👋 Hi, Luiza Jarovsky here. Read about my work, invite me to speak, tell me what you've been working on, or just say hi here.
This week's edition is sponsored by Containing Big Tech:
From our sponsor: The five largest tech companies - Meta, Apple, Amazon, Microsoft, and Google - have built innovative products that improve many aspects of our lives. But their intrusiveness and our dependence on them have created pressing threats, including the overcollection and weaponization of our most sensitive data and the problematic ways they use AI to process and act upon our data. In his new book, Tom Kemp eloquently weaves together the threats posed by Big Tech and offers actionable solutions for individuals and policymakers to advocate for change. Order Containing Big Tech today.
🔥 Privacy lawsuits against OpenAI
We are not even a year into the current AI hype - started in November 2022, when ChatGPT was made available to the public - and lawsuits are popping up, including privacy lawsuits.
Below are two examples that are extremely interesting for the intersection of privacy & AI and the emerging challenges we are observing in the last few months.
[observation for non-lawyers: the Defendants are OpenAI and Microsoft; the Plaintiffs are the authors of the lawsuit, you can see their identification at the top of the lawsuit]
Lawsuit 1 - June 28, 2023
Interesting quote:
“(…) this secret and unregistered scraping of internet data, for Defendants’ own private and exorbitant financial gain, without regard to privacy risks, amounts to the negligent and otherwise illegal theft of personal data of millions of Americans who do not even use AI tools. These individuals (“Non-Users”) had their personal information scraped long before OpenAI’s applications were available to the public, and certainly before they could have registered as a ChatGPT user. In either case, no one consented to the use of their personal data to train the Products.” (pages 36-37)
Lawsuit 2 - September 9, 2023
Interesting quote:
“(iii) Control: Defendants must allow Product users and everyday internet users to opt out of all data collection and they should otherwise stop the illegal taking of internet data, delete (or compensate for) any ill-gotten data, or the algorithms which were built on the stolen data, and before any further commercial deployment, technological safety measures must be added to the Products that will prevent the technology from surpassing human intelligence and harming others.” (pages 7-8)
*
Scraping
One of the central topics in these lawsuits is scraping. For those not aware, large language models behind AI-based chatbots, such as ChatGPT, rely on web scraping, which can be defined as “the process of extracting data from a specific web page. It involves making an HTTP request to a website’s server, downloading the page’s HTML and parsing it to extract the desired data.”
From a privacy and data protection perspective, scraping in this case is problematic as they collected personal data - including sensitive data - without lawful grounds and are exploiting it commercially. This personal data was used to train the model, and it is now an integral part of this model and the outputs that come from it.
People who posted any content online - all of us - did not give consent to have our data used to train AI-based models that are now being deployed commercially and even integrated into more traditional products, as it is happening within Microsoft.
The second lawsuit I cited above is very detailed regarding OpenAI's data sources and issues involved in scraping the data. Readers who want to learn more should read section C “ChatGPT’s Development Depends on Secret Web-Scraping,” on pages 20-25.
The criticism of scraping comes back to the idea of contextual integrity, proposed by Prof. Helen Nissembaum (article and book). Privacy involves respecting the original context and the set of norms around it. As Nissembaum says:
“Contextual integrity ties adequate protection for privacy to norms of specific contexts, demanding that information gathering and dissemination be appropriate to that context and obey the governing norms of distribution within it.”
On the topic of scraping, last month, OpenAI quietly released a new spiderbot (webcrawler) - the GPTbot - to expand its dataset and feed GPT5. According to this link in OpenAI's API website:
“Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety. Below, we also share how to disallow GPTBot from accessing your site.”
It's interesting to see how they were more careful with privacy concerns with this new release, expressly mentioning that they are avoiding PII and letting websites to disallow the GPTBot. It is unclear, however, if this “opt out” works well in practice.
Control
A second interesting aspect, which is reflected in the second lawsuit's quote I highlighted above, is the claim that:
users should be able to opt out of data collection
OpenAI should stop the “illegal taking of internet data”
OpenAI should delete or compensate for any unlawfully collected data
These are important topics that are so far without a clear regulatory solution. There is also a clear intersection with copyright claims. For example, authors Sarah Silverman, Christopher Golden, and Richard Kadrey have sued OpenAI and Meta for copyright infringement (which occurred through ChatGPT and LLaMA, respectively). In the OpenAI lawsuit, they argue that:
“The unlawful business practices described herein violate the UCL because they are unfair, immoral, unethical, oppressive, unscrupulous or injurious to consumers, because, among other reasons, Defendants used Plaintiffs’ protected works to train ChatGPT for Defendants’ own commercial profit without Plaintiffs’ and the Class’s authorization. Defendants further knowingly designed ChatGPT to output portions or summaries of Plaintiffs’ copyrighted works without attribution, and they unfairly profit from and take credit for developing a commercial product based on unattributed reproductions of those stolen writing and ideas.”
It is still unclear how these topics will develop in the next months/years - the regulatory aspects are challenging and interesting.
As I wrote yesterday on LinkedIn: at least, we deserve to know - and be informed that whatever we are uploading or publishing on a certain website will be used to train the next AI model. We might then choose not to publish or publish in a different way. By reading the comments on this post, it looks like this is a common sentiment.
These emerging challenges - and where we move from here in terms of best practices or regulatory efforts - are part of the crisis privacy is undergoing, which I discuss in the case study this week (below).
If you want to dive deeper, check out my upcoming live Masterclass: AI & Privacy: Risks, Challenges and Regulation.
🔥 New report on dark patterns in privacy
Feeding the current regulatory boom around dark patterns, the UK's Information Commissioner's Office (ICO) and the Competition and Markets Authority (CMA) launched this joint position paper called: “Harmful design in digital markets: how online choice architecture practices can undermine consumer choice and control over personal information.”
One of the interesting aspects of this article is its focus on online choice architecture and how it can negatively affect users and have legal implications.
In my academic article about dark patterns in privacy, I spoke about how they are forms of choice architecture that negatively affect user privacy.
On the topic of choice architecture, we can bring the book Nudge: Improving Decisions About Health, Wealth, and Happiness by Richard Thaler and Cass Sunstein (image above). They defined a nudge as:
“any aspect of the choice architecture that alters people’s behaviour in a predictable way without forbidding any options or significantly changing their economic incentives. To count as a mere nudge, the intervention must be easy and cheap to avoid. Nudges are not mandates. Putting the fruit at eye level counts as a nudge. Banning junk food does not” (page 6).
Dark patterns, therefore, can be seen as “bad nudges,” as they try to influence people to do things that will not be in their best interest. Dark patterns usually do not coerce people, they exploit people's cognitive biases to push them to share more personal data (to learn this in depth, join my upcoming live Masterclass on Dark Patterns and Privacy UX).
One of the most interesting parts of the report is this section with the ICO and the CMA's expectations on firms that are using choice architectures in the context of personal data:
“• Put the user at the heart of your design choices: Are firms building their online interfaces around the user’s interests and preferences?
• Use design that empowers user choice and control: Are firms helping users to make effective and informed choices about their personal data, and putting them in control of how their data is collected and used?
• Test and trial your design choices: Has testing and trialling been carried out to ensure their design choices are evidence-based?
• Comply with data protection, consumer and competition law: Do firms consider the data protection, consumer protection and competition law implications of the design practices they are employing?”
My comment here is that online choice architecture with privacy relevance is everywhere. Privacy settings, UX interfaces that allow the user to access privacy information, or to interact with the company or with other users, interfaces that collect personally identifiable information, and so on. So even if a designer is not explicitly attempting to harm privacy or negatively influence choice, there can be dark patterns in privacy.
Companies are incentivized to constantly gather more personal data, so simply by failing to put effort into protecting users, the UX interface might be inadvertently harming user privacy.
I am happy to see that more and more authorities, advocates, researchers, and the public in general are interested in the impact that UX interfaces have on user privacy.
Dark patterns and privacy UX are some of the top subjects in this newsletter - I talk about it almost every week. You can check our archive for more articles and explanations, or join the upcoming Masterclass.
🔥 Privacy is undergoing a crisis
This week's case study deals with emerging privacy challenges in the context of AI and how they affect how we perceive and enforce privacy rights: