🔥 Scraping is almost always illegal, says Dutch DPA

AI policy & regulation | Luiza's Newsletter #102

May 07, 2024

👋 Hi, Luiza Jarovsky here. Welcome to the 102nd edition of this newsletter on AI policy & regulation, read by 23,500+ subscribers in 130+ countries. I hope you enjoy reading it as much as I enjoy writing it.

➡️ A special thanks to MineOS for sponsoring this week's free edition of the newsletter. Check out their free guide:

In the ever-evolving maze of state legislation, the US is again trying to introduce a federal data privacy law, this time with strong momentum. AI progress intensifies the pressure to safeguard privacy, and the issue still holds bipartisan support. But has Congress learned the lessons from an endless stream of state privacy bills and the failure of the ADPPA in 2022? Get an overview of the American Privacy Rights Act in this free guide from MineOS. Explore its provisions, impact on businesses and individuals, and the likelihood of its passage.

🔥 Scraping is almost always illegal, says Dutch DPA

The new report from the Dutch Data Protection Authority is out, and it has bad news for AI developers & entrepreneurs (automatic translation below):

➡️ According to the Dutch DPA:

"Scraping will almost always be a violation of the General Data Protection Regulation (GDPR). In a number of cases, scraping is not allowed anyway. For example:
➵ scraping the internet to create profiles of people and then resell them;
➵ scraping information from protected social media accounts or private forums;
➵ scraping data from public social media profiles, with the aim of determining whether or not those people will receive requested insurance.

"A widespread misunderstanding is that scraping is allowed, because everything on the internet is already available to everyone. 'But the fact that information about you is public does not automatically mean that you also give permission for scraping,' says AP chairman Aleid Wolfsen. 'Even if you post on your social media account that you recently won the lottery or had an operation, you do not give permission for that data to be scraped. You only give permission to collect personal data if you have been asked in advance. That is usually not possible with scraping."

➡️ It's great to see that what I've been writing about privacy and AI in this newsletter since January 2023 and teaching in my courses is slowly becoming a consensus and being translated into practical recommendations—at least in the EU.

➡️ Read the full report (in Dutch) here.

📋 OECD revises its AI principles

➡️ According to the official release from the OECD: "In response to recent developments in AI technologies, notably the emergence of general-purpose and generative AI, the updated Principles more directly address AI-associated challenges involving privacy, intellectual property rights, safety, and information integrity."

➡️ Key revisions include:

➵ "Addressing safety concerns, so that if AI systems risk causing undue harm or exhibit undesired behaviour, robust mechanisms and safeguards exist to override, repair, and/or decommission them safely;

➵ Reflecting the growing importance of addressing mis- and disinformation, and safeguarding information integrity in the context of generative AI;

➵ Emphasising responsible business conduct throughout the AI system lifecycle, involving co-operation with suppliers of AI knowledge and AI resources, AI system users, and other stakeholders;

➵ Clarifying the information regarding AI systems that constitute transparency and responsible disclosure;

➵ Explicitly referencing environmental sustainability, a concern that has grown considerably in importance over the past five years;

➵ Underscoring the need for jurisdictions to work together to promote interoperable governance and policy environments for AI, as the number of AI policy initiatives worldwide surges."

➡️ Read the revised principles here.

📑 NIST: "AI Risk Management Framework: Generative AI Profile”

➡️ The US National Institute of Standards and Technology (NIST) publishes the first draft of its "AI Risk Management Framework: Generative AI Profile." Important information & quotes:

➡️ This is a comprehensive document that contains an overview of risks unique to or exacerbated by generative AI (GAI) and an extensive list of actions to manage GAI's risks.

➡️ It highlights the following risks:
➵ CBRN Information
➵ Confabulation
➵ Dangerous or Violent Recommendations
➵ Data Privacy
➵ Environmental
➵ Human-AI Configuration
➵ Information Integrity
➵ Information Security
➵ Intellectual Property
➵ Obscene, Degrading, and/or Abusive Content
➵ Toxicity, Bias, and Homogenization
➵ Value Chain and Component Integration

➡️ Quotes:

"AI technology can produce varied outputs in multiple modalities and present many classes of user interfaces. This leads to a broader set of AI actors interacting with GAI systems for widely differing applications and contexts of use. These can include data labeling and preparation, development of GAI models, content moderation, code generation and review, text generation and editing, image and video generation, summarization, search, and chat. These activities can take place within organizational settings or in the public domain." (page 63)

"The quality of AI red-teaming outputs is related to the background and expertise of the AI red-team itself. Demographically and interdisciplinarily diverse AI red-teams can be used to identify flaws in the varying contexts where GAI will be used. For best results, AI red-teams should demonstrate domain expertise, and awareness of socio-cultural aspects within the deployment context. AI red teaming results should be given additional analysis before they are incorporated into organizational governance and decision making, policy and procedural updates, and AI risk management efforts." (page 66)

"Provenance data tracking processes can include and assist AI actors across the lifecycle who may not have full visibility or control over the various trade-offs and cascading impacts of early-stage model decisions on downstream performance and synthetic outputs. For example, by selecting a given model to prioritize computational efficiency over accuracy, an AI actor may inadvertently affect provenance tracking reliability." (page 67)

➡️ This is a comprehensive and informative document on Generative AI's risk profile: read it here.

🏛️ New AI bill introduced in the US

➡️ US Senators Mark Warner & Thom Tillis introduce the Secure AI Act of 2024. What you need to know:

➡️ According to the official release, the bill aims to:

"(…) improve information sharing between the federal government and private companies by updating cybersecurity reporting systems to better incorporate AI systems" and "would also create a voluntary database to record AI-related cybersecurity incidents including so-called 'near miss' events"

➡️ The bill also:

➵ "Requires NIST to update the NVD and requires CISA to update the CVE program or develop a new process to track voluntary reports of AI security vulnerabilities;

➵ Establishes a public database to track voluntary reports of AI security and safety incidents;

➵ Creates a multi-stakeholder process that encourages the development and adoption of best practices that address supply chain risks associated with training and maintaining AI models;

➵ Establishes an Artificial Intelligence Security Center at the NSA to provide an AI research testbed to the private sector and academic researchers, develop guidance to prevent or mitigate counter-AI techniques, and promote secure AI adoption."

➡️ Read the bill here.

⚖️ AI copyright lawsuit against MosaicML and Databricks

➡️ Authors Rebecca Makkai & Jason Reynolds filed an AI copyright lawsuit against MosaicML and Databricks. Important quotes:

"MosaicML has admitted training its MPT-7B model on a copy of the “RedPajama—Books” dataset, which in turn is a copy of the Books3 dataset. Therefore, MosaicML necessarily trained its MPT-7B model on a copy of Books3. Certain books written by Plaintiffs and Class members are part of Books3—including the Infringed Works—and thus MosaicML necessarily trained MPT-7B on one or more copies of the Infringed Works, thereby directly infringing the copyrights of the Plaintiffs and Class members." (page 7)

"On information and belief, MosaicML made further copies of the Books3 dataset or subsets thereof to train other models in the MPT family. For instance, MosaicML released a model called MPT-7B-StoryWriter-65k+ (“the StoryWriter model”), a variant of MPT-7B that MosaicML admits was further trained on “a filtered fiction subset of the [B]ooks3 dataset”.7 The stated purpose of the StoryWriter model is “to read and write stories”—or, put another way, to generate works that directly compete with works in the training dataset." (page 8)

"MosaicML repeatedly copied the Infringed Works without Plaintiffs’ permission. MosaicML made these copies of the Infringed Works in violation of Plaintiffs’ exclusive rights under the Copyright Act. Plaintiffs and Class members have been injured by MosaicML’s acts of direct copyright infringement. Plaintiffs and Class members are entitled to statutory damages, actual damages, restitution of profits, and other remedies provided by law." (page 9)

➡️ Read the lawsuit here.

⚖️ AI copyright lawsuit against NVIDIA

➡️ Authors Andre Dubus III & Susan Orlean filed an AI copyright lawsuit against NVIDIA. Important quotes:

"In September 2022, NVIDIA released its NeMo Megatron series of large language models. A large language model (“LLM”) is AI software designed to emit convincingly naturalistic text outputs in response to user prompts." (page 4)

”Much of the material in NVIDIA’s training dataset, however, comes from copyrighted works—including books written by Plaintis and Class members—that were copied by NVIDIA without consent, without credit, and without compensation.” (page 5)

”NVIDIA has admitted training its NeMo Megatron models on a copy of The Pile dataset, which in turn includes the Books3 dataset. Therefore, NVIDIA necessarily also trained its NeMo Megatron models on a copy of Books3, because Books3 is part of the Pile, certain books written by Plaintiffs and Class members are part of Books3-including the Infringed Works – and thus NVIDIA necessarily trained its NeMo Megatron models on one or more copies of the Infringed Works, thereby directly infringing the copyrights of the Plaintiffs and the Class.” (page 7)

”Plaintiffs and Class members have been injured by NVIDIA’s acts of direct copyright infringement. Plaintiffs and Class members are entitled to statutory damages, actual damages, restitution of profits, and other remedies provided by law.” (page 8)

➡️ Read the lawsuit here.

⚖️ AI copyright lawsuit against OpenAI & Microsoft

Eight US newspapers sued OpenAI and Microsoft for AI-related copyright infringement. Important info and quotes:

➡️ The newspapers suing are:

New York Daily News
Chicago Tribune
Orlando Sentinel
Sun Sentinel Media Group
The Mercury News
The Denver Post
Orange County Register
Pioneer Press

➡️ Quotes:

"This lawsuit arises from Defendants purloining millions of the Publishers' copyrighted articles without permission and without payment to fuel the commercialization of their generative artificial intelligence (“GenAI”) products, including ChatGPT and Copilot. Although OpenAI purported at one time to be a non-profit organization, its recent $90 billion valuation underscores how that is no longer the case. ChatGPT, along with Microsoft Copilot (formerlyknown as Bing Chat) has also added hundreds of billions of dollars to Microsoft’s market value. Defendants have created those GenAI products in violation of the law by using important journalism created by the Publishers’ newspapers without any compensation." (pages 1-2)

"This lawsuit is not a battle between new technology and old technology. It is not a battle between a thriving industry and an industry in transition. It is most surely not a battle to resolve the phalanx of social, political, moral, and economic issues that GenAI raises. This lawsuit is about how Microsoft and OpenAI are not entitled to use copyrighted newspaper content to build their new trillion-dollar enterprises, without paying for that content. As this lawsuit will demonstrate, Defendants must both obtain the Publishers’ consent to use their content and pay fair value for such use." (page 7)

➡️ They claim:

➵ Copyright Infringement (17 U.S.C. § 501);
➵ Vicarious Copyright Infringement;
➵ Contributory Copyright Infringement;
➵ Digital Millennium Copyright Act – Removal of Copyright Management Information (17 U.S.C. § 1202);
➵ Common Law Unfair Competition By Misappropriation;
➵ Trademark Dilution (15 U.S.C. § 1125(c));
➵ Dilution and Injury to Business Reputation (N.Y. Gen. Bus. Law § 360-l).

➡️ They demand:

➵ Awarding the Publishers statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity;
➵ Permanently enjoining Defendants from the unlawful, unfair, and infringing conduct alleged herein;
➵ Ordering destruction under 17 U.S.C. § 503(b) of all GPT or other LLM models and training sets that incorporate the Publishers’ Works;
➵ An award of costs, expenses, and attorneys’ fees as permitted by law;
➵ Such other or further relief as the Court may deem appropriate, just, and equitable.

➡️ Read the lawsuit here.

⚖️ AI copyright lawsuit against Google

➡️ Visual artists Jingna Zhang, Sarah Andersen, Hope Larson, and Jessica Fink have filed an AI copyright lawsuit against Google, alleging that the company used their work without permission to train its AI-based image generator. Quotes:

"Plaintiffs never authorized Google to use their copyrighted work in any way. Nevertheless, Google repeatedly violated Plaintiffs’ exclusive rights under § 106 and continues to do so today. Plaintiffs and the Class members never authorized Google to make copies of their works, make derivative works, publicly display copies (or derivative works), or distribute copies (or derivative works)."

"On information and belief, Google has used Plaintiffs’ training images to train other versions of Imagen, including Imagen 2, and so-called “multimodal” models that are trained on training images as well as text, such as Google Gemini. Collectively, Imagen and other models that Google trained on LAION-400M are called the Google–LAION Models"

"As the corporate parent of Google, Alphabet had the right and ability to supervise the infringing activity of Google when it trained the Google–LAION Models on Plaintiffs’ works. Alphabet failed to exercise that right and ability."

"Plaintiffs have been injured by Alphabet’s acts of vicarious copyright infringement. Plaintiffs are entitled to statutory damages, actual damages, restitution of profits, and other remedies provided by law."

➡️ Read the lawsuit here.

📌 Resources in AI, tech & privacy

If you enjoy this newsletter, you might also want to check out our weekly job alerts (privacy & AI governance jobs), AI Book Club (950+ members), YouTube channel (24,200+ subscribers), and 4-week Bootcamps to upskill and advance your career (730+ participants). *The EU AI Act Bootcamp starts tomorrow: read more and register here or join our course waitlist.

⚡Free Lightning Lesson: Understand the DMA in 20 minutes

➡️ To participate, register here. For more learning and upskilling opportunities, subscribe to our course waitlist.

🙏 Thank you for reading!

If you have comments on this week's edition, I'll be happy to hear them! Write to me, and I'll get back to you soon.

A reminder that today is the last day to register for the EU AI Act 4-week Bootcamp starting tomorrow. Read more here.

Have a great day!

Luiza