Beware of AI training spiderbots

Plus: case study on Zoom, AI practices, and the new internet

Aug 09, 2023

∙ Paid

👋 Hi, Luiza Jarovsky here. Read about my work, invite me to speak, tell me what you've been working on, or just say hi here.

This week's edition of The Privacy Whisperer is sponsored by The State of US Privacy & AI Regulation:

Want to hear directly from the people shaping the US Privacy & AI Regulation at the federal and state levels? Then join this LinkedIn Live on August 28 at 11am PST (2pm EST), with speakers Rep. Ro Khanna (member of Congress representing Silicon Valley), Alastair Mactaggart (co-author of CCPA & CPRA, and board member of the California Privacy Protection Agency), and moderator Tom Kemp (co-author of the California Delete Act, and author of the new book Containing Big Tech). Free registration here.

🔥AI red teaming: a path forward

According to the National Institute of Standards and Technology (NIST)'s glossary, a red team is “a group of people authorized and organized to emulate a potential adversary’s attack or exploitation capabilities against an enterprise’s security posture. The Red Team’s objective is to improve enterprise cybersecurity by demonstrating the impacts of successful attacks and by demonstrating what works for the defenders (i.e., the Blue Team) in an operational environment.”

This concept, initially applied in the context of cybersecurity, is now being more broadly deployed in the context of AI.

Yesterday I read this very interesting article in the Washington Post describing recent AI-related red teaming initiatives, such as the ones led by Dr. Rumman Chowdhury, co-founder of Humane Intelligence, a nonprofit developing accountable AI systems. She says, for example, that embedded harms are harder to identify, “such as biased assumptions, false claims or deceptive behavior.”

On July 19, Google published information on their AI red teams, including a “lessons learned” session. Microsoft also published its approach to AI red teaming, which has additional resources and can be useful for other companies developing their own strategies to improve their AI models. Some of their lessons learned:

“AI red teaming is more expansive.
AI red teaming focuses on failures from both malicious and benign personas.
AI systems are constantly evolving.
Red teaming generative AI systems requires multiple attempts.
Mitigating AI failures requires defense in depth.”

Red teaming seems to be a positive path to AI development, especially when they focus on a broader range of risks and harms and involve a more diverse group of people to put AI systems to the test.

🔥Social networks without the recommendation algorithm?

According to TikTok's recent announcement, in order to comply with the EU's Digital Services Act (DSA), they will allow EU users to turn off personalization. As a result, the "For You" and the "Live" feeds, as well as TikTok search, will show popular videos and not algorithmically recommended videos based on past user behavior and interests.

In my TikTok article, I wrote that social networks’ recommendation algorithms, and TikTok's especially - as it targets millions of minors every second - are a powerful engine for user manipulation, as the content being shown is uniquely enticing for that specific viewer. One of the consequences of these behavioral algorithms is that children and teens are so fixated that they cannot leave the app (and spend hours a day on it), as the content is always so fresh, stimulating, and intimately targeted to them. Social networks such as TikTok benefit from these recommendation algorithms by being able to show more ads to users - and profit from these ads.

I do not think that the internet we want is one in which social media companies can have manipulation power over users, especially minors. Thankfully, it looks like the DSA is starting to change that.

In this recent TikTok announcement, however, there are important issues not yet solved:

we need further accountability and transparency measures to ensure that this and other TikTok practices respect user autonomy, transparency, fairness, and privacy;
it looks like the recommendation algorithm will still be “on” by default, and users will have the possibility to opt-out. Especially for minors, the recommendation algorithm should be “off” by default (following privacy-by-design and best Privacy UX practices);
we need all social networks to have similar measures; this must be the new norm.

We also need to make this basic privacy right - of not being algorithmically manipulated and, instead, “being let alone,” as Warren & Brandeis said in 1890 - to be respected worldwide, not only in the EU.

A bit of background: in April, TikTok was one of the 19 companies designated either as very large online platforms (VLOPs) or very large online search engines (VLOSEs) according to the DSA criteria. These companies will face stricter obligations under the DSA, including those relating to transparency, user empowerment, protection of minors, and content moderation.

I hope that, as with the GDPR, the DSA will foster a regulatory wave around the globe. We are in 2023, and some practices, such as highly manipulative recommendation algorithms targeting minors, should come to an end worldwide.

🔥 Beware of AI training spiderbots

Do you know what a spiderbot is? It is synonymous with web crawler and can be defined as “a computer program that scans the web automatically, ‘reading’ everything it finds, analyzing the information and classifying it in the database.”

Spiderbots can be useful to organize the web and help you find the information you are looking for. They have been popularly used for years by search engines. This is how it works: a search engine applies a search algorithm to the data collected by web crawlers so that it can answer a search query with relevant links from the web. For example: if you are the owner of a small business and you launch a new website, thanks to spiderbots, soon your website will be indexed by search engines, and potential clients will be able to find your products when looking for relevant terms on the search engine.

Another application of spiderbots is training AI models. OpenAI has quietly released a new spiderbot - the GPTbot - to expand its dataset and feed the upcoming GPT5. According to this link in OpenAI's API website:

“Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety. Below, we also share how to disallow GPTBot from accessing your site.”

From a privacy perspective, it is interesting that they are filtering websites that are known to have PII (of course, it is not 100% effective, and there will still be PII in their training data).

It is also a good transparency practice that they are teaching how to block and customize the GPTbot (however, it requires technical literacy to edit the website's configuration, so a limited slice of the internet will do it independently).

However, there are big issues here. Search engines crawl the web to index pages, and “as a reward,” they bring traffic to the owner of the website. With AI spiderbots, the deal is completely different. They use crawlers to train their models; however, their models will not take the person prompting to the content owner's website. And this is problematic to anyone that produces content, either as an artist/creator or as a business. According to Business Insider:

“Why would any producer of free online content let OpenAI scrape its material when that data will be used to train future LLMs that later compete with that creator by pulling users away from their site? You can already see this in action as fewer people visit Stack Overflow to get software coding help.”

In parallel, there are also discussions on the best models to govern AI training, what the rights of content creators are, and if they deserve to be compensated by the data being fed to train AI models.

The “deal” behind the internet is changing, and it is still not clear what the next phase will look like. If you publish content online, it is time to think about what you will do with these spiderbots: will you feed or block them?

🔥 Case study: Zoom, AI practices, and the new internet

This week's case study deals with Zoom's recent terms of service change and what companies can learn about privacy implications of AI practices.