If you ever uploaded photos or illustrations, wrote a review, “liked” content, answered a question on Reddit, contributed to open source, or performed any number of other online activities, you did free work for tech companies because downloading all this content from the internet is how their AI systems learn about the world.
Tech companies are aware of this, but they mask your contributions to their products with technical terms like “training data”, “unsupervised learning” and “data exhaustion” (and, of course, impenetrable “Terms of Use” documents). In fact, a big part of AI innovation over the past few years has been to use more and more of your content for free. This is true for search engines like Google, social media sites like Instagram, AI research startups like OpenAI, and many other smart technology providers.
This exploitative dynamic is especially damaging when it comes to the new wave of generative AI programs like Dall-E and ChatGPT. Without your content, ChatGPT and everything like it simply would not exist. Many AI researchers believe that your content is in fact more important than what computer scientists do. However, these intelligent technologies that exploit your labor are the very technologies that threaten to put you out of work. It’s like an AI system infiltrating your factory and stealing your car.
But this dynamic also means that the users who generate the data have a lot of power. Discussions about the use of sophisticated AI technologies are often driven by impotence and the attitude that AI companies will do what they want and there is little the public can do to move the technology in the other direction. We are AI researchers and our research shows that the public has a huge amount of “data usagethat can be used to create an AI ecosystem that both generates amazing new technologies and fairly shares the benefits of those technologies with the people who created them.
The use of data may deploy in at least four areas: direct action (e.g. people banding together to hide, “poison” or redirect data), repeatedlyregulatory action (for example, by insisting on a data protection policy and legal recognition of “data coalitions“), legal action (for example, communities adopting new data licensing regimes or lawsuit), and market action (for example, by requiring that large language models only train with data from consented creators).
Let’s start with direct action, which is a particularly exciting route because it can be done immediately. Because generative AI systems rely on web scraping, website owners can significantly disrupt the training data pipeline if they prohibit or restrict data collection. setting their robots.txt file (the file that tells crawlers which pages are not available).
Large user-generated content sites such as Wikipedia, StackOverflow, and Reddit are especially important to generative AI systems, and they can prevent these systems from accessing their content in even stronger ways, such as by blocking IP traffic and API access. According to Elon Musk, Twitter recently made exactly this. Content producers should also take advantage of the opt-out mechanisms increasingly provided by AI companies. For example, programmers on GitHub can choose not to BigCode training data through a simple form. More generally, just speaking up when content was used without your consent was somewhat effective. For example, large generative AI manufacturer Stability AI has agreed to honor opt-out requests received through haveibeentrained.com after the social media scandal. By participating in public forms of action, as in the case of mass protest against artificial intelligence created by artists, it may be possible to force companies to stop commercial activities that the majority of the population perceives as theft.