AI Has Created a Battle Over Web Crawling
๐ Abstract
The article discusses the challenges faced by generative AI models due to the increasing restrictions placed by websites on web crawlers accessing their data. It highlights how the robots.txt protocol, which was originally designed to guide web crawlers, is now being used by websites to limit access to their data, particularly by those who feel threatened by the capabilities of generative AI. The article also explores the implications of this trend for AI companies and the potential solutions, such as licensing data directly or using synthetic data.
๐ Q&A
[01] The technology behind robots.txt and its relevance in the age of generative AI
1. What is the robots.txt protocol and how does it work?
- Robots.txt is a machine-readable file that crawlers (bots that navigate the web and record what they see) use to determine whether or not to crawl certain parts of a website.
- It became the de facto standard for directing web search engines like Bing and Google to improve the user experience of navigating the web.
- Websites and crawlers generally had a symbiotic relationship, where websites wanted the traffic from web search engines, and the search engines wanted to index the websites.
2. How has the use of robots.txt changed in the age of generative AI?
- Many websites, particularly those monetized with advertising and paywalls (such as news and artist websites), are now using robots.txt to restrict bots, especially those associated with generative AI models.
- They are doing this out of fear that generative AI might impinge on their livelihoods.
3. What are the limitations of using robots.txt to restrict access?
- Robots.txt is machine-readable but not legally enforceable, unlike the terms of service which can be legally enforceable but are not machine-readable.
- Websites have to individually specify which crawlers are allowed or disallowed, which puts an undue burden on them.
- Many major AI companies have been accused of not respecting robots.txt and crawling websites anyway, despite their stated policies.
[02] The impact of data restrictions on generative AI training
1. What did the report find about the impact of data restrictions on popular generative AI training datasets?
- The report found that in less than a year, about 5% of the data in the C4 dataset (created in 2019) has been revoked due to websites restricting access.
- For the top 2,000 websites in the C4 dataset (which include news, academic, social media, and other high-quality sites), 25% of the data has been revoked.
- This means the distribution of training data for models that respect robots.txt is rapidly shifting away from high-quality sources to more personal, organizational, and e-commerce websites.
2. What are the potential implications of this shift in training data?
- It could lead to a performance gap between models that respect robots.txt and those that do not, as the latter would have access to higher-quality data.
- There is uncertainty around whether robots.txt can apply retroactively to older datasets, which could lead to legal battles.
3. What are the potential strategies for AI companies to address the data restriction issue?
- Large companies may license data directly from sources, which could create a higher capital requirement for entry into the market.
- Companies may invest more in acquiring exclusive access to valuable user-generated data sources like YouTube, GitHub, and Reddit, which raises antitrust concerns.
- Increased use of synthetic data, though there are concerns about model collapse due to poor-quality synthetic data.
[03] The future outlook and potential solutions
1. What is the overall trend in website data restrictions, and what factors could affect it?
- The report expects the restrictions on robots.txt and terms of service to continue rising.
- However, this trend could be affected by external factors such as legislation, company policy changes, lawsuit outcomes, and community pressure from groups like writers' guilds.
2. What potential solutions does the report suggest for addressing the data restriction issue?
- The report suggests the need for new standards that allow creators to express their preferences for data use in a more granular and machine-readable way, reducing the burden on websites.
- However, it's unclear whose responsibility it is to create and enforce such standards, and there is a risk of bias towards the interests of the standard's designer.
- The report also suggests that not all data restrictions should be respected, particularly for academic or journalistic research purposes, as not all data and use cases are equal.