Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI
๐ Abstract
The article investigates how AI companies have used material from thousands of YouTube videos to train their AI models, despite YouTube's rules against harvesting materials from the platform without permission. The investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by major tech companies like Anthropic, Nvidia, Apple, and Salesforce. The article discusses the concerns of YouTube creators whose work was used without their consent, as well as the legal implications and debates around the use of such data for training AI models.
๐ Q&A
[01] AI Companies Using YouTube Videos for Training
1. What did the investigation by Proof News find about AI companies using YouTube videos to train their models?
- The investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by major tech companies like Anthropic, Nvidia, Apple, and Salesforce to train their AI models.
- The dataset, called YouTube Subtitles, contained video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard, as well as content from media outlets like The Wall Street Journal, NPR, and the BBC.
- The dataset also included material from popular YouTube creators like MrBeast, Marques Brownlee, Jacksepticeye, and PewDiePie, without their consent.
2. How did the creators feel about their content being used without permission?
- Many creators were unaware that their content had been used to train AI models, and expressed frustration and concern about the unauthorized use of their work.
- David Pakman, host of "The David Pakman Show," said that if AI companies are profiting from the use of his content, he should be compensated for it, as it is his livelihood.
- Dave Wiskus, the CEO of Nebula, a streaming service partially owned by its creators, called the unauthorized use of creators' work "theft" and "disrespectful," especially since it could be used to replace artists.
3. What were the concerns raised about the content used to train the AI models?
- The dataset contained profanity, as well as instances of racial and gender slurs, which could lead to "vulnerabilities and safety concerns" in the AI models trained on this data.
- The dataset also included subtitles from more than 12,000 videos that have since been deleted from YouTube, meaning the creators' work has been incorporated into an unknown number of AI models without their knowledge or consent.
[02] Legal Implications and Debates
1. How have other AI training datasets faced legal challenges?
- Similar to the YouTube Subtitles dataset, the Books3 dataset, which contained over 180,000 books, including works by authors like Margaret Atwood, Michael Pollan, and Zadie Smith, faced legal challenges from authors who alleged copyright violations.
- While some of these cases have been voluntarily dismissed, the legal questions surrounding permission and payment for the use of creative works in AI training remain unresolved.
2. What are the arguments made by AI companies regarding the use of such datasets?
- Defendants like Meta, OpenAI, and Bloomberg have argued that their use of the datasets constitutes fair use, but the litigation in these cases is still in the early stages.
- The Pile dataset, which included the YouTube Subtitles, has been removed from its official download site, but is still available on file-sharing services.
3. How do experts view the actions of technology companies in regards to the use of creative works for AI training?
- Consumer protection attorney Amy Keller said that technology companies have "run roughshod" over the concerns of creators, and that the real issue is that people didn't have a choice in the matter of their work being used.
- AI policy researcher Jai Vipra noted that AI companies compete by procuring higher-quality data, which is why they keep their data sources secret, and that datasets like YouTube Subtitles are seen as a "gold mine" for training AI models.