
OpenAI transcribed over a million hours of YouTube videos to train GPT-4
/cdn.vox-cdn.com/uploads/chorus_asset/file/25330654/STK414_AI_CHATBOT_E.jpg)
๐ Abstract
The article discusses the challenges AI companies are facing in gathering high-quality training data, and the ways they have dealt with this issue, which often involve using copyrighted content without permission.
๐ Q&A
[01] Challenges in Gathering Training Data
1. What are the key challenges AI companies are facing in gathering high-quality training data?
- AI companies are running into a wall when it comes to gathering high-quality training data for their models
- As companies like OpenAI, Google, and Meta have exhausted supplies of useful data, they have resorted to using copyrighted content without permission, such as:
- Transcribing over a million hours of YouTube videos to train GPT-4
- Using computer code from Github, chess move databases, and schoolwork content from Quizlet
- Gathering transcripts from YouTube
2. How have companies tried to address the training data shortage?
- OpenAI developed its Whisper audio transcription model to transcribe YouTube videos and use that data to train GPT-4
- Companies have discussed options like:
- Paying for book licenses or buying a large publisher outright
- Training models on "synthetic" data created by their own models
- Using "curriculum learning" to feed models high-quality data in an ordered fashion
3. What are the legal and ethical concerns with the approaches companies have taken?
- Using copyrighted content without permission falls into a "hazy gray area" of AI copyright law
- Google, YouTube, and others have stated that unauthorized scraping or downloading of their content is prohibited by their terms of service
- Companies are wrestling with quickly-evaporating training data for their models, and may be outpacing new content creation by 2028
[02] Company Responses
1. How have different companies responded to the training data shortage?
- OpenAI reportedly developed Whisper to transcribe YouTube videos, which it knew was legally questionable but believed to be fair use
- Google has also gathered transcripts from YouTube, but says it has done so in accordance with its agreements with YouTube creators
- Meta has discussed its unpermitted use of copyrighted works while trying to catch up to OpenAI
2. What actions have companies taken to expand their use of consumer data?
- Google's legal department asked the company's privacy team to tweak its policy language to expand what it could do with consumer data, such as its office tools like Google Docs
- Meta was apparently limited in the ways it could use consumer data by privacy-focused changes it made in the wake of the Cambridge Analytica scandal
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.