Summarize by Aili

OpenAI transcribed over a million hours of YouTube videos to train GPT-4

https://www.theverge.com/2024/4/6/24122915/openai-youtube-transcripts-gpt-4-training-data-google

🌈 Abstract

The article discusses the challenges AI companies are facing in gathering high-quality training data, and the ways they have dealt with this issue, which often involve using copyrighted content without permission.

🙋 Q&A

[01] Challenges in Gathering Training Data

1. What are the key challenges AI companies are facing in gathering high-quality training data?

AI companies are running into a wall when it comes to gathering high-quality training data for their models
As companies like OpenAI, Google, and Meta have exhausted supplies of useful data, they have resorted to using copyrighted content without permission, such as:
- Transcribing over a million hours of YouTube videos to train GPT-4
- Using computer code from Github, chess move databases, and schoolwork content from Quizlet
- Gathering transcripts from YouTube

2. How have companies tried to address the training data shortage?

OpenAI developed its Whisper audio transcription model to transcribe YouTube videos and use that data to train GPT-4
Companies have discussed options like:
- Paying for book licenses or buying a large publisher outright
- Training models on "synthetic" data created by their own models
- Using "curriculum learning" to feed models high-quality data in an ordered fashion

3. What are the legal and ethical concerns with the approaches companies have taken?

Using copyrighted content without permission falls into a "hazy gray area" of AI copyright law
Google, YouTube, and others have stated that unauthorized scraping or downloading of their content is prohibited by their terms of service
Companies are wrestling with quickly-evaporating training data for their models, and may be outpacing new content creation by 2028

[02] Company Responses

1. How have different companies responded to the training data shortage?

OpenAI reportedly developed Whisper to transcribe YouTube videos, which it knew was legally questionable but believed to be fair use
Google has also gathered transcripts from YouTube, but says it has done so in accordance with its agreements with YouTube creators
Meta has discussed its unpermitted use of copyrighted works while trying to catch up to OpenAI

2. What actions have companies taken to expand their use of consumer data?

Google's legal department asked the company's privacy team to tweak its policy language to expand what it could do with consumer data, such as its office tools like Google Docs
Meta was apparently limited in the ways it could use consumer data by privacy-focused changes it made in the wake of the Cambridge Analytica scandal

Shared by Daniel Chen ·

Install fromChrome Web Store