OpenAI destroyed a trove of books used to train AI models. The employees who collected the data are gone.
๐ Abstract
The article discusses a lawsuit filed by the Authors Guild against OpenAI, accusing the company of illegally using copyrighted books to train its AI models. The article also covers the deletion of two large datasets, "books1" and "books2," that were used to train GPT-3, and the departure of the researchers who created these datasets from OpenAI.
๐ Q&A
[01] Lawsuit and Deleted Datasets
1. What is the Authors Guild accusing OpenAI of?
- The Authors Guild is suing OpenAI, accusing it of illegally using copyrighted books to train AI models.
2. What did the newly unsealed documents show about the datasets used to train GPT-3?
- The documents show that OpenAI deleted two datasets, "books1" and "books2," that had been used to train GPT-3.
- The datasets probably contained more than 100,000 published books and were central to the Authors Guild's allegations that OpenAI used copyrighted materials to train AI models.
3. What happened to the researchers who created the "books1" and "books2" datasets?
- The two researchers who created the "books1" and "books2" datasets are no longer employed by OpenAI.
4. How did OpenAI respond to the lawsuit and the deletion of the datasets?
- OpenAI initially resisted sharing information about the datasets, citing confidentiality concerns, but ultimately disclosed that it had deleted all copies of the data.
- OpenAI has petitioned the court to keep the names of the two employees who created the datasets, as well as information about the datasets, under seal.
- OpenAI stated that the models powering ChatGPT and its API today were not developed using these datasets, and that the datasets were last used in 2021 and deleted due to non-use in 2022.
[02] Importance of Training Data for AI Models
1. Why is high-quality training data important for powerful AI models?
- High-quality training data is an important part of the powerful AI models that are taking the tech world by storm.
- OpenAI and other companies used data from the internet, including many books, to build these models.
- Many of the companies that created this information want to be paid for providing intelligence to these new AI products, but tech companies don't want to be forced to pay.
2. How does the dispute over the use of copyrighted materials for training AI models play out?
- This dispute is being fought in court now, via several lawsuits, as companies that created the information want to be paid for providing intelligence to these new AI products, while tech companies don't want to be forced to pay.