magic starSummarize by Aili

Releasing Re-LAION 5B: transparent iteration on LAION-5B with additional safety fixes | LAION

๐ŸŒˆ Abstract

The article discusses the release of an updated version of the LAION-5B dataset, called Re-LAION-5B, which has been thoroughly cleaned of known links to suspected child sexual abuse material (CSAM). It highlights the importance of open and transparent datasets for reproducible machine learning research, and the challenges in ensuring the legal compliance of such large-scale datasets gathered from the public web. The article outlines the steps taken by LAION to partner with organizations like the Internet Watch Foundation (IWF) and the Canadian Children Protection organization (C3P) to identify and remove links to suspected CSAM, as well as the removal of other sensitive data in cooperation with Human Rights Watch (HRW).

๐Ÿ™‹ Q&A

[01] Motivation and Approach

1. What is the motivation behind the release of Re-LAION-5B?

  • The motivation is to provide a web-scale, text-link to images pair dataset that has been thoroughly cleaned of known links to suspected CSAM, in order to enable reproducible machine learning research on foundation models.

2. What were the key steps taken by LAION to clean the LAION-5B dataset?

  • LAION partnered with the IWF and C3P to obtain lists of MD5 image and URL hashes for known CSAM samples on the public internet.
  • LAION used these hash lists to remove all links to suspected CSAM samples from the LAION-5B dataset, without having to inspect the actual content.
  • LAION also removed additional privacy-related data in cooperation with Human Rights Watch.

3. What are the two versions of Re-LAION-5B that are being released?

  • Re-LAION-5B-research-safe: A subset of Re-LAION-5B-research, which is itself a subset of the original LAION-5B dataset. Both versions are released with gated access, requiring affiliation information and consent for use.

[02] Findings and Insights

1. What was the total number of unique hashes provided by the partner organizations (IWF and C3P)?

  • The total number of unique hashes provided by the partner organizations was 16.2 million (2.2 million from IWF and 14 million from C3P).

2. What is the estimated upper bound for the number of links to suspected CSAM in the original LAION-5B dataset?

  • The upper bound is 2,236 links, as this is the total number of matches found between the LAION-5B dataset and the hash lists provided by the partner organizations.

3. What insights were gained during the safety iteration process?

  • Many of the matched links were likely dead or no longer accessible, as the partner organizations had already taken down the actual content.
  • The number of links to suspected CSAM is likely much lower than the upper bound of 2,236, as it subsumes the 1,008 suspected links identified in the Stanford Internet Observatory report.

[03] Release and Recommendations

1. What are the key differences between the Re-LAION-5B-research and Re-LAION-5B-research-safe versions?

  • Re-LAION-5B-research-safe is a true subset of Re-LAION-5B-research, which is a true subset of the original LAION-5B dataset.
  • Both versions are released with gated access, requiring affiliation information and consent for use.

2. What are the key recommendations for using the Re-LAION datasets?

  • The datasets are intended for research purposes, especially for conducting basic research on open multi-modal foundation models.
  • LAION strongly advises against using the datasets in industrial settings or for creating end products, as the datasets can contain links to various discomforting image samples.
  • LAION is not responsible for the content that can be accessed via the links, and researchers should not inspect the content of individual samples.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.