snats website
๐ Abstract
The article discusses the author's attempt to classify all the PDFs on the internet using large language models (LLMs) and traditional machine learning techniques. It covers the author's exploration of various datasets, including Common Crawl and the SafeDocs dataset, and the challenges faced in working with such large amounts of data. The article also covers the author's experimentation with different approaches, including few-shot prompting, embeddings models, fine-tuning, and traditional machine learning techniques like XGBoost and TFIDF. The author shares the results of their experiments and the lessons learned throughout the process.
๐ Q&A
[01] Classifying PDFs on the Internet
1. What is the Common Crawl dataset, and how does it differ from the Internet Archive?
- Common Crawl is a web archive of the entire internet, which is currently petabytes in size and has been running since 2007. Unlike the Internet Archive, Common Crawl focuses more on archiving the internet for scientists and researchers rather than digital preservation.
- The key difference is that Common Crawl only stores the first megabyte of information from each PDF it finds, truncating the rest.
2. What is the SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED dataset, and how does it address the limitations of Common Crawl?
- The SafeDocs dataset was created by the DARPA SafeDocs program, which refetched all the different PDFs from a snapshot of Common Crawl to have untruncated versions.
- This dataset is incredibly large, with roughly 8.4 million PDFs that total 8TB when uncompressed, making it the biggest pure PDF dataset on the internet.
3. What was the author's goal in classifying the PDFs in the SafeDocs dataset?
- The author wanted to use large language models (LLMs) in their personal projects and was inspired by the FineWeb technical blog/paper, which created a subset for "educational" content based on the larger FineWeb dataset.
- The author decided to follow a similar approach, using a "teacher and student" approach where the LLM generates labels from unstructured text, and then a smaller "student" or "distilled" learner is trained to classify based on those labels.
[02] Approaches and Experiments
1. How did the author use few-shot prompting to generate initial labels for the dataset?
- The author used few-shot prompting, a technique where an LLM learns from examples without formal training, to generate 100,000 initial labels using the Llama-3-70B model.
- The author then filtered the labels, keeping only those with 250 or more occurrences and labeling the rest as "other" to focus on the most frequent classes.
2. What type of embeddings model did the author use, and how did they fine-tune it for the classification task?
- The author used an embeddings model to generate semantic representations of the PDF text, which were then used to train a classification model.
- The author experimented with various embeddings models, including Stella_en_400M, gte-large-1.5, and Arctic Embed, and fine-tuned the models for the specific classification task.
3. How did the author compare the performance of the deep learning-based approach to traditional machine learning techniques?
- The author explored using XGBoost and TFIDF (Term Frequency-Inverse Document Frequency) as alternative approaches to the deep learning-based classification.
- The author found that the XGBoost model using the embeddings outperformed the deep learning-based approach, achieving an accuracy of 83.97%.
- The TFIDF-based models also performed better than the initial deep learning-based approach, with a Linear Regressor ensemble reaching an accuracy of 70.68%.
4. What were the results of the author's final experiments using larger datasets and different deep learning models?
- The author generated an additional 400,000 labels using the Llama3.1-7B model and experimented with the RoBERTa-base and gte-large models.
- The gte-large model with the 400,000 labels achieved an accuracy of 69.22%, which was closer to the author's goal of 70% accuracy.
- The author also performed a hyperparameter sweep on the XGBoost embeddings model, which resulted in an accuracy of 85.26%, the best-performing model overall.
5. How did the author visualize the results of the classification task?
- The author generated PCA and UMAP visualizations of the entire dataset, as well as visualizations of the individual class distributions.
- The author rented a powerful Azure virtual machine with 48 cores, 384GB RAM, and 768GB disk to run the UMAP visualization on 6.5 million points.
6. What are the key takeaways and next steps from the author's experience?
- The author acknowledges that they could have performed better with the deep learning-based approach, but were limited by the resources available at the time.
- The author is releasing the datasets, embeddings, and code used in the project, encouraging others to build upon the work and potentially surpass the results.
- The author suggests that PDFs may become more prevalent in training pipelines for large language models and multi-modal models in the future.