SmolLM - blazingly fast and remarkably powerful
๐ Abstract
This blog post introduces SmolLM, a family of state-of-the-art small language models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset called SmolLM-Corpus. The post covers the data curation process, model evaluation, and usage details of these small models.
๐ Q&A
[01] Introduction
1. What is the motivation behind developing small language models?
- There is increasing interest in small language models that can operate on local devices, as it enables novel applications while dramatically reducing inference costs and improving user privacy.
- Techniques such as distillation, quantization, and training small models from scratch on large datasets are being used to create these small models.
2. What are some examples of small language models developed by other companies?
- Microsoft's Phi series, Alibaba's Qwen2 (less than 2B), and Meta's MobileLLM demonstrate that small models can achieve impressive results when designed and trained thoughtfully.
- However, most of the details about the data curation and training of these models are not publicly available.
[02] SmolLM Models
1. What are the key features of the SmolLM models?
- SmolLM is a series of state-of-the-art small language models available in three sizes: 135M, 360M, and 1.7B parameters.
- These models are built on a meticulously curated high-quality training corpus, called SmolLM-Corpus, which the authors are releasing.
2. What are the main components of the SmolLM-Corpus dataset?
- The dataset includes:
- Cosmopedia v2, an enhanced version of the Cosmopedia dataset, consisting of over 30 million textbooks, blog posts, and stories generated by large language models.
- FineWeb-Edu, a dataset of 1.3T tokens of educational web pages filtered from the FineWeb dataset.
- Python-Edu, a refined dataset of 4B tokens of educational Python code samples.
3. How were the prompts and seed samples for the Cosmopedia v2 dataset improved compared to the previous version?
- In Cosmopedia v2, the authors used a predefined list of 34,000 topics based on the BISAC book classification, and implemented a search tool to retrieve the most relevant web pages for each topic.
- This approach was more effective than the unsupervised clustering approach used in Cosmopedia v1.
4. What were the findings from the ablation studies on the Cosmopedia v2 dataset?
- The authors found that textbooks based on topics and seed samples from curated sources such as Stanford and OpenStax provided the best overall performance.
- Textbooks aimed primarily at middle school students gave the best scores on most benchmarks, except for MMLU, which requires more advanced knowledge.
[03] Training and Evaluation of SmolLM Models
1. What was the training mixture used for the SmolLM models?
- The SmolLM models were trained on a mixture of data sources, including Cosmopedia v2, FineWeb-Edu, Python-Edu, and other subsets from the Cosmopedia v1 dataset.
2. What architectural choices were made for the different sizes of SmolLM models?
- The 135M and 360M parameter models used a design similar to MobileLLM, incorporating Grouped-Query Attention (GQA) and prioritizing depth over width.
- The 1.7B parameter model used a more traditional architecture.
- All three models used embedding tying and a context length of 2048 tokens.
3. How did the authors evaluate the performance of the SmolLM models?
- The SmolLM models were evaluated on a diverse set of benchmarks testing common sense reasoning and world knowledge, using the same evaluation setup for all models.
- The authors also instruction-tuned the models using publicly available permissive instruction datasets and evaluated their performance on the IFEval benchmark.
4. How do the SmolLM models compare to other small language models in terms of performance and memory footprint?
- The SmolLM models outperform other models in their size categories across the evaluated benchmarks.
- The memory footprint of the SmolLM models makes them suitable for deployment on a wide range of devices, from smartphones to laptops.