CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
๐ Abstract
The article proposes a method called Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT) to generate synthetic datasets for fine-tuning language models on specialized tasks. CRAFT requires only a small number of human-curated few-shot examples to retrieve relevant documents from large-scale web-crawled corpora and then uses instruction-tuned language models to augment the retrieved documents into custom-formatted task samples. The authors demonstrate the effectiveness of CRAFT on four diverse tasks: biology question-answering, medicine question-answering, commonsense question-answering, and text summarization. The results show that models fine-tuned on CRAFT-generated datasets achieve performance that is either better than or comparable to instruction-tuned language models.
๐ Q&A
[01] Building High-Quality Datasets
1. What are the key challenges in building high-quality datasets for specialized tasks?
- Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge.
- Existing datasets may be limited or non-existent, especially for low-resource domains or novel tasks.
2. How do the authors address these challenges with CRAFT?
- CRAFT only requires a small set of few-shot examples from the user to initiate the process of crawling and structuring task examples.
- CRAFT uses large-scale public web-crawled corpora and similarity-based document retrieval to find relevant human-written documents.
- CRAFT then uses instruction-tuned language models to augment the retrieved documents into custom-formatted task samples.
[02] The CRAFT Approach
1. What are the key components of the CRAFT framework?
- Few-shot examples provided by the user to define the task
- An embedding database created from large corpora to enable retrieval of relevant documents
- A two-step retrieval process to efficiently find the most similar documents to the few-shots
- Instruction-tuned language models to augment the retrieved documents into task-specific samples
2. How does CRAFT leverage the few-shot examples and the embedding database?
- The few-shot examples contain the language, content, and accuracy of high-quality corpus samples, as well as the task instruction and expected output.
- The embedding database provides embeddings of diverse human-written documents that can be retrieved for task-specific augmentation.
- The retrieval system uses the few-shot examples as queries to dynamically retrieve relevant documents from the corpora.
3. How does CRAFT generate the final synthetic task samples?
- CRAFT uses instruction-tuning prompt templates that combine the few-shots, the retrieved corpus samples, and a brief instruction for the model to generate the final task samples.
- This augmentation step rephrases the text and condenses the retrieved documents down to the essential information required for the task.
[03] Experimental Setup
1. What are the tasks evaluated in the experiments?
- Multiple-choice question-answering tasks in biology, medicine, and commonsense
- Generative tasks of text summarization and recipe generation
2. How do the authors evaluate the generated datasets?
- For QA tasks, they use accuracy as the evaluation metric.
- For generative tasks, they use language models as judges to provide preference scores between generated outputs and human-curated references.
- They compare the performance of models trained on CRAFT-generated datasets against few-shot baselines, instruction-tuned models, and models trained on human-curated datasets.
[04] Results
1. How does the performance of CRAFT-generated datasets scale with the amount of data?
- The authors observe consistent performance improvements across the tasks as they increase the size of the CRAFT-generated datasets.
- Relative to the few-shot baseline, the CRAFT-generated datasets show improvements of 17% for BioQA, 12% for CSQA, 23% for MedQA, and 124% for summarization.
2. How do CRAFT-generated datasets compare to human-curated datasets?
- For the QA tasks, the CRAFT-generated datasets achieve performance that is comparable to or better than the instruction-tuned baseline.
- For summarization, the CRAFT-generated datasets outperform the models trained on human-curated data by 46 preference points.
- The authors also find that CRAFT-generated datasets exhibit lower overlap with the test sets compared to the human-curated datasets, indicating better out-of-domain generalization.
3. What are the limitations observed in the recipe generation task?
- The authors observe a drop in performance when scaling the recipe generation dataset from 100 to 25,000 examples.
- Analysis suggests that the CRAFT pipeline tends to find less relevant examples over time, leading to a decrease in data quality.
- The authors recommend incorporating effective stopping criteria or additional quality validation steps in future iterations of CRAFT to address this limitation.