Summarize by Aili

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

https://arxiv.org/html/2409.02098v1

🌈 Abstract

The article proposes a method called Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT) to generate synthetic datasets for fine-tuning language models on specialized tasks. CRAFT requires only a small number of human-curated few-shot examples to retrieve relevant documents from large-scale web-crawled corpora and then uses instruction-tuned language models to augment the retrieved documents into custom-formatted task samples. The authors demonstrate the effectiveness of CRAFT on four diverse tasks: biology question-answering, medicine question-answering, commonsense question-answering, and text summarization. The results show that models fine-tuned on CRAFT-generated datasets achieve performance that is either better than or comparable to instruction-tuned language models.

🙋 Q&A

[01] Building High-Quality Datasets

1. What are the key challenges in building high-quality datasets for specialized tasks?

Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge.
Existing datasets may be limited or non-existent, especially for low-resource domains or novel tasks.

2. How do the authors address these challenges with CRAFT?

CRAFT only requires a small set of few-shot examples from the user to initiate the process of crawling and structuring task examples.
CRAFT uses large-scale public web-crawled corpora and similarity-based document retrieval to find relevant human-written documents.
CRAFT then uses instruction-tuned language models to augment the retrieved documents into custom-formatted task samples.

[02] The CRAFT Approach

1. What are the key components of the CRAFT framework?

Few-shot examples provided by the user to define the task
An embedding database created from large corpora to enable retrieval of relevant documents
A two-step retrieval process to efficiently find the most similar documents to the few-shots
Instruction-tuned language models to augment the retrieved documents into task-specific samples

2. How does CRAFT leverage the few-shot examples and the embedding database?

The few-shot examples contain the language, content, and accuracy of high-quality corpus samples, as well as the task instruction and expected output.
The embedding database provides embeddings of diverse human-written documents that can be retrieved for task-specific augmentation.
The retrieval system uses the few-shot examples as queries to dynamically retrieve relevant documents from the corpora.

3. How does CRAFT generate the final synthetic task samples?

CRAFT uses instruction-tuning prompt templates that combine the few-shots, the retrieved corpus samples, and a brief instruction for the model to generate the final task samples.
This augmentation step rephrases the text and condenses the retrieved documents down to the essential information required for the task.

[03] Experimental Setup

1. What are the tasks evaluated in the experiments?

Multiple-choice question-answering tasks in biology, medicine, and commonsense
Generative tasks of text summarization and recipe generation

2. How do the authors evaluate the generated datasets?

For QA tasks, they use accuracy as the evaluation metric.
For generative tasks, they use language models as judges to provide preference scores between generated outputs and human-curated references.
They compare the performance of models trained on CRAFT-generated datasets against few-shot baselines, instruction-tuned models, and models trained on human-curated datasets.

[04] Results

1. How does the performance of CRAFT-generated datasets scale with the amount of data?

The authors observe consistent performance improvements across the tasks as they increase the size of the CRAFT-generated datasets.
Relative to the few-shot baseline, the CRAFT-generated datasets show improvements of 17% for BioQA, 12% for CSQA, 23% for MedQA, and 124% for summarization.

2. How do CRAFT-generated datasets compare to human-curated datasets?

For the QA tasks, the CRAFT-generated datasets achieve performance that is comparable to or better than the instruction-tuned baseline.
For summarization, the CRAFT-generated datasets outperform the models trained on human-curated data by 46 preference points.
The authors also find that CRAFT-generated datasets exhibit lower overlap with the test sets compared to the human-curated datasets, indicating better out-of-domain generalization.

3. What are the limitations observed in the recipe generation task?

The authors observe a drop in performance when scaling the recipe generation dataset from 100 to 25,000 examples.
Analysis suggests that the CRAFT pipeline tends to find less relevant examples over time, leading to a decrease in data quality.
The authors recommend incorporating effective stopping criteria or additional quality validation steps in future iterations of CRAFT to address this limitation.

Shared by Daniel Chen ·

Install fromChrome Web Store