magic starSummarize by Aili

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

๐ŸŒˆ Abstract

The article proposes a method called Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT) to generate synthetic datasets for fine-tuning language models on specialized tasks. CRAFT requires only a small number of human-curated few-shot examples to retrieve relevant documents from large-scale web-crawled corpora and then uses instruction-tuned language models to augment the retrieved documents into custom-formatted task samples. The authors demonstrate the effectiveness of CRAFT on four diverse tasks: biology question-answering, medicine question-answering, commonsense question-answering, and text summarization. The results show that models fine-tuned on CRAFT-generated datasets achieve performance that is either better than or comparable to instruction-tuned language models.

๐Ÿ™‹ Q&A

[01] Building High-Quality Datasets

1. What are the key challenges in building high-quality datasets for specialized tasks?

  • Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge.
  • Existing datasets may be limited or non-existent, especially for low-resource domains or novel tasks.

2. How do the authors address these challenges with CRAFT?

  • CRAFT only requires a small set of few-shot examples from the user to initiate the process of crawling and structuring task examples.
  • CRAFT uses large-scale public web-crawled corpora and similarity-based document retrieval to find relevant human-written documents.
  • CRAFT then uses instruction-tuned language models to augment the retrieved documents into custom-formatted task samples.

[02] The CRAFT Approach

1. What are the key components of the CRAFT framework?

  • Few-shot examples provided by the user to define the task
  • An embedding database created from large corpora to enable retrieval of relevant documents
  • A two-step retrieval process to efficiently find the most similar documents to the few-shots
  • Instruction-tuned language models to augment the retrieved documents into task-specific samples

2. How does CRAFT leverage the few-shot examples and the embedding database?

  • The few-shot examples contain the language, content, and accuracy of high-quality corpus samples, as well as the task instruction and expected output.
  • The embedding database provides embeddings of diverse human-written documents that can be retrieved for task-specific augmentation.
  • The retrieval system uses the few-shot examples as queries to dynamically retrieve relevant documents from the corpora.

3. How does CRAFT generate the final synthetic task samples?

  • CRAFT uses instruction-tuning prompt templates that combine the few-shots, the retrieved corpus samples, and a brief instruction for the model to generate the final task samples.
  • This augmentation step rephrases the text and condenses the retrieved documents down to the essential information required for the task.

[03] Experimental Setup

1. What are the tasks evaluated in the experiments?

  • Multiple-choice question-answering tasks in biology, medicine, and commonsense
  • Generative tasks of text summarization and recipe generation

2. How do the authors evaluate the generated datasets?

  • For QA tasks, they use accuracy as the evaluation metric.
  • For generative tasks, they use language models as judges to provide preference scores between generated outputs and human-curated references.
  • They compare the performance of models trained on CRAFT-generated datasets against few-shot baselines, instruction-tuned models, and models trained on human-curated datasets.

[04] Results

1. How does the performance of CRAFT-generated datasets scale with the amount of data?

  • The authors observe consistent performance improvements across the tasks as they increase the size of the CRAFT-generated datasets.
  • Relative to the few-shot baseline, the CRAFT-generated datasets show improvements of 17% for BioQA, 12% for CSQA, 23% for MedQA, and 124% for summarization.

2. How do CRAFT-generated datasets compare to human-curated datasets?

  • For the QA tasks, the CRAFT-generated datasets achieve performance that is comparable to or better than the instruction-tuned baseline.
  • For summarization, the CRAFT-generated datasets outperform the models trained on human-curated data by 46 preference points.
  • The authors also find that CRAFT-generated datasets exhibit lower overlap with the test sets compared to the human-curated datasets, indicating better out-of-domain generalization.

3. What are the limitations observed in the recipe generation task?

  • The authors observe a drop in performance when scaling the recipe generation dataset from 100 to 25,000 examples.
  • Analysis suggests that the CRAFT pipeline tends to find less relevant examples over time, leading to a decrease in data quality.
  • The authors recommend incorporating effective stopping criteria or additional quality validation steps in future iterations of CRAFT to address this limitation.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.