magic starSummarize by Aili

Prompt Engineering Is Dead: DSPy Is New Paradigm For Prompting

๐ŸŒˆ Abstract

The article discusses the evolution of prompt engineering and the limitations of the current approaches. It introduces a framework called DSPY, developed by Stanford, which aims to treat language models (LLMs) as modules that can be optimized and composed into self-improving pipelines, similar to the abstractions found in PyTorch. The article highlights the challenges of prompt engineering and the need for a more systematic and scientific approach to building applications with LLMs.

๐Ÿ™‹ Q&A

[01] Prompt Engineering and its Limitations

1. What was the status of prompt engineering a few months ago, and how has it changed?

  • Prompt engineering was all the hype a few months ago, with the job market filled with prompt engineer roles.
  • However, large-scale experiments have shown that there is no single prompt or strategy that works for all kinds of problems, and prompt engineering is just a "clever Hans phenomenon" where humans provide the necessary context for the system to answer in a better way.

2. What are the issues with the current approaches to prompt engineering?

  • There are many books and blogs promoting "top prompts" to get the best out of GPT, but these are often just selling "a ton of crap" and are not a good way to build applications.
  • Certain papers have shown that using emotional prompts can increase LLM performance, but the author has reservations about the authenticity and generalizability of such findings.
  • Telling the system about personal situations (e.g., "I might get fired") in an attempt to hack the LLM's behavior is not a scientific or systematic approach.

3. What are the challenges in implementing specific prompting techniques in practice?

  • When trying to implement techniques like "Add 5-shot CoT with RAG, using hard negative examples", it is conceptually clear but very difficult to implement in practice due to the sensitivity of LLMs to how the prompt is written.
  • The output of the LLM needs to be restricted in a way that it can work as input for other modules in a larger pipeline, not just trying to convince the LLM to give a certain output.

[02] DSPY: A Framework for Self-Improving Pipelines

1. What is the goal of the DSPY framework?

  • The goal of DSPY is to shift the focus from tweaking the LLMs to designing good overarching system architecture.

2. How does DSPY treat LLMs as modules?

  • DSPY treats LLMs as "devices" that execute instructions and operate through an abstraction akin to a deep neural network (DNN).
  • Just as we define a layer of Convolution in PyTorch and can stack these layers to achieve a desired level of abstraction, DSPY aims to abstract LLMs as modules that can be stacked in different combinations to achieve a certain type of behavior (e.g., Chain of Thought, ReAct).

3. How does DSPY infer the role of fields and build modules?

  • DSPY uses "signatures" to define the behavior we want from the LLMs, which specify what needs to be achieved rather than how to prompt the LLM to do it.
  • DSPY infers the role of fields using these signatures and uses them to build modules, automatically producing high-quality prompts to achieve the desired behavior.

4. How does the DSPY optimizer work?

  • The DSPY optimizer takes the entire pipeline and optimizes it on a certain metric, automatically coming up with the best prompts and even updating the weights of the language model to fit the task being solved.

[03] Implementing a Multi-Hop QA System with DSPY

1. What is the challenge in building multi-hop QA systems?

  • A single search query is often not enough for complex QA tasks, and the standard approach is to build multi-hop search systems that read the retrieved results and generate additional queries to gather more information.

2. How can DSPY be used to simulate a multi-hop QA system?

  • The article provides an example of implementing a "Simplified Baleen" pipeline using DSPY, which includes modules for generating search queries, retrieving passages, and generating the final answer.
  • The pipeline is defined using DSPY's signatures and modules, and the article demonstrates how the pipeline can be compiled and optimized using DSPY's teleprompter and evaluation functions.

3. How does the compiled DSPY pipeline perform compared to the uncompiled version?

  • The results show that the compiled DSPY pipeline outperforms the uncompiled version in terms of retrieval score on the HotPotQA dataset, even surpassing human feedback.
  • The article also mentions that a much smaller model like T5 can outperform GPT when used in a DSPY setting, highlighting the potential of DSPY in building better and more systematically designed systems.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.