Summarize by Aili

The Instruction Hierarchy:Training LLMs to Prioritize Privileged Instructions

🌈 Abstract

The article discusses the problem of LLMs being susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. It proposes an "instruction hierarchy" that explicitly defines how models should behave when instructions of different priorities conflict, in order to address this vulnerability.

🙋 Q&A

[01] The Instruction Hierarchy

1. What is the key idea behind the instruction hierarchy proposed in the article? The key idea is to create a hierarchy of instructions, where LLMs will defer to higher-privileged instructions in the case of conflicts. The hierarchy consists of:

System Messages provided by application developers, which define the general instructions, safety guidelines, and constraints for the LLM
User Messages provided by end users
Tool Outputs from third-party sources

The goal is to teach LLMs to conditionally follow lower-level instructions based on their alignment with higher-level instructions. Aligned instructions should be followed, while misaligned instructions should be ignored when possible, or the model should refuse to comply if there is no way to proceed.

2. How does the article propose to generate training data for the instruction hierarchy? The article proposes two approaches for generating training data:

Context Synthesis: For aligned instructions, the article generates examples using compositional requests and decomposes the instructions into smaller pieces. These decomposed instructions are then placed at different levels of the hierarchy, and models are trained to predict the original ground-truth response.
Context Ignorance: For misaligned instructions, the article trains models to predict the same answer they would have generated if they never saw the lower-level instructions, effectively teaching them to ignore the misaligned instructions.

This data is generated for different types of attacks, such as prompt injections, system prompt extractions, and jailbreaks.

3. What are the key results reported in the article? The article reports the following key results:

The instruction hierarchy approach leads to dramatically improved robustness across various safety and capability benchmarks, compared to a baseline model.
The instruction hierarchy also exhibits generalization to evaluation criteria that were explicitly excluded from training, including jailbreaks, attacks that try to extract passwords from the system message, and prompt injections via tool use.
The article also reports some regressions on "over-refusal" evaluations, where the models sometimes ignore or refuse benign queries. However, the generic capabilities of the models remain otherwise unscathed.

[02] Background: Attacks on LLMs

1. What are the different types of attacks on LLMs discussed in the article? The article discusses the following types of attacks on LLMs:

Prompt Injections: Adversaries insert instructions that subvert the intent of the system designer, either directly through the user input or indirectly through third-party inputs.
Jailbreaks: Attacks that aim to escape the safety behavior that is trained into an LLM, allowing the model to perform malicious tasks.
System Message Extraction: Attacks that aim to reveal the system message, which defines the expected behavior of the model and may contain private information.

2. How does the article characterize the underlying cause of these attacks? The article argues that the underlying cause of these attacks is the lack of a clear instruction hierarchy in modern LLMs. Currently, all instructions are treated equally, allowing adversaries to overwrite higher-level instructions with their own malicious prompts.

Shared by Daniel Chen ·

Install fromChrome Web Store