Ignore Previous Instruction: The Persistent Challenge of Prompt Injection in Language Models
๐ Abstract
The article discusses the persistent challenge of prompt injection in language models (LLMs). Prompt injection is an emergent vulnerability that arises because LLMs are unable to differentiate between system prompts (created by engineers) and user prompts (created by users). This can lead to undesired behavior, where a user can craft a prompt that instructs the LLM to ignore the programmed prompt and follow the user-generated prompt instead. The article outlines the potential attacks that can result from prompt injection, such as revealing the system prompt, reputational attacks, data disclosure, and privilege escalation. It also discusses various mitigations, such as canary tokens, input scanning, and output checking, but notes that these solutions are not foolproof. The article emphasizes the importance of proper design and data classification when building applications with LLMs to mitigate the risks of prompt injection.
๐ Q&A
[01] The Persistent Challenge of Prompt Injection in Language Models
1. What is prompt injection, and how does it arise in language models (LLMs)? Prompt injection is an emergent vulnerability in LLMs that arises because LLMs are unable to differentiate between system prompts (created by engineers) and user prompts (created by users). This can lead to undesired behavior, where a user can craft a prompt that instructs the LLM to ignore the programmed prompt and follow the user-generated prompt instead.
2. What are the potential attacks that can result from a prompt injection vulnerability? Potential attacks include:
- Revealing the system prompt, which can be devastating if the system prompt is the main intellectual property of the application
- Reputational attacks by getting the app to say something malicious, uncompliant, or contradictory to official documentation
- Revealing data that the LLM has access to, such as external knowledge bases used for Retrieval Augmented Generation
- Manipulating plugins to escalate privileges, such as causing the LLM to make requests to internal endpoints (SSRF)
3. What are some common mitigations used to prevent prompt injection attacks? Common mitigations include:
- Canary tokens: A token purposely placed in the prompt, which can indicate if the directive prompt has leaked
- Scanning the input with another LLM to detect potential prompt injection payloads
- Checking the output to ensure it is grounded in the context of the system prompt
4. Why are these mitigations not sufficient to completely prevent prompt injection attacks? These mitigations have limitations:
- Heuristic-based scans and LLM scans are based on previous attacks, so they may not catch new attack patterns
- Canary tokens only block instances where the specific token is used, and can be bypassed by transforming the output
- Checking the output for proper grounding can only be done in limited use cases and can also be bypassed by a secondary prompt injection
5. What is the main security control that should be used to mitigate prompt injection vulnerabilities? The main security control is proper design and data classification when building applications with LLMs. This includes:
- Ensuring plugins only have access to resources the user already has access to
- Limiting the LLM's knowledge base to only contain data the user has access to
- Avoiding classifying the system prompt as sensitive intellectual property
[02] Stored Prompt Injection
1. What is Stored Prompt Injection, and how does it differ from regular prompt injection? Stored Prompt Injection occurs when an LLM is used in conjunction with Retrieval Augmented Generation, where the LLM summarizes content from an external knowledge base. If the knowledge base contains malicious, user-inputted content, a prompt injection payload can be executed when the summarization step occurs, leading to a stored attack that can be replicated across multiple users.
2. Why is Stored Prompt Injection a notable variant of the prompt injection vulnerability? Stored Prompt Injection is notable because Retrieval Augmented Generation is a common way of enriching user queries, and the vulnerability occurs due to content stored in the database. This means the attack can be replicated across multiple users, potentially leaking more information than a normal prompt injection attack, which is typically only executed in the context of a single user's query.