Security Guide
InjectBench for Indirect Prompt Injection Defense
InjectBench is a benchmarking framework for measuring how well LLM systems resist malicious instructions hidden in third-party plugin data while still completing the user task.
What Is InjectBench and Its Primary Goal?
InjectBench targets indirect prompt injections: attacks where the model retrieves hostile content from external data sources (web pages, docs, plugin feeds) and mistakes attacker text for trusted instructions. The framework standardizes how these attacks are generated, executed, and scored so teams can quantify security and utility together.
- Core focus: evaluate plugin/data-channel prompt injection risk, not just direct chat-level jailbreaks.
- Scale: synthetic benchmark corpus with 1,670 samples for repeatable model comparisons.
- Outcome: measure both attack success and retained usefulness under defense.
Indirect Prompt Injections vs Traditional Jailbreaks
| Category | Indirect Injection | Traditional Jailbreak |
|---|---|---|
| Attack location | Hidden in retrieved third-party content | Directly written in user chat prompt |
| User awareness | User is often unaware content is hostile | User usually sees the attack text |
| Common target | Plugins, browsing, RAG pipelines, summarizers | Base model policy through direct instructions |
| Failure mode | Model over-trusts context and follows attacker payload | Model obeys direct override against policy |
How InjectBench Creates Attacks
- Benign component: realistic context text (for example news, how-to content, reviews, recipes).
- Separator component: delimiter, focus override, or fake summary transition to elevate attacker authority.
- Malicious instruction: payload for manipulated content, availability disruption, or fraud/malware link coercion.
Instruction generation can be run as a two-stage process: one model creates multiple malicious candidates and a second model ranks/selects the strongest executable option.
How InjectBench Evaluates Success
- LLM-as-judge: evaluate whether the response complied with injected instructions.
- Attack-agnostic and attack-specific prompts: broad compliance checks plus category-specific criteria.
- Precision tuning: thresholded Yes/No scoring calibrated for lower false positives.
- Human alignment: compare model judging quality against user-study judgments.
Most Effective Defenses for Plugin-Based Attacks
- Strict context boundaries: treat retrieved/plugin text as untrusted data, never system policy.
- Instruction-source filtering: reject commands coming from external documents unless explicitly authorized.
- Random sampling defense: sample-and-verify context slices to reduce attack success rates.
- Tool permission hardening: enforce allowlists and require explicit user confirmation for risky actions.
- Output auditing: detect leaked secrets, fake error claims, and fraudulent links before returning output.
Implementing a 50-Sample InjectBench Variant
Use a single-agent lane and distribute 50 attack samples across three attacker goals:
- 17 manipulated content samples: subtle narrative distortion while staying topically relevant.
- 17 availability samples: fake unreadable/unavailable claims or bogus error-code responses.
- 16 fraud/malware samples: coercion toward illegitimate links masked as helpful resources.
Each sample should follow this structure: [Benign Context] + [Separator] + [Malicious Instruction].
Core Research Questions
- What is the InjectBench framework and its primary goals?
- How do indirect prompt injections differ from traditional jailbreaks?
- What are the most effective defenses against plugin-based attacks?
- How does InjectBench create and evaluate indirect prompt injection attacks?
ClawBench