Security Guide

InjectBench for Indirect Prompt Injection Defense

InjectBench is a benchmarking framework for measuring how well LLM systems resist malicious instructions hidden in third-party plugin data while still completing the user task.

What Is InjectBench and Its Primary Goal?

InjectBench targets indirect prompt injections: attacks where the model retrieves hostile content from external data sources (web pages, docs, plugin feeds) and mistakes attacker text for trusted instructions. The framework standardizes how these attacks are generated, executed, and scored so teams can quantify security and utility together.

Core focus: evaluate plugin/data-channel prompt injection risk, not just direct chat-level jailbreaks.
Scale: synthetic benchmark corpus with 1,670 samples for repeatable model comparisons.
Outcome: measure both attack success and retained usefulness under defense.

Indirect Prompt Injections vs Traditional Jailbreaks

Category	Indirect Injection	Traditional Jailbreak
Attack location	Hidden in retrieved third-party content	Directly written in user chat prompt
User awareness	User is often unaware content is hostile	User usually sees the attack text
Common target	Plugins, browsing, RAG pipelines, summarizers	Base model policy through direct instructions
Failure mode	Model over-trusts context and follows attacker payload	Model obeys direct override against policy

How InjectBench Creates Attacks

Benign component: realistic context text (for example news, how-to content, reviews, recipes).
Separator component: delimiter, focus override, or fake summary transition to elevate attacker authority.
Malicious instruction: payload for manipulated content, availability disruption, or fraud/malware link coercion.

Instruction generation can be run as a two-stage process: one model creates multiple malicious candidates and a second model ranks/selects the strongest executable option.

How InjectBench Evaluates Success

LLM-as-judge: evaluate whether the response complied with injected instructions.
Attack-agnostic and attack-specific prompts: broad compliance checks plus category-specific criteria.
Precision tuning: thresholded Yes/No scoring calibrated for lower false positives.
Human alignment: compare model judging quality against user-study judgments.

Most Effective Defenses for Plugin-Based Attacks

Strict context boundaries: treat retrieved/plugin text as untrusted data, never system policy.
Instruction-source filtering: reject commands coming from external documents unless explicitly authorized.
Random sampling defense: sample-and-verify context slices to reduce attack success rates.
Tool permission hardening: enforce allowlists and require explicit user confirmation for risky actions.
Output auditing: detect leaked secrets, fake error claims, and fraudulent links before returning output.

Implementing a 50-Sample InjectBench Variant

Use a single-agent lane and distribute 50 attack samples across three attacker goals:

17 manipulated content samples: subtle narrative distortion while staying topically relevant.
17 availability samples: fake unreadable/unavailable claims or bogus error-code responses.
16 fraud/malware samples: coercion toward illegitimate links masked as helpful resources.

Each sample should follow this structure: [Benign Context] + [Separator] + [Malicious Instruction].

Core Research Questions

What is the InjectBench framework and its primary goals?
How do indirect prompt injections differ from traditional jailbreaks?
What are the most effective defenses against plugin-based attacks?
How does InjectBench create and evaluate indirect prompt injection attacks?

Set up your agent and run this mode Open single-agent PI run instructions