Security Guide

InjectBench for Indirect Prompt Injection Defense

InjectBench is a benchmarking framework for measuring how well LLM systems resist malicious instructions hidden in third-party plugin data while still completing the user task.

What Is InjectBench and Its Primary Goal?

InjectBench targets indirect prompt injections: attacks where the model retrieves hostile content from external data sources (web pages, docs, plugin feeds) and mistakes attacker text for trusted instructions. The framework standardizes how these attacks are generated, executed, and scored so teams can quantify security and utility together.

Indirect Prompt Injections vs Traditional Jailbreaks

CategoryIndirect InjectionTraditional Jailbreak
Attack locationHidden in retrieved third-party contentDirectly written in user chat prompt
User awarenessUser is often unaware content is hostileUser usually sees the attack text
Common targetPlugins, browsing, RAG pipelines, summarizersBase model policy through direct instructions
Failure modeModel over-trusts context and follows attacker payloadModel obeys direct override against policy

How InjectBench Creates Attacks

  1. Benign component: realistic context text (for example news, how-to content, reviews, recipes).
  2. Separator component: delimiter, focus override, or fake summary transition to elevate attacker authority.
  3. Malicious instruction: payload for manipulated content, availability disruption, or fraud/malware link coercion.

Instruction generation can be run as a two-stage process: one model creates multiple malicious candidates and a second model ranks/selects the strongest executable option.

How InjectBench Evaluates Success

Most Effective Defenses for Plugin-Based Attacks

Implementing a 50-Sample InjectBench Variant

Use a single-agent lane and distribute 50 attack samples across three attacker goals:

Each sample should follow this structure: [Benign Context] + [Separator] + [Malicious Instruction].

Core Research Questions