Glossary

AI Agent Benchmark Glossary

AI Agent Benchmark

A repeatable test suite for comparing autonomous agents on quality, robustness, cost, and safety.

Eval Harness

The orchestration layer that runs tasks, captures outputs, scores results, and stores artifacts.

Prompt Injection

Malicious instruction patterns that attempt to override policy or exfiltrate hidden context.

Deterministic Replay

Re-running an evaluation with fixed conditions to verify a result can be reproduced.

Benchmark Drift

When benchmark tasks stop reflecting real-world workloads and scores become less meaningful.

Utility Retention

How much useful task performance remains after applying security defenses.