AI Agent Benchmark
A repeatable test suite for comparing autonomous agents on quality, robustness, cost, and safety.
Glossary
A repeatable test suite for comparing autonomous agents on quality, robustness, cost, and safety.
The orchestration layer that runs tasks, captures outputs, scores results, and stores artifacts.
Malicious instruction patterns that attempt to override policy or exfiltrate hidden context.
Re-running an evaluation with fixed conditions to verify a result can be reproduced.
When benchmark tasks stop reflecting real-world workloads and scores become less meaningful.
How much useful task performance remains after applying security defenses.