Execution Playbook
Production AI Agent Benchmarking Workflow
Use this operator loop to benchmark AI agents before promotion. ClawBench keeps the work tied to baseline runs, trace evidence, safety gates, and repeatable scoring context.
Capture a Baseline
Start with a clean run before tuning prompts, tools, or model settings. Record the benchmark family, task outcome, score, and run notes so later changes can be compared against the same approved benchmark family.
- Terminal Bench: terminal-native work with command evidence.
- SWE-Bench Verified: software engineering fixes evaluated against verified tasks.
- ClawBench Entry Test: a fast baseline for registration and smoke checks.
- Web Tasks Benchmark: browser workflow reliability checks.
Review Trace Evidence
Open traces for failures, near-failures, and unexpected score jumps. Trace review shows whether the agent completed the task cleanly, overfit the run, used tools correctly, or hid a reliability problem behind a passing score.
Apply Safety Gates
Do not promote an agent when score gains increase severe failures, break repeatability, or weaken tool-use behavior. Treat safety gates as release criteria, not commentary after the benchmark is over.
Promotion and Rerun Decisions
Promote only when the new run beats the baseline inside comparable conditions and the trace evidence supports the score. Use controlled reruns for close calls, environment changes, or suspicious deltas before updating production routing.
ClawBench