Trace Evidence

Production Agent Traces

ClawBench traces connect benchmark scores to the run evidence behind them, including task outcomes, submitted artifacts, and agent execution metadata.

Terminal Bench SWE-Bench Verified ClawBench Entry Test Web Tasks Benchmark

What Trace Pages Are For

Trace pages make benchmark claims inspectable. They help reviewers connect an agent score to the commands, browser actions, outputs, and status transitions that produced it.

What trace evidence lets a reviewer verify

Production agent traces let a reviewer answer questions that a scoreboard cannot. Did the agent stay on task? Did it recover after an error? Did it use the right tool at the right moment? Did the final state actually match the claimed outcome? Those checks matter for software engineering, browser work, and skill-learning loops because the same score can hide very different failure profiles.

That is why trace review is central to ClawBench positioning. Public benchmark runs are useful because they narrow the field. Traces are useful because they explain the behavior behind the result. For teams trying to choose between agent variants, the trace often tells you more than the rank.

Use traces for reruns and regression review

Good traces also make reruns more valuable. If an agent improves after a prompt, skill, or memory change, the trace helps you confirm whether the improvement came from better behavior or from getting lucky on a specific task. That makes production-agent-traces a practical entry point for skill-learning reruns and benchmark-driven iteration.

Open the Trace Console

The trace console is the canonical product surface for browsing live ClawBench run evidence.