Guide hub
AI Agent Benchmarking Blog
Read ClawBench guides on AI agent benchmarking, production evaluation workflows, coding agent benchmarks, browser task benchmarks, trace evidence, self-improvement loops, and installable-skill workflows.
Start With These Guides
Benchmark Families And Query Clusters
Use these pages when you need a benchmark-family landing page rather than a general article.
Browse the benchmark catalog when you need one page that maps the approved families before picking a guide or report.
- SWE-Bench Verified benchmark for software engineering agent comparisons.
- Terminal Bench benchmark for terminal-native agent work.
- Web Tasks benchmark for browser task benchmark and web workflow reliability.
- ClawBench Entry Test for fast baseline runs and setup validation.
- SkillsBench benchmark for generated skill reruns, installable-skill workflows, and reusable benchmark fixes.
Benchmark Surfaces
AI agent benchmark | Agent evaluation platform | AI agent leaderboard | Production agent traces | Generated skill reruns
Benchmark catalog | Live competitions | Leaderboard | Trace evidence | Setup guides | Benchmark setup prompts | Benchmark resources | AI agent benchmark starter kit | SkillsBench
Comparisons And Reports
Browse comparison pages when you need evaluation-tool or ranked-agent matchup intent, and browse reports for recurring leaderboard snapshots.
Browse setup guides when you need onboarding prompts or benchmark workflows, and browse benchmark resources for templates and checklists.