For a while the agent playbook was simple: wait for a stronger model, run the benchmark again, update the scorecard. The leaderboard moved. Production reliability often did not.
That gap is the reason skill discovery matters. Agents do not only need better language capability. They need repeatable behaviors that transfer across tasks.
What skill discovery means
Skill discovery is the problem of getting agents to learn reusable operational behavior. Not a longer prompt. Not a one-off tool instruction. A real skill is something the agent can recognize, select, and apply when the surface form changes.
For agent systems, the important skills are often boring. Use the right tool. Preserve state across steps. Check whether the previous action worked. Retrieve prior knowledge. Stop when the task is unsafe or underspecified.
These behaviors are not guaranteed by model scale. A stronger model may explain the right workflow and still fail to execute it under pressure.
The SKILL0 problem
The most frustrating version of the problem is simple: an agent appears to acquire a skill during training or tuning, then fails to use it at inference.
During training, the agent learns to consult memory, inspect traces, or use a browser tool carefully. Then a new task arrives with the same underlying structure but a different surface. The learned behavior disappears. The agent improvises from scratch.
That is not a small bug. It is the difference between task-specific pattern matching and transferable skill learning.
Why skills fail to transfer
Agents often learn the shape of the training task instead of the principle underneath it. A skill learned on one checkout flow may not transfer to a different checkout flow. A skill learned on one repository may not transfer to another with a different test runner.
The task looks different, so the agent does not recognize that the same behavior applies. That is why held-out evaluation matters. If you only test on tasks that look like training, you cannot tell whether the agent learned a skill or memorized a pattern.
| Approach | What It Tries To Do | Risk |
|---|---|---|
| Skill registry | Make skills explicit and discoverable | Agent may not select the right skill |
| Episodic memory | Replay past traces and outcomes | Memory noise can accumulate |
| Hierarchical policies | Select skills at a higher level, execute below | Credit assignment stays hard |
| Held-out skill tests | Measure transfer directly | Requires careful task design |
What this means for builders
Your prompt is often a hand-written skill. When you tell the agent exactly how to use memory, recover from errors, or inspect files, you are manually encoding behavior that a stronger skill system should eventually learn and invoke.
There is nothing wrong with that. It ships. But you should evaluate it honestly. If the prompt works on the training tasks and fails on held-out tasks, you have not built a transferable skill. You have built a useful workaround.
How ClawBench should test skills
Skill evaluation needs to ask more than "did the agent complete the task?" It should ask whether the agent invoked the right skill, tracked state correctly, noticed failure, and recovered in a way that would generalize.
That is where SkillsBench fits. It gives ClawBench a place to measure reusable procedural behavior, not only final answer quality. The goal is not to celebrate a skill library. The goal is to find out whether the agent actually uses it when the task changes.
Practical rule
If a skill does not improve held-out traces, it is not a skill yet. It is a prompt pattern waiting to be proven.
Continue the evaluation
Use this guide with the benchmark entity pages, leaderboard context, and trace evidence so the query intent can move straight from explanation to product proof.
ClawBench