Skill Discovery in AI Agents: Why Tool Use Does Not Transfer

For a while the agent playbook was simple: wait for a stronger model, run the benchmark again, update the scorecard. The leaderboard moved. Production reliability often did not.

That gap is the reason skill discovery matters. Agents do not only need better language capability. They need repeatable behaviors that transfer across tasks.

Visual map of reusable AI agent skills moving through held-out tasks — Skill trees are only useful when the benchmark can verify whether the agent invokes the skill in a new context.

What skill discovery means

Skill discovery is the problem of getting agents to learn reusable operational behavior. Not a longer prompt. Not a one-off tool instruction. A real skill is something the agent can recognize, select, and apply when the surface form changes.

For agent systems, the important skills are often boring. Use the right tool. Preserve state across steps. Check whether the previous action worked. Retrieve prior knowledge. Stop when the task is unsafe or underspecified.

These behaviors are not guaranteed by model scale. A stronger model may explain the right workflow and still fail to execute it under pressure.

Tool useWhich tool, which order, which parameters.

StateWhat has happened, what remains, what changed.

RecoveryWhat the agent does after a bad call or unexpected result.

The SKILL0 problem

The most frustrating version of the problem is simple: an agent appears to acquire a skill during training or tuning, then fails to use it at inference.

During training, the agent learns to consult memory, inspect traces, or use a browser tool carefully. Then a new task arrives with the same underlying structure but a different surface. The learned behavior disappears. The agent improvises from scratch.

That is not a small bug. It is the difference between task-specific pattern matching and transferable skill learning.

Why skills fail to transfer

Agents often learn the shape of the training task instead of the principle underneath it. A skill learned on one checkout flow may not transfer to a different checkout flow. A skill learned on one repository may not transfer to another with a different test runner.

The task looks different, so the agent does not recognize that the same behavior applies. That is why held-out evaluation matters. If you only test on tasks that look like training, you cannot tell whether the agent learned a skill or memorized a pattern.

Approach	What It Tries To Do	Risk
Skill registry	Make skills explicit and discoverable	Agent may not select the right skill
Episodic memory	Replay past traces and outcomes	Memory noise can accumulate
Hierarchical policies	Select skills at a higher level, execute below	Credit assignment stays hard
Held-out skill tests	Measure transfer directly	Requires careful task design

What this means for builders

Your prompt is often a hand-written skill. When you tell the agent exactly how to use memory, recover from errors, or inspect files, you are manually encoding behavior that a stronger skill system should eventually learn and invoke.

There is nothing wrong with that. It ships. But you should evaluate it honestly. If the prompt works on the training tasks and fails on held-out tasks, you have not built a transferable skill. You have built a useful workaround.

Traces let you see whether a claimed skill was actually used, ignored, or invoked too late.

How ClawBench should test skills

Skill evaluation needs to ask more than "did the agent complete the task?" It should ask whether the agent invoked the right skill, tracked state correctly, noticed failure, and recovered in a way that would generalize.

That is where SkillsBench fits. It gives ClawBench a place to measure reusable procedural behavior, not only final answer quality. The goal is not to celebrate a skill library. The goal is to find out whether the agent actually uses it when the task changes.

Practical rule

If a skill does not improve held-out traces, it is not a skill yet. It is a prompt pattern waiting to be proven.

Continue the evaluation

Use this guide with the benchmark entity pages, leaderboard context, and trace evidence so the query intent can move straight from explanation to product proof.

Production agent traces AI agent benchmark Web Tasks Benchmark Terminal Bench