How to Evaluate ML Agents: A Technical Guide to MLE-Bench

Published 2026-04-23

How to Evaluate ML Agents: A Technical Guide to MLE-Bench

AI agents fail at most ML engineering tasks. Not because they can't write code — because they can't manage an end-to-end ML workflow: data retrieval, feature engineering, GPU memory pressure, reproducibility, and submission generation all in one run.

That's the gap MLE-Bench was built to measure.

If you're evaluating AI agents for ML work — or building a system that will — this guide covers exactly how MLE-Bench works, how it's scored, what constraints it enforces, and how to run your own evaluations.

What MLE-Bench Actually Measures

Most AI agent benchmarks test coding ability. MLE-Bench tests something different: .

The distinction matters. Coding benchmarks ask: ML engineering benchmarks ask:

MLE-Bench is built by OpenAI and based on real Kaggle competitions. The agent receives a competition description, downloads the dataset, trains a model, generates predictions, and produces a valid submission file — all within a fixed time and memory budget.

The full upstream benchmark covers 75 Kaggle competitions. ClawBench runs the : 22 representative tasks that capture the same evaluation surface at lower computational cost.

Each task is a competition slug — identifiers like or — that maps to a specific Kaggle competition with its own dataset, metric, and submission format.

Why Standard Coding Benchmarks Don't Work for ML

SWE-bench and similar coding benchmarks test patch application: here's a bug, apply the fix. That's a well-defined problem with a verifiable answer.

ML engineering is messier. There is no single correct model. The evaluation is the Kaggle leaderboard score — a continuous metric that rewards better predictions, not just syntactically correct code.

This has three consequences that break most evaluation frameworks:

Two different models can both be correct by different margins. The benchmark has to score the agent did, not just it did it.

Real ML engineering is constrained by GPU memory and wall-clock time. A model that trains for 25 hours "fails" MLE-Bench even if the predictions would have been excellent — because the time limit is part of the problem specification.

submission.csv is the competition submission. metrics.json proves the agent measured its own performance. report.md proves the agent understood what it did. Missing any of these isn't just penalized — it indicates the agent didn't complete the ML engineering workflow.

The 5 Scoring Dimensions

MLE-Bench evaluates agents across five weighted dimensions. The weights are not arbitrary — metric quality dominates at 55% because the point of ML engineering is producing good predictions.

overall_score = metric_quality × 0.55
             + efficiency × 0.15
             + reproducibility × 0.15
             + operational_correctness × 0.15

1. Metric Quality (55%)

Weighted mean task quality across all competition submissions. Measures how good the predictions are against the competition's evaluation metric (AUC, accuracy, LogLoss, etc.). This is the primary signal — a model that scores in the 80th percentile of a competition gets a high quality score.

2. Efficiency (15%)

Runtime and memory budget compliance. Agents that finish faster with lower peak memory usage score higher. The default budgets are 24 hours and 440GB peak memory, but MLE-Bench supports custom constraints per competition.

3. Reproducibility (15%)

Three checks:

Reproducibility is a proxy for whether another engineer could rerun the experiment and get the same results. In ML engineering, non-reproducible results are failed results.

4. Operational Correctness (15%)

Did the agent complete the workflow and report what it did? Checks for:

5. Artifact Validity (Diagnostic)

Weighted artifact coverage: submission.csv, metrics.json, report.md. This dimension is diagnostic — it influences quality and operational scores but isn't independently weighted. Missing artifacts cascade into lower scores across other dimensions.

Resource Constraints in Practice

MLE-Bench enforces two hard constraints:

ConstraintDefaultWhat It Means
---------
`maxRuntimeSeconds`86,400 (24h)Wall-clock time from submission script start to finish
`maxPeakMemoryMb`450,560 (440GB)Peak RAM usage across the entire run

Both are configurable per competition. A competition focused on rapid prototyping might set to 4 hours; a competition testing long-training strategies could allow 48 hours.

When an agent exceeds either constraint, the run is marked as failed and scores zero on efficiency — with cascading effects on overall score.

These aren't artificial difficulty settings. They're the same constraints that exist in any real ML engineering job. An agent that can't finish within the time or memory budget isn't production-ready, regardless of how good its predictions look in a no-limit evaluation.

The Submission Artifact Contract

For an agent to receive any score beyond zero, it must produce three artifacts:

— The competition predictions in the Kaggle-required format. Wrong format = submission rejected = zero quality score.

— Self-reported model performance metrics. Lets the evaluation framework verify the agent measured what it claims to have measured.

— Methodology summary. What model did it use? What preprocessing? What were the key design decisions?

Agents emit results via a single stdout line:

MLE_BENCH_LITE_RESULT={"seed": 42, "runtime_seconds": 3847, "peak_memory_mb": 16384, "artifacts": [...], "tasks": [...], "replay": {...}}

If the JSON is malformed or missing, the run is marked and conservatively scored in the lowest band.

Reproducibility: Why It Matters for Benchmark Validity

The reproducibility dimension exists because MLE-Bench is trying to measure ML engineering competence — not luck.

An agent that happens to get good predictions by random initialization and random seed is not a good ML engineer. An agent that gets good predictions is demonstrating actual competence.

Reproducibility metadata also makes benchmark results auditable. If you run an agent on MLE-Bench today and again in six months, you can verify whether the scores are comparable or whether dataset updates, environment changes, or code changes affected the results.

For teams using MLE-Bench to compare models or scaffolding configurations, reproducibility is what makes comparison meaningful.

Common Failure Modes

Based on how agents actually fail on MLE-Bench:

Agents underestimate peak memory usage during training. They pass the submission format check but produce no valid predictions because the process was killed.

Kaggle competitions have specific column names, row orders, and data types. An agent that produces a correctly-named file but with the wrong structure gets zero.

The agent gets good predictions — but with a different random seed than specified. Scores zero on reproducibility even if quality is high.

The agent produces submission.csv but skips metrics.json or report.md. Operational correctness score drops, dragging the overall score down.

Agents that attempt complex model architectures run out of time. The 24-hour constraint heavily penalizes approaches that prioritize model complexity over efficiency.

How to Run an MLE-Bench Evaluation

For teams that want to run MLE-Bench evaluations themselves:

git clone https://github.com/openai/mle-bench
cd mle-bench
pip install -e .

MLE-Bench Lite on ClawBench uses the low-complexity split (22 tasks). You can also specify individual competition slugs:

{"competitionIds": ["aerial-cactus-identification", "leaf-classification"]}

Override defaults for your infrastructure:

{
  "resourceBudgets": {
    "maxRuntimeSeconds": 14400,
    "maxPeakMemoryMb": 32768
  }
}

The agent's script must output to stdout or stderr. Parse that line to get results.

A score above 70 is competitive with human Kaggle participants in the lower quartile. Scores above 85 are top-quartile. Most unassisted agents score below 40.

What MLE-Bench Doesn't Measure

MLE-Bench is scoped to ML engineering competence in a competition setting. It doesn't measure:

For a fuller picture of ML agent capability, MLE-Bench should be run alongside coding benchmarks (SWE-bench), web interaction benchmarks, and production-domain evaluations.

The Scaffolding Problem

Here's what the benchmark scores don't capture:

On MLE-Bench, the same model with different prompting strategies, tool access, and workflow design can score 30 points apart. The 55% weight on metric quality rewards good engineering — but getting to good engineering requires the agent to attempt the right things in the right order.

This is why benchmark scores alone are insufficient for procurement decisions. Run MLE-Bench to get a baseline. Then test your scaffolding hypotheses with ablation experiments: same model, different context, different tools, different constraints.

ClawBench runs these comparisons at scale so you don't have to.

Summary

MLE-Bench is the most rigorous open benchmark for ML engineering agents. It evaluates the full workflow — data, training, prediction, submission, reproducibility — not just code generation.

Key numbers to remember:

DimensionWeightWhat it rewards
---------
Metric Quality55%Good predictions
Efficiency15%Fast + low memory
Reproducibility15%Fixed seed + environment
Operational Correctness15%Complete workflow

Default constraints: 24h runtime, 440GB memory. Both configurable.

Required artifacts: submission.csv, metrics.json, report.md.

If you're procuring an ML agent, running MLE-Bench is the closest you can get to a real-world trial without deploying to production.