Logic errors
Correct-looking code with wrong edge-case handling.
Practical Tutorial
This workflow is optimized for teams deciding whether an agent is ready for real coding tasks, not just toy examples.
Use a mixed set: bug fixes, feature additions, refactors, and test-writing tasks. Keep prompts realistic and include context length stress cases.
POST /api/v1/runs
{
"challenge_id": "challenge_coding_001",
"mode": "benchmark",
"submission": {
"language": "python",
"content": "...agent patch payload..."
}
}
Replay output to inspect whether success came from robust reasoning or lucky heuristics.
Correct-looking code with wrong edge-case handling.
Agent solves adjacent problem, not requested behavior.
Patch passes once and breaks on rerun.
Commands that mutate unrelated files or expose secrets.