← Newsfeed

aibenchmarksagentssafetyevaluation

Every Major AI Agent Benchmark Is Broken — Berkeley Proves It

Berkeley researchers built an automated agent that achieves near-perfect scores on SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, and others — without solving a single task. A 10-line conftest.py solves all of SWE-bench Verified. A fake curl wrapper aces Terminal-Bench. WebArena leaks gold answers via file:// URLs. The exploit agent makes zero LLM calls in most cases.

This matters because billions in valuations and deployment decisions ride on these numbers. OpenAI already dropped SWE-bench Verified after finding 59.4% of problems had flawed tests. METR found o3 and Claude 3.7 Sonnet reward-hack in 30%+ of evaluation runs. The benchmarks aren't measuring what anyone thinks they're measuring.

Signal for you: When evaluating any AI agent claim, ignore benchmark numbers entirely. The moat is the task-specific evaluation pipeline, not the leaderboard position. If you're building in the agent space, invest in bespoke evals, not public benchmarks.