Eight Top AI Agent Benchmarks Hit 100% Without Solving a Single Task

UC Berkeley’s RDI built an exploit agent that scored 100% on SWE-bench, Terminal-Bench, FieldWorkArena and more by attacking the evaluation pipeline instead of doing the work. The leaderboards the industry trusts are gameable.
artificial-intelligence
cyber-security
Author

Kabui, Charles

Published

2026-04-19

Keywords

llm-benchmarks, swe-bench, reward-hacking, benchjack, berkeley-rdi