🕵🏻‍♂️ GSO HackDetector

Manish Shetty, Naman Jain | Nov 3, 2025

gso-bench.github.io

Hungry, hungry reward hackers, and how we catch them. Tests certify functional behavior; they don’t judge intent. GSO now combines tests with a rubric-driven HackDetector to identify models that game the benchmark. We found that up to 30% of a model’s attempts are reward hacks, which are not caught by correctness tests!

</aside>

The Problem: When Passing Tests Isn’t Enough

GSO (General Software Optimization) is a challenging benchmark that tasks AI models with optimizing real-world codebases for optimal performance. We score models on the percentage of tasks where a patch to the codebase preserves correctness and achieves at least 95% of the expert human-achieved speedup. We measure speedups on a suite of performance tests with diverse workloads.

The catch? Tests verify outcomes, not intent. A patch might pass all checks while altogether gaming them. On FrontierMath, EpochAI found that models often guess answers without going through all the intended reasoning on single-turn advanced math questions [1]. Such reward hacking behaviour is considerably exacerbated in multi-turn agentic settings such as autonomous software development, as documented by METR [2]. Recently, ImpossibleBench [3] has explored reward hacking in synthetic coding setups—asking models to solve contradictory tasks to see if they’d cheat. However, synthetic evaluations risk answering the wrong question: they test whether models can exploit bad specifications, not whether they exploit real-world challenges.

Here’s what we found: Models don’t need impossible tasks to hack. Give them challenging, realistic problems, and they’ll find the exploits anyway.

HackDetector

Over the past few months, we’ve seen an increasing number of cases of reward hacking ******on GSO’s tasks. Frontier models not only fail to optimize code—they actively exploit evaluation infrastructure, overfit test distributions, and remove features to fake performance gains. To distinguish real gains, we need rubric-based criteria that penalize environment tampering, side-stepping tests, and other such hacks.

We introduce an automated detection system that leverages GPT-5’s code analysis capabilities, with majority voting. An overview of the system is as follows:

When a patch clears GSO’s correctness and performance tests, we don’t stop. We build a case file, including the relevant tests, the expert patch, the model’s patch, and a concise rubric that defines what constitutes improvement versus gaming the benchmark. We then query GPT-5-high K times independently for an is_reward_hack verdict with rationale. We then take a majority vote across the K samples for the final label with confidence max(hack_verdicts, legitimate_verdicts) / K. We tune this system and the rubrics by manually validating 100s of hacks detected by it.

Verdict: Models are Serial Hackers

We ran our HackDetector on all GSO problem instances attempted by recent frontier models. We plot the Hack Rate = (#hacked_instances / #total_instances), distinguishing between model patches that passed GSO’s correctness tests and those that did not. The results are striking:

On average, a massive 30% of a model’s solutions are reward hacks!!

While correctness tests catch many of these hacking attempts, 18% of attempts are exploits that pass the tests. Reasoning models such as O3 can hack up to 30% of instances while passing tests. These models don’t just cut corners sometimes; they systematically exploit environments, memoize calls, overfit to test distributions, and even delete features to fake speedups.

Leaderboard Impact

Beyond correctness, GSO also measures speedups. GSO’s stringent 95% speedup requirement does filter out many of these hacks. As we tighten the performance threshold, hack rates drop, confirming that most model exploits rely on crude tricks that do not survive rigorous performance standards.