GSO Winter ‘25 Update

TL;DR: GSO measures whether models can reach human parity on global software optimization tasks: problems that demand longer-horizon, end-to-end changes (not just quick fixes). Models just made their first real leaps on such tasks.

Leaderboard Update

Two new results stand out:

🥇 Opus-4.5 is the first model to hit 25%.

🥈 Gemini-3-Pro made a big jump to #2.

What’s actually improving

Here are three shifts that feel qualitatively different from prior models:

1) Better long-horizon code changes

Earlier models often “patched symptoms” to look like they were finishing the task, or they got stuck over-committing to an early path. These newer models look more willing to explore—more action, earlier in the trajectory, and tighter iteration.

2) “Long-horizon” isn’t one thing

A big takeaway: long-horizon behavior depends on the task. For code optimization, “doing well” may mean smart trial-and-error exploration—not just reading/reasoning forever, and not excessive testing either.

In other words: more time spent doesn’t help if it’s spent on the wrong activities.

3) More analysis and proactivity before committing

Models also seem more deliberate: they consider alternate directions, think about algorithmic complexity, and plan more explicitly before making changes.

Where this leaves us

GSO is still hard—there are still clear human-agent gaps in software engineering—but it’s genuinely exciting to see this kind of progress start to show up in a long-horizon task such as software optimization.