TL;DR: GSO measures whether models can reach human parity on global software optimization tasks: problems that demand longer-horizon, end-to-end changes (not just quick fixes). Models just made their first real leaps on such tasks.

Two new results stand out:
🥇 Opus-4.5 is the first model to hit 25%.
🥈 Gemini-3-Pro made a big jump to #2.
Here are three shifts that feel qualitatively different from prior models:
Earlier models often “patched symptoms” to look like they were finishing the task, or they got stuck over-committing to an early path. These newer models look more willing to explore—more action, earlier in the trajectory, and tighter iteration.

A big takeaway: long-horizon behavior depends on the task. For code optimization, “doing well” may mean smart trial-and-error exploration—not just reading/reasoning forever, and not excessive testing either.
In other words: more time spent doesn’t help if it’s spent on the wrong activities.

Models also seem more deliberate: they consider alternate directions, think about algorithmic complexity, and plan more explicitly before making changes.

GSO is still hard—there are still clear human-agent gaps in software engineering—but it’s genuinely exciting to see this kind of progress start to show up in a long-horizon task such as software optimization.