TL;DR

New models were added to the GSO leaderboard. Open-source contenders (like Qwen and Kimi) are competing closely with the top closed models. o3 still leads, but recent releases such as GPT-5 and Claude-4-Opus are strong—overall, progress remains tight on these long-horizon code-optimization tasks.

image.png

What’s new

Takeaways

  1. Open vs. closed is narrowing. It’s encouraging to see open models hanging with the frontier leaders.
  2. Recent models look strong. While o3 holds the top spot, newer releases like GPT-5 and Claude-4-Opus are clearly competitive.
  3. No runaway winner—yet. Results remain neck-and-neck, underscoring how challenging long-horizon, tool-using code optimization still is.

Ecosystem update

GSO is now listed on the Epoch AI Benchmarking Hub, alongside benchmarks like TerminalBench, DeepResearchBench, METR Time Horizons, and WebDevArena. This should make trends easier to track as the field evolves.

image.png

Links