GSO Summer ’25 Update

TL;DR

New models were added to the GSO leaderboard. Open-source contenders (like Qwen and Kimi) are competing closely with the top closed models. o3 still leads, but recent releases such as GPT-5 and Claude-4-Opus are strong—overall, progress remains tight on these long-horizon code-optimization tasks.

What’s new

Several fresh model entries are now live on the GSO leaderboard.
A quick read of the board shows o3 at #1, with GPT-5 and Claude-4-Opus in the top tier.
Open models (e.g., Qwen, Kimi) are right up there with frontier closed models.

Takeaways

Open vs. closed is narrowing. It’s encouraging to see open models hanging with the frontier leaders.
Recent models look strong. While o3 holds the top spot, newer releases like GPT-5 and Claude-4-Opus are clearly competitive.
No runaway winner—yet. Results remain neck-and-neck, underscoring how challenging long-horizon, tool-using code optimization still is.

Ecosystem update

GSO is now listed on the Epoch AI Benchmarking Hub, alongside benchmarks like TerminalBench, DeepResearchBench, METR Time Horizons, and WebDevArena. This should make trends easier to track as the field evolves.

What’s new

Takeaways

Ecosystem update

Links