Why Single Number Evals and Narrow Benchmarks Won’t Lead Us to AGI

Richa Sharma, May 10, 2026
Vintage analog oscilloscope and measurement instruments in a mid-century physics laboratory

The AI research community has developed a collective addiction to single-number evaluations. We compress months of model development into a leaderboard percentage, declare winners and losers on the basis of a few decimal points, and move on. This is the wrong game to be playing.

When the field celebrates a model “solving” GPQA or “saturating” HLE, what exactly has been demonstrated? That a system can select the correct answer from a fixed set of options, on problems with known solutions, under conditions that bear almost no resemblance to real scientific inquiry.11GPQA (Graduate-Level Google-Proof QA) and HLE (Humanity’s Last Exam) test narrow slices of expert knowledge but cannot evaluate the capacity to formulate novel questions—arguably the harder half of science. The benchmark tells us something. It does not tell us nearly enough.

We are measuring the map and mistaking it for the territory.

The Eval Trap

Single-number evaluations create perverse incentives. When a model’s worth is determined by its rank on a public benchmark, every optimization dollar flows toward that benchmark. Training data gets curated to match eval distributions. Prompting strategies get tuned to specific question formats. The model doesn’t get smarter—it gets better at the test.

This is not a new observation in machine learning, but its consequences for AGI research are particularly severe. Goodhart’s Law—when a measure becomes a target, it ceases to be a good measure—applies with full force.22The original formulation by Charles Goodhart (1975) was about monetary policy, but the principle has become a foundational warning in optimization theory and ML alignment research. Every narrow benchmark we saturate is a benchmark we should probably retire.

The deeper problem is structural. Benchmarks test interpolation within known distributions. They ask: given this well-posed problem with a known answer, can you find it? Science asks something categorically different: given an open-ended reality with no answer key, can you discover something true?

No multiple-choice exam has ever produced a Newton. No leaderboard has ever produced a theory.

What Benchmarks Miss

Consider what it takes to make a genuine scientific discovery. The researcher must first notice something anomalous—a residual in the data, a symmetry that shouldn’t be there, a prediction that fails in a specific and interesting way. They must then formulate a hypothesis that is not merely consistent with the data but predictive of new data.

None of these capabilities are measured by existing benchmarks. AIME tests mathematical problem-solving within well-defined contest constraints. Even the most ambitious evaluations—FrontierMath, the Putnam—test the ability to arrive at known answers through known techniques.33FrontierMath, introduced in 2024, represents a significant step forward in difficulty but still operates within the paradigm of “problems with verifiable solutions designed by humans.”

What we need to evaluate is something harder and more fundamental: can a model operate in environments where the rules are rigid but the answers are unknown?

A Different Metric

If single-number evals won’t get us to AGI, what will? We believe the answer lies in open-ended evaluation against physical reality. Instead of asking a model to solve problems we’ve already solved, we should be asking it to solve problems we haven’t.

This requires a fundamentally different evaluation infrastructure. Not a dataset of questions and answers, but a world—a deterministic physics simulator with rigid constraints where the model can hypothesize, experiment, fail, and learn.

This is what we are building at Icosian. Not another benchmark. An arena.