The Anatomy of a Breakthrough
In 1609, Johannes Kepler working with Tycho Brahe’s meticulously recorded, naked-eye observations of Mars, arrived at simple but groundbreakingly predictive laws of planetary motion.
For two millennia, the absolute scientific consensus dictated that celestial bodies moved in perfect circles. But Brahe’s data contained an anomaly: a mere 8-arcminute deviation from a perfectly circular orbit. An unthinking researcher (human or machine) would have forced the data to fit the circular paradigm. But Kepler threw away 2,000-year old dogma entirely, and collapsed the immense complexity of the heavens into the equation of ellipse.
He had the capacity to resist the gravity of his peer consensus—an established interpolation—and extract absolute ground truth from a sparse, rigidly constrained state space. It is an objective function we have thus far failed to design for our models.
This is the essence of true scientific discovery—human scientific conviction. Now the question is: if Artificial General Intelligence is to be truly realized, how do we distill it down to our machines?
The Silicon Engine: Verifiable Search Spaces
In order to ponder how to engineer the kind of scientific conviction Kepler possessed, we must examine the architectural DNA of the most profound AI milestones to date. From AlphaGo and AlphaFold to the recent progress in applying Reinforcement Learning to Chain-of-Thought (RL-CoT), a singular pattern emerges.
These models do not succeed through biological intuition; they succeed because they operate within domains that possess massive, discrete search spaces coupled with absolute, easily verifiable outcomes.
530070000600195000098000060800060003400803001700020006060000280000419005000080079Consider the architecture of a Sudoku puzzle. The combinatorial state space of a 9x9 grid is immense, but the rules governing a valid state are rigid and unbreakable. A model can generate millions of stochastic, brute-force paths, but the crucial mechanism is the reward function: verifying a correct board is computationally instant and entirely binary. It is either a mathematical fit, or it is not.
This loop forces the model out of superficial pattern matching—what we might call sophisticated imitation—and violently prunes the search space until the model discovers a structured, logically sound path to the solution.
We have successfully built systems that can navigate the combinatorial explosion of a Go board or a protein sequence because the "win state" is undisputed. But as we pivot from these closed-system mathematical games to the open, unstructured reality of fundamental physics, the architecture of our training grounds is about to hit a definitive ceiling.
Hill-Climb and Saturation Timeline
The recent trajectory of mathematical benchmarks is not a linear progression; it is a violent, compute-driven saturation of the search space. Just eighteen months ago, research-grade evaluations like FrontierMath represented a daunting barrier, with top models struggling to clear the 10% threshold. Today, that landscape has been decimated. GPT-5.4 Pro (xhigh) has already pushed past 50% accuracy, with Gemini 3.1 trailing at 36.9%.
The delta between 10% and 50% was not achieved by feeding models more textbook examples. It was achieved by embracing The Bitter Lesson: the realization that general methods that leverage massive computation—specifically search and learning—eventually overwhelm human-designed heuristics or specialized datasets. We have moved beyond the era of simple Supervised Fine-Tuning (SFT). The current frontier is defined by scaled test-time compute and Reinforcement Learning on Chain-of-Thought (RL-CoT).
The reason narrow, short-form benchmarks are coming to an end is that labs have essentially "solved" the symbolic reasoning tier. This is where the "hard-to-find, easy-to-verify" loop becomes a double-edged sword:
- The Sudoku Effect: If a problem has a definitive, programmatically verifiable win-state, a model can simply burn tokens to search the space until it hits the target.
- The Data Diminishing Return: Releases like MathNet—which provide massive datasets of solved IMO problems—are increasingly secondary to the compute-driven reasoning loop. For the current tier of difficulty, the Bitter Lesson dictates that we don't need more examples; we need more efficient search policies.
We are reaching the asymptote of what can be measured by single-answer evaluations. The only real frontier remaining is the long-horizon open problem—environments where there is no pre-existing dataset to overfit and no human-framed "answer key" to climb toward.
This is precisely why we must shift our focus from mathematical puzzles to the messy, high-dimensional search spaces of Computational Physics. If we want to build models with true scientific conviction, we have to move them out of the Sudoku grids and into the simulation pipelines, where the ground truth isn't just a binary "correct" on a leaderboard, but a fundamental law of the universe.
Paradigm Shifting Discovery
The cleanest formulation of this gap is Demis Hassabis's threshold for true AGI: the Einstein Test. Cut off a model's training data at the year 1901 and see if it can derive Relativity on its own. Out of a million possible frameworks, a model must possess the discernment to choose the singular, paradigm-shifting mathematical truth, entirely without modern empirical evidence to guide it.
Human understanding of the universe has always hinged on these rare biological anomalies. Newton. Kepler. Einstein. Witten. Now, we are attempting to engineer this capacity into silicon.
Why Encode Universe?
The last remaining frontier is training models on unsolved, long-horizon research problems. This requires a grounded reality where a model can hypothesize, run simulations, fail, and refine its world-model against rigid physical constraints.
To "Encode Universe" is to move beyond text-based reasoning and into the realm of active, embodied discovery where the reward signal comes from the universe itself, not a human preference model.
Computational Physics Bottleneck
The primary obstacle to this vision is the massive gap between the speed of physical reality and the speed of digital learning. Traditional simulators are often too slow, too fragile, or too specialized to serve as the training ground for general discovery models.
We need differentiable, high-fidelity environments that allow models to iterate through millions of scientific hypotheses in minutes, effectively compressing centuries of human laboratory work into days of compute.
Icosian's Mission & Final Frontier to achieve AGI
Icosian builds the Reinforcement Learning environments and simulators required to enable truly autonomous, Nobel-grade physical discoveries. We are providing the arenas for frontier labs to take a real shot at the Einstein Test, and put literal Einsteins in the datacenter.
