In the world of AI benchmarks, most roads lead to flashy competitions: solving coding puzzles, climbing Codeforces ratings, or passing Olympiad-level problems. But a new benchmark — FormulaOne — changes the race. It doesn’t ask, “Can you win a medal?” It asks, “Can you think like a researcher?”
And the answer from today’s frontier LLMs? A resounding no.
From Codeforces Champs to Research Rookies
The authors of FormulaOne strip away the glitz of competitive programming and delve into something far more consequential: research-grade algorithmic problems grounded in Monadic Second-Order (MSO) logic over graphs. These aren’t out-of-distribution visual puzzles like ARC. They’re in-distribution, theoretically tractable problems designed with precision to demand multi-step symbolic reasoning, mathematical insight, and clean implementation.
The result? GPT-4o, Gemini 2.5, Grok 4 Heavy — all models celebrated for their prowess in programming contests — fail to solve even 1% of the 120 problems in the FormulaOne dataset, despite generous prompting and multiple attempts. That’s not a performance gap. It’s a chasm.
A Benchmark Rooted in Theory and Reality
FormulaOne’s problems emerge from a powerful framework: MSO logic combined with dynamic programming over tree decompositions. This is not an artificial game. These techniques underpin real-world optimization tasks like:
- Traffic routing
- Power grid scheduling
- Network design
The benchmark draws directly from Courcelle’s theorem, which guarantees that MSO-definable problems can be solved efficiently on graphs with bounded treewidth. Crucially, many of these problems are also tied to fine-grained complexity theory, specifically the Strong Exponential Time Hypothesis (SETH). That means solving them faster than known algorithms could overturn major theoretical assumptions — a holy grail in algorithm design.
And yet, today’s LLMs — trained on oceans of code and mathematics — collapse under this structured reasoning task.
What Makes FormulaOne So Hard?
A few reasons:
1. State Design is an Art, Not Just Code
Solving MSO problems via dynamic programming on tree decompositions demands crafting a DP state that captures just enough information — no more, no less. For instance, solving a Dominating Set problem might involve tracking whether each vertex is in the set, dominated, or still in need of domination. Miss a subtle interaction, and the algorithm fails.
Table: Typical DP State Components for MSO-based Problems
Component Meaning In Vertex is included in the solution set Out-Dominated Not included, but dominated Out-Needs-Domination Not included, still needs domination
2. Geometric and Temporal Complexity
LLMs consistently struggle with how partial solutions evolve across the tree decomposition. Many errors stem from:
- Prematurely finalizing states (forgetting too soon)
- Failing to account for how future nodes might merge disjoint components
- Double-counting solutions in join operations
This isn’t just an implementation problem. It’s a reasoning bottleneck.
3. Combinatorial Traps
Some problems combine multiple logical constraints — like ensuring a graph is a cograph and bipartite. The naive approach? Combine two large state machines. The smarter approach? Recognize a hidden lemma: a cograph is bipartite iff it’s triangle-free. Human researchers know this. Models don’t — yet.
Why This Matters (Especially for RL and Theory)
FormulaOne isn’t just a benchmark. It’s a generative environment. Thanks to its MSO foundation, it can produce unlimited, verifiable, research-grade problems with guaranteed solutions — ideal for building Reinforcement Learning with Verifiable Rewards (RLVR) systems.
If an RL agent makes progress on FormulaOne, it’s not just scoring points. It’s discovering algorithmic shortcuts. It’s inching toward real scientific discovery. And if it ever surpasses known SETH-optimal algorithms, it would shake the foundations of computational complexity.
That’s not just a benchmark. That’s a proving ground.
Final Lap: Implications and Challenges Ahead
The lesson from FormulaOne is clear: LLMs are phenomenal interpolators, but not yet true reasoners. They thrive on seen patterns. They struggle when reasoning must be structured, recursive, and parsimonious.
The benchmark’s modular DSL, test suite diversity, and built-in evaluation framework make it a rare tool: both challenging and practical. It’s an invitation — not just to frontier model developers, but to algorithm designers, complexity theorists, and RL researchers.
The race for real AI reasoning has barely begun. FormulaOne sets the track.
Cognaptus: Automate the Present, Incubate the Future