Red Flag on the Track: Why LLMs Still Struggle with Real Algorithmic Reasoning

In the world of AI benchmarks, most roads lead to flashy competitions: solving coding puzzles, climbing Codeforces ratings, or passing Olympiad-level problems. But a new benchmark — FormulaOne — changes the race. It doesn’t ask, “Can you win a medal?” It asks, “Can you think like a researcher?”

And the answer from today’s frontier LLMs? A resounding no.

From Codeforces Champs to Research Rookies

The authors of FormulaOne strip away the glitz of competitive programming and delve into something far more consequential: research-grade algorithmic problems grounded in Monadic Second-Order (MSO) logic over graphs. These aren’t out-of-distribution visual puzzles like ARC. They’re in-distribution, theoretically tractable problems designed with precision to demand multi-step symbolic reasoning, mathematical insight, and clean implementation.

The result? GPT-4o, Gemini 2.5, Grok 4 Heavy — all models celebrated for their prowess in programming contests — fail to solve even 1% of the 120 problems in the FormulaOne dataset, despite generous prompting and multiple attempts. That’s not a performance gap. It’s a chasm.

A Benchmark Rooted in Theory and Reality

FormulaOne’s problems emerge from a powerful framework: MSO logic combined with dynamic programming over tree decompositions. This is not an artificial game. These techniques underpin real-world optimization tasks like:

Traffic routing
Power grid scheduling
Network design

The benchmark draws directly from Courcelle’s theorem, which guarantees that MSO-definable problems can be solved efficiently on graphs with bounded treewidth. Crucially, many of these problems are also tied to fine-grained complexity theory, specifically the Strong Exponential Time Hypothesis (SETH). That means solving them faster than known algorithms could overturn major theoretical assumptions — a holy grail in algorithm design.

And yet, today’s LLMs — trained on oceans of code and mathematics — collapse under this structured reasoning task.

What Makes FormulaOne So Hard?

A few reasons:

1. State Design is an Art, Not Just Code

Solving MSO problems via dynamic programming on tree decompositions demands crafting a DP state that captures just enough information — no more, no less. For instance, solving a Dominating Set problem might involve tracking whether each vertex is in the set, dominated, or still in need of domination. Miss a subtle interaction, and the algorithm fails.

Table: Typical DP State Components for MSO-based Problems

Component Meaning

In Vertex is included in the solution set

Out-Dominated Not included, but dominated

Out-Needs-Domination Not included, still needs domination

Component	Meaning
In	Vertex is included in the solution set
Out-Dominated	Not included, but dominated
Out-Needs-Domination	Not included, still needs domination

2. Geometric and Temporal Complexity

LLMs consistently struggle with how partial solutions evolve across the tree decomposition. Many errors stem from:

Prematurely finalizing states (forgetting too soon)
Failing to account for how future nodes might merge disjoint components
Double-counting solutions in join operations

This isn’t just an implementation problem. It’s a reasoning bottleneck.

3. Combinatorial Traps

Some problems combine multiple logical constraints — like ensuring a graph is a cograph and bipartite. The naive approach? Combine two large state machines. The smarter approach? Recognize a hidden lemma: a cograph is bipartite iff it’s triangle-free. Human researchers know this. Models don’t — yet.

Why This Matters (Especially for RL and Theory)

FormulaOne isn’t just a benchmark. It’s a generative environment. Thanks to its MSO foundation, it can produce unlimited, verifiable, research-grade problems with guaranteed solutions — ideal for building Reinforcement Learning with Verifiable Rewards (RLVR) systems.

If an RL agent makes progress on FormulaOne, it’s not just scoring points. It’s discovering algorithmic shortcuts. It’s inching toward real scientific discovery. And if it ever surpasses known SETH-optimal algorithms, it would shake the foundations of computational complexity.

That’s not just a benchmark. That’s a proving ground.

Final Lap: Implications and Challenges Ahead

The lesson from FormulaOne is clear: LLMs are phenomenal interpolators, but not yet true reasoners. They thrive on seen patterns. They struggle when reasoning must be structured, recursive, and parsimonious.

The benchmark’s modular DSL, test suite diversity, and built-in evaluation framework make it a rare tool: both challenging and practical. It’s an invitation — not just to frontier model developers, but to algorithm designers, complexity theorists, and RL researchers.

The race for real AI reasoning has barely begun. FormulaOne sets the track.

Cognaptus: Automate the Present, Incubate the Future

From Codeforces Champs to Research Rookies#

A Benchmark Rooted in Theory and Reality#

What Makes FormulaOne So Hard?#

1. State Design is an Art, Not Just Code#

2. Geometric and Temporal Complexity#

3. Combinatorial Traps#

Why This Matters (Especially for RL and Theory)#

Final Lap: Implications and Challenges Ahead#