Red Flag on the Track: Why LLMs Still Struggle with Real Algorithmic Reasoning

TL;DR for operators

FormulaOne is a useful red flag because it tests something many businesses quietly assume LLMs already possess: the ability to design deep algorithms, not merely write plausible code around familiar patterns.¹

The benchmark contains 120 hard dynamic-programming problems on tree-like graphs, plus 100 easier FormulaOne-Warmup problems. The hard tasks are generated from Monadic Second-Order logic, come with verifiable evaluation, and sit near the kind of combinatorial reasoning used in routing, scheduling, network design and other optimisation-heavy domains.

The headline result is brutal. On the hard FormulaOne set, o3, o3-Pro and Gemini 2.5 Pro each solve 1 out of 120 problems. Grok 4 Heavy solves 0. This is after the authors give the models a detailed scaffold, few-shot examples, maximum reasoning settings, generous token budgets and an evaluation framework that handles input parsing, graph representation and tree-decomposition traversal.

The operational lesson is not “LLMs cannot code”. That would be lazy, and we are trying not to become a LinkedIn comment section. The lesson is narrower and more important: frontier models still struggle with expert state design, future-aware reasoning, local-to-global consistency and canonical counting. These are exactly the skills needed when an optimisation problem stops looking like a coding exercise and starts looking like an actual systems problem.

For businesses, FormulaOne argues for a conservative deployment pattern: use LLMs to assist algorithm engineering, generate candidate implementations, explain code, search design alternatives and write test harnesses. Do not treat them as unsupervised designers of novel optimisation logic unless the surrounding system includes formal constraints, verifiers, systematic search and human review.

The real task is not writing Python; it is deciding what Python must remember

Imagine a logistics team asking an AI system to improve routing across warehouses, drivers, time windows and unreliable road conditions. The hard part is not printing syntactically valid Python. That is the clerical layer. The hard part is deciding what information must be carried forward as the algorithm moves through a complex structure.

FormulaOne attacks exactly that layer.

The benchmark is built around dynamic programming on graphs with low treewidth. In plain language, the graph is processed through a sequence of small “windows”, called bags, organised as a tree decomposition. At each step, the algorithm sees only a local portion of the graph. It must summarise what has already happened in a compact state, update that state when a vertex is introduced or forgotten, merge states when two branches join, and finally extract the global answer.

This sounds almost bureaucratic. Introduce. Forget. Join. Extract. A tidy little office workflow for graphs.

Then reality arrives, wearing steel-toed boots.

A correct dynamic-programming state must be rich enough to preserve every fact that may matter later, but small enough to remain computationally tractable. Store too little, and future constraints become impossible to enforce. Store too much, and the state space explodes or, worse, the algorithm counts the same object multiple times under different witness labels. The central question is not “can the model write a loop?” It is:

What does the algorithm need to know about the past and present so that it can make valid choices in the future?

FormulaOne is difficult because that question has to be answered precisely, repeatedly and under interacting graph-theoretic constraints.

MSO gives the benchmark its depth, not just its notation

FormulaOne’s problems come from Monadic Second-Order logic on graphs. MSO is a formal language that can express many graph properties by quantifying over vertices, edges and sets. The paper uses this framework because it gives the benchmark three useful properties.

First, the task family is broad. MSO can express familiar graph problems such as colouring, domination and independent sets, but also more unusual properties involving forbidden subgraphs, connectivity, topological structure, maximality and local invariants.

Second, the tasks are not arbitrary puzzle bait. Courcelle’s theorem says, roughly, that MSO-definable graph problems can be solved efficiently on graphs of bounded treewidth. That matters because the benchmark’s problems are hard to discover but guaranteed to have efficient dynamic-programming solutions under the supplied decomposition. The models are not being asked to perform magic. They are being asked to find the right machinery.

Third, the family can be scaled. The authors present FormulaOne as a step toward a large supply of verifiable, high-depth algorithmic tasks for Reinforcement Learning with Verifiable Rewards. The phrase sounds like grant-proposal seasoning, but the underlying point is real: if you can generate difficult problems and automatically check solutions, you can build training environments that reward actual algorithmic progress rather than confident explanation cosplay.

The current release is deliberately narrower than the full universe of MSO. It focuses on Weighted Model Counting objectives, where the model must compute the aggregate weight of all satisfying sets, usually modulo $10^9 + 7$, or return $-1$ if no feasible set exists. The paper notes that optimisation objectives, other counting variants, MSO2 extensions, alternative graph parameters and multi-pass traversal strategies could all broaden the environment later.

That matters for interpretation. FormulaOne is not claiming to exhaust algorithmic reasoning. It is showing that one principled, theoretically grounded slice of algorithmic reasoning is already enough to make top models stall.

Tree decompositions turn global problems into local windows

The best way to understand the benchmark is not to stare at the model scorecard first. The scorecard is the autopsy result. The mechanism of failure begins with the tree decomposition.

A tree decomposition breaks a graph into overlapping bags of vertices. The algorithm traverses these bags bottom-up. At each bag, it processes a local view of the graph and stores a table of profiles. Each profile represents a class of partial solutions that look identical from the current bag.

For a very simple problem such as Dominating Set, a workable profile might track whether each visible vertex is:

Vertex status	Meaning
`in`	The vertex is included in the emerging solution set.
`out-dominated`	The vertex is outside the set but already dominated by a selected neighbour.
`out-needs-domination`	The vertex is outside the set and still needs domination later.

That is the friendly version. It is the graph-algorithm equivalent of learning to parallel park in an empty car park.

FormulaOne quickly moves beyond that. Consider a connected induced subgraph with at least $k$ vertices. A model has to track connectivity among visible selected vertices, cap the cardinality once the threshold is reached, remember whether a component has detached from the current bag, and merge partitions correctly at join nodes. It must avoid discarding states that look disconnected locally but may connect through future vertices. It must also avoid double-counting vertices that appear in both branches of a join.

None of this is mystical. It is exactly the ordinary brutality of real algorithm design: local summaries must preserve global truth.

The benchmark exposes four failure modes that ordinary coding tests usually hide

The paper’s observed failure modes are more useful than the headline percentage because they explain what kind of reasoning breaks.

Failure mode	What goes wrong	Why it matters operationally
Premature finalisation	The model makes irreversible decisions about forgotten vertices based on neighbours whose future roles may still change.	In planning systems, this maps to committing too early before downstream constraints have been resolved.
Incomplete geometric reasoning	The model misses ways a fixed subgraph pattern can form across different branches of a tree decomposition.	In network design, local interactions can combine into global failures; missing configurations is not a cosmetic bug.
Local-to-global errors	The model enforces local rules but fails to assemble the intended global structure.	Many enterprise optimisation failures look exactly like this: each module passes, the whole system does not.
Non-canonical state representation	The model tracks auxiliary witness information that creates multiple representations of the same object, causing overcounting.	Counting, risk aggregation and scenario enumeration all become unreliable if identity and equivalence are mishandled.

The cograph example in the paper illustrates the geometric problem neatly. A cograph is a graph with no induced path of length three. To detect violations during a tree-decomposition traversal, the state may need to remember whether selected vertices have forgotten selected neighbours, whether pairs share common forgotten neighbours, and whether length-two paths have already formed through forgotten vertices. At join nodes, a forbidden path can appear only after two partial worlds are merged.

This is not a place where “try again with a better prompt” sounds especially dignified.

The paper’s bipartite-cograph example adds a second layer. A naive approach might combine a cograph state machine with a bipartiteness state machine. A sharper approach uses the theorem-level insight that a cograph is bipartite if and only if it is triangle-free. Because every clique appears inside some bag of a tree decomposition, checking triangles becomes a much smaller addition to the cograph machinery.

That is the distinction FormulaOne is testing: not whether the model knows a word like “bipartite”, but whether it can recognise when a mathematical simplification changes the state design.

The authors removed several easy excuses before reporting the score

The experimental setup is important because the benchmark is not merely asking models to solve raw contest statements from scratch.

The authors provide a purpose-built Python evaluation environment. It handles input parsing, graph representation and the traversal of the tree decomposition. The model only needs to implement five callback functions:

Callback	Role in the dynamic programme
`leaf_callback`	Initialise the DP table at a leaf node.
`introduce_callback`	Update the state when a vertex enters the bag.
`forget_callback`	Update the state when a vertex leaves the bag.
`join_callback`	Merge two previously processed branches at a shared bag.
`extract_solution`	Compute the final answer from the root table.

The prompt also includes a general description of the setting, the specific problem statement, guidelines discouraging attempts to bypass the framework, expected fixed-parameter linear complexity, a local graph-query utility class and three diverse few-shot example solutions.

The evaluation is also not a single fragile unit test. The framework includes several test types:

Test type	Likely purpose	What it supports	What it does not prove
Correctness tests	Main evidence for whether the DP logic computes the right answer on small graphs.	Catches direct semantic errors against brute-force verifiers.	Does not prove efficiency at scale.
Consistency tests	Robustness test across different tree decompositions of the same graph.	Checks whether the algorithm depends improperly on decomposition artefacts.	Does not guarantee all edge cases are covered.
Efficiency tests	Stress test for fixed-parameter linear behaviour.	Detects bloated states or inefficient joins that fail under larger inputs.	Does not rank all possible asymptotic refinements.
Sporadic exotic tests	Sensitivity probe using dense small graphs with many gadgets.	Exposes missed local configurations.	Does not by itself explain why the state design failed.
Appendix maximal-cluster solution	Implementation-detail demonstration of reasoning depth.	Shows the number of interacting obligations a correct solution may need.	It is not a separate empirical result.
Label-based performance breakdown	Exploratory diagnostic extension.	Suggests which skill categories remain difficult.	It should not be read as causal attribution from labels alone.

This matters because FormulaOne is not mainly measuring whether models can parse ugly input formats. The authors deliberately strip away much of that noise. The remaining problem is state design and transition logic.

Naturally, that is where the models fall over.

The hard-set result is not a dip; it is a cliff

On the 120-problem FormulaOne hard set, the evaluated frontier reasoning models perform as follows:

Model	Sampling setting	Problems solved	Success rate
o3 High	@10	1 / 120	0.8%
o3-Pro High	@1	1 / 120	0.8%
Gemini 2.5 Pro	@10	1 / 120	0.8%
Grok 4 Heavy	@1	0 / 120	0.0%

The sampling settings are not identical. o3 and Gemini receive ten independent completions per problem. o3-Pro and Grok 4 Heavy are evaluated with one completion because, according to the paper, those systems already aggregate multiple samples internally. Grok 4 Heavy also had internet access, unlike the others, due to technical constraints. None of this rescues the result.

The more interesting comparison is FormulaOne-Warmup. On the 100 easier problems, performance rises sharply:

Model	FormulaOne-Warmup problems solved
Gemini 2.5 Pro	62 / 100
o3 High	54 / 100
o3-Pro High	37 / 100
Grok 4 Heavy	28 / 100

This warmup result is important because it rules out a simplistic interpretation. The models are not completely unable to work with MSO-flavoured graph DP. They can solve many entry-level instances. The cliff appears when tasks require deeper combinations of state design, lookahead, connectivity, topology, global structure and canonical counting.

That is precisely the worrying part. Many business processes are not hard because each local rule is complex. They are hard because many ordinary rules interact over time.

The appendix is the quiet warning shot

The appendix walks through a full solution for a maximal cluster graph problem. It is not there as decorative mathematical furniture. It shows why “just implement the DP” is a rather large “just”.

A cluster graph is a disjoint union of cliques. The task is to sum weights of vertex sets whose induced subgraph is a cluster graph and is maximal with respect to inclusion. The paper uses the fact that cluster graphs are exactly graphs with no induced $P_3$, a path on three vertices.

The state must track, for each visible vertex, whether it is selected. For selected vertices, it must track whether they have a forgotten selected neighbour and whether they carry an outstanding obligation. For non-selected vertices, it must track whether their exclusion is already “safe”, meaning that adding them would create a forbidden induced $P_3$.

Then the transitions get delicate. When a selected vertex is introduced, the algorithm must check whether it creates a forbidden $P_3$ with visible or forgotten information. When a non-selected vertex is forgotten without being safe, it may need to place future obligations on selected neighbours. At a join node, obligations from one branch may be discharged by forgotten-neighbour information from the other branch.

This is the kind of reasoning that does not look dramatic in a benchmark table. It looks like small flags, bits and profile merges. But these are the machinery of correctness. Miss one obligation, and the answer is wrong. Add one non-canonical witness, and the count is wrong. Carry exact counts when a capped threshold would suffice, and the runtime quietly bloats from elegant to embarrassing.

What the paper shows, what Cognaptus infers, and what remains uncertain

FormulaOne is most useful when its claims are kept in their proper boxes.

Layer	Statement
What the paper directly shows	Four frontier reasoning models perform extremely poorly on 120 hard MSO-based dynamic-programming tasks, despite scaffolding, few-shot examples and verifiable evaluation.
What the mechanism suggests	The failure is concentrated around expert state design: lookahead, geometric coverage across tree decompositions, local-to-global consistency and canonical counting.
What Cognaptus infers for business use	LLMs should not be treated as autonomous designers of novel optimisation algorithms for high-stakes systems without verifiers, search, tests and expert review.
What remains uncertain	FormulaOne is a specialised benchmark. It does not quantify model performance across every logistics, scheduling, engineering or software-maintenance task.
What would change the conclusion	Evidence that models can reliably construct correct states and transitions across unseen, high-depth optimisation families, not merely pass more coding contests.

For an operator, the key distinction is between code production and algorithm invention.

LLMs can be valuable in optimisation-heavy work. They can translate requirements into prototypes, generate baselines, write brute-force checkers, suggest data structures, explain existing algorithms, create adversarial tests and help engineers inspect failure cases. That is already useful.

But FormulaOne warns against a more ambitious and riskier assumption: that a frontier model with a strong contest-coding reputation can be handed a novel graph-optimisation problem and trusted to design the correct dynamic programme. The benchmark says: not yet, and not by a small margin.

The business risk is invisible because the output still looks technical

A failed LLM answer to a graph DP problem does not necessarily look foolish. It may define plausible states, use serious terminology and produce a neatly structured implementation. That is what makes the failure operationally dangerous.

In business settings, the equivalent failure mode is a planning system that produces a schedule, route, pricing rule or resource-allocation policy that appears coherent but violates a hidden global constraint. The local checks pass. The dashboard is tidy. The model explains itself in full sentences. Somewhere in the join operation, metaphorically speaking, reality has been counted twice.

The practical response is not to ban LLMs from algorithmic workflows. It is to change the architecture around them.

A sensible deployment pattern looks more like this:

Workflow role	Good use of LLMs	Required guardrail
Problem translation	Convert business constraints into formal candidate specifications.	Human validation of the formal objective and constraints.
Algorithm ideation	Suggest multiple state designs or decompositions.	Comparative review against known algorithmic templates and edge cases.
Implementation	Generate callback logic, helper functions and test scaffolds.	Automated correctness, consistency and stress testing.
Debugging	Explain failed cases and propose repair hypotheses.	Regression tests over adversarial examples.
Search and optimisation	Explore candidate variants under a verifier.	Verifiable reward, not self-reported confidence.

The point is simple: the verifier should be more trusted than the prose. Always a useful rule, in AI and in procurement.

The RLVR angle is stronger than the benchmark headline

The most interesting long-run contribution of FormulaOne may not be the current model scorecard. Scorecards age quickly. Benchmarks saturate. Leaderboards become public relations in spreadsheet form.

The more durable idea is the environment design.

Because MSO can generate a broad family of graph problems, and because candidate solutions can be checked with verifiers and generated tests, FormulaOne points toward an RLVR setting with real depth. That is different from reward environments where the task is shallow, the dataset is static or the reward checks only superficial correctness.

The authors are careful to note that FormulaOne is only a fraction of what the framework could produce. They mention extensions across richer MSO fragments, optimisation objectives rather than only Weighted Model Counting, different traversal strategies, different graph families and parameters beyond treewidth.

For model developers, this is the attractive part: a way to train against problems where correctness is objective but the reasoning is not trivial. For businesses, the lesson is indirect but important. The next generation of useful AI systems for optimisation will probably look less like chatbots with larger context windows and more like hybrid search-and-verification systems trained inside environments that punish shallow reasoning.

A model that can explain a dynamic programme is nice. A system that can propose one, test it, repair it and prove that it respects the relevant invariants is far more interesting.

Boundaries: FormulaOne is hard evidence, not a universal condemnation

The benchmark’s implications are sharp, but they are not infinite.

FormulaOne focuses on MSO-definable graph problems over supplied tree decompositions, mainly with Weighted Model Counting objectives. That is a meaningful and theoretically grounded domain, but it is not the whole of enterprise optimisation. A warehouse scheduling tool, a power-grid optimiser or a network-planning platform may use different mathematical structures, heuristics, decomposition methods and operational constraints.

The evaluation also tests models under a particular scaffold: implement callbacks inside a prepared Python framework. That is a fair way to isolate state-design reasoning, but it is not the same as testing a full agentic system with retrieval over algorithm libraries, theorem provers, symbolic search, self-debugging loops and human-in-the-loop review.

The paper’s result therefore should not be inflated into “LLMs are useless at algorithms”. They are plainly useful for parts of algorithmic work. Nor should it be softened into “the prompt was probably bad”. The authors make the task substantially easier than raw research work, and the hard-set scores are still effectively flat.

The better conclusion is less theatrical: current frontier models remain brittle when asked to perform deep, novel state design for graph dynamic programming. That is enough to matter.

Conclusion: the red flag is about depth, not syntax

FormulaOne is a red flag because it separates two capabilities that are often bundled together in casual AI commentary.

One capability is competitive programming performance: recognising known patterns, writing code quickly, handling contest-style constraints and producing executable solutions. Frontier models have made impressive progress there.

The other capability is research-grade algorithmic reasoning: deciding what must be remembered, what can be forgotten, how local views merge into global structure, when a theorem collapses a large state into a smaller one, and how to count each object exactly once. FormulaOne shows that this second capability is still weak.

That distinction matters more than another leaderboard percentage. Businesses do not need AI systems that sound like clever contestants. They need systems that preserve constraints when the problem becomes messy, recursive and expensive to get wrong.

The race is not over. But FormulaOne has waved the red flag for anyone pretending the current cars can already handle every corner.

Cognaptus: Automate the Present, Incubate the Future.

Gal Beniamini et al., “FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming,” arXiv:2507.13337, 2025. https://arxiv.org/pdf/2507.13337 ↩︎

TL;DR for operators#

The real task is not writing Python; it is deciding what Python must remember#

MSO gives the benchmark its depth, not just its notation#

Tree decompositions turn global problems into local windows#

The benchmark exposes four failure modes that ordinary coding tests usually hide#

The authors removed several easy excuses before reporting the score#

The hard-set result is not a dip; it is a cliff#

The appendix is the quiet warning shot#

What the paper shows, what Cognaptus infers, and what remains uncertain#

The business risk is invisible because the output still looks technical#

The RLVR angle is stronger than the benchmark headline#

Boundaries: FormulaOne is hard evidence, not a universal condemnation#

Conclusion: the red flag is about depth, not syntax#