Cover image

Red Flag on the Track: Why LLMs Still Struggle with Real Algorithmic Reasoning

TL;DR for operators FormulaOne is a useful red flag because it tests something many businesses quietly assume LLMs already possess: the ability to design deep algorithms, not merely write plausible code around familiar patterns.1 The benchmark contains 120 hard dynamic-programming problems on tree-like graphs, plus 100 easier FormulaOne-Warmup problems. The hard tasks are generated from Monadic Second-Order logic, come with verifiable evaluation, and sit near the kind of combinatorial reasoning used in routing, scheduling, network design and other optimisation-heavy domains. ...

July 18, 2025 · 17 min · Zelina