Red Flag on the Track: Why LLMs Still Struggle with Real Algorithmic Reasoning
TL;DR for operators FormulaOne is a useful red flag because it tests something many businesses quietly assume LLMs already possess: the ability to design deep algorithms, not merely write plausible code around familiar patterns.1 The benchmark contains 120 hard dynamic-programming problems on tree-like graphs, plus 100 easier FormulaOne-Warmup problems. The hard tasks are generated from Monadic Second-Order logic, come with verifiable evaluation, and sit near the kind of combinatorial reasoning used in routing, scheduling, network design and other optimisation-heavy domains. ...