Thinking Models

A workflow does not fail because the first step is hard. It fails because the seventeenth step is boring, the twenty-third step depends on a slightly wrong state, and by the thirty-first step the agent is confidently building on its own rubbish. Very enterprise. Very scalable. Very expensive. The paper behind this article, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, makes a deceptively simple point: judging LLM progress by short-task accuracy can badly understate the value of reliability gains over long workflows.1 A model that improves only slightly on a single step may become dramatically better at completing long sequences without failure. That is not motivational poster mathematics. It is compounding. ...