Chain-of-Thought

Think, Then Do: Why ReAct Turned LLMs into Real Agents

A chatbot answers. An agent checks. That distinction sounds small until a workflow fails at 2:17 p.m. because the model confidently invented a policy clause, skipped the database lookup, and then explained itself with the serene authority of a consultant who has already left the building. The 2022 paper ReAct: Synergizing Reasoning and Acting in Language Models matters because it made that failure mode harder to ignore.1 It did not simply ask language models to “think step by step.” Chain-of-thought prompting already did that. It did not simply attach a search box to a model. Retrieval-augmented systems were already moving in that direction. The paper’s real contribution was more architectural: it showed that a language model could alternate between reasoning, acting, observing, and revising its next move. ...

Flip the Script: When Causality Breaks the LLM Illusion

A fire alarm can cause people to evacuate. It can cause a building to enter alert mode. It can trigger emergency procedures, bring firefighters, and make everyone suddenly remember where the stairs are. But does a fire alarm cause a fire? Obviously not. At least, obviously not to a human who understands the causal structure. The alarm is usually an effect or signal of fire risk, not the origin of the fire itself. A model trained on enough sentences of the form “fire alarm causes…” may not be so careful. It may see the familiar phrase pattern, complete the familiar answer, and walk directly into the wrong conclusion with excellent grammar. ...

Potential Energy: What Chain-of-Thought Is Really Doing Inside Your LLM

The familiar ritual: ask it to think longer When an LLM gives a weak answer, the standard reflex is now almost ceremonial: ask it to think step by step. The model writes more. The answer often improves. The benchmark number rises. Everyone feels temporarily reassured. This habit has become so normal that many teams treat chain-of-thought as if it were a small reasoning engine bolted onto the model: more intermediate steps, more deliberate thought, more correctness. A comforting story. Also, like many comforting stories in AI, not quite what the evidence says. ...

Thinking Isn’t Free: Why Chain-of-Thought Hits a Hard Wall

Reasoning budgets look harmless until they become a line item. A user asks an AI system to reconcile a long contract, inspect a transaction trail, trace dependencies in a knowledge graph, or verify whether one operational event can lead to another. The model “thinks.” The answer improves. The invoice also improves, in the less charming direction. The usual response is to ask for shorter reasoning: compress the chain of thought, use fewer tokens, impose a budget, maybe add a prompt that says “be concise,” because apparently invoices can be negotiated with adjectives. ...

Seeing Is Thinking: When Images Do the Reasoning

Paper is a good trap for artificial intelligence. Fold it, punch it, unfold it, and ask where the holes are. A person may not solve the problem instantly, but the mind knows what to do: imagine the folded sheet opening step by step. The reasoning is not mainly verbal. We do not narrate every cell of the paper grid like a bored accountant reading inventory codes. We see the transformation. ...

Pulling the Thread: Why LLM Reasoning Often Unravels

Audit is a less glamorous word than intelligence. That is unfortunate, because most business problems with AI agents do not begin with stupidity. They begin with confidence. The agent gives an answer. The answer sounds reasonable. The explanation sounds even better. A manager, analyst, compliance reviewer, or product owner reads the chain of thought and feels the mild comfort of seeing steps. There is a premise, then a bridge, then a conclusion. Very civilized. Very inspectable. Very possibly fake. ...

Think Before You Sink: Streaming Hallucinations in Long Reasoning

A bad answer is easy to audit. It sits there, smug and wrong. A bad reasoning process is worse. It looks useful while it is drifting. It explains itself. It produces intermediate steps that sound locally plausible. It may even correct one mistake while preserving another, like a spreadsheet with a broken formula hiding behind tasteful formatting. ...

When the Answer Matters More Than the Thinking

Answer. In most business systems, that is the part users actually care about. The approval decision. The risk label. The final invoice category. The recommended next action. The tidy little field that decides whether the workflow moves forward or someone opens a Slack thread titled “Why did the AI say this?” Yet much of modern LLM fine-tuning treats that answer as just another slice of text. Worse, when supervised examples include long chain-of-thought explanations, the final answer may become the shortest and least dominant part of the training objective. The model learns to produce a convincing trail of reasoning, but the tiny destination at the end receives comparatively little optimization pressure. Very elegant. Also slightly absurd. ...

When Reasoning Meets Its Laws: Why More Thinking Isn’t Always Better

The expensive model that thinks less at the wrong moment Tokens are not wisdom. They are rented time. Anyone who has paid for reasoning-model inference already understands the business version of this problem. A model spends hundreds or thousands of tokens circling a simple question, then compresses a genuinely compound task into a suspiciously neat answer. It looks thoughtful. It may even sound disciplined. But the bill arrives in one column and the error arrives in another. ...

When Tokens Remember: Graphing the Ghosts in LLM Reasoning

Audit is easy when the answer is a single lookup. A customer asks, “What is your refund policy?” The model quotes the policy paragraph. We check whether the quoted paragraph came from the right source. Very civilized. Everyone goes home early. But real enterprise LLM work is rarely that tidy. A compliance assistant reads a contract, extracts obligations, compares them with internal policy, reasons through exceptions, and writes a recommendation. A research assistant reads multiple sources, builds an intermediate summary, then answers a question from that summary. A support agent reads a user history, infers the likely issue, then proposes the next action. In these cases, the final sentence may depend on prompt evidence and on earlier generated text. ...