Audit logs are comforting things. They tell managers that a system took an action, they tell engineers which step fired, and they tell compliance teams that someone, somewhere, has a line of text to point at when the incident review begins.

Now imagine an AI agent inside a business workflow. It has a customer request, a list of available tools, and a visible reasoning trace. The trace says it carefully considered whether to call an API, ask for missing information, or answer directly. It sounds deliberate. It sounds inspectable. It sounds like governance.

A new arXiv paper, Therefore I am. I Think., makes that comfort less comfortable.1 The paper asks a simple question with unpleasant operational consequences: when a reasoning model decides whether to call a tool, does the visible reasoning produce the decision, or has the decision already been encoded before the first reasoning token appears?

The authors’ answer is not that chain-of-thought is always fake. That would be too neat, and therefore probably wrong. Their narrower claim is more useful: in their tool-use setting, simple probes can detect tool-call decisions from pre-generation activations with very high confidence, and steering those activations can flip later actions. In many flipped cases, the subsequent reasoning does not confess that something has changed. It explains the new action as if it had reasoned its way there all along.

That is the business-relevant part. Not “models are mysterious.” Everyone already bought that ticket. The sharper lesson is that visible reasoning may be a weak audit trail for discrete agent actions. If the decision is already latent before the explanation begins, then the explanation is evidence of what the model says about its action, not necessarily evidence of how the action was chosen.

The mechanism is decision first, narrative second

The paper studies tool-use decisions because they are unusually clean. A model either calls a tool or it does not. That binary structure makes the setting easier to probe than open-ended reasoning quality, where every interpretation immediately wanders into a swamp wearing formal shoes.

The authors use two benchmark settings. The first is NVIDIA’s When2Call benchmark, which focuses on whether a model should call a tool, answer directly, request missing information, or abstain because available tools cannot answer the query. The second is a constructed call-versus-no-call setting using BFCL Irrelevance plus BFCL Simple. The important design choice is that both benchmarks make the action decision interpretable: tool or no tool.

They then run reasoning models and record their hidden states at different points in the generation process. The key position is pre_gen: the model state just before the first reasoning token is generated. Other positions include the beginning of thinking, several percentages through the reasoning trace, the end of thinking, and the first post-reasoning decision token.

The mechanism can be summarized like this:

Stage What the paper measures Why it matters
Before visible reasoning Residual-stream activations at pre_gen Tests whether the action tendency exists before chain-of-thought begins
During reasoning Activations at several positions through the reasoning span Tests whether visible deliberation changes, weakens, or recovers the signal
End of reasoning Activations near think_end and the decision token Tests whether the early signal lines up with the final action
Intervention Activation steering at the pre-generation position Tests whether the signal is merely predictive or partly causal

This is a mechanism-first paper. The point is not simply that the authors got a high score on a benchmark. The point is the sequence: latent action signal, visible reasoning, final action, post-hoc explanation. Once that sequence is clear, the governance implication becomes less theatrical and more annoying: inspecting the reasoning text may be inspecting the press release, not the board meeting.

Linear probes find the decision before the model starts “thinking”

The first empirical move is deliberately simple. The authors train logistic-regression probes on hidden states to predict whether the model will call a tool or not. They do this across sampled layers and token positions, using 5-fold stratified cross-validation and AUROC as the main metric.

This matters because a simple linear probe is not an exotic decoder with a suspicious number of knobs. If a linear classifier can separate future tool-call and no-tool decisions before reasoning begins, then the relevant information is not buried beyond practical reach. It is sitting in the representation space with the lights on.

The main result is strong. For Qwen3-4B and GLM-Z1-9B, on both When2Call and BFCL, the authors report pre-generation tool-decision predictability above 90% AUROC in all four main settings, and above 95% in three of them. The paper also reports that predictability drops early in the reasoning trace, around the first 5% to 10%, before recovering toward the end of thinking.

That dip is easy to misread. It does not mean the paper discovered a tiny philosopher inside the model who briefly became unsure and then found wisdom. A more disciplined interpretation is that visible reasoning introduces, verifies, or perturbs intermediate representations. The action signal becomes less linearly readable for a while, then becomes readable again near the end.

The more important result is alignment over time. The pre-generation decision detected by the probe aligns with the decision detected at think_end more than 80% of the time, and the end-of-thinking probe aligns with the model’s realized action with near-perfect accuracy. In plain business English: before the model starts explaining, there is already a strong internal hint of what it will do.

The paper’s evidence has a clear role:

Experiment Likely purpose What it supports What it does not prove
Pre-generation linear probes Main evidence Tool-call decisions are linearly decodable before visible reasoning The model consciously “decides” in any human sense
Layer-position heatmaps Robustness and localization The pattern appears across sampled layers and positions, strongest in mid-to-late layers Every layer encodes the same causal variable
Agreement curves Main diagnostic Early detected decisions often agree with end-of-thinking decisions Reasoning never changes decisions
GPT-OSS-20B appendix results Supplemental robustness Similar predictability pattern appears in another model family The same steering method works for mixture-of-experts models

That last boundary matters. The authors include supplemental GPT-OSS-20B results, but exclude it from causal steering analysis because its mixture-of-experts architecture would require a different steering technique. Sensible restraint. A rare species in AI papers.

Steering turns a readable signal into a causal suspect

Prediction alone is not enough. A weather forecast predicts rain; it does not cause the rain, unless your meteorologist has become unusually powerful. So the paper’s second move is activation steering.

The authors compute a steering vector from pre-generation activations by comparing mean activations for tool-call examples and no-tool examples. At inference time, they add or subtract this vector at the chosen layer and position. Adding the vector injects tool-call propensity. Subtracting it suppresses tool-call propensity.

This is the causal test. If the pre-generation direction is only a passive marker, steering it should not reliably change later behavior. If it is connected to the action mechanism, steering should sometimes flip the final decision.

It does.

Across held-out examples, steering produces flips in both directions, with rates depending heavily on model, benchmark, direction, thinking mode, and steering strength. The accepted headline remains accurate: flip rates range from low single digits to 79% across the reported settings. In the clearest high-end case, GLM-Z1-9B on BFCL reaches a 79% injection flip rate at the strongest reported steering level. Qwen3-4B also shows large effects, including high injection flips on When2Call under thinking mode.

The important interpretation is not “steering always works.” It does not. Some decisions resist steering. Some settings are much less movable. GLM-Z1-9B is comparatively resistant on When2Call suppression, while BFCL injection is far more steerable. The practical point is narrower: the pre-generation direction is not merely decorative. Perturbing it can change the action that appears after reasoning.

The paper also includes a useful specificity check. Steering with an unrelated binary decision direction from ProntoQA produces a 0% flip rate across the tested models and benchmarks. That is not a universal proof of causal purity, but it does reduce the worry that any random activation shove would produce similar tool-use flips.

A business reader should read the steering result in three layers:

Layer Direct paper result Business interpretation
Measurement A pre-generation tool-call direction can be identified from activations Some agent actions may have early internal warning signals
Intervention Steering that direction can flip later tool/no-tool behavior Action selection may be vulnerable before the visible reasoning stage
Specificity An unrelated steering vector produced no flips in the reported check Internal monitors should be task-specific, not generic “AI confidence” badges

This is where the paper becomes more than interpretability trivia. If a model’s tool-call decision is partly encoded before visible reasoning, enterprise control should not begin only after the model has written an elegant explanation. At that point, the horse may not only have left the stable; it may have generated a 900-token memo about why leaving was strategically aligned.

Longer reasoning is not necessarily better reasoning

The steering experiments also measure chain-of-thought token inflation. This is where the paper politely attacks one of the lazier assumptions in current AI product design: more reasoning tokens must mean more thinking.

Under steering, many examples produce longer reasoning traces. In Table 2, average steered chain-of-thought length often rises substantially relative to baseline. For example, Qwen3-4B suppression-resistant cases on When2Call show a ratio of 2.30, meaning the model generated more than twice as many reasoning tokens while still resisting the final action flip. GLM-Z1-9B on BFCL suppression-resistant cases shows a ratio of 2.18. In several flipped cases, the ratio also rises, often in the range of roughly 1.3 to 2.0 depending on model and benchmark.

The interpretation is subtle. Longer reasoning can mean the model is wrestling with the perturbation. It can also mean the model is generating extra justification for a direction it has already been pushed toward. The token count alone does not tell us which one is happening.

This matters for business deployment because “reasoning mode” is increasingly treated as a safety feature. Need higher quality? Turn on deeper reasoning. Need more confidence? Ask the model to think step by step. Need governance? Store the trace.

The paper does not say these practices are useless. It says they are insufficient. A longer trace may be a better explanation, a noisier struggle, or a more polished rationalization. Those are not the same object. Procurement documents may pretend otherwise, because procurement documents also enjoy fiction.

When the action flips, the explanation often rationalizes it

The third contribution is behavioral. After steering, the authors compare baseline and steered responses using two external LLM judges: GPT-5.4 and Claude Sonnet 4.6. The judges classify how the steered response changed relative to the baseline, choosing among six categories: seamless divergence, confabulated support, constraint override, inflated deliberation, decision instability, and no meaningful difference.

This analysis is not the main causal evidence. The steering experiment is doing the causal lifting. The judge analysis explains what the visible reasoning looks like after the intervention.

That distinction is important. LLM judges are imperfect instruments. Here they are used not to prove the mechanism, but to categorize observable behavioral changes in paired traces. The appendix also reports disagreement rates, which vary by setting and are especially high in some BFCL cases where confabulated support and constraint override are difficult to separate. That is not a fatal flaw. It is a reminder that taxonomy becomes messy exactly where behavior becomes interesting.

The qualitative pattern is still useful:

Behavioral bucket What it means operationally Why it matters for audit
Confabulated support The model invents missing facts, defaults, or user intent to justify the steered action The explanation can create false evidence for the action
Constraint override The model notices a constraint, then weakly dismisses it The trace may mention the right rule while still violating its spirit
Inflated deliberation The model produces more hedging or repeated re-evaluation More text may signal conflict, not better reasoning
Decision instability The model visibly shifts direction during the trace The reasoning path may be unstable even if the final action is clean
Seamless divergence The model fluently reaches a different final action Fluency can hide the intervention most neatly
No meaningful difference The steered and baseline outputs behave similarly Some decisions are resistant or unaffected

The most business-relevant pattern appears under injection steering. In flipped injection cases, models often rationalize the new tool call through confabulated support or constraint override. The example in the paper is wonderfully mundane: the user asks to play “Baby Shark,” but the only available tool adjusts volume. The baseline correctly abstains because no playback tool exists. Under injection steering, the model repurposes set_volume as if setting volume could satisfy the play request.

That is not just a funny failure. It is the failure mode of many workflow agents in miniature: a user asks for something, the available tools almost fit, and the model bridges the gap with confidence. The dangerous part is not that the tool call is wrong. Wrong calls can be blocked. The dangerous part is that the reasoning trace can make the wrong call look internally justified.

Chain-of-thought is not an audit trail

The paper lands hardest on a common enterprise misconception: if a reasoning model prints its reasoning, the organization has an audit trail.

No. It has a generated explanation.

That explanation may still be useful. It can reveal misunderstandings, surface missing assumptions, and help humans debug prompts or tool definitions. But as compliance evidence, it is weak unless paired with independent checks. The paper’s mechanism suggests why: the action tendency can be present before the reasoning begins, and the reasoning may later conform to the action tendency.

A better governance model separates three objects that are often collapsed into one cheerful dashboard:

Object What it answers Reliability problem
Reasoning trace What story did the model produce about its decision? May be post-hoc or rationalized
Action log What tool/action was actually selected? Describes behavior, not cause
Internal-state monitor Was there an early latent tendency toward a high-risk action? Requires model access, calibration, and task-specific probes

This does not mean every business needs activation probes tomorrow morning. Most companies are not running open-weight reasoning models with residual-stream hooks in production. Many are using hosted APIs where internal activations are not available at all.

But the design principle transfers: do not treat self-explanation as self-verification. For high-risk actions, tool invocation should be governed by external constraints: schema validation, permission checks, policy gates, retrieval-grounded verification, uncertainty thresholds, and human approval where the cost of error is large. Chain-of-thought can be diagnostic context. It should not be the lock on the door.

The operational opportunity is pre-action intelligence

There is a more constructive angle here. If action decisions are detectable early, then agent systems may eventually gain a new layer of control: pre-action intelligence.

Instead of waiting for a model to finish reasoning and then inspecting the output, a system could monitor internal or proxy signals before execution. In open-weight deployments, this might mean probes trained for specific high-risk actions: payment initiation, database mutation, sensitive retrieval, external message sending, or irreversible workflow steps. In closed-model deployments, the equivalent may be less direct: lightweight shadow classifiers, tool-call risk scoring, disagreement checks between the proposed action and retrieved policy, or structured pre-execution validators.

The business value is not mystical interpretability. It is cheaper failure interception.

Use case What pre-action monitoring could catch Practical boundary
Customer support automation Premature tool calls when required user information is missing Needs calibrated labels for “should call” vs “should ask”
Data analytics agents Unnecessary database or API calls when direct answer is possible Must avoid blocking legitimate exploratory calls
Finance or procurement workflows Early propensity toward payment, approval, or vendor action Requires strict external authorization, not just model confidence
Coding agents Latent tendency to execute destructive commands Needs sandboxing and command-level policy gates
Compliance review Mismatch between reasoning story and action propensity Internal activations may be unavailable in hosted systems

The uncertain part is how far this generalizes. The paper tests binary tool-call decisions in specific benchmarks and models. Enterprise agents perform richer actions: multi-step plans, tool sequences, stateful workflows, and interactions with messy human inputs. A binary probe for call/no-call is not the same as a governance layer for a 12-step procurement process.

Still, the direction is valuable. The old model of agent governance is output inspection. The new model should be action-centric: inspect the proposed action, the permissions around it, the evidence supporting it, and—where technically possible—the early signals that the action is forming before the model writes its justification.

Where the paper stops

The limitations are not decorative here; they affect how the result should be used.

First, the main causal results are on Qwen3-4B and GLM-Z1-9B. GPT-OSS-20B appears in supplemental predictability analyses, but not in causal steering, because its mixture-of-experts architecture complicates the intervention. So the paper is not a universal law of all reasoning models.

Second, the action studied is binary: call a tool or do not call a tool. That is exactly why the paper is clean. It is also why we should be careful before extending the result to every form of reasoning, planning, or strategic behavior.

Third, steering is evaluated on held-out samples of 100 examples per benchmark/direction combination. That is enough to show the phenomenon is real in the studied setting; it is not enough to map every operational regime where enterprise agents might fail.

Fourth, behavioral categories rely on external LLM judges. The authors mitigate this by using two judges, reversed presentation order, and disagreement reporting. But the classification is still a behavioral lens, not a ground-truth microscope.

Finally, the paper does not prove that reasoning is useless. It shows that, in these settings, visible reasoning may be downstream of an already detectable action tendency and may rationalize perturbations. That is narrower than “CoT is fake” and much more useful than the slogan.

The revised mental model for AI agents

The cleanest replacement for the old mental model is this:

Not reasoning → decision → action, but latent action tendency → visible reasoning → action justification.

That does not mean the middle step is irrelevant. Visible reasoning may still verify, destabilize, recover, or occasionally change the trajectory. The paper’s early dip in predictability is a reminder that reasoning dynamics are not simply a rubber stamp. But the system designer should not assume the text trace is where the decision was born.

For business teams deploying agents, the immediate takeaway is practical:

  1. Treat chain-of-thought as diagnostic evidence, not causal evidence.
  2. Govern tool calls directly, especially when actions are costly, irreversible, or regulated.
  3. Build pre-execution validators that do not depend on the model’s own explanation.
  4. Where open-weight deployment permits it, explore task-specific probes for high-risk action propensities.
  5. Measure whether longer reasoning improves decisions, rather than assuming it does because the invoice is larger.

The paper’s title, Therefore I am. I Think., is nicely inverted. The model may not be saying, “I think, therefore I decide.” In these experiments, the more accurate line is closer to: “I am already leaning toward an action; therefore I will now explain.”

For AI governance, that is enough of a problem. The explanation may be fluent, the chain-of-thought may be long, and the dashboard may look reassuring. But if the action was already forming before the story began, then auditing only the story is not governance. It is literary criticism with enterprise pricing.

Cognaptus: Automate the Present, Incubate the Future.


  1. Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov, and Rajagopal Venkatesaramani, “Therefore I am. I Think.” arXiv:2604.01202v3, 2026. https://arxiv.org/abs/2604.01202 ↩︎