A failed deployment usually produces two questions. The first is easy enough to ask: what happened? The second is where the room goes quiet: what actually caused it?
Most AI systems are now quite comfortable with the first question. Give them logs, traces, workflows, tool calls, or transition histories, and they can often produce a plausible reconstruction. They can narrate the incident in confident sequence. They can point to every condition that was present. They can provide a tidy post-mortem, ideally before the humans have finished opening the dashboard.
TempoBench argues that this is not enough. The paper’s central claim is sharper than the usual “models struggle with temporal reasoning” refrain. It shows that models can be reasonably good at simulating a system forward, while failing at the backward causal task of identifying which prior inputs were necessary for an observed outcome.1 In other words, they can replay the movie but still misunderstand the plot. A familiar talent in corporate life, admittedly, but not one we should automate without supervision.
The paper’s real distinction is not time, but causal minimality
TempoBench is built around a deceptively simple distinction.
The first task is trace simulation, or SIM. A model receives a deterministic state-transition system, represented as a Mealy machine, plus an input sequence. It must simulate the system step by step: current state, input, output, next state. This is forward reasoning. The model asks, in effect: given the rules and the inputs, what happens next?
The second task is minimal causal attribution, or MIN. Here the model receives the system and an observed trace, then must identify which input literals were necessary for a particular output. This is backward counterfactual reasoning. The model asks: if this input had been different, would the observed output still have occurred?
That difference sounds small until it reaches operations. In an incident review, “the tests failed, the reviewer did not approve, and the deployment was blocked” is a trace. “The deployment was blocked because the approval was missing; the test result did not matter in that state” is causal minimality. The second statement is more useful, because it separates signal from decoration.
TempoBench formalises this distinction using deterministic Mealy machines. Each machine has states, transitions, input guards, and outputs. Because the full system is known, the authors can compute exact minimal causal labels rather than relying on noisy human annotation or model-generated explanations. That is the benchmark’s first contribution: it uses formal systems to generate scalable, verifiable reasoning data where the answer is not a vibes-based post-mortem wearing a lab coat.
The important word is minimal. A model does not get full credit for listing every condition present on a transition. It must prune away anything that was not necessary. That pruning step is precisely where the tested models stumble.
Why copying the transition condition is not causal reasoning
The common failure mode is overspecification.
Suppose a transition fires under a guard such as tests_pass & reviewer_approved, but the output would have been the same even if tests_pass had changed. A model that lists both literals has described a sufficient-looking condition, but it has not identified the minimal cause. It has copied the surface rule rather than performing the counterfactual test.
This matters because many business AI systems are being evaluated on whether they can produce coherent explanations. Coherence is cheap. Causal discipline is not.
TempoBench makes the distinction operational:
| Model behaviour | Looks like reasoning? | What it actually does | Business consequence |
|---|---|---|---|
| Simulates each transition correctly | Yes | Tracks system execution forward | Useful for replay and monitoring |
| Lists all observed input conditions | Very much so | Retrieves the full guard or surrounding facts | Bloats root-cause reports with false positives |
| Identifies only necessary literals | Yes, and for better reasons | Performs counterfactual pruning | Supports diagnosis, remediation, and accountability |
| Says “no constraints” when output was inevitable | Often looks suspiciously brief | Correctly recognises that no input mattered | Prevents teams from fixing irrelevant causes |
The last row is especially uncomfortable. In real workflows, sometimes nothing the user did at a specific step mattered; the system had already entered a state where the output was inevitable. A good diagnostic model must be able to say that. A weaker model will search for agency because explanations feel more satisfying when someone or something can be blamed. Management consultants discovered this long before transformers did.
TempoBench uses formal worlds because real worlds are too messy to label cleanly
The benchmark is generated from reactive systems specified in temporal logic and synthesised into Mealy machines. The paper draws systems from the SYNTCOMP benchmark suite, including arbiters, Chomp games, and load balancers. These produce machines ranging from small mutual-exclusion controllers to larger branching systems with dozens of states and hundreds of transitions.
This design is not a claim that production systems are Mealy machines in disguise. They are not. Production systems contain partial observability, nondeterminism, missing logs, human interventions, stale permissions, fragile APIs, and at least one spreadsheet everyone is afraid to touch.
The formal setting serves a narrower purpose: it gives researchers exact ground truth. For each trace, the authors can compute which input conditions are genuinely necessary under counterfactual intervention. That makes TempoBench useful as a diagnostic microscope. It is not a replica of the enterprise; it is a controlled chamber where a specific weakness can be isolated without arguing about whether the human label was right.
The data pipeline creates two kinds of supervision. For SIM, the model learns to walk through transitions. For MIN, it learns to test whether flipping an input would change the observed output. The training examples are long: the paper reports TempoBench’s 50,000-sample training set at roughly 278.9 million tokens, with mean examples far longer than comparison datasets such as SlimOrca, OpenMathInstruct, or OpenCoder. That is not just a size difference. It is a difference in training signal. TempoBench teaches the model to reason through system behaviour, not merely to produce short final answers.
The main result: models can follow the trace, then fail the diagnosis
The headline evidence comes from the split between SIM and MIN.
On forward simulation, the strongest evaluated reasoning model reaches 96.0% step accuracy. That is impressive, though not magical: the task provides the transition rules and asks the model to apply them. Several frontier models still perform unevenly, but the general pattern is clear enough. Following the machine is easier than diagnosing it.
On causal attribution, the picture changes. In the main evaluation table, non-reasoning frontier models show high overspecification rates on MIN: Sonnet 4.6 at 60.1%, Haiku 4.5 at 95.8%, DeepSeek V3.2 normal at 96.2%, and DeepSeek V3 at 76.8%. The failure is not random. It is structured. Models often return too much.
The constrained version, MIN+, is particularly revealing because it removes easy “no constraints” steps and focuses on cases where specific input literals are required. No zero-shot frontier model in the table exceeds 24.4% step accuracy on MIN+. Sonnet 4.6 reaches 13.7%. Haiku 4.5 reaches 24.4%. DeepSeek V3.2 normal reaches 24.2%. DeepSeek V3 drops to 1.5%.
Extended reasoning helps, but does not dissolve the problem. DeepSeek R1 and DeepSeek V3.2 in reasoning mode use roughly 20,000 tokens per sample. They improve on some all-step MIN results, but remain weak on constrained causal attribution: DeepSeek R1 reaches 10.3% on MIN+, and DeepSeek V3.2 reasoning reaches 9.7%. More thinking tokens reduce some copying behaviour, but introduce more underspecification. The model stops merely parroting the guard; then it misses necessary causes. Progress, of a sort. Like replacing a bad audit with a thoughtful but incomplete one.
The paper’s error taxonomy is therefore more important than any single score. It separates:
- Overspecification, where the model adds unnecessary literals.
- Condition copy, a subtype where the model copies the full transition condition.
- Underspecification, where the model misses necessary literals.
- Correct minimal attribution, where it identifies exactly what mattered.
That taxonomy explains why ordinary benchmark scores can be misleading. A model can look competent because it produces detailed explanations. TempoBench asks whether those details survive counterfactual deletion.
Fine-tuning shows the capability is learnable, but not solved
The paper’s training study is the second major result. The authors fine-tune LLaMA 3.1 8B, LLaMA 3.2 3B, and Qwen2.5 7B using LoRA, then compare TempoBench training against instruction data, math data, code data, and mixtures.
The key finding is not that small models suddenly become perfect causal analysts. They do not. The useful result is more modest and more interesting: TempoBench-style training moves models toward the right behaviour, and standard reasoning datasets do not transfer reliably into temporal causal attribution.
For example, LLaMA 3.1 8B trained with TempoBench plus SlimOrca reaches 21.7% on MIN+, exceeding Sonnet 4.6’s 13.7% and approaching Haiku 4.5’s 24.4% in the reported table. Qwen2.5 7B trained with TempoBench plus OpenMathInstruct reaches 22.4% on all-step MIN, outperforming several frontier baselines on that metric. These are not victory-lap numbers. They are evidence that the failure is at least partly a training-signal problem, not an immutable law of scale.
The comparison with math, code, and instruction data is where the business lesson starts to form. Code data trains models to work with structured systems. Math data trains multi-step derivation. Instruction data trains compliance and explanation style. Yet none of those reliably teaches the specific act of causal pruning over time. Apparently, “show your work” is not the same as “remove every fact that did not matter.” Shocking, but useful.
The ablations explain what the benchmark is actually testing
The paper includes several supporting tests. These should not be read as separate grand claims. Their purpose is narrower: to check whether the main interpretation survives obvious objections.
| Test or appendix evidence | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Error breakdown by overspecification, condition copy, and underspecification | Main diagnostic evidence | Failures are structured around copying and pruning errors | That real-world incidents will have the same error distribution |
| Causal chain-of-thought ablation | Ablation | The reasoning trace matters; final-answer-only training is weaker | That visible chain-of-thought should be exposed in production |
| Transition visibility ablation | Memorisation and input-use check | Models rely on the provided machine rather than simply memorising systems | That the learned skill generalises to unlogged enterprise systems |
| SHAP analysis of structural factors | Robustness and sensitivity test | Difficulty is linked to structural reasoning features, not just trace length | That SHAP features capture all forms of temporal complexity |
| Narrative enrichment study | Exploratory extension | Natural language can obscure formal output equivalence | That formal notation is always better for users |
The chain-of-thought ablation is especially relevant. The paper trains a LLaMA 3.1 8B model with and without causal reasoning traces. Removing the causal CoT supervision damages performance across several benchmarks and does not produce the same TempoBench gains. This supports the authors’ view that the useful training signal is not merely the final label. It is the intermediate discipline of counterfactual checking.
The memorisation test is also useful. When the authors reduce transition visibility, SIM performance degrades sharply: with full visibility, step accuracy is 59.8%; with 50% visibility, it drops to 21.7%; with 25% visibility, it drops to 8.8%. That pattern suggests the model is reading and using the transition table rather than memorising the machine. On MIN, reduced visibility increases condition-copy behaviour, reinforcing the broader diagnosis: when uncertain, models lean toward retrieval-like behaviour.
The narrative enrichment appendix is a small but telling caution. When a formal automaton is rewritten as a rich reactor-meltdown story with an observation archive, Claude Sonnet 4.6 correctly identifies the one genuinely necessary input but also introduces false positives by confusing changes in state with changes in output. The model performs worse overall than in the formal format. The lesson is not that natural language is bad. It is that narrative can blur equivalences that formal notation makes explicit. A beautifully written incident report can still mislead the diagnostic engine. Literature departments may celebrate; operations teams should not.
The business value is better diagnosis, not prettier explanations
For businesses, TempoBench points to a specific evaluation gap in AI agents.
Many agent systems are already tested on task completion: did the agent execute the workflow, call the right tools, retrieve the right document, resolve the ticket, or patch the issue? That is forward competence. Necessary, but insufficient.
The harder operational question appears after something goes wrong:
- Which tool call was necessary for the failure?
- Which missing approval actually blocked the workflow?
- Which policy condition mattered, and which one merely appeared in the same log segment?
- Which earlier state made a later output inevitable?
- Which remediation would have changed the outcome?
These are MIN-shaped questions. They require identifying the minimal causal set inside a temporal process.
The practical pathway is therefore not “use TempoBench as-is to certify enterprise agents.” That would be too convenient, and convenience is often where bad AI governance goes to breed. The more realistic pathway is:
-
Model important workflows as stateful systems where possible. Deployment pipelines, access approvals, claims workflows, order fulfilment, ticket routing, and monitoring systems often have explicit states and transition rules.
-
Generate synthetic traces from those workflow models. The goal is not to mimic every production mess, but to create controlled cases where causal ground truth is known.
-
Train or evaluate agents on minimal attribution, not just replay. The model should identify which input conditions were necessary and when no condition mattered.
-
Use overspecification as a risk metric. In root-cause analysis, false positives are not harmless. They send teams to fix irrelevant controls, rewrite irrelevant policies, or blame irrelevant users.
-
Keep human review for ambiguous real-world cases. TempoBench tests deterministic systems with complete rules. Enterprises rarely offer that luxury. Funny how they never do.
This reframes ROI. The value is not simply cheaper explanation generation. It is cheaper diagnostic narrowing. In a complex workflow, the expensive part is not writing the incident summary; it is deciding where engineers, compliance staff, or operations managers should focus next. A model that overexplains can increase labour. A model that prunes correctly can reduce it.
What the paper directly shows, and what we can only infer
TempoBench directly shows three things.
First, formally generated automata can produce scalable, verified training and evaluation data for temporal causal reasoning. The authors do not need humans to annotate why an output occurred; they compute minimal causal labels from the system.
Second, current LLMs show a meaningful split between forward simulation and backward causal minimality. Even when given transition rules and algorithmic instructions, models often overspecify causes.
Third, fine-tuning on TempoBench-style data improves causal attribution and can transfer more broadly than math-, code-, or instruction-only data in the tested model families. The strongest story is not universal improvement; it is that causal temporal supervision contributes something ordinary reasoning datasets do not.
The business inference is that agent reliability programmes should add causal minimality tests to their evaluation stack. This inference is reasonable, but it is still an inference. The paper does not test production incident response, live software systems, human organisational workflows, or stochastic environments. It tests deterministic automata derived from formal specifications.
That boundary matters. Formal systems are clean. Business systems are not. Logs are incomplete. Policies conflict. Users behave creatively, which is a polite way of saying “against documentation.” A model that improves on TempoBench may still fail in a messy enterprise trace unless the surrounding system provides adequate observability, state representation, and counterfactual access.
The strongest near-term use is therefore in domains where workflows are already structured: CI/CD pipelines, policy engines, robotic process automation, access-control systems, financial approval flows, and monitored infrastructure. In those settings, TempoBench’s philosophy can be adapted: create verified synthetic cases, score causal pruning, and treat overspecification as a first-class failure.
The uncomfortable lesson for agent builders
The industry has spent a great deal of effort making models reason longer. TempoBench suggests that longer reasoning is not enough if the model is reasoning in the wrong direction.
Forward reasoning asks the model to accumulate. Backward causal reasoning asks it to delete. It must remove facts that were present but irrelevant, conditions that looked important but were redundant, and actions that happened before the outcome but did not cause it. That is cognitively different. It is also organisationally different. Companies are not short of explanations. They are short of explanations that survive pruning.
The paper’s quiet provocation is that many “agent reasoning” evaluations may be over-rewarding narrative competence. A model that can tell you everything that happened is not necessarily a model that knows what mattered. TempoBench gives that distinction a formal testing ground.
For Cognaptus readers, the practical conclusion is straightforward. When evaluating AI agents, do not only ask whether they can execute the workflow or summarise the trace. Ask whether they can identify the smallest set of conditions that would have changed the outcome. Then ask what they added unnecessarily. The extra details are not always helpful. Sometimes they are the bug.
AI systems that diagnose should not be rewarded for sounding comprehensive. They should be rewarded for being selectively correct. TempoBench is valuable because it turns that principle into a measurable task.
And yes, it turns out that “less is more” applies to causality too. Somewhere, a minimalist designer is insufferably pleased.
Cognaptus: Automate the Present, Incubate the Future.
-
Nikolaus Holzer, William Fishell, Baishakhi Ray, and Mark Santolucito, “TempoBench: Evaluating Temporal Causal Reasoning in Large Language Models,” arXiv:2510.27544v2, 2026. https://arxiv.org/abs/2510.27544 ↩︎