Delay is not a footnote in automation. It is the product.

A customer support agent that takes thirty seconds to decide whether to escalate has already shaped the customer’s mood. A warehouse robot that produces the correct plan after the pallet has moved has produced something closer to poetry than control. A trading assistant that generates a gorgeous hedge after the market has repriced is not sophisticated. It is late, which is the expensive version of wrong.

That is the useful provocation in Real-Time Reasoning Agents in Evolving Environments by Wen, Ye, Zhang, Yang, and Zhu.1 The paper does not merely ask whether large language model agents can reason. We have had enough demonstrations of agents thinking slowly in worlds politely frozen for their convenience. The question here is less flattering: can an agent reason while the world keeps moving?

The answer is: not reliably, if the agent is forced to choose between reflex and deliberation. Reactive agents move on time but often think too shallowly. Planning agents reason more deeply but act on stale observations. Code-as-policy can help in tidy algorithmic worlds, then becomes brittle when context and coordination matter. The authors’ proposed system, AgileThinker, is interesting because it treats real-time agency as a resource-allocation problem between fast action and slow reasoning, not as a model-size contest wearing a latency badge.

That distinction matters. The business misconception is obvious and persistent: “real-time AI” sounds like “use a faster model.” Cute. Also insufficient. A faster monologue is still a monologue. Real-time agents need architectures that decide what must be done now, what can be reasoned about in parallel, and how much of unfinished thought is still useful before the next tick of the world.

The paper makes time part of the environment, where it belonged all along

Most agent benchmarks quietly grant models a supernatural privilege: the environment waits while they think. The browser page does not change. The simulated robot state does not drift. The game board does not advance. The user does not lose patience. This is convenient for evaluation and mildly ridiculous for deployment.

The paper formalises a different setting. In Real-Time Reasoning Gym, the environment updates at a fixed rate regardless of whether the agent has completed its reasoning. If the agent has not produced an action in time, a default action is applied. That simple change is brutal. It turns inference latency from an engineering nuisance into part of the task.

The benchmark uses three real-time games, each designed to isolate a different dynamic pressure:

Environment Dynamic pressure What the agent must handle Score intuition
Freeway Hazards Avoid moving cars while crossing lanes Faster successful traversal
Snake Opportunities Catch transient food without trapping itself More apples before collision
Overcooked Partners Coordinate with an independently acting teammate More completed cooperative tasks

The authors vary two dimensions. Cognitive load changes the inherent difficulty of the environment: longer paths in Freeway, more obstacles in Snake, and larger counter layouts in Overcooked. Time pressure changes how many output tokens the agent can generate before the environment advances, using token count as a hardware-agnostic proxy for elapsed inference time.

This design is not just a benchmarking trick. It exposes the central operational reality of agent deployment: intelligence is not only about whether a model can derive the right answer; it is about whether the answer arrives inside the action window.

Four agent styles enter the clock; only one treats latency as structure

The paper compares several ways to deploy language models as real-time agents. The comparison is more informative than the headline win, because each failure mode maps neatly onto a business architecture mistake.

Agent design What it optimises Where it breaks Business analogue
Reactive agent Timely response Myopic decisions under high cognitive load A copilot that always replies quickly but misses the strategic context
Planning agent Deeper reasoning Stale plans under time pressure An operations assistant that produces good plans after conditions have changed
Code-as-policy Executable adaptation Works best when the task has a compact algorithmic structure A brittle automation script pretending to be judgement
AgileThinker Parallel reaction and planning Requires careful time-budget allocation and access to useful reasoning traces A latency-aware orchestration layer with fast fallback and slow guidance

Reactive agents produce one action per environment step. They stay inside the update cycle, which is exactly what makes them attractive in production. The cost is reasoning depth. When the task is simple, that may be enough. When the agent must anticipate a trap, coordinate with another actor, or sacrifice immediate reward for future safety, speed alone starts looking rather underdressed.

Planning agents do the opposite. They reason across multiple steps and generate a plan. This is powerful under relaxed time constraints. It is also dangerous when the world changes during inference. The plan may be logically coherent for the state the agent saw when it began thinking, while being actively harmful for the state that exists when it finally acts.

Code-as-policy is the seductive middle option: ask a reasoning model to generate executable code that maps observations to actions. In Freeway, where the state space is small and breadth-first search can solve the core navigation problem, this works comparatively well. In Snake and Overcooked, the generated code tends to collapse into shallow heuristics. Snake needs longer-horizon survival judgement. Overcooked needs context, partner inference, and goal prioritisation. The code can encode local rules; it struggles to maintain coherent strategy in a changing, socially entangled environment. Shocking discovery: not every problem becomes clean because someone emitted Python.

AgileThinker takes a different route. It runs two threads. A planning thread performs extended reasoning over a frozen observation and streams its partial reasoning. A reactive thread, operating under strict time constraints, sees the latest environment state and can consult the planning thread’s partial trace. The reactive thread does not wait for the grand plan to complete. It uses unfinished reasoning as strategic guidance, not as a commandment.

That is the mechanism worth paying attention to. AgileThinker is not merely “fast model plus slow model.” The important move is that the fast component can access intermediate slow reasoning before it becomes stale or complete. In a real-time world, incomplete but timely guidance may be more valuable than complete but late analysis. Business systems already know this under another name: decision support.

Reactive agents survive the clock but lose the plot

The reactive baseline is the cleanest correction to the “just make it faster” myth.

In the paper’s cognitive-load experiments, with time pressure fixed at 8k tokens per environment step, the reactive agent based on DeepSeek-V3 degrades sharply as tasks get harder. The authors report average scores falling from 0.89 to 0.15 as cognitive load rises, while AgileThinker falls from 0.88 to 0.50. The exact values matter less than the shape: the reactive agent is not primarily destroyed by the clock. It is destroyed by the task becoming cognitively expensive.

This is the pattern businesses should recognise. Many production “real-time” systems are effectively reactive wrappers around a language model. They handle the last message, the latest observation, the immediate instruction. That is useful for responsiveness. It is also how agents become locally competent and globally foolish.

The paper’s Snake case study makes the point nicely. A reactive agent greedily pursues nearby food and falls into a predictable trap. AgileThinker avoids the trap because the planning thread has identified the longer-term survival issue, while the reactive thread still responds to the current state. The lesson is not that every agent needs to play Snake. Thankfully, the economy has suffered enough. The lesson is that immediate reward, immediate risk, and future viability often conflict.

In customer operations, the equivalent is a support agent that refunds quickly but violates policy. In infrastructure monitoring, it is an incident assistant that suppresses the noisy alert while missing the cascading dependency. In procurement, it is an agent that accepts the cheapest supplier response without noticing delivery risk. Reactive systems can be fast and still expensive.

Planning agents solve yesterday’s problem beautifully

The planning baseline fails in the opposite direction, and its failure is more flattering to the model. It often reasons well. It simply reasons too slowly for a moving state.

In the paper’s time-pressure experiments, with cognitive load fixed at medium difficulty, planning agents perform strongly when the time budget is relaxed. But as pressure tightens, their average score drops from 0.92 to 0.05. AgileThinker drops from 0.90 to 0.58 over the same pressure range. That is not a small implementation wrinkle. It is the difference between a system that degrades and a system that effectively disappears.

The mechanism is straightforward. A planning agent begins reasoning from an observation. While it thinks, the environment advances. By the time it emits a plan, the assumptions may no longer hold. The agent then executes a plan for a state that no longer exists. This is not bad reasoning in the abstract. It is good reasoning with an expired timestamp.

This matters for any workflow where the state changes independently of the model: inventory, ticket queues, security events, financial books, logistics routes, hospital capacity, field-service dispatch, manufacturing lines. If an AI system reasons as though its input snapshot remains valid until completion, it is silently making a temporal assumption. In stable back-office analysis, that assumption may be acceptable. In live operations, it becomes an unpriced risk.

The paper’s contribution is therefore partly conceptual. It forces agent evaluation to ask not only “was the plan valid?” but “was it still valid when acted upon?” That is the question many demos prefer to avoid, for obvious reasons involving applause.

Code-as-policy is useful when the world agrees to be algorithmic

The appendix’s code-as-policy analysis is not a side show. It explains why “convert reasoning into executable policy” is sometimes powerful and sometimes performative.

In Freeway, generated code can implement breadth-first search across a small state space. The environment is dynamic but structurally simple. Cars move according to known rules. The player’s movement options are limited. A compact algorithm can search feasible paths. Here, code-as-policy is a sensible tool.

Snake and Overcooked are less cooperative. In Snake, the generated policies tend to rely on limited-depth search, which misses longer-term traps or distant opportunities. In Overcooked, generated code often inspects narrow state variables, such as what the agent is currently holding, while failing to infer partner intent or maintain stable priorities. The paper gives examples where the code either stays idle in a bad state or oscillates between incompatible objectives.

The business translation is direct. Code generation is excellent when the task has a clear formal structure and a stable objective. It is weaker when the task requires judgement under partial context, changing priorities, and coordination with other actors. That does not make code-as-policy useless. It simply means it should not be confused with adaptive reasoning. A rules engine is still a rules engine, even when a language model wrote the rules five seconds ago.

AgileThinker wins because it turns unfinished thought into operational signal

AgileThinker’s advantage comes from a pragmatic compromise: let slow reasoning continue, but do not let action wait for it.

The planning thread streams reasoning about longer-horizon structure. The reactive thread observes the latest state and uses whatever partial planning trace is available. This gives the system three advantages at once.

First, the reactive thread remains timely. It still produces actions inside the environment’s update rhythm.

Second, the planning thread is not forced into artificial truncation every step. It can keep working on strategic structure across time.

Third, the reactive thread can adapt stale guidance rather than obey it. The prompts explicitly frame planning output as guidance that may no longer perfectly match the current situation. That is an important design detail. The system does not pretend old reasoning is automatically current. It treats it as a strategic reference to be reconciled with fresh observation.

This is a useful template for business AI orchestration. A fast front-line component can handle the immediate decision, while a slower reasoning component maintains situation models, scenario plans, policy interpretations, or multi-step strategies. The fast layer should not merely ask “what is happening now?” It should ask “what does the slower layer currently believe, and which parts of that belief still apply?”

That is a richer architecture than “use the big model when you have time and the small model when you do not.” It is closer to maintaining a live planning substrate under a reactive controller.

The evidence is strongest where pressure and complexity meet

The paper’s experiments are not all serving the same purpose. Some are main evidence. Some are sensitivity checks. Some are diagnostic extensions. Treating them as one pile of “results” would be efficient and intellectually lazy, a beloved combination.

Evidence item Likely purpose What it supports What it does not prove
Cognitive-load and time-pressure experiments across Freeway, Snake, and Overcooked Main evidence Single-paradigm agents fail differently as complexity and latency constraints rise; AgileThinker degrades more gracefully That the same magnitude will hold in every real-world domain
Reactive-budget sweep in AgileThinker Sensitivity / resource-allocation test Performance depends on allocating enough budget to the reactive thread without starving planning That one universal budget setting exists
Code-as-policy appendix Diagnostic comparison Code policies help when tasks are compactly algorithmic and struggle with heuristic or coordination-heavy settings That code-as-policy is generally inferior
Significance testing Statistical support for main trend AgileThinker’s advantage generally becomes significant as load and pressure increase That every individual environment condition is equally conclusive
DeepSeek-V3.2 and Gemini-2.5-Flash extensions Robustness / exploratory generalisation The fast/slow combination trend is not confined to the exact DeepSeek-V3 plus R1 setup Full cross-model generality, especially because Gemini lacks accessible reasoning traces
Limited-throughput experiment Resource-constraint robustness Benefits remain even when parallelism is replaced by concurrent switching That infrastructure cost is irrelevant
Wall-clock experiments Deployment relevance check Token-count time abstraction tracks real inference time well enough in the tested DeepSeek API setting That token count is a universal latency proxy across all providers, hardware, and serving stacks

The complete appendix tables add useful texture. At medium difficulty under rising time pressure, planning collapses dramatically in Snake and Overcooked when token budgets tighten. AgileThinker remains substantially higher, though not perfect. In Freeway, code-as-policy is competitive because the task admits a neat algorithm. In Overcooked, AgileThinker is especially strong relative to planning under tighter pressure, because coordination punishes stale deliberation.

The wall-clock experiment is also important, but should not be over-read. The authors test whether token-count simulation corresponds to physical inference time using DeepSeek’s official API and then evaluate agents under real elapsed time. AgileThinker still wins in that setting: Freeway scores 0.88 versus 0.24 for reactive and 0.12 for planning; Snake scores 0.45 versus 0.37 and 0.04; Overcooked scores 0.89 versus 0.57 and 0.00. This supports the practical relevance of the token-time abstraction in their setup. It does not make token count a law of nature. Providers, batching, caching, hardware, streaming behaviour, and tool calls all have opinions.

Budget allocation is not housekeeping; it is the control surface

One of the paper’s most operationally useful sections asks how much token budget should be allocated to the reactive thread. Set it too low and the reactive agent cannot process the latest state plus planning guidance well enough. Set it too high and the system wastes time that the planning thread could have used productively.

The authors find that performance peaks when the reactive budget roughly matches the natural upper bound of the reactive model’s token usage. Freeway needs a higher reactive budget, around 5k tokens in their setting, while Snake and Overcooked peak around 2k. More importantly, AgileThinker remains ahead of single-paradigm baselines across broad budget ranges.

This is a practical lesson disguised as an ablation. In production systems, latency budgets should not be treated as arbitrary API limits. They are design parameters. A fraud-monitoring agent, dispatch assistant, or industrial-control copilot needs empirical tuning around three questions:

  1. How quickly does the environment change?
  2. How much reasoning is needed for safe immediate action?
  3. How much slower strategic context can remain useful across multiple updates?

The answer will differ by domain. A customer chat agent may tolerate seconds. A market-making assistant may not. A hospital triage support system may have strict escalation windows. A warehouse robot has physical update cycles and safety constraints. The architecture should expose these budgets explicitly instead of hiding them behind “max tokens” and vibes.

What Cognaptus infers for business architecture

The paper directly shows that, in controlled real-time game environments, dual-thread fast/slow reasoning can outperform reactive-only and planning-only LLM agents under rising cognitive load and time pressure. It also shows that token-count simulation can approximate wall-clock timing in the tested DeepSeek deployment, and that the advantage persists in several robustness and extension settings.

Cognaptus infers a broader architecture principle: latency-sensitive agent systems should separate fast action, slow reasoning, and budget governance as first-class components.

That inference is not the same as saying “deploy AgileThinker unchanged into your factory.” Please do not let a Snake benchmark steer a crane. The right takeaway is architectural.

A production version would need something like this:

Layer Role Practical requirement
Reactive controller Produces timely actions or recommendations Hard latency budget, fallback action, safety guardrails
Planning substrate Maintains multi-step strategy and context Longer horizon, streaming intermediate state, refresh logic
Reconciliation layer Decides which slow reasoning remains valid State-drift checks, confidence flags, contradiction handling
Budget manager Allocates compute across urgency levels SLA-aware token, model, and tool-use limits
Evaluation harness Tests under live-state evolution Latency-adjusted success metrics, not static accuracy only

This matters because many enterprise agent pilots still evaluate the wrong thing. They test whether the model can produce a good answer to a frozen case. That is useful for document review, static analysis, and batch reasoning. It is inadequate for workflows where the case changes while the model is thinking.

For business leaders, the question should become: where does our process require correctness under motion? If the answer is “nowhere,” simple agent architectures may be enough. If the answer is “incident response, dispatch, customer escalation, inventory allocation, robotic action, cyber defence, financial execution, or live partner coordination,” then latency-aware reasoning architecture is not fancy. It is table stakes, though naturally someone will try to sell it as a revolution with a gradient logo.

Where the result stops

The boundaries are clear and important.

First, the environments are controlled games. They are well chosen because they isolate hazards, opportunities, and partner dynamics. They are not substitutes for messy enterprise deployments with ambiguous goals, partial observability, compliance constraints, human overrides, and unhappy legacy systems lurking in the basement.

Second, the strongest experiments use DeepSeek models because AgileThinker needs access to transparent reasoning trajectories. That matters. Many commercial model providers do not expose intermediate reasoning traces. Without access to partial reasoning, the architecture must be approximated through summaries, intermediate plans, tool states, external scratchpads, or other safe planning artifacts. The principle may generalise, but the implementation changes.

Third, the human dual-process analogy should not be over-romanticised. The paper itself is careful here. AgileThinker is inspired by fast and slow cognition, but it is not evidence that the system models human cognition. It is an engineering design that happens to use a useful metaphor. Metaphors are ladders. Climb them, then stop hugging them.

Fourth, real deployment adds costs the benchmark does not fully price: serving throughput, model concurrency, streaming reliability, safety constraints, observability, and auditability. The limited-throughput experiment helps by showing that concurrent switching still performs strongly compared with baselines, but infrastructure economics remain domain-specific.

The honest business conclusion is therefore not “AgileThinker is ready-made autonomy.” It is: real-time reasoning should be evaluated and engineered as a temporal control problem, not patched after the fact with faster endpoints.

The agent that thinks best may be the one that knows when not to finish thinking

The most interesting idea in this paper is almost anti-academic: sometimes an agent should act before the reasoning is complete.

That feels uncomfortable because AI progress has trained everyone to worship longer chains, deeper plans, and more test-time compute. But in live environments, unfinished reasoning can be useful if it is streamed, contextualised, and checked against the latest state. Complete reasoning can be useless if it arrives after the world has moved on.

AgileThinker points toward agents that do not merely think fast or think slow. They allocate cognition over time. They preserve strategic reasoning without sacrificing immediate responsiveness. They understand that a plan is not a sacred artifact; it is a perishable asset.

For companies building real automation, that is the sober lesson. Do not ask only whether an agent can reason. Ask whether it can reason under motion, whether it can act with partial insight, and whether its architecture makes latency visible enough to govern.

The present will not pause for the model. Rude, perhaps, but useful.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yule Wen, Yixin Ye, Yanzhe Zhang, Diyi Yang, and Hao Zhu, “Real-Time Reasoning Agents in Evolving Environments,” arXiv:2511.04898, 2025, https://arxiv.org/html/2511.04898↩︎