About Time: When Reinforcement Learning Finally Learns to Wait

Waiting is a decision.

That sounds obvious to anyone who has watched a warehouse robot pause at an intersection, a trading system delay execution, or an autonomous vehicle slow down before a pedestrian crossing. In the real world, “do the task” is rarely the whole instruction. The operational instruction is closer to: do the task, in this order, not before this condition, not after that deadline, and preferably without wasting time while pretending that nothing is happening.

Classic reinforcement learning is not naturally fluent in that language. It can learn from rewards, and it can learn sequences of behavior, but timing is often smuggled in through state features, fixed time steps, or crude penalties. That works until the difference between “wait three seconds” and “wait four seconds” is not cosmetic but contractual.

The paper About Time: Model-free Reinforcement Learning with Timed Reward Machines addresses exactly this gap.¹ Its central move is simple, which is usually a good sign: if a reward specification needs time, give it clocks. Not metaphorical clocks. Actual clocks with guards, resets, deadlines, delay actions, and rewards that can accumulate while the agent waits.

The result is a formalism called Timed Reward Machines, or TRMs. The paper is not selling a new neural architecture, thank heavens. It is doing something less fashionable and more useful: fixing the reward language so reinforcement learning can represent time-sensitive objectives without pretending that “eventually correct” is the same as correct.

Ordinary reward machines know the plot, not the pacing

Reward Machines already solve an important problem in reinforcement learning. Many tasks are not Markovian in the simple sense. The reward for an action may depend not only on the current state but also on what has happened before. Pick up the passenger, then visit a location, then drop the passenger off. Complete objective A before objective B. Avoid repeating a mistake. These are task histories, not single-state snapshots.

A reward machine handles this by adding a finite-state automaton beside the environment. The automaton tracks progress through the task and issues rewards as the agent moves through stages. This makes the reward structure explicit, interpretable, and often easier to learn.

But there is a missing dimension. A normal reward machine can say:

Pick up the passenger, then go to the green location, then drop the passenger at the destination.

It cannot naturally say:

Pick up the passenger after waiting long enough, then reach the green location within the next timing window, then drop the passenger off before the deadline.

That difference is not academic decoration. It is the difference between a workflow engine and an operations system. Timing constraints are everywhere: service-level agreements, safety dwell times, minimum cooling periods, traffic rules, manufacturing cycle times, clinical escalation windows, order execution limits. If the reward formalism cannot represent timing directly, the agent learns around timing indirectly. That is where many elegant systems quietly become expensive debugging projects.

Timed Reward Machines extend reward machines with the machinery needed to express this kind of objective.

Standard reward machine	Timed reward machine
Tracks task stage	Tracks task stage and clock valuations
Rewards event sequences	Rewards event sequences under timing constraints
Cannot directly express “wait at least 10 seconds”	Can express minimum delays, deadlines, and time windows
Usually no explicit delay action	Agent can choose how long to wait before acting
Rewards attached to automaton transitions	Rewards can be attached to states and transitions

The misconception worth killing early is that ordinary time penalties are enough. They are not. A time penalty can encourage speed. It cannot cleanly represent “slow down for at least three seconds,” “act before ten seconds,” or “wait inside this interval, but not outside it.” Timing is not merely a cost coefficient. It is part of the task logic.

The mechanism shift: reward now depends on what happened and when

The paper defines a TRM by augmenting a reward machine with clocks. These clocks evolve as time passes. Transitions between TRM states are guarded by clock constraints. Some transitions reset clocks. Rewards can be attached both to states and transitions, meaning the agent can receive a lump-sum reward for completing a timed step and also accumulate costs or rewards while waiting in a state.

That last detail matters. In many operational systems, nothing visible may happen while waiting, but value is still changing. A truck sitting at a dock burns time. A robot waiting in a safe zone may be doing the right thing. A trading algorithm delaying execution may reduce market impact or miss the price. A hospital workflow that waits too long before escalation creates risk. Waiting is not neutral; it has economics.

The paper’s RL environment therefore includes not only ordinary actions but also delay actions. The agent chooses both what to do and how long to wait before doing it. The TRM then reads the resulting timed trajectory and assigns rewards according to the timing specification.

A useful mental model is:

Environment state
      +
Task-progress state
      +
Clock values
      ↓
Timed product state for Q-learning

This is the core mechanism. The learning problem is not just “which environment action should I take?” It becomes “which action should I take, from this task stage, with these clock values, after which delay?”

That sounds larger because it is larger. The paper’s technical work is mostly about making that enlarged problem finite, learnable, and not completely ridiculous.

Digital clocks make learning finite, but timing becomes coarse

The first setting is the digital-clock case. Time is treated as integer-valued. Clock values are bounded using the largest constants appearing in the timing guards; beyond those constants, larger clock values behave equivalently for guard satisfaction. This lets the authors construct a finite cross-product MDP containing the environment state, the TRM state, and bounded clock valuations.

Once this cross-product exists, tabular Q-learning can be applied. The paper proves that an optimal deterministic positional policy exists in this finite product and that Q-learning converges under standard assumptions: sufficient state-action visitation and a decaying learning rate.

This is the clean case. It is also the case with an obvious operational compromise. Digital clocks are easy to learn over because they discretize time. But discretized time can be too blunt when the reward depends on precise timing. If the business rule cares about being just inside a time window, integer time may miss the economically relevant edge.

The digital-clock setting is therefore best read as the baseline timed version of reward-machine learning. It proves that timed reward structure can be folded into model-free RL in a principled way. It does not solve every timing problem.

Real time is more expressive, and therefore more annoying

The second setting is real-time. Here, clock values and delay actions can be continuous. This is closer to many real processes. It is also where reinforcement learning inherits the usual curse of continuous choice, plus a few timing-specific headaches for flavor.

The paper makes an important theoretical point: in real-time semantics, an optimal policy may not exist. The agent may be able to get closer and closer to the best value by choosing delays arbitrarily near a guard boundary, while the exact boundary value itself is not allowed. In other words, there may be a supremum but no attainable maximum. A system designer might call this “edge behavior.” A theorist calls it “no optimal policy.” Both are correct; one just sounds less like a migraine.

A naive fix is uniform discretization: cut continuous time into small intervals and learn over those. This works in principle but scales badly. Finer discretization gives better temporal precision, but the number of clock valuations and delay actions grows quickly. You pay for precision with state-action explosion.

The paper’s more interesting answer is corner-point abstraction, borrowed from timed automata. The intuition is that many valuable timed behaviors happen near the boundaries of timing regions: just before a deadline, just after a minimum delay, near the corner where a guard changes truth value. Instead of exploring every possible real-valued time, the abstraction focuses learning around region corners.

This is the mechanism-first heart of the paper. The authors are not merely adding time as another feature. They use timed automata theory to decide which parts of continuous time are worth representing for learning.

Timing interpretation	What it buys	What it costs
Digital clock	Finite, straightforward Q-learning	Coarse timing
Uniform real-time discretization	More precision than integer time	Large state-action growth
Corner-point abstraction	Principled focus near timing boundaries	More complex abstraction machinery
Untimed reward machine	Simpler learning	Cannot choose delay actions or enforce timed guards

The practical lesson is not “corner-point abstraction is always best.” The better lesson is: when time-sensitive reward depends on boundaries, the abstraction should understand boundaries. Uniform discretization treats all intervals as equally interesting. Timed automata do not make that mistake.

Counterfactual imagining teaches the agent from delays it did not choose

Adding delays expands the action space. Exploration becomes harder because the agent must discover not only which task action matters but also which waiting time makes the timing guard satisfiable.

The paper adapts counterfactual imagining to the timed setting. In ordinary reward-machine learning, counterfactual experiences can be generated by asking: if the reward-machine state had been different, what would this transition have meant? TRMs require a richer version: if the clock valuation or delay had been different, which timed transitions would have become possible?

So after a real transition, the learner synthesizes additional experiences by varying nearby clock valuations and useful delays, especially delays that satisfy guards. In the corner-point abstraction, it similarly imagines alternative corner configurations and delay-successor choices.

This is not philosophical counterfactual reasoning. It is more like making the reward machine do extra bookkeeping so the agent learns from timing alternatives it did not physically sample. In a sparse timed task, that can be the difference between “eventually stumbles into a legal schedule” and “actually learns the schedule.”

The paper’s ablation-style evidence supports this. In the Taxi and Frozen Lake experiments, counterfactual imagining produces higher discounted returns and reduces episode time under both digital and corner-point abstractions. The likely purpose of this experiment is not to prove TRMs as a whole; it isolates the value of the counterfactual timing heuristic.

What the experiments actually show

The evaluation uses Gym Taxi and a modified Frozen Lake environment. The authors test several TRM specifications involving ordered objectives, timing constraints, slow movement, holes to avoid, pickup/drop-off behavior, and intermediate rewards. They use tabular Q-learning, 300K maximum global steps, parameter decay, and averages over 10 independent runs.

The experiments answer three research questions.

Test in the paper	Likely purpose	What it supports	What it does not prove
Counterfactual imagining on Taxi TRM1 and Frozen Lake TRM2	Ablation of the timing-specific CI heuristic	CI improves returns and episode time by exploring useful clock-delay alternatives	It does not prove the approach scales to deep RL or high-dimensional robotics
Digital clock vs uniform discretization vs corner-point abstraction vs untimed reward machines	Main comparison among timing interpretations	Timing-aware abstractions outperform untimed reward machines when specifications require delays and deadlines; corner abstraction is often stronger when precision matters	It does not prove one abstraction dominates all timed tasks
TRM5 with pronounced timing needs	Stress example for precise timing	Corner-point abstraction can outperform digital clocks when near-boundary timing is important	It remains a designed benchmark, not a production case study
Product size, explored states, and training time table	Scalability diagnostic	Sampling explores far fewer states than the full product space, suggesting practical learnability in these tabular domains	It does not remove the combinatorial growth problem in larger systems
Timed automaton monitor vs TRM with intermediate rewards	Comparison with declarative sparse-reward monitoring	TRMs can provide denser progress feedback and learn long-horizon timed tasks better than accept/reject monitors	It does not say declarative specifications are useless; it says sparse terminal feedback can be hard for learning

The most concrete numerical evidence appears in the scale table. Some induced product spaces are large relative to the base environments. For example, the real-time Taxi setting with TRM3 has a product size of 23,358,000, but the learner explores 2,990 states and reports 84.21 seconds of learning time. Real-time Frozen Lake with TRM2 has a product size of 2,242,368, with 5,763 explored states and 84.53 seconds. The digital-clock Taxi TRM1 setting has a product size of 51,000, with 1,050 explored states and 184.13 seconds.

Those numbers should be interpreted carefully. They do not mean large timed RL is solved. They mean the sampling-based approach does not need to enumerate the full timed product state space in these experiments. That is meaningful because many formal timed-system methods are burdened by explicit product construction. It is not a free pass to deploy this on messy continuous-control systems next Tuesday.

The comparison with timed automaton monitors is especially useful for business readers. A monitor that gives reward only at final acceptance can be formally elegant and practically unhelpful. If the agent receives almost no signal until it completes the entire timed specification, learning may fail. TRMs can preserve the timed structure while adding intermediate rewards. That is not just a technical convenience; it is reward design as feedback engineering.

The business value is specifying schedules, not making agents “smarter”

The business implication is not that TRMs magically make RL safe. They do not. The paper uses small tabular environments, hand-designed TRMs, and controlled timing specifications. The direct result is methodological, not deployment-ready.

The practical value is more specific: TRMs offer a formal way to encode timed operating rules inside a learning objective.

That matters for any domain where action timing affects correctness.

Business setting	Timed requirement	Why ordinary reward shaping is weak	What TRMs suggest
Warehouse robotics	Wait before entering shared space; complete handoff within a window	A generic delay penalty may punish safe waiting	Encode minimum waits, deadlines, and progress rewards explicitly
Autonomous driving	Slow down for a duration; avoid unsafe zones for a period	“Drive safely” is too vague for timed behavior	Use guards and clocks to represent dwell-time obligations
Manufacturing	Process step must occur after cooling but before degradation	Event order alone misses process timing	Reward stage progress only under valid timing windows
Financial execution	Delay can reduce impact but increase opportunity cost	Speed penalty cannot express interval-based execution logic	Model timing as part of reward, not just latency
Business workflow automation	Escalate after SLA threshold but not before human review window closes	Binary completion rewards ignore service timing	Represent deadlines and intermediate timed milestones

Cognaptus’ inference is that TRMs are most relevant to agentic systems that need operational discipline. Many AI agent demos focus on task completion: book the meeting, route the ticket, generate the report, place the order. Real organizations care about temporal correctness: wait for approval, escalate after a threshold, retry with backoff, do not act before compliance clearance, complete before the SLA breach.

A timed reward structure gives designers a vocabulary for those rules. It does not replace governance. It gives governance something formal to attach to.

The paper’s quiet lesson: reward language is infrastructure

The tempting summary is “RL learns to wait.” That is catchy, and yes, it is the title’s joke. But the deeper point is about representation.

When the reward language cannot express time, the learning system must infer timing from indirect signals. That makes timing brittle. It also makes debugging unpleasant because failures appear as policy mistakes when they are actually specification mistakes.

TRMs move timing into the specification layer. The designer can say:

Task progress is valid only if the relevant clock guard is satisfied.
Waiting has a state-based cost or reward.
Completion has a transition reward.
Certain transitions reset the clock.

This is more transparent than hiding timing inside a neural network state vector and hoping training discovers the intended behavior. Hope is not a specification strategy. It is a project-management smell.

The paper also clarifies a useful distinction between declarative timed specifications and reward-based timed specifications. Declarative formalisms can be closer to natural language and formal verification. But if they produce only sparse accept/reject feedback, they may be hard for RL to learn from. TRMs are more operational: they let designers place intermediate rewards along the timed path. That is less pure, perhaps, but often more learnable. Systems that need to learn usually appreciate being told when they are halfway correct.

Boundaries: this is a formal building block, not an industrial RL platform

The limits are important and should be stated once, precisely.

First, the implementation is tabular Q-learning. The experiments do not show deep RL performance in high-dimensional perception, robotics, or continuous-control tasks. The authors themselves point toward future work involving continuous-time Markov models, deep continuous RL, and priced-zone guidance.

Second, the TRMs are hand-designed. In a business setting, someone still has to translate operating policy into timed guards, rewards, and resets. That is valuable work, but it is work. Bad timed specifications will still produce bad incentives, now with clocks.

Third, the experiments use standard benchmark environments with deliberately constructed timing tasks. They demonstrate mechanism and feasibility. They do not demonstrate production robustness under noisy sensors, partial observability, changing workflows, or adversarial edge cases.

Fourth, choosing the right timing abstraction remains a design decision. Digital clocks may be enough for coarse business workflows. Corner-point abstraction is more appropriate when near-boundary timing matters. Uniform discretization may be simple but expensive. The paper gives tools; it does not eliminate engineering judgment. Very rude of reality, but consistent.

Where this fits in applied AI strategy

For companies building agentic automation, the paper’s immediate lesson is not “use TRMs tomorrow.” It is to audit whether your agent’s reward or evaluation logic can express timing at all.

A useful diagnostic is:

Question	If the answer is yes, timing belongs in the specification
Is acting too early a failure, not merely inefficient?	Use minimum-delay or precondition timing guards
Is acting too late a failure, not merely lower reward?	Use deadlines or upper-bound guards
Does waiting in different states have different cost?	Use state-based timed rewards
Does progress need intermediate feedback?	Use transition rewards for timed milestones
Does task success depend on staying inside a time window?	Use clocks and interval guards

This paper is particularly relevant for organizations moving from “AI assistant” demos toward “AI operator” systems. Assistants can often be forgiven for timing sloppiness. Operators cannot. The moment an agent controls workflow, capital, equipment, or safety-relevant action, timing stops being metadata and becomes part of correctness.

TRMs are not the only possible answer. But they show the right architectural instinct: do not ask a learning system to rediscover the temporal rules your organization already knows. Encode the rules in a form the learner can use.

Conclusion: give the agent a watch, then make the watch part of the job

The paper’s contribution is not flashy. It is better than flashy.

It extends reward machines with explicit time, shows how to run model-free Q-learning under digital and real-time interpretations, uses corner-point abstraction to handle precise timing more intelligently than naive discretization, and demonstrates that counterfactual timing experiences improve learning in benchmark tasks.

The business interpretation is equally direct: many applied AI systems fail not because they do the wrong thing, but because they do the right thing at the wrong time. Timed Reward Machines make that distinction explicit.

RL did not simply need a bigger model here. It needed a watch, a rulebook, and a reward function willing to admit that waiting is sometimes the action.

Cognaptus: Automate the Present, Incubate the Future.

Rajarshi Roy, Anirban Majumdar, Ritam Raha, David Parker, and Marta Kwiatkowska, “About Time: Model-free Reinforcement Learning with Timed Reward Machines,” arXiv:2512.17637. ↩︎

Ordinary reward machines know the plot, not the pacing#

The mechanism shift: reward now depends on what happened and when#

Digital clocks make learning finite, but timing becomes coarse#

Real time is more expressive, and therefore more annoying#

Counterfactual imagining teaches the agent from delays it did not choose#

What the experiments actually show#

The business value is specifying schedules, not making agents “smarter”#

The paper’s quiet lesson: reward language is infrastructure#

Boundaries: this is a formal building block, not an industrial RL platform#

Where this fits in applied AI strategy#

Conclusion: give the agent a watch, then make the watch part of the job#