About Time: When Reinforcement Learning Finally Learns to Wait

Opening — Why this matters now

Reinforcement learning has become remarkably good at doing things eventually. Unfortunately, many real-world systems care about when those things happen. Autonomous vehicles, industrial automation, financial execution systems, even basic robotics all live under deadlines, delays, and penalties for being too early or too late. Classic RL mostly shrugs at this. Time is either implicit, discretized away, or awkwardly stuffed into state features.

This paper makes a blunt but overdue observation: if time matters, reward specification must speak the language of time. And then it does something refreshingly concrete about it.

Background — Context and prior art

Reward Machines (RMs) already gave RL practitioners a structured way to specify non-Markovian rewards—objectives that depend on history, not just the current state. They encode tasks as finite-state automata that track progress and issue rewards accordingly. Useful, interpretable, and surprisingly practical.

But traditional RMs are temporally tone-deaf. They can say what must happen and in what order, but not how fast, how slow, or by when. If you need to “wait at least 10 seconds,” or “act within 3 seconds,” vanilla reward machines simply cannot express it.

Timed automata, on the other hand, handle clocks beautifully—but they live in the world of model-based verification and control synthesis, not model-free learning.

This paper bridges that gap.

Analysis — What the paper does

The authors introduce Timed Reward Machines (TRMs): reward machines augmented with clocks, guards, and resets—borrowed from timed automata—but explicitly adapted for model-free reinforcement learning.

At a high level:

Rewards can depend on elapsed time, not just event order
Agents explicitly choose delay actions (how long to wait before acting)
Costs and rewards can accumulate during waiting, not only at transitions

Crucially, the paper does not hand-wave around learning. It provides concrete Q-learning constructions under two timing semantics:

1. Digital-time semantics

Time advances in integer steps. Clock valuations are discrete. The cross-product MDP includes:

Environment state
TRM state
Bounded clock valuations

This keeps the learning problem finite and guarantees convergence, at the cost of temporal precision.

2. Real-time semantics

Time is continuous. Delays are real-valued. This is where things get interesting—and difficult.

The paper shows that:

Optimal policies may not exist (only suprema do)
Naive discretization explodes the state space

To deal with this, the authors introduce a corner-point abstraction, inspired by region abstractions in timed automata. The intuition is elegant: if rewards depend monotonically on time, optimal behavior tends to occur near guard boundaries. So instead of exploring all times, explore the corners.

Findings — Results with visualization

Across Taxi and FrozenLake benchmarks, three patterns emerge clearly:

1. Timing-aware rewards matter

Reward Machines without timing consistently underperform when the task requires waiting, delaying, or meeting deadlines. They finish episodes faster—but incorrectly.

2. Corner-point abstraction wins

Method	Temporal Precision	Performance
Reward Machine	None	Lowest
Digital TRM	Coarse	Moderate
Discretized Real-Time TRM	Medium	Better
Corner-Point TRM	High (near-guard)	Best

The corner abstraction reliably achieves higher discounted returns by exploiting precise timing around constraints.

3. Counterfactual imagining helps—again

By replaying alternative delays and clock valuations after each transition, learning accelerates significantly. This is especially effective in time-augmented action spaces where exploration is otherwise painful.

Implications — Why this matters beyond benchmarks

Timed Reward Machines quietly solve a problem that has been lurking in applied RL for years:

Robotics: waiting safely is often better than acting quickly
Autonomous driving: deadlines and dwell times are contractual, not optional
Finance: execution timing is the reward function
Operations: delays incur costs even when nothing “happens”

More broadly, TRMs expose a deeper point: reward engineering is not just about shaping incentives—it is about choosing the right formal language. If your language cannot express time, your agent will not learn time.

Conclusion — RL, but with a watch

This paper does not introduce a flashy new neural architecture. It does something more valuable: it fixes a structural blind spot in reinforcement learning.

Timed Reward Machines give practitioners a principled way to say: do this, in this order, within this time, and don’t rush. And they do it without abandoning model-free learning.

It turns out RL wasn’t bad at timing. It just never owned a clock.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper does#

1. Digital-time semantics#

2. Real-time semantics#

Findings — Results with visualization#

1. Timing-aware rewards matter#

2. Corner-point abstraction wins#

3. Counterfactual imagining helps—again#

Implications — Why this matters beyond benchmarks#

Conclusion — RL, but with a watch#