Opening — Why this matters now
Reinforcement learning has become remarkably good at doing things eventually. Unfortunately, many real-world systems care about when those things happen. Autonomous vehicles, industrial automation, financial execution systems, even basic robotics all live under deadlines, delays, and penalties for being too early or too late. Classic RL mostly shrugs at this. Time is either implicit, discretized away, or awkwardly stuffed into state features.
This paper makes a blunt but overdue observation: if time matters, reward specification must speak the language of time. And then it does something refreshingly concrete about it.
Background — Context and prior art
Reward Machines (RMs) already gave RL practitioners a structured way to specify non-Markovian rewards—objectives that depend on history, not just the current state. They encode tasks as finite-state automata that track progress and issue rewards accordingly. Useful, interpretable, and surprisingly practical.
But traditional RMs are temporally tone-deaf. They can say what must happen and in what order, but not how fast, how slow, or by when. If you need to “wait at least 10 seconds,” or “act within 3 seconds,” vanilla reward machines simply cannot express it.
Timed automata, on the other hand, handle clocks beautifully—but they live in the world of model-based verification and control synthesis, not model-free learning.
This paper bridges that gap.
Analysis — What the paper does
The authors introduce Timed Reward Machines (TRMs): reward machines augmented with clocks, guards, and resets—borrowed from timed automata—but explicitly adapted for model-free reinforcement learning.
At a high level:
- Rewards can depend on elapsed time, not just event order
- Agents explicitly choose delay actions (how long to wait before acting)
- Costs and rewards can accumulate during waiting, not only at transitions
Crucially, the paper does not hand-wave around learning. It provides concrete Q-learning constructions under two timing semantics:
1. Digital-time semantics
Time advances in integer steps. Clock valuations are discrete. The cross-product MDP includes:
- Environment state
- TRM state
- Bounded clock valuations
This keeps the learning problem finite and guarantees convergence, at the cost of temporal precision.
2. Real-time semantics
Time is continuous. Delays are real-valued. This is where things get interesting—and difficult.
The paper shows that:
- Optimal policies may not exist (only suprema do)
- Naive discretization explodes the state space
To deal with this, the authors introduce a corner-point abstraction, inspired by region abstractions in timed automata. The intuition is elegant: if rewards depend monotonically on time, optimal behavior tends to occur near guard boundaries. So instead of exploring all times, explore the corners.
Findings — Results with visualization
Across Taxi and FrozenLake benchmarks, three patterns emerge clearly:
1. Timing-aware rewards matter
Reward Machines without timing consistently underperform when the task requires waiting, delaying, or meeting deadlines. They finish episodes faster—but incorrectly.
2. Corner-point abstraction wins
| Method | Temporal Precision | Performance |
|---|---|---|
| Reward Machine | None | Lowest |
| Digital TRM | Coarse | Moderate |
| Discretized Real-Time TRM | Medium | Better |
| Corner-Point TRM | High (near-guard) | Best |
The corner abstraction reliably achieves higher discounted returns by exploiting precise timing around constraints.
3. Counterfactual imagining helps—again
By replaying alternative delays and clock valuations after each transition, learning accelerates significantly. This is especially effective in time-augmented action spaces where exploration is otherwise painful.
Implications — Why this matters beyond benchmarks
Timed Reward Machines quietly solve a problem that has been lurking in applied RL for years:
- Robotics: waiting safely is often better than acting quickly
- Autonomous driving: deadlines and dwell times are contractual, not optional
- Finance: execution timing is the reward function
- Operations: delays incur costs even when nothing “happens”
More broadly, TRMs expose a deeper point: reward engineering is not just about shaping incentives—it is about choosing the right formal language. If your language cannot express time, your agent will not learn time.
Conclusion — RL, but with a watch
This paper does not introduce a flashy new neural architecture. It does something more valuable: it fixes a structural blind spot in reinforcement learning.
Timed Reward Machines give practitioners a principled way to say: do this, in this order, within this time, and don’t rush. And they do it without abandoning model-free learning.
It turns out RL wasn’t bad at timing. It just never owned a clock.
Cognaptus: Automate the Present, Incubate the Future.