Reinforcement Learning

When Rewards Learn Back: Evolution, but With Gradients

Opening — Why this matters now Reinforcement learning has always had an uncomfortable secret: most of the intelligence is smuggled in through the reward function. We talk about agents learning from experience, but in practice, someone—usually a tired engineer—decides what “good behavior” numerically means. As tasks grow longer-horizon, more compositional, and more brittle to specification errors, this arrangement stops scaling. ...

When Tokens Become Actions: A Policy Gradient Built for Transformers

Opening — Why this matters now Reinforcement learning has always assumed that actions are atomic. Large language models politely disagree. In modern LLM training, an “action” is rarely a single move. It is a sequence of tokens, often structured, sometimes tool‑augmented, occasionally self‑reflective. Yet most policy‑gradient methods still pretend that Transformers behave like generic RL agents. The result is a growing mismatch between theory and practice—especially visible in agentic reasoning, tool use, and long‑horizon tasks. ...

RL Grows a Third Dimension: Why Text-to-3D Finally Needs Reasoning

Opening — Why this matters now Text-to-3D generation has quietly hit a ceiling. Diffusion-based pipelines are expensive, autoregressive models are brittle, and despite impressive demos, most systems collapse the moment a prompt requires reasoning rather than recall. Meanwhile, reinforcement learning (RL) has already reshaped language models and is actively restructuring 2D image generation. The obvious question—long avoided—was whether RL could do the same for 3D. ...

Agents Without Time: When Reinforcement Learning Meets Higher-Order Causality

Opening — Why this matters now Reinforcement learning has spent the last decade obsessing over better policies, better value functions, and better credit assignment. Physics, meanwhile, has been busy questioning whether time itself needs to behave nicely. This paper sits uncomfortably—and productively—between the two. At a moment when agentic AI systems are being deployed in distributed, partially observable, and poorly synchronized environments, the assumption of a fixed causal order is starting to look less like a law of nature and more like a convenience. Wilson’s work asks a precise and unsettling question: what if decision-making agents and causal structure are the same mathematical object viewed from different sides? ...

Proof, Policy, and Probability: How DeepProofLog Rewrites the Rules of Reasoning

Opening — Why this matters now Neurosymbolic AI has long promised a synthesis: neural networks that learn, and logical systems that reason. But in practice, the two halves have been perpetually out of sync — neural systems scale but don’t explain, while symbolic systems explain but don’t scale. The recent paper DeepProofLog: Efficient Proving in Deep Stochastic Logic Programs takes a decisive step toward resolving this standoff by reframing reasoning itself as a policy optimization problem. In short, it teaches logic to think like a reinforcement learner. ...

Forget Me Not: How IterResearch Rebuilt Long-Horizon Thinking for AI Agents

Opening — Why this matters now The AI world has become obsessed with “long-horizon” reasoning—the ability for agents to sustain coherent thought over hundreds or even thousands of interactions. Yet most large language model (LLM) agents, despite their size, collapse under their own memory. The context window fills, noise piles up, and coherence suffocates. Alibaba’s IterResearch tackles this problem not by extending memory—but by redesigning it. ...

When Agents Think in Waves: Diffusion Models for Ad Hoc Teamwork

Opening — Why this matters now Collaboration is the final frontier of autonomy. As AI agents move from single-task environments to shared, unpredictable ones — driving, logistics, even disaster response — the question is no longer can they act, but can they cooperate? Most reinforcement learning (RL) systems still behave like lone wolves: excellent at optimization, terrible at teamwork. The recent paper PADiff: Predictive and Adaptive Diffusion Policies for Ad Hoc Teamwork proposes a striking alternative — a diffusion-based framework where agents learn not just to act, but to anticipate and adapt, even alongside teammates they’ve never met. ...

Agents on the Clock: How TPS-Bench Exposes the Time Management Problem in AI

Opening — Why this matters now AI agents can code, search, analyze data, and even plan holidays. But when the clock starts ticking, they often stumble. The latest benchmark from Shanghai Jiao Tong University — TPS-Bench (Tool Planning and Scheduling Benchmark) — measures whether large language model (LLM) agents can not only choose the right tools, but also use them efficiently in multi-step, real-world scenarios. The results? Let’s just say most of our AI “assistants” are better at thinking than managing their calendars. ...

When the Sandbox Thinks Back: Training AI Agents in Simulated Realities

Opening — Why this matters now The AI industry has a curious paradox: we can train models to reason at Olympiad level, but they still fumble at booking flights or handling a spreadsheet. The problem isn’t intelligence—it’s context. Agents are trained in narrow sandboxes that don’t scale, breaking the moment the environment changes. Microsoft and the University of Washington’s Simia framework tackles this bottleneck with a provocative idea: what if the agent could simulate its own world? ...

When Markets Dream: The Rise of Agentic AI Traders

Opening — Why this matters now The line between algorithmic trading and artificial intelligence is dissolving. What once were rigid, rules-based systems executing trades on predefined indicators are now evolving into learning entities — autonomous agents capable of adapting, negotiating, and even competing in simulated markets. The research paper under review explores this frontier, where multi-agent reinforcement learning (MARL) meets financial markets — a domain notorious for non-stationarity, strategic interaction, and limited data transparency. ...