Reinforcement Learning

Charting a Better Bedside: When Agentic RL Teaches RAG to Diagnose

Why this paper matters: Retrieval‑augmented generation (RAG) has been the default answer to “how do we make LLMs factual?” But clinical work is not a single hop to a single document; it’s a workflow—observe, hypothesize, retrieve, cross‑check, and only then decide. Deep‑DxSearch reframes RAG as a sequential policy, trained end‑to‑end with reinforcement learning (RL) so the model learns when to reason internally and when to consult guidelines, match similar patients, or search broader knowledge—before committing to a diagnosis. That design change is the story. ...

Click Less, Do More: Why API-GUI + RL Could Finally Make Desktop Agents Useful

The gist (and why it matters for business) Enterprise buyers don’t reward demos; they reward repeatable completions per dollar. ComputerRL proposes a path to that by (1) escaping pure GUI mimicry via a machine-first API-GUI action space, (2) scaling online RL across thousands of Ubuntu VMs, and (3) preventing policy entropy collapse with Entropulse—a cadence that alternates RL and supervised fine-tuning (SFT) on successful rollouts. The result: a reported 48.1% OSWorld success with markedly fewer steps than GUI-only agents. Translation for buyers: lower latency, lower cost, higher reliability. ...

Atom by Atom, Better Research: How Fine-Grained Rewards Make Agentic Search Smarter

If you’ve ever watched a web agent swing from elegant reasoning to face‑plants on basic facts, you’ve met the limits of outcome‑only training. Atom‑Searcher proposes a simple but radical fix: stop treating the whole reasoning trace as one monolith. Instead, break it down into Atomic Thoughts—the minimal, functional units of reasoning—and supervise them directly with a Reasoning Reward Model (RRM). Then blend those process‑level rewards with the final answer score using a decaying curriculum. The result? More stable training, deeper search behavior, and better generalization across in‑ and out‑of‑domain QA. ...

When Collusion Cuts Prices: The Counterintuitive Economics of Algorithmic Bidding

Most warnings about algorithmic collusion tell the same story: sellers using AI to set prices end up coordinating—without explicit communication—to keep prices higher than competition would allow. This is what regulators fear: supra-competitive prices, reduced consumer welfare, and harder-to-detect anti-competitive behavior. A new study, however, flips the narrative on its head. By analyzing multi-dimensional decision-making—where reinforcement learning (RL) agents set both prices and advertising bids on a platform like Amazon—the authors uncover a surprising outcome: in markets with high consumer search costs, algorithmic “collusion” can lower prices below competitive benchmarks. ...

Search When It Hurts: How UR² Teaches Models to Retrieve Only When Needed

Most “smart” RAG stacks are actually compulsive googlers: they fetch first and think later. UR² (“Unified RAG and Reasoning”) flips that reflex. It trains a model to reason by default and retrieve only when necessary, using reinforcement learning (RL) to orchestrate the dance between internal knowledge and external evidence. Why this matters for builders: indiscriminate retrieval is the silent cost center of LLM systems—extra latency, bigger bills, brittle answers. UR² shows a way to make retrieval selective, structured, and rewarded, yielding better accuracy on exams (MMLU‑Pro, MedQA), real‑world QA (HotpotQA, Bamboogle, MuSiQue), and even math. ...

From Zero to Reasoning Hero: How R-Zero Teaches Itself Without Human Data

In AI development, removing humans from the training loop has long been a holy grail — not because people aren’t valuable, but because human labeling is expensive, slow, and fundamentally limited. R-Zero, a new framework from Tencent AI Seattle Lab, takes a decisive step in that direction: no seed dataset, no human annotations, and no external verifier. Just two AI roles — Challenger and Solver — locked in an evolutionary arms race. ...

From GUI Novice to Digital Native: How SEAgent Teaches Itself Software Autonomously

If you’ve ever tried to automate your own software workflows using AI, you’ll know the hard part isn’t reasoning — it’s clicking the right button in a sea of ambiguous icons, drop-downs, and obscure UIs. For agents tasked with navigating GUIs like humans do, the real challenge isn’t logic — it’s context. Enter SEAgent: a self-evolving computer-use agent that doesn’t just learn to operate software — it teaches itself how to learn, using nothing but screenshots, feedback from its own past mistakes, and a clever curriculum. ...

Thinking in Circles: How Self-Questioning LLMs Learn Without Labels

What if an LLM could learn not by reading more, but by thinking harder? That’s the radical premise behind Self-Questioning Language Models (SQLM), a framework that transforms large language models from passive learners into active generators of their own training data. No curated datasets. No labeled answers. Just a prompt — and a model that gets smarter by challenging itself. From Self-Play in Robotics to Reasoning in Language The inspiration for SQLM comes from asymmetric self-play, a technique used in robotics where one agent proposes tasks and another learns to solve them. Here, that paradigm is adapted to LLMs: ...

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

When training Large Language Models (LLMs) to reason, reinforcement learning has proven to be a powerful yet blunt instrument. Most methods reduce the entire model output to a single pass/fail reward, applying that verdict to every token—regardless of whether it contributed to success or failure. This makes credit assignment vague, verifiability weak, and learning inefficient. Enter CAPO (Credit Assignment Policy Optimization), a method that shifts the paradigm: it brings verifiable, fine-grained credit assignment to the token level, using LLMs themselves as judgment agents. ...

From Charts to Circuits: How TINs Rewire Technical Analysis for the AI Era

In a field where LSTMs, transformers, and black-box agents often dominate the conversation, a new framework dares to ask: What if our old tools weren’t wrong, just under-optimized? That’s the central premise behind Technical Indicator Networks (TINs) — a novel architecture that transforms traditional technical analysis indicators into interpretable, trainable neural networks. Indicators, Meet Neural Networks Rather than discarding hand-crafted indicators like MACD or RSI, the TIN approach recasts them as neural network topologies. A Moving Average becomes a linear layer. MACD? A cascade of two EMAs with a subtractive node and a smoothing layer. RSI? A bias-regularized division circuit. The resulting neural networks aren’t generic function approximators; they’re directly derived from the mathematical structure of the indicators themselves. ...