Reinforcement Learning

Tool Time, Any Time: Inside RLFactory’s Plug‑and‑Play RL for Multi‑Turn Tool Use

Large language models are finally learning to work the tools instead of merely talking about them. RLFactory proposes a clean way to post‑train LLMs for multi‑turn tool use by rebuilding the reinforcement learning loop around tool feedback, not just text. The result: quicker training, higher stability, and a framework teams can actually adopt. Why this matters (and where prior setups struggle) Most RL-for-LLMs treat the environment as pure text: the model thinks, emits tokens, gets a scalar reward. But real tasks—searching, querying databases, compiling code, booking travel—depend on external tools that return structured results, fail intermittently, and vary in latency and format. Hard problems emerge: ...

Mind the Gap: How OSC Turns Agent Chatter into Compound Intelligence

Multi‑agent LLMs work great on paper and go sideways in practice. We over‑select experts, flood the channel with verbose thoughts, and then pray a meta‑LLM can stitch it all together. OSC (Orchestrating Cognitive Synergy) proposes a missing middle: a learned orchestration layer that constantly models what each agent knows, spots “cognitive gaps,” and then tells agents how to talk—what to say, to whom, and at what level of detail—before the aggregator votes. ...

Plan, Don't Spam: The Goldilocks Rule for Test‑Time Compute

When do you really need a plan? In agentic AI, the answer isn’t “always” (ReAct‑style reasoning at every step) or “never” (greedy next‑action). It’s sometimes—and knowing when is the whole game. A new paper shows that agents that learn to allocate test‑time compute dynamically—planning only when the expected benefit outweighs the cost—beat both extremes on long‑horizon tasks. Why this matters for operators Most enterprise deployments of LLM agents are killed by one of two problems: ...

From Prompts to Policies: The Agentic RL Playbook

How a new survey formalizes the shift from RLHF’d text bots to tool-using operators—and the practical playbook for product teams. TL;DR Agentic RL reframes LLMs from one-shot text generators to policies acting in dynamic environments with planning, tool use, memory, and reflection. The paper contrasts PBRFT (preference-based RL fine-tuning) with Agentic RL via an MDP→POMDP upgrade; action space now includes text + structured actions. It organizes the space by capabilities (planning, tools, memory, self-improvement, reasoning, perception) and tasks (search, code, math, GUI, vision, embodied, multi-agent). Open challenges: trust, scalable training, and scalable environments. For builders: start with short-horizon agents (verifiable rewards), invest early in evaluation, and plan a migration path from RAG pipelines to tool-integrated reasoning (TIR) with RL. What the paper actually changes Most “LLM RL” work you’ve seen is PBRFT—optimize responses to fit human/AI preferences (RLHF/DPO/etc.). This new survey argues that real autonomy needs Agentic RL: treat the model as a policy embedded in a sequential, partially observable world. That sounds academic, but the practical consequences are huge: ...

Rollouts, Not GPUs: Why AWorld’s 14.6× Speedup Rewires Agent Training

Thesis: In agentic AI, the rate-limiting step isn’t backprop—it’s rollouts. AWorld (from Inclusion AI) turns the crank on experience generation with a distributed executor that accelerates rollouts 14.6×, enabling practical reinforcement learning on complex environments like GAIA and yielding double‑digit pass@1 gains on a 32B model. TL;DR for operators The bottleneck has moved: On GAIA‑style tasks, training time is constant; interaction time dominates. AWorld cuts the rollout phase from 7,695s → 525s per cycle (total cycle 7,839s → 669s). That’s a ~92% reduction in wall‑clock. Performance follows scale of attempts: More attempts per task (up to 32 rollouts/q) materially raises pass@k across frontier models—evidence that success hinges on finding wins to learn from. Proof on GAIA: Fine‑tuning + RL with AWorld elevates Qwen3‑32B from 21.59% → 32.23% pass@1 overall and 4.08% → 16.33% on Level‑3 (hardest) questions—competitive with or surpassing strong proprietary baselines at the top difficulty. Why this matters for business Most “AI agent” pilots stall in browsers, spreadsheets, and internal CRMs—not because the model can’t reason, but because the loop (tool use → observation → next step) runs too slowly to harvest enough positive trajectories for improvement. AWorld’s contribution is operational: treat rollouts as a first‑class distributed workload (Kubernetes pods, sandboxed tools, message‑bus protocols) so your agents can practice at scale and your RL can learn from those successes. ...

Talk, Tool, Triumph: Training Agents with Real Conversations

TL;DR Most “tool‑using” LLMs still practice in sterile gyms. MUA‑RL moves training into the messy real world by adding an LLM‑simulated user inside the RL rollout, wiring the agent to call actual tools and rewarding it only when the end task is truly done. The result: smaller open models close in on or beat bigger names on multi‑turn benchmarks, while learning crisper, policy‑compliant dialogue habits. Why this matters now Enterprises don’t want chatty copilots; they want agents that finish jobs: modify an order under policy, update a ticket with verified fields, push a fix to a repo, or reconcile an invoice—often across several conversational turns and multiple tools. Supervised fine‑tuning on synthetic traces helps, but it often overfits to static scripts and misses the live back‑and‑forth where users change their minds, add constraints, or misunderstand policy. ...

Charting a Better Bedside: When Agentic RL Teaches RAG to Diagnose

Why this paper matters: Retrieval‑augmented generation (RAG) has been the default answer to “how do we make LLMs factual?” But clinical work is not a single hop to a single document; it’s a workflow—observe, hypothesize, retrieve, cross‑check, and only then decide. Deep‑DxSearch reframes RAG as a sequential policy, trained end‑to‑end with reinforcement learning (RL) so the model learns when to reason internally and when to consult guidelines, match similar patients, or search broader knowledge—before committing to a diagnosis. That design change is the story. ...

Click Less, Do More: Why API-GUI + RL Could Finally Make Desktop Agents Useful

The gist (and why it matters for business) Enterprise buyers don’t reward demos; they reward repeatable completions per dollar. ComputerRL proposes a path to that by (1) escaping pure GUI mimicry via a machine-first API-GUI action space, (2) scaling online RL across thousands of Ubuntu VMs, and (3) preventing policy entropy collapse with Entropulse—a cadence that alternates RL and supervised fine-tuning (SFT) on successful rollouts. The result: a reported 48.1% OSWorld success with markedly fewer steps than GUI-only agents. Translation for buyers: lower latency, lower cost, higher reliability. ...

Atom by Atom, Better Research: How Fine-Grained Rewards Make Agentic Search Smarter

If you’ve ever watched a web agent swing from elegant reasoning to face‑plants on basic facts, you’ve met the limits of outcome‑only training. Atom‑Searcher proposes a simple but radical fix: stop treating the whole reasoning trace as one monolith. Instead, break it down into Atomic Thoughts—the minimal, functional units of reasoning—and supervise them directly with a Reasoning Reward Model (RRM). Then blend those process‑level rewards with the final answer score using a decaying curriculum. The result? More stable training, deeper search behavior, and better generalization across in‑ and out‑of‑domain QA. ...

When Collusion Cuts Prices: The Counterintuitive Economics of Algorithmic Bidding

Most warnings about algorithmic collusion tell the same story: sellers using AI to set prices end up coordinating—without explicit communication—to keep prices higher than competition would allow. This is what regulators fear: supra-competitive prices, reduced consumer welfare, and harder-to-detect anti-competitive behavior. A new study, however, flips the narrative on its head. By analyzing multi-dimensional decision-making—where reinforcement learning (RL) agents set both prices and advertising bids on a platform like Amazon—the authors uncover a surprising outcome: in markets with high consumer search costs, algorithmic “collusion” can lower prices below competitive benchmarks. ...