Reinforcement Learning

Atom by Atom, Better Research: How Fine-Grained Rewards Make Agentic Search Smarter

TL;DR for operators Research agents fail in a very familiar way: they do several useful things, then make one bad final move, and the training signal treats the whole journey as garbage. Delightful. Efficient. Totally not a credit-assignment problem wearing a lab coat. Atom-Searcher attacks that problem by splitting an agent’s reasoning trace into Atomic Thoughts: small, functional reasoning units such as planning, verification, hypothesis testing, observation, action selection, or risk analysis. A Reasoning Reward Model then scores those units, producing an Atomic Thought Reward that is blended with the final-answer reward during reinforcement learning.1 ...

Train Long, Think Short: How Curriculum Learning Makes LLMs Think Smarter, Not Longer

TL;DR for operators The paper behind this article proposes Curriculum GRPO: a reinforcement-learning training method that starts a reasoning model with a larger token budget, then gradually shrinks that budget until the model learns to solve problems in shorter traces.1 The important point is not “ask the model to be brief.” We have tried that. It works roughly as well as asking a committee to be concise, which is to say: occasionally, under duress. The paper instead changes the training trajectory. The model is first allowed to explore longer reasoning paths, then is forced to compress successful strategies into a tighter token budget. ...

When Collusion Cuts Prices: The Counterintuitive Economics of Algorithmic Bidding

TL;DR for operators Marketplace operators usually worry that pricing algorithms learn the oldest trick in commerce: stop undercutting each other and raise prices. That worry is real. But this paper makes a more interesting point: when sellers use algorithms to optimise both product prices and sponsored-ad bids, collusion can move through the cost side before it moves through the price side.1 ...

Search When It Hurts: How UR² Teaches Models to Retrieve Only When Needed

TL;DR for operators UR² is a useful paper because it attacks the part of RAG that most demos politely ignore: search can make a model worse when it is used badly.1 The framework trains smaller language models to coordinate retrieval and reasoning, rather than bolting a search box onto a chatbot and hoping the context window will behave itself. Hope, regrettably, is not a retrieval strategy. ...

From Zero to Reasoning Hero: How R-Zero Teaches Itself Without Human Data

TL;DR for operators R-Zero is a self-evolving training framework for reasoning LLMs that starts with one base model, splits it into two roles, and lets them co-train: a Challenger generates difficult questions, while a Solver learns to answer them.1 The useful business takeaway is not “models no longer need data.” That is the sort of sentence that should be handled with tongs. R-Zero removes the need for external task datasets and human labels in its training loop, but it still depends on engineered reward signals, majority-vote pseudo-labels, answer-format discipline, filtering, and objective correctness checks. “Zero data” here means zero external tasks and labels, not zero structure. ...

From GUI Novice to Digital Native: How SEAgent Teaches Itself Software Autonomously

TL;DR for operators Software automation usually breaks at the interface between “the process is known” and “the application has changed again.” A button moves. A settings panel is renamed. A vendor ships a redesign with the emotional restraint of a toddler near glitter. The usual answer is more labelled demonstrations, more brittle scripts, or more human babysitting. ...

Thinking in Circles: How Self-Questioning LLMs Learn Without Labels

TL;DR for operators Self-Questioning Language Models, or SQLM, tests a tempting idea: can a language model improve its reasoning ability without being handed a curated training set of questions and answers? The answer in this paper is: partly, in narrow settings, if the training loop is engineered carefully enough.1 The mechanism is not mystical self-awareness. A model is split into two roles. One role proposes questions from a single topic prompt. The other tries to solve them. Reinforcement learning then updates the system using proxy rewards: majority-vote agreement for arithmetic and algebra, and proposer-generated unit tests for coding. The proposer is rewarded for problems that are not too easy and not too hard; the solver is rewarded for answers that pass the available proxy. ...

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

TL;DR for operators CAPO is not mainly a paper about “making models reason better” in the usual fog-machine sense. It is about fixing a specific training failure: outcome-only reinforcement learning tells a model whether the final answer was right, but not which part of the reasoning earned or destroyed that outcome. The method uses a stronger off-the-shelf LLM as a generative process reward model, or GenPRM, to inspect a rollout and identify wrong reasoning steps in one pass. Those step-level critiques are then converted into token-level penalties, so the policy update can suppress flawed reasoning segments instead of treating the whole answer as one indivisible blob. The authors test this across Llama-3-1B/3B and Qwen2.5-1.5B/7B backbones, with results showing consistent average gains over SFT, GRPO with rule-based verification, and GRPO with generative outcome reward modelling.1 ...

From Charts to Circuits: How TINs Rewire Technical Analysis for the AI Era

TL;DR for operators Trading platforms have spent decades giving users fixed technical indicators and then, more recently, neural models that treat those indicators as just another column in a feature table. Longfei Lu’s paper on Technical Indicator Networks, or TINs, proposes a different wiring job: make the indicator itself into the neural architecture.1 ...

Stacking Alpha: How HARLF's Three-Tier Reinforcement Learner Beats the Market

TL;DR for operators HARLF is not a story about a large language model suddenly becoming a portfolio manager. Sensible readers may exhale. The language component is FinBERT sentiment scoring applied to financial news, then converted into monthly asset-level signals. The heavier claim is architectural: instead of throwing price metrics and sentiment into one flat reinforcement-learning model and hoping the neural soup tastes like alpha, the paper separates the decision process into three tiers. ...