Model Training

Stale Gradients, Fresh Economics: CoCD’s Lightweight Route to Zeroth-Order AI

Memory is usually treated as a luxury in machine learning. More parameters, more activations, more optimiser state, more logs, more everything. Then the invoice arrives, the device overheats, and someone rediscovers the ancient corporate virtue of not wasting things. The paper Turning Stale Gradients into Stable Gradients makes a modest but interesting proposal: perhaps an optimiser should not throw away old gradient information just because it is old.1 In the right setting, yesterday’s partial derivative is not spoiled milk. It is a slightly outdated map. If the terrain has not shifted too violently, it may still point in a useful direction. ...

High Entropy, Low Drama: The Internal Fingerprint of LLM Reasoning

Debugging a reasoning model usually starts at the wrong end. A model gives a wrong mathematical answer, so we inspect the final output. Then we inspect the chain-of-thought. Then we compare benchmark scores, sample more answers, compute pass rates, and hope the model’s visible reasoning trace tells us what happened inside. This is convenient. It is also a little like diagnosing a factory by reading only the shipping label. ...

RL Needs a Menu, Not a Miracle

RL Needs a Menu, Not a Miracle Menus are underrated. When a language model knows only one way to solve a problem, reinforcement learning can mostly reward or punish that route. It can make the model more confident, more selective, and sometimes more verbose. But it has little room to choose among genuinely different ways of reaching the answer. ...

When Reasoning Pays (and When It Cheats): Fixing RL Signals in LLM Training

Scorecards are useful until people learn how the scorecard works. That is not a cynical observation. It is basic management. Sales teams optimize for commission rules. Customer-service teams optimize for handle-time dashboards. Students optimize for exams. And language models, with their charming lack of shame, optimize whatever reward function we put in front of them. ...

Training Models to Explain Themselves: Counterfactuals as a First-Class Objective

Rejected. That is where counterfactual explanations usually enter the story. A loan applicant is declined by an automated system. A hiring candidate is filtered out. An insurance customer is priced into an unfavorable category. The counterfactual explanation is supposed to answer a practical question: what would need to change for the model to give me the desired outcome? ...

Scaling Laws Without Power Laws: Why Bigger Models Still Win

Budget meetings have a way of making AI theory suddenly less philosophical. Someone asks the simple question: “If we double the model size or the training data, how much better does the system get?” Then someone else opens a spreadsheet, adds a few curves, and everyone pretends the future has become manageable. This ritual has powered a large part of modern AI investment. Scaling laws made model development feel less like guesswork and more like engineering. ...

Deployed, Retrained, Repeated: When LLMs Learn From Being Used

Acceptance is a reward, even when nobody writes reward = 1. Imagine an enterprise deploys an AI agent to generate code, reconcile invoices, or prepare operational plans. Some outputs pass automated checks and enter production. Others fail, disappear into logs, and are never seen again. Months later, the accepted outputs are collected and used to fine-tune the next model. ...

When Models Learn to Forget: Why Memorization Isn’t the Same as Intelligence

A contract clause appears in a chatbot response. Not a summary. Not a paraphrase. The clause itself, with the same odd phrasing, the same punctuation, and the same mildly embarrassing typo that legal counsel thought nobody outside the company would ever see. The model did not “reason” its way there. It remembered. ...

Branching Out of the Box: Tree‑OPO Turns MCTS Traces into Better RL for Reasoning

Branching Out of the Box: Tree-OPO Turns MCTS Traces into Better RL for Reasoning A search tree is expensive to build. Once you have paid for it, using only the final answers is a little like buying an aircraft engine and admiring the packaging. That is the useful instinct behind Tree-OPO, a paper that asks whether Monte Carlo Tree Search traces from a stronger teacher model can be reused not merely as demonstrations, but as a structured curriculum for training a smaller reasoning policy.1 The idea is not to run MCTS at inference time and call that progress. Nor is it to imitate a teacher’s logits until the student develops the personality of a photocopier. The paper’s more interesting move is subtler: take the partial reasoning states produced by search, let the student complete from those prefixes, and compute advantages in a way that respects where each prefix sits in the tree. ...

Rollouts, Not GPUs: Why AWorld’s 14.6× Speedup Rewires Agent Training

TL;DR for operators AWorld’s useful lesson is not “buy more GPUs”. It is more specific, and therefore more operationally annoying: if an agent learns from interaction, the bottleneck becomes the rate at which it can safely attempt tasks, collect trajectories, score outcomes, and feed those traces back into training. The paper shows three things that matter for builders. First, more rollouts per task sharply raise success rates on GAIA validation: Claude 3.7 Sonnet rises from 47.9% pass@1 to a 76.4% peak, while GPT-4o rises from 27.3% to 65.5% as rollout count increases to 32. Second, AWorld’s distributed executor cuts rollout time for one training cycle from 7,695 seconds to 525 seconds, while training time stays fixed at 144 seconds. That is the paper’s 14.6× speedup, and it is the result that makes the training loop economically less ridiculous. Third, using that loop, Qwen3-32B-AWorld reaches 32.23% GAIA test pass@1, up from 21.59% for the base Qwen3-32B model, and improves xbench-DeepSearch from 12% to 32% without direct training on that benchmark. ...