Cover image

When Rewards Learn to Think: Teaching Agents *How* They’re Wrong

Opening — Why this matters now Agentic AI is having a credibility problem. Not because agents can’t browse, code, or call tools—but because we still train them like they’re taking a final exam with no partial credit. Most agentic reinforcement learning (RL) systems reward outcomes, not process. Either the agent finishes the task correctly, or it doesn’t. For short problems, that’s tolerable. For long-horizon, tool-heavy reasoning tasks, it’s catastrophic. A single late-stage mistake erases an otherwise competent trajectory. ...

January 30, 2026 · 4 min · Zelina
Cover image

When Rewards Learn Back: Evolution, but With Gradients

Opening — Why this matters now Reinforcement learning has always had an uncomfortable secret: most of the intelligence is smuggled in through the reward function. We talk about agents learning from experience, but in practice, someone—usually a tired engineer—decides what “good behavior” numerically means. As tasks grow longer-horizon, more compositional, and more brittle to specification errors, this arrangement stops scaling. ...

December 16, 2025 · 4 min · Zelina
Cover image

Atom by Atom, Better Research: How Fine-Grained Rewards Make Agentic Search Smarter

If you’ve ever watched a web agent swing from elegant reasoning to face‑plants on basic facts, you’ve met the limits of outcome‑only training. Atom‑Searcher proposes a simple but radical fix: stop treating the whole reasoning trace as one monolith. Instead, break it down into Atomic Thoughts—the minimal, functional units of reasoning—and supervise them directly with a Reasoning Reward Model (RRM). Then blend those process‑level rewards with the final answer score using a decaying curriculum. The result? More stable training, deeper search behavior, and better generalization across in‑ and out‑of‑domain QA. ...

August 19, 2025 · 5 min · Zelina