Cover image

Evolving Minds: How LLMs Teach Themselves Through Adversarial Cooperation

Training data is the quiet tax on modern AI. Someone has to write the examples, verify the answers, clean the failures, and pretend the spreadsheet is a strategy. Reinforcement learning makes that tax even more visible: if a model is supposed to improve through feedback, then the organisation must either provide ground-truth answers, hire evaluators, or build verifiers that can tell success from nonsense. ...

November 1, 2025 · 14 min · Zelina
Cover image

Deep Thinking, Dynamic Acting: How DeepAgent Redefines General Reasoning

Tools are where agent demos go to die. The pitch is usually elegant. Give the model a goal, attach a few APIs, let it reason, and watch the automation glide across systems like a tiny consultant with no calendar conflicts. Then the real world appears: too many tools, unclear documentation, stale context, partial failures, long interaction histories, and the occasional API response that seems to have been designed by someone settling a personal score. ...

October 31, 2025 · 15 min · Zelina
Cover image

Blueprints of Agency: Compositional Machines and the New Architecture of Intelligence

A prototype begins innocently enough: a product team wants a small machine, a vehicle, a tool, a fixture, perhaps a mechanism that throws something across a room because medieval engineering apparently never left the group chat. The modern AI pitch says the agent can design it. Give it parts, constraints, and a goal; let it reason; let it test; let it improve. ...

October 23, 2025 · 14 min · Zelina
Cover image

Plan>Then>Profit: Reinforcement Learning That Teaches LLMs to Outline Before They Think

Planning is usually the part of work everybody claims to value and nobody wants to inspect. The deck has a roadmap. The project has a strategy. The model has a chain of thought. Splendid. Now, does the plan actually make the execution better, or is it just theatre with bullet points? That is the useful question behind Plan Then Action: High-Level Planning Guidance Reinforcement Learning for LLM Reasoning, which introduces PTA-GRPO, a reinforcement-learning method that trains language models to generate an explicit analytic plan before detailed reasoning and then rewards the quality of that plan, not merely the final answer.1 ...

October 9, 2025 · 16 min · Zelina
Cover image

Paths, Not Parrots: When RL Makes LLMs Plan—and When It Doesn’t

A workflow agent usually looks clever right up to the moment one service is down, one permission changes, or one customer case arrives with the wrong sort of mess attached. Then the question becomes painfully simple: did the model learn a plan, or did it learn the usual route? That distinction is the centre of Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective, an ICLR 2026 paper by Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, and Wei Chen.1 The paper is not another victory lap for reinforcement learning. It is more useful than that. It asks what, mechanically, changes when a language model is trained for planning with reinforcement learning rather than supervised fine-tuning. ...

October 3, 2025 · 16 min · Zelina
Cover image

Branching Out of the Box: Tree‑OPO Turns MCTS Traces into Better RL for Reasoning

Branching Out of the Box: Tree-OPO Turns MCTS Traces into Better RL for Reasoning A search tree is expensive to build. Once you have paid for it, using only the final answers is a little like buying an aircraft engine and admiring the packaging. That is the useful instinct behind Tree-OPO, a paper that asks whether Monte Carlo Tree Search traces from a stronger teacher model can be reused not merely as demonstrations, but as a structured curriculum for training a smaller reasoning policy.1 The idea is not to run MCTS at inference time and call that progress. Nor is it to imitate a teacher’s logits until the student develops the personality of a photocopier. The paper’s more interesting move is subtler: take the partial reasoning states produced by search, let the student complete from those prefixes, and compute advantages in a way that respects where each prefix sits in the tree. ...

September 17, 2025 · 14 min · Zelina
Cover image

Tool Time, Any Time: Inside RLFactory’s Plug‑and‑Play RL for Multi‑Turn Tool Use

Tool calls are where agent demos stop being cute. A chatbot can talk through a task all day. A working agent has to search, query, execute, verify, retry, and sometimes discover that the tool it politely called has returned a malformed answer after making everyone wait. That is the difference between “reasoning about work” and doing work. The former gives you fluent paragraphs. The latter gives you latency, interface contracts, timeout handling, reward ambiguity, and a suspicious number of JSON parsing errors. Glamorous, naturally. ...

September 13, 2025 · 16 min · Zelina
Cover image

Mind the Gap: How OSC Turns Agent Chatter into Compound Intelligence

Teams fail quietly before they fail visibly. The procurement analyst missed a constraint. The legal reviewer assumed a definition. The finance model used a different baseline. Everyone produced competent work. The final report still wobbled because the collaboration layer never asked the obvious question: who knows what, who misunderstands what, and which disagreement is worth resolving before the answer is assembled? ...

September 11, 2025 · 16 min · Zelina
Cover image

Plan, Don't Spam: The Goldilocks Rule for Test‑Time Compute

A busy agent is not necessarily a thinking agent. Anyone who has watched an LLM agent narrate every tiny move knows the feeling. It reviews the goal. It drafts a plan. It revises the plan. It reconsiders the revision. Then, with exquisite deliberation, it clicks the wrong button. The transcript looks intelligent; the behaviour looks like a consultant trapped in a revolving door. ...

September 8, 2025 · 15 min · Zelina
Cover image

From Prompts to Policies: The Agentic RL Playbook

A chatbot can answer a question. An agent has to do something after the answer stops being enough. That distinction sounds obvious until a system must browse, click, call an API, write code, inspect an error, remember what it tried, and decide whether another attempt is worth the cost. At that point, “better prompting” becomes the AI equivalent of telling a logistics team to be more mindful while the warehouse is on fire. Pleasant, perhaps. Not a control system. ...

September 4, 2025 · 15 min · Zelina