Cover image

Let It Flow: ROME and the Economics of Agentic Craft

Opening — Why this matters now 2025 quietly settled an uncomfortable truth in AI: agents are not products, they are supply chains. Anyone can demo a tool-using model. Very few can make it survive contact with real environments, long-horizon tasks, and users who refuse to behave like benchmarks. The paper “Let It Flow: Agentic Crafting on Rock and Roll” arrives at exactly this inflection point. Instead of promising yet another agent, it asks a more grown-up question: what kind of ecosystem is required to reliably produce agents at scale? ...

January 1, 2026 · 3 min · Zelina
Cover image

When Maps Start Thinking: Teaching Agents to Plan in Time and Space

Opening — Why this matters now AI can already write poetry, debug code, and argue philosophy. Yet ask most large language models to plan a realistic trip—respecting time, geography, traffic, weather, and human constraints—and they quietly fall apart. Real-world planning is messy, asynchronous, and unforgiving. Unlike math problems, you cannot hallucinate a charging station that does not exist. ...

January 1, 2026 · 3 min · Zelina
Cover image

Browsing Without the Bloat: Teaching Agents to Think Before They Scroll

Opening — Why this matters now Large Language Models have learned to think. Then we asked them to act. Now we want them to browse — and suddenly everything breaks. Deep research agents are running head‑first into a practical wall: the modern web is not made of tidy pages and polite APIs. It is dynamic, stateful, bloated, and aggressively redundant. Give an agent a real browser and it drowns in tokens. Don’t give it one, and it misses the most valuable information entirely. ...

December 31, 2025 · 4 min · Zelina
Cover image

Agents on the Clock: How TPS-Bench Exposes the Time Management Problem in AI

Opening — Why this matters now AI agents can code, search, analyze data, and even plan holidays. But when the clock starts ticking, they often stumble. The latest benchmark from Shanghai Jiao Tong University — TPS-Bench (Tool Planning and Scheduling Benchmark) — measures whether large language model (LLM) agents can not only choose the right tools, but also use them efficiently in multi-step, real-world scenarios. The results? Let’s just say most of our AI “assistants” are better at thinking than managing their calendars. ...

November 6, 2025 · 3 min · Zelina
Cover image

Deep Thinking, Dynamic Acting: How DeepAgent Redefines General Reasoning

In the fast-evolving landscape of agentic AI, one critical limitation persists: most frameworks can think or act, but rarely both in a fluid, self-directed manner. They follow rigid ReAct-like loops—plan, call, observe—resembling a robot that obeys instructions without ever truly reflecting on its strategy. The recent paper “DeepAgent: A General Reasoning Agent with Scalable Toolsets” from Renmin University and Xiaohongshu proposes an ambitious leap beyond this boundary. It envisions an agent that thinks deeply, acts freely, and remembers wisely. ...

October 31, 2025 · 4 min · Zelina
Cover image

Tool Wars, Protocol Peace: What MCP‑AgentBench Really Measures

TL;DR MCP‑AgentBench is the first broad benchmark that evaluates language agents inside the Model Context Protocol (MCP) rather than with ad‑hoc function calls. It sets up 33 MCP servers with 188 tools and runs 600 goal‑oriented queries across six task patterns. Results flip a few assumptions: open‑source leaders (notably Qwen3‑235B‑A22B) can top the table under the ReAct style, while Claude 4 Sonnet shines with native tool‑calling. Token budgets matter: o3‑mini posts the best performance‑per‑token among big names. The meta‑lesson for builders: your agent’s interaction style must match the model and benchmarks must reward outcome, not ritual. ...

September 19, 2025 · 5 min · Zelina
Cover image

Tool Time, Any Time: Inside RLFactory’s Plug‑and‑Play RL for Multi‑Turn Tool Use

Large language models are finally learning to work the tools instead of merely talking about them. RLFactory proposes a clean way to post‑train LLMs for multi‑turn tool use by rebuilding the reinforcement learning loop around tool feedback, not just text. The result: quicker training, higher stability, and a framework teams can actually adopt. Why this matters (and where prior setups struggle) Most RL-for-LLMs treat the environment as pure text: the model thinks, emits tokens, gets a scalar reward. But real tasks—searching, querying databases, compiling code, booking travel—depend on external tools that return structured results, fail intermittently, and vary in latency and format. Hard problems emerge: ...

September 13, 2025 · 4 min · Zelina
Cover image

From Prompts to Policies: The Agentic RL Playbook

How a new survey formalizes the shift from RLHF’d text bots to tool-using operators—and the practical playbook for product teams. TL;DR Agentic RL reframes LLMs from one-shot text generators to policies acting in dynamic environments with planning, tool use, memory, and reflection. The paper contrasts PBRFT (preference-based RL fine-tuning) with Agentic RL via an MDP→POMDP upgrade; action space now includes text + structured actions. It organizes the space by capabilities (planning, tools, memory, self-improvement, reasoning, perception) and tasks (search, code, math, GUI, vision, embodied, multi-agent). Open challenges: trust, scalable training, and scalable environments. For builders: start with short-horizon agents (verifiable rewards), invest early in evaluation, and plan a migration path from RAG pipelines to tool-integrated reasoning (TIR) with RL. What the paper actually changes Most “LLM RL” work you’ve seen is PBRFT—optimize responses to fit human/AI preferences (RLHF/DPO/etc.). This new survey argues that real autonomy needs Agentic RL: treat the model as a policy embedded in a sequential, partially observable world. That sounds academic, but the practical consequences are huge: ...

September 4, 2025 · 5 min · Zelina
Cover image

Talk, Tool, Triumph: Training Agents with Real Conversations

TL;DR Most “tool‑using” LLMs still practice in sterile gyms. MUA‑RL moves training into the messy real world by adding an LLM‑simulated user inside the RL rollout, wiring the agent to call actual tools and rewarding it only when the end task is truly done. The result: smaller open models close in on or beat bigger names on multi‑turn benchmarks, while learning crisper, policy‑compliant dialogue habits. Why this matters now Enterprises don’t want chatty copilots; they want agents that finish jobs: modify an order under policy, update a ticket with verified fields, push a fix to a repo, or reconcile an invoice—often across several conversational turns and multiple tools. Supervised fine‑tuning on synthetic traces helps, but it often overfits to static scripts and misses the live back‑and‑forth where users change their minds, add constraints, or misunderstand policy. ...

August 27, 2025 · 4 min · Zelina
Cover image

Agents on the Clock: Turning a 3‑Layer Taxonomy into a Build‑Ready Playbook

Most “agent” decks promise autonomy; few explain how to make it shippable. A new survey of LLM‑based agentic reasoning frameworks cuts through the noise with a three‑layer taxonomy—single‑agent methods, tool‑based methods, and multi‑agent methods. Below, we translate that map into a practical build/run playbook for teams deploying AI automation in real workflows. TL;DR Single‑agent = shape the model’s thinking loop (roles, task prompts, reflection, iterative refinement). Tool‑based = widen the model’s action space (APIs, plugins/RAG, middleware; plus selection and orchestration patterns: sequential, parallel, iterative). Multi‑agent = scale division of labor (centralized, decentralized, or hierarchical; with cooperation, competition, negotiation). Treat these as orthogonal dials you tune per use‑case; don’t jump to multi‑agent if a reflective single agent with a code‑interpreter suffices. 1) What’s genuinely new (and useful) here Most prior surveys were model‑centric (how to finetune or RLHF your way to better agents). This survey is framework‑centric: it formalizes the reasoning process—context $C$, action space $A = {a_{reason}, a_{tool}, a_{reflect}}$, termination $Q$—and shows where each method plugs into the loop. That formalism matters for operators: it’s the difference between “let’s try AutoGen” and “we know which knob to turn when the agent stalls, loops, or hallucinates.” ...

August 26, 2025 · 5 min · Zelina