Cover image

Talk, Tool, Triumph: Training Agents with Real Conversations

TL;DR Most “tool‑using” LLMs still practice in sterile gyms. MUA‑RL moves training into the messy real world by adding an LLM‑simulated user inside the RL rollout, wiring the agent to call actual tools and rewarding it only when the end task is truly done. The result: smaller open models close in on or beat bigger names on multi‑turn benchmarks, while learning crisper, policy‑compliant dialogue habits. Why this matters now Enterprises don’t want chatty copilots; they want agents that finish jobs: modify an order under policy, update a ticket with verified fields, push a fix to a repo, or reconcile an invoice—often across several conversational turns and multiple tools. Supervised fine‑tuning on synthetic traces helps, but it often overfits to static scripts and misses the live back‑and‑forth where users change their minds, add constraints, or misunderstand policy. ...

August 27, 2025 · 4 min · Zelina
Cover image

Wheel Smarts > Wheel Reinvention: What GitTaskBench Really Measures

Agents don’t build Rome from scratch—they retrofit the city. GitTaskBench (arXiv:2508.18993) is the first benchmark that grades code agents on how well they exploit existing GitHub repositories to deliver real-world outcomes, not just pass algorithm puzzles. It also puts a price tag on success via an Alpha value that blends accuracy with cost, bringing long-missing business realism to agent evals. TL;DR What’s new: 54 tasks across 7 modalities (image, video, speech, office docs, web scraping, security/privacy, biosignals), each paired to a real repo and a practical, automated test harness. Why it matters: The hard part isn’t just writing code—it’s environment setup, dependency wrangling, repo comprehension, and workflow orchestration. Headline result: Even the best stack—OpenHands + Claude 3.7—passes only ~48% of tasks; environment/setup issues cause ~65% of all failures. Business twist: The Alpha value estimates net economic benefit per task by combining success, quality, and token costs. Expensive tasks become clear wins; cheap tasks require ruthless cost control. The Benchmark, de-jargoned Problem framed: In real shops, devs search, fork, and adapt. GitTaskBench simulates that reality. Each task gives an agent a specific repo (e.g., DeOldify, Scrapy, NeuroKit, SpeechBrain) and a concrete user goal (e.g., “colorize this photo” or “extract author/quote pairs into CSV”). Success is determined by a task-specific metric (e.g., NIQE for image quality; SNR/SDR for speech separation; field-level F1 for scraping; column/row fidelity for office docs) and an execution check (the thing actually runs and outputs in the right format). ...

August 27, 2025 · 5 min · Zelina
Cover image

Agents on the Clock: Turning a 3‑Layer Taxonomy into a Build‑Ready Playbook

Most “agent” decks promise autonomy; few explain how to make it shippable. A new survey of LLM‑based agentic reasoning frameworks cuts through the noise with a three‑layer taxonomy—single‑agent methods, tool‑based methods, and multi‑agent methods. Below, we translate that map into a practical build/run playbook for teams deploying AI automation in real workflows. TL;DR Single‑agent = shape the model’s thinking loop (roles, task prompts, reflection, iterative refinement). Tool‑based = widen the model’s action space (APIs, plugins/RAG, middleware; plus selection and orchestration patterns: sequential, parallel, iterative). Multi‑agent = scale division of labor (centralized, decentralized, or hierarchical; with cooperation, competition, negotiation). Treat these as orthogonal dials you tune per use‑case; don’t jump to multi‑agent if a reflective single agent with a code‑interpreter suffices. 1) What’s genuinely new (and useful) here Most prior surveys were model‑centric (how to finetune or RLHF your way to better agents). This survey is framework‑centric: it formalizes the reasoning process—context $C$, action space $A = {a_{reason}, a_{tool}, a_{reflect}}$, termination $Q$—and shows where each method plugs into the loop. That formalism matters for operators: it’s the difference between “let’s try AutoGen” and “we know which knob to turn when the agent stalls, loops, or hallucinates.” ...

August 26, 2025 · 5 min · Zelina
Cover image

Stop at 30k: How Hermes 4 Turns Long Chains of Thought into Shorter Time‑to‑Value

TL;DR Hermes 4 is an open‑weight “hybrid reasoner” that marries huge synthetic reasoning corpora with carefully engineered post‑training and evaluation. The headline for operators isn’t just benchmark wins—it’s control: control of format, schema, and especially when the model stops thinking. That last bit matters for latency, cost, and reliability. Why this matters for business readers If you’re piloting agentic or “think‑step” LLMs, two pains dominate: Unbounded reasoning length → blow‑ups in latency and context costs. Messy outputs → brittle downstream integrations. Hermes 4 addresses both with: (a) rejection‑sampled, verifier‑backed reasoning traces to raise answer quality, and (b) explicit output‑format and schema adherence training plus length‑control fine‑tuning to bound variance. That combo is exactly what production teams need. ...

August 26, 2025 · 4 min · Zelina
Cover image

MoA vs. Moat: Agentic LLMs for Drug Competitor Mapping Cut Diligence Time 20×

The punchline Competitive analysis for drug assets isn’t a tidy table—it’s a scavenger hunt across press releases, registries, investor decks, and alias-riddled drug names. A new paper shows that scaffolded, web-native LLM agents can reliably enumerate true competitors for a given indication, then filter hallucinations with an LLM-as-judge, beating popular “deep research” tools and cutting analyst turnaround from ~2.5 days to ~3 hours. This matters now: the EU’s Joint Clinical Assessments (JCA) regime makes comparator choice visible and consequential; missing a relevant competitor can ripple into pricing, market access, and trial design. In short: MoA (mechanism of action) meets moat (defensible advantage)—and the moat is built from recall. ...

August 25, 2025 · 5 min · Zelina
Cover image

ReAct Without the Chaos: AgentScope 1.0 Turns Tools into Strategy

Thesis: AgentScope 1.0 is less a toolkit and more a discipline for agentic software. By pinning everything to ReAct loops, unifying “message–model–memory–tool,” and adding group-wise tool provisioning, it addresses the real failure mode of agents in production: tool sprawl without control. The evaluation/Studio/runtime trio then turns prototypes into shippable services. What’s actually new (and why it matters) 1) A crisp core: Message → Model → Memory → Tool Most frameworks blur these into ad‑hoc objects; AgentScope forces a clean, composable boundary: ...

August 25, 2025 · 4 min · Zelina
Cover image

Charting a Better Bedside: When Agentic RL Teaches RAG to Diagnose

Why this paper matters: Retrieval‑augmented generation (RAG) has been the default answer to “how do we make LLMs factual?” But clinical work is not a single hop to a single document; it’s a workflow—observe, hypothesize, retrieve, cross‑check, and only then decide. Deep‑DxSearch reframes RAG as a sequential policy, trained end‑to‑end with reinforcement learning (RL) so the model learns when to reason internally and when to consult guidelines, match similar patients, or search broader knowledge—before committing to a diagnosis. That design change is the story. ...

August 24, 2025 · 5 min · Zelina
Cover image

Atom by Atom, Better Research: How Fine-Grained Rewards Make Agentic Search Smarter

If you’ve ever watched a web agent swing from elegant reasoning to face‑plants on basic facts, you’ve met the limits of outcome‑only training. Atom‑Searcher proposes a simple but radical fix: stop treating the whole reasoning trace as one monolith. Instead, break it down into Atomic Thoughts—the minimal, functional units of reasoning—and supervise them directly with a Reasoning Reward Model (RRM). Then blend those process‑level rewards with the final answer score using a decaying curriculum. The result? More stable training, deeper search behavior, and better generalization across in‑ and out‑of‑domain QA. ...

August 19, 2025 · 5 min · Zelina
Cover image

Crystal Ball, Meet Cron Job: What FutureX Reveals About ‘Live’ Forecasting Agents

The one-sentence take A new live benchmark, FutureX, swaps lab-style trivia for rolling, real-world future events, forcing agentic LLMs to search, reason, and hedge under uncertainty that actually moves—and the results expose where today’s “agents” are still brittle. Why FutureX matters now Enterprise teams are deploying agents to answer questions whose truth changes by the hour—markets, elections, sports, product launches. Static leaderboards don’t measure that. FutureX runs as a cron job on reality: it collects new events every day, has agents make predictions, and grades them after events resolve. That turns evaluation from a screenshot into a time series and makes overfitting to benchmark quirks a lot harder. ...

August 19, 2025 · 4 min · Zelina
Cover image

Knows the Facts, Misses the Plot: LLMs’ Knowledge–Reasoning Split in Clinical NLI

The gist A new clinical natural language inference (NLI) benchmark isolates what models know from how they reason—and the results are stark. State‑of‑the‑art LLMs ace targeted fact checks (≈92% accuracy) but crater on the actual reasoning tasks (≈25% accuracy). The collapse is most extreme in compositional grounding (≈4% accuracy), where a claim depends on multiple interacting clinical constraints (e.g., drug × dose × diagnosis × schedule). Scaling yielded fluent prose, not reliable inference. ...

August 18, 2025 · 4 min · Zelina