Evaluation

From Prompts to Policies: The Agentic RL Playbook

How a new survey formalizes the shift from RLHF’d text bots to tool-using operators—and the practical playbook for product teams. TL;DR Agentic RL reframes LLMs from one-shot text generators to policies acting in dynamic environments with planning, tool use, memory, and reflection. The paper contrasts PBRFT (preference-based RL fine-tuning) with Agentic RL via an MDP→POMDP upgrade; action space now includes text + structured actions. It organizes the space by capabilities (planning, tools, memory, self-improvement, reasoning, perception) and tasks (search, code, math, GUI, vision, embodied, multi-agent). Open challenges: trust, scalable training, and scalable environments. For builders: start with short-horizon agents (verifiable rewards), invest early in evaluation, and plan a migration path from RAG pipelines to tool-integrated reasoning (TIR) with RL. What the paper actually changes Most “LLM RL” work you’ve seen is PBRFT—optimize responses to fit human/AI preferences (RLHF/DPO/etc.). This new survey argues that real autonomy needs Agentic RL: treat the model as a policy embedded in a sequential, partially observable world. That sounds academic, but the practical consequences are huge: ...

Numbers Need Narration: Making LLMs Do Reasoning‑Intensive Regression

Thesis: When the job is to read text, reason carefully, and return a precise number (not just a label), ordinary regression heads and vanilla prompting often fail in opposite ways. The paper introduces MENTAT, a lightweight recipe that marries batch‑reflective prompt evolution with a small MLP aggregator over multiple LLM rollouts. The result: tighter calibration and better ranking on tasks where each example demands real reasoning, not surface features. What counts as “Reasoning‑Intensive Regression” (RiR)? RiR tasks look like this: the model must (1) think through the input with step‑wise analysis, and then (2) score it on a real‑valued scale. The paper frames three such tasks: ...

Benchmarks with Benefits: What DeepScholar-Bench Really Measures

TL;DR DeepScholar-Bench introduces a live (continuously refreshable) benchmark and a holistic automated evaluation for generative research synthesis. Its reference pipeline, DeepScholar‑base, is simple yet competitive. The headline: today’s best systems organize text well but miss key facts, under-retrieve important sources, and fail verifiability at scale. That’s not a death knell—it’s a roadmap. Why this matters for business readers Enterprise “research copilots” promise to digest the live web, summarize options, and provide auditable citations. In practice, three gaps keep showing up: ...

Faking It to Make It: When Synthetic Data Actually Works

The latest tutorial by Li, Huang, Li, Zhou, Zhang, and Liu surveys how GANs, diffusion models, and LLMs now mass‑produce synthetic text, tables, graphs, time series, and images for data‑mining workloads. That’s the supply side. The demand side—execs asking “will this improve my model and keep us compliant?”—is where most projects stall. This piece extracts a decision framework from the tutorial and extends it with business‑grade evaluation and governance so you can decide when synthetic data is a shortcut—and when it’s a trap. ...

Judge, Jury, and Chain‑of‑Thought: Making Models StepWiser

TL;DR Generative judges that think before they judge—and are trained with online RL using stepwise labels—beat classic discriminative process reward models (PRMs). The StepWiser approach brings three wins: (1) higher accuracy at spotting the first bad step, (2) cleaner, more reliable inference via a “chunk‑reset” search that prunes bad steps while keeping overall length similar, and (3) better data selection for fine‑tuning. Why this matters (for builders and buyers) Most enterprise CoT systems fail not because they can’t produce long reasoning, but because they can’t police their own steps. Traditional PRMs act like a yes/no bouncer at each step—fast, but shallow. StepWiser reframes judging as its own reasoning task: the judge writes an analysis first, then issues a verdict. That small shift has big, practical consequences: ...

Talk, Tool, Triumph: Training Agents with Real Conversations

TL;DR Most “tool‑using” LLMs still practice in sterile gyms. MUA‑RL moves training into the messy real world by adding an LLM‑simulated user inside the RL rollout, wiring the agent to call actual tools and rewarding it only when the end task is truly done. The result: smaller open models close in on or beat bigger names on multi‑turn benchmarks, while learning crisper, policy‑compliant dialogue habits. Why this matters now Enterprises don’t want chatty copilots; they want agents that finish jobs: modify an order under policy, update a ticket with verified fields, push a fix to a repo, or reconcile an invoice—often across several conversational turns and multiple tools. Supervised fine‑tuning on synthetic traces helps, but it often overfits to static scripts and misses the live back‑and‑forth where users change their minds, add constraints, or misunderstand policy. ...

Wheel Smarts > Wheel Reinvention: What GitTaskBench Really Measures

Agents don’t build Rome from scratch—they retrofit the city. GitTaskBench (arXiv:2508.18993) is the first benchmark that grades code agents on how well they exploit existing GitHub repositories to deliver real-world outcomes, not just pass algorithm puzzles. It also puts a price tag on success via an Alpha value that blends accuracy with cost, bringing long-missing business realism to agent evals. TL;DR What’s new: 54 tasks across 7 modalities (image, video, speech, office docs, web scraping, security/privacy, biosignals), each paired to a real repo and a practical, automated test harness. Why it matters: The hard part isn’t just writing code—it’s environment setup, dependency wrangling, repo comprehension, and workflow orchestration. Headline result: Even the best stack—OpenHands + Claude 3.7—passes only ~48% of tasks; environment/setup issues cause ~65% of all failures. Business twist: The Alpha value estimates net economic benefit per task by combining success, quality, and token costs. Expensive tasks become clear wins; cheap tasks require ruthless cost control. The Benchmark, de-jargoned Problem framed: In real shops, devs search, fork, and adapt. GitTaskBench simulates that reality. Each task gives an agent a specific repo (e.g., DeOldify, Scrapy, NeuroKit, SpeechBrain) and a concrete user goal (e.g., “colorize this photo” or “extract author/quote pairs into CSV”). Success is determined by a task-specific metric (e.g., NIQE for image quality; SNR/SDR for speech separation; field-level F1 for scraping; column/row fidelity for office docs) and an execution check (the thing actually runs and outputs in the right format). ...

Agents on the Clock: Turning a 3‑Layer Taxonomy into a Build‑Ready Playbook

Most “agent” decks promise autonomy; few explain how to make it shippable. A new survey of LLM‑based agentic reasoning frameworks cuts through the noise with a three‑layer taxonomy—single‑agent methods, tool‑based methods, and multi‑agent methods. Below, we translate that map into a practical build/run playbook for teams deploying AI automation in real workflows. TL;DR Single‑agent = shape the model’s thinking loop (roles, task prompts, reflection, iterative refinement). Tool‑based = widen the model’s action space (APIs, plugins/RAG, middleware; plus selection and orchestration patterns: sequential, parallel, iterative). Multi‑agent = scale division of labor (centralized, decentralized, or hierarchical; with cooperation, competition, negotiation). Treat these as orthogonal dials you tune per use‑case; don’t jump to multi‑agent if a reflective single agent with a code‑interpreter suffices. 1) What’s genuinely new (and useful) here Most prior surveys were model‑centric (how to finetune or RLHF your way to better agents). This survey is framework‑centric: it formalizes the reasoning process—context $C$, action space $A = {a_{reason}, a_{tool}, a_{reflect}}$, termination $Q$—and shows where each method plugs into the loop. That formalism matters for operators: it’s the difference between “let’s try AutoGen” and “we know which knob to turn when the agent stalls, loops, or hallucinates.” ...

Stop at 30k: How Hermes 4 Turns Long Chains of Thought into Shorter Time‑to‑Value

TL;DR Hermes 4 is an open‑weight “hybrid reasoner” that marries huge synthetic reasoning corpora with carefully engineered post‑training and evaluation. The headline for operators isn’t just benchmark wins—it’s control: control of format, schema, and especially when the model stops thinking. That last bit matters for latency, cost, and reliability. Why this matters for business readers If you’re piloting agentic or “think‑step” LLMs, two pains dominate: Unbounded reasoning length → blow‑ups in latency and context costs. Messy outputs → brittle downstream integrations. Hermes 4 addresses both with: (a) rejection‑sampled, verifier‑backed reasoning traces to raise answer quality, and (b) explicit output‑format and schema adherence training plus length‑control fine‑tuning to bound variance. That combo is exactly what production teams need. ...

MoA vs. Moat: Agentic LLMs for Drug Competitor Mapping Cut Diligence Time 20×

The punchline Competitive analysis for drug assets isn’t a tidy table—it’s a scavenger hunt across press releases, registries, investor decks, and alias-riddled drug names. A new paper shows that scaffolded, web-native LLM agents can reliably enumerate true competitors for a given indication, then filter hallucinations with an LLM-as-judge, beating popular “deep research” tools and cutting analyst turnaround from ~2.5 days to ~3 hours. This matters now: the EU’s Joint Clinical Assessments (JCA) regime makes comparator choice visible and consequential; missing a relevant competitor can ripple into pricing, market access, and trial design. In short: MoA (mechanism of action) meets moat (defensible advantage)—and the moat is built from recall. ...