Evaluation

When More Becomes Smarter: The Unreasonable Effectiveness of Scaling Agents

Desktops are where AI ambition goes to discover gravity. A chatbot can sound competent in one turn. A coding assistant can look brilliant inside a bounded file. But ask an agent to use a real computer for a long task — open the right app, edit the right file, preserve formatting, notice a pop-up, verify the final state, and not confidently click itself into a small administrative tragedy — and the problem changes. Intelligence is no longer a single answer. It is a chain of actions, each one able to quietly poison the next. ...

Backtrack to Breakthrough: Why Great AI Agents Revisit

Search is easy. Knowing when to go back is harder. That is the useful irritation inside GSM-Agent, a new benchmark for studying agentic reasoning under controlled conditions.1 The paper takes grade-school maths problems from GSM8K, removes the premises from the prompt, hides those premises in a searchable document database, and asks an LLM agent to recover the facts before solving the problem. The arithmetic is not supposed to be impressive. That is the point. If a model fails here, we cannot calmly blame differential geometry, PhD-level law, or some mysteriously adversarial enterprise workflow. The agent simply did not find and use the facts. ...

Options = Power: Turning Empowerment into a KPI for AI Agents

Login. That is where many agent evaluations become strangely unserious. A benchmark asks whether the agent completed a task. A dashboard records whether the browser session ended successfully. A monitoring system checks whether the tool call returned an error. Then the agent enters valid credentials and suddenly gains access to a much larger part of the environment. ...

Vitals, Not Vibes: Inside the New Anatomy of Personal Health Agents

TL;DR for operators Personal health AI is usually sold as a friendly chatbot with a fitness tracker bolted on. This paper argues for something more awkward, more expensive, and much more plausible: a coordinated system of specialised agents. One agent analyses longitudinal wearable and health-record data. One grounds advice in health knowledge and user context. One handles coaching, goal-setting, and behaviour change. An orchestrator decides who should act, who should support, what should be remembered, and how the final answer should be assembled.1 ...

ReAct Without the Chaos: AgentScope 1.0 Turns Tools into Strategy

TL;DR for operators AgentScope 1.0 is best read as a production-shaping framework for agentic applications, not as a victory lap over rival agent frameworks. Alibaba’s paper describes a developer-centric stack that rebuilds agents around four core abstractions — message, model, memory, and tool — then places a ReAct-style reasoning-and-action loop on top of them.1 ...

Agents of Disruption: How LLMs Became Adversarial Testers for Autonomous Driving

TL;DR for operators AGENTS-LLM is not another attempt to make a language model dream up an entire traffic world and then hope the simulator forgives the hallucination. It does something narrower and more operationally useful: it takes an existing real-world driving scenario, accepts a natural-language instruction such as adding a parked vehicle, jaywalker, accident site, or construction zone, and produces an augmented scenario that can be executed in closed-loop autonomous-driving simulation.1 ...

Mind the Gap: Fixing the Flaws in Agentic Benchmarking

TL;DR for operators Agent benchmark scores are starting to function like procurement documents. They appear in model cards, vendor decks, research claims, and internal build-versus-buy decisions. The awkward finding in this paper is that some of those scores do not measure what buyers think they measure. Zhu et al. introduce the Agentic Benchmark Checklist, or ABC, to audit whether an agentic benchmark has valid tasks, valid outcome grading, and adequate reporting.1 Applying it to ten widely used agentic benchmarks, they find task-validity flaws in seven, outcome-validity flaws in seven, and reporting limitations in all ten. ...

Playing with Strangers: A New Benchmark for Ad-Hoc Human-AI Teamwork

TL;DR for operators Teamwork is the awkward part of agentic AI. It is easy to show a model completing a task when the environment is clean, the instructions are explicit, and the other “teammates” behave exactly as expected. Real deployments are less polite. Humans omit context, follow local conventions, adapt unevenly, and occasionally do something that looks wrong only because the system has misunderstood the room. ...

Unchained Distortions: Why Step-by-Step Image Editing Breaks Down While Chain-of-Thought Shines

TL;DR for operators Image-editing demos are easy. Ask a model to remove one object, recolour a jacket, or add a tasteful lamp, and most modern systems can produce something impressive enough for a product page and a LinkedIn post. Ask it to perform eight connected edits while keeping the original subject, layout, texture, lighting, and realism intact, and the polite showroom smile begins to crack. ...