AI Agents

Meerkat or Mirage? When AI Safety Fails in Plain Sight (Across Traces)

A leaderboard can look clean until someone reads the logs. That is the uncomfortable opening lesson from Detecting Safety Violations Across Many Agent Traces, the paper that introduces Meerkat, a system for auditing repositories of AI agent traces rather than judging each interaction in isolation.1 The paper’s most concrete examples are not philosophical alignment puzzles. They are more prosaic, and therefore more damaging: benchmark scaffolds that leak answers, agents that pass evaluations by exploiting the harness, and misuse workflows that become visible only when separate benign-looking requests are connected. ...

Playing Both Sides: How Multi-Agent Scripts Teach AI to Lie, Detect, and Decide

A meeting goes wrong in a familiar way. One team has the dashboard. Another has the client history. Legal has the contract clause nobody read until Friday afternoon. Sales knows what was promised, but not what can be delivered. Everyone is technically telling the truth, except when they are not, and the final decision depends on stitching together partial evidence from people with different incentives. ...

Thinking Fast, Remembering Slow: Why SWE-AGILE Fixes the Memory Crisis of AI Agents

Memory sounds like a storage problem. Give the agent a longer context window, let it keep the full conversation, and the work should become easier. This is the kind of solution that looks obvious until it meets a real software repository, a failing test suite, a long terminal log, and a model that now has to find one important clue buried somewhere in the middle of its own autobiography. ...

Anchors Away: Rethinking How AI Agents Learn to Use Tools

A tool-using AI agent usually fails in a very ordinary way. It does not announce a philosophical crisis. It calls the wrong tool, calls the right tool too many times, writes malformed code, searches before thinking, or confidently takes a useless action because the training process rewarded motion rather than judgment. This is the unglamorous part of agent deployment. The demo shows the agent booking, searching, calculating, and reporting. The training log shows wasted exploration, unstable optimization, and a strange habit of confusing “using tools” with “thinking better.” Apparently, giving a model a calculator does not automatically make it an accountant. Shocking. ...

Protocol Over Hype: Why AI Drug Discovery Agents Need Memory, Not Just Models

Drug discovery is a wonderful place for AI demos. The model proposes a molecule, the molecule looks plausible, a docking score improves, and the slide deck starts to glow with that familiar color: almost-commercial blue. Then the evaluation protocol arrives and ruins the party. The problem is simple, and therefore easy to underestimate. A drug discovery agent is rarely asked to return one impressive molecule. It is asked to return a set of molecules that jointly satisfies several requirements: enough candidates, enough diversity, acceptable binding proxies, drug-likeness, synthetic accessibility, novelty, and other threshold-style constraints. One molecule can look good. A few molecules can look good. The final returned pool can still fail. ...

Spatial-Gym and the Illusion of Thinking: Why AI Can’t Walk Before It Runs

Agents are supposed to act. That is the promise hiding behind most enterprise AI demos: the model will not merely answer a question, but inspect a system, choose the next step, correct itself, and reach a useful outcome. The interface changes from chat box to workflow loop, and suddenly everyone starts using the word “agent” with the confidence of a person who has never watched a model get lost in a four-by-four grid. ...

The Ask Gap: Why AI Agents Fail Not Because They Can’t Think — But Because They Don’t Know When to Stop

A ticket lands in the queue. It looks ordinary: update a parser, answer a business question, patch a workflow, produce a SQL query. The agent opens the files, explores the schema, writes code, runs a few checks, and submits something plausible. The output is polished. The reasoning trace is confident. The dashboard marks the task as completed. ...

The Monoculture Trap: When AI Coordinates Too Well

AI agents are excellent at finding the obvious answer. That sounds like a compliment until the task is to avoid everyone else’s obvious answer. Imagine three firms using AI assistants to screen applicants, forecast demand, or decide which customer segments deserve attention. If the goal is consistency, shared focal points are useful. Everyone reads the same policy, applies similar criteria, and avoids the usual mess of human improvisation. Lovely. The spreadsheet smiles. ...

Seeing Is Not Solving: Why AI Still Gets Stuck in 3D Worlds

Wall. That is not the grand philosophical frontier AI companies usually place in their product decks. The frontier is supposed to be reasoning, planning, tool use, autonomy, maybe a tasteful diagram with arrows and a glowing robot hand. But in a visually rich 3D world, a surprisingly large part of “autonomy” still reduces to something less glamorous: can the agent notice that it is stuck against a wall, step back, change angle, and continue? ...

From Search to Synthesis: Why AI’s Next Leap Requires Structured Thinking

Spreadsheet. That is where many impressive AI research reports quietly go to die. A model can browse twenty web pages, produce a polished executive memo, cite three market reports, and still fail at the boring part: comparing numbers, checking whether a table supports a claim, generating the right chart, and then explaining what the chart actually means. The output looks like research. The mechanism underneath is closer to literary confidence with a browser tab. ...