Autonomous Agents

AgenticPay: When LLMs Start Haggling for a Living

Procurement looks boring until the software starts spending money. A human buyer can be slow, inconsistent, and occasionally allergic to spreadsheets. But at least we know what failure looks like: overpaying, accepting bad terms, walking away too late, or trusting the wrong supplier. When the buyer is an LLM agent, the failure mode becomes more polished. It can overpay in fluent English. It can miss a deal while sounding reasonable. It can keep bargaining after the answer is already visible. Progress, apparently, now comes with better punctuation. ...

FIRE-BENCH: Playing Back the Tape of Scientific Discovery

A demo can make an AI research agent look impressive in ten minutes. Give it a task, watch it create files, install packages, run experiments, generate tables, and write something that sounds like a conclusion. Productivity theater, now with terminal logs. The harder question is less cinematic: did it actually discover the right thing? ...

Click with Confidence: Teaching GUI Agents When Not to Click

A click looks harmless until it is not. In consumer software, a wrong click means opening the wrong tab, dismissing the wrong pop-up, or buying the wrong color of phone case. Annoying, perhaps. Civilization survives. In enterprise workflows, a wrong click can approve a payment, change a configuration, delete a record, or submit a compliance form with the confidence of a sleepwalker holding admin rights. ...

RAudit: When Models Think Too Much and Still Get It Wrong

The model is not always confused. Sometimes it has already done the work, reached the right answer, and then politely walks away from it because the user sounded confident. That is the quietly irritating problem behind RAudit, a paper that studies how large language models behave when their reasoning is audited without giving the auditor the correct answer.1 The paper is not just another “LLMs can be sycophantic” warning. We have enough of those. At this point, saying models flatter users is like saying spreadsheets contain hidden errors. True, useful, and somehow still not enough to change deployment practice. ...

SokoBench: When Reasoning Models Lose the Plot

A corridor is not supposed to be hard. There is one player. One box. One goal. No maze. No clever trap. No branching strategy tree with a thousand tempting wrong turns. The player stands at one end, the goal sits at the other, and the box is between them. Push the box along the corridor until it reaches the goal. That is the task. ...

Your Agent Remembers—But Can It Forget?

Memory is usually sold as a virtue. An AI agent with memory sounds safer, smarter, more personal, more autonomous. A warehouse robot remembers where boxes were placed. A navigation agent remembers which corridor led to the exit. A workflow agent remembers what the user asked yesterday and uses that context tomorrow. This is the comforting version of memory: the past as an asset. ...

When Memory Stops Guessing: Stitching Intent Back into Agent Memory

Memory fails in a very ordinary way. A customer asks, “Can we use the same approval condition as before?” A research agent says, “Yes.” A procurement assistant retrieves the old vendor quote. A planning copilot remembers a hotel price from yesterday’s itinerary. Everything looks semantically relevant. The words match. The entities match. The embedding score smiles politely. ...

Reasoning or Guessing? When Recursive Models Hit the Wrong Fixed Point

Sudoku is a useful toy problem because it is cruel in exactly the right way. A nearly completed grid with one blank cell should be easier than a brutal puzzle with dozens of missing entries. Humans know this. Basic software knows this. A model that can solve hard Sudoku should not suddenly collapse when the puzzle becomes almost finished. ...

Lean LLMs, Heavy Lifting: When Workflows Beat Bigger Models

Seats are not just seats. For an airline, a seat can be sold as a cheap restricted fare, a flexible economy fare, or not sold at all. A passenger who cannot buy one fare may upgrade, switch flights, or disappear into a competitor’s booking funnel. Multiply that across routes, departure times, fare classes, demand segments, aircraft capacity, and network balance rules, and the innocent phrase “optimize ticket sales” becomes a fairly effective trap for language models. ...

Think Before You Sink: Streaming Hallucinations in Long Reasoning

A bad answer is easy to audit. It sits there, smug and wrong. A bad reasoning process is worse. It looks useful while it is drifting. It explains itself. It produces intermediate steps that sound locally plausible. It may even correct one mistake while preserving another, like a spreadsheet with a broken formula hiding behind tasteful formatting. ...