Llm-Agents

Error 404: Peer Review Not Found — How LLMs Are Quietly Rewriting Scientific Quality Control

Deadline. That is the simplest way to understand why modern AI papers contain mistakes. Not because researchers suddenly forgot algebra. Not because reviewers are lazy. Not because the field has collectively decided that proofs are decorative furniture. The more boring explanation is also the more important one: the AI publication machine has scaled faster than the quality-control machinery around it. ...

Stacking the Odds: Why Blocksworld Still Breaks Your Fancy LLM Agent

A robot arm, a few colored blocks, and a table. That is the setup. No messy warehouse, no sensor dust, no tired operator, no forklift reversing into the wrong aisle. Just blocks. And still, the fancy LLM agent stumbles. That is the useful discomfort in Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol.1 The paper does not show a robot revolution. It shows something more valuable for anyone trying to deploy LLM agents in industrial workflows: even in a symbolic world where the rules are explicit, the actions are discrete, the state can be queried, and the tool interface is standardized, reliability degrades as soon as the task stops being politely simple. ...

Short Paths, Sharp Minds: Why Knowledge Graph Distance Feels Like Cognitive Gravity

Map distance is not truth. Anyone who has followed a GPS into a dead-end road knows this already. But distance is still useful. If a restaurant is 300 meters away, it is usually a more plausible lunch option than one across the ocean. If a customer record links directly to an invoice, and that invoice links directly to a shipment, the shipment is a more plausible grounding for a customer-service question than a random supplier buried in another region’s procurement graph. Not guaranteed. Just plausible. That small distinction is where the paper becomes interesting. ...

Think Fast, Act Faster: How 'Thinking-by-Doing' Is Rewiring LLM World Models

Feedback is addictive. Give an AI agent a tool, an API, a database, a browser, a simulator, or a workflow environment, and the temptation is obvious: let it keep poking the world until something works. It tries. It observes. It corrects. It tries again. Compared with a model sitting alone in a prompt box, imagining every possible transition in its head, this looks much healthier. Less hallucinated planning, more contact with reality. Very grown-up. ...

When Agents Treat Agents as Tools: What Tool-RoCo Tells Us About LLM Autonomy

Dispatch is where autonomy usually goes to die. A warehouse manager may have ten workers, three forklifts, two packing stations, and one increasingly dramatic dashboard. The hard part is not merely deciding what each person should do. The hard part is knowing when to call someone in, when to release them, and when extra “help” is just a polite name for congestion. ...

Cutting Through the Noise: How Programmatic Pruning Turns Web Agents into Real Operators

Clicking the right button should not be an intelligence test. For humans, a webpage is usually manageable. We scan the visible screen, ignore the footer, dismiss the newsletter trap, and find the search box without treating every hidden <div> as a philosophical object. Web agents are less lucky. They see a modern page as a swollen mixture of visible text, invisible attributes, nested containers, event handlers, accessibility metadata, layout debris, cookie banners, product cards, promotional links, and enough frontend residue to make “just use the DOM” sound like a mild punishment. ...

Enviro-Mental Gymnastics: Why Cross-Environment Agents Still Trip Over Their Own Feet

Demo day is easy. Give an AI agent one workflow, one tool stack, one database schema, one approval rule, and one forgiving evaluator, and it may look surprisingly competent. It files the ticket. It updates the CRM. It writes the SQL query. Everyone nods. Someone says “agentic transformation,” because apparently every procurement meeting now needs a spell. ...

Agents Behaving Badly: Why 'Agentic AI' Needs Adult Supervision

A travel agent that books a bad flight is annoying. A travel agent that books the wrong flight, triggers a hotel agent to change the reservation, alerts a finance agent to approve reimbursement, and then lets a calendar agent reschedule meetings around the mistake is no longer annoying. It is an organizational incident with a charming user interface. ...

Intent, Actually: Why DeFi Needs a Mind‑Reader

A wallet is easy to watch and hard to understand. That is the small comedy at the centre of DeFi analytics. Every transaction is public, every contract call can be inspected, every log can be dragged into a dashboard, and yet the actual question remains stubbornly human: what was this user trying to do? A swap may be a trade, a hedge, a liquidation defence, an arbitrage leg, a farming manoeuvre, or just someone clicking through a protocol interface with dangerous confidence. The chain shows the footprint. It does not provide the diary. ...

Skills to Pay the Agent Bills: Why LLMs Need Better Moves, Not Bigger Models

Runbooks are underrated. Not the glossy strategy kind. The real kind: “check this first, then open that system, then verify the thing that usually breaks, then escalate only if the next signal appears.” Most operational work is not heroic reasoning. It is structured repetition under partial information. This is exactly where many LLM agents still look strangely amateur. They can describe a process beautifully, then fail to follow it. They can hold a long context window, then ignore the one action that would move the task forward. They can retrieve prior examples, then drown themselves in irrelevant steps. Very impressive. Very expensive. Occasionally useful. ...