AI Agents

When Your Agent Knows It’s Lying: Detecting Tool-Calling Hallucinations from the Inside

The expensive part of an AI agent making things up is not always the sentence it writes. Sometimes it is the API call it sends. A chatbot can hallucinate a policy clause and embarrass itself. An agent can hallucinate a function call and move money, query the wrong data, calculate the wrong dose, bypass an audit trail, or quietly pretend it used a tool when it actually guessed. That is a different species of failure. The output may still look tidy. The JSON may still parse. The function name may even exist. The problem is that the agent has selected the wrong action in a system that treats actions as real. ...

Graph Before You Leap: How ComfySearch Makes AI Workflows Actually Work

Pipelines break at the seams Pipelines look simple when drawn on a slide. A user asks for an image. A model generates it. A workflow saves it. Somewhere in the middle, a few helpful boxes connect to a few other helpful boxes, and the whole thing becomes “automation.” Lovely. Very managerial. Then someone opens the real workflow. ...

MobileDreamer: When GUI Agents Stop Guessing and Start Imagining

A phone screen is not difficult because it is visually beautiful. It is difficult because it keeps changing. Tap the wrong button, and a form disappears. Scroll too far, and the useful item vanishes below the fold. Open the wrong menu, and the agent spends the next three steps politely recovering from its own confidence. Anyone who has watched a GUI agent operate a mobile app has seen the pattern: it often looks competent right until the interface asks for a small amount of foresight. ...

Batch of Thought, Not Chain of Thought: Why LLMs Reason Better Together

Fraud review is not a solo sport. A risk analyst looking at one suspicious seller can notice a strange product description, a vague company name, or a price range that feels wrong. But the real signal often appears only when several sellers are placed side by side. One shop looks unusual. Ten shops with the same naming pattern, same product mismatch, and same pricing behavior start to look less like noise and more like a system. ...

Infinite Tasks, Finite Minds: Why Agents Keep Forgetting—and How InfiAgent Cheats Time

A report is not finished because the model “understands” the assignment. It is finished because the system still knows, two hundred actions later, which documents were read, which notes were trustworthy, which sections remain unfinished, and which half-baked intermediate answer should not accidentally become the final one. That is the boring part of agentic AI. Naturally, it is also the part most systems quietly fail at. ...

MAGMA Gets a Memory: Why Flat Retrieval Is No Longer Enough

Memory is where many impressive agents quietly become mediocre employees. They can answer the last question. They can summarize the last document. They can sound very confident about a customer, a project, or a workflow they saw three weeks ago. Then someone asks, “Why did we make that decision?”, “When did the requirement change?”, or “Was that the same client who objected last time?” Suddenly the agent rummages through its past like a consultant searching Slack at 1:43 a.m. Technically alive. Not exactly organized. ...

Trust Issues at 35,000 Feet: Assuring AI Digital Twins Before They Fly

Trust Issues at 35,000 Feet: Assuring AI Digital Twins Before They Fly Airspace is a bad place to discover that your simulation was “mostly right.” That sentence is obvious enough to sound useless, but it points to the real issue. For an AI-enabled digital twin of air traffic control, being “accurate” is not one property. It is a stack of claims. The data must be representative. The software representation must preserve the right details. The trajectory predictor must handle uncertainty rather than pretending aircraft behave like obedient geometry. The AI agents using the twin must receive, act on, and explain information without corrupting the control problem on the way. ...

EverMemOS: When Memory Stops Being a Junk Drawer

Memory sounds simple until the assistant has to remember two incompatible things at once. A customer loves craft beer. The same customer is temporarily taking antibiotics. A flat memory system retrieves “likes IPA” and recommends a variety pack, because apparently “memory” means grabbing the loudest sticky note from a drawer and pretending it is wisdom. A more useful assistant retrieves the preference, the medical constraint, the timing, and the relation among them. It recommends a mocktail and quietly avoids turning personalization into negligence. ...

LeanCat-astrophe: Why Category Theory Is Where LLM Provers Go to Struggle

A developer can understand what a software function should do, write something that looks reasonable, and still fail because the surrounding codebase expects a particular interface, naming convention, object hierarchy, or sequence of calls. Giving the developer four independent attempts may eventually fix a misplaced bracket. It does little when the real problem is that they do not know which internal abstraction the system expects. ...

Deployed, Retrained, Repeated: When LLMs Learn From Being Used

Acceptance is a reward, even when nobody writes reward = 1. Imagine an enterprise deploys an AI agent to generate code, reconcile invoices, or prepare operational plans. Some outputs pass automated checks and enter production. Others fail, disappear into logs, and are never seen again. Months later, the accepted outputs are collected and used to fine-tune the next model. ...