AI Agents

Memory Lane Has Potholes: MemFail and the Business of Testing Agent Recall

Memory is where enterprise AI demos go to become operationally embarrassing. In the demo, the assistant remembers that a client prefers concise weekly updates, that a trader avoids high-leverage positions after volatility spikes, or that a procurement manager only approves a supplier when compliance documents are current. In production, the same assistant may remember the attractive half of the fact and quietly lose the condition. It recalls “approves supplier” but forgets “only when compliance documents are current.” Congratulations: the agent has not forgotten. It has remembered dangerously. ...

Vibe Check: AutoResearch Is a Workflow, Not a Robot Scientist

Demo day is not discovery day Demo day has a familiar rhythm. An AI system reads papers, proposes an idea, edits code, runs an experiment, drafts a manuscript, and perhaps even produces something that looks suspiciously like a conference submission. The slide title then arrives with great ceremony: autonomous scientist. The paper AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery is useful because it interrupts that ceremony before everyone starts clapping at the PDF generator.1 Its central move is not to deny progress. Current systems really can automate meaningful pieces of research work. They can search, summarize, plan, code, run tools, assemble figures, and draft reports. That is already operationally important. ...

Think Longer, Act Smarter: Why Coding Agents Need Behavior-Preserving Reasoning

Software agents fail in a familiar way. They do not always fail because they are stupid. Sometimes they fail because they are busy. They search too widely, inspect too much, edit too early, revise the wrong file, run out of context, and then collapse under the weight of their own half-formed investigation. In enterprise language: they generate activity before they stabilize a diagnosis. We have seen humans do this too, usually in Slack threads with too many tabs open. The machines are catching up nicely. ...

Reasonable Doubt: Why LLM Reasoning Needs Process Control

Why this matters now The business case for LLMs has quietly moved from chatbot answers to agentic work: legal review, compliance checking, market research, document synthesis, internal analytics, coding support, and decision preparation. That shift changes the risk profile. A wrong chatbot answer is annoying. A wrong agent that looks coherent, cites documents, calls tools, updates files, and confidently stops too early is a workflow liability wearing a productivity costume. ...

Context Is Not a Costume: Why Strong Agents Still Fail on Contact

The agent looks ready. Then reality answers back. The current AI-agent story is conveniently simple. Take a powerful foundation model, wrap it in tools, give it a workflow, add a polite system prompt, and call the result “ready for deployment.” Reality, as usual, has poor manners. Two recent arXiv papers examine very different agent settings. One studies whether multimodal AI agents can align their behavior with the cognitive age of child users. The other studies whether behavior foundation models for imitation learning can remain robust when the physical dynamics of an environment shift after training. They do not share a benchmark, a model class, or even the same deployment domain. That is precisely why they are useful together. ...

Credit Where It’s Due: The New Reasoning Stack for Agentic AI

Opening — Why this matters now The current agentic AI conversation has a very convenient myth: if an AI agent fails, give it a better model, a longer context window, more tool calls, and perhaps a heroic prompt containing the phrase “think step by step” in several places. Then wait for magic. Preferably billable magic. ...

The Reward Is in the Room: Why AI Automation Needs Better Judgment, Not Just Bigger Models

Opening — Why this matters now AI adoption has entered its second, less glamorous phase. The first phase was easy to explain: make the model generate things. Emails, reports, code, dashboards, summaries, customer replies, compliance drafts, market notes, training content. Give the machine a prompt, admire the fluent output, and pretend the future has arrived because the paragraphs are well-spaced. ...

Edge Cases: Why Graph World Models May Make AI Agents Less Lost

Opening — Why this matters now Every serious AI roadmap now contains some version of the same promise: agents that do not merely answer questions, but perceive a situation, remember what matters, simulate what could happen next, and choose an action. The software industry has given this ambition a polite name: “agentic AI.” The less polite version is: we are trying to make machines behave usefully in environments that keep changing while everyone is still arguing about the requirements document. ...

Ctrl+Z Is Not a Strategy: When LLM Self-Correction Actually Works

Opening — Why this matters now Agentic AI systems are currently being sold with a suspiciously comforting ritual: generate an answer, ask the same model to reflect, then ask it to improve the answer. Repeat until the dashboard looks busy. In demos, this feels intelligent. In production, it may simply be a very expensive way to turn correct answers into wrong ones. ...

Org-Charted Territory: Why AI Agents Need Middle Management

Opening — Why this matters now The AI industry has spent the last two years trying to turn large language models into workers. The result is a small circus of agents: coding agents, browser agents, research agents, support agents, spreadsheet agents, and agents that appear to exist mainly to summon other agents. Naturally, the next problem is not intelligence. It is management. ...