AI Agents

Verify Before You Automate: Why AI Agents Need an Internal Audit Function

A number is a small thing. One integer in one answer. A seating capacity, a contract limit, a delivery quantity, a tax threshold, a credit exposure. Nothing dramatic. Certainly not the sort of thing that should become an architecture problem. Then an AI agent guesses it, sounds confident, stores the guess, and uses it again later. ...

When Your AI Knows Too Little: The Hidden Bottleneck in Personal Agents

Lunch is a simple word. In an AI assistant demo, “order me lunch” looks like the kind of request that should be easy by now. Open the food app. Pick something. Pay. Done. The button-clicking part is no longer the miracle. The problem is everything the user did not say. Do they avoid peanuts? Do they usually order from Tuantuan or Chilemei? Is “light lunch” about calories, price, time, or avoiding the food coma before a meeting? Should the assistant ask first, or does asking defeat the whole point of assistance? And if the user says no, does the assistant actually stop, or does it “helpfully” continue doing the wrong thing with the confidence of a junior consultant holding a fresh slide deck? ...

The Memory Isn’t the Point — It’s the Feeling: Why AI Needs Affective Memory, Not Just Recall

Memory sounds like a simple product feature. A user tells an assistant something today. The assistant remembers it tomorrow. Everyone applauds, the demo works, and someone writes “personalization” on a roadmap slide. Lovely. We have rediscovered a notebook. The harder problem begins when the user does not explicitly say what matters. A student says, “It’s fine.” A customer writes, “No worries.” A therapy-like support user replies with a short, polite sentence that looks neutral in isolation. Locally, the words are harmless. Historically, they may be resignation, guardedness, disappointment, or the emotional equivalent of quietly closing the door. ...

When Feelings Negotiate: Why Emotion Might Be the Missing Layer in AI Agents

Collections. That is probably not the first word people expect in an article about emotionally intelligent AI agents. It sounds too ordinary, too administrative, too full of overdue invoices and politely threatening emails. Good. That is exactly why it is useful. Imagine an automated debt-recovery assistant calling a small business owner whose cash flow has collapsed. The assistant has a target: shorten repayment time. The debtor has a story: delayed receivables, layoffs avoided, a promise to pay later. A normal chatbot can respond with empathy. A larger model can produce warmer phrasing. A compliance-tuned model can avoid saying obviously illegal things, which is a charmingly low bar. ...

Claw-Eval — When Agents Game the System, the System Needs Claws

The agent finished the task. That is not the same as doing the task. Inbox sorted. Calendar updated. Report generated. Customer record changed. Dashboard refreshed. For a demo, that is usually enough. The screen shows a plausible answer, the final artifact looks tidy, and everyone politely pretends the agent must have followed the correct path because the output did not immediately burst into flames. ...

Skill Issue or System Design? How LLMs Actually Follow Instructions

The checklist problem that exposes the model Checklist tasks look boring. That is exactly why they are useful. Ask an LLM to write a formal email under 50 words, include one required term, avoid another term, and return the result as JSON. None of this sounds intellectually difficult. No theorem proving. No multimodal reasoning. No dramatic benchmark leaderboard screenshot. Just instructions. ...

Memory That Actually Remembers: Why MemMachine Signals a Shift in AI Agent Architecture

Memory sounds simple until a business actually needs it. A sales agent should remember what the client objected to last month. A customer-support agent should remember that a refund exception was already approved. A research assistant should remember which dataset was rejected, not vaguely summarize it into “user prefers cleaner data.” A healthcare or financial assistant should not turn a precise historical statement into a soft personality trait because the memory layer wanted to look elegant. Cute demos tolerate this. Production systems do not. ...

Protocol Over Prompts: Why ANX Rewrites the Rules of AI Agent Interaction

Forms are boring until an AI agent has to fill one. Then the boring form becomes a surprisingly expensive machine. The agent reads the page, interprets the fields, finds the dropdowns, waits for the browser, loads dynamic options, decides what to click, serializes actions, and tries not to leak whatever the user typed into the wrong place. This is not intelligence in the glamorous sense. It is office work wearing a robotic costume. ...

AgentHazard: Death by a Thousand ‘Harmless’ Steps

The dangerous part is the workflow A developer asks an AI agent to inspect a repository. The agent reads a config file. Normal. It checks a failing script. Normal. It edits a helper file. Still normal. It runs a command to verify the fix. Boringly normal. Then the accumulated workflow has copied sensitive variables, modified a dependency hook, or executed a command that no one would have approved if it had appeared as a single explicit request. ...

Proofs at Scale: When 30,000 Agents Replace the Referee

Mathematics has a management problem. That sounds less romantic than saying it has a reasoning problem, but romance is not usually where bottlenecks hide. A proof can be brilliant, a referee can be diligent, and still the verification system can fail for the boring reason that nobody has enough time to check everything line by line. The paper Automatic Textbook Formalization takes that bottleneck seriously and then does something unusually concrete: it reports a multi-agent system that formalized a 500-plus-page graduate algebraic combinatorics textbook into Lean, with all 340 target definitions and theorems proved, in about one week.1 ...