Llm-Agents

Scaffold and Ladder: Why AI Agents Need Meta-Reasoning, Not Longer Monologues

Workflow is where AI agents usually stop looking magical. Ask one to summarize a short memo, and it behaves like a competent intern with suspiciously fast typing. Ask it to investigate a compliance question across policies, contract clauses, ticket histories, and messy attachments, and the illusion starts to wobble. The agent searches once, reads too much at once, jumps to a plausible answer, and then politely explains the wrong conclusion with the confidence of a junior consultant who has discovered formatting. ...

Think Longer, Act Smarter: Why Coding Agents Need Behavior-Preserving Reasoning

A coding agent can fail in two very different ways. One failure is obvious: it does not think enough. It sees an error report, guesses the wrong file, edits too early, and then spends the rest of the trajectory debugging its own mistake. Anyone who has watched an autonomous coding agent wander through a repository has seen this little tragedy. The machine is busy, but not necessarily useful. ...

Don’t Average the Needle: Spectral Retrieval and the RAG Evidence Problem

Enterprise search has a very old habit wearing a very modern jacket: it averages. A policy document becomes one vector. A runbook becomes one vector. A postmortem full of operational detail becomes one vector. Then a RAG system asks that one vector whether the document is relevant. This is convenient, fast, and usually defensible — until the relevant answer is a narrow paragraph hiding inside a large document. At that point, the retrieval system is no longer searching for evidence. It is asking a crowd to speak for the witness. ...

Experience Is Not Memory: Why Learning Agents Need a Better Feedback Loop

A support ticket goes wrong. A workflow agent chooses the wrong tool. A finance assistant misses a procedural step. The usual response is familiar: add the failure to memory, rewrite a prompt, perhaps ask the agent to “reflect” before trying again. This is useful, in the same way that putting a sticky note on a broken machine is useful. It may prevent the same mistake next time. It does not prove the machine has learned how to improve. ...

Think Longer, Act Worse? What M2A Teaches About Reasoning Agents

Think Longer, Act Worse? What M2A Teaches About Reasoning Agents A coding agent does not fail only because it cannot think. Sometimes it fails because it keeps thinking after it should inspect the repository. Sometimes it writes a plausible explanation before checking the relevant file. Sometimes it burns the context window by wandering through hypotheses, each one almost reasonable, none of them decisive. The result is not stupidity in the familiar sense. It is a coordination failure: the model does not know when to reason, when to call a tool, when to absorb feedback, and when to edit. ...

Silent Errors, Loud Consequences: ASMR-Bench and the Coming Era of AI Auditors

Code review is supposed to be the sober adult in the room. A researcher writes code. A reviewer checks the code. A suspicious bug gets caught before it becomes a chart, a memo, a product decision, or—if everyone is having a particularly expensive week—a board presentation. That model works reasonably well when the failure is accidental and the reviewer has more patience than the author. It becomes less reassuring when the author is an AI research agent, the codebase is messy, the experiment is expensive to rerun, and the suspicious line looks less like a bug than a perfectly normal design choice. ...

Reviewer, Reviewed: When AI Starts Grading the Graders

Review queue. That is where many serious organizations quietly lose time, quality, and patience. A technical team writes a proposal. A risk team checks a report. A grant committee reads applications. A legal or compliance group inspects a document for missing evidence, weak logic, and embarrassing errors. Everyone agrees that review matters. Everyone also knows the reviewers are tired. ...

Evolve or Die Trying: When LLMs Stop Writing Code and Start Designing Algorithms

A developer asks an LLM to “write a better algorithm.” The LLM obliges. It writes code. The code runs, perhaps after a few rounds of apologetic debugging. The result is slightly better than the baseline, or at least sufficiently mysterious to be called “novel.” Everyone nods politely. Another benchmark table is born. ...

The Memory Isn’t Broken — It’s Flat: Why LLMs Need to ‘Draw’ to Remember

Memory is usually sold as a storage problem. Give the agent a vector database. Add a recall layer. Save summaries. Search harder. Expand the context window until the budget department starts making eye contact. Then ask the agent a simple question: what changed after the earlier conversation? That is where the polite demo often turns into a fog machine. ...

CivBench: When AI Stops Guessing and Starts Planning

Scoreboards are comforting. They reduce a messy contest into one neat line: winner, loser, maybe a score. Executives like them, product teams like them, investors like them, and benchmark dashboards absolutely adore them. Strategy, unfortunately, is rude enough not to fit inside that line. A company can make the right decisions and still lose because the market turns. A trading agent can survive a bad regime by managing exposure well, then look mediocre because the final return is not spectacular. A planning system can stumble into success after making terrible intermediate choices. Outcome-only evaluation is clean, but cleanliness is not the same as truth. It is often just a good-looking loss of information. ...