Llm-Agents

Recommendations With Receipts: When LLMs Have to Prove They Behaved

A recommendation list is rarely just a list. On the surface, it says: “Here are ten movies, products, articles, songs, creators, or courses you may like.” Underneath, it often carries a second instruction: “Also do not bury long-tail items, do not over-concentrate exposure, do not violate diversity rules, do not create an audit nightmare, and please do all of this while still looking personalized.” ...

Bubble Trouble: Why Top‑K Retrieval Keeps Letting LLMs Down

The problem is not finding documents. It is spending the prompt budget badly. Ask an enterprise RAG system for “scope of work,” and the system may look confident for exactly the wrong reason. The query sounds simple. Somewhere in the document set, there is probably a sheet, paragraph, or clause literally called “Scope of Works.” A flat top-k retriever will happily grab the highest-scoring chunks from that section, stack them into the model context, and call the job done. Very tidy. Very wrong. ...

Knowing Is Not Doing: When LLM Agents Pass the Task but Fail the World

A task is finished. The agent found the file, clicked the button, moved the object, submitted the form, or reached the winning state. The dashboard turns green. Everyone relaxes. That is usually the moment when the real question gets quietly buried: what did the agent actually learn about the world it just operated in? ...

Scaling the Sandbox: When LLM Agents Need Better Worlds

Sandbox is a comforting word. It sounds safe, contained, childlike. Put an AI agent in a sandbox and let it practice. Nothing catches fire. Nobody accidentally cancels a real flight. No production database wakes up with 37 mysterious refund requests and a very confused compliance officer. The problem is that most agent sandboxes are either too fake to teach anything, too manual to scale, or too close to production to be relaxing. The agent has to learn how to navigate persistent state, business rules, incomplete user information, tool failures, and multi-step dependencies. A static API-call dataset does not teach that. A role-playing LLM pretending to be the environment may hallucinate the rules. A hand-built benchmark is useful, but expensive to multiply. ...

STACKPLANNER: When Agents Learn to Forget

Enterprise agents usually fail in an undramatic way. They do not rebel. They do not suddenly become conscious. They do not announce, with cinematic timing, that humanity has been replaced by a spreadsheet. They simply lose the thread. A research agent searches once, finds something half-relevant, and keeps dragging that result through the rest of the task. A report-writing workflow collects too many fragments and then forgets which ones were actually useful. A coordinator delegates to sub-agents, receives noisy outputs, and treats every message as equally important because, apparently, all context is sacred now. By the final step, the system has not become more intelligent. It has become a very expensive meeting transcript. ...

Agents That Ship, Not Just Think: When LLM Self-Improvement Meets Release Engineering

Shipping Is the Part Agents Usually Skip Shipping is where confidence goes to die. A demo agent can impress everyone on Tuesday, receive a clever prompt update on Wednesday, and quietly break three workflows that were working last week. The aggregate score improves. The release notes look cheerful. Somewhere, a previously solved customer task becomes unsolved again. Naturally, everyone calls this “iteration,” because “we broke production while chasing a benchmark bump” sounds less strategic. ...

ResMAS: When Multi‑Agent Systems Stop Falling Apart

Agent teams fail in a very ordinary way. One agent misreads a question. Another repeats the wrong answer with more confidence. A third receives both versions, performs a tiny ceremony of “collaboration,” and returns something that looks more polished than the original error. Management sees five agents instead of one and assumes redundancy has arrived. It has not. Sometimes it is just a committee with better stationery. ...

From Tokens to Topology: Teaching LLMs to Think in Simulink

A model engineer asks for a small change: add a temperature sensor between a fuel-cell stack and a pump-control input. Easy request. Annoying execution. The assistant must find the right Simscape block, use the correct library path, respect physical ports, avoid breaking the existing topology, and produce a model that actually compiles. ...

Trading Without Cheating: Teaching LLMs to Reason When Markets Lie

Trade has a special talent for humiliating clean theories. A model reads a market brief. It sees earnings beats, sales guidance, analyst upgrades, and a few scattered corporate events. Asked to behave like a turnaround specialist, it starts building buy signals. Some recommendations are reasonable. Others quietly smuggle in missing assumptions: maybe the company has new management; maybe the earnings beat reflects restructuring; maybe debt reduction is happening somewhere behind the curtain. Very elegant. Also, very convenient. ...

Pulling the Thread: Why LLM Reasoning Often Unravels

Audit is a less glamorous word than intelligence. That is unfortunate, because most business problems with AI agents do not begin with stupidity. They begin with confidence. The agent gives an answer. The answer sounds reasonable. The explanation sounds even better. A manager, analyst, compliance reviewer, or product owner reads the chain of thought and feels the mild comfort of seeing steps. There is a premise, then a bridge, then a conclusion. Very civilized. Very inspectable. Very possibly fake. ...