Llm-Agents

Blueprints of Agency: Compositional Machines and the New Architecture of Intelligence

A prototype begins innocently enough: a product team wants a small machine, a vehicle, a tool, a fixture, perhaps a mechanism that throws something across a room because medieval engineering apparently never left the group chat. The modern AI pitch says the agent can design it. Give it parts, constraints, and a goal; let it reason; let it test; let it improve. ...

Pods over Prompts: Shachi’s Playbook for Serious Agent-Based Simulation

A boardroom simulation is only useful if you know what was being simulated. That sounds obvious. It is also where many AI-agent demos quietly fall apart. Give one hundred language-model agents a set of personas, drop them into a toy market, forum, election, auction, or customer-support queue, and the result will usually look interesting. Someone panics. Someone coordinates. Someone overpays. Someone posts something faintly unhinged. Excellent. We have recreated the internet. ...

Failures, Taxonomized: How Multi‑Level Reflection Turns Agents Into Self‑Learners

Failure is usually treated as waste. The demo breaks, the agent apologises, someone adds a prompt patch, and everyone pretends the next retry will be more mature. Very enterprise. Very ceremonial. The SaMuLe paper makes a more useful claim: failed agent runs are not just embarrassing logs. They are the curriculum.1 More precisely, they are raw material for a structured reflection pipeline that turns messy trajectories into error taxonomies, cross-task lessons, and finally a small retrospective model trained to diagnose future failures. ...

Memory That Fights Back: How SEDM Turns Agent Logs into Verified Knowledge

Every agent platform eventually develops a storage problem and pretends it is a memory strategy. The logs are all there: user turns, tool calls, partial plans, failed attempts, corrected answers, retry traces, database lookups, compliance notes, and the occasional heroic workaround that actually solved something. The tempting move is obvious. Store everything. Embed everything. Retrieve whatever looks semantically close. Then call it “long-term memory,” because “expensive junk drawer with cosine similarity” sounds less fundable. ...

Search Party in a Notebook: JUPITER Turns Data Analysis into a Tree Game

A notebook is not just a file. In most companies, it is where the analyst tried three joins, fixed the date column, discovered the leakage, reran the model, cursed quietly, and eventually produced the chart that made it into Monday’s meeting. Then the notebook was archived, copied, half-forgotten, and treated as residue. ...

Small Gains, Long Games: Why Tiny Accuracy Bumps Explode into Big Execution Wins

A workflow does not fail because the first step is hard. It fails because the seventeenth step is boring, the twenty-third step depends on a slightly wrong state, and by the thirty-first step the agent is confidently building on its own rubbish. Very enterprise. Very scalable. Very expensive. The paper behind this article, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, makes a deceptively simple point: judging LLM progress by short-task accuracy can badly understate the value of reliability gains over long workflows.1 A model that improves only slightly on a single step may become dramatically better at completing long sequences without failure. That is not motivational poster mathematics. It is compounding. ...

Guardrails Before Gas: Secure Plan‑Then‑Execute Agents for Real Work

Every executive agent demo eventually reaches the same awkward moment: the model stops being a chatbot and starts touching things. Files. APIs. Databases. Code runners. Email clients. Payment workflows. Production systems, because apparently we enjoy giving probabilistic text engines access to expensive buttons. The paper Architecting Resilient LLM Agents: A Guide to Secure Plan-then-Execute Implementations argues that the core safety problem is not merely that agents sometimes reason badly. The sharper problem is that many agent architectures let untrusted information change what the agent decides to do next.1 That is a control-flow problem. And control-flow problems are not solved by asking the model, very politely, to behave. ...

Tool Time, Any Time: Inside RLFactory’s Plug‑and‑Play RL for Multi‑Turn Tool Use

Tool calls are where agent demos stop being cute. A chatbot can talk through a task all day. A working agent has to search, query, execute, verify, retry, and sometimes discover that the tool it politely called has returned a malformed answer after making everyone wait. That is the difference between “reasoning about work” and doing work. The former gives you fluent paragraphs. The latter gives you latency, interface contracts, timeout handling, reward ambiguity, and a suspicious number of JSON parsing errors. Glamorous, naturally. ...

Plan, Act, Replan: When LLM Agents Run the Aisles

Retail planning usually fails in the hand-off. A sales team sets a target. Inventory planners translate it into stock positions. Procurement checks supplier feasibility. Operations discovers warehouse constraints. Someone exports a spreadsheet, someone else reworks the assumptions, and by the time the plan looks executable, the market has already wandered off with the innocence of a cat near an open laptop. ...

Plan, Don't Spam: The Goldilocks Rule for Test‑Time Compute

A busy agent is not necessarily a thinking agent. Anyone who has watched an LLM agent narrate every tiny move knows the feeling. It reviews the goal. It drafts a plan. It revises the plan. It reconsiders the revision. Then, with exquisite deliberation, it clicks the wrong button. The transcript looks intelligent; the behaviour looks like a consultant trapped in a revolving door. ...