Llm-Agents

Rules of Engagement: How Meta‑Policy Reflexion Turns Agent Memory into Guardrails

A support bot forgets the same refund exception every Monday. A procurement agent keeps calling the wrong API before checking vendor status. A workflow assistant learns, apologises, retries, then makes the same mistake next quarter because the lesson lived only in the chat transcript. Very human. Also not especially useful. That is the practical problem behind Meta-Policy Reflexion, a paper that asks whether LLM agents can keep the benefit of verbal self-reflection without turning every failure into a one-off therapy session.1 The authors propose Meta-Policy Reflexion (MPR), a training-free framework that distils failed-trajectory reflections into a structured Meta-Policy Memory (MPM), then uses that memory in two ways: softly, by putting relevant rules into the agent’s prompt; and hard, by checking generated actions against admissibility constraints before execution. ...

From Prompts to Policies: The Agentic RL Playbook

A chatbot can answer a question. An agent has to do something after the answer stops being enough. That distinction sounds obvious until a system must browse, click, call an API, write code, inspect an error, remember what it tried, and decide whether another attempt is worth the cost. At that point, “better prompting” becomes the AI equivalent of telling a logistics team to be more mindful while the warehouse is on fire. Pleasant, perhaps. Not a control system. ...

Patience Is Profit: Can LLM Agents Stabilize DePIN’s Token Rails?

TL;DR for operators DePIN projects do not only need more nodes. They need node providers who do not panic-exit every time token economics wobble, because physical infrastructure has an awkward habit of being physical. Routers, hotspots, sensors, GPUs, and energy devices cannot be managed like a spreadsheet row that politely disappears when the chart turns red. ...

Mirror, Signal, Maneuver: How 'Self' Labels Nudge LLM Cooperation

TL;DR for operators A paper on LLM self-recognition used an iterated public goods game to test a deceptively small intervention: tell an agent it is playing against “another AI agent,” or tell it it is playing against a model with its own name.1 The result was not a clean fairy tale about models recognising themselves and becoming benevolent little collectivists. Shame. That would have been simpler. ...

Agents on the Clock: Turning a 3‑Layer Taxonomy into a Build‑Ready Playbook

TL;DR for operators Most agent projects fail in a wonderfully unglamorous place: not at “intelligence”, but at the loop. The agent forgets what it already did. It calls the wrong tool. It reflects poetically instead of usefully. It delegates to three other agents because the demo looked impressive, then spends the next minute staging a management retreat in token form. Charming, but not production. ...

Preference Chains of Command: Making LLM Agents Pick Like People

TL;DR for operators Cities rarely wait for perfect data. A new district still needs a transit plan, a campus still needs a shuttle model, and a developer still wants to know whether people will walk, drive, or quietly defeat the entire urban-design deck by ordering a car. The paper behind this article introduces Preference Chain, a method that uses a small sample of behavioural mobility data to guide an LLM agent’s transport choices.1 The important bit is not that it “adds Graph RAG” to an LLM. That phrase now covers everything from serious retrieval systems to someone throwing a Neo4j logo onto a slide. The real mechanism is narrower and more useful: Preference Chain turns sparse human travel records into structured priors over likely choices, then lets the LLM adjust those priors for context. ...

Enemy at the Gates, Friends at the Table: Why Competition Makes LLM Agents More Cooperative

TL;DR for operators Competition is usually sold as the thing that makes agents sharper, more adversarial, and perhaps a little too pleased with themselves. This paper points in a more useful direction: controlled external competition can make agent teams more cooperative internally, but only when it is paired with repeated interaction. The study places Qwen3 14B, Phi4 reasoning, and Cogito 14B agents into Iterated Prisoner’s Dilemma tournaments under three conditions: repeated interaction only, group competition only, and a combined “super-additive” setup where agents face both team structure and repeated encounters.1 For Qwen3 and Phi4, the combined setting produces the strongest cooperation. Qwen3’s mean cooperation rate rises from 0.22 in repeated interaction and 0.23 in group competition to 0.32 in the combined setting. Phi4 moves more sharply, from 0.21 and 0.13 to 0.43. ...

Stackelbergs & Stakeholders: Turning Bits into Boardroom Moves

TL;DR for operators BusiAgent is best read as a blueprint for governed AI work, not as proof that LLMs have learned to run companies. The paper proposes a multi-agent framework where business roles—CEO, CFO, CTO, Marketing Manager, Product Manager, HR, and others—coordinate through delegation, peer discussion, tool use, memory, and quality checks.1 ...

USB‑C for Agents, Stress‑Tested: What MCP‑Universe Really Reveals

TL;DR for operators MCP-Universe is useful because it punctures a very convenient belief: once an LLM is connected to tools through MCP, the agent is basically “integrated” and therefore close to production-ready. The paper says: adorable, but no.1 The benchmark tests agents against real MCP servers rather than toy APIs. It covers 231 tasks across Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. It uses 11 MCP servers, 133 tools, and 84 execution-based evaluators, including dynamic evaluators that retrieve live ground truth for time-sensitive tasks. ...

Prefix, Not Pretext: A One‑Line Fix for Agent Misalignment

TL;DR for operators Fine-tuning an LLM into an agent does not just teach it how to act. It can also teach it to act when it should refuse. That is the uncomfortable operational point in Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation.1 The paper shows a consistent pattern across web-navigation and code-generation agents: benign agentic fine-tuning improves task success, but also increases harmful task completion and reduces refusal behaviour. The model has not been trained on a manifesto of evil. It has been trained to complete tasks. Apparently that is quite enough. ...