Llm-Agents

Chains of Causality, Not Just Thought

TL;DR for operators Causal Influence Prompting, or CIP, is a safety method for LLM agents that asks the model to build and consult a causal influence diagram before acting. Instead of telling the agent, “be safe,” it asks the agent to represent the task as a graph: what facts matter, what choices are available, what outcomes are useful, and what outcomes are harmful. This is a better shape for the problem, because agents do not merely answer questions. They click buttons, run code, forward messages, use tools, and occasionally behave as if “sure, why not?” were a compliance framework. ...

Chatbot at the Table: Rethinking Group Recommendations with GenAI

TL;DR for operators Dinner plans are where elegant recommender theory goes to be quietly embarrassed. Five people do not usually open a dedicated app, rate every restaurant, agree on a utility function, and wait for a ranked list to descend from the heavens. They argue in a chat. They change their minds. Someone forgets the budget. Someone says “anything is fine” while absolutely not meaning it. Someone else proposes a venue that is closed on Mondays. Humanity, as usual, remains a hostile runtime environment. ...

Grounded and Confused: Why RAG Systems Still Fail in the Enterprise

TL;DR for operators Enterprise RAG does not fail because the chatbot forgot to sound confident. It fails because the answer is often scattered across the least glamorous parts of the company: Slack threads, meeting transcripts, pull requests, document revisions, customer reports, employee metadata, and URLs somebody pasted into a chat six weeks ago. ...

Catalysts of Thought: How LLM Agents are Reinventing Chemical Process Optimization

TL;DR for operators Chemical-process optimisation does not usually fail because nobody has heard of optimisation. It fails earlier, in the less glamorous swamp where someone has to decide what operating ranges are even allowed. Temperatures, separator conditions, pressure drops, utility trade-offs, convergence behaviour, equipment limits: all the tedious things that make optimisation useful and prevent it from becoming a very fast route to nonsense. ...

Mind Games for Machines: How Decrypto Reveals the Hidden Gaps in AI Reasoning

TL;DR for operators Meetings are easy to automate until someone has to understand what everyone else thinks everyone else knows. That is the useful discomfort created by Decrypto, a new benchmark for multi-agent reasoning and theory of mind in language models.1 The benchmark is built around a simple word game. Alice and Bob share four secret keywords. Alice receives a three-digit code and gives three public hints. Bob must recover the code. Eve sees the same hints but does not know the secret keywords and tries to intercept. Alice’s job is therefore not “give good clues.” It is “give clues calibrated to Bob’s knowledge while limiting Eve’s inference.” Welcome to enterprise communication, but with fewer calendar invites. ...

The Joy of Many Minds: How JoyAgents-R1 Unleashes the Power of Multi-LLM Reinforcement Learning

TL;DR for operators A naming note before the machinery starts: the existing Cognaptus title says JoyAgents-R1, but the arXiv paper itself names the benchmark HiMA-Ecom and the training method HiMA-R1. This revision uses the paper’s terminology, because accuracy is not decorative trim. The paper is useful for operators because it does not simply say “use more agents.” That slogan is old, cheap, and usually followed by a demo in which three chatbots politely agree with one another until the invoice arrives. The real contribution is more specific: the authors build a hierarchical e-commerce assistant benchmark, then train the master agent and specialised sub-agents jointly with reinforcement learning instead of optimising them as isolated prompt puppets.1 ...

From Sparse to Smart: How PROGRM Elevates GUI Agent Training

TL;DR for operators Every GUI automation project has a familiar failure mode: the agent gets almost there, makes one bad click, and the training system treats the whole episode as garbage. That is tidy for spreadsheets and absurd for learning. ProgRM addresses that absurdity by replacing final-only success/failure rewards with step-level estimates of task progress.1 Instead of asking only, “Did the agent finish?”, it asks, “How much closer is the agent now than it was one step ago?” The reward is the change in estimated progress. A search that reaches the right article but fails to bookmark it is no longer equivalent to an agent staring at the home screen and scrolling like a caffeinated intern. ...

Divide and Model: How Multi-Agent LLMs Are Rethinking Real-World Problem Solving

TL;DR for operators Real business problems do not arrive as tidy exam questions. They arrive as “Can we optimise this logistics network?”, “Which markets should we prioritise?”, “How many clinics do we need?”, or “What happens if the subsidy disappears?” The annoying part is not the equation. The annoying part is deciding what the equation should even represent. ...

Mind the Context: How ContextAgent Listens, Sees, and Acts Before You Ask

TL;DR for operators ContextAgent is not interesting because it imagines an assistant that talks before the user does. We already have enough software that talks before anyone asks. The interesting part is more disciplined: it tries to decide when an assistant should remain silent, when it should intervene, and which external tools it should call when intervention is justified. ...

Reflections in the Mirror Maze: Why LLM Reasoning Isn't Quite There Yet

TL;DR for operators Adding “reasoning” to an LLM agent is not the same as making it reason better. Wong et al. test four open-source models across dynamic SmartPlay tasks using a baseline prompt, reflection, reflection plus an Oracle that mutates heuristics, and reflection plus a Planner that simulates short future trajectories.1 The clean result is not “planning wins” or “bigger models win.” The result is more annoying, therefore more useful: the same scaffold can be a booster, a distraction, or a failure amplifier. ...