Multi-Agent Systems

Memory That Fights Back: How SEDM Turns Agent Logs into Verified Knowledge

TL;DR Most “agent memory” is a junk drawer: it grows fast, gets noisy, and slows everything down. SEDM (Self‑Evolving Distributed Memory) proposes an auditable, efficiency‑first overhaul. It verifies each candidate memory by replaying the exact run in a Self‑Contained Execution Context (SCEC), assigns an initial utility‑aligned weight, and then self‑schedules what to retrieve next. The result: higher task accuracy with fewer tokens versus strong memory baselines on FEVER and HotpotQA. ...

Mirror, Signal, Maneuver: How 'Self' Labels Nudge LLM Cooperation

When an agent thinks it sees itself in the mirror, it doesn’t necessarily smile—it sometimes clutches its wallet. TL;DR In an iterated public‑goods game (20 rounds, 10 tokens per round, 1.6 multiplier), telling models they’re playing “another AI” versus “themselves” shifts contributions by up to ~4 points in some settings. Direction of the shift depends on the prompt persona: with collective prompts, “self” labels often reduced contributions; with selfish prompts, “self” labels sometimes increased matching/cooperation. Effects persist under rephrased prompts and when reasoning traces aren’t requested, and they appear even in four‑agent self‑play variants. For enterprise multi‑agent AI, identity cues are levers. Manage them like you manage feature flags: test, monitor, and standardize. What the authors tested (and why it’s clever) Game mechanics. Two (and later four) LLM agents repeatedly choose how much to contribute (0–10) to a common pool each round. Pool is multiplied by 1.6 and split evenly; keeping more is privately optimal, but coordinated contribution yields higher joint payoffs. ...

Enemy at the Gates, Friends at the Table: Why Competition Makes LLM Agents More Cooperative

TL;DR When language‑model agents compete as teams and meet the same opponents repeatedly, they cooperate more—even on the very first encounter. This “super‑additive” effect reliably appears for Qwen3 and Phi‑4, and changes how we should structure agent ecosystems at work. Why this matters (for builders and buyers) Most enterprise agent stacks still optimize solo intelligence (one bot per task). But real workflows are competitive–cooperative: sales vs. sales, negotiators vs. suppliers, ops vs. delays. This paper shows that if we architect the social rules (teams + rematches) rather than just tune models, we can raise cooperative behavior and stability without extra fine‑tuning—or even bigger models. ...

Peer Review, But Make It Multi‑Agent: Inside aiXiv’s Bid to Publish AI Scientists

If 2024 was the year AI started writing science, 2025 is making it figure out how to publish it. Today’s paper introduces aiXiv, an open‑access platform where AI agents (and humans) submit proposals, review each other’s work, and iterate until a paper meets acceptance criteria. Rather than bolt AI onto the old gears of journals and preprint servers, aiXiv rebuilds the conveyor belt end‑to‑end. Why this matters (and to whom) Research leaders get a way to pressure‑test automated discovery without waiting months for traditional peer review. AI vendors can plug agents into a standardized workflow (through APIs/MCP), capturing telemetry to prove reliability. Publishers face an existential question: if quality control is measurable and agentic, do we still need the old queue? The core idea in one sentence A closed‑loop, multi‑agent review system combines retrieval‑augmented evaluation, structured critique, and re‑submission cycles to raise the floor of AI‑generated proposals/papers and create an auditable trail of improvements. ...

Agents on the Wire: Protocols, Memory, and Guardrails for Real-World Agentic AI

TL;DR Agentic AI is moving from toy demos to systems that must coordinate, persist memory, and interoperate across teams and services. A new survey maps the landscape—frameworks (LangGraph, CrewAI, AutoGen, Semantic Kernel, Agno, Google ADK, MetaGPT), communication protocols (MCP, ACP, A2A, ANP, Agora), and the fault lines that still block production scale. This article distills what’s ready now, what breaks in production, and how to architect for the protocols coming next. ...

Therapy, Explained: How Multi‑Agent LLMs Turn DSM‑5 Screens into Auditable Logic

TL;DR DSM5AgentFlow uses three cooperating LLM agents—Therapist, Client, and Diagnostician—to simulate DSM‑5 Level‑1 screenings and then generate step‑by‑step diagnoses tied to specific DSM criteria. Experiments across four LLMs show a familiar trade‑off: dialogue‑oriented models sounded more natural, while a reasoning‑oriented model scored higher on diagnostic accuracy. For founders and PMs in digital mental health, the win is auditability: every symptom claim can be traced to a quoted utterance and an explicit DSM clause. The catch: results are built on synthetic dialogues, so ecological validity and real‑world safety remain open. ...

RAGulating Compliance: When Triplets Trump Chunks

TL;DR A new multi‑agent pipeline builds an ontology‑light knowledge graph from regulatory text, embeds subject–predicate–object triplets alongside their source snippets in one vector store, and uses triplet‑level retrieval to ground LLM answers. The result: better section retrieval at stricter similarity thresholds, slightly higher answer accuracy, and far stronger navigability across related rules. For compliance teams, the payoff is auditability and explainability baked into the data layer, not just the prompt. ...

Lights, Camera, Agents: How MAViS Reinvents Long-Sequence Video Storytelling

The dream of generating a fully realized, minute-long video from a short text prompt has always run aground on three reefs: disjointed narratives, visual glitches, and characters that morph inexplicably between shots. MAViS (Multi-Agent framework for long-sequence Video Storytelling) takes aim at all three by treating video creation not as a single monolithic AI task, but as a disciplined production pipeline staffed by specialized AI “crew members.” The Problem with One-Shot Generators Single-pass text-to-video systems shine in short clips but crumble under the demands of long-form storytelling. They repeat motions, lose scene continuity, and often rely on users to do the heavy lifting—writing scripts, designing shots, and manually training models for character consistency. This is not just a technical shortcoming; it’s a workflow bottleneck that makes creative scaling impossible. ...

From Chaos to Choreography: The Future of Agent Workflows

In the world of Large Language Model (LLM)-powered automation, agents are no longer experimental curiosities — they’re becoming the operational backbone for scalable, autonomous AI systems. But as the number and complexity of these agents grow, the missing piece is no longer raw capability; it’s choreography. This is where agent workflows come in: structured orchestration frameworks that govern how agents plan, collaborate, and interact with tools, data, and each other. A recent survey of 24 representative systems — from industry platforms like LangChain, AutoGen, and Meta-GPT to research frameworks like ReAct and ReWoo — reveals not just technical diversity, but a strategic gap in interoperability. ...

Meta-Game Theory: What a Pokémon League Taught Us About LLM Strategy

When language models battle, their strategies talk back. In a controlled Pokémon tournament, eight LLMs drafted teams, chose moves, and logged natural‑language rationales every turn. Beyond win–loss records, those explanations exposed how models reason about uncertainty, risk, and resource management—exactly the traits we want in enterprise decision agents. Why Pokémon is a serious benchmark (yes, really) Pokémon delivers the trifecta we rarely get in classic AI games: Structured complexity: 18 interacting types, clear multipliers, and crisp rules. Uncertainty that matters: imperfect information, status effects, and accuracy trade‑offs. Resource management: limited switches, finite HP, role specialization. Crucially, the action space is compact enough for language-first agents to reason step‑by‑step without search trees—so we can see the strategy, not just the score. ...