Multi-Agent

Evolving Minds: How LLMs Teach Themselves Through Adversarial Cooperation

The dream of self-improving intelligence has long haunted AI research—a model that learns not from humans, but from itself. Multi-Agent Evolve (MAE) by Yixing Chen et al. (UIUC, NVIDIA, PKU) gives that dream a concrete architecture: three versions of the same LLM—Proposer, Solver, and Judge—locked in a continuous loop of challenge, response, and evaluation. No human labels. No external verifiers. Just the model, teaching itself through the friction of disagreement. ...

Reason, Reveal, Resist: The Persuasion Duality in Multi‑Agent AI

TL;DR In LLM multi‑agent systems, how a model thinks matters more than how big it is. Explicit reasoning (thinking mode / CoT) creates a Persuasion Duality: sharing a model’s reasoning makes it far better at convincing others, while enabling the model’s own reasoning mode makes it far harder to convince. This shifts best practices for agent design, governance, and product UX. Why this paper matters Cognition—not just parameter count—now drives the social dynamics of agent swarms. For Cognaptus clients building agent workers (ops, compliance, research, trading), the result is practical: toggling reasoning changes not just accuracy, but influence. Your deployment choices can tilt a network toward consensus, stalemate, or resilient truth‑seeking. ...

Recon, Then Wreck the Roadblocks: How Recon‑Act Turns Web Stumbles into Tools

Thesis: The next leap in practical web agents isn’t bigger models or deeper search trees—it’s a tight loop that learns by failing well. Recon‑Act’s two‑team architecture (Reconnaissance → Action) turns mistakes into generalized tools and feeds them back into execution. That’s not just a benchmark trick; it’s an operating system for enterprise‑grade automation. Why this matters (for operators, not just researchers) Most “browser LLMs” still thrash on real websites: ambiguous DOMs, mixed text‑image signals, fragile flows, and long horizons. Recon‑Act reframes the problem: when progress stalls, stop trying harder—learn smarter. It does three things companies can copy tomorrow: ...

Repo, Meet Your Agent: Turning GitHub into a Workforce with EnvX

Why this matters: Most “AI + devtools” still treats repos as documentation you read and code you copy. EnvX flips the model: it agentizes a repository so it can understand your request, set up its own environment (deps, data, checkpoints), run tasks end‑to‑end, verify results, and even talk to other repo‑agents. That’s a step change—from “NL2Code” to “NL2Working System.” The core shift in one line Instead of you integrating a repo, the repo integrates itself into your workflow—and can collaborate with other repos when the task spans multiple systems. ...

Kernel Kombat: How Multi‑Agent LLMs Squeeze 1.32× More From Your GPUs

TL;DR Astra is a multi‑agent LLM system that optimizes existing CUDA kernels instead of generating them from PyTorch. On three production‑relevant SGLang kernels, it delivered 1.32× average speedup (up to 1.46×) without fine‑tuning—just structured zero‑shot prompting. The win isn’t a single trick; it’s a division of labor: testing, profiling, planning, and coding each handled by a specialized agent that iterates toward faster, still‑correct kernels. Why this matters for business readers GPU efficiency is the new gross margin. If your serving stack pushes trillions of tokens per day, a 25–45% kernel‑level speedup compounds into: ...

Branching Out of the Middle: How a ‘Tree of Agents’ Fixes Long-Context Blind Spots

TL;DR — Tree of Agents (TOA) splits very long documents into chunks, lets multiple agents read in different orders, shares evidence, prunes dead-ends, caches partial states, and then votes. The result: fewer hallucinations, resilience to the “lost in the middle” effect, and accuracy comparable to premium large models—while using a compact backbone. Why this matters for operators If your business parses contracts, annual reports, medical SOPs, or call-center transcripts, you’ve likely felt the pain of long-context LLMs: critical details buried mid-document get ignored; retrieval misses cross-paragraph logic; and bigger context windows inflate cost without guaranteeing better reasoning. TOA is a pragmatic middle path: it re-imposes structure on attention—not by scaling a single monolith, but by coordinating multiple lightweight readers with disciplined information exchange. ...

Mind the Gap: How OSC Turns Agent Chatter into Compound Intelligence

Multi‑agent LLMs work great on paper and go sideways in practice. We over‑select experts, flood the channel with verbose thoughts, and then pray a meta‑LLM can stitch it all together. OSC (Orchestrating Cognitive Synergy) proposes a missing middle: a learned orchestration layer that constantly models what each agent knows, spots “cognitive gaps,” and then tells agents how to talk—what to say, to whom, and at what level of detail—before the aggregator votes. ...

Judgment Day for RAG: How L‑MARS Cuts Legal Hallucinations by Design

TL;DR — L‑MARS replaces single‑pass RAG with a judge‑in‑the‑loop multi‑agent workflow that iteratively searches, checks sufficiency (jurisdiction, date, authority), and only then answers. On a 200‑question LegalSearchQA benchmark of current‑year questions, it reports major gains vs. pure LLMs, at the cost of latency. For regulated industries, the architecture—not just the model—does the heavy lifting. What’s actually new here Most legal QA failures aren’t from weak language skills—they’re from missing or outdated authority. L‑MARS tackles this with three design commitments: ...

Assert Less, Observe More: AICL and the New QA Stack for LLM Apps

TL;DR Traditional QA treats software as deterministic; LLM apps aren’t. This paper proposes a three‑layer view (System Shell → Prompt Orchestration → LLM Inference) and argues for a collaborative testing strategy: retain classical testing where it still fits, translate assertions into semantic checks, integrate AI‑safety style probes, and extend QA into runtime. The kicker is AICL, a compact agent‑interaction protocol that bakes in observability, context isolation, and deterministic replay. Why this matters for operators and product teams LLM products now look like systems—not prompts. They combine RAG, tools, stateful multi‑turn workflows, and sometimes multi‑agent handoffs. The result is probabilistic behavior plus cross‑layer failure modes. If you keep writing boolean, exact‑match tests, you’ll ship brittle releases and discover regressions in production. The fix isn’t to abandon testing; it’s to move from asserting single outputs to observing semantic behavior distributions. ...

Edge of Reason: Orchestrating LLMs Without a Conductor

TL;DR Most multi‑agent LLM frameworks still rely on a central organizer that becomes expensive, rigid, and a single point of failure. Symphony proposes a fully decentralized runtime—a capability ledger, a beacon‑based selection protocol, and weighted Chain‑of‑Thought (CoT) voting—to coordinate lightweight 7B‑class models on consumer GPUs. In benchmarks (BBH, AMC), Symphony outperforms centralized baselines like AutoGen and CrewAI, narrowing the gap across model quality and adding fault tolerance with ~negligible orchestration overhead. ...