Opening — Why this matters now

Multi-agent systems built on large language models are having a moment. From research copilots to autonomous report generators, the promise is seductive: split a complex task into pieces, let specialized agents work in parallel, and coordinate everything with a central planner. In practice, however, these systems tend to collapse under their own cognitive weight.

The problem is not collaboration. It’s memory.

As agentic workflows stretch from minutes to hours and from single queries to deep research pipelines, centralized coordinators accumulate bloated context, propagate early errors, and repeatedly relearn the same coordination patterns. The result is a familiar failure mode: agents that can reason, but cannot sustain reasoning.

The paper behind STACKPLANNER enters this discussion with a refreshingly blunt thesis: long-horizon multi-agent failure is fundamentally a memory management problem, not a scaling problem.

Background — Centralized agents and the memory trap

Most production-grade multi-agent systems converge toward some form of centralized coordination. Decentralized debate frameworks are flexible but noisy; fully automated agent swarms struggle with global consistency. A central coordinator simplifies planning, delegation, and synthesis.

But centralization introduces two structural weaknesses:

  1. Task memory bloat — Every sub-agent output, tool call, and partial result gets appended to the coordinator’s context. Noise accumulates faster than signal.
  2. Experience amnesia — Each new task is treated as a blank slate. Prior successful coordination strategies are rarely reused in a structured way.

Existing solutions rely on passive fixes: context truncation, heuristic summarization, or longer context windows. These approaches treat memory as a side effect of reasoning, rather than a controllable system resource.

STACKPLANNER flips that assumption.

Analysis — What STACKPLANNER actually does

STACKPLANNER is a hierarchical, centralized multi-agent framework where memory is elevated to a first-class control object.

1. A deliberately small coordinator action space

The central coordinator does not reason about execution details. Instead, it operates over a compact set of meta-actions:

Action Purpose
PLAN Decide the next coordination step
DELEGATE Assign scoped subtasks to specialized agents
REVISE Actively edit task memory

This design enforces role discipline. The coordinator plans and edits memory; sub-agents execute.

2. Task memory as a stack, not a transcript

Instead of treating context as an ever-growing conversation, STACKPLANNER maintains a task memory stack. Crucially, this stack can be modified.

REVISE enables three explicit operations:

  • Update: push new task-relevant information
  • Condensation: summarize completed segments into compact representations
  • Pruning: remove unproductive or erroneous exploration paths

This turns memory from an archival log into an editable working state. Errors are not merely overshadowed by new tokens—they are actively removed.

3. Experience memory for cross-task reuse

Beyond task-specific memory, STACKPLANNER introduces a structured experience memory with three components:

Memory Type Role
User profiles Stable user attributes and preferences
Semantic memory Factual knowledge and retrieved evidence
Procedural memory (SOPs) Reusable coordination patterns

An experience search agent retrieves relevant past trajectories and injects them into the current task memory, directly addressing cold-start and generalization failures.

4. Learning to coordinate via reinforcement learning

The coordinator is trained using reinforcement learning, framing planning, delegation, and memory operations as a sequential decision process. Instead of hard-coded heuristics, coordination policies improve by observing which memory edits and delegation strategies actually lead to successful outcomes.

This matters because memory control is contextual. When to summarize, what to prune, and when to retrieve prior experience cannot be reliably hard-coded.

Findings — Results that validate the memory thesis

Across multi-hop QA and agentic benchmarks (2WikiMultiHopQA, MuSiQue, GAIA, FRAMES), STACKPLANNER consistently outperforms naive, single-agent, multi-agent, and agentic-RL baselines.

Main performance snapshot (F1, %)

Model 2Wiki MuSiQue GAIA FRAMES
Best baseline (Agentic-RL) 29.55 13.38 7.71 13.49
STACKPLANNER 32.92 16.48 7.71 16.23

The more telling result comes from ablation:

Configuration GAIA F1
Full model 7.71
w/o task memory 4.68
w/o experience memory 5.53
w/o both 2.47

Remove memory control, and performance collapses.

Not because the model forgets facts—but because it forgets how to coordinate itself.

Implications — What this means beyond benchmarks

STACKPLANNER quietly challenges several industry assumptions:

  • Long context is not a solution: Without editability, longer memory only amplifies error persistence.
  • Agents need operational memory, not just knowledge: Procedural experience matters more than factual recall in long-horizon tasks.
  • Coordination is a learnable skill: Planning and memory control benefit from reinforcement learning just as much as tool usage does.

For businesses building agentic workflows—research, compliance, analytics, or automation—the lesson is uncomfortable but clear: adding more agents or bigger models will not fix coordination decay.

Memory architecture will.

Conclusion — Forgetting, deliberately

STACKPLANNER’s real contribution is not another multi-agent framework. It is a reframing of the problem space.

In long-horizon AI systems, intelligence is not defined by what agents remember—but by what they are allowed to forget, compress, and reuse.

Until memory becomes an explicit control surface, agentic AI will remain impressive in demos and unreliable in production.

STACKPLANNER is an early but persuasive step toward fixing that.

Cognaptus: Automate the Present, Incubate the Future.