A warehouse robot turns a corner and sees three things: a shelf edge, a moving cart, and another robot’s partial path. It does not see the blocked aisle behind the shelf. It does not see whether the cart will stop or continue. It does not see the supervisor system’s full map. Still, it must act.

The usual engineering reflex is predictable: add more memory, widen the observation window, let agents communicate, or increase the policy network size until the budget starts making whimpering noises. The new ICLR 2026 paper “GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent Systems” argues that this reflex misses the real structure of the problem.1

Partial observability is not merely a shortage of data. It is a one-to-many inference problem.

One local observation can correspond to many plausible global states. A robot that sees an empty lane may be in a safe corridor, or it may be one second away from a crossing conflict. A drone that loses visual contact with a teammate may face a harmless occlusion, a coordination failure, or an adversarial maneuver. These states may look identical locally. Treating them as one “best estimate” is not intelligence. It is premature commitment wearing a neural-network costume.

GlobeDiff’s contribution is to take that ambiguity seriously. Instead of asking each agent to predict a single hidden global state, it uses a conditional diffusion model to generate plausible global states from local or auxiliary observations. The important word is not “diffusion,” fashionable as it may be. The important word is “plausible.”

The blind spot is not lack of memory; it is collapsed possibility

In a decentralized partially observable multi-agent system, each agent receives only a slice of the environment. The formal setting is a Dec-POMDP, but the business translation is simpler: every unit acts with incomplete situational awareness.

The paper’s central framing is this mapping:

$$ x \rightarrow {s_1, s_2, s_3, \ldots} $$

Here, $x$ is the auxiliary local observation available to an agent, and $s$ is the true global state. The trouble is that $x$ does not uniquely identify $s$. The same local evidence may point to several global realities.

Traditional approaches often behave as if the mapping were closer to:

$$ \hat{s} = f_\theta(x) $$

That is convenient. It is also the beginning of the problem.

A discriminative model, recurrent belief estimator, or Transformer-style history encoder can learn useful patterns, but it still tends to compress uncertainty into a single representation. When the true situation is multi-modal, the model may average incompatible states, choose one arbitrarily, or preserve a smoothed belief that is not operationally useful. This is the familiar curse of point estimation: when reality branches, the model reports a midpoint. In physical systems, midpoints can be a surprisingly efficient way to hit things.

GlobeDiff replaces this point-estimation instinct with conditional generation. It tries to learn the distribution:

$$ p_{\theta,\phi}(s|x) = \int p_\theta(s|x,z)p_\phi(z|x)dz $$

The latent variable $z$ matters because it acts as a mode selector. The model is no longer forced to answer, “What is the one hidden state?” It can instead answer, “Given this observation and this latent mode, what global state is plausible?”

That is the mechanism-first lesson of the paper. Partial observability creates ambiguity. Ambiguity creates multi-modal hidden states. Multi-modal hidden states punish single-output inference. Diffusion gives the system a way to sample plausible reconstructions instead of compressing them into one brittle guess.

GlobeDiff separates state reconstruction from policy execution

The architecture has three moving parts:

Component What it does Why it matters
Posterior network $q_\psi(z x,s)$ Learns a useful latent variable when the true global state is available during training Uses centralized training information to understand which latent modes explain real states
Prior network $p_\phi(z x)$ Predicts the latent variable from local information during execution Bridges the gap between training-time global access and execution-time local access
Conditional diffusion model $p_\theta(s x,z)$ Denoises from Gaussian noise into an inferred global state Generates plausible global states rather than one deterministic estimate

This design fits naturally into Centralized Training with Decentralized Execution. During training, the system can use true global states to train the generative state-inference module. During execution, agents do not receive the true global state; they infer it from local or auxiliary observations and then act using the reconstructed state.

This distinction is operationally important. GlobeDiff is not saying every agent should magically see everything. It is saying that before an agent chooses an action, it can run a learned reconstruction process that converts partial evidence into a plausible full-state hypothesis.

For business systems, this suggests a modular architecture:

Partial observation
Auxiliary observation construction
Generative global-state inference
Policy decision
Action

That is different from stuffing more history into the policy network and hoping the policy learns hidden-state inference as a side hobby. The paper’s position is more disciplined: state inference is its own problem, so give it its own model.

The diffusion process is not decoration; it is the anti-collapse mechanism

Diffusion models are often discussed as image-generation machines, which is slightly unfortunate because it makes every diffusion paper sound like it wants to draw a raccoon in sunglasses. GlobeDiff uses diffusion for a different reason: denoising can model complex conditional distributions.

Training starts with the true global state $s$, adds Gaussian noise over $K$ steps, and trains a denoising network to reverse that corruption while conditioned on $x$ and $z$. The loss has two main parts:

$$ \mathcal{L} = \mathbb{E}\left[\lVert \epsilon - \epsilon_\theta(s_k,x,z,k)\rVert^2\right] + \beta_{KL} KL(q_\psi(z|x,s)\Vert p_\phi(z|x)) $$

The first term trains the model to reconstruct global state through denoising. The second aligns the prior network with the posterior network, so that the latent variable used at execution resembles the latent information learned with access to the real state.

This is not a minor detail. Without the prior-posterior alignment, training can learn latent information that execution cannot reproduce. That would be a very academic kind of success: elegant on paper, unusable when the system leaves the lab. The paper’s prior-network ablation tests this directly by removing the prior network and KL constraint. Performance declines, which supports the claim that the latent bridge is not ornamental.

The model also uses a U-Net-like architecture with one-dimensional temporal convolutions. In implementation, the authors first train the diffusion model from offline data and then update it online to reduce distribution mismatch. That matters because multi-agent environments do not politely remain stationary just because the training script has finished.

The benchmark had to be made more blind before it became useful

One quietly valuable part of the paper is methodological. The authors found that original SMAC tasks were not severe enough to study partial observability properly. Reducing sight range from 9 to 3 in SMAC-v2 produced only a small MAPPO performance decline of about 0.03, because the local observations still retained enough useful information.

So they modified SMAC-v1 and SMAC-v2 into partial-observability variants by removing enemy unit types and hit points from local observations. This is a good reminder that benchmarks can pretend to measure a problem while quietly leaking the information needed to solve it. Benchmarks are like financial models in that way: sometimes the impressive part is not the answer, but how little pressure the assumption had to endure.

The paper then evaluates GlobeDiff in these adapted SMAC-v1 (PO) and SMAC-v2 (PO) environments. The authors compare against belief-state and communication baselines, including Learned Belief Search, Dynamic Belief, CommFormer, and vanilla MAPPO. They also compare against generative or reconstruction baselines, including VAE, MLP, and joint-observation MAPPO variants.

The evidence should be read in layers, not as one undifferentiated “model wins” paragraph.

Evidence type Likely purpose What it supports What it does not prove
SMAC-v1/v2 (PO) win-rate curves Main evidence GlobeDiff improves policy performance under stronger partial observability Real-world deployment readiness
t-SNE and Voronoi state visualizations State-reconstruction evidence Generated states resemble true global-state structure more closely than VAE/MLP reconstructions Exact causal contribution of each visual feature
VAE, MLP, and joint-observation baselines Comparison with alternative inference approaches Merely adding a generic generator or high-dimensional joint input is insufficient That GlobeDiff is the only possible generative design
Prior-network ablation Ablation Prior-posterior alignment improves performance That the chosen KL setup is optimal
Diffusion-step and residual-block ablations Sensitivity / robustness test More denoising steps help; more residual blocks have limited impact Full compute-latency trade-off in production systems
Same-parameter-count comparison Capacity control Larger MAPPO alone does not explain GlobeDiff’s advantage Universal superiority over all larger architectures

This layered reading is important because the paper’s strongest claim is not “diffusion is better.” The stronger claim is narrower and more useful: explicit generative global-state inference improves multi-agent decision-making when local observations are intrinsically ambiguous.

Bigger networks do not automatically learn hidden worlds

The most business-relevant appendix result is the same-parameter-count experiment. The authors increased vanilla MAPPO’s parameter count to roughly match GlobeDiff’s total parameter budget, about 12–14 million parameters. The large MAPPO baseline used four-layer actor and critic MLPs with 2048 hidden units per layer, reaching about 13.5–14 million parameters.

The result is not flattering to the “just make it bigger” school of architecture design.

Task Vanilla MAPPO Vanilla MAPPO (Large) GlobeDiff
zerg 5v5 0.22±0.01 0.23±0.00 0.33±0.02
protoss 5v5 0.21±0.02 0.24±0.01 0.38±0.01
terran 5v5 0.16±0.01 0.17±0.00 0.24±0.01
zerg 10v10 0.13±0.01 0.15±0.01 0.25±0.01
zerg 10v11 0.06±0.01 0.07±0.01 0.12±0.01
terran 10v11 0.02±0.00 0.03±0.00 0.07±0.01
MMM2 0.27±0.08 0.01±0.01 0.49±0.11
3s5z vs 3s6z 0.20±0.01 0.01±0.01 0.28±0.04
6h vs 8z 0.12±0.01 0.01±0.00 0.47±0.04

The table is not perfectly smooth. In several tasks, the large MAPPO baseline performs worse than vanilla MAPPO, especially in MMM2, 3s5z vs 3s6z, and 6h vs 8z. That instability itself is informative. Capacity does not merely fail to solve partial observability; badly placed capacity may make learning less reliable.

GlobeDiff’s advantage is therefore not reducible to parameter count. The model spends capacity on the right subproblem: reconstructing plausible global states before policy execution.

This is the part many business readers should underline. AI systems rarely fail because they are too small in the abstract. They fail because the architecture asks one component to solve several different problems at once. A policy network asked to infer hidden state, resolve multi-modal uncertainty, coordinate with other agents, and choose actions may learn something. It may also learn an expensive fog machine.

The visual evidence is about structure, not aesthetics

The paper uses t-SNE and Voronoi visualizations to compare true global states with inferred states generated by GlobeDiff, VAE, and MLP baselines. These figures should not be read as decorative proof. They are qualitative diagnostics for whether the inferred state distribution preserves neighborhood structure.

GlobeDiff’s generated states show polygon structures closer to the true states. The appendix also adds a density-based analysis: real global states form several prominent Gaussian-like modes in the projected space, and generated states from selected agents reconstruct the location, shape, and amplitude of those modes.

This supports the paper’s claim that the latent variable $z$ is doing meaningful work. It is not just random noise injected for generative glamour. It appears to encode semantic variation across possible global states.

Still, the visual evidence should be kept in its lane. t-SNE and KDE plots help diagnose whether the model preserves multi-modal structure. They do not, by themselves, prove operational safety, generalization to real robots, or robustness under adversarial sensor failure. The main performance evidence comes from win-rate curves; the visual evidence explains why the performance gains are plausible.

That separation matters. A pretty embedding can be a useful clue. It is not a deployment certificate.

The theoretical result is a useful boundary, not a magic warranty

The paper provides two error-bound results.

The first covers the single-sample setting. Under assumptions about denoising accuracy and prior alignment, the expected squared error is bounded in terms of the Wasserstein distance between the learned and true conditional distributions and the conditional variance:

$$ \mathbb{E}[\lVert \hat{s}-s\rVert^2] \leq 2W_2^2(p_{\theta,\phi}(s|x),p(s|x)) + 4Var(s|x) $$

The second result is closer to the paper’s main intuition. When the true conditional state distribution is multi-modal Gaussian with sufficiently separated modes, the generated state remains close to one mode center:

$$ \mathbb{E}[\lVert \hat{s}-\mu_j(x)\rVert^2] \leq C_1K\delta^2 + C_2\varepsilon_{KL} + 2\max_i Tr(\Sigma_i(x)) \ast O(e^{-D^2/(8\sigma^2_{max})}) $$

The useful interpretation is not “the model is guaranteed to be right.” The useful interpretation is more precise: if denoising error is controlled, prior-posterior alignment is good, and modes are sufficiently separated, the generated state should land near a valid mode rather than collapse into an average between modes.

That is exactly the kind of guarantee a multi-agent architecture needs. In ambiguous environments, being close to one plausible reality is often better than being halfway between three incompatible realities.

But the assumptions are not casual. Mode separation matters. Prior alignment matters. Denoising error matters. In a messy warehouse, financial market, or autonomous-vehicle setting, the hidden-state distribution may be non-Gaussian, drifting, adversarial, or too tightly overlapping for clean separation. The theory helps explain the mechanism. It does not remove engineering responsibility. Tragic, I know.

The business lesson is to add a generative perception layer before coordination

For business use, GlobeDiff is most relevant wherever multiple agents must coordinate under incomplete information:

  • robot fleets in warehouses and factories;
  • autonomous vehicles and delivery robots;
  • drone swarms and inspection systems;
  • distributed infrastructure monitoring;
  • multi-agent simulation platforms;
  • trading or allocation agents operating on partial market views.

The practical pathway is not to copy GlobeDiff directly into production next Tuesday. The practical pathway is architectural:

Technical idea from GlobeDiff Business design translation
Local observation maps to multiple possible global states Treat ambiguity as structural, not as noise to be averaged away
Latent variable selects among plausible modes Maintain scenario hypotheses rather than a single hidden-state estimate
Diffusion reconstructs global state Use a generative state-estimation layer before action selection
CTDE integration Train with richer centralized data, execute with local inference
Prior-network ablation matters Validate the training-to-execution bridge, not just offline reconstruction
Same-parameter-count result Spend capacity on the correct module, not merely on larger policies

For a logistics company, this could mean separating fleet coordination into three modules: local sensing, generative global-state reconstruction, and policy execution. For a robotics integrator, it suggests that communication is not the only path to coordination; agents may infer missing structure when communication is expensive, delayed, or unreliable. For an AI operations team, it implies a new diagnostic question: when agents fail, did the policy choose badly, or did the system reconstruct the world badly?

That distinction is not academic. It changes debugging. A policy failure asks for reward shaping, policy architecture, or exploration changes. A state-reconstruction failure asks for better observation design, latent alignment, denoising quality, or online adaptation. Mixing the two creates the classic enterprise AI debugging experience: everyone stares at a dashboard, suspects everything, and fixes nothing.

Communication is still useful, but it is not a substitute for inference

One tempting misreading is that GlobeDiff makes communication unnecessary. It does not.

The paper itself uses different auxiliary observation constructions depending on the environment. In SMAC-v1 (PO), auxiliary information is based on an individual agent’s historical trajectory. In SMAC-v2 (PO), where local information is more limited, the method uses communication between agents to construct auxiliary information.

So the message is not “stop communication.” The message is: communication alone does not solve the state-inference problem. Messages still need to be interpreted through a model capable of reconstructing global structure.

This is especially relevant in business systems where adding communication is often easier than improving inference. More telemetry, more event streams, more agent-to-agent messages, more logs — the usual modern ritual sacrifice to the data platform. But if the system cannot convert auxiliary information into a coherent distribution over possible states, more communication may simply produce a larger pile of partial clues.

GlobeDiff suggests a more disciplined pipeline: communicate enough to enrich the auxiliary observation, then infer plausible global states explicitly.

Where GlobeDiff should not be over-sold

The paper is strong, but its boundaries are clear.

First, the evidence comes from modified SMAC benchmarks. These are useful for studying cooperative MARL under controlled partial observability, but they are still game-based environments. Real-world systems introduce sensor noise, actuator delays, safety constraints, non-stationary behavior, adversarial actors, and ugly integration layers written by someone who left the company in 2019.

Second, the method still depends on centralized training information. Under CTDE, the true global state is available during training. Many businesses can approximate this through logs, simulation, digital twins, or post-hoc reconciliation. Some cannot. If the system never observes global state even during training, the GlobeDiff recipe becomes harder to apply.

Third, diffusion inference has computational cost. The ablation shows longer denoising steps improve state inference, while residual-block count has relatively minor impact in the tested setting. That is encouraging, but production systems care about latency. A robot avoiding a collision does not have the luxury of admiring a long denoising chain.

Fourth, the theoretical multi-modal guarantee relies on separated modes. In operational environments where plausible states overlap heavily, the neat “one latent mode, one plausible world” story becomes less neat. The model may still help, but the guarantee becomes a weaker guide.

These limitations do not kill the idea. They locate it. GlobeDiff is best read as a mechanism and architecture paper, not a plug-and-play industrial product.

The quiet shift: from predicting the world to sampling it

The reason GlobeDiff is interesting is not that it adds diffusion to MARL. AI research has added diffusion to almost everything by now; soon someone will denoise quarterly earnings guidance and call it strategic finance.

The reason it is interesting is that it makes a clean architectural argument:

  1. Multi-agent partial observability creates one-to-many hidden-state ambiguity.
  2. One-to-many ambiguity is poorly served by single-state prediction.
  3. A latent-conditioned generative model can preserve multiple plausible global states.
  4. Better inferred states can improve decentralized action.
  5. More parameters alone do not solve the structural problem.

That is a useful argument beyond StarCraft. Many enterprise AI failures come from forcing systems to act as if uncertainty is a scalar confidence score. In multi-agent settings, uncertainty is often a set of possible worlds. Those worlds imply different actions.

The next generation of serious multi-agent systems may therefore need a layer that does not merely answer “what do I see?” or “what should I do?” It should answer a quieter question first:

What worlds could make my observation true?

GlobeDiff is one attempt to formalize that question. It is not the final answer. But it points in the right direction: away from bigger blind policies, and toward generative foresight as a coordination primitive.

That is the real business signal. In complex multi-agent systems, the winning architecture may not be the one that reacts fastest to partial observations. It may be the one that reconstructs the hidden situation before acting.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yiqin Yang, Xu Yang, Yuhua Jiang, Ni Mu, Hao Hu, Runpeng Xie, Ziyou Zhang, Siyuan Li, Yuan-Hua Ni, Qianchuan Zhao, and Bo Xu, “GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent Systems,” arXiv:2602.15776, published as a conference paper at ICLR 2026. ↩︎