When Agents Think in Waves: Diffusion Models for Ad Hoc Teamwork

A warehouse robot does not fail only when it drops the box. Sometimes it fails earlier, in the quieter moment when another robot takes an unexpected route and the first robot keeps behaving as though the original choreography still exists. Nobody crashes. Nothing explodes. The system merely becomes stupid in a very expensive way.

That is the practical problem behind ad hoc teamwork. An autonomous agent may know the task, the environment, and its own capabilities, but not the habits of the agents around it. In a factory, a fleet of robots may be reconfigured. On a road, an autonomous vehicle meets drivers it has never modelled. In disaster response, drones, humans, and ground robots may coordinate without rehearsed roles. The old assumption that cooperation means executing a shared plan starts to look charming, like a fax machine with venture funding.

The paper behind this article, PADiff: Predictive and Adaptive Diffusion Policies for Ad Hoc Teamwork, attacks that problem from a useful angle: the agent should not collapse too early onto one “best” cooperative behaviour.¹ In ad hoc teamwork, the right action may depend on what an unfamiliar teammate is about to do. If several teammate intentions remain plausible, a policy that preserves several possible cooperation modes may be more valuable than one that optimises itself into brittle confidence.

PADiff’s central move is to represent the ego agent’s policy as a diffusion process. That sounds like someone smuggled image-generation vocabulary into multi-agent control, because they did. But here the point is not pretty pictures. It is policy expressiveness. Diffusion models are good at representing distributions with multiple peaks. In teamwork, those peaks can correspond to different valid ways of cooperating: pass left, pass right, intercept, wait, reposition, assist. The intelligent part is not that the agent behaves randomly. The intelligent part is that it keeps structured alternatives alive long enough to adapt.

The paper’s real contribution is therefore mechanical, not decorative. PADiff combines three ideas: a diffusion-based policy that can represent multimodal action choices, an Adaptive Feature Modulation Network that adjusts internal features according to teammate context, and a Predictive Guidance Block that trains the denoising process to care about future team reward and future collaborative subgoals. The result is a policy that does not merely generate actions. It denoises actions under pressure from a changing team situation.

That distinction matters. Randomness is cheap. Coordination diversity is not.

The failure mode is premature agreement with oneself

Most reinforcement learning policies are trained to maximise expected return. That is sensible until the environment contains several valid cooperative patterns and the agent cannot know in advance which one its teammate will support. Then the optimisation objective quietly pushes the policy toward one dominant behaviour. The paper describes this as policy collapse: the agent learns a single high-reward response and loses the ability to represent other viable cooperation modes.

Maximum-entropy RL tries to soften this by encouraging stochasticity. That helps exploration, but exploration is not the same as structured coordination. A noisy agent may try different things. A multimodal policy can represent different meaningful strategies.

This is the misconception worth killing early: PADiff is not interesting because it makes agents more random. It is interesting because it tries to make their uncertainty organised.

In the paper’s motivating soccer-style example, an ego agent advancing with the ball may shoot directly, pass to one teammate, or pass to another. The correct decision depends partly on the teammates’ likely behaviour. A unimodal policy risks learning “usually shoot” or “usually pass left”. A multimodal policy can keep several candidate behaviours available under the same state and select among them through context.

For business systems, this maps cleanly onto a deployment problem: agents rarely fail because there is only one possible plan. They fail because there are several plausible plans, and the system locks onto one before it understands the surrounding behaviour.

PADiff treats action selection as denoising, not direct mapping

In a conventional policy, the system often learns a mapping from state to action: observe the state, produce an action distribution, select an action. PADiff instead frames the action as something generated through a reverse diffusion process. It begins from noise and iteratively denoises toward an executable action, conditioned on the current teamwork context.

For continuous diffusion models, the intuition is familiar: start from a noisy sample and refine it step by step. PADiff adapts this idea to action-level policies, including discrete action spaces, following the structured denoising logic used in discrete diffusion models. During training, noise is added to known actions; the model learns the reverse process that reconstructs useful actions from noisy ones. During inference, it samples a noisy action and iteratively denoises it into the final action to execute.

That gives the policy a richer shape. Instead of being pressured into a single deterministic-looking response, it can model an action distribution with multiple modes. In ad hoc teamwork, those modes are not aesthetic. They are different cooperation plans.

A useful way to read the architecture is this:

Component	What it does technically	Why it matters for teamwork
Diffusion policy	Represents action generation as iterative denoising	Allows multiple plausible cooperation modes rather than one dominant response
State encoder	Compresses recent team state into a latent teamwork context	Gives the denoising process information about the interaction pattern, not just the current snapshot
AFM-Net	Uses context-conditioned scale and shift operations to modulate intermediate features	Lets the policy adjust its internal computation as teammate behaviour changes
PGB	Predicts collaborative return and future collaborative goal during training	Pushes denoising features toward team-aware, future-sensitive decisions

The state encoder is not a cosmetic preprocessor. PADiff uses a history window to encode the team’s recent interaction context into a latent representation. That matters because ad hoc teamwork is not solved by observing positions alone. The agent needs clues about behavioural style. Is the teammate aggressive? Hesitant? Coordinated? Drifting toward a subgoal? The history window gives the model a chance to infer those patterns.

The diffusion step and the latent teamwork context are then combined as the condition for denoising. In plain English: the model does not just ask, “What action fits this state?” It asks, “Given this team context and this stage of refinement, how should this noisy candidate action be shaped?”

That is the mechanism-first story. PADiff’s advantage is not “diffusion, therefore magic”. It is diffusion plus teammate-conditioned adaptation plus predictive training signals. Remove those latter parts, and diffusion alone becomes a well-dressed baseline.

AFM-Net makes teammate context touch the policy’s internal features

Standard diffusion architectures are not automatically suitable for ad hoc teamwork. The paper is explicit about this. Diffusion policies can represent multimodal distributions, but ordinary MLP or U-Net denoisers do not necessarily adapt well to changing teammate behaviour. Transformer-style designs can add flexibility, but often with additional computational overhead.

PADiff’s answer is AFM-Net, an Adaptive Feature Modulation Network. It borrows the broad idea of feature-wise modulation: generate scale and shift parameters from the teamwork context, then use those parameters to modulate intermediate feature vectors inside the denoising network. The paper also adds layer normalisation, residual connections, and dropout, aiming for stability and generalisation to unseen teammates.

The design choice is practical. Instead of building a giant attention-heavy architecture and hoping it learns how to adapt, AFM-Net inserts the teammate context directly into the feature transformations. The denoiser’s internal representation is no longer neutral. It is bent by the inferred team situation.

That is important because ad hoc teamwork is a representation problem before it is an action problem. If the same visible state can imply different best actions depending on teammate intent, then the policy must condition its internal features on that inferred intent. Otherwise, it merely sees coordinates and pretends they are a strategy.

The paper’s ablation study supports this design. When AFM-Net is replaced with an MLP or a U-Net denoising network, performance drops across Predator-Prey, Level-Based Foraging, and Overcooked. This is an ablation, not the main benchmark claim. Its purpose is narrower: to test whether the adaptation architecture contributes beyond the diffusion-policy wrapper. The evidence says yes, at least in these simulated benchmark settings.

There is a quiet business lesson here. In deployed agent systems, “context awareness” cannot remain a slogan pasted onto a dashboard. It must enter the model at the point where decisions are formed. AFM-Net is one technical expression of that principle.

PGB trains the denoiser to care about future cooperation

PADiff’s second major module is the Predictive Guidance Block. This is where the paper becomes more interesting than “diffusion policy for teamwork”.

A standard diffusion model can generate diverse actions, but diversity alone does not mean useful cooperation. The agent needs predictive structure: what team outcome is this action moving toward, and what future collaborative state is it helping produce? PADiff trains the denoising process with two auxiliary predictions:

Collaborative Return (CoReturn): the expected cumulative future team reward.
Collaborative Goal (CoGoal): a predicted future state used as a subgoal representing teammate intention.

The total training loss combines the diffusion loss with the CoReturn and CoGoal losses:

$$ L_{\text{total}} = L_{\text{Diffusion}} + \alpha L_{\text{CoReturn}} + \beta L_{\text{CoGoal}} $$

This matters because the predictions are made from intermediate features produced during denoising. Gradients from the predictive tasks flow back into those features, shaping them to become more team-aware. The paper’s framing is clear: PGB is used during training, then removed at inference. The trained denoising block is expected to have internalised the predictive structure.

That is a useful deployment property. Inference-time guidance can be flexible, but it may add runtime cost or complexity. PADiff tries to pay the prediction tax during training, then execute with the denoising policy alone.

The PGB ablation tests the two predictive components by removing CoGoal and CoReturn separately. The full model outperforms both ablated variants. Again, this is not a second thesis; it is an ablation. Its purpose is to show that future reward prediction and future subgoal prediction both contribute to the final performance. The evidence supports the paper’s claim that diffusion needs more than expressive sampling; it needs predictive pressure.

There is a slight irony here, which we may as well enjoy. The model becomes adaptive at inference partly because it was forced to do homework during training.

The benchmark design tests unfamiliar teammates, not just task competence

The experiments use three standard cooperative environments: Predator-Prey, Level-Based Foraging, and Overcooked. Each stresses coordination differently. Predator-Prey requires agents to coordinate around a moving target. Level-Based Foraging requires agents to combine capabilities to collect food items. Overcooked requires synchronised movement and task sequencing in a constrained kitchen layout.

The key experimental design choice is the teammate policy pool. The authors use Soft-Value Diversity from the CSP framework to create diverse teammate behaviours. They instantiate four independent multi-agent populations in each environment. Three populations are used for collecting training data with the ego agent, while a fourth is reserved for evaluation. The test population includes twelve checkpoints, divided into groups of four or eight test strategies.

That last detail matters because the “4” and “8” in the reported conditions are not numbers of agents. They are held-out test policy pool sizes. Treating them as agent counts would be an easy mistake, and a strangely confident one. The paper is testing whether the ego agent can cooperate with different held-out teammate strategies.

The appendix cross-play matrices serve a specific purpose: validating that the held-out policies are meaningfully diverse. The authors report that within-population cooperation performs better than cross-population cooperation, implying that different populations learned distinct strategies. This is not the main performance result. It is a robustness check on the evaluation setup. If the test teammates were not actually different, “generalising to unseen teammates” would be a softer claim. The cross-play analysis strengthens that claim by showing coordination friction between different policy populations.

The main results are strongest where coordination brittleness is visible

The main comparison evaluates PADiff against AHT methods such as LIAM and ODITS, diffusion-based baselines such as Diffusion-BC and Diffusion-QL, offline RL via CQL, multi-agent diffusion and transformer baselines such as MADiff and MADT, and the more recent TAGET method. The model is trained for 20 epochs on collected offline datasets and evaluated every two epochs against held-out test pools. Results are averaged over 50 trials with 95% confidence intervals.

The headline claim is that PADiff outperforms all baselines across all three environments and both test pool sizes, with an average reported gain of 35.25%. The appendix table gives the most useful numerical view:

Test setting	Best baseline	PADiff	Reported relative gain	Interpretation
Predator-Prey, 4-policy test pool	TAGET: 61.2 ± 1.0	67.0 ± 1.1	+9.47%	Solid improvement in a task already producing high returns
Level-Based Foraging, 4-policy test pool	ODITS/TAGET: 0.080	0.130 ± 0.022	+62.50%	Large relative gain on a low-return scale
Overcooked, 4-policy test pool	CQL: 0.64 ± 0.05	0.88 ± 0.06	+37.50%	Strong improvement in a coordination-heavy layout
Predator-Prey, 8-policy test pool	TAGET: 60.1 ± 1.1	65.7 ± 1.0	+9.32%	Similar margin under the larger held-out pool
Level-Based Foraging, 8-policy test pool	LIAM/TAGET: 0.065	0.117 ± 0.018	+80.00%	Very large relative gain, again on a small absolute scale
Overcooked, 8-policy test pool	TAGET: 0.63 ± 0.11	0.71 ± 0.06	+12.70%	Modest but consistent improvement

The pattern is more informative than the average. Predator-Prey shows moderate but steady improvement on a return scale around the 50s and 60s. Level-Based Foraging shows dramatic relative gains, but the absolute returns are small, so the percentage should not be read as a universal productivity multiplier. Overcooked shows meaningful gains, especially in the 4-policy setting, where coordination bottlenecks are more visible.

The comparison figure also suggests that PADiff tends to rise early and remain above the baselines across training. That supports the paper’s claim that its architecture is not merely finding a lucky final checkpoint. But the table is still the cleaner evidence: across all six settings, PADiff is the top reported method.

The visualisation results play a different role. In one Predator-Prey scenario, the authors feed the same initial state into the policy and show different cooperative trajectories. In the appendix, they randomly sample six Predator-Prey states and generate 1,000 action samples per state; the resulting kernel density plots show multiple peaks. These figures are best interpreted as mechanism validation. They do not prove business readiness. They support the claim that the learned policy is actually multimodal, rather than simply scoring higher for unrelated reasons.

This distinction is not pedantic. In applied AI, a model that performs better and a model that performs better for the claimed reason are not the same animal. PADiff provides both benchmark evidence and some mechanism-consistent visual evidence. That combination is stronger than a leaderboard row alone.

What each experiment is actually doing

The paper contains several experimental elements, and they should not be thrown into one bucket labelled “results”. They answer different questions.

Paper element	Likely purpose	What it supports	What it does not prove
Main benchmark comparison across PP, LBF, and Overcooked	Main evidence	PADiff outperforms selected AHT, offline RL, diffusion, and transformer baselines in the tested simulated settings	Real-world deployment performance
Table 1 with six test settings	Main evidence and magnitude	The gains are consistent across both 4-policy and 8-policy held-out pools	That relative gains will transfer to other task scales
Cross-play matrices	Robustness check on evaluation setup	Test teammate pools contain meaningfully different strategies	That all real human or robot partners can be represented by such pools
Multimodal trajectory and density visualisations	Mechanism sanity check	PADiff appears to generate multiple action modes under the same state	That every mode is safe, interpretable, or operationally desirable
AFM-Net replacement with MLP/U-Net	Ablation	Context-conditioned modulation improves the denoiser in these AHT tasks	That AFM-Net is the only possible efficient adaptation design
PGB removal tests	Ablation	CoReturn and CoGoal prediction both contribute to performance	That these exact auxiliary objectives are optimal
Diffusion-step and dropout analysis	Sensitivity/implementation detail	Moderate dropout and appropriate diffusion depth balance performance and cost	A universal hyperparameter rule
Compute reporting	Implementation detail	Experiments used Quadro RTX 8000 GPUs, batch size 128, and environment-dependent training times	Production cost under deployment constraints

The hyperparameter analysis deserves a short note because it is where the paper quietly acknowledges the cost of diffusion. More denoising steps can improve generation quality, but computation rises. The authors report diminishing returns beyond 30 diffusion steps in their domains and compare depths of 2, 10, 20, and 30 with dropout variation. The business translation is simple: diffusion policies buy expressiveness by spending iterative computation. Anyone pretending that this is free should be made to run the latency budget.

The business value is optionality under uncertainty

The immediate commercial relevance is not that diffusion models are fashionable. Fashion is not a strategy, although it does seem to raise seed rounds. The relevance is that many business deployments increasingly involve agents operating around unfamiliar partners.

In warehouse automation, the partner may be another robot from a different control stack, a human picker, or a temporary process change. In autonomous driving, the partner is every nearby road user. In drone operations, the partner may be a human operator, another drone, or a ground unit with incomplete communication. In enterprise software agents, the partner may be another automated workflow owned by a different team. The surface domains differ, but the coordination problem rhymes.

PADiff suggests a practical design principle: when partner behaviour is uncertain, preserve multiple structured cooperation modes and use recent interaction context to decide among them. Do not force the agent into a single average behaviour too early.

What the paper directly shows is narrower:

Category	Paper shows	Cognaptus business inference	Boundary
Policy representation	Diffusion policies can model multimodal action distributions in AHT benchmarks	Agent systems may benefit from preserving several plausible coordination plans	Evidence is simulated and benchmark-based
Adaptation	AFM-Net improves performance versus MLP and U-Net variants	Context should modulate internal decision features, not merely sit beside the model as metadata	Other architectures may achieve similar effects
Prediction	PGB’s CoReturn and CoGoal objectives improve results in ablations	Training agents to predict team outcomes may produce better runtime cooperation without inference-time guidance	The chosen predictive targets are tested only in the paper’s domains
Generalisation	Held-out teammate policy pools test unfamiliar strategy adaptation	Deployment teams should evaluate agents against unseen partner behaviours, not just repeated internal scenarios	Real partners may be noisier, partially observable, or strategically non-stationary
Cost	More diffusion steps create a performance-cost trade-off	Diffusion control needs latency and compute budgeting from the beginning	The paper reports research training resources, not production serving cost

The important business idea is not “use PADiff tomorrow”. It is “measure whether your agent collapses to one coordination style”. Many systems will look competent in rehearsed evaluation and brittle in mixed-partner deployment. PADiff gives one technical route to reducing that brittleness.

Where the claim should stop

PADiff is promising, but its boundary is not subtle.

First, the paper assumes full observability. The authors state this directly in the limitations. Real deployments often involve partial observability, noisy sensors, delayed communication, hidden human intention, and incomplete state sharing. A robot in a lab grid world gets a cleaner reality than a robot in a warehouse aisle at 5:47 p.m. on a Friday. Shocking, yes.

Second, the environments are simulated benchmarks. Predator-Prey, Level-Based Foraging, and Overcooked are useful stress tests, but they are not autonomous highways, disaster zones, or production factories. They abstract coordination sharply, which is exactly why researchers use them. That abstraction makes the evidence cleaner, not broader.

Third, the teammate pools are generated through a particular diversity mechanism. The cross-play matrices indicate that the policies are distinct, which strengthens the experiment. But real partner diversity can include irrational behaviour, adversarial incentives, hardware failure, human hesitation, institutional constraints, and plain old bad documentation. PADiff does not solve all of that. It addresses a structured version of unfamiliar teammate adaptation.

Fourth, the reported gains should be interpreted by task scale. A 80% relative improvement in Level-Based Foraging sounds spectacular, but the absolute scores are small. The correct reading is not “80% better teamwork everywhere”. The correct reading is that PADiff produced the strongest results in a difficult low-return benchmark condition where baselines struggled.

Finally, diffusion-based action generation has computational implications. The appendix reports training on Quadro RTX 8000 GPUs with environment training times of roughly 16 hours for Predator-Prey, 6 hours for Level-Based Foraging, and 13 hours for Overcooked, using batch size 128. That is not terrifying by modern research standards, but deployment is not research. If the agent must act under tight latency constraints, the number of denoising steps becomes an engineering decision, not a footnote.

The deeper shift is from optimal action to cooperation portfolio

The most useful way to understand PADiff is not as a diffusion paper wearing a multi-agent hat. It is a paper about cooperation portfolios.

A unimodal policy effectively says: under this state, here is the action I have learned to prefer. A multimodal policy says: under this state, several cooperation plans remain live, and the team context should decide which one survives denoising. PADiff then adds adaptation through AFM-Net and predictive training through PGB, so the distribution is not just wide; it is shaped by teammate behaviour and future team objectives.

That is a more mature view of agentic systems. In open environments, intelligence is not only choosing the best action. It is knowing when the “best” action is underdetermined because other agents have not yet revealed themselves. It is keeping strategic slack without degenerating into noise.

For businesses building multi-agent automation, the paper points toward a useful evaluation question: does your agent have one coordination reflex, or can it maintain several plausible patterns and adapt as partners reveal their intent? The answer will matter more as automation moves from isolated task execution into shared operational spaces.

PADiff is not the final architecture for real-world ad hoc teamwork. It is still simulated, fully observable, and benchmark-bound. But it does sharpen the right problem. Cooperation is not just alignment with a goal. It is adaptation to the people, robots, and agents who are also trying to reach it, sometimes badly.

That is where diffusion becomes more than a generative trick. It becomes a way to delay premature certainty. In teamwork, that delay may be the difference between an agent that merely acts and one that can actually collaborate.

Cognaptus: Automate the Present, Incubate the Future.

Hohei Chan, Xinzhi Zhang, Antao Xiang, Weinan Zhang, and Mengchen Zhao, “PADiff: Predictive and Adaptive Diffusion Policies for Ad Hoc Teamwork,” arXiv:2511.07260, 2026. https://arxiv.org/abs/2511.07260 ↩︎

The failure mode is premature agreement with oneself#

PADiff treats action selection as denoising, not direct mapping#

AFM-Net makes teammate context touch the policy’s internal features#

PGB trains the denoiser to care about future cooperation#

The benchmark design tests unfamiliar teammates, not just task competence#

The main results are strongest where coordination brittleness is visible#

What each experiment is actually doing#

The business value is optionality under uncertainty#

Where the claim should stop#

The deeper shift is from optimal action to cooperation portfolio#