When Prompts Hire Specialists: Why pMoE Changes Visual Adaptation Economics

Opening — Why This Matters Now

Foundation vision models are becoming corporate infrastructure. They sit behind defect detection systems, medical imaging workflows, retail analytics dashboards, and increasingly, compliance pipelines.

But here is the quiet operational truth: most enterprises do not retrain these models. They adapt them.

Full fine-tuning is expensive, risky, and often unnecessary. Prompt tuning—adding learnable tokens while freezing the backbone—has emerged as the pragmatic alternative. Yet most approaches rely on a single pre-trained model. A single “expert.”

That assumption is convenient. It is also limiting.

The ICLR 2025 paper pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation introduces a simple but powerful idea: instead of adapting one expert, let multiple pre-trained experts collaborate—dynamically.

Not by merging weights. Not by retraining everything. But by orchestrating prompts.

This shift has material implications for cost, modularity, and ROI in applied AI systems.

Background — The Limits of Single-Expert Prompting

The Rise of Parameter-Efficient Tuning

Visual Prompt Tuning (VPT) and its successors (GaPT, LSPT) showed that you can freeze a Vision Transformer (ViT) and prepend learnable prompt tokens. The backbone remains intact. Only a small set of tokens is trained.

The economic appeal is obvious:

Method	Backbone Updated?	Extra Parameters	Deployment Cost	Risk Profile
Full Fine-Tuning	Yes	High	High	High
Adapters	Partial	Medium	Medium	Moderate
Prompt Tuning	No	Low	Low	Low

Prompt tuning became the “safe” enterprise choice.

But it came with an implicit constraint:

One backbone. One knowledge source.

If your backbone was trained on ImageNet, it excels at natural images. If it was trained on medical imaging, it understands domain nuance but may lack general semantics.

Real-world tasks often require both.

Fine-grained classification may require discriminative contrastive features (DINO, MoCo). Medical segmentation may benefit from domain-specific encoders. Structured visual reasoning tasks may benefit from different inductive biases.

A single expert rarely dominates across all axes.

What pMoE Actually Does — Mechanism, Not Marketing

pMoE introduces two architectural elements:

Expert Prompt Tokens (EPTs) — Each pre-trained model (“expert”) gets its own prompt tokens.
A Learnable Dispatcher — A shared module that dynamically weights and fuses prompt tokens across experts.

The backbone weights remain frozen.

The experts do not directly share internal representations.

Instead, they communicate through prompts.

Step 1: Multiple Experts, Separate Prompt Streams

Assume we have K pre-trained experts:

DINO
MoCo v3
MAE
CLIP
Medical vision encoder

Each expert has its own:

Patch embedding
Transformer stack
Prompt tokens per layer

But these prompt tokens are not isolated.

Step 2: The Dispatcher — Dynamic Token Fusion

At each transformer layer, a dispatcher:

Observes the expert’s current state (prompt tokens + accumulated prompts + patch tokens)
Produces dispatching weights
Combines the same-index prompt tokens across all experts

Mathematically, each integrated prompt token becomes a weighted sum across experts.

Conceptually:

“For this task, at this layer, how much should we listen to each expert?”

This is not static averaging.

It is task- and layer-dependent routing.

Findings — Where the Gains Actually Appear

The authors test across 47 adaptation tasks:

FGVC (fine-grained classification)
VTAB-1K (Natural, Specialized, Structured)
Medical imaging benchmarks (classification + segmentation)
ADE20K segmentation

The pattern is consistent.

1. VTAB-1K (ImageNet-21K Supervised Backbone)

Method	Natural	Specialized	Structured	Avg
LSPT	85.26	88.57	66.25	77.95
LSPT + pMoE	87.18	90.25	69.32	80.31

Structured tasks see the largest relative boost.

This is not accidental.

Structured tasks often require combining semantic reasoning and low-level feature precision—precisely where multi-expert collaboration helps.

2. Fine-Grained Classification (FGVC)

Using DINO v2 pre-trained backbones:

Dataset	LSPT (1.10X params)	LSPT + pMoE	Gain
CUB	84.85	86.07	+1.22
Flowers	95.57	96.58	+1.01
Cars	80.63	81.12	+0.49

The parameter increase is modest (1.10X total).

The performance gains are consistent.

This is the hallmark of structural improvement, not optimization luck.

3. Medical Classification

The effect becomes more pronounced in medical tasks:

Example (Color Medical Datasets):

Task	LSPT	LSPT + pMoE
Kvasir Polyp	71.53	75.68
ISIC Skin	54.37	56.25

Medical domains benefit disproportionately.

Why?

Because medical imaging requires domain priors that general vision encoders lack. Multi-expert prompts allow combining:

Contrastive representation strength
Domain-specific encoding
Segmentation-friendly feature hierarchies

Without retraining everything.

4. Segmentation — Dense Prediction Still Improves

ADE20K (MAE backbone):

Method	mIoU (MS)
LSPT	41.51
LSPT + pMoE	42.87

Medical segmentation (Kvasir, Skin):

Dataset	LSPT	LSPT + pMoE
Kvasir-seg	45.81	47.95
Skin	77.63	80.35

This demonstrates something subtle:

Mixture-of-Experts via prompts works not only for classification, but also for dense spatial tasks.

That matters for enterprise AI pipelines where segmentation and detection are often more valuable than classification.

Why This Works — Architectural Interpretation

pMoE solves three structural issues in prompt tuning.

1. Knowledge Silos

Single-model prompt tuning traps adaptation inside one representation regime.

pMoE enables cross-expert knowledge flow without merging backbones.

2. Token Underutilization

VPT-deep discards accumulated prompt tokens after each layer.

pMoE preserves and integrates them, improving inter-layer information flow.

3. Static Prompt Bias

Traditional prompt tuning assumes all tasks benefit from the same prompt structure.

pMoE routes different tasks through different expert paths.

The visualization in the paper shows distinct expert activation paths for:

Natural tasks
Specialized tasks
Structured tasks

That dynamic routing is where the performance lift originates.

Enterprise Implications — Beyond the Benchmark Table

Let’s translate this into operational terms.

1. Modular Model Strategy

Instead of:

“Which backbone should we deploy?”

You can ask:

“Which combination of experts should we orchestrate?”

This aligns with enterprise AI architecture trends toward modular, composable systems.

2. Lower Re-Training Risk

You do not retrain or merge backbones.

You:

Freeze experts
Train prompt tokens
Train a lightweight dispatcher

This reduces:

Regulatory risk (important in medical applications)
Drift instability
Compliance overhead

3. Better ROI per Parameter

The parameter overhead is modest (often ~1.04X–1.10X total parameters).

Performance gains are consistent across domains.

From a cost-performance lens:

Approach	Compute Cost	Adaptability	Cross-Domain Robustness
Single Prompt Tuning	Low	Medium	Low
Multi-Backbone Ensemble	High	High	High
pMoE	Low–Medium	High	High

pMoE approximates ensemble benefits without ensemble-level cost.

That is a non-trivial engineering achievement.

Limitations and Trade-Offs

No architectural improvement is free.

Observed constraints:

Gains plateau beyond ~6–9 experts
Increasing prompt layers yields diminishing returns beyond 9–12
Complexity increases integration overhead

But these are manageable scaling laws, not fatal flaws.

Importantly, the dispatcher itself remains lightweight.

The FLOPs increase is modest relative to full model duplication.

Broader Trend — Prompts as Orchestrators

pMoE hints at a larger paradigm shift.

Prompts are no longer just “soft instructions.”

They are becoming:

Routing controllers
Knowledge selectors
Cross-model communication channels

In LLM systems, we see similar patterns emerging in multi-agent orchestration.

In vision, pMoE demonstrates that the same orchestration logic applies.

The future of adaptation may not be bigger backbones.

It may be smarter coordination.

Conclusion — Specialists, Coordinated

pMoE does not reinvent vision transformers.

It rethinks how we adapt them.

Instead of asking one expert to stretch beyond its domain, it lets multiple experts collaborate—through prompts and a lightweight dispatcher.

The result is consistent improvement across:

General classification
Fine-grained tasks
Medical imaging
Segmentation

For enterprises building AI systems that must operate across heterogeneous visual domains, this is not an academic curiosity.

It is an architectural blueprint.

Because sometimes the smartest model is not the biggest one.

It is the one that knows which expert to ask.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — The Limits of Single-Expert Prompting#

The Rise of Parameter-Efficient Tuning#

What pMoE Actually Does — Mechanism, Not Marketing#

Step 1: Multiple Experts, Separate Prompt Streams#

Step 2: The Dispatcher — Dynamic Token Fusion#

Findings — Where the Gains Actually Appear#

1. VTAB-1K (ImageNet-21K Supervised Backbone)#

2. Fine-Grained Classification (FGVC)#

3. Medical Classification#

4. Segmentation — Dense Prediction Still Improves#

Why This Works — Architectural Interpretation#

1. Knowledge Silos#

2. Token Underutilization#

3. Static Prompt Bias#

Enterprise Implications — Beyond the Benchmark Table#

1. Modular Model Strategy#

2. Lower Re-Training Risk#

3. Better ROI per Parameter#

Limitations and Trade-Offs#

Broader Trend — Prompts as Orchestrators#

Conclusion — Specialists, Coordinated#