Opening — Why This Matters Now

Foundation vision models are becoming corporate infrastructure. They sit behind defect detection systems, medical imaging workflows, retail analytics dashboards, and increasingly, compliance pipelines.

But here is the quiet operational truth: most enterprises do not retrain these models. They adapt them.

Full fine-tuning is expensive, risky, and often unnecessary. Prompt tuning—adding learnable tokens while freezing the backbone—has emerged as the pragmatic alternative. Yet most approaches rely on a single pre-trained model. A single “expert.”

That assumption is convenient. It is also limiting.

The ICLR 2025 paper pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation introduces a simple but powerful idea: instead of adapting one expert, let multiple pre-trained experts collaborate—dynamically.

Not by merging weights. Not by retraining everything. But by orchestrating prompts.

This shift has material implications for cost, modularity, and ROI in applied AI systems.


Background — The Limits of Single-Expert Prompting

The Rise of Parameter-Efficient Tuning

Visual Prompt Tuning (VPT) and its successors (GaPT, LSPT) showed that you can freeze a Vision Transformer (ViT) and prepend learnable prompt tokens. The backbone remains intact. Only a small set of tokens is trained.

The economic appeal is obvious:

Method Backbone Updated? Extra Parameters Deployment Cost Risk Profile
Full Fine-Tuning Yes High High High
Adapters Partial Medium Medium Moderate
Prompt Tuning No Low Low Low

Prompt tuning became the “safe” enterprise choice.

But it came with an implicit constraint:

One backbone. One knowledge source.

If your backbone was trained on ImageNet, it excels at natural images. If it was trained on medical imaging, it understands domain nuance but may lack general semantics.

Real-world tasks often require both.

Fine-grained classification may require discriminative contrastive features (DINO, MoCo). Medical segmentation may benefit from domain-specific encoders. Structured visual reasoning tasks may benefit from different inductive biases.

A single expert rarely dominates across all axes.


What pMoE Actually Does — Mechanism, Not Marketing

pMoE introduces two architectural elements:

  1. Expert Prompt Tokens (EPTs) — Each pre-trained model (“expert”) gets its own prompt tokens.
  2. A Learnable Dispatcher — A shared module that dynamically weights and fuses prompt tokens across experts.

The backbone weights remain frozen.

The experts do not directly share internal representations.

Instead, they communicate through prompts.

Step 1: Multiple Experts, Separate Prompt Streams

Assume we have K pre-trained experts:

  • DINO
  • MoCo v3
  • MAE
  • CLIP
  • Medical vision encoder

Each expert has its own:

  • Patch embedding
  • Transformer stack
  • Prompt tokens per layer

But these prompt tokens are not isolated.


Step 2: The Dispatcher — Dynamic Token Fusion

At each transformer layer, a dispatcher:

  • Observes the expert’s current state (prompt tokens + accumulated prompts + patch tokens)
  • Produces dispatching weights
  • Combines the same-index prompt tokens across all experts

Mathematically, each integrated prompt token becomes a weighted sum across experts.

Conceptually:

“For this task, at this layer, how much should we listen to each expert?”

This is not static averaging.

It is task- and layer-dependent routing.


Findings — Where the Gains Actually Appear

The authors test across 47 adaptation tasks:

  • FGVC (fine-grained classification)
  • VTAB-1K (Natural, Specialized, Structured)
  • Medical imaging benchmarks (classification + segmentation)
  • ADE20K segmentation

The pattern is consistent.

1. VTAB-1K (ImageNet-21K Supervised Backbone)

Method Natural Specialized Structured Avg
LSPT 85.26 88.57 66.25 77.95
LSPT + pMoE 87.18 90.25 69.32 80.31

Structured tasks see the largest relative boost.

This is not accidental.

Structured tasks often require combining semantic reasoning and low-level feature precision—precisely where multi-expert collaboration helps.


2. Fine-Grained Classification (FGVC)

Using DINO v2 pre-trained backbones:

Dataset LSPT (1.10X params) LSPT + pMoE Gain
CUB 84.85 86.07 +1.22
Flowers 95.57 96.58 +1.01
Cars 80.63 81.12 +0.49

The parameter increase is modest (1.10X total).

The performance gains are consistent.

This is the hallmark of structural improvement, not optimization luck.


3. Medical Classification

The effect becomes more pronounced in medical tasks:

Example (Color Medical Datasets):

Task LSPT LSPT + pMoE
Kvasir Polyp 71.53 75.68
ISIC Skin 54.37 56.25

Medical domains benefit disproportionately.

Why?

Because medical imaging requires domain priors that general vision encoders lack. Multi-expert prompts allow combining:

  • Contrastive representation strength
  • Domain-specific encoding
  • Segmentation-friendly feature hierarchies

Without retraining everything.


4. Segmentation — Dense Prediction Still Improves

ADE20K (MAE backbone):

Method mIoU (MS)
LSPT 41.51
LSPT + pMoE 42.87

Medical segmentation (Kvasir, Skin):

Dataset LSPT LSPT + pMoE
Kvasir-seg 45.81 47.95
Skin 77.63 80.35

This demonstrates something subtle:

Mixture-of-Experts via prompts works not only for classification, but also for dense spatial tasks.

That matters for enterprise AI pipelines where segmentation and detection are often more valuable than classification.


Why This Works — Architectural Interpretation

pMoE solves three structural issues in prompt tuning.

1. Knowledge Silos

Single-model prompt tuning traps adaptation inside one representation regime.

pMoE enables cross-expert knowledge flow without merging backbones.

2. Token Underutilization

VPT-deep discards accumulated prompt tokens after each layer.

pMoE preserves and integrates them, improving inter-layer information flow.

3. Static Prompt Bias

Traditional prompt tuning assumes all tasks benefit from the same prompt structure.

pMoE routes different tasks through different expert paths.

The visualization in the paper shows distinct expert activation paths for:

  • Natural tasks
  • Specialized tasks
  • Structured tasks

That dynamic routing is where the performance lift originates.


Enterprise Implications — Beyond the Benchmark Table

Let’s translate this into operational terms.

1. Modular Model Strategy

Instead of:

“Which backbone should we deploy?”

You can ask:

“Which combination of experts should we orchestrate?”

This aligns with enterprise AI architecture trends toward modular, composable systems.


2. Lower Re-Training Risk

You do not retrain or merge backbones.

You:

  • Freeze experts
  • Train prompt tokens
  • Train a lightweight dispatcher

This reduces:

  • Regulatory risk (important in medical applications)
  • Drift instability
  • Compliance overhead

3. Better ROI per Parameter

The parameter overhead is modest (often ~1.04X–1.10X total parameters).

Performance gains are consistent across domains.

From a cost-performance lens:

Approach Compute Cost Adaptability Cross-Domain Robustness
Single Prompt Tuning Low Medium Low
Multi-Backbone Ensemble High High High
pMoE Low–Medium High High

pMoE approximates ensemble benefits without ensemble-level cost.

That is a non-trivial engineering achievement.


Limitations and Trade-Offs

No architectural improvement is free.

Observed constraints:

  • Gains plateau beyond ~6–9 experts
  • Increasing prompt layers yields diminishing returns beyond 9–12
  • Complexity increases integration overhead

But these are manageable scaling laws, not fatal flaws.

Importantly, the dispatcher itself remains lightweight.

The FLOPs increase is modest relative to full model duplication.


Broader Trend — Prompts as Orchestrators

pMoE hints at a larger paradigm shift.

Prompts are no longer just “soft instructions.”

They are becoming:

  • Routing controllers
  • Knowledge selectors
  • Cross-model communication channels

In LLM systems, we see similar patterns emerging in multi-agent orchestration.

In vision, pMoE demonstrates that the same orchestration logic applies.

The future of adaptation may not be bigger backbones.

It may be smarter coordination.


Conclusion — Specialists, Coordinated

pMoE does not reinvent vision transformers.

It rethinks how we adapt them.

Instead of asking one expert to stretch beyond its domain, it lets multiple experts collaborate—through prompts and a lightweight dispatcher.

The result is consistent improvement across:

  • General classification
  • Fine-grained tasks
  • Medical imaging
  • Segmentation

For enterprises building AI systems that must operate across heterogeneous visual domains, this is not an academic curiosity.

It is an architectural blueprint.

Because sometimes the smartest model is not the biggest one.

It is the one that knows which expert to ask.

Cognaptus: Automate the Present, Incubate the Future.