Opening — Why This Matters Now
Foundation vision models are becoming corporate infrastructure. They sit behind defect detection systems, medical imaging workflows, retail analytics dashboards, and increasingly, compliance pipelines.
But here is the quiet operational truth: most enterprises do not retrain these models. They adapt them.
Full fine-tuning is expensive, risky, and often unnecessary. Prompt tuning—adding learnable tokens while freezing the backbone—has emerged as the pragmatic alternative. Yet most approaches rely on a single pre-trained model. A single “expert.”
That assumption is convenient. It is also limiting.
The ICLR 2025 paper pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation introduces a simple but powerful idea: instead of adapting one expert, let multiple pre-trained experts collaborate—dynamically.
Not by merging weights. Not by retraining everything. But by orchestrating prompts.
This shift has material implications for cost, modularity, and ROI in applied AI systems.
Background — The Limits of Single-Expert Prompting
The Rise of Parameter-Efficient Tuning
Visual Prompt Tuning (VPT) and its successors (GaPT, LSPT) showed that you can freeze a Vision Transformer (ViT) and prepend learnable prompt tokens. The backbone remains intact. Only a small set of tokens is trained.
The economic appeal is obvious:
| Method | Backbone Updated? | Extra Parameters | Deployment Cost | Risk Profile |
|---|---|---|---|---|
| Full Fine-Tuning | Yes | High | High | High |
| Adapters | Partial | Medium | Medium | Moderate |
| Prompt Tuning | No | Low | Low | Low |
Prompt tuning became the “safe” enterprise choice.
But it came with an implicit constraint:
One backbone. One knowledge source.
If your backbone was trained on ImageNet, it excels at natural images. If it was trained on medical imaging, it understands domain nuance but may lack general semantics.
Real-world tasks often require both.
Fine-grained classification may require discriminative contrastive features (DINO, MoCo). Medical segmentation may benefit from domain-specific encoders. Structured visual reasoning tasks may benefit from different inductive biases.
A single expert rarely dominates across all axes.
What pMoE Actually Does — Mechanism, Not Marketing
pMoE introduces two architectural elements:
- Expert Prompt Tokens (EPTs) — Each pre-trained model (“expert”) gets its own prompt tokens.
- A Learnable Dispatcher — A shared module that dynamically weights and fuses prompt tokens across experts.
The backbone weights remain frozen.
The experts do not directly share internal representations.
Instead, they communicate through prompts.
Step 1: Multiple Experts, Separate Prompt Streams
Assume we have K pre-trained experts:
- DINO
- MoCo v3
- MAE
- CLIP
- Medical vision encoder
Each expert has its own:
- Patch embedding
- Transformer stack
- Prompt tokens per layer
But these prompt tokens are not isolated.
Step 2: The Dispatcher — Dynamic Token Fusion
At each transformer layer, a dispatcher:
- Observes the expert’s current state (prompt tokens + accumulated prompts + patch tokens)
- Produces dispatching weights
- Combines the same-index prompt tokens across all experts
Mathematically, each integrated prompt token becomes a weighted sum across experts.
Conceptually:
“For this task, at this layer, how much should we listen to each expert?”
This is not static averaging.
It is task- and layer-dependent routing.
Findings — Where the Gains Actually Appear
The authors test across 47 adaptation tasks:
- FGVC (fine-grained classification)
- VTAB-1K (Natural, Specialized, Structured)
- Medical imaging benchmarks (classification + segmentation)
- ADE20K segmentation
The pattern is consistent.
1. VTAB-1K (ImageNet-21K Supervised Backbone)
| Method | Natural | Specialized | Structured | Avg |
|---|---|---|---|---|
| LSPT | 85.26 | 88.57 | 66.25 | 77.95 |
| LSPT + pMoE | 87.18 | 90.25 | 69.32 | 80.31 |
Structured tasks see the largest relative boost.
This is not accidental.
Structured tasks often require combining semantic reasoning and low-level feature precision—precisely where multi-expert collaboration helps.
2. Fine-Grained Classification (FGVC)
Using DINO v2 pre-trained backbones:
| Dataset | LSPT (1.10X params) | LSPT + pMoE | Gain |
|---|---|---|---|
| CUB | 84.85 | 86.07 | +1.22 |
| Flowers | 95.57 | 96.58 | +1.01 |
| Cars | 80.63 | 81.12 | +0.49 |
The parameter increase is modest (1.10X total).
The performance gains are consistent.
This is the hallmark of structural improvement, not optimization luck.
3. Medical Classification
The effect becomes more pronounced in medical tasks:
Example (Color Medical Datasets):
| Task | LSPT | LSPT + pMoE |
|---|---|---|
| Kvasir Polyp | 71.53 | 75.68 |
| ISIC Skin | 54.37 | 56.25 |
Medical domains benefit disproportionately.
Why?
Because medical imaging requires domain priors that general vision encoders lack. Multi-expert prompts allow combining:
- Contrastive representation strength
- Domain-specific encoding
- Segmentation-friendly feature hierarchies
Without retraining everything.
4. Segmentation — Dense Prediction Still Improves
ADE20K (MAE backbone):
| Method | mIoU (MS) |
|---|---|
| LSPT | 41.51 |
| LSPT + pMoE | 42.87 |
Medical segmentation (Kvasir, Skin):
| Dataset | LSPT | LSPT + pMoE |
|---|---|---|
| Kvasir-seg | 45.81 | 47.95 |
| Skin | 77.63 | 80.35 |
This demonstrates something subtle:
Mixture-of-Experts via prompts works not only for classification, but also for dense spatial tasks.
That matters for enterprise AI pipelines where segmentation and detection are often more valuable than classification.
Why This Works — Architectural Interpretation
pMoE solves three structural issues in prompt tuning.
1. Knowledge Silos
Single-model prompt tuning traps adaptation inside one representation regime.
pMoE enables cross-expert knowledge flow without merging backbones.
2. Token Underutilization
VPT-deep discards accumulated prompt tokens after each layer.
pMoE preserves and integrates them, improving inter-layer information flow.
3. Static Prompt Bias
Traditional prompt tuning assumes all tasks benefit from the same prompt structure.
pMoE routes different tasks through different expert paths.
The visualization in the paper shows distinct expert activation paths for:
- Natural tasks
- Specialized tasks
- Structured tasks
That dynamic routing is where the performance lift originates.
Enterprise Implications — Beyond the Benchmark Table
Let’s translate this into operational terms.
1. Modular Model Strategy
Instead of:
“Which backbone should we deploy?”
You can ask:
“Which combination of experts should we orchestrate?”
This aligns with enterprise AI architecture trends toward modular, composable systems.
2. Lower Re-Training Risk
You do not retrain or merge backbones.
You:
- Freeze experts
- Train prompt tokens
- Train a lightweight dispatcher
This reduces:
- Regulatory risk (important in medical applications)
- Drift instability
- Compliance overhead
3. Better ROI per Parameter
The parameter overhead is modest (often ~1.04X–1.10X total parameters).
Performance gains are consistent across domains.
From a cost-performance lens:
| Approach | Compute Cost | Adaptability | Cross-Domain Robustness |
|---|---|---|---|
| Single Prompt Tuning | Low | Medium | Low |
| Multi-Backbone Ensemble | High | High | High |
| pMoE | Low–Medium | High | High |
pMoE approximates ensemble benefits without ensemble-level cost.
That is a non-trivial engineering achievement.
Limitations and Trade-Offs
No architectural improvement is free.
Observed constraints:
- Gains plateau beyond ~6–9 experts
- Increasing prompt layers yields diminishing returns beyond 9–12
- Complexity increases integration overhead
But these are manageable scaling laws, not fatal flaws.
Importantly, the dispatcher itself remains lightweight.
The FLOPs increase is modest relative to full model duplication.
Broader Trend — Prompts as Orchestrators
pMoE hints at a larger paradigm shift.
Prompts are no longer just “soft instructions.”
They are becoming:
- Routing controllers
- Knowledge selectors
- Cross-model communication channels
In LLM systems, we see similar patterns emerging in multi-agent orchestration.
In vision, pMoE demonstrates that the same orchestration logic applies.
The future of adaptation may not be bigger backbones.
It may be smarter coordination.
Conclusion — Specialists, Coordinated
pMoE does not reinvent vision transformers.
It rethinks how we adapt them.
Instead of asking one expert to stretch beyond its domain, it lets multiple experts collaborate—through prompts and a lightweight dispatcher.
The result is consistent improvement across:
- General classification
- Fine-grained tasks
- Medical imaging
- Segmentation
For enterprises building AI systems that must operate across heterogeneous visual domains, this is not an academic curiosity.
It is an architectural blueprint.
Because sometimes the smartest model is not the biggest one.
It is the one that knows which expert to ask.
Cognaptus: Automate the Present, Incubate the Future.