When it comes to prompting vision-language models, most methods rely on textual descriptions extracted from large language models like GPT. But those descriptions—“fluffy fur, friendly eyes, golden color”—are often verbose, ambiguous, or flat-out unreliable. What if we could skip that noisy middle step entirely?
That’s the premise behind DeMul (Description-free Multi-prompt Learning), a new method presented at ICLR 2025 that quietly delivers a major leap in few-shot image classification. Instead of generating descriptions for each class, DeMul directly distills the semantic knowledge of GPT embeddings into learnable prompt vectors. The result is simpler, more robust, and strikingly effective.
Why Description-Based Prompting Fails
In vision-language models like CLIP, prompts usually take the form of templates like:
“A photo of a {class_name}.”
To enhance this, recent works query LLMs like GPT for richer features:
“What are the visual traits of a golden retriever?”
But these answers are fragile. They:
- Vary wildly by phrasing and temperature
- Include irrelevant or unverifiable traits (e.g., “loves to fetch”)
- Often contain qualifiers like “may,” “usually,” or “often”
Even more advanced description-based methods like WaffleCLIP or dCLIP can’t escape this variability.
The DeMul Approach: Skip the Words, Learn the Semantics
DeMul’s core idea is radical in its simplicity:
Don’t generate text—directly map learnable prompts into the GPT embedding space and optimize them to match class semantics.
Here’s how it works:
- Each class name (like “cheesecake” or “helicopter”) is embedded using OpenAI’s
text-embedding-3-large
(3072 dimensions). - Each prompt is a vector, not a sentence, and is optimized to align with the GPT embedding.
- A mapping function $\phi$ connects the CLIP prompt space to the GPT embedding space. A frozen inverse $\psi$ preserves orientation.
This removes the noisy text generation step entirely. No prompt templates. No hand-tuned sentences. Just direct semantic alignment.
Weighted Multi-Prompt Learning: Embrace Diversity, Then Prioritize
Rather than using a single prompt per class, DeMul introduces multiple prompts (M = 32) per class. These are not treated equally. Instead:
- Each prompt is given a learnable weight $w_{ij}$
- Weights are L1-regularized to encourage sparsity (i.e., fewer dominant prompts)
- Prompts are optimized for both classification accuracy in CLIP space and semantic alignment in GPT space
This leads to an elegant total loss:
$L_{\text{total}} = L_{\text{cls}} + \alpha L_{\text{distill}}$
Where $L_{\text{cls}}$ ensures good classification and $L_{\text{distill}}$ ensures the prompts semantically match the class names.
Results: Simpler Prompts, Stronger Performance
On 11 benchmarks—from ImageNet to Food101—DeMul consistently outperformed zero-shot CLIP, description-based methods (dCLIP, WaffleCLIP), and even the best continuous prompt methods (CoOp, GalLoP).
Method | Uses Descriptions | Learnable Prompts | Prompt Weights | 16-Shot Avg Accuracy |
---|---|---|---|---|
CLIP | ❌ | ❌ | ❌ | 78.8% |
dCLIP | ✅ | ❌ | ❌ | 66.5% |
CoOp | ❌ | ✅ | ❌ | 79.5% |
GalLoP | ❌ | ✅ | Partial | 84.4% |
DeMul | ❌ | ✅ | ✅ | 85.3% |
Even at just 1-shot, DeMul improves over GalLoP by +1.8%. More importantly, it does so with fewer assumptions, no hand-crafted text, and greater robustness.
Implications for Cognaptus and Beyond
What excites us at Cognaptus is not just the performance bump, but the architectural elegance:
- No reliance on language generation → Faster, more deterministic workflows.
- LLM distillation via embeddings → A generalizable framework for non-visual tasks (think document tags, chatbot intents, anomaly types).
- Learnable weighting of multiple perspectives → A smart way to aggregate diverse views without manual selection.
We’re already exploring how similar techniques could be applied to text-based process automation, such as invoice classification or intent routing. Embedding-level distillation might just be the key to LLM-powered enterprise tools that are robust by design.
Cognaptus: Automate the Present, Incubate the Future.