Prompt Without Words: Distilling GPT Semantics for Smarter Vision Models

TL;DR for operators

Most attempts to improve CLIP-style image classification with large language models follow a familiar ritual: ask GPT to describe a class, paste those descriptions into prompts, then hope the model pays attention to the useful bits. The problem is that GPT’s descriptions are not stable objects. They vary by query wording, include hedged statements, and sometimes contain features that are hard or impossible to verify visually. “Usually,” “may,” and “often” are not exactly the foundations of a disciplined recognition system.

The paper behind DeMul, short for Description-free Multi-prompt Learning, makes a cleaner proposal: stop asking GPT to write descriptions, and instead distil GPT’s semantic structure directly into learnable prompt vectors for CLIP-style vision-language models.¹ This is not prompt engineering with better adjectives. It is prompt learning with the sentence-shaped middleman removed.

Operationally, the idea matters because many enterprise vision tasks are few-shot by default. A team may have a handful of labelled examples for specialised product defects, food categories, retail shelf states, satellite land-use classes, or equipment conditions. DeMul suggests that GPT embeddings can provide a semantic prior without forcing engineers to manage unstable generated descriptions.

The evidence is respectable but bounded. Using frozen CLIP with a ViT-B/16 backbone, DeMul is tested on 11 image-recognition datasets under 1, 2, 4, 8, and 16-shot settings. It reports average top-1 accuracy of 75.5% in 1-shot and 85.3% in 16-shot, compared with GalLoP’s 73.7% and 84.4%. That is a meaningful improvement, not a miracle. The paper shows a better mechanism for few-shot classification, not a general solution for all visual AI deployments. Nobody gets to declare victory over industrial inspection because ImageNet smiled politely.

The weak link is not semantics. It is asking semantics to arrive as prose.

The easiest way to misunderstand this paper is to treat it as another entry in the long, crowded theatre of prompt engineering. It is not. DeMul is built around a sharper diagnosis: language models may contain useful class-level knowledge, but generated descriptions are a poor interface for transferring that knowledge into a vision-language classifier.

Consider the ordinary description-based workflow. A system asks an LLM something like: “What are useful features for distinguishing a golden retriever?” The model returns a list of plausible visual traits. Some are useful. Some are generic. Some are hedged. Some drift into facts that may be true in the world but are not necessarily visible in the image. The appendix examples in the paper make the point nicely. For “French toast,” a generated description includes being “fried in a pan.” That may be culinary information, but a classifier looking at a plated image cannot reliably observe the cooking process unless the pan had the courtesy to remain in frame.

This is the central irritant. The generated sentence is doing too many jobs at once. It is a semantic hint, a linguistic artefact, a product of a particular query, and a noisy input to another model. If performance depends on that sentence being both visually relevant and consistently generated, the pipeline inherits a surprisingly human problem: saying the right thing, in the right wording, at the right level of specificity, every time. Machines were meant to save us from committee writing, not automate it.

DeMul keeps the useful part: GPT’s semantic organisation of class names. It removes the fragile part: GPT’s verbal descriptions of those classes.

DeMul transfers GPT knowledge without making GPT talk

The mechanism is easier to understand if we separate three spaces.

First, there is the CLIP space, where image embeddings and text prompt embeddings are compared for classification. CLIP expects something text-like on one side and image-like on the other, then classifies by similarity.

Second, there is the GPT embedding space. Instead of asking GPT to complete text, DeMul uses OpenAI embedding models to represent class names as vectors. These embeddings are treated as a semantic map: not a list of descriptions, but a geometry of meaning.

Third, there is the learnable prompt space. Like earlier prompt-learning methods such as CoOp, DeMul uses continuous prompt vectors rather than fixed natural-language templates. These prompts are not hand-written phrases. They are optimised parameters.

The trick is alignment. DeMul learns a mapping from CLIP prompt embeddings into the GPT embedding space, then trains prompts so that their mapped representations align with the GPT embedding of the corresponding class name. In simpler terms: the prompt does not need to say “cheesecake is creamy, round, and often topped with fruit.” It needs to move into a semantic position that GPT’s embedding space already associates with “cheesecake.”

That distinction is the paper’s core contribution. Description-based methods use GPT as a writer. DeMul uses GPT as a semantic coordinate system.

This also clarifies what “description-free” does and does not mean. DeMul is not language-free. It still uses class names. It still relies on GPT embeddings. It still works in a vision-language framework. What it avoids is generated class-description text. The difference sounds small until one has debugged a prompt library where the model alternates between visual features, lifestyle stereotypes, and oddly confident nonsense. Then it sounds like basic hygiene.

Multiple prompts are useful only if they are not all treated as equally wise

The second mechanism in DeMul is weighted multi-prompt learning.

A single prompt is rarely enough to represent a class cleanly. A class can contain multiple visual modes: cars across angles, food across presentations, flowers across colours, pets across breeds and poses. Multi-prompt learning responds by assigning several prompt vectors to each class. DeMul uses 32 prompts per class in its main experimental setup.

But simply averaging multiple prompts is crude. Some prompts will become more useful than others during training. Some may capture discriminative semantics; others may wander into weaker regions. Treating all of them equally is democratic, certainly, but democracy is not always the best scoring rule for vector representations.

DeMul therefore gives each prompt a learnable weight. The classifier can emphasise prompts that help and reduce the influence of prompts that do not. The paper also applies L1 regularisation to encourage sparsity, meaning the system is nudged toward relying on fewer, stronger prompts rather than spreading attention evenly across the whole crowd.

This is important because it moves multi-prompt learning from “more prompts must be better” to “different prompts should earn their keep.” That is a healthier design principle. In applied AI, ensembles, retrieval candidates, and agent outputs all face the same problem: variety is useful until it becomes clutter. Weighting is the difference between a panel of experts and a room full of people speaking at once.

What the experiments actually show

The paper’s experiments should be read in layers. Not every table is doing the same job, and not every result supports the same claim.

Evidence item	Likely purpose	What it supports	What it does not prove
Main benchmark across 11 datasets	Main evidence and comparison with prior work	DeMul improves average few-shot top-1 accuracy over zero-shot CLIP, description-based methods, and several continuous prompt-learning baselines	That DeMul will outperform in production vision systems, detection, segmentation, or open-ended multimodal tasks
Ablation on distillation and weighting	Ablation	Both GPT embedding-space distillation and prompt weighting contribute, with the combined method performing best	That each component is independently decisive in every dataset or deployment
Different GPT embedding models	Robustness / sensitivity test	Stronger embedding models appear to correlate with better DeMul performance	That any embedding model can be swapped in without engineering cost or distribution effects
UMAP mapping visualisation	Implementation diagnostic / explanatory analysis	The mapping strategy helps preserve class-clustering structure when moving between CLIP and GPT spaces	That the learned geometry is universally interpretable or causally sufficient
Prompt-weight and similarity analysis	Exploratory mechanism check	Prompt weights tend to track semantic similarity, supporting the idea that weights reflect prompt usefulness	That weights are a reliable explanation of model decisions in high-stakes settings
Appendix description examples	Motivation and failure-mode illustration	Generated descriptions can be variable, hedged, non-visual, or biased	That all description-based methods fail equally, or that descriptions are never useful

The main numerical story is straightforward. Across 11 datasets, DeMul reports average top-1 accuracy of 75.5% in the 1-shot setting, rising to 85.3% in 16-shot. GalLoP, the strongest compared multi-prompt baseline in the paper, reports 73.7% and 84.4% in those same settings. Against zero-shot CLIP, the reported average improvement is much larger: DeMul’s 1-shot result is 10.5 points above CLIP’s 65.0%, and its 16-shot result is 20.3 points above that same zero-shot baseline.

The more interesting comparison is not against vanilla CLIP, though. Beating zero-shot CLIP with few-shot prompt learning is useful, but hardly shocking. The sharper comparison is against methods that already optimise prompts or use LLM-generated descriptions. DeMul’s advantage over GalLoP is modest but consistent in the reported averages: +1.8 points in 1-shot and +0.9 points in 16-shot. That is not fireworks. It is the kind of incremental gain that matters when it comes with a cleaner operating mechanism.

The ablation table reinforces this interpretation. Removing the distillation component or removing prompt weighting reduces performance relative to the full method, though not dramatically. In 16-shot average accuracy, the version without the distillation component reports 84.8%, the version without weighting reports 85.2%, and the full method reports 85.3%. In 1-shot, the full method reaches 75.5%, versus 75.0% without the distillation component and 75.2% without weighting.

That pattern should calm down anyone preparing a victory parade. The paper’s novelty is not that either component single-handedly transforms visual recognition. It is that the combination gives a disciplined way to use LLM semantics without depending on generated text, and it edges out strong baselines under a controlled benchmark setup.

The embedding-model test is a sensitivity check, not a second thesis

The paper also tests different OpenAI embedding models in the 16-shot setting. The reported average accuracy rises from 82.5% with text-embedding-ada-002, to 83.4% with text-embedding-3-small, to 84.4% with text-embedding-3-large in the table covering ten listed datasets. The authors interpret this as a positive correlation between embedding-model capability and classification performance.

That is a plausible reading. It also matters operationally. If DeMul-style systems depend on the quality of the teacher embedding space, then embedding-model choice becomes part of the performance budget. Better semantic priors may improve downstream adaptation, but they also create dependency on a specific embedding provider, embedding dimension, API behaviour, and future model changes. In enterprise settings, this is not merely a model-selection detail. It is vendor risk wearing a lab coat.

The result does not mean that a larger embedding model will always be worth the cost. The paper does not run a full cost-performance analysis, nor does it test a wide zoo of proprietary and open embedding models under production constraints. The finding is best read as sensitivity evidence: the quality of the semantic teacher matters.

The visual analyses explain plausibility, not causality

The mapping-space and prompt-weight analyses are useful because they tell us whether the proposed mechanism behaves as intended.

The UMAP visualisation focuses on whether prompts retain meaningful class structure when mapped between CLIP and GPT embedding spaces. The paper reports that freezing part of the mapping and fine-tuning with the proposed loss better preserves clustering tendencies in the GPT space. This is not the main evidence for performance. It is a diagnostic: the authors are checking whether the bridge between spaces is structurally sane.

The prompt-weight analysis does something similar. During training on Food101, the paper tracks prompt weights and similarities between prompts and class names. Higher similarities generally correlate with higher weights. This supports the claim that the weighting mechanism is not arbitrary; prompts closer to class semantics tend to matter more.

Still, “tends to correlate” is not the same as “fully explains the decision.” A prompt weight is a training signal, not an audit trail. For business use, it may help debugging. It should not be sold as interpretability. The difference is small only to people who have never had to explain a failed automated decision to a regulator, a client, or a very patient legal team.

The business value is less prompt fragility, not magical vision

For operators, DeMul’s practical relevance sits in a specific class of problem: adapting a strong vision-language model to a specialised classification task with limited labelled data.

That covers many real workflows. Retail teams may need to classify shelf states or product variants. Food platforms may need to recognise long-tail dish categories. Manufacturing teams may need early prototypes for defect categories before enough labelled examples exist for a full supervised model. Remote-sensing teams may need land-use classification under data scarcity. In these settings, the cost is not only model training. It is also the human effort required to describe categories, refine prompts, test variants, and discover that half the descriptions were charming but useless.

DeMul points toward a better workflow:

Start with a frozen vision-language model such as CLIP.
Use a small number of labelled images per class.
Use class-name embeddings from a strong LLM embedding model as semantic anchors.
Learn continuous prompts that align with those anchors.
Let multiple prompts compete through learned weights.

The inferred business advantage is not “no labels needed.” DeMul is evaluated in few-shot settings, not as a label-free production system. The advantage is better leverage from scarce labels, fewer brittle hand-written prompts, and less dependence on LLM-generated prose. That can reduce iteration cost in early-stage model adaptation.

This is also where the method is most strategically interesting. A lot of enterprise AI work fails not because the base model is weak, but because the interface between business categories and model representations is messy. Humans write labels and descriptions. Models consume embeddings and gradients. DeMul is one more sign that the future of “prompting” may be less about clever phrasing and more about controlled representation transfer.

A regrettable development for prompt gurus, perhaps. A useful one for everyone else.

Where the result should not be overextended

The paper’s boundary conditions are clear enough if we do not try to inflate them.

First, the experiments focus on image classification. DeMul is not evaluated for object detection, segmentation, visual question answering, image retrieval, robotic perception, or multimodal agents. The fact that CLIP-style embeddings are useful across tasks does not automatically move the evidence across tasks.

Second, the setup keeps the pre-trained CLIP image and text encoders frozen. This is a feature for efficiency and comparability, but it also defines the operating regime. The paper is about prompt learning around a frozen vision-language model, not full fine-tuning of a visual foundation model.

Third, the method still depends on few-shot training images. The authors explicitly note that performance can vary with the training-image distribution. This matters for businesses because few-shot samples are often not representative. If the first labelled examples come from one factory line, one lighting condition, one geography, or one customer segment, the learned prompts may inherit that narrowness.

Fourth, the authors note a memory-related limitation: learnable prompt vectors are kept identical across classes as an extension of prior work, which can hinder class-specific prompt development. That is not a minor technical footnote. For specialised domains, class-specific nuance is often the whole game. A defect category, a medical image label, or a product variant may require distinctions that generic shared prompt vectors struggle to encode.

Finally, the paper does not eliminate the need for evaluation discipline. It reduces one source of prompt instability. It does not solve dataset shift, class ambiguity, label noise, visual confounding, or the delightful habit of production images arriving nothing like benchmark images.

A better interface between language knowledge and visual adaptation

The useful idea in DeMul is not that words are bad. The useful idea is that generated words are often the wrong transport layer for semantic knowledge.

Descriptions are attractive because they are readable. Engineers can inspect them. Managers can understand them. They make demos feel reassuringly human. But readability is not the same as reliability. A generated description can sound precise while being irrelevant, non-visual, biased, or simply unstable across runs.

DeMul takes a less theatrical route. It treats GPT embeddings as a semantic teacher and trains CLIP prompts to align with that teacher while still optimising for classification. It then lets multiple prompts contribute unequally, because semantic diversity is useful only when the model can decide which parts matter.

The result is a paper with a modest numerical gain and a strong design lesson. For enterprise AI, that is often the better combination. Big benchmark jumps invite exaggeration. Clean mechanisms invite reuse.

DeMul does not make vision-language systems magically robust. It does something more believable: it removes a noisy prose bottleneck from the adaptation pipeline. Sometimes progress is not a louder model. Sometimes it is the quiet deletion of an unnecessary sentence.

Cognaptus: Automate the Present, Incubate the Future.

Sua Lee, Kyubum Shin, and Jung Ho Park, “Weighted Multi-Prompt Learning with Description-free Large Language Model Distillation,” arXiv:2507.07147, 2025, https://arxiv.org/abs/2507.07147. ↩︎

TL;DR for operators#

The weak link is not semantics. It is asking semantics to arrive as prose.#

DeMul transfers GPT knowledge without making GPT talk#

Multiple prompts are useful only if they are not all treated as equally wise#

What the experiments actually show#

The embedding-model test is a sensitivity check, not a second thesis#

The visual analyses explain plausibility, not causality#

The business value is less prompt fragility, not magical vision#

Where the result should not be overextended#

A better interface between language knowledge and visual adaptation#