Long prompts are easy to understand. They are also expensive, slow, and—in multimodal systems—very quickly ridiculous.
That is the practical tension behind many-shot multimodal in-context learning. In principle, giving a vision-language model more examples should help it recognise the task. In practice, every image costs tokens, every additional demonstration adds latency, and open-source large multimodal models do not generally enjoy infinite context windows. The business version of the problem is familiar: you want a model to adapt to a specialised workflow, but you do not want to fine-tune it every week, pay for swollen prompts forever, or discover that the “cheap” approach now requires a larger GPU.
The paper Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning proposes a sharper alternative.1 Instead of placing many examples in the prompt, it compresses them into activation-level interventions. The method, called Sensitivity-aware Task Vector insertion, or STV, asks two questions that sound almost too obvious: where in the model should the task information be inserted, and what should be inserted there?
The small twist is that previous task-vector methods often answered only one of those questions properly. Some estimated a vector and inserted it into predefined locations. Others searched for locations while using a fixed or averaged vector. STV treats both choices as first-class. That is the paper’s real contribution. The benchmark gains matter, but the mechanism matters more.
The misconception: many-shot multimodal ICL is not just a longer-prompt problem
A natural reading of many-shot ICL is that the model needs more examples in its input. Add more images, add more labels, add more demonstrations, and the model should infer the pattern. This view is not wrong. It is merely operationally lazy, which is worse because it can pass a quarterly review.
The paper points out two concrete constraints. Qwen-VL-7B has an 8,192-token context window, and each image can consume up to 256 tokens. Idefics2-8B also has an 8,192-token context window, with 64 tokens per image. Multimodal examples are therefore not lightweight prompt decorations. They are expensive context occupants.
Task-vector methods change the framing. Instead of keeping the examples as visible prompt tokens, they extract task information from model activations and insert compact vectors into the model during inference. The many-shot signal becomes an internal intervention rather than a long input sequence.
That reframing creates a different engineering problem:
| Prompt-based view | STV’s view |
|---|---|
| The model needs more examples in the context window. | The model needs the right task signal inserted into the right internal locations. |
| Scaling means longer prompts. | Scaling means better activation compression and placement. |
| Cost grows with demonstrations at inference. | Most work shifts into locating and selecting task vectors before inference. |
| The context window is the bottleneck. | The bottleneck is knowing which representations actually respond to context. |
This is the useful mental shift. The paper is not saying examples are irrelevant. It is saying their value does not have to remain trapped inside the prompt.
STV starts by finding the heads that actually notice context
The first stage of STV is location selection. The authors compare the model’s attention-head activations when a query is processed alone versus when it is processed with in-context demonstrations. The difference is called an activation delta.
Conceptually, this is a diagnostic test. If adding demonstrations changes a particular attention head’s activation substantially and consistently, that head is treated as sensitive to contextual task information. If the head barely reacts, it is probably not a useful insertion site.
The paper formalises this by computing the L2 difference between query-only and query-with-context activations for each attention head, averaging the result across sampled query-context pairs, and selecting the top-$K$ heads with the largest sensitivity scores.
The important observation is not merely that some heads move. It is that the sensitive locations show consistent structure within the same task and model, while differing across tasks and model families. In the paper’s Figure 2, sensitivity patterns are shown across VizWiz and OK-VQA, across Qwen-VL-7B and Idefics2-8B, and across sample sizes of 100 and 500. The black-boxed sensitive heads do not appear as random confetti. They form task- and model-dependent patterns.
That matters because it replaces blind location search with a model-informed shortcut. STV does not need to ask every part of the model, “Would you like to be modified today?” It first identifies the heads that already reveal contextual sensitivity.
This is also where the business implication begins. Many applied teams already treat prompting and fine-tuning as the two main adaptation choices. STV suggests a third category: activation-level adaptation, where the model’s internal response to examples determines where the adaptation should occur.
The catch, naturally, is that this requires access to internal activations. If your model is only available through a hosted black-box API, STV is not something you can simply paste into a system prompt and call innovation.
The second stage chooses the vector, not just the location
Finding sensitive heads is only half the problem. Once STV knows where to intervene, it still needs to decide what activation value to insert.
Earlier task-vector methods often compress demonstrations using PCA, fixed vectors, or mean activations. That reduces cost, but it can flatten useful task-specific detail. Averaging many examples into one vector is tidy. It is also a good way to turn nuance into soup.
STV instead builds a pre-clustered activation bank for each selected location. For each sensitive attention head, the method collects context-enhanced activation values from many-shot forward passes and clusters them. Each cluster centre becomes a candidate task vector for that location.
Then STV treats vector selection as a discrete optimisation problem. For each selected attention head, it learns a categorical distribution over the candidate cluster centres. It samples candidate vectors, inserts them, evaluates task loss, converts that into a reward, and updates the distribution using REINFORCE. After training, it selects the highest-probability cluster centre for each location.
The mechanism is worth spelling out because it explains why the method is more than “task vectors, but with a nicer table.”
STV separates the adaptation problem into three acts:
- Measure sensitivity: identify attention heads whose activations change when demonstrations are present.
- Build candidate vectors: cluster context-enhanced activations for each sensitive location.
- Select per-location vectors: use reinforcement learning to choose the best candidate for each insertion site.
At inference time, STV replaces the original activations at those selected heads with the chosen task vectors. It does not extend the prompt. It does not update model weights. It requires only a single forward pass per test instance.
That last point is easy to understate. The computational work is not eliminated; it is moved. STV needs upfront computation to estimate deltas, build activation banks, and learn vector choices. But once those vectors are selected, inference behaves more like zero-shot prompting than many-shot prompting.
For repeated workloads, that distinction is commercially meaningful. For one-off tasks, less so.
The main evidence: better accuracy without paying the long-prompt tax at inference
The paper evaluates STV on five benchmarks: VizWiz, OK-VQA, DTD, Flowers, and CUB. These cover real-world visual question answering, knowledge-intensive visual QA, and fine-grained classification across textures, flowers, and birds. The authors test two open-source multimodal models: Qwen-VL-7B and Idefics2-8B.
The headline result is straightforward: STV outperforms prior task-vector methods and standard ICL baselines across both model families.
| Model | Method | VizWiz | OK-VQA | DTD | Flowers | CUB | Average |
|---|---|---|---|---|---|---|---|
| Qwen-VL-7B | Zero-shot ICL | 35.21 | 57.76 | 55.07 | 55.24 | 56.50 | 51.96 |
| Qwen-VL-7B | 4-shot ICL | 42.00 | 54.62 | 55.50 | 54.67 | 56.16 | 52.59 |
| Qwen-VL-7B | MTV | 45.60 | 60.51 | 76.50 | 78.10 | 80.00 | 68.94 |
| Qwen-VL-7B | STV | 58.30 | 61.94 | 80.45 | 81.51 | 82.33 | 72.11 |
| Idefics2-8B | Zero-shot ICL | 31.30 | 52.40 | 88.73 | 82.80 | 88.70 | 68.79 |
| Idefics2-8B | 4-shot ICL | 40.80 | 51.50 | 89.13 | 84.32 | 87.26 | 70.60 |
| Idefics2-8B | MTV | 52.50 | 53.00 | 89.14 | 83.80 | 89.80 | 73.65 |
| Idefics2-8B | STV | 60.61 | 54.14 | 92.47 | 86.73 | 90.23 | 76.04 |
The strongest comparison is against MTV, the previous location-selection-oriented task-vector baseline. On Qwen-VL-7B, STV raises average accuracy from 68.94 to 72.11. On Idefics2-8B, it raises average accuracy from 73.65 to 76.04.
The largest single jump appears on VizWiz: STV reaches 58.30 on Qwen-VL-7B versus MTV’s 45.60, a 12.7-point gain. On Idefics2-8B, STV reaches 60.61 versus MTV’s 52.50, an 8.1-point gain.
The fine-grained classification tasks also matter. On Qwen-VL-7B, STV reaches 80.45 on DTD, 81.51 on Flowers, and 82.33 on CUB. These are not small corrections to a weak baseline. They suggest that the task-vector intervention is preserving useful discriminative information, not merely injecting a vague instruction like “please classify better.”
The business reading should be careful. The paper shows controlled benchmark gains on two open-source models. It does not prove that STV will improve every enterprise vision-language workflow. But it does strengthen the case that internal activation control can be a practical adaptation layer between prompting and fine-tuning.
The efficiency result is not decorative; it is part of the thesis
The paper’s efficiency comparison with MTV is especially important because STV is not only trying to be more accurate. It is trying to avoid expensive location search.
On VizWiz with Qwen-VL-7B, the paper reports:
| Metric | MTV | STV | Change |
|---|---|---|---|
| Location search time | 6,000 seconds | 88 seconds | 98.53% reduction |
| GPU memory | 19.8 GB | 19.8 GB | No change |
| Inference time | 0.49 seconds | 0.49 seconds | No change |
| Performance | 45.6 | 58.3 | +12.7 |
This is the rare table where the engineering story is as important as the accuracy story. STV does not merely find a better intervention; it finds it much faster. The reason is mechanical: activation deltas guide the search toward context-sensitive heads instead of relying on expensive sampling over locations.
For businesses, the practical interpretation is not “free adaptation.” Nothing is free; someone always pays, usually the GPU. The interpretation is more specific: if a task recurs often enough, STV can amortise the upfront cost of building and selecting activation-level task vectors while keeping per-query inference close to zero-shot cost.
That is useful for cases such as repeated visual QA over a specialised image domain, product taxonomy classification, field-inspection image triage, or workflow-specific document-image interpretation. It is less compelling for one-off prompts, low-volume tasks, or environments where the model cannot be instrumented internally.
The ablations show why both “where” and “what” matter
The ablation section is doing real work. It is not just a ritual sacrifice to reviewer expectations.
First, the authors test the effect of selecting more sensitive locations. On VizWiz with Qwen-VL-7B, performance rises from 35.2% to 49.2% as more sensitive locations are used. That supports the claim that sensitivity-aware location selection contributes meaningful adaptation.
But the curve does not reward indiscriminate intervention. When $K$ becomes too large—above 300 in the reported analysis—performance drops sharply. That is an important boundary. Touching more of the model is not automatically better. Internal intervention has a dosage problem.
Second, the authors test cluster granularity. With $K = 300$, increasing the number of cluster centres improves performance and saturates around 32. The interpretation is straightforward: a richer activation bank preserves more task information, but only up to a point. After that, extra granularity offers diminishing returns.
The paper attributes +14.0 percentage points to location selection and +9.1 percentage points to task-vector construction in this ablation setting. That supports the core mechanism-first reading: STV wins because it combines sensitive placement with better per-location vector choice. Either half alone is weaker.
Third, the paper studies scaling over the number of iterations and shots per iteration. More examples help at first, but excessive iterations or shot counts eventually reduce accuracy. Redundant information becomes noise. This is a useful corrective to the naïve “more examples must be better” instinct.
The result is not that many-shot learning is bad. It is that many-shot information must be filtered. STV is partly a filtering method.
STV is not fine-tuning, but it is also not magic zero-data adaptation
The comparison with parameter tuning is one of the more practically relevant parts of the paper.
On Qwen-VL-7B, the base model scores 35.2 on VizWiz and 58.6 on OK-VQA. Supervised fine-tuning improves VizWiz to 62.0 but collapses OK-VQA to 25.1. LoRA improves VizWiz to 44.3 but slightly reduces OK-VQA to 57.7. STV reaches 58.3 on VizWiz and 61.9 on OK-VQA.
| Adaptation method | VizWiz | OK-VQA | Practical reading |
|---|---|---|---|
| Base Qwen-VL-7B | 35.2 | 58.6 | No adaptation. |
| SFT | 62.0 | 25.1 | Strong task gain, poor transfer in this comparison. |
| LoRA | 44.3 | 57.7 | Modest task gain, little cross-task upside here. |
| STV | 58.3 | 61.9 | Strong VizWiz gain without weight updates; OK-VQA also improves. |
This is where STV becomes strategically interesting. It does not require updating model weights, yet it can outperform LoRA in the reported setting and avoid the cross-task failure shown by SFT. That does not make fine-tuning obsolete. It makes fine-tuning less obviously the default answer.
Still, STV is not a zero-data miracle. It uses in-context examples, computes task losses during selection, and depends on labelled or otherwise evaluable task data. The method avoids parameter updates; it does not avoid task setup.
That distinction matters operationally. A team considering STV would still need curated examples, access to activations, a validation signal, and an implementation pipeline for intervention. The selling point is not “no work.” It is “no model-weight update and no long prompt at inference.” Different promise. Better promise.
Against ordinary ICL, STV’s advantage is cost shape
The paper also compares STV against increasingly large few-shot ICL on VizWiz with Qwen-VL-7B.
Standard ICL improves as shots rise: 0-shot scores 35.2, 4-shot scores 42.0, 8-shot scores 44.3, 16-shot scores 46.9, and 32-shot scores 49.8. At 64 shots, the run hits out-of-memory. STV, using 400 examples compressed into task vectors, reaches 58.3 without the same token overhead.
The runtime and FLOP comparison reinforces the same point. Relative to STV, 32-shot ICL is reported as 25.56× FLOPs and 8.55× runtime. STV remains normalised at 1.00× for both, essentially matching zero-shot cost in the reported setup.
This is the key business shape:
| Approach | Where the cost appears | When it makes sense |
|---|---|---|
| Few-shot / many-shot prompting | Every inference call pays for examples in the prompt. | Low setup, low volume, black-box API use. |
| Fine-tuning / LoRA | Upfront training cost and model-management cost. | Stable tasks, strong data, acceptable model update process. |
| STV-style activation intervention | Upfront vector construction and selection; low per-query prompt overhead. | Repeated task adaptation on accessible open-source models under context/GPU constraints. |
STV does not beat prompting because prompting is foolish. Prompting is wonderfully convenient. That is why everyone abuses it.
STV beats prompting where convenience becomes a recurring tax.
Exemplar quality still matters, because reality enjoys ruining abstractions
The paper includes two smaller tests that are easy to skip but useful for implementation thinking.
First, selecting high-quality exemplars with a Facility Location method improves STV on VizWiz from 58.3 to 61.9. That suggests activation-space adaptation is still sensitive to the quality of the demonstrations used to build its internal task signal. Garbage in, activation-modulated garbage out. Very modern.
Second, under noisy cross-domain exemplars, STV degrades by 0.7 points while 4-shot ICL degrades by 1.0 point in the reported VizWiz setup. This is a small robustness result, not a grand theorem. Its likely purpose is to show that STV is not unusually fragile to exemplar noise. It does not prove broad production robustness under domain drift, adversarial inputs, or messy enterprise labelling.
These tests are best treated as implementation signals. STV benefits from better exemplar selection, and it may be somewhat more stable than direct ICL under this specific noise condition. That is useful. It is not a licence to ignore data curation.
What the paper directly shows, what Cognaptus infers, and what remains open
The cleanest business interpretation is to separate evidence from inference.
| Layer | Claim | Status |
|---|---|---|
| Direct paper result | STV improves over MTV and other task-vector baselines across five benchmarks and two open-source LMMs. | Directly supported by reported experiments. |
| Direct paper result | STV reduces MTV location-search time from 6,000s to 88s on VizWiz while keeping GPU memory and inference time unchanged in that comparison. | Directly supported by Table 2. |
| Direct paper result | Too many insertion locations or excessive examples can reduce performance. | Directly supported by ablation trends. |
| Cognaptus inference | STV is attractive for repeated multimodal tasks where long prompts are expensive and model activations are accessible. | Reasonable operational inference. |
| Cognaptus inference | STV can serve as a middle layer between prompting and fine-tuning. | Reasonable architectural inference. |
| Still uncertain | Whether STV works as well in live enterprise workflows, safety-critical settings, proprietary long-context models, or highly dynamic domains. | Not established by the paper. |
The method is strongest where three conditions hold.
First, the task recurs often enough to justify upfront vector construction. Second, the organisation can run or modify an open-source multimodal model with access to intermediate activations. Third, the task has enough examples and evaluation signal to guide vector selection.
If any of those conditions fail, STV becomes less compelling. A black-box API user cannot easily use it. A one-time classification task may not justify it. A workflow without labelled examples may need a different adaptation strategy first.
The real lesson is not “task vectors win”; it is “adaptation has geography”
The paper’s title asks where and what matters. That framing is unusually accurate.
The “where” is the geography of the model: some attention heads are more sensitive to contextual task information than others. The “what” is the content of the intervention: averaged or fixed vectors may lose task detail, while clustered activation banks plus RL selection preserve more useful variation.
Together, these ideas point toward a more surgical view of model adaptation. Do not always lengthen the prompt. Do not always fine-tune the weights. Sometimes the better question is: which internal representations already react to the examples, and can we reuse that reaction cheaply?
That is not a universal answer. It is a narrower answer, which is why it is useful.
For enterprises building with open-source multimodal models, STV suggests a practical route for repeated visual QA and classification tasks under context and GPU constraints. The business value is not merely higher benchmark accuracy. It is a different cost profile: more upfront adaptation work, less inference-time prompt baggage, and no parameter update.
The caveat is equally practical. This belongs in the toolbox of teams that can instrument models, manage activation hooks, curate examples, and evaluate task loss. It is not a prompt-engineering trick. It is model adaptation wearing a smaller lab coat.
And that, frankly, is the more interesting direction. The future of applied AI will not only be about asking models longer questions. It will also be about knowing which parts of the model are worth disturbing.
Cognaptus: Automate the Present, Incubate the Future.
-
Ziyu Ma, Chenhui Gou, Yiming Hu, Yong Wang, Xiangxiang Chu, Bohan Zhuang, and Jianfei Cai, “Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning,” arXiv:2511.08246, 2025. https://arxiv.org/abs/2511.08246 ↩︎