Opening — Why this matters now
Multimodal Large Language Models (MLLMs) can reason, explain, and even philosophize about images—until they’re asked to notice something small. A number on a label. A word in a table. The relational context that turns a painted line into a parking space instead of a traffic lane.
The industry’s default fix has been straightforward: crop harder, zoom further, add resolution. Yet performance stubbornly plateaus. This paper makes an uncomfortable but important claim: the problem is not missing pixels. It’s missing structure.
Background — Context loss disguised as progress
Most modern MLLMs compress visual input into a fixed token budget. Fine details suffer, so recent work introduced inference-time cropping strategies: locate the “important” region via attention, crop it tightly, and feed it back alongside the original image.
On paper, this seems sensible. In practice, it produces a failure mode the authors name Contextual Blindness.
Contextual Blindness occurs when:
- The model sees the global image and the high-resolution detail
- Yet fails to integrate them meaningfully
The reason is subtle: intermediate context is gone. The crop isolates detail; the original image is too coarse. Nothing bridges the two.
Analysis — Visual Funnel as structural repair
The proposed method, Visual Funnel, is training-free and deceptively simple. Its core contribution is not better localization, but a better integration strategy.
Step 1: Contextual Anchoring
Instead of immediately asking the model to answer a question, Visual Funnel first asks:
“To answer this question, where in the image should I look?”
This forces the model to externalize its spatial uncertainty. From this single forward pass, cross-attention maps are extracted—no tools, no retraining, no extra supervision.
Step 2: Entropy-Scaled Portfolio Generation
Here’s the real innovation.
Rather than selecting one crop—or many arbitrary ones—the method constructs a hierarchical, multi-scale portfolio:
| Level | Purpose | Scale Logic |
|---|---|---|
| Focal | Precise detail | Attention peak |
| Immediate | Local relations | Entropy-adjusted expansion |
| Broader | Global grounding | Higher-entropy expansion |
Attention entropy determines how much context each crop deserves. Diffuse attention means ambiguity, so the crop expands. Sharp attention stays tight—but never context-free.
Crucially, crop centers are refined hierarchically, correcting for asymmetric attention distributions common in documents and natural scenes.
Findings — Structure beats quantity
Across TextVQA, DocVQA, InfoVQA, and related benchmarks, Visual Funnel consistently outperforms:
- No-crop baselines
- Single-crop attention methods
- Multi-crop unstructured baselines with identical token budgets
The most telling result is the so-called Redundancy Penalty:
Adding more crops without structure can actively hurt performance.
Representative results (DocVQA, Qwen2.5-VL-3B)
| Method | Accuracy |
|---|---|
| No cropping | 51.5 |
| Single crop | 54.2 |
| Top-3 random crops | 55.3 |
| Visual Funnel | 61.1 |
Same tokens. Same attention source. Radically different outcomes.
Implications — What this changes for practitioners
This paper quietly reframes how we should think about multimodal perception:
- Resolution is not reasoning — Bigger crops don’t fix relational ambiguity.
- Attention alone is insufficient — Localization without structured integration still fails.
- Inference-time structure matters — You can meaningfully improve models without touching weights.
For applied teams working on OCR, document intelligence, or visual compliance systems, the takeaway is blunt: stop stacking crops and start designing context.
Limitations — And where this breaks
Visual Funnel assumes:
- A single dominant region of interest
- Reasonably accurate initial attention
Multi-focus reasoning (e.g., cross-document comparisons) remains out of scope. There’s also a ~2× inference latency cost—though notably more efficient than naïve multi-crop baselines.
Conclusion — The funnel beats the magnifying glass
Visual Funnel doesn’t make MLLMs see more. It makes them see properly.
By replacing pixel hoarding with structural hierarchy, it resolves a core failure mode hiding in plain sight. Contextual Blindness was never about vision—it was about organization.
And that’s a lesson worth remembering far beyond multimodal models.
Cognaptus: Automate the Present, Incubate the Future.