Opening — Why this matters now

Multimodal Large Language Models (MLLMs) can reason, explain, and even philosophize about images—until they’re asked to notice something small. A number on a label. A word in a table. The relational context that turns a painted line into a parking space instead of a traffic lane.

The industry’s default fix has been straightforward: crop harder, zoom further, add resolution. Yet performance stubbornly plateaus. This paper makes an uncomfortable but important claim: the problem is not missing pixels. It’s missing structure.

Background — Context loss disguised as progress

Most modern MLLMs compress visual input into a fixed token budget. Fine details suffer, so recent work introduced inference-time cropping strategies: locate the “important” region via attention, crop it tightly, and feed it back alongside the original image.

On paper, this seems sensible. In practice, it produces a failure mode the authors name Contextual Blindness.

Contextual Blindness occurs when:

  • The model sees the global image and the high-resolution detail
  • Yet fails to integrate them meaningfully

The reason is subtle: intermediate context is gone. The crop isolates detail; the original image is too coarse. Nothing bridges the two.

Analysis — Visual Funnel as structural repair

The proposed method, Visual Funnel, is training-free and deceptively simple. Its core contribution is not better localization, but a better integration strategy.

Step 1: Contextual Anchoring

Instead of immediately asking the model to answer a question, Visual Funnel first asks:

“To answer this question, where in the image should I look?”

This forces the model to externalize its spatial uncertainty. From this single forward pass, cross-attention maps are extracted—no tools, no retraining, no extra supervision.

Step 2: Entropy-Scaled Portfolio Generation

Here’s the real innovation.

Rather than selecting one crop—or many arbitrary ones—the method constructs a hierarchical, multi-scale portfolio:

Level Purpose Scale Logic
Focal Precise detail Attention peak
Immediate Local relations Entropy-adjusted expansion
Broader Global grounding Higher-entropy expansion

Attention entropy determines how much context each crop deserves. Diffuse attention means ambiguity, so the crop expands. Sharp attention stays tight—but never context-free.

Crucially, crop centers are refined hierarchically, correcting for asymmetric attention distributions common in documents and natural scenes.

Findings — Structure beats quantity

Across TextVQA, DocVQA, InfoVQA, and related benchmarks, Visual Funnel consistently outperforms:

  • No-crop baselines
  • Single-crop attention methods
  • Multi-crop unstructured baselines with identical token budgets

The most telling result is the so-called Redundancy Penalty:

Adding more crops without structure can actively hurt performance.

Representative results (DocVQA, Qwen2.5-VL-3B)

Method Accuracy
No cropping 51.5
Single crop 54.2
Top-3 random crops 55.3
Visual Funnel 61.1

Same tokens. Same attention source. Radically different outcomes.

Implications — What this changes for practitioners

This paper quietly reframes how we should think about multimodal perception:

  1. Resolution is not reasoning — Bigger crops don’t fix relational ambiguity.
  2. Attention alone is insufficient — Localization without structured integration still fails.
  3. Inference-time structure matters — You can meaningfully improve models without touching weights.

For applied teams working on OCR, document intelligence, or visual compliance systems, the takeaway is blunt: stop stacking crops and start designing context.

Limitations — And where this breaks

Visual Funnel assumes:

  • A single dominant region of interest
  • Reasonably accurate initial attention

Multi-focus reasoning (e.g., cross-document comparisons) remains out of scope. There’s also a ~2× inference latency cost—though notably more efficient than naïve multi-crop baselines.

Conclusion — The funnel beats the magnifying glass

Visual Funnel doesn’t make MLLMs see more. It makes them see properly.

By replacing pixel hoarding with structural hierarchy, it resolves a core failure mode hiding in plain sight. Contextual Blindness was never about vision—it was about organization.

And that’s a lesson worth remembering far beyond multimodal models.

Cognaptus: Automate the Present, Incubate the Future.