A receipt is not hard to understand because it is philosophical. It is hard because the answer may live in one corner, the label in another, and the meaning in the relationship between them.

That is exactly the kind of thing multimodal large language models are supposed to be getting better at. Give the model an image. Ask a question. Let the model inspect the pixels and reason over the scene. The product demo looks magical until the model reads the wrong number, misses the column header, confuses the parking space for a lane, or confidently answers a chart question from the wrong local patch. Then the magic becomes a support ticket.

The obvious engineering reaction is simple: crop the relevant region. If the model cannot read the tiny detail, zoom in. If one crop is not enough, give it more crops. More pixels, more tokens, more chances. Nice and mechanical. Also, according to the paper Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models, often not enough.1

The paper’s central argument is sharper than “models need higher resolution.” It says that cropping can create a new failure mode: Contextual Blindness. The model may have the full image and the tight crop, yet still fail because the intermediate context—the visual bridge between detail and scene—has been stripped away. The model is not missing pixels. It is missing structure. Very modern problem: plenty of information, badly arranged.

The proposed fix, Visual Funnel, is a training-free inference-time method. It first asks the model where to look, extracts an attention-based anchor, and then builds a structured portfolio of crops: focal detail, immediate context, and broader context. The important word is not “multi-crop.” It is structured. The crops are not three random magnifying glasses waved at the image. They are arranged as a hierarchy.

That distinction is where the paper becomes useful for business readers. For document AI, infographic QA, UI agents, inspection workflows, and other precision-heavy multimodal systems, the lesson is not simply “buy a bigger model” or “increase resolution everywhere.” The operational lesson is: when the answer depends on small visual details, the system needs a context recovery layer.

The failure is not bad eyesight; it is broken visual bookkeeping

Most discussions of vision-language model failure start with perception. The model cannot see small text. The object is too small. The chart label is too compressed. The document layout is too dense. That is true, but incomplete.

The paper separates the problem into two steps:

Step What it means Why it matters
Localization Find the region likely to contain the answer The model must know where to zoom
Integration Present the localized detail in a form the model can interpret The model must connect detail to surrounding context

Earlier crop-based methods have become reasonably good at localization. They can use internal attention signals or iterative search to identify likely regions of interest. The weak point is integration. After the target is found, many methods feed back a tight crop, sometimes alongside the original image, and assume the model will connect the two.

That assumption sounds reasonable until the answer is relational.

A number in a table is not meaningful without its row and column headers. A person’s height may be “short” only relative to another person in the frame. A chart value may require reading a bar label, a category, and a negation in the question. A parking marking may be misread if the crop removes the surrounding road layout. In these cases, a tight crop improves local visibility while damaging interpretability.

This is the core mechanism behind Contextual Blindness:

Global image:     too broad to read the small detail clearly
Tight crop:       clear detail, but missing the surrounding relation
Missing layer:    intermediate context that links detail to meaning
Result:           the model has the pixels but not the usable structure

This is why the paper’s framing is more interesting than another “high-resolution helps” claim. The failure is not simply resolution scarcity. It is a mismatch between the visual units supplied to the model and the reasoning structure required by the question.

In business terms, this distinction matters because the fix is different. If the problem is only resolution, the answer is more compute. If the problem is structure, the answer is smarter inference design.

Why a tight crop can make the model worse

The intuitive story says a crop is a zoom lens. The paper’s less comfortable story says a crop is also a knife.

A crop improves detail by cutting away pixels. Sometimes those pixels are irrelevant. Often they are the context that tells the model what the detail means. This trade-off is especially painful in document and infographic tasks, where information is distributed across layout.

Consider a document question asking for a value in a table. The relevant number may be visually tiny. A tight crop around the number helps the model read it. But if the crop removes the column title, the row label, or neighboring entries, the number becomes an orphan. The model can see “45%” but may not know whether it refers to confidence, lack of confidence, experts, beginners, or some other category. Now the model is no longer blind in the ordinary sense. It is contextually blind.

The paper’s Figure 2 examples illustrate this pattern across natural scenes and structured visuals. The model may crop seated diners and miss the standing person needed for comparison. It may crop a value but exclude the header required to identify the metric. It may crop a chart component but lose the question-relevant negation. These examples are not just decorative failure cases. They define the mechanism the experiments later test.

This also explains why adding the original full image beside a crop does not automatically solve the problem. The global image and the crop operate at very different scales. The full image may contain the context, but in compressed form. The crop may contain the detail, but in isolation. The model has to bridge the two representations. The paper argues that MLLMs often do not bridge them reliably without intermediate visual scales.

So the naive pipeline becomes brittle:

  1. Find region.
  2. Crop tightly.
  3. Feed full image plus crop.
  4. Hope the model performs cross-scale integration.
  5. Pretend step 4 was engineering.

The paper is basically saying: step 4 is not engineering. It is wishful thinking with a GPU budget.

Visual Funnel adds the missing middle layer

Visual Funnel keeps the broad crop-based idea but changes the integration strategy. Instead of giving the model only the original image and a single tight crop, it constructs a multi-scale portfolio.

The method has two main steps.

First, Contextual Anchoring. The model is prompted with a localization-focused query: in effect, “To answer this question, where in the image should I look?” The method then extracts an attention map from the model’s internal representations. The purpose is not to answer yet. The purpose is to produce a better spatial anchor.

Second, Entropy-Scaled Portfolio Generation. Using that attention map, Visual Funnel creates three crops:

Crop level Role Operational meaning
Focal crop Shows the fine detail Read the small object, text, value, or local state
Immediate-context crop Preserves local surroundings Connect the detail to nearby labels, objects, or layout
Broader-context crop Restores wider structure Link the local region to the scene, table, chart, or document

The crop sizes are not fixed blindly. The method uses normalized attention entropy as a signal of uncertainty. Low entropy means attention is concentrated, so less expansion may be enough. High entropy means attention is diffuse, suggesting the model may need broader context. The paper implements this with entropy-guided expansion factors, while always preserving some minimum context expansion so the focal detail is not isolated again.

There is another practical detail: crop centers are refined hierarchically. A standard crop may assume the important content sits near the center. Real images and documents are less polite. A target cell may sit near the edge of a table. A small sign may be off-center. Visual Funnel recalculates centers inside parent crops so that each scale can shift toward where the attention mass actually lies.

The method therefore does two things at once:

Attention map -> where to look
Attention entropy -> how much surrounding context to keep
Hierarchical refinement -> how to place each crop without assuming symmetry

This is why “three crops” is not the main contribution. Three unstructured crops are just more visual material. Visual Funnel’s claim is that the three crops form a usable context ladder.

The main evidence: grounded visual QA is where the mechanism shows up

The paper evaluates Visual Funnel on three representative MLLMs: LLaVA-1.5-7B, InstructBLIP-7B, and Qwen2.5-VL-3B-Instruct. It tests seven benchmarks divided into two groups.

The first group, Grounded Visual QA, includes TextVQA, GQA, DocVQA, and InfoVQA. These tasks are more sensitive to small details, text, document layout, charts, or compositional visual grounding. This is the main evidence category because it directly tests the claimed failure mode.

The second group, Recognition Visual QA, includes POPE, A-OKVQA, and VQAv2. These tasks are useful as a boundary check. If Visual Funnel is solving Contextual Blindness rather than simply acting as a generic performance booster, the gains should be larger on grounded detail-heavy tasks and smaller on general recognition tasks.

That is broadly what the results show.

Model Dataset Base ViCrop ViCrop Top-3 Visual Funnel Interpretation
LLaVA-1.5-7B TextVQA 47.9 54.1 53.5 59.1 Structured context beats both single crop and unstructured multi-crop
LLaVA-1.5-7B DocVQA 15.9 19.4 19.2 22.8 Document layout benefits from intermediate context
InstructBLIP-7B InfoVQA 12.8 15.8 16.0 25.1 Infographics show a large gain from structured context
Qwen2.5-VL-3B DocVQA 51.5 54.2 55.3 61.1 Stronger base model still benefits from the same mechanism
Qwen2.5-VL-3B InfoVQA 34.2 39.4 39.9 49.6 The biggest practical signal: chart/layout reasoning improves sharply

The pattern matters more than any single number. Visual Funnel improves most on tasks where small details must be interpreted through surrounding layout. On Qwen2.5-VL-3B, DocVQA rises from 51.5 to 61.1, and InfoVQA rises from 34.2 to 49.6. On InstructBLIP, InfoVQA rises from 12.8 to 25.1. These are not tiny leaderboard confetti gains. They are the kind of gaps that change whether a document AI system feels usable or suspicious.

By contrast, GQA gains are modest: around +1.0 to +1.2 over the base models in the paper’s main table. The paper interprets this as consistent with the mechanism: GQA often involves larger visual concepts in natural scenes, where the exact missing intermediate context is less central.

Recognition tasks show smaller gains too. Visual Funnel improves POPE, A-OKVQA, and VQAv2 only modestly. That is not a weakness of the paper. It is a useful boundary. The method is not claiming to be a universal charm spell sprinkled over every visual benchmark. It is targeted at cases where the model must connect a small visual detail to its surrounding structure.

Good. We have enough universal charm spells in AI already.

The Top-3 baseline is the most important comparison

The most revealing baseline is not the base model. It is not even the single-crop ViCrop baseline. It is ViCrop Top-3.

Why? Because ViCrop Top-3 gives the model multiple crops, using the same attention map, but without Visual Funnel’s hierarchical structure. This baseline asks the obvious skeptical question: maybe Visual Funnel wins just because it gives the model more visual tokens.

The answer appears to be no.

For Qwen2.5-VL-3B, Top-3 improves only slightly over ViCrop on TextVQA, moving from 76.0 to 76.7. On DocVQA, it moves from 54.2 to 55.3. Visual Funnel, using a structured three-crop portfolio, reaches 79.8 on TextVQA and 61.1 on DocVQA.

For LLaVA-1.5-7B, Top-3 can even underperform the single-crop version: TextVQA drops from 54.1 to 53.5, and DocVQA from 19.4 to 19.2. The paper calls this a “Redundancy Penalty.” The phrase is useful if treated carefully. It does not prove that more visual tokens are generally bad. It shows that unstructured overlapping visual information can fail to help, and may interfere, under this setup.

That is the operational lesson. The crop budget is not a shopping list. It is an information architecture problem.

Reader belief Paper’s correction Practical replacement
“If the model misses detail, crop tighter.” Tight crops can remove the relational context needed for interpretation. Preserve focal detail plus intermediate context.
“If one crop helps, three crops should help more.” Unstructured Top-3 crops often add little and sometimes hurt. Use hierarchical crops with distinct roles.
“The full image plus crop should be enough.” The model may fail to bridge global and focal scales. Add intermediate visual scales explicitly.
“This is just a localization problem.” Better anchoring alone gives only marginal gains in the ablation. Treat integration as the main design problem.

That last row is especially important.

The ablation says integration is doing most of the work

The paper’s ablation study uses Qwen2.5-VL-3B-Instruct on DocVQA and InfoVQA. Its purpose is to separate the two components: Step 1, Contextual Anchoring, and Step 2, Entropy-Scaled Portfolio Generation.

Configuration DocVQA InfoVQA Likely purpose of the test What it supports
ViCrop baseline 54.2 39.4 Baseline comparison Single-crop attention-guided zoom helps but remains limited
Visual Funnel without Step 2 55.1 40.3 Ablation of portfolio construction Better localization alone is only a small improvement
Visual Funnel without Step 1 59.8 47.9 Ablation of specialized anchoring Structured portfolio is the main driver
Full Visual Funnel 61.1 49.6 Main method Anchoring plus structured portfolio works best together

This is one of the cleanest parts of the paper. If the method’s value came mostly from finding a better crop, removing the portfolio step should still produce a strong gain. It does not. DocVQA moves from 54.2 to 55.1. InfoVQA moves from 39.4 to 40.3.

When the portfolio remains but the specialized anchoring step is removed, the gains are much larger: DocVQA reaches 59.8 and InfoVQA reaches 47.9. Full Visual Funnel improves further to 61.1 and 49.6.

The interpretation is straightforward: localization helps, but structured integration is doing the heavy lifting. This supports the article’s mechanism-first framing. The story is not “the model learned where to look.” The story is “the system learned how to package what the model sees.”

For business AI systems, that difference is not academic. Many teams obsess over retrieval or localization—find the right page, find the right crop, find the right bounding box. Then they pass the result to a model and hope. This paper is a reminder that retrieval is not the same as interpretation. The handoff format matters.

The appendix tests robustness, not a second thesis

The appendix is useful because it answers two practical questions.

First: are the entropy-scaling parameters fragile? The paper varies sensitivity coefficients on DocVQA using Qwen2.5-VL-3B. A static fixed-size configuration gets 59.5. Weak adaptation gets 60.4. The default gets 61.1. Strong adaptation gets 60.8. This suggests the adaptive entropy idea helps, but the method does not collapse if the exact coefficients shift within a reasonable range.

Second: are the base expansion factors over-tuned? Tighter crops score 60.5, default scores 61.1, and wider crops score 60.9. Again, not a wild swing. The result supports stability rather than a magic-number story.

The portfolio-size ablation is more interesting for deployment. On DocVQA:

Portfolio size Configuration Accuracy
0 Original image only 51.5
1 Focal crop only 55.1
2 Focal + immediate context 58.0
3 Focal + immediate + broader context 61.1
4 Adds an even wider context crop 60.7

The gain from one crop to three crops is substantial. The fourth crop does not help. This reinforces the same pattern: the value comes from building a useful hierarchy, not from endlessly appending visual material.

The efficiency analysis should be read with more care because the HTML rendering does not expose every latency number cleanly. The paper reports average token counts and relative timing on DocVQA. The base model uses 450 tokens. ViCrop uses 780. ViCrop Top-3 uses 920. Visual Funnel uses 890 and reports a relative time of 1.98, while achieving 61.1 DocVQA accuracy. The paper argues that Visual Funnel is more efficient than naive multi-crop in accuracy per computational unit.

The cautious interpretation: Visual Funnel is not free. It is training-free, not latency-free. It adds inference overhead through an extra anchoring step and additional crop encoding. But for workflows where a wrong answer is expensive—document review, insurance forms, compliance tables, industrial inspection—the accuracy gain may be worth applying selectively.

What this means for business systems: add a context recovery layer

The direct paper result is about benchmark performance. The business inference is broader but bounded: many visual AI workflows should separate “where to look” from “how to present what was found.”

A practical architecture might look like this:

User question + image/document
Base multimodal model attempts answer or localization
Trigger condition: small text, low confidence, dense layout, table/chart, visual ambiguity
Context recovery layer:
  - attention or detector-based anchor
  - focal crop
  - immediate-context crop
  - broader-context crop
Final multimodal answer with structured visual portfolio
Optional verification: cite region, show crop evidence, ask for human review if unresolved

This is not a call to run Visual Funnel on every image. That would be lazy architecture in a nicer coat. The paper’s own evidence suggests the method is most valuable where Contextual Blindness is plausible.

Use case Why Visual Funnel-style design may help Boundary
Document AI Values depend on headers, rows, labels, and nearby text Multi-region questions may still be difficult
Infographic QA Meaning is distributed across chart labels, legends, and values Complex chart reasoning may require explicit symbolic extraction
UI understanding Buttons and fields need local text plus screen layout Dynamic UI states need interaction history, not just screenshots
Industrial inspection Defects may be small but interpreted relative to surrounding structure If the target is not localized, crop portfolios inherit the error
Retail shelf or logistics images Small labels or objects need surrounding product/category context Cluttered multi-object queries may exceed single-anchor design

The likely return on investment is not “cheaper training.” Visual Funnel is training-free, which is nice, but the business value is more specific: it offers a way to improve high-precision visual reasoning without fine-tuning the base model. That makes it attractive as an inference wrapper around existing multimodal systems.

This matters for companies that cannot retrain large models every time a document format changes. A context recovery layer can be adjusted at inference time, tested on domain-specific validation sets, and selectively triggered. The cost is added latency and engineering complexity. The benefit is fewer failures where the model saw the right patch and still answered the wrong question. Those are the failures that make users lose trust fastest, because they are hard to explain politely.

What the paper does not prove

The paper is strongest on fine-grained visual QA tasks where one main region of interest can be localized and interpreted with nearby context. It is less conclusive outside that zone.

First, Visual Funnel depends on a reasonably useful initial attention map. If the model’s attention points to the wrong area, the portfolio will be beautifully structured around the wrong evidence. Elegant error, still error.

Second, the current formulation is centered on questions with a single focal region. Some real business tasks require synthesizing multiple spatially separate regions: compare three invoices, reconcile two distant table sections, inspect multiple defects, or match a legend with several chart components. Visual Funnel may help parts of those tasks, but it is not a full multi-region reasoning system by itself.

Third, the method adds inference overhead. The paper argues the trade-off is favorable for detail-heavy tasks, and the appendix supports that view on DocVQA. Still, latency-sensitive applications need selective triggering. A customer-facing mobile UI agent cannot afford to run every screenshot through heavy multi-crop processing just to identify a button that was already obvious.

Fourth, benchmark gains do not automatically translate into production reliability. Production documents are messy. They include scans, compression artifacts, rotated pages, inconsistent fonts, overlapping stamps, and human creativity, which remains the strongest adversarial attack against enterprise software. A Visual Funnel-style layer should therefore be evaluated on the actual workflow distribution, not only on public VQA benchmarks.

These limitations do not weaken the core point. They define the deployment envelope.

The bigger lesson: visual input is a product interface

The most useful insight from this paper is not that Visual Funnel exists. It is that visual input design is becoming an interface problem.

In text-based RAG systems, teams eventually learned that retrieval quality is not enough. Chunk size, context ordering, metadata, citations, and prompt layout all shape the final answer. Vision-language systems are now running into the same lesson, but with pixels. A crop is a visual chunk. A crop portfolio is a visual context window. Bad visual chunking can break reasoning even when the right evidence is technically present.

That analogy should feel familiar. In enterprise AI, “the information was included somewhere” has never been a sufficient defense. The model needs information in a structure it can use.

Visual Funnel’s contribution is to make that structure visible. It names the failure mode, proposes a mechanism, and tests the difference between raw visual quantity and hierarchical visual context. The evidence is especially persuasive because the Top-3 baseline does not rescue the naive “more crops” story. More visual material is not the same as better visual organization. Apparently, even machines dislike messy slide decks.

For Cognaptus readers, the practical takeaway is simple: when building multimodal workflows, do not treat cropping as a zoom button. Treat it as context engineering.

If the question depends on a small detail, ask what surrounding visual evidence gives that detail meaning. If the model needs to read a value, preserve the label. If it needs to classify an object state, preserve the nearby reference objects. If it needs to interpret a chart, preserve the axis, legend, and category structure. The model may not need more pixels everywhere. It may need the right pixels arranged at the right scales.

Contextual Blindness is a useful phrase because it captures a common production failure: the model sees, but it does not see through the structure. Visual Funnel is one proposed remedy. The broader lesson will likely outlive the method.

Cognaptus: Automate the Present, Incubate the Future.


  1. Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, and Junyeong Kim, “Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models,” arXiv:2512.10362v2, 27 Apr. 2026, https://arxiv.org/abs/2512.10362↩︎