A visual AI system can fail in a very boring way: it sounds confident, answers fluently, and quietly forgets to look.

That is more dangerous than a spectacular hallucination. A spectacular hallucination at least waves a red flag. The boring version looks like normal enterprise automation: an insurance claim assessment, a warehouse inspection report, a medical-image triage note, a construction progress summary, a product-quality explanation. The system has an image. It has a question. It produces an answer. Somewhere inside the model, language did most of the work and vision became decorative evidence. Very modern. Very polished. Very capable of being wrong.

The arXiv paper “Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning” by Zhang et al. is useful because it does not treat this as a generic “models need better vision” problem.1 Its argument is sharper: multimodal reasoning often fails because visual evidence is introduced at the wrong time, in the wrong form, or with the wrong control structure. The paper proposes CSMR, a cognitive scheduling framework where a language model acts as a reasoning core, dynamically asks an independent vision-language model for targeted visual evidence, and stops when enough evidence has been gathered.

That may sound like another tool-using agent wrapper. It is not quite that. The interesting part is the diagnosis. The authors are not simply saying, “Ask the image model more questions.” They are saying that two popular ways of doing multimodal reasoning create opposite but related failure modes. One compresses the image too early. The other keeps the image inside the model but lets language dominate the reasoning process. CSMR is a control-loop answer to both.

The real problem is not seeing; it is knowing when to look again

Most readers will approach this topic with a reasonable assumption: if a visual-language model is weak, improve the fusion between image and text. Add better cross-modal attention. Train a stronger VLM. Give the model more visual tokens. Make the architecture more end-to-end. That instinct is understandable. It is also the point this paper complicates.

The authors separate existing multimodal reasoning methods into two broad paradigms.

The first is text-centric pre-reasoning. The system converts the image into text before reasoning begins. Maybe it generates a caption. Maybe it creates question-relevant descriptions. Maybe it decomposes the problem into visual sub-questions and answers them. After that, the reasoning model mostly operates in language space.

The advantage is obvious: large language models are strong reasoners. The cost is also obvious, though often politely ignored: the image is compressed before the model knows which details will matter. A one-shot caption can mention “a truck carrying goods,” but later the reasoning task may require deciding whether the truck is transporting cargo or selling goods. The first description may be technically true and still strategically useless. This is the kind of information loss that survives demos and then bites you in production.

The second paradigm is unified multimodal representation. Image and text tokens are jointly processed inside one model. This avoids irreversible caption compression and, in principle, allows the model to reason directly over visual representations. The problem is that “in principle” does a lot of unpaid labor here. The paper argues that unified VLMs can be structurally biased toward language during answer generation. Visual evidence is present, but language priors can dominate attention and gradients.

The two failure modes are different, but they rhyme.

Paradigm What it tries to fix Failure mode identified by the paper Business symptom
Text-centric pre-reasoning Use strong LLM reasoning after converting image to text Visual details are compressed too early, before the reasoning path reveals what matters The system misses decisive visual evidence because it was never captured in the initial description
Unified multimodal representation Preserve direct image access during reasoning Text tokens and language priors dominate attention, weakening visual grounding The system gives plausible answers that reflect textual expectations more than the actual image
CSMR Separate reasoning control from perception, then query vision on demand Depends on dynamic scheduling quality and the perception module’s reliability The system can inspect targeted visual evidence step by step, but may incur extra latency and orchestration cost

This is why the paper’s framing matters. It moves the discussion from “Do we have visual input?” to “Is visual evidence being acquired at the right moment for the current reasoning state?” Those are not the same question. Many enterprise systems answer the first and never notice the second.

Unified VLMs can have images and still lean on language

The paper’s most useful diagnostic section comes before the method. It analyzes why unified multimodal reasoning may lose visual faithfulness even when the image remains inside the model.

The authors make two claims. First, the standard training objective does not explicitly force the visual encoder to preserve faithful image evidence. A model can reduce loss by using linguistic priors if the question, options, hints, or common patterns already make the answer guessable. Second, during generation, text tokens receive more influence than visual tokens, so the vision encoder may receive misleading updates dominated by language rather than by the image.

The appendix gives a clean supporting test. On a selected ScienceQA subset of 232 image-containing samples from grades 8–12, the authors evaluate Qwen3-VL-8B under four input configurations. With question, options, image, and hint, accuracy is 92.67%. Remove the hint but keep the image, and accuracy falls to 81.90%. Remove the image but keep the hint, and the model still reaches 68.10%. Remove both image and hint, and it still reaches 57.33%.

This is not the main benchmark result. It is a diagnostic test. Its purpose is to show that language-only or mostly-language inference can remain surprisingly strong even in tasks that include visual inputs. That matters because a model trained under ordinary answer-prediction loss does not automatically learn to rely on image evidence whenever humans think the task is visual. The loss function rewards correct answers. It does not send a thank-you note to the pixels.

The attention analysis supports the same diagnosis from another angle. The paper studies attention allocation when generating the first output token. For Qwen3-VL-8B on the ScienceQA subset, text tokens consistently receive higher average pre-softmax attention scores than image tokens across all 35 transformer layers. The authors also check total pre-softmax attention, not only per-token averages, to address the concern that visual tokens may be diluted by token count. Text still receives stronger total attention in their Qwen3-VL analysis.

They then test whether this is merely a Qwen-VL artifact, since Qwen-VL compresses visual tokens. On LLaVA-1.6-7B, which does not use comparable visual token compression, they again find a text-oriented attention bias at the per-token level. Visual tokens dominate total attention only in early layers because there are many of them; deeper in the model, text attention overtakes.

The purpose of these appendix and attention tests is not to prove that all VLMs always ignore images. That would be too broad, and also too theatrical. Their narrower role is to justify the mechanism: multimodal reasoning can become language-dominant even when image tokens are technically available. If that is the pathology, then simply building a tighter unified model may not solve the whole problem.

CSMR turns visual perception into an on-demand service

CSMR’s architecture is simple enough to be useful and annoying enough to be interesting. It separates the system into two modules:

  1. A Cognitive Reasoning Core, or CRC, implemented as an LLM.
  2. A Primary Visual Perception Module, or PVP, implemented as an independent VLM.

The CRC receives the task and maintains a reasoning state. At each step, it decides whether to ask a visual question or produce the final answer. If it asks a visual question, the PVP looks at the original image and returns textualized visual evidence. The CRC integrates that evidence into its state and continues. The loop ends when the CRC decides it has enough evidence or when the token budget is reached.

A minimal version looks like this:

Question + image
      |
      v
CRC: What do I need to know visually?
      |
      |-- visual query --> PVP: inspect image and answer the query
      |<-- visual evidence --|
      |
CRC: update reasoning state
      |
      |-- ask again if evidence is insufficient
      |-- stop if evidence is sufficient
      v
Final answer

The design has three operational consequences.

First, visual evidence is conditional. The system does not commit to all visual questions at the start. It asks after partial reasoning reveals what is missing.

Second, perception and reasoning are decoupled. The perception module is asked to answer targeted visual questions based on the original image. The reasoning model does not have to carry raw visual representations through long language chains.

Third, termination is flexible. The system can stop early when enough evidence has been gathered. That matters because “more reasoning” is not always better. Sometimes more reasoning is just a larger stage for the model to trip over its own elegant nonsense.

The key distinction from DDCoT-style decomposition is timing. DDCoT decomposes into sub-questions, then answers and integrates them. CSMR allows later questions to depend on earlier evidence and intermediate reasoning. That dynamic dependency is the mechanism the paper wants us to notice.

The benchmark gains are modest in some places, large in others, and directionally consistent

The main results compare CSMR against representative baselines on M3CoT, ScienceQA, and LLaVA-Bench In-the-Wild. The paper uses Qwen2-VL-7B-Instruct as the perception backbone and Qwen2-7B-Instruct as the reasoning backbone for CSMR. The evaluations are zero-shot.

The headline is that CSMR performs best across all three reported benchmarks:

Method group Method M3CoT accuracy ScienceQA accuracy LLaVA-W ROUGE-L
Unified VLM-style baselines No-CoT 43.6 56.3 32.7
Unified VLM-style baselines Multimodal CoT 40.1 51.3 30.7
Unified VLM-style baselines CCoT 43.3 56.4 29.4
Unified VLM-style baselines SCAFFOLD 41.7 53.7 31.8
Unified VLM-style baselines ICoT 44.1 56.8 34.2
Text-centric baselines Caption 40.9 67.7 29.1
Text-centric baselines DDCoT 39.0 71.9 26.4
Proposed framework CSMR 45.7 78.2 34.3

The magnitude needs careful reading.

On M3CoT, CSMR reaches 45.7%, ahead of ICoT at 44.1% and No-CoT at 43.6%. This is a real but not enormous improvement. It suggests that cognitive scheduling helps, but it does not transform the benchmark into a solved problem. Nobody should read 45.7% and immediately schedule a procurement celebration.

On ScienceQA, the gain is much larger. CSMR reaches 78.2%, compared with 71.9% for DDCoT and 67.7% for Caption. This is where the paper’s argument has more practical weight: dynamic visual acquisition plus a strong reasoning core can outperform both static captioning and pre-planned decomposition.

On LLaVA-W, CSMR reaches 34.3 ROUGE-L, almost tied with ICoT at 34.2 and above Caption and DDCoT. This benchmark is open-ended and evaluated with a long-form similarity metric, so the result should be interpreted more cautiously than a clean multiple-choice accuracy score. Still, CSMR does not collapse when moved from structured QA to open-ended visual instruction following.

The business reading is not “CSMR dominates everything.” The better reading is: a training-free orchestration layer can produce consistent gains across different multimodal task types, and its strongest advantage appears when reasoning requires targeted visual evidence rather than a single generic image description.

The ablations show that scheduling is the contribution, not decoration

The ablation study is the most important evidence for the paper’s mechanism. The authors compare full CSMR with three variants on M3CoT:

Variant Likely purpose of test M3CoT accuracy Time per sample Interpretation
Full CSMR Main mechanism 45.7 24.34s Dynamic visual querying plus flexible termination works best
Single-Query CRC Ablation: remove iterative querying 40.0 12.35s One visual query is faster but loses much of the benefit
Pre-planned Visual Queries Ablation: remove state-conditioned later querying 40.1 23.36s Asking multiple questions is not enough if all questions are decided before reasoning unfolds
Fixed-Step CRC Ablation: remove flexible termination 42.7 72.48s More steps cost more and can reduce accuracy by adding redundant or distracting information

This table is the paper’s strongest defense against a lazy interpretation. CSMR is not better merely because it asks the vision model more questions. If that were true, pre-planned queries or fixed steps should perform well. They do not.

The single-query variant drops from 45.7% to 40.0%, which supports the need for iterative evidence acquisition. The pre-planned variant lands at 40.1%, almost the same. That is the more interesting result: multiple visual queries without reasoning feedback are still weak. The system needs to let earlier answers reshape later questions.

The fixed-step variant reaches 42.7% but takes 72.48 seconds per sample, roughly three times the time of full CSMR. This is a useful reminder for enterprise teams: forcing a model to “think longer” can be both slower and worse. Reasoning loops need stopping rules. Otherwise, you do not get cognition; you get an expensive monologue.

The hallucination analysis points to semantic drift control

The paper’s hallucination analysis compares CSMR with DDCoT on 200 randomly sampled M3CoT instances. Both methods involve multiple uses of visual perception, so the comparison is fairer than comparing CSMR with a single-pass caption baseline. The authors preserve full multi-round dialogues and use GPT-5 as an automatic evaluator to identify descriptions inconsistent with the image.

The result: CSMR increases the proportion of samples without hallucinations by 9 percentage points compared with DDCoT.

This should be treated as an exploratory extension, not as the final word on hallucination measurement. The sample size is limited, the evaluator is another model, and the paper reports the gain as a percentage-point improvement rather than giving every underlying evaluator detail in the main text. Still, it is aligned with the mechanism.

The case study explains why. DDCoT commits to static, parallel sub-questions before concrete evidence is obtained. In the example discussed by the authors, DDCoT keeps circling around the type of goods carried by a truck. CSMR first asks for necessary visual-semantic information, then shifts toward the discriminative issue: whether the scene indicates selling goods or transporting cargo. That progression matters. It narrows the decision space instead of expanding along a convenient but irrelevant semantic axis.

That is the paper’s practical hallucination story. Hallucination is not only a failure to “see” an object. It can be a failure to keep visual questioning aligned with the actual decision boundary. DDCoT may ask visual questions. CSMR tries to ask the next useful visual question.

The effectiveness-regime test tells us when this architecture is most useful

The paper also tests CSMR under different perception-reasoning backbone configurations on M3CoT. This is not just another leaderboard table. It is a regime test: when does centralized reasoning control help most?

Perception backbone Reasoning backbone Caption DDCoT CSMR CSMR time/sample
Qwen2-VL-7B Qwen3-8B 43.1 43.3 48.1 20.7s
Qwen3-VL-8B Qwen3-8B 45.2 50.3 53.6 23.1s

In the first configuration, the reasoning backbone is substantially stronger than the perception backbone. CSMR beats DDCoT by 4.8 accuracy points. In the second, the perception and reasoning backbones are closer in effective reasoning capability, and the CSMR advantage narrows to 3.3 points.

The authors interpret this as evidence that CSMR is most valuable when the reasoning model is stronger than the perception model. That makes sense. If the VLM is already a strong autonomous reasoner, DDCoT can exploit more of that capability. If the VLM is weaker as a reasoner but still useful as a perceptual module, CSMR lets the stronger LLM take over global control.

This has direct product relevance. In many enterprise systems, the best available language model may be stronger at planning, decomposition, and decision logic than the vision model attached to a workflow. CSMR suggests a practical architecture: do not force the VLM to be the whole brain. Let it be the eyes. Use the LLM to decide when the eyes need to check again.

A small insult to monolithic model maximalism, perhaps. But a productive one.

What Cognaptus infers for enterprise visual AI

The paper directly shows zero-shot benchmark gains from a training-free cognitive scheduling framework, plus supporting ablations that tie the gains to dynamic querying and flexible termination. Cognaptus would translate this into a broader but bounded business principle:

For visual AI workflows, grounding is not only a model property. It is also an orchestration property.

That principle matters in several enterprise settings.

Technical design in CSMR Operational consequence ROI relevance Boundary
Separate reasoning core from perception module Teams can upgrade reasoning and perception components independently Avoids retraining a full multimodal model whenever reasoning requirements change Integration quality and prompt reliability become important engineering risks
Ask visual questions dynamically The system can inspect evidence after it knows what is missing Better fit for claims review, inspection, compliance, QA, and triage tasks where decisive details emerge late More calls can increase latency and cost
Preserve reasoning state Later questions can depend on earlier evidence Reduces semantic drift in multi-step visual decisions Long context may still accumulate irrelevant evidence without good control
Flexible termination The system stops when evidence is sufficient Avoids the “always reason longer” cost trap Stopping criteria are prompt-governed in the current implementation, not guaranteed by formal verification
Textualized visual evidence Intermediate evidence is inspectable by humans and downstream systems Easier audit trails than opaque latent fusion Textual evidence can still be incomplete or wrong if the perception module misreads the image

The clearest business use case is not a generic image chatbot. It is a workflow where the answer depends on targeted visual verification: checking whether damage exists in a specific region, whether a shelf arrangement violates a planogram, whether a construction site matches a milestone, whether a form image contains a missing stamp, or whether a safety image shows a relevant violation rather than a visually similar non-issue.

In those settings, a one-shot caption is too blunt. A fully fused VLM may be hard to audit. A CSMR-style loop gives the system a more operational shape: reason, inspect, update, inspect again, stop.

This is not glamorous. It is closer to how competent human analysts work. They do not stare at an image once and then free-associate. They ask, “What would distinguish option A from option B?” Then they look for that detail. Then they revise. The paper’s contribution is to make that mundane discipline explicit in a model workflow.

The boundaries are practical, not cosmetic

The limitations are not fatal, but they are real enough to affect deployment decisions.

First, CSMR is training-free in the experiments, which is useful for rapid adoption but does not guarantee domain-level reliability. A hospital, insurer, manufacturer, or logistics operator would still need domain adaptation, validation data, and human review for high-stakes decisions. The paper’s impact statement correctly notes that training-free is a current implementation choice, not a structural ceiling. The CRC and PVP could be optimized independently or jointly later.

Second, CSMR introduces inference overhead. The main ablation table reports full CSMR at 24.34 seconds per sample on M3CoT under the reported setup. It is much faster than fixed-step CRC and faster than DDCoT in the later backbone-regime tests, but it is still slower than single-pass captioning. For many enterprise workflows, this is acceptable. For real-time robotics, high-throughput moderation, or low-margin inspection pipelines, it may not be.

Third, the hallucination analysis is promising but not definitive. It uses 200 sampled M3CoT instances and GPT-5 as an automatic evaluator. That is reasonable for exploratory analysis, but production hallucination control requires task-specific labels, evaluator calibration, and failure audits. Otherwise, one model grades another model and everyone pretends the spreadsheet is a courtroom.

Fourth, CSMR depends on the perception module answering targeted questions correctly. Dynamic querying cannot rescue a perception module that consistently misreads small objects, spatial relations, OCR, or domain-specific visual signs. It can ask better questions. It cannot manufacture visual truth.

Finally, the framework shifts some risk from model training to workflow design. Prompt formats, routing parsers, context accumulation, stopping behavior, logging, and evidence presentation become part of the system’s reliability surface. That is not a weakness. It is simply where the engineering work moves.

The paper is really about control, not captions

The temptation is to summarize CSMR as “LLM asks VLM questions.” That is true, but too thin. The more important idea is that visual evidence should be scheduled by the reasoning state.

That statement sounds small until you compare it with the usual alternatives. In one-shot captioning, visual evidence is gathered before reasoning reveals what is needed. In unified VLM reasoning, visual evidence remains inside the model but may be overpowered by language. In static decomposition, sub-questions are generated before evidence can reshape the search path. CSMR changes the order of operations: reason enough to know what to inspect, inspect the image, update the reasoning state, and stop when sufficient.

For enterprise AI, that is a more useful lesson than “bigger multimodal models are coming.” Bigger models are coming anyway. They are always coming. The question is whether the workflow makes them look at the right evidence at the right time.

The paper’s results do not prove that cognitive scheduling is the final architecture for multimodal reasoning. They do show that evidence timing is a real design variable, not a footnote. In visual AI systems, grounding is not achieved merely by attaching an image to a prompt. Grounding is maintained by controlling how visual facts enter the decision process.

The model should not just see. It should know when to look.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun, Yidong Chen, Jiayi Ji, Qian Chen, and Rongrong Ji, “Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning,” arXiv:2605.28160v1, 27 May 2026, https://arxiv.org/abs/2605.28160↩︎