A camera sees the scene. The model reads the question. The options look reasonable. One of them must be right.

That last sentence is the problem.

Many enterprise video-AI workflows are built around this quiet assumption. A model reviews a warehouse clip and chooses the most likely safety violation. It watches a customer interaction and classifies the complaint. It checks a manufacturing video and identifies the defect category. The system may be wrong, of course, but the menu is treated as complete. The correct answer is assumed to be hiding somewhere among the choices, waiting for the model to point at it with sufficient confidence.

The paper When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding attacks exactly that assumption.1 Its core move is simple and rather rude, which is often how good diagnostics begin: take a standard video multiple-choice question, remove the correct answer, and see whether the model notices.

Most models do not.

They choose plausible distractors. They often do so with high confidence. Temporal questions make the problem worse. Adding more video frames can improve ordinary accuracy while making absent-answer detection worse. Chain-of-thought prompting helps, but not enough to make the failure disappear. So the business lesson is not “video models are bad.” That would be too easy, and also false. The sharper lesson is this: a model can become better at selecting from a candidate set while remaining poor at judging whether the candidate set deserves trust.

That distinction matters anywhere video AI is expected to support operational decisions rather than win a benchmark beauty contest.

The failure mechanism is not wrong vision; it is forced-choice behavior

The paper defines a diagnostic task called absent answer detection. Start with a video $V$, a question $q$, and a candidate set $C$ that originally contains the ground-truth answer $c^*$.

Then intervene:

$$ \tilde{C} = C \setminus {c^*} $$

The video and question remain unchanged. Only the candidate set is changed. A reliable model should now reject the remaining options because none of them is correct.

The authors test this under three conditions:

Setting What changes Metric Likely purpose
Baseline Original answer set includes the correct answer ACC Main reference point for ordinary video QA ability
Multi-choice detection Correct answer removed; “None of the above” added MCDR Main evidence for whether an explicit rejection option is used
Open-ended detection Correct answer removed; model is told it may answer none OEDR Test of active rejection without merely choosing a listed NOTA option
Unprompted detection Correct answer removed; no cue is given UDR Stress test for spontaneous doubt under the standard protocol

This design is useful because it separates two abilities that ordinary benchmarks tend to merge:

  1. Can the model pick the right answer when the right answer is present?
  2. Can the model recognize that the right answer is absent?

The paper shows that these are not the same skill. This is the first mechanism to keep in mind. Standard multiple-choice evaluation rewards selection. Absent-answer detection tests whether selection is conditional on evidence. Those are different behavioral regimes.

In business language: there is a difference between an AI system that can classify an incident and an AI system that knows when the incident does not fit the classification menu. The second capability is less glamorous. Naturally, it is also where production systems quietly break.

The main tables show a gap between accuracy and abstention

On VideoMME, models can show respectable baseline accuracy while detecting absent answers poorly. Gemini-2.5-Flash reaches 68.9% baseline accuracy, but only 33.9% MCDR, 43.6% OEDR, and 2.4% UDR. Qwen3-VL reaches 67.0% baseline accuracy, but only 17.4% MCDR, 16.2% OEDR, and 0.7% UDR. InternVL3.5 reaches 65.3% baseline accuracy, but 13.6% MCDR, 6.5% OEDR, and 0.0% UDR.

The EgoSchema results tell the same unpleasant story. InternVL3 gets 79.2% baseline accuracy but only 9.6% MCDR, 2.2% OEDR, and 0.0% UDR. Mimo-VL is a partial exception on multi-choice detection, with 40.6% MCDR on EgoSchema, but its open-ended detection is still only 17.4%, and unprompted detection is 0.0%.

The exact model ranking is not the main point. Model rankings are entertaining, but so are horoscopes if one is tired enough. The more useful pattern is structural:

Observation Interpretation Business meaning
Baseline ACC can be much higher than MCDR or OEDR Knowing how to answer does not imply knowing when no answer is valid Accuracy dashboards can overstate decision reliability
UDR is near zero across almost all models Models rarely question the completeness of the option set without an explicit cue Abstention will not appear magically in production unless designed
OEDR is often lower than MCDR Generating a rejection is harder than selecting a visible rejection option “The model may refuse” is weaker than an engineered refusal path
Some high OEDR results require inspection A number can be inflated by unstable generation behavior Evaluation must include qualitative failure audits, not only aggregate metrics

The most severe result is the unprompted setting. When the correct answer is removed and the model is simply asked to choose from the remaining options, detection is almost nonexistent. That means the default behavior is not “check whether the premise is valid.” The default behavior is “choose the least wrong option.”

For enterprise use, this is the difference between a system that says “this looks like Class B” and one that says “none of our known classes explain this clip.” The first response is convenient. The second response is often what the organization actually needed.

“None of the above” is treated like a decorative option

A tempting fix is to add “None of the above.” The paper shows that this helps, but not enough.

The multi-choice detection setting gives the model a visible NOTA option. Detection improves compared with the unprompted setting, but remains weak. Many models still choose a distractor. This suggests that NOTA is not treated as a serious semantic candidate. It is present, but not operationally alive.

The appendix confidence analysis explains why. The authors compare confidence distributions under the normal baseline and the multi-choice detection setting. The distributions substantially overlap. Removing the correct answer does not meaningfully reduce the model’s confidence in its selected option. In a second confidence analysis, they compare the probability assigned to NOTA with the probability assigned to the chosen distractor. NOTA receives very low probability mass, while the chosen distractor receives high confidence.

This appendix evidence is not a second thesis. Its purpose is diagnostic: it explains the behavioral mechanism behind the main tables. The failure is not merely that models sometimes miss NOTA. The failure is that models often continue to act as though the best distractor is a legitimate answer.

That is why a UI-level “allow refusal” switch is not enough. If the model’s internal preference is still to satisfy the answer-selection pattern, NOTA becomes a polite suggestion taped to the emergency exit. Nice font. Poor evacuation performance.

Temporal questions make distractors more dangerous

The paper then moves from general video QA to temporal subsets of VideoMME. This is not just a category breakdown. It tests a plausible mechanism: in video, wrong answers can be close to the truth in time.

A temporal distractor may describe something that happened just before the relevant moment, just after it, or in a causally related part of the scene. These options are not random nonsense. They are attractive wrong answers.

The results support the concern. On VideoMME temporal perception and temporal reasoning subsets, many open-source models show lower absent-answer detection than their overall results. For example, Qwen3-VL has 80.0% / 55.4% ACC on temporal perception / temporal reasoning, but only 9.1% / 2.3% MCDR and 7.3% / 4.0% OEDR. InternVL3.5 has 76.4% / 48.0% ACC, but only 3.6% / 3.4% MCDR and 1.8% / 0.6% OEDR. Even where accuracy remains decent, rejection remains fragile.

The mechanism is straightforward. Video adds temporal structure. Temporal structure creates near-miss answers. Near-miss answers give a forced-choice model something plausible to grab.

For business workflows, that matters more than it first appears. Many high-value video tasks are temporal by nature:

Workflow Temporal ambiguity Absent-answer risk
Workplace safety review A near-accident may resemble a violation but not meet the definition Model assigns the wrong violation category
Manufacturing inspection A transient artifact may look like a defect event Model forces a defect label when the issue is outside the taxonomy
Training and compliance review A later corrective action may be confused with the original behavior Model summarizes the wrong procedural failure
Retail or security monitoring Similar events occur in sequence Model selects the closest scripted scenario instead of flagging uncertainty
Insurance or claims review Relevant evidence may occur outside the sampled segment Model chooses the most plausible claim type anyway

The paper directly tests benchmark video QA, not these enterprise workflows. The Cognaptus inference is that any production video system with a closed label menu inherits a similar risk pattern when temporal near-misses are common. The boundary matters: this paper does not prove failure rates in factories, hospitals, warehouses, or insurers. It gives a diagnostic mechanism those deployments should test before pretending their label menu is complete.

More frames can improve answer selection while worsening abstention

The frame-sampling result is the paper’s most interesting trap for practical readers.

A common intuition is that more visual evidence should produce more faithful understanding. In ordinary accuracy terms, the paper observes that increasing sampled frames improves baseline performance across the tested models shown in Figure 2. So far, so comforting.

Then comes the twist: MCDR and OEDR decrease with denser frame sampling.

This is a sensitivity test on input density, and its purpose is to separate candidate matching from candidate-set verification. More frames help the model match options to evidence. But in the absent-answer setting, stronger matching can make the best distractor more seductive. The model sees more material, finds more partial alignment, and becomes less willing to reject the menu.

That sounds paradoxical only if we assume that evidence automatically produces skepticism. It does not. Evidence can also produce better rationalization.

For enterprise systems, this result is important because “more context” is one of the most common procurement promises in AI. More frames, longer clips, richer metadata, longer context windows, more sensor streams. All useful, sometimes necessary. But the paper suggests that richer input can improve classification confidence without improving abstention.

So the operational question should not be:

Did adding more frames improve accuracy?

It should be:

Did adding more frames improve the system’s ability to reject all wrong options?

Those are different metrics. A model can pass the first and fail the second. That is how a better video model becomes a more confident operational nuisance. Progress, apparently.

Chain-of-thought helps because it changes the decision procedure

The authors test chain-of-thought prompting as a mitigation strategy. The prompt asks the model to analyze the video, evaluate each candidate option against the video content, and then decide whether any option fully and correctly answers the question.

This is best read as an intervention test. It asks whether the failure is partly procedural: if the model is forced into per-option verification, can it recover some absent-answer detection capability?

The answer is yes, partially.

Model MCDR baseline MCDR with CoT OEDR baseline OEDR with CoT
InternVL3.5 13.6 25.7 6.5 18.6
Qwen3-VL 17.4 48.2 16.2 49.9

The improvement is large, especially for Qwen3-VL. But the endpoint is still not robust. Detection remains below or around half of cases. The paper also notes that CoT adds inference cost, which matters in video systems where latency and compute are not decorative accounting details.

The mechanism here is worth stating carefully. CoT does not simply make the model “smarter.” It changes the decision procedure from direct candidate selection to sequential candidate verification. That matters because absent-answer detection is not only a perception task. It is a verification task over a set of possible answers.

For businesses, the lesson is not “turn on chain-of-thought and go home.” The better lesson is that the architecture should contain an explicit verification stage. That stage may be prompted, trained, rule-assisted, calibrated, or delegated to a separate model. The key is that it should ask a different question from the classifier:

Does any available option meet the evidence threshold?

Without that question, the model is mostly just browsing the buffet.

The appendix tests why the headline numbers should be trusted carefully

The appendices matter because they separate three things that are easy to confuse: implementation details, explanatory evidence, and quality control.

Paper component Likely purpose What it supports What it does not prove
Prompt templates Implementation detail The settings differ only in candidate set and instruction That prompts are optimal for all models
Confidence distributions Mechanism diagnosis Models remain confident in distractors and underweight NOTA A complete causal account of training dynamics
Statistical association test Robustness / explanatory support Models are more likely to select NOTA when they answered correctly at baseline That knowing the answer is sufficient for abstention
Outlier analysis of Qwen2.5-Omni Quality-control check High OEDR can reflect generation degeneration rather than genuine detection That all high detection scores are invalid
Future-work discussion Research implication Benchmarks and training data should include absent-answer cases A tested training-level solution

The statistical analysis is especially useful. The authors test whether selecting NOTA is associated with answering the original baseline question correctly. The association is statistically significant across models, with odds ratios from 2.7 to 5.5. So models are more likely to detect absence when they had genuine baseline knowledge.

But the effect sizes are small. Cramér’s $\phi$ ranges from 0.13 to 0.26. In plain language: knowing the correct answer helps, but it does not solve the problem. The model may know what should have been there and still choose a distractor when it is removed.

This is a good example of a result that should not be flattened into a slogan. The paper does not say models have zero sensitivity to absence. It says the sensitivity is weak, insufficient, and easily dominated by forced-choice behavior.

The outlier analysis matters for a different reason. Qwen2.5-Omni reports a high OEDR of 61.7% on VideoMME, far above most models. The appendix says this should be interpreted cautiously because the model exhibits severe generation degeneration, including hallucinated multi-turn dialogue patterns and repetitive loops. That makes the high score less trustworthy as evidence of deliberate absent-answer detection.

This is the evaluation version of checking whether the student got the answer right because they understood calculus or because the grading script exploded. Tedious, yes. Necessary, also yes.

What the paper directly shows, and what businesses should infer

The paper directly shows a diagnostic failure in evaluated video MLLMs: when the correct answer is removed from a candidate set, models usually select plausible distractors rather than recognizing that no valid option exists. It shows this under multiple evaluation settings, across VideoMME and EgoSchema, with temporal tasks and frame-sampling density adding important stress tests. It also shows that CoT prompting improves detection but does not make it reliable.

Cognaptus infers a business design principle:

Video AI systems should be evaluated not only on whether they choose the right label when it exists, but on whether they refuse the label menu when it does not.

That principle affects evaluation, system design, and governance.

1. Evaluation should include no-valid-answer cases

A video-AI benchmark for enterprise use should include cases where all options are wrong. This is not a philosophical luxury. It is the only way to measure whether the model can abstain appropriately.

A useful evaluation set should contain at least three classes of cases:

Case type Example What it tests
Standard answerable case The correct defect category is present Ordinary classification ability
Missing-label case The observed defect is outside the taxonomy Abstention from incomplete menus
Insufficient-evidence case The video segment does not show enough information Evidence-based refusal

The third case is distinct from the paper’s main setting. In the paper, the question remains answerable from the video, but the correct option is removed. In insufficient-evidence cases, the video itself may not support an answer. Production systems need both tests.

2. Confidence calibration must include distractor confidence, not only answer confidence

The appendix confidence results show that models can maintain high confidence after the correct answer is removed. That means ordinary confidence thresholds may be misleading. A high-confidence answer can be a high-confidence distractor.

A better evaluation should ask:

  • How much probability is assigned to rejection?
  • How does confidence change when the correct option is removed?
  • Does the model distinguish exact evidence from partial temporal resemblance?
  • Does confidence drop when all options are semantically close but wrong?

The point is not to worship calibration plots. The point is to prevent a model from mistaking “best available option” for “valid answer.”

3. Per-option evidence checks should be explicit

The CoT result suggests that per-option verification helps. In production, this does not have to mean exposing long reasoning text to users. It means the system should internally require evidence for each candidate.

A practical pattern might look like this:

  1. Generate candidate labels.
  2. Retrieve or identify video evidence for each candidate.
  3. Score whether each candidate is fully supported, partially supported, contradicted, or unsupported.
  4. Accept a label only if support clears a threshold.
  5. Return “no valid label” or escalate to human review when all candidates fail.

This is slower than direct classification. It is also less likely to confidently misfile reality under the nearest dropdown option.

4. Abstention should be trained, not merely permitted

The paper’s future-work section points to training data that rewards appropriate abstention. This matters because “the model is allowed to say no” is weaker than “the model has been optimized to say no when evidence is insufficient.”

For enterprise procurement, this becomes a concrete question for vendors:

Was the model or application layer evaluated on no-valid-answer video cases, and what was the abstention performance?

A vendor that answers only with ordinary accuracy, F1, or benchmark rank has not answered the question. They have changed the subject, perhaps very smoothly.

Boundaries: this is a diagnostic study, not a deployment guarantee

The paper is valuable because it isolates a failure mechanism. It should not be overextended.

First, the evaluated tasks are benchmark video QA tasks built from VideoMME and EgoSchema. The results do not automatically quantify error rates in specific industries such as manufacturing, retail security, insurance, or clinical operations.

Second, the paper tests prompting as a mitigation but does not test training-level interventions. It argues that training objectives should include absent-answer-aware cases, but it does not show a trained solution.

Third, model development moves quickly. The evaluated models are representative, not eternal monuments. Future architectures may behave differently. That does not weaken the diagnostic logic. It means each new system should be tested under the same kind of intervention instead of being trusted because the model card sounds expensive.

Fourth, absent-answer detection is only one form of reliability. A production video system also needs temporal localization, data governance, privacy controls, human review design, domain-specific taxonomies, and monitoring after deployment. The paper does not cover all of that. It covers a narrower and very important question: what happens when the correct answer is not available?

That narrow question is enough to embarrass many evaluation pipelines.

The practical checklist: do not let the menu define reality

A useful enterprise video-AI evaluation should include the following checks before deployment:

Governance question Why it matters
Does the evaluation set include cases where no listed option is correct? Measures absent-answer detection directly
Are temporal near-miss distractors included? Tests the hardest plausible wrong answers
Is abstention measured separately from accuracy? Prevents ordinary accuracy from hiding forced-choice behavior
Does confidence fall when the correct answer is removed? Tests whether confidence reflects evidence, not menu pressure
Are per-option evidence checks required before final classification? Reduces best-distractor selection
Are high detection scores audited for generation degeneration or formatting artifacts? Prevents misleading metric gains
Is there an escalation path when no option is valid? Turns abstention into an operational workflow

The key design change is simple: the candidate menu should be treated as a hypothesis set, not a law of nature.

That sounds obvious. It is not how many systems behave.

Conclusion: the right answer is sometimes missing

This paper is useful because it makes a hidden assumption measurable. Standard video QA benchmarks ask whether a model can select the correct answer when the correct answer is present. The authors ask what happens when it is absent.

The answer is uncomfortable. Models often choose plausible distractors. They rarely abstain without explicit cues. Temporal tasks make the problem harder. More frames can improve ordinary accuracy while weakening absent-answer detection. Chain-of-thought prompting helps by forcing per-option verification, but it remains an incomplete mitigation.

For business readers, the takeaway is not that video AI should be avoided. It is that video AI should be tested for a behavior that ordinary accuracy does not measure: the ability to reject a bad option set.

A model that always chooses from the menu is not necessarily intelligent. Sometimes it is just very obedient to a menu written by someone who forgot reality has edge cases.

And reality, as usual, did not attend the vendor demo.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yiheng Wang, Yueqian Lin, Lichen Zhu, Yudong Liu, Hai “Helen” Li, and Yiran Chen, “When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding,” arXiv:2606.08239, 2026. https://arxiv.org/abs/2606.08239 ::: ↩︎