None Taken: Why Video AI Must Learn When No Answer Is Correct

A camera sees the scene. The model reads the question. The options look reasonable. One of them must be right.

That last sentence is the problem.

Many enterprise video-AI workflows are built around this quiet assumption. A model reviews a warehouse clip and chooses the most likely safety violation. It watches a customer interaction and classifies the complaint. It checks a manufacturing video and identifies the defect category. The system may be wrong, of course, but the menu is treated as complete. The correct answer is assumed to be hiding somewhere among the choices, waiting for the model to point at it with sufficient confidence.

The paper When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding attacks exactly that assumption.¹ Its core move is simple and rather rude, which is often how good diagnostics begin: take a standard video multiple-choice question, remove the correct answer, and see whether the model notices.

Most models do not.

They choose plausible distractors. They often do so with high confidence. Temporal questions make the problem worse. Adding more video frames can improve ordinary accuracy while making absent-answer detection worse. Chain-of-thought prompting helps, but not enough to make the failure disappear. So the business lesson is not “video models are bad.” That would be too easy, and also false. The sharper lesson is this: a model can become better at selecting from a candidate set while remaining poor at judging whether the candidate set deserves trust.

That distinction matters anywhere video AI is expected to support operational decisions rather than win a benchmark beauty contest.

The failure mechanism is not wrong vision; it is forced-choice behavior

The paper defines a diagnostic task called absent answer detection. Start with a video $V$, a question $q$, and a candidate set $C$ that originally contains the ground-truth answer $c^\ast$.

Then intervene:

$$ \tilde{C} = C \setminus {c^\ast} $$

The video and question remain unchanged. Only the candidate set is changed. A reliable model should now reject the remaining options because none of them is correct.

The authors test this under three conditions:

Setting	What changes	Metric	Likely purpose
Baseline	Original answer set includes the correct answer	ACC	Main reference point for ordinary video QA ability
Multi-choice detection	Correct answer removed; “None of the above” added	MCDR	Main evidence for whether an explicit rejection option is used
Open-ended detection	Correct answer removed; model is told it may answer none	OEDR	Test of active rejection without merely choosing a listed NOTA option
Unprompted detection	Correct answer removed; no cue is given	UDR	Stress test for spontaneous doubt under the standard protocol

This design is useful because it separates two abilities that ordinary benchmarks tend to merge:

Can the model pick the right answer when the right answer is present?
Can the model recognize that the right answer is absent?

The paper shows that these are not the same skill. This is the first mechanism to keep in mind. Standard multiple-choice evaluation rewards selection. Absent-answer detection tests whether selection is conditional on evidence. Those are different behavioral regimes.

In business language: there is a difference between an AI system that can classify an incident and an AI system that knows when the incident does not fit the classification menu. The second capability is less glamorous. Naturally, it is also where production systems quietly break.

The main tables show a gap between accuracy and abstention

On VideoMME, models can show respectable baseline accuracy while detecting absent answers poorly. Gemini-2.5-Flash reaches 68.9% baseline accuracy, but only 33.9% MCDR, 43.6% OEDR, and 2.4% UDR. Qwen3-VL reaches 67.0% baseline accuracy, but only 17.4% MCDR, 16.2% OEDR, and 0.7% UDR. InternVL3.5 reaches 65.3% baseline accuracy, but 13.6% MCDR, 6.5% OEDR, and 0.0% UDR.

The EgoSchema results tell the same unpleasant story. InternVL3 gets 79.2% baseline accuracy but only 9.6% MCDR, 2.2% OEDR, and 0.0% UDR. Mimo-VL is a partial exception on multi-choice detection, with 40.6% MCDR on EgoSchema, but its open-ended detection is still only 17.4%, and unprompted detection is 0.0%.

The exact model ranking is not the main point. Model rankings are entertaining, but so are horoscopes if one is tired enough. The more useful pattern is structural:

Observation	Interpretation	Business meaning
Baseline ACC can be much higher than MCDR or OEDR	Knowing how to answer does not imply knowing when no answer is valid	Accuracy dashboards can overstate decision reliability
UDR is near zero across almost all models	Models rarely question the completeness of the option set without an explicit cue	Abstention will not appear magically in production unless designed
OEDR is often lower than MCDR	Generating a rejection is harder than selecting a visible rejection option	“The model may refuse” is weaker than an engineered refusal path
Some high OEDR results require inspection	A number can be inflated by unstable generation behavior	Evaluation must include qualitative failure audits, not only aggregate metrics

The most severe result is the unprompted setting. When the correct answer is removed and the model is simply asked to choose from the remaining options, detection is almost nonexistent. That means the default behavior is not “check whether the premise is valid.” The default behavior is “choose the least wrong option.”

For enterprise use, this is the difference between a system that says “this looks like Class B” and one that says “none of our known classes explain this clip.” The first response is convenient. The second response is often what the organization actually needed.

“None of the above” is treated like a decorative option

A tempting fix is to add “None of the above.” The paper shows that this helps, but not enough.

The multi-choice detection setting gives the model a visible NOTA option. Detection improves compared with the unprompted setting, but remains weak. Many models still choose a distractor. This suggests that NOTA is not treated as a serious semantic candidate. It is present, but not operationally alive.

The appendix confidence analysis explains why. The authors compare confidence distributions under the normal baseline and the multi-choice detection setting. The distributions substantially overlap. Removing the correct answer does not meaningfully reduce the model’s confidence in its selected option. In a second confidence analysis, they compare the probability assigned to NOTA with the probability assigned to the chosen distractor. NOTA receives very low probability mass, while the chosen distractor receives high confidence.

This appendix evidence is not a second thesis. Its purpose is diagnostic: it explains the behavioral mechanism behind the main tables. The failure is not merely that models sometimes miss NOTA. The failure is that models often continue to act as though the best distractor is a legitimate answer.

That is why a UI-level “allow refusal” switch is not enough. If the model’s internal preference is still to satisfy the answer-selection pattern, NOTA becomes a polite suggestion taped to the emergency exit. Nice font. Poor evacuation performance.

Temporal questions make distractors more dangerous

The paper then moves from general video QA to temporal subsets of VideoMME. This is not just a category breakdown. It tests a plausible mechanism: in video, wrong answers can be close to the truth in time.

A temporal distractor may describe something that happened just before the relevant moment, just after it, or in a causally related part of the scene. These options are not random nonsense. They are attractive wrong answers.

The results support the concern. On VideoMME temporal perception and temporal reasoning subsets, many open-source models show lower absent-answer detection than their overall results. For example, Qwen3-VL has 80.0% / 55.4% ACC on temporal perception / temporal reasoning, but only 9.1% / 2.3% MCDR and 7.3% / 4.0% OEDR. InternVL3.5 has 76.4% / 48.0% ACC, but only 3.6% / 3.4% MCDR and 1.8% / 0.6% OEDR. Even where accuracy remains decent, rejection remains fragile.

The mechanism is straightforward. Video adds temporal structure. Temporal structure creates near-miss answers. Near-miss answers give a forced-choice model something plausible to grab.

For business workflows, that matters more than it first appears. Many high-value video tasks are temporal by nature:

Workflow	Temporal ambiguity	Absent-answer risk
Workplace safety review	A near-accident may resemble a violation but not meet the definition	Model assigns the wrong violation category
Manufacturing inspection	A transient artifact may look like a defect event	Model forces a defect label when the issue is outside the taxonomy
Training and compliance review	A later corrective action may be confused with the original behavior	Model summarizes the wrong procedural failure
Retail or security monitoring	Similar events occur in sequence	Model selects the closest scripted scenario instead of flagging uncertainty
Insurance or claims review	Relevant evidence may occur outside the sampled segment	Model chooses the most plausible claim type anyway

The paper directly tests benchmark video QA, not these enterprise workflows. The Cognaptus inference is that any production video system with a closed label menu inherits a similar risk pattern when temporal near-misses are common. The boundary matters: this paper does not prove failure rates in factories, hospitals, warehouses, or insurers. It gives a diagnostic mechanism those deployments should test before pretending their label menu is complete.

Chain-of-thought helps because it changes the decision procedure

The authors test chain-of-thought prompting as a mitigation strategy. The prompt asks the model to analyze the video, evaluate each candidate option against the video content, and then decide whether any option fully and correctly answers the question.

This is best read as an intervention test. It asks whether the failure is partly procedural: if the model is forced into per-option verification, can it recover some absent-answer detection capability?

The answer is yes, partially.

Model	MCDR baseline	MCDR with CoT	OEDR baseline	OEDR with CoT
InternVL3.5	13.6	25.7	6.5	18.6
Qwen3-VL	17.4	48.2	16.2	49.9

The improvement is large, especially for Qwen3-VL. But the endpoint is still not robust. Detection remains below or around half of cases. The paper also notes that CoT adds inference cost, which matters in video systems where latency and compute are not decorative accounting details.

The mechanism here is worth stating carefully. CoT does not simply make the model “smarter.” It changes the decision procedure from direct candidate selection to sequential candidate verification. That matters because absent-answer detection is not only a perception task. It is a verification task over a set of possible answers.

For businesses, the lesson is not “turn on chain-of-thought and go home.” The better lesson is that the architecture should contain an explicit verification stage. That stage may be prompted, trained, rule-assisted, calibrated, or delegated to a separate model. The key is that it should ask a different question from the classifier:

Does any available option meet the evidence threshold?

Without that question, the model is mostly just browsing the buffet.

The appendix tests why the headline numbers should be trusted carefully

The appendices matter because they separate three things that are easy to confuse: implementation details, explanatory evidence, and quality control.

Paper component	Likely purpose	What it supports	What it does not prove
Prompt templates	Implementation detail	The settings differ only in candidate set and instruction	That prompts are optimal for all models
Confidence distributions	Mechanism diagnosis	Models remain confident in distractors and underweight NOTA	A complete causal account of training dynamics
Statistical association test	Robustness / explanatory support	Models are more likely to select NOTA when they answered correctly at baseline	That knowing the answer is sufficient for abstention
Outlier analysis of Qwen2.5-Omni	Quality-control check	High OEDR can reflect generation degeneration rather than genuine detection	That all high detection scores are invalid
Future-work discussion	Research implication	Benchmarks and training data should include absent-answer cases	A tested training-level solution

The statistical analysis is especially useful. The authors test whether selecting NOTA is associated with answering the original baseline question correctly. The association is statistically significant across models, with odds ratios from 2.7 to 5.5. So models are more likely to detect absence when they had genuine baseline knowledge.

But the effect sizes are small. Cramér’s $\phi$ ranges from 0.13 to 0.26. In plain language: knowing the correct answer helps, but it does not solve the problem. The model may know what should have been there and still choose a distractor when it is removed.

This is a good example of a result that should not be flattened into a slogan. The paper does not say models have zero sensitivity to absence. It says the sensitivity is weak, insufficient, and easily dominated by forced-choice behavior.

The outlier analysis matters for a different reason. Qwen2.5-Omni reports a high OEDR of 61.7% on VideoMME, far above most models. The appendix says this should be interpreted cautiously because the model exhibits severe generation degeneration, including hallucinated multi-turn dialogue patterns and repetitive loops. That makes the high score less trustworthy as evidence of deliberate absent-answer detection.

This is the evaluation version of checking whether the student got the answer right because they understood calculus or because the grading script exploded. Tedious, yes. Necessary, also yes.

What the paper directly shows, and what businesses should infer

The paper directly shows a diagnostic failure in evaluated video MLLMs: when the correct answer is removed from a candidate set, models usually select plausible distractors rather than recognizing that no valid option exists. It shows this under multiple evaluation settings, across VideoMME and EgoSchema, with temporal tasks and frame-sampling density adding important stress tests. It also shows that CoT prompting improves detection but does not make it reliable.

Cognaptus infers a business design principle:

Video AI systems should be evaluated not only on whether they choose the right label when it exists, but on whether they refuse the label menu when it does not.

That principle affects evaluation, system design, and governance.

1. Evaluation should include no-valid-answer cases

A video-AI benchmark for enterprise use should include cases where all options are wrong. This is not a philosophical luxury. It is the only way to measure whether the model can abstain appropriately.

A useful evaluation set should contain at least three classes of cases:

Case type	Example	What it tests
Standard answerable case	The correct defect category is present	Ordinary classification ability
Missing-label case	The observed defect is outside the taxonomy	Abstention from incomplete menus
Insufficient-evidence case	The video segment does not show enough information	Evidence-based refusal

The third case is distinct from the paper’s main setting. In the paper, the question remains answerable from the video, but the correct option is removed. In insufficient-evidence cases, the video itself may not support an answer. Production systems need both tests.

2. Confidence calibration must include distractor confidence, not only answer confidence

The appendix confidence results show that models can maintain high confidence after the correct answer is removed. That means ordinary confidence thresholds may be misleading. A high-confidence answer can be a high-confidence distractor.

A better evaluation should ask:

How much probability is assigned to rejection?
How does confidence change when the correct option is removed?
Does the model distinguish exact evidence from partial temporal resemblance?
Does confidence drop when all options are semantically close but wrong?

The point is not to worship calibration plots. The point is to prevent a model from mistaking “best available option” for “valid answer.”

3. Per-option evidence checks should be explicit

The CoT result suggests that per-option verification helps. In production, this does not have to mean exposing long reasoning text to users. It means the system should internally require evidence for each candidate.

A practical pattern might look like this:

Generate candidate labels.
Retrieve or identify video evidence for each candidate.
Score whether each candidate is fully supported, partially supported, contradicted, or unsupported.
Accept a label only if support clears a threshold.
Return “no valid label” or escalate to human review when all candidates fail.

This is slower than direct classification. It is also less likely to confidently misfile reality under the nearest dropdown option.

4. Abstention should be trained, not merely permitted

The paper’s future-work section points to training data that rewards appropriate abstention. This matters because “the model is allowed to say no” is weaker than “the model has been optimized to say no when evidence is insufficient.”

For enterprise procurement, this becomes a concrete question for vendors:

Was the model or application layer evaluated on no-valid-answer video cases, and what was the abstention performance?

A vendor that answers only with ordinary accuracy, F1, or benchmark rank has not answered the question. They have changed the subject, perhaps very smoothly.

Boundaries: this is a diagnostic study, not a deployment guarantee

The paper is valuable because it isolates a failure mechanism. It should not be overextended.

First, the evaluated tasks are benchmark video QA tasks built from VideoMME and EgoSchema. The results do not automatically quantify error rates in specific industries such as manufacturing, retail security, insurance, or clinical operations.

Second, the paper tests prompting as a mitigation but does not test training-level interventions. It argues that training objectives should include absent-answer-aware cases, but it does not show a trained solution.

Third, model development moves quickly. The evaluated models are representative, not eternal monuments. Future architectures may behave differently. That does not weaken the diagnostic logic. It means each new system should be tested under the same kind of intervention instead of being trusted because the model card sounds expensive.

Fourth, absent-answer detection is only one form of reliability. A production video system also needs temporal localization, data governance, privacy controls, human review design, domain-specific taxonomies, and monitoring after deployment. The paper does not cover all of that. It covers a narrower and very important question: what happens when the correct answer is not available?

That narrow question is enough to embarrass many evaluation pipelines.

A useful enterprise video-AI evaluation should include the following checks before deployment:

Governance question	Why it matters
Does the evaluation set include cases where no listed option is correct?	Measures absent-answer detection directly
Are temporal near-miss distractors included?	Tests the hardest plausible wrong answers
Is abstention measured separately from accuracy?	Prevents ordinary accuracy from hiding forced-choice behavior
Does confidence fall when the correct answer is removed?	Tests whether confidence reflects evidence, not menu pressure
Are per-option evidence checks required before final classification?	Reduces best-distractor selection
Are high detection scores audited for generation degeneration or formatting artifacts?	Prevents misleading metric gains
Is there an escalation path when no option is valid?	Turns abstention into an operational workflow

The key design change is simple: the candidate menu should be treated as a hypothesis set, not a law of nature.

That sounds obvious. It is not how many systems behave.

Conclusion: the right answer is sometimes missing

This paper is useful because it makes a hidden assumption measurable. Standard video QA benchmarks ask whether a model can select the correct answer when the correct answer is present. The authors ask what happens when it is absent.

The answer is uncomfortable. Models often choose plausible distractors. They rarely abstain without explicit cues. Temporal tasks make the problem harder. More frames can improve ordinary accuracy while weakening absent-answer detection. Chain-of-thought prompting helps by forcing per-option verification, but it remains an incomplete mitigation.

For business readers, the takeaway is not that video AI should be avoided. It is that video AI should be tested for a behavior that ordinary accuracy does not measure: the ability to reject a bad option set.

A model that always chooses from the menu is not necessarily intelligent. Sometimes it is just very obedient to a menu written by someone who forgot reality has edge cases.

And reality, as usual, did not attend the vendor demo.

Cognaptus: Automate the Present, Incubate the Future.

Yiheng Wang, Yueqian Lin, Lichen Zhu, Yudong Liu, Hai “Helen” Li, and Yiran Chen, “When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding,” arXiv:2606.08239, 2026. https://arxiv.org/abs/2606.08239 ::: ↩︎

None Taken: Why Video AI Must Learn When No Answer Is Correct

The failure mechanism is not wrong vision; it is forced-choice behavior

The main tables show a gap between accuracy and abstention

“None of the above” is treated like a decorative option

Temporal questions make distractors more dangerous

More frames can improve answer selection while worsening abstention

Chain-of-thought helps because it changes the decision procedure

The appendix tests why the headline numbers should be trusted carefully

What the paper directly shows, and what businesses should infer

1. Evaluation should include no-valid-answer cases

2. Confidence calibration must include distractor confidence, not only answer confidence

3. Per-option evidence checks should be explicit

4. Abstention should be trained, not merely permitted

Boundaries: this is a diagnostic study, not a deployment guarantee

The practical checklist: do not let the menu define reality

Conclusion: the right answer is sometimes missing

The failure mechanism is not wrong vision; it is forced-choice behavior#

The main tables show a gap between accuracy and abstention#

“None of the above” is treated like a decorative option#

Temporal questions make distractors more dangerous#

More frames can improve answer selection while worsening abstention#

Chain-of-thought helps because it changes the decision procedure#

The appendix tests why the headline numbers should be trusted carefully#

What the paper directly shows, and what businesses should infer#

1. Evaluation should include no-valid-answer cases#

2. Confidence calibration must include distractor confidence, not only answer confidence#

3. Per-option evidence checks should be explicit#

4. Abstention should be trained, not merely permitted#

Boundaries: this is a diagnostic study, not a deployment guarantee#

The practical checklist: do not let the menu define reality#

Conclusion: the right answer is sometimes missing#

The failure mechanism is not wrong vision; it is forced-choice behavior

The main tables show a gap between accuracy and abstention

“None of the above” is treated like a decorative option

Temporal questions make distractors more dangerous

More frames can improve answer selection while worsening abstention

Chain-of-thought helps because it changes the decision procedure

The appendix tests why the headline numbers should be trusted carefully

What the paper directly shows, and what businesses should infer

1. Evaluation should include no-valid-answer cases

2. Confidence calibration must include distractor confidence, not only answer confidence

3. Per-option evidence checks should be explicit

4. Abstention should be trained, not merely permitted

Boundaries: this is a diagnostic study, not a deployment guarantee

The practical checklist: do not let the menu define reality

Conclusion: the right answer is sometimes missing