Opening — Why this matters now

Multimodal models have become unnervingly confident readers of documents. Hand them a PDF, a scanned exam paper, or a photographed worksheet, and they will happily extract text, diagrams, and even implied structure. The problem is not what they can read. It is what they refuse to unread.

In real classrooms, mathematics exam papers are not pristine artifacts. They are scribbled on, folded, stained, partially photographed, and occasionally vandalized by enthusiastic graders. Yet most document benchmarks still assume a polite world where inputs are complete and legible. This gap matters. An AI system that confidently invents missing math questions is not merely wrong—it is operationally dangerous.

Background — Clean documents, dirty reality

Most document understanding benchmarks optimize for layout parsing, OCR accuracy, or downstream question answering. They treat documents as collections of blocks, tokens, or regions. This abstraction works for invoices and forms. It breaks down for mathematics.

Math questions are units. A stem, options, formulas, and diagrams form a single semantic object. Remove one piece and the rest may become meaningless. Prior benchmarks largely ignore this unit-level integrity and almost never ask the model a harder question: should you answer at all?

What the paper does — MathDoc in plain terms

The paper introduces MathDoc, a benchmark built from real high-school mathematics exam papers photographed in uncontrolled conditions. Not synthetic noise. Not cleaned scans. Real artifacts: handwriting covering text, page folds cutting equations in half, blurred figures, and truncated questions.

The dataset contains 3,609 questions, explicitly including a large subset labeled unrecognizable. These are not trick cases. They are questions that a human would reasonably refuse to transcribe.

The task definition is deliberately strict:

  • Extract structured questions only if all critical information is present.
  • Extract diagrams with semantic fidelity, not just bounding boxes.
  • Explicitly reject questions that are incomplete or illegible.

In short: extraction, perception, and refusal are evaluated together, not in isolation.

How it is evaluated — Beyond accuracy

MathDoc introduces a multi-dimensional evaluation pipeline:

Dimension What is measured Why it matters
Stem Accuracy Normalized Levenshtein similarity Penalizes speculative completion
Visual Similarity LLM-based semantic judging of cropped figures Captures missing labels, edges, symbols
Refusal Ability Precision / Recall / F1 on unrecognizable cases Measures epistemic humility

Notably, refusal is treated as a positive behavior. A model that answers everything scores poorly.

The final score explicitly allocates weight to refusal performance, making it impossible to hide behind strong OCR.

Findings — Strong readers, weak skeptics

The results are uncomfortable.

State-of-the-art multimodal models—both open and closed—perform well on recognizable questions. End-to-end models outperform traditional OCR pipelines, especially on complex layouts.

But when inputs degrade, a pattern emerges:

  • High precision, low recall refusal: models can refuse correctly, but rarely choose to.
  • Forced transcription dominates behavior: incomplete text is completed, not rejected.
  • Larger models are more sensitive to missing information, but still far from reliable.

In practical terms, the models behave like overconfident interns: impressive when conditions are ideal, dangerously assertive when they are not.

Probing refusal — When do models say no?

The paper goes further and tests refusal boundaries by progressively erasing parts of a question.

A key finding: model scale correlates with sensitivity to information loss. Large models begin refusing when roughly 20–30% of critical content disappears. Smaller models continue answering even when nearly half the question is gone.

Another insight is more actionable: cropping helps.

When a single question is isolated from the page, models are significantly more likely to refuse incomplete inputs. Full-page context appears to encourage hallucination; focused context exposes gaps.

This suggests refusal is not purely a training issue. It is also a perception framing problem.

Implications — Reliability is not accuracy

MathDoc quietly shifts the evaluation conversation.

If AI systems are to be deployed in education, compliance, finance, or law, refusal is not an edge case. It is a core capability. A system that cannot say “I don’t know” is not intelligent—it is brittle.

For builders, the message is clear:

  • Train on incomplete data explicitly.
  • Reward refusal as a first-class outcome.
  • Design pipelines that surface uncertainty instead of smoothing it away.

For buyers of AI systems, the warning is sharper: benchmarks that report only accuracy are hiding risk.

Conclusion

MathDoc is not just a dataset. It is a critique.

It exposes a fundamental mismatch between how multimodal models are evaluated and how documents exist in the real world. By treating refusal as a measurable, desirable behavior, it forces the field to confront a simple truth: reliability begins where generation stops.

Cognaptus: Automate the Present, Incubate the Future.