Opening — Why this matters now
Long-context multimodal models are starting to look fluent enough to pass surface-level exams on scientific papers. They answer questions correctly. They summarize convincingly. And yet, something feels off. The answers often arrive without a visible path—no trail of figures, no textual anchors, no defensible reasoning chain. In other words, the model knows what to say, but not necessarily why it is true.
This gap is no longer academic. As LLMs move into research assistance, peer review, and automated analysis, correctness without traceability becomes a liability. The paper behind SIN-Bench makes a blunt claim: if a model cannot show its evidence, it should not get the score.
Background — From needles to oceans
Most long-context benchmarks still rely on a familiar trick: hide a few synthetic facts inside a large body of irrelevant text and test whether the model can retrieve them. This “Needle-in-a-Haystack” paradigm is useful for probing memory limits, but it quietly sidesteps the real difficulty of scientific reading.
Real papers are not haystacks. They are oceans.
Information is native, dense, and interconnected. Figures depend on methods sections. Conclusions depend on ablations buried pages earlier. Evidence is not isolated; it is entangled. The authors reframe the task accordingly with the Fish-in-the-Ocean (FITO) paradigm: models must locate, connect, and order native evidence across text and figures to justify an answer.
Formally, the shift is simple but consequential. Instead of optimizing only for the answer $A$ given a document $D$ and question $Q$, evaluation now centers on the joint probability:
$$ P(A, E | D, Q) = P(E | D, Q) \cdot P(A | E, D, Q) $$
The latent variable $E$—the evidence chain—is no longer optional. It is the point.
Analysis — What SIN-Bench actually builds
To operationalize FITO, the authors introduce two components: SIN-Data and SIN-Bench.
SIN-Data: restoring native interleaving
Scientific papers are linearized into a semantic-first interleaved format where figures are injected at their first citation point, not dumped at the end or treated as captions-only artifacts. This preserves how humans actually read papers: text, then figure, then back to text.
From an initial pool of ~50,000 arXiv and PMC papers, aggressive filtering yields 4,000 high-density documents spanning 12 top-level disciplines and more than 80 subfields. The goal is not scale for its own sake, but controllable, evidence-rich contexts.
SIN-Bench: four tasks, one workflow
SIN-Bench mirrors a realistic research loop through four progressive tasks:
| Task | What it tests | Core failure mode exposed |
|---|---|---|
| SIN-Find | Evidence discovery | Missing prerequisites |
| SIN-Verify | Evidence sufficiency | Answer-driven rationalization |
| SIN-QA | Grounded reasoning | Hallucinated support |
| SIN-Summary | Evidence-anchored synthesis | Shallow narrative coherence |
Crucially, all tasks operate over the same interface: a document $D$, a query $Q$, an answer $A$, and an ordered interleaved evidence chain $E$ spanning text and figures.
Findings — Accuracy is cheap, grounding is not
The headline result is uncomfortable for anyone who equates benchmark leadership with understanding.
Across eight leading multimodal models, evidence grounding—not answer accuracy—is the primary bottleneck.
A simplified view of the results:
| Model | SIN-QA Answer Accuracy | Overall Evidence-Aligned Score |
|---|---|---|
| GPT-5 | Highest | Middling |
| Gemini-3-Pro | Slightly lower | Best overall |
| Open-weight VL models | Lower | Often invalid |
GPT-5 frequently produces the correct answer in SIN-QA, yet underperforms once evidence quality is scored. Gemini-3-Pro, by contrast, is more disciplined in anchoring answers to verifiable text–figure chains.
The implication is subtle but important: parametric knowledge can still win answer-only games. It struggles when asked to show its work.
Two additional findings stand out:
- Interleaving matters. Preserving native text–figure order improves SIN-QA and SIN-Summary scores by over 0.10 in absolute terms.
- Evidence requirements help models think. Forcing explicit evidence chains improves answer accuracy, acting as a lightweight, multimodal chain-of-thought constraint.
Implications — What this changes for practitioners
For practitioners deploying LLMs in research-heavy workflows, SIN-Bench offers three sobering lessons:
- Correct answers are not enough. If your system cannot surface where claims come from, it will fail under scrutiny.
- Structured outputs are a real bottleneck. Many open-weight models collapse not because they are “dumb,” but because they cannot reliably emit ordered, anchored evidence.
- Evaluation shapes behavior. When evidence is scored explicitly, models become more conservative—and more useful.
More broadly, the benchmark challenges the industry’s quiet assumption that longer context windows automatically imply deeper understanding. Length amplifies variance. Without grounding, it also amplifies hallucination.
Conclusion — Raising the bar, uncomfortably
SIN-Bench does not claim that current models are incapable of scientific reasoning. It claims something more precise: we have been grading the wrong thing.
By enforcing a strict “No Evidence, No Score” principle, the benchmark exposes a gap between fluency and understanding that answer-only metrics politely ignore. In doing so, it nudges the field toward a more uncomfortable, but more honest, standard—one where models are judged not by how confidently they speak, but by whether their claims can be traced back to the document itself.
That may slow down leaderboard progress. It will almost certainly improve trust.
Cognaptus: Automate the Present, Incubate the Future.