Fish in the Ocean, Not Needles in the Haystack
Documents are where confident AI demos go to become slightly embarrassing. A model reads a long report. It gives the right answer. The room relaxes. Someone says “great, it understood the document,” and everyone pretends the word understood has not just been smuggled into the meeting without a passport. That is the exact mistake SIN-Bench is designed to catch.1 The paper is not merely another benchmark asking whether multimodal large language models can answer questions about scientific literature. It asks a more operationally painful question: can the model show the evidence path that makes the answer legitimate? ...