Seeing Is Thinking: When Multimodal Reasoning Stops Talking and Starts Drawing
Image work has always had a small credibility problem: people can say where they looked, but we do not always know whether they actually looked there. The same problem shows up in multimodal AI. A model can answer a question about a chart, a photograph, a geometry diagram, or a robotic scene, then produce a neat textual chain of thought afterwards. It may sound procedural. It may mention “examining the relevant region.” It may even say “the graph shows…” with the confidence of a consultant holding a laser pointer. ...