
Reasoning with Both Eyes Open: Why Multimodal Chain-of-Thought Still Trips Up LLMs
If today’s AI models can ace bar exams, explain astrophysics, and generate functional code from a napkin sketch, why do they still fail at seemingly simple questions that require looking and thinking? A new benchmark called MCORE (Multimodal Chain-of-Reasoning Evaluation) answers that question with a resounding: because reasoning across modalities is hard—and we’re not as far along as we thought. Beyond Pattern Matching: What MCORE Tests The majority of multimodal evaluations today rely on either: ...