Multimodal

Reasoning with Both Eyes Open: Why Multimodal Chain-of-Thought Still Trips Up LLMs

If today’s AI models can ace bar exams, explain astrophysics, and generate functional code from a napkin sketch, why do they still fail at seemingly simple questions that require looking and thinking? A new benchmark called MCORE (Multimodal Chain-of-Reasoning Evaluation) answers that question with a resounding: because reasoning across modalities is hard—and we’re not as far along as we thought. Beyond Pattern Matching: What MCORE Tests The majority of multimodal evaluations today rely on either: ...

Aya Vision

An open-weight vision encoder developed by Cohere For AI, part of Project Aya’s global multilingual and multimodal research initiative.

ChatGPT-4o (Omni)

OpenAI’s flagship GPT-4 variant that natively supports text, vision, and audio input/output with faster performance and improved reasoning.

Claude 3 Sonnet

A mid-sized member of Anthropic’s Claude 3 model family, optimized for balanced performance across reasoning, speed, and multimodal understanding.

DeepSeek-V3

A multi-modal foundation model by DeepSeek AI, integrating vision and language for high-performance tasks including OCR, captioning, and visual reasoning.

Gemini 2 Flash

A fast and lightweight variant of Google’s Gemini 2.0 multimodal model, optimized for latency-sensitive tasks via the Gemini API.

Grok-2

The next-generation model from xAI, built on a new architecture and fully integrated into X (formerly Twitter) as part of Elon Musk’s AI assistant efforts.

PaliGemma 2

A next-generation vision-language model by Google, combining Gemma LLM and SigLIP vision encoder for image captioning, VQA, and image-text reasoning tasks.