Multimodal Models

When Papers Learn to Draw: AutoFigure and the End of Ugly Science Diagrams

Opening — Why this matters now AI can already write papers, review papers, and in some cases get papers accepted. Yet one stubborn artifact has remained conspicuously human: the scientific figure. Diagrams, pipelines, conceptual schematics—these are still hand-crafted, visually inconsistent, and painfully slow to produce. For AI-driven research agents, this isn’t cosmetic. It’s a structural failure. ...

When One Patch Rules Them All: Teaching MLLMs to See What Isn’t There

Opening — Why this matters now Multimodal large language models (MLLMs) are no longer research curiosities. They caption images, reason over diagrams, guide robots, and increasingly sit inside commercial products that users implicitly trust. That trust rests on a fragile assumption: that these models see the world in a reasonably stable way. The paper behind this article quietly dismantles that assumption. It shows that a single, reusable visual perturbation—not tailored to any specific image—can reliably coerce closed-source systems like GPT‑4o or Gemini‑2.0 into producing attacker‑chosen outputs. Not once. Not occasionally. But consistently, across arbitrary, previously unseen images. ...

MemCtrl: Teaching Small Models What Not to Remember

Opening — Why this matters now Embodied AI is hitting a very human bottleneck: memory. Not storage capacity, not retrieval speed—but judgment. Modern multimodal large language models (MLLMs) can see, reason, and act, yet when deployed as embodied agents they tend to remember too much, too indiscriminately. Every frame, every reflection, every redundant angle piles into context until the agent drowns in its own experience. ...

Thinking Twice: Why Making AI Argue With Itself Actually Works

Opening — Why this matters now Multimodal large language models (MLLMs) are everywhere: vision-language assistants, document analyzers, agents that claim to see, read, and reason simultaneously. Yet anyone who has deployed them seriously knows an awkward truth: they often say confident nonsense, especially when images are involved. The paper behind this article tackles an uncomfortable but fundamental question: what if the problem isn’t lack of data or scale—but a mismatch between how models generate answers and how they understand them? The proposed fix is surprisingly philosophical: let the model contradict itself, on purpose. ...

Fish in the Ocean, Not Needles in the Haystack

Opening — Why this matters now Long-context multimodal models are starting to look fluent enough to pass surface-level exams on scientific papers. They answer questions correctly. They summarize convincingly. And yet, something feels off. The answers often arrive without a visible path—no trail of figures, no textual anchors, no defensible reasoning chain. In other words, the model knows what to say, but not necessarily why it is true. ...

$Cover image$

When the Right Answer Is No Answer: Teaching AI to Refuse Messy Math

Opening — Why this matters now Multimodal models have become unnervingly confident readers of documents. Hand them a PDF, a scanned exam paper, or a photographed worksheet, and they will happily extract text, diagrams, and even implied structure. The problem is not what they can read. It is what they refuse to unread. In real classrooms, mathematics exam papers are not pristine artifacts. They are scribbled on, folded, stained, partially photographed, and occasionally vandalized by enthusiastic graders. Yet most document benchmarks still assume a polite world where inputs are complete and legible. This gap matters. An AI system that confidently invents missing math questions is not merely wrong—it is operationally dangerous. ...

Seeing Is Thinking: When Multimodal Reasoning Stops Talking and Starts Drawing

Opening — Why this matters now Multimodal AI has spent the last two years narrating its thoughts like a philosophy student with a whiteboard it refuses to use. Images go in, text comes out, and the actual visual reasoning—zooming, marking, tracing, predicting—happens offstage, if at all. Omni-R1 arrives with a blunt correction: reasoning that depends on vision should generate vision. ...

Seeing Too Much: When Multimodal Models Forget Privacy

Opening — Why this matters now Multimodal models have learned to see. Unfortunately, they have also learned to remember—and sometimes to reveal far more than they should. As vision-language models (VLMs) are deployed into search, assistants, surveillance-adjacent tools, and enterprise workflows, the question is no longer whether they can infer personal information from images, but how often they do so—and under what conditions they fail to hold back. ...

Hard Problems Pay Better: Why Difficulty-Aware DPO Fixes Multimodal Hallucinations

Opening — Why this matters now Multimodal large language models (MLLMs) are getting better at seeing—but not necessarily at knowing. Despite steady architectural progress, hallucinations remain stubbornly common: models confidently describe objects that do not exist, infer relationships never shown, and fabricate visual details with unsettling fluency. The industry response has been predictable: more preference data, more alignment, more optimization. ...

RxnBench: Reading Chemistry Like a Human (Turns Out That’s Hard)

Opening — Why this matters now Multimodal Large Language Models (MLLMs) have become impressively fluent readers of the world. They can caption images, parse charts, and answer questions about documents that would once have required a human analyst and a strong coffee. Naturally, chemistry was next. But chemistry does not speak in sentences. It speaks in arrows, wedges, dashed bonds, cryptic tables, and reaction schemes buried three pages away from their explanations. If we want autonomous “AI chemists,” the real test is not trivia or SMILES strings — it is whether models can read actual chemical papers. ...