Cover image

When Agents Start Thinking Twice: Teaching Multimodal AI to Doubt Itself

Opening — Why this matters now Multimodal models are getting better at seeing, but not necessarily at understanding. They describe images fluently, answer visual questions confidently—and yet still contradict themselves when asked to reason across perception and language. The gap isn’t capability. It’s coherence. The paper behind this article targets a subtle but costly problem in modern AI systems: models that generate answers they cannot later justify—or even agree with. In real-world deployments, that gap shows up as unreliable assistants, brittle agents, and automation that looks smart until it’s asked why. ...

February 9, 2026 · 3 min · Zelina
Cover image

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

Opening — Why this matters now Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them. This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way. ...

February 3, 2026 · 4 min · Zelina
Cover image

Thinking in Panels: Why Comics Might Beat Video for Multimodal Reasoning

Opening — Why this matters now Multimodal reasoning has quietly hit an efficiency wall. We taught models to think step by step with text, then asked them to imagine with images, and finally to reason with videos. Each step added expressive power—and cost. Images freeze time. Videos drown signal in redundancy. Somewhere between the two, reasoning gets expensive fast. ...

February 3, 2026 · 3 min · Zelina
Cover image

MI-ZO: Teaching Vision-Language Models Where to Look

Opening — Why this matters now Vision-Language Models (VLMs) are everywhere—judging images, narrating videos, and increasingly acting as reasoning engines layered atop perception. But there is a quiet embarrassment in the room: most state-of-the-art VLMs are trained almost entirely on 2D data, then expected to reason about 3D worlds as if depth, occlusion, and viewpoint were minor details. ...

January 2, 2026 · 4 min · Zelina
Cover image

When Rewards Learn to See: Teaching Humanoids What the Ground Looks Like

Opening — Why this matters now Humanoid robots can now run, jump, and occasionally impress investors. What they still struggle with is something more mundane: noticing the stairs before falling down them. For years, reinforcement learning (RL) has delivered impressive locomotion demos—mostly on flat floors. The uncomfortable truth is that many of these robots are, functionally speaking, blind. They walk well only because the ground behaves politely. Once the terrain becomes uneven, discontinuous, or adversarial, performance collapses. ...

December 21, 2025 · 4 min · Zelina
Cover image

CitySeeker: Lost in Translation, Found in the City

Opening — Why this matters now Urban navigation looks deceptively solved. We have GPS, street-view imagery, and multimodal models that can describe a scene better than most humans. And yet, when vision-language models (VLMs) are asked to actually navigate a city — not just caption it — performance collapses in subtle, embarrassing ways. The gap is no longer about perception quality. It is about cognition: remembering where you have been, knowing when you are wrong, and understanding implicit human intent. This is the exact gap CitySeeker is designed to expose. ...

December 19, 2025 · 3 min · Zelina
Cover image

Seeing Isn’t Knowing: Why Vision-Language Models Still Miss the Details

Opening — Why this matters now Vision-language models (VLMs) have become unreasonably confident. Ask them to explain a chart, reason over a meme, or narrate an image, and they respond with eloquence that borders on arrogance. Yet, beneath this fluency lies an uncomfortable truth: many of these models still struggle with seeing the right thing. ...

December 14, 2025 · 4 min · Zelina
Cover image

Tunnel Vision, Literally: When Cropping Makes Multimodal Models Blind

Opening — Why this matters now Multimodal Large Language Models (MLLMs) can reason, explain, and even philosophize about images—until they’re asked to notice something small. A number on a label. A word in a table. The relational context that turns a painted line into a parking space instead of a traffic lane. The industry’s default fix has been straightforward: crop harder, zoom further, add resolution. Yet performance stubbornly plateaus. This paper makes an uncomfortable but important claim: the problem is not missing pixels. It’s missing structure. ...

December 14, 2025 · 3 min · Zelina
Cover image

You Know It When You See It—But Can the Model?

Opening — Why this matters now Vision models have become remarkably competent at recognizing things. Dogs, cars, traffic lights—no drama. The problem starts when we ask them to recognize judgment. Is this image unhealthy food? Is this visual clickbait? Is this borderline unsafe? These are not classification problems with clean edges; they are negotiations. And most existing pipelines pretend otherwise. ...

December 12, 2025 · 4 min · Zelina
Cover image

Seeing is Retraining: How VizGenie Turns Visualization into a Self-Improving AI Loop

Scientific visualization has long been caught in a bind: the more complex the dataset, the more domain-specific the visualization, and the harder it is to automate. From MRI scans to hurricane simulations, modern scientific data is massive, high-dimensional, and notoriously messy. While dashboards and 2D plots have benefitted from LLM-driven automation, 3D volumetric visualization—especially in high-performance computing (HPC) settings—has remained stubbornly manual. VizGenie changes that. Developed at Los Alamos National Laboratory, VizGenie is a hybrid agentic system that doesn’t just automate visualization tasks—it refines itself through them. It blends traditional visualization tools (like VTK) with dynamically generated Python modules and augments this with vision-language models fine-tuned on domain-specific images. The result: a system that can answer questions like “highlight the tissue boundaries” and actually improve its answers over time. ...

August 2, 2025 · 4 min · Zelina