Cover image

When Rewards Learn to See: Teaching Humanoids What the Ground Looks Like

Opening — Why this matters now Humanoid robots can now run, jump, and occasionally impress investors. What they still struggle with is something more mundane: noticing the stairs before falling down them. For years, reinforcement learning (RL) has delivered impressive locomotion demos—mostly on flat floors. The uncomfortable truth is that many of these robots are, functionally speaking, blind. They walk well only because the ground behaves politely. Once the terrain becomes uneven, discontinuous, or adversarial, performance collapses. ...

December 21, 2025 · 4 min · Zelina
Cover image

CitySeeker: Lost in Translation, Found in the City

Opening — Why this matters now Urban navigation looks deceptively solved. We have GPS, street-view imagery, and multimodal models that can describe a scene better than most humans. And yet, when vision-language models (VLMs) are asked to actually navigate a city — not just caption it — performance collapses in subtle, embarrassing ways. The gap is no longer about perception quality. It is about cognition: remembering where you have been, knowing when you are wrong, and understanding implicit human intent. This is the exact gap CitySeeker is designed to expose. ...

December 19, 2025 · 3 min · Zelina
Cover image

Seeing Isn’t Knowing: Why Vision-Language Models Still Miss the Details

Opening — Why this matters now Vision-language models (VLMs) have become unreasonably confident. Ask them to explain a chart, reason over a meme, or narrate an image, and they respond with eloquence that borders on arrogance. Yet, beneath this fluency lies an uncomfortable truth: many of these models still struggle with seeing the right thing. ...

December 14, 2025 · 4 min · Zelina
Cover image

Tunnel Vision, Literally: When Cropping Makes Multimodal Models Blind

Opening — Why this matters now Multimodal Large Language Models (MLLMs) can reason, explain, and even philosophize about images—until they’re asked to notice something small. A number on a label. A word in a table. The relational context that turns a painted line into a parking space instead of a traffic lane. The industry’s default fix has been straightforward: crop harder, zoom further, add resolution. Yet performance stubbornly plateaus. This paper makes an uncomfortable but important claim: the problem is not missing pixels. It’s missing structure. ...

December 14, 2025 · 3 min · Zelina
Cover image

You Know It When You See It—But Can the Model?

Opening — Why this matters now Vision models have become remarkably competent at recognizing things. Dogs, cars, traffic lights—no drama. The problem starts when we ask them to recognize judgment. Is this image unhealthy food? Is this visual clickbait? Is this borderline unsafe? These are not classification problems with clean edges; they are negotiations. And most existing pipelines pretend otherwise. ...

December 12, 2025 · 4 min · Zelina
Cover image

Seeing is Retraining: How VizGenie Turns Visualization into a Self-Improving AI Loop

Scientific visualization has long been caught in a bind: the more complex the dataset, the more domain-specific the visualization, and the harder it is to automate. From MRI scans to hurricane simulations, modern scientific data is massive, high-dimensional, and notoriously messy. While dashboards and 2D plots have benefitted from LLM-driven automation, 3D volumetric visualization—especially in high-performance computing (HPC) settings—has remained stubbornly manual. VizGenie changes that. Developed at Los Alamos National Laboratory, VizGenie is a hybrid agentic system that doesn’t just automate visualization tasks—it refines itself through them. It blends traditional visualization tools (like VTK) with dynamically generated Python modules and augments this with vision-language models fine-tuned on domain-specific images. The result: a system that can answer questions like “highlight the tissue boundaries” and actually improve its answers over time. ...

August 2, 2025 · 4 min · Zelina
Cover image

Prompt Without Words: Distilling GPT Semantics for Smarter Vision Models

When it comes to prompting vision-language models, most methods rely on textual descriptions extracted from large language models like GPT. But those descriptions—“fluffy fur, friendly eyes, golden color”—are often verbose, ambiguous, or flat-out unreliable. What if we could skip that noisy middle step entirely? That’s the premise behind DeMul (Description-free Multi-prompt Learning), a new method presented at ICLR 2025 that quietly delivers a major leap in few-shot image classification. Instead of generating descriptions for each class, DeMul directly distills the semantic knowledge of GPT embeddings into learnable prompt vectors. The result is simpler, more robust, and strikingly effective. ...

July 13, 2025 · 3 min · Zelina