Multimodal AI

Click Like a Human: Why Avenir-Web Is a Quiet Breakthrough in Web Agents

Opening — Why this matters now For years, autonomous web agents have promised to automate the internet: booking flights, scraping dashboards, configuring enterprise tools, or simply clicking buttons so humans don’t have to. And yet, anyone who has actually tried to deploy one knows the truth—these agents fail in embarrassingly human ways. They get lost. They click the wrong thing. They forget what they were doing halfway through. ...

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

Opening — Why this matters now Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them. This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way. ...

Thinking in Panels: Why Comics Might Beat Video for Multimodal Reasoning

Opening — Why this matters now Multimodal reasoning has quietly hit an efficiency wall. We taught models to think step by step with text, then asked them to imagine with images, and finally to reason with videos. Each step added expressive power—and cost. Images freeze time. Videos drown signal in redundancy. Somewhere between the two, reasoning gets expensive fast. ...

Seeing Is Thinking: When Images Do the Reasoning

Opening — Why this matters now Large language models have learned to talk their way through reasoning. But the real world does not speak in tokens. It moves, collides, folds, and occludes. As multimodal models mature, a quiet question has become unavoidable: is language really the best internal medium for thinking about physical reality? ...

When LLMs Invent Languages: Efficiency, Secrecy, and the Limits of Natural Speech

Opening — Why this matters now Large language models are supposed to speak our language. Yet as they become more capable, something uncomfortable emerges: when pushed to cooperate efficiently, models often abandon natural language altogether. This paper shows that modern vision–language models (VLMs) can spontaneously invent task-specific communication protocols—compressed, opaque, and sometimes deliberately unreadable to outsiders—without any fine-tuning. Just prompts. ...

Seeing Is Misleading: When Climate Images Need Receipts

Opening — Why this matters now Climate misinformation has matured. It no longer argues; it shows. A melting glacier with the wrong caption. A wildfire image from another decade. A meme that looks scientific enough to feel authoritative. In an era where images travel faster than footnotes, public understanding of climate science is increasingly shaped by visuals that lie by omission, context shift, or outright fabrication. ...

MobileDreamer: When GUI Agents Stop Guessing and Start Imagining

Opening — Why this matters now GUI agents are everywhere in demos and nowhere in production. They click, scroll, and type impressively—right up until the task requires foresight. The moment an interface branches, refreshes, or hides its intent behind two more screens, today’s agents revert to trial-and-error behavior. The core problem isn’t vision. It’s imagination. ...

Crossing the Line: Teaching Pedestrian Models to Reason, Not Memorize

Opening — Why this matters now Pedestrian fatalities are rising, mid-block crossings dominate risk exposure, and yet most models tasked with predicting pedestrian behavior remain stubbornly local. They perform well—until they don’t. Move them to a new street, a wider arterial, or a different land-use mix, and accuracy quietly collapses. This is not a data problem. It’s a reasoning problem. ...

Echoes, Not Amnesia: Teaching GUI Agents to Remember What Worked

Opening — Why this matters now GUI agents are finally competent enough to click buttons without embarrassing themselves. And yet, they suffer from a strangely human flaw: they forget everything they just learned. Each task is treated as a clean slate. Every mistake is patiently re‑made. Every success is quietly discarded. In a world obsessed with scaling models, this paper asks a simpler, sharper question: what if agents could remember? ...

Seeing Isn’t Knowing: Why Vision-Language Models Still Miss the Details

Opening — Why this matters now Vision-language models (VLMs) have become unreasonably confident. Ask them to explain a chart, reason over a meme, or narrate an image, and they respond with eloquence that borders on arrogance. Yet, beneath this fluency lies an uncomfortable truth: many of these models still struggle with seeing the right thing. ...