Multimodal AI

$Cover image$

When the Right Answer Is No Answer: Teaching AI to Refuse Messy Math

A scanned exam paper is not a polite input. It arrives bent, shadowed, annotated, folded, half-covered by a student’s handwriting, and occasionally photographed at an angle chosen by someone apparently in active conflict with geometry. For a human teacher, this is annoying. For a document AI system, it is more than annoying. It creates a dangerous fork in the road: extract what is visible, or admit that the question cannot be recovered. ...

GUI-Eyes: When Agents Learn Where to Look

Screenshots look simple until they are not. A human opening a dense professional application does not inspect every pixel with equal seriousness. We glance, zoom in mentally, ignore decorative clutter, search for the likely region, then focus. In other words, we do not merely “see” the interface. We decide where to look. ...

Seeing Is Thinking: When Multimodal Reasoning Stops Talking and Starts Drawing

Image work has always had a small credibility problem: people can say where they looked, but we do not always know whether they actually looked there. The same problem shows up in multimodal AI. A model can answer a question about a chart, a photograph, a geometry diagram, or a robotic scene, then produce a neat textual chain of thought afterwards. It may sound procedural. It may mention “examining the relevant region.” It may even say “the graph shows…” with the confidence of a consultant holding a laser pointer. ...

Seeing Too Much: When Multimodal Models Forget Privacy

Face. That is where the privacy problem starts to become awkward. A company does not need to build a facial-recognition product to create facial-recognition risk. It may only add a multimodal model to a customer-support workflow, an HR document review process, a KYC assistant, a media-monitoring tool, or a claims-processing system. Someone uploads an image. The model sees a person. Then the user asks: Who is this? Where do they live? What is their email? What is their religion? What is their medical condition? ...

TowerMind: When Language Models Learn That Towers Have Consequences

Tower placement is a small decision until it is wrong. In a tower-defense game, a bad tower is not merely an inelegant plan. It is money spent, coverage lost, enemies leaked, and time wasted. The game does not care that the explanation sounded strategic. It only asks whether the tower actually touches the road. ...

Hard Problems Pay Better: Why Difficulty-Aware DPO Fixes Multimodal Hallucinations

Training data has a bad habit: the easiest examples talk the loudest. Anyone who has trained a model on preference pairs knows the scene. One answer is clearly grounded in the image; the other confidently invents an object, a color, or an action that is not there. The model learns the contrast quickly. Everyone applauds. The loss goes down. The dashboard looks obedient. ...

Teaching Has a Poker Face: Why Teacher Emotion Needs Its Own AI

Teaching Has a Poker Face: Why Teacher Emotion Needs Its Own AI A teacher can say “Good, let’s try again” in at least five different emotional languages. It can mean patience. It can mean disappointment carefully wrapped in professionalism. It can mean encouragement, routine classroom management, mild frustration, or the heroic survival instinct of someone explaining the same concept for the fourth time while thirty students perform collective eye contact avoidance. ...

When 1B Beats 200B: DeepSeek’s Quiet Coup in Clinical AI

Chest X-rays are not a glamorous AI benchmark. They are routine, repetitive, and brutally operational. A hospital does not need a model that can write poetry about radiology. It needs reports that are accurate enough, fast enough, structured enough, and cheap enough to run inside an actual clinical workflow without turning the IT department into a cloud-billing support group. ...

When One Clip Isn’t Enough: Teaching LLMs to Watch Long Videos Like Adults

Video is a terrible place to hide evidence. Not because the evidence is invisible. Because it is usually obvious only after someone has already found the right minute, the right scene, and the right visual detail. A person reviewing a long customer-support screen recording, a training video, a compliance recording, or a surveillance clip rarely watches everything with equal attention. They skim, localize, zoom in, check the detail, and then answer. Primitive, yes. Effective, also yes. ...

Echoes, Not Amnesia: Teaching GUI Agents to Remember What Worked

Memory is not a folder A useful employee does not fill out the same form from scratch every morning as if yesterday never happened. They remember which menu hides the export button, which warning can be ignored, which field must be filled before the “Next” button wakes up, and which apparently harmless click sends the process into a small bureaucratic swamp. ...