Computer Vision

When Motion Lies: Why Video LLMs Keep Misreading Physics

A car approaches a crosswalk. The frames look simple: car, road, direction, movement. A human can still ask the useful question: is the car speeding up, slowing down, or merely moving at a steady pace? A video language model may answer with the confidence of a dashboard camera that has read too many captions and learned too little physics. It sees a car getting closer. It infers “accelerating.” The problem is not that the model missed the car. The problem is that it saw the same visual pattern and failed to model the hidden change in motion. ...

CAPTION THIS: Why Multimodal RAG Is Finally Growing Up

Captioning looks easy until the caption has to be true. A consumer image captioning model can say, “a man standing at a podium,” and most people will nod. A newsroom cannot stop there. It needs to know whether the man is a senator, a witness, a CEO, a defendant, or simply someone unlucky enough to stand near a microphone. It may need the committee name, the location, the event, the year, the organization behind the banner, and the person half-visible at the edge of the frame. Journalism, as usual, ruins the demo. ...

Merge, Bound, and Determined: Why Weight-Space Surgery May Be CIL’s Most Underrated Trick

Catalogs change. Defect categories change. Fraud patterns change. Document types change. The model, unfortunately, often reacts like an employee who learns the new product line and immediately forgets where the old shelves are. That is the everyday problem behind Class-Incremental Learning (CIL): a model must learn new classes over time while still recognizing old ones. The difficult part is not merely adding output labels. It is keeping the feature extractor from being rewritten by the latest task until yesterday’s knowledge becomes decorative archaeology. ...

Pruned but Not Muted: How Frequency-Aware Token Reduction Saves Vision Transformers

Images are expensive. Not emotionally, although some product managers do try. They are expensive because modern visual models turn an image into a sequence of tokens, then let those tokens attend to one another. In a Vision Transformer, more tokens usually mean more detail, but also more attention cost. The obvious response is to reduce the number of tokens. ...

When Raindrops Become Data: Hypergraphs, Event Cameras, and the New Shape of Perception

Rain is easy to understand until you try to measure every drop. A conventional camera solves this problem by pretending time arrives in neat rectangular packages: one frame, then another frame, then another. An event camera does something stranger and, in many real-world settings, more useful. It does not record the whole scene at fixed intervals. It records changes. A pixel fires when brightness changes, producing a stream of asynchronous events rather than a normal video. ...

Mind the Markov Gap: How a Lightweight Agent Outsmarts Heavy LLMs in Open-Vocabulary Vision

A camera on a factory line does not need to write an essay before deciding whether a part is cracked. That sounds obvious. Yet a surprising amount of recent AI architecture quietly assumes the opposite: when vision systems become uncertain, bring in a large language model, ask it to generate richer descriptions, then run the detector again. Sometimes this works. It also turns a detection problem into a small committee meeting, and committee meetings are rarely known for real-time throughput. ...

The Latent Truth: Why Prototype Explanations Need a Reality Check

The Latent Truth: Why Prototype Explanations Need a Reality Check Audit starts with a simple request: show me why. For prototype-based neural networks, that request has always had a pleasantly visual answer. The model points to a learned prototype from training data and says, in effect, “this part of the image looks like that part of an example I already know.” This is the interpretability sales pitch in its most charming form. No opaque wall of logits. No post-hoc heatmap pretending to be a confession. Just a case-based explanation: this resembles that. ...

From Yarn to Code: What CrochetBench Reveals About AI’s Procedural Blind Spot

A pattern is not a caption. That sounds obvious until a multimodal model looks at a finished object, produces a confident set of instructions, and everyone in the room quietly rounds “looks plausible” up to “can build it.” This is one of the industry’s more expensive habits: mistaking descriptive competence for operational competence. The model can say what is there. Therefore, surely, it can infer how to make it. Very neat. Very wrong. ...

Learning by X-ray: When Surgical Robots Teach Themselves to See in Shadows

X-rays are useful because they are cheap, familiar, and already sitting in the operating room. They are also, inconveniently, shadows. That is the central tension in Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures, a paper that asks whether a robot policy can plan vertebroplasty cannula trajectories from only bi-planar X-ray views—one anterior-posterior view, one lateral view—without CT-based navigation, registration, or a lovingly over-engineered suite of intra-operative infrastructure.1 ...

Noisy but Wise: How Simple Noise Injection Beats Shortcut Learning in Medical AI

X-rays look clinical. To a neural network, they can also look like stationery. A hospital name in the corner. A scanner signature. A compression pattern. A familiar positioning marker. A slightly different way of cropping the lung field. None of these is pneumonia. None of these is COVID-19. Yet a deep learning model trained on small medical datasets can treat them as wonderfully convenient diagnostic evidence, because machines are very good at passing exams and less naturally committed to understanding what the exam is about. ...