Computer Vision

Seeing Is Thinking: When Multimodal Reasoning Stops Talking and Starts Drawing

Image work has always had a small credibility problem: people can say where they looked, but we do not always know whether they actually looked there. The same problem shows up in multimodal AI. A model can answer a question about a chart, a photograph, a geometry diagram, or a robotic scene, then produce a neat textual chain of thought afterwards. It may sound procedural. It may mention “examining the relevant region.” It may even say “the graph shows…” with the confidence of a consultant holding a laser pointer. ...

Crossing the Line: Teaching Pedestrian Models to Reason, Not Memorize

Crosswalks look simple from a spreadsheet. A pedestrian either crosses at the intersection or crosses mid-block. The model sees age group, gender, lane count, lighting, weather, signal timing, maybe a bus stop nearby, and then predicts the choice. Very civilized. Very tabular. Very likely to fail when the same logic is moved to a different road. ...

When Three Examples Beat a Thousand GPUs

A GPU bill is usually treated as a hardware problem. Buy faster accelerators, shorten training runs, negotiate a better cloud contract. Less often asked is whether the expensive part of the pipeline began with a badly calibrated prompt. An LLM generating neural-network architectures can create thousands of candidates before training begins. If the prompt provides too little context, the model may repeatedly produce shallow variations of the same familiar design. Add more examples, and it may combine useful ideas across architectural families. Add still more, and the output can become worse, incomplete, or invalid. ...

Label Now, Drive Later: Why Autonomous Driving Needs Fewer Clicks, Not Smarter Annotators

Clicks are a cost centre. In a 3D annotation tool, deleting an unnecessary bounding box may take one or two seconds. Creating a missed vehicle annotation from scratch takes about 23 seconds. Correcting a poorly positioned box falls somewhere in between. These actions may all count as model errors. They do not cost the same amount of human time. ...

MaskOpt or It Didn’t Happen: Teaching AI to See Chips Like Lithography Engineers

MaskOpt or It Didn’t Happen: Teaching AI to See Chips Like Lithography Engineers Cells repeat. That is the comforting part of chip design. A NAND gate appears thousands of times. A buffer shows up again and again. Standard-cell libraries exist because repetition is economically useful: design once, place many times, avoid reinventing geometry until everyone loses the will to live. ...

Dexterity Over Data: Why Sign Language Broke Generic 3D Pose Models

Hands are small, fast, and inconvenient. That is a problem for AI systems that prefer the world to be large, slow, and conveniently labeled. A walking person can be reconstructed with some tolerance for imprecision. A signer cannot. In sign language, a curled finger, wrist angle, palm orientation, or moment of hand-body contact may carry meaning. When the model gets that wrong, it is not merely producing an awkward avatar. It is quietly changing the message. ...

When Tensors Meet Telemedicine: Diagnosing Leukemia at the Edge

Blood Smears, But Make Them Networked A blood smear is not exactly the image most executives imagine when they say “AI transformation.” It is small, stained, quiet, and usually examined under conditions that do not look like a glossy product demo. Yet this is where many medical AI systems either become useful or become another benchmark trophy gathering dust in a PDF. ...

SceneMaker: When 3D Scene Generation Stops Guessing

A chair behind a table is not half a chair A single image can be a very rude input. It shows the front of a room, hides the back of objects, compresses depth into pixels, and then asks a model to produce a coherent 3D scene. The model must decide what the hidden side of a chair looks like, how large the chair is, whether it sits behind the table or intersects with it, and where everything belongs in 3D space. Naturally, when the result looks wrong, we often blame “weak 3D generation.” ...

Drunk on Data: How Recurrent Fusion Models Soberingly Outperform Traditional Intoxication Detection

A checkpoint camera is not a breathalyzer. That sounds obvious, until a model reports 95.82% accuracy and everyone in the room suddenly starts imagining frictionless alcohol screening at entrances, vehicles, warehouses, airports, and campuses. This is the useful tension in Detection of Intoxicated Individuals from Facial Video Sequences via a Recurrent Fusion Model.1 The paper does not claim to measure blood alcohol concentration. It does not turn facial video into courtroom-grade evidence. What it does is more specific, and arguably more operationally interesting: it shows how a video model can combine facial geometry, temporal movement, and adaptive fusion to classify likely intoxication from short facial video clips. ...

Noise Without Borders: How Single-Pair Guidance Rewrites Diffusion Synthesis

Camera noise is annoying in the same way logistics is annoying: nobody wants to talk about it until the system fails. A phone camera, a factory inspection camera, a medical imaging sensor, or a night-time security device does not merely capture a clean scene plus a cute little sprinkle of Gaussian noise. Real image noise is shaped by sensors, ISO settings, shutter speed, color processing, demosaicing, compression, and whatever private magic lives inside the image signal processing pipeline. In research papers, that pipeline is often politely summarized as “real-world noise.” In deployment, it is the reason a denoising model that looked excellent in the lab starts behaving like it has never seen darkness before. ...