Computer Vision

Caught on Skeleton: How Pose-Based AI Is Teaching Retail Cameras to Adapt

A camera in a store has one job that sounds simple until one remembers that stores are not laboratories. People browse. Children run. Staff restock shelves. Customers bend, hesitate, carry bags, reach into pockets, and occasionally do all of that without stealing anything. A system that treats every awkward motion as a crime will quickly become less a security tool than a very expensive way to annoy employees. Retail has enough of those already. ...

When Prompts Hire Specialists: Why pMoE Changes Visual Adaptation Economics

Inspection cameras, pathology scanners, product catalog systems, and retail shelf analytics all create the same inconvenient problem: the image may look simple, but the knowledge needed to interpret it rarely comes from one source. A model trained on broad natural images may recognize general objects well. A contrastive model may separate fine visual categories better. A medical encoder may notice domain-specific patterns that a general model treats as visual noise. A segmentation-oriented model may understand spatial boundaries better than a classifier. Asking one backbone to cover all of this is elegant in a slide deck and occasionally foolish in production. Nature, sadly, did not optimize itself for clean model procurement. ...

Motivation Is Something Your Models Need: When Curiosity Becomes a Training Strategy

Training budgets are where elegant architecture slogans go to be audited. The usual response to a model that needs better accuracy is painfully familiar: make it larger, train it longer, feed it more data, and then pretend the GPU bill is a philosophical problem. The paper Motivation Is Something You Need takes a more interesting route. It asks whether a model needs to be large all the time, or whether extra capacity can be activated only when training signals suggest the model is “getting somewhere.”1 ...

Swin or Swim: Federated Fusion for Lung AI

Hospital AI sounds simple until someone asks where the patient images will live. A research team can build a decent chest X-ray classifier in a lab. A hospital network, however, has to answer less glamorous questions. Can private data stay inside each institution? Can the model improve across sites without pooling raw images? Can the system run without consuming hardware like a small dragon? And, after all that, does accuracy actually improve enough to justify the complexity? ...

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

A model can generate a pretty sequence of images. Good. So can a slide deck. The harder question is whether those images actually help it think. That is the uncomfortable point behind MentisOculi: Revealing the Limits of Reasoning with Mental Imagery, a new benchmark paper that tests whether frontier multimodal models can do something closer to human mental imagery: form a visual state, keep it stable, transform it step by step, and use the transformed state to decide what to do next.1 Not merely “look at an image and answer a question.” Not “draw a plausible intermediate picture.” Actual visual reasoning, with consequences. ...

Edge Cases Matter: Teaching Drones to See the Small Stuff

A drone can cover a construction site, a traffic corridor, or a flooded street in minutes. That is the easy part. The harder part is noticing the small object that changes the decision: a person near a road barrier, a tiny vehicle in a dense intersection, a partly hidden target on a high-resolution aerial image. ...

PyraTok: When Video Tokens Finally Learn to Speak Human

Video looks easy until a machine has to remember what matters. A human watches a short clip and immediately separates the important layers: the object, the action, the background, the timing, the implied intent, the scene transition. A model sees a much less polite object: frames, pixels, motion, compression artifacts, and a large bill for GPU memory. Then we ask it to generate video, answer questions, segment objects, localize actions, and preserve meaning across time. Naturally, the model responds by becoming expensive. Very relatable. ...

When Models Guess the Verb by Looking at the Drawer

Drawer. That is the easy part. A model sees a drawer, and it knows that drawers are often opened. Then it watches a video where someone is closing the drawer and predicts opening anyway. This is not the kind of error that makes a demo look silly for five seconds and then disappear into the benchmark appendix. It is the kind of error that reveals what the system is really using as evidence. The model is not necessarily watching the motion. It may be recognizing the object, remembering the most common verb attached to that object during training, and calling that “video understanding.” Very efficient. Also wrong. ...

Noise Without Regret: How Error Feedback Fixes Differentially Private Image Generation

Photos are annoying data. They are useful because they contain details: the handle of a bag, the edge of a sleeve, the texture of a face, the faint classroom gesture that matters only after someone trains a model on it. They are risky for exactly the same reason. If a generated image looks too much like the real training data, it may quietly leak what the organization was trying not to reveal. If it is protected too aggressively, it becomes a blurry souvenir from a dataset that used to be useful. ...

Punching Above Baselines: When Boxing Strategy Learns to Differentiate

Li Qian is the useful part of the paper, not the medal count Boxing is a simple sport only if you watch it from far enough away. Two athletes enter a ring. One wins. The spectators remember the clean punch, the late-round pressure, the judge’s card, maybe the celebration. Coaches remember something less theatrical: distance, lead-hand rhythm, counter timing, target selection, whether a hook was thrown from the wrong range, whether the opponent’s aggression was actually a trap. ...