VLMs | Cognaptus

Benchmarks Lie, Rooms Don’t: Why Embodied AI Fails the Moment It Enters Your House

The room is not impressed by your leaderboard A robot that performs well on a public benchmark has not necessarily learned how to operate in your house. It may recognize a chair in a dataset. It may answer a visual question about a tidy image. It may even produce a confident paragraph explaining where the coffee mug should be. Then it enters a real room — with mirrors, partial views, cluttered corners, awkward sightlines, and objects that are not positioned for benchmark convenience — and suddenly the “general intelligence” starts behaving like a tourist holding the map upside down. ...

Seeing Too Much: When Multimodal Models Forget Privacy

Face. That is where the privacy problem starts to become awkward. A company does not need to build a facial-recognition product to create facial-recognition risk. It may only add a multimodal model to a customer-support workflow, an HR document review process, a KYC assistant, a media-monitoring tool, or a claims-processing system. Someone uploads an image. The model sees a person. Then the user asks: Who is this? Where do they live? What is their email? What is their religion? What is their medical condition? ...

RoboSafe: When Robots Need a Conscience (That Actually Runs)

A robot does not need evil intent to become dangerous. It only needs a bad next action. “Turn on the microwave” sounds ordinary until the microwave contains a fork. “Pick up the knife” may be harmless in a cooking task until the next move is to swing it around. “Turn on the stove” may be safe for one step and unsafe three steps later if the agent forgets to turn it off. Physical risk is annoyingly literal that way. It does not wait for a model to finish reflecting on its values. ...

When Rewards Learn to See: Teaching Humanoids What the Ground Looks Like

Robots do not fall because the word “walk” is ambiguous. They fall because the ground has opinions. A flat floor, a gap, a pile of blocks, and a staircase may all ask for “locomotion,” but they do not ask for the same behavior. One asks for velocity tracking. Another asks for foot placement. Another punishes careless exploration. A staircase, because it has a flair for drama, asks the robot to negotiate gravity one step at a time. ...

CitySeeker: Lost in Translation, Found in the City

The city does not answer literal questions A person says, “I’m thirsty.” A human does not usually reply, “Please specify whether you require a vending machine, café, convenience store, supermarket, juice shop, water fountain, or bubble tea store.” That would be technically attentive and socially catastrophic. A human looks around, remembers what cities usually contain, infers which places can satisfy the need, and starts walking toward a plausible target. ...

STRIDE Gets a Plus-One: How ASTRIDE Rewrites Threat Modeling for the Agentic Era

Diagram reviews are where many security problems first become visible. Not in the production logs. Not in the postmortem. Not after a user discovers that a tool-calling agent has confidently pushed private data into the wrong API. The humble architecture diagram is supposed to be the place where adults in the room ask: what can go wrong here? ...

Steering by the Token: How GRAINS Turns Attribution into Alignment

TL;DR for operators GRAINS is not “fine-tuning, but cheaper.” That framing misses the point and commits the usual business sin of turning a mechanism into a procurement slogan. The paper’s useful claim is more specific: token-level attribution can be converted into an inference-time steering signal. Instead of retraining model weights, GrAInS identifies which text or image tokens most strongly push the model toward preferred or dispreferred outputs, builds layer-wise steering vectors from those activation shifts, and applies normalized edits during inference.1 ...

Tunnel Vision: Why Vision-Language Models Still Miss the Bigger Picture

TL;DR for operators A vision-language model can describe an image, answer a chart question, and still fail at the kind of seeing that a bored intern would perform before lunch. That is the operational lesson from Shmuel Berman and Jia Deng’s paper, VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs.1 The paper tests whether leading VLMs can do three basic things: compare two visual objects across an image, follow a sequence of visual clues, and trace a continuous line to its endpoint. Humans find these tasks trivial. Current VLMs do not. ...