Vision-Language Models

Seeing Is Deceiving: Diagnosing and Fixing Hallucinations in Multimodal AI

TL;DR for operators A multimodal model can look at an image and still answer from memory, habit, or linguistic guesswork. That is the uncomfortable core of visual hallucination: the output is fluent, relevant-looking, and sometimes even useful, while being only loosely attached to the pixels it claims to describe. The practical lesson is not “never use multimodal AI.” That would be tidy, dramatic, and mostly useless. The lesson is narrower and more valuable: visual hallucinations need to be diagnosed by where grounding fails, not merely counted after the model has embarrassed itself. ...

Seeing is Believing? Not Quite — How CoCoT Makes Vision-Language Models Think Before They Judge

TL;DR for operators Vision-language models do not merely “look at an image” and answer. In social tasks, they must perform three different jobs: notice what is visually present, infer what situation those cues imply, and judge what social or safety norm applies. Standard chain-of-thought prompting often smears those jobs together into one confident little essay. Very charming. Also very dangerous. ...

One Model to Train Them All: How OmniTrain Rethinks Open-Vocabulary Detection

TL;DR for operators OmniTrain’s useful claim is not that open-vocabulary object detection needs a bigger vocabulary, a more theatrical prompt, or yet another detection head with a confident acronym stapled to it. Its claim is simpler and more operational: the training interface is the bottleneck.1 Open-vocabulary detection asks a detector to find categories it may not have seen as boxed labels during training. That promise is attractive for retail shelves, industrial inspection, visual search, robotics, and any business where the object list changes faster than the annotation budget. But many systems still inherit a messy workflow: pre-train a vision-language model, fine-tune a detector, add grounding supervision, reconcile losses, then hope the pieces do not quietly disagree. ...

Prompt Without Words: Distilling GPT Semantics for Smarter Vision Models

TL;DR for operators Most attempts to improve CLIP-style image classification with large language models follow a familiar ritual: ask GPT to describe a class, paste those descriptions into prompts, then hope the model pays attention to the useful bits. The problem is that GPT’s descriptions are not stable objects. They vary by query wording, include hedged statements, and sometimes contain features that are hard or impossible to verify visually. “Usually,” “may,” and “often” are not exactly the foundations of a disciplined recognition system. ...

Good Bot, Bad Reward: Fixing Feedback Loops in Vision-Language Reasoning

TL;DR for operators The useful lesson is not that vision-language models need longer reasoning traces. They already produce plenty of words. Some of them are even adjacent to thought. The useful lesson is that multimodal systems need feedback that can tell where a reasoning path breaks, not merely whether the final answer looks acceptable. ...

From Infinite Paths to Intelligent Steps: How AI Learns What Matters

TL;DR for operators GUI automation agents do not usually fail because clicking is hard. They fail because almost everything they could click is irrelevant. The CoGA paper proposes a pragmatic way to reduce that waste: use a vision-language model before reinforcement learning begins to generate executable code that identifies which GUI actions are currently affordable, then use that code as an action mask during RL training and inference.1 The VLM is not the agent. It is more like an expensive consultant brought in once to write a rule-based narrowing function. After that, a reinforcement learning agent still learns the policy. ...