Cover image

Seeing is Believing? Not Quite — How CoCoT Makes Vision-Language Models Think Before They Judge

TL;DR for operators Vision-language models do not merely “look at an image” and answer. In social tasks, they must perform three different jobs: notice what is visually present, infer what situation those cues imply, and judge what social or safety norm applies. Standard chain-of-thought prompting often smears those jobs together into one confident little essay. Very charming. Also very dangerous. ...

July 29, 2025 · 17 min · Zelina
Cover image

Good Bot, Bad Reward: Fixing Feedback Loops in Vision-Language Reasoning

TL;DR for operators The useful lesson is not that vision-language models need longer reasoning traces. They already produce plenty of words. Some of them are even adjacent to thought. The useful lesson is that multimodal systems need feedback that can tell where a reasoning path breaks, not merely whether the final answer looks acceptable. ...

June 13, 2025 · 15 min · Zelina