Seeing is Believing? Not Quite — How CoCoT Makes Vision-Language Models Think Before They Judge
TL;DR for operators Vision-language models do not merely “look at an image” and answer. In social tasks, they must perform three different jobs: notice what is visually present, infer what situation those cues imply, and judge what social or safety norm applies. Standard chain-of-thought prompting often smears those jobs together into one confident little essay. Very charming. Also very dangerous. ...