Multimodal Reasoning

Name the Speaker, Then Ask the Plot: Selective Reasoning for Drama Transcripts

TL;DR for operators A streaming transcript can contain thousands of dialogue lines. Most are easy to assign to a speaker from the audio; the difficult minority consists of whispers, brief replies, crowded casts, and speech from outside the frame. DramaSR-LRM raises overall attribution accuracy from 85.49% to 87.79%. The average gain looks modest, but accuracy for utterances shorter than 0.5 seconds rises from 67.45% to 76.65%—precisely where acoustic evidence is most limited. ...

Seeing is Believing? Not Quite — How CoCoT Makes Vision-Language Models Think Before They Judge

TL;DR for operators Vision-language models do not merely “look at an image” and answer. In social tasks, they must perform three different jobs: notice what is visually present, infer what situation those cues imply, and judge what social or safety norm applies. Standard chain-of-thought prompting often smears those jobs together into one confident little essay. Very charming. Also very dangerous. ...

Good Bot, Bad Reward: Fixing Feedback Loops in Vision-Language Reasoning

TL;DR for operators The useful lesson is not that vision-language models need longer reasoning traces. They already produce plenty of words. Some of them are even adjacent to thought. The useful lesson is that multimodal systems need feedback that can tell where a reasoning path breaks, not merely whether the final answer looks acceptable. ...