Hard Problems Pay Better: Why Difficulty-Aware DPO Fixes Multimodal Hallucinations
Training data has a bad habit: the easiest examples talk the loudest. Anyone who has trained a model on preference pairs knows the scene. One answer is clearly grounded in the image; the other confidently invents an object, a color, or an action that is not there. The model learns the contrast quickly. Everyone applauds. The loss goes down. The dashboard looks obedient. ...