Signal Over Noise: Why Multimodal RL Needs to Know What to Ignore
Opening — Why this matters now Multimodal models have become the new default. Text, audio, video—feed it all in and let the transformer figure it out. The assumption is elegant: more signals, more intelligence. Reality is less polite. In production systems, signals are often missing, delayed, degraded, or irrelevant. Yet most RL post-training pipelines treat multimodal trajectories as if they were drawn from a single, homogeneous distribution. Every rollout is mixed together. Every reward is normalized together. Every gradient update assumes the model needed all modalities. ...