Don’t Just Fuse It — Align It: When Multimodal Recommendation Grows a Spine
Opening — Why this matters now Multimodal recommendation has quietly hit a ceiling. Not because we ran out of data — quite the opposite. Images are sharper, text embeddings richer, and interaction logs longer than ever. The problem is architectural complacency: most systems add modalities, but few truly reason across them. Visual features get concatenated. Text is averaged. Users remain thin ID vectors staring helplessly at semantically over-engineered items. ...