Cross-Modal Attention

A product page has a photo. A description. A category. A few user clicks. Maybe a rating, if the platform is lucky. The ordinary recommender-system reflex is to pour all of that into the model and call it “multimodal.” Image embedding here, text embedding there, concatenate, pool, sum, ship. Then, when performance disappoints, add another feature extractor, another graph layer, another auxiliary objective, and hope the leaderboard blushes. ...