Phantasia and the Illusion of Safety: When AI Lies Without Looking Wrong
Safety checks usually look for the model doing something strange. That sounds reasonable. A compromised model should produce a strange phrase, repeat a suspicious payload, ignore the image, or behave in a way that feels obviously detached from the input. This is the comforting version of AI security: attackers leave fingerprints, defenders look for fingerprints, and everyone goes home after filling out a procurement checklist. ...