Consensus Sampling

A model generates an image. It looks ordinary. A horse in a meadow, a lighthouse in a storm, a bowl of oranges. Nothing dramatic. No obvious watermark, no visible glitch, no suspicious artefact screaming “please call the security team”. That is precisely the problem. Some AI failures are meant to be seen. Toxic text, obvious hallucinations, broken code, bizarre images with eight fingers and a cursed wrist. Those are the easy cases, relatively speaking. The harder cases are outputs that look fine while carrying something unsafe: a hidden message, a planted vulnerability, a backdoor trigger, or another payload that cannot be reliably detected by staring harder at the finished product. ...