Mind the Slot: Jailbreak Prompts Have Weak Points, Not Just Bad Words
Security teams like to search for suspicious strings. That habit is understandable. Strings are visible. They can be logged, filtered, matched, scored, and proudly displayed in dashboards. A bad suffix at the end of a prompt looks like a bad suffix at the end of a prompt. Convenient. Almost too convenient. The problem is that prompts are not flat text boxes. They are transformed into token sequences, wrapped in chat templates, and passed through attention layers that do not treat every position equally. Some positions receive more influence over the model’s next-token behavior than others. Put adversarial tokens there, and the same amount of “badness” can travel farther. ...