ThinkSafe: Teaching Models to Refuse Without Forgetting How to Think
A model can be very good at solving math problems and very bad at saying no. That sentence sounds like a joke until it becomes a deployment problem. A reasoning model trained to work harder, think longer, and satisfy difficult prompts may also become more willing to satisfy harmful prompts. The training objective says: solve the problem. The model obeys. Safety, apparently, was not copied on the memo. ...