Model Steering

A refusal is easy to recognize. The model says it cannot help. The sentence sounds polite. The compliance team relaxes for three seconds. Everyone moves on. That is the comfortable version of AI safety: refusal as an observable behavior. The uncomfortable version is that refusal may be only the visible end of a much narrower internal computation. If that computation can be found, isolated, and steered, then the model’s “sorry, I can’t assist with that” is not a moral boundary. It is a circuit behavior. Very reassuring, in the same way a locked glass door is reassuring before someone points out the hinge. ...