Steer by Equation: When LLM Alignment Learns to Drive with ODEs
Opening — Why This Matters Now Activation steering has become the quiet workhorse of LLM alignment. No retraining. No RLHF reruns. Just a subtle nudge inside the model’s hidden states at inference time. Efficient? Yes. Principled? Not quite. Most steering methods rely on one-step activation addition: compute a direction vector, add it once, hope the model behaves. It works—until it doesn’t. Complex behaviors like truthfulness, helpfulness, and toxicity mitigation rarely live on clean linear boundaries. ...