GAVEL: When AI Safety Grows a Rulebook
Rules are boring until the audit starts. That is roughly where enterprise AI safety is heading. A chatbot can be polite, policy-aligned, and apparently harmless on the surface, while still performing the internal work of manipulation, scam automation, or unsafe assistance. Text moderation catches what the model says. Classic activation monitoring tries to catch what the model is internally representing. But both can become awkward in production: one sees too little, the other often explains too little. ...