LLM Guardrails

Sirens in the Weights: Why AI Safety May Be Hiding Inside the Model

Moderation usually sits outside the model. A user sends a prompt. A model answers. Then a separate guard model steps in, reads the text, and declares the content safe or unsafe. In business terms, this is a familiar architecture: put a checkpoint at the gate, classify traffic, block what violates policy, and hope the checkpoint is both fast and sensible. It is the airport-security model of AI safety, except the passenger may be a 40-token prompt, a 4,000-token reasoning trace, or a response that is still being generated while the guard is politely looking for its shoes. ...

When Drones Think Too Much: Defining Cognition Envelopes for Bounded AI Reasoning

A drone finds a clue. Not a dramatic clue, necessarily. A backpack near a trailhead. A red hat in water. A pair of goggles on rock. The kind of object a human search-and-rescue team would treat as operational evidence, not as a philosophical invitation. But once a vision-language model captions the image, a language model assesses its relevance, and another model proposes a search action, the system has quietly crossed an important line. ...

OneShield Against the Storm: A Smarter Firewall for LLM Risks

TL;DR for operators Enterprise LLM safety is often discussed as if the main question is whether the model has been trained to “behave”. That is the comforting version of the story. It is also too small. IBM’s OneShield paper argues for a different operating model: treat safety as a separate, model-agnostic guardrail layer that sits around the LLM, runs multiple specialised detectors in parallel, and then applies explicit policy decisions through a separate policy manager.1 In plain business terms, OneShield is less like teaching the model good manners and more like installing a configurable safety-control plane around every AI interaction. Glamorous? Not especially. Operationally useful? Very much so. ...