OneShield Against the Storm: A Smarter Firewall for LLM Risks

As businesses embrace large language models (LLMs) across sectors like healthcare, finance, and customer support, a pressing concern has emerged: how do we guard against hallucinations, toxicity, and data leaks without killing performance or flexibility?

Enter OneShield, IBM’s next-generation guardrail framework. Think of it not as a rigid moral compass baked into the model, but as an external, modular firewall — capable of custom rules, parallel scanning, and jurisdiction-aware policy enforcement. The design principle is simple but powerful: separate safety from generation.

Why Current Guardrails Fall Short

Most current safety solutions suffer from at least one of three flaws:

They’re entangled with the LLM — limiting transparency and introducing blind spots.
They’re inflexible — hardcoded rules that don’t adapt to new threats or legal boundaries.
They’re monolithic — failing to support nuanced, multi-detector coordination.

Even advanced systems like LlamaGuard and NeMo Guardrails struggle when domain customization or multi-risk interplay is required. And while OpenAI’s Moderation API offers a broad net, it still suffers in edge cases where context matters more than keywords.

OneShield: Modular, Model-Agnostic, Real-Time

OneShield’s architecture is split into four main components:

Component	Function
Orchestrator	Routes user inputs/outputs and aggregates detector results
Detectors	Stateless services (classification, extraction, comparison)
Policy Manager	Applies contextualized actions based on detector outputs
Data Stores	For proprietary knowledge retrieval and matching

All detectors run in parallel, keeping latency low. Policies are applied after aggregation, enabling complex multi-risk logic (e.g., allow PII if no hate speech is present).

Detector Arsenal: Classification, Extraction, Comparison

🧠 Classification Detectors

These classify text into categories such as:

Self-Harm: Using Reddit SuicideWatch + LLM logs (F1: 96.5%)
Adult Content: Crawled data with TF-IDF filtering (F1: 93.8%)
Health Advice: Cross-domain benchmarked (F1: 87.7%)

All classification models are BERT- or SepCNN-based, maintained via sparse human-in-the-loop updating.

🕵️ Extractor Detectors

Focus on PII detection, with hybrid methods:

Regex rules + entity extractors + contextual scoring
Covers names, addresses, SSNs, health IDs, etc.
Offers redaction or masking depending on jurisdiction

📚 Comparison Detectors

Two specialized modules:

Text Attribution: Identifies copyright/data leaks by matching user or model text against proprietary corpora using hybrid vector + text similarity.
Factuality Checker: Evaluates hallucinations by comparing LLM outputs to known facts (e.g., Forbes company facts).

The Real Star: Policy Manager

This is where OneShield gets clever. Instead of each detector acting in isolation, the Policy Manager:

Aggregates all signals
Applies custom or jurisdictional templates (e.g., GDPR, CCPA)
Takes actions like pass, redact, or block — contextually

It’s what allows OneShield to support complex, real-world requirements without being overly blunt or excessively permissive.

Real Deployment: GitHub Meets Compliance

IBM tested OneShield in the wild via InstructLab, a community-driven model training hub.

PRs containing prompt-response training data were auto-scanned by OneShield
8.25% of submissions were flagged for potential violations (sexual content, hate speech, doxxing)
Results were passed to human triagers, slashing review time and elevating trust

OneShield acted as a GitHub bot, demonstrating its flexibility and ease of integration.

Why This Matters

As LLMs power more core business functions, static rules and ad hoc moderation will fail. We need:

Context-aware safety (e.g., allow PII if anonymized and not combined with hate speech)
Flexible deployment (e.g., as a firewall layer across vendors)
Continual updates without retraining the core model

OneShield shows that safety doesn’t have to be brittle or slow. With modular detectors and a smart policy engine, it paves the way for a future where trustworthy AI isn’t just about what the model knows — but how it’s governed.

Cognaptus: Automate the Present, Incubate the Future.

Why Current Guardrails Fall Short#

OneShield: Modular, Model-Agnostic, Real-Time#

Detector Arsenal: Classification, Extraction, Comparison#

🧠 Classification Detectors#

🕵️ Extractor Detectors#

📚 Comparison Detectors#

The Real Star: Policy Manager#

Real Deployment: GitHub Meets Compliance#

Why This Matters#