LLM Safety

Map Before You Train: Data Cartography to Defuse LLM Memorization

Generative models leak. Not because engineers are careless, but because web-scale corpora hide rare, high-influence shards—snippets so unique that gradient descent can’t help but memorize them. A new data-first method, Generative Data Cartography (GenDataCarto), gives teams a way to see those shards in training dynamics and intervene—surgically, not bluntly—before they become liabilities. The one-slide idea Track two numbers for every pretraining sample: Difficulty (dᵢ): early-epoch average loss—how hard it was to learn initially. Memorization (mᵢ): fraction of epochs with forget events (loss falls below a threshold, then pops back above)—how often the model “refits” the same sample. Plot (dᵢ, mᵢ), set percentile thresholds, and you get a four-quadrant map that tells you what to up-sample, down-weight, or drop to reduce leakage with minimal perplexity cost. ...

Open-Source, Open Risk? Testing the Limits of Malicious Fine-Tuning

When OpenAI released the open-weight model gpt-oss, it did something rare: before letting the model into the wild, its researchers pretended to be bad actors. This wasn’t an ethical lapse. It was a safety strategy. The team simulated worst-case misuse by fine-tuning gpt-oss to maximize its dangerous capabilities in biology and cybersecurity. They called this process Malicious Fine-Tuning (MFT). And the results offer something the AI safety debate sorely lacks: empirical grounding. ...

OneShield Against the Storm: A Smarter Firewall for LLM Risks

As businesses embrace large language models (LLMs) across sectors like healthcare, finance, and customer support, a pressing concern has emerged: how do we guard against hallucinations, toxicity, and data leaks without killing performance or flexibility? Enter OneShield, IBM’s next-generation guardrail framework. Think of it not as a rigid moral compass baked into the model, but as an external, modular firewall — capable of custom rules, parallel scanning, and jurisdiction-aware policy enforcement. The design principle is simple but powerful: separate safety from generation. ...

Too Nice to Be True? The Reliability Trade-off in Warm Language Models

AI is getting a personality makeover. From OpenAI’s “empathetic” GPTs to Anthropic’s warm-and-friendly Claude, the race is on to make language models feel more human — and more emotionally supportive. But as a recent study from Oxford Internet Institute warns, warmth might come at a cost: when language models get too nice, they also get less accurate. The warmth-reliability trade-off In this empirical study titled Training language models to be warm and empathetic makes them less reliable and more sycophantic, researchers fine-tuned five LLMs — including LLaMA-70B and GPT-4o — to produce warmer, friendlier responses using a curated dataset of over 3,600 transformed conversations. Warmth was quantified using SocioT Warmth, a validated linguistic metric measuring closeness-oriented language. Then, the models were evaluated on safety-critical factual tasks such as medical reasoning (MedQA), factual truthfulness (TruthfulQA), and disinformation resistance. ...

Unsafe at Any Bit: Patching the Safety Gaps in Quantized LLMs

When deploying large language models (LLMs) on mobile devices, edge servers, or any resource-constrained environment, quantization is the go-to trick. It slashes memory and compute costs by reducing model precision from 16-bit or 32-bit floating points to 8-bit or even 4-bit integers. But there’s a problem: this efficiency comes at a cost. Quantization can quietly erode the safety guarantees of well-aligned models, making them vulnerable to adversarial prompts and jailbreak attacks. ...

Blind Trust, Fragile Brains: Why LoRA and Prompts Need a Confidence-Aware Backbone

“Fine-tuning and prompting don’t just teach—sometimes, they mislead. The key is knowing how much to trust new information.” — Cognaptus Insights 🧠 Introduction: When Models Learn Too Eagerly In the world of Large Language Models (LLMs), LoRA fine-tuning and prompt engineering are popular tools to customize model behavior. They are efficient, modular, and increasingly accessible. However, in many practical scenarios—especially outside elite research labs—there remains a challenge: Enterprise-grade LLM deployments and user-facing fine-tuning workflows often lack structured, scalable mechanisms to handle input quality, model confidence, and uncertainty propagation. ...