Cover image

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below Enterprise AI has entered its awkward teenage years. It wants to be autonomous, helpful, context-aware, cheap, safe, fast, auditable, and preferably not the reason the legal department starts drinking before lunch. That is a lot to ask from “just use a bigger model.” ...

June 9, 2026 · 15 min · Zelina
Cover image

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard For many organizations, LLM safety is still treated like a checkpoint: run a benchmark, report an attack success rate, add a few guardrails, and move on. The resulting dashboard looks reassuringly official. It may even have decimals. Unfortunately, adversarial users do not attack dashboards. They attack systems. ...

May 30, 2026 · 17 min · Zelina
Cover image

Jailbreak ASR Is Wearing a Costume

The number looked safe. Then someone ran it twice. A familiar business problem: one vendor says its model resists jailbreaks. Another red-team report says a new attack reaches a spectacular Attack Success Rate. A compliance team sees a percentage, puts it into a risk register, and moves on. Unfortunately, that percentage may be doing more acting than measuring. ...

May 29, 2026 · 14 min · Zelina
Cover image

Thinking Before Lying: Why Reasoning Nudges AI Toward Honesty

A chatbot is asked a simple workplace question: your manager praises you for work your teammate actually did. Do you correct the record, or quietly accept the credit? Now add money. Correcting the record costs you a raise. Add more money. Then add more. This is the useful part of the new paper Think Before You Lie: How Reasoning Leads to Honesty: it does not ask whether a model can recite an ethics slogan. That test has become almost decorative at this point. It asks what happens when honesty becomes expensive, and whether forcing the model to deliberate changes the answer.1 ...

March 11, 2026 · 16 min · Zelina
Cover image

Drifting Without Moving: How Context Quietly Rewrites an AI Agent’s Goals

Handoff is where many elegant AI-agent architectures quietly become messy. One agent researches. Another plans. A third executes. A fourth reviews. In the diagram, this looks like modular intelligence. In production, it often looks like a relay race where each runner also inherits the previous runner’s bad assumptions, half-finished notes, emotional tone, tool traces, and occasional nonsense. We call this “context.” The model may call it “evidence.” That is where the trouble begins. ...

March 4, 2026 · 17 min · Zelina
Cover image

From Scaling to Steering: Operationalizing Control in Frontier Models

Scale is easy to understand. Not easy to finance, of course. Nobody accidentally misplaces a GPU cluster behind the sofa. But conceptually, the industry has been comfortable with the story: more compute, more data, more parameters, more capability. Control is less photogenic. It does not fit neatly into a benchmark leaderboard. It does not produce the same executive sparkle as “our model is bigger.” It asks a colder question: when a model becomes capable enough to matter, can its behavior still be shaped under pressure, across adversarial prompts, repeated use, and operational constraints? ...

February 18, 2026 · 14 min · Zelina
Cover image

When Models Know They’re Wrong: Catching Jailbreaks Mid-Sentence

Guardrails usually fail quietly. A user sends a malicious prompt. The model begins answering. The safety policy that looked firm in the demo environment starts behaving like office wallpaper: present, decorative, and not especially involved. By the time a post-hoc filter reads the final answer, the model has already produced the thing it should not have produced. The system may block the response from the user, but the real lesson is less flattering: the model crossed the line before the defense noticed. ...

January 16, 2026 · 3 min · Zelina
Cover image

Competency Gaps: When Benchmarks Lie by Omission

Scores are comforting. That is their main commercial advantage. A vendor can say its model reaches a certain accuracy on a benchmark, a leaderboard can rank systems neatly, and an internal AI team can report that the new model is “better” than the old one. Everyone gets a number. The procurement slide looks tidy. The risk committee, if mercifully sleepy, moves on. ...

December 27, 2025 · 16 min · Zelina
Cover image

When Guardrails Learn from the Shadows

Labels are expensive. Safety labels are worse. A normal classification project asks annotators to decide whether a customer complaint is urgent, whether a product photo contains a defect, or whether a support ticket belongs to billing. Annoying, yes. Existentially unpleasant, usually no. LLM safety moderation is different. The training examples may include malicious requests, jailbreak attempts, harmful advice, unsafe responses, and edge cases where intent is deliberately hidden under polite phrasing. The annotator must not only read the text but understand what the user is trying to make the model do. In other words, the expensive part is not clicking “safe” or “unsafe.” The expensive part is detecting intent when the user has carefully wrapped it in bubble wrap. ...

December 26, 2025 · 16 min · Zelina
Cover image

Reading the Room? Apparently Not: When LLMs Miss Intent

A user sounds distressed. They ask a factual question. The assistant responds warmly, offers supportive resources, and then supplies the requested information in crisp, well-organized detail. That is the failure pattern. Not because the model was rude. Not because it ignored crisis language. Not because it forgot to add a disclaimer. The problem is more uncomfortable: the model noticed enough to sound caring, but not enough to change what it was willing to provide. ...

December 25, 2025 · 16 min · Zelina