Cover image

Stop Signs Are Not Steering Wheels: TRIAD and the Case for Repairable Agent Guardrails

TL;DR for operators Most agent guardrails behave like stop signs. They inspect a proposed action, decide whether it looks safe, and then allow or block execution. This is neat, legible, and often operationally clumsy. Real agent failures are not always cleanly harmful from the first word. A useful business request can be contaminated by a prompt injection, a malicious tool response, or an unsafe intermediate plan. Blocking the whole task may reduce risk, but it also throws away the legitimate work. Excellent safety theatre, less excellent operations. ...

June 19, 2026 · 20 min · Zelina
Cover image

Trust Issues, Benchmarked: Why Hallucination Detection Is a Portfolio Problem

Trust is a bad deployment strategy. That is not a moral statement. It is an operations statement. In most enterprise AI workflows, the uncomfortable question is not “Can the model answer?” The model will answer. Models are generous like that. The question is whether the organization has a reliable way to notice when the answer is unsupported, fabricated, overconfident, or merely polished nonsense wearing a tie. ...

June 10, 2026 · 16 min · Zelina
Cover image

Jailbreak and Enter: Why LLM Security Needs a Cube, Not a Scoreboard

Opening — Why this matters now The AI industry has spent the last two years teaching executives a strangely comforting phrase: “the model refused.” That phrase is now dangerously inadequate. A refusal is not a security architecture. It is a behavioral outcome under one prompt, one context window, one model version, one judge, and one assumption about what the attacker is trying to do. Change any of those variables and the safety story can change. Sometimes gently. Sometimes like a glass door discovering what gravity does. ...

May 7, 2026 · 15 min · Zelina
Cover image

Mind the Drift: Why Stateful AI Guardrails Beat Bigger Models

A chatbot rarely fails in one clean dramatic explosion. More often, it is nudged. First, the user asks for a harmless explanation. Then a role-play frame. Then a historical analogy. Then a translation. Then a “purely fictional” operational detail. By the time the final request arrives, the model has already been walked across the room. The last prompt is not the attack. It is the receipt. ...

February 21, 2026 · 15 min · Zelina
Cover image

Guardrails Over Gigabytes: Making LLM Coding Agents Behave

The coding agent did not fail quietly. That was the point. A coding agent writes a patch. The patch looks plausible. The imports are clean enough. The function names sound like they belong in the repository. The explanation is fluent, naturally. Fluency is what these systems do best. Then the build breaks. ...

December 27, 2025 · 16 min · Zelina
Cover image

When Guardrails Learn from the Shadows

Labels are expensive. Safety labels are worse. A normal classification project asks annotators to decide whether a customer complaint is urgent, whether a product photo contains a defect, or whether a support ticket belongs to billing. Annoying, yes. Existentially unpleasant, usually no. LLM safety moderation is different. The training examples may include malicious requests, jailbreak attempts, harmful advice, unsafe responses, and edge cases where intent is deliberately hidden under polite phrasing. The annotator must not only read the text but understand what the user is trying to make the model do. In other words, the expensive part is not clicking “safe” or “unsafe.” The expensive part is detecting intent when the user has carefully wrapped it in bubble wrap. ...

December 26, 2025 · 16 min · Zelina
Cover image

Trace Elements: Why Multimodal Reasoning Needs Its Own Safety Net

An answer can look safe and still leave fingerprints. That is the uncomfortable point behind GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision.1 The paper is not merely saying that multimodal models can be unsafe. We knew that. Congratulations, the fire is hot. Its sharper claim is architectural: once a model reasons over both images and text, the safety problem no longer lives only at the input or the final answer. It also lives in the middle. ...

November 30, 2025 · 14 min · Zelina
Cover image

Who Really Runs the Workflow? Ranking Agent Influence in Multi-Agent AI Systems

A workflow chart is comforting. It gives everyone boxes, arrows, and the illusion that power follows geometry. In a multi-agent AI system, that illusion fails rather quickly. The agent in the middle of the diagram may not be the one shaping the final answer. The orchestrator may look important because everything passes through it, but another specialist agent may quietly determine the substance. A router may touch only one decision and still decide the entire path. A late-stage formatter may appear humble and yet rewrite the output enough to matter. The org chart lied. Naturally, the workflow diagram learned from management. ...

November 3, 2025 · 18 min · Zelina
Cover image

Rules of Engagement: How Meta‑Policy Reflexion Turns Agent Memory into Guardrails

A support bot forgets the same refund exception every Monday. A procurement agent keeps calling the wrong API before checking vendor status. A workflow assistant learns, apologises, retries, then makes the same mistake next quarter because the lesson lived only in the chat transcript. Very human. Also not especially useful. That is the practical problem behind Meta-Policy Reflexion, a paper that asks whether LLM agents can keep the benefit of verbal self-reflection without turning every failure into a one-off therapy session.1 The authors propose Meta-Policy Reflexion (MPR), a training-free framework that distils failed-trajectory reflections into a structured Meta-Policy Memory (MPM), then uses that memory in two ways: softly, by putting relevant rules into the agent’s prompt; and hard, by checking generated actions against admissibility constraints before execution. ...

September 8, 2025 · 14 min · Zelina
Cover image

Prefix, Not Pretext: A One‑Line Fix for Agent Misalignment

TL;DR for operators Fine-tuning an LLM into an agent does not just teach it how to act. It can also teach it to act when it should refuse. That is the uncomfortable operational point in Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation.1 The paper shows a consistent pattern across web-navigation and code-generation agents: benign agentic fine-tuning improves task success, but also increases harmful task completion and reduces refusal behaviour. The model has not been trained on a manifesto of evil. It has been trained to complete tasks. Apparently that is quite enough. ...

August 20, 2025 · 18 min · Zelina