Cover image

When Your Dataset Needs a Credit Score

A dataset can look respectable for all the wrong reasons. It may have a familiar name. It may sit on a well-known repository. It may come with a license file, a citation, a download button, and just enough academic polish to make procurement, product, and engineering all feel that the risk has been handled. Wonderful. A PDF said it was fine. What could possibly go wrong? ...

December 29, 2025 · 15 min · Zelina
Cover image

When the Chain Watches the Brain: Governing Agentic AI Before It Acts

Approval is boring. That is why most automation diagrams hide it. A user request arrives, a sensor emits a signal, an AI agent reasons through the situation, a tool call fires, and something in the real world changes. A stock level is replenished. A traffic light is adjusted. A healthcare alert is escalated. In the clean version of the diagram, the agent looks wonderfully autonomous. In the operational version, someone eventually asks the unpleasant question: who allowed this thing to act? ...

December 28, 2025 · 19 min · Zelina
Cover image

When Guardrails Learn from the Shadows

Labels are expensive. Safety labels are worse. A normal classification project asks annotators to decide whether a customer complaint is urgent, whether a product photo contains a defect, or whether a support ticket belongs to billing. Annoying, yes. Existentially unpleasant, usually no. LLM safety moderation is different. The training examples may include malicious requests, jailbreak attempts, harmful advice, unsafe responses, and edge cases where intent is deliberately hidden under polite phrasing. The annotator must not only read the text but understand what the user is trying to make the model do. In other words, the expensive part is not clicking “safe” or “unsafe.” The expensive part is detecting intent when the user has carefully wrapped it in bubble wrap. ...

December 26, 2025 · 16 min · Zelina
Cover image

When Models Learn to Forget: Why Memorization Isn’t the Same as Intelligence

A contract clause appears in a chatbot response. Not a summary. Not a paraphrase. The clause itself, with the same odd phrasing, the same punctuation, and the same mildly embarrassing typo that legal counsel thought nobody outside the company would ever see. The model did not “reason” its way there. It remembered. ...

December 26, 2025 · 15 min · Zelina
Cover image

When Policies Read Each Other: Teaching Agents to Cooperate by Reading the Code

A workflow breaks in a familiar way. The planning agent assumes the procurement agent will wait. The procurement agent assumes the planning agent has already revised the forecast. The compliance agent flags the output after both have acted. Everyone had access to the same dashboard. Nobody had access to the thing that actually mattered: the other agent’s decision policy. ...

December 26, 2025 · 19 min · Zelina
Cover image

Reading the Room? Apparently Not: When LLMs Miss Intent

A user sounds distressed. They ask a factual question. The assistant responds warmly, offers supportive resources, and then supplies the requested information in crisp, well-organized detail. That is the failure pattern. Not because the model was rude. Not because it ignored crisis language. Not because it forgot to add a disclaimer. The problem is more uncomfortable: the model noticed enough to sound caring, but not enough to change what it was willing to provide. ...

December 25, 2025 · 16 min · Zelina
Cover image

When 100% Sensitivity Isn’t Safety: How LLMs Fail in Real Clinical Work

Clinic. That is where the comforting AI story starts to wobble. In a benchmark, a clinical model receives a clean question, enough context, and a scoring rule that usually rewards the right answer. In a clinic, the same model sees an elderly patient with multiple conditions, incomplete records, medication changes from years ago, possible specialist involvement, ambiguous prescribing history, and a problem that may not require action at all. The model is not merely being asked, “Can you spot a risk?” It is being asked, “Do you understand whether this risk is real, current, important, and safely actionable?” ...

December 25, 2025 · 20 min · Zelina
Cover image

When More Explanation Hurts: The Early‑Stopping Paradox of Agentic XAI

A farmer does not need ninety-three charts before deciding what to do next. That sounds obvious. Unfortunately, “obvious” is where many agentic AI workflows go to die. Give an LLM a model explanation, ask it to improve the explanation, let it generate more analysis, feed the results back, and repeat. The process feels responsible. More checks. More plots. More reasoning. More “depth.” Somewhere in the background, a product manager begins to hear the soft music of enterprise automation. ...

December 25, 2025 · 16 min · Zelina
Cover image

Think Before You Beam: When AI Learns to Plan Like a Physicist

Beam planning sounds like the sort of work automation should have solved years ago. There is a target. There are organs at risk. There are dose constraints. There is an optimizer. Surely the machine should find the best plan while humans do something more dignified than nudging parameters inside a treatment planning system for the seventeenth time. ...

December 24, 2025 · 14 min · Zelina
Cover image

When Benchmarks Rot: Why Static ‘Gold Labels’ Are a Clinical Liability

Clinical AI has a paperwork problem. Not the usual paperwork problem, where doctors drown in documentation and everyone promises that software will save them. The more interesting problem sits one layer below: the paperwork used to judge the software may itself be wrong. That is the uncomfortable center of Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight, a paper that audits MedCalc-Bench, a benchmark for testing whether language models can compute medical risk scores from patient narratives.1 The paper’s target is not a toy dataset. MedCalc-Bench covers 55 medical calculators and includes 10,053 training instances plus 1,047 test instances. Its labels were produced through an LLM-assisted pipeline: GPT-3.5 matched patient contexts to calculator questions, GPT-4 extracted clinical features, and Python scripts aggregated those features into final scores. ...

December 23, 2025 · 15 min · Zelina