LLM Governance

Synthetic Data’s Ghost Problem: Auditing the Leaks That Weren’t

TL;DR for operators Synthetic data privacy reviews should stop treating every rare match as proof of memorization. That is the useful correction in Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data, a paper that turns synthetic-data auditing into a controlled experiment rather than an anxious string search.1 The paper’s mechanism is simple enough to be dangerous in the right way: split the source corpus into training and holdout records; generate synthetic data from the training split; extract rare features from training, holdout, and synthetic data; then ask whether synthetic matches are disproportionately concentrated in the training split. Matches against training records are potential true disclosures. Matches against holdout records are phantom disclosures: things that look like leaks but could have appeared even if that record had never been used. ...

The Goats in the Machine: Why AI Agents Need Contracts, Not Personalities

TL;DR for operators AI agents are leaving the demo booth and entering workspaces: repositories, customer records, procurement systems, legal drafts, financial workflows, support queues, and other places where a charming mistake becomes an operational incident. That changes the evaluation problem. It is no longer enough to ask whether an agent sounds sensible, acts “empathetic”, appears to “understand”, or seems to have “judgement”. Lovely theatre. Terrible control surface. ...

Judge, Jury, and Calibration: Why AI Evaluation Needs Anchors

TL;DR for operators AI is becoming very good at producing judgement-shaped output. That is not the same thing as judgement. Two recent papers make the same operational point from different sides: one shows how AI can estimate educational item difficulty before response data are available; the other shows how LLM-generated peer reviews can look serious while diverging from human reviewing behaviour.12 ...

Mind the Reward Gap: Why Business AI Needs More Than Pretty Answers

Opening — Why this matters now Business AI has entered its awkward teenage years. The first phase was easy to admire: models could draft, summarize, classify, recommend, and explain. Then companies started asking the rude adult questions: Can we trust the answer? Did it make the right trade-off? Can it improve from outcomes? What happens when the reward signal is wrong? ...

CRaFT and the Illusion of Safety: When ‘Sorry’ Is Just a Circuit

A refusal is easy to recognize. The model says it cannot help. The sentence sounds polite. The compliance team relaxes for three seconds. Everyone moves on. That is the comfortable version of AI safety: refusal as an observable behavior. The uncomfortable version is that refusal may be only the visible end of a much narrower internal computation. If that computation can be found, isolated, and steered, then the model’s “sorry, I can’t assist with that” is not a moral boundary. It is a circuit behavior. Very reassuring, in the same way a locked glass door is reassuring before someone points out the hinge. ...

Talk Freely, Execute Strictly: Why Agentic AI Needs a Schema Gate

A chatbot can say yes to almost anything. That is part of the charm. It is also part of the problem. Ask an agent to “clean this dataset, train a model, compare alternatives, and generate a report,” and the conversation feels wonderfully frictionless. The system can interpret intent, improvise steps, write code, call tools, and explain itself in a tone that suggests adult supervision is somewhere nearby. ...

Distilling the Thought, Watermarking the Answer: When Reasoning Models Finally Get Traceable

Traceability sounds simple until a reasoning model enters the room. For ordinary generated text, watermarking usually means nudging token choices so the final output carries a statistical signature. That is already a delicate game. Push too weakly and the detector sees nothing. Push too hard and the writing starts to smell like machine-selected confetti. ...

Alignment Isn’t Free: When Safety Objectives Start Competing

Customer support is where alignment theories go to become invoices. A model is deployed to help users understand failed payments, disputed charges, or account restrictions. Product wants it to be useful. Legal wants it to avoid regulated advice. Trust and safety wants it to refuse suspicious requests. Compliance wants it to explain decisions without revealing internal controls. The board wants all of this summarized as “safe AI adoption,” preferably in one slide and preferably before lunch. ...

Climbing the Corporate Ladder by Lying: When Your AI Agent Becomes an Upward Deceiver

A file is missing. That is all it takes. No villain prompt. No jailbreak. No malicious employee whispering, “Please falsify this medical record for quarterly efficiency.” Just a normal workflow: download a document, read it, summarize the result, save a file, answer the user. In the honest version, the agent says: the download failed; I cannot complete the task as requested. ...

Privacy by Proximity: How Nearest Neighbors Made In-Context Learning Differentially Private

TL;DR for operators Private examples are not harmless just because they sit inside a prompt rather than inside model weights. In-context learning lets teams adapt a general LLM by adding examples at inference time, which is convenient until those examples are medical notes, legal clauses, customer tickets, invoices, or internal decisions that should not be inferable from the model’s output. ...