Model Auditing

A Full Distribution Is Not a Risk Certificate

TL;DR for operators A team is deciding whether an AI risk dashboard should trigger a safer action, an operational alert, or a governance review. The system predicts a range of possible returns for each action rather than only an average, so its output may appear to provide stronger evidence about worst-case outcomes. ...

Count the Missing, Weight the Rare: A Better Bargain for Cardiac Phenotyping

TL;DR for operators CW-B is not interesting because it discovers a novel model architecture. It is interesting because it treats three routinely neglected design decisions as one system: Rare classes receive more influence during training. Class weights are computed separately inside each training fold, preventing common phenotypes from dominating the tree-building process. Missing values retain their provenance. The pipeline imputes a numerical value but also adds an indicator showing that the original measurement was absent. Clinical priorities are audited separately from aggregate performance. Stable coronary artery disease, acute coronary syndrome, and non-obstructive coronary disease are grouped into a predefined evaluation set because missing them can alter follow-up and treatment pathways. On a five-class dataset containing 4,354 patient records and 57 structured features, CW-B records the strongest accuracy, Macro-F1, balanced accuracy, and prioritized F1 among the tested tree, ensemble, and neural baselines. Its balanced accuracy reaches 0.73, compared with 0.66 for a larger XGBoost baseline. Its prioritized F1 is 0.69, compared with 0.67 for that baseline. ...

Right Answer, Wrong Audit: When Reasoning Models Grade the Destination, Not the Route

Right Answer, Wrong Audit: When Reasoning Models Grade the Destination, Not the Route A reviewer sees the final number. It is correct. Then the quiet failure begins. The reviewer stops asking whether the argument actually works. The missing step becomes “implicit.” The shuffled logic becomes “not ideal, but acceptable.” The circular explanation becomes “verbose but essentially correct.” The answer has done something worse than persuade. It has anesthetized the audit. ...

RAG’s Receipt Problem: When Correct Answers Don’t Prove Retrieval

RAG’s Receipt Problem: When Correct Answers Don’t Prove Retrieval Retrieval-augmented generation has become the respectable outfit enterprise AI wears when it wants to look grounded. Add a document store, retrieve a few passages, attach citations, and the answer suddenly appears more disciplined than a free-floating chatbot. That appearance is useful. It is not proof. ...

Auditing the Illusion of Forgetting: When Unlearning Isn’t Enough

Deletion requests sound simple until the model answers politely. A user asks for data to be removed. A publisher demands that copyrighted passages stop being reproduced. A compliance team wants evidence that a fine-tuned model no longer carries traces of a forbidden dataset. The model is run through an unlearning method, the surface tests improve, the dashboard turns less red, and everyone enjoys the brief spiritual comfort of a green checkmark. ...

When Fairness Fails in Groups: From Lone Counterexamples to Discrimination Clusters

Imagine two fairness bugs. In the first, changing a protected attribute while holding everything else constant shifts a model’s output enough to trigger one unfair decision. In the second, the same underlying applicant profile can fracture into nineteen meaningfully different score bands as protected attributes change. A conventional pairwise fairness test records both as violations. One counterexample each. Very tidy. Also not especially useful. ...

The Ethics of Not Knowing: When Uncertainty Becomes an Obligation

Uncertainty is the most convenient word in governance. A model is uncertain, so the system waits. A committee is uncertain, so the decision is deferred. A risk officer is uncertain, so the memo gets another paragraph of decorative caution and nobody quite owns the next step. Very mature. Very responsible. Also, sometimes, very useful for avoiding responsibility while looking intellectually respectable. ...

When Tokens Remember: Graphing the Ghosts in LLM Reasoning

Audit is easy when the answer is a single lookup. A customer asks, “What is your refund policy?” The model quotes the policy paragraph. We check whether the quoted paragraph came from the right source. Very civilized. Everyone goes home early. But real enterprise LLM work is rarely that tidy. A compliance assistant reads a contract, extracts obligations, compares them with internal policy, reasons through exceptions, and writes a recommendation. A research assistant reads multiple sources, builds an intermediate summary, then answers a question from that summary. A support agent reads a user history, infers the likely issue, then proposes the next action. In these cases, the final sentence may depend on prompt evidence and on earlier generated text. ...

Who Owns Your Words? Copyright, LLMs, and the Quiet Arms Race Over Training Data

The new copyright question is not “did the model copy me?” but “how would I know?” A writer uploads a chapter. A publisher uploads a manuscript. A compliance team uploads a protected document. The question is simple enough to ask in one sentence: did this material end up inside a large language model’s training data? ...

What LLMs Remember—and Why: Unpacking the Entropy-Memorization Law

TL;DR for operators Memorization audits usually start with the wrong question: “Which individual text snippets look memorized?” This paper suggests a better first diagnostic: group many snippets by how closely the model reproduces them, then measure the entropy of the token distribution inside each group.1 The result is an empirical pattern the authors call Entropy–Memorization Linearity. In plain English: when training examples are pooled by edit-distance score, their set-level entropy forms a strong linear relationship with how closely the model reproduces them. Since the paper’s “memorization score” is an edit distance, lower score means stronger verbatim reproduction; higher score means the generated continuation is farther from the ground truth. ...