Conformal Prediction

The Model Got Smaller. The Risk Got Wider.

TL;DR for operators Compression is usually sold as a clean engineering bargain: smaller model, lower memory, cheaper inference, acceptable accuracy loss. This paper asks the more operationally annoying question: after compression, does the model still know when it should hedge? The answer is: not reliably. Tong et al. benchmark compressed LLMs using conformal prediction, a framework that converts model probabilities into prediction sets with target coverage.1 In this setup, the important uncertainty metric is prediction set size: if the model needs to include more answer options to maintain coverage, it is less certain, even if its top-1 accuracy still looks respectable. ...

The Reward Model Was Confident. That Was the Bug.

TL;DR for operators Reward models should not be treated as little oracles that hand down one clean number from the alignment heavens. In the paper’s diagnosis, the problem is more mundane and therefore more dangerous: a reward model can be wrong, uncertain, and numerically confident-looking at the same time. GRPO then standardizes those rewards inside a rollout group, giving extreme scores large influence even when the reward model is least reliable. Excellent. The pipeline has discovered a way to launder uncertainty into policy updates. ...

When the Judge Needs Judging: LLM Evaluators Under Cross-Examination

The dashboard says the judge is fine. The document disagrees. Judge is an easy word to trust. It suggests robes, procedure, and someone in the room who is supposed to be less confused than everyone else. In AI evaluation, the word has become dangerously comfortable. Product teams now use LLMs to score summaries, rank chatbot answers, approve RAG outputs, compare model releases, and decide whether another model’s response is “good enough.” The attraction is obvious: human review is expensive, slow, and occasionally insists on context. An LLM judge is fast, scalable, and does not ask why the evaluation rubric was written five minutes before the sprint review. ...

The Truth Filter Paradox: When Reliable AI Becomes Useless

Silence is safe. That is the awkward little secret behind many “reliable AI” systems. Ask a retrieval-augmented generation system a question. It drafts an answer. A factuality filter checks each claim. Risky claims are removed. The final answer is cleaner, safer, and statistically more defensible. On a dashboard, factuality goes up. In a meeting, everyone nods. In production, the user receives something that says almost nothing. ...

When Agents Behave: Conformal Policy Control and the Business of Safe Autonomy

Deployment has a boring problem. That is usually where the expensive problems live. A company has an existing model, workflow, or agent policy that is not brilliant but has behaved well enough not to frighten legal, compliance, or operations. Then someone improves it. The new version is more capable, more exploratory, perhaps trained with better preference data or optimized for a sharper reward. It also does things the old version would not have done. ...

Confidence, Not Confidence Tricks: Statistical Guardrails for Generative AI

A product team launches an AI assistant. The demo works. The benchmark looks respectable. The model even says “I’m confident” with the serene authority of a consultant who has never owned a pager. Then the real users arrive. Some ask ambiguous questions. Some ask adversarial questions. Some ask perfectly normal questions that happen to sit outside the model’s competence. The assistant still answers. Sometimes it refuses too often. Sometimes it refuses too late. Sometimes its confidence score is less a forecast and more a decorative sticker. ...