Model Governance

The Model Got Smaller. The Risk Got Wider.

TL;DR for operators Compression is usually sold as a clean engineering bargain: smaller model, lower memory, cheaper inference, acceptable accuracy loss. This paper asks the more operationally annoying question: after compression, does the model still know when it should hedge? The answer is: not reliably. Tong et al. benchmark compressed LLMs using conformal prediction, a framework that converts model probabilities into prediction sets with target coverage.1 In this setup, the important uncertainty metric is prediction set size: if the model needs to include more answer options to maintain coverage, it is less certain, even if its top-1 accuracy still looks respectable. ...

The Clean Label Fairy Is Not Coming

TL;DR for operators Hospitals do not label images the same way. Radiologists disagree on contours. Pathologists disagree on grades. Automatically generated masks miss structures, add structures, or quietly confuse one target for another. In centralized AI, those errors are already irritating. In federated learning, they become operationally awkward because the data cannot simply be pooled, inspected, cleaned, and morally forgiven by a heroic annotation team. ...

Less Prompt, More Blueprint: MOSAIC and the Data-Science Agent That Keeps Receipts

TL;DR for operators MOSAIC is best read as a system-design paper, not as another entry in the increasingly crowded genre of “we attached an LLM to Python and hoped for the best.” The paper introduces a structured agentic framework for automated data science where the agent builds an explicit workflow blueprint before generating code, then verifies, executes, and refines candidates using diagnostic feedback and failure-aware offline reinforcement learning.1 ...

Mind the Middle: Why AI Reliability Lives Between the Data and the Answer

TL;DR for operators AI systems rarely fail only at the final answer. They fail earlier, in the quiet machinery that decides which evidence is seen, which records are aligned, which identity is protected, and which previous model behaviour is worth reusing. Three recent papers make that point from very different technical worlds. One improves few-shot object detection by correcting the imbalance between base-class and novel-class region proposals. One builds anonymous two-party gradient-boosted decision tree training so parties can align records without exposing shared identifiers. One maps the behavioural geometry of LLMs so jailbreak risk and defences can be predicted or transferred across model populations. ...

Tail Risk: Why Imbalanced AI Needs Shared Depth, Not Bigger Weights

TL;DR for operators Most business AI failures on imbalanced data do not look like dramatic model collapse. They look quieter: the system performs well on common cases, under-serves rare cases, and then someone discovers that “rare” was another word for “expensive when wrong”. The OSDTW paper tackles this long-tailed recognition problem by treating head and tail classes as two related tasks rather than one flattened classification problem.1 Its practical message is not “care more about minority classes”, although that would make a pleasant conference slogan. The message is sharper: imbalance is a structural design problem. You must decide which representation layers should be shared, which parts should specialise, and how much head versus tail supervision should shape the shared model. ...

Class Action: Fairness Is a Frontier, Not a Checkbox

TL;DR for operators Fairness work usually arrives in one of two flavours: mathematical fog or compliance theatre. OptFair is more useful than both. Zhang et al. study multi-class fair classification and show how to define the optimal accuracy-fairness frontier, then approximate it through two deployable routes: intervention during training and calibration after training.1 ...

Context Collapse: Why AI’s Next Bottleneck Is Knowing What Matters

TL;DR for operators AI is getting fluent enough to be dangerous in boring ways. It can describe a scene, generate a video, and write a policy memo with impressive confidence. The problem is that real operations rarely fail at the level of generic fluency. They fail when the system confuses which person did what, blends event one into event two, or treats a documented atrocity as a debate club prompt because a user asked for “balance”. ...

Split Before You Scale: Why Useful AI Starts by Sorting the Mess

TL;DR for operators AI systems fail less dramatically when they stop treating every messy signal as the same kind of mess. The three papers in this cluster look unrelated at first: one generates graphs, one studies exploration in restless bandits, and one improves reinforcement-learning generalisation from formal task specifications. Under the surface, they make a shared operational point: before scaling an AI system, separate the structure that must be preserved, the uncertainty that should guide action, and the supervision signal stable enough to train on. ...

Statecraft, Not Scorecards: Why Reliable AI Lives on the Path

TL;DR for operators AI reliability is increasingly a path problem, not a score problem. One paper argues that post-training methods such as supervised fine-tuning, reinforcement learning, and on-policy distillation should be understood by asking where supervision is applied in the model’s state space.1 Another argues that GUI-agent software evaluation fails when a single unsuccessful rollout is treated as proof of a broken application, even though the evaluator has only inspected one path through a larger UI state graph.2 ...

Gradient Customs: AlphaToken Checks Which Tokens Are Allowed to Train

Fine-tuning looks deceptively democratic. Every response token gets its little vote in the gradient. The commas, the boilerplate, the obvious connective tissue, the wrong kind of certainty, the genuinely task-bearing step in the middle of the answer: all are invited to update the model. A charmingly egalitarian arrangement. Also a rather efficient way to teach a model to forget things it used to know. ...