Knowledge Distillation

You Can’t Reweight a Dead End: TRD and the Prefix Failure Problem

TL;DR for operators The paper’s main message is simple: if a reasoning model has already walked into a dead end, per-token distillation often keeps supervising it from inside the dead end. A clever loss cap is not a map. A top-k filter is not a tow truck. Trajectory-Refined Distillation, or TRD, repairs the student’s own rollout before using it for distillation. The pipeline is: sample the student’s attempt, ask a teacher or privileged self-teacher to rewrite the trajectory into a better one, then train on the refined trajectory rather than on the original failed rollout. The technical contribution is not “better prompting”, although prompts are used. It is the shift from token-level correction to trajectory-level correction. ...

The Rule Is the Model: DEM’s Case for Bedside Anomaly Detection Without Explainer Theatre

Alerts are cheap; trusted alerts are not A hospital monitor that screams without explaining itself is not a decision-support system. It is a very expensive doorbell. That is the practical problem behind Singh, Roy, Bose, and Hota’s Distilled Explanation Model, or DEM, for physiological anomaly detection in wireless body area networks.1 The paper is nominally about clinical sensor data: heart rate, oxygen saturation, blood pressure, temperature, stress signals, sensor dropouts, and ICU monitoring. But the more interesting argument is architectural. DEM is not trying to make a black-box model more charming after it has already made a decision. It is trying to make the explanation part of the decision itself. ...

When 256 Dimensions Pretend to Be 16: The Quiet Overengineering of Vision-Language Segmentation

A prompt is usually a small thing. “White dog.” “Person in a blue jacket.” “Cup on the table.” Nobody hears these phrases and thinks: excellent, time to deploy a large general-purpose language encoder. Yet that is often what modern vision-language segmentation systems do. The visual model may be carefully optimized. The deployment team may obsess over image encoder latency, GPU memory, and batch size. Then the text side sits there, inherited from a larger foundation model stack, quietly burning capacity to understand what is often a noun phrase with a color adjective attached. Very sophisticated machinery, bravely parsing “red car.” Heroic. ...

When Models Listen but Stop Thinking: Teaching Audio Models to Reason Like They Read

A voice assistant can transcribe your question correctly and still answer like it heard something else. That is the awkward part of modern audio-language models. The obvious diagnosis is usually “better speech recognition.” The less obvious diagnosis is nastier: the model may receive an audio input that is semantically equivalent to the text prompt, but once generation begins, its audio-conditioned reasoning trajectory drifts away from the reasoning trajectory it would have followed if the same question had been typed. ...