Model-Evaluation

You Can’t Reweight a Dead End: TRD and the Prefix Failure Problem

TL;DR for operators The paper’s main message is simple: if a reasoning model has already walked into a dead end, per-token distillation often keeps supervising it from inside the dead end. A clever loss cap is not a map. A top-k filter is not a tow truck. Trajectory-Refined Distillation, or TRD, repairs the student’s own rollout before using it for distillation. The pipeline is: sample the student’s attempt, ask a teacher or privileged self-teacher to rewrite the trajectory into a better one, then train on the refined trajectory rather than on the original failed rollout. The technical contribution is not “better prompting”, although prompts are used. It is the shift from token-level correction to trajectory-level correction. ...

Heads You Lose: Why Ablation-Reversible Interpretability Doesn’t Transfer

TL;DR for operators The paper is a useful slap on the wrist for anyone tempted to turn an interpretability result into an operational control too quickly.1 It asks a simple question: when an attention head looks important, contains readable information, and can restore model behaviour after ablation, does that mean it carries a transferable representation of the computation? ...

Mind the Flux: Why Average Accuracy Fails Where the Towers Aren’t

TL;DR for operators Models are often sold as if accuracy were a passport: one clean number, stamped at the border, cleared for deployment. FLUXtrapolation is a useful reminder that the border is usually where the problem begins. The paper introduces a benchmark for predicting hourly ecosystem fluxes — carbon, water, and energy exchanges between ecosystems and the atmosphere — when direct measurements exist only at sparse flux-tower sites.1 The mechanism is simple and unpleasant: train models where towers exist, then test them in progressively less comfortable situations where the future, the geography, or the temperature regime has shifted. ...

Stop Model Shopping: Build the AI Control Tower

TL;DR for operators AI deployment is no longer mainly a question of whether a model can produce something plausible. That problem has been solved often enough to become boring, which is usually when businesses start wasting money at scale. The live problem is control. Which model should be trusted on this workload? When should a system query another model, pay more, or stop? When an LLM produces an analytical “insight”, is it finding the pattern you care about, or merely discovering an aggregate confound wearing a nice blazer? ...

Mind the Readout: Why AI Gets Smarter When We Stop Worshipping the Output

The current AI industry has a strangely theatrical relationship with intelligence. We judge models by the visible performance: the answer they print, the image they reconstruct, the attention map they expose, the number of reasoning steps they perform, the architectural flourish in the diagram. If the output looks sophisticated, we call the system capable. If the output looks wrong, we assume the capability is missing. This is convenient, measurable, and often completely misleading. Naturally, it is popular. ...

Control, Alt, Generate: Why AI Needs Control Surfaces, Not Bigger Prompts

Generative AI has become very good at producing things that look finished. That is useful. It is also the problem. A polished answer can quietly overuse the same words until every research abstract sounds like it was written by one over-caffeinated committee. A video model can obey an edit instruction and still damage the background, distort motion, or leave a ghost of the removed object behind. The output looks like a product feature. The failure behaves like a production-control problem. ...

Look Before You Think: Why Visual AI Needs Evidence Scheduling

A visual AI system can fail in a very boring way: it sounds confident, answers fluently, and quietly forgets to look. That is more dangerous than a spectacular hallucination. A spectacular hallucination at least waves a red flag. The boring version looks like normal enterprise automation: an insurance claim assessment, a warehouse inspection report, a medical-image triage note, a construction progress summary, a product-quality explanation. The system has an image. It has a question. It produces an answer. Somewhere inside the model, language did most of the work and vision became decorative evidence. Very modern. Very polished. Very capable of being wrong. ...

One Pass to Forecast Them All: Toto 2.0 and the Scaling Recipe for Time-Series AI

Forecasting is where machine learning often learns humility. A language model can sound clever while being wrong. A forecasting model has fewer hiding places. Revenue arrives or it does not. CPU saturation happens or it does not. Demand spikes, latency drifts, inventories rot, turbines fail, and the spreadsheet smiles politely before punishing everyone involved. This is why time-series foundation models have been treated with a particular kind of suspicion: useful, interesting, sometimes impressive, but not yet comfortably scalable in the way large language models became scalable. ...

Uncertain Terms: Hallucination Scores Are Triage Signals, Not Lie Detectors

Uncertain Terms: Hallucination Scores Are Triage Signals, Not Lie Detectors A support ticket lands on the AI team’s desk: the enterprise chatbot answered confidently, cited the wrong policy, and somehow made the compliance team nostalgic for search boxes. The obvious next idea is to add an uncertainty score. When the model is unsure, route the answer to a verifier. When the score is high, reject the output. When the score is low, let it pass. Elegant. Cheap. Measurable. Also, as usual, a little too clean. ...

Synthetic and Sensibility: Why More Data Needs a Control Stack

Synthetic and Sensibility: Why More Data Needs a Control Stack Synthetic data has become the convenient answer to almost every uncomfortable AI training question. Need more reasoning traces? Generate them. Need domain examples? Generate them. Need privacy-preserving replacements for customer data? Generate them. Need a dataset that looks suspiciously like a benchmark but not too suspiciously like a benchmark? Generate it, then call it “curriculum design.” ...