LLM Evaluation

When Your House Talks Back: Teaching Buildings to Think About Energy

A high electricity bill arrives. You ask your smart-home assistant what happened. It checks the meter data, explains that the electric-vehicle charger ran during peak-rate hours, and recommends a cheaper schedule. Useful. Then you ask how much the new schedule will save next month. The assistant retrieves the tariff, forecasts consumption, applies export credits from the solar panels, and confidently reports a number. ...

When Safety Stops Being a Turn-Based Game

Jailbreaks are not polite enough to wait their turn. That is the awkward weakness in many safety-training pipelines. A model is attacked, patched, tested, and released. Then another attack appears, usually crafted with more creativity than the previous defense assumed. The safety team patches again. The benchmark improves. The real attack surface moves. Everyone calls this iteration, because “organized whack-a-mole with GPUs” sounds less respectable. ...

When 100% Sensitivity Isn’t Safety: How LLMs Fail in Real Clinical Work

Clinic. That is where the comforting AI story starts to wobble. In a benchmark, a clinical model receives a clean question, enough context, and a scoring rule that usually rewards the right answer. In a clinic, the same model sees an elderly patient with multiple conditions, incomplete records, medication changes from years ago, possible specialist involvement, ambiguous prescribing history, and a problem that may not require action at all. The model is not merely being asked, “Can you spot a risk?” It is being asked, “Do you understand whether this risk is real, current, important, and safely actionable?” ...

From Benchmarks to Beakers: Stress‑Testing LLMs as Scientific Co‑Scientists

Benchmarks are clean. Research is not. A benchmark asks a model to answer a question, then politely stops. A research workflow asks the model to form a hypothesis, test it, read the result, notice what went wrong, adjust the plan, and try again without wandering into scientific nonsense. One is a quiz. The other is a beaker with a budget, a deadline, and a surprisingly expensive simulation queue. ...

Long Thoughts, Short Bills: Distilling Mathematical Reasoning at Scale

The invoice arrives after the benchmark party Math benchmarks are fun until the training bill arrives. A model can be taught to produce longer reasoning traces. It can be shown more olympiad problems. It can be given Python. It can be pushed into 128K-token contexts and told, heroically, to think harder. All of this sounds impressive in a benchmark table. Less impressive is the operational detail that most training samples do not need the full 128K window, yet a naive training setup can still make every step pay for it. ...

Mind-Reading Without Telepathy: Predictive Concept Decoders

Audit is usually boring until the system being audited can write a beautiful excuse. Ask a language model why it refused a harmful request, why it used a shortcut, or why it made a strange numerical mistake, and it may give a polished answer. That answer may even sound morally mature, procedurally clean, and delightfully compliant with the safety policy. Very nice. Also: not enough. ...

Picking Less to Know More: When RAG Stops Ranking and Starts Thinking

Search is not judgment Search is easy to admire because it produces something visible. A ranked list. A bigger context window. A satisfying pile of passages that says, “Look, we retrieved evidence.” Very comforting. Also not the same as knowing what evidence is actually needed. That distinction is the core of Context-Picker: Dynamic Context Selection Using Multi-stage Reinforcement Learning.1 The paper studies a familiar RAG problem: if a system retrieves too little, it misses the answer; if it retrieves too much, it drags in distractors, repeats, weakly related fragments, and the usual long-context swamp where useful evidence politely disappears in the middle. ...

NeuralFOMO: When LLMs Care About Being Second

Losing is not the problem. Being seen losing is. Put two AI agents in the same workflow and the design immediately stops being a simple productivity question. One agent writes code. Another reviews it. A third ranks alternatives. A fourth routes the next task to whoever looks most competent. At the slide-deck level, this is “multi-agent collaboration.” In the logs, it is often a scoreboard with better manners. ...

When Reasoning Needs Receipts: Graphs Over Guesswork in Medical AI

Diagnosis is not a magic word. In medicine, the answer matters, but the path to the answer matters almost as much. A model that says the correct disease name after skipping the decisive evidence is not “reasoning efficiently.” It is guessing with bedside manner. That is the problem addressed by MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph.1 The paper’s core claim is not simply that a medical LLM can score higher on benchmarks. That would be useful, but not especially surprising. The more interesting move is architectural: the authors try to make clinical reasoning trainable by turning it into a graph of required evidence, then rewarding the model for following that graph. ...

When LLMs Get Fatty Liver: Diagnosing AI-MASLD in Clinical AI

A patient walks into a clinic and tells the doctor several things at once: chest tightness, shortness of breath, leg swelling, leg pain, maybe a history of walking too much, maybe some anxiety, maybe something that sounds more obviously cardiac. The dangerous part is not the word “chest.” The dangerous part is the chain: leg swelling and pain may suggest deep vein thrombosis; shortness of breath may suggest pulmonary embolism; pulmonary embolism can kill. ...