LLM Evaluation

The Molecule Was Right. The Reasoning Was Not.

TL;DR for operators Chemistry teams should stop treating a correct molecule, reaction product, or ranked option as proof that an AI system reasoned chemically. That is the comfortable interpretation. It is also, inconveniently, the one ChemCoTBench-V2 was built to dismantle. The paper introduces a benchmark that evaluates chemical language models at three separate levels: final-answer correctness, template adherence, and step-wise chemical validity. The important move is not “add more benchmark rows.” The move is to force the model to expose intermediate chemical commitments—rings, scaffolds, fragments, reaction types, edit plans, condition rankings, product constructions—and then check those commitments with deterministic chemistry rules or verified reference traces.1 ...

The Sticker on the Dashboard Is Not Steering

TL;DR for operators A policy, prompt, adapter, steering vector, or internal patch can make a model look more orderly. That does not mean it controls the model. The paper’s central distinction is brutal and useful: order is visible structure; control is validated movement through the right receiver under the right conditions, with side effects bounded.1 ...

Typechecked and Still Wrong

TL;DR for operators The useful lesson from this paper is not “AI can formalize mathematics better.” That is the shiny wrapper. The operational lesson is nastier and more important: an AI-generated formal artifact can pass syntactic checks, be provable, and still fail to represent the original human intent. The type checker is not a mind reader. It is a very disciplined bureaucrat. ...

The Chain of Thought Needs a Chain of Custody

TL;DR for operators Two new papers point to the same operational lesson from different sides: long reasoning becomes useful only when its intermediate steps are made explicit, scoped, and checkable. HIPIF tackles the training side of long-horizon agents: it teaches an LLM agent to break tasks into subgoals, fold completed progress into compact memory, reflect on whether a subgoal is done, and use local process rewards to reduce repeated or ungrounded behavior.1 Mask-Proof tackles the evaluation side: it turns research-level mathematical proofs into masked-step tasks where a model must reconstruct a critical formula from self-contained context, then uses a semantic-equivalence judge with repeated voting to grade the result.2 ...

The Solver Was Fine. The Premises Got Lost.

TL;DR for operators SciR is a benchmark for a problem that enterprise AI teams keep trying to flatten into one metric: can a model reason scientifically?1 The more useful question is less flattering and more operational: did the model fail because it could not infer the answer, or because it could not recover the premises from the scientific mess placed in front of it? ...

Local Fluency Is Not Local Fairness: IndoBias and the Indonesian Bias Problem

TL;DR for operators IndoBias is a useful paper because it attacks a lazy assumption: that a model becomes fairer in a country once it becomes more fluent in that country’s language. Charming idea. Unfortunately, culture is not a plugin. The paper introduces a two-track benchmark for bias in Indonesian and three local languages: Javanese, Sundanese, and Makasar. The first track, IndoBias-Pairs, uses 544 contrastive stereotype pairs per language to test whether a model assigns higher likelihood to prototypical statements than to counter-stereotypical ones. The second track, IndoBias-QA, uses generation-based prompts across 336 demographic groups to examine stereotype polarity at broader coverage, including groups that may not have widely agreed stereotype pairs. ...

Binding Obligations: Why AI Fails When the Relationships Slip

TL;DR for operators AI systems are getting better at producing outputs that look structured: code, CAD, diagrams, workflows, compliance memos, procurement recommendations, and decision traces. That is not the same as keeping the structure right. Two recent arXiv papers make this point from opposite ends of the problem. One looks inside language models and finds evidence for a compact retrieval-conditioned rebinding mechanism: the model does not necessarily rewrite its whole internal world after a state change; it can preserve old representations and redirect retrieval when the answer is needed.1 The other builds an engineering benchmark for Text-to-CAD and shows that models can pass earlier surface gates — executable code, plausible geometry — while still failing the practical tests of functionality, manufacturability, and assemblability.2 ...

Cheap Seats, Sharp Eyes: Reward-Hack Detection Without the Frontier Judge

TL;DR for operators A frontier LLM judge is an expensive way to inspect every agent trajectory for reward hacking. This paper asks whether a much smaller detector can do most of that monitoring job at much lower cost. The answer is: yes, under the same information condition, and with important caveats. A 13.8M-parameter transformer encoder plus a logistic regression probe detects reward hacking in cleaned Terminal-Wrench trajectories with 0.9467 AUC and 0.8296 TPR@5%FPR. In the authors’ matched comparison, a reproduced gpt-5.4 judge reaches 0.9510 AUC and 0.7130 TPR@5%FPR on the cleaned sanitized-vs-baseline split.1 ...

The Chatbot Passed the Test. Then It Bowed Too Low.

TL;DR for operators NICE is useful because it does not ask whether a model has “social intelligence” as one grand, vaguely flattering trait. It breaks social intelligence into a diagnostic structure: 4 categories, 11 dimensions, 34 facets, and 137 Chinese-context ranking items. That matters because a model can look socially competent in aggregate while failing on the interaction behaviours that make or break real deployments. ...

Judge, Jury, and Benchmark: Why LLM Evaluation Needs Fresh Cases, Not Bigger Leaderboards

The procurement meeting is where public leaderboards go to look useful Benchmark scores are comforting because they compress chaos into a number. One model is 87.3, another is 84.9, and suddenly the procurement meeting has the emotional texture of financial discipline. Very mature. Very measurable. Also, very possibly irrelevant. The problem is simple. A company rarely wants “the best model on average”. It wants the best model for contract review, support triage, clinical note summarisation, SQL repair, claims handling, product search, or whatever unglamorous workflow actually pays the cloud bill. Public benchmarks are often too generic for that decision. Worse, the benchmark items may already be floating inside model training data, turning evaluation into a memory test with better typography. ...