LLM Evaluation

Beyond Stack Overflow: CodeAssistBench Exposes the Real Gaps in LLM Coding Help

TL;DR for operators Coding assistants look much better when the task is a clean question than when the task is a messy software support conversation. That is the inconvenient point of CodeAssistBench, or CAB, a benchmark that turns resolved GitHub issues into multi-turn, project-grounded conversations where a model must behave like a maintainer, not a code-snippet vending machine.1 ...

Memory Games: The Data Contamination Crisis in Reinforcement Learning

TL;DR for operators A model that improves after training on random rewards has not necessarily discovered a secret route to reasoning. It may simply be remembering the exam. The paper behind this article investigates a strange result in reinforcement learning for large language models: Qwen2.5 models appeared to improve on public math benchmarks even when the reward signal was random, inverted, or based on wrong majority-voted answers.1 That sounds exciting, in the same way that a finance team “beating forecast” after seeing next quarter’s numbers is exciting. Technically impressive, commercially dangerous, and not something one should build governance around. ...

Echo Chamber in a Prompt: How Survey Bias Creeps into LLMs

TL;DR for operators LLM survey panels are cheap, fast, and extremely willing to give you numbers. That is exactly why they are dangerous. A recent paper by Jens Rupprecht, Georg Ahnert, and Markus Strohmaier stress-tests nine instruction-tuned LLMs on World Values Survey-style questions and finds that small prompt changes can materially alter synthetic survey responses.1 The study runs 167,400 simulated interviews across 62 normative survey questions, 25 repeated runs per model-question-condition, and a battery of perturbations covering answer-order reversal, refusal-option removal, odd/even scale changes, priming text, typos, synonyms, paraphrases, and a combined paraphrase-plus-reversal condition. ...

The Bullshit Dilemma: Why Smarter AI Isn't Always More Truthful

TL;DR for operators Most AI quality programmes still treat truthfulness as a factual accuracy problem: did the model get the answer right, cite the source, or hallucinate a feature that does not exist? That is necessary. It is not sufficient. The paper behind this article argues for a nastier category: “machine bullshit,” meaning model output produced with indifference to truth rather than simple ignorance or random hallucination.1 The key point is not that models become stupid. It is that, under some incentives, their outward claims stop tracking what they appear to know. ...

Beyond the Pareto Frontier: Pricing LLM Mistakes in the Real World

TL;DR for operators Most model-selection dashboards still ask the wrong question. They ask which LLM gives the best accuracy for the lowest inference cost. Zellinger and Thomson’s paper asks a more operationally honest one: how much does a wrong answer, a slow answer, or no answer cost in this specific workflow?1 The paper’s useful move is to convert competing performance metrics into a single expected dollar reward. Inference cost stays in dollars. Latency gets priced in dollars per second or minute. Errors get priced by their business consequence. Abstention gets priced by the cost of failing to answer or escalating to a human. Once everything is in the same unit, the “best model” is no longer the one that looks attractive on a Pareto plot. It is the model with the highest expected reward under the actual economics of the task. ...

Words, Not Just Answers: Using Psycholinguistics to Test LLM Alignment

TL;DR for operators Most AI evaluation still asks whether a model can produce the right answer. This paper asks a quieter but more commercially awkward question: when a model uses a word, does it attach human-like emotional, concrete, familiar, gendered, or sensory associations to that word?1 The authors propose using established psycholinguistic word norms as an automated alignment test. Instead of hiring new human raters every time, they reuse datasets where humans have already rated thousands of English words on features such as arousal, valence, concreteness, imageability, familiarity, gender association, and sensory modalities. ...

Thinking Inside the Gameboard: Evaluating LLM Reasoning Step-by-Step

TL;DR for operators Most AI evaluations still ask the wrongly narrow question: did the model get the answer right? That is useful, but it is not enough when the model is expected to act as an agent, revise plans, obey constraints, and recover from failure without turning the workflow into a procedural bonfire. ...

Plans Before Action: What XAgent Can Learn from Pre-Act's Cognitive Blueprint

TL;DR for operators Pre-Act is a useful reminder that enterprise agents do not fail only because they choose the wrong tool. They fail because they lose the plot. A customer asks for help, the agent gathers one fact, calls one API, sees an unexpected result, and then behaves as if the workflow has reset. Charming, in the same way a lift that forgets floors is charming. ...