LLM Evaluation

Trust Issues, Benchmarked: Why Hallucination Detection Is a Portfolio Problem

Trust is a bad deployment strategy. That is not a moral statement. It is an operations statement. In most enterprise AI workflows, the uncomfortable question is not “Can the model answer?” The model will answer. Models are generous like that. The question is whether the organization has a reliable way to notice when the answer is unsupported, fabricated, overconfident, or merely polished nonsense wearing a tie. ...

Trust Me, I’m Benchmarked: Why Enterprise AI Needs Two Audits

Enterprise AI has developed two favorite comfort blankets: the model’s confident explanation and the benchmark score. The first says, “Relax, I reasoned through this.” The second says, “Relax, I scored well on a public test.” Both are useful. Neither is a warranty. And when business teams treat either as proof of reliability, the result is not governance. It is theatre with better typography. ...

Wrong on Purpose: FalsifyBench and the Agent Skill We Keep Forgetting

A good analyst should occasionally try to break their own idea. Not performatively. Not with a decorative “on the other hand” paragraph. Actually break it. Ask the kind of question that could make the current hypothesis collapse, then watch whether the evidence forces a better one. That simple discipline is the center of FalsifyBench: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games, a new paper by Leonardo Bertolazzi, Katya Tentori, and Raffaella Bernardi.1 The paper is framed around scientific reasoning, but its practical message travels well beyond science. If an AI agent cannot test outside its own current belief, it may look careful while doing something much less impressive: confirming the first plausible story it invented. ...

Right Answer, Wrong Audit: When Reasoning Models Grade the Destination, Not the Route

Right Answer, Wrong Audit: When Reasoning Models Grade the Destination, Not the Route A reviewer sees the final number. It is correct. Then the quiet failure begins. The reviewer stops asking whether the argument actually works. The missing step becomes “implicit.” The shuffled logic becomes “not ideal, but acceptable.” The circular explanation becomes “verbose but essentially correct.” The answer has done something worse than persuade. It has anesthetized the audit. ...

Entropy, My Dear Watson: Finding Hallucinations in the Shape of Uncertainty

A customer-support bot gives a fluent answer. The grammar is clean, the tone is helpful, and the confidence is offensively calm. Then someone checks the underlying fact and discovers the answer is wrong. The old operating question was: Was the model confident? The better question is: What did the model’s uncertainty look like while it was speaking? ...

Cache Me If You Can: Why LLM Benchmarks Need Contamination-Resistant Data

The benchmark score is not the product. The test pipeline is. Benchmarks used to feel like neutral scoreboards. A model sat down, answered questions, received a number, and everyone pretended the number meant generalization. That story became less charming once benchmark questions started appearing in the same public data oceans used to train the models being tested. ...

Peer Pressure: AI Reviewers Pass the Item Test, Not the Replacement Test

Review is a strange business process. The visible output is a verdict: accept, reject, revise, approve, block, escalate. The useful output is usually smaller and more annoying: one specific criticism that is correct, important, and supported by evidence. That distinction is where the new paper On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists becomes more interesting than the usual “can AI replace reviewers?” theatre.1 The paper does not ask whether an AI reviewer can imitate a human reviewer’s overall score. It asks whether each individual criticism is any good. ...

RAG and the Art of Not Dropping the Answer

RAG and the Art of Not Dropping the Answer A RAG team usually starts with a familiar ambition: make the retrieved context smarter. The raw document feels too long. The search snippet feels too primitive. The page structure looks messy. A query-focused summary sounds more elegant. A proposition list sounds more machine-readable. A paraphrase from a strong LLM sounds, at least cosmetically, like an upgrade. So the team builds another representation layer between retrieval and generation, hoping the model will reward the extra sophistication. ...

The Benchmark Drop Is Not the Verdict: Re-reading GSM-Symbolic with Statistics

A benchmark result lands on the desk. The chart is clean. The message is dramatic. A model performs well on the original math questions, then worse on symbolic variants. Someone in the meeting says the obvious thing: “So it cannot really reason.” That sentence is attractive because it is simple. It is also the kind of sentence that should be forced to pass through a statistical checkpoint before being allowed near procurement, product strategy, or a LinkedIn post with too many lightning emojis. ...

Score and Disorder: Why LLM Reasoning Needs More Than Accuracy

A model review often begins with a spreadsheet. One column says accuracy. Another says cost. A third says latency. Someone asks whether the model is “good enough.” Someone else points at the benchmark score. A decision is made. Procurement smiles. Compliance does not, but compliance rarely smiles anyway. The problem is not that accuracy is useless. The problem is that accuracy is too small a container for the thing businesses actually want from reasoning systems. A final answer can be correct while the route to that answer is unstable, unnecessarily expensive, locally contradictory, or impossible to reproduce under a harmless rewording of the question. That is not a philosophical inconvenience. It is an operational failure mode waiting politely inside a dashboard. ...