AI Evaluation

No Structure, No Glory: Why AI Cognition Has to Be Shown, Not Named

TL;DR for operators AI systems are now sold with labels that sound increasingly cognitive: reasoning, planning, agency, memory, autonomy, sometimes even the more theatrical hints of machine consciousness. Lovely. The marketing department has discovered philosophy. The useful question is not whether the label feels exciting. It is whether the system realizes an internal organization that could actually support the claimed capability. ...

The Harness Wants a Promotion

TL;DR for operators Most agent failures are blamed on the model because blaming “the model” is emotionally convenient and operationally vague. HarnessX makes a more useful claim: the runtime harness around the model — prompts, tools, memory, control flow, tracing, evaluators, safety checks, and training interfaces — is not scaffolding in the disposable sense. It is part of the system’s intelligence surface.1 ...

The Lesson Plan Is the Product

TL;DR for operators AI learning is usually sold as a volume story: more data, more retrieval, more reasoning tokens, more reinforcement learning. Comforting. Also incomplete. Three recent papers make a more useful point. The model does not merely need more exposure. It needs a better lesson plan. One paper shows that a model can be given a more meaningful difficulty ranking for training examples, yet still fail to beat ordinary full-data training unless scoring and pacing are engineered together. Another shows that travel-planning agents become more factually grounded when forced into retrieval, but that the burden of grounding can damage instruction retention and preference satisfaction. A third shows that legal AI systems can be rewarded for correct prosecution outcomes without learning the underlying discrimination process that separates evidence insufficiency, statutory non-liability, discretionary non-prosecution, and prosecution. ...

The Retriever Found Similar Things. The Evidence Was Elsewhere.

TL;DR for operators The current enterprise RAG conversation still has a charmingly stubborn misconception: if the model hallucinates, buy better embeddings, increase the context window, add an agent, and hope the PowerPoint becomes true. The two papers here point in a less theatrical direction. One paper, Non-negative Elastic Net Decoding for Information Retrieval, argues that dense retrieval has a structural weakness: it scores each candidate independently, so it can retrieve several similar items instead of the complementary set actually needed to answer the query.1 The other, Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis, shows what happens when retrieval is treated as a full evidence workflow: sparse and dense retrieval are fused, queries are decomposed under constraints, evidence is deduplicated and budgeted, and answers are judged for coverage, hallucination, and abstention.2 ...

Measure Twice, Deploy Once: The Hidden Geometry of Reliable AI

TL;DR for operators The practical problem is not that AI systems lack benchmarks. We are drowning in benchmarks. The problem is that many benchmarks, design scores, and demo metrics politely avoid the failure modes that will later become incident reports, refund requests, clinical risk reviews, or broken robots wedged under furniture. Two recent papers make the same point from very different directions. One studies Argus, a spherical, many-legged robot designed around dynamic isotropy: the uniformity of attainable center-of-mass acceleration across directions.1 The other reworks panoptic segmentation evaluation by replacing a fixed one-to-one segment matching rule with a configurable assignment framework that can handle fragmentation, merging, thresholds, Voronoi regions, and part-aware targets.2 ...

Context Collapse: Why AI’s Next Bottleneck Is Knowing What Matters

TL;DR for operators AI is getting fluent enough to be dangerous in boring ways. It can describe a scene, generate a video, and write a policy memo with impressive confidence. The problem is that real operations rarely fail at the level of generic fluency. They fail when the system confuses which person did what, blends event one into event two, or treats a documented atrocity as a debate club prompt because a user asked for “balance”. ...

The Path of Least Assurance: Why AI Reliability Lives Between the Steps

TL;DR for operators AI reliability is increasingly a process problem, not an answer-checking problem. Three recent arXiv papers make that point from very different angles. MoCo-EA shows that adversarial examples are not merely isolated malicious pixels lurking in the shrubbery; they can lie along continuous, optimisable paths.1 ConceptAgent shows that erasing a concept from a diffusion model may disrupt the early text-to-image link while leaving later trajectory dynamics available for concept re-entry.2 BlueFin shows that LLM agents doing finance spreadsheet work fail in ways that only appear when you inspect formulas, recalculation behaviour, workbook mutations, tool choices, and whether the output helps a human analyst do useful work.3 ...

Judge, Jury, and Calibration: Why AI Evaluation Needs Anchors

TL;DR for operators AI is becoming very good at producing judgement-shaped output. That is not the same thing as judgement. Two recent papers make the same operational point from different sides: one shows how AI can estimate educational item difficulty before response data are available; the other shows how LLM-generated peer reviews can look serious while diverging from human reviewing behaviour.12 ...

Pre-Review, Not Peer Review: The Drafting Gate AI Actually Earns

TL;DR for operators AI-Paper-Review is useful because it behaves like a disciplined pre-submission review room, not because it makes peer reviewers obsolete. The system selects a panel of AI reviewer personas, makes them review independently, clusters duplicated concerns, ranks the resulting issues by consensus and severity, then compares them with human reviews. That mechanism matters more than the slogan, because raw AI critique is cheap, noisy, and very good at sounding busy. ...

Wait, Let Me Check: Why Long-CoT AI Can Still Verify the Wrong Thing

Checking is supposed to calm people down. In business, a second review makes a financial model feel safer. A compliance checklist makes a release feel governed. A senior analyst saying “let me double-check that” gives the room a small dopamine hit of procedural seriousness. Long Chain-of-Thought models have learned the same theatre. They pause. They reconsider. They say “wait.” They verify arithmetic. They sometimes generate reasoning traces so long that one begins to feel the model must be thinking deeply, if only because wasting that many tokens while being shallow seems rude. ...