Benchmarking

The Judge Is Not Always Right: Stress‑Testing LLM Judges

A judge is useful only if it can survive the boring parts of reality. Not the dramatic failure cases. Not the philosophical debates about machine intelligence. The boring parts: an extra blank line, a shorter answer, a paraphrased sentence, a multi-turn transcript where one message quietly changes the outcome, or a scoring rubric that asks for a number instead of a yes-or-no label. ...

When the Model Knows but Doesn't Remember: The Hidden Blind Spot in LLM Contamination Detection

Audit. That is the word companies like to use when they want uncertainty to sound disciplined. Model audit. Benchmark audit. Contamination audit. The phrase suggests a clean checklist: run the detector, read the score, decide whether the benchmark is safe. The paper behind today’s article makes that picture less comfortable. It studies Contamination Detection via output Distribution, or CDD, on small language models and finds a simple but awkward failure mode: a model can be trained on contaminated benchmark examples, learn from them, and still avoid the kind of verbatim memorization that CDD is designed to catch.1 ...

Brains, Bias & Benchmarks: Why Multimodal AI Still Struggles with Tumor Truth

MRI is a useful reality check for multimodal AI. It looks like an image problem, behaves like a reasoning problem, and punishes lazy confidence with the quiet brutality of clinical ambiguity. That is why MM-NeuroOnco is more interesting than another “new benchmark” headline.1 The paper introduces a multimodal instruction dataset and benchmark for MRI-based brain tumor diagnosis, but the dataset size is not the main story. Yes, the authors curate a 73,226-image pool, build 24,726 semantically attributed samples, generate more than 200,000 VQA pairs, and construct a 1,000-image benchmark with more than 3,000 questions. Fine. The spreadsheet is muscular. ...

From Reactive to Preemptive: Benchmarking the Rise of Proactive Mobile Agents

Phone assistants have one deeply underrated talent: they wait. They wait for the user to unlock the screen. They wait for a command. They wait for a nicely phrased instruction that explains the goal, the app, the constraints, and preferably the user’s hidden motivation. Then, if the demo gods are merciful, they execute. ...

When Retrieval Isn’t Enough: The DEEPSYNTH Wake‑Up Call

Search is easy to admire because it looks busy. The agent opens pages. It follows links. It finds PDFs. It writes Python. It returns a neat JSON object, ideally with the confidence of someone who has just discovered government statistics. This is the part of AI demos that makes executives lean forward: the machine appears to have become an analyst. ...

Flip the Script: When Causality Breaks the LLM Illusion

A fire alarm can cause people to evacuate. It can cause a building to enter alert mode. It can trigger emergency procedures, bring firefighters, and make everyone suddenly remember where the stairs are. But does a fire alarm cause a fire? Obviously not. At least, obviously not to a human who understands the causal structure. The alarm is usually an effect or signal of fire risk, not the origin of the fire itself. A model trained on enough sentences of the form “fire alarm causes…” may not be so careful. It may see the familiar phrase pattern, complete the familiar answer, and walk directly into the wrong conclusion with excellent grammar. ...

Lost in the Links: When World Knowledge Isn’t Enough

Links look harmless. One click from one Wikipedia page to another. Then another. Then another. No robotics. No messy browser UI. No customer database. No procurement workflow with three inconsistent Excel files and one person named Mike who “usually knows where that form is.” Just hyperlinks. That is why LLM-WikiRace is useful. It strips agentic AI down to a small, irritating question: when a model knows a lot about the world, can it use that knowledge step by step without getting lost?1 ...

Signal Over Noise: Why Multimodal RL Needs to Know What to Ignore

Audio. Video. Subtitles. The standard instinct is to send all of them into the model and hope the transformer performs its usual magic trick: turn a messy pile of signals into a useful answer. This instinct is understandable. It is also expensive, noisy, and occasionally a magnificent way to teach the model the wrong lesson. ...

See, Plan, Snap: Why AI Can Think in Blocks but Can’t Drop Them

Blocks are supposed to make programming easier. That is the whole promise of Scratch: instead of typing syntax, the learner drags colorful blocks, snaps them together, and watches the program run. No semicolons. No import errors. No spiritual damage from invisible whitespace. Very civilized. Now give that same interface to an AI agent. ...

Houston, We Have a Benchmark: When Agentic AI Meets Orbital Reality

Space is not impressed by fluent reasoning. A satellite does not care that an AI agent has produced a confident plan. A ground station cannot magically see through the Earth because the prompt says “ensure connectivity.” A sensor cannot keep collecting images after its onboard storage is full. Orbital mechanics, power budgets, slew angles, data buffers, and line-of-sight geometry are not stakeholder preferences. They are constraints. Reality, annoyingly, still has root access. ...