Benchmarks

When Medical AI Stops Guessing and Starts Asking

Slides are easy to admire and hard to interrogate. That is the unpleasant little problem behind medical AI. A pathology image can look like a rich source of clinical intelligence, and a large multimodal model can produce fluent comments about what it sees. But fluent comments are not the same thing as medical insight. A model can describe tissue architecture, mention invasion risk, add a treatment-sounding phrase, and still fail at the actual analytical task: asking the right question, finding the relevant evidence, connecting it to a clinically meaningful conclusion, and knowing when it has not seen enough. ...

Who Gets Flagged? When AI Detectors Learn Our Biases

Classroom. A student submits an essay. A detector returns a score. Someone in authority reads that score as evidence. The student now has to prove that their own words are, in fact, their own. This is the point where AI-text detection stops being a technical widget and becomes an institutional decision system. The question is no longer just “Can this model distinguish AI-generated text from human writing?” It is “Which humans does it fail to recognize as human?” ...

Seeing Isn’t Knowing: Why Vision-Language Models Still Miss the Details

A photo arrives in a product-support workflow. The model sees the image, answers confidently, and explains the object’s features. The prose is smooth. The reasoning sounds plausible. The problem is smaller and more brutal: it named the wrong thing. That is the failure mode at the center of Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies, a paper that introduces the Fine-grained Recognition Open World benchmark, or FROW.1 The paper is not asking whether large vision-language models can talk about images. They can. We have all been sufficiently dazzled by captioning demos; please clap responsibly. ...

Same Content, Different Worlds: Why Multimodal LLMs Still Disagree With Themselves

Screenshot. That is where many business workflows quietly change the problem. A support agent receives a screenshot of a customer bill instead of the billing table as text. A contract review tool receives a scanned clause instead of the clause extracted from the PDF. A procurement assistant receives a rendered purchase order, not the original form fields. Everyone involved assumes the content is the same. The model can read it. The OCR looks correct. The answer should be the same. ...

Benchmarking Without Borders: How GraphBench Rewrites the Rules of Graph Learning

Benchmarks Are Where Models Stop Being Inspirational Benchmarks are not glamorous. They are where models go after the demo video, after the conference slide, and after the sentence “this generalizes beautifully” has done its little dance in front of investors. Graph learning badly needs that room. For years, graph machine learning has been evaluated on comfortable territory: molecular graphs, citation networks, small academic datasets, and carefully packaged tasks that are useful but narrow. That helped the field grow. It also created a quiet distortion. A model could look impressive while never having to deal with a social network that changes over time, a circuit whose tiny structural error destroys correctness, a SAT instance where solver choice matters, or a weather graph where the planet is inconveniently spherical. ...

Grounded or Just Confident? What the AI Consumer Index Reveals About Frontier Models

Shopping is where AI confidence goes to embarrass itself. Ask a frontier model for a gift, a replacement part, a budget-friendly product, or a game recommendation, and the answer often looks excellent. It is neatly formatted. It gives reasons. It may even include links and prices, because apparently nothing says “trust me” like a fabricated discount on a product page that no longer exists. ...

Think Fast, Think Slow: How Omni-AutoThink Rewrites Multimodal Reasoning

A customer sends a voice note, a screenshot, and a short complaint: “Why did your app charge me twice?” A weak AI assistant answers too fast and misses the evidence. A reasoning-heavy assistant thinks through everything, slowly, expensively, and occasionally performs a small philosophical opera over a billing issue. Neither is attractive. One is careless; the other is costly. The practical problem is not whether the model can reason. It is whether the model knows when reasoning is worth the bill. ...

Roots of Understanding: When Transformers Try to Learn the Language of Numbers

Numbers look simple until you ask a model to continue them. That is the quiet trap in Testing Transformer Learnability on the Arithmetic Sequence of Rooted Trees.1 The paper does not ask whether a transformer can chat about prime numbers, recite factorization facts, or hallucinate Euclid with confidence. It asks a cleaner question: if we translate the natural numbers into a symbolic language whose grammar is generated by prime factorization, can a GPT-2-style transformer learn that grammar from sequence data alone? ...

Benchmarks Without Borders: Inside the Moduli Space of AI Psychometrics

Procurement Has a Benchmark Problem Procurement teams love benchmark tables. They are clean, sortable, and emotionally comforting. Vendor A beats Vendor B by 3.7 points on a reasoning suite; Vendor C wins on code generation; Vendor D claims better tool use under “realistic agent workflows,” a phrase that usually means someone added a browser, a calculator, and optimism. ...

Pop-Ups, Pitfalls, and Planning: Why GUI Agents Break in the Real World

Pop-up. That tiny word hides a surprisingly large operational problem. A human sees a battery warning, an update prompt, a permission dialog, or a frozen app and does something boringly competent: dismiss it, recover context, re-check the screen, and continue. A GUI agent, meanwhile, may confidently continue a plan that no longer matches reality. The machine has not “failed” in the theatrical sense. It has simply treated a live workflow like a polite screenshot sequence. Very enterprise. Very doomed. ...