AI Evaluation

Themis Knows Best: When AI Judges Start Training Other AI

Click. The button moved. The page refreshed. A popup appeared, then disappeared. The agent says the task is done. The screenshot looks plausible. The log is long enough to impress a project manager and confusing enough to defeat a reviewer with a normal human attention span. Now comes the awkward question: should the agent be rewarded? ...

The Art of Interrupting AI: When Knowing Isn’t Talking

The meeting-room test AI still fails Meeting rooms are unforgiving places for intelligence. A person can know the topic, understand the slides, recognize every face around the table, and still be a terrible participant. Speak too early, and they interrupt. Speak too late, and the moment has passed. Say something factually relevant but socially tone-deaf, and the room quietly deducts points. No spreadsheet records this. Everyone notices anyway. ...

Crystal Clear? Why AI Needs to Show Its Work

Answers are cheap. In a business setting, this is slightly annoying. A model reads a chart, extracts a number, answers a compliance question, classifies a product defect, or explains a visual inspection result. The answer lands in the dashboard. It looks clean. It may even be correct. Then someone asks the only question that matters: how did it get there? ...

Thinking Out Loud — Why LLMs Might Need Chain‑of‑Thought

Audit trails are boring until something goes wrong. In ordinary business operations, this is not controversial. If a payment approval, legal review, procurement decision, or trading order leaves intermediate records, people can reconstruct what happened. If the whole decision is buried inside a black-box system that simply outputs “approved,” “rejected,” or “buy now,” the audit team has a less glamorous job: guessing which invisible machinery produced the visible answer. Charming, in the way dental surgery is charming. ...

Too Many Doctors in the Room? Benchmarking the Rise of Medical AI Agent Teams

Too Many Doctors in the Room? Benchmarking the Rise of Medical AI Agent Teams Doctors know the problem. A difficult case enters the room. One specialist sees a radiology pattern. Another notices a metabolic clue. A third worries about a rare diagnosis. Everyone has a useful fragment. Then the meeting gets longer, the notes get messier, and somehow the final answer becomes less clear than the first opinion. ...

Cut to the Chase: When AI Learns to Summarize Videos by Thinking in Events

Video is where organizational knowledge goes to become expensive furniture. Meetings are recorded. Lectures are archived. Product demos are uploaded. Customer calls, training sessions, interviews, sports broadcasts, livestreams, and conference talks accumulate in cloud storage with admirable discipline and very little afterlife. Everyone agrees the videos are valuable. Almost nobody has time to watch them. ...

Don’t Just Answer — Ask: Why Interactive Benchmarks May Redefine AI Intelligence

Meeting. That is where many AI demos go to die. A model receives a tidy prompt, produces a tidy answer, and everyone nods. Then the real work begins: the client clarifies a requirement, the dataset has a missing column, the UI screenshot does not match the written description, the user contradicts themselves, and the model has to decide whether to ask, revise, infer, test, or gracefully admit that it is flying blind. ...

Judging the Judges: How Bias-Bounded Evaluation Could Make LLM Referees Trustworthy

Scores look clean on dashboards. That is part of the problem. A model gets 4.7 out of 5. A customer-support agent receives a “pass.” A generated legal summary is marked “acceptable.” A coding assistant is judged “safe to deploy.” The number is tidy, the workflow continues, and everyone pretends the judge was a neutral instrument rather than another model with its own sensitivities, habits, and small theatrical preferences. ...

When the Model Knows but Doesn't Remember: The Hidden Blind Spot in LLM Contamination Detection

Audit. That is the word companies like to use when they want uncertainty to sound disciplined. Model audit. Benchmark audit. Contamination audit. The phrase suggests a clean checklist: run the detector, read the score, decide whether the benchmark is safe. The paper behind today’s article makes that picture less comfortable. It studies Contamination Detection via output Distribution, or CDD, on small language models and finds a simple but awkward failure mode: a model can be trained on contaminated benchmark examples, learn from them, and still avoid the kind of verbatim memorization that CDD is designed to catch.1 ...

Cheap Signals, Expensive Insights: Rethinking AI Evaluation with Tensor Factorization

Budget is where evaluation systems usually lose their innocence. A team wants to compare several models across hundreds or thousands of prompts. The obvious answer is human evaluation. The less obvious invoice arrives later: annotator time, reviewer fatigue, prompt coverage gaps, inconsistent judgments, and the slow realization that “we evaluated the model” often means “we averaged away the only differences that mattered.” ...