LLM Evaluation

Reasoning Under Pressure: When Smart Models Second-Guess Themselves

A customer challenges the answer. Not with new evidence. Not with a better calculation. Just with one of those tiny conversational needles: Are you sure? Or worse: Most people disagree with this. Or the classic office-friendly version: As an expert, I’m confident you are wrong. A human analyst might pause, check the source, and decide whether the objection contains actual information. A large reasoning model may also pause. It may even produce several polished paragraphs of careful reconsideration. Then, occasionally, it abandons the correct answer. ...

$Cover image$

Proof Over Probabilities: Why AI Oversight Needs a Judge That Can Do Math

Agents now do things. That sounds obvious, but it is the entire problem. A chatbot can be wrong and mostly embarrass itself. An agent can book the wrong hotel, leak the wrong file, fabricate the wrong report, or move through a workflow with the quiet confidence of a junior employee who has just discovered automation and has not yet discovered liability. ...

Thinking About Thinking: When LLMs Start Writing Their Own Report Cards

Report cards are usually written by teachers, managers, examiners, auditors, or other people with the institutional privilege of saying, “Nice effort, but no.” The paper Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics asks a stranger question: what if the model helps write the report card for its own reasoning process?1 That sounds like the kind of governance idea that would make a compliance officer reach for coffee. A model evaluating itself is not automatically trustworthy. Sometimes it is self-reflection. Sometimes it is theatre with JSON brackets. ...

When Agents Hesitate: Smarter Test-Time Scaling for Web AI

Forms are boring. That is exactly why they are dangerous for AI agents. A human filling out an enterprise dashboard does not treat every click as a philosophical crisis. Search here. Scroll there. Submit. Done. A web agent, unfortunately, has no such common sense guarantee. It can overthink a routine step, miss a pivotal one, or spend a small fortune sampling twenty versions of the same obvious action. Very diligent. Also very expensive. ...

AIRS-Bench: When AI Starts Doing the Science, Not Just Talking About It

A benchmark is supposed to be a ruler. In AI, it often becomes a trophy shelf. A model gets a higher score, a chart moves up and to the right, and everyone politely pretends the hard part has been settled. That ritual works when the task is narrow: classify an image, answer a question, pass a coding test, retrieve a document. But it becomes much less comforting when the system being evaluated is no longer just answering. It is planning experiments, writing code, debugging failures, training models, interpreting results, and deciding what to try next. ...

When Agents Believe Their Own Hype: The Hidden Cost of Agentic Overconfidence

Code review has a comforting ritual. A developer submits a patch. A reviewer inspects it. The reviewer says it looks good. Everyone feels slightly better, because at least someone checked. In AI-agent workflows, this ritual becomes even more tempting: let one agent write the patch, let another agent review it, then ask the reviewer how confident it is. ...

When Your Agent Starts Copying Itself: Breaking Conversational Inertia

A support agent keeps asking the same diagnostic question after the customer has already answered it. A research agent revisits the same failed source path with slightly different wording. A workflow agent tries the same invalid action again because, apparently, the best evidence for what to do next is what it just did badly. ...

DRIFT-BENCH: When Agents Stop Asking and Start Breaking

A user says, “Update the record with a sensible value.” That sentence is small. The damage may not be. For a normal chatbot, the worst outcome might be a vague answer wearing a confident expression. Annoying, yes, but usually recoverable. For an agent connected to a database, file system, workflow platform, or API service, the same ambiguity becomes operational. The model may update the wrong row, call the wrong endpoint, overwrite a file, or politely explain its mistake after making it. Charming, in the same way a self-driving forklift is charming. ...

Grading the Doctor: How Health-SCORE Scales Judgment in Medical AI

Checklist is a boring word. That is why it is useful. In healthcare AI, the glamorous question is whether a model can “reason like a doctor.” The operational question is uglier: did it invent a lab value, miss an emergency referral, overstate certainty, ignore the requested format, recommend unsafe antibiotics, or fail to ask for missing context? ...

When Empathy Needs a Map: Benchmarking Tool‑Augmented Emotional Support

Empathy is easy to fake for one sentence. A chatbot can say “that sounds exhausting” without knowing anything about you, your situation, your city, your time zone, or whether the advice it is about to give is physically possible. That is the awkward part of emotional support AI: the tone can be soft while the facts are made of air. A very caring assistant can still recommend a midnight walk at 3 p.m., suggest a closed café, or confidently invent local details because it wants to be helpful. The kindness is real enough in style. The grounding is not. ...