AI Evaluation

Follow the Heads, Not the Hype: How LLMs Route Deductive Reasoning

A compliance bot does not fail only when it gives the wrong final answer. It can fail earlier, in a quieter and more expensive place: it selects the wrong premise, stops collecting evidence too soon, matches the wrong rule, and then writes a perfectly fluent explanation of a decision that was already broken three steps ago. Very elegant. Very useless. ...

If Logic, Then Trouble: Why LLMs Still Miss Human Conditionals

Contract. A supplier writes, “If payment is received by Friday, the discount applies.” Most business readers do not treat this as a detached logic puzzle. They hear a practical rule: pay by Friday, get the discount; miss Friday, probably no discount. The phrase carries intent, relevance, and a small but important threat wrapped in polite operational language. ...

Reasonable Doubt: Why LLM Reasoning Needs Process Control

Why this matters now The business case for LLMs has quietly moved from chatbot answers to agentic work: legal review, compliance checking, market research, document synthesis, internal analytics, coding support, and decision preparation. That shift changes the risk profile. A wrong chatbot answer is annoying. A wrong agent that looks coherent, cites documents, calls tools, updates files, and confidently stops too early is a workflow liability wearing a productivity costume. ...

$Cover image$

Do the Math, Not the Mime: Why LLM Reasoning Needs a Verification Pipeline

Spreadsheet errors have a special talent: they look boring until they become expensive. That is the business version of the LLM math problem. A model can produce a calm, step-by-step explanation, put a confident number at the bottom, and still be wrong in the only place that matters. Worse, the reasoning may look plausible enough that a manager, analyst, tutor, or compliance reviewer nods and moves on. The answer has the rhythm of thinking. It has the costume of calculation. It may even have a chain-of-thought trace. Very civilized. Still not proof. ...

If Logic Were Enough: Why LLMs Still Miss the Point of Conditionals

A promise is rarely just a logical operator. “If you mow the lawn, I’ll give you 50 dollars” does not sound like a philosophical exercise in truth tables. It sounds like a deal. Most people hear it as: no mowing, no money. By contrast, “If you’re hungry, there’s pizza in the oven” does not mean the pizza appears only under the metaphysical condition of your hunger. It means the pizza is there, and your hunger merely explains why I am telling you. ...

The Confidence Trick: When Long AI Reasoning Arrives Too Early

A model gives you a long answer. It lists assumptions. It walks through steps. It sounds patient, organized, and slightly overqualified for the task. In a business setting, that style is comforting. A compliance analyst sees a neat explanation. A finance team sees a transparent calculation. A product manager sees “reasoning.” Everyone relaxes a little. ...

Red Queen Receipts: AI Security Testing Needs Logs, Not Vibes

Security testing is not a screenshot. A model gives a dangerous answer. Someone posts the transcript. A vendor says the model has been updated. A consultant turns the incident into a slide titled “AI risk is real.” Everyone nods gravely. Very mature. Very enterprise. The harder question is less theatrical: can the same vulnerability be tested again, under controlled conditions, with visible logs, a consistent evaluator, repeatable statistics, and enough human inspection to make the result defensible? ...

Prompt and Circumstance: Why One Accuracy Number Is Not a Reliability Audit

Opening — Why this matters now The AI market has learned to worship benchmark tables with the solemnity once reserved for quarterly earnings. One model is up two points on MMLU, another is slightly better at reasoning, a third is cheaper, smaller, faster, and therefore apparently ready to run your compliance workflow by Tuesday. ...

Look Who’s Reasoning Now: UpstreamQA and the Fine Print of Video AI

Opening — Why this matters now Video is becoming one of the most tempting inputs for business AI. Warehouses have cameras. Clinics have consultation rooms. Retailers have shelves, queues, and checkout counters. Property managers have inspection footage. Factories have safety recordings. Everyone wants to ask the same beautifully dangerous question: Can the model just watch the video and tell us what happened? ...

Zero Degrees, Still Feverish: Why Deterministic AI Needs a Thermometer

Opening — Why this matters now The comforting myth of enterprise AI is that setting an LLM’s temperature to zero makes it deterministic. A nice little checkbox. A procedural sedative. Press it, and the machine behaves. The paper Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models is useful because it attacks that myth directly. Its central claim is not that LLMs are chaotic by nature. That would be dramatic, and therefore probably a conference keynote. The claim is sharper: even when a model is asked to decode at $T = 0$, the surrounding inference environment can introduce enough tiny numerical variation to produce divergent outputs.1 ...