AI Evaluation

Benchmarked Brilliance: How CreBench Rewrites the Rules of Machine Creativity

Design review is where creativity usually goes to become awkward. One person likes the concept because it feels original. Another dislikes it because it looks impractical. A third praises the visual polish while quietly ignoring whether the idea solves the actual problem. Then someone asks whether the AI can “evaluate creativity”, and everyone pretends the word creativity has a stable meaning. Excellent. Very efficient. ...

$Cover image$

Back to the Drawing Board: How DiagramIR Quietly Fixes Math Diagrams for AI

A diagram is not a paragraph with lines attached. That sounds obvious, which is usually where software product teams get into trouble. Text can be judged by fluency, relevance, and whether the answer has wandered into confident nonsense. A geometry diagram has extra obligations. The side marked 8 should look longer than the side marked 3. The angle labelled $90^\circ$ should not be having an identity crisis. Labels should sit near the thing they label. The image should not be half outside the frame, unless the product strategy is “modern art, but for sixth grade”. ...

The Problem with Problems: Why LLMs Still Don’t Know What’s Interesting

A tutoring system has one deceptively simple job: give the learner the next problem. Not the hardest problem. Not the flashiest problem. Not the one that makes the model feel terribly pleased with itself after a 4,000-token monologue. The next problem: the one that keeps a student engaged, teaches the right structure, and feels worth the effort. ...

Agents with Interest: How Fintech Taught RAG to Read the Fine Print

Ask a product manager in a financial technology company a simple question — “How does this feature behave under that framework?” — and the answer may live in five places, three teams, two stale wikis, and one acronym that means different things depending on who had coffee with whom. This is the everyday enemy of enterprise AI. Not lack of models. Not lack of dashboards. Not even lack of documents. The problem is that internal knowledge rarely behaves like a neat public benchmark. It is fragmented, duplicated, partially obsolete, acronym-heavy, and governed by access rules that make the usual “just send it to a cloud assistant” suggestion both naïve and professionally adventurous. ...

Who Really Runs the Workflow? Ranking Agent Influence in Multi-Agent AI Systems

A workflow chart is comforting. It gives everyone boxes, arrows, and the illusion that power follows geometry. In a multi-agent AI system, that illusion fails rather quickly. The agent in the middle of the diagram may not be the one shaping the final answer. The orchestrator may look important because everything passes through it, but another specialist agent may quietly determine the substance. A router may touch only one decision and still decide the entire path. A late-stage formatter may appear humble and yet rewrite the output enough to matter. The org chart lied. Naturally, the workflow diagram learned from management. ...

Evolving Minds: How LLMs Teach Themselves Through Adversarial Cooperation

Training data is the quiet tax on modern AI. Someone has to write the examples, verify the answers, clean the failures, and pretend the spreadsheet is a strategy. Reinforcement learning makes that tax even more visible: if a model is supposed to improve through feedback, then the organisation must either provide ground-truth answers, hire evaluators, or build verifiers that can tell success from nonsense. ...

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science

Procurement meetings have a habit of turning AI agents into theatre. A vendor shows a polished research assistant. It finds papers, writes a summary, cites sources, maybe generates a small experiment plan. Everyone nods. Someone says “agentic workflow.” Someone else says “autonomous discovery.” A budget appears. The machine is declared practically scientific, which is convenient, because the machine itself has not yet been asked to survive the boring parts of science: retrieval under controlled conditions, code execution, data analysis, experimental reproduction, hypothesis testing, and the small matter of completing all required steps without wandering into the digital bushes. ...

Beyond Answers: Measuring How Deep Research Agents Really Think

A research report is not an answer with extra paragraphs. That sounds obvious until an enterprise team tries to evaluate a deep research agent by asking whether its final conclusion looks plausible, whether it included citations, and whether the prose sounded confident enough to survive a board deck. Congratulations: the machine has produced something that resembles diligence. Whether it actually performed diligence is the inconvenient question. ...

Benchmarks That Fight Back: Adaptive Testing for LMs

A benchmark is supposed to be a measuring instrument. In practice, many AI benchmarks behave more like a tired clipboard. Every model gets the same questions. Every question receives the same accounting treatment. The final score is usually a mean accuracy number, neat enough for a leaderboard and blunt enough to hide the messy truth underneath. Some items are too easy to tell strong models apart. Some are too hard to tell weak models apart. Some are mislabeled. Some have stopped mattering because everyone competent now solves them. Yet the ritual continues: run the suite, average the answers, update the chart, pretend the thermometer is not melting. ...

Confidence, Not Confidence Tricks: Statistical Guardrails for Generative AI

A product team launches an AI assistant. The demo works. The benchmark looks respectable. The model even says “I’m confident” with the serene authority of a consultant who has never owned a pager. Then the real users arrive. Some ask ambiguous questions. Some ask adversarial questions. Some ask perfectly normal questions that happen to sit outside the model’s competence. The assistant still answers. Sometimes it refuses too often. Sometimes it refuses too late. Sometimes its confidence score is less a forecast and more a decorative sticker. ...