AI Evaluation

Judge Math-Not by Its Parser

Opening — Why this matters now The AI industry has discovered a wonderfully pedestrian way to misread progress: build models that can solve harder math problems, then grade them with evaluators that panic when 2040 minutes is not written as 34 hours. That is not a joke. It is the central irritation behind “Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity”, an arXiv paper that examines how mathematical reasoning benchmarks can be distorted by rigid symbolic verification.1 ...

When the Referee Wants to Be Nice: Hidden Bias in AI Judges

Audit. That is the word companies use when they want something to sound objective, disciplined, and preferably immune to politics. A model produces an answer. Another model evaluates it. The evaluator gives a verdict. Everyone gets a dashboard. The dashboard gets shown to management. Management nods, because dashboards have a calming effect on adults in conference rooms. ...

When Maps Start Thinking: GeoAgentBench and the Audit of Spatial AI

When Maps Start Thinking: GeoAgentBench and the Audit of Spatial AI Maps look calm. That is their trick. A finished map gives the impression of order: roads align, polygons close, rivers flow, color ramps behave, labels politely stay out of the way. Behind that calm surface, a GIS workflow is usually a small bureaucratic state: coordinate systems, raster-vector conversions, topology checks, interpolation choices, file paths, layer ordering, and visualization rules all negotiating with one another. One wrong projection, one invalid geometry, one missing intermediate file, and the whole administrative state collapses. It does not collapse poetically. It throws an error. ...

The Search That Remembers: Training AI Without Answers

Search looks cheap until you try to train it. A business can usually collect plenty of questions. Employees ask support bots why a policy changed. Analysts ask internal search systems for comparable transactions. Legal teams ask where a contract clause first appears. Researchers ask agents to chase a multi-step trail across documents, web pages, and databases. ...

The Ask Gap: Why AI Agents Fail Not Because They Can’t Think — But Because They Don’t Know When to Stop

A ticket lands in the queue. It looks ordinary: update a parser, answer a business question, patch a workflow, produce a SQL query. The agent opens the files, explores the schema, writes code, runs a few checks, and submits something plausible. The output is polished. The reasoning trace is confident. The dashboard marks the task as completed. ...

CivBench: When AI Stops Guessing and Starts Planning

Scoreboards are comforting. They reduce a messy contest into one neat line: winner, loser, maybe a score. Executives like them, product teams like them, investors like them, and benchmark dashboards absolutely adore them. Strategy, unfortunately, is rude enough not to fit inside that line. A company can make the right decisions and still lose because the market turns. A trading agent can survive a bad regime by managing exposure well, then look mediocre because the final return is not spectacular. A planning system can stumble into success after making terrible intermediate choices. Outcome-only evaluation is clean, but cleanliness is not the same as truth. It is often just a good-looking loss of information. ...

From Search to Synthesis: Why AI’s Next Leap Requires Structured Thinking

Spreadsheet. That is where many impressive AI research reports quietly go to die. A model can browse twenty web pages, produce a polished executive memo, cite three market reports, and still fail at the boring part: comparing numbers, checking whether a table supports a claim, generating the right chart, and then explaining what the chart actually means. The output looks like research. The mechanism underneath is closer to literary confidence with a browser tab. ...

From Seeing to Doing: Why Agentic AI Still Trips Over Reality

Tools do not make an agent; they make the failure more interesting Camera. Browser. Crop tool. Search engine. Python sandbox. That sounds like the beginning of an intelligent workflow. Give a multimodal model these tools, and it should move from merely seeing the world to actually doing something with it: zoom into the blurry sign, search the extracted clue, cross-check the result, and produce the answer. ...

Walking the Graph: When LLMs Stop Guessing and Start Navigating

Enterprise data has a familiar bad habit: it looks organized until someone asks a question that requires moving across it. A supplier is connected to a factory, the factory is connected to a product line, the product line is connected to a delayed shipment, and the shipment is tied to a contract clause that nobody wants to read at 11:40 p.m. The graph exists. The relationships exist. The answer is somewhere inside the structure. Then an LLM pipeline retrieves a subgraph, pastes it into a prompt, and asks the model to “reason carefully.” ...

Temperament Over Talent: Why AI Behavior Is the New Competitive Edge

Procurement loves a leaderboard. That is understandable. A leaderboard is clean, sortable, and emotionally comforting. One model scores higher on reasoning. Another is cheaper per token. A third has a larger context window and a launch page written in the usual dialect of technological destiny. Decision made, presumably. Then the model enters a real workflow. ...