LLM Evaluation

When the AI Becomes the Agronomist: Can Chatbots Really Replace the Literature Review?

A farmer does not need a literature review. She needs to know what works. That simple sentence is why AI agronomy is so tempting. Somewhere inside thousands of papers are useful answers: which microbial agents suppress whitefly, whether botanicals work outside the lab, how much pest control disappears when a method leaves a greenhouse and meets weather, soil, and actual insects with their own little business plans. The evidence exists, but it is fragmented, multilingual, paywalled, and written in the soothing dialect of “further research is warranted.” ...

Replace, Don’t Expand: When RAG Learns to Throw Things Away

The inbox problem hiding inside RAG Inbox. That is the easiest way to understand what goes wrong in many retrieval-augmented generation systems. A query arrives. The system retrieves a few documents. The answer is not obvious. So the system retrieves more. Then more. Then perhaps a web search result. Then a rewritten query. Then another bundle of passages. ...

Bench to the Future: Why E-commerce Is the Real Final Boss for Foundation Agents

Shopping looks easy until someone has to calculate the customs duty. That is roughly the lesson of EcomBench, a new benchmark designed to evaluate foundation agents on realistic e-commerce tasks.1 The paper’s most useful finding is not that one model ranks above another. Leaderboards are entertaining, in the same way airport departure boards are entertaining when your flight is already delayed. The useful finding is the shape of failure. ...

It Takes a Village (of Models): Why Multi-Agent Intelligence Won't Emerge by Accident

Agents are easy to multiply. That is the attractive part. Give one model a browser. Give another a code editor. Add a planner, a critic, a memory layer, a few tools, a dashboard, and suddenly the product demo looks like a small digital office. Everyone has a job title. Everyone talks. Nobody asks whether the “team” actually knows how to be a team. ...

Error Bars for the Algorithmic Mind: What ReasonBench Reveals About LLM Instability

A demo is not a deployment. In a demo, the model answers once. The answer looks correct. The cost looks tolerable. The team nods, the slide deck gains a green checkmark, and someone says the usual fatal sentence: “This seems reliable enough.” Then production happens. The same prompt goes through the same provider endpoint. The same workflow runs again. Sometimes the answer changes. Sometimes the reasoning trace wanders. Sometimes the bill is higher. Sometimes a supposedly more “thoughtful” strategy spends extra tokens to become confidently less useful. Beautiful. The machine has developed not consciousness, but variance. ...

When Research Becomes a Tree: Why Static-DRA Matters in an Agentic World

A research agent enters a company budget meeting. That sounds like the beginning of a bad consulting joke, but it is exactly where “deep research” systems are heading. The first generation of excitement was about capability: can an AI agent search, plan, decompose, synthesize, and write a report that feels less like a chatbot answer and more like an analyst memo? Fine. The next question is less glamorous and far more operational: can the company control how much research the agent performs before the invoice becomes a small weather event? ...

Agents Without Prompts: When LLMs Finally Learn to Check Their Own Homework

Agents Without Prompts: When LLMs Finally Learn to Check Their Own Homework Instructions are usually treated as the beginning of an AI workflow. A user, developer, or system designer writes a prompt. The model produces an output. Then, if the output looks wrong, someone writes another prompt telling the model how to check it, another prompt telling it how to repair it, and eventually a small mountain of prompt glue accumulates around what was supposed to be an automated system. ...

Flame Tamed: Can LLMs Put Out the Internet’s Worst Fires?

Flame Tamed: Can LLMs Put Out the Internet’s Worst Fires? A comment thread rarely explodes in one clean motion. It starts with a correction. Then someone reads the correction as condescension. Then another person adds a historical grievance, a screenshot, three exclamation marks, and the kind of moral certainty normally reserved for courtrooms and family dinners. By the time a moderator arrives, the thread is no longer a conversation. It is archaeology with insults. ...

Checkmating the Hype: What LLM CHESS Reveals About 'Reasoning Models'

Chess is useful because it is rude. It does not care whether a model writes elegant explanations. It does not reward confident prose. It does not politely accept a move that looks plausible but violates the rules. Either the move is legal, the position improves, and the game continues—or the model has just exposed something that a benchmark score on math or coding can easily hide. ...

Rules of Attraction: How LLMs Learn to Judge Better Than We Do

Rubrics are supposed to make judgment boring. That is their charm. A good rubric tells a teacher why one essay deserves a 5 instead of a 3, tells a compliance reviewer why one response is acceptable and another is risky, and tells an internal QA team why a generated summary is useful rather than merely confident. In business, boring judgment is valuable. It scales. It can be audited. It survives employee turnover. It does not wake up one morning and decide that “clarity” now means “vibes with a semicolon.” ...