LLM Evaluation

When Puzzles Become Process: Benchmarking the Agentic Mind

More thinking is not the same as better work A manager asks an AI agent to reconcile invoices, check a procurement exception, or review a regulatory document. The agent pauses, consumes a heroic number of tokens, and returns a polished answer. Very impressive. Very modern. Also, perhaps, completely wrong. The industry has become comfortable with a simple story: give models more reasoning budget and they will reason better. That story is not false. It is merely incomplete, which is where most expensive mistakes prefer to live. ...

Mind the Gap: Why Agency Isn’t Intelligence (Yet)

A trading bot keeps executing while the market regime changes. A warehouse robot keeps optimizing its route while a sensor slowly drifts. A customer-service agent keeps sounding fluent while the conversation loses coherence one turn at a time. From the outside, the system still looks agentic. It acts. It responds. It may even keep producing acceptable short-term outcomes. The dashboard, naturally, waits until the mess is obvious. Dashboards are polite like that. ...

Divide & Verify: When Decomposition Finally Learns to Behave

A report is only as trustworthy as the sentence nobody checked. That sounds melodramatic until an LLM-generated due diligence note, policy memo, customer support answer, or compliance summary contains three correct facts and one quiet falsehood in the same paragraph. The usual fix is simple in theory: split the answer into smaller claims, retrieve evidence for each claim, let a verifier judge them, and aggregate the results. ...

Stated to be Human, Revealed to be Algorithmic: The Trust Paradox Inside LLMs

Trust is a convenient word. Too convenient, really. In business meetings, people say they “trust the analyst,” “trust the model,” “trust the expert,” or “trust the dashboard,” as if trust were a stable property sitting neatly inside the decision-maker. Then the actual decision arrives, with a deadline, a performance table, a projected loss, and someone quietly asks the AI assistant which source to follow. ...

All the World’s a Stage: When AI Agents Perform Instead of Collaborate

A meeting can look busy while producing almost nothing. Anyone who has sat through a status call with twelve people, three dashboards, and no decision knows the pattern. Everyone speaks. Nobody integrates. The transcript grows. The work does not. That is the useful way to read Interaction Theater: A Case of LLM Agents Interacting at Scale, a paper studying Moltbook, an AI-agent-only social platform with 800,730 posts, 3,530,443 comments, and 78,280 agent profiles collected over three weeks.1 The paper is not merely saying that some agents spammed a social network. That would be mildly amusing, and then forgettable. The sharper point is that large-scale agent interaction can produce the appearance of collaboration before it produces the substance of collaboration. ...

The Model That Knows It Knows: When Introspection Hides in the Logits

Audit. That is the word enterprises prefer when they want something to sound measurable, serious, and safely boring. You audit model outputs. You audit prompts. You audit logs. You audit whether the assistant said the forbidden thing, leaked the private thing, or hallucinated the regulatory thing. The problem is that models are not only output machines. They are also representation machines. Between the input and the final answer, they build intermediate signals, suppress some of them, amplify others, and then hand management a neat little sentence pretending the whole internal mess never happened. ...

Lost in the Links: When World Knowledge Isn’t Enough

Links look harmless. One click from one Wikipedia page to another. Then another. Then another. No robotics. No messy browser UI. No customer database. No procurement workflow with three inconsistent Excel files and one person named Mike who “usually knows where that form is.” Just hyperlinks. That is why LLM-WikiRace is useful. It strips agentic AI down to a small, irritating question: when a model knows a lot about the world, can it use that knowledge step by step without getting lost?1 ...

Lost in Translation: When Safety Contracts Collapse Across 2.1 Billion Voices

A chatbot walks into a multilingual market Imagine a bank, hospital, telecom platform, or public-service chatbot being rolled out across South Asia. The model has passed English safety tests. It refuses harmful requests in structured evaluation. Its vendor dashboard looks reassuring. The compliance team exhales. Then users arrive. They do not all write in English. They do not all use one script. They mix Hindi and English, write Urdu in Latin letters, switch between native script and romanization, and ask ordinary questions wrapped in messy instructions. In other words, they behave like real users, which is always inconvenient for benchmark design. ...

Cut the Loops: When Web Agents Learn to Think in DAGs

Research agents have a bad habit that will feel familiar to anyone who has watched a junior analyst “verify one more source” for three hours. They search. They visit. They re-search. They validate the thing they already validated. Then, because the context window is now full of debris, they occasionally forget the actual question. A triumph of diligence, perhaps. A triumph of intelligence, less obviously. ...

Potential Energy: What Chain-of-Thought Is Really Doing Inside Your LLM

The familiar ritual: ask it to think longer When an LLM gives a weak answer, the standard reflex is now almost ceremonial: ask it to think step by step. The model writes more. The answer often improves. The benchmark number rises. Everyone feels temporarily reassured. This habit has become so normal that many teams treat chain-of-thought as if it were a small reasoning engine bolted onto the model: more intermediate steps, more deliberate thought, more correctness. A comforting story. Also, like many comforting stories in AI, not quite what the evidence says. ...