LLM Evaluation

Branching Out of the Middle: How a ‘Tree of Agents’ Fixes Long-Context Blind Spots

Contracts are not polite. They hide the important clause on page 83, define the crucial exception on page 17, and bury the fatal cross-reference in an appendix nobody wanted to read. Annual reports behave similarly. So do medical SOPs, litigation files, policy manuals, technical logs, and most documents produced by institutions that have discovered both Microsoft Word and committees. ...

Fault Lines & Safety Nets: How RAFFLES Finds the First Domino in Agent Failures

A failed agent run rarely fails politely. It does not raise its hand at step 4 and say, “Here is the causal error; please patch the planner.” It drifts. A web agent grabs the wrong source. A coding agent trusts a bad assumption. A verifier rubber-stamps a plausible-looking answer. Twenty steps later the final output is wrong, the dashboard says “failed,” and the team is left doing digital archaeology with a very expensive shovel. ...

Model Portfolio: When LLMs Sit the CFA

Exams are useful because they are rude. They do not care that a model sounds polished, cites the right buzzwords, or can produce a gorgeous paragraph about duration risk. They ask for A, B, or C. Then they mark the answer wrong. That is why a new CFA-based benchmark is more useful than another misty-eyed essay about AI “transforming finance.” The paper evaluates GPT-4o, GPT-o1, and o3-mini on 1,560 official CFA mock multiple-choice questions across Levels I, II, and III, both zero-shot and with a domain-reasoning RAG pipeline built from official CFA curriculum materials.1 The result is not a single leaderboard. It is closer to a routing manual. ...

Agreeable to a Fault: Why LLM ‘People’ Can’t Hold Their Ground

A focus group is expensive. A virtual focus group is cheap, infinitely patient, and available at 2 a.m. It also never asks for coffee, parking reimbursement, or clarification about the incentive payment. Naturally, this makes synthetic users attractive to anyone trying to test products, policies, campaigns, or customer journeys before real humans get involved. ...

Fusion Cuisine for RAG: Z‑Scores, Rankers, and the Two‑Source Diet

A RAG system usually fails in one of two annoyingly familiar ways. It retrieves documents that are factually relevant but gives the model no clue about the task’s decision boundary. Or it retrieves labelled examples that show the decision pattern but are too parochial to help when the topic drifts. One source knows the world. The other knows the exam rubric. Naturally, many systems pick one and then pretend the compromise was strategy. ...

Razor Burn: Why LLMs Nick Themselves on Induction and Abduction

Diagnosis is where AI systems start to look clever, then suddenly start charging consultancy rates. Give a model a handful of symptoms, incident logs, customer complaints, or audit traces, and ask it what explains them. It will usually produce something plausible. Sometimes several plausible things. Occasionally an entire decorative shrubbery of plausible things. The practical question is not whether the model can invent an explanation. That bar is underground. The harder question is whether it can find the simplest explanation that accounts for the evidence without adding unnecessary machinery. ...

Numbers Need Narration: Making LLMs Do Reasoning‑Intensive Regression

TL;DR for operators Many AI workflows do not need a yes-or-no judgment. They need a number: how well did this answer follow the instruction, how far did this reasoning trace remain valid, how much better is answer A than answer B, how strong is this essay, how risky is this case, how close is this support call to escalation? ...

Prolog & Paycheck: When Tax AI Shows Its Work

TL;DR for operators Tax AI should not be judged by whether the model can produce a confident answer in fluent prose. That is how one builds a very polite liability machine. The useful pattern in this paper is architectural: let the language model translate statutory text and taxpayer facts into executable Prolog; let a symbolic solver compute the result; reject outputs that fail execution or disagree across independent attempts; then evaluate the system using an error-cost ledger, not just accuracy.1 The paper’s strongest practical message is therefore not “LLMs can do tax”. It is: high-stakes rule automation becomes more credible when the model is demoted from final authority to structured translator. ...

Talk, Tool, Triumph: Training Agents with Real Conversations

TL;DR for operators The paper behind this article is useful because it changes the unit of training. Instead of training an agent to emit the right function call after a tidy prompt, MUA-RL trains the agent inside a live-feeling loop: user message, agent response, tool call, database result, another user message, another decision, and so on.1 That is much closer to customer support, travel booking, retail order management, telecom troubleshooting, and internal workflow automation. In other words: the model is not just learning which button to press. It is learning when to ask, when to verify, when to act, and when not to confidently vandalise the database. Progress. ...

Stop at 30k: How Hermes 4 Turns Long Chains of Thought into Shorter Time‑to‑Value

TL;DR for operators Reasoning models are not expensive because they are philosophical. They are expensive because they can keep thinking long after the business value has stopped arriving. The Hermes 4 Technical Report is easiest to misread as another open-weight leaderboard announcement. That is the least useful reading. The more useful reading is that Hermes 4 is a build manual for making open reasoning models behave like deployable systems: generate diverse synthetic data, verify what can be verified, preserve general instruction-following, control runaway reasoning length, and evaluate with enough logging to know whether the model failed or the benchmark harness sneezed.1 ...