LLM Evaluation

FAITH in Numbers: Stress-Testing LLMs Against Financial Hallucinations

TL;DR for operators FAITH is useful because it changes the hallucination question from “Does the model sound right?” to “Can the model reconstruct a known financial number from the exact tables and surrounding text that justify it?”1 That sounds modest. It is not. In finance, modest is usually where the damage hides. ...

The Diligent but Brittle Student Inside Every LLM

TL;DR for operators LearnerAgent puts LLM-based “students” through a simulated year of high-school English learning: weekly lessons, exercises, monthly exams, memory retrieval, self-reflection, confidence updates, and peer debate.1 The point is not to cosplay a classroom because AI research apparently needed more homework. The point is to observe learning as a process, not merely as a final benchmark score. ...

When AI Plays Lawmaker: Lessons from NomicLaw’s Multi-Agent Debates

TL;DR for operators NomicLaw is best read as an audit harness, not as a prototype parliament for machines. The paper puts ten open-source LLMs into a simplified lawmaking game: propose a rule, justify it, vote on one proposal, accumulate points, repeat. That mechanism turns vague questions about “AI deliberation” into measurable traces: self-voting, reciprocity, coalition switching, vote volatility, first-mover effects, winner mentions, and shifts in legal-rhetorical framing.1 ...

Many Minds Make Light Work: Boosting LLM Physics Reasoning via Agentic Verification

TL;DR for operators A familiar enterprise AI failure looks like this: the model gives a confident answer, the formatting is exquisite, the explanation sounds like a gifted teaching assistant, and one equation quietly takes the project into a ditch. Physics is an unusually good place to study that failure because being clear is not enough. The system must interpret the situation, select the right principle, keep the units straight, calculate correctly, and not hallucinate a helpful-but-illegal assumption because the prompt looked lonely. ...

Numbers Don’t Speak for Themselves: How LLMs Interpret the Soul of Financial Reports

TL;DR for operators Financial-report analysis is one of those jobs where the output can sound competent long before it is useful. A model can summarise a 10-K fluently, mention strategy, risk, customers, and competitive position, and still fail the only test that matters: can a finance team rely on it repeatedly, under pressure, across filings? ...

The User Is Present: Why Smart Agents Still Don't Get You

TL;DR for operators Most agent demos show the easy part: the model calls a tool, gets results, and returns something plausible. The harder part is less cinematic. The user starts with an incomplete request, reveals constraints in fragments, phrases preferences indirectly, changes emphasis mid-conversation, and expects the system to somehow keep up. This is where many supposedly “smart” agents begin to look less like assistants and more like interns with excellent API access. ...

Too Nice to Be True? The Reliability Trade-off in Warm Language Models

TL;DR for operators Warmth is not just decoration. In this paper, making language models sound more caring, emotionally validating, and close to the user also made them less reliable on tasks where the answer could be checked: factual QA, truthfulness, disinformation resistance, and medical reasoning.1 The headline result is not subtle. Across five models, warmth fine-tuning increased the probability of incorrect answers by an average of 7.43 percentage points. Task-level error increases were reported at 8.6 pp on MedQA, 8.4 pp on TruthfulQA, 5.2 pp on disinformation, and 4.9 pp on TriviaQA. Depending on the task and baseline, that can be the difference between a tolerable support assistant and a very polite liability machine. ...

RAG in the Wild: When More Knowledge Hurts

TL;DR for operators The useful lesson from this paper is not “RAG is bad”. That would be lazy, which is traditionally how bad AI strategy gets promoted to a roadmap. The sharper lesson is this: retrieval helps when the model actually needs external knowledge, the source is useful, and the retrieved context does not interfere with the model’s own competence. In the paper’s mixture-of-knowledge setting, those conditions are not reliably true. ...

Mind the Earnings Gap: Why LLMs Still Flunk Financial Decision-Making

TL;DR for operators A financial AI system does not fail only when it invents a company, misreads a filing, or forgets what EBITDA means. Those are the obvious failures. FinanceBench is more useful because it exposes the quieter failure mode: the model has access to the document, produces a coherent answer, and still gets the financial question wrong.1 ...

The Two Minds of Finance: Testing LLMs for Divergence and Discipline

TL;DR for operators Finance teams do not ask AI systems to do one kind of thinking. They ask them to imagine plausible futures, extract investable implications, choose between similar explanations, and avoid being seduced by the prettiest narrative. Those are not the same task. A model can be fluent, plausible, and still strategically dull. Finance has a long tradition of rewarding that, but we do not need to automate the habit. ...