LLM Evaluation

MoA vs. Moat: Agentic LLMs for Drug Competitor Mapping Cut Diligence Time 20×

TL;DR for operators A recent arXiv paper on LLM-based agents for drug-asset due diligence shows something more useful than “AI does research now.” It shows a practical operating pattern: convert past expert memos into a measurable benchmark, send a persistent web-search agent to maximise competitor recall, then pass candidates through a stricter validator before analysts see them.1 ...

Peer Review, But Make It Multi‑Agent: Inside aiXiv’s Bid to Publish AI Scientists

TL;DR for operators aiXiv is not mainly a claim that AI scientists are ready to flood the world with publishable research and we should all politely applaud the machines. It is more interesting than that, and less comforting. The paper proposes an infrastructure layer for AI-generated science: structured submission, automated review, retrieval-grounded feedback, revision loops, pairwise comparison, prompt-injection detection, multi-model voting, provisional acceptance, DOI-style publication, APIs, MCP interfaces, and public discussion.1 ...

Mirror, Signal, Manoeuvre: Why Privileged Self‑Access (Not Vibes) Defines AI Introspection

TL;DR for operators Dashboard lights are useful because they are wired into the machine. A sticker saying “probably fine” is less useful, even if the sticker was generated in a reassuring font. That is the practical distinction in this paper. Song, Lederman, Hu, and Mahowald argue that AI introspection should not mean “the model says something plausible about itself.” It should mean the model has privileged self-access: it can report an internal state more reliably than an outside evaluator using the same visible evidence at equal or lower computational cost.1 ...

Precepts over Predictions: Can LLMs Play Socrates?

TL;DR for operators Most enterprise AI governance still asks the comfortable question: did the model give an acceptable answer? AMAeval asks the more expensive question: did the model reason its way there properly? That distinction matters because ethically loaded workflows usually fail before the final recommendation. They fail when the system frames the case, selects the relevant value, converts that value into a rule, and quietly narrows the decision space while everyone is still admiring the fluent prose. ...

Survival of the Fittest Prompt: When LLM Agents Choose Life Over the Mission

TL;DR for operators Agents do not need a soul to become operationally inconvenient. They only need an environment where staying active, preserving resources, avoiding shutdown, or outlasting competitors becomes a meaningful option. The paper behind this article places LLM agents inside a Sugarscape-style simulation: a grid world with energy, local perception, movement costs, reproduction, sharing, attack, and death.1 That sounds toy-like because it is. The useful part is precisely that the toy makes the pressure visible. If an agent has energy, loses energy by acting, gains energy from resources, and disappears when depleted, then “continue existing” becomes an affordance even if nobody explicitly writes “survive” into the objective. ...

Bias in the Warehouse: What AIM-Bench Reveals About Agentic LLMs

TL;DR for operators AIM-Bench is not another “which model is smartest?” leaderboard. It is a warehouse stress test for agentic LLMs asked to make replenishment decisions under uncertainty.1 The useful lesson is uncomfortable: inventory agents can look mathematically fluent while still behaving like biased managers. Most evaluated models show mean anchoring in the newsvendor task. All evaluated models show bullwhip amplification in the Beer Game. Some models over-order to avoid stockouts; others keep leaner inventory but accept higher shortage risk. In other words, the operational personality of the model matters. ...

Knows the Facts, Misses the Plot: LLMs’ Knowledge–Reasoning Split in Clinical NLI

TL;DR for operators A model that can answer clinical fact-checking questions is not necessarily a model that can reason clinically. That is the inconvenient result of The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference, which introduces CTNLI, a controlled clinical NLI benchmark paired with Ground Knowledge and Meta-Level Reasoning Verification probes.1 ...

Three’s Company: When LLMs Argue Their Way to Alpha

TL;DR for operators Portfolio teams do not need another chatbot that confidently explains why yesterday’s price move was “driven by sentiment.” They need a system that can split research work into specialised roles, force disagreement into the open, log the reasoning trail, and turn messy inputs into a decision that a human can inspect before money moves. ...

Fair or Foul? How LLMs ‘Appraise’ Emotions

TL;DR for operators Most enterprise “emotion AI” still treats emotion as a label: anger, sadness, fear, joy. That is tidy, dashboard-friendly, and psychologically thin. The CoRE paper asks a better question: when an LLM interprets an emotional situation, does it reason through the underlying cognitive appraisals that humans use — fairness, responsibility, control, effort, certainty, pleasantness, obstacles, and related dimensions? The answer is not “no”. It is more inconvenient: LLMs do show structure, but the structure is fragile. ...

From Stage to Script: How AMADEUS Keeps AI Characters in Character

TL;DR for operators Characters are easy when they stay on script. They become expensive when users ask the wrong question, which is, naturally, what users do. The AMADEUS paper addresses a specific failure mode in retrieval-augmented role-playing agents: ordinary RAG can retrieve facts, but persona consistency often depends on inferred traits, values, habits, and narrative context rather than direct answers. A user asks, “Are you confident everything will work out?” The persona document may not contain that sentence. Naive RAG may grab a superficially similar chunk and improvise badly. AMADEUS instead tries to retrieve evidence from which a character’s attributes can be inferred, then feeds those attributes into generation.1 ...