LLM Evaluation

Precepts over Predictions: Can LLMs Play Socrates?

TL;DR Most LLM ethics tests score the verdict. AMAeval scores the reasoning. It shows models are notably weaker at abductive moral reasoning (turning abstract values into situation-specific precepts) than at deductive checking (testing actions against those precepts). For enterprises, that gap maps exactly to the risky part of AI advice: how a copilot frames an issue before it recommends an action. Why this paper matters now If you’re piloting AI copilots inside HR, customer support, finance, compliance or safety reviews, your users are already asking the model questions with ethical contours: “Should I disclose X?”, “Is this fair to the customer?”, “What’s the responsible escalation?” ...

FAITH in Numbers: Stress-Testing LLMs Against Financial Hallucinations

Financial AI promises speed and scale — but in finance, a single misplaced digit can be the difference between compliance and catastrophe. The FAITH (Framework for Assessing Intrinsic Tabular Hallucinations) benchmark tackles this risk head‑on, probing how well large language models can faithfully extract and compute numbers from the dense, interconnected tables in 10‑K filings. From Idea to Dataset: Masking With a Purpose FAITH reframes hallucination detection as a context‑aware masked span prediction task. It takes real S&P 500 annual reports, hides specific numeric spans, and asks the model to recover them — but only after ensuring three non‑negotiable conditions: ...

Many Minds Make Light Work: Boosting LLM Physics Reasoning via Agentic Verification

If you think AI models are getting too good at math, you’re not wrong. Benchmarks like GSM8K and MATH have been largely conquered. But when it comes to physics—where reasoning isn’t just about arithmetic, but about assumptions, abstractions, and real-world alignment—the picture is murkier. A new paper, PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems, makes a bold stride in this direction. It introduces a massive benchmark of 19,609 physics problems called PHYSICSEVAL and rigorously tests how frontier LLMs fare across topics from thermodynamics to quantum mechanics. Yet the real breakthrough isn’t just the dataset—it’s the method: multi-agent inference-time critique. ...

Beyond Stack Overflow: CodeAssistBench Exposes the Real Gaps in LLM Coding Help

The Trouble With Stack Overflow-Style Benchmarks Large language models (LLMs) have been hailed as revolutionizing programming workflows. But most coding benchmarks still test them like they’re junior devs solving textbook exercises. Benchmarks such as HumanEval, MBPP, and even InfiBench focus on code synthesis in single-turn scenarios. These tests make models look deceptively good — ChatGPT-4 gets 83% on StackEval. Yet in real development, engineers don’t just ask isolated questions. They explore, revise, troubleshoot, and clarify — all while navigating large, messy codebases. ...

Memory Games: The Data Contamination Crisis in Reinforcement Learning

Reinforcement learning (RL) has recently emerged as the favored path to boost large language models’ reasoning abilities. The latest headline-grabbing claim? That even random or incorrect reward signals can help models like Qwen2.5 become better reasoners. But a new paper, “Reasoning or Memorization?”, cuts through the hype—and it does so with scalpel-like precision. It reveals that what we thought were signs of emergent reasoning in Qwen2.5 might, in fact, be a textbook case of data contamination. If true, the implications are serious: much of what we thought we knew about RL-driven reasoning gains could be little more than sophisticated memory retrieval. ...

Words, Not Just Answers: Using Psycholinguistics to Test LLM Alignment

Words, Not Just Answers: Using Psycholinguistics to Test LLM Alignment For years, evaluating large language models (LLMs) has revolved around whether they get the answer right. Multiple-choice benchmarks, logical puzzles, and coding tasks dominate the leaderboard mindset. But a new study argues we may be asking the wrong questions — or at least, measuring the wrong aspects of language. Instead of judging models by their correctness, Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans introduces a richer, more cognitively grounded evaluation: comparing how LLMs rate words on human-centric features like arousal, concreteness, and even gustatory experience. The study repurposes well-established datasets from psycholinguistics to assess whether LLMs process language in ways similar to people — not just syntactically, but experientially. ...

Thinking Inside the Gameboard: Evaluating LLM Reasoning Step-by-Step

LLMs are great at spitting out answers—but are they any good at thinking through problems? A new benchmark, AdvGameBench, introduces a process-based evaluation approach that places LLMs into three rule-based strategic games to measure not outcomes, but the quality of reasoning. Developed by Yuan et al., this framework focuses on how LLMs plan, revise, and make resource-limited decisions in dynamic settings. Three Games, Three Cognitive Demands 1. Tower Defense tests spatial planning and rule-following. Models place defenders on a battlefield to block enemies—positioning, cooldowns, and cost management are key. ...