LLM Evaluation

Many Minds Make Light Work: Boosting LLM Physics Reasoning via Agentic Verification

If you think AI models are getting too good at math, you’re not wrong. Benchmarks like GSM8K and MATH have been largely conquered. But when it comes to physics—where reasoning isn’t just about arithmetic, but about assumptions, abstractions, and real-world alignment—the picture is murkier. A new paper, PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems, makes a bold stride in this direction. It introduces a massive benchmark of 19,609 physics problems called PHYSICSEVAL and rigorously tests how frontier LLMs fare across topics from thermodynamics to quantum mechanics. Yet the real breakthrough isn’t just the dataset—it’s the method: multi-agent inference-time critique. ...

Beyond Stack Overflow: CodeAssistBench Exposes the Real Gaps in LLM Coding Help

The Trouble With Stack Overflow-Style Benchmarks Large language models (LLMs) have been hailed as revolutionizing programming workflows. But most coding benchmarks still test them like they’re junior devs solving textbook exercises. Benchmarks such as HumanEval, MBPP, and even InfiBench focus on code synthesis in single-turn scenarios. These tests make models look deceptively good — ChatGPT-4 gets 83% on StackEval. Yet in real development, engineers don’t just ask isolated questions. They explore, revise, troubleshoot, and clarify — all while navigating large, messy codebases. ...

Memory Games: The Data Contamination Crisis in Reinforcement Learning

Reinforcement learning (RL) has recently emerged as the favored path to boost large language models’ reasoning abilities. The latest headline-grabbing claim? That even random or incorrect reward signals can help models like Qwen2.5 become better reasoners. But a new paper, “Reasoning or Memorization?”, cuts through the hype—and it does so with scalpel-like precision. It reveals that what we thought were signs of emergent reasoning in Qwen2.5 might, in fact, be a textbook case of data contamination. If true, the implications are serious: much of what we thought we knew about RL-driven reasoning gains could be little more than sophisticated memory retrieval. ...

Words, Not Just Answers: Using Psycholinguistics to Test LLM Alignment

Words, Not Just Answers: Using Psycholinguistics to Test LLM Alignment For years, evaluating large language models (LLMs) has revolved around whether they get the answer right. Multiple-choice benchmarks, logical puzzles, and coding tasks dominate the leaderboard mindset. But a new study argues we may be asking the wrong questions — or at least, measuring the wrong aspects of language. Instead of judging models by their correctness, Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans introduces a richer, more cognitively grounded evaluation: comparing how LLMs rate words on human-centric features like arousal, concreteness, and even gustatory experience. The study repurposes well-established datasets from psycholinguistics to assess whether LLMs process language in ways similar to people — not just syntactically, but experientially. ...

Thinking Inside the Gameboard: Evaluating LLM Reasoning Step-by-Step

LLMs are great at spitting out answers—but are they any good at thinking through problems? A new benchmark, AdvGameBench, introduces a process-based evaluation approach that places LLMs into three rule-based strategic games to measure not outcomes, but the quality of reasoning. Developed by Yuan et al., this framework focuses on how LLMs plan, revise, and make resource-limited decisions in dynamic settings. Three Games, Three Cognitive Demands 1. Tower Defense tests spatial planning and rule-following. Models place defenders on a battlefield to block enemies—positioning, cooldowns, and cost management are key. ...