Words, Not Just Answers: Using Psycholinguistics to Test LLM Alignment

Words, Not Just Answers: Using Psycholinguistics to Test LLM Alignment For years, evaluating large language models (LLMs) has revolved around whether they get the answer right. Multiple-choice benchmarks, logical puzzles, and coding tasks dominate the leaderboard mindset. But a new study argues we may be asking the wrong questions — or at least, measuring the wrong aspects of language. Instead of judging models by their correctness, Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans introduces a richer, more cognitively grounded evaluation: comparing how LLMs rate words on human-centric features like arousal, concreteness, and even gustatory experience. The study repurposes well-established datasets from psycholinguistics to assess whether LLMs process language in ways similar to people — not just syntactically, but experientially. ...

July 1, 2025 · 4 min · Zelina

Thinking Inside the Gameboard: Evaluating LLM Reasoning Step-by-Step

LLMs are great at spitting out answers—but are they any good at thinking through problems? A new benchmark, AdvGameBench, introduces a process-based evaluation approach that places LLMs into three rule-based strategic games to measure not outcomes, but the quality of reasoning. Developed by Yuan et al., this framework focuses on how LLMs plan, revise, and make resource-limited decisions in dynamic settings. Three Games, Three Cognitive Demands 1. Tower Defense tests spatial planning and rule-following. Models place defenders on a battlefield to block enemies—positioning, cooldowns, and cost management are key. ...

June 20, 2025 · 3 min · Zelina