AI Evaluation

Thinking in Circles: How Self-Questioning LLMs Learn Without Labels

TL;DR for operators Self-Questioning Language Models, or SQLM, tests a tempting idea: can a language model improve its reasoning ability without being handed a curated training set of questions and answers? The answer in this paper is: partly, in narrow settings, if the training loop is engineered carefully enough.1 The mechanism is not mystical self-awareness. A model is split into two roles. One role proposes questions from a single topic prompt. The other tries to solve them. Reinforcement learning then updates the system using proxy rewards: majority-vote agreement for arithmetic and algebra, and proposer-generated unit tests for coding. The proposer is rewarded for problems that are not too easy and not too hard; the solver is rewarded for answers that pass the available proxy. ...

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

TL;DR for operators CAPO is not mainly a paper about “making models reason better” in the usual fog-machine sense. It is about fixing a specific training failure: outcome-only reinforcement learning tells a model whether the final answer was right, but not which part of the reasoning earned or destroyed that outcome. The method uses a stronger off-the-shelf LLM as a generative process reward model, or GenPRM, to inspect a rollout and identify wrong reasoning steps in one pass. Those step-level critiques are then converted into token-level penalties, so the policy update can suppress flawed reasoning segments instead of treating the whole answer as one indivisible blob. The authors test this across Llama-3-1B/3B and Qwen2.5-1.5B/7B backbones, with results showing consistent average gains over SFT, GRPO with rule-based verification, and GRPO with generative outcome reward modelling.1 ...

Seeing Is Deceiving: Diagnosing and Fixing Hallucinations in Multimodal AI

TL;DR for operators A multimodal model can look at an image and still answer from memory, habit, or linguistic guesswork. That is the uncomfortable core of visual hallucination: the output is fluent, relevant-looking, and sometimes even useful, while being only loosely attached to the pixels it claims to describe. The practical lesson is not “never use multimodal AI.” That would be tidy, dramatic, and mostly useless. The lesson is narrower and more valuable: visual hallucinations need to be diagnosed by where grounding fails, not merely counted after the model has embarrassed itself. ...

How Sparse is Your Thought? Cracking the Inner Logic of Chain-of-Thought Prompts

TL;DR for operators Chain-of-thought prompting is often sold as a window into model reasoning. This paper is more useful because it treats CoT as something less mystical and more testable: a prompt-induced change in internal representations.1 The researchers train sparse autoencoders on hidden activations from two Pythia models solving GSM8K math problems under CoT and NoCoT prompts. They then patch CoT-derived sparse features into NoCoT runs and ask a sharper question: does inserting those internal features increase the log-probability of the correct answer? ...

Mind the Gap: How AI Papers Misuse Psychology

TL;DR for operators AI teams love borrowing psychology. It gives messy model behaviour a tidy name: “reasoning,” “empathy,” “Theory of Mind,” “bias,” “motivation,” “attention.” The problem is that a borrowed label is not the same as a valid construct. A new paper, The Incomplete Bridge: How AI Research (Mis)Engages with Psychology, studies this borrowing directly by mapping 1,006 LLM-related papers from major AI venues and the 2,544 psychology papers they cite.1 ...

Mirage Agents: When LLMs Act on Illusions

TL;DR for operators LLM agents do not merely hallucinate by saying false things. They hallucinate when they act on a version of the world that does not match the task, the history, or the screen in front of them. That is the useful idea in MIRAGE-Bench: it treats agent hallucination as context-unfaithful action. The agent may click a button that is not there, assume a page transition succeeded when it did not, answer a colleague’s question with invented information, submit code despite failed tests, or report success when the environment says otherwise. Very industrious. Very confident. Very much not what you want near production systems. ...

Tools of Thought: Why Reasoning Isn’t an Illusion After All

TL;DR for operators The useful question is not whether reasoning models “really think”. That debate is charming, mostly because it lets everyone pretend a benchmark table is a metaphysics seminar. The operational question is simpler: when you give a reasoning model the same tools as a non-reasoning model, does it use them better? ...

Mirror, Mirror in the Model: How MLLMs Learn from Their Own Mistakes

TL;DR for operators Image generators fail in a familiar way: the output looks polished, but the prompt was quietly ignored. A product photo misses the specified texture. A campaign image reverses a spatial relation. A science illustration draws the visually plausible version, not the physically correct one. Everyone then discovers, with appropriate corporate surprise, that “high quality” and “correct” are not synonyms. ...

LLMs Meet Logic: SymbolicThought Turns AI Relationship Guesswork into Graphs

TL;DR for operators SymbolicThought1 is a useful reminder that relationship extraction is not a vibes problem. It is a graph problem wearing a language-model costume. The paper proposes a human-in-the-loop system for extracting character relationships from narrative text. The pipeline lets an LLM propose characters and relations, then applies symbolic rules to infer missing edges, detect contradictions, retrieve supporting evidence, and ask humans to confirm or correct what matters. That is the important mechanism: the LLM is not trusted as a final judge. It is treated as a noisy extractor inside a controlled annotation workflow. ...

Mind Games: How LLMs Subtly Rewire Human Judgment

TL;DR for operators When an LLM summarises a review, policy memo, support ticket, medical note, or news item, the operational question is not only “Did it get the facts right?” The sharper question is: did it change what the user is likely to believe, prioritise, or buy? The paper behind this article studies exactly that problem. It treats LLM-generated content as a decision interface and measures three ways the interface can quietly bend human judgment: changing the sentiment frame of the source, overemphasising the beginning of the source, and fabricating confident answers for events beyond the model’s knowledge cutoff.1 ...