Reasoning Evaluation

TL;DR for operators A model giving the same answer five times is comforting in the same way that five interns copying the same spreadsheet error is comforting: technically consistent, operationally useless. The paper behind this article proposes structural uncertainty, a black-box method for evaluating whether an LLM can stably rank its own reasoning paths, not merely whether its final answers agree.1 The method samples multiple candidate solutions, asks the same model to compare pairs of its own outputs, turns those comparisons into ranking distributions using Bradley-Terry or TrueSkill plus PageRank, then measures two things: whether rankings fluctuate across comparison trials, and whether each trial remains ambiguous among candidates. ...