When Benchmarks Forget What They Learned
Opening — Why this matters now Large language models are getting better at everything — or at least that’s what the leaderboards suggest. Yet beneath the glossy scores lies a quiet distortion: many benchmarks are no longer measuring learning, but recall. The paper you’ve just uploaded dissects this issue with surgical precision, showing how memorization creeps into evaluation pipelines and quietly inflates our confidence in model capability. ...