IRT | Cognaptus

A benchmark is supposed to be a measuring instrument. In practice, many AI benchmarks behave more like a tired clipboard. Every model gets the same questions. Every question receives the same accounting treatment. The final score is usually a mean accuracy number, neat enough for a leaderboard and blunt enough to hide the messy truth underneath. Some items are too easy to tell strong models apart. Some are too hard to tell weak models apart. Some are mislabeled. Some have stopped mattering because everyone competent now solves them. Yet the ritual continues: run the suite, average the answers, update the chart, pretend the thermometer is not melting. ...