Cache Me If You Can: Why LLM Benchmarks Need Contamination-Resistant Data
The benchmark score is not the product. The test pipeline is. Benchmarks used to feel like neutral scoreboards. A model sat down, answered questions, received a number, and everyone pretended the number meant generalization. That story became less charming once benchmark questions started appearing in the same public data oceans used to train the models being tested. ...