The Data Diet for Reasoning Models: Why Less (But Smarter) Wins
Opening — Why this matters now The current arms race in AI has a predictable bias: more data, more compute, more parameters. It’s the industrialization of intelligence—scale as a proxy for progress. And yet, quietly, a different thesis is emerging: what if the bottleneck isn’t model size, but data quality and selection? This paper introduces SUPERNOVA, a data curation framework that challenges a deeply held assumption in AI development—that more diverse training data always improves reasoning. Spoiler: it doesn’t. ...