The Data Diet for Reasoning Models: Why Less (But Smarter) Wins

Opening — Why this matters now

The current arms race in AI has a predictable bias: more data, more compute, more parameters. It’s the industrialization of intelligence—scale as a proxy for progress.

And yet, quietly, a different thesis is emerging: what if the bottleneck isn’t model size, but data quality and selection?

This paper introduces SUPERNOVA, a data curation framework that challenges a deeply held assumption in AI development—that more diverse training data always improves reasoning. Spoiler: it doesn’t.

Background — Context and prior art

Modern reasoning models rely heavily on reinforcement learning (RL) pipelines, often trained on mixtures of tasks drawn from large instruction datasets. The prevailing approach has been:

Aggregate hundreds (or thousands) of tasks
Mix them broadly
Let scale average out weaknesses

Datasets like Super-Natural Instructions (SuperNI) and reasoning-focused corpora such as DAPO or Nemotron-Crossthink reflect this philosophy.

But this introduces a quiet problem: not all tasks contribute equally to reasoning ability.

In fact, some tasks actively degrade performance.

Analysis — What the paper does differently

SUPERNOVA reframes data curation as a selection and composition problem, not a scaling problem.

1. Task Utility is Uneven (and Sometimes Negative)

The authors show that individual training tasks vary dramatically in their downstream impact.

Task Type	Impact on Reasoning
Multi-hop reasoning	Strong positive
Coreference / logic	Positive
Narrative tasks	Neutral to negative
Surface formatting	Negative

In controlled experiments, the gap between the best and worst tasks reached ~48 percentage points in performance difference—an uncomfortable statistic for anyone still believing in uniform data value.

2. Micro Mixing > Macro Mixing

Rather than blending tasks based on global ranking, SUPERNOVA introduces Micro Mixing—selecting top-performing tasks per sub-skill.

Strategy	pass@8 Performance
Macro Mixing (Top 2)	21.7
Micro Mixing (Top 2)	22.8

The improvement may look modest, but in reasoning benchmarks, this is meaningful—and consistent.

The implication is subtle but important:

Diversity should be structured, not indiscriminate.

3. Synthetic Data Interventions Don’t Help (Much)

The paper tests multiple transformations—long-context prompts, inductive reasoning variations, error injection.

Result: none outperform the original curated dataset.

Intervention	pass@8
Base (Micro-Top2)	22.8
Going Against Prior	22.6
Long Context	21.3
Inductive Reasoning	20.4

This is mildly ironic. We spend enormous effort inventing clever augmentations—only to discover that well-selected original data already does the job better.

4. Smaller Models, Better Data → Bigger Results

SUPERNOVA-trained models consistently outperform larger baselines.

Model	Size	pass@8
SUPERNOVA	4B	33.3
Qwen3	8B	24.2
General Reasoner	4B	32.9

A 4B model beating an 8B model is not just an engineering win—it’s a cost argument.

5. Generalization Holds (Even OOD)

The framework improves performance across unseen benchmarks like logical puzzles and multi-step reasoning tasks.

Notably, gains persist even at higher sampling levels (pass@k scaling), suggesting:

Better reasoning paths
Not just better lucky guesses

Findings — What actually moves the needle

Let’s distill the paper into a practical hierarchy:

Factor	Impact	Interpretation
Task selection	Very high	Choose wisely or suffer
Task mixing strategy	High	Structure matters
Synthetic augmentation	Low/negative	Often over-engineered
Model size	Moderate	Secondary to data quality

In short:

The marginal value of better data exceeds the marginal value of more model.

Implications — What this means for AI builders

1. Data Curation Becomes a Core Competency

AI teams will need to move beyond dataset aggregation into dataset design.

Expect roles like:

Reasoning data analysts
Task utility evaluators
RL data pipeline engineers

2. ROI Shifts from Compute to Selection

Training costs are dominated by compute—but SUPERNOVA suggests:

Better data → fewer training steps
Smaller models → lower infra cost
Higher performance → better product outcomes

This is one of the rare cases where quality also reduces cost.

3. Benchmarks May Be Misleading

Aggregate metrics hide important dynamics.

A model that performs well on average may:

Be carried by a few strong capabilities
Fail systematically on others

SUPERNOVA’s per-task analysis highlights the need for fine-grained evaluation.

4. “More Tasks” Is Not a Strategy

The industry default—throwing more tasks into training—now looks inefficient.

Worse, it can degrade performance.

Which makes many current pipelines… politely speaking, suboptimal.

Conclusion — Precision beats volume

SUPERNOVA doesn’t introduce a new model architecture.

It doesn’t rely on exotic training tricks.

Instead, it asks a sharper question:

What if we simply trained on the right data?

The answer is quietly disruptive:

Smaller models outperform larger ones
Carefully selected tasks outperform broad mixtures
Synthetic complexity fails to beat curated simplicity

It turns out intelligence isn’t just about how much you learn.

It’s about what you choose to learn from.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper does differently#

1. Task Utility is Uneven (and Sometimes Negative)#

2. Micro Mixing > Macro Mixing#

3. Synthetic Data Interventions Don’t Help (Much)#

4. Smaller Models, Better Data → Bigger Results#

5. Generalization Holds (Even OOD)#

Findings — What actually moves the needle#

Implications — What this means for AI builders#

1. Data Curation Becomes a Core Competency#

2. ROI Shifts from Compute to Selection#

3. Benchmarks May Be Misleading#

4. “More Tasks” Is Not a Strategy#

Conclusion — Precision beats volume#