When Models Forget on Purpose: Why Data Selection Matters More Than Data Volume

Training data has become the AI industry’s favorite comfort blanket. When performance stalls, add more tokens. When a benchmark looks stubborn, add more tokens. When the model behaves badly, add more tokens and call it a roadmap.

This worked well enough to become a reflex. Unfortunately, reflexes are not strategies. The uncomfortable question is no longer whether data matters. Of course it matters. The better question is whether every token deserves the same vote during training.

A recent paper, LLM Data Selection and Utilization via Dynamic Bi-level Optimization, gives a useful answer: no, it does not.¹ The paper does not merely argue for cleaner datasets or better filtering before training begins. That part is now practically table stakes. Its more interesting claim is that a model’s preferred training data changes during training, so static data selection leaves value on the table.

In plainer language: models may need to forget on purpose. Not by deleting knowledge after the fact. Not by moralizing the dataset. By learning which examples should matter less at a given stage of training.

That is a quiet idea with expensive implications.

Data volume is not the same as data value

The old scaling story was beautifully simple: bigger models, more compute, more data. It had the elegance of a gym poster. Lift heavier, eat more, become enormous.

The problem is that language-model training is not bodybuilding. Once the industry moved from “can we train a useful model?” to “can we train one efficiently, safely, and predictably?”, raw volume became a suspicious metric. A corpus can be large because it is broad. It can also be large because it is repetitive, noisy, redundant, legally awkward, stylistically narrow, or full of material that teaches the model the wrong shortcuts with great enthusiasm.

Prior work has already made this point from several directions. DSIR treats data selection as a distribution-matching problem: select pretraining data that better resembles a target distribution rather than blindly sampling from everything available.² QuRating asks language models to score text quality across dimensions such as writing style, expertise, facts, and educational value, then uses those ratings to improve pretraining data selection.³ LESS shows that, for instruction tuning, a carefully selected small subset can outperform training on the full dataset for targeted capabilities.⁴

These studies all weaken the “more is better” instinct. The new paper goes one step further. It asks not only which data should enter training, but how much influence each sample should have while training is already happening.

That distinction matters. Static filtering is like hiring a team before a project starts. Dynamic weighting is like changing who speaks during each meeting as the project evolves. Anyone who has attended a meeting knows the second problem is where civilization usually collapses.

The paper’s core move: data selection becomes data utilization

The paper proposes a Data Weighting Model, or DWM. Instead of treating every selected sample in a batch equally, DWM assigns different weights to samples during training. These weights determine how strongly each sample contributes to the model update.

The mechanism is important because the authors are not simply saying “prefer high-quality data.” That would be too easy, and therefore probably wrong.

Their point is that the value of a sample depends on three things:

the model’s current stage of training;
the other samples in the batch;
the validation objective used to judge whether the weighting actually improves generalization.

A data point is not valuable in isolation. It is valuable because of the update direction it helps create. A polished reasoning example may be useful later but inefficient early. A broad, ordinary text sample may be useful early but less helpful once the model needs deeper expertise. The model’s appetite changes. Feeding it the same meal plan throughout training is convenient, not necessarily optimal.

The authors formalize this using a bi-level optimization setup. The lower level trains the language model using weighted training samples. The upper level updates the weighting model based on how the trained language model performs on a validation task. The point is not to minimize training loss directly; doing that could reward trivial behavior, such as assigning low weights to difficult examples. The point is to learn weights that improve validation performance after the language model has been updated.

That small shift is the paper’s main contribution. It moves data curation from a pre-processing decision to a training-time control system.

Layer	Conventional approach	DWM approach	Practical interpretation
Dataset construction	Select or filter data before training	Still select data, but do not stop there	Curation is necessary but incomplete
Batch usage	Treat samples uniformly	Weight samples differently within a batch	Not every token deserves equal influence
Training stage	Same selection logic throughout	Relearn data preference across stages	The useful data mix changes over time
Objective	Reduce training loss	Improve validation performance after weighted updates	Optimize for generalization, not just easier fitting
Business meaning	“Buy or collect more data”	“Govern data influence”	Data strategy becomes operational, not archival

This is why the title’s “forgetting” should not be read as erasure. It is closer to attention discipline. The model still sees selected data, but the training process learns when to listen less.

The evidence is modest, but the pattern is useful

The experiments use selected data from SlimPajama, a cleaned and deduplicated version of RedPajama, and train Llama-2-style models at 370M and 1.3B parameter scales. The authors select 30B tokens for the main comparisons and evaluate across nine downstream tasks, including ARC, BoolQ, HellaSwag, PIQA, SciQ, WinoGrande, LogiQA, and OpenBookQA.¹

The headline result is not that DWM magically transforms a small model into a frontier model. It does not. Thankfully. We have enough magic in AI slides already.

The useful result is narrower: dynamic weighting consistently improves several settings, especially under two-shot evaluation and when transferred to larger models or other data-selection methods.

Setting	Baseline	With DWM	Average change	Interpretation
370M, random data, zero-shot	44.0	45.0	+1.0	Dynamic weighting improves random-selected data, but unevenly across tasks
370M, random data, two-shot	45.1	46.4	+1.3	The stronger gain appears in few-shot use, consistent with the validation-task explanation
370M, DSIR data, two-shot	45.3	46.4	+1.1	DWM can add value on top of another selection method
370M, QuRating data, two-shot	48.5	48.6	+0.1	If static selection is already strong, marginal gain may be small
1.3B, random data, two-shot	47.6	48.7	+1.1	A weighting model trained at smaller scale transfers upward
1.3B, QuRating data, two-shot	50.3	51.0	+0.7	Larger models may better absorb high-quality selected data

These are not dramatic numbers. That is precisely why they are worth reading carefully.

In production AI, a one-point gain is not automatically meaningful. It depends on cost, deployment sensitivity, task distribution, and whether the improvement is stable across the cases that matter. The paper’s result is best interpreted as evidence that training-time data utilization is a real lever, not as proof that DWM itself is the final answer.

The most interesting comparison is not “DWM beats everything.” It does not. QuRating remains strong in several settings. The better reading is that DWM is orthogonal to static selection. You can first select better data, then still ask how each selected sample should influence training at a particular stage.

For business users, that is the difference between procurement and operations. Buying better ingredients is good. Knowing when to use each ingredient is a different competence. Restaurants discovered this before AI labs did. Embarrassing, but educational.

The model’s taste changes during training

The paper’s most valuable section may not be the benchmark table. It is the analysis of dynamic data preference.

The authors examine which samples the weighting model prefers at different training stages. They use QuRating-style dimensions to compare preferred and unpreferred samples: writing quality, expertise, facts and trivia, and educational value. The pattern is revealing. Early in training, the weighting model tends to prefer data that scores well across broad quality dimensions. Later, the model becomes less attracted to merely better writing style and more attracted to expertise-heavy data.¹

This is the mechanism behind the business lesson.

A model early in training may need broad linguistic grounding. Later, it may benefit more from specialized, knowledge-rich, or reasoning-heavy examples. If you freeze the selection rule too early, you risk optimizing yesterday’s need. If you push high-expertise material too early or too uniformly, smaller models may saturate or fail to use it well.

The paper also reports an important failure mode: the weighting model after stage 2 behaves inconsistently with later stages and corresponds to a performance drop compared with uniform random data use in that setting. That matters because it prevents a lazy conclusion. Dynamic weighting is not automatically better because it is dynamic. It is better only when the weighting model tracks a useful validation signal at the right granularity.

This is where the accepted misconception should die quietly: data selection is not just a filter.

Filtering asks, “Should this example be in the dataset?” Weighting asks, “How much should this example shape the model right now?” Curriculum learning asks a neighboring question: “When should the model see which kind of example?” Recent curriculum-learning work in LLM pretraining reports faster convergence and sustained improvements when data ordering is used as an efficiency mechanism.⁵ Together, these directions suggest that the dataset is becoming less like a warehouse and more like a control surface.

A warehouse can be large and still useless. A control surface has to be operated.

Validation choice is not a footnote; it is the steering wheel

One of the paper’s ablation results deserves more attention than it usually gets in quick summaries.

DWM learns weights by optimizing validation performance. The authors compare two validation choices: LAMBADA and a held-out SlimPajama validation set. LAMBADA performs better in their setup. In zero-shot evaluation, random selection averages 44.0; DWM with held-out SlimPajama reaches 44.6; DWM with LAMBADA reaches 45.0. In two-shot evaluation, the corresponding averages are 45.1, 46.1, and 46.4.¹

This is not just a technical detail. It says the weighting system learns what the validation task rewards. If the validation signal favors text understanding and prediction, the weighting model will learn data preferences aligned with that. If the validation signal is poorly matched to the intended deployment domain, the weighting model may become a very efficient machine for optimizing the wrong thing.

Businesses should stare at this part for a moment.

Many firms want domain-specialized models: legal review, financial analysis, customer support, procurement, coding assistance, medical documentation, compliance triage. They often treat dataset assembly as the central challenge. But if dynamic data utilization becomes part of the training or fine-tuning process, then the validation set becomes a strategic artifact. It defines what “useful data” means.

A weak validation set gives you weak data weighting. A generic validation set gives you generic preferences. A contaminated validation set gives you a beautifully automated mistake.

This is why business AI teams should not reduce the paper to “we can train cheaper.” The stronger implication is “we need better evaluation governance before data weighting becomes safe to automate.”

The business value is cheaper diagnosis, not just cheaper training

The obvious business takeaway is cost efficiency. Better data utilization can reduce wasted training compute. That is real, especially when training budgets are constrained and every additional token has a GPU bill attached.

But the more durable value is diagnosis.

Dynamic weighting gives teams a way to ask operational questions that static datasets hide:

Business question	Technical proxy	Why it matters
Which data types help at early versus late training stages?	Stage-wise preferred samples	Supports staged training curricula instead of one static mixture
Which domains are overrepresented but underuseful?	Low-weight recurring domains	Reveals corpus bloat and procurement waste
Which examples improve target validation performance?	Weight influence	Connects data governance to measurable model behavior
Does a specialist dataset help the current model size?	Model-size transfer results	Avoids feeding small models material they cannot exploit
Is performance gain coming from selection or utilization?	Static baseline versus DWM comparison	Separates data acquisition value from training process value

This is useful for firms that do not train frontier models from scratch. Most companies will not pretrain 1.3B-parameter models on 30B tokens, let alone frontier systems. But the same logic applies downstream: fine-tuning, domain adaptation, preference optimization, retrieval corpus design, synthetic data selection, and evaluation set construction.

If a company is building an internal legal assistant, the relevant question is not “How many documents can we dump into training or retrieval?” The question is which documents improve the model’s target behavior, at which stage, and under which validation objective. Contract boilerplate may be useful for language familiarity. Annotated edge cases may be useful for reasoning. Recent regulatory updates may be critical but should not dominate every update. The right data is not a pile. It is a schedule, a weighting policy, and an evaluation loop.

Cognaptus inference: the next competitive advantage in enterprise AI will not come from owning “more data” in the abstract. It will come from knowing how to assign influence to data under specific business objectives. That is a less glamorous sentence than “data is the new oil,” but it has the advantage of not being nonsense.

What the paper directly shows, and what it does not

The paper directly shows that dynamic data weighting can improve performance in the tested pretraining settings, that the learned weighting model can transfer to larger models and other selection methods, and that preferred data characteristics shift across training stages.¹

It also shows that the benefit is not uniform. Some task-level results decline even when averages improve. DWM adds only a small gain on top of QuRating for the 370M two-shot setting. The stage-2 preference inconsistency shows that dynamic preference estimation can be unstable. The validation-task ablation shows that the steering signal matters.

Those boundaries are not cosmetic. They affect deployment interpretation.

Claim	Supported by the paper?	Business reading	Boundary
Data weighting can improve selected-data training	Yes	Data utilization should be treated as a training lever	Gains are modest and task-dependent
Static data selection is incomplete	Yes	Filtering alone may waste value inside the selected corpus	Does not mean static selection is obsolete
Smaller weighting models can transfer upward	Partly	Cheaper pilot experiments may inform larger training runs	Transfer was tested within specific model families and scales
More data is always worse	No	Volume without selection can be inefficient	Large, diverse data remains valuable when well used
Validation design determines useful weighting	Yes	Evaluation governance becomes central to data strategy	Wrong validation objectives can misdirect the system
The method solves enterprise data risk	No	It helps manage influence, not legality or provenance	Compliance, privacy, and licensing remain separate controls

This section is necessary because AI commentary often performs a small magic trick: it takes a controlled research result and turns it into a universal business doctrine. Charming. Also expensive.

The correct doctrine is narrower. Data selection and utilization should become part of AI operations, but the method must be validated against the actual target task, model size, and risk profile.

The appendix tests robustness, not a second thesis

The ablations should be read as robustness checks around the central claim, not as separate grand theories.

The number-of-stages experiment suggests that more frequent updates can help capture changing data preferences, but also introduces overhead. In the reported two-shot 370M setting, two stages average 45.3, five stages average 46.4, and eight stages average 46.5. The improvement from five to eight stages is tiny compared with the extra complexity.¹

That is a practical signal. Data-weighting systems should not be updated just because updating feels sophisticated. The goal is to match the pace at which useful data preference changes. Too slow, and the model trains under stale preferences. Too fast, and the overhead may buy little more than a nicer diagram.

The transfer experiment also needs careful reading. The paper estimates that transferring a weighting model trained on the 370M model to a 1.3B model introduces roughly 9% additional training overhead because the weighting model is used mainly for forward inference during larger-model training.¹ That overhead may be worthwhile if the performance gains matter for the target deployment. It may not be worthwhile if the benchmark gain is irrelevant, unstable, or below business tolerance.

In other words, the ROI question is not “Does DWM improve the average?” The ROI question is:

Does the gain appear on the tasks we care about, at a cost lower than alternative improvements such as better data cleaning, stronger evaluation sets, better prompting, retrieval, or a larger base model?

Less romantic. More useful.

From data hoarding to data governance

The paper belongs to a larger shift in AI: data is moving from passive asset to managed process.

The old enterprise instinct was to collect everything. Emails, manuals, tickets, contracts, PDFs, meeting transcripts, CRM notes, product documents, support logs. Throw them into a pipeline, add embeddings, sprinkle “AI transformation” on top, and hope procurement does not ask for a postmortem.

The new instinct should be different. Every data source should be evaluated by contribution, timing, risk, and fit.

This changes the operating model for AI teams:

Data teams need influence metrics, not only quality labels. A document can be clean and still unhelpful for a target model behavior.
Evaluation sets become strategic infrastructure. If validation steers weighting, then validation quality determines training quality.
Training should be staged. Early-stage broad competence and later-stage specialized reasoning may require different data emphasis.
Model size matters. Smaller models may not exploit the same high-quality data that larger models can use effectively.
Data governance must include down-weighting. Removing data is not the only intervention. Sometimes the right move is to reduce its training influence.

The best business interpretation is not that companies should rush to implement this exact DWM method tomorrow. Most should not. The method is research-grade and requires careful engineering. The immediate lesson is managerial: stop measuring AI readiness by data volume alone.

A company with a smaller but well-characterized, stage-aware, validation-linked corpus may be better positioned than a company with a giant folder called final_final_all_docs_v7.

The folder name, sadly, is not fictional enough.

Where this applies, and where it does not

This research is most relevant when a team has meaningful control over training or fine-tuning. It applies strongly to pretraining experiments, continued pretraining, domain adaptation, instruction tuning, and preference optimization. It is also conceptually relevant to retrieval systems, where ranking and weighting documents can shape model outputs even without parameter updates.

It applies less directly to teams that only use closed models through an API and never fine-tune. For them, the lesson translates into retrieval governance, evaluation design, prompt-data selection, and synthetic-data filtering rather than DWM-style training.

There is also a safety boundary. Dynamic data weighting does not automatically solve memorization, privacy leakage, copyright risk, or bias. It can reduce the influence of less useful data if the system learns that those samples hurt validation performance. But legal and ethical risk do not always show up as validation loss. A model can perform well and still memorize sensitive material. The compliance department will not be impressed by your average benchmark gain. They are famously difficult that way.

The method also depends on the validation objective. If the validation task is narrow, the weighting model may learn narrow preferences. If the validation data is noisy, the weighting model may learn noise with confidence. If the deployment domain changes, yesterday’s weighting policy may become stale.

So the boundary is clear: dynamic data utilization is a training-efficiency and generalization tool. It is not a substitute for dataset provenance, privacy controls, evaluation audits, or domain expertise.

Forgetting is becoming an optimization skill

The AI industry used to treat forgetting as failure. A model forgot because it was too small, undertrained, poorly fine-tuned, or damaged by continual learning. This paper points to a more mature view: selective forgetting can be a feature of competent training.

A model should not learn every pattern with equal intensity. It should not grant the same importance to boilerplate, duplicated snippets, shallow style markers, expert reasoning traces, ordinary prose, and domain-specific edge cases. The training process should decide what deserves influence, when, and for which objective.

That is why data selection now matters more than data volume. Not because volume is irrelevant, but because volume without influence control becomes expensive noise. The future of AI training will not be won by teams that merely own the largest piles of text. It will be won by teams that understand which parts of the pile should be amplified, delayed, discounted, or ignored.

For businesses, the practical shift is simple enough to say and hard enough to implement:

Do not ask only what data you have. Ask what each piece of data is allowed to teach the model.

That is where data strategy begins to look less like storage management and more like intelligence design.

Cognaptus: Automate the Present, Incubate the Future.

Yang Yu, Kai Han, Hang Zhou, Yehui Tang, Kaiqi Huang, Yunhe Wang, and Dacheng Tao, “LLM Data Selection and Utilization via Dynamic Bi-level Optimization,” arXiv:2507.16178, 2025. https://arxiv.org/abs/2507.16178 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang, “Data Selection for Language Models via Importance Resampling,” NeurIPS 2023. https://openreview.net/forum?id=uPSQv0leAu ↩︎
Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen, “QuRating: Selecting High-Quality Data for Training Language Models,” arXiv:2402.09739, 2024. https://arxiv.org/abs/2402.09739 ↩︎
Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen, “LESS: Selecting Influential Data for Targeted Instruction Tuning,” arXiv:2402.04333, 2024. https://arxiv.org/abs/2402.04333 ↩︎
Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, and Michalis Vazirgiannis, “Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning,” arXiv:2506.11300, 2025. https://arxiv.org/abs/2506.11300 ↩︎

Data volume is not the same as data value#

The paper’s core move: data selection becomes data utilization#

The evidence is modest, but the pattern is useful#

The model’s taste changes during training#

Validation choice is not a footnote; it is the steering wheel#

The business value is cheaper diagnosis, not just cheaper training#

What the paper directly shows, and what it does not#

The appendix tests robustness, not a second thesis#

From data hoarding to data governance#

Where this applies, and where it does not#

Forgetting is becoming an optimization skill#