TL;DR for operators

A useful scaling law does not merely say “bigger is better.” That is not a law; that is a purchasing department with a GPU account.

The paper behind this article studies whether the composition of pretraining data can change the compute-optimal balance between model size and downstream data in jet classification.1 The answer, in this setting, is yes. Training from scratch on JetClass produces a nearly balanced scaling rule: as compute grows, the optimal model size and dataset size grow at roughly similar rates. But after pretraining on a JetClass-II corpus augmented with Beyond Standard Model resonance decays, the compute-optimal rule shifts sharply toward downstream data. More of the next compute budget should be spent processing more examples, not inflating the model.

The important detail is that pretraining alone is not the story. QCD-only pretraining reduces loss but barely moves the main compute-optimal exponents. The large shift appears when the pretraining data becomes both more diverse and better aligned with the downstream task’s multi-prong jet structures. In other words, “more pretraining” is the lazy reading. “Better-composed pretraining changes the resource frontier” is the useful one.

For business readers, the translation is not “particle physics discovered the new universal AI recipe.” Please, no. The practical lesson is narrower and more interesting: in domains where synthetic data can be generated with some control, data composition may become a steering wheel for model economics. It can influence whether the next marginal dollar should buy more parameters, more task-specific examples, or a better-designed pretraining mix.

The boundary is equally important. This is one simulated scientific domain, one architecture family, one supervised pretraining setup, one downstream fine-tuning task, and pretraining subsets where diversity and alignment move together. The paper does not prove that every enterprise should stop scaling models and start generating synthetic data. It does show that in controlled-data domains, scaling behavior is not just something to observe. It may be something to engineer.

The mechanism is budget reallocation, not pretraining pixie dust

The usual scaling-law conversation has a comforting rhythm. Increase compute. Increase model size. Increase data. Measure loss. Fit power laws. Then pretend the curve has delivered strategic wisdom from the mountain.

The better version asks a harsher question: when compute increases, where should the next unit go?

A compute-optimal scaling law answers that question by estimating how the optimal model size and optimal dataset size grow with total compute. The paper uses the familiar notation:

$$ N^\ast \propto C^a $$
$$ D^\ast \propto C^b $$

where $N^\ast$ is the compute-optimal model size, $D^\ast$ is the compute-optimal dataset size, $C$ is compute, and the exponents $a$ and $b$ indicate how fast model size and data should grow as compute grows. The authors use the consistency condition $a + b \approx 1$.

In plain English: if $a$ is large, the scaling law is telling you to grow the model aggressively. If $b$ is large, it is telling you to spend relatively more compute on data.

This distinction matters because “pretraining improves performance” is not the same claim as “pretraining changes the compute-optimal allocation.” A pretrained model can start lower on the loss curve while preserving the same long-run scaling balance. That is a discount, not a new budget rule.

The paper’s central claim is stronger. It argues that pretraining data composition can reshape the scaling exponents themselves. The model does not merely get a better starting point. The training regime changes which resource becomes more valuable at the margin.

That is the mechanism worth caring about.

The paper chooses a domain where data can be designed, not merely scraped

The experiment sits in high-energy physics, specifically hadronic jet classification. Jets are streams of particles produced in high-energy collisions. For machine learning purposes, each jet is treated as a set of constituent particles with kinematic, trajectory, and particle-identification features.

The choice of domain is not decorative. Particle physics has high-fidelity simulators. That changes the economics. In language or image domains, acquiring genuinely new data is often expensive, legally messy, operationally slow, or all three, which is the industry’s preferred combination of pain. In particle physics, synthetic examples can be generated from simulator-defined processes. That makes pretraining corpus design a plausible experimental lever.

The authors use a transformer architecture that processes jet data as a point cloud. They vary the embedding dimension and number of heads while holding depth at four layers and the attention head dimension at eight, producing 12 model sizes from roughly 3,000 to 10.5 million parameters.

The pretraining data comes from JetClass-II, which contains 188 simulated jet classes. The downstream task uses JetClass, a 10-class jet classification dataset. The paper defines four pretraining subsets:

Pretraining subset Composition Why it matters
QCD QCD jets only, 17 classes Baseline pretraining on light-quark/gluon-like structures
QCD + res2p QCD plus two-prong BSM resonance decays Adds multi-prong structures closer to downstream labels
QCD + res34p QCD plus three- or four-prong BSM resonance decays Adds richer resonance topology and broader phase-space coverage
QCD + res2p + res34p QCD plus all resonant decays Most diverse and most aligned tested pretraining corpus

The downstream JetClass task is not just “classify jets” in a vague sense. Its labels correspond to physical topologies: one-prong light quark/gluon jets, two-prong boson or Higgs decays, three-prong top decays, and four-prong Higgs-to-WW decays. That is why QCD-only pretraining is limited. It touches some downstream classes, but it provides little signal for the multi-prong resonance structures that dominate much of the classification problem.

The BSM-augmented subsets add exactly that missing coverage. Convenient? Yes. Also the point.

What each experiment is doing before we steal its meaning

The paper is compact, but the figures and appendix are doing different jobs. Treating all of them as equal “results” would be a small act of analytical vandalism.

Paper component Likely purpose What it supports What it does not prove
Scratch scaling analysis Main baseline evidence JetClass scratch training has near-balanced compute-optimal scaling That all jet tasks or all architectures behave this way
QCD-only pretraining Contrast / ablation-like test Pretraining can lower loss while barely shifting the main scaling exponents That QCD pretraining is useless for every downstream objective
BSM-augmented pretraining subsets Main evidence for composition effect More diverse and aligned pretraining shifts scaling toward downstream data Whether diversity or alignment is the causal driver alone
Intermediate subsets in Figure 2 Sensitivity / progression check Scaling shifts are not limited to a single full-corpus comparison A clean monotonic law over all possible data compositions
Fine-tuning loss curves in Figure 3 Supporting evidence BSM-enhanced pretraining lowers loss across model sizes, with larger benefits for larger models That lower loss alone explains the exponent shift
Appendix parametric loss fit Robustness check The exponent shift is not merely an artifact of one extraction method That the parametric loss surface fit is high quality in this dataset

That last row matters. The appendix uses a parametric loss model,

$$ L(N,D) = L_\infty + \frac{A}{N^\alpha} + \frac{B}{D^\beta}, $$

then derives compute-optimal exponents analytically as:

$$ a = \frac{\beta}{\alpha + \beta} $$
$$ b = \frac{\alpha}{\alpha + \beta} $$

The authors report poor fit quality for this parametric method and therefore use the IsoFLOP profile approach as the main analysis. Still, the appendix produces similar qualitative exponent shifts. That makes it a robustness check, not a second thesis. The result survives a less cooperative fitting method. Always nice when the math sulks but still points in the same direction.

The main result: scratch training is balanced; BSM pretraining is data-hungry

The central table is the cleanest way into the result.

Pretraining subset $a$ for $N^\ast$ $b$ for $D^\ast$ Operational reading
Scratch 0.517 ± 0.002 0.483 ± 0.002 Grow model and data at similar rates
QCD 0.526 ± 0.004 0.474 ± 0.004 Almost the same scaling balance as scratch
QCD + res2p 0.265 ± 0.005 0.735 ± 0.005 Strong shift toward downstream data
QCD + res34p 0.283 ± 0.005 0.717 ± 0.005 Similar data-favoring shift
QCD + res2p + res34p 0.224 ± 0.005 0.776 ± 0.005 Strongest data-favoring regime tested

The scratch baseline says: as compute increases, the optimal model size and dataset size should rise at almost equal rates. With $a = 0.517$ and $b = 0.483$, a tenfold compute increase implies roughly a 3.3x increase in optimal model size and a 3.0x increase in optimal data.

The full BSM-augmented pretraining case says something very different. With $a = 0.224$ and $b = 0.776$, the same tenfold compute increase implies roughly a 1.7x increase in optimal model size and a 6.0x increase in optimal data. That is not a subtle rebalancing. It is a change in purchasing instructions.

The paper phrases this as a shift from a balanced model-data allocation to a strongly data-favoring regime. That is exactly the right interpretation, with one important caveat: this is about compute-optimal fine-tuning after a specific supervised pretraining setup, not an eternal law of all physics models.

Still, the magnitude is hard to ignore. The model-size exponent falls by a factor of about 2.3 from scratch to full BSM pretraining, while the data exponent rises from about 0.48 to about 0.78. The model has been given representations that make additional parameters less urgent. Once the backbone has learned more relevant structure, the best use of downstream compute is to expose it to more examples.

This is the moment where the paper becomes more than a performance report.

The QCD-only result quietly kills the lazy interpretation

A weaker version of the story would be: pretraining helps, therefore pretraining changes scaling.

The paper does not support that lazy version. QCD-only pretraining does improve loss, but its main exponents are $a = 0.526$ and $b = 0.474$, almost indistinguishable from scratch training’s $a = 0.517$ and $b = 0.483$ for the purpose of compute allocation.

That is the useful negative result.

It says the scaling shift is not triggered simply by running a pretraining phase. It appears when the pretraining corpus contains examples that broaden the support of the data distribution and overlap with the downstream task’s relevant structure. In this case, BSM resonance decays add multi-prong topologies, mass-scale variation, and substructure cues that QCD-only jets do not supply.

The paper uses two terms for this: diversity and alignment. Diversity means a wider variety of physics processes and kinematic configurations. Alignment means overlap with the downstream task’s useful features: prong multiplicity, mass scale, and substructure.

The annoying but honest detail is that the paper does not isolate these two variables. The BSM-augmented datasets are both more diverse and more aligned. So the result should not be read as “diversity alone wins” or “alignment alone wins.” The tested lever is a bundled one: add pretraining data that is broader and more relevant.

That may be less slogan-friendly. It is also more likely to survive contact with engineering.

Figure 3 shows improvement, but the exponent shift is the real asset

The fine-tuning loss curves are visually satisfying. BSM-enhanced pretraining lowers loss across model sizes, and the benefit increases for larger models. That supports the claim that the pretraining data is doing useful representational work.

But the most business-relevant contribution is not that the curves are lower. Lower loss is performance evidence. The exponent shift is allocation evidence.

This distinction is not semantic nitpicking. Enterprises already know that better pretraining can improve downstream performance. They have built entire procurement rituals around that idea, many involving dashboards, vague embeddings, and someone saying “domain-specific” with devotional intensity.

What they often do not know is whether pretraining should change their scaling plan. Should the next tranche of compute go to a larger model, a broader synthetic corpus, a cleaner downstream dataset, or another pretraining variant?

The paper’s answer, in this domain, is that better-composed pretraining makes additional downstream data more valuable relative to additional parameters. That changes the decision boundary. It says: once representation quality improves, model capacity may become less of the bottleneck than data exposure.

The practical interpretation is not “small models always win.” It is more precise: a well-composed pretraining corpus can reduce how quickly optimal model size needs to grow with compute. The saved compute can then be redirected toward processing more data, assuming such data is available and meaningful.

That assumption is doing work. In particle physics, it is plausible because simulators can generate training examples. In many businesses, “more data” means more compliance review, more labeling, more system integration, and more people discovering that their source-of-truth database has five truths and no source.

The business translation: scaling laws become design variables when data is controllable

The business relevance is clearest in sectors where synthetic or simulator-derived data is not just a toy but part of the operating model: scientific computing, robotics, industrial simulation, cyber ranges, financial scenario generation, manufacturing process models, digital twins, logistics simulation, and some forms of medical imaging research.

In those settings, the data pipeline is not merely a passive feedstock. It can be designed.

Paper shows directly Cognaptus inference for operators Boundary
BSM-augmented pretraining shifts compute-optimal scaling toward data in jet classification Synthetic data composition can affect whether the marginal budget should favor data generation or model growth Proven here only for this jet-classification setup
QCD-only pretraining barely shifts the main exponents Pretraining volume alone is not enough; task-relevant coverage matters QCD may be useful for other downstream objectives
Intermediate BSM subsets produce data-favoring scaling The effect is not only a one-off full-corpus artifact Diversity and alignment are confounded in the design
Parametric appendix fit gives similar qualitative exponents despite poor fit quality The qualitative direction is reasonably robust to extraction method Fit quality limits overconfidence in precise exponent values
Larger models benefit more visibly from BSM-enhanced pretraining loss curves Richer pretraining may unlock capacity that scratch training struggles to use efficiently Not a guarantee for larger architectures or bigger datasets beyond the tested range

The operating lesson is to stop treating synthetic data generation as a volume knob. It is a portfolio problem.

A useful synthetic data program should ask:

  1. Which downstream features matter?
  2. Which simulator settings generate coverage of those features?
  3. Which rare or structurally important cases are underrepresented?
  4. Which pretraining examples add diversity without becoming irrelevant noise?
  5. Which examples improve transfer rather than merely beautifying the dataset statistics?

The fifth question is where many projects politely fall into a ditch.

A corpus can be diverse in the way a junk drawer is diverse. That does not make it aligned. Conversely, a corpus can be aligned but narrow, teaching the model the same trick repeatedly until it performs competence in a very small theater. The paper’s strongest setup works because it changes both coverage and relevance.

The resource-allocation lesson is sharper than “use more synthetic data”

There is a temptation to turn this result into a synthetic-data manifesto. Resist it. The industry has enough manifestos. Some of them have logos.

The paper does not say synthetic data is inherently superior. It says that in a domain where synthetic data can be generated from meaningful physical processes, the composition of that data can reshape compute-optimal scaling. That is a much narrower claim, but also a more operationally useful one.

For AI teams, the relevant question is not “Can we generate more data?” The relevant question is “Can we generate data whose structure changes the bottleneck?”

That means the data-generation process needs to be tied to a model-economics loop:

Step Operator question Failure mode
Map downstream structure What features, modes, or regimes drive downstream decisions? Generating superficially varied but irrelevant examples
Design pretraining composition Which synthetic sources broaden coverage and transfer to the task? Treating all data diversity as useful
Run scaling sweeps How do $N^\ast$ and $D^\ast$ shift under different mixtures? Optimizing one benchmark score and missing the budget implication
Compare marginal costs Is simulator data actually cheaper than model growth or labeling? Assuming “synthetic” means “free”
Reallocate budget Should the next compute tranche go to model size, data volume, or corpus redesign? Scaling the most visible resource instead of the binding one

This is where the paper’s mechanism-first reading pays off. The pretraining corpus is not just an accuracy enhancer. It is a way of altering which constraint binds later.

In business terms, that is a shift from “model training” to “model economics.” Less glamorous. More useful. The CFO may even stay awake.

The limitations are not footnotes; they define the operating perimeter

The paper is careful about what it does not prove, and the boundaries matter.

First, the domain is particle-physics jet classification. The paper’s advantage comes partly from physics simulators that can generate structured synthetic data. This does not automatically transfer to domains where the simulator is weak, biased, missing real-world mess, or basically a PowerPoint with equations.

Second, the experiment studies one downstream task: 10-class JetClass classification. The pretraining corpus is evaluated by fine-tuning on that task using a single-epoch protocol. Different downstream tasks could value different features, and therefore respond differently to the same pretraining mixtures.

Third, the architecture range is controlled but not enormous by modern foundation-model standards: roughly 3,000 to 10.5 million parameters. The study is meaningful inside its scale range, not a guarantee about billion-parameter scientific models.

Fourth, the pretraining setup is supervised and fixed at about 25.6 million jets per model across subsets. The paper does not vary pretraining compute budget, so it cannot tell us whether the data-favoring shift strengthens, weakens, or changes character under larger-scale pretraining.

Fifth, and most important for interpretation, BSM augmentation bundles diversity and alignment. The BSM subsets add more process classes and broader phase-space coverage, but they also add structures that overlap with the downstream task. The result shows the bundled lever works. It does not cleanly identify which component dominates.

Finally, the appendix parametric fit has poor quality. The authors use it as a secondary validation because the extracted exponents are qualitatively similar, but the poor fit is a warning against worshipping the third decimal place. The directional conclusion is more credible than any overly precise budget spreadsheet derived from these exact values.

What an AI program should do with this result

The practical move is not to copy JetClass-II. Unless your business also classifies hadronic jets, in which case congratulations on having a more interesting data pipeline than most.

The move is to copy the question.

When building domain models in controllable-data environments, teams should evaluate pretraining mixtures by their effect on downstream scaling, not only by their effect on immediate validation loss. That requires small but deliberate scaling sweeps across model size and data size. Yes, this costs compute. So does guessing, but guessing usually invoices later.

A sensible operating pattern would look like this:

Decision area Better question after this paper
Pretraining corpus design Does this mixture add task-relevant coverage or just more examples?
Synthetic data generation Which generated regimes change downstream bottlenecks?
Model-size planning Does pretraining reduce the need to scale parameters aggressively?
Fine-tuning budget Does the compute-optimal frontier now favor more examples?
Evaluation Are we measuring allocation shifts, or only reporting lower loss?

The most valuable teams will stop asking whether data is “real” or “synthetic” as if those are moral categories. They will ask whether the data has the right structure to change transfer, capacity use, and marginal returns.

That is less theatrical than debating model size. It is also where the money hides.

The conclusion: bigger is not a strategy when composition can move the frontier

The paper’s real contribution is not that BSM-augmented pretraining improves jet classification. That is useful, but expected enough not to deserve a parade.

The real contribution is showing that pretraining composition can move the compute-optimal frontier. Scratch training says: grow model and data together. QCD-only pretraining says: lower the loss, but keep roughly the same scaling balance. BSM-augmented pretraining says: the model now benefits more from additional data than from rapidly increasing parameters.

That distinction is the article.

For scientific AI and other simulator-rich domains, this changes how model-building should be planned. The pretraining corpus is not just a preliminary dataset. It is an economic instrument. It can decide whether future compute should buy parameters or examples.

The catch is that the instrument must be tuned. “More data” is not the strategy. “More relevant structural coverage, tested against downstream scaling behavior” is the strategy. A bit less catchy, perhaps. Also less likely to waste six months.

Cognaptus: Automate the Present, Incubate the Future. :::


  1. Jan-Lucas Uslu, Kevin Greif, Daniel Whiteson, and Benjamin Nachman, “Towards Engineering Scaling Laws with Pretraining Data Composition,” arXiv:2606.19781, 2026. ↩︎