Synthetic and Sensibility: Why More Data Needs a Control Stack

Synthetic data has become the convenient answer to almost every uncomfortable AI training question. Need more reasoning traces? Generate them. Need domain examples? Generate them. Need privacy-preserving replacements for customer data? Generate them. Need a dataset that looks suspiciously like a benchmark but not too suspiciously like a benchmark? Generate it, then call it “curriculum design.”

This is not entirely foolish. Synthetic data is cheap, scalable, and sometimes extremely effective. But the current business conversation often treats synthetic data as if it has a natural quality score: stronger source model, cleaner prose, more benchmark-like examples, larger corpus, better result. That view is comforting. It is also too simple.

Two recent arXiv papers are useful because they attack the same problem from opposite ends of the pipeline. Dai et al.’s UniCo paper asks what synthetic data must look like if the goal is to deliberately train causal reasoning: broad coverage, exact answers, multiple representations, and filters against shortcut learning.¹ Alemohammad et al.’s prompt-free self-training paper asks a different question: even if synthetic text is available, when is it actually yours to learn from? Their answer is colder: utility is relational. The same synthetic corpus can help one student model, underperform for another, and fail to be predicted by simple quality proxies.²

Together, they point to a better business interpretation: synthetic data is not a commodity input. It is a controlled training interface. The point is not to ask whether synthetic data is “good.” The point is to ask whether it passes a sequence of gates: target definition, generation validity, shortcut resistance, diversity, student compatibility, downstream transfer, and side-effect audit.

In other words, more data is not the strategy. More controlled admission is the strategy. Less glamorous, perhaps. Also less likely to waste a quarter of your AI budget on beautifully formatted nonsense.

The shared problem: synthetic data is easy to produce and hard to trust

The two papers are not doing the same experiment. That is precisely why they are useful together.

The UniCo paper works from the data side. It starts with a specific capability target: causal reasoning. The authors argue that existing causal reasoning datasets often work better as benchmarks than as training data because they are narrow, under-specified, ambiguous, or too easy to solve through shortcuts. UniCo therefore builds synthetic causal examples from structural causal models, computes ground-truth answers through exact causal inference, covers 18 causal query types across association, intervention, and counterfactual reasoning, and renders the same underlying causal structures in symbolic, code, and natural-language forms.

The self-training paper works from the model side. It strips synthetic training down to an intentionally weak setting: base models are fine-tuned on unconditional text sampled from the beginning-of-sequence token, with no prompts, no verifier, no reward model, no teacher, and no task specification. The point of this minimalist setup is not to propose a production recipe. The point is to isolate whether synthetic utility comes from the corpus itself or from its relationship with the student model.

Those roles fit together:

Question	UniCo’s answer	Self-training paper’s answer	Business translation
What should synthetic data contain?	Verifiable task structure, broad coverage, and quality controls.	Not addressed as a designed curriculum; the setting is deliberately weak.	Build data for a capability, not for volume.
Can formal correctness be enough?	No. UniCo also controls ambiguity, missing conditions, representation diversity, and shortcut-solvable cases.	No. Even fluent or high-likelihood data may not help the consuming student.	Validation has to include both data-side and model-side tests.
Does stronger source data always help?	The paper compares causal-data frameworks, not arbitrary source models.	No. Self-generated and same-lineage data can beat stronger but mismatched sources.	Vendor prestige is not compatibility. Annoying, but useful to know early.
What counts as success?	Better in-distribution and out-of-distribution causal reasoning, plus improved reasoning faithfulness.	Selective benchmark gains, failed intrinsic proxies, and a utility–extractability effect.	Measure training effect, transfer, and side effects, not dataset aesthetics.

This is a complementary logic chain, not a pair of separate paper summaries. UniCo explains how to construct synthetic data that is worth learning from. The self-training paper explains why “worth learning from” still cannot be judged without the student model that consumes it.

Gate 1: define the capability before generating the data

Synthetic data projects often begin with a dangerous sentence: “Let’s generate more examples.”

More examples of what?

UniCo’s first useful lesson is that a capability must be decomposed before it can be trained. The paper does not treat causal reasoning as one generic benchmark score. It separates causal queries across Pearl’s causal ladder: association, intervention, and counterfactual reasoning. Within those levels, it covers 18 query types, including marginal probability, conditional probability, average treatment effect, backdoor adjustment, frontdoor adjustment, counterfactual probability, natural direct and indirect effects, probability of necessity, and probability of sufficiency.

That matters because models can appear competent when the test stays on one rung of the ladder. A model that handles association questions may still fail when asked to reason about interventions. A model that can solve symbolic causal notation may fail when the same causal structure is embedded in code or natural language. UniCo’s experiments show large performance gaps across representation forms and causal levels before training, which is exactly the kind of brittleness businesses usually discover after deployment, when it is more expensive and more embarrassing.

The business version of this gate is simple:

Do not generate synthetic data for a vague model behavior. Generate it for a named capability with known subskills, known surface forms, and known failure modes.

For a financial analyst assistant, the target is not “better reasoning.” It might be “distinguish correlation from causal explanation in earnings-call narratives.” For a legal workflow assistant, it might be “recognize when a contractual conclusion depends on a missing condition.” For a healthcare summarization tool, it might be “avoid converting temporal association into clinical causality.”

Once the target is defined at that level, synthetic data becomes an engineering object rather than a content pile.

Gate 2: make the synthetic examples verifiable, not just plausible

The next gate is answerability.

UniCo constructs examples from structural causal models. That design choice matters because it gives each synthetic question an underlying causal graph and probability structure. The answer is not whatever a language model thinks sounds reasonable. It is computed by exact probabilistic and causal inference, including graph surgery, adjustment methods, and twin-network reasoning for counterfactual cases.

This is where synthetic data becomes valuable: not because it is synthetic, but because the synthetic world is fully specified. A fully specified synthetic world can produce labels that are difficult to obtain from messy real-world data. It can generate intervention questions without running unethical experiments. It can generate counterfactual questions without pretending the observed world contains both factual and hypothetical outcomes. Synthetic data, used properly, is not a shortcut around rigor. It is a way to create rigor under controlled assumptions.

But UniCo also shows why verifiability is not enough. The authors explicitly check for insufficient conditions, ambiguous question formulations, and incorrect answers. In their human evaluation sample, UniCo had no such cases, while comparison datasets had ambiguity or missing-condition problems. That is not a decorative quality check. In post-training, ambiguity is not harmless noise. It teaches the model that guessing is acceptable.

For business teams, this means synthetic-data validation should include at least three layers:

Validation layer	What it asks	Why it matters
Structural validity	Is the synthetic world fully specified?	Prevents unsolvable or underdetermined examples.
Label validity	Is the target answer computed or checked by a reliable method?	Prevents fluent wrong supervision.
Instruction validity	Is the question unambiguous and complete?	Prevents the model from learning to infer missing constraints by vibes.

The third layer is the one people skip. Naturally. It is harder to automate and less exciting to announce. It is also where many enterprise failures live.

Gate 3: remove shortcut-solvable examples before they teach the wrong lesson

One of the most important details in UniCo is the treatment of “causally naive” questions. These are examples that appear to require a higher-level causal operation but can actually be solved by a lower-level calculation.

For example, an intervention question may be answerable by simply comparing conditional probabilities if the graph structure makes the action-observation distinction collapse. Technically, the answer can still be correct. Pedagogically, the example is dangerous. It looks like causal reasoning but rewards a shortcut.

UniCo finds that, without control, a large share of intervention and counterfactual samples can be causally naive. The authors reduce this ratio through query-specific rejection sampling, and their ablation suggests that shortcut-heavy training data weakens the intended learning effect.

This is the synthetic-data version of a familiar business problem: KPI gaming. If the training set rewards surface correlations, the model learns surface correlations. If the evaluation set rewards template matching, the model learns template matching. If the dataset says “think causally” but allows the answer to be obtained by observational arithmetic, the model will take the cheaper route. Models, like interns and consultants, are sensitive to incentives.

The lesson generalizes beyond causality. In code training, a shortcut-solvable example might be one where the function name gives away the solution. In legal reasoning, it might be a fact pattern where the conclusion is implied by a keyword rather than by legal structure. In customer-support training, it might be a ticket where sentiment alone predicts escalation, even though the desired capability is policy interpretation.

A serious synthetic-data pipeline should therefore include a shortcut audit:

Intended reasoning path → possible easier path → reject, revise, or downweight

This is not just dataset hygiene. It is capability governance. If the synthetic example does not force the desired operation, it may train a cheaper behavior that later masquerades as competence.

Gate 4: diversify the representation, not just the topic

A common way to diversify data is to vary topics. UniCo makes a more precise point: diversify the representation of the same underlying skill.

The paper uses three forms: symbolic notation, executable code, and natural-language narratives. This matters because real business tasks rarely announce their formal structure. A causal graph in a textbook is polite. A causal dependency buried in a product incident report is not.

Before training, the models in UniCo show substantial gaps between symbolic, code, and natural-language versions of causal problems. After training on diverse forms and levels, performance improves more broadly than when training is restricted to one representation or one causal level. The important point is not that every business needs symbolic causal data. The point is that training on one clean surface form does not guarantee transfer to the messy forms users actually submit.

For an enterprise AI system, representation diversity should be designed around the workflow:

Capability	Clean form	Messy operational form
Causal reasoning	Formal intervention query	Executive memo mixing correlation, explanation, and recommendation
Policy reasoning	Rule table	Customer email with partial facts and emotional framing
Financial reasoning	Spreadsheet formula	Narrative management commentary and footnotes
Technical debugging	Minimal reproducible example	Log fragments, screenshots, and vague user descriptions

The business implication is uncomfortable but practical: if your synthetic data looks cleaner than your operating environment, the model may be learning an office fantasy. The real world will decline to cooperate.

Gate 5: test whether the student model can actually learn from it

UniCo explains how to build strong source-side synthetic supervision. The second paper adds the missing warning label: even useful synthetic data is not intrinsically useful.

Alemohammad et al. test prompt-free unconditional self-training. The setup is intentionally sparse: a source model samples plain text from the beginning-of-sequence token, and a student model is fine-tuned on that text. There are no task prompts or external labels. If performance improves, the improvement cannot be credited to a carefully designed teacher or verifier. It must come from the interaction between the student’s existing latent capabilities and the weak synthetic signal.

The findings are particularly relevant for businesses evaluating synthetic-data vendors or internal model-distillation plans. Self-generated data helps most. Same-lineage transfer comes next. A larger but differently trained source model can transfer worse than a smaller, more compatible source. Cross-family transfer is weaker and can be harmful. The authors also test obvious proxies and find them insufficient: benchmark semantic similarity does not explain the gains, and average likelihood under the student does not predict which corpora help.

This is the part procurement teams will enjoy least. A stronger source model does not automatically mean stronger training data for your model. A synthetic corpus can be fluent, benchmark-adjacent, high-likelihood, and still induce the wrong update direction. Quality is not floating in the dataset like a vitamin. It appears in the interaction.

The authors are careful about mechanism: they do not directly measure gradient alignment, so student-source compatibility remains an empirically supported explanatory hypothesis rather than a proven mechanism. That limitation matters. Still, the operational message is already strong enough:

Synthetic data must be tested on the exact student model and task family before it is scaled.

Not on a related model. Not on a leaderboard cousin. Not on the source model’s own evaluation deck, lovingly prepared by the vendor. On the model you will actually update.

The combined control stack

The two papers produce a clean synthetic-data control stack:

1. Capability target
   ↓
2. Verifiable generation
   ↓
3. Ambiguity and shortcut filters
   ↓
4. Representation diversity
   ↓
5. Student compatibility pilot
   ↓
6. Downstream transfer evaluation
   ↓
7. Side-effect and audit review

This stack is the central business takeaway. Synthetic data should not flow directly from generator to training job. It should pass through admission control.

Control stack layer	Paper support	Business decision it informs
Capability target	UniCo decomposes causal reasoning across query types and ladder levels.	What exactly are we trying to teach?
Verifiable generation	UniCo grounds answers in exact causal inference.	Can we trust the labels?
Shortcut filtering	UniCo reduces causally naive examples.	Are we rewarding the intended skill or a cheaper proxy?
Representation diversity	UniCo spans symbolic, code, and natural language forms.	Will the learned skill survive workflow variation?
Student compatibility	Self-training results show utility depends on the source-student pair.	Does this data actually help our model?
Transfer evaluation	UniCo tests out-of-distribution causal benchmarks and faithfulness; the self-training paper tests multiple benchmarks and source pairings.	Does improvement move beyond the training distribution?
Side-effect audit	The self-training paper studies utility versus memorization/extractability.	What else changed when the model improved?

Notice what is missing: “Generate 10 million examples and hope.” A bold strategy, but perhaps not one to place near customer-facing systems.

What the papers show, and what business should infer

It is worth separating the papers’ evidence from the business interpretation.

The papers show that carefully designed synthetic causal data can train smaller models into stronger causal reasoners, including improvements on out-of-distribution causal benchmarks and reasoning-faithfulness tests. They also show that weak self-training can produce selective gains when the synthetic corpus is compatible with the student model, and that simple corpus-level proxies fail to predict utility.

The business interpretation is that synthetic-data programs should be governed as controlled post-training experiments. That interpretation goes beyond the papers, but it follows naturally from their combined logic. Businesses do not merely need synthetic data. They need a repeatable way to decide which synthetic data is allowed to influence the model.

A practical adoption workflow would look like this:

Define the capability. Write a capability card: target behavior, subskills, expected input forms, unacceptable shortcut paths, and deployment risks.
Generate a small verified curriculum. Start with hundreds or thousands of examples, not millions. Include labels or answers that can be checked by rules, solvers, simulations, or expert review.
Run shortcut and ambiguity audits. Identify examples solvable by superficial cues, missing conditions, or accidental leakage from templates.
Train pilot models. Fine-tune the exact student model using small controlled variants of the dataset.
Compare against baselines. Use real-data replay, existing instruction data, no-training baselines, and ablated versions of the synthetic set.
Evaluate transfer. Test not only on matched tasks but on adjacent tasks and operationally realistic cases.
Audit side effects. Track hallucination, memorization, privacy-relevant extraction, calibration, and performance regressions on unrelated capabilities.
Scale only after compatibility is observed. If the pilot does not move the target metric cleanly, more synthetic data may simply make the wrong update more confidently.

This is not a glamorous workflow. It is a boring workflow. In enterprise AI, boring is often what remains after the expensive mistakes have been removed.

The privacy and memorization wrinkle

The self-training paper includes an intriguing result: in controlled Pythia experiments, favorable self-training preserved or improved benchmark utility while sharply reducing held-out exact-match extraction and lowering the probability assigned to true pretraining continuations. The authors interpret this as evidence that compatible self-training can reinforce distributed task structure while moving probability mass away from sequence-specific recall.

That finding is interesting, but businesses should not inflate it into “synthetic data solves memorization.” The authors explicitly note that their analysis does not constitute a complete privacy audit. It measures held-out exact-match extraction and true-continuation likelihood on documented Pythia pretraining sequences. That is stronger than a casual check, but it is not the same as proving safety for a production model trained on sensitive enterprise data.

The useful lesson is narrower and better: capability gain and memorization are not always mechanically tied together. A training update can move a model into a neighboring mode where benchmark behavior improves and verbatim recall drops. That suggests side effects should be measured, not assumed.

For business systems, this becomes another control-stack requirement. Every synthetic-data update should be evaluated not only for target capability but also for what changed elsewhere. Did extraction risk fall or rise? Did refusal behavior change? Did the model become overconfident? Did it forget domain conventions? Did it improve on the demo and quietly degrade on the boring tickets that pay the bills?

Synthetic data is not a free privacy shield. It is a tool whose privacy consequences depend on the training regime, source data, student model, and audit method. Naturally, that sentence is harder to put on a slide. It is also closer to the truth.

Why this matters now

Synthetic data is becoming the default answer because real data is scarce, regulated, expensive, messy, and politically inconvenient inside organizations. That pressure will only increase. Companies want domain-tuned AI systems, but they often lack clean labeled data. They also want to avoid exposing sensitive data. Synthetic data appears to offer a neat escape.

The two papers suggest a more disciplined path. Synthetic data can help, but not as a bulk substitute. It helps when it is structured around a target capability, checked for correctness and ambiguity, diversified across real usage forms, filtered against shortcuts, matched to the student model, and audited after training.

The central misconception to prevent is this: synthetic data has an intrinsic quality score. It does not. Or at least, such a score is not enough. A dataset can be formally correct and still be too narrow. It can be diverse and still contain shortcuts. It can be fluent and still be mismatched to the student. It can come from a stronger source and still produce weaker updates. It can improve one benchmark and still fail the workflow.

The better question is not:

Is this synthetic data good?

The better question is:

Under this training recipe, for this student model, on this target capability, after these audits, does this synthetic data produce the intended update with acceptable side effects?

That question is longer. It is less marketable. It is also the question that separates synthetic-data engineering from synthetic-data superstition.

The operating principle

The combined lesson of these papers is not anti-synthetic-data. It is anti-magic-data.

UniCo shows the upside of carefully designed synthetic supervision: when examples are structurally grounded, answerable, diverse, and shortcut-resistant, they can teach meaningful reasoning behavior. The self-training paper shows the boundary condition: even useful synthetic data must be judged through the student model it updates, because utility is relational rather than intrinsic.

For Cognaptus readers, the practical principle is straightforward:

Treat synthetic data as an input that must earn training influence.

That means no automatic admission from generator to fine-tuning. It means no blind faith in source-model size. It means no assumption that pretty data teaches deep skills. It means building a control stack before scaling the corpus.

The firms that learn this early will not necessarily generate the most synthetic data. They will waste the least synthetic data. That may be the less glamorous advantage, but it is the one finance departments tend to understand.

Cognaptus: Automate the Present, Incubate the Future.

Qirun Dai, Xiao Liu, Jiawei Zhang, Dylan Zhang, Hao Peng, and Chenhao Tan, “Towards a Universal Causal Reasoner,” arXiv:2605.24873, 2026, https://arxiv.org/abs/2605.24873. ↩︎
Sina Alemohammad, Li Chen, Richard G. Baraniuk, and Zhangyang Wang, “Not All Synthetic Data Is Yours to Learn From,” arXiv:2605.31126, 2026, https://arxiv.org/abs/2605.31126. ↩︎

Synthetic and Sensibility: Why More Data Needs a Control Stack#

The shared problem: synthetic data is easy to produce and hard to trust#

Gate 1: define the capability before generating the data#

Gate 2: make the synthetic examples verifiable, not just plausible#

Gate 3: remove shortcut-solvable examples before they teach the wrong lesson#

Gate 4: diversify the representation, not just the topic#

Gate 5: test whether the student model can actually learn from it#

The combined control stack#

What the papers show, and what business should infer#

The privacy and memorization wrinkle#

Why this matters now#

The operating principle#