The Invariance Trap: Why Matching Distributions Can Break Your Model

Noise is easy to add. Information is rather less cooperative.

A high-resolution camera image can be blurred. A precise sensor reading can be contaminated with noise. A complete genetic record can be reduced to a coarser code. Reversing any of those operations is much harder, because the missing information has already left the building.

Yet many domain-adaptation systems treat the two directions as though they were interchangeable. They train representations until source and target data become difficult to distinguish, celebrating when the distributions finally overlap.

That sounds sensible until the source contains information the target does not. Then the easiest way to make the domains match is not to improve the target. It is to damage the source.

The paper Le Cam Distortion: A Decision-Theoretic Framework for Robust Transfer Learning argues that this is a structural flaw in the usual pursuit of domain invariance.¹ Its proposed replacement is directional simulability: instead of asking whether two domains can be made indistinguishable, ask whether the target can be simulated from the source without erasing the source’s useful information.

That apparently modest change—from matching to simulating—alters the objective, the safety logic, and the meaning of successful transfer.

The mistake is treating unequal information as a symmetric mismatch

Suppose a model is trained using clean observations but must operate with noisy sensors after deployment.

The conventional framing describes this as a distribution mismatch:

the source distribution contains clean observations;
the target distribution contains noisy observations;
therefore, the learned representation should align the two distributions.

This framing skips the most important fact: the two domains are not merely different. They have different information content.

A clean observation can generate a noisy observation through a simple degradation process. Add noise, reduce resolution, discard detail, or hide part of the state.

A noisy observation generally cannot recreate the clean one without additional information. No amount of optimism in the loss function changes that.

The paper expresses each domain as a statistical experiment: a family of observations indexed by some underlying parameter $\theta$, such as a class label, physical state, or biological condition.

Let the source and target experiments be:

$$ \mathcal{E}_S = {P\ast\theta^S}\ast{\theta \in \Theta}, \qquad \mathcal{E}_T = {P\ast\theta^T}\ast{\theta \in \Theta}. $$

The question is not whether $P_\theta^S$ and $P_\theta^T$ look identical. The question is whether there exists a parameter-independent transformation, or Markov kernel, $K$, that can turn source observations into valid target-like observations.

The corresponding directional deficiency is:

$$ \delta(\mathcal{E}_S,\mathcal{E}_T) = \inf_K \sup_{\theta} \left| K P_\theta^S - P_\theta^T \right|_{\mathrm{TV}}. $$

A small value means the target can be simulated well from the source. Importantly, $K$ cannot inspect the true value of $\theta$. Otherwise, the simulator could cheat by generating the correct target observation directly.

This quantity is directional:

$$ \delta(\mathcal{E}_S,\mathcal{E}_T) \neq \delta(\mathcal{E}_T,\mathcal{E}_S). $$

For a clean source and noisy target, the forward deficiency can be small or zero: add the appropriate noise and the target is reproduced. The reverse deficiency remains positive because processing noisy data cannot restore information that was removed.

That asymmetry is the mechanism behind the invariance trap.

Symmetric alignment can succeed by making the good domain worse

The paper contrasts directional deficiency with a symmetric distance:

$$ \Delta(\mathcal{E}_S,\mathcal{E}_T) = \max \left\{ \delta(\mathcal{E}_S,\mathcal{E}_T), \delta(\mathcal{E}_T,\mathcal{E}_S) \right\}. $$

Minimizing the symmetric quantity demands that each experiment simulate the other.

When both domains contain comparable information, that may be reasonable. When one domain is clearly richer, the reverse simulation is impossible unless the rich representation first discards whatever the degraded domain lacks.

Consider a clean Gaussian signal and a noisier version of the same signal. Adding noise produces the target exactly. Removing that noise without knowing the underlying parameter is impossible. If an encoder is nevertheless rewarded for making the clean and noisy representations equivalent, it can satisfy the objective by suppressing the informative signal in the clean representation.

The distributions align. The model becomes less capable. The metric applauds politely.

The paper formalizes this result through Fisher information: in its Gaussian setting, enforcing symmetric equivalence requires reducing the information contained in the source representation to the level available in the target.

This is more precise than saying that domain adaptation can occasionally cause negative transfer. It identifies a condition under which destructive transfer is the expected solution to the optimization problem.

Deficiency changes the contract from similarity to decision risk

The paper’s theoretical contribution is not simply another distance measure. It connects simulability to downstream decisions.

If one experiment can simulate another with bounded error, then the gap in achievable decision risk can also be bounded, subject to the paper’s assumptions. The result applies across bounded decision problems rather than only one selected classification objective.

This produces a different contract for representation learning:

Transfer objective	What success means	Main failure risk
Symmetric invariance	Source and target representations are difficult to distinguish	Useful source information may be removed to achieve alignment
Directional simulability	A learned kernel can reproduce target conditions from source observations	The simulator or deficiency estimate may be inaccurate
Source-only training	Source capability is preserved	Deployment shift remains untreated
Target fine-tuning	The model specializes to observed target data	Requires suitable target labels and may reduce broader capability

The distinction matters because businesses rarely need a representation that merely wins a domain-classification game. They need a model that continues making acceptable decisions under deployment conditions.

A system that becomes equally mediocre on every sensor has achieved invariance in the same way that turning off the lights standardizes office brightness.

The learned simulator changes what MMD is asked to do

The paper’s practical proposal is to learn a degradation simulator from source to target.

Conceptually, the workflow is:

Informative Source
       |
       v
Learned degradation kernel K
       |
       v
Simulated Target Conditions
       |
       v
Train a model robust to deployment conditions

Source capability remains available separately

The simulator is trained using a discrepancy measure such as Maximum Mean Discrepancy, or MMD, to make its generated observations resemble unlabeled target data.

This detail is important because MMD also appears in many invariant-domain methods. The paper is not arguing that MMD itself destroys information. The problem is what the optimization asks MMD to accomplish.

In symmetric alignment, MMD encourages source and target representations to converge toward each other. In the proposed directional method, MMD helps fit a simulator that transforms source observations toward the target while leaving the original source representation intact.

The same measuring instrument can support two very different objectives.

The theoretical deficiency is defined as an infimum over all valid kernels and uses total variation distance. The implemented method searches within a restricted simulator family and estimates similarity using MMD. The practical guarantee is therefore conditional on two substantial assumptions:

the selected simulator family can represent the actual degradation process;
the empirical divergence measure reliably detects remaining differences.

The paper states this gap explicitly. It later becomes one of the most important qualifications to its safety claims.

The experiments test different parts of the argument

The paper presents experiments across Gaussian data, reinforcement learning, CIFAR-10 images, and synthetic HLA genomics. These should not all be read as interchangeable benchmark victories.

Each serves a different evidential purpose.

Experiment	Likely purpose	What it supports	What it does not establish
Gaussian shift and quantization tests	Theoretical and estimator verification	Directionality, monotonic response to information loss, recovery of known degradation structure	Performance in complex real deployments
One- and two-dimensional control	Main mechanism evidence	Symmetric alignment can collapse task-relevant signals and produce unsafe control	General superiority across realistic robotics tasks
CIFAR-10 degradation	Main safety-performance comparison	Source preservation and partial target improvement under a constrained degradation simulator	State-of-the-art image domain adaptation
Synthetic HLA recovery	Exploratory discrete-domain extension	The directional logic can be implemented in a combinatorial setting	Clinical validity on real genomic populations
MuJoCo preliminary test	Sensitivity and failure-mode evidence	Simulator learning can lose under limited data and high dimensionality	A completed evaluation of large-scale control

This hierarchy matters. The strongest evidence in the paper concerns the mechanism: forcing equivalence between unequally informative domains can remove information. Evidence for broad operational superiority remains more conditional.

In control, invariance learns to do nothing

The reinforcement-learning experiments provide the clearest demonstration of the mechanism because a collapsed representation has an immediate physical consequence.

In the one-dimensional task, an agent must control a state toward zero. The source observation exposes the clean state, while the target observation adds noise.

Three approaches are compared:

Method	Target return
Naive transfer	-48.6
Invariant RL	-1290.2
Le Cam RL	-25.3

The naive policy learns aggressive control from clean observations and becomes unstable when exposed to noise.

The invariant policy encounters a stranger failure. To make clean and noisy representations match, it suppresses the state signal and approaches a zero-gain, “do nothing” policy. The representation becomes invariant partly because it stops representing what the controller needs to know.

The directional method instead learns a noise simulator and trains a conservative policy under simulated target conditions. It retains the ability to observe the clean state while learning not to overreact to degraded observations.

The two-dimensional extension makes the point more specific. One axis receives reliable observations while the other is heavily degraded. A useful system should respond differently along the two dimensions.

The directional approach learns that anisotropy and applies aggressive control where observations remain reliable and conservative control where they do not. Its target return of -61.65 improves on the naive method’s -166.92 and the invariant method’s -1789.37.

The lesson is not merely that one algorithm received a better score. It is that directional modeling can preserve where information remains useful, while global invariance can discard signals too indiscriminately.

CIFAR-10 exposes the price hidden inside target accuracy

The CIFAR-10 experiment constructs a degraded target domain by applying blur and noise to clean images. A ResNet-18 trained only on the source performs well on clean images but poorly on degraded ones.

The comparison then reveals three distinct strategies:

Method	Source accuracy	Target accuracy	Source change
Source-only	81.01%	17.51%	0.0 points
CycleGAN	46.30%	34.73%	-34.7 points
Le Cam harmonization	81.17%	26.46%	+0.2 points

CycleGAN achieves the best target accuracy. It also reduces clean-source accuracy from about 81% to 46%.

If the sole business objective is performance on the degraded target and the source will never be used again, that trade may be acceptable. The result does not prove that symmetric alignment is useless.

It does show why target accuracy alone is an incomplete evaluation metric.

CycleGAN’s target improvement arrives together with severe source amnesia. High-frequency details that help distinguish clean images are removed as the representation adapts to blurred inputs. From a single-target benchmark perspective, this may look efficient. From the perspective of a multi-sensor system expected to handle both clean and degraded inputs, it is a capability failure.

The directional method preserves source accuracy while improving target accuracy by roughly nine percentage points over source-only training. It does not match CycleGAN on the degraded target.

The paper initially describes the remaining 8.3-point target gap as the “price of safety.” Its later limitation analysis gives a more useful explanation: the chosen simulator models Gaussian blur and noise, while the actual degradation pipeline also contains effects such as compression and downsampling artifacts. The simulator is misspecified.

That distinction matters operationally. Some of the performance gap may reflect an unavoidable trade-off between preservation and specialization. Some may simply reflect a simulator that does not yet model the deployment process well enough.

Calling the entire gap the price of safety is neat. Production systems are rarely so obliging.

The appendix validates the proxy—and demonstrates why it cannot be trusted blindly

The controlled Gaussian and quantization experiments are best read as unit tests.

As quantization becomes coarser and more information is lost, the estimated deficiency rises monotonically. In the known Gaussian-shift setting, the directional model learns a conservative estimate of the noise process while broadly preserving clean-domain performance.

These tests support the internal consistency of the proposed objective. They are not substitutes for external validation, but they show that the estimator reacts in the intended direction under controlled conditions.

More interestingly, the appendix includes a “proxy blindness” test. Two distributions have matching moments but different shapes. With an unsuitable bandwidth, standard MMD fails to detect the difference even though the total variation distance is large.

This is not a minor implementation footnote. It directly limits the meaning of an empirical safety certificate.

The theory provides guarantees in terms of deficiency. The implementation estimates a proxy for deficiency. If the proxy is blind to an important difference, the apparent guarantee becomes decorative mathematics attached to an undetected shift.

For businesses, this means that simulator validation must include deliberate adversarial and sensitivity testing. Reporting a low MMD value is not sufficient. Teams need evidence that their chosen divergence, bandwidths, sample sizes, and representations can detect the failure modes that matter for the application.

The HLA experiment makes information loss obvious, but remains synthetic

The synthetic HLA experiment translates the same mechanism into discrete genetic data.

A high-resolution, phased genetic state is degraded by removing allelic detail and phase information. The lower-resolution observation cannot uniquely recreate the source without knowledge of population structure.

The paper compares a naive suffix-guessing method, a classical expectation-maximization baseline, and a simulation-based Le Cam method trained on 10,000 synthetic pairs. Evaluation uses 1,000 synthetic test individuals.

Metric	Le Cam method	EM baseline	Naive baseline
Allele accuracy	90.2%	90.8%	62.3%
Haplotype accuracy	90.2%	90.5%	41.0%
Phase accuracy	88.9%	89.7%	19.0%
Frequency correlation	0.999	0.986	-0.207

EM retains a small advantage in individual reconstruction accuracy. The Le Cam method produces the strongest population-frequency correlation and offers amortized inference after training.

The experiment is useful because the asymmetry is easy to see: a high-resolution allele code contains detail that a low-resolution code does not. Forcing the two to be equivalent plainly destroys relevant biological information.

However, the study uses a synthetic universe of 15 haplotypes with designed frequencies. It supports the claim that the framework can operate on discrete and combinatorial data. It does not demonstrate clinical readiness, performance on diverse real populations, or validity for transplant decisions.

The paper’s language occasionally reaches for universality. The evidence more safely supports cross-domain plausibility.

Business use starts with a domain-dominance audit

The practical value of the paper is not that every organization should immediately replace its transfer-learning stack with a Le Cam implementation.

The more useful implication is that teams should diagnose the information relationship between environments before selecting an adaptation objective.

A workable process would begin with four questions.

1. Is the target a degradation of the source?

Examples include lower-resolution scanners, noisier sensors, missing fields, quantized records, partial state observations, or cheaper measurement processes.

When the answer is yes, the source may be able to simulate the target through a physically meaningful degradation kernel.

When the relationship is unknown, symmetric alignment is risky because the direction of information dominance has not been established.

2. Can the degradation process be modeled and tested?

A useful simulator should correspond to a plausible process: blur, noise, dropout, censoring, resolution reduction, latency, or sensor failure.

An expressive neural generator may fit the target distribution more closely, but it can also learn transformations that are difficult to interpret or validate. A constrained simulator is easier to inspect but may miss important degradation modes.

This is not merely a model-selection decision. It determines what kind of failure can remain hidden.

3. Must the source capability remain usable?

Some systems adapt once to a permanent target and never need the source environment again. In that case, sacrificing source performance may be economically rational.

Other systems must serve mixed inputs, support multiple sensor qualities, undergo continual learning, or retain certification on the original environment. For them, source preservation is a requirement rather than a pleasant side effect.

4. How expensive is a false transfer certificate?

The directional framework becomes most valuable when negative transfer is expensive: control systems, diagnostic pipelines, industrial inspection, or decisions made from heterogeneous measurement devices.

The paper does not calculate return on investment. Cognaptus’ business inference is that the main economic value would come from avoiding destructive deployment and preserving previously validated capability—not simply from producing a higher target benchmark score.

A transfer scorecard should measure preserved capability, not just adaptation

The CIFAR-10 results suggest a simple governance improvement: evaluate adaptation on both environments.

A deployment scorecard should include at least:

Measure	Operational question
Source utility after adaptation	Did the model lose capability it previously possessed?
Target utility after adaptation	Does the model perform better under deployment conditions?
Simulator fit	Does the learned degradation process reproduce the relevant target characteristics?
Proxy sensitivity	Can the divergence measure detect known dangerous shifts?
Direction validity	Is there evidence that source-to-target simulation is the appropriate direction?
Refusal condition	When should transfer be rejected rather than forced?

This changes the approval question from “Did target accuracy improve?” to “What capability was exchanged for that improvement, and was the exchange authorized?”

That is a more tedious question. It is also the one an operating system deserves.

The guarantee is conditional on the part practitioners must estimate

The theory is attractive because it provides a decision-oriented interpretation of transfer. The practical method still faces several boundaries.

First, the true deficiency is not directly optimized. The implementation searches within a selected simulator class and uses MMD as a proxy. If the real degradation lies outside that class, the estimated simulability can remain poor even when the conceptual direction is correct.

Second, divergence estimation becomes harder in high-dimensional spaces. The paper notes that MMD has quadratic computational cost in batch size and can suffer from poor sensitivity. Larger or more expressive simulators also require more data.

Third, simulator learning introduces a sample-complexity tax. In preliminary MuJoCo experiments conducted under a limited training budget, the naive source-trained agent outperformed both the directional and invariant approaches. The paper interprets this as a case where the learned policy was already sufficiently robust and the additional adaptation objective consumed scarce training capacity.

Fourth, most of the strongest demonstrations are deliberately controlled. The control tasks expose the mechanism cleanly, but they are not autonomous-driving systems. The HLA results are synthetic. The CIFAR target degradation is constructed.

These are not reasons to dismiss the framework. They define where its evidence currently ends.

A responsible deployment claim would therefore be narrower:

Directional simulability offers a principled objective when the source is demonstrably more informative than the target, the degradation process can be modeled, the proxy can be validated, and preserving source capability matters.

Outside those conditions, the safer option may be target fine-tuning, source-only deployment with validation, or refusing to claim that the domains are transferable at all.

Stop rewarding sameness when the system needs capability

The invariance trap begins with a reasonable instinct: reduce the difference between training and deployment environments.

The problem appears when difference is treated as though it were always removable without cost.

Some differences represent nuisance variation. Others represent lost information. A transfer objective that cannot distinguish between them may solve the alignment problem by deleting whatever makes the source useful.

The paper’s most valuable contribution is therefore not a claim that directional simulation will dominate every adaptation method. Its contribution is a sharper diagnostic question:

Can the target be generated from the source, and what must be destroyed to pretend that the reverse is also true?

That question should be asked before distributions are matched, before representations are aligned, and certainly before a lower domain-classification score is presented as evidence of safety.

Symmetry is elegant. Deployment is directional.

Cognaptus: Automate the Present, Incubate the Future.

Deniz Akdemir, “Le Cam Distortion: A Decision-Theoretic Framework for Robust Transfer Learning,” arXiv:2512.23617, 2025. ↩︎

The mistake is treating unequal information as a symmetric mismatch#

Symmetric alignment can succeed by making the good domain worse#

Deficiency changes the contract from similarity to decision risk#

The learned simulator changes what MMD is asked to do#

The experiments test different parts of the argument#

In control, invariance learns to do nothing#

CIFAR-10 exposes the price hidden inside target accuracy#

The appendix validates the proxy—and demonstrates why it cannot be trusted blindly#

The HLA experiment makes information loss obvious, but remains synthetic#

Business use starts with a domain-dominance audit#

1. Is the target a degradation of the source?#

2. Can the degradation process be modeled and tested?#

3. Must the source capability remain usable?#

4. How expensive is a false transfer certificate?#

A transfer scorecard should measure preserved capability, not just adaptation#

The guarantee is conditional on the part practitioners must estimate#

Stop rewarding sameness when the system needs capability#