When SGD Remembers: The Hidden Memory Inside Training Dynamics

Reset Is the Most Honest Experiment

Resetting an optimizer sounds boring. It is the kind of engineering operation that hides inside training scripts, not the kind of thing that gets people excited at conference coffee breaks.

But in this paper, reset becomes a scalpel.

The authors ask a deceptively simple question: when a neural network receives the same next training intervention, does that intervention behave the same way regardless of what just happened before?¹ In a tidy Markovian story, the answer should be yes, at least once the relevant state is specified. In practical training, the answer is more inconvenient. Momentum buffers, batch overlap, augmentation choices, and short update histories can all make yesterday’s path leak into today’s update.

That is not surprising to practitioners. Anyone who has touched curriculum learning, staged augmentation, or optimizer warm-up has probably developed a private superstition about order effects. The useful part of this paper is not that it says “order matters.” That line has already been overworked enough to qualify for early retirement. The useful part is that it turns order-dependence into a measurable witness.

The mechanism is compact:

Create two short histories, $A$ and $A’$.
Measure how different the resulting models are on a fixed probe set.
Apply the same second intervention, $B$, to both.
Measure whether the difference grows.

If the difference grows after the same $B$, then $B$ is not acting only on the observable state. Something about the hidden preparation still matters.

That “something” is the memory channel.

The Paper Is Not Claiming That SGD Is Mystically Non-Markovian

Before the argument gets too dramatic, one clarification matters.

The paper is not claiming that SGD, properly expanded to include parameters, optimizer buffers, sampler state, random seeds, and every other hidden variable, is metaphysically non-Markovian. That would be a much larger and less useful claim. With enough state augmentation, many processes can be made Markovian. One can always carry a bigger suitcase.

The paper’s claim is narrower and more operational: at the level of what the experiment observes — predictive distributions on a fixed probe set — a fixed second intervention may fail to behave like a single channel acting on the measured one-step outcome.

That distinction is the whole point.

For a training team, the relevant question is rarely “does the universe contain a fully Markovian description of this update process?” The relevant question is closer to: “Can I tell whether my next training phase is still being shaped by the previous phase in a way that affects model outputs?”

The paper formalizes that question using a process-tensor view. In this framing, a training segment is a multi-time map: it takes controllable instruments such as mini-batch choices, augmentations, and optimizer micro-steps, and returns observables such as softmax predictions on a held-out probe set. The theory may come from open quantum systems, but the practical object is refreshingly ordinary: a controlled training protocol with a before-and-after distance measurement.

The Two-Step Witness: When the Same Future Amplifies Different Pasts

The central statistic is:

$$ \Delta_{BF} = D_2 - D_1 $$

Here, $D_1$ is the distance between model outputs after the first intervention, and $D_2$ is the distance after both branches receive the same second intervention $B$.

The first intervention comes in two versions, $A$ and $A’$. In the experiments, they use the same data indices but different augmentations. That matters because the authors are not merely comparing two unrelated training runs. They are isolating a controlled difference: same starting parameters, same batch identity, different transformation history, then the same next intervention.

The distances are computed over predictive distributions on a fixed held-out probe set. Total variation is the primary metric, while Jensen-Shannon and Hellinger are used as robustness checks. All three are chosen because they obey the kind of contractivity condition needed for the operational argument: under a fixed stochastic channel, distinguishability should not increase.

So the logic is simple:

Step	What happens	What it tests
$A$ vs. $A'$	Same data indices, different first augmentation histories	Whether two controlled histories produce different observable states
Measure $D_1$	Compare probe-set predictions after the first step	Baseline distinguishability
Apply common $B$	Same second intervention applied to both branches	Whether the future acts independently of the prior preparation
Measure $D_2$	Compare probe-set predictions again	Whether distinguishability grew or shrank
Compute $\Delta_{BF}$	$D_2 - D_1$	Positive values witness observable-level memory

If $\Delta_{BF} > 0$, the second intervention increased the difference between the two histories. Under the paper’s operational Markov condition, that should not happen. A positive value therefore falsifies the memoryless observable-channel story for the specific protocol, observable, and divergence.

Notice the discipline here. A negative value does not prove that the training process is Markovian. It only means this particular witness did not observe back-flow. A positive value, however, is meaningful: the same second step made two histories more distinguishable.

That is the asymmetry that makes the diagnostic useful.

The Causal Break Is the Mechanism Test, Not Just Another Ablation

The strongest part of the paper is the causal break.

After the first intervention, the authors either carry the optimizer state into $B$ or reset the optimizer immediately before $B$, while keeping the parameters fixed. The reset removes momentum buffers and similar optimizer memory, but does not erase the model weights created by the first intervention.

That gives a clean mechanism test:

Test	Likely purpose	What it supports	What it does not prove
No-break $A/A’ \rightarrow B$	Main evidence	Measures whether a shared $B$ amplifies differences between histories	Does not identify the carrier of memory by itself
Optimizer reset before $B$	Mechanism ablation / causal break	Tests whether optimizer state carries the observed memory	Does not prove all memory comes only from optimizer buffers
TV, JS, Hellinger agreement	Robustness test	Checks that the effect is not a quirk of one distance metric	Does not show downstream accuracy improvement
Negative regime	Control	Checks that weak/no-memory settings stay near zero	Does not replace the causal-break test
Non-commutativity and momentum alignment diagnostics	Mechanism interpretation	Links back-flow to order sensitivity and optimizer-buffer alignment	Does not become a full causal model of generalization
Placebo $A=A’$ and $\mu=0$ checks	Falsification checks	Tests whether the witness collapses when the controlled difference or momentum channel is removed	Does not cover all optimizers or all training regimes

This is better than the usual “we tried several settings and the number moved” ablation. The reset targets a specific hidden channel. If the optimizer buffers carry history, then breaking that channel should shrink, nullify, or flip the back-flow effect. If they do not, the reset should matter much less.

Empirically, the reset matters a lot.

Across 50 unique dataset-model-regime-stage setups and two optimizer conditions, the paper reports 100 total-variation groups. Without the reset, 35 positive and 12 negative groups have confidence intervals excluding zero, with only 3 non-significant. With the reset, 22 positive and 25 negative groups are significant, again with only 3 non-significant. Every unique setup is significant in at least one condition, and 44 of 50 are significant in both.

The more revealing number is the sign flip. Resetting optimizer state before $B$ changes both magnitude and direction: 14 of 50 setups, or 28%, flip sign between no-break and break conditions.

That is not a decorative robustness check. It is the paper’s causal hinge.

The Magnitudes Are Small Enough to Be Plausible and Large Enough to Matter

The paper is measuring short-horizon effects on predictive distributions, not end-of-training leaderboard gains. So the numbers should not be read as direct business KPIs. They are diagnostic magnitudes.

Still, the observed shifts are not just statistical confetti.

Representative sign-flip cases show clear changes. For CIFAR-100 with ViT-B/16 in the resonant-strong regime, TV back-flow moves from $+0.0557$ without a break to $-0.0291$ with a break, a change of $-0.0847$. CIFAR-100 with VGG-11 in the same regime moves from $+0.0331$ to $-0.0492$. Imagenette with ResNet-18 under the standard regime moves from $+0.0161$ to $-0.0225$.

Those are not all the same story, and that matters. Some model-regime combinations already attenuate without a break. Others amplify strongly. The value of the diagnostic is precisely that it does not assume one universal training folklore. It gives a protocol for asking: in this model, under this schedule, with this intervention, is memory amplifying or attenuating the branch difference?

Regime averages sharpen the picture. Across dataset-model combinations, the negative control stays near zero: about $6 \times 10^{-4}$ without a break and $5 \times 10^{-4}$ with a break, with medians near $10^{-7}$. Standard, orthogonal, and resonant-strong regimes tend to amplify without a break and shift toward attenuation with a break. The reported mean TV changes are:

Regime	No break mean $\Delta_{BF}$	Break mean $\Delta_{BF}$	Interpretation
Standard	$+0.0127$	$-0.00729$	Ordinary training choices can still carry measurable memory
Orthogonal	$+0.0103$	$-0.0206$	Even disjoint/content-separated designs can show order effects
Resonant strong	$+0.00734$	$-0.0310$	High momentum and overlap make the break especially visible
Resonant mid	$-0.0177$	$-0.0448$	Not every high-control setting amplifies; the break still pushes toward attenuation
Negative control	$\approx 6 \times 10^{-4}$	$\approx 5 \times 10^{-4}$	Control behavior remains near zero

The important business reading is not “always reset the optimizer.” That would be the kind of simplistic operational advice that sounds decisive until it meets a real training pipeline and quietly dies.

The better reading is: optimizer memory can be measured before teams make schedule decisions. A reset is not a moral virtue. It is an intervention. Sometimes you want continuity. Sometimes you want separation. The point is to stop guessing.

Momentum and Overlap Turn Memory Into a Training-Schedule Variable

The paper’s mechanism story has two main drivers: momentum and overlap.

Momentum is the obvious suspect. It is literally designed to carry past gradient information forward. The paper’s linearized argument predicts that amplification increases with a factor like:

$$ A_{\mu} = \frac{1 - \mu^k}{1 - \mu} $$

where $\mu$ is momentum and $k$ is the number of micro-steps. Higher momentum and more micro-steps allow past gradients to shape the next update more strongly. That is not a bug. It is the product feature working exactly as advertised, then occasionally behaving like a ghost in the training pipeline.

The experiments support this mechanism in several ways.

First, the authors log the cosine alignment between the $B$-gradient and the momentum buffer just before the first $B$ step. Without a break, that alignment concentrates above zero. Across no-break configurations, $\Delta$ correlates with pre-$B$ momentum alignment, with Pearson $r = 0.409$ and Spearman $\rho = 0.454$, both highly significant. After the causal break, this alignment vanishes by construction.

Second, they examine order sensitivity through non-commutativity: whether $AB$ differs from $BA$. The TV distance between $AB$ and $BA$ increases with the number of micro-steps in the illustrated CIFAR-100/ViT-B/16 resonant-strong case. Across configurations, the slope of non-commutativity versus micro-step depth correlates with back-flow: Pearson $r = 0.184$ and Spearman $\rho = 0.329$.

The Pearson correlation is modest. The Spearman result is stronger. That is exactly the kind of detail that should prevent over-reading. The diagnostics support the mechanism; they do not reduce training dynamics to one neat scalar. Neural networks remain inconsiderate like that.

Batch overlap is the second driver. If $B$ reuses samples from $A$, or overlaps in class composition, the next update can resonate with the previous one. The paper’s regimes make this explicit: standard uses momentum $0.90$, $k=3$, and overlap $0.5$; resonant-mid uses momentum $0.95$, $k=6$, and overlap $0.75$; resonant-strong uses momentum $0.99$, $k=6$, and overlap $1.0$; orthogonal uses high momentum but zero overlap and different class sampling; negative control uses no momentum, one micro-step, identity augmentation, and zero overlap.

The appendix dose-response study is best read as a sensitivity extension, not a second thesis. With fixed weak $B$ augmentation and no break, the authors compare standard, resonant-mid, and resonant-strong settings. A simple regression is not clean because momentum, step depth, and overlap move together across regimes. The more useful within-pair comparison fixes $k=6$ and compares resonant-strong with resonant-mid for the same dataset and model. All 10 of 10 pairs show strong greater than mid, with a mean lift of $0.0251$ and a 95% confidence interval of $[0.0208, 0.0288]$.

So the practical claim is not merely “momentum matters.” The more operational claim is: momentum, overlap, and step depth jointly determine whether a training phase carries useful continuity or unwanted residue into the next phase.

Robustness Here Means “The Witness Survives Alternate Measurements”

The paper uses three divergences: total variation, Jensen-Shannon, and Hellinger. This is not a cosmetic choice. If the back-flow effect appeared only under one convenient metric, the measurement story would be much weaker.

The appendix reports strong agreement. Out of 100 groups, TV is significant in 97, Hellinger in 94, and Jensen-Shannon in 82. TV and Hellinger agree in 93 of 100 groups; TV and Jensen-Shannon agree in 80 of 100; all three are significant in 80 of 100. TV-only significance is zero, which is a useful sanity check: the primary metric is not behaving like a lonely attention seeker.

The false-discovery-rate analysis tells the same story. At 5% FDR, 94 of 100 TV groups, 93 of 100 Hellinger groups, and 81 of 100 Jensen-Shannon groups remain significant. The exact magnitudes differ by metric, as they should. The existence of the effect does not disappear.

The paper also uses placebo and collapse checks. When $A = A’$ — identical augmentations — $\Delta$ is statistically indistinguishable from zero. When momentum is set to zero, no-break runs collapse toward the break baseline. These checks are not glamorous, but they are important. A diagnostic that lights up when there is no controlled difference would be useless. A mechanism story about optimizer buffers that survives zero momentum would be suspicious.

In other words, the paper does not only report a positive result. It tries to make the result fail in places where it should fail.

That is how one earns the right to be interesting.

The Business Value Is Cheaper Diagnosis, Not Magic Training Gains

The direct result is a measurement contribution. It does not show that using this witness automatically improves final accuracy, reduces training cost, or produces better deployment behavior. Those would be downstream claims, and the paper does not prove them.

The business relevance comes from a different pathway: training operations often depend on phase transitions. Teams switch curricula, change augmentations, alter data mixtures, move from general data to domain data, fine-tune on high-value examples, or adjust optimizer settings. Each transition raises a practical question: should the next phase inherit momentum from the previous phase, or should the pipeline deliberately break continuity?

Today, that decision is often made by habit. Keep optimizer state because continuity is efficient. Reset because fine-tuning feels cleaner. Lower the learning rate because everyone does. Sacrifice a goat to the validation curve. The usual mature MLOps repertoire.

The paper suggests a more disciplined workflow:

Training decision	What the witness can diagnose	Possible operational use
Curriculum transition	Whether the next phase amplifies differences created by the previous phase	Decide whether to preserve or reset optimizer state before a new curriculum block
Augmentation schedule switch	Whether augmentation history remains active through $B$	Test whether aggressive augmentation leaves unwanted carryover
Domain adaptation	Whether source-domain training memory affects target-domain updates	Consider reset or partial reset before domain-specific fine-tuning
Data-mixture redesign	Whether overlapping batches/classes create resonance	Adjust sampling overlap, class ordering, or phase boundaries
Optimizer comparison	How much memory each optimizer induces under controlled interventions	Compare optimizers on memory behavior, not only speed and final loss

This is not marketing magic. It is diagnostic infrastructure. The ROI, if it exists, would come from reducing blind schedule search, making phase-boundary decisions more testable, and catching unwanted carryover before it becomes an expensive late-stage training surprise.

For large training pipelines, even a small reduction in schedule uncertainty can matter. But that remains an inference from the paper, not something the paper directly validates. The experiments measure short-horizon observable memory in vision classification settings. They do not run a production-scale LLM training program and show a budget reduction. Annoying, yes. Also scientifically respectable.

Where the Result Applies, and Where It Should Not Be Over-Sold

The paper’s boundary is clear.

First, the experiments are short-horizon and two-step. They are designed to detect and attribute memory at a micro scale, not to model an entire training run. The witness tells us that a fixed $B$ can amplify prior differences under controlled conditions. It does not tell us how that amplification accumulates across millions of steps.

Second, the benchmarks are image classification datasets — CIFAR-100 and Imagenette — with standard vision backbones including SmallCNN, ResNet-18, VGG-11, MobileNetV2, and ViT-B/16. The method is model-agnostic in principle, but the evidence is not yet universal across modalities, scales, optimizers, or foundation-model training regimes.

Third, the observable is probe-set prediction behavior. That is practical and measurable, but it is still a chosen lens. A different probe set, a representation-level metric, or a task-specific downstream evaluation might reveal different memory structure. The paper is careful about this: positive back-flow witnesses observable-level non-Markovianity for the specified intervention and observable.

Fourth, memory is not automatically bad. Optimizer history exists because it often helps optimization. The question is not whether to eliminate memory. The question is whether a particular training transition should preserve it.

That is the managerial version of the paper’s technical point. Memory is not a sin. Unmeasured memory is the problem.

From Folklore to an Instrument Panel

The strongest contribution of this work is not a new optimizer, a new architecture, or another heroic scaling curve. It is an instrument panel for something training teams already suspect but usually cannot measure cleanly.

“Data order matters” becomes: under this $A/A’/B$ protocol, does $\Delta_{BF}$ go positive?

“Momentum carries history” becomes: does resetting the optimizer before $B$ collapse or flip the witness?

“Curriculum effects are mysterious” becomes: do different phase orderings create different observable back-flow patterns?

That is a quieter kind of progress, but a useful one. Modern AI systems are increasingly shaped by pipeline decisions rather than isolated model choices. Data order, augmentation schedules, optimizer state, and phase transitions are all part of the product. They deserve diagnostics that are more precise than vibes with a YAML file.

The paper’s message is not that SGD has a memory problem. It is that SGD has measurable memory. Sometimes that memory amplifies the right signal. Sometimes it preserves the wrong residue. Either way, once it becomes measurable, training design becomes less like superstition and more like operations.

SGD remembers. The better question is whether the pipeline wanted it to.

Cognaptus: Automate the Present, Incubate the Future.

Vasileios Sevetlidis and George Pavlidis, “Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability,” arXiv:2601.16563, 2026, https://arxiv.org/abs/2601.16563. ↩︎

Reset Is the Most Honest Experiment#

The Paper Is Not Claiming That SGD Is Mystically Non-Markovian#

The Two-Step Witness: When the Same Future Amplifies Different Pasts#

The Causal Break Is the Mechanism Test, Not Just Another Ablation#

The Magnitudes Are Small Enough to Be Plausible and Large Enough to Matter#

Momentum and Overlap Turn Memory Into a Training-Schedule Variable#

Robustness Here Means “The Witness Survives Alternate Measurements”#

The Business Value Is Cheaper Diagnosis, Not Magic Training Gains#

Where the Result Applies, and Where It Should Not Be Over-Sold#

From Folklore to an Instrument Panel#