Opening — Why this matters now
Modern deep learning quietly assumes a comforting fiction: that training is memoryless. Given the current parameters (and maybe the optimizer buffers), tomorrow’s update shouldn’t care about yesterday’s data order, augmentation choice, or micro-step path. This assumption underwrites theory, stabilizes intuition, and keeps whiteboards clean.
Reality, however, has been less cooperative. Practitioners know that order matters, momentum carries ghosts of past gradients, and small curriculum tweaks can echo far longer than expected. Yet until now, there has been no clean, operational way to measure whether training truly forgets—or merely pretends to.
This paper introduces exactly that: a diagnostic that tells you, with confidence bounds, whether SGD remembers its past.
Background — From quantum memory to neural training
The authors borrow an idea from an unlikely place: open quantum systems. In that literature, non-Markovianity is detected via back-flow of distinguishability—if two states become more distinguishable over time under the same operation, information must be flowing back from hidden memory.
The conceptual leap here is elegant: treat training itself as a process tensor—a multi-time map from controlled interventions (batch selection, augmentations, optimizer micro-steps) to observable outcomes (model predictions on a fixed probe set).
Instead of asking whether SGD is Markovian in theory, the paper asks something sharper:
Given what we can actually observe, does a fixed training step act independently of the immediate past?
Analysis — Training as a two-step experiment
The core protocol is disarmingly simple.
- Two histories: Start from identical parameters. Apply one of two first-step interventions, A or A′. These differ only in something subtle—typically augmentation—while using the same data.
- Measurement: Measure how different the resulting models are on a fixed probe set. Call this $D_1$.
- Common future: Apply the same second intervention B to both models.
- Measure again: Compute the new distance $D_2$.
The statistic of interest is:
$$ \Delta_{BF} = D_2 - D_1 $$
If training were operationally Markovian at the observable level, $\Delta_{BF}$ should never be positive. Any increase means that B amplified a past difference instead of washing it out.
To avoid metric games, the authors test three contractive divergences on softmax outputs:
- Total Variation (TV)
- Jensen–Shannon (JS)
- Hellinger
The result: positive back-flow appears consistently—and robustly.
A crucial intervention — The causal break
The most revealing move in the paper is the causal break.
Before applying B, the authors reset the optimizer state—momentum buffers, history, everything—while keeping the parameters fixed. This severs any hidden memory channel that might connect A to B.
The prediction is clean:
- If memory is carried by optimizer state, $\Delta_{BF}$ should collapse.
- If not, nothing should change.
What happens empirically is decisive. Across datasets, architectures, and regimes, the back-flow not only shrinks—it often flips sign. Roughly 28% of configurations reverse from amplification to attenuation.
This is not correlation. It is a controlled causal test.
Findings — What actually drives training memory
The experiments span CIFAR-100 and Imagenette, with CNNs, ResNets, MobileNets, and ViTs. The patterns are strikingly consistent.
1. Momentum amplifies memory
Higher momentum produces stronger back-flow. A simple linearized analysis predicts this: accumulated gradients align future updates with past ones, magnifying small differences. Empirically, $\Delta_{BF}$ scales with a momentum amplification factor and with direct gradient–momentum alignment.
2. Data overlap creates resonance
When B reuses data seen in A, especially with similar class composition, amplification increases. Disjoint batches damp it. Training behaves less like noise injection and more like a resonant system.
3. Micro-step depth matters
More optimizer steps per intervention increase non-commutativity (AB ≠ BA) and correlate with larger back-flow. Short horizons are already enough to reveal order sensitivity.
4. The break works
Resetting optimizer state reliably attenuates or nullifies back-flow. Setting momentum to zero produces the same effect. The mechanism is unambiguous.
Summary snapshot
| Factor | Effect on Memory |
|---|---|
| Momentum ↑ | Amplifies back-flow |
| Batch overlap ↑ | Resonant amplification |
| Micro-steps ↑ | Stronger order sensitivity |
| Optimizer reset | Memory collapses |
Implications — From folklore to diagnostics
This work turns a long-standing intuition—“data order matters”—into a testable operator.
For practitioners, the implications are immediate:
- Schedule design can be probed, not guessed. If a phase transition is coming (e.g., curriculum switch, augmentation change), a causal break may be beneficial.
- Optimizer comparisons gain a new axis: not just speed or stability, but induced memory.
- Curriculum learning can be instrumented. The paper includes an exploratory case where micro-level back-flow predicts macro training behavior.
More broadly, this reframes training dynamics as something closer to a controlled experiment than a black art.
Conclusion — SGD is not amnesic
The central message is quiet but firm: practical SGD deviates from the Markov idealization in measurable, controllable ways. Optimizer state acts as a memory channel. Sometimes that memory helps. Sometimes it hurts. But it is no longer invisible.
By importing process-tensor thinking and back-flow diagnostics into machine learning, this paper provides a rare thing: a bridge between theory, mechanism, and hands-on practice.
SGD remembers. Now we can ask whether it should.
Cognaptus: Automate the Present, Incubate the Future.