When SGD Remembers: The Hidden Memory Inside Training Dynamics
Reset Is the Most Honest Experiment Resetting an optimizer sounds boring. It is the kind of engineering operation that hides inside training scripts, not the kind of thing that gets people excited at conference coffee breaks. But in this paper, reset becomes a scalpel. The authors ask a deceptively simple question: when a neural network receives the same next training intervention, does that intervention behave the same way regardless of what just happened before?1 In a tidy Markovian story, the answer should be yes, at least once the relevant state is specified. In practical training, the answer is more inconvenient. Momentum buffers, batch overlap, augmentation choices, and short update histories can all make yesterday’s path leak into today’s update. ...