Training Dynamics

Reset Is the Most Honest Experiment Resetting an optimizer sounds boring. It is the kind of engineering operation that hides inside training scripts, not the kind of thing that gets people excited at conference coffee breaks. But in this paper, reset becomes a scalpel. The authors ask a deceptively simple question: when a neural network receives the same next training intervention, does that intervention behave the same way regardless of what just happened before?1 In a tidy Markovian story, the answer should be yes, at least once the relevant state is specified. In practical training, the answer is more inconvenient. Momentum buffers, batch overlap, augmentation choices, and short update histories can all make yesterday’s path leak into today’s update. ...

TL;DR for operators Deletion is simple in a database. It is not simple in a neural network that has already used the deleted record to improve its internal machinery. That is the unpleasant little invoice this paper presents. Gaurav R. Ghosal, Pratyush Maini, and Aditi Raghunathan study why repeated natural text is hard to remove from language models after training, then propose MemSinks, a training-time mechanism designed to make memorization easier to isolate later.1 The important shift is not “better pruning.” It is architectural accounting. Instead of hoping that memorized text happens to live in a few removable neurons, MemSinks gives repeated sequences a controlled place to accumulate memorization during training. ...

Training Dynamics

When SGD Remembers: The Hidden Memory Inside Training Dynamics

The Sink That Remembers: Solving LLM Memorization Without Forgetting Everything Else