Coordinate Descent

Memory is usually treated as a luxury in machine learning. More parameters, more activations, more optimiser state, more logs, more everything. Then the invoice arrives, the device overheats, and someone rediscovers the ancient corporate virtue of not wasting things. The paper Turning Stale Gradients into Stable Gradients makes a modest but interesting proposal: perhaps an optimiser should not throw away old gradient information just because it is old.1 In the right setting, yesterday’s partial derivative is not spoiled milk. It is a slightly outdated map. If the terrain has not shifted too violently, it may still point in a useful direction. ...