Memory is usually treated as a luxury in machine learning. More parameters, more activations, more optimiser state, more logs, more everything. Then the invoice arrives, the device overheats, and someone rediscovers the ancient corporate virtue of not wasting things.

The paper Turning Stale Gradients into Stable Gradients makes a modest but interesting proposal: perhaps an optimiser should not throw away old gradient information just because it is old.1 In the right setting, yesterday’s partial derivative is not spoiled milk. It is a slightly outdated map. If the terrain has not shifted too violently, it may still point in a useful direction.

That is the core idea behind Coherent Coordinate Descent, or CoCD. It is a deterministic zeroth-order optimiser: it optimises using only function evaluations, not backpropagated gradients. It cycles through coordinates, estimates a few finite-difference partial derivatives at each step, stores them in a buffer, and reuses stale entries as part of a global descent direction. Add momentum, add fading memory, and suddenly the supposedly obsolete gradient becomes a cheap approximation of the current landscape.

The paper’s real contribution is not merely that CoCD beats a baseline in a few tables, although it does. The interesting move is conceptual. It argues that stale gradients and larger finite-difference intervals, two things an optimiser designer might instinctively distrust, can become stabilising mechanisms when the optimisation trajectory is locally coherent.

This is not a universal recipe for training giant models on pocket change. That would be the sort of claim best left to pitch decks and other controlled substances. CoCD is much narrower and more useful than that: it is a serious attempt to make zeroth-order optimisation more stable and sample-efficient in small-to-medium neural-network settings where backpropagation is unavailable, too memory-heavy, or operationally awkward.

CoCD keeps a gradient ledger instead of rolling dice

Zeroth-order optimisation starts from a difficult premise: the optimiser cannot ask for the gradient. It can only evaluate the objective. This matters in black-box systems, simulation-based optimisation, adversarial testing, and memory-constrained training where building a full computation graph is expensive or impossible.

Classical finite-difference methods are accurate but costly. For a model with $n$ parameters, estimating every coordinate direction requires work proportional to $n$. Randomised zeroth-order methods cut that cost by sampling directions, but they pay in variance. The optimiser gets cheaper signals, but noisier ones. Naturally, the field has learned to compensate with small learning rates, larger batches, and various forms of averaging. In other words: the algorithm saves money, then spends it calming itself down. Corporate finance would recognise the pattern.

CoCD takes a different path. Instead of constructing a fresh random gradient estimate at every step, it maintains a dense gradient buffer $\hat{g}$. At each optimisation step, it performs only $B$ finite-difference queries, where $B$ is the compute budget. It refreshes those coordinates in a deterministic cyclic order and leaves the rest of the buffer as remembered information.

The central finite-difference estimate for coordinate $i$ is the familiar:

$$ \tilde{\nabla}_i f(x) = \frac{f(x+\epsilon e_i)-f(x-\epsilon e_i)}{2\epsilon} $$

CoCD does not estimate every coordinate every time. It updates a subset, stores the result, and then updates the parameters using the whole available buffer:

$$ x_{t+1}=x_t-\alpha \hat{g}_t $$

The method therefore separates two clocks that are usually treated as one. The model takes an optimisation step at time $t$, but the gradient buffer contains entries refreshed at different internal times. Some are fresh. Some are stale. CoCD’s bet is that this mixture can still form a useful descent direction.

The method also introduces a momentum-like decay parameter $\gamma$ that defines three regimes:

Setting Technical behaviour Interpretation
$\gamma = 1$ Stale gradient entries remain until refreshed Maximum reuse; the buffer acts like a time-lagged full-gradient estimate
$\gamma = 0$ Old entries are cleared Equivalent to standard Block Cyclic Coordinate Descent, using only freshly updated coordinates
$0 < \gamma < 1$ Older entries are exponentially suppressed Fading memory; useful when the landscape changes too quickly for perfect memory

This framing matters because it turns CoCD from a trick into a spectrum. At one end is conservative coordinate descent. At the other is aggressive historical reuse. The middle is where business systems usually live: useful memory, but not religious devotion to it.

The counterintuitive mechanism: stale gradients are useful only when the path is coherent

A stale gradient is dangerous when the model has moved far from the point where that gradient was measured. The paper does not deny this. Instead, it asks a better question: how far has the optimisation trajectory actually moved?

The paper formalises this through local coherence. Roughly, if recent iterates lie in a region where the objective is smooth, gradients do not change arbitrarily between steps. The stale coordinate is not exact, but its error is bounded by two quantities: how rough the local landscape is, and how far the optimiser has travelled since the coordinate was last refreshed.

This leads to the paper’s key error bound. Let $K=\lfloor n/B \rfloor$ and $r=n \bmod B$. The approximation error between the CoCD buffer and the finite-difference gradient is bounded by a term of the form:

$$ |\hat{g}t-\tilde{\nabla}^{\epsilon}f(x_t)| \le \frac{L{\epsilon}\delta}{2} \left(BK(K-1)+2rK\right) $$

The details matter less than the structure. The error is governed by $L_{\epsilon}\delta$: the effective smoothness of the finite-difference gradient times the local movement of the optimisation path.

That is the mechanism. Staleness is not automatically fatal. It becomes costly when the landscape is too jagged, the optimiser moves too far, or the compute budget is too low to refresh coordinates often enough. If the trajectory is coherent, stale entries remain informative long enough to be useful.

This also explains why compute budget $B$ matters. Higher $B$ refreshes more coordinates per step and reduces staleness. Lower $B$ makes the method cheaper, but forces it to trust older information. The algorithm is not magic; it is accounting. It simply accounts for gradient age more intelligently than discarding it.

Larger finite differences can smooth the landscape, not just blur the answer

The second misconception the paper attacks is subtler. In finite differences, a smaller $\epsilon$ is usually treated as more faithful because it approximates the true derivative more closely. CoCD’s analysis says: yes, but faithfulness is not the only objective.

A finite-difference estimate is also an average over a local neighbourhood. With central differences, the coordinate derivative is averaged across an interval around the current parameter value. A larger $\epsilon$ therefore smooths away small-scale irregularities. In a noisy, highly non-convex landscape, that can be useful.

The paper captures this using an effective coordinate-wise smoothness constant, $L_{\epsilon}$. A larger finite-difference interval can reduce $L_{\epsilon}$, which lowers the staleness error bound and permits more stable descent. The price is bias: a coarser finite difference may optimise a smoothed version of the objective rather than the exact original one.

That trade-off is important. CoCD is not saying “make $\epsilon$ large and enjoy enlightenment.” It is saying that in this setting, a larger $\epsilon$ can function like implicit landscape smoothing. For business users, the translation is simple: sometimes the optimiser benefits from a slightly blurred map if the exact map is full of distracting cracks.

The theory gives permission, not immunity

The theoretical section links CoCD to Block Cyclic Coordinate Descent with stale-gradient warm starts. This is the paper’s main legitimacy move. Without that connection, CoCD could look like a buffer hack. With it, the authors can reason about when gradient reuse remains stable.

Under $L$-smoothness and a Polyak-Łojasiewicz condition, the paper proves linear convergence for CoCD, with a staleness factor $\tau=n/B-1$. In the full-budget case, $B=n$, staleness vanishes and CoCD reduces to the finite-difference equivalent of standard gradient descent behaviour under the same PŁ-style setting.

This is useful but should be read with discipline. PŁ conditions are stronger than “modern neural networks are messy but somehow train.” They provide a theoretical ceiling, not a production guarantee. The paper itself notes that practical convergence can be faster than the conservative bound because the proof relaxes several cross terms to worst-case estimates.

The theory is best understood as permission: stale gradients are not automatically illegitimate. Under local smoothness, limited movement, adequate refresh rates, and reasonable memory decay, they can support descent.

It is not immunity: if the landscape changes too fast, if the compute budget is too small, or if stale entries are trusted for too long, the approximation will degrade. Optimisers, unlike consultants, cannot survive forever on old information.

The main experiments test whether reuse beats forgetting

The empirical design is clean enough to be useful. The authors compare CoCD against standard BCCD, where $\gamma=0$, so the gradient buffer does not reuse stale information. That comparison isolates the effect of gradient history.

The main benchmarks cover SARCOS regression, MNIST classification, and CIFAR-10 classification. The tested models are deliberately modest: a 5-layer MLP with about 13k parameters for SARCOS, and a small CNN with about 20k parameters for MNIST and CIFAR-10. SGD is included only as a first-order oracle reference, not as a fair zeroth-order baseline, because it uses exact gradients.

The result is not subtle:

Dataset Method Final metric Time per epoch
SARCOS SGD oracle Loss: 5.38
SARCOS BCCD Loss: 188.73 6.23s
SARCOS CoCD Loss: 31.18 6.12s
MNIST SGD oracle Accuracy: 99.21%
MNIST BCCD Accuracy: 27.03% ≈44.42s
MNIST CoCD Accuracy: 95.48% ≈44.56s
CIFAR-10 SGD oracle Accuracy: 62.51%
CIFAR-10 BCCD Accuracy: 10.13% ≈77.01s
CIFAR-10 CoCD Accuracy: 45.08% ≈77.03s

The practical interpretation is straightforward. CoCD’s advantage over BCCD is not coming from materially higher epoch time. It comes from using the same broad coordinate-descent machinery while retaining and weighting historical gradient information. On CIFAR-10, BCCD sits near random guessing, while CoCD reaches meaningful classification performance. That is a large signal, not a cosmetic improvement.

The gap to SGD remains visible, especially on SARCOS and CIFAR-10. That is expected. Exact backpropagated gradients are still superior when available and affordable. CoCD’s claim is not “better than backpropagation.” It is “substantially better than forgetting useful finite-difference information when backpropagation is not the tool on the table.”

The ablations show which knob actually matters

The ablation section is not a second thesis. Its purpose is to test whether the proposed mechanism survives changes in smoothing radius, momentum, compute budget, and memory budget.

Test Likely purpose What it supports What it does not prove
Smoothing radius $\epsilon$ Ablation of implicit smoothing Larger $\epsilon$ improves SARCOS performance and stabilises MNIST training curves, though MNIST can prefer smaller $\epsilon$ for final accuracy under larger budgets It does not prove one universal $\epsilon$ works across tasks
Momentum $\gamma$ Ablation of gradient-history reuse Higher $\gamma$ accelerates convergence; in a SARCOS stress test with $B=1$, CoCD reaches loss 68.64 using $\gamma=0.95$, $\epsilon=1.0$, while BCCD stagnates near 188 even with $B=64$ It does not prove stale history is safe under all non-convex dynamics
Compute budget $B$ Sensitivity and stability test For the MLP, $B=64$ is the threshold needed to escape the initial plateau quickly; larger budgets show diminishing returns It does not remove the scaling problem as parameter counts grow
Memory budget $M$ Memory-efficiency test Reducing memory introduces noise but does not prevent convergence; even $M=0.25$ remains competitive It does not establish behaviour under severe embedded-memory limits across model families
SPSA / ZO-SGD comparison Comparison with random-update baselines On SARCOS, randomised baselines diverge partway while CoCD remains stable; CoCD is reported at 8.1s versus 15.7s per episode against ZO-SGD under matched function-evaluation budgets Classification baselines are reported to match CoCD final accuracy, so the stability claim is strongest for the regression case
ResNet-20 appendix Exploratory scaling extension CoCD remains stable on a roughly 270k-parameter ResNet-20 with $B=2048$ and beats BCCD over the short 10-episode window It is not evidence of large-model or LLM-scale readiness

The most important ablation is momentum. This is where the paper’s mechanism becomes hard to dismiss. If $\gamma=0$ performs poorly and higher $\gamma$ accelerates convergence, then the improvement is not merely a coordinate-ordering artefact. It is the reuse of history.

The smoothing result is also important, but more nuanced. On SARCOS, larger $\epsilon$ helps monotonically. On MNIST, smaller $\epsilon$ can produce better final accuracy when compute is generous, while larger $\epsilon$ produces smoother, more stable training. That is exactly the sort of trade-off one should expect if $\epsilon$ is doing both approximation and smoothing. It is not a magic constant. It is a bias-stability knob.

The random-baseline comparison clarifies what CoCD is really competing against

A useful part of the paper compares CoCD against SPSA and ZO-SGD under matched function-evaluation budgets. This is not just another leaderboard row. It tests whether deterministic coordinate structure can avoid the variance burden of randomised zeroth-order methods.

On the SARCOS regression task, the random-update baselines diverge partway through training, while CoCD stabilises. The paper also reports that CoCD is almost twice as fast as ZO-SGD in that experiment, at 8.1 seconds versus 15.7 seconds per episode on average. The reason is not mysterious: ZO-SGD relies on dense Gaussian sampling for explicit smoothing, while CoCD gets smoothing through finite differences and structure, without drawing new dense random directions every step.

For classification tasks, the authors say the random baselines can match CoCD’s final accuracy. That boundary is important. CoCD’s empirical advantage over randomised methods is not uniformly “higher final accuracy everywhere.” It is stronger stability and lower overhead in the tested regression setting, plus a deterministic alternative to variance-heavy search.

That distinction matters for deployment. If final accuracy is the only objective and random methods behave well, the case for switching may be weaker. If stability, repeatability, and predictable query budgets matter, deterministic structure becomes more attractive.

The engineering contribution is boring in the best possible way

One practical obstacle with coordinate descent on neural networks is that model parameters are not a clean flat vector. They are a collection of tensors: convolution kernels, matrices, biases, and other shaped objects. Naively flattening and reshaping them every step would add overhead and damage the point of the method.

CoCD’s implementation uses a flattened FIFO buffer with pointers that map between the flat coordinate logic and structured model parameters. It stores a global gradient buffer of size $m$, where $m$ is the memory budget. Usually $m=n$, but the paper allows smaller buffers for constrained settings. During descent, it uses views and in-place updates instead of reconstructing full gradient tensors.

This is not glamorous. It is also the part that makes the method plausible as a systems contribution. A clever optimiser that requires theatrical memory movement is not lightweight; it is merely wearing a smaller hat.

The authors emphasise that CoCD needs one extra buffer for gradient history. That is lighter than Adam-style optimisers that keep multiple moment buffers, and it avoids the projection-matrix baggage of some random subspace approaches. The precise memory advantage will depend on implementation and model structure, but the design direction is credible.

What the paper directly shows

The paper directly supports four claims.

First, stale finite-difference gradients can be reused effectively when updates follow a coherent trajectory. This is supported by the algorithm, the staleness error analysis, and the BCCD comparison.

Second, finite-difference smoothing can be beneficial rather than merely harmful. The theory explains how larger $\epsilon$ can reduce an effective smoothness term, and the SARCOS/MNIST ablations show the expected stability behaviour.

Third, CoCD can substantially outperform standard BCCD in the tested small-to-medium neural-network regimes. The main table is strong on this point.

Fourth, deterministic zeroth-order updates can offer stability advantages over randomised baselines in at least some matched-budget settings, especially the SARCOS regression case.

Those are meaningful claims. They are also bounded claims.

What Cognaptus infers for business use

The business relevance is not “this trains big models without GPUs.” Please do not put that on a slide unless the slide is part of a compliance investigation.

The more realistic pathway is lightweight adaptation and optimisation where gradients are unavailable, expensive, or operationally constrained.

Three business situations fit the paper’s shape:

  1. Edge and on-device learning. When memory is tight, avoiding a full backpropagation graph and heavy optimiser state can matter. CoCD’s one-buffer design is directly relevant, although the paper does not test truly tiny hardware.

  2. Black-box model or system tuning. In cases where the objective can be evaluated but gradients are not exposed, deterministic finite-difference reuse may provide a more stable alternative to random search.

  3. Simulation and control-adjacent optimisation. The paper’s lineage from stale-gradient coherence and its use of SARCOS make it relevant for systems where objective evaluations are expensive but trajectories change smoothly enough for history to remain informative.

The ROI logic is not that CoCD reduces all training cost. It may reduce wasted function evaluations and improve stability when the alternative is variance-heavy zeroth-order search or forgetful coordinate descent. In operational terms, CoCD is attractive where query budget, memory budget, and repeatability are constraints.

A practical adoption checklist would look like this:

Question Why it matters
Are gradients unavailable or too costly? If backpropagation is cheap and available, standard first-order methods remain the obvious baseline
Is the objective locally coherent across updates? CoCD relies on stale gradients staying informative long enough to matter
Can the system afford a gradient-history buffer? The method is memory-light, not memory-free
Is deterministic repeatability valuable? CoCD avoids random-direction variance, which can simplify debugging and certification
Does the model sit in the small-to-medium parameter regime? The paper’s strongest evidence is below large-scale foundation-model territory

Where the boundary should be drawn

The paper is admirably clear about its main boundary: scale. CoCD is currently most effective for small-to-medium models. As parameter count increases, keeping coordinate-wise finite-difference estimates accurate requires larger compute budgets. That can become impractical quickly.

The ResNet-20 appendix is useful, but it should be interpreted as a scaling check, not a scale solution. The model has about 270k parameters, uses $B=2048$, and is tested over a short 10-episode window. CoCD with $\gamma \in {0.5,0.9,0.95}$ beats BCCD and remains stable. That is encouraging. It is not evidence that CoCD is ready for production LLM fine-tuning.

There is also the finite-difference bias issue. Larger $\epsilon$ can smooth the landscape and improve stability, but it can also move the optimiser toward the optimum of a smoothed objective rather than the exact original objective. The paper acknowledges this through an error term associated with $\epsilon$. In business terms: smoothing may improve training behaviour while slightly changing the target. Sometimes that is acceptable. Sometimes it is not.

Finally, the method has hyperparameter sensitivity. Compute budget, memory budget, momentum, smoothing radius, and learning rate interact. CoCD is not a zero-configuration optimiser. It is a structured method with knobs. Better knobs, perhaps, but still knobs.

The strategic point: optimisation is becoming a memory-management problem

CoCD is interesting because it reframes zeroth-order optimisation as memory management. The question is not just “which direction should we sample?” but “which old information is still worth carrying?”

That shift has broader implications. As AI systems move into constrained environments—devices, private deployments, black-box APIs, robotics simulators, enterprise systems with limited observability—the assumption of easy gradients becomes less universal. Optimisers that can survive on partial, delayed, and imperfect information will matter.

CoCD’s particular answer is deterministic, cyclic, and buffer-based. It says: stop rolling dice when structure is available; stop discarding gradients when the path is coherent; stop assuming finer finite differences are always better if a smoother landscape is what the optimiser actually needs.

The lesson is not that stale gradients are always good. They are not. The lesson is that age is not the same as uselessness. In a coherent system, old information can still carry signal. The trick is knowing when to trust it, when to fade it, and when to refresh it.

That is a useful message for machine learning. Also, frankly, for management.

Cognaptus: Automate the Present, Incubate the Future.


  1. Chen Liang, Xiatao Sun, Qian Wang, and Daniel Rakita, “Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization,” arXiv:2605.14373v3, 2026. https://arxiv.org/abs/2605.14373 ↩︎