Greedy Enough to Win: When Loss Starts Driving the Learning Rate

Opening — Why this matters now

Modern deep learning training is an odd contradiction. We obsess over architectures, data curation, and trillion-token scaling laws—then quietly accept Cosine Annealing as if it were gravity. Learning rate schedules are often inherited, not argued for. This paper challenges that complacency with a scheduler that does something almost offensive in its simplicity: it just watches the loss and reacts.

The result, GreedyLR, is not glamorous. It is, however, effective—across NLP, CV, fine-tuning, pre-training, and even noisy training regimes that resemble real-world chaos more than benchmark utopia.

Background — Context and prior art

Learning rate scheduling has long lived in two camps:

Fixed-shape schedules — cosine, linear, polynomial. Elegant curves, zero situational awareness.
Adaptive optimizers — Adam, RMSProp, Adagrad. Parameter-wise adaptation, but still often paired with fixed global schedules.

A third camp exists—line search, Lipschitz-based tuning, Bayesian or evolutionary AutoLR—but these tend to be expensive, brittle, or operationally complex. The industry default remains cosine decay not because it is optimal, but because it is predictable.

GreedyLR enters as a zeroth-order scheduler: no gradients, no curvature estimates, no validation lookahead. Just loss comparison.

Analysis — What the paper actually does

At its core, GreedyLR applies a single rule:

If loss improves → increase learning rate
If loss worsens → decrease learning rate

Formally, with scaling factor $F \in (0,1)$:

Improvement: $\gamma_t = \gamma_{t-1} / F$
Regression: $\gamma_t = \gamma_{t-1} \times F$

This is wrapped with practical safeguards—patience, smoothing windows, warmup, cooldown, and LR bounds—turning a naïve rule into a production-grade scheduler.

Why this works (more often than it should)

Loss acts as a crude directional signal. In smooth regions, consistent improvement implies under-stepping; in volatile or high-curvature regions, loss spikes force rapid LR contraction. GreedyLR behaves less like a predefined curve and more like a reflex.

The authors provide a convergence proof under standard smooth convex assumptions, yielding an $O(1/T)$ rate for averaged iterates. More interestingly, they derive an optimal scaling factor:

$$ F^* = 1 - \frac{1}{L_{max}} $$

In practice, $L_{max}$ is unknowable. Fortunately, the experiments show you don’t need it.

Findings — Results that actually matter

Performance summary

Scenario	Result
Small models (<500M)	≥86% as-good-or-better than baselines
Large models (0.5–7B)	≥83% as-good-or-better in fine-tuning
LLM pre-training	5.4% lower final loss vs cosine
Early training	Up to 47% faster convergence

GreedyLR shines most where training is unstable: early fine-tuning, pre-training from scratch, or noisy regimes.

The F ≥ 0.5 rule (the paper’s quiet killer feature)

A sweep over $F \in {0.25, 0.5, 0.75, 0.99}$ reveals a sharp stability threshold:

F < 0.5 → catastrophic divergence
F ≥ 0.5 → stable, near-identical convergence (within ~1.5%)

This effectively collapses hyperparameter tuning into a binary decision. That alone is operationally valuable.

Robustness under noise

Across 8,100 noisy training runs—including adversarial and spike perturbations—GreedyLR:

Achieved 37% lower median final loss than the best traditional scheduler
Recovered 3–5× faster after disruptions
Exhibited tighter performance distributions (higher reliability)

In plain terms: it panics less, and recovers faster.

Implications — What this means for practice

GreedyLR is not a silver bullet. It does not dominate every task, and it does not replace adaptive optimizers. But it offers something rare:

Minimal assumptions
Low implementation cost
Strong early-phase gains
Predictable stability rules

For practitioners training LLMs, adapters, or domain-shifted models, GreedyLR is a credible default, not an exotic alternative.

For infrastructure teams, its loss-only dependency makes it attractive in distributed or partially observed training systems.

Conclusion — A scheduler that reacts, not recites

GreedyLR does not try to be clever. It reacts to reality instead of following a script. In an ecosystem crowded with elaborate heuristics, that restraint is refreshing.

Will it replace cosine annealing everywhere? No. But it makes one thing clear: the learning rate should be allowed to listen.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

Why this works (more often than it should)#

Findings — Results that actually matter#

Performance summary#

The F ≥ 0.5 rule (the paper’s quiet killer feature)#

Robustness under noise#

Implications — What this means for practice#

Conclusion — A scheduler that reacts, not recites#