Opening — Why this matters now
Modern deep learning training is an odd contradiction. We obsess over architectures, data curation, and trillion-token scaling laws—then quietly accept Cosine Annealing as if it were gravity. Learning rate schedules are often inherited, not argued for. This paper challenges that complacency with a scheduler that does something almost offensive in its simplicity: it just watches the loss and reacts.
The result, GreedyLR, is not glamorous. It is, however, effective—across NLP, CV, fine-tuning, pre-training, and even noisy training regimes that resemble real-world chaos more than benchmark utopia.
Background — Context and prior art
Learning rate scheduling has long lived in two camps:
- Fixed-shape schedules — cosine, linear, polynomial. Elegant curves, zero situational awareness.
- Adaptive optimizers — Adam, RMSProp, Adagrad. Parameter-wise adaptation, but still often paired with fixed global schedules.
A third camp exists—line search, Lipschitz-based tuning, Bayesian or evolutionary AutoLR—but these tend to be expensive, brittle, or operationally complex. The industry default remains cosine decay not because it is optimal, but because it is predictable.
GreedyLR enters as a zeroth-order scheduler: no gradients, no curvature estimates, no validation lookahead. Just loss comparison.
Analysis — What the paper actually does
At its core, GreedyLR applies a single rule:
- If loss improves → increase learning rate
- If loss worsens → decrease learning rate
Formally, with scaling factor $F \in (0,1)$:
- Improvement: $\gamma_t = \gamma_{t-1} / F$
- Regression: $\gamma_t = \gamma_{t-1} \times F$
This is wrapped with practical safeguards—patience, smoothing windows, warmup, cooldown, and LR bounds—turning a naïve rule into a production-grade scheduler.
Why this works (more often than it should)
Loss acts as a crude directional signal. In smooth regions, consistent improvement implies under-stepping; in volatile or high-curvature regions, loss spikes force rapid LR contraction. GreedyLR behaves less like a predefined curve and more like a reflex.
The authors provide a convergence proof under standard smooth convex assumptions, yielding an $O(1/T)$ rate for averaged iterates. More interestingly, they derive an optimal scaling factor:
$$ F^* = 1 - \frac{1}{L_{max}} $$
In practice, $L_{max}$ is unknowable. Fortunately, the experiments show you don’t need it.
Findings — Results that actually matter
Performance summary
| Scenario | Result |
|---|---|
| Small models (<500M) | ≥86% as-good-or-better than baselines |
| Large models (0.5–7B) | ≥83% as-good-or-better in fine-tuning |
| LLM pre-training | 5.4% lower final loss vs cosine |
| Early training | Up to 47% faster convergence |
GreedyLR shines most where training is unstable: early fine-tuning, pre-training from scratch, or noisy regimes.
The F ≥ 0.5 rule (the paper’s quiet killer feature)
A sweep over $F \in {0.25, 0.5, 0.75, 0.99}$ reveals a sharp stability threshold:
- F < 0.5 → catastrophic divergence
- F ≥ 0.5 → stable, near-identical convergence (within ~1.5%)
This effectively collapses hyperparameter tuning into a binary decision. That alone is operationally valuable.
Robustness under noise
Across 8,100 noisy training runs—including adversarial and spike perturbations—GreedyLR:
- Achieved 37% lower median final loss than the best traditional scheduler
- Recovered 3–5× faster after disruptions
- Exhibited tighter performance distributions (higher reliability)
In plain terms: it panics less, and recovers faster.
Implications — What this means for practice
GreedyLR is not a silver bullet. It does not dominate every task, and it does not replace adaptive optimizers. But it offers something rare:
- Minimal assumptions
- Low implementation cost
- Strong early-phase gains
- Predictable stability rules
For practitioners training LLMs, adapters, or domain-shifted models, GreedyLR is a credible default, not an exotic alternative.
For infrastructure teams, its loss-only dependency makes it attractive in distributed or partially observed training systems.
Conclusion — A scheduler that reacts, not recites
GreedyLR does not try to be clever. It reacts to reality instead of following a script. In an ecosystem crowded with elaborate heuristics, that restraint is refreshing.
Will it replace cosine annealing everywhere? No. But it makes one thing clear: the learning rate should be allowed to listen.
Cognaptus: Automate the Present, Incubate the Future.