Training runs rarely fail with cinematic drama. They do not burst into flames. They simply become expensive, slow, and faintly embarrassing.
A fine-tuning job starts with promise, the loss descends, then progress flattens. Another run behaves well for 200 steps, then becomes jumpy after a data shard changes. A third run is rescued by lowering the learning rate, except nobody knows whether the rescue came too early, too late, or by accident. Eventually, the team does what teams do: try cosine decay again, because at least cosine looks mathematically respectable while doing whatever it was going to do anyway.
The paper behind GreedyLR asks a blunt question: what if the learning-rate schedule should stop reciting a prewritten curve and start watching the loss?1
That sounds too simple. Slightly rude, even. Deep learning has adaptive optimizers, gradient moments, warmups, decay schedules, restarts, line-search traditions, Bayesian hyperparameter search, and a small graveyard of “almost default” scheduler ideas. GreedyLR’s core rule is almost comic by comparison: if the loss improves, increase the learning rate; if the loss worsens, decrease it.
The surprising part is not that this rule sometimes works. The surprising part is the range of evidence the authors assemble: small NLP and CV models, large-model fine-tuning up to 7B parameters, a limited Llama-3.2-1B pre-training run, an empirical scaling-factor sweep, and a large set of noisy training simulations. The result is not a proof that GreedyLR is now the One True Scheduler. Please, no shrine-building. The stronger interpretation is more useful: loss-driven scheduling may be a cheap operational reflex that often removes wasted motion from training.
This article uses an evidence-first reading because the mechanism alone is too simple to carry the argument. If we start with the rule, GreedyLR looks like a cute heuristic. If we start with the evidence, the rule becomes a plausible answer to a real operational problem: most learning-rate schedules are stable because they are ignorant.
The evidence says “default candidate,” not “universal winner”
The paper’s main claim is empirical: GreedyLR is a strong default scheduler across many tested settings. That wording matters. “Default candidate” is an operational phrase. “Universal winner” is how benchmarking discourse loses adult supervision.
The authors compare GreedyLR against common schedules such as linear, cosine, polynomial, constant warmup, cosine with restarts, and exponential decay, depending on the experiment family. The evidence falls into four categories:
| Evidence category | Likely purpose in the paper | What it supports | What it does not prove |
|---|---|---|---|
| Small-model NLP/CV experiments | Main evidence | GreedyLR often matches or beats standard schedules across many architectures, datasets, optimizers, and task types | That it dominates every model, metric, or production environment |
| Large-model fine-tuning | Main evidence plus comparison with a common default | GreedyLR can improve early and final loss versus cosine in tested 2B–7B fine-tuning settings | That it is always better for all LLM fine-tuning recipes |
| Llama-3.2-1B pre-training | Exploratory extension | GreedyLR can remain useful beyond adapter or instruction fine-tuning | That it scales to frontier pre-training over hundreds of thousands or millions of steps |
| Scaling-factor and noise experiments | Sensitivity and robustness tests | GreedyLR has practical stability guidance and can recover from engineered perturbations | That all real training noise is captured by the simulations |
That distinction prevents two opposite misreadings. The first is dismissive: “Adam already adapts learning rates, so scheduler adaptation is redundant.” Not exactly. Adam and AdamW adapt parameter-wise updates using gradient moments; most training stacks still place a global schedule on top. GreedyLR works at that global-schedule layer, using loss movement as a feedback signal.
The second misreading is enthusiastic and therefore more dangerous: “GreedyLR beats cosine, so replace everything.” Also no. The paper itself reports cases where GreedyLR underperforms. The useful conclusion is narrower and more practical: if a team currently uses cosine because nobody wants to spend two days justifying scheduler choice, GreedyLR is credible enough to test as a default alternative.
Small models show breadth, not merely a lucky demo
The small-model section is the paper’s breadth test. The authors run 132 experiments across 16 model architectures and 15 datasets. The coverage includes translation, question answering, summarization, named entity recognition, and image tasks. Architectures include families such as Pegasus, BERT, T5, BART, ResNet, ViT, and CamemBERT. Optimizers include AdamW, Adafactor, Adagrad, and SGD. The scheduler comparison includes GreedyLR against common Hugging Face-style scheduling choices.
This is not the most glamorous part of the paper, which is precisely why it is useful. Small models are where a scheduler can be tested across many conditions without pretending every experiment costs a national budget.
The headline result: for small models below 500M parameters, GreedyLR is reported as “as good or better” in 86.73% of comparisons. It is “better” in 57.14%, “clearly better” in 24.49%, and worse in 13.27%. The authors also report an average loss benefit of 0.16, a maximum benefit of 2.3, and a maximum deficit of -0.62.
Those numbers need careful reading. “As good or better” includes cases where the difference is small enough to be treated as practically insignificant. That makes it a stability claim as much as a superiority claim. GreedyLR does not need to crush every baseline to be operationally interesting. If it frequently avoids being worse while sometimes being meaningfully better, it becomes a rational default candidate.
The stage-level results are also revealing. GreedyLR is as-good-or-better 92.42% of the time at 10% of training, 81.81% at 50%, and 85.94% at 100%. The early-stage number matters because early training is where teams often diagnose whether a run is “healthy” or “quietly wasting compute.” A scheduler that reacts well early can reduce the number of abandoned or overextended runs.
For business use, the small-model evidence says: do not think of GreedyLR only as a large-language-model trick. Its practical value may start in the less glamorous places—classification systems, summarizers, internal NLP models, and CV pipelines where training budgets are smaller but iteration speed matters.
Large-model fine-tuning is where the cosine habit gets uncomfortable
The large-model fine-tuning results are narrower but more attention-grabbing. The authors test Microsoft Phi-2, Falcon 7B, and Gemma 7B across instruct, SQL-generation, and French Alpaca-style datasets. The comparison is mainly against cosine scheduling with AdamW.
Across the large-model fine-tuning measurements, GreedyLR is reported as as-good-or-better 83.33% of the time and clearly better 62.5% of the time. The paper reports an average benefit of 2.89%, a maximum benefit of 47%, and a maximum deficit of 28%.
The maximum numbers deserve restraint. A 47% early benefit is impressive, but maximums are not operating policies. They are evidence that the scheduler can matter a lot under some conditions. The more reliable takeaway is that GreedyLR shows a net positive pattern across the tested large-model fine-tuning setups, especially early in training.
The appendix gives the texture behind that summary. For example, in one Gemma-7B SQL-context experiment, GreedyLR is reported as 47% better at 10% of training, then only 0.7% better at 50% and 1.5% better at completion. In another Gemma-7B French-Alpaca run, GreedyLR starts 21% better at 10% but ends 28% worse. Phi-2 on xP3mt-code is worse across the measured stages.
That pattern is more useful than a clean victory narrative. GreedyLR can accelerate early adaptation, but early advantage does not guarantee final superiority. In deployment language: GreedyLR is a candidate for reducing wasted fine-tuning time, not an excuse to stop using validation curves.
The business implication is straightforward. Many companies do not fine-tune once. They fine-tune repeatedly: new customer domain, new product taxonomy, new policy corpus, new language market, new retrieval style, new compliance requirement. A scheduler that improves the chance of early useful progress can reduce the cost of experimentation. But the final checkpoint still needs evaluation. The learning rate can listen to the loss; the business still has to listen to the task metric.
The pre-training result is promising, but intentionally small
The paper also tests GreedyLR in a pre-training setting: Llama-3.2-1B on the RedPajama-arxiv subset for 1,000 steps, with a 100-step warmup, batch size 1, gradient accumulation 32, bf16, initial learning rate $2 \times 10^{-4}$, GreedyLR factor $F = 0.95$, and minimum learning rate $1.85 \times 10^{-5}$.
GreedyLR achieves lower loss than cosine by 1.0% at 10% of training, 3.0% at 50%, and 5.4% at the end, with final loss 2.16 versus 2.28.
Unlike the fine-tuning experiments, where early-stage gains dominate, the pre-training result improves over time. That is interesting because pre-training has less task-specific prior structure. The scheduler is not merely adapting to a known downstream format; it is responding to ongoing loss movement in a noisier, broader training process.
But this is also where discipline matters. A 1,000-step run on one 1B model with one random seed is not a frontier pre-training conclusion. It is a serious exploratory signal. It says GreedyLR may deserve larger-scale pre-training investigation. It does not say that a trillion-token training run should switch schedules on Monday because one arXiv subset behaved nicely.
For AI infrastructure teams, the pre-training result is best interpreted as a research prioritization signal. If training costs are large enough that scheduler choice materially affects budget, GreedyLR should be evaluated in internal pilot runs. But it should be evaluated under the team’s actual data mixture, batch construction, optimizer stack, and checkpoint policy. Otherwise, we are just replacing cosine superstition with GreedyLR superstition. Same temple, new incense.
The F-sweep turns a theoretical parameter into a usable rule
GreedyLR uses a multiplicative factor $F \in (0,1)$. In the simplest rule:
- if loss improves, increase learning rate: $\gamma_t = \gamma_{t-1}/F$;
- if loss worsens, decrease learning rate: $\gamma_t = \gamma_{t-1} \times F$.
A smaller $F$ makes the scheduler more aggressive. A value closer to 1 makes it more conservative.
The paper’s theoretical appendix derives an optimal value:
where $L_{\max}$ is the smoothness constant of the objective. This is mathematically neat and practically annoying, because nobody training a real neural network knows $L_{\max}$ in a usable way.
The authors therefore run an $F$-sweep on Phi-2 fine-tuning over 250 steps, testing $F \in {0.25, 0.50, 0.75, 0.99}$. The result is the paper’s quiet operational gift: $F = 0.25$ causes catastrophic divergence, with final loss 7.78 versus initial loss 2.28; all values at or above 0.5 converge stably, ending around 1.89, 1.92, and 1.91—within 1.5%.
This is a sensitivity test, not a second thesis. Its purpose is not to prove a universal magic threshold across all domains. Its purpose is to reduce adoption friction. If practitioners had to estimate a smoothness constant before using GreedyLR, the method would become another clever idea trapped in an implementation swamp. The $F \ge 0.5$ result gives a practical starting rule for the tested setting.
The catch is also explicit: the stability threshold is established for an LLM fine-tuning case, not for every possible training regime. Reinforcement learning, mixture-of-experts routing, multimodal batches, and continual learning may produce loss movements that do not behave like this experiment. Still, as a default engineering heuristic, “avoid overly aggressive $F$; start at or above 0.5” is the kind of guidance teams can actually use.
The robustness experiments test recovery, not just final loss
The robustness section is easy to overread because the numbers are large. The authors run 8,100 training experiments with engineered perturbations: Gaussian noise, periodic spike noise, random spike noise, adversarial noise, and a clean baseline. They compare GreedyLR with cosine annealing, cosine with restarts, and exponential decay across multiple neural architectures.
This is a robustness and recovery test. It does not prove that these perturbations fully represent distributed training reality. It does test whether a loss-driven scheduler can recover from disruptions better than fixed-shape schedules.
The reported median final loss is 0.148 for GreedyLR, compared with 0.232 for cosine annealing, 0.226 for cosine with restarts, and 0.249 for exponential decay. The paper describes this as 37% lower median loss than the best traditional scheduler in that setup.
The recovery metric is even more operational. The authors define recovery performance as the ratio between maximum loss during training and final achieved loss. GreedyLR has median recovery of 134× and best-case recovery of 72,999×. Cosine has a similar median recovery of 132× but a much lower best-case recovery of about 5K×; cosine restarts and exponential decay lag further. GreedyLR also recovers 3–5× faster after perturbations, with a median of 12 steps versus 45 steps for cosine.
The median comparison between GreedyLR and cosine recovery is worth noticing. GreedyLR’s median recovery is not dramatically higher than cosine’s. Its advantage appears more in best-case recovery, faster recovery speed, final-loss distribution, and consistency across perturbation types. That makes the evidence more nuanced—and more credible.
For business practice, robustness matters because training failures are not always algorithmic in the pure sense. They come from messy data batches, checkpoint disruptions, transient infrastructure behavior, and distribution shifts. If the learning-rate schedule can contract after a spike and re-expand after recovery, it acts like a reflex system. Cosine, elegant as it is, does not know that the floor just moved.
GreedyLR is simple, but the production version is not naive
The paper’s simple rule is not the whole implementation. The practical scheduler includes safeguards: patience, smoothing windows, warmup, cooldown, minimum and maximum learning-rate bounds, thresholds for minor loss changes, and reset functionality.
That matters because raw loss is noisy. If every tiny loss movement caused an immediate learning-rate change, GreedyLR would become a caffeinated thermostat: technically responsive, operationally unbearable.
The practical version tries to avoid that. Patience waits for repeated improvement or deterioration before changing the learning rate. Smoothing compares averaged loss values rather than single-step fluctuations. Warmup and cooldown prevent overreaction during transitional periods. Bounds prevent the scheduler from exploding or shrinking the learning rate into uselessness.
This also weakens the theoretical elegance. The convergence proof is for SGD under smooth convex assumptions. Modern deep learning usually uses Adam or AdamW on non-convex objectives, and the practical GreedyLR algorithm contains features not covered by the proof. That gap is not fatal. Most useful training tricks live somewhere between theory and operational evidence. But the gap should stop readers from treating the proof as a certificate for all real models.
The better interpretation is: the theory explains why the core reflex is plausible under controlled assumptions; the experiments show that the practical version often behaves well in messy settings. Neither one alone is enough. Together, they justify trying the method.
The business value is fewer wasted runs, not a smarter model
The most important business point is easy to miss: GreedyLR does not make the model architecture smarter. It does not add domain knowledge. It does not improve retrieval quality. It does not solve evaluation. It changes how training effort is spent.
That is still valuable.
| What the paper directly shows | Cognaptus business inference | What remains uncertain |
|---|---|---|
| GreedyLR often matches or beats common schedules across tested small NLP/CV models | Teams with many modest training jobs may reduce scheduler trial-and-error | Whether the same pattern holds for each company’s data and metrics |
| GreedyLR improves many tested large-model fine-tuning runs versus cosine | Domain adaptation runs may reach useful loss levels sooner | Early loss gains may not always translate into final task performance |
| A limited Llama-3.2-1B pre-training run shows 5.4% lower final loss | Scheduler adaptation may matter for larger training economics | Long-horizon, multi-seed, frontier-scale pre-training remains untested |
| Robustness simulations show lower median final loss and faster recovery | GreedyLR may reduce wasted compute after noisy training events | Engineered noise may not match real distributed-training failures |
| $F \ge 0.5$ is stable in the tested sweep | Adoption can start with a simple practical rule | The threshold is not proven universal |
This is where GreedyLR becomes operationally interesting. Training efficiency is not only about reducing the cost of a successful run. It is also about reducing the number of ambiguous runs: the runs where the curve looks strange, nobody knows whether to stop, and the next experiment is launched mostly out of irritation.
For consulting and enterprise AI teams, that ambiguity has a real cost. It slows iteration. It makes budgets harder to forecast. It turns engineering judgment into superstition decorated with dashboards.
GreedyLR’s value proposition is not “higher accuracy by magic.” It is “a low-cost feedback mechanism that may reduce avoidable scheduler mismatch.” Less glamorous. More bankable.
Where the method can mislead itself
Loss is useful because it is available. Loss is dangerous for the same reason.
The paper is clear about this. Consecutive loss changes are a zeroth-order proxy for optimization progress. In smooth settings, that proxy can be informative. In highly non-convex regions, saddle points, sharp curvature, or heterogeneous batch regimes, the loss may move for reasons unrelated to whether the previous learning rate was wise.
The most important boundary cases are not obscure:
- multi-domain training where consecutive mini-batches come from different distributions;
- mixture-of-experts models where routing changes affect which parameters see which examples;
- heterogeneous batches where loss movement reflects sample composition more than optimization;
- reinforcement learning, where reward-derived signals have different noise structure than supervised loss;
- long-horizon continual learning, where plateaus, forgetting, and distribution drift complicate interpretation.
These are not reasons to ignore GreedyLR. They are reasons to instrument it. A team testing GreedyLR should log not only training loss but learning-rate trajectory, validation metrics, batch composition, gradient norms where feasible, and failure cases. The question is not simply “does it converge?” The question is “when it changes the learning rate, was the loss signal telling the truth?”
A reasonable production evaluation would compare GreedyLR against the current schedule on:
- final task metric, not only training loss;
- early stopping efficiency;
- variance across seeds;
- number of failed or abandoned runs;
- recovery after known disruptions;
- sensitivity to batch composition;
- learning-rate trajectory interpretability.
This is how a scheduler becomes infrastructure rather than a paper result.
What to do with GreedyLR now
The practical recommendation is neither “replace cosine” nor “wait for perfect proof.” Both are lazy in opposite directions.
For small and mid-sized training jobs, GreedyLR is worth adding to the default scheduler menu. The implementation cost is low, and the paper’s breadth evidence is strong enough to justify routine A/B testing. Teams that frequently fine-tune models for client-specific domains should pay special attention to early-stage convergence, because that is where GreedyLR often shows its advantage.
For LLM fine-tuning, GreedyLR is a credible candidate when cosine is being used mainly by convention. Start conservatively. Use the paper’s guidance around $F \ge 0.5$, keep smoothing and patience enabled, and evaluate on task metrics rather than loss alone. If the model is being trained on highly mixed domains, track whether loss spikes correlate with batch composition before trusting the scheduler’s reactions.
For pre-training, treat GreedyLR as a research candidate, not a default switch. The 1B-parameter, 1,000-step result is promising, but the unanswered questions are exactly the expensive ones: long-run dynamics, seed variance, scaling beyond 7B, behavior after major learning-rate decay, and interaction with large distributed training systems.
For AI infrastructure teams, the most interesting use case may be monitoring and recovery. GreedyLR’s loss-driven reactions create a visible learning-rate trace. That trace can become part of training observability: when the scheduler contracts, expands, or resets, the system records a diagnosis of training stress. Cosine gives you a schedule. GreedyLR gives you a pulse.
The scheduler should be allowed to listen
The deeper lesson of GreedyLR is not that loss alone is a perfect signal. It is not. The deeper lesson is that fixed schedules are a strange default for systems whose optimization behavior changes constantly during training.
Cosine annealing is elegant. Linear decay is simple. Polynomial decay is obedient. None of them knows whether yesterday’s batch was easy, today’s gradient was unstable, or the model just entered a flat region where larger steps would help. They are scripts.
GreedyLR is a reflex. A crude reflex, yes. But a reflex can be more useful than a script when the environment moves.
The paper’s strongest contribution is therefore not the formula, nor even the headline wins. It is the evidence that a very cheap feedback loop can often compete with carefully inherited scheduler habits. That is an uncomfortable result for anyone who likes elegant curves. Fortunately, optimization does not care about elegance. It cares about progress.
And if the loss is already shouting, the learning rate might as well stop pretending it cannot hear.
Cognaptus: Automate the Present, Incubate the Future.
-
Shreyas Subramanian, Bala Krishnamoorthy, and Pranav Murthy, “Dynamic Learning Rate Scheduling based on Loss Changes Leads to Faster Convergence,” arXiv:2512.14527, 2025. ↩︎