Learning Rate Calculator and Scheduler Guide
Calculate optimal learning rates and schedule them correctly for your ML training runs. Covers warmup, cosine decay, cyclical, and step decay strategies.
Learning rate is the single hyperparameter that breaks more training runs than any other. Too high and your loss explodes or oscillates. Too low and you waste compute crawling toward a mediocre minimum. Getting it right isn't magic — there's a principled approach to finding a good starting point and then scheduling it intelligently.
Starting Point: The Learning Rate Range Test
The most reliable way to find a good initial learning rate is to run a short "LR range test": train for ~100 steps with LR increasing exponentially from 1e-7 to 10. Plot loss vs LR. The learning rate one order of magnitude before the loss starts to blow up is usually your sweet spot.
For quick estimation without a range test, the CalcHub Learning Rate Calculator lets you enter your model type, optimizer, and batch size. It outputs a suggested starting LR based on published heuristics, with reasoning.
Common Starting Points by Setup
| Model Type | Optimizer | Typical LR Range |
|---|---|---|
| Fine-tuning BERT/RoBERTa | AdamW | 1e-5 to 5e-5 |
| Training ResNet from scratch | SGD + momentum | 0.01 to 0.1 |
| Fine-tuning ResNet | Adam | 1e-4 to 1e-3 |
| Small transformer from scratch | Adam | 1e-4 to 3e-4 |
| LLM fine-tuning (LoRA) | AdamW | 2e-4 to 1e-3 |
| GAN training (both networks) | Adam | 2e-4 (classic default) |
Linear Scaling Rule
When you change batch size, you should adjust LR proportionally:
New LR = Base LR × (New Batch Size / Reference Batch Size)
If you trained at batch 256 with LR 0.1 and you're now using batch 64, try LR 0.025. This works well in the range of batch sizes 64–512 for SGD. Adam is less sensitive to this scaling but still benefits from some adjustment.
The calculator applies this rule automatically when you enter your hardware's max batch size alongside your reference configuration.
Scheduling Strategies
The learning rate shouldn't stay constant throughout training. Here's how common schedules work:
Warmup + Cosine Decay — Start from a very small LR (1e-6), ramp up linearly over 5–10% of total steps, then decay following a cosine curve to near-zero. This is the default for most transformer training runs and is what CalcHub generates schedules for. Step Decay — Multiply LR by 0.1 at fixed epoch milestones (e.g., epochs 30, 60, 90 for ResNet training). Simple and effective for CNNs trained from scratch. Cyclical Learning Rates (CLR) — Oscillate between a min and max LR in cycles. Useful for escaping local minima and often reduces the need to tune LR carefully. Reduce on Plateau — Automatically halve LR when validation loss stops improving for N epochs. Hands-off but slower to converge than manual schedules.A Warmup Schedule Example
For a BERT fine-tuning run over 3 epochs, 10,000 steps total, LR max 3e-5:
| Phase | Steps | LR Behavior |
|---|---|---|
| Warmup | 0–500 | Linear: 0 → 3e-5 |
| Cosine decay | 500–10,000 | Cosine: 3e-5 → ~0 |
Tips
- Don't use a fixed LR for 1-cycle vs multi-epoch runs. A LR that's too high for long training can be perfect for a short aggressive cycle.
- Layer-wise LR (LLRD) works well for fine-tuning: give lower layers (closer to the input) a smaller LR than the task-specific head. A decay factor of 0.9 per layer down is a common starting point.
- AdamW needs weight decay at 0.01–0.1, not 0. Forgetting weight decay with Adam is a very common silent failure mode.
How do I know if my learning rate is too high?
Training loss oscillates or diverges in the first few hundred steps. If your loss is jumping up and down rather than trending down, halve the LR and restart.
Is a lower learning rate always safer?
Not really. An LR that's too small converges slowly and may get stuck in sharp, poorly-generalizing minima. The goal is to find the largest stable LR, not the smallest possible one.
Does the learning rate affect final model quality?
Significantly. Learning rate is often the difference between hitting 91% and 93% accuracy on the same architecture. It's worth spending time on a grid search or range test rather than using a default.
Related Calculators
- Batch Size Calculator — learning rate and batch size are linked
- Training Time Calculator — plan total training steps
- Confusion Matrix Calculator — evaluate model after training