March 26, 20264 min read

Learning Rate Calculator and Scheduler Guide

Calculate optimal learning rates and schedule them correctly for your ML training runs. Covers warmup, cosine decay, cyclical, and step decay strategies.

machine learning learning rate optimizer training calchub

Learning rate is the single hyperparameter that breaks more training runs than any other. Too high and your loss explodes or oscillates. Too low and you waste compute crawling toward a mediocre minimum. Getting it right isn't magic — there's a principled approach to finding a good starting point and then scheduling it intelligently.

Starting Point: The Learning Rate Range Test

The most reliable way to find a good initial learning rate is to run a short "LR range test": train for ~100 steps with LR increasing exponentially from 1e-7 to 10. Plot loss vs LR. The learning rate one order of magnitude before the loss starts to blow up is usually your sweet spot.

For quick estimation without a range test, the CalcHub Learning Rate Calculator lets you enter your model type, optimizer, and batch size. It outputs a suggested starting LR based on published heuristics, with reasoning.

Common Starting Points by Setup

Model Type	Optimizer	Typical LR Range
Fine-tuning BERT/RoBERTa	AdamW	1e-5 to 5e-5
Training ResNet from scratch	SGD + momentum	0.01 to 0.1
Fine-tuning ResNet	Adam	1e-4 to 1e-3
Small transformer from scratch	Adam	1e-4 to 3e-4
LLM fine-tuning (LoRA)	AdamW	2e-4 to 1e-3
GAN training (both networks)	Adam	2e-4 (classic default)

These are starting points, not gospel. Your actual training dynamics depend on batch size, model depth, and data distribution.

Linear Scaling Rule

When you change batch size, you should adjust LR proportionally:

New LR = Base LR × (New Batch Size / Reference Batch Size)

If you trained at batch 256 with LR 0.1 and you're now using batch 64, try LR 0.025. This works well in the range of batch sizes 64–512 for SGD. Adam is less sensitive to this scaling but still benefits from some adjustment.

The calculator applies this rule automatically when you enter your hardware's max batch size alongside your reference configuration.

Scheduling Strategies

The learning rate shouldn't stay constant throughout training. Here's how common schedules work:

Warmup + Cosine Decay — Start from a very small LR (1e-6), ramp up linearly over 5–10% of total steps, then decay following a cosine curve to near-zero. This is the default for most transformer training runs and is what CalcHub generates schedules for. Step Decay — Multiply LR by 0.1 at fixed epoch milestones (e.g., epochs 30, 60, 90 for ResNet training). Simple and effective for CNNs trained from scratch. Cyclical Learning Rates (CLR) — Oscillate between a min and max LR in cycles. Useful for escaping local minima and often reduces the need to tune LR carefully. Reduce on Plateau — Automatically halve LR when validation loss stops improving for N epochs. Hands-off but slower to converge than manual schedules.

A Warmup Schedule Example

For a BERT fine-tuning run over 3 epochs, 10,000 steps total, LR max 3e-5:

Phase	Steps	LR Behavior
Warmup	0–500	Linear: 0 → 3e-5
Cosine decay	500–10,000	Cosine: 3e-5 → ~0

The calculator outputs this as a plot and as a table you can feed directly into your training loop or a LR scheduler config.

Tips

Don't use a fixed LR for 1-cycle vs multi-epoch runs. A LR that's too high for long training can be perfect for a short aggressive cycle.
Layer-wise LR (LLRD) works well for fine-tuning: give lower layers (closer to the input) a smaller LR than the task-specific head. A decay factor of 0.9 per layer down is a common starting point.
AdamW needs weight decay at 0.01–0.1, not 0. Forgetting weight decay with Adam is a very common silent failure mode.

How do I know if my learning rate is too high?

Training loss oscillates or diverges in the first few hundred steps. If your loss is jumping up and down rather than trending down, halve the LR and restart.

Is a lower learning rate always safer?

Not really. An LR that's too small converges slowly and may get stuck in sharp, poorly-generalizing minima. The goal is to find the largest stable LR, not the smallest possible one.

Does the learning rate affect final model quality?

Significantly. Learning rate is often the difference between hitting 91% and 93% accuracy on the same architecture. It's worth spending time on a grid search or range test rather than using a default.

Batch Size Calculator — learning rate and batch size are linked
Training Time Calculator — plan total training steps
Confusion Matrix Calculator — evaluate model after training