FLOPs Calculator — Model Compute Cost Estimator
Calculate floating-point operations (FLOPs) for your neural network. Compare model compute efficiency and estimate hardware requirements for training.
Model size in parameters tells you about memory requirements. FLOPs — floating-point operations — tell you about compute. A model can have few parameters but be compute-heavy (due to long sequences or large intermediate activations), or have many parameters but low FLOPs (sparse models). Both metrics matter when planning infrastructure.
What Counts as a FLOP?
A FLOP is a single floating-point multiply or add. For a matrix multiplication of shape (m × k) × (k × n), the number of FLOPs is roughly 2 × m × k × n (the factor of 2 accounts for the multiply-accumulate pattern). In practice, when people report "X FLOPs" for a model, they usually mean FLOPs per forward pass on a single sample.
For training, you need roughly 3× the forward-pass FLOPs per step: one forward pass plus approximately twice that for backpropagation.
FLOPs by Layer Type
| Layer | FLOPs Formula | Example |
|---|---|---|
| Linear (in → out) | 2 × in × out | Linear(512, 256) → 262K |
| Conv2D (k×k, in→out, spatial H×W) | 2 × k² × in × out × H × W | Conv(3×3, 64, 128, 14×14) → 115M |
| Self-attention (seq len L, dim D) | 4 × L × D² + 2 × L² × D | L=512, D=768 → 2.1B |
| Feed-forward (2 linear layers) | 4 × D × 4D | D=768 → 9.4M per layer |
| Embedding lookup | ~0 (table lookup, no arithmetic) | — |
Using the CalcHub FLOPs Calculator
Visit CalcHub and open the FLOPs Calculator. You can:
- Build layer by layer — add each layer type with its dimensions
- Use model presets — BERT-base, GPT-2 variants, ResNet-50, ViT-B, etc.
- Enter sequence length for transformer models (critical for the quadratic attention cost)
FLOPs for Well-Known Models (Single Forward Pass)
| Model | Parameters | FLOPs (forward pass) |
|---|---|---|
| ResNet-50 | 25M | 4.1 GFLOPs |
| BERT-base (seq 128) | 110M | 22 GFLOPs |
| BERT-base (seq 512) | 110M | 178 GFLOPs |
| GPT-2 Small | 117M | 8.8 GFLOPs |
| GPT-3 175B | 175B | ~350 TFLOPs |
| Llama 3 8B (seq 512) | 8B | ~100 GFLOPs |
From FLOPs to Hardware Requirements
Theoretical peak FLOPs of common GPUs:| GPU | FP32 TFLOPs | FP16 TFLOPs |
|---|---|---|
| RTX 3090 | 35.6 | 71.2 |
| RTX 4090 | 82.6 | 165.2 |
| A100 40GB | 77.6 | 312 (bf16) |
| A100 80GB | 77.6 | 312 (bf16) |
Tips
- FLOPs vs MACs: Some papers report MACs (multiply-accumulate operations) instead of FLOPs. MACs = FLOPs / 2. Make sure you're comparing apples to apples.
- FLOPs don't account for memory bandwidth. A memory-bound operation (like layer normalization) may bottleneck your hardware even when compute FLOPs look modest. Profile with actual hardware, not just FLOPs estimates.
- Efficient architectures: Models like MobileNet use depthwise separable convolutions to cut FLOPs by 8–9× vs standard conv while maintaining similar accuracy. The tradeoff is lower parallelism.
Is lower FLOPs always better?
For deployment, yes — fewer FLOPs means faster inference and lower energy cost. For training, you care more about FLOPs per quality unit. A model with 2× the FLOPs might reach target accuracy in half the steps, making total training FLOPs comparable.
How do I measure actual FLOPs vs theoretical?
Use torch.profiler for per-operation FLOPs counting, or the fvcore library which provides FlopCountAnalysis. These give measured FLOPs for your actual computation graph, which can differ from theoretical due to optimizations.
What's a petaFLOP-day?
It's a unit of compute often used to report training cost: 1 petaFLOP-day = 10¹⁵ FLOPs × 86,400 seconds. GPT-3 training required roughly 3,640 petaFLOP-days. At 40% efficiency on A100s (312 TFLOPs bf16), one A100 delivers about 27 petaFLOP-days per day.
Related Calculators
- Model Parameters Calculator — count parameters alongside FLOPs
- Training Time Calculator — convert FLOPs to wall-clock time
- GPU Memory Calculator — plan VRAM alongside compute requirements