March 26, 20264 min read

FLOPs Calculator — Model Compute Cost Estimator

Calculate floating-point operations (FLOPs) for your neural network. Compare model compute efficiency and estimate hardware requirements for training.

flops compute machine learning model efficiency calchub
Ad 336x280

Model size in parameters tells you about memory requirements. FLOPs — floating-point operations — tell you about compute. A model can have few parameters but be compute-heavy (due to long sequences or large intermediate activations), or have many parameters but low FLOPs (sparse models). Both metrics matter when planning infrastructure.

What Counts as a FLOP?

A FLOP is a single floating-point multiply or add. For a matrix multiplication of shape (m × k) × (k × n), the number of FLOPs is roughly 2 × m × k × n (the factor of 2 accounts for the multiply-accumulate pattern). In practice, when people report "X FLOPs" for a model, they usually mean FLOPs per forward pass on a single sample.

For training, you need roughly 3× the forward-pass FLOPs per step: one forward pass plus approximately twice that for backpropagation.

FLOPs by Layer Type

LayerFLOPs FormulaExample
Linear (in → out)2 × in × outLinear(512, 256) → 262K
Conv2D (k×k, in→out, spatial H×W)2 × k² × in × out × H × WConv(3×3, 64, 128, 14×14) → 115M
Self-attention (seq len L, dim D)4 × L × D² + 2 × L² × DL=512, D=768 → 2.1B
Feed-forward (2 linear layers)4 × D × 4DD=768 → 9.4M per layer
Embedding lookup~0 (table lookup, no arithmetic)
The attention layer's quadratic term (2 × L² × D) is why long sequences are expensive: doubling sequence length quadruples attention FLOPs.

Using the CalcHub FLOPs Calculator

Visit CalcHub and open the FLOPs Calculator. You can:

  1. Build layer by layer — add each layer type with its dimensions
  2. Use model presets — BERT-base, GPT-2 variants, ResNet-50, ViT-B, etc.
  3. Enter sequence length for transformer models (critical for the quadratic attention cost)
The output shows FLOPs per forward pass, per training step, total FLOPs for a full training run given your dataset and epoch count, and the equivalent GPU-days on common hardware.

FLOPs for Well-Known Models (Single Forward Pass)

ModelParametersFLOPs (forward pass)
ResNet-5025M4.1 GFLOPs
BERT-base (seq 128)110M22 GFLOPs
BERT-base (seq 512)110M178 GFLOPs
GPT-2 Small117M8.8 GFLOPs
GPT-3 175B175B~350 TFLOPs
Llama 3 8B (seq 512)8B~100 GFLOPs
The sequence length effect is stark: BERT-base at sequence 512 uses 8× more FLOPs than at sequence 128 despite identical parameters.

From FLOPs to Hardware Requirements

Theoretical peak FLOPs of common GPUs:
GPUFP32 TFLOPsFP16 TFLOPs
RTX 309035.671.2
RTX 409082.6165.2
A100 40GB77.6312 (bf16)
A100 80GB77.6312 (bf16)
Real-world utilization is typically 30–50% of theoretical peak for transformer training. The calculator applies a realistic efficiency factor (default 40%) to convert theoretical TFLOPs to practical throughput.

Tips

  • FLOPs vs MACs: Some papers report MACs (multiply-accumulate operations) instead of FLOPs. MACs = FLOPs / 2. Make sure you're comparing apples to apples.
  • FLOPs don't account for memory bandwidth. A memory-bound operation (like layer normalization) may bottleneck your hardware even when compute FLOPs look modest. Profile with actual hardware, not just FLOPs estimates.
  • Efficient architectures: Models like MobileNet use depthwise separable convolutions to cut FLOPs by 8–9× vs standard conv while maintaining similar accuracy. The tradeoff is lower parallelism.

Is lower FLOPs always better?

For deployment, yes — fewer FLOPs means faster inference and lower energy cost. For training, you care more about FLOPs per quality unit. A model with 2× the FLOPs might reach target accuracy in half the steps, making total training FLOPs comparable.

How do I measure actual FLOPs vs theoretical?

Use torch.profiler for per-operation FLOPs counting, or the fvcore library which provides FlopCountAnalysis. These give measured FLOPs for your actual computation graph, which can differ from theoretical due to optimizations.

What's a petaFLOP-day?

It's a unit of compute often used to report training cost: 1 petaFLOP-day = 10¹⁵ FLOPs × 86,400 seconds. GPT-3 training required roughly 3,640 petaFLOP-days. At 40% efficiency on A100s (312 TFLOPs bf16), one A100 delivers about 27 petaFLOP-days per day.

Ad 728x90