March 26, 20264 min read

FLOPs Calculator — Model Compute Cost Estimator

Calculate floating-point operations (FLOPs) for your neural network. Compare model compute efficiency and estimate hardware requirements for training.

flops compute machine learning model efficiency calchub

Model size in parameters tells you about memory requirements. FLOPs — floating-point operations — tell you about compute. A model can have few parameters but be compute-heavy (due to long sequences or large intermediate activations), or have many parameters but low FLOPs (sparse models). Both metrics matter when planning infrastructure.

What Counts as a FLOP?

A FLOP is a single floating-point multiply or add. For a matrix multiplication of shape (m × k) × (k × n), the number of FLOPs is roughly 2 × m × k × n (the factor of 2 accounts for the multiply-accumulate pattern). In practice, when people report "X FLOPs" for a model, they usually mean FLOPs per forward pass on a single sample.

For training, you need roughly 3× the forward-pass FLOPs per step: one forward pass plus approximately twice that for backpropagation.

FLOPs by Layer Type

Layer	FLOPs Formula	Example
Linear (in → out)	2 × in × out	Linear(512, 256) → 262K
Conv2D (k×k, in→out, spatial H×W)	2 × k² × in × out × H × W	Conv(3×3, 64, 128, 14×14) → 115M
Self-attention (seq len L, dim D)	4 × L × D² + 2 × L² × D	L=512, D=768 → 2.1B
Feed-forward (2 linear layers)	4 × D × 4D	D=768 → 9.4M per layer
Embedding lookup	~0 (table lookup, no arithmetic)	—

The attention layer's quadratic term (2 × L² × D) is why long sequences are expensive: doubling sequence length quadruples attention FLOPs.

Using the CalcHub FLOPs Calculator

Visit CalcHub and open the FLOPs Calculator. You can:

Build layer by layer — add each layer type with its dimensions
Use model presets — BERT-base, GPT-2 variants, ResNet-50, ViT-B, etc.
Enter sequence length for transformer models (critical for the quadratic attention cost)

The output shows FLOPs per forward pass, per training step, total FLOPs for a full training run given your dataset and epoch count, and the equivalent GPU-days on common hardware.

FLOPs for Well-Known Models (Single Forward Pass)

Model	Parameters	FLOPs (forward pass)
ResNet-50	25M	4.1 GFLOPs
BERT-base (seq 128)	110M	22 GFLOPs
BERT-base (seq 512)	110M	178 GFLOPs
GPT-2 Small	117M	8.8 GFLOPs
GPT-3 175B	175B	~350 TFLOPs
Llama 3 8B (seq 512)	8B	~100 GFLOPs

The sequence length effect is stark: BERT-base at sequence 512 uses 8× more FLOPs than at sequence 128 despite identical parameters.

From FLOPs to Hardware Requirements

Theoretical peak FLOPs of common GPUs:

GPU	FP32 TFLOPs	FP16 TFLOPs
RTX 3090	35.6	71.2
RTX 4090	82.6	165.2
A100 40GB	77.6	312 (bf16)
A100 80GB	77.6	312 (bf16)

Real-world utilization is typically 30–50% of theoretical peak for transformer training. The calculator applies a realistic efficiency factor (default 40%) to convert theoretical TFLOPs to practical throughput.

Tips

FLOPs vs MACs: Some papers report MACs (multiply-accumulate operations) instead of FLOPs. MACs = FLOPs / 2. Make sure you're comparing apples to apples.
FLOPs don't account for memory bandwidth. A memory-bound operation (like layer normalization) may bottleneck your hardware even when compute FLOPs look modest. Profile with actual hardware, not just FLOPs estimates.
Efficient architectures: Models like MobileNet use depthwise separable convolutions to cut FLOPs by 8–9× vs standard conv while maintaining similar accuracy. The tradeoff is lower parallelism.

Is lower FLOPs always better?

For deployment, yes — fewer FLOPs means faster inference and lower energy cost. For training, you care more about FLOPs per quality unit. A model with 2× the FLOPs might reach target accuracy in half the steps, making total training FLOPs comparable.

How do I measure actual FLOPs vs theoretical?

Use torch.profiler for per-operation FLOPs counting, or the fvcore library which provides FlopCountAnalysis. These give measured FLOPs for your actual computation graph, which can differ from theoretical due to optimizations.

What's a petaFLOP-day?

It's a unit of compute often used to report training cost: 1 petaFLOP-day = 10¹⁵ FLOPs × 86,400 seconds. GPT-3 training required roughly 3,640 petaFLOP-days. At 40% efficiency on A100s (312 TFLOPs bf16), one A100 delivers about 27 petaFLOP-days per day.

Model Parameters Calculator — count parameters alongside FLOPs
Training Time Calculator — convert FLOPs to wall-clock time
GPU Memory Calculator — plan VRAM alongside compute requirements