FLOP, FLOPs, FLOPS, IsoFLOP: A practical guide to compute accounting

May 13, 2026

by Leonardo

1. Four acronyms, three different units

The deep learning literature uses FLOP, FLOPs, FLOPS, and IsoFLOP almost interchangeably, but they refer to three fundamentally different quantities. Mixing them up leads to off-by-an-order-of-magnitude mistakes when you compare papers, plan training runs, or read a GPU spec sheet. The disambiguation:

FLOP (singular) — one FLoating-point OPeration. A single $+$ , $-$ , $\times$ , or $\div$ on floating-point numbers. This is a count of one atomic event, so it almost never appears alone outside of explanations like this paragraph.
FLOPs (plural, lowercase s) — FLoating-point OPerations, a cumulative count over some workload. Units: dimensionless count (often written as $10^{18}$ FLOPs, ExaFLOPs, ZettaFLOPs). This is the natural unit for how much compute did training this model cost.
FLOPS or FLOP/s (capital S, or with a slash) — FLoating-point OPerations per Second, a rate. Units: $\frac{FLOPs}{second}$ . This is the natural unit for how fast is this GPU. Modern accelerators are rated in TeraFLOPS ( $10^{12}$ ) or PetaFLOPS ( $10^{15}$ ).
IsoFLOP — not a unit at all, but an experimental protocol: hold total training FLOPs constant and vary something else (model size, data, hyperparameters) to find an optimum on a fixed compute budget.

The shorthand:

FLOPs (work) = FLOPS (rate) \times time .

A H100 SXM rated at 989 TFLOPS of BF16 dense throughput, run for one hour, performs at most $989 \times 10^{12} \times 3600 \approx 3.56 \times 10^{18}$ FLOPs of useful work — call this 3.56 ExaFLOPs. At most, because real workloads almost never sustain peak throughput; see Section 4.

2. Counting FLOPs in a transformer forward pass

The atomic operation in deep learning is the matrix multiplication. For $𝐴 \in ℝ^{𝑚 \times 𝑘}$ and $𝐵 \in ℝ^{𝑘 \times 𝑛}$ , computing $𝐶 = 𝐴 𝐵$ takes

FLOPs (𝐴 𝐵) = 2 𝑚 𝑛 𝑘

The factor of 2 comes from each output element requiring $𝑘$ multiplications and $𝑘 - 1 \approx 𝑘$ additions. People sometimes call a fused multiply-add a single "operation" and write $𝑚 𝑛 𝑘$ instead — this is the source of the most common factor-of-two discrepancy between papers. Modern accelerators implement FMA in hardware but report throughput counting each FMA as 2 FLOPs, so the $2 𝑚 𝑛 𝑘$ convention is the one that matches vendor spec sheets.

For a decoder-only transformer with $𝑁$ non-embedding parameters processing a sequence of length $𝑇$ , the forward pass cost decomposes into:

Linear projections (QKV, output projection, two MLP layers): every parameter is touched once per token, costing $2 𝑁 𝑇$ FLOPs total.
Attention scores and values (the $𝑄 𝐾^{⊤}$ and softmax-weighted $𝑉$ steps): $2 \cdot 2 \cdot 𝑛_{layer} \cdot 𝑛_{head} \cdot 𝑑_{head} \cdot 𝑇^{2}$ FLOPs.

For models where $𝑇 < 12 𝑑_{model}$ , the attention term is a small correction and the widely cited approximation holds:

𝐶_{forward} \approx 2 𝑁 𝑇 (per sequence of length 𝑇)

Per token, that is $2 𝑁$ . The backward pass is conventionally costed at $4 𝑁$ per token (one gradient flow into the inputs, one into the weights), giving the famous Kaplan/Chinchilla estimate for total training compute:

𝐶_{train} \approx 6 𝑁 𝐷

where $𝐷$ is the total number of training tokens. This single formula — six FLOPs per parameter per training token — is the backbone of every modern scaling-law paper.

A concrete instance: a 70B-parameter model trained on 2T tokens spends approximately $6 \times 7 \times 10^{10} \times 2 \times 10^{12} = 8.4 \times 10^{23}$ FLOPs, or 0.84 YottaFLOPs. On a cluster sustaining 1 ExaFLOPS of effective throughput, that is about 10 days of wall-clock time.

3. Where the $6 𝑁 𝐷$ formula bends

The $6 𝑁 𝐷$ rule is an approximation, and it leaks in three predictable places.

First, activation checkpointing. Recomputing activations during the backward pass to save memory adds another forward pass per checkpointed segment, pushing the per-token cost from $6 𝑁$ toward $8 𝑁$ . This is a deliberate trade: paying $\approx 33 %$ more compute to fit a larger model in memory.

Second, long context. The attention $𝑇^{2}$ term is no longer negligible when $𝑇$ is large. A 1M-token sequence makes attention the dominant cost even for trillion-parameter models, and the $6 𝑁 𝐷$ formula understates compute by a factor that grows linearly in $𝑇$ .

Third, Mixture-of-Experts. In an MoE model with $𝑁_{total}$ total parameters and $𝑁_{active}$ activated per token (e.g., DeepSeek-V3's 671B total / 37B active), training compute scales with $𝑁_{active}$ , not $𝑁_{total}$ . The relevant accounting is $𝐶 \approx 6 𝑁_{active} 𝐷$ , while memory and bandwidth scale with $𝑁_{total}$ . This is exactly why MoE is attractive: you buy capacity at memory prices, not compute prices.

4. FLOPS as a hardware spec: peak, sustained, and "useful"

A vendor's headline FLOPS number is peak theoretical throughput under ideal conditions — typically dense matrix multiplications at the lowest supported precision, with the on-chip tensor cores fully utilized. Real workloads see something much lower. The gap between peak FLOPS and what you actually get out of a training run is large enough that the field invented a dedicated vocabulary for it.

4.1. Achieved FLOPS, MFU, and HFU

Three utilization metrics get conflated in practice, and the differences matter:

Achieved FLOPS — the raw observed rate of useful FLOPs per second during a run. If your training step processes $𝐵$ tokens in $𝑡$ seconds on a dense model of $𝑁$ params, achieved FLOPS $\approx \frac{6 𝑁 𝐵}{𝑡}$ .
Model FLOPs Utilization (MFU) — achieved FLOPS divided by peak FLOPS, summed over all accelerators:

MFU = \frac{useful model FLOPs per step}{𝑛_{gpu} \cdot {FLOPS}_{peak} \cdot 𝑡_{step}} = \frac{6 𝑁 𝐵}{𝑛_{gpu} \cdot {FLOPS}_{peak} \cdot 𝑡_{step}}

Hardware FLOPs Utilization (HFU) — same denominator, but the numerator counts every FLOP the chip executed, including redundant work:

HFU = \frac{all FLOPs executed per step}{𝑛_{gpu} \cdot {FLOPS}_{peak} \cdot 𝑡_{step}}

The numerator distinction is the whole point. Activation checkpointing recomputes a forward pass during the backward to save memory: those FLOPs run on the silicon (counted in HFU) but do not appear in the $6 𝑁 𝐷$ model accounting (not counted in MFU). So $HFU \geq MFU$ always, and a wide gap is a signal that your training is paying for recomputation, speculative decoding, or kernel-level redundancy that is not buying you direct optimization progress.

The PaLM paper popularized MFU specifically because peak FLOPS is a misleading bragging right — what matters is how much of that peak survives the journey through memory bandwidth, all-reduce collectives, and pipeline bubbles. MFU is the single number that lets you compare training efficiency across model sizes, hardware generations, and parallelism strategies on an equal footing.

4.2. What sits between peak FLOPS and MFU

A 100% peak number erodes through a stack of friction layers, in roughly decreasing order of damage on a modern training cluster:

Memory-bound operations. LayerNorm, softmax, dropout, residual adds, and the entire attention $𝑄 𝐾^{⊤} / 𝑉$ kernel at small sequence lengths are limited by HBM bandwidth, not tensor-core throughput. They contribute a small percentage of total FLOPs but a large percentage of wall-clock time.
Communication overhead. All-reduce gradients across data-parallel ranks, all-gather weights/activations under tensor parallelism, point-to-point sends under pipeline parallelism. Even on NVLink/InfiniBand fabrics, the GPU sits idle during the un-overlapped portion of every collective.
Pipeline bubbles. With $𝑝$ pipeline stages and $𝑚$ micro-batches per step, the bubble fraction is $\approx \frac{𝑝 - 1}{𝑝 + 𝑚 - 1}$ . Small $𝑚$ (memory-constrained) directly burns peak FLOPS.
Optimizer and overhead steps. Adam updates, gradient clipping, weight casting, checkpoint saves. These are tiny in FLOPs but block the step boundary.
Dataloading and host-side overhead. Tokenization, shuffling, host-to-device transfers, kernel launch latency — visible as stalls if the input pipeline is not perfectly overlapped.

A useful diagnostic frame: budget your $𝑡_{step}$ across these buckets with a profiler (Nsight Systems, PyTorch profiler) and check whether MFU losses match where the timeline says time is spent. If the bubble math says $15 %$ but you are losing $30 %$ of peak, the gap is elsewhere.

4.3. When you actually need MFU vs HFU

A short decision rule:

Comparing two training runs on the same hardware: use MFU. It strips out implementation choices like recomputation that change "how much silicon you bought" without changing model progress.
Diagnosing hardware-level inefficiency: use HFU. The gap to peak tells you whether your kernels are getting near the tensor cores at all.
Comparing inference stacks: use MBU (single-stream) or tokens/sec/$ (throughput). Achieved FLOPS hides whether you are even on the right roofline.
Reporting cost/efficiency in a paper: report MFU and the absolute achieved FLOPS, so readers can sanity-check both your model accounting and your hardware assumptions.

5. IsoFLOP: holding compute constant to find what to vary

IsoFLOP is not a unit — it is the protocol that made Chinchilla famous (Hoffmann et al., 2022). The question it answers: given a fixed compute budget $𝐶$ , what allocation of parameters $𝑁$ and tokens $𝐷$ minimizes loss?

The procedure:

Pick several compute budgets $𝐶_{1} < 𝐶_{2} < \dots < 𝐶_{𝑘}$ (each spanning roughly one order of magnitude).
At each $𝐶_{𝑖}$ , train multiple models with different $(𝑁, 𝐷)$ pairs all satisfying $6 𝑁 𝐷 = 𝐶_{𝑖}$ .
Plot final loss as a function of $𝑁$ (or equivalently $𝐷 = \frac{𝐶_{𝑖}}{6 𝑁}$ ). This is an IsoFLOP curve: each curve corresponds to one compute budget and traces out the loss-vs-allocation trade.
Find the minimum of each curve; call this $𝑁^{⋆} (𝐶_{𝑖})$ .
Fit a power law $𝑁^{⋆} \propto 𝐶^{𝑎}$ across compute budgets. Chinchilla found $𝑎 \approx 0.5$ , implying $𝑁^{⋆} \propto 𝐷^{⋆}$ — model size and tokens should scale equally as compute grows.

This was a sharp departure from the earlier Kaplan scaling laws, which had recommended scaling parameters about three times faster than data. The difference came from a methodological choice: Kaplan held learning-rate schedule fixed and varied $𝑁$ with $𝐷$ very large; Chinchilla's IsoFLOP design varied both jointly under fixed compute, which is the actual constraint a training team faces.

IsoFLOP is now standard for any scaling claim. A modern paper will report IsoFLOP frontiers showing how a new architecture, optimizer, or data mixture moves the loss-vs-compute curve down — a comparison that is only meaningful when the FLOPs axis really is held constant.

🔒 Access Restricted

Access Control