KV Cache Size Estimator

A calculator readout for the autoregressive KV cache. Six inputs — layers, KV heads, head dim, sequence length, batch size, bytes per element — drive a single headline X GiB total. A breakdown table beneath splits the total into its K and V halves so the leading 2 in the formula 2 * layers * kvHeads * seqLen * headDim * batchSize * bpe reads as the literal K cache + V cache it represents.

Where KVCacheBarViz plots the same formula as a horizontal bar against MHA / GQA / MQA comparisons, this primitive is the number: a big readout you can scrub the inputs of to feel the multipliers.

KV cache size estimator. 32 layers, 32 KV heads of 128 dim, 4K tokens, batch 1, 2 bytes per element. K cache 1.00 GiB, V cache 1.00 GiB, total 2.00 GiB.
KV cache size2 · 32 L · 32 H · 4K S · 128 D · 1 B · 2b
tensorshapebytes
K cache32L × 32H × 4KS × 128D1.00 GiB
V cache32L × 32H × 4KS × 128D1.00 GiB
totalK + V (× batch 1 · 2 B)2.00 GiB
32
32
128
4K
1
Customize
Shape
32
32
128
Workload
4K tokens
1
Display
2 (fp16/bf16)

Installation

npx shadcn@latest add https://craftbits.dev/r/kv-cache-size-estimator.json

Usage

import { KVCacheSizeEstimator } from "@craft-bits/core";
 
<KVCacheSizeEstimator
  defaultNumLayers={32}
  defaultNumHeads={32}
  defaultHeadDim={128}
  defaultSeqLen={4096}
  defaultBatchSize={1}
  bytesPerElement={2}
/>

Drive every input from outside (controlled):

<KVCacheSizeEstimator
  numLayers={layers}
  onNumLayersChange={setLayers}
  numHeads={heads}
  onNumHeadsChange={setHeads}
  headDim={dim}
  onHeadDimChange={setDim}
  seqLen={tokens}
  onSeqLenChange={setTokens}
  batchSize={batch}
  onBatchSizeChange={setBatch}
/>

Hide the breakdown table when you only need the headline number:

<KVCacheSizeEstimator showBreakdown={false} />

Understanding the component

  1. One formula. Bytes per request equal 2 * layers * kvHeads * seqLen * headDim * batchSize * bytesPerElement. The leading 2 covers both K and V tensors. Every slider in the panel maps to exactly one term in the product, so the relationship between input and output is direct.
  2. Headline readout. A single X GiB figure dominates the card. On each new total, the readout fades in (SPRINGS.smooth) so the eye lands on the change without scrubbing every digit; prefers-reduced-motion: reduce collapses the fade to an instant swap.
  3. K / V / total breakdown. Beneath the readout, three rows — K cache, V cache, total — show the shape L x H x S x D and the byte count of each tensor. K and V are mirror copies of each other on purpose: that is the point the breakdown surfaces.
  4. Controlled or uncontrolled. Every input supports the Radix pattern — pass value plus onValueChange (e.g. numLayers plus onNumLayersChange) for controlled mode, or defaultValue (defaultNumLayers) for uncontrolled.
  5. Sliders. Each slider is a LabeledSlider — native <input type="range"> — so the full aria-valuemin / max / now / text keyboard plus screen-reader story comes for free.

Props

PropTypeDefaultDescription
numLayersnumberControlled layer count.
defaultNumLayersnumber32Uncontrolled initial layer count.
onNumLayersChange(numLayers: number) => voidFires when the slider commits.
numHeadsnumberControlled KV-head count. For MHA equals query heads; for GQA equals groups; for MQA equals 1.
defaultNumHeadsnumber32Uncontrolled initial KV-head count.
onNumHeadsChange(numHeads: number) => voidFires when the slider commits.
headDimnumberControlled per-head dimension d_k.
defaultHeadDimnumber128Uncontrolled initial per-head dimension.
onHeadDimChange(headDim: number) => voidFires when the slider commits.
seqLennumberControlled sequence length in tokens.
defaultSeqLennumber4096Uncontrolled initial sequence length.
onSeqLenChange(seqLen: number) => voidFires when the slider commits.
batchSizenumberControlled batch size.
defaultBatchSizenumber1Uncontrolled initial batch size.
onBatchSizeChange(batchSize: number) => voidFires when the slider commits.
bytesPerElementnumber22 for fp16/bf16, 4 for fp32, 1 for int8.
showBreakdownbooleantrueRender the K / V / total breakdown table beneath the headline readout.
numLayersMinnumber1Minimum layer count the slider allows.
numLayersMaxnumber128Maximum layer count the slider allows.
numHeadsMinnumber1Minimum KV-head count the slider allows.
numHeadsMaxnumber128Maximum KV-head count the slider allows.
headDimMinnumber32Minimum head dimension the slider allows.
headDimMaxnumber256Maximum head dimension the slider allows.
seqLenMinnumber256Minimum sequence length the slider allows.
seqLenMaxnumber131072Maximum sequence length the slider allows.
batchSizeMinnumber1Minimum batch size the slider allows.
batchSizeMaxnumber64Maximum batch size the slider allows.
transitionTransitionSPRINGS.smoothSpring used for the headline-readout fade.
classNamestringMerged onto the root via cn().

Accessibility

  • The figure is role="figure" with a hidden summary listing every input and the K / V / total byte counts — screen readers hear the whole story whenever props change.
  • The breakdown is a role="table" with columnheader and cell roles so AT users walk it as a real table.
  • Every slider is a native <input type="range"> via the library's LabeledSlider with aria-valuemin / aria-valuemax / aria-valuenow / aria-valuetext — full keyboard plus screen-reader semantics for free.
  • The headline readout is aria-hidden="true" because the canonical announcement lives in the figure's summary; otherwise screen readers would announce the total twice on every drag.
  • Colour is never the only signal — the breakdown's accent tone on the total row pairs with font-medium and the explicit total row label.
  • Motion respects prefers-reduced-motion: reduce.

Credits

  • Extracted from: craftingattention (app/src/lessons/primitives/nn/KVCacheSizeEstimator.tsx). The source paired the calculator with a stacked GPU memory bar (weights plus KV cache plus headroom), three model presets (GPT-2, LLaMA-7B, LLaMA-70B), three GPU configs (4090, A100, H100), a four-phase narration state machine (observe / sizing / overflow / insight), a per-factor "relative contribution" mini-bar chart, an overflow-pulse animation, a breathing-pulse hint, and discrete-step sliders. The library extract is the pure calculator — six inputs in, K / V / total bytes out — with controlled / uncontrolled APIs on every slider and the headline fade as the only motion. Presets, GPU comparisons, narration, and the contribution chart belong in the consuming lesson, not the primitive. The sibling KVCacheBarViz covers the bar-chart presentation of the same formula.