KV Cache Size Estimator
A calculator readout for the autoregressive KV cache. Six inputs — layers, KV heads, head dim, sequence length, batch size, bytes per element — drive a single headline X GiB total. A breakdown table beneath splits the total into its K and V halves so the leading 2 in the formula 2 * layers * kvHeads * seqLen * headDim * batchSize * bpe reads as the literal K cache + V cache it represents.
Where KVCacheBarViz plots the same formula as a horizontal bar against MHA / GQA / MQA comparisons, this primitive is the number: a big readout you can scrub the inputs of to feel the multipliers.
KV cache size estimator. 32 layers, 32 KV heads of 128 dim, 4K tokens, batch 1, 2 bytes per element. K cache 1.00 GiB, V cache 1.00 GiB, total 2.00 GiB.
KV cache size2 · 32 L · 32 H · 4K S · 128 D · 1 B · 2b
tensorshapebytes
K cache32L × 32H × 4KS × 128D1.00 GiB
V cache32L × 32H × 4KS × 128D1.00 GiB
totalK + V (× batch 1 · 2 B)2.00 GiB
32
32
128
4K
1
Customize
Shape
32
32
128
Workload
4K tokens
1
Display
2 (fp16/bf16)
Installation
npx shadcn@latest add https://craftbits.dev/r/kv-cache-size-estimator.jsonUsage
import { KVCacheSizeEstimator } from "@craft-bits/core";
<KVCacheSizeEstimator
defaultNumLayers={32}
defaultNumHeads={32}
defaultHeadDim={128}
defaultSeqLen={4096}
defaultBatchSize={1}
bytesPerElement={2}
/>Drive every input from outside (controlled):
<KVCacheSizeEstimator
numLayers={layers}
onNumLayersChange={setLayers}
numHeads={heads}
onNumHeadsChange={setHeads}
headDim={dim}
onHeadDimChange={setDim}
seqLen={tokens}
onSeqLenChange={setTokens}
batchSize={batch}
onBatchSizeChange={setBatch}
/>Hide the breakdown table when you only need the headline number:
<KVCacheSizeEstimator showBreakdown={false} />Understanding the component
- One formula. Bytes per request equal
2 * layers * kvHeads * seqLen * headDim * batchSize * bytesPerElement. The leading2covers both K and V tensors. Every slider in the panel maps to exactly one term in the product, so the relationship between input and output is direct. - Headline readout. A single
X GiBfigure dominates the card. On each new total, the readout fades in (SPRINGS.smooth) so the eye lands on the change without scrubbing every digit;prefers-reduced-motion: reducecollapses the fade to an instant swap. - K / V / total breakdown. Beneath the readout, three rows —
K cache,V cache,total— show the shapeL x H x S x Dand the byte count of each tensor. K and V are mirror copies of each other on purpose: that is the point the breakdown surfaces. - Controlled or uncontrolled. Every input supports the Radix pattern — pass
valueplusonValueChange(e.g.numLayersplusonNumLayersChange) for controlled mode, ordefaultValue(defaultNumLayers) for uncontrolled. - Sliders. Each slider is a
LabeledSlider— native<input type="range">— so the fullaria-valuemin / max / now / textkeyboard plus screen-reader story comes for free.
Props
| Prop | Type | Default | Description |
|---|---|---|---|
numLayers | number | — | Controlled layer count. |
defaultNumLayers | number | 32 | Uncontrolled initial layer count. |
onNumLayersChange | (numLayers: number) => void | — | Fires when the slider commits. |
numHeads | number | — | Controlled KV-head count. For MHA equals query heads; for GQA equals groups; for MQA equals 1. |
defaultNumHeads | number | 32 | Uncontrolled initial KV-head count. |
onNumHeadsChange | (numHeads: number) => void | — | Fires when the slider commits. |
headDim | number | — | Controlled per-head dimension d_k. |
defaultHeadDim | number | 128 | Uncontrolled initial per-head dimension. |
onHeadDimChange | (headDim: number) => void | — | Fires when the slider commits. |
seqLen | number | — | Controlled sequence length in tokens. |
defaultSeqLen | number | 4096 | Uncontrolled initial sequence length. |
onSeqLenChange | (seqLen: number) => void | — | Fires when the slider commits. |
batchSize | number | — | Controlled batch size. |
defaultBatchSize | number | 1 | Uncontrolled initial batch size. |
onBatchSizeChange | (batchSize: number) => void | — | Fires when the slider commits. |
bytesPerElement | number | 2 | 2 for fp16/bf16, 4 for fp32, 1 for int8. |
showBreakdown | boolean | true | Render the K / V / total breakdown table beneath the headline readout. |
numLayersMin | number | 1 | Minimum layer count the slider allows. |
numLayersMax | number | 128 | Maximum layer count the slider allows. |
numHeadsMin | number | 1 | Minimum KV-head count the slider allows. |
numHeadsMax | number | 128 | Maximum KV-head count the slider allows. |
headDimMin | number | 32 | Minimum head dimension the slider allows. |
headDimMax | number | 256 | Maximum head dimension the slider allows. |
seqLenMin | number | 256 | Minimum sequence length the slider allows. |
seqLenMax | number | 131072 | Maximum sequence length the slider allows. |
batchSizeMin | number | 1 | Minimum batch size the slider allows. |
batchSizeMax | number | 64 | Maximum batch size the slider allows. |
transition | Transition | SPRINGS.smooth | Spring used for the headline-readout fade. |
className | string | — | Merged onto the root via cn(). |
Accessibility
- The figure is
role="figure"with a hidden summary listing every input and the K / V / total byte counts — screen readers hear the whole story whenever props change. - The breakdown is a
role="table"withcolumnheaderandcellroles so AT users walk it as a real table. - Every slider is a native
<input type="range">via the library'sLabeledSliderwitharia-valuemin/aria-valuemax/aria-valuenow/aria-valuetext— full keyboard plus screen-reader semantics for free. - The headline readout is
aria-hidden="true"because the canonical announcement lives in the figure's summary; otherwise screen readers would announce the total twice on every drag. - Colour is never the only signal — the breakdown's accent tone on the total row pairs with
font-mediumand the explicittotalrow label. - Motion respects
prefers-reduced-motion: reduce.
Credits
- Extracted from:
craftingattention(app/src/lessons/primitives/nn/KVCacheSizeEstimator.tsx). The source paired the calculator with a stacked GPU memory bar (weights plus KV cache plus headroom), three model presets (GPT-2, LLaMA-7B, LLaMA-70B), three GPU configs (4090, A100, H100), a four-phase narration state machine (observe / sizing / overflow / insight), a per-factor "relative contribution" mini-bar chart, an overflow-pulse animation, a breathing-pulse hint, and discrete-step sliders. The library extract is the pure calculator — six inputs in, K / V / total bytes out — with controlled / uncontrolled APIs on every slider and the headline fade as the only motion. Presets, GPU comparisons, narration, and the contribution chart belong in the consuming lesson, not the primitive. The siblingKVCacheBarVizcovers the bar-chart presentation of the same formula.