KV Cache Size Estimator

A calculator readout for the autoregressive KV cache. Six inputs — layers, KV heads, head dim, sequence length, batch size, bytes per element — drive a single headline X GiB total. A breakdown table beneath splits the total into its K and V halves so the leading 2 in the formula 2 * layers * kvHeads * seqLen * headDim * batchSize * bpe reads as the literal K cache + V cache it represents.

Where KVCacheBarViz plots the same formula as a horizontal bar against MHA / GQA / MQA comparisons, this primitive is the number: a big readout you can scrub the inputs of to feel the multipliers.

KV cache size2 · 32 L · 32 H · 4K S · 128 D · 1 B · 2b

tensorshapebytes

K cache32L × 32H × 4KS × 128D1.00 GiB

V cache32L × 32H × 4KS × 128D1.00 GiB

totalK + V (× batch 1 · 2 B)2.00 GiB

layers32

kv heads32

head dim128

seq len4K

batch1

Customize

Shape

layers32

kv heads32

head dim128

Workload

seq len4K tokens

batch1

Display

bytes/elem2 (fp16/bf16)

K / V breakdown

Installation

npx shadcn@latest add https://craftbits.dev/r/kv-cache-size-estimator.json

Usage

import { KVCacheSizeEstimator } from "@craft-bits/core";
 
<KVCacheSizeEstimator
  defaultNumLayers={32}
  defaultNumHeads={32}
  defaultHeadDim={128}
  defaultSeqLen={4096}
  defaultBatchSize={1}
  bytesPerElement={2}
/>

Drive every input from outside (controlled):

<KVCacheSizeEstimator
  numLayers={layers}
  onNumLayersChange={setLayers}
  numHeads={heads}
  onNumHeadsChange={setHeads}
  headDim={dim}
  onHeadDimChange={setDim}
  seqLen={tokens}
  onSeqLenChange={setTokens}
  batchSize={batch}
  onBatchSizeChange={setBatch}
/>

Hide the breakdown table when you only need the headline number:

<KVCacheSizeEstimator showBreakdown={false} />

Understanding the component

One formula. Bytes per request equal 2 * layers * kvHeads * seqLen * headDim * batchSize * bytesPerElement. The leading 2 covers both K and V tensors. Every slider in the panel maps to exactly one term in the product, so the relationship between input and output is direct.
Headline readout. A single X GiB figure dominates the card. On each new total, the readout fades in (SPRINGS.smooth) so the eye lands on the change without scrubbing every digit; prefers-reduced-motion: reduce collapses the fade to an instant swap.
K / V / total breakdown. Beneath the readout, three rows — K cache, V cache, total — show the shape L x H x S x D and the byte count of each tensor. K and V are mirror copies of each other on purpose: that is the point the breakdown surfaces.
Controlled or uncontrolled. Every input supports the Radix pattern — pass value plus onValueChange (e.g. numLayers plus onNumLayersChange) for controlled mode, or defaultValue (defaultNumLayers) for uncontrolled.
Sliders. Each slider is a LabeledSlider — native <input type="range"> — so the full aria-valuemin / max / now / text keyboard plus screen-reader story comes for free.

Props

Prop	Type	Default	Description
`numLayers`	`number`	—	Controlled layer count.
`defaultNumLayers`	`number`	`32`	Uncontrolled initial layer count.
`onNumLayersChange`	`(numLayers: number) => void`	—	Fires when the slider commits.
`numHeads`	`number`	—	Controlled KV-head count. For MHA equals query heads; for GQA equals groups; for MQA equals 1.
`defaultNumHeads`	`number`	`32`	Uncontrolled initial KV-head count.
`onNumHeadsChange`	`(numHeads: number) => void`	—	Fires when the slider commits.
`headDim`	`number`	—	Controlled per-head dimension `d_k`.
`defaultHeadDim`	`number`	`128`	Uncontrolled initial per-head dimension.
`onHeadDimChange`	`(headDim: number) => void`	—	Fires when the slider commits.
`seqLen`	`number`	—	Controlled sequence length in tokens.
`defaultSeqLen`	`number`	`4096`	Uncontrolled initial sequence length.
`onSeqLenChange`	`(seqLen: number) => void`	—	Fires when the slider commits.
`batchSize`	`number`	—	Controlled batch size.
`defaultBatchSize`	`number`	`1`	Uncontrolled initial batch size.
`onBatchSizeChange`	`(batchSize: number) => void`	—	Fires when the slider commits.
`bytesPerElement`	`number`	`2`	2 for fp16/bf16, 4 for fp32, 1 for int8.
`showBreakdown`	`boolean`	`true`	Render the K / V / total breakdown table beneath the headline readout.
`numLayersMin`	`number`	`1`	Minimum layer count the slider allows.
`numLayersMax`	`number`	`128`	Maximum layer count the slider allows.
`numHeadsMin`	`number`	`1`	Minimum KV-head count the slider allows.
`numHeadsMax`	`number`	`128`	Maximum KV-head count the slider allows.
`headDimMin`	`number`	`32`	Minimum head dimension the slider allows.
`headDimMax`	`number`	`256`	Maximum head dimension the slider allows.
`seqLenMin`	`number`	`256`	Minimum sequence length the slider allows.
`seqLenMax`	`number`	`131072`	Maximum sequence length the slider allows.
`batchSizeMin`	`number`	`1`	Minimum batch size the slider allows.
`batchSizeMax`	`number`	`64`	Maximum batch size the slider allows.
`transition`	`Transition`	`SPRINGS.smooth`	Spring used for the headline-readout fade.
`className`	`string`	—	Merged onto the root via `cn()`.

Accessibility

The figure is role="figure" with a hidden summary listing every input and the K / V / total byte counts — screen readers hear the whole story whenever props change.
The breakdown is a role="table" with columnheader and cell roles so AT users walk it as a real table.
Every slider is a native <input type="range"> via the library's LabeledSlider with aria-valuemin / aria-valuemax / aria-valuenow / aria-valuetext — full keyboard plus screen-reader semantics for free.
The headline readout is aria-hidden="true" because the canonical announcement lives in the figure's summary; otherwise screen readers would announce the total twice on every drag.
Colour is never the only signal — the breakdown's accent tone on the total row pairs with font-medium and the explicit total row label.
Motion respects prefers-reduced-motion: reduce.

Credits

Extracted from: craftingattention (app/src/lessons/primitives/nn/KVCacheSizeEstimator.tsx). The source paired the calculator with a stacked GPU memory bar (weights plus KV cache plus headroom), three model presets (GPT-2, LLaMA-7B, LLaMA-70B), three GPU configs (4090, A100, H100), a four-phase narration state machine (observe / sizing / overflow / insight), a per-factor "relative contribution" mini-bar chart, an overflow-pulse animation, a breathing-pulse hint, and discrete-step sliders. The library extract is the pure calculator — six inputs in, K / V / total bytes out — with controlled / uncontrolled APIs on every slider and the headline fade as the only motion. Presets, GPU comparisons, narration, and the contribution chart belong in the consuming lesson, not the primitive. The sibling KVCacheBarViz covers the bar-chart presentation of the same formula.