KV Cache Bar Viz

A horizontal bar visualisation of the autoregressive KV cache. During decoding, every previously attended token's K and V tensors are kept per layer per head, so the cache size scales as 2 * seqLen * numLayers * effectiveHeads * headDim * bytesPerElement. The active configuration renders as the headline bar; the other two attention-sharing schemes (MHA, GQA, MQA) render as comparison bars at the same scale so the savings show up by length.

KV cache memory. 32 layers, 32 KV heads of 128 dim, 2K tokens, 2 bytes per element. MHA total: 1.00 GiB.
KV cache2 · 32 L · 32 H · 2K S · 128 D · 2B
MHA32 KV heads1.00 GiB
GQA8 KV heads256 MiB
MQA1 KV head32.0 MiB
2K
32
Customize
Sequence
2K tokens
32
Heads
32
8
Display
MHA

Installation

npx shadcn@latest add https://craftbits.dev/r/kv-cache-bar-viz.json

Usage

import { KVCacheBarViz } from "@craft-bits/core";
 
<KVCacheBarViz
  defaultSeqLen={2048}
  defaultNumLayers={32}
  numHeads={32}
  headDim={128}
  bytesPerElement={2}
  gqaGroups={8}
  defaultAttentionType="mha"
/>

Drive the sequence length and the attention type from outside:

<KVCacheBarViz
  seqLen={seqLen}
  onSeqLenChange={setSeqLen}
  attentionType={type}
  onAttentionTypeChange={setType}
  numHeads={32}
  gqaGroups={8}
/>

Hide the comparison rows when you want a single clean bar:

<KVCacheBarViz showComparison={false} />

Understanding the component

  1. One formula. Bytes per request equal 2 * seqLen * numLayers * effectiveHeads * headDim * bytesPerElement. The leading 2 covers both K and V tensors. effectiveHeads collapses to numHeads for MHA, gqaGroups for GQA, and 1 for MQA.
  2. Active bar plus comparison bars. The active attention type renders as the headline 7-px-tall bar; the other two schemes render below at the same scale (3-px-tall) so the savings read by length, not by inferred ratio.
  3. Shared normaliser. Bar widths divide by the largest of the three totals — usually MHA. The MQA bar never balloons to 100% just because it is the active type, and the comparison preserves real proportions across schemes.
  4. x less than MHA caption. When the active type is GQA or MQA, the headline bar shows the multiplicative reduction against MHA (e.g. 4x for gqaGroups=8 over numHeads=32, 32x for MQA over the same MHA).
  5. Controlled or uncontrolled. seqLen, numLayers, and attentionType all support the Radix pattern — pass value plus onValueChange for controlled mode, or defaultValue for uncontrolled. The picker pills are a role="radiogroup" of role="radio" buttons.
  6. SPRINGS.smooth everywhere. Bar-width changes animate with the canonical smooth spring; prefers-reduced-motion: reduce collapses every spring to an instant swap.

Props

PropTypeDefaultDescription
seqLennumberControlled sequence length.
defaultSeqLennumber2048Uncontrolled initial sequence length.
onSeqLenChange(seqLen: number) => voidFires when the slider commits.
numLayersnumberControlled layer count.
defaultNumLayersnumber32Uncontrolled initial layer count.
onNumLayersChange(numLayers: number) => voidFires when the slider commits.
numHeadsnumber32Total query heads (MHA case).
headDimnumber128Per-head dimension d_k.
bytesPerElementnumber22 for fp16/bf16, 4 for fp32, 1 for int8.
attentionType`'mha''gqa''mqa'`
defaultAttentionType`'mha''gqa''mqa'`
onAttentionTypeChange(t) => voidFires when the picker commits.
gqaGroupsnumber8KV groups when attentionType === "gqa".
seqLenMinnumber256Minimum sequence length the slider allows.
seqLenMaxnumber131072Maximum sequence length the slider allows.
numLayersMinnumber8Minimum layer count the slider allows.
numLayersMaxnumber96Maximum layer count the slider allows.
showComparisonbooleantrueRender the two non-active attention types as comparison bars.
transitionTransitionSPRINGS.smoothSpring used for bar-width transitions.
classNamestringMerged onto the root via cn().

Accessibility

  • The figure is role="figure" with a hidden summary listing layers, KV heads, head dim, sequence length, bytes per element, and the active total — screen readers hear the story whenever props change.
  • The attention-type picker is a role="radiogroup" of role="radio" pills with aria-checked. Tab focuses the group; Space and Enter commit a selection.
  • Sliders are native <input type="range"> via the library's LabeledSlider with aria-valuemin / aria-valuemax / aria-valuenow / aria-valuetext — full keyboard + screen-reader semantics for free.
  • Color is never the only signal — each attention type has a textual label, an effective-heads count, and a byte readout. The reduction against MHA is also stated textually.
  • Motion respects prefers-reduced-motion: reduce.

Credits

  • Extracted from: craftingattention (app/src/lessons/primitives/nn/KVCacheBarViz.tsx). Stripped the four-phase narration state machine (observe / scaling / comparing / insight), the GPU-capacity threshold markers (RTX 4090 / A100 / H100), the live formula display with active-slider highlighting, the overflow-pulse animation, the breathing-pulse hint, and the discrete-step sliders for kv_heads and d_k. The library extract is the pure plotting primitive: configuration in, normalised bars out, with controlled / uncontrolled APIs for sequence length, layer count, and attention type.