KV Cache Bar Viz
A horizontal bar visualisation of the autoregressive KV cache. During decoding, every previously attended token's K and V tensors are kept per layer per head, so the cache size scales as 2 * seqLen * numLayers * effectiveHeads * headDim * bytesPerElement. The active configuration renders as the headline bar; the other two attention-sharing schemes (MHA, GQA, MQA) render as comparison bars at the same scale so the savings show up by length.
KV cache memory. 32 layers, 32 KV heads of 128 dim, 2K tokens, 2 bytes per element. MHA total: 1.00 GiB.
KV cache2 · 32 L · 32 H · 2K S · 128 D · 2B
MHA32 KV heads1.00 GiB
GQA8 KV heads256 MiB
MQA1 KV head32.0 MiB
2K
32
Customize
Sequence
2K tokens
32
Heads
32
8
Display
MHA
Installation
npx shadcn@latest add https://craftbits.dev/r/kv-cache-bar-viz.jsonUsage
import { KVCacheBarViz } from "@craft-bits/core";
<KVCacheBarViz
defaultSeqLen={2048}
defaultNumLayers={32}
numHeads={32}
headDim={128}
bytesPerElement={2}
gqaGroups={8}
defaultAttentionType="mha"
/>Drive the sequence length and the attention type from outside:
<KVCacheBarViz
seqLen={seqLen}
onSeqLenChange={setSeqLen}
attentionType={type}
onAttentionTypeChange={setType}
numHeads={32}
gqaGroups={8}
/>Hide the comparison rows when you want a single clean bar:
<KVCacheBarViz showComparison={false} />Understanding the component
- One formula. Bytes per request equal
2 * seqLen * numLayers * effectiveHeads * headDim * bytesPerElement. The leading 2 covers both K and V tensors.effectiveHeadscollapses tonumHeadsfor MHA,gqaGroupsfor GQA, and1for MQA. - Active bar plus comparison bars. The active attention type renders as the headline 7-px-tall bar; the other two schemes render below at the same scale (3-px-tall) so the savings read by length, not by inferred ratio.
- Shared normaliser. Bar widths divide by the largest of the three totals — usually MHA. The MQA bar never balloons to 100% just because it is the active type, and the comparison preserves real proportions across schemes.
x less than MHAcaption. When the active type is GQA or MQA, the headline bar shows the multiplicative reduction against MHA (e.g.4xforgqaGroups=8overnumHeads=32,32xfor MQA over the same MHA).- Controlled or uncontrolled.
seqLen,numLayers, andattentionTypeall support the Radix pattern — passvalueplusonValueChangefor controlled mode, ordefaultValuefor uncontrolled. The picker pills are arole="radiogroup"ofrole="radio"buttons. SPRINGS.smootheverywhere. Bar-width changes animate with the canonical smooth spring;prefers-reduced-motion: reducecollapses every spring to an instant swap.
Props
| Prop | Type | Default | Description |
|---|---|---|---|
seqLen | number | — | Controlled sequence length. |
defaultSeqLen | number | 2048 | Uncontrolled initial sequence length. |
onSeqLenChange | (seqLen: number) => void | — | Fires when the slider commits. |
numLayers | number | — | Controlled layer count. |
defaultNumLayers | number | 32 | Uncontrolled initial layer count. |
onNumLayersChange | (numLayers: number) => void | — | Fires when the slider commits. |
numHeads | number | 32 | Total query heads (MHA case). |
headDim | number | 128 | Per-head dimension d_k. |
bytesPerElement | number | 2 | 2 for fp16/bf16, 4 for fp32, 1 for int8. |
attentionType | `'mha' | 'gqa' | 'mqa'` |
defaultAttentionType | `'mha' | 'gqa' | 'mqa'` |
onAttentionTypeChange | (t) => void | — | Fires when the picker commits. |
gqaGroups | number | 8 | KV groups when attentionType === "gqa". |
seqLenMin | number | 256 | Minimum sequence length the slider allows. |
seqLenMax | number | 131072 | Maximum sequence length the slider allows. |
numLayersMin | number | 8 | Minimum layer count the slider allows. |
numLayersMax | number | 96 | Maximum layer count the slider allows. |
showComparison | boolean | true | Render the two non-active attention types as comparison bars. |
transition | Transition | SPRINGS.smooth | Spring used for bar-width transitions. |
className | string | — | Merged onto the root via cn(). |
Accessibility
- The figure is
role="figure"with a hidden summary listing layers, KV heads, head dim, sequence length, bytes per element, and the active total — screen readers hear the story whenever props change. - The attention-type picker is a
role="radiogroup"ofrole="radio"pills witharia-checked. Tab focuses the group; Space and Enter commit a selection. - Sliders are native
<input type="range">via the library'sLabeledSliderwitharia-valuemin/aria-valuemax/aria-valuenow/aria-valuetext— full keyboard + screen-reader semantics for free. - Color is never the only signal — each attention type has a textual label, an effective-heads count, and a byte readout. The reduction against MHA is also stated textually.
- Motion respects
prefers-reduced-motion: reduce.
Credits
- Extracted from:
craftingattention(app/src/lessons/primitives/nn/KVCacheBarViz.tsx). Stripped the four-phase narration state machine (observe / scaling / comparing / insight), the GPU-capacity threshold markers (RTX 4090 / A100 / H100), the live formula display with active-slider highlighting, the overflow-pulse animation, the breathing-pulse hint, and the discrete-step sliders for kv_heads and d_k. The library extract is the pure plotting primitive: configuration in, normalised bars out, with controlled / uncontrolled APIs for sequence length, layer count, and attention type.