KV Cache Bar Viz

A horizontal bar visualisation of the autoregressive KV cache. During decoding, every previously attended token's K and V tensors are kept per layer per head, so the cache size scales as 2 * seqLen * numLayers * effectiveHeads * headDim * bytesPerElement. The active configuration renders as the headline bar; the other two attention-sharing schemes (MHA, GQA, MQA) render as comparison bars at the same scale so the savings show up by length.

KV cache2 · 32 L · 32 H · 2K S · 128 D · 2B

MHA32 KV heads1.00 GiB

GQA8 KV heads256 MiB

MQA1 KV head32.0 MiB

seq len2K

layers32

Customize

Sequence

seq len2K tokens

layers32

Heads

query heads32

GQA groups8

Display

attentionMHA

comparison bars

Installation

npx shadcn@latest add https://craftbits.dev/r/kv-cache-bar-viz.json

Usage

import { KVCacheBarViz } from "@craft-bits/core";
 
<KVCacheBarViz
  defaultSeqLen={2048}
  defaultNumLayers={32}
  numHeads={32}
  headDim={128}
  bytesPerElement={2}
  gqaGroups={8}
  defaultAttentionType="mha"
/>

Drive the sequence length and the attention type from outside:

<KVCacheBarViz
  seqLen={seqLen}
  onSeqLenChange={setSeqLen}
  attentionType={type}
  onAttentionTypeChange={setType}
  numHeads={32}
  gqaGroups={8}
/>

Hide the comparison rows when you want a single clean bar:

<KVCacheBarViz showComparison={false} />

Understanding the component

One formula. Bytes per request equal 2 * seqLen * numLayers * effectiveHeads * headDim * bytesPerElement. The leading 2 covers both K and V tensors. effectiveHeads collapses to numHeads for MHA, gqaGroups for GQA, and 1 for MQA.
Active bar plus comparison bars. The active attention type renders as the headline 7-px-tall bar; the other two schemes render below at the same scale (3-px-tall) so the savings read by length, not by inferred ratio.
Shared normaliser. Bar widths divide by the largest of the three totals — usually MHA. The MQA bar never balloons to 100% just because it is the active type, and the comparison preserves real proportions across schemes.
x less than MHA caption. When the active type is GQA or MQA, the headline bar shows the multiplicative reduction against MHA (e.g. 4x for gqaGroups=8 over numHeads=32, 32x for MQA over the same MHA).
Controlled or uncontrolled. seqLen, numLayers, and attentionType all support the Radix pattern — pass value plus onValueChange for controlled mode, or defaultValue for uncontrolled. The picker pills are a role="radiogroup" of role="radio" buttons.
SPRINGS.smooth everywhere. Bar-width changes animate with the canonical smooth spring; prefers-reduced-motion: reduce collapses every spring to an instant swap.

Props

Prop	Type	Default	Description
`seqLen`	`number`	—	Controlled sequence length.
`defaultSeqLen`	`number`	`2048`	Uncontrolled initial sequence length.
`onSeqLenChange`	`(seqLen: number) => void`	—	Fires when the slider commits.
`numLayers`	`number`	—	Controlled layer count.
`defaultNumLayers`	`number`	`32`	Uncontrolled initial layer count.
`onNumLayersChange`	`(numLayers: number) => void`	—	Fires when the slider commits.
`numHeads`	`number`	`32`	Total query heads (MHA case).
`headDim`	`number`	`128`	Per-head dimension `d_k`.
`bytesPerElement`	`number`	`2`	2 for fp16/bf16, 4 for fp32, 1 for int8.
`attentionType`	`'mha'	'gqa'	'mqa'`
`defaultAttentionType`	`'mha'	'gqa'	'mqa'`
`onAttentionTypeChange`	`(t) => void`	—	Fires when the picker commits.
`gqaGroups`	`number`	`8`	KV groups when `attentionType === "gqa"`.
`seqLenMin`	`number`	`256`	Minimum sequence length the slider allows.
`seqLenMax`	`number`	`131072`	Maximum sequence length the slider allows.
`numLayersMin`	`number`	`8`	Minimum layer count the slider allows.
`numLayersMax`	`number`	`96`	Maximum layer count the slider allows.
`showComparison`	`boolean`	`true`	Render the two non-active attention types as comparison bars.
`transition`	`Transition`	`SPRINGS.smooth`	Spring used for bar-width transitions.
`className`	`string`	—	Merged onto the root via `cn()`.

Accessibility

The figure is role="figure" with a hidden summary listing layers, KV heads, head dim, sequence length, bytes per element, and the active total — screen readers hear the story whenever props change.
The attention-type picker is a role="radiogroup" of role="radio" pills with aria-checked. Tab focuses the group; Space and Enter commit a selection.
Sliders are native <input type="range"> via the library's LabeledSlider with aria-valuemin / aria-valuemax / aria-valuenow / aria-valuetext — full keyboard + screen-reader semantics for free.
Color is never the only signal — each attention type has a textual label, an effective-heads count, and a byte readout. The reduction against MHA is also stated textually.
Motion respects prefers-reduced-motion: reduce.

Credits

Extracted from: craftingattention (app/src/lessons/primitives/nn/KVCacheBarViz.tsx). Stripped the four-phase narration state machine (observe / scaling / comparing / insight), the GPU-capacity threshold markers (RTX 4090 / A100 / H100), the live formula display with active-slider highlighting, the overflow-pulse animation, the breathing-pulse hint, and the discrete-step sliders for kv_heads and d_k. The library extract is the pure plotting primitive: configuration in, normalised bars out, with controlled / uncontrolled APIs for sequence length, layer count, and attention type.