GPU Memory Math Viz
Live, workload-agnostic breakdown of the memory required to train a model. A log-scale slider drives the parameter count; a ZeRO radiogroup demonstrates how data-parallel sharding across N GPUs redistributes the load. Dashed GPU capacity lines and a hatched overflow overlay reveal the OOM cliff; a second per-GPU bar reveals itself once ZeRO is active.
GPU memory math. 3.7B parameters. Total training memory 62 GiB.
Presets
3.7B
Weights (2B/param)
Master (4B/param)
Momentum (4B/param)
Variance (4B/param)
Gradients (2B/param)
Activations (2B/param)
Total Training Memory62 GiB
ZeRO Stage
Memory formula
Weights = 3.7B x 2B = 6.8 GiBMaster = 3.7B x 4B = 14 GiBMomentum = 3.7B x 4B = 14 GiBVariance = 3.7B x 4B = 14 GiBGradients = 3.7B x 2B = 6.8 GiBActivations = 3.7B x 2B = 6.8 GiB
Total = 3.7B x 18 bytes = 62 GiBLeft / Right: adjust1-4: presetsZ: cycle ZeRO
GPU memory math. 3.7B parameters. Total training memory 62 GiB.
Customize
Model
3.7B
Distributed
8
none
Installation
npx shadcn@latest add https://craftbits.dev/r/gpu-memory-math-viz.jsonUsage
import { GpuMemoryMathViz } from "@craft-bits/viz/gpu-memory-math-viz";
<GpuMemoryMathViz defaultParams={7e9} defaultZeroStage="none" />Drive the parameter count + ZeRO stage from outside (controlled mode):
const [params, setParams] = useState(7e9);
const [zeroStage, setZeroStage] = useState<"none" | "stage1" | "stage2" | "stage3">("none");
<GpuMemoryMathViz
params={params}
onParamsChange={setParams}
zeroStage={zeroStage}
onZeroStageChange={setZeroStage}
/>;Swap in a custom memory table (e.g. quantised inference, not Adam training):
<GpuMemoryMathViz
segments={[
{ id: "weights", label: "INT4 Weights", shortLabel: "Weights", bytesPerParam: 0.5, shardFrom: "stage3" },
{ id: "kv", label: "FP16 KV Cache", shortLabel: "KV", bytesPerParam: 1.5, shardFrom: "stage3" },
]}
presets={[
{ label: "7B", params: 7e9, key: "1" },
{ label: "70B", params: 70e9, key: "2" },
]}
/>Understanding the component
- Log-scale slider. Parameters span ~3 orders of magnitude (100M → 70B), so the slider operates on
log10(params). Pointer position maps to log space, then back to params on read — so equal slider distance feels like equal "model size" in conceptual terms. - Stacked memory bar. Each segment contributes
params x bytesPerParambytes. Segments are stacked horizontally, normalised againstmax(totalGiB, maxGpuRef) x 1.15so the bar always shows context (the next GPU ceiling) even when the workload comfortably fits. - GPU capacity lines. Every
gpuRefs[]entry paints a dashed line at its capacity in the bar's coordinate space and labels itself below. When the total crosses a line, a "DOESN'T FIT" badge appears above and the overflow region of every segment gets a hatched overlay. - ZeRO stages. Each segment declares its
shardFromstage. At a given ZeRO stage, sharded segments are divided acrossgpuCountGPUs; non-sharded segments remain full. The component reveals a second bar showing the per-GPU breakdown, plus chips that label each segment asname / N(sharded) orname (full). - Controlled / uncontrolled.
params+onParamsChangeandzeroStage+onZeroStageChangefollow the Radix pattern — pass both for controlled, omit both for self-managed. - Reduced motion. Under
prefers-reduced-motion: reduce, every bar transition collapses toduration: 0. The ZeRO detail still appears but enters without movement.
Props
| Prop | Type | Default | Description |
|---|---|---|---|
params | number | — | Controlled parameter count. Pair with onParamsChange. |
defaultParams | number | 7e9 | Uncontrolled initial parameter count. |
onParamsChange | (next: number) => void | — | Fires when the slider or a preset chip changes the count. |
zeroStage | "none" | "stage1" | "stage2" | "stage3" | — | Controlled ZeRO stage. |
defaultZeroStage | "none" | "stage1" | "stage2" | "stage3" | "none" | Uncontrolled initial ZeRO stage. |
onZeroStageChange | (next) => void | — | Fires when the ZeRO stage changes. |
minParams | number | 100e6 | Lower bound of the log-scale slider. |
maxParams | number | 70e9 | Upper bound of the log-scale slider. |
gpuCount | number | 8 | Number of GPUs assumed for ZeRO sharding. |
segments | readonly GpuMemoryMathVizSegment[] | Adam table | Memory cost segments (label, bytes/param, shardFrom). |
gpuRefs | readonly GpuMemoryMathVizGpuRef[] | A100 / H100 / 4090 | GPU capacity reference lines. |
presets | readonly GpuMemoryMathVizPreset[] | GPT-2 / 7B / 13B / 70B | Quick-preset chips. |
transition | Transition | SPRINGS.smooth | Override the bar-segment growth spring. |
className | string | — | Merged onto the root via cn(). |
Accessibility
- The root is
role="figure"with anaria-labelledbysummary that names the parameter count, the total training memory, and (when ZeRO is active) the per-GPU memory. - The summary is mirrored in a live region (
aria-live="polite") so screen-reader users hear the same update as sighted users when the slider moves. - The parameter slider is a native
<input type="range">witharia-valuemin/aria-valuemax/aria-valuenow/aria-valuetextmirroring its current state and a visible label. - The ZeRO toggle is a proper
role="radiogroup"witharia-checkedon eachrole="radio"button. - Keyboard support:
Left/Rightmove the slider one log-step at a time,1…njump to preset chips,Zcycles ZeRO stages. - Focus styling on every control uses
:focus-visiblewith a 2px ring offset from the surface so it remains visible against both light and dark themes. - Colour is never the only signal — the per-segment GiB readouts, the "DOESN'T FIT" badges, and the per-GPU chips are all text.
- Motion respects
prefers-reduced-motion: reduce— bar transitions snap and the ZeRO detail enters without movement.
Credits
- Extracted from:
craftingattention(app/src/lessons/primitives/systems/GPUMemoryMathViz.tsx). The source baked in the Adam optimizer memory table, two hardcoded GPU configs, four hardcoded LLM presets, a lesson-specific narration paragraph, and rawvar(--color-*)lesson tokens. The viz extract drops the narration (lesson chrome), generalises every list (segments,gpuRefs,presets,gpuCount, slider bounds) to a prop with a sensible default, remaps the palette tovar(--cb-*)semantic tokens, swaps inlineSPRINGS.snappy / .gentle+STAGGER.tightreferences for the canonicalSPRINGS.snap / .smooth+STAGGERconstant from@craft-bits/core/motion, surfacesparamsandzeroStagevia the controlled/uncontrolled Radix pattern, and replaces ad-hocuseReducedMotion()with the library'susePrefersReducedMotionhook.