Tensor Parallelism Viz
A diagram of one forward pass through a weight matrix that has been split column-wise across multiple GPUs, paired with a latency-comparison bar that plots compute vs all-reduce time for every configured GPU count. The input vector X feeds every shard at once; each GPU multiplies X by its column slice in parallel; an all-reduce sums the partial results; the final output vector Y appears on the right. With one GPU the all-reduce phase is skipped.
Tensor parallelism. Full 8 by 8 weight matrix on a single GPU.Idle. Ready to compute.
Tensor parallelism1 GPU · 8×8 W · 100ms
GPUs
Latency comparison
1 GPU
100ms
2 GPUs
55ms
4 GPUs
30ms
computeall-reduce
Customize
Shape
1
Playback
Layout
Installation
npx shadcn@latest add https://craftbits.dev/r/tensor-parallelism-viz.jsonUsage
import { TensorParallelismViz } from "@craft-bits/viz/tensor-parallelism-viz";
<TensorParallelismViz />Start on four GPUs and run the forward pass automatically:
<TensorParallelismViz defaultGpus={4} defaultPlaying />Drive the visual from outside (controlled phase + active GPU):
<TensorParallelismViz
gpus={4}
phase="compute"
activeGpu={2}
/>Provide a custom latency table (e.g. for an 8-GPU configuration):
<TensorParallelismViz
gpuOptions={[1, 2, 4, 8]}
timings={{
1: { compute: 200, comm: 0, total: 200 },
2: { compute: 100, comm: 8, total: 108 },
4: { compute: 50, comm: 10, total: 60 },
8: { compute: 25, comm: 12, total: 37 },
}}
defaultGpus={8}
/>Understanding the component
- Diagram layout. The SVG places the input vector
Xon the left, the sharded weight matrixWin the centre, an all-reduce node next to it, and the output vectorYon the right. Shards are offset horizontally by four pixels each so they read as separate physical devices. - Lifecycle. Four phases —
idle,compute,allreduce,done— gate every visual element. Arrows draw in oncompute; the all-reduce block appears onallreduceand tintssuccessoncedone; the outputYcells fade in only ondone. - Autoplay loop. When
playingis true and reduced motion is off, asetIntervaladvances the phase everyplaySpeedmilliseconds. The compute phase walksactiveGpufrom0toN − 1before flipping toallreduce(or directly todonewhenN = 1). - Controlled / uncontrolled.
gpus,phase,activeGpu, andplayingall follow Radix's pattern — pass the controlled prop with a handler, or use thedefault*counterpart. The component never double-tracks state. - Latency-comparison bar. A horizontal proportional bar per option in
gpuOptionsreadscompute(accent) andcomm(warning) widths against the baseline (gpuOptions[0]) total. The current row glows accent once the forward pass reachesallreduce/done.
Props
| Prop | Type | Default | Description |
|---|---|---|---|
gpuOptions | readonly number[] | [1, 2, 4] | GPU-count options surfaced as buttons. |
timings | Record<number, TensorParallelismVizTiming> | TENSOR_PARALLELISM_VIZ_DEFAULT_TIMINGS | Per-GPU-count latency table. |
gpus / defaultGpus | number | 1 | Controlled / uncontrolled GPU count. |
phase / defaultPhase | TensorParallelismVizPhase | "idle" | Controlled / uncontrolled phase. |
activeGpu / defaultActiveGpu | number | -1 | Active GPU index during compute. -1 highlights every shard. |
playing / defaultPlaying | boolean | false | Controlled / uncontrolled autoplay state. |
playSpeed | number | 420 | Milliseconds between phase advances. Floored at 80 ms. |
rows / cols | number | 8 / 8 | Weight-matrix dimensions. |
showGpuLabels | boolean | true | Render the "GPU 1, GPU 2, …" labels under shards. |
showTimingBar | boolean | true | Render the latency-comparison bar. |
transition | Transition | SPRINGS.smooth | Override the spring used for cell / label transitions. |
Accessibility
- The root is a
role="figure"witharia-labelledbypointing at a hidden summary so screen-reader users get a one-line description before exploring the diagram. - A polite live region announces phase changes — e.g. "GPU 3 multiplying its shard.", "All-reduce. Summing partial outputs across 4 GPUs.", "Done. 30 ms total."
- The GPU buttons are real
<button>elements witharia-pressedand a descriptivearia-label("4 GPUs"). The forward-pass button mirrors its visible label througharia-labeland isdisabledwhen not idle. - Colour is never the only signal — the active shard gets a thicker stroke in addition to higher fill opacity; the all-reduce block changes shape; the phase chips at the bottom highlight the current phase with both colour and a filled dot.
- Motion respects
prefers-reduced-motion: reduce— every cell / arrow / all-reduce / output transition collapses to instant. Autoplay is a no-op when reduced motion is on.
Credits
- Extracted from:
craftingattention(app/src/lessons/primitives/viz/TensorParallelismViz.tsx). The source wrapped the diagram in aWidgetwith three modes (Explore / Predict / Challenge),useWidgetHistoryundo / redo, bookmarks, a narration band keyed to a fixed 100 / 50 / 25 ms latency table, and per-shard colour palette. The library extract strips the Widget chrome and lesson modes entirely, drops the multi-hue per-GPU palette in favour of a single--cb-accent, keeps the GPU selector + run button + diagram + latency-comparison bar as the four primitive pieces, lifts the latency table to atimingsprop, and exposes the full Radix-style controlled / uncontrolled API forgpus/phase/activeGpu/playing. Colours are remapped tovar(--cb-accent)/var(--cb-success)/var(--cb-warning)/var(--cb-fg-*)/var(--cb-bg-*), the inline spring is replaced bySPRINGS.smoothfrom@craft-bits/core/motion, and the per-arrow markers are scoped byuseIdto avoid SVG marker-id collisions when multiple instances render on the same page.