Tensor Parallelism Viz

A diagram of one forward pass through a weight matrix that has been sharded across multiple GPUs. The input vector X feeds every shard at once; each GPU multiplies X by its column slice in parallel; an all-reduce sums the partial results; the final output vector Y appears on the right.

The component walks four phases — idle, compute, allreduce, done — driven either by its own autoplay loop or by the caller (controlled phase, active GPU, and play state).

Tensor parallelism. 8 by 8 weight matrix split into 4 column shards across 4 GPUs.Idle. Ready to compute.
Tensor parallelism4 GPUs · 8×8 W
Customize
Shape
4
8
8
Playback
420ms
Labels

Installation

npx shadcn@latest add https://craftbits.dev/r/tensor-parallelism-viz.json

Usage

import { TensorParallelismViz } from "@craft-bits/core";
 
<TensorParallelismViz numGpus={4} />

Autoplay the full forward pass:

<TensorParallelismViz numGpus={4} defaultPlaying />

Drive the phase from outside (e.g. wire it to a scrubber):

const [phase, setPhase] = useState<TensorParallelismPhase>("idle");
 
<TensorParallelismViz
  numGpus={4}
  phase={phase}
  onPhaseChange={setPhase}
  activeGpu={1}
/>

Understanding the component

  1. Input vector X. The left column of cells is the input activation. It sits at low opacity while idle and brightens once the forward pass starts so the eye follows the data into the matrix.
  2. Sharded weight matrix W. The component splits the matrix into numGpus column shards. Each shard renders in the same accent colour but offsets horizontally by a few pixels so the seam between shards is visible. The data attribute data-shard on each cell lets you target a specific shard from outside CSS.
  3. Compute phase. While the component is in the compute phase, the cells of the currently active shard glow at high opacity and gain a thicker border. Setting activeGpu to -1 highlights every shard at once — useful when narrating "this happens on every GPU in parallel".
  4. All-reduce. When more than one GPU is in play, a small all-reduce node appears between the matrix and the output during the all-reduce phase. It picks up the accent colour while reducing, then switches to the success colour once the result is in.
  5. Output vector Y. On done, the output column animates in from the right, scaled up from 0.8 to 1 with a tiny stagger per row.
  6. Controlled or uncontrolled. Phase, active GPU, and playing all follow the Radix pattern — pass the controlled prop plus its onChange callback for controlled mode, or rely on the default* variants for self-driven autoplay.

Variants

A single GPU collapses the layout — no shard seams, no all-reduce node, and the pass jumps straight from compute to done:

<TensorParallelismViz numGpus={1} defaultPlaying />

An eight-GPU split with wider matrix:

<TensorParallelismViz numGpus={8} cols={16} defaultPlaying />

Pause on a specific phase for a screenshot, with one GPU highlighted:

<TensorParallelismViz numGpus={4} phase="compute" activeGpu={2} />

Drop the GPU labels when the surrounding caption already names the shards:

<TensorParallelismViz numGpus={4} showGpuLabels={false} />

Props

PropTypeDefaultDescription
numGpusnumber2Number of column shards / GPUs.
rowsnumber8Rows in the weight matrix.
colsnumber8Columns in the weight matrix. Floored to a multiple of numGpus shards; remainder goes to the last shard.
phaseTensorParallelismPhaseControlled phase.
defaultPhaseTensorParallelismPhase"idle"Uncontrolled initial phase.
onPhaseChange(phase) => voidFires when the phase advances.
activeGpunumberControlled active-GPU index during compute. -1 highlights every GPU.
defaultActiveGpunumber-1Uncontrolled initial active GPU.
onActiveGpuChange(idx) => voidFires when the active GPU changes.
playingbooleanControlled autoplay state.
defaultPlayingbooleanfalseUncontrolled initial autoplay state.
onPlayingChange(playing) => voidFires when play / pause flips.
playSpeednumber420Milliseconds between phase advances. Floored at 80 ms.
showGpuLabelsbooleantrueRender the per-shard GPU N labels under the matrix.
transitionTransitionSPRINGS.smoothOverride for cell-fill / label transitions.
classNamestringMerged onto the root via cn().

Accessibility

  • The figure is role="figure" with a hidden summary listing GPU count and matrix shape — screen readers hear the configuration whenever props change.
  • A polite aria-live region announces the current phase and active GPU, so non-sighted users follow the same pass as sighted ones.
  • The SVG itself is aria-hidden. Colour is never the only signal: phase chips in the footer carry textual labels (idle, compute, allreduce, done) alongside the active dot.
  • Motion respects prefers-reduced-motion: reduce — cell fades collapse to instant swaps and autoplay never starts.
  • data-phase on the root and data-shard / data-col-in-shard on each weight cell expose state to CSS without resorting to className toggles.

Credits

  • Extracted from: craftingattention (app/src/lessons/primitives/viz/TensorParallelismViz.tsx). The source wrapped the diagram in a three-mode Widget (Explore / Predict / Challenge) with bookmarks, undo / redo via useWidgetHistory, score dots, narration that timed itself to a fixed 100 / 50 / 25 ms latency table, and a stacked timing-comparison bar. The library extract is the pure diagram primitive — input vector, sharded weight matrix, all-reduce node, output vector — driven entirely by props with controlled / uncontrolled and play / pause APIs. Latency comparisons, narration, prediction prompts, and challenge framing belong in lesson code, not in the library primitive.