Tensor Parallelism Viz

A diagram of one forward pass through a weight matrix that has been sharded across multiple GPUs. The input vector X feeds every shard at once; each GPU multiplies X by its column slice in parallel; an all-reduce sums the partial results; the final output vector Y appears on the right.

The component walks four phases — idle, compute, allreduce, done — driven either by its own autoplay loop or by the caller (controlled phase, active GPU, and play state).

Tensor parallelism4 GPUs · 8×8 W

Customize

Shape

num GPUs4

rows8

cols8

Playback

autoplay

play speed420ms

Labels

GPU labels

Installation

npx shadcn@latest add https://craftbits.dev/r/tensor-parallelism-viz.json

Usage

import { TensorParallelismViz } from "@craft-bits/core";
 
<TensorParallelismViz numGpus={4} />

Autoplay the full forward pass:

<TensorParallelismViz numGpus={4} defaultPlaying />

Drive the phase from outside (e.g. wire it to a scrubber):

const [phase, setPhase] = useState<TensorParallelismPhase>("idle");
 
<TensorParallelismViz
  numGpus={4}
  phase={phase}
  onPhaseChange={setPhase}
  activeGpu={1}
/>

Understanding the component

Input vector X. The left column of cells is the input activation. It sits at low opacity while idle and brightens once the forward pass starts so the eye follows the data into the matrix.
Sharded weight matrix W. The component splits the matrix into numGpus column shards. Each shard renders in the same accent colour but offsets horizontally by a few pixels so the seam between shards is visible. The data attribute data-shard on each cell lets you target a specific shard from outside CSS.
Compute phase. While the component is in the compute phase, the cells of the currently active shard glow at high opacity and gain a thicker border. Setting activeGpu to -1 highlights every shard at once — useful when narrating "this happens on every GPU in parallel".
All-reduce. When more than one GPU is in play, a small all-reduce node appears between the matrix and the output during the all-reduce phase. It picks up the accent colour while reducing, then switches to the success colour once the result is in.
Output vector Y. On done, the output column animates in from the right, scaled up from 0.8 to 1 with a tiny stagger per row.
Controlled or uncontrolled. Phase, active GPU, and playing all follow the Radix pattern — pass the controlled prop plus its onChange callback for controlled mode, or rely on the default* variants for self-driven autoplay.

Variants

A single GPU collapses the layout — no shard seams, no all-reduce node, and the pass jumps straight from compute to done:

<TensorParallelismViz numGpus={1} defaultPlaying />

An eight-GPU split with wider matrix:

<TensorParallelismViz numGpus={8} cols={16} defaultPlaying />

Pause on a specific phase for a screenshot, with one GPU highlighted:

<TensorParallelismViz numGpus={4} phase="compute" activeGpu={2} />

Drop the GPU labels when the surrounding caption already names the shards:

<TensorParallelismViz numGpus={4} showGpuLabels={false} />

Props

Prop	Type	Default	Description
`numGpus`	`number`	`2`	Number of column shards / GPUs.
`rows`	`number`	`8`	Rows in the weight matrix.
`cols`	`number`	`8`	Columns in the weight matrix. Floored to a multiple of `numGpus` shards; remainder goes to the last shard.
`phase`	`TensorParallelismPhase`	—	Controlled phase.
`defaultPhase`	`TensorParallelismPhase`	`"idle"`	Uncontrolled initial phase.
`onPhaseChange`	`(phase) => void`	—	Fires when the phase advances.
`activeGpu`	`number`	—	Controlled active-GPU index during compute. `-1` highlights every GPU.
`defaultActiveGpu`	`number`	`-1`	Uncontrolled initial active GPU.
`onActiveGpuChange`	`(idx) => void`	—	Fires when the active GPU changes.
`playing`	`boolean`	—	Controlled autoplay state.
`defaultPlaying`	`boolean`	`false`	Uncontrolled initial autoplay state.
`onPlayingChange`	`(playing) => void`	—	Fires when play / pause flips.
`playSpeed`	`number`	`420`	Milliseconds between phase advances. Floored at 80 ms.
`showGpuLabels`	`boolean`	`true`	Render the per-shard `GPU N` labels under the matrix.
`transition`	`Transition`	`SPRINGS.smooth`	Override for cell-fill / label transitions.
`className`	`string`	—	Merged onto the root via `cn()`.

Accessibility

The figure is role="figure" with a hidden summary listing GPU count and matrix shape — screen readers hear the configuration whenever props change.
A polite aria-live region announces the current phase and active GPU, so non-sighted users follow the same pass as sighted ones.
The SVG itself is aria-hidden. Colour is never the only signal: phase chips in the footer carry textual labels (idle, compute, allreduce, done) alongside the active dot.
Motion respects prefers-reduced-motion: reduce — cell fades collapse to instant swaps and autoplay never starts.
data-phase on the root and data-shard / data-col-in-shard on each weight cell expose state to CSS without resorting to className toggles.

Credits

Extracted from: craftingattention (app/src/lessons/primitives/viz/TensorParallelismViz.tsx). The source wrapped the diagram in a three-mode Widget (Explore / Predict / Challenge) with bookmarks, undo / redo via useWidgetHistory, score dots, narration that timed itself to a fixed 100 / 50 / 25 ms latency table, and a stacked timing-comparison bar. The library extract is the pure diagram primitive — input vector, sharded weight matrix, all-reduce node, output vector — driven entirely by props with controlled / uncontrolled and play / pause APIs. Latency comparisons, narration, prediction prompts, and challenge framing belong in lesson code, not in the library primitive.