Speculative Decoding Viz

Two horizontal lanes share a wall-clock axis. The standard lane emits one token per targetMs. The speculative lane runs K cheap draft steps (K * draftMs) and then a single large-model verify pass (targetMs); the accepted-prefix of drafted tokens is committed for free, plus one bonus target-model token per round. When acceptance is high, the speculative lane finishes the same token budget in a fraction of the wall-clock time — and the bar charts and per-token chips make the speedup readable at a glance.

This is the verify-prefix sibling to ContinuousBatchingViz (per-request token-fill timeline). Reach for SpeculativeDecodingViz when the point is throughput per forward pass — that a verified draft amortises one expensive forward pass across many tokens — rather than how a single batch absorbs new arrivals.

Speculative decoding visualisation. No rounds yet — set `playing` or pass a `rounds` trace.
Speculative decoding visualisation. No rounds yet — set `playing` or pass a `rounds` trace.
Customize
Cost model
5
100ms
10ms
Acceptance
80%
4
Playback
1100 ms

Installation

npx shadcn@latest add https://craftbits.dev/r/speculative-decoding-viz.json

Usage

import { SpeculativeDecodingViz } from "@craft-bits/viz/speculative-decoding-viz";
 
<SpeculativeDecodingViz draftK={5} targetMs={100} draftMs={10} acceptanceRate={0.8} />

Drive the rounds from outside (e.g. let a parent narrate the trace):

<SpeculativeDecodingViz
  rounds={rounds}
  onRoundsChange={setRounds}
  draftK={5}
  maxRounds={4}
/>

Inject a fixed trace for a deterministic figure:

<SpeculativeDecodingViz
  rounds={[
    { drafted: ["the", "cat", "sat", "on", "mat"], nAccepted: 5 },
    { drafted: ["a", "big", "red", "warm", "rug"], nAccepted: 3 },
  ]}
  draftK={5}
/>

Autoplay until the timeline is full:

<SpeculativeDecodingViz playing playSpeed={1000} maxRounds={4} />

Understanding the component

  1. Two lanes share an axis. The standard lane (top) and speculative lane (bottom) are normalised to the same max(standardMs, speculativeMs) wall-clock width, so a shorter speculative bar literally reads as "finishes sooner."
  2. Standard cost is per token. Each verified-prefix token plus the one bonus target-model token per round pays targetMs. The standard lane is totalTokens * targetMs wide, with per-token tick marks layered on top to show the granularity.
  3. Speculative cost is per round. Every round pays K * draftMs for the draft pass and a single targetMs for verify, regardless of how many drafts were accepted. Within each round the draft sub-bar splits into K cells, accepted ones tinted accent and rejected ones desaturated.
  4. Last round chips. The token chips below the lanes list the drafted strings, colour-coded accepted vs rejected, with a nAccepted/K readout — so the speedup story is grounded in actual tokens, not just bars.
  5. Controlled or uncontrolled rounds. rounds follows the Radix pattern: pass rounds plus onRoundsChange for full control, or defaultRounds for an initial trace that the component owns.
  6. Autoplay with SPRINGS.smooth. When playing is true, the component appends one fresh round every playSpeed ms via window.setInterval, using acceptanceRate to sample each draft token's accept/reject. prefers-reduced-motion: reduce short-circuits autoplay and snaps every bar transition to instant.

Props

PropTypeDefaultDescription
draftKnumber5Drafted tokens per speculative round.
targetMsnumber100Cost of one target (large) model forward pass.
draftMsnumber10Cost of one draft (small) model step.
roundsreadonly SpeculativeDecodingRound[]Controlled rounds list.
defaultRoundsreadonly SpeculativeDecodingRound[][]Uncontrolled initial rounds.
onRoundsChange(rounds) => voidFires on autoplay tick or external mutation.
acceptanceRatenumber0.8Per-token accept probability for the built-in generator.
maxRoundsnumber4Cap on rounds the timeline will hold.
vocabreadonly string[]built-inToken vocabulary for the generator.
playingbooleanfalseWhen true, autoplay appends rounds until maxRounds.
playSpeednumber1200Milliseconds between autoplay rounds. Floored at 120 ms.
transitionTransitionSPRINGS.smoothBar/cell transition.
classNamestringMerged onto the root via cn().

Accessibility

  • The figure is role="figure" with a hidden summary that lists round count, total tokens, both wall-clock totals, and the standing speedup — screen readers hear the comparison whenever props change.
  • A polite aria-live region announces the latest round count and speedup as autoplay advances.
  • Token chips and tick decorations are aria-hidden — the textual summary carries the meaning. Colour is never the only signal: accepted vs rejected token chips also differ in opacity, border weight, and the per-round nAccepted/K readout.
  • Motion respects prefers-reduced-motion: reduce — bar transitions snap to instant and autoplay never starts.

Credits

  • Extracted from: craftingattention (app/src/lessons/primitives/viz/SpeculativeDecodingViz.tsx). The source wrapped the timeline in a three-mode Widget (Explore / Predict / Challenge) with bookmarks, undo/redo via useWidgetHistory, a labelled slider, per-round narration, score dots, and challenge predicates. The library extract is the pure timeline primitive: standard vs speculative lanes, draft + verify cost model, accept/reject token chips — driven entirely by rounds plus acceptanceRate, with controlled/uncontrolled and play/pause APIs, so callers compose the harness around it.