Speculative Decoding Viz
Two horizontal lanes share a wall-clock axis. The standard lane emits one token per targetMs. The speculative lane runs K cheap draft steps (K * draftMs) and then a single large-model verify pass (targetMs); the accepted-prefix of drafted tokens is committed for free, plus one bonus target-model token per round. When acceptance is high, the speculative lane finishes the same token budget in a fraction of the wall-clock time — and the bar charts and per-token chips make the speedup readable at a glance.
This is the verify-prefix sibling to ContinuousBatchingViz (per-request token-fill timeline). Reach for SpeculativeDecodingViz when the point is throughput per forward pass — that a verified draft amortises one expensive forward pass across many tokens — rather than how a single batch absorbs new arrivals.
Installation
npx shadcn@latest add https://craftbits.dev/r/speculative-decoding-viz.jsonUsage
import { SpeculativeDecodingViz } from "@craft-bits/viz/speculative-decoding-viz";
<SpeculativeDecodingViz draftK={5} targetMs={100} draftMs={10} acceptanceRate={0.8} />Drive the rounds from outside (e.g. let a parent narrate the trace):
<SpeculativeDecodingViz
rounds={rounds}
onRoundsChange={setRounds}
draftK={5}
maxRounds={4}
/>Inject a fixed trace for a deterministic figure:
<SpeculativeDecodingViz
rounds={[
{ drafted: ["the", "cat", "sat", "on", "mat"], nAccepted: 5 },
{ drafted: ["a", "big", "red", "warm", "rug"], nAccepted: 3 },
]}
draftK={5}
/>Autoplay until the timeline is full:
<SpeculativeDecodingViz playing playSpeed={1000} maxRounds={4} />Understanding the component
- Two lanes share an axis. The standard lane (top) and speculative lane (bottom) are normalised to the same
max(standardMs, speculativeMs)wall-clock width, so a shorter speculative bar literally reads as "finishes sooner." - Standard cost is per token. Each verified-prefix token plus the one bonus target-model token per round pays
targetMs. The standard lane istotalTokens * targetMswide, with per-token tick marks layered on top to show the granularity. - Speculative cost is per round. Every round pays
K * draftMsfor the draft pass and a singletargetMsfor verify, regardless of how many drafts were accepted. Within each round the draft sub-bar splits intoKcells, accepted ones tinted accent and rejected ones desaturated. - Last round chips. The token chips below the lanes list the drafted strings, colour-coded accepted vs rejected, with a
nAccepted/Kreadout — so the speedup story is grounded in actual tokens, not just bars. - Controlled or uncontrolled rounds.
roundsfollows the Radix pattern: passroundsplusonRoundsChangefor full control, ordefaultRoundsfor an initial trace that the component owns. - Autoplay with
SPRINGS.smooth. Whenplayingistrue, the component appends one fresh round everyplaySpeedms viawindow.setInterval, usingacceptanceRateto sample each draft token's accept/reject.prefers-reduced-motion: reduceshort-circuits autoplay and snaps every bar transition to instant.
Props
| Prop | Type | Default | Description |
|---|---|---|---|
draftK | number | 5 | Drafted tokens per speculative round. |
targetMs | number | 100 | Cost of one target (large) model forward pass. |
draftMs | number | 10 | Cost of one draft (small) model step. |
rounds | readonly SpeculativeDecodingRound[] | — | Controlled rounds list. |
defaultRounds | readonly SpeculativeDecodingRound[] | [] | Uncontrolled initial rounds. |
onRoundsChange | (rounds) => void | — | Fires on autoplay tick or external mutation. |
acceptanceRate | number | 0.8 | Per-token accept probability for the built-in generator. |
maxRounds | number | 4 | Cap on rounds the timeline will hold. |
vocab | readonly string[] | built-in | Token vocabulary for the generator. |
playing | boolean | false | When true, autoplay appends rounds until maxRounds. |
playSpeed | number | 1200 | Milliseconds between autoplay rounds. Floored at 120 ms. |
transition | Transition | SPRINGS.smooth | Bar/cell transition. |
className | string | — | Merged onto the root via cn(). |
Accessibility
- The figure is
role="figure"with a hidden summary that lists round count, total tokens, both wall-clock totals, and the standing speedup — screen readers hear the comparison whenever props change. - A polite
aria-liveregion announces the latest round count and speedup as autoplay advances. - Token chips and tick decorations are
aria-hidden— the textual summary carries the meaning. Colour is never the only signal: accepted vs rejected token chips also differ in opacity, border weight, and the per-roundnAccepted/Kreadout. - Motion respects
prefers-reduced-motion: reduce— bar transitions snap to instant and autoplay never starts.
Credits
- Extracted from:
craftingattention(app/src/lessons/primitives/viz/SpeculativeDecodingViz.tsx). The source wrapped the timeline in a three-modeWidget(Explore / Predict / Challenge) with bookmarks, undo/redo viauseWidgetHistory, a labelled slider, per-round narration, score dots, and challenge predicates. The library extract is the pure timeline primitive: standard vs speculative lanes, draft + verify cost model, accept/reject token chips — driven entirely byroundsplusacceptanceRate, with controlled/uncontrolled and play/pause APIs, so callers compose the harness around it.