Tokenizer Viz

A three-mode BPE tokenizer playground. Type into the textarea or pick one of six presets (English, rare long word, CJK, JSON, Python) and watch the text shatter into subword tokens, each chip coloured from the semantic palette. Stat readouts surface token count, character count, chars-per-token compression, and an approximate dollar cost at a configurable rate per million input tokens.

Three modes:

  • Explore — free-form input plus six presets that demonstrate compression edge cases.
  • Predict — five rounds where you guess the token count for a string before the reveal.
  • Challenge — four multiple-choice diagnostics on tokenization trade-offs (CJK vs English, JSON overhead, whitespace cost, common-word compression).
Common English words are typically 1 token each. This 11-character text compresses to just 2 tokens — excellent efficiency.
11 of 500 characters
Hello␣world
2tokens11characters5.5chars/token≈ <$0.0001at $3/M input tokens

Common English words are typically 1 token each. This 11-character text compresses to just 2 tokens — excellent efficiency.

Customize
Tokenizer
$3.00
500 chars

Installation

npx shadcn@latest add https://craftbits.dev/r/tokenizer-viz.json

Usage

import { TokenizerViz } from "@craft-bits/viz/tokenizer-viz";
 
<TokenizerViz />

Bring your own presets and quiz rounds:

<TokenizerViz
  presets={[
    {
      label: "My example",
      text: "tokens are fun",
      tokens: ["tokens", " are", " fun"],
      narration: "Common short words map cleanly to single tokens.",
    },
  ]}
  predictRounds={[
    {
      text: "Hello world",
      tokens: ["Hello", " world"],
      options: [1, 2, 5, 11],
      correctIdx: 1,
      explanation: "Common English words map to single tokens.",
    },
  ]}
/>

Set your model's actual rate so the cost readout matches your bill:

<TokenizerViz costPerMillionTokens={0.5} />

Subscribe to score events:

<TokenizerViz
  onModeComplete={({ mode, correct, total }) => {
    /* lift the totals into your own analytics */
  }}
/>

Understanding the component

  1. Mode tabs. Three semantic-coloured pill buttons drive data-mode on the root and crossfade between sub-views. Each switch resets per-mode state so a fresh predict run never inherits a stale answer.
  2. Token chips. The explore-mode chip row is a popLayout AnimatePresence: each chip pops in with a tiny blur-out spring tinted by its index into the six-slot semantic colour wheel (cb-accent / cb-warning / cb-success / cb-info / cb-error / cb-fg-muted). Hover scales the chip by 5 %; the title attribute exposes the literal token text and length.
  3. Stat row. Token count, character count, chars-per-token, and cost are each wrapped in their own AnimatePresence mode="wait" so only the value that changed flips — preventing the whole row from re-mounting on every keystroke. Numbers use font-variant-numeric: tabular-nums.
  4. Heuristic tokenizer. The exported tokenizeWithPresets helper short-circuits on exact-preset matches (so the demo stays deterministic) and otherwise runs a teaching-grade BPE approximation: common words map to single tokens, rare words shatter at vowel-consonant boundaries, CJK characters are one-per-token, punctuation is its own token. It is not a real tiktoken implementation — it reproduces the qualitative behaviour learners care about without bundling a vocab.
  5. Predict mode. Each round shows the text in a monospace box plus four numeric options. After checking, the correct option scale-bumps, the wrong one shake-x's, the full token breakdown reveals beneath, and the explanation slides in.
  6. Reduced motion. Under prefers-reduced-motion: reduce, every entrance animation, the chip pop-in, the wrong-answer shake, the correct-answer scale-bump, and the mode crossfade collapse to instant transitions.

Props

PropTypeDefaultDescription
presetsreadonly TokenizerVizPreset[]6 presetsDrop-in strings for the explore-mode chip bar.
predictRoundsreadonly TokenizerVizPredictRound[]5 rounds"How many tokens?" MCQ rounds.
challengeRoundsreadonly TokenizerVizChallengeRound[]4 roundsMultiple-choice diagnostics on tokenization trade-offs.
defaultMode"explore" | "predict" | "challenge""explore"Mode visible on first render.
maxInputLengthnumber500Hard cap on the explore-mode textarea.
costPerMillionTokensnumber3Dollars per million input tokens.
transitionTransitionSPRINGS.snapOverride the entrance spring for tokens and stat readouts.
onModeChange(mode) => voidFires when the active mode changes.
onInputChange(text: string) => voidFires on every explore-mode keystroke.
onModeComplete(score) => voidFires when a predict / challenge run finishes.
classNamestringMerged onto the root via cn().

Accessibility

  • The three mode buttons are a role="group" with aria-label="Tokenizer mode" and aria-pressed on the active one.
  • A polite live region (aria-live="polite") announces the current narration / round / explanation / final score on every state change.
  • Mode switches focus the mode-content wrapper so screen readers re-read the new pane.
  • The textarea has a visible :focus ring and a hidden character-count hint exposed via aria-describedby.
  • The token strip is a role="region" with an aria-label naming the token count; each chip carries a title with the literal token text and length.
  • Each option button in predict / challenge has aria-pressed, a visible focus ring, and meets the 36-pixel minimum hit area.
  • Colour is never the only signal — correct / wrong answers also carry an explicit FeedbackBadge ("Correct!" / "Not quite") and a score-dots row with role="img".
  • Motion respects prefers-reduced-motion: reduce — every entrance, the chip pop-in, the wrong-answer shake, the correct-answer scale-bump, and the mode crossfade collapse to instant transitions.

Credits

  • Extracted from: craftingattention (app/src/lessons/primitives/systems/TokenizerViz.tsx). The source was a lesson component wired to Widget / useWidgetHistory / ModeStrip / ChallengeBtn / FeedbackBadge / ScoreDots primitives in @/lessons/primitives/chrome/*, used CA palette tokens (--color-accent-400, --color-warn-400, --color-success-400, --color-fail-400, --color-ink-*, --color-surface-*, ca-narration) for the token-chip wheel, and called CA-specific spring aliases (SPRINGS.snappy, SPRINGS.gentle, STAGGER.tight, TIMING.wrong.shakeSpring, TIMING.correct.scaleBounce, MICRO.tap). The craft-bits extract drops the lesson chrome (no widget shell, no undo/redo history — pure useState), re-keys the eight-slot CA palette wheel to a six-slot semantic cb-* wheel so consumer themes repaint freely, and routes every transition through the canonical SPRINGS.snap / SPRINGS.smooth / SPRINGS.bouncy / TAP_SCALE / STAGGER from @craft-bits/core/motion. The hard-coded preset list, predict rounds, and challenge rounds are now props with the originals exposed as TOKENIZER_VIZ_DEFAULT_* constants. The mode strip, primary / secondary buttons, feedback badge, score dots, numeric option, and text option are inlined as private sub-components consuming cb-* tokens. forwardRef + cn() + ...props spread were added; lessonId / SvgLabel / ChallengeBtn lesson chrome was stripped.