Attention Scale Viz

Side-by-side bar charts of softmax(QKᵀ / √dₖ) (left) versus softmax(QKᵀ) (right) over the same row of raw dot-product logits. As the dₖ slider grows, the unscaled distribution saturates toward a one-hot — gradients vanish everywhere except the argmax — while the scaled distribution keeps its shape. Makes the case for the 1/√dₖ factor in scaled dot-product attention visceral.

Where the Attention Heatmap shows a static N×N weight matrix and the Attention Stepper Viz walks an output cursor across precomputed weights, this primitive isolates a single softmax row and lets the user drag the scaling factor in real time.

At dk 64: scaled softmax peaks at 25% on k₁; unscaled peaks at 63% on k₁.
Attention scale · dₖ = 641 / √dₖ = 0.125
scaled · z / √dₖ
0255075100k₁k₂k₃k₄k₅25.0%probability %
unscaled · softmax(z)
0255075100k₁k₂k₃k₄k₅63.2%probability %
64
Customize
Distribution
clear winner
64
Display

Installation

npx shadcn@latest add https://craftbits.dev/r/attention-scale-viz.json

Usage

import { AttentionScaleViz } from "@craft-bits/core";
 
const logits = [4.0, 3.0, 2.0, 1.0, 0.5];
 
<AttentionScaleViz
  logits={logits}
  labels={["k₁", "k₂", "k₃", "k₄", "k₅"]}
  defaultDk={64}
/>

Drive dk from a parent so the slider can be controlled by a scrollytelling step or a sibling widget:

const [dk, setDk] = useState(64);
 
<AttentionScaleViz
  logits={logits}
  dk={dk}
  onDkChange={setDk}
/>

Hide the comparison column for a compact display:

<AttentionScaleViz logits={logits} showComparison={false} />

Understanding the component

  1. Two charts, one row of logits. The same QKᵀ row is fed to both panels. The left panel divides by √dₖ before softmax; the right panel doesn't. Both use the textbook max-subtraction trick so even huge unscaled logits never overflow exp.
  2. dₖ drives the gap. As the slider walks from 1 toward 1024, the scale factor 1/√dₖ shrinks from 1 to ~0.031. The right chart sees the raw logits and collapses to a one-hot; the left chart sees z / √dₖ and stays well-spread.
  3. Bar height animates with SPRINGS.smooth. Per-bar motion.rect interpolates y and height on every dₖ change, so the saturation is felt as a continuous motion rather than a snap. Reduced-motion users always get instant transitions.
  4. Argmax callouts. The peak bar in each chart floats its softmax(...) percentage above its top edge so the "100% on k₁" moment in the unscaled column reads at a glance.
  5. Uniform 1/N reference. A dashed accent line marks where every bar would land for a perfectly flat distribution — useful for spotting "still uniform" vs "starting to peak" without squinting.
  6. Controlled + uncontrolled dk. Pair dk with onDkChange for full control; leave them off to let the component own its own state via defaultDk.

Props

PropTypeDefaultDescription
logitsreadonly number[]Raw QKᵀ row for a single query. Required.
labelsreadonly string[]numeric indicesKey-token labels under each bar.
dknumberControlled head dimension. Pair with onDkChange.
defaultDknumber64Uncontrolled initial head dimension.
onDkChange(dk: number) => voidFires when the dₖ slider changes.
dkRangereadonly [number, number][1, 1024][min, max] for the dₖ slider.
applyScalebooleantrueWhen false, the left chart shows plain softmax(z) too.
showComparisonbooleantrueWhen false, only the scaled chart renders.
transitionTransitionSPRINGS.smoothSpring for bar-height transitions.
classNamestringMerged onto the root <div> via cn().

Accessibility

  • The outer element is role="figure" with aria-labelledby pointing at the heading and an aria-live="polite" summary that announces the current dₖ plus the peak probability and key in each chart.
  • Each chart has its own <svg role="img"> with aria-labelledby pointing at a per-chart title ("scaled · z / √dₖ" / "unscaled · softmax(z)") so the two panels are distinguishable in an AT outline.
  • The dₖ slider is a native <input type="range"> with aria-valuemin / aria-valuemax / aria-valuenow / aria-valuetext — keyboard arrows nudge the head dimension and screen readers narrate the value.
  • Color is never the only signal: the argmax bar carries its softmax percentage as visible text, and labels under each bar bold when they win.
  • prefers-reduced-motion: reduce collapses every transition to duration: 0 — the saturation snaps instead of springs.
  • Color contrast is theme-driven via --cb-accent / --cb-fg / --cb-fg-muted tokens, so AA contrast holds in both light and dark mode.

Credits

  • Extracted from: craftingattention (app/src/lessons/primitives/viz/AttentionScaleViz.tsx). The original was an interactive widget about the quadratic memory wall (N² scaling), bundled with a lesson harness. This extract is a sibling concept — the 1/√dₖ scaling factor — distilled into a focused primitive: two softmax bars, one slider, controlled / uncontrolled wiring.