Bootstrap CI Viz

An interactive visualisation of bootstrap confidence intervals for LLM-eval scores. The viewer presses Resample to step one bootstrap draw at a time (watching the histogram populate one bar at a time) or Run 1,000 to sprint through a bulk run. A sample-size toggle (N=50 vs N=200) demonstrates how more data narrows the CI. Once both sample sizes have been bootstrapped, a comparison panel surfaces whether each CI excludes a configurable baseline.

A single point estimate hides uncertainty; the bootstrap distribution makes the uncertainty visible — and shows when it's small enough to act on.

Accuracy: 92%Is this enough to deploy?
Dataset 46 pass, 4 fail Pass Fail
Bootstrap distribution
Resample to build the histogram
Space: resample / Enter: run 1,000 / R: reset
Bootstrap CI visualization ready. Dataset has 50 examples with 92% accuracy. Press Space to resample.

You evaluated your LLM on 50 examples and got 92% accuracy. Looks good — but is this enough to deploy? A single point estimate hides the uncertainty. Press Resample to see how much the score can shift.

Customize
Setup
1,000
89%

Installation

npx shadcn@latest add https://craftbits.dev/r/bootstrap-ci-viz.json

Usage

import { BootstrapCIViz } from "@craft-bits/viz/bootstrap-ci-viz";
 
<BootstrapCIViz />

Start with the larger sample size if you want viewers to skip straight to the tightening:

<BootstrapCIViz defaultSampleSize={200} />

Lift the bulk result into your own chart:

<BootstrapCIViz
  onBulkComplete={({ sampleSize, ci }) => {
    /* feed ci into a downstream chart */
  }}
/>

Understanding the component

  1. The dataset. A grid of dots — green for pass, red for fail. The point estimate is passes / total, surfaced in monospace tabular numerals.
  2. Single resample. Pressing Resample (or Space) picks N indices with replacement, highlights the picked dots, badges the multi-picks with a 2 / 3 count, and appends the resampled accuracy to the histogram.
  3. Bulk run. Pressing Run 1,000 (or Enter) fires bootstrap batches every bulkTickMs until bulkTarget resamples have accumulated. The CI band only paints once the histogram has 100+ samples.
  4. Sample-size toggle. Flipping between N=50 and N=200 resets the resample history but keeps any completed bulk result. The component remembers both so the comparison panel can land once both have run.
  5. Comparison panel. Renders side-by-side CI bars and a verdict against the configurable baseline — overlapping CI prints in --cb-error, non-overlapping in --cb-success.
  6. Reduced motion. Under prefers-reduced-motion: reduce, every entrance disables and the bulk run collapses to a single synchronous computation.

Props

PropTypeDefaultDescription
defaultSampleSize50 | 20050Sample size shown on mount. The toggle still lets the viewer flip.
bulkTargetnumber1000Total resamples to accumulate in a bulk run.
bulkBatchSizenumber50Resamples added per animation tick during a bulk run.
bulkTickMsnumber40Tick interval (ms) between bulk batches. Reduced-motion users skip this.
baselinenumber0.89Baseline the comparison panel measures each CI against.
transitionTransitionSPRINGS.snapOverride histogram bar / dot / band entrance transition.
onBulkComplete(result) => voidFires after each bulk run completes.
classNamestringMerged onto the root via cn().

Accessibility

  • The root is role="figure" with a descriptive aria-label; the histogram and dot grid carry their own role="img" labels summarising the dataset and CI.
  • A polite live region announces the current resample count and CI without spamming on every histogram bar.
  • Keyboard model: Space resamples once, Enter runs the bulk, R resets. The component itself is focusable so the shortcuts work without targeting the buttons.
  • The sample-size toggle is a role="radiogroup" of role="radio" buttons with aria-checked reflecting the active size; the bulk button disables while running so repeat-clicks can't queue.
  • Colour is never the only signal — the narration, the legend, and the comparison verdict all encode pass / fail / CI overlap as words.
  • Motion respects prefers-reduced-motion: reduce — every entrance disables and the bulk run collapses to a single synchronous computation.

Credits

  • Extracted from: craftingattention (app/src/lessons/primitives/systems/BootstrapCIViz.tsx). The source was a lesson primitive for an LLM-eval lesson; the extract drops the curriculum chrome (lesson narration banner, ca-narration class, no-cb --color-* ink/surface tokens) and lifts the controls into a Radix-style controlled API. Inline SPRINGS.snappy / SPRINGS.gentle are re-keyed to the canonical SPRINGS.snap / SPRINGS.smooth from @craft-bits/core/motion; STAGGER.tight is replaced with the canonical scalar STAGGER. Histogram bars now animate via a scaleY transform (rather than animating height / y) so the entrance respects the transform-and-opacity-only rule. Per-track palette tokens are remapped to var(--cb-*) semantic tokens so consumer themes repaint freely.