Eval Pipeline Viz

An interactive visualisation of the CI/CD eval gate pattern: every prompt or model change triggers automated eval suites and deployment is blocked when scores regress beyond a confidence-interval overlap test. A coloured packet animates through five stages — Commit → Run Evals → Score + CI → Gate → Deploy/Block — and each stage opens a detail panel with its concrete artifact: a git-style diff, live progress bars, scored metrics with CIs, the overlap test itself, then a pass/block summary.

The teaching insight: confidence intervals — not point estimates — decide whether a change is a real regression. If the new and baseline CIs overlap, the difference is not statistically significant. Only non-overlapping CIs trigger the gate. The Introduce regression toggle flips Helpfulness so the CIs separate and the gate blocks deployment.

1
Commit
2
Run Evals
3
Score + CI
4
Gate
5
Deploy
Space / Arrow to advance · R to toggle regression
Eval pipeline ready. Press Start Pipeline or Space to begin.

This is the eval gate pattern. Every prompt change flows through automated evaluations before reaching production. Step through to see how confidence intervals decide whether a change ships or gets blocked.

Customize
Scenario

Installation

npx shadcn@latest add https://craftbits.dev/r/eval-pipeline-viz.json

Usage

import { EvalPipelineViz } from "@craft-bits/viz/eval-pipeline-viz";
 
<EvalPipelineViz />

Boot directly into the regression scenario:

<EvalPipelineViz defaultRegression />

Subscribe to the final verdict:

<EvalPipelineViz
  onComplete={({ pass, hasRegression, outcomes }) => {
    /* lift the gate verdict into your own dashboard */
  }}
/>

Understanding the component

  1. The pipeline header. Five connected stage nodes with a static connector line behind them; an animated progress line and a glowing packet travel left-to-right as you advance. The active stage scales gently, completed stages show a check mark.
  2. Detail panels. Each non-idle phase renders a panel with the artifact for that stage. Stage 2 (Run Evals) runs three simulated progress bars in parallel — Accuracy (200 cases), Helpfulness (100), Safety (50) — and auto-advances to scoring when every bar fills.
  3. The overlap test. Stage 4 draws two horizontal CI bars per metric (grey baseline on top, violet candidate on bottom) over a shared axis. Where they intersect a soft yellow rectangle labels the overlap; where they don't, a dashed red gap line spans the void. The verdict banner underneath summarises the gate outcome.
  4. The gate decision. A metric is a regression when its new CI sits entirely below the baseline CI (newCI[1] < baseCI[0]). Any single regression blocks the deploy.
  5. Reduced motion. Under prefers-reduced-motion: reduce, every panel transition, packet motion, and SVG entrance collapses to a snap, and the eval-progress simulation runs about three times faster.

Props

PropTypeDefaultDescription
defaultRegressionbooleanfalseWhether the candidate starts in the regression scenario.
showRegressionTogglebooleantrueShow the Introduce regression button and bind the R shortcut.
transitionTransitionSPRINGS.defaultOverride the spring used for the packet, progress line, and panel transitions.
onComplete(summary) => voidFires when the pipeline reaches result with { pass, hasRegression, outcomes }.
classNamestringMerged onto the root via cn().

Accessibility

  • Root is role="figure" with an aria-label summarising the visualisation so screen-reader users get the headline.
  • A polite live region announces each stage transition by its label.
  • Each interactive control (Start Pipeline, Introduce regression, Reset) has a visible focus ring, an aria-label, and the regression toggle exposes its state via aria-pressed.
  • Keyboard shortcuts: Space / ArrowRight advance through the stages, R toggles the regression scenario when the toggle is shown.
  • Each CI overlap SVG carries an aria-label with its metric, both CIs, and the textual outcome — colour is never the only signal.
  • Motion respects prefers-reduced-motion: reduce: every entrance, scale pulse, packet motion, and panel transition collapses to a snap.

Credits

  • Extracted from: craftingattention (app/src/lessons/primitives/systems/EvalPipelineViz.tsx). The source was a lesson component that bundled SvgLabel chrome, a ChallengeBtn predict-the-outcome quiz, and the lesson's lessonId narration framing. The viz extract keeps only the interactive 5-stage pipeline, the eval progress simulation, and the CI overlap test — the quiz round and lesson plumbing are curriculum-specific and live in the lesson source. Per-track palette tokens (--color-ink-*, --color-success-*, --color-fail-*, --color-accent-*, --color-surface-raised) are remapped to var(--cb-fg-*) / var(--cb-success) / var(--cb-error) / var(--cb-accent) / var(--cb-bg-elevated) so consumer themes repaint freely. Inline SPRINGS.gentle / SPRINGS.snappy and STAGGER.normal are re-keyed to canonical SPRINGS.default / SPRINGS.snap / STAGGER from @craft-bits/core/motion.