Part 6 · Testbench Architecture · Intermediate

The Regression Flow

Test-list and seed matrix, parallel dispatch, result aggregation, triage buckets, rerun loop, and nightly vs per-commit tiers.

From test list to verdict board

A regression is a matrix: a test list (which tests, with which knobs) crossed with seeds (how many random universes per test). The dispatcher expands the matrix into individual simulation jobs, farms them out in parallel, then scrapes each log for its verdict banner and exit code. Everything earlier in this topic — banners, seeds in headers, knobs, grep-able logs — exists so this flow can run without humans in the loop.

diagram
REGRESSION FLOW

  test list                       seeds
  ─────────                       ─────
  smoke_test    x1                random per run,
  burst_test    x20               recorded in log
  err_inject    x10
  soak_test     x5
        │
        ▼
  ┌──────────────────────────────────────────────┐
  │ DISPATCH: N jobs in parallel (farm / cores)  │
  │  job = (test, seed, knobs)  sim  run.log   │
  └──────────────────────────────────────────────┘
        │
        ▼
  AGGREGATE per job:
    PASS  banner + exit 0           pass
    FAIL  banner / exit != 0        fail  + first-error signature
    no banner / timeout             infra-error (counts as fail!)
        │
        ▼
  TRIAGE: bucket fails by first-error signature
    bucket A (14 fails): "SCB txn mismatch addr=0x1040"   one bug
    bucket B  (2 fails): "timeout waiting for grant"      another bug
        │
        ▼
  RERUN failures with logged seeds  reproduce  debug  fix
        └──────────── repeat until green ────────────────┘

Aggregation and triage buckets

Thirty failing runs rarely mean thirty bugs. The aggregator extracts each run's first-error signature — the first ERROR line, with volatile fields like timestamps and data values masked — and clusters identical signatures into buckets. Each bucket is one suspected bug; engineers debug one representative run per bucket, not all thirty.

bash
# Aggregate: one verdict line per run
for log in results/*/run.log; do
  if grep -q "TEST PASSED" "$log"; then
    echo "PASS  $log"
  else
    # first-error signature, with numbers masked for clustering
    sig=$(grep -m1 "ERROR" "$log" | sed 's/0x[0-9a-fA-F]*/0xN/g; s/#[0-9]*/#N/g')
    if [ -z "$sig" ]; then sig="NO_BANNER_OR_CRASH"; fi
    echo "FAIL  $log  $sig"
  fi
done | tee summary.txt

# Triage buckets: cluster failures by signature
grep "^FAIL" summary.txt | cut -d' ' -f4- | sort | uniq -c | sort -rn

The rerun-failures loop

  1. Aggregate the overnight run; cluster failures into signature buckets.

  2. Pick one representative (test, seed) per bucket — the seed is in the log header.

  3. Replay it with +VERBOSITY=DEBUG and waves on; debug; fix RTL or TB.

  4. Rerun just the failing (test, seed) pairs to confirm each fix.

  5. Rerun the full matrix — fixes can unmask new failures downstream of the old one.


Nightly vs per-commit pyramids

Not every change deserves the full matrix. Healthy projects layer regressions like a pyramid: tiny and fast at the commit gate, broad and slow at night, exhaustive before tape-out milestones.

diagram
REGRESSION PYRAMID

            ┌────────────┐
            │  WEEKLY /  │   full matrix, soak tests, long seeds,
            │  MILESTONE │   coverage merge  closure report
            ├────────────┤
            │  NIGHTLY   │   all tests x many seeds (hours, farm)
            │            │   triage board updated every morning
            ├────────────┤
            │ PER-COMMIT │   smoke list x 1-2 seeds (minutes)
            │  (gate)    │   blocks merge on failure
            └────────────┘
   wide base = run constantly, must be fast and rock-stable
   narrow top = run rarely, allowed to be slow
  • Per-commit smoke must be fast and flake-free — a flaky gate trains people to ignore it.

  • Nightly catches the cross-test, multi-seed interactions a smoke list cannot.

  • Track pass-rate trends per bucket over days — a slowly growing bucket is a real bug, not noise.

  • Coverage merge belongs to the nightly/weekly tiers — single-run coverage means little.

Interview angle

  • "Walk me through your regression flow" — matrix → parallel dispatch → banner/exit-code scrape → signature buckets → seeded rerun loop.

  • "Thirty failures overnight, where do you start?" — bucket by first-error signature; debug one representative per bucket.

  • "What runs on every commit vs nightly?" — fast deterministic smoke at the gate; broad seed sweeps and coverage merge at night.

Key takeaways

  • A regression is a (test x seed) matrix dispatched in parallel and scraped mechanically.

  • Cluster failures by first-error signature — buckets map to bugs, runs do not.

  • No-banner and timeout runs count as failures — infrastructure errors must not pass silently.

  • Layer the pyramid: fast smoke per commit, broad nightly sweeps, exhaustive milestone runs.

Common pitfalls

  • Treating each failing run as a separate bug — thirty runs in one bucket are one bug.

  • Rerunning failures with fresh seeds — "it passed this time" proves nothing was fixed.

  • Counting timed-out or crashed runs as neither pass nor fail — they silently vanish from reports.

  • A slow, flaky commit gate — developers learn to bypass it, and the gate is worse than none.