Part 10 · Advanced Topics · Intermediate

Intermittent Failure Workflow

A repeatable triage loop for flaky failures: replay, stabilize, classify signatures, and isolate root cause with controlled perturbations.

From flaky symptom to deterministic case

Intermittent failures often look random, but most are deterministic under the right environment and observation conditions. The workflow below converts an occasional farm failure into a repeatable local debug target.

diagram
[REG] flaky triage loop

  1) Capture failing row (seed + metadata)
  2) Replay exact command locally
  3) Confirm same signature appears
  4) Tighten instrumentation around failing phase
  5) Reduce test scope while preserving failure
  6) Build hypothesis and verify with controlled toggles
  7) Add regression guard (test/seed/assertion)
  8) Close with documented signature + fix sha

Signature-first triage

Group failures by signature before investigating by seed. Multiple seeds often reveal the same bug pattern; signature bucketing prevents duplicated work.

  • signature = error_id + file:line + key protocol context

  • seed is evidence, signature is classification

  • same signature across many seeds implies systemic bug


Replay reliability checks

If replay does not fail consistently, verify environment parity before changing testbench code.

bash
# Validate same binary and knobs
echo "expected build sha: abc1234"
./simv -version
rg "RUN_META|build=" out/logs/replay_seed821734.log

# Run N repeats with same seed to test determinism
for i in $(seq 1 20); do
  simv +UVM_TESTNAME=axi_random_test +ntb_random_seed=821734 > "replay_$i.log"
done
diagram
[REG] determinism matrix (same seed)

  20/20 fail same signature -> deterministic bug
  20/20 pass               -> non-parity environment or stale binary
  mixed outcomes           -> race/timing issue or external nondeterminism

Common hidden nondeterminism sources

  • timeouts derived from wall-clock or host load

  • unordered associative iteration used in scoreboarding/reporting

  • external file ordering assumptions

  • parallel testbench threads without deterministic synchronization


Controlled perturbation strategy

When exact replay still appears unstable, isolate dimensions one at a time: verbosity, timeout knobs, traffic volume, and monitor granularity. Change only one knob per run.

bash
# Keep seed fixed and vary one parameter
simv +UVM_TESTNAME=axi_random_test +ntb_random_seed=821734 +NUM_TXN=500
simv +UVM_TESTNAME=axi_random_test +ntb_random_seed=821734 +NUM_TXN=100
simv +UVM_TESTNAME=axi_random_test +ntb_random_seed=821734 +NUM_TXN=20

# Raise targeted verbosity only for relevant component
simv +UVM_TESTNAME=axi_random_test +ntb_random_seed=821734 \
     +uvm_set_verbosity=uvm_test_top.env.scb,SCB,UVM_HIGH,run
systemverilog
// Example: deterministic seed pin in a temporary debug test
class repro_821734_test extends axi_random_test;
  `uvm_component_utils(repro_821734_test)

  task run_phase(uvm_phase phase);
    phase.raise_objection(this);
    // Run exactly one sequence path to minimize noise.
    run_targeted_sequence_only();
    phase.drop_objection(this);
  endtask
endclass

Key takeaways

  • Classify by signature, then validate replay determinism.

  • Hold seed constant while perturbing one variable at a time.

  • Reduce scope to a minimal reproducible failing flow.

  • Close loop by adding permanent guard for the discovered bug.

Common pitfalls

  • Changing multiple knobs per iteration and losing causality.

  • Investigating seed-by-seed without signature grouping.

  • Declaring unreproducible before checking binary/environment parity.


Failure journal template

Document intermittent issues with a short reproducibility journal. This prevents knowledge loss when ownership shifts.

diagram
[REG] intermittent issue journal

  Bug ID:
  Signature:
  First seen date:
  Affected tests:
  Repro command:
  Determinism check (N runs):
  Minimal failing configuration:
  Root cause:
  Fix commit:
  Guard added (test/assertion/seed):
  Residual risk: