Part 10 · Advanced Topics · Intermediate
Intermittent Failure Workflow
A repeatable triage loop for flaky failures: replay, stabilize, classify signatures, and isolate root cause with controlled perturbations.
From flaky symptom to deterministic case
Intermittent failures often look random, but most are deterministic under the right environment and observation conditions. The workflow below converts an occasional farm failure into a repeatable local debug target.
[REG] flaky triage loop
1) Capture failing row (seed + metadata)
2) Replay exact command locally
3) Confirm same signature appears
4) Tighten instrumentation around failing phase
5) Reduce test scope while preserving failure
6) Build hypothesis and verify with controlled toggles
7) Add regression guard (test/seed/assertion)
8) Close with documented signature + fix shaSignature-first triage
Group failures by signature before investigating by seed. Multiple seeds often reveal the same bug pattern; signature bucketing prevents duplicated work.
signature = error_id + file:line + key protocol context
seed is evidence, signature is classification
same signature across many seeds implies systemic bug
Replay reliability checks
If replay does not fail consistently, verify environment parity before changing testbench code.
# Validate same binary and knobs
echo "expected build sha: abc1234"
./simv -version
rg "RUN_META|build=" out/logs/replay_seed821734.log
# Run N repeats with same seed to test determinism
for i in $(seq 1 20); do
simv +UVM_TESTNAME=axi_random_test +ntb_random_seed=821734 > "replay_$i.log"
done[REG] determinism matrix (same seed)
20/20 fail same signature -> deterministic bug
20/20 pass -> non-parity environment or stale binary
mixed outcomes -> race/timing issue or external nondeterminismCommon hidden nondeterminism sources
timeouts derived from wall-clock or host load
unordered associative iteration used in scoreboarding/reporting
external file ordering assumptions
parallel testbench threads without deterministic synchronization
Controlled perturbation strategy
When exact replay still appears unstable, isolate dimensions one at a time: verbosity, timeout knobs, traffic volume, and monitor granularity. Change only one knob per run.
# Keep seed fixed and vary one parameter
simv +UVM_TESTNAME=axi_random_test +ntb_random_seed=821734 +NUM_TXN=500
simv +UVM_TESTNAME=axi_random_test +ntb_random_seed=821734 +NUM_TXN=100
simv +UVM_TESTNAME=axi_random_test +ntb_random_seed=821734 +NUM_TXN=20
# Raise targeted verbosity only for relevant component
simv +UVM_TESTNAME=axi_random_test +ntb_random_seed=821734 \
+uvm_set_verbosity=uvm_test_top.env.scb,SCB,UVM_HIGH,run// Example: deterministic seed pin in a temporary debug test
class repro_821734_test extends axi_random_test;
`uvm_component_utils(repro_821734_test)
task run_phase(uvm_phase phase);
phase.raise_objection(this);
// Run exactly one sequence path to minimize noise.
run_targeted_sequence_only();
phase.drop_objection(this);
endtask
endclassKey takeaways
Classify by signature, then validate replay determinism.
Hold seed constant while perturbing one variable at a time.
Reduce scope to a minimal reproducible failing flow.
Close loop by adding permanent guard for the discovered bug.
Common pitfalls
Changing multiple knobs per iteration and losing causality.
Investigating seed-by-seed without signature grouping.
Declaring unreproducible before checking binary/environment parity.
Failure journal template
Document intermittent issues with a short reproducibility journal. This prevents knowledge loss when ownership shifts.
[REG] intermittent issue journal
Bug ID:
Signature:
First seen date:
Affected tests:
Repro command:
Determinism check (N runs):
Minimal failing configuration:
Root cause:
Fix commit:
Guard added (test/assertion/seed):
Residual risk: