Part 8 · Senior & Interview Prep · Intermediate
Failure Triage Method
First-error discipline, error-signature classification, the reproduce-minimize-isolate loop, and deciding TB bug vs RTL bug vs spec ambiguity.
First-error discipline
The single highest-leverage debug habit is brutal: only the first error matters . Once a scoreboard mismatches or an assertion fires, the simulation is in a corrupted state — the testbench may be comparing against stale expectations, sequences may have desynchronized from the DUT, and every subsequent message is a downstream echo of the original fault. Engineers who debug error #47 because it 'looks more familiar' routinely lose days chasing a symptom of a symptom.
In practice: grep the log for the first ERROR or FATAL, note its simulation time, and treat everything after that timestamp as suspect. The one exception: a later error with a much clearer signature can guide where to look, but the verdict must still be confirmed at the time of the first error.
LOG WITH 200 ERRORS — what a senior reads
@ 12,450 ns UVM_ERROR SB: data mismatch exp=0xA4 act=0x00 ◄── ONLY THIS
@ 12,470 ns UVM_ERROR SB: data mismatch exp=0x31 act=0x00 (echo)
@ 12,490 ns UVM_ERROR SB: data mismatch exp=0x77 act=0x00 (echo)
...197 more... (noise)
Clue hidden in the noise: act is ALWAYS 0x00 after 12,450 ns
→ something stopped responding at ~12,450 ns. Debug THAT instant.Error-signature classification
Before opening a single waveform, classify the failure by its signature. Each class has a different most-likely cause and a different first experiment, so thirty seconds of classification saves hours of undirected searching.
Data mismatch (exp vs act differ) — scoreboard or reference-model logic, monitor sampling, or a real RTL data-path bug. First move: is the mismatch in one field or all fields?
Missing transaction (expected, never seen) — DUT dropped it, monitor missed it, or the predictor over-generated. First move: did the txn appear on the pins at all?
Extra/unexpected transaction — monitor double-sampling, protocol retry not modeled, or DUT replaying. First move: compare counts at each pipeline stage.
Timeout/hang — objection or drain bug, mailbox deadlock, DUT stall, or a sequence waiting forever. First move: where is simulation time stuck and what is each thread blocked on?
Assertion failure — read the assertion before anything else; half the time the assertion itself encodes the spec wrong.
X on outputs — initialization or reset gap; jump straight to the X-hunting method.
Reproduce → minimize → isolate
REPRODUCE MINIMIZE ISOLATE
───────────── ───────────── ─────────────
same seed shorter sim binary-search time:
same build (kill after 1st err) which txn # breaks it?
same +args ──► fewer agents ──►
fails identically? fewer txn types binary-search space:
│ single sequence which component lies?
no → env diff, │ │
race, or stale still fails with smallest config that
build — fix 3 txns instead of fails = your debug
this FIRST 30,000? gold. vehicle from now onMinimization is not optional polish — it is the step that makes every later experiment 100x cheaper. A failure that reproduces in a 3-transaction, 2-microsecond run can be re-simulated after every hypothesis tweak; the original 30,000-transaction overnight run cannot.
The verdict: TB bug, RTL bug, or spec ambiguity
Every debug ends in one of three verdicts, and each verdict requires positive evidence — not just absence of other explanations. Seniors are distinguished by refusing to file an RTL bug until the evidence checklist is complete, because a falsely-accused designer who finds a TB bug in your report costs you credibility for months.
Evidence checklist per verdict
TB bug: the pins show correct DUT behavior but the scoreboard/monitor disagrees; the expected value is provably wrong against the spec; or the failure disappears when a TB component is corrected. Pin-level truth is the arbiter.
RTL bug: the pins themselves violate the spec — you can cite the spec section, the cycle number, the offending signal, and show the input stimulus was legal. All four or it is not ready to file.
Spec ambiguity: TB and RTL each implement a defensible reading of the spec; you can quote the ambiguous sentence and articulate both readings. Verdict: an email to the architect, not a bug to either side.
Bug report that lands
One-line summary with the signature class (mismatch, hang, X...).
Seed, build, command line — exact reproduction in one paste.
First-error time and the minimized scenario (3 txns, not 30,000).
Evidence: pin-level waveform or log excerpt, with the spec section cited.
Your verdict and confidence level — and what would change your mind.
Key takeaways
Debug only the first error; later errors are echoes of a corrupted state.
Classify the signature before opening waves — each class has a different first experiment.
Minimize relentlessly: a 3-transaction repro makes every hypothesis cheap to test.
Each verdict (TB/RTL/spec) needs positive evidence; pin-level truth is the arbiter.
Common pitfalls
Debugging error #47 because it looks familiar — it is downstream noise of error #1.
Filing an RTL bug from scoreboard output alone, without confirming at the pins.
Skipping minimization 'to save time' — then waiting 40 minutes per experiment all night.
Calling spec ambiguity to dodge a hard debug — you must quote the ambiguous sentence.