Part 8 · Senior & Interview Prep · Intermediate

Failure Triage Method

First-error discipline, error-signature classification, the reproduce-minimize-isolate loop, and deciding TB bug vs RTL bug vs spec ambiguity.

First-error discipline

The single highest-leverage debug habit is brutal: only the first error matters . Once a scoreboard mismatches or an assertion fires, the simulation is in a corrupted state — the testbench may be comparing against stale expectations, sequences may have desynchronized from the DUT, and every subsequent message is a downstream echo of the original fault. Engineers who debug error #47 because it 'looks more familiar' routinely lose days chasing a symptom of a symptom.

In practice: grep the log for the first ERROR or FATAL, note its simulation time, and treat everything after that timestamp as suspect. The one exception: a later error with a much clearer signature can guide where to look, but the verdict must still be confirmed at the time of the first error.

diagram

LOG WITH 200 ERRORS — what a senior reads

  @ 12,450 ns  UVM_ERROR  SB: data mismatch exp=0xA4 act=0x00   ◄── ONLY THIS
  @ 12,470 ns  UVM_ERROR  SB: data mismatch exp=0x31 act=0x00       (echo)
  @ 12,490 ns  UVM_ERROR  SB: data mismatch exp=0x77 act=0x00       (echo)
  ...197 more...                                                     (noise)

  Clue hidden in the noise: act is ALWAYS 0x00 after 12,450 ns
  → something stopped responding at ~12,450 ns. Debug THAT instant.

Error-signature classification

Before opening a single waveform, classify the failure by its signature. Each class has a different most-likely cause and a different first experiment, so thirty seconds of classification saves hours of undirected searching.

Data mismatch (exp vs act differ) — scoreboard or reference-model logic, monitor sampling, or a real RTL data-path bug. First move: is the mismatch in one field or all fields?
Missing transaction (expected, never seen) — DUT dropped it, monitor missed it, or the predictor over-generated. First move: did the txn appear on the pins at all?
Extra/unexpected transaction — monitor double-sampling, protocol retry not modeled, or DUT replaying. First move: compare counts at each pipeline stage.
Timeout/hang — objection or drain bug, mailbox deadlock, DUT stall, or a sequence waiting forever. First move: where is simulation time stuck and what is each thread blocked on?
Assertion failure — read the assertion before anything else; half the time the assertion itself encodes the spec wrong.
X on outputs — initialization or reset gap; jump straight to the X-hunting method.

Reproduce → minimize → isolate

diagram

REPRODUCE                MINIMIZE                 ISOLATE
  ─────────────            ─────────────            ─────────────
  same seed                shorter sim              binary-search time:
  same build               (kill after 1st err)     which txn # breaks it?
  same +args         ──►   fewer agents       ──►
  fails identically?       fewer txn types          binary-search space:
       │                   single sequence          which component lies?
       no → env diff,           │                        │
       race, or stale      still fails with         smallest config that
       build — fix         3 txns instead of        fails = your debug
       this FIRST          30,000? gold.            vehicle from now on

Minimization is not optional polish — it is the step that makes every later experiment 100x cheaper. A failure that reproduces in a 3-transaction, 2-microsecond run can be re-simulated after every hypothesis tweak; the original 30,000-transaction overnight run cannot.

The verdict: TB bug, RTL bug, or spec ambiguity

Every debug ends in one of three verdicts, and each verdict requires positive evidence — not just absence of other explanations. Seniors are distinguished by refusing to file an RTL bug until the evidence checklist is complete, because a falsely-accused designer who finds a TB bug in your report costs you credibility for months.

Evidence checklist per verdict

TB bug: the pins show correct DUT behavior but the scoreboard/monitor disagrees; the expected value is provably wrong against the spec; or the failure disappears when a TB component is corrected. Pin-level truth is the arbiter.
RTL bug: the pins themselves violate the spec — you can cite the spec section, the cycle number, the offending signal, and show the input stimulus was legal. All four or it is not ready to file.
Spec ambiguity: TB and RTL each implement a defensible reading of the spec; you can quote the ambiguous sentence and articulate both readings. Verdict: an email to the architect, not a bug to either side.

Bug report that lands

One-line summary with the signature class (mismatch, hang, X...).
Seed, build, command line — exact reproduction in one paste.
First-error time and the minimized scenario (3 txns, not 30,000).
Evidence: pin-level waveform or log excerpt, with the spec section cited.
Your verdict and confidence level — and what would change your mind.

Key takeaways

Debug only the first error; later errors are echoes of a corrupted state.
Classify the signature before opening waves — each class has a different first experiment.
Minimize relentlessly: a 3-transaction repro makes every hypothesis cheap to test.
Each verdict (TB/RTL/spec) needs positive evidence; pin-level truth is the arbiter.

Common pitfalls

Debugging error #47 because it looks familiar — it is downstream noise of error #1.
Filing an RTL bug from scoreboard output alone, without confirming at the pins.
Skipping minimization 'to save time' — then waiting 40 minutes per experiment all night.
Calling spec ambiguity to dodge a hard debug — you must quote the ambiguous sentence.

Practice this lesson