Part 8 · Senior & Interview Prep · Intermediate

Three Debug Case Studies

Three complete worked debugs with timelines — a monitor sampling-edge bug, an uninitialized config knob, and an objection-drain hang — each with the generalized lesson.

Case 1: The scoreboard mismatch that blamed the wrong component

A bus environment began reporting sporadic data mismatches after an RTL drop. The natural narrative — 'new RTL, new bug' — pointed at the design. The playbook said otherwise.

diagram
TIMELINE — Case 1 (monitor sampling-edge bug)

  T+0:00  Regression: 7/400 seeds fail, all "SB data mismatch".
  T+0:10  Triage: first error only; signature = single-field mismatch
          (data wrong, addr/resp correct). Note: exp looks SHIFTED —
          act of txn N equals exp of txn N-1 in 3 of 3 checked cases.
  T+0:25  Minimize: 6-txn directed repro fails identically. Gold.
  T+0:40  Verdict test at the pins: wave shows DUT rdata CORRECT
          at the protocol's sample point. DUT acquitted  TB bug.
  T+1:05  Backward trace in the MONITOR: it samples rdata on the
          ready edge; new RTL legally asserts ready one cycle EARLIER
          (spec allows 0-wait responses). Monitor grabs the bus one
          cycle before data is valid  captures the PREVIOUS data beat.
  T+1:20  Fix: monitor samples per spec (valid && ready), not on a
          hard-coded wait count. Add a 0-wait-state directed test.
          All 400 seeds green.

The off-by-one-transaction pattern in the mismatch — actual matching the previous expected — was the tell visible in the first ten minutes. Generalized lesson: when exp/act look time-shifted rather than corrupted, suspect a sampling-position bug in the monitor before anything else, and remember that monitors must encode the protocol's legal timing range, not the timing the current RTL happens to exhibit.


Case 2: The random failure at seed 4217

One seed in a thousand failed with an illegal-burst assertion. Re-running seed 4217 reproduced it perfectly — a deterministic 'random' failure, which is the best kind.

diagram
TIMELINE — Case 2 (uninitialized config knob)

  T+0:00  Seed 4217 fails: SVA "burst must not cross 4KB" fires.
          Triage: assertion read first — assertion encodes spec
          correctly. Stimulus is genuinely illegal  TB generated it.
  T+0:15  Where do burst bounds come from? cfg.max_burst_bytes,
          set by the test from a +arg... supposedly.
  T+0:30  Log the cfg object at start: max_burst_bytes = 0 (!).
          The +arg parse populated a LOCAL variable; the copy into
          cfg happened before the parse on most orderings — a build
          refactor had moved the parse later for one test variant.
  T+0:45  Why only seed 4217? With max=0 the constraint
          "len*size <= max_burst_bytes" became insoluble; randomize()
          returned 0 — UNCHECKED — leaving len/size at stale values
          from the previous txn. Most stale pairs were legal by luck;
          seed 4217's sequence produced a stale pair crossing 4KB.
  T+1:00  Fixes (all three): check every randomize() return;
          `uvm_fatal on cfg fields still at poison defaults
          (init to 'hDEAD_BEEF, not 0); end-of-build cfg.print().

Generalized lesson: a failure 'at random seed N' is rarely about seed N — it is a latent deterministic bug (here: uninitialized knob plus unchecked randomize) that needed a specific stimulus pattern to become visible. Poison-default config values turn this class of bug from a one-in-a-thousand mystery into a first-run fatal.


Case 3: The hang at end of test

A previously stable test began timing out at the global watchdog. No errors, no assertion failures — simulation time simply marched to the timeout with the DUT idle. Hang signature → thread-dump approach.

diagram
TIMELINE — Case 3 (objection/drain logic)

  T+0:00  Watchdog timeout at 2ms; DUT idle since 0.4ms. No errors.
  T+0:10  Heartbeat counters: drv_sent=200/200, mon_seen=200,
          sb_cmp=198. Two responses outstanding... forever.
  T+0:30  Thread dump: scoreboard blocked in expected-queue drain
          wait (waiting for 2 more matches); test's end-of-test
          objection held by the scoreboard until the queue empties.
          Classic standoff: nothing will produce 2 more responses.
  T+0:50  Why 2 missing? Monitor counts 200, scoreboard compared
          198  drop INSIDE the scoreboard path. The analysis fifo
          was sized — bounded! — at 64; under a late burst the
          writer's try_put silently dropped 2 transactions.
  T+1:10  Fix: unbounded analysis fifo (or put() with backpressure,
          never try_put on a checking path). Drain logic gains a
          progress watchdog: if outstanding count is static for
          100us, FATAL with the queue contents printed — a hang
          becomes a 1-minute diagnosis next time.

Generalized lesson: end-of-test hangs are almost always accounting bugs — something holds completion hostage waiting for events that can no longer occur. Instrument the drain path to report what it is waiting for, and never allow silent drops (try_put, void-cast gets) anywhere on a checking path. A hang that explains itself costs minutes; one that does not costs a day.

The three lessons side by side

  1. Time-shifted exp/act → monitor sampling bug; monitors encode the spec's timing range, not current RTL behavior.

  2. Random-seed failure → latent deterministic bug; poison defaults and checked randomize() surface it on run one.

  3. End-of-test hang → completion accounting; drain logic must report what it awaits, and checking paths never drop silently.

Key takeaways

  • Patterns in HOW the data is wrong (shifted, stale, missing-count) identify the guilty component class before waves are opened.

  • Acquit or convict the DUT at the pins first — every case above pivoted on that single check.

  • Poison defaults, checked randomize, and self-reporting drain logic convert mystery failures into self-diagnosing ones.

  • After every debug, generalize: what checklist item, assertion, or instrument would have caught this in minutes?

Common pitfalls

  • Accepting 'new RTL drop, must be RTL' as a verdict — narratives are not evidence.

  • Chasing the seed instead of the latent bug the seed exposed.

  • Bounded fifos and try_put on checking paths — silent drops that surface as unrelated hangs.

  • Closing a debug without adding the test or instrument that would have caught it.