Part 8 · Senior & Interview Prep · Intermediate
Three Debug Case Studies
Three complete worked debugs with timelines — a monitor sampling-edge bug, an uninitialized config knob, and an objection-drain hang — each with the generalized lesson.
Case 1: The scoreboard mismatch that blamed the wrong component
A bus environment began reporting sporadic data mismatches after an RTL drop. The natural narrative — 'new RTL, new bug' — pointed at the design. The playbook said otherwise.
TIMELINE — Case 1 (monitor sampling-edge bug)
T+0:00 Regression: 7/400 seeds fail, all "SB data mismatch".
T+0:10 Triage: first error only; signature = single-field mismatch
(data wrong, addr/resp correct). Note: exp looks SHIFTED —
act of txn N equals exp of txn N-1 in 3 of 3 checked cases.
T+0:25 Minimize: 6-txn directed repro fails identically. Gold.
T+0:40 Verdict test at the pins: wave shows DUT rdata CORRECT
at the protocol's sample point. DUT acquitted → TB bug.
T+1:05 Backward trace in the MONITOR: it samples rdata on the
ready edge; new RTL legally asserts ready one cycle EARLIER
(spec allows 0-wait responses). Monitor grabs the bus one
cycle before data is valid → captures the PREVIOUS data beat.
T+1:20 Fix: monitor samples per spec (valid && ready), not on a
hard-coded wait count. Add a 0-wait-state directed test.
All 400 seeds green.The off-by-one-transaction pattern in the mismatch — actual matching the previous expected — was the tell visible in the first ten minutes. Generalized lesson: when exp/act look time-shifted rather than corrupted, suspect a sampling-position bug in the monitor before anything else, and remember that monitors must encode the protocol's legal timing range, not the timing the current RTL happens to exhibit.
Case 2: The random failure at seed 4217
One seed in a thousand failed with an illegal-burst assertion. Re-running seed 4217 reproduced it perfectly — a deterministic 'random' failure, which is the best kind.
TIMELINE — Case 2 (uninitialized config knob)
T+0:00 Seed 4217 fails: SVA "burst must not cross 4KB" fires.
Triage: assertion read first — assertion encodes spec
correctly. Stimulus is genuinely illegal → TB generated it.
T+0:15 Where do burst bounds come from? cfg.max_burst_bytes,
set by the test from a +arg... supposedly.
T+0:30 Log the cfg object at start: max_burst_bytes = 0 (!).
The +arg parse populated a LOCAL variable; the copy into
cfg happened before the parse on most orderings — a build
refactor had moved the parse later for one test variant.
T+0:45 Why only seed 4217? With max=0 the constraint
"len*size <= max_burst_bytes" became insoluble; randomize()
returned 0 — UNCHECKED — leaving len/size at stale values
from the previous txn. Most stale pairs were legal by luck;
seed 4217's sequence produced a stale pair crossing 4KB.
T+1:00 Fixes (all three): check every randomize() return;
`uvm_fatal on cfg fields still at poison defaults
(init to 'hDEAD_BEEF, not 0); end-of-build cfg.print().Generalized lesson: a failure 'at random seed N' is rarely about seed N — it is a latent deterministic bug (here: uninitialized knob plus unchecked randomize) that needed a specific stimulus pattern to become visible. Poison-default config values turn this class of bug from a one-in-a-thousand mystery into a first-run fatal.
Case 3: The hang at end of test
A previously stable test began timing out at the global watchdog. No errors, no assertion failures — simulation time simply marched to the timeout with the DUT idle. Hang signature → thread-dump approach.
TIMELINE — Case 3 (objection/drain logic)
T+0:00 Watchdog timeout at 2ms; DUT idle since 0.4ms. No errors.
T+0:10 Heartbeat counters: drv_sent=200/200, mon_seen=200,
sb_cmp=198. Two responses outstanding... forever.
T+0:30 Thread dump: scoreboard blocked in expected-queue drain
wait (waiting for 2 more matches); test's end-of-test
objection held by the scoreboard until the queue empties.
Classic standoff: nothing will produce 2 more responses.
T+0:50 Why 2 missing? Monitor counts 200, scoreboard compared
198 → drop INSIDE the scoreboard path. The analysis fifo
was sized — bounded! — at 64; under a late burst the
writer's try_put silently dropped 2 transactions.
T+1:10 Fix: unbounded analysis fifo (or put() with backpressure,
never try_put on a checking path). Drain logic gains a
progress watchdog: if outstanding count is static for
100us, FATAL with the queue contents printed — a hang
becomes a 1-minute diagnosis next time.Generalized lesson: end-of-test hangs are almost always accounting bugs — something holds completion hostage waiting for events that can no longer occur. Instrument the drain path to report what it is waiting for, and never allow silent drops (try_put, void-cast gets) anywhere on a checking path. A hang that explains itself costs minutes; one that does not costs a day.
The three lessons side by side
Time-shifted exp/act → monitor sampling bug; monitors encode the spec's timing range, not current RTL behavior.
Random-seed failure → latent deterministic bug; poison defaults and checked randomize() surface it on run one.
End-of-test hang → completion accounting; drain logic must report what it awaits, and checking paths never drop silently.
Key takeaways
Patterns in HOW the data is wrong (shifted, stale, missing-count) identify the guilty component class before waves are opened.
Acquit or convict the DUT at the pins first — every case above pivoted on that single check.
Poison defaults, checked randomize, and self-reporting drain logic convert mystery failures into self-diagnosing ones.
After every debug, generalize: what checklist item, assertion, or instrument would have caught this in minutes?
Common pitfalls
Accepting 'new RTL drop, must be RTL' as a verdict — narratives are not evidence.
Chasing the seed instead of the latent bug the seed exposed.
Bounded fifos and try_put on checking paths — silent drops that surface as unrelated hangs.
Closing a debug without adding the test or instrument that would have caught it.