Part 8 · Senior & Interview Prep · Intermediate

Hunting X's and Races

X-source back-propagation, reset/init audits, race symptoms across tool versions, deterministic-rerun techniques, and a classic case walkthrough.

X-source tracing: back-propagate to the first X

An X on an output is never the bug — it is the last car of a train that started somewhere upstream and cycles earlier. The method mirrors backward waveform tracing, with one addition: at each stage, find when the signal first went X (jump to its previous non-X value), then ask which input was X at that moment. Repeat until you reach a signal that went X with all-known inputs — that is the origin, and it is almost always one of a short list.

diagram

X BACK-PROPAGATION

  output X @ 5,200 ns
     │ when did it FIRST go X?           ◄── jump-to-previous-edge, not scroll
     ▼
  first X @ 3,100 ns ── which input was X then? ── pipe_q was X
     │
     ▼
  pipe_q first X @ 3,000 ns ── input? ── ctrl_reg was X
     │
     ▼
  ctrl_reg X SINCE TIME 0  ◄── ORIGIN
     │
     ▼ origin checklist (one of these, nearly always):
     · flop never reset (missing from reset list / wrong reset domain)
     · RAM read before first write
     · X'ed input port (TB never drove it / wrong interface modport)
     · multi-driver conflict resolving to X on a wire
     · explicit 'x assignment in RTL (X-injection on purpose) reached live data

Reset and initialization X audit

List every state element on the failure path; for each, confirm it is reset, and by WHICH reset domain — cross-domain misses are the classic gap.
Check reset duration: a flop clocked by a divided/gated clock may never see an active edge during a too-short reset pulse.
Memories are usually not reset by design — verify the read-before-write protection logic instead of demanding reset.
Beware two-state (bit) variables in the TB masking X's the RTL would produce — the DUT sees X, your int-typed mirror sees 0.

Race symptoms and deterministic reruns

A race condition announces itself with a distinctive smell: the failure appears or vanishes with changes that should be behavior-neutral — a different tool version, an added $display, a changed optimization flag, a different seed with identical stimulus. When a 'heisenbug' shifts under observation, stop debugging the data path and start debugging event ordering.

Classic TB-side cause: driving DUT inputs with blocking assignments from a clocked block — same-edge sampling order becomes tool-dependent. Fix: nonblocking drives, or better, a clocking block which makes the race structurally impossible.
Classic crossover cause: TB reads DUT state via hierarchical reference in the Active region while RTL updates in NBA — reads pre- or post-update depending on scheduler mood.
Zero-delay loops: two always blocks communicating through a shared variable at the same timestamp — order undefined by the LRM, stable per tool, broken on tool upgrade.
$display heisenbug: adding a print reorders Active-region evaluation just enough to flip the race — the print is a detector, not a fix.

Making reruns deterministic

Pin everything: seed, tool version, compile flags, +args — a race hunt with a drifting baseline is unwinnable.
Bisect the nondeterminism: if two runs with identical inputs differ, diff their logs to the first divergent line; that names the racing pair.
Force both orderings deliberately: add a #0 or region-shifting change at the suspect point and confirm you can flip the failure on and off — that is proof, not coincidence.
Fix structurally (clocking blocks, NBA drives, mailbox handshakes), then remove your ordering hack and confirm stability across both tools.

Case walkthrough: the X that only appeared at gate level

A design passed full RTL regression, then the same tests produced X-storms in gate-level simulation. Triage classified it as an X signature with divergence at the very first post-reset cycle. Back-propagation found the origin: a configuration register whose reset pin was tied off in synthesis because the RTL reset it with an initial block, not a reset term — RTL simulation obligingly initialized the flop; the netlist flop powered up X. The X then propagated through an enable, gating an entire datapath.

Symptom: gate-level only X-storm; RTL regression 100% green.
Triage: X signature; first divergence at first post-reset cycle — points at initialization, not logic.
Back-propagation: output X ← datapath enable X ← cfg_reg X since time 0.
Origin check: cfg_reg reset via initial block (simulation-only), no reset term in the always_ff.
Fix: real reset term in RTL; lesson item added to review checklist: 'no state initialization via initial blocks in synthesizable code.'

The generalized lesson: RTL simulation is more forgiving than reality — two-state types, initial-block init, and optimistic X-handling all hide bugs that gates and silicon will not. The X-hunt method works identically at both levels; the origin checklist just gains entries.

Key takeaways

Never debug the X you see; back-propagate to the FIRST X and check the short origin checklist.
Audit reset by domain and duration — the classic origin is a flop missed by its reset or clocked too slowly to see it.
A failure that shifts with prints, flags, or tool versions is a race — debug event ordering, not data.
Prove a race by flipping it on and off deliberately, then fix structurally with clocking blocks or NBA drives.

Common pitfalls

Fixing the X where it annoys you (a forced value, a reset bolt-on) instead of at its origin.
Two-state TB types silently converting DUT X's to zeros — checking passes, bug survives.
Treating a $display that 'fixes' a failure as a fix rather than as race evidence.
Initial-block initialization in synthesizable RTL — green in sim, X in gates and silicon.

Practice this lesson