Part 8 · Senior & Interview Prep · Intermediate

Q&A: Scenario & Design Questions

Verify a FIFO end to end, TB for an arbiter, verifying a CDC path, debugging a hung test, and the tapeout-eve bug report.

Q: How would you verify a FIFO end to end? (the full senior answer)

Structure the answer as plan → environment → checks → coverage → closure. Plan: extract features (ordering, flags at exact boundaries, backpressure, reset mid-traffic), risk-rank (full/empty corners and reset highest), log spec ambiguities (simultaneous push+pop at full?). Environment: interface with clocking blocks; generator → mailbox → driver; independent input/output monitors; queue-based reference model in a scoreboard. Checks: scoreboard compares in-order data; bound assertion module for no-push-at-full, no-pop-at-empty, count consistency, flag definitions. Coverage: levels including exact boundaries, flag transitions, operation×state crosses, backpressure scenarios. Closure: seed sweep, merged coverage, hole analysis, bug-rate convergence, sign-off review.

diagram

gen ─mb─► drv ─┐                  ┌─► in_mon ──► ┌────────────┐
                 fifo_if ──► DUT ──► fifo_if ─► out_mon ─► │ scoreboard │
   assertions bound to DUT ▲                               │ (queue mdl)│
   (full/empty/count/data) │                               └────────────┘
                       coverage: boundaries, crosses, backpressure

Follow-up: "Which bug class does each checker catch that the others miss?" — Scoreboard: data corruption and ordering. Assertions: cycle-accurate protocol violations (push accepted at full) the scoreboard only sees later as corruption. Coverage catches nothing — it proves what was exercised. All three or you have holes.

Junior vs senior: a junior describes driving pushes and pops. A senior gives the plan-first lifecycle, the three-checker separation of concerns, and names the specific corner inventory (boundaries, simultaneous ops, reset mid-traffic).

Q: Design a testbench for a 4-master arbiter.

The structural answer is four active agents, one passive checker : one driver/monitor pair per master port (each with its own vif), a grant-side monitor, and a scoreboard checking three property families — exclusivity (at most one grant at a time), legality (grant only to a requester), and fairness/starvation (every continuous requester granted within N cycles, per the arbitration policy). Stimulus must create contention deliberately: all-request bursts, staggered arrivals, one master hogging, randomized priorities if programmable.

systemverilog

// the three property families, as assertions:
a_excl:  assert property (@(posedge clk) disable iff (!rst_n)
           $onehot0(gnt));                        // at most one grant
a_legal: assert property (@(posedge clk) disable iff (!rst_n)
           |gnt |-> |(gnt & req));                // grant implies its req
a_starv: assert property (@(posedge clk) disable iff (!rst_n)
           (req[0] && !gnt[0]) |-> ##[1:64] gnt[0]);  // bound per policy
// + scoreboard models the documented policy (round-robin pointer, etc.)
// + coverage: contention patterns (1,2,3,4 simultaneous requesters), 
//   back-to-back grants, pointer wraps

Follow-up: "How do you verify round-robin specifically?" — Model the pointer in the scoreboard and predict the exact winner each cycle; assert grant matches prediction. Pure properties can check starvation bounds, but exact rotation needs a reference model.

Junior vs senior: a junior builds four drivers. A senior leads with the three property families, knows fairness needs a policy model not just assertions, and designs stimulus for contention rather than hoping randomness finds it.

Q: How would you verify a CDC path?

Three layers. Structural: CDC lint (Spyglass-class tools) proving every crossing has a legal synchronizer structure — this is the layer simulation cannot replace. Protocol: assertions on the crossing contract — single-bit: source value stable long enough to be captured; handshake: req held until ack, no new req before ack falls; gray-coded pointers: exactly one bit changes per transition. Dynamic: simulate with truly asynchronous clock ratios (including near-same and extreme), and run metastability injection (randomized synchronizer delay models) so the design's tolerance is actually exercised — plus formal on the handshake FSMs where feasible.

systemverilog

// gray-code contract on an async FIFO pointer crossing:
a_gray: assert property (@(posedge wr_clk) disable iff (!rst_n)
  $countones(wr_ptr_gray ^ $past(wr_ptr_gray)) <= 1);

// handshake contract:
a_hold: assert property (@(posedge src_clk) disable iff (!rst_n)
  (req && !ack_sync) |=> req);    // req held until ack seen

Follow-up: "Why can't normal RTL simulation find CDC bugs?" — RTL sim has no metastability: a violated setup window still resolves to a clean value, in zero time, deterministically. The bug class physically does not exist in plain RTL simulation — hence lint for structure, injection for resilience.

Junior vs senior: a junior says "add two-flop synchronizers and test it." A senior gives the three layers and explains why simulation alone is structurally blind here — the single most senior-flavored sentence in this bank.

Q: A test hangs and the simulation never finishes — walk me through the debug.

First locate the hang, then classify it . Is sim time advancing? If wall clock burns but sim time is frozen, it is a zero-delay loop (always_comb feedback, a wait(expr) on an expression no time passes to change). If sim time advances forever, something blocks eternally: get() on an empty mailbox, @(ev) on a never-fired event (the missed-trigger race), a fork join waiting on a stuck branch, a semaphore key leaked by a killed process, or a DUT handshake that never completes. Then find who: pause in the interactive debugger and inspect process states, or bisect with timestamped prints at each component boundary — the last component that printed is upstream of the block.

diagram

HANG TRIAGE TREE

  sim time frozen?
   ├─ YES → zero-delay loop: comb feedback / wait() that never yields
   │        → simulator profile or pause shows the spinning process
   └─ NO  → blocking wait:
            ├─ mailbox.get() — producer dead? (gen finished early?)
            ├─ @(event)      — same-timestep missed trigger?
            ├─ join          — which branch is stuck? (print per branch)
            ├─ semaphore     — key leaked by disable fork?
            └─ DUT handshake — req with no ack: RTL bug or bad stimulus
  always-on defense: global watchdog
    initial begin #1ms; $fatal(1, "global timeout"); end
    + objection/activity tracking to report WHO was still busy

Follow-up: "How do you make hangs debuggable before they happen?" — A global watchdog with a fatal and a status dump (queue sizes, outstanding transaction counts, per-component heartbeats), so a hang becomes a report naming the stuck component instead of a 4-hour silent regression slot.

Junior vs senior: a junior adds prints at random. A senior splits sim-time-frozen vs advancing first, walks the blocking-primitive checklist, and has the watchdog-plus-status-dump pattern already in the bench.

Q: One day to tapeout, a bug report lands — what do you check?

Triage for decision-grade facts in hours, not a fix : (1) Reproduce it — exact seed, test, RTL version; does it reproduce on the tapeout candidate? (2) Is it real silicon behavior or a testbench/checker artifact? (3) Severity and reach: which feature, what traffic pattern triggers it, how likely in the field, is there a software workaround or a register/strap mitigation? (4) Scope the RTL delta if fixed — one line in one block, or a re-verification cascade? (5) Hand management a decision package: fix-and-slip vs ship-with-erratum vs respin-risk, each with evidence. The verification deliverable on tapeout eve is the risk assessment, not the patch.

diagram

TAPEOUT-EVE TRIAGE CHECKLIST

  □ reproducible on the tapeout RTL?   (seed, test, version pinned)
  □ DUT bug vs TB artifact?            (waveform to root cause class)
  □ trigger conditions                 (how reachable in real use?)
  □ workaround exists?                 (sw sequence / register / strap)
  □ fix blast radius                   (re-verify what, for how long?)
  □ decision package up                (options + risks + recommendation)

  the answer they want: a decision process under pressure —
  not "I'd fix it fast"

Follow-up: "The bug is real but the fix needs a week of re-verification — what do you recommend?" — Depends on the triage facts: field likelihood × failure cost vs slip cost. If a software workaround fully contains it, ship with a documented erratum; if it corrupts data on a common path, the slip is cheaper than a respin. Recommend with the evidence; the project decides.

Junior vs senior: a junior promises a fast fix. A senior runs the triage checklist, separates DUT-bug from TB-artifact early, and frames the output as a decision package with options and risk — which is what the question is really probing.

Key takeaways

FIFO answer = lifecycle + three-checker separation (scoreboard/assertions/coverage) + corner inventory.
Arbiter = exclusivity, legality, starvation properties + a policy reference model + engineered contention.
CDC = lint for structure, assertions for contract, async-ratio sim + metastability injection for resilience.
Hang debug = sim-time frozen vs advancing first, then the blocking-primitive checklist; watchdog by default.
Tapeout-eve bug = triage to a decision package: reproduce, classify, scope, recommend.

Common pitfalls

Jumping to testbench components before stating the plan — the answer's frame matters.
Claiming simulation verifies CDC — interviewers wait for exactly this mistake.
Debugging hangs with random prints instead of classifying the hang type first.
Answering the tapeout question with heroics instead of a triage process.

Practice this lesson