Part 11 · Senior Prep · Intermediate

Night Triage Interview Q&A

Model answers on nightly failure bucketing, seed reproduction, log mining, owner assignment, and escalation when closure is at risk.

Triage process questions

Q: Regression red at 2 AM — triage process?

diagram

[INT][SENIOR][UVM] MODEL ANSWER

Q: Nightly red — triage?

A:
  MECHANISM:  Classify failure bucket: hang, mismatch, config, stimulus, predict.
  MOTIVATION:  Random scrolling logs wastes farm time and human attention.
  PROCESS:    1) Identify failing test+seed from manifest.
              2) Bucket symptom (objection / UVM_ERROR / config fatal).
              3) Repro locally with same seed+plusargs.
              4) Assign owner by component (RAL, scb, agent, test).
              5) Block closure-sensitive merges until root cause known.
  PITFALL:    Re-running entire regression to 'see if green' without repro.
  EXAMPLE:    axi_rand fails seed 4099 — objection leak in new vseq fork.

Q: Hang vs mismatch — first diagnostic move for each?

diagram

[INT][SENIOR][UVM] MODEL ANSWER

Q: Hang vs mismatch first move?

A:
  HANG:       display_objections() → driver get_next_item/item_done → vif check.
  MISMATCH:   Identify first UVM_ERROR timestamp → scoreboard predict path →
              compare expected/actual item fields → seed stability rerun.
  MOTIVATION:  Different buckets — different tools; mixing them wastes time.
  PITFALL:    Opening waves on hang before objection trace.
  EXAMPLE:    Hang: vseq fork never joined. Mismatch: byte order wrong on wide beat.

systemverilog

// hang triage — interviewers expect you to name these
phase.phase_done.display_objections();
// mismatch triage — enable targeted compare
`uvm_info("SCB", $sformatf("exp=%0p act=%0p", exp, act), UVM_LOW)

Key takeaways

Night triage: bucket → repro seed → assign owner — not rerun farm blindly.
Hang = objections first; mismatch = first error timestamp + field compare.
Block risky merges until P0 failure root-caused.

Common pitfalls

Assigning triage to one person — seniors distribute buckets fast.
Rerunning without same seed — cannot confirm fix.

Log mining and escalation

Q: How do you mine a 50 GB nightly log without drowning?

diagram

[INT][SENIOR][UVM] MODEL ANSWER

Q: Large log triage?

A:
  MECHANISM:  Manifest gives test+seed+exit code; grep UVM_ERROR/FATAL first;
              sim.log tail for objection summary; bisect if multi-commit window.
  MOTIVATION:  Full log read is impossible at scale — structured extraction first.
  COMMANDS:   grep -E "UVM_ERROR|UVM_FATAL|FATAL" sim.log | head
              grep "display_objections" -A20 sim.log
  PITFALL:    Starting waveform for every failure — reserve for repro confirmed.
  EXAMPLE:    12 failures all same test — one root cause, not 12 independent bugs.

Q: New failure after merge — bisect strategy?

diagram

[INT][SENIOR][UVM] MODEL ANSWER

Q: Bisect regression failure?

A:
  MECHANISM:  Identify last green SHA, binary search commits with same test+seed.
  MOTIVATION:  Nightly may span 20 commits — bisect finds culprit fast.
  WHEN:       Failure not obvious from manifest diff (new test vs new RTL).
  PITFALL:    Blaming random seed when failure is deterministic on that seed.
  EXAMPLE:    axi_rand seed 8821 fails from commit abc123 — bisect lands on scb fix.

bash

TEST=axi_rand_test
SEED=4099
LOG="triage_${TEST}_${SEED}.log"
simv +UVM_TESTNAME="${TEST}" +ntb_random_seed="${SEED}" \
  +UVM_VERBOSITY=UVM_MEDIUM -l "${LOG}"
grep -E "UVM_ERROR|UVM_FATAL|objection" "${LOG}" | head -40
echo "TEST=${TEST} SEED=${SEED}" > "${LOG}.meta"

Q: When do you escalate to leads during night triage?

diagram

[INT][SENIOR][UVM] MODEL ANSWER

Q: Escalation criteria?

A:
  MECHANISM:  Escalate when P0 test broken, closure DB regresses >5%, flake rate
              spikes, or suspected RTL regression blocks morning commit window.
  MOTIVATION:  Seniors shield team sleep but not schedule risk.
  ESCALATE:   P0 smoke red, widespread multi-test failure, cov drop on merge.
  DON'T:      Single directed test fail with known owner and morning fix plan.
  EXAMPLE:    All DMA tests fail after RTL sync — escalate; one closure test — queue.

diagram

[INT][SENIOR][UVM] triage owner routing

  SYMPTOM              OWNER           FIRST TOOL
  objection leak       test/seq        display_objections
  scb mismatch         scoreboard      exp/act compare
  RAL mirror fail      RAL/predictor   mirror.check + adapter
  config fatal         env/integration config_db print
  driver stall         agent/driver    handshake probe

Key takeaways

Structured log mining — manifest, grep errors, objection summary.
Bisect when culprit commit unclear — same test+seed throughout.
Escalate on P0 smoke or widespread red — not every single failure.

Common pitfalls

Telling PM 'all red' without bucket count — causes panic.
Closing triage without .meta repro file — morning handoff fails.

Practice this lessonQuestions tagged for this topic in the bank