Part 8 · Senior & Interview Prep · Intermediate
Risk Management & Schedule Reality
Late spec changes, IP quality surprises, regression capacity planning, descoping decisions, and communicating risk upward.
The risks that actually happen
Verification schedules rarely die from coding speed. They die from late spec changes, IP that arrives broken, and regression compute that runs out exactly when closure needs it. A senior engineer plans for these on day one, because all three happen on essentially every project.
THE USUAL SUSPECTS
Risk │ Typical timing │ Mitigation planned up front
──────────────────────┼────────────────────┼────────────────────────────
Late spec change │ 60-80% through │ change-impact process:
│ schedule │ plan rows tagged by feature
│ │ → re-open affected rows only
3rd-party IP broken │ at integration │ IP acceptance smoke suite
│ │ BEFORE depending on it
Regression capacity │ closure phase │ measure seeds/night early;
shortage │ (when demand 10x) │ book compute at planning
Key engineer pulled │ anytime │ owner column + docs, no
│ │ single-owner critical paths
RTL freeze slips │ end │ coverage results time-tagged
│ │ vs RTL version — stale
│ │ closure is re-run, not arguedLate spec changes — the impact drill
Tag every plan row with the spec section it verifies — do this when writing the plan, it is free.
On a spec change, grep the plan for affected sections: those rows revert from closed to open.
Re-estimate: changed constraints? new bins? new assertions? scoreboard model updates?
Report the cost in the same change-review where the spec change is approved — not two weeks later.
Regression capacity planning
Coverage closure is a compute problem : merged coverage grows with seeds × nights, and demand peaks exactly when everyone else's project also hits closure. Measure your seeds-per-night burn rate early and extrapolate — running out of licenses in the last month is a planning failure, not bad luck.
CAPACITY ARITHMETIC (example)
regression: 400 tests × 3 avg seeds = 1200 sims/night wanted
sim runtime: avg 25 min → 1200 × 25 = 500 cpu-hours/night
available: 15 licenses × 12h window = 180 cpu-hours/night ✗ 2.8x short
levers, in order of preference:
1. cut sim waste — profile slowest 5% of tests; cap runaway tests
2. rank tests — coverage-per-cpu-hour; drop redundant ones nightly,
full sweep weekly
3. stagger — smoke per-checkin, full nightly, soak weekend
4. buy/borrow — escalate WITH this arithmetic, not with vibesDescoping — deciding what not to verify
When the schedule cannot fit the plan, something gives. Descoping done well is an explicit, documented, risk-ranked decision; done badly it is silent test deletion discovered after tape-out. The rule: descope verification effort, never descope silently .
Candidates: low-risk features (pass-through paths, well-proven reused IP), redundant coverage crosses, long-tail performance scenarios.
Never candidates: anything CDC, reset, or power-related; features with open bugs; new logic regardless of perceived simplicity.
Each descoped row gets a waiver: what is not verified, why the risk is acceptable, who approved it, and the silicon-debug plan if it bites.
The waiver list is presented at sign-off review — descoping is a project decision, not a verification-team secret.
Communicating risk upward
Managers do not need simulation details; they need decision-grade information : what is the risk, what does it cost to fix, what happens if we ship anyway. The credibility you build reporting bad news early is the only currency that works when you need schedule or compute.
RISK ESCALATION — A FORMAT THAT WORKS
BAD: "Coverage is at 74% and the arbiter is concerning."
GOOD: "Arbiter fairness coverage is 40% and flat for 2 weeks.
Root cause: random stimulus can't create 4-master
contention; needs a directed scenario layer.
Cost: ~1 engineer-week.
If we skip it: starvation bugs of the class that caused
chip X's field recall stay unverified.
Ask: keep RK on arbiter 1 more week, slip reg-model
closure (low risk, reused IP) by the same week."
status + root cause + cost + risk-if-ignored + a concrete askKey takeaways
Plan day one for the big three: spec changes, broken IP, compute shortage.
Tag plan rows by spec section — spec-change impact becomes a query, not an archaeology dig.
Do capacity arithmetic early; escalate with numbers before closure, not during.
Descope explicitly with risk-ranked, approved waivers — never by silent deletion.
Escalate with status, root cause, cost, risk-if-ignored, and a concrete ask.
Common pitfalls
Accepting third-party IP into the environment without an acceptance smoke suite.
Closure declared on coverage collected against last month's RTL.
Sitting on bad news until it is unfixable — the schedule slips either way; your credibility doesn't have to.
Descoping CDC or reset verification under pressure — these produce the silicon bugs that end careers.