Part 8 · Senior & Interview Prep · Intermediate

Risk Management & Schedule Reality

Late spec changes, IP quality surprises, regression capacity planning, descoping decisions, and communicating risk upward.

The risks that actually happen

Verification schedules rarely die from coding speed. They die from late spec changes, IP that arrives broken, and regression compute that runs out exactly when closure needs it. A senior engineer plans for these on day one, because all three happen on essentially every project.

diagram

THE USUAL SUSPECTS

  Risk                  │ Typical timing     │ Mitigation planned up front
  ──────────────────────┼────────────────────┼────────────────────────────
  Late spec change      │ 60-80% through     │ change-impact process:
                        │ schedule           │ plan rows tagged by feature
                        │                    │ → re-open affected rows only
  3rd-party IP broken   │ at integration     │ IP acceptance smoke suite
                        │                    │ BEFORE depending on it
  Regression capacity   │ closure phase      │ measure seeds/night early;
  shortage              │ (when demand 10x)  │ book compute at planning
  Key engineer pulled   │ anytime            │ owner column + docs, no
                        │                    │ single-owner critical paths
  RTL freeze slips      │ end                │ coverage results time-tagged
                        │                    │ vs RTL version — stale
                        │                    │ closure is re-run, not argued

Late spec changes — the impact drill

Tag every plan row with the spec section it verifies — do this when writing the plan, it is free.
On a spec change, grep the plan for affected sections: those rows revert from closed to open.
Re-estimate: changed constraints? new bins? new assertions? scoreboard model updates?
Report the cost in the same change-review where the spec change is approved — not two weeks later.

Regression capacity planning

Coverage closure is a compute problem : merged coverage grows with seeds × nights, and demand peaks exactly when everyone else's project also hits closure. Measure your seeds-per-night burn rate early and extrapolate — running out of licenses in the last month is a planning failure, not bad luck.

diagram

CAPACITY ARITHMETIC (example)

  regression: 400 tests × 3 avg seeds  = 1200 sims/night wanted
  sim runtime: avg 25 min  → 1200 × 25 = 500 cpu-hours/night
  available: 15 licenses × 12h window  = 180 cpu-hours/night  ✗ 2.8x short

  levers, in order of preference:
  1. cut sim waste     — profile slowest 5% of tests; cap runaway tests
  2. rank tests        — coverage-per-cpu-hour; drop redundant ones nightly,
                         full sweep weekly
  3. stagger           — smoke per-checkin, full nightly, soak weekend
  4. buy/borrow        — escalate WITH this arithmetic, not with vibes

Descoping — deciding what not to verify

When the schedule cannot fit the plan, something gives. Descoping done well is an explicit, documented, risk-ranked decision; done badly it is silent test deletion discovered after tape-out. The rule: descope verification effort, never descope silently .

Candidates: low-risk features (pass-through paths, well-proven reused IP), redundant coverage crosses, long-tail performance scenarios.
Never candidates: anything CDC, reset, or power-related; features with open bugs; new logic regardless of perceived simplicity.
Each descoped row gets a waiver: what is not verified, why the risk is acceptable, who approved it, and the silicon-debug plan if it bites.
The waiver list is presented at sign-off review — descoping is a project decision, not a verification-team secret.

Communicating risk upward

Managers do not need simulation details; they need decision-grade information : what is the risk, what does it cost to fix, what happens if we ship anyway. The credibility you build reporting bad news early is the only currency that works when you need schedule or compute.

diagram

RISK ESCALATION — A FORMAT THAT WORKS

  BAD:  "Coverage is at 74% and the arbiter is concerning."

  GOOD: "Arbiter fairness coverage is 40% and flat for 2 weeks.
         Root cause: random stimulus can't create 4-master 
         contention; needs a directed scenario layer.
         Cost: ~1 engineer-week.
         If we skip it: starvation bugs of the class that caused
         chip X's field recall stay unverified.
         Ask: keep RK on arbiter 1 more week, slip reg-model
         closure (low risk, reused IP) by the same week."

  status + root cause + cost + risk-if-ignored + a concrete ask

Key takeaways

Plan day one for the big three: spec changes, broken IP, compute shortage.
Tag plan rows by spec section — spec-change impact becomes a query, not an archaeology dig.
Do capacity arithmetic early; escalate with numbers before closure, not during.
Descope explicitly with risk-ranked, approved waivers — never by silent deletion.
Escalate with status, root cause, cost, risk-if-ignored, and a concrete ask.

Common pitfalls

Accepting third-party IP into the environment without an acceptance smoke suite.
Closure declared on coverage collected against last month's RTL.
Sitting on bad news until it is unfixable — the schedule slips either way; your credibility doesn't have to.
Descoping CDC or reset verification under pressure — these produce the silicon bugs that end careers.

Practice this lesson