Incident Response Runbook
Sprint I.7 — skeleton. Upgrade this runbook once PagerDuty is provisioned (Raj dep) and the first post-incident review lands.
Scope
Any user-visible failure of the production API, the web surfaces at
gohatch.fun, or the scoring pipeline. Degradations that don't reach users
(slow background jobs, developer-only tools) are tracked in
docs/runbooks/ but don't enter this flow.
Severity
| Level | Criteria |
|---|---|
major |
Any public surface returning 5xx > 1% of requests, or attestation freeze |
minor |
Degraded latency (P95 > 5s for > 5 min), single non-critical endpoint |
security |
Suspected unauthorized access, exfil, or vulnerability disclosure |
Severity decides paging, not the customer impact tone of the eventual
postmortem — prefer over-paging a minor to under-paging a major.
Detection → Acknowledge
- Page fires (PagerDuty — pending) OR operator observes symptom.
- Within 5 minutes: ack in Telegram
#hatch-oncallwith one sentence of symptom + affected surface. - Open
/admin/status— this is the first-touch signal for API + DB. - Open Vercel dashboard; open Railway dashboard; open Supabase logs.
Triage (first 15 minutes)
- API 5xx spike →
/v1/admin/statusfirst. If DB latency > 500ms or DBok: false, escalate to database-outage branch below. If DB ok, check Railway logs for recent deploy timestamp — last-green rollback takes 90 seconds and is almost always the right first move. - DB outage → Supabase dashboard → incidents. Do NOT run manual SQL; read-only queries still consume connections. Ask ops if pool is saturated before touching anything.
- Attestation publishing freeze →
publisher503s in logs withnot_configuredmeans env drift (BSC_RPC_URL/ATTESTER_PRIVATE_KEYrotated). Any other error code means chain-side problem; log the address being published and pause publishes until chain is back. - Stubbed-signal regression → if a signal flipped from live to stub
post-deploy, the third-party key is down or the scorer is throwing.
Scores keep shipping with
hasStubs: true; attestations refuse as designed. Page the model owner, not the oncall.
Communication
- Internal first. Post what you observe, what you've tried, what's changed in the last 24h. Do NOT post speculation publicly until confirmed.
- External after mitigation. A tweet from
@gohatchand an entry inapps/web/src/lib/incidents.ts(renders on/transparency) are both required for any user-visible incident. Update the status page if the degradation lasts > 15 minutes. - Keep the timeline. Every message in
#hatch-oncallshould stamp UTC. Postmortem wants the timeline verbatim.
Mitigate
Preferred order:
- Rollback the last deploy — Vercel "Promote to production" on the previous green; Railway rollback via the dashboard.
- Flip the feature flag — if the degraded surface is flag-gated, flip it off. (Hatch currently has no runtime flags; add one only during the mitigate phase if genuinely reversible.)
- Scale or restart — Railway restart is safe for stateless API; Supabase scale-up is a paid action and must be approved by Raj.
- Rate-limit the attacker — if the cause is abuse, tighten the
per-IP limits in
apps/api/src/middleware/rate-limit.tsand redeploy the API. This is a new deploy, not a config change.
Resolve + Record
Once the symptom clears and has stayed clear for 15 minutes:
- Append an entry to
apps/web/src/lib/incidents.tswith severity, impact, root cause (preliminary is fine), fix, followups. - Resolve the page.
- Schedule the postmortem within 72 hours (template in
templates/runbook.md).
Security disclosures
If the incident is tagged security:
- Do not discuss in public channels until disclosed.
- Do reach out to the reporter within 4 business hours.
- Legal is looped in before any external acknowledgement.
- RFC 9116 metadata lives in
apps/web/public/.well-known/security.txt.