Incident Response Runbook

Sprint I.7 — skeleton. Upgrade this runbook once PagerDuty is provisioned (Raj dep) and the first post-incident review lands.

Scope

Any user-visible failure of the production API, the web surfaces at gohatch.fun, or the scoring pipeline. Degradations that don't reach users (slow background jobs, developer-only tools) are tracked in docs/runbooks/ but don't enter this flow.

Severity

Level	Criteria
`major`	Any public surface returning 5xx > 1% of requests, or attestation freeze
`minor`	Degraded latency (P95 > 5s for > 5 min), single non-critical endpoint
`security`	Suspected unauthorized access, exfil, or vulnerability disclosure

Severity decides paging, not the customer impact tone of the eventual postmortem — prefer over-paging a minor to under-paging a major.

Detection → Acknowledge

Page fires (PagerDuty — pending) OR operator observes symptom.
Within 5 minutes: ack in Telegram #hatch-oncall with one sentence of symptom + affected surface.
Open /admin/status — this is the first-touch signal for API + DB.
Open Vercel dashboard; open Railway dashboard; open Supabase logs.

Triage (first 15 minutes)

API 5xx spike → /v1/admin/status first. If DB latency > 500ms or DB ok: false, escalate to database-outage branch below. If DB ok, check Railway logs for recent deploy timestamp — last-green rollback takes 90 seconds and is almost always the right first move.
DB outage → Supabase dashboard → incidents. Do NOT run manual SQL; read-only queries still consume connections. Ask ops if pool is saturated before touching anything.
Attestation publishing freeze → publisher 503s in logs with not_configured means env drift (BSC_RPC_URL / ATTESTER_PRIVATE_KEY rotated). Any other error code means chain-side problem; log the address being published and pause publishes until chain is back.
Stubbed-signal regression → if a signal flipped from live to stub post-deploy, the third-party key is down or the scorer is throwing. Scores keep shipping with hasStubs: true; attestations refuse as designed. Page the model owner, not the oncall.

Communication

Internal first. Post what you observe, what you've tried, what's changed in the last 24h. Do NOT post speculation publicly until confirmed.
External after mitigation. A tweet from @gohatch and an entry in apps/web/src/lib/incidents.ts (renders on /transparency) are both required for any user-visible incident. Update the status page if the degradation lasts > 15 minutes.
Keep the timeline. Every message in #hatch-oncall should stamp UTC. Postmortem wants the timeline verbatim.

Mitigate

Preferred order:

Rollback the last deploy — Vercel "Promote to production" on the previous green; Railway rollback via the dashboard.
Flip the feature flag — if the degraded surface is flag-gated, flip it off. (Hatch currently has no runtime flags; add one only during the mitigate phase if genuinely reversible.)
Scale or restart — Railway restart is safe for stateless API; Supabase scale-up is a paid action and must be approved by Raj.
Rate-limit the attacker — if the cause is abuse, tighten the per-IP limits in apps/api/src/middleware/rate-limit.ts and redeploy the API. This is a new deploy, not a config change.

Resolve + Record

Once the symptom clears and has stayed clear for 15 minutes:

Append an entry to apps/web/src/lib/incidents.ts with severity, impact, root cause (preliminary is fine), fix, followups.
Resolve the page.
Schedule the postmortem within 72 hours (template in templates/runbook.md).

Security disclosures

If the incident is tagged security:

Do not discuss in public channels until disclosed.
Do reach out to the reporter within 4 business hours.
Legal is looped in before any external acknowledgement.
RFC 9116 metadata lives in apps/web/public/.well-known/security.txt.

Severity

Level

Criteria

major

Any public surface returning 5xx > 1% of requests, or attestation freeze

minor

Degraded latency (P95 > 5s for > 5 min), single non-critical endpoint

security

Suspected unauthorized access, exfil, or vulnerability disclosure

Severity decides paging, not the customer impact tone of the eventual postmortem — prefer over-paging a minor to under-paging a major.

Triage (first 15 minutes)

API 5xx spike → /v1/admin/status first. If DB latency > 500ms or DB ok: false, escalate to database-outage branch below. If DB ok, check Railway logs for recent deploy timestamp — last-green rollback takes 90 seconds and is almost always the right first move.

DB outage → Supabase dashboard → incidents. Do NOT run manual SQL; read-only queries still consume connections. Ask ops if pool is saturated before touching anything.

Attestation publishing freeze → publisher 503s in logs with not_configured means env drift (BSC_RPC_URL / ATTESTER_PRIVATE_KEY rotated). Any other error code means chain-side problem; log the address being published and pause publishes until chain is back.

Stubbed-signal regression → if a signal flipped from live to stub post-deploy, the third-party key is down or the scorer is throwing. Scores keep shipping with hasStubs: true; attestations refuse as designed. Page the model owner, not the oncall.

Communication

Internal first. Post what you observe, what you've tried, what's changed in the last 24h. Do NOT post speculation publicly until confirmed.

External after mitigation. A tweet from @gohatch and an entry in apps/web/src/lib/incidents.ts (renders on /transparency) are both required for any user-visible incident. Update the status page if the degradation lasts > 15 minutes.

Keep the timeline. Every message in #hatch-oncall should stamp UTC. Postmortem wants the timeline verbatim.

Mitigate

Preferred order:

Rollback the last deploy — Vercel "Promote to production" on the previous green; Railway rollback via the dashboard.

Flip the feature flag — if the degraded surface is flag-gated, flip it off. (Hatch currently has no runtime flags; add one only during the mitigate phase if genuinely reversible.)

Scale or restart — Railway restart is safe for stateless API; Supabase scale-up is a paid action and must be approved by Raj.

Rate-limit the attacker — if the cause is abuse, tighten the per-IP limits in apps/api/src/middleware/rate-limit.ts and redeploy the API. This is a new deploy, not a config change.