ADR 0003 — Scoring service architecture (Sprint C.1)
Status: Accepted · 2026-04-17 Deciders: Raj Supersedes: —
Context
Hatch's core product surface is the on-chain AI Hatch Score published before every
Four.meme launch. The scoring service is the LLM-facing half of that surface — it
takes a token submission, computes six independent signals (spec §6.2-6.3), aggregates
them, and publishes an attestation to the HatchAttest contract deployed in Domain B.
Sprint C.1 covers the skeleton + first signal (meme). Sprint C.2 fills in the
remaining five. Sprint C.3 does the aggregation + on-chain publish. This ADR captures
the architectural shape chosen in C.1 and locked for the rest of Domain C.
Decision
1. Dependency-free Anthropic client
We use a minimal, purpose-built HTTP client against the Messages API instead of the
@anthropic-ai/sdk npm package.
Why:
- Explicit control — the scoring path owns timeout, retry backoff, and circuit breaker numbers. The SDK's defaults are fine but can change between minor versions.
- Zero new deps — the API worker bundle stays lean; upgrade decisions stay ours.
- Tight schema contract — we parse one well-typed response shape (tool-use), nothing else. An SDK that abstracts streaming/events is overkill for synchronous scoring.
Trade-off accepted: we re-implement retry/backoff/jitter. ~100 LoC; tested.
2. Structured output via tool-use
Every prompt MUST emit its result through a named Anthropic tool with a strict JSON schema. The client rejects any response that lacks a tool_use block or uses the wrong tool name.
Why:
- Makes schema drift a hard failure, not a parsing mystery.
- Avoids free-text parsing or JSON-in-text hacks.
- The tool schema IS the API contract; it diffs cleanly in code review.
3. Immutable, semver-versioned prompts
Every prompt has an id (scoring.meme) and a semver (1.0.0) and lives in its own
file. Once tagged, the file is never edited. Changes bump the semver and create a
new file. The registry keeps every version forever.
Why:
- Replayability — given a score + prompt version, we can reconstruct the exact prompt that produced it.
- Explainability (Sprint C.4) — "why this score?" needs the prompt that was run, not the prompt that's current.
- Regulatory posture — when a creator disputes a score, the audit trail points to the exact prompt + tool schema used, frozen by version.
4. Deterministic stub mode
When ANTHROPIC_API_KEY is absent, the client returns a deterministic pseudo-score
derived from a hash of the user message. Results are marked stub: true.
Why:
- C.1 can ship + be tested before Dep.2.4 (Anthropic enterprise credits) lands.
- Staging and dev environments can exercise the full pipeline (including persist + routes) without burning Anthropic credits.
- Tests run deterministically; no environment-dependent flakiness.
Safety rail: any score with at least one stub: true signal has hasStubs: true
at the aggregate level. The C.3 attestation publisher refuses to attest rows with
hasStubs: true. This prevents stub scores from leaking on-chain.
5. Results are durable, raw responses are audit-only
Two tables:
score_requests— the committed result and submission, queryable.llm_audit— the full raw Anthropic response, kept forever for audit/replay.
A failure to write llm_audit never fails the scoring call (best-effort); a failure
to write score_requests logs + returns the result anyway (the service treats the
result as authoritative; the client should persist it). This matches the operating
rule "every production change ships with logs + metrics + an alert" — DB downtime
triggers alerts but does not hard-fail user-visible scoring.
6. Synchronous HTTP today, async job path tomorrow
POST /v1/score is synchronous with a 30s budget. Sprint C.3 introduces a job
queue (BullMQ on Redis) so that attestation publishing is decoupled from the
creator-facing request. C.1 ships only the synchronous path.
Consequences
Positive:
- Small, reviewable surface. ~800 LoC of TS + tests.
- Shippable without external dependencies unblocked.
- Every prompt diff is reviewable in PR history; every score is replayable.
- Aggressive upgrade resilience — SDK changes don't touch us.
Negative:
- We own retry/backoff logic. If Anthropic adds a new retryable status code, we
have to notice and update
isRetryable. - We lose the SDK's type-safety helpers for tool definitions. Mitigated by our
own strict parsing in
anthropic-client.ts. - Stub mode introduces a branch that must be exercised in tests every time we change the client. Two dedicated tests guard this.
Rejected alternatives
- Anthropic SDK — rejected for reasons in §1 above. Revisit in C.8 if streaming or batch becomes required.
- Unstructured prose + regex parsing — rejected. Fragile; has bitten every LLM-powered system that ever shipped it.
- One monolithic prompt for all six signals — rejected. Coupling. A bad score on one signal would force a full rerun; per-signal timeouts impossible; latency compounds serially.
- Fire-and-forget scoring (no persisted result) — rejected. No audit trail, no replay, no "why this score?" modal, no regulatory posture.
Open questions (for Sprint C.2)
- Meme-similarity embedding store — pgvector or a standalone vector DB?
- Creator-signal data pipe — Bitquery streaming via WS or scheduled pulls?
- Image-signal cost ceiling — Vision calls are ~4× text calls; budget envelope?
These are explicitly Sprint C.2 decisions and out of scope here.