ADR 0003 — Scoring service architecture (Sprint C.1)

Status: Accepted · 2026-04-17 Deciders: Raj Supersedes: —

Context

Hatch's core product surface is the on-chain AI Hatch Score published before every Four.meme launch. The scoring service is the LLM-facing half of that surface — it takes a token submission, computes six independent signals (spec §6.2-6.3), aggregates them, and publishes an attestation to the HatchAttest contract deployed in Domain B.

Sprint C.1 covers the skeleton + first signal (meme). Sprint C.2 fills in the remaining five. Sprint C.3 does the aggregation + on-chain publish. This ADR captures the architectural shape chosen in C.1 and locked for the rest of Domain C.

Decision

1. Dependency-free Anthropic client

We use a minimal, purpose-built HTTP client against the Messages API instead of the @anthropic-ai/sdk npm package.

Why:

Explicit control — the scoring path owns timeout, retry backoff, and circuit breaker numbers. The SDK's defaults are fine but can change between minor versions.
Zero new deps — the API worker bundle stays lean; upgrade decisions stay ours.
Tight schema contract — we parse one well-typed response shape (tool-use), nothing else. An SDK that abstracts streaming/events is overkill for synchronous scoring.

Trade-off accepted: we re-implement retry/backoff/jitter. ~100 LoC; tested.

2. Structured output via tool-use

Every prompt MUST emit its result through a named Anthropic tool with a strict JSON schema. The client rejects any response that lacks a tool_use block or uses the wrong tool name.

Why:

Makes schema drift a hard failure, not a parsing mystery.
Avoids free-text parsing or JSON-in-text hacks.
The tool schema IS the API contract; it diffs cleanly in code review.

3. Immutable, semver-versioned prompts

Every prompt has an id (scoring.meme) and a semver (1.0.0) and lives in its own file. Once tagged, the file is never edited. Changes bump the semver and create a new file. The registry keeps every version forever.

Why:

Replayability — given a score + prompt version, we can reconstruct the exact prompt that produced it.
Explainability (Sprint C.4) — "why this score?" needs the prompt that was run, not the prompt that's current.
Regulatory posture — when a creator disputes a score, the audit trail points to the exact prompt + tool schema used, frozen by version.

4. Deterministic stub mode

When ANTHROPIC_API_KEY is absent, the client returns a deterministic pseudo-score derived from a hash of the user message. Results are marked stub: true.

Why:

C.1 can ship + be tested before Dep.2.4 (Anthropic enterprise credits) lands.
Staging and dev environments can exercise the full pipeline (including persist + routes) without burning Anthropic credits.
Tests run deterministically; no environment-dependent flakiness.

Safety rail: any score with at least one stub: true signal has hasStubs: true at the aggregate level. The C.3 attestation publisher refuses to attest rows with hasStubs: true. This prevents stub scores from leaking on-chain.

5. Results are durable, raw responses are audit-only

Two tables:

score_requests — the committed result and submission, queryable.
llm_audit — the full raw Anthropic response, kept forever for audit/replay.

A failure to write llm_audit never fails the scoring call (best-effort); a failure to write score_requests logs + returns the result anyway (the service treats the result as authoritative; the client should persist it). This matches the operating rule "every production change ships with logs + metrics + an alert" — DB downtime triggers alerts but does not hard-fail user-visible scoring.

6. Synchronous HTTP today, async job path tomorrow

POST /v1/score is synchronous with a 30s budget. Sprint C.3 introduces a job queue (BullMQ on Redis) so that attestation publishing is decoupled from the creator-facing request. C.1 ships only the synchronous path.

Consequences

Positive:

Small, reviewable surface. ~800 LoC of TS + tests.
Shippable without external dependencies unblocked.
Every prompt diff is reviewable in PR history; every score is replayable.
Aggressive upgrade resilience — SDK changes don't touch us.

Negative:

We own retry/backoff logic. If Anthropic adds a new retryable status code, we have to notice and update isRetryable.
We lose the SDK's type-safety helpers for tool definitions. Mitigated by our own strict parsing in anthropic-client.ts.
Stub mode introduces a branch that must be exercised in tests every time we change the client. Two dedicated tests guard this.

Rejected alternatives

Anthropic SDK — rejected for reasons in §1 above. Revisit in C.8 if streaming or batch becomes required.
Unstructured prose + regex parsing — rejected. Fragile; has bitten every LLM-powered system that ever shipped it.
One monolithic prompt for all six signals — rejected. Coupling. A bad score on one signal would force a full rerun; per-signal timeouts impossible; latency compounds serially.
Fire-and-forget scoring (no persisted result) — rejected. No audit trail, no replay, no "why this score?" modal, no regulatory posture.

Open questions (for Sprint C.2)

Meme-similarity embedding store — pgvector or a standalone vector DB?
Creator-signal data pipe — Bitquery streaming via WS or scheduled pulls?
Image-signal cost ceiling — Vision calls are ~4× text calls; budget envelope?

These are explicitly Sprint C.2 decisions and out of scope here.

ADR 0003 — Scoring service architecture (Sprint C.1)

Status: Accepted · 2026-04-17 Deciders: Raj Supersedes: —

Context

Decision

1. Dependency-free Anthropic client

We use a minimal, purpose-built HTTP client against the Messages API instead of the @anthropic-ai/sdk npm package.

Why:

Explicit control — the scoring path owns timeout, retry backoff, and circuit breaker numbers. The SDK's defaults are fine but can change between minor versions.
Zero new deps — the API worker bundle stays lean; upgrade decisions stay ours.
Tight schema contract — we parse one well-typed response shape (tool-use), nothing else. An SDK that abstracts streaming/events is overkill for synchronous scoring.

Trade-off accepted: we re-implement retry/backoff/jitter. ~100 LoC; tested.

2. Structured output via tool-use

Every prompt MUST emit its result through a named Anthropic tool with a strict JSON schema. The client rejects any response that lacks a tool_use block or uses the wrong tool name.

Why:

Makes schema drift a hard failure, not a parsing mystery.
Avoids free-text parsing or JSON-in-text hacks.
The tool schema IS the API contract; it diffs cleanly in code review.

3. Immutable, semver-versioned prompts

Why:

Replayability — given a score + prompt version, we can reconstruct the exact prompt that produced it.
Explainability (Sprint C.4) — "why this score?" needs the prompt that was run, not the prompt that's current.
Regulatory posture — when a creator disputes a score, the audit trail points to the exact prompt + tool schema used, frozen by version.

4. Deterministic stub mode

When ANTHROPIC_API_KEY is absent, the client returns a deterministic pseudo-score derived from a hash of the user message. Results are marked stub: true.

Why:

C.1 can ship + be tested before Dep.2.4 (Anthropic enterprise credits) lands.
Staging and dev environments can exercise the full pipeline (including persist + routes) without burning Anthropic credits.
Tests run deterministically; no environment-dependent flakiness.

5. Results are durable, raw responses are audit-only

Two tables:

score_requests — the committed result and submission, queryable.
llm_audit — the full raw Anthropic response, kept forever for audit/replay.

6. Synchronous HTTP today, async job path tomorrow

Consequences

Positive:

Small, reviewable surface. ~800 LoC of TS + tests.
Shippable without external dependencies unblocked.
Every prompt diff is reviewable in PR history; every score is replayable.
Aggressive upgrade resilience — SDK changes don't touch us.

Negative:

We own retry/backoff logic. If Anthropic adds a new retryable status code, we have to notice and update isRetryable.
We lose the SDK's type-safety helpers for tool definitions. Mitigated by our own strict parsing in anthropic-client.ts.
Stub mode introduces a branch that must be exercised in tests every time we change the client. Two dedicated tests guard this.

Rejected alternatives

Anthropic SDK — rejected for reasons in §1 above. Revisit in C.8 if streaming or batch becomes required.
Unstructured prose + regex parsing — rejected. Fragile; has bitten every LLM-powered system that ever shipped it.
One monolithic prompt for all six signals — rejected. Coupling. A bad score on one signal would force a full rerun; per-signal timeouts impossible; latency compounds serially.
Fire-and-forget scoring (no persisted result) — rejected. No audit trail, no replay, no "why this score?" modal, no regulatory posture.

Open questions (for Sprint C.2)

Meme-similarity embedding store — pgvector or a standalone vector DB?
Creator-signal data pipe — Bitquery streaming via WS or scheduled pulls?
Image-signal cost ceiling — Vision calls are ~4× text calls; budget envelope?

These are explicitly Sprint C.2 decisions and out of scope here.