Conformance checking and drift handling

wraith check is the conformance engine. It replays every recorded exchange through the synthesized twin and produces a quantitative measure of how well the twin matches reality — not a string-equality check, but a semantic diff that knows about generated IDs, timestamps, enums, and structural shape.

If the twin scores high enough, you can trust it. If it doesn’t, the report tells you exactly which routes are diverging and how.

In-memory vs wire mode

wraith check stripe                  # in-memory replay (fast, default)
wraith check stripe --wire           # spawn the real serve and replay through HTTP
wraith check stripe --upstream       # replay against the live upstream API

Mode	What it tests	Speed
`--in-memory`	Synth model directly. Skips the HTTP stack, scrub layer, header strip.	Fast
`--wire`	Spawns `wraith serve` on a loopback port and replays through reqwest. Catches protocol-level bugs the in-memory check is blind to (header stripping, scrub-layer mismatches, status drift).	Slower
`--upstream`	Replays the recorded requests against the live API. Detects drift in the upstream itself rather than in the twin.	Network-bound

In-memory is the default. Use wire-mode in CI when you want to catch the things only a real HTTP stack would surface. Use upstream when you want to know whether the recordings themselves are still valid.

What “conformance” means

The check engine compares each replayed response field-by-field against the recorded response. Every field is classified before comparison:

Classification	Comparison rule
`generated`	Skipped (different values are fine — UUIDs, etc.)
`timestamp_like`	Type-only (both numbers? both strings? compared structurally)
`constant`	Exact value comparison
`enum`	Value must be in the recorded set
`echoed`	Value must match what was sent in the request
Default (unclassified)	Exact value comparison

This classification is automatic — it’s what synth produces from observing how each field behaves across recordings. You override it per-field in wraith.toml:

[diff.fields]
"summary.total_value" = { classify = "constant" }
"expires_at" = { classify = "timestamp" }
"theme.color" = { classify = "enum", values = ["dark", "light"] }

Hole-style paths — no body. prefix; that’s added automatically.

Scoring

Per-exchange scores cover three components: body structure, body values, headers. A session passes when ≥95% of exchanges pass; a run passes when ≥90% of sessions pass. Tune the thresholds:

[diff]
required_score = 0.90
session_pass_rate = 0.95

[diff.thresholds]
status_exact_match = true
body_structure = 0.90
body_values = 0.85
symbol_consistency = 1.0
header_conformance = 0.80

Per-route overrides are available for routes that legitimately need tighter or looser bars:

[diff.overrides."POST /v1/charges"]
body_structure = 0.95
body_values = 0.95

Score format is canonical: every output is score_bp (basis points, 0–10000) — 9500 means 95.00%.

Reading a divergence report

Run wraith check --format json to get the structured envelope:

{
  "twin": "stripe",
  "score_bp": 9847,
  "session_pass_rate": 1.0,
  "exchanges_total": 1240,
  "exchanges_passed": 1221,
  "divergences": [
    {
      "route": "POST /v1/charges",
      "session": "session-3",
      "exchange": 17,
      "path": "body.metadata.client_ip",
      "category": "value_mismatch",
      "severity": "error",
      "expected": "10.0.0.1",
      "actual": "192.168.1.1",
      "drift_id": "drift-9f2c4b8e1a3d5f70",
      "drift_type": "value_drift"
    }
  ]
}

Each divergence carries:

path — JSON pointer to the field that diverged.
category — what kind of divergence (value_mismatch, extra_field, missing_field, array_length_mismatch, status_code_mismatch, etc.).
severity — error, warning, or info. Only error affects scoring.
drift_id — stable fingerprint of (route + path + category + values). Cite it in suppression rules.
drift_type — semantic classification of why it drifted (numeric_drift, url_drift, value_drift, host_rewrite, enum_expansion, additive_optional_field, field_removed, status_code_shift).

Two suppression layers

Wraith separates two kinds of “this divergence is fine”:

`[[diff.suppress]]` in `wraith.toml` — inherent twin behavior

Divergences that are inherent to the synthesized twin and should never have been reported. Suppressed entries are excluded from scoring and from the divergence list (but they’re counted, and --show-suppressed lists them).

[[diff.suppress]]
path = "body.created_at"
reason = "twin uses placeholder timestamps"

[[diff.suppress]]
route = "POST /repos/*/statuses/*"
category = "value_mismatch"
reason = "commit status fields are state-dependent"

Use when the divergence is something the synth model fundamentally can’t replicate (placeholder timestamps, generated IDs in surrogate format, etc.). These aren’t drifts — they’re inherent.

`drift.toml` — accepted drifts

Optional file next to scrub.toml. Drifts that have been reviewed and accepted as known and harmless, but stay visible in the report:

[[suppress]]
drift_type = "additive_optional_field"
route = "GET /v1/users/*"
reason = "backend adds optional fields on schedule; not worth reclassifying"

[[suppress]]
drift_id = "drift-9f2c4b8e1a3d5f70"
reason = "known harmless field-order change in search responses"

Reclassify drifts that should be a different category without suppressing them:

[[reclassify]]
match = { route = "POST /v1/jobs", path = "body.status" }
new_drift_type = "enum_extension"
reason = "upstream adds new enum values often; not a schema break"

wraith.toml suppressions act before drift classification. drift.toml acts after. Both are visible to --show-suppressed.

Authored-output deviations (Lua handlers & fixtures)

Since v0.17.0, routes served by a Lua handler get an extra check: the handler’s raw output is compared against the shape of the recorded responses for that route. A structural deviation — a mis-cased field name, a missing key, a wrong type, a status the recordings never produced — is reported as an authored_deviation finding, and an unmarked one fails the run with exit 2 even when the conformance score passes. This closes the “confidently wrong twin” gap: before v0.17.0 a handler emitting licenseAgreementID where every recording said licenseAgreementId shipped at a perfect score.

Intentional deviations are declared in wraith.toml — same glob matching as [[diff.suppress]], but reason is required and the meaning is different: this is declared intent, not noise suppression.

[[deviations]]
route = "GET /assets/:id"
path = "$.comparisonSegments"
reason = "segments unused in this workflow; handler serves an empty list"

A marked deviation passes the gate, and value-level divergences at or under the marked path are exempted from scoring (listed by --show-suppressed). If you want the findings reported without failing builds — e.g. while triaging after an upgrade — set the migration valve:

[handlers]
deviation_policy = "warn"   # default is "error"

In JSON output the findings are in .conformance.authored_deviations[], each with marked and the matched rule’s reason.

Freshness SLA

A twin recorded 47 days ago serves with the same confidence as one recorded yesterday — unless you make staleness a gate. Add a [freshness] section to wraith.toml:

[freshness]
max_age   = "30d"    # SLA: check fails (exit 2) when the newest recording is older
warn_age  = "14d"    # advisory: warning advice past this
max_drift = 0.05     # optional: max fraction of drifted routes from the latest refresh

wraith check then fails with twin-stale / twin-drift-exceeded advice when the SLA is violated, and wraith doctor reports the same verdict; the JSON envelope carries the verdict under .freshness.status (fresh | warn | stale). Without a [freshness] section nothing changes.

Age is also visible at serve time regardless of config: wraith serve prints a startup banner with the twin’s age, stamps X-Wraith-Twin-Age / X-Wraith-Recorded-At on every response, and reports the same fields in --ready-json and /__wraith/info. The intended CI loop: wraith refresh re-records on a schedule, and wraith check with a [freshness] SLA makes sure a rotting twin can’t pass silently.

`--show-suppressed`

wraith check stripe --show-suppressed

Lists every suppressed field path with the reason from the rule that suppressed it. Useful when adding a new suppression — confirm it’s catching what you intended and not more.

Action loop

If the check fails:

Read the report. Find the highest-leverage divergence — one rule often explains many findings.
Decide which suppression layer applies. Inherent twin behavior → [[diff.suppress]]. Known, accepted drift → drift.toml.
If neither applies — the twin is wrong. Either re-record (wraith record against the upstream), re-synthesize (wraith synth), or run wraith generate to apply LLM-assisted fixes.
Re-run check. Confirm the score moved.

If the score has plateaued and the remaining divergences are real drift in the upstream itself, the fix is upstream — not in the twin. The twin is doing its job by surfacing the drift.