Components & Services — Audit Trail Platform (ATP)¶

This document is the component-level reference for ATP. It turns the Architecture Overview and HLD into deployable, testable building blocks with clear responsibilities, interfaces, data boundaries, tenancy guards, resilience patterns, and observability hooks.

Purpose — Describe each service/component so teams can implement, scale, and operate it consistently:
- What it does (responsibility & boundaries)
- How it’s called (REST/gRPC/webhooks/events)
- Where its data lives (authoritative stores & projections)
- How tenancy, security, and policy are enforced
- How we keep it reliable (timeouts, retries, outbox/DLQ) and observable (OTel, SLIs/SLOs)
Audience — Engineers (app/platform), SRE/Operations, Security/Compliance, and Product/Delivery.

How this doc relates to other architecture docs¶

Architecture Overview — the “why/what” narrative; this file instantiates the “how” for each service.
See Architecture Overview (architecture.md)
High-Level Design (HLD) — system-level views, sequences, and non-functionals; components here map 1:1 to HLD containers.
See High-Level Design (hld.md)
Context Map — bounded contexts and their relationships; each deep-dive section below corresponds to a context on the map.
See Context Map (context-map.md)
Data Model — canonical schemas, projections, and storage tiers; per-component “Data” sections must align with it.
See Data Model (data-model.md)
Sequence Flows — end-to-end append/query/export flows; component interactions here should match those sequences.
See Sequence Flows (sequence-flows.md)
Contracts — REST/gRPC specs, message subjects/schemas, webhooks; this doc references them but does not duplicate them.
See REST APIs (../domain/contracts/rest-apis.md), Message Schemas (../domain/contracts/message-schemas.md), Webhooks (../domain/contracts/webhooks.md)
Implementation Guides — shared patterns used by multiple components (outbox/inbox, idempotency, messaging).
See Messaging (../implementation/messaging.md), Outbox/Inbox & Idempotency (../implementation/outbox-inbox-idempotency.md)

Scope & Structure¶

Each component uses the same template:

Responsibility & Boundaries — what the component owns; what it explicitly does not do.
Interfaces — REST/gRPC/webhooks (endpoints, verbs, auth), event subjects it publishes/consumes.
Data — authoritative store, projections/indexes, partitioning strategy, retention/classification hooks.
Tenancy & Security — tenant/edition enforcement, RBAC/ABAC checks, KMS/secret usage, PII handling.
Resilience — timeouts, retries with jitter, outbox/inbox, DLQ & replay, bulkheads/circuit breakers.
Observability & SLOs — required traces/metrics/logs, golden signals, SLI/SLO targets and alerts.
Risks & Edge Cases — known failure modes and mitigations (linking to runbooks/ADRs).
Contracts & Links — pointers to API specs, schemas, and related diagrams.

Out of scope: exhaustive API parameter docs (see Contracts), cloud-specific manifests (see Deployment Views), and business-process runbooks (see Operations).

Links¶

→ Architecture Overview (architecture.md)
→ High-Level Design (hld.md)
→ Context Map (context-map.md)
→ Data Model (data-model.md)
→ Sequence Flows (sequence-flows.md)
→ REST APIs (../domain/contracts/rest-apis.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ Outbox/Inbox & Idempotency (../implementation/outbox-inbox-idempotency.md)

Service Catalog (At-a-Glance)¶

This catalog mirrors the bounded contexts in the Context Map and summarizes each component using our standard sections: Responsibility, Interfaces, Data, Tenancy, Contracts, Resilience, Observability/SLOs, Risks. Deep-dives follow in later cycles.

Legend: REST=gateway-fronted APIs, EVT=event subjects (publish/consume).

Component	Responsibility	Interfaces (primary)	Data (authoritative)	Tenancy	Contracts	Resilience	Observability / SLOs	Risks (top)
Gateway	Single ingress: AuthN/Z, tenancy resolution, versioning, rate limits, schema validation	REST: `/api/v{n}/...` (append/query/export/admin)	— (stateless)	Require `tenantId` (JWT claim preferred); edition gates at edge	REST specs; webhook sigs	Shed with 429; no retries downstream; timeouts per route; WAF	Route p95/p99, 4xx/5xx, 429 rate	Tenant spoofing; mis-versioning
Ingestion	Canonicalize & validate events; apply classification/retention; append; emit acceptance	REST: `POST /audit/append` · EVT (pub): `audit.appended`, `audit.accepted`	Append Store (hot)	Persist `tenantId`; RLS/filters; per-tenant quotas	REST + event schemas	Tx outbox, idempotency keys, retries with jitter, DLQ	Append p95; outbox age p99	Scale spikes; schema drift
Policy	Decisions: classify, retain, redact; edition evaluation	REST: `/policy/evaluate` · EVT (pub): `policy.changed`	Policy store (read-mostly)	Tenant-scoped decisions; cache with TTL	REST + event schemas	Short timeouts; local cache; breakers; fallback modes	Decision p95; hit ratio	Compliance drift; stale cache
Integrity	Hash chains, signatures, checkpoints; verification APIs; manifest attestation	REST: `/integrity/verify/` · EVT (sub/pub)*: `audit.appended` / `integrity.verified`	Evidence ledger (chain/checkpoints)	Tenant chains per time slice; key scope per env/tenant (opt)	REST + manifest format	Idempotent chain updates; signer retries; checkpoint windows	Verify p95; signer errors	Key rotation gaps; chain gaps
Projection	Build/maintain read models & indexes; track watermarks/lag; rebuilds	EVT (sub): `audit.accepted` · EVT (pub): `projection.updated`	Read Models / Indexes (warm)	Tenant-partitioned models; no cross-tenant joins	Event schemas	Idempotent projectors; lag-based autoscaling; DLQ + replay	Lag median/p95; DLQ depth	Hot partitions; replay mistakes
Query	Tenant-scoped retrieval; policy-aware filtering & redaction; selection for export	REST: `GET /query/*`	Read Models (RO)	Enforce tenant/edition; redaction by classification	REST specs	Cache; bulkheads vs Export; timeouts	p95/p99; cache hit; watermark drift	Noisy neighbor; read amp
Search (optional)	Full-text/time-range across projected data	REST: `/search/` · EVT (sub)*: projection feed	Search index	Tenant shards; per-tenant analyzers (where relevant)	REST + index schema	Async refresh; ISR windows; circuit breakers	p95 latency; index age	Cost creep; index drift
Export	Package selections; build signed manifests; deliver; optional webhooks	REST: `/export/` (create/poll/download) · EVT (pub)*: `export.requested\|completed`	Cold artifacts (immutable object store)	Tenant-scoped packages; legal hold & residency respected	REST + manifest format	Resumable streams; concurrency caps; queue back-pressure	TTFB p95; completion p95	Read SLO impact; large jobs
Admin / Control Plane	Policies, schema registry pointers, DLQ/replay, rebuilds, feature flags	REST: `/admin/*`	Control metadata & logs	Break-glass guarded; SSO + IP allow-lists; full audit	REST specs	Guarded ops; approvals; dry-run replays	Action success; time-to-approve	Misuse of power; audit gaps

Links¶

→ Context Map (context-map.md)
→ REST APIs (../domain/contracts/rest-apis.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ Webhooks (../domain/contracts/webhooks.md)
→ Observability (OTel/Logs/Metrics) (../operations/observability.md)

Audit.Gateway (Component Deep-Dive)¶

Responsibility¶

The single ingress for ATP. Enforces AuthN/Z, tenancy resolution, edition gates, rate limits/quotas, schema validation, and versioning. It terminates TLS, injects Zero-Trust signals (tenant/edition/trace), and forwards to app services through the mesh using standardized headers. Gateway exposes an Open Host Service (OHS) with a Published Language (stable paths, error model, headers).

Out of scope: business decisions (classification/retention), persistence, projections.

C4 L3 — Component Diagram¶

flowchart LR
  subgraph Edge
    WAF[WAF/Ingress Controller]
    GW[Audit.Gateway<br/>AuthN/Z • Tenancy • RL • Versioning • Schema]
  end

  subgraph Mesh[Service Mesh - mTLS/L7 AuthZ]
    ING[Audit.Ingestion]
    QRY[Audit.Query]
    EXP[Audit.Export]
    ADM[Admin/Control]
    POL[Audit.Policy - meta]
  end

  Producer[(Event Producers / Clients / UIs)] --> WAF --> GW --> Mesh
  GW -->|REST/gRPC| ING
  GW -->|REST/gRPC| QRY
  GW -->|REST/gRPC| EXP
  GW -->|REST/gRPC| ADM
  GW -.cache/meta.-> POL

  classDef s fill:#F4F7FF,stroke:#7aa6ff,rx:6,ry:6;
  class GW,ING,QRY,EXP,ADM,POL s;

Hold "Alt" / "Option" to enable pan & zoom

Dependency edges: synchronous requests to Ingestion/Query/Export/Admin; read-only metadata (edition/route policy, schema registry) optionally cached from Policy. No write decisions flow through Gateway.

Interfaces¶

Inbound (OHS)

REST (primary), optional gRPC for high-QPS clients.
Versioning: /api/v{n}/... or X-Api-Version: n.
Error model: Problem+JSON.

Outbound (to services)

REST/gRPC over mTLS via mesh.
Standard header propagation: x-tenant-id, x-edition, traceparent, tracestate, x-correlation-id, x-policy-version (if known), x-idempotency-key (append).

API Surface (selected)

Route	Verb	Target	Notes
`/api/v{n}/audit/append`	POST	Ingestion	Requires `X-Idempotency-Key`; schema/size limits
`/api/v{n}/query/*`	GET	Query	Returns `X-Watermark`, `X-Lag`
`/api/v{n}/export`	POST	Export	Creates package; resumable download via signed URL
`/api/v{n}/export/{id}`	GET	Export	Poll/download; supports `Range`
`/api/v{n}/admin/*`	GET/POST	Admin	SSO + IP allow-lists; break-glass logged

Webhook Ingest (optional)

Endpoint: /api/v{n}/webhooks/ingest
Security: HMAC/signature verification; replay detection via eventId/X-Idempotency-Key.

Data¶

Gateway is stateless. It persists no tenant data. Caches allowed:

Schema registry (read-through, TTL).
Edition/route policy (small, TTL).
JWKS/IdP (per OIDC recommendations).

Tenancy & Security¶

Tenant required at ingress. Prefer token claim; accept X-Tenant-Id only for trusted workload identities.
Edition gates: deny routes not enabled for the tenant’s edition.
AuthN: OIDC/OAuth2 for users; workload identity/JWT for services.
AuthZ (coarse) at Gateway; fine ABAC in services.
TLS at edge, mTLS inside mesh; deny-by-default egress.

Contracts¶

Published Language: stable routes, headers, and Problem+JSON error shapes.
Headers (ingress): Authorization, Content-Type, X-Idempotency-Key (append).
Headers (propagated): x-tenant-id, x-edition, traceparent, tracestate, x-correlation-id.
Deprecation: Deprecation: true, Sunset: <rfc1123>, Link: <url>; rel="deprecation".

Problem+JSON example

{
  "type": "https://errors.atp.example/schema",
  "title": "Invalid request payload",
  "status": 400,
  "detail": "Field 'attributes' failed schema validation",
  "instance": "urn:trace:01J9...-req-7f3a",
  "tenantId": "t-acme",
  "code": "SCHEMA_VALIDATION_FAILED"
}

Rate Limits, Quotas & Validation¶

Control	Default (tunable)	Notes
Burst RL	e.g., 200 rps / tenant / route	429 + `Retry-After`
Sustained RL	e.g., 50 rps / tenant / route	Sliding window
Payload size	e.g., 256 KB on append	Hard reject at edge
Schema validation	Strict by default	Unknown fields rejected unless whitelisted
Export concurrency	e.g., 2 concurrent / tenant	Protect Query SLOs

Resilience¶

Fail fast at edge: invalid schema/tenant/edition → 4xx.
Back-pressure: 429 with budgets; communicate X-RateLimit-*.
No retries downstream from Gateway (avoid retry storms).
Timeouts per route (derived from SLO; ≈ p95 × 1.2), circuit-breakers for IdP/Policy cache.
WAF rules for common threats; request body size/time caps.

Observability¶

Traces: one span per request with route, tenant, edition, auth latency, schema check time.
Metrics: RPS; p95/p99 per route; 4xx/5xx; 429 rate; schema-reject rate.
Logs: structured; include x-request-id, traceId, tenant/edition; no PII.

SLOs (initial)¶

Route latency p95 ≤ X ms, p99 ≤ Y ms.
5xx ratio ≤ R ppm (excluding downstream 5xx surfaced as 503).
Accurate tenancy propagation: 100% of forwarded calls carry x-tenant-id and traceparent.

Risks¶

Tenant spoofing → accept tenant from validated tokens; normalize once; reject conflicts.
Schema drift → strict validation + registry, contract tests in CI.
Noisy neighbor → per-tenant RL/quotas; export concurrency caps.
Mis-versioning → deprecation headers + changelog; dual-stack during sunset.

Deep Links¶

→ REST APIs (../domain/contracts/rest-apis.md)
→ Webhooks (../domain/contracts/webhooks.md)
→ API Gateway & Connectivity (overview) (architecture.md#api-gateway-connectivity)
→ Security & Compliance (../platform/security-compliance.md)
→ Observability (../operations/observability.md)

Audit.Ingestion (Component Deep-Dive)¶

Responsibility¶

Accept append requests from Gateway/webhooks, canonicalize payloads (Anti-Corruption Layer), evaluate policy (classification/retention), enforce idempotency, persist to the Append Store (hot), and emit acceptance events for downstream Integrity and Projection. Provides the exactly-once intent boundary for writes.

Out of scope: read/query, search indexing, export packaging, admin policy authoring.

C4 L3 — Component Diagram¶

flowchart LR
  subgraph API[API Adapters]
    CTRL[AppendController<br/>REST/gRPC]
    WH[WebhookReceiver<br/>- optional]
  end

  subgraph APP[Application -> Use-cases]
    UC[AppendAuditRecordUseCase]
    IDEM[IdempotencyService]
    CAN[Canonicalizer/ACL]
    VAL[SchemaValidator]
    POL[PolicyClient - sync]
    OUTB[Outbox Port]
    REPO[AuditRecordRepository Port]
  end

  subgraph DOM[Domain]
    AR[AuditRecord Aggregate]
    CLS[Classification]
    RET[RetentionPolicy]
  end

  subgraph INFRA[Infrastructure Adapters]
    RDB[(Append Store - hot)]
    BUS[(Event Bus)]
    OTL[(OTel/Logs/Metrics)]
  end

  CTRL --> UC
  WH --> UC
  UC --> VAL
  UC --> IDEM
  UC --> CAN
  UC --> POL
  UC --> REPO
  UC --> OUTB
  UC --> AR
  REPO -.-> RDB
  OUTB -.-> BUS
  UC -.-> OTL

  classDef b fill:#F4F7FF,stroke:#7aa6ff,rx:6,ry:6;
  class CTRL,WH,UC,IDEM,CAN,VAL,POL,OUTB,REPO,AR,CLS,RET b;

Hold "Alt" / "Option" to enable pan & zoom

Dependency edges: synchronous to Policy (decisions), transactional to Append Store, asynchronous to Event Bus via Outbox.

Interfaces¶

Inbound

REST/gRPC (via Gateway): POST /api/v{n}/audit/append
Webhook (optional): POST /api/v{n}/webhooks/ingest (HMAC/signature verification)

Outbound

Events (publish): audit.appended, audit.accepted
Sync decisions: POST /policy/evaluate (classification/retention/redaction snapshot)
Integrity materials: provided to Integrity indirectly via events (no tight coupling)

Headers (ingress)

Required: Authorization, X-Idempotency-Key, Content-Type: application/json
Propagated from Gateway: x-tenant-id, x-edition, traceparent, tracestate, x-correlation-id

Sequence — Accept → Persist → Emit¶

sequenceDiagram
  autonumber
  participant GW as API Gateway
  participant ING as Ingestion
  participant POL as Policy
  participant DB as Append Store (hot)
  participant OB as Outbox (tx)
  participant BUS as Event Bus

  GW->>ING: AppendCommand(tenant, idemKey, payload)
  ING->>ING: Schema validate + Canonicalize (ACL)
  ING->>ING: Idempotency check (idemKey)
  alt duplicate
    ING-->>GW: 200 OK { recordId, idempotent: true }
  else new
    ING->>POL: Evaluate(classify, retain) [TTL cache]
    POL-->>ING: Decision(version, labels, retention)
    ING->>DB: Append(record + policy stamps) [atomic tx]
    ING->>OB: Enqueue(audit.appended, audit.accepted) [same tx]
    OB->>BUS: Relay (idempotent publish)
    ING-->>GW: 202 Accepted { recordId, traceId }
  end

Hold "Alt" / "Option" to enable pan & zoom

Data¶

Authoritative store: Append Store (hot); append-only records with policy stamps and digests.
Keys
- recordId: ULID (sortable within time bucket)
- idempotencyKey: tenantId:sourceId:sequence|hash
Partitions: time bucket (e.g., day) × tenantId (optionally sourceId for hot shards)
Indexes: (tenantId, time), (tenantId, sourceId, sequence), recordId
Retention: short in Hot (policy-driven), tiered to Warm/Cold by lifecycle jobs
Integrity prep: canonical content digest stored with record; chain linking done downstream

Tenancy & Security¶

Tenant required (token claim preferred); verify edition gates before accepting.
ABAC: workload or user role must authorize append for tenantId.
PII discipline: no PII in headers; payload fields carry classification tags.
Secrets/Keys: no direct KMS calls; uses Gateway-provided identity and internal secret manager for config.

Contracts¶

API (OHS): POST /api/v{n}/audit/append
- Body: canonical audit record (sourceId, actorId, action, resource, attributes, occurredAt)
- Headers: X-Idempotency-Key mandatory
- Responses:
  - 202 Accepted { recordId, traceId } (new write)
  - 200 OK { recordId, idempotent: true } (dedupe)
  - 400/403 (schema/policy/auth errors) — no retry
Events (Published Language)
- audit.appended (pre-commit checks passed; for Integrity)
- audit.accepted (persisted + classified + retained; for Projection/Search)
Schema evolution: additive-first; breaking via new subject/major; validated in CI

Idempotency Scope & Inbox/Dedupe¶

Scope: per tenantId:sourceId; uniqueness by sequence (preferred) or payload hash.
Behavior:
- Duplicate key with identical payload → 200 OK, idempotent:true.
- Duplicate key with different payload → 409 Conflict (IDEMPOTENCY_MISMATCH).
Inbox receipts (consumer side): store (eventId|idempotencyKey) for M days to drop duplicates.

Resilience¶

Transactional Outbox: record + events committed atomically; relay publishes with retries + jitter.
Retry policy: no retry on 4xx; transient 5xx/timeouts → bounded retries (decorrelated jitter, cap 10s).
DLQ: per subscription; triage with schema/version/attempt metadata; replay tools by tenant/time window.
Back-pressure: Gateway 429 on hot partitions; per-tenant quotas; consumer concurrency caps.

Observability¶

Traces: append span (attrs: tenantId, edition, idempotencyKey, policyVersion, recordId).
Metrics: append.accepted.latency, outbox.relay.age, events.published.count, dedupe.rate, schema.reject.rate.
Logs: structured; include traceId, recordId, idempotent flag, no PII; redaction hints on classified fields.

SLOs (initial)¶

Append accepted p95 ≤ X ms at baseline QPS.
Outbox age p99 ≤ N s under steady load.
Schema reject rate ≤ R ppm (non-malformed traffic).
Idempotent dedupe accuracy: 100% within retention window.

Risks¶

Scale spikes → enforce per-tenant RL/quotas; shard by tenantId:sourceId; autoscale consumers on outbox age.
Schema drift → strict validation; schema registry contracts; staged rollouts.
Policy outage/latency → TTL cache + circuit breaker; fail-closed on risky decisions.
Hot partitions → adaptive concurrency; producer feedback via Retry-After; replay after pressure subsides.

Deep Links¶

→ REST APIs (../domain/contracts/rest-apis.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ Messaging (impl) (../implementation/messaging.md)
→ Outbox/Inbox & Idempotency (impl) (../implementation/outbox-inbox-idempotency.md)
→ Sequence Flows (../architecture/sequence-flows.md)

Audit.Policy (Component Deep-Dive)¶

Responsibility¶

Provide deterministic policy decisions for:

Classification (e.g., PUBLIC/INTERNAL/PERSONAL/SENSITIVE),
Retention (policy binding + duration),
Redaction (mask/hash/omit rules on read/export),
Edition gating (feature availability per tenant/edition).

This service is a Customer–Supplier for Ingestion, Query, and Export: they consume decisions; Policy supplies them and emits change signals.
Out of scope: authoring UIs (owned by Admin/Control Plane), data writes beyond its own policy store, and business telemetry.

C4 L3 — Component Diagram¶

flowchart LR
  subgraph API[API Adapters]
    POLCTL[PolicyController<br/>REST/gRPC]
  end

  subgraph APP[Application -> Use-cases]
    EVAL[EvaluatePolicyUseCase]
    CACHE[DecisionCache - TTL + SWR]
    CMP[Compiler/RuleEngine]
    PUB[PolicyChangePublisher]
    STORE[PolicyRepository Port]
  end

  subgraph INFRA[Infrastructure]
    PDB[(Policy Store<br/>read-mostly)]
    BUS[(Event Bus)]
    OTL[(OTel/Logs/Metrics)]
  end

  POLCTL --> EVAL
  EVAL --> CACHE
  EVAL --> STORE
  EVAL --> CMP
  PUB -.-> BUS
  STORE -.-> PDB
  EVAL -.-> OTL

  classDef b fill:#F4F7FF,stroke:#7aa6ff,rx:6,ry:6;
  class POLCTL,EVAL,CACHE,CMP,PUB,STORE b;

Hold "Alt" / "Option" to enable pan & zoom

Dependency edges: synchronous evaluate calls from Ingestion, Query, Export; policy.changed events published for caches to invalidate across services.

Interfaces¶

Inbound (decision APIs)

POST /api/v{n}/policy/evaluate (general) — with mode = append|read|export.
(Optional split) POST /api/v{n}/policy/evaluate/append · /read · /export.

Outbound

Events (publish): policy.changed with affected scopes and policyVersion.
(Optional) ETag-based config endpoints for bulk policy retrieval by Admin tools.

Headers (required/propagated)

Inbound: Authorization, Content-Type: application/json.
Propagated context: x-tenant-id, x-edition, traceparent, x-correlation-id.

Contracts (Requests/Responses)¶

Evaluate — Append (classify + retain)

POST /api/v1/policy/evaluate
{
  "mode": "append",
  "tenantId": "t-acme",
  "edition": "enterprise",
  "attributes": {
    "sourceId": "order-svc",
    "actorId": "u-123",
    "action": "UPDATE",
    "resource": "Order/4711",
    "fields": { "attributes.email": "alice@example.com" }
  }
}

Response

{
  "allow": true,
  "policyVersion": 17,
  "classification": "PERSONAL",
  "retentionPolicyId": "rp-2025-01",
  "labels": ["pii", "contact"],
  "ttlSeconds": 300,
  "reason": "rule#CLS-12"
}

Evaluate — Read/Export (redaction + gates)

POST /api/v1/policy/evaluate
{
  "mode": "read",
  "tenantId": "t-acme",
  "edition": "enterprise",
  "selection": { "resource": "Order/4711", "fields": ["actorId","attributes.email","attributes.cardLast4"] }
}

Response

{
  "allow": true,
  "policyVersion": 17,
  "redaction": {
    "fields": {
      "actorId": "MASK4",
      "attributes.email": "HASH",
      "attributes.cardLast4": "ALLOW_IF(role:Auditor & scope:PII)"
    }
  },
  "editionGates": ["export.manifest.signing","search.facets"],
  "ttlSeconds": 120,
  "reason": "rule#RED-03"
}

Change Event — Policy Changed

{
  "subject": "policy.changed",
  "schemaVersion": "1.0.0",
  "tenantId": "t-acme",
  "policyVersion": 18,
  "effectiveAt": "2025-10-22T12:00:00Z",
  "scopes": ["classification","retention","redaction"]
}

Problem+JSON for errors: 400 (invalid request), 403 (denied), 429 (rate limit), 503 (dependency).

Data¶

Authoritative store: Policy Store (read-mostly).
Artifacts: PolicySet, Rule, DecisionTemplate, and compiled decision DAG.
Versioning: monotonic policyVersion per tenant; stamped on write decisions and re-evaluated on read/export.
Cache: in-memory LRU with TTL + stale-while-revalidate (SWR); warmers at startup.

Tenancy & Security¶

Tenant-scoped policies; no cross-tenant reads.
Edition-aware: decisions may enable/deny features per edition.
AuthZ: evaluate requires caller’s identity/role context; Admin updates are separate and guarded (SSO + approvals) via Admin/Control.
PII discipline: requests should pass field names, not values, where possible; response contains directives (MASK/HASH/OMIT), not data.

Cache Strategy¶

Local TTL (e.g., 300s for append; 120s for read/export).
Invalidate on policy.changed event for the matching tenant/scope.
ETag for bulk fetch by Admin tooling (If-None-Match).
Fallbacks:
- Write (append): fail-closed if policy unavailable (deny or minimal retention).
- Read (query/export): fail-safe with maximal redaction if decision unavailable.

Resilience¶

Timeouts: short (e.g., 100–300 ms budget) — aligns with append/query SLOs.
Retries: bounded with jitter on transient store/timeouts.
Circuit breaker to store; use cached/stale decisions per fallback rules above.
Idempotency: not needed for evaluate; events (policy.changed) are idempotent (ULID eventId + inbox receipts).

Observability¶

Traces: policy.evaluate span with tenantId, edition, mode, policyVersion, cache.hit=true|false.
Metrics: policy.decision.latency{mode}, policy.cache.hit_ratio, policy.deny.rate, policy.changed.count.
Logs: structured; ruleIds referenced; no PII; include reason code and traceId.

SLOs (initial)¶

Decision latency p95 ≤ Q ms (append/read).
Cache hit ratio ≥ H% (target set per env).
Deny error rate within expected bounds (watch for spikes indicating mis-policy).

Risks¶

Compliance drift — stale or incorrect rules → mitigate with policies-as-code, reviews, and test packs; emit policy.changed with approvals.
Cache staleness — high TTL leading to outdated decisions → keep TTL small; rely on events for invalidation.
Ambiguous redaction — unclear directives → use constrained enum (MASK, HASH, OMIT, ALLOW_IF(...)).
Hot path latency — evaluate on append must be fast → precompile rules; keep dependency chain short; use SWR.

Customer–Supplier Relation (explicit)¶

Customers: Ingestion (append), Query (read), Export (package scope).
Supplier: Policy (this component) defines contract and versioning; consumers must treat decisions as authoritative and include policyVersion in writes/events/exports.

Deep Links¶

→ REST APIs (../domain/contracts/rest-apis.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ PII Redaction & Classification (../platform/pii-redaction-classification.md)
→ Compliance & Privacy (../platform/privacy-gdpr-hipaa-soc2.md)
→ Multitenancy & Tenancy Guards (../platform/multitenancy-tenancy.md)

Audit.Integrity (Component Deep-Dive)¶

Responsibility¶

Provide tamper-evidence and cryptographic attestation for ATP:

Build per-tenant hash chains over appended records (linear chains, optional Merkle per checkpoint).
Issue signed checkpoints on defined windows (e.g., hourly/daily).
Produce and attach evidence to exports (signed manifests).
Expose verification APIs for records, ranges, and export packages.
Manage key usage & rotation via KMS/HSM with auditable provenance.

Out of scope: data classification/retention decisions, query/search, export packaging itself.

C4 L3 — Component Diagram¶

flowchart LR
  subgraph API[API Adapters]
    VREC[VerifyRecordController]
    VRNG[VerifyRangeController]
    VEXP[VerifyExportController]
  end

  subgraph APP[Application -> Use-cases]
    UPD[UpdateChainUseCase]
    CHKPT[CheckpointIssuer]
    VFY[VerificationUseCase]
    KEY[KeyProvider - KMS/HSM]
    LEDGER[EvidenceRepository Port]
    BUSPUB[EventPublisher Port]
  end

  subgraph INFRA[Infrastructure]
    EDB[(Evidence Ledger<br/>Chains/Checkpoints/Manifests)]
    BUS[(Event Bus)]
    KMS[(KMS/HSM)]
    OTL[(OTel/Logs/Metrics)]
  end

  API --> VFY
  BUS -->|sub: audit.appended| UPD
  UPD --> LEDGER
  UPD --> CHKPT
  CHKPT --> KEY
  KEY -.signs.-> CHKPT
  CHKPT --> LEDGER
  VFY --> LEDGER
  VFY --> KEY
  APP -.-> OTL
  BUSPUB -.-> BUS

Hold "Alt" / "Option" to enable pan & zoom

Event edges: Consumes audit.appended (or audit.accepted depending on topology). Publishes integrity.verified (optional) and emits telemetry; Export reads proofs via API.

Interfaces¶

Inbound (REST)

POST /api/v{n}/integrity/verify/record
POST /api/v{n}/integrity/verify/range
POST /api/v{n}/integrity/verify/export

Inbound (Events)

audit.appended (preferred for prompt materialization) or audit.accepted (if integrity follows persistence)

Outbound

integrity.verified (optional attestation events) for downstream consumers (Export/UI).
Read-only artifact fetch for Export when assembling packages.

Headers / Context

x-tenant-id, traceparent, x-correlation-id, x-policy-version.
No PII in headers or integrity artifacts.

Contracts (Requests/Responses)¶

Verify Record

POST /api/v1/integrity/verify/record
{
  "tenantId": "t-acme",
  "recordId": "01JAXY...ULID"
}

Response

{
  "ok": true,
  "recordId": "01JAXY...ULID",
  "digest": "sha256:9c...ab",
  "chainProof": {
    "chainId": "t-acme:2025-10-22",
    "index": 85105,
    "prevHash": "sha256:aa...ef",
    "hash": "sha256:bb...42"
  },
  "checkpoint": {
    "id": "ckp-2025-10-22T12:00:00Z",
    "algo": "sha256",
    "keyId": "kms:ed25519:v4",
    "signature": "base64:MEYCIQ..."
  }
}

Verify Range

POST /api/v1/integrity/verify/range
{
  "tenantId": "t-acme",
  "chainId": "t-acme:2025-10-22",
  "fromIndex": 84000,
  "toIndex": 85123
}

Response

{
  "ok": true,
  "chainId": "t-acme:2025-10-22",
  "H_last": "sha256:ab..ef",
  "span": { "fromIndex": 84000, "toIndex": 85123 },
  "checkpoint": { "id": "ckp-2025-10-22T12:00:00Z", "signature": "base64:..." }
}

Verify Export

POST /api/v1/integrity/verify/export
{
  "exportId": "exp-01JAZ...",
  "tenantId": "t-acme"
}

Response

{
  "ok": true,
  "exportId": "exp-01JAZ...",
  "manifest": {
    "algo": "sha256",
    "keyId": "kms:ed25519:v4",
    "digest": "sha256:7f...c1",
    "signature": "base64:MEUCIQ..."
  },
  "chainRefs": [
    { "chainId": "t-acme:2025-10-22", "toIndex": 85123, "H_last": "sha256:ab..ef", "checkpointId": "ckp-2025-10-22T12:00:00Z" }
  ]
}

Event (optional) — integrity.verified

{
  "subject": "integrity.verified",
  "schemaVersion": "1.0.0",
  "tenantId": "t-acme",
  "target": "record|range|export",
  "targetId": "01JAXY...ULID",
  "status": "ok",
  "checkpointId": "ckp-2025-10-22T12:00:00Z"
}

Data¶

Evidence Ledger (authoritative)

Chain node: { chainId, chainIndex, prevHash, recordId, recordDigest, committedAt, policyVersion }
Checkpoint: { checkpointId, chainId, fromIndex, toIndex, H_last, algo, keyId, keyVersion, signature, issuedAt }
Manifest index (for exports): { exportId, manifestDigest, algo, keyId, signature, createdAt, chainRefs[] }

Partitioning

chainId = tenantId : time-slice (e.g., day).
Time + tenant partitions; optional shard by sourceId for very hot tenants (still per-tenant chains).

Algorithms & Keys

Hash: SHA-256 (default); support algo versioning.
Signatures: Ed25519 or ECDSA P-256.
Keys: KMS/HSM managed; keyId + version stamped into artifacts.

Rotation Policy

Rotate signing keys at planned cadence; dual-key windows allowed.
Introduce new algo/keyId only at checkpoint boundaries.
Publish public verification keys; maintain historic keys for legacy verifications.

Tenancy & Security¶

Chains are per-tenant; no cross-tenant links.
Ledger access is tenant-scoped (RLS/filters); verify APIs require tenant context.
KMS operations use workload identity with least privilege; audit all sign/verify calls.
No PII in digests, chain metadata, or manifests; only identifiers and hashes.

Resilience¶

Idempotent chain updates: recomputing H_i for a record must be a no-op if already present; conflict → alert.
Retries with jitter for KMS and ledger writes; circuit breaker on KMS slowness.
Checkpointing: timer or watermark-triggered; missed window → alert; safe to re-issue (idempotent by checkpointId).
DLQ for audit.appended processing; replay by tenant/time window; proofs recomputed deterministically.

Observability¶

Traces: chain.update, checkpoint.issue, verify.record|range|export; attrs: tenantId, chainId, algo, keyId, checkpointId.
Metrics: integrity.chain.queue_depth, checkpoint.delay.seconds, kms.latency, verify.latency{target}, signature.error.count.
Logs: structured; include recordId/exportId, chainId, checkpointId, error codes; no payload/PII.

SLOs (initial)¶

Verify latency p95:
- record ≤ T_rec ms; range ≤ T_rng ms for N entries; export ≤ T_exp s for K items.
Checkpoint issuance delay ≤ D_ckp s from window end (p95).
KMS error rate ≤ E ppm; signature verification failures trigger page.

Risks¶

Chain gaps/reordering → detect via recompute & mismatch; quarantine slice; rebuild from last good checkpoint; issue advisory.
Key rotation mistakes → dual-key window, staged rollout, verification with previous public keys; rotation drills.
KMS latency/outage → breaker + queue updates; delay checkpoints (alert); do not block append path.
Clock skew → rely on committedAt from append store; include in proofs; monitor skew.

Deep Links¶

→ Tamper Evidence (../hardening/tamper-evidence.md)
→ Security & Compliance (../platform/security-compliance.md)
→ Backups, Restore & eDiscovery (../operations/backups-restore-ediscovery.md)
→ Events Catalog (../domain/events-catalog.md)
→ Message Schemas (../domain/contracts/message-schemas.md)

Audit.Projection (Component Deep-Dive)¶

Responsibility¶

Maintain read models and indexes optimized for query/search, driven by events from the append path. Track watermarks and lag, support rebuild/replay (per tenant/time-range), and publish projection.updated signals. Projections are rebuildable (not the source of truth) and must be idempotent.

Out of scope: append/write acceptance, export packaging, policy authoring.

C4 L3 — Component Diagram¶

flowchart LR
  subgraph EVT[Event Ingestion]
    SUB[Subscriber<br/>audit.accepted]
    INBOX[Inbox/Dedup]
  end

  subgraph APP[Projector Workers]
    PLAN[Projector Planner<br/>- tenant/partition routing]
    P1[Timeline Projector]
    P2[Facet/Aggregation Projector]
    P3[Lookup Projector]
    P4[Search Indexer]
    WM[Watermark Tracker]
  end

  subgraph DATA[Read Models - Warm]
    RM1[(Timeline Model)]
    RM2[(Facet/Aggregations)]
    RM3[(Lookup by Id/Correlation)]
    IDX[(Search Index)]
    META[(Watermarks/Offsets)]
  end

  subgraph BUS[Event Bus]
    TOP[(topics + dlq)]
  end

  subgraph OUT[Outbound]
    PUB[projection.updated]
  end

  TOP --> SUB --> INBOX --> PLAN
  PLAN --> P1 --> RM1
  PLAN --> P2 --> RM2
  PLAN --> P3 --> RM3
  P4 --> IDX
  P1 --> WM
  P2 --> WM
  P3 --> WM
  P4 --> WM
  WM --> META
  PUB -.-> TOP

  classDef b fill:#F4F7FF,stroke:#7aa6ff,rx:6,ry:6;
  class SUB,INBOX,PLAN,P1,P2,P3,P4,WM,PUB b;

Hold "Alt" / "Option" to enable pan & zoom

Dependency edges: Consumes audit.accepted (authoritative write signal). Publishes projection.updated with tenant/watermark. Optional feed from policy.changed to trigger cache invalidation/recompute where needed.

Interfaces¶

Inbound (events)

audit.accepted (required)
policy.changed (optional; to recompute derived redaction-aware material)
export.requested (optional; to warm selection caches)

Control (admin)

POST /admin/projection/replay — { tenantId, from, to, dryRun }
POST /admin/projection/rebuild — { tenantId, model, from, to, concurrency }
GET /admin/projection/status — per-tenant/partition watermarks, lag, DLQ depth

Outbound (events)

projection.updated — { tenantId, model, watermark, lagSeconds }

Headers/Context

Propagate: x-tenant-id, traceparent, x-correlation-id, x-policy-version (from producing event).

Data¶

Read Models (warm; rebuildable)

Model	Keying & Shape	Purpose
Timeline	`(tenantId, occurredAt)` plus secondary keys: `actorId`, `resource`, `action`	primary time-range queries
Facet/Aggregations	`(tenantId, window, dimension)`	counts/analytics for filters and summaries
Lookup	`(tenantId, recordId \| sourceId+sequence \| correlationId)`	fast direct access and joins across selections
Search Index (opt)	tenant-sharded index; subset of fields	full-text and fast filtering

Offsets & Watermarks

Consumer offset per {subscription, partition} (driver-managed).
Watermark per {tenantId, model} = highest occurredAt (or event position) materialized.
Lag = now - watermark (or broker consumer lag); persisted in META.

Idempotency

Use Inbox receipts (eventId|idempotencyKey) and upsert-or-noop writes keyed by recordId.
Every projector must be replay-safe: applying the same event twice yields the same read model.

Partitioning

Tenant-first partitioning; hot tenants optionally sharded by sourceId or time buckets.

Tenancy & Security¶

All read/writes scoped to tenantId. No cross-tenant joins.
Redaction/classification respected at projection time where necessary (or deferred to Query with policy checks).
Admin control endpoints guarded (SSO + IP allow-lists); all actions audited.

Contracts¶

Consumes

audit.accepted { recordId, tenantId, occurredAt, classification, policyVersion, ... }

Publishes

projection.updated { tenantId, model, watermark, countDelta, lagSeconds }

Control APIs (Admin)

Replay/Rebuild endpoints exchange Problem+JSON on validation or safety violations (e.g., cross-tenant windows).

Schema evolution: additive-first; projectors treat unknown fields as optional.

Sequence — Steady State & Replay¶

Steady State

sequenceDiagram
  autonumber
  participant BUS as Event Bus
  participant SUB as Subscriber
  participant P as Projector
  participant RM as Read Model
  participant WM as Watermarks

  BUS-->>SUB: audit.accepted (tenant, eventId, payload)
  SUB->>P: Handle(event)
  P->>RM: Upsert-or-noop(recordId, derived)
  P->>WM: Advance watermark (tenant, model)
  P-->>BUS: projection.updated (tenant, watermark)

Hold "Alt" / "Option" to enable pan & zoom

Replay/Rebuild

sequenceDiagram
  autonumber
  participant ADM as Admin
  participant CTRL as Projection Control API
  participant JOB as Replay Job
  participant RM as Read Models
  participant WM as Watermarks

  ADM->>CTRL: POST /admin/projection/rebuild {tenant, from, to, dryRun}
  CTRL->>JOB: Start job (scoped)
  JOB->>RM: (dryRun?) compute diffs
  alt dryRun
    JOB-->>CTRL: Report would-change counts
  else execute
    JOB->>RM: Apply deterministic upserts
    JOB->>WM: Recompute watermarks
    JOB-->>CTRL: Success + metrics
  end

Hold "Alt" / "Option" to enable pan & zoom

Resilience¶

Inbox de-dup + idempotent upserts guarantee replay safety.
Retries with decorrelated jitter on transient store errors; DLQ after N attempts (per subscription).
Autoscaling on lag (seconds) and queue depth; cap concurrency per tenant to prevent noisy-neighbor effects.
Backfills use separate worker pools and lower priority to protect steady-state SLIs.
Schema guards: unknown fields do not fail projection; breaking changes handled via dual projectors.

Observability¶

Traces: projection.handle spans with tenantId, recordId, model, watermark.before/after.
Metrics (golden):
- projector.lag.seconds{model} median/p95,
- consumer.lag.messages,
- projection.updated.count,
- inbox.duplicate.rate,
- dlq.depth and dlq.replay.rate.
Logs: structured; include tenantId, eventId, model, result (upsert|noop|skipped), reason codes.

SLOs (initial)¶

Projector lag: median ≤ 5 s, p95 ≤ 30 s.
DLQ rate: ≤ R ppm steady; zero after successful fix + replay.
Replay throughput: ≥ N records/min per worker (guide).
Watermark accuracy: 100% monotonic per tenant/model.

Risks¶

Hot partitions / bursty tenants → scale on lag; shard by tenantId:sourceId; apply per-tenant concurrency caps.
Idempotency bugs (duplicates or gaps) → enforce inbox receipts + upsert-or-noop; property tests with replays.
Schema drift → additive-first; dual-run projectors; DLQ & replay after fix.
Long rebuilds impact reads → separate pools, throttle, publish progress; coordinate with Query/Export windows.

Runbook Hooks¶

Commands: replay --tenant t-xxx --from 2025-10-01 --to 2025-10-02 --dry-run, rebuild --model timeline.
Safeties: dry-run diff %, concurrency caps, pause/resume, watermark guardrails.
Evidence: job logs, counts changed, final watermarks, DLQ clearance report.

Deep Links¶

→ Sequence Flows (../architecture/sequence-flows.md)
→ Query Views & Indexing (impl) (../implementation/query-views-indexing.md)
→ Outbox/Inbox & Idempotency (impl) (../implementation/outbox-inbox-idempotency.md)
→ Observability (../operations/observability.md)
→ Runbook (../operations/runbook.md)

Audit.Query (Component Deep-Dive)¶

Responsibility¶

Serve authorized, tenant-scoped reads over read models with policy-aware redaction and freshness hints. Provide selection capabilities that hand off to Export for packaging. Enforce per-tenant quotas, pagination, and time-range limits; surface watermarks to indicate projection freshness.

Out of scope: write/append, export packaging itself, policy authoring, search (if separate service).

C4 L3 — Component Diagram¶

flowchart LR
  subgraph API[API Adapters]
    QREST[QueryController<br/>REST/gRPC]
    QGQL[GraphQL Gateway]
  end

  subgraph APP[Application -> Use-cases]
    QExec[ExecuteQueryUseCase]
    QSel[BuildSelectionUseCase]
    QPol[PolicyClient - read/export]
    QRed[RedactionPlanner]
    QPg[Pagination/Cursor]
    RMPort[ReadModelRepository Port]
    Cache[Result Cache]
  end

  subgraph DATA[Read Models - Warm]
    TL[(Timeline Model)]
    FAC[(Facet/Aggregations)]
    LKP[(Lookup Model)]
    IDX[(Search Index - optional)]
    WM[(Watermarks/Metadata)]
  end

  subgraph OUT[Handoff]
    EXP[Export Service]
  end

  API --> QExec
  API --> QSel
  QExec --> RMPort
  QExec --> QPol
  QExec --> QRed
  QExec --> QPg
  QExec --> Cache
  RMPort -.-> TL & FAC & LKP & IDX & WM
  QSel --> EXP

  classDef b fill:#F4F7FF,stroke:#7aa6ff,rx:6,ry:6;
  class QREST,QGQL,QExec,QSel,QPol,QRed,QPg,RMPort,Cache b;

Hold "Alt" / "Option" to enable pan & zoom

Dependency edges: synchronous to Read Models; synchronous to Policy for read/export decisions; synchronous handoff to Export to create packages.

Interfaces¶

Inbound

REST (primary)
- GET /api/v{n}/query/timeline — time-range, filters (tenant-scoped)
- GET /api/v{n}/query/lookup/{recordId}
- GET /api/v{n}/query/facets — aggregations for filters
- POST /api/v{n}/query/selection — create a selection token (for export)
gRPC (optional for high-throughput)
GraphQL (optional read-only schema for consoles)

Outbound

Policy: POST /policy/evaluate with mode=read|export to obtain redaction plan & edition gates.
Export: POST /export with { selectionToken | inline selectionSpec, format }.

Headers (ingress/egress)

Ingress: Authorization, Content-Type, tenant (prefer token claim).
Propagate: x-tenant-id, x-edition, traceparent, x-correlation-id.
Responses: X-Watermark, X-Lag, optional X-Redaction-Policy-Version.

Data¶

Authoritative: none (read-only).
Sources: Timeline, Facet/Aggregation, Lookup (and Search index if enabled).
Watermarks: track latest materialized position per tenant/model; returned on every response.
Caching: short TTL (e.g., 5–30s) for common queries; key includes tenant, filters, page cursor, policy version.

Tenancy & Security¶

Tenant required; enforced at API and repository filters.
ABAC/RBAC: verify caller roles/scopes; edition gates (e.g., search, export).
Policy-aware: evaluate redaction plan per request; apply before returning data.
PII discipline: logs and headers contain no PII; redaction never relies on client-side behavior.

Contracts¶

Example — Timeline Query

GET /api/v1/query/timeline?from=2025-10-21T00:00:00Z&to=2025-10-22T00:00:00Z&resource=Order/4711&page.size=100
Authorization: Bearer <JWT>

Response

{
  "results": [
    {
      "recordId": "01JAXY...",
      "occurredAt": "2025-10-21T12:32:10Z",
      "actorId": "u-***3",             // masked per policy
      "action": "UPDATE",
      "resource": "Order/4711",
      "attributes": { "email": "hash:ab12...", "status": "Shipped" }
    }
  ],
  "page": { "next": "eyJjdXJzb3IiOi..." },
  "redactionHints": {
    "actorId": "MASK4",
    "attributes.email": "HASH"
  },
  "watermark": "2025-10-22T00:00:00Z"
}

Response headers: X-Watermark: 2025-10-22T00:00:00Z · X-Lag: 8s · X-RateLimit-Remaining: 92

Example — Build Selection & Handoff to Export

POST /api/v1/query/selection
{
  "timeRange": { "from": "2025-10-01T00:00:00Z", "to": "2025-10-07T00:00:00Z" },
  "filters": { "resource": "Order/*", "action": ["CREATE","UPDATE"] }
}

Response

{ "selectionToken": "sel-01JB2...", "estimatedCount": 12840, "expiresAt": "2025-10-22T13:00:00Z" }

Now handoff:

POST /api/v1/export
{ "selectionToken": "sel-01JB2...", "format": "jsonl" }

Error model: Problem+JSON (400 invalid filters, 403 authorization, 429 quota, 503 transient).

Sequence — Read & Handoff¶

sequenceDiagram
  autonumber
  participant C as Client
  participant GW as Gateway
  participant Q as Query
  participant POL as Policy
  participant RM as Read Models
  participant EXP as Export

  C->>GW: GET /query/timeline (JWT, tenant)
  GW->>Q: QueryRequest
  Q->>RM: Fetch slice (tenant, filters, page)
  Q->>POL: Evaluate(mode=read, fields=...)
  POL-->>Q: Redaction plan (policyVersion)
  Q-->>GW: 200 OK { results (redacted), page, watermark }
  C->>GW: POST /query/selection { filters, timeRange }
  GW->>Q: BuildSelection
  Q-->>GW: { selectionToken, estimatedCount }
  C->>GW: POST /export { selectionToken }
  GW->>EXP: CreateExport(selectionToken)
  EXP-->>C: 202 Accepted { exportId }

Hold "Alt" / "Option" to enable pan & zoom

Resilience¶

Freshness: serve with last known watermark; include X-Lag. If lag > budget, return warning in body and headers.
Back-pressure: rate limits + page size caps; protect against read amplification.
Timeouts: short, derived from SLO (p95 × 1.2); cancel upstream calls on client abort.
Cache: negative/empty-result caching with tiny TTLs for hot miss patterns.
Bulkheads: isolate selection/handoff path from standard queries; protect Query when Export is busy.

Observability¶

Traces: query.execute and query.selection spans; attrs: tenantId, route, filters.hash, policyVersion, watermark, lag.
Metrics:
- query.latency.p95|p99{route}
- query.cache.hit_ratio
- query.watermark.lag.seconds
- query.redacted.field.count
- selection.estimated.count, selection.build.latency
Logs: structured; include traceId, tenantId, cursor, result.count; no PII.

SLOs (initial)¶

Latency: p95 ≤ Y ms, p99 ≤ Z ms at baseline RPS.
Availability: ≥ 99.95% (excl. client 4xx).
Watermark lag: median ≤ 5 s, p95 ≤ 30 s.
Cache hit: ≥ H% on common filters.

Risks¶

Noisy neighbor / read amplification → per-tenant quotas, page caps, cache protection, limit “select *” patterns.
Stale projections → surface X-Lag, autoscale projectors, allow client to gate actions based on freshness.
Redaction errors → use typed redaction plans; unit tests for masking/hashing; fail-closed if policy unavailable.
Selection abuse (massive exports) → estimate counts, throttle selection building, edition/role gates, export concurrency caps.

Deep Links¶

→ REST APIs (../domain/contracts/rest-apis.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ Query Views & Indexing (impl) (../implementation/query-views-indexing.md)
→ Export (Component) (architecture.md#c12--sequence-flows-appendqueryexport)
→ PII Redaction & Classification (../platform/pii-redaction-classification.md)

Audit.Search (Component Deep-Dive)¶

Responsibility¶

Provide fast, flexible discovery over audit data via full-text, filters, facets, and type-ahead suggestions. Search is optional (edition-gated) and rebuildable from read models. It must respect classification (no indexing of disallowed fields) and tenant isolation.

Out of scope: authoritative storage, write/append, export packaging (handoff via Query).

C4 L3 — Component Diagram¶

flowchart LR
  subgraph EVT[Event Feed]
    PUPD[projection.updated]
  end

  subgraph ING[Indexing Pipeline]
    SUB[Subscriber]
    INBOX[Inbox/Dedup]
    BUF[Batcher/Backfill Planner]
    MAP[Mapper -> policy-aware field selection]
    IDXW[Index Writer]
    WM[Indexer Watermark]
  end

  subgraph DATA[Search Data]
    SIDX[(Tenant-partitioned Index)]
    META[(Index Metadata: analyzers, mappings, watermarks)]
  end

  subgraph API[Search APIs]
    SREST[SearchController - REST]
  end

  PUPD --> SUB --> INBOX --> BUF --> MAP --> IDXW --> SIDX
  IDXW --> WM --> META
  API --> SIDX

  classDef b fill:#F4F7FF,stroke:#7aa6ff,rx:6,ry:6;
  class SUB,INBOX,BUF,MAP,IDXW,WM,SREST b;

Hold "Alt" / "Option" to enable pan & zoom

Dependency edges: Consumes projection.updated (and, if enabled, direct audit.accepted for hot paths). Serves REST search APIs to Gateway/clients.

Interfaces¶

Inbound (events)

projection.updated — drives incremental indexing and watermark advancement.
(Optional) audit.accepted — for ultra-low-latency tenants; still policy-aware mapping.

APIs (REST)

GET /api/v{n}/search?q=...&from=...&to=...&filters=...&page.size=... — search with filters.
GET /api/v{n}/search/facets?fields=action,resource,...&q=...&timeRange=... — facet counts.
GET /api/v{n}/search/suggest?q=Ord...&field=resource — type-ahead suggestions.
GET /api/v{n}/search/explain?id=... — diagnostics (edition: admin).

Headers

Ingress: Authorization, Content-Type.
Propagate: x-tenant-id, x-edition, traceparent, x-correlation-id.
Responses: X-Index-Watermark, X-Index-Lag.

Data¶

Indexing source: Read Models (warm) + event hints; never bypass classification. Authoritative: none (index is a derived, rebuildable view).

Partitioning

Preferred: index-per-tenant (small/med tenants grouped into shards via routing).
Alternative: shared index with tenantId as a routing key and RLS-like filters.

Fields & Mappings (overview)

tenantId (keyword, routing/shard key)
recordId (keyword, stored)
occurredAt (date)
action (keyword)
resource (text + keyword subfield for exact match)
actorId (keyword or masked token; respect classification)
attributes.* (controlled allow-list; use appropriate types or text with analyzers)
_policyVersion (integer)
_classification (keyword; drives filtering/redaction)

Analyzers

Standard lowercase/ascii-folding; keyword for identifiers; per-tenant locale stopwords (optional).

What we do not index

Fields marked SENSITIVE unless explicitly allowed by policy.
Large blobs/opaque payloads; store hashes only.

Tenancy & Security¶

All queries and index reads are tenant-scoped; no cross-tenant search.
Index shards are either per-tenant or routed by tenantId.
Edition gates: Search endpoints enabled only for tenants/editions that include Search.
PII discipline: mapping stage applies policy; sensitive fields are masked/hashed or excluded.

Contracts¶

Search (example)

GET /api/v1/search?q=status:Shipped AND resource:"Order/4711"&from=2025-10-20T00:00:00Z&to=2025-10-22T00:00:00Z&page.size=50
Authorization: Bearer <JWT>

Response

{
  "results": [
    {
      "recordId": "01JAXY...",
      "occurredAt": "2025-10-21T12:32:10Z",
      "action": "UPDATE",
      "resource": "Order/4711",
      "snippet": "status: Shipped, carrier: DHL",
      "highlights": { "attributes.status": ["<em>Shipped</em>"] }
    }
  ],
  "page": { "next": "eyJjdXJzb3IiOi..." },
  "watermark": "2025-10-22T00:00:00Z",
  "redactionHints": { "actorId": "MASK4" }
}

Facets (example)

GET /api/v1/search/facets?fields=action,resource&from=...&to=...&q=resource:Order/*

Response

{ "action": { "CREATE": 1203, "UPDATE": 8542 }, "resource": { "Order/*": 9745 } }

Suggest (example)

GET /api/v1/search/suggest?q=Ord&field=resource

Response

{ "suggestions": ["Order/4711", "Order/47*", "Order/Tracking"] }

Error model: Problem+JSON — 400 (bad query/filter), 403 (authz/edition), 429 (quota), 503 (transient).

Indexing Pipeline¶

Subscribe to projection.updated (or audit.accepted for hot tenants).
Inbox/Dedup on (eventId|idempotencyKey) to avoid duplicate writes.
Batch/Backfill: coalesce small updates; windows sized by throughput and SLA.
Map projection → index document with policy-aware field selection (classification-aware).
Write with upsert-or-noop semantics keyed by recordId.
Advance watermark per tenant; emit metrics.

Rebuild strategy

From read models: scan + map + bulk index per tenant/window.
Watermark and lag tracked separately from projection; publish X-Index-Watermark to clients.
Run in separate pools to protect query/search SLOs.

Resilience¶

Retries with jitter on index write failures; DLQ for poison docs; replay by tenant/window.
Circuit breakers for index cluster; serve degraded from cached results when possible.
Back-pressure: cap indexing concurrency per tenant; adaptive batch sizes.
Consistency: search is eventually consistent; clients rely on watermark/lag headers.

Observability¶

Traces: search.query, search.facets, search.suggest, index.write; attrs: tenantId, route, filters.hash, watermark, lag.
Metrics:
- search.latency.p95|p99,
- search.error.rate,
- indexer.queue.depth,
- index.refresh.age.seconds,
- indexer.lag.seconds,
- dlq.depth (indexer).
Logs: structured; include queryId, tenantId, cardinality.hint, took.ms; never log raw PII or full payloads.

SLOs (initial)¶

Search latency: p95 ≤ Y ms, p99 ≤ Z ms.
Indexer lag: median ≤ 5 s, p95 ≤ 30 s relative to projection watermark.
Availability: ≥ 99.9% (search API).
Accuracy: zero cross-tenant hits (guardrail).

Risks¶

Cost creep (index size/IO) → strict field allow-list, compression, TTL for ephemeral indices, shard sizing, FinOps review.
Index drift vs read models → watermark monitoring, rebuild tools, contract tests on mappings.
PII leakage → policy-aware mapper, unit tests for excluded fields, masked tokens only.
Hot tenants causing shard hotspots → per-tenant shards or routed partitions; throttle and rebalance.

Optionality & Edition Behavior¶

Feature flag + edition gate: search.enabled per tenant.
Tenants without Search still operate via Query; links and UI hide Search features.
Upgrading edition triggers background backfill and progressive enablement.

Deep Links¶

→ Query Views & Indexing (impl) (../implementation/query-views-indexing.md)
→ PII Redaction & Classification (../platform/pii-redaction-classification.md)
→ Observability (../operations/observability.md)
→ Sequence Flows (../architecture/sequence-flows.md)

Audit.Export (Component Deep-Dive)¶

Responsibility¶

Create immutable, verifiable export packages from authorized selections. Streams records from Read Models (via Query), builds signed manifests with Integrity references, stores artifacts in Cold storage, and provides resumable downloads and optional webhooks. Enforces per-tenant concurrency and legal-hold/residency constraints.

Out of scope: read model computation (Projection), search/indexing, policy authoring.

C4 L3 — Component Diagram¶

flowchart LR
  subgraph API[API Adapters]
    EXCTL[ExportController<br/> REST]
    WHCTL[WebhookController<br/>optional]
  end

  subgraph APP[Application -> Use-cases]
    PLAN[PlanSelectionUseCase]
    PKG[PackageBuilder]
    MFS[ManifestSigner - Integrity Client]
    QSTR[QueryStreamer]
    RES[ResumptionManager]
    QOS[Quota/Throttling]
    DLR[DeliveryManager]
    PUB[EventPublisher]
  end

  subgraph DATA[Storage]
    COLD[(Cold Store<br/>immutable objects)]
    META[(Export Metadata<br/>jobs, parts, status)]
  end

  subgraph EXT[Dependencies]
    QRY[Query Service]
    INT[Integrity Service]
    BUS[(Event Bus)]
  end

  API --> PLAN
  PLAN --> QOS
  PLAN --> QSTR
  QSTR --> PKG
  PKG --> MFS
  MFS --> INT
  PKG --> COLD
  PKG --> META
  DLR --> META
  DLR --> COLD
  PUB -.-> BUS

  classDef b fill:#F4F7FF,stroke:#7aa6ff,rx:6,ry:6;
  class EXCTL,WHCTL,PLAN,PKG,MFS,QSTR,RES,QOS,DLR,PUB b;

Hold "Alt" / "Option" to enable pan & zoom

Dependency edges: synchronous to Query for streaming selections; synchronous to Integrity for manifest signing; writes to Cold Store; publishes export lifecycle events.

Interfaces¶

Inbound (REST)

POST /api/v{n}/export — create export from { selectionToken | inline selectionSpec, format }
GET /api/v{n}/export/{exportId} — poll status; returns 303 → signed URL when ready
GET /api/v{n}/export/{exportId}/parts/{n} — (optional) list parts for segmented downloads
DELETE /api/v{n}/export/{exportId} — cancel pending export (if not sealed)

Outbound

Query: stream records by selection (paged, resumable cursor)
Integrity: request manifest digest/signature and chain references
Events (publish):
- export.requested { exportId, tenantId, selectionMeta, format }
- export.completed { exportId, tenantId, manifestDigest, url, bytes, parts }
- export.failed { exportId, tenantId, code, detail }
Webhooks (optional): POST {tenantEndpoint} /on-export-completed with signed payload

Headers

Ingress: Authorization, Content-Type.
Propagate: x-tenant-id, x-edition, traceparent, x-correlation-id.
Responses: Retry-After (if queued), X-Export-ETA-Hint (best-effort), Content-Range on downloads.

Data¶

Artifacts in Cold (immutable)

Package: *.jsonl (default), *.parquet (optional), *.csv (narrow shape).
Manifest (JSON): { packageId, tenantId, createdAt, algo, keyId, items[], chainRefs[], signature }.
Parts (for large outputs): package.part-000N with rolling checksums.

Metadata (Export Catalog)

Job state: queued | running | sealing | completed | failed | canceled
Progress counters: selected, emitted, bytesWritten, parts
Resumption tokens & cursors for Query streaming
Legal/residency flags: region, legalHoldApplied, retentionClass

Retention

Packages inherit tenant retention for eDiscovery; can be pinned by legal hold (non-deletable) until released.

Tenancy & Security¶

Tenant-scoped: every export is bound to a single tenantId.
Edition gates: exports enabled by edition; concurrency caps per tenant.
AuthZ: role/scope check (e.g., role:Auditor); optional approval workflow for large selections.
PII discipline: relies on Query + Policy redaction; no extra PII added.
Delivery: signed, time-limited URLs; webhook payloads signed (HMAC or asymmetric), rotate secrets regularly.
Residency: artifacts stored in tenant’s home region only.

Contracts¶

Create Export (selection token)

POST /api/v1/export
{
  "selectionToken": "sel-01JB2...",
  "format": "jsonl",
  "filename": "acme-order-2025-10-week1"
}

Response

{
  "exportId": "exp-01JB3...",
  "status": "queued",
  "ttfbHintSeconds": 8,
  "rateLimited": false
}

Poll / Download

GET /api/v1/export/exp-01JB3...

Response (ready)

303 See Other
Location: https://cold.example/tenant/t-acme/exp-01JB3.../download?... (signed)

Manifest (sealed)

{
  "schemaVersion": "1.0.0",
  "packageId": "exp-01JB3...",
  "tenantId": "t-acme",
  "createdAt": "2025-10-22T12:04:55Z",
  "format": "jsonl",
  "algo": "sha256",
  "keyId": "kms:key/ed25519:v4",
  "items": [
    { "recordId": "01H...", "digest": "sha256:...", "occurredAt": "2025-10-21T10:12:00Z" }
  ],
  "chainRefs": [
    { "chainId": "t-acme:2025-10-22", "fromIndex": 84000, "toIndex": 85123, "H_last": "sha256:...", "checkpointId": "ckp-2025-10-22T12:00:00Z" }
  ],
  "signature": "base64:..."
}

Events

// export.requested
{ "subject":"export.requested","tenantId":"t-acme","exportId":"exp-01JB3...","format":"jsonl","selection":{"estCount":12840} }

// export.completed
{ "subject":"export.completed","tenantId":"t-acme","exportId":"exp-01JB3...","manifestDigest":"sha256:7f...c1","bytes":734003200,"parts":7 }

Error model: Problem+JSON — 400 invalid selection, 403 authorization/edition, 409 canceled or conflict, 429 quota/concurrency, 503 transient.

Sequence — Plan → Stream → Seal → Deliver¶

sequenceDiagram
  autonumber
  participant C as Client
  participant GW as Gateway
  participant EXP as Export
  participant Q as Query
  participant INT as Integrity
  participant COLD as Cold Store
  participant WH as Webhook (opt)

  C->>GW: POST /export { selectionToken, format }
  GW->>EXP: CreateExport
  EXP->>Q: Stream(selection) [paged, resumable]
  loop pages
    Q-->>EXP: batch(records, cursor)
    EXP->>EXP: write to temp parts
  end
  EXP->>INT: Seal(manifest request with digests)
  INT-->>EXP: { manifest.signature, chainRefs }
  EXP->>COLD: Put package + manifest (immutable)
  EXP-->>GW: 202 Accepted { exportId, pollUrl }
  C->>GW: GET /export/{id}
  GW-->>C: 303 → signed URL (download - supports Range)
  EXP-->>WH: POST /on-export-completed (optional)

Hold "Alt" / "Option" to enable pan & zoom

Throttling & Resumability

Per-tenant concurrency caps; queued requests get Retry-After.
If streaming fails mid-way, resume from last cursor/part; downloads support HTTP Range.

Legal Hold & Residency

If a legal hold exists for any selected data, the package is created and pinned; deletion disabled until hold is cleared.
Artifacts never cross region; restore drills confirm region binding.

Resilience¶

Back-pressure: queue depth thresholds; temporarily throttle bulk exports to protect Query SLIs.
Retries with jitter for Query pages and Cold Store PUTs; resume on network errors via part checkpoints.
Integrity slowness: buffer digests; seal when signature arrives; do not publish completed status until manifest is signed.
DLQ: rare; used for webhook deliveries (retry with backoff, dead-letter after N failures).
Idempotency: exportId is unique; repeated create with same selection/filename returns the existing job (configurable).

Observability¶

Traces: export.create, export.stream.page, export.seal, export.deliver; attrs: tenantId, selection.size.est, format, bytes, parts, manifest.keyId.
Metrics:
- export.queue.depth, export.concurrent.tenants,
- export.ttfb.seconds.p95, export.completion.minutes.p95,
- export.bytes.total, export.parts.count,
- webhook.success.rate, webhook.retry.count.
Logs: structured; exportId, tenantId, cursor, part, retry, signedUrl.expiresAt; no payload data.

SLOs (initial)¶

TTFB p95 ≤ Z s for packages ≤ K MB.
Completion p95 ≤ M min for N records.
Download availability ≥ 99.9%, signed URL validity ≥ 24–72h (config).
Zero cross-region artifacts (guardrail; alert on violation).

Risks¶

Read SLO impact (export vs query) → enforce per-tenant concurrency; isolate worker pools; throttle selection streaming.
Large packages failures → part-based writes, resume on cursor; expose partial status to clients.
Manifest/seal errors → block completion until Integrity signs; alert on delays; retry with backoff.
Compliance leakage (wrong region or hold) → residency map enforcement; legal-hold pinning; periodic audits.

Deep Links¶

→ Sequence Flows (../architecture/sequence-flows.md)
→ Compliance & Privacy (../platform/privacy-gdpr-hipaa-soc2.md)
→ Backups, Restore & eDiscovery (../operations/backups-restore-ediscovery.md)
→ Tamper Evidence / Integrity (../hardening/tamper-evidence.md)
→ REST APIs (../domain/contracts/rest-apis.md)

Audit.Admin / Control Plane (Component Deep-Dive)¶

Responsibility¶

Operate the control plane for ATP. Provides tenant onboarding & lifecycle, policy/schema/feature-flag administration, DLQ & replay tooling, projection rebuild controls, and compliance operations (legal holds, residency maps, key rotation orchestration). Ensures every change is audited, approved, and traceable to ADRs/epics.

Out of scope: data-plane append/query/export paths; business UIs for end users.

C4 L3 — Component Diagram¶

flowchart LR
  subgraph API[Admin APIs - SSO + IP allow-list]
    AADM[AdminController<br/>REST/gRPC]
    AUI[Admin UI / CLI]
  end

  subgraph APP[Application -> Use-cases]
    TEN[TenantLifecycle]
    SCH[SchemaRegistryProxy]
    FLG[FeatureFlagService]
    RPY[DLQ/ReplayOrchestrator]
    PRJ[ProjectionControl]
    LHC[LegalHoldController]
    RES[ResidencyMapManager]
    KEY[KeyRotationPlanner]
    AUD[AdminAuditLog]
    GOV[ADR/Governance Links]
  end

  subgraph INFRA[Infra/Deps]
    CTRLDB[(Control Metadata Store)]
    BUS[(Event Bus)]
    DQ[(DLQ Endpoints)]
    PROJ[(Projection Service)]
    POL[(Policy Service)]
    INT[(Integrity Service)]
    QRY[(Query Service)]
    SEC[(Secrets/KMS)]
  end

  API --> TEN & SCH & FLG & RPY & PRJ & LHC & RES & KEY & GOV
  TEN & SCH & FLG & RPY & PRJ & LHC & RES & KEY --> CTRLDB
  RPY --> DQ
  PRJ --> PROJ
  SCH --> POL
  KEY --> INT & SEC
  APP --> AUD --> CTRLDB

Hold "Alt" / "Option" to enable pan & zoom

Edges: Writes to Control Metadata; calls Projection/Policy/Integrity admin surfaces; interacts with DLQs and Event Bus for replay; never bypasses data-plane tenancy controls.

Interfaces¶

Admin (internal / privileged)

Protected by SSO, MFA, role-based scopes, and IP allow-lists.
All mutating operations require approval workflow (2-person rule) and are audited.

Tenant Admin (scoped to their tenant)

Limited set: view quotas/usage, request exports policy changes, view watermarks/health, request replays (requires approval).

Selected REST APIs

Tenant lifecycle
- POST /admin/tenants — onboard (id, edition, region, quotas)
- PATCH /admin/tenants/{id} — update edition/quotas/residency
- GET /admin/tenants/{id} — status, usage, watermarks
Schema/Policy
- GET /admin/schemas — list subjects/versions (proxy to registry)
- POST /admin/policies/apply — propose PolicySet change (requires approval → emits policy.changed)
DLQ/Replay
- GET /admin/dlq/subscriptions — list DLQs + depth
- POST /admin/dlq/{sub}/replay — { tenantId, from, to, dryRun, concurrency }
- POST /admin/dlq/{sub}/purge — guarded (rare; approval + evidence)
Projection Control
- POST /admin/projection/rebuild — { tenantId, model, from, to, dryRun }
- GET /admin/projection/status?tenant=... — watermarks/lag
Legal Holds & Residency
- POST /admin/legal-holds — { tenantId, reason, scope, expiresAt? }
- DELETE /admin/legal-holds/{id} — release
- PUT /admin/residency-map/{tenantId} — set region binding
Keys & Integrity
- POST /admin/keys/rotate — plan/execute rotation (checkpoint boundary)
- GET /admin/integrity/checkpoints?tenant=... — inspect windows
Feature Flags / Quotas
- PATCH /admin/flags — toggles (edition/tenant-scoped)
- PATCH /admin/quotas — RPS/export/storage caps
Break-glass
- POST /admin/breakglass/enable — time-boxed elevation
- POST /admin/breakglass/disable

Headers

Authorization (SSO), x-admin-role, x-approval-ticket, traceparent. Responses use Problem+JSON.

Data¶

Control Metadata Store (authoritative for admin state)

Tenant { id, edition, region, quotas, createdAt, status }
LegalHold { id, tenantId, scope, createdAt, releasedAt }
ResidencyMap { tenantId, region, effectiveAt }
ReplayJob { id, tenantId, sub, from, to, dryRun, status, metrics }
ProjectionJob { id, tenantId, model, window, status, metrics }
AdminAudit { id, actor, action, target, changeSet, approvals[], traceId, adrRef? }
FeatureFlag { key, value, tenantScope?, editionScope? }

Not stored: raw audit events/records; those remain in Hot/Warm/Cold data planes.

Tenancy & Security¶

Two surfaces:
- Platform Admin (internal): cross-tenant visibility; strongest controls.
- Tenant Admin (scoped): only their tenantId.
Break-glass: time-limited elevation (e.g., 30–60 min), session recording, dual approval, post-action review.
PII discipline: Admin never exposes raw PII; views show counts/metadata, not payload values.
Every action produces an AdminAudit record; certain actions require ADR reference (adrRef in payload) or change ticket link.

Contracts¶

Tenant Onboard (example)

POST /api/v1/admin/tenants
{
  "tenantId": "t-acme",
  "edition": "enterprise",
  "region": "eu-central-1",
  "quotas": { "appendRps": 200, "exportConcurrent": 2, "storageGB": 500 }
}

Response

{ "tenantId": "t-acme", "status": "ready", "links": { "policy": "/admin/policies?t=t-acme" } }

Replay DLQ (dry run)

POST /api/v1/admin/dlq/audit.accepted/replay
{
  "tenantId": "t-acme",
  "from": "2025-10-21T00:00:00Z",
  "to": "2025-10-21T06:00:00Z",
  "dryRun": true,
  "concurrency": 4
}

Response

{
  "jobId": "rep-01JB7...",
  "dryRun": true,
  "wouldProcess": 18432,
  "estimatedDurationSec": 320
}

Problem+JSON (guardrail breach)

{
  "type": "https://errors.atp.example/guardrail",
  "title": "Cross-tenant window not allowed",
  "status": 403,
  "detail": "Replay must target a single tenant and bounded time window",
  "code": "REPLAY_SCOPE_INVALID",
  "instance": "urn:trace:01J9..."
}

Resilience¶

Dry-run first for replay/rebuild/purge; enforce bounded windows and per-tenant scope.
Approval workflow: long-running or destructive operations require two approvals; execution tokens expire quickly.
Idempotent jobs: re-submitting same job spec returns existing job; jobs can resume after failure.
Rate/Concurrency caps for admin-triggered workloads; separate low-priority pools.

Observability¶

Traces: admin.tenants.create, admin.replay.start, admin.projection.rebuild, admin.keys.rotate; include actor, tenantId, jobId, adrRef?.
Metrics:
- admin.action.success.rate,
- admin.approval.latency.seconds,
- replay.throughput.records_per_min,
- dlq.depth and dlq.replay.rate,
- projection.rebuild.duration.
Logs: structured AdminAudit entries; immutable and exportable for compliance.

SLOs (initial)¶

Admin API availability ≥ 99.9% (control plane).
Approval latency p95 ≤ 30 min (business-hour window).
Replay dry-run start ≤ 2 min; execution throughput ≥ N records/min (env-tuned).
Zero unauthorized cross-tenant operations (guardrail).

Risks¶

Misuse of power → dual-approval, break-glass TTL, session recording, immutable audit, periodic review.
Accidental cross-tenant impact → strict scoping, guardrails, simulation (dry-run), and Problem+JSON failures.
Policy/Schema regressions → ADRs + CI contract tests; staged rollout; policy.changed events.
Long rebuilds hurting SLOs → separate pools, throttling, off-peak scheduling, progress & cancel endpoints.

Runbook Hooks¶

Onboard tenant: POST /admin/tenants → verify residency/quotas → seed policies → smoke append/query.
Apply policy: PR to policy-as-code → ADR/link → POST /admin/policies/apply → watch policy.changed.
DLQ replay: dry-run → approval → execute with concurrency caps → verify DLQ zero → attach report to ticket.
Projection rebuild: narrow window → dedicated pool → monitor lag/watermarks → success artifacts stored.
Key rotation: plan window at checkpoint boundary → rotate → verify with both key sets → publish rotation bundle.

Deep Links¶

→ Runbook (../operations/runbook.md)
→ Quality Gates (../ci-cd/quality-gates.md)
→ ADR Index & Governance (../architecture/architecture.md#adr-index-governance)
→ Security & Compliance (../platform/security-compliance.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ REST APIs (../domain/contracts/rest-apis.md)

Cross-Cutting Modules: Outbox/Inbox & Idempotency¶

Responsibility¶

Provide shared reliability primitives so every service can retry safely, publish/consume exactly-once intent, and replay deterministically:

Transactional Outbox (producer side): persist domain changes and the to-be-published event in the same transaction; a relay worker publishes later, with retries and trace propagation.
Inbox / De-dup (consumer side): record processed eventId/idempotencyKey receipts to drop duplicates; handlers are idempotent (upsert-or-noop).
Idempotency Keys: standard structure for dedupe across boundaries (HTTP, events, webhooks).
Headers & Trace: standardized metadata on every message/request so retries are observable and correlatable.
DLQ + Replay: every subscription has a Dead Letter Queue with audited replay by tenant/time window.

Out of scope: business logic for individual services, policy decisions, or schema authoring.

Design Overview¶

flowchart LR
  subgraph Producer Service
    A[Use-case / Tx] -->|commit| OB[(Outbox Table)]
    OB --> RL[Relay Worker]
    RL -->|publish - key-> tenant and source| BUS[(Event Bus + DLQ)]
  end

  subgraph Consumer Service
    BUS --> SUB[Subscriber]
    SUB --> IBX[(Inbox Receipts)]
    SUB --> H[Idempotent Handler]
    H --> STORE[(State / Read Model)]
  end

  classDef s fill:#F4F7FF,stroke:#7aa6ff,rx:6,ry:6;
  class OB,RL,SUB,IBX,H s;

Hold "Alt" / "Option" to enable pan & zoom

Ordering: broker partitions by tenantId[:sourceId] to preserve per-partition order.
Exactly-once intent: duplicates may happen in transit, but effects are idempotent.

Standard Metadata & Headers¶

All HTTP requests and events carry the following:

Name	Where	Purpose
`X-Idempotency-Key`	HTTP (append/webhook)	dedupe at the write boundary
`x-tenant-id`	HTTP + Events	tenancy scoping & partitioning
`x-edition`	HTTP + Events	edition gates & quotas
`traceparent`, `tracestate`	HTTP + Events	distributed tracing
`x-correlation-id`	HTTP + Events	end-to-end correlation across retries
`x-policy-version`	Events	decision provenance
`x-schema-version`	Events	evolve consumers safely

Idempotency Key (canonical):

<tenantId>:<sourceId>:<sequence|hash>

Prefer monotonic sequence per tenantId:sourceId.
Fallback hash = stable content hash when no sequence is available.
Conflict rule: same key, different payload → 409 IDEMPOTENCY_MISMATCH.

Producer Side — Transactional Outbox¶

sequenceDiagram
  autonumber
  participant UC as Use-case
  participant DB as App DB / Append Store
  participant OB as Outbox (same tx)
  participant RL as Relay Worker
  participant BUS as Broker

  UC->>DB: Persist state (record/aggregate)
  UC->>OB: Persist event (headers, idemKey) in same tx
  DB-->>UC: Commit
  RL->>OB: Poll due events (FIFO per partition)
  RL->>BUS: Publish (idempotent - with key & headers)
  RL->>OB: Mark delivered (deliverOnce via token)

Hold "Alt" / "Option" to enable pan & zoom

Properties

Atomicity: state + event committed together (no dual-write loss).
Retry: relay uses decorrelated jitter with caps; back-pressure on persistent failures.
Stop-the-world safe: if relay is down, events persist and publish later—no data loss.

Outbox schema (minimal)

eventId (ULID), subject, tenantId, key, headers{}, payload, createdAt, attempts, status.

Consumer Side — Inbox & Idempotent Handlers¶

sequenceDiagram
  autonumber
  participant SUB as Subscriber
  participant IBX as Inbox
  participant H as Handler (Idempotent)
  participant STO as Store/Read Model

  SUB->>IBX: Check(eventId | idemKey)
  alt seen
    SUB-->>SUB: Drop (ack)
  else new
    H->>STO: Upsert-or-noop(recordId, derived fields)
    IBX->>IBX: Record receipt (TTL M days)
  end

Hold "Alt" / "Option" to enable pan & zoom

Inbox schema (minimal)

receiptId (ULID), tenantId, eventId|idempotencyKey, hash, firstSeenAt, expiresAt.

TTL

Keep receipts 7–30 days (config); long enough to cover broker at-least-once duplicates and operator replays.

DLQ & Replay (Safety Rails)¶

DLQ per subscription; metadata includes error code, attempt count, schema version, payload hash, headers.
Replay:
- Scoped by tenant and bounded time window.
- Dry-run mode reports would-be successes/failures.
- Handlers must be replay-safe (idempotent upserts; no non-idempotent side-effects).

Partitioning & Keys¶

Partition key for publish = tenantId[:sourceId] to keep order for logically related events.
Message key (broker) aligns with partition key; consumers use keys for sharding work.
Watermarks: consumers update “last-applied” per tenant/model to expose freshness.

Failure Modes & Handling¶

Scenario	Detection	Handling
Outbox relay cannot reach broker	publish timeout	retry w/ jitter; circuit breaker; alert on age/attempts
Duplicate publish (network retry)	consumer sees same `eventId`	inbox drop; no-op handler
Poison message (schema/version error)	N retries fail	route to DLQ; operator fixes schema; replay
Ordering break (cross-partition)	out-of-order derived state	use per-tenant/source partitioning; reconcile on replay
Hot partition saturation	rising lag	autoscale consumer; per-tenant concurrency caps; 429 at gateway for append

Resilience Defaults (starters)¶

Retries: 3–5 attempts with decorrelated jitter, cap 10s; exponential backoff multiplier 1.5–2.0.
Timeouts: based on SLO; use p95 × 1.2 rule-of-thumb for downstream calls.
Circuit breakers: open on dependency failures; short half-open probes.
Bulkheads: separate worker pools for steady-state vs. replays/backfills.

Observability¶

Traces: link request → outbox.persist → relay.publish → consumer.handle → inbox.record. Span attrs: tenantId, subject, partition, attempt, idempotent=true|false, dlq=true|false.
Metrics:
- outbox.relay.age.p95|p99,
- outbox.pending.count,
- consumer.lag.seconds,
- inbox.duplicate.rate,
- dlq.depth, dlq.replay.rate,
- retry.count, breaker.open.count.
Logs: structured with eventId, idempotencyKey, tenantId, subject, schemaVersion; no payload/PII.

SLOs (cross-cutting)¶

Outbox age p99 ≤ N s under steady load.
Consumer lag p95 ≤ M s by partition.
Inbox duplicate drop accuracy = 100% within receipt TTL.
DLQ cleared within T hours after a schema/consumer fix (tracked per incident).

Contracts & Bus Ties¶

Subjects (examples): audit.appended, audit.accepted, policy.changed, projection.updated, export.requested, export.completed.
Message envelope (event bus):

{
  "subject": "audit.accepted",
  "eventId": "01JB9...",
  "schemaVersion": "1.0.0",
  "tenantId": "t-acme",
  "key": "t-acme:order-svc",
  "headers": {
    "x-tenant-id": "t-acme",
    "x-policy-version": "17",
    "traceparent": "00-...."
  },
  "payload": { /* domain fields */ }
}

Testable Controls¶

CI contract tests ensure headers present (x-tenant-id, traceparent, x-schema-version) and no PII in headers.
Unit/property tests: consumer handlers idempotent; repeating same event yields no delta.
Synthetic drills: force broker hiccups → verify outbox age alarms; inject poison → DLQ then replay green.
Lint rules: all new topics/handlers define partition key and DLQ binding.

Risks¶

Duplicate side-effects if handlers are not idempotent → enforce upsert-or-noop, receipts, and property tests.
Lost events if outbox isn’t transactional → must be same transaction as state; admissions block on policy.
Cross-tenant leakage → partition & filter strictly on tenantId; never share inbox/outbox tables across tenants without scoping.
Replay hazards (oversized windows) → require dry-run + approvals; throttle concurrency.

Config (starter values)¶

OUTBOX_RELAY_MAX_ATTEMPTS=5
OUTBOX_RELAY_BASE_DELAY_MS=200 (jittered; cap 10s)
INBOX_RETENTION_DAYS=14
CONSUMER_MAX_CONCURRENCY_PER_TENANT=4
DLQ_RETENTION_DAYS=30

Links¶

→ Messaging (impl) (../implementation/messaging.md)
→ Outbox/Inbox & Idempotency (impl) (../implementation/outbox-inbox-idempotency.md)
→ Sequence Flows (../architecture/sequence-flows.md)
→ Observability (../operations/observability.md)
→ Alerts & SLOs (../operations/alerts-slos.md)

Storage Components per Service¶

This section lists the authoritative stores and major derived stores per component, with tenancy partitioning and RLS/filters guidance. Hot/Warm/Cold tiers follow the platform’s lifecycle rules.

Legend
Hot = append-only, short retention, write-optimized.
Warm = read models / indexes, rebuildable.
Cold = immutable artifacts for export/eDiscovery.
RLS = Row-Level Security (or equivalent tenant filter).

Store Catalog¶

Component / Context	Store (type & example)	Purpose (authoritative?)	Tenancy Partitioning	RLS / Filters	Backup & Retention (summary)
Gateway	— (stateless)	n/a	n/a	n/a	n/a
Ingestion	Append Store (Hot) — transactional DB or log-structured store	Authoritative for raw committed records	Partition by `tenantId` + time bucket (`YYYY-MM-DD`); optional `sourceId` shard for hot tenants	DB RLS: `tenant_id = current_tenant()`; app filter enforced at repo	Point-in-time recovery; Hot→Warm tiering by policy; short online retention
	Outbox Table (per-service)	Durable event publication	Partition by `tenantId` (or bus key)	RLS by `tenantId`; relay worker service principal	Same as Hot; purge delivered after TTL
Policy	Policy Store (read-mostly KV/DB)	Authoritative for policy sets & versions	Partition by `tenantId`	RLS by `tenantId`; read-only outside Admin	Daily backups; long retention for audit history
Integrity	Evidence Ledger (chains & checkpoints)	Authoritative for chain nodes & signed checkpoints	Partition by `tenantId : time-slice`	RLS by `tenantId`; verify APIs scoped	Frequent backups; key material externalized in KMS; never re-writes
Projection	Read Models (Warm) — timeline, facets, lookups	Derived; rebuildable	Partition by `tenantId` (+ optional `sourceId` or time window)	RLS by `tenantId`; upsert-or-noop writes	Snapshot or recreate from events; retain metadata (watermarks)
Search (optional)	Search Index (multi-tenant clusters)	Derived; rebuildable	Index-per-tenant or routed by `tenantId`	Query filter/routing by `tenantId`; mapping excludes sensitive fields	Recreate on demand; snapshot indices optionally
Query	— (uses Read Models)	n/a	n/a	Repository filters by `tenantId`	n/a
Export	Cold Artifact Store (objects)	Authoritative immutable packages & manifests	Tenant folder/prefix per home region	Access scoped by signed URLs; path includes `tenantId`	Versioned objects; long retention per legal/eDiscovery; legal-hold pinning
	Export Metadata Catalog	Job state, cursors, parts, manifest digest	Partition by `tenantId`	RLS by `tenantId`	Backup daily; retain while any package exists
Admin / Control	Control Metadata Store	Authoritative for tenants, quotas, holds, residency map, admin audit	Partition by `tenantId` where applicable; platform-wide tables separate	RLS by `tenantId` for tenant-scoped views; platform admin tables restricted	Full backups; audit logs immutable retention
Cross-Cutting	Inbox Receipts (per subscription)	De-duplication receipts	Partition by `tenantId` + subscription	RLS by `tenantId`; TTL enforced	TTL 7–30 days; no backups required beyond TTL
Messaging	Broker Topics/Queues + DLQ	Transport; not a data store	Partition key = `tenantId[:sourceId]`	Authz by principal; tenant key governs order	Broker retention per topic; DLQ retained per ops policy

Tenancy & RLS Guidance¶

Hard boundary: every store enforces tenancy through RLS (or equivalent) and application-level filters.
Keys: all primary/unique keys include tenantId (or a scoped synthetic key) to prevent cross-tenant collisions.
Access patterns: administrative cross-tenant tables (e.g., AdminAudit, ResidencyMap) are platform-admin only.
Search: use index-per-tenant where feasible; otherwise enforce both routing by tenantId and a must-match tenant filter at query time.

Encryption & Keying¶

At rest: all stores encrypted with cloud KMS; key IDs recorded in manifests/checkpoints where applicable.
In transit: TLS everywhere; mTLS within the mesh.
Rotation: store configs reference key aliases; rotation happens underneath with attested evidence (see Integrity & Admin).

Backup, Restore & Rebuild Posture¶

Authoritative: Append Store, Policy Store, Evidence Ledger, Control Metadata, Export Artifacts — must be backed up with verifiable restore drills.
Derived/Rebuildable: Read Models, Search Index — can be recreated from audit.accepted + policies; store only watermarks and projection metadata.
Cold Artifacts: immutability + versioning; region-pinned; legal holds supersede lifecycle deletion.

Indexes & Hot Paths (starter set)¶

Append Store (Hot):
- PK: (tenantId, timeBucket, recordId)
- Secondary: (tenantId, sourceId, sequence), (tenantId, occurredAt)
Read Models (Warm):
- Timeline: (tenantId, occurredAt) (+ actorId, resource)
- Lookup: (tenantId, recordId); (tenantId, correlationId)
Evidence Ledger:
- Chain: (tenantId, slice, chainIndex); unique recordId
- Checkpoint: (tenantId, slice, checkpointId)

Residency Rules (recap)¶

Every store that holds tenant data is bound to the tenant’s home region.
Cross-region replication disabled by default; DR uses region-local copies unless contractually allowed.
Restore drills verify that region binding is preserved for tenant-scoped restores.

Links¶

→ Data Model (../architecture/data-model.md)
→ Persistence & Storage (impl) (../implementation/persistence.md)
→ Query Views & Indexing (impl) (../implementation/query-views-indexing.md)
→ Backups, Restore & eDiscovery (../operations/backups-restore-ediscovery.md)
→ Data Residency & Retention (../platform/data-residency-retention.md)

Messaging Topologies & Contracts¶

Responsibility¶

Define the event subjects, topics/queues, partitioning keys, and subscription rules used across ATP so producers/consumers interoperate safely with at-least-once delivery and exactly-once effects (via inbox/idempotent handlers). Contracts are owned by their source context and evolve additively-first through a schema registry.

Out of scope: physical broker choice/sizing (see Deployment Views), per-language SDK details (see SDK docs).

Bus Topology (logical)¶

flowchart LR
  ING[Ingestion] -- audit.appended / audit.accepted --> BUS[(Event Bus)]
  POL[Policy] -- policy.changed --> BUS
  PRJ[Projection] -- projection.updated --> BUS
  EXP[Export] -- export.requested|completed|failed --> BUS
  INT[Integrity] --(consumes appended)-> BUS
  INT --> integrity.verified --> BUS

  subgraph Consumers
    PRJc[Projection]:::c
    INTc[Integrity]:::c
    EXPc[Export]:::c
    ADMc[Admin jobs]:::c
    UIs[Ops Consoles]:::c
  end

  BUS --> PRJc
  BUS --> INTc
  BUS --> EXPc
  BUS --> ADMc
  BUS --> UIs

  classDef c fill:#F4F7FF,stroke:#7aa6ff,rx:6,ry:6;

Hold "Alt" / "Option" to enable pan & zoom

Partitioning policy: key = tenantId[:sourceId] to preserve per-tenant (and optionally per-source) order where it matters (append → integrity → projection).

Standard Envelope & Headers¶

All subjects share a consistent envelope and headers.

{
  "subject": "audit.accepted",
  "schemaVersion": "1.0.0",
  "eventId": "01JB9Z5G...",
  "occurredAt": "2025-10-22T12:04:55Z",
  "tenantId": "t-acme",
  "key": "t-acme:order-svc",
  "headers": {
    "x-tenant-id": "t-acme",
    "x-edition": "enterprise",
    "x-policy-version": "17",
    "x-schema-version": "1.0.0",
    "traceparent": "00-6e0c...-01",
    "x-correlation-id": "req-7f3a"
  },
  "payload": { /* subject-specific fields (see contracts) */ }
}

Rules:

No PII in headers; payload fields carry classification.
eventId is ULID; used with Inbox receipts for de-dup.
schemaVersion follows semantic versioning; breaking ⇒ new subject name or major version channel.

Subjects & Topics (inventory)¶

Subject	Topic (logical)	Partition Key	Producer(s)	Consumer(s)	Payload (high-level)	Retention	DLQ
`audit.appended`	`audit.appended.v1`	`tenantId[:sourceId]`	Ingestion (outbox)	Integrity, (optionally) Projection/Search warmers	recordId, digest, sourceId, occurredAt, policyVersion	≥ projection catch-up window	Yes (per consumer)
`audit.accepted`	`audit.accepted.v1`	`tenantId[:sourceId]`	Ingestion (outbox)	Projection (required), Search (opt), Admin analytics	recordId, attributes*, classification, retention, occurredAt	≥ replay window	Yes
`policy.changed`	`policy.changed.v1`	`tenantId`	Policy	Ingestion caches, Query caches, Projection (optional recompute), Admin UI	policyVersion, scopes[], effectiveAt	≥ 24–72h	Yes
`projection.updated`	`projection.updated.v1`	`tenantId`	Projection	Dashboards, Admin, Search indexer	model, watermark, countDelta, lagSeconds	short (telemetry)	Optional
`export.requested`	`export.requested.v1`	`tenantId`	Export	Admin/Notifications	exportId, format, estCount	≥ 24h	Optional
`export.completed`	`export.completed.v1`	`tenantId`	Export	Admin/UI, Audit trail	exportId, manifestDigest, bytes, parts, url?	≥ compliance evidence window	Yes
`export.failed`	`export.failed.v1`	`tenantId`	Export	Admin/on-call	exportId, code, detail	≥ 24–72h	Yes
`integrity.verified` (opt)	`integrity.verified.v1`	`tenantId`	Integrity	Admin/UI	target (record	range	export), status, checkpointId	short	Optional

attributes are policy-aware (redacted/hints as needed) in audit.accepted projections; raw append payload remains in the Append Store.

Subscription Rules¶

Consumer group per component (e.g., projection.timeline, integrity.chain, search.indexer).
Inbox receipts required for each subscription: store (eventId | idempotencyKey) to drop duplicates.
Commit offsets after idempotent write completes (upsert-or-noop).
Back-pressure: scale consumers on lag; set per-tenant concurrency caps to avoid noisy-neighbor blow-ups.
DLQ on repeated failure (N attempts); DLQ entries capture errorCode, attempts, schemaVersion, payloadHash.
Replay is scoped (tenant + time window) with dry-run first; see Admin/Control.

Contract Snapshots (links)¶

Appended → audit.appended (write acknowledgment & digest for Integrity).
Accepted → audit.accepted (authoritative persisted event for Projection/Search).
Projection → projection.updated (freshness signal).
Export → export.requested|completed|failed.
Integrity → integrity.verified (optional attestation).

See: Message Schemas (../domain/contracts/message-schemas.md) and Events Catalog (../domain/events-catalog.md).

Schema Evolution Policy¶

Additive-first: add optional fields with defaults; avoid breaking consumers.
Breaking change ⇒ publish new subject (e.g., audit.accepted.v2) or bump schemaVersion major with parallel topic.
Deprecation windows: announce via changelog; dual-publish/read during migration.
Registry: all schemas registered; CI contract tests gate merges.

Tenancy & Security on the Bus¶

Partitioning by tenantId (and optional sourceId) to preserve order.
ACLs restrict produce/consume to specific subjects per service principal.
mTLS within the mesh; broker credentials rotated regularly.
Residency: topics are region-local; cross-region links disabled unless explicitly contracted.

Error Model & Invariants¶

Consumers must treat unknown fields as optional (forward compatibility).
If schema validation fails at consumer:
- retry with backoff → DLQ after N; fix schema/handler → replay.
Producers never include PII in headers; payload classification guides downstream masking.
Ordering invariant: for a given partition key, audit.appended precedes or coincides with audit.accepted for the same record (if both are emitted).

Observability (bus-level)¶

Spans: relay.publish, consumer.handle; attrs: tenantId, subject, partition, offset, attempt, dlq.
Metrics: topic.rps, consumer.lag.seconds, outbox.age.seconds, dlq.depth, schema.validation.failures.
Logs: eventId, subject, schemaVersion, partition/offset; no payloads in INFO-level logs.

Testable Controls¶

CI: contract tests against registry; subject presence in inventory table required for new topics.
Pre-deploy: canary consumers validating schema compatibility.
Ops: synthetic publishes to ensure DLQ wiring and replay tools function.
Security: broker ACL tests, mTLS verification, no PII in headers linter.

Links¶

→ Message Schemas (../domain/contracts/message-schemas.md)
→ Events Catalog (../domain/events-catalog.md)
→ Outbox/Inbox & Idempotency (../implementation/outbox-inbox-idempotency.md)
→ Sequence Flows (../architecture/sequence-flows.md)

Security & Tenancy Guards in Components¶

This cross-cutting section standardizes AuthN/Z, tenancy enforcement, edition gates, and least-privilege edges for all components. It complements per-component notes (Gateway, Ingestion, Policy, Integrity, Projection, Query, Search, Export, Admin).

Identity & Tokens (inbound → mesh → events)¶

Token types

User/Service OIDC JWT (preferred): short-lived, signed by IdP; audience = atp-api.
Workload JWT (service-to-service): minted for internal callers; narrower scopes.
mTLS identity (mesh): service principal for broker/DB/Object Store access.
Job Token (optional): time-boxed token minted by Admin for long exports/replays.

Required claims (JWT) | Claim | Purpose | |---|---| | sub | actor/service identity | | iss,aud | issuer/audience checks (prevent token confusion) | | exp,iat,nbf | time validity | | tenantId (custom claim) | authoritative tenant (never accept from body) | | roles[] | coarse role check at Gateway | | scopes[] | fine-grained API permissions | | edition | edition gating | | jti | replay protection (optional) |

Propagation (documented invariant) - Gateway → Services: add/forward headers:
x-tenant-id, x-edition, traceparent, tracestate, x-correlation-id, x-policy-version?. - Events (bus): include in envelope headers:
x-tenant-id, x-edition, x-schema-version, traceparent. No PII in headers.

Tenant source of truth is the token claim. Any X-Tenant-Id header from clients is ignored unless the caller is a trusted workload identity and values match.

Authorization Model (3 layers)¶

Edge (Gateway): authenticate token; enforce coarse RBAC + edition gates; rate/size limits.
Service (use-case): scoped ABAC checks (tenant, resource, action) + per-route scopes.
Data: RLS/tenant filters at the store (append/read models/ledger/cold artifacts).

Roles & Scopes Matrix (per component)¶

Component	Typical Roles	Required Scopes (examples)	Notes
Gateway	any	—	AuthN/Z, versioning, rate limits at edge
Ingestion	`service:producer`	`audit.append:write`	Requires `X-Idempotency-Key`; denies if edition/tenant missing
Policy	`service:*`	`policy:evaluate`	Read path only; Admin uses `policy:admin` via Control Plane
Integrity	`service:*`,`auditor`	`integrity:verify`	`integrity:admin` for checkpoint/keys via Admin
Projection	system	— (no public API)	Admin controls: `projection:admin`
Query	`auditor`,`support`,`analyst`	`audit.read`, `selection:create`	Redaction enforced per decision
Search (opt)	`auditor`	`search.read`	Edition gate `search.enabled`
Export	`auditor`	`export:create`, `export:download`	Webhooks: `export:webhook:verify` (server-side secret)
Admin/Control	`platform-admin`, `tenant-admin`	`admin:` (platform) · `tenant-admin:` (scoped)	Break-glass flow: time-boxed elevation

Scopes are additive; revocation is immediate at Gateway (token refresh) and cached minimally downstream.

Edition Gates (feature flags)¶

Feature	Edition Gate	Components Affected
Full-text Search	`search.enabled`	Search, Query UI routes
Export	`export.enabled`	Export, Query (selection→export)
Signed Manifests	`integrity.manifest.signing`	Integrity, Export
Admin Replays	`admin.replay.enabled`	Admin/Control, Projection
Large Selections	`query.large-selection`	Query, Export concurrency planner

Edition gates are evaluated at the Gateway and again at the service (defense-in-depth).

Per-Tenant Quotas & Limits (guardrails)¶

Domain	Default (tunable)	Enforcement
Append RPS	50 sustained / 200 burst	Gateway rate limiter (429 + `Retry-After`)
Append payload size	256 KB	Gateway hard limit
Query page size	100 default / 1000 max	Query route caps
Search QPS	50 rps per tenant	Search limiter
Export concurrency	2 jobs per tenant	Export queue & scheduler
Storage quota	contract-based (GB/TB)	Control metadata + lifecycle
Admin replay	window ≤ 24h, dry-run first	Admin policy gate

Least-Privilege Edges (who can talk to what)¶

Edge	Principal	Allowed Actions (minimal)
Services → IdP/JWKS	Gateway only	Token validation/caching; no direct user identity calls from data-plane
Services → KMS/HSM	Integrity (sign/verify), Admin (rotate via workflow)	Non-exportable keys; per-action audit; scoped key IDs
Export → Cold Store	Export worker principal	`PutObject`, `HeadObject`; downloads via signed URLs only
Services → Broker	Each service account	Produce/consume specific subjects; tenant-partitioned keys
Services → Datastores	Per-service account + RLS	Only own tables/DBs; RLS by `tenantId` enforced

Data Handling & Classification¶

Append: Ingestion stamps classification, retentionPolicyId, policyVersion.
Read/Export: Query obtains redaction plan from Policy; masks/hashes/omits before returning.
Logs/Traces: structured, no PII; include traceId, tenantId, policyVersion.
Residency: all data and artifacts are region-pinned to the tenant’s home region.

Failure/Threat Scenarios & Mitigations¶

Scenario	Mitigation
Token confusion (wrong `aud`/`iss`)	Strict audience/issuer checks at Gateway; reject on mismatch
Tenant spoofing	Use token claim for tenant; ignore client header; verify once and propagate
Header tampering	Mesh policy denies external traffic; services trust only Gateway-set headers
Privilege escalation	Scopes checked at service; Admin actions require dual approval and break-glass TTL
Cross-tenant data access	RLS + repository filters; partition keys include `tenantId`; synthetic tests verify
Key misuse	KMS policies per key; non-exportable; audit every sign/verify; rotation drills

Testable Controls¶

CI policy: every route declares required scopes and edition gates; contracts linted.
Pen tests: header spoofing blocked; tenant claim precedence verified.
Synthetic tests: ensure x-tenant-id & traceparent propagate across HTTP → bus → consumers.
DB checks: RLS policies exist on all tenant tables; cross-tenant queries fail by default.
KMS evidence: rotation and key usage appear in Integrity/Admin audit logs.

Links¶

→ Multitenancy & Tenancy Guards (../platform/multitenancy-tenancy.md)
→ Security & Compliance (../platform/security-compliance.md)
→ Privacy (GDPR/HIPAA/SOC2) (../platform/privacy-gdpr-hipaa-soc2.md)
→ Outbox/Inbox & Idempotency (../implementation/outbox-inbox-idempotency.md)
→ Observability (../operations/observability.md)

Observability in Components¶

Purpose¶

Standardize traces, metrics, and logs across all services so teams can diagnose issues quickly, protect SLOs, and maintain tenant isolation. This section defines golden signals per component, required tags/attributes, exemplar spans, dashboards, and alert hooks.

See also: Observability (OTel/Logs/Metrics) (../operations/observability.md), Alerts & SLOs (../operations/alerts-slos.md).

Global Conventions (OTel-first)¶

Resource attributes (set once per service):

service.name, service.version, deployment.environment, cloud.region.

Required span attributes (every request/operation)

atp.tenant_id (matches token claim)
atp.edition
atp.policy_version (if known)
http.route | messaging.operation + messaging.destination (subject/topic)
atp.correlation_id (from x-correlation-id)
error.type, error.code (on failure)

Event/Header hygiene

Always propagate: traceparent, tracestate, x-correlation-id, x-tenant-id, x-edition, x-policy-version?.
No PII in headers or logs; use content hashes where helpful.

Sampling

Tail-based: 100% of errors, 100% of slow requests (above p95), 10% of normal; boost sampling for tenants on watchlists.
Exemplars: attach trace_id to latency/error metrics for jump-to-trace.

Cardinality controls

Labels on metrics: include route|subject, never raw recordId/actorId.
Tenant labels: use scoped series only on tenant dashboards; platform-wide metrics use top-N or tenant.class (small/med/large).

Exemplar Spans (naming)¶

Span	Example name	Key attrs
Gateway request	`gateway.request`	`http.route`, `atp.tenant_id`, `rl.outcome`
Ingestion append	`ingestion.append`	`idempotent`, `schema.ok`, `policyVersion`
Outbox publish	`relay.publish`	`subject`, `partition`, `attempt`
Consumer handle	`consumer.handle`	`subject`, `eventId`, `result: upsert\|noop\|dlq`
Policy evaluate	`policy.evaluate`	`mode: append\|read\|export`, `cache.hit`
Integrity verify	`integrity.verify.record\|range\|export`	`algo`, `keyId`, `checkpointId`
Projection apply	`projection.handle`	`model`, `watermark.before/after`
Query execute	`query.execute`	`route`, `filters.hash`, `watermark`, `lag.s`
Search query	`search.query`	`q.hash`, `filters.hash`, `index.lag.s`
Export create	`export.create`	`selection.est`, `format`, `parts`
Admin action	`admin.replay.start`	`tenantId`, `window`, `dryRun`

Example span snippet:

{
  "name": "ingestion.append",
  "attributes": {
    "atp.tenant_id": "t-acme",
    "atp.edition": "enterprise",
    "http.route": "/api/v1/audit/append",
    "idempotent": false,
    "schema.ok": true,
    "atp.policy_version": 17
  }
}
````

---

### Logs (structured JSON)

* Fields: `ts`, `level`, `service`, `traceId`, `spanId`, `atp.tenant_id`, `atp.correlation_id`, `eventId?`, `subject?`, `code`, `message`, `details{}`.
* **Never** log payload values or PII. Use `payload.hash`, `recordId` only if necessary.

Example:

```json
{
  "ts":"2025-10-22T12:06:41Z",
  "level":"WARN",
  "service":"projection",
  "traceId":"01JB..",
  "atp.tenant_id":"t-acme",
  "eventId":"01JB9..",
  "subject":"audit.accepted",
  "code":"PROJECTION_RETRY",
  "details":{"attempt":2,"reason":"store-timeout"}
}

Golden Signals per Component¶

Component	Latency	Errors	Saturation/Backlog	Freshness/Lag	Extras
Gateway	`http.server.duration{route}` p95/p99	4xx/5xx, `429.rate`	`inflight{route}`	—	`schema.reject.rate`, auth latency
Ingestion	`append.accepted.latency`	`problemjson.4xx/5xx`	`outbox.pending.count`	`outbox.relay.age.p95`	`dedupe.rate`, payload reject rate
Policy	`policy.decision.latency{mode}`	deny/error rate	cache size	—	`policy.cache.hit_ratio`
Integrity	`verify.latency{target}`	signature/verify errors	signer queue depth	`checkpoint.delay.seconds`	`kms.latency`
Projection	handler latency	DLQ rate	consumer queue depth	`projector.lag.seconds{model}`	`inbox.duplicate.rate`
Query	`query.latency{route}`	4xx/5xx	cache pressure	`query.watermark.lag.seconds`	`selection.build.latency`
Search	`search.latency{route}`	4xx/5xx	indexer queue depth	`indexer.lag.seconds`	`index.refresh.age.seconds`
Export	`export.ttfb.seconds`	`export.failed.count`	`export.queue.depth`	—	`export.completion.minutes`, webhook success
Admin	admin route latency	4xx/5xx	job queue depth	replay/rebuild lag	`approval.latency.seconds`, breakglass gauge

Metrics — Required Names & Labels (starter set)¶

Gateway:
- gateway_http_requests_duration_seconds{route,status} (histogram)
- gateway_rate_limit_rejections_total{route,tenant_class}
- gateway_schema_reject_total{route}
Ingestion:
- ingestion_append_latency_seconds (histogram)
- outbox_relay_age_seconds (summary)
- ingestion_dedupe_ratio (gauge)
- ingestion_schema_reject_total
Policy:
- policy_decision_latency_seconds{mode}
- policy_cache_hit_ratio
- policy_deny_total{mode}
Integrity:
- integrity_verify_latency_seconds{target}
- integrity_checkpoint_delay_seconds
- kms_request_latency_seconds
- integrity_signature_error_total
Projection:
- projector_lag_seconds{model}
- consumer_lag_messages{subscription}
- dlq_depth{subscription}
- dlq_replay_rate{subscription}
Query:
- query_latency_seconds{route}
- query_watermark_lag_seconds
- query_cache_hit_ratio
- selection_build_latency_seconds
Search:
- search_latency_seconds{route}
- indexer_lag_seconds
- index_refresh_age_seconds
Export:
- export_queue_depth
- export_ttfb_seconds
- export_completion_minutes
- webhook_delivery_success_ratio
Admin:
- admin_action_success_ratio
- admin_approval_latency_seconds
- replay_throughput_records_per_min
- breakglass_enabled{actor} (gauge, with TTL)

All metrics include: service, env, region, and only safe labels; tenant-wide metrics exposed on tenant dashboards.

Dashboards & SLO Hooks¶

Dashboards

Platform Overview: route latencies p95/p99, error ratios, outbox age p99, consumer lag, export backlog, indexer lag; top-N tenants by RPS.
Per-Tenant: gateway 429s, append/query latency, watermark lag, export status.
Per-Service: golden signals, DLQ depth, retry counts, breaker state, dependency latencies.
Integrity/Compliance: checkpoint delays, signature errors, verify latencies.

SLOs linkage

Map each golden signal to an SLO (examples; tune per env):
- Gateway latency: p95/p99 thresholds; error budget burn alerts.
- Ingestion Outbox age p99 → alert at N seconds for M minutes.
- Projection lag p95 → page when > 30s sustained.
- Query watermark lag → warn at 30s; page at 2m.
- Export TTFB p95 and completion p95 thresholds.
- Integrity checkpoint delay beyond window end.

See Alerts & SLOs (../operations/alerts-slos.md) for concrete thresholds.

Example: End-to-End Trace (Append → Verify → Project → Query → Export)¶

sequenceDiagram
  autonumber
  participant G as Gateway (span: gateway.request)
  participant I as Ingestion (span: ingestion.append)
  participant OB as Outbox (span: relay.publish)
  participant P as Projection (span: consumer.handle/projection.handle)
  participant Q as Query (span: query.execute)
  participant E as Export (span: export.create|export.seal)
  participant INT as Integrity (span: integrity.verify.export)

  G->>I: /audit/append (traceparent, x-tenant-id)
  I->>OB: outbox persist (trace link)
  OB-->>P: audit.accepted (event trace link)
  P-->>Q: watermark advanced
  G->>Q: /query/selection (traceparent)
  Q-->>E: create export (selection)
  E->>INT: request manifest signing
  INT-->>E: signature + chain refs

Hold "Alt" / "Option" to enable pan & zoom

Each span carries atp.tenant_id, atp.correlation_id, and emits metrics exemplars.

Alert Hints (starter set)¶

High 429 rate at Gateway (per tenant & route) → notify tenant and throttle producers.
Outbox age p99 > threshold → scale relay / investigate broker.
Consumer lag > threshold → autoscale projectors or reduce hot-tenant limits.
DLQ depth rising → create incident; run dry-run replay.
Checkpoint delay beyond window → investigate KMS/signing.
Export backlog → raise export concurrency for affected tenants within caps.

Testable Controls¶

Lint check: every route/consumer defines trace names and required attrs.
Synthetic traces: periodic “canary append → query” to verify links and SLOs.
Log scrubber CI: block accidental PII fields.
Metric cardinality guard: budgets enforced; alert when new labels explode series count.

Links¶

→ Observability (OTel/Logs/Metrics) (../operations/observability.md)
→ Alerts & SLOs (../operations/alerts-slos.md)
→ Health Checks (../operations/health-checks.md)
→ Runbook (../operations/runbook.md)

Resilience & Back-Pressure in Components¶

Purpose¶

Define timeouts, retries with jitter, bulkheads, and DLQ patterns per service so ATP remains stable under spikes, dependency hiccups, and partial failures. This section also calls out event-choreography resilience (EDA: at-least-once delivery + exactly-once effects via idempotency).

Cross-Cutting Principles¶

Fail fast at the edge: schema/tenant/edition errors → 4xx; never flow into the mesh.
No retries on 4xx; bounded, jittered retries only for transient errors (timeouts, 5xx).
Timeout budgets derive from SLOs: route_timeout ≈ p95 × 1.2 (cap by upstream budgets).
Bulkheads everywhere: separate worker pools (steady-state vs. replay/backfill/export).
At-least-once on the bus, exactly-once intent in handlers (inbox + upsert-or-noop).
Back-pressure tunnel: consumer lag → throttle producers/exporters → if needed, 429 at Gateway for hotspots.
DLQ per subscription, dry-run replay first; replays are tenant + time-window scoped.

Policy Table — Timeouts, Retries, Bulkheads, DLQs¶

Component	Timeout Budget	Retry Policy (transient)	Bulkheads & Concurrency	DLQ / Replay	Back-Pressure Signals
Gateway	route-specific (e.g., append 1–3s; query 800–1500ms)	None downstream (edge fails fast)	Per-route workers; token/validation isolated	n/a	429 rate, auth/cache latency
Ingestion	policy eval 100–300ms; store write ≤ 500–1000ms	3–5 tries, decorrelated jitter (base 200ms, cap 10s); no retry on 4xx	Separate pools for API vs. outbox relay	DLQ for relay; replay by tenant/window	outbox age p95, broker depth, dedupe rate
Policy	100–300ms	2–3 tries w/ jitter; breaker to store; SWR cache fallback	Evaluate pool vs. cache refresh pool	Optional DLQ for `policy.changed` consumers	decision latency p95, cache hit ratio
Integrity	ledger write ≤ 300–800ms; KMS sign ≤ 200–500ms	3–5 tries w/ jitter; breaker to KMS	Chain-update workers vs. checkpoint issuer	DLQ for chain updates; replay deterministic	checkpoint delay, KMS error rate
Projection	per event ≤ 50–150ms	5 tries w/ jitter; then DLQ	Separate steady-state vs. rebuild pools; per-tenant caps	DLQ + replay (dry-run first)	consumer lag, DLQ depth
Query	500–1200ms typical	1–2 tries to read model only (fast-fail overall)	Selection builders isolated from standard routes	n/a	watermark lag, cache pressure
Search (opt)	500–1500ms; index write ≤ 200ms	3 tries write; 1 try read	Indexer pool separate; per-tenant shard caps	DLQ for indexer	indexer lag, refresh age
Export	per page read ≤ 1–2s; seal ≤ 3–5s	5 tries page/PUT w/ resume; no retry on manifest 4xx	Queued jobs; per-tenant concurrency caps; deliverers isolated	Webhook DLQ (backoff)	queue depth, TTFB/completion p95
Admin/Control	1–3s for control APIs	2–3 tries non-destructive; ops require approvals	Job runner pool low-priority; hard caps	DLQ uncommon; jobs are resumable	approval latency, job backlog

Jitter recipe (decorrelated)

sleep = min(cap, rand(base, sleep * 3))

Start with base=200ms, cap=10s. Stops retrying on client cancellations or 4xx.

Event Choreography Resilience (EDA)¶

Ordering: partition key = tenantId[:sourceId], preserving sequence for related events.
Duplicates: tolerated; Inbox receipts drop dupes; handlers are idempotent.
Poison messages: after N failures → DLQ with schema/version/attempt metadata; fix → replay.
Saga-like rebuilds: Projection/Search rebuild from audit.accepted; Integrity can deterministically recompute proof chains from append digests + checkpoints.

Failure Modes & Resolution Notes¶

Scenario	What you’ll see	Resolution (first/second moves)
Broker hiccup / partition outage	rising outbox age, consumer lag	(1) autoscale relay/consumers; (2) open breaker to slow producers via per-tenant RL; (3) incident if sustained
KMS slowness/outage (Integrity)	checkpoint delay, verify latency spikes	(1) breaker + queue chain updates; (2) defer checkpoints, alert; (3) rotate to standby key if provider issue
Hot tenant burst	tenant-localized 429s, projector lag for key	(1) throttle offending tenant; (2) shard by `tenantId:sourceId`; (3) suggest off-peak exports
Append schema drift	gateway schema rejects, DLQ growth downstream	(1) rollback producer or hotfix schema; (2) re-validate via contract tests; (3) replay DLQ
Index cluster degraded (Search)	indexer lag, refresh age ↑	(1) reduce batch size; (2) pause low-value indexing; (3) switch to Query-only UI mode
Cold store latency (Export)	TTFB ↑, incomplete downloads	(1) smaller part size; (2) increase download expiry; (3) reroute within region; resume Range downloads
Policy store outage	append evaluate timeouts	(1) use cached decisions; (2) fail-closed for writes; (3) page team; validate after recovery
Projection backlog	per-tenant lag > budget	(1) autoscale; (2) reduce export concurrency; (3) targeted replay of lagged window
Integrity chain gaps	verify mismatches	(1) quarantine slice; (2) recompute from last good checkpoint; (3) publish advisory

Back-Pressure Flow (from lag to edge)¶

flowchart LR
  C[Consumers lag ↑] --> P[Projection/Indexer autoscale]
  P -->|insufficient| Q[Queue depth ↑ / Watermark lag ↑]
  Q --> T[Tenant limiter tighten]
  T --> G[Gateway 429 - offending tenant/routes]
  Q --> E[Export scheduler slowdowns]
  E --> G

Hold "Alt" / "Option" to enable pan & zoom

First responders are worker autoscalers.
If lag persists → tenant-specific throttles (not global).
Export scheduler yields first to protect Query SLOs.

Concrete Defaults (tune per env)¶

Append: timeout 2–3s; retries 3–5; idempotency required.
Query/Search: timeout 1s; retries ≤ 1; page size default 100 (max 1000).
Export: read page 1–2s; cold PUT 3s; resumable with Range; per-tenant concurrency = 2.
Integrity: sign 200–500ms; checkpoint rule: window end + D seconds (alert if late).
Projection: handler budget 100ms; DLQ after 5 attempts; replay throughput target ≥ N rec/min/worker.

Testable Controls¶

Chaos drills: kill broker partition/KMS; assert outbox age alarms, breakers open, and recovery path.
Replay drills: inject poison into DLQ; perform dry-run then actual replay; verify no dupes.
Hot partition drills: synthetic burst on one tenant; verify throttling and no cross-tenant impact.
Timeout lint: routes/clients must declare budgets; CI blocks missing configs.

Links¶

→ Outbox/Inbox & Idempotency (../implementation/outbox-inbox-idempotency.md)
→ Messaging Topologies & Contracts (./components.md#messaging-topologies-contracts)
→ Alerts & SLOs (../operations/alerts-slos.md)
→ Runbook (../operations/runbook.md)

API & Event Contract Pointers¶

This section links each component to its REST/gRPC specs and event subjects. Schemas live under Contracts, subjects in the Events Catalog.

Canonical contract docs:
• REST/gRPC — (../domain/contracts/rest-apis.md)
• Message Schemas — (../domain/contracts/message-schemas.md)
• Webhooks — (../domain/contracts/webhooks.md)
• Events Catalog — (../domain/events-catalog.md)

Inventory (at a glance)¶

Component	REST / gRPC Contracts	Webhooks	Produces (Events)	Consumes (Events)	Schemas
Gateway	API Gateway (`../domain/contracts/rest-apis.md`)	Webhook Ingest (`../domain/contracts/webhooks.md`)	—	—	Error model & headers in REST; see Message Schemas (`../domain/contracts/message-schemas.md`)
Ingestion	Append API (`../domain/contracts/rest-apis.md`)	(via Gateway) (`../domain/contracts/webhooks.md`)	`audit.appended`, `audit.accepted` (`../domain/events-catalog.md`)	—	Subjects & payloads (`../domain/contracts/message-schemas.md`)
Policy	Evaluate (append/read/export) (`../domain/contracts/rest-apis.md`)	—	`policy.changed` (`../domain/events-catalog.md`)	—	Decision/Policy schemas (`../domain/contracts/message-schemas.md`)
Integrity	Verify APIs (`../domain/contracts/rest-apis.md`)	—	`integrity.verified` (opt) (`../domain/events-catalog.md`)	`audit.appended` (or `audit.accepted`) (`../domain/events-catalog.md`)	Chain/checkpoint/manifest schemas (`../domain/contracts/message-schemas.md`)
Projection	Admin control via Control Plane (`../domain/contracts/rest-apis.md`)	—	`projection.updated` (`../domain/events-catalog.md`)	`audit.accepted` (`../domain/events-catalog.md`)	Projection metadata schemas (`../domain/contracts/message-schemas.md`)
Query	Query (timeline/lookup/facets/selection) (`../domain/contracts/rest-apis.md`)	—	—	—	Response/selection schemas (`../domain/contracts/message-schemas.md`)
Search (opt)	Search (query/facets/suggest) (`../domain/contracts/rest-apis.md`)	—	—	`projection.updated` (`../domain/events-catalog.md`)	Index doc schemas (`../domain/contracts/message-schemas.md`)
Export	Export (create/poll/download) (`../domain/contracts/rest-apis.md`)	Export Completed (optional) (`../domain/contracts/webhooks.md`)	`export.requested`, `export.completed`, `export.failed` (`../domain/events-catalog.md`)	—	Manifest/package schemas (`../domain/contracts/message-schemas.md`)
Admin / Control	Admin (tenants, DLQ/replay, policies, keys) (`../domain/contracts/rest-apis.md`)	—	(drives `policy.changed` via Policy)	`projection.updated` (status)	Admin audit/job schemas (`../domain/contracts/message-schemas.md`)

By Component — Link Blocks¶

Gateway

REST: API Gateway routes (../domain/contracts/rest-apis.md)
Webhooks: Ingest & verification (../domain/contracts/webhooks.md)
Schemas: Error model, headers (../domain/contracts/message-schemas.md)

Ingestion

REST: Append (../domain/contracts/rest-apis.md)
Events (produce): audit.appended, audit.accepted (../domain/events-catalog.md)
Schemas: Append payload, Appended/Accepted events (../domain/contracts/message-schemas.md)

Policy

REST: Evaluate (append/read/export) (../domain/contracts/rest-apis.md)
Events (produce): policy.changed (../domain/events-catalog.md)
Schemas: Decisions & change events (../domain/contracts/message-schemas.md)

Integrity

REST: Verify (record/range/export) (../domain/contracts/rest-apis.md)
Events (consume): audit.appended (or audit.accepted) (../domain/events-catalog.md)
Events (produce): integrity.verified (optional) (../domain/events-catalog.md)
Schemas: Chain node, checkpoint, manifest (../domain/contracts/message-schemas.md)

Projection

Admin/API (via Control Plane): Rebuild/Replay/Status (../domain/contracts/rest-apis.md)
Events (consume): audit.accepted (../domain/events-catalog.md)
Events (produce): projection.updated (../domain/events-catalog.md)
Schemas: Projection metadata (../domain/contracts/message-schemas.md)

Query

REST: Timeline/Lookup/Facets/Selection (../domain/contracts/rest-apis.md)
Handoff: Selection → Export (../domain/contracts/rest-apis.md)
Schemas: Query responses & selection (../domain/contracts/message-schemas.md)

Search (optional)

REST: Search/Facets/Suggest (../domain/contracts/rest-apis.md)
Events (consume): projection.updated (../domain/events-catalog.md)
Schemas: Index docs (../domain/contracts/message-schemas.md)

Export

REST: Create/Poll/Download (../domain/contracts/rest-apis.md)
Webhooks: Export completed (signed) (../domain/contracts/webhooks.md)
Events (produce): export.requested, export.completed, export.failed (../domain/events-catalog.md)
Schemas: Manifest & package (../domain/contracts/message-schemas.md)

Admin / Control Plane

REST: Tenants, Policies, DLQ/Replay, Keys, Legal Holds (../domain/contracts/rest-apis.md)
Events: drives policy.changed via Policy (../domain/events-catalog.md)
Schemas: Admin audit & job specs (../domain/contracts/message-schemas.md)

Header & Token Invariants (for contracts)¶

Always propagate: x-tenant-id, x-edition, traceparent, x-correlation-id, and x-policy-version (when relevant).
No PII in headers; payload classification drives redaction.
Error model: Problem+JSON in REST contracts (../domain/contracts/rest-apis.md).

Traceability & Cross-Doc Links¶

Tiny matrix mapping each component section to roadmap Epics/Features and the most relevant artifacts under /docs.

Roadmap baseline: see Planned Work (Epics & Features) (../planning/index.md)

Mini-Matrix¶

Section	Roadmap Epic / Feature	Related Docs (under `/docs`)
Document Header & Scope	AUD-ARC-001	`architecture/architecture.md`, `architecture/hld.md`, `architecture/context-map.md`
Service Catalog (At-a-Glance)	AUD-ARC-001	`architecture/context-map.md`, `domain/events-catalog.md`
Audit.Gateway (Deep-Dive)	AUD-GATEWAY-001	`domain/contracts/rest-apis.md`, `architecture/architecture.md#api-gateway-connectivity`
Audit.Ingestion (Deep-Dive)	AUD-INGEST-001	`domain/contracts/message-schemas.md`, `implementation/outbox-inbox-idempotency.md`
Audit.Policy (Deep-Dive)	AUD-COMPLIANCE-001	`platform/pii-redaction-classification.md`, `platform/privacy-gdpr-hipaa-soc2.md`
Audit.Integrity (Deep-Dive)	AUD-INTEGRITY-001	`hardening/tamper-evidence.md`, `operations/backups-restore-ediscovery.md`
Audit.Projection (Deep-Dive)	AUD-QUERY-001	`implementation/query-views-indexing.md`, `architecture/sequence-flows.md`
Audit.Query (Deep-Dive)	AUD-QUERY-001	`domain/contracts/rest-apis.md`, `architecture/sequence-flows.md`
Audit.Search (Deep-Dive, optional)	AUD-QUERY-SEARCH-001	`implementation/query-views-indexing.md`, `platform/pii-redaction-classification.md`
Audit.Export (Deep-Dive)	AUD-EXPORT-001	`domain/contracts/rest-apis.md`, `operations/backups-restore-ediscovery.md`
Admin / Control Plane (Deep-Dive)	AUD-OPS-001, AUD-GOV-ADR-001	`operations/runbook.md`, `architecture/architecture.md#adr-index-governance`
Cross-Cutting: Outbox/Inbox & Idempotency	AUD-CHAOS-001	`implementation/messaging.md`, `implementation/outbox-inbox-idempotency.md`
Storage Components per Service	AUD-STORAGE-001	`implementation/persistence.md`, `architecture/data-model.md`
Messaging Topologies & Contracts	AUD-BUS-001	`domain/events-catalog.md`, `domain/contracts/message-schemas.md`
Security & Tenancy Guards	AUD-SECURITY-001, AUD-TENANT-001	`platform/security-compliance.md`, `platform/multitenancy-tenancy.md`
Observability in Components	AUD-OTEL-001	`operations/observability.md`, `operations/alerts-slos.md`
Resilience & Back-Pressure	AUD-CHAOS-001	`implementation/outbox-inbox-idempotency.md`, `operations/runbook.md`
API & Event Contract Pointers	AUD-CONTRACTS-001	`domain/contracts/rest-apis.md`, `domain/contracts/message-schemas.md`, `domain/events-catalog.md`

Notes & Acceptance¶

Each row links to the primary epic and canonical artifacts (contracts, sequences, platform pages).
Anchors within this file follow your markdown renderer’s auto-ID rules; adjust if link checkers flag mismatches.
Accept when all component deep-dives above appear in the matrix, and links pass the docs link-checker.

Links¶

→ Roadmap (Epics & Features) (../planning/index.md)
→ Quality Gates (../ci-cd/quality-gates.md)
→ Changelog (../reference/changelog.md)