Sequence Flows — Header, Scope & Notation — Audit Trail Platform (ATP)¶

This document captures end-to-end sequence flows for the Audit Trail Platform (ATP). It shows how requests move across services (Gateway → Ingestion → Storage → Integrity → Projection → Query/Search/Export), which headers and IDs are propagated, and where policy/redaction and integrity operations occur.

JSON uses lowerCamel; C#/gRPC (code-first) uses PascalCase; Protobuf fields are PascalCase with json_name mapped to lowerCamel. Times are ISO-8601 UTC with ms precision.

Purpose¶

Provide a definitive reference for request/response choreography across ATP.
Make tenancy, correlation, idempotency, redaction, and integrity touchpoints explicit.
Enable engineers, SREs, and auditors to reason about correctness and operational SLOs (e.g., projection and sealing lag).

Audience¶

Platform engineers implementing services and SDKs.
SRE/Operations running ATP in production.
Security & Compliance validating controls, proofs, and holds.
Integrations & SDK authors producing/consuming audit data.

Scope¶

Online and async ingestion, projection, integrity, query/search, export, policy/hold, recovery, observability.
Happy paths with alt/opt blocks for errors, retries, and degraded modes.
Cross-references to Data Model, Message Schemas, HLD, and Components.

Non-goals¶

Full API parameter docs (see REST/gRPC contracts).
Deep internals of cryptographic primitives (see Integrity spec).
Runbook procedures (see Operations/Runbook).

How to read these diagrams¶

Each flow is expressed with Mermaid sequenceDiagram.
We use consistent participant names (below) and consistent labels for calls:
- op name [headers] {summary} for requests.
- ↩ status body for responses.
Headers are shown with [h], bodies with [b] when helpful.
Tenancy & correlation appear on first hop and are implied downstream unless called out.
Errors use alt/else blocks; retries use loop with backoff notes.

Canonical participants (legend)¶

Label	Meaning
`Client`	External producer/consumer (browser, service, tool)
`Gateway`	API Gateway / Edge (authN/Z, rate limit, tenancy)
`Ingestion`	Write path (validate, canonicalize, classify/redact, append)
`Storage`	Authoritative append-only store (WORM)
`Integrity`	Segment/block sealing, Merkle roots, signatures
`Projection`	Read-model updaters; checkpoints/watermarks
`Query`	Timeline/resource/actor queries; masking profiles
`Search`	Full-text/facets/suggest over per-tenant indices
`Export`	eDiscovery and bulk packages; signed manifests
`Policy`	Classification, redaction, retention evaluation
`LegalHold`	Hold application/release, scope indexing
`Bus`	Message transport (e.g., Service Bus/MassTransit/NSB)
`KMS`	Key management for signatures/manifests
`IdP`	Identity provider (JWT/OIDC)
`Obs`	Observability pipeline (metrics/logs/traces)

Flows may also show Inbox/Outbox, Indexer, or Admin where relevant.

Cross-cutting conventions¶

Tenancy: All flows carry x-tenant-id (or gRPC metadata tenant); RLS enforced at storage and read models.
Correlation: OTel traceparent is required; optional baggage includes tenant, edition, shard.
Idempotency: Producers SHOULD send x-idempotency-key (REST) or idempotency (gRPC metadata); ingestion dedupes per (tenantId, key).
Problem+JSON: Errors return RFC-7807 with type, title, status, detail, and errors[] { pointer, reason }.
Redaction: Write path applies classification/redaction per policy. Reads apply masking profiles (Safe|Support|Investigator|Raw).
Integrity: Sealing is asynchronous; verify-on-read is optional and called out explicitly where supported.
Pagination: Seek cursors encode (createdAt, auditRecordId); included in query flows.
Clocks: createdAt (producer), observedAt (platform), sealedAt (integrity), eligibleAt (retention).
Status codes (REST): 2xx (OK/Accepted), 4xx (validation/limits/auth), 5xx (transient). gRPC codes: OK, INVALID_ARGUMENT, ALREADY_EXISTS, RESOURCE_EXHAUSTED, UNAVAILABLE, DEADLINE_EXCEEDED.

Sample notation (Mermaid)¶

sequenceDiagram
  autonumber
  actor Client
  participant Gateway
  participant Ingestion
  participant Policy
  participant Storage
  participant Projection
  participant Integrity
  participant Obs as Observability

  Client->>Gateway: POST /audit [h: x-tenant-id, traceparent, x-idempotency-key] [b: AuditRecord]
  Note right of Gateway: AuthN/Z (IdP), rate limiting, tenancy check
  Gateway->>Ingestion: Append(request) [h: forwarded headers]
  Ingestion->>Policy: Evaluate(classify, redact hints)
  Policy-->>Ingestion: decision {classes, redactions}
  Ingestion->>Storage: INSERT AuditRecord (canonical JSON, WORM)
  Storage-->>Ingestion: ↩ ack {auditRecordId}
  Ingestion-->>Gateway: ↩ 202 Accepted {auditRecordId}
  par Async
    Storage-->>Projection: event AuditRecord.Accepted
    Storage-->>Integrity: leaf hash → segment buffer
  and
    Ingestion-->>Obs: metrics/traces/logs
  end
  Projection-->>Projection: upsert read models, advance checkpoint
  Integrity-->>Integrity: seal block, sign, emit ProofComputed

Hold "Alt" / "Option" to enable pan & zoom

Legend

Solid arrows: synchronous calls.
Dashed arrows (-->>) : async publish/consume or responses.
par blocks: parallel async work.
alt/else blocks: branching (validation errors, retries).
loop blocks: retry with backoff.

Reading map (what comes next)¶

The remaining sections detail each area with a dedicated diagram and callouts:

Ingestion (REST/gRPC/Bus/Actors) — validation, classification/redaction, idempotency
Integrity — chain/segment/block sealing, verification, key rotation
Projections & Search — read models, indexing, checkpoints, pagination
Query & Read — policy-aware masking, verify-on-read, filters & time windows
Export & eDiscovery — job lifecycle, manifests, delivery, legal hold
Policy, Retention & Hold — evaluation, eligibility, purge block
Reliability — retry, DLQ, circuit breaker, compensation, rebuild
Observability — metrics, traces, health, alerts
Admin — onboarding, schema evolution, configuration, partitioning, auto-scaling

Links¶

Standard Audit Record Ingestion Flow¶

Canonical online path to append an AuditRecord via the API Gateway. Covers authN/Z, tenancy routing, rate limiting, validation & canonicalization, policy-driven classification/redaction hints, append to WORM storage, and async fan-out (AuditRecord.Accepted, projections, integrity). Uniquely emphasizes idempotency and Problem+JSON error semantics for safe retries.

Overview¶

Purpose: Accept a producer’s audit fact and durably append it to the authoritative store with correct tenancy, correlation, and privacy posture.
Scope: Single-record REST ingestion through the Gateway; includes validation, classification/redaction hints, append, and async fan-out triggers. Excludes gRPC and bus-based ingestion (covered in separate flows).
Context: Entry point for most interactive producers; downstream projections power query/search; integrity sealing is asynchronous.
Key Participants:

Client (producer)
API Gateway (authN/Z, limits, tenancy)
Ingestion Service (validate/canonicalize/classify)
Policy Service (classification/redaction hints)
Storage Service (authoritative append, WORM)
Projection Service (read models; async)
Integrity Service (segment/block sealing; async)

Prerequisites¶

System Requirements¶

API Gateway, Ingestion, Policy, Storage online and reachable
TLS enabled end-to-end; trusted IdP/JWT validation configured
Network routes opened Gateway → Ingestion → Policy/Storage
Schema Registry accessible to Ingestion

Business Requirements¶

Tenant exists and is active; residency and edition set
Policy (classification/redaction) published and cacheable
Retention policy present (for later lifecycle)
Legal holds (if any) indexed (no effect on write, affects lifecycle)

Performance Requirements¶

Gateway rate-limit buckets sized for tenant (burst/sustain)
Ingestion p95 latency < 50 ms at target load
Payload size ≤ 256 KiB; attributes/fields within limits

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Client
    participant Gateway as API Gateway
    participant Ingestion as Ingestion Service
    participant Policy as Policy Service
    participant Storage as Storage (Authoritative)
    participant Projection as Projection Service
    participant Integrity as Integrity Service

    Client->>Gateway: POST /audit/v1/records<br/>[h: Authorization, x-tenant-id, traceparent, x-idempotency-key]<br/>[b: AuditRecord JSON]
    Note right of Gateway: AuthN (JWT/OIDC) • AuthZ (tenant scope) • Rate limit • Header validation
    Gateway->>Ingestion: Append(request)<br/>[forward headers]
    Ingestion->>Policy: Evaluate(classify/redaction hints)
    Policy-->>Ingestion: decision { classes, redactions }
    Ingestion->>Ingestion: Validate & canonicalize<br/>(size, clocks, action, resource, attrs)
    Ingestion->>Storage: INSERT canonical JSON (WORM)
    Storage-->>Ingestion: ↩ ack { auditRecordId }
    Ingestion-->>Gateway: ↩ 202 Accepted { auditRecordId, status:"Created" }
    par Async fan-out
      Storage-->>Projection: event AuditRecord.Accepted
      Storage-->>Integrity: enqueue leaf → segment
    end
    Note over Projection,Integrity: Projections update read models, Integrity seals blocks later

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Duplicate idempotency key: Ingestion returns 202 with status:"Duplicate" and original auditRecordId.
Server-assigned ID: If auditRecordId omitted, Ingestion assigns ULID and returns it.
Sealing disabled: Integrity branch skipped for tenant/edition; lifecycle proceeds to eligibility without proofs.
Partial policy outage: Use last-known policy (stale-tolerant) and tag decision with basis:"Cached".

Error Paths¶

sequenceDiagram
    actor Client
    participant Gateway as API Gateway
    participant Ingestion as Ingestion Service
    Client->>Gateway: POST /audit/v1/records
    alt Validation error
        Gateway->>Ingestion: Append(request)
        Ingestion-->>Gateway: ↩ 400 Problem+JSON (action.invalid, payload.tooLarge, ...)
        Gateway-->>Client: ↩ 400 Problem+JSON
    else Rate limited
        Gateway-->>Client: ↩ 429 Problem+JSON + Retry-After
    else Storage unavailable
        Gateway->>Ingestion: Append(request)
        Ingestion-->>Gateway: ↩ 503 Problem+JSON
        Gateway-->>Client: ↩ 503 (retry with backoff)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req.	Description	Validation
`Authorization` (header)	string	Y	Bearer JWT	Valid signature; tenant claims
`x-tenant-id` (header)	string	Y	Tenant routing key	`^[A-Za-z0-9._-]{1,128}$`
`traceparent` (header)	string	Y	W3C trace context	55-char format
`x-idempotency-key` (header)	string	Y	Dedupe key per tenant	≤128 ASCII visible
`tenantId`	string	Y	Tenant id (body)	Must equal header
`schemaVersion`	string	Y	Payload schema id	`auditrecord.v1` (or newer)
`auditRecordId`	ULID	N	Client- or server-assigned id	ULID pattern
`createdAt`	timestamp	Y	Producer time	`≤ now + 2m`, ms precision
`action`	string	Y	`verb` or `verb.noun`	`^[a-z]+(\.[a-z0-9_-]+)?$`
`resource.type`	string	Y	PascalCase dotted type	`^[A-Z][A-Za-z0-9](\.[A-Z][A-Za-z0-9])*$`
`resource.id`	string	Y	Opaque id	≤128, no spaces
`resource.path`	string	N	JSON Pointer	≤512, normalized
`actor.id`	string	Y	Actor identifier	≤128, no spaces
`actor.type`	enum	Y	`Unknown \| User \| Service \| Job`	Enum
`actor.display`	string	N	Friendly name	Masked on read
`decision.outcome`	enum	N	Access verdict	`Allow \| Deny \| NotApplicable \| Indeterminate`
`delta.fields`	map	N	Field changes	≤256 entries
`attributes`	map	N	Extra key/values	≤64 keys; key/val length
`correlation.traceId`	hex	N	Trace id	32 lowercase hex
`correlation.requestId`	string	N	Client request id	≤128

Output Specifications¶

Field	Type	Description	Notes
`auditRecordId`	ULID	Durable id	Server returns original or assigned
`status`	string	`Created` or `Duplicate`	Idempotent semantics
`observedAt`	timestamp	Ingestion time	ms precision
`traceId`	hex32	Echo for correlation	From `traceparent`
`links.self`	string	Record URL	REST locator
`links.operation`	string	Idempotency op URL	Stable outcome resource

Example Payloads¶

Request

{
  "tenantId": "splootvets",
  "schemaVersion": "auditrecord.v1",
  "createdAt": "2025-10-22T12:00:03.100Z",
  "action": "appointment.update",
  "resource": { "type": "Vetspire.Appointment", "id": "A-9981", "path": "/status" },
  "actor": { "id": "user_123", "type": "User", "display": "A. Smith" },
  "decision": { "outcome": "Allow" },
  "delta": { "fields": { "status": { "before": "Pending", "after": "Booked" } } },
  "attributes": { "client.ip": "203.0.113.42", "client.userAgent": "Mozilla/5.0 ..." },
  "correlation": { "traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716", "requestId": "req-7a9f" }
}

Response — 202 Accepted

{
  "auditRecordId": "01JE7K4J9F9D0S6E7X5Q1A3BCP",
  "status": "Created",
  "observedAt": "2025-10-22T12:00:03.300Z",
  "traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716",
  "links": {
    "self": "/audit/v1/records/01JE7K4J9F9D0S6E7X5Q1A3BCP",
    "operation": "/audit/v1/operations/prod-ord-9981-v1"
  }
}

Error Handling¶

Error Scenarios¶

Error Code	Scenario	Recovery Action	Retry Strategy
400	Schema/clock/format invalid	Fix request; follow details/pointers	Do not retry until corrected
401	Invalid/missing JWT	Acquire valid token	Retry after re-auth
403	Tenant forbidden	Correct tenant or permissions	Do not retry
413	Payload > 256 KiB	Reduce size / trim delta	Do not retry until reduced
415	Wrong media type	Use `application/json`	Retry with correct header
429	Rate limited/backpressure	Respect `Retry-After`	Exponential backoff + jitter
503	Storage/Policy unavailable	Transient outage	Exponential backoff + jitter; reuse idempotency key
409	Idempotency conflict (rare)	Reuse same key; inspect `operation` link	Safe retry with same key

Failure Modes¶

Network Failures: Timeouts, TLS issues → client retries with backoff; preserve x-idempotency-key.
Service Unavailability: Return 503 from Gateway; circuit breaker may open.
Data Corruption: Validation rejects; Problem+JSON details include errors[].pointer.
Policy Violations: Credentials detected → dropped at write; log redactionHint.

Recovery Procedures¶

Inspect Problem+JSON type, detail, and errors[].
For transient failures, retry with the same idempotency key using backoff; honor Retry-After.
For validation failures, correct the payload (see rules), then resubmit.

Performance Characteristics¶

Latency Expectations¶

P50: 15–25 ms
P95: ≤ 50 ms
P99: ≤ 120 ms
P99.9: ≤ 300 ms (under burst control)

Throughput Limits¶

Per Tenant (sustain): ~500 rps (edition-dependent)
Per Tenant (burst): up to 2,000 rps for 60 s
Global Target: ≥ 50k rps across shards

Resource Requirements¶

CPU: Ingestion nodes sized for JSON parse + hashing; vectorized canonicalization where available
Memory: Payload buffers ≤ 256 KiB × concurrency; header maps
Network: TLS offload at Gateway or service mesh
Storage: WAL/redo sized for burst × 2 indexes

Scaling Considerations¶

Horizontal: Scale Gateway/Ingestion statelessly (HPA/KEDA based on rps/CPU/queue depth)
Vertical: Rarely needed; prefer horizontal
Auto-scaling Triggers: rps, p95 latency, queue depth, 429 rate, CPU > 75%

Security & Compliance¶

Authentication¶

Method: JWT (OIDC); short-lived tokens; clock skew ±60s
Token Requirements: Audience/service match; tenant claims present
Session Management: Stateless; no cookies

Authorization¶

Permissions: Producer role allowed to audit:append for x-tenant-id
Tenant Isolation: RLS enforced in Storage/Projections; headers validated at edge
RBAC: Gateway policy + service layer checks

Data Protection¶

Transit: TLS 1.2+; HSTS at edge
At Rest: DB/storage encryption; key management via KMS
PII Handling: Write-time classification/redaction; credentials dropped; personal/sensitive masked/hashed

Compliance¶

GDPR/HIPAA/SOC2: Audit trail of who appended; immutable WORM; data subject exports via Export flows

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`ingest_requests_total`	counter	Count of POSTs	Anomaly vs. baseline
`ingest_latency_ms`	histogram	End-to-end latency	p95 > 50 ms (5m)
`ingest_payload_bytes`	histogram	Payload sizes	> 90^th near 256 KiB
`ingest_rate_limited_total`	counter	429 responses	Spike > 5%
`storage_errors_total`	counter	5xx from Storage	> 0.5%
`policy_eval_latency_ms`	histogram	Policy call latency	p95 > 30 ms

Logging Requirements¶

Structured JSON logs; include tenantId, auditRecordId, traceId, idempotencyKey (hash)
Mask personal/sensitive values; never log raw credentials

Distributed Tracing¶

Propagate traceparent; spans: ingest.request, ingest.validate, ingest.append, policy.evaluate
Span attrs: tenant, payloadBytes, status, dedupe="Created|Duplicate"

Health Checks¶

Liveness: process heartbeats
Readiness: downstream (Policy/Storage) probes with budgets
Dependency: Registry reachability, KMS if signing on write (rare)

Operational Procedures¶

Deployment¶

Deploy/roll Gateway and Ingestion behind feature flag audit.ingest.enabled=false
Warm caches (schema, policy); run smoke POST against canary
Flip flag, ramp traffic using traffic splitting (e.g., 10% → 50% → 100%)

Configuration¶

Env Vars: RATE_BURST, RATE_SUSTAIN, MAX_PAYLOAD_BYTES=262144
Config: Policy endpoint base URL; schema registry URL
Feature Flags: Sealing on write (usually off), request verification levels

Maintenance¶

Rotate tokens/keys; tune rate limits; review metrics for near-limit payloads

Troubleshooting¶

High 400s → inspect Problem+JSON pointers
High 429s → increase tenant buckets or advise producers to backoff
5xx spikes → check Storage/Policy dependency health, breaker state

Testing Scenarios¶

Happy Path Tests¶

Accept minimal valid record; returns 202 Created
With server-assigned ULID; returns new auditRecordId
Duplicate x-idempotency-key returns status:"Duplicate"

Error Path Tests¶

action.invalid → 400 with pointer /action
Payload over 256 KiB → 413
Missing/invalid JWT → 401; forbidden tenant → 403
Rate limit exceeded → 429 with Retry-After

Performance Tests¶

Sustain 500 rps per tenant; p95 < 50 ms
Burst 2k rps per tenant for 60s without error inflation
Large but valid payload near limit; still < 50 ms p95

Security Tests¶

Credential key in attributes is dropped/redacted
PII masked on read paths (verify via downstream Query)
Multi-tenant isolation (no cross-tenant access)

External References¶

RFC 7807 (Problem Details for HTTP APIs)
W3C Trace Context (traceparent)

Appendices¶

A. Configuration Examples¶

NGINX/L7 snippet to pass through traceparent, x-tenant-id, x-idempotency-key

B. Troubleshooting Guide¶

Decision tree for 4xx vs 5xx vs 429 responses

C. Performance Benchmarks¶

Latest load test summary attached in CI artifacts

D. Security Checklist¶

No secrets logged
Masking rules applied on read
RLS enforced in all queries

Batch Audit Record Ingestion Flow¶

Efficient bulk ingest of many AuditRecord items using multipart upload or presigned object storage. The Gateway creates a batch job, the client uploads JSONL (optionally gzip), and an Ingestion Batch Worker validates, canonicalizes, and appends records to the WORM store with partial-failure reporting, chunking, and resume tokens.

Overview¶

Purpose: Move large volumes of audit facts into ATP reliably and cost-effectively with resumable uploads and per-record error isolation.
Scope: REST orchestration for batch jobs, uploads (multipart or presigned URLs), background processing, partial failures, status polling, and completion artifacts. Excludes online single-record ingest and streaming bus pipelines.
Context: Preferred for backfills, partner dumps, and nightly loads. Downstream, projections and integrity run asynchronously as with standard ingestion.
Key Participants:

Client (uploader)
API Gateway (job control, presigned URLs, limits)
Object Storage (S3/GCS/Azure Blob; optional path)
Ingestion Batch Worker (validate/canonicalize/process chunks)
Storage (Authoritative) (WORM append)
Integrity Service (hash/segment/block sealing; async)
Projection Service (read models; async)

Prerequisites¶

System Requirements¶

API Gateway, Batch Worker, Storage, Integrity, Projection online
TLS end-to-end; object storage reachable from workers
IdP configured; JWT audience for Gateway set
Schema Registry reachable by workers

Business Requirements¶

Tenant active; residency/edition configured
Classification/redaction & retention policies published
Legal holds indexed (affects lifecycle, not write)

Performance Requirements¶

Chunk size and worker parallelism tuned (defaults below)
Storage capacity sized for expected peak insert rate
Backpressure thresholds configured (429/503 policies)

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Client
    participant Gateway as API Gateway
    participant Store as Object Storage
    participant Batch as Ingestion Batch Worker
    participant Storage as Storage (Authoritative)
    participant Projection as Projection Service
    participant Integrity as Integrity Service

    Client->>Gateway: POST /audit/v1/batches { manifest, strategy }
    Gateway-->>Client: ↩ 202 { batchId, uploadPlan, resumeToken }

    alt Presigned strategy
      Client->>Store: PUT parts to presigned URLs (JSONL[.gz])
      Client->>Gateway: POST /audit/v1/batches/{batchId}: finalize
    else Multipart strategy
      Client->>Gateway: POST /audit/v1/batches/{batchId}/upload (multipart)
    end

    Gateway-->>Batch: event Batch.Created { batchId, objectUris }
    Batch->>Batch: Plan chunks (e.g., 5k recs or 16 MiB)
    loop Each chunk
      Batch->>Store: READ chunk bytes (stream)
      Batch->>Batch: Validate & canonicalize each JSONL line
      Batch->>Storage: INSERT valid AuditRecord rows (idempotent)
      Batch-->>Batch: Record per-line status, advance resumeToken
    end
    par Async fan-out for accepted rows
      Storage-->>Projection: AuditRecord.Accepted
      Storage-->>Integrity: enqueue leaf → segment
    end

    Batch-->>Gateway: status { processed, succeeded, failed, resumeToken }
    Gateway-->>Client: ↩ 200/202 GET /batches/{id}/status
    Note over Batch,Client: Completion → summary + downloadable error report for failed lines

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Continue-on-error: Process full batch; emit per-line errors; job ends CompletedWithFailures.
Halt-on-threshold: Stop when failed/processed ≥ threshold (e.g., 5%); job Aborted.
Resume: Client provides resumeToken; worker skips processed chunks.
Single-URL manifest: Gateway returns one upload URL; worker enumerates parts by convention.

Error Paths¶

sequenceDiagram
    actor Client
    participant Gateway as API Gateway
    participant Batch as Ingestion Batch Worker

    Client->>Gateway: POST /audit/v1/batches { manifest }
    alt Invalid manifest
      Gateway-->>Client: ↩ 400 Problem+JSON (manifest.invalid)
    else Failure threshold exceeded
      Batch-->>Gateway: status { state:"Aborted", reason:"FailureThreshold" }
      Gateway-->>Client: ↩ 409 Problem+JSON + link:errorReport
    else Storage unavailable
      Batch-->>Gateway: status { state:"Retrying", backoff:"exponential" }
      Gateway-->>Client: ↩ 503 on status until recovery
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Required	Description	Validation
`Authorization` (header)	string	Y	Bearer JWT	Valid signature; tenant claim
`x-tenant-id` (header)	string	Y	Tenant routing	`^[A-Za-z0-9._-]{1,128}$`
`traceparent` (header)	string	Y	W3C trace context	55-char format
`x-idempotency-key` (header)	string	Y	Job creation dedupe	≤128 ASCII
`strategy`	enum	Y	`Presigned` or `Multipart`	Enum
`manifest.files[]`	array	Y	Object URIs or file descriptors	≤256 files
`manifest.format`	enum	Y	`Jsonl` or `JsonlGzip`	Enum
`manifest.schemaVersion`	string	Y	Expected schema	e.g., `auditrecord.v1`
`options.chunk.maxRecords`	int	N	Records per chunk	1–10,000 (default 5,000)
`options.chunk.maxBytes`	int	N	Bytes per chunk	1–32 MiB (default 16 MiB)
`options.failure.mode`	enum	N	`Continue`/`HaltOnThreshold`	Default `Continue`
`options.failure.threshold`	number	N	0.0–1.0	Default 0.05
`options.parallelism`	int	N	Worker concurrency	1–32 (edition gated)

Output Specifications¶

Field	Type	Description	Notes
`batchId`	ULID	Batch identifier	Returned on create
`uploadPlan`	object	Presigned URLs or upload endpoints	May include part sizes
`resumeToken`	string	Opaque position token	For resume
`state`	enum	`Created \| Uploading \| Processing \| Retrying \| Completed \| CompletedWithFailures \| Aborted \| Failed`	From status API
`counters`	object	`{processed,succeeded,failed,bytesRead}`	Status API
`errorReport`	url	Download failed-lines report	On completion/abort

Example Payloads¶

Create batch (presigned)

{
  "strategy": "Presigned",
  "manifest": {
    "format": "JsonlGzip",
    "schemaVersion": "auditrecord.v1",
    "files": [
      { "name": "part-0001.jsonl.gz", "sizeBytes": 104857600 },
      { "name": "part-0002.jsonl.gz", "sizeBytes": 83886080 }
    ]
  },
  "options": {
    "chunk": { "maxRecords": 5000, "maxBytes": 16777216 },
    "failure": { "mode": "Continue", "threshold": 0.05 },
    "parallelism": 8
  }
}

Create response

{
  "batchId": "01JE8A3GZ8X0E9K3N5R6V7B8C9",
  "uploadPlan": {
    "presigned": [
      { "name": "part-0001.jsonl.gz", "method": "PUT", "url": "https://store/..." },
      { "name": "part-0002.jsonl.gz", "method": "PUT", "url": "https://store/..." }
    ]
  },
  "resumeToken": "r-01je8a3g-0000"
}

Status response

{
  "batchId": "01JE8A3GZ8X0E9K3N5R6V7B8C9",
  "state": "CompletedWithFailures",
  "counters": { "processed": 180000, "succeeded": 176400, "failed": 3600, "bytesRead": 183500800 },
  "resumeToken": "r-01je8a3g-ffff",
  "errorReport": "/audit/v1/batches/01JE8A3G.../errors?profile=Safe"
}

Error Handling¶

Error Scenarios¶

Error Code	Scenario	Recovery Action	Retry Strategy
400	Invalid manifest/options	Fix payload (schema, limits)	No retry until corrected
401/403	AuthN/Z failure	Acquire token / permissions	Retry after fix
409	Duplicate `x-idempotency-key`	Use status endpoint / operation link	Safe to reuse key
413	Part too large	Reduce part size	Re-upload affected part
422	Failure threshold exceeded	Inspect error report; fix data	New batch recommended
429	Gateway/worker backpressure	Honor `Retry-After`; slow uploads	Exponential backoff + jitter
503	Storage/object store unavailable	Wait for recovery	Workers auto-retry chunks

Failure Modes¶

Line-level validation failures: recorded {line, pointer, reason}; good lines continue.
Chunk retry: transient errors → chunk-level retries with capped attempts.
Poison lines: after N retries, line written to dead-letter file in the error report.

Recovery Procedures¶

GET status; if CompletedWithFailures, download errorReport.
Fix rejected lines; re-upload as new batch or incremental patch.
If Aborted due to threshold, pre-clean data or lower threshold; start a new batch.

Performance Characteristics¶

Latency Expectations¶

Job creation: ~10–50 ms
Per-chunk processing: target ≤ 2 s for 5k records
End-to-end: proportional to data volume and parallelism

Throughput Limits¶

Worker ingest: ≥ 3k rps per shard sustained (shared with online writes)
Per-job parallelism: default 8 chunks in flight (edition gated)
Upload: presigned PUT up to provider limits; prefer 8–16 MiB parts

Resource Requirements¶

CPU: JSON parse + hashing; concurrency N × vCPU
Memory: streaming parse; per-chunk buffers (≤ 16–32 MiB each)
Network: high egress from object store to workers; colocate where possible
Storage: WAL sized for burst; keep secondary indexes minimal on authoritative store

Scaling Considerations¶

Horizontal: scale workers by queue depth and chunk latency
Auto-scaling triggers: backlog age, running jobs, p95 chunk duration, CPU > 75%
Backpressure: workers advertise capacity; Gateway throttles create/upload

Security & Compliance¶

Authentication¶

JWT (OIDC) to create/manage batches; presigned URLs for object store writes (scoped, short-lived).

Authorization¶

Require audit:batch:create for tenant; status and error report scoped to same tenant and batch.

Data Protection¶

Transit: TLS 1.2+; presigned HTTPS only
At Rest: object storage + DB encryption; server-side KMS keys
PII: same write-time classification/redaction as standard ingest (no raw credentials persisted)

Compliance¶

Batch operations are audited: who created, uploaded, resumed, and downloaded error reports.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`batch_created_total`	counter	Batches created	Anomaly vs baseline
`batch_records_processed_total`	counter	Lines processed	Drops or stalls
`batch_failures_total`	counter	Per-line rejects	> 2% sustained
`batch_chunk_latency_ms`	histogram	Chunk processing time	p95 > 2 s
`batch_inflight_jobs`	gauge	Active batches	Capacity saturation
`batch_bytes_read`	counter	Input bytes	Sudden spikes

Logging Requirements¶

Structured logs with batchId, lineNo, error pointer, reason; mask sensitive values.

Distributed Tracing¶

Root span batch.create; child spans per chunk (batch.process.chunk) including chunkId, records, bytes.

Health Checks¶

Readiness includes object store access, Storage connectivity, Schema Registry reachability.

Operational Procedures¶

Deployment¶

Roll out Batch Worker with feature flag audit.batch.enabled=false.
Validate presigned URL issuance in non-prod.
Enable flag; ramp per-tenant concurrency caps.

Configuration¶

Env Vars: BATCH_MAX_PARALLELISM, BATCH_CHUNK_MAX_BYTES, BATCH_CHUNK_MAX_RECORDS, BATCH_FAILURE_THRESHOLD
Storage: connection pools sized for concurrent inserts
Object Store: bucket/container, lifecycle policy for temp uploads and error reports

Maintenance¶

Periodic cleanup of stale, incomplete batches and expired presigned URLs.
Rotate KMS keys as per policy.

Troubleshooting¶

High batch_failures_total → download error report; inspect common pointers.
Slow chunks → reduce chunk size or increase parallelism; check DB bottlenecks.
Frequent 503 → verify storage health and worker retry logs.

Testing Scenarios¶

Happy Path Tests¶

Create presigned batch; upload two parts; completion with zero failures
Multipart upload success path with server parsing
Resume from resumeToken after intentional worker restart

Error Path Tests¶

Invalid manifest → 400 with pointer to failing field
Failure threshold exceeded → job Aborted, 409 on finalize
Object store permission denied → 403 on PUT, recover with new presigned URL

Performance Tests¶

100M records across 20 files; verify throughput and stability
Chunk size sweep (4–32 MiB) to tune p95
Parallel jobs from multiple tenants without starvation

Security Tests¶

Presigned URL expiry respected; uploads fail after TTL
Error report redacts/masks PII appropriately
Tenant isolation—no cross-tenant batch visibility

Internal References¶

gRPC Service Ingestion Flow
Service Bus (MassTransit) Ingestion Flow
Audit Record Projection Update Flow

External References¶

Provider docs for presigned URLs (S3/GCS/Azure Blob)
RFC 7231 (HTTP semantics) for 202/409/413 usage

Appendices¶

A. Minimal JSONL Example (uncompressed)¶

{"tenantId":"acme","schemaVersion":"auditrecord.v1","createdAt":"2025-10-22T12:00:00.000Z","action":"user.create","resource":{"type":"Iam.User","id":"U-1"},"actor":{"id":"svc_gw","type":"Service"}}
{"tenantId":"acme","schemaVersion":"auditrecord.v1","createdAt":"2025-10-22T12:00:01.000Z","action":"appointment.update","resource":{"type":"Vetspire.Appointment","id":"A-2"},"actor":{"id":"user_123","type":"User"},"delta":{"fields":{"status":{"before":"Pending","after":"Booked"}}}}

B. Error Report Schema (per-line)¶

{
  "batchId": "01JE8A3GZ8X0E9K3N5R6V7B8C9",
  "summary": { "processed": 100000, "succeeded": 98400, "failed": 1600 },
  "errors": [
    { "line": 42, "pointer": "/action", "reason": "regex", "code": "action.invalid", "rawSnippet": "..." }
  ]
}

C. Resume Token Example¶

{ "batchId": "01JE8A3G...", "chunk": 128, "offset": 7340032, "file": "part-0002.jsonl.gz" }

Audit Record Validation & Classification Flow¶

Applies schema/limits validation, canonicalization, and policy-driven classification & redaction before persisting an AuditRecord. Ensures deterministic normalization, consistent privacy posture, and auditable decisions that accompany the record through its lifecycle.

Overview¶

Purpose: Validate and normalize incoming audit facts, classify data sensitivity, and apply redaction actions prior to append.
Scope: Ingestion-time validation/canonicalization, policy evaluation, classification flags, redaction (drop/mask/hash/tokenize), decision auditing. Excludes post-read masking (covered in Query flows) and integrity/projection specifics.
Context: Runs during Standard/Batch ingestion just before the authoritative append. Outputs include normalized payload, DataClass flags, RedactionHints, and a policy decision trail.
Key Participants:

Ingestion Service (validator/canonicalizer/orchestrator)
Schema Registry (JSON Schema/contract resolution)
Policy Service (classification & redaction policy)
Classification Engine (PII/secret detectors, patterns)
Redaction Service (hash/mask/tokenize/drop transforms)
Storage (Authoritative) (WORM append with decision audit)

Prerequisites¶

System Requirements¶

Ingestion reachable to Schema Registry and Policy endpoints
Policy/Classification/Redaction services healthy (or cached policy available)
Clock sync within ±60s (for timestamp validations)
TLS enabled; service identities trusted

Business Requirements¶

Tenant active; edition/residency known (affects policy set)
Current Policy revision published; cache TTL configured
Data classification catalog aligned with Data Model

Performance Requirements¶

Validation + policy evaluation p95 ≤ 30 ms per record
Classification engine p95 ≤ 10 ms for typical payloads
End-to-end ingest validation budget p95 ≤ 50 ms

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant Ingestion as Ingestion Service
    participant Registry as Schema Registry
    participant Policy as Policy Service
    participant Classify as Classification Engine
    participant Redact as Redaction Service
    participant Storage as Storage (Authoritative)

    Ingestion->>Registry: Resolve schema (auditrecord.v1)
    Registry-->>Ingestion: ↩ schema (cacheable)

    Ingestion->>Ingestion: Structural validate + limits (size, clocks)
    Ingestion->>Ingestion: Canonicalize (strings NFC, action, resource.path)

    Ingestion->>Policy: Evaluate(tenant, edition, payload summary)
    Policy-->>Ingestion: ↩ decision {classes, actions, revision, basis:"Live"}

    Ingestion->>Classify: Detect PII/Secrets (hints, patterns)
    Classify-->>Ingestion: ↩ findings {keys, types, confidence}

    Ingestion->>Redact: Apply(actions, findings) → transform fields
    Redact-->>Ingestion: ↩ normalized payload + redactionHints

    Ingestion->>Storage: INSERT payload + {classes, redactionHints, policyRevision}
    Storage-->>Ingestion: ↩ ack {auditRecordId}

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Cached policy: If Policy unavailable, use last-known decision template (basis:"Cached") with TTL; record basis in decision trail.
Dry-run mode: Apply classification only; annotate recommended actions without mutating payload (used in partner onboarding).
Producer hints: Producer supplies dataClass hints; engine verifies/augments but never downgrades sensitivity.

Error Paths¶

sequenceDiagram
    participant Ingestion as Ingestion Service
    participant Registry as Schema Registry
    participant Policy as Policy Service

    Ingestion->>Registry: Resolve schema
    alt Schema mismatch/invalid
      Registry-->>Ingestion: ↩ error(schema.invalid)
      Ingestion-->>Client: ↩ 400 Problem+JSON (pointers)
    else Policy hard outage and no cache
      Ingestion->>Policy: Evaluate(...)
      Policy-->>Ingestion: ↩ 503
      Ingestion-->>Client: ↩ 503 Problem+JSON (retry with idempotency)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

This flow executes inside ingestion. External interfaces (e.g., REST /audit/v1/records) are shown for the fields pertinent to validation & classification.

Input Requirements¶

Field	Type	Required	Description	Validation
`schemaVersion`	string	Y	Payload contract id	Known & active in Registry
`createdAt`	timestamp	Y	Producer clock	ISO-8601 UTC, ms; ≤ now+2m
`effectiveAt`	timestamp	N	Effect time	≤ `createdAt`
`action`	string	Y	`verb` or `verb.noun`	`^[a-z]+(\.[a-z0-9_-]+)?$`
`resource.type`	string	Y	Dotted PascalCase type	`^[A-Z][A-Za-z0-9](\.[A-Z][A-Za-z0-9])*$`
`resource.id`	string	Y	Opaque id	≤128, visible ASCII
`resource.path`	string	N	JSON Pointer	normalized, ≤512
`actor.id`	string	Y	Actor identifier	≤128
`actor.type`	enum	Y	`Unknown \| User \| Service \| Job`	Enum
`attributes.*`	map	N	Extra k/v pairs	≤64 keys; key≤64, val≤1024
`delta.fields`	map	N	Field-level changes	≤256 entries
`correlation.traceId`	hex	N	Trace correlation	32 lowercase hex

Output Specifications¶

Field	Type	Description	Notes
`normalizedPayload`	object	Canonical JSON after transforms	JCS canonical form
`classes`	bitset/array	DataClass flags	e.g., `Personal \| Sensitive`
`redactionHints[]`	array	Where/why redacted	`{ pointer, action }`
`policyRevision`	string	Policy rev used	`rev-YYYYMMDD-n`
`policyBasis`	enum	`Live \| Cached \| DryRun`	Audit of basis
`violations[]`	array	Validation/policy errors	For 4xx generation

Example Payloads¶

Input (pre-normalization)

{
  "schemaVersion": "auditrecord.v1",
  "createdAt": "2025-10-22T12:00:03.100Z",
  "action": "User.Create",
  "resource": { "type": "Iam.User", "id": " U-1001 ", "path": "/name" },
  "actor": { "id": "svc_gw", "type": "Service", "display": "ingress-gw" },
  "attributes": {
    "email": "alice@example.com",
    "password": "hunter2",
    "client.ip": "2001:db8::1"
  }
}

Normalized + decision (stored)

{
  "schemaVersion": "auditrecord.v1",
  "createdAt": "2025-10-22T12:00:03.100Z",
  "action": "user.create",
  "resource": { "type": "Iam.User", "id": "U-1001", "path": "/name" },
  "actor": { "id": "svc_gw", "type": "Service", "display": "ingress-gw" },
  "attributes": {
    "email": "sha256:2c26b46b68ffc68ff99b453c1d304134",
    "client.ip": "2001:db8::/64"
  },
  "_decision": {
    "classes": ["Personal", "Sensitive"],
    "redactionHints": [
      { "pointer": "/attributes/password", "action": "Drop" },
      { "pointer": "/attributes/email", "action": "Hash" },
      { "pointer": "/attributes/client.ip", "action": "Mask" }
    ],
    "policyRevision": "rev-20251022-1",
    "policyBasis": "Live"
  }
}

Error Handling¶

Error Scenarios¶

Error Code	Scenario	Recovery Action	Retry Strategy
400	Schema/shape invalid	Fix payload per pointers	No retry until corrected
400	Limits exceeded (size/keys/delta)	Reduce payload size/keys	No retry until corrected
422	Policy violation (forbidden fields)	Remove/transform offending fields	Retry after fix
503	Policy/Registry unavailable & no cache	Wait for recovery	Retry with same idempotency key
409	Policy revision conflict (rare)	Resubmit; server reconciles	Safe retry (idempotent)

Failure Modes¶

Secret detected: Field dropped; hint recorded; no write-time failure unless configured “fail-closed”.
Classifier ambiguity: Lowest-risk action chosen (mask/hash) and flagged for review.
Cache staleness: Decision marked basis:"Cached"; async audit triggers re-eval if needed.

Recovery Procedures¶

If 4xx, inspect Problem+JSON errors[].pointer and correct data.
If 503, retry with backoff; preserve idempotency key.
If repeated classifier ambiguities, update policy patterns; redeploy.

Performance Characteristics¶

Latency Expectations¶

Validation + Canonicalization: p95 ≤ 20 ms
Policy Evaluation: p95 ≤ 30 ms (local cache hit ≤ 5 ms)
Classification/Redaction: p95 ≤ 10 ms typical payloads

Throughput Limits¶

Designed to sustain the same per-tenant ingest targets as Standard Ingestion (e.g., 500 rps), bounded by policy eval capacity.

Resource Requirements¶

CPU for JSON parsing and pattern matching; memory for small transient field buffers (< 512 KiB).
Optional vectorized hashing for tokenization.

Scaling Considerations¶

Scale Ingestion horizontally; cache policy decisions per-tenant.
Separate classifier pool if heavy patterns enabled.

Security & Compliance¶

Authentication¶

mTLS/service identity between Ingestion and Policy/Classification/Redaction services.

Authorization¶

Ingestion authorized to access tenant-scoped policies only.

Data Protection¶

Secrets never persisted; PII transformed per policy before write.
Hashing uses approved algorithms (e.g., SHA-256 with salt/pepper policy where applicable).

Compliance¶

Decision trail persisted (policyRevision, policyBasis, redactionHints) to support audits and DSAR exports.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`validation_failures_total`	counter	Number of 4xx validations	Spike > baseline
`policy_eval_latency_ms`	histogram	Policy call latency	p95 > 30 ms
`redactions_applied_total`	counter	Actions applied	Sudden drop (policy drift)
`classified_records_total`	counter	Records with classes	Monotonic vs ingest
`cached_policy_decisions_total`	counter	Cached-basis uses	> 5% sustained

Logging Requirements¶

Structured logs include tenantId, auditRecordId (if available), policyRevision, policyBasis, and summarized redactionHints (no raw data).

Distributed Tracing¶

Spans: ingest.validate, policy.evaluate, classify.detect, redact.apply; attributes: tenant, payloadBytes, basis.

Health Checks¶

Readiness checks: Registry reachability, Policy cache warmness, classifier models loaded.

Operational Procedures¶

Deployment¶

Deploy Ingestion with feature flag policy.eval.enabled=true, redaction.apply.enabled=true.
Warm policy cache for top tenants; prefetch schema versions.
Flip traffic gradually and watch latency/4xx/5xx rates.

Configuration¶

Env Vars: POLICY_BASE_URL, POLICY_CACHE_TTL, CLASSIFIER_TIMEOUT_MS, REDACTION_MODE (Apply|DryRun).
Patterns: versioned classifier pattern sets per tenant/edition.

Maintenance¶

Rotate hashing salts/peppers per schedule; invalidate caches.
Refresh classifier patterns as policies evolve.

Troubleshooting¶

High 400s: inspect pointers; verify schema version drift.
High cached-basis usage: Policy outage or network; check health and TTLs.
Unexpected PII in reads: verify redaction applied and read-profile masking.

Testing Scenarios¶

Happy Path Tests¶

Valid payload normalized; policy Live; redactions applied; append succeeds
Producer hints merged; never downgrade sensitivity
Cached policy basis used during brief outage; append still succeeds

Error Path Tests¶

Schema validation failure → 400 with pointers
Forbidden field by policy → 422 with pointer
Policy outage with empty cache → 503

Performance Tests¶

p95 validation+policy ≤ 50 ms at 500 rps/tenant
Classifier throughput with large attributes maps

Security Tests¶

Secrets dropped, not logged
PII hashing/tokenization conforms to policy (golden samples)
Authorization scoping of policy endpoints

Internal References¶

Batch Audit Record Ingestion Flow
Data Redaction Flow (Read)

External References¶

RFC 8785 (JSON Canonicalization Scheme)
W3C Trace Context (for correlation)

Appendices¶

A. Common Validation Rules (excerpt)¶

No NaN/Infinity; UTF-8, strings normalized to NFC; key set size ≤ 64; payload ≤ 256 KiB.

B. DataClass Examples¶

Personal: name, email; Sensitive: secrets, tokens; Operational: IP/UA.

C. Redaction Actions¶

Drop (remove), Mask (partial), Hash (one-way), Tokenize (reversible, vault-backed).

Audit Record Integrity Chain Flow¶

Creates a tamper-evidence chain for accepted audit facts. Each persisted AuditRecord becomes a leaf hash, batched into segments (Merkle trees), then sealed into blocks signed by KMS. Proof artifacts are written to the Evidence Store, a reference is attached to the record, and Integrity.ProofComputed is emitted.

Overview¶

Purpose: Guarantee immutability-at-rest by linking records into signed, verifiable chains with exportable proofs.
Scope: Post-append integrity processing: leaf hashing, segment buffering, Merkle root computation, block sealing/signing, evidence persistence, record back-reference, and event publication. Excludes verify-on-read (covered in a separate flow).
Context: Runs asynchronously after AuditRecord.Accepted. Segments seal on size/age thresholds. Blocks form a forward-only chain with PrevBlockRoot.
Key Participants:

Storage (Authoritative) — source of accepted records
Integrity Service — orchestrates hashing, sealing, signing
KMS — signs block headers; manages key rotation
Evidence Store — durable proofs (segments/blocks/manifests)
Projection Service — indexes proof refs for reads/search (optional)
Event Bus — publishes Integrity.ProofComputed

Prerequisites¶

System Requirements¶

Integrity workers online; access to Storage and Evidence Store
KMS key (current + optional previous for dual-verify window) available
Time sync within ±60s across services
Reliable message delivery from Storage to Integrity

Business Requirements¶

Tenant configured with integrity policy (segment size/age, edition/residency)
Retention rules do not remove proofs before data eligibility
Legal holds respected (proofs retained regardless)

Performance Requirements¶

Seal latency SLO: p95 ≤ 120s from Accepted to ProofComputed
Integrity throughput sized for ingest peak × safety margin (e.g., 1.5×)
Evidence Store write amplification budgeted

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant Storage as Storage (Authoritative)
    participant Integrity as Integrity Service
    participant KMS as KMS
    participant Evidence as Evidence Store
    participant Bus as Event Bus
    participant Projection as Projection Service

    Storage-->>Integrity: AuditRecord.Accepted { auditRecordId, tenantId, canonicalBytesRef }
    Integrity->>Integrity: LeafHash = SHA-256(canonicalBytes)
    Integrity->>Integrity: Append leaf to SegmentBuffer(tenant, shard)
    alt Seal threshold met (size or age)
        Integrity->>Integrity: MerkleRoot = merkle(leafHashes)
        Integrity->>KMS: Sign(BlockHeader { SegmentId, MerkleRoot, PrevBlockRoot })
        KMS-->>Integrity: ↩ Signature { keyId, sig }
        Integrity->>Evidence: Store { Segment, BlockHeader, Signature }
        Evidence-->>Integrity: ↩ EvidenceRef { segmentUri, blockUri }
        Integrity-->>Storage: Write IntegrityRef on records in segment
        Integrity-->>Bus: Publish Integrity.ProofComputed { tenantId, segmentId, blockId }
        Bus-->>Projection: Event fan-out (optional)
    else Buffer continues
        Integrity->>Integrity: Wait for more leaves or seal timeout
    end

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Time-based seal: If size threshold not reached within sealMaxAge, force seal to bound verification lag.
Dual-sign window: During key rotation, blocks are signed with new key, and verifiers accept old or new keyId.
Cross-region catch-up: If region falls behind, segments seal independently; later anchor block links chains (see DR flow).

Error Paths¶

sequenceDiagram
    participant Integrity as Integrity Service
    participant KMS as KMS
    participant Evidence as Evidence Store

    Integrity->>KMS: Sign(BlockHeader)
    alt KMS unavailable
        KMS-->>Integrity: ↩ 503
        Integrity->>Integrity: Retry with backoff, keep SegmentBuffer open
    else Signature reject
        KMS-->>Integrity: ↩ error(key.invalid)
        Integrity->>Integrity: Quarantine segment, raise alert
    end

    Integrity->>Evidence: Store proofs
    alt Evidence store error
        Evidence-->>Integrity: ↩ 503
        Integrity->>Integrity: Retry, if max attempts → DLQ & operator action
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

The chain creation is internal, but two public/operational surfaces are relevant: the event and the evidence retrieval API.

Input Requirements (event consumed by Integrity)¶

Field	Type	Required	Description	Validation
`auditRecordId`	ULID	Y	Record identifier	Exists in Storage
`tenantId`	string	Y	Tenant scope	Valid tenant
`canonicalBytesRef`	uri	Y	Pointer to canonical JSON	Dereferenceable
`createdAt`	timestamp	Y	Record time	ISO-8601 UTC
`observedAt`	timestamp	Y	Ingestion time	ISO-8601 UTC

Output Specifications¶

Event: Integrity.ProofComputed

Field	Type	Description	Notes
`tenantId`	string	Tenant	—
`segmentId`	ULID	Sealed segment id	—
`blockId`	ULID	Block id	—
`keyId`	string	Signing key identifier	From KMS
`merkleRoot`	hex	Root hash	SHA-256
`recordRange`	object	`{fromId, toId}`	Optional
`evidence`	object	`{segmentUri, blockUri}`	Evidence Store refs
`sealedAt`	timestamp	Seal time	UTC

API: GET /integrity/v1/proofs/{auditRecordId}

Field	Type	Description	Notes
`auditRecordId`	path	Record id	ULID
`include`	query	`leaf \| segment \| block \| all`	Optional

Response (200)

{
  "auditRecordId": "01JE9C5V6A7B8C9D0E1F2G3H4I",
  "leaf": { "hash": "sha256:ab…", "position": 128, "segmentId": "01JE9C6…" },
  "segment": { "merkleRoot": "sha256:cd…", "proofPath": ["ef…","01…"] },
  "block": { "blockId": "01JE9C7…", "prevBlockRoot": "sha256:12…", "signature": { "keyId": "kms-2025-10", "sig": "MEUCIQ…" } },
  "sealedAt": "2025-10-22T12:01:45.120Z"
}

Error Handling¶

Error Scenarios¶

Error Code	Scenario	Recovery Action	Retry Strategy
400	Bad `include` value / malformed id	Correct request (ULID/enum)	No retry until corrected
404	Record or proof not found (not yet sealed or purged)	Poll later or verify eligibility	Retry after backoff
409	Append attempt to sealed segment (internal)	Start new segment; do not mutate sealed	N/A (system fix)
422	Signature cannot be generated due to key policy mismatch	Adjust policy / rotate properly	Retry after policy fix
429	Integrity backlog/backpressure	System scales workers	Automatic; client retries evidence GET
503	KMS/Evidence store unavailable	Wait for recovery	Exponential backoff + jitter

Failure Modes¶

Segment overflow beyond configured max leaves: immediate seal and roll to next segment.
KMS key disabled: seals paused; alert; switch to standby key or rotate.
Evidence write partial: transactionally retry, or mark segment PendingEvidence.

Recovery Procedures¶

If KMS/Evidence outage, allow buffers to grow; workers retry with capped backoff.
If quarantine triggered (signature reject), isolate segment and open incident; re-sign with correct key after root cause.
Reconcile PrevBlockRoot on restart to maintain a single forward chain per (tenant, shard).

Performance Characteristics¶

Latency Expectations¶

Leaf→ProofComputed: p50 20–40s; p95 ≤ 120s (time/size thresholds dependent)

Throughput Limits¶

Leaf hashing ≥ ingest throughput; segment sealing limited by Merkle + I/O (target ≥ 5k leaves/s per worker).

Resource Requirements¶

CPU for SHA-256/Merkle; memory for SegmentBuffer (bounded by max leaves or bytes).
Evidence Store IOPS sized for block bursts.

Scaling Considerations¶

Horizontal scale by tenant/shard queues.
Auto-seal if buffers exceed memory pressure.
Backpressure signaled to upstream only in extreme cases (avoid impacting ingest).

Security & Compliance¶

Authentication¶

mTLS between Integrity and KMS/Evidence Store.

Authorization¶

Integrity service principal limited to sign and write evidence; read-only for verify endpoints.

Data Protection¶

Proof artifacts encrypted at rest; signatures cover SegmentId, MerkleRoot, PrevBlockRoot, sealedAt.

Compliance¶

Proofs retained for at least as long as corresponding records; legal holds pin proofs.
Audit trail includes keyId, sealedAt, and policyRevision used for sealing thresholds.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`integrity_queue_depth`	gauge	Pending leaves	Rising > 10× baseline
`segment_seal_latency_ms`	histogram	Accept→seal delay	p95 > 120s
`proof_compute_errors_total`	counter	Failed proof writes	> 0 over 5m
`kms_sign_latency_ms`	histogram	KMS call time	p95 > 200ms
`segments_sealed_total`	counter	Count per tenant/shard	Trend watch

Logging Requirements¶

Log segmentId, blockId, keyId, leaf counts, thresholds used; never log raw record bytes.

Distributed Tracing¶

Spans: integrity.hash.leaf, integrity.seal.segment, kms.sign, evidence.write; attributes include tenant, segmentSize, ageSec.

Health Checks¶

Readiness: KMS reachable; Evidence Store writable; backlog below watermark.
Liveness: worker heartbeats; buffer pressure alarms.

Operational Procedures¶

Deployment¶

Deploy Integrity workers; keep integrity.enabled=false.
Validate KMS permissions and dry-run seal on a test tenant.
Enable and monitor queue_depth, seal_latency_ms.

Configuration¶

Env Vars: SEAL_MAX_LEAVES, SEAL_MAX_AGE_SEC, KMS_KEY_ID, MAX_BUFFER_BYTES
Backoff: KMS_RETRY_BACKOFF, EVIDENCE_RETRY_BACKOFF

Maintenance¶

Rotate keyId on schedule; run dual-verify window; archive old public keys.
Periodic integrity audit: random-sample verify segments nightly.

Troubleshooting¶

High queue depth → add workers; lower seal thresholds temporarily.
Signature failures → verify KMS policy/alg; check clock skew.
Missing proofs → check DLQ for segments marked PendingEvidence.

Testing Scenarios¶

Happy Path Tests¶

Given AuditRecord.Accepted, then Integrity.ProofComputed within SLO and record has IntegrityRef.
Merkle proof verifies for random leaves in sealed segment.

Error Path Tests¶

KMS outage → seals delayed; proofs catch up after recovery.
Evidence store 503 → retries; no data loss; segment eventually Sealed.

Performance Tests¶

Seal at size threshold (e.g., 10k leaves) under peak ingest.
Seal at age threshold (e.g., 60s) with sparse ingest.

Security Tests¶

Signatures verify with current and previous keyId during rotation.
Unauthorized client cannot fetch proofs from another tenant.

Internal References¶

Verify-On-Read Flow
Export eDiscovery Flow
Tamper Detection Flow

External References¶

Merkle tree concepts (general)
KMS provider docs for signing APIs

Appendices¶

A. Block Header (conceptual)¶

{
  "blockId": "01JE9C7…",
  "segmentId": "01JE9C6…",
  "merkleRoot": "sha256:cd…",
  "prevBlockRoot": "sha256:12…",
  "sealedAt": "2025-10-22T12:01:45.120Z",
  "keyId": "kms-2025-10",
  "signature": "MEQCIF…"
}

B. Leaf Hash Definition¶

LeafHash = SHA-256( CanonicalRecordBytes )

Audit Record Projection Update Flow¶

Builds query-optimized views from authoritative append-only facts. The Projector consumes accepted records, performs idempotent upserts into read models (AuditEvents timeline, Resource- and Actor-centric projections), updates the Search index, invalidates caches, and advances a checkpoint/watermark to guarantee at-least-once processing without duplication.

Overview¶

Purpose: Materialize fast, tenant-scoped views for queries and search while tracking consistent progress via checkpoints.
Scope: Post-append event consumption, idempotent projection updates, search indexing, cache invalidation, checkpointing, and replay/rebuild controls. Excludes ingestion, redaction policy evaluation, and verify-on-read.
Context: Runs asynchronously after AuditRecord.Accepted; multiple projector shards process per tenant/partition with strict ordering guarantees.
Key Participants:

Storage (Authoritative) — emits AuditRecord.Accepted
Projector — applies projection logic, maintains idempotency & checkpoints
Read DB — projection tables (AuditEvents, Resource, Actor)
Search Index — per-tenant documents for full-text/facets/suggest
Cache — key-based caches for hot read paths
Checkpoint Store — durable cursor (offset/watermark)
Event Bus — transport for Accepted and internal signals

Prerequisites¶

System Requirements¶

Storage → Bus delivery configured; Projector subscribed to AuditRecord.Accepted
Read DB reachable with migrations applied for projection schemas
Checkpoint Store provisioned (per tenant/shard)
Search cluster online and tenant indices created (if enabled)

Business Requirements¶

Tenants activated with edition flags for Search (optional)
Data minimization rules acknowledged in projection shapes
Cache TTLs defined per view (timeline/resource/actor)

Performance Requirements¶

Projection lag SLO: p95 ≤ 5 s from Accepted to visible in reads
Indexing throughput sized to match ingest rate (≥ 1×)
Checkpoint advance p99 commit ≤ 50 ms

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant Storage as Storage (Authoritative)
    participant Bus as Event Bus
    participant Proj as Projector
    participant ReadDB as Read DB (Projections)
    participant Search as Search Index
    participant Cache as Cache
    participant Ckpt as Checkpoint Store

    Storage-->>Bus: Publish AuditRecord.Accepted {tenantId, auditRecordId, canonicalRef}
    Bus-->>Proj: Deliver event (ordered per partition)
    Proj->>Proj: Idempotency check (eventId vs last offset)
    Proj->>ReadDB: UPSERT AuditEvents (timeline)
    Proj->>ReadDB: UPSERT ResourceProjection (by resource)
    Proj->>ReadDB: UPSERT ActorProjection (by actor)
    alt Search enabled
      Proj->>Search: UPSERT index document(s)
    end
    Proj->>Cache: Invalidate keys {timeline:tenant, resource:id, actor:id}
    Proj->>Ckpt: Commit watermark {offset, auditRecordId, observedAt}
    Ckpt-->>Proj: ↩ ack

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Out-of-order duplicate: Projector detects processed offset and skips; checkpoint remains.
Rebuild: Admin issues Rebuild command → Projector resets checkpoint to origin, clears projections (or writes compaction shadow tables), replays events, then swaps.
Partial Indexing: If Search is temporarily disabled for a tenant, projector queues index updates and advances DB projections; index will catch up later from a backlog.

Error Paths¶

sequenceDiagram
    participant Proj as Projector
    participant ReadDB as Read DB
    participant Ckpt as Checkpoint Store
    participant Search as Search Index

    Proj->>ReadDB: UPSERT projections
    alt Constraint conflict (unique key)
        ReadDB-->>Proj: ↩ 409 conflict
        Proj->>Proj: Apply idempotent merge, retry once
    else Bad projection payload (schema drift)
        ReadDB-->>Proj: ↩ 400 bad request
        Proj->>Proj: Quarantine record → DLQ, continue stream
    end

    Proj->>Ckpt: Commit watermark
    alt Not found checkpoint stream
        Ckpt-->>Proj: ↩ 404 not found
        Proj->>Ckpt: Create stream atomically, retry
    end

    Proj->>Search: UPSERT doc
    alt Index unavailable / rate-limited
        Search-->>Proj: ↩ 429/503
        Proj->>Proj: Buffer + backoff, do not block DB projections
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

External APIs are operational controls; projections themselves are internal upserts.

Input Requirements (event consumed)¶

Field	Type	Required	Description	Validation
`tenantId`	string	Y	Tenant scope	Known tenant
`auditRecordId`	ULID	Y	Record id	Exists in Storage
`createdAt`	timestamp	Y	Producer time	ISO-8601 UTC
`observedAt`	timestamp	Y	Ingestion time	ISO-8601 UTC
`action`	string	Y	Event verb	normalized
`resource`	object	Y	`{type,id,path?}`	normalized
`actor`	object	Y	`{id,type,display?}`	present
`decision`	object	N	Access outcome	enum
`attributes`	map	N	extras	bounded

Output Specifications (projections)¶

Projection	Key	Shape (summary)	Notes
AuditEvents	`(tenantId, createdAt, auditRecordId)`	timeline row	paginates by cursor
ResourceProjection	`(tenantId, resource.type, resource.id)`	latest state + last actions	small, denormalized
ActorProjection	`(tenantId, actor.id)`	last actions, resources touched	for actor-centric queries
Search Document	`(tenantId, auditRecordId)`	flattened facets + text	edition-gated

Operational APIs¶

GET /projections/v1/{tenant}/{name}/status Response 200:

{
  "tenant": "acme",
  "name": "AuditEvents",
  "watermark": { "offset": 1203981, "auditRecordId": "01JEA...", "updatedAt": "2025-10-22T12:00:06.100Z" },
  "lag": { "seconds": 2.4, "records": 180 },
  "state": "Healthy"
}

POST /projections/v1/{tenant}/{name}/rebuild → 202 with { jobId }

Error Handling¶

Error Scenarios¶

Error Code	Scenario	Recovery Action	Retry Strategy
400	Bad request to ops API (invalid `name`, bad params)	Fix request	No retry until corrected
404	Status/rebuild for unknown projection or tenant	Validate inputs	No retry
409	Rebuild already in progress / checkpoint conflict	Use existing job or wait	Retry after completion
422	Event schema drift vs projection mapper	Quarantine & hotfix mapper	Continue stream; backfill later
429	Search/index or cache backpressure	Defer indexing; advance DB	Automatic retry/backoff
503	Read DB/Checkpoint store transient failure	Keep event, retry	Exponential backoff + jitter

Failure Modes¶

Poison event: irreconcilable mapping → send to DLQ with pointers; continue stream.
Cache stampede: cache invalidations batched/coalesced; use jittered TTLs.
Idempotency race: unique key conflicts resolved via UPSERT with deterministic merge.

Recovery Procedures¶

If Read DB/Checkpoint outage, pause commits but keep events buffered; resume and commit in order.
For DLQ items, fix mapper/policy, then replay from saved offset range.
During rebuild, expose state:"Rebuilding"; queries read from shadow tables if configured.

Performance Characteristics¶

Latency Expectations¶

Accept → Read visible: p95 ≤ 5 s
Accept → Indexed: p95 ≤ 10 s (if search enabled)

Throughput Limits¶

Sustains ingest parity; projectors process ≥ 1× ingest rps per shard.

Resource Requirements¶

CPU for mapping/JSON flatten; DB connections sized for write bursts.
Search bulkers batch 500–1,000 docs or 5–10 MiB per flush.

Scaling Considerations¶

Horizontal scale by tenant/shard.
HPA/KEDA on queue depth, projection lag, and p95 projector latency.
Apply backpressure to indexing only; keep DB projections current.

Security & Compliance¶

Authentication¶

mTLS between Projector and Read DB/Search/Checkpoint.

Authorization¶

Projector principal has write on projections & checkpoint, write/bulk on Search, no read of other tenants.

Data Protection¶

Store only minimized fields required for query/search; avoid sensitive raw values.
Tenant isolation enforced at table/index level (prefix/shard keys).

Compliance¶

Projection updates logged with tenant, auditRecordId, and mapperVersion for auditability.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`projection_lag_seconds`	gauge	Accept→visible delay	> 5s p95 (5m)
`projected_records_total`	counter	Rows upserted	Trend vs ingest
`checkpoint_commit_latency_ms`	histogram	Commit time	p95 > 50ms
`projection_conflicts_total`	counter	409 upserts	Rising trend
`index_updates_backlog`	gauge	Pending index docs	Growing without drop

Logging Requirements¶

Structured logs: tenant, auditRecordId, offset, mapperVersion, conflict summaries (no sensitive values).

Distributed Tracing¶

Spans: projector.consume, mapper.apply, readdb.upsert, index.bulk, checkpoint.commit.
Attributes: tenant, offset, bulkCount, lagMs.

Health Checks¶

Readiness: connectivity to Read DB/Search/Checkpoint; lag below threshold.
Liveness: consumer heartbeats; partition ownership indicator.

Operational Procedures¶

Deployment¶

Deploy Projector with projector.enabled=false.
Run migrations for projection schemas.
Enable consumers per tenant/shard; monitor projection_lag_seconds.

Configuration¶

Env Vars: PROJECTOR_PARALLELISM, CHECKPOINT_BATCH, INDEX_BULK_BYTES, INDEX_BULK_DOCS
Flags: search.enabled, rebuild.shadowSwap=true

Maintenance¶

Periodic compaction of timeline tables; rotate old index aliases.
Update mapperVersion with schema changes; keep backward compatibility.

Troubleshooting¶

Rising lag → scale workers or reduce index bulk size; inspect DB write contention.
Many conflicts → verify UPSERT keys & mapping determinism.
Backlog in indexing → check cluster health; enable backpressure-only mode.

Testing Scenarios¶

Happy Path Tests¶

Accepted event produces AuditEvents row, Resource & Actor upserts; watermark advances.
Search document visible; cache invalidated and repopulated on read.

Error Path Tests¶

Unique key conflict handled idempotently (no duplicate rows).
Bad ops API request → 400; unknown projection → 404; rebuild in progress → 409.

Performance Tests¶

Maintain p95 ≤ 5 s at target ingest rps with search enabled/disabled.
Bulk indexing flush sizes tuned for p95 < 1 s per bulk.

Security Tests¶

Tenant isolation in projections and index aliases.
No sensitive fields persisted beyond minimization policy.

Internal References¶

Standard Audit Record Ingestion Flow
Audit Record Integrity Chain Flow
Search Query Flow

External References¶

Bulk indexing guidance for the chosen search engine (vendor docs)

Appendices¶

A. UPSERT Keys (example)¶

AuditEvents: (tenantId, createdAt, auditRecordId)
ResourceProjection: (tenantId, resourceType, resourceId)
ActorProjection: (tenantId, actorId)

B. Checkpoint Record (example)¶

{
  "tenant": "acme",
  "partition": "p3",
  "offset": 1203981,
  "auditRecordId": "01JEA…",
  "updatedAt": "2025-10-22T12:00:06.100Z",
  "mapperVersion": "v7"
}

HTTP REST API Ingestion Flow¶

REST transport for appending a single AuditRecord via API Gateway. Details HTTP method/endpoint, required headers, authentication & rate limiting, header-to-internal mapping (traceparent, x-tenant-id, x-idempotency-key), response behaviors (2xx/4xx/5xx), and concrete request/response examples.

Overview¶

Purpose: Provide a secure, idempotent HTTP interface for producers to append audit facts through the Gateway.
Scope: HTTP semantics (headers, status codes, retries), authN/Z at the edge, rate limiting, payload size/type validation, Problem+JSON errors. Excludes batch/grpc/bus transports (separate flows) and downstream integrity/projections internals.
Context: Front door for most interactive clients; maps cleanly to the canonical ingestion path.
Key Participants:

HTTP Client (producer)
API Gateway (edge policy, authN/Z, limits)
Ingestion Service (validation/canonicalization)
Policy Service (classification/redaction hints, invoked by Ingestion)
Storage (Authoritative) (append/WORM)

Prerequisites¶

System Requirements¶

TLS 1.2+ enabled on Gateway; valid certificates
Gateway has JWKS/issuer config to validate JWTs (OIDC)
Network routes Gateway → Ingestion (and Ingestion → Policy/Storage)

Business Requirements¶

Tenant exists, active, and mapped to regions/partitions
Policy/retention configurations present for tenant
Edition flags set (may influence limits)

Performance Requirements¶

Gateway rate limits sized per tenant (burst/sustained)
Max payload ≤ 256 KiB; P95 end-to-end ≤ 50 ms at target RPS
Idempotency store capacity sized for 24h dedupe window

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Client as HTTP Client
    participant Gateway as API Gateway
    participant Ingestion as Ingestion Service
    participant Storage as Storage (Authoritative)

    Client->>Gateway: POST /audit/v1/records<br/>(h: Authorization, x-tenant-id, traceparent, x-idempotency-key)<br/>(b: application/json)
    Note right of Gateway: Validate JWT, tenant scope, rate-limit, content-type & size
    Gateway->>Ingestion: Append(request) (forward required headers)
    Ingestion->>Ingestion: Validate + canonicalize + policy hints
    Ingestion->>Storage: INSERT canonical record (WORM)
    Storage-->>Ingestion: ack {auditRecordId}
    Ingestion-->>Gateway: 202 {auditRecordId, status:"Created"}
    Gateway-->>Client: 202 Accepted with Problem+JSON on errors, rate-limit headers on success too

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Duplicate idempotency key: 202 with status:"Duplicate" and original auditRecordId.
Server-assigned ULID: Omit auditRecordId and receive assigned value in response.
CORS/browser clients: Preflight OPTIONS handled by Gateway; only safelisted headers exposed.

Error Paths¶

sequenceDiagram
    actor Client
    participant Gateway as API Gateway
    Client->>Gateway: POST /audit/v1/records (bad/missing bits)
    alt Bad request (shape/size/type)
        Gateway-->>Client: 400/413/415 Problem+JSON
    else Unauthorized / Forbidden
        Gateway-->>Client: 401/403 Problem+JSON
    else Not found / wrong route
        Gateway-->>Client: 404 Problem+JSON
    else Conflict (idempotency anomaly)
        Gateway-->>Client: 409 Problem+JSON
    else Rate limited
        Gateway-->>Client: 429 Problem+JSON (+ Retry-After)
    else Upstream unavailable
        Gateway-->>Client: 503 Problem+JSON
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Required	Description	Validation
Method	HTTP	Y	`POST`	`POST /audit/v1/records`
Content-Type	header	Y	Body MIME type	`application/json; charset=utf-8`
Authorization	header	Y	Bearer JWT	Valid signature, audience, tenant claim
x-tenant-id	header	Y	Tenant routing	`^[A-Za-z0-9._-]{1,128}$`
traceparent	header	Y	W3C trace context	55-char format
x-idempotency-key	header	Y	Dedupe per tenant (24h)	≤128 visible ASCII
Body	JSON	Y	Canonical `AuditRecord` fields	See Data Model rules

Output Specifications¶

Field	Type	Description	Notes
`auditRecordId`	ULID	Durable record id	Server returns original or assigned
`status`	string	`Created` or `Duplicate`	Idempotent behavior
`observedAt`	timestamp	Ingestion observation time	ms precision
`traceId`	hex32	Echo from `traceparent`	Correlation
`links.self`	uri	Record locator	Optional operation link

Example Payloads¶

Request

POST /audit/v1/records HTTP/1.1
Host: api.atp.example
Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...
x-tenant-id: acme
traceparent: 00-3e1f2d0c9b8a7f6e5d4c3b2a19081716-7f6e5d4c3b2a1908-01
x-idempotency-key: acme-ord-9981-v1
Content-Type: application/json; charset=utf-8

{
  "tenantId": "acme",
  "schemaVersion": "auditrecord.v1",
  "createdAt": "2025-10-22T12:00:03.100Z",
  "action": "user.create",
  "resource": { "type": "Iam.User", "id": "U-1001" },
  "actor": { "id": "svc_ingress", "type": "Service" }
}

Response — 202 Accepted

{
  "auditRecordId": "01JEB0V2G7NY5T6Q9KX3M4C8AP",
  "status": "Created",
  "observedAt": "2025-10-22T12:00:03.280Z",
  "traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716",
  "links": {
    "self": "/audit/v1/records/01JEB0V2G7NY5T6Q9KX3M4C8AP"
  }
}

Response — 400 Bad Request (Problem+JSON)

{
  "type": "urn:connectsoft:errors/validation/action.invalid",
  "title": "Invalid action",
  "status": 400,
  "detail": "Action must match ^[a-z]+(\\.[a-z0-9_-]+)?$",
  "errors": [{ "pointer": "/action", "reason": "regex" }],
  "traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716"
}

Error Handling¶

Status Code Matrix¶

Class	Code	When	Notes
2xx	202	Accepted (created or deduped)	Body includes `status:"Created \| Duplicate"`
4xx	400	Shape/field invalid, schema mismatch	Problem+JSON with `errors[].pointer`
4xx	401	Missing/invalid JWT	Bearer challenge omitted for APIs; return body explains
4xx	403	Tenant/permission forbidden	Token valid but insufficient scope
4xx	404	Unknown route/tenant or disabled feature	Useful for wrong base path or edition
4xx	409	Idempotency anomaly / conflicting op link	Rare; follow `links.operation` if present
4xx	413	Payload exceeds 256 KiB	Include `maxBytes` hint
4xx	415	Unsupported media type	Require `application/json`
4xx	429	Rate-limited/backpressure	Include `Retry-After` (seconds or HTTP date)
5xx	503	Upstream dependency unavailable	Retry with same idempotency key

Failure Modes¶

Clock skew: createdAt > now+2m → 400 with pointer /createdAt.
Tenant mismatch: body tenantId ≠ header x-tenant-id → 403.
Idempotency race: concurrent distinct payloads under same key → 409.

Recovery Procedures¶

For 4xx, correct payload/headers and resend (new key except for 409).
For 429/503, retry with exponential backoff + jitter; reuse the same x-idempotency-key.
Track traceId from responses to correlate retries.

Performance Characteristics¶

Latency Expectations¶

Gateway edge: P50 5–10 ms, P95 ≤ 20 ms
End-to-end to 202: P50 15–25 ms, P95 ≤ 50 ms

Throughput Limits¶

Default per-tenant: 500 rps sustained, 2k rps burst (60s)
Global: ≥ 50k rps across shards (capacity dependent)

Resource Requirements¶

Gateway CPU for JWT validation and header processing; memory for small payload buffers.

Scaling Considerations¶

Scale Gateway horizontally; HPA on rps & p95.
Separate rate limit buckets per tenant and per route.

Security & Compliance¶

Authentication¶

OIDC JWT Bearer; short-lived (≤ 15m), leeway ±60s.

Authorization¶

Require audit:append scoped to x-tenant-id; Gateway enforces edition access.

Data Protection¶

TLS 1.2+; HSTS at edge; CORS preflight for browser-based producers (restrict origins & headers).

Compliance¶

Log who/when appended; immutable WORM store; Problem+JSON avoids leaking sensitive values.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`http_requests_total{route="/audit/v1/records"}`	counter	Request rate	Anomaly vs baseline
`http_request_duration_ms`	histogram	Latency	p95 > 50 ms (5m)
`http_responses_total{status=4xx/5xx}`	counter	Error rates	> 1% 4xx (validation spikes), any 5xx
`rate_limited_total`	counter	429 responses	> 5% sustained

Logging Requirements¶

Structured logs with tenantId, traceId, idempotencyKey (hashed), statusCode; no sensitive payloads.

Distributed Tracing¶

Propagate traceparent to Ingestion; spans gateway.authz, gateway.forward with attributes tenant, payloadBytes.

Health Checks¶

Liveness: process/thread checks; Readiness: JWKS reachable, Ingestion upstream healthy.

Operational Procedures¶

Deployment¶

Deploy Gateway route behind feature flag ingest.rest.enabled=false.
Smoke test with signed JWT and minimal payload; verify 202 and headers.
Enable feature flag and gradually raise rate limits.

Configuration¶

Env Vars / Config: JWKS URI, audiences, rate limit buckets, max payload bytes, allowed CORS origins/headers.
Headers to forward: traceparent, x-tenant-id, x-idempotency-key.

Maintenance¶

Rotate keys/JWKS; cache with TTL; monitor expired/invalid token spikes.

Troubleshooting¶

Many 401s → check JWKS drift/clock skew.
Many 415s → clients mis-sending Content-Type.
Elevated 409s → investigate idempotency key collisions in client.

Testing Scenarios¶

Happy Path Tests¶

Valid POST returns 202 with status:"Created" and auditRecordId.
Duplicate x-idempotency-key returns 202 with status:"Duplicate".
Trace propagation: traceId echoed matches traceparent.

Error Path Tests¶

400 invalid action; pointer /action.
404 wrong route (e.g., /audit/v2/...).
409 conflicting idempotency key (distinct payload).
415 wrong media type; 413 too large.
429 with Retry-After; 503 transient outage.

Performance Tests¶

Sustain 500 rps tenant; p95 ≤ 50 ms.
Burst to 2k rps without >1% errors.

Security Tests¶

JWT expiration & audience checks enforced.
CORS preflight honors allowed origins and headers.
Tenant mismatch (header vs body) rejected with 403.

Internal References¶

gRPC Service Ingestion Flow
Service Bus (MassTransit) Ingestion Flow
Retry Flow

External References¶

RFC 7807 (Problem Details for HTTP APIs)
W3C Trace Context (traceparent)

Appendices¶

A. cURL Examples¶

curl -sS -X POST "https://api.atp.example/audit/v1/records" \
  -H "Authorization: Bearer $TOKEN" \
  -H "x-tenant-id: acme" \
  -H "traceparent: 00-$(uuidgen | tr 'A-Z' 'a-z' | tr -d '-')-$(uuidgen | tr 'A-Z' 'a-z' | cut -c1-16)-01" \
  -H "x-idempotency-key: acme-ord-9981-v1" \
  -H "Content-Type: application/json; charset=utf-8" \
  --data-binary @record.json

B. Rate Limiting Headers (example)¶

RateLimit-Limit: 2000, 500;w=60
RateLimit-Remaining: 1980
RateLimit-Reset: 45
Retry-After: 3

gRPC Service Ingestion Flow¶

High-QPS, low-latency transport for appending individual AuditRecord items using gRPC. Clients call a unary Append RPC on the Gateway, passing metadata for tenant, traceparent, idempotency, and authorization. The Gateway authenticates/authorizes and forwards to Ingestion; responses use canonical gRPC status codes with retry/backoff guidance.

Overview¶

Purpose: Provide a high-throughput ingestion path with efficient framing, multiplexing, and connection reuse.
Scope: gRPC method shape, metadata requirements, authN/Z, rate limiting, error code mapping, retries/backoff, and sample code-first contracts. Excludes batch uploads and message bus ingestion.
Context: Preferred for service-to-service producers and heavy internal traffic; functionally equivalent to REST ingestion but with gRPC semantics.
Key Participants:

gRPC Client (producer)
gRPC Gateway (edge; authN/Z, limits, metadata mapping)
Ingestion Service (validate/canonicalize, policy/classification/redaction)
Storage (Authoritative) (append/WORM)

Prerequisites¶

System Requirements¶

Gateway and Ingestion expose/accept HTTP/2 with TLS (mTLS optional for internal meshes)
OIDC/JWKS configured at the Gateway to validate authorization metadata
Network connectivity Gateway ↔ Ingestion ↔ Storage/Policy services

Business Requirements¶

Tenant active and mapped to partitions/regions
Policy and retention configured for tenant
Edition flags (e.g., max RPS) set if applicable

Performance Requirements¶

Connection pooling enabled; client max concurrent streams tuned (HTTP/2)
End-to-end p95 ≤ 40 ms at target RPS; message size ≤ 256 KiB
Idempotency store sized for 24h dedupe window

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Client as gRPC Client
    participant GW as gRPC Gateway
    participant Ing as Ingestion Service
    participant Store as Storage (Authoritative)

    Client->>GW: Append(AuditRecord) + metadata{authorization, x-tenant-id, traceparent, x-idempotency-key}
    Note right of GW: Validate token, tenant scope, rate limit-> map metadata → headers
    GW->>Ing: Append(request, forwarded metadata)
    Ing->>Ing: Validate + canonicalize + policy/classification/redaction
    Ing->>Store: INSERT canonical record (WORM)
    Store-->>Ing: ack {auditRecordId}
    Ing-->>GW: AppendReply {auditRecordId, status=Created}
    GW-->>Client: OK (AppendReply) + trailers {traceId}

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Duplicate idempotency key: return OK with status=Duplicate and original auditRecordId.
Server-assigned ID: client omits auditRecordId; service returns assigned ULID.
Streaming batch (future): optional client- or server-streaming variants reuse the same metadata (not covered here).

Error Paths¶

sequenceDiagram
    actor Client
    participant GW as gRPC Gateway

    Client->>GW: Append(bad or unauthorized)
    alt Invalid argument / too large
        GW-->>Client: INVALID_ARGUMENT / RESOURCE_EXHAUSTED
    else Unauthenticated / permission denied
        GW-->>Client: UNAUTHENTICATED / PERMISSION_DENIED
    else Not found route / disabled
        GW-->>Client: NOT_FOUND
    else Idempotency conflict (payload differs)
        GW-->>Client: ALREADY_EXISTS
    else Rate limited / upstream unavailable
        GW-->>Client: RESOURCE_EXHAUSTED / UNAVAILABLE
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Required	Description	Validation
RPC	unary	Y	`Append(AppendRequest) returns (AppendReply)`	gRPC
`authorization` (metadata)	string	Y	`Bearer <JWT>`	Valid signature, audience, tenant claim
`x-tenant-id` (metadata)	string	Y	Tenant routing	`^[A-Za-z0-9._-]{1,128}$`
`traceparent` (metadata)	string	Y	W3C Trace Context	55-char format
`x-idempotency-key` (metadata)	string	Y	Dedupe per tenant (24h)	≤128 visible ASCII
`AppendRequest.auditRecord`	message	Y	Canonical `AuditRecord`	See Data Model limits (≤ 256 KiB)
`AppendRequest.schemaVersion`	string	Y	Contract version	Known & active

Metadata naming: gRPC metadata keys are lowercase ASCII. Use exactly: authorization, x-tenant-id, traceparent, x-idempotency-key.

Output Specifications¶

Field	Type	Description	Notes
`AppendReply.auditRecordId`	string (ULID)	Durable id	Assigned or echoed
`AppendReply.status`	enum	`Created` or `Duplicate`	Idempotent result
`AppendReply.observedAt`	timestamp	Ingestion observation	ms precision
trailers:`traceid`	hex32	Correlation id	Derived from `traceparent`

Example Payloads¶

Proto (illustrative; see code-first C# below)

service AuditIngestion {
  rpc Append (AppendRequest) returns (AppendReply);
}

message AppendRequest {
  string schemaVersion = 1;
  AuditRecord auditRecord = 2;
}

message AppendReply {
  string auditRecordId = 1;
  string status = 2; // "Created" | "Duplicate"
  string observedAt = 3; // ISO-8601 UTC
}

Example grpcurl

grpcurl -d @ \
  -H "authorization: Bearer $TOKEN" \
  -H "x-tenant-id: acme" \
  -H "traceparent: 00-3e1f2d0c9b8a7f6e5d4c3b2a19081716-7f6e5d4c3b2a1908-01" \
  -H "x-idempotency-key: acme-ord-9981-v1" \
  api.atp.example:443 audit.AuditIngestion/Append <<'JSON'
{
  "schemaVersion": "auditrecord.v1",
  "auditRecord": {
    "tenantId": "acme",
    "createdAt": "2025-10-22T12:00:03.100Z",
    "action": "user.create",
    "resource": { "type": "Iam.User", "id": "U-1001" },
    "actor": { "id": "svc_ingress", "type": "Service" }
  }
}
JSON

Error Handling¶

Error Scenarios (gRPC ↔ HTTP analogy)¶

gRPC Code	HTTP Analogy	Scenario	Recovery Action	Retry Strategy
OK	202	Created or Duplicate	—	—
INVALID_ARGUMENT	400	Schema/shape/limits invalid	Fix per details	No retry until corrected
NOT_FOUND	404	Unknown service/method or tenant/feature disabled	Check route/tenant	No retry
ALREADY_EXISTS	409	Idempotency conflict (same key, different payload)	Use new key; reconcile	Do not retry with same key
UNAUTHENTICATED	401	Missing/invalid token	Acquire valid JWT	Retry after fix
PERMISSION_DENIED	403	Insufficient scope or tenant mismatch	Adjust perms/tenant	No retry until corrected
RESOURCE_EXHAUSTED	429	Rate limit/backpressure	Honor retry hints	Exponential backoff + jitter
UNAVAILABLE	503	Upstream unavailable / transient gateway error	Wait for recovery	Retry with same idempotency key
DEADLINE_EXCEEDED	504	Client/server deadline hit	Increase deadline if safe	Limited retries
INTERNAL	500	Unexpected server error	Open incident if persistent	Bounded retries with backoff

Failure Modes¶

Metadata missing/uppercase: gRPC metadata keys must be lowercase; missing required keys → INVALID_ARGUMENT.
Clock skew: createdAt > now+2m → INVALID_ARGUMENT with field pointer.
Concurrent duplicates: distinct payload under same key → ALREADY_EXISTS.

Recovery Procedures¶

For 4xx analogs (INVALID_ARGUMENT, PERMISSION_DENIED, ALREADY_EXISTS, NOT_FOUND) fix request/config before retry.
For RESOURCE_EXHAUSTED/UNAVAILABLE/DEADLINE_EXCEEDED, backoff with jitter; reuse x-idempotency-key.
Log/propagate traceid from trailers for correlation.

Performance Characteristics¶

Latency Expectations¶

P50: 10–20 ms
P95: ≤ 40 ms
P99: ≤ 75 ms

Throughput Limits¶

Per connection: hundreds of concurrent streams (HTTP/2)
Per tenant: baseline 1k rps sustained, burst 4k rps (edition dependent)
Global: scales linearly with Gateway instances

Resource Requirements¶

Persistent HTTP/2 channels; tune client pool size and max streams per connection.

Scaling Considerations¶

Horizontal scale Gateway on RPS/p95; shard by tenant/region.
Configure server and client receive/send message size caps (≤ 256 KiB).

Security & Compliance¶

Authentication¶

authorization metadata with OIDC JWT; short-lived (≤ 15m), leeway ±60s; optional mTLS for extra assurance.

Authorization¶

Require audit:append scoped to x-tenant-id; Gateway enforces RBAC/ABAC.

Data Protection¶

TLS 1.2+; no sensitive data in logs; redaction/classification applied by Ingestion before persist.

Compliance¶

Producer identity, idempotency key hash, and decision trail logged; aligns with privacy/PII policies.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`grpc_server_started_total`	counter	Calls started	Anomaly detection
`grpc_server_handled_total{code}`	counter	Calls by status code	Any 5xx; spikes in INVALID_ARGUMENT
`grpc_server_handling_seconds`	histogram	Latency	p95 > 40 ms
`rate_limited_total`	counter	RESOURCE_EXHAUSTED	> 5% sustained

Logging Requirements¶

Structured logs: tenant, traceId, idempotencyKey (hashed), grpc.code, latencyMs; omit payload bodies.

Distributed Tracing¶

Map traceparent to gRPC context; spans: gateway.authz, ingestion.append. Include attributes tenant, payloadBytes.

Health Checks¶

Liveness: process/thread; Readiness: JWKS reachability, upstream Ingestion healthy.

Operational Procedures¶

Deployment¶

Enable gRPC port/route under flag ingest.grpc.enabled=false.
Smoke test with signed JWT and minimal payload; verify OK and trailers.
Gradually raise per-tenant limits; observe grpc_server_handled_total{code!="OK"}.

Configuration¶

Gateway: JWKS URI, audiences, rate limits, max recv/send message bytes, allowed metadata keys/size.
Client: channel pool size, per-call deadline (e.g., 2s), retry policy (UNAVAILABLE, RESOURCE_EXHAUSTED).

Maintenance¶

Rotate JWKS/keys; monitor token validation failures; tune backoff policies.

Troubleshooting¶

Many INVALID_ARGUMENT → inspect validation pointers; schema drift.
Many UNAVAILABLE → upstream health; check saturation.
Frequent ALREADY_EXISTS → idempotency key collisions—fix client keying.

Testing Scenarios¶

Happy Path Tests¶

Valid Append returns OK with status:"Created" and auditRecordId.
Duplicate x-idempotency-key returns OK with status:"Duplicate".

Error Path Tests¶

Missing x-tenant-id → INVALID_ARGUMENT.
Unknown method/route → NOT_FOUND.
Conflicting idempotency payload → ALREADY_EXISTS.
Rate limit → RESOURCE_EXHAUSTED with retry backoff honored.

Performance Tests¶

Sustain 1k rps/tenant with p95 ≤ 40 ms.
Connection reuse across 10k calls without reconnect churn.

Security Tests¶

JWT expiration/audience enforced.
Tenant mismatch (metadata vs body) → PERMISSION_DENIED.
Trace propagation verified end-to-end.

Internal References¶

Standard Audit Record Ingestion Flow
Retry Flow
Distributed Tracing Flow

External References¶

gRPC Status Codes guide
W3C Trace Context

Appendices¶

A. C# gRPC code-first contract (protobuf-net.Grpc style)¶

using System.ServiceModel;
using ProtoBuf.Grpc;
using ProtoBuf.Grpc.Configuration;

[Service]
public interface IAuditIngestionService
{
    [Operation]
    Task<AppendReply> AppendAsync(AppendRequest request, CallContext context = default);
}

public sealed class AppendRequest
{
    public string SchemaVersion { get; set; } = "auditrecord.v1";
    public AuditRecord AuditRecord { get; set; } = default!;
}

public sealed class AppendReply
{
    public string AuditRecordId { get; set; } = default!;
    public string Status { get; set; } = "Created"; // or "Duplicate"
    public DateTimeOffset ObservedAt { get; set; }
}

B. C# client stub usage (metadata mapping)¶

var channel = GrpcChannel.ForAddress("https://api.atp.example");
var client  = channel.CreateGrpcService<IAuditIngestionService>();

var headers = new Metadata {
    { "authorization", $"Bearer {token}" },
    { "x-tenant-id", "acme" },
    { "traceparent", traceparent },
    { "x-idempotency-key", "acme-ord-9981-v1" }
};

var ctx = new CallContext(new CallOptions(headers: headers, deadline: DateTime.UtcNow.AddSeconds(2)));

var reply = await client.AppendAsync(new AppendRequest {
    SchemaVersion = "auditrecord.v1",
    AuditRecord = record
}, ctx);

C. Recommended client retry policy (pseudocode)¶

retry on: UNAVAILABLE, RESOURCE_EXHAUSTED, DEADLINE_EXCEEDED
backoff: exponential (base 100ms, max 5s), jitter 20%
max attempts: 5
reuse same x-idempotency-key

Service Bus (MassTransit) Ingestion Flow¶

Asynchronous ingestion path using the Outbox → Bus → Inbox pattern. A producer writes to its own Outbox in the same transaction as its business change; an Outbox Dispatcher publishes to the MassTransit bus. The Ingestion Consumer reads messages, performs validation/canonicalization, applies dedupe/idempotency, appends to the WORM store, and emits AuditRecord.Accepted. Poison messages are routed to a DLQ with reprocess tooling.

Overview¶

Purpose: Provide a resilient, high-throughput async ingestion path with exactly-once effects (at-least-once delivery + idempotent consumer).
Scope: Producer outbox semantics, bus delivery (MassTransit), consumer inbox/deduplication, retry/backoff, DLQ handling, and operational reprocessing. Excludes REST/gRPC transports and batch presigned uploads.
Context: Recommended for internal microservices and partner pipelines that already publish domain events.
Key Participants:

Producer Service (business txn + Outbox write)
Outbox Dispatcher (background publisher)
Message Bus (MassTransit over RabbitMQ/Azure SB/Kafka)
Ingestion Consumer (MassTransit consumer)
Idempotency Store (consumer-inbox/dedupe keys)
Storage (Authoritative) (append-only WORM)
DLQ / Error Queue (quarantine and reprocess)

Prerequisites¶

System Requirements¶

MassTransit configured with a supported broker and durable queues/topics
Producer DB migration includes Outbox table (append-only)
Ingestion Consumer has Idempotency/Inbox store (e.g., table or cache)
Network connectivity Producer ↔ Broker ↔ Ingestion; TLS enabled end-to-end

Business Requirements¶

Tenants provisioned; routing keys/partitions defined per tenant
Policy/retention/classification configured (used by Ingestion)
DLQ retention meets compliance requirements

Performance Requirements¶

Producer Outbox dispatch interval (poll/batch size) tuned for target throughput
Consumer prefetch/concurrency tuned; p95 end-to-append ≤ 100 ms under load
Broker quotas/partitions sized for expected peak

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant Prod as Producer Service
    participant DB as Producer DB + Outbox
    participant Disp as Outbox Dispatcher
    participant Bus as Message Bus (MassTransit)
    participant Cons as Ingestion Consumer
    participant Idem as Idempotency Store
    participant Store as Storage (Authoritative)

    Prod->>DB: BEGIN TX: business change + INSERT Outbox{Message, IdempotencyKey, Tenant, Trace}
    DB-->>Prod: COMMIT

    Disp->>DB: Poll Outbox (unpublished rows)
    Disp->>Bus: Publish AuditRecordEnvelope (MessageId, CorrelationId, headers)
    Bus-->>Disp: Ack (broker)

    Bus-->>Cons: Deliver message
    Cons->>Idem: Check/put(idempotencyKey) // atomic get-or-create
    alt First delivery
        Cons->>Cons: Validate + canonicalize + policy/classification/redaction
        Cons->>Store: INSERT canonical record (WORM)
        Store-->>Cons: ack {auditRecordId}
        Cons->>Idem: Mark completed(auditRecordId)
    else Duplicate
        Idem-->>Cons: already completed
        Cons->>Cons: Skip side effects, ack broker
    end

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Transactional Outbox (in-proc): Outbox insert is in the same DB transaction as business write (recommended).
Partition affinity: Route by tenantId (or resourceId) to guarantee in-order delivery per key.
Saga assistance: Optional MassTransit saga can coordinate multi-message batches or ensure exactly-one finalization event per batch.

Error Paths¶

sequenceDiagram
    participant Disp as Outbox Dispatcher
    participant Bus as Message Bus
    participant Cons as Ingestion Consumer
    participant DLQ as Dead Letter Queue

    Disp->>Bus: Publish
    alt Broker unavailable
        Bus-->>Disp: nack/exception
        Disp->>Disp: Retry with exponential backoff, do not delete Outbox row
    end

    Bus-->>Cons: Deliver message
    alt Validation fails (poison message)
        Cons-->>Bus: reject (no requeue)
        Bus-->>DLQ: route
    else Transient error (Storage 503)
        Cons-->>Bus: nack (requeue)
        Bus->>Cons: redeliver with backoff
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

This flow is message-based. The message contract and headers are the stable surface. Operational HTTP endpoints (status, reprocess) are listed for completeness.

Input Requirements (message contract)¶

Field	Type	Required	Description	Validation
`MessageId`	GUID/ULID	Y	Broker message id	Generated by bus
`CorrelationId`	GUID/ULID	Y	Correlates with trace/saga	Present
`IdempotencyKey`	string	Y	Stable key per producer event	≤128 ASCII
`TenantId`	string	Y	Tenant scope	Header & body match
`Traceparent`	string	Y	W3C trace context	55-char
`SchemaVersion`	string	Y	`auditrecord.v1`	Known
`AuditRecord`	object	Y	Canonical fields	≤ 256 KiB after serialize

Recommended headers (MassTransit)

tenant-id, traceparent, idempotency-key, schema-version, content-type=application/json

Output Specifications¶

Field	Type	Description	Notes
`AuditRecord.Accepted`	event	Downstream event from Storage	Async
Consumer ack	broker ack	Successful handle	Commit offset / ack message
DLQ message	broker dead-letter	On poison/MaxRetry exceeded	Inspect & reprocess

Example Message (Envelope)¶

{
  "SchemaVersion": "auditrecord.v1",
  "IdempotencyKey": "acme:order#9981:v1",
  "TenantId": "acme",
  "AuditRecord": {
    "tenantId": "acme",
    "createdAt": "2025-10-22T12:00:03.100Z",
    "action": "user.create",
    "resource": { "type": "Iam.User", "id": "U-1001" },
    "actor": { "id": "svc_billing", "type": "Service" }
  }
}

Error Handling¶

Error Scenarios (bus & ops APIs)¶

Code/Outcome	Scenario	Recovery Action	Retry Strategy
INVALID (poison) → DLQ	Schema/shape invalid at consumer	Quarantine; fix mapper or data	Reprocess after fix
Requeue	Storage/Policy transient failure	Backoff & retry	Exponential backoff + jitter
Duplicate (idempotent skip)	IdempotencyKey already completed	No action	Ack immediately
400 Bad Request (ops API)	Bad reprocess/status request	Correct request	No retry until fixed
404 Not Found (ops API)	Unknown batch/msgId/tenant	Verify identifiers	—
409 Conflict (ops API)	Reprocess while job active	Wait & retry	After completion
503 Service Unavailable	Broker or Storage outage	Wait for recovery	Bounded backoff, circuit-breaker

Failure Modes¶

Outbox row deletion before publish: never delete until broker ack; use “published_at IS NOT NULL” marker.
Inbox/idempotency race: ensure atomic get-or-create; use unique index on (TenantId, IdempotencyKey).
Re-delivery storm: cap retries; move to DLQ after N attempts.

Recovery Procedures¶

Inspect DLQ; download sample and Problem details if present.
Patch mapper/policy or data; use reprocess API/command to move back to primary queue.
For stuck Outbox rows, resume dispatcher (no manual delete).

Performance Characteristics¶

Latency Expectations¶

Outbox write: ~1–2 ms (in-proc tx)
Dispatch to broker: sub-10 ms typical
Consume → append: p95 ≤ 100 ms steady state

Throughput Limits¶

Producer: controlled by Outbox polling batch size (e.g., 500) and dispatch concurrency.
Consumer: controlled by prefetch (e.g., 256) and concurrency (e.g., 8–32).
Broker: ensure partitions/queues per tenant or shard.

Resource Requirements¶

Producer DB IOPS for Outbox; Consumer CPU for JSON + hashing; Idempotency store write IOPS.

Scaling Considerations¶

Scale by queue/partition per tenant/shard; increase consumer count.
Use bulk publish from dispatcher; avoid tiny batches.

Security & Compliance¶

Authentication¶

Broker auth via username/secret/SAS; TLS enabled. MassTransit transport credentials stored securely.

Authorization¶

Topic/queue ACLs restrict producers/consumers to tenant-scoped routes.

Data Protection¶

Message payloads encrypted on the wire (TLS); sensitive attributes redacted by Ingestion before persist.

Compliance¶

Retain DLQ items per policy; operations on DLQ are audited (who/when reprocessed or purged).

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`outbox_rows_pending`	gauge	Unpublished rows	Growth > 3× baseline
`dispatcher_publish_rate`	counter	Messages/sec to broker	Drop vs ingest
`consumer_lag`	gauge	Backlog size/age	Age > 60s
`consumer_retry_total`	counter	Redeliveries	Spike indicates transient failures
`dlq_messages_total`	counter	DLQ count	> 0 sustained

Logging Requirements¶

Include tenant, messageId, idempotencyKey (hashed), deliveryAttempt, and decision of DLQ vs retry; never log full payloads.

Distributed Tracing¶

Propagate traceparent via message headers; spans: outbox.enqueue, dispatcher.publish, consumer.handle, storage.append.

Health Checks¶

Producer: DB + broker connectivity; Consumer: broker + Storage/Idempotency write access.

Operational Procedures¶

Deployment¶

Migrate Producer DB to add Outbox table; enable MassTransit outbox middleware.
Deploy Ingestion Consumer with inbox/idempotency enabled (unique key index).
Create queues/topics, bindings, and DLQ; enable TLS and ACLs.

Configuration¶

Producer: OutboxPollIntervalMs, OutboxBatchSize, broker connection, TLS certs.
Consumer: PrefetchCount, ConcurrentMessageLimit, retry policy (incremental/exponential), idempotency TTL.
Routing: exchange/topic per tenantId or shard key.

Maintenance¶

Purge published Outbox rows by retention (based on published_at).
DLQ review and reprocess runs; archive old DLQ messages per policy.

Troubleshooting¶

Rising outbox_rows_pending → broker unreachable or dispatch stalled.
Spiking consumer_retry_total → investigate Storage/Policy health.
Many duplicates → check idempotency unique index and key construction.

Testing Scenarios¶

Happy Path Tests¶

Business txn writes Outbox; Dispatcher publishes; Consumer appends; Accepted observed.
Duplicate delivery skipped via idempotency store.

Error Path Tests¶

Poison message → DLQ; reprocess after fix returns success.
Broker outage → Outbox retains; auto-catchup after recovery.
Ops API: 400 bad reprocess request; 404 unknown message; 409 reprocess job already running.

Performance Tests¶

Validate throughput at target RPS with prefetch/concurrency sweeps.
Backpressure behavior under Storage throttling.

Security Tests¶

Tenant isolation via routing and ACLs.
TLS enforcement; credentials rotation without downtime.

Internal References¶

Standard Audit Record Ingestion Flow
Validation & Classification Flow
Retry Flow / Dead Letter Queue Flow
Data Model

Orleans Actor Ingestion Flow

External References¶

MassTransit Outbox/Inbox docs for chosen transport
Broker-specific DLQ and retry policies

Appendices¶

A. Producer Outbox table (example)¶

CREATE TABLE Outbox (
  Id            bigint IDENTITY PRIMARY KEY,
  MessageId     uniqueidentifier NOT NULL,
  IdempotencyKey nvarchar(128) NOT NULL,
  TenantId      nvarchar(128) NOT NULL,
  Body          varbinary(max) NOT NULL,
  Traceparent   nvarchar(64) NULL,
  CreatedAt     datetime2 NOT NULL DEFAULT sysutcdatetime(),
  PublishedAt   datetime2 NULL
);
CREATE UNIQUE INDEX UX_Outbox_Idempotency ON Outbox (TenantId, IdempotencyKey);

B. Consumer Idempotency (Inbox) table (example)¶

CREATE TABLE ConsumerInbox (
  TenantId        nvarchar(128) NOT NULL,
  IdempotencyKey  nvarchar(128) NOT NULL,
  CompletedAt     datetime2 NULL,
  AuditRecordId   char(26) NULL, -- ULID
  PRIMARY KEY (TenantId, IdempotencyKey)
);

C. C# Contracts (MassTransit)¶

public record AuditRecordEnvelope(
    string SchemaVersion,
    string IdempotencyKey,
    string TenantId,
    AuditRecord AuditRecord
);
// Configure send
cfg.Message<AuditRecordEnvelope>(x => x.SetEntityName("audit.ingest"));
cfg.Send<AuditRecordEnvelope>(x => {
    x.UseRoutingKeyFormatter(ctx => ctx.Message.TenantId);
});

Orleans Actor Ingestion Flow¶

Actor-to-actor ingestion path using Microsoft Orleans. A producer Grain invokes an Ingestion Grain with an AuditRecord and context (tenant, traceparent, idempotencyKey). The Ingestion Grain enforces at-least-once delivery with idempotent effects, appends to the WORM store, and returns an AppendResult. Notes cover activation, placement, and reentrancy to achieve high concurrency without duplication.

Overview¶

Purpose: Provide a low-latency, in-cluster ingestion path that preserves actor semantics and ordering guarantees per key.
Scope: Orleans grain contract, RequestContext propagation, idempotency/inbox, storage append, reentrancy, activation/placement, and failure handling including DLQ for poison messages. Excludes REST/gRPC and external bus transports.
Context: Used by actor-based services already running on Orleans (e.g., domain aggregates or workflow grains); per-tenant or per-resource sharding maps naturally to grain keys.
Key Participants:

Producer Grain (domain actor generating audit facts)
Ingestion Grain (IAuditIngestionGrain) — validates, canonicalizes, dedupes, appends
Idempotency/Inbox Store — per-grain dedupe table or grain state
Storage (Authoritative) — append-only WORM store
DLQ (optional) — for poison inputs when configured

Prerequisites¶

System Requirements¶

Orleans cluster healthy (silos, membership, reminders/timers)
RequestContext propagation enabled between grains
Ingestion Grain type registered; access to Storage and Idempotency store
TLS/mTLS for silo-to-silo traffic if crossing nodes/regions

Business Requirements¶

Tenants configured; placement strategy keyed by (tenantId[, shard])
Policy/retention/classification active for tenant
DLQ or operator alerting policy defined for poison records

Performance Requirements¶

Ingestion Grain reentrancy policy chosen (see below) and tested at target RPS
Per-grain mailboxes sized; throughput meets ingest parity
Idempotency lookup p95 ≤ 5 ms (local state or fast store)

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant Producer as Producer Grain
    participant Ing as Ingestion Grain (IAuditIngestionGrain)
    participant Inbox as Idempotency/Inbox Store
    participant Store as Storage (Authoritative)

    Producer->>Ing: Append(auditRecord, idempotencyKey)<br/>(RequestContext: tenant, traceparent)
    Ing->>Ing: Validate + canonicalize + policy/classification/redaction
    Ing->>Inbox: GetOrPut(tenant,idempotencyKey)
    alt First delivery
        Ing->>Store: INSERT canonical record (WORM)
        Store-->>Ing: ack {auditRecordId}
        Ing->>Inbox: MarkCompleted(auditRecordId)
        Ing-->>Producer: AppendResult {auditRecordId, status:"Created"}
    else Duplicate
        Inbox-->>Ing: Found Completed(auditRecordId)
        Ing-->>Producer: AppendResult {auditRecordId, status:"Duplicate"}
    end

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Per-tenant placement: IAuditIngestionGrain keys on tenantId (or (tenantId, shard)), preserving ordering within the key while allowing horizontal scale across tenants/shards.
Local persistent state inbox: Use Orleans PersistentState within the grain for fastest dedupe; or external table if cross-language consumers also write.
Reentrant grain: Enable reentrancy to allow concurrent requests sharing the same trace id/group; protect critical sections (idempotency write + store append) with coarse-grained serialization.

Error Paths¶

sequenceDiagram
    participant Ing as Ingestion Grain
    participant Store as Storage
    participant Inbox as Idempotency/Inbox

    Ing->>Store: INSERT
    alt Storage transient
        Store-->>Ing: throws transient
        Ing->>Ing: Retry with backoff, do not mark inbox completed
    else Validation failure (poison)
        Ing-->>Ing: throw ValidationException
        Ing->>Inbox: MarkFailed(optional) / emit DLQ if configured
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Required	Description	Validation
`auditRecord`	object	Y	Canonical `AuditRecord`	Data Model rules; ≤ 256 KiB
`idempotencyKey`	string	Y	Unique per submitted record	≤128 ASCII
`RequestContext["tenant-id"]`	string	Y	Tenant routing	Must match `auditRecord.tenantId`
`RequestContext["traceparent"]`	string	Y	W3C context	55-char format
`RequestContext["schema-version"]`	string	Y	Contract version	Known & active

Output Specifications¶

Field	Type	Description	Notes
`AppendResult.auditRecordId`	ULID	Durable id	Assigned or echoed
`AppendResult.status`	enum	`Created` or `Duplicate`	Idempotent outcome
`AppendResult.observedAt`	timestamp	Ingestion observation	ms precision

Example Grain Contract (C#)¶

public interface IAuditIngestionGrain : IGrainWithStringKey
{
    Task<AppendResult> Append(AuditRecord record, string idempotencyKey);
}

public sealed record AppendResult(string AuditRecordId, string Status, DateTimeOffset ObservedAt);

Producer call

RequestContext.Set("tenant-id", tenantId);
RequestContext.Set("traceparent", traceparent);
RequestContext.Set("schema-version", "auditrecord.v1");

var grain = GrainFactory.GetGrain<IAuditIngestionGrain>(tenantId); // or $"{tenantId}:{shard}"
var result = await grain.Append(record, idempotencyKey);

Error Handling¶

Error Scenarios (Orleans ↔ HTTP analogy)¶

Orleans Exception/Outcome	HTTP Analogy	Scenario	Recovery Action	Retry Strategy
— (OK)	202 Accepted	Created/Duplicate	—	—
`ArgumentException` / validation error	400 Bad Request	Schema/shape/limits invalid	Fix payload	No retry until corrected
`GrainReferenceNotFoundException` / unknown key	404 Not Found	Wrong grain key/tenant or disabled feature	Check routing/tenant	No retry
`IdempotencyConflictException`	409 Conflict	Same key, different payload	Use a new key; reconcile	Do not retry with same key
`OrleansException` with `IsTransient`	503 Service Unavailable	Store or infra transient	Backoff & retry	Exponential backoff + jitter
`TimeoutException`	504 Gateway Timeout	Grain busy or network stall	Increase timeout if safe	Limited retries

Failure Modes¶

Reentrancy hazard: racing requests with same key—protect with atomic GetOrPut in inbox and serialize append section.
Activation churn: hotspot tenants cause frequent activations; use sticky placement and activation warmup.
Poison record: repeated validation failures—optionally route to DLQ or mark Failed in inbox for operator review.

Recovery Procedures¶

For transients, retry with jitter; maintain idempotency key.
For conflict, choose canonical payload and re-attempt with a new key if necessary.
For poison, capture Problem details and trigger operator workflow or DLQ.

Performance Characteristics¶

Latency Expectations¶

P50: 5–15 ms
P95: ≤ 35 ms
P99: ≤ 75 ms

Throughput Limits¶

Single ingestion grain: thousands of req/s with reentrancy on and critical section minimized.
Cluster throughput scales linearly with # of silos × # of shards/tenants.

Resource Requirements¶

CPU for JSON parse/hash; memory for small inbox state.
Low storage write IOPS per grain; batch commits optional if available in store client.

Scaling Considerations¶

Placement: Prefer hash-based placement by (tenantId[, shard]).
Reentrancy: Enable grain reentrancy; serialize only the idempotency + append critical section.
Backpressure: Use Orleans.Concurrency.Limit or custom queue length monitors to shed load gracefully.

Security & Compliance¶

Authentication¶

Internal cluster auth (mTLS/IPSec as required); producer identity derived from grain identity and/or tokens in RequestContext if crossing trust boundaries.

Authorization¶

Validate tenant-id context matches auditRecord.tenantId; enforce RBAC/ABAC as needed for cross-tenant actors.

Data Protection¶

No sensitive data in logs; redaction/classification applied before persist.

Compliance¶

Append operations recorded with tenant, grainKey, idempotencyKey (hashed), and traceId.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`orleans_grain_queue_length`	gauge	Mailbox depth per ingestion grain	Sustained growth
`ingestion_append_latency_ms`	histogram	Grain handle latency	p95 > 35 ms
`inbox_getorput_latency_ms`	histogram	Idempotency lookup time	p95 > 5 ms
`idempotent_duplicates_total`	counter	Duplicate skips	Track trend
`orleans_activations_total`	counter	Activations of ingestion grains	Unexpected spikes

Logging Requirements¶

Structured logs: tenant, grainKey, traceId, idempotencyKey (hashed), outcome (Created|Duplicate|Failed).

Distributed Tracing¶

Carry traceparent in RequestContext; spans: grain.append, inbox.check, storage.append; include tenant, grainKey.

Health Checks¶

Silo membership stable; storage reachable; inbox store latency under thresholds.

Operational Procedures¶

Deployment¶

Register IAuditIngestionGrain and storage/idempotency providers; deploy silos.
Warm hot-tenant grains (pre-activation) to reduce cold-start latency.
Validate end-to-end append and idempotency in non-prod.

Configuration¶

Reentrancy: [Reentrant] attribute or runtime config as appropriate.
Placement: consistent hashing or custom placement by tenant.
Timeouts/Retries: client call timeouts (e.g., 2s) and retry policies for transient exceptions.

Maintenance¶

Monitor inbox state growth; compact or TTL-complete entries older than dedupe window.
Rotate cluster certs/keys if mTLS in use.

Troubleshooting¶

Many TimeoutExceptions → check reentrancy, queue length, storage latency.
Frequent IdempotencyConflictException → investigate client keying logic.
Activation spikes → adjust placement/keep-alive or increase silos.

Testing Scenarios¶

Happy Path Tests¶

Append returns Created with auditRecordId.
Second call with same idempotencyKey returns Duplicate without extra writes.

Error Path Tests¶

Validation error → 400 analog (ArgumentException), not persisted.
Unknown grain key/disabled tenant → 404 analog.
Conflict on idempotency (different payload) → 409 analog.
Transient storage failure → retried then succeeds.

Performance Tests¶

Reentrancy on: sustain target RPS with p95 ≤ 35 ms.
Critical section profiling (inbox+append) shows minimal blocking.

Security Tests¶

tenant-id in RequestContext matches payload; mismatches rejected.
Trace propagation visible across grains and storage client.

Internal References¶

gRPC Service Ingestion Flow
Service Bus (MassTransit) Ingestion Flow
Retry Flow

External References¶

Orleans Docs: Grains, Persistence, Reentrancy, RequestContext

Appendices¶

A. Inbox table (if using external store)¶

CREATE TABLE IngestionInbox (
  TenantId        nvarchar(128) NOT NULL,
  IdempotencyKey  nvarchar(128) NOT NULL,
  Status          tinyint NOT NULL, -- 0=Pending,1=Completed,2=Failed
  AuditRecordId   char(26) NULL,
  UpdatedAt       datetime2 NOT NULL DEFAULT sysutcdatetime(),
  PRIMARY KEY (TenantId, IdempotencyKey)
);

B. Reentrancy pattern (C# sketch)¶

[Reentrant]
public class AuditIngestionGrain : Grain, IAuditIngestionGrain
{
    public async Task<AppendResult> Append(AuditRecord record, string key)
    {
        using var _ = await _criticalSection.EnterAsync(key); // narrow critical region
        var (first, existingId) = await _inbox.GetOrPutAsync(record.TenantId, key);
        if (!first) return new(existingId, "Duplicate", DateTimeOffset.UtcNow);

        var id = await _storage.AppendAsync(record); // may retry internally
        await _inbox.MarkCompletedAsync(record.TenantId, key, id);
        return new(id, "Created", DateTimeOffset.UtcNow);
    }
}

Tenant-Scoped Query Flow¶

Retrieves a tenant’s AuditEvents timeline via the Query Service through the API Gateway. Uses row-level security (RLS) / tenant validation, seek-based pagination (cursor over (createdAt,auditRecordId)), and returns X-Watermark and X-Lag headers indicating projection freshness.

Overview¶

Purpose: Provide a low-latency, read-optimized timeline of audit events for a single tenant with consistent ordering and efficient pagination.
Scope: Gateway authN/Z, tenant scoping (header/path), RLS enforcement in Read DB, timeline query, seek pagination, watermark/lag headers. Excludes full-text search (see Search flow) and on-read PII masking (covered in Data Redaction flow).
Context: Runs against the AuditEvents projection maintained by the Projection Service; consults the Checkpoint Store for the current watermark.
Key Participants:

Query Client (API consumer)
API Gateway (authN/Z, rate limiting, header normalization)
Query Service (query planning, pagination, response shaping)
Read DB (AuditEvents) (tenant-scoped projection with indexes & RLS)
Checkpoint Store (per-tenant watermark)
Cache (optional, key-scoped response caching)

Prerequisites¶

System Requirements¶

API Gateway reachable with TLS; JWKS configured for JWT validation
Query Service deployed with network access to Read DB & Checkpoint Store
Read DB has RLS policies enforcing tenantId on AuditEvents
Projection/Checkpoint up and healthy (watermark progressing)

Business Requirements¶

Tenant exists and is active; edition permits timeline queries
Data retention/visibility policies do not restrict requested window
If multi-region, tenant’s home region is routable by Gateway

Performance Requirements¶

p95 ≤ 150 ms for limit<=200 over hot partitions
Indexes on (tenantId, createdAt DESC, auditRecordId) present
Cache configured (optional) with safe TTL & keying by tenant + params

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Client as Query Client
    participant GW as API Gateway
    participant Q as Query Service
    participant RDB as Read DB (AuditEvents + RLS)
    participant CKPT as Checkpoint Store
    participant Cache as Cache

    Client->>GW: GET /audit/v1/events?limit=100&cursor=... <br/> h:{Authorization,x-tenant-id,traceparent}
    Note right of GW: Validate JWT, tenant scope, rate-limit, normalize headers
    GW->>Q: Forward request + tenant context + traceparent
    Q->>CKPT: Read tenant watermark (offset,timestamp)
    alt Cache enabled and hit
        Q->>Cache: Lookup by {tenant, params}
        Cache-->>Q: Cached page + cursors
    else No cache / miss
        Q->>RDB: SELECT ... FROM AuditEvents WHERE tenantId=? AND (seek by cursor) ORDER BY createdAt DESC, auditRecordId DESC LIMIT N
        RDB-->>Q: rows, next/prev anchors
        Q->>Cache: Put page (optional TTL)
    end
    Q-->>GW: 200 JSON {items, nextCursor, prevCursor} + headers X-Watermark, X-Lag
    GW-->>Client: 200 OK
    Note over Client,RDB: Seek-based pagination avoids deep OFFSET scans

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Time-bounded query: from/to timestamps narrow the scan before seek pagination.
Ascending order: order=asc for forward-in-time scans; cursors encode direction.
Head polling: Client uses If-None-Match: "wmk:<value>"; Query Service returns 304 Not Modified if X-Watermark unchanged.

Error Paths¶

sequenceDiagram
    actor Client
    participant GW as API Gateway
    participant Q as Query Service

    Client->>GW: GET /audit/v1/events?limit=5000&cursor=bad
    alt Invalid params / cursor parse fail
        GW-->>Client: 400 Bad Request (Problem+JSON)
    else Unknown tenant / route
        GW-->>Client: 404 Not Found (Problem+JSON)
    else Conflicting params (e.g., both cursor & page)
        GW-->>Client: 409 Conflict (Problem+JSON)
    else Unauthorized / Forbidden
        GW-->>Client: 401/403 (Problem+JSON)
    else Service backpressure / upstream down
        GW-->>Client: 429/503 (Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	HTTP GET `/audit/v1/events` or `/audit/v1/tenants/{tenantId}/events`	Y	Timeline endpoint	One of header or path must provide tenant
`Authorization`	header	Y	`Bearer <JWT>`	Valid signature, audience; not expired
`x-tenant-id`	header	Y*	Tenant scope (if not in path)	`^[A-Za-z0-9._-]{1,128}$`
`traceparent`	header	O	W3C trace context	55-char format
`limit`	query	O	Max items per page (default 100)	1–1000, default 100
`cursor`	query	O	Opaque base64url cursor `(ts,id,dir)`	Valid/owned by tenant
`order`	query	O	`desc` (default) or `asc`	enum
`from`/`to`	query	O	ISO-8601 UTC time bounds	`from≤to`, within retention
`filter.resourceType`	query	O	Optional type filter	matches known types
`filter.actorId`	query	O	Optional actor filter	≤128 chars

*Required unless tenant is in path.

Output Specifications¶

Field	Type	Description	Notes
`items[]`	array	Page of timeline entries	Ordered by `order`
`nextCursor`	string?	Opaque cursor for next page	Omitted if no more
`prevCursor`	string?	Opaque cursor for reverse page	Omitted on first page
`count`	integer	Number of items in this page	≤ `limit`

Response Headers

X-Watermark: ISO-8601 UTC of latest committed projection timestamp for the tenant.
X-Lag: Seconds behind “now” (now - X-Watermark).
Cache-Control: typically no-store, max-age=0 (or short TTL if allowed).

Example Requests/Responses¶

Request

GET /audit/v1/events?limit=100&order=desc&from=2025-10-22T00:00:00Z HTTP/1.1
Host: api.atp.example
Authorization: Bearer eyJhbGciOi...
x-tenant-id: acme
traceparent: 00-9f0c1d2e3a4b5c6d7e8f9a0b1c2d3e4f-1111222233334444-01

200 OK

HTTP/1.1 200 OK
Content-Type: application/json
X-Watermark: 2025-10-22T12:03:05.120Z
X-Lag: 4.8
Cache-Control: no-store

{
  "items": [
    {
      "auditRecordId": "01JEC2A2V7N9M0X1Y2Z3A4B5C6",
      "createdAt": "2025-10-22T12:02:59.812Z",
      "action": "user.create",
      "resource": { "type": "Iam.User", "id": "U-1001" },
      "actor": { "id": "svc_ingress", "type": "Service", "display": "ingress-gw" },
      "decision": { "result": "Allow" }
    }
  ],
  "nextCursor": "eyJ0cyI6IjIwMjUtMTAtMjJUMTI6MDI6NTkuODEyWiIsImlkIjoiMDFK...IiwgImRpciI6ImRlc2MifQ",
  "prevellers": null,
  "count": 1
}

400 Bad Request (invalid cursor)

{
  "type": "urn:connectsoft:errors/query/cursor.invalid",
  "title": "Invalid cursor",
  "status": 400,
  "detail": "Cursor is malformed or expired for this tenant.",
  "errors": [{ "pointer": "query.cursor", "reason": "malformed" }],
  "traceId": "9f0c1d2e3a4b5c6d7e8f9a0b1c2d3e4f"
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Missing/invalid params; bad `cursor`; `from>to`; `limit` out of bounds	Correct request; regenerate cursor	No retry until fixed
401	Missing/invalid/expired JWT	Obtain valid token	Retry after renewal
403	Token not authorized for `x-tenant-id`	Request proper scope/role	No retry until fixed
404	Tenant or route not found; tenant disabled	Verify tenant/URL	No retry
409	Conflicting params (e.g., `cursor` with `from/to` not allowed) or cursor tenant mismatch	Remove conflict; obtain fresh cursor	Retry after fix
429	Rate limit / query backpressure	Backoff; respect `Retry-After`	Exponential backoff + jitter
503	Upstream (DB/checkpoint) unavailable	Wait for recovery	Retry with backoff
304	`If-None-Match` matched watermark	Use cached data	Re-poll later

Failure Modes¶

Stale cursor after rebuild/compaction: server returns 409 with type: .../cursor.stale and a resyncFrom hint.
RLS misconfiguration: query returns 403/500; health checks should detect missing policy.
Watermark stale: X-Lag grows; alerting should trigger projector scaling.

Recovery Procedures¶

On 409 cursor.stale, drop cursor and re-start from from=lastSeenTime.
On 429/503, backoff with jitter; do not increase limit to compensate.
If RLS errors occur, fail closed (no data) and escalate to operations.

Performance Characteristics¶

Latency Expectations¶

P50 ≤ 60 ms, P95 ≤ 150 ms, P99 ≤ 300 ms for limit≤200 over warm cache/index.

Throughput Limits¶

Per tenant: 200 rps sustained, 800 rps burst (configurable).
Global: scales with read replicas and cache hit rate.

Resource Requirements¶

Read DB IOPS proportional to limit and filter selectivity; ensure covering indexes.
Cache memory sized for hot cursors/pages if enabled.

Scaling Considerations¶

Add read replicas; shard by tenant.
Use index-only scans with narrow projections to reduce I/O.
Apply adaptive limit caps under load; enable result caching for hot ranges.

Security & Compliance¶

Authentication¶

OIDC JWT (short-lived), traceparent propagated; mTLS between Gateway ↔ Query Service (optional but recommended).

Authorization¶

Enforce audit:read:timeline scope; verify sub/tenant claims; apply DB-level RLS on tenantId.

Data Protection¶

Only minimal fields returned; no secret values.
X-Watermark reveals timing only; avoid leaking internal offsets.

Compliance¶

Access logged with tenantId, subject, filters, and watermark for auditability.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`query_latency_ms{route="/audit/v1/events"}`	histogram	End-to-end latency	p95 > 150 ms (5m)
`timeline_results_count`	histogram	Items per page	Sudden 0 across tenants
`watermark_lag_seconds`	gauge	`now - watermark`	> target (e.g., >10 s)
`query_rate_limited_total`	counter	429 responses	> 5% sustained
`cursor_stale_total`	counter	409 due to stale/malformed cursor	spike indicates rebuild issues

Logging Requirements¶

Structured logs: tenantId, traceId, limit, order, from/to (if set), cursorHash, resultCount, watermark, lagSec. Do not log raw cursor tokens.

Distributed Tracing¶

Spans: query.parse, db.select.timeline, ckpt.read, cache.get/set.
Attributes: tenant, limit, order, hasCursor, rows, lagMs.

Health Checks¶

Readiness: DB + checkpoint reachable; RLS policy verified; index present.
Liveness: threadpool saturation, connection pool usage below thresholds.

Operational Procedures¶

Deployment¶

Apply/verify AuditEvents schema & RLS policies in Read DB.
Deploy Query Service behind Gateway route /audit/v1/events.
Validate watermark propagation and X-Lag accuracy in staging.

Configuration¶

Env: QUERY_MAX_LIMIT, DEFAULT_LIMIT, CACHE_TTL_SECONDS, RLS_ENABLED=true.
Indexing: (tenantId, createdAt DESC, auditRecordId) plus optional partial indexes per tenant.

Maintenance¶

Periodic VACUUM/ANALYZE (SQL) or compaction (NoSQL).
Rotate JWT keys; update JWKS URL.
Monitor and refresh cache layer sizing.

Troubleshooting¶

High watermark_lag_seconds → check projector lag, search bulk backlog.
Many 409 (cursor.stale) → investigate projection rebuilds/compaction.
Slow queries → examine query plans; add/adjust indexes.

Testing Scenarios¶

Happy Path Tests¶

GET with valid x-tenant-id and limit=100 returns 200 with ordered items and X-Watermark/X-Lag.
nextCursor yields the next page; prevCursor navigates back without duplication.

Error Path Tests¶

400 on malformed cursor / invalid limit / from>to.
404 when tenant missing/disabled or route incorrect.
409 when cursor used with disallowed params or tenant mismatch.
429/503 trigger proper backoff behavior.

Performance Tests¶

p95 ≤ 150 ms for limit=200 under typical load.
Index-only scan verified via EXPLAIN plan.

Security Tests¶

JWT audience/scope enforced; RLS prevents cross-tenant leakage.
X-tenant-id header vs path tenant consistency enforced.

Internal References¶

Search Query Flow
Filtered Query Flow (policy/redaction on read)
Audit Record Projection Update Flow

External References¶

RFC 7233/7231 (HTTP semantics), RFC 9110 (HTTP Semantics) for headers
W3C Trace Context (traceparent)

Appendices¶

A. Cursor Encoding (example)¶

cursor = base64url( JSON.stringify({ ts:"2025-10-22T12:02:59.812Z", id:"01JEC2A2V7N9M0X1Y2Z3A4B5C6", dir:"desc" }) )

B. Example RLS Policy (PostgreSQL)¶

ALTER TABLE audit_events ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON audit_events
USING (tenant_id = current_setting('app.tenant_id')::text);
-- Set current_setting('app.tenant_id') per request in the DB session.

Search Query Flow¶

Full-text, facet, and type-ahead search over tenant-scoped indices. The Search Service executes per-tenant queries against a per-tenant alias (or filtered index), returning ranked results, facet aggregations, and optional suggest completions. Responses include X-Index-Watermark and X-Index-Lag to convey indexing freshness.

Overview¶

Purpose: Provide fast, flexible discovery of audit records using full-text, filters, facets, and suggesters.
Scope: Query parsing, tenant isolation via alias/filter, facet execution, pagination, highlights, and freshness reporting. Excludes authoritative reads (timeline) and export; on-read masking follows redaction policy.
Context: Operates on the Search Index projection populated by the Projection Service; eventual consistency vs. authoritative store is expected.
Key Participants:

Search Client (API consumer)
Search Service (query planner/executor)
Search Engine (per-tenant indices/aliases)
Checkpoint Store (optional: index watermark)
Cache (optional: hot query caching)

Prerequisites¶

System Requirements¶

Search cluster reachable with TLS; per-tenant indices/aliases created
Search Service has network access and service account with read permissions
Projection → Index pipeline healthy (indexers running)

Business Requirements¶

Tenant has Search edition/feature enabled
Data minimization and on-read masking rules configured for Search documents
Retention and residency policies applied to search indices

Performance Requirements¶

p95 query latency ≤ 200 ms for size ≤ 50 and modest facets
Cluster capacity sized for QPS and aggregation workload
Index freshness SLO: p95 ≤ 10 s Accept→Indexed

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Client as Search Client
    participant Svc as Search Service
    participant Engine as Search Engine (Tenant Alias)
    participant CKPT as Checkpoint Store

    Client->>Svc: POST /search/v1/query<br/>h:{Authorization,x-tenant-id}<br/>{q, filters, facets, size, cursor?}
    Svc->>Svc: Validate params, build per-tenant query, apply redaction-on-read
    Svc->>Engine: Execute { index: tenant-alias, body: query+aggs }
    Engine-->>Svc: Hits, facets, next cursor, took
    Svc->>CKPT: Read index watermark (optional)
    Svc-->>Client: 200 {results, facets, nextCursor} + X-Index-Watermark + X-Index-Lag

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Time freshness bias: apply recency boost within a freshness window (e.g., last 24h).
Filter-only queries (q empty): return filtered timeline with facets.
Suggest endpoint: /search/v1/suggest uses completion or n-gram suggesters with prefix and filters.
Read-through cache: cache popular queries for short TTL (exclude personalized filters).

Error Paths¶

sequenceDiagram
    actor Client
    participant Svc as Search Service

    Client->>Svc: POST /search/v1/query (bad params/tenant)
    alt Bad request (malformed cursor/invalid facet)
        Svc-->>Client: 400 Problem+JSON
    else Tenant alias missing / disabled
        Svc-->>Client: 404 Problem+JSON
    else Conflicting params (both page & cursor, or size>cap)
        Svc-->>Client: 409 Problem+JSON
    else Rate limited / engine unavailable
        Svc-->>Client: 429/503 Problem+JSON (+ Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	HTTP `POST /search/v1/query`	Y	Search endpoint	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y*	Tenant scope	`^[A-Za-z0-9._-]{1,128}$`
`q`	string	O	Query string (full-text)	0–2048 chars
`filters`	object	O	`{resourceType?, actorId?, action?, time:{from?,to?}, decision?}`	enums/ISO-8601
`facets`	array	O	Facets to compute (e.g., `["resourceType","action"]`)	allowlist only
`size`	int	O	Page size	1–100 (default 25)
`cursor`	string	O	Opaque search-after token	base64url
`highlight`	bool	O	Return snippets	default false
`sort`	enum	O	`relevance` (default)	`createdAt:desc \| asc`	allowlist

*Required unless tenant is encoded in a dedicated tenant path variant.

Output Specifications¶

Field	Type	Description	Notes
`results[]`	array	Search hits with essential fields	Redacted as needed
`facets`	object	Buckets per requested facet	Top-N buckets
`nextCursor`	string?	Token for next page	Omitted if no more
`tookMs`	int	Engine execution time	From engine
`totalApprox`	int	Approx total matches	Not exact if tracking disabled

Response Headers

X-Index-Watermark: ISO-8601 UTC of latest indexed event for tenant
X-Index-Lag: Seconds behind “now” (now - X-Index-Watermark)

Example Payloads¶

Request

{
  "q": "user create OR signup",
  "filters": {
    "resourceType": "Iam.User",
    "time": { "from": "2025-10-22T00:00:00Z", "to": "2025-10-22T23:59:59Z" }
  },
  "facets": ["resourceType", "action"],
  "size": 25,
  "sort": "relevance",
  "highlight": true
}

200 OK

X-Index-Watermark: 2025-10-22T12:03:05.120Z
X-Index-Lag: 7.2

{
  "results": [
    {
      "auditRecordId": "01JEC7KX8…",
      "createdAt": "2025-10-22T11:58:10.201Z",
      "action": "user.create",
      "resource": { "type": "Iam.User", "id": "U-1001" },
      "actor": { "id": "svc_signup", "type": "Service", "display": "signup-svc" },
      "score": 7.42,
      "highlights": { "action": ["<em>user</em>.create"] }
    }
  ],
  "facets": {
    "resourceType": [{ "key": "Iam.User", "count": 128 }],
    "action": [{ "key": "user.create", "count": 92 }]
  },
  "nextCursor": "eyJzZWFyY2hBZnRlciI6WyIxLjIzIiwiMDFK...Il19",
  "tookMs": 23,
  "totalApprox": 612
}

400 Bad Request (invalid facet)

{
  "type": "urn:connectsoft:errors/search/facet.invalid",
  "title": "Invalid facet",
  "status": 400,
  "detail": "Facet 'userEmail' is not allowed.",
  "errors": [{ "pointer": "/facets/0", "reason": "allowlist" }]
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Malformed `cursor`, disallowed facet, bad time range, `size` out of bounds	Fix request	No retry until corrected
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Insufficient `audit:search` scope or tenant mismatch	Request proper scope	No retry
404	Tenant alias/index missing or feature disabled	Verify tenant/feature	No retry
409	Conflicting params (e.g., cursor with sort not supported)	Adjust params	Retry after fix
422	Query too complex (clause limit, wildcard explosion)	Simplify query	No retry until changed
429	Rate limited/backpressure	Respect `Retry-After`	Exponential backoff + jitter
503	Engine unavailable / timeout	Wait for recovery	Retry with jitter

Failure Modes¶

Stale cursor after reindex/alias swap → 409 cursor.stale with resyncFrom hint.
Facet blow-up (high cardinality) → 422 with guidance to narrow filters.
Highlight overflow → server truncates snippets to configured limit.

Recovery Procedures¶

On 409 cursor.stale, drop cursor and re-issue query without cursor or with from bound.
On 429/503, backoff; keep query identical to benefit from caching when enabled.
Replace disallowed facets with supported ones per schema allowlist.

Performance Characteristics¶

Latency Expectations¶

P50 ≤ 80 ms, P95 ≤ 200 ms, P99 ≤ 400 ms (moderate facets, size ≤ 50).

Throughput Limits¶

Per tenant baseline 300 rps sustained; global scales with cluster nodes and shard count.

Resource Requirements¶

Aggregations demand CPU/heap; ensure shard sizing and circuit breakers for large queries.

Scaling Considerations¶

Scale by shards/replicas; use per-tenant alias routing.
Enable result caching and request coalescing for hot queries.
Apply freshness bias instead of hard refresh to avoid heavy refresh calls.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; mTLS between Search Service and engine (optional).

Authorization¶

Enforce audit:search scope; per-tenant isolation via index alias filter or index-per-tenant.

Data Protection¶

Documents store minimized fields; sensitive values tokenized or omitted.
Highlights sanitized; never return dropped/redacted fields.

Compliance¶

Record search access with tenant, subject, queryHash, filters, and returnedCount.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`search_latency_ms`	histogram	End-to-end latency	p95 > 200 ms
`search_qps`	counter	Requests/sec	Capacity planning
`index_freshness_seconds`	gauge	`now - indexWatermark`	> 10 s sustained
`search_429_total`	counter	Rate limited count	> 5% sustained
`cursor_stale_total`	counter	409 due to stale cursor	Spike detection

Logging Requirements¶

Structured logs: tenant, traceId, qHash, filtersHash, size, sort, tookMs, indexLagSec.
Do not log raw queries or highlights.

Distributed Tracing¶

Spans: search.plan, engine.search, engine.aggs, cache.get/set.
Attributes: tenant, hasCursor, facetCount, size, tookMs.

Health Checks¶

Readiness: engine reachable; tenant alias exists; index freshness within SLO.
Liveness: threadpool/connection pool healthy; circuit breakers closed.

Operational Procedures¶

Deployment¶

Create index template & per-tenant alias with filter tenantId=....
Deploy Search Service route /search/v1/query and /search/v1/suggest.
Validate end-to-end queries and index freshness headers.

Configuration¶

Env: SEARCH_MAX_SIZE=100, DEFAULT_SIZE=25, ALLOWED_FACETS=..., CURSOR_TTL, RECENCY_BOOST_WINDOW.
Engine: shard/replica count, analyzers, suggesters, circuit breakers.

Maintenance¶

Rolling reindex and alias swap; backfill lag tracking.
Periodic shard rebalancing; optimize/forcemerge as needed off-peak.

Troubleshooting¶

High index_freshness_seconds → inspect projector/indexer lag.
Many 422 → educate clients on query limits; adjust clause caps if safe.
429 spikes → scale nodes or adjust rate limits/caching.

Testing Scenarios¶

Happy Path Tests¶

Keyword query with filters returns ranked hits and requested facets within p95 ≤ 200 ms.
Pagination via nextCursor returns non-overlapping result sets.
Headers include X-Index-Watermark and X-Index-Lag.

Error Path Tests¶

400 on invalid facet, malformed cursor, or bad time bounds.
404 when tenant alias missing/disabled.
409 on stale cursor or conflicting params.
422 on overly complex query (clause cap).
429/503 obey retry/backoff.

Performance Tests¶

Facet cost under control for typical cardinalities.
Query load at target QPS with p95 ≤ 200 ms.

Security Tests¶

RBAC scope audit:search enforced; cross-tenant leakage prevented by alias filter.
Redaction/minimization verified in results and highlights.

Internal References¶

Tenant-Scoped Query Flow
Audit Record Projection Update Flow
Data Redaction Flow

External References¶

Vendor docs for analyzers, aggregations, and suggesters (e.g., ES/OpenSearch)

Appendices¶

A. Example Engine Query (conceptual)¶

{
  "query": {
    "bool": {
      "filter": [{ "term": { "tenantId": "acme" } }],
      "must": [{ "simple_query_string": { "query": "user create OR signup", "fields": ["action^3","resource.type","attributes.*"] }}]
    }
  },
  "aggs": {
    "resourceType": { "terms": { "field": "resource.type", "size": 10 } },
    "action": { "terms": { "field": "action.keyword", "size": 10 } }
  },
  "sort": ["_score", { "createdAt": "desc" }],
  "size": 25,
  "search_after": ["1.23", "01JEC7KX8..."]
}

B. Example Suggest Request¶

{
  "prefix": "user.c",
  "filters": { "resourceType": "Iam.User" },
  "size": 10
}

Filtered Query Flow¶

Policy-aware read path that applies purpose-of-use evaluation, field-level allow/deny, and on-read redaction/masking before returning results. The Query Service consults the Policy Service to compute an effective redaction profile for the caller, then executes a tenant-scoped query and post-processes rows according to the profile.

Overview¶

Purpose: Return tenant-scoped audit results filtered by caller intent and masked according to privacy & PII policies.
Scope: Purpose-of-use signaling, policy evaluation, field projection, masking strategies (hash/mask/tokenize/drop), seek pagination, and response headers indicating applied policy and freshness. Excludes full-text search (see Search flow) and raw timeline (see Tenant-Scoped Query).
Context: Operates on AuditEvents projection; combines pre-index filters with post-fetch masking.
Key Participants:

Client (API consumer)
API Gateway (authN/Z, rate limiting)
Query Service (query + masking orchestrator)
Policy Service (purpose-of-use, allow/deny, redaction profile)
Read DB (AuditEvents + RLS) (tenant-isolated projection)
Checkpoint Store (watermark for freshness)

Prerequisites¶

System Requirements¶

TLS at Gateway; JWKS configured for JWT verification
Query Service access to Read DB and Policy Service
RLS on Read DB enforcing tenantId
Redaction libraries & configs deployed (hash/mask/tokenize/drop)

Business Requirements¶

Tenant active; privacy/PII classifications configured
Policy definitions include purpose-of-use to field permissions/masking
Data residency respected for cross-region reads

Performance Requirements¶

p95 ≤ 180 ms for limit≤200 with standard masking
Policy evaluation cache (per subject+purpose) warmed; TTL tuned
Indexes support common filter predicates

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Client as Client
    participant GW as API Gateway
    participant Q as Query Service
    participant P as Policy Service
    participant R as Read DB (AuditEvents + RLS)
    participant C as Checkpoint Store

    Client->>GW: POST /query/v1/filtered <br/> h:{Authorization,x-tenant-id,traceparent,x-purpose-of-use}
    GW->>Q: Forward request + headers
    Q->>P: Evaluate(subject, tenant, purpose, requestedFields, filters)
    P-->>Q: RedactionProfile {allowed, denied, maskRules}
    Q->>R: SELECT ... WHERE tenantId=? AND <server-validated filters> ORDER BY createdAt DESC LIMIT N
    R-->>Q: rows
    Q->>Q: Apply RedactionProfile (drop/transform fields) + build cursors
    Q->>C: Read tenant watermark
    Q-->>GW: 200 {items(masked), nextCursor} + X-Watermark, X-Lag, X-Policy-Decision-Id
    GW-->>Client: 200 OK

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Field projection: Client requests fields=[...]; server intersects with allowed and masks per rules.
Explain-only: dryRun=true returns the effective RedactionProfile without data.
Head polling: If-None-Match: "wmk:<value>" → 304 if unchanged watermark.

Error Paths¶

sequenceDiagram
    actor Client
    participant GW as API Gateway
    participant Q as Query Service

    Client->>GW: POST /query/v1/filtered (bad params/conflicts)
    alt Bad request (invalid filter/purpose/fields)
        GW-->>Client: 400 Problem+JSON
    else Tenant/route not found or feature disabled
        GW-->>Client: 404 Problem+JSON
    else Fields conflict with policy decision
        GW-->>Client: 409 Problem+JSON
    else Unauthorized / Forbidden
        GW-->>Client: 401/403 Problem+JSON
    else Backpressure / upstream down
        GW-->>Client: 429/503 Problem+JSON (+ Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	HTTP `POST /query/v1/filtered`	Y	Filtered & masked read	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y*	Tenant scope	`^[A-Za-z0-9._-]{1,128}$`
`traceparent`	header	O	W3C trace context	55-char
`x-purpose-of-use`	header	Y	Caller intent (e.g., `Support`, `SecurityOps`, `Analytics`)	Enum allowlist
`limit`	body.int	O	Items per page	1–200 (default 100)
`cursor`	body.string	O	Opaque seek token	base64url
`filters`	body.object	O	Server-validated predicates	Allowlist only
`fields`	body.array	O	Requested projections	Intersected with policy
`dryRun`	body.bool	O	Return policy only	default false

*Required unless tenant embedded in path variant.

Supported filter keys (allowlist example): createdAt.from/to, action, resource.type, resource.id, actor.id, decision.result.

Output Specifications¶

Field	Type	Description	Notes
`items[]`	array	Masked rows honoring RedactionProfile	Order: `createdAt DESC`
`nextCursor`	string?	Seek token for next page	Omitted if end
`policy`	object?	Returned if `dryRun=true`	Effective profile summary
`count`	int	Items in this page	≤ `limit`

Response Headers

X-Watermark: tenant projection watermark (ISO-8601 UTC)
X-Lag: seconds behind now
X-Policy-Decision-Id: opaque id of the applied policy decision (for audit)

Example Payloads¶

Request

{
  "limit": 50,
  "fields": ["auditRecordId","createdAt","action","resource.id","actor.display","client.ip"],
  "filters": {
    "resource.type": "Iam.User",
    "createdAt": { "from": "2025-10-22T00:00:00Z", "to": "2025-10-22T23:59:59Z" }
  }
}

Headers:

Authorization: Bearer eyJhbGciOi...
x-tenant-id: acme
x-purpose-of-use: Support
traceparent: 00-9f0c1d2e3a4b5c6d7e8f9a0b1c2d3e4f-1111222233334444-01

200 OK (masked)

X-Watermark: 2025-10-22T12:05:10.330Z
X-Lag: 6.9
X-Policy-Decision-Id: pol_7b3f8d1a

{
  "items": [
    {
      "auditRecordId": "01JEC9VX2Z…",
      "createdAt": "2025-10-22T11:57:03.200Z",
      "action": "user.create",
      "resource": { "id": "U-1001" },
      "actor": { "display": "signup-svc" },
      "client": { "ip": "203.0.113.0/24" }  // IP truncated per Support profile
    }
  ],
  "nextCursor": "eyJ0cyI6IjIwMjUtMTAtMjJUMTE6NTc6MDMuMjAwWiIsImlkIjoiMDFK...In0",
  "count": 1
}

dryRun=true (policy only)

{
  "policy": {
    "allowed": ["auditRecordId","createdAt","action","resource.id","actor.display","client.ip"],
    "denied": ["client.userAgent","geo.location","subject.email"],
    "maskRules": {
      "client.ip": "truncate_cidr_24",
      "subject.email": "mask_localpart"
    }
  }
}

400 Bad Request (conflicting filters)

{
  "type": "urn:connectsoft:errors/query/filters.invalid",
  "title": "Invalid filters",
  "status": 400,
  "detail": "Unsupported filter 'subject.email'.",
  "errors": [{ "pointer": "/filters/subject.email", "reason": "not-allowed" }]
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Malformed `filters`/`cursor`; unknown `x-purpose-of-use`; invalid `fields`	Fix request; use allowlisted fields	No retry until corrected
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Subject lacks `audit:read:filtered` scope or policy denies all fields	Request correct scope; adjust purpose	No retry until fixed
404	Tenant/route not found; feature disabled	Verify tenant/URL/edition	—
409	Requested fields conflict with policy (e.g., `denied` but required) or cursor param conflicts	Remove offending fields/params	Retry after fix
429	Rate limit/backpressure	Respect `Retry-After`	Exponential backoff + jitter
503	Policy or DB dependency unavailable	Wait for recovery	Retry with same params

Failure Modes¶

Policy cache staleness: returns stricter profile than expected—safe by design; refresh on next call.
Cursor invalid after rebuild: 409 cursor.stale with resyncFrom hint.
Overbroad projection: requesting many fields increases payload size; server may trim to allowed ∩ requested.

Recovery Procedures¶

On 409 field-policy conflict, re-issue request with fields returned in policy.allowed.
On 429/503, backoff with jitter; do not widen limit.
For stale cursor, restart from from time bound or omit cursor.

Performance Characteristics¶

Latency Expectations¶

P50 ≤ 70 ms, P95 ≤ 180 ms (policy cache hit); add 15–30 ms if cache miss.

Throughput Limits¶

Per tenant: 150 rps sustained, 600 rps burst (configurable).
Global: scales with read replicas and policy cache hit rate.

Resource Requirements¶

CPU for masking transforms (e.g., hashing/tokenization); memory for page shaping.

Scaling Considerations¶

Cache policy decisions keyed by (tenant, subject, purpose) with short TTL (e.g., 60–300s).
Pre-compute allowlists per purpose to minimize per-request overhead.

Security & Compliance¶

Authentication¶

OIDC JWT; traceparent propagated; optional mTLS Gateway↔Query Service.

Authorization¶

Require audit:read:filtered; validate x-tenant-id claim and RBAC.
Enforce DB-level RLS and post-query field-level controls from policy.

Data Protection¶

Apply masking strategies per Data Model (truncate_cidr_24, mask_localpart, hash_sha256, drop).
Do not return fields marked denied by policy; never include raw PII if policy says mask/drop.

Compliance¶

Emit access audit: subject, tenant, purpose, decisionId, requestedFields, returnedFields.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`filtered_query_latency_ms`	histogram	End-to-end latency	p95 > 180 ms
`policy_eval_latency_ms`	histogram	Policy round-trip	p95 > 30 ms
`policy_denied_total`	counter	Requests with any denied fields	Sudden spikes
`masked_fields_total`	counter	Count of masked field applications	Trend monitoring
`cursor_stale_total`	counter	409 due to stale cursor	Rebuild detection
`query_429_total`	counter	Rate-limited responses	> 5% sustained

Logging Requirements¶

Structured logs: tenant, traceId, purpose, decisionId, requestedFieldsHash, returnedFieldsHash, resultCount, watermark, lagSec. Do not log raw PII.

Distributed Tracing¶

Spans: policy.evaluate, db.select.filtered, mask.apply.
Attributes: purpose, allowedCount, maskedCount, deniedCount.

Health Checks¶

Readiness: Policy Service reachable; RLS verified; masking config loaded.
Liveness: threadpool/connection pools healthy.

Operational Procedures¶

Deployment¶

Deploy/enable /query/v1/filtered route behind feature flag query.filtered.enabled=false.
Load policy catalogs and masking configuration; warm caches.
Validate dryRun and live calls in staging with test profiles.

Configuration¶

Env: QUERY_MAX_LIMIT, DEFAULT_LIMIT, POLICY_CACHE_TTL, MASKING_RULES_PATH.
Headers: accept x-purpose-of-use values from allowlist only.

Maintenance¶

Rotate JWT keys; review policy changes; audit decision logs.
Monitor masked vs. denied trends to tune rules.

Troubleshooting¶

Many 409 field conflicts → educate clients to request dryRun first or fetch policy.allowed.
High policy_eval_latency_ms → investigate Policy Service capacity/caching.
Data leakage concerns → verify masking config version & hot reload.

Testing Scenarios¶

Happy Path Tests¶

Valid request with x-purpose-of-use: Support returns masked IP and allowed fields.
dryRun=true returns expected profile; subsequent call applies it.

Error Path Tests¶

400 on invalid filter key or unknown purpose.
404 when tenant missing/disabled.
409 when requesting denied fields.
429/503 obey retry/backoff with unchanged parameters.

Performance Tests¶

Cache-hit p95 ≤ 180 ms; cache-miss overhead within budget.
Large page (limit=200) still meets p95 under typical load.

Security Tests¶

RLS prevents cross-tenant access.
No raw PII fields returned when policy mandates mask/drop.
Access audit entries include purpose and decisionId.

Internal References¶

Data Redaction Flow (on-read), Policy & Retention flows
Compliance Audit Flow

External References¶

RFC 7807 (Problem Details)
Organization Privacy/PII policy catalog

Appendices¶

A. Example RedactionProfile (concept)¶

{
  "decisionId": "pol_7b3f8d1a",
  "purpose": "Support",
  "allowed": ["auditRecordId","createdAt","action","resource.id","actor.display","client.ip"],
  "denied": ["subject.email","geo.location","client.userAgent"],
  "maskRules": {
    "client.ip": "truncate_cidr_24",
    "subject.email": "mask_localpart"
  }
}

B. Masking Rules (summary)¶

truncate_cidr_24 → IPv4 a.b.c.d → a.b.c.0/24
mask_localpart → name@domain → n***@domain
hash_sha256 → irreversible 64-hex digest
drop → remove field from output

Time-Range Query Flow¶

Efficiently retrieves audit events constrained by a time window. The Query Service translates from/to predicates into partition/shard pruning (e.g., daily/monthly tenant partitions), executes seek-paginated scans over the minimal set of partitions, and returns watermark/lag headers to describe projection freshness.

Overview¶

Purpose: Provide fast, predictable retrieval of audit events within a specified time range while minimizing IO via partition/shard pruning.
Scope: Time predicates, partition selection, shard routing, seek-based pagination across multiple partitions, and freshness exposition. Excludes full-text relevance (see Search) and policy-driven masking (see Filtered Query).
Context: Operates on the AuditEvents read model that is physically partitioned by tenant and time; the Projection Service updates these partitions asynchronously.
Key Participants:

Client (API consumer)
API Gateway (authN/Z, rate limiting)
Query Service (planner/executor, paginator)
Read Store (time-partitioned AuditEvents with RLS)
Partition Catalog (maps time windows → partitions/shards)
Checkpoint Store (per-tenant watermark)

Prerequisites¶

System Requirements¶

Gateway with TLS and JWT validation
Query Service can access Read Store, Partition Catalog, and Checkpoint Store
Read Store enforces RLS by tenantId
Time partitions (e.g., daily/monthly) exist and are discoverable in the catalog

Business Requirements¶

Tenant is active and permitted to query historical windows requested
Retention policy covers the requested from/to period
Regional residency honored for multi-region tenants

Performance Requirements¶

p95 ≤ 160 ms for limit≤200 and ≤ 14 partitions scanned
Covering index on (tenantId, createdAt DESC, auditRecordId) per partition
Partition discovery latency p95 ≤ 10 ms

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Client as Client
    participant GW as API Gateway
    participant Q as Query Service
    participant Cat as Partition Catalog
    participant R as Read Store (AuditEvents + RLS)
    participant Ck as Checkpoint Store

    Client->>GW: GET /audit/v1/events/range?from=...&to=...&limit=200&cursor=... <br/> h:{Authorization,x-tenant-id,traceparent}
    GW->>Q: Forward request + normalized headers
    Q->>Q: Validate time window, normalize [from,to], parse/verify cursor (if any)
    Q->>Cat: Resolve partitions/shards for [from,to] + tenant
    Cat-->>Q: Ordered partition list (most-recent → oldest)
    Q->>R: Query partitions with seek pagination (ORDER BY createdAt DESC, auditRecordId)
    R-->>Q: Page of rows + next anchor (ts,id,partitionIdx)
    Q->>Ck: Read tenant watermark
    Q-->>GW: 200 {items, nextCursor} + X-Watermark + X-Lag + X-Partitions-Scanned
    GW-->>Client: 200 OK

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Open-ended range: only from provided (defaults to=now), or only to (backfill).
Ascending traversal: order=asc for forward scans; cursor encodes direction + partition index.
Server-side downsampling: for very wide windows, service may cap maxPartitions and advise narrowing via Problem+JSON type: .../range.too_wide (422) when appropriate.

Error Paths¶

sequenceDiagram
    actor Client
    participant GW as API Gateway

    Client->>GW: GET /audit/v1/events/range?from=bad&to=2025-10-22T00:00:00Z
    alt Bad request (malformed/invalid window)
        GW-->>Client: 400 Bad Request (Problem+JSON)
    else Tenant route not found / disabled
        GW-->>Client: 404 Not Found (Problem+JSON)
    else Conflicting params (cursor with changed window/order)
        GW-->>Client: 409 Conflict (Problem+JSON)
    else Rate limited / store unavailable
        GW-->>Client: 429/503 (Problem+JSON + Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	HTTP `GET /audit/v1/events/range` or `/tenants/{tenantId}/events/range`	Y	Time-range endpoint	—
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y*	Tenant scope	`^[A-Za-z0-9._-]{1,128}$`
`traceparent`	header	O	W3C trace context	55-char
`from`	query	O*	ISO-8601 UTC lower bound	≤ `to`; within retention
`to`	query	O*	ISO-8601 UTC upper bound	≥ `from`; not in future+skew
`limit`	query	O	Items per page (default 100)	1–1000
`order`	query	O	`desc` (default) or `asc`	enum
`cursor`	query	O	Opaque base64url `(ts,id,partitionIdx,dir)`	Must match current params
filters…	query	O	Optional allowlisted filters (e.g., `action`, `resource.type`)	Validated server-side

At least one of from or to is required; if only one provided, the other defaults to now (bounded by retention and skew rules).

Output Specifications¶

Field	Type	Description	Notes
`items[]`	array	Results in requested order	Seek-paginated
`nextCursor`	string?	Encodes next anchor + partition index	Omitted if no more
`count`	int	Items in this page	≤ `limit`

Response Headers

X-Watermark: tenant projection watermark (ISO-8601 UTC)
X-Lag: seconds behind now (now - watermark)
X-Partitions-Scanned: integer count of partitions touched
Cache-Control: typically no-store (or short TTL where safe)

Example Request¶

GET /audit/v1/events/range?from=2025-10-20T00:00:00Z&to=2025-10-22T23:59:59Z&limit=200&order=desc HTTP/1.1
Host: api.atp.example
Authorization: Bearer eyJhbGciOi...
x-tenant-id: acme
traceparent: 00-3e1f2d0c9b8a7f6e5d4c3b2a19081716-7f6e5d4c3b2a1908-01

200 OK

X-Watermark: 2025-10-22T12:10:05.412Z
X-Lag: 5.6
X-Partitions-Scanned: 3

{
  "items": [
    {
      "auditRecordId": "01JECZ6Y8K1V...",
      "createdAt": "2025-10-22T12:02:59.812Z",
      "action": "user.create",
      "resource": { "type": "Iam.User", "id": "U-1001" },
      "actor": { "id": "svc_ingress", "type": "Service" }
    }
  ],
  "nextCursor": "eyJ0cyI6IjIwMjUtMTAtMjJUMTE6NTU6MDAuMDAwWiIsImlkIjoiMDFK...IiwicGFydGl0aW9uSW5kZXgiOjEsImRpciI6ImRlc2MifQ",
  "count": 1
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Malformed `from`/`to`; `from>to`; window exceeds max span; `limit` out of bounds	Fix params; reduce window	No retry until corrected
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Token lacks `audit:read:timeline` for tenant	Request proper scope	No retry
404	Tenant/route not found; tenant disabled; partitions not present (fully aged out)	Verify tenant/window	—
409	`cursor` does not match `from`/`to`/`order`; stale cursor after compaction	Drop/refresh cursor, re-issue	Retry after fix
429	Rate limit/backpressure	Honor `Retry-After`	Exponential backoff + jitter
503	Read Store/Catalog unavailable	Wait for recovery	Retry with same params

Failure Modes¶

Stale cursor after partition compaction/rotation → 409 with type: .../cursor.stale and resyncFrom hint.
Excessive partitions for wide windows → 422 range.too_wide with suggested subranges.
Clock skew: future to beyond now+skew → clamp or 400 with pointer to to.

Recovery Procedures¶

For 409 cursor.stale, restart without cursor or with from=lastSeen.createdAt.
For 422 range.too_wide, split the request by suggested daily/monthly windows.
Monitor X-Partitions-Scanned; if high, narrow the time window.

Performance Characteristics¶

Latency Expectations¶

P50 ≤ 70 ms, P95 ≤ 160 ms, P99 ≤ 320 ms when ≤14 partitions scanned.

Throughput Limits¶

Per tenant: 150 rps sustained, burst 600 rps (configurable).
Global: scales with number of read replicas and partition cache hit rate.

Resource Requirements¶

Partition catalog lookup in-memory or fast key-value store; read DB requires covering indexes per partition.

Scaling Considerations¶

Pruning first: always resolve partitions before issuing any scans.
Adaptive limits: cap limit when many partitions are touched; prefer more pages over wide scans.
Parallel partition reads (optional): small fan-out with strict per-tenant concurrency to preserve order semantics when stitching results.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; propagate traceparent; optional mTLS Gateway↔Query.

Authorization¶

Enforce audit:read:timeline; verify tenant claims; RLS must filter by tenantId.

Data Protection¶

Only return fields allowed by baseline read model; masking/redaction applied in dedicated filtered flow if required.

Compliance¶

Log access with tenant, from, to, limit, partitionsScanned, watermark, and lag.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`range_query_latency_ms`	histogram	End-to-end latency	p95 > 160 ms
`partitions_scanned`	histogram	Partitions per request	> 16 median
`cursor_stale_total`	counter	409 due to stale cursor	Spike indicates compaction
`range_too_wide_total`	counter	422 due to excessive span	Trend watch
`watermark_lag_seconds`	gauge	`now - watermark`	> target (e.g., >10 s)

Logging Requirements¶

Structured logs: tenant, traceId, from, to, order, limit, cursorHash, partitionsScanned, resultCount, watermark, lagSec. Do not log raw cursor.

Distributed Tracing¶

Spans: catalog.resolvePartitions, db.scan.partition, stitch.page, ckpt.read.
Attributes: partitionCount, limit, dir, hasCursor.

Health Checks¶

Readiness: catalog reachable; partitions for today resolvable; indexes present.
Liveness: DB/connection pools healthy; threadpool not saturated.

Operational Procedures¶

Deployment¶

Enable /audit/v1/events/range route; confirm RLS and partition catalog.
Smoke-test with a 24h window and verify X-Partitions-Scanned.
Validate cursor stability across partition boundaries.

Configuration¶

Env: RANGE_MAX_SPAN_DAYS, QUERY_MAX_LIMIT, DEFAULT_LIMIT, PARTITION_LOOKUP_TTL.
Pruning: enable negative caching for empty/aged-out partitions.

Maintenance¶

Keep partition catalog in sync with DDL/rotation jobs; prune aged partitions per retention.
Rebuild indexes offline before alias/cutover when rotating partitions.

Troubleshooting¶

High partitions_scanned → check catalog gaps or miscomputed from/to.
Frequent 409 cursor conflicts → ensure clients don’t change window/order between pages.
Elevated watermark_lag_seconds → scale projectors or indexers.

Testing Scenarios¶

Happy Path Tests¶

Query 48h window returns ordered results with X-Partitions-Scanned ≤ 3.
Pagination crosses a partition boundary without duplicates or gaps.

Error Path Tests¶

400 on malformed/invalid time bounds or from>to.
404 when tenant/route disabled or fully aged-out window.
409 when cursor does not match current from/to/order.
429/503 cause client backoff and retry with same params.

Performance Tests¶

p95 ≤ 160 ms for limit=200, ≤14 partitions.
Partition discovery p95 ≤ 10 ms under load.

Security Tests¶

RLS prevents cross-tenant access.
JWT scope audit:read:timeline enforced.

Internal References¶

Components → Query Service, Read Store
Data Model — Tenancy Keys & Partitioning
Data Model — Read Models & Projections

Tenant-Scoped Query Flow
Filtered Query Flow
Audit Record Projection Update Flow

External References¶

RFC 3339 / ISO-8601 for timestamps
W3C Trace Context (traceparent)

Appendices¶

A. Cursor schema (concept)¶

{
  "ts": "2025-10-22T11:55:00.000Z",
  "id": "01JECZ6Y8K1V...",
  "partitionIdx": 1,
  "dir": "desc"
}

B. Example partition policy¶

Key: (tenantId, yyyymm) monthly partitions; for high-volume tenants use daily (tenantId, yyyymmdd).
Pruning: select partitions where [from,to] intersects partition time bounds; query newest-first for desc.

Standard Export Flow¶

On-demand eDiscovery export that builds a consistent snapshot of tenant-scoped audit data, runs a scoped query over the read models, streams results in chunked parts (JSONL or Parquet, optionally gzipped), produces a signed ExportManifest (with integrity proofs), delivers via presigned URLs and/or webhook callback, and emits Export.Completed.

Overview¶

Purpose: Enable compliance officers to export audit data for a given tenant/time window with integrity evidence and policy safeguards.
Scope: Job creation, query scoping, chunked packaging, integrity/manifest generation, delivery (URLs/webhook), completion events, and resume/cancel. Excludes continuous/scheduled exports (see Bulk Export Flow).
Context: Runs against the projection/read models (e.g., AuditEvents) and consults Integrity Service for proofs, Policy/Retention/LegalHold for eligibility, and Storage for canonical IDs.
Key Participants:

Compliance Officer / Client
API Gateway
Export Service (job orchestration, packaging)
Query Service / Read Store (scoped read with seek pagination)
Integrity Service (Merkle roots / signatures)
Delivery Backend (object storage for parts, presigned URLs)
Webhook Receiver (optional callback on completion)

Prerequisites¶

System Requirements¶

API Gateway with TLS and JWT validation
Export Service deployed with access to Read Store, Integrity Service, Delivery Backend
Read Store enforces RLS by tenantId; indexes support range scans
Webhook signing keys configured (if callbacks used)

Business Requirements¶

Tenant active; retention and residency policies provisioned
Legal holds registered; export must honor holds and exclusions
Officer has audit:export permission; purpose-of-use recorded

Performance Requirements¶

Target p95 job time-to-first-part ≤ 30 s for typical scopes
Per-part target size (e.g., 128–512 MiB) to optimize download throughput
Concurrency caps per tenant to protect read replicas

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Officer as Compliance Officer
    participant GW as API Gateway
    participant EXP as Export Service
    participant Q as Query Service / Read Store
    participant INT as Integrity Service
    participant OBJ as Delivery Backend (Object Storage)
    participant WH as Webhook Receiver (optional)

    Officer->>GW: POST /export/v1/jobs {tenant, range, filters, format, partSize, webhook?}
    GW->>EXP: Forward request (authN/Z, x-tenant-id, traceparent)
    EXP->>Q: Open scoped cursor (tenant, from/to, filters)
    loop Chunk until exhausted
        Q-->>EXP: Page of rows + next cursor
        EXP->>EXP: Serialize to JSONL/Parquet, gzip if requested
        EXP->>INT: Append leaf hashes, update segment/merkle state
        EXP->>OBJ: PUT part (presigned upload or service credentials)
        OBJ-->>EXP: URL + ETag
        EXP->>EXP: Record part metadata, update resumeToken
    end
    EXP->>INT: Seal block → MerkleRoot + signature
    EXP->>EXP: Build ExportManifest {parts, counts, bytes, root, signature, resumeToken}
    EXP-->>Officer: 202 Accepted {jobId, status:"Running"} (+ presigned GETs if requested)
    EXP-->>Officer: 200 GET /export/v1/jobs/{jobId}/manifest (signed manifest)
    alt webhook configured
        EXP->>WH: POST /webhook/export {jobId,status:"Completed",manifestUrl,signature}
    end

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Presigned download: Service writes parts to bucket and returns read-only presigned URLs.
Direct upload: Client provides presigned PUT URLs per part (client-managed storage).
Parquet + schema: Columnar output with embedded schema for analytics workloads.
Resume: Client POST /export/v1/jobs/{jobId}:resume with server-provided resumeToken.

Error Paths¶

sequenceDiagram
    actor Officer
    participant GW as API Gateway
    participant EXP as Export Service

    Officer->>GW: POST /export/v1/jobs {invalid filters/format}
    alt Invalid request
        GW-->>Officer: 400 Bad Request (Problem+JSON)
    else Tenant not found/feature disabled
        GW-->>Officer: 404 Not Found (Problem+JSON)
    else Job state conflict (e.g., resume running job)
        GW-->>Officer: 409 Conflict (Problem+JSON)
    else Rate limited / dependencies down
        GW-->>Officer: 429/503 (Retry-After/Problem+JSON)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`POST /export/v1/jobs`	Y	Create export job	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y*	Tenant scope	Must match body.tenant
`traceparent`	header	O	W3C trace context	55-char
`tenant`	string	Y	Target tenant	`^[A-Za-z0-9._-]{1,128}$`
`range`	object	O	`{from?, to?}` ISO-8601 UTC	`from ≤ to`, within retention
`filters`	object	O	Allowlisted filters (action/resource/actor/decision)	Server validated
`format`	enum	O	`jsonl` (default), `parquet`	allowlist
`compression`	enum	O	`none` (default), `gzip`	allowlist
`partSizeMiB`	int	O	Target part size	16–1024, default 256
`fields`	array	O	Projection/columns	Valid subset of schema
`webhook.url`	url	O	Completion callback	HTTPS + signature method
`webhook.secretId`	string	O	Key id for HMAC	Must exist in KMS
`delivery.mode`	enum	O	`presigned-get`	`client-presigned-put`

Header required unless using path variant /tenants/{tenantId}/export/jobs.

Output Specifications¶

Create Job — 202 Accepted

Field	Type	Description
`jobId`	string	Server-assigned id (ULID/GUID)
`status`	enum	`Running`
`estimation`	object	`{partsApprox, bytesApprox?}`
`pollUrl`	url	`GET /export/v1/jobs/{jobId}`
`manifestUrl`	url	`GET /export/v1/jobs/{jobId}/manifest` (when ready)

Get Job — 200 OK

Field	Type	Description
`jobId`	string	id
`status`	enum	`Queued \| Running \| Completed \| Failed \| Canceled`
`counts`	object	`{records, parts}`
`bytes`	object	`{written}`
`parts[]`	array	`{index,url,etag,bytes,records}` (if presigned-get)
`resumeToken`	string?	For resume/cancel/retry
`startedAt/finishedAt`	timestamp	ISO-8601 UTC
`watermark`	timestamp	Consistency snapshot time

Manifest (JSON)

{
  "jobId": "exp_01JECXYZ...",
  "tenant": "acme",
  "range": {"from":"2025-10-20T00:00:00Z","to":"2025-10-22T23:59:59Z"},
  "format": "jsonl",
  "compression": "gzip",
  "parts": [
    {"index":0,"url":"https://.../p0.gz","bytes":268435456,"records":100000,"etag":"\"abc123\""}
  ],
  "counts":{"records":250000,"parts":3},
  "bytes":{"written":734003200},
  "integrity":{"merkleRoot":"8a4f...","signature":{"alg":"Ed25519","kid":"int-key-2025","sig":"MEQC..."}},
  "createdAt":"2025-10-22T12:30:12Z",
  "resumeToken":"r:01JEC...",
  "watermark":"2025-10-22T12:25:00Z"
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Malformed range/filters; unsupported `format/compression`; invalid `partSizeMiB`; insecure webhook URL	Fix request; use allowlisted values	No retry until corrected
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Caller lacks `audit:export` or tenant mismatch	Request proper role/scope	No retry
404	Tenant/route not found; `GET /jobs/{id}` for unknown id	Verify identifiers/tenant	—
409	Job state conflict (resume/cancel when not applicable); changing scope on resume	Wait for state; create new job	Retry after fix
413	Estimated export exceeds max allowed per job	Narrow scope or switch to Bulk Export	—
429	Per-tenant/global export rate limited	Respect `Retry-After`	Exponential backoff + jitter
503	Read store/integrity/object storage unavailable	Wait for recovery	Retry create/poll

Failure Modes¶

Retention/residency violation: service rejects with 400 type: .../policy.violation.
Legal hold conflict: either enforced inclusion or exclusion per policy; decision id returned via X-Policy-Decision-Id.
Webhook failure: job completes, callback retries with backoff; manifest always retrievable via GET.

Recovery Procedures¶

For 409, poll job until terminal; then retry with new job if needed.
For 503/429, back off using Retry-After; do not alter request to preserve idempotency.
Use resumeToken to continue aborted jobs without duplicating parts.

Performance Characteristics¶

Latency Expectations¶

Time-to-first-part p95 ≤ 30 s for typical 24–48h windows.
Per-part write steady-state throughput aligned with object storage (100–500 MiB/s aggregate across workers).

Throughput Limits¶

Per tenant: ≤ 2 concurrent running jobs (configurable).
Global: bounded by export workers × read replica capacity.

Resource Requirements¶

Read IOPS proportional to projected records; CPU for serialization/compression; memory for part buffers.

Scaling Considerations¶

Horizontal worker pool with fair-share per tenant.
Adaptive partSizeMiB and dynamic concurrency to maintain steady throughput.
Use seek pagination from Query Service to avoid deep offsets.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; mTLS for service-to-service (optional).

Authorization¶

Require audit:export for tenant; enforce RLS in reads; verify x-tenant-id.

Data Protection¶

Parts stored with server-side encryption; presigned URLs time-limited and least-privilege.
Redaction/minimization applied if using Filtered export mode (optional flag).

Compliance¶

Enforce retention/residency and legal holds; include decision metadata in manifest.
Manifest contains integrity proof (Merkle root + signature) for end-to-end verification.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`export_jobs_active`	gauge	Running jobs count	> tenant/global cap
`export_bytes_written_total`	counter	Cumulative bytes	Trend/throughput
`export_parts_total`	counter	Parts produced	—
`export_job_duration_seconds`	histogram	Job runtime	p95 > SLO
`export_failures_total`	counter	Failed jobs	> 0 sustained
`export_webhook_fail_total`	counter	Callback failures	spike alerts

Logging Requirements¶

Structured logs: tenant, jobId, range, filtersHash, format, partIndex, bytes, records, watermark, integrity.merkleRoot, decisionId (if policy applied). No raw PII.

Distributed Tracing¶

Spans: export.create, query.page, serialize.chunk, compress, object.put, integrity.seal, webhook.post.
Attributes: tenant, format, partSizeMiB, parts, bytes, lagSec.

Health Checks¶

Readiness: access to Read Store, Integrity, Object Storage; signing keys loaded.
Liveness: worker queue depth within bounds; no stuck jobs.

Operational Procedures¶

Deployment¶

Provision object storage buckets and KMS keys; configure presign service.
Deploy Export Service and register /export/v1/* routes.
Validate end-to-end export on a test tenant (JSONL + Parquet).

Configuration¶

Env: EXPORT_MAX_CONCURRENCY_PER_TENANT, EXPORT_DEFAULT_PART_MIB, EXPORT_MAX_PART_MIB, WEBHOOK_SIGNING_KID, PRESIGN_TTL_SEC.
SLOs: define job duration targets per size window.

Maintenance¶

Rotate signing keys and presign credentials; prune expired parts/manifests.
Rehearse DR: re-run export from resumeToken after worker failover.

Troubleshooting¶

Slow jobs → check read replica load, part size too small/large, compression CPU bound.
Frequent 409 conflicts → review client workflow (don’t resume running jobs).
Webhook failures → verify DNS/TLS; use manual manifest retrieval.

Testing Scenarios¶

Happy Path Tests¶

Create job with 24h range → parts produced; manifest includes merkle root/signature.
Presigned URLs download successfully; counts/bytes match manifest.

Error Path Tests¶

400 on invalid range/filters/format; 404 on unknown jobId; 409 on resume while running.
429/503 lead to client backoff and eventual success.

Performance Tests¶

Validate time-to-first-part p95 ≤ 30 s under nominal load.
Confirm linear scaling with worker count up to configured cap.

Security Tests¶

audit:export scope enforced; cross-tenant access blocked.
Presigned URLs expire and are scoped to objects; encryption at rest verified.
Manifest signature verifies against Integrity public key.

Internal References¶

Legal Hold Export Flow
eDiscovery Export Flow
Bulk Export Flow
Audit Record Projection Update Flow

External References¶

RFC 4180 (CSV, if supported), JSON Lines spec, Parquet format spec
W3C Trace Context; RFC 7807 (Problem Details)

Appendices¶

A. Example Problem+JSON (retention violation)¶

{
  "type": "urn:connectsoft:errors/export/policy.violation",
  "title": "Retention policy violation",
  "status": 400,
  "detail": "Requested 'from' precedes tenant retention window.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "/range/from", "reason": "before-retention-start"}]
}

B. Webhook Payload (HMAC signed)¶

{
  "event": "Export.Completed",
  "jobId": "exp_01JECXYZ...",
  "tenant": "acme",
  "manifestUrl": "https://api.../export/v1/jobs/exp_01JECXYZ.../manifest",
  "status": "Completed",
  "signature": {"alg":"HMAC-SHA256","kid":"wh-2025","ts":"2025-10-22T12:31:02Z","sig":"b64..."}
}

Legal Hold Export Flow¶

Export of audit data subject to active Legal Holds. The LegalHold Service validates scope and policy, instructs the Export Service to run a hold-compliant export, embeds proof inclusion (integrity root, hold decision metadata, and optional per-part/record proofs) into a signed manifest, delivers via secure presigned URLs and/or webhook, and emits completion events. Holds continue to block purge, and all actions are themselves audited.

Overview¶

Purpose: Produce a defensible, tamper-evident export of all records covered by one or more active Legal Holds for a tenant (or set of scopes).
Scope: Hold resolution & validation, compliance decision capture, hold-aware query scoping, chunked packaging, integrity & proof inclusion policy, secure delivery, resume/cancel, and auditable completion. Excludes non-hold exports (see Standard Export Flow).
Context: Builds on the Export Service and Integrity Service; queries the Read Store (projections) with server-side filters derived from LegalHold definitions and their current Revision.
Key Participants:

Legal Team / Client
API Gateway
LegalHold Service (hold registry, scope/eligibility, decisioning)
Export Service (orchestrator, packaging)
Query Service / Read Store (tenant-scoped reads)
Integrity Service (Merkle roots, signatures)
Delivery Backend (object storage, presigned URLs)
Webhook Receiver (optional callback endpoint)

Prerequisites¶

System Requirements¶

Gateway with TLS + JWT validation
LegalHold Service reachable; hold registry & revisioning enabled
Export Service has access to Read Store, Integrity, Delivery Backend
Webhook signing keys/KMS available if callbacks are used

Business Requirements¶

Target LegalHold exists and is Active (not Released)
Tenant retention/residency policies configured; hold implies purge block
Operator runbook for evidence requests and key rotation

Performance Requirements¶

p95 time-to-first-part ≤ 45 s for typical hold scopes
Concurrency caps per tenant and per hold to avoid read hot spots
Indexes support hold filters (resource/action/time) efficiently

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Legal as Legal Team
    participant GW as API Gateway
    participant LHS as LegalHold Service
    participant EXP as Export Service
    participant Q as Query Service / Read Store
    participant INT as Integrity Service
    participant OBJ as Delivery Backend
    participant WH as Webhook (optional)

    Legal->>GW: POST /legal-hold/v1/exports {holdId, format, partSize, proofMode, webhook?}
    GW->>LHS: Validate authN/Z, fetch hold(holdId) + current Revision
    LHS-->>GW: 200 {holdSnapshot:{id, revision, scope, status:Active}}
    GW->>EXP: Create export job (mode: LEGAL_HOLD, holdSnapshot, proofMode)
    EXP->>Q: Open scoped cursor using holdSnapshot.scope (tenant, filters, time)
    loop Chunk until exhausted
        Q-->>EXP: Page of rows + next cursor
        EXP->>INT: Add leaves to integrity segment (per-part proofs if requested)
        EXP->>OBJ: PUT part (JSONL/Parquet, optional gzip)
        EXP->>EXP: Record part metadata + resumeToken
    end
    EXP->>INT: Seal block → MerkleRoot + signature
    EXP->>EXP: Build signed ExportManifest {parts, counts, bytes, holdSnapshot, proofPolicy, merkleRoot, signature}
    EXP-->>Legal: 202 Accepted {jobId, status:"Running"}
    alt webhook configured
        EXP->>WH: POST Export.Completed {jobId, manifestUrl, holdSnapshot, signature}
    end

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Multiple holds: request {holdIds:[...]}; LHS returns merged scope (union) and aggregated decision id(s).
Incremental export: sinceDecisionId or sinceWatermark to export only new/changed covered records.
Client-provided storage: delivery.mode=client-presigned-put with per-part presigned PUT URLs.

Error Paths¶

sequenceDiagram
    actor Legal
    participant GW as API Gateway
    participant LHS as LegalHold Service

    Legal->>GW: POST /legal-hold/v1/exports {holdId:"unknown"}
    alt Bad request (malformed payload/params)
        GW-->>Legal: 400 Bad Request (Problem+JSON)
    else Hold not found or not Active
        GW->>LHS: GET hold(holdId)
        LHS-->>GW: 404/409 (Released|NotFound)
        GW-->>Legal: 404/409 Problem+JSON
    else Conflict with hold revision (If-Match mismatch)
        GW-->>Legal: 412 Precondition Failed (Problem+JSON)
    else Rate limited / dependency down
        GW-->>Legal: 429/503 (Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`POST /legal-hold/v1/exports`	Y	Create a hold-governed export job	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y	Tenant scope	Must match hold tenant
`traceparent`	header	O	W3C trace context	55-char
`holdId`	string	Y	Target legal hold id	Exists & `status=Active`
`ifMatch`	header	O	Expected `holdRevision` (optimistic)	Matches current revision
`format`	enum	O	`jsonl` (default), `parquet`	allowlist
`compression`	enum	O	`none`, `gzip`	allowlist
`partSizeMiB`	int	O	Target part size	16–1024, default 256
`proofMode`	enum	O	`manifest-only`	`per-part`
`webhook.url`/`webhook.secretId`	string	O	Completion callback + signing	HTTPS + known KMS key
`delivery.mode`	enum	O	`presigned-get`	`client-presigned-put`

Output Specifications¶

Create — 202 Accepted

Field	Type	Description
`jobId`	string	Server-assigned id
`status`	enum	`Queued \| Running`
`holdSnapshot`	object	`{id, revision, scope, decidedAt, decisionId}`
`proofPolicy`	object	`{mode, algorithm, keyId}`
`pollUrl` / `manifestUrl`	url	Where to poll/fetch manifest

Manifest (excerpt)

{
  "jobId": "exp_01JF3…",
  "mode": "LEGAL_HOLD",
  "tenant": "acme",
  "holdSnapshot": {
    "id": "lh_2025_001",
    "revision": 7,
    "scope": {"resourceTypes":["Case.File","Iam.User"],"time":{"from":"2025-09-01T00:00:00Z"}},
    "decidedAt": "2025-10-10T12:01:22Z",
    "decisionId": "lhdec_8a12…"
  },
  "proofPolicy": {"mode":"per-part","algorithm":"Ed25519","keyId":"int-key-2025"},
  "integrity": {"merkleRoot":"8a4f…","signature":{"alg":"Ed25519","kid":"int-key-2025","sig":"MEQC…"}},
  "parts":[{"index":0,"url":"https://…/p0.gz","bytes":268435456,"records":100000,"etag":"\"abc123\""}]
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Malformed body; unsupported `format/proofMode`; invalid `partSizeMiB`	Correct request	No retry until fixed
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Missing `audit:legalhold.export` or tenant mismatch	Request proper role/scope	No retry
404	`holdId` not found (or not in tenant)	Verify hold/tenant	—
409	Hold `status` not `Active` (e.g., Released); job state conflict on resume/cancel	Activate/select correct hold; create new job	Retry after fix
412	`If-Match` revision mismatch (hold updated mid-flight)	Re-fetch hold; restart with new revision	Retry with new precondition
429	Per-tenant/global rate limit	Respect `Retry-After`	Backoff + jitter
503	Read store/Integrity/Delivery unavailable	Wait for recovery	Retry idempotently

Failure Modes¶

Hold mutated during export: precondition fails (412) to ensure defensibility; job halts.
Policy violation (residency/retention): 400 .../policy.violation with decisionId.
Webhook delivery failure: job completes; callback retried with backoff; manifest always retrievable.

Recovery Procedures¶

On 412, fetch latest holdSnapshot and recreate the job.
On 503/429, back off; use the server-provided resumeToken to continue.
If policy violation, adjust scope with Legal team; re-request.

Performance Characteristics¶

Latency Expectations¶

Time-to-first-part p95 ≤ 45 s for typical holds.
Steady-state throughput bounded by read replicas and object storage.

Throughput Limits¶

Per hold: 1–2 concurrent jobs (configurable).
Per tenant: combined cap across holds/exports to preserve SLOs.

Resource Requirements¶

CPU for serialization/compression; memory for part buffers; IOPS for scans.

Scaling Considerations¶

Shard by tenant; sequence chunks with seek pagination.
Prefer per-part proofs for balance of size vs. verifiability; per-record for high-assurance cases only.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; optional mTLS between services.

Authorization¶

Require audit:legalhold.read to resolve holds and audit:legalhold.export to create jobs.
Enforce RLS on reads; verify x-tenant-id vs hold tenant.

Data Protection¶

Parts encrypted at rest; presigned URLs are short-lived, least-privilege; webhook payloads HMAC-signed.
Redaction/minimization may still apply if configured for hold exports (jurisdictional constraint).

Compliance¶

Holds block purge throughout job lifetime; export does not weaken hold.
Manifest includes holdSnapshot (id, revision, decisionId) and integrity proofs per proofPolicy.
All requests emit audit entries (who, when, purpose, hold ids, decision ids).

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`lh_export_jobs_active`	gauge	Running hold exports	> cap per tenant/hold
`lh_export_job_duration_seconds`	histogram	Runtime per job	p95 > SLO
`lh_hold_revision_conflicts_total`	counter	412 preconditions hit	Spike indicates frequent edits
`lh_export_bytes_written_total`	counter	Bytes exported under holds	Trend/forecast
`lh_export_failures_total`	counter	Failed jobs	> 0 sustained

Logging Requirements¶

Structured logs: tenant, holdId, holdRevision, jobId, decisionId, proofMode, partIndex, bytes, records, watermark. No raw PII.

Distributed Tracing¶

Spans: legalhold.resolve, export.create, query.page, integrity.seal, object.put, webhook.post.
Attributes: holdId, revision, proofMode, parts, bytes.

Health Checks¶

Readiness: LHS/Read Store/Integrity/Delivery reachable; signing keys loaded.
Liveness: worker queues healthy; no stuck jobs; purge-block signal latched for hold.

Operational Procedures¶

Deployment¶

Deploy LegalHold Service & /legal-hold/v1/exports route behind Gateway.
Configure KMS keys for manifest/proof signing and webhook HMAC.
Validate end-to-end on a test hold (Active → export → Completed).

Configuration¶

Env: LH_EXPORT_MAX_CONCURRENCY, EXPORT_DEFAULT_PART_MIB, PROOF_DEFAULT_MODE, PRESIGN_TTL_SEC, WEBHOOK_SIGNING_KID.
Policy: toggle allowPerRecordProofs by edition/regulatory need.

Maintenance¶

Rotate signing keys; prune expired presigned URLs and old manifests per policy.
Periodically reconcile hold purge-block flags across stores.

Troubleshooting¶

412 spikes → educate counsel/operators to avoid modifying holds during exports; rely on ifMatch.
Slow jobs → check read replica load, part size, compression CPU.
Webhook failures → review TLS/HMAC configuration; fall back to polling manifestUrl.

Testing Scenarios¶

Happy Path Tests¶

Active holdId export produces parts and manifest with holdSnapshot, merkleRoot, signature.
Proof policy per-part includes per-part proofs; manifest-only includes only root/signature.

Error Path Tests¶

400 on unsupported proofMode/invalid partSizeMiB.
404 on unknown holdId.
409 when hold status is Released.
412 when ifMatch revision mismatches.
429/503 cause compliant backoff and resume.

Performance Tests¶

Time-to-first-part p95 ≤ 45 s under nominal load.
Linear scaling with additional workers up to cap.

Security Tests¶

RBAC scopes enforced; cross-tenant blocked.
Presigned URLs expire; webhook HMAC validates.
Manifest signature verifies with Integrity public key.

Internal References¶

Standard Export Flow
Legal Hold Processing Flow
Compliance Audit Flow

External References¶

RFC 7807 (Problem Details)
W3C Trace Context
Organization Legal Hold & Evidence Handling Policy

Appendices¶

A. Example Problem+JSON (hold released)¶

{
  "type": "urn:connectsoft:errors/legalhold/status.invalid",
  "title": "Hold is not active",
  "status": 409,
  "detail": "Legal hold 'lh_2025_001' is Released (rev=7).",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "/holdId", "reason": "released"}]
}

B. Proof Inclusion Policy Options¶

manifest-only: single MerkleRoot + signature in manifest.
per-part: each part contains a subtree root; manifest maps parts→proofs.
per-record (high assurance): each line embeds leaf hash or side proof; larger output, strongest verification.

eDiscovery Export Flow¶

Generates a forensically defensible export tailored for eDiscovery: runs a scoped export, computes a signed ExportManifest, invokes KMS/HSM to produce a detached signature over the manifest and Merkle root, and assembles an Integrity Bundle (manifest + proofs + public key material) for delivery.

Overview¶

Purpose: Provide legal/forensic teams with a complete, tamper-evident export that includes a signed manifest and Merkle proofs suitable for independent verification.
Scope: Job creation, scoped read, manifest construction, Merkle tree computation, KMS signing, bundle packaging (ZIP/TAR.GZ), delivery via presigned URLs or webhook, and completion event. Excludes hold-governed constraints (see Legal Hold Export Flow) and generic on-demand exports (see Standard Export Flow).
Context: Builds on Export Service and Integrity Service with KMS/HSM for signing. Reads from Read Store via Query Service.
Key Participants:

eDiscovery Client (case management/tooling)
API Gateway
Export Service (orchestrator, packaging)
Query Service / Read Store (scoped reads)
Integrity Service (Merkle computation)
KMS/HSM (key management, signing)
Delivery Backend (object storage, presigned URLs)
Webhook Receiver (optional)

Prerequisites¶

System Requirements¶

Gateway with TLS + JWT validation
Export & Integrity Services deployed; Integration with KMS/HSM configured (key IDs, policies)
Read Store accessible with RLS by tenantId
Object storage bucket for parts, manifest, and bundle

Business Requirements¶

Tenant’s retention/residency policies defined and enforced
eDiscovery caseId lifecycle managed (optional, but recommended)
Operator runbook for key rotation & signature verification

Performance Requirements¶

p95 time-to-manifest ≤ 30 s for typical 24–48h scopes
Bundle assembly completes ≤ 60 s after final part upload
Per-tenant export concurrency capped to protect read replicas

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor EDC as eDiscovery Client
    participant GW as API Gateway
    participant EXP as Export Service
    participant Q as Query Service / Read Store
    participant INT as Integrity Service
    participant KMS as KMS/HSM (Signer)
    participant OBJ as Delivery Backend
    participant WH as Webhook (optional)

    EDC->>GW: POST /ediscovery/v1/exports {tenant, caseId, range, filters, format, proofMode, bundle:{type}}
    GW->>EXP: Create export job (mode: EDISCOVERY) + params
    EXP->>Q: Open scoped cursor (tenant/from-to/filters)
    loop Stream pages → parts
        Q-->>EXP: Page of rows + next cursor
        EXP->>INT: Update Merkle segment with leaf hashes
        EXP->>OBJ: PUT part (JSONL/Parquet, optional gzip)
        EXP->>EXP: Track part metadata (index, bytes, records, ETag)
    end
    EXP->>INT: Seal block → {merkleRoot}
    EXP->>EXP: Build ExportManifest {parts, counts, bytes, watermarks, merkleRoot}
    EXP->>KMS: Sign canonicalized(manifest) + merkleRoot → {signature, kid, alg}
    EXP->>OBJ: PUT manifest.json and manifest.sig
    EXP->>EXP: Assemble Integrity Bundle (manifest, signature, publicKey/chain, optional proofs)
    EXP->>OBJ: PUT bundle (bundle.zip/.tar.gz) → bundleUrl
    EXP-->>EDC: 202 Accepted {jobId, status:"Running"}
    alt webhook configured
        EXP->>WH: POST Export.Completed {jobId, bundleUrl, manifestUrl, signature}
    end

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Proof modes: manifest-only (root+sig), per-part (subtree proofs), per-record (leaf proofs; larger bundle).
Client-provided storage: delivery.mode=client-presigned-put for manifest/parts/bundle.
Re-sign: POST /ediscovery/v1/exports/{jobId}:resign {kid} to reissue signature with a rotated key (no data rewrite).

Error Paths¶

sequenceDiagram
    actor EDC
    participant GW as API Gateway
    participant EXP as Export Service

    EDC->>GW: POST /ediscovery/v1/exports {invalid params}
    alt Malformed request / unsupported proofMode/format
        GW-->>EDC: 400 Bad Request (Problem+JSON)
    else Unknown tenant / route
        GW-->>EDC: 404 Not Found (Problem+JSON)
    else Conflict (resign while job running, or bundle requested before complete)
        GW-->>EDC: 409 Conflict (Problem+JSON)
    else Unauthorized / Forbidden
        GW-->>EDC: 401/403 (Problem+JSON)
    else Backpressure / dependency down
        GW-->>EDC: 429/503 (Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`POST /ediscovery/v1/exports`	Y	Create eDiscovery export	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y*	Tenant scope	Must match `tenant`
`traceparent`	header	O	W3C trace context	55-char
`tenant`	string	Y	Target tenant	`^[A-Za-z0-9._-]{1,128}$`
`caseId`	string	O	eDiscovery case identifier	≤ 128 chars
`range`	object	O	`{from?, to?}` ISO-8601 UTC	`from ≤ to`, retention bounds
`filters`	object	O	Allowlisted predicates	Server validated
`format`	enum	O	`jsonl` (default)	`parquet`
`compression`	enum	O	`none`	`gzip` (default)
`proofMode`	enum	O	`manifest-only`	`per-part`
`bundle.type`	enum	O	`zip` (default)	`tar.gz`
`kms.kid`	string	O	Key id for signing	must exist in KMS
`delivery.mode`	enum	O	`presigned-get`	`client-presigned-put`
`webhook.url`/`webhook.secretId`	string	O	Completion callback + HMAC key	HTTPS + known key

Header may be omitted if using path variant /tenants/{tenantId}/ediscovery/exports.

Output Specifications¶

Create — 202 Accepted

Field	Type	Description
`jobId`	string	Server-assigned id (ULID/GUID)
`status`	enum	`Queued \| Running`
`pollUrl`	url	`GET /ediscovery/v1/exports/{jobId}`
`manifestUrl`	url?	Available once ready
`bundleUrl`	url?	Available once ready

Get — 200 OK

Field	Type	Description
`jobId`	string	Identifier
`status`	enum	`Queued \| Running \| Sealing \| Signing \| Bundling \| Completed \| Failed \| Canceled`
`counts`	object	`{records, parts}`
`bytes`	object	`{written}`
`merkleRoot`	string	Hex/base64url root
`signature`	object?	`{alg,kid,sig}` once signed
`manifestUrl` / `bundleUrl`	url?	Delivery endpoints
`resumeToken`	string?	For resume/retry
`startedAt/finishedAt`	timestamp	ISO-8601 UTC

Integrity Bundle Contents (concept)

bundle/
  manifest.json
  manifest.sig            # COSE_Sign1 or JWS (detached)
  integrity/
    root.json             # { merkleRoot, algorithm, createdAt }
    proofs/               # per-part or per-record .proof files (optional)
  keys/
    publicKey.pem         # PEM or JWK
    key-metadata.json     # { kid, alg, issuer, notBefore, notAfter }
  README.txt              # verification instructions

Manifest (excerpt)

{
  "jobId": "exp_01JFG2...",
  "mode": "EDISCOVERY",
  "tenant": "acme",
  "caseId": "CASE-2025-0421",
  "range": {"from":"2025-10-01T00:00:00Z","to":"2025-10-22T23:59:59Z"},
  "format": "jsonl",
  "compression": "gzip",
  "parts": [
    {"index":0,"url":"https://.../p0.gz","bytes":268435456,"records":100000,"etag":"\"abc123\""}
  ],
  "counts":{"records":250000,"parts":3},
  "bytes":{"written":734003200},
  "integrity":{"merkleRoot":"8a4f...","algorithm":"sha256","createdAt":"2025-10-22T12:30:12Z"},
  "createdAt":"2025-10-22T12:30:12Z",
  "watermark":"2025-10-22T12:25:00Z"
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Malformed body; invalid `range/filters`; unsupported `proofMode/format/bundle.type`; unknown `kms.kid`	Correct request/params	No retry until fixed
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Missing `audit:ediscovery.export` or tenant mismatch	Request proper scope/role	No retry
404	Tenant/route not found; `jobId` unknown; manifest/bundle not available	Verify tenant/IDs; wait for completion	—
409	Bundle requested before job complete; resign while signing; resume on terminal job	Poll until terminal; create new job	Retry after fix
412	`If-Match` on manifest version failed (re-signed)	Fetch latest manifest; retry	Retry with new ETag
429	Per-tenant/global export rate limited	Respect `Retry-After`	Exponential backoff + jitter
503	Read store/Integrity/KMS/Object storage unavailable	Wait for recovery	Retry idempotently

Failure Modes¶

KMS key disabled/rotated: signing fails → 503; operator selects new kid or uses :resign.
Proof blowup with per-record on huge jobs → 413/422 with guidance to switch to per-part.
Residency/retention policy violation → 400 .../policy.violation (decision id included).

Recovery Procedures¶

On 409, poll job status until Completed then fetch manifestUrl/bundleUrl.
On 503/429, back off and use resumeToken to continue without duplicating parts.
On signature/key issues, re-run :resign with a valid kms.kid.

Performance Characteristics¶

Latency Expectations¶

Time-to-manifest p95 ≤ 30 s for typical scopes; bundling overhead ≤ 60 s.

Throughput Limits¶

Per tenant: ≤ 2 concurrent eDiscovery jobs (configurable).
Global: limited by export workers, KMS QPS, and object storage throughput.

Resource Requirements¶

CPU for serialization/compression; memory for part buffers and proof generation; KMS signing latency budget (p95 ≤ 100 ms).

Scaling Considerations¶

Horizontal export workers; bound KMS concurrency; stream proof files to avoid large in-memory structures.
Prefer per-part proofs for balance of size vs. verifiability.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; optional mTLS for service-to-service calls.

Authorization¶

Require audit:ediscovery.export; enforce RLS on reads; verify x-tenant-id.

Data Protection¶

Object storage encryption at rest; time-limited presigned URLs; webhook payloads HMAC-signed.
No raw secret material in logs; public keys shipped as JWK/PEM inside bundle only.

Compliance¶

Manifest + signature + proofs enable independent verification.
Include watermark (projection snapshot time) and caseId in manifest for chain-of-custody.
Emit audit entries for create/resume/resign/bundle fetch actions.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`ediscovery_jobs_active`	gauge	Running jobs	> tenant/global cap
`manifest_build_duration_ms`	histogram	Build + sign time	p95 > 30s
`kms_sign_latency_ms`	histogram	KMS sign call	p95 > 100 ms
`bundle_bytes_total`	counter	Size of bundles	Trend/forecast
`ediscovery_failures_total`	counter	Failed jobs	> 0 sustained

Logging Requirements¶

Structured logs: tenant, caseId, jobId, merkleRoot, kid, proofMode, parts, bytes, watermark.
Do not log raw proofs or presigned URLs.

Distributed Tracing¶

Spans: export.create, query.page, integrity.seal, kms.sign, bundle.pack, object.put, webhook.post.
Attributes: kid, proofMode, bundleType, parts, bytes.

Health Checks¶

Readiness: KMS key available, Integrity & Object storage reachable.
Liveness: worker queues healthy; no stuck Signing/Bundling states.

Operational Procedures¶

Deployment¶

Configure KMS key(s) and kid mapping; verify sign/verify path in staging.
Deploy /ediscovery/v1/exports route; ensure buckets and presign service are ready.
Validate end-to-end: create job → manifest signed → bundle downloadable and verifiable.

Configuration¶

Env: EXPORT_MAX_CONCURRENCY_PER_TENANT, EXPORT_DEFAULT_PART_MIB, PROOF_DEFAULT_MODE, KMS_DEFAULT_KID, PRESIGN_TTL_SEC.
Policies: enforce retention/residency on the export scope.

Maintenance¶

Rotate KMS keys; support :resign to reissue signatures.
Prune expired presigned URLs and old bundles per policy.

Troubleshooting¶

High kms_sign_latency_ms → check KMS limits/region; enable key caching.
Large bundles/timeouts → switch to per-part proofs; increase part size.
409 conflicts → ensure clients poll status before requesting bundle/resign.

Testing Scenarios¶

Happy Path Tests¶

Create eDiscovery export with proofMode=per-part → manifest + signature + bundle available; verification succeeds.
resign with new kid produces new manifest.sig without rewriting parts.

Error Path Tests¶

400 on invalid proofMode/format/bundle.type or bad range.
404 on unknown jobId or bundle before creation.
409 when requesting bundle before completion or resign during signing.
429/503 trigger compliant backoff and resume.

Performance Tests¶

Time-to-manifest p95 ≤ 30 s; bundling overhead ≤ 60 s under nominal load.
KMS signing p95 ≤ 100 ms for 95% of signatures.

Security Tests¶

RBAC scope audit:ediscovery.export enforced; cross-tenant blocked.
Manifest signature verifies with exported public key (JWK/PEM).
Presigned URLs expire and are least-privilege.

Internal References¶

External References¶

COSE (RFC 8152) / JWS (RFC 7515) for signatures
W3C Trace Context; RFC 7807 (Problem Details)

Appendices¶

A. Example `manifest.sig` (JWS detached)¶

{
  "protected": "eyJhbGciOiJFZDI1NTE5Iiwia2lkIjoiaW50LWtleS0yMDI1In0",
  "signature": "L5Jq...cQ"
}

B. Verification Outline¶

Download manifest.json, manifest.sig, and keys/publicKey.pem.
Verify signature over canonicalized manifest (UTF-8, no BOM).
Recompute Merkle root from all part proofs (if provided) and compare to manifest.integrity.merkleRoot.
Spot-verify a subset of parts/records using proofs/*.proof.

Bulk Export Flow¶

Scheduled or ad-hoc large-scale exports that split a wide scope into time/key slices, run them in parallel across a controlled worker pool, write results as multiple packages (parts/bundles), and support resume tokens for fault-tolerant continuation. Exposes explicit SLA/throughput metrics and enforces per-tenant/global concurrency limits.

Overview¶

Purpose: Efficiently export very large datasets (days/months of audit events) on a schedule or on demand, with parallelization, resumability, and integrity/manifest generation.
Scope: Scheduler, job creation, slicing strategy (time/partition), parallel workers, packaging (JSONL/Parquet, gzip), resume/cancel, integrity sealing, delivery via presigned URLs/webhook, and metrics. Excludes hold-specific rules (see Legal Hold Export Flow) and eDiscovery signing options (see eDiscovery Export Flow).
Context: Orchestrated by Export Service with a Scheduler; reads from Read Store via Query Service; uses Integrity Service for Merkle roots/signatures and Object Storage for parts/bundles.
Key Participants:

Scheduler (cron/rrule, “run now”)
API Gateway
Export Service (orchestrator, slicer, worker pool)
Query Service / Read Store (tenant-scoped scans)
Integrity Service (hash/merkle/seal)
Object Storage (parts, manifests, bundles)
Webhook Receiver (optional callbacks)
Metrics/Tracing Backend

Prerequisites¶

System Requirements¶

Gateway with TLS + JWT; Export Service reachable by Scheduler
Read Store with RLS by tenantId; seek pagination available
Integrity Service & Object Storage configured (KMS keys, buckets)
Clock skew controls; partition catalog available for slicing

Business Requirements¶

Tenant retention/residency policies configured and enforced
Export feature/edition enabled; per-tenant concurrency limits defined
Optional webhook signing keys provisioned

Performance Requirements¶

Target throughput per worker (e.g., 50–150 MB/s effective)
Time-to-first-part p95 ≤ 60 s for bulk slice runs
Slice width chosen to keep slice p95 ≤ 10–20 min under load

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Sched as Scheduler
    participant GW as API Gateway
    participant EXP as Export Service (Orchestrator)
    participant SL as Slicer / Planner
    participant WP as Worker Pool
    participant Q as Query Service / Read Store
    participant INT as Integrity Service
    participant OBJ as Object Storage
    participant WH as Webhook (optional)

    Sched->>GW: POST /export/v1/bulk-jobs {tenant, schedule, range, sliceWidth, format, partSize}
    GW->>EXP: Create/Upsert BulkJob
    loop On schedule tick or run-now
        EXP->>SL: Plan slices for window (time/partition)
        SL-->>EXP: [Slice#0..Slice#N] + dependencies
        par N parallel slices (bounded by concurrency caps)
            EXP->>WP: Dispatch Slice#i {cursor, sliceWindow, resumeToken?}
            WP->>Q: Stream pages via seek pagination
            Q-->>WP: Rows + next cursor
            WP->>INT: Append leaf hashes, update merkle segment
            WP->>OBJ: PUT part(s) (JSONL/Parquet, gzip?)
            WP->>EXP: Report progress {bytes, records, partMeta, resumeToken}
        and
        end
        EXP->>INT: Seal slice block → MerkleRoot + signature
        EXP->>OBJ: PUT slice manifest, update BulkJob manifest index
        alt webhook configured
            EXP->>WH: POST Export.SliceCompleted {jobId, sliceId, manifestUrl}
        end
    end
    EXP->>OBJ: PUT final Bulk Manifest (index of slice manifests) + signature
    EXP-->>GW: 200/202 {jobId, status:"Completed", manifestUrl, stats}

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Run now: POST /export/v1/bulk-jobs/{id}:run-now triggers immediate cycle outside schedule.
Catch-up mode: planner advances by watermark; only exports new slices since last success.
Client-managed storage: use presigned PUT per slice/part.
Dynamic re-slicing: large slices auto-split if runtime exceeds threshold.

Error Paths¶

sequenceDiagram
    actor Client
    participant GW as API Gateway
    participant EXP as Export Service

    Client->>GW: POST /export/v1/bulk-jobs {invalid config}
    alt Bad request (bad schedule/sliceWidth/partSize)
        GW-->>Client: 400 Problem+JSON
    else Unknown jobId / tenant route not found
        GW-->>Client: 404 Problem+JSON
    else Conflict (modify running job / duplicate schedule window)
        GW-->>Client: 409 Problem+JSON
    else Backpressure or deps down
        GW-->>Client: 429/503 Problem+JSON (+ Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Create/Update	`POST /export/v1/bulk-jobs`	Y	Create/Upsert bulk job	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y*	Tenant scope	Must match body.tenant
`tenant`	string	Y	Target tenant	`^[A-Za-z0-9._-]{1,128}$`
`range`	object	O	`{from?, to?}` for initial catch-up	ISO-8601 UTC
`schedule`	object	O	`{cron:"0 2 * * *"}` or `{rrule:"RRULE:..."}`	validated
`sliceWidth`	string	O	e.g., `24h`, `7d`, `1mo`	max per policy
`format`	enum	O	`jsonl` (default)	`parquet`
`compression`	enum	O	`none`	`gzip` (default)
`partSizeMiB`	int	O	16–1024 (default 256)	bounds checked
`maxParallelSlices`	int	O	Per-tenant concurrency cap	≤ tenant cap
`webhook.url/secretId`	string	O	Completion callbacks	HTTPS + known key
`delivery.mode`	enum	O	`presigned-get`	`client-presigned-put`

*Header may be omitted for /tenants/{tenantId}/export/bulk-jobs.

Control Endpoints

POST /export/v1/bulk-jobs/{id}:run-now
POST /export/v1/bulk-jobs/{id}:pause / :resume / :cancel
GET /export/v1/bulk-jobs/{id} (status, stats, current window, next run)
GET /export/v1/bulk-jobs/{id}/manifest (bulk manifest index)

Output Specifications¶

Field	Type	Description	Notes
`jobId`	string	Bulk job identifier	ULID/GUID
`status`	enum	`Paused \| Scheduled \| Running \| Completed \| Failed \| Canceled`	—
`currentSlice`	object?	`{sliceId, window, status, resumeToken}`	When running
`stats`	object	`{slicesCompleted, bytes, records, parts}`	Cumulative
`manifestUrl`	url?	Bulk manifest index	After completion
`nextRunAt`	timestamp	Next scheduled tick	ISO-8601 UTC

Bulk Manifest Index (concept)

{
  "jobId":"bulk_01JH2…",
  "tenant":"acme",
  "schedule":"0 2 * * *",
  "slices":[
    {"sliceId":"s_2025_10_01","from":"2025-10-01T00:00:00Z","to":"2025-10-02T00:00:00Z","manifestUrl":"https://.../s_2025_10_01.manifest.json","merkleRoot":"8a4f...","signature":{"alg":"Ed25519","kid":"int-key-2025","sig":"MEQC..."}}  
  ],
  "counts":{"records":12003450,"parts":480},
  "bytes":{"written":358721987654},
  "createdAt":"2025-10-22T02:00:00Z",
  "completedAt":"2025-10-22T09:40:00Z"
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid `schedule`/`sliceWidth`/`partSizeMiB`; malformed range	Fix config	No retry until corrected
401	Missing/invalid JWT	Obtain valid token	Retry after renewal
403	Missing `audit:export.bulk` or tenant mismatch	Request proper scope/role	—
404	Unknown `jobId` or tenant route disabled	Verify identifiers/tenant	—
409	Modify/pause/resume conflict; duplicate scheduled window; attempt to run while Running	Wait/resolve state; use `run-now` after idle	Retry after fix
413	Estimated bulk size exceeds job cap	Reduce scope/sliceWidth; increase cap by policy	—
422	`sliceWidth` too large for SLO; `range` outside retention	Choose smaller slices / valid window	—
429	Per-tenant/global concurrency limit hit	Honor `Retry-After`	Backoff + jitter
503	Read store/Integrity/Object storage unavailable	Wait for recovery	Idempotent retry using `resumeToken`

Failure Modes¶

Slice timeout → auto reslice into smaller sub-slices; remaining work re-queued.
Resume after crash → resumeToken resumes at last committed cursor/part.
Storage throttling → Export Service reduces parallelism; returns 429 to clients.

Recovery Procedures¶

Use :resume with server-provided resumeToken to continue failed slices.
On 429/503, back off and let the scheduler retry the tick; do not spawn duplicate runs.
Adjust sliceWidth/maxParallelSlices to match observed throughput.

Performance Characteristics¶

Latency Expectations¶

Time-to-first-part p95 ≤ 60 s per run.
Per-slice runtime p95 within configured SLO (e.g., ≤ 15 min for 24h slice on typical volume).

Throughput Limits¶

Per worker: target sustained 50–150 MB/s effective write.
Per tenant: cap maxParallelSlices (e.g., ≤ 4).
Global: orchestrator enforces cluster-wide max workers.

Resource Requirements¶

CPU for serialization/compression; RAM for part buffers; IOPS for wide scans; network to object storage.

Scaling Considerations¶

Plan then fan-out: precompute slice plan and submit to a bounded queue.
Fair-share: per-tenant token bucket to avoid noisy neighbors.
Adaptive concurrency: scale workers based on export QPS, object storage throttling, and read replica load.
Backpressure: honor Retry-After; dynamically shrink maxParallelSlices.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; optional mTLS for service-to-service calls.

Authorization¶

Require audit:export.bulk; enforce RLS on reads; validate x-tenant-id.

Data Protection¶

Server-side encryption at rest; presigned URLs short-lived and scoped.
Optional on-read masking if bulk job set to filtered mode.

Compliance¶

Respect retention/residency; include watermarks and integrity proofs per slice.
Emit audit events for schedule create/update, run, pause/resume/cancel, and completion.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`bulk_jobs_active`	gauge	Running bulk jobs	> global cap
`bulk_slices_inflight`	gauge	Concurrent slice executions	> per-tenant cap
`bulk_bytes_written_total`	counter	Bytes written across slices	Trend/throughput
`bulk_slice_duration_seconds`	histogram	Runtime per slice	p95 > SLO
`bulk_failures_total`	counter	Failed slices/jobs	> 0 sustained
`resume_events_total`	counter	Resumed slices	Spike indicates instability

Logging Requirements¶

Structured logs: tenant, jobId, sliceId, window, resumeToken, parts, bytes, records, watermark, merkleRoot, status. No raw PII or presigned URLs.

Distributed Tracing¶

Spans: bulk.plan, slice.run, query.page, serialize.part, object.put, integrity.seal, webhook.post.
Attributes: sliceWidth, parallelism, bytes, records, throttleEvents.

Health Checks¶

Readiness: object storage/Integrity/Read Store reachable; scheduler connected.
Liveness: worker queues draining; no stuck slices beyond timeout.

Operational Procedures¶

Deployment¶

Deploy Scheduler and Export Service; register /export/v1/bulk-jobs routes.
Configure tenant/global concurrency caps and default sliceWidth.
Run a dry run on a non-prod tenant to validate planning and sealing.

Configuration¶

Env: BULK_MAX_PARALLEL_SLICES_PER_TENANT, BULK_DEFAULT_SLICE_WIDTH, EXPORT_DEFAULT_PART_MIB, RESUME_TOKEN_TTL, PRESIGN_TTL_SEC, SLA_SLICE_P95_SECONDS.
Planner: enable dynamic reslicing thresholds (time/size).

Maintenance¶

Rotate signing keys; prune expired manifests/parts; archive bulk manifest indices per policy.
Periodically reassess sliceWidth vs. observed volumes.

Troubleshooting¶

Many resume events → check read replica throttling/object storage limits; reduce parallelism.
Frequent 409 on job ops → ensure clients don’t modify running jobs; use pause then update.
Slow slices → inspect filters/indexes and increase part size or reduce masking.

Testing Scenarios¶

Happy Path Tests¶

Create bulk job with cron schedule; verify automatic run creates multiple slices and parts.
Resume a slice after induced worker crash using resumeToken.

Error Path Tests¶

400 on invalid schedule/sliceWidth/partSizeMiB.
404 on unknown jobId.
409 when updating a running job without pause.
429/503 cause backoff and eventual success without duplication.

Performance Tests¶

Achieve target throughput per worker and per tenant; slice p95 ≤ SLO.
Concurrency caps prevent read replica saturation.

Security Tests¶

RBAC audit:export.bulk enforced; cross-tenant isolation verified.
Presigned URLs expire and are least-privilege.
Integrity sealing produces valid Merkle roots/signatures per slice.

Internal References¶

External References¶

RFC 7807 (Problem Details)
W3C Trace Context

Appendices¶

A. Example Create Bulk Job Request¶

{
  "tenant": "acme",
  "schedule": { "cron": "0 2 * * *" },
  "range": { "from": "2025-09-01T00:00:00Z" },
  "sliceWidth": "24h",
  "format": "parquet",
  "compression": "gzip",
  "partSizeMiB": 256,
  "maxParallelSlices": 3,
  "delivery": { "mode": "presigned-get" },
  "webhook": { "url": "https://hooks.example/exports", "secretId": "wh-2025" }
}

B. Resume Token (concept)¶

{
  "sliceId":"s_2025_10_21",
  "cursor":"eyJ0cyI6IjIwMjUtMTAtMjFUMTI6MDA6MDAuMDAwWiIsImlkIjoiMDFK...In0",
  "partIndex": 17,
  "bytesCommitted": 134217728
}

Retention Policy Evaluation Flow¶

Computes and records eligibleAt timestamps for purge based on the active Retention Policy. Evaluations run on schedule and on policy change, marking candidates in the retention index and emitting Retention.EligibleComputed events with decision basis (policy id, rule id, revision, window).

Overview¶

Purpose: Determine when audit records (or partitions) become eligible for purge and persist eligibleAt along with decision metadata for defensible lifecycle operations.
Scope: Policy fetch & revision checks, rules evaluation (scopes/windows/exceptions), candidate marking, event emission, and re-evaluation on policy updates or clock ticks. Excludes purge execution (see Data Lifecycle & States / Purge flow).
Context: The Policy Service is the source of truth for Retention Policies and their forward-only revisions. The Lifecycle Evaluator (part of Policy or Lifecycle service) scans read/canonical stores and updates a Retention Index used by purge workers.
Key Participants:

Scheduler (periodic + on-change trigger)
API Gateway (for admin endpoints)
Policy Service (policies, revisions, decisions)
Lifecycle Evaluator (rules engine, candidate marker)
Metadata/Retention Index (stores eligibleAt, decision basis)
Event Bus (emits Retention.EligibleComputed)

Prerequisites¶

System Requirements¶

Policy Service reachable; policy registry seeded with tenant policy
Lifecycle Evaluator has read access to stores and write access to Retention Index
Event Bus configured for Retention.* topics
Time source synchronized; clock skew guardrails applied

Business Requirements¶

Tenant has an Active retention policy with forward-only Revision
Residency constraints configured (region-aware evaluation if required)
Legal Holds honored (holds block eligibility marking)

Performance Requirements¶

Evaluation p95 per partition ≤ 3 min for typical volumes
Index write throughput supports peak daily windows (e.g., midnight marks)
Backpressure controls on scans and index writes

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant SCH as Scheduler
    participant POL as Policy Service
    participant LCE as Lifecycle Evaluator
    participant IDX as Retention Index
    participant BUS as Event Bus

    SCH->>POL: GET /policy/v1/retention?tenant=acme (If-None-Match: rev)
    POL-->>SCH: 200 {policyId, revision, rules, windows} or 304 if unchanged
    SCH->>LCE: Trigger evaluate {tenant, policyId, revision, windowHint}
    LCE->>LCE: Enumerate candidate sets (by partition/time/resource)
    LCE->>LCE: For each record/partition: compute eligibleAt = createdAt + window(rule)
    LCE->>IDX: Upsert {recordId/partitionKey, eligibleAt, decisionBasis{policyId,ruleId,revision}}
    IDX-->>LCE: Ack (batched)
    LCE->>BUS: Publish Retention.EligibleComputed {tenant, policyId, revision, stats}

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

On-Change Re-eval: Policy.Changed event triggers incremental re-evaluation for affected scopes only.
Partition-Level Evaluation: compute once per partition boundary and apply to contained records (for WORM append stores).
Dry Run: evaluation writes to a shadow index and returns a delta report (no marking).

Error Paths¶

sequenceDiagram
    participant GW as API Gateway
    participant POL as Policy Service
    participant LCE as Lifecycle Evaluator

    GW->>POL: POST /policy/v1/retention:evaluate {tenant, revision:999}
    alt Unknown tenant/policy
        POL-->>GW: 404 Not Found (Problem+JSON)
    else Revision conflict (client expects different rev)
        POL-->>GW: 409 Conflict (Problem+JSON)
    else Bad request (invalid window/spec)
        POL-->>GW: 400 Bad Request (Problem+JSON)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`POST /policy/v1/retention:evaluate`	Y	Manual/adhoc evaluation trigger	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y	Tenant scope	`^[A-Za-z0-9._-]{1,128}$`
`traceparent`	header	O	W3C trace context	55-char
`policyId`	string	O	Explicit policy to apply	Must belong to tenant
`revision`	int	O	Expected policy revision (`If-Match` equivalent)	≥ current? causes 409
`scope`	object	O	Limit evaluation to subset (time/resources)	Server-validated
`mode`	enum	O	`normal` (default)	`dry-run`

Output Specifications¶

202 Accepted

Field	Type	Description
`evaluationId`	string	Operation identifier
`status`	enum	`Queued \| Running`
`policy`	object	`{policyId, revision}`
`scopeApplied`	object	Effective evaluated scope

200 OK (dry-run report)

Field	Type	Description
`estimatedCandidates`	int	Count that would be marked
`sample`	array	Example `{recordId, computedEligibleAt, ruleId}`
`diff`	object	Prior vs. new policy impact

Retention Index (concept row)

{
  "tenantId": "acme",
  "recordId": "01JECZ6Y8K1V...",
  "eligibleAt": "2026-01-21T10:12:00Z",
  "decisionBasis": { "policyId":"ret_001", "ruleId":"r_login_365d", "revision":5 },
  "decidedAt": "2025-10-22T12:00:00Z"
}

Event Retention.EligibleComputed (summary)

{
  "tenant": "acme",
  "policyId": "ret_001",
  "revision": 5,
  "window": {"from":"2025-10-21T00:00:00Z","to":"2025-10-22T00:00:00Z"},
  "stats": {"marked": 124553, "skippedHeld": 112, "errors": 0}
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid policy spec/windows; negative/zero retention; malformed scope	Fix policy/scope	No retry until corrected
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Missing `policy:retention.evaluate` permission	Request proper scope/role	—
404	Tenant/policy not found	Verify tenant/policy id	—
409	Policy revision conflict; evaluation for same scope already running	Re-fetch policy; wait or cancel prior run	Retry after fix
412	`If-Match` (revision) mismatch	Fetch latest policy; retry with current rev	Conditional retry
422	Policy invalid for tenant residency/edition	Adjust policy to constraints	—
429	Evaluator rate limited	Honor `Retry-After`	Backoff + jitter
503	Stores/Index/Event bus unavailable	Wait for recovery	Idempotent retry of evaluation step

Failure Modes¶

Legal Hold present: candidate skipped; index notes skippedHeld and basis includes holdId.
Window change shrinks retention: re-evaluation advances eligibleAt forward only; never moves earlier than prior decision without explicit re-baseline admin action.
Clock skew: eligibleAt never set before now - skew.

Recovery Procedures¶

On 409 or 412, fetch current {policyId, revision} and re-issue with updated precondition.
When 503/429, back off; evaluation jobs are idempotent by (tenant, policyId, revision, scopeKey).
Use dry-run to assess impact before applying a new revision.

Performance Characteristics¶

Latency Expectations¶

Partition-sized evaluation p95 ≤ 3 min; small scope ad-hoc p95 ≤ 30 s.

Throughput Limits¶

Evaluator concurrency limited per tenant to protect read/metadata stores (e.g., ≤ 2 concurrent scopes).

Resource Requirements¶

Bounded memory for rule evaluation batches; write-optimized Retention Index with bulk upserts.

Scaling Considerations¶

Batch by partition and time windows; use checkpointing to resume mid-run.
Prefer set-based updates (partition-level) when rules are uniform (e.g., 365d global).
Emit periodic progress to avoid long silent runs.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; service-to-service credentials for Evaluator ↔ Index.

Authorization¶

Enforce policy:retention.read and policy:retention.evaluate; verify x-tenant-id.

Data Protection¶

Decision basis recorded without copying sensitive payload; only IDs/timestamps stored.

Compliance¶

Forward-only versions: revision monotonically increases; decisions log basis {policyId, ruleId, revision, computedAt} for auditability.
Residency honored by running evaluation in-region and by scoping reads.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`ret_eval_jobs_active`	gauge	Running evaluations	> tenant/global cap
`ret_candidates_marked_total`	counter	Records marked eligible	Trend
`ret_eval_duration_seconds`	histogram	Runtime per evaluation	p95 > SLO
`ret_eval_skipped_held_total`	counter	Skipped due to Legal Hold	Spike watch
`ret_eval_conflicts_total`	counter	409/412 occurrences	Investigate policy churn

Logging Requirements¶

Structured logs: tenant, policyId, revision, ruleId, scopeKey, marked, skippedHeld, durationMs, errors. No PII.

Distributed Tracing¶

Spans: policy.fetch, eval.scan, eval.batch, index.upsert, event.publish.
Attributes: revision, batchSize, marked, skipped.

Health Checks¶

Readiness: Index writable; Policy Service reachable; Event Bus available.
Liveness: job queue drains; checkpoints advance.

Operational Procedures¶

Deployment¶

Deploy Lifecycle Evaluator workers; register /policy/v1/retention:evaluate.
Seed policies; verify revisioning and on-change triggers.
Run a dry-run evaluation in staging; verify index shape and events.

Configuration¶

Env: RET_EVAL_BATCH_SIZE, RET_EVAL_MAX_CONCURRENCY, RET_EVAL_CHECKPOINT_TTL, CLOCK_SKEW_SEC.
Policy: enforce forward-only revisions; require change justification metadata.

Maintenance¶

Compact Retention Index (drop superseded decisions); rotate event topics per retention.
Re-baseline procedures for exceptional policy rollbacks (administrative only).

Troubleshooting¶

High conflicts: educate admins to supply If-Match revision when triggering evaluations.
Slow runs: increase batch size carefully; verify index write IOPS; reduce scan scope.
Skewed results: check time normalization and partition catalog.

Testing Scenarios¶

Happy Path Tests¶

Evaluate 24h scope → candidates marked with correct eligibleAt and decisionBasis.
Policy change (revision++) triggers incremental re-eval for affected scopes only.

Error Path Tests¶

400 for invalid windows/rules; 404 for unknown policy; 409/412 for revision issues.
422 when policy violates residency/edition.
429/503 lead to compliant backoff and eventual success.

Performance Tests¶

Partition evaluation p95 ≤ 3 min; throughput meets index SLOs.
Checkpoint resume after induced worker restart.

Security Tests¶

RBAC scopes enforced; cross-tenant isolation verified.
Logs contain decision basis without payload leakage.

Internal References¶

Legal Hold Processing Flow
Data Lifecycle (Purge Execution) Flow
Policy Change Propagation Flow

External References¶

RFC 7807 (Problem Details)
W3C Trace Context

Appendices¶

A. Example Problem+JSON (revision conflict)¶

{
  "type": "urn:connectsoft:errors/policy/revision.conflict",
  "title": "Policy revision conflict",
  "status": 409,
  "detail": "Requested evaluation with revision 5, current is 6.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "/revision", "reason": "stale"}]
}

B. Decision Basis (concept)¶

{
  "policyId": "ret_001",
  "ruleId": "r_login_365d",
  "revision": 6,
  "formula": "eligibleAt = createdAt + P365D",
  "inputs": {"createdAt":"2025-10-21T11:00:00Z"},
  "output": {"eligibleAt":"2026-10-21T11:00:00Z"}
}

Legal Hold Processing Flow¶

Applies, updates, and releases Legal Holds against tenant data. Resolves scope unambiguously, materializes a holdSnapshot (with forward-only revision), matches target records/partitions, marks them OnHold (purge-block), and emits lifecycle events. Releasing a hold clears blockers and triggers dependent re-evaluations.

Overview¶

Purpose: Provide a defensible mechanism to place and release Legal Holds so that covered records are preserved and exports can reference verifiable hold decisions.
Scope: Create/apply/update/release flows, scope resolution and match indexing, purge-block signaling, event emission, and concurrency controls. Excludes exporting data under hold (see Legal Hold Export Flow).
Context: The LegalHold Service is authoritative for hold definitions and state. It interacts with Read/Projection Stores to match data, the Lifecycle/Purge subsystem to block deletion, and Policy/Retention to re-evaluate eligibility.
Key Participants:

Legal Team / Client
API Gateway
LegalHold Service (registry, matcher, state machine)
Read/Projection Store (query targets by scope)
Hold Index / Purge Guard (flags OnHold)
Event Bus (LegalHold.Applied|Updated|Released)

Prerequisites¶

System Requirements¶

API Gateway with TLS and JWT validation
LegalHold Service deployed with access to Read/Projection Store and Hold Index
Event Bus topics configured (LegalHold.*)
Clock/time normalization to UTC; deterministic scope resolvers

Business Requirements¶

Tenant enabled for Legal Hold; roles and approvals defined
Case management identifiers available (caseId)
Residency constraints and retention policies configured

Performance Requirements¶

p95 apply time for typical scopes ≤ 60 s (to first confirmation)
Hold matching throughput sized to tenant volume (seek pagination)
Low-latency purge-block propagation (seconds, not minutes)

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor Legal as Legal Team
    participant GW as API Gateway
    participant LHS as LegalHold Service
    participant RD as Read/Projection Store
    participant HIX as Hold Index / Purge Guard
    participant BUS as Event Bus

    Legal->>GW: POST /legal-hold/v1/holds {tenant, scope, caseId, reason, expiresAt?}
    GW->>LHS: Create+Apply request (authN/Z, x-tenant-id, traceparent)
    LHS->>LHS: Validate scope → normalize ResourceRef/time boundaries
    LHS->>RD: Enumerate targets via cursor (tenant, scope)
    loop Batched match
        RD-->>LHS: Batch of record/partition keys
        LHS->>HIX: Mark OnHold {keys..., holdId, revision}
    end
    LHS->>LHS: Persist holdSnapshot {id, revision, scope, decidedAt}
    LHS->>BUS: Publish LegalHold.Applied {holdId, tenant, revision, scope}
    LHS-->>GW: 201 Created {holdId, status:"Active", snapshot}

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Preview: mode=preview returns counts and sample keys without applying.
Incremental expand: PATCH /holds/{id} with additional scope → revision++, match only delta.
Auto-expiry: expiresAt schedules automatic Release at timestamp.
Partition-level hold: mark append partitions instead of individual records for large scopes.

Error Paths¶

sequenceDiagram
    actor Legal
    participant GW as API Gateway
    participant LHS as LegalHold Service

    Legal->>GW: POST /legal-hold/v1/holds {invalid scope}
    alt Bad request
        GW-->>Legal: 400 Bad Request (Problem+JSON)
    else Hold not found (read/update/release)
        GW-->>Legal: 404 Not Found (Problem+JSON)
    else Conflict (apply on already Active, release on Released)
        GW-->>Legal: 409 Conflict (Problem+JSON)
    else Precondition failed (If-Match revision mismatch)
        GW-->>Legal: 412 Precondition Failed (Problem+JSON)
    else Rate limited / dependencies down
        GW-->>Legal: 429/503 (Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Create/Apply	`POST /legal-hold/v1/holds`	Y	Create + apply a hold	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y	Tenant scope	Must match body.tenant
`traceparent`	header	O	W3C trace context	55-char
`tenant`	string	Y	Target tenant	`^[A-Za-z0-9._-]{1,128}$`
`caseId`	string	Y	Legal case identifier	≤ 128 chars
`reason`	string	Y	Business/legal justification	≤ 512 chars
`scope`	object	Y	Resource/time predicates	Normalized server-side
`expiresAt`	timestamp	O	Auto-release time (UTC)	Must be in future
`mode`	enum	O	`apply` (default)	`preview`

Update (expand/restrict)

Field	Type	Req	Description
`PATCH /legal-hold/v1/holds/{holdId}`	path	Y	Modify scope (forward-only*); requires `If-Match: <rev>`
Body: `{scopeDelta}`	json	Y	Additive change preferred; shrink requires admin override

*Forward-only scope changes recommended; shrinking scope is exceptional and audited.

Release

Field	Type	Req	Description
`POST /legal-hold/v1/holds/{holdId}:release`	path	Y	Release hold
`If-Match`	header	O	Expected revision	Prevents races

Output Specifications¶

Create — 201 Created

Field	Type	Description
`holdId`	string	Hold identifier (ULID/GUID)
`status`	enum	`Active`
`revision`	int	Current revision
`snapshot`	object	`{scope, decidedAt, caseId, reason, expiresAt?}`
`stats`	object	`{matched, partitions, partial?:bool}`

Release — 200 OK

Field	Type	Description
`holdId`	string	Id
`status`	enum	`Released`
`releasedAt`	timestamp	ISO-8601 UTC
`revision`	int	Final revision

Example Payloads¶

// Create & apply
{
  "tenant": "acme",
  "caseId": "CASE-2025-099",
  "reason": "Regulatory investigation",
  "scope": {
    "time": {"from": "2025-09-01T00:00:00Z"},
    "resourceTypes": ["Iam.User","Case.File"],
    "actions": ["Create","Update"]
  },
  "expiresAt": "2026-03-01T00:00:00Z"
}

// Release
{
  "note": "Case concluded; hold lifted by order #1234"
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Malformed scope; invalid `expiresAt`; missing `caseId/reason`	Correct request	No retry until fixed
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Missing `audit:legalhold.apply \| update \| release`	Request proper role/scope	—
404	Unknown `holdId` or tenant route not found	Verify ids/tenant	—
409	Apply on already `Active`; Release on `Released`; concurrent modify	Align state (PATCH or fetch latest)	Retry after fix
412	`If-Match` revision mismatch	Fetch latest snapshot → retry	Conditional retry
422	Scope cannot be resolved unambiguously	Adjust scope; use preview	—
429	Per-tenant/global rate limit	Honor `Retry-After`	Backoff + jitter
503	Read/Index/Event bus unavailable	Wait for recovery	Idempotent retry (server de-dupes)

Failure Modes¶

Partial match (timeouts/limits): partial=true in stats; matcher continues asynchronously until complete.
Residency boundary: cross-region scope split into regional sub-holds to remain compliant.
Clock skew: time predicates normalized to UTC; inclusive start, exclusive end by convention.

Recovery Procedures¶

On 412/409, retrieve latest {holdId, revision, status} and re-issue with correct preconditions.
For partial matches, monitor progress events or query stats until partial=false.
If 503/429, back off; the apply operation is idempotent by (tenant, caseId, normalizedScopeHash).

Performance Characteristics¶

Latency Expectations¶

Apply confirmation p95 ≤ 60 s for typical scopes; full match completion may continue async.

Throughput Limits¶

Matcher QPS bounded by read replica capacity; batch size tuned per tenant.

Resource Requirements¶

CPU for scope normalization; memory for batching keys; I/O for index updates.

Scaling Considerations¶

Use seek pagination and partition-aware queries.
Mark partitions OnHold when feasible for large contiguous ranges.
Backpressure from Hold Index updates reduces batch size automatically.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; optional mTLS service-to-service.

Authorization¶

Require audit:legalhold.apply, audit:legalhold.update, audit:legalhold.release.
Enforce RLS by tenantId; verify x-tenant-id.

Data Protection¶

Store minimal decision basis (ids/timestamps); do not copy payloads.
All hold state transitions are audited with actor and purpose-of-use.

Compliance¶

Holds block purge immediately via Purge Guard; Retention Evaluator records skippedHeld.
holdSnapshot (id, revision, scope, decidedAt) provides chain-of-custody.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`holds_active`	gauge	Active holds per tenant	Sudden spikes
`hold_applied_total`	counter	Holds applied	—
`hold_released_total`	counter	Holds released	—
`hold_match_duration_seconds`	histogram	Matching latency	p95 > SLO
`purge_block_signals_total`	counter	Purge-block updates sent	Drop indicates risk

Logging Requirements¶

Structured logs: tenant, holdId, revision, caseId, scopeHash, matched, partial, actor, reason. No PII.

Distributed Tracing¶

Spans: hold.apply, match.scan, index.mark, hold.release, event.publish.
Attributes: scopeHash, batchSize, matched, partial.

Health Checks¶

Readiness: Read/Projection and Hold Index reachable; Event Bus available.
Liveness: matcher queue drains; no stuck Applying holds beyond timeout.

Operational Procedures¶

Deployment¶

Deploy LegalHold Service and register /legal-hold/v1/* routes.
Initialize Hold Index and Purge Guard hooks.
Verify preview/apply/release in staging with synthetic scopes.

Configuration¶

Env: HOLD_MATCH_BATCH, HOLD_APPLY_TIMEOUT, HOLD_MAX_SCOPE_SIZE, RESIDENCY_MODE.
Policy: require reason and caseId; optional expiresAt auto-release.

Maintenance¶

Compact Hold Index (drop released markers no longer needed).
Rotate webhook/signing keys if callbacks to external systems are used.

Troubleshooting¶

High partial rates → increase batch size cautiously; check read replica health.
Frequent 409/412 → educate clients to use If-Match and fetch-latest patterns.
Purge still running on held data → verify Purge Guard subscription and index state.

Testing Scenarios¶

Happy Path Tests¶

Apply hold with resource/time scope → holds_active increments; purge-block engaged.
Update scope (additive) → revision++, only delta matched; events emitted.
Release hold → blockers cleared; LegalHold.Released published.

Error Path Tests¶

400 for malformed scope; 404 for unknown holdId; 409 for invalid state transitions; 412 for revision mismatch.
429/503 cause compliant backoff; operation remains idempotent.

Performance Tests¶

Matching completes within SLO for typical tenants; no read replica saturation.
Purge-block propagation latency within seconds.

Security Tests¶

RBAC enforced; cross-tenant access blocked.
Audit log contains actor, purpose, scope hash; no PII leakage.

Internal References¶

Data Model — Legal Hold Model
Retention Policy Evaluation Flow
Standard Export Flow / Legal Hold Export Flow

External References¶

RFC 7807 (Problem Details)
W3C Trace Context

Appendices¶

A. Example Problem+JSON (invalid scope)¶

{
  "type": "urn:connectsoft:errors/legalhold/scope.invalid",
  "title": "Invalid legal hold scope",
  "status": 400,
  "detail": "Scope must include at least one of resourceTypes or actors, and a bounded time window.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [
    {"pointer": "/scope/time", "reason": "missing-or-unbounded"}
  ]
}

B. Hold Snapshot (concept)¶

{
  "id": "lh_2025_001",
  "tenant": "acme",
  "revision": 3,
  "status": "Active",
  "caseId": "CASE-2025-099",
  "reason": "Regulatory investigation",
  "scope": { "resourceTypes": ["Iam.User"], "time": {"from":"2025-09-01T00:00:00Z"} },
  "decidedAt": "2025-10-22T11:45:10Z",
  "expiresAt": "2026-03-01T00:00:00Z"
}

Data Redaction Flow (Read)¶

Applies policy-driven masking to query results at read time. The Query Service consults the Redaction Service to enforce a requested profile (Safe, Support, Investigator, Raw), optionally validates a Just-In-Time (JIT) unmask approval, and returns transformed results. All unmask attempts and approvals are audited.

Overview¶

Purpose: Ensure returned data complies with privacy policy via profile-based masking, with tightly controlled JIT unmask for break-glass scenarios.
Scope: Profile selection, purpose-of-use capture, redaction rules execution, JIT approval verification, response annotation, and auditing. Excludes write-time classification (see Validation & Classification Flow).
Context: Sits on the Query path between Read Models/Search and clients. Uses Data Classification from the model and Redaction Rules (mask/hash/tokenize/drop).
Key Participants:

Client (consumer of audit data)
API Gateway
Query Service (fetch, orchestrate)
Redaction Service (policy engine, transform)
Approval Service (JIT unmask token issuance/validation)
Audit/Event Bus (log read/unmask decisions)

Prerequisites¶

System Requirements¶

Gateway with TLS + JWT validation
Query Service can call Redaction & Approval Services
Read Models/Search indices annotated with DataClass metadata
Clock sync for JIT token TTL enforcement

Business Requirements¶

Redaction profiles & policy configured per tenant
Purpose-of-use taxonomy and RBAC scopes defined
Approver roster & workflow for JIT unmask (with SLA)

Performance Requirements¶

p95 redaction overhead ≤ 15 ms per page (server-side)
JIT token verification p95 ≤ 50 ms
Budget for page sizes (e.g., ≤ 200 records) to maintain SLOs

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor C as Client
    participant GW as API Gateway
    participant Q as Query Service
    participant R as Redaction Service
    participant A as Approval Service
    participant AUD as Audit/Event Bus

    C->>GW: GET /query/v1/events?filters…<br/>Headers: x-redaction-profile=Support, x-purpose-of-use=SupportOps
    GW->>Q: Forward request (authN/Z, tenant)
    Q->>Q: Fetch page from Read Model / Search
    Q->>R: ApplyProfile(records, profile=Support, tenant, purpose)
    R-->>Q: Redacted(records, redactionMeta)
    Q->>AUD: Publish Read.Audited {tenant, profile, purpose, actor, resultCount}
    Q-->>GW: 200 OK (masked results + X-Redaction-Profile + X-Watermark)
    GW-->>C: 200 OK

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Investigator profile: broader reveal than Support but still masked for HighlySensitive; requires higher RBAC.
Raw profile with JIT: client supplies x-jit-approval-token; Approval Service validates token → Redaction Service bypasses selected fields (field-scoped unmask).
Field-scoped override: request includes fields=… to minimize exposure; redaction runs only on returned fields.

Error Paths¶

sequenceDiagram
    actor C as Client
    participant GW as API Gateway
    participant Q as Query Service
    participant A as Approval Service

    C->>GW: GET … x-redaction-profile=Raw, x-jit-approval-token=abc
    GW->>Q: Forward
    Q->>A: ValidateToken(abc)
    alt Token invalid/expired/not-for-tenant
        A-->>Q: 403 Forbidden (reason)
        Q-->>GW: 403 Problem+JSON
        GW-->>C: 403 Forbidden
    else Bad profile or params
        Q-->>GW: 400 Bad Request (Problem+JSON)
        GW-->>C: 400
    else Record id requested but not found
        Q-->>GW: 404 Not Found (Problem+JSON)
        GW-->>C: 404
    else Conflict (token already consumed / different subject)
        Q-->>GW: 409 Conflict (Problem+JSON)
        GW-->>C: 409
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`GET /query/v1/events`	Y	Search/scroll timeline	Query params allowlisted
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y	Tenant scope	Matches JWT/route
`x-redaction-profile`	header	O	`Safe`(default)	`Support`
`x-purpose-of-use`	header	Y	Business purpose taxonomy	Non-empty, allowlist
`x-jit-approval-token`	header	O	Break-glass token for unmask	JIT policy validates
`traceparent`	header	O	W3C trace context	55-char
`fields`	query	O	Comma list of fields to return	Minimization applied
`page.after`	query	O	Seek cursor	Opaque; server-issued
`limit`	query	O	Page size	1–200 default 100

Output Specifications¶

200 OK

Field	Type	Description	Notes
`items[]`	array	Records with masking applied	See examples
`redactionMeta`	object	`{profile, rulesApplied[], jit:{used, reason?}}`	Optional when Safe
`watermark`	string	Projection snapshot time	Also in header

Headers

X-Redaction-Profile: effective profile
X-Purpose-Of-Use: echoed purpose
X-Watermark: ISO-8601 UTC projection watermark

Example Payloads¶

// Request (Support profile)
GET /query/v1/events?resourceType=Payment&from=2025-10-01T00:00:00Z
x-redaction-profile: Support
x-purpose-of-use: SupportOps

// Response (masked)
{
  "items": [
    {
      "id": "01JF…",
      "actor": {"id":"u_123","displayName":"A**** T****"},
      "resource": {"type":"Payment","id":"pay_789"},
      "action": "Create",
      "createdAt": "2025-10-22T11:01:22Z",
      "deltas": {
        "after": {
          "cardLast4": "****",
          "cardBin": "******",
          "email": "a***@e***.com",
          "amount": 1299
        }
      }
    }
  ],
  "redactionMeta": {
    "profile": "Support",
    "rulesApplied": [
      {"field":"deltas.after.cardLast4","rule":"mask-last4"},
      {"field":"deltas.after.cardBin","rule":"drop"},
      {"field":"deltas.after.email","rule":"mask-email"}
    ]
  },
  "watermark": "2025-10-22T11:05:00Z"
}

// Raw with JIT token (field-scoped unmask)
GET /query/v1/events/{id}
x-redaction-profile: Raw
x-jit-approval-token: jt_01ABC…
x-purpose-of-use: IncidentResponse

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Unsupported profile; invalid `limit/fields`; bad time filters	Correct request	No retry until fixed
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Insufficient RBAC for profile; JIT token invalid/expired; tenant mismatch	Request proper scope or new JIT approval	—
404	Requested record id not found	Verify id/tenant	—
409	JIT token subject mismatch or already consumed	Obtain a fresh token	—
422	Purpose-of-use missing/invalid; policy disallows Raw for tenant	Fix usage/policy	—
429	Rate limited for sensitive profiles	Honor `Retry-After`	Backoff + jitter
503	Redaction/Approval service unavailable	Wait for recovery	Idempotent retry (re-run query)

Failure Modes¶

Partial redaction (missing DataClass metadata): default to most restrictive (mask/drop) and include warning in redactionMeta.
Policy change mid-request: response includes X-Policy-Revision-Used; clients re-issue if needed.

Recovery Procedures¶

For 403/409, request/refresh JIT approval; ensure subject/resource matches token scope.
On 503/429, back off; queries are safe to retry with same cursor.

Performance Characteristics¶

Latency Expectations¶

Redaction transform p95 ≤ 15 ms/page; JIT verification p95 ≤ 50 ms.

Throughput Limits¶

Sensitive profiles (Investigator, Raw) may be throttled per tenant (token bucket).

Resource Requirements¶

CPU-bound transforms; memory proportional to page size; minimal I/O overhead.

Scaling Considerations¶

Cache compiled redaction plans per {profile, schemaVersion}.
Prefer field projection (fields=…) to reduce work and exposure.
Co-locate Redaction Service with Query Service to minimize RPC latency.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; service credentials between Query ↔ Redaction/Approval.

Authorization¶

RBAC scopes per profile (e.g., audit:read.support, audit:read.investigator, audit:read.raw).
Enforce tenant RLS; verify x-tenant-id.

Data Protection¶

No raw PII in logs; only masked samples and rule stats.
JIT tokens are short-lived, single-use, audience- and subject-scoped; signed & time-bounded.

Compliance¶

All unmask uses are audited with actor, purpose, scope, token id, and fields revealed.
Profiles & rule sets derived from tenant policy; revision id echoed in responses.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`redaction_requests_total`	counter	Redaction calls by profile	Sudden spikes
`redaction_latency_ms`	histogram	Transform latency	p95 > 15 ms
`jit_token_validations_total`	counter	Approval checks	Track failures
`jit_validation_latency_ms`	histogram	JIT check latency	p95 > 50 ms
`unmask_events_total`	counter	Successful JIT unmask	Unusual growth

Logging Requirements¶

Structured logs: tenant, profile, purpose, actorId, resultCount, policyRevision, jit.used, fieldsRevealed[] (names only). No values.

Distributed Tracing¶

Spans: query.fetch, redaction.apply, approval.validate.
Attributes: profile, purpose, maskedFieldsCount, jitUsed.

Health Checks¶

Readiness: Redaction & Approval endpoints reachable; policy cache warm.
Liveness: transform queue drains; token cache not stale.

Operational Procedures¶

Deployment¶

Deploy Redaction & Approval Services; enable headers in Gateway.
Prime policy/profile caches; validate with synthetic records.

Configuration¶

Env: REDACTION_DEFAULT_PROFILE, JIT_TTL_SEC, JIT_AUDIENCE, PROFILE_RBAC_MAP, SENSITIVE_RATE_LIMITS.
Policy: map DataClass → rule (mask/hash/tokenize/drop) per profile.

Maintenance¶

Rotate signing keys for JIT tokens; tune rate limits by tenant.
Review unmask audit reports periodically with compliance.

Troubleshooting¶

Latency regressions → inspect rule plan caching, page size, co-location.
Frequent 403/409 → check token issuance workflow and subject scoping.
Unexpected reveals → verify policy revision and RBAC mapping.

Testing Scenarios¶

Happy Path Tests¶

Safe returns masked payload per policy with correct redactionMeta.
Support reveals operational fields but masks HighlySensitive.
Raw with valid JIT token reveals requested fields only; audit event emitted.

Error Path Tests¶

400 for invalid profile; 422 for missing/invalid purpose-of-use.
403/409 for bad/consumed JIT token; 404 for missing record id.
429/503 result in compliant backoff and successful retry.

Performance Tests¶

p95 redaction ≤ 15 ms for 100-record pages.
JIT validation ≤ 50 ms p95.

Security Tests¶

RBAC enforced per profile; cross-tenant blocked.
Logs exclude PII values; unmask audited with token id.

Internal References¶

External References¶

RFC 7807 (Problem Details)
W3C Trace Context

Appendices¶

A. Example Problem+JSON (invalid profile)¶

{
  "type": "urn:connectsoft:errors/redaction/profile.invalid",
  "title": "Unsupported redaction profile",
  "status": 400,
  "detail": "Profile 'Debug' is not enabled for tenant 'acme'.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "x-redaction-profile", "reason": "unsupported"}]
}

B. JIT Token (concept)¶

{
  "jitId": "jt_01ABC…",
  "tenant": "acme",
  "subject": {"type":"Payment","id":"pay_789"},
  "fields": ["deltas.after.email","actor.displayName"],
  "purpose": "IncidentResponse",
  "aud": "audit-read",
  "nbf": "2025-10-22T11:00:00Z",
  "exp": "2025-10-22T11:10:00Z",
  "sig": "MEQCI…"
}

Compliance Audit Flow¶

Generates a defensible compliance report by collecting evidence (records, lifecycle transitions, retention/legal-hold decisions, and integrity proofs), independently verifying tamper-evidence, and assembling a signed report artifact with full end-to-end traceability.

Overview¶

Purpose: Produce an auditable report that demonstrates data integrity, lifecycle adherence, and policy compliance over a defined scope and period.
Scope: Audit job creation, evidence collection, integrity verification (Merkle/Signatures), control checks (retention, legal hold, redaction on read), report assembly/signing, delivery, and audit of the audit. Excludes exporting large datasets (see Export flows) and policy authoring.
Context: Orchestrated by Audit Service; reads from Read Models/Indices, Lifecycle/Retention Index, Legal Hold, and Integrity Service; produces a signed Compliance Report and optional Evidence Bundle.
Key Participants:

Auditor / Compliance Client
API Gateway
Audit Service (orchestrator, verifier, report builder)
Query Service / Read Store (records, timelines)
Integrity Service (Merkle & signatures verification)
Policy/LegalHold/Retention services (decisions & states)
Delivery Backend (report/evidence URLs)
Webhook Receiver (optional callbacks)

Prerequisites¶

System Requirements¶

API Gateway with TLS and JWT validation
Audit Service with access to Read Store, Integrity, Policy, LegalHold, Retention Index
Object storage for report artifacts and optional evidence bundle
KMS/HSM configured for report signing (optional but recommended)

Business Requirements¶

Tenant compliance profile defined (e.g., GDPR/HIPAA/SOC2 control set)
Purpose-of-use and auditor role(s) configured
Time-bound audit scope agreed (from/to, resources, actors)

Performance Requirements¶

p95 time-to-summary ≤ 60 s for typical 24–48h windows
Evidence sampling and cap thresholds configured to avoid oversize bundles
Parallel verification workers sized to volume

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor AUD as Auditor
    participant GW as API Gateway
    participant AS as Audit Service
    participant Q as Query Service / Read Store
    participant INT as Integrity Service
    participant POL as Policy/Retention/LegalHold
    participant OBJ as Delivery Backend
    participant WH as Webhook (optional)

    AUD->>GW: POST /compliance/v1/audits {tenant, scope, frameworks, options{verifyIntegrity, includeEvidence}}
    GW->>AS: Create audit job (authN/Z, x-tenant-id, traceparent)
    AS->>Q: Collect evidence set (records, lifecycle states, decisions)
    AS->>POL: Fetch decisions (retention elig., legal holds, policy revisions)
    AS->>INT: Verify integrity (Merkle chain, signatures, sample leaves)
    INT-->>AS: Verification results {ok, failures[], merkleRoot, keyIds}
    AS->>AS: Compile control checks + traceability map
    AS->>OBJ: PUT report.pdf/json + (optional) evidence.zip
    AS-->>GW: 202 Accepted {auditId, status:"Running"}
    alt webhook configured
        AS->>WH: POST Compliance.ReportReady {auditId, reportUrl, summary}
    end

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Lightweight attest-only: verifyIntegrity=true with no evidence bundle; report includes verification transcript and pointers.
Delta audit: sinceAuditId to compare changes between two audits.
Framework-specific: frameworks=["SOC2"] limits control set and sections rendered.

Error Paths¶

sequenceDiagram
    actor AUD as Auditor
    participant GW as API Gateway
    participant AS as Audit Service

    AUD->>GW: POST /compliance/v1/audits {malformed}
    alt 400 Bad Request
        GW-->>AUD: 400 Problem+JSON
    else 404 Not Found (tenant/route/auditId)
        GW-->>AUD: 404 Problem+JSON
    else 409 Conflict (modify running audit / duplicate request-id)
        GW-->>AUD: 409 Problem+JSON
    else 429/503 Backpressure/Dependency down
        GW-->>AUD: 429/503 Problem+JSON (+Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`POST /compliance/v1/audits`	Y	Create a compliance audit job	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y	Tenant scope	Matches body.tenant
`traceparent`	header	O	W3C trace context	55-char
`tenant`	string	Y	Target tenant	`^[A-Za-z0-9._-]{1,128}$`
`scope`	object	Y	`{time:{from,to}, resourceTypes?, actors?}`	UTC ISO-8601, bounded
`frameworks`	array	O	`["GDPR","HIPAA","SOC2"]`	allowlist
`options.verifyIntegrity`	bool	O	Run integrity verification	default: true
`options.includeEvidence`	enum	O	`none`	`sampled`
`options.sampleRate`	number	O	0–1 for sampled proofs	bounds checked
`webhook.url/secretId`	string	O	Completion callback + HMAC	HTTPS + known key
`idempotency-key`	header	O	De-duplicate create	≤ 128 chars

Control & Status

GET /compliance/v1/audits/{auditId}
POST /compliance/v1/audits/{auditId}:cancel
GET /compliance/v1/audits/{auditId}/report (redirect/URL)
GET /compliance/v1/audits/{auditId}/evidence (if produced)

Output Specifications¶

Create — 202 Accepted

Field	Type	Description
`auditId`	string	Operation id (ULID/GUID)
`status`	enum	`Queued \| Collecting \| Verifying \| Assembling \| Completed \| Failed \| Canceled`
`summaryUrl`	url?	Interim human-readable status
`reportUrl`	url?	Set when ready

Status — 200 OK

Field	Type	Description
`auditId`	string	Identifier
`status`	enum	Terminal or running state
`counts`	object	`{records, proofsChecked, holds, eligible}`
`verifications`	object	`{merkleRoot, keyIds[], ok, failures[]}`
`reportUrl` / `evidenceUrl`	url?	Delivery

Report (concept outline)

Executive Summary (scope, date range, frameworks)
Data Integrity (roots, signatures, verification transcript)
Lifecycle & Retention (eligibleAt coverage, purge windows)
Legal Holds (active timeline, affected records/partitions)
Redaction & Privacy Controls (profiles, sampling of masked fields)
Exceptions & Findings (severity, impacted scope)
Appendices (inputs, hashes, timestamps, key ids)

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Malformed scope/time window; unsupported framework; invalid `sampleRate`	Fix request	No retry until corrected
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Missing `audit:compliance.run` or cross-tenant attempt	Request proper scope/role	—
404	Unknown `auditId`/tenant; route disabled	Verify ids/tenant	—
409	Modify/cancel while running; duplicate `idempotency-key`	Wait for terminal state or change key	Retry after fix
412	`If-Match` mismatch on update/cancel	Fetch latest status and retry	Conditional retry
422	Evidence size would exceed cap; incompatible options (`full` with restricted edition)	Adjust options	—
429	Per-tenant/global audit concurrency limit	Honor `Retry-After`	Backoff + jitter
503	Read/Integrity/Policy service unavailable	Wait for recovery	Idempotent retry

Failure Modes¶

Proof sampling too low/high: report flags sampling level; enforce min/max per policy.
Key unavailability: signature verification deferred; report marks inconclusive for specific windows with remediation steps.
Projection lag: report includes watermark; sections constrained to consistent point-in-time.

Recovery Procedures¶

Reduce evidence mode to sampled or raise cap via admin policy if 422.
Re-run verification portion when keys/services recover; re-issue report with new signature.
For 409/412, poll latest status, then retry control action.

Performance Characteristics¶

Latency Expectations¶

Time-to-summary p95 ≤ 60 s for 24–48h windows; full verification depends on scope and sampling.

Throughput Limits¶

Concurrency caps per tenant (e.g., ≤ 2 running audits); global worker pool bounded.

Resource Requirements¶

CPU for hashing/verification; I/O for evidence fetch; memory for report assembly (streamed).

Scaling Considerations¶

Parallelize by time/partition slices; verify proofs in worker pool; stream artifact assembly to object storage.
Use seek pagination and limit evidence to sampled mode for very large scopes.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; optional mTLS for service-to-service.

Authorization¶

Require audit:compliance.run to start; audit:compliance.read to fetch results; strict tenant RLS.

Data Protection¶

Reports/evidence encrypted at rest; presigned URLs short-lived and least-privilege; webhook payloads HMAC-signed.

Compliance¶

Report is signed (JWS/COSE) with kid; includes verification transcript, watermarks, and policy revisions used.
All audit actions are themselves audited (actor, purpose, scope, outputs).

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`compliance_audits_active`	gauge	Running audits	> tenant/global cap
`compliance_audit_duration_seconds`	histogram	Runtime per audit	p95 > SLO
`integrity_verifications_total`	counter	Proof checks performed	Trend
`verification_failures_total`	counter	Failed proof checks	> 0 sustained
`report_build_failures_total`	counter	Report assembly/sign failures	> 0

Logging Requirements¶

Structured logs: tenant, auditId, scopeHash, frameworks[], proofsChecked, failures, watermark, kid. No PII.

Distributed Tracing¶

Spans: audit.collect, policy.fetch, integrity.verify, report.assemble, object.put, webhook.post.
Attributes: sampleRate, evidenceMode, bytes, records.

Health Checks¶

Readiness: Read/Integrity/Policy reachable; KMS key loadable.
Liveness: job queue draining; no stuck Verifying/Assembling states.

Operational Procedures¶

Deployment¶

Deploy Audit Service; expose /compliance/v1/audits routes.
Configure KMS signing keys and buckets for artifacts.
Validate E2E on staging: create → verify → signed report downloadable.

Configuration¶

Env: AUDIT_MAX_CONCURRENCY_PER_TENANT, AUDIT_SAMPLE_RATE_DEFAULT, AUDIT_EVIDENCE_CAP_BYTES, PRESIGN_TTL_SEC, REPORT_SIGNING_KID.
Policy: min/max sampling, allowed frameworks per edition.

Maintenance¶

Rotate signing keys; prune expired artifacts; archive reports according to retention.
Periodic verification health checks against known-good test datasets.

Troubleshooting¶

Verification failures → inspect key rotation, integrity roots, time window alignment.
Large artifacts → switch to sampled mode; extend caps only if justified.
Frequent 409/412 → ensure clients poll before modifying audit jobs.

Testing Scenarios¶

Happy Path Tests¶

Create audit with verifyIntegrity=true, includeEvidence=sampled → signed report produced; verification transcript included.
Fetch report/evidence; signature validates with published public key.

Error Path Tests¶

400 malformed scope; 404 unknown auditId; 409 modify while running.
422 evidence exceeds cap triggers clear guidance; 429/503 backoff works.

Performance Tests¶

p95 time-to-summary ≤ 60 s; verify scaling across parallel slices.
Sampled proof checks meet throughput targets.

Security Tests¶

RBAC scopes enforced; presigned URLs expire; webhook HMAC validated.
Report signature verifies via JWS/COSE with current kid.

Internal References¶

External References¶

RFC 7807 (Problem Details)
W3C Trace Context
JWS (RFC 7515) / COSE (RFC 8152)

Appendices¶

A. Example Problem+JSON (evidence cap exceeded)¶

{
  "type": "urn:connectsoft:errors/compliance/evidence.cap.exceeded",
  "title": "Evidence bundle too large",
  "status": 422,
  "detail": "Estimated evidence size 8.4GB exceeds cap 5GB. Use includeEvidence=sampled or narrow scope.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "/options/includeEvidence", "reason": "cap-exceeded"}]
}

B. Report Verification (outline)¶

Download report.json and report.sig (or signed PDF).
Verify signature with published JWK/PEM (kid in report header).
Re-run sample integrity proofs listed in the transcript; compare roots.
Confirm watermarks and policy revision ids match tenant records.

Integrity Verification Flow¶

Runs an on-demand proof check for one or more records, validating leaf hash → Merkle path → block/segment root → signature. Produces a per-record evidence report (OK|FAIL|INCONCLUSIVE) and supports degraded mode when some materials (e.g., keys, archived proofs) are unavailable.

Overview¶

Purpose: Allow clients and auditors to independently verify that returned records are authentic and untampered, using stored proofs and signatures.
Scope: Request intake, materialization of proof inputs (leaf, path, roots, signatures), verification pipeline, degraded-mode policies, report generation, and optional evidence bundle. Excludes integrity creation/sealing (see Integrity Chain flow).
Context: The Integrity Service reads Integrity Store/Evidence Store (paths, roots, manifests) and may call KMS/HSM or use public keys to verify signatures.
Key Participants:

Client (verifier)
API Gateway
Integrity Service (verifier/orchestrator)
Evidence Store / Integrity Store (proofs, roots, manifests)
KMS/HSM or Key Registry (public keys / verification)
Object Storage (optional evidence bundles)

Prerequisites¶

System Requirements¶

API Gateway with TLS and JWT validation
Integrity Service with read access to Integrity/Evidence stores and key registry
Object storage bucket for optional per-request evidence bundles
Time source synchronized; hash and signature algorithms configured

Business Requirements¶

Tenant integrity policy defines algorithms (e.g., SHA-256, Ed25519) and acceptable degraded modes
Retention of proofs/manifests meets verification SLAs
Auditing enabled for verification requests

Performance Requirements¶

p95 verification latency ≤ 200 ms for single-record checks (cached proofs)
Batch verification throughput meets SLO (e.g., 2k–10k records/s with precomputed paths)
Backpressure & rate limits for large batches

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor CL as Client
    participant GW as API Gateway
    participant INT as Integrity Service
    participant EVI as Evidence Store / Integrity Store
    participant KMS as KMS/HSM or Key Registry
    participant OBJ as Object Storage (optional)

    CL->>GW: POST /integrity/v1/verify {tenant, items[], mode: "full"}
    GW->>INT: Forward (authN/Z, x-tenant-id, traceparent)
    INT->>EVI: Fetch materials (leaf hash or record, path, blockRoot, manifest)
    INT->>KMS: Load/validate public key (by kid) and verify signature(root)
    KMS-->>INT: ok {kid, alg}
    INT->>INT: Verify inclusion (leaf→path→blockRoot) and chain(root→segmentRoot?)
    alt returnEvidence = "bundle"
        INT->>OBJ: PUT evidence.zip (paths, manifest, key metadata)
    end
    INT-->>GW: 200 OK {perItemResults[], summary, evidenceUrl?}
    GW-->>CL: 200 OK

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Fast mode: mode="fast" skips recomputation of leaf hash when caller supplies leafHash; verifies path→root→signature only.
Degraded mode: allowDegraded=true permits INCONCLUSIVE with reasons (e.g., signature service offline) while still verifying available steps.
External leaf: caller provides payload to hash server-side (canonicalization rules applied).

Error Paths¶

sequenceDiagram
    actor CL as Client
    participant GW as API Gateway
    participant INT as Integrity Service

    CL->>GW: POST /integrity/v1/verify {malformed}
    alt Bad request (invalid item spec/algorithm)
        GW-->>CL: 400 Bad Request (Problem+JSON)
    else Not found (record/proof/manifest missing)
        GW-->>CL: 404 Not Found (Problem+JSON)
    else Conflict (verify while block is resealing/rotating)
        GW-->>CL: 409 Conflict (Problem+JSON)
    else Unauthorized/Forbidden
        GW-->>CL: 401/403 (Problem+JSON)
    else Rate limit / dependency down
        GW-->>CL: 429/503 (Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`POST /integrity/v1/verify`	Y	Start verification	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y	Tenant scope	Must match body.tenant
`traceparent`	header	O	W3C trace context	55-char
`tenant`	string	Y	Target tenant	`^[A-Za-z0-9._-]{1,128}$`
`mode`	enum	O	`full` (default)	`fast`
`allowDegraded`	bool	O	Permit partial verify	default: false
`returnEvidence`	enum	O	`none` (default)	`bundle`
`items[]`	array	Y	Records to verify	1–10k items
`items[].recordId`	string	O*	Record identifier	ULID/GUID
`items[].leafHash`	string	O*	Base64url/hex hash	matches `algorithm`
`items[].payload`	object	O*	Canonicalizable payload	size bounded
`items[].algorithm`	enum	O	`sha256` (default)	allowlist
`items[].expectedRoot`	string	O	Optional asserted root	must match stored
`idempotency-key`	header	O	De-dupe request	≤ 128 chars

*Provide at least one of recordId, leafHash, or payload.

Output Specifications¶

200 OK

Field	Type	Description	Notes
`results[]`	array	Per-item verification results	See below
`summary`	object	`{ok, fail, inconclusive}`	Counts
`evidenceUrl`	url?	If `returnEvidence=bundle`	Presigned, short-lived
`policyRevisionUsed`	int	Integrity policy revision	For audit

Per-item result

{
  "input": {"recordId":"01JF…","algorithm":"sha256"},
  "steps": {
    "leafHash": {"status":"OK","computed":"8a4f..."},
    "pathVerify": {"status":"OK","depth":17},
    "rootSignature": {"status":"OK","kid":"int-key-2025","alg":"Ed25519"},
    "chainLink": {"status":"OK","segment":"seg_2025_10_22"}
  },
  "status": "OK",                // OK | FAIL | INCONCLUSIVE
  "degraded": false,             // true if allowed and used
  "reason": null,                // failure/inconclusive reason
  "timingsMs": {"total": 42, "leaf": 1, "path": 6, "sig": 8}
}

Example Payloads¶

// Full verification by recordId
{
  "tenant": "acme",
  "mode": "full",
  "items": [
    {"recordId": "01JF3W8KTR2D3WQF3B9R0KJY9Y", "algorithm": "sha256"}
  ],
  "returnEvidence": "path-only"
}

// Fast verification using supplied leafHash and allowing degraded mode
{
  "tenant": "acme",
  "mode": "fast",
  "allowDegraded": true,
  "items": [
    {"leafHash": "8a4f...", "expectedRoot": "d1c2...", "algorithm": "sha256"}
  ]
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Malformed body; none of `recordId \| leafHash \| payload`provided; unsupported`algorithm`	Fix request	No retry until corrected
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Missing `audit:integrity.verify` or tenant mismatch	Request proper scope/role	—
404	Record/proof/manifest not found	Verify id/scope; ensure proofs retained	—
409	Verification against block being resealed/rotated	Retry after block state settles	Short backoff
412	`If-Match` on root version failed	Fetch latest root/manifest; retry	Conditional retry
422	Payload cannot be canonicalized to leaf hash	Use server-known recordId or supply leafHash	—
429	Rate limited for batch or per-tenant	Honor `Retry-After`	Exponential backoff + jitter
503	Evidence store, key service, or integrity store unavailable	Wait for recovery	Idempotent retry

Failure Modes¶

Missing signature key (archived/rotated): inclusion verified, signature step INCONCLUSIVE when allowDegraded=true.
Archived proofs (cold tier): request becomes async; 202 with later webhook/report when materials restored.
Projection drift: record exists but proof not yet sealed; respond 409 until seal completes.

Recovery Procedures¶

On 409/412, fetch latest block status/root and retry verification.
If 503/429, back off; request is idempotent by (tenant, itemsHash, idempotency-key?).
When proofs are archived, re-issue request with allowDegraded=true or wait for restoration event.

Performance Characteristics¶

Latency Expectations¶

Single-record, cached materials: p95 ≤ 200 ms.
Batch with precomputed paths: thousands/sec per verifier instance.

Throughput Limits¶

Per-tenant verification QPS caps; batch size limits (e.g., ≤ 1k items/request).

Resource Requirements¶

CPU-bound hashing/path checks; memory proportional to path depth and batch size; small I/O for manifest/path fetch.

Scaling Considerations¶

Cache recent roots and key material by kid.
Pre-fetch proof paths for hot records; shard verifier workers by tenant/segment.
Use asynchronous retrieval for cold-storage proofs.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; service credentials for store/key access.

Authorization¶

Require audit:integrity.verify; enforce x-tenant-id RLS.

Data Protection¶

Do not log payloads or raw proofs; only hashes and ids.
Evidence bundles are encrypted at rest and shared via short-lived presigned URLs.

Compliance¶

Verification report contains key ids, algorithms, roots, and timestamps for chain-of-custody.
Degraded-mode decisions are explicit and auditable.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`verify_requests_total`	counter	Verification requests	Trend
`verify_latency_ms`	histogram	End-to-end latency	p95 > SLO
`verify_failures_total`	counter	Items with `FAIL`	> 0 sustained
`verify_inconclusive_total`	counter	Degraded outcomes	Spike watch
`proof_cache_hit_ratio`	gauge	Cache effectiveness	< 0.8 sustained

Logging Requirements¶

Structured logs: tenant, requestId, batchSize, ok/fail/inconclusive, alg, kid, degraded. No PII or payloads.

Distributed Tracing¶

Spans: materials.fetch, leaf.hash, path.verify, sig.verify, bundle.pack.
Attributes: pathDepth, kid, mode, degraded.

Health Checks¶

Readiness: evidence/key stores reachable; cache warmed.
Liveness: verifier queue drains; no stuck requests beyond timeout.

Operational Procedures¶

Deployment¶

Deploy Integrity Service; expose /integrity/v1/verify.
Configure key registry/KMS access and algorithm allowlist.
Warm caches with latest roots and public keys.

Configuration¶

Env: VERIFY_MAX_BATCH, VERIFY_RATE_LIMITS, KEY_CACHE_TTL, ROOT_CACHE_TTL, EVIDENCE_BUNDLE_TTL.
Policy: allowed degraded modes; acceptable algorithms; maximum batch sizes.

Maintenance¶

Rotate verification keys and update registry; verify legacy roots with retained public keys.
Periodically test cold-proof restore paths.

Troubleshooting¶

Rising INCONCLUSIVE → check KMS availability and key retention.
High FAIL rates → inspect canonicalization/version mismatches or corrupted paths.
Latency spikes → verify cache TTLs and storage hot/cold tiering.

Testing Scenarios¶

Happy Path Tests¶

Verify by recordId with full steps → status=OK, signature validated.
Batch verify with provided leafHash → status=OK for all items; summary counts correct.

Error Path Tests¶

400 when no recordId|leafHash|payload; 404 for unknown record/proof.
409 when verifying during reseal; 412 when root version mismatches.
429/503 induce backoff and successful retry.

Performance Tests¶

Achieve target throughput with cached proofs; measure p95 latency.
Stress with 10k items; ensure backpressure and partial progress reporting.

Security Tests¶

RBAC scopes enforced; cross-tenant blocked.
Evidence bundle URL expiry honored; keys validated by kid.

Internal References¶

Data Model — Integrity Structures
Audit Record Integrity Chain Flow
Compliance Audit Flow

External References¶

RFC 7807 (Problem Details)
W3C Trace Context

Appendices¶

A. Example Problem+JSON (degraded not allowed)¶

{
  "type": "urn:connectsoft:errors/integrity/degraded.disallowed",
  "title": "Degraded verification not permitted",
  "status": 422,
  "detail": "Key service unavailable and allowDegraded is false.",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

B. Evidence Bundle (concept)¶

evidence/
  item_01JF…/
    leaf.txt
    path.json
    manifest.json
    root.sig           # JWS/COSE detached signature
    key-metadata.json  # {kid, alg, issuer, notBefore, notAfter}
README.txt             # verification instructions

Tamper Detection Flow¶

Continuously (or on-demand) scans integrity materials to detect anomalies—such as gaps, forks, reseals outside policy, signature/key issues, or out-of-order segments—then alerts and escalates with actionable context. The pipeline emphasizes low false positives through suppression, correlation, and thresholds.

Overview¶

Purpose: Proactively detect and surface potential tampering or integrity regressions before consumers encounter affected data.
Scope: Scheduling, scope planning, chain/segment/manifest checks, anomaly scoring & suppression, alerting/escalation, and case tracking. Excludes remediation (sealing/repair) which is handled by operations runbooks.
Context: Runs within the Integrity Validator component against Integrity/Evidence Stores and Key Registry/KMS; feeds alerts to Observability and Incident Management systems.
Key Participants:

Scheduler/Detector Orchestrator
Integrity Validator (check runners, anomaly detector)
Integrity Store / Evidence Store (roots, manifests, paths)
Key Registry/KMS (public keys, validity windows)
Alerting / On-Call (Pager/Email/Webhooks)
SIEM / Case Manager (ticketing, correlation)

Prerequisites¶

System Requirements¶

Validator has read access to Integrity/Evidence stores and Key Registry
Object storage reachable for manifests and archived proofs
Time synchronization across services; policy cache warm (algorithms, seal cadence)

Business Requirements¶

Tenant integrity policy defines seal cadence, allowed reseal windows, acceptable algorithms, and escalation paths
Alert routing configured (webhooks/pager) with on-call schedule
Compliance logging enabled for anomaly events

Performance Requirements¶

Chain scan p95 ≤ 2 min per segment; continuous mode amortized to keep staleness ≤ 5 min
Alert fan-out latency p95 ≤ 30 s
Bounded load on stores (rate-limited walkers)

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant SCH as Scheduler
    participant VAL as Integrity Validator
    participant IST as Integrity/Evidence Store
    participant KMS as Key Registry/KMS
    participant ALR as Alerting (Pager/Webhook)
    participant SIEM as SIEM/Case Manager

    SCH->>VAL: Tick {tenant, window, policyRevision}
    VAL->>IST: Enumerate segments/blocks within window
    loop For each segment
        VAL->>IST: Fetch manifests + roots + metadata
        VAL->>KMS: Get key by kid, check validity window
        VAL->>VAL: Run checks (gap/fork/order/sig/freshness/seal cadence)
    end
    VAL->>VAL: Score & suppress duplicates, correlate with recent changes
    alt Anomalies found
        VAL->>ALR: Create alert {type, severity, evidence pointers}
        ALR-->>VAL: Ack alert id
        VAL->>SIEM: Open case/ticket {links to evidence}
    else No anomalies
        VAL->>VAL: Record heartbeat metric & watermark
    end

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

On-demand scan: operator invokes POST /integrity/v1/tamper-detection:scan for a tenant/time range.
Hot segment watch: watch new blocks; verify seal cadence and signature freshness in near-real-time.
Degraded verification: if keys unavailable, emit warning with degraded=true (no hard alert) depending on policy.

Error Paths¶

sequenceDiagram
    participant OP as Operator
    participant GW as API Gateway
    participant VAL as Integrity Validator

    OP->>GW: POST /integrity/v1/tamper-detection:scan {malformed}
    alt 400 Bad Request (invalid window/tenant/algo)
        GW-->>OP: 400 Problem+JSON
    else 404 Not Found (unknown detectorId/tenant)
        GW-->>OP: 404 Problem+JSON
    else 409 Conflict (scan already running for same scope)
        GW-->>OP: 409 Problem+JSON
    else 429/503 (rate limit/dependency down)
        GW-->>OP: 429/503 Problem+JSON (+Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`POST /integrity/v1/tamper-detection:scan`	O	On-demand scan trigger	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid, not expired
`x-tenant-id`	header	Y	Tenant scope	Matches body.tenant
`tenant`	string	Y	Target tenant	`^[A-Za-z0-9._-]{1,128}$`
`window`	object	O	`{from,to}` override	ISO-8601 UTC, bounded
`checks`	array	O	Subset (`gap`,`fork`,`order`,`seal`,`sig`,`freshness`)	allowlist
`severityThreshold`	enum	O	`info \| low \| medium \| high \| critical`	default `medium`
`suppressWindow`	string	O	e.g., `10m` duplicate suppression	≤ policy max
`traceparent`	header	O	W3C trace context	55-char
`idempotency-key`	header	O	De-dup create	≤ 128 chars

Output Specifications¶

202 Accepted / 200 OK

Field	Type	Description	Notes
`scanId`	string	Operation id	ULID/GUID
`status`	enum	`Queued \| Running \| Completed \| Failed`	—
`summary`	object	`{checkedSegments, anomalies, degraded}`	Final on 200
`watermark`	string	Latest segment time examined	ISO-8601 UTC

Anomaly Event (concept)

{
  "tenant": "acme",
  "type": "Integrity.ForkDetected",
  "severity": "high",
  "segment": "seg_2025_10_22",
  "policyRevision": 12,
  "details": {
    "roots": ["9a1c...", "77fb..."],
    "firstSeenAt": "2025-10-22T12:00:07Z",
    "evidence": {"manifestUrl": "s3://.../seg_2025_10_22.manifest.json"}
  }
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid time window/checks list; `from>=to`	Fix request	No retry until corrected
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Missing `audit:integrity.tamper.scan`	Request proper role/scope	—
404	Unknown `detectorId`/tenant	Verify ids/tenant	—
409	Scan already running for same `{tenant, window}`	Wait for completion or use different scope	Retry after fix
422	Suppression window exceeds policy	Adjust parameter	—
429	Rate limited	Honor `Retry-After`	Backoff + jitter
503	Integrity/Evidence/Key service unavailable	Wait for recovery	Idempotent retry

Failure Modes¶

Transient fork (eventual consistency): auto-downgrade to warning unless it persists beyond stabilityDelay.
Key rotation gap: signatures verify with new kid but manifests still reference old key; mark degraded=false, add remediation hint.
Late seal: block sealed outside allowed window; alert severity based on policy (medium → high if repeated).

Recovery Procedures¶

For 409, query scan status and avoid duplicate runs; use idempotency-key.
For intermittent fork/gap, re-scan after stabilityDelay; escalate only if repeated.
On 503/429, validator backs off automatically; operator may re-issue trigger.

Performance Characteristics¶

Latency Expectations¶

Segment check p95 ≤ 2 min; near-real-time watch detects issues within ≤ 5 min of occurrence.

Throughput Limits¶

Bounded walkers per tenant (e.g., ≤ 2 concurrent); global cap to protect stores.

Resource Requirements¶

CPU for hashing/verification; small read IO for manifests/roots; minimal memory with streaming checks.

Scaling Considerations¶

Shard by tenant and segment time; cache recent roots and valid kids.
Use adaptive sampling: deep checks on hot segments; summary checks elsewhere.
Apply duplicate suppression windows to maintain low FP.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; service credentials for store/key access.

Authorization¶

Require audit:integrity.tamper.scan (run) and audit:integrity.tamper.read (results).
Enforce tenant RLS via x-tenant-id.

Data Protection¶

Do not include payloads in alerts; only ids, hashes, URLs to manifests (access-controlled).
Evidence links shared as short-lived presigned URLs.

Compliance¶

All anomalies and operator triggers are audited with actor, purpose, scope, and policy revision.
Detector configuration changes tracked with forward-only revisions.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`tamper_scans_active`	gauge	Running scans	> tenant/global cap
`tamper_anomalies_total`	counter	Anomalies by type/severity	Spike indicates issue
`tamper_false_positives_total`	counter	Operator-marked FP	> target triggers tuning
`tamper_scan_duration_seconds`	histogram	Scan runtime	p95 > SLO
`tamper_degraded_checks_total`	counter	Checks in degraded mode	Sustained rise → key/store health

Logging Requirements¶

Structured logs: tenant, scanId, policyRevision, segmentsChecked, anomalies[], degraded, watermark. No PII.

Distributed Tracing¶

Spans: scan.plan, segment.fetch, check.run(type), alert.emit, case.open.
Attributes: segmentCount, checks, severity, suppressed.

Health Checks¶

Readiness: Integrity/Evidence stores and Key Registry reachable; policy cache loaded.
Liveness: scan queue advancing; no segment stuck beyond timeout.

Operational Procedures¶

Deployment¶

Deploy Integrity Validator; enable scheduler and on-demand endpoint.
Configure alert routes (pager/webhook) and SIEM integration.
Validate with seeded test anomalies (simulated fork/gap).

Configuration¶

Env: DETECTOR_MAX_CONCURRENCY, DETECTOR_STABILITY_DELAY, DETECTOR_SUPPRESS_WINDOW, DETECTOR_DEFAULT_CHECKS.
Policy: seal cadence, reseal allowances, severity mappings, degraded-mode policy.

Maintenance¶

Tune thresholds using tamper_false_positives_total and incident postmortems.
Rotate keys and ensure manifests reference valid kids across rotations.

Troubleshooting¶

Repeated transient forks → increase stabilityDelay slightly; verify store replication lag.
Many degraded checks → investigate Key Registry/KMS availability.
Alert floods → widen suppression window; confirm dedupe keys include {tenant, segment, type}.

Testing Scenarios¶

Happy Path Tests¶

Continuous scan detects a forced manifest gap and raises a single actionable alert.
On-demand scan limits to given window and returns summary with watermark.

Error Path Tests¶

400 on malformed window/checks; 404 unknown tenant; 409 duplicate scan scope.
429/503 produce compliant backoff with no duplicate alerts.

Performance Tests¶

Segment check p95 ≤ 2 min; scan staleness ≤ 5 min under steady load.
Suppression prevents duplicate alerts during repeated sightings.

Security Tests¶

RBAC respected; cross-tenant access blocked.
Alerts contain no payload data; evidence URLs expire and are scoped.

Internal References¶

Audit Record Integrity Chain Flow
Integrity Verification Flow
Compliance Audit Flow

External References¶

RFC 7807 (Problem Details)
W3C Trace Context

Appendices¶

A. Example Problem+JSON (duplicate scope)¶

{
  "type": "urn:connectsoft:errors/detector/scope.conflict",
  "title": "Tamper scan already running for scope",
  "status": 409,
  "detail": "A scan for tenant 'acme' and window 2025-10-22T00:00Z..2025-10-22T12:00Z is already running.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "/window", "reason": "duplicate-scope"}]
}

B. Anomaly Types (reference)¶

gap: missing block/segment in expected sequence
fork: two different roots for the same segment
order: out-of-order seal time or index
seal: seal outside configured cadence or early reseal
sig: signature invalid/key mismatch/outside validity window
freshness: seal/manifest not produced within SLA

Key Rotation Flow¶

Safely rotates signing keys for integrity sealing and verification. Introduces a new key (kid_new) in KMS, publishes it via the Key Registry, enables a dual-verify window where both kid_old and kid_new are trusted for verification, then transitions the signer to kid_new and retires kid_old without breaking backward verification.

Overview¶

Purpose: Regularly rotate integrity signing keys while ensuring uninterrupted signing and verification, preserving the ability to verify historical signatures.
Scope: Key generation/activation, registry publication, signer switchover, dual-verify window, verifier cache refresh, deactivation/retirement, and audit events. Excludes general IAM/PKI hardening (covered elsewhere).
Context: Security (SecOps) initiates rotation in KMS/HSM. Key Registry (JWKS/COSE keyset) distributes public keys to Integrity Service (signer) and all Verifiers (Verification/Compliance services).
Key Participants:

Security (SecOps)
KMS/HSM (key creation, protection, activation windows)
Key Registry / Publisher (JWKS/COSE sets, versioning)
Integrity Service (Signer) (seals blocks with active kid)
Verification Services (Integrity Verify, Compliance Audit)
Event Bus / Observability (Key.Rotated, metrics/alerts)

Prerequisites¶

System Requirements¶

KMS/HSM reachable; policies allow key create/rotate/disable
Key Registry supports versioned JWKS/COSE publication with cache headers
Integrity Service can hot-reload signer kid without restart
Verifiers fetch/refresh keys on cache miss or via periodic refresh

Business Requirements¶

Rotation cadence defined (e.g., 90 days) and emergency rotation runbook approved
Dual-verify window configured (e.g., 14 days) and documented
Audit logging enabled for all key lifecycle operations

Performance Requirements¶

JWKS fetch p95 ≤ 200 ms; cache TTL tuned (e.g., 5–10 min)
Signer switchover ≤ 1 min between publish and activation
Verification failure rate due to unknown kid < 0.01% during rotation

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor SEC as Security (SecOps)
    participant KMS as KMS/HSM
    participant REG as Key Registry (JWKS/COSE)
    participant SIG as Integrity Service (Signer)
    participant VER as Verification Services
    participant BUS as Event Bus / Observability

    SEC->>KMS: CreateKey {alg:Ed25519, usage:sign, tags:{tenant, purpose}}
    KMS-->>SEC: KeyMetadata {kid_new, state:PreActive}
    SEC->>REG: Publish {kid_new, pubKey, notBefore, notAfter}
    REG-->>VER: JWKS {kid_old, kid_new} (cacheable)
    SEC->>SIG: Schedule Activate {kid_new, at: T0+5m}
    Note over VER,REG: Dual-verify window begins: verifiers trust {kid_old, kid_new}
    SEC->>BUS: Emit Key.RotationPlanned {kid_old, kid_new, at:T0+5m}
    SIG->>KMS: Load key {kid_new}
    SIG->>SIG: Activate signer kid = kid_new (at T0+5m)
    SIG->>BUS: Emit Key.Rotated {active:kid_new, retired:kid_old?}
    SEC->>KMS: Set kid_old to verify-only (disable sign) at T0+14d
    SEC->>REG: Unpublish kid_old (or mark as retiring) at T0+14d
    REG-->>VER: JWKS {kid_new} (kid_old removed after window)

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Emergency rotation: immediate switch due to suspected compromise; shorten dual-verify window, revoke kid_old for signing at once; maintain verify-only if integrity permits.
Canary activation: enable kid_new for a subset of signers; verify end-to-end before global activation.
Per-region phased rollout: publish globally, activate region by region with overlap.

Error Paths¶

sequenceDiagram
    actor SEC as Security
    participant GW as API Gateway
    participant KM as KMS/HSM

    SEC->>GW: POST /keys/v1/rotate {alg:"foo"}  %% unsupported alg
    alt 400 Bad Request
        GW-->>SEC: 400 Problem+JSON
    else 404 Not Found (kid_old)
        GW-->>SEC: 404 Problem+JSON
    else 409 Conflict (active rotation in progress / multiple active signers)
        GW-->>SEC: 409 Problem+JSON
    else 503 KMS unavailable
        GW-->>SEC: 503 Problem+JSON (Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`POST /keys/v1/rotate`	Y	Initiate rotation (planned)	JSON body
`Authorization`	header	Y	`Bearer <JWT>` (SecOps)	Role: `security:keys.rotate`
`x-tenant-id`	header	O	If tenant-scoped keys	Matches policy
`algorithm`	enum	O	`Ed25519` (default)	`ES256`
`activateAt`	timestamp	O	Planned activation time (UTC)	≥ now+5m
`dualVerifyWindow`	duration	O	e.g., `14d`	policy bounds
`reason`	string	O	Rotation rationale	≤ 256 chars
`idempotency-key`	header	O	De-dupe	≤ 128 chars

Operations

POST /keys/v1/activate {kid} — force activate kid_new now (emergency).
POST /keys/v1/retire {kid} — set kid_old verify-only / disable sign.
GET /.well-known/jwks.json — public keys (Key Registry).
GET /keys/v1/status — signer active kid, registry freshness, next rotation date.

Output Specifications¶

202 Accepted / 200 OK

Field	Type	Description
`kidOld`	string	Previously active key id
`kidNew`	string	New key id to activate
`activateAt`	timestamp	Planned activation
`dualVerifyWindow`	string	Duration (e.g., `P14D`)
`status`	enum	`Planned \| Activating \| Active \| Retiring \| Retired`

Key.Rotated Event (concept)

{
  "tenant": "platform",
  "kidOld": "int-key-2025-07",
  "kidNew": "int-key-2025-10",
  "activatedAt": "2025-10-22T11:00:00Z",
  "dualVerifyUntil": "2025-11-05T11:00:00Z"
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Unsupported `algorithm`; invalid `activateAt/dualVerifyWindow`	Correct request	No retry until fixed
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Missing `security:keys.rotate`	Request proper role	—
404	`kid_old` not found; JWKS endpoint not available	Verify ids/registry	—
409	Rotation already in progress; multiple active signers detected	Wait or cancel/prune; ensure single active signer	Retry after fix
412	`If-Match` on signer version mismatch	Fetch status; retry with latest	Conditional retry
422	Dual-verify window outside policy bounds	Adjust window	—
429	Excessive rotation attempts	Honor `Retry-After`	Backoff + jitter
503	KMS/Registry unavailable	Wait for recovery	Idempotent retry

Failure Modes¶

Verifier cache staleness: transient verify failures for kid_new until JWKS refreshed; verifiers must re-fetch on unknown_kid.
Key compromise: emergency path—disable signing for kid_old immediately; maintain verify-only if proofs still need validation, else revoke and mark proofs inconclusive with remediation guidance.
Clock skew: activation timestamps are UTC; signer defers switch until now ≥ activateAt + safetyMargin.

Recovery Procedures¶

On unknown kid verification failures, force JWKS refresh and reprocess.
If 409 multiple active signers, demote extras to verify-only and audit the window.
For 503, pause activation and retry KMS/Registry operations with backoff.

Performance Characteristics¶

Latency Expectations¶

Signer key load & switchover ≤ 60 s from activation time.
JWKS refresh propagation to verifiers within TTL (e.g., ≤ 10 min).

Throughput Limits¶

JWKS endpoint sized for spike during rotation; CDN cache recommended.

Resource Requirements¶

Minimal CPU; network I/O for JWKS distribution; signer maintains small in-memory key cache.

Scaling Considerations¶

Stage keys ahead of activation; pre-warm caches by triggering background JWKS fetch on publish.
Stagger regional activations to limit burst load.

Security & Compliance¶

Authentication¶

SecOps endpoints protected by OIDC + fine-grained RBAC; service-to-service mTLS optional.

Authorization¶

Roles: security:keys.rotate, security:keys.activate, security:keys.retire, security:keys.read.

Data Protection¶

Private keys never leave KMS/HSM; signing via KMS APIs or HSM PKCS#11.
JWKS served over HTTPS with integrity headers; include kid, alg, use.

Compliance¶

All key lifecycle changes audited (who, when, why, diff).
Backward verification preserved: historical signatures tied to archived public keys and validity windows.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`signer_active_kid`	gauge(label)	Current signer kid	Change outside window
`verify_unknown_kid_total`	counter	Verifications failing due to unknown kid	> 0 sustained
`jwks_cache_age_seconds`	gauge	Age of verifier key cache	> TTL
`key_rotation_events_total`	counter	Rotations/emergencies	Annotate releases
`sign_failures_total`	counter	Signing errors post-activation	> 0

Logging Requirements¶

Structured logs: kidOld, kidNew, activateAt, actor, status, reason, region. No private key material.

Distributed Tracing¶

Spans: kms.create, registry.publish, signer.activate, verifier.refresh.
Attributes: kid, alg, dualVerifyWindow, region.

Health Checks¶

Readiness: signer can load kid_new; registry reachable.
Liveness: signer reports active kid; verification path succeeds with both keys during window.

Operational Procedures¶

Deployment¶

Ensure signer supports dynamic kid reload; deploy Registry with JWKS endpoint.
Test canary rotation in staging with synthetic seals and verifications.
Schedule production rotation with maintenance window & comms.

Configuration¶

Env: SIGNING_ACTIVE_KID, KEY_ROTATION_SAFETY_MARGIN_SEC, JWKS_CACHE_TTL_SEC, DUAL_VERIFY_WINDOW_DEFAULT.
Policy: rotation cadence, emergency procedures, window bounds.

Maintenance¶

Archive decommissioned public keys and manifests; keep for lifetime of signed data.
Regularly validate that verifiers honor unknown_kid → refresh path.

Troubleshooting¶

Spike in verify_unknown_kid_total → verify JWKS TTL, CDN invalidation, clock skew.
Signing failures post-activate → confirm KMS grants, key state, signer reload status.
Conflicting actives → audit deployment orchestrations; enforce single active signer guard.

Testing Scenarios¶

Happy Path Tests¶

Plan → publish → activate kid_new; verify new seals validate with both keys during window.
Post-window, verify historical proofs with kid_old and new proofs with kid_new.

Error Path Tests¶

400 invalid algorithm/time; 404 unknown kid; 409 rotation already in progress.
503 KMS/Registry outage causes graceful delay and retries.

Performance Tests¶

JWKS propagation within TTL; negligible signing latency change.
High verification traffic during rotation does not exceed registry capacity.

Security Tests¶

Private keys never leave KMS; signer only holds handles.
Emergency rotation disables signing for kid_old immediately; verify-only allowed as policy dictates.

Internal References¶

Integrity Structures
Audit Record Integrity Chain Flow
Integrity Verification Flow
Tamper Detection Flow

External References¶

JWS (RFC 7515) / JWKS (RFC 7517)
COSE (RFC 8152)

Appendices¶

A. Example Problem+JSON (rotation conflict)¶

{
  "type": "urn:connectsoft:errors/keys/rotation.conflict",
  "title": "Another rotation is already in progress",
  "status": 409,
  "detail": "Active signer kid is already scheduled to rotate at 2025-10-22T11:00:00Z.",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

B. JWKS Example¶

{
  "keys": [
    {"kty":"OKP","crv":"Ed25519","kid":"int-key-2025-10","use":"sig","alg":"EdDSA","x":"lJp..."},
    {"kty":"OKP","crv":"Ed25519","kid":"int-key-2025-07","use":"sig","alg":"EdDSA","x":"h3Q...", "status":"verify-only","notAfter":"2025-11-05T11:00:00Z"}
  ]
}

Retry Flow¶

Executes resilient retries with exponential backoff + jitter to achieve safe at-least-once delivery semantics. Failed operations are scheduled by the Retry Service, executed when due, and on terminal failure are DLQ-routed with full context. All retryable work must be idempotent via an idempotencyKey.

Overview¶

Purpose: Increase robustness of transient or downstream-dependent operations by automated retries with guardrails, while preventing thundering herds via jitter and honoring tenant backpressure.
Scope: Scheduling, backoff calculation, jitter, execution, success/failure reporting, DLQ routing, observability. Excludes business-specific compensation (see Compensation Flow).
Context: Sits alongside Ingestion, Export, Projection, etc. Services emit retryable tasks to the Retry Service; on success the original workflow continues; on terminal failure the task is routed to DLQ for manual/automated handling.
Key Participants:

Producer Service (emits retryable work)
Retry Service (scheduler + executor)
Target Service (downstream dependency being called)
DLQ / Review Tool (terminal task handling)
Event Bus / Metrics

Prerequisites¶

System Requirements¶

Retry Service deployed with durable queue and time-based scheduling
Clock synchronized (UTC); stable monotonic timers
Network egress to Target Services; circuit breaker library available
Idempotent endpoints or idempotency keys supported by Target Services

Business Requirements¶

Per-tenant retry policies (maxAttempts, baseDelay, cap, jitter, retryable codes)
DLQ process and ownership defined (runbook, on-call group)
Data minimization for task payloads; no sensitive values in logs

Performance Requirements¶

p95 schedule-to-execute latency within ±1s of due time under nominal load
Executor throughput sized to peak retry storms; global and per-tenant caps
Backpressure signals honored (reduce concurrency, extend delays)

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant P as Producer Service
    participant R as Retry Service (Scheduler/Executor)
    participant T as Target Service
    participant BUS as Event Bus / Metrics

    P->>R: POST /retries/v1/schedule {task, idempotencyKey, policy}
    R->>R: Persist task, compute delay = backoff(attempt=1)+jitter
    R->>R: Enqueue for due time
    R->>T: (when due) Execute task with idempotencyKey
    T-->>R: 200 OK (or success code)
    R->>BUS: Emit Retry.Succeeded {taskId, attempts}
    R-->>P: 201 Created {taskId, status:"Scheduled"}

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Transient failure: Target returns retryable error → attempt++, recompute delay with jitter → reschedule until success or maxAttempts.
Immediate retry hints: Target returns Retry-After → override computed delay (bounded by policy).
Work dedupe: if idempotencyKey seen recently, executor skips duplicate execution and marks Succeeded (idempotent).

Error Paths¶

sequenceDiagram
    participant P as Producer
    participant R as Retry Service
    participant D as DLQ

    P->>R: Schedule task {malformed}
    alt 400 Bad Request
        R-->>P: 400 Problem+JSON
    else Task not found / status query bad id
        R-->>P: 404 Not Found (Problem+JSON)
    else Update while executing
        R-->>P: 409 Conflict (Problem+JSON)
    end
    R->>R: Execute attempt N (last allowed)
    R->>D: Route to DLQ {task, lastError, attempts=N}

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`POST /retries/v1/schedule`	Y	Schedule a retryable task	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Valid
`x-tenant-id`	header	Y	Tenant scope	Matches policy
`traceparent`	header	O	W3C trace context	55-char
`task.type`	string	Y	Logical task kind (e.g., `Export.Callback`)	allowlist
`task.payload`	object	Y	Minimal inputs to re-execute	Size ≤ policy cap
`idempotencyKey`	string	Y	De-dupes executions	≤ 128 chars
`policy`	object	O	Override defaults	See below

Policy Overrides (optional)

Field	Type	Description
`maxAttempts`	int	e.g., 6 (including first)
`baseDelayMs`	int	e.g., 250
`multiplier`	number	e.g., 2.0 (exponential)
`maxDelayMs`	int	cap, e.g., 60_000
`jitter`	enum/number	`full`
`retryable`	array	Retryable status codes / reasons

Status/Control

GET /retries/v1/tasks/{taskId} → status, attempts, nextDueAt
POST /retries/v1/tasks/{taskId}:cancel (if safe)
GET /retries/v1/dlq → items; POST /retries/v1/dlq/{id}:replay

Output Specifications¶

201 Created

Field	Type	Description
`taskId`	string	ULID/GUID
`status`	enum	`Scheduled`
`nextDueAt`	timestamp	First attempt due time
`policyEffective`	object	Resolved policy
`attempt`	int	1

200 OK (Status)

Field	Type	Description
`taskId`	string	Id
`attempt`	int	Current attempt
`nextDueAt`	timestamp?	Null if running/completed
`state`	enum	`Running \| Succeeded \| Failed \| DLQ`
`lastError`	object?	`{code, reason, ts}`

Example Payloads¶

// Schedule with policy override
{
  "task": {
    "type": "Export.Callback",
    "payload": {"url":"https://example.com/hook","exportId":"exp_01JF..."}
  },
  "idempotencyKey": "exp_01JF...:callback",
  "policy": {"maxAttempts": 6, "baseDelayMs": 500, "multiplier": 2, "maxDelayMs": 60000, "jitter":"full"}
}

// Status response
{
  "taskId": "rtk_01JF...",
  "state": "Running",
  "attempt": 3,
  "nextDueAt": "2025-10-22T11:14:25Z",
  "lastError": {"code":"HTTP_503","reason":"Upstream unavailable","ts":"2025-10-22T11:12:13Z"}
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Malformed task/policy; payload too large	Fix request	—
401	Missing/invalid JWT	Renew token	Retry after renewal
403	Caller lacks `retry:schedule`	Acquire role/scope	—
404	Unknown `taskId`	Verify id	—
409	Update/cancel during execution window	Wait for state change	Retry after fix
412	`If-Match` version mismatch on update	Fetch latest, retry	Conditional retry
422	Non-idempotent target / policy disallowed	Change endpoint/policy	—
429	Per-tenant/global throttle exceeded	Honor `Retry-After`	Backoff + jitter
503	Scheduler/Executor dependency down	Wait for recovery	Idempotent reschedule

Failure Modes¶

Poison task: repeatedly fails with non-retryable error → immediate DLQ.
Retry storm: global backoff and concurrency caps applied; jitter widened.
Clock skew: due times computed in UTC; executor compares with monotonic clock guard.

Recovery Procedures¶

Inspect DLQ item; fix root cause; replay via DLQ endpoint.
Adjust policy (raise cap, widen backoff) for transient incidents.
Use idempotencyKey to ensure safe replays.

Performance Characteristics¶

Latency Expectations¶

Schedule-to-execute drift p95 ≤ 1s at steady load; may widen under backpressure.

Throughput Limits¶

Executor concurrency: per-tenant & global caps to protect downstreams.
Batched scheduling & due-time bucketing for high-volume workloads.

Resource Requirements¶

Lightweight CPU; memory for queues; persistent storage for tasks and attempts.

Scaling Considerations¶

Shard by tenant/time buckets; use decorrelated jitter to reduce synchronization.
Propagate Retry-After and circuit-breaker state into backoff calculation.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; service accounts for producers.

Authorization¶

Roles: retry:schedule, retry:read, retry:cancel, retry:dlq.read, retry:dlq.replay.
Enforce tenant RLS via x-tenant-id.

Data Protection¶

Store minimal payloads; encrypt at rest; no secrets in task payloads—use references (e.g., secret ids).

Compliance¶

All attempts and state transitions audited with actor, reason, and outcomes.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`retry_scheduled_total`	counter	Tasks scheduled	Trend
`retry_attempts_total`	counter	Attempts made	Sudden surge
`retry_success_total`	counter	Completed via retry	—
`retry_dlq_total`	counter	Routed to DLQ	> baseline
`retry_delay_applied_ms`	histogram	Backoff + jitter	p95 sanity
`executor_concurrency`	gauge	Active workers	Cap breaches

Logging Requirements¶

Structured logs per attempt: taskId, tenant, attempt, delayMs, jitterMs, code, reason. No sensitive payloads.

Distributed Tracing¶

Spans: retry.schedule, retry.execute, retry.backoff, dlq.route.
Attributes: attempt, delayMs, policyId, idempotencyKey.

Health Checks¶

Readiness: queue store reachable; scheduler tick healthy.
Liveness: executor draining; no stuck partitions.

Operational Procedures¶

Deployment¶

Deploy Scheduler and Executor; configure queues/stores.
Register retry policies per tenant; validate with synthetic faults.

Configuration¶

Env: RETRY_MAX_CONCURRENCY, RETRY_DEFAULT_POLICY, RETRY_MAX_PAYLOAD_BYTES, RETRY_STORM_GUARD_MULTIPLIER.
Policy: retryable codes map (HTTP/gRPC), base delays, caps, jitter mode.

Maintenance¶

Periodically purge completed tasks; archive DLQ with retention.
Tune jitter/backoff from incident postmortems.

Troubleshooting¶

DLQ spike → inspect non-retryable reasons; verify idempotency at Target.
Drift in due execution → check scheduler lag and backpressure controls.
Duplicate side effects → confirm Target honors idempotencyKey.

Testing Scenarios¶

Happy Path Tests¶

Target 503 twice then 200 → attempts increase, success within policy, no DLQ.
Retry-After honored to override computed delay.

Error Path Tests¶

400 malformed schedule; 404 unknown task; 409 modify during run.
422 when endpoint marked non-idempotent; 429/503 backoff honored.

Performance Tests¶

High-volume storm—executor respects caps; jitter spreads load.
p95 schedule-to-execute ≤ 1s under nominal load.

Security Tests¶

RBAC enforced; cross-tenant access blocked.
No secrets in logs/payloads; encryption at rest verified.

Internal References¶

Dead Letter Queue Flow
Circuit Breaker Flow
Compensation Flow

External References¶

RFC 7807 (Problem Details)
W3C Trace Context

Appendices¶

A. Backoff Formula (examples)¶

Exponential: delay = min(maxDelay, base * (multiplier^(attempt-1))) + jitter
Decorrelated jitter: sleep = min(maxDelay, random(base, sleep*3))

B. Example Problem+JSON (policy violation)¶

{
  "type": "urn:connectsoft:errors/retry/policy.invalid",
  "title": "Retry policy invalid",
  "status": 422,
  "detail": "Endpoint requires idempotency but idempotencyKey was not provided.",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

Dead Letter Queue Flow¶

Operational path to triage, diagnose, and replay messages that exhausted retries or failed with non-retryable errors. Ensures no duplicate side effects by requiring idempotent targets and preserving the original idempotencyKey during replay. Provides auditability, metrics, and safe deletion/quarantine.

Overview¶

Purpose: Restore messages from failure to success with controlled, observable, and compliant procedures.
Scope: DLQ item listing, inspection, annotation, fix/runbook execution, safe replay (single/bulk), quarantine or delete, and auditing. Excludes business-side compensation (see Compensation Flow).
Context: DLQ is fed by Retry Service and other producers. Replay Tool orchestrates re-submission to the Target Service using at-least-once semantics with idempotency guarantees.
Key Participants:

Operator / SRE (triage & action)
API Gateway (authN/Z, tenancy)
DLQ Store (dead letters, metadata)
Replay Tool / DLQ Service (orchestrates fix & replay)
Target Service (original destination)
Runbook/Knowledge Base (known-error fixes)
Observability (metrics, logs, alerts)
Audit/Event Bus (operator actions, outcomes)

Prerequisites¶

System Requirements¶

DLQ store with durable retention and per-tenant partitioning
Replay Tool has network access to Target Service(s)
Original endpoint supports idempotency keys or is side-effect free
Circuit breaker and rate limits configured for replay traffic

Business Requirements¶

Runbooks for top failure signatures (e.g., mapping fixes, schema bumps)
Role-based access for DLQ operations with approvals where needed
Data minimization policies for viewing payloads (mask PII by default)

Performance Requirements¶

Listing/inspect p95 ≤ 200 ms per page
Replay throughput bounded (tenant/global) to protect targets
Batch replay progress reporting and partial-failure handling

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor OP as Operator
    participant GW as API Gateway
    participant DLQ as DLQ Service/Store
    participant RB as Runbook/KB
    participant RT as Replay Tool
    participant T as Target Service
    participant AUD as Audit/Event Bus

    OP->>GW: GET /ops/v1/dlq?filters… (search & select item)
    GW->>DLQ: Query items (tenant, filters)
    DLQ-->>GW: Page of items
    OP->>GW: GET /ops/v1/dlq/{id} (inspect, view masked payload, lastError)
    GW->>DLQ: Fetch item + metadata
    DLQ-->>GW: Item + recommended runbook link
    OP->>RB: Follow runbook, apply fix (config/schema/data)
    OP->>GW: POST /ops/v1/dlq/{id}:replay {mode:"safe"}
    GW->>RT: Orchestrate replay (authZ, tenancy)
    RT->>T: Re-submit with original idempotencyKey/payload
    T-->>RT: 200 OK (idempotent success)
    RT->>DLQ: Mark Resolved, attach replay transcript
    RT->>AUD: Emit DLQ.Replayed {id, attempts, actor, outcome}
    GW-->>OP: 200 OK {status:"Replayed", transcriptUrl}

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Bulk replay: operator selects a query window/signature and triggers :bulk-replay with concurrency caps.
Quarantine: item moved to a separate queue to prevent accidental replay while investigation continues.
Redrive to alternative endpoint: route to a newer API version when the original is deprecated (policy-gated).

Error Paths¶

sequenceDiagram
    actor OP as Operator
    participant GW as API Gateway
    participant DLQ as DLQ Service
    participant RT as Replay Tool

    OP->>GW: POST /ops/v1/dlq/{id}:replay
    alt 400 Bad Request (invalid mode/filters)
        GW-->>OP: 400 Problem+JSON
    else 404 Not Found (unknown item)
        GW-->>OP: 404 Problem+JSON
    else 409 Conflict (item locked/by another replay)
        GW-->>OP: 409 Problem+JSON
    else 422 Unprocessable (target non-idempotent, policy forbids)
        GW-->>OP: 422 Problem+JSON
    else 429/503 (rate limit/dependency down)
        GW-->>OP: 429/503 Problem+JSON (+Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
List	`GET /ops/v1/dlq`	Y	List DLQ items	Pagination with `page.after`, `limit≤200`
Inspect	`GET /ops/v1/dlq/{id}`	Y	Fetch one item	`{id}` ULID/GUID
Replay	`POST /ops/v1/dlq/{id}:replay`	Y	Re-submit safely	JSON body
Bulk Replay	`POST /ops/v1/dlq:bulk-replay`	O	Replay by filter	JSON body
Quarantine	`POST /ops/v1/dlq/{id}:quarantine`	O	Move to quarantine	—
Delete	`DELETE /ops/v1/dlq/{id}`	O	Drop after approval	Policy-gated
`Authorization`	header	Y	`Bearer <JWT>`	Role: DLQ ops
`x-tenant-id`	header	Y	Tenant scope	RLS enforced
`traceparent`	header	O	W3C trace	55-char
`idempotencyKey`	string	O	Override if missing	≤ 128 chars
`mode`	enum	O	`safe`(default)	`force`

DLQ Item (shape)

Field	Description
`id`	DLQ item id
`source`	Producer (service/flow)
`target`	Endpoint/service intended
`payload`	Masked by default (toggle with RBAC)
`idempotencyKey`	Original key (if any)
`attempts`	Attempts made
`firstSeenAt` / `lastErrorAt`	Timestamps
`lastError`	`{code, reason, traceId}`
`annotations[]`	Operator notes
`status`	`Pending \| Quarantined \| Replayed \| Deleted`

Output Specifications¶

200 OK (Inspect)

Field	Type	Description
`item`	object	DLQ item
`recommendedRunbook`	url	Link to doc
`replayEligible`	bool	`true` if idempotent & policy allows
`warnings[]`	array	E.g., “missing idempotencyKey”

200 OK (Replay)

Field	Type	Description
`status`	enum	`Replayed \| InProgress \| Quarantined`
`transcriptUrl`	url	Steps & outcomes
`attempt`	int	Attempt count after replay
`effectiveIdempotencyKey`	string	Key used

Example Payloads¶

// Replay request (safe)
POST /ops/v1/dlq/01JF...:replay
{
  "mode": "safe",
  "notes": "Fixed mapping for resourceType=Invoice; re-submitting."
}

// DLQ item (inspect response excerpt)
{
  "id": "01JF…",
  "source": "Ingestion.Consumer",
  "target": "Storage.Append",
  "idempotencyKey": "ar:01JF…",
  "attempts": 6,
  "lastError": {"code":"HTTP_422","reason":"Schema validation failed"},
  "replayEligible": true
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid filters, mode, or bulk selection too large	Fix request/trim selection	—
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Lacks `dlq:operate` or PII unmask permission	Request proper role	—
404	DLQ item not found	Refresh list; verify id/tenant	—
409	Item locked by another operator/replay in progress	Wait or take lock after TTL	Retry after unlock
412	`If-Match` version mismatch on annotate/delete	Refetch item; retry with latest	Conditional retry
422	Replay blocked (non-idempotent target / missing key)	Provide key or route to compensation	—
429	Replay throughput cap exceeded	Honor `Retry-After`	Backoff + jitter
503	DLQ store or target unavailable	Wait for recovery	Idempotent replay later

Failure Modes¶

Duplicate side effects risk: target not idempotent or key missing → block replay unless force with executive approval; log and audit.
Payload drift: original payload stale after schema change → tool offers auto-migrate transform preview before replay.
Replay storm: bulk selection triggers target throttling → tool enforces per-tenant QPS caps and adaptive backoff.
PII exposure: viewing raw payload requires elevated RBAC; otherwise masked.

Recovery Procedures¶

If 422, attempt payload migration using versioned transformers; re-try in safe mode.
If 409, wait for lock TTL or coordinate via on-call; avoid parallel replay.
For 503/429, the tool pauses and resumes respecting backoff and circuit breaker state.

Performance Characteristics¶

Latency Expectations¶

Inspect/list p95 ≤ 200 ms; single replay end-to-end typically ≤ 2 s (excluding target latency).

Throughput Limits¶

Default bulk replay ≤ 50 msg/s per tenant (configurable), global cap to protect targets.

Resource Requirements¶

Light CPU/IO for listing; replay workers sized to throughput; encrypted storage for transcripts.

Scaling Considerations¶

Shard DLQ by tenant and creation time; support cursor-based pagination; parallel workers with per-target concurrency.

Security & Compliance¶

Authentication¶

OIDC JWT at Gateway; service tokens for replay to targets.

Authorization¶

Roles: dlq:read, dlq:operate, dlq:quarantine, dlq:delete, dlq:pii.unmask.
Fine-grained approvals required for mode=force and deletions.

Data Protection¶

Payloads masked by default; unmask requires explicit action (with purpose-of-use).
Transcripts and payload snapshots encrypted at rest; presigned URLs short-lived.

Compliance¶

All DLQ actions are audited (who, what, why, before/after, result).
Retention for DLQ items and transcripts aligns with tenant policy.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`dlq_items_total`	gauge	Current DLQ size (by tenant)	Rising trend
`dlq_oldest_age_seconds`	gauge	Age of oldest item	> SLO
`dlq_replay_success_total`	counter	Successful replays	Track rate
`dlq_replay_failure_total`	counter	Failed replays	Spike alert
`dlq_quarantine_total`	counter	Items quarantined	Investigate
`dlq_bulk_replay_inflight`	gauge	Active bulk operations	Cap breaches

Logging Requirements¶

Structured logs: tenant, dlqId, action, actor, mode, outcome, idempotencyKey, target, attempts. Do not log payload values.

Distributed Tracing¶

Spans: dlq.list, dlq.inspect, dlq.replay, dlq.quarantine.
Attributes: bulkSize, replayed, failed, throttled, transformVersion.

Health Checks¶

Readiness: DLQ store reachable; replay workers healthy.
Liveness: no stuck locks; bulk runners progressing.

Operational Procedures¶

Deployment¶

Deploy DLQ Service and Replay Tool; wire to Gateway with RBAC.
Configure per-tenant throughput caps and masking defaults.
Validate end-to-end with seeded poison messages.

Configuration¶

Env: DLQ_LIST_PAGE_MAX, DLQ_REPLAY_QPS_PER_TENANT, DLQ_GLOBAL_QPS_CAP, DLQ_LOCK_TTL_SEC, TRANSFORMER_DEFAULT_VERSION.
Policy: allowed force operations, deletion approvals, payload unmask rules.

Maintenance¶

Periodic purge/archival of resolved items; rotate transcript encryption keys.
Review top failure signatures and update runbooks/transformers.

Troubleshooting¶

Duplicates observed → verify target idempotency and keys; disable force path.
Bulk replay throttled → reduce concurrency or expand caps with approval.
Payload migration errors → roll back transformer version and fix mapping.

Testing Scenarios¶

Happy Path Tests¶

Inspect → apply mapping fix → safe replay succeeds; DLQ item resolved.
Bulk replay with 5,000 items respects QPS caps and completes with transcript.

Error Path Tests¶

400 invalid filters; 404 unknown id; 409 locked item; 422 non-idempotent blocked.
429/503 backoff honored; operation resumes and completes.

Performance Tests¶

Listing p95 ≤ 200 ms at 1M items/tenant (indexed).
Bulk replay maintains target SLOs under cap.

Security Tests¶

PII masked by default; unmask requires RBAC + purpose-of-use; all actions audited.
Deletions require multi-party approval when enabled.

Internal References¶

Retry Flow
Compensation Flow
Standard Audit Record Ingestion Flow

External References¶

RFC 7807 (Problem Details)
W3C Trace Context

Appendices¶

A. Example Problem+JSON (non-idempotent target)¶

{
  "type": "urn:connectsoft:errors/dlq/replay.disallowed",
  "title": "Replay blocked by policy",
  "status": 422,
  "detail": "Target endpoint is not idempotent and force mode is disabled for this tenant.",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

B. Example Annotation¶

POST /ops/v1/dlq/01JF...:annotate
{
  "note": "Fixed customer mapping (CUS-123). Verified with runbook RB-42."
}

Circuit Breaker Flow¶

Contains downstream failures and prevents cascading outages by short-circuiting failing calls, routing to fallbacks/queues, and probing recovery via half-open trials. Exposes clear client signals (headers/status) and integrates with Retry/DLQ to preserve at-least-once semantics.

Overview¶

Purpose: Protect services from unstable dependencies using automated open/half-open/closed state transitions, graceful degradation, and recovery probing.
Scope: Policy configuration, failure/latency detection, state transitions, short-circuit responses, fallback and queueing, recovery probes, client signaling. Excludes business-specific compensation (see Compensation Flow).
Context: Libraries/middleware wrap all client calls to downstreams (HTTP/gRPC/bus). Breaker state may be per-tenant, per-endpoint, per-partition.
Key Participants:

Caller Service (producer of the downstream call)
Circuit Breaker (in-process or sidecar)
Target Service (downstream dependency)
Fallback/Cache (optional read cache or static responses)
Retry/DLQ Services (for write/side-effect operations)
Observability/Config (metrics, alerts, ops overrides)

Prerequisites¶

System Requirements¶

Circuit breaker library enabled for HTTP/gRPC clients with configurable policies
Sliding windows for failure rate and slow-call rate with min call thresholds
Central config and runtime override API (ops) with safe defaults
Correlation/tracing propagation through fallback paths

Business Requirements¶

Defined fallback strategy per call type (read: cache; write: enqueue → Retry)
Tenant- and endpoint-level SLOs to tune thresholds
Runbook for operator overrides (force-open/close, reset)

Performance Requirements¶

Wrapper overhead p95 ≤ 1 ms per call (fast path, closed)
Probe batch size and interval sized to recover quickly without stampedes
Backpressure headers documented for clients

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant C as Caller Service
    participant CB as Circuit Breaker
    participant T as Target Service
    participant F as Fallback/Queue (optional)

    C->>CB: Invoke downstream operation
    alt State = CLOSED
        CB->>T: Forward request
        T-->>CB: 200/OK (within latency budget)
        CB-->>C: Success (propagate response)
    else State = HALF-OPEN (probe window)
        CB->>T: Limited probes (N% or fixed concurrency=1..k)
        T-->>CB: OK responses exceed threshold
        CB-->>C: Success, transition → CLOSED
    end

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Fallback (read): CB returns cached/derived response with X-ATP-Circuit-State: open and X-ATP-Source: cache.
Queue (write): CB enqueues to Retry Service with idempotencyKey, returns 202 Accepted (Problem+JSON alternative body optional).
Partitioned breakers: isolate a bad shard/tenant from healthy traffic.

Error Paths¶

sequenceDiagram
    participant C as Caller
    participant CB as Circuit Breaker
    participant T as Target
    participant Q as Retry/DLQ

    C->>CB: Invoke downstream operation
    alt State = OPEN (short-circuit)
        CB-->>C: 503 Service Unavailable
        Note right of C: Headers: X-ATP-Circuit-State: open, Retry-After: 5
    else State = CLOSED but failure/slow-call triggers thresholds
        CB->>T: Request
        T-->>CB: 5xx/timeout/slow
        CB->>CB: Increment counters, if trip threshold → OPEN
        CB->>Q: (write ops) enqueue for retry
        CB-->>C: 503/504 or 202 (queued) with Problem+JSON
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

The breaker primarily shapes responses; ops endpoints allow safe overrides.

Input Requirements (Ops)¶

Field	Type	Req	Description	Validation
Method/Path	`POST /ops/v1/circuits/{id}:override`	O	Force `open \| half-open \| closed` with TTL	`{id}` exists
`Authorization`	header	Y	`Bearer <JWT>`	Role `ops:circuits`
`state`	enum	Y	`open`	`half-open`	`closed`	allowlist
`ttl`	duration	O	Override duration (e.g., `10m`)	≤ policy max
`notes`	string	O	Reason	≤ 256 chars

Output Specifications (Client-Facing)¶

Closed (success): normal 2xx/OK.
Open (short-circuited read): 503 Service Unavailable Headers: X-ATP-Circuit-State: open, Retry-After: <sec>, X-ATP-Circuit-Reason: failure-rate|slow-calls|min-calls-not-met. Body (Problem+JSON example):

{
  "type":"urn:connectsoft:errors/circuit/open",
  "title":"Dependency temporarily unavailable",
  "status":503,
  "detail":"Calls short-circuited by circuit breaker (failure rate > 50% over 20s).",
  "retryAfterSeconds":5,
  "traceId":"9f0c1d2e3a4b5c6d..."
}

* Open (queued write): 202 Accepted with Location: /retries/v1/tasks/{taskId} and headers above plus X-ATP-Queued: true.

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Ops override payload invalid (state/ttl)	Fix request	—
401	Missing/invalid JWT (ops)	Acquire valid token	Retry after renewal
403	Lacks `ops:circuits`	Request proper role	—
404	Unknown circuit `{id}`	Verify id/scope	—
409	Conflicting override/state transition	Clear override or wait TTL	Retry after fix
412	`If-Match` on circuit version mismatch	Read latest, retry	Conditional retry
422	TTL or state not permitted by policy	Adjust inputs	—
429	Too many overrides/changes	Back off	Jittered retry
503/504	Short-circuited/open or downstream timeout	Respect headers	Exponential backoff + jitter

Failure Modes¶

Min-calls not met: insufficient samples → breaker stays closed but labels responses with X-ATP-Circuit-Reason: warmup.
Stampede on recovery: too many probes → configure half-open concurrency and jitter.
Cache staleness: fallback exceeds TTL → downgrade to 503 instead of serving stale beyond policy.

Recovery Procedures¶

When open, allow half-open after cool-down; probe with limited concurrency.
Tune thresholds based on SLOs and observed metrics (failure/slow-call rate).
For write paths, confirm idempotencyKey is propagated before enabling queue mode.

Performance Characteristics¶

Latency Expectations¶

Added wrapper overhead p95 ≤ 1 ms (closed).
Half-open probes routed immediately; unaffected calls still short-circuited.

Throughput Limits¶

Limit concurrent probes (e.g., 1–5) per breaker key; cap queued writes per tenant.

Resource Requirements¶

In-process counters/timers; optional small shared state for cluster coordination.

Scaling Considerations¶

Key breaker by {tenant, endpoint, partition} to avoid global trips.
Use decorrelated jitter for cool-down and probe scheduling.
Optional shared state (e.g., Redis) for multi-instance consistency.

Security & Compliance¶

Authentication¶

Client requests authenticated as usual; ops overrides require OIDC JWT and RBAC.

Authorization¶

Ops roles: ops:circuits.read, ops:circuits.override, ops:circuits.reset.

Data Protection¶

Headers reveal state but not sensitive internals; avoid leaking backend hostnames.

Compliance¶

All trips, overrides, and recoveries are audited (who, when, why, thresholds, counts).

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`circuit_state{key}`	gauge	0=closed,1=half-open,2=open	Open > 0 sustained
`circuit_short_circuits_total`	counter	Calls blocked by open state	Spike alert
`circuit_failure_rate`	gauge	Recent failure %	> policy trip
`circuit_slow_call_rate`	gauge	Recent slow-call %	> policy trip
`circuit_probe_success_total`	counter	Half-open successes	Low during recovery
`fallback_invocations_total`	counter	Cache/queue usage	Track degradation

Logging Requirements¶

Structured logs: breakerKey, state, reason, window, failRate, slowRate, probe, override, actor, traceId.

Distributed Tracing¶

Tag spans with circuit.state, circuit.reason, fallback=true, queued=true, include downstream span links when available.

Health Checks¶

Readiness: breaker config loaded; counters active.
Liveness: state machine transitions occur; no stuck half-open beyond TTL.

Operational Procedures¶

Deployment¶

Enable breaker middleware for all outbound clients; set sane defaults.
Wire ops API and dashboards; define per-tenant keys.
Validate with chaos testing (inject 5xx/timeouts).

Configuration¶

Policy: {window=20s, minCalls=20, failureRate=50%, slowThreshold=1s, slowRate=50%, cooldown=5s, probe=2}
Headers: X-ATP-Circuit-State, X-ATP-Circuit-Reason, Retry-After, X-ATP-Queued.

Maintenance¶

Review trip analytics weekly; adjust thresholds and probe sizes.
Rotate cache TTLs for fallbacks per freshness requirements.

Troubleshooting¶

Frequent opens → inspect dependency SLOs, retry storms, and idempotency.
No recovery → increase probe window or check downstream health checks.
Client confusion → verify headers are surfaced at Gateway.

Testing Scenarios¶

Happy Path Tests¶

Closed → success; zero wrapper overhead regressions.
Half-open with limited probes transitions to closed after consecutive successes.

Error Path Tests¶

Trip on failure rate > threshold; open cool-down respected; headers set.
Read fallback returns cached response with correct state headers.
Write enqueued returns 202 with Location and idempotencyKey.

Performance Tests¶

Probe concurrency prevents stampede; short-circuit path p95 ≤ 1 ms.
High QPS under open state does not overload queue/cache.

Security Tests¶

Ops override RBAC enforced; audit trail captured.
Headers do not leak sensitive backend identifiers.

Internal References¶

Retry Flow
Dead Letter Queue Flow
Compensation Flow

External References¶

RFC 7807 (Problem Details)
W3C Trace Context

Appendices¶

A. Example Ops Override¶

POST /ops/v1/circuits/tenant:search:index:primary:override
{
  "state": "open",
  "ttl": "10m",
  "notes": "Isolate failing shard while indexers recover."
}

B. Client Header Cheatsheet¶

X-ATP-Circuit-State: closed|half-open|open
X-ATP-Circuit-Reason: failure-rate|slow-calls|override|warmup
Retry-After: seconds until next probe/cooldown ends
X-ATP-Queued: true when write queued for retry

Compensation Flow¶

Repairs partial failures or out-of-order effects by executing a deterministic, idempotent sequence of inverse actions (e.g., projection rewrites, search index corrections, pointer re-links). Produces a complete audit trail and supports dry-run planning before execution.

Overview¶

Purpose: Restore system invariants when a transaction or workflow completed partially (e.g., append succeeded but projection/index update failed).
Scope: Detection/selection of a failed transaction, plan synthesis, dry-run validation, execution of compensating steps, verification, and audit. Excludes business refunds or external systems remediation (covered by domain runbooks).
Context: Invoked by operators or automation (DLQ/alerts). Coordinates with Projection Service, Search Index, Storage, and Integrity to ensure consistency.
Key Participants:

Operator / Automation (trigger)
Compensation Service (planner/executor)
Storage / Projection / Search Index (targets)
Audit/Event Bus (actions & outcomes)
Retry/DLQ (feeder, optional post-fix replay)

Prerequisites¶

System Requirements¶

Compensation Service deployed with access to Storage, Projections, and Indexes
Idempotency primitives available (step keys, compare-and-set guards)
Read-only snapshot capability for dry-run planning
Time-synchronized environment (UTC), consistent tracing

Business Requirements¶

Catalog of compensable scenarios and their inverse steps
Approval policy for destructive operations and bulk compensations
Masking rules for any payloads surfaced to operators

Performance Requirements¶

p95 plan synthesis ≤ 500 ms for typical cases
Batched execution with rate limits to protect targets
Backpressure-aware executor with progress reporting

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor OP as Operator/Automation
    participant GW as API Gateway
    participant CMP as Compensation Service
    participant ST as Storage
    participant PR as Projection Service
    participant IX as Search Index
    participant AUD as Audit/Event Bus

    OP->>GW: POST /ops/v1/compensations {txnId|recordId,..., dryRun:true}
    GW->>CMP: Create Plan (authN/Z, x-tenant-id)
    CMP->>ST: Inspect ground truth (append store)
    CMP->>PR: Inspect projection state
    CMP->>IX: Inspect index documents
    CMP->>CMP: Synthesize plan (ordered idempotent steps)
    CMP-->>GW: 200 OK {plan, impact, approvals}
    OP->>GW: POST /ops/v1/compensations/{id}:run
    GW->>CMP: Execute Plan
    CMP->>ST: (if needed) no-op or pointer fix
    CMP->>PR: Rewrite/repair projections (CAS by watermark)
    CMP->>IX: Reindex specific docs (with version guards)
    CMP->>CMP: Verify invariants, mark Completed
    CMP->>AUD: Emit Compensation.Completed {id, steps, result}
    GW-->>OP: 200 OK {status:"Completed", metrics}

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Auto-compensation from DLQ: DLQ item contains signature; Compensation Service builds & runs plan before replay.
Partial plan: execute only safe subset; schedule remaining steps via Retry Service.
Integrity-first: if integrity proofs affected, run Integrity Verification/re-seal checks before projection/index fixes.

Error Paths¶

sequenceDiagram
    actor OP as Operator
    participant GW as API Gateway
    participant CMP as Compensation Service

    OP->>GW: POST /ops/v1/compensations {invalid}
    alt 400 Bad Request (invalid scope, missing ids)
        GW-->>OP: 400 Problem+JSON
    else 404 Not Found (unknown txn/record)
        GW-->>OP: 404 Problem+JSON
    else 409 Conflict (plan already running / step lock held)
        GW-->>OP: 409 Problem+JSON
    else 422 Unprocessable (scenario not compensable)
        GW-->>OP: 422 Problem+JSON
    else 429/503 (rate limit/dependency down)
        GW-->>OP: 429/503 Problem+JSON (+Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`POST /ops/v1/compensations`	Y	Create compensation plan	JSON body
`Authorization`	header	Y	`Bearer <JWT>`	Role `ops:compensate`
`x-tenant-id`	header	Y	Tenant scope	RLS enforced
`traceparent`	header	O	W3C trace context	55-char
`txnId`	string	O*	Transaction/workflow id	ULID/GUID
`recordId`	string	O*	Affected record id	ULID/GUID
`scope`	object	O	`{from, to, filters}` window	ISO-8601 UTC
`dryRun`	bool	O	Only produce plan	default true
`strategy`	enum	O	`repair` (default)	`replay`
`notes`	string	O	Operator context	≤ 512 chars
`idempotency-key`	header	O	De-dupe	≤ 128 chars

Provide at least one of txnId, recordId, or scope.

Control/Status

GET /ops/v1/compensations/{id} → status, steps, metrics
POST /ops/v1/compensations/{id}:run → execute planned steps
POST /ops/v1/compensations/{id}:cancel → cancel if safe

Output Specifications¶

200 OK (Plan)

Field	Type	Description	Notes
`id`	string	Plan id	ULID/GUID
`steps[]`	array	Ordered idempotent steps	See step shape
`impact`	object	Counters by target (proj/index/records)	Estimate
`approvalsRequired`	bool	Whether approval gate is needed	Policy-driven

Step (shape)

{
  "stepId": "S1",
  "type": "Projection.Rewrite",
  "target": {"projection":"AuditEvents","key":"01JF..."},
  "idempotencyKey": "cmp:proj:AuditEvents:01JF...",
  "precondition": {"watermarkAtLeast":"2025-10-22T10:55:00Z"},
  "action": {"rewriteFrom": "storage", "schemaVersion": 3},
  "verify": {"projectionMatches":"storageHash"}
}

Example Payloads¶

// Create plan (dry-run) by recordId
{
  "tenant": "acme",
  "recordId": "01JF3W8KTR2D3WQF3B9R0KJY9Y",
  "dryRun": true,
  "strategy": "repair",
  "notes": "Projection missing due to prior outage."
}

// Execute planned compensation
POST /ops/v1/compensations/01K0...:run
{
  "approvalToken": "appr_9c1...",
  "concurrency": 8
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid scope; both `txnId` and `recordId` missing; bad timestamps	Fix request	—
401	Missing/invalid JWT	Acquire valid token	Retry after renewal
403	Lacks `ops:compensate` or approval missing	Request role/approval	—
404	Transaction/record not found	Verify ids/window	—
409	Another plan running on same target; step lock held	Wait/Cancel existing	Retry after unlock
412	Precondition (watermark/version) failed	Refresh state; re-plan	Conditional retry
422	Scenario not compensable or non-idempotent step detected	Route to manual runbook	—
429	Throttled by target system	Honor `Retry-After`	Backoff + jitter
503	Dependency unavailable (Projection/Index/Storage)	Wait or partial run	Idempotent retry later

Failure Modes¶

Non-idempotent side effect: step flagged and blocked unless operator uses explicit force gate.
Stale projection: CAS/watermark precondition fails → re-plan with updated state.
Wide impact plan: bulk changes require staged batches with checkpoints to avoid long locks.

Recovery Procedures¶

On 412, refresh state and regenerate plan; executor resumes from last completed step.
If 503/429, executor backs off, persists progress, and continues when healthy.
For 409, inspect running plan and either merge or cancel the conflicting one.

Performance Characteristics¶

Latency Expectations¶

Plan (single record) typically ≤ 500 ms; execution dominated by target services latencies.

Throughput Limits¶

Concurrency governed per target (e.g., proj=16, index=8) and per-tenant caps.

Resource Requirements¶

Light CPU for planning; executor memory proportional to batch window.

Scaling Considerations¶

Shard plans by tenant and time window; use watermarks to ensure deterministic ordering.
Persist checkpoints every N steps; support resume-after-failure.

Security & Compliance¶

Authentication¶

OIDC JWT/OAuth at Gateway; service accounts for inter-service calls.

Authorization¶

Roles: ops:compensate.plan, ops:compensate.run, ops:compensate.cancel, ops:compensate.read.
Approval tokens required for destructive/bulk plans.

Data Protection¶

Mask PII in operator views; only show necessary diffs.
Encrypt transcripts and store with short-lived presigned access.

Compliance¶

Emit Compensation.Planned|Started|StepCompleted|Completed|Failed events with actor, reason, and evidence.
Plans and transcripts retained per tenant retention policy.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`compensation_plans_total`	counter	Plans created	Trend
`compensation_steps_completed_total`	counter	Steps done	—
`compensation_failures_total`	counter	Failed steps	> 0 sustained
`compensation_runtime_seconds`	histogram	End-to-end duration	p95 > SLO
`compensation_blocked_total`	counter	Blocked by preconditions/locks	Spike alert

Logging Requirements¶

Structured logs include: planId, tenant, stepId, type, idempotencyKey, precondition, outcome, traceId. No payload values.

Distributed Tracing¶

Spans: plan.synthesize, step.execute(type), verify, checkpoint.
Attributes: concurrency, watermark, casVersion, affectedCount.

Health Checks¶

Readiness: access to Storage/Projection/Index; plan store reachable.
Liveness: executors progressing; no step stuck beyond timeout.

Operational Procedures¶

Deployment¶

Deploy Compensation Service with plan store and executor.
Wire RBAC, approval gates, and observability.
Seed known scenarios and step templates.

Configuration¶

Env: COMP_PLAN_MAX_SCOPE, COMP_EXEC_CONCURRENCY, COMP_STEP_TIMEOUT, COMP_APPROVAL_REQUIRED.
Policy: destructive-step approvals; per-target QPS caps; retry/backoff settings.

Maintenance¶

Review top compensation causes; add detectors to prevent recurrence.
Tune watermark/CAS policies to reduce 412 conflicts.

Troubleshooting¶

Frequent 412 → stale state; check projection lag and adjust watermarks.
High blocked_total → missing approvals or non-idempotent steps; refine templates.
Long runtimes → lower concurrency or break plan into smaller batches.

Testing Scenarios¶

Happy Path Tests¶

Plan & run for “missing projection” fixes projection and index, verifies equality to storage.
DLQ-triggered auto-compensation succeeds, then DLQ replay passes.

Error Path Tests¶

400 invalid scope; 404 unknown record/txn; 409 conflicting plan; 422 non-compensable scenario.
412 precondition failure reruns after re-plan and completes.

Performance Tests¶

Batch plan (1k records) executes within rate limits; checkpoints allow resume.
Executor maintains p95 step time within target under load.

Security Tests¶

RBAC and approvals enforced; transcripts encrypted; PII masked by default.
Idempotency verified by re-running completed plan → no additional side effects.

Internal References¶

Dead Letter Queue Flow
Retry Flow
Audit Record Projection Update Flow

External References¶

RFC 7807 (Problem Details)
W3C Trace Context

Appendices¶

A. Example Problem+JSON (precondition failed)¶

{
  "type": "urn:connectsoft:errors/compensation/precondition.failed",
  "title": "Watermark precondition failed",
  "status": 412,
  "detail": "Projection watermark 2025-10-22T11:02:10Z is below required 2025-10-22T11:05:00Z.",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

B. Step Type Catalog (excerpt)¶

Projection.Rewrite — rebuild from storage by key with CAS
Index.Reindex — single-doc reindex with version guard
Pointer.Relink — fix correlation/resource pointers with invariants check
Event.Replay — re-emit projection events from checkpoint (idempotent)

Metrics Collection Flow¶

Collects and aggregates golden signals and SLO-aligned KPIs from all platform services using OpenTelemetry (OTel) and Prometheus exposition/scrape. Emits standardized counters/gauges/histograms with tenant/shard/region labels, stores them in a scalable TSDB, and drives dashboards & alerts (ingest latency, projection lag, seal lag, queue depth).

Overview¶

Purpose: Provide reliable, low-cardinality telemetry for capacity planning, incident detection, and SLO compliance.
Scope: In-process instrumentation (OTel SDK), export (OTLP gRPC/HTTP or Prom scrape), aggregation, storage, dashboards, alerting. Excludes application logs and traces (covered in other flows).
Context: Every service ships metrics to an OTel Collector (agent/sidecar/daemonset) which forwards to Metrics Backend (Prometheus/Mimir/Thanos). Alert rules and dashboards read from the backend.
Key Participants:

Service (instrumented application)
OTel SDK (metrics API + views)
OTel Collector (receivers/processors/exporters)
Metrics Backend (TSDB) (Prometheus-compatible)
Alerting (Alertmanager/Notifications)
Dashboards (Grafana)

Prerequisites¶

System Requirements¶

OTel SDK enabled in each service with histograms for latency and gauges for lags
OTel Collector reachable (4317 gRPC / 4318 HTTP) with TLS/mTLS
Metrics backend with remote write or federated scrape; retention configured
Resource attributes set (service.name, service.version, deployment.environment, region)

Business Requirements¶

SLOs defined per domain: Ingestion latency, Projection lag, Seal lag, Search latency
Alert routing/ownership documented; runbooks linked from alerts
Cardinality budgets per tenant and endpoint (guardrails/policies)

Performance Requirements¶

Metrics export overhead < 1% CPU; payloads ≤ policy size (batching on)
Scrape intervals tuned (e.g., 15s) without overloading services
End-to-end telemetry freshness p95 ≤ 30s

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant SVC as Service
    participant SDK as OTel SDK (Metrics)
    participant COL as OTel Collector
    participant TSDB as Metrics Backend (Prom/Mimir)
    participant ALR as Alerting
    participant DB as Dashboards

    SVC->>SDK: Record metrics (counters/gauges/histograms)
    SDK->>COL: Export (OTLP) with resource attrs & exemplars (traceId)
    COL->>TSDB: Remote write / Prom scrape pipeline
    TSDB-->>ALR: Rule eval -> alert fire/inhibit
    TSDB-->>DB: Power SLO dashboards & drilldowns

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Prometheus scrape: service exposes /metrics; TSDB scrapes directly (no collector) where allowed.
Edge aggregation: Collector performs histogram downsampling or delta temporality conversion before write.
Multi-tenant split: per-tenant remote-write endpoints or relabeling to enforce isolation.

Error Paths¶

sequenceDiagram
    participant SVC as Service
    participant COL as OTel Collector
    participant TSDB as Metrics Backend

    SVC->>COL: Export (invalid metrics/labels)
    alt 400 Bad Request (schema/label violation)
        COL-->>SVC: 400 Problem (drop + log)
    else 404 Not Found (unknown tenant/series namespace)
        TSDB-->>COL: 404, metric rejected
    else 409 Conflict (type change for existing metric name)
        TSDB-->>COL: 409, reject write
    else 429/503 (rate limit/outage)
        TSDB-->>COL: 429/503, backoff + retry
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
OTLP Endpoint	url	Y	`grpc://collector:4317` or `https://collector:4318/v1/metrics`	TLS/mTLS
`resource.service.name`	string	Y	Logical service name	kebab-case
`resource.deployment.environment`	string	Y	`prod \| staging \| dev`	enum
`resource.cloud.region`	string	O	Region/zone	allowlist
Metric names	string	Y	`atp_*` prefix + unit suffix	Prom rules
Labels	map	Y	`{tenant, shard, region, result, route}`	cardinality caps
Views	config	O	Histogram buckets, temporality	per-SLO
Exemplars	bool	O	Attach trace links to histograms	sample rate cap

Output Specifications¶

Field	Type	Description	Notes
Dashboards	URL	Grafana folders per domain	RBAC enforced
Alerts	YAML	Rule groups with SLO burn rates	Routed to on-call
Recording Rules	YAML	Pre-agg series by tenant/shard	Reduces cost
Telemetry Health	JSON	Collector/TSDB status endpoints	For probes

Example Payloads¶

.NET OTel setup (C#)

builder.Services.AddOpenTelemetry()
    .WithMetrics(m => m
        .AddMeter("atp.ingestion","atp.projection","atp.integrity")
        .AddRuntimeInstrumentation()
        .AddAspNetCoreInstrumentation()
        .AddOtlpExporter(o => o.Endpoint = new Uri("http://otel-collector:4317")));

Metric naming & units (examples)

atp_ingest_latency_seconds (histogram) — client→accepted latency
atp_projection_lag_seconds (gauge) — append→projection lag
atp_integrity_seal_lag_seconds (gauge) — append→seal lag
atp_ingest_records_total (counter) — records ingested
atp_export_jobs_active (gauge) — active export jobs

Recommended histogram buckets (seconds)

[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid metric name/unit; disallowed label; excessive cardinality	Fix SDK config; drop or remap labels	No retry until fixed
401	Missing/invalid token for remote write	Renew credentials	Retry after renewal
403	Tenant not authorized to write namespace	Update RBAC/relabling	—
404	Unknown tenant/namespace; dashboard id missing	Create namespace / correct link	—
409	Type conflict (counter→histogram reuse of name)	Rename metric; update dashboards	—
413	Payload too large	Reduce batch size; increase limits	Retry with smaller batches
429	Rate limited by TSDB/collector	Honor `Retry-After`	Exponential backoff + jitter
503	Collector/TSDB unavailable	Buffer (within cap)	Bounded retry with drop policy

Failure Modes¶

Cardinality explosion (e.g., userId in labels) → automatic label sanitizer drops high-cardinality keys; emit warning counter.
Type migration (metric renamed without deprecation) → breaks dashboards; use recording rules to bridge.
Clock skew → out-of-order samples dropped; sync NTP and use server timestamping.

Recovery Procedures¶

Enable views to aggregate/drop labels causing explosion; redeploy with safe config.
Roll out metric renames via dual-publish window + recording rules → retire old names.
During TSDB outage, buffer with caps; after recovery, drain at limited QPS.

Performance Characteristics¶

Latency Expectations¶

Exporter p95 < 50 ms per batch; end-to-end metric freshness p95 ≤ 30 s.

Throughput Limits¶

Default 10k samples/s per pod (configurable); per-tenant write QPS caps at the collector.

Resource Requirements¶

SDK minimal CPU; Collector memory sized for queues; backend disk/retention sized to SLO analytics.

Scaling Considerations¶

Shard TSDB by tenant/region; use recording rules to pre-aggregate; leverage remote write to long-term store (Thanos/Mimir).

Security & Compliance¶

Authentication¶

OTLP with mTLS; Prom scrape secured by service mesh identities or basic auth over TLS.

Authorization¶

Per-tenant write tokens; relabeling at collector enforces tenant isolation.

Data Protection¶

No PII in labels; label sanitizer strips ids, emails, IPs unless explicitly allowlisted.

Compliance¶

Alert acknowledgments/audits stored; SLO reports preserved per retention policy.

Monitoring & Observability¶

Key Metrics¶

Metric Name	Type	Description	Alert Threshold
`atp_ingest_latency_seconds`	histogram	Client→accepted latency	Burn rate on p95/p99
`atp_projection_lag_seconds`	gauge	Append→projection lag	> 60s sustained
`atp_integrity_seal_lag_seconds`	gauge	Append→seal lag	> 120s sustained
`otelcol_exporter_queue_size`	gauge	Collector queue depth	> 80% capacity
`prom_remote_write_requests_failed_total`	counter	Failed writes	Rising trend
`atp_metrics_cardinality_dropped_total`	counter	Dropped label pairs	Spike → investigate

Logging Requirements¶

Collector structured logs for drops/backpressure; include tenant, series, reason.

Distributed Tracing¶

Exemplars: attach traceId to latency histogram buckets for drill-down.
Trace spans for exporter/collector with attributes: seriesCount, dropped, retry.

Health Checks¶

Collector readiness (receivers/exporters live); TSDB scrape targets up; dashboard datasource healthy.

Operational Procedures¶

Deployment¶

Ship OTel SDK across services; configure default meters and views.
Deploy OTel Collector (agent/daemonset) with TLS and remote write.
Provision dashboards and alert rules from GitOps repo.

Configuration¶

Env: OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_METRIC_EXPORT_INTERVAL, OTEL_RESOURCE_ATTRIBUTES.
Collector: processors (batch, memory_limiter), exporters (prometheusremotewrite).
Backend: retention, compaction, ruler/alertmanager endpoints.

Maintenance¶

Periodic review of cardinality budget; prune unused metrics.
Tune histogram buckets as traffic patterns evolve.

Troubleshooting¶

Missing metrics → check SDK meter enabled, service.name correct, collector pipelines.
High drops → inspect label sanitizer logs; remove high-cardinality labels.
Alert noise → adjust SLO burn-rate windows and inhibit rules.

Testing Scenarios¶

Happy Path Tests¶

Ingestion service publishes latency histogram; dashboard shows p95/p99; alerts fire under synthetic slowness.
Projection lag gauge reflects backlog; alert triggers and clears after recovery.

Error Path Tests¶

400 invalid label name → collector drops with warning counter incremented.
404 unknown tenant namespace → write rejected; dashboards unaffected.
409 type conflict on metric rename → dual-publish + recording rule bridges.

Performance Tests¶

10k samples/s sustained without exporter backpressure; queue sizes stable.
TSDB outage → buffered then drained within limits; no OOM.

Security Tests¶

mTLS enforced; cross-tenant writes denied.
No PII observed in labels; sanitizer counters remain near zero.

Internal References¶

External References¶

OpenTelemetry Metrics Spec
Prometheus Best Practices

Appendices¶

A. Example Alert (Projection Lag SLO)¶

groups:
- name: projection-lag
  rules:
  - alert: ProjectionLagHigh
    expr: atp_projection_lag_seconds{environment="prod"} > 60
    for: 5m
    labels: {severity: page, team: projections}
    annotations:
      summary: "Projection lag high (>{{ $value }}s)"
      runbook: "https://runbooks/projection-lag"

B. Collector Pipeline (excerpt)¶

receivers:
  otlp:
    protocols: { grpc: {}, http: {} }
processors:
  batch: {}
  memory_limiter: { check_interval: 1s, limit_mib: 512 }
exporters:
  prometheusremotewrite:
    endpoint: https://mimir.remote/api/v1/push
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

Distributed Tracing Flow¶

Correlates requests across all hops using W3C Trace Context (traceparent, tracestate) and OpenTelemetry spans. Propagates baggage (e.g., tenant, edition) with strict guardrails to enable per-tenant analytics without leaking PII. All spans are exported to a trace store for query and troubleshooting.

Overview¶

Purpose: Provide end-to-end visibility of a request from Gateway → Ingestion → Storage → Integrity → Projection → Search/Export, enabling root-cause analysis and SLO burn tracking.
Scope: Context propagation (HTTP/gRPC/bus), span creation and attributes, sampling (head/tail), baggage policy, export via OTel → Collector → Trace Backend, and trace query UX. Excludes logs/metrics (covered elsewhere).
Context: Each service uses OTel SDK. The API Gateway starts/continues a trace, forwards context, and attaches safe baggage (tenant, edition). Downstream services create child spans. Collector batches/exports to a Jaeger/Tempo-compatible backend.
Key Participants:

Client / Producer
API Gateway
Ingestion Service
Storage Service
Integrity Service
Projection Service
Search / Export Services
OTel Collector
Trace Backend (Jaeger/Tempo)

Prerequisites¶

System Requirements¶

OTel SDK enabled for HTTP, gRPC, DB instrumentation (server & client)
W3C Trace Context and Baggage propagators registered
OTel Collector reachable with TLS (gRPC 4317 / HTTP 4318)
Trace backend available (Tempo/Jaeger) with retention & indexing

Business Requirements¶

Baggage policy allowlist: tenant, edition, optional purpose (no PII)
Sampling policy defined (head: rate/parent; tail: error/latency based)
SRE runbooks for “missing span”, “broken parent”, and “dropped export”

Performance Requirements¶

Tracing overhead < 3% CPU at default sample rates
Export latency hidden via batching; queue backpressure bounded
Query p95 ≤ 3 s for recent traces

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant CL as Client
    participant GW as API Gateway
    participant ING as Ingestion Service
    participant ST as Storage Service
    participant INT as Integrity Service
    participant PR as Projection Service
    participant COL as OTel Collector
    participant TR as Trace Backend

    CL->>GW: HTTP/gRPC request (+traceparent?, +baggage: tenant,edition)
    Note right of GW: Start/continue root span, enforce baggage allowlist
    GW->>ING: Forward request (+traceparent,+baggage)
    ING->>ST: Append audit (child span)
    ST-->>ING: Ack (db client/server spans)
    ING->>INT: Enqueue/compute integrity (child span)
    INT-->>ING: Proof computed
    ING->>PR: Emit projection event (child span)
    PR-->>ING: Projected
    par Export spans
      GW-->>COL: OTLP export (batched)
      ING-->>COL: OTLP export (batched)
      ST-->>COL: OTLP export (batched)
      INT-->>COL: OTLP export (batched)
      PR-->>COL: OTLP export (batched)
    end
    COL->>TR: Push spans
    TR-->>GW: Trace available for query
    Note over GW,PR: Baggage {tenant,edition} available on all spans

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Message bus propagation: inject traceparent/baggage into message headers; consumers extract and create linked spans if processing is async.
Tail sampling: collector performs tail-based sampling (error/latency heuristics) for high-value traces while keeping head sampling low.
Gateway as root: if client sends no traceparent, Gateway creates the root span; otherwise, it joins the provided context.

Error Paths¶

sequenceDiagram
    participant CL as Client
    participant GW as API Gateway
    participant COL as OTel Collector
    participant TR as Trace Backend

    CL->>GW: Request (malformed trace headers)
    alt 400 Bad Request (invalid traceparent format)
        GW-->>CL: 400 Problem+JSON (with new trace id for error handling)
    else Backend query for traceId
        GW->>TR: GET /traces/{traceId}
        alt 404 Not Found (expired/unknown)
            TR-->>GW: 404 Not Found
            GW-->>CL: 404 Problem+JSON
        else 409 Conflict (concurrent sampling policy change)
            TR-->>GW: 409 Conflict
            GW-->>CL: 409 Problem+JSON
        end
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements (Propagation & Policy)¶

Field	Type	Req	Description	Validation
`traceparent`	header/metadata	O	W3C Trace Context	55-char format
`tracestate`	header/metadata	O	Vendor/state hints	size ≤ 512B
`baggage`	header/metadata	O	`tenant=acme,edition=enterprise`	allowlist keys; total ≤ 1024B
`x-tenant-id`	header	Y	Tenant RLS (also echoed in baggage)	must match
`trace-flags`	bitfield	O	Sampling decision (head)	0/1
`idempotency-key`	header	O	For write flows (not tracing but correlated)	≤ 128 chars

Ops / Query

GET /traces/{traceId} → rendered trace
GET /traces/search?tenant=&error=true&latencyMs>… → find traces
POST /ops/v1/tracing/sampling {headRate, tailPolicies[]} → update sampling (RBAC)

Output Specifications¶

Spans include attributes (examples):
- Common: tenant, edition, environment, region, trace.sampled
- Gateway: route, status_code, client.ip_hash
- Ingestion: audit.schemaVersion, payload.bytes, validation.result
- Storage: db.system, db.operation=append, db.statement?=off
- Integrity: integrity.blockId, segment, proof.kid
- Projection: watermark, lag.ms
- Search/Export: query.kind, result.count, package.id

Example HTTP with headers

POST /audit/v1/records HTTP/1.1
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: atp=gw;ver=1
baggage: tenant=acme,edition=enterprise
x-tenant-id: acme
content-type: application/json

Example gRPC metadata (pseudo)

:authority: ingestion.atp
traceparent: 00-4bf92f3577b34...-00f067aa0b...-01
baggage: tenant=acme,edition=enterprise

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Malformed `traceparent`/`baggage`	Drop/regen context; return Problem+JSON if strict	No retry until fixed
401	Querying traces without auth	Acquire token	Retry after renewal
403	Cross-tenant trace access	Enforce RLS; deny	—
404	Trace id not found/expired	Verify id/retention window	—
409	Sampling policy update conflicts	Re-fetch policy; retry op	Conditional retry
413	Oversized baggage	Trim to policy; drop disallowed keys	Resend with smaller baggage
429	Collector/back-end rate limit	Honor `Retry-After`	Exponential backoff + jitter
503	Collector/back-end unavailable	Buffer within caps	Bounded retry, drop oldest if over cap

Failure Modes¶

Broken parentage: services that don’t extract context create new roots → detectable by orphan span metric.
Baggage misuse: high-cardinality/PII snuck into baggage → sanitizer drops keys and emits policy violations.
Excess sampling: high head sampling inflates overhead → shift to tail sampling for error/slow traces.

Recovery Procedures¶

Enable/verify propagators in all client/server middleware.
Turn on tail sampling policies (e.g., error=true, latency>500ms).
Inspect “orphan span” dashboards; fix missing extract/inject in specific services.

Performance Characteristics¶

Latency Expectations¶

Instrumentation overhead p95 ≤ 1 ms per hop (sampled), near-zero when unsampled.

Throughput Limits¶

Collector queue sized for burst N× steady state; backpressure triggers temporary head sampling reductions.

Resource Requirements¶

Small CPU for SDK; Collector memory for queues; backend disk for retention (e.g., 7–14 days).

Scaling Considerations¶

Shard collectors per region/tenant; enable tail sampling at edge; compress exports; prefer OTLP gRPC.

Security & Compliance¶

Authentication¶

Query/UI protected by OIDC; service-to-collector via mTLS.

Authorization¶

Enforce tenant isolation on trace queries (filter by baggage tenant and RLS).

Data Protection¶

No PII in baggage or span attributes; hash IPs/UAs; redact payloads; disable SQL/body capture by default.

Compliance¶

Retention adheres to tenant policy; trace access is audited with actor and purpose-of-use.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`otel_traces_exported_total`	counter	Spans successfully exported	Sudden drop
`otel_traces_dropped_total`	counter	Dropped spans (queue/limits)	> baseline
`trace_orphan_spans_total`	counter	Spans without valid parent	Spike alert
`collector_queue_size`	gauge	Export queue depth	> 80% capacity
`trace_tail_sampled_total`	counter	Tail-sampled traces	Track ratio
`trace_query_latency_seconds`	histogram	UI/API query latency	p95 > SLO

Logging Requirements¶

Structured logs: traceId, spanId, dropReason, policyId, tenant, edition. No payload values.

Distributed Tracing¶

(Meta) link exporter spans to service spans; include exemplars on latency histograms (metrics flow).

Health Checks¶

Collector readiness/liveness; backend ingestion status; UI availability.

Operational Procedures¶

Deployment¶

Enable OTel SDKs with HTTP/gRPC/DB instrumentation and W3C propagators.
Deploy OTel Collector (batch, memory_limiter, tail_sampling processors).
Wire trace backend and provision dashboards.

Configuration¶

Env: OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_SAMPLER, OTEL_RESOURCE_ATTRIBUTES.
Tail Sampling (examples): error=true, status_code>=500, latency_ms>500, selective by tenant.

Maintenance¶

Adjust sampling as traffic patterns evolve; rotate retention; prune noisy attributes.

Troubleshooting¶

Missing links → check inject/extract middleware order.
High drops → increase collector queues or reduce sampling; inspect backpressure.
Cross-tenant leak alerts → confirm baggage sanitizer & RLS.

Testing Scenarios¶

Happy Path Tests¶

End-to-end trace spans present across Gateway→Ingestion→Storage→Integrity→Projection.
Baggage (tenant=acme, edition=enterprise) visible on all spans.

Error Path Tests¶

400 invalid traceparent handled; new trace created for error path.
404 unknown trace id query returns Problem+JSON, no data leakage.
409 sampling change during export handled without crash.

Performance Tests¶

Sampled high-QPS traffic keeps overhead < 3%.
Collector withstands burst without dropping (or drops < policy).

Security Tests¶

No PII in spans/baggage; sanitizer counters near zero.
Trace queries scoped to tenant via RLS.

Internal References¶

External References¶

W3C Trace Context & Baggage
OpenTelemetry Specification

Appendices¶

A. Example Problem+JSON (invalid trace headers)¶

{
  "type": "urn:connectsoft:errors/tracing/traceparent.invalid",
  "title": "Invalid W3C traceparent header",
  "status": 400,
  "detail": "Trace ID length not 16 bytes (hex).",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

B. Suggested Span Attribute Keys (allowlist)¶

tenant, edition, environment, region, route, status_code, db.system, db.operation, integrity.blockId, projection.watermark, search.query.kind, export.package.id

Health Check Flow¶

Implements liveness, readiness, and startup probes with per-component dependency checks and an aggregated status that signals deploy orchestrators (e.g., Kubernetes) for safe rollouts and traffic routing. Probes are budgeted and isolated to avoid noisy-neighbor effects; timeouts and intervals are tuned to service SLOs.

Overview¶

Purpose: Provide reliable health signaling for deployment safety, traffic gating, and fast failure detection without causing additional load or false negatives.
Scope: Local process liveness, startup warmup, dependency readiness (DB, queue, cache, search, integrity, policy), aggregation, export via HTTP endpoints, and ops overrides (maintenance mode).
Context: Orchestrators consume /health/liveness, /health/readiness, /health/startup. Readiness reflects dependencies & backpressure, not just process up. Liveness is crash/lock detection only.
Key Participants:

Service (with HealthCheck library)
Dependency Probers (DB/Cache/Queue/Search/Integrity/Policy)
Aggregator (health manager + budgeter)
Orchestrator (Kubernetes/Service Mesh/Gateway)
Ops UI / API (maintenance & overrides)
Observability (metrics/logs)

Prerequisites¶

System Requirements¶

HealthCheck middleware/library enabled with endpoints: /health/liveness, /health/readiness, /health/startup
Per-dependency prober with timeouts, concurrency caps, and circuit-break aware checks
Clock synchronized (UTC) for timestamps; structured logging enabled
Network policies allow orchestrator-to-service health traffic

Business Requirements¶

Defined maintenance mode procedure (drain → mark NotReady → perform ops)
Per-tenant/edition readiness policies when dependencies are multi-tenant
Runbooks for common failure signatures (DB degraded, queue backlog, index lag)

Performance Requirements¶

Probe p95 ≤ 50 ms for local checks, ≤ 200 ms for remote deps
Readiness interval typically 10s–30s; liveness interval 5s–10s
Probe CPU overhead < 1%; IO bounded with concurrency limits

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant ORCH as Orchestrator (K8s)
    participant SVC as Service
    participant AGG as Health Aggregator
    participant DB as Database
    participant Q as Queue
    participant C as Cache

    ORCH->>SVC: GET /health/startup
    SVC->>AGG: Run startup checks (one-time warmups)
    AGG-->>SVC: status: Up
    SVC-->>ORCH: 200 OK {status:"Up"}

    ORCH->>SVC: GET /health/readiness
    SVC->>AGG: Parallel probers (DB/Q/Cache) with budgets
    AGG->>DB: ping (timeout ≤ 150ms)
    AGG->>Q: depth/head check
    AGG->>C: get/set key
    DB-->>AGG: OK
    Q-->>AGG: OK
    C-->>AGG: OK
    AGG-->>SVC: Ready
    SVC-->>ORCH: 200 OK {status:"Ready", components:[...]}

    ORCH->>SVC: GET /health/liveness
    SVC-->>ORCH: 200 OK {status:"Alive"}

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Maintenance mode: Ops toggles → service returns 503 on readiness with Retry-After, keeps liveness 200 to avoid restarts during planned work.
Degraded-but-Serving: Non-critical dependency fails; readiness remains 200 with warnings[], traffic allowed but autoscaler informed via metrics.
Backpressure-aware readiness: If queue depth/backlog exceeds threshold, respond 429 Too Many Requests (optionally) or 503 with reason to trigger traffic shifting.

Error Paths¶

sequenceDiagram
    participant ORCH as Orchestrator
    participant SVC as Service
    participant AGG as Health Aggregator
    participant DB as Database

    ORCH->>SVC: GET /health/readiness
    SVC->>AGG: Run checks
    AGG->>DB: ping
    DB-->>AGG: timeout
    AGG-->>SVC: NotReady {db:"Timeout"}
    alt Not Ready
        SVC-->>ORCH: 503 Service Unavailable (Problem+JSON)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
GET /health/liveness	http	Y	Process health (no deps)	Always lightweight
GET /health/startup	http	Y	Warmup complete?	One-time gates
GET /health/readiness	http	Y	Dependency/traffic readiness	Budgeted checks
POST /ops/v1/health:maintenance	http	O	Enter/exit maintenance	AuthZ required
`Authorization` (ops)	header	O	`Bearer <JWT>`	Role `ops:health`
`traceparent`	header	O	Trace exemplar correlation	Optional
Query: `full=true`	bool	O	Include per-component detail	RBAC for PII masking

Output Specifications¶

200 OK (Readiness/Liveness/Startup)

{
  "status": "Ready",
  "service": "ingestion",
  "time": "2025-10-27T08:21:45Z",
  "warnings": [],
  "components": [
    {"name":"db", "type":"postgres", "status":"Up", "latencyMs": 32},
    {"name":"queue", "type":"rabbitmq", "status":"Up", "latencyMs": 18},
    {"name":"cache", "type":"redis", "status":"Up", "latencyMs": 4}
  ]
}

503 Service Unavailable (Not Ready)

{
  "type": "urn:connectsoft:errors/health/not-ready",
  "title": "Readiness check failed",
  "status": 503,
  "detail": "postgres timeout; queue connecting",
  "retryAfterSeconds": 10
}

Maintenance Mode Toggle

// POST /ops/v1/health:maintenance
{ "enabled": true, "reason": "DB failover", "ttlSeconds": 900 }

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid maintenance payload (negative TTL/unknown field)	Fix request	—
401	Missing/invalid JWT for ops endpoint	Obtain token	Retry after renewal
403	Caller lacks `ops:health` role	Request access	—
404	Unknown component in `?component=` query	Remove/rename	—
409	Conflicting state change (maintenance enabled while drain in progress)	Wait or cancel prior op	Retry after resolution
429	Health endpoint rate-limited (human/automation abuse)	Back off	Jittered retry
503	Not Ready (dependency down/backpressure)	Remediate dependency	Retry after `Retry-After`
504	Probe exceeded timeout budget	Increase timeout if justified	Backoff; verify load

Failure Modes¶

Noisy-neighbor probes: too-frequent or heavy checks cause dependency load → enforce intervals, timeouts, and read-only probes.
Coupled liveness/readiness: using dependency checks for liveness causes restarts → separate strictly.
Flapping readiness: thresholds too tight → add stabilization window and hysteresis.
Leaky details: exposing internal hostnames/errors externally → sanitize messages.

Recovery Procedures¶

Enter maintenance mode → drain traffic (readiness 503), keep liveness 200, perform remediation.
Enable degraded mode for non-critical deps; keep serving with warnings.
Increase probe intervals/timeouts cautiously; verify impact via metrics.

Performance Characteristics¶

Latency Expectations¶

Liveness: p95 ≤ 5 ms; Start-up: first success within warmup target; Readiness: p95 ≤ 150–200 ms.

Throughput Limits¶

Cap concurrent dependency checks (e.g., max 2 per dep per instance).
Global RPS limit on health endpoints to prevent abuse.

Resource Requirements¶

Minimal CPU; network usage proportional to dependency checks; cache results for stabilization window (e.g., 2–5s).

Scaling Considerations¶

Shard readiness by tenant/shard if dependencies are partitioned; expose components[].partition.
Push passive signals (e.g., queue depth) from dependencies to reduce active probing.

Security & Compliance¶

Authentication¶

Health endpoints for orchestrator may be anonymous inside cluster (network-policy protected). Ops endpoints require OIDC JWT.

Authorization¶

Roles: ops:health.read, ops:health.maintain.

Data Protection¶

Mask error details in public readiness; full component diagnostics behind RBAC. No secrets in responses.

Compliance¶

Health state transitions and maintenance toggles audited with actor, reason, and duration.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`health_readiness_status`	gauge	1=Ready, 0=NotReady	0 for >1m
`health_probe_latency_ms{component}`	histogram	Per-component probe latency	p95 breach
`health_notready_total{reason}`	counter	Fail events by reason	Spike alert
`health_maintenance_mode`	gauge	1 when enabled	Unexpected >0
`health_flaps_total`	counter	Ready↔NotReady transitions	>N/hour

Logging Requirements¶

Structured logs: probe, component, latencyMs, result, timeout, reason, traceId.

Distributed Tracing¶

Health endpoints not traced by default (to reduce noise); ops toggles may emit spans with attributes maintenance=true.

Health Checks¶

Internal self-check (threadpool saturation, GC, disk space).
Dependency checks with budgeted timeouts and circuit-breaker awareness.

Operational Procedures¶

Deployment¶

Expose /health/liveness, /health/readiness, /health/startup.
Configure orchestrator probes and thresholds (see Appendix).
Register metrics and alerts; link runbooks.

Configuration¶

Env: HEALTH_READINESS_TIMEOUT_MS, HEALTH_PROBE_INTERVAL_S, HEALTH_STABILIZATION_WINDOW_S, HEALTH_MAX_CONCURRENCY, HEALTH_MAINTENANCE_TTL_S.
Policy: which dependencies are critical vs advisory for readiness.

Maintenance¶

Use ops endpoint to enable maintenance → drain → operate → disable → verify readiness.

Troubleshooting¶

Frequent flaps → extend stabilization, review dependency SLOs.
Probes time out → check network/circuit breaker; raise timeout only with evidence.
Orchestrator killing pods unexpectedly → confirm liveness is local-only.

Testing Scenarios¶

Happy Path Tests¶

Startup becomes Up after caches warmed; readiness 200.
All components return Up; status JSON includes latencies.

Error Path Tests¶

DB timeout triggers readiness 503 with sanitized Problem+JSON.
400 invalid maintenance payload rejected; 404 unknown component; 409 conflicting state change handled.

Performance Tests¶

Probe p95 ≤ 200 ms under load; intervals respected; no excess CPU/IO.
High RPS to health endpoints remains within rate limits.

Security Tests¶

Public readiness hides internals; full diagnostics gated by RBAC.
Audit records for maintenance toggles captured.

Internal References¶

External References¶

Kubernetes probe guidance (liveness/readiness/startup)

Appendices¶

A. Example Kubernetes Probes¶

livenessProbe:
  httpGet: { path: /health/liveness, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 2
readinessProbe:
  httpGet: { path: /health/readiness, port: 8080 }
  initialDelaySeconds: 20
  periodSeconds: 15
  timeoutSeconds: 2
  successThreshold: 1
  failureThreshold: 3
startupProbe:
  httpGet: { path: /health/startup, port: 8080 }
  failureThreshold: 30
  periodSeconds: 5

B. Example Problem+JSON (Not Ready)¶

{
  "type": "urn:connectsoft:errors/health/not-ready",
  "title": "Readiness check failed",
  "status": 503,
  "detail": "queue backlog > threshold; integrity service degraded",
  "retryAfterSeconds": 15
}

Alert Generation Flow¶

Turns signals into action: evaluates thresholds and SLO burn rates, fires alerts, routes to pager/chat/webhook, opens a ticket, and auto-closes on recovery. Noise is controlled via grouping, inhibition, dedup windows, silences, and maintenance calendars. Escalation paths are explicit and auditable.

Overview¶

Purpose: Deliver timely, actionable notifications with clear ownership and escalation while minimizing false positives.
Scope: Rule evaluation, grouping/dedup, routing, paging/notifications, ticket creation, auto-resolve, silencing and inhibition controls.
Context: Metrics and events feed a Rule Engine (e.g., Prometheus Ruler). Alerts traverse a Router (Alertmanager-like) to destinations: PagerDuty/On-call, Chat (Slack/Teams), Webhook (runbooks/automation), and Ticketing (Jira/ServiceNow).
Key Participants:

Metrics Backend / Rule Engine
Alert Router (grouping, dedup, silences, inhibition)
Destinations: Pager, Chat, Webhook, Ticketing
On-call Engineer / Team
Ops API/UI (manage silences, ack, routes)
Runbooks (linked from alerts)

Prerequisites¶

System Requirements¶

Metrics and logs published with low cardinality labels (tenant, shard, region, service)
Rule Engine with multi-window SLO burn capability and dependency-aware inhibition
Alert Router HA with persistent silences and dedup state
Integrations to pager/chat/ticket with retry & backoff

Business Requirements¶

Defined ownership map: service → team → escalation policy
Runbooks per alert with clear first actions and diagnostics links
Maintenance windows / change freeze calendars integrated

Performance Requirements¶

End-to-end alerting latency p95 ≤ 30s from breach to page
Router throughput sized for peak fan-out; delivery retries with backoff
Dedup window defaults (e.g., 5m) to limit paging storms

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant MET as Metrics/Rule Engine
    participant RTR as Alert Router
    participant PD as Pager (On-call)
    participant CHAT as Chat (Slack/Teams)
    participant TKT as Ticketing (Jira/SNOW)
    participant OPS as On-call Engineer

    MET->>RTR: Alert{labels, annotations, status="firing"}
    RTR->>RTR: Group & dedup (fingerprint), apply inhibition/silences
    RTR->>PD: Page (severity=page, service=ingestion)
    RTR->>CHAT: Notify #oncall-ingestion (runbook link)
    RTR->>TKT: Create ticket (P1) with alert context
    PD-->>OPS: Page delivered (push/phone/SMS)
    OPS->>TKT: Acknowledge ticket, start mitigation
    MET-->>RTR: Alert{status="resolved"}
    RTR->>PD: Auto-resolve page
    RTR->>TKT: Auto-close with resolution note
    RTR->>CHAT: Post recovery message

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Warning-only: severity warn → chat/webhook only, no page.
Escalation: no ack within 10m → escalate to secondary, then manager-on-call.
Bulk correlation: many shard alerts collapse into one parent incident with children inhibited.
Auto-remediation: webhook triggers safe runbook; success posts to thread and downgrades severity.

Error Paths¶

sequenceDiagram
    participant MET as Metrics/Rule Engine
    participant RTR as Alert Router
    participant PD as Pager

    MET->>RTR: Alert firing
    alt 400 Bad Request (invalid labels/size)
        RTR-->>MET: 400 drop + audit
    else 404 Destination not configured
        RTR-->>MET: 404, fallback to default route
    else 409 Conflict (duplicate route update)
        RTR-->>MET: 409, keep last-good config
    else 429/503 Pager API throttled/outage
        RTR-->>PD: retry with backoff, queue locally
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements (Alert Payload to Router)¶

Field	Type	Req	Description	Validation
`status`	enum	Y	`firing`	`resolved`
`labels`	map	Y	`{alertname, service, tenant, shard, severity}`	size ≤ 50, allowlist keys
`annotations`	map	O	`{summary, description, runbook, dashboard}`	≤ 4KB
`startsAt` / `endsAt`	RFC3339	Y/O	When firing/resolved	UTC
`generatorURL`	url	O	Link to rule source	valid URL
`fingerprint`	string	O	Stable dedup key	computed if missing

Output Specifications (Destinations)¶

Pager: payload includes service, severity, routing_key, dedup_key=fingerprint, links (runbook/dashboards).
Chat: message with summary, top labels, graph image link, ack emoji workflow.
Ticket: fields summary, description, priority, labels, customFields (tenant/shard), plus auto-close comment on resolve.
Webhook: signed POST with HMAC; body includes current status, last N samples, silence suggestions.

Example Payloads¶

// Alert to Router (condensed)
{
  "status": "firing",
  "labels": {
    "alertname": "ProjectionLagHigh",
    "service": "projection",
    "tenant": "acme",
    "severity": "page",
    "region": "eu-west-1"
  },
  "annotations": {
    "summary": "Projection lag > 60s",
    "description": "Watermark delay crossing SLO for tenant=acme.",
    "runbook": "https://runbooks/projection-lag",
    "dashboard": "https://grafana/d/lag"
  },
  "startsAt": "2025-10-27T08:15:00Z",
  "generatorURL": "prom://ruler/expr/123"
}

# Burn-rate rule example (SLO 99.9% over 30d)
- alert: IngestSLOBurnHigh
  expr: |
    (sum(rate(atp_ingest_errors_total[5m])) by (service,tenant)
     / sum(rate(atp_ingest_requests_total[5m])) by (service,tenant))
    > (0.001 * 14.4)
  for: 5m
  labels: {severity: page, service: ingestion}
  annotations:
    summary: "Ingest SLO fast burn (5m)"
    runbook: "https://runbooks/ingest-slo"

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid alert payload (missing labels/oversized)	Fix rule/labels; drop event	No retry until fixed
401	Webhook/Pager auth failed	Rotate tokens/keys	Retry after renewal
403	Route not permitted for tenant/edition	Update RBAC/route policy	—
404	Route/destination not found	Use default route; fix config	—
409	Concurrent route config updates	Apply last-write-wins or CAS	Retry after fetch
412	HMAC signature mismatch (webhook)	Recalculate with correct secret	—
429	Destination rate-limiting	Honor vendor backoff	Exponential backoff + jitter
503	Destination outage	Queue & retry within TTL	Progressive backoff, failover route

Failure Modes¶

Alert storms: ungrouped high-cardinality labels → enable grouping keys and label sanitization.
Flapping: thresholds too tight → add for: windows and hysteresis.
Cascading pages: child alerts page alongside parent → add inhibition until parent resolves.
Silent failures: misconfigured routes → periodic synthetic alerts verify end-to-end.

Recovery Procedures¶

Activate global silence or maintenance mode during planned incidents.
Expand grouping and increase group_wait/group_interval to dampen bursts.
Fail over to secondary pager provider if primary remains 503/429 beyond SLO.

Performance Characteristics¶

Latency Expectations¶

Signal-to-page p95 ≤ 30s; chat/webhook p95 ≤ 15s; ticket creation ≤ 60s.

Throughput Limits¶

Router handles thousands of alerts/min with grouping; per-destination QPS caps and queues.

Resource Requirements¶

Router memory for dedup store and silence registry; HA storage (e.g., S3/object store or DB) for persistence.

Scaling Considerations¶

Partition routes by region and service; replicate router HA; shard rules by domain.

Security & Compliance¶

Authentication¶

Mutual TLS for webhook receivers; OAuth tokens/keys for pager/ticket/chat APIs.

Authorization¶

Route policies per tenant/edition; ops roles to create silences and modify routes (ops:alerts.*).

Data Protection¶

Do not include PII in labels/annotations; link dashboards instead of embedding raw data.

Compliance¶

All alert lifecycle actions (fire/route/ack/resolve/silence) audited with actor, reason, and timestamps.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`alerts_firing_total`	gauge	Active firing alerts	Trend by service
`alerts_notifications_sent_total`	counter	Deliveries by destination	Sudden drop
`alerts_delivery_failures_total`	counter	Failed sends by dest	Spike alert
`alerts_routing_latency_seconds`	histogram	Router processing latency	p95 breach
`alerts_silences_active`	gauge	Current silences	Unexpected growth
`alerts_inhibited_total`	counter	Child alerts inhibited	Track correlation

Logging Requirements¶

Structured logs: alertname, fingerprint, status, route, destination, deliveryId, retry, actor (for silences/acks).

Distributed Tracing¶

Trace Router pipeline (ingest→group→deliver); attach exemplars to routing latency histograms.

Health Checks¶

Router readiness includes destination probes (token check, rate-limit status); synthetic canaries validate end-to-end.

Operational Procedures¶

Deployment¶

Deploy Rule Engine & Router HA; configure storage for silences/dedup.
Create base routes (page/warn/info) and default receivers.
Set up synthetic alerts per region/service.

Configuration¶

Router: group_by: [alertname, service, tenant], group_wait: 10s, group_interval: 5m, repeat_interval: 2h.
Escalation: ack timeout 10m primary → secondary → manager.
Webhook HMAC secret rotation schedule.

Maintenance¶

Review top talkers weekly; reduce cardinality; tune thresholds and for: windows.
Validate runbook links and dashboard IDs quarterly.

Troubleshooting¶

No pages received → check destination quotas, auth, and router queue depths.
Excess noise → increase grouping, add inhibition rules, widen hysteresis.
Auto-close not working → verify resolved events flow and ticket webhooks.

Testing Scenarios¶

Happy Path Tests¶

Fire ProjectionLagHigh → page+chat+ticket created; resolves and auto-closes on recovery.
Warning-only alert posts to chat without paging.

Error Path Tests¶

400/404 misrouted alerts handled; default route used.
429/503 destination throttling triggers retries and eventual delivery/failover.

Performance Tests¶

Burst of 10k alerts grouped to ≤ 100 pages; router p95 latency within SLO.
Dedup prevents duplicate pages across replicas.

Security Tests¶

Webhook HMAC verified; invalid signature (412) rejected.
No PII in labels/annotations; audits present for silences/acks.

Internal References¶

Metrics Collection Flow
Health Check Flow
Operations: Observability

External References¶

SRE Workbook: Multi-window, multi-burn-rate alerts
Vendor APIs: PagerDuty/Slack/Jira

Appendices¶

A. Router Route Snippet (YAML)¶

route:
  group_by: ['alertname','service','tenant']
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 2h
  receiver: 'default'
  routes:
    - match: {severity: 'page'}
      receiver: 'pager'
      continue: true
    - match: {severity: 'page'}
      receiver: 'chat'
    - match_re: {severity: 'warn|info'}
      receiver: 'chat'

receivers:
  - name: pager
    pagerduty_configs:
      - routing_key: ${PAGERDUTY_KEY}
        dedup_key: '{{ .GroupLabels.fingerprint }}'
  - name: chat
    slack_configs:
      - channel: '#oncall-{{ .GroupLabels.service }}'
        title: '{{ .CommonLabels.alertname }}'
        text: '{{ .CommonAnnotations.summary }}\n{{ .CommonAnnotations.runbook }}'

B. Example Silence (API)¶

POST /ops/v1/alerts/silences
{
  "matchers": [{"name":"service","value":"projection","isRegex":false}],
  "startsAt": "2025-10-27T08:00:00Z",
  "endsAt": "2025-10-27T10:00:00Z",
  "createdBy": "deploy-bot",
  "comment": "Planned projection migration"
}

Tenant Onboarding Flow¶

Provisions and activates a new tenant with IdP linkage, policy defaults, partitioned storage & indexes, per-tenant KMS keys and residency settings. Ensures strict isolation (RLS) and emits onboarding welcome/events. All steps are idempotent and fully audited.

Overview¶

Purpose: Safely create a tenant boundary (identity, data, policy, encryption, residency) and make it ready for use.
Scope: Intake → validation → IdP linkage → resource provisioning (storage/projections/search) → policy/key/residency setup → activation → welcome events. Excludes billing system specifics.
Context: Orchestrated by Onboarding Service with calls to Identity/IdP, Policy, Storage/Projection/Search, KMS/Secrets, and Notifications.
Key Participants:

Tenant Admin / Operator
Onboarding Service (orchestrator)
Identity/SSO (SAML/OIDC, optional SCIM)
Policy Service (defaults: retention, redaction)
Storage Service (append store partitions)
Projection/Search Services (read models, index aliases)
KMS / Secrets (per-tenant keys/creds)
Notification/Webhooks

Prerequisites¶

System Requirements¶

Onboarding API enabled with RBAC and idempotency support
KMS, Storage, Projection DB, and Search clusters reachable and quota available
DNS/Domain verification service (for SAML domains)
OTel tracing/metrics active for step diagnostics

Business Requirements¶

Approved edition/plan matrix (limits, features)
Default policy bundles per edition/region (retention, redaction profiles)
Residency catalog (allowed regions per tenant)

Performance Requirements¶

Synchronous intake p95 ≤ 300 ms; async provisioning target < 2 min
Parallelizable steps (keys/indexes) with bounded concurrency
Backpressure handling when cluster capacity is constrained

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor TA as Tenant Admin
    participant GW as API Gateway
    participant ONB as Onboarding Service
    participant IDP as Identity/IdP
    participant POL as Policy Service
    participant KMS as KMS/Secrets
    participant ST as Storage (Append)
    participant PR as Projection DB
    participant IX as Search Index
    participant NTF as Notifications/Webhooks

    TA->>GW: POST /tenants/v1 (tenantSlug, region, edition, idpConfig, adminEmails)
    GW->>ONB: CreateTenant (idempotency-key)
    ONB->>ONB: Validate & reserve tenantId/slug (unique)
    ONB->>IDP: Link IdP / Verify domain (SAML/OIDC/SCIM)
    ONB->>POL: Apply default policies (retention/redaction)
    par Provision resources
        ONB->>KMS: Create tenant key + alias (kid)
        ONB->>ST: Create partition/shard & RLS bindings
        ONB->>PR: Create schemas (namespaced) & watermarks
        ONB->>IX: Create per-tenant index alias/mappings
    end
    ONB->>ONB: Health checks (readiness of resources)
    ONB->>GW: 202 Accepted {tenantId, status:"Provisioning", resumeToken}
    ONB->>NTF: Emit Tenant.Provisioned
    ONB->>GW: POST /tenants/v1/{tenantId}:activate
    ONB->>GW: 200 OK {status:"Active"}
    ONB->>NTF: Emit Tenant.Activated + Welcome

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Deferred IdP linkage: create tenant with local admin; link IdP later via /link-idp.
Pre-provisioned resources: BYO KMS key or existing index namespace accepted when validated.
Staged activation: keep status="Provisioned" until external readiness checks pass.

Error Paths¶

sequenceDiagram
    participant TA as Tenant Admin
    participant GW as API Gateway
    participant ONB as Onboarding Service
    participant IDP as Identity/IdP

    TA->>GW: POST /tenants/v1 {invalid payload or duplicate slug}
    alt 400 Bad Request (invalid/unsupported fields)
        GW-->>TA: 400 Problem+JSON
    else 409 Conflict (slug/domain already in use)
        GW-->>TA: 409 Problem+JSON
    else 422 Unprocessable (IdP metadata invalid, domain not verified)
        ONB-->>GW: 422 Problem+JSON
        GW-->>TA: 422 Problem+JSON
    else 503 Dependency unavailable (KMS/Search/DB)
        GW-->>TA: 503 Problem+JSON (+Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
Method/Path	`POST /tenants/v1`	Y	Create tenant	JSON body
`Authorization`	header	Y	Admin/ops JWT	Role `tenants:create`
`idempotency-key`	header	O	De-dupe create	≤128 chars
`tenantSlug`	string	Y	Human slug (`acme`)	`^[a-z0-9-]{3,40}$`, unique
`displayName`	string	Y	Tenant display name	3–100 chars
`edition`	enum	Y	`free \| standard \| enterprise`	allowlist
`region`	enum	Y	Residency region	allowlist
`idpConfig`	object	O	SAML/OIDC metadata/urls	schema-validated
`adminEmails[]`	array	Y	Initial admins	valid emails
`webhooks[]`	array	O	Event targets (HMAC)	URL + secret

Control

GET /tenants/v1/{tenantId} → status (Provisioning|Provisioned|Active|Failed), components health
POST /tenants/v1/{tenantId}:activate → promote to Active
POST /tenants/v1/{tenantId}:link-idp → attach/replace IdP config
POST /tenants/v1/{tenantId}:rotate-keys → new KMS key version (dual-read window)

Output Specifications¶

Field	Type	Description	Notes
`tenantId`	string (ULID/GUID)	System identifier	Immutable
`tenantSlug`	string	Human label	Unique, mutable with policy
`status`	enum	Lifecycle status	see above
`kid`	string	Current KMS key id	For integrity/signing
`residency`	object	Region/data classes	PII routing policy
`policyBundle`	object	Defaults applied	versioned
`endpoints`	object	Tenant endpoints/aliases	for SDK setup

Example Payloads¶

Create Tenant

{
  "tenantSlug": "acme",
  "displayName": "Acme Corp",
  "edition": "enterprise",
  "region": "eu-west",
  "idpConfig": {
    "type": "saml",
    "metadataUrl": "https://idp.acme.com/metadata.xml",
    "domains": ["acme.com"]
  },
  "adminEmails": ["secops@acme.com","platform@acme.com"]
}

Create Response (202)

{
  "tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
  "tenantSlug": "acme",
  "status": "Provisioning",
  "resumeToken": "onb_7b2d..."
}

Activate

POST /tenants/v1/01JF6V3A6W1T6E2TB1C2N2YV9Q:activate

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid slug, edition, region; missing admins	Fix payload	—
401	Missing/invalid admin JWT	Authenticate	Retry after renewal
403	Plan/edition not allowed for region	Choose allowed combo	—
404	Unknown `tenantId` (status/activate/link)	Verify id	—
409	`tenantSlug` or domain already bound to another tenant	Pick new slug / release domain	—
412	Activation preconditions unmet (resources not healthy)	Wait for ready; fix failing component	Conditional retry
422	IdP metadata invalid, DNS TXT not verified	Correct & re-submit	—
429	Create rate-limited	Back off	Exponential backoff + jitter
503	KMS/Storage/Search unavailable	Retry later	Respect `Retry-After`

Failure Modes¶

Partial provisioning: some resources created; idempotent reruns resume from checkpoints.
Cross-tenant leakage risk: misbound index alias or RLS → automated sanity checks block activation.
IdP domain hijack: require DNS TXT proof + admin email domain match.

Recovery Procedures¶

Use status API to inspect failing step; rerun with same idempotency-key.
Roll back or repair mis-provisioned resources (Compensation flow) before activation.
Re-verify domain/IdP, then call :activate.

Performance Characteristics¶

Latency Expectations¶

POST /tenants/v1: p95 ≤ 300 ms (enqueue & reserve).
Provisioning background: typical 30–120 s (parallelized steps).
Activation p95 ≤ 200 ms after readiness.

Throughput Limits¶

Controlled by cluster quotas; default ≤ 5 concurrent onboardings per region.

Resource Requirements¶

Onboarding workers sized for parallel KMS/DB/Index operations; cautious with index creation.

Scaling Considerations¶

Shard provisioning queues by region; backpressure from dependent clusters pauses new starts.
Pre-create warm pools (schemas/aliases) for popular editions.

Security & Compliance¶

Authentication¶

Admin/ops endpoints require OIDC JWT; service-to-service with mTLS.

Authorization¶

Roles: tenants:create|read|activate|link-idp|rotate-keys.
Least-privilege service identities for each provisioning step.

Data Protection¶

Tenant KMS key per tenant; secrets stored encrypted; residency enforced across storage/search/projections.
No PII stored beyond admin contacts; audit all operations.

Compliance¶

Emit Tenant.Provisioned|Activated|Failed|IdpLinked events with actor, reason, evidence.
Residency and key policies attached to tenant record for audits.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`tenant_onboard_started_total`	counter	Onboarding requests	Anomaly trend
`tenant_onboard_completed_total`	counter	Successful onboardings	Drop vs start
`tenant_onboard_duration_seconds`	histogram	End-to-end time	p95 > 180s
`tenant_onboard_step_failures_total{step}`	counter	Failures per step	Spike alert
`tenant_activation_gates_open`	gauge	Waiting for readiness	Stuck > 10m

Logging Requirements¶

Structured logs with tenantId, tenantSlug, step, result, component, retry, traceId. Mask secrets/metadata.

Distributed Tracing¶

Span per step: idp.link, kms.key.create, storage.partition.create, projection.schema.create, index.alias.create, activate. Include tenantSlug, region, edition.

Health Checks¶

Readiness depends on KMS, Storage, DB, Search; onboarding worker queue depth monitored.

Operational Procedures¶

Deployment¶

Deploy Onboarding Service with worker pool and step registry.
Configure RBAC, KMS access policies, and cluster credentials per region.
Register default policy bundles and residency maps.

Configuration¶

Env: ONB_MAX_CONCURRENCY, ONB_REGION_ALLOWLIST, ONB_IDP_DOMAIN_TTL, ONB_PROVISION_TIMEOUT_S.
Policies: default retention/redaction per edition; index templates per region.

Maintenance¶

Rotate service credentials; rotate default index templates; verify domain verification CA chains.
Periodic dry runs in staging.

Troubleshooting¶

409 slug/domain → list bindings, confirm ownership.
422 IdP → validate metadata XML/JWKS, DNS TXT ownership.
Activation stuck → inspect failing component health; run targeted repair.

Testing Scenarios¶

Happy Path Tests¶

Create → provision all components → activate → welcome events emitted.
IdP linked and login works for admin users.

Error Path Tests¶

400 invalid payload; 409 duplicate slug/domain; 404 unknown tenant.
412 activation blocked until readiness passes; succeeds after fix.
422 invalid IdP metadata rejected with clear Problem+JSON.

Performance Tests¶

Parallel onboardings (N=5) complete within target; no cluster saturation.
Index/schema creation time within SLO per region.

Security Tests¶

RLS verified—tenant cannot query others’ data.
Residency enforced—data and indexes created only in chosen region.
Audit events present for all steps; secrets never logged.

Internal References¶

Architecture Overview
Components
Data Model — Tenancy Keys & Partitioning
Multitenancy
Privacy & Policies

External References¶

SAML / OIDC specs (metadata, JWKS)
Regional residency regulations (org policy)

Appendices¶

A. DNS TXT Verification (example)¶

_acme-verify.atp.example.com  TXT  atp-verify=01JF6V3A6W1T6E2T

B. Example Events¶

{
  "type": "Tenant.Provisioned",
  "tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
  "region": "eu-west",
  "kid": "kms:eu-west:acme:v1",
  "time": "2025-10-27T08:05:21Z"
}

{
  "type": "Tenant.Activated",
  "tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
  "time": "2025-10-27T08:06:41Z",
  "endpoints": {
    "ingest": "https://eu-west.api.atp/ingest/acme",
    "query": "https://eu-west.api.atp/query/acme"
  }
}

Schema Evolution Flow¶

Rolls out safe, additive schema changes across write and read paths. Publishes vNext to the Schema Registry, advertises availability via SDK/Gateway announcements, runs a dual-write / tolerant-read window (projectors, search), and executes a sunset plan for deprecated fields. Enforces a compatibility matrix to prevent breaking consumers.

Overview¶

Purpose: Introduce new fields/enums without breaking existing producers/consumers; coordinate rollout and rollback with clear guardrails.
Scope: Registry publish & validation → SDK/Gateway announcement → producer feature flag/canary → dual-write (events, projections) → tolerant-read (unknown fields) → metrics/alerts → deprecation & sunset. Excludes large-scale data migrations (covered by backfill runbooks).
Context: Works with Ingestion, Projection, Search, Export, and SDKs. Contracts defined in JSON Schema / Protobuf; REST/gRPC negotiate schema version via headers/metadata.
Key Participants:

Schema Author (engineer)
Schema Registry (validation, compatibility rules)
API Gateway / SDKs (announce, negotiate)
Producers (writers; may dual-write)
Consumers (readers; tolerant-read)
Projection/Search Services (tolerant/readers)
Ops/Release (flags, canaries)

Prerequisites¶

System Requirements¶

Schema Registry online with compatibility checks and artifact signing
CI pipeline to lint/validate schemas (JSON Schema/Protobuf)
Gateway supports version advertisement headers & graceful negotiation
Services compiled with tolerant parsers (ignore unknowns; default enums)

Business Requirements¶

Compatibility matrix approved (e.g., vN write requires readers ≥ vN-1)
Rollout plan (tenants/regions/canaries) and rollback criteria defined
Deprecation timeline communicated to stakeholders

Performance Requirements¶

Registry publish p95 ≤ 300 ms; lookup cache TTL tuned
Dual-write overhead ≤ 10% QPS/egress during window
No more than 1 additional index refresh per change in Search

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor SA as Schema Author
    participant CI as CI/CD
    participant REG as Schema Registry
    participant GW as API Gateway
    participant PRD as Producer (Service/SDK)
    participant PRJ as Projection Service
    participant IDX as Search Index
    participant CSM as Consumer (Query/Export)

    SA->>CI: Open PR with vNext (add fields/enums)
    CI->>REG: Validate & publish draft vNext (compatibility=FORWARD+BACKWARD)
    REG-->>CI: OK (artifactId, version=v3, signature)
    CI->>GW: Deploy Gateway/SDK announcement (X-Schema-Latest: v3)
    PRD->>PRD: Enable canary flag (tenant subset)
    PRD->>GW: Writes (dual-write: v2 + v3 metadata)
    GW-->>PRD: 202 Accepted (X-Schema-Active: v3)
    PRJ->>PRJ: Read tolerant (unknown fields ignored, defaults applied)
    IDX->>IDX: Mapping updated (add new fields as optional)
    CSM->>GW: Reads (request v2, receives v2) / (request v3, receives v3)
    CI->>REG: Promote v3 to stable, start deprecation clock for v1

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Canary-by-tenant: enable v3 only for tenant in {acme,beta}; expand after burn-in.
Header-only announce: Gateway advertises X-Schema-Latest before any producer dual-writes (readers prep first).
Soft-fail: Producer emits v3-only but Gateway downgrades to v2 for legacy consumers via transformation map (temporary).

Error Paths¶

sequenceDiagram
    participant CI as CI/CD
    participant REG as Schema Registry
    participant GW as API Gateway
    participant PRD as Producer

    CI->>REG: Publish vNext (breaking removal/rename)
    alt 400 Bad Request (invalid schema)
        REG-->>CI: 400 Problem+JSON
    else 409 Conflict (compatibility violation)
        REG-->>CI: 409 Problem+JSON (matrix failed)
    end

    PRD->>GW: Write with v3 before announce
    GW-->>PRD: 412 Precondition Failed (X-Required-Schema: v2)

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements (Key Endpoints & Headers)¶

Field	Type	Req	Description	Validation
POST /registry/v1/schemas/{name}/versions	http	Y	Publish schema vNext	Signed commit
`compatibility`	enum	Y	`BACKWARD`, `FORWARD`, `FULL`	policy
`X-Schema-Write-Version`	header	O	Producer-declared write version	int ≥ 1
`X-Schema-Read-Version`	header	O	Consumer requested read version	int ≥ 1
`Accept`	header	O	`application/json;profile="#v3"`	negotiated
gRPC metadata: `schema-version`	meta	O	Read/write hint	int
`idempotency-key`	header	O	Dual-write de-dupe	≤128 chars

Output Specifications¶

Field	Type	Description	Notes
`artifactId`	string	Registry id of version	immutable
`version`	int	Published version (e.g., 3)	monotonic
`X-Schema-Latest`	header	Latest stable version	set by Gateway
`X-Schema-Active`	header	Version currently served	per route/tenant
`downgrade`	flag	Whether Gateway transformed response	temporary only

Example Payloads¶

Publish vNext (JSON Schema)

POST /registry/v1/schemas/auditrecord/versions
{
  "version": 3,
  "compatibility": "FULL",
  "schema": {
    "$id": "urn:atp:auditrecord:v3",
    "type": "object",
    "properties": {
      "Id": {"type":"string"},
      "Actor": {"$ref":"urn:atp:actor:v2"},
      "Decision": {"$ref":"urn:atp:decision:v1"},
      "Geo": {"type":"object","properties":{"Country":{"type":"string"}}} // new additive
    },
    "additionalProperties": false
  }
}

Write (dual-write hint)

POST /audit/v1/records
X-Schema-Write-Version: 3
Idempotency-Key: wr_01JF...
Content-Type: application/json

Read (negotiate v2)

GET /audit/v1/records?sv=2
X-Schema-Read-Version: 2
Accept: application/json; profile="#v2"

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid schema JSON/Proto; unknown fields without defaults	Fix schema; re-validate	—
401	Unauthorized schema publish	Authenticate	Retry after renewal
403	Caller lacks `schemas:publish` or tenant attempting global change	Request access	—
404	Unknown schema name/version; consumer requests non-existent `sv`	Request supported version; update client	—
409	Compatibility violation vs matrix; mapping collision in search	Adjust change or update matrix; run reindex plan	—
412	Producer writing `vNext` before Gateway/Registry mark active	Wait for announce; enable flag after	Conditional
422	Enum narrowing or field type change detected	Redesign as additive; use new field name	—
429	Publish rate-limited	Back off	Jittered backoff
503	Registry/Gateway dependency unavailable	Retry later	Exponential backoff

Failure Modes¶

Breaking removal/rename: rejected by Registry; use add + deprecate pattern.
Dual-write drift: v2 & v3 diverge → enable consistency checkers and fail fast on mismatch.
Search mapping conflicts: new field analyzer mismatches existing index → create new index alias v3 and reindex.

Recovery Procedures¶

Roll back producer flag to v2-only; keep Registry v3 published but inactive.
If search mapping conflict, cut over to v3 alias after backfill; keep reads tolerant.
Use Compensation Flow to repair projections that missed new fields during early canary.

Performance Characteristics¶

Latency Expectations¶

Version negotiation adds ≤ 1 ms at Gateway (header processing/cache).
Registry lookup cached; cache miss p95 ≤ 50 ms.

Throughput Limits¶

Dual-write increases write amp; restrict to canary tenants initially.
Reindex/backfill throttled per-tenant to protect cluster SLOs.

Resource Requirements¶

Registry store for versions & metadata; small footprint per artifact.
Backfill/reindex workers sized to edition limits.

Scaling Considerations¶

Per-tenant activation gates; gradual region waves.
Keep old readers working via tolerant-read and optional downgrade transforms (temporary only).

Security & Compliance¶

Authentication¶

OIDC/JWT for publish & toggle APIs; mTLS service-to-service.

Authorization¶

Roles: schemas:publish, schemas:promote, schemas:deprecate, schemas:read.
Only release managers can promote to stable or start sunset.

Data Protection¶

Signed artifacts; checksum headers; registry enforces immutability.
No PII stored in schema metadata beyond author id.

Compliance¶

Audit events: Schema.Published|Promoted|Activated|Deprecated|SunsetCompleted with actor & diff.
Backward/forward compatibility reports attached to change record.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`schema_publish_total`	counter	Versions published	Spike analysis
`schema_compat_fail_total`	counter	Registry rejects	>0 sustained
`schema_negotiations_total`	counter	Gateway negotiations	Trend
`dual_write_mismatch_total`	counter	v2 vs v3 mismatch	Any > 0
`reader_unknown_field_rate`	counter	Unknowns seen by readers	Spike
`search_reindex_progress`	gauge	Backfill completion	Stalls

Logging Requirements¶

Structured logs: schema, fromVersion, toVersion, tenant, compatMode, result, traceId.

Distributed Tracing¶

Spans: registry.validate, gateway.negotiate, producer.dualwrite, projection.tolerant-read, search.mapping.update.

Health Checks¶

Registry readiness (DB/object store); Gateway cache health; index template availability.

Operational Procedures¶

Deployment¶

Deploy/upgrade Registry with compatibility policies.
Enable Gateway negotiation & headers; roll SDKs with version awareness.
Register CI checks (lint/compat) and block merges on failure.

Configuration¶

Env: SCHEMA_COMPAT_MODE=FULL, SCHEMA_CACHE_TTL=300s, SCHEMA_DOWNGRADE_ENABLED=true (temporary).
Flags: feature.auditrecord.v3.enabled, feature.search.mapping.v3.enabled.

Maintenance¶

Periodic cleanup of deprecated versions after sunset window.
Rotate registry signing keys; verify artifact signatures in CI.

Troubleshooting¶

409 compatibility failures → inspect matrix report; adjust plan to additive-only.
Reader errors on unknown fields → ensure tolerant-read; verify SDK versions.
Search failures → create new alias with updated template; reindex flow.

Testing Scenarios¶

Happy Path Tests¶

Publish v3 (additive); Gateway advertises; producer dual-writes; readers tolerant; promote to stable.
Search mapping updated; index accepts new field; dashboards reflect new attribute.

Error Path Tests¶

400 invalid schema rejected; 404 unknown version on read; 409 matrix violation blocked.
412 write blocked before announce; passes after activation.

Performance Tests¶

Dual-write adds ≤ 10% overhead; Gateway negotiation ≤ 1 ms p95.
Reindex completes within planned window without SLO breach.

Security Tests¶

Only schemas:promote role can activate vNext; artifacts signed/verified.
Audit events emitted for publish/promote/deprecate.

Internal References¶

Data Model — Schema Evolution & Compatibility
Event Contracts
Search Index Schema
Compensation Flow

External References¶

JSON Schema / Protobuf compatibility guides

Appendices¶

A. Compatibility Matrix (excerpt)¶

Change Type	Backward	Forward	Allowed
Add optional field	✓	✓	Yes
Add enum value	✓*	✓	Yes (readers must default)
Remove field	✗	✗	No (use deprecate)
Change type (string→int)	✗	✗	No (new field)
Widen type (int32→int64)	✓*	✓	Yes with defaults

Requires tolerant-read or defaulting behavior.

B. Problem+JSON (compatibility violation)¶

{
  "type": "urn:connectsoft:errors/schema/compatibility-violation",
  "title": "Schema change is not backward compatible",
  "status": 409,
  "detail": "Removing field 'Decision' breaks existing consumers.",
  "violations": [
    {"path":"$.Decision", "rule":"field-removal"}
  ]
}

Configuration Update Flow¶

Safely rolls out configuration changes using validation (dry-run), staged rollout (feature flags/canaries), hot reload in services, and automatic verification / rollback with strict blast-radius controls. Separates config from secrets; every change is audited and idempotent.

Overview¶

Purpose: Apply config changes without disrupting tenants, maintaining SLOs and isolation.
Scope: Propose → validate (schema & semantic) → stage → canary rollout → service reload → verify (metrics/health) → promote or rollback. Excludes secret rotation (covered elsewhere).
Context: Config is stored in a Config Registry/Repo, announced via Config Service, consumed by Gateway/Ingestion/Projection/Search/Export at runtime with hot reload or restart on failure.
Key Participants:

Operator / CI/CD
Config Registry/Repo (GitOps or API)
Config Service (distribution, versioning, audits)
Feature Flag Service (progressive exposure)
Target Services ( Gateway / Ingestion / … )
Observability (metrics/logs/traces)
Orchestrator (deploy hooks for restarts if needed)

Prerequisites¶

System Requirements¶

Config schemas (JSON Schema/Protobuf) with server-side validation and dry-run execution
Feature flag platform for canary/percentage/segment rollouts
Services implement hot reload endpoint or SIGHUP handler and config guards (shadow config)
Config Service supports versioning, idempotency, and RBAC

Business Requirements¶

Change approval workflow (CAB) with blast-radius assessment
Runbooks & rollback plans linked to config keys
Tenant/edition-aware defaults to prevent cross-tenant leakage

Performance Requirements¶

Validation p95 ≤ 200 ms; distribution to all pods p95 ≤ 60 s
Hot reload p95 ≤ 250 ms per service; zero-downtime guarantee
Verification window (post-change) default 5–15 min with auto-rollback gates

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    actor OP as Operator/CI
    participant REG as Config Registry/Repo
    participant CFG as Config Service
    participant FF as Feature Flag Service
    participant SVC as Target Services
    participant OBS as Observability

    OP->>REG: Propose config vNext (PR/ChangeSet)
    REG->>CFG: Validate (schema + semantic dry-run)
    CFG-->>REG: OK (change-id, version=v17)
    OP->>FF: Stage flag "cfg.v17.enabled=false" (guard)
    OP->>CFG: Apply v17 (scope: canary tenants/perc=5%)
    CFG->>SVC: Distribute v17 (signed, If-None-Match)
    SVC->>SVC: Hot reload, shadow compare, begin verification
    SVC-->>OBS: Emit KPIs (errors/latency/health)
    OBS-->>CFG: Verification passed (within SLO)
    OP->>FF: Ramp to 50% → 100%
    CFG->>SVC: Finalize v17 (active for all)
    CFG-->>REG: Promote v17 to Active, close change

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Flag-only change: no new config payload; toggle flag segments to roll out behavior changes.
Tenant-staged rollout: enable by region/tenant/edition gates before global activation.
Restart-required: services lacking hot reload receive orchestrated rolling restart with readiness guards.

Error Paths¶

sequenceDiagram
    participant OP as Operator
    participant REG as Config Registry
    participant CFG as Config Service
    participant SVC as Target Services

    OP->>REG: Submit invalid config (schema fail)
    REG-->>OP: 400 Bad Request (Problem+JSON)

    OP->>CFG: Apply v17 (unknown key/scope)
    CFG-->>OP: 404 Not Found (key/scope)

    OP->>CFG: Apply while v16 rollout in-progress
    CFG-->>OP: 409 Conflict (change in progress)

    CFG->>SVC: Distribute v17
    SVC-->>CFG: 503 Service Unavailable (reload guard failed)
    CFG->>CFG: Auto-rollback to v16, raise alert

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements (APIs)¶

Field	Type	Req	Description	Validation
POST /ops/v1/config/validate	http	Y	Dry-run schema & semantic checks	JSON body
POST /ops/v1/config/apply	http	Y	Apply version with scope/strategy	RBAC + idempotent
`changeId`	string	Y	Unique change identifier	ULID/UUID
`version`	int	Y	Candidate version	monotonic
`scope`	object	O	`{tenants, regions, editions, percent}`	allowlists
`strategy`	object	O	`{mode: canary \| all, ramp: [5,50,100], verifyMins:10}`	sane ranges
`preconditions.etag`	string	O	CAS guard	matches head
`reason`	string	Y	Change reason	1–256 chars

Output Specifications¶

Field	Type	Description	Notes
`status`	enum	`Validated \| Applying \| Partial \| RolledBack \| Active \| Failed`	lifecycle
`activeVersion`	int	Current active config version	—
`appliedTo`	object	Effective scope (tenants/percent)	resolved
`verification`	object	KPIs & window state	pass/fail
`rollbackToken`	string	Token to execute rollback	TTL-bound

Example Payloads¶

Validate

POST /ops/v1/config/validate
{
  "changeId": "chg_01JF8C6Q...",
  "version": 17,
  "payload": { "Ingestion": { "MaxBatchBytes": 1048576 } }
}

Apply (canary 5%)

POST /ops/v1/config/apply
{
  "changeId": "chg_01JF8C6Q...",
  "version": 17,
  "scope": { "percent": 5, "regions": ["eu-west"] },
  "strategy": { "mode": "canary", "ramp": [5,50,100], "verifyMins": 10 },
  "preconditions": { "etag": "v16-etag" },
  "reason": "Lower ingest batch size to reduce p99"
}

Service Hot Reload Contract

POST /config/reload
If-None-Match: v17

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Schema/semantic validation failed	Fix payload; re-validate	—
401	Missing/invalid token	Authenticate	Retry after renewal
403	Caller lacks `config:apply`	Request access	—
404	Unknown config key/version/scope	Verify ids; fetch latest	—
409	Concurrent change in progress; ETag mismatch	Wait; retry with latest ETag	Conditional retry
412	Preconditions failed (guardrails)	Adjust scope/strategy	—
422	Semantic violation (unsafe value range)	Choose safe value	—
429	Apply rate-limited	Back off	Exponential + jitter
503	Target service not ready/reload failure	Auto-rollback; investigate	Retry after health OK

Failure Modes¶

Blast radius: global apply without canary → guarded by policy (requires staged rollout).
Config drift: some pods on v16, others v17 → Config Service reconciles until convergence.
Hot reload hazards: partial initialization using new values → shadow config & atomic swap.

Recovery Procedures¶

Trigger auto-rollback via policy gate failure; restore activeVersion to previous.
Freeze changes (global mute) and open incident; evaluate metrics & logs.
Re-run apply with reduced scope or adjusted values after fix.

Performance Characteristics¶

Latency Expectations¶

Validation p95 ≤ 200 ms; distribution to pods ≤ 60 s; hot reload ≤ 250 ms.

Throughput Limits¶

Max N parallel applies per region (e.g., 1); queue subsequent changes.

Resource Requirements¶

Config Service cache/ETag store; signed bundles; modest CPU for validation.

Scaling Considerations¶

Shard config topics per service/region; CDN or sidecar cache for large payloads.
Prefer delta distribution over full bundle for frequent small tweaks.

Security & Compliance¶

Authentication¶

OIDC JWT for ops APIs; mTLS service-to-service.

Authorization¶

Roles: config:validate, config:apply, config:rollback, config:read.
Tenant/edition scoping enforced at apply time.

Data Protection¶

No secrets in config; secrets managed via dedicated Secrets Service/KMS.
Signed config bundles (checksum, signature) verified by services.

Compliance¶

Audit events: Config.Validated|Applied|Promoted|RolledBack with actor, diff, scope, reason.
Change records linked to incident/ticket ids.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`config_apply_total`	counter	Applies by result	Spike in failures
`config_active_version`	gauge	Current active version	Unexpected regress
`config_rollbacks_total`	counter	Auto/manual rollbacks	>0 sustained
`config_distribution_lag_seconds`	histogram	Registry→pod lag	p95 > 60s
`service_config_reload_failures_total`	counter	Reload errors	Any > 0

Logging Requirements¶

Structured logs: changeId, version, service, scope, strategy, result, traceId, rollbackToken.

Distributed Tracing¶

Spans: config.validate, config.apply, service.reload, verify.window. Include changeId & version.

Health Checks¶

Readiness includes config freshness (expected vs actual version).
Synthetic probes after apply to confirm behavior.

Operational Procedures¶

Deployment¶

Deploy Config Service (HA) with schema validators & signing keys.
Enable hot reload endpoints in services; wire feature flag SDK.
Configure GitOps or Ops API pipeline with approval gates.

Configuration¶

Env: CFG_APPLY_CONCURRENCY=1, CFG_VERIFY_WINDOW=10m, CFG_MAX_SCOPE_PERCENT=10, CFG_REQUIRE_FLAG_GUARD=true.
Policies: mandatory canary for high-risk keys; deny global applies during peak.

Maintenance¶

Rotate signing keys; prune deprecated keys; rehearse rollback drills quarterly.

Troubleshooting¶

Apply stuck → check distribution lag metrics & queue; verify RBAC/ETag.
Errors spike post-apply → auto-rollback should trigger; confirm guardrail worked.
Only subset updated → reconcile loop; investigate failing pods’ reload logs.

Testing Scenarios¶

Happy Path Tests¶

Validate → apply to 5% → verify → ramp to 100% with no SLO breach.
Hot reload succeeds across services; config version converges.

Error Path Tests¶

400 invalid payload rejected; 404 unknown key; 409 concurrent apply guarded.
503 reload failure triggers automatic rollback.

Performance Tests¶

Distribution completes ≤ 60 s across 200 pods; reload p95 ≤ 250 ms.
Multiple small deltas do not exceed CPU/network budgets.

Security Tests¶

Only config:apply role can promote; signatures verified; audits present.
No secrets present in config payloads.

Internal References¶

External References¶

Progressive Delivery / Feature Flags best practices

Appendices¶

A. Canary Strategy (YAML)¶

strategy:
  mode: canary
  ramp: [5, 25, 50, 100]
  verify:
    window: 10m
    guards:
      - metric: atp_ingest_errors_ratio
        threshold: "< 0.5%"
      - metric: atp_projection_lag_seconds
        threshold: "< 60"
      - metric: health_readiness_status
        threshold: "== 1"

B. Problem+JSON Examples¶

{
  "type": "urn:connectsoft:errors/config/invalid",
  "title": "Invalid configuration payload",
  "status": 400,
  "detail": "Ingestion.MaxBatchBytes exceeds allowed maximum."
}

{
  "type": "urn:connectsoft:errors/config/conflict",
  "title": "Change conflict",
  "status": 409,
  "detail": "Another change chg_01JF8B... is applying.",
  "currentChangeId": "chg_01JF8B..."
}

Backup & Recovery Flow¶

Implements durable backups (snapshots/exports) with integrity verification and WORM-secure storage, plus periodic recovery drills that prove RPO/RTO objectives are met. Covers append store, projections, and search indexes with consistent cutover points and tenant-aware restores. Evidence of successful restore is captured and audited.

Overview¶

Purpose: Guarantee recoverability of tenant data with defined RPO/RTO and cryptographic proof of integrity.
Scope: Scheduled/on-demand backups → snapshot/export → sign/verify → store in immutable object storage → catalog → recovery drills (sandbox restore + validation) → reporting. Excludes hot replicas (covered by HA).
Context: Orchestrated by Backup Service. Sources: Storage (Append/WORM), Projection DB, Search Index. Targets: Object Store (WORM/Object Lock) with tenant/region prefixes and KMS encryption.
Key Participants:

Backup Scheduler/Service (orchestrator)
Storage (Append Store) / Projection DB / Search Index
Integrity Service (hash/Merkle proofs)
Object Store (WORM) with KMS
Drill Runner (restore validator)
Ops / Compliance (approvals, reports)

Prerequisites¶

System Requirements¶

Snapshot/backup endpoints enabled for all data planes (append/projection/index)
Object store with WORM/Object Lock & lifecycle policies; mTLS + signed URLs
Integrity Service available for proof computation/verification
Catalog/Manifest registry with index of recovery points

Business Requirements¶

Tenant residency & encryption policies mapped to backup targets
Defined RPO (e.g., ≤ 15 min) and RTO (e.g., ≤ 60 min) per edition
Drill cadence (e.g., monthly per region; quarterly per tenant sample) and evidence requirements

Performance Requirements¶

Backup windows avoid peak hours; bandwidth caps per region/tenant
Incremental backups preferred; fulls on weekly cadence
Verification completes within X% of backup duration (target ≤ 30%)

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant SCH as Scheduler
    participant BAK as Backup Service
    participant ST as Storage (Append)
    participant PR as Projection DB
    participant IX as Search Index
    participant INT as Integrity Service
    participant OBJ as Object Store (WORM)
    participant CAT as Catalog/Manifest
    participant DR as Drill Runner

    SCH->>BAK: Trigger Backup (policy, scope, type=incremental)
    BAK->>ST: Consistent snapshot/export (cutover @ T)
    BAK->>PR: Projection dump @ watermark<=T
    BAK->>IX: Index snapshot (optional or template)
    BAK->>INT: Compute hashes/Merkle root + sign (kid)
    INT-->>BAK: Proof bundle {root, signature, kid}
    BAK->>OBJ: Upload packages (JSONL/Parquet/SQL), proofs, manifest (WORM)
    BAK->>CAT: Register Recovery Point (RP-2025-10-27T08:00Z)
    BAK-->>SCH: Success {recoveryPointId, sizes, proof}
    SCH->>DR: Schedule recovery drill (sandbox)
    DR->>OBJ: Fetch packages + manifest
    DR->>INT: Verify proofs/signatures
    DR->>ST: Restore append, reproject read models
    DR-->>SCH: Drill report (RPO/RTO met, sample checks OK)

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

On-demand tenant backup: operator requests scoped backup for a single tenant; catalog marks it tenant-scoped.
Warm-standby region: ship encrypted copies to secondary region with residency-allowed classes only.
Indexless restore: restore append store and rebuild projections/search from facts to reduce backup volume.

Error Paths¶

sequenceDiagram
    participant BAK as Backup Service
    participant OBJ as Object Store
    participant INT as Integrity
    participant CAT as Catalog

    BAK->>OBJ: PUT package (network issue)
    alt 503 Storage unavailable
        BAK-->>BAK: Retry with backoff, pause schedule if persistent
    else 409 Conflict (WORM retention/exists)
        BAK-->>BAK: Switch to new key (timestamped), update manifest
    end

    BAK->>INT: Compute proof
    alt Proof mismatch
        INT-->>BAK: 422 Unprocessable (hash mismatch)
        BAK-->>CAT: Mark recovery point FAILED, alert
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements (APIs)¶

Field	Type	Req	Description	Validation
POST /ops/v1/backups	http	Y	Start backup	RBAC `backup:start`
`scope`	object	Y	`{tenants:[], regions:[], dataClasses:[]}`	allowlists/residency
`type`	enum	Y	`full`	`incremental`
`cutover`	RFC3339	O	Desired snapshot time	≤ now
`retentionDays`	int	O	Override default retention	≤ policy max
POST /ops/v1/restores	http	Y	Start restore/drill	RBAC `backup:restore`
`recoveryPointId`	string	Y	Catalog id	exists
`mode`	enum	Y	`sandbox`	`production`
`target`	object	O	`{tenantId?, region}`	valid & empty slot
`verifyPolicy`	object	O	sampling, row-counts, checksums	schema

Output Specifications¶

Field	Type	Description	Notes
`recoveryPointId`	string	Unique id for backup	sortable by time
`manifestUrl`	url	Signed URL to manifest	time-limited
`proof`	object	`{merkleRoot, signature, kid}`	integrity
`sizes`	object	bytes per package	budgeting
`restoreJobId`	string	Track restore/drill	status API

Example Payloads¶

Start Backup

POST /ops/v1/backups
{
  "scope": { "regions": ["eu-west"], "tenants": ["acme"] },
  "type": "incremental",
  "retentionDays": 30
}

Catalog Manifest (excerpt)

{
  "recoveryPointId": "RP-2025-10-27T08:00:00Z-eu-west-acme",
  "time": "2025-10-27T08:00:00Z",
  "packages": [
    {"name":"append-0001.jsonl","sha256":"...","bytes": 73482910},
    {"name":"projection.sql","sha256":"...","bytes": 2183412}
  ],
  "merkleRoot": "b3f3…",
  "signature": "MEUCIQ…",
  "kid": "kms:eu-west:tenant/acme:v3",
  "watermark": "2025-10-27T07:59:58Z"
}

Start Restore (Sandbox)

POST /ops/v1/restores
{
  "recoveryPointId": "RP-2025-10-27T08:00:00Z-eu-west-acme",
  "mode": "sandbox",
  "target": { "region": "eu-west" },
  "verifyPolicy": { "rowCounts": true, "samplePercent": 5, "proofs": true }
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid scope/type/cutover; residency mismatch	Correct payload/policy	—
401	Missing/invalid token	Authenticate	Retry after renewal
403	Caller lacks `backup:` or `restore:`	Request access	—
404	Unknown `recoveryPointId` or package missing	Choose valid point; investigate catalog	—
409	Restore already in progress for target / resource lock	Wait or choose new target	Conditional retry
412	Preconditions not met (sandbox not empty; legal hold prevents overwrite)	Satisfy preconditions / choose sandbox	—
423	Target locked (admin lock/maintenance)	Release lock	Retry
429	Region throughput/backups rate-limited	Back off	Exponential + jitter
503	Object store/Integrity service unavailable	Retry later	Bounded retries with backoff

Failure Modes¶

Inconsistent cutover: sources not frozen → use watermark T and quiesce writes for snapshot window.
WORM conflict: attempting overwrite before retention expires → versioned keys; never mutate existing.
Silent corruption: block-level issues → end-to-end checksums + Merkle proofs required; drill detects.

Recovery Procedures¶

Re-run backup with quiesce (short write pause or log-based incremental with LSN).
For failed proof, invalidate recovery point and alert; run full backup next window.
During restore, rebuild projections and search from append facts if projection package absent or stale.

Performance Characteristics¶

Latency Expectations¶

Catalog publish p95 ≤ 1 s; proof computation bounded by package size (parallelizable).
Drill restore: RTO target (e.g., ≤ 60 min for medium tenants) including re-projection.

Throughput Limits¶

Per-region bandwidth caps (e.g., ≤ 200 MB/s aggregate); per-tenant rate caps to avoid noisy neighbors.

Resource Requirements¶

Temporary staging disk for package creation; CPU for hashing; memory for buffering; KMS for signing.

Scaling Considerations¶

Incremental forever + periodic synth full to limit restore chains.
Shard backups by tenant/shard and time slots to flatten I/O.

Security & Compliance¶

Authentication¶

Ops endpoints via OIDC; service-to-object store via mTLS and scoped IAM roles.

Authorization¶

Roles: backup:start|read|restore|drill|approve. Production restore requires two-person approval.

Data Protection¶

KMS encryption at rest; WORM/Object Lock with retention & legal hold support; signed manifests/proofs.
Residency: copy only to allowed regions per data class; PII masking not required since encrypted at rest (still observe policy).

Compliance¶

Evidence pack: drill reports, manifest, proof verification, timing → archived for audits.
Legal holds honored—restore does not violate purge blocks; backups include hold metadata.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`backup_runs_total{result}`	counter	Backups by result	Failures > baseline
`backup_bytes_total`	counter	Total bytes uploaded	Sudden drop/spike
`backup_duration_seconds`	histogram	Backup wall time	p95 > SLO
`restore_duration_seconds`	histogram	Drill/restore time	p95 > RTO
`backup_proof_failures_total`	counter	Integrity verification failures	Any > 0
`rpo_effective_seconds`	gauge	Now − last successful cutover	> target
`rto_drill_pass_rate`	gauge	% drills meeting RTO	< target

Logging Requirements¶

Structured logs: recoveryPointId, tenant, region, sizes, hash, kid, result, traceId, rpo, rto.

Distributed Tracing¶

Spans for snapshot, package.upload, proof.compute, proof.verify, restore.apply, reprojection.run.

Health Checks¶

Readiness of object store, KMS, Integrity; catalog consistency checks (manifest ↔ objects).

Operational Procedures¶

Deployment¶

Deploy Backup Service (HA) with schedulers and workers per region.
Configure object store buckets with Object Lock (compliance mode) and lifecycle.
Register policies (cadence, scope, RPO/RTO) per edition.

Configuration¶

Env: BACKUP_WINDOW=02:00-05:00, BACKUP_MAX_BW_MBPS, BACKUP_TYPE=incremental, BACKUP_VERIFY=true.
Policies: weekly full, daily incremental; monthly drill per region.

Maintenance¶

Rotate KMS keys; test restore runbooks quarterly; refresh lifecycle policies and retention.

Troubleshooting¶

Missing package → verify catalog vs. object listing; re-upload if upload was interrupted.
Proof mismatch → recalc locally; if persistent, invalidate RP and run full backup.
RTO miss → profile slow steps (download bandwidth, reprojection speed) and optimize.

Testing Scenarios¶

Happy Path Tests¶

Scheduled incremental backup creates catalog entry with valid proofs.
Monthly drill restores to sandbox, reprojects, and meets RTO.

Error Path Tests¶

400 invalid scope rejected; 404 unknown recoveryPointId; 409 concurrent restore blocked.
503 object store outage triggers retries and eventual success/fail with alert.

Performance Tests¶

Backup completes within window; verify overhead does not breach SLOs.
Drill on medium tenant completes within RTO under load.

Security Tests¶

WORM enforced—no overwrite/delete within retention; manifests signed & verified.
Access controls prevent cross-tenant reads of backup artifacts.

Internal References¶

External References¶

Object Lock/WORM (vendor docs)
NIST SP 800-34 (Contingency Planning)

Appendices¶

A. Example Object Store Bucket Policy (WORM)¶

{
  "ObjectLockEnabled": "Enabled",
  "Rules": [{
    "DefaultRetention": { "Mode": "COMPLIANCE", "Days": 30 }
  }]
}

B. Recovery Drill Checklist¶

Select latest valid recoveryPointId for target region/tenant.
Provision isolated sandbox (no outbound webhooks).
Restore append → reproject → (optional) reindex.
Verify counts (rows/events) & sample diffs; verify proofs.
Capture RTO and evidence; archive report; clean up sandbox.

C. Problem+JSON (example)¶

{
  "type": "urn:connectsoft:errors/backup/recovery-point-not-found",
  "title": "Recovery point not found",
  "status": 404,
  "detail": "RP-2025-10-27T08:00:00Z-eu-west-acme does not exist or is invalid."
}

Load Balancing Flow¶

Distributes incoming traffic fairly across healthy service instances using L7/L4 load balancing, with optional affinity (cookie/hash) for sticky paths and standard stateless routing for idempotent calls. Includes multi-region routing (geo/DNS/anycast) with residency and failover policies. Integrates with health checks, rate limiting, and circuit breakers.

Overview¶

Purpose: Balance requests to healthy backends, maximize utilization, and minimize latency while enforcing tenant isolation and residency.
Scope: Edge routing (DNS/anycast) → Regional LB/Ingress (L7) → per-service pools with health/affinity → response path and headers. Excludes per-tenant throttling logic (covered by Gateway rate limiting).
Context: Client enters via Global LB (GSLB/Anycast), then Regional L7 LB/Ingress/Gateway (Envoy/Nginx/API GW) that selects a backend (Ingestion/Query/Export).
Key Participants:

Client
Global Traffic Manager (GTM) (GeoDNS/Anycast)
Regional L7 LB / API Gateway
Target Service Pool (Ingestion / Query / Export)
Health Check / Discovery
Observability (metrics/logs/traces)

Prerequisites¶

System Requirements¶

Edge TLS termination with modern ciphers; optional end-to-end mTLS to services
Active+passive health checks (HTTP/gRPC/TCP) with outlier detection
Service discovery (EDS/SD) with instance metadata: {region, shard, edition}
Circuit Breaker and connection pools configured per service/route

Business Requirements¶

Residency policy maps tenants → allowed regions
Edition/plan may influence weights (e.g., enterprise canary lanes)
Documented sticky vs stateless routes (e.g., Query=stateless, Export job UI=sticky)

Performance Requirements¶

End-to-end added LB latency p95 ≤ 5 ms (regional), ≤ 20 ms (global routing)
Per-service concurrency/connection limits defined; surge queue bounded
Balancing algorithm chosen per route: least-request, weighted RR, ring-hash (affinity)

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant C as Client
    participant GTM as Global Traffic Manager (GeoDNS/Anycast)
    participant L7 as Regional L7 LB / API Gateway
    participant S as Service Pool (e.g., Ingestion)
    participant HC as Health/Discovery

    C->>GTM: Resolve api.atp.example (Geo/latency policy)
    GTM-->>C: Regional VIP (eu-west)
    C->>L7: HTTPS request (Host: api.atp.example)
    L7->>HC: Get healthy endpoints & weights
    L7->>S: Route to least-loaded healthy instance (affinity if provided)
    S-->>L7: 200 OK (payload)
    L7-->>C: 200 OK + headers (X-Region, X-Backend-Id, Server-Timing)

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Sticky (affinity) routing: LB sets atp_affinity cookie or uses ring-hash on X-Sticky-Key/tenantId for session locality.
Multi-region: GTM favors closest allowed region; on regional brownout, fail over to next policy region.
Canary/weighted: subset traffic (5%) routed to canary pool via header or flag for progressive delivery.

Error Paths¶

sequenceDiagram
    participant C as Client
    participant L7 as Regional L7 LB
    participant S as Service Pool
    participant HC as Health/Discovery

    C->>L7: Request /ingest
    L7->>HC: Endpoints?
    alt No healthy backends
        L7-->>C: 503 Service Unavailable (Retry-After)
    else Backend times out
        L7->>S: Forward
        S-->>L7: (timeout)
        L7-->>C: 504 Gateway Timeout
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
`Host` / SNI	header	Y	Virtual host routing	Matches configured domain
`Authorization`	header	O	Propagated to Gateway	If present, well-formed
`traceparent`	header	O	Trace propagation	W3C format
`X-Tenant-Id`	header	O	Residency/affinity hint	ULID/UUID
`X-Region-Hint`	header	O	Client preferred region	Allowlist
`X-Sticky-Key`	header	O	Consistent hashing key	≤128 chars
`Cookie: atp_affinity`	cookie	O	LB-issued sticky cookie	Signed
`Accept` / `Content-Type`	header	O	Protocol negotiation	Valid MIME
`Idempotency-Key`	header	O	For retries across LB	≤128 chars

Output Specifications¶

Field	Type	Description	Notes
`X-Region`	header	Region that served the request	e.g., `eu-west`
`X-Backend-Id`	header	Instance/pod identifier	For debugging
`X-Served-By`	header	LB node identifier	Optional
`Server-Timing`	header	`lb;dur=...`	Perf insights
`Retry-After`	header	Sent on 429/503	Seconds or HTTP date

Example Payloads¶

GET /query/v1/records?tenant=acme HTTP/1.1
Host: api.atp.example
X-Tenant-Id: 01HZXM0...
X-Region-Hint: eu-west
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

HTTP/1.1 200 OK
X-Region: eu-west
X-Backend-Id: proj-7f9c6bd9d8-2m4sx
Server-Timing: lb;dur=3, gw;dur=6
Content-Type: application/json

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid host/SNI, malformed headers (`X-Region-Hint`)	Correct request	—
401	Auth failure (if L7 does authN)	Re-authenticate	Retry after renewal
403	Region not allowed by residency	Remove hint / use allowed region	—
404	Route/service not found	Verify path/host	—
409	Sticky key conflicts with pool policy	Clear cookie/change key	—
429	LB/Gateway rate limit	Back off	Exponential + jitter
502	Bad gateway (abrupt upstream close)	Investigate upstream	Retry idempotent
503	No healthy backends / brownout	Failover or wait	Respect `Retry-After`
504	Upstream timeout	Tune timeouts or retry	Idempotent only

Failure Modes¶

Hot spotting: poor hash key → use ring-hash on tenantId and minimum healthy hosts.
Sticky drift: deleted pod but cookie persists → cookie TTL/clearing and outlier ejection.
Cross-region leakage: missing residency guard → enforce allowlist at GTM and L7.

Recovery Procedures¶

Drain failing instances (connection draining) and eject outliers.
Flip traffic weights away from impaired pool; enable canary disable flag.
Trigger regional failover at GTM if health below threshold.

Performance Characteristics¶

Latency Expectations¶

Added L7 overhead p95 ≤ 5 ms; GTM selection ≤ 20 ms additional.

Throughput Limits¶

Tune per-service max connections/requests; queue length capped (e.g., 100) to prevent head-of-line blocking.

Resource Requirements¶

LB nodes sized for TLS termination (ECDSA), HTTP/2, and gRPC fan-in/out; enable connection reuse.

Scaling Considerations¶

Horizontal scale LB nodes; shard by region; enable Autoscaling based on RPS and CPU.
Prefer least-request for spiky traffic; ring-hash for affinity; weighted RR for canaries.

Security & Compliance¶

Authentication¶

TLS 1.2+ at edge; optional mTLS to backends; ALPN for HTTP/2/gRPC.

Authorization¶

If Gateway performs authZ, L7 forwards identity context; deny routes without matching policies.

Data Protection¶

No PII in LB logs; mask headers; use HSTS; secure cookies (HttpOnly, Secure, SameSite=Lax).

Compliance¶

Residency honored at GTM/L7; all decisions auditable (who changed routes/weights).

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`lb_requests_total{route,region}`	counter	Requests by route	Trend
`lb_latency_seconds`	histogram	Added LB latency	p95 breach
`lb_upstream_5xx_total`	counter	Backend errors	Spike
`lb_no_healthy_backends_total`	counter	Routing failures	Any > 0
`lb_active_connections`	gauge	Concurrent conns	Saturation
`lb_outlier_ejections_total`	counter	Ejected hosts	Investigate

Logging Requirements¶

Access logs with region, backendId, status, bytes, durationMs, traceId; redact sensitive headers.

Distributed Tracing¶

Start or propagate traceparent; add span attributes lb.region, lb.backend_id, policy.

Health Checks¶

Active (HTTP/gRPC) + passive checks; outlier detection (consecutive 5xx/latency) with ejection & recovery.

Operational Procedures¶

Deployment¶

Deploy GTM records (Geo/latency policy + failover).
Roll out L7 LB/Ingress with routes, TLS certs, and backends.
Enable discovery (EDS) and health checks; validate with synthetic probes.

Configuration¶

Algorithms: least_request, ring_hash(key=X-Sticky-Key|tenantId), weighted_round_robin.
Timeouts: connect=1s, request=5s (per route), idle=60s.
Headers: set X-Region, X-Backend-Id, and propagate traceparent.

Maintenance¶

Rotate TLS certs; tune weights during canaries; routinely test failover.
Drain nodes before upgrades (connection draining, readiness gates).

Troubleshooting¶

Elevated 5xx → check outlier ejections, backend health, circuit breaker trips.
High latency → verify least-request and connection pool sizes; inspect Nagle/HTTP/2 settings.
Sticky anomalies → clear cookies, verify ring-hash seed and host set stability.

Testing Scenarios¶

Happy Path Tests¶

Requests distributed evenly under steady load (Gini coefficient within target).
Sticky session remains on same backend across N requests.

Error Path Tests¶

503 when all backends unhealthy; 504 on upstream timeout; 404 on unknown route.
409 when sticky key conflicts with policy handled gracefully.

Performance Tests¶

p95 LB overhead ≤ 5 ms at target RPS; no queue growth beyond cap.
Failover to secondary region within SLA (< 60s) under regional outage.

Security Tests¶

TLS and cipher policy enforced; mTLS to backends verified.
Residency blocks cross-region routing attempts; logs contain no PII.

Internal References¶

Components
Health Check Flow
Circuit Breaker Flow
Auto-Scaling Flow

External References¶

Load balancing algorithms (least-request, ring-hash) and best practices

Appendices¶

A. Example Envoy Route (weighted + ring-hash)¶

route:
  match: { prefix: "/query" }
  route:
    hash_policy:
      - header: { header_name: "X-Sticky-Key" }
      - cookie: { name: "atp_affinity", ttl: 3600s, path: "/" }
    weighted_clusters:
      clusters:
        - name: query-primary
          weight: 95
        - name: query-canary
          weight: 5
    timeout: 5s
    idle_timeout: 60s

B. Problem+JSON (example 503)¶

{
  "type": "urn:connectsoft:errors/lb/no-healthy-backends",
  "title": "No healthy backends available",
  "status": 503,
  "detail": "All instances for route '/ingest' are out of service.",
  "retryAfterSeconds": 10
}

Caching Flow¶

Reduces read latency and load on backing stores via tenant-scoped caches with L1 (in-process) and L2 (distributed) tiers. Supports read-through + stale-while-revalidate (SWR), with projection-driven invalidation and export-safe cache bypass when strong freshness is required. Consistency model and TTLs are explicit per resource.

Overview¶

Purpose: Serve query responses quickly while honoring tenant isolation and documented freshness guarantees.
Scope: Cache lookup → hit/miss handling → read-through fill → TTL/SWR behavior → projector/exports invalidations → observability. Excludes CDN/public caching.
Context: Query Service fronts Projection DB/Search with L1/L2 caches; Projection Update Flow emits invalidations; Export may request bypass/lock.
Key Participants:

Client
API Gateway / Query Service
Cache L1 (per-pod)
Cache L2 (Redis/Memcache)
Projection DB / Search Index
Invalidation Bus (events from Projector/Export)

Prerequisites¶

System Requirements¶

L1 in-process cache with bounded memory and eviction (LRU/LFU)
L2 distributed cache with multi-tenant namespaces, TLS, and ACLs
Invalidation channel (pub/sub or stream) from Projector & Export
Strong hashing for keys; serialization with versioned schema

Business Requirements¶

Documented consistency choices per endpoint: strong, bounded-staleness, or eventual
Per-edition TTLs and max object sizes; negative-caching policy
Clear semantics for export and legal-hold reads (bypass or SWR disabled)

Performance Requirements¶

p95 cache hit latency: L1 ≤ 1 ms, L2 ≤ 3 ms
Target hit ratio: ≥ 85% for hot keys; ≥ 60% overall for query endpoints
Fill amplification bounded (parallel request coalescing)

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant C as Client
    participant GW as API Gateway / Query Service
    participant L1 as Cache L1 (in-process)
    participant L2 as Cache L2 (Redis)
    participant DB as Projection DB / Search
    participant BUS as Invalidation Bus

    C->>GW: GET /query/v1/records?tenant=acme&from=... (Cache-Mode: default)
    GW->>L1: GET cache[key(Tenant,QueryHash)]
    alt L1 hit (fresh)
        L1-->>GW: value, meta{ttl,freshness}
        GW-->>C: 200 OK (X-Cache: L1-HIT, X-Cache-Freshness: fresh)
    else L1 miss
        GW->>L2: GET key
        alt L2 hit (fresh or SWR-eligible)
            L2-->>GW: value, meta
            GW-->>C: 200 OK (X-Cache: L2-HIT, X-Cache-Freshness: fresh|stale)
            opt SWR revalidate in background if stale
                GW->>DB: Query (If-None-Match: etag)
                DB-->>GW: 304 or 200 + new value
                GW->>L2: SET key (ttl)
                GW->>L1: SET key (ttl)
            end
        else L2 miss
            GW->>DB: Query
            DB-->>GW: 200 result (etag)
            GW->>L2: SET key (ttl, etag)
            GW->>L1: SET key (ttl, etag)
            GW-->>C: 200 OK (X-Cache: MISS)
        end
    end
    BUS-->>L2: Invalidation(key or tag) on projection update
    L2-->>L1: Fan-out eviction notice

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Bypass: header Cache-Mode: bypass → skip L1/L2 for strict reads (e.g., export) and optionally refresh cache.
Write-around: projector writes DB then publishes tag-based invalidations (e.g., tenant:acme, resource:order:123).
Coalesced fills: first request holds a per-key mutex; subsequent misses wait to avoid stampede.

Error Paths¶

sequenceDiagram
    participant GW as Query Service
    participant L2 as Cache L2
    participant DB as Projection DB

    GW->>L2: GET key
    alt 503 L2 unavailable
        GW->>DB: Fallback to DB
        DB-->>GW: 200
        GW->>L2: (skip SET) or queue async warm
    else 409 CAS/ETag conflict on SET
        L2-->>GW: 409 Conflict
        GW->>L2: GET latest → retry SET (backoff)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements (Headers & Query)¶

Field	Type	Req	Description	Validation
`X-Tenant-Id`	header	Y	Tenant namespace for cache	ULID/UUID
`Cache-Mode`	header	O	`default \| bypass \| refresh \| swr-only`	enum
`Cache-Control`	header	O	`max-age`, `stale-while-revalidate`, `no-store`	RFC 7234
`If-None-Match`	header	O	Revalidation with ETag	string
`X-Consistency`	header	O	`strong \| bounded \| eventual`	per-route
Query params	query	O	Affect key hash	canonicalized order

Output Specifications (Response & Meta)¶

Field	Type	Description	Notes
`X-Cache`	header	`L1-HIT \| L2-HIT \| MISS \| BYPASS \| STALE`	observability
`ETag`	header	Entity tag for revalidation	stable per value
`Cache-Control`	header	Response caching directives	includes `max-age`
`X-Cache-Key`	header	Debug key (hashed/short)	no PII
`X-Cache-Freshness`	header	`fresh \| stale(<sec>)`	SWR info
`X-Watermark`	header	Projection watermark time	freshness signal

Example Payloads¶

Bounded-staleness read with SWR

GET /query/v1/records?tenant=acme&from=2025-10-27T08:00Z HTTP/1.1
X-Tenant-Id: 01JF...
Cache-Mode: default
X-Consistency: bounded

HTTP/1.1 200 OK
X-Cache: L2-HIT
Cache-Control: max-age=30, stale-while-revalidate=60
ETag: "recset:acme:ab12"
X-Cache-Freshness: stale(12)
X-Watermark: 2025-10-27T08:05:30Z

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid `Cache-Mode`/`X-Consistency` value; oversized key	Fix headers/params	—
401	Missing tenant header for cached endpoints	Add `X-Tenant-Id`	Retry after fix
403	Tenant not allowed on this region/cache	Correct region or policy	—
404	Cache management API: unknown key/tag on purge	No-op; verify key	—
409	CAS/ETag conflict on concurrent SET	Retry with backoff; re-GET latest	Jittered backoff
412	Revalidation precondition failed (ETag mismatch)	Fetch full object	Conditional retry
429	Cache rate limit (management ops)	Back off	Exponential
503	L2 unavailable	Fallback to DB; degrade to L1-only	Bounded retries

Failure Modes¶

Cache stampede: thundering herd on popular key → request coalescing, jittered TTLs, SWR background refresh.
Stale reads too old: misconfigured stale-while-revalidate → enforce max-staleness cap per route.
Cross-tenant leakage: missing tenant in key → mandatory X-Tenant-Id + namespace prefixes.
Oversized entries: evictions/fragmentation → cap object size, compress payloads, or avoid caching.

Recovery Procedures¶

Disable SWR temporarily for problematic routes; set shorter TTLs.
Purge by tag (tenant:acme, resource:order:123) after projection anomalies.
Route around L2 failures (feature flag) while keeping read path via DB.

Performance Characteristics¶

Latency Expectations¶

L1 ≤ 1 ms p95; L2 ≤ 3 ms p95; read-through to DB ≤ endpoint SLO.

Throughput Limits¶

L2 QPS sized for peak miss + revalidation; keyspace cardinality controlled via hashing and tag strategy.

Resource Requirements¶

Memory budgets per pod (L1) and per cluster (L2); eviction policy tuned (LFU for skewed traffic).

Scaling Considerations¶

Partition L2 by region and shard; enable replica readers; avoid cross-AZ chatter.
Use compressed values (e.g., zstd) for large result sets with CPU tradeoff.

Security & Compliance¶

Authentication¶

mTLS between services and L2; signed purge APIs.

Authorization¶

RBAC for cache management (cache:purge|inspect); tenant-scoped purge only.

Data Protection¶

No PII in keys; values encrypted at rest if L2 supports; TLS in transit.

Compliance¶

Audit cache management actions (purge/warm) with actor, scope, reason.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`cache_hit_total{tier}`	counter	Hits by tier	Drop signals issues
`cache_miss_total`	counter	Misses (cold + reval)	Spike alert
`cache_hit_ratio`	gauge	Hits / (hits+misses)	< target
`cache_evictions_total`	counter	Evictions by reason	Unexpected growth
`cache_swr_served_total`	counter	Stale responses served	Excess indicates lag
`cache_fill_duration_seconds`	histogram	Miss→filled latency	p95 breach
`cache_invalidation_total{tag}`	counter	Invalidation events	Monitor volume

Logging Requirements¶

Include tenantId, short cacheKey, tier, freshness, hit/miss, fillMs, traceId. No payloads/PII in logs.

Distributed Tracing¶

Child spans for cache.l1.get, cache.l2.get/set, cache.swr.revalidate, with attributes key_hash, tier.

Health Checks¶

L2 readiness probes; replication lag; pub/sub connectivity for invalidations.

Operational Procedures¶

Deployment¶

Deploy L2 cache cluster (HA) with TLS and ACL; configure namespaces per region.
Enable L1 caches in services with bounds and eviction settings.
Wire projector → invalidation bus → L2 pub/sub fan-out.

Configuration¶

Defaults: TTL=30s, stale-while-revalidate=60s, max-staleness=90s, negativeTTL=3s.
Enable request coalescing and per-key mutex; cap value size (e.g., 512 KB).

Maintenance¶

Periodic warm-up for hot keys post-deploy; tune TTLs using hit/miss analytics.
Rotate L2 credentials; defragment and scale nodes as keyspace grows.

Troubleshooting¶

Low hit ratio → verify key canonicalization and tenant scoping.
Stampedes → increase jitter, enable SWR, and coalescing.
Staleness complaints → reduce TTL or require Cache-Mode: bypass for affected endpoints.

Testing Scenarios¶

Happy Path Tests¶

L1/L2 hits return within target latencies and correct headers.
Revalidation updates cache while serving stale safely (SWR).

Error Path Tests¶

503 L2 outage falls back to DB with acceptable latency.
409 CAS conflict on SET resolves with retry and no corruption.
400 invalid Cache-Mode rejected.

Performance Tests¶

Hit ratio meets targets under production-like skew (Zipfian).
Thundering herd prevented under bursty traffic.

Security Tests¶

No cross-tenant cache bleed; purge is tenant-scoped and audited.
TLS and ACLs enforced for L2 connections.

Internal References¶

Read Models & Projections (Query Path)
Audit Record Projection Update Flow
Search Query Flow
Export Flows

External References¶

RFC 7234 (HTTP Caching), SWR patterns; Redis best practices

Appendices¶

A. Cache Key Schema (canonicalized)¶

Key = sha256(
  "tenant=" + TenantId +
  "&route=" + RouteId +
  "&params=" + CanonicalQueryString +
  "&version=" + SchemaVersion
)
Namespace = "atp:{region}:{edition}"
Final = Namespace + ":q:" + KeyPrefix

B. Problem+JSON Examples¶

{
  "type": "urn:connectsoft:errors/cache/invalid-mode",
  "title": "Invalid Cache-Mode",
  "status": 400,
  "detail": "Allowed values are default|bypass|refresh|swr-only."
}

{
  "type": "urn:connectsoft:errors/cache/conflict",
  "title": "Concurrent cache update conflict",
  "status": 409,
  "detail": "ETag mismatch during SET. Value updated by another request."
}

Partitioning Flow¶

Routes traffic and data by tenant / shard / region using a deterministic partition strategy (e.g., TenantId + TimeBucket) mapped onto a consistent-hash ring. Ensures RLS enforcement at the data plane and honors residency flags so data stays within allowed regions. Supports shard pruning on reads and smooth ring changes with minimal rebalancing.

Overview¶

Purpose: Achieve scalable, cost-efficient storage and query performance by distributing load across shards while preserving strict tenant isolation and residency.
Scope: Partition key derivation → ring lookup → write placement (append store & indexes) → read-time shard pruning → ring change management (add/remove/move) → RLS enforcement. Excludes cross-region replication (covered elsewhere).
Context: Ingestion and Query paths use the Placement Service and Partition Catalog to route writes/reads. Storage (Append), Projection DB, and Search Index expose per-shard/tenant namespaces.
Key Participants:

API Gateway / Ingestion Service
Placement Service (ring lookup)
Partition Catalog (tenants, shards, regions)
Storage (Append) / Projection DB / Search Index
RLS/Policy Engine

Prerequisites¶

System Requirements¶

Global Partition Catalog with tenant → region/edition → shard mapping
Consistent-hash ring with virtual nodes; gossip or control-plane updates
Time bucketing policy (e.g., hour|day) for hot-key spreading and pruning
RLS enabled in all data planes (tenant-scoped schemas/aliases)

Business Requirements¶

Residency policy per tenant/edition with allowed regions and data classes
Hot-tenant isolation rules (dedicated shards/weighting)
Ring change governance (approvals, maintenance windows for big moves)

Performance Requirements¶

Target shard load imbalance (P95) ≤ 1.5× average
Read pruning effectiveness ≥ 90% of shards skipped for typical time windows
Partition lookup p95 ≤ 1 ms (cached in-process)

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant C as Client
    participant GW as API Gateway
    participant ING as Ingestion Service
    participant PLC as Placement Service (Ring)
    participant ST as Storage (Append / Shard)
    participant PR as Projection DB (Shard)
    participant IX as Search Index (Tenant Alias)

    C->>GW: POST /audit/v1/records (X-Tenant-Id, time=2025-10-27T08:05Z)
    GW->>ING: Canonicalized record (TenantId, OccurredAt)
    ING->>PLC: ResolvePartition(TenantId, TimeBucket=2025-10-27:08)
    PLC-->>ING: {region: eu-west, shard: s-17, keyspace: k_acme}
    ING->>ST: Append to s-17 (RLS=TenantId)
    ST-->>ING: ack (offset, partitionId)
    ING-->>GW: 202 Accepted (X-Partition: s-17, X-Region: eu-west)

    C->>GW: GET /query/v1/records?tenant=acme&from=08:00&to=08:10
    GW->>PLC: PlanQuery(TenantId, Range)
    PLC-->>GW: {prunedShards:[s-17,s-18], watermark}
    GW->>PR: Read from pruned shards (RLS=TenantId)
    PR-->>GW: results
    GW-->>C: 200 OK (X-Shards: s-17,s-18)

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Hot-tenant isolation: Placement pins tenant to a dedicated shard set (higher vNode weight) to prevent noisy neighbors.
Multi-bucket fanout: Large ranges map to multiple time buckets → pruned shard list per bucket, executed in parallel with bounded concurrency.
Search path: Query uses per-tenant alias → resolves to index shards in allowed region only (no cross-region hits).

Error Paths¶

sequenceDiagram
    participant ING as Ingestion
    participant PLC as Placement
    participant ST as Storage

    ING->>PLC: ResolvePartition(TenantId=T?, TimeBucket=?)
    alt 400 Bad Request (invalid tenant/time)
        PLC-->>ING: 400 Problem+JSON
    else 403 Residency violation (region hint not allowed)
        PLC-->>ING: 403 Problem+JSON
    else 404 Not Found (tenant or shard mapping missing)
        PLC-->>ING: 404 Problem+JSON
    else 409 Conflict (ring update in progress, epoch mismatch)
        PLC-->>ING: 409 Problem+JSON (retry with new epoch)
    else 503 Service Unavailable (catalog/ring unavailable)
        PLC-->>ING: 503 Problem+JSON (Retry-After)
    end

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements¶

Field	Type	Req	Description	Validation
`X-Tenant-Id`	header	Y	Tenant identity for RLS and partitioning	ULID/UUID
`X-Region-Hint`	header	O	Preferred region (must be allowed)	Residency allowlist
`OccurredAt`	body field	Y	Event time used for time bucket	RFC3339 UTC
`Partition-Key`	header	O	Override hash key (advanced)	Controlled via policy
`Range`	query	O	`from`/`to` time for reads	`from ≤ to`, bounded span
`X-Ring-Epoch`	header	O	Client-observed ring epoch	Monotonic int

Output Specifications¶

Field	Type	Description	Notes
`X-Partition`	header	Chosen shard id	For debugging
`X-Region`	header	Serving region	Residency proof
`X-Shards`	header	Pruned shard list for reads	Comma-separated
`X-Watermark`	header	Lowest consistent time served	For staleness checks
`X-Ring-Epoch`	header	Ring epoch used for routing	Detect drift

Example Payloads¶

Resolve Partition (internal)

POST /placement/v1/resolve
{
  "tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
  "occurredAt": "2025-10-27T08:05:12Z"
}

Response

{
  "region": "eu-west",
  "shardId": "s-17",
  "epoch": 42,
  "keyspace": "k_acme"
}

Query Plan (pruning)

POST /placement/v1/plan-query
{
  "tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
  "from": "2025-10-27T08:00:00Z",
  "to": "2025-10-27T08:10:00Z"
}

Response

{
  "region": "eu-west",
  "shards": ["s-17","s-18"],
  "watermark": "2025-10-27T08:09:58Z",
  "epoch": 42
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Missing/invalid `X-Tenant-Id`, bad time window	Fix request headers/params	—
401	Unauthenticated request to placement APIs	Authenticate	Retry after renewal
403	Residency/edition violation (region not allowed)	Choose allowed region	—
404	Tenant or shard mapping not found	Re-sync catalog / onboard tenant	—
409	Ring epoch mismatch during write/read	Fetch latest epoch; redo resolve	Jittered retry
412	Preconditions (RLS context) not present	Include tenant scope	—
429	Placement lookups rate-limited	Back off	Exponential + jitter
503	Placement/Catalog unavailable	Degrade to cached hint or fail	Bounded retries

Failure Modes¶

Hot shard: skewed hash or burst tenant → adjust vNode weights, or isolate tenant to dedicated shard set.
Ring churn: frequent membership changes cause 409s → stage updates and epoch gating with drain.
Cross-region spill: misconfigured residency → hard deny at placement and gateway.

Recovery Procedures¶

Enable skew mitigations (weighting, pinning) and backfill if rebalancing moved ranges.
Roll back ring change to prior epoch if error rate spikes; drain and retry in controlled batches.
Rebuild tenant alias in Search/Projection if shard move required index re-aliasing.

Performance Characteristics¶

Latency Expectations¶

Placement cache lookup ≤ 1 ms p95; cold fetch ≤ 10 ms p95.
Pruned read fanout limited to ≤ 4 shards for typical query windows.

Throughput Limits¶

Placement QPS sized for all writes + planning; use edge caches in services to reduce calls.

Resource Requirements¶

Small in-memory partition maps per service; watch stream for updates; compact ring representation with virtual nodes.

Scaling Considerations¶

Multi-ring design (per-region) to avoid cross-region chatter.
Add shards by adding vNodes (smooth rebalance ≤ 10% key movement).
Time buckets control hot partitions; tune bucket size by workload.

Security & Compliance¶

Authentication¶

mTLS between services and Placement/Catalog; OIDC for ops.

Authorization¶

Roles: placement:read, placement:update; only platform ops can alter ring/vNodes.

Data Protection¶

Enforce RLS at DB and index layers; per-tenant schemas/aliases; no PII in partition keys.

Compliance¶

Residency enforced at plan/placement and audited; changes to ring membership recorded as events Partition.RingUpdated.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`partition_lookup_latency_seconds`	histogram	Placement latency	p95 > 10 ms
`partition_skew_ratio`	gauge	Max shard load / avg	> 1.5
`ring_epoch_mismatch_total`	counter	409 due to epoch drift	Spike
`reads_shards_scanned`	histogram	Shards touched per query	p95 > target
`residency_denied_total`	counter	403 due to residency	Any sustained
`hot_tenant_isolations_total`	counter	Isolation activations	Trend

Logging Requirements¶

Include tenantId, region, epoch, shardId, bucket, planId, traceId; never log plaintext PII.

Distributed Tracing¶

Spans: placement.resolve, placement.planQuery, attributes epoch, shard_list, bucket_count.

Health Checks¶

Catalog freshness (last update time), ring convergence across nodes, RLS guard status.

Operational Procedures¶

Deployment¶

Deploy Placement Service (HA) and Catalog with watch streams.
Configure per-region rings; seed vNodes; warm caches.

Configuration¶

Hash: fnv1a/xxhash on TenantId + BucketKey.
Bucket: daily/hourly; configurable per tenant/class.
Ring: vNodes=256 default; epoch increments on changes.

Maintenance¶

Quarterly ring review; rebalance heavy shards; rotate ring secrets.
Simulate ring changes in staging with shadow placement before production.

Troubleshooting¶

High shard scan count → check time bucket tuning and secondary predicates.
409 spikes → ensure services refresh epoch quickly; increase push frequency.
Residency denials → verify tenant policy and region hint.

Testing Scenarios¶

Happy Path Tests¶

Ingest routes to correct shard/region with proper headers.
Query pruning selects minimal shards and returns correct results.

Error Path Tests¶

400/404 invalid tenant/mapping rejected; 409 epoch mismatch handled by retry.
403 residency violations blocked decisively.

Performance Tests¶

Placement p95 ≤ 1 ms cached; shard skew ratio ≤ 1.5× under load.
Query scans ≤ target shards for standard ranges.

Security Tests¶

RLS enforced on all reads/writes; no cross-tenant leakage.
Residency never violated even under failover.

Internal References¶

External References¶

Consistent hashing & virtual nodes best practices

Appendices¶

A. Partition Key Derivation¶

BucketKey = floor(to_unix(OccurredAt) / BucketSizeSeconds)
HashInput = TenantId || ":" || BucketKey
Shard = Ring(hash(HashInput))

B. Problem+JSON Examples¶

{
  "type": "urn:connectsoft:errors/partition/epoch-mismatch",
  "title": "Ring epoch mismatch",
  "status": 409,
  "detail": "Client epoch 41 != current epoch 42."
}

{
  "type": "urn:connectsoft:errors/partition/residency-violation",
  "title": "Region not allowed by residency policy",
  "status": 403,
  "detail": "Tenant 'acme' is restricted to eu-west."
}

Auto-Scaling Flow¶

Scales services safely with load using policy-driven HPA/KEDA decisions, proactive warmup/readiness gates, and cost guardrails. Prevents thrash via stabilization windows, rate limits, and deliberate scale-in. Maintains SLOs while distributing load across newly ready instances.

Overview¶

Purpose: Automatically add/remove capacity to meet SLOs while controlling cost and avoiding oscillation.
Scope: Signal collection → scaling decision → resource provisioning → service scale-out/in → warmup/readiness → load distribution → verification/rollback. Excludes manual capacity planning.
Context: Metrics from Observability and Queue/Bus feed Autoscaler (HPA/KEDA). Kubernetes (orchestrator) applies replica changes. Gateway/LB route traffic only to ready pods.
Key Participants:

Load Monitor (Prometheus/OTel, Queue metrics)
Autoscaler (HPA/KEDA controller)
Orchestrator (Kubernetes API Server)
Target Service (e.g., Ingestion/Query/Export)
Warmup Manager (init tasks, cache warm)
API Gateway / L7 LB
Cost Guard (budget policy evaluator)

Prerequisites¶

System Requirements¶

Metrics (CPU, memory, RPS, p95 latency, queue depth/lag) exported and scraped
HPA/KEDA installed with stabilization windows & scale rate limits
Readiness/Startup probes and graceful shutdown configured
Optional Warm Pool or pre-provisioned nodes for burst traffic

Business Requirements¶

SLOs defined per service (latency/error budget)
Cost guardrails (min/max replicas, monthly budget caps, per-tenant limits)
Change approvals for autoscaling policy updates

Performance Requirements¶

Scale-out reaction time ≤ 30–60s for CPU/RPS, ≤ 10s for queue lag (event-driven)
Scale-in conservatively; error budget burn must stay within targets
No oscillation: replica changes limited by stabilization (e.g., 300s down, 60s up)

Sequence Flow¶

Happy Path¶

sequenceDiagram
    autonumber
    participant LM as Load Monitor (Metrics/Queue)
    participant AS as Autoscaler (HPA/KEDA)
    participant OR as Orchestrator (K8s API)
    participant SVC as Target Service
    participant WM as Warmup Manager
    participant LB as API Gateway / L7 LB
    participant CG as Cost Guard

    LM-->>AS: Signals {cpu=78%, rps=1.8k, p95=230ms, queueLag=high}
    AS->>CG: Check policy & budget (min/max, cost caps)
    CG-->>AS: OK (within budget)
    AS->>OR: Patch Deployment replicas +3 (rate-limited)
    OR->>SVC: Create Pods (Pending→Init→Running)
    SVC->>WM: Warmup (JIT cache, connection pools)
    SVC-->>OR: Readiness=TRUE (startup probe passed)
    OR-->>LB: Endpoint added to ready set
    LB-->>SVC: Start routing a % of traffic (ramp-up)
    LM-->>AS: Metrics improve (p95→140ms, queueLag→normal)
    AS->>OR: Hold steady (stabilization window active)

Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths¶

Predictive/Scheduled: pre-scale based on calendar or forecast (e.g., top-of-hour export).
Event-driven (KEDA): scale on queue depth/lag or webhook events (spikes).
Per-tenant partitions: scale labeled shard Deployments independently to isolate hot tenants.

Error Paths¶

sequenceDiagram
    participant AS as Autoscaler
    participant OR as Orchestrator
    participant CG as Cost Guard
    participant SVC as Target Service

    AS->>CG: Request scale beyond max
    CG-->>AS: 409 Conflict (budget cap)
    AS-->>AS: Clamp to max, raise alert

    AS->>OR: Scale to N
    OR-->>AS: 503 API unavailable / quota exceeded
    AS-->>AS: Retry w/ backoff, keep stabilization timer

    OR->>SVC: Start pods
    SVC-->>OR: Readiness FAILED (startup)
    OR-->>AS: Scale not effective
    AS-->>AS: Pause scale-in, open incident, hold window

Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications¶

Input Requirements (Autoscaling Policy APIs)¶

Field	Type	Req	Description	Validation
POST /ops/v1/autoscale/policies	http	Y	Create/update policy	RBAC
`service`	string	Y	Target service name	existing
`minReplicas` / `maxReplicas`	int	Y	Bounds	`1 ≤ min ≤ max`
`targets`	object	O	e.g., `cpu=70`, `rps=200`, `p95Ms=180`, `queueLag=5s`	sane ranges
`scaleUpPolicy`	object	O	`stabilizationSec`, `maxIncreasePercent`, `step`	limits
`scaleDownPolicy`	object	O	`stabilizationSec`, `maxDecreasePercent`, `idleWindowSec`	limits
`costGuardrails`	object	O	`{maxMonthlyCents, maxNodes, burstAllowance}`	non-negative
`predictive`	object	O	schedule/cron or model id	valid cron

Output Specifications¶

Field	Type	Description	Notes
`policyId`	string	Identifier	immutable
`status`	enum	`Active \| Pending \| Error`	—
`effectiveAt`	time	Activation time	RFC3339
`reason`	string	Policy validation result	optional

Example Payloads¶

Create Policy

POST /ops/v1/autoscale/policies
{
  "service": "query",
  "minReplicas": 4,
  "maxReplicas": 40,
  "targets": { "cpu": 70, "p95Ms": 180, "rps": 250 },
  "scaleUpPolicy": { "stabilizationSec": 60, "maxIncreasePercent": 100, "step": 4 },
  "scaleDownPolicy": { "stabilizationSec": 300, "maxDecreasePercent": 33, "idleWindowSec": 600 },
  "costGuardrails": { "maxMonthlyCents": 250000, "maxNodes": 60 }
}

Decision Record (emit)

{
  "decisionId": "asd_01JF9A...",
  "service": "query",
  "from": 16,
  "to": 24,
  "reason": "p95>180ms and rps>target",
  "window": "60s",
  "guardrailsApplied": false,
  "timestamp": "2025-10-27T08:06:30Z"
}

Error Handling¶

Error Scenarios¶

HTTP Code	Scenario	Recovery Action	Retry Strategy
400	Invalid policy (min>max, bad targets)	Fix payload	—
401	Missing/invalid token to ops API	Authenticate	Retry after renewal
403	Caller lacks `autoscale:write`	Request access	—
404	Policy/service not found	Verify name; create first	—
409	Policy conflicts with cost guardrails or active rollout	Adjust bounds or wait	Conditional retry
412	Preconditions failed (budget exceeded)	Increase budget or reduce target	—
429	Throttled ops updates	Back off	Exponential + jitter
503	Orchestrator unavailable/quota exhausted	Retry; open incident	Backoff; clamp to safe min

Failure Modes¶

Thrashing: rapid up/down changes → increase stabilization windows; lower sensitivity; coarser steps.
Cold-start latency: new pods routed too early → enforce readiness gates and ramp-up percentage.
Exceeding budget: forecast misses → cost guard clamps, triggers graceful degradation plans.

Recovery Procedures¶

Freeze scale-down; hold steady at current replicas; widen windows.
Enable predictive pre-scale during known peaks; warm caches.
If quota hit, divert traffic (multi-region) or shed load (429) with idempotency keys.

Performance Characteristics¶

Latency Expectations¶

Scale-out decision path (signal→ready) ≤ 60–90s typical; ≤ 15s for KEDA on lag spikes.
No SLO breach during scale-in; drain connections before termination.

Throughput Limits¶

Max scale step per window (e.g., +100% up, −33% down).
Node autoscaler pre-warms to ensure pods schedule within target.

Resource Requirements¶

Metrics store sized for scrape interval and cardinality; autoscaler controller HA.
Warm pool (optional) sized to absorb N minutes of surge.

Scaling Considerations¶

Separate control plane autoscaler resources from workloads.
Partition by service/shard for isolation; avoid global contention.
Use pod disruption budgets (PDBs) to protect capacity on rollouts.

Security & Compliance¶

Authentication¶

OIDC for ops APIs; mTLS between autoscaler and cluster API.

Authorization¶

RBAC: autoscale:read, autoscale:write, autoscale:admin. Least privilege for controllers.

Data Protection¶

No PII in scaling logs/metrics; scrub tenant identifiers or hash.

Compliance¶

Emit audited events: Autoscale.PolicyUpdated|DecisionMade|ScaleApplied|GuardrailClamped with reason & actor.

Monitoring & Observability¶

Key Metrics¶

Metric	Type	Description	Alert Threshold
`autoscale_desired_replicas`	gauge	Desired vs current	Large sustained delta
`autoscale_decisions_total{reason}`	counter	Scale events	Spike analysis
`autoscale_thrash_total`	counter	Up/down flips within window	> 0 sustained
`service_slo_latency_p95_ms`	gauge	p95 latency	> target
`queue_lag_seconds`	gauge	Event backlog	> target
`cost_estimated_monthly_cents`	gauge	Spend projection	> budget

Logging Requirements¶

Decision logs: decisionId, from→to, reasons, signals, guardrailsApplied, traceId.

Distributed Tracing¶

Spans: autoscale.evaluate, autoscale.apply; link to service load spans via traceparent.

Health Checks¶

Controller health, permission checks, K8s API latency; synthetic scale probe in staging.

Operational Procedures¶

Deployment¶

Install HPA/KEDA; configure metrics adapters.
Enable readiness/startup probes and graceful draining (preStop hooks).
Apply baseline policies per service; verify guardrails.

Configuration¶

Example defaults: min=2, max=40, cpu=70%, p95=180ms, queueLag=5s.
scaleUpStabilization=60s, scaleDownStabilization=300s, maxIncrease=100%, maxDecrease=33%.
Cost guard: maxMonthlyCents, maxNodes, burstAllowance.

Maintenance¶

Quarterly policy review vs. observed traffic.
Load tests before peak seasons; adjust predictive schedules.

Troubleshooting¶

Oscillation → widen stabilization, reduce sensitivity, increase step size.
Pods not becoming ready → inspect warmup dependencies, increase startupProbe timeouts.
Budget clamp events → validate forecasts; consider reserved capacity.

Testing Scenarios¶

Happy Path Tests¶

Sustained load triggers scale-out within target time; SLO met.
Post-peak scale-in occurs after stabilization; no SLO regressions.

Error Path Tests¶

409 guardrail clamp logged; system holds safe capacity.
503 orchestrator outage handled by retries without thrash.

Performance Tests¶

Burst load with KEDA (queue lag) scales within ≤ 15s to clear backlog.
Scale-in preserves error budget and maintains p95 latency.

Security Tests¶

Only authorized roles can modify policies; all changes audited.
No PII in autoscale logs/metrics.

Internal References¶

Metrics Collection Flow
Load Balancing Flow
Health Check Flow
Caching Flow

External References¶

HPA/KEDA best practices; SRE guides on autoscaling and error budgets

Appendices¶

A. Example HPA (CPU + custom p95 latency via metrics adapter)¶

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: query-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: query
  minReplicas: 4
  maxReplicas: 40
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 33
          periodSeconds: 300
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: service_latency_p95_ms
        target:
          type: AverageValue
          averageValue: "180"

B. Example KEDA ScaledObject (queue lag)¶

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: export-worker
spec:
  scaleTargetRef:
    name: export-worker
  minReplicaCount: 2
  maxReplicaCount: 60
  cooldownPeriod: 300
  triggers:
    - type: redis
      metadata:
        address: REDIS_ADDR
        listName: export-jobs
        listLength: "100" # target backlog

C. Problem+JSON (policy conflict)¶

{
  "type": "urn:connectsoft:errors/autoscale/policy-conflict",
  "title": "Autoscale policy conflicts with guardrails",
  "status": 409,
  "detail": "Requested maxReplicas 120 exceeds maxNodes budget."
}

Sequence Flows — Header, Scope & Notation — Audit Trail Platform (ATP)¶

Purpose¶

Audience¶

Scope¶

Non-goals¶

How to read these diagrams¶

Canonical participants (legend)¶

Cross-cutting conventions¶

Sample notation (Mermaid)¶

Reading map (what comes next)¶

Links¶

Standard Audit Record Ingestion Flow¶

Overview¶

Prerequisites¶

System Requirements¶

Business Requirements¶

Performance Requirements¶

Sequence Flow¶

Happy Path¶

Alternative Paths¶

Error Paths¶

Request/Response Specifications¶

Input Requirements¶

Output Specifications¶

Example Payloads¶

Error Handling¶

Error Scenarios¶

Failure Modes¶

Recovery Procedures¶

Performance Characteristics¶

Latency Expectations¶

Throughput Limits¶

Resource Requirements¶

Scaling Considerations¶

Security & Compliance¶

Authentication¶

Authorization¶

Data Protection¶

Compliance¶

Monitoring & Observability¶

Key Metrics¶

Logging Requirements¶

Distributed Tracing¶

Health Checks¶

Operational Procedures¶

Deployment¶

Configuration¶

Maintenance¶

Troubleshooting¶

Testing Scenarios¶

Happy Path Tests¶

Error Path Tests¶

Performance Tests¶

Security Tests¶

Related Documentation¶

Internal References¶

Related Flows¶

External References¶

Appendices¶

A. Configuration Examples¶

B. Troubleshooting Guide¶

C. Performance Benchmarks¶

D. Security Checklist¶

Batch Audit Record Ingestion Flow¶

Overview¶

Prerequisites¶

System Requirements¶

Business Requirements¶

Performance Requirements¶

Sequence Flow¶

Happy Path¶

Alternative Paths¶

Error Paths¶

Request/Response Specifications¶

Input Requirements¶

Output Specifications¶

Example Payloads¶

Error Handling¶

Error Scenarios¶

Failure Modes¶