Skip to content

Sequence Flows — Header, Scope & Notation — Audit Trail Platform (ATP)

This document captures end-to-end sequence flows for the Audit Trail Platform (ATP). It shows how requests move across services (Gateway → Ingestion → Storage → Integrity → Projection → Query/Search/Export), which headers and IDs are propagated, and where policy/redaction and integrity operations occur.

JSON uses lowerCamel; C#/gRPC (code-first) uses PascalCase; Protobuf fields are PascalCase with json_name mapped to lowerCamel. Times are ISO-8601 UTC with ms precision.


Purpose

  • Provide a definitive reference for request/response choreography across ATP.
  • Make tenancy, correlation, idempotency, redaction, and integrity touchpoints explicit.
  • Enable engineers, SREs, and auditors to reason about correctness and operational SLOs (e.g., projection and sealing lag).

Audience

  • Platform engineers implementing services and SDKs.
  • SRE/Operations running ATP in production.
  • Security & Compliance validating controls, proofs, and holds.
  • Integrations & SDK authors producing/consuming audit data.

Scope

  • Online and async ingestion, projection, integrity, query/search, export, policy/hold, recovery, observability.
  • Happy paths with alt/opt blocks for errors, retries, and degraded modes.
  • Cross-references to Data Model, Message Schemas, HLD, and Components.

Non-goals

  • Full API parameter docs (see REST/gRPC contracts).
  • Deep internals of cryptographic primitives (see Integrity spec).
  • Runbook procedures (see Operations/Runbook).

How to read these diagrams

  • Each flow is expressed with Mermaid sequenceDiagram.
  • We use consistent participant names (below) and consistent labels for calls:
    • op name [headers] {summary} for requests.
    • ↩ status body for responses.
  • Headers are shown with [h], bodies with [b] when helpful.
  • Tenancy & correlation appear on first hop and are implied downstream unless called out.
  • Errors use alt/else blocks; retries use loop with backoff notes.

Canonical participants (legend)

Label Meaning
Client External producer/consumer (browser, service, tool)
Gateway API Gateway / Edge (authN/Z, rate limit, tenancy)
Ingestion Write path (validate, canonicalize, classify/redact, append)
Storage Authoritative append-only store (WORM)
Integrity Segment/block sealing, Merkle roots, signatures
Projection Read-model updaters; checkpoints/watermarks
Query Timeline/resource/actor queries; masking profiles
Search Full-text/facets/suggest over per-tenant indices
Export eDiscovery and bulk packages; signed manifests
Policy Classification, redaction, retention evaluation
LegalHold Hold application/release, scope indexing
Bus Message transport (e.g., Service Bus/MassTransit/NSB)
KMS Key management for signatures/manifests
IdP Identity provider (JWT/OIDC)
Obs Observability pipeline (metrics/logs/traces)

Flows may also show Inbox/Outbox, Indexer, or Admin where relevant.


Cross-cutting conventions

  • Tenancy: All flows carry x-tenant-id (or gRPC metadata tenant); RLS enforced at storage and read models.
  • Correlation: OTel traceparent is required; optional baggage includes tenant, edition, shard.
  • Idempotency: Producers SHOULD send x-idempotency-key (REST) or idempotency (gRPC metadata); ingestion dedupes per (tenantId, key).
  • Problem+JSON: Errors return RFC-7807 with type, title, status, detail, and errors[] { pointer, reason }.
  • Redaction: Write path applies classification/redaction per policy. Reads apply masking profiles (Safe|Support|Investigator|Raw).
  • Integrity: Sealing is asynchronous; verify-on-read is optional and called out explicitly where supported.
  • Pagination: Seek cursors encode (createdAt, auditRecordId); included in query flows.
  • Clocks: createdAt (producer), observedAt (platform), sealedAt (integrity), eligibleAt (retention).
  • Status codes (REST): 2xx (OK/Accepted), 4xx (validation/limits/auth), 5xx (transient). gRPC codes: OK, INVALID_ARGUMENT, ALREADY_EXISTS, RESOURCE_EXHAUSTED, UNAVAILABLE, DEADLINE_EXCEEDED.

Sample notation (Mermaid)

sequenceDiagram
  autonumber
  actor Client
  participant Gateway
  participant Ingestion
  participant Policy
  participant Storage
  participant Projection
  participant Integrity
  participant Obs as Observability

  Client->>Gateway: POST /audit [h: x-tenant-id, traceparent, x-idempotency-key] [b: AuditRecord]
  Note right of Gateway: AuthN/Z (IdP), rate limiting, tenancy check
  Gateway->>Ingestion: Append(request) [h: forwarded headers]
  Ingestion->>Policy: Evaluate(classify, redact hints)
  Policy-->>Ingestion: decision {classes, redactions}
  Ingestion->>Storage: INSERT AuditRecord (canonical JSON, WORM)
  Storage-->>Ingestion: ↩ ack {auditRecordId}
  Ingestion-->>Gateway: ↩ 202 Accepted {auditRecordId}
  par Async
    Storage-->>Projection: event AuditRecord.Accepted
    Storage-->>Integrity: leaf hash → segment buffer
  and
    Ingestion-->>Obs: metrics/traces/logs
  end
  Projection-->>Projection: upsert read models, advance checkpoint
  Integrity-->>Integrity: seal block, sign, emit ProofComputed
Hold "Alt" / "Option" to enable pan & zoom

Legend

  • Solid arrows: synchronous calls.
  • Dashed arrows (-->>) : async publish/consume or responses.
  • par blocks: parallel async work.
  • alt/else blocks: branching (validation errors, retries).
  • loop blocks: retry with backoff.

Reading map (what comes next)

The remaining sections detail each area with a dedicated diagram and callouts:

  1. Ingestion (REST/gRPC/Bus/Actors) — validation, classification/redaction, idempotency
  2. Integrity — chain/segment/block sealing, verification, key rotation
  3. Projections & Search — read models, indexing, checkpoints, pagination
  4. Query & Read — policy-aware masking, verify-on-read, filters & time windows
  5. Export & eDiscovery — job lifecycle, manifests, delivery, legal hold
  6. Policy, Retention & Hold — evaluation, eligibility, purge block
  7. Reliability — retry, DLQ, circuit breaker, compensation, rebuild
  8. Observability — metrics, traces, health, alerts
  9. Admin — onboarding, schema evolution, configuration, partitioning, auto-scaling


Standard Audit Record Ingestion Flow

Canonical online path to append an AuditRecord via the API Gateway. Covers authN/Z, tenancy routing, rate limiting, validation & canonicalization, policy-driven classification/redaction hints, append to WORM storage, and async fan-out (AuditRecord.Accepted, projections, integrity). Uniquely emphasizes idempotency and Problem+JSON error semantics for safe retries.


Overview

Purpose: Accept a producer’s audit fact and durably append it to the authoritative store with correct tenancy, correlation, and privacy posture.
Scope: Single-record REST ingestion through the Gateway; includes validation, classification/redaction hints, append, and async fan-out triggers. Excludes gRPC and bus-based ingestion (covered in separate flows).
Context: Entry point for most interactive producers; downstream projections power query/search; integrity sealing is asynchronous.
Key Participants:

  • Client (producer)
  • API Gateway (authN/Z, limits, tenancy)
  • Ingestion Service (validate/canonicalize/classify)
  • Policy Service (classification/redaction hints)
  • Storage Service (authoritative append, WORM)
  • Projection Service (read models; async)
  • Integrity Service (segment/block sealing; async)

Prerequisites

System Requirements

  • API Gateway, Ingestion, Policy, Storage online and reachable
  • TLS enabled end-to-end; trusted IdP/JWT validation configured
  • Network routes opened Gateway → Ingestion → Policy/Storage
  • Schema Registry accessible to Ingestion

Business Requirements

  • Tenant exists and is active; residency and edition set
  • Policy (classification/redaction) published and cacheable
  • Retention policy present (for later lifecycle)
  • Legal holds (if any) indexed (no effect on write, affects lifecycle)

Performance Requirements

  • Gateway rate-limit buckets sized for tenant (burst/sustain)
  • Ingestion p95 latency < 50 ms at target load
  • Payload size ≤ 256 KiB; attributes/fields within limits

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Client
    participant Gateway as API Gateway
    participant Ingestion as Ingestion Service
    participant Policy as Policy Service
    participant Storage as Storage (Authoritative)
    participant Projection as Projection Service
    participant Integrity as Integrity Service

    Client->>Gateway: POST /audit/v1/records<br/>[h: Authorization, x-tenant-id, traceparent, x-idempotency-key]<br/>[b: AuditRecord JSON]
    Note right of Gateway: AuthN (JWT/OIDC) • AuthZ (tenant scope) • Rate limit • Header validation
    Gateway->>Ingestion: Append(request)<br/>[forward headers]
    Ingestion->>Policy: Evaluate(classify/redaction hints)
    Policy-->>Ingestion: decision { classes, redactions }
    Ingestion->>Ingestion: Validate & canonicalize<br/>(size, clocks, action, resource, attrs)
    Ingestion->>Storage: INSERT canonical JSON (WORM)
    Storage-->>Ingestion: ↩ ack { auditRecordId }
    Ingestion-->>Gateway: ↩ 202 Accepted { auditRecordId, status:"Created" }
    par Async fan-out
      Storage-->>Projection: event AuditRecord.Accepted
      Storage-->>Integrity: enqueue leaf → segment
    end
    Note over Projection,Integrity: Projections update read models, Integrity seals blocks later
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Duplicate idempotency key: Ingestion returns 202 with status:"Duplicate" and original auditRecordId.
  • Server-assigned ID: If auditRecordId omitted, Ingestion assigns ULID and returns it.
  • Sealing disabled: Integrity branch skipped for tenant/edition; lifecycle proceeds to eligibility without proofs.
  • Partial policy outage: Use last-known policy (stale-tolerant) and tag decision with basis:"Cached".

Error Paths

sequenceDiagram
    actor Client
    participant Gateway as API Gateway
    participant Ingestion as Ingestion Service
    Client->>Gateway: POST /audit/v1/records
    alt Validation error
        Gateway->>Ingestion: Append(request)
        Ingestion-->>Gateway: ↩ 400 Problem+JSON (action.invalid, payload.tooLarge, ...)
        Gateway-->>Client: ↩ 400 Problem+JSON
    else Rate limited
        Gateway-->>Client: ↩ 429 Problem+JSON + Retry-After
    else Storage unavailable
        Gateway->>Ingestion: Append(request)
        Ingestion-->>Gateway: ↩ 503 Problem+JSON
        Gateway-->>Client: ↩ 503 (retry with backoff)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req. Description Validation
Authorization (header) string Y Bearer JWT Valid signature; tenant claims
x-tenant-id (header) string Y Tenant routing key ^[A-Za-z0-9._-]{1,128}$
traceparent (header) string Y W3C trace context 55-char format
x-idempotency-key (header) string Y Dedupe key per tenant ≤128 ASCII visible
tenantId string Y Tenant id (body) Must equal header
schemaVersion string Y Payload schema id auditrecord.v1 (or newer)
auditRecordId ULID N Client- or server-assigned id ULID pattern
createdAt timestamp Y Producer time ≤ now + 2m, ms precision
action string Y verb or verb.noun ^[a-z]+(\.[a-z0-9_-]+)?$
resource.type string Y PascalCase dotted type ^[A-Z][A-Za-z0-9]*(\.[A-Z][A-Za-z0-9]*)*$
resource.id string Y Opaque id ≤128, no spaces
resource.path string N JSON Pointer ≤512, normalized
actor.id string Y Actor identifier ≤128, no spaces
actor.type enum Y Unknown | User | Service | Job Enum
actor.display string N Friendly name Masked on read
decision.outcome enum N Access verdict Allow | Deny | NotApplicable | Indeterminate
delta.fields map N Field changes ≤256 entries
attributes map N Extra key/values ≤64 keys; key/val length
correlation.traceId hex N Trace id 32 lowercase hex
correlation.requestId string N Client request id ≤128

Output Specifications

Field Type Description Notes
auditRecordId ULID Durable id Server returns original or assigned
status string Created or Duplicate Idempotent semantics
observedAt timestamp Ingestion time ms precision
traceId hex32 Echo for correlation From traceparent
links.self string Record URL REST locator
links.operation string Idempotency op URL Stable outcome resource

Example Payloads

Request

{
  "tenantId": "splootvets",
  "schemaVersion": "auditrecord.v1",
  "createdAt": "2025-10-22T12:00:03.100Z",
  "action": "appointment.update",
  "resource": { "type": "Vetspire.Appointment", "id": "A-9981", "path": "/status" },
  "actor": { "id": "user_123", "type": "User", "display": "A. Smith" },
  "decision": { "outcome": "Allow" },
  "delta": { "fields": { "status": { "before": "Pending", "after": "Booked" } } },
  "attributes": { "client.ip": "203.0.113.42", "client.userAgent": "Mozilla/5.0 ..." },
  "correlation": { "traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716", "requestId": "req-7a9f" }
}

Response — 202 Accepted

{
  "auditRecordId": "01JE7K4J9F9D0S6E7X5Q1A3BCP",
  "status": "Created",
  "observedAt": "2025-10-22T12:00:03.300Z",
  "traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716",
  "links": {
    "self": "/audit/v1/records/01JE7K4J9F9D0S6E7X5Q1A3BCP",
    "operation": "/audit/v1/operations/prod-ord-9981-v1"
  }
}

Error Handling

Error Scenarios

Error Code Scenario Recovery Action Retry Strategy
400 Schema/clock/format invalid Fix request; follow details/pointers Do not retry until corrected
401 Invalid/missing JWT Acquire valid token Retry after re-auth
403 Tenant forbidden Correct tenant or permissions Do not retry
413 Payload > 256 KiB Reduce size / trim delta Do not retry until reduced
415 Wrong media type Use application/json Retry with correct header
429 Rate limited/backpressure Respect Retry-After Exponential backoff + jitter
503 Storage/Policy unavailable Transient outage Exponential backoff + jitter; reuse idempotency key
409 Idempotency conflict (rare) Reuse same key; inspect operation link Safe retry with same key

Failure Modes

  • Network Failures: Timeouts, TLS issues → client retries with backoff; preserve x-idempotency-key.
  • Service Unavailability: Return 503 from Gateway; circuit breaker may open.
  • Data Corruption: Validation rejects; Problem+JSON details include errors[].pointer.
  • Policy Violations: Credentials detected → dropped at write; log redactionHint.

Recovery Procedures

  1. Inspect Problem+JSON type, detail, and errors[].
  2. For transient failures, retry with the same idempotency key using backoff; honor Retry-After.
  3. For validation failures, correct the payload (see rules), then resubmit.

Performance Characteristics

Latency Expectations

  • P50: 15–25 ms
  • P95: ≤ 50 ms
  • P99: ≤ 120 ms
  • P99.9: ≤ 300 ms (under burst control)

Throughput Limits

  • Per Tenant (sustain): ~500 rps (edition-dependent)
  • Per Tenant (burst): up to 2,000 rps for 60 s
  • Global Target: ≥ 50k rps across shards

Resource Requirements

  • CPU: Ingestion nodes sized for JSON parse + hashing; vectorized canonicalization where available
  • Memory: Payload buffers ≤ 256 KiB × concurrency; header maps
  • Network: TLS offload at Gateway or service mesh
  • Storage: WAL/redo sized for burst × 2 indexes

Scaling Considerations

  • Horizontal: Scale Gateway/Ingestion statelessly (HPA/KEDA based on rps/CPU/queue depth)
  • Vertical: Rarely needed; prefer horizontal
  • Auto-scaling Triggers: rps, p95 latency, queue depth, 429 rate, CPU > 75%

Security & Compliance

Authentication

  • Method: JWT (OIDC); short-lived tokens; clock skew ±60s
  • Token Requirements: Audience/service match; tenant claims present
  • Session Management: Stateless; no cookies

Authorization

  • Permissions: Producer role allowed to audit:append for x-tenant-id
  • Tenant Isolation: RLS enforced in Storage/Projections; headers validated at edge
  • RBAC: Gateway policy + service layer checks

Data Protection

  • Transit: TLS 1.2+; HSTS at edge
  • At Rest: DB/storage encryption; key management via KMS
  • PII Handling: Write-time classification/redaction; credentials dropped; personal/sensitive masked/hashed

Compliance

  • GDPR/HIPAA/SOC2: Audit trail of who appended; immutable WORM; data subject exports via Export flows

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
ingest_requests_total counter Count of POSTs Anomaly vs. baseline
ingest_latency_ms histogram End-to-end latency p95 > 50 ms (5m)
ingest_payload_bytes histogram Payload sizes > 90th near 256 KiB
ingest_rate_limited_total counter 429 responses Spike > 5%
storage_errors_total counter 5xx from Storage > 0.5%
policy_eval_latency_ms histogram Policy call latency p95 > 30 ms

Logging Requirements

  • Structured JSON logs; include tenantId, auditRecordId, traceId, idempotencyKey (hash)
  • Mask personal/sensitive values; never log raw credentials

Distributed Tracing

  • Propagate traceparent; spans: ingest.request, ingest.validate, ingest.append, policy.evaluate
  • Span attrs: tenant, payloadBytes, status, dedupe="Created|Duplicate"

Health Checks

  • Liveness: process heartbeats
  • Readiness: downstream (Policy/Storage) probes with budgets
  • Dependency: Registry reachability, KMS if signing on write (rare)

Operational Procedures

Deployment

  1. Deploy/roll Gateway and Ingestion behind feature flag audit.ingest.enabled=false
  2. Warm caches (schema, policy); run smoke POST against canary
  3. Flip flag, ramp traffic using traffic splitting (e.g., 10% → 50% → 100%)

Configuration

  • Env Vars: RATE_BURST, RATE_SUSTAIN, MAX_PAYLOAD_BYTES=262144
  • Config: Policy endpoint base URL; schema registry URL
  • Feature Flags: Sealing on write (usually off), request verification levels

Maintenance

  • Rotate tokens/keys; tune rate limits; review metrics for near-limit payloads

Troubleshooting

  • High 400s → inspect Problem+JSON pointers
  • High 429s → increase tenant buckets or advise producers to backoff
  • 5xx spikes → check Storage/Policy dependency health, breaker state

Testing Scenarios

Happy Path Tests

  • Accept minimal valid record; returns 202 Created
  • With server-assigned ULID; returns new auditRecordId
  • Duplicate x-idempotency-key returns status:"Duplicate"

Error Path Tests

  • action.invalid → 400 with pointer /action
  • Payload over 256 KiB → 413
  • Missing/invalid JWT → 401; forbidden tenant → 403
  • Rate limit exceeded → 429 with Retry-After

Performance Tests

  • Sustain 500 rps per tenant; p95 < 50 ms
  • Burst 2k rps per tenant for 60s without error inflation
  • Large but valid payload near limit; still < 50 ms p95

Security Tests

  • Credential key in attributes is dropped/redacted
  • PII masked on read paths (verify via downstream Query)
  • Multi-tenant isolation (no cross-tenant access)

Internal References

External References

  • RFC 7807 (Problem Details for HTTP APIs)
  • W3C Trace Context (traceparent)

Appendices

A. Configuration Examples

  • NGINX/L7 snippet to pass through traceparent, x-tenant-id, x-idempotency-key

B. Troubleshooting Guide

  • Decision tree for 4xx vs 5xx vs 429 responses

C. Performance Benchmarks

  • Latest load test summary attached in CI artifacts

D. Security Checklist

  • No secrets logged
  • Masking rules applied on read
  • RLS enforced in all queries

Batch Audit Record Ingestion Flow

Efficient bulk ingest of many AuditRecord items using multipart upload or presigned object storage. The Gateway creates a batch job, the client uploads JSONL (optionally gzip), and an Ingestion Batch Worker validates, canonicalizes, and appends records to the WORM store with partial-failure reporting, chunking, and resume tokens.


Overview

Purpose: Move large volumes of audit facts into ATP reliably and cost-effectively with resumable uploads and per-record error isolation.
Scope: REST orchestration for batch jobs, uploads (multipart or presigned URLs), background processing, partial failures, status polling, and completion artifacts. Excludes online single-record ingest and streaming bus pipelines.
Context: Preferred for backfills, partner dumps, and nightly loads. Downstream, projections and integrity run asynchronously as with standard ingestion.
Key Participants:

  • Client (uploader)
  • API Gateway (job control, presigned URLs, limits)
  • Object Storage (S3/GCS/Azure Blob; optional path)
  • Ingestion Batch Worker (validate/canonicalize/process chunks)
  • Storage (Authoritative) (WORM append)
  • Integrity Service (hash/segment/block sealing; async)
  • Projection Service (read models; async)

Prerequisites

System Requirements

  • API Gateway, Batch Worker, Storage, Integrity, Projection online
  • TLS end-to-end; object storage reachable from workers
  • IdP configured; JWT audience for Gateway set
  • Schema Registry reachable by workers

Business Requirements

  • Tenant active; residency/edition configured
  • Classification/redaction & retention policies published
  • Legal holds indexed (affects lifecycle, not write)

Performance Requirements

  • Chunk size and worker parallelism tuned (defaults below)
  • Storage capacity sized for expected peak insert rate
  • Backpressure thresholds configured (429/503 policies)

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Client
    participant Gateway as API Gateway
    participant Store as Object Storage
    participant Batch as Ingestion Batch Worker
    participant Storage as Storage (Authoritative)
    participant Projection as Projection Service
    participant Integrity as Integrity Service

    Client->>Gateway: POST /audit/v1/batches { manifest, strategy }
    Gateway-->>Client: ↩ 202 { batchId, uploadPlan, resumeToken }

    alt Presigned strategy
      Client->>Store: PUT parts to presigned URLs (JSONL[.gz])
      Client->>Gateway: POST /audit/v1/batches/{batchId}: finalize
    else Multipart strategy
      Client->>Gateway: POST /audit/v1/batches/{batchId}/upload (multipart)
    end

    Gateway-->>Batch: event Batch.Created { batchId, objectUris }
    Batch->>Batch: Plan chunks (e.g., 5k recs or 16 MiB)
    loop Each chunk
      Batch->>Store: READ chunk bytes (stream)
      Batch->>Batch: Validate & canonicalize each JSONL line
      Batch->>Storage: INSERT valid AuditRecord rows (idempotent)
      Batch-->>Batch: Record per-line status, advance resumeToken
    end
    par Async fan-out for accepted rows
      Storage-->>Projection: AuditRecord.Accepted
      Storage-->>Integrity: enqueue leaf → segment
    end

    Batch-->>Gateway: status { processed, succeeded, failed, resumeToken }
    Gateway-->>Client: ↩ 200/202 GET /batches/{id}/status
    Note over Batch,Client: Completion → summary + downloadable error report for failed lines
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Continue-on-error: Process full batch; emit per-line errors; job ends CompletedWithFailures.
  • Halt-on-threshold: Stop when failed/processed ≥ threshold (e.g., 5%); job Aborted.
  • Resume: Client provides resumeToken; worker skips processed chunks.
  • Single-URL manifest: Gateway returns one upload URL; worker enumerates parts by convention.

Error Paths

sequenceDiagram
    actor Client
    participant Gateway as API Gateway
    participant Batch as Ingestion Batch Worker

    Client->>Gateway: POST /audit/v1/batches { manifest }
    alt Invalid manifest
      Gateway-->>Client: ↩ 400 Problem+JSON (manifest.invalid)
    else Failure threshold exceeded
      Batch-->>Gateway: status { state:"Aborted", reason:"FailureThreshold" }
      Gateway-->>Client: ↩ 409 Problem+JSON + link:errorReport
    else Storage unavailable
      Batch-->>Gateway: status { state:"Retrying", backoff:"exponential" }
      Gateway-->>Client: ↩ 503 on status until recovery
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Required Description Validation
Authorization (header) string Y Bearer JWT Valid signature; tenant claim
x-tenant-id (header) string Y Tenant routing ^[A-Za-z0-9._-]{1,128}$
traceparent (header) string Y W3C trace context 55-char format
x-idempotency-key (header) string Y Job creation dedupe ≤128 ASCII
strategy enum Y Presigned or Multipart Enum
manifest.files[] array Y Object URIs or file descriptors ≤256 files
manifest.format enum Y Jsonl or JsonlGzip Enum
manifest.schemaVersion string Y Expected schema e.g., auditrecord.v1
options.chunk.maxRecords int N Records per chunk 1–10,000 (default 5,000)
options.chunk.maxBytes int N Bytes per chunk 1–32 MiB (default 16 MiB)
options.failure.mode enum N Continue/HaltOnThreshold Default Continue
options.failure.threshold number N 0.0–1.0 Default 0.05
options.parallelism int N Worker concurrency 1–32 (edition gated)

Output Specifications

Field Type Description Notes
batchId ULID Batch identifier Returned on create
uploadPlan object Presigned URLs or upload endpoints May include part sizes
resumeToken string Opaque position token For resume
state enum Created | Uploading | Processing | Retrying | Completed | CompletedWithFailures | Aborted | Failed From status API
counters object {processed,succeeded,failed,bytesRead} Status API
errorReport url Download failed-lines report On completion/abort

Example Payloads

Create batch (presigned)

{
  "strategy": "Presigned",
  "manifest": {
    "format": "JsonlGzip",
    "schemaVersion": "auditrecord.v1",
    "files": [
      { "name": "part-0001.jsonl.gz", "sizeBytes": 104857600 },
      { "name": "part-0002.jsonl.gz", "sizeBytes": 83886080 }
    ]
  },
  "options": {
    "chunk": { "maxRecords": 5000, "maxBytes": 16777216 },
    "failure": { "mode": "Continue", "threshold": 0.05 },
    "parallelism": 8
  }
}

Create response

{
  "batchId": "01JE8A3GZ8X0E9K3N5R6V7B8C9",
  "uploadPlan": {
    "presigned": [
      { "name": "part-0001.jsonl.gz", "method": "PUT", "url": "https://store/..." },
      { "name": "part-0002.jsonl.gz", "method": "PUT", "url": "https://store/..." }
    ]
  },
  "resumeToken": "r-01je8a3g-0000"
}

Status response

{
  "batchId": "01JE8A3GZ8X0E9K3N5R6V7B8C9",
  "state": "CompletedWithFailures",
  "counters": { "processed": 180000, "succeeded": 176400, "failed": 3600, "bytesRead": 183500800 },
  "resumeToken": "r-01je8a3g-ffff",
  "errorReport": "/audit/v1/batches/01JE8A3G.../errors?profile=Safe"
}

Error Handling

Error Scenarios

Error Code Scenario Recovery Action Retry Strategy
400 Invalid manifest/options Fix payload (schema, limits) No retry until corrected
401/403 AuthN/Z failure Acquire token / permissions Retry after fix
409 Duplicate x-idempotency-key Use status endpoint / operation link Safe to reuse key
413 Part too large Reduce part size Re-upload affected part
422 Failure threshold exceeded Inspect error report; fix data New batch recommended
429 Gateway/worker backpressure Honor Retry-After; slow uploads Exponential backoff + jitter
503 Storage/object store unavailable Wait for recovery Workers auto-retry chunks

Failure Modes

  • Line-level validation failures: recorded {line, pointer, reason}; good lines continue.
  • Chunk retry: transient errors → chunk-level retries with capped attempts.
  • Poison lines: after N retries, line written to dead-letter file in the error report.

Recovery Procedures

  1. GET status; if CompletedWithFailures, download errorReport.
  2. Fix rejected lines; re-upload as new batch or incremental patch.
  3. If Aborted due to threshold, pre-clean data or lower threshold; start a new batch.

Performance Characteristics

Latency Expectations

  • Job creation: ~10–50 ms
  • Per-chunk processing: target ≤ 2 s for 5k records
  • End-to-end: proportional to data volume and parallelism

Throughput Limits

  • Worker ingest: ≥ 3k rps per shard sustained (shared with online writes)
  • Per-job parallelism: default 8 chunks in flight (edition gated)
  • Upload: presigned PUT up to provider limits; prefer 8–16 MiB parts

Resource Requirements

  • CPU: JSON parse + hashing; concurrency N × vCPU
  • Memory: streaming parse; per-chunk buffers (≤ 16–32 MiB each)
  • Network: high egress from object store to workers; colocate where possible
  • Storage: WAL sized for burst; keep secondary indexes minimal on authoritative store

Scaling Considerations

  • Horizontal: scale workers by queue depth and chunk latency
  • Auto-scaling triggers: backlog age, running jobs, p95 chunk duration, CPU > 75%
  • Backpressure: workers advertise capacity; Gateway throttles create/upload

Security & Compliance

Authentication

  • JWT (OIDC) to create/manage batches; presigned URLs for object store writes (scoped, short-lived).

Authorization

  • Require audit:batch:create for tenant; status and error report scoped to same tenant and batch.

Data Protection

  • Transit: TLS 1.2+; presigned HTTPS only
  • At Rest: object storage + DB encryption; server-side KMS keys
  • PII: same write-time classification/redaction as standard ingest (no raw credentials persisted)

Compliance

  • Batch operations are audited: who created, uploaded, resumed, and downloaded error reports.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
batch_created_total counter Batches created Anomaly vs baseline
batch_records_processed_total counter Lines processed Drops or stalls
batch_failures_total counter Per-line rejects > 2% sustained
batch_chunk_latency_ms histogram Chunk processing time p95 > 2 s
batch_inflight_jobs gauge Active batches Capacity saturation
batch_bytes_read counter Input bytes Sudden spikes

Logging Requirements

  • Structured logs with batchId, lineNo, error pointer, reason; mask sensitive values.

Distributed Tracing

  • Root span batch.create; child spans per chunk (batch.process.chunk) including chunkId, records, bytes.

Health Checks

  • Readiness includes object store access, Storage connectivity, Schema Registry reachability.

Operational Procedures

Deployment

  1. Roll out Batch Worker with feature flag audit.batch.enabled=false.
  2. Validate presigned URL issuance in non-prod.
  3. Enable flag; ramp per-tenant concurrency caps.

Configuration

  • Env Vars: BATCH_MAX_PARALLELISM, BATCH_CHUNK_MAX_BYTES, BATCH_CHUNK_MAX_RECORDS, BATCH_FAILURE_THRESHOLD
  • Storage: connection pools sized for concurrent inserts
  • Object Store: bucket/container, lifecycle policy for temp uploads and error reports

Maintenance

  • Periodic cleanup of stale, incomplete batches and expired presigned URLs.
  • Rotate KMS keys as per policy.

Troubleshooting

  • High batch_failures_total → download error report; inspect common pointers.
  • Slow chunks → reduce chunk size or increase parallelism; check DB bottlenecks.
  • Frequent 503 → verify storage health and worker retry logs.

Testing Scenarios

Happy Path Tests

  • Create presigned batch; upload two parts; completion with zero failures
  • Multipart upload success path with server parsing
  • Resume from resumeToken after intentional worker restart

Error Path Tests

  • Invalid manifest → 400 with pointer to failing field
  • Failure threshold exceeded → job Aborted, 409 on finalize
  • Object store permission denied → 403 on PUT, recover with new presigned URL

Performance Tests

  • 100M records across 20 files; verify throughput and stability
  • Chunk size sweep (4–32 MiB) to tune p95
  • Parallel jobs from multiple tenants without starvation

Security Tests

  • Presigned URL expiry respected; uploads fail after TTL
  • Error report redacts/masks PII appropriately
  • Tenant isolation—no cross-tenant batch visibility

Internal References

  • gRPC Service Ingestion Flow
  • Service Bus (MassTransit) Ingestion Flow
  • Audit Record Projection Update Flow

External References

  • Provider docs for presigned URLs (S3/GCS/Azure Blob)
  • RFC 7231 (HTTP semantics) for 202/409/413 usage

Appendices

A. Minimal JSONL Example (uncompressed)

{"tenantId":"acme","schemaVersion":"auditrecord.v1","createdAt":"2025-10-22T12:00:00.000Z","action":"user.create","resource":{"type":"Iam.User","id":"U-1"},"actor":{"id":"svc_gw","type":"Service"}}
{"tenantId":"acme","schemaVersion":"auditrecord.v1","createdAt":"2025-10-22T12:00:01.000Z","action":"appointment.update","resource":{"type":"Vetspire.Appointment","id":"A-2"},"actor":{"id":"user_123","type":"User"},"delta":{"fields":{"status":{"before":"Pending","after":"Booked"}}}}

B. Error Report Schema (per-line)

{
  "batchId": "01JE8A3GZ8X0E9K3N5R6V7B8C9",
  "summary": { "processed": 100000, "succeeded": 98400, "failed": 1600 },
  "errors": [
    { "line": 42, "pointer": "/action", "reason": "regex", "code": "action.invalid", "rawSnippet": "..." }
  ]
}

C. Resume Token Example

{ "batchId": "01JE8A3G...", "chunk": 128, "offset": 7340032, "file": "part-0002.jsonl.gz" }

Audit Record Validation & Classification Flow

Applies schema/limits validation, canonicalization, and policy-driven classification & redaction before persisting an AuditRecord. Ensures deterministic normalization, consistent privacy posture, and auditable decisions that accompany the record through its lifecycle.


Overview

Purpose: Validate and normalize incoming audit facts, classify data sensitivity, and apply redaction actions prior to append.
Scope: Ingestion-time validation/canonicalization, policy evaluation, classification flags, redaction (drop/mask/hash/tokenize), decision auditing. Excludes post-read masking (covered in Query flows) and integrity/projection specifics.
Context: Runs during Standard/Batch ingestion just before the authoritative append. Outputs include normalized payload, DataClass flags, RedactionHints, and a policy decision trail.
Key Participants:

  • Ingestion Service (validator/canonicalizer/orchestrator)
  • Schema Registry (JSON Schema/contract resolution)
  • Policy Service (classification & redaction policy)
  • Classification Engine (PII/secret detectors, patterns)
  • Redaction Service (hash/mask/tokenize/drop transforms)
  • Storage (Authoritative) (WORM append with decision audit)

Prerequisites

System Requirements

  • Ingestion reachable to Schema Registry and Policy endpoints
  • Policy/Classification/Redaction services healthy (or cached policy available)
  • Clock sync within ±60s (for timestamp validations)
  • TLS enabled; service identities trusted

Business Requirements

  • Tenant active; edition/residency known (affects policy set)
  • Current Policy revision published; cache TTL configured
  • Data classification catalog aligned with Data Model

Performance Requirements

  • Validation + policy evaluation p95 ≤ 30 ms per record
  • Classification engine p95 ≤ 10 ms for typical payloads
  • End-to-end ingest validation budget p95 ≤ 50 ms

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant Ingestion as Ingestion Service
    participant Registry as Schema Registry
    participant Policy as Policy Service
    participant Classify as Classification Engine
    participant Redact as Redaction Service
    participant Storage as Storage (Authoritative)

    Ingestion->>Registry: Resolve schema (auditrecord.v1)
    Registry-->>Ingestion: ↩ schema (cacheable)

    Ingestion->>Ingestion: Structural validate + limits (size, clocks)
    Ingestion->>Ingestion: Canonicalize (strings NFC, action, resource.path)

    Ingestion->>Policy: Evaluate(tenant, edition, payload summary)
    Policy-->>Ingestion: ↩ decision {classes, actions, revision, basis:"Live"}

    Ingestion->>Classify: Detect PII/Secrets (hints, patterns)
    Classify-->>Ingestion: ↩ findings {keys, types, confidence}

    Ingestion->>Redact: Apply(actions, findings) → transform fields
    Redact-->>Ingestion: ↩ normalized payload + redactionHints

    Ingestion->>Storage: INSERT payload + {classes, redactionHints, policyRevision}
    Storage-->>Ingestion: ↩ ack {auditRecordId}
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Cached policy: If Policy unavailable, use last-known decision template (basis:"Cached") with TTL; record basis in decision trail.
  • Dry-run mode: Apply classification only; annotate recommended actions without mutating payload (used in partner onboarding).
  • Producer hints: Producer supplies dataClass hints; engine verifies/augments but never downgrades sensitivity.

Error Paths

sequenceDiagram
    participant Ingestion as Ingestion Service
    participant Registry as Schema Registry
    participant Policy as Policy Service

    Ingestion->>Registry: Resolve schema
    alt Schema mismatch/invalid
      Registry-->>Ingestion: ↩ error(schema.invalid)
      Ingestion-->>Client: ↩ 400 Problem+JSON (pointers)
    else Policy hard outage and no cache
      Ingestion->>Policy: Evaluate(...)
      Policy-->>Ingestion: ↩ 503
      Ingestion-->>Client: ↩ 503 Problem+JSON (retry with idempotency)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

This flow executes inside ingestion. External interfaces (e.g., REST /audit/v1/records) are shown for the fields pertinent to validation & classification.

Input Requirements

Field Type Required Description Validation
schemaVersion string Y Payload contract id Known & active in Registry
createdAt timestamp Y Producer clock ISO-8601 UTC, ms; ≤ now+2m
effectiveAt timestamp N Effect time createdAt
action string Y verb or verb.noun ^[a-z]+(\.[a-z0-9_-]+)?$
resource.type string Y Dotted PascalCase type ^[A-Z][A-Za-z0-9]*(\.[A-Z][A-Za-z0-9]*)*$
resource.id string Y Opaque id ≤128, visible ASCII
resource.path string N JSON Pointer normalized, ≤512
actor.id string Y Actor identifier ≤128
actor.type enum Y Unknown | User | Service | Job Enum
attributes.* map N Extra k/v pairs ≤64 keys; key≤64, val≤1024
delta.fields map N Field-level changes ≤256 entries
correlation.traceId hex N Trace correlation 32 lowercase hex

Output Specifications

Field Type Description Notes
normalizedPayload object Canonical JSON after transforms JCS canonical form
classes bitset/array DataClass flags e.g., Personal | Sensitive
redactionHints[] array Where/why redacted { pointer, action }
policyRevision string Policy rev used rev-YYYYMMDD-n
policyBasis enum Live | Cached | DryRun Audit of basis
violations[] array Validation/policy errors For 4xx generation

Example Payloads

Input (pre-normalization)

{
  "schemaVersion": "auditrecord.v1",
  "createdAt": "2025-10-22T12:00:03.100Z",
  "action": "User.Create",
  "resource": { "type": "Iam.User", "id": " U-1001 ", "path": "/name" },
  "actor": { "id": "svc_gw", "type": "Service", "display": "ingress-gw" },
  "attributes": {
    "email": "alice@example.com",
    "password": "hunter2",
    "client.ip": "2001:db8::1"
  }
}

Normalized + decision (stored)

{
  "schemaVersion": "auditrecord.v1",
  "createdAt": "2025-10-22T12:00:03.100Z",
  "action": "user.create",
  "resource": { "type": "Iam.User", "id": "U-1001", "path": "/name" },
  "actor": { "id": "svc_gw", "type": "Service", "display": "ingress-gw" },
  "attributes": {
    "email": "sha256:2c26b46b68ffc68ff99b453c1d304134",
    "client.ip": "2001:db8::/64"
  },
  "_decision": {
    "classes": ["Personal", "Sensitive"],
    "redactionHints": [
      { "pointer": "/attributes/password", "action": "Drop" },
      { "pointer": "/attributes/email", "action": "Hash" },
      { "pointer": "/attributes/client.ip", "action": "Mask" }
    ],
    "policyRevision": "rev-20251022-1",
    "policyBasis": "Live"
  }
}

Error Handling

Error Scenarios

Error Code Scenario Recovery Action Retry Strategy
400 Schema/shape invalid Fix payload per pointers No retry until corrected
400 Limits exceeded (size/keys/delta) Reduce payload size/keys No retry until corrected
422 Policy violation (forbidden fields) Remove/transform offending fields Retry after fix
503 Policy/Registry unavailable & no cache Wait for recovery Retry with same idempotency key
409 Policy revision conflict (rare) Resubmit; server reconciles Safe retry (idempotent)

Failure Modes

  • Secret detected: Field dropped; hint recorded; no write-time failure unless configured “fail-closed”.
  • Classifier ambiguity: Lowest-risk action chosen (mask/hash) and flagged for review.
  • Cache staleness: Decision marked basis:"Cached"; async audit triggers re-eval if needed.

Recovery Procedures

  1. If 4xx, inspect Problem+JSON errors[].pointer and correct data.
  2. If 503, retry with backoff; preserve idempotency key.
  3. If repeated classifier ambiguities, update policy patterns; redeploy.

Performance Characteristics

Latency Expectations

  • Validation + Canonicalization: p95 ≤ 20 ms
  • Policy Evaluation: p95 ≤ 30 ms (local cache hit ≤ 5 ms)
  • Classification/Redaction: p95 ≤ 10 ms typical payloads

Throughput Limits

  • Designed to sustain the same per-tenant ingest targets as Standard Ingestion (e.g., 500 rps), bounded by policy eval capacity.

Resource Requirements

  • CPU for JSON parsing and pattern matching; memory for small transient field buffers (< 512 KiB).
  • Optional vectorized hashing for tokenization.

Scaling Considerations

  • Scale Ingestion horizontally; cache policy decisions per-tenant.
  • Separate classifier pool if heavy patterns enabled.

Security & Compliance

Authentication

  • mTLS/service identity between Ingestion and Policy/Classification/Redaction services.

Authorization

  • Ingestion authorized to access tenant-scoped policies only.

Data Protection

  • Secrets never persisted; PII transformed per policy before write.
  • Hashing uses approved algorithms (e.g., SHA-256 with salt/pepper policy where applicable).

Compliance

  • Decision trail persisted (policyRevision, policyBasis, redactionHints) to support audits and DSAR exports.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
validation_failures_total counter Number of 4xx validations Spike > baseline
policy_eval_latency_ms histogram Policy call latency p95 > 30 ms
redactions_applied_total counter Actions applied Sudden drop (policy drift)
classified_records_total counter Records with classes Monotonic vs ingest
cached_policy_decisions_total counter Cached-basis uses > 5% sustained

Logging Requirements

  • Structured logs include tenantId, auditRecordId (if available), policyRevision, policyBasis, and summarized redactionHints (no raw data).

Distributed Tracing

  • Spans: ingest.validate, policy.evaluate, classify.detect, redact.apply; attributes: tenant, payloadBytes, basis.

Health Checks

  • Readiness checks: Registry reachability, Policy cache warmness, classifier models loaded.

Operational Procedures

Deployment

  1. Deploy Ingestion with feature flag policy.eval.enabled=true, redaction.apply.enabled=true.
  2. Warm policy cache for top tenants; prefetch schema versions.
  3. Flip traffic gradually and watch latency/4xx/5xx rates.

Configuration

  • Env Vars: POLICY_BASE_URL, POLICY_CACHE_TTL, CLASSIFIER_TIMEOUT_MS, REDACTION_MODE (Apply|DryRun).
  • Patterns: versioned classifier pattern sets per tenant/edition.

Maintenance

  • Rotate hashing salts/peppers per schedule; invalidate caches.
  • Refresh classifier patterns as policies evolve.

Troubleshooting

  • High 400s: inspect pointers; verify schema version drift.
  • High cached-basis usage: Policy outage or network; check health and TTLs.
  • Unexpected PII in reads: verify redaction applied and read-profile masking.

Testing Scenarios

Happy Path Tests

  • Valid payload normalized; policy Live; redactions applied; append succeeds
  • Producer hints merged; never downgrade sensitivity
  • Cached policy basis used during brief outage; append still succeeds

Error Path Tests

  • Schema validation failure → 400 with pointers
  • Forbidden field by policy → 422 with pointer
  • Policy outage with empty cache → 503

Performance Tests

  • p95 validation+policy ≤ 50 ms at 500 rps/tenant
  • Classifier throughput with large attributes maps

Security Tests

  • Secrets dropped, not logged
  • PII hashing/tokenization conforms to policy (golden samples)
  • Authorization scoping of policy endpoints

Internal References

  • Batch Audit Record Ingestion Flow
  • Data Redaction Flow (Read)

External References

  • RFC 8785 (JSON Canonicalization Scheme)
  • W3C Trace Context (for correlation)

Appendices

A. Common Validation Rules (excerpt)

  • No NaN/Infinity; UTF-8, strings normalized to NFC; key set size ≤ 64; payload ≤ 256 KiB.

B. DataClass Examples

  • Personal: name, email; Sensitive: secrets, tokens; Operational: IP/UA.

C. Redaction Actions

  • Drop (remove), Mask (partial), Hash (one-way), Tokenize (reversible, vault-backed).

Audit Record Integrity Chain Flow

Creates a tamper-evidence chain for accepted audit facts. Each persisted AuditRecord becomes a leaf hash, batched into segments (Merkle trees), then sealed into blocks signed by KMS. Proof artifacts are written to the Evidence Store, a reference is attached to the record, and Integrity.ProofComputed is emitted.


Overview

Purpose: Guarantee immutability-at-rest by linking records into signed, verifiable chains with exportable proofs.
Scope: Post-append integrity processing: leaf hashing, segment buffering, Merkle root computation, block sealing/signing, evidence persistence, record back-reference, and event publication. Excludes verify-on-read (covered in a separate flow).
Context: Runs asynchronously after AuditRecord.Accepted. Segments seal on size/age thresholds. Blocks form a forward-only chain with PrevBlockRoot.
Key Participants:

  • Storage (Authoritative) — source of accepted records
  • Integrity Service — orchestrates hashing, sealing, signing
  • KMS — signs block headers; manages key rotation
  • Evidence Store — durable proofs (segments/blocks/manifests)
  • Projection Service — indexes proof refs for reads/search (optional)
  • Event Bus — publishes Integrity.ProofComputed

Prerequisites

System Requirements

  • Integrity workers online; access to Storage and Evidence Store
  • KMS key (current + optional previous for dual-verify window) available
  • Time sync within ±60s across services
  • Reliable message delivery from Storage to Integrity

Business Requirements

  • Tenant configured with integrity policy (segment size/age, edition/residency)
  • Retention rules do not remove proofs before data eligibility
  • Legal holds respected (proofs retained regardless)

Performance Requirements

  • Seal latency SLO: p95 ≤ 120s from Accepted to ProofComputed
  • Integrity throughput sized for ingest peak × safety margin (e.g., 1.5×)
  • Evidence Store write amplification budgeted

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant Storage as Storage (Authoritative)
    participant Integrity as Integrity Service
    participant KMS as KMS
    participant Evidence as Evidence Store
    participant Bus as Event Bus
    participant Projection as Projection Service

    Storage-->>Integrity: AuditRecord.Accepted { auditRecordId, tenantId, canonicalBytesRef }
    Integrity->>Integrity: LeafHash = SHA-256(canonicalBytes)
    Integrity->>Integrity: Append leaf to SegmentBuffer(tenant, shard)
    alt Seal threshold met (size or age)
        Integrity->>Integrity: MerkleRoot = merkle(leafHashes)
        Integrity->>KMS: Sign(BlockHeader { SegmentId, MerkleRoot, PrevBlockRoot })
        KMS-->>Integrity: ↩ Signature { keyId, sig }
        Integrity->>Evidence: Store { Segment, BlockHeader, Signature }
        Evidence-->>Integrity: ↩ EvidenceRef { segmentUri, blockUri }
        Integrity-->>Storage: Write IntegrityRef on records in segment
        Integrity-->>Bus: Publish Integrity.ProofComputed { tenantId, segmentId, blockId }
        Bus-->>Projection: Event fan-out (optional)
    else Buffer continues
        Integrity->>Integrity: Wait for more leaves or seal timeout
    end
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Time-based seal: If size threshold not reached within sealMaxAge, force seal to bound verification lag.
  • Dual-sign window: During key rotation, blocks are signed with new key, and verifiers accept old or new keyId.
  • Cross-region catch-up: If region falls behind, segments seal independently; later anchor block links chains (see DR flow).

Error Paths

sequenceDiagram
    participant Integrity as Integrity Service
    participant KMS as KMS
    participant Evidence as Evidence Store

    Integrity->>KMS: Sign(BlockHeader)
    alt KMS unavailable
        KMS-->>Integrity: ↩ 503
        Integrity->>Integrity: Retry with backoff, keep SegmentBuffer open
    else Signature reject
        KMS-->>Integrity: ↩ error(key.invalid)
        Integrity->>Integrity: Quarantine segment, raise alert
    end

    Integrity->>Evidence: Store proofs
    alt Evidence store error
        Evidence-->>Integrity: ↩ 503
        Integrity->>Integrity: Retry, if max attempts → DLQ & operator action
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

The chain creation is internal, but two public/operational surfaces are relevant: the event and the evidence retrieval API.

Input Requirements (event consumed by Integrity)

Field Type Required Description Validation
auditRecordId ULID Y Record identifier Exists in Storage
tenantId string Y Tenant scope Valid tenant
canonicalBytesRef uri Y Pointer to canonical JSON Dereferenceable
createdAt timestamp Y Record time ISO-8601 UTC
observedAt timestamp Y Ingestion time ISO-8601 UTC

Output Specifications

Event: Integrity.ProofComputed

Field Type Description Notes
tenantId string Tenant
segmentId ULID Sealed segment id
blockId ULID Block id
keyId string Signing key identifier From KMS
merkleRoot hex Root hash SHA-256
recordRange object {fromId, toId} Optional
evidence object {segmentUri, blockUri} Evidence Store refs
sealedAt timestamp Seal time UTC

API: GET /integrity/v1/proofs/{auditRecordId}

Field Type Description Notes
auditRecordId path Record id ULID
include query leaf | segment | block | all Optional

Response (200)

{
  "auditRecordId": "01JE9C5V6A7B8C9D0E1F2G3H4I",
  "leaf": { "hash": "sha256:ab…", "position": 128, "segmentId": "01JE9C6…" },
  "segment": { "merkleRoot": "sha256:cd…", "proofPath": ["ef…","01…"] },
  "block": { "blockId": "01JE9C7…", "prevBlockRoot": "sha256:12…", "signature": { "keyId": "kms-2025-10", "sig": "MEUCIQ…" } },
  "sealedAt": "2025-10-22T12:01:45.120Z"
}

Error Handling

Error Scenarios

Error Code Scenario Recovery Action Retry Strategy
400 Bad include value / malformed id Correct request (ULID/enum) No retry until corrected
404 Record or proof not found (not yet sealed or purged) Poll later or verify eligibility Retry after backoff
409 Append attempt to sealed segment (internal) Start new segment; do not mutate sealed N/A (system fix)
422 Signature cannot be generated due to key policy mismatch Adjust policy / rotate properly Retry after policy fix
429 Integrity backlog/backpressure System scales workers Automatic; client retries evidence GET
503 KMS/Evidence store unavailable Wait for recovery Exponential backoff + jitter

Failure Modes

  • Segment overflow beyond configured max leaves: immediate seal and roll to next segment.
  • KMS key disabled: seals paused; alert; switch to standby key or rotate.
  • Evidence write partial: transactionally retry, or mark segment PendingEvidence.

Recovery Procedures

  1. If KMS/Evidence outage, allow buffers to grow; workers retry with capped backoff.
  2. If quarantine triggered (signature reject), isolate segment and open incident; re-sign with correct key after root cause.
  3. Reconcile PrevBlockRoot on restart to maintain a single forward chain per (tenant, shard).

Performance Characteristics

Latency Expectations

  • Leaf→ProofComputed: p50 20–40s; p95 ≤ 120s (time/size thresholds dependent)

Throughput Limits

  • Leaf hashing ≥ ingest throughput; segment sealing limited by Merkle + I/O (target ≥ 5k leaves/s per worker).

Resource Requirements

  • CPU for SHA-256/Merkle; memory for SegmentBuffer (bounded by max leaves or bytes).
  • Evidence Store IOPS sized for block bursts.

Scaling Considerations

  • Horizontal scale by tenant/shard queues.
  • Auto-seal if buffers exceed memory pressure.
  • Backpressure signaled to upstream only in extreme cases (avoid impacting ingest).

Security & Compliance

Authentication

  • mTLS between Integrity and KMS/Evidence Store.

Authorization

  • Integrity service principal limited to sign and write evidence; read-only for verify endpoints.

Data Protection

  • Proof artifacts encrypted at rest; signatures cover SegmentId, MerkleRoot, PrevBlockRoot, sealedAt.

Compliance

  • Proofs retained for at least as long as corresponding records; legal holds pin proofs.
  • Audit trail includes keyId, sealedAt, and policyRevision used for sealing thresholds.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
integrity_queue_depth gauge Pending leaves Rising > 10× baseline
segment_seal_latency_ms histogram Accept→seal delay p95 > 120s
proof_compute_errors_total counter Failed proof writes > 0 over 5m
kms_sign_latency_ms histogram KMS call time p95 > 200ms
segments_sealed_total counter Count per tenant/shard Trend watch

Logging Requirements

  • Log segmentId, blockId, keyId, leaf counts, thresholds used; never log raw record bytes.

Distributed Tracing

  • Spans: integrity.hash.leaf, integrity.seal.segment, kms.sign, evidence.write; attributes include tenant, segmentSize, ageSec.

Health Checks

  • Readiness: KMS reachable; Evidence Store writable; backlog below watermark.
  • Liveness: worker heartbeats; buffer pressure alarms.

Operational Procedures

Deployment

  1. Deploy Integrity workers; keep integrity.enabled=false.
  2. Validate KMS permissions and dry-run seal on a test tenant.
  3. Enable and monitor queue_depth, seal_latency_ms.

Configuration

  • Env Vars: SEAL_MAX_LEAVES, SEAL_MAX_AGE_SEC, KMS_KEY_ID, MAX_BUFFER_BYTES
  • Backoff: KMS_RETRY_BACKOFF, EVIDENCE_RETRY_BACKOFF

Maintenance

  • Rotate keyId on schedule; run dual-verify window; archive old public keys.
  • Periodic integrity audit: random-sample verify segments nightly.

Troubleshooting

  • High queue depth → add workers; lower seal thresholds temporarily.
  • Signature failures → verify KMS policy/alg; check clock skew.
  • Missing proofs → check DLQ for segments marked PendingEvidence.

Testing Scenarios

Happy Path Tests

  • Given AuditRecord.Accepted, then Integrity.ProofComputed within SLO and record has IntegrityRef.
  • Merkle proof verifies for random leaves in sealed segment.

Error Path Tests

  • KMS outage → seals delayed; proofs catch up after recovery.
  • Evidence store 503 → retries; no data loss; segment eventually Sealed.

Performance Tests

  • Seal at size threshold (e.g., 10k leaves) under peak ingest.
  • Seal at age threshold (e.g., 60s) with sparse ingest.

Security Tests

  • Signatures verify with current and previous keyId during rotation.
  • Unauthorized client cannot fetch proofs from another tenant.

Internal References

  • Verify-On-Read Flow
  • Export eDiscovery Flow
  • Tamper Detection Flow

External References

  • Merkle tree concepts (general)
  • KMS provider docs for signing APIs

Appendices

A. Block Header (conceptual)

{
  "blockId": "01JE9C7…",
  "segmentId": "01JE9C6…",
  "merkleRoot": "sha256:cd…",
  "prevBlockRoot": "sha256:12…",
  "sealedAt": "2025-10-22T12:01:45.120Z",
  "keyId": "kms-2025-10",
  "signature": "MEQCIF…"
}

B. Leaf Hash Definition

LeafHash = SHA-256( CanonicalRecordBytes )

Audit Record Projection Update Flow

Builds query-optimized views from authoritative append-only facts. The Projector consumes accepted records, performs idempotent upserts into read models (AuditEvents timeline, Resource- and Actor-centric projections), updates the Search index, invalidates caches, and advances a checkpoint/watermark to guarantee at-least-once processing without duplication.


Overview

Purpose: Materialize fast, tenant-scoped views for queries and search while tracking consistent progress via checkpoints.
Scope: Post-append event consumption, idempotent projection updates, search indexing, cache invalidation, checkpointing, and replay/rebuild controls. Excludes ingestion, redaction policy evaluation, and verify-on-read.
Context: Runs asynchronously after AuditRecord.Accepted; multiple projector shards process per tenant/partition with strict ordering guarantees.
Key Participants:

  • Storage (Authoritative) — emits AuditRecord.Accepted
  • Projector — applies projection logic, maintains idempotency & checkpoints
  • Read DB — projection tables (AuditEvents, Resource, Actor)
  • Search Index — per-tenant documents for full-text/facets/suggest
  • Cache — key-based caches for hot read paths
  • Checkpoint Store — durable cursor (offset/watermark)
  • Event Bus — transport for Accepted and internal signals

Prerequisites

System Requirements

  • Storage → Bus delivery configured; Projector subscribed to AuditRecord.Accepted
  • Read DB reachable with migrations applied for projection schemas
  • Checkpoint Store provisioned (per tenant/shard)
  • Search cluster online and tenant indices created (if enabled)

Business Requirements

  • Tenants activated with edition flags for Search (optional)
  • Data minimization rules acknowledged in projection shapes
  • Cache TTLs defined per view (timeline/resource/actor)

Performance Requirements

  • Projection lag SLO: p95 ≤ 5 s from Accepted to visible in reads
  • Indexing throughput sized to match ingest rate (≥ 1×)
  • Checkpoint advance p99 commit ≤ 50 ms

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant Storage as Storage (Authoritative)
    participant Bus as Event Bus
    participant Proj as Projector
    participant ReadDB as Read DB (Projections)
    participant Search as Search Index
    participant Cache as Cache
    participant Ckpt as Checkpoint Store

    Storage-->>Bus: Publish AuditRecord.Accepted {tenantId, auditRecordId, canonicalRef}
    Bus-->>Proj: Deliver event (ordered per partition)
    Proj->>Proj: Idempotency check (eventId vs last offset)
    Proj->>ReadDB: UPSERT AuditEvents (timeline)
    Proj->>ReadDB: UPSERT ResourceProjection (by resource)
    Proj->>ReadDB: UPSERT ActorProjection (by actor)
    alt Search enabled
      Proj->>Search: UPSERT index document(s)
    end
    Proj->>Cache: Invalidate keys {timeline:tenant, resource:id, actor:id}
    Proj->>Ckpt: Commit watermark {offset, auditRecordId, observedAt}
    Ckpt-->>Proj: ↩ ack
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Out-of-order duplicate: Projector detects processed offset and skips; checkpoint remains.
  • Rebuild: Admin issues Rebuild command → Projector resets checkpoint to origin, clears projections (or writes compaction shadow tables), replays events, then swaps.
  • Partial Indexing: If Search is temporarily disabled for a tenant, projector queues index updates and advances DB projections; index will catch up later from a backlog.

Error Paths

sequenceDiagram
    participant Proj as Projector
    participant ReadDB as Read DB
    participant Ckpt as Checkpoint Store
    participant Search as Search Index

    Proj->>ReadDB: UPSERT projections
    alt Constraint conflict (unique key)
        ReadDB-->>Proj: ↩ 409 conflict
        Proj->>Proj: Apply idempotent merge, retry once
    else Bad projection payload (schema drift)
        ReadDB-->>Proj: ↩ 400 bad request
        Proj->>Proj: Quarantine record → DLQ, continue stream
    end

    Proj->>Ckpt: Commit watermark
    alt Not found checkpoint stream
        Ckpt-->>Proj: ↩ 404 not found
        Proj->>Ckpt: Create stream atomically, retry
    end

    Proj->>Search: UPSERT doc
    alt Index unavailable / rate-limited
        Search-->>Proj: ↩ 429/503
        Proj->>Proj: Buffer + backoff, do not block DB projections
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

External APIs are operational controls; projections themselves are internal upserts.

Input Requirements (event consumed)

Field Type Required Description Validation
tenantId string Y Tenant scope Known tenant
auditRecordId ULID Y Record id Exists in Storage
createdAt timestamp Y Producer time ISO-8601 UTC
observedAt timestamp Y Ingestion time ISO-8601 UTC
action string Y Event verb normalized
resource object Y {type,id,path?} normalized
actor object Y {id,type,display?} present
decision object N Access outcome enum
attributes map N extras bounded

Output Specifications (projections)

Projection Key Shape (summary) Notes
AuditEvents (tenantId, createdAt, auditRecordId) timeline row paginates by cursor
ResourceProjection (tenantId, resource.type, resource.id) latest state + last actions small, denormalized
ActorProjection (tenantId, actor.id) last actions, resources touched for actor-centric queries
Search Document (tenantId, auditRecordId) flattened facets + text edition-gated

Operational APIs

GET /projections/v1/{tenant}/{name}/status Response 200:

{
  "tenant": "acme",
  "name": "AuditEvents",
  "watermark": { "offset": 1203981, "auditRecordId": "01JEA...", "updatedAt": "2025-10-22T12:00:06.100Z" },
  "lag": { "seconds": 2.4, "records": 180 },
  "state": "Healthy"
}

POST /projections/v1/{tenant}/{name}/rebuild → 202 with { jobId }


Error Handling

Error Scenarios

Error Code Scenario Recovery Action Retry Strategy
400 Bad request to ops API (invalid name, bad params) Fix request No retry until corrected
404 Status/rebuild for unknown projection or tenant Validate inputs No retry
409 Rebuild already in progress / checkpoint conflict Use existing job or wait Retry after completion
422 Event schema drift vs projection mapper Quarantine & hotfix mapper Continue stream; backfill later
429 Search/index or cache backpressure Defer indexing; advance DB Automatic retry/backoff
503 Read DB/Checkpoint store transient failure Keep event, retry Exponential backoff + jitter

Failure Modes

  • Poison event: irreconcilable mapping → send to DLQ with pointers; continue stream.
  • Cache stampede: cache invalidations batched/coalesced; use jittered TTLs.
  • Idempotency race: unique key conflicts resolved via UPSERT with deterministic merge.

Recovery Procedures

  1. If Read DB/Checkpoint outage, pause commits but keep events buffered; resume and commit in order.
  2. For DLQ items, fix mapper/policy, then replay from saved offset range.
  3. During rebuild, expose state:"Rebuilding"; queries read from shadow tables if configured.

Performance Characteristics

Latency Expectations

  • Accept → Read visible: p95 ≤ 5 s
  • Accept → Indexed: p95 ≤ 10 s (if search enabled)

Throughput Limits

  • Sustains ingest parity; projectors process ≥ 1× ingest rps per shard.

Resource Requirements

  • CPU for mapping/JSON flatten; DB connections sized for write bursts.
  • Search bulkers batch 500–1,000 docs or 5–10 MiB per flush.

Scaling Considerations

  • Horizontal scale by tenant/shard.
  • HPA/KEDA on queue depth, projection lag, and p95 projector latency.
  • Apply backpressure to indexing only; keep DB projections current.

Security & Compliance

Authentication

  • mTLS between Projector and Read DB/Search/Checkpoint.

Authorization

  • Projector principal has write on projections & checkpoint, write/bulk on Search, no read of other tenants.

Data Protection

  • Store only minimized fields required for query/search; avoid sensitive raw values.
  • Tenant isolation enforced at table/index level (prefix/shard keys).

Compliance

  • Projection updates logged with tenant, auditRecordId, and mapperVersion for auditability.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
projection_lag_seconds gauge Accept→visible delay > 5s p95 (5m)
projected_records_total counter Rows upserted Trend vs ingest
checkpoint_commit_latency_ms histogram Commit time p95 > 50ms
projection_conflicts_total counter 409 upserts Rising trend
index_updates_backlog gauge Pending index docs Growing without drop

Logging Requirements

  • Structured logs: tenant, auditRecordId, offset, mapperVersion, conflict summaries (no sensitive values).

Distributed Tracing

  • Spans: projector.consume, mapper.apply, readdb.upsert, index.bulk, checkpoint.commit.
  • Attributes: tenant, offset, bulkCount, lagMs.

Health Checks

  • Readiness: connectivity to Read DB/Search/Checkpoint; lag below threshold.
  • Liveness: consumer heartbeats; partition ownership indicator.

Operational Procedures

Deployment

  1. Deploy Projector with projector.enabled=false.
  2. Run migrations for projection schemas.
  3. Enable consumers per tenant/shard; monitor projection_lag_seconds.

Configuration

  • Env Vars: PROJECTOR_PARALLELISM, CHECKPOINT_BATCH, INDEX_BULK_BYTES, INDEX_BULK_DOCS
  • Flags: search.enabled, rebuild.shadowSwap=true

Maintenance

  • Periodic compaction of timeline tables; rotate old index aliases.
  • Update mapperVersion with schema changes; keep backward compatibility.

Troubleshooting

  • Rising lag → scale workers or reduce index bulk size; inspect DB write contention.
  • Many conflicts → verify UPSERT keys & mapping determinism.
  • Backlog in indexing → check cluster health; enable backpressure-only mode.

Testing Scenarios

Happy Path Tests

  • Accepted event produces AuditEvents row, Resource & Actor upserts; watermark advances.
  • Search document visible; cache invalidated and repopulated on read.

Error Path Tests

  • Unique key conflict handled idempotently (no duplicate rows).
  • Bad ops API request → 400; unknown projection → 404; rebuild in progress → 409.

Performance Tests

  • Maintain p95 ≤ 5 s at target ingest rps with search enabled/disabled.
  • Bulk indexing flush sizes tuned for p95 < 1 s per bulk.

Security Tests

  • Tenant isolation in projections and index aliases.
  • No sensitive fields persisted beyond minimization policy.

Internal References

  • Standard Audit Record Ingestion Flow
  • Audit Record Integrity Chain Flow
  • Search Query Flow

External References

  • Bulk indexing guidance for the chosen search engine (vendor docs)

Appendices

A. UPSERT Keys (example)

  • AuditEvents: (tenantId, createdAt, auditRecordId)
  • ResourceProjection: (tenantId, resourceType, resourceId)
  • ActorProjection: (tenantId, actorId)

B. Checkpoint Record (example)

{
  "tenant": "acme",
  "partition": "p3",
  "offset": 1203981,
  "auditRecordId": "01JEA…",
  "updatedAt": "2025-10-22T12:00:06.100Z",
  "mapperVersion": "v7"
}

HTTP REST API Ingestion Flow

REST transport for appending a single AuditRecord via API Gateway. Details HTTP method/endpoint, required headers, authentication & rate limiting, header-to-internal mapping (traceparent, x-tenant-id, x-idempotency-key), response behaviors (2xx/4xx/5xx), and concrete request/response examples.


Overview

Purpose: Provide a secure, idempotent HTTP interface for producers to append audit facts through the Gateway.
Scope: HTTP semantics (headers, status codes, retries), authN/Z at the edge, rate limiting, payload size/type validation, Problem+JSON errors. Excludes batch/grpc/bus transports (separate flows) and downstream integrity/projections internals.
Context: Front door for most interactive clients; maps cleanly to the canonical ingestion path.
Key Participants:

  • HTTP Client (producer)
  • API Gateway (edge policy, authN/Z, limits)
  • Ingestion Service (validation/canonicalization)
  • Policy Service (classification/redaction hints, invoked by Ingestion)
  • Storage (Authoritative) (append/WORM)

Prerequisites

System Requirements

  • TLS 1.2+ enabled on Gateway; valid certificates
  • Gateway has JWKS/issuer config to validate JWTs (OIDC)
  • Network routes Gateway → Ingestion (and Ingestion → Policy/Storage)

Business Requirements

  • Tenant exists, active, and mapped to regions/partitions
  • Policy/retention configurations present for tenant
  • Edition flags set (may influence limits)

Performance Requirements

  • Gateway rate limits sized per tenant (burst/sustained)
  • Max payload ≤ 256 KiB; P95 end-to-end ≤ 50 ms at target RPS
  • Idempotency store capacity sized for 24h dedupe window

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Client as HTTP Client
    participant Gateway as API Gateway
    participant Ingestion as Ingestion Service
    participant Storage as Storage (Authoritative)

    Client->>Gateway: POST /audit/v1/records<br/>(h: Authorization, x-tenant-id, traceparent, x-idempotency-key)<br/>(b: application/json)
    Note right of Gateway: Validate JWT, tenant scope, rate-limit, content-type & size
    Gateway->>Ingestion: Append(request) (forward required headers)
    Ingestion->>Ingestion: Validate + canonicalize + policy hints
    Ingestion->>Storage: INSERT canonical record (WORM)
    Storage-->>Ingestion: ack {auditRecordId}
    Ingestion-->>Gateway: 202 {auditRecordId, status:"Created"}
    Gateway-->>Client: 202 Accepted with Problem+JSON on errors, rate-limit headers on success too
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Duplicate idempotency key: 202 with status:"Duplicate" and original auditRecordId.
  • Server-assigned ULID: Omit auditRecordId and receive assigned value in response.
  • CORS/browser clients: Preflight OPTIONS handled by Gateway; only safelisted headers exposed.

Error Paths

sequenceDiagram
    actor Client
    participant Gateway as API Gateway
    Client->>Gateway: POST /audit/v1/records (bad/missing bits)
    alt Bad request (shape/size/type)
        Gateway-->>Client: 400/413/415 Problem+JSON
    else Unauthorized / Forbidden
        Gateway-->>Client: 401/403 Problem+JSON
    else Not found / wrong route
        Gateway-->>Client: 404 Problem+JSON
    else Conflict (idempotency anomaly)
        Gateway-->>Client: 409 Problem+JSON
    else Rate limited
        Gateway-->>Client: 429 Problem+JSON (+ Retry-After)
    else Upstream unavailable
        Gateway-->>Client: 503 Problem+JSON
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Required Description Validation
Method HTTP Y POST POST /audit/v1/records
Content-Type header Y Body MIME type application/json; charset=utf-8
Authorization header Y Bearer JWT Valid signature, audience, tenant claim
x-tenant-id header Y Tenant routing ^[A-Za-z0-9._-]{1,128}$
traceparent header Y W3C trace context 55-char format
x-idempotency-key header Y Dedupe per tenant (24h) ≤128 visible ASCII
Body JSON Y Canonical AuditRecord fields See Data Model rules

Output Specifications

Field Type Description Notes
auditRecordId ULID Durable record id Server returns original or assigned
status string Created or Duplicate Idempotent behavior
observedAt timestamp Ingestion observation time ms precision
traceId hex32 Echo from traceparent Correlation
links.self uri Record locator Optional operation link

Example Payloads

Request

POST /audit/v1/records HTTP/1.1
Host: api.atp.example
Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...
x-tenant-id: acme
traceparent: 00-3e1f2d0c9b8a7f6e5d4c3b2a19081716-7f6e5d4c3b2a1908-01
x-idempotency-key: acme-ord-9981-v1
Content-Type: application/json; charset=utf-8

{
  "tenantId": "acme",
  "schemaVersion": "auditrecord.v1",
  "createdAt": "2025-10-22T12:00:03.100Z",
  "action": "user.create",
  "resource": { "type": "Iam.User", "id": "U-1001" },
  "actor": { "id": "svc_ingress", "type": "Service" }
}

Response — 202 Accepted

{
  "auditRecordId": "01JEB0V2G7NY5T6Q9KX3M4C8AP",
  "status": "Created",
  "observedAt": "2025-10-22T12:00:03.280Z",
  "traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716",
  "links": {
    "self": "/audit/v1/records/01JEB0V2G7NY5T6Q9KX3M4C8AP"
  }
}

Response — 400 Bad Request (Problem+JSON)

{
  "type": "urn:connectsoft:errors/validation/action.invalid",
  "title": "Invalid action",
  "status": 400,
  "detail": "Action must match ^[a-z]+(\\.[a-z0-9_-]+)?$",
  "errors": [{ "pointer": "/action", "reason": "regex" }],
  "traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716"
}

Error Handling

Status Code Matrix

Class Code When Notes
2xx 202 Accepted (created or deduped) Body includes status:"Created | Duplicate"
4xx 400 Shape/field invalid, schema mismatch Problem+JSON with errors[].pointer
4xx 401 Missing/invalid JWT Bearer challenge omitted for APIs; return body explains
4xx 403 Tenant/permission forbidden Token valid but insufficient scope
4xx 404 Unknown route/tenant or disabled feature Useful for wrong base path or edition
4xx 409 Idempotency anomaly / conflicting op link Rare; follow links.operation if present
4xx 413 Payload exceeds 256 KiB Include maxBytes hint
4xx 415 Unsupported media type Require application/json
4xx 429 Rate-limited/backpressure Include Retry-After (seconds or HTTP date)
5xx 503 Upstream dependency unavailable Retry with same idempotency key

Failure Modes

  • Clock skew: createdAt > now+2m → 400 with pointer /createdAt.
  • Tenant mismatch: body tenantId ≠ header x-tenant-id → 403.
  • Idempotency race: concurrent distinct payloads under same key → 409.

Recovery Procedures

  1. For 4xx, correct payload/headers and resend (new key except for 409).
  2. For 429/503, retry with exponential backoff + jitter; reuse the same x-idempotency-key.
  3. Track traceId from responses to correlate retries.

Performance Characteristics

Latency Expectations

  • Gateway edge: P50 5–10 ms, P95 ≤ 20 ms
  • End-to-end to 202: P50 15–25 ms, P95 ≤ 50 ms

Throughput Limits

  • Default per-tenant: 500 rps sustained, 2k rps burst (60s)
  • Global: ≥ 50k rps across shards (capacity dependent)

Resource Requirements

  • Gateway CPU for JWT validation and header processing; memory for small payload buffers.

Scaling Considerations

  • Scale Gateway horizontally; HPA on rps & p95.
  • Separate rate limit buckets per tenant and per route.

Security & Compliance

Authentication

  • OIDC JWT Bearer; short-lived (≤ 15m), leeway ±60s.

Authorization

  • Require audit:append scoped to x-tenant-id; Gateway enforces edition access.

Data Protection

  • TLS 1.2+; HSTS at edge; CORS preflight for browser-based producers (restrict origins & headers).

Compliance

  • Log who/when appended; immutable WORM store; Problem+JSON avoids leaking sensitive values.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
http_requests_total{route="/audit/v1/records"} counter Request rate Anomaly vs baseline
http_request_duration_ms histogram Latency p95 > 50 ms (5m)
http_responses_total{status=4xx/5xx} counter Error rates > 1% 4xx (validation spikes), any 5xx
rate_limited_total counter 429 responses > 5% sustained

Logging Requirements

  • Structured logs with tenantId, traceId, idempotencyKey (hashed), statusCode; no sensitive payloads.

Distributed Tracing

  • Propagate traceparent to Ingestion; spans gateway.authz, gateway.forward with attributes tenant, payloadBytes.

Health Checks

  • Liveness: process/thread checks; Readiness: JWKS reachable, Ingestion upstream healthy.

Operational Procedures

Deployment

  1. Deploy Gateway route behind feature flag ingest.rest.enabled=false.
  2. Smoke test with signed JWT and minimal payload; verify 202 and headers.
  3. Enable feature flag and gradually raise rate limits.

Configuration

  • Env Vars / Config: JWKS URI, audiences, rate limit buckets, max payload bytes, allowed CORS origins/headers.
  • Headers to forward: traceparent, x-tenant-id, x-idempotency-key.

Maintenance

  • Rotate keys/JWKS; cache with TTL; monitor expired/invalid token spikes.

Troubleshooting

  • Many 401s → check JWKS drift/clock skew.
  • Many 415s → clients mis-sending Content-Type.
  • Elevated 409s → investigate idempotency key collisions in client.

Testing Scenarios

Happy Path Tests

  • Valid POST returns 202 with status:"Created" and auditRecordId.
  • Duplicate x-idempotency-key returns 202 with status:"Duplicate".
  • Trace propagation: traceId echoed matches traceparent.

Error Path Tests

  • 400 invalid action; pointer /action.
  • 404 wrong route (e.g., /audit/v2/...).
  • 409 conflicting idempotency key (distinct payload).
  • 415 wrong media type; 413 too large.
  • 429 with Retry-After; 503 transient outage.

Performance Tests

  • Sustain 500 rps tenant; p95 ≤ 50 ms.
  • Burst to 2k rps without >1% errors.

Security Tests

  • JWT expiration & audience checks enforced.
  • CORS preflight honors allowed origins and headers.
  • Tenant mismatch (header vs body) rejected with 403.

Internal References

  • gRPC Service Ingestion Flow
  • Service Bus (MassTransit) Ingestion Flow
  • Retry Flow

External References

  • RFC 7807 (Problem Details for HTTP APIs)
  • W3C Trace Context (traceparent)

Appendices

A. cURL Examples

curl -sS -X POST "https://api.atp.example/audit/v1/records" \
  -H "Authorization: Bearer $TOKEN" \
  -H "x-tenant-id: acme" \
  -H "traceparent: 00-$(uuidgen | tr 'A-Z' 'a-z' | tr -d '-')-$(uuidgen | tr 'A-Z' 'a-z' | cut -c1-16)-01" \
  -H "x-idempotency-key: acme-ord-9981-v1" \
  -H "Content-Type: application/json; charset=utf-8" \
  --data-binary @record.json

B. Rate Limiting Headers (example)

RateLimit-Limit: 2000, 500;w=60
RateLimit-Remaining: 1980
RateLimit-Reset: 45
Retry-After: 3

gRPC Service Ingestion Flow

High-QPS, low-latency transport for appending individual AuditRecord items using gRPC. Clients call a unary Append RPC on the Gateway, passing metadata for tenant, traceparent, idempotency, and authorization. The Gateway authenticates/authorizes and forwards to Ingestion; responses use canonical gRPC status codes with retry/backoff guidance.


Overview

Purpose: Provide a high-throughput ingestion path with efficient framing, multiplexing, and connection reuse.
Scope: gRPC method shape, metadata requirements, authN/Z, rate limiting, error code mapping, retries/backoff, and sample code-first contracts. Excludes batch uploads and message bus ingestion.
Context: Preferred for service-to-service producers and heavy internal traffic; functionally equivalent to REST ingestion but with gRPC semantics.
Key Participants:

  • gRPC Client (producer)
  • gRPC Gateway (edge; authN/Z, limits, metadata mapping)
  • Ingestion Service (validate/canonicalize, policy/classification/redaction)
  • Storage (Authoritative) (append/WORM)

Prerequisites

System Requirements

  • Gateway and Ingestion expose/accept HTTP/2 with TLS (mTLS optional for internal meshes)
  • OIDC/JWKS configured at the Gateway to validate authorization metadata
  • Network connectivity Gateway ↔ Ingestion ↔ Storage/Policy services

Business Requirements

  • Tenant active and mapped to partitions/regions
  • Policy and retention configured for tenant
  • Edition flags (e.g., max RPS) set if applicable

Performance Requirements

  • Connection pooling enabled; client max concurrent streams tuned (HTTP/2)
  • End-to-end p95 ≤ 40 ms at target RPS; message size ≤ 256 KiB
  • Idempotency store sized for 24h dedupe window

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Client as gRPC Client
    participant GW as gRPC Gateway
    participant Ing as Ingestion Service
    participant Store as Storage (Authoritative)

    Client->>GW: Append(AuditRecord) + metadata{authorization, x-tenant-id, traceparent, x-idempotency-key}
    Note right of GW: Validate token, tenant scope, rate limit-> map metadata → headers
    GW->>Ing: Append(request, forwarded metadata)
    Ing->>Ing: Validate + canonicalize + policy/classification/redaction
    Ing->>Store: INSERT canonical record (WORM)
    Store-->>Ing: ack {auditRecordId}
    Ing-->>GW: AppendReply {auditRecordId, status=Created}
    GW-->>Client: OK (AppendReply) + trailers {traceId}
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Duplicate idempotency key: return OK with status=Duplicate and original auditRecordId.
  • Server-assigned ID: client omits auditRecordId; service returns assigned ULID.
  • Streaming batch (future): optional client- or server-streaming variants reuse the same metadata (not covered here).

Error Paths

sequenceDiagram
    actor Client
    participant GW as gRPC Gateway

    Client->>GW: Append(bad or unauthorized)
    alt Invalid argument / too large
        GW-->>Client: INVALID_ARGUMENT / RESOURCE_EXHAUSTED
    else Unauthenticated / permission denied
        GW-->>Client: UNAUTHENTICATED / PERMISSION_DENIED
    else Not found route / disabled
        GW-->>Client: NOT_FOUND
    else Idempotency conflict (payload differs)
        GW-->>Client: ALREADY_EXISTS
    else Rate limited / upstream unavailable
        GW-->>Client: RESOURCE_EXHAUSTED / UNAVAILABLE
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Required Description Validation
RPC unary Y Append(AppendRequest) returns (AppendReply) gRPC
authorization (metadata) string Y Bearer <JWT> Valid signature, audience, tenant claim
x-tenant-id (metadata) string Y Tenant routing ^[A-Za-z0-9._-]{1,128}$
traceparent (metadata) string Y W3C Trace Context 55-char format
x-idempotency-key (metadata) string Y Dedupe per tenant (24h) ≤128 visible ASCII
AppendRequest.auditRecord message Y Canonical AuditRecord See Data Model limits (≤ 256 KiB)
AppendRequest.schemaVersion string Y Contract version Known & active

Metadata naming: gRPC metadata keys are lowercase ASCII. Use exactly: authorization, x-tenant-id, traceparent, x-idempotency-key.

Output Specifications

Field Type Description Notes
AppendReply.auditRecordId string (ULID) Durable id Assigned or echoed
AppendReply.status enum Created or Duplicate Idempotent result
AppendReply.observedAt timestamp Ingestion observation ms precision
trailers:traceid hex32 Correlation id Derived from traceparent

Example Payloads

Proto (illustrative; see code-first C# below)

service AuditIngestion {
  rpc Append (AppendRequest) returns (AppendReply);
}

message AppendRequest {
  string schemaVersion = 1;
  AuditRecord auditRecord = 2;
}

message AppendReply {
  string auditRecordId = 1;
  string status = 2; // "Created" | "Duplicate"
  string observedAt = 3; // ISO-8601 UTC
}

Example grpcurl

grpcurl -d @ \
  -H "authorization: Bearer $TOKEN" \
  -H "x-tenant-id: acme" \
  -H "traceparent: 00-3e1f2d0c9b8a7f6e5d4c3b2a19081716-7f6e5d4c3b2a1908-01" \
  -H "x-idempotency-key: acme-ord-9981-v1" \
  api.atp.example:443 audit.AuditIngestion/Append <<'JSON'
{
  "schemaVersion": "auditrecord.v1",
  "auditRecord": {
    "tenantId": "acme",
    "createdAt": "2025-10-22T12:00:03.100Z",
    "action": "user.create",
    "resource": { "type": "Iam.User", "id": "U-1001" },
    "actor": { "id": "svc_ingress", "type": "Service" }
  }
}
JSON

Error Handling

Error Scenarios (gRPC ↔ HTTP analogy)

gRPC Code HTTP Analogy Scenario Recovery Action Retry Strategy
OK 202 Created or Duplicate
INVALID_ARGUMENT 400 Schema/shape/limits invalid Fix per details No retry until corrected
NOT_FOUND 404 Unknown service/method or tenant/feature disabled Check route/tenant No retry
ALREADY_EXISTS 409 Idempotency conflict (same key, different payload) Use new key; reconcile Do not retry with same key
UNAUTHENTICATED 401 Missing/invalid token Acquire valid JWT Retry after fix
PERMISSION_DENIED 403 Insufficient scope or tenant mismatch Adjust perms/tenant No retry until corrected
RESOURCE_EXHAUSTED 429 Rate limit/backpressure Honor retry hints Exponential backoff + jitter
UNAVAILABLE 503 Upstream unavailable / transient gateway error Wait for recovery Retry with same idempotency key
DEADLINE_EXCEEDED 504 Client/server deadline hit Increase deadline if safe Limited retries
INTERNAL 500 Unexpected server error Open incident if persistent Bounded retries with backoff

Failure Modes

  • Metadata missing/uppercase: gRPC metadata keys must be lowercase; missing required keys → INVALID_ARGUMENT.
  • Clock skew: createdAt > now+2mINVALID_ARGUMENT with field pointer.
  • Concurrent duplicates: distinct payload under same key → ALREADY_EXISTS.

Recovery Procedures

  1. For 4xx analogs (INVALID_ARGUMENT, PERMISSION_DENIED, ALREADY_EXISTS, NOT_FOUND) fix request/config before retry.
  2. For RESOURCE_EXHAUSTED/UNAVAILABLE/DEADLINE_EXCEEDED, backoff with jitter; reuse x-idempotency-key.
  3. Log/propagate traceid from trailers for correlation.

Performance Characteristics

Latency Expectations

  • P50: 10–20 ms
  • P95: ≤ 40 ms
  • P99: ≤ 75 ms

Throughput Limits

  • Per connection: hundreds of concurrent streams (HTTP/2)
  • Per tenant: baseline 1k rps sustained, burst 4k rps (edition dependent)
  • Global: scales linearly with Gateway instances

Resource Requirements

  • Persistent HTTP/2 channels; tune client pool size and max streams per connection.

Scaling Considerations

  • Horizontal scale Gateway on RPS/p95; shard by tenant/region.
  • Configure server and client receive/send message size caps (≤ 256 KiB).

Security & Compliance

Authentication

  • authorization metadata with OIDC JWT; short-lived (≤ 15m), leeway ±60s; optional mTLS for extra assurance.

Authorization

  • Require audit:append scoped to x-tenant-id; Gateway enforces RBAC/ABAC.

Data Protection

  • TLS 1.2+; no sensitive data in logs; redaction/classification applied by Ingestion before persist.

Compliance

  • Producer identity, idempotency key hash, and decision trail logged; aligns with privacy/PII policies.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
grpc_server_started_total counter Calls started Anomaly detection
grpc_server_handled_total{code} counter Calls by status code Any 5xx; spikes in INVALID_ARGUMENT
grpc_server_handling_seconds histogram Latency p95 > 40 ms
rate_limited_total counter RESOURCE_EXHAUSTED > 5% sustained

Logging Requirements

  • Structured logs: tenant, traceId, idempotencyKey (hashed), grpc.code, latencyMs; omit payload bodies.

Distributed Tracing

  • Map traceparent to gRPC context; spans: gateway.authz, ingestion.append. Include attributes tenant, payloadBytes.

Health Checks

  • Liveness: process/thread; Readiness: JWKS reachability, upstream Ingestion healthy.

Operational Procedures

Deployment

  1. Enable gRPC port/route under flag ingest.grpc.enabled=false.
  2. Smoke test with signed JWT and minimal payload; verify OK and trailers.
  3. Gradually raise per-tenant limits; observe grpc_server_handled_total{code!="OK"}.

Configuration

  • Gateway: JWKS URI, audiences, rate limits, max recv/send message bytes, allowed metadata keys/size.
  • Client: channel pool size, per-call deadline (e.g., 2s), retry policy (UNAVAILABLE, RESOURCE_EXHAUSTED).

Maintenance

  • Rotate JWKS/keys; monitor token validation failures; tune backoff policies.

Troubleshooting

  • Many INVALID_ARGUMENT → inspect validation pointers; schema drift.
  • Many UNAVAILABLE → upstream health; check saturation.
  • Frequent ALREADY_EXISTS → idempotency key collisions—fix client keying.

Testing Scenarios

Happy Path Tests

  • Valid Append returns OK with status:"Created" and auditRecordId.
  • Duplicate x-idempotency-key returns OK with status:"Duplicate".

Error Path Tests

  • Missing x-tenant-idINVALID_ARGUMENT.
  • Unknown method/route → NOT_FOUND.
  • Conflicting idempotency payload → ALREADY_EXISTS.
  • Rate limit → RESOURCE_EXHAUSTED with retry backoff honored.

Performance Tests

  • Sustain 1k rps/tenant with p95 ≤ 40 ms.
  • Connection reuse across 10k calls without reconnect churn.

Security Tests

  • JWT expiration/audience enforced.
  • Tenant mismatch (metadata vs body) → PERMISSION_DENIED.
  • Trace propagation verified end-to-end.

Internal References

  • Standard Audit Record Ingestion Flow
  • Retry Flow
  • Distributed Tracing Flow

External References

  • gRPC Status Codes guide
  • W3C Trace Context

Appendices

A. C# gRPC code-first contract (protobuf-net.Grpc style)

using System.ServiceModel;
using ProtoBuf.Grpc;
using ProtoBuf.Grpc.Configuration;

[Service]
public interface IAuditIngestionService
{
    [Operation]
    Task<AppendReply> AppendAsync(AppendRequest request, CallContext context = default);
}

public sealed class AppendRequest
{
    public string SchemaVersion { get; set; } = "auditrecord.v1";
    public AuditRecord AuditRecord { get; set; } = default!;
}

public sealed class AppendReply
{
    public string AuditRecordId { get; set; } = default!;
    public string Status { get; set; } = "Created"; // or "Duplicate"
    public DateTimeOffset ObservedAt { get; set; }
}

B. C# client stub usage (metadata mapping)

var channel = GrpcChannel.ForAddress("https://api.atp.example");
var client  = channel.CreateGrpcService<IAuditIngestionService>();

var headers = new Metadata {
    { "authorization", $"Bearer {token}" },
    { "x-tenant-id", "acme" },
    { "traceparent", traceparent },
    { "x-idempotency-key", "acme-ord-9981-v1" }
};

var ctx = new CallContext(new CallOptions(headers: headers, deadline: DateTime.UtcNow.AddSeconds(2)));

var reply = await client.AppendAsync(new AppendRequest {
    SchemaVersion = "auditrecord.v1",
    AuditRecord = record
}, ctx);
retry on: UNAVAILABLE, RESOURCE_EXHAUSTED, DEADLINE_EXCEEDED
backoff: exponential (base 100ms, max 5s), jitter 20%
max attempts: 5
reuse same x-idempotency-key

Service Bus (MassTransit) Ingestion Flow

Asynchronous ingestion path using the Outbox → Bus → Inbox pattern. A producer writes to its own Outbox in the same transaction as its business change; an Outbox Dispatcher publishes to the MassTransit bus. The Ingestion Consumer reads messages, performs validation/canonicalization, applies dedupe/idempotency, appends to the WORM store, and emits AuditRecord.Accepted. Poison messages are routed to a DLQ with reprocess tooling.


Overview

Purpose: Provide a resilient, high-throughput async ingestion path with exactly-once effects (at-least-once delivery + idempotent consumer).
Scope: Producer outbox semantics, bus delivery (MassTransit), consumer inbox/deduplication, retry/backoff, DLQ handling, and operational reprocessing. Excludes REST/gRPC transports and batch presigned uploads.
Context: Recommended for internal microservices and partner pipelines that already publish domain events.
Key Participants:

  • Producer Service (business txn + Outbox write)
  • Outbox Dispatcher (background publisher)
  • Message Bus (MassTransit over RabbitMQ/Azure SB/Kafka)
  • Ingestion Consumer (MassTransit consumer)
  • Idempotency Store (consumer-inbox/dedupe keys)
  • Storage (Authoritative) (append-only WORM)
  • DLQ / Error Queue (quarantine and reprocess)

Prerequisites

System Requirements

  • MassTransit configured with a supported broker and durable queues/topics
  • Producer DB migration includes Outbox table (append-only)
  • Ingestion Consumer has Idempotency/Inbox store (e.g., table or cache)
  • Network connectivity Producer ↔ Broker ↔ Ingestion; TLS enabled end-to-end

Business Requirements

  • Tenants provisioned; routing keys/partitions defined per tenant
  • Policy/retention/classification configured (used by Ingestion)
  • DLQ retention meets compliance requirements

Performance Requirements

  • Producer Outbox dispatch interval (poll/batch size) tuned for target throughput
  • Consumer prefetch/concurrency tuned; p95 end-to-append ≤ 100 ms under load
  • Broker quotas/partitions sized for expected peak

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant Prod as Producer Service
    participant DB as Producer DB + Outbox
    participant Disp as Outbox Dispatcher
    participant Bus as Message Bus (MassTransit)
    participant Cons as Ingestion Consumer
    participant Idem as Idempotency Store
    participant Store as Storage (Authoritative)

    Prod->>DB: BEGIN TX: business change + INSERT Outbox{Message, IdempotencyKey, Tenant, Trace}
    DB-->>Prod: COMMIT

    Disp->>DB: Poll Outbox (unpublished rows)
    Disp->>Bus: Publish AuditRecordEnvelope (MessageId, CorrelationId, headers)
    Bus-->>Disp: Ack (broker)

    Bus-->>Cons: Deliver message
    Cons->>Idem: Check/put(idempotencyKey) // atomic get-or-create
    alt First delivery
        Cons->>Cons: Validate + canonicalize + policy/classification/redaction
        Cons->>Store: INSERT canonical record (WORM)
        Store-->>Cons: ack {auditRecordId}
        Cons->>Idem: Mark completed(auditRecordId)
    else Duplicate
        Idem-->>Cons: already completed
        Cons->>Cons: Skip side effects, ack broker
    end
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Transactional Outbox (in-proc): Outbox insert is in the same DB transaction as business write (recommended).
  • Partition affinity: Route by tenantId (or resourceId) to guarantee in-order delivery per key.
  • Saga assistance: Optional MassTransit saga can coordinate multi-message batches or ensure exactly-one finalization event per batch.

Error Paths

sequenceDiagram
    participant Disp as Outbox Dispatcher
    participant Bus as Message Bus
    participant Cons as Ingestion Consumer
    participant DLQ as Dead Letter Queue

    Disp->>Bus: Publish
    alt Broker unavailable
        Bus-->>Disp: nack/exception
        Disp->>Disp: Retry with exponential backoff, do not delete Outbox row
    end

    Bus-->>Cons: Deliver message
    alt Validation fails (poison message)
        Cons-->>Bus: reject (no requeue)
        Bus-->>DLQ: route
    else Transient error (Storage 503)
        Cons-->>Bus: nack (requeue)
        Bus->>Cons: redeliver with backoff
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

This flow is message-based. The message contract and headers are the stable surface. Operational HTTP endpoints (status, reprocess) are listed for completeness.

Input Requirements (message contract)

Field Type Required Description Validation
MessageId GUID/ULID Y Broker message id Generated by bus
CorrelationId GUID/ULID Y Correlates with trace/saga Present
IdempotencyKey string Y Stable key per producer event ≤128 ASCII
TenantId string Y Tenant scope Header & body match
Traceparent string Y W3C trace context 55-char
SchemaVersion string Y auditrecord.v1 Known
AuditRecord object Y Canonical fields ≤ 256 KiB after serialize

Recommended headers (MassTransit)

  • tenant-id, traceparent, idempotency-key, schema-version, content-type=application/json

Output Specifications

Field Type Description Notes
AuditRecord.Accepted event Downstream event from Storage Async
Consumer ack broker ack Successful handle Commit offset / ack message
DLQ message broker dead-letter On poison/MaxRetry exceeded Inspect & reprocess

Example Message (Envelope)

{
  "SchemaVersion": "auditrecord.v1",
  "IdempotencyKey": "acme:order#9981:v1",
  "TenantId": "acme",
  "AuditRecord": {
    "tenantId": "acme",
    "createdAt": "2025-10-22T12:00:03.100Z",
    "action": "user.create",
    "resource": { "type": "Iam.User", "id": "U-1001" },
    "actor": { "id": "svc_billing", "type": "Service" }
  }
}

Error Handling

Error Scenarios (bus & ops APIs)

Code/Outcome Scenario Recovery Action Retry Strategy
INVALID (poison) → DLQ Schema/shape invalid at consumer Quarantine; fix mapper or data Reprocess after fix
Requeue Storage/Policy transient failure Backoff & retry Exponential backoff + jitter
Duplicate (idempotent skip) IdempotencyKey already completed No action Ack immediately
400 Bad Request (ops API) Bad reprocess/status request Correct request No retry until fixed
404 Not Found (ops API) Unknown batch/msgId/tenant Verify identifiers
409 Conflict (ops API) Reprocess while job active Wait & retry After completion
503 Service Unavailable Broker or Storage outage Wait for recovery Bounded backoff, circuit-breaker

Failure Modes

  • Outbox row deletion before publish: never delete until broker ack; use “published_at IS NOT NULL” marker.
  • Inbox/idempotency race: ensure atomic get-or-create; use unique index on (TenantId, IdempotencyKey).
  • Re-delivery storm: cap retries; move to DLQ after N attempts.

Recovery Procedures

  1. Inspect DLQ; download sample and Problem details if present.
  2. Patch mapper/policy or data; use reprocess API/command to move back to primary queue.
  3. For stuck Outbox rows, resume dispatcher (no manual delete).

Performance Characteristics

Latency Expectations

  • Outbox write: ~1–2 ms (in-proc tx)
  • Dispatch to broker: sub-10 ms typical
  • Consume → append: p95 ≤ 100 ms steady state

Throughput Limits

  • Producer: controlled by Outbox polling batch size (e.g., 500) and dispatch concurrency.
  • Consumer: controlled by prefetch (e.g., 256) and concurrency (e.g., 8–32).
  • Broker: ensure partitions/queues per tenant or shard.

Resource Requirements

  • Producer DB IOPS for Outbox; Consumer CPU for JSON + hashing; Idempotency store write IOPS.

Scaling Considerations

  • Scale by queue/partition per tenant/shard; increase consumer count.
  • Use bulk publish from dispatcher; avoid tiny batches.

Security & Compliance

Authentication

  • Broker auth via username/secret/SAS; TLS enabled. MassTransit transport credentials stored securely.

Authorization

  • Topic/queue ACLs restrict producers/consumers to tenant-scoped routes.

Data Protection

  • Message payloads encrypted on the wire (TLS); sensitive attributes redacted by Ingestion before persist.

Compliance

  • Retain DLQ items per policy; operations on DLQ are audited (who/when reprocessed or purged).

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
outbox_rows_pending gauge Unpublished rows Growth > 3× baseline
dispatcher_publish_rate counter Messages/sec to broker Drop vs ingest
consumer_lag gauge Backlog size/age Age > 60s
consumer_retry_total counter Redeliveries Spike indicates transient failures
dlq_messages_total counter DLQ count > 0 sustained

Logging Requirements

  • Include tenant, messageId, idempotencyKey (hashed), deliveryAttempt, and decision of DLQ vs retry; never log full payloads.

Distributed Tracing

  • Propagate traceparent via message headers; spans: outbox.enqueue, dispatcher.publish, consumer.handle, storage.append.

Health Checks

  • Producer: DB + broker connectivity; Consumer: broker + Storage/Idempotency write access.

Operational Procedures

Deployment

  1. Migrate Producer DB to add Outbox table; enable MassTransit outbox middleware.
  2. Deploy Ingestion Consumer with inbox/idempotency enabled (unique key index).
  3. Create queues/topics, bindings, and DLQ; enable TLS and ACLs.

Configuration

  • Producer: OutboxPollIntervalMs, OutboxBatchSize, broker connection, TLS certs.
  • Consumer: PrefetchCount, ConcurrentMessageLimit, retry policy (incremental/exponential), idempotency TTL.
  • Routing: exchange/topic per tenantId or shard key.

Maintenance

  • Purge published Outbox rows by retention (based on published_at).
  • DLQ review and reprocess runs; archive old DLQ messages per policy.

Troubleshooting

  • Rising outbox_rows_pending → broker unreachable or dispatch stalled.
  • Spiking consumer_retry_total → investigate Storage/Policy health.
  • Many duplicates → check idempotency unique index and key construction.

Testing Scenarios

Happy Path Tests

  • Business txn writes Outbox; Dispatcher publishes; Consumer appends; Accepted observed.
  • Duplicate delivery skipped via idempotency store.

Error Path Tests

  • Poison message → DLQ; reprocess after fix returns success.
  • Broker outage → Outbox retains; auto-catchup after recovery.
  • Ops API: 400 bad reprocess request; 404 unknown message; 409 reprocess job already running.

Performance Tests

  • Validate throughput at target RPS with prefetch/concurrency sweeps.
  • Backpressure behavior under Storage throttling.

Security Tests

  • Tenant isolation via routing and ACLs.
  • TLS enforcement; credentials rotation without downtime.

Internal References

  • Orleans Actor Ingestion Flow

External References

  • MassTransit Outbox/Inbox docs for chosen transport
  • Broker-specific DLQ and retry policies

Appendices

A. Producer Outbox table (example)

CREATE TABLE Outbox (
  Id            bigint IDENTITY PRIMARY KEY,
  MessageId     uniqueidentifier NOT NULL,
  IdempotencyKey nvarchar(128) NOT NULL,
  TenantId      nvarchar(128) NOT NULL,
  Body          varbinary(max) NOT NULL,
  Traceparent   nvarchar(64) NULL,
  CreatedAt     datetime2 NOT NULL DEFAULT sysutcdatetime(),
  PublishedAt   datetime2 NULL
);
CREATE UNIQUE INDEX UX_Outbox_Idempotency ON Outbox (TenantId, IdempotencyKey);

B. Consumer Idempotency (Inbox) table (example)

CREATE TABLE ConsumerInbox (
  TenantId        nvarchar(128) NOT NULL,
  IdempotencyKey  nvarchar(128) NOT NULL,
  CompletedAt     datetime2 NULL,
  AuditRecordId   char(26) NULL, -- ULID
  PRIMARY KEY (TenantId, IdempotencyKey)
);

C. C# Contracts (MassTransit)

public record AuditRecordEnvelope(
    string SchemaVersion,
    string IdempotencyKey,
    string TenantId,
    AuditRecord AuditRecord
);
// Configure send
cfg.Message<AuditRecordEnvelope>(x => x.SetEntityName("audit.ingest"));
cfg.Send<AuditRecordEnvelope>(x => {
    x.UseRoutingKeyFormatter(ctx => ctx.Message.TenantId);
});

Orleans Actor Ingestion Flow

Actor-to-actor ingestion path using Microsoft Orleans. A producer Grain invokes an Ingestion Grain with an AuditRecord and context (tenant, traceparent, idempotencyKey). The Ingestion Grain enforces at-least-once delivery with idempotent effects, appends to the WORM store, and returns an AppendResult. Notes cover activation, placement, and reentrancy to achieve high concurrency without duplication.


Overview

Purpose: Provide a low-latency, in-cluster ingestion path that preserves actor semantics and ordering guarantees per key.
Scope: Orleans grain contract, RequestContext propagation, idempotency/inbox, storage append, reentrancy, activation/placement, and failure handling including DLQ for poison messages. Excludes REST/gRPC and external bus transports.
Context: Used by actor-based services already running on Orleans (e.g., domain aggregates or workflow grains); per-tenant or per-resource sharding maps naturally to grain keys.
Key Participants:

  • Producer Grain (domain actor generating audit facts)
  • Ingestion Grain (IAuditIngestionGrain) — validates, canonicalizes, dedupes, appends
  • Idempotency/Inbox Store — per-grain dedupe table or grain state
  • Storage (Authoritative) — append-only WORM store
  • DLQ (optional) — for poison inputs when configured

Prerequisites

System Requirements

  • Orleans cluster healthy (silos, membership, reminders/timers)
  • RequestContext propagation enabled between grains
  • Ingestion Grain type registered; access to Storage and Idempotency store
  • TLS/mTLS for silo-to-silo traffic if crossing nodes/regions

Business Requirements

  • Tenants configured; placement strategy keyed by (tenantId[, shard])
  • Policy/retention/classification active for tenant
  • DLQ or operator alerting policy defined for poison records

Performance Requirements

  • Ingestion Grain reentrancy policy chosen (see below) and tested at target RPS
  • Per-grain mailboxes sized; throughput meets ingest parity
  • Idempotency lookup p95 ≤ 5 ms (local state or fast store)

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant Producer as Producer Grain
    participant Ing as Ingestion Grain (IAuditIngestionGrain)
    participant Inbox as Idempotency/Inbox Store
    participant Store as Storage (Authoritative)

    Producer->>Ing: Append(auditRecord, idempotencyKey)<br/>(RequestContext: tenant, traceparent)
    Ing->>Ing: Validate + canonicalize + policy/classification/redaction
    Ing->>Inbox: GetOrPut(tenant,idempotencyKey)
    alt First delivery
        Ing->>Store: INSERT canonical record (WORM)
        Store-->>Ing: ack {auditRecordId}
        Ing->>Inbox: MarkCompleted(auditRecordId)
        Ing-->>Producer: AppendResult {auditRecordId, status:"Created"}
    else Duplicate
        Inbox-->>Ing: Found Completed(auditRecordId)
        Ing-->>Producer: AppendResult {auditRecordId, status:"Duplicate"}
    end
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Per-tenant placement: IAuditIngestionGrain keys on tenantId (or (tenantId, shard)), preserving ordering within the key while allowing horizontal scale across tenants/shards.
  • Local persistent state inbox: Use Orleans PersistentState within the grain for fastest dedupe; or external table if cross-language consumers also write.
  • Reentrant grain: Enable reentrancy to allow concurrent requests sharing the same trace id/group; protect critical sections (idempotency write + store append) with coarse-grained serialization.

Error Paths

sequenceDiagram
    participant Ing as Ingestion Grain
    participant Store as Storage
    participant Inbox as Idempotency/Inbox

    Ing->>Store: INSERT
    alt Storage transient
        Store-->>Ing: throws transient
        Ing->>Ing: Retry with backoff, do not mark inbox completed
    else Validation failure (poison)
        Ing-->>Ing: throw ValidationException
        Ing->>Inbox: MarkFailed(optional) / emit DLQ if configured
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Required Description Validation
auditRecord object Y Canonical AuditRecord Data Model rules; ≤ 256 KiB
idempotencyKey string Y Unique per submitted record ≤128 ASCII
RequestContext["tenant-id"] string Y Tenant routing Must match auditRecord.tenantId
RequestContext["traceparent"] string Y W3C context 55-char format
RequestContext["schema-version"] string Y Contract version Known & active

Output Specifications

Field Type Description Notes
AppendResult.auditRecordId ULID Durable id Assigned or echoed
AppendResult.status enum Created or Duplicate Idempotent outcome
AppendResult.observedAt timestamp Ingestion observation ms precision

Example Grain Contract (C#)

public interface IAuditIngestionGrain : IGrainWithStringKey
{
    Task<AppendResult> Append(AuditRecord record, string idempotencyKey);
}

public sealed record AppendResult(string AuditRecordId, string Status, DateTimeOffset ObservedAt);

Producer call

RequestContext.Set("tenant-id", tenantId);
RequestContext.Set("traceparent", traceparent);
RequestContext.Set("schema-version", "auditrecord.v1");

var grain = GrainFactory.GetGrain<IAuditIngestionGrain>(tenantId); // or $"{tenantId}:{shard}"
var result = await grain.Append(record, idempotencyKey);

Error Handling

Error Scenarios (Orleans ↔ HTTP analogy)

Orleans Exception/Outcome HTTP Analogy Scenario Recovery Action Retry Strategy
— (OK) 202 Accepted Created/Duplicate
ArgumentException / validation error 400 Bad Request Schema/shape/limits invalid Fix payload No retry until corrected
GrainReferenceNotFoundException / unknown key 404 Not Found Wrong grain key/tenant or disabled feature Check routing/tenant No retry
IdempotencyConflictException 409 Conflict Same key, different payload Use a new key; reconcile Do not retry with same key
OrleansException with IsTransient 503 Service Unavailable Store or infra transient Backoff & retry Exponential backoff + jitter
TimeoutException 504 Gateway Timeout Grain busy or network stall Increase timeout if safe Limited retries

Failure Modes

  • Reentrancy hazard: racing requests with same key—protect with atomic GetOrPut in inbox and serialize append section.
  • Activation churn: hotspot tenants cause frequent activations; use sticky placement and activation warmup.
  • Poison record: repeated validation failures—optionally route to DLQ or mark Failed in inbox for operator review.

Recovery Procedures

  1. For transients, retry with jitter; maintain idempotency key.
  2. For conflict, choose canonical payload and re-attempt with a new key if necessary.
  3. For poison, capture Problem details and trigger operator workflow or DLQ.

Performance Characteristics

Latency Expectations

  • P50: 5–15 ms
  • P95: ≤ 35 ms
  • P99: ≤ 75 ms

Throughput Limits

  • Single ingestion grain: thousands of req/s with reentrancy on and critical section minimized.
  • Cluster throughput scales linearly with # of silos × # of shards/tenants.

Resource Requirements

  • CPU for JSON parse/hash; memory for small inbox state.
  • Low storage write IOPS per grain; batch commits optional if available in store client.

Scaling Considerations

  • Placement: Prefer hash-based placement by (tenantId[, shard]).
  • Reentrancy: Enable grain reentrancy; serialize only the idempotency + append critical section.
  • Backpressure: Use Orleans.Concurrency.Limit or custom queue length monitors to shed load gracefully.

Security & Compliance

Authentication

  • Internal cluster auth (mTLS/IPSec as required); producer identity derived from grain identity and/or tokens in RequestContext if crossing trust boundaries.

Authorization

  • Validate tenant-id context matches auditRecord.tenantId; enforce RBAC/ABAC as needed for cross-tenant actors.

Data Protection

  • No sensitive data in logs; redaction/classification applied before persist.

Compliance

  • Append operations recorded with tenant, grainKey, idempotencyKey (hashed), and traceId.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
orleans_grain_queue_length gauge Mailbox depth per ingestion grain Sustained growth
ingestion_append_latency_ms histogram Grain handle latency p95 > 35 ms
inbox_getorput_latency_ms histogram Idempotency lookup time p95 > 5 ms
idempotent_duplicates_total counter Duplicate skips Track trend
orleans_activations_total counter Activations of ingestion grains Unexpected spikes

Logging Requirements

  • Structured logs: tenant, grainKey, traceId, idempotencyKey (hashed), outcome (Created|Duplicate|Failed).

Distributed Tracing

  • Carry traceparent in RequestContext; spans: grain.append, inbox.check, storage.append; include tenant, grainKey.

Health Checks

  • Silo membership stable; storage reachable; inbox store latency under thresholds.

Operational Procedures

Deployment

  1. Register IAuditIngestionGrain and storage/idempotency providers; deploy silos.
  2. Warm hot-tenant grains (pre-activation) to reduce cold-start latency.
  3. Validate end-to-end append and idempotency in non-prod.

Configuration

  • Reentrancy: [Reentrant] attribute or runtime config as appropriate.
  • Placement: consistent hashing or custom placement by tenant.
  • Timeouts/Retries: client call timeouts (e.g., 2s) and retry policies for transient exceptions.

Maintenance

  • Monitor inbox state growth; compact or TTL-complete entries older than dedupe window.
  • Rotate cluster certs/keys if mTLS in use.

Troubleshooting

  • Many TimeoutExceptions → check reentrancy, queue length, storage latency.
  • Frequent IdempotencyConflictException → investigate client keying logic.
  • Activation spikes → adjust placement/keep-alive or increase silos.

Testing Scenarios

Happy Path Tests

  • Append returns Created with auditRecordId.
  • Second call with same idempotencyKey returns Duplicate without extra writes.

Error Path Tests

  • Validation error → 400 analog (ArgumentException), not persisted.
  • Unknown grain key/disabled tenant → 404 analog.
  • Conflict on idempotency (different payload) → 409 analog.
  • Transient storage failure → retried then succeeds.

Performance Tests

  • Reentrancy on: sustain target RPS with p95 ≤ 35 ms.
  • Critical section profiling (inbox+append) shows minimal blocking.

Security Tests

  • tenant-id in RequestContext matches payload; mismatches rejected.
  • Trace propagation visible across grains and storage client.

Internal References

  • gRPC Service Ingestion Flow
  • Service Bus (MassTransit) Ingestion Flow
  • Retry Flow

External References

  • Orleans Docs: Grains, Persistence, Reentrancy, RequestContext

Appendices

A. Inbox table (if using external store)

CREATE TABLE IngestionInbox (
  TenantId        nvarchar(128) NOT NULL,
  IdempotencyKey  nvarchar(128) NOT NULL,
  Status          tinyint NOT NULL, -- 0=Pending,1=Completed,2=Failed
  AuditRecordId   char(26) NULL,
  UpdatedAt       datetime2 NOT NULL DEFAULT sysutcdatetime(),
  PRIMARY KEY (TenantId, IdempotencyKey)
);

B. Reentrancy pattern (C# sketch)

[Reentrant]
public class AuditIngestionGrain : Grain, IAuditIngestionGrain
{
    public async Task<AppendResult> Append(AuditRecord record, string key)
    {
        using var _ = await _criticalSection.EnterAsync(key); // narrow critical region
        var (first, existingId) = await _inbox.GetOrPutAsync(record.TenantId, key);
        if (!first) return new(existingId, "Duplicate", DateTimeOffset.UtcNow);

        var id = await _storage.AppendAsync(record); // may retry internally
        await _inbox.MarkCompletedAsync(record.TenantId, key, id);
        return new(id, "Created", DateTimeOffset.UtcNow);
    }
}

Tenant-Scoped Query Flow

Retrieves a tenant’s AuditEvents timeline via the Query Service through the API Gateway. Uses row-level security (RLS) / tenant validation, seek-based pagination (cursor over (createdAt,auditRecordId)), and returns X-Watermark and X-Lag headers indicating projection freshness.


Overview

Purpose: Provide a low-latency, read-optimized timeline of audit events for a single tenant with consistent ordering and efficient pagination.
Scope: Gateway authN/Z, tenant scoping (header/path), RLS enforcement in Read DB, timeline query, seek pagination, watermark/lag headers. Excludes full-text search (see Search flow) and on-read PII masking (covered in Data Redaction flow).
Context: Runs against the AuditEvents projection maintained by the Projection Service; consults the Checkpoint Store for the current watermark.
Key Participants:

  • Query Client (API consumer)
  • API Gateway (authN/Z, rate limiting, header normalization)
  • Query Service (query planning, pagination, response shaping)
  • Read DB (AuditEvents) (tenant-scoped projection with indexes & RLS)
  • Checkpoint Store (per-tenant watermark)
  • Cache (optional, key-scoped response caching)

Prerequisites

System Requirements

  • API Gateway reachable with TLS; JWKS configured for JWT validation
  • Query Service deployed with network access to Read DB & Checkpoint Store
  • Read DB has RLS policies enforcing tenantId on AuditEvents
  • Projection/Checkpoint up and healthy (watermark progressing)

Business Requirements

  • Tenant exists and is active; edition permits timeline queries
  • Data retention/visibility policies do not restrict requested window
  • If multi-region, tenant’s home region is routable by Gateway

Performance Requirements

  • p95 ≤ 150 ms for limit<=200 over hot partitions
  • Indexes on (tenantId, createdAt DESC, auditRecordId) present
  • Cache configured (optional) with safe TTL & keying by tenant + params

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Client as Query Client
    participant GW as API Gateway
    participant Q as Query Service
    participant RDB as Read DB (AuditEvents + RLS)
    participant CKPT as Checkpoint Store
    participant Cache as Cache

    Client->>GW: GET /audit/v1/events?limit=100&cursor=... <br/> h:{Authorization,x-tenant-id,traceparent}
    Note right of GW: Validate JWT, tenant scope, rate-limit, normalize headers
    GW->>Q: Forward request + tenant context + traceparent
    Q->>CKPT: Read tenant watermark (offset,timestamp)
    alt Cache enabled and hit
        Q->>Cache: Lookup by {tenant, params}
        Cache-->>Q: Cached page + cursors
    else No cache / miss
        Q->>RDB: SELECT ... FROM AuditEvents WHERE tenantId=? AND (seek by cursor) ORDER BY createdAt DESC, auditRecordId DESC LIMIT N
        RDB-->>Q: rows, next/prev anchors
        Q->>Cache: Put page (optional TTL)
    end
    Q-->>GW: 200 JSON {items, nextCursor, prevCursor} + headers X-Watermark, X-Lag
    GW-->>Client: 200 OK
    Note over Client,RDB: Seek-based pagination avoids deep OFFSET scans
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Time-bounded query: from/to timestamps narrow the scan before seek pagination.
  • Ascending order: order=asc for forward-in-time scans; cursors encode direction.
  • Head polling: Client uses If-None-Match: "wmk:<value>"; Query Service returns 304 Not Modified if X-Watermark unchanged.

Error Paths

sequenceDiagram
    actor Client
    participant GW as API Gateway
    participant Q as Query Service

    Client->>GW: GET /audit/v1/events?limit=5000&cursor=bad
    alt Invalid params / cursor parse fail
        GW-->>Client: 400 Bad Request (Problem+JSON)
    else Unknown tenant / route
        GW-->>Client: 404 Not Found (Problem+JSON)
    else Conflicting params (e.g., both cursor & page)
        GW-->>Client: 409 Conflict (Problem+JSON)
    else Unauthorized / Forbidden
        GW-->>Client: 401/403 (Problem+JSON)
    else Service backpressure / upstream down
        GW-->>Client: 429/503 (Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path HTTP GET /audit/v1/events or /audit/v1/tenants/{tenantId}/events Y Timeline endpoint One of header or path must provide tenant
Authorization header Y Bearer <JWT> Valid signature, audience; not expired
x-tenant-id header Y* Tenant scope (if not in path) ^[A-Za-z0-9._-]{1,128}$
traceparent header O W3C trace context 55-char format
limit query O Max items per page (default 100) 1–1000, default 100
cursor query O Opaque base64url cursor (ts,id,dir) Valid/owned by tenant
order query O desc (default) or asc enum
from/to query O ISO-8601 UTC time bounds from≤to, within retention
filter.resourceType query O Optional type filter matches known types
filter.actorId query O Optional actor filter ≤128 chars

*Required unless tenant is in path.

Output Specifications

Field Type Description Notes
items[] array Page of timeline entries Ordered by order
nextCursor string? Opaque cursor for next page Omitted if no more
prevCursor string? Opaque cursor for reverse page Omitted on first page
count integer Number of items in this page limit

Response Headers

  • X-Watermark: ISO-8601 UTC of latest committed projection timestamp for the tenant.
  • X-Lag: Seconds behind “now” (now - X-Watermark).
  • Cache-Control: typically no-store, max-age=0 (or short TTL if allowed).

Example Requests/Responses

Request

GET /audit/v1/events?limit=100&order=desc&from=2025-10-22T00:00:00Z HTTP/1.1
Host: api.atp.example
Authorization: Bearer eyJhbGciOi...
x-tenant-id: acme
traceparent: 00-9f0c1d2e3a4b5c6d7e8f9a0b1c2d3e4f-1111222233334444-01

200 OK

HTTP/1.1 200 OK
Content-Type: application/json
X-Watermark: 2025-10-22T12:03:05.120Z
X-Lag: 4.8
Cache-Control: no-store

{
  "items": [
    {
      "auditRecordId": "01JEC2A2V7N9M0X1Y2Z3A4B5C6",
      "createdAt": "2025-10-22T12:02:59.812Z",
      "action": "user.create",
      "resource": { "type": "Iam.User", "id": "U-1001" },
      "actor": { "id": "svc_ingress", "type": "Service", "display": "ingress-gw" },
      "decision": { "result": "Allow" }
    }
  ],
  "nextCursor": "eyJ0cyI6IjIwMjUtMTAtMjJUMTI6MDI6NTkuODEyWiIsImlkIjoiMDFK...IiwgImRpciI6ImRlc2MifQ",
  "prevellers": null,
  "count": 1
}

400 Bad Request (invalid cursor)

{
  "type": "urn:connectsoft:errors/query/cursor.invalid",
  "title": "Invalid cursor",
  "status": 400,
  "detail": "Cursor is malformed or expired for this tenant.",
  "errors": [{ "pointer": "query.cursor", "reason": "malformed" }],
  "traceId": "9f0c1d2e3a4b5c6d7e8f9a0b1c2d3e4f"
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Missing/invalid params; bad cursor; from>to; limit out of bounds Correct request; regenerate cursor No retry until fixed
401 Missing/invalid/expired JWT Obtain valid token Retry after renewal
403 Token not authorized for x-tenant-id Request proper scope/role No retry until fixed
404 Tenant or route not found; tenant disabled Verify tenant/URL No retry
409 Conflicting params (e.g., cursor with from/to not allowed) or cursor tenant mismatch Remove conflict; obtain fresh cursor Retry after fix
429 Rate limit / query backpressure Backoff; respect Retry-After Exponential backoff + jitter
503 Upstream (DB/checkpoint) unavailable Wait for recovery Retry with backoff
304 If-None-Match matched watermark Use cached data Re-poll later

Failure Modes

  • Stale cursor after rebuild/compaction: server returns 409 with type: .../cursor.stale and a resyncFrom hint.
  • RLS misconfiguration: query returns 403/500; health checks should detect missing policy.
  • Watermark stale: X-Lag grows; alerting should trigger projector scaling.

Recovery Procedures

  1. On 409 cursor.stale, drop cursor and re-start from from=lastSeenTime.
  2. On 429/503, backoff with jitter; do not increase limit to compensate.
  3. If RLS errors occur, fail closed (no data) and escalate to operations.

Performance Characteristics

Latency Expectations

  • P50 ≤ 60 ms, P95 ≤ 150 ms, P99 ≤ 300 ms for limit≤200 over warm cache/index.

Throughput Limits

  • Per tenant: 200 rps sustained, 800 rps burst (configurable).
  • Global: scales with read replicas and cache hit rate.

Resource Requirements

  • Read DB IOPS proportional to limit and filter selectivity; ensure covering indexes.
  • Cache memory sized for hot cursors/pages if enabled.

Scaling Considerations

  • Add read replicas; shard by tenant.
  • Use index-only scans with narrow projections to reduce I/O.
  • Apply adaptive limit caps under load; enable result caching for hot ranges.

Security & Compliance

Authentication

  • OIDC JWT (short-lived), traceparent propagated; mTLS between Gateway ↔ Query Service (optional but recommended).

Authorization

  • Enforce audit:read:timeline scope; verify sub/tenant claims; apply DB-level RLS on tenantId.

Data Protection

  • Only minimal fields returned; no secret values.
  • X-Watermark reveals timing only; avoid leaking internal offsets.

Compliance

  • Access logged with tenantId, subject, filters, and watermark for auditability.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
query_latency_ms{route="/audit/v1/events"} histogram End-to-end latency p95 > 150 ms (5m)
timeline_results_count histogram Items per page Sudden 0 across tenants
watermark_lag_seconds gauge now - watermark > target (e.g., >10 s)
query_rate_limited_total counter 429 responses > 5% sustained
cursor_stale_total counter 409 due to stale/malformed cursor spike indicates rebuild issues

Logging Requirements

  • Structured logs: tenantId, traceId, limit, order, from/to (if set), cursorHash, resultCount, watermark, lagSec. Do not log raw cursor tokens.

Distributed Tracing

  • Spans: query.parse, db.select.timeline, ckpt.read, cache.get/set.
  • Attributes: tenant, limit, order, hasCursor, rows, lagMs.

Health Checks

  • Readiness: DB + checkpoint reachable; RLS policy verified; index present.
  • Liveness: threadpool saturation, connection pool usage below thresholds.

Operational Procedures

Deployment

  1. Apply/verify AuditEvents schema & RLS policies in Read DB.
  2. Deploy Query Service behind Gateway route /audit/v1/events.
  3. Validate watermark propagation and X-Lag accuracy in staging.

Configuration

  • Env: QUERY_MAX_LIMIT, DEFAULT_LIMIT, CACHE_TTL_SECONDS, RLS_ENABLED=true.
  • Indexing: (tenantId, createdAt DESC, auditRecordId) plus optional partial indexes per tenant.

Maintenance

  • Periodic VACUUM/ANALYZE (SQL) or compaction (NoSQL).
  • Rotate JWT keys; update JWKS URL.
  • Monitor and refresh cache layer sizing.

Troubleshooting

  • High watermark_lag_seconds → check projector lag, search bulk backlog.
  • Many 409 (cursor.stale) → investigate projection rebuilds/compaction.
  • Slow queries → examine query plans; add/adjust indexes.

Testing Scenarios

Happy Path Tests

  • GET with valid x-tenant-id and limit=100 returns 200 with ordered items and X-Watermark/X-Lag.
  • nextCursor yields the next page; prevCursor navigates back without duplication.

Error Path Tests

  • 400 on malformed cursor / invalid limit / from>to.
  • 404 when tenant missing/disabled or route incorrect.
  • 409 when cursor used with disallowed params or tenant mismatch.
  • 429/503 trigger proper backoff behavior.

Performance Tests

  • p95 ≤ 150 ms for limit=200 under typical load.
  • Index-only scan verified via EXPLAIN plan.

Security Tests

  • JWT audience/scope enforced; RLS prevents cross-tenant leakage.
  • X-tenant-id header vs path tenant consistency enforced.

Internal References

  • Search Query Flow
  • Filtered Query Flow (policy/redaction on read)
  • Audit Record Projection Update Flow

External References

  • RFC 7233/7231 (HTTP semantics), RFC 9110 (HTTP Semantics) for headers
  • W3C Trace Context (traceparent)

Appendices

A. Cursor Encoding (example)

cursor = base64url( JSON.stringify({ ts:"2025-10-22T12:02:59.812Z", id:"01JEC2A2V7N9M0X1Y2Z3A4B5C6", dir:"desc" }) )

B. Example RLS Policy (PostgreSQL)

ALTER TABLE audit_events ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON audit_events
USING (tenant_id = current_setting('app.tenant_id')::text);
-- Set current_setting('app.tenant_id') per request in the DB session.

Search Query Flow

Full-text, facet, and type-ahead search over tenant-scoped indices. The Search Service executes per-tenant queries against a per-tenant alias (or filtered index), returning ranked results, facet aggregations, and optional suggest completions. Responses include X-Index-Watermark and X-Index-Lag to convey indexing freshness.


Overview

Purpose: Provide fast, flexible discovery of audit records using full-text, filters, facets, and suggesters.
Scope: Query parsing, tenant isolation via alias/filter, facet execution, pagination, highlights, and freshness reporting. Excludes authoritative reads (timeline) and export; on-read masking follows redaction policy.
Context: Operates on the Search Index projection populated by the Projection Service; eventual consistency vs. authoritative store is expected.
Key Participants:

  • Search Client (API consumer)
  • Search Service (query planner/executor)
  • Search Engine (per-tenant indices/aliases)
  • Checkpoint Store (optional: index watermark)
  • Cache (optional: hot query caching)

Prerequisites

System Requirements

  • Search cluster reachable with TLS; per-tenant indices/aliases created
  • Search Service has network access and service account with read permissions
  • Projection → Index pipeline healthy (indexers running)

Business Requirements

  • Tenant has Search edition/feature enabled
  • Data minimization and on-read masking rules configured for Search documents
  • Retention and residency policies applied to search indices

Performance Requirements

  • p95 query latency ≤ 200 ms for size ≤ 50 and modest facets
  • Cluster capacity sized for QPS and aggregation workload
  • Index freshness SLO: p95 ≤ 10 s Accept→Indexed

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Client as Search Client
    participant Svc as Search Service
    participant Engine as Search Engine (Tenant Alias)
    participant CKPT as Checkpoint Store

    Client->>Svc: POST /search/v1/query<br/>h:{Authorization,x-tenant-id}<br/>{q, filters, facets, size, cursor?}
    Svc->>Svc: Validate params, build per-tenant query, apply redaction-on-read
    Svc->>Engine: Execute { index: tenant-alias, body: query+aggs }
    Engine-->>Svc: Hits, facets, next cursor, took
    Svc->>CKPT: Read index watermark (optional)
    Svc-->>Client: 200 {results, facets, nextCursor} + X-Index-Watermark + X-Index-Lag
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Time freshness bias: apply recency boost within a freshness window (e.g., last 24h).
  • Filter-only queries (q empty): return filtered timeline with facets.
  • Suggest endpoint: /search/v1/suggest uses completion or n-gram suggesters with prefix and filters.
  • Read-through cache: cache popular queries for short TTL (exclude personalized filters).

Error Paths

sequenceDiagram
    actor Client
    participant Svc as Search Service

    Client->>Svc: POST /search/v1/query (bad params/tenant)
    alt Bad request (malformed cursor/invalid facet)
        Svc-->>Client: 400 Problem+JSON
    else Tenant alias missing / disabled
        Svc-->>Client: 404 Problem+JSON
    else Conflicting params (both page & cursor, or size>cap)
        Svc-->>Client: 409 Problem+JSON
    else Rate limited / engine unavailable
        Svc-->>Client: 429/503 Problem+JSON (+ Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path HTTP POST /search/v1/query Y Search endpoint JSON body
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y* Tenant scope ^[A-Za-z0-9._-]{1,128}$
q string O Query string (full-text) 0–2048 chars
filters object O {resourceType?, actorId?, action?, time:{from?,to?}, decision?} enums/ISO-8601
facets array O Facets to compute (e.g., ["resourceType","action"]) allowlist only
size int O Page size 1–100 (default 25)
cursor string O Opaque search-after token base64url
highlight bool O Return snippets default false
sort enum O relevance (default) createdAt:desc | asc allowlist

*Required unless tenant is encoded in a dedicated tenant path variant.

Output Specifications

Field Type Description Notes
results[] array Search hits with essential fields Redacted as needed
facets object Buckets per requested facet Top-N buckets
nextCursor string? Token for next page Omitted if no more
tookMs int Engine execution time From engine
totalApprox int Approx total matches Not exact if tracking disabled

Response Headers

  • X-Index-Watermark: ISO-8601 UTC of latest indexed event for tenant
  • X-Index-Lag: Seconds behind “now” (now - X-Index-Watermark)

Example Payloads

Request

{
  "q": "user create OR signup",
  "filters": {
    "resourceType": "Iam.User",
    "time": { "from": "2025-10-22T00:00:00Z", "to": "2025-10-22T23:59:59Z" }
  },
  "facets": ["resourceType", "action"],
  "size": 25,
  "sort": "relevance",
  "highlight": true
}

200 OK

X-Index-Watermark: 2025-10-22T12:03:05.120Z
X-Index-Lag: 7.2
{
  "results": [
    {
      "auditRecordId": "01JEC7KX8…",
      "createdAt": "2025-10-22T11:58:10.201Z",
      "action": "user.create",
      "resource": { "type": "Iam.User", "id": "U-1001" },
      "actor": { "id": "svc_signup", "type": "Service", "display": "signup-svc" },
      "score": 7.42,
      "highlights": { "action": ["<em>user</em>.create"] }
    }
  ],
  "facets": {
    "resourceType": [{ "key": "Iam.User", "count": 128 }],
    "action": [{ "key": "user.create", "count": 92 }]
  },
  "nextCursor": "eyJzZWFyY2hBZnRlciI6WyIxLjIzIiwiMDFK...Il19",
  "tookMs": 23,
  "totalApprox": 612
}

400 Bad Request (invalid facet)

{
  "type": "urn:connectsoft:errors/search/facet.invalid",
  "title": "Invalid facet",
  "status": 400,
  "detail": "Facet 'userEmail' is not allowed.",
  "errors": [{ "pointer": "/facets/0", "reason": "allowlist" }]
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Malformed cursor, disallowed facet, bad time range, size out of bounds Fix request No retry until corrected
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Insufficient audit:search scope or tenant mismatch Request proper scope No retry
404 Tenant alias/index missing or feature disabled Verify tenant/feature No retry
409 Conflicting params (e.g., cursor with sort not supported) Adjust params Retry after fix
422 Query too complex (clause limit, wildcard explosion) Simplify query No retry until changed
429 Rate limited/backpressure Respect Retry-After Exponential backoff + jitter
503 Engine unavailable / timeout Wait for recovery Retry with jitter

Failure Modes

  • Stale cursor after reindex/alias swap → 409 cursor.stale with resyncFrom hint.
  • Facet blow-up (high cardinality) → 422 with guidance to narrow filters.
  • Highlight overflow → server truncates snippets to configured limit.

Recovery Procedures

  1. On 409 cursor.stale, drop cursor and re-issue query without cursor or with from bound.
  2. On 429/503, backoff; keep query identical to benefit from caching when enabled.
  3. Replace disallowed facets with supported ones per schema allowlist.

Performance Characteristics

Latency Expectations

  • P50 ≤ 80 ms, P95 ≤ 200 ms, P99 ≤ 400 ms (moderate facets, size ≤ 50).

Throughput Limits

  • Per tenant baseline 300 rps sustained; global scales with cluster nodes and shard count.

Resource Requirements

  • Aggregations demand CPU/heap; ensure shard sizing and circuit breakers for large queries.

Scaling Considerations

  • Scale by shards/replicas; use per-tenant alias routing.
  • Enable result caching and request coalescing for hot queries.
  • Apply freshness bias instead of hard refresh to avoid heavy refresh calls.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; mTLS between Search Service and engine (optional).

Authorization

  • Enforce audit:search scope; per-tenant isolation via index alias filter or index-per-tenant.

Data Protection

  • Documents store minimized fields; sensitive values tokenized or omitted.
  • Highlights sanitized; never return dropped/redacted fields.

Compliance

  • Record search access with tenant, subject, queryHash, filters, and returnedCount.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
search_latency_ms histogram End-to-end latency p95 > 200 ms
search_qps counter Requests/sec Capacity planning
index_freshness_seconds gauge now - indexWatermark > 10 s sustained
search_429_total counter Rate limited count > 5% sustained
cursor_stale_total counter 409 due to stale cursor Spike detection

Logging Requirements

  • Structured logs: tenant, traceId, qHash, filtersHash, size, sort, tookMs, indexLagSec.
  • Do not log raw queries or highlights.

Distributed Tracing

  • Spans: search.plan, engine.search, engine.aggs, cache.get/set.
  • Attributes: tenant, hasCursor, facetCount, size, tookMs.

Health Checks

  • Readiness: engine reachable; tenant alias exists; index freshness within SLO.
  • Liveness: threadpool/connection pool healthy; circuit breakers closed.

Operational Procedures

Deployment

  1. Create index template & per-tenant alias with filter tenantId=....
  2. Deploy Search Service route /search/v1/query and /search/v1/suggest.
  3. Validate end-to-end queries and index freshness headers.

Configuration

  • Env: SEARCH_MAX_SIZE=100, DEFAULT_SIZE=25, ALLOWED_FACETS=..., CURSOR_TTL, RECENCY_BOOST_WINDOW.
  • Engine: shard/replica count, analyzers, suggesters, circuit breakers.

Maintenance

  • Rolling reindex and alias swap; backfill lag tracking.
  • Periodic shard rebalancing; optimize/forcemerge as needed off-peak.

Troubleshooting

  • High index_freshness_seconds → inspect projector/indexer lag.
  • Many 422 → educate clients on query limits; adjust clause caps if safe.
  • 429 spikes → scale nodes or adjust rate limits/caching.

Testing Scenarios

Happy Path Tests

  • Keyword query with filters returns ranked hits and requested facets within p95 ≤ 200 ms.
  • Pagination via nextCursor returns non-overlapping result sets.
  • Headers include X-Index-Watermark and X-Index-Lag.

Error Path Tests

  • 400 on invalid facet, malformed cursor, or bad time bounds.
  • 404 when tenant alias missing/disabled.
  • 409 on stale cursor or conflicting params.
  • 422 on overly complex query (clause cap).
  • 429/503 obey retry/backoff.

Performance Tests

  • Facet cost under control for typical cardinalities.
  • Query load at target QPS with p95 ≤ 200 ms.

Security Tests

  • RBAC scope audit:search enforced; cross-tenant leakage prevented by alias filter.
  • Redaction/minimization verified in results and highlights.

Internal References

  • Tenant-Scoped Query Flow
  • Audit Record Projection Update Flow
  • Data Redaction Flow

External References

  • Vendor docs for analyzers, aggregations, and suggesters (e.g., ES/OpenSearch)

Appendices

A. Example Engine Query (conceptual)

{
  "query": {
    "bool": {
      "filter": [{ "term": { "tenantId": "acme" } }],
      "must": [{ "simple_query_string": { "query": "user create OR signup", "fields": ["action^3","resource.type","attributes.*"] }}]
    }
  },
  "aggs": {
    "resourceType": { "terms": { "field": "resource.type", "size": 10 } },
    "action": { "terms": { "field": "action.keyword", "size": 10 } }
  },
  "sort": ["_score", { "createdAt": "desc" }],
  "size": 25,
  "search_after": ["1.23", "01JEC7KX8..."]
}

B. Example Suggest Request

{
  "prefix": "user.c",
  "filters": { "resourceType": "Iam.User" },
  "size": 10
}

Filtered Query Flow

Policy-aware read path that applies purpose-of-use evaluation, field-level allow/deny, and on-read redaction/masking before returning results. The Query Service consults the Policy Service to compute an effective redaction profile for the caller, then executes a tenant-scoped query and post-processes rows according to the profile.


Overview

Purpose: Return tenant-scoped audit results filtered by caller intent and masked according to privacy & PII policies.
Scope: Purpose-of-use signaling, policy evaluation, field projection, masking strategies (hash/mask/tokenize/drop), seek pagination, and response headers indicating applied policy and freshness. Excludes full-text search (see Search flow) and raw timeline (see Tenant-Scoped Query).
Context: Operates on AuditEvents projection; combines pre-index filters with post-fetch masking.
Key Participants:

  • Client (API consumer)
  • API Gateway (authN/Z, rate limiting)
  • Query Service (query + masking orchestrator)
  • Policy Service (purpose-of-use, allow/deny, redaction profile)
  • Read DB (AuditEvents + RLS) (tenant-isolated projection)
  • Checkpoint Store (watermark for freshness)

Prerequisites

System Requirements

  • TLS at Gateway; JWKS configured for JWT verification
  • Query Service access to Read DB and Policy Service
  • RLS on Read DB enforcing tenantId
  • Redaction libraries & configs deployed (hash/mask/tokenize/drop)

Business Requirements

  • Tenant active; privacy/PII classifications configured
  • Policy definitions include purpose-of-use to field permissions/masking
  • Data residency respected for cross-region reads

Performance Requirements

  • p95 ≤ 180 ms for limit≤200 with standard masking
  • Policy evaluation cache (per subject+purpose) warmed; TTL tuned
  • Indexes support common filter predicates

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Client as Client
    participant GW as API Gateway
    participant Q as Query Service
    participant P as Policy Service
    participant R as Read DB (AuditEvents + RLS)
    participant C as Checkpoint Store

    Client->>GW: POST /query/v1/filtered <br/> h:{Authorization,x-tenant-id,traceparent,x-purpose-of-use}
    GW->>Q: Forward request + headers
    Q->>P: Evaluate(subject, tenant, purpose, requestedFields, filters)
    P-->>Q: RedactionProfile {allowed, denied, maskRules}
    Q->>R: SELECT ... WHERE tenantId=? AND <server-validated filters> ORDER BY createdAt DESC LIMIT N
    R-->>Q: rows
    Q->>Q: Apply RedactionProfile (drop/transform fields) + build cursors
    Q->>C: Read tenant watermark
    Q-->>GW: 200 {items(masked), nextCursor} + X-Watermark, X-Lag, X-Policy-Decision-Id
    GW-->>Client: 200 OK
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Field projection: Client requests fields=[...]; server intersects with allowed and masks per rules.
  • Explain-only: dryRun=true returns the effective RedactionProfile without data.
  • Head polling: If-None-Match: "wmk:<value>"304 if unchanged watermark.

Error Paths

sequenceDiagram
    actor Client
    participant GW as API Gateway
    participant Q as Query Service

    Client->>GW: POST /query/v1/filtered (bad params/conflicts)
    alt Bad request (invalid filter/purpose/fields)
        GW-->>Client: 400 Problem+JSON
    else Tenant/route not found or feature disabled
        GW-->>Client: 404 Problem+JSON
    else Fields conflict with policy decision
        GW-->>Client: 409 Problem+JSON
    else Unauthorized / Forbidden
        GW-->>Client: 401/403 Problem+JSON
    else Backpressure / upstream down
        GW-->>Client: 429/503 Problem+JSON (+ Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path HTTP POST /query/v1/filtered Y Filtered & masked read JSON body
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y* Tenant scope ^[A-Za-z0-9._-]{1,128}$
traceparent header O W3C trace context 55-char
x-purpose-of-use header Y Caller intent (e.g., Support, SecurityOps, Analytics) Enum allowlist
limit body.int O Items per page 1–200 (default 100)
cursor body.string O Opaque seek token base64url
filters body.object O Server-validated predicates Allowlist only
fields body.array O Requested projections Intersected with policy
dryRun body.bool O Return policy only default false

*Required unless tenant embedded in path variant.

Supported filter keys (allowlist example): createdAt.from/to, action, resource.type, resource.id, actor.id, decision.result.

Output Specifications

Field Type Description Notes
items[] array Masked rows honoring RedactionProfile Order: createdAt DESC
nextCursor string? Seek token for next page Omitted if end
policy object? Returned if dryRun=true Effective profile summary
count int Items in this page limit

Response Headers

  • X-Watermark: tenant projection watermark (ISO-8601 UTC)
  • X-Lag: seconds behind now
  • X-Policy-Decision-Id: opaque id of the applied policy decision (for audit)

Example Payloads

Request

{
  "limit": 50,
  "fields": ["auditRecordId","createdAt","action","resource.id","actor.display","client.ip"],
  "filters": {
    "resource.type": "Iam.User",
    "createdAt": { "from": "2025-10-22T00:00:00Z", "to": "2025-10-22T23:59:59Z" }
  }
}

Headers:

Authorization: Bearer eyJhbGciOi...
x-tenant-id: acme
x-purpose-of-use: Support
traceparent: 00-9f0c1d2e3a4b5c6d7e8f9a0b1c2d3e4f-1111222233334444-01

200 OK (masked)

X-Watermark: 2025-10-22T12:05:10.330Z
X-Lag: 6.9
X-Policy-Decision-Id: pol_7b3f8d1a
{
  "items": [
    {
      "auditRecordId": "01JEC9VX2Z…",
      "createdAt": "2025-10-22T11:57:03.200Z",
      "action": "user.create",
      "resource": { "id": "U-1001" },
      "actor": { "display": "signup-svc" },
      "client": { "ip": "203.0.113.0/24" }  // IP truncated per Support profile
    }
  ],
  "nextCursor": "eyJ0cyI6IjIwMjUtMTAtMjJUMTE6NTc6MDMuMjAwWiIsImlkIjoiMDFK...In0",
  "count": 1
}

dryRun=true (policy only)

{
  "policy": {
    "allowed": ["auditRecordId","createdAt","action","resource.id","actor.display","client.ip"],
    "denied": ["client.userAgent","geo.location","subject.email"],
    "maskRules": {
      "client.ip": "truncate_cidr_24",
      "subject.email": "mask_localpart"
    }
  }
}

400 Bad Request (conflicting filters)

{
  "type": "urn:connectsoft:errors/query/filters.invalid",
  "title": "Invalid filters",
  "status": 400,
  "detail": "Unsupported filter 'subject.email'.",
  "errors": [{ "pointer": "/filters/subject.email", "reason": "not-allowed" }]
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Malformed filters/cursor; unknown x-purpose-of-use; invalid fields Fix request; use allowlisted fields No retry until corrected
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Subject lacks audit:read:filtered scope or policy denies all fields Request correct scope; adjust purpose No retry until fixed
404 Tenant/route not found; feature disabled Verify tenant/URL/edition
409 Requested fields conflict with policy (e.g., denied but required) or cursor param conflicts Remove offending fields/params Retry after fix
429 Rate limit/backpressure Respect Retry-After Exponential backoff + jitter
503 Policy or DB dependency unavailable Wait for recovery Retry with same params

Failure Modes

  • Policy cache staleness: returns stricter profile than expected—safe by design; refresh on next call.
  • Cursor invalid after rebuild: 409 cursor.stale with resyncFrom hint.
  • Overbroad projection: requesting many fields increases payload size; server may trim to allowed ∩ requested.

Recovery Procedures

  1. On 409 field-policy conflict, re-issue request with fields returned in policy.allowed.
  2. On 429/503, backoff with jitter; do not widen limit.
  3. For stale cursor, restart from from time bound or omit cursor.

Performance Characteristics

Latency Expectations

  • P50 ≤ 70 ms, P95 ≤ 180 ms (policy cache hit); add 15–30 ms if cache miss.

Throughput Limits

  • Per tenant: 150 rps sustained, 600 rps burst (configurable).
  • Global: scales with read replicas and policy cache hit rate.

Resource Requirements

  • CPU for masking transforms (e.g., hashing/tokenization); memory for page shaping.

Scaling Considerations

  • Cache policy decisions keyed by (tenant, subject, purpose) with short TTL (e.g., 60–300s).
  • Pre-compute allowlists per purpose to minimize per-request overhead.

Security & Compliance

Authentication

  • OIDC JWT; traceparent propagated; optional mTLS Gateway↔Query Service.

Authorization

  • Require audit:read:filtered; validate x-tenant-id claim and RBAC.
  • Enforce DB-level RLS and post-query field-level controls from policy.

Data Protection

  • Apply masking strategies per Data Model (truncate_cidr_24, mask_localpart, hash_sha256, drop).
  • Do not return fields marked denied by policy; never include raw PII if policy says mask/drop.

Compliance

  • Emit access audit: subject, tenant, purpose, decisionId, requestedFields, returnedFields.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
filtered_query_latency_ms histogram End-to-end latency p95 > 180 ms
policy_eval_latency_ms histogram Policy round-trip p95 > 30 ms
policy_denied_total counter Requests with any denied fields Sudden spikes
masked_fields_total counter Count of masked field applications Trend monitoring
cursor_stale_total counter 409 due to stale cursor Rebuild detection
query_429_total counter Rate-limited responses > 5% sustained

Logging Requirements

  • Structured logs: tenant, traceId, purpose, decisionId, requestedFieldsHash, returnedFieldsHash, resultCount, watermark, lagSec. Do not log raw PII.

Distributed Tracing

  • Spans: policy.evaluate, db.select.filtered, mask.apply.
  • Attributes: purpose, allowedCount, maskedCount, deniedCount.

Health Checks

  • Readiness: Policy Service reachable; RLS verified; masking config loaded.
  • Liveness: threadpool/connection pools healthy.

Operational Procedures

Deployment

  1. Deploy/enable /query/v1/filtered route behind feature flag query.filtered.enabled=false.
  2. Load policy catalogs and masking configuration; warm caches.
  3. Validate dryRun and live calls in staging with test profiles.

Configuration

  • Env: QUERY_MAX_LIMIT, DEFAULT_LIMIT, POLICY_CACHE_TTL, MASKING_RULES_PATH.
  • Headers: accept x-purpose-of-use values from allowlist only.

Maintenance

  • Rotate JWT keys; review policy changes; audit decision logs.
  • Monitor masked vs. denied trends to tune rules.

Troubleshooting

  • Many 409 field conflicts → educate clients to request dryRun first or fetch policy.allowed.
  • High policy_eval_latency_ms → investigate Policy Service capacity/caching.
  • Data leakage concerns → verify masking config version & hot reload.

Testing Scenarios

Happy Path Tests

  • Valid request with x-purpose-of-use: Support returns masked IP and allowed fields.
  • dryRun=true returns expected profile; subsequent call applies it.

Error Path Tests

  • 400 on invalid filter key or unknown purpose.
  • 404 when tenant missing/disabled.
  • 409 when requesting denied fields.
  • 429/503 obey retry/backoff with unchanged parameters.

Performance Tests

  • Cache-hit p95 ≤ 180 ms; cache-miss overhead within budget.
  • Large page (limit=200) still meets p95 under typical load.

Security Tests

  • RLS prevents cross-tenant access.
  • No raw PII fields returned when policy mandates mask/drop.
  • Access audit entries include purpose and decisionId.

Internal References

  • Data Redaction Flow (on-read), Policy & Retention flows
  • Compliance Audit Flow

External References

  • RFC 7807 (Problem Details)
  • Organization Privacy/PII policy catalog

Appendices

A. Example RedactionProfile (concept)

{
  "decisionId": "pol_7b3f8d1a",
  "purpose": "Support",
  "allowed": ["auditRecordId","createdAt","action","resource.id","actor.display","client.ip"],
  "denied": ["subject.email","geo.location","client.userAgent"],
  "maskRules": {
    "client.ip": "truncate_cidr_24",
    "subject.email": "mask_localpart"
  }
}

B. Masking Rules (summary)

  • truncate_cidr_24 → IPv4 a.b.c.da.b.c.0/24
  • mask_localpartname@domainn***@domain
  • hash_sha256 → irreversible 64-hex digest
  • drop → remove field from output

Time-Range Query Flow

Efficiently retrieves audit events constrained by a time window. The Query Service translates from/to predicates into partition/shard pruning (e.g., daily/monthly tenant partitions), executes seek-paginated scans over the minimal set of partitions, and returns watermark/lag headers to describe projection freshness.


Overview

Purpose: Provide fast, predictable retrieval of audit events within a specified time range while minimizing IO via partition/shard pruning.
Scope: Time predicates, partition selection, shard routing, seek-based pagination across multiple partitions, and freshness exposition. Excludes full-text relevance (see Search) and policy-driven masking (see Filtered Query).
Context: Operates on the AuditEvents read model that is physically partitioned by tenant and time; the Projection Service updates these partitions asynchronously.
Key Participants:

  • Client (API consumer)
  • API Gateway (authN/Z, rate limiting)
  • Query Service (planner/executor, paginator)
  • Read Store (time-partitioned AuditEvents with RLS)
  • Partition Catalog (maps time windows → partitions/shards)
  • Checkpoint Store (per-tenant watermark)

Prerequisites

System Requirements

  • Gateway with TLS and JWT validation
  • Query Service can access Read Store, Partition Catalog, and Checkpoint Store
  • Read Store enforces RLS by tenantId
  • Time partitions (e.g., daily/monthly) exist and are discoverable in the catalog

Business Requirements

  • Tenant is active and permitted to query historical windows requested
  • Retention policy covers the requested from/to period
  • Regional residency honored for multi-region tenants

Performance Requirements

  • p95 ≤ 160 ms for limit≤200 and ≤ 14 partitions scanned
  • Covering index on (tenantId, createdAt DESC, auditRecordId) per partition
  • Partition discovery latency p95 ≤ 10 ms

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Client as Client
    participant GW as API Gateway
    participant Q as Query Service
    participant Cat as Partition Catalog
    participant R as Read Store (AuditEvents + RLS)
    participant Ck as Checkpoint Store

    Client->>GW: GET /audit/v1/events/range?from=...&to=...&limit=200&cursor=... <br/> h:{Authorization,x-tenant-id,traceparent}
    GW->>Q: Forward request + normalized headers
    Q->>Q: Validate time window, normalize [from,to], parse/verify cursor (if any)
    Q->>Cat: Resolve partitions/shards for [from,to] + tenant
    Cat-->>Q: Ordered partition list (most-recent → oldest)
    Q->>R: Query partitions with seek pagination (ORDER BY createdAt DESC, auditRecordId)
    R-->>Q: Page of rows + next anchor (ts,id,partitionIdx)
    Q->>Ck: Read tenant watermark
    Q-->>GW: 200 {items, nextCursor} + X-Watermark + X-Lag + X-Partitions-Scanned
    GW-->>Client: 200 OK
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Open-ended range: only from provided (defaults to=now), or only to (backfill).
  • Ascending traversal: order=asc for forward scans; cursor encodes direction + partition index.
  • Server-side downsampling: for very wide windows, service may cap maxPartitions and advise narrowing via Problem+JSON type: .../range.too_wide (422) when appropriate.

Error Paths

sequenceDiagram
    actor Client
    participant GW as API Gateway

    Client->>GW: GET /audit/v1/events/range?from=bad&to=2025-10-22T00:00:00Z
    alt Bad request (malformed/invalid window)
        GW-->>Client: 400 Bad Request (Problem+JSON)
    else Tenant route not found / disabled
        GW-->>Client: 404 Not Found (Problem+JSON)
    else Conflicting params (cursor with changed window/order)
        GW-->>Client: 409 Conflict (Problem+JSON)
    else Rate limited / store unavailable
        GW-->>Client: 429/503 (Problem+JSON + Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path HTTP GET /audit/v1/events/range or /tenants/{tenantId}/events/range Y Time-range endpoint
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y* Tenant scope ^[A-Za-z0-9._-]{1,128}$
traceparent header O W3C trace context 55-char
from query O* ISO-8601 UTC lower bound to; within retention
to query O* ISO-8601 UTC upper bound from; not in future+skew
limit query O Items per page (default 100) 1–1000
order query O desc (default) or asc enum
cursor query O Opaque base64url (ts,id,partitionIdx,dir) Must match current params
filters… query O Optional allowlisted filters (e.g., action, resource.type) Validated server-side
  • At least one of from or to is required; if only one provided, the other defaults to now (bounded by retention and skew rules).

Output Specifications

Field Type Description Notes
items[] array Results in requested order Seek-paginated
nextCursor string? Encodes next anchor + partition index Omitted if no more
count int Items in this page limit

Response Headers

  • X-Watermark: tenant projection watermark (ISO-8601 UTC)
  • X-Lag: seconds behind now (now - watermark)
  • X-Partitions-Scanned: integer count of partitions touched
  • Cache-Control: typically no-store (or short TTL where safe)

Example Request

GET /audit/v1/events/range?from=2025-10-20T00:00:00Z&to=2025-10-22T23:59:59Z&limit=200&order=desc HTTP/1.1
Host: api.atp.example
Authorization: Bearer eyJhbGciOi...
x-tenant-id: acme
traceparent: 00-3e1f2d0c9b8a7f6e5d4c3b2a19081716-7f6e5d4c3b2a1908-01

200 OK

X-Watermark: 2025-10-22T12:10:05.412Z
X-Lag: 5.6
X-Partitions-Scanned: 3
{
  "items": [
    {
      "auditRecordId": "01JECZ6Y8K1V...",
      "createdAt": "2025-10-22T12:02:59.812Z",
      "action": "user.create",
      "resource": { "type": "Iam.User", "id": "U-1001" },
      "actor": { "id": "svc_ingress", "type": "Service" }
    }
  ],
  "nextCursor": "eyJ0cyI6IjIwMjUtMTAtMjJUMTE6NTU6MDAuMDAwWiIsImlkIjoiMDFK...IiwicGFydGl0aW9uSW5kZXgiOjEsImRpciI6ImRlc2MifQ",
  "count": 1
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Malformed from/to; from>to; window exceeds max span; limit out of bounds Fix params; reduce window No retry until corrected
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Token lacks audit:read:timeline for tenant Request proper scope No retry
404 Tenant/route not found; tenant disabled; partitions not present (fully aged out) Verify tenant/window
409 cursor does not match from/to/order; stale cursor after compaction Drop/refresh cursor, re-issue Retry after fix
429 Rate limit/backpressure Honor Retry-After Exponential backoff + jitter
503 Read Store/Catalog unavailable Wait for recovery Retry with same params

Failure Modes

  • Stale cursor after partition compaction/rotation → 409 with type: .../cursor.stale and resyncFrom hint.
  • Excessive partitions for wide windows → 422 range.too_wide with suggested subranges.
  • Clock skew: future to beyond now+skew → clamp or 400 with pointer to to.

Recovery Procedures

  1. For 409 cursor.stale, restart without cursor or with from=lastSeen.createdAt.
  2. For 422 range.too_wide, split the request by suggested daily/monthly windows.
  3. Monitor X-Partitions-Scanned; if high, narrow the time window.

Performance Characteristics

Latency Expectations

  • P50 ≤ 70 ms, P95 ≤ 160 ms, P99 ≤ 320 ms when ≤14 partitions scanned.

Throughput Limits

  • Per tenant: 150 rps sustained, burst 600 rps (configurable).
  • Global: scales with number of read replicas and partition cache hit rate.

Resource Requirements

  • Partition catalog lookup in-memory or fast key-value store; read DB requires covering indexes per partition.

Scaling Considerations

  • Pruning first: always resolve partitions before issuing any scans.
  • Adaptive limits: cap limit when many partitions are touched; prefer more pages over wide scans.
  • Parallel partition reads (optional): small fan-out with strict per-tenant concurrency to preserve order semantics when stitching results.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; propagate traceparent; optional mTLS Gateway↔Query.

Authorization

  • Enforce audit:read:timeline; verify tenant claims; RLS must filter by tenantId.

Data Protection

  • Only return fields allowed by baseline read model; masking/redaction applied in dedicated filtered flow if required.

Compliance

  • Log access with tenant, from, to, limit, partitionsScanned, watermark, and lag.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
range_query_latency_ms histogram End-to-end latency p95 > 160 ms
partitions_scanned histogram Partitions per request > 16 median
cursor_stale_total counter 409 due to stale cursor Spike indicates compaction
range_too_wide_total counter 422 due to excessive span Trend watch
watermark_lag_seconds gauge now - watermark > target (e.g., >10 s)

Logging Requirements

  • Structured logs: tenant, traceId, from, to, order, limit, cursorHash, partitionsScanned, resultCount, watermark, lagSec. Do not log raw cursor.

Distributed Tracing

  • Spans: catalog.resolvePartitions, db.scan.partition, stitch.page, ckpt.read.
  • Attributes: partitionCount, limit, dir, hasCursor.

Health Checks

  • Readiness: catalog reachable; partitions for today resolvable; indexes present.
  • Liveness: DB/connection pools healthy; threadpool not saturated.

Operational Procedures

Deployment

  1. Enable /audit/v1/events/range route; confirm RLS and partition catalog.
  2. Smoke-test with a 24h window and verify X-Partitions-Scanned.
  3. Validate cursor stability across partition boundaries.

Configuration

  • Env: RANGE_MAX_SPAN_DAYS, QUERY_MAX_LIMIT, DEFAULT_LIMIT, PARTITION_LOOKUP_TTL.
  • Pruning: enable negative caching for empty/aged-out partitions.

Maintenance

  • Keep partition catalog in sync with DDL/rotation jobs; prune aged partitions per retention.
  • Rebuild indexes offline before alias/cutover when rotating partitions.

Troubleshooting

  • High partitions_scanned → check catalog gaps or miscomputed from/to.
  • Frequent 409 cursor conflicts → ensure clients don’t change window/order between pages.
  • Elevated watermark_lag_seconds → scale projectors or indexers.

Testing Scenarios

Happy Path Tests

  • Query 48h window returns ordered results with X-Partitions-Scanned ≤ 3.
  • Pagination crosses a partition boundary without duplicates or gaps.

Error Path Tests

  • 400 on malformed/invalid time bounds or from>to.
  • 404 when tenant/route disabled or fully aged-out window.
  • 409 when cursor does not match current from/to/order.
  • 429/503 cause client backoff and retry with same params.

Performance Tests

  • p95 ≤ 160 ms for limit=200, ≤14 partitions.
  • Partition discovery p95 ≤ 10 ms under load.

Security Tests

  • RLS prevents cross-tenant access.
  • JWT scope audit:read:timeline enforced.

Internal References

  • Tenant-Scoped Query Flow
  • Filtered Query Flow
  • Audit Record Projection Update Flow

External References

  • RFC 3339 / ISO-8601 for timestamps
  • W3C Trace Context (traceparent)

Appendices

A. Cursor schema (concept)

{
  "ts": "2025-10-22T11:55:00.000Z",
  "id": "01JECZ6Y8K1V...",
  "partitionIdx": 1,
  "dir": "desc"
}

B. Example partition policy

  • Key: (tenantId, yyyymm) monthly partitions; for high-volume tenants use daily (tenantId, yyyymmdd).
  • Pruning: select partitions where [from,to] intersects partition time bounds; query newest-first for desc.

Standard Export Flow

On-demand eDiscovery export that builds a consistent snapshot of tenant-scoped audit data, runs a scoped query over the read models, streams results in chunked parts (JSONL or Parquet, optionally gzipped), produces a signed ExportManifest (with integrity proofs), delivers via presigned URLs and/or webhook callback, and emits Export.Completed.


Overview

Purpose: Enable compliance officers to export audit data for a given tenant/time window with integrity evidence and policy safeguards.
Scope: Job creation, query scoping, chunked packaging, integrity/manifest generation, delivery (URLs/webhook), completion events, and resume/cancel. Excludes continuous/scheduled exports (see Bulk Export Flow).
Context: Runs against the projection/read models (e.g., AuditEvents) and consults Integrity Service for proofs, Policy/Retention/LegalHold for eligibility, and Storage for canonical IDs.
Key Participants:

  • Compliance Officer / Client
  • API Gateway
  • Export Service (job orchestration, packaging)
  • Query Service / Read Store (scoped read with seek pagination)
  • Integrity Service (Merkle roots / signatures)
  • Delivery Backend (object storage for parts, presigned URLs)
  • Webhook Receiver (optional callback on completion)

Prerequisites

System Requirements

  • API Gateway with TLS and JWT validation
  • Export Service deployed with access to Read Store, Integrity Service, Delivery Backend
  • Read Store enforces RLS by tenantId; indexes support range scans
  • Webhook signing keys configured (if callbacks used)

Business Requirements

  • Tenant active; retention and residency policies provisioned
  • Legal holds registered; export must honor holds and exclusions
  • Officer has audit:export permission; purpose-of-use recorded

Performance Requirements

  • Target p95 job time-to-first-part30 s for typical scopes
  • Per-part target size (e.g., 128–512 MiB) to optimize download throughput
  • Concurrency caps per tenant to protect read replicas

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Officer as Compliance Officer
    participant GW as API Gateway
    participant EXP as Export Service
    participant Q as Query Service / Read Store
    participant INT as Integrity Service
    participant OBJ as Delivery Backend (Object Storage)
    participant WH as Webhook Receiver (optional)

    Officer->>GW: POST /export/v1/jobs {tenant, range, filters, format, partSize, webhook?}
    GW->>EXP: Forward request (authN/Z, x-tenant-id, traceparent)
    EXP->>Q: Open scoped cursor (tenant, from/to, filters)
    loop Chunk until exhausted
        Q-->>EXP: Page of rows + next cursor
        EXP->>EXP: Serialize to JSONL/Parquet, gzip if requested
        EXP->>INT: Append leaf hashes, update segment/merkle state
        EXP->>OBJ: PUT part (presigned upload or service credentials)
        OBJ-->>EXP: URL + ETag
        EXP->>EXP: Record part metadata, update resumeToken
    end
    EXP->>INT: Seal block → MerkleRoot + signature
    EXP->>EXP: Build ExportManifest {parts, counts, bytes, root, signature, resumeToken}
    EXP-->>Officer: 202 Accepted {jobId, status:"Running"} (+ presigned GETs if requested)
    EXP-->>Officer: 200 GET /export/v1/jobs/{jobId}/manifest (signed manifest)
    alt webhook configured
        EXP->>WH: POST /webhook/export {jobId,status:"Completed",manifestUrl,signature}
    end
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Presigned download: Service writes parts to bucket and returns read-only presigned URLs.
  • Direct upload: Client provides presigned PUT URLs per part (client-managed storage).
  • Parquet + schema: Columnar output with embedded schema for analytics workloads.
  • Resume: Client POST /export/v1/jobs/{jobId}:resume with server-provided resumeToken.

Error Paths

sequenceDiagram
    actor Officer
    participant GW as API Gateway
    participant EXP as Export Service

    Officer->>GW: POST /export/v1/jobs {invalid filters/format}
    alt Invalid request
        GW-->>Officer: 400 Bad Request (Problem+JSON)
    else Tenant not found/feature disabled
        GW-->>Officer: 404 Not Found (Problem+JSON)
    else Job state conflict (e.g., resume running job)
        GW-->>Officer: 409 Conflict (Problem+JSON)
    else Rate limited / dependencies down
        GW-->>Officer: 429/503 (Retry-After/Problem+JSON)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path POST /export/v1/jobs Y Create export job JSON body
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y* Tenant scope Must match body.tenant
traceparent header O W3C trace context 55-char
tenant string Y Target tenant ^[A-Za-z0-9._-]{1,128}$
range object O {from?, to?} ISO-8601 UTC from ≤ to, within retention
filters object O Allowlisted filters (action/resource/actor/decision) Server validated
format enum O jsonl (default), parquet allowlist
compression enum O none (default), gzip allowlist
partSizeMiB int O Target part size 16–1024, default 256
fields array O Projection/columns Valid subset of schema
webhook.url url O Completion callback HTTPS + signature method
webhook.secretId string O Key id for HMAC Must exist in KMS
delivery.mode enum O presigned-get client-presigned-put
  • Header required unless using path variant /tenants/{tenantId}/export/jobs.

Output Specifications

Create Job — 202 Accepted

Field Type Description
jobId string Server-assigned id (ULID/GUID)
status enum Running
estimation object {partsApprox, bytesApprox?}
pollUrl url GET /export/v1/jobs/{jobId}
manifestUrl url GET /export/v1/jobs/{jobId}/manifest (when ready)

Get Job — 200 OK

Field Type Description
jobId string id
status enum Queued | Running | Completed | Failed | Canceled
counts object {records, parts}
bytes object {written}
parts[] array {index,url,etag,bytes,records} (if presigned-get)
resumeToken string? For resume/cancel/retry
startedAt/finishedAt timestamp ISO-8601 UTC
watermark timestamp Consistency snapshot time

Manifest (JSON)

{
  "jobId": "exp_01JECXYZ...",
  "tenant": "acme",
  "range": {"from":"2025-10-20T00:00:00Z","to":"2025-10-22T23:59:59Z"},
  "format": "jsonl",
  "compression": "gzip",
  "parts": [
    {"index":0,"url":"https://.../p0.gz","bytes":268435456,"records":100000,"etag":"\"abc123\""}
  ],
  "counts":{"records":250000,"parts":3},
  "bytes":{"written":734003200},
  "integrity":{"merkleRoot":"8a4f...","signature":{"alg":"Ed25519","kid":"int-key-2025","sig":"MEQC..."}},
  "createdAt":"2025-10-22T12:30:12Z",
  "resumeToken":"r:01JEC...",
  "watermark":"2025-10-22T12:25:00Z"
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Malformed range/filters; unsupported format/compression; invalid partSizeMiB; insecure webhook URL Fix request; use allowlisted values No retry until corrected
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Caller lacks audit:export or tenant mismatch Request proper role/scope No retry
404 Tenant/route not found; GET /jobs/{id} for unknown id Verify identifiers/tenant
409 Job state conflict (resume/cancel when not applicable); changing scope on resume Wait for state; create new job Retry after fix
413 Estimated export exceeds max allowed per job Narrow scope or switch to Bulk Export
429 Per-tenant/global export rate limited Respect Retry-After Exponential backoff + jitter
503 Read store/integrity/object storage unavailable Wait for recovery Retry create/poll

Failure Modes

  • Retention/residency violation: service rejects with 400 type: .../policy.violation.
  • Legal hold conflict: either enforced inclusion or exclusion per policy; decision id returned via X-Policy-Decision-Id.
  • Webhook failure: job completes, callback retries with backoff; manifest always retrievable via GET.

Recovery Procedures

  1. For 409, poll job until terminal; then retry with new job if needed.
  2. For 503/429, back off using Retry-After; do not alter request to preserve idempotency.
  3. Use resumeToken to continue aborted jobs without duplicating parts.

Performance Characteristics

Latency Expectations

  • Time-to-first-part p95 ≤ 30 s for typical 24–48h windows.
  • Per-part write steady-state throughput aligned with object storage (100–500 MiB/s aggregate across workers).

Throughput Limits

  • Per tenant: ≤ 2 concurrent running jobs (configurable).
  • Global: bounded by export workers × read replica capacity.

Resource Requirements

  • Read IOPS proportional to projected records; CPU for serialization/compression; memory for part buffers.

Scaling Considerations

  • Horizontal worker pool with fair-share per tenant.
  • Adaptive partSizeMiB and dynamic concurrency to maintain steady throughput.
  • Use seek pagination from Query Service to avoid deep offsets.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; mTLS for service-to-service (optional).

Authorization

  • Require audit:export for tenant; enforce RLS in reads; verify x-tenant-id.

Data Protection

  • Parts stored with server-side encryption; presigned URLs time-limited and least-privilege.
  • Redaction/minimization applied if using Filtered export mode (optional flag).

Compliance

  • Enforce retention/residency and legal holds; include decision metadata in manifest.
  • Manifest contains integrity proof (Merkle root + signature) for end-to-end verification.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
export_jobs_active gauge Running jobs count > tenant/global cap
export_bytes_written_total counter Cumulative bytes Trend/throughput
export_parts_total counter Parts produced
export_job_duration_seconds histogram Job runtime p95 > SLO
export_failures_total counter Failed jobs > 0 sustained
export_webhook_fail_total counter Callback failures spike alerts

Logging Requirements

  • Structured logs: tenant, jobId, range, filtersHash, format, partIndex, bytes, records, watermark, integrity.merkleRoot, decisionId (if policy applied). No raw PII.

Distributed Tracing

  • Spans: export.create, query.page, serialize.chunk, compress, object.put, integrity.seal, webhook.post.
  • Attributes: tenant, format, partSizeMiB, parts, bytes, lagSec.

Health Checks

  • Readiness: access to Read Store, Integrity, Object Storage; signing keys loaded.
  • Liveness: worker queue depth within bounds; no stuck jobs.

Operational Procedures

Deployment

  1. Provision object storage buckets and KMS keys; configure presign service.
  2. Deploy Export Service and register /export/v1/* routes.
  3. Validate end-to-end export on a test tenant (JSONL + Parquet).

Configuration

  • Env: EXPORT_MAX_CONCURRENCY_PER_TENANT, EXPORT_DEFAULT_PART_MIB, EXPORT_MAX_PART_MIB, WEBHOOK_SIGNING_KID, PRESIGN_TTL_SEC.
  • SLOs: define job duration targets per size window.

Maintenance

  • Rotate signing keys and presign credentials; prune expired parts/manifests.
  • Rehearse DR: re-run export from resumeToken after worker failover.

Troubleshooting

  • Slow jobs → check read replica load, part size too small/large, compression CPU bound.
  • Frequent 409 conflicts → review client workflow (don’t resume running jobs).
  • Webhook failures → verify DNS/TLS; use manual manifest retrieval.

Testing Scenarios

Happy Path Tests

  • Create job with 24h range → parts produced; manifest includes merkle root/signature.
  • Presigned URLs download successfully; counts/bytes match manifest.

Error Path Tests

  • 400 on invalid range/filters/format; 404 on unknown jobId; 409 on resume while running.
  • 429/503 lead to client backoff and eventual success.

Performance Tests

  • Validate time-to-first-part p95 ≤ 30 s under nominal load.
  • Confirm linear scaling with worker count up to configured cap.

Security Tests

  • audit:export scope enforced; cross-tenant access blocked.
  • Presigned URLs expire and are scoped to objects; encryption at rest verified.
  • Manifest signature verifies against Integrity public key.

Internal References

  • Legal Hold Export Flow
  • eDiscovery Export Flow
  • Bulk Export Flow
  • Audit Record Projection Update Flow

External References

  • RFC 4180 (CSV, if supported), JSON Lines spec, Parquet format spec
  • W3C Trace Context; RFC 7807 (Problem Details)

Appendices

A. Example Problem+JSON (retention violation)

{
  "type": "urn:connectsoft:errors/export/policy.violation",
  "title": "Retention policy violation",
  "status": 400,
  "detail": "Requested 'from' precedes tenant retention window.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "/range/from", "reason": "before-retention-start"}]
}

B. Webhook Payload (HMAC signed)

{
  "event": "Export.Completed",
  "jobId": "exp_01JECXYZ...",
  "tenant": "acme",
  "manifestUrl": "https://api.../export/v1/jobs/exp_01JECXYZ.../manifest",
  "status": "Completed",
  "signature": {"alg":"HMAC-SHA256","kid":"wh-2025","ts":"2025-10-22T12:31:02Z","sig":"b64..."}
}

Export of audit data subject to active Legal Holds. The LegalHold Service validates scope and policy, instructs the Export Service to run a hold-compliant export, embeds proof inclusion (integrity root, hold decision metadata, and optional per-part/record proofs) into a signed manifest, delivers via secure presigned URLs and/or webhook, and emits completion events. Holds continue to block purge, and all actions are themselves audited.


Overview

Purpose: Produce a defensible, tamper-evident export of all records covered by one or more active Legal Holds for a tenant (or set of scopes).
Scope: Hold resolution & validation, compliance decision capture, hold-aware query scoping, chunked packaging, integrity & proof inclusion policy, secure delivery, resume/cancel, and auditable completion. Excludes non-hold exports (see Standard Export Flow).
Context: Builds on the Export Service and Integrity Service; queries the Read Store (projections) with server-side filters derived from LegalHold definitions and their current Revision.
Key Participants:

  • Legal Team / Client
  • API Gateway
  • LegalHold Service (hold registry, scope/eligibility, decisioning)
  • Export Service (orchestrator, packaging)
  • Query Service / Read Store (tenant-scoped reads)
  • Integrity Service (Merkle roots, signatures)
  • Delivery Backend (object storage, presigned URLs)
  • Webhook Receiver (optional callback endpoint)

Prerequisites

System Requirements

  • Gateway with TLS + JWT validation
  • LegalHold Service reachable; hold registry & revisioning enabled
  • Export Service has access to Read Store, Integrity, Delivery Backend
  • Webhook signing keys/KMS available if callbacks are used

Business Requirements

  • Target LegalHold exists and is Active (not Released)
  • Tenant retention/residency policies configured; hold implies purge block
  • Operator runbook for evidence requests and key rotation

Performance Requirements

  • p95 time-to-first-part ≤ 45 s for typical hold scopes
  • Concurrency caps per tenant and per hold to avoid read hot spots
  • Indexes support hold filters (resource/action/time) efficiently

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Legal as Legal Team
    participant GW as API Gateway
    participant LHS as LegalHold Service
    participant EXP as Export Service
    participant Q as Query Service / Read Store
    participant INT as Integrity Service
    participant OBJ as Delivery Backend
    participant WH as Webhook (optional)

    Legal->>GW: POST /legal-hold/v1/exports {holdId, format, partSize, proofMode, webhook?}
    GW->>LHS: Validate authN/Z, fetch hold(holdId) + current Revision
    LHS-->>GW: 200 {holdSnapshot:{id, revision, scope, status:Active}}
    GW->>EXP: Create export job (mode: LEGAL_HOLD, holdSnapshot, proofMode)
    EXP->>Q: Open scoped cursor using holdSnapshot.scope (tenant, filters, time)
    loop Chunk until exhausted
        Q-->>EXP: Page of rows + next cursor
        EXP->>INT: Add leaves to integrity segment (per-part proofs if requested)
        EXP->>OBJ: PUT part (JSONL/Parquet, optional gzip)
        EXP->>EXP: Record part metadata + resumeToken
    end
    EXP->>INT: Seal block → MerkleRoot + signature
    EXP->>EXP: Build signed ExportManifest {parts, counts, bytes, holdSnapshot, proofPolicy, merkleRoot, signature}
    EXP-->>Legal: 202 Accepted {jobId, status:"Running"}
    alt webhook configured
        EXP->>WH: POST Export.Completed {jobId, manifestUrl, holdSnapshot, signature}
    end
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Multiple holds: request {holdIds:[...]}; LHS returns merged scope (union) and aggregated decision id(s).
  • Incremental export: sinceDecisionId or sinceWatermark to export only new/changed covered records.
  • Client-provided storage: delivery.mode=client-presigned-put with per-part presigned PUT URLs.

Error Paths

sequenceDiagram
    actor Legal
    participant GW as API Gateway
    participant LHS as LegalHold Service

    Legal->>GW: POST /legal-hold/v1/exports {holdId:"unknown"}
    alt Bad request (malformed payload/params)
        GW-->>Legal: 400 Bad Request (Problem+JSON)
    else Hold not found or not Active
        GW->>LHS: GET hold(holdId)
        LHS-->>GW: 404/409 (Released|NotFound)
        GW-->>Legal: 404/409 Problem+JSON
    else Conflict with hold revision (If-Match mismatch)
        GW-->>Legal: 412 Precondition Failed (Problem+JSON)
    else Rate limited / dependency down
        GW-->>Legal: 429/503 (Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path POST /legal-hold/v1/exports Y Create a hold-governed export job JSON body
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y Tenant scope Must match hold tenant
traceparent header O W3C trace context 55-char
holdId string Y Target legal hold id Exists & status=Active
ifMatch header O Expected holdRevision (optimistic) Matches current revision
format enum O jsonl (default), parquet allowlist
compression enum O none, gzip allowlist
partSizeMiB int O Target part size 16–1024, default 256
proofMode enum O manifest-only per-part
webhook.url/webhook.secretId string O Completion callback + signing HTTPS + known KMS key
delivery.mode enum O presigned-get client-presigned-put

Output Specifications

Create — 202 Accepted

Field Type Description
jobId string Server-assigned id
status enum Queued | Running
holdSnapshot object {id, revision, scope, decidedAt, decisionId}
proofPolicy object {mode, algorithm, keyId}
pollUrl / manifestUrl url Where to poll/fetch manifest

Manifest (excerpt)

{
  "jobId": "exp_01JF3…",
  "mode": "LEGAL_HOLD",
  "tenant": "acme",
  "holdSnapshot": {
    "id": "lh_2025_001",
    "revision": 7,
    "scope": {"resourceTypes":["Case.File","Iam.User"],"time":{"from":"2025-09-01T00:00:00Z"}},
    "decidedAt": "2025-10-10T12:01:22Z",
    "decisionId": "lhdec_8a12…"
  },
  "proofPolicy": {"mode":"per-part","algorithm":"Ed25519","keyId":"int-key-2025"},
  "integrity": {"merkleRoot":"8a4f…","signature":{"alg":"Ed25519","kid":"int-key-2025","sig":"MEQC…"}},
  "parts":[{"index":0,"url":"https://…/p0.gz","bytes":268435456,"records":100000,"etag":"\"abc123\""}]
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Malformed body; unsupported format/proofMode; invalid partSizeMiB Correct request No retry until fixed
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Missing audit:legalhold.export or tenant mismatch Request proper role/scope No retry
404 holdId not found (or not in tenant) Verify hold/tenant
409 Hold status not Active (e.g., Released); job state conflict on resume/cancel Activate/select correct hold; create new job Retry after fix
412 If-Match revision mismatch (hold updated mid-flight) Re-fetch hold; restart with new revision Retry with new precondition
429 Per-tenant/global rate limit Respect Retry-After Backoff + jitter
503 Read store/Integrity/Delivery unavailable Wait for recovery Retry idempotently

Failure Modes

  • Hold mutated during export: precondition fails (412) to ensure defensibility; job halts.
  • Policy violation (residency/retention): 400 .../policy.violation with decisionId.
  • Webhook delivery failure: job completes; callback retried with backoff; manifest always retrievable.

Recovery Procedures

  1. On 412, fetch latest holdSnapshot and recreate the job.
  2. On 503/429, back off; use the server-provided resumeToken to continue.
  3. If policy violation, adjust scope with Legal team; re-request.

Performance Characteristics

Latency Expectations

  • Time-to-first-part p95 ≤ 45 s for typical holds.
  • Steady-state throughput bounded by read replicas and object storage.

Throughput Limits

  • Per hold: 1–2 concurrent jobs (configurable).
  • Per tenant: combined cap across holds/exports to preserve SLOs.

Resource Requirements

  • CPU for serialization/compression; memory for part buffers; IOPS for scans.

Scaling Considerations

  • Shard by tenant; sequence chunks with seek pagination.
  • Prefer per-part proofs for balance of size vs. verifiability; per-record for high-assurance cases only.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; optional mTLS between services.

Authorization

  • Require audit:legalhold.read to resolve holds and audit:legalhold.export to create jobs.
  • Enforce RLS on reads; verify x-tenant-id vs hold tenant.

Data Protection

  • Parts encrypted at rest; presigned URLs are short-lived, least-privilege; webhook payloads HMAC-signed.
  • Redaction/minimization may still apply if configured for hold exports (jurisdictional constraint).

Compliance

  • Holds block purge throughout job lifetime; export does not weaken hold.
  • Manifest includes holdSnapshot (id, revision, decisionId) and integrity proofs per proofPolicy.
  • All requests emit audit entries (who, when, purpose, hold ids, decision ids).

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
lh_export_jobs_active gauge Running hold exports > cap per tenant/hold
lh_export_job_duration_seconds histogram Runtime per job p95 > SLO
lh_hold_revision_conflicts_total counter 412 preconditions hit Spike indicates frequent edits
lh_export_bytes_written_total counter Bytes exported under holds Trend/forecast
lh_export_failures_total counter Failed jobs > 0 sustained

Logging Requirements

  • Structured logs: tenant, holdId, holdRevision, jobId, decisionId, proofMode, partIndex, bytes, records, watermark. No raw PII.

Distributed Tracing

  • Spans: legalhold.resolve, export.create, query.page, integrity.seal, object.put, webhook.post.
  • Attributes: holdId, revision, proofMode, parts, bytes.

Health Checks

  • Readiness: LHS/Read Store/Integrity/Delivery reachable; signing keys loaded.
  • Liveness: worker queues healthy; no stuck jobs; purge-block signal latched for hold.

Operational Procedures

Deployment

  1. Deploy LegalHold Service & /legal-hold/v1/exports route behind Gateway.
  2. Configure KMS keys for manifest/proof signing and webhook HMAC.
  3. Validate end-to-end on a test hold (Active → export → Completed).

Configuration

  • Env: LH_EXPORT_MAX_CONCURRENCY, EXPORT_DEFAULT_PART_MIB, PROOF_DEFAULT_MODE, PRESIGN_TTL_SEC, WEBHOOK_SIGNING_KID.
  • Policy: toggle allowPerRecordProofs by edition/regulatory need.

Maintenance

  • Rotate signing keys; prune expired presigned URLs and old manifests per policy.
  • Periodically reconcile hold purge-block flags across stores.

Troubleshooting

  • 412 spikes → educate counsel/operators to avoid modifying holds during exports; rely on ifMatch.
  • Slow jobs → check read replica load, part size, compression CPU.
  • Webhook failures → review TLS/HMAC configuration; fall back to polling manifestUrl.

Testing Scenarios

Happy Path Tests

  • Active holdId export produces parts and manifest with holdSnapshot, merkleRoot, signature.
  • Proof policy per-part includes per-part proofs; manifest-only includes only root/signature.

Error Path Tests

  • 400 on unsupported proofMode/invalid partSizeMiB.
  • 404 on unknown holdId.
  • 409 when hold status is Released.
  • 412 when ifMatch revision mismatches.
  • 429/503 cause compliant backoff and resume.

Performance Tests

  • Time-to-first-part p95 ≤ 45 s under nominal load.
  • Linear scaling with additional workers up to cap.

Security Tests

  • RBAC scopes enforced; cross-tenant blocked.
  • Presigned URLs expire; webhook HMAC validates.
  • Manifest signature verifies with Integrity public key.

Internal References

  • Standard Export Flow
  • Legal Hold Processing Flow
  • Compliance Audit Flow

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context
  • Organization Legal Hold & Evidence Handling Policy

Appendices

A. Example Problem+JSON (hold released)

{
  "type": "urn:connectsoft:errors/legalhold/status.invalid",
  "title": "Hold is not active",
  "status": 409,
  "detail": "Legal hold 'lh_2025_001' is Released (rev=7).",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "/holdId", "reason": "released"}]
}

B. Proof Inclusion Policy Options

  • manifest-only: single MerkleRoot + signature in manifest.
  • per-part: each part contains a subtree root; manifest maps parts→proofs.
  • per-record (high assurance): each line embeds leaf hash or side proof; larger output, strongest verification.

eDiscovery Export Flow

Generates a forensically defensible export tailored for eDiscovery: runs a scoped export, computes a signed ExportManifest, invokes KMS/HSM to produce a detached signature over the manifest and Merkle root, and assembles an Integrity Bundle (manifest + proofs + public key material) for delivery.


Overview

Purpose: Provide legal/forensic teams with a complete, tamper-evident export that includes a signed manifest and Merkle proofs suitable for independent verification.
Scope: Job creation, scoped read, manifest construction, Merkle tree computation, KMS signing, bundle packaging (ZIP/TAR.GZ), delivery via presigned URLs or webhook, and completion event. Excludes hold-governed constraints (see Legal Hold Export Flow) and generic on-demand exports (see Standard Export Flow).
Context: Builds on Export Service and Integrity Service with KMS/HSM for signing. Reads from Read Store via Query Service.
Key Participants:

  • eDiscovery Client (case management/tooling)
  • API Gateway
  • Export Service (orchestrator, packaging)
  • Query Service / Read Store (scoped reads)
  • Integrity Service (Merkle computation)
  • KMS/HSM (key management, signing)
  • Delivery Backend (object storage, presigned URLs)
  • Webhook Receiver (optional)

Prerequisites

System Requirements

  • Gateway with TLS + JWT validation
  • Export & Integrity Services deployed; Integration with KMS/HSM configured (key IDs, policies)
  • Read Store accessible with RLS by tenantId
  • Object storage bucket for parts, manifest, and bundle

Business Requirements

  • Tenant’s retention/residency policies defined and enforced
  • eDiscovery caseId lifecycle managed (optional, but recommended)
  • Operator runbook for key rotation & signature verification

Performance Requirements

  • p95 time-to-manifest30 s for typical 24–48h scopes
  • Bundle assembly completes ≤ 60 s after final part upload
  • Per-tenant export concurrency capped to protect read replicas

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor EDC as eDiscovery Client
    participant GW as API Gateway
    participant EXP as Export Service
    participant Q as Query Service / Read Store
    participant INT as Integrity Service
    participant KMS as KMS/HSM (Signer)
    participant OBJ as Delivery Backend
    participant WH as Webhook (optional)

    EDC->>GW: POST /ediscovery/v1/exports {tenant, caseId, range, filters, format, proofMode, bundle:{type}}
    GW->>EXP: Create export job (mode: EDISCOVERY) + params
    EXP->>Q: Open scoped cursor (tenant/from-to/filters)
    loop Stream pages → parts
        Q-->>EXP: Page of rows + next cursor
        EXP->>INT: Update Merkle segment with leaf hashes
        EXP->>OBJ: PUT part (JSONL/Parquet, optional gzip)
        EXP->>EXP: Track part metadata (index, bytes, records, ETag)
    end
    EXP->>INT: Seal block → {merkleRoot}
    EXP->>EXP: Build ExportManifest {parts, counts, bytes, watermarks, merkleRoot}
    EXP->>KMS: Sign canonicalized(manifest) + merkleRoot → {signature, kid, alg}
    EXP->>OBJ: PUT manifest.json and manifest.sig
    EXP->>EXP: Assemble Integrity Bundle (manifest, signature, publicKey/chain, optional proofs)
    EXP->>OBJ: PUT bundle (bundle.zip/.tar.gz) → bundleUrl
    EXP-->>EDC: 202 Accepted {jobId, status:"Running"}
    alt webhook configured
        EXP->>WH: POST Export.Completed {jobId, bundleUrl, manifestUrl, signature}
    end
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Proof modes: manifest-only (root+sig), per-part (subtree proofs), per-record (leaf proofs; larger bundle).
  • Client-provided storage: delivery.mode=client-presigned-put for manifest/parts/bundle.
  • Re-sign: POST /ediscovery/v1/exports/{jobId}:resign {kid} to reissue signature with a rotated key (no data rewrite).

Error Paths

sequenceDiagram
    actor EDC
    participant GW as API Gateway
    participant EXP as Export Service

    EDC->>GW: POST /ediscovery/v1/exports {invalid params}
    alt Malformed request / unsupported proofMode/format
        GW-->>EDC: 400 Bad Request (Problem+JSON)
    else Unknown tenant / route
        GW-->>EDC: 404 Not Found (Problem+JSON)
    else Conflict (resign while job running, or bundle requested before complete)
        GW-->>EDC: 409 Conflict (Problem+JSON)
    else Unauthorized / Forbidden
        GW-->>EDC: 401/403 (Problem+JSON)
    else Backpressure / dependency down
        GW-->>EDC: 429/503 (Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path POST /ediscovery/v1/exports Y Create eDiscovery export JSON body
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y* Tenant scope Must match tenant
traceparent header O W3C trace context 55-char
tenant string Y Target tenant ^[A-Za-z0-9._-]{1,128}$
caseId string O eDiscovery case identifier ≤ 128 chars
range object O {from?, to?} ISO-8601 UTC from ≤ to, retention bounds
filters object O Allowlisted predicates Server validated
format enum O jsonl (default) parquet
compression enum O none gzip (default)
proofMode enum O manifest-only per-part
bundle.type enum O zip (default) tar.gz
kms.kid string O Key id for signing must exist in KMS
delivery.mode enum O presigned-get client-presigned-put
webhook.url/webhook.secretId string O Completion callback + HMAC key HTTPS + known key
  • Header may be omitted if using path variant /tenants/{tenantId}/ediscovery/exports.

Output Specifications

Create — 202 Accepted

Field Type Description
jobId string Server-assigned id (ULID/GUID)
status enum Queued | Running
pollUrl url GET /ediscovery/v1/exports/{jobId}
manifestUrl url? Available once ready
bundleUrl url? Available once ready

Get — 200 OK

Field Type Description
jobId string Identifier
status enum Queued | Running | Sealing | Signing | Bundling | Completed | Failed | Canceled
counts object {records, parts}
bytes object {written}
merkleRoot string Hex/base64url root
signature object? {alg,kid,sig} once signed
manifestUrl / bundleUrl url? Delivery endpoints
resumeToken string? For resume/retry
startedAt/finishedAt timestamp ISO-8601 UTC

Integrity Bundle Contents (concept)

bundle/
  manifest.json
  manifest.sig            # COSE_Sign1 or JWS (detached)
  integrity/
    root.json             # { merkleRoot, algorithm, createdAt }
    proofs/               # per-part or per-record .proof files (optional)
  keys/
    publicKey.pem         # PEM or JWK
    key-metadata.json     # { kid, alg, issuer, notBefore, notAfter }
  README.txt              # verification instructions

Manifest (excerpt)

{
  "jobId": "exp_01JFG2...",
  "mode": "EDISCOVERY",
  "tenant": "acme",
  "caseId": "CASE-2025-0421",
  "range": {"from":"2025-10-01T00:00:00Z","to":"2025-10-22T23:59:59Z"},
  "format": "jsonl",
  "compression": "gzip",
  "parts": [
    {"index":0,"url":"https://.../p0.gz","bytes":268435456,"records":100000,"etag":"\"abc123\""}
  ],
  "counts":{"records":250000,"parts":3},
  "bytes":{"written":734003200},
  "integrity":{"merkleRoot":"8a4f...","algorithm":"sha256","createdAt":"2025-10-22T12:30:12Z"},
  "createdAt":"2025-10-22T12:30:12Z",
  "watermark":"2025-10-22T12:25:00Z"
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Malformed body; invalid range/filters; unsupported proofMode/format/bundle.type; unknown kms.kid Correct request/params No retry until fixed
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Missing audit:ediscovery.export or tenant mismatch Request proper scope/role No retry
404 Tenant/route not found; jobId unknown; manifest/bundle not available Verify tenant/IDs; wait for completion
409 Bundle requested before job complete; resign while signing; resume on terminal job Poll until terminal; create new job Retry after fix
412 If-Match on manifest version failed (re-signed) Fetch latest manifest; retry Retry with new ETag
429 Per-tenant/global export rate limited Respect Retry-After Exponential backoff + jitter
503 Read store/Integrity/KMS/Object storage unavailable Wait for recovery Retry idempotently

Failure Modes

  • KMS key disabled/rotated: signing fails → 503; operator selects new kid or uses :resign.
  • Proof blowup with per-record on huge jobs → 413/422 with guidance to switch to per-part.
  • Residency/retention policy violation → 400 .../policy.violation (decision id included).

Recovery Procedures

  1. On 409, poll job status until Completed then fetch manifestUrl/bundleUrl.
  2. On 503/429, back off and use resumeToken to continue without duplicating parts.
  3. On signature/key issues, re-run :resign with a valid kms.kid.

Performance Characteristics

Latency Expectations

  • Time-to-manifest p95 ≤ 30 s for typical scopes; bundling overhead ≤ 60 s.

Throughput Limits

  • Per tenant: ≤ 2 concurrent eDiscovery jobs (configurable).
  • Global: limited by export workers, KMS QPS, and object storage throughput.

Resource Requirements

  • CPU for serialization/compression; memory for part buffers and proof generation; KMS signing latency budget (p95 ≤ 100 ms).

Scaling Considerations

  • Horizontal export workers; bound KMS concurrency; stream proof files to avoid large in-memory structures.
  • Prefer per-part proofs for balance of size vs. verifiability.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; optional mTLS for service-to-service calls.

Authorization

  • Require audit:ediscovery.export; enforce RLS on reads; verify x-tenant-id.

Data Protection

  • Object storage encryption at rest; time-limited presigned URLs; webhook payloads HMAC-signed.
  • No raw secret material in logs; public keys shipped as JWK/PEM inside bundle only.

Compliance

  • Manifest + signature + proofs enable independent verification.
  • Include watermark (projection snapshot time) and caseId in manifest for chain-of-custody.
  • Emit audit entries for create/resume/resign/bundle fetch actions.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
ediscovery_jobs_active gauge Running jobs > tenant/global cap
manifest_build_duration_ms histogram Build + sign time p95 > 30s
kms_sign_latency_ms histogram KMS sign call p95 > 100 ms
bundle_bytes_total counter Size of bundles Trend/forecast
ediscovery_failures_total counter Failed jobs > 0 sustained

Logging Requirements

  • Structured logs: tenant, caseId, jobId, merkleRoot, kid, proofMode, parts, bytes, watermark.
  • Do not log raw proofs or presigned URLs.

Distributed Tracing

  • Spans: export.create, query.page, integrity.seal, kms.sign, bundle.pack, object.put, webhook.post.
  • Attributes: kid, proofMode, bundleType, parts, bytes.

Health Checks

  • Readiness: KMS key available, Integrity & Object storage reachable.
  • Liveness: worker queues healthy; no stuck Signing/Bundling states.

Operational Procedures

Deployment

  1. Configure KMS key(s) and kid mapping; verify sign/verify path in staging.
  2. Deploy /ediscovery/v1/exports route; ensure buckets and presign service are ready.
  3. Validate end-to-end: create job → manifest signed → bundle downloadable and verifiable.

Configuration

  • Env: EXPORT_MAX_CONCURRENCY_PER_TENANT, EXPORT_DEFAULT_PART_MIB, PROOF_DEFAULT_MODE, KMS_DEFAULT_KID, PRESIGN_TTL_SEC.
  • Policies: enforce retention/residency on the export scope.

Maintenance

  • Rotate KMS keys; support :resign to reissue signatures.
  • Prune expired presigned URLs and old bundles per policy.

Troubleshooting

  • High kms_sign_latency_ms → check KMS limits/region; enable key caching.
  • Large bundles/timeouts → switch to per-part proofs; increase part size.
  • 409 conflicts → ensure clients poll status before requesting bundle/resign.

Testing Scenarios

Happy Path Tests

  • Create eDiscovery export with proofMode=per-part → manifest + signature + bundle available; verification succeeds.
  • resign with new kid produces new manifest.sig without rewriting parts.

Error Path Tests

  • 400 on invalid proofMode/format/bundle.type or bad range.
  • 404 on unknown jobId or bundle before creation.
  • 409 when requesting bundle before completion or resign during signing.
  • 429/503 trigger compliant backoff and resume.

Performance Tests

  • Time-to-manifest p95 ≤ 30 s; bundling overhead ≤ 60 s under nominal load.
  • KMS signing p95 ≤ 100 ms for 95% of signatures.

Security Tests

  • RBAC scope audit:ediscovery.export enforced; cross-tenant blocked.
  • Manifest signature verifies with exported public key (JWK/PEM).
  • Presigned URLs expire and are least-privilege.

Internal References

External References

  • COSE (RFC 8152) / JWS (RFC 7515) for signatures
  • W3C Trace Context; RFC 7807 (Problem Details)

Appendices

A. Example manifest.sig (JWS detached)

{
  "protected": "eyJhbGciOiJFZDI1NTE5Iiwia2lkIjoiaW50LWtleS0yMDI1In0",
  "signature": "L5Jq...cQ"
}

B. Verification Outline

  1. Download manifest.json, manifest.sig, and keys/publicKey.pem.
  2. Verify signature over canonicalized manifest (UTF-8, no BOM).
  3. Recompute Merkle root from all part proofs (if provided) and compare to manifest.integrity.merkleRoot.
  4. Spot-verify a subset of parts/records using proofs/*.proof.

Bulk Export Flow

Scheduled or ad-hoc large-scale exports that split a wide scope into time/key slices, run them in parallel across a controlled worker pool, write results as multiple packages (parts/bundles), and support resume tokens for fault-tolerant continuation. Exposes explicit SLA/throughput metrics and enforces per-tenant/global concurrency limits.


Overview

Purpose: Efficiently export very large datasets (days/months of audit events) on a schedule or on demand, with parallelization, resumability, and integrity/manifest generation.
Scope: Scheduler, job creation, slicing strategy (time/partition), parallel workers, packaging (JSONL/Parquet, gzip), resume/cancel, integrity sealing, delivery via presigned URLs/webhook, and metrics. Excludes hold-specific rules (see Legal Hold Export Flow) and eDiscovery signing options (see eDiscovery Export Flow).
Context: Orchestrated by Export Service with a Scheduler; reads from Read Store via Query Service; uses Integrity Service for Merkle roots/signatures and Object Storage for parts/bundles.
Key Participants:

  • Scheduler (cron/rrule, “run now”)
  • API Gateway
  • Export Service (orchestrator, slicer, worker pool)
  • Query Service / Read Store (tenant-scoped scans)
  • Integrity Service (hash/merkle/seal)
  • Object Storage (parts, manifests, bundles)
  • Webhook Receiver (optional callbacks)
  • Metrics/Tracing Backend

Prerequisites

System Requirements

  • Gateway with TLS + JWT; Export Service reachable by Scheduler
  • Read Store with RLS by tenantId; seek pagination available
  • Integrity Service & Object Storage configured (KMS keys, buckets)
  • Clock skew controls; partition catalog available for slicing

Business Requirements

  • Tenant retention/residency policies configured and enforced
  • Export feature/edition enabled; per-tenant concurrency limits defined
  • Optional webhook signing keys provisioned

Performance Requirements

  • Target throughput per worker (e.g., 50–150 MB/s effective)
  • Time-to-first-part p95 ≤ 60 s for bulk slice runs
  • Slice width chosen to keep slice p95 ≤ 10–20 min under load

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Sched as Scheduler
    participant GW as API Gateway
    participant EXP as Export Service (Orchestrator)
    participant SL as Slicer / Planner
    participant WP as Worker Pool
    participant Q as Query Service / Read Store
    participant INT as Integrity Service
    participant OBJ as Object Storage
    participant WH as Webhook (optional)

    Sched->>GW: POST /export/v1/bulk-jobs {tenant, schedule, range, sliceWidth, format, partSize}
    GW->>EXP: Create/Upsert BulkJob
    loop On schedule tick or run-now
        EXP->>SL: Plan slices for window (time/partition)
        SL-->>EXP: [Slice#0..Slice#N] + dependencies
        par N parallel slices (bounded by concurrency caps)
            EXP->>WP: Dispatch Slice#i {cursor, sliceWindow, resumeToken?}
            WP->>Q: Stream pages via seek pagination
            Q-->>WP: Rows + next cursor
            WP->>INT: Append leaf hashes, update merkle segment
            WP->>OBJ: PUT part(s) (JSONL/Parquet, gzip?)
            WP->>EXP: Report progress {bytes, records, partMeta, resumeToken}
        and
        end
        EXP->>INT: Seal slice block → MerkleRoot + signature
        EXP->>OBJ: PUT slice manifest, update BulkJob manifest index
        alt webhook configured
            EXP->>WH: POST Export.SliceCompleted {jobId, sliceId, manifestUrl}
        end
    end
    EXP->>OBJ: PUT final Bulk Manifest (index of slice manifests) + signature
    EXP-->>GW: 200/202 {jobId, status:"Completed", manifestUrl, stats}
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Run now: POST /export/v1/bulk-jobs/{id}:run-now triggers immediate cycle outside schedule.
  • Catch-up mode: planner advances by watermark; only exports new slices since last success.
  • Client-managed storage: use presigned PUT per slice/part.
  • Dynamic re-slicing: large slices auto-split if runtime exceeds threshold.

Error Paths

sequenceDiagram
    actor Client
    participant GW as API Gateway
    participant EXP as Export Service

    Client->>GW: POST /export/v1/bulk-jobs {invalid config}
    alt Bad request (bad schedule/sliceWidth/partSize)
        GW-->>Client: 400 Problem+JSON
    else Unknown jobId / tenant route not found
        GW-->>Client: 404 Problem+JSON
    else Conflict (modify running job / duplicate schedule window)
        GW-->>Client: 409 Problem+JSON
    else Backpressure or deps down
        GW-->>Client: 429/503 Problem+JSON (+ Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Create/Update POST /export/v1/bulk-jobs Y Create/Upsert bulk job JSON body
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y* Tenant scope Must match body.tenant
tenant string Y Target tenant ^[A-Za-z0-9._-]{1,128}$
range object O {from?, to?} for initial catch-up ISO-8601 UTC
schedule object O {cron:"0 2 * * *"} or {rrule:"RRULE:..."} validated
sliceWidth string O e.g., 24h, 7d, 1mo max per policy
format enum O jsonl (default) parquet
compression enum O none gzip (default)
partSizeMiB int O 16–1024 (default 256) bounds checked
maxParallelSlices int O Per-tenant concurrency cap ≤ tenant cap
webhook.url/secretId string O Completion callbacks HTTPS + known key
delivery.mode enum O presigned-get client-presigned-put

*Header may be omitted for /tenants/{tenantId}/export/bulk-jobs.

Control Endpoints

  • POST /export/v1/bulk-jobs/{id}:run-now
  • POST /export/v1/bulk-jobs/{id}:pause / :resume / :cancel
  • GET /export/v1/bulk-jobs/{id} (status, stats, current window, next run)
  • GET /export/v1/bulk-jobs/{id}/manifest (bulk manifest index)

Output Specifications

Field Type Description Notes
jobId string Bulk job identifier ULID/GUID
status enum Paused | Scheduled | Running | Completed | Failed | Canceled
currentSlice object? {sliceId, window, status, resumeToken} When running
stats object {slicesCompleted, bytes, records, parts} Cumulative
manifestUrl url? Bulk manifest index After completion
nextRunAt timestamp Next scheduled tick ISO-8601 UTC

Bulk Manifest Index (concept)

{
  "jobId":"bulk_01JH2…",
  "tenant":"acme",
  "schedule":"0 2 * * *",
  "slices":[
    {"sliceId":"s_2025_10_01","from":"2025-10-01T00:00:00Z","to":"2025-10-02T00:00:00Z","manifestUrl":"https://.../s_2025_10_01.manifest.json","merkleRoot":"8a4f...","signature":{"alg":"Ed25519","kid":"int-key-2025","sig":"MEQC..."}}  
  ],
  "counts":{"records":12003450,"parts":480},
  "bytes":{"written":358721987654},
  "createdAt":"2025-10-22T02:00:00Z",
  "completedAt":"2025-10-22T09:40:00Z"
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid schedule/sliceWidth/partSizeMiB; malformed range Fix config No retry until corrected
401 Missing/invalid JWT Obtain valid token Retry after renewal
403 Missing audit:export.bulk or tenant mismatch Request proper scope/role
404 Unknown jobId or tenant route disabled Verify identifiers/tenant
409 Modify/pause/resume conflict; duplicate scheduled window; attempt to run while Running Wait/resolve state; use run-now after idle Retry after fix
413 Estimated bulk size exceeds job cap Reduce scope/sliceWidth; increase cap by policy
422 sliceWidth too large for SLO; range outside retention Choose smaller slices / valid window
429 Per-tenant/global concurrency limit hit Honor Retry-After Backoff + jitter
503 Read store/Integrity/Object storage unavailable Wait for recovery Idempotent retry using resumeToken

Failure Modes

  • Slice timeout → auto reslice into smaller sub-slices; remaining work re-queued.
  • Resume after crashresumeToken resumes at last committed cursor/part.
  • Storage throttling → Export Service reduces parallelism; returns 429 to clients.

Recovery Procedures

  1. Use :resume with server-provided resumeToken to continue failed slices.
  2. On 429/503, back off and let the scheduler retry the tick; do not spawn duplicate runs.
  3. Adjust sliceWidth/maxParallelSlices to match observed throughput.

Performance Characteristics

Latency Expectations

  • Time-to-first-part p95 ≤ 60 s per run.
  • Per-slice runtime p95 within configured SLO (e.g., ≤ 15 min for 24h slice on typical volume).

Throughput Limits

  • Per worker: target sustained 50–150 MB/s effective write.
  • Per tenant: cap maxParallelSlices (e.g., ≤ 4).
  • Global: orchestrator enforces cluster-wide max workers.

Resource Requirements

  • CPU for serialization/compression; RAM for part buffers; IOPS for wide scans; network to object storage.

Scaling Considerations

  • Plan then fan-out: precompute slice plan and submit to a bounded queue.
  • Fair-share: per-tenant token bucket to avoid noisy neighbors.
  • Adaptive concurrency: scale workers based on export QPS, object storage throttling, and read replica load.
  • Backpressure: honor Retry-After; dynamically shrink maxParallelSlices.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; optional mTLS for service-to-service calls.

Authorization

  • Require audit:export.bulk; enforce RLS on reads; validate x-tenant-id.

Data Protection

  • Server-side encryption at rest; presigned URLs short-lived and scoped.
  • Optional on-read masking if bulk job set to filtered mode.

Compliance

  • Respect retention/residency; include watermarks and integrity proofs per slice.
  • Emit audit events for schedule create/update, run, pause/resume/cancel, and completion.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
bulk_jobs_active gauge Running bulk jobs > global cap
bulk_slices_inflight gauge Concurrent slice executions > per-tenant cap
bulk_bytes_written_total counter Bytes written across slices Trend/throughput
bulk_slice_duration_seconds histogram Runtime per slice p95 > SLO
bulk_failures_total counter Failed slices/jobs > 0 sustained
resume_events_total counter Resumed slices Spike indicates instability

Logging Requirements

  • Structured logs: tenant, jobId, sliceId, window, resumeToken, parts, bytes, records, watermark, merkleRoot, status. No raw PII or presigned URLs.

Distributed Tracing

  • Spans: bulk.plan, slice.run, query.page, serialize.part, object.put, integrity.seal, webhook.post.
  • Attributes: sliceWidth, parallelism, bytes, records, throttleEvents.

Health Checks

  • Readiness: object storage/Integrity/Read Store reachable; scheduler connected.
  • Liveness: worker queues draining; no stuck slices beyond timeout.

Operational Procedures

Deployment

  1. Deploy Scheduler and Export Service; register /export/v1/bulk-jobs routes.
  2. Configure tenant/global concurrency caps and default sliceWidth.
  3. Run a dry run on a non-prod tenant to validate planning and sealing.

Configuration

  • Env: BULK_MAX_PARALLEL_SLICES_PER_TENANT, BULK_DEFAULT_SLICE_WIDTH, EXPORT_DEFAULT_PART_MIB, RESUME_TOKEN_TTL, PRESIGN_TTL_SEC, SLA_SLICE_P95_SECONDS.
  • Planner: enable dynamic reslicing thresholds (time/size).

Maintenance

  • Rotate signing keys; prune expired manifests/parts; archive bulk manifest indices per policy.
  • Periodically reassess sliceWidth vs. observed volumes.

Troubleshooting

  • Many resume events → check read replica throttling/object storage limits; reduce parallelism.
  • Frequent 409 on job ops → ensure clients don’t modify running jobs; use pause then update.
  • Slow slices → inspect filters/indexes and increase part size or reduce masking.

Testing Scenarios

Happy Path Tests

  • Create bulk job with cron schedule; verify automatic run creates multiple slices and parts.
  • Resume a slice after induced worker crash using resumeToken.

Error Path Tests

  • 400 on invalid schedule/sliceWidth/partSizeMiB.
  • 404 on unknown jobId.
  • 409 when updating a running job without pause.
  • 429/503 cause backoff and eventual success without duplication.

Performance Tests

  • Achieve target throughput per worker and per tenant; slice p95 ≤ SLO.
  • Concurrency caps prevent read replica saturation.

Security Tests

  • RBAC audit:export.bulk enforced; cross-tenant isolation verified.
  • Presigned URLs expire and are least-privilege.
  • Integrity sealing produces valid Merkle roots/signatures per slice.

Internal References

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context

Appendices

A. Example Create Bulk Job Request

{
  "tenant": "acme",
  "schedule": { "cron": "0 2 * * *" },
  "range": { "from": "2025-09-01T00:00:00Z" },
  "sliceWidth": "24h",
  "format": "parquet",
  "compression": "gzip",
  "partSizeMiB": 256,
  "maxParallelSlices": 3,
  "delivery": { "mode": "presigned-get" },
  "webhook": { "url": "https://hooks.example/exports", "secretId": "wh-2025" }
}

B. Resume Token (concept)

{
  "sliceId":"s_2025_10_21",
  "cursor":"eyJ0cyI6IjIwMjUtMTAtMjFUMTI6MDA6MDAuMDAwWiIsImlkIjoiMDFK...In0",
  "partIndex": 17,
  "bytesCommitted": 134217728
}

Retention Policy Evaluation Flow

Computes and records eligibleAt timestamps for purge based on the active Retention Policy. Evaluations run on schedule and on policy change, marking candidates in the retention index and emitting Retention.EligibleComputed events with decision basis (policy id, rule id, revision, window).


Overview

Purpose: Determine when audit records (or partitions) become eligible for purge and persist eligibleAt along with decision metadata for defensible lifecycle operations.
Scope: Policy fetch & revision checks, rules evaluation (scopes/windows/exceptions), candidate marking, event emission, and re-evaluation on policy updates or clock ticks. Excludes purge execution (see Data Lifecycle & States / Purge flow).
Context: The Policy Service is the source of truth for Retention Policies and their forward-only revisions. The Lifecycle Evaluator (part of Policy or Lifecycle service) scans read/canonical stores and updates a Retention Index used by purge workers.
Key Participants:

  • Scheduler (periodic + on-change trigger)
  • API Gateway (for admin endpoints)
  • Policy Service (policies, revisions, decisions)
  • Lifecycle Evaluator (rules engine, candidate marker)
  • Metadata/Retention Index (stores eligibleAt, decision basis)
  • Event Bus (emits Retention.EligibleComputed)

Prerequisites

System Requirements

  • Policy Service reachable; policy registry seeded with tenant policy
  • Lifecycle Evaluator has read access to stores and write access to Retention Index
  • Event Bus configured for Retention.* topics
  • Time source synchronized; clock skew guardrails applied

Business Requirements

  • Tenant has an Active retention policy with forward-only Revision
  • Residency constraints configured (region-aware evaluation if required)
  • Legal Holds honored (holds block eligibility marking)

Performance Requirements

  • Evaluation p95 per partition ≤ 3 min for typical volumes
  • Index write throughput supports peak daily windows (e.g., midnight marks)
  • Backpressure controls on scans and index writes

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant SCH as Scheduler
    participant POL as Policy Service
    participant LCE as Lifecycle Evaluator
    participant IDX as Retention Index
    participant BUS as Event Bus

    SCH->>POL: GET /policy/v1/retention?tenant=acme (If-None-Match: rev)
    POL-->>SCH: 200 {policyId, revision, rules, windows} or 304 if unchanged
    SCH->>LCE: Trigger evaluate {tenant, policyId, revision, windowHint}
    LCE->>LCE: Enumerate candidate sets (by partition/time/resource)
    LCE->>LCE: For each record/partition: compute eligibleAt = createdAt + window(rule)
    LCE->>IDX: Upsert {recordId/partitionKey, eligibleAt, decisionBasis{policyId,ruleId,revision}}
    IDX-->>LCE: Ack (batched)
    LCE->>BUS: Publish Retention.EligibleComputed {tenant, policyId, revision, stats}
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • On-Change Re-eval: Policy.Changed event triggers incremental re-evaluation for affected scopes only.
  • Partition-Level Evaluation: compute once per partition boundary and apply to contained records (for WORM append stores).
  • Dry Run: evaluation writes to a shadow index and returns a delta report (no marking).

Error Paths

sequenceDiagram
    participant GW as API Gateway
    participant POL as Policy Service
    participant LCE as Lifecycle Evaluator

    GW->>POL: POST /policy/v1/retention:evaluate {tenant, revision:999}
    alt Unknown tenant/policy
        POL-->>GW: 404 Not Found (Problem+JSON)
    else Revision conflict (client expects different rev)
        POL-->>GW: 409 Conflict (Problem+JSON)
    else Bad request (invalid window/spec)
        POL-->>GW: 400 Bad Request (Problem+JSON)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path POST /policy/v1/retention:evaluate Y Manual/adhoc evaluation trigger JSON body
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y Tenant scope ^[A-Za-z0-9._-]{1,128}$
traceparent header O W3C trace context 55-char
policyId string O Explicit policy to apply Must belong to tenant
revision int O Expected policy revision (If-Match equivalent) ≥ current? causes 409
scope object O Limit evaluation to subset (time/resources) Server-validated
mode enum O normal (default) dry-run

Output Specifications

202 Accepted

Field Type Description
evaluationId string Operation identifier
status enum Queued | Running
policy object {policyId, revision}
scopeApplied object Effective evaluated scope

200 OK (dry-run report)

Field Type Description
estimatedCandidates int Count that would be marked
sample array Example {recordId, computedEligibleAt, ruleId}
diff object Prior vs. new policy impact

Retention Index (concept row)

{
  "tenantId": "acme",
  "recordId": "01JECZ6Y8K1V...",
  "eligibleAt": "2026-01-21T10:12:00Z",
  "decisionBasis": { "policyId":"ret_001", "ruleId":"r_login_365d", "revision":5 },
  "decidedAt": "2025-10-22T12:00:00Z"
}

Event Retention.EligibleComputed (summary)

{
  "tenant": "acme",
  "policyId": "ret_001",
  "revision": 5,
  "window": {"from":"2025-10-21T00:00:00Z","to":"2025-10-22T00:00:00Z"},
  "stats": {"marked": 124553, "skippedHeld": 112, "errors": 0}
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid policy spec/windows; negative/zero retention; malformed scope Fix policy/scope No retry until corrected
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Missing policy:retention.evaluate permission Request proper scope/role
404 Tenant/policy not found Verify tenant/policy id
409 Policy revision conflict; evaluation for same scope already running Re-fetch policy; wait or cancel prior run Retry after fix
412 If-Match (revision) mismatch Fetch latest policy; retry with current rev Conditional retry
422 Policy invalid for tenant residency/edition Adjust policy to constraints
429 Evaluator rate limited Honor Retry-After Backoff + jitter
503 Stores/Index/Event bus unavailable Wait for recovery Idempotent retry of evaluation step

Failure Modes

  • Legal Hold present: candidate skipped; index notes skippedHeld and basis includes holdId.
  • Window change shrinks retention: re-evaluation advances eligibleAt forward only; never moves earlier than prior decision without explicit re-baseline admin action.
  • Clock skew: eligibleAt never set before now - skew.

Recovery Procedures

  1. On 409 or 412, fetch current {policyId, revision} and re-issue with updated precondition.
  2. When 503/429, back off; evaluation jobs are idempotent by (tenant, policyId, revision, scopeKey).
  3. Use dry-run to assess impact before applying a new revision.

Performance Characteristics

Latency Expectations

  • Partition-sized evaluation p95 ≤ 3 min; small scope ad-hoc p95 ≤ 30 s.

Throughput Limits

  • Evaluator concurrency limited per tenant to protect read/metadata stores (e.g., ≤ 2 concurrent scopes).

Resource Requirements

  • Bounded memory for rule evaluation batches; write-optimized Retention Index with bulk upserts.

Scaling Considerations

  • Batch by partition and time windows; use checkpointing to resume mid-run.
  • Prefer set-based updates (partition-level) when rules are uniform (e.g., 365d global).
  • Emit periodic progress to avoid long silent runs.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; service-to-service credentials for Evaluator ↔ Index.

Authorization

  • Enforce policy:retention.read and policy:retention.evaluate; verify x-tenant-id.

Data Protection

  • Decision basis recorded without copying sensitive payload; only IDs/timestamps stored.

Compliance

  • Forward-only versions: revision monotonically increases; decisions log basis {policyId, ruleId, revision, computedAt} for auditability.
  • Residency honored by running evaluation in-region and by scoping reads.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
ret_eval_jobs_active gauge Running evaluations > tenant/global cap
ret_candidates_marked_total counter Records marked eligible Trend
ret_eval_duration_seconds histogram Runtime per evaluation p95 > SLO
ret_eval_skipped_held_total counter Skipped due to Legal Hold Spike watch
ret_eval_conflicts_total counter 409/412 occurrences Investigate policy churn

Logging Requirements

  • Structured logs: tenant, policyId, revision, ruleId, scopeKey, marked, skippedHeld, durationMs, errors. No PII.

Distributed Tracing

  • Spans: policy.fetch, eval.scan, eval.batch, index.upsert, event.publish.
  • Attributes: revision, batchSize, marked, skipped.

Health Checks

  • Readiness: Index writable; Policy Service reachable; Event Bus available.
  • Liveness: job queue drains; checkpoints advance.

Operational Procedures

Deployment

  1. Deploy Lifecycle Evaluator workers; register /policy/v1/retention:evaluate.
  2. Seed policies; verify revisioning and on-change triggers.
  3. Run a dry-run evaluation in staging; verify index shape and events.

Configuration

  • Env: RET_EVAL_BATCH_SIZE, RET_EVAL_MAX_CONCURRENCY, RET_EVAL_CHECKPOINT_TTL, CLOCK_SKEW_SEC.
  • Policy: enforce forward-only revisions; require change justification metadata.

Maintenance

  • Compact Retention Index (drop superseded decisions); rotate event topics per retention.
  • Re-baseline procedures for exceptional policy rollbacks (administrative only).

Troubleshooting

  • High conflicts: educate admins to supply If-Match revision when triggering evaluations.
  • Slow runs: increase batch size carefully; verify index write IOPS; reduce scan scope.
  • Skewed results: check time normalization and partition catalog.

Testing Scenarios

Happy Path Tests

  • Evaluate 24h scope → candidates marked with correct eligibleAt and decisionBasis.
  • Policy change (revision++) triggers incremental re-eval for affected scopes only.

Error Path Tests

  • 400 for invalid windows/rules; 404 for unknown policy; 409/412 for revision issues.
  • 422 when policy violates residency/edition.
  • 429/503 lead to compliant backoff and eventual success.

Performance Tests

  • Partition evaluation p95 ≤ 3 min; throughput meets index SLOs.
  • Checkpoint resume after induced worker restart.

Security Tests

  • RBAC scopes enforced; cross-tenant isolation verified.
  • Logs contain decision basis without payload leakage.

Internal References

  • Legal Hold Processing Flow
  • Data Lifecycle (Purge Execution) Flow
  • Policy Change Propagation Flow

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context

Appendices

A. Example Problem+JSON (revision conflict)

{
  "type": "urn:connectsoft:errors/policy/revision.conflict",
  "title": "Policy revision conflict",
  "status": 409,
  "detail": "Requested evaluation with revision 5, current is 6.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "/revision", "reason": "stale"}]
}

B. Decision Basis (concept)

{
  "policyId": "ret_001",
  "ruleId": "r_login_365d",
  "revision": 6,
  "formula": "eligibleAt = createdAt + P365D",
  "inputs": {"createdAt":"2025-10-21T11:00:00Z"},
  "output": {"eligibleAt":"2026-10-21T11:00:00Z"}
}

Applies, updates, and releases Legal Holds against tenant data. Resolves scope unambiguously, materializes a holdSnapshot (with forward-only revision), matches target records/partitions, marks them OnHold (purge-block), and emits lifecycle events. Releasing a hold clears blockers and triggers dependent re-evaluations.


Overview

Purpose: Provide a defensible mechanism to place and release Legal Holds so that covered records are preserved and exports can reference verifiable hold decisions.
Scope: Create/apply/update/release flows, scope resolution and match indexing, purge-block signaling, event emission, and concurrency controls. Excludes exporting data under hold (see Legal Hold Export Flow).
Context: The LegalHold Service is authoritative for hold definitions and state. It interacts with Read/Projection Stores to match data, the Lifecycle/Purge subsystem to block deletion, and Policy/Retention to re-evaluate eligibility.
Key Participants:

  • Legal Team / Client
  • API Gateway
  • LegalHold Service (registry, matcher, state machine)
  • Read/Projection Store (query targets by scope)
  • Hold Index / Purge Guard (flags OnHold)
  • Event Bus (LegalHold.Applied|Updated|Released)

Prerequisites

System Requirements

  • API Gateway with TLS and JWT validation
  • LegalHold Service deployed with access to Read/Projection Store and Hold Index
  • Event Bus topics configured (LegalHold.*)
  • Clock/time normalization to UTC; deterministic scope resolvers

Business Requirements

  • Tenant enabled for Legal Hold; roles and approvals defined
  • Case management identifiers available (caseId)
  • Residency constraints and retention policies configured

Performance Requirements

  • p95 apply time for typical scopes ≤ 60 s (to first confirmation)
  • Hold matching throughput sized to tenant volume (seek pagination)
  • Low-latency purge-block propagation (seconds, not minutes)

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor Legal as Legal Team
    participant GW as API Gateway
    participant LHS as LegalHold Service
    participant RD as Read/Projection Store
    participant HIX as Hold Index / Purge Guard
    participant BUS as Event Bus

    Legal->>GW: POST /legal-hold/v1/holds {tenant, scope, caseId, reason, expiresAt?}
    GW->>LHS: Create+Apply request (authN/Z, x-tenant-id, traceparent)
    LHS->>LHS: Validate scope → normalize ResourceRef/time boundaries
    LHS->>RD: Enumerate targets via cursor (tenant, scope)
    loop Batched match
        RD-->>LHS: Batch of record/partition keys
        LHS->>HIX: Mark OnHold {keys..., holdId, revision}
    end
    LHS->>LHS: Persist holdSnapshot {id, revision, scope, decidedAt}
    LHS->>BUS: Publish LegalHold.Applied {holdId, tenant, revision, scope}
    LHS-->>GW: 201 Created {holdId, status:"Active", snapshot}
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Preview: mode=preview returns counts and sample keys without applying.
  • Incremental expand: PATCH /holds/{id} with additional scope → revision++, match only delta.
  • Auto-expiry: expiresAt schedules automatic Release at timestamp.
  • Partition-level hold: mark append partitions instead of individual records for large scopes.

Error Paths

sequenceDiagram
    actor Legal
    participant GW as API Gateway
    participant LHS as LegalHold Service

    Legal->>GW: POST /legal-hold/v1/holds {invalid scope}
    alt Bad request
        GW-->>Legal: 400 Bad Request (Problem+JSON)
    else Hold not found (read/update/release)
        GW-->>Legal: 404 Not Found (Problem+JSON)
    else Conflict (apply on already Active, release on Released)
        GW-->>Legal: 409 Conflict (Problem+JSON)
    else Precondition failed (If-Match revision mismatch)
        GW-->>Legal: 412 Precondition Failed (Problem+JSON)
    else Rate limited / dependencies down
        GW-->>Legal: 429/503 (Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Create/Apply POST /legal-hold/v1/holds Y Create + apply a hold JSON body
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y Tenant scope Must match body.tenant
traceparent header O W3C trace context 55-char
tenant string Y Target tenant ^[A-Za-z0-9._-]{1,128}$
caseId string Y Legal case identifier ≤ 128 chars
reason string Y Business/legal justification ≤ 512 chars
scope object Y Resource/time predicates Normalized server-side
expiresAt timestamp O Auto-release time (UTC) Must be in future
mode enum O apply (default) preview

Update (expand/restrict)

Field Type Req Description
PATCH /legal-hold/v1/holds/{holdId} path Y Modify scope (forward-only*); requires If-Match: <rev>
Body: {scopeDelta} json Y Additive change preferred; shrink requires admin override

*Forward-only scope changes recommended; shrinking scope is exceptional and audited.

Release

Field Type Req Description
POST /legal-hold/v1/holds/{holdId}:release path Y Release hold
If-Match header O Expected revision Prevents races

Output Specifications

Create — 201 Created

Field Type Description
holdId string Hold identifier (ULID/GUID)
status enum Active
revision int Current revision
snapshot object {scope, decidedAt, caseId, reason, expiresAt?}
stats object {matched, partitions, partial?:bool}

Release — 200 OK

Field Type Description
holdId string Id
status enum Released
releasedAt timestamp ISO-8601 UTC
revision int Final revision

Example Payloads

// Create & apply
{
  "tenant": "acme",
  "caseId": "CASE-2025-099",
  "reason": "Regulatory investigation",
  "scope": {
    "time": {"from": "2025-09-01T00:00:00Z"},
    "resourceTypes": ["Iam.User","Case.File"],
    "actions": ["Create","Update"]
  },
  "expiresAt": "2026-03-01T00:00:00Z"
}
// Release
{
  "note": "Case concluded; hold lifted by order #1234"
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Malformed scope; invalid expiresAt; missing caseId/reason Correct request No retry until fixed
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Missing audit:legalhold.apply | update | release Request proper role/scope
404 Unknown holdId or tenant route not found Verify ids/tenant
409 Apply on already Active; Release on Released; concurrent modify Align state (PATCH or fetch latest) Retry after fix
412 If-Match revision mismatch Fetch latest snapshot → retry Conditional retry
422 Scope cannot be resolved unambiguously Adjust scope; use preview
429 Per-tenant/global rate limit Honor Retry-After Backoff + jitter
503 Read/Index/Event bus unavailable Wait for recovery Idempotent retry (server de-dupes)

Failure Modes

  • Partial match (timeouts/limits): partial=true in stats; matcher continues asynchronously until complete.
  • Residency boundary: cross-region scope split into regional sub-holds to remain compliant.
  • Clock skew: time predicates normalized to UTC; inclusive start, exclusive end by convention.

Recovery Procedures

  1. On 412/409, retrieve latest {holdId, revision, status} and re-issue with correct preconditions.
  2. For partial matches, monitor progress events or query stats until partial=false.
  3. If 503/429, back off; the apply operation is idempotent by (tenant, caseId, normalizedScopeHash).

Performance Characteristics

Latency Expectations

  • Apply confirmation p95 ≤ 60 s for typical scopes; full match completion may continue async.

Throughput Limits

  • Matcher QPS bounded by read replica capacity; batch size tuned per tenant.

Resource Requirements

  • CPU for scope normalization; memory for batching keys; I/O for index updates.

Scaling Considerations

  • Use seek pagination and partition-aware queries.
  • Mark partitions OnHold when feasible for large contiguous ranges.
  • Backpressure from Hold Index updates reduces batch size automatically.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; optional mTLS service-to-service.

Authorization

  • Require audit:legalhold.apply, audit:legalhold.update, audit:legalhold.release.
  • Enforce RLS by tenantId; verify x-tenant-id.

Data Protection

  • Store minimal decision basis (ids/timestamps); do not copy payloads.
  • All hold state transitions are audited with actor and purpose-of-use.

Compliance

  • Holds block purge immediately via Purge Guard; Retention Evaluator records skippedHeld.
  • holdSnapshot (id, revision, scope, decidedAt) provides chain-of-custody.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
holds_active gauge Active holds per tenant Sudden spikes
hold_applied_total counter Holds applied
hold_released_total counter Holds released
hold_match_duration_seconds histogram Matching latency p95 > SLO
purge_block_signals_total counter Purge-block updates sent Drop indicates risk

Logging Requirements

  • Structured logs: tenant, holdId, revision, caseId, scopeHash, matched, partial, actor, reason. No PII.

Distributed Tracing

  • Spans: hold.apply, match.scan, index.mark, hold.release, event.publish.
  • Attributes: scopeHash, batchSize, matched, partial.

Health Checks

  • Readiness: Read/Projection and Hold Index reachable; Event Bus available.
  • Liveness: matcher queue drains; no stuck Applying holds beyond timeout.

Operational Procedures

Deployment

  1. Deploy LegalHold Service and register /legal-hold/v1/* routes.
  2. Initialize Hold Index and Purge Guard hooks.
  3. Verify preview/apply/release in staging with synthetic scopes.

Configuration

  • Env: HOLD_MATCH_BATCH, HOLD_APPLY_TIMEOUT, HOLD_MAX_SCOPE_SIZE, RESIDENCY_MODE.
  • Policy: require reason and caseId; optional expiresAt auto-release.

Maintenance

  • Compact Hold Index (drop released markers no longer needed).
  • Rotate webhook/signing keys if callbacks to external systems are used.

Troubleshooting

  • High partial rates → increase batch size cautiously; check read replica health.
  • Frequent 409/412 → educate clients to use If-Match and fetch-latest patterns.
  • Purge still running on held data → verify Purge Guard subscription and index state.

Testing Scenarios

Happy Path Tests

  • Apply hold with resource/time scope → holds_active increments; purge-block engaged.
  • Update scope (additive) → revision++, only delta matched; events emitted.
  • Release hold → blockers cleared; LegalHold.Released published.

Error Path Tests

  • 400 for malformed scope; 404 for unknown holdId; 409 for invalid state transitions; 412 for revision mismatch.
  • 429/503 cause compliant backoff; operation remains idempotent.

Performance Tests

  • Matching completes within SLO for typical tenants; no read replica saturation.
  • Purge-block propagation latency within seconds.

Security Tests

  • RBAC enforced; cross-tenant access blocked.
  • Audit log contains actor, purpose, scope hash; no PII leakage.

Internal References

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context

Appendices

A. Example Problem+JSON (invalid scope)

{
  "type": "urn:connectsoft:errors/legalhold/scope.invalid",
  "title": "Invalid legal hold scope",
  "status": 400,
  "detail": "Scope must include at least one of resourceTypes or actors, and a bounded time window.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [
    {"pointer": "/scope/time", "reason": "missing-or-unbounded"}
  ]
}

B. Hold Snapshot (concept)

{
  "id": "lh_2025_001",
  "tenant": "acme",
  "revision": 3,
  "status": "Active",
  "caseId": "CASE-2025-099",
  "reason": "Regulatory investigation",
  "scope": { "resourceTypes": ["Iam.User"], "time": {"from":"2025-09-01T00:00:00Z"} },
  "decidedAt": "2025-10-22T11:45:10Z",
  "expiresAt": "2026-03-01T00:00:00Z"
}

Data Redaction Flow (Read)

Applies policy-driven masking to query results at read time. The Query Service consults the Redaction Service to enforce a requested profile (Safe, Support, Investigator, Raw), optionally validates a Just-In-Time (JIT) unmask approval, and returns transformed results. All unmask attempts and approvals are audited.


Overview

Purpose: Ensure returned data complies with privacy policy via profile-based masking, with tightly controlled JIT unmask for break-glass scenarios.
Scope: Profile selection, purpose-of-use capture, redaction rules execution, JIT approval verification, response annotation, and auditing. Excludes write-time classification (see Validation & Classification Flow).
Context: Sits on the Query path between Read Models/Search and clients. Uses Data Classification from the model and Redaction Rules (mask/hash/tokenize/drop).
Key Participants:

  • Client (consumer of audit data)
  • API Gateway
  • Query Service (fetch, orchestrate)
  • Redaction Service (policy engine, transform)
  • Approval Service (JIT unmask token issuance/validation)
  • Audit/Event Bus (log read/unmask decisions)

Prerequisites

System Requirements

  • Gateway with TLS + JWT validation
  • Query Service can call Redaction & Approval Services
  • Read Models/Search indices annotated with DataClass metadata
  • Clock sync for JIT token TTL enforcement

Business Requirements

  • Redaction profiles & policy configured per tenant
  • Purpose-of-use taxonomy and RBAC scopes defined
  • Approver roster & workflow for JIT unmask (with SLA)

Performance Requirements

  • p95 redaction overhead ≤ 15 ms per page (server-side)
  • JIT token verification p95 ≤ 50 ms
  • Budget for page sizes (e.g., ≤ 200 records) to maintain SLOs

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor C as Client
    participant GW as API Gateway
    participant Q as Query Service
    participant R as Redaction Service
    participant A as Approval Service
    participant AUD as Audit/Event Bus

    C->>GW: GET /query/v1/events?filters…<br/>Headers: x-redaction-profile=Support, x-purpose-of-use=SupportOps
    GW->>Q: Forward request (authN/Z, tenant)
    Q->>Q: Fetch page from Read Model / Search
    Q->>R: ApplyProfile(records, profile=Support, tenant, purpose)
    R-->>Q: Redacted(records, redactionMeta)
    Q->>AUD: Publish Read.Audited {tenant, profile, purpose, actor, resultCount}
    Q-->>GW: 200 OK (masked results + X-Redaction-Profile + X-Watermark)
    GW-->>C: 200 OK
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Investigator profile: broader reveal than Support but still masked for HighlySensitive; requires higher RBAC.
  • Raw profile with JIT: client supplies x-jit-approval-token; Approval Service validates token → Redaction Service bypasses selected fields (field-scoped unmask).
  • Field-scoped override: request includes fields=… to minimize exposure; redaction runs only on returned fields.

Error Paths

sequenceDiagram
    actor C as Client
    participant GW as API Gateway
    participant Q as Query Service
    participant A as Approval Service

    C->>GW: GET … x-redaction-profile=Raw, x-jit-approval-token=abc
    GW->>Q: Forward
    Q->>A: ValidateToken(abc)
    alt Token invalid/expired/not-for-tenant
        A-->>Q: 403 Forbidden (reason)
        Q-->>GW: 403 Problem+JSON
        GW-->>C: 403 Forbidden
    else Bad profile or params
        Q-->>GW: 400 Bad Request (Problem+JSON)
        GW-->>C: 400
    else Record id requested but not found
        Q-->>GW: 404 Not Found (Problem+JSON)
        GW-->>C: 404
    else Conflict (token already consumed / different subject)
        Q-->>GW: 409 Conflict (Problem+JSON)
        GW-->>C: 409
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path GET /query/v1/events Y Search/scroll timeline Query params allowlisted
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y Tenant scope Matches JWT/route
x-redaction-profile header O Safe(default) Support
x-purpose-of-use header Y Business purpose taxonomy Non-empty, allowlist
x-jit-approval-token header O Break-glass token for unmask JIT policy validates
traceparent header O W3C trace context 55-char
fields query O Comma list of fields to return Minimization applied
page.after query O Seek cursor Opaque; server-issued
limit query O Page size 1–200 default 100

Output Specifications

200 OK

Field Type Description Notes
items[] array Records with masking applied See examples
redactionMeta object {profile, rulesApplied[], jit:{used, reason?}} Optional when Safe
watermark string Projection snapshot time Also in header

Headers

  • X-Redaction-Profile: effective profile
  • X-Purpose-Of-Use: echoed purpose
  • X-Watermark: ISO-8601 UTC projection watermark

Example Payloads

// Request (Support profile)
GET /query/v1/events?resourceType=Payment&from=2025-10-01T00:00:00Z
x-redaction-profile: Support
x-purpose-of-use: SupportOps
// Response (masked)
{
  "items": [
    {
      "id": "01JF…",
      "actor": {"id":"u_123","displayName":"A**** T****"},
      "resource": {"type":"Payment","id":"pay_789"},
      "action": "Create",
      "createdAt": "2025-10-22T11:01:22Z",
      "deltas": {
        "after": {
          "cardLast4": "****",
          "cardBin": "******",
          "email": "a***@e***.com",
          "amount": 1299
        }
      }
    }
  ],
  "redactionMeta": {
    "profile": "Support",
    "rulesApplied": [
      {"field":"deltas.after.cardLast4","rule":"mask-last4"},
      {"field":"deltas.after.cardBin","rule":"drop"},
      {"field":"deltas.after.email","rule":"mask-email"}
    ]
  },
  "watermark": "2025-10-22T11:05:00Z"
}
// Raw with JIT token (field-scoped unmask)
GET /query/v1/events/{id}
x-redaction-profile: Raw
x-jit-approval-token: jt_01ABC…
x-purpose-of-use: IncidentResponse

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Unsupported profile; invalid limit/fields; bad time filters Correct request No retry until fixed
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Insufficient RBAC for profile; JIT token invalid/expired; tenant mismatch Request proper scope or new JIT approval
404 Requested record id not found Verify id/tenant
409 JIT token subject mismatch or already consumed Obtain a fresh token
422 Purpose-of-use missing/invalid; policy disallows Raw for tenant Fix usage/policy
429 Rate limited for sensitive profiles Honor Retry-After Backoff + jitter
503 Redaction/Approval service unavailable Wait for recovery Idempotent retry (re-run query)

Failure Modes

  • Partial redaction (missing DataClass metadata): default to most restrictive (mask/drop) and include warning in redactionMeta.
  • Policy change mid-request: response includes X-Policy-Revision-Used; clients re-issue if needed.

Recovery Procedures

  1. For 403/409, request/refresh JIT approval; ensure subject/resource matches token scope.
  2. On 503/429, back off; queries are safe to retry with same cursor.

Performance Characteristics

Latency Expectations

  • Redaction transform p95 ≤ 15 ms/page; JIT verification p95 ≤ 50 ms.

Throughput Limits

  • Sensitive profiles (Investigator, Raw) may be throttled per tenant (token bucket).

Resource Requirements

  • CPU-bound transforms; memory proportional to page size; minimal I/O overhead.

Scaling Considerations

  • Cache compiled redaction plans per {profile, schemaVersion}.
  • Prefer field projection (fields=…) to reduce work and exposure.
  • Co-locate Redaction Service with Query Service to minimize RPC latency.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; service credentials between Query ↔ Redaction/Approval.

Authorization

  • RBAC scopes per profile (e.g., audit:read.support, audit:read.investigator, audit:read.raw).
  • Enforce tenant RLS; verify x-tenant-id.

Data Protection

  • No raw PII in logs; only masked samples and rule stats.
  • JIT tokens are short-lived, single-use, audience- and subject-scoped; signed & time-bounded.

Compliance

  • All unmask uses are audited with actor, purpose, scope, token id, and fields revealed.
  • Profiles & rule sets derived from tenant policy; revision id echoed in responses.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
redaction_requests_total counter Redaction calls by profile Sudden spikes
redaction_latency_ms histogram Transform latency p95 > 15 ms
jit_token_validations_total counter Approval checks Track failures
jit_validation_latency_ms histogram JIT check latency p95 > 50 ms
unmask_events_total counter Successful JIT unmask Unusual growth

Logging Requirements

  • Structured logs: tenant, profile, purpose, actorId, resultCount, policyRevision, jit.used, fieldsRevealed[] (names only). No values.

Distributed Tracing

  • Spans: query.fetch, redaction.apply, approval.validate.
  • Attributes: profile, purpose, maskedFieldsCount, jitUsed.

Health Checks

  • Readiness: Redaction & Approval endpoints reachable; policy cache warm.
  • Liveness: transform queue drains; token cache not stale.

Operational Procedures

Deployment

  1. Deploy Redaction & Approval Services; enable headers in Gateway.
  2. Prime policy/profile caches; validate with synthetic records.

Configuration

  • Env: REDACTION_DEFAULT_PROFILE, JIT_TTL_SEC, JIT_AUDIENCE, PROFILE_RBAC_MAP, SENSITIVE_RATE_LIMITS.
  • Policy: map DataClass → rule (mask/hash/tokenize/drop) per profile.

Maintenance

  • Rotate signing keys for JIT tokens; tune rate limits by tenant.
  • Review unmask audit reports periodically with compliance.

Troubleshooting

  • Latency regressions → inspect rule plan caching, page size, co-location.
  • Frequent 403/409 → check token issuance workflow and subject scoping.
  • Unexpected reveals → verify policy revision and RBAC mapping.

Testing Scenarios

Happy Path Tests

  • Safe returns masked payload per policy with correct redactionMeta.
  • Support reveals operational fields but masks HighlySensitive.
  • Raw with valid JIT token reveals requested fields only; audit event emitted.

Error Path Tests

  • 400 for invalid profile; 422 for missing/invalid purpose-of-use.
  • 403/409 for bad/consumed JIT token; 404 for missing record id.
  • 429/503 result in compliant backoff and successful retry.

Performance Tests

  • p95 redaction ≤ 15 ms for 100-record pages.
  • JIT validation ≤ 50 ms p95.

Security Tests

  • RBAC enforced per profile; cross-tenant blocked.
  • Logs exclude PII values; unmask audited with token id.

Internal References

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context

Appendices

A. Example Problem+JSON (invalid profile)

{
  "type": "urn:connectsoft:errors/redaction/profile.invalid",
  "title": "Unsupported redaction profile",
  "status": 400,
  "detail": "Profile 'Debug' is not enabled for tenant 'acme'.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "x-redaction-profile", "reason": "unsupported"}]
}

B. JIT Token (concept)

{
  "jitId": "jt_01ABC…",
  "tenant": "acme",
  "subject": {"type":"Payment","id":"pay_789"},
  "fields": ["deltas.after.email","actor.displayName"],
  "purpose": "IncidentResponse",
  "aud": "audit-read",
  "nbf": "2025-10-22T11:00:00Z",
  "exp": "2025-10-22T11:10:00Z",
  "sig": "MEQCI…"
}

Compliance Audit Flow

Generates a defensible compliance report by collecting evidence (records, lifecycle transitions, retention/legal-hold decisions, and integrity proofs), independently verifying tamper-evidence, and assembling a signed report artifact with full end-to-end traceability.


Overview

Purpose: Produce an auditable report that demonstrates data integrity, lifecycle adherence, and policy compliance over a defined scope and period.
Scope: Audit job creation, evidence collection, integrity verification (Merkle/Signatures), control checks (retention, legal hold, redaction on read), report assembly/signing, delivery, and audit of the audit. Excludes exporting large datasets (see Export flows) and policy authoring.
Context: Orchestrated by Audit Service; reads from Read Models/Indices, Lifecycle/Retention Index, Legal Hold, and Integrity Service; produces a signed Compliance Report and optional Evidence Bundle.
Key Participants:

  • Auditor / Compliance Client
  • API Gateway
  • Audit Service (orchestrator, verifier, report builder)
  • Query Service / Read Store (records, timelines)
  • Integrity Service (Merkle & signatures verification)
  • Policy/LegalHold/Retention services (decisions & states)
  • Delivery Backend (report/evidence URLs)
  • Webhook Receiver (optional callbacks)

Prerequisites

System Requirements

  • API Gateway with TLS and JWT validation
  • Audit Service with access to Read Store, Integrity, Policy, LegalHold, Retention Index
  • Object storage for report artifacts and optional evidence bundle
  • KMS/HSM configured for report signing (optional but recommended)

Business Requirements

  • Tenant compliance profile defined (e.g., GDPR/HIPAA/SOC2 control set)
  • Purpose-of-use and auditor role(s) configured
  • Time-bound audit scope agreed (from/to, resources, actors)

Performance Requirements

  • p95 time-to-summary60 s for typical 24–48h windows
  • Evidence sampling and cap thresholds configured to avoid oversize bundles
  • Parallel verification workers sized to volume

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor AUD as Auditor
    participant GW as API Gateway
    participant AS as Audit Service
    participant Q as Query Service / Read Store
    participant INT as Integrity Service
    participant POL as Policy/Retention/LegalHold
    participant OBJ as Delivery Backend
    participant WH as Webhook (optional)

    AUD->>GW: POST /compliance/v1/audits {tenant, scope, frameworks, options{verifyIntegrity, includeEvidence}}
    GW->>AS: Create audit job (authN/Z, x-tenant-id, traceparent)
    AS->>Q: Collect evidence set (records, lifecycle states, decisions)
    AS->>POL: Fetch decisions (retention elig., legal holds, policy revisions)
    AS->>INT: Verify integrity (Merkle chain, signatures, sample leaves)
    INT-->>AS: Verification results {ok, failures[], merkleRoot, keyIds}
    AS->>AS: Compile control checks + traceability map
    AS->>OBJ: PUT report.pdf/json + (optional) evidence.zip
    AS-->>GW: 202 Accepted {auditId, status:"Running"}
    alt webhook configured
        AS->>WH: POST Compliance.ReportReady {auditId, reportUrl, summary}
    end
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Lightweight attest-only: verifyIntegrity=true with no evidence bundle; report includes verification transcript and pointers.
  • Delta audit: sinceAuditId to compare changes between two audits.
  • Framework-specific: frameworks=["SOC2"] limits control set and sections rendered.

Error Paths

sequenceDiagram
    actor AUD as Auditor
    participant GW as API Gateway
    participant AS as Audit Service

    AUD->>GW: POST /compliance/v1/audits {malformed}
    alt 400 Bad Request
        GW-->>AUD: 400 Problem+JSON
    else 404 Not Found (tenant/route/auditId)
        GW-->>AUD: 404 Problem+JSON
    else 409 Conflict (modify running audit / duplicate request-id)
        GW-->>AUD: 409 Problem+JSON
    else 429/503 Backpressure/Dependency down
        GW-->>AUD: 429/503 Problem+JSON (+Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path POST /compliance/v1/audits Y Create a compliance audit job JSON body
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y Tenant scope Matches body.tenant
traceparent header O W3C trace context 55-char
tenant string Y Target tenant ^[A-Za-z0-9._-]{1,128}$
scope object Y {time:{from,to}, resourceTypes?, actors?} UTC ISO-8601, bounded
frameworks array O ["GDPR","HIPAA","SOC2"] allowlist
options.verifyIntegrity bool O Run integrity verification default: true
options.includeEvidence enum O none sampled
options.sampleRate number O 0–1 for sampled proofs bounds checked
webhook.url/secretId string O Completion callback + HMAC HTTPS + known key
idempotency-key header O De-duplicate create ≤ 128 chars

Control & Status

  • GET /compliance/v1/audits/{auditId}
  • POST /compliance/v1/audits/{auditId}:cancel
  • GET /compliance/v1/audits/{auditId}/report (redirect/URL)
  • GET /compliance/v1/audits/{auditId}/evidence (if produced)

Output Specifications

Create — 202 Accepted

Field Type Description
auditId string Operation id (ULID/GUID)
status enum Queued | Collecting | Verifying | Assembling | Completed | Failed | Canceled
summaryUrl url? Interim human-readable status
reportUrl url? Set when ready

Status — 200 OK

Field Type Description
auditId string Identifier
status enum Terminal or running state
counts object {records, proofsChecked, holds, eligible}
verifications object {merkleRoot, keyIds[], ok, failures[]}
reportUrl / evidenceUrl url? Delivery

Report (concept outline)

  • Executive Summary (scope, date range, frameworks)
  • Data Integrity (roots, signatures, verification transcript)
  • Lifecycle & Retention (eligibleAt coverage, purge windows)
  • Legal Holds (active timeline, affected records/partitions)
  • Redaction & Privacy Controls (profiles, sampling of masked fields)
  • Exceptions & Findings (severity, impacted scope)
  • Appendices (inputs, hashes, timestamps, key ids)

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Malformed scope/time window; unsupported framework; invalid sampleRate Fix request No retry until corrected
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Missing audit:compliance.run or cross-tenant attempt Request proper scope/role
404 Unknown auditId/tenant; route disabled Verify ids/tenant
409 Modify/cancel while running; duplicate idempotency-key Wait for terminal state or change key Retry after fix
412 If-Match mismatch on update/cancel Fetch latest status and retry Conditional retry
422 Evidence size would exceed cap; incompatible options (full with restricted edition) Adjust options
429 Per-tenant/global audit concurrency limit Honor Retry-After Backoff + jitter
503 Read/Integrity/Policy service unavailable Wait for recovery Idempotent retry

Failure Modes

  • Proof sampling too low/high: report flags sampling level; enforce min/max per policy.
  • Key unavailability: signature verification deferred; report marks inconclusive for specific windows with remediation steps.
  • Projection lag: report includes watermark; sections constrained to consistent point-in-time.

Recovery Procedures

  1. Reduce evidence mode to sampled or raise cap via admin policy if 422.
  2. Re-run verification portion when keys/services recover; re-issue report with new signature.
  3. For 409/412, poll latest status, then retry control action.

Performance Characteristics

Latency Expectations

  • Time-to-summary p95 ≤ 60 s for 24–48h windows; full verification depends on scope and sampling.

Throughput Limits

  • Concurrency caps per tenant (e.g., ≤ 2 running audits); global worker pool bounded.

Resource Requirements

  • CPU for hashing/verification; I/O for evidence fetch; memory for report assembly (streamed).

Scaling Considerations

  • Parallelize by time/partition slices; verify proofs in worker pool; stream artifact assembly to object storage.
  • Use seek pagination and limit evidence to sampled mode for very large scopes.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; optional mTLS for service-to-service.

Authorization

  • Require audit:compliance.run to start; audit:compliance.read to fetch results; strict tenant RLS.

Data Protection

  • Reports/evidence encrypted at rest; presigned URLs short-lived and least-privilege; webhook payloads HMAC-signed.

Compliance

  • Report is signed (JWS/COSE) with kid; includes verification transcript, watermarks, and policy revisions used.
  • All audit actions are themselves audited (actor, purpose, scope, outputs).

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
compliance_audits_active gauge Running audits > tenant/global cap
compliance_audit_duration_seconds histogram Runtime per audit p95 > SLO
integrity_verifications_total counter Proof checks performed Trend
verification_failures_total counter Failed proof checks > 0 sustained
report_build_failures_total counter Report assembly/sign failures > 0

Logging Requirements

  • Structured logs: tenant, auditId, scopeHash, frameworks[], proofsChecked, failures, watermark, kid. No PII.

Distributed Tracing

  • Spans: audit.collect, policy.fetch, integrity.verify, report.assemble, object.put, webhook.post.
  • Attributes: sampleRate, evidenceMode, bytes, records.

Health Checks

  • Readiness: Read/Integrity/Policy reachable; KMS key loadable.
  • Liveness: job queue draining; no stuck Verifying/Assembling states.

Operational Procedures

Deployment

  1. Deploy Audit Service; expose /compliance/v1/audits routes.
  2. Configure KMS signing keys and buckets for artifacts.
  3. Validate E2E on staging: create → verify → signed report downloadable.

Configuration

  • Env: AUDIT_MAX_CONCURRENCY_PER_TENANT, AUDIT_SAMPLE_RATE_DEFAULT, AUDIT_EVIDENCE_CAP_BYTES, PRESIGN_TTL_SEC, REPORT_SIGNING_KID.
  • Policy: min/max sampling, allowed frameworks per edition.

Maintenance

  • Rotate signing keys; prune expired artifacts; archive reports according to retention.
  • Periodic verification health checks against known-good test datasets.

Troubleshooting

  • Verification failures → inspect key rotation, integrity roots, time window alignment.
  • Large artifacts → switch to sampled mode; extend caps only if justified.
  • Frequent 409/412 → ensure clients poll before modifying audit jobs.

Testing Scenarios

Happy Path Tests

  • Create audit with verifyIntegrity=true, includeEvidence=sampled → signed report produced; verification transcript included.
  • Fetch report/evidence; signature validates with published public key.

Error Path Tests

  • 400 malformed scope; 404 unknown auditId; 409 modify while running.
  • 422 evidence exceeds cap triggers clear guidance; 429/503 backoff works.

Performance Tests

  • p95 time-to-summary ≤ 60 s; verify scaling across parallel slices.
  • Sampled proof checks meet throughput targets.

Security Tests

  • RBAC scopes enforced; presigned URLs expire; webhook HMAC validated.
  • Report signature verifies via JWS/COSE with current kid.

Internal References

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context
  • JWS (RFC 7515) / COSE (RFC 8152)

Appendices

A. Example Problem+JSON (evidence cap exceeded)

{
  "type": "urn:connectsoft:errors/compliance/evidence.cap.exceeded",
  "title": "Evidence bundle too large",
  "status": 422,
  "detail": "Estimated evidence size 8.4GB exceeds cap 5GB. Use includeEvidence=sampled or narrow scope.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "/options/includeEvidence", "reason": "cap-exceeded"}]
}

B. Report Verification (outline)

  1. Download report.json and report.sig (or signed PDF).
  2. Verify signature with published JWK/PEM (kid in report header).
  3. Re-run sample integrity proofs listed in the transcript; compare roots.
  4. Confirm watermarks and policy revision ids match tenant records.

Integrity Verification Flow

Runs an on-demand proof check for one or more records, validating leaf hash → Merkle path → block/segment root → signature. Produces a per-record evidence report (OK|FAIL|INCONCLUSIVE) and supports degraded mode when some materials (e.g., keys, archived proofs) are unavailable.


Overview

Purpose: Allow clients and auditors to independently verify that returned records are authentic and untampered, using stored proofs and signatures.
Scope: Request intake, materialization of proof inputs (leaf, path, roots, signatures), verification pipeline, degraded-mode policies, report generation, and optional evidence bundle. Excludes integrity creation/sealing (see Integrity Chain flow).
Context: The Integrity Service reads Integrity Store/Evidence Store (paths, roots, manifests) and may call KMS/HSM or use public keys to verify signatures.
Key Participants:

  • Client (verifier)
  • API Gateway
  • Integrity Service (verifier/orchestrator)
  • Evidence Store / Integrity Store (proofs, roots, manifests)
  • KMS/HSM or Key Registry (public keys / verification)
  • Object Storage (optional evidence bundles)

Prerequisites

System Requirements

  • API Gateway with TLS and JWT validation
  • Integrity Service with read access to Integrity/Evidence stores and key registry
  • Object storage bucket for optional per-request evidence bundles
  • Time source synchronized; hash and signature algorithms configured

Business Requirements

  • Tenant integrity policy defines algorithms (e.g., SHA-256, Ed25519) and acceptable degraded modes
  • Retention of proofs/manifests meets verification SLAs
  • Auditing enabled for verification requests

Performance Requirements

  • p95 verification latency ≤ 200 ms for single-record checks (cached proofs)
  • Batch verification throughput meets SLO (e.g., 2k–10k records/s with precomputed paths)
  • Backpressure & rate limits for large batches

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor CL as Client
    participant GW as API Gateway
    participant INT as Integrity Service
    participant EVI as Evidence Store / Integrity Store
    participant KMS as KMS/HSM or Key Registry
    participant OBJ as Object Storage (optional)

    CL->>GW: POST /integrity/v1/verify {tenant, items[], mode: "full"}
    GW->>INT: Forward (authN/Z, x-tenant-id, traceparent)
    INT->>EVI: Fetch materials (leaf hash or record, path, blockRoot, manifest)
    INT->>KMS: Load/validate public key (by kid) and verify signature(root)
    KMS-->>INT: ok {kid, alg}
    INT->>INT: Verify inclusion (leaf→path→blockRoot) and chain(root→segmentRoot?)
    alt returnEvidence = "bundle"
        INT->>OBJ: PUT evidence.zip (paths, manifest, key metadata)
    end
    INT-->>GW: 200 OK {perItemResults[], summary, evidenceUrl?}
    GW-->>CL: 200 OK
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Fast mode: mode="fast" skips recomputation of leaf hash when caller supplies leafHash; verifies path→root→signature only.
  • Degraded mode: allowDegraded=true permits INCONCLUSIVE with reasons (e.g., signature service offline) while still verifying available steps.
  • External leaf: caller provides payload to hash server-side (canonicalization rules applied).

Error Paths

sequenceDiagram
    actor CL as Client
    participant GW as API Gateway
    participant INT as Integrity Service

    CL->>GW: POST /integrity/v1/verify {malformed}
    alt Bad request (invalid item spec/algorithm)
        GW-->>CL: 400 Bad Request (Problem+JSON)
    else Not found (record/proof/manifest missing)
        GW-->>CL: 404 Not Found (Problem+JSON)
    else Conflict (verify while block is resealing/rotating)
        GW-->>CL: 409 Conflict (Problem+JSON)
    else Unauthorized/Forbidden
        GW-->>CL: 401/403 (Problem+JSON)
    else Rate limit / dependency down
        GW-->>CL: 429/503 (Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path POST /integrity/v1/verify Y Start verification JSON body
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y Tenant scope Must match body.tenant
traceparent header O W3C trace context 55-char
tenant string Y Target tenant ^[A-Za-z0-9._-]{1,128}$
mode enum O full (default) fast
allowDegraded bool O Permit partial verify default: false
returnEvidence enum O none (default) bundle
items[] array Y Records to verify 1–10k items
items[].recordId string O* Record identifier ULID/GUID
items[].leafHash string O* Base64url/hex hash matches algorithm
items[].payload object O* Canonicalizable payload size bounded
items[].algorithm enum O sha256 (default) allowlist
items[].expectedRoot string O Optional asserted root must match stored
idempotency-key header O De-dupe request ≤ 128 chars

*Provide at least one of recordId, leafHash, or payload.

Output Specifications

200 OK

Field Type Description Notes
results[] array Per-item verification results See below
summary object {ok, fail, inconclusive} Counts
evidenceUrl url? If returnEvidence=bundle Presigned, short-lived
policyRevisionUsed int Integrity policy revision For audit

Per-item result

{
  "input": {"recordId":"01JF…","algorithm":"sha256"},
  "steps": {
    "leafHash": {"status":"OK","computed":"8a4f..."},
    "pathVerify": {"status":"OK","depth":17},
    "rootSignature": {"status":"OK","kid":"int-key-2025","alg":"Ed25519"},
    "chainLink": {"status":"OK","segment":"seg_2025_10_22"}
  },
  "status": "OK",                // OK | FAIL | INCONCLUSIVE
  "degraded": false,             // true if allowed and used
  "reason": null,                // failure/inconclusive reason
  "timingsMs": {"total": 42, "leaf": 1, "path": 6, "sig": 8}
}

Example Payloads

// Full verification by recordId
{
  "tenant": "acme",
  "mode": "full",
  "items": [
    {"recordId": "01JF3W8KTR2D3WQF3B9R0KJY9Y", "algorithm": "sha256"}
  ],
  "returnEvidence": "path-only"
}
// Fast verification using supplied leafHash and allowing degraded mode
{
  "tenant": "acme",
  "mode": "fast",
  "allowDegraded": true,
  "items": [
    {"leafHash": "8a4f...", "expectedRoot": "d1c2...", "algorithm": "sha256"}
  ]
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Malformed body; none of recordId | leafHash | payloadprovided; unsupportedalgorithm Fix request No retry until corrected
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Missing audit:integrity.verify or tenant mismatch Request proper scope/role
404 Record/proof/manifest not found Verify id/scope; ensure proofs retained
409 Verification against block being resealed/rotated Retry after block state settles Short backoff
412 If-Match on root version failed Fetch latest root/manifest; retry Conditional retry
422 Payload cannot be canonicalized to leaf hash Use server-known recordId or supply leafHash
429 Rate limited for batch or per-tenant Honor Retry-After Exponential backoff + jitter
503 Evidence store, key service, or integrity store unavailable Wait for recovery Idempotent retry

Failure Modes

  • Missing signature key (archived/rotated): inclusion verified, signature step INCONCLUSIVE when allowDegraded=true.
  • Archived proofs (cold tier): request becomes async; 202 with later webhook/report when materials restored.
  • Projection drift: record exists but proof not yet sealed; respond 409 until seal completes.

Recovery Procedures

  1. On 409/412, fetch latest block status/root and retry verification.
  2. If 503/429, back off; request is idempotent by (tenant, itemsHash, idempotency-key?).
  3. When proofs are archived, re-issue request with allowDegraded=true or wait for restoration event.

Performance Characteristics

Latency Expectations

  • Single-record, cached materials: p95 ≤ 200 ms.
  • Batch with precomputed paths: thousands/sec per verifier instance.

Throughput Limits

  • Per-tenant verification QPS caps; batch size limits (e.g., ≤ 1k items/request).

Resource Requirements

  • CPU-bound hashing/path checks; memory proportional to path depth and batch size; small I/O for manifest/path fetch.

Scaling Considerations

  • Cache recent roots and key material by kid.
  • Pre-fetch proof paths for hot records; shard verifier workers by tenant/segment.
  • Use asynchronous retrieval for cold-storage proofs.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; service credentials for store/key access.

Authorization

  • Require audit:integrity.verify; enforce x-tenant-id RLS.

Data Protection

  • Do not log payloads or raw proofs; only hashes and ids.
  • Evidence bundles are encrypted at rest and shared via short-lived presigned URLs.

Compliance

  • Verification report contains key ids, algorithms, roots, and timestamps for chain-of-custody.
  • Degraded-mode decisions are explicit and auditable.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
verify_requests_total counter Verification requests Trend
verify_latency_ms histogram End-to-end latency p95 > SLO
verify_failures_total counter Items with FAIL > 0 sustained
verify_inconclusive_total counter Degraded outcomes Spike watch
proof_cache_hit_ratio gauge Cache effectiveness < 0.8 sustained

Logging Requirements

  • Structured logs: tenant, requestId, batchSize, ok/fail/inconclusive, alg, kid, degraded. No PII or payloads.

Distributed Tracing

  • Spans: materials.fetch, leaf.hash, path.verify, sig.verify, bundle.pack.
  • Attributes: pathDepth, kid, mode, degraded.

Health Checks

  • Readiness: evidence/key stores reachable; cache warmed.
  • Liveness: verifier queue drains; no stuck requests beyond timeout.

Operational Procedures

Deployment

  1. Deploy Integrity Service; expose /integrity/v1/verify.
  2. Configure key registry/KMS access and algorithm allowlist.
  3. Warm caches with latest roots and public keys.

Configuration

  • Env: VERIFY_MAX_BATCH, VERIFY_RATE_LIMITS, KEY_CACHE_TTL, ROOT_CACHE_TTL, EVIDENCE_BUNDLE_TTL.
  • Policy: allowed degraded modes; acceptable algorithms; maximum batch sizes.

Maintenance

  • Rotate verification keys and update registry; verify legacy roots with retained public keys.
  • Periodically test cold-proof restore paths.

Troubleshooting

  • Rising INCONCLUSIVE → check KMS availability and key retention.
  • High FAIL rates → inspect canonicalization/version mismatches or corrupted paths.
  • Latency spikes → verify cache TTLs and storage hot/cold tiering.

Testing Scenarios

Happy Path Tests

  • Verify by recordId with full steps → status=OK, signature validated.
  • Batch verify with provided leafHashstatus=OK for all items; summary counts correct.

Error Path Tests

  • 400 when no recordId|leafHash|payload; 404 for unknown record/proof.
  • 409 when verifying during reseal; 412 when root version mismatches.
  • 429/503 induce backoff and successful retry.

Performance Tests

  • Achieve target throughput with cached proofs; measure p95 latency.
  • Stress with 10k items; ensure backpressure and partial progress reporting.

Security Tests

  • RBAC scopes enforced; cross-tenant blocked.
  • Evidence bundle URL expiry honored; keys validated by kid.

Internal References

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context

Appendices

A. Example Problem+JSON (degraded not allowed)

{
  "type": "urn:connectsoft:errors/integrity/degraded.disallowed",
  "title": "Degraded verification not permitted",
  "status": 422,
  "detail": "Key service unavailable and allowDegraded is false.",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

B. Evidence Bundle (concept)

evidence/
  item_01JF…/
    leaf.txt
    path.json
    manifest.json
    root.sig           # JWS/COSE detached signature
    key-metadata.json  # {kid, alg, issuer, notBefore, notAfter}
README.txt             # verification instructions

Tamper Detection Flow

Continuously (or on-demand) scans integrity materials to detect anomalies—such as gaps, forks, reseals outside policy, signature/key issues, or out-of-order segments—then alerts and escalates with actionable context. The pipeline emphasizes low false positives through suppression, correlation, and thresholds.


Overview

Purpose: Proactively detect and surface potential tampering or integrity regressions before consumers encounter affected data.
Scope: Scheduling, scope planning, chain/segment/manifest checks, anomaly scoring & suppression, alerting/escalation, and case tracking. Excludes remediation (sealing/repair) which is handled by operations runbooks.
Context: Runs within the Integrity Validator component against Integrity/Evidence Stores and Key Registry/KMS; feeds alerts to Observability and Incident Management systems.
Key Participants:

  • Scheduler/Detector Orchestrator
  • Integrity Validator (check runners, anomaly detector)
  • Integrity Store / Evidence Store (roots, manifests, paths)
  • Key Registry/KMS (public keys, validity windows)
  • Alerting / On-Call (Pager/Email/Webhooks)
  • SIEM / Case Manager (ticketing, correlation)

Prerequisites

System Requirements

  • Validator has read access to Integrity/Evidence stores and Key Registry
  • Object storage reachable for manifests and archived proofs
  • Time synchronization across services; policy cache warm (algorithms, seal cadence)

Business Requirements

  • Tenant integrity policy defines seal cadence, allowed reseal windows, acceptable algorithms, and escalation paths
  • Alert routing configured (webhooks/pager) with on-call schedule
  • Compliance logging enabled for anomaly events

Performance Requirements

  • Chain scan p95 ≤ 2 min per segment; continuous mode amortized to keep staleness ≤ 5 min
  • Alert fan-out latency p95 ≤ 30 s
  • Bounded load on stores (rate-limited walkers)

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant SCH as Scheduler
    participant VAL as Integrity Validator
    participant IST as Integrity/Evidence Store
    participant KMS as Key Registry/KMS
    participant ALR as Alerting (Pager/Webhook)
    participant SIEM as SIEM/Case Manager

    SCH->>VAL: Tick {tenant, window, policyRevision}
    VAL->>IST: Enumerate segments/blocks within window
    loop For each segment
        VAL->>IST: Fetch manifests + roots + metadata
        VAL->>KMS: Get key by kid, check validity window
        VAL->>VAL: Run checks (gap/fork/order/sig/freshness/seal cadence)
    end
    VAL->>VAL: Score & suppress duplicates, correlate with recent changes
    alt Anomalies found
        VAL->>ALR: Create alert {type, severity, evidence pointers}
        ALR-->>VAL: Ack alert id
        VAL->>SIEM: Open case/ticket {links to evidence}
    else No anomalies
        VAL->>VAL: Record heartbeat metric & watermark
    end
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • On-demand scan: operator invokes POST /integrity/v1/tamper-detection:scan for a tenant/time range.
  • Hot segment watch: watch new blocks; verify seal cadence and signature freshness in near-real-time.
  • Degraded verification: if keys unavailable, emit warning with degraded=true (no hard alert) depending on policy.

Error Paths

sequenceDiagram
    participant OP as Operator
    participant GW as API Gateway
    participant VAL as Integrity Validator

    OP->>GW: POST /integrity/v1/tamper-detection:scan {malformed}
    alt 400 Bad Request (invalid window/tenant/algo)
        GW-->>OP: 400 Problem+JSON
    else 404 Not Found (unknown detectorId/tenant)
        GW-->>OP: 404 Problem+JSON
    else 409 Conflict (scan already running for same scope)
        GW-->>OP: 409 Problem+JSON
    else 429/503 (rate limit/dependency down)
        GW-->>OP: 429/503 Problem+JSON (+Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path POST /integrity/v1/tamper-detection:scan O On-demand scan trigger JSON body
Authorization header Y Bearer <JWT> Valid, not expired
x-tenant-id header Y Tenant scope Matches body.tenant
tenant string Y Target tenant ^[A-Za-z0-9._-]{1,128}$
window object O {from,to} override ISO-8601 UTC, bounded
checks array O Subset (gap,fork,order,seal,sig,freshness) allowlist
severityThreshold enum O info | low | medium | high | critical default medium
suppressWindow string O e.g., 10m duplicate suppression ≤ policy max
traceparent header O W3C trace context 55-char
idempotency-key header O De-dup create ≤ 128 chars

Output Specifications

202 Accepted / 200 OK

Field Type Description Notes
scanId string Operation id ULID/GUID
status enum Queued | Running | Completed | Failed
summary object {checkedSegments, anomalies, degraded} Final on 200
watermark string Latest segment time examined ISO-8601 UTC

Anomaly Event (concept)

{
  "tenant": "acme",
  "type": "Integrity.ForkDetected",
  "severity": "high",
  "segment": "seg_2025_10_22",
  "policyRevision": 12,
  "details": {
    "roots": ["9a1c...", "77fb..."],
    "firstSeenAt": "2025-10-22T12:00:07Z",
    "evidence": {"manifestUrl": "s3://.../seg_2025_10_22.manifest.json"}
  }
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid time window/checks list; from>=to Fix request No retry until corrected
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Missing audit:integrity.tamper.scan Request proper role/scope
404 Unknown detectorId/tenant Verify ids/tenant
409 Scan already running for same {tenant, window} Wait for completion or use different scope Retry after fix
422 Suppression window exceeds policy Adjust parameter
429 Rate limited Honor Retry-After Backoff + jitter
503 Integrity/Evidence/Key service unavailable Wait for recovery Idempotent retry

Failure Modes

  • Transient fork (eventual consistency): auto-downgrade to warning unless it persists beyond stabilityDelay.
  • Key rotation gap: signatures verify with new kid but manifests still reference old key; mark degraded=false, add remediation hint.
  • Late seal: block sealed outside allowed window; alert severity based on policy (mediumhigh if repeated).

Recovery Procedures

  1. For 409, query scan status and avoid duplicate runs; use idempotency-key.
  2. For intermittent fork/gap, re-scan after stabilityDelay; escalate only if repeated.
  3. On 503/429, validator backs off automatically; operator may re-issue trigger.

Performance Characteristics

Latency Expectations

  • Segment check p95 ≤ 2 min; near-real-time watch detects issues within ≤ 5 min of occurrence.

Throughput Limits

  • Bounded walkers per tenant (e.g., ≤ 2 concurrent); global cap to protect stores.

Resource Requirements

  • CPU for hashing/verification; small read IO for manifests/roots; minimal memory with streaming checks.

Scaling Considerations

  • Shard by tenant and segment time; cache recent roots and valid kids.
  • Use adaptive sampling: deep checks on hot segments; summary checks elsewhere.
  • Apply duplicate suppression windows to maintain low FP.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; service credentials for store/key access.

Authorization

  • Require audit:integrity.tamper.scan (run) and audit:integrity.tamper.read (results).
  • Enforce tenant RLS via x-tenant-id.

Data Protection

  • Do not include payloads in alerts; only ids, hashes, URLs to manifests (access-controlled).
  • Evidence links shared as short-lived presigned URLs.

Compliance

  • All anomalies and operator triggers are audited with actor, purpose, scope, and policy revision.
  • Detector configuration changes tracked with forward-only revisions.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
tamper_scans_active gauge Running scans > tenant/global cap
tamper_anomalies_total counter Anomalies by type/severity Spike indicates issue
tamper_false_positives_total counter Operator-marked FP > target triggers tuning
tamper_scan_duration_seconds histogram Scan runtime p95 > SLO
tamper_degraded_checks_total counter Checks in degraded mode Sustained rise → key/store health

Logging Requirements

  • Structured logs: tenant, scanId, policyRevision, segmentsChecked, anomalies[], degraded, watermark. No PII.

Distributed Tracing

  • Spans: scan.plan, segment.fetch, check.run(type), alert.emit, case.open.
  • Attributes: segmentCount, checks, severity, suppressed.

Health Checks

  • Readiness: Integrity/Evidence stores and Key Registry reachable; policy cache loaded.
  • Liveness: scan queue advancing; no segment stuck beyond timeout.

Operational Procedures

Deployment

  1. Deploy Integrity Validator; enable scheduler and on-demand endpoint.
  2. Configure alert routes (pager/webhook) and SIEM integration.
  3. Validate with seeded test anomalies (simulated fork/gap).

Configuration

  • Env: DETECTOR_MAX_CONCURRENCY, DETECTOR_STABILITY_DELAY, DETECTOR_SUPPRESS_WINDOW, DETECTOR_DEFAULT_CHECKS.
  • Policy: seal cadence, reseal allowances, severity mappings, degraded-mode policy.

Maintenance

  • Tune thresholds using tamper_false_positives_total and incident postmortems.
  • Rotate keys and ensure manifests reference valid kids across rotations.

Troubleshooting

  • Repeated transient forks → increase stabilityDelay slightly; verify store replication lag.
  • Many degraded checks → investigate Key Registry/KMS availability.
  • Alert floods → widen suppression window; confirm dedupe keys include {tenant, segment, type}.

Testing Scenarios

Happy Path Tests

  • Continuous scan detects a forced manifest gap and raises a single actionable alert.
  • On-demand scan limits to given window and returns summary with watermark.

Error Path Tests

  • 400 on malformed window/checks; 404 unknown tenant; 409 duplicate scan scope.
  • 429/503 produce compliant backoff with no duplicate alerts.

Performance Tests

  • Segment check p95 ≤ 2 min; scan staleness ≤ 5 min under steady load.
  • Suppression prevents duplicate alerts during repeated sightings.

Security Tests

  • RBAC respected; cross-tenant access blocked.
  • Alerts contain no payload data; evidence URLs expire and are scoped.

Internal References

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context

Appendices

A. Example Problem+JSON (duplicate scope)

{
  "type": "urn:connectsoft:errors/detector/scope.conflict",
  "title": "Tamper scan already running for scope",
  "status": 409,
  "detail": "A scan for tenant 'acme' and window 2025-10-22T00:00Z..2025-10-22T12:00Z is already running.",
  "traceId": "9f0c1d2e3a4b5c6d...",
  "errors": [{"pointer": "/window", "reason": "duplicate-scope"}]
}

B. Anomaly Types (reference)

  • gap: missing block/segment in expected sequence
  • fork: two different roots for the same segment
  • order: out-of-order seal time or index
  • seal: seal outside configured cadence or early reseal
  • sig: signature invalid/key mismatch/outside validity window
  • freshness: seal/manifest not produced within SLA

Key Rotation Flow

Safely rotates signing keys for integrity sealing and verification. Introduces a new key (kid_new) in KMS, publishes it via the Key Registry, enables a dual-verify window where both kid_old and kid_new are trusted for verification, then transitions the signer to kid_new and retires kid_old without breaking backward verification.


Overview

Purpose: Regularly rotate integrity signing keys while ensuring uninterrupted signing and verification, preserving the ability to verify historical signatures.
Scope: Key generation/activation, registry publication, signer switchover, dual-verify window, verifier cache refresh, deactivation/retirement, and audit events. Excludes general IAM/PKI hardening (covered elsewhere).
Context: Security (SecOps) initiates rotation in KMS/HSM. Key Registry (JWKS/COSE keyset) distributes public keys to Integrity Service (signer) and all Verifiers (Verification/Compliance services).
Key Participants:

  • Security (SecOps)
  • KMS/HSM (key creation, protection, activation windows)
  • Key Registry / Publisher (JWKS/COSE sets, versioning)
  • Integrity Service (Signer) (seals blocks with active kid)
  • Verification Services (Integrity Verify, Compliance Audit)
  • Event Bus / Observability (Key.Rotated, metrics/alerts)

Prerequisites

System Requirements

  • KMS/HSM reachable; policies allow key create/rotate/disable
  • Key Registry supports versioned JWKS/COSE publication with cache headers
  • Integrity Service can hot-reload signer kid without restart
  • Verifiers fetch/refresh keys on cache miss or via periodic refresh

Business Requirements

  • Rotation cadence defined (e.g., 90 days) and emergency rotation runbook approved
  • Dual-verify window configured (e.g., 14 days) and documented
  • Audit logging enabled for all key lifecycle operations

Performance Requirements

  • JWKS fetch p95 ≤ 200 ms; cache TTL tuned (e.g., 5–10 min)
  • Signer switchover ≤ 1 min between publish and activation
  • Verification failure rate due to unknown kid < 0.01% during rotation

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor SEC as Security (SecOps)
    participant KMS as KMS/HSM
    participant REG as Key Registry (JWKS/COSE)
    participant SIG as Integrity Service (Signer)
    participant VER as Verification Services
    participant BUS as Event Bus / Observability

    SEC->>KMS: CreateKey {alg:Ed25519, usage:sign, tags:{tenant, purpose}}
    KMS-->>SEC: KeyMetadata {kid_new, state:PreActive}
    SEC->>REG: Publish {kid_new, pubKey, notBefore, notAfter}
    REG-->>VER: JWKS {kid_old, kid_new} (cacheable)
    SEC->>SIG: Schedule Activate {kid_new, at: T0+5m}
    Note over VER,REG: Dual-verify window begins: verifiers trust {kid_old, kid_new}
    SEC->>BUS: Emit Key.RotationPlanned {kid_old, kid_new, at:T0+5m}
    SIG->>KMS: Load key {kid_new}
    SIG->>SIG: Activate signer kid = kid_new (at T0+5m)
    SIG->>BUS: Emit Key.Rotated {active:kid_new, retired:kid_old?}
    SEC->>KMS: Set kid_old to verify-only (disable sign) at T0+14d
    SEC->>REG: Unpublish kid_old (or mark as retiring) at T0+14d
    REG-->>VER: JWKS {kid_new} (kid_old removed after window)
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Emergency rotation: immediate switch due to suspected compromise; shorten dual-verify window, revoke kid_old for signing at once; maintain verify-only if integrity permits.
  • Canary activation: enable kid_new for a subset of signers; verify end-to-end before global activation.
  • Per-region phased rollout: publish globally, activate region by region with overlap.

Error Paths

sequenceDiagram
    actor SEC as Security
    participant GW as API Gateway
    participant KM as KMS/HSM

    SEC->>GW: POST /keys/v1/rotate {alg:"foo"}  %% unsupported alg
    alt 400 Bad Request
        GW-->>SEC: 400 Problem+JSON
    else 404 Not Found (kid_old)
        GW-->>SEC: 404 Problem+JSON
    else 409 Conflict (active rotation in progress / multiple active signers)
        GW-->>SEC: 409 Problem+JSON
    else 503 KMS unavailable
        GW-->>SEC: 503 Problem+JSON (Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path POST /keys/v1/rotate Y Initiate rotation (planned) JSON body
Authorization header Y Bearer <JWT> (SecOps) Role: security:keys.rotate
x-tenant-id header O If tenant-scoped keys Matches policy
algorithm enum O Ed25519 (default) ES256
activateAt timestamp O Planned activation time (UTC) ≥ now+5m
dualVerifyWindow duration O e.g., 14d policy bounds
reason string O Rotation rationale ≤ 256 chars
idempotency-key header O De-dupe ≤ 128 chars

Operations

  • POST /keys/v1/activate {kid} — force activate kid_new now (emergency).
  • POST /keys/v1/retire {kid} — set kid_old verify-only / disable sign.
  • GET /.well-known/jwks.json — public keys (Key Registry).
  • GET /keys/v1/status — signer active kid, registry freshness, next rotation date.

Output Specifications

202 Accepted / 200 OK

Field Type Description
kidOld string Previously active key id
kidNew string New key id to activate
activateAt timestamp Planned activation
dualVerifyWindow string Duration (e.g., P14D)
status enum Planned | Activating | Active | Retiring | Retired

Key.Rotated Event (concept)

{
  "tenant": "platform",
  "kidOld": "int-key-2025-07",
  "kidNew": "int-key-2025-10",
  "activatedAt": "2025-10-22T11:00:00Z",
  "dualVerifyUntil": "2025-11-05T11:00:00Z"
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Unsupported algorithm; invalid activateAt/dualVerifyWindow Correct request No retry until fixed
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Missing security:keys.rotate Request proper role
404 kid_old not found; JWKS endpoint not available Verify ids/registry
409 Rotation already in progress; multiple active signers detected Wait or cancel/prune; ensure single active signer Retry after fix
412 If-Match on signer version mismatch Fetch status; retry with latest Conditional retry
422 Dual-verify window outside policy bounds Adjust window
429 Excessive rotation attempts Honor Retry-After Backoff + jitter
503 KMS/Registry unavailable Wait for recovery Idempotent retry

Failure Modes

  • Verifier cache staleness: transient verify failures for kid_new until JWKS refreshed; verifiers must re-fetch on unknown_kid.
  • Key compromise: emergency path—disable signing for kid_old immediately; maintain verify-only if proofs still need validation, else revoke and mark proofs inconclusive with remediation guidance.
  • Clock skew: activation timestamps are UTC; signer defers switch until now ≥ activateAt + safetyMargin.

Recovery Procedures

  1. On unknown kid verification failures, force JWKS refresh and reprocess.
  2. If 409 multiple active signers, demote extras to verify-only and audit the window.
  3. For 503, pause activation and retry KMS/Registry operations with backoff.

Performance Characteristics

Latency Expectations

  • Signer key load & switchover ≤ 60 s from activation time.
  • JWKS refresh propagation to verifiers within TTL (e.g., ≤ 10 min).

Throughput Limits

  • JWKS endpoint sized for spike during rotation; CDN cache recommended.

Resource Requirements

  • Minimal CPU; network I/O for JWKS distribution; signer maintains small in-memory key cache.

Scaling Considerations

  • Stage keys ahead of activation; pre-warm caches by triggering background JWKS fetch on publish.
  • Stagger regional activations to limit burst load.

Security & Compliance

Authentication

  • SecOps endpoints protected by OIDC + fine-grained RBAC; service-to-service mTLS optional.

Authorization

  • Roles: security:keys.rotate, security:keys.activate, security:keys.retire, security:keys.read.

Data Protection

  • Private keys never leave KMS/HSM; signing via KMS APIs or HSM PKCS#11.
  • JWKS served over HTTPS with integrity headers; include kid, alg, use.

Compliance

  • All key lifecycle changes audited (who, when, why, diff).
  • Backward verification preserved: historical signatures tied to archived public keys and validity windows.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
signer_active_kid gauge(label) Current signer kid Change outside window
verify_unknown_kid_total counter Verifications failing due to unknown kid > 0 sustained
jwks_cache_age_seconds gauge Age of verifier key cache > TTL
key_rotation_events_total counter Rotations/emergencies Annotate releases
sign_failures_total counter Signing errors post-activation > 0

Logging Requirements

  • Structured logs: kidOld, kidNew, activateAt, actor, status, reason, region. No private key material.

Distributed Tracing

  • Spans: kms.create, registry.publish, signer.activate, verifier.refresh.
  • Attributes: kid, alg, dualVerifyWindow, region.

Health Checks

  • Readiness: signer can load kid_new; registry reachable.
  • Liveness: signer reports active kid; verification path succeeds with both keys during window.

Operational Procedures

Deployment

  1. Ensure signer supports dynamic kid reload; deploy Registry with JWKS endpoint.
  2. Test canary rotation in staging with synthetic seals and verifications.
  3. Schedule production rotation with maintenance window & comms.

Configuration

  • Env: SIGNING_ACTIVE_KID, KEY_ROTATION_SAFETY_MARGIN_SEC, JWKS_CACHE_TTL_SEC, DUAL_VERIFY_WINDOW_DEFAULT.
  • Policy: rotation cadence, emergency procedures, window bounds.

Maintenance

  • Archive decommissioned public keys and manifests; keep for lifetime of signed data.
  • Regularly validate that verifiers honor unknown_kid → refresh path.

Troubleshooting

  • Spike in verify_unknown_kid_total → verify JWKS TTL, CDN invalidation, clock skew.
  • Signing failures post-activate → confirm KMS grants, key state, signer reload status.
  • Conflicting actives → audit deployment orchestrations; enforce single active signer guard.

Testing Scenarios

Happy Path Tests

  • Plan → publish → activate kid_new; verify new seals validate with both keys during window.
  • Post-window, verify historical proofs with kid_old and new proofs with kid_new.

Error Path Tests

  • 400 invalid algorithm/time; 404 unknown kid; 409 rotation already in progress.
  • 503 KMS/Registry outage causes graceful delay and retries.

Performance Tests

  • JWKS propagation within TTL; negligible signing latency change.
  • High verification traffic during rotation does not exceed registry capacity.

Security Tests

  • Private keys never leave KMS; signer only holds handles.
  • Emergency rotation disables signing for kid_old immediately; verify-only allowed as policy dictates.

Internal References

External References

  • JWS (RFC 7515) / JWKS (RFC 7517)
  • COSE (RFC 8152)

Appendices

A. Example Problem+JSON (rotation conflict)

{
  "type": "urn:connectsoft:errors/keys/rotation.conflict",
  "title": "Another rotation is already in progress",
  "status": 409,
  "detail": "Active signer kid is already scheduled to rotate at 2025-10-22T11:00:00Z.",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

B. JWKS Example

{
  "keys": [
    {"kty":"OKP","crv":"Ed25519","kid":"int-key-2025-10","use":"sig","alg":"EdDSA","x":"lJp..."},
    {"kty":"OKP","crv":"Ed25519","kid":"int-key-2025-07","use":"sig","alg":"EdDSA","x":"h3Q...", "status":"verify-only","notAfter":"2025-11-05T11:00:00Z"}
  ]
}

Retry Flow

Executes resilient retries with exponential backoff + jitter to achieve safe at-least-once delivery semantics. Failed operations are scheduled by the Retry Service, executed when due, and on terminal failure are DLQ-routed with full context. All retryable work must be idempotent via an idempotencyKey.


Overview

Purpose: Increase robustness of transient or downstream-dependent operations by automated retries with guardrails, while preventing thundering herds via jitter and honoring tenant backpressure.
Scope: Scheduling, backoff calculation, jitter, execution, success/failure reporting, DLQ routing, observability. Excludes business-specific compensation (see Compensation Flow).
Context: Sits alongside Ingestion, Export, Projection, etc. Services emit retryable tasks to the Retry Service; on success the original workflow continues; on terminal failure the task is routed to DLQ for manual/automated handling.
Key Participants:

  • Producer Service (emits retryable work)
  • Retry Service (scheduler + executor)
  • Target Service (downstream dependency being called)
  • DLQ / Review Tool (terminal task handling)
  • Event Bus / Metrics

Prerequisites

System Requirements

  • Retry Service deployed with durable queue and time-based scheduling
  • Clock synchronized (UTC); stable monotonic timers
  • Network egress to Target Services; circuit breaker library available
  • Idempotent endpoints or idempotency keys supported by Target Services

Business Requirements

  • Per-tenant retry policies (maxAttempts, baseDelay, cap, jitter, retryable codes)
  • DLQ process and ownership defined (runbook, on-call group)
  • Data minimization for task payloads; no sensitive values in logs

Performance Requirements

  • p95 schedule-to-execute latency within ±1s of due time under nominal load
  • Executor throughput sized to peak retry storms; global and per-tenant caps
  • Backpressure signals honored (reduce concurrency, extend delays)

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant P as Producer Service
    participant R as Retry Service (Scheduler/Executor)
    participant T as Target Service
    participant BUS as Event Bus / Metrics

    P->>R: POST /retries/v1/schedule {task, idempotencyKey, policy}
    R->>R: Persist task, compute delay = backoff(attempt=1)+jitter
    R->>R: Enqueue for due time
    R->>T: (when due) Execute task with idempotencyKey
    T-->>R: 200 OK (or success code)
    R->>BUS: Emit Retry.Succeeded {taskId, attempts}
    R-->>P: 201 Created {taskId, status:"Scheduled"}
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Transient failure: Target returns retryable error → attempt++, recompute delay with jitter → reschedule until success or maxAttempts.
  • Immediate retry hints: Target returns Retry-After → override computed delay (bounded by policy).
  • Work dedupe: if idempotencyKey seen recently, executor skips duplicate execution and marks Succeeded (idempotent).

Error Paths

sequenceDiagram
    participant P as Producer
    participant R as Retry Service
    participant D as DLQ

    P->>R: Schedule task {malformed}
    alt 400 Bad Request
        R-->>P: 400 Problem+JSON
    else Task not found / status query bad id
        R-->>P: 404 Not Found (Problem+JSON)
    else Update while executing
        R-->>P: 409 Conflict (Problem+JSON)
    end
    R->>R: Execute attempt N (last allowed)
    R->>D: Route to DLQ {task, lastError, attempts=N}
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path POST /retries/v1/schedule Y Schedule a retryable task JSON body
Authorization header Y Bearer <JWT> Valid
x-tenant-id header Y Tenant scope Matches policy
traceparent header O W3C trace context 55-char
task.type string Y Logical task kind (e.g., Export.Callback) allowlist
task.payload object Y Minimal inputs to re-execute Size ≤ policy cap
idempotencyKey string Y De-dupes executions ≤ 128 chars
policy object O Override defaults See below

Policy Overrides (optional)

Field Type Description
maxAttempts int e.g., 6 (including first)
baseDelayMs int e.g., 250
multiplier number e.g., 2.0 (exponential)
maxDelayMs int cap, e.g., 60_000
jitter enum/number full
retryable array Retryable status codes / reasons

Status/Control

  • GET /retries/v1/tasks/{taskId} → status, attempts, nextDueAt
  • POST /retries/v1/tasks/{taskId}:cancel (if safe)
  • GET /retries/v1/dlq → items; POST /retries/v1/dlq/{id}:replay

Output Specifications

201 Created

Field Type Description
taskId string ULID/GUID
status enum Scheduled
nextDueAt timestamp First attempt due time
policyEffective object Resolved policy
attempt int 1

200 OK (Status)

Field Type Description
taskId string Id
attempt int Current attempt
nextDueAt timestamp? Null if running/completed
state enum Running | Succeeded | Failed | DLQ
lastError object? {code, reason, ts}

Example Payloads

// Schedule with policy override
{
  "task": {
    "type": "Export.Callback",
    "payload": {"url":"https://example.com/hook","exportId":"exp_01JF..."}
  },
  "idempotencyKey": "exp_01JF...:callback",
  "policy": {"maxAttempts": 6, "baseDelayMs": 500, "multiplier": 2, "maxDelayMs": 60000, "jitter":"full"}
}
// Status response
{
  "taskId": "rtk_01JF...",
  "state": "Running",
  "attempt": 3,
  "nextDueAt": "2025-10-22T11:14:25Z",
  "lastError": {"code":"HTTP_503","reason":"Upstream unavailable","ts":"2025-10-22T11:12:13Z"}
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Malformed task/policy; payload too large Fix request
401 Missing/invalid JWT Renew token Retry after renewal
403 Caller lacks retry:schedule Acquire role/scope
404 Unknown taskId Verify id
409 Update/cancel during execution window Wait for state change Retry after fix
412 If-Match version mismatch on update Fetch latest, retry Conditional retry
422 Non-idempotent target / policy disallowed Change endpoint/policy
429 Per-tenant/global throttle exceeded Honor Retry-After Backoff + jitter
503 Scheduler/Executor dependency down Wait for recovery Idempotent reschedule

Failure Modes

  • Poison task: repeatedly fails with non-retryable error → immediate DLQ.
  • Retry storm: global backoff and concurrency caps applied; jitter widened.
  • Clock skew: due times computed in UTC; executor compares with monotonic clock guard.

Recovery Procedures

  1. Inspect DLQ item; fix root cause; replay via DLQ endpoint.
  2. Adjust policy (raise cap, widen backoff) for transient incidents.
  3. Use idempotencyKey to ensure safe replays.

Performance Characteristics

Latency Expectations

  • Schedule-to-execute drift p95 ≤ 1s at steady load; may widen under backpressure.

Throughput Limits

  • Executor concurrency: per-tenant & global caps to protect downstreams.
  • Batched scheduling & due-time bucketing for high-volume workloads.

Resource Requirements

  • Lightweight CPU; memory for queues; persistent storage for tasks and attempts.

Scaling Considerations

  • Shard by tenant/time buckets; use decorrelated jitter to reduce synchronization.
  • Propagate Retry-After and circuit-breaker state into backoff calculation.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; service accounts for producers.

Authorization

  • Roles: retry:schedule, retry:read, retry:cancel, retry:dlq.read, retry:dlq.replay.
  • Enforce tenant RLS via x-tenant-id.

Data Protection

  • Store minimal payloads; encrypt at rest; no secrets in task payloads—use references (e.g., secret ids).

Compliance

  • All attempts and state transitions audited with actor, reason, and outcomes.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
retry_scheduled_total counter Tasks scheduled Trend
retry_attempts_total counter Attempts made Sudden surge
retry_success_total counter Completed via retry
retry_dlq_total counter Routed to DLQ > baseline
retry_delay_applied_ms histogram Backoff + jitter p95 sanity
executor_concurrency gauge Active workers Cap breaches

Logging Requirements

  • Structured logs per attempt: taskId, tenant, attempt, delayMs, jitterMs, code, reason. No sensitive payloads.

Distributed Tracing

  • Spans: retry.schedule, retry.execute, retry.backoff, dlq.route.
  • Attributes: attempt, delayMs, policyId, idempotencyKey.

Health Checks

  • Readiness: queue store reachable; scheduler tick healthy.
  • Liveness: executor draining; no stuck partitions.

Operational Procedures

Deployment

  1. Deploy Scheduler and Executor; configure queues/stores.
  2. Register retry policies per tenant; validate with synthetic faults.

Configuration

  • Env: RETRY_MAX_CONCURRENCY, RETRY_DEFAULT_POLICY, RETRY_MAX_PAYLOAD_BYTES, RETRY_STORM_GUARD_MULTIPLIER.
  • Policy: retryable codes map (HTTP/gRPC), base delays, caps, jitter mode.

Maintenance

  • Periodically purge completed tasks; archive DLQ with retention.
  • Tune jitter/backoff from incident postmortems.

Troubleshooting

  • DLQ spike → inspect non-retryable reasons; verify idempotency at Target.
  • Drift in due execution → check scheduler lag and backpressure controls.
  • Duplicate side effects → confirm Target honors idempotencyKey.

Testing Scenarios

Happy Path Tests

  • Target 503 twice then 200 → attempts increase, success within policy, no DLQ.
  • Retry-After honored to override computed delay.

Error Path Tests

  • 400 malformed schedule; 404 unknown task; 409 modify during run.
  • 422 when endpoint marked non-idempotent; 429/503 backoff honored.

Performance Tests

  • High-volume storm—executor respects caps; jitter spreads load.
  • p95 schedule-to-execute ≤ 1s under nominal load.

Security Tests

  • RBAC enforced; cross-tenant access blocked.
  • No secrets in logs/payloads; encryption at rest verified.

Internal References

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context

Appendices

A. Backoff Formula (examples)

  • Exponential: delay = min(maxDelay, base * (multiplier^(attempt-1))) + jitter
  • Decorrelated jitter: sleep = min(maxDelay, random(base, sleep*3))

B. Example Problem+JSON (policy violation)

{
  "type": "urn:connectsoft:errors/retry/policy.invalid",
  "title": "Retry policy invalid",
  "status": 422,
  "detail": "Endpoint requires idempotency but idempotencyKey was not provided.",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

Dead Letter Queue Flow

Operational path to triage, diagnose, and replay messages that exhausted retries or failed with non-retryable errors. Ensures no duplicate side effects by requiring idempotent targets and preserving the original idempotencyKey during replay. Provides auditability, metrics, and safe deletion/quarantine.


Overview

Purpose: Restore messages from failure to success with controlled, observable, and compliant procedures.
Scope: DLQ item listing, inspection, annotation, fix/runbook execution, safe replay (single/bulk), quarantine or delete, and auditing. Excludes business-side compensation (see Compensation Flow).
Context: DLQ is fed by Retry Service and other producers. Replay Tool orchestrates re-submission to the Target Service using at-least-once semantics with idempotency guarantees.
Key Participants:

  • Operator / SRE (triage & action)
  • API Gateway (authN/Z, tenancy)
  • DLQ Store (dead letters, metadata)
  • Replay Tool / DLQ Service (orchestrates fix & replay)
  • Target Service (original destination)
  • Runbook/Knowledge Base (known-error fixes)
  • Observability (metrics, logs, alerts)
  • Audit/Event Bus (operator actions, outcomes)

Prerequisites

System Requirements

  • DLQ store with durable retention and per-tenant partitioning
  • Replay Tool has network access to Target Service(s)
  • Original endpoint supports idempotency keys or is side-effect free
  • Circuit breaker and rate limits configured for replay traffic

Business Requirements

  • Runbooks for top failure signatures (e.g., mapping fixes, schema bumps)
  • Role-based access for DLQ operations with approvals where needed
  • Data minimization policies for viewing payloads (mask PII by default)

Performance Requirements

  • Listing/inspect p95 ≤ 200 ms per page
  • Replay throughput bounded (tenant/global) to protect targets
  • Batch replay progress reporting and partial-failure handling

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor OP as Operator
    participant GW as API Gateway
    participant DLQ as DLQ Service/Store
    participant RB as Runbook/KB
    participant RT as Replay Tool
    participant T as Target Service
    participant AUD as Audit/Event Bus

    OP->>GW: GET /ops/v1/dlq?filters… (search & select item)
    GW->>DLQ: Query items (tenant, filters)
    DLQ-->>GW: Page of items
    OP->>GW: GET /ops/v1/dlq/{id} (inspect, view masked payload, lastError)
    GW->>DLQ: Fetch item + metadata
    DLQ-->>GW: Item + recommended runbook link
    OP->>RB: Follow runbook, apply fix (config/schema/data)
    OP->>GW: POST /ops/v1/dlq/{id}:replay {mode:"safe"}
    GW->>RT: Orchestrate replay (authZ, tenancy)
    RT->>T: Re-submit with original idempotencyKey/payload
    T-->>RT: 200 OK (idempotent success)
    RT->>DLQ: Mark Resolved, attach replay transcript
    RT->>AUD: Emit DLQ.Replayed {id, attempts, actor, outcome}
    GW-->>OP: 200 OK {status:"Replayed", transcriptUrl}
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Bulk replay: operator selects a query window/signature and triggers :bulk-replay with concurrency caps.
  • Quarantine: item moved to a separate queue to prevent accidental replay while investigation continues.
  • Redrive to alternative endpoint: route to a newer API version when the original is deprecated (policy-gated).

Error Paths

sequenceDiagram
    actor OP as Operator
    participant GW as API Gateway
    participant DLQ as DLQ Service
    participant RT as Replay Tool

    OP->>GW: POST /ops/v1/dlq/{id}:replay
    alt 400 Bad Request (invalid mode/filters)
        GW-->>OP: 400 Problem+JSON
    else 404 Not Found (unknown item)
        GW-->>OP: 404 Problem+JSON
    else 409 Conflict (item locked/by another replay)
        GW-->>OP: 409 Problem+JSON
    else 422 Unprocessable (target non-idempotent, policy forbids)
        GW-->>OP: 422 Problem+JSON
    else 429/503 (rate limit/dependency down)
        GW-->>OP: 429/503 Problem+JSON (+Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
List GET /ops/v1/dlq Y List DLQ items Pagination with page.after, limit≤200
Inspect GET /ops/v1/dlq/{id} Y Fetch one item {id} ULID/GUID
Replay POST /ops/v1/dlq/{id}:replay Y Re-submit safely JSON body
Bulk Replay POST /ops/v1/dlq:bulk-replay O Replay by filter JSON body
Quarantine POST /ops/v1/dlq/{id}:quarantine O Move to quarantine
Delete DELETE /ops/v1/dlq/{id} O Drop after approval Policy-gated
Authorization header Y Bearer <JWT> Role: DLQ ops
x-tenant-id header Y Tenant scope RLS enforced
traceparent header O W3C trace 55-char
idempotencyKey string O Override if missing ≤ 128 chars
mode enum O safe(default) force

DLQ Item (shape)

Field Description
id DLQ item id
source Producer (service/flow)
target Endpoint/service intended
payload Masked by default (toggle with RBAC)
idempotencyKey Original key (if any)
attempts Attempts made
firstSeenAt / lastErrorAt Timestamps
lastError {code, reason, traceId}
annotations[] Operator notes
status Pending | Quarantined | Replayed | Deleted

Output Specifications

200 OK (Inspect)

Field Type Description
item object DLQ item
recommendedRunbook url Link to doc
replayEligible bool true if idempotent & policy allows
warnings[] array E.g., “missing idempotencyKey”

200 OK (Replay)

Field Type Description
status enum Replayed | InProgress | Quarantined
transcriptUrl url Steps & outcomes
attempt int Attempt count after replay
effectiveIdempotencyKey string Key used

Example Payloads

// Replay request (safe)
POST /ops/v1/dlq/01JF...:replay
{
  "mode": "safe",
  "notes": "Fixed mapping for resourceType=Invoice; re-submitting."
}
// DLQ item (inspect response excerpt)
{
  "id": "01JF…",
  "source": "Ingestion.Consumer",
  "target": "Storage.Append",
  "idempotencyKey": "ar:01JF…",
  "attempts": 6,
  "lastError": {"code":"HTTP_422","reason":"Schema validation failed"},
  "replayEligible": true
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid filters, mode, or bulk selection too large Fix request/trim selection
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Lacks dlq:operate or PII unmask permission Request proper role
404 DLQ item not found Refresh list; verify id/tenant
409 Item locked by another operator/replay in progress Wait or take lock after TTL Retry after unlock
412 If-Match version mismatch on annotate/delete Refetch item; retry with latest Conditional retry
422 Replay blocked (non-idempotent target / missing key) Provide key or route to compensation
429 Replay throughput cap exceeded Honor Retry-After Backoff + jitter
503 DLQ store or target unavailable Wait for recovery Idempotent replay later

Failure Modes

  • Duplicate side effects risk: target not idempotent or key missing → block replay unless force with executive approval; log and audit.
  • Payload drift: original payload stale after schema change → tool offers auto-migrate transform preview before replay.
  • Replay storm: bulk selection triggers target throttling → tool enforces per-tenant QPS caps and adaptive backoff.
  • PII exposure: viewing raw payload requires elevated RBAC; otherwise masked.

Recovery Procedures

  1. If 422, attempt payload migration using versioned transformers; re-try in safe mode.
  2. If 409, wait for lock TTL or coordinate via on-call; avoid parallel replay.
  3. For 503/429, the tool pauses and resumes respecting backoff and circuit breaker state.

Performance Characteristics

Latency Expectations

  • Inspect/list p95 ≤ 200 ms; single replay end-to-end typically ≤ 2 s (excluding target latency).

Throughput Limits

  • Default bulk replay ≤ 50 msg/s per tenant (configurable), global cap to protect targets.

Resource Requirements

  • Light CPU/IO for listing; replay workers sized to throughput; encrypted storage for transcripts.

Scaling Considerations

  • Shard DLQ by tenant and creation time; support cursor-based pagination; parallel workers with per-target concurrency.

Security & Compliance

Authentication

  • OIDC JWT at Gateway; service tokens for replay to targets.

Authorization

  • Roles: dlq:read, dlq:operate, dlq:quarantine, dlq:delete, dlq:pii.unmask.
  • Fine-grained approvals required for mode=force and deletions.

Data Protection

  • Payloads masked by default; unmask requires explicit action (with purpose-of-use).
  • Transcripts and payload snapshots encrypted at rest; presigned URLs short-lived.

Compliance

  • All DLQ actions are audited (who, what, why, before/after, result).
  • Retention for DLQ items and transcripts aligns with tenant policy.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
dlq_items_total gauge Current DLQ size (by tenant) Rising trend
dlq_oldest_age_seconds gauge Age of oldest item > SLO
dlq_replay_success_total counter Successful replays Track rate
dlq_replay_failure_total counter Failed replays Spike alert
dlq_quarantine_total counter Items quarantined Investigate
dlq_bulk_replay_inflight gauge Active bulk operations Cap breaches

Logging Requirements

  • Structured logs: tenant, dlqId, action, actor, mode, outcome, idempotencyKey, target, attempts. Do not log payload values.

Distributed Tracing

  • Spans: dlq.list, dlq.inspect, dlq.replay, dlq.quarantine.
  • Attributes: bulkSize, replayed, failed, throttled, transformVersion.

Health Checks

  • Readiness: DLQ store reachable; replay workers healthy.
  • Liveness: no stuck locks; bulk runners progressing.

Operational Procedures

Deployment

  1. Deploy DLQ Service and Replay Tool; wire to Gateway with RBAC.
  2. Configure per-tenant throughput caps and masking defaults.
  3. Validate end-to-end with seeded poison messages.

Configuration

  • Env: DLQ_LIST_PAGE_MAX, DLQ_REPLAY_QPS_PER_TENANT, DLQ_GLOBAL_QPS_CAP, DLQ_LOCK_TTL_SEC, TRANSFORMER_DEFAULT_VERSION.
  • Policy: allowed force operations, deletion approvals, payload unmask rules.

Maintenance

  • Periodic purge/archival of resolved items; rotate transcript encryption keys.
  • Review top failure signatures and update runbooks/transformers.

Troubleshooting

  • Duplicates observed → verify target idempotency and keys; disable force path.
  • Bulk replay throttled → reduce concurrency or expand caps with approval.
  • Payload migration errors → roll back transformer version and fix mapping.

Testing Scenarios

Happy Path Tests

  • Inspect → apply mapping fix → safe replay succeeds; DLQ item resolved.
  • Bulk replay with 5,000 items respects QPS caps and completes with transcript.

Error Path Tests

  • 400 invalid filters; 404 unknown id; 409 locked item; 422 non-idempotent blocked.
  • 429/503 backoff honored; operation resumes and completes.

Performance Tests

  • Listing p95 ≤ 200 ms at 1M items/tenant (indexed).
  • Bulk replay maintains target SLOs under cap.

Security Tests

  • PII masked by default; unmask requires RBAC + purpose-of-use; all actions audited.
  • Deletions require multi-party approval when enabled.

Internal References

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context

Appendices

A. Example Problem+JSON (non-idempotent target)

{
  "type": "urn:connectsoft:errors/dlq/replay.disallowed",
  "title": "Replay blocked by policy",
  "status": 422,
  "detail": "Target endpoint is not idempotent and force mode is disabled for this tenant.",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

B. Example Annotation

POST /ops/v1/dlq/01JF...:annotate
{
  "note": "Fixed customer mapping (CUS-123). Verified with runbook RB-42."
}

Circuit Breaker Flow

Contains downstream failures and prevents cascading outages by short-circuiting failing calls, routing to fallbacks/queues, and probing recovery via half-open trials. Exposes clear client signals (headers/status) and integrates with Retry/DLQ to preserve at-least-once semantics.


Overview

Purpose: Protect services from unstable dependencies using automated open/half-open/closed state transitions, graceful degradation, and recovery probing.
Scope: Policy configuration, failure/latency detection, state transitions, short-circuit responses, fallback and queueing, recovery probes, client signaling. Excludes business-specific compensation (see Compensation Flow).
Context: Libraries/middleware wrap all client calls to downstreams (HTTP/gRPC/bus). Breaker state may be per-tenant, per-endpoint, per-partition.
Key Participants:

  • Caller Service (producer of the downstream call)
  • Circuit Breaker (in-process or sidecar)
  • Target Service (downstream dependency)
  • Fallback/Cache (optional read cache or static responses)
  • Retry/DLQ Services (for write/side-effect operations)
  • Observability/Config (metrics, alerts, ops overrides)

Prerequisites

System Requirements

  • Circuit breaker library enabled for HTTP/gRPC clients with configurable policies
  • Sliding windows for failure rate and slow-call rate with min call thresholds
  • Central config and runtime override API (ops) with safe defaults
  • Correlation/tracing propagation through fallback paths

Business Requirements

  • Defined fallback strategy per call type (read: cache; write: enqueue → Retry)
  • Tenant- and endpoint-level SLOs to tune thresholds
  • Runbook for operator overrides (force-open/close, reset)

Performance Requirements

  • Wrapper overhead p95 ≤ 1 ms per call (fast path, closed)
  • Probe batch size and interval sized to recover quickly without stampedes
  • Backpressure headers documented for clients

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant C as Caller Service
    participant CB as Circuit Breaker
    participant T as Target Service
    participant F as Fallback/Queue (optional)

    C->>CB: Invoke downstream operation
    alt State = CLOSED
        CB->>T: Forward request
        T-->>CB: 200/OK (within latency budget)
        CB-->>C: Success (propagate response)
    else State = HALF-OPEN (probe window)
        CB->>T: Limited probes (N% or fixed concurrency=1..k)
        T-->>CB: OK responses exceed threshold
        CB-->>C: Success, transition → CLOSED
    end
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Fallback (read): CB returns cached/derived response with X-ATP-Circuit-State: open and X-ATP-Source: cache.
  • Queue (write): CB enqueues to Retry Service with idempotencyKey, returns 202 Accepted (Problem+JSON alternative body optional).
  • Partitioned breakers: isolate a bad shard/tenant from healthy traffic.

Error Paths

sequenceDiagram
    participant C as Caller
    participant CB as Circuit Breaker
    participant T as Target
    participant Q as Retry/DLQ

    C->>CB: Invoke downstream operation
    alt State = OPEN (short-circuit)
        CB-->>C: 503 Service Unavailable
        Note right of C: Headers: X-ATP-Circuit-State: open, Retry-After: 5
    else State = CLOSED but failure/slow-call triggers thresholds
        CB->>T: Request
        T-->>CB: 5xx/timeout/slow
        CB->>CB: Increment counters, if trip threshold → OPEN
        CB->>Q: (write ops) enqueue for retry
        CB-->>C: 503/504 or 202 (queued) with Problem+JSON
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

The breaker primarily shapes responses; ops endpoints allow safe overrides.

Input Requirements (Ops)

Field Type Req Description Validation
Method/Path POST /ops/v1/circuits/{id}:override O Force open | half-open | closed with TTL {id} exists
Authorization header Y Bearer <JWT> Role ops:circuits
state enum Y open half-open closed allowlist
ttl duration O Override duration (e.g., 10m) ≤ policy max
notes string O Reason ≤ 256 chars

Output Specifications (Client-Facing)

  • Closed (success): normal 2xx/OK.
  • Open (short-circuited read): 503 Service Unavailable Headers: X-ATP-Circuit-State: open, Retry-After: <sec>, X-ATP-Circuit-Reason: failure-rate|slow-calls|min-calls-not-met. Body (Problem+JSON example):

{
  "type":"urn:connectsoft:errors/circuit/open",
  "title":"Dependency temporarily unavailable",
  "status":503,
  "detail":"Calls short-circuited by circuit breaker (failure rate > 50% over 20s).",
  "retryAfterSeconds":5,
  "traceId":"9f0c1d2e3a4b5c6d..."
}
* Open (queued write): 202 Accepted with Location: /retries/v1/tasks/{taskId} and headers above plus X-ATP-Queued: true.


Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Ops override payload invalid (state/ttl) Fix request
401 Missing/invalid JWT (ops) Acquire valid token Retry after renewal
403 Lacks ops:circuits Request proper role
404 Unknown circuit {id} Verify id/scope
409 Conflicting override/state transition Clear override or wait TTL Retry after fix
412 If-Match on circuit version mismatch Read latest, retry Conditional retry
422 TTL or state not permitted by policy Adjust inputs
429 Too many overrides/changes Back off Jittered retry
503/504 Short-circuited/open or downstream timeout Respect headers Exponential backoff + jitter

Failure Modes

  • Min-calls not met: insufficient samples → breaker stays closed but labels responses with X-ATP-Circuit-Reason: warmup.
  • Stampede on recovery: too many probes → configure half-open concurrency and jitter.
  • Cache staleness: fallback exceeds TTL → downgrade to 503 instead of serving stale beyond policy.

Recovery Procedures

  1. When open, allow half-open after cool-down; probe with limited concurrency.
  2. Tune thresholds based on SLOs and observed metrics (failure/slow-call rate).
  3. For write paths, confirm idempotencyKey is propagated before enabling queue mode.

Performance Characteristics

Latency Expectations

  • Added wrapper overhead p95 ≤ 1 ms (closed).
  • Half-open probes routed immediately; unaffected calls still short-circuited.

Throughput Limits

  • Limit concurrent probes (e.g., 1–5) per breaker key; cap queued writes per tenant.

Resource Requirements

  • In-process counters/timers; optional small shared state for cluster coordination.

Scaling Considerations

  • Key breaker by {tenant, endpoint, partition} to avoid global trips.
  • Use decorrelated jitter for cool-down and probe scheduling.
  • Optional shared state (e.g., Redis) for multi-instance consistency.

Security & Compliance

Authentication

  • Client requests authenticated as usual; ops overrides require OIDC JWT and RBAC.

Authorization

  • Ops roles: ops:circuits.read, ops:circuits.override, ops:circuits.reset.

Data Protection

  • Headers reveal state but not sensitive internals; avoid leaking backend hostnames.

Compliance

  • All trips, overrides, and recoveries are audited (who, when, why, thresholds, counts).

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
circuit_state{key} gauge 0=closed,1=half-open,2=open Open > 0 sustained
circuit_short_circuits_total counter Calls blocked by open state Spike alert
circuit_failure_rate gauge Recent failure % > policy trip
circuit_slow_call_rate gauge Recent slow-call % > policy trip
circuit_probe_success_total counter Half-open successes Low during recovery
fallback_invocations_total counter Cache/queue usage Track degradation

Logging Requirements

  • Structured logs: breakerKey, state, reason, window, failRate, slowRate, probe, override, actor, traceId.

Distributed Tracing

  • Tag spans with circuit.state, circuit.reason, fallback=true, queued=true, include downstream span links when available.

Health Checks

  • Readiness: breaker config loaded; counters active.
  • Liveness: state machine transitions occur; no stuck half-open beyond TTL.

Operational Procedures

Deployment

  1. Enable breaker middleware for all outbound clients; set sane defaults.
  2. Wire ops API and dashboards; define per-tenant keys.
  3. Validate with chaos testing (inject 5xx/timeouts).

Configuration

  • Policy: {window=20s, minCalls=20, failureRate=50%, slowThreshold=1s, slowRate=50%, cooldown=5s, probe=2}
  • Headers: X-ATP-Circuit-State, X-ATP-Circuit-Reason, Retry-After, X-ATP-Queued.

Maintenance

  • Review trip analytics weekly; adjust thresholds and probe sizes.
  • Rotate cache TTLs for fallbacks per freshness requirements.

Troubleshooting

  • Frequent opens → inspect dependency SLOs, retry storms, and idempotency.
  • No recovery → increase probe window or check downstream health checks.
  • Client confusion → verify headers are surfaced at Gateway.

Testing Scenarios

Happy Path Tests

  • Closed → success; zero wrapper overhead regressions.
  • Half-open with limited probes transitions to closed after consecutive successes.

Error Path Tests

  • Trip on failure rate > threshold; open cool-down respected; headers set.
  • Read fallback returns cached response with correct state headers.
  • Write enqueued returns 202 with Location and idempotencyKey.

Performance Tests

  • Probe concurrency prevents stampede; short-circuit path p95 ≤ 1 ms.
  • High QPS under open state does not overload queue/cache.

Security Tests

  • Ops override RBAC enforced; audit trail captured.
  • Headers do not leak sensitive backend identifiers.

Internal References

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context

Appendices

A. Example Ops Override

POST /ops/v1/circuits/tenant:search:index:primary:override
{
  "state": "open",
  "ttl": "10m",
  "notes": "Isolate failing shard while indexers recover."
}

B. Client Header Cheatsheet

  • X-ATP-Circuit-State: closed|half-open|open
  • X-ATP-Circuit-Reason: failure-rate|slow-calls|override|warmup
  • Retry-After: seconds until next probe/cooldown ends
  • X-ATP-Queued: true when write queued for retry

Compensation Flow

Repairs partial failures or out-of-order effects by executing a deterministic, idempotent sequence of inverse actions (e.g., projection rewrites, search index corrections, pointer re-links). Produces a complete audit trail and supports dry-run planning before execution.


Overview

Purpose: Restore system invariants when a transaction or workflow completed partially (e.g., append succeeded but projection/index update failed).
Scope: Detection/selection of a failed transaction, plan synthesis, dry-run validation, execution of compensating steps, verification, and audit. Excludes business refunds or external systems remediation (covered by domain runbooks).
Context: Invoked by operators or automation (DLQ/alerts). Coordinates with Projection Service, Search Index, Storage, and Integrity to ensure consistency.
Key Participants:

  • Operator / Automation (trigger)
  • Compensation Service (planner/executor)
  • Storage / Projection / Search Index (targets)
  • Audit/Event Bus (actions & outcomes)
  • Retry/DLQ (feeder, optional post-fix replay)

Prerequisites

System Requirements

  • Compensation Service deployed with access to Storage, Projections, and Indexes
  • Idempotency primitives available (step keys, compare-and-set guards)
  • Read-only snapshot capability for dry-run planning
  • Time-synchronized environment (UTC), consistent tracing

Business Requirements

  • Catalog of compensable scenarios and their inverse steps
  • Approval policy for destructive operations and bulk compensations
  • Masking rules for any payloads surfaced to operators

Performance Requirements

  • p95 plan synthesis ≤ 500 ms for typical cases
  • Batched execution with rate limits to protect targets
  • Backpressure-aware executor with progress reporting

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor OP as Operator/Automation
    participant GW as API Gateway
    participant CMP as Compensation Service
    participant ST as Storage
    participant PR as Projection Service
    participant IX as Search Index
    participant AUD as Audit/Event Bus

    OP->>GW: POST /ops/v1/compensations {txnId|recordId,..., dryRun:true}
    GW->>CMP: Create Plan (authN/Z, x-tenant-id)
    CMP->>ST: Inspect ground truth (append store)
    CMP->>PR: Inspect projection state
    CMP->>IX: Inspect index documents
    CMP->>CMP: Synthesize plan (ordered idempotent steps)
    CMP-->>GW: 200 OK {plan, impact, approvals}
    OP->>GW: POST /ops/v1/compensations/{id}:run
    GW->>CMP: Execute Plan
    CMP->>ST: (if needed) no-op or pointer fix
    CMP->>PR: Rewrite/repair projections (CAS by watermark)
    CMP->>IX: Reindex specific docs (with version guards)
    CMP->>CMP: Verify invariants, mark Completed
    CMP->>AUD: Emit Compensation.Completed {id, steps, result}
    GW-->>OP: 200 OK {status:"Completed", metrics}
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Auto-compensation from DLQ: DLQ item contains signature; Compensation Service builds & runs plan before replay.
  • Partial plan: execute only safe subset; schedule remaining steps via Retry Service.
  • Integrity-first: if integrity proofs affected, run Integrity Verification/re-seal checks before projection/index fixes.

Error Paths

sequenceDiagram
    actor OP as Operator
    participant GW as API Gateway
    participant CMP as Compensation Service

    OP->>GW: POST /ops/v1/compensations {invalid}
    alt 400 Bad Request (invalid scope, missing ids)
        GW-->>OP: 400 Problem+JSON
    else 404 Not Found (unknown txn/record)
        GW-->>OP: 404 Problem+JSON
    else 409 Conflict (plan already running / step lock held)
        GW-->>OP: 409 Problem+JSON
    else 422 Unprocessable (scenario not compensable)
        GW-->>OP: 422 Problem+JSON
    else 429/503 (rate limit/dependency down)
        GW-->>OP: 429/503 Problem+JSON (+Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path POST /ops/v1/compensations Y Create compensation plan JSON body
Authorization header Y Bearer <JWT> Role ops:compensate
x-tenant-id header Y Tenant scope RLS enforced
traceparent header O W3C trace context 55-char
txnId string O* Transaction/workflow id ULID/GUID
recordId string O* Affected record id ULID/GUID
scope object O {from, to, filters} window ISO-8601 UTC
dryRun bool O Only produce plan default true
strategy enum O repair (default) replay
notes string O Operator context ≤ 512 chars
idempotency-key header O De-dupe ≤ 128 chars
  • Provide at least one of txnId, recordId, or scope.

Control/Status

  • GET /ops/v1/compensations/{id} → status, steps, metrics
  • POST /ops/v1/compensations/{id}:run → execute planned steps
  • POST /ops/v1/compensations/{id}:cancel → cancel if safe

Output Specifications

200 OK (Plan)

Field Type Description Notes
id string Plan id ULID/GUID
steps[] array Ordered idempotent steps See step shape
impact object Counters by target (proj/index/records) Estimate
approvalsRequired bool Whether approval gate is needed Policy-driven

Step (shape)

{
  "stepId": "S1",
  "type": "Projection.Rewrite",
  "target": {"projection":"AuditEvents","key":"01JF..."},
  "idempotencyKey": "cmp:proj:AuditEvents:01JF...",
  "precondition": {"watermarkAtLeast":"2025-10-22T10:55:00Z"},
  "action": {"rewriteFrom": "storage", "schemaVersion": 3},
  "verify": {"projectionMatches":"storageHash"}
}

Example Payloads

// Create plan (dry-run) by recordId
{
  "tenant": "acme",
  "recordId": "01JF3W8KTR2D3WQF3B9R0KJY9Y",
  "dryRun": true,
  "strategy": "repair",
  "notes": "Projection missing due to prior outage."
}
// Execute planned compensation
POST /ops/v1/compensations/01K0...:run
{
  "approvalToken": "appr_9c1...",
  "concurrency": 8
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid scope; both txnId and recordId missing; bad timestamps Fix request
401 Missing/invalid JWT Acquire valid token Retry after renewal
403 Lacks ops:compensate or approval missing Request role/approval
404 Transaction/record not found Verify ids/window
409 Another plan running on same target; step lock held Wait/Cancel existing Retry after unlock
412 Precondition (watermark/version) failed Refresh state; re-plan Conditional retry
422 Scenario not compensable or non-idempotent step detected Route to manual runbook
429 Throttled by target system Honor Retry-After Backoff + jitter
503 Dependency unavailable (Projection/Index/Storage) Wait or partial run Idempotent retry later

Failure Modes

  • Non-idempotent side effect: step flagged and blocked unless operator uses explicit force gate.
  • Stale projection: CAS/watermark precondition fails → re-plan with updated state.
  • Wide impact plan: bulk changes require staged batches with checkpoints to avoid long locks.

Recovery Procedures

  1. On 412, refresh state and regenerate plan; executor resumes from last completed step.
  2. If 503/429, executor backs off, persists progress, and continues when healthy.
  3. For 409, inspect running plan and either merge or cancel the conflicting one.

Performance Characteristics

Latency Expectations

  • Plan (single record) typically ≤ 500 ms; execution dominated by target services latencies.

Throughput Limits

  • Concurrency governed per target (e.g., proj=16, index=8) and per-tenant caps.

Resource Requirements

  • Light CPU for planning; executor memory proportional to batch window.

Scaling Considerations

  • Shard plans by tenant and time window; use watermarks to ensure deterministic ordering.
  • Persist checkpoints every N steps; support resume-after-failure.

Security & Compliance

Authentication

  • OIDC JWT/OAuth at Gateway; service accounts for inter-service calls.

Authorization

  • Roles: ops:compensate.plan, ops:compensate.run, ops:compensate.cancel, ops:compensate.read.
  • Approval tokens required for destructive/bulk plans.

Data Protection

  • Mask PII in operator views; only show necessary diffs.
  • Encrypt transcripts and store with short-lived presigned access.

Compliance

  • Emit Compensation.Planned|Started|StepCompleted|Completed|Failed events with actor, reason, and evidence.
  • Plans and transcripts retained per tenant retention policy.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
compensation_plans_total counter Plans created Trend
compensation_steps_completed_total counter Steps done
compensation_failures_total counter Failed steps > 0 sustained
compensation_runtime_seconds histogram End-to-end duration p95 > SLO
compensation_blocked_total counter Blocked by preconditions/locks Spike alert

Logging Requirements

  • Structured logs include: planId, tenant, stepId, type, idempotencyKey, precondition, outcome, traceId. No payload values.

Distributed Tracing

  • Spans: plan.synthesize, step.execute(type), verify, checkpoint.
  • Attributes: concurrency, watermark, casVersion, affectedCount.

Health Checks

  • Readiness: access to Storage/Projection/Index; plan store reachable.
  • Liveness: executors progressing; no step stuck beyond timeout.

Operational Procedures

Deployment

  1. Deploy Compensation Service with plan store and executor.
  2. Wire RBAC, approval gates, and observability.
  3. Seed known scenarios and step templates.

Configuration

  • Env: COMP_PLAN_MAX_SCOPE, COMP_EXEC_CONCURRENCY, COMP_STEP_TIMEOUT, COMP_APPROVAL_REQUIRED.
  • Policy: destructive-step approvals; per-target QPS caps; retry/backoff settings.

Maintenance

  • Review top compensation causes; add detectors to prevent recurrence.
  • Tune watermark/CAS policies to reduce 412 conflicts.

Troubleshooting

  • Frequent 412 → stale state; check projection lag and adjust watermarks.
  • High blocked_total → missing approvals or non-idempotent steps; refine templates.
  • Long runtimes → lower concurrency or break plan into smaller batches.

Testing Scenarios

Happy Path Tests

  • Plan & run for “missing projection” fixes projection and index, verifies equality to storage.
  • DLQ-triggered auto-compensation succeeds, then DLQ replay passes.

Error Path Tests

  • 400 invalid scope; 404 unknown record/txn; 409 conflicting plan; 422 non-compensable scenario.
  • 412 precondition failure reruns after re-plan and completes.

Performance Tests

  • Batch plan (1k records) executes within rate limits; checkpoints allow resume.
  • Executor maintains p95 step time within target under load.

Security Tests

  • RBAC and approvals enforced; transcripts encrypted; PII masked by default.
  • Idempotency verified by re-running completed plan → no additional side effects.

Internal References

External References

  • RFC 7807 (Problem Details)
  • W3C Trace Context

Appendices

A. Example Problem+JSON (precondition failed)

{
  "type": "urn:connectsoft:errors/compensation/precondition.failed",
  "title": "Watermark precondition failed",
  "status": 412,
  "detail": "Projection watermark 2025-10-22T11:02:10Z is below required 2025-10-22T11:05:00Z.",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

B. Step Type Catalog (excerpt)

  • Projection.Rewrite — rebuild from storage by key with CAS
  • Index.Reindex — single-doc reindex with version guard
  • Pointer.Relink — fix correlation/resource pointers with invariants check
  • Event.Replay — re-emit projection events from checkpoint (idempotent)

Metrics Collection Flow

Collects and aggregates golden signals and SLO-aligned KPIs from all platform services using OpenTelemetry (OTel) and Prometheus exposition/scrape. Emits standardized counters/gauges/histograms with tenant/shard/region labels, stores them in a scalable TSDB, and drives dashboards & alerts (ingest latency, projection lag, seal lag, queue depth).


Overview

Purpose: Provide reliable, low-cardinality telemetry for capacity planning, incident detection, and SLO compliance.
Scope: In-process instrumentation (OTel SDK), export (OTLP gRPC/HTTP or Prom scrape), aggregation, storage, dashboards, alerting. Excludes application logs and traces (covered in other flows).
Context: Every service ships metrics to an OTel Collector (agent/sidecar/daemonset) which forwards to Metrics Backend (Prometheus/Mimir/Thanos). Alert rules and dashboards read from the backend.
Key Participants:

  • Service (instrumented application)
  • OTel SDK (metrics API + views)
  • OTel Collector (receivers/processors/exporters)
  • Metrics Backend (TSDB) (Prometheus-compatible)
  • Alerting (Alertmanager/Notifications)
  • Dashboards (Grafana)

Prerequisites

System Requirements

  • OTel SDK enabled in each service with histograms for latency and gauges for lags
  • OTel Collector reachable (4317 gRPC / 4318 HTTP) with TLS/mTLS
  • Metrics backend with remote write or federated scrape; retention configured
  • Resource attributes set (service.name, service.version, deployment.environment, region)

Business Requirements

  • SLOs defined per domain: Ingestion latency, Projection lag, Seal lag, Search latency
  • Alert routing/ownership documented; runbooks linked from alerts
  • Cardinality budgets per tenant and endpoint (guardrails/policies)

Performance Requirements

  • Metrics export overhead < 1% CPU; payloads ≤ policy size (batching on)
  • Scrape intervals tuned (e.g., 15s) without overloading services
  • End-to-end telemetry freshness p95 ≤ 30s

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant SVC as Service
    participant SDK as OTel SDK (Metrics)
    participant COL as OTel Collector
    participant TSDB as Metrics Backend (Prom/Mimir)
    participant ALR as Alerting
    participant DB as Dashboards

    SVC->>SDK: Record metrics (counters/gauges/histograms)
    SDK->>COL: Export (OTLP) with resource attrs & exemplars (traceId)
    COL->>TSDB: Remote write / Prom scrape pipeline
    TSDB-->>ALR: Rule eval -> alert fire/inhibit
    TSDB-->>DB: Power SLO dashboards & drilldowns
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Prometheus scrape: service exposes /metrics; TSDB scrapes directly (no collector) where allowed.
  • Edge aggregation: Collector performs histogram downsampling or delta temporality conversion before write.
  • Multi-tenant split: per-tenant remote-write endpoints or relabeling to enforce isolation.

Error Paths

sequenceDiagram
    participant SVC as Service
    participant COL as OTel Collector
    participant TSDB as Metrics Backend

    SVC->>COL: Export (invalid metrics/labels)
    alt 400 Bad Request (schema/label violation)
        COL-->>SVC: 400 Problem (drop + log)
    else 404 Not Found (unknown tenant/series namespace)
        TSDB-->>COL: 404, metric rejected
    else 409 Conflict (type change for existing metric name)
        TSDB-->>COL: 409, reject write
    else 429/503 (rate limit/outage)
        TSDB-->>COL: 429/503, backoff + retry
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
OTLP Endpoint url Y grpc://collector:4317 or https://collector:4318/v1/metrics TLS/mTLS
resource.service.name string Y Logical service name kebab-case
resource.deployment.environment string Y prod | staging | dev enum
resource.cloud.region string O Region/zone allowlist
Metric names string Y atp_* prefix + unit suffix Prom rules
Labels map Y {tenant, shard, region, result, route} cardinality caps
Views config O Histogram buckets, temporality per-SLO
Exemplars bool O Attach trace links to histograms sample rate cap

Output Specifications

Field Type Description Notes
Dashboards URL Grafana folders per domain RBAC enforced
Alerts YAML Rule groups with SLO burn rates Routed to on-call
Recording Rules YAML Pre-agg series by tenant/shard Reduces cost
Telemetry Health JSON Collector/TSDB status endpoints For probes

Example Payloads

.NET OTel setup (C#)

builder.Services.AddOpenTelemetry()
    .WithMetrics(m => m
        .AddMeter("atp.ingestion","atp.projection","atp.integrity")
        .AddRuntimeInstrumentation()
        .AddAspNetCoreInstrumentation()
        .AddOtlpExporter(o => o.Endpoint = new Uri("http://otel-collector:4317")));

Metric naming & units (examples)

  • atp_ingest_latency_seconds (histogram) — client→accepted latency
  • atp_projection_lag_seconds (gauge) — append→projection lag
  • atp_integrity_seal_lag_seconds (gauge) — append→seal lag
  • atp_ingest_records_total (counter) — records ingested
  • atp_export_jobs_active (gauge) — active export jobs

Recommended histogram buckets (seconds)

[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid metric name/unit; disallowed label; excessive cardinality Fix SDK config; drop or remap labels No retry until fixed
401 Missing/invalid token for remote write Renew credentials Retry after renewal
403 Tenant not authorized to write namespace Update RBAC/relabling
404 Unknown tenant/namespace; dashboard id missing Create namespace / correct link
409 Type conflict (counter→histogram reuse of name) Rename metric; update dashboards
413 Payload too large Reduce batch size; increase limits Retry with smaller batches
429 Rate limited by TSDB/collector Honor Retry-After Exponential backoff + jitter
503 Collector/TSDB unavailable Buffer (within cap) Bounded retry with drop policy

Failure Modes

  • Cardinality explosion (e.g., userId in labels) → automatic label sanitizer drops high-cardinality keys; emit warning counter.
  • Type migration (metric renamed without deprecation) → breaks dashboards; use recording rules to bridge.
  • Clock skew → out-of-order samples dropped; sync NTP and use server timestamping.

Recovery Procedures

  1. Enable views to aggregate/drop labels causing explosion; redeploy with safe config.
  2. Roll out metric renames via dual-publish window + recording rules → retire old names.
  3. During TSDB outage, buffer with caps; after recovery, drain at limited QPS.

Performance Characteristics

Latency Expectations

  • Exporter p95 < 50 ms per batch; end-to-end metric freshness p95 ≤ 30 s.

Throughput Limits

  • Default 10k samples/s per pod (configurable); per-tenant write QPS caps at the collector.

Resource Requirements

  • SDK minimal CPU; Collector memory sized for queues; backend disk/retention sized to SLO analytics.

Scaling Considerations

  • Shard TSDB by tenant/region; use recording rules to pre-aggregate; leverage remote write to long-term store (Thanos/Mimir).

Security & Compliance

Authentication

  • OTLP with mTLS; Prom scrape secured by service mesh identities or basic auth over TLS.

Authorization

  • Per-tenant write tokens; relabeling at collector enforces tenant isolation.

Data Protection

  • No PII in labels; label sanitizer strips ids, emails, IPs unless explicitly allowlisted.

Compliance

  • Alert acknowledgments/audits stored; SLO reports preserved per retention policy.

Monitoring & Observability

Key Metrics

Metric Name Type Description Alert Threshold
atp_ingest_latency_seconds histogram Client→accepted latency Burn rate on p95/p99
atp_projection_lag_seconds gauge Append→projection lag > 60s sustained
atp_integrity_seal_lag_seconds gauge Append→seal lag > 120s sustained
otelcol_exporter_queue_size gauge Collector queue depth > 80% capacity
prom_remote_write_requests_failed_total counter Failed writes Rising trend
atp_metrics_cardinality_dropped_total counter Dropped label pairs Spike → investigate

Logging Requirements

  • Collector structured logs for drops/backpressure; include tenant, series, reason.

Distributed Tracing

  • Exemplars: attach traceId to latency histogram buckets for drill-down.
  • Trace spans for exporter/collector with attributes: seriesCount, dropped, retry.

Health Checks

  • Collector readiness (receivers/exporters live); TSDB scrape targets up; dashboard datasource healthy.

Operational Procedures

Deployment

  1. Ship OTel SDK across services; configure default meters and views.
  2. Deploy OTel Collector (agent/daemonset) with TLS and remote write.
  3. Provision dashboards and alert rules from GitOps repo.

Configuration

  • Env: OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_METRIC_EXPORT_INTERVAL, OTEL_RESOURCE_ATTRIBUTES.
  • Collector: processors (batch, memory_limiter), exporters (prometheusremotewrite).
  • Backend: retention, compaction, ruler/alertmanager endpoints.

Maintenance

  • Periodic review of cardinality budget; prune unused metrics.
  • Tune histogram buckets as traffic patterns evolve.

Troubleshooting

  • Missing metrics → check SDK meter enabled, service.name correct, collector pipelines.
  • High drops → inspect label sanitizer logs; remove high-cardinality labels.
  • Alert noise → adjust SLO burn-rate windows and inhibit rules.

Testing Scenarios

Happy Path Tests

  • Ingestion service publishes latency histogram; dashboard shows p95/p99; alerts fire under synthetic slowness.
  • Projection lag gauge reflects backlog; alert triggers and clears after recovery.

Error Path Tests

  • 400 invalid label name → collector drops with warning counter incremented.
  • 404 unknown tenant namespace → write rejected; dashboards unaffected.
  • 409 type conflict on metric rename → dual-publish + recording rule bridges.

Performance Tests

  • 10k samples/s sustained without exporter backpressure; queue sizes stable.
  • TSDB outage → buffered then drained within limits; no OOM.

Security Tests

  • mTLS enforced; cross-tenant writes denied.
  • No PII observed in labels; sanitizer counters remain near zero.

Internal References

External References

  • OpenTelemetry Metrics Spec
  • Prometheus Best Practices

Appendices

A. Example Alert (Projection Lag SLO)

groups:
- name: projection-lag
  rules:
  - alert: ProjectionLagHigh
    expr: atp_projection_lag_seconds{environment="prod"} > 60
    for: 5m
    labels: {severity: page, team: projections}
    annotations:
      summary: "Projection lag high (>{{ $value }}s)"
      runbook: "https://runbooks/projection-lag"

B. Collector Pipeline (excerpt)

receivers:
  otlp:
    protocols: { grpc: {}, http: {} }
processors:
  batch: {}
  memory_limiter: { check_interval: 1s, limit_mib: 512 }
exporters:
  prometheusremotewrite:
    endpoint: https://mimir.remote/api/v1/push
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

Distributed Tracing Flow

Correlates requests across all hops using W3C Trace Context (traceparent, tracestate) and OpenTelemetry spans. Propagates baggage (e.g., tenant, edition) with strict guardrails to enable per-tenant analytics without leaking PII. All spans are exported to a trace store for query and troubleshooting.


Overview

Purpose: Provide end-to-end visibility of a request from Gateway → Ingestion → Storage → Integrity → Projection → Search/Export, enabling root-cause analysis and SLO burn tracking.
Scope: Context propagation (HTTP/gRPC/bus), span creation and attributes, sampling (head/tail), baggage policy, export via OTel → Collector → Trace Backend, and trace query UX. Excludes logs/metrics (covered elsewhere).
Context: Each service uses OTel SDK. The API Gateway starts/continues a trace, forwards context, and attaches safe baggage (tenant, edition). Downstream services create child spans. Collector batches/exports to a Jaeger/Tempo-compatible backend.
Key Participants:

  • Client / Producer
  • API Gateway
  • Ingestion Service
  • Storage Service
  • Integrity Service
  • Projection Service
  • Search / Export Services
  • OTel Collector
  • Trace Backend (Jaeger/Tempo)

Prerequisites

System Requirements

  • OTel SDK enabled for HTTP, gRPC, DB instrumentation (server & client)
  • W3C Trace Context and Baggage propagators registered
  • OTel Collector reachable with TLS (gRPC 4317 / HTTP 4318)
  • Trace backend available (Tempo/Jaeger) with retention & indexing

Business Requirements

  • Baggage policy allowlist: tenant, edition, optional purpose (no PII)
  • Sampling policy defined (head: rate/parent; tail: error/latency based)
  • SRE runbooks for “missing span”, “broken parent”, and “dropped export”

Performance Requirements

  • Tracing overhead < 3% CPU at default sample rates
  • Export latency hidden via batching; queue backpressure bounded
  • Query p95 ≤ 3 s for recent traces

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant CL as Client
    participant GW as API Gateway
    participant ING as Ingestion Service
    participant ST as Storage Service
    participant INT as Integrity Service
    participant PR as Projection Service
    participant COL as OTel Collector
    participant TR as Trace Backend

    CL->>GW: HTTP/gRPC request (+traceparent?, +baggage: tenant,edition)
    Note right of GW: Start/continue root span, enforce baggage allowlist
    GW->>ING: Forward request (+traceparent,+baggage)
    ING->>ST: Append audit (child span)
    ST-->>ING: Ack (db client/server spans)
    ING->>INT: Enqueue/compute integrity (child span)
    INT-->>ING: Proof computed
    ING->>PR: Emit projection event (child span)
    PR-->>ING: Projected
    par Export spans
      GW-->>COL: OTLP export (batched)
      ING-->>COL: OTLP export (batched)
      ST-->>COL: OTLP export (batched)
      INT-->>COL: OTLP export (batched)
      PR-->>COL: OTLP export (batched)
    end
    COL->>TR: Push spans
    TR-->>GW: Trace available for query
    Note over GW,PR: Baggage {tenant,edition} available on all spans
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Message bus propagation: inject traceparent/baggage into message headers; consumers extract and create linked spans if processing is async.
  • Tail sampling: collector performs tail-based sampling (error/latency heuristics) for high-value traces while keeping head sampling low.
  • Gateway as root: if client sends no traceparent, Gateway creates the root span; otherwise, it joins the provided context.

Error Paths

sequenceDiagram
    participant CL as Client
    participant GW as API Gateway
    participant COL as OTel Collector
    participant TR as Trace Backend

    CL->>GW: Request (malformed trace headers)
    alt 400 Bad Request (invalid traceparent format)
        GW-->>CL: 400 Problem+JSON (with new trace id for error handling)
    else Backend query for traceId
        GW->>TR: GET /traces/{traceId}
        alt 404 Not Found (expired/unknown)
            TR-->>GW: 404 Not Found
            GW-->>CL: 404 Problem+JSON
        else 409 Conflict (concurrent sampling policy change)
            TR-->>GW: 409 Conflict
            GW-->>CL: 409 Problem+JSON
        end
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements (Propagation & Policy)

Field Type Req Description Validation
traceparent header/metadata O W3C Trace Context 55-char format
tracestate header/metadata O Vendor/state hints size ≤ 512B
baggage header/metadata O tenant=acme,edition=enterprise allowlist keys; total ≤ 1024B
x-tenant-id header Y Tenant RLS (also echoed in baggage) must match
trace-flags bitfield O Sampling decision (head) 0/1
idempotency-key header O For write flows (not tracing but correlated) ≤ 128 chars

Ops / Query

  • GET /traces/{traceId} → rendered trace
  • GET /traces/search?tenant=&error=true&latencyMs>… → find traces
  • POST /ops/v1/tracing/sampling {headRate, tailPolicies[]} → update sampling (RBAC)

Output Specifications

  • Spans include attributes (examples):
    • Common: tenant, edition, environment, region, trace.sampled
    • Gateway: route, status_code, client.ip_hash
    • Ingestion: audit.schemaVersion, payload.bytes, validation.result
    • Storage: db.system, db.operation=append, db.statement?=off
    • Integrity: integrity.blockId, segment, proof.kid
    • Projection: watermark, lag.ms
    • Search/Export: query.kind, result.count, package.id

Example HTTP with headers

POST /audit/v1/records HTTP/1.1
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: atp=gw;ver=1
baggage: tenant=acme,edition=enterprise
x-tenant-id: acme
content-type: application/json

Example gRPC metadata (pseudo)

:authority: ingestion.atp
traceparent: 00-4bf92f3577b34...-00f067aa0b...-01
baggage: tenant=acme,edition=enterprise

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Malformed traceparent/baggage Drop/regen context; return Problem+JSON if strict No retry until fixed
401 Querying traces without auth Acquire token Retry after renewal
403 Cross-tenant trace access Enforce RLS; deny
404 Trace id not found/expired Verify id/retention window
409 Sampling policy update conflicts Re-fetch policy; retry op Conditional retry
413 Oversized baggage Trim to policy; drop disallowed keys Resend with smaller baggage
429 Collector/back-end rate limit Honor Retry-After Exponential backoff + jitter
503 Collector/back-end unavailable Buffer within caps Bounded retry, drop oldest if over cap

Failure Modes

  • Broken parentage: services that don’t extract context create new roots → detectable by orphan span metric.
  • Baggage misuse: high-cardinality/PII snuck into baggage → sanitizer drops keys and emits policy violations.
  • Excess sampling: high head sampling inflates overhead → shift to tail sampling for error/slow traces.

Recovery Procedures

  1. Enable/verify propagators in all client/server middleware.
  2. Turn on tail sampling policies (e.g., error=true, latency>500ms).
  3. Inspect “orphan span” dashboards; fix missing extract/inject in specific services.

Performance Characteristics

Latency Expectations

  • Instrumentation overhead p95 ≤ 1 ms per hop (sampled), near-zero when unsampled.

Throughput Limits

  • Collector queue sized for burst N× steady state; backpressure triggers temporary head sampling reductions.

Resource Requirements

  • Small CPU for SDK; Collector memory for queues; backend disk for retention (e.g., 7–14 days).

Scaling Considerations

  • Shard collectors per region/tenant; enable tail sampling at edge; compress exports; prefer OTLP gRPC.

Security & Compliance

Authentication

  • Query/UI protected by OIDC; service-to-collector via mTLS.

Authorization

  • Enforce tenant isolation on trace queries (filter by baggage tenant and RLS).

Data Protection

  • No PII in baggage or span attributes; hash IPs/UAs; redact payloads; disable SQL/body capture by default.

Compliance

  • Retention adheres to tenant policy; trace access is audited with actor and purpose-of-use.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
otel_traces_exported_total counter Spans successfully exported Sudden drop
otel_traces_dropped_total counter Dropped spans (queue/limits) > baseline
trace_orphan_spans_total counter Spans without valid parent Spike alert
collector_queue_size gauge Export queue depth > 80% capacity
trace_tail_sampled_total counter Tail-sampled traces Track ratio
trace_query_latency_seconds histogram UI/API query latency p95 > SLO

Logging Requirements

  • Structured logs: traceId, spanId, dropReason, policyId, tenant, edition. No payload values.

Distributed Tracing

  • (Meta) link exporter spans to service spans; include exemplars on latency histograms (metrics flow).

Health Checks

  • Collector readiness/liveness; backend ingestion status; UI availability.

Operational Procedures

Deployment

  1. Enable OTel SDKs with HTTP/gRPC/DB instrumentation and W3C propagators.
  2. Deploy OTel Collector (batch, memory_limiter, tail_sampling processors).
  3. Wire trace backend and provision dashboards.

Configuration

  • Env: OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_SAMPLER, OTEL_RESOURCE_ATTRIBUTES.
  • Tail Sampling (examples): error=true, status_code>=500, latency_ms>500, selective by tenant.

Maintenance

  • Adjust sampling as traffic patterns evolve; rotate retention; prune noisy attributes.

Troubleshooting

  • Missing links → check inject/extract middleware order.
  • High drops → increase collector queues or reduce sampling; inspect backpressure.
  • Cross-tenant leak alerts → confirm baggage sanitizer & RLS.

Testing Scenarios

Happy Path Tests

  • End-to-end trace spans present across Gateway→Ingestion→Storage→Integrity→Projection.
  • Baggage (tenant=acme, edition=enterprise) visible on all spans.

Error Path Tests

  • 400 invalid traceparent handled; new trace created for error path.
  • 404 unknown trace id query returns Problem+JSON, no data leakage.
  • 409 sampling change during export handled without crash.

Performance Tests

  • Sampled high-QPS traffic keeps overhead < 3%.
  • Collector withstands burst without dropping (or drops < policy).

Security Tests

  • No PII in spans/baggage; sanitizer counters near zero.
  • Trace queries scoped to tenant via RLS.

Internal References

External References

  • W3C Trace Context & Baggage
  • OpenTelemetry Specification

Appendices

A. Example Problem+JSON (invalid trace headers)

{
  "type": "urn:connectsoft:errors/tracing/traceparent.invalid",
  "title": "Invalid W3C traceparent header",
  "status": 400,
  "detail": "Trace ID length not 16 bytes (hex).",
  "traceId": "9f0c1d2e3a4b5c6d..."
}

B. Suggested Span Attribute Keys (allowlist)

  • tenant, edition, environment, region, route, status_code, db.system, db.operation, integrity.blockId, projection.watermark, search.query.kind, export.package.id

Health Check Flow

Implements liveness, readiness, and startup probes with per-component dependency checks and an aggregated status that signals deploy orchestrators (e.g., Kubernetes) for safe rollouts and traffic routing. Probes are budgeted and isolated to avoid noisy-neighbor effects; timeouts and intervals are tuned to service SLOs.


Overview

Purpose: Provide reliable health signaling for deployment safety, traffic gating, and fast failure detection without causing additional load or false negatives.
Scope: Local process liveness, startup warmup, dependency readiness (DB, queue, cache, search, integrity, policy), aggregation, export via HTTP endpoints, and ops overrides (maintenance mode).
Context: Orchestrators consume /health/liveness, /health/readiness, /health/startup. Readiness reflects dependencies & backpressure, not just process up. Liveness is crash/lock detection only.
Key Participants:

  • Service (with HealthCheck library)
  • Dependency Probers (DB/Cache/Queue/Search/Integrity/Policy)
  • Aggregator (health manager + budgeter)
  • Orchestrator (Kubernetes/Service Mesh/Gateway)
  • Ops UI / API (maintenance & overrides)
  • Observability (metrics/logs)

Prerequisites

System Requirements

  • HealthCheck middleware/library enabled with endpoints: /health/liveness, /health/readiness, /health/startup
  • Per-dependency prober with timeouts, concurrency caps, and circuit-break aware checks
  • Clock synchronized (UTC) for timestamps; structured logging enabled
  • Network policies allow orchestrator-to-service health traffic

Business Requirements

  • Defined maintenance mode procedure (drain → mark NotReady → perform ops)
  • Per-tenant/edition readiness policies when dependencies are multi-tenant
  • Runbooks for common failure signatures (DB degraded, queue backlog, index lag)

Performance Requirements

  • Probe p95 ≤ 50 ms for local checks, ≤ 200 ms for remote deps
  • Readiness interval typically 10s–30s; liveness interval 5s–10s
  • Probe CPU overhead < 1%; IO bounded with concurrency limits

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant ORCH as Orchestrator (K8s)
    participant SVC as Service
    participant AGG as Health Aggregator
    participant DB as Database
    participant Q as Queue
    participant C as Cache

    ORCH->>SVC: GET /health/startup
    SVC->>AGG: Run startup checks (one-time warmups)
    AGG-->>SVC: status: Up
    SVC-->>ORCH: 200 OK {status:"Up"}

    ORCH->>SVC: GET /health/readiness
    SVC->>AGG: Parallel probers (DB/Q/Cache) with budgets
    AGG->>DB: ping (timeout ≤ 150ms)
    AGG->>Q: depth/head check
    AGG->>C: get/set key
    DB-->>AGG: OK
    Q-->>AGG: OK
    C-->>AGG: OK
    AGG-->>SVC: Ready
    SVC-->>ORCH: 200 OK {status:"Ready", components:[...]}

    ORCH->>SVC: GET /health/liveness
    SVC-->>ORCH: 200 OK {status:"Alive"}
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Maintenance mode: Ops toggles → service returns 503 on readiness with Retry-After, keeps liveness 200 to avoid restarts during planned work.
  • Degraded-but-Serving: Non-critical dependency fails; readiness remains 200 with warnings[], traffic allowed but autoscaler informed via metrics.
  • Backpressure-aware readiness: If queue depth/backlog exceeds threshold, respond 429 Too Many Requests (optionally) or 503 with reason to trigger traffic shifting.

Error Paths

sequenceDiagram
    participant ORCH as Orchestrator
    participant SVC as Service
    participant AGG as Health Aggregator
    participant DB as Database

    ORCH->>SVC: GET /health/readiness
    SVC->>AGG: Run checks
    AGG->>DB: ping
    DB-->>AGG: timeout
    AGG-->>SVC: NotReady {db:"Timeout"}
    alt Not Ready
        SVC-->>ORCH: 503 Service Unavailable (Problem+JSON)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
GET /health/liveness http Y Process health (no deps) Always lightweight
GET /health/startup http Y Warmup complete? One-time gates
GET /health/readiness http Y Dependency/traffic readiness Budgeted checks
POST /ops/v1/health:maintenance http O Enter/exit maintenance AuthZ required
Authorization (ops) header O Bearer <JWT> Role ops:health
traceparent header O Trace exemplar correlation Optional
Query: full=true bool O Include per-component detail RBAC for PII masking

Output Specifications

200 OK (Readiness/Liveness/Startup)

{
  "status": "Ready",
  "service": "ingestion",
  "time": "2025-10-27T08:21:45Z",
  "warnings": [],
  "components": [
    {"name":"db", "type":"postgres", "status":"Up", "latencyMs": 32},
    {"name":"queue", "type":"rabbitmq", "status":"Up", "latencyMs": 18},
    {"name":"cache", "type":"redis", "status":"Up", "latencyMs": 4}
  ]
}

503 Service Unavailable (Not Ready)

{
  "type": "urn:connectsoft:errors/health/not-ready",
  "title": "Readiness check failed",
  "status": 503,
  "detail": "postgres timeout; queue connecting",
  "retryAfterSeconds": 10
}

Maintenance Mode Toggle

// POST /ops/v1/health:maintenance
{ "enabled": true, "reason": "DB failover", "ttlSeconds": 900 }

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid maintenance payload (negative TTL/unknown field) Fix request
401 Missing/invalid JWT for ops endpoint Obtain token Retry after renewal
403 Caller lacks ops:health role Request access
404 Unknown component in ?component= query Remove/rename
409 Conflicting state change (maintenance enabled while drain in progress) Wait or cancel prior op Retry after resolution
429 Health endpoint rate-limited (human/automation abuse) Back off Jittered retry
503 Not Ready (dependency down/backpressure) Remediate dependency Retry after Retry-After
504 Probe exceeded timeout budget Increase timeout if justified Backoff; verify load

Failure Modes

  • Noisy-neighbor probes: too-frequent or heavy checks cause dependency load → enforce intervals, timeouts, and read-only probes.
  • Coupled liveness/readiness: using dependency checks for liveness causes restarts → separate strictly.
  • Flapping readiness: thresholds too tight → add stabilization window and hysteresis.
  • Leaky details: exposing internal hostnames/errors externally → sanitize messages.

Recovery Procedures

  1. Enter maintenance mode → drain traffic (readiness 503), keep liveness 200, perform remediation.
  2. Enable degraded mode for non-critical deps; keep serving with warnings.
  3. Increase probe intervals/timeouts cautiously; verify impact via metrics.

Performance Characteristics

Latency Expectations

  • Liveness: p95 ≤ 5 ms; Start-up: first success within warmup target; Readiness: p95 ≤ 150–200 ms.

Throughput Limits

  • Cap concurrent dependency checks (e.g., max 2 per dep per instance).
  • Global RPS limit on health endpoints to prevent abuse.

Resource Requirements

  • Minimal CPU; network usage proportional to dependency checks; cache results for stabilization window (e.g., 2–5s).

Scaling Considerations

  • Shard readiness by tenant/shard if dependencies are partitioned; expose components[].partition.
  • Push passive signals (e.g., queue depth) from dependencies to reduce active probing.

Security & Compliance

Authentication

  • Health endpoints for orchestrator may be anonymous inside cluster (network-policy protected). Ops endpoints require OIDC JWT.

Authorization

  • Roles: ops:health.read, ops:health.maintain.

Data Protection

  • Mask error details in public readiness; full component diagnostics behind RBAC. No secrets in responses.

Compliance

  • Health state transitions and maintenance toggles audited with actor, reason, and duration.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
health_readiness_status gauge 1=Ready, 0=NotReady 0 for >1m
health_probe_latency_ms{component} histogram Per-component probe latency p95 breach
health_notready_total{reason} counter Fail events by reason Spike alert
health_maintenance_mode gauge 1 when enabled Unexpected >0
health_flaps_total counter Ready↔NotReady transitions >N/hour

Logging Requirements

  • Structured logs: probe, component, latencyMs, result, timeout, reason, traceId.

Distributed Tracing

  • Health endpoints not traced by default (to reduce noise); ops toggles may emit spans with attributes maintenance=true.

Health Checks

  • Internal self-check (threadpool saturation, GC, disk space).
  • Dependency checks with budgeted timeouts and circuit-breaker awareness.

Operational Procedures

Deployment

  1. Expose /health/liveness, /health/readiness, /health/startup.
  2. Configure orchestrator probes and thresholds (see Appendix).
  3. Register metrics and alerts; link runbooks.

Configuration

  • Env: HEALTH_READINESS_TIMEOUT_MS, HEALTH_PROBE_INTERVAL_S, HEALTH_STABILIZATION_WINDOW_S, HEALTH_MAX_CONCURRENCY, HEALTH_MAINTENANCE_TTL_S.
  • Policy: which dependencies are critical vs advisory for readiness.

Maintenance

  • Use ops endpoint to enable maintenance → drain → operate → disable → verify readiness.

Troubleshooting

  • Frequent flaps → extend stabilization, review dependency SLOs.
  • Probes time out → check network/circuit breaker; raise timeout only with evidence.
  • Orchestrator killing pods unexpectedly → confirm liveness is local-only.

Testing Scenarios

Happy Path Tests

  • Startup becomes Up after caches warmed; readiness 200.
  • All components return Up; status JSON includes latencies.

Error Path Tests

  • DB timeout triggers readiness 503 with sanitized Problem+JSON.
  • 400 invalid maintenance payload rejected; 404 unknown component; 409 conflicting state change handled.

Performance Tests

  • Probe p95 ≤ 200 ms under load; intervals respected; no excess CPU/IO.
  • High RPS to health endpoints remains within rate limits.

Security Tests

  • Public readiness hides internals; full diagnostics gated by RBAC.
  • Audit records for maintenance toggles captured.

Internal References

External References

  • Kubernetes probe guidance (liveness/readiness/startup)

Appendices

A. Example Kubernetes Probes

livenessProbe:
  httpGet: { path: /health/liveness, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 2
readinessProbe:
  httpGet: { path: /health/readiness, port: 8080 }
  initialDelaySeconds: 20
  periodSeconds: 15
  timeoutSeconds: 2
  successThreshold: 1
  failureThreshold: 3
startupProbe:
  httpGet: { path: /health/startup, port: 8080 }
  failureThreshold: 30
  periodSeconds: 5

B. Example Problem+JSON (Not Ready)

{
  "type": "urn:connectsoft:errors/health/not-ready",
  "title": "Readiness check failed",
  "status": 503,
  "detail": "queue backlog > threshold; integrity service degraded",
  "retryAfterSeconds": 15
}

Alert Generation Flow

Turns signals into action: evaluates thresholds and SLO burn rates, fires alerts, routes to pager/chat/webhook, opens a ticket, and auto-closes on recovery. Noise is controlled via grouping, inhibition, dedup windows, silences, and maintenance calendars. Escalation paths are explicit and auditable.


Overview

Purpose: Deliver timely, actionable notifications with clear ownership and escalation while minimizing false positives.
Scope: Rule evaluation, grouping/dedup, routing, paging/notifications, ticket creation, auto-resolve, silencing and inhibition controls.
Context: Metrics and events feed a Rule Engine (e.g., Prometheus Ruler). Alerts traverse a Router (Alertmanager-like) to destinations: PagerDuty/On-call, Chat (Slack/Teams), Webhook (runbooks/automation), and Ticketing (Jira/ServiceNow).
Key Participants:

  • Metrics Backend / Rule Engine
  • Alert Router (grouping, dedup, silences, inhibition)
  • Destinations: Pager, Chat, Webhook, Ticketing
  • On-call Engineer / Team
  • Ops API/UI (manage silences, ack, routes)
  • Runbooks (linked from alerts)

Prerequisites

System Requirements

  • Metrics and logs published with low cardinality labels (tenant, shard, region, service)
  • Rule Engine with multi-window SLO burn capability and dependency-aware inhibition
  • Alert Router HA with persistent silences and dedup state
  • Integrations to pager/chat/ticket with retry & backoff

Business Requirements

  • Defined ownership map: service → team → escalation policy
  • Runbooks per alert with clear first actions and diagnostics links
  • Maintenance windows / change freeze calendars integrated

Performance Requirements

  • End-to-end alerting latency p95 ≤ 30s from breach to page
  • Router throughput sized for peak fan-out; delivery retries with backoff
  • Dedup window defaults (e.g., 5m) to limit paging storms

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant MET as Metrics/Rule Engine
    participant RTR as Alert Router
    participant PD as Pager (On-call)
    participant CHAT as Chat (Slack/Teams)
    participant TKT as Ticketing (Jira/SNOW)
    participant OPS as On-call Engineer

    MET->>RTR: Alert{labels, annotations, status="firing"}
    RTR->>RTR: Group & dedup (fingerprint), apply inhibition/silences
    RTR->>PD: Page (severity=page, service=ingestion)
    RTR->>CHAT: Notify #oncall-ingestion (runbook link)
    RTR->>TKT: Create ticket (P1) with alert context
    PD-->>OPS: Page delivered (push/phone/SMS)
    OPS->>TKT: Acknowledge ticket, start mitigation
    MET-->>RTR: Alert{status="resolved"}
    RTR->>PD: Auto-resolve page
    RTR->>TKT: Auto-close with resolution note
    RTR->>CHAT: Post recovery message
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Warning-only: severity warn → chat/webhook only, no page.
  • Escalation: no ack within 10m → escalate to secondary, then manager-on-call.
  • Bulk correlation: many shard alerts collapse into one parent incident with children inhibited.
  • Auto-remediation: webhook triggers safe runbook; success posts to thread and downgrades severity.

Error Paths

sequenceDiagram
    participant MET as Metrics/Rule Engine
    participant RTR as Alert Router
    participant PD as Pager

    MET->>RTR: Alert firing
    alt 400 Bad Request (invalid labels/size)
        RTR-->>MET: 400 drop + audit
    else 404 Destination not configured
        RTR-->>MET: 404, fallback to default route
    else 409 Conflict (duplicate route update)
        RTR-->>MET: 409, keep last-good config
    else 429/503 Pager API throttled/outage
        RTR-->>PD: retry with backoff, queue locally
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements (Alert Payload to Router)

Field Type Req Description Validation
status enum Y firing resolved
labels map Y {alertname, service, tenant, shard, severity} size ≤ 50, allowlist keys
annotations map O {summary, description, runbook, dashboard} ≤ 4KB
startsAt / endsAt RFC3339 Y/O When firing/resolved UTC
generatorURL url O Link to rule source valid URL
fingerprint string O Stable dedup key computed if missing

Output Specifications (Destinations)

  • Pager: payload includes service, severity, routing_key, dedup_key=fingerprint, links (runbook/dashboards).
  • Chat: message with summary, top labels, graph image link, ack emoji workflow.
  • Ticket: fields summary, description, priority, labels, customFields (tenant/shard), plus auto-close comment on resolve.
  • Webhook: signed POST with HMAC; body includes current status, last N samples, silence suggestions.

Example Payloads

// Alert to Router (condensed)
{
  "status": "firing",
  "labels": {
    "alertname": "ProjectionLagHigh",
    "service": "projection",
    "tenant": "acme",
    "severity": "page",
    "region": "eu-west-1"
  },
  "annotations": {
    "summary": "Projection lag > 60s",
    "description": "Watermark delay crossing SLO for tenant=acme.",
    "runbook": "https://runbooks/projection-lag",
    "dashboard": "https://grafana/d/lag"
  },
  "startsAt": "2025-10-27T08:15:00Z",
  "generatorURL": "prom://ruler/expr/123"
}
# Burn-rate rule example (SLO 99.9% over 30d)
- alert: IngestSLOBurnHigh
  expr: |
    (sum(rate(atp_ingest_errors_total[5m])) by (service,tenant)
     / sum(rate(atp_ingest_requests_total[5m])) by (service,tenant))
    > (0.001 * 14.4)
  for: 5m
  labels: {severity: page, service: ingestion}
  annotations:
    summary: "Ingest SLO fast burn (5m)"
    runbook: "https://runbooks/ingest-slo"

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid alert payload (missing labels/oversized) Fix rule/labels; drop event No retry until fixed
401 Webhook/Pager auth failed Rotate tokens/keys Retry after renewal
403 Route not permitted for tenant/edition Update RBAC/route policy
404 Route/destination not found Use default route; fix config
409 Concurrent route config updates Apply last-write-wins or CAS Retry after fetch
412 HMAC signature mismatch (webhook) Recalculate with correct secret
429 Destination rate-limiting Honor vendor backoff Exponential backoff + jitter
503 Destination outage Queue & retry within TTL Progressive backoff, failover route

Failure Modes

  • Alert storms: ungrouped high-cardinality labels → enable grouping keys and label sanitization.
  • Flapping: thresholds too tight → add for: windows and hysteresis.
  • Cascading pages: child alerts page alongside parent → add inhibition until parent resolves.
  • Silent failures: misconfigured routes → periodic synthetic alerts verify end-to-end.

Recovery Procedures

  1. Activate global silence or maintenance mode during planned incidents.
  2. Expand grouping and increase group_wait/group_interval to dampen bursts.
  3. Fail over to secondary pager provider if primary remains 503/429 beyond SLO.

Performance Characteristics

Latency Expectations

  • Signal-to-page p95 ≤ 30s; chat/webhook p95 ≤ 15s; ticket creation ≤ 60s.

Throughput Limits

  • Router handles thousands of alerts/min with grouping; per-destination QPS caps and queues.

Resource Requirements

  • Router memory for dedup store and silence registry; HA storage (e.g., S3/object store or DB) for persistence.

Scaling Considerations

  • Partition routes by region and service; replicate router HA; shard rules by domain.

Security & Compliance

Authentication

  • Mutual TLS for webhook receivers; OAuth tokens/keys for pager/ticket/chat APIs.

Authorization

  • Route policies per tenant/edition; ops roles to create silences and modify routes (ops:alerts.*).

Data Protection

  • Do not include PII in labels/annotations; link dashboards instead of embedding raw data.

Compliance

  • All alert lifecycle actions (fire/route/ack/resolve/silence) audited with actor, reason, and timestamps.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
alerts_firing_total gauge Active firing alerts Trend by service
alerts_notifications_sent_total counter Deliveries by destination Sudden drop
alerts_delivery_failures_total counter Failed sends by dest Spike alert
alerts_routing_latency_seconds histogram Router processing latency p95 breach
alerts_silences_active gauge Current silences Unexpected growth
alerts_inhibited_total counter Child alerts inhibited Track correlation

Logging Requirements

  • Structured logs: alertname, fingerprint, status, route, destination, deliveryId, retry, actor (for silences/acks).

Distributed Tracing

  • Trace Router pipeline (ingest→group→deliver); attach exemplars to routing latency histograms.

Health Checks

  • Router readiness includes destination probes (token check, rate-limit status); synthetic canaries validate end-to-end.

Operational Procedures

Deployment

  1. Deploy Rule Engine & Router HA; configure storage for silences/dedup.
  2. Create base routes (page/warn/info) and default receivers.
  3. Set up synthetic alerts per region/service.

Configuration

  • Router: group_by: [alertname, service, tenant], group_wait: 10s, group_interval: 5m, repeat_interval: 2h.
  • Escalation: ack timeout 10m primary → secondarymanager.
  • Webhook HMAC secret rotation schedule.

Maintenance

  • Review top talkers weekly; reduce cardinality; tune thresholds and for: windows.
  • Validate runbook links and dashboard IDs quarterly.

Troubleshooting

  • No pages received → check destination quotas, auth, and router queue depths.
  • Excess noise → increase grouping, add inhibition rules, widen hysteresis.
  • Auto-close not working → verify resolved events flow and ticket webhooks.

Testing Scenarios

Happy Path Tests

  • Fire ProjectionLagHigh → page+chat+ticket created; resolves and auto-closes on recovery.
  • Warning-only alert posts to chat without paging.

Error Path Tests

  • 400/404 misrouted alerts handled; default route used.
  • 429/503 destination throttling triggers retries and eventual delivery/failover.

Performance Tests

  • Burst of 10k alerts grouped to ≤ 100 pages; router p95 latency within SLO.
  • Dedup prevents duplicate pages across replicas.

Security Tests

  • Webhook HMAC verified; invalid signature (412) rejected.
  • No PII in labels/annotations; audits present for silences/acks.

Internal References

External References

  • SRE Workbook: Multi-window, multi-burn-rate alerts
  • Vendor APIs: PagerDuty/Slack/Jira

Appendices

A. Router Route Snippet (YAML)

route:
  group_by: ['alertname','service','tenant']
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 2h
  receiver: 'default'
  routes:
    - match: {severity: 'page'}
      receiver: 'pager'
      continue: true
    - match: {severity: 'page'}
      receiver: 'chat'
    - match_re: {severity: 'warn|info'}
      receiver: 'chat'

receivers:
  - name: pager
    pagerduty_configs:
      - routing_key: ${PAGERDUTY_KEY}
        dedup_key: '{{ .GroupLabels.fingerprint }}'
  - name: chat
    slack_configs:
      - channel: '#oncall-{{ .GroupLabels.service }}'
        title: '{{ .CommonLabels.alertname }}'
        text: '{{ .CommonAnnotations.summary }}\n{{ .CommonAnnotations.runbook }}'

B. Example Silence (API)

POST /ops/v1/alerts/silences
{
  "matchers": [{"name":"service","value":"projection","isRegex":false}],
  "startsAt": "2025-10-27T08:00:00Z",
  "endsAt": "2025-10-27T10:00:00Z",
  "createdBy": "deploy-bot",
  "comment": "Planned projection migration"
}

Tenant Onboarding Flow

Provisions and activates a new tenant with IdP linkage, policy defaults, partitioned storage & indexes, per-tenant KMS keys and residency settings. Ensures strict isolation (RLS) and emits onboarding welcome/events. All steps are idempotent and fully audited.


Overview

Purpose: Safely create a tenant boundary (identity, data, policy, encryption, residency) and make it ready for use.
Scope: Intake → validation → IdP linkage → resource provisioning (storage/projections/search) → policy/key/residency setup → activation → welcome events. Excludes billing system specifics.
Context: Orchestrated by Onboarding Service with calls to Identity/IdP, Policy, Storage/Projection/Search, KMS/Secrets, and Notifications.
Key Participants:

  • Tenant Admin / Operator
  • Onboarding Service (orchestrator)
  • Identity/SSO (SAML/OIDC, optional SCIM)
  • Policy Service (defaults: retention, redaction)
  • Storage Service (append store partitions)
  • Projection/Search Services (read models, index aliases)
  • KMS / Secrets (per-tenant keys/creds)
  • Notification/Webhooks

Prerequisites

System Requirements

  • Onboarding API enabled with RBAC and idempotency support
  • KMS, Storage, Projection DB, and Search clusters reachable and quota available
  • DNS/Domain verification service (for SAML domains)
  • OTel tracing/metrics active for step diagnostics

Business Requirements

  • Approved edition/plan matrix (limits, features)
  • Default policy bundles per edition/region (retention, redaction profiles)
  • Residency catalog (allowed regions per tenant)

Performance Requirements

  • Synchronous intake p95 ≤ 300 ms; async provisioning target < 2 min
  • Parallelizable steps (keys/indexes) with bounded concurrency
  • Backpressure handling when cluster capacity is constrained

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor TA as Tenant Admin
    participant GW as API Gateway
    participant ONB as Onboarding Service
    participant IDP as Identity/IdP
    participant POL as Policy Service
    participant KMS as KMS/Secrets
    participant ST as Storage (Append)
    participant PR as Projection DB
    participant IX as Search Index
    participant NTF as Notifications/Webhooks

    TA->>GW: POST /tenants/v1 (tenantSlug, region, edition, idpConfig, adminEmails)
    GW->>ONB: CreateTenant (idempotency-key)
    ONB->>ONB: Validate & reserve tenantId/slug (unique)
    ONB->>IDP: Link IdP / Verify domain (SAML/OIDC/SCIM)
    ONB->>POL: Apply default policies (retention/redaction)
    par Provision resources
        ONB->>KMS: Create tenant key + alias (kid)
        ONB->>ST: Create partition/shard & RLS bindings
        ONB->>PR: Create schemas (namespaced) & watermarks
        ONB->>IX: Create per-tenant index alias/mappings
    end
    ONB->>ONB: Health checks (readiness of resources)
    ONB->>GW: 202 Accepted {tenantId, status:"Provisioning", resumeToken}
    ONB->>NTF: Emit Tenant.Provisioned
    ONB->>GW: POST /tenants/v1/{tenantId}:activate
    ONB->>GW: 200 OK {status:"Active"}
    ONB->>NTF: Emit Tenant.Activated + Welcome
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Deferred IdP linkage: create tenant with local admin; link IdP later via /link-idp.
  • Pre-provisioned resources: BYO KMS key or existing index namespace accepted when validated.
  • Staged activation: keep status="Provisioned" until external readiness checks pass.

Error Paths

sequenceDiagram
    participant TA as Tenant Admin
    participant GW as API Gateway
    participant ONB as Onboarding Service
    participant IDP as Identity/IdP

    TA->>GW: POST /tenants/v1 {invalid payload or duplicate slug}
    alt 400 Bad Request (invalid/unsupported fields)
        GW-->>TA: 400 Problem+JSON
    else 409 Conflict (slug/domain already in use)
        GW-->>TA: 409 Problem+JSON
    else 422 Unprocessable (IdP metadata invalid, domain not verified)
        ONB-->>GW: 422 Problem+JSON
        GW-->>TA: 422 Problem+JSON
    else 503 Dependency unavailable (KMS/Search/DB)
        GW-->>TA: 503 Problem+JSON (+Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Method/Path POST /tenants/v1 Y Create tenant JSON body
Authorization header Y Admin/ops JWT Role tenants:create
idempotency-key header O De-dupe create ≤128 chars
tenantSlug string Y Human slug (acme) ^[a-z0-9-]{3,40}$, unique
displayName string Y Tenant display name 3–100 chars
edition enum Y free | standard | enterprise allowlist
region enum Y Residency region allowlist
idpConfig object O SAML/OIDC metadata/urls schema-validated
adminEmails[] array Y Initial admins valid emails
webhooks[] array O Event targets (HMAC) URL + secret

Control

  • GET /tenants/v1/{tenantId} → status (Provisioning|Provisioned|Active|Failed), components health
  • POST /tenants/v1/{tenantId}:activate → promote to Active
  • POST /tenants/v1/{tenantId}:link-idp → attach/replace IdP config
  • POST /tenants/v1/{tenantId}:rotate-keys → new KMS key version (dual-read window)

Output Specifications

Field Type Description Notes
tenantId string (ULID/GUID) System identifier Immutable
tenantSlug string Human label Unique, mutable with policy
status enum Lifecycle status see above
kid string Current KMS key id For integrity/signing
residency object Region/data classes PII routing policy
policyBundle object Defaults applied versioned
endpoints object Tenant endpoints/aliases for SDK setup

Example Payloads

Create Tenant

{
  "tenantSlug": "acme",
  "displayName": "Acme Corp",
  "edition": "enterprise",
  "region": "eu-west",
  "idpConfig": {
    "type": "saml",
    "metadataUrl": "https://idp.acme.com/metadata.xml",
    "domains": ["acme.com"]
  },
  "adminEmails": ["secops@acme.com","platform@acme.com"]
}

Create Response (202)

{
  "tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
  "tenantSlug": "acme",
  "status": "Provisioning",
  "resumeToken": "onb_7b2d..."
}

Activate

POST /tenants/v1/01JF6V3A6W1T6E2TB1C2N2YV9Q:activate

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid slug, edition, region; missing admins Fix payload
401 Missing/invalid admin JWT Authenticate Retry after renewal
403 Plan/edition not allowed for region Choose allowed combo
404 Unknown tenantId (status/activate/link) Verify id
409 tenantSlug or domain already bound to another tenant Pick new slug / release domain
412 Activation preconditions unmet (resources not healthy) Wait for ready; fix failing component Conditional retry
422 IdP metadata invalid, DNS TXT not verified Correct & re-submit
429 Create rate-limited Back off Exponential backoff + jitter
503 KMS/Storage/Search unavailable Retry later Respect Retry-After

Failure Modes

  • Partial provisioning: some resources created; idempotent reruns resume from checkpoints.
  • Cross-tenant leakage risk: misbound index alias or RLS → automated sanity checks block activation.
  • IdP domain hijack: require DNS TXT proof + admin email domain match.

Recovery Procedures

  1. Use status API to inspect failing step; rerun with same idempotency-key.
  2. Roll back or repair mis-provisioned resources (Compensation flow) before activation.
  3. Re-verify domain/IdP, then call :activate.

Performance Characteristics

Latency Expectations

  • POST /tenants/v1: p95 ≤ 300 ms (enqueue & reserve).
  • Provisioning background: typical 30–120 s (parallelized steps).
  • Activation p95 ≤ 200 ms after readiness.

Throughput Limits

  • Controlled by cluster quotas; default ≤ 5 concurrent onboardings per region.

Resource Requirements

  • Onboarding workers sized for parallel KMS/DB/Index operations; cautious with index creation.

Scaling Considerations

  • Shard provisioning queues by region; backpressure from dependent clusters pauses new starts.
  • Pre-create warm pools (schemas/aliases) for popular editions.

Security & Compliance

Authentication

  • Admin/ops endpoints require OIDC JWT; service-to-service with mTLS.

Authorization

  • Roles: tenants:create|read|activate|link-idp|rotate-keys.
  • Least-privilege service identities for each provisioning step.

Data Protection

  • Tenant KMS key per tenant; secrets stored encrypted; residency enforced across storage/search/projections.
  • No PII stored beyond admin contacts; audit all operations.

Compliance

  • Emit Tenant.Provisioned|Activated|Failed|IdpLinked events with actor, reason, evidence.
  • Residency and key policies attached to tenant record for audits.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
tenant_onboard_started_total counter Onboarding requests Anomaly trend
tenant_onboard_completed_total counter Successful onboardings Drop vs start
tenant_onboard_duration_seconds histogram End-to-end time p95 > 180s
tenant_onboard_step_failures_total{step} counter Failures per step Spike alert
tenant_activation_gates_open gauge Waiting for readiness Stuck > 10m

Logging Requirements

  • Structured logs with tenantId, tenantSlug, step, result, component, retry, traceId. Mask secrets/metadata.

Distributed Tracing

  • Span per step: idp.link, kms.key.create, storage.partition.create, projection.schema.create, index.alias.create, activate. Include tenantSlug, region, edition.

Health Checks

  • Readiness depends on KMS, Storage, DB, Search; onboarding worker queue depth monitored.

Operational Procedures

Deployment

  1. Deploy Onboarding Service with worker pool and step registry.
  2. Configure RBAC, KMS access policies, and cluster credentials per region.
  3. Register default policy bundles and residency maps.

Configuration

  • Env: ONB_MAX_CONCURRENCY, ONB_REGION_ALLOWLIST, ONB_IDP_DOMAIN_TTL, ONB_PROVISION_TIMEOUT_S.
  • Policies: default retention/redaction per edition; index templates per region.

Maintenance

  • Rotate service credentials; rotate default index templates; verify domain verification CA chains.
  • Periodic dry runs in staging.

Troubleshooting

  • 409 slug/domain → list bindings, confirm ownership.
  • 422 IdP → validate metadata XML/JWKS, DNS TXT ownership.
  • Activation stuck → inspect failing component health; run targeted repair.

Testing Scenarios

Happy Path Tests

  • Create → provision all components → activate → welcome events emitted.
  • IdP linked and login works for admin users.

Error Path Tests

  • 400 invalid payload; 409 duplicate slug/domain; 404 unknown tenant.
  • 412 activation blocked until readiness passes; succeeds after fix.
  • 422 invalid IdP metadata rejected with clear Problem+JSON.

Performance Tests

  • Parallel onboardings (N=5) complete within target; no cluster saturation.
  • Index/schema creation time within SLO per region.

Security Tests

  • RLS verified—tenant cannot query others’ data.
  • Residency enforced—data and indexes created only in chosen region.
  • Audit events present for all steps; secrets never logged.

Internal References

External References

  • SAML / OIDC specs (metadata, JWKS)
  • Regional residency regulations (org policy)

Appendices

A. DNS TXT Verification (example)

_acme-verify.atp.example.com  TXT  atp-verify=01JF6V3A6W1T6E2T

B. Example Events

{
  "type": "Tenant.Provisioned",
  "tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
  "region": "eu-west",
  "kid": "kms:eu-west:acme:v1",
  "time": "2025-10-27T08:05:21Z"
}
{
  "type": "Tenant.Activated",
  "tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
  "time": "2025-10-27T08:06:41Z",
  "endpoints": {
    "ingest": "https://eu-west.api.atp/ingest/acme",
    "query": "https://eu-west.api.atp/query/acme"
  }
}

Schema Evolution Flow

Rolls out safe, additive schema changes across write and read paths. Publishes vNext to the Schema Registry, advertises availability via SDK/Gateway announcements, runs a dual-write / tolerant-read window (projectors, search), and executes a sunset plan for deprecated fields. Enforces a compatibility matrix to prevent breaking consumers.


Overview

Purpose: Introduce new fields/enums without breaking existing producers/consumers; coordinate rollout and rollback with clear guardrails.
Scope: Registry publish & validation → SDK/Gateway announcement → producer feature flag/canary → dual-write (events, projections) → tolerant-read (unknown fields) → metrics/alerts → deprecation & sunset. Excludes large-scale data migrations (covered by backfill runbooks).
Context: Works with Ingestion, Projection, Search, Export, and SDKs. Contracts defined in JSON Schema / Protobuf; REST/gRPC negotiate schema version via headers/metadata.
Key Participants:

  • Schema Author (engineer)
  • Schema Registry (validation, compatibility rules)
  • API Gateway / SDKs (announce, negotiate)
  • Producers (writers; may dual-write)
  • Consumers (readers; tolerant-read)
  • Projection/Search Services (tolerant/readers)
  • Ops/Release (flags, canaries)

Prerequisites

System Requirements

  • Schema Registry online with compatibility checks and artifact signing
  • CI pipeline to lint/validate schemas (JSON Schema/Protobuf)
  • Gateway supports version advertisement headers & graceful negotiation
  • Services compiled with tolerant parsers (ignore unknowns; default enums)

Business Requirements

  • Compatibility matrix approved (e.g., vN write requires readers ≥ vN-1)
  • Rollout plan (tenants/regions/canaries) and rollback criteria defined
  • Deprecation timeline communicated to stakeholders

Performance Requirements

  • Registry publish p95 ≤ 300 ms; lookup cache TTL tuned
  • Dual-write overhead ≤ 10% QPS/egress during window
  • No more than 1 additional index refresh per change in Search

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor SA as Schema Author
    participant CI as CI/CD
    participant REG as Schema Registry
    participant GW as API Gateway
    participant PRD as Producer (Service/SDK)
    participant PRJ as Projection Service
    participant IDX as Search Index
    participant CSM as Consumer (Query/Export)

    SA->>CI: Open PR with vNext (add fields/enums)
    CI->>REG: Validate & publish draft vNext (compatibility=FORWARD+BACKWARD)
    REG-->>CI: OK (artifactId, version=v3, signature)
    CI->>GW: Deploy Gateway/SDK announcement (X-Schema-Latest: v3)
    PRD->>PRD: Enable canary flag (tenant subset)
    PRD->>GW: Writes (dual-write: v2 + v3 metadata)
    GW-->>PRD: 202 Accepted (X-Schema-Active: v3)
    PRJ->>PRJ: Read tolerant (unknown fields ignored, defaults applied)
    IDX->>IDX: Mapping updated (add new fields as optional)
    CSM->>GW: Reads (request v2, receives v2) / (request v3, receives v3)
    CI->>REG: Promote v3 to stable, start deprecation clock for v1
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Canary-by-tenant: enable v3 only for tenant in {acme,beta}; expand after burn-in.
  • Header-only announce: Gateway advertises X-Schema-Latest before any producer dual-writes (readers prep first).
  • Soft-fail: Producer emits v3-only but Gateway downgrades to v2 for legacy consumers via transformation map (temporary).

Error Paths

sequenceDiagram
    participant CI as CI/CD
    participant REG as Schema Registry
    participant GW as API Gateway
    participant PRD as Producer

    CI->>REG: Publish vNext (breaking removal/rename)
    alt 400 Bad Request (invalid schema)
        REG-->>CI: 400 Problem+JSON
    else 409 Conflict (compatibility violation)
        REG-->>CI: 409 Problem+JSON (matrix failed)
    end

    PRD->>GW: Write with v3 before announce
    GW-->>PRD: 412 Precondition Failed (X-Required-Schema: v2)
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements (Key Endpoints & Headers)

Field Type Req Description Validation
POST /registry/v1/schemas/{name}/versions http Y Publish schema vNext Signed commit
compatibility enum Y BACKWARD, FORWARD, FULL policy
X-Schema-Write-Version header O Producer-declared write version int ≥ 1
X-Schema-Read-Version header O Consumer requested read version int ≥ 1
Accept header O application/json;profile="#v3" negotiated
gRPC metadata: schema-version meta O Read/write hint int
idempotency-key header O Dual-write de-dupe ≤128 chars

Output Specifications

Field Type Description Notes
artifactId string Registry id of version immutable
version int Published version (e.g., 3) monotonic
X-Schema-Latest header Latest stable version set by Gateway
X-Schema-Active header Version currently served per route/tenant
downgrade flag Whether Gateway transformed response temporary only

Example Payloads

Publish vNext (JSON Schema)

POST /registry/v1/schemas/auditrecord/versions
{
  "version": 3,
  "compatibility": "FULL",
  "schema": {
    "$id": "urn:atp:auditrecord:v3",
    "type": "object",
    "properties": {
      "Id": {"type":"string"},
      "Actor": {"$ref":"urn:atp:actor:v2"},
      "Decision": {"$ref":"urn:atp:decision:v1"},
      "Geo": {"type":"object","properties":{"Country":{"type":"string"}}} // new additive
    },
    "additionalProperties": false
  }
}

Write (dual-write hint)

POST /audit/v1/records
X-Schema-Write-Version: 3
Idempotency-Key: wr_01JF...
Content-Type: application/json

Read (negotiate v2)

GET /audit/v1/records?sv=2
X-Schema-Read-Version: 2
Accept: application/json; profile="#v2"

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid schema JSON/Proto; unknown fields without defaults Fix schema; re-validate
401 Unauthorized schema publish Authenticate Retry after renewal
403 Caller lacks schemas:publish or tenant attempting global change Request access
404 Unknown schema name/version; consumer requests non-existent sv Request supported version; update client
409 Compatibility violation vs matrix; mapping collision in search Adjust change or update matrix; run reindex plan
412 Producer writing vNext before Gateway/Registry mark active Wait for announce; enable flag after Conditional
422 Enum narrowing or field type change detected Redesign as additive; use new field name
429 Publish rate-limited Back off Jittered backoff
503 Registry/Gateway dependency unavailable Retry later Exponential backoff

Failure Modes

  • Breaking removal/rename: rejected by Registry; use add + deprecate pattern.
  • Dual-write drift: v2 & v3 diverge → enable consistency checkers and fail fast on mismatch.
  • Search mapping conflicts: new field analyzer mismatches existing index → create new index alias v3 and reindex.

Recovery Procedures

  1. Roll back producer flag to v2-only; keep Registry v3 published but inactive.
  2. If search mapping conflict, cut over to v3 alias after backfill; keep reads tolerant.
  3. Use Compensation Flow to repair projections that missed new fields during early canary.

Performance Characteristics

Latency Expectations

  • Version negotiation adds ≤ 1 ms at Gateway (header processing/cache).
  • Registry lookup cached; cache miss p95 ≤ 50 ms.

Throughput Limits

  • Dual-write increases write amp; restrict to canary tenants initially.
  • Reindex/backfill throttled per-tenant to protect cluster SLOs.

Resource Requirements

  • Registry store for versions & metadata; small footprint per artifact.
  • Backfill/reindex workers sized to edition limits.

Scaling Considerations

  • Per-tenant activation gates; gradual region waves.
  • Keep old readers working via tolerant-read and optional downgrade transforms (temporary only).

Security & Compliance

Authentication

  • OIDC/JWT for publish & toggle APIs; mTLS service-to-service.

Authorization

  • Roles: schemas:publish, schemas:promote, schemas:deprecate, schemas:read.
  • Only release managers can promote to stable or start sunset.

Data Protection

  • Signed artifacts; checksum headers; registry enforces immutability.
  • No PII stored in schema metadata beyond author id.

Compliance

  • Audit events: Schema.Published|Promoted|Activated|Deprecated|SunsetCompleted with actor & diff.
  • Backward/forward compatibility reports attached to change record.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
schema_publish_total counter Versions published Spike analysis
schema_compat_fail_total counter Registry rejects >0 sustained
schema_negotiations_total counter Gateway negotiations Trend
dual_write_mismatch_total counter v2 vs v3 mismatch Any > 0
reader_unknown_field_rate counter Unknowns seen by readers Spike
search_reindex_progress gauge Backfill completion Stalls

Logging Requirements

  • Structured logs: schema, fromVersion, toVersion, tenant, compatMode, result, traceId.

Distributed Tracing

  • Spans: registry.validate, gateway.negotiate, producer.dualwrite, projection.tolerant-read, search.mapping.update.

Health Checks

  • Registry readiness (DB/object store); Gateway cache health; index template availability.

Operational Procedures

Deployment

  1. Deploy/upgrade Registry with compatibility policies.
  2. Enable Gateway negotiation & headers; roll SDKs with version awareness.
  3. Register CI checks (lint/compat) and block merges on failure.

Configuration

  • Env: SCHEMA_COMPAT_MODE=FULL, SCHEMA_CACHE_TTL=300s, SCHEMA_DOWNGRADE_ENABLED=true (temporary).
  • Flags: feature.auditrecord.v3.enabled, feature.search.mapping.v3.enabled.

Maintenance

  • Periodic cleanup of deprecated versions after sunset window.
  • Rotate registry signing keys; verify artifact signatures in CI.

Troubleshooting

  • 409 compatibility failures → inspect matrix report; adjust plan to additive-only.
  • Reader errors on unknown fields → ensure tolerant-read; verify SDK versions.
  • Search failures → create new alias with updated template; reindex flow.

Testing Scenarios

Happy Path Tests

  • Publish v3 (additive); Gateway advertises; producer dual-writes; readers tolerant; promote to stable.
  • Search mapping updated; index accepts new field; dashboards reflect new attribute.

Error Path Tests

  • 400 invalid schema rejected; 404 unknown version on read; 409 matrix violation blocked.
  • 412 write blocked before announce; passes after activation.

Performance Tests

  • Dual-write adds ≤ 10% overhead; Gateway negotiation ≤ 1 ms p95.
  • Reindex completes within planned window without SLO breach.

Security Tests

  • Only schemas:promote role can activate vNext; artifacts signed/verified.
  • Audit events emitted for publish/promote/deprecate.

Internal References

External References

  • JSON Schema / Protobuf compatibility guides

Appendices

A. Compatibility Matrix (excerpt)

Change Type Backward Forward Allowed
Add optional field Yes
Add enum value ✓* Yes (readers must default)
Remove field No (use deprecate)
Change type (string→int) No (new field)
Widen type (int32→int64) ✓* Yes with defaults
  • Requires tolerant-read or defaulting behavior.

B. Problem+JSON (compatibility violation)

{
  "type": "urn:connectsoft:errors/schema/compatibility-violation",
  "title": "Schema change is not backward compatible",
  "status": 409,
  "detail": "Removing field 'Decision' breaks existing consumers.",
  "violations": [
    {"path":"$.Decision", "rule":"field-removal"}
  ]
}

Configuration Update Flow

Safely rolls out configuration changes using validation (dry-run), staged rollout (feature flags/canaries), hot reload in services, and automatic verification / rollback with strict blast-radius controls. Separates config from secrets; every change is audited and idempotent.


Overview

Purpose: Apply config changes without disrupting tenants, maintaining SLOs and isolation.
Scope: Propose → validate (schema & semantic) → stage → canary rollout → service reload → verify (metrics/health) → promote or rollback. Excludes secret rotation (covered elsewhere).
Context: Config is stored in a Config Registry/Repo, announced via Config Service, consumed by Gateway/Ingestion/Projection/Search/Export at runtime with hot reload or restart on failure.
Key Participants:

  • Operator / CI/CD
  • Config Registry/Repo (GitOps or API)
  • Config Service (distribution, versioning, audits)
  • Feature Flag Service (progressive exposure)
  • Target Services ( Gateway / Ingestion / … )
  • Observability (metrics/logs/traces)
  • Orchestrator (deploy hooks for restarts if needed)

Prerequisites

System Requirements

  • Config schemas (JSON Schema/Protobuf) with server-side validation and dry-run execution
  • Feature flag platform for canary/percentage/segment rollouts
  • Services implement hot reload endpoint or SIGHUP handler and config guards (shadow config)
  • Config Service supports versioning, idempotency, and RBAC

Business Requirements

  • Change approval workflow (CAB) with blast-radius assessment
  • Runbooks & rollback plans linked to config keys
  • Tenant/edition-aware defaults to prevent cross-tenant leakage

Performance Requirements

  • Validation p95 ≤ 200 ms; distribution to all pods p95 ≤ 60 s
  • Hot reload p95 ≤ 250 ms per service; zero-downtime guarantee
  • Verification window (post-change) default 5–15 min with auto-rollback gates

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    actor OP as Operator/CI
    participant REG as Config Registry/Repo
    participant CFG as Config Service
    participant FF as Feature Flag Service
    participant SVC as Target Services
    participant OBS as Observability

    OP->>REG: Propose config vNext (PR/ChangeSet)
    REG->>CFG: Validate (schema + semantic dry-run)
    CFG-->>REG: OK (change-id, version=v17)
    OP->>FF: Stage flag "cfg.v17.enabled=false" (guard)
    OP->>CFG: Apply v17 (scope: canary tenants/perc=5%)
    CFG->>SVC: Distribute v17 (signed, If-None-Match)
    SVC->>SVC: Hot reload, shadow compare, begin verification
    SVC-->>OBS: Emit KPIs (errors/latency/health)
    OBS-->>CFG: Verification passed (within SLO)
    OP->>FF: Ramp to 50% → 100%
    CFG->>SVC: Finalize v17 (active for all)
    CFG-->>REG: Promote v17 to Active, close change
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Flag-only change: no new config payload; toggle flag segments to roll out behavior changes.
  • Tenant-staged rollout: enable by region/tenant/edition gates before global activation.
  • Restart-required: services lacking hot reload receive orchestrated rolling restart with readiness guards.

Error Paths

sequenceDiagram
    participant OP as Operator
    participant REG as Config Registry
    participant CFG as Config Service
    participant SVC as Target Services

    OP->>REG: Submit invalid config (schema fail)
    REG-->>OP: 400 Bad Request (Problem+JSON)

    OP->>CFG: Apply v17 (unknown key/scope)
    CFG-->>OP: 404 Not Found (key/scope)

    OP->>CFG: Apply while v16 rollout in-progress
    CFG-->>OP: 409 Conflict (change in progress)

    CFG->>SVC: Distribute v17
    SVC-->>CFG: 503 Service Unavailable (reload guard failed)
    CFG->>CFG: Auto-rollback to v16, raise alert
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements (APIs)

Field Type Req Description Validation
POST /ops/v1/config/validate http Y Dry-run schema & semantic checks JSON body
POST /ops/v1/config/apply http Y Apply version with scope/strategy RBAC + idempotent
changeId string Y Unique change identifier ULID/UUID
version int Y Candidate version monotonic
scope object O {tenants, regions, editions, percent} allowlists
strategy object O {mode: canary | all, ramp: [5,50,100], verifyMins:10} sane ranges
preconditions.etag string O CAS guard matches head
reason string Y Change reason 1–256 chars

Output Specifications

Field Type Description Notes
status enum Validated | Applying | Partial | RolledBack | Active | Failed lifecycle
activeVersion int Current active config version
appliedTo object Effective scope (tenants/percent) resolved
verification object KPIs & window state pass/fail
rollbackToken string Token to execute rollback TTL-bound

Example Payloads

Validate

POST /ops/v1/config/validate
{
  "changeId": "chg_01JF8C6Q...",
  "version": 17,
  "payload": { "Ingestion": { "MaxBatchBytes": 1048576 } }
}

Apply (canary 5%)

POST /ops/v1/config/apply
{
  "changeId": "chg_01JF8C6Q...",
  "version": 17,
  "scope": { "percent": 5, "regions": ["eu-west"] },
  "strategy": { "mode": "canary", "ramp": [5,50,100], "verifyMins": 10 },
  "preconditions": { "etag": "v16-etag" },
  "reason": "Lower ingest batch size to reduce p99"
}

Service Hot Reload Contract

POST /config/reload
If-None-Match: v17

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Schema/semantic validation failed Fix payload; re-validate
401 Missing/invalid token Authenticate Retry after renewal
403 Caller lacks config:apply Request access
404 Unknown config key/version/scope Verify ids; fetch latest
409 Concurrent change in progress; ETag mismatch Wait; retry with latest ETag Conditional retry
412 Preconditions failed (guardrails) Adjust scope/strategy
422 Semantic violation (unsafe value range) Choose safe value
429 Apply rate-limited Back off Exponential + jitter
503 Target service not ready/reload failure Auto-rollback; investigate Retry after health OK

Failure Modes

  • Blast radius: global apply without canary → guarded by policy (requires staged rollout).
  • Config drift: some pods on v16, others v17 → Config Service reconciles until convergence.
  • Hot reload hazards: partial initialization using new values → shadow config & atomic swap.

Recovery Procedures

  1. Trigger auto-rollback via policy gate failure; restore activeVersion to previous.
  2. Freeze changes (global mute) and open incident; evaluate metrics & logs.
  3. Re-run apply with reduced scope or adjusted values after fix.

Performance Characteristics

Latency Expectations

  • Validation p95 ≤ 200 ms; distribution to pods ≤ 60 s; hot reload ≤ 250 ms.

Throughput Limits

  • Max N parallel applies per region (e.g., 1); queue subsequent changes.

Resource Requirements

  • Config Service cache/ETag store; signed bundles; modest CPU for validation.

Scaling Considerations

  • Shard config topics per service/region; CDN or sidecar cache for large payloads.
  • Prefer delta distribution over full bundle for frequent small tweaks.

Security & Compliance

Authentication

  • OIDC JWT for ops APIs; mTLS service-to-service.

Authorization

  • Roles: config:validate, config:apply, config:rollback, config:read.
  • Tenant/edition scoping enforced at apply time.

Data Protection

  • No secrets in config; secrets managed via dedicated Secrets Service/KMS.
  • Signed config bundles (checksum, signature) verified by services.

Compliance

  • Audit events: Config.Validated|Applied|Promoted|RolledBack with actor, diff, scope, reason.
  • Change records linked to incident/ticket ids.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
config_apply_total counter Applies by result Spike in failures
config_active_version gauge Current active version Unexpected regress
config_rollbacks_total counter Auto/manual rollbacks >0 sustained
config_distribution_lag_seconds histogram Registry→pod lag p95 > 60s
service_config_reload_failures_total counter Reload errors Any > 0

Logging Requirements

  • Structured logs: changeId, version, service, scope, strategy, result, traceId, rollbackToken.

Distributed Tracing

  • Spans: config.validate, config.apply, service.reload, verify.window. Include changeId & version.

Health Checks

  • Readiness includes config freshness (expected vs actual version).
  • Synthetic probes after apply to confirm behavior.

Operational Procedures

Deployment

  1. Deploy Config Service (HA) with schema validators & signing keys.
  2. Enable hot reload endpoints in services; wire feature flag SDK.
  3. Configure GitOps or Ops API pipeline with approval gates.

Configuration

  • Env: CFG_APPLY_CONCURRENCY=1, CFG_VERIFY_WINDOW=10m, CFG_MAX_SCOPE_PERCENT=10, CFG_REQUIRE_FLAG_GUARD=true.
  • Policies: mandatory canary for high-risk keys; deny global applies during peak.

Maintenance

  • Rotate signing keys; prune deprecated keys; rehearse rollback drills quarterly.

Troubleshooting

  • Apply stuck → check distribution lag metrics & queue; verify RBAC/ETag.
  • Errors spike post-apply → auto-rollback should trigger; confirm guardrail worked.
  • Only subset updated → reconcile loop; investigate failing pods’ reload logs.

Testing Scenarios

Happy Path Tests

  • Validate → apply to 5% → verify → ramp to 100% with no SLO breach.
  • Hot reload succeeds across services; config version converges.

Error Path Tests

  • 400 invalid payload rejected; 404 unknown key; 409 concurrent apply guarded.
  • 503 reload failure triggers automatic rollback.

Performance Tests

  • Distribution completes ≤ 60 s across 200 pods; reload p95 ≤ 250 ms.
  • Multiple small deltas do not exceed CPU/network budgets.

Security Tests

  • Only config:apply role can promote; signatures verified; audits present.
  • No secrets present in config payloads.

Internal References

External References

  • Progressive Delivery / Feature Flags best practices

Appendices

A. Canary Strategy (YAML)

strategy:
  mode: canary
  ramp: [5, 25, 50, 100]
  verify:
    window: 10m
    guards:
      - metric: atp_ingest_errors_ratio
        threshold: "< 0.5%"
      - metric: atp_projection_lag_seconds
        threshold: "< 60"
      - metric: health_readiness_status
        threshold: "== 1"

B. Problem+JSON Examples

{
  "type": "urn:connectsoft:errors/config/invalid",
  "title": "Invalid configuration payload",
  "status": 400,
  "detail": "Ingestion.MaxBatchBytes exceeds allowed maximum."
}
{
  "type": "urn:connectsoft:errors/config/conflict",
  "title": "Change conflict",
  "status": 409,
  "detail": "Another change chg_01JF8B... is applying.",
  "currentChangeId": "chg_01JF8B..."
}

Backup & Recovery Flow

Implements durable backups (snapshots/exports) with integrity verification and WORM-secure storage, plus periodic recovery drills that prove RPO/RTO objectives are met. Covers append store, projections, and search indexes with consistent cutover points and tenant-aware restores. Evidence of successful restore is captured and audited.


Overview

Purpose: Guarantee recoverability of tenant data with defined RPO/RTO and cryptographic proof of integrity.
Scope: Scheduled/on-demand backups → snapshot/export → sign/verify → store in immutable object storage → catalog → recovery drills (sandbox restore + validation) → reporting. Excludes hot replicas (covered by HA).
Context: Orchestrated by Backup Service. Sources: Storage (Append/WORM), Projection DB, Search Index. Targets: Object Store (WORM/Object Lock) with tenant/region prefixes and KMS encryption.
Key Participants:

  • Backup Scheduler/Service (orchestrator)
  • Storage (Append Store) / Projection DB / Search Index
  • Integrity Service (hash/Merkle proofs)
  • Object Store (WORM) with KMS
  • Drill Runner (restore validator)
  • Ops / Compliance (approvals, reports)

Prerequisites

System Requirements

  • Snapshot/backup endpoints enabled for all data planes (append/projection/index)
  • Object store with WORM/Object Lock & lifecycle policies; mTLS + signed URLs
  • Integrity Service available for proof computation/verification
  • Catalog/Manifest registry with index of recovery points

Business Requirements

  • Tenant residency & encryption policies mapped to backup targets
  • Defined RPO (e.g., ≤ 15 min) and RTO (e.g., ≤ 60 min) per edition
  • Drill cadence (e.g., monthly per region; quarterly per tenant sample) and evidence requirements

Performance Requirements

  • Backup windows avoid peak hours; bandwidth caps per region/tenant
  • Incremental backups preferred; fulls on weekly cadence
  • Verification completes within X% of backup duration (target ≤ 30%)

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant SCH as Scheduler
    participant BAK as Backup Service
    participant ST as Storage (Append)
    participant PR as Projection DB
    participant IX as Search Index
    participant INT as Integrity Service
    participant OBJ as Object Store (WORM)
    participant CAT as Catalog/Manifest
    participant DR as Drill Runner

    SCH->>BAK: Trigger Backup (policy, scope, type=incremental)
    BAK->>ST: Consistent snapshot/export (cutover @ T)
    BAK->>PR: Projection dump @ watermark<=T
    BAK->>IX: Index snapshot (optional or template)
    BAK->>INT: Compute hashes/Merkle root + sign (kid)
    INT-->>BAK: Proof bundle {root, signature, kid}
    BAK->>OBJ: Upload packages (JSONL/Parquet/SQL), proofs, manifest (WORM)
    BAK->>CAT: Register Recovery Point (RP-2025-10-27T08:00Z)
    BAK-->>SCH: Success {recoveryPointId, sizes, proof}
    SCH->>DR: Schedule recovery drill (sandbox)
    DR->>OBJ: Fetch packages + manifest
    DR->>INT: Verify proofs/signatures
    DR->>ST: Restore append, reproject read models
    DR-->>SCH: Drill report (RPO/RTO met, sample checks OK)
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • On-demand tenant backup: operator requests scoped backup for a single tenant; catalog marks it tenant-scoped.
  • Warm-standby region: ship encrypted copies to secondary region with residency-allowed classes only.
  • Indexless restore: restore append store and rebuild projections/search from facts to reduce backup volume.

Error Paths

sequenceDiagram
    participant BAK as Backup Service
    participant OBJ as Object Store
    participant INT as Integrity
    participant CAT as Catalog

    BAK->>OBJ: PUT package (network issue)
    alt 503 Storage unavailable
        BAK-->>BAK: Retry with backoff, pause schedule if persistent
    else 409 Conflict (WORM retention/exists)
        BAK-->>BAK: Switch to new key (timestamped), update manifest
    end

    BAK->>INT: Compute proof
    alt Proof mismatch
        INT-->>BAK: 422 Unprocessable (hash mismatch)
        BAK-->>CAT: Mark recovery point FAILED, alert
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements (APIs)

Field Type Req Description Validation
POST /ops/v1/backups http Y Start backup RBAC backup:start
scope object Y {tenants:[], regions:[], dataClasses:[]} allowlists/residency
type enum Y full incremental
cutover RFC3339 O Desired snapshot time ≤ now
retentionDays int O Override default retention ≤ policy max
POST /ops/v1/restores http Y Start restore/drill RBAC backup:restore
recoveryPointId string Y Catalog id exists
mode enum Y sandbox production
target object O {tenantId?, region} valid & empty slot
verifyPolicy object O sampling, row-counts, checksums schema

Output Specifications

Field Type Description Notes
recoveryPointId string Unique id for backup sortable by time
manifestUrl url Signed URL to manifest time-limited
proof object {merkleRoot, signature, kid} integrity
sizes object bytes per package budgeting
restoreJobId string Track restore/drill status API

Example Payloads

Start Backup

POST /ops/v1/backups
{
  "scope": { "regions": ["eu-west"], "tenants": ["acme"] },
  "type": "incremental",
  "retentionDays": 30
}

Catalog Manifest (excerpt)

{
  "recoveryPointId": "RP-2025-10-27T08:00:00Z-eu-west-acme",
  "time": "2025-10-27T08:00:00Z",
  "packages": [
    {"name":"append-0001.jsonl","sha256":"...","bytes": 73482910},
    {"name":"projection.sql","sha256":"...","bytes": 2183412}
  ],
  "merkleRoot": "b3f3…",
  "signature": "MEUCIQ…",
  "kid": "kms:eu-west:tenant/acme:v3",
  "watermark": "2025-10-27T07:59:58Z"
}

Start Restore (Sandbox)

POST /ops/v1/restores
{
  "recoveryPointId": "RP-2025-10-27T08:00:00Z-eu-west-acme",
  "mode": "sandbox",
  "target": { "region": "eu-west" },
  "verifyPolicy": { "rowCounts": true, "samplePercent": 5, "proofs": true }
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid scope/type/cutover; residency mismatch Correct payload/policy
401 Missing/invalid token Authenticate Retry after renewal
403 Caller lacks backup:* or restore:* Request access
404 Unknown recoveryPointId or package missing Choose valid point; investigate catalog
409 Restore already in progress for target / resource lock Wait or choose new target Conditional retry
412 Preconditions not met (sandbox not empty; legal hold prevents overwrite) Satisfy preconditions / choose sandbox
423 Target locked (admin lock/maintenance) Release lock Retry
429 Region throughput/backups rate-limited Back off Exponential + jitter
503 Object store/Integrity service unavailable Retry later Bounded retries with backoff

Failure Modes

  • Inconsistent cutover: sources not frozen → use watermark T and quiesce writes for snapshot window.
  • WORM conflict: attempting overwrite before retention expires → versioned keys; never mutate existing.
  • Silent corruption: block-level issues → end-to-end checksums + Merkle proofs required; drill detects.

Recovery Procedures

  1. Re-run backup with quiesce (short write pause or log-based incremental with LSN).
  2. For failed proof, invalidate recovery point and alert; run full backup next window.
  3. During restore, rebuild projections and search from append facts if projection package absent or stale.

Performance Characteristics

Latency Expectations

  • Catalog publish p95 ≤ 1 s; proof computation bounded by package size (parallelizable).
  • Drill restore: RTO target (e.g., ≤ 60 min for medium tenants) including re-projection.

Throughput Limits

  • Per-region bandwidth caps (e.g., ≤ 200 MB/s aggregate); per-tenant rate caps to avoid noisy neighbors.

Resource Requirements

  • Temporary staging disk for package creation; CPU for hashing; memory for buffering; KMS for signing.

Scaling Considerations

  • Incremental forever + periodic synth full to limit restore chains.
  • Shard backups by tenant/shard and time slots to flatten I/O.

Security & Compliance

Authentication

  • Ops endpoints via OIDC; service-to-object store via mTLS and scoped IAM roles.

Authorization

  • Roles: backup:start|read|restore|drill|approve. Production restore requires two-person approval.

Data Protection

  • KMS encryption at rest; WORM/Object Lock with retention & legal hold support; signed manifests/proofs.
  • Residency: copy only to allowed regions per data class; PII masking not required since encrypted at rest (still observe policy).

Compliance

  • Evidence pack: drill reports, manifest, proof verification, timing → archived for audits.
  • Legal holds honored—restore does not violate purge blocks; backups include hold metadata.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
backup_runs_total{result} counter Backups by result Failures > baseline
backup_bytes_total counter Total bytes uploaded Sudden drop/spike
backup_duration_seconds histogram Backup wall time p95 > SLO
restore_duration_seconds histogram Drill/restore time p95 > RTO
backup_proof_failures_total counter Integrity verification failures Any > 0
rpo_effective_seconds gauge Now − last successful cutover > target
rto_drill_pass_rate gauge % drills meeting RTO < target

Logging Requirements

  • Structured logs: recoveryPointId, tenant, region, sizes, hash, kid, result, traceId, rpo, rto.

Distributed Tracing

  • Spans for snapshot, package.upload, proof.compute, proof.verify, restore.apply, reprojection.run.

Health Checks

  • Readiness of object store, KMS, Integrity; catalog consistency checks (manifest ↔ objects).

Operational Procedures

Deployment

  1. Deploy Backup Service (HA) with schedulers and workers per region.
  2. Configure object store buckets with Object Lock (compliance mode) and lifecycle.
  3. Register policies (cadence, scope, RPO/RTO) per edition.

Configuration

  • Env: BACKUP_WINDOW=02:00-05:00, BACKUP_MAX_BW_MBPS, BACKUP_TYPE=incremental, BACKUP_VERIFY=true.
  • Policies: weekly full, daily incremental; monthly drill per region.

Maintenance

  • Rotate KMS keys; test restore runbooks quarterly; refresh lifecycle policies and retention.

Troubleshooting

  • Missing package → verify catalog vs. object listing; re-upload if upload was interrupted.
  • Proof mismatch → recalc locally; if persistent, invalidate RP and run full backup.
  • RTO miss → profile slow steps (download bandwidth, reprojection speed) and optimize.

Testing Scenarios

Happy Path Tests

  • Scheduled incremental backup creates catalog entry with valid proofs.
  • Monthly drill restores to sandbox, reprojects, and meets RTO.

Error Path Tests

  • 400 invalid scope rejected; 404 unknown recoveryPointId; 409 concurrent restore blocked.
  • 503 object store outage triggers retries and eventual success/fail with alert.

Performance Tests

  • Backup completes within window; verify overhead does not breach SLOs.
  • Drill on medium tenant completes within RTO under load.

Security Tests

  • WORM enforced—no overwrite/delete within retention; manifests signed & verified.
  • Access controls prevent cross-tenant reads of backup artifacts.

Internal References

External References

  • Object Lock/WORM (vendor docs)
  • NIST SP 800-34 (Contingency Planning)

Appendices

A. Example Object Store Bucket Policy (WORM)

{
  "ObjectLockEnabled": "Enabled",
  "Rules": [{
    "DefaultRetention": { "Mode": "COMPLIANCE", "Days": 30 }
  }]
}

B. Recovery Drill Checklist

  1. Select latest valid recoveryPointId for target region/tenant.
  2. Provision isolated sandbox (no outbound webhooks).
  3. Restore append → reproject → (optional) reindex.
  4. Verify counts (rows/events) & sample diffs; verify proofs.
  5. Capture RTO and evidence; archive report; clean up sandbox.

C. Problem+JSON (example)

{
  "type": "urn:connectsoft:errors/backup/recovery-point-not-found",
  "title": "Recovery point not found",
  "status": 404,
  "detail": "RP-2025-10-27T08:00:00Z-eu-west-acme does not exist or is invalid."
}

Load Balancing Flow

Distributes incoming traffic fairly across healthy service instances using L7/L4 load balancing, with optional affinity (cookie/hash) for sticky paths and standard stateless routing for idempotent calls. Includes multi-region routing (geo/DNS/anycast) with residency and failover policies. Integrates with health checks, rate limiting, and circuit breakers.


Overview

Purpose: Balance requests to healthy backends, maximize utilization, and minimize latency while enforcing tenant isolation and residency.
Scope: Edge routing (DNS/anycast) → Regional LB/Ingress (L7) → per-service pools with health/affinity → response path and headers. Excludes per-tenant throttling logic (covered by Gateway rate limiting).
Context: Client enters via Global LB (GSLB/Anycast), then Regional L7 LB/Ingress/Gateway (Envoy/Nginx/API GW) that selects a backend (Ingestion/Query/Export).
Key Participants:

  • Client
  • Global Traffic Manager (GTM) (GeoDNS/Anycast)
  • Regional L7 LB / API Gateway
  • Target Service Pool (Ingestion / Query / Export)
  • Health Check / Discovery
  • Observability (metrics/logs/traces)

Prerequisites

System Requirements

  • Edge TLS termination with modern ciphers; optional end-to-end mTLS to services
  • Active+passive health checks (HTTP/gRPC/TCP) with outlier detection
  • Service discovery (EDS/SD) with instance metadata: {region, shard, edition}
  • Circuit Breaker and connection pools configured per service/route

Business Requirements

  • Residency policy maps tenants → allowed regions
  • Edition/plan may influence weights (e.g., enterprise canary lanes)
  • Documented sticky vs stateless routes (e.g., Query=stateless, Export job UI=sticky)

Performance Requirements

  • End-to-end added LB latency p95 ≤ 5 ms (regional), ≤ 20 ms (global routing)
  • Per-service concurrency/connection limits defined; surge queue bounded
  • Balancing algorithm chosen per route: least-request, weighted RR, ring-hash (affinity)

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant C as Client
    participant GTM as Global Traffic Manager (GeoDNS/Anycast)
    participant L7 as Regional L7 LB / API Gateway
    participant S as Service Pool (e.g., Ingestion)
    participant HC as Health/Discovery

    C->>GTM: Resolve api.atp.example (Geo/latency policy)
    GTM-->>C: Regional VIP (eu-west)
    C->>L7: HTTPS request (Host: api.atp.example)
    L7->>HC: Get healthy endpoints & weights
    L7->>S: Route to least-loaded healthy instance (affinity if provided)
    S-->>L7: 200 OK (payload)
    L7-->>C: 200 OK + headers (X-Region, X-Backend-Id, Server-Timing)
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Sticky (affinity) routing: LB sets atp_affinity cookie or uses ring-hash on X-Sticky-Key/tenantId for session locality.
  • Multi-region: GTM favors closest allowed region; on regional brownout, fail over to next policy region.
  • Canary/weighted: subset traffic (5%) routed to canary pool via header or flag for progressive delivery.

Error Paths

sequenceDiagram
    participant C as Client
    participant L7 as Regional L7 LB
    participant S as Service Pool
    participant HC as Health/Discovery

    C->>L7: Request /ingest
    L7->>HC: Endpoints?
    alt No healthy backends
        L7-->>C: 503 Service Unavailable (Retry-After)
    else Backend times out
        L7->>S: Forward
        S-->>L7: (timeout)
        L7-->>C: 504 Gateway Timeout
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
Host / SNI header Y Virtual host routing Matches configured domain
Authorization header O Propagated to Gateway If present, well-formed
traceparent header O Trace propagation W3C format
X-Tenant-Id header O Residency/affinity hint ULID/UUID
X-Region-Hint header O Client preferred region Allowlist
X-Sticky-Key header O Consistent hashing key ≤128 chars
Cookie: atp_affinity cookie O LB-issued sticky cookie Signed
Accept / Content-Type header O Protocol negotiation Valid MIME
Idempotency-Key header O For retries across LB ≤128 chars

Output Specifications

Field Type Description Notes
X-Region header Region that served the request e.g., eu-west
X-Backend-Id header Instance/pod identifier For debugging
X-Served-By header LB node identifier Optional
Server-Timing header lb;dur=... Perf insights
Retry-After header Sent on 429/503 Seconds or HTTP date

Example Payloads

GET /query/v1/records?tenant=acme HTTP/1.1
Host: api.atp.example
X-Tenant-Id: 01HZXM0...
X-Region-Hint: eu-west
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
HTTP/1.1 200 OK
X-Region: eu-west
X-Backend-Id: proj-7f9c6bd9d8-2m4sx
Server-Timing: lb;dur=3, gw;dur=6
Content-Type: application/json

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid host/SNI, malformed headers (X-Region-Hint) Correct request
401 Auth failure (if L7 does authN) Re-authenticate Retry after renewal
403 Region not allowed by residency Remove hint / use allowed region
404 Route/service not found Verify path/host
409 Sticky key conflicts with pool policy Clear cookie/change key
429 LB/Gateway rate limit Back off Exponential + jitter
502 Bad gateway (abrupt upstream close) Investigate upstream Retry idempotent
503 No healthy backends / brownout Failover or wait Respect Retry-After
504 Upstream timeout Tune timeouts or retry Idempotent only

Failure Modes

  • Hot spotting: poor hash key → use ring-hash on tenantId and minimum healthy hosts.
  • Sticky drift: deleted pod but cookie persists → cookie TTL/clearing and outlier ejection.
  • Cross-region leakage: missing residency guard → enforce allowlist at GTM and L7.

Recovery Procedures

  1. Drain failing instances (connection draining) and eject outliers.
  2. Flip traffic weights away from impaired pool; enable canary disable flag.
  3. Trigger regional failover at GTM if health below threshold.

Performance Characteristics

Latency Expectations

  • Added L7 overhead p95 ≤ 5 ms; GTM selection ≤ 20 ms additional.

Throughput Limits

  • Tune per-service max connections/requests; queue length capped (e.g., 100) to prevent head-of-line blocking.

Resource Requirements

  • LB nodes sized for TLS termination (ECDSA), HTTP/2, and gRPC fan-in/out; enable connection reuse.

Scaling Considerations

  • Horizontal scale LB nodes; shard by region; enable Autoscaling based on RPS and CPU.
  • Prefer least-request for spiky traffic; ring-hash for affinity; weighted RR for canaries.

Security & Compliance

Authentication

  • TLS 1.2+ at edge; optional mTLS to backends; ALPN for HTTP/2/gRPC.

Authorization

  • If Gateway performs authZ, L7 forwards identity context; deny routes without matching policies.

Data Protection

  • No PII in LB logs; mask headers; use HSTS; secure cookies (HttpOnly, Secure, SameSite=Lax).

Compliance

  • Residency honored at GTM/L7; all decisions auditable (who changed routes/weights).

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
lb_requests_total{route,region} counter Requests by route Trend
lb_latency_seconds histogram Added LB latency p95 breach
lb_upstream_5xx_total counter Backend errors Spike
lb_no_healthy_backends_total counter Routing failures Any > 0
lb_active_connections gauge Concurrent conns Saturation
lb_outlier_ejections_total counter Ejected hosts Investigate

Logging Requirements

  • Access logs with region, backendId, status, bytes, durationMs, traceId; redact sensitive headers.

Distributed Tracing

  • Start or propagate traceparent; add span attributes lb.region, lb.backend_id, policy.

Health Checks

  • Active (HTTP/gRPC) + passive checks; outlier detection (consecutive 5xx/latency) with ejection & recovery.

Operational Procedures

Deployment

  1. Deploy GTM records (Geo/latency policy + failover).
  2. Roll out L7 LB/Ingress with routes, TLS certs, and backends.
  3. Enable discovery (EDS) and health checks; validate with synthetic probes.

Configuration

  • Algorithms: least_request, ring_hash(key=X-Sticky-Key|tenantId), weighted_round_robin.
  • Timeouts: connect=1s, request=5s (per route), idle=60s.
  • Headers: set X-Region, X-Backend-Id, and propagate traceparent.

Maintenance

  • Rotate TLS certs; tune weights during canaries; routinely test failover.
  • Drain nodes before upgrades (connection draining, readiness gates).

Troubleshooting

  • Elevated 5xx → check outlier ejections, backend health, circuit breaker trips.
  • High latency → verify least-request and connection pool sizes; inspect Nagle/HTTP/2 settings.
  • Sticky anomalies → clear cookies, verify ring-hash seed and host set stability.

Testing Scenarios

Happy Path Tests

  • Requests distributed evenly under steady load (Gini coefficient within target).
  • Sticky session remains on same backend across N requests.

Error Path Tests

  • 503 when all backends unhealthy; 504 on upstream timeout; 404 on unknown route.
  • 409 when sticky key conflicts with policy handled gracefully.

Performance Tests

  • p95 LB overhead ≤ 5 ms at target RPS; no queue growth beyond cap.
  • Failover to secondary region within SLA (< 60s) under regional outage.

Security Tests

  • TLS and cipher policy enforced; mTLS to backends verified.
  • Residency blocks cross-region routing attempts; logs contain no PII.

Internal References

External References

  • Load balancing algorithms (least-request, ring-hash) and best practices

Appendices

A. Example Envoy Route (weighted + ring-hash)

route:
  match: { prefix: "/query" }
  route:
    hash_policy:
      - header: { header_name: "X-Sticky-Key" }
      - cookie: { name: "atp_affinity", ttl: 3600s, path: "/" }
    weighted_clusters:
      clusters:
        - name: query-primary
          weight: 95
        - name: query-canary
          weight: 5
    timeout: 5s
    idle_timeout: 60s

B. Problem+JSON (example 503)

{
  "type": "urn:connectsoft:errors/lb/no-healthy-backends",
  "title": "No healthy backends available",
  "status": 503,
  "detail": "All instances for route '/ingest' are out of service.",
  "retryAfterSeconds": 10
}

Caching Flow

Reduces read latency and load on backing stores via tenant-scoped caches with L1 (in-process) and L2 (distributed) tiers. Supports read-through + stale-while-revalidate (SWR), with projection-driven invalidation and export-safe cache bypass when strong freshness is required. Consistency model and TTLs are explicit per resource.


Overview

Purpose: Serve query responses quickly while honoring tenant isolation and documented freshness guarantees.
Scope: Cache lookup → hit/miss handling → read-through fill → TTL/SWR behavior → projector/exports invalidations → observability. Excludes CDN/public caching.
Context: Query Service fronts Projection DB/Search with L1/L2 caches; Projection Update Flow emits invalidations; Export may request bypass/lock.
Key Participants:

  • Client
  • API Gateway / Query Service
  • Cache L1 (per-pod)
  • Cache L2 (Redis/Memcache)
  • Projection DB / Search Index
  • Invalidation Bus (events from Projector/Export)

Prerequisites

System Requirements

  • L1 in-process cache with bounded memory and eviction (LRU/LFU)
  • L2 distributed cache with multi-tenant namespaces, TLS, and ACLs
  • Invalidation channel (pub/sub or stream) from Projector & Export
  • Strong hashing for keys; serialization with versioned schema

Business Requirements

  • Documented consistency choices per endpoint: strong, bounded-staleness, or eventual
  • Per-edition TTLs and max object sizes; negative-caching policy
  • Clear semantics for export and legal-hold reads (bypass or SWR disabled)

Performance Requirements

  • p95 cache hit latency: L1 ≤ 1 ms, L2 ≤ 3 ms
  • Target hit ratio: ≥ 85% for hot keys; ≥ 60% overall for query endpoints
  • Fill amplification bounded (parallel request coalescing)

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant C as Client
    participant GW as API Gateway / Query Service
    participant L1 as Cache L1 (in-process)
    participant L2 as Cache L2 (Redis)
    participant DB as Projection DB / Search
    participant BUS as Invalidation Bus

    C->>GW: GET /query/v1/records?tenant=acme&from=... (Cache-Mode: default)
    GW->>L1: GET cache[key(Tenant,QueryHash)]
    alt L1 hit (fresh)
        L1-->>GW: value, meta{ttl,freshness}
        GW-->>C: 200 OK (X-Cache: L1-HIT, X-Cache-Freshness: fresh)
    else L1 miss
        GW->>L2: GET key
        alt L2 hit (fresh or SWR-eligible)
            L2-->>GW: value, meta
            GW-->>C: 200 OK (X-Cache: L2-HIT, X-Cache-Freshness: fresh|stale)
            opt SWR revalidate in background if stale
                GW->>DB: Query (If-None-Match: etag)
                DB-->>GW: 304 or 200 + new value
                GW->>L2: SET key (ttl)
                GW->>L1: SET key (ttl)
            end
        else L2 miss
            GW->>DB: Query
            DB-->>GW: 200 result (etag)
            GW->>L2: SET key (ttl, etag)
            GW->>L1: SET key (ttl, etag)
            GW-->>C: 200 OK (X-Cache: MISS)
        end
    end
    BUS-->>L2: Invalidation(key or tag) on projection update
    L2-->>L1: Fan-out eviction notice
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Bypass: header Cache-Mode: bypass → skip L1/L2 for strict reads (e.g., export) and optionally refresh cache.
  • Write-around: projector writes DB then publishes tag-based invalidations (e.g., tenant:acme, resource:order:123).
  • Coalesced fills: first request holds a per-key mutex; subsequent misses wait to avoid stampede.

Error Paths

sequenceDiagram
    participant GW as Query Service
    participant L2 as Cache L2
    participant DB as Projection DB

    GW->>L2: GET key
    alt 503 L2 unavailable
        GW->>DB: Fallback to DB
        DB-->>GW: 200
        GW->>L2: (skip SET) or queue async warm
    else 409 CAS/ETag conflict on SET
        L2-->>GW: 409 Conflict
        GW->>L2: GET latest → retry SET (backoff)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements (Headers & Query)

Field Type Req Description Validation
X-Tenant-Id header Y Tenant namespace for cache ULID/UUID
Cache-Mode header O default | bypass | refresh | swr-only enum
Cache-Control header O max-age, stale-while-revalidate, no-store RFC 7234
If-None-Match header O Revalidation with ETag string
X-Consistency header O strong | bounded | eventual per-route
Query params query O Affect key hash canonicalized order

Output Specifications (Response & Meta)

Field Type Description Notes
X-Cache header L1-HIT | L2-HIT | MISS | BYPASS | STALE observability
ETag header Entity tag for revalidation stable per value
Cache-Control header Response caching directives includes max-age
X-Cache-Key header Debug key (hashed/short) no PII
X-Cache-Freshness header fresh | stale(<sec>) SWR info
X-Watermark header Projection watermark time freshness signal

Example Payloads

Bounded-staleness read with SWR

GET /query/v1/records?tenant=acme&from=2025-10-27T08:00Z HTTP/1.1
X-Tenant-Id: 01JF...
Cache-Mode: default
X-Consistency: bounded
HTTP/1.1 200 OK
X-Cache: L2-HIT
Cache-Control: max-age=30, stale-while-revalidate=60
ETag: "recset:acme:ab12"
X-Cache-Freshness: stale(12)
X-Watermark: 2025-10-27T08:05:30Z

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid Cache-Mode/X-Consistency value; oversized key Fix headers/params
401 Missing tenant header for cached endpoints Add X-Tenant-Id Retry after fix
403 Tenant not allowed on this region/cache Correct region or policy
404 Cache management API: unknown key/tag on purge No-op; verify key
409 CAS/ETag conflict on concurrent SET Retry with backoff; re-GET latest Jittered backoff
412 Revalidation precondition failed (ETag mismatch) Fetch full object Conditional retry
429 Cache rate limit (management ops) Back off Exponential
503 L2 unavailable Fallback to DB; degrade to L1-only Bounded retries

Failure Modes

  • Cache stampede: thundering herd on popular key → request coalescing, jittered TTLs, SWR background refresh.
  • Stale reads too old: misconfigured stale-while-revalidate → enforce max-staleness cap per route.
  • Cross-tenant leakage: missing tenant in key → mandatory X-Tenant-Id + namespace prefixes.
  • Oversized entries: evictions/fragmentation → cap object size, compress payloads, or avoid caching.

Recovery Procedures

  1. Disable SWR temporarily for problematic routes; set shorter TTLs.
  2. Purge by tag (tenant:acme, resource:order:123) after projection anomalies.
  3. Route around L2 failures (feature flag) while keeping read path via DB.

Performance Characteristics

Latency Expectations

  • L1 ≤ 1 ms p95; L2 ≤ 3 ms p95; read-through to DB ≤ endpoint SLO.

Throughput Limits

  • L2 QPS sized for peak miss + revalidation; keyspace cardinality controlled via hashing and tag strategy.

Resource Requirements

  • Memory budgets per pod (L1) and per cluster (L2); eviction policy tuned (LFU for skewed traffic).

Scaling Considerations

  • Partition L2 by region and shard; enable replica readers; avoid cross-AZ chatter.
  • Use compressed values (e.g., zstd) for large result sets with CPU tradeoff.

Security & Compliance

Authentication

  • mTLS between services and L2; signed purge APIs.

Authorization

  • RBAC for cache management (cache:purge|inspect); tenant-scoped purge only.

Data Protection

  • No PII in keys; values encrypted at rest if L2 supports; TLS in transit.

Compliance

  • Audit cache management actions (purge/warm) with actor, scope, reason.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
cache_hit_total{tier} counter Hits by tier Drop signals issues
cache_miss_total counter Misses (cold + reval) Spike alert
cache_hit_ratio gauge Hits / (hits+misses) < target
cache_evictions_total counter Evictions by reason Unexpected growth
cache_swr_served_total counter Stale responses served Excess indicates lag
cache_fill_duration_seconds histogram Miss→filled latency p95 breach
cache_invalidation_total{tag} counter Invalidation events Monitor volume

Logging Requirements

  • Include tenantId, short cacheKey, tier, freshness, hit/miss, fillMs, traceId. No payloads/PII in logs.

Distributed Tracing

  • Child spans for cache.l1.get, cache.l2.get/set, cache.swr.revalidate, with attributes key_hash, tier.

Health Checks

  • L2 readiness probes; replication lag; pub/sub connectivity for invalidations.

Operational Procedures

Deployment

  1. Deploy L2 cache cluster (HA) with TLS and ACL; configure namespaces per region.
  2. Enable L1 caches in services with bounds and eviction settings.
  3. Wire projector → invalidation bus → L2 pub/sub fan-out.

Configuration

  • Defaults: TTL=30s, stale-while-revalidate=60s, max-staleness=90s, negativeTTL=3s.
  • Enable request coalescing and per-key mutex; cap value size (e.g., 512 KB).

Maintenance

  • Periodic warm-up for hot keys post-deploy; tune TTLs using hit/miss analytics.
  • Rotate L2 credentials; defragment and scale nodes as keyspace grows.

Troubleshooting

  • Low hit ratio → verify key canonicalization and tenant scoping.
  • Stampedes → increase jitter, enable SWR, and coalescing.
  • Staleness complaints → reduce TTL or require Cache-Mode: bypass for affected endpoints.

Testing Scenarios

Happy Path Tests

  • L1/L2 hits return within target latencies and correct headers.
  • Revalidation updates cache while serving stale safely (SWR).

Error Path Tests

  • 503 L2 outage falls back to DB with acceptable latency.
  • 409 CAS conflict on SET resolves with retry and no corruption.
  • 400 invalid Cache-Mode rejected.

Performance Tests

  • Hit ratio meets targets under production-like skew (Zipfian).
  • Thundering herd prevented under bursty traffic.

Security Tests

  • No cross-tenant cache bleed; purge is tenant-scoped and audited.
  • TLS and ACLs enforced for L2 connections.

Internal References

External References

  • RFC 7234 (HTTP Caching), SWR patterns; Redis best practices

Appendices

A. Cache Key Schema (canonicalized)

Key = sha256(
  "tenant=" + TenantId +
  "&route=" + RouteId +
  "&params=" + CanonicalQueryString +
  "&version=" + SchemaVersion
)
Namespace = "atp:{region}:{edition}"
Final = Namespace + ":q:" + KeyPrefix

B. Problem+JSON Examples

{
  "type": "urn:connectsoft:errors/cache/invalid-mode",
  "title": "Invalid Cache-Mode",
  "status": 400,
  "detail": "Allowed values are default|bypass|refresh|swr-only."
}
{
  "type": "urn:connectsoft:errors/cache/conflict",
  "title": "Concurrent cache update conflict",
  "status": 409,
  "detail": "ETag mismatch during SET. Value updated by another request."
}

Partitioning Flow

Routes traffic and data by tenant / shard / region using a deterministic partition strategy (e.g., TenantId + TimeBucket) mapped onto a consistent-hash ring. Ensures RLS enforcement at the data plane and honors residency flags so data stays within allowed regions. Supports shard pruning on reads and smooth ring changes with minimal rebalancing.


Overview

Purpose: Achieve scalable, cost-efficient storage and query performance by distributing load across shards while preserving strict tenant isolation and residency.
Scope: Partition key derivation → ring lookup → write placement (append store & indexes) → read-time shard pruning → ring change management (add/remove/move) → RLS enforcement. Excludes cross-region replication (covered elsewhere).
Context: Ingestion and Query paths use the Placement Service and Partition Catalog to route writes/reads. Storage (Append), Projection DB, and Search Index expose per-shard/tenant namespaces.
Key Participants:

  • API Gateway / Ingestion Service
  • Placement Service (ring lookup)
  • Partition Catalog (tenants, shards, regions)
  • Storage (Append) / Projection DB / Search Index
  • RLS/Policy Engine

Prerequisites

System Requirements

  • Global Partition Catalog with tenant → region/edition → shard mapping
  • Consistent-hash ring with virtual nodes; gossip or control-plane updates
  • Time bucketing policy (e.g., hour|day) for hot-key spreading and pruning
  • RLS enabled in all data planes (tenant-scoped schemas/aliases)

Business Requirements

  • Residency policy per tenant/edition with allowed regions and data classes
  • Hot-tenant isolation rules (dedicated shards/weighting)
  • Ring change governance (approvals, maintenance windows for big moves)

Performance Requirements

  • Target shard load imbalance (P95) ≤ 1.5× average
  • Read pruning effectiveness ≥ 90% of shards skipped for typical time windows
  • Partition lookup p95 ≤ 1 ms (cached in-process)

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant C as Client
    participant GW as API Gateway
    participant ING as Ingestion Service
    participant PLC as Placement Service (Ring)
    participant ST as Storage (Append / Shard)
    participant PR as Projection DB (Shard)
    participant IX as Search Index (Tenant Alias)

    C->>GW: POST /audit/v1/records (X-Tenant-Id, time=2025-10-27T08:05Z)
    GW->>ING: Canonicalized record (TenantId, OccurredAt)
    ING->>PLC: ResolvePartition(TenantId, TimeBucket=2025-10-27:08)
    PLC-->>ING: {region: eu-west, shard: s-17, keyspace: k_acme}
    ING->>ST: Append to s-17 (RLS=TenantId)
    ST-->>ING: ack (offset, partitionId)
    ING-->>GW: 202 Accepted (X-Partition: s-17, X-Region: eu-west)

    C->>GW: GET /query/v1/records?tenant=acme&from=08:00&to=08:10
    GW->>PLC: PlanQuery(TenantId, Range)
    PLC-->>GW: {prunedShards:[s-17,s-18], watermark}
    GW->>PR: Read from pruned shards (RLS=TenantId)
    PR-->>GW: results
    GW-->>C: 200 OK (X-Shards: s-17,s-18)
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Hot-tenant isolation: Placement pins tenant to a dedicated shard set (higher vNode weight) to prevent noisy neighbors.
  • Multi-bucket fanout: Large ranges map to multiple time buckets → pruned shard list per bucket, executed in parallel with bounded concurrency.
  • Search path: Query uses per-tenant alias → resolves to index shards in allowed region only (no cross-region hits).

Error Paths

sequenceDiagram
    participant ING as Ingestion
    participant PLC as Placement
    participant ST as Storage

    ING->>PLC: ResolvePartition(TenantId=T?, TimeBucket=?)
    alt 400 Bad Request (invalid tenant/time)
        PLC-->>ING: 400 Problem+JSON
    else 403 Residency violation (region hint not allowed)
        PLC-->>ING: 403 Problem+JSON
    else 404 Not Found (tenant or shard mapping missing)
        PLC-->>ING: 404 Problem+JSON
    else 409 Conflict (ring update in progress, epoch mismatch)
        PLC-->>ING: 409 Problem+JSON (retry with new epoch)
    else 503 Service Unavailable (catalog/ring unavailable)
        PLC-->>ING: 503 Problem+JSON (Retry-After)
    end
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements

Field Type Req Description Validation
X-Tenant-Id header Y Tenant identity for RLS and partitioning ULID/UUID
X-Region-Hint header O Preferred region (must be allowed) Residency allowlist
OccurredAt body field Y Event time used for time bucket RFC3339 UTC
Partition-Key header O Override hash key (advanced) Controlled via policy
Range query O from/to time for reads from ≤ to, bounded span
X-Ring-Epoch header O Client-observed ring epoch Monotonic int

Output Specifications

Field Type Description Notes
X-Partition header Chosen shard id For debugging
X-Region header Serving region Residency proof
X-Shards header Pruned shard list for reads Comma-separated
X-Watermark header Lowest consistent time served For staleness checks
X-Ring-Epoch header Ring epoch used for routing Detect drift

Example Payloads

Resolve Partition (internal)

POST /placement/v1/resolve
{
  "tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
  "occurredAt": "2025-10-27T08:05:12Z"
}

Response

{
  "region": "eu-west",
  "shardId": "s-17",
  "epoch": 42,
  "keyspace": "k_acme"
}

Query Plan (pruning)

POST /placement/v1/plan-query
{
  "tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
  "from": "2025-10-27T08:00:00Z",
  "to": "2025-10-27T08:10:00Z"
}

Response

{
  "region": "eu-west",
  "shards": ["s-17","s-18"],
  "watermark": "2025-10-27T08:09:58Z",
  "epoch": 42
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Missing/invalid X-Tenant-Id, bad time window Fix request headers/params
401 Unauthenticated request to placement APIs Authenticate Retry after renewal
403 Residency/edition violation (region not allowed) Choose allowed region
404 Tenant or shard mapping not found Re-sync catalog / onboard tenant
409 Ring epoch mismatch during write/read Fetch latest epoch; redo resolve Jittered retry
412 Preconditions (RLS context) not present Include tenant scope
429 Placement lookups rate-limited Back off Exponential + jitter
503 Placement/Catalog unavailable Degrade to cached hint or fail Bounded retries

Failure Modes

  • Hot shard: skewed hash or burst tenant → adjust vNode weights, or isolate tenant to dedicated shard set.
  • Ring churn: frequent membership changes cause 409s → stage updates and epoch gating with drain.
  • Cross-region spill: misconfigured residency → hard deny at placement and gateway.

Recovery Procedures

  1. Enable skew mitigations (weighting, pinning) and backfill if rebalancing moved ranges.
  2. Roll back ring change to prior epoch if error rate spikes; drain and retry in controlled batches.
  3. Rebuild tenant alias in Search/Projection if shard move required index re-aliasing.

Performance Characteristics

Latency Expectations

  • Placement cache lookup ≤ 1 ms p95; cold fetch ≤ 10 ms p95.
  • Pruned read fanout limited to ≤ 4 shards for typical query windows.

Throughput Limits

  • Placement QPS sized for all writes + planning; use edge caches in services to reduce calls.

Resource Requirements

  • Small in-memory partition maps per service; watch stream for updates; compact ring representation with virtual nodes.

Scaling Considerations

  • Multi-ring design (per-region) to avoid cross-region chatter.
  • Add shards by adding vNodes (smooth rebalance ≤ 10% key movement).
  • Time buckets control hot partitions; tune bucket size by workload.

Security & Compliance

Authentication

  • mTLS between services and Placement/Catalog; OIDC for ops.

Authorization

  • Roles: placement:read, placement:update; only platform ops can alter ring/vNodes.

Data Protection

  • Enforce RLS at DB and index layers; per-tenant schemas/aliases; no PII in partition keys.

Compliance

  • Residency enforced at plan/placement and audited; changes to ring membership recorded as events Partition.RingUpdated.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
partition_lookup_latency_seconds histogram Placement latency p95 > 10 ms
partition_skew_ratio gauge Max shard load / avg > 1.5
ring_epoch_mismatch_total counter 409 due to epoch drift Spike
reads_shards_scanned histogram Shards touched per query p95 > target
residency_denied_total counter 403 due to residency Any sustained
hot_tenant_isolations_total counter Isolation activations Trend

Logging Requirements

  • Include tenantId, region, epoch, shardId, bucket, planId, traceId; never log plaintext PII.

Distributed Tracing

  • Spans: placement.resolve, placement.planQuery, attributes epoch, shard_list, bucket_count.

Health Checks

  • Catalog freshness (last update time), ring convergence across nodes, RLS guard status.

Operational Procedures

Deployment

  1. Deploy Placement Service (HA) and Catalog with watch streams.
  2. Configure per-region rings; seed vNodes; warm caches.

Configuration

  • Hash: fnv1a/xxhash on TenantId + BucketKey.
  • Bucket: daily/hourly; configurable per tenant/class.
  • Ring: vNodes=256 default; epoch increments on changes.

Maintenance

  • Quarterly ring review; rebalance heavy shards; rotate ring secrets.
  • Simulate ring changes in staging with shadow placement before production.

Troubleshooting

  • High shard scan count → check time bucket tuning and secondary predicates.
  • 409 spikes → ensure services refresh epoch quickly; increase push frequency.
  • Residency denials → verify tenant policy and region hint.

Testing Scenarios

Happy Path Tests

  • Ingest routes to correct shard/region with proper headers.
  • Query pruning selects minimal shards and returns correct results.

Error Path Tests

  • 400/404 invalid tenant/mapping rejected; 409 epoch mismatch handled by retry.
  • 403 residency violations blocked decisively.

Performance Tests

  • Placement p95 ≤ 1 ms cached; shard skew ratio ≤ 1.5× under load.
  • Query scans ≤ target shards for standard ranges.

Security Tests

  • RLS enforced on all reads/writes; no cross-tenant leakage.
  • Residency never violated even under failover.

Internal References

External References

  • Consistent hashing & virtual nodes best practices

Appendices

A. Partition Key Derivation

BucketKey = floor(to_unix(OccurredAt) / BucketSizeSeconds)
HashInput = TenantId || ":" || BucketKey
Shard = Ring(hash(HashInput))

B. Problem+JSON Examples

{
  "type": "urn:connectsoft:errors/partition/epoch-mismatch",
  "title": "Ring epoch mismatch",
  "status": 409,
  "detail": "Client epoch 41 != current epoch 42."
}
{
  "type": "urn:connectsoft:errors/partition/residency-violation",
  "title": "Region not allowed by residency policy",
  "status": 403,
  "detail": "Tenant 'acme' is restricted to eu-west."
}

Auto-Scaling Flow

Scales services safely with load using policy-driven HPA/KEDA decisions, proactive warmup/readiness gates, and cost guardrails. Prevents thrash via stabilization windows, rate limits, and deliberate scale-in. Maintains SLOs while distributing load across newly ready instances.


Overview

Purpose: Automatically add/remove capacity to meet SLOs while controlling cost and avoiding oscillation.
Scope: Signal collection → scaling decision → resource provisioning → service scale-out/in → warmup/readiness → load distribution → verification/rollback. Excludes manual capacity planning.
Context: Metrics from Observability and Queue/Bus feed Autoscaler (HPA/KEDA). Kubernetes (orchestrator) applies replica changes. Gateway/LB route traffic only to ready pods.
Key Participants:

  • Load Monitor (Prometheus/OTel, Queue metrics)
  • Autoscaler (HPA/KEDA controller)
  • Orchestrator (Kubernetes API Server)
  • Target Service (e.g., Ingestion/Query/Export)
  • Warmup Manager (init tasks, cache warm)
  • API Gateway / L7 LB
  • Cost Guard (budget policy evaluator)

Prerequisites

System Requirements

  • Metrics (CPU, memory, RPS, p95 latency, queue depth/lag) exported and scraped
  • HPA/KEDA installed with stabilization windows & scale rate limits
  • Readiness/Startup probes and graceful shutdown configured
  • Optional Warm Pool or pre-provisioned nodes for burst traffic

Business Requirements

  • SLOs defined per service (latency/error budget)
  • Cost guardrails (min/max replicas, monthly budget caps, per-tenant limits)
  • Change approvals for autoscaling policy updates

Performance Requirements

  • Scale-out reaction time ≤ 30–60s for CPU/RPS, ≤ 10s for queue lag (event-driven)
  • Scale-in conservatively; error budget burn must stay within targets
  • No oscillation: replica changes limited by stabilization (e.g., 300s down, 60s up)

Sequence Flow

Happy Path

sequenceDiagram
    autonumber
    participant LM as Load Monitor (Metrics/Queue)
    participant AS as Autoscaler (HPA/KEDA)
    participant OR as Orchestrator (K8s API)
    participant SVC as Target Service
    participant WM as Warmup Manager
    participant LB as API Gateway / L7 LB
    participant CG as Cost Guard

    LM-->>AS: Signals {cpu=78%, rps=1.8k, p95=230ms, queueLag=high}
    AS->>CG: Check policy & budget (min/max, cost caps)
    CG-->>AS: OK (within budget)
    AS->>OR: Patch Deployment replicas +3 (rate-limited)
    OR->>SVC: Create Pods (Pending→Init→Running)
    SVC->>WM: Warmup (JIT cache, connection pools)
    SVC-->>OR: Readiness=TRUE (startup probe passed)
    OR-->>LB: Endpoint added to ready set
    LB-->>SVC: Start routing a % of traffic (ramp-up)
    LM-->>AS: Metrics improve (p95→140ms, queueLag→normal)
    AS->>OR: Hold steady (stabilization window active)
Hold "Alt" / "Option" to enable pan & zoom

Alternative Paths

  • Predictive/Scheduled: pre-scale based on calendar or forecast (e.g., top-of-hour export).
  • Event-driven (KEDA): scale on queue depth/lag or webhook events (spikes).
  • Per-tenant partitions: scale labeled shard Deployments independently to isolate hot tenants.

Error Paths

sequenceDiagram
    participant AS as Autoscaler
    participant OR as Orchestrator
    participant CG as Cost Guard
    participant SVC as Target Service

    AS->>CG: Request scale beyond max
    CG-->>AS: 409 Conflict (budget cap)
    AS-->>AS: Clamp to max, raise alert

    AS->>OR: Scale to N
    OR-->>AS: 503 API unavailable / quota exceeded
    AS-->>AS: Retry w/ backoff, keep stabilization timer

    OR->>SVC: Start pods
    SVC-->>OR: Readiness FAILED (startup)
    OR-->>AS: Scale not effective
    AS-->>AS: Pause scale-in, open incident, hold window
Hold "Alt" / "Option" to enable pan & zoom

Request/Response Specifications

Input Requirements (Autoscaling Policy APIs)

Field Type Req Description Validation
POST /ops/v1/autoscale/policies http Y Create/update policy RBAC
service string Y Target service name existing
minReplicas / maxReplicas int Y Bounds 1 ≤ min ≤ max
targets object O e.g., cpu=70, rps=200, p95Ms=180, queueLag=5s sane ranges
scaleUpPolicy object O stabilizationSec, maxIncreasePercent, step limits
scaleDownPolicy object O stabilizationSec, maxDecreasePercent, idleWindowSec limits
costGuardrails object O {maxMonthlyCents, maxNodes, burstAllowance} non-negative
predictive object O schedule/cron or model id valid cron

Output Specifications

Field Type Description Notes
policyId string Identifier immutable
status enum Active | Pending | Error
effectiveAt time Activation time RFC3339
reason string Policy validation result optional

Example Payloads

Create Policy

POST /ops/v1/autoscale/policies
{
  "service": "query",
  "minReplicas": 4,
  "maxReplicas": 40,
  "targets": { "cpu": 70, "p95Ms": 180, "rps": 250 },
  "scaleUpPolicy": { "stabilizationSec": 60, "maxIncreasePercent": 100, "step": 4 },
  "scaleDownPolicy": { "stabilizationSec": 300, "maxDecreasePercent": 33, "idleWindowSec": 600 },
  "costGuardrails": { "maxMonthlyCents": 250000, "maxNodes": 60 }
}

Decision Record (emit)

{
  "decisionId": "asd_01JF9A...",
  "service": "query",
  "from": 16,
  "to": 24,
  "reason": "p95>180ms and rps>target",
  "window": "60s",
  "guardrailsApplied": false,
  "timestamp": "2025-10-27T08:06:30Z"
}

Error Handling

Error Scenarios

HTTP Code Scenario Recovery Action Retry Strategy
400 Invalid policy (min>max, bad targets) Fix payload
401 Missing/invalid token to ops API Authenticate Retry after renewal
403 Caller lacks autoscale:write Request access
404 Policy/service not found Verify name; create first
409 Policy conflicts with cost guardrails or active rollout Adjust bounds or wait Conditional retry
412 Preconditions failed (budget exceeded) Increase budget or reduce target
429 Throttled ops updates Back off Exponential + jitter
503 Orchestrator unavailable/quota exhausted Retry; open incident Backoff; clamp to safe min

Failure Modes

  • Thrashing: rapid up/down changes → increase stabilization windows; lower sensitivity; coarser steps.
  • Cold-start latency: new pods routed too early → enforce readiness gates and ramp-up percentage.
  • Exceeding budget: forecast misses → cost guard clamps, triggers graceful degradation plans.

Recovery Procedures

  1. Freeze scale-down; hold steady at current replicas; widen windows.
  2. Enable predictive pre-scale during known peaks; warm caches.
  3. If quota hit, divert traffic (multi-region) or shed load (429) with idempotency keys.

Performance Characteristics

Latency Expectations

  • Scale-out decision path (signal→ready) ≤ 60–90s typical; ≤ 15s for KEDA on lag spikes.
  • No SLO breach during scale-in; drain connections before termination.

Throughput Limits

  • Max scale step per window (e.g., +100% up, −33% down).
  • Node autoscaler pre-warms to ensure pods schedule within target.

Resource Requirements

  • Metrics store sized for scrape interval and cardinality; autoscaler controller HA.
  • Warm pool (optional) sized to absorb N minutes of surge.

Scaling Considerations

  • Separate control plane autoscaler resources from workloads.
  • Partition by service/shard for isolation; avoid global contention.
  • Use pod disruption budgets (PDBs) to protect capacity on rollouts.

Security & Compliance

Authentication

  • OIDC for ops APIs; mTLS between autoscaler and cluster API.

Authorization

  • RBAC: autoscale:read, autoscale:write, autoscale:admin. Least privilege for controllers.

Data Protection

  • No PII in scaling logs/metrics; scrub tenant identifiers or hash.

Compliance

  • Emit audited events: Autoscale.PolicyUpdated|DecisionMade|ScaleApplied|GuardrailClamped with reason & actor.

Monitoring & Observability

Key Metrics

Metric Type Description Alert Threshold
autoscale_desired_replicas gauge Desired vs current Large sustained delta
autoscale_decisions_total{reason} counter Scale events Spike analysis
autoscale_thrash_total counter Up/down flips within window > 0 sustained
service_slo_latency_p95_ms gauge p95 latency > target
queue_lag_seconds gauge Event backlog > target
cost_estimated_monthly_cents gauge Spend projection > budget

Logging Requirements

  • Decision logs: decisionId, from→to, reasons, signals, guardrailsApplied, traceId.

Distributed Tracing

  • Spans: autoscale.evaluate, autoscale.apply; link to service load spans via traceparent.

Health Checks

  • Controller health, permission checks, K8s API latency; synthetic scale probe in staging.

Operational Procedures

Deployment

  1. Install HPA/KEDA; configure metrics adapters.
  2. Enable readiness/startup probes and graceful draining (preStop hooks).
  3. Apply baseline policies per service; verify guardrails.

Configuration

  • Example defaults: min=2, max=40, cpu=70%, p95=180ms, queueLag=5s.
  • scaleUpStabilization=60s, scaleDownStabilization=300s, maxIncrease=100%, maxDecrease=33%.
  • Cost guard: maxMonthlyCents, maxNodes, burstAllowance.

Maintenance

  • Quarterly policy review vs. observed traffic.
  • Load tests before peak seasons; adjust predictive schedules.

Troubleshooting

  • Oscillation → widen stabilization, reduce sensitivity, increase step size.
  • Pods not becoming ready → inspect warmup dependencies, increase startupProbe timeouts.
  • Budget clamp events → validate forecasts; consider reserved capacity.

Testing Scenarios

Happy Path Tests

  • Sustained load triggers scale-out within target time; SLO met.
  • Post-peak scale-in occurs after stabilization; no SLO regressions.

Error Path Tests

  • 409 guardrail clamp logged; system holds safe capacity.
  • 503 orchestrator outage handled by retries without thrash.

Performance Tests

  • Burst load with KEDA (queue lag) scales within ≤ 15s to clear backlog.
  • Scale-in preserves error budget and maintains p95 latency.

Security Tests

  • Only authorized roles can modify policies; all changes audited.
  • No PII in autoscale logs/metrics.

Internal References

External References

  • HPA/KEDA best practices; SRE guides on autoscaling and error budgets

Appendices

A. Example HPA (CPU + custom p95 latency via metrics adapter)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: query-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: query
  minReplicas: 4
  maxReplicas: 40
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 33
          periodSeconds: 300
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: service_latency_p95_ms
        target:
          type: AverageValue
          averageValue: "180"

B. Example KEDA ScaledObject (queue lag)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: export-worker
spec:
  scaleTargetRef:
    name: export-worker
  minReplicaCount: 2
  maxReplicaCount: 60
  cooldownPeriod: 300
  triggers:
    - type: redis
      metadata:
        address: REDIS_ADDR
        listName: export-jobs
        listLength: "100" # target backlog

C. Problem+JSON (policy conflict)

{
  "type": "urn:connectsoft:errors/autoscale/policy-conflict",
  "title": "Autoscale policy conflicts with guardrails",
  "status": 409,
  "detail": "Requested maxReplicas 120 exceeds maxNodes budget."
}