Sequence Flows — Header, Scope & Notation — Audit Trail Platform (ATP)¶
This document captures end-to-end sequence flows for the Audit Trail Platform (ATP). It shows how requests move across services (Gateway → Ingestion → Storage → Integrity → Projection → Query/Search/Export), which headers and IDs are propagated, and where policy/redaction and integrity operations occur.
JSON uses lowerCamel; C#/gRPC (code-first) uses PascalCase; Protobuf fields are PascalCase with
json_namemapped to lowerCamel. Times are ISO-8601 UTC with ms precision.
Purpose¶
- Provide a definitive reference for request/response choreography across ATP.
- Make tenancy, correlation, idempotency, redaction, and integrity touchpoints explicit.
- Enable engineers, SREs, and auditors to reason about correctness and operational SLOs (e.g., projection and sealing lag).
Audience¶
- Platform engineers implementing services and SDKs.
- SRE/Operations running ATP in production.
- Security & Compliance validating controls, proofs, and holds.
- Integrations & SDK authors producing/consuming audit data.
Scope¶
- Online and async ingestion, projection, integrity, query/search, export, policy/hold, recovery, observability.
- Happy paths with
alt/optblocks for errors, retries, and degraded modes. - Cross-references to Data Model, Message Schemas, HLD, and Components.
Non-goals¶
- Full API parameter docs (see REST/gRPC contracts).
- Deep internals of cryptographic primitives (see Integrity spec).
- Runbook procedures (see Operations/Runbook).
How to read these diagrams¶
- Each flow is expressed with Mermaid
sequenceDiagram. - We use consistent participant names (below) and consistent labels for calls:
op name [headers] {summary}for requests.↩ status bodyfor responses.
- Headers are shown with
[h], bodies with[b]when helpful. - Tenancy & correlation appear on first hop and are implied downstream unless called out.
- Errors use
alt/elseblocks; retries useloopwith backoff notes.
Canonical participants (legend)¶
| Label | Meaning |
|---|---|
Client |
External producer/consumer (browser, service, tool) |
Gateway |
API Gateway / Edge (authN/Z, rate limit, tenancy) |
Ingestion |
Write path (validate, canonicalize, classify/redact, append) |
Storage |
Authoritative append-only store (WORM) |
Integrity |
Segment/block sealing, Merkle roots, signatures |
Projection |
Read-model updaters; checkpoints/watermarks |
Query |
Timeline/resource/actor queries; masking profiles |
Search |
Full-text/facets/suggest over per-tenant indices |
Export |
eDiscovery and bulk packages; signed manifests |
Policy |
Classification, redaction, retention evaluation |
LegalHold |
Hold application/release, scope indexing |
Bus |
Message transport (e.g., Service Bus/MassTransit/NSB) |
KMS |
Key management for signatures/manifests |
IdP |
Identity provider (JWT/OIDC) |
Obs |
Observability pipeline (metrics/logs/traces) |
Flows may also show
Inbox/Outbox,Indexer, orAdminwhere relevant.
Cross-cutting conventions¶
- Tenancy: All flows carry
x-tenant-id(or gRPC metadatatenant); RLS enforced at storage and read models. - Correlation: OTel
traceparentis required; optionalbaggageincludestenant,edition,shard. - Idempotency: Producers SHOULD send
x-idempotency-key(REST) oridempotency(gRPC metadata); ingestion dedupes per(tenantId, key). - Problem+JSON: Errors return RFC-7807 with
type,title,status,detail, anderrors[] { pointer, reason }. - Redaction: Write path applies classification/redaction per policy. Reads apply masking profiles (
Safe|Support|Investigator|Raw). - Integrity: Sealing is asynchronous; verify-on-read is optional and called out explicitly where supported.
- Pagination: Seek cursors encode
(createdAt, auditRecordId); included in query flows. - Clocks:
createdAt(producer),observedAt(platform),sealedAt(integrity),eligibleAt(retention). - Status codes (REST): 2xx (OK/Accepted), 4xx (validation/limits/auth), 5xx (transient). gRPC codes:
OK,INVALID_ARGUMENT,ALREADY_EXISTS,RESOURCE_EXHAUSTED,UNAVAILABLE,DEADLINE_EXCEEDED.
Sample notation (Mermaid)¶
sequenceDiagram
autonumber
actor Client
participant Gateway
participant Ingestion
participant Policy
participant Storage
participant Projection
participant Integrity
participant Obs as Observability
Client->>Gateway: POST /audit [h: x-tenant-id, traceparent, x-idempotency-key] [b: AuditRecord]
Note right of Gateway: AuthN/Z (IdP), rate limiting, tenancy check
Gateway->>Ingestion: Append(request) [h: forwarded headers]
Ingestion->>Policy: Evaluate(classify, redact hints)
Policy-->>Ingestion: decision {classes, redactions}
Ingestion->>Storage: INSERT AuditRecord (canonical JSON, WORM)
Storage-->>Ingestion: ↩ ack {auditRecordId}
Ingestion-->>Gateway: ↩ 202 Accepted {auditRecordId}
par Async
Storage-->>Projection: event AuditRecord.Accepted
Storage-->>Integrity: leaf hash → segment buffer
and
Ingestion-->>Obs: metrics/traces/logs
end
Projection-->>Projection: upsert read models, advance checkpoint
Integrity-->>Integrity: seal block, sign, emit ProofComputed
Legend
- Solid arrows: synchronous calls.
- Dashed arrows (
-->>) : async publish/consume or responses. parblocks: parallel async work.alt/elseblocks: branching (validation errors, retries).loopblocks: retry with backoff.
Reading map (what comes next)¶
The remaining sections detail each area with a dedicated diagram and callouts:
- Ingestion (REST/gRPC/Bus/Actors) — validation, classification/redaction, idempotency
- Integrity — chain/segment/block sealing, verification, key rotation
- Projections & Search — read models, indexing, checkpoints, pagination
- Query & Read — policy-aware masking, verify-on-read, filters & time windows
- Export & eDiscovery — job lifecycle, manifests, delivery, legal hold
- Policy, Retention & Hold — evaluation, eligibility, purge block
- Reliability — retry, DLQ, circuit breaker, compensation, rebuild
- Observability — metrics, traces, health, alerts
- Admin — onboarding, schema evolution, configuration, partitioning, auto-scaling
Links¶
- → Architecture Overview
- → High-Level Design
- → Context Map
- → Components & Services
- → Data Model
- → Message Schemas
- → REST APIs
- → Observability
- → Runbook
Standard Audit Record Ingestion Flow¶
Canonical online path to append an AuditRecord via the API Gateway. Covers authN/Z, tenancy routing, rate limiting, validation & canonicalization, policy-driven classification/redaction hints, append to WORM storage, and async fan-out (AuditRecord.Accepted, projections, integrity). Uniquely emphasizes idempotency and Problem+JSON error semantics for safe retries.
Overview¶
Purpose: Accept a producer’s audit fact and durably append it to the authoritative store with correct tenancy, correlation, and privacy posture.
Scope: Single-record REST ingestion through the Gateway; includes validation, classification/redaction hints, append, and async fan-out triggers. Excludes gRPC and bus-based ingestion (covered in separate flows).
Context: Entry point for most interactive producers; downstream projections power query/search; integrity sealing is asynchronous.
Key Participants:
- Client (producer)
- API Gateway (authN/Z, limits, tenancy)
- Ingestion Service (validate/canonicalize/classify)
- Policy Service (classification/redaction hints)
- Storage Service (authoritative append, WORM)
- Projection Service (read models; async)
- Integrity Service (segment/block sealing; async)
Prerequisites¶
System Requirements¶
- API Gateway, Ingestion, Policy, Storage online and reachable
- TLS enabled end-to-end; trusted IdP/JWT validation configured
- Network routes opened Gateway → Ingestion → Policy/Storage
- Schema Registry accessible to Ingestion
Business Requirements¶
- Tenant exists and is active; residency and edition set
- Policy (classification/redaction) published and cacheable
- Retention policy present (for later lifecycle)
- Legal holds (if any) indexed (no effect on write, affects lifecycle)
Performance Requirements¶
- Gateway rate-limit buckets sized for tenant (burst/sustain)
- Ingestion p95 latency < 50 ms at target load
- Payload size ≤ 256 KiB; attributes/fields within limits
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Client
participant Gateway as API Gateway
participant Ingestion as Ingestion Service
participant Policy as Policy Service
participant Storage as Storage (Authoritative)
participant Projection as Projection Service
participant Integrity as Integrity Service
Client->>Gateway: POST /audit/v1/records<br/>[h: Authorization, x-tenant-id, traceparent, x-idempotency-key]<br/>[b: AuditRecord JSON]
Note right of Gateway: AuthN (JWT/OIDC) • AuthZ (tenant scope) • Rate limit • Header validation
Gateway->>Ingestion: Append(request)<br/>[forward headers]
Ingestion->>Policy: Evaluate(classify/redaction hints)
Policy-->>Ingestion: decision { classes, redactions }
Ingestion->>Ingestion: Validate & canonicalize<br/>(size, clocks, action, resource, attrs)
Ingestion->>Storage: INSERT canonical JSON (WORM)
Storage-->>Ingestion: ↩ ack { auditRecordId }
Ingestion-->>Gateway: ↩ 202 Accepted { auditRecordId, status:"Created" }
par Async fan-out
Storage-->>Projection: event AuditRecord.Accepted
Storage-->>Integrity: enqueue leaf → segment
end
Note over Projection,Integrity: Projections update read models, Integrity seals blocks later
Alternative Paths¶
- Duplicate idempotency key: Ingestion returns
202withstatus:"Duplicate"and originalauditRecordId. - Server-assigned ID: If
auditRecordIdomitted, Ingestion assigns ULID and returns it. - Sealing disabled: Integrity branch skipped for tenant/edition; lifecycle proceeds to eligibility without proofs.
- Partial policy outage: Use last-known policy (stale-tolerant) and tag decision with
basis:"Cached".
Error Paths¶
sequenceDiagram
actor Client
participant Gateway as API Gateway
participant Ingestion as Ingestion Service
Client->>Gateway: POST /audit/v1/records
alt Validation error
Gateway->>Ingestion: Append(request)
Ingestion-->>Gateway: ↩ 400 Problem+JSON (action.invalid, payload.tooLarge, ...)
Gateway-->>Client: ↩ 400 Problem+JSON
else Rate limited
Gateway-->>Client: ↩ 429 Problem+JSON + Retry-After
else Storage unavailable
Gateway->>Ingestion: Append(request)
Ingestion-->>Gateway: ↩ 503 Problem+JSON
Gateway-->>Client: ↩ 503 (retry with backoff)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req. | Description | Validation | |||
|---|---|---|---|---|---|---|---|
Authorization (header) |
string | Y | Bearer JWT | Valid signature; tenant claims | |||
x-tenant-id (header) |
string | Y | Tenant routing key | ^[A-Za-z0-9._-]{1,128}$ |
|||
traceparent (header) |
string | Y | W3C trace context | 55-char format | |||
x-idempotency-key (header) |
string | Y | Dedupe key per tenant | ≤128 ASCII visible | |||
tenantId |
string | Y | Tenant id (body) | Must equal header | |||
schemaVersion |
string | Y | Payload schema id | auditrecord.v1 (or newer) |
|||
auditRecordId |
ULID | N | Client- or server-assigned id | ULID pattern | |||
createdAt |
timestamp | Y | Producer time | ≤ now + 2m, ms precision |
|||
action |
string | Y | verb or verb.noun |
^[a-z]+(\.[a-z0-9_-]+)?$ |
|||
resource.type |
string | Y | PascalCase dotted type | ^[A-Z][A-Za-z0-9]*(\.[A-Z][A-Za-z0-9]*)*$ |
|||
resource.id |
string | Y | Opaque id | ≤128, no spaces | |||
resource.path |
string | N | JSON Pointer | ≤512, normalized | |||
actor.id |
string | Y | Actor identifier | ≤128, no spaces | |||
actor.type |
enum | Y | Unknown | User | Service | Job |
Enum | |||
actor.display |
string | N | Friendly name | Masked on read | |||
decision.outcome |
enum | N | Access verdict | Allow | Deny | NotApplicable | Indeterminate |
|||
delta.fields |
map | N | Field changes | ≤256 entries | |||
attributes |
map | N | Extra key/values | ≤64 keys; key/val length | |||
correlation.traceId |
hex | N | Trace id | 32 lowercase hex | |||
correlation.requestId |
string | N | Client request id | ≤128 |
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
auditRecordId |
ULID | Durable id | Server returns original or assigned |
status |
string | Created or Duplicate |
Idempotent semantics |
observedAt |
timestamp | Ingestion time | ms precision |
traceId |
hex32 | Echo for correlation | From traceparent |
links.self |
string | Record URL | REST locator |
links.operation |
string | Idempotency op URL | Stable outcome resource |
Example Payloads¶
Request
{
"tenantId": "splootvets",
"schemaVersion": "auditrecord.v1",
"createdAt": "2025-10-22T12:00:03.100Z",
"action": "appointment.update",
"resource": { "type": "Vetspire.Appointment", "id": "A-9981", "path": "/status" },
"actor": { "id": "user_123", "type": "User", "display": "A. Smith" },
"decision": { "outcome": "Allow" },
"delta": { "fields": { "status": { "before": "Pending", "after": "Booked" } } },
"attributes": { "client.ip": "203.0.113.42", "client.userAgent": "Mozilla/5.0 ..." },
"correlation": { "traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716", "requestId": "req-7a9f" }
}
Response — 202 Accepted
{
"auditRecordId": "01JE7K4J9F9D0S6E7X5Q1A3BCP",
"status": "Created",
"observedAt": "2025-10-22T12:00:03.300Z",
"traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716",
"links": {
"self": "/audit/v1/records/01JE7K4J9F9D0S6E7X5Q1A3BCP",
"operation": "/audit/v1/operations/prod-ord-9981-v1"
}
}
Error Handling¶
Error Scenarios¶
| Error Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Schema/clock/format invalid | Fix request; follow details/pointers | Do not retry until corrected |
| 401 | Invalid/missing JWT | Acquire valid token | Retry after re-auth |
| 403 | Tenant forbidden | Correct tenant or permissions | Do not retry |
| 413 | Payload > 256 KiB | Reduce size / trim delta | Do not retry until reduced |
| 415 | Wrong media type | Use application/json |
Retry with correct header |
| 429 | Rate limited/backpressure | Respect Retry-After |
Exponential backoff + jitter |
| 503 | Storage/Policy unavailable | Transient outage | Exponential backoff + jitter; reuse idempotency key |
| 409 | Idempotency conflict (rare) | Reuse same key; inspect operation link |
Safe retry with same key |
Failure Modes¶
- Network Failures: Timeouts, TLS issues → client retries with backoff; preserve
x-idempotency-key. - Service Unavailability: Return 503 from Gateway; circuit breaker may open.
- Data Corruption: Validation rejects; Problem+JSON details include
errors[].pointer. - Policy Violations: Credentials detected → dropped at write; log
redactionHint.
Recovery Procedures¶
- Inspect Problem+JSON
type,detail, anderrors[]. - For transient failures, retry with the same idempotency key using backoff; honor
Retry-After. - For validation failures, correct the payload (see rules), then resubmit.
Performance Characteristics¶
Latency Expectations¶
- P50: 15–25 ms
- P95: ≤ 50 ms
- P99: ≤ 120 ms
- P99.9: ≤ 300 ms (under burst control)
Throughput Limits¶
- Per Tenant (sustain): ~500 rps (edition-dependent)
- Per Tenant (burst): up to 2,000 rps for 60 s
- Global Target: ≥ 50k rps across shards
Resource Requirements¶
- CPU: Ingestion nodes sized for JSON parse + hashing; vectorized canonicalization where available
- Memory: Payload buffers ≤ 256 KiB × concurrency; header maps
- Network: TLS offload at Gateway or service mesh
- Storage: WAL/redo sized for burst × 2 indexes
Scaling Considerations¶
- Horizontal: Scale Gateway/Ingestion statelessly (HPA/KEDA based on rps/CPU/queue depth)
- Vertical: Rarely needed; prefer horizontal
- Auto-scaling Triggers: rps, p95 latency, queue depth, 429 rate, CPU > 75%
Security & Compliance¶
Authentication¶
- Method: JWT (OIDC); short-lived tokens; clock skew ±60s
- Token Requirements: Audience/service match; tenant claims present
- Session Management: Stateless; no cookies
Authorization¶
- Permissions: Producer role allowed to
audit:appendforx-tenant-id - Tenant Isolation: RLS enforced in Storage/Projections; headers validated at edge
- RBAC: Gateway policy + service layer checks
Data Protection¶
- Transit: TLS 1.2+; HSTS at edge
- At Rest: DB/storage encryption; key management via KMS
- PII Handling: Write-time classification/redaction; credentials dropped; personal/sensitive masked/hashed
Compliance¶
- GDPR/HIPAA/SOC2: Audit trail of who appended; immutable WORM; data subject exports via Export flows
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
ingest_requests_total |
counter | Count of POSTs | Anomaly vs. baseline |
ingest_latency_ms |
histogram | End-to-end latency | p95 > 50 ms (5m) |
ingest_payload_bytes |
histogram | Payload sizes | > 90th near 256 KiB |
ingest_rate_limited_total |
counter | 429 responses | Spike > 5% |
storage_errors_total |
counter | 5xx from Storage | > 0.5% |
policy_eval_latency_ms |
histogram | Policy call latency | p95 > 30 ms |
Logging Requirements¶
- Structured JSON logs; include
tenantId,auditRecordId,traceId,idempotencyKey(hash) - Mask personal/sensitive values; never log raw credentials
Distributed Tracing¶
- Propagate
traceparent; spans:ingest.request,ingest.validate,ingest.append,policy.evaluate - Span attrs:
tenant,payloadBytes,status,dedupe="Created|Duplicate"
Health Checks¶
- Liveness: process heartbeats
- Readiness: downstream (Policy/Storage) probes with budgets
- Dependency: Registry reachability, KMS if signing on write (rare)
Operational Procedures¶
Deployment¶
- Deploy/roll Gateway and Ingestion behind feature flag
audit.ingest.enabled=false - Warm caches (schema, policy); run smoke POST against canary
- Flip flag, ramp traffic using traffic splitting (e.g., 10% → 50% → 100%)
Configuration¶
- Env Vars:
RATE_BURST,RATE_SUSTAIN,MAX_PAYLOAD_BYTES=262144 - Config: Policy endpoint base URL; schema registry URL
- Feature Flags: Sealing on write (usually off), request verification levels
Maintenance¶
- Rotate tokens/keys; tune rate limits; review metrics for near-limit payloads
Troubleshooting¶
- High 400s → inspect Problem+JSON pointers
- High 429s → increase tenant buckets or advise producers to backoff
- 5xx spikes → check Storage/Policy dependency health, breaker state
Testing Scenarios¶
Happy Path Tests¶
- Accept minimal valid record; returns
202 Created - With server-assigned ULID; returns new
auditRecordId - Duplicate
x-idempotency-keyreturnsstatus:"Duplicate"
Error Path Tests¶
-
action.invalid→ 400 with pointer/action - Payload over 256 KiB → 413
- Missing/invalid JWT → 401; forbidden tenant → 403
- Rate limit exceeded → 429 with
Retry-After
Performance Tests¶
- Sustain 500 rps per tenant; p95 < 50 ms
- Burst 2k rps per tenant for 60s without error inflation
- Large but valid payload near limit; still < 50 ms p95
Security Tests¶
- Credential key in attributes is dropped/redacted
- PII masked on read paths (verify via downstream Query)
- Multi-tenant isolation (no cross-tenant access)
Related Documentation¶
Internal References¶
Related Flows¶
External References¶
- RFC 7807 (Problem Details for HTTP APIs)
- W3C Trace Context (traceparent)
Appendices¶
A. Configuration Examples¶
- NGINX/L7 snippet to pass through
traceparent,x-tenant-id,x-idempotency-key
B. Troubleshooting Guide¶
- Decision tree for 4xx vs 5xx vs 429 responses
C. Performance Benchmarks¶
- Latest load test summary attached in CI artifacts
D. Security Checklist¶
- No secrets logged
- Masking rules applied on read
- RLS enforced in all queries
Batch Audit Record Ingestion Flow¶
Efficient bulk ingest of many AuditRecord items using multipart upload or presigned object storage. The Gateway creates a batch job, the client uploads JSONL (optionally gzip), and an Ingestion Batch Worker validates, canonicalizes, and appends records to the WORM store with partial-failure reporting, chunking, and resume tokens.
Overview¶
Purpose: Move large volumes of audit facts into ATP reliably and cost-effectively with resumable uploads and per-record error isolation.
Scope: REST orchestration for batch jobs, uploads (multipart or presigned URLs), background processing, partial failures, status polling, and completion artifacts. Excludes online single-record ingest and streaming bus pipelines.
Context: Preferred for backfills, partner dumps, and nightly loads. Downstream, projections and integrity run asynchronously as with standard ingestion.
Key Participants:
- Client (uploader)
- API Gateway (job control, presigned URLs, limits)
- Object Storage (S3/GCS/Azure Blob; optional path)
- Ingestion Batch Worker (validate/canonicalize/process chunks)
- Storage (Authoritative) (WORM append)
- Integrity Service (hash/segment/block sealing; async)
- Projection Service (read models; async)
Prerequisites¶
System Requirements¶
- API Gateway, Batch Worker, Storage, Integrity, Projection online
- TLS end-to-end; object storage reachable from workers
- IdP configured; JWT audience for Gateway set
- Schema Registry reachable by workers
Business Requirements¶
- Tenant active; residency/edition configured
- Classification/redaction & retention policies published
- Legal holds indexed (affects lifecycle, not write)
Performance Requirements¶
- Chunk size and worker parallelism tuned (defaults below)
- Storage capacity sized for expected peak insert rate
- Backpressure thresholds configured (429/503 policies)
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Client
participant Gateway as API Gateway
participant Store as Object Storage
participant Batch as Ingestion Batch Worker
participant Storage as Storage (Authoritative)
participant Projection as Projection Service
participant Integrity as Integrity Service
Client->>Gateway: POST /audit/v1/batches { manifest, strategy }
Gateway-->>Client: ↩ 202 { batchId, uploadPlan, resumeToken }
alt Presigned strategy
Client->>Store: PUT parts to presigned URLs (JSONL[.gz])
Client->>Gateway: POST /audit/v1/batches/{batchId}: finalize
else Multipart strategy
Client->>Gateway: POST /audit/v1/batches/{batchId}/upload (multipart)
end
Gateway-->>Batch: event Batch.Created { batchId, objectUris }
Batch->>Batch: Plan chunks (e.g., 5k recs or 16 MiB)
loop Each chunk
Batch->>Store: READ chunk bytes (stream)
Batch->>Batch: Validate & canonicalize each JSONL line
Batch->>Storage: INSERT valid AuditRecord rows (idempotent)
Batch-->>Batch: Record per-line status, advance resumeToken
end
par Async fan-out for accepted rows
Storage-->>Projection: AuditRecord.Accepted
Storage-->>Integrity: enqueue leaf → segment
end
Batch-->>Gateway: status { processed, succeeded, failed, resumeToken }
Gateway-->>Client: ↩ 200/202 GET /batches/{id}/status
Note over Batch,Client: Completion → summary + downloadable error report for failed lines
Alternative Paths¶
- Continue-on-error: Process full batch; emit per-line errors; job ends
CompletedWithFailures. - Halt-on-threshold: Stop when
failed/processed ≥ threshold(e.g., 5%); jobAborted. - Resume: Client provides
resumeToken; worker skips processed chunks. - Single-URL manifest: Gateway returns one upload URL; worker enumerates parts by convention.
Error Paths¶
sequenceDiagram
actor Client
participant Gateway as API Gateway
participant Batch as Ingestion Batch Worker
Client->>Gateway: POST /audit/v1/batches { manifest }
alt Invalid manifest
Gateway-->>Client: ↩ 400 Problem+JSON (manifest.invalid)
else Failure threshold exceeded
Batch-->>Gateway: status { state:"Aborted", reason:"FailureThreshold" }
Gateway-->>Client: ↩ 409 Problem+JSON + link:errorReport
else Storage unavailable
Batch-->>Gateway: status { state:"Retrying", backoff:"exponential" }
Gateway-->>Client: ↩ 503 on status until recovery
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Required | Description | Validation |
|---|---|---|---|---|
Authorization (header) |
string | Y | Bearer JWT | Valid signature; tenant claim |
x-tenant-id (header) |
string | Y | Tenant routing | ^[A-Za-z0-9._-]{1,128}$ |
traceparent (header) |
string | Y | W3C trace context | 55-char format |
x-idempotency-key (header) |
string | Y | Job creation dedupe | ≤128 ASCII |
strategy |
enum | Y | Presigned or Multipart |
Enum |
manifest.files[] |
array | Y | Object URIs or file descriptors | ≤256 files |
manifest.format |
enum | Y | Jsonl or JsonlGzip |
Enum |
manifest.schemaVersion |
string | Y | Expected schema | e.g., auditrecord.v1 |
options.chunk.maxRecords |
int | N | Records per chunk | 1–10,000 (default 5,000) |
options.chunk.maxBytes |
int | N | Bytes per chunk | 1–32 MiB (default 16 MiB) |
options.failure.mode |
enum | N | Continue/HaltOnThreshold |
Default Continue |
options.failure.threshold |
number | N | 0.0–1.0 | Default 0.05 |
options.parallelism |
int | N | Worker concurrency | 1–32 (edition gated) |
Output Specifications¶
| Field | Type | Description | Notes | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
batchId |
ULID | Batch identifier | Returned on create | |||||||
uploadPlan |
object | Presigned URLs or upload endpoints | May include part sizes | |||||||
resumeToken |
string | Opaque position token | For resume | |||||||
state |
enum | Created | Uploading | Processing | Retrying | Completed | CompletedWithFailures | Aborted | Failed |
From status API | |||||||
counters |
object | {processed,succeeded,failed,bytesRead} |
Status API | |||||||
errorReport |
url | Download failed-lines report | On completion/abort |
Example Payloads¶
Create batch (presigned)
{
"strategy": "Presigned",
"manifest": {
"format": "JsonlGzip",
"schemaVersion": "auditrecord.v1",
"files": [
{ "name": "part-0001.jsonl.gz", "sizeBytes": 104857600 },
{ "name": "part-0002.jsonl.gz", "sizeBytes": 83886080 }
]
},
"options": {
"chunk": { "maxRecords": 5000, "maxBytes": 16777216 },
"failure": { "mode": "Continue", "threshold": 0.05 },
"parallelism": 8
}
}
Create response
{
"batchId": "01JE8A3GZ8X0E9K3N5R6V7B8C9",
"uploadPlan": {
"presigned": [
{ "name": "part-0001.jsonl.gz", "method": "PUT", "url": "https://store/..." },
{ "name": "part-0002.jsonl.gz", "method": "PUT", "url": "https://store/..." }
]
},
"resumeToken": "r-01je8a3g-0000"
}
Status response
{
"batchId": "01JE8A3GZ8X0E9K3N5R6V7B8C9",
"state": "CompletedWithFailures",
"counters": { "processed": 180000, "succeeded": 176400, "failed": 3600, "bytesRead": 183500800 },
"resumeToken": "r-01je8a3g-ffff",
"errorReport": "/audit/v1/batches/01JE8A3G.../errors?profile=Safe"
}
Error Handling¶
Error Scenarios¶
| Error Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid manifest/options | Fix payload (schema, limits) | No retry until corrected |
| 401/403 | AuthN/Z failure | Acquire token / permissions | Retry after fix |
| 409 | Duplicate x-idempotency-key |
Use status endpoint / operation link | Safe to reuse key |
| 413 | Part too large | Reduce part size | Re-upload affected part |
| 422 | Failure threshold exceeded | Inspect error report; fix data | New batch recommended |
| 429 | Gateway/worker backpressure | Honor Retry-After; slow uploads |
Exponential backoff + jitter |
| 503 | Storage/object store unavailable | Wait for recovery | Workers auto-retry chunks |
Failure Modes¶
- Line-level validation failures: recorded
{line, pointer, reason}; good lines continue. - Chunk retry: transient errors → chunk-level retries with capped attempts.
- Poison lines: after N retries, line written to dead-letter file in the error report.
Recovery Procedures¶
- GET status; if
CompletedWithFailures, downloaderrorReport. - Fix rejected lines; re-upload as new batch or incremental patch.
- If
Aborteddue to threshold, pre-clean data or lower threshold; start a new batch.
Performance Characteristics¶
Latency Expectations¶
- Job creation: ~10–50 ms
- Per-chunk processing: target ≤ 2 s for 5k records
- End-to-end: proportional to data volume and parallelism
Throughput Limits¶
- Worker ingest: ≥ 3k rps per shard sustained (shared with online writes)
- Per-job parallelism: default 8 chunks in flight (edition gated)
- Upload: presigned PUT up to provider limits; prefer 8–16 MiB parts
Resource Requirements¶
- CPU: JSON parse + hashing; concurrency N × vCPU
- Memory: streaming parse; per-chunk buffers (≤ 16–32 MiB each)
- Network: high egress from object store to workers; colocate where possible
- Storage: WAL sized for burst; keep secondary indexes minimal on authoritative store
Scaling Considerations¶
- Horizontal: scale workers by queue depth and chunk latency
- Auto-scaling triggers: backlog age, running jobs, p95 chunk duration, CPU > 75%
- Backpressure: workers advertise capacity; Gateway throttles create/upload
Security & Compliance¶
Authentication¶
- JWT (OIDC) to create/manage batches; presigned URLs for object store writes (scoped, short-lived).
Authorization¶
- Require
audit:batch:createfor tenant; status and error report scoped to same tenant and batch.
Data Protection¶
- Transit: TLS 1.2+; presigned HTTPS only
- At Rest: object storage + DB encryption; server-side KMS keys
- PII: same write-time classification/redaction as standard ingest (no raw credentials persisted)
Compliance¶
- Batch operations are audited: who created, uploaded, resumed, and downloaded error reports.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
batch_created_total |
counter | Batches created | Anomaly vs baseline |
batch_records_processed_total |
counter | Lines processed | Drops or stalls |
batch_failures_total |
counter | Per-line rejects | > 2% sustained |
batch_chunk_latency_ms |
histogram | Chunk processing time | p95 > 2 s |
batch_inflight_jobs |
gauge | Active batches | Capacity saturation |
batch_bytes_read |
counter | Input bytes | Sudden spikes |
Logging Requirements¶
- Structured logs with
batchId,lineNo, errorpointer,reason; mask sensitive values.
Distributed Tracing¶
- Root span
batch.create; child spans per chunk (batch.process.chunk) includingchunkId,records,bytes.
Health Checks¶
- Readiness includes object store access, Storage connectivity, Schema Registry reachability.
Operational Procedures¶
Deployment¶
- Roll out Batch Worker with feature flag
audit.batch.enabled=false. - Validate presigned URL issuance in non-prod.
- Enable flag; ramp per-tenant concurrency caps.
Configuration¶
- Env Vars:
BATCH_MAX_PARALLELISM,BATCH_CHUNK_MAX_BYTES,BATCH_CHUNK_MAX_RECORDS,BATCH_FAILURE_THRESHOLD - Storage: connection pools sized for concurrent inserts
- Object Store: bucket/container, lifecycle policy for temp uploads and error reports
Maintenance¶
- Periodic cleanup of stale, incomplete batches and expired presigned URLs.
- Rotate KMS keys as per policy.
Troubleshooting¶
- High
batch_failures_total→ download error report; inspect commonpointers. - Slow chunks → reduce chunk size or increase parallelism; check DB bottlenecks.
- Frequent 503 → verify storage health and worker retry logs.
Testing Scenarios¶
Happy Path Tests¶
- Create presigned batch; upload two parts; completion with zero failures
- Multipart upload success path with server parsing
- Resume from
resumeTokenafter intentional worker restart
Error Path Tests¶
- Invalid manifest → 400 with pointer to failing field
- Failure threshold exceeded → job
Aborted, 409 on finalize - Object store permission denied → 403 on PUT, recover with new presigned URL
Performance Tests¶
- 100M records across 20 files; verify throughput and stability
- Chunk size sweep (4–32 MiB) to tune p95
- Parallel jobs from multiple tenants without starvation
Security Tests¶
- Presigned URL expiry respected; uploads fail after TTL
- Error report redacts/masks PII appropriately
- Tenant isolation—no cross-tenant batch visibility
Related Documentation¶
Internal References¶
- Standard Audit Record Ingestion Flow
- Data Model
- Performance & Size Budgets
- Validation, Limits & Canonicalization
- REST APIs
Related Flows¶
- gRPC Service Ingestion Flow
- Service Bus (MassTransit) Ingestion Flow
- Audit Record Projection Update Flow
External References¶
- Provider docs for presigned URLs (S3/GCS/Azure Blob)
- RFC 7231 (HTTP semantics) for 202/409/413 usage
Appendices¶
A. Minimal JSONL Example (uncompressed)¶
{"tenantId":"acme","schemaVersion":"auditrecord.v1","createdAt":"2025-10-22T12:00:00.000Z","action":"user.create","resource":{"type":"Iam.User","id":"U-1"},"actor":{"id":"svc_gw","type":"Service"}}
{"tenantId":"acme","schemaVersion":"auditrecord.v1","createdAt":"2025-10-22T12:00:01.000Z","action":"appointment.update","resource":{"type":"Vetspire.Appointment","id":"A-2"},"actor":{"id":"user_123","type":"User"},"delta":{"fields":{"status":{"before":"Pending","after":"Booked"}}}}
B. Error Report Schema (per-line)¶
{
"batchId": "01JE8A3GZ8X0E9K3N5R6V7B8C9",
"summary": { "processed": 100000, "succeeded": 98400, "failed": 1600 },
"errors": [
{ "line": 42, "pointer": "/action", "reason": "regex", "code": "action.invalid", "rawSnippet": "..." }
]
}
C. Resume Token Example¶
Audit Record Validation & Classification Flow¶
Applies schema/limits validation, canonicalization, and policy-driven classification & redaction before persisting an AuditRecord. Ensures deterministic normalization, consistent privacy posture, and auditable decisions that accompany the record through its lifecycle.
Overview¶
Purpose: Validate and normalize incoming audit facts, classify data sensitivity, and apply redaction actions prior to append.
Scope: Ingestion-time validation/canonicalization, policy evaluation, classification flags, redaction (drop/mask/hash/tokenize), decision auditing. Excludes post-read masking (covered in Query flows) and integrity/projection specifics.
Context: Runs during Standard/Batch ingestion just before the authoritative append. Outputs include normalized payload, DataClass flags, RedactionHints, and a policy decision trail.
Key Participants:
- Ingestion Service (validator/canonicalizer/orchestrator)
- Schema Registry (JSON Schema/contract resolution)
- Policy Service (classification & redaction policy)
- Classification Engine (PII/secret detectors, patterns)
- Redaction Service (hash/mask/tokenize/drop transforms)
- Storage (Authoritative) (WORM append with decision audit)
Prerequisites¶
System Requirements¶
- Ingestion reachable to Schema Registry and Policy endpoints
- Policy/Classification/Redaction services healthy (or cached policy available)
- Clock sync within ±60s (for timestamp validations)
- TLS enabled; service identities trusted
Business Requirements¶
- Tenant active; edition/residency known (affects policy set)
- Current Policy revision published; cache TTL configured
- Data classification catalog aligned with Data Model
Performance Requirements¶
- Validation + policy evaluation p95 ≤ 30 ms per record
- Classification engine p95 ≤ 10 ms for typical payloads
- End-to-end ingest validation budget p95 ≤ 50 ms
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant Ingestion as Ingestion Service
participant Registry as Schema Registry
participant Policy as Policy Service
participant Classify as Classification Engine
participant Redact as Redaction Service
participant Storage as Storage (Authoritative)
Ingestion->>Registry: Resolve schema (auditrecord.v1)
Registry-->>Ingestion: ↩ schema (cacheable)
Ingestion->>Ingestion: Structural validate + limits (size, clocks)
Ingestion->>Ingestion: Canonicalize (strings NFC, action, resource.path)
Ingestion->>Policy: Evaluate(tenant, edition, payload summary)
Policy-->>Ingestion: ↩ decision {classes, actions, revision, basis:"Live"}
Ingestion->>Classify: Detect PII/Secrets (hints, patterns)
Classify-->>Ingestion: ↩ findings {keys, types, confidence}
Ingestion->>Redact: Apply(actions, findings) → transform fields
Redact-->>Ingestion: ↩ normalized payload + redactionHints
Ingestion->>Storage: INSERT payload + {classes, redactionHints, policyRevision}
Storage-->>Ingestion: ↩ ack {auditRecordId}
Alternative Paths¶
- Cached policy: If Policy unavailable, use last-known decision template (
basis:"Cached") with TTL; record basis in decision trail. - Dry-run mode: Apply classification only; annotate recommended actions without mutating payload (used in partner onboarding).
- Producer hints: Producer supplies
dataClasshints; engine verifies/augments but never downgrades sensitivity.
Error Paths¶
sequenceDiagram
participant Ingestion as Ingestion Service
participant Registry as Schema Registry
participant Policy as Policy Service
Ingestion->>Registry: Resolve schema
alt Schema mismatch/invalid
Registry-->>Ingestion: ↩ error(schema.invalid)
Ingestion-->>Client: ↩ 400 Problem+JSON (pointers)
else Policy hard outage and no cache
Ingestion->>Policy: Evaluate(...)
Policy-->>Ingestion: ↩ 503
Ingestion-->>Client: ↩ 503 Problem+JSON (retry with idempotency)
end
Request/Response Specifications¶
This flow executes inside ingestion. External interfaces (e.g., REST
/audit/v1/records) are shown for the fields pertinent to validation & classification.
Input Requirements¶
| Field | Type | Required | Description | Validation | |||
|---|---|---|---|---|---|---|---|
schemaVersion |
string | Y | Payload contract id | Known & active in Registry | |||
createdAt |
timestamp | Y | Producer clock | ISO-8601 UTC, ms; ≤ now+2m | |||
effectiveAt |
timestamp | N | Effect time | ≤ createdAt |
|||
action |
string | Y | verb or verb.noun |
^[a-z]+(\.[a-z0-9_-]+)?$ |
|||
resource.type |
string | Y | Dotted PascalCase type | ^[A-Z][A-Za-z0-9]*(\.[A-Z][A-Za-z0-9]*)*$ |
|||
resource.id |
string | Y | Opaque id | ≤128, visible ASCII | |||
resource.path |
string | N | JSON Pointer | normalized, ≤512 | |||
actor.id |
string | Y | Actor identifier | ≤128 | |||
actor.type |
enum | Y | Unknown | User | Service | Job |
Enum | |||
attributes.* |
map | N | Extra k/v pairs | ≤64 keys; key≤64, val≤1024 | |||
delta.fields |
map | N | Field-level changes | ≤256 entries | |||
correlation.traceId |
hex | N | Trace correlation | 32 lowercase hex |
Output Specifications¶
| Field | Type | Description | Notes | ||
|---|---|---|---|---|---|
normalizedPayload |
object | Canonical JSON after transforms | JCS canonical form | ||
classes |
bitset/array | DataClass flags | e.g., Personal | Sensitive |
||
redactionHints[] |
array | Where/why redacted | { pointer, action } |
||
policyRevision |
string | Policy rev used | rev-YYYYMMDD-n |
||
policyBasis |
enum | Live | Cached | DryRun |
Audit of basis | ||
violations[] |
array | Validation/policy errors | For 4xx generation |
Example Payloads¶
Input (pre-normalization)
{
"schemaVersion": "auditrecord.v1",
"createdAt": "2025-10-22T12:00:03.100Z",
"action": "User.Create",
"resource": { "type": "Iam.User", "id": " U-1001 ", "path": "/name" },
"actor": { "id": "svc_gw", "type": "Service", "display": "ingress-gw" },
"attributes": {
"email": "alice@example.com",
"password": "hunter2",
"client.ip": "2001:db8::1"
}
}
Normalized + decision (stored)
{
"schemaVersion": "auditrecord.v1",
"createdAt": "2025-10-22T12:00:03.100Z",
"action": "user.create",
"resource": { "type": "Iam.User", "id": "U-1001", "path": "/name" },
"actor": { "id": "svc_gw", "type": "Service", "display": "ingress-gw" },
"attributes": {
"email": "sha256:2c26b46b68ffc68ff99b453c1d304134",
"client.ip": "2001:db8::/64"
},
"_decision": {
"classes": ["Personal", "Sensitive"],
"redactionHints": [
{ "pointer": "/attributes/password", "action": "Drop" },
{ "pointer": "/attributes/email", "action": "Hash" },
{ "pointer": "/attributes/client.ip", "action": "Mask" }
],
"policyRevision": "rev-20251022-1",
"policyBasis": "Live"
}
}
Error Handling¶
Error Scenarios¶
| Error Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Schema/shape invalid | Fix payload per pointers | No retry until corrected |
| 400 | Limits exceeded (size/keys/delta) | Reduce payload size/keys | No retry until corrected |
| 422 | Policy violation (forbidden fields) | Remove/transform offending fields | Retry after fix |
| 503 | Policy/Registry unavailable & no cache | Wait for recovery | Retry with same idempotency key |
| 409 | Policy revision conflict (rare) | Resubmit; server reconciles | Safe retry (idempotent) |
Failure Modes¶
- Secret detected: Field dropped; hint recorded; no write-time failure unless configured “fail-closed”.
- Classifier ambiguity: Lowest-risk action chosen (mask/hash) and flagged for review.
- Cache staleness: Decision marked
basis:"Cached"; async audit triggers re-eval if needed.
Recovery Procedures¶
- If 4xx, inspect Problem+JSON
errors[].pointerand correct data. - If 503, retry with backoff; preserve idempotency key.
- If repeated classifier ambiguities, update policy patterns; redeploy.
Performance Characteristics¶
Latency Expectations¶
- Validation + Canonicalization: p95 ≤ 20 ms
- Policy Evaluation: p95 ≤ 30 ms (local cache hit ≤ 5 ms)
- Classification/Redaction: p95 ≤ 10 ms typical payloads
Throughput Limits¶
- Designed to sustain the same per-tenant ingest targets as Standard Ingestion (e.g., 500 rps), bounded by policy eval capacity.
Resource Requirements¶
- CPU for JSON parsing and pattern matching; memory for small transient field buffers (< 512 KiB).
- Optional vectorized hashing for tokenization.
Scaling Considerations¶
- Scale Ingestion horizontally; cache policy decisions per-tenant.
- Separate classifier pool if heavy patterns enabled.
Security & Compliance¶
Authentication¶
- mTLS/service identity between Ingestion and Policy/Classification/Redaction services.
Authorization¶
- Ingestion authorized to access tenant-scoped policies only.
Data Protection¶
- Secrets never persisted; PII transformed per policy before write.
- Hashing uses approved algorithms (e.g., SHA-256 with salt/pepper policy where applicable).
Compliance¶
- Decision trail persisted (
policyRevision,policyBasis,redactionHints) to support audits and DSAR exports.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
validation_failures_total |
counter | Number of 4xx validations | Spike > baseline |
policy_eval_latency_ms |
histogram | Policy call latency | p95 > 30 ms |
redactions_applied_total |
counter | Actions applied | Sudden drop (policy drift) |
classified_records_total |
counter | Records with classes | Monotonic vs ingest |
cached_policy_decisions_total |
counter | Cached-basis uses | > 5% sustained |
Logging Requirements¶
- Structured logs include
tenantId,auditRecordId(if available),policyRevision,policyBasis, and summarizedredactionHints(no raw data).
Distributed Tracing¶
- Spans:
ingest.validate,policy.evaluate,classify.detect,redact.apply; attributes:tenant,payloadBytes,basis.
Health Checks¶
- Readiness checks: Registry reachability, Policy cache warmness, classifier models loaded.
Operational Procedures¶
Deployment¶
- Deploy Ingestion with feature flag
policy.eval.enabled=true,redaction.apply.enabled=true. - Warm policy cache for top tenants; prefetch schema versions.
- Flip traffic gradually and watch latency/4xx/5xx rates.
Configuration¶
- Env Vars:
POLICY_BASE_URL,POLICY_CACHE_TTL,CLASSIFIER_TIMEOUT_MS,REDACTION_MODE(Apply|DryRun). - Patterns: versioned classifier pattern sets per tenant/edition.
Maintenance¶
- Rotate hashing salts/peppers per schedule; invalidate caches.
- Refresh classifier patterns as policies evolve.
Troubleshooting¶
- High 400s: inspect pointers; verify schema version drift.
- High cached-basis usage: Policy outage or network; check health and TTLs.
- Unexpected PII in reads: verify redaction applied and read-profile masking.
Testing Scenarios¶
Happy Path Tests¶
- Valid payload normalized; policy
Live; redactions applied; append succeeds - Producer hints merged; never downgrade sensitivity
- Cached policy basis used during brief outage; append still succeeds
Error Path Tests¶
- Schema validation failure → 400 with pointers
- Forbidden field by policy → 422 with pointer
- Policy outage with empty cache → 503
Performance Tests¶
- p95 validation+policy ≤ 50 ms at 500 rps/tenant
- Classifier throughput with large attributes maps
Security Tests¶
- Secrets dropped, not logged
- PII hashing/tokenization conforms to policy (golden samples)
- Authorization scoping of policy endpoints
Related Documentation¶
Internal References¶
- Data Model
- Privacy & PII Inventory
- Validation, Limits & Canonicalization
- Standard Audit Record Ingestion Flow
Related Flows¶
- Batch Audit Record Ingestion Flow
- Data Redaction Flow (Read)
External References¶
- RFC 8785 (JSON Canonicalization Scheme)
- W3C Trace Context (for correlation)
Appendices¶
A. Common Validation Rules (excerpt)¶
- No NaN/Infinity; UTF-8, strings normalized to NFC; key set size ≤ 64; payload ≤ 256 KiB.
B. DataClass Examples¶
Personal: name, email;Sensitive: secrets, tokens;Operational: IP/UA.
C. Redaction Actions¶
Drop(remove),Mask(partial),Hash(one-way),Tokenize(reversible, vault-backed).
Audit Record Integrity Chain Flow¶
Creates a tamper-evidence chain for accepted audit facts. Each persisted AuditRecord becomes a leaf hash, batched into segments (Merkle trees), then sealed into blocks signed by KMS. Proof artifacts are written to the Evidence Store, a reference is attached to the record, and Integrity.ProofComputed is emitted.
Overview¶
Purpose: Guarantee immutability-at-rest by linking records into signed, verifiable chains with exportable proofs.
Scope: Post-append integrity processing: leaf hashing, segment buffering, Merkle root computation, block sealing/signing, evidence persistence, record back-reference, and event publication. Excludes verify-on-read (covered in a separate flow).
Context: Runs asynchronously after AuditRecord.Accepted. Segments seal on size/age thresholds. Blocks form a forward-only chain with PrevBlockRoot.
Key Participants:
- Storage (Authoritative) — source of accepted records
- Integrity Service — orchestrates hashing, sealing, signing
- KMS — signs block headers; manages key rotation
- Evidence Store — durable proofs (segments/blocks/manifests)
- Projection Service — indexes proof refs for reads/search (optional)
- Event Bus — publishes
Integrity.ProofComputed
Prerequisites¶
System Requirements¶
- Integrity workers online; access to Storage and Evidence Store
- KMS key (current + optional previous for dual-verify window) available
- Time sync within ±60s across services
- Reliable message delivery from Storage to Integrity
Business Requirements¶
- Tenant configured with integrity policy (segment size/age, edition/residency)
- Retention rules do not remove proofs before data eligibility
- Legal holds respected (proofs retained regardless)
Performance Requirements¶
- Seal latency SLO: p95 ≤ 120s from
AcceptedtoProofComputed - Integrity throughput sized for ingest peak × safety margin (e.g., 1.5×)
- Evidence Store write amplification budgeted
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant Storage as Storage (Authoritative)
participant Integrity as Integrity Service
participant KMS as KMS
participant Evidence as Evidence Store
participant Bus as Event Bus
participant Projection as Projection Service
Storage-->>Integrity: AuditRecord.Accepted { auditRecordId, tenantId, canonicalBytesRef }
Integrity->>Integrity: LeafHash = SHA-256(canonicalBytes)
Integrity->>Integrity: Append leaf to SegmentBuffer(tenant, shard)
alt Seal threshold met (size or age)
Integrity->>Integrity: MerkleRoot = merkle(leafHashes)
Integrity->>KMS: Sign(BlockHeader { SegmentId, MerkleRoot, PrevBlockRoot })
KMS-->>Integrity: ↩ Signature { keyId, sig }
Integrity->>Evidence: Store { Segment, BlockHeader, Signature }
Evidence-->>Integrity: ↩ EvidenceRef { segmentUri, blockUri }
Integrity-->>Storage: Write IntegrityRef on records in segment
Integrity-->>Bus: Publish Integrity.ProofComputed { tenantId, segmentId, blockId }
Bus-->>Projection: Event fan-out (optional)
else Buffer continues
Integrity->>Integrity: Wait for more leaves or seal timeout
end
Alternative Paths¶
- Time-based seal: If size threshold not reached within
sealMaxAge, force seal to bound verification lag. - Dual-sign window: During key rotation, blocks are signed with new key, and verifiers accept old or new
keyId. - Cross-region catch-up: If region falls behind, segments seal independently; later anchor block links chains (see DR flow).
Error Paths¶
sequenceDiagram
participant Integrity as Integrity Service
participant KMS as KMS
participant Evidence as Evidence Store
Integrity->>KMS: Sign(BlockHeader)
alt KMS unavailable
KMS-->>Integrity: ↩ 503
Integrity->>Integrity: Retry with backoff, keep SegmentBuffer open
else Signature reject
KMS-->>Integrity: ↩ error(key.invalid)
Integrity->>Integrity: Quarantine segment, raise alert
end
Integrity->>Evidence: Store proofs
alt Evidence store error
Evidence-->>Integrity: ↩ 503
Integrity->>Integrity: Retry, if max attempts → DLQ & operator action
end
Request/Response Specifications¶
The chain creation is internal, but two public/operational surfaces are relevant: the event and the evidence retrieval API.
Input Requirements (event consumed by Integrity)¶
| Field | Type | Required | Description | Validation |
|---|---|---|---|---|
auditRecordId |
ULID | Y | Record identifier | Exists in Storage |
tenantId |
string | Y | Tenant scope | Valid tenant |
canonicalBytesRef |
uri | Y | Pointer to canonical JSON | Dereferenceable |
createdAt |
timestamp | Y | Record time | ISO-8601 UTC |
observedAt |
timestamp | Y | Ingestion time | ISO-8601 UTC |
Output Specifications¶
Event: Integrity.ProofComputed
| Field | Type | Description | Notes |
|---|---|---|---|
tenantId |
string | Tenant | — |
segmentId |
ULID | Sealed segment id | — |
blockId |
ULID | Block id | — |
keyId |
string | Signing key identifier | From KMS |
merkleRoot |
hex | Root hash | SHA-256 |
recordRange |
object | {fromId, toId} |
Optional |
evidence |
object | {segmentUri, blockUri} |
Evidence Store refs |
sealedAt |
timestamp | Seal time | UTC |
API: GET /integrity/v1/proofs/{auditRecordId}
| Field | Type | Description | Notes | |||
|---|---|---|---|---|---|---|
auditRecordId |
path | Record id | ULID | |||
include |
query | leaf | segment | block | all |
Optional |
Response (200)
{
"auditRecordId": "01JE9C5V6A7B8C9D0E1F2G3H4I",
"leaf": { "hash": "sha256:ab…", "position": 128, "segmentId": "01JE9C6…" },
"segment": { "merkleRoot": "sha256:cd…", "proofPath": ["ef…","01…"] },
"block": { "blockId": "01JE9C7…", "prevBlockRoot": "sha256:12…", "signature": { "keyId": "kms-2025-10", "sig": "MEUCIQ…" } },
"sealedAt": "2025-10-22T12:01:45.120Z"
}
Error Handling¶
Error Scenarios¶
| Error Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Bad include value / malformed id |
Correct request (ULID/enum) | No retry until corrected |
| 404 | Record or proof not found (not yet sealed or purged) | Poll later or verify eligibility | Retry after backoff |
| 409 | Append attempt to sealed segment (internal) | Start new segment; do not mutate sealed | N/A (system fix) |
| 422 | Signature cannot be generated due to key policy mismatch | Adjust policy / rotate properly | Retry after policy fix |
| 429 | Integrity backlog/backpressure | System scales workers | Automatic; client retries evidence GET |
| 503 | KMS/Evidence store unavailable | Wait for recovery | Exponential backoff + jitter |
Failure Modes¶
- Segment overflow beyond configured max leaves: immediate seal and roll to next segment.
- KMS key disabled: seals paused; alert; switch to standby key or rotate.
- Evidence write partial: transactionally retry, or mark segment PendingEvidence.
Recovery Procedures¶
- If KMS/Evidence outage, allow buffers to grow; workers retry with capped backoff.
- If quarantine triggered (signature reject), isolate segment and open incident; re-sign with correct key after root cause.
- Reconcile
PrevBlockRooton restart to maintain a single forward chain per(tenant, shard).
Performance Characteristics¶
Latency Expectations¶
- Leaf→ProofComputed: p50 20–40s; p95 ≤ 120s (time/size thresholds dependent)
Throughput Limits¶
- Leaf hashing ≥ ingest throughput; segment sealing limited by Merkle + I/O (target ≥ 5k leaves/s per worker).
Resource Requirements¶
- CPU for SHA-256/Merkle; memory for SegmentBuffer (bounded by max leaves or bytes).
- Evidence Store IOPS sized for block bursts.
Scaling Considerations¶
- Horizontal scale by tenant/shard queues.
- Auto-seal if buffers exceed memory pressure.
- Backpressure signaled to upstream only in extreme cases (avoid impacting ingest).
Security & Compliance¶
Authentication¶
- mTLS between Integrity and KMS/Evidence Store.
Authorization¶
- Integrity service principal limited to sign and write evidence; read-only for verify endpoints.
Data Protection¶
- Proof artifacts encrypted at rest; signatures cover SegmentId, MerkleRoot, PrevBlockRoot, sealedAt.
Compliance¶
- Proofs retained for at least as long as corresponding records; legal holds pin proofs.
- Audit trail includes
keyId,sealedAt, andpolicyRevisionused for sealing thresholds.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
integrity_queue_depth |
gauge | Pending leaves | Rising > 10× baseline |
segment_seal_latency_ms |
histogram | Accept→seal delay | p95 > 120s |
proof_compute_errors_total |
counter | Failed proof writes | > 0 over 5m |
kms_sign_latency_ms |
histogram | KMS call time | p95 > 200ms |
segments_sealed_total |
counter | Count per tenant/shard | Trend watch |
Logging Requirements¶
- Log
segmentId,blockId,keyId, leaf counts, thresholds used; never log raw record bytes.
Distributed Tracing¶
- Spans:
integrity.hash.leaf,integrity.seal.segment,kms.sign,evidence.write; attributes includetenant,segmentSize,ageSec.
Health Checks¶
- Readiness: KMS reachable; Evidence Store writable; backlog below watermark.
- Liveness: worker heartbeats; buffer pressure alarms.
Operational Procedures¶
Deployment¶
- Deploy Integrity workers; keep
integrity.enabled=false. - Validate KMS permissions and dry-run seal on a test tenant.
- Enable and monitor
queue_depth,seal_latency_ms.
Configuration¶
- Env Vars:
SEAL_MAX_LEAVES,SEAL_MAX_AGE_SEC,KMS_KEY_ID,MAX_BUFFER_BYTES - Backoff:
KMS_RETRY_BACKOFF,EVIDENCE_RETRY_BACKOFF
Maintenance¶
- Rotate
keyIdon schedule; run dual-verify window; archive old public keys. - Periodic integrity audit: random-sample verify segments nightly.
Troubleshooting¶
- High queue depth → add workers; lower seal thresholds temporarily.
- Signature failures → verify KMS policy/alg; check clock skew.
- Missing proofs → check DLQ for segments marked
PendingEvidence.
Testing Scenarios¶
Happy Path Tests¶
- Given
AuditRecord.Accepted, thenIntegrity.ProofComputedwithin SLO and record hasIntegrityRef. - Merkle proof verifies for random leaves in sealed segment.
Error Path Tests¶
- KMS outage → seals delayed; proofs catch up after recovery.
- Evidence store 503 → retries; no data loss; segment eventually
Sealed.
Performance Tests¶
- Seal at size threshold (e.g., 10k leaves) under peak ingest.
- Seal at age threshold (e.g., 60s) with sparse ingest.
Security Tests¶
- Signatures verify with current and previous
keyIdduring rotation. - Unauthorized client cannot fetch proofs from another tenant.
Related Documentation¶
Internal References¶
Related Flows¶
- Verify-On-Read Flow
- Export eDiscovery Flow
- Tamper Detection Flow
External References¶
- Merkle tree concepts (general)
- KMS provider docs for signing APIs
Appendices¶
A. Block Header (conceptual)¶
{
"blockId": "01JE9C7…",
"segmentId": "01JE9C6…",
"merkleRoot": "sha256:cd…",
"prevBlockRoot": "sha256:12…",
"sealedAt": "2025-10-22T12:01:45.120Z",
"keyId": "kms-2025-10",
"signature": "MEQCIF…"
}
B. Leaf Hash Definition¶
Audit Record Projection Update Flow¶
Builds query-optimized views from authoritative append-only facts. The Projector consumes accepted records, performs idempotent upserts into read models (AuditEvents timeline, Resource- and Actor-centric projections), updates the Search index, invalidates caches, and advances a checkpoint/watermark to guarantee at-least-once processing without duplication.
Overview¶
Purpose: Materialize fast, tenant-scoped views for queries and search while tracking consistent progress via checkpoints.
Scope: Post-append event consumption, idempotent projection updates, search indexing, cache invalidation, checkpointing, and replay/rebuild controls. Excludes ingestion, redaction policy evaluation, and verify-on-read.
Context: Runs asynchronously after AuditRecord.Accepted; multiple projector shards process per tenant/partition with strict ordering guarantees.
Key Participants:
- Storage (Authoritative) — emits
AuditRecord.Accepted - Projector — applies projection logic, maintains idempotency & checkpoints
- Read DB — projection tables (AuditEvents, Resource, Actor)
- Search Index — per-tenant documents for full-text/facets/suggest
- Cache — key-based caches for hot read paths
- Checkpoint Store — durable cursor (offset/watermark)
- Event Bus — transport for
Acceptedand internal signals
Prerequisites¶
System Requirements¶
- Storage → Bus delivery configured; Projector subscribed to
AuditRecord.Accepted - Read DB reachable with migrations applied for projection schemas
- Checkpoint Store provisioned (per tenant/shard)
- Search cluster online and tenant indices created (if enabled)
Business Requirements¶
- Tenants activated with edition flags for Search (optional)
- Data minimization rules acknowledged in projection shapes
- Cache TTLs defined per view (timeline/resource/actor)
Performance Requirements¶
- Projection lag SLO: p95 ≤ 5 s from
Acceptedto visible in reads - Indexing throughput sized to match ingest rate (≥ 1×)
- Checkpoint advance p99 commit ≤ 50 ms
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant Storage as Storage (Authoritative)
participant Bus as Event Bus
participant Proj as Projector
participant ReadDB as Read DB (Projections)
participant Search as Search Index
participant Cache as Cache
participant Ckpt as Checkpoint Store
Storage-->>Bus: Publish AuditRecord.Accepted {tenantId, auditRecordId, canonicalRef}
Bus-->>Proj: Deliver event (ordered per partition)
Proj->>Proj: Idempotency check (eventId vs last offset)
Proj->>ReadDB: UPSERT AuditEvents (timeline)
Proj->>ReadDB: UPSERT ResourceProjection (by resource)
Proj->>ReadDB: UPSERT ActorProjection (by actor)
alt Search enabled
Proj->>Search: UPSERT index document(s)
end
Proj->>Cache: Invalidate keys {timeline:tenant, resource:id, actor:id}
Proj->>Ckpt: Commit watermark {offset, auditRecordId, observedAt}
Ckpt-->>Proj: ↩ ack
Alternative Paths¶
- Out-of-order duplicate: Projector detects processed offset and skips; checkpoint remains.
- Rebuild: Admin issues
Rebuildcommand → Projector resets checkpoint to origin, clears projections (or writes compaction shadow tables), replays events, then swaps. - Partial Indexing: If Search is temporarily disabled for a tenant, projector queues index updates and advances DB projections; index will catch up later from a backlog.
Error Paths¶
sequenceDiagram
participant Proj as Projector
participant ReadDB as Read DB
participant Ckpt as Checkpoint Store
participant Search as Search Index
Proj->>ReadDB: UPSERT projections
alt Constraint conflict (unique key)
ReadDB-->>Proj: ↩ 409 conflict
Proj->>Proj: Apply idempotent merge, retry once
else Bad projection payload (schema drift)
ReadDB-->>Proj: ↩ 400 bad request
Proj->>Proj: Quarantine record → DLQ, continue stream
end
Proj->>Ckpt: Commit watermark
alt Not found checkpoint stream
Ckpt-->>Proj: ↩ 404 not found
Proj->>Ckpt: Create stream atomically, retry
end
Proj->>Search: UPSERT doc
alt Index unavailable / rate-limited
Search-->>Proj: ↩ 429/503
Proj->>Proj: Buffer + backoff, do not block DB projections
end
Request/Response Specifications¶
External APIs are operational controls; projections themselves are internal upserts.
Input Requirements (event consumed)¶
| Field | Type | Required | Description | Validation |
|---|---|---|---|---|
tenantId |
string | Y | Tenant scope | Known tenant |
auditRecordId |
ULID | Y | Record id | Exists in Storage |
createdAt |
timestamp | Y | Producer time | ISO-8601 UTC |
observedAt |
timestamp | Y | Ingestion time | ISO-8601 UTC |
action |
string | Y | Event verb | normalized |
resource |
object | Y | {type,id,path?} |
normalized |
actor |
object | Y | {id,type,display?} |
present |
decision |
object | N | Access outcome | enum |
attributes |
map | N | extras | bounded |
Output Specifications (projections)¶
| Projection | Key | Shape (summary) | Notes |
|---|---|---|---|
| AuditEvents | (tenantId, createdAt, auditRecordId) |
timeline row | paginates by cursor |
| ResourceProjection | (tenantId, resource.type, resource.id) |
latest state + last actions | small, denormalized |
| ActorProjection | (tenantId, actor.id) |
last actions, resources touched | for actor-centric queries |
| Search Document | (tenantId, auditRecordId) |
flattened facets + text | edition-gated |
Operational APIs¶
GET /projections/v1/{tenant}/{name}/status
Response 200:
{
"tenant": "acme",
"name": "AuditEvents",
"watermark": { "offset": 1203981, "auditRecordId": "01JEA...", "updatedAt": "2025-10-22T12:00:06.100Z" },
"lag": { "seconds": 2.4, "records": 180 },
"state": "Healthy"
}
POST /projections/v1/{tenant}/{name}/rebuild → 202 with { jobId }
Error Handling¶
Error Scenarios¶
| Error Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Bad request to ops API (invalid name, bad params) |
Fix request | No retry until corrected |
| 404 | Status/rebuild for unknown projection or tenant | Validate inputs | No retry |
| 409 | Rebuild already in progress / checkpoint conflict | Use existing job or wait | Retry after completion |
| 422 | Event schema drift vs projection mapper | Quarantine & hotfix mapper | Continue stream; backfill later |
| 429 | Search/index or cache backpressure | Defer indexing; advance DB | Automatic retry/backoff |
| 503 | Read DB/Checkpoint store transient failure | Keep event, retry | Exponential backoff + jitter |
Failure Modes¶
- Poison event: irreconcilable mapping → send to DLQ with pointers; continue stream.
- Cache stampede: cache invalidations batched/coalesced; use jittered TTLs.
- Idempotency race: unique key conflicts resolved via UPSERT with deterministic merge.
Recovery Procedures¶
- If Read DB/Checkpoint outage, pause commits but keep events buffered; resume and commit in order.
- For DLQ items, fix mapper/policy, then replay from saved offset range.
- During rebuild, expose
state:"Rebuilding"; queries read from shadow tables if configured.
Performance Characteristics¶
Latency Expectations¶
- Accept → Read visible: p95 ≤ 5 s
- Accept → Indexed: p95 ≤ 10 s (if search enabled)
Throughput Limits¶
- Sustains ingest parity; projectors process ≥ 1× ingest rps per shard.
Resource Requirements¶
- CPU for mapping/JSON flatten; DB connections sized for write bursts.
- Search bulkers batch 500–1,000 docs or 5–10 MiB per flush.
Scaling Considerations¶
- Horizontal scale by tenant/shard.
- HPA/KEDA on queue depth, projection lag, and p95 projector latency.
- Apply backpressure to indexing only; keep DB projections current.
Security & Compliance¶
Authentication¶
- mTLS between Projector and Read DB/Search/Checkpoint.
Authorization¶
- Projector principal has write on projections & checkpoint, write/bulk on Search, no read of other tenants.
Data Protection¶
- Store only minimized fields required for query/search; avoid sensitive raw values.
- Tenant isolation enforced at table/index level (prefix/shard keys).
Compliance¶
- Projection updates logged with
tenant,auditRecordId, andmapperVersionfor auditability.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
projection_lag_seconds |
gauge | Accept→visible delay | > 5s p95 (5m) |
projected_records_total |
counter | Rows upserted | Trend vs ingest |
checkpoint_commit_latency_ms |
histogram | Commit time | p95 > 50ms |
projection_conflicts_total |
counter | 409 upserts | Rising trend |
index_updates_backlog |
gauge | Pending index docs | Growing without drop |
Logging Requirements¶
- Structured logs:
tenant,auditRecordId,offset,mapperVersion, conflict summaries (no sensitive values).
Distributed Tracing¶
- Spans:
projector.consume,mapper.apply,readdb.upsert,index.bulk,checkpoint.commit. - Attributes:
tenant,offset,bulkCount,lagMs.
Health Checks¶
- Readiness: connectivity to Read DB/Search/Checkpoint; lag below threshold.
- Liveness: consumer heartbeats; partition ownership indicator.
Operational Procedures¶
Deployment¶
- Deploy Projector with
projector.enabled=false. - Run migrations for projection schemas.
- Enable consumers per tenant/shard; monitor
projection_lag_seconds.
Configuration¶
- Env Vars:
PROJECTOR_PARALLELISM,CHECKPOINT_BATCH,INDEX_BULK_BYTES,INDEX_BULK_DOCS - Flags:
search.enabled,rebuild.shadowSwap=true
Maintenance¶
- Periodic compaction of timeline tables; rotate old index aliases.
- Update
mapperVersionwith schema changes; keep backward compatibility.
Troubleshooting¶
- Rising lag → scale workers or reduce index bulk size; inspect DB write contention.
- Many conflicts → verify UPSERT keys & mapping determinism.
- Backlog in indexing → check cluster health; enable backpressure-only mode.
Testing Scenarios¶
Happy Path Tests¶
-
Acceptedevent produces AuditEvents row, Resource & Actor upserts; watermark advances. - Search document visible; cache invalidated and repopulated on read.
Error Path Tests¶
- Unique key conflict handled idempotently (no duplicate rows).
- Bad ops API request → 400; unknown projection → 404; rebuild in progress → 409.
Performance Tests¶
- Maintain p95 ≤ 5 s at target ingest rps with search enabled/disabled.
- Bulk indexing flush sizes tuned for p95 < 1 s per bulk.
Security Tests¶
- Tenant isolation in projections and index aliases.
- No sensitive fields persisted beyond minimization policy.
Related Documentation¶
Internal References¶
Related Flows¶
- Standard Audit Record Ingestion Flow
- Audit Record Integrity Chain Flow
- Search Query Flow
External References¶
- Bulk indexing guidance for the chosen search engine (vendor docs)
Appendices¶
A. UPSERT Keys (example)¶
- AuditEvents:
(tenantId, createdAt, auditRecordId) - ResourceProjection:
(tenantId, resourceType, resourceId) - ActorProjection:
(tenantId, actorId)
B. Checkpoint Record (example)¶
{
"tenant": "acme",
"partition": "p3",
"offset": 1203981,
"auditRecordId": "01JEA…",
"updatedAt": "2025-10-22T12:00:06.100Z",
"mapperVersion": "v7"
}
HTTP REST API Ingestion Flow¶
REST transport for appending a single AuditRecord via API Gateway. Details HTTP method/endpoint, required headers, authentication & rate limiting, header-to-internal mapping (traceparent, x-tenant-id, x-idempotency-key), response behaviors (2xx/4xx/5xx), and concrete request/response examples.
Overview¶
Purpose: Provide a secure, idempotent HTTP interface for producers to append audit facts through the Gateway.
Scope: HTTP semantics (headers, status codes, retries), authN/Z at the edge, rate limiting, payload size/type validation, Problem+JSON errors. Excludes batch/grpc/bus transports (separate flows) and downstream integrity/projections internals.
Context: Front door for most interactive clients; maps cleanly to the canonical ingestion path.
Key Participants:
- HTTP Client (producer)
- API Gateway (edge policy, authN/Z, limits)
- Ingestion Service (validation/canonicalization)
- Policy Service (classification/redaction hints, invoked by Ingestion)
- Storage (Authoritative) (append/WORM)
Prerequisites¶
System Requirements¶
- TLS 1.2+ enabled on Gateway; valid certificates
- Gateway has JWKS/issuer config to validate JWTs (OIDC)
- Network routes Gateway → Ingestion (and Ingestion → Policy/Storage)
Business Requirements¶
- Tenant exists, active, and mapped to regions/partitions
- Policy/retention configurations present for tenant
- Edition flags set (may influence limits)
Performance Requirements¶
- Gateway rate limits sized per tenant (burst/sustained)
- Max payload ≤ 256 KiB; P95 end-to-end ≤ 50 ms at target RPS
- Idempotency store capacity sized for 24h dedupe window
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Client as HTTP Client
participant Gateway as API Gateway
participant Ingestion as Ingestion Service
participant Storage as Storage (Authoritative)
Client->>Gateway: POST /audit/v1/records<br/>(h: Authorization, x-tenant-id, traceparent, x-idempotency-key)<br/>(b: application/json)
Note right of Gateway: Validate JWT, tenant scope, rate-limit, content-type & size
Gateway->>Ingestion: Append(request) (forward required headers)
Ingestion->>Ingestion: Validate + canonicalize + policy hints
Ingestion->>Storage: INSERT canonical record (WORM)
Storage-->>Ingestion: ack {auditRecordId}
Ingestion-->>Gateway: 202 {auditRecordId, status:"Created"}
Gateway-->>Client: 202 Accepted with Problem+JSON on errors, rate-limit headers on success too
Alternative Paths¶
- Duplicate idempotency key: 202 with
status:"Duplicate"and originalauditRecordId. - Server-assigned ULID: Omit
auditRecordIdand receive assigned value in response. - CORS/browser clients: Preflight
OPTIONShandled by Gateway; only safelisted headers exposed.
Error Paths¶
sequenceDiagram
actor Client
participant Gateway as API Gateway
Client->>Gateway: POST /audit/v1/records (bad/missing bits)
alt Bad request (shape/size/type)
Gateway-->>Client: 400/413/415 Problem+JSON
else Unauthorized / Forbidden
Gateway-->>Client: 401/403 Problem+JSON
else Not found / wrong route
Gateway-->>Client: 404 Problem+JSON
else Conflict (idempotency anomaly)
Gateway-->>Client: 409 Problem+JSON
else Rate limited
Gateway-->>Client: 429 Problem+JSON (+ Retry-After)
else Upstream unavailable
Gateway-->>Client: 503 Problem+JSON
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Required | Description | Validation |
|---|---|---|---|---|
| Method | HTTP | Y | POST |
POST /audit/v1/records |
| Content-Type | header | Y | Body MIME type | application/json; charset=utf-8 |
| Authorization | header | Y | Bearer JWT | Valid signature, audience, tenant claim |
| x-tenant-id | header | Y | Tenant routing | ^[A-Za-z0-9._-]{1,128}$ |
| traceparent | header | Y | W3C trace context | 55-char format |
| x-idempotency-key | header | Y | Dedupe per tenant (24h) | ≤128 visible ASCII |
| Body | JSON | Y | Canonical AuditRecord fields |
See Data Model rules |
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
auditRecordId |
ULID | Durable record id | Server returns original or assigned |
status |
string | Created or Duplicate |
Idempotent behavior |
observedAt |
timestamp | Ingestion observation time | ms precision |
traceId |
hex32 | Echo from traceparent |
Correlation |
links.self |
uri | Record locator | Optional operation link |
Example Payloads¶
Request
POST /audit/v1/records HTTP/1.1
Host: api.atp.example
Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...
x-tenant-id: acme
traceparent: 00-3e1f2d0c9b8a7f6e5d4c3b2a19081716-7f6e5d4c3b2a1908-01
x-idempotency-key: acme-ord-9981-v1
Content-Type: application/json; charset=utf-8
{
"tenantId": "acme",
"schemaVersion": "auditrecord.v1",
"createdAt": "2025-10-22T12:00:03.100Z",
"action": "user.create",
"resource": { "type": "Iam.User", "id": "U-1001" },
"actor": { "id": "svc_ingress", "type": "Service" }
}
Response — 202 Accepted
{
"auditRecordId": "01JEB0V2G7NY5T6Q9KX3M4C8AP",
"status": "Created",
"observedAt": "2025-10-22T12:00:03.280Z",
"traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716",
"links": {
"self": "/audit/v1/records/01JEB0V2G7NY5T6Q9KX3M4C8AP"
}
}
Response — 400 Bad Request (Problem+JSON)
{
"type": "urn:connectsoft:errors/validation/action.invalid",
"title": "Invalid action",
"status": 400,
"detail": "Action must match ^[a-z]+(\\.[a-z0-9_-]+)?$",
"errors": [{ "pointer": "/action", "reason": "regex" }],
"traceId": "3e1f2d0c9b8a7f6e5d4c3b2a19081716"
}
Error Handling¶
Status Code Matrix¶
| Class | Code | When | Notes | |
|---|---|---|---|---|
| 2xx | 202 | Accepted (created or deduped) | Body includes status:"Created | Duplicate" |
|
| 4xx | 400 | Shape/field invalid, schema mismatch | Problem+JSON with errors[].pointer |
|
| 4xx | 401 | Missing/invalid JWT | Bearer challenge omitted for APIs; return body explains | |
| 4xx | 403 | Tenant/permission forbidden | Token valid but insufficient scope | |
| 4xx | 404 | Unknown route/tenant or disabled feature | Useful for wrong base path or edition | |
| 4xx | 409 | Idempotency anomaly / conflicting op link | Rare; follow links.operation if present |
|
| 4xx | 413 | Payload exceeds 256 KiB | Include maxBytes hint |
|
| 4xx | 415 | Unsupported media type | Require application/json |
|
| 4xx | 429 | Rate-limited/backpressure | Include Retry-After (seconds or HTTP date) |
|
| 5xx | 503 | Upstream dependency unavailable | Retry with same idempotency key |
Failure Modes¶
- Clock skew:
createdAt > now+2m→ 400 with pointer/createdAt. - Tenant mismatch: body
tenantId≠ headerx-tenant-id→ 403. - Idempotency race: concurrent distinct payloads under same key → 409.
Recovery Procedures¶
- For 4xx, correct payload/headers and resend (new key except for 409).
- For 429/503, retry with exponential backoff + jitter; reuse the same
x-idempotency-key. - Track
traceIdfrom responses to correlate retries.
Performance Characteristics¶
Latency Expectations¶
- Gateway edge: P50 5–10 ms, P95 ≤ 20 ms
- End-to-end to 202: P50 15–25 ms, P95 ≤ 50 ms
Throughput Limits¶
- Default per-tenant: 500 rps sustained, 2k rps burst (60s)
- Global: ≥ 50k rps across shards (capacity dependent)
Resource Requirements¶
- Gateway CPU for JWT validation and header processing; memory for small payload buffers.
Scaling Considerations¶
- Scale Gateway horizontally; HPA on rps & p95.
- Separate rate limit buckets per tenant and per route.
Security & Compliance¶
Authentication¶
- OIDC JWT Bearer; short-lived (≤ 15m), leeway ±60s.
Authorization¶
- Require
audit:appendscoped tox-tenant-id; Gateway enforces edition access.
Data Protection¶
- TLS 1.2+; HSTS at edge; CORS preflight for browser-based producers (restrict origins & headers).
Compliance¶
- Log who/when appended; immutable WORM store; Problem+JSON avoids leaking sensitive values.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
http_requests_total{route="/audit/v1/records"} |
counter | Request rate | Anomaly vs baseline |
http_request_duration_ms |
histogram | Latency | p95 > 50 ms (5m) |
http_responses_total{status=4xx/5xx} |
counter | Error rates | > 1% 4xx (validation spikes), any 5xx |
rate_limited_total |
counter | 429 responses | > 5% sustained |
Logging Requirements¶
- Structured logs with
tenantId,traceId,idempotencyKey(hashed),statusCode; no sensitive payloads.
Distributed Tracing¶
- Propagate
traceparentto Ingestion; spansgateway.authz,gateway.forwardwith attributestenant,payloadBytes.
Health Checks¶
- Liveness: process/thread checks; Readiness: JWKS reachable, Ingestion upstream healthy.
Operational Procedures¶
Deployment¶
- Deploy Gateway route behind feature flag
ingest.rest.enabled=false. - Smoke test with signed JWT and minimal payload; verify 202 and headers.
- Enable feature flag and gradually raise rate limits.
Configuration¶
- Env Vars / Config: JWKS URI, audiences, rate limit buckets, max payload bytes, allowed CORS origins/headers.
- Headers to forward:
traceparent,x-tenant-id,x-idempotency-key.
Maintenance¶
- Rotate keys/JWKS; cache with TTL; monitor expired/invalid token spikes.
Troubleshooting¶
- Many 401s → check JWKS drift/clock skew.
- Many 415s → clients mis-sending
Content-Type. - Elevated 409s → investigate idempotency key collisions in client.
Testing Scenarios¶
Happy Path Tests¶
- Valid POST returns 202 with
status:"Created"andauditRecordId. - Duplicate
x-idempotency-keyreturns 202 withstatus:"Duplicate". - Trace propagation:
traceIdechoed matchestraceparent.
Error Path Tests¶
- 400 invalid action; pointer
/action. - 404 wrong route (e.g.,
/audit/v2/...). - 409 conflicting idempotency key (distinct payload).
- 415 wrong media type; 413 too large.
- 429 with
Retry-After; 503 transient outage.
Performance Tests¶
- Sustain 500 rps tenant; p95 ≤ 50 ms.
- Burst to 2k rps without >1% errors.
Security Tests¶
- JWT expiration & audience checks enforced.
- CORS preflight honors allowed origins and headers.
- Tenant mismatch (header vs body) rejected with 403.
Related Documentation¶
Internal References¶
Related Flows¶
- gRPC Service Ingestion Flow
- Service Bus (MassTransit) Ingestion Flow
- Retry Flow
External References¶
- RFC 7807 (Problem Details for HTTP APIs)
- W3C Trace Context (traceparent)
Appendices¶
A. cURL Examples¶
curl -sS -X POST "https://api.atp.example/audit/v1/records" \
-H "Authorization: Bearer $TOKEN" \
-H "x-tenant-id: acme" \
-H "traceparent: 00-$(uuidgen | tr 'A-Z' 'a-z' | tr -d '-')-$(uuidgen | tr 'A-Z' 'a-z' | cut -c1-16)-01" \
-H "x-idempotency-key: acme-ord-9981-v1" \
-H "Content-Type: application/json; charset=utf-8" \
--data-binary @record.json
B. Rate Limiting Headers (example)¶
gRPC Service Ingestion Flow¶
High-QPS, low-latency transport for appending individual AuditRecord items using gRPC. Clients call a unary Append RPC on the Gateway, passing metadata for tenant, traceparent, idempotency, and authorization. The Gateway authenticates/authorizes and forwards to Ingestion; responses use canonical gRPC status codes with retry/backoff guidance.
Overview¶
Purpose: Provide a high-throughput ingestion path with efficient framing, multiplexing, and connection reuse.
Scope: gRPC method shape, metadata requirements, authN/Z, rate limiting, error code mapping, retries/backoff, and sample code-first contracts. Excludes batch uploads and message bus ingestion.
Context: Preferred for service-to-service producers and heavy internal traffic; functionally equivalent to REST ingestion but with gRPC semantics.
Key Participants:
- gRPC Client (producer)
- gRPC Gateway (edge; authN/Z, limits, metadata mapping)
- Ingestion Service (validate/canonicalize, policy/classification/redaction)
- Storage (Authoritative) (append/WORM)
Prerequisites¶
System Requirements¶
- Gateway and Ingestion expose/accept HTTP/2 with TLS (mTLS optional for internal meshes)
- OIDC/JWKS configured at the Gateway to validate
authorizationmetadata - Network connectivity Gateway ↔ Ingestion ↔ Storage/Policy services
Business Requirements¶
- Tenant active and mapped to partitions/regions
- Policy and retention configured for tenant
- Edition flags (e.g., max RPS) set if applicable
Performance Requirements¶
- Connection pooling enabled; client max concurrent streams tuned (HTTP/2)
- End-to-end p95 ≤ 40 ms at target RPS; message size ≤ 256 KiB
- Idempotency store sized for 24h dedupe window
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Client as gRPC Client
participant GW as gRPC Gateway
participant Ing as Ingestion Service
participant Store as Storage (Authoritative)
Client->>GW: Append(AuditRecord) + metadata{authorization, x-tenant-id, traceparent, x-idempotency-key}
Note right of GW: Validate token, tenant scope, rate limit-> map metadata → headers
GW->>Ing: Append(request, forwarded metadata)
Ing->>Ing: Validate + canonicalize + policy/classification/redaction
Ing->>Store: INSERT canonical record (WORM)
Store-->>Ing: ack {auditRecordId}
Ing-->>GW: AppendReply {auditRecordId, status=Created}
GW-->>Client: OK (AppendReply) + trailers {traceId}
Alternative Paths¶
- Duplicate idempotency key: return
OKwithstatus=Duplicateand originalauditRecordId. - Server-assigned ID: client omits
auditRecordId; service returns assigned ULID. - Streaming batch (future): optional client- or server-streaming variants reuse the same metadata (not covered here).
Error Paths¶
sequenceDiagram
actor Client
participant GW as gRPC Gateway
Client->>GW: Append(bad or unauthorized)
alt Invalid argument / too large
GW-->>Client: INVALID_ARGUMENT / RESOURCE_EXHAUSTED
else Unauthenticated / permission denied
GW-->>Client: UNAUTHENTICATED / PERMISSION_DENIED
else Not found route / disabled
GW-->>Client: NOT_FOUND
else Idempotency conflict (payload differs)
GW-->>Client: ALREADY_EXISTS
else Rate limited / upstream unavailable
GW-->>Client: RESOURCE_EXHAUSTED / UNAVAILABLE
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Required | Description | Validation |
|---|---|---|---|---|
| RPC | unary | Y | Append(AppendRequest) returns (AppendReply) |
gRPC |
authorization (metadata) |
string | Y | Bearer <JWT> |
Valid signature, audience, tenant claim |
x-tenant-id (metadata) |
string | Y | Tenant routing | ^[A-Za-z0-9._-]{1,128}$ |
traceparent (metadata) |
string | Y | W3C Trace Context | 55-char format |
x-idempotency-key (metadata) |
string | Y | Dedupe per tenant (24h) | ≤128 visible ASCII |
AppendRequest.auditRecord |
message | Y | Canonical AuditRecord |
See Data Model limits (≤ 256 KiB) |
AppendRequest.schemaVersion |
string | Y | Contract version | Known & active |
Metadata naming: gRPC metadata keys are lowercase ASCII. Use exactly:
authorization,x-tenant-id,traceparent,x-idempotency-key.
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
AppendReply.auditRecordId |
string (ULID) | Durable id | Assigned or echoed |
AppendReply.status |
enum | Created or Duplicate |
Idempotent result |
AppendReply.observedAt |
timestamp | Ingestion observation | ms precision |
trailers:traceid |
hex32 | Correlation id | Derived from traceparent |
Example Payloads¶
Proto (illustrative; see code-first C# below)
service AuditIngestion {
rpc Append (AppendRequest) returns (AppendReply);
}
message AppendRequest {
string schemaVersion = 1;
AuditRecord auditRecord = 2;
}
message AppendReply {
string auditRecordId = 1;
string status = 2; // "Created" | "Duplicate"
string observedAt = 3; // ISO-8601 UTC
}
Example grpcurl
grpcurl -d @ \
-H "authorization: Bearer $TOKEN" \
-H "x-tenant-id: acme" \
-H "traceparent: 00-3e1f2d0c9b8a7f6e5d4c3b2a19081716-7f6e5d4c3b2a1908-01" \
-H "x-idempotency-key: acme-ord-9981-v1" \
api.atp.example:443 audit.AuditIngestion/Append <<'JSON'
{
"schemaVersion": "auditrecord.v1",
"auditRecord": {
"tenantId": "acme",
"createdAt": "2025-10-22T12:00:03.100Z",
"action": "user.create",
"resource": { "type": "Iam.User", "id": "U-1001" },
"actor": { "id": "svc_ingress", "type": "Service" }
}
}
JSON
Error Handling¶
Error Scenarios (gRPC ↔ HTTP analogy)¶
| gRPC Code | HTTP Analogy | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|---|
| OK | 202 | Created or Duplicate | — | — |
| INVALID_ARGUMENT | 400 | Schema/shape/limits invalid | Fix per details | No retry until corrected |
| NOT_FOUND | 404 | Unknown service/method or tenant/feature disabled | Check route/tenant | No retry |
| ALREADY_EXISTS | 409 | Idempotency conflict (same key, different payload) | Use new key; reconcile | Do not retry with same key |
| UNAUTHENTICATED | 401 | Missing/invalid token | Acquire valid JWT | Retry after fix |
| PERMISSION_DENIED | 403 | Insufficient scope or tenant mismatch | Adjust perms/tenant | No retry until corrected |
| RESOURCE_EXHAUSTED | 429 | Rate limit/backpressure | Honor retry hints | Exponential backoff + jitter |
| UNAVAILABLE | 503 | Upstream unavailable / transient gateway error | Wait for recovery | Retry with same idempotency key |
| DEADLINE_EXCEEDED | 504 | Client/server deadline hit | Increase deadline if safe | Limited retries |
| INTERNAL | 500 | Unexpected server error | Open incident if persistent | Bounded retries with backoff |
Failure Modes¶
- Metadata missing/uppercase: gRPC metadata keys must be lowercase; missing required keys →
INVALID_ARGUMENT. - Clock skew:
createdAt > now+2m→INVALID_ARGUMENTwith field pointer. - Concurrent duplicates: distinct payload under same key →
ALREADY_EXISTS.
Recovery Procedures¶
- For 4xx analogs (
INVALID_ARGUMENT,PERMISSION_DENIED,ALREADY_EXISTS,NOT_FOUND) fix request/config before retry. - For
RESOURCE_EXHAUSTED/UNAVAILABLE/DEADLINE_EXCEEDED, backoff with jitter; reusex-idempotency-key. - Log/propagate
traceidfrom trailers for correlation.
Performance Characteristics¶
Latency Expectations¶
- P50: 10–20 ms
- P95: ≤ 40 ms
- P99: ≤ 75 ms
Throughput Limits¶
- Per connection: hundreds of concurrent streams (HTTP/2)
- Per tenant: baseline 1k rps sustained, burst 4k rps (edition dependent)
- Global: scales linearly with Gateway instances
Resource Requirements¶
- Persistent HTTP/2 channels; tune client pool size and max streams per connection.
Scaling Considerations¶
- Horizontal scale Gateway on RPS/p95; shard by tenant/region.
- Configure server and client receive/send message size caps (≤ 256 KiB).
Security & Compliance¶
Authentication¶
authorizationmetadata with OIDC JWT; short-lived (≤ 15m), leeway ±60s; optional mTLS for extra assurance.
Authorization¶
- Require
audit:appendscoped tox-tenant-id; Gateway enforces RBAC/ABAC.
Data Protection¶
- TLS 1.2+; no sensitive data in logs; redaction/classification applied by Ingestion before persist.
Compliance¶
- Producer identity, idempotency key hash, and decision trail logged; aligns with privacy/PII policies.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
grpc_server_started_total |
counter | Calls started | Anomaly detection |
grpc_server_handled_total{code} |
counter | Calls by status code | Any 5xx; spikes in INVALID_ARGUMENT |
grpc_server_handling_seconds |
histogram | Latency | p95 > 40 ms |
rate_limited_total |
counter | RESOURCE_EXHAUSTED | > 5% sustained |
Logging Requirements¶
- Structured logs:
tenant,traceId,idempotencyKey(hashed),grpc.code,latencyMs; omit payload bodies.
Distributed Tracing¶
- Map
traceparentto gRPC context; spans:gateway.authz,ingestion.append. Include attributestenant,payloadBytes.
Health Checks¶
- Liveness: process/thread; Readiness: JWKS reachability, upstream Ingestion healthy.
Operational Procedures¶
Deployment¶
- Enable gRPC port/route under flag
ingest.grpc.enabled=false. - Smoke test with signed JWT and minimal payload; verify
OKand trailers. - Gradually raise per-tenant limits; observe
grpc_server_handled_total{code!="OK"}.
Configuration¶
- Gateway: JWKS URI, audiences, rate limits, max recv/send message bytes, allowed metadata keys/size.
- Client: channel pool size, per-call deadline (e.g., 2s), retry policy (
UNAVAILABLE,RESOURCE_EXHAUSTED).
Maintenance¶
- Rotate JWKS/keys; monitor token validation failures; tune backoff policies.
Troubleshooting¶
- Many
INVALID_ARGUMENT→ inspect validation pointers; schema drift. - Many
UNAVAILABLE→ upstream health; check saturation. - Frequent
ALREADY_EXISTS→ idempotency key collisions—fix client keying.
Testing Scenarios¶
Happy Path Tests¶
- Valid Append returns OK with
status:"Created"andauditRecordId. - Duplicate
x-idempotency-keyreturns OK withstatus:"Duplicate".
Error Path Tests¶
- Missing
x-tenant-id→ INVALID_ARGUMENT. - Unknown method/route → NOT_FOUND.
- Conflicting idempotency payload → ALREADY_EXISTS.
- Rate limit → RESOURCE_EXHAUSTED with retry backoff honored.
Performance Tests¶
- Sustain 1k rps/tenant with p95 ≤ 40 ms.
- Connection reuse across 10k calls without reconnect churn.
Security Tests¶
- JWT expiration/audience enforced.
- Tenant mismatch (metadata vs body) → PERMISSION_DENIED.
- Trace propagation verified end-to-end.
Related Documentation¶
Internal References¶
Related Flows¶
- Standard Audit Record Ingestion Flow
- Retry Flow
- Distributed Tracing Flow
External References¶
- gRPC Status Codes guide
- W3C Trace Context
Appendices¶
A. C# gRPC code-first contract (protobuf-net.Grpc style)¶
using System.ServiceModel;
using ProtoBuf.Grpc;
using ProtoBuf.Grpc.Configuration;
[Service]
public interface IAuditIngestionService
{
[Operation]
Task<AppendReply> AppendAsync(AppendRequest request, CallContext context = default);
}
public sealed class AppendRequest
{
public string SchemaVersion { get; set; } = "auditrecord.v1";
public AuditRecord AuditRecord { get; set; } = default!;
}
public sealed class AppendReply
{
public string AuditRecordId { get; set; } = default!;
public string Status { get; set; } = "Created"; // or "Duplicate"
public DateTimeOffset ObservedAt { get; set; }
}
B. C# client stub usage (metadata mapping)¶
var channel = GrpcChannel.ForAddress("https://api.atp.example");
var client = channel.CreateGrpcService<IAuditIngestionService>();
var headers = new Metadata {
{ "authorization", $"Bearer {token}" },
{ "x-tenant-id", "acme" },
{ "traceparent", traceparent },
{ "x-idempotency-key", "acme-ord-9981-v1" }
};
var ctx = new CallContext(new CallOptions(headers: headers, deadline: DateTime.UtcNow.AddSeconds(2)));
var reply = await client.AppendAsync(new AppendRequest {
SchemaVersion = "auditrecord.v1",
AuditRecord = record
}, ctx);
C. Recommended client retry policy (pseudocode)¶
retry on: UNAVAILABLE, RESOURCE_EXHAUSTED, DEADLINE_EXCEEDED
backoff: exponential (base 100ms, max 5s), jitter 20%
max attempts: 5
reuse same x-idempotency-key
Service Bus (MassTransit) Ingestion Flow¶
Asynchronous ingestion path using the Outbox → Bus → Inbox pattern. A producer writes to its own Outbox in the same transaction as its business change; an Outbox Dispatcher publishes to the MassTransit bus. The Ingestion Consumer reads messages, performs validation/canonicalization, applies dedupe/idempotency, appends to the WORM store, and emits AuditRecord.Accepted. Poison messages are routed to a DLQ with reprocess tooling.
Overview¶
Purpose: Provide a resilient, high-throughput async ingestion path with exactly-once effects (at-least-once delivery + idempotent consumer).
Scope: Producer outbox semantics, bus delivery (MassTransit), consumer inbox/deduplication, retry/backoff, DLQ handling, and operational reprocessing. Excludes REST/gRPC transports and batch presigned uploads.
Context: Recommended for internal microservices and partner pipelines that already publish domain events.
Key Participants:
- Producer Service (business txn + Outbox write)
- Outbox Dispatcher (background publisher)
- Message Bus (MassTransit over RabbitMQ/Azure SB/Kafka)
- Ingestion Consumer (MassTransit consumer)
- Idempotency Store (consumer-inbox/dedupe keys)
- Storage (Authoritative) (append-only WORM)
- DLQ / Error Queue (quarantine and reprocess)
Prerequisites¶
System Requirements¶
- MassTransit configured with a supported broker and durable queues/topics
- Producer DB migration includes Outbox table (append-only)
- Ingestion Consumer has Idempotency/Inbox store (e.g., table or cache)
- Network connectivity Producer ↔ Broker ↔ Ingestion; TLS enabled end-to-end
Business Requirements¶
- Tenants provisioned; routing keys/partitions defined per tenant
- Policy/retention/classification configured (used by Ingestion)
- DLQ retention meets compliance requirements
Performance Requirements¶
- Producer Outbox dispatch interval (poll/batch size) tuned for target throughput
- Consumer prefetch/concurrency tuned; p95 end-to-append ≤ 100 ms under load
- Broker quotas/partitions sized for expected peak
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant Prod as Producer Service
participant DB as Producer DB + Outbox
participant Disp as Outbox Dispatcher
participant Bus as Message Bus (MassTransit)
participant Cons as Ingestion Consumer
participant Idem as Idempotency Store
participant Store as Storage (Authoritative)
Prod->>DB: BEGIN TX: business change + INSERT Outbox{Message, IdempotencyKey, Tenant, Trace}
DB-->>Prod: COMMIT
Disp->>DB: Poll Outbox (unpublished rows)
Disp->>Bus: Publish AuditRecordEnvelope (MessageId, CorrelationId, headers)
Bus-->>Disp: Ack (broker)
Bus-->>Cons: Deliver message
Cons->>Idem: Check/put(idempotencyKey) // atomic get-or-create
alt First delivery
Cons->>Cons: Validate + canonicalize + policy/classification/redaction
Cons->>Store: INSERT canonical record (WORM)
Store-->>Cons: ack {auditRecordId}
Cons->>Idem: Mark completed(auditRecordId)
else Duplicate
Idem-->>Cons: already completed
Cons->>Cons: Skip side effects, ack broker
end
Alternative Paths¶
- Transactional Outbox (in-proc): Outbox insert is in the same DB transaction as business write (recommended).
- Partition affinity: Route by
tenantId(orresourceId) to guarantee in-order delivery per key. - Saga assistance: Optional MassTransit saga can coordinate multi-message batches or ensure exactly-one finalization event per batch.
Error Paths¶
sequenceDiagram
participant Disp as Outbox Dispatcher
participant Bus as Message Bus
participant Cons as Ingestion Consumer
participant DLQ as Dead Letter Queue
Disp->>Bus: Publish
alt Broker unavailable
Bus-->>Disp: nack/exception
Disp->>Disp: Retry with exponential backoff, do not delete Outbox row
end
Bus-->>Cons: Deliver message
alt Validation fails (poison message)
Cons-->>Bus: reject (no requeue)
Bus-->>DLQ: route
else Transient error (Storage 503)
Cons-->>Bus: nack (requeue)
Bus->>Cons: redeliver with backoff
end
Request/Response Specifications¶
This flow is message-based. The message contract and headers are the stable surface. Operational HTTP endpoints (status, reprocess) are listed for completeness.
Input Requirements (message contract)¶
| Field | Type | Required | Description | Validation |
|---|---|---|---|---|
MessageId |
GUID/ULID | Y | Broker message id | Generated by bus |
CorrelationId |
GUID/ULID | Y | Correlates with trace/saga | Present |
IdempotencyKey |
string | Y | Stable key per producer event | ≤128 ASCII |
TenantId |
string | Y | Tenant scope | Header & body match |
Traceparent |
string | Y | W3C trace context | 55-char |
SchemaVersion |
string | Y | auditrecord.v1 |
Known |
AuditRecord |
object | Y | Canonical fields | ≤ 256 KiB after serialize |
Recommended headers (MassTransit)
tenant-id,traceparent,idempotency-key,schema-version,content-type=application/json
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
AuditRecord.Accepted |
event | Downstream event from Storage | Async |
| Consumer ack | broker ack | Successful handle | Commit offset / ack message |
| DLQ message | broker dead-letter | On poison/MaxRetry exceeded | Inspect & reprocess |
Example Message (Envelope)¶
{
"SchemaVersion": "auditrecord.v1",
"IdempotencyKey": "acme:order#9981:v1",
"TenantId": "acme",
"AuditRecord": {
"tenantId": "acme",
"createdAt": "2025-10-22T12:00:03.100Z",
"action": "user.create",
"resource": { "type": "Iam.User", "id": "U-1001" },
"actor": { "id": "svc_billing", "type": "Service" }
}
}
Error Handling¶
Error Scenarios (bus & ops APIs)¶
| Code/Outcome | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| INVALID (poison) → DLQ | Schema/shape invalid at consumer | Quarantine; fix mapper or data | Reprocess after fix |
| Requeue | Storage/Policy transient failure | Backoff & retry | Exponential backoff + jitter |
| Duplicate (idempotent skip) | IdempotencyKey already completed | No action | Ack immediately |
| 400 Bad Request (ops API) | Bad reprocess/status request | Correct request | No retry until fixed |
| 404 Not Found (ops API) | Unknown batch/msgId/tenant | Verify identifiers | — |
| 409 Conflict (ops API) | Reprocess while job active | Wait & retry | After completion |
| 503 Service Unavailable | Broker or Storage outage | Wait for recovery | Bounded backoff, circuit-breaker |
Failure Modes¶
- Outbox row deletion before publish: never delete until broker ack; use “published_at IS NOT NULL” marker.
- Inbox/idempotency race: ensure atomic get-or-create; use unique index on
(TenantId, IdempotencyKey). - Re-delivery storm: cap retries; move to DLQ after N attempts.
Recovery Procedures¶
- Inspect DLQ; download sample and Problem details if present.
- Patch mapper/policy or data; use reprocess API/command to move back to primary queue.
- For stuck Outbox rows, resume dispatcher (no manual delete).
Performance Characteristics¶
Latency Expectations¶
- Outbox write: ~1–2 ms (in-proc tx)
- Dispatch to broker: sub-10 ms typical
- Consume → append: p95 ≤ 100 ms steady state
Throughput Limits¶
- Producer: controlled by Outbox polling batch size (e.g., 500) and dispatch concurrency.
- Consumer: controlled by prefetch (e.g., 256) and concurrency (e.g., 8–32).
- Broker: ensure partitions/queues per tenant or shard.
Resource Requirements¶
- Producer DB IOPS for Outbox; Consumer CPU for JSON + hashing; Idempotency store write IOPS.
Scaling Considerations¶
- Scale by queue/partition per tenant/shard; increase consumer count.
- Use bulk publish from dispatcher; avoid tiny batches.
Security & Compliance¶
Authentication¶
- Broker auth via username/secret/SAS; TLS enabled. MassTransit transport credentials stored securely.
Authorization¶
- Topic/queue ACLs restrict producers/consumers to tenant-scoped routes.
Data Protection¶
- Message payloads encrypted on the wire (TLS); sensitive attributes redacted by Ingestion before persist.
Compliance¶
- Retain DLQ items per policy; operations on DLQ are audited (who/when reprocessed or purged).
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
outbox_rows_pending |
gauge | Unpublished rows | Growth > 3× baseline |
dispatcher_publish_rate |
counter | Messages/sec to broker | Drop vs ingest |
consumer_lag |
gauge | Backlog size/age | Age > 60s |
consumer_retry_total |
counter | Redeliveries | Spike indicates transient failures |
dlq_messages_total |
counter | DLQ count | > 0 sustained |
Logging Requirements¶
- Include
tenant,messageId,idempotencyKey(hashed),deliveryAttempt, and decision of DLQ vs retry; never log full payloads.
Distributed Tracing¶
- Propagate
traceparentvia message headers; spans:outbox.enqueue,dispatcher.publish,consumer.handle,storage.append.
Health Checks¶
- Producer: DB + broker connectivity; Consumer: broker + Storage/Idempotency write access.
Operational Procedures¶
Deployment¶
- Migrate Producer DB to add Outbox table; enable MassTransit outbox middleware.
- Deploy Ingestion Consumer with inbox/idempotency enabled (unique key index).
- Create queues/topics, bindings, and DLQ; enable TLS and ACLs.
Configuration¶
- Producer:
OutboxPollIntervalMs,OutboxBatchSize, broker connection, TLS certs. - Consumer:
PrefetchCount,ConcurrentMessageLimit, retry policy (incremental/exponential), idempotency TTL. - Routing: exchange/topic per
tenantIdor shard key.
Maintenance¶
- Purge published Outbox rows by retention (based on
published_at). - DLQ review and reprocess runs; archive old DLQ messages per policy.
Troubleshooting¶
- Rising
outbox_rows_pending→ broker unreachable or dispatch stalled. - Spiking
consumer_retry_total→ investigate Storage/Policy health. - Many duplicates → check idempotency unique index and key construction.
Testing Scenarios¶
Happy Path Tests¶
- Business txn writes Outbox; Dispatcher publishes; Consumer appends;
Acceptedobserved. - Duplicate delivery skipped via idempotency store.
Error Path Tests¶
- Poison message → DLQ; reprocess after fix returns success.
- Broker outage → Outbox retains; auto-catchup after recovery.
- Ops API: 400 bad reprocess request; 404 unknown message; 409 reprocess job already running.
Performance Tests¶
- Validate throughput at target RPS with prefetch/concurrency sweeps.
- Backpressure behavior under Storage throttling.
Security Tests¶
- Tenant isolation via routing and ACLs.
- TLS enforcement; credentials rotation without downtime.
Related Documentation¶
Internal References¶
- Standard Audit Record Ingestion Flow
- Validation & Classification Flow
- Retry Flow / Dead Letter Queue Flow
- Data Model
Related Flows¶
- Orleans Actor Ingestion Flow
External References¶
- MassTransit Outbox/Inbox docs for chosen transport
- Broker-specific DLQ and retry policies
Appendices¶
A. Producer Outbox table (example)¶
CREATE TABLE Outbox (
Id bigint IDENTITY PRIMARY KEY,
MessageId uniqueidentifier NOT NULL,
IdempotencyKey nvarchar(128) NOT NULL,
TenantId nvarchar(128) NOT NULL,
Body varbinary(max) NOT NULL,
Traceparent nvarchar(64) NULL,
CreatedAt datetime2 NOT NULL DEFAULT sysutcdatetime(),
PublishedAt datetime2 NULL
);
CREATE UNIQUE INDEX UX_Outbox_Idempotency ON Outbox (TenantId, IdempotencyKey);
B. Consumer Idempotency (Inbox) table (example)¶
CREATE TABLE ConsumerInbox (
TenantId nvarchar(128) NOT NULL,
IdempotencyKey nvarchar(128) NOT NULL,
CompletedAt datetime2 NULL,
AuditRecordId char(26) NULL, -- ULID
PRIMARY KEY (TenantId, IdempotencyKey)
);
C. C# Contracts (MassTransit)¶
public record AuditRecordEnvelope(
string SchemaVersion,
string IdempotencyKey,
string TenantId,
AuditRecord AuditRecord
);
// Configure send
cfg.Message<AuditRecordEnvelope>(x => x.SetEntityName("audit.ingest"));
cfg.Send<AuditRecordEnvelope>(x => {
x.UseRoutingKeyFormatter(ctx => ctx.Message.TenantId);
});
Orleans Actor Ingestion Flow¶
Actor-to-actor ingestion path using Microsoft Orleans. A producer Grain invokes an Ingestion Grain with an AuditRecord and context (tenant, traceparent, idempotencyKey). The Ingestion Grain enforces at-least-once delivery with idempotent effects, appends to the WORM store, and returns an AppendResult. Notes cover activation, placement, and reentrancy to achieve high concurrency without duplication.
Overview¶
Purpose: Provide a low-latency, in-cluster ingestion path that preserves actor semantics and ordering guarantees per key.
Scope: Orleans grain contract, RequestContext propagation, idempotency/inbox, storage append, reentrancy, activation/placement, and failure handling including DLQ for poison messages. Excludes REST/gRPC and external bus transports.
Context: Used by actor-based services already running on Orleans (e.g., domain aggregates or workflow grains); per-tenant or per-resource sharding maps naturally to grain keys.
Key Participants:
- Producer Grain (domain actor generating audit facts)
- Ingestion Grain (
IAuditIngestionGrain) — validates, canonicalizes, dedupes, appends - Idempotency/Inbox Store — per-grain dedupe table or grain state
- Storage (Authoritative) — append-only WORM store
- DLQ (optional) — for poison inputs when configured
Prerequisites¶
System Requirements¶
- Orleans cluster healthy (silos, membership, reminders/timers)
- RequestContext propagation enabled between grains
- Ingestion Grain type registered; access to Storage and Idempotency store
- TLS/mTLS for silo-to-silo traffic if crossing nodes/regions
Business Requirements¶
- Tenants configured; placement strategy keyed by
(tenantId[, shard]) - Policy/retention/classification active for tenant
- DLQ or operator alerting policy defined for poison records
Performance Requirements¶
- Ingestion Grain reentrancy policy chosen (see below) and tested at target RPS
- Per-grain mailboxes sized; throughput meets ingest parity
- Idempotency lookup p95 ≤ 5 ms (local state or fast store)
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant Producer as Producer Grain
participant Ing as Ingestion Grain (IAuditIngestionGrain)
participant Inbox as Idempotency/Inbox Store
participant Store as Storage (Authoritative)
Producer->>Ing: Append(auditRecord, idempotencyKey)<br/>(RequestContext: tenant, traceparent)
Ing->>Ing: Validate + canonicalize + policy/classification/redaction
Ing->>Inbox: GetOrPut(tenant,idempotencyKey)
alt First delivery
Ing->>Store: INSERT canonical record (WORM)
Store-->>Ing: ack {auditRecordId}
Ing->>Inbox: MarkCompleted(auditRecordId)
Ing-->>Producer: AppendResult {auditRecordId, status:"Created"}
else Duplicate
Inbox-->>Ing: Found Completed(auditRecordId)
Ing-->>Producer: AppendResult {auditRecordId, status:"Duplicate"}
end
Alternative Paths¶
- Per-tenant placement:
IAuditIngestionGrainkeys ontenantId(or(tenantId, shard)), preserving ordering within the key while allowing horizontal scale across tenants/shards. - Local persistent state inbox: Use Orleans
PersistentStatewithin the grain for fastest dedupe; or external table if cross-language consumers also write. - Reentrant grain: Enable reentrancy to allow concurrent requests sharing the same trace id/group; protect critical sections (idempotency write + store append) with coarse-grained serialization.
Error Paths¶
sequenceDiagram
participant Ing as Ingestion Grain
participant Store as Storage
participant Inbox as Idempotency/Inbox
Ing->>Store: INSERT
alt Storage transient
Store-->>Ing: throws transient
Ing->>Ing: Retry with backoff, do not mark inbox completed
else Validation failure (poison)
Ing-->>Ing: throw ValidationException
Ing->>Inbox: MarkFailed(optional) / emit DLQ if configured
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Required | Description | Validation |
|---|---|---|---|---|
auditRecord |
object | Y | Canonical AuditRecord |
Data Model rules; ≤ 256 KiB |
idempotencyKey |
string | Y | Unique per submitted record | ≤128 ASCII |
RequestContext["tenant-id"] |
string | Y | Tenant routing | Must match auditRecord.tenantId |
RequestContext["traceparent"] |
string | Y | W3C context | 55-char format |
RequestContext["schema-version"] |
string | Y | Contract version | Known & active |
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
AppendResult.auditRecordId |
ULID | Durable id | Assigned or echoed |
AppendResult.status |
enum | Created or Duplicate |
Idempotent outcome |
AppendResult.observedAt |
timestamp | Ingestion observation | ms precision |
Example Grain Contract (C#)¶
public interface IAuditIngestionGrain : IGrainWithStringKey
{
Task<AppendResult> Append(AuditRecord record, string idempotencyKey);
}
public sealed record AppendResult(string AuditRecordId, string Status, DateTimeOffset ObservedAt);
Producer call
RequestContext.Set("tenant-id", tenantId);
RequestContext.Set("traceparent", traceparent);
RequestContext.Set("schema-version", "auditrecord.v1");
var grain = GrainFactory.GetGrain<IAuditIngestionGrain>(tenantId); // or $"{tenantId}:{shard}"
var result = await grain.Append(record, idempotencyKey);
Error Handling¶
Error Scenarios (Orleans ↔ HTTP analogy)¶
| Orleans Exception/Outcome | HTTP Analogy | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|---|
| — (OK) | 202 Accepted | Created/Duplicate | — | — |
ArgumentException / validation error |
400 Bad Request | Schema/shape/limits invalid | Fix payload | No retry until corrected |
GrainReferenceNotFoundException / unknown key |
404 Not Found | Wrong grain key/tenant or disabled feature | Check routing/tenant | No retry |
IdempotencyConflictException |
409 Conflict | Same key, different payload | Use a new key; reconcile | Do not retry with same key |
OrleansException with IsTransient |
503 Service Unavailable | Store or infra transient | Backoff & retry | Exponential backoff + jitter |
TimeoutException |
504 Gateway Timeout | Grain busy or network stall | Increase timeout if safe | Limited retries |
Failure Modes¶
- Reentrancy hazard: racing requests with same key—protect with atomic GetOrPut in inbox and serialize append section.
- Activation churn: hotspot tenants cause frequent activations; use sticky placement and activation warmup.
- Poison record: repeated validation failures—optionally route to DLQ or mark Failed in inbox for operator review.
Recovery Procedures¶
- For transients, retry with jitter; maintain idempotency key.
- For conflict, choose canonical payload and re-attempt with a new key if necessary.
- For poison, capture Problem details and trigger operator workflow or DLQ.
Performance Characteristics¶
Latency Expectations¶
- P50: 5–15 ms
- P95: ≤ 35 ms
- P99: ≤ 75 ms
Throughput Limits¶
- Single ingestion grain: thousands of req/s with reentrancy on and critical section minimized.
- Cluster throughput scales linearly with # of silos × # of shards/tenants.
Resource Requirements¶
- CPU for JSON parse/hash; memory for small inbox state.
- Low storage write IOPS per grain; batch commits optional if available in store client.
Scaling Considerations¶
- Placement: Prefer hash-based placement by
(tenantId[, shard]). - Reentrancy: Enable grain reentrancy; serialize only the idempotency + append critical section.
- Backpressure: Use
Orleans.Concurrency.Limitor custom queue length monitors to shed load gracefully.
Security & Compliance¶
Authentication¶
- Internal cluster auth (mTLS/IPSec as required); producer identity derived from grain identity and/or tokens in RequestContext if crossing trust boundaries.
Authorization¶
- Validate
tenant-idcontext matchesauditRecord.tenantId; enforce RBAC/ABAC as needed for cross-tenant actors.
Data Protection¶
- No sensitive data in logs; redaction/classification applied before persist.
Compliance¶
- Append operations recorded with
tenant,grainKey,idempotencyKey(hashed), andtraceId.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
orleans_grain_queue_length |
gauge | Mailbox depth per ingestion grain | Sustained growth |
ingestion_append_latency_ms |
histogram | Grain handle latency | p95 > 35 ms |
inbox_getorput_latency_ms |
histogram | Idempotency lookup time | p95 > 5 ms |
idempotent_duplicates_total |
counter | Duplicate skips | Track trend |
orleans_activations_total |
counter | Activations of ingestion grains | Unexpected spikes |
Logging Requirements¶
- Structured logs:
tenant,grainKey,traceId,idempotencyKey(hashed), outcome (Created|Duplicate|Failed).
Distributed Tracing¶
- Carry
traceparentin RequestContext; spans:grain.append,inbox.check,storage.append; includetenant,grainKey.
Health Checks¶
- Silo membership stable; storage reachable; inbox store latency under thresholds.
Operational Procedures¶
Deployment¶
- Register
IAuditIngestionGrainand storage/idempotency providers; deploy silos. - Warm hot-tenant grains (pre-activation) to reduce cold-start latency.
- Validate end-to-end append and idempotency in non-prod.
Configuration¶
- Reentrancy:
[Reentrant]attribute or runtime config as appropriate. - Placement: consistent hashing or custom placement by tenant.
- Timeouts/Retries: client call timeouts (e.g., 2s) and retry policies for transient exceptions.
Maintenance¶
- Monitor inbox state growth; compact or TTL-complete entries older than dedupe window.
- Rotate cluster certs/keys if mTLS in use.
Troubleshooting¶
- Many
TimeoutExceptions → check reentrancy, queue length, storage latency. - Frequent
IdempotencyConflictException→ investigate client keying logic. - Activation spikes → adjust placement/keep-alive or increase silos.
Testing Scenarios¶
Happy Path Tests¶
- Append returns
CreatedwithauditRecordId. - Second call with same
idempotencyKeyreturnsDuplicatewithout extra writes.
Error Path Tests¶
- Validation error → 400 analog (
ArgumentException), not persisted. - Unknown grain key/disabled tenant → 404 analog.
- Conflict on idempotency (different payload) → 409 analog.
- Transient storage failure → retried then succeeds.
Performance Tests¶
- Reentrancy on: sustain target RPS with p95 ≤ 35 ms.
- Critical section profiling (inbox+append) shows minimal blocking.
Security Tests¶
-
tenant-idin RequestContext matches payload; mismatches rejected. - Trace propagation visible across grains and storage client.
Related Documentation¶
Internal References¶
Related Flows¶
- gRPC Service Ingestion Flow
- Service Bus (MassTransit) Ingestion Flow
- Retry Flow
External References¶
- Orleans Docs: Grains, Persistence, Reentrancy, RequestContext
Appendices¶
A. Inbox table (if using external store)¶
CREATE TABLE IngestionInbox (
TenantId nvarchar(128) NOT NULL,
IdempotencyKey nvarchar(128) NOT NULL,
Status tinyint NOT NULL, -- 0=Pending,1=Completed,2=Failed
AuditRecordId char(26) NULL,
UpdatedAt datetime2 NOT NULL DEFAULT sysutcdatetime(),
PRIMARY KEY (TenantId, IdempotencyKey)
);
B. Reentrancy pattern (C# sketch)¶
[Reentrant]
public class AuditIngestionGrain : Grain, IAuditIngestionGrain
{
public async Task<AppendResult> Append(AuditRecord record, string key)
{
using var _ = await _criticalSection.EnterAsync(key); // narrow critical region
var (first, existingId) = await _inbox.GetOrPutAsync(record.TenantId, key);
if (!first) return new(existingId, "Duplicate", DateTimeOffset.UtcNow);
var id = await _storage.AppendAsync(record); // may retry internally
await _inbox.MarkCompletedAsync(record.TenantId, key, id);
return new(id, "Created", DateTimeOffset.UtcNow);
}
}
Tenant-Scoped Query Flow¶
Retrieves a tenant’s AuditEvents timeline via the Query Service through the API Gateway. Uses row-level security (RLS) / tenant validation, seek-based pagination (cursor over (createdAt,auditRecordId)), and returns X-Watermark and X-Lag headers indicating projection freshness.
Overview¶
Purpose: Provide a low-latency, read-optimized timeline of audit events for a single tenant with consistent ordering and efficient pagination.
Scope: Gateway authN/Z, tenant scoping (header/path), RLS enforcement in Read DB, timeline query, seek pagination, watermark/lag headers. Excludes full-text search (see Search flow) and on-read PII masking (covered in Data Redaction flow).
Context: Runs against the AuditEvents projection maintained by the Projection Service; consults the Checkpoint Store for the current watermark.
Key Participants:
- Query Client (API consumer)
- API Gateway (authN/Z, rate limiting, header normalization)
- Query Service (query planning, pagination, response shaping)
- Read DB (AuditEvents) (tenant-scoped projection with indexes & RLS)
- Checkpoint Store (per-tenant watermark)
- Cache (optional, key-scoped response caching)
Prerequisites¶
System Requirements¶
- API Gateway reachable with TLS; JWKS configured for JWT validation
- Query Service deployed with network access to Read DB & Checkpoint Store
- Read DB has RLS policies enforcing
tenantIdonAuditEvents - Projection/Checkpoint up and healthy (watermark progressing)
Business Requirements¶
- Tenant exists and is active; edition permits timeline queries
- Data retention/visibility policies do not restrict requested window
- If multi-region, tenant’s home region is routable by Gateway
Performance Requirements¶
- p95 ≤ 150 ms for
limit<=200over hot partitions - Indexes on
(tenantId, createdAt DESC, auditRecordId)present - Cache configured (optional) with safe TTL & keying by tenant + params
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Client as Query Client
participant GW as API Gateway
participant Q as Query Service
participant RDB as Read DB (AuditEvents + RLS)
participant CKPT as Checkpoint Store
participant Cache as Cache
Client->>GW: GET /audit/v1/events?limit=100&cursor=... <br/> h:{Authorization,x-tenant-id,traceparent}
Note right of GW: Validate JWT, tenant scope, rate-limit, normalize headers
GW->>Q: Forward request + tenant context + traceparent
Q->>CKPT: Read tenant watermark (offset,timestamp)
alt Cache enabled and hit
Q->>Cache: Lookup by {tenant, params}
Cache-->>Q: Cached page + cursors
else No cache / miss
Q->>RDB: SELECT ... FROM AuditEvents WHERE tenantId=? AND (seek by cursor) ORDER BY createdAt DESC, auditRecordId DESC LIMIT N
RDB-->>Q: rows, next/prev anchors
Q->>Cache: Put page (optional TTL)
end
Q-->>GW: 200 JSON {items, nextCursor, prevCursor} + headers X-Watermark, X-Lag
GW-->>Client: 200 OK
Note over Client,RDB: Seek-based pagination avoids deep OFFSET scans
Alternative Paths¶
- Time-bounded query:
from/totimestamps narrow the scan before seek pagination. - Ascending order:
order=ascfor forward-in-time scans; cursors encode direction. - Head polling: Client uses
If-None-Match: "wmk:<value>"; Query Service returns304 Not ModifiedifX-Watermarkunchanged.
Error Paths¶
sequenceDiagram
actor Client
participant GW as API Gateway
participant Q as Query Service
Client->>GW: GET /audit/v1/events?limit=5000&cursor=bad
alt Invalid params / cursor parse fail
GW-->>Client: 400 Bad Request (Problem+JSON)
else Unknown tenant / route
GW-->>Client: 404 Not Found (Problem+JSON)
else Conflicting params (e.g., both cursor & page)
GW-->>Client: 409 Conflict (Problem+JSON)
else Unauthorized / Forbidden
GW-->>Client: 401/403 (Problem+JSON)
else Service backpressure / upstream down
GW-->>Client: 429/503 (Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | HTTP GET /audit/v1/events or /audit/v1/tenants/{tenantId}/events |
Y | Timeline endpoint | One of header or path must provide tenant |
Authorization |
header | Y | Bearer <JWT> |
Valid signature, audience; not expired |
x-tenant-id |
header | Y* | Tenant scope (if not in path) | ^[A-Za-z0-9._-]{1,128}$ |
traceparent |
header | O | W3C trace context | 55-char format |
limit |
query | O | Max items per page (default 100) | 1–1000, default 100 |
cursor |
query | O | Opaque base64url cursor (ts,id,dir) |
Valid/owned by tenant |
order |
query | O | desc (default) or asc |
enum |
from/to |
query | O | ISO-8601 UTC time bounds | from≤to, within retention |
filter.resourceType |
query | O | Optional type filter | matches known types |
filter.actorId |
query | O | Optional actor filter | ≤128 chars |
*Required unless tenant is in path.
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
items[] |
array |
Page of timeline entries | Ordered by order |
nextCursor |
string? | Opaque cursor for next page | Omitted if no more |
prevCursor |
string? | Opaque cursor for reverse page | Omitted on first page |
count |
integer | Number of items in this page | ≤ limit |
Response Headers
X-Watermark: ISO-8601 UTC of latest committed projection timestamp for the tenant.X-Lag: Seconds behind “now” (now - X-Watermark).Cache-Control: typicallyno-store, max-age=0(or short TTL if allowed).
Example Requests/Responses¶
Request
GET /audit/v1/events?limit=100&order=desc&from=2025-10-22T00:00:00Z HTTP/1.1
Host: api.atp.example
Authorization: Bearer eyJhbGciOi...
x-tenant-id: acme
traceparent: 00-9f0c1d2e3a4b5c6d7e8f9a0b1c2d3e4f-1111222233334444-01
200 OK
HTTP/1.1 200 OK
Content-Type: application/json
X-Watermark: 2025-10-22T12:03:05.120Z
X-Lag: 4.8
Cache-Control: no-store
{
"items": [
{
"auditRecordId": "01JEC2A2V7N9M0X1Y2Z3A4B5C6",
"createdAt": "2025-10-22T12:02:59.812Z",
"action": "user.create",
"resource": { "type": "Iam.User", "id": "U-1001" },
"actor": { "id": "svc_ingress", "type": "Service", "display": "ingress-gw" },
"decision": { "result": "Allow" }
}
],
"nextCursor": "eyJ0cyI6IjIwMjUtMTAtMjJUMTI6MDI6NTkuODEyWiIsImlkIjoiMDFK...IiwgImRpciI6ImRlc2MifQ",
"prevellers": null,
"count": 1
}
400 Bad Request (invalid cursor)
{
"type": "urn:connectsoft:errors/query/cursor.invalid",
"title": "Invalid cursor",
"status": 400,
"detail": "Cursor is malformed or expired for this tenant.",
"errors": [{ "pointer": "query.cursor", "reason": "malformed" }],
"traceId": "9f0c1d2e3a4b5c6d7e8f9a0b1c2d3e4f"
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Missing/invalid params; bad cursor; from>to; limit out of bounds |
Correct request; regenerate cursor | No retry until fixed |
| 401 | Missing/invalid/expired JWT | Obtain valid token | Retry after renewal |
| 403 | Token not authorized for x-tenant-id |
Request proper scope/role | No retry until fixed |
| 404 | Tenant or route not found; tenant disabled | Verify tenant/URL | No retry |
| 409 | Conflicting params (e.g., cursor with from/to not allowed) or cursor tenant mismatch |
Remove conflict; obtain fresh cursor | Retry after fix |
| 429 | Rate limit / query backpressure | Backoff; respect Retry-After |
Exponential backoff + jitter |
| 503 | Upstream (DB/checkpoint) unavailable | Wait for recovery | Retry with backoff |
| 304 | If-None-Match matched watermark |
Use cached data | Re-poll later |
Failure Modes¶
- Stale cursor after rebuild/compaction: server returns 409 with
type: .../cursor.staleand aresyncFromhint. - RLS misconfiguration: query returns 403/500; health checks should detect missing policy.
- Watermark stale:
X-Laggrows; alerting should trigger projector scaling.
Recovery Procedures¶
- On 409
cursor.stale, drop cursor and re-start fromfrom=lastSeenTime. - On 429/503, backoff with jitter; do not increase
limitto compensate. - If RLS errors occur, fail closed (no data) and escalate to operations.
Performance Characteristics¶
Latency Expectations¶
- P50 ≤ 60 ms, P95 ≤ 150 ms, P99 ≤ 300 ms for
limit≤200over warm cache/index.
Throughput Limits¶
- Per tenant: 200 rps sustained, 800 rps burst (configurable).
- Global: scales with read replicas and cache hit rate.
Resource Requirements¶
- Read DB IOPS proportional to
limitand filter selectivity; ensure covering indexes. - Cache memory sized for hot cursors/pages if enabled.
Scaling Considerations¶
- Add read replicas; shard by tenant.
- Use index-only scans with narrow projections to reduce I/O.
- Apply adaptive
limitcaps under load; enable result caching for hot ranges.
Security & Compliance¶
Authentication¶
- OIDC JWT (short-lived),
traceparentpropagated; mTLS between Gateway ↔ Query Service (optional but recommended).
Authorization¶
- Enforce
audit:read:timelinescope; verifysub/tenantclaims; apply DB-level RLS ontenantId.
Data Protection¶
- Only minimal fields returned; no secret values.
X-Watermarkreveals timing only; avoid leaking internal offsets.
Compliance¶
- Access logged with
tenantId,subject,filters, andwatermarkfor auditability.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
query_latency_ms{route="/audit/v1/events"} |
histogram | End-to-end latency | p95 > 150 ms (5m) |
timeline_results_count |
histogram | Items per page | Sudden 0 across tenants |
watermark_lag_seconds |
gauge | now - watermark |
> target (e.g., >10 s) |
query_rate_limited_total |
counter | 429 responses | > 5% sustained |
cursor_stale_total |
counter | 409 due to stale/malformed cursor | spike indicates rebuild issues |
Logging Requirements¶
- Structured logs:
tenantId,traceId,limit,order,from/to(if set),cursorHash,resultCount,watermark,lagSec. Do not log raw cursor tokens.
Distributed Tracing¶
- Spans:
query.parse,db.select.timeline,ckpt.read,cache.get/set. - Attributes:
tenant,limit,order,hasCursor,rows,lagMs.
Health Checks¶
- Readiness: DB + checkpoint reachable; RLS policy verified; index present.
- Liveness: threadpool saturation, connection pool usage below thresholds.
Operational Procedures¶
Deployment¶
- Apply/verify
AuditEventsschema & RLS policies in Read DB. - Deploy Query Service behind Gateway route
/audit/v1/events. - Validate watermark propagation and
X-Lagaccuracy in staging.
Configuration¶
- Env:
QUERY_MAX_LIMIT,DEFAULT_LIMIT,CACHE_TTL_SECONDS,RLS_ENABLED=true. - Indexing:
(tenantId, createdAt DESC, auditRecordId)plus optional partial indexes per tenant.
Maintenance¶
- Periodic VACUUM/ANALYZE (SQL) or compaction (NoSQL).
- Rotate JWT keys; update JWKS URL.
- Monitor and refresh cache layer sizing.
Troubleshooting¶
- High
watermark_lag_seconds→ check projector lag, search bulk backlog. - Many 409 (cursor.stale) → investigate projection rebuilds/compaction.
- Slow queries → examine query plans; add/adjust indexes.
Testing Scenarios¶
Happy Path Tests¶
- GET with valid
x-tenant-idandlimit=100returns200with ordered items andX-Watermark/X-Lag. -
nextCursoryields the next page;prevCursornavigates back without duplication.
Error Path Tests¶
- 400 on malformed
cursor/ invalidlimit/from>to. - 404 when tenant missing/disabled or route incorrect.
- 409 when
cursorused with disallowed params or tenant mismatch. - 429/503 trigger proper backoff behavior.
Performance Tests¶
- p95 ≤ 150 ms for
limit=200under typical load. - Index-only scan verified via EXPLAIN plan.
Security Tests¶
- JWT audience/scope enforced; RLS prevents cross-tenant leakage.
-
X-tenant-idheader vs path tenant consistency enforced.
Related Documentation¶
Internal References¶
- Architecture Overview
- Components → Query Service, API Gateway
- Data Model — Read Models & Projections
- API Contracts
Related Flows¶
- Search Query Flow
- Filtered Query Flow (policy/redaction on read)
- Audit Record Projection Update Flow
External References¶
- RFC 7233/7231 (HTTP semantics), RFC 9110 (HTTP Semantics) for headers
- W3C Trace Context (traceparent)
Appendices¶
A. Cursor Encoding (example)¶
cursor = base64url( JSON.stringify({ ts:"2025-10-22T12:02:59.812Z", id:"01JEC2A2V7N9M0X1Y2Z3A4B5C6", dir:"desc" }) )
B. Example RLS Policy (PostgreSQL)¶
ALTER TABLE audit_events ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON audit_events
USING (tenant_id = current_setting('app.tenant_id')::text);
-- Set current_setting('app.tenant_id') per request in the DB session.
Search Query Flow¶
Full-text, facet, and type-ahead search over tenant-scoped indices. The Search Service executes per-tenant queries against a per-tenant alias (or filtered index), returning ranked results, facet aggregations, and optional suggest completions. Responses include X-Index-Watermark and X-Index-Lag to convey indexing freshness.
Overview¶
Purpose: Provide fast, flexible discovery of audit records using full-text, filters, facets, and suggesters.
Scope: Query parsing, tenant isolation via alias/filter, facet execution, pagination, highlights, and freshness reporting. Excludes authoritative reads (timeline) and export; on-read masking follows redaction policy.
Context: Operates on the Search Index projection populated by the Projection Service; eventual consistency vs. authoritative store is expected.
Key Participants:
- Search Client (API consumer)
- Search Service (query planner/executor)
- Search Engine (per-tenant indices/aliases)
- Checkpoint Store (optional: index watermark)
- Cache (optional: hot query caching)
Prerequisites¶
System Requirements¶
- Search cluster reachable with TLS; per-tenant indices/aliases created
- Search Service has network access and service account with read permissions
- Projection → Index pipeline healthy (indexers running)
Business Requirements¶
- Tenant has Search edition/feature enabled
- Data minimization and on-read masking rules configured for Search documents
- Retention and residency policies applied to search indices
Performance Requirements¶
- p95 query latency ≤ 200 ms for
size ≤ 50and modest facets - Cluster capacity sized for QPS and aggregation workload
- Index freshness SLO: p95 ≤ 10 s Accept→Indexed
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Client as Search Client
participant Svc as Search Service
participant Engine as Search Engine (Tenant Alias)
participant CKPT as Checkpoint Store
Client->>Svc: POST /search/v1/query<br/>h:{Authorization,x-tenant-id}<br/>{q, filters, facets, size, cursor?}
Svc->>Svc: Validate params, build per-tenant query, apply redaction-on-read
Svc->>Engine: Execute { index: tenant-alias, body: query+aggs }
Engine-->>Svc: Hits, facets, next cursor, took
Svc->>CKPT: Read index watermark (optional)
Svc-->>Client: 200 {results, facets, nextCursor} + X-Index-Watermark + X-Index-Lag
Alternative Paths¶
- Time freshness bias: apply recency boost within a freshness window (e.g., last 24h).
- Filter-only queries (
qempty): return filtered timeline with facets. - Suggest endpoint:
/search/v1/suggestuses completion or n-gram suggesters withprefixand filters. - Read-through cache: cache popular queries for short TTL (exclude personalized filters).
Error Paths¶
sequenceDiagram
actor Client
participant Svc as Search Service
Client->>Svc: POST /search/v1/query (bad params/tenant)
alt Bad request (malformed cursor/invalid facet)
Svc-->>Client: 400 Problem+JSON
else Tenant alias missing / disabled
Svc-->>Client: 404 Problem+JSON
else Conflicting params (both page & cursor, or size>cap)
Svc-->>Client: 409 Problem+JSON
else Rate limited / engine unavailable
Svc-->>Client: 429/503 Problem+JSON (+ Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation | |
|---|---|---|---|---|---|
| Method/Path | HTTP POST /search/v1/query |
Y | Search endpoint | JSON body | |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired | |
x-tenant-id |
header | Y* | Tenant scope | ^[A-Za-z0-9._-]{1,128}$ |
|
q |
string | O | Query string (full-text) | 0–2048 chars | |
filters |
object | O | {resourceType?, actorId?, action?, time:{from?,to?}, decision?} |
enums/ISO-8601 | |
facets |
array |
O | Facets to compute (e.g., ["resourceType","action"]) |
allowlist only | |
size |
int | O | Page size | 1–100 (default 25) | |
cursor |
string | O | Opaque search-after token | base64url | |
highlight |
bool | O | Return snippets | default false | |
sort |
enum | O | relevance (default) |
createdAt:desc | asc |
allowlist |
*Required unless tenant is encoded in a dedicated tenant path variant.
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
results[] |
array |
Search hits with essential fields | Redacted as needed |
facets |
object | Buckets per requested facet | Top-N buckets |
nextCursor |
string? | Token for next page | Omitted if no more |
tookMs |
int | Engine execution time | From engine |
totalApprox |
int | Approx total matches | Not exact if tracking disabled |
Response Headers
X-Index-Watermark: ISO-8601 UTC of latest indexed event for tenantX-Index-Lag: Seconds behind “now” (now - X-Index-Watermark)
Example Payloads¶
Request
{
"q": "user create OR signup",
"filters": {
"resourceType": "Iam.User",
"time": { "from": "2025-10-22T00:00:00Z", "to": "2025-10-22T23:59:59Z" }
},
"facets": ["resourceType", "action"],
"size": 25,
"sort": "relevance",
"highlight": true
}
200 OK
{
"results": [
{
"auditRecordId": "01JEC7KX8…",
"createdAt": "2025-10-22T11:58:10.201Z",
"action": "user.create",
"resource": { "type": "Iam.User", "id": "U-1001" },
"actor": { "id": "svc_signup", "type": "Service", "display": "signup-svc" },
"score": 7.42,
"highlights": { "action": ["<em>user</em>.create"] }
}
],
"facets": {
"resourceType": [{ "key": "Iam.User", "count": 128 }],
"action": [{ "key": "user.create", "count": 92 }]
},
"nextCursor": "eyJzZWFyY2hBZnRlciI6WyIxLjIzIiwiMDFK...Il19",
"tookMs": 23,
"totalApprox": 612
}
400 Bad Request (invalid facet)
{
"type": "urn:connectsoft:errors/search/facet.invalid",
"title": "Invalid facet",
"status": 400,
"detail": "Facet 'userEmail' is not allowed.",
"errors": [{ "pointer": "/facets/0", "reason": "allowlist" }]
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Malformed cursor, disallowed facet, bad time range, size out of bounds |
Fix request | No retry until corrected |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Insufficient audit:search scope or tenant mismatch |
Request proper scope | No retry |
| 404 | Tenant alias/index missing or feature disabled | Verify tenant/feature | No retry |
| 409 | Conflicting params (e.g., cursor with sort not supported) | Adjust params | Retry after fix |
| 422 | Query too complex (clause limit, wildcard explosion) | Simplify query | No retry until changed |
| 429 | Rate limited/backpressure | Respect Retry-After |
Exponential backoff + jitter |
| 503 | Engine unavailable / timeout | Wait for recovery | Retry with jitter |
Failure Modes¶
- Stale cursor after reindex/alias swap → 409
cursor.stalewithresyncFromhint. - Facet blow-up (high cardinality) → 422 with guidance to narrow filters.
- Highlight overflow → server truncates snippets to configured limit.
Recovery Procedures¶
- On 409
cursor.stale, drop cursor and re-issue query without cursor or withfrombound. - On 429/503, backoff; keep query identical to benefit from caching when enabled.
- Replace disallowed facets with supported ones per schema allowlist.
Performance Characteristics¶
Latency Expectations¶
- P50 ≤ 80 ms, P95 ≤ 200 ms, P99 ≤ 400 ms (moderate facets, size ≤ 50).
Throughput Limits¶
- Per tenant baseline 300 rps sustained; global scales with cluster nodes and shard count.
Resource Requirements¶
- Aggregations demand CPU/heap; ensure shard sizing and circuit breakers for large queries.
Scaling Considerations¶
- Scale by shards/replicas; use per-tenant alias routing.
- Enable result caching and request coalescing for hot queries.
- Apply freshness bias instead of hard refresh to avoid heavy
refreshcalls.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; mTLS between Search Service and engine (optional).
Authorization¶
- Enforce
audit:searchscope; per-tenant isolation via index alias filter or index-per-tenant.
Data Protection¶
- Documents store minimized fields; sensitive values tokenized or omitted.
- Highlights sanitized; never return dropped/redacted fields.
Compliance¶
- Record search access with
tenant,subject,queryHash,filters, andreturnedCount.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
search_latency_ms |
histogram | End-to-end latency | p95 > 200 ms |
search_qps |
counter | Requests/sec | Capacity planning |
index_freshness_seconds |
gauge | now - indexWatermark |
> 10 s sustained |
search_429_total |
counter | Rate limited count | > 5% sustained |
cursor_stale_total |
counter | 409 due to stale cursor | Spike detection |
Logging Requirements¶
- Structured logs:
tenant,traceId,qHash,filtersHash,size,sort,tookMs,indexLagSec. - Do not log raw queries or highlights.
Distributed Tracing¶
- Spans:
search.plan,engine.search,engine.aggs,cache.get/set. - Attributes:
tenant,hasCursor,facetCount,size,tookMs.
Health Checks¶
- Readiness: engine reachable; tenant alias exists; index freshness within SLO.
- Liveness: threadpool/connection pool healthy; circuit breakers closed.
Operational Procedures¶
Deployment¶
- Create index template & per-tenant alias with filter
tenantId=.... - Deploy Search Service route
/search/v1/queryand/search/v1/suggest. - Validate end-to-end queries and index freshness headers.
Configuration¶
- Env:
SEARCH_MAX_SIZE=100,DEFAULT_SIZE=25,ALLOWED_FACETS=...,CURSOR_TTL,RECENCY_BOOST_WINDOW. - Engine: shard/replica count, analyzers, suggesters, circuit breakers.
Maintenance¶
- Rolling reindex and alias swap; backfill lag tracking.
- Periodic shard rebalancing; optimize/forcemerge as needed off-peak.
Troubleshooting¶
- High
index_freshness_seconds→ inspect projector/indexer lag. - Many 422 → educate clients on query limits; adjust clause caps if safe.
- 429 spikes → scale nodes or adjust rate limits/caching.
Testing Scenarios¶
Happy Path Tests¶
- Keyword query with filters returns ranked hits and requested facets within p95 ≤ 200 ms.
- Pagination via
nextCursorreturns non-overlapping result sets. - Headers include
X-Index-WatermarkandX-Index-Lag.
Error Path Tests¶
- 400 on invalid facet, malformed cursor, or bad time bounds.
- 404 when tenant alias missing/disabled.
- 409 on stale cursor or conflicting params.
- 422 on overly complex query (clause cap).
- 429/503 obey retry/backoff.
Performance Tests¶
- Facet cost under control for typical cardinalities.
- Query load at target QPS with p95 ≤ 200 ms.
Security Tests¶
- RBAC scope
audit:searchenforced; cross-tenant leakage prevented by alias filter. - Redaction/minimization verified in results and highlights.
Related Documentation¶
Internal References¶
Related Flows¶
- Tenant-Scoped Query Flow
- Audit Record Projection Update Flow
- Data Redaction Flow
External References¶
- Vendor docs for analyzers, aggregations, and suggesters (e.g., ES/OpenSearch)
Appendices¶
A. Example Engine Query (conceptual)¶
{
"query": {
"bool": {
"filter": [{ "term": { "tenantId": "acme" } }],
"must": [{ "simple_query_string": { "query": "user create OR signup", "fields": ["action^3","resource.type","attributes.*"] }}]
}
},
"aggs": {
"resourceType": { "terms": { "field": "resource.type", "size": 10 } },
"action": { "terms": { "field": "action.keyword", "size": 10 } }
},
"sort": ["_score", { "createdAt": "desc" }],
"size": 25,
"search_after": ["1.23", "01JEC7KX8..."]
}
B. Example Suggest Request¶
Filtered Query Flow¶
Policy-aware read path that applies purpose-of-use evaluation, field-level allow/deny, and on-read redaction/masking before returning results. The Query Service consults the Policy Service to compute an effective redaction profile for the caller, then executes a tenant-scoped query and post-processes rows according to the profile.
Overview¶
Purpose: Return tenant-scoped audit results filtered by caller intent and masked according to privacy & PII policies.
Scope: Purpose-of-use signaling, policy evaluation, field projection, masking strategies (hash/mask/tokenize/drop), seek pagination, and response headers indicating applied policy and freshness. Excludes full-text search (see Search flow) and raw timeline (see Tenant-Scoped Query).
Context: Operates on AuditEvents projection; combines pre-index filters with post-fetch masking.
Key Participants:
- Client (API consumer)
- API Gateway (authN/Z, rate limiting)
- Query Service (query + masking orchestrator)
- Policy Service (purpose-of-use, allow/deny, redaction profile)
- Read DB (AuditEvents + RLS) (tenant-isolated projection)
- Checkpoint Store (watermark for freshness)
Prerequisites¶
System Requirements¶
- TLS at Gateway; JWKS configured for JWT verification
- Query Service access to Read DB and Policy Service
- RLS on Read DB enforcing
tenantId - Redaction libraries & configs deployed (hash/mask/tokenize/drop)
Business Requirements¶
- Tenant active; privacy/PII classifications configured
- Policy definitions include purpose-of-use to field permissions/masking
- Data residency respected for cross-region reads
Performance Requirements¶
- p95 ≤ 180 ms for
limit≤200with standard masking - Policy evaluation cache (per subject+purpose) warmed; TTL tuned
- Indexes support common filter predicates
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Client as Client
participant GW as API Gateway
participant Q as Query Service
participant P as Policy Service
participant R as Read DB (AuditEvents + RLS)
participant C as Checkpoint Store
Client->>GW: POST /query/v1/filtered <br/> h:{Authorization,x-tenant-id,traceparent,x-purpose-of-use}
GW->>Q: Forward request + headers
Q->>P: Evaluate(subject, tenant, purpose, requestedFields, filters)
P-->>Q: RedactionProfile {allowed, denied, maskRules}
Q->>R: SELECT ... WHERE tenantId=? AND <server-validated filters> ORDER BY createdAt DESC LIMIT N
R-->>Q: rows
Q->>Q: Apply RedactionProfile (drop/transform fields) + build cursors
Q->>C: Read tenant watermark
Q-->>GW: 200 {items(masked), nextCursor} + X-Watermark, X-Lag, X-Policy-Decision-Id
GW-->>Client: 200 OK
Alternative Paths¶
- Field projection: Client requests
fields=[...]; server intersects withallowedand masks per rules. - Explain-only:
dryRun=truereturns the effective RedactionProfile without data. - Head polling:
If-None-Match: "wmk:<value>"→304if unchanged watermark.
Error Paths¶
sequenceDiagram
actor Client
participant GW as API Gateway
participant Q as Query Service
Client->>GW: POST /query/v1/filtered (bad params/conflicts)
alt Bad request (invalid filter/purpose/fields)
GW-->>Client: 400 Problem+JSON
else Tenant/route not found or feature disabled
GW-->>Client: 404 Problem+JSON
else Fields conflict with policy decision
GW-->>Client: 409 Problem+JSON
else Unauthorized / Forbidden
GW-->>Client: 401/403 Problem+JSON
else Backpressure / upstream down
GW-->>Client: 429/503 Problem+JSON (+ Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | HTTP POST /query/v1/filtered |
Y | Filtered & masked read | JSON body |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired |
x-tenant-id |
header | Y* | Tenant scope | ^[A-Za-z0-9._-]{1,128}$ |
traceparent |
header | O | W3C trace context | 55-char |
x-purpose-of-use |
header | Y | Caller intent (e.g., Support, SecurityOps, Analytics) |
Enum allowlist |
limit |
body.int | O | Items per page | 1–200 (default 100) |
cursor |
body.string | O | Opaque seek token | base64url |
filters |
body.object | O | Server-validated predicates | Allowlist only |
fields |
body.array |
O | Requested projections | Intersected with policy |
dryRun |
body.bool | O | Return policy only | default false |
*Required unless tenant embedded in path variant.
Supported filter keys (allowlist example): createdAt.from/to, action, resource.type, resource.id, actor.id, decision.result.
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
items[] |
array | Masked rows honoring RedactionProfile | Order: createdAt DESC |
nextCursor |
string? | Seek token for next page | Omitted if end |
policy |
object? | Returned if dryRun=true |
Effective profile summary |
count |
int | Items in this page | ≤ limit |
Response Headers
X-Watermark: tenant projection watermark (ISO-8601 UTC)X-Lag: seconds behind nowX-Policy-Decision-Id: opaque id of the applied policy decision (for audit)
Example Payloads¶
Request
{
"limit": 50,
"fields": ["auditRecordId","createdAt","action","resource.id","actor.display","client.ip"],
"filters": {
"resource.type": "Iam.User",
"createdAt": { "from": "2025-10-22T00:00:00Z", "to": "2025-10-22T23:59:59Z" }
}
}
Headers:
Authorization: Bearer eyJhbGciOi...
x-tenant-id: acme
x-purpose-of-use: Support
traceparent: 00-9f0c1d2e3a4b5c6d7e8f9a0b1c2d3e4f-1111222233334444-01
200 OK (masked)
{
"items": [
{
"auditRecordId": "01JEC9VX2Z…",
"createdAt": "2025-10-22T11:57:03.200Z",
"action": "user.create",
"resource": { "id": "U-1001" },
"actor": { "display": "signup-svc" },
"client": { "ip": "203.0.113.0/24" } // IP truncated per Support profile
}
],
"nextCursor": "eyJ0cyI6IjIwMjUtMTAtMjJUMTE6NTc6MDMuMjAwWiIsImlkIjoiMDFK...In0",
"count": 1
}
dryRun=true (policy only)
{
"policy": {
"allowed": ["auditRecordId","createdAt","action","resource.id","actor.display","client.ip"],
"denied": ["client.userAgent","geo.location","subject.email"],
"maskRules": {
"client.ip": "truncate_cidr_24",
"subject.email": "mask_localpart"
}
}
}
400 Bad Request (conflicting filters)
{
"type": "urn:connectsoft:errors/query/filters.invalid",
"title": "Invalid filters",
"status": 400,
"detail": "Unsupported filter 'subject.email'.",
"errors": [{ "pointer": "/filters/subject.email", "reason": "not-allowed" }]
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Malformed filters/cursor; unknown x-purpose-of-use; invalid fields |
Fix request; use allowlisted fields | No retry until corrected |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Subject lacks audit:read:filtered scope or policy denies all fields |
Request correct scope; adjust purpose | No retry until fixed |
| 404 | Tenant/route not found; feature disabled | Verify tenant/URL/edition | — |
| 409 | Requested fields conflict with policy (e.g., denied but required) or cursor param conflicts |
Remove offending fields/params | Retry after fix |
| 429 | Rate limit/backpressure | Respect Retry-After |
Exponential backoff + jitter |
| 503 | Policy or DB dependency unavailable | Wait for recovery | Retry with same params |
Failure Modes¶
- Policy cache staleness: returns stricter profile than expected—safe by design; refresh on next call.
- Cursor invalid after rebuild:
409 cursor.stalewithresyncFromhint. - Overbroad projection: requesting many fields increases payload size; server may trim to
allowed ∩ requested.
Recovery Procedures¶
- On 409 field-policy conflict, re-issue request with
fieldsreturned inpolicy.allowed. - On 429/503, backoff with jitter; do not widen
limit. - For stale cursor, restart from
fromtime bound or omit cursor.
Performance Characteristics¶
Latency Expectations¶
- P50 ≤ 70 ms, P95 ≤ 180 ms (policy cache hit); add 15–30 ms if cache miss.
Throughput Limits¶
- Per tenant: 150 rps sustained, 600 rps burst (configurable).
- Global: scales with read replicas and policy cache hit rate.
Resource Requirements¶
- CPU for masking transforms (e.g., hashing/tokenization); memory for page shaping.
Scaling Considerations¶
- Cache policy decisions keyed by
(tenant, subject, purpose)with short TTL (e.g., 60–300s). - Pre-compute allowlists per purpose to minimize per-request overhead.
Security & Compliance¶
Authentication¶
- OIDC JWT;
traceparentpropagated; optional mTLS Gateway↔Query Service.
Authorization¶
- Require
audit:read:filtered; validatex-tenant-idclaim and RBAC. - Enforce DB-level RLS and post-query field-level controls from policy.
Data Protection¶
- Apply masking strategies per Data Model (
truncate_cidr_24,mask_localpart,hash_sha256,drop). - Do not return fields marked
deniedby policy; never include raw PII if policy says mask/drop.
Compliance¶
- Emit access audit:
subject,tenant,purpose,decisionId,requestedFields,returnedFields.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
filtered_query_latency_ms |
histogram | End-to-end latency | p95 > 180 ms |
policy_eval_latency_ms |
histogram | Policy round-trip | p95 > 30 ms |
policy_denied_total |
counter | Requests with any denied fields | Sudden spikes |
masked_fields_total |
counter | Count of masked field applications | Trend monitoring |
cursor_stale_total |
counter | 409 due to stale cursor | Rebuild detection |
query_429_total |
counter | Rate-limited responses | > 5% sustained |
Logging Requirements¶
- Structured logs:
tenant,traceId,purpose,decisionId,requestedFieldsHash,returnedFieldsHash,resultCount,watermark,lagSec. Do not log raw PII.
Distributed Tracing¶
- Spans:
policy.evaluate,db.select.filtered,mask.apply. - Attributes:
purpose,allowedCount,maskedCount,deniedCount.
Health Checks¶
- Readiness: Policy Service reachable; RLS verified; masking config loaded.
- Liveness: threadpool/connection pools healthy.
Operational Procedures¶
Deployment¶
- Deploy/enable
/query/v1/filteredroute behind feature flagquery.filtered.enabled=false. - Load policy catalogs and masking configuration; warm caches.
- Validate
dryRunand live calls in staging with test profiles.
Configuration¶
- Env:
QUERY_MAX_LIMIT,DEFAULT_LIMIT,POLICY_CACHE_TTL,MASKING_RULES_PATH. - Headers: accept
x-purpose-of-usevalues from allowlist only.
Maintenance¶
- Rotate JWT keys; review policy changes; audit decision logs.
- Monitor masked vs. denied trends to tune rules.
Troubleshooting¶
- Many 409 field conflicts → educate clients to request
dryRunfirst or fetchpolicy.allowed. - High
policy_eval_latency_ms→ investigate Policy Service capacity/caching. - Data leakage concerns → verify masking config version & hot reload.
Testing Scenarios¶
Happy Path Tests¶
- Valid request with
x-purpose-of-use: Supportreturns masked IP and allowed fields. -
dryRun=truereturns expected profile; subsequent call applies it.
Error Path Tests¶
- 400 on invalid filter key or unknown purpose.
- 404 when tenant missing/disabled.
- 409 when requesting
deniedfields. - 429/503 obey retry/backoff with unchanged parameters.
Performance Tests¶
- Cache-hit p95 ≤ 180 ms; cache-miss overhead within budget.
- Large page (
limit=200) still meets p95 under typical load.
Security Tests¶
- RLS prevents cross-tenant access.
- No raw PII fields returned when policy mandates mask/drop.
- Access audit entries include
purposeanddecisionId.
Related Documentation¶
Internal References¶
- Data Model — Privacy & PII Inventory
- Data Model — Data Classification & Redaction Rules
- Tenant-Scoped Query Flow
- Search Query Flow
Related Flows¶
- Data Redaction Flow (on-read), Policy & Retention flows
- Compliance Audit Flow
External References¶
- RFC 7807 (Problem Details)
- Organization Privacy/PII policy catalog
Appendices¶
A. Example RedactionProfile (concept)¶
{
"decisionId": "pol_7b3f8d1a",
"purpose": "Support",
"allowed": ["auditRecordId","createdAt","action","resource.id","actor.display","client.ip"],
"denied": ["subject.email","geo.location","client.userAgent"],
"maskRules": {
"client.ip": "truncate_cidr_24",
"subject.email": "mask_localpart"
}
}
B. Masking Rules (summary)¶
truncate_cidr_24→ IPv4a.b.c.d→a.b.c.0/24mask_localpart→name@domain→n***@domainhash_sha256→ irreversible 64-hex digestdrop→ remove field from output
Time-Range Query Flow¶
Efficiently retrieves audit events constrained by a time window. The Query Service translates from/to predicates into partition/shard pruning (e.g., daily/monthly tenant partitions), executes seek-paginated scans over the minimal set of partitions, and returns watermark/lag headers to describe projection freshness.
Overview¶
Purpose: Provide fast, predictable retrieval of audit events within a specified time range while minimizing IO via partition/shard pruning.
Scope: Time predicates, partition selection, shard routing, seek-based pagination across multiple partitions, and freshness exposition. Excludes full-text relevance (see Search) and policy-driven masking (see Filtered Query).
Context: Operates on the AuditEvents read model that is physically partitioned by tenant and time; the Projection Service updates these partitions asynchronously.
Key Participants:
- Client (API consumer)
- API Gateway (authN/Z, rate limiting)
- Query Service (planner/executor, paginator)
- Read Store (time-partitioned
AuditEventswith RLS) - Partition Catalog (maps time windows → partitions/shards)
- Checkpoint Store (per-tenant watermark)
Prerequisites¶
System Requirements¶
- Gateway with TLS and JWT validation
- Query Service can access Read Store, Partition Catalog, and Checkpoint Store
- Read Store enforces RLS by
tenantId - Time partitions (e.g., daily/monthly) exist and are discoverable in the catalog
Business Requirements¶
- Tenant is active and permitted to query historical windows requested
- Retention policy covers the requested
from/toperiod - Regional residency honored for multi-region tenants
Performance Requirements¶
- p95 ≤ 160 ms for
limit≤200and ≤ 14 partitions scanned - Covering index on
(tenantId, createdAt DESC, auditRecordId)per partition - Partition discovery latency p95 ≤ 10 ms
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Client as Client
participant GW as API Gateway
participant Q as Query Service
participant Cat as Partition Catalog
participant R as Read Store (AuditEvents + RLS)
participant Ck as Checkpoint Store
Client->>GW: GET /audit/v1/events/range?from=...&to=...&limit=200&cursor=... <br/> h:{Authorization,x-tenant-id,traceparent}
GW->>Q: Forward request + normalized headers
Q->>Q: Validate time window, normalize [from,to], parse/verify cursor (if any)
Q->>Cat: Resolve partitions/shards for [from,to] + tenant
Cat-->>Q: Ordered partition list (most-recent → oldest)
Q->>R: Query partitions with seek pagination (ORDER BY createdAt DESC, auditRecordId)
R-->>Q: Page of rows + next anchor (ts,id,partitionIdx)
Q->>Ck: Read tenant watermark
Q-->>GW: 200 {items, nextCursor} + X-Watermark + X-Lag + X-Partitions-Scanned
GW-->>Client: 200 OK
Alternative Paths¶
- Open-ended range: only
fromprovided (defaultsto=now), or onlyto(backfill). - Ascending traversal:
order=ascfor forward scans; cursor encodes direction + partition index. - Server-side downsampling: for very wide windows, service may cap
maxPartitionsand advise narrowing viaProblem+JSONtype: .../range.too_wide(422) when appropriate.
Error Paths¶
sequenceDiagram
actor Client
participant GW as API Gateway
Client->>GW: GET /audit/v1/events/range?from=bad&to=2025-10-22T00:00:00Z
alt Bad request (malformed/invalid window)
GW-->>Client: 400 Bad Request (Problem+JSON)
else Tenant route not found / disabled
GW-->>Client: 404 Not Found (Problem+JSON)
else Conflicting params (cursor with changed window/order)
GW-->>Client: 409 Conflict (Problem+JSON)
else Rate limited / store unavailable
GW-->>Client: 429/503 (Problem+JSON + Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | HTTP GET /audit/v1/events/range or /tenants/{tenantId}/events/range |
Y | Time-range endpoint | — |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired |
x-tenant-id |
header | Y* | Tenant scope | ^[A-Za-z0-9._-]{1,128}$ |
traceparent |
header | O | W3C trace context | 55-char |
from |
query | O* | ISO-8601 UTC lower bound | ≤ to; within retention |
to |
query | O* | ISO-8601 UTC upper bound | ≥ from; not in future+skew |
limit |
query | O | Items per page (default 100) | 1–1000 |
order |
query | O | desc (default) or asc |
enum |
cursor |
query | O | Opaque base64url (ts,id,partitionIdx,dir) |
Must match current params |
| filters… | query | O | Optional allowlisted filters (e.g., action, resource.type) |
Validated server-side |
- At least one of
fromortois required; if only one provided, the other defaults tonow(bounded by retention and skew rules).
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
items[] |
array |
Results in requested order | Seek-paginated |
nextCursor |
string? | Encodes next anchor + partition index | Omitted if no more |
count |
int | Items in this page | ≤ limit |
Response Headers
X-Watermark: tenant projection watermark (ISO-8601 UTC)X-Lag: seconds behind now (now - watermark)X-Partitions-Scanned: integer count of partitions touchedCache-Control: typicallyno-store(or short TTL where safe)
Example Request¶
GET /audit/v1/events/range?from=2025-10-20T00:00:00Z&to=2025-10-22T23:59:59Z&limit=200&order=desc HTTP/1.1
Host: api.atp.example
Authorization: Bearer eyJhbGciOi...
x-tenant-id: acme
traceparent: 00-3e1f2d0c9b8a7f6e5d4c3b2a19081716-7f6e5d4c3b2a1908-01
200 OK
{
"items": [
{
"auditRecordId": "01JECZ6Y8K1V...",
"createdAt": "2025-10-22T12:02:59.812Z",
"action": "user.create",
"resource": { "type": "Iam.User", "id": "U-1001" },
"actor": { "id": "svc_ingress", "type": "Service" }
}
],
"nextCursor": "eyJ0cyI6IjIwMjUtMTAtMjJUMTE6NTU6MDAuMDAwWiIsImlkIjoiMDFK...IiwicGFydGl0aW9uSW5kZXgiOjEsImRpciI6ImRlc2MifQ",
"count": 1
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Malformed from/to; from>to; window exceeds max span; limit out of bounds |
Fix params; reduce window | No retry until corrected |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Token lacks audit:read:timeline for tenant |
Request proper scope | No retry |
| 404 | Tenant/route not found; tenant disabled; partitions not present (fully aged out) | Verify tenant/window | — |
| 409 | cursor does not match from/to/order; stale cursor after compaction |
Drop/refresh cursor, re-issue | Retry after fix |
| 429 | Rate limit/backpressure | Honor Retry-After |
Exponential backoff + jitter |
| 503 | Read Store/Catalog unavailable | Wait for recovery | Retry with same params |
Failure Modes¶
- Stale cursor after partition compaction/rotation → 409 with
type: .../cursor.staleandresyncFromhint. - Excessive partitions for wide windows → 422
range.too_widewith suggested subranges. - Clock skew: future
tobeyondnow+skew→ clamp or 400 with pointer toto.
Recovery Procedures¶
- For 409
cursor.stale, restart withoutcursoror withfrom=lastSeen.createdAt. - For 422
range.too_wide, split the request by suggested daily/monthly windows. - Monitor
X-Partitions-Scanned; if high, narrow the time window.
Performance Characteristics¶
Latency Expectations¶
- P50 ≤ 70 ms, P95 ≤ 160 ms, P99 ≤ 320 ms when ≤14 partitions scanned.
Throughput Limits¶
- Per tenant: 150 rps sustained, burst 600 rps (configurable).
- Global: scales with number of read replicas and partition cache hit rate.
Resource Requirements¶
- Partition catalog lookup in-memory or fast key-value store; read DB requires covering indexes per partition.
Scaling Considerations¶
- Pruning first: always resolve partitions before issuing any scans.
- Adaptive limits: cap
limitwhen many partitions are touched; prefer more pages over wide scans. - Parallel partition reads (optional): small fan-out with strict per-tenant concurrency to preserve order semantics when stitching results.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; propagate
traceparent; optional mTLS Gateway↔Query.
Authorization¶
- Enforce
audit:read:timeline; verify tenant claims; RLS must filter bytenantId.
Data Protection¶
- Only return fields allowed by baseline read model; masking/redaction applied in dedicated filtered flow if required.
Compliance¶
- Log access with
tenant,from,to,limit,partitionsScanned,watermark, andlag.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
range_query_latency_ms |
histogram | End-to-end latency | p95 > 160 ms |
partitions_scanned |
histogram | Partitions per request | > 16 median |
cursor_stale_total |
counter | 409 due to stale cursor | Spike indicates compaction |
range_too_wide_total |
counter | 422 due to excessive span | Trend watch |
watermark_lag_seconds |
gauge | now - watermark |
> target (e.g., >10 s) |
Logging Requirements¶
- Structured logs:
tenant,traceId,from,to,order,limit,cursorHash,partitionsScanned,resultCount,watermark,lagSec. Do not log raw cursor.
Distributed Tracing¶
- Spans:
catalog.resolvePartitions,db.scan.partition,stitch.page,ckpt.read. - Attributes:
partitionCount,limit,dir,hasCursor.
Health Checks¶
- Readiness: catalog reachable; partitions for today resolvable; indexes present.
- Liveness: DB/connection pools healthy; threadpool not saturated.
Operational Procedures¶
Deployment¶
- Enable
/audit/v1/events/rangeroute; confirm RLS and partition catalog. - Smoke-test with a 24h window and verify
X-Partitions-Scanned. - Validate
cursorstability across partition boundaries.
Configuration¶
- Env:
RANGE_MAX_SPAN_DAYS,QUERY_MAX_LIMIT,DEFAULT_LIMIT,PARTITION_LOOKUP_TTL. - Pruning: enable negative caching for empty/aged-out partitions.
Maintenance¶
- Keep partition catalog in sync with DDL/rotation jobs; prune aged partitions per retention.
- Rebuild indexes offline before alias/cutover when rotating partitions.
Troubleshooting¶
- High
partitions_scanned→ check catalog gaps or miscomputedfrom/to. - Frequent 409
cursorconflicts → ensure clients don’t change window/order between pages. - Elevated
watermark_lag_seconds→ scale projectors or indexers.
Testing Scenarios¶
Happy Path Tests¶
- Query 48h window returns ordered results with
X-Partitions-Scanned ≤ 3. - Pagination crosses a partition boundary without duplicates or gaps.
Error Path Tests¶
- 400 on malformed/invalid time bounds or
from>to. - 404 when tenant/route disabled or fully aged-out window.
- 409 when cursor does not match current
from/to/order. - 429/503 cause client backoff and retry with same params.
Performance Tests¶
- p95 ≤ 160 ms for
limit=200, ≤14 partitions. - Partition discovery p95 ≤ 10 ms under load.
Security Tests¶
- RLS prevents cross-tenant access.
- JWT scope
audit:read:timelineenforced.
Related Documentation¶
Internal References¶
- Components → Query Service, Read Store
- Data Model — Tenancy Keys & Partitioning
- Data Model — Read Models & Projections
Related Flows¶
- Tenant-Scoped Query Flow
- Filtered Query Flow
- Audit Record Projection Update Flow
External References¶
- RFC 3339 / ISO-8601 for timestamps
- W3C Trace Context (traceparent)
Appendices¶
A. Cursor schema (concept)¶
B. Example partition policy¶
- Key:
(tenantId, yyyymm)monthly partitions; for high-volume tenants use daily(tenantId, yyyymmdd). - Pruning: select partitions where
[from,to]intersects partition time bounds; query newest-first fordesc.
Standard Export Flow¶
On-demand eDiscovery export that builds a consistent snapshot of tenant-scoped audit data, runs a scoped query over the read models, streams results in chunked parts (JSONL or Parquet, optionally gzipped), produces a signed ExportManifest (with integrity proofs), delivers via presigned URLs and/or webhook callback, and emits Export.Completed.
Overview¶
Purpose: Enable compliance officers to export audit data for a given tenant/time window with integrity evidence and policy safeguards.
Scope: Job creation, query scoping, chunked packaging, integrity/manifest generation, delivery (URLs/webhook), completion events, and resume/cancel. Excludes continuous/scheduled exports (see Bulk Export Flow).
Context: Runs against the projection/read models (e.g., AuditEvents) and consults Integrity Service for proofs, Policy/Retention/LegalHold for eligibility, and Storage for canonical IDs.
Key Participants:
- Compliance Officer / Client
- API Gateway
- Export Service (job orchestration, packaging)
- Query Service / Read Store (scoped read with seek pagination)
- Integrity Service (Merkle roots / signatures)
- Delivery Backend (object storage for parts, presigned URLs)
- Webhook Receiver (optional callback on completion)
Prerequisites¶
System Requirements¶
- API Gateway with TLS and JWT validation
- Export Service deployed with access to Read Store, Integrity Service, Delivery Backend
- Read Store enforces RLS by
tenantId; indexes support range scans - Webhook signing keys configured (if callbacks used)
Business Requirements¶
- Tenant active; retention and residency policies provisioned
- Legal holds registered; export must honor holds and exclusions
- Officer has
audit:exportpermission; purpose-of-use recorded
Performance Requirements¶
- Target p95 job time-to-first-part ≤ 30 s for typical scopes
- Per-part target size (e.g., 128–512 MiB) to optimize download throughput
- Concurrency caps per tenant to protect read replicas
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Officer as Compliance Officer
participant GW as API Gateway
participant EXP as Export Service
participant Q as Query Service / Read Store
participant INT as Integrity Service
participant OBJ as Delivery Backend (Object Storage)
participant WH as Webhook Receiver (optional)
Officer->>GW: POST /export/v1/jobs {tenant, range, filters, format, partSize, webhook?}
GW->>EXP: Forward request (authN/Z, x-tenant-id, traceparent)
EXP->>Q: Open scoped cursor (tenant, from/to, filters)
loop Chunk until exhausted
Q-->>EXP: Page of rows + next cursor
EXP->>EXP: Serialize to JSONL/Parquet, gzip if requested
EXP->>INT: Append leaf hashes, update segment/merkle state
EXP->>OBJ: PUT part (presigned upload or service credentials)
OBJ-->>EXP: URL + ETag
EXP->>EXP: Record part metadata, update resumeToken
end
EXP->>INT: Seal block → MerkleRoot + signature
EXP->>EXP: Build ExportManifest {parts, counts, bytes, root, signature, resumeToken}
EXP-->>Officer: 202 Accepted {jobId, status:"Running"} (+ presigned GETs if requested)
EXP-->>Officer: 200 GET /export/v1/jobs/{jobId}/manifest (signed manifest)
alt webhook configured
EXP->>WH: POST /webhook/export {jobId,status:"Completed",manifestUrl,signature}
end
Alternative Paths¶
- Presigned download: Service writes parts to bucket and returns read-only presigned URLs.
- Direct upload: Client provides presigned PUT URLs per part (client-managed storage).
- Parquet + schema: Columnar output with embedded schema for analytics workloads.
- Resume: Client
POST /export/v1/jobs/{jobId}:resumewith server-providedresumeToken.
Error Paths¶
sequenceDiagram
actor Officer
participant GW as API Gateway
participant EXP as Export Service
Officer->>GW: POST /export/v1/jobs {invalid filters/format}
alt Invalid request
GW-->>Officer: 400 Bad Request (Problem+JSON)
else Tenant not found/feature disabled
GW-->>Officer: 404 Not Found (Problem+JSON)
else Job state conflict (e.g., resume running job)
GW-->>Officer: 409 Conflict (Problem+JSON)
else Rate limited / dependencies down
GW-->>Officer: 429/503 (Retry-After/Problem+JSON)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | POST /export/v1/jobs |
Y | Create export job | JSON body |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired |
x-tenant-id |
header | Y* | Tenant scope | Must match body.tenant |
traceparent |
header | O | W3C trace context | 55-char |
tenant |
string | Y | Target tenant | ^[A-Za-z0-9._-]{1,128}$ |
range |
object | O | {from?, to?} ISO-8601 UTC |
from ≤ to, within retention |
filters |
object | O | Allowlisted filters (action/resource/actor/decision) | Server validated |
format |
enum | O | jsonl (default), parquet |
allowlist |
compression |
enum | O | none (default), gzip |
allowlist |
partSizeMiB |
int | O | Target part size | 16–1024, default 256 |
fields |
array |
O | Projection/columns | Valid subset of schema |
webhook.url |
url | O | Completion callback | HTTPS + signature method |
webhook.secretId |
string | O | Key id for HMAC | Must exist in KMS |
delivery.mode |
enum | O | presigned-get |
client-presigned-put |
- Header required unless using path variant
/tenants/{tenantId}/export/jobs.
Output Specifications¶
Create Job — 202 Accepted
| Field | Type | Description |
|---|---|---|
jobId |
string | Server-assigned id (ULID/GUID) |
status |
enum | Running |
estimation |
object | {partsApprox, bytesApprox?} |
pollUrl |
url | GET /export/v1/jobs/{jobId} |
manifestUrl |
url | GET /export/v1/jobs/{jobId}/manifest (when ready) |
Get Job — 200 OK
| Field | Type | Description | ||||
|---|---|---|---|---|---|---|
jobId |
string | id | ||||
status |
enum | Queued | Running | Completed | Failed | Canceled |
||||
counts |
object | {records, parts} |
||||
bytes |
object | {written} |
||||
parts[] |
array | {index,url,etag,bytes,records} (if presigned-get) |
||||
resumeToken |
string? | For resume/cancel/retry | ||||
startedAt/finishedAt |
timestamp | ISO-8601 UTC | ||||
watermark |
timestamp | Consistency snapshot time |
Manifest (JSON)
{
"jobId": "exp_01JECXYZ...",
"tenant": "acme",
"range": {"from":"2025-10-20T00:00:00Z","to":"2025-10-22T23:59:59Z"},
"format": "jsonl",
"compression": "gzip",
"parts": [
{"index":0,"url":"https://.../p0.gz","bytes":268435456,"records":100000,"etag":"\"abc123\""}
],
"counts":{"records":250000,"parts":3},
"bytes":{"written":734003200},
"integrity":{"merkleRoot":"8a4f...","signature":{"alg":"Ed25519","kid":"int-key-2025","sig":"MEQC..."}},
"createdAt":"2025-10-22T12:30:12Z",
"resumeToken":"r:01JEC...",
"watermark":"2025-10-22T12:25:00Z"
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Malformed range/filters; unsupported format/compression; invalid partSizeMiB; insecure webhook URL |
Fix request; use allowlisted values | No retry until corrected |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Caller lacks audit:export or tenant mismatch |
Request proper role/scope | No retry |
| 404 | Tenant/route not found; GET /jobs/{id} for unknown id |
Verify identifiers/tenant | — |
| 409 | Job state conflict (resume/cancel when not applicable); changing scope on resume | Wait for state; create new job | Retry after fix |
| 413 | Estimated export exceeds max allowed per job | Narrow scope or switch to Bulk Export | — |
| 429 | Per-tenant/global export rate limited | Respect Retry-After |
Exponential backoff + jitter |
| 503 | Read store/integrity/object storage unavailable | Wait for recovery | Retry create/poll |
Failure Modes¶
- Retention/residency violation: service rejects with 400
type: .../policy.violation. - Legal hold conflict: either enforced inclusion or exclusion per policy; decision id returned via
X-Policy-Decision-Id. - Webhook failure: job completes, callback retries with backoff; manifest always retrievable via
GET.
Recovery Procedures¶
- For 409, poll job until terminal; then retry with new job if needed.
- For 503/429, back off using
Retry-After; do not alter request to preserve idempotency. - Use
resumeTokento continue aborted jobs without duplicating parts.
Performance Characteristics¶
Latency Expectations¶
- Time-to-first-part p95 ≤ 30 s for typical 24–48h windows.
- Per-part write steady-state throughput aligned with object storage (100–500 MiB/s aggregate across workers).
Throughput Limits¶
- Per tenant: ≤ 2 concurrent running jobs (configurable).
- Global: bounded by export workers × read replica capacity.
Resource Requirements¶
- Read IOPS proportional to projected records; CPU for serialization/compression; memory for part buffers.
Scaling Considerations¶
- Horizontal worker pool with fair-share per tenant.
- Adaptive
partSizeMiBand dynamic concurrency to maintain steady throughput. - Use seek pagination from Query Service to avoid deep offsets.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; mTLS for service-to-service (optional).
Authorization¶
- Require
audit:exportfor tenant; enforce RLS in reads; verifyx-tenant-id.
Data Protection¶
- Parts stored with server-side encryption; presigned URLs time-limited and least-privilege.
- Redaction/minimization applied if using Filtered export mode (optional flag).
Compliance¶
- Enforce retention/residency and legal holds; include decision metadata in manifest.
- Manifest contains integrity proof (Merkle root + signature) for end-to-end verification.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
export_jobs_active |
gauge | Running jobs count | > tenant/global cap |
export_bytes_written_total |
counter | Cumulative bytes | Trend/throughput |
export_parts_total |
counter | Parts produced | — |
export_job_duration_seconds |
histogram | Job runtime | p95 > SLO |
export_failures_total |
counter | Failed jobs | > 0 sustained |
export_webhook_fail_total |
counter | Callback failures | spike alerts |
Logging Requirements¶
- Structured logs:
tenant,jobId,range,filtersHash,format,partIndex,bytes,records,watermark,integrity.merkleRoot,decisionId(if policy applied). No raw PII.
Distributed Tracing¶
- Spans:
export.create,query.page,serialize.chunk,compress,object.put,integrity.seal,webhook.post. - Attributes:
tenant,format,partSizeMiB,parts,bytes,lagSec.
Health Checks¶
- Readiness: access to Read Store, Integrity, Object Storage; signing keys loaded.
- Liveness: worker queue depth within bounds; no stuck jobs.
Operational Procedures¶
Deployment¶
- Provision object storage buckets and KMS keys; configure presign service.
- Deploy Export Service and register
/export/v1/*routes. - Validate end-to-end export on a test tenant (JSONL + Parquet).
Configuration¶
- Env:
EXPORT_MAX_CONCURRENCY_PER_TENANT,EXPORT_DEFAULT_PART_MIB,EXPORT_MAX_PART_MIB,WEBHOOK_SIGNING_KID,PRESIGN_TTL_SEC. - SLOs: define job duration targets per size window.
Maintenance¶
- Rotate signing keys and presign credentials; prune expired parts/manifests.
- Rehearse DR: re-run export from
resumeTokenafter worker failover.
Troubleshooting¶
- Slow jobs → check read replica load, part size too small/large, compression CPU bound.
- Frequent 409 conflicts → review client workflow (don’t resume running jobs).
- Webhook failures → verify DNS/TLS; use manual manifest retrieval.
Testing Scenarios¶
Happy Path Tests¶
- Create job with 24h range → parts produced; manifest includes merkle root/signature.
- Presigned URLs download successfully; counts/bytes match manifest.
Error Path Tests¶
- 400 on invalid range/filters/format; 404 on unknown jobId; 409 on resume while running.
- 429/503 lead to client backoff and eventual success.
Performance Tests¶
- Validate time-to-first-part p95 ≤ 30 s under nominal load.
- Confirm linear scaling with worker count up to configured cap.
Security Tests¶
-
audit:exportscope enforced; cross-tenant access blocked. - Presigned URLs expire and are scoped to objects; encryption at rest verified.
- Manifest signature verifies against Integrity public key.
Related Documentation¶
Internal References¶
Related Flows¶
- Legal Hold Export Flow
- eDiscovery Export Flow
- Bulk Export Flow
- Audit Record Projection Update Flow
External References¶
- RFC 4180 (CSV, if supported), JSON Lines spec, Parquet format spec
- W3C Trace Context; RFC 7807 (Problem Details)
Appendices¶
A. Example Problem+JSON (retention violation)¶
{
"type": "urn:connectsoft:errors/export/policy.violation",
"title": "Retention policy violation",
"status": 400,
"detail": "Requested 'from' precedes tenant retention window.",
"traceId": "9f0c1d2e3a4b5c6d...",
"errors": [{"pointer": "/range/from", "reason": "before-retention-start"}]
}
B. Webhook Payload (HMAC signed)¶
{
"event": "Export.Completed",
"jobId": "exp_01JECXYZ...",
"tenant": "acme",
"manifestUrl": "https://api.../export/v1/jobs/exp_01JECXYZ.../manifest",
"status": "Completed",
"signature": {"alg":"HMAC-SHA256","kid":"wh-2025","ts":"2025-10-22T12:31:02Z","sig":"b64..."}
}
Legal Hold Export Flow¶
Export of audit data subject to active Legal Holds. The LegalHold Service validates scope and policy, instructs the Export Service to run a hold-compliant export, embeds proof inclusion (integrity root, hold decision metadata, and optional per-part/record proofs) into a signed manifest, delivers via secure presigned URLs and/or webhook, and emits completion events. Holds continue to block purge, and all actions are themselves audited.
Overview¶
Purpose: Produce a defensible, tamper-evident export of all records covered by one or more active Legal Holds for a tenant (or set of scopes).
Scope: Hold resolution & validation, compliance decision capture, hold-aware query scoping, chunked packaging, integrity & proof inclusion policy, secure delivery, resume/cancel, and auditable completion. Excludes non-hold exports (see Standard Export Flow).
Context: Builds on the Export Service and Integrity Service; queries the Read Store (projections) with server-side filters derived from LegalHold definitions and their current Revision.
Key Participants:
- Legal Team / Client
- API Gateway
- LegalHold Service (hold registry, scope/eligibility, decisioning)
- Export Service (orchestrator, packaging)
- Query Service / Read Store (tenant-scoped reads)
- Integrity Service (Merkle roots, signatures)
- Delivery Backend (object storage, presigned URLs)
- Webhook Receiver (optional callback endpoint)
Prerequisites¶
System Requirements¶
- Gateway with TLS + JWT validation
- LegalHold Service reachable; hold registry & revisioning enabled
- Export Service has access to Read Store, Integrity, Delivery Backend
- Webhook signing keys/KMS available if callbacks are used
Business Requirements¶
- Target LegalHold exists and is Active (not Released)
- Tenant retention/residency policies configured; hold implies purge block
- Operator runbook for evidence requests and key rotation
Performance Requirements¶
- p95 time-to-first-part ≤ 45 s for typical hold scopes
- Concurrency caps per tenant and per hold to avoid read hot spots
- Indexes support hold filters (resource/action/time) efficiently
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Legal as Legal Team
participant GW as API Gateway
participant LHS as LegalHold Service
participant EXP as Export Service
participant Q as Query Service / Read Store
participant INT as Integrity Service
participant OBJ as Delivery Backend
participant WH as Webhook (optional)
Legal->>GW: POST /legal-hold/v1/exports {holdId, format, partSize, proofMode, webhook?}
GW->>LHS: Validate authN/Z, fetch hold(holdId) + current Revision
LHS-->>GW: 200 {holdSnapshot:{id, revision, scope, status:Active}}
GW->>EXP: Create export job (mode: LEGAL_HOLD, holdSnapshot, proofMode)
EXP->>Q: Open scoped cursor using holdSnapshot.scope (tenant, filters, time)
loop Chunk until exhausted
Q-->>EXP: Page of rows + next cursor
EXP->>INT: Add leaves to integrity segment (per-part proofs if requested)
EXP->>OBJ: PUT part (JSONL/Parquet, optional gzip)
EXP->>EXP: Record part metadata + resumeToken
end
EXP->>INT: Seal block → MerkleRoot + signature
EXP->>EXP: Build signed ExportManifest {parts, counts, bytes, holdSnapshot, proofPolicy, merkleRoot, signature}
EXP-->>Legal: 202 Accepted {jobId, status:"Running"}
alt webhook configured
EXP->>WH: POST Export.Completed {jobId, manifestUrl, holdSnapshot, signature}
end
Alternative Paths¶
- Multiple holds: request
{holdIds:[...]}; LHS returns merged scope (union) and aggregated decision id(s). - Incremental export:
sinceDecisionIdorsinceWatermarkto export only new/changed covered records. - Client-provided storage:
delivery.mode=client-presigned-putwith per-part presigned PUT URLs.
Error Paths¶
sequenceDiagram
actor Legal
participant GW as API Gateway
participant LHS as LegalHold Service
Legal->>GW: POST /legal-hold/v1/exports {holdId:"unknown"}
alt Bad request (malformed payload/params)
GW-->>Legal: 400 Bad Request (Problem+JSON)
else Hold not found or not Active
GW->>LHS: GET hold(holdId)
LHS-->>GW: 404/409 (Released|NotFound)
GW-->>Legal: 404/409 Problem+JSON
else Conflict with hold revision (If-Match mismatch)
GW-->>Legal: 412 Precondition Failed (Problem+JSON)
else Rate limited / dependency down
GW-->>Legal: 429/503 (Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | POST /legal-hold/v1/exports |
Y | Create a hold-governed export job | JSON body |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired |
x-tenant-id |
header | Y | Tenant scope | Must match hold tenant |
traceparent |
header | O | W3C trace context | 55-char |
holdId |
string | Y | Target legal hold id | Exists & status=Active |
ifMatch |
header | O | Expected holdRevision (optimistic) |
Matches current revision |
format |
enum | O | jsonl (default), parquet |
allowlist |
compression |
enum | O | none, gzip |
allowlist |
partSizeMiB |
int | O | Target part size | 16–1024, default 256 |
proofMode |
enum | O | manifest-only |
per-part |
webhook.url/webhook.secretId |
string | O | Completion callback + signing | HTTPS + known KMS key |
delivery.mode |
enum | O | presigned-get |
client-presigned-put |
Output Specifications¶
Create — 202 Accepted
| Field | Type | Description | |
|---|---|---|---|
jobId |
string | Server-assigned id | |
status |
enum | Queued | Running |
|
holdSnapshot |
object | {id, revision, scope, decidedAt, decisionId} |
|
proofPolicy |
object | {mode, algorithm, keyId} |
|
pollUrl / manifestUrl |
url | Where to poll/fetch manifest |
Manifest (excerpt)
{
"jobId": "exp_01JF3…",
"mode": "LEGAL_HOLD",
"tenant": "acme",
"holdSnapshot": {
"id": "lh_2025_001",
"revision": 7,
"scope": {"resourceTypes":["Case.File","Iam.User"],"time":{"from":"2025-09-01T00:00:00Z"}},
"decidedAt": "2025-10-10T12:01:22Z",
"decisionId": "lhdec_8a12…"
},
"proofPolicy": {"mode":"per-part","algorithm":"Ed25519","keyId":"int-key-2025"},
"integrity": {"merkleRoot":"8a4f…","signature":{"alg":"Ed25519","kid":"int-key-2025","sig":"MEQC…"}},
"parts":[{"index":0,"url":"https://…/p0.gz","bytes":268435456,"records":100000,"etag":"\"abc123\""}]
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Malformed body; unsupported format/proofMode; invalid partSizeMiB |
Correct request | No retry until fixed |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Missing audit:legalhold.export or tenant mismatch |
Request proper role/scope | No retry |
| 404 | holdId not found (or not in tenant) |
Verify hold/tenant | — |
| 409 | Hold status not Active (e.g., Released); job state conflict on resume/cancel |
Activate/select correct hold; create new job | Retry after fix |
| 412 | If-Match revision mismatch (hold updated mid-flight) |
Re-fetch hold; restart with new revision | Retry with new precondition |
| 429 | Per-tenant/global rate limit | Respect Retry-After |
Backoff + jitter |
| 503 | Read store/Integrity/Delivery unavailable | Wait for recovery | Retry idempotently |
Failure Modes¶
- Hold mutated during export: precondition fails (412) to ensure defensibility; job halts.
- Policy violation (residency/retention): 400
.../policy.violationwithdecisionId. - Webhook delivery failure: job completes; callback retried with backoff; manifest always retrievable.
Recovery Procedures¶
- On 412, fetch latest
holdSnapshotand recreate the job. - On 503/429, back off; use the server-provided
resumeTokento continue. - If policy violation, adjust scope with Legal team; re-request.
Performance Characteristics¶
Latency Expectations¶
- Time-to-first-part p95 ≤ 45 s for typical holds.
- Steady-state throughput bounded by read replicas and object storage.
Throughput Limits¶
- Per hold: 1–2 concurrent jobs (configurable).
- Per tenant: combined cap across holds/exports to preserve SLOs.
Resource Requirements¶
- CPU for serialization/compression; memory for part buffers; IOPS for scans.
Scaling Considerations¶
- Shard by tenant; sequence chunks with seek pagination.
- Prefer
per-partproofs for balance of size vs. verifiability;per-recordfor high-assurance cases only.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; optional mTLS between services.
Authorization¶
- Require
audit:legalhold.readto resolve holds andaudit:legalhold.exportto create jobs. - Enforce RLS on reads; verify
x-tenant-idvs hold tenant.
Data Protection¶
- Parts encrypted at rest; presigned URLs are short-lived, least-privilege; webhook payloads HMAC-signed.
- Redaction/minimization may still apply if configured for hold exports (jurisdictional constraint).
Compliance¶
- Holds block purge throughout job lifetime; export does not weaken hold.
- Manifest includes holdSnapshot (id, revision, decisionId) and integrity proofs per proofPolicy.
- All requests emit audit entries (who, when, purpose, hold ids, decision ids).
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
lh_export_jobs_active |
gauge | Running hold exports | > cap per tenant/hold |
lh_export_job_duration_seconds |
histogram | Runtime per job | p95 > SLO |
lh_hold_revision_conflicts_total |
counter | 412 preconditions hit | Spike indicates frequent edits |
lh_export_bytes_written_total |
counter | Bytes exported under holds | Trend/forecast |
lh_export_failures_total |
counter | Failed jobs | > 0 sustained |
Logging Requirements¶
- Structured logs:
tenant,holdId,holdRevision,jobId,decisionId,proofMode,partIndex,bytes,records,watermark. No raw PII.
Distributed Tracing¶
- Spans:
legalhold.resolve,export.create,query.page,integrity.seal,object.put,webhook.post. - Attributes:
holdId,revision,proofMode,parts,bytes.
Health Checks¶
- Readiness: LHS/Read Store/Integrity/Delivery reachable; signing keys loaded.
- Liveness: worker queues healthy; no stuck jobs; purge-block signal latched for hold.
Operational Procedures¶
Deployment¶
- Deploy
LegalHold Service&/legal-hold/v1/exportsroute behind Gateway. - Configure KMS keys for manifest/proof signing and webhook HMAC.
- Validate end-to-end on a test hold (Active → export → Completed).
Configuration¶
- Env:
LH_EXPORT_MAX_CONCURRENCY,EXPORT_DEFAULT_PART_MIB,PROOF_DEFAULT_MODE,PRESIGN_TTL_SEC,WEBHOOK_SIGNING_KID. - Policy: toggle
allowPerRecordProofsby edition/regulatory need.
Maintenance¶
- Rotate signing keys; prune expired presigned URLs and old manifests per policy.
- Periodically reconcile hold purge-block flags across stores.
Troubleshooting¶
- 412 spikes → educate counsel/operators to avoid modifying holds during exports; rely on
ifMatch. - Slow jobs → check read replica load, part size, compression CPU.
- Webhook failures → review TLS/HMAC configuration; fall back to polling
manifestUrl.
Testing Scenarios¶
Happy Path Tests¶
- Active
holdIdexport produces parts and manifest withholdSnapshot,merkleRoot,signature. - Proof policy
per-partincludes per-part proofs;manifest-onlyincludes only root/signature.
Error Path Tests¶
- 400 on unsupported
proofMode/invalidpartSizeMiB. - 404 on unknown
holdId. - 409 when hold status is
Released. - 412 when
ifMatchrevision mismatches. - 429/503 cause compliant backoff and resume.
Performance Tests¶
- Time-to-first-part p95 ≤ 45 s under nominal load.
- Linear scaling with additional workers up to cap.
Security Tests¶
- RBAC scopes enforced; cross-tenant blocked.
- Presigned URLs expire; webhook HMAC validates.
- Manifest signature verifies with Integrity public key.
Related Documentation¶
Internal References¶
- Data Model — Legal Hold Model
- Data Model — Export Models & Manifests
- Data Model — Integrity Structures
- Retention Policy Model
Related Flows¶
- Standard Export Flow
- Legal Hold Processing Flow
- Compliance Audit Flow
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
- Organization Legal Hold & Evidence Handling Policy
Appendices¶
A. Example Problem+JSON (hold released)¶
{
"type": "urn:connectsoft:errors/legalhold/status.invalid",
"title": "Hold is not active",
"status": 409,
"detail": "Legal hold 'lh_2025_001' is Released (rev=7).",
"traceId": "9f0c1d2e3a4b5c6d...",
"errors": [{"pointer": "/holdId", "reason": "released"}]
}
B. Proof Inclusion Policy Options¶
manifest-only: single MerkleRoot + signature in manifest.per-part: each part contains a subtree root; manifest maps parts→proofs.per-record(high assurance): each line embeds leaf hash or side proof; larger output, strongest verification.
eDiscovery Export Flow¶
Generates a forensically defensible export tailored for eDiscovery: runs a scoped export, computes a signed ExportManifest, invokes KMS/HSM to produce a detached signature over the manifest and Merkle root, and assembles an Integrity Bundle (manifest + proofs + public key material) for delivery.
Overview¶
Purpose: Provide legal/forensic teams with a complete, tamper-evident export that includes a signed manifest and Merkle proofs suitable for independent verification.
Scope: Job creation, scoped read, manifest construction, Merkle tree computation, KMS signing, bundle packaging (ZIP/TAR.GZ), delivery via presigned URLs or webhook, and completion event. Excludes hold-governed constraints (see Legal Hold Export Flow) and generic on-demand exports (see Standard Export Flow).
Context: Builds on Export Service and Integrity Service with KMS/HSM for signing. Reads from Read Store via Query Service.
Key Participants:
- eDiscovery Client (case management/tooling)
- API Gateway
- Export Service (orchestrator, packaging)
- Query Service / Read Store (scoped reads)
- Integrity Service (Merkle computation)
- KMS/HSM (key management, signing)
- Delivery Backend (object storage, presigned URLs)
- Webhook Receiver (optional)
Prerequisites¶
System Requirements¶
- Gateway with TLS + JWT validation
- Export & Integrity Services deployed; Integration with KMS/HSM configured (key IDs, policies)
- Read Store accessible with RLS by
tenantId - Object storage bucket for parts, manifest, and bundle
Business Requirements¶
- Tenant’s retention/residency policies defined and enforced
- eDiscovery caseId lifecycle managed (optional, but recommended)
- Operator runbook for key rotation & signature verification
Performance Requirements¶
- p95 time-to-manifest ≤ 30 s for typical 24–48h scopes
- Bundle assembly completes ≤ 60 s after final part upload
- Per-tenant export concurrency capped to protect read replicas
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor EDC as eDiscovery Client
participant GW as API Gateway
participant EXP as Export Service
participant Q as Query Service / Read Store
participant INT as Integrity Service
participant KMS as KMS/HSM (Signer)
participant OBJ as Delivery Backend
participant WH as Webhook (optional)
EDC->>GW: POST /ediscovery/v1/exports {tenant, caseId, range, filters, format, proofMode, bundle:{type}}
GW->>EXP: Create export job (mode: EDISCOVERY) + params
EXP->>Q: Open scoped cursor (tenant/from-to/filters)
loop Stream pages → parts
Q-->>EXP: Page of rows + next cursor
EXP->>INT: Update Merkle segment with leaf hashes
EXP->>OBJ: PUT part (JSONL/Parquet, optional gzip)
EXP->>EXP: Track part metadata (index, bytes, records, ETag)
end
EXP->>INT: Seal block → {merkleRoot}
EXP->>EXP: Build ExportManifest {parts, counts, bytes, watermarks, merkleRoot}
EXP->>KMS: Sign canonicalized(manifest) + merkleRoot → {signature, kid, alg}
EXP->>OBJ: PUT manifest.json and manifest.sig
EXP->>EXP: Assemble Integrity Bundle (manifest, signature, publicKey/chain, optional proofs)
EXP->>OBJ: PUT bundle (bundle.zip/.tar.gz) → bundleUrl
EXP-->>EDC: 202 Accepted {jobId, status:"Running"}
alt webhook configured
EXP->>WH: POST Export.Completed {jobId, bundleUrl, manifestUrl, signature}
end
Alternative Paths¶
- Proof modes:
manifest-only(root+sig),per-part(subtree proofs),per-record(leaf proofs; larger bundle). - Client-provided storage:
delivery.mode=client-presigned-putfor manifest/parts/bundle. - Re-sign:
POST /ediscovery/v1/exports/{jobId}:resign {kid}to reissue signature with a rotated key (no data rewrite).
Error Paths¶
sequenceDiagram
actor EDC
participant GW as API Gateway
participant EXP as Export Service
EDC->>GW: POST /ediscovery/v1/exports {invalid params}
alt Malformed request / unsupported proofMode/format
GW-->>EDC: 400 Bad Request (Problem+JSON)
else Unknown tenant / route
GW-->>EDC: 404 Not Found (Problem+JSON)
else Conflict (resign while job running, or bundle requested before complete)
GW-->>EDC: 409 Conflict (Problem+JSON)
else Unauthorized / Forbidden
GW-->>EDC: 401/403 (Problem+JSON)
else Backpressure / dependency down
GW-->>EDC: 429/503 (Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | POST /ediscovery/v1/exports |
Y | Create eDiscovery export | JSON body |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired |
x-tenant-id |
header | Y* | Tenant scope | Must match tenant |
traceparent |
header | O | W3C trace context | 55-char |
tenant |
string | Y | Target tenant | ^[A-Za-z0-9._-]{1,128}$ |
caseId |
string | O | eDiscovery case identifier | ≤ 128 chars |
range |
object | O | {from?, to?} ISO-8601 UTC |
from ≤ to, retention bounds |
filters |
object | O | Allowlisted predicates | Server validated |
format |
enum | O | jsonl (default) |
parquet |
compression |
enum | O | none |
gzip (default) |
proofMode |
enum | O | manifest-only |
per-part |
bundle.type |
enum | O | zip (default) |
tar.gz |
kms.kid |
string | O | Key id for signing | must exist in KMS |
delivery.mode |
enum | O | presigned-get |
client-presigned-put |
webhook.url/webhook.secretId |
string | O | Completion callback + HMAC key | HTTPS + known key |
- Header may be omitted if using path variant
/tenants/{tenantId}/ediscovery/exports.
Output Specifications¶
Create — 202 Accepted
| Field | Type | Description | |
|---|---|---|---|
jobId |
string | Server-assigned id (ULID/GUID) | |
status |
enum | Queued | Running |
|
pollUrl |
url | GET /ediscovery/v1/exports/{jobId} |
|
manifestUrl |
url? | Available once ready | |
bundleUrl |
url? | Available once ready |
Get — 200 OK
| Field | Type | Description | |||||||
|---|---|---|---|---|---|---|---|---|---|
jobId |
string | Identifier | |||||||
status |
enum | Queued | Running | Sealing | Signing | Bundling | Completed | Failed | Canceled |
|||||||
counts |
object | {records, parts} |
|||||||
bytes |
object | {written} |
|||||||
merkleRoot |
string | Hex/base64url root | |||||||
signature |
object? | {alg,kid,sig} once signed |
|||||||
manifestUrl / bundleUrl |
url? | Delivery endpoints | |||||||
resumeToken |
string? | For resume/retry | |||||||
startedAt/finishedAt |
timestamp | ISO-8601 UTC |
Integrity Bundle Contents (concept)
bundle/
manifest.json
manifest.sig # COSE_Sign1 or JWS (detached)
integrity/
root.json # { merkleRoot, algorithm, createdAt }
proofs/ # per-part or per-record .proof files (optional)
keys/
publicKey.pem # PEM or JWK
key-metadata.json # { kid, alg, issuer, notBefore, notAfter }
README.txt # verification instructions
Manifest (excerpt)
{
"jobId": "exp_01JFG2...",
"mode": "EDISCOVERY",
"tenant": "acme",
"caseId": "CASE-2025-0421",
"range": {"from":"2025-10-01T00:00:00Z","to":"2025-10-22T23:59:59Z"},
"format": "jsonl",
"compression": "gzip",
"parts": [
{"index":0,"url":"https://.../p0.gz","bytes":268435456,"records":100000,"etag":"\"abc123\""}
],
"counts":{"records":250000,"parts":3},
"bytes":{"written":734003200},
"integrity":{"merkleRoot":"8a4f...","algorithm":"sha256","createdAt":"2025-10-22T12:30:12Z"},
"createdAt":"2025-10-22T12:30:12Z",
"watermark":"2025-10-22T12:25:00Z"
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Malformed body; invalid range/filters; unsupported proofMode/format/bundle.type; unknown kms.kid |
Correct request/params | No retry until fixed |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Missing audit:ediscovery.export or tenant mismatch |
Request proper scope/role | No retry |
| 404 | Tenant/route not found; jobId unknown; manifest/bundle not available |
Verify tenant/IDs; wait for completion | — |
| 409 | Bundle requested before job complete; resign while signing; resume on terminal job | Poll until terminal; create new job | Retry after fix |
| 412 | If-Match on manifest version failed (re-signed) |
Fetch latest manifest; retry | Retry with new ETag |
| 429 | Per-tenant/global export rate limited | Respect Retry-After |
Exponential backoff + jitter |
| 503 | Read store/Integrity/KMS/Object storage unavailable | Wait for recovery | Retry idempotently |
Failure Modes¶
- KMS key disabled/rotated: signing fails → 503; operator selects new
kidor uses:resign. - Proof blowup with
per-recordon huge jobs → 413/422 with guidance to switch toper-part. - Residency/retention policy violation → 400
.../policy.violation(decision id included).
Recovery Procedures¶
- On 409, poll job status until
Completedthen fetchmanifestUrl/bundleUrl. - On 503/429, back off and use
resumeTokento continue without duplicating parts. - On signature/key issues, re-run
:resignwith a validkms.kid.
Performance Characteristics¶
Latency Expectations¶
- Time-to-manifest p95 ≤ 30 s for typical scopes; bundling overhead ≤ 60 s.
Throughput Limits¶
- Per tenant: ≤ 2 concurrent eDiscovery jobs (configurable).
- Global: limited by export workers, KMS QPS, and object storage throughput.
Resource Requirements¶
- CPU for serialization/compression; memory for part buffers and proof generation; KMS signing latency budget (p95 ≤ 100 ms).
Scaling Considerations¶
- Horizontal export workers; bound KMS concurrency; stream proof files to avoid large in-memory structures.
- Prefer
per-partproofs for balance of size vs. verifiability.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; optional mTLS for service-to-service calls.
Authorization¶
- Require
audit:ediscovery.export; enforce RLS on reads; verifyx-tenant-id.
Data Protection¶
- Object storage encryption at rest; time-limited presigned URLs; webhook payloads HMAC-signed.
- No raw secret material in logs; public keys shipped as JWK/PEM inside bundle only.
Compliance¶
- Manifest + signature + proofs enable independent verification.
- Include
watermark(projection snapshot time) andcaseIdin manifest for chain-of-custody. - Emit audit entries for create/resume/resign/bundle fetch actions.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
ediscovery_jobs_active |
gauge | Running jobs | > tenant/global cap |
manifest_build_duration_ms |
histogram | Build + sign time | p95 > 30s |
kms_sign_latency_ms |
histogram | KMS sign call | p95 > 100 ms |
bundle_bytes_total |
counter | Size of bundles | Trend/forecast |
ediscovery_failures_total |
counter | Failed jobs | > 0 sustained |
Logging Requirements¶
- Structured logs:
tenant,caseId,jobId,merkleRoot,kid,proofMode,parts,bytes,watermark. - Do not log raw proofs or presigned URLs.
Distributed Tracing¶
- Spans:
export.create,query.page,integrity.seal,kms.sign,bundle.pack,object.put,webhook.post. - Attributes:
kid,proofMode,bundleType,parts,bytes.
Health Checks¶
- Readiness: KMS key available, Integrity & Object storage reachable.
- Liveness: worker queues healthy; no stuck
Signing/Bundlingstates.
Operational Procedures¶
Deployment¶
- Configure KMS key(s) and
kidmapping; verify sign/verify path in staging. - Deploy
/ediscovery/v1/exportsroute; ensure buckets and presign service are ready. - Validate end-to-end: create job → manifest signed → bundle downloadable and verifiable.
Configuration¶
- Env:
EXPORT_MAX_CONCURRENCY_PER_TENANT,EXPORT_DEFAULT_PART_MIB,PROOF_DEFAULT_MODE,KMS_DEFAULT_KID,PRESIGN_TTL_SEC. - Policies: enforce retention/residency on the export scope.
Maintenance¶
- Rotate KMS keys; support
:resignto reissue signatures. - Prune expired presigned URLs and old bundles per policy.
Troubleshooting¶
- High
kms_sign_latency_ms→ check KMS limits/region; enable key caching. - Large bundles/timeouts → switch to
per-partproofs; increase part size. - 409 conflicts → ensure clients poll status before requesting bundle/resign.
Testing Scenarios¶
Happy Path Tests¶
- Create eDiscovery export with
proofMode=per-part→ manifest + signature + bundle available; verification succeeds. -
resignwith newkidproduces newmanifest.sigwithout rewriting parts.
Error Path Tests¶
- 400 on invalid
proofMode/format/bundle.typeor badrange. - 404 on unknown
jobIdor bundle before creation. - 409 when requesting bundle before completion or resign during signing.
- 429/503 trigger compliant backoff and resume.
Performance Tests¶
- Time-to-manifest p95 ≤ 30 s; bundling overhead ≤ 60 s under nominal load.
- KMS signing p95 ≤ 100 ms for 95% of signatures.
Security Tests¶
- RBAC scope
audit:ediscovery.exportenforced; cross-tenant blocked. - Manifest signature verifies with exported public key (JWK/PEM).
- Presigned URLs expire and are least-privilege.
Related Documentation¶
Internal References¶
- Data Model — Export Models & Manifests
- Data Model — Integrity Structures
- Standard Export Flow
- Legal Hold Export Flow
External References¶
- COSE (RFC 8152) / JWS (RFC 7515) for signatures
- W3C Trace Context; RFC 7807 (Problem Details)
Appendices¶
A. Example manifest.sig (JWS detached)¶
B. Verification Outline¶
- Download
manifest.json,manifest.sig, andkeys/publicKey.pem. - Verify signature over canonicalized manifest (UTF-8, no BOM).
- Recompute Merkle root from all part proofs (if provided) and compare to
manifest.integrity.merkleRoot. - Spot-verify a subset of parts/records using
proofs/*.proof.
Bulk Export Flow¶
Scheduled or ad-hoc large-scale exports that split a wide scope into time/key slices, run them in parallel across a controlled worker pool, write results as multiple packages (parts/bundles), and support resume tokens for fault-tolerant continuation. Exposes explicit SLA/throughput metrics and enforces per-tenant/global concurrency limits.
Overview¶
Purpose: Efficiently export very large datasets (days/months of audit events) on a schedule or on demand, with parallelization, resumability, and integrity/manifest generation.
Scope: Scheduler, job creation, slicing strategy (time/partition), parallel workers, packaging (JSONL/Parquet, gzip), resume/cancel, integrity sealing, delivery via presigned URLs/webhook, and metrics. Excludes hold-specific rules (see Legal Hold Export Flow) and eDiscovery signing options (see eDiscovery Export Flow).
Context: Orchestrated by Export Service with a Scheduler; reads from Read Store via Query Service; uses Integrity Service for Merkle roots/signatures and Object Storage for parts/bundles.
Key Participants:
- Scheduler (cron/rrule, “run now”)
- API Gateway
- Export Service (orchestrator, slicer, worker pool)
- Query Service / Read Store (tenant-scoped scans)
- Integrity Service (hash/merkle/seal)
- Object Storage (parts, manifests, bundles)
- Webhook Receiver (optional callbacks)
- Metrics/Tracing Backend
Prerequisites¶
System Requirements¶
- Gateway with TLS + JWT; Export Service reachable by Scheduler
- Read Store with RLS by
tenantId; seek pagination available - Integrity Service & Object Storage configured (KMS keys, buckets)
- Clock skew controls; partition catalog available for slicing
Business Requirements¶
- Tenant retention/residency policies configured and enforced
- Export feature/edition enabled; per-tenant concurrency limits defined
- Optional webhook signing keys provisioned
Performance Requirements¶
- Target throughput per worker (e.g., 50–150 MB/s effective)
- Time-to-first-part p95 ≤ 60 s for bulk slice runs
- Slice width chosen to keep slice p95 ≤ 10–20 min under load
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Sched as Scheduler
participant GW as API Gateway
participant EXP as Export Service (Orchestrator)
participant SL as Slicer / Planner
participant WP as Worker Pool
participant Q as Query Service / Read Store
participant INT as Integrity Service
participant OBJ as Object Storage
participant WH as Webhook (optional)
Sched->>GW: POST /export/v1/bulk-jobs {tenant, schedule, range, sliceWidth, format, partSize}
GW->>EXP: Create/Upsert BulkJob
loop On schedule tick or run-now
EXP->>SL: Plan slices for window (time/partition)
SL-->>EXP: [Slice#0..Slice#N] + dependencies
par N parallel slices (bounded by concurrency caps)
EXP->>WP: Dispatch Slice#i {cursor, sliceWindow, resumeToken?}
WP->>Q: Stream pages via seek pagination
Q-->>WP: Rows + next cursor
WP->>INT: Append leaf hashes, update merkle segment
WP->>OBJ: PUT part(s) (JSONL/Parquet, gzip?)
WP->>EXP: Report progress {bytes, records, partMeta, resumeToken}
and
end
EXP->>INT: Seal slice block → MerkleRoot + signature
EXP->>OBJ: PUT slice manifest, update BulkJob manifest index
alt webhook configured
EXP->>WH: POST Export.SliceCompleted {jobId, sliceId, manifestUrl}
end
end
EXP->>OBJ: PUT final Bulk Manifest (index of slice manifests) + signature
EXP-->>GW: 200/202 {jobId, status:"Completed", manifestUrl, stats}
Alternative Paths¶
- Run now:
POST /export/v1/bulk-jobs/{id}:run-nowtriggers immediate cycle outside schedule. - Catch-up mode: planner advances by watermark; only exports new slices since last success.
- Client-managed storage: use presigned PUT per slice/part.
- Dynamic re-slicing: large slices auto-split if runtime exceeds threshold.
Error Paths¶
sequenceDiagram
actor Client
participant GW as API Gateway
participant EXP as Export Service
Client->>GW: POST /export/v1/bulk-jobs {invalid config}
alt Bad request (bad schedule/sliceWidth/partSize)
GW-->>Client: 400 Problem+JSON
else Unknown jobId / tenant route not found
GW-->>Client: 404 Problem+JSON
else Conflict (modify running job / duplicate schedule window)
GW-->>Client: 409 Problem+JSON
else Backpressure or deps down
GW-->>Client: 429/503 Problem+JSON (+ Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Create/Update | POST /export/v1/bulk-jobs |
Y | Create/Upsert bulk job | JSON body |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired |
x-tenant-id |
header | Y* | Tenant scope | Must match body.tenant |
tenant |
string | Y | Target tenant | ^[A-Za-z0-9._-]{1,128}$ |
range |
object | O | {from?, to?} for initial catch-up |
ISO-8601 UTC |
schedule |
object | O | {cron:"0 2 * * *"} or {rrule:"RRULE:..."} |
validated |
sliceWidth |
string | O | e.g., 24h, 7d, 1mo |
max per policy |
format |
enum | O | jsonl (default) |
parquet |
compression |
enum | O | none |
gzip (default) |
partSizeMiB |
int | O | 16–1024 (default 256) | bounds checked |
maxParallelSlices |
int | O | Per-tenant concurrency cap | ≤ tenant cap |
webhook.url/secretId |
string | O | Completion callbacks | HTTPS + known key |
delivery.mode |
enum | O | presigned-get |
client-presigned-put |
*Header may be omitted for /tenants/{tenantId}/export/bulk-jobs.
Control Endpoints
POST /export/v1/bulk-jobs/{id}:run-nowPOST /export/v1/bulk-jobs/{id}:pause/:resume/:cancelGET /export/v1/bulk-jobs/{id}(status, stats, current window, next run)GET /export/v1/bulk-jobs/{id}/manifest(bulk manifest index)
Output Specifications¶
| Field | Type | Description | Notes | |||||
|---|---|---|---|---|---|---|---|---|
jobId |
string | Bulk job identifier | ULID/GUID | |||||
status |
enum | Paused | Scheduled | Running | Completed | Failed | Canceled |
— | |||||
currentSlice |
object? | {sliceId, window, status, resumeToken} |
When running | |||||
stats |
object | {slicesCompleted, bytes, records, parts} |
Cumulative | |||||
manifestUrl |
url? | Bulk manifest index | After completion | |||||
nextRunAt |
timestamp | Next scheduled tick | ISO-8601 UTC |
Bulk Manifest Index (concept)
{
"jobId":"bulk_01JH2…",
"tenant":"acme",
"schedule":"0 2 * * *",
"slices":[
{"sliceId":"s_2025_10_01","from":"2025-10-01T00:00:00Z","to":"2025-10-02T00:00:00Z","manifestUrl":"https://.../s_2025_10_01.manifest.json","merkleRoot":"8a4f...","signature":{"alg":"Ed25519","kid":"int-key-2025","sig":"MEQC..."}}
],
"counts":{"records":12003450,"parts":480},
"bytes":{"written":358721987654},
"createdAt":"2025-10-22T02:00:00Z",
"completedAt":"2025-10-22T09:40:00Z"
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid schedule/sliceWidth/partSizeMiB; malformed range |
Fix config | No retry until corrected |
| 401 | Missing/invalid JWT | Obtain valid token | Retry after renewal |
| 403 | Missing audit:export.bulk or tenant mismatch |
Request proper scope/role | — |
| 404 | Unknown jobId or tenant route disabled |
Verify identifiers/tenant | — |
| 409 | Modify/pause/resume conflict; duplicate scheduled window; attempt to run while Running | Wait/resolve state; use run-now after idle |
Retry after fix |
| 413 | Estimated bulk size exceeds job cap | Reduce scope/sliceWidth; increase cap by policy | — |
| 422 | sliceWidth too large for SLO; range outside retention |
Choose smaller slices / valid window | — |
| 429 | Per-tenant/global concurrency limit hit | Honor Retry-After |
Backoff + jitter |
| 503 | Read store/Integrity/Object storage unavailable | Wait for recovery | Idempotent retry using resumeToken |
Failure Modes¶
- Slice timeout → auto reslice into smaller sub-slices; remaining work re-queued.
- Resume after crash →
resumeTokenresumes at last committed cursor/part. - Storage throttling → Export Service reduces parallelism; returns 429 to clients.
Recovery Procedures¶
- Use
:resumewith server-providedresumeTokento continue failed slices. - On 429/503, back off and let the scheduler retry the tick; do not spawn duplicate runs.
- Adjust
sliceWidth/maxParallelSlicesto match observed throughput.
Performance Characteristics¶
Latency Expectations¶
- Time-to-first-part p95 ≤ 60 s per run.
- Per-slice runtime p95 within configured SLO (e.g., ≤ 15 min for
24hslice on typical volume).
Throughput Limits¶
- Per worker: target sustained 50–150 MB/s effective write.
- Per tenant: cap
maxParallelSlices(e.g., ≤ 4). - Global: orchestrator enforces cluster-wide max workers.
Resource Requirements¶
- CPU for serialization/compression; RAM for part buffers; IOPS for wide scans; network to object storage.
Scaling Considerations¶
- Plan then fan-out: precompute slice plan and submit to a bounded queue.
- Fair-share: per-tenant token bucket to avoid noisy neighbors.
- Adaptive concurrency: scale workers based on export QPS, object storage throttling, and read replica load.
- Backpressure: honor
Retry-After; dynamically shrinkmaxParallelSlices.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; optional mTLS for service-to-service calls.
Authorization¶
- Require
audit:export.bulk; enforce RLS on reads; validatex-tenant-id.
Data Protection¶
- Server-side encryption at rest; presigned URLs short-lived and scoped.
- Optional on-read masking if bulk job set to filtered mode.
Compliance¶
- Respect retention/residency; include watermarks and integrity proofs per slice.
- Emit audit events for schedule create/update, run, pause/resume/cancel, and completion.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
bulk_jobs_active |
gauge | Running bulk jobs | > global cap |
bulk_slices_inflight |
gauge | Concurrent slice executions | > per-tenant cap |
bulk_bytes_written_total |
counter | Bytes written across slices | Trend/throughput |
bulk_slice_duration_seconds |
histogram | Runtime per slice | p95 > SLO |
bulk_failures_total |
counter | Failed slices/jobs | > 0 sustained |
resume_events_total |
counter | Resumed slices | Spike indicates instability |
Logging Requirements¶
- Structured logs:
tenant,jobId,sliceId,window,resumeToken,parts,bytes,records,watermark,merkleRoot,status. No raw PII or presigned URLs.
Distributed Tracing¶
- Spans:
bulk.plan,slice.run,query.page,serialize.part,object.put,integrity.seal,webhook.post. - Attributes:
sliceWidth,parallelism,bytes,records,throttleEvents.
Health Checks¶
- Readiness: object storage/Integrity/Read Store reachable; scheduler connected.
- Liveness: worker queues draining; no stuck slices beyond timeout.
Operational Procedures¶
Deployment¶
- Deploy Scheduler and Export Service; register
/export/v1/bulk-jobsroutes. - Configure tenant/global concurrency caps and default
sliceWidth. - Run a dry run on a non-prod tenant to validate planning and sealing.
Configuration¶
- Env:
BULK_MAX_PARALLEL_SLICES_PER_TENANT,BULK_DEFAULT_SLICE_WIDTH,EXPORT_DEFAULT_PART_MIB,RESUME_TOKEN_TTL,PRESIGN_TTL_SEC,SLA_SLICE_P95_SECONDS. - Planner: enable dynamic reslicing thresholds (time/size).
Maintenance¶
- Rotate signing keys; prune expired manifests/parts; archive bulk manifest indices per policy.
- Periodically reassess
sliceWidthvs. observed volumes.
Troubleshooting¶
- Many resume events → check read replica throttling/object storage limits; reduce parallelism.
- Frequent 409 on job ops → ensure clients don’t modify running jobs; use
pausethenupdate. - Slow slices → inspect filters/indexes and increase part size or reduce masking.
Testing Scenarios¶
Happy Path Tests¶
- Create bulk job with
cronschedule; verify automatic run creates multiple slices and parts. - Resume a slice after induced worker crash using
resumeToken.
Error Path Tests¶
- 400 on invalid
schedule/sliceWidth/partSizeMiB. - 404 on unknown
jobId. - 409 when updating a running job without
pause. - 429/503 cause backoff and eventual success without duplication.
Performance Tests¶
- Achieve target throughput per worker and per tenant; slice p95 ≤ SLO.
- Concurrency caps prevent read replica saturation.
Security Tests¶
- RBAC
audit:export.bulkenforced; cross-tenant isolation verified. - Presigned URLs expire and are least-privilege.
- Integrity sealing produces valid Merkle roots/signatures per slice.
Related Documentation¶
Internal References¶
- Standard Export Flow
- eDiscovery Export Flow
- Data Model — Export Models & Manifests
- Data Model — Tenancy Keys & Partitioning
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
Appendices¶
A. Example Create Bulk Job Request¶
{
"tenant": "acme",
"schedule": { "cron": "0 2 * * *" },
"range": { "from": "2025-09-01T00:00:00Z" },
"sliceWidth": "24h",
"format": "parquet",
"compression": "gzip",
"partSizeMiB": 256,
"maxParallelSlices": 3,
"delivery": { "mode": "presigned-get" },
"webhook": { "url": "https://hooks.example/exports", "secretId": "wh-2025" }
}
B. Resume Token (concept)¶
{
"sliceId":"s_2025_10_21",
"cursor":"eyJ0cyI6IjIwMjUtMTAtMjFUMTI6MDA6MDAuMDAwWiIsImlkIjoiMDFK...In0",
"partIndex": 17,
"bytesCommitted": 134217728
}
Retention Policy Evaluation Flow¶
Computes and records eligibleAt timestamps for purge based on the active Retention Policy. Evaluations run on schedule and on policy change, marking candidates in the retention index and emitting Retention.EligibleComputed events with decision basis (policy id, rule id, revision, window).
Overview¶
Purpose: Determine when audit records (or partitions) become eligible for purge and persist eligibleAt along with decision metadata for defensible lifecycle operations.
Scope: Policy fetch & revision checks, rules evaluation (scopes/windows/exceptions), candidate marking, event emission, and re-evaluation on policy updates or clock ticks. Excludes purge execution (see Data Lifecycle & States / Purge flow).
Context: The Policy Service is the source of truth for Retention Policies and their forward-only revisions. The Lifecycle Evaluator (part of Policy or Lifecycle service) scans read/canonical stores and updates a Retention Index used by purge workers.
Key Participants:
- Scheduler (periodic + on-change trigger)
- API Gateway (for admin endpoints)
- Policy Service (policies, revisions, decisions)
- Lifecycle Evaluator (rules engine, candidate marker)
- Metadata/Retention Index (stores
eligibleAt, decision basis) - Event Bus (emits
Retention.EligibleComputed)
Prerequisites¶
System Requirements¶
- Policy Service reachable; policy registry seeded with tenant policy
- Lifecycle Evaluator has read access to stores and write access to Retention Index
- Event Bus configured for
Retention.*topics - Time source synchronized; clock skew guardrails applied
Business Requirements¶
- Tenant has an Active retention policy with forward-only Revision
- Residency constraints configured (region-aware evaluation if required)
- Legal Holds honored (holds block eligibility marking)
Performance Requirements¶
- Evaluation p95 per partition ≤ 3 min for typical volumes
- Index write throughput supports peak daily windows (e.g., midnight marks)
- Backpressure controls on scans and index writes
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant SCH as Scheduler
participant POL as Policy Service
participant LCE as Lifecycle Evaluator
participant IDX as Retention Index
participant BUS as Event Bus
SCH->>POL: GET /policy/v1/retention?tenant=acme (If-None-Match: rev)
POL-->>SCH: 200 {policyId, revision, rules, windows} or 304 if unchanged
SCH->>LCE: Trigger evaluate {tenant, policyId, revision, windowHint}
LCE->>LCE: Enumerate candidate sets (by partition/time/resource)
LCE->>LCE: For each record/partition: compute eligibleAt = createdAt + window(rule)
LCE->>IDX: Upsert {recordId/partitionKey, eligibleAt, decisionBasis{policyId,ruleId,revision}}
IDX-->>LCE: Ack (batched)
LCE->>BUS: Publish Retention.EligibleComputed {tenant, policyId, revision, stats}
Alternative Paths¶
- On-Change Re-eval:
Policy.Changedevent triggers incremental re-evaluation for affected scopes only. - Partition-Level Evaluation: compute once per partition boundary and apply to contained records (for WORM append stores).
- Dry Run: evaluation writes to a shadow index and returns a delta report (no marking).
Error Paths¶
sequenceDiagram
participant GW as API Gateway
participant POL as Policy Service
participant LCE as Lifecycle Evaluator
GW->>POL: POST /policy/v1/retention:evaluate {tenant, revision:999}
alt Unknown tenant/policy
POL-->>GW: 404 Not Found (Problem+JSON)
else Revision conflict (client expects different rev)
POL-->>GW: 409 Conflict (Problem+JSON)
else Bad request (invalid window/spec)
POL-->>GW: 400 Bad Request (Problem+JSON)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | POST /policy/v1/retention:evaluate |
Y | Manual/adhoc evaluation trigger | JSON body |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired |
x-tenant-id |
header | Y | Tenant scope | ^[A-Za-z0-9._-]{1,128}$ |
traceparent |
header | O | W3C trace context | 55-char |
policyId |
string | O | Explicit policy to apply | Must belong to tenant |
revision |
int | O | Expected policy revision (If-Match equivalent) |
≥ current? causes 409 |
scope |
object | O | Limit evaluation to subset (time/resources) | Server-validated |
mode |
enum | O | normal (default) |
dry-run |
Output Specifications¶
202 Accepted
| Field | Type | Description | |
|---|---|---|---|
evaluationId |
string | Operation identifier | |
status |
enum | Queued | Running |
|
policy |
object | {policyId, revision} |
|
scopeApplied |
object | Effective evaluated scope |
200 OK (dry-run report)
| Field | Type | Description |
|---|---|---|
estimatedCandidates |
int | Count that would be marked |
sample |
array | Example {recordId, computedEligibleAt, ruleId} |
diff |
object | Prior vs. new policy impact |
Retention Index (concept row)
{
"tenantId": "acme",
"recordId": "01JECZ6Y8K1V...",
"eligibleAt": "2026-01-21T10:12:00Z",
"decisionBasis": { "policyId":"ret_001", "ruleId":"r_login_365d", "revision":5 },
"decidedAt": "2025-10-22T12:00:00Z"
}
Event Retention.EligibleComputed (summary)
{
"tenant": "acme",
"policyId": "ret_001",
"revision": 5,
"window": {"from":"2025-10-21T00:00:00Z","to":"2025-10-22T00:00:00Z"},
"stats": {"marked": 124553, "skippedHeld": 112, "errors": 0}
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid policy spec/windows; negative/zero retention; malformed scope | Fix policy/scope | No retry until corrected |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Missing policy:retention.evaluate permission |
Request proper scope/role | — |
| 404 | Tenant/policy not found | Verify tenant/policy id | — |
| 409 | Policy revision conflict; evaluation for same scope already running | Re-fetch policy; wait or cancel prior run | Retry after fix |
| 412 | If-Match (revision) mismatch |
Fetch latest policy; retry with current rev | Conditional retry |
| 422 | Policy invalid for tenant residency/edition | Adjust policy to constraints | — |
| 429 | Evaluator rate limited | Honor Retry-After |
Backoff + jitter |
| 503 | Stores/Index/Event bus unavailable | Wait for recovery | Idempotent retry of evaluation step |
Failure Modes¶
- Legal Hold present: candidate skipped; index notes
skippedHeldand basis includesholdId. - Window change shrinks retention: re-evaluation advances
eligibleAtforward only; never moves earlier than prior decision without explicit re-baseline admin action. - Clock skew:
eligibleAtnever set beforenow - skew.
Recovery Procedures¶
- On 409 or 412, fetch current
{policyId, revision}and re-issue with updated precondition. - When 503/429, back off; evaluation jobs are idempotent by
(tenant, policyId, revision, scopeKey). - Use dry-run to assess impact before applying a new revision.
Performance Characteristics¶
Latency Expectations¶
- Partition-sized evaluation p95 ≤ 3 min; small scope ad-hoc p95 ≤ 30 s.
Throughput Limits¶
- Evaluator concurrency limited per tenant to protect read/metadata stores (e.g., ≤ 2 concurrent scopes).
Resource Requirements¶
- Bounded memory for rule evaluation batches; write-optimized Retention Index with bulk upserts.
Scaling Considerations¶
- Batch by partition and time windows; use checkpointing to resume mid-run.
- Prefer set-based updates (partition-level) when rules are uniform (e.g., 365d global).
- Emit periodic progress to avoid long silent runs.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; service-to-service credentials for Evaluator ↔ Index.
Authorization¶
- Enforce
policy:retention.readandpolicy:retention.evaluate; verifyx-tenant-id.
Data Protection¶
- Decision basis recorded without copying sensitive payload; only IDs/timestamps stored.
Compliance¶
- Forward-only versions:
revisionmonotonically increases; decisions log basis{policyId, ruleId, revision, computedAt}for auditability. - Residency honored by running evaluation in-region and by scoping reads.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
ret_eval_jobs_active |
gauge | Running evaluations | > tenant/global cap |
ret_candidates_marked_total |
counter | Records marked eligible | Trend |
ret_eval_duration_seconds |
histogram | Runtime per evaluation | p95 > SLO |
ret_eval_skipped_held_total |
counter | Skipped due to Legal Hold | Spike watch |
ret_eval_conflicts_total |
counter | 409/412 occurrences | Investigate policy churn |
Logging Requirements¶
- Structured logs:
tenant,policyId,revision,ruleId,scopeKey,marked,skippedHeld,durationMs,errors. No PII.
Distributed Tracing¶
- Spans:
policy.fetch,eval.scan,eval.batch,index.upsert,event.publish. - Attributes:
revision,batchSize,marked,skipped.
Health Checks¶
- Readiness: Index writable; Policy Service reachable; Event Bus available.
- Liveness: job queue drains; checkpoints advance.
Operational Procedures¶
Deployment¶
- Deploy Lifecycle Evaluator workers; register
/policy/v1/retention:evaluate. - Seed policies; verify revisioning and on-change triggers.
- Run a dry-run evaluation in staging; verify index shape and events.
Configuration¶
- Env:
RET_EVAL_BATCH_SIZE,RET_EVAL_MAX_CONCURRENCY,RET_EVAL_CHECKPOINT_TTL,CLOCK_SKEW_SEC. - Policy: enforce forward-only revisions; require change justification metadata.
Maintenance¶
- Compact Retention Index (drop superseded decisions); rotate event topics per retention.
- Re-baseline procedures for exceptional policy rollbacks (administrative only).
Troubleshooting¶
- High conflicts: educate admins to supply
If-Matchrevisionwhen triggering evaluations. - Slow runs: increase batch size carefully; verify index write IOPS; reduce scan scope.
- Skewed results: check time normalization and partition catalog.
Testing Scenarios¶
Happy Path Tests¶
- Evaluate 24h scope → candidates marked with correct
eligibleAtanddecisionBasis. - Policy change (revision++) triggers incremental re-eval for affected scopes only.
Error Path Tests¶
- 400 for invalid windows/rules; 404 for unknown policy; 409/412 for revision issues.
- 422 when policy violates residency/edition.
- 429/503 lead to compliant backoff and eventual success.
Performance Tests¶
- Partition evaluation p95 ≤ 3 min; throughput meets index SLOs.
- Checkpoint resume after induced worker restart.
Security Tests¶
- RBAC scopes enforced; cross-tenant isolation verified.
- Logs contain decision basis without payload leakage.
Related Documentation¶
Internal References¶
Related Flows¶
- Legal Hold Processing Flow
- Data Lifecycle (Purge Execution) Flow
- Policy Change Propagation Flow
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
Appendices¶
A. Example Problem+JSON (revision conflict)¶
{
"type": "urn:connectsoft:errors/policy/revision.conflict",
"title": "Policy revision conflict",
"status": 409,
"detail": "Requested evaluation with revision 5, current is 6.",
"traceId": "9f0c1d2e3a4b5c6d...",
"errors": [{"pointer": "/revision", "reason": "stale"}]
}
B. Decision Basis (concept)¶
{
"policyId": "ret_001",
"ruleId": "r_login_365d",
"revision": 6,
"formula": "eligibleAt = createdAt + P365D",
"inputs": {"createdAt":"2025-10-21T11:00:00Z"},
"output": {"eligibleAt":"2026-10-21T11:00:00Z"}
}
Legal Hold Processing Flow¶
Applies, updates, and releases Legal Holds against tenant data. Resolves scope unambiguously, materializes a holdSnapshot (with forward-only revision), matches target records/partitions, marks them OnHold (purge-block), and emits lifecycle events. Releasing a hold clears blockers and triggers dependent re-evaluations.
Overview¶
Purpose: Provide a defensible mechanism to place and release Legal Holds so that covered records are preserved and exports can reference verifiable hold decisions.
Scope: Create/apply/update/release flows, scope resolution and match indexing, purge-block signaling, event emission, and concurrency controls. Excludes exporting data under hold (see Legal Hold Export Flow).
Context: The LegalHold Service is authoritative for hold definitions and state. It interacts with Read/Projection Stores to match data, the Lifecycle/Purge subsystem to block deletion, and Policy/Retention to re-evaluate eligibility.
Key Participants:
- Legal Team / Client
- API Gateway
- LegalHold Service (registry, matcher, state machine)
- Read/Projection Store (query targets by scope)
- Hold Index / Purge Guard (flags
OnHold) - Event Bus (
LegalHold.Applied|Updated|Released)
Prerequisites¶
System Requirements¶
- API Gateway with TLS and JWT validation
- LegalHold Service deployed with access to Read/Projection Store and Hold Index
- Event Bus topics configured (
LegalHold.*) - Clock/time normalization to UTC; deterministic scope resolvers
Business Requirements¶
- Tenant enabled for Legal Hold; roles and approvals defined
- Case management identifiers available (
caseId) - Residency constraints and retention policies configured
Performance Requirements¶
- p95 apply time for typical scopes ≤ 60 s (to first confirmation)
- Hold matching throughput sized to tenant volume (seek pagination)
- Low-latency purge-block propagation (seconds, not minutes)
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor Legal as Legal Team
participant GW as API Gateway
participant LHS as LegalHold Service
participant RD as Read/Projection Store
participant HIX as Hold Index / Purge Guard
participant BUS as Event Bus
Legal->>GW: POST /legal-hold/v1/holds {tenant, scope, caseId, reason, expiresAt?}
GW->>LHS: Create+Apply request (authN/Z, x-tenant-id, traceparent)
LHS->>LHS: Validate scope → normalize ResourceRef/time boundaries
LHS->>RD: Enumerate targets via cursor (tenant, scope)
loop Batched match
RD-->>LHS: Batch of record/partition keys
LHS->>HIX: Mark OnHold {keys..., holdId, revision}
end
LHS->>LHS: Persist holdSnapshot {id, revision, scope, decidedAt}
LHS->>BUS: Publish LegalHold.Applied {holdId, tenant, revision, scope}
LHS-->>GW: 201 Created {holdId, status:"Active", snapshot}
Alternative Paths¶
- Preview:
mode=previewreturns counts and sample keys without applying. - Incremental expand:
PATCH /holds/{id}with additional scope →revision++, match only delta. - Auto-expiry:
expiresAtschedules automatic Release at timestamp. - Partition-level hold: mark append partitions instead of individual records for large scopes.
Error Paths¶
sequenceDiagram
actor Legal
participant GW as API Gateway
participant LHS as LegalHold Service
Legal->>GW: POST /legal-hold/v1/holds {invalid scope}
alt Bad request
GW-->>Legal: 400 Bad Request (Problem+JSON)
else Hold not found (read/update/release)
GW-->>Legal: 404 Not Found (Problem+JSON)
else Conflict (apply on already Active, release on Released)
GW-->>Legal: 409 Conflict (Problem+JSON)
else Precondition failed (If-Match revision mismatch)
GW-->>Legal: 412 Precondition Failed (Problem+JSON)
else Rate limited / dependencies down
GW-->>Legal: 429/503 (Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Create/Apply | POST /legal-hold/v1/holds |
Y | Create + apply a hold | JSON body |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired |
x-tenant-id |
header | Y | Tenant scope | Must match body.tenant |
traceparent |
header | O | W3C trace context | 55-char |
tenant |
string | Y | Target tenant | ^[A-Za-z0-9._-]{1,128}$ |
caseId |
string | Y | Legal case identifier | ≤ 128 chars |
reason |
string | Y | Business/legal justification | ≤ 512 chars |
scope |
object | Y | Resource/time predicates | Normalized server-side |
expiresAt |
timestamp | O | Auto-release time (UTC) | Must be in future |
mode |
enum | O | apply (default) |
preview |
Update (expand/restrict)
| Field | Type | Req | Description |
|---|---|---|---|
PATCH /legal-hold/v1/holds/{holdId} |
path | Y | Modify scope (forward-only*); requires If-Match: <rev> |
Body: {scopeDelta} |
json | Y | Additive change preferred; shrink requires admin override |
*Forward-only scope changes recommended; shrinking scope is exceptional and audited.
Release
| Field | Type | Req | Description | |
|---|---|---|---|---|
POST /legal-hold/v1/holds/{holdId}:release |
path | Y | Release hold | |
If-Match |
header | O | Expected revision | Prevents races |
Output Specifications¶
Create — 201 Created
| Field | Type | Description |
|---|---|---|
holdId |
string | Hold identifier (ULID/GUID) |
status |
enum | Active |
revision |
int | Current revision |
snapshot |
object | {scope, decidedAt, caseId, reason, expiresAt?} |
stats |
object | {matched, partitions, partial?:bool} |
Release — 200 OK
| Field | Type | Description |
|---|---|---|
holdId |
string | Id |
status |
enum | Released |
releasedAt |
timestamp | ISO-8601 UTC |
revision |
int | Final revision |
Example Payloads¶
// Create & apply
{
"tenant": "acme",
"caseId": "CASE-2025-099",
"reason": "Regulatory investigation",
"scope": {
"time": {"from": "2025-09-01T00:00:00Z"},
"resourceTypes": ["Iam.User","Case.File"],
"actions": ["Create","Update"]
},
"expiresAt": "2026-03-01T00:00:00Z"
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy | ||
|---|---|---|---|---|---|
| 400 | Malformed scope; invalid expiresAt; missing caseId/reason |
Correct request | No retry until fixed | ||
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal | ||
| 403 | Missing audit:legalhold.apply | update | release |
Request proper role/scope | — | ||
| 404 | Unknown holdId or tenant route not found |
Verify ids/tenant | — | ||
| 409 | Apply on already Active; Release on Released; concurrent modify |
Align state (PATCH or fetch latest) | Retry after fix | ||
| 412 | If-Match revision mismatch |
Fetch latest snapshot → retry | Conditional retry | ||
| 422 | Scope cannot be resolved unambiguously | Adjust scope; use preview | — | ||
| 429 | Per-tenant/global rate limit | Honor Retry-After |
Backoff + jitter | ||
| 503 | Read/Index/Event bus unavailable | Wait for recovery | Idempotent retry (server de-dupes) |
Failure Modes¶
- Partial match (timeouts/limits):
partial=truein stats; matcher continues asynchronously until complete. - Residency boundary: cross-region scope split into regional sub-holds to remain compliant.
- Clock skew: time predicates normalized to UTC; inclusive start, exclusive end by convention.
Recovery Procedures¶
- On 412/409, retrieve latest
{holdId, revision, status}and re-issue with correct preconditions. - For partial matches, monitor progress events or query stats until
partial=false. - If 503/429, back off; the apply operation is idempotent by
(tenant, caseId, normalizedScopeHash).
Performance Characteristics¶
Latency Expectations¶
- Apply confirmation p95 ≤ 60 s for typical scopes; full match completion may continue async.
Throughput Limits¶
- Matcher QPS bounded by read replica capacity; batch size tuned per tenant.
Resource Requirements¶
- CPU for scope normalization; memory for batching keys; I/O for index updates.
Scaling Considerations¶
- Use seek pagination and partition-aware queries.
- Mark partitions
OnHoldwhen feasible for large contiguous ranges. - Backpressure from Hold Index updates reduces batch size automatically.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; optional mTLS service-to-service.
Authorization¶
- Require
audit:legalhold.apply,audit:legalhold.update,audit:legalhold.release. - Enforce RLS by
tenantId; verifyx-tenant-id.
Data Protection¶
- Store minimal decision basis (ids/timestamps); do not copy payloads.
- All hold state transitions are audited with actor and purpose-of-use.
Compliance¶
- Holds block purge immediately via Purge Guard; Retention Evaluator records
skippedHeld. holdSnapshot(id, revision, scope, decidedAt) provides chain-of-custody.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
holds_active |
gauge | Active holds per tenant | Sudden spikes |
hold_applied_total |
counter | Holds applied | — |
hold_released_total |
counter | Holds released | — |
hold_match_duration_seconds |
histogram | Matching latency | p95 > SLO |
purge_block_signals_total |
counter | Purge-block updates sent | Drop indicates risk |
Logging Requirements¶
- Structured logs:
tenant,holdId,revision,caseId,scopeHash,matched,partial,actor,reason. No PII.
Distributed Tracing¶
- Spans:
hold.apply,match.scan,index.mark,hold.release,event.publish. - Attributes:
scopeHash,batchSize,matched,partial.
Health Checks¶
- Readiness: Read/Projection and Hold Index reachable; Event Bus available.
- Liveness: matcher queue drains; no stuck
Applyingholds beyond timeout.
Operational Procedures¶
Deployment¶
- Deploy LegalHold Service and register
/legal-hold/v1/*routes. - Initialize Hold Index and Purge Guard hooks.
- Verify preview/apply/release in staging with synthetic scopes.
Configuration¶
- Env:
HOLD_MATCH_BATCH,HOLD_APPLY_TIMEOUT,HOLD_MAX_SCOPE_SIZE,RESIDENCY_MODE. - Policy: require
reasonandcaseId; optionalexpiresAtauto-release.
Maintenance¶
- Compact Hold Index (drop released markers no longer needed).
- Rotate webhook/signing keys if callbacks to external systems are used.
Troubleshooting¶
- High partial rates → increase batch size cautiously; check read replica health.
- Frequent 409/412 → educate clients to use
If-Matchand fetch-latest patterns. - Purge still running on held data → verify Purge Guard subscription and index state.
Testing Scenarios¶
Happy Path Tests¶
- Apply hold with resource/time scope →
holds_activeincrements; purge-block engaged. - Update scope (additive) →
revision++, only delta matched; events emitted. - Release hold → blockers cleared;
LegalHold.Releasedpublished.
Error Path Tests¶
- 400 for malformed scope; 404 for unknown
holdId; 409 for invalid state transitions; 412 for revision mismatch. - 429/503 cause compliant backoff; operation remains idempotent.
Performance Tests¶
- Matching completes within SLO for typical tenants; no read replica saturation.
- Purge-block propagation latency within seconds.
Security Tests¶
- RBAC enforced; cross-tenant access blocked.
- Audit log contains actor, purpose, scope hash; no PII leakage.
Related Documentation¶
Internal References¶
- Data Model — Legal Hold Model
- Retention Policy Evaluation Flow
- Standard Export Flow / Legal Hold Export Flow
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
Appendices¶
A. Example Problem+JSON (invalid scope)¶
{
"type": "urn:connectsoft:errors/legalhold/scope.invalid",
"title": "Invalid legal hold scope",
"status": 400,
"detail": "Scope must include at least one of resourceTypes or actors, and a bounded time window.",
"traceId": "9f0c1d2e3a4b5c6d...",
"errors": [
{"pointer": "/scope/time", "reason": "missing-or-unbounded"}
]
}
B. Hold Snapshot (concept)¶
{
"id": "lh_2025_001",
"tenant": "acme",
"revision": 3,
"status": "Active",
"caseId": "CASE-2025-099",
"reason": "Regulatory investigation",
"scope": { "resourceTypes": ["Iam.User"], "time": {"from":"2025-09-01T00:00:00Z"} },
"decidedAt": "2025-10-22T11:45:10Z",
"expiresAt": "2026-03-01T00:00:00Z"
}
Data Redaction Flow (Read)¶
Applies policy-driven masking to query results at read time. The Query Service consults the Redaction Service to enforce a requested profile (Safe, Support, Investigator, Raw), optionally validates a Just-In-Time (JIT) unmask approval, and returns transformed results. All unmask attempts and approvals are audited.
Overview¶
Purpose: Ensure returned data complies with privacy policy via profile-based masking, with tightly controlled JIT unmask for break-glass scenarios.
Scope: Profile selection, purpose-of-use capture, redaction rules execution, JIT approval verification, response annotation, and auditing. Excludes write-time classification (see Validation & Classification Flow).
Context: Sits on the Query path between Read Models/Search and clients. Uses Data Classification from the model and Redaction Rules (mask/hash/tokenize/drop).
Key Participants:
- Client (consumer of audit data)
- API Gateway
- Query Service (fetch, orchestrate)
- Redaction Service (policy engine, transform)
- Approval Service (JIT unmask token issuance/validation)
- Audit/Event Bus (log read/unmask decisions)
Prerequisites¶
System Requirements¶
- Gateway with TLS + JWT validation
- Query Service can call Redaction & Approval Services
- Read Models/Search indices annotated with
DataClassmetadata - Clock sync for JIT token TTL enforcement
Business Requirements¶
- Redaction profiles & policy configured per tenant
- Purpose-of-use taxonomy and RBAC scopes defined
- Approver roster & workflow for JIT unmask (with SLA)
Performance Requirements¶
- p95 redaction overhead ≤ 15 ms per page (server-side)
- JIT token verification p95 ≤ 50 ms
- Budget for page sizes (e.g., ≤ 200 records) to maintain SLOs
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor C as Client
participant GW as API Gateway
participant Q as Query Service
participant R as Redaction Service
participant A as Approval Service
participant AUD as Audit/Event Bus
C->>GW: GET /query/v1/events?filters…<br/>Headers: x-redaction-profile=Support, x-purpose-of-use=SupportOps
GW->>Q: Forward request (authN/Z, tenant)
Q->>Q: Fetch page from Read Model / Search
Q->>R: ApplyProfile(records, profile=Support, tenant, purpose)
R-->>Q: Redacted(records, redactionMeta)
Q->>AUD: Publish Read.Audited {tenant, profile, purpose, actor, resultCount}
Q-->>GW: 200 OK (masked results + X-Redaction-Profile + X-Watermark)
GW-->>C: 200 OK
Alternative Paths¶
- Investigator profile: broader reveal than Support but still masked for
HighlySensitive; requires higher RBAC. - Raw profile with JIT: client supplies
x-jit-approval-token; Approval Service validates token → Redaction Service bypasses selected fields (field-scoped unmask). - Field-scoped override: request includes
fields=…to minimize exposure; redaction runs only on returned fields.
Error Paths¶
sequenceDiagram
actor C as Client
participant GW as API Gateway
participant Q as Query Service
participant A as Approval Service
C->>GW: GET … x-redaction-profile=Raw, x-jit-approval-token=abc
GW->>Q: Forward
Q->>A: ValidateToken(abc)
alt Token invalid/expired/not-for-tenant
A-->>Q: 403 Forbidden (reason)
Q-->>GW: 403 Problem+JSON
GW-->>C: 403 Forbidden
else Bad profile or params
Q-->>GW: 400 Bad Request (Problem+JSON)
GW-->>C: 400
else Record id requested but not found
Q-->>GW: 404 Not Found (Problem+JSON)
GW-->>C: 404
else Conflict (token already consumed / different subject)
Q-->>GW: 409 Conflict (Problem+JSON)
GW-->>C: 409
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | GET /query/v1/events |
Y | Search/scroll timeline | Query params allowlisted |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired |
x-tenant-id |
header | Y | Tenant scope | Matches JWT/route |
x-redaction-profile |
header | O | Safe(default) |
Support |
x-purpose-of-use |
header | Y | Business purpose taxonomy | Non-empty, allowlist |
x-jit-approval-token |
header | O | Break-glass token for unmask | JIT policy validates |
traceparent |
header | O | W3C trace context | 55-char |
fields |
query | O | Comma list of fields to return | Minimization applied |
page.after |
query | O | Seek cursor | Opaque; server-issued |
limit |
query | O | Page size | 1–200 default 100 |
Output Specifications¶
200 OK
| Field | Type | Description | Notes |
|---|---|---|---|
items[] |
array | Records with masking applied | See examples |
redactionMeta |
object | {profile, rulesApplied[], jit:{used, reason?}} |
Optional when Safe |
watermark |
string | Projection snapshot time | Also in header |
Headers
X-Redaction-Profile: effective profileX-Purpose-Of-Use: echoed purposeX-Watermark: ISO-8601 UTC projection watermark
Example Payloads¶
// Request (Support profile)
GET /query/v1/events?resourceType=Payment&from=2025-10-01T00:00:00Z
x-redaction-profile: Support
x-purpose-of-use: SupportOps
// Response (masked)
{
"items": [
{
"id": "01JF…",
"actor": {"id":"u_123","displayName":"A**** T****"},
"resource": {"type":"Payment","id":"pay_789"},
"action": "Create",
"createdAt": "2025-10-22T11:01:22Z",
"deltas": {
"after": {
"cardLast4": "****",
"cardBin": "******",
"email": "a***@e***.com",
"amount": 1299
}
}
}
],
"redactionMeta": {
"profile": "Support",
"rulesApplied": [
{"field":"deltas.after.cardLast4","rule":"mask-last4"},
{"field":"deltas.after.cardBin","rule":"drop"},
{"field":"deltas.after.email","rule":"mask-email"}
]
},
"watermark": "2025-10-22T11:05:00Z"
}
// Raw with JIT token (field-scoped unmask)
GET /query/v1/events/{id}
x-redaction-profile: Raw
x-jit-approval-token: jt_01ABC…
x-purpose-of-use: IncidentResponse
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Unsupported profile; invalid limit/fields; bad time filters |
Correct request | No retry until fixed |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Insufficient RBAC for profile; JIT token invalid/expired; tenant mismatch | Request proper scope or new JIT approval | — |
| 404 | Requested record id not found | Verify id/tenant | — |
| 409 | JIT token subject mismatch or already consumed | Obtain a fresh token | — |
| 422 | Purpose-of-use missing/invalid; policy disallows Raw for tenant | Fix usage/policy | — |
| 429 | Rate limited for sensitive profiles | Honor Retry-After |
Backoff + jitter |
| 503 | Redaction/Approval service unavailable | Wait for recovery | Idempotent retry (re-run query) |
Failure Modes¶
- Partial redaction (missing DataClass metadata): default to most restrictive (mask/drop) and include warning in
redactionMeta. - Policy change mid-request: response includes
X-Policy-Revision-Used; clients re-issue if needed.
Recovery Procedures¶
- For 403/409, request/refresh JIT approval; ensure subject/resource matches token scope.
- On 503/429, back off; queries are safe to retry with same cursor.
Performance Characteristics¶
Latency Expectations¶
- Redaction transform p95 ≤ 15 ms/page; JIT verification p95 ≤ 50 ms.
Throughput Limits¶
- Sensitive profiles (
Investigator,Raw) may be throttled per tenant (token bucket).
Resource Requirements¶
- CPU-bound transforms; memory proportional to page size; minimal I/O overhead.
Scaling Considerations¶
- Cache compiled redaction plans per
{profile, schemaVersion}. - Prefer field projection (
fields=…) to reduce work and exposure. - Co-locate Redaction Service with Query Service to minimize RPC latency.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; service credentials between Query ↔ Redaction/Approval.
Authorization¶
- RBAC scopes per profile (e.g.,
audit:read.support,audit:read.investigator,audit:read.raw). - Enforce tenant RLS; verify
x-tenant-id.
Data Protection¶
- No raw PII in logs; only masked samples and rule stats.
- JIT tokens are short-lived, single-use, audience- and subject-scoped; signed & time-bounded.
Compliance¶
- All unmask uses are audited with actor, purpose, scope, token id, and fields revealed.
- Profiles & rule sets derived from tenant policy; revision id echoed in responses.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
redaction_requests_total |
counter | Redaction calls by profile | Sudden spikes |
redaction_latency_ms |
histogram | Transform latency | p95 > 15 ms |
jit_token_validations_total |
counter | Approval checks | Track failures |
jit_validation_latency_ms |
histogram | JIT check latency | p95 > 50 ms |
unmask_events_total |
counter | Successful JIT unmask | Unusual growth |
Logging Requirements¶
- Structured logs:
tenant,profile,purpose,actorId,resultCount,policyRevision,jit.used,fieldsRevealed[](names only). No values.
Distributed Tracing¶
- Spans:
query.fetch,redaction.apply,approval.validate. - Attributes:
profile,purpose,maskedFieldsCount,jitUsed.
Health Checks¶
- Readiness: Redaction & Approval endpoints reachable; policy cache warm.
- Liveness: transform queue drains; token cache not stale.
Operational Procedures¶
Deployment¶
- Deploy Redaction & Approval Services; enable headers in Gateway.
- Prime policy/profile caches; validate with synthetic records.
Configuration¶
- Env:
REDACTION_DEFAULT_PROFILE,JIT_TTL_SEC,JIT_AUDIENCE,PROFILE_RBAC_MAP,SENSITIVE_RATE_LIMITS. - Policy: map
DataClass→ rule (mask/hash/tokenize/drop) per profile.
Maintenance¶
- Rotate signing keys for JIT tokens; tune rate limits by tenant.
- Review unmask audit reports periodically with compliance.
Troubleshooting¶
- Latency regressions → inspect rule plan caching, page size, co-location.
- Frequent 403/409 → check token issuance workflow and subject scoping.
- Unexpected reveals → verify policy revision and RBAC mapping.
Testing Scenarios¶
Happy Path Tests¶
-
Safereturns masked payload per policy with correctredactionMeta. -
Supportreveals operational fields but masksHighlySensitive. -
Rawwith valid JIT token reveals requested fields only; audit event emitted.
Error Path Tests¶
- 400 for invalid profile; 422 for missing/invalid purpose-of-use.
- 403/409 for bad/consumed JIT token; 404 for missing record id.
- 429/503 result in compliant backoff and successful retry.
Performance Tests¶
- p95 redaction ≤ 15 ms for 100-record pages.
- JIT validation ≤ 50 ms p95.
Security Tests¶
- RBAC enforced per profile; cross-tenant blocked.
- Logs exclude PII values; unmask audited with token id.
Related Documentation¶
Internal References¶
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
Appendices¶
A. Example Problem+JSON (invalid profile)¶
{
"type": "urn:connectsoft:errors/redaction/profile.invalid",
"title": "Unsupported redaction profile",
"status": 400,
"detail": "Profile 'Debug' is not enabled for tenant 'acme'.",
"traceId": "9f0c1d2e3a4b5c6d...",
"errors": [{"pointer": "x-redaction-profile", "reason": "unsupported"}]
}
B. JIT Token (concept)¶
{
"jitId": "jt_01ABC…",
"tenant": "acme",
"subject": {"type":"Payment","id":"pay_789"},
"fields": ["deltas.after.email","actor.displayName"],
"purpose": "IncidentResponse",
"aud": "audit-read",
"nbf": "2025-10-22T11:00:00Z",
"exp": "2025-10-22T11:10:00Z",
"sig": "MEQCI…"
}
Compliance Audit Flow¶
Generates a defensible compliance report by collecting evidence (records, lifecycle transitions, retention/legal-hold decisions, and integrity proofs), independently verifying tamper-evidence, and assembling a signed report artifact with full end-to-end traceability.
Overview¶
Purpose: Produce an auditable report that demonstrates data integrity, lifecycle adherence, and policy compliance over a defined scope and period.
Scope: Audit job creation, evidence collection, integrity verification (Merkle/Signatures), control checks (retention, legal hold, redaction on read), report assembly/signing, delivery, and audit of the audit. Excludes exporting large datasets (see Export flows) and policy authoring.
Context: Orchestrated by Audit Service; reads from Read Models/Indices, Lifecycle/Retention Index, Legal Hold, and Integrity Service; produces a signed Compliance Report and optional Evidence Bundle.
Key Participants:
- Auditor / Compliance Client
- API Gateway
- Audit Service (orchestrator, verifier, report builder)
- Query Service / Read Store (records, timelines)
- Integrity Service (Merkle & signatures verification)
- Policy/LegalHold/Retention services (decisions & states)
- Delivery Backend (report/evidence URLs)
- Webhook Receiver (optional callbacks)
Prerequisites¶
System Requirements¶
- API Gateway with TLS and JWT validation
- Audit Service with access to Read Store, Integrity, Policy, LegalHold, Retention Index
- Object storage for report artifacts and optional evidence bundle
- KMS/HSM configured for report signing (optional but recommended)
Business Requirements¶
- Tenant compliance profile defined (e.g., GDPR/HIPAA/SOC2 control set)
- Purpose-of-use and auditor role(s) configured
- Time-bound audit scope agreed (from/to, resources, actors)
Performance Requirements¶
- p95 time-to-summary ≤ 60 s for typical 24–48h windows
- Evidence sampling and cap thresholds configured to avoid oversize bundles
- Parallel verification workers sized to volume
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor AUD as Auditor
participant GW as API Gateway
participant AS as Audit Service
participant Q as Query Service / Read Store
participant INT as Integrity Service
participant POL as Policy/Retention/LegalHold
participant OBJ as Delivery Backend
participant WH as Webhook (optional)
AUD->>GW: POST /compliance/v1/audits {tenant, scope, frameworks, options{verifyIntegrity, includeEvidence}}
GW->>AS: Create audit job (authN/Z, x-tenant-id, traceparent)
AS->>Q: Collect evidence set (records, lifecycle states, decisions)
AS->>POL: Fetch decisions (retention elig., legal holds, policy revisions)
AS->>INT: Verify integrity (Merkle chain, signatures, sample leaves)
INT-->>AS: Verification results {ok, failures[], merkleRoot, keyIds}
AS->>AS: Compile control checks + traceability map
AS->>OBJ: PUT report.pdf/json + (optional) evidence.zip
AS-->>GW: 202 Accepted {auditId, status:"Running"}
alt webhook configured
AS->>WH: POST Compliance.ReportReady {auditId, reportUrl, summary}
end
Alternative Paths¶
- Lightweight attest-only:
verifyIntegrity=truewith no evidence bundle; report includes verification transcript and pointers. - Delta audit:
sinceAuditIdto compare changes between two audits. - Framework-specific:
frameworks=["SOC2"]limits control set and sections rendered.
Error Paths¶
sequenceDiagram
actor AUD as Auditor
participant GW as API Gateway
participant AS as Audit Service
AUD->>GW: POST /compliance/v1/audits {malformed}
alt 400 Bad Request
GW-->>AUD: 400 Problem+JSON
else 404 Not Found (tenant/route/auditId)
GW-->>AUD: 404 Problem+JSON
else 409 Conflict (modify running audit / duplicate request-id)
GW-->>AUD: 409 Problem+JSON
else 429/503 Backpressure/Dependency down
GW-->>AUD: 429/503 Problem+JSON (+Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | POST /compliance/v1/audits |
Y | Create a compliance audit job | JSON body |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired |
x-tenant-id |
header | Y | Tenant scope | Matches body.tenant |
traceparent |
header | O | W3C trace context | 55-char |
tenant |
string | Y | Target tenant | ^[A-Za-z0-9._-]{1,128}$ |
scope |
object | Y | {time:{from,to}, resourceTypes?, actors?} |
UTC ISO-8601, bounded |
frameworks |
array | O | ["GDPR","HIPAA","SOC2"] |
allowlist |
options.verifyIntegrity |
bool | O | Run integrity verification | default: true |
options.includeEvidence |
enum | O | none |
sampled |
options.sampleRate |
number | O | 0–1 for sampled proofs | bounds checked |
webhook.url/secretId |
string | O | Completion callback + HMAC | HTTPS + known key |
idempotency-key |
header | O | De-duplicate create | ≤ 128 chars |
Control & Status
GET /compliance/v1/audits/{auditId}POST /compliance/v1/audits/{auditId}:cancelGET /compliance/v1/audits/{auditId}/report(redirect/URL)GET /compliance/v1/audits/{auditId}/evidence(if produced)
Output Specifications¶
Create — 202 Accepted
| Field | Type | Description | ||||||
|---|---|---|---|---|---|---|---|---|
auditId |
string | Operation id (ULID/GUID) | ||||||
status |
enum | Queued | Collecting | Verifying | Assembling | Completed | Failed | Canceled |
||||||
summaryUrl |
url? | Interim human-readable status | ||||||
reportUrl |
url? | Set when ready |
Status — 200 OK
| Field | Type | Description |
|---|---|---|
auditId |
string | Identifier |
status |
enum | Terminal or running state |
counts |
object | {records, proofsChecked, holds, eligible} |
verifications |
object | {merkleRoot, keyIds[], ok, failures[]} |
reportUrl / evidenceUrl |
url? | Delivery |
Report (concept outline)
- Executive Summary (scope, date range, frameworks)
- Data Integrity (roots, signatures, verification transcript)
- Lifecycle & Retention (eligibleAt coverage, purge windows)
- Legal Holds (active timeline, affected records/partitions)
- Redaction & Privacy Controls (profiles, sampling of masked fields)
- Exceptions & Findings (severity, impacted scope)
- Appendices (inputs, hashes, timestamps, key ids)
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Malformed scope/time window; unsupported framework; invalid sampleRate |
Fix request | No retry until corrected |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Missing audit:compliance.run or cross-tenant attempt |
Request proper scope/role | — |
| 404 | Unknown auditId/tenant; route disabled |
Verify ids/tenant | — |
| 409 | Modify/cancel while running; duplicate idempotency-key |
Wait for terminal state or change key | Retry after fix |
| 412 | If-Match mismatch on update/cancel |
Fetch latest status and retry | Conditional retry |
| 422 | Evidence size would exceed cap; incompatible options (full with restricted edition) |
Adjust options | — |
| 429 | Per-tenant/global audit concurrency limit | Honor Retry-After |
Backoff + jitter |
| 503 | Read/Integrity/Policy service unavailable | Wait for recovery | Idempotent retry |
Failure Modes¶
- Proof sampling too low/high: report flags sampling level; enforce min/max per policy.
- Key unavailability: signature verification deferred; report marks inconclusive for specific windows with remediation steps.
- Projection lag: report includes watermark; sections constrained to consistent point-in-time.
Recovery Procedures¶
- Reduce evidence mode to
sampledor raise cap via admin policy if 422. - Re-run verification portion when keys/services recover; re-issue report with new signature.
- For 409/412, poll latest status, then retry control action.
Performance Characteristics¶
Latency Expectations¶
- Time-to-summary p95 ≤ 60 s for 24–48h windows; full verification depends on scope and sampling.
Throughput Limits¶
- Concurrency caps per tenant (e.g., ≤ 2 running audits); global worker pool bounded.
Resource Requirements¶
- CPU for hashing/verification; I/O for evidence fetch; memory for report assembly (streamed).
Scaling Considerations¶
- Parallelize by time/partition slices; verify proofs in worker pool; stream artifact assembly to object storage.
- Use seek pagination and limit evidence to sampled mode for very large scopes.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; optional mTLS for service-to-service.
Authorization¶
- Require
audit:compliance.runto start;audit:compliance.readto fetch results; strict tenant RLS.
Data Protection¶
- Reports/evidence encrypted at rest; presigned URLs short-lived and least-privilege; webhook payloads HMAC-signed.
Compliance¶
- Report is signed (JWS/COSE) with
kid; includes verification transcript, watermarks, and policy revisions used. - All audit actions are themselves audited (actor, purpose, scope, outputs).
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
compliance_audits_active |
gauge | Running audits | > tenant/global cap |
compliance_audit_duration_seconds |
histogram | Runtime per audit | p95 > SLO |
integrity_verifications_total |
counter | Proof checks performed | Trend |
verification_failures_total |
counter | Failed proof checks | > 0 sustained |
report_build_failures_total |
counter | Report assembly/sign failures | > 0 |
Logging Requirements¶
- Structured logs:
tenant,auditId,scopeHash,frameworks[],proofsChecked,failures,watermark,kid. No PII.
Distributed Tracing¶
- Spans:
audit.collect,policy.fetch,integrity.verify,report.assemble,object.put,webhook.post. - Attributes:
sampleRate,evidenceMode,bytes,records.
Health Checks¶
- Readiness: Read/Integrity/Policy reachable; KMS key loadable.
- Liveness: job queue draining; no stuck
Verifying/Assemblingstates.
Operational Procedures¶
Deployment¶
- Deploy Audit Service; expose
/compliance/v1/auditsroutes. - Configure KMS signing keys and buckets for artifacts.
- Validate E2E on staging: create → verify → signed report downloadable.
Configuration¶
- Env:
AUDIT_MAX_CONCURRENCY_PER_TENANT,AUDIT_SAMPLE_RATE_DEFAULT,AUDIT_EVIDENCE_CAP_BYTES,PRESIGN_TTL_SEC,REPORT_SIGNING_KID. - Policy: min/max sampling, allowed frameworks per edition.
Maintenance¶
- Rotate signing keys; prune expired artifacts; archive reports according to retention.
- Periodic verification health checks against known-good test datasets.
Troubleshooting¶
- Verification failures → inspect key rotation, integrity roots, time window alignment.
- Large artifacts → switch to
sampledmode; extend caps only if justified. - Frequent 409/412 → ensure clients poll before modifying audit jobs.
Testing Scenarios¶
Happy Path Tests¶
- Create audit with
verifyIntegrity=true, includeEvidence=sampled→ signed report produced; verification transcript included. - Fetch report/evidence; signature validates with published public key.
Error Path Tests¶
- 400 malformed scope; 404 unknown
auditId; 409 modify while running. - 422 evidence exceeds cap triggers clear guidance; 429/503 backoff works.
Performance Tests¶
- p95 time-to-summary ≤ 60 s; verify scaling across parallel slices.
- Sampled proof checks meet throughput targets.
Security Tests¶
- RBAC scopes enforced; presigned URLs expire; webhook HMAC validated.
- Report signature verifies via JWS/COSE with current
kid.
Related Documentation¶
Internal References¶
- Data Model — Integrity Structures
- Data Lifecycle & States
- Legal Hold Processing Flow
- Retention Policy Evaluation Flow
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
- JWS (RFC 7515) / COSE (RFC 8152)
Appendices¶
A. Example Problem+JSON (evidence cap exceeded)¶
{
"type": "urn:connectsoft:errors/compliance/evidence.cap.exceeded",
"title": "Evidence bundle too large",
"status": 422,
"detail": "Estimated evidence size 8.4GB exceeds cap 5GB. Use includeEvidence=sampled or narrow scope.",
"traceId": "9f0c1d2e3a4b5c6d...",
"errors": [{"pointer": "/options/includeEvidence", "reason": "cap-exceeded"}]
}
B. Report Verification (outline)¶
- Download
report.jsonandreport.sig(or signed PDF). - Verify signature with published JWK/PEM (
kidin report header). - Re-run sample integrity proofs listed in the transcript; compare roots.
- Confirm watermarks and policy revision ids match tenant records.
Integrity Verification Flow¶
Runs an on-demand proof check for one or more records, validating leaf hash → Merkle path → block/segment root → signature. Produces a per-record evidence report (OK|FAIL|INCONCLUSIVE) and supports degraded mode when some materials (e.g., keys, archived proofs) are unavailable.
Overview¶
Purpose: Allow clients and auditors to independently verify that returned records are authentic and untampered, using stored proofs and signatures.
Scope: Request intake, materialization of proof inputs (leaf, path, roots, signatures), verification pipeline, degraded-mode policies, report generation, and optional evidence bundle. Excludes integrity creation/sealing (see Integrity Chain flow).
Context: The Integrity Service reads Integrity Store/Evidence Store (paths, roots, manifests) and may call KMS/HSM or use public keys to verify signatures.
Key Participants:
- Client (verifier)
- API Gateway
- Integrity Service (verifier/orchestrator)
- Evidence Store / Integrity Store (proofs, roots, manifests)
- KMS/HSM or Key Registry (public keys / verification)
- Object Storage (optional evidence bundles)
Prerequisites¶
System Requirements¶
- API Gateway with TLS and JWT validation
- Integrity Service with read access to Integrity/Evidence stores and key registry
- Object storage bucket for optional per-request evidence bundles
- Time source synchronized; hash and signature algorithms configured
Business Requirements¶
- Tenant integrity policy defines algorithms (e.g., SHA-256, Ed25519) and acceptable degraded modes
- Retention of proofs/manifests meets verification SLAs
- Auditing enabled for verification requests
Performance Requirements¶
- p95 verification latency ≤ 200 ms for single-record checks (cached proofs)
- Batch verification throughput meets SLO (e.g., 2k–10k records/s with precomputed paths)
- Backpressure & rate limits for large batches
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor CL as Client
participant GW as API Gateway
participant INT as Integrity Service
participant EVI as Evidence Store / Integrity Store
participant KMS as KMS/HSM or Key Registry
participant OBJ as Object Storage (optional)
CL->>GW: POST /integrity/v1/verify {tenant, items[], mode: "full"}
GW->>INT: Forward (authN/Z, x-tenant-id, traceparent)
INT->>EVI: Fetch materials (leaf hash or record, path, blockRoot, manifest)
INT->>KMS: Load/validate public key (by kid) and verify signature(root)
KMS-->>INT: ok {kid, alg}
INT->>INT: Verify inclusion (leaf→path→blockRoot) and chain(root→segmentRoot?)
alt returnEvidence = "bundle"
INT->>OBJ: PUT evidence.zip (paths, manifest, key metadata)
end
INT-->>GW: 200 OK {perItemResults[], summary, evidenceUrl?}
GW-->>CL: 200 OK
Alternative Paths¶
- Fast mode:
mode="fast"skips recomputation of leaf hash when caller suppliesleafHash; verifies path→root→signature only. - Degraded mode:
allowDegraded=truepermitsINCONCLUSIVEwith reasons (e.g., signature service offline) while still verifying available steps. - External leaf: caller provides
payloadto hash server-side (canonicalization rules applied).
Error Paths¶
sequenceDiagram
actor CL as Client
participant GW as API Gateway
participant INT as Integrity Service
CL->>GW: POST /integrity/v1/verify {malformed}
alt Bad request (invalid item spec/algorithm)
GW-->>CL: 400 Bad Request (Problem+JSON)
else Not found (record/proof/manifest missing)
GW-->>CL: 404 Not Found (Problem+JSON)
else Conflict (verify while block is resealing/rotating)
GW-->>CL: 409 Conflict (Problem+JSON)
else Unauthorized/Forbidden
GW-->>CL: 401/403 (Problem+JSON)
else Rate limit / dependency down
GW-->>CL: 429/503 (Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | POST /integrity/v1/verify |
Y | Start verification | JSON body |
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired |
x-tenant-id |
header | Y | Tenant scope | Must match body.tenant |
traceparent |
header | O | W3C trace context | 55-char |
tenant |
string | Y | Target tenant | ^[A-Za-z0-9._-]{1,128}$ |
mode |
enum | O | full (default) |
fast |
allowDegraded |
bool | O | Permit partial verify | default: false |
returnEvidence |
enum | O | none (default) |
bundle |
items[] |
array | Y | Records to verify | 1–10k items |
items[].recordId |
string | O* | Record identifier | ULID/GUID |
items[].leafHash |
string | O* | Base64url/hex hash | matches algorithm |
items[].payload |
object | O* | Canonicalizable payload | size bounded |
items[].algorithm |
enum | O | sha256 (default) |
allowlist |
items[].expectedRoot |
string | O | Optional asserted root | must match stored |
idempotency-key |
header | O | De-dupe request | ≤ 128 chars |
*Provide at least one of recordId, leafHash, or payload.
Output Specifications¶
200 OK
| Field | Type | Description | Notes |
|---|---|---|---|
results[] |
array | Per-item verification results | See below |
summary |
object | {ok, fail, inconclusive} |
Counts |
evidenceUrl |
url? | If returnEvidence=bundle |
Presigned, short-lived |
policyRevisionUsed |
int | Integrity policy revision | For audit |
Per-item result
{
"input": {"recordId":"01JF…","algorithm":"sha256"},
"steps": {
"leafHash": {"status":"OK","computed":"8a4f..."},
"pathVerify": {"status":"OK","depth":17},
"rootSignature": {"status":"OK","kid":"int-key-2025","alg":"Ed25519"},
"chainLink": {"status":"OK","segment":"seg_2025_10_22"}
},
"status": "OK", // OK | FAIL | INCONCLUSIVE
"degraded": false, // true if allowed and used
"reason": null, // failure/inconclusive reason
"timingsMs": {"total": 42, "leaf": 1, "path": 6, "sig": 8}
}
Example Payloads¶
// Full verification by recordId
{
"tenant": "acme",
"mode": "full",
"items": [
{"recordId": "01JF3W8KTR2D3WQF3B9R0KJY9Y", "algorithm": "sha256"}
],
"returnEvidence": "path-only"
}
// Fast verification using supplied leafHash and allowing degraded mode
{
"tenant": "acme",
"mode": "fast",
"allowDegraded": true,
"items": [
{"leafHash": "8a4f...", "expectedRoot": "d1c2...", "algorithm": "sha256"}
]
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy | ||
|---|---|---|---|---|---|
| 400 | Malformed body; none of recordId | leafHash | payloadprovided; unsupportedalgorithm |
Fix request | No retry until corrected | ||
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal | ||
| 403 | Missing audit:integrity.verify or tenant mismatch |
Request proper scope/role | — | ||
| 404 | Record/proof/manifest not found | Verify id/scope; ensure proofs retained | — | ||
| 409 | Verification against block being resealed/rotated | Retry after block state settles | Short backoff | ||
| 412 | If-Match on root version failed |
Fetch latest root/manifest; retry | Conditional retry | ||
| 422 | Payload cannot be canonicalized to leaf hash | Use server-known recordId or supply leafHash | — | ||
| 429 | Rate limited for batch or per-tenant | Honor Retry-After |
Exponential backoff + jitter | ||
| 503 | Evidence store, key service, or integrity store unavailable | Wait for recovery | Idempotent retry |
Failure Modes¶
- Missing signature key (archived/rotated): inclusion verified, signature step
INCONCLUSIVEwhenallowDegraded=true. - Archived proofs (cold tier): request becomes async; 202 with later webhook/report when materials restored.
- Projection drift: record exists but proof not yet sealed; respond
409until seal completes.
Recovery Procedures¶
- On 409/412, fetch latest block status/root and retry verification.
- If 503/429, back off; request is idempotent by
(tenant, itemsHash, idempotency-key?). - When proofs are archived, re-issue request with
allowDegraded=trueor wait for restoration event.
Performance Characteristics¶
Latency Expectations¶
- Single-record, cached materials: p95 ≤ 200 ms.
- Batch with precomputed paths: thousands/sec per verifier instance.
Throughput Limits¶
- Per-tenant verification QPS caps; batch size limits (e.g., ≤ 1k items/request).
Resource Requirements¶
- CPU-bound hashing/path checks; memory proportional to path depth and batch size; small I/O for manifest/path fetch.
Scaling Considerations¶
- Cache recent roots and key material by
kid. - Pre-fetch proof paths for hot records; shard verifier workers by tenant/segment.
- Use asynchronous retrieval for cold-storage proofs.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; service credentials for store/key access.
Authorization¶
- Require
audit:integrity.verify; enforcex-tenant-idRLS.
Data Protection¶
- Do not log payloads or raw proofs; only hashes and ids.
- Evidence bundles are encrypted at rest and shared via short-lived presigned URLs.
Compliance¶
- Verification report contains key ids, algorithms, roots, and timestamps for chain-of-custody.
- Degraded-mode decisions are explicit and auditable.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
verify_requests_total |
counter | Verification requests | Trend |
verify_latency_ms |
histogram | End-to-end latency | p95 > SLO |
verify_failures_total |
counter | Items with FAIL |
> 0 sustained |
verify_inconclusive_total |
counter | Degraded outcomes | Spike watch |
proof_cache_hit_ratio |
gauge | Cache effectiveness | < 0.8 sustained |
Logging Requirements¶
- Structured logs:
tenant,requestId,batchSize,ok/fail/inconclusive,alg,kid,degraded. No PII or payloads.
Distributed Tracing¶
- Spans:
materials.fetch,leaf.hash,path.verify,sig.verify,bundle.pack. - Attributes:
pathDepth,kid,mode,degraded.
Health Checks¶
- Readiness: evidence/key stores reachable; cache warmed.
- Liveness: verifier queue drains; no stuck requests beyond timeout.
Operational Procedures¶
Deployment¶
- Deploy Integrity Service; expose
/integrity/v1/verify. - Configure key registry/KMS access and algorithm allowlist.
- Warm caches with latest roots and public keys.
Configuration¶
- Env:
VERIFY_MAX_BATCH,VERIFY_RATE_LIMITS,KEY_CACHE_TTL,ROOT_CACHE_TTL,EVIDENCE_BUNDLE_TTL. - Policy: allowed degraded modes; acceptable algorithms; maximum batch sizes.
Maintenance¶
- Rotate verification keys and update registry; verify legacy roots with retained public keys.
- Periodically test cold-proof restore paths.
Troubleshooting¶
- Rising
INCONCLUSIVE→ check KMS availability and key retention. - High
FAILrates → inspect canonicalization/version mismatches or corrupted paths. - Latency spikes → verify cache TTLs and storage hot/cold tiering.
Testing Scenarios¶
Happy Path Tests¶
- Verify by
recordIdwith full steps →status=OK, signature validated. - Batch verify with provided
leafHash→status=OKfor all items; summary counts correct.
Error Path Tests¶
- 400 when no
recordId|leafHash|payload; 404 for unknown record/proof. - 409 when verifying during reseal; 412 when root version mismatches.
- 429/503 induce backoff and successful retry.
Performance Tests¶
- Achieve target throughput with cached proofs; measure p95 latency.
- Stress with 10k items; ensure backpressure and partial progress reporting.
Security Tests¶
- RBAC scopes enforced; cross-tenant blocked.
- Evidence bundle URL expiry honored; keys validated by
kid.
Related Documentation¶
Internal References¶
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
Appendices¶
A. Example Problem+JSON (degraded not allowed)¶
{
"type": "urn:connectsoft:errors/integrity/degraded.disallowed",
"title": "Degraded verification not permitted",
"status": 422,
"detail": "Key service unavailable and allowDegraded is false.",
"traceId": "9f0c1d2e3a4b5c6d..."
}
B. Evidence Bundle (concept)¶
evidence/
item_01JF…/
leaf.txt
path.json
manifest.json
root.sig # JWS/COSE detached signature
key-metadata.json # {kid, alg, issuer, notBefore, notAfter}
README.txt # verification instructions
Tamper Detection Flow¶
Continuously (or on-demand) scans integrity materials to detect anomalies—such as gaps, forks, reseals outside policy, signature/key issues, or out-of-order segments—then alerts and escalates with actionable context. The pipeline emphasizes low false positives through suppression, correlation, and thresholds.
Overview¶
Purpose: Proactively detect and surface potential tampering or integrity regressions before consumers encounter affected data.
Scope: Scheduling, scope planning, chain/segment/manifest checks, anomaly scoring & suppression, alerting/escalation, and case tracking. Excludes remediation (sealing/repair) which is handled by operations runbooks.
Context: Runs within the Integrity Validator component against Integrity/Evidence Stores and Key Registry/KMS; feeds alerts to Observability and Incident Management systems.
Key Participants:
- Scheduler/Detector Orchestrator
- Integrity Validator (check runners, anomaly detector)
- Integrity Store / Evidence Store (roots, manifests, paths)
- Key Registry/KMS (public keys, validity windows)
- Alerting / On-Call (Pager/Email/Webhooks)
- SIEM / Case Manager (ticketing, correlation)
Prerequisites¶
System Requirements¶
- Validator has read access to Integrity/Evidence stores and Key Registry
- Object storage reachable for manifests and archived proofs
- Time synchronization across services; policy cache warm (algorithms, seal cadence)
Business Requirements¶
- Tenant integrity policy defines seal cadence, allowed reseal windows, acceptable algorithms, and escalation paths
- Alert routing configured (webhooks/pager) with on-call schedule
- Compliance logging enabled for anomaly events
Performance Requirements¶
- Chain scan p95 ≤ 2 min per segment; continuous mode amortized to keep staleness ≤ 5 min
- Alert fan-out latency p95 ≤ 30 s
- Bounded load on stores (rate-limited walkers)
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant SCH as Scheduler
participant VAL as Integrity Validator
participant IST as Integrity/Evidence Store
participant KMS as Key Registry/KMS
participant ALR as Alerting (Pager/Webhook)
participant SIEM as SIEM/Case Manager
SCH->>VAL: Tick {tenant, window, policyRevision}
VAL->>IST: Enumerate segments/blocks within window
loop For each segment
VAL->>IST: Fetch manifests + roots + metadata
VAL->>KMS: Get key by kid, check validity window
VAL->>VAL: Run checks (gap/fork/order/sig/freshness/seal cadence)
end
VAL->>VAL: Score & suppress duplicates, correlate with recent changes
alt Anomalies found
VAL->>ALR: Create alert {type, severity, evidence pointers}
ALR-->>VAL: Ack alert id
VAL->>SIEM: Open case/ticket {links to evidence}
else No anomalies
VAL->>VAL: Record heartbeat metric & watermark
end
Alternative Paths¶
- On-demand scan: operator invokes
POST /integrity/v1/tamper-detection:scanfor a tenant/time range. - Hot segment watch: watch new blocks; verify seal cadence and signature freshness in near-real-time.
- Degraded verification: if keys unavailable, emit warning with
degraded=true(no hard alert) depending on policy.
Error Paths¶
sequenceDiagram
participant OP as Operator
participant GW as API Gateway
participant VAL as Integrity Validator
OP->>GW: POST /integrity/v1/tamper-detection:scan {malformed}
alt 400 Bad Request (invalid window/tenant/algo)
GW-->>OP: 400 Problem+JSON
else 404 Not Found (unknown detectorId/tenant)
GW-->>OP: 404 Problem+JSON
else 409 Conflict (scan already running for same scope)
GW-->>OP: 409 Problem+JSON
else 429/503 (rate limit/dependency down)
GW-->>OP: 429/503 Problem+JSON (+Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation | ||||
|---|---|---|---|---|---|---|---|---|
| Method/Path | POST /integrity/v1/tamper-detection:scan |
O | On-demand scan trigger | JSON body | ||||
Authorization |
header | Y | Bearer <JWT> |
Valid, not expired | ||||
x-tenant-id |
header | Y | Tenant scope | Matches body.tenant | ||||
tenant |
string | Y | Target tenant | ^[A-Za-z0-9._-]{1,128}$ |
||||
window |
object | O | {from,to} override |
ISO-8601 UTC, bounded | ||||
checks |
array | O | Subset (gap,fork,order,seal,sig,freshness) |
allowlist | ||||
severityThreshold |
enum | O | info | low | medium | high | critical |
default medium |
||||
suppressWindow |
string | O | e.g., 10m duplicate suppression |
≤ policy max | ||||
traceparent |
header | O | W3C trace context | 55-char | ||||
idempotency-key |
header | O | De-dup create | ≤ 128 chars |
Output Specifications¶
202 Accepted / 200 OK
| Field | Type | Description | Notes | |||
|---|---|---|---|---|---|---|
scanId |
string | Operation id | ULID/GUID | |||
status |
enum | Queued | Running | Completed | Failed |
— | |||
summary |
object | {checkedSegments, anomalies, degraded} |
Final on 200 | |||
watermark |
string | Latest segment time examined | ISO-8601 UTC |
Anomaly Event (concept)
{
"tenant": "acme",
"type": "Integrity.ForkDetected",
"severity": "high",
"segment": "seg_2025_10_22",
"policyRevision": 12,
"details": {
"roots": ["9a1c...", "77fb..."],
"firstSeenAt": "2025-10-22T12:00:07Z",
"evidence": {"manifestUrl": "s3://.../seg_2025_10_22.manifest.json"}
}
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid time window/checks list; from>=to |
Fix request | No retry until corrected |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Missing audit:integrity.tamper.scan |
Request proper role/scope | — |
| 404 | Unknown detectorId/tenant |
Verify ids/tenant | — |
| 409 | Scan already running for same {tenant, window} |
Wait for completion or use different scope | Retry after fix |
| 422 | Suppression window exceeds policy | Adjust parameter | — |
| 429 | Rate limited | Honor Retry-After |
Backoff + jitter |
| 503 | Integrity/Evidence/Key service unavailable | Wait for recovery | Idempotent retry |
Failure Modes¶
- Transient fork (eventual consistency): auto-downgrade to warning unless it persists beyond
stabilityDelay. - Key rotation gap: signatures verify with new
kidbut manifests still reference old key; mark degraded=false, add remediation hint. - Late seal: block sealed outside allowed window; alert severity based on policy (
medium→highif repeated).
Recovery Procedures¶
- For 409, query scan status and avoid duplicate runs; use
idempotency-key. - For intermittent fork/gap, re-scan after
stabilityDelay; escalate only if repeated. - On 503/429, validator backs off automatically; operator may re-issue trigger.
Performance Characteristics¶
Latency Expectations¶
- Segment check p95 ≤ 2 min; near-real-time watch detects issues within ≤ 5 min of occurrence.
Throughput Limits¶
- Bounded walkers per tenant (e.g., ≤ 2 concurrent); global cap to protect stores.
Resource Requirements¶
- CPU for hashing/verification; small read IO for manifests/roots; minimal memory with streaming checks.
Scaling Considerations¶
- Shard by tenant and segment time; cache recent roots and valid
kids. - Use adaptive sampling: deep checks on hot segments; summary checks elsewhere.
- Apply duplicate suppression windows to maintain low FP.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; service credentials for store/key access.
Authorization¶
- Require
audit:integrity.tamper.scan(run) andaudit:integrity.tamper.read(results). - Enforce tenant RLS via
x-tenant-id.
Data Protection¶
- Do not include payloads in alerts; only ids, hashes, URLs to manifests (access-controlled).
- Evidence links shared as short-lived presigned URLs.
Compliance¶
- All anomalies and operator triggers are audited with actor, purpose, scope, and policy revision.
- Detector configuration changes tracked with forward-only revisions.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
tamper_scans_active |
gauge | Running scans | > tenant/global cap |
tamper_anomalies_total |
counter | Anomalies by type/severity | Spike indicates issue |
tamper_false_positives_total |
counter | Operator-marked FP | > target triggers tuning |
tamper_scan_duration_seconds |
histogram | Scan runtime | p95 > SLO |
tamper_degraded_checks_total |
counter | Checks in degraded mode | Sustained rise → key/store health |
Logging Requirements¶
- Structured logs:
tenant,scanId,policyRevision,segmentsChecked,anomalies[],degraded,watermark. No PII.
Distributed Tracing¶
- Spans:
scan.plan,segment.fetch,check.run(type),alert.emit,case.open. - Attributes:
segmentCount,checks,severity,suppressed.
Health Checks¶
- Readiness: Integrity/Evidence stores and Key Registry reachable; policy cache loaded.
- Liveness: scan queue advancing; no segment stuck beyond timeout.
Operational Procedures¶
Deployment¶
- Deploy Integrity Validator; enable scheduler and on-demand endpoint.
- Configure alert routes (pager/webhook) and SIEM integration.
- Validate with seeded test anomalies (simulated fork/gap).
Configuration¶
- Env:
DETECTOR_MAX_CONCURRENCY,DETECTOR_STABILITY_DELAY,DETECTOR_SUPPRESS_WINDOW,DETECTOR_DEFAULT_CHECKS. - Policy: seal cadence, reseal allowances, severity mappings, degraded-mode policy.
Maintenance¶
- Tune thresholds using
tamper_false_positives_totaland incident postmortems. - Rotate keys and ensure manifests reference valid
kids across rotations.
Troubleshooting¶
- Repeated transient forks → increase
stabilityDelayslightly; verify store replication lag. - Many degraded checks → investigate Key Registry/KMS availability.
- Alert floods → widen suppression window; confirm dedupe keys include
{tenant, segment, type}.
Testing Scenarios¶
Happy Path Tests¶
- Continuous scan detects a forced manifest gap and raises a single actionable alert.
- On-demand scan limits to given window and returns summary with watermark.
Error Path Tests¶
- 400 on malformed window/checks; 404 unknown tenant; 409 duplicate scan scope.
- 429/503 produce compliant backoff with no duplicate alerts.
Performance Tests¶
- Segment check p95 ≤ 2 min; scan staleness ≤ 5 min under steady load.
- Suppression prevents duplicate alerts during repeated sightings.
Security Tests¶
- RBAC respected; cross-tenant access blocked.
- Alerts contain no payload data; evidence URLs expire and are scoped.
Related Documentation¶
Internal References¶
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
Appendices¶
A. Example Problem+JSON (duplicate scope)¶
{
"type": "urn:connectsoft:errors/detector/scope.conflict",
"title": "Tamper scan already running for scope",
"status": 409,
"detail": "A scan for tenant 'acme' and window 2025-10-22T00:00Z..2025-10-22T12:00Z is already running.",
"traceId": "9f0c1d2e3a4b5c6d...",
"errors": [{"pointer": "/window", "reason": "duplicate-scope"}]
}
B. Anomaly Types (reference)¶
gap: missing block/segment in expected sequencefork: two different roots for the same segmentorder: out-of-order seal time or indexseal: seal outside configured cadence or early resealsig: signature invalid/key mismatch/outside validity windowfreshness: seal/manifest not produced within SLA
Key Rotation Flow¶
Safely rotates signing keys for integrity sealing and verification. Introduces a new key (kid_new) in KMS, publishes it via the Key Registry, enables a dual-verify window where both kid_old and kid_new are trusted for verification, then transitions the signer to kid_new and retires kid_old without breaking backward verification.
Overview¶
Purpose: Regularly rotate integrity signing keys while ensuring uninterrupted signing and verification, preserving the ability to verify historical signatures.
Scope: Key generation/activation, registry publication, signer switchover, dual-verify window, verifier cache refresh, deactivation/retirement, and audit events. Excludes general IAM/PKI hardening (covered elsewhere).
Context: Security (SecOps) initiates rotation in KMS/HSM. Key Registry (JWKS/COSE keyset) distributes public keys to Integrity Service (signer) and all Verifiers (Verification/Compliance services).
Key Participants:
- Security (SecOps)
- KMS/HSM (key creation, protection, activation windows)
- Key Registry / Publisher (JWKS/COSE sets, versioning)
- Integrity Service (Signer) (seals blocks with active
kid) - Verification Services (Integrity Verify, Compliance Audit)
- Event Bus / Observability (
Key.Rotated, metrics/alerts)
Prerequisites¶
System Requirements¶
- KMS/HSM reachable; policies allow key create/rotate/disable
- Key Registry supports versioned JWKS/COSE publication with cache headers
- Integrity Service can hot-reload signer
kidwithout restart - Verifiers fetch/refresh keys on cache miss or via periodic refresh
Business Requirements¶
- Rotation cadence defined (e.g., 90 days) and emergency rotation runbook approved
- Dual-verify window configured (e.g., 14 days) and documented
- Audit logging enabled for all key lifecycle operations
Performance Requirements¶
- JWKS fetch p95 ≤ 200 ms; cache TTL tuned (e.g., 5–10 min)
- Signer switchover ≤ 1 min between publish and activation
- Verification failure rate due to unknown
kid< 0.01% during rotation
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor SEC as Security (SecOps)
participant KMS as KMS/HSM
participant REG as Key Registry (JWKS/COSE)
participant SIG as Integrity Service (Signer)
participant VER as Verification Services
participant BUS as Event Bus / Observability
SEC->>KMS: CreateKey {alg:Ed25519, usage:sign, tags:{tenant, purpose}}
KMS-->>SEC: KeyMetadata {kid_new, state:PreActive}
SEC->>REG: Publish {kid_new, pubKey, notBefore, notAfter}
REG-->>VER: JWKS {kid_old, kid_new} (cacheable)
SEC->>SIG: Schedule Activate {kid_new, at: T0+5m}
Note over VER,REG: Dual-verify window begins: verifiers trust {kid_old, kid_new}
SEC->>BUS: Emit Key.RotationPlanned {kid_old, kid_new, at:T0+5m}
SIG->>KMS: Load key {kid_new}
SIG->>SIG: Activate signer kid = kid_new (at T0+5m)
SIG->>BUS: Emit Key.Rotated {active:kid_new, retired:kid_old?}
SEC->>KMS: Set kid_old to verify-only (disable sign) at T0+14d
SEC->>REG: Unpublish kid_old (or mark as retiring) at T0+14d
REG-->>VER: JWKS {kid_new} (kid_old removed after window)
Alternative Paths¶
- Emergency rotation: immediate switch due to suspected compromise; shorten dual-verify window, revoke kid_old for signing at once; maintain verify-only if integrity permits.
- Canary activation: enable
kid_newfor a subset of signers; verify end-to-end before global activation. - Per-region phased rollout: publish globally, activate region by region with overlap.
Error Paths¶
sequenceDiagram
actor SEC as Security
participant GW as API Gateway
participant KM as KMS/HSM
SEC->>GW: POST /keys/v1/rotate {alg:"foo"} %% unsupported alg
alt 400 Bad Request
GW-->>SEC: 400 Problem+JSON
else 404 Not Found (kid_old)
GW-->>SEC: 404 Problem+JSON
else 409 Conflict (active rotation in progress / multiple active signers)
GW-->>SEC: 409 Problem+JSON
else 503 KMS unavailable
GW-->>SEC: 503 Problem+JSON (Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | POST /keys/v1/rotate |
Y | Initiate rotation (planned) | JSON body |
Authorization |
header | Y | Bearer <JWT> (SecOps) |
Role: security:keys.rotate |
x-tenant-id |
header | O | If tenant-scoped keys | Matches policy |
algorithm |
enum | O | Ed25519 (default) |
ES256 |
activateAt |
timestamp | O | Planned activation time (UTC) | ≥ now+5m |
dualVerifyWindow |
duration | O | e.g., 14d |
policy bounds |
reason |
string | O | Rotation rationale | ≤ 256 chars |
idempotency-key |
header | O | De-dupe | ≤ 128 chars |
Operations
POST /keys/v1/activate{kid}— force activatekid_newnow (emergency).POST /keys/v1/retire{kid}— setkid_oldverify-only / disable sign.GET /.well-known/jwks.json— public keys (Key Registry).GET /keys/v1/status— signer activekid, registry freshness, next rotation date.
Output Specifications¶
202 Accepted / 200 OK
| Field | Type | Description | ||||
|---|---|---|---|---|---|---|
kidOld |
string | Previously active key id | ||||
kidNew |
string | New key id to activate | ||||
activateAt |
timestamp | Planned activation | ||||
dualVerifyWindow |
string | Duration (e.g., P14D) |
||||
status |
enum | Planned | Activating | Active | Retiring | Retired |
Key.Rotated Event (concept)
{
"tenant": "platform",
"kidOld": "int-key-2025-07",
"kidNew": "int-key-2025-10",
"activatedAt": "2025-10-22T11:00:00Z",
"dualVerifyUntil": "2025-11-05T11:00:00Z"
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Unsupported algorithm; invalid activateAt/dualVerifyWindow |
Correct request | No retry until fixed |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Missing security:keys.rotate |
Request proper role | — |
| 404 | kid_old not found; JWKS endpoint not available |
Verify ids/registry | — |
| 409 | Rotation already in progress; multiple active signers detected | Wait or cancel/prune; ensure single active signer | Retry after fix |
| 412 | If-Match on signer version mismatch |
Fetch status; retry with latest | Conditional retry |
| 422 | Dual-verify window outside policy bounds | Adjust window | — |
| 429 | Excessive rotation attempts | Honor Retry-After |
Backoff + jitter |
| 503 | KMS/Registry unavailable | Wait for recovery | Idempotent retry |
Failure Modes¶
- Verifier cache staleness: transient verify failures for
kid_newuntil JWKS refreshed; verifiers must re-fetch onunknown_kid. - Key compromise: emergency path—disable signing for
kid_oldimmediately; maintain verify-only if proofs still need validation, else revoke and mark proofs inconclusive with remediation guidance. - Clock skew: activation timestamps are UTC; signer defers switch until
now ≥ activateAt + safetyMargin.
Recovery Procedures¶
- On unknown
kidverification failures, force JWKS refresh and reprocess. - If 409 multiple active signers, demote extras to verify-only and audit the window.
- For 503, pause activation and retry KMS/Registry operations with backoff.
Performance Characteristics¶
Latency Expectations¶
- Signer key load & switchover ≤ 60 s from activation time.
- JWKS refresh propagation to verifiers within TTL (e.g., ≤ 10 min).
Throughput Limits¶
- JWKS endpoint sized for spike during rotation; CDN cache recommended.
Resource Requirements¶
- Minimal CPU; network I/O for JWKS distribution; signer maintains small in-memory key cache.
Scaling Considerations¶
- Stage keys ahead of activation; pre-warm caches by triggering background JWKS fetch on publish.
- Stagger regional activations to limit burst load.
Security & Compliance¶
Authentication¶
- SecOps endpoints protected by OIDC + fine-grained RBAC; service-to-service mTLS optional.
Authorization¶
- Roles:
security:keys.rotate,security:keys.activate,security:keys.retire,security:keys.read.
Data Protection¶
- Private keys never leave KMS/HSM; signing via KMS APIs or HSM PKCS#11.
- JWKS served over HTTPS with integrity headers; include
kid,alg,use.
Compliance¶
- All key lifecycle changes audited (who, when, why, diff).
- Backward verification preserved: historical signatures tied to archived public keys and validity windows.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
signer_active_kid |
gauge(label) | Current signer kid | Change outside window |
verify_unknown_kid_total |
counter | Verifications failing due to unknown kid | > 0 sustained |
jwks_cache_age_seconds |
gauge | Age of verifier key cache | > TTL |
key_rotation_events_total |
counter | Rotations/emergencies | Annotate releases |
sign_failures_total |
counter | Signing errors post-activation | > 0 |
Logging Requirements¶
- Structured logs:
kidOld,kidNew,activateAt,actor,status,reason,region. No private key material.
Distributed Tracing¶
- Spans:
kms.create,registry.publish,signer.activate,verifier.refresh. - Attributes:
kid,alg,dualVerifyWindow,region.
Health Checks¶
- Readiness: signer can load
kid_new; registry reachable. - Liveness: signer reports active
kid; verification path succeeds with both keys during window.
Operational Procedures¶
Deployment¶
- Ensure signer supports dynamic
kidreload; deploy Registry with JWKS endpoint. - Test canary rotation in staging with synthetic seals and verifications.
- Schedule production rotation with maintenance window & comms.
Configuration¶
- Env:
SIGNING_ACTIVE_KID,KEY_ROTATION_SAFETY_MARGIN_SEC,JWKS_CACHE_TTL_SEC,DUAL_VERIFY_WINDOW_DEFAULT. - Policy: rotation cadence, emergency procedures, window bounds.
Maintenance¶
- Archive decommissioned public keys and manifests; keep for lifetime of signed data.
- Regularly validate that verifiers honor
unknown_kid→ refresh path.
Troubleshooting¶
- Spike in
verify_unknown_kid_total→ verify JWKS TTL, CDN invalidation, clock skew. - Signing failures post-activate → confirm KMS grants, key state, signer reload status.
- Conflicting actives → audit deployment orchestrations; enforce single active signer guard.
Testing Scenarios¶
Happy Path Tests¶
- Plan → publish → activate
kid_new; verify new seals validate with both keys during window. - Post-window, verify historical proofs with
kid_oldand new proofs withkid_new.
Error Path Tests¶
- 400 invalid algorithm/time; 404 unknown
kid; 409 rotation already in progress. - 503 KMS/Registry outage causes graceful delay and retries.
Performance Tests¶
- JWKS propagation within TTL; negligible signing latency change.
- High verification traffic during rotation does not exceed registry capacity.
Security Tests¶
- Private keys never leave KMS; signer only holds handles.
- Emergency rotation disables signing for
kid_oldimmediately; verify-only allowed as policy dictates.
Related Documentation¶
Internal References¶
- Integrity Structures
- Audit Record Integrity Chain Flow
- Integrity Verification Flow
- Tamper Detection Flow
External References¶
- JWS (RFC 7515) / JWKS (RFC 7517)
- COSE (RFC 8152)
Appendices¶
A. Example Problem+JSON (rotation conflict)¶
{
"type": "urn:connectsoft:errors/keys/rotation.conflict",
"title": "Another rotation is already in progress",
"status": 409,
"detail": "Active signer kid is already scheduled to rotate at 2025-10-22T11:00:00Z.",
"traceId": "9f0c1d2e3a4b5c6d..."
}
B. JWKS Example¶
{
"keys": [
{"kty":"OKP","crv":"Ed25519","kid":"int-key-2025-10","use":"sig","alg":"EdDSA","x":"lJp..."},
{"kty":"OKP","crv":"Ed25519","kid":"int-key-2025-07","use":"sig","alg":"EdDSA","x":"h3Q...", "status":"verify-only","notAfter":"2025-11-05T11:00:00Z"}
]
}
Retry Flow¶
Executes resilient retries with exponential backoff + jitter to achieve safe at-least-once delivery semantics. Failed operations are scheduled by the Retry Service, executed when due, and on terminal failure are DLQ-routed with full context. All retryable work must be idempotent via an idempotencyKey.
Overview¶
Purpose: Increase robustness of transient or downstream-dependent operations by automated retries with guardrails, while preventing thundering herds via jitter and honoring tenant backpressure.
Scope: Scheduling, backoff calculation, jitter, execution, success/failure reporting, DLQ routing, observability. Excludes business-specific compensation (see Compensation Flow).
Context: Sits alongside Ingestion, Export, Projection, etc. Services emit retryable tasks to the Retry Service; on success the original workflow continues; on terminal failure the task is routed to DLQ for manual/automated handling.
Key Participants:
- Producer Service (emits retryable work)
- Retry Service (scheduler + executor)
- Target Service (downstream dependency being called)
- DLQ / Review Tool (terminal task handling)
- Event Bus / Metrics
Prerequisites¶
System Requirements¶
- Retry Service deployed with durable queue and time-based scheduling
- Clock synchronized (UTC); stable monotonic timers
- Network egress to Target Services; circuit breaker library available
- Idempotent endpoints or idempotency keys supported by Target Services
Business Requirements¶
- Per-tenant retry policies (maxAttempts, baseDelay, cap, jitter, retryable codes)
- DLQ process and ownership defined (runbook, on-call group)
- Data minimization for task payloads; no sensitive values in logs
Performance Requirements¶
- p95 schedule-to-execute latency within ±1s of due time under nominal load
- Executor throughput sized to peak retry storms; global and per-tenant caps
- Backpressure signals honored (reduce concurrency, extend delays)
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant P as Producer Service
participant R as Retry Service (Scheduler/Executor)
participant T as Target Service
participant BUS as Event Bus / Metrics
P->>R: POST /retries/v1/schedule {task, idempotencyKey, policy}
R->>R: Persist task, compute delay = backoff(attempt=1)+jitter
R->>R: Enqueue for due time
R->>T: (when due) Execute task with idempotencyKey
T-->>R: 200 OK (or success code)
R->>BUS: Emit Retry.Succeeded {taskId, attempts}
R-->>P: 201 Created {taskId, status:"Scheduled"}
Alternative Paths¶
- Transient failure: Target returns retryable error →
attempt++, recompute delay with jitter → reschedule until success or maxAttempts. - Immediate retry hints: Target returns
Retry-After→ override computed delay (bounded by policy). - Work dedupe: if
idempotencyKeyseen recently, executor skips duplicate execution and marks Succeeded (idempotent).
Error Paths¶
sequenceDiagram
participant P as Producer
participant R as Retry Service
participant D as DLQ
P->>R: Schedule task {malformed}
alt 400 Bad Request
R-->>P: 400 Problem+JSON
else Task not found / status query bad id
R-->>P: 404 Not Found (Problem+JSON)
else Update while executing
R-->>P: 409 Conflict (Problem+JSON)
end
R->>R: Execute attempt N (last allowed)
R->>D: Route to DLQ {task, lastError, attempts=N}
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | POST /retries/v1/schedule |
Y | Schedule a retryable task | JSON body |
Authorization |
header | Y | Bearer <JWT> |
Valid |
x-tenant-id |
header | Y | Tenant scope | Matches policy |
traceparent |
header | O | W3C trace context | 55-char |
task.type |
string | Y | Logical task kind (e.g., Export.Callback) |
allowlist |
task.payload |
object | Y | Minimal inputs to re-execute | Size ≤ policy cap |
idempotencyKey |
string | Y | De-dupes executions | ≤ 128 chars |
policy |
object | O | Override defaults | See below |
Policy Overrides (optional)
| Field | Type | Description |
|---|---|---|
maxAttempts |
int | e.g., 6 (including first) |
baseDelayMs |
int | e.g., 250 |
multiplier |
number | e.g., 2.0 (exponential) |
maxDelayMs |
int | cap, e.g., 60_000 |
jitter |
enum/number | full |
retryable |
array | Retryable status codes / reasons |
Status/Control
GET /retries/v1/tasks/{taskId}→ status, attempts, nextDueAtPOST /retries/v1/tasks/{taskId}:cancel(if safe)GET /retries/v1/dlq→ items;POST /retries/v1/dlq/{id}:replay
Output Specifications¶
201 Created
| Field | Type | Description |
|---|---|---|
taskId |
string | ULID/GUID |
status |
enum | Scheduled |
nextDueAt |
timestamp | First attempt due time |
policyEffective |
object | Resolved policy |
attempt |
int | 1 |
200 OK (Status)
| Field | Type | Description | |||
|---|---|---|---|---|---|
taskId |
string | Id | |||
attempt |
int | Current attempt | |||
nextDueAt |
timestamp? | Null if running/completed | |||
state |
enum | Running | Succeeded | Failed | DLQ |
|||
lastError |
object? | {code, reason, ts} |
Example Payloads¶
// Schedule with policy override
{
"task": {
"type": "Export.Callback",
"payload": {"url":"https://example.com/hook","exportId":"exp_01JF..."}
},
"idempotencyKey": "exp_01JF...:callback",
"policy": {"maxAttempts": 6, "baseDelayMs": 500, "multiplier": 2, "maxDelayMs": 60000, "jitter":"full"}
}
// Status response
{
"taskId": "rtk_01JF...",
"state": "Running",
"attempt": 3,
"nextDueAt": "2025-10-22T11:14:25Z",
"lastError": {"code":"HTTP_503","reason":"Upstream unavailable","ts":"2025-10-22T11:12:13Z"}
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Malformed task/policy; payload too large | Fix request | — |
| 401 | Missing/invalid JWT | Renew token | Retry after renewal |
| 403 | Caller lacks retry:schedule |
Acquire role/scope | — |
| 404 | Unknown taskId |
Verify id | — |
| 409 | Update/cancel during execution window | Wait for state change | Retry after fix |
| 412 | If-Match version mismatch on update |
Fetch latest, retry | Conditional retry |
| 422 | Non-idempotent target / policy disallowed | Change endpoint/policy | — |
| 429 | Per-tenant/global throttle exceeded | Honor Retry-After |
Backoff + jitter |
| 503 | Scheduler/Executor dependency down | Wait for recovery | Idempotent reschedule |
Failure Modes¶
- Poison task: repeatedly fails with non-retryable error → immediate DLQ.
- Retry storm: global backoff and concurrency caps applied; jitter widened.
- Clock skew: due times computed in UTC; executor compares with monotonic clock guard.
Recovery Procedures¶
- Inspect DLQ item; fix root cause; replay via DLQ endpoint.
- Adjust policy (raise cap, widen backoff) for transient incidents.
- Use
idempotencyKeyto ensure safe replays.
Performance Characteristics¶
Latency Expectations¶
- Schedule-to-execute drift p95 ≤ 1s at steady load; may widen under backpressure.
Throughput Limits¶
- Executor concurrency: per-tenant & global caps to protect downstreams.
- Batched scheduling & due-time bucketing for high-volume workloads.
Resource Requirements¶
- Lightweight CPU; memory for queues; persistent storage for tasks and attempts.
Scaling Considerations¶
- Shard by tenant/time buckets; use decorrelated jitter to reduce synchronization.
- Propagate Retry-After and circuit-breaker state into backoff calculation.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; service accounts for producers.
Authorization¶
- Roles:
retry:schedule,retry:read,retry:cancel,retry:dlq.read,retry:dlq.replay. - Enforce tenant RLS via
x-tenant-id.
Data Protection¶
- Store minimal payloads; encrypt at rest; no secrets in task payloads—use references (e.g., secret ids).
Compliance¶
- All attempts and state transitions audited with actor, reason, and outcomes.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
retry_scheduled_total |
counter | Tasks scheduled | Trend |
retry_attempts_total |
counter | Attempts made | Sudden surge |
retry_success_total |
counter | Completed via retry | — |
retry_dlq_total |
counter | Routed to DLQ | > baseline |
retry_delay_applied_ms |
histogram | Backoff + jitter | p95 sanity |
executor_concurrency |
gauge | Active workers | Cap breaches |
Logging Requirements¶
- Structured logs per attempt:
taskId,tenant,attempt,delayMs,jitterMs,code,reason. No sensitive payloads.
Distributed Tracing¶
- Spans:
retry.schedule,retry.execute,retry.backoff,dlq.route. - Attributes:
attempt,delayMs,policyId,idempotencyKey.
Health Checks¶
- Readiness: queue store reachable; scheduler tick healthy.
- Liveness: executor draining; no stuck partitions.
Operational Procedures¶
Deployment¶
- Deploy Scheduler and Executor; configure queues/stores.
- Register retry policies per tenant; validate with synthetic faults.
Configuration¶
- Env:
RETRY_MAX_CONCURRENCY,RETRY_DEFAULT_POLICY,RETRY_MAX_PAYLOAD_BYTES,RETRY_STORM_GUARD_MULTIPLIER. - Policy: retryable codes map (HTTP/gRPC), base delays, caps, jitter mode.
Maintenance¶
- Periodically purge completed tasks; archive DLQ with retention.
- Tune jitter/backoff from incident postmortems.
Troubleshooting¶
- DLQ spike → inspect non-retryable reasons; verify idempotency at Target.
- Drift in due execution → check scheduler lag and backpressure controls.
- Duplicate side effects → confirm Target honors
idempotencyKey.
Testing Scenarios¶
Happy Path Tests¶
- Target 503 twice then 200 → attempts increase, success within policy, no DLQ.
-
Retry-Afterhonored to override computed delay.
Error Path Tests¶
- 400 malformed schedule; 404 unknown task; 409 modify during run.
- 422 when endpoint marked non-idempotent; 429/503 backoff honored.
Performance Tests¶
- High-volume storm—executor respects caps; jitter spreads load.
- p95 schedule-to-execute ≤ 1s under nominal load.
Security Tests¶
- RBAC enforced; cross-tenant access blocked.
- No secrets in logs/payloads; encryption at rest verified.
Related Documentation¶
Internal References¶
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
Appendices¶
A. Backoff Formula (examples)¶
- Exponential:
delay = min(maxDelay, base * (multiplier^(attempt-1))) + jitter - Decorrelated jitter:
sleep = min(maxDelay, random(base, sleep*3))
B. Example Problem+JSON (policy violation)¶
{
"type": "urn:connectsoft:errors/retry/policy.invalid",
"title": "Retry policy invalid",
"status": 422,
"detail": "Endpoint requires idempotency but idempotencyKey was not provided.",
"traceId": "9f0c1d2e3a4b5c6d..."
}
Dead Letter Queue Flow¶
Operational path to triage, diagnose, and replay messages that exhausted retries or failed with non-retryable errors. Ensures no duplicate side effects by requiring idempotent targets and preserving the original idempotencyKey during replay. Provides auditability, metrics, and safe deletion/quarantine.
Overview¶
Purpose: Restore messages from failure to success with controlled, observable, and compliant procedures.
Scope: DLQ item listing, inspection, annotation, fix/runbook execution, safe replay (single/bulk), quarantine or delete, and auditing. Excludes business-side compensation (see Compensation Flow).
Context: DLQ is fed by Retry Service and other producers. Replay Tool orchestrates re-submission to the Target Service using at-least-once semantics with idempotency guarantees.
Key Participants:
- Operator / SRE (triage & action)
- API Gateway (authN/Z, tenancy)
- DLQ Store (dead letters, metadata)
- Replay Tool / DLQ Service (orchestrates fix & replay)
- Target Service (original destination)
- Runbook/Knowledge Base (known-error fixes)
- Observability (metrics, logs, alerts)
- Audit/Event Bus (operator actions, outcomes)
Prerequisites¶
System Requirements¶
- DLQ store with durable retention and per-tenant partitioning
- Replay Tool has network access to Target Service(s)
- Original endpoint supports idempotency keys or is side-effect free
- Circuit breaker and rate limits configured for replay traffic
Business Requirements¶
- Runbooks for top failure signatures (e.g., mapping fixes, schema bumps)
- Role-based access for DLQ operations with approvals where needed
- Data minimization policies for viewing payloads (mask PII by default)
Performance Requirements¶
- Listing/inspect p95 ≤ 200 ms per page
- Replay throughput bounded (tenant/global) to protect targets
- Batch replay progress reporting and partial-failure handling
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor OP as Operator
participant GW as API Gateway
participant DLQ as DLQ Service/Store
participant RB as Runbook/KB
participant RT as Replay Tool
participant T as Target Service
participant AUD as Audit/Event Bus
OP->>GW: GET /ops/v1/dlq?filters… (search & select item)
GW->>DLQ: Query items (tenant, filters)
DLQ-->>GW: Page of items
OP->>GW: GET /ops/v1/dlq/{id} (inspect, view masked payload, lastError)
GW->>DLQ: Fetch item + metadata
DLQ-->>GW: Item + recommended runbook link
OP->>RB: Follow runbook, apply fix (config/schema/data)
OP->>GW: POST /ops/v1/dlq/{id}:replay {mode:"safe"}
GW->>RT: Orchestrate replay (authZ, tenancy)
RT->>T: Re-submit with original idempotencyKey/payload
T-->>RT: 200 OK (idempotent success)
RT->>DLQ: Mark Resolved, attach replay transcript
RT->>AUD: Emit DLQ.Replayed {id, attempts, actor, outcome}
GW-->>OP: 200 OK {status:"Replayed", transcriptUrl}
Alternative Paths¶
- Bulk replay: operator selects a query window/signature and triggers
:bulk-replaywith concurrency caps. - Quarantine: item moved to a separate queue to prevent accidental replay while investigation continues.
- Redrive to alternative endpoint: route to a newer API version when the original is deprecated (policy-gated).
Error Paths¶
sequenceDiagram
actor OP as Operator
participant GW as API Gateway
participant DLQ as DLQ Service
participant RT as Replay Tool
OP->>GW: POST /ops/v1/dlq/{id}:replay
alt 400 Bad Request (invalid mode/filters)
GW-->>OP: 400 Problem+JSON
else 404 Not Found (unknown item)
GW-->>OP: 404 Problem+JSON
else 409 Conflict (item locked/by another replay)
GW-->>OP: 409 Problem+JSON
else 422 Unprocessable (target non-idempotent, policy forbids)
GW-->>OP: 422 Problem+JSON
else 429/503 (rate limit/dependency down)
GW-->>OP: 429/503 Problem+JSON (+Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| List | GET /ops/v1/dlq |
Y | List DLQ items | Pagination with page.after, limit≤200 |
| Inspect | GET /ops/v1/dlq/{id} |
Y | Fetch one item | {id} ULID/GUID |
| Replay | POST /ops/v1/dlq/{id}:replay |
Y | Re-submit safely | JSON body |
| Bulk Replay | POST /ops/v1/dlq:bulk-replay |
O | Replay by filter | JSON body |
| Quarantine | POST /ops/v1/dlq/{id}:quarantine |
O | Move to quarantine | — |
| Delete | DELETE /ops/v1/dlq/{id} |
O | Drop after approval | Policy-gated |
Authorization |
header | Y | Bearer <JWT> |
Role: DLQ ops |
x-tenant-id |
header | Y | Tenant scope | RLS enforced |
traceparent |
header | O | W3C trace | 55-char |
idempotencyKey |
string | O | Override if missing | ≤ 128 chars |
mode |
enum | O | safe(default) |
force |
DLQ Item (shape)
| Field | Description | |||
|---|---|---|---|---|
id |
DLQ item id | |||
source |
Producer (service/flow) | |||
target |
Endpoint/service intended | |||
payload |
Masked by default (toggle with RBAC) | |||
idempotencyKey |
Original key (if any) | |||
attempts |
Attempts made | |||
firstSeenAt / lastErrorAt |
Timestamps | |||
lastError |
{code, reason, traceId} |
|||
annotations[] |
Operator notes | |||
status |
Pending | Quarantined | Replayed | Deleted |
Output Specifications¶
200 OK (Inspect)
| Field | Type | Description |
|---|---|---|
item |
object | DLQ item |
recommendedRunbook |
url | Link to doc |
replayEligible |
bool | true if idempotent & policy allows |
warnings[] |
array | E.g., “missing idempotencyKey” |
200 OK (Replay)
| Field | Type | Description | ||
|---|---|---|---|---|
status |
enum | Replayed | InProgress | Quarantined |
||
transcriptUrl |
url | Steps & outcomes | ||
attempt |
int | Attempt count after replay | ||
effectiveIdempotencyKey |
string | Key used |
Example Payloads¶
// Replay request (safe)
POST /ops/v1/dlq/01JF...:replay
{
"mode": "safe",
"notes": "Fixed mapping for resourceType=Invoice; re-submitting."
}
// DLQ item (inspect response excerpt)
{
"id": "01JF…",
"source": "Ingestion.Consumer",
"target": "Storage.Append",
"idempotencyKey": "ar:01JF…",
"attempts": 6,
"lastError": {"code":"HTTP_422","reason":"Schema validation failed"},
"replayEligible": true
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid filters, mode, or bulk selection too large | Fix request/trim selection | — |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Lacks dlq:operate or PII unmask permission |
Request proper role | — |
| 404 | DLQ item not found | Refresh list; verify id/tenant | — |
| 409 | Item locked by another operator/replay in progress | Wait or take lock after TTL | Retry after unlock |
| 412 | If-Match version mismatch on annotate/delete |
Refetch item; retry with latest | Conditional retry |
| 422 | Replay blocked (non-idempotent target / missing key) | Provide key or route to compensation | — |
| 429 | Replay throughput cap exceeded | Honor Retry-After |
Backoff + jitter |
| 503 | DLQ store or target unavailable | Wait for recovery | Idempotent replay later |
Failure Modes¶
- Duplicate side effects risk: target not idempotent or key missing → block replay unless
forcewith executive approval; log and audit. - Payload drift: original payload stale after schema change → tool offers auto-migrate transform preview before replay.
- Replay storm: bulk selection triggers target throttling → tool enforces per-tenant QPS caps and adaptive backoff.
- PII exposure: viewing raw payload requires elevated RBAC; otherwise masked.
Recovery Procedures¶
- If 422, attempt payload migration using versioned transformers; re-try in
safemode. - If 409, wait for lock TTL or coordinate via on-call; avoid parallel replay.
- For 503/429, the tool pauses and resumes respecting backoff and circuit breaker state.
Performance Characteristics¶
Latency Expectations¶
- Inspect/list p95 ≤ 200 ms; single replay end-to-end typically ≤ 2 s (excluding target latency).
Throughput Limits¶
- Default bulk replay ≤ 50 msg/s per tenant (configurable), global cap to protect targets.
Resource Requirements¶
- Light CPU/IO for listing; replay workers sized to throughput; encrypted storage for transcripts.
Scaling Considerations¶
- Shard DLQ by tenant and creation time; support cursor-based pagination; parallel workers with per-target concurrency.
Security & Compliance¶
Authentication¶
- OIDC JWT at Gateway; service tokens for replay to targets.
Authorization¶
- Roles:
dlq:read,dlq:operate,dlq:quarantine,dlq:delete,dlq:pii.unmask. - Fine-grained approvals required for
mode=forceand deletions.
Data Protection¶
- Payloads masked by default; unmask requires explicit action (with purpose-of-use).
- Transcripts and payload snapshots encrypted at rest; presigned URLs short-lived.
Compliance¶
- All DLQ actions are audited (who, what, why, before/after, result).
- Retention for DLQ items and transcripts aligns with tenant policy.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
dlq_items_total |
gauge | Current DLQ size (by tenant) | Rising trend |
dlq_oldest_age_seconds |
gauge | Age of oldest item | > SLO |
dlq_replay_success_total |
counter | Successful replays | Track rate |
dlq_replay_failure_total |
counter | Failed replays | Spike alert |
dlq_quarantine_total |
counter | Items quarantined | Investigate |
dlq_bulk_replay_inflight |
gauge | Active bulk operations | Cap breaches |
Logging Requirements¶
- Structured logs:
tenant,dlqId,action,actor,mode,outcome,idempotencyKey,target,attempts. Do not log payload values.
Distributed Tracing¶
- Spans:
dlq.list,dlq.inspect,dlq.replay,dlq.quarantine. - Attributes:
bulkSize,replayed,failed,throttled,transformVersion.
Health Checks¶
- Readiness: DLQ store reachable; replay workers healthy.
- Liveness: no stuck locks; bulk runners progressing.
Operational Procedures¶
Deployment¶
- Deploy DLQ Service and Replay Tool; wire to Gateway with RBAC.
- Configure per-tenant throughput caps and masking defaults.
- Validate end-to-end with seeded poison messages.
Configuration¶
- Env:
DLQ_LIST_PAGE_MAX,DLQ_REPLAY_QPS_PER_TENANT,DLQ_GLOBAL_QPS_CAP,DLQ_LOCK_TTL_SEC,TRANSFORMER_DEFAULT_VERSION. - Policy: allowed
forceoperations, deletion approvals, payload unmask rules.
Maintenance¶
- Periodic purge/archival of resolved items; rotate transcript encryption keys.
- Review top failure signatures and update runbooks/transformers.
Troubleshooting¶
- Duplicates observed → verify target idempotency and keys; disable force path.
- Bulk replay throttled → reduce concurrency or expand caps with approval.
- Payload migration errors → roll back transformer version and fix mapping.
Testing Scenarios¶
Happy Path Tests¶
- Inspect → apply mapping fix → safe replay succeeds; DLQ item resolved.
- Bulk replay with 5,000 items respects QPS caps and completes with transcript.
Error Path Tests¶
- 400 invalid filters; 404 unknown id; 409 locked item; 422 non-idempotent blocked.
- 429/503 backoff honored; operation resumes and completes.
Performance Tests¶
- Listing p95 ≤ 200 ms at 1M items/tenant (indexed).
- Bulk replay maintains target SLOs under cap.
Security Tests¶
- PII masked by default; unmask requires RBAC + purpose-of-use; all actions audited.
- Deletions require multi-party approval when enabled.
Related Documentation¶
Internal References¶
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
Appendices¶
A. Example Problem+JSON (non-idempotent target)¶
{
"type": "urn:connectsoft:errors/dlq/replay.disallowed",
"title": "Replay blocked by policy",
"status": 422,
"detail": "Target endpoint is not idempotent and force mode is disabled for this tenant.",
"traceId": "9f0c1d2e3a4b5c6d..."
}
B. Example Annotation¶
POST /ops/v1/dlq/01JF...:annotate
{
"note": "Fixed customer mapping (CUS-123). Verified with runbook RB-42."
}
Circuit Breaker Flow¶
Contains downstream failures and prevents cascading outages by short-circuiting failing calls, routing to fallbacks/queues, and probing recovery via half-open trials. Exposes clear client signals (headers/status) and integrates with Retry/DLQ to preserve at-least-once semantics.
Overview¶
Purpose: Protect services from unstable dependencies using automated open/half-open/closed state transitions, graceful degradation, and recovery probing.
Scope: Policy configuration, failure/latency detection, state transitions, short-circuit responses, fallback and queueing, recovery probes, client signaling. Excludes business-specific compensation (see Compensation Flow).
Context: Libraries/middleware wrap all client calls to downstreams (HTTP/gRPC/bus). Breaker state may be per-tenant, per-endpoint, per-partition.
Key Participants:
- Caller Service (producer of the downstream call)
- Circuit Breaker (in-process or sidecar)
- Target Service (downstream dependency)
- Fallback/Cache (optional read cache or static responses)
- Retry/DLQ Services (for write/side-effect operations)
- Observability/Config (metrics, alerts, ops overrides)
Prerequisites¶
System Requirements¶
- Circuit breaker library enabled for HTTP/gRPC clients with configurable policies
- Sliding windows for failure rate and slow-call rate with min call thresholds
- Central config and runtime override API (ops) with safe defaults
- Correlation/tracing propagation through fallback paths
Business Requirements¶
- Defined fallback strategy per call type (read: cache; write: enqueue → Retry)
- Tenant- and endpoint-level SLOs to tune thresholds
- Runbook for operator overrides (force-open/close, reset)
Performance Requirements¶
- Wrapper overhead p95 ≤ 1 ms per call (fast path, closed)
- Probe batch size and interval sized to recover quickly without stampedes
- Backpressure headers documented for clients
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant C as Caller Service
participant CB as Circuit Breaker
participant T as Target Service
participant F as Fallback/Queue (optional)
C->>CB: Invoke downstream operation
alt State = CLOSED
CB->>T: Forward request
T-->>CB: 200/OK (within latency budget)
CB-->>C: Success (propagate response)
else State = HALF-OPEN (probe window)
CB->>T: Limited probes (N% or fixed concurrency=1..k)
T-->>CB: OK responses exceed threshold
CB-->>C: Success, transition → CLOSED
end
Alternative Paths¶
- Fallback (read): CB returns cached/derived response with
X-ATP-Circuit-State: openandX-ATP-Source: cache. - Queue (write): CB enqueues to Retry Service with
idempotencyKey, returns202 Accepted(Problem+JSON alternative body optional). - Partitioned breakers: isolate a bad shard/tenant from healthy traffic.
Error Paths¶
sequenceDiagram
participant C as Caller
participant CB as Circuit Breaker
participant T as Target
participant Q as Retry/DLQ
C->>CB: Invoke downstream operation
alt State = OPEN (short-circuit)
CB-->>C: 503 Service Unavailable
Note right of C: Headers: X-ATP-Circuit-State: open, Retry-After: 5
else State = CLOSED but failure/slow-call triggers thresholds
CB->>T: Request
T-->>CB: 5xx/timeout/slow
CB->>CB: Increment counters, if trip threshold → OPEN
CB->>Q: (write ops) enqueue for retry
CB-->>C: 503/504 or 202 (queued) with Problem+JSON
end
Request/Response Specifications¶
The breaker primarily shapes responses; ops endpoints allow safe overrides.
Input Requirements (Ops)¶
| Field | Type | Req | Description | Validation | ||
|---|---|---|---|---|---|---|
| Method/Path | POST /ops/v1/circuits/{id}:override |
O | Force open | half-open | closed with TTL |
{id} exists |
||
Authorization |
header | Y | Bearer <JWT> |
Role ops:circuits |
||
state |
enum | Y | open |
half-open |
closed |
allowlist |
ttl |
duration | O | Override duration (e.g., 10m) |
≤ policy max | ||
notes |
string | O | Reason | ≤ 256 chars |
Output Specifications (Client-Facing)¶
- Closed (success): normal 2xx/OK.
- Open (short-circuited read):
503 Service UnavailableHeaders:X-ATP-Circuit-State: open,Retry-After: <sec>,X-ATP-Circuit-Reason: failure-rate|slow-calls|min-calls-not-met. Body (Problem+JSON example):
{
"type":"urn:connectsoft:errors/circuit/open",
"title":"Dependency temporarily unavailable",
"status":503,
"detail":"Calls short-circuited by circuit breaker (failure rate > 50% over 20s).",
"retryAfterSeconds":5,
"traceId":"9f0c1d2e3a4b5c6d..."
}
202 Accepted with Location: /retries/v1/tasks/{taskId} and headers above plus X-ATP-Queued: true.
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Ops override payload invalid (state/ttl) | Fix request | — |
| 401 | Missing/invalid JWT (ops) | Acquire valid token | Retry after renewal |
| 403 | Lacks ops:circuits |
Request proper role | — |
| 404 | Unknown circuit {id} |
Verify id/scope | — |
| 409 | Conflicting override/state transition | Clear override or wait TTL | Retry after fix |
| 412 | If-Match on circuit version mismatch |
Read latest, retry | Conditional retry |
| 422 | TTL or state not permitted by policy | Adjust inputs | — |
| 429 | Too many overrides/changes | Back off | Jittered retry |
| 503/504 | Short-circuited/open or downstream timeout | Respect headers | Exponential backoff + jitter |
Failure Modes¶
- Min-calls not met: insufficient samples → breaker stays closed but labels responses with
X-ATP-Circuit-Reason: warmup. - Stampede on recovery: too many probes → configure half-open concurrency and jitter.
- Cache staleness: fallback exceeds TTL → downgrade to 503 instead of serving stale beyond policy.
Recovery Procedures¶
- When open, allow half-open after cool-down; probe with limited concurrency.
- Tune thresholds based on SLOs and observed metrics (failure/slow-call rate).
- For write paths, confirm idempotencyKey is propagated before enabling queue mode.
Performance Characteristics¶
Latency Expectations¶
- Added wrapper overhead p95 ≤ 1 ms (closed).
- Half-open probes routed immediately; unaffected calls still short-circuited.
Throughput Limits¶
- Limit concurrent probes (e.g., 1–5) per breaker key; cap queued writes per tenant.
Resource Requirements¶
- In-process counters/timers; optional small shared state for cluster coordination.
Scaling Considerations¶
- Key breaker by
{tenant, endpoint, partition}to avoid global trips. - Use decorrelated jitter for cool-down and probe scheduling.
- Optional shared state (e.g., Redis) for multi-instance consistency.
Security & Compliance¶
Authentication¶
- Client requests authenticated as usual; ops overrides require OIDC JWT and RBAC.
Authorization¶
- Ops roles:
ops:circuits.read,ops:circuits.override,ops:circuits.reset.
Data Protection¶
- Headers reveal state but not sensitive internals; avoid leaking backend hostnames.
Compliance¶
- All trips, overrides, and recoveries are audited (who, when, why, thresholds, counts).
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
circuit_state{key} |
gauge | 0=closed,1=half-open,2=open | Open > 0 sustained |
circuit_short_circuits_total |
counter | Calls blocked by open state | Spike alert |
circuit_failure_rate |
gauge | Recent failure % | > policy trip |
circuit_slow_call_rate |
gauge | Recent slow-call % | > policy trip |
circuit_probe_success_total |
counter | Half-open successes | Low during recovery |
fallback_invocations_total |
counter | Cache/queue usage | Track degradation |
Logging Requirements¶
- Structured logs:
breakerKey,state,reason,window,failRate,slowRate,probe,override,actor,traceId.
Distributed Tracing¶
- Tag spans with
circuit.state,circuit.reason,fallback=true,queued=true, include downstream span links when available.
Health Checks¶
- Readiness: breaker config loaded; counters active.
- Liveness: state machine transitions occur; no stuck half-open beyond TTL.
Operational Procedures¶
Deployment¶
- Enable breaker middleware for all outbound clients; set sane defaults.
- Wire ops API and dashboards; define per-tenant keys.
- Validate with chaos testing (inject 5xx/timeouts).
Configuration¶
- Policy:
{window=20s, minCalls=20, failureRate=50%, slowThreshold=1s, slowRate=50%, cooldown=5s, probe=2} - Headers:
X-ATP-Circuit-State,X-ATP-Circuit-Reason,Retry-After,X-ATP-Queued.
Maintenance¶
- Review trip analytics weekly; adjust thresholds and probe sizes.
- Rotate cache TTLs for fallbacks per freshness requirements.
Troubleshooting¶
- Frequent opens → inspect dependency SLOs, retry storms, and idempotency.
- No recovery → increase probe window or check downstream health checks.
- Client confusion → verify headers are surfaced at Gateway.
Testing Scenarios¶
Happy Path Tests¶
- Closed → success; zero wrapper overhead regressions.
- Half-open with limited probes transitions to closed after consecutive successes.
Error Path Tests¶
- Trip on failure rate > threshold; open cool-down respected; headers set.
- Read fallback returns cached response with correct state headers.
- Write enqueued returns 202 with
LocationandidempotencyKey.
Performance Tests¶
- Probe concurrency prevents stampede; short-circuit path p95 ≤ 1 ms.
- High QPS under open state does not overload queue/cache.
Security Tests¶
- Ops override RBAC enforced; audit trail captured.
- Headers do not leak sensitive backend identifiers.
Related Documentation¶
Internal References¶
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
Appendices¶
A. Example Ops Override¶
POST /ops/v1/circuits/tenant:search:index:primary:override
{
"state": "open",
"ttl": "10m",
"notes": "Isolate failing shard while indexers recover."
}
B. Client Header Cheatsheet¶
X-ATP-Circuit-State:closed|half-open|openX-ATP-Circuit-Reason:failure-rate|slow-calls|override|warmupRetry-After: seconds until next probe/cooldown endsX-ATP-Queued:truewhen write queued for retry
Compensation Flow¶
Repairs partial failures or out-of-order effects by executing a deterministic, idempotent sequence of inverse actions (e.g., projection rewrites, search index corrections, pointer re-links). Produces a complete audit trail and supports dry-run planning before execution.
Overview¶
Purpose: Restore system invariants when a transaction or workflow completed partially (e.g., append succeeded but projection/index update failed).
Scope: Detection/selection of a failed transaction, plan synthesis, dry-run validation, execution of compensating steps, verification, and audit. Excludes business refunds or external systems remediation (covered by domain runbooks).
Context: Invoked by operators or automation (DLQ/alerts). Coordinates with Projection Service, Search Index, Storage, and Integrity to ensure consistency.
Key Participants:
- Operator / Automation (trigger)
- Compensation Service (planner/executor)
- Storage / Projection / Search Index (targets)
- Audit/Event Bus (actions & outcomes)
- Retry/DLQ (feeder, optional post-fix replay)
Prerequisites¶
System Requirements¶
- Compensation Service deployed with access to Storage, Projections, and Indexes
- Idempotency primitives available (step keys, compare-and-set guards)
- Read-only snapshot capability for dry-run planning
- Time-synchronized environment (UTC), consistent tracing
Business Requirements¶
- Catalog of compensable scenarios and their inverse steps
- Approval policy for destructive operations and bulk compensations
- Masking rules for any payloads surfaced to operators
Performance Requirements¶
- p95 plan synthesis ≤ 500 ms for typical cases
- Batched execution with rate limits to protect targets
- Backpressure-aware executor with progress reporting
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor OP as Operator/Automation
participant GW as API Gateway
participant CMP as Compensation Service
participant ST as Storage
participant PR as Projection Service
participant IX as Search Index
participant AUD as Audit/Event Bus
OP->>GW: POST /ops/v1/compensations {txnId|recordId,..., dryRun:true}
GW->>CMP: Create Plan (authN/Z, x-tenant-id)
CMP->>ST: Inspect ground truth (append store)
CMP->>PR: Inspect projection state
CMP->>IX: Inspect index documents
CMP->>CMP: Synthesize plan (ordered idempotent steps)
CMP-->>GW: 200 OK {plan, impact, approvals}
OP->>GW: POST /ops/v1/compensations/{id}:run
GW->>CMP: Execute Plan
CMP->>ST: (if needed) no-op or pointer fix
CMP->>PR: Rewrite/repair projections (CAS by watermark)
CMP->>IX: Reindex specific docs (with version guards)
CMP->>CMP: Verify invariants, mark Completed
CMP->>AUD: Emit Compensation.Completed {id, steps, result}
GW-->>OP: 200 OK {status:"Completed", metrics}
Alternative Paths¶
- Auto-compensation from DLQ: DLQ item contains signature; Compensation Service builds & runs plan before replay.
- Partial plan: execute only safe subset; schedule remaining steps via Retry Service.
- Integrity-first: if integrity proofs affected, run Integrity Verification/re-seal checks before projection/index fixes.
Error Paths¶
sequenceDiagram
actor OP as Operator
participant GW as API Gateway
participant CMP as Compensation Service
OP->>GW: POST /ops/v1/compensations {invalid}
alt 400 Bad Request (invalid scope, missing ids)
GW-->>OP: 400 Problem+JSON
else 404 Not Found (unknown txn/record)
GW-->>OP: 404 Problem+JSON
else 409 Conflict (plan already running / step lock held)
GW-->>OP: 409 Problem+JSON
else 422 Unprocessable (scenario not compensable)
GW-->>OP: 422 Problem+JSON
else 429/503 (rate limit/dependency down)
GW-->>OP: 429/503 Problem+JSON (+Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| Method/Path | POST /ops/v1/compensations |
Y | Create compensation plan | JSON body |
Authorization |
header | Y | Bearer <JWT> |
Role ops:compensate |
x-tenant-id |
header | Y | Tenant scope | RLS enforced |
traceparent |
header | O | W3C trace context | 55-char |
txnId |
string | O* | Transaction/workflow id | ULID/GUID |
recordId |
string | O* | Affected record id | ULID/GUID |
scope |
object | O | {from, to, filters} window |
ISO-8601 UTC |
dryRun |
bool | O | Only produce plan | default true |
strategy |
enum | O | repair (default) |
replay |
notes |
string | O | Operator context | ≤ 512 chars |
idempotency-key |
header | O | De-dupe | ≤ 128 chars |
- Provide at least one of
txnId,recordId, orscope.
Control/Status
GET /ops/v1/compensations/{id}→ status, steps, metricsPOST /ops/v1/compensations/{id}:run→ execute planned stepsPOST /ops/v1/compensations/{id}:cancel→ cancel if safe
Output Specifications¶
200 OK (Plan)
| Field | Type | Description | Notes |
|---|---|---|---|
id |
string | Plan id | ULID/GUID |
steps[] |
array | Ordered idempotent steps | See step shape |
impact |
object | Counters by target (proj/index/records) | Estimate |
approvalsRequired |
bool | Whether approval gate is needed | Policy-driven |
Step (shape)
{
"stepId": "S1",
"type": "Projection.Rewrite",
"target": {"projection":"AuditEvents","key":"01JF..."},
"idempotencyKey": "cmp:proj:AuditEvents:01JF...",
"precondition": {"watermarkAtLeast":"2025-10-22T10:55:00Z"},
"action": {"rewriteFrom": "storage", "schemaVersion": 3},
"verify": {"projectionMatches":"storageHash"}
}
Example Payloads¶
// Create plan (dry-run) by recordId
{
"tenant": "acme",
"recordId": "01JF3W8KTR2D3WQF3B9R0KJY9Y",
"dryRun": true,
"strategy": "repair",
"notes": "Projection missing due to prior outage."
}
// Execute planned compensation
POST /ops/v1/compensations/01K0...:run
{
"approvalToken": "appr_9c1...",
"concurrency": 8
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid scope; both txnId and recordId missing; bad timestamps |
Fix request | — |
| 401 | Missing/invalid JWT | Acquire valid token | Retry after renewal |
| 403 | Lacks ops:compensate or approval missing |
Request role/approval | — |
| 404 | Transaction/record not found | Verify ids/window | — |
| 409 | Another plan running on same target; step lock held | Wait/Cancel existing | Retry after unlock |
| 412 | Precondition (watermark/version) failed | Refresh state; re-plan | Conditional retry |
| 422 | Scenario not compensable or non-idempotent step detected | Route to manual runbook | — |
| 429 | Throttled by target system | Honor Retry-After |
Backoff + jitter |
| 503 | Dependency unavailable (Projection/Index/Storage) | Wait or partial run | Idempotent retry later |
Failure Modes¶
- Non-idempotent side effect: step flagged and blocked unless operator uses explicit
forcegate. - Stale projection: CAS/watermark precondition fails → re-plan with updated state.
- Wide impact plan: bulk changes require staged batches with checkpoints to avoid long locks.
Recovery Procedures¶
- On 412, refresh state and regenerate plan; executor resumes from last completed step.
- If 503/429, executor backs off, persists progress, and continues when healthy.
- For 409, inspect running plan and either merge or cancel the conflicting one.
Performance Characteristics¶
Latency Expectations¶
- Plan (single record) typically ≤ 500 ms; execution dominated by target services latencies.
Throughput Limits¶
- Concurrency governed per target (e.g.,
proj=16,index=8) and per-tenant caps.
Resource Requirements¶
- Light CPU for planning; executor memory proportional to batch window.
Scaling Considerations¶
- Shard plans by tenant and time window; use watermarks to ensure deterministic ordering.
- Persist checkpoints every N steps; support resume-after-failure.
Security & Compliance¶
Authentication¶
- OIDC JWT/OAuth at Gateway; service accounts for inter-service calls.
Authorization¶
- Roles:
ops:compensate.plan,ops:compensate.run,ops:compensate.cancel,ops:compensate.read. - Approval tokens required for destructive/bulk plans.
Data Protection¶
- Mask PII in operator views; only show necessary diffs.
- Encrypt transcripts and store with short-lived presigned access.
Compliance¶
- Emit
Compensation.Planned|Started|StepCompleted|Completed|Failedevents with actor, reason, and evidence. - Plans and transcripts retained per tenant retention policy.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
compensation_plans_total |
counter | Plans created | Trend |
compensation_steps_completed_total |
counter | Steps done | — |
compensation_failures_total |
counter | Failed steps | > 0 sustained |
compensation_runtime_seconds |
histogram | End-to-end duration | p95 > SLO |
compensation_blocked_total |
counter | Blocked by preconditions/locks | Spike alert |
Logging Requirements¶
- Structured logs include:
planId,tenant,stepId,type,idempotencyKey,precondition,outcome,traceId. No payload values.
Distributed Tracing¶
- Spans:
plan.synthesize,step.execute(type),verify,checkpoint. - Attributes:
concurrency,watermark,casVersion,affectedCount.
Health Checks¶
- Readiness: access to Storage/Projection/Index; plan store reachable.
- Liveness: executors progressing; no step stuck beyond timeout.
Operational Procedures¶
Deployment¶
- Deploy Compensation Service with plan store and executor.
- Wire RBAC, approval gates, and observability.
- Seed known scenarios and step templates.
Configuration¶
- Env:
COMP_PLAN_MAX_SCOPE,COMP_EXEC_CONCURRENCY,COMP_STEP_TIMEOUT,COMP_APPROVAL_REQUIRED. - Policy: destructive-step approvals; per-target QPS caps; retry/backoff settings.
Maintenance¶
- Review top compensation causes; add detectors to prevent recurrence.
- Tune watermark/CAS policies to reduce 412 conflicts.
Troubleshooting¶
- Frequent 412 → stale state; check projection lag and adjust watermarks.
- High blocked_total → missing approvals or non-idempotent steps; refine templates.
- Long runtimes → lower concurrency or break plan into smaller batches.
Testing Scenarios¶
Happy Path Tests¶
- Plan & run for “missing projection” fixes projection and index, verifies equality to storage.
- DLQ-triggered auto-compensation succeeds, then DLQ replay passes.
Error Path Tests¶
- 400 invalid scope; 404 unknown record/txn; 409 conflicting plan; 422 non-compensable scenario.
- 412 precondition failure reruns after re-plan and completes.
Performance Tests¶
- Batch plan (1k records) executes within rate limits; checkpoints allow resume.
- Executor maintains p95 step time within target under load.
Security Tests¶
- RBAC and approvals enforced; transcripts encrypted; PII masked by default.
- Idempotency verified by re-running completed plan → no additional side effects.
Related Documentation¶
Internal References¶
External References¶
- RFC 7807 (Problem Details)
- W3C Trace Context
Appendices¶
A. Example Problem+JSON (precondition failed)¶
{
"type": "urn:connectsoft:errors/compensation/precondition.failed",
"title": "Watermark precondition failed",
"status": 412,
"detail": "Projection watermark 2025-10-22T11:02:10Z is below required 2025-10-22T11:05:00Z.",
"traceId": "9f0c1d2e3a4b5c6d..."
}
B. Step Type Catalog (excerpt)¶
Projection.Rewrite— rebuild from storage by key with CASIndex.Reindex— single-doc reindex with version guardPointer.Relink— fix correlation/resource pointers with invariants checkEvent.Replay— re-emit projection events from checkpoint (idempotent)
Metrics Collection Flow¶
Collects and aggregates golden signals and SLO-aligned KPIs from all platform services using OpenTelemetry (OTel) and Prometheus exposition/scrape. Emits standardized counters/gauges/histograms with tenant/shard/region labels, stores them in a scalable TSDB, and drives dashboards & alerts (ingest latency, projection lag, seal lag, queue depth).
Overview¶
Purpose: Provide reliable, low-cardinality telemetry for capacity planning, incident detection, and SLO compliance.
Scope: In-process instrumentation (OTel SDK), export (OTLP gRPC/HTTP or Prom scrape), aggregation, storage, dashboards, alerting. Excludes application logs and traces (covered in other flows).
Context: Every service ships metrics to an OTel Collector (agent/sidecar/daemonset) which forwards to Metrics Backend (Prometheus/Mimir/Thanos). Alert rules and dashboards read from the backend.
Key Participants:
- Service (instrumented application)
- OTel SDK (metrics API + views)
- OTel Collector (receivers/processors/exporters)
- Metrics Backend (TSDB) (Prometheus-compatible)
- Alerting (Alertmanager/Notifications)
- Dashboards (Grafana)
Prerequisites¶
System Requirements¶
- OTel SDK enabled in each service with histograms for latency and gauges for lags
- OTel Collector reachable (4317 gRPC / 4318 HTTP) with TLS/mTLS
- Metrics backend with remote write or federated scrape; retention configured
- Resource attributes set (service.name, service.version, deployment.environment, region)
Business Requirements¶
- SLOs defined per domain: Ingestion latency, Projection lag, Seal lag, Search latency
- Alert routing/ownership documented; runbooks linked from alerts
- Cardinality budgets per tenant and endpoint (guardrails/policies)
Performance Requirements¶
- Metrics export overhead < 1% CPU; payloads ≤ policy size (batching on)
- Scrape intervals tuned (e.g., 15s) without overloading services
- End-to-end telemetry freshness p95 ≤ 30s
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant SVC as Service
participant SDK as OTel SDK (Metrics)
participant COL as OTel Collector
participant TSDB as Metrics Backend (Prom/Mimir)
participant ALR as Alerting
participant DB as Dashboards
SVC->>SDK: Record metrics (counters/gauges/histograms)
SDK->>COL: Export (OTLP) with resource attrs & exemplars (traceId)
COL->>TSDB: Remote write / Prom scrape pipeline
TSDB-->>ALR: Rule eval -> alert fire/inhibit
TSDB-->>DB: Power SLO dashboards & drilldowns
Alternative Paths¶
- Prometheus scrape: service exposes
/metrics; TSDB scrapes directly (no collector) where allowed. - Edge aggregation: Collector performs histogram downsampling or delta temporality conversion before write.
- Multi-tenant split: per-tenant remote-write endpoints or relabeling to enforce isolation.
Error Paths¶
sequenceDiagram
participant SVC as Service
participant COL as OTel Collector
participant TSDB as Metrics Backend
SVC->>COL: Export (invalid metrics/labels)
alt 400 Bad Request (schema/label violation)
COL-->>SVC: 400 Problem (drop + log)
else 404 Not Found (unknown tenant/series namespace)
TSDB-->>COL: 404, metric rejected
else 409 Conflict (type change for existing metric name)
TSDB-->>COL: 409, reject write
else 429/503 (rate limit/outage)
TSDB-->>COL: 429/503, backoff + retry
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation | ||
|---|---|---|---|---|---|---|
| OTLP Endpoint | url | Y | grpc://collector:4317 or https://collector:4318/v1/metrics |
TLS/mTLS | ||
resource.service.name |
string | Y | Logical service name | kebab-case | ||
resource.deployment.environment |
string | Y | prod | staging | dev |
enum | ||
resource.cloud.region |
string | O | Region/zone | allowlist | ||
| Metric names | string | Y | atp_* prefix + unit suffix |
Prom rules | ||
| Labels | map | Y | {tenant, shard, region, result, route} |
cardinality caps | ||
| Views | config | O | Histogram buckets, temporality | per-SLO | ||
| Exemplars | bool | O | Attach trace links to histograms | sample rate cap |
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
| Dashboards | URL | Grafana folders per domain | RBAC enforced |
| Alerts | YAML | Rule groups with SLO burn rates | Routed to on-call |
| Recording Rules | YAML | Pre-agg series by tenant/shard | Reduces cost |
| Telemetry Health | JSON | Collector/TSDB status endpoints | For probes |
Example Payloads¶
.NET OTel setup (C#)
builder.Services.AddOpenTelemetry()
.WithMetrics(m => m
.AddMeter("atp.ingestion","atp.projection","atp.integrity")
.AddRuntimeInstrumentation()
.AddAspNetCoreInstrumentation()
.AddOtlpExporter(o => o.Endpoint = new Uri("http://otel-collector:4317")));
Metric naming & units (examples)
atp_ingest_latency_seconds(histogram) — client→accepted latencyatp_projection_lag_seconds(gauge) — append→projection lagatp_integrity_seal_lag_seconds(gauge) — append→seal lagatp_ingest_records_total(counter) — records ingestedatp_export_jobs_active(gauge) — active export jobs
Recommended histogram buckets (seconds)
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid metric name/unit; disallowed label; excessive cardinality | Fix SDK config; drop or remap labels | No retry until fixed |
| 401 | Missing/invalid token for remote write | Renew credentials | Retry after renewal |
| 403 | Tenant not authorized to write namespace | Update RBAC/relabling | — |
| 404 | Unknown tenant/namespace; dashboard id missing | Create namespace / correct link | — |
| 409 | Type conflict (counter→histogram reuse of name) | Rename metric; update dashboards | — |
| 413 | Payload too large | Reduce batch size; increase limits | Retry with smaller batches |
| 429 | Rate limited by TSDB/collector | Honor Retry-After |
Exponential backoff + jitter |
| 503 | Collector/TSDB unavailable | Buffer (within cap) | Bounded retry with drop policy |
Failure Modes¶
- Cardinality explosion (e.g., userId in labels) → automatic label sanitizer drops high-cardinality keys; emit warning counter.
- Type migration (metric renamed without deprecation) → breaks dashboards; use recording rules to bridge.
- Clock skew → out-of-order samples dropped; sync NTP and use server timestamping.
Recovery Procedures¶
- Enable views to aggregate/drop labels causing explosion; redeploy with safe config.
- Roll out metric renames via dual-publish window + recording rules → retire old names.
- During TSDB outage, buffer with caps; after recovery, drain at limited QPS.
Performance Characteristics¶
Latency Expectations¶
- Exporter p95 < 50 ms per batch; end-to-end metric freshness p95 ≤ 30 s.
Throughput Limits¶
- Default 10k samples/s per pod (configurable); per-tenant write QPS caps at the collector.
Resource Requirements¶
- SDK minimal CPU; Collector memory sized for queues; backend disk/retention sized to SLO analytics.
Scaling Considerations¶
- Shard TSDB by tenant/region; use recording rules to pre-aggregate; leverage remote write to long-term store (Thanos/Mimir).
Security & Compliance¶
Authentication¶
- OTLP with mTLS; Prom scrape secured by service mesh identities or basic auth over TLS.
Authorization¶
- Per-tenant write tokens; relabeling at collector enforces tenant isolation.
Data Protection¶
- No PII in labels; label sanitizer strips ids, emails, IPs unless explicitly allowlisted.
Compliance¶
- Alert acknowledgments/audits stored; SLO reports preserved per retention policy.
Monitoring & Observability¶
Key Metrics¶
| Metric Name | Type | Description | Alert Threshold |
|---|---|---|---|
atp_ingest_latency_seconds |
histogram | Client→accepted latency | Burn rate on p95/p99 |
atp_projection_lag_seconds |
gauge | Append→projection lag | > 60s sustained |
atp_integrity_seal_lag_seconds |
gauge | Append→seal lag | > 120s sustained |
otelcol_exporter_queue_size |
gauge | Collector queue depth | > 80% capacity |
prom_remote_write_requests_failed_total |
counter | Failed writes | Rising trend |
atp_metrics_cardinality_dropped_total |
counter | Dropped label pairs | Spike → investigate |
Logging Requirements¶
- Collector structured logs for drops/backpressure; include
tenant,series,reason.
Distributed Tracing¶
- Exemplars: attach
traceIdto latency histogram buckets for drill-down. - Trace spans for exporter/collector with attributes:
seriesCount,dropped,retry.
Health Checks¶
- Collector readiness (receivers/exporters live); TSDB scrape targets up; dashboard datasource healthy.
Operational Procedures¶
Deployment¶
- Ship OTel SDK across services; configure default meters and views.
- Deploy OTel Collector (agent/daemonset) with TLS and remote write.
- Provision dashboards and alert rules from GitOps repo.
Configuration¶
- Env:
OTEL_EXPORTER_OTLP_ENDPOINT,OTEL_METRIC_EXPORT_INTERVAL,OTEL_RESOURCE_ATTRIBUTES. - Collector: processors (batch, memory_limiter), exporters (prometheusremotewrite).
- Backend: retention, compaction, ruler/alertmanager endpoints.
Maintenance¶
- Periodic review of cardinality budget; prune unused metrics.
- Tune histogram buckets as traffic patterns evolve.
Troubleshooting¶
- Missing metrics → check SDK meter enabled, service.name correct, collector pipelines.
- High drops → inspect label sanitizer logs; remove high-cardinality labels.
- Alert noise → adjust SLO burn-rate windows and inhibit rules.
Testing Scenarios¶
Happy Path Tests¶
- Ingestion service publishes latency histogram; dashboard shows p95/p99; alerts fire under synthetic slowness.
- Projection lag gauge reflects backlog; alert triggers and clears after recovery.
Error Path Tests¶
- 400 invalid label name → collector drops with warning counter incremented.
- 404 unknown tenant namespace → write rejected; dashboards unaffected.
- 409 type conflict on metric rename → dual-publish + recording rule bridges.
Performance Tests¶
- 10k samples/s sustained without exporter backpressure; queue sizes stable.
- TSDB outage → buffered then drained within limits; no OOM.
Security Tests¶
- mTLS enforced; cross-tenant writes denied.
- No PII observed in labels; sanitizer counters remain near zero.
Related Documentation¶
Internal References¶
External References¶
- OpenTelemetry Metrics Spec
- Prometheus Best Practices
Appendices¶
A. Example Alert (Projection Lag SLO)¶
groups:
- name: projection-lag
rules:
- alert: ProjectionLagHigh
expr: atp_projection_lag_seconds{environment="prod"} > 60
for: 5m
labels: {severity: page, team: projections}
annotations:
summary: "Projection lag high (>{{ $value }}s)"
runbook: "https://runbooks/projection-lag"
B. Collector Pipeline (excerpt)¶
receivers:
otlp:
protocols: { grpc: {}, http: {} }
processors:
batch: {}
memory_limiter: { check_interval: 1s, limit_mib: 512 }
exporters:
prometheusremotewrite:
endpoint: https://mimir.remote/api/v1/push
service:
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
Distributed Tracing Flow¶
Correlates requests across all hops using W3C Trace Context (traceparent, tracestate) and OpenTelemetry spans. Propagates baggage (e.g., tenant, edition) with strict guardrails to enable per-tenant analytics without leaking PII. All spans are exported to a trace store for query and troubleshooting.
Overview¶
Purpose: Provide end-to-end visibility of a request from Gateway → Ingestion → Storage → Integrity → Projection → Search/Export, enabling root-cause analysis and SLO burn tracking.
Scope: Context propagation (HTTP/gRPC/bus), span creation and attributes, sampling (head/tail), baggage policy, export via OTel → Collector → Trace Backend, and trace query UX. Excludes logs/metrics (covered elsewhere).
Context: Each service uses OTel SDK. The API Gateway starts/continues a trace, forwards context, and attaches safe baggage (tenant, edition). Downstream services create child spans. Collector batches/exports to a Jaeger/Tempo-compatible backend.
Key Participants:
- Client / Producer
- API Gateway
- Ingestion Service
- Storage Service
- Integrity Service
- Projection Service
- Search / Export Services
- OTel Collector
- Trace Backend (Jaeger/Tempo)
Prerequisites¶
System Requirements¶
- OTel SDK enabled for HTTP, gRPC, DB instrumentation (server & client)
- W3C Trace Context and Baggage propagators registered
- OTel Collector reachable with TLS (gRPC 4317 / HTTP 4318)
- Trace backend available (Tempo/Jaeger) with retention & indexing
Business Requirements¶
- Baggage policy allowlist:
tenant,edition, optionalpurpose(no PII) - Sampling policy defined (head: rate/parent; tail: error/latency based)
- SRE runbooks for “missing span”, “broken parent”, and “dropped export”
Performance Requirements¶
- Tracing overhead < 3% CPU at default sample rates
- Export latency hidden via batching; queue backpressure bounded
- Query p95 ≤ 3 s for recent traces
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant CL as Client
participant GW as API Gateway
participant ING as Ingestion Service
participant ST as Storage Service
participant INT as Integrity Service
participant PR as Projection Service
participant COL as OTel Collector
participant TR as Trace Backend
CL->>GW: HTTP/gRPC request (+traceparent?, +baggage: tenant,edition)
Note right of GW: Start/continue root span, enforce baggage allowlist
GW->>ING: Forward request (+traceparent,+baggage)
ING->>ST: Append audit (child span)
ST-->>ING: Ack (db client/server spans)
ING->>INT: Enqueue/compute integrity (child span)
INT-->>ING: Proof computed
ING->>PR: Emit projection event (child span)
PR-->>ING: Projected
par Export spans
GW-->>COL: OTLP export (batched)
ING-->>COL: OTLP export (batched)
ST-->>COL: OTLP export (batched)
INT-->>COL: OTLP export (batched)
PR-->>COL: OTLP export (batched)
end
COL->>TR: Push spans
TR-->>GW: Trace available for query
Note over GW,PR: Baggage {tenant,edition} available on all spans
Alternative Paths¶
- Message bus propagation: inject
traceparent/baggageinto message headers; consumers extract and create linked spans if processing is async. - Tail sampling: collector performs tail-based sampling (error/latency heuristics) for high-value traces while keeping head sampling low.
- Gateway as root: if client sends no
traceparent, Gateway creates the root span; otherwise, it joins the provided context.
Error Paths¶
sequenceDiagram
participant CL as Client
participant GW as API Gateway
participant COL as OTel Collector
participant TR as Trace Backend
CL->>GW: Request (malformed trace headers)
alt 400 Bad Request (invalid traceparent format)
GW-->>CL: 400 Problem+JSON (with new trace id for error handling)
else Backend query for traceId
GW->>TR: GET /traces/{traceId}
alt 404 Not Found (expired/unknown)
TR-->>GW: 404 Not Found
GW-->>CL: 404 Problem+JSON
else 409 Conflict (concurrent sampling policy change)
TR-->>GW: 409 Conflict
GW-->>CL: 409 Problem+JSON
end
end
Request/Response Specifications¶
Input Requirements (Propagation & Policy)¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
traceparent |
header/metadata | O | W3C Trace Context | 55-char format |
tracestate |
header/metadata | O | Vendor/state hints | size ≤ 512B |
baggage |
header/metadata | O | tenant=acme,edition=enterprise |
allowlist keys; total ≤ 1024B |
x-tenant-id |
header | Y | Tenant RLS (also echoed in baggage) | must match |
trace-flags |
bitfield | O | Sampling decision (head) | 0/1 |
idempotency-key |
header | O | For write flows (not tracing but correlated) | ≤ 128 chars |
Ops / Query
GET /traces/{traceId}→ rendered traceGET /traces/search?tenant=&error=true&latencyMs>…→ find tracesPOST /ops/v1/tracing/sampling{headRate, tailPolicies[]}→ update sampling (RBAC)
Output Specifications¶
- Spans include attributes (examples):
- Common:
tenant,edition,environment,region,trace.sampled - Gateway:
route,status_code,client.ip_hash - Ingestion:
audit.schemaVersion,payload.bytes,validation.result - Storage:
db.system,db.operation=append,db.statement?=off - Integrity:
integrity.blockId,segment,proof.kid - Projection:
watermark,lag.ms - Search/Export:
query.kind,result.count,package.id
- Common:
Example HTTP with headers
POST /audit/v1/records HTTP/1.1
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: atp=gw;ver=1
baggage: tenant=acme,edition=enterprise
x-tenant-id: acme
content-type: application/json
Example gRPC metadata (pseudo)
:authority: ingestion.atp
traceparent: 00-4bf92f3577b34...-00f067aa0b...-01
baggage: tenant=acme,edition=enterprise
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Malformed traceparent/baggage |
Drop/regen context; return Problem+JSON if strict | No retry until fixed |
| 401 | Querying traces without auth | Acquire token | Retry after renewal |
| 403 | Cross-tenant trace access | Enforce RLS; deny | — |
| 404 | Trace id not found/expired | Verify id/retention window | — |
| 409 | Sampling policy update conflicts | Re-fetch policy; retry op | Conditional retry |
| 413 | Oversized baggage | Trim to policy; drop disallowed keys | Resend with smaller baggage |
| 429 | Collector/back-end rate limit | Honor Retry-After |
Exponential backoff + jitter |
| 503 | Collector/back-end unavailable | Buffer within caps | Bounded retry, drop oldest if over cap |
Failure Modes¶
- Broken parentage: services that don’t extract context create new roots → detectable by orphan span metric.
- Baggage misuse: high-cardinality/PII snuck into baggage → sanitizer drops keys and emits policy violations.
- Excess sampling: high head sampling inflates overhead → shift to tail sampling for error/slow traces.
Recovery Procedures¶
- Enable/verify propagators in all client/server middleware.
- Turn on tail sampling policies (e.g.,
error=true,latency>500ms). - Inspect “orphan span” dashboards; fix missing extract/inject in specific services.
Performance Characteristics¶
Latency Expectations¶
- Instrumentation overhead p95 ≤ 1 ms per hop (sampled), near-zero when unsampled.
Throughput Limits¶
- Collector queue sized for burst N× steady state; backpressure triggers temporary head sampling reductions.
Resource Requirements¶
- Small CPU for SDK; Collector memory for queues; backend disk for retention (e.g., 7–14 days).
Scaling Considerations¶
- Shard collectors per region/tenant; enable tail sampling at edge; compress exports; prefer OTLP gRPC.
Security & Compliance¶
Authentication¶
- Query/UI protected by OIDC; service-to-collector via mTLS.
Authorization¶
- Enforce tenant isolation on trace queries (filter by baggage
tenantand RLS).
Data Protection¶
- No PII in baggage or span attributes; hash IPs/UAs; redact payloads; disable SQL/body capture by default.
Compliance¶
- Retention adheres to tenant policy; trace access is audited with actor and purpose-of-use.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
otel_traces_exported_total |
counter | Spans successfully exported | Sudden drop |
otel_traces_dropped_total |
counter | Dropped spans (queue/limits) | > baseline |
trace_orphan_spans_total |
counter | Spans without valid parent | Spike alert |
collector_queue_size |
gauge | Export queue depth | > 80% capacity |
trace_tail_sampled_total |
counter | Tail-sampled traces | Track ratio |
trace_query_latency_seconds |
histogram | UI/API query latency | p95 > SLO |
Logging Requirements¶
- Structured logs:
traceId,spanId,dropReason,policyId,tenant,edition. No payload values.
Distributed Tracing¶
- (Meta) link exporter spans to service spans; include exemplars on latency histograms (metrics flow).
Health Checks¶
- Collector readiness/liveness; backend ingestion status; UI availability.
Operational Procedures¶
Deployment¶
- Enable OTel SDKs with HTTP/gRPC/DB instrumentation and W3C propagators.
- Deploy OTel Collector (batch, memory_limiter, tail_sampling processors).
- Wire trace backend and provision dashboards.
Configuration¶
- Env:
OTEL_EXPORTER_OTLP_ENDPOINT,OTEL_TRACES_SAMPLER,OTEL_RESOURCE_ATTRIBUTES. - Tail Sampling (examples):
error=true,status_code>=500,latency_ms>500, selective bytenant.
Maintenance¶
- Adjust sampling as traffic patterns evolve; rotate retention; prune noisy attributes.
Troubleshooting¶
- Missing links → check inject/extract middleware order.
- High drops → increase collector queues or reduce sampling; inspect backpressure.
- Cross-tenant leak alerts → confirm baggage sanitizer & RLS.
Testing Scenarios¶
Happy Path Tests¶
- End-to-end trace spans present across Gateway→Ingestion→Storage→Integrity→Projection.
- Baggage (
tenant=acme,edition=enterprise) visible on all spans.
Error Path Tests¶
- 400 invalid
traceparenthandled; new trace created for error path. - 404 unknown trace id query returns Problem+JSON, no data leakage.
- 409 sampling change during export handled without crash.
Performance Tests¶
- Sampled high-QPS traffic keeps overhead < 3%.
- Collector withstands burst without dropping (or drops < policy).
Security Tests¶
- No PII in spans/baggage; sanitizer counters near zero.
- Trace queries scoped to tenant via RLS.
Related Documentation¶
Internal References¶
External References¶
- W3C Trace Context & Baggage
- OpenTelemetry Specification
Appendices¶
A. Example Problem+JSON (invalid trace headers)¶
{
"type": "urn:connectsoft:errors/tracing/traceparent.invalid",
"title": "Invalid W3C traceparent header",
"status": 400,
"detail": "Trace ID length not 16 bytes (hex).",
"traceId": "9f0c1d2e3a4b5c6d..."
}
B. Suggested Span Attribute Keys (allowlist)¶
tenant,edition,environment,region,route,status_code,db.system,db.operation,integrity.blockId,projection.watermark,search.query.kind,export.package.id
Health Check Flow¶
Implements liveness, readiness, and startup probes with per-component dependency checks and an aggregated status that signals deploy orchestrators (e.g., Kubernetes) for safe rollouts and traffic routing. Probes are budgeted and isolated to avoid noisy-neighbor effects; timeouts and intervals are tuned to service SLOs.
Overview¶
Purpose: Provide reliable health signaling for deployment safety, traffic gating, and fast failure detection without causing additional load or false negatives.
Scope: Local process liveness, startup warmup, dependency readiness (DB, queue, cache, search, integrity, policy), aggregation, export via HTTP endpoints, and ops overrides (maintenance mode).
Context: Orchestrators consume /health/liveness, /health/readiness, /health/startup. Readiness reflects dependencies & backpressure, not just process up. Liveness is crash/lock detection only.
Key Participants:
- Service (with HealthCheck library)
- Dependency Probers (DB/Cache/Queue/Search/Integrity/Policy)
- Aggregator (health manager + budgeter)
- Orchestrator (Kubernetes/Service Mesh/Gateway)
- Ops UI / API (maintenance & overrides)
- Observability (metrics/logs)
Prerequisites¶
System Requirements¶
- HealthCheck middleware/library enabled with endpoints:
/health/liveness,/health/readiness,/health/startup - Per-dependency prober with timeouts, concurrency caps, and circuit-break aware checks
- Clock synchronized (UTC) for timestamps; structured logging enabled
- Network policies allow orchestrator-to-service health traffic
Business Requirements¶
- Defined maintenance mode procedure (drain → mark NotReady → perform ops)
- Per-tenant/edition readiness policies when dependencies are multi-tenant
- Runbooks for common failure signatures (DB degraded, queue backlog, index lag)
Performance Requirements¶
- Probe p95 ≤ 50 ms for local checks, ≤ 200 ms for remote deps
- Readiness interval typically 10s–30s; liveness interval 5s–10s
- Probe CPU overhead < 1%; IO bounded with concurrency limits
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant ORCH as Orchestrator (K8s)
participant SVC as Service
participant AGG as Health Aggregator
participant DB as Database
participant Q as Queue
participant C as Cache
ORCH->>SVC: GET /health/startup
SVC->>AGG: Run startup checks (one-time warmups)
AGG-->>SVC: status: Up
SVC-->>ORCH: 200 OK {status:"Up"}
ORCH->>SVC: GET /health/readiness
SVC->>AGG: Parallel probers (DB/Q/Cache) with budgets
AGG->>DB: ping (timeout ≤ 150ms)
AGG->>Q: depth/head check
AGG->>C: get/set key
DB-->>AGG: OK
Q-->>AGG: OK
C-->>AGG: OK
AGG-->>SVC: Ready
SVC-->>ORCH: 200 OK {status:"Ready", components:[...]}
ORCH->>SVC: GET /health/liveness
SVC-->>ORCH: 200 OK {status:"Alive"}
Alternative Paths¶
- Maintenance mode: Ops toggles → service returns
503on readiness withRetry-After, keeps liveness200to avoid restarts during planned work. - Degraded-but-Serving: Non-critical dependency fails; readiness remains
200withwarnings[], traffic allowed but autoscaler informed via metrics. - Backpressure-aware readiness: If queue depth/backlog exceeds threshold, respond
429 Too Many Requests(optionally) or503with reason to trigger traffic shifting.
Error Paths¶
sequenceDiagram
participant ORCH as Orchestrator
participant SVC as Service
participant AGG as Health Aggregator
participant DB as Database
ORCH->>SVC: GET /health/readiness
SVC->>AGG: Run checks
AGG->>DB: ping
DB-->>AGG: timeout
AGG-->>SVC: NotReady {db:"Timeout"}
alt Not Ready
SVC-->>ORCH: 503 Service Unavailable (Problem+JSON)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| GET /health/liveness | http | Y | Process health (no deps) | Always lightweight |
| GET /health/startup | http | Y | Warmup complete? | One-time gates |
| GET /health/readiness | http | Y | Dependency/traffic readiness | Budgeted checks |
| POST /ops/v1/health:maintenance | http | O | Enter/exit maintenance | AuthZ required |
Authorization (ops) |
header | O | Bearer <JWT> |
Role ops:health |
traceparent |
header | O | Trace exemplar correlation | Optional |
Query: full=true |
bool | O | Include per-component detail | RBAC for PII masking |
Output Specifications¶
200 OK (Readiness/Liveness/Startup)
{
"status": "Ready",
"service": "ingestion",
"time": "2025-10-27T08:21:45Z",
"warnings": [],
"components": [
{"name":"db", "type":"postgres", "status":"Up", "latencyMs": 32},
{"name":"queue", "type":"rabbitmq", "status":"Up", "latencyMs": 18},
{"name":"cache", "type":"redis", "status":"Up", "latencyMs": 4}
]
}
503 Service Unavailable (Not Ready)
{
"type": "urn:connectsoft:errors/health/not-ready",
"title": "Readiness check failed",
"status": 503,
"detail": "postgres timeout; queue connecting",
"retryAfterSeconds": 10
}
Maintenance Mode Toggle
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid maintenance payload (negative TTL/unknown field) | Fix request | — |
| 401 | Missing/invalid JWT for ops endpoint | Obtain token | Retry after renewal |
| 403 | Caller lacks ops:health role |
Request access | — |
| 404 | Unknown component in ?component= query |
Remove/rename | — |
| 409 | Conflicting state change (maintenance enabled while drain in progress) | Wait or cancel prior op | Retry after resolution |
| 429 | Health endpoint rate-limited (human/automation abuse) | Back off | Jittered retry |
| 503 | Not Ready (dependency down/backpressure) | Remediate dependency | Retry after Retry-After |
| 504 | Probe exceeded timeout budget | Increase timeout if justified | Backoff; verify load |
Failure Modes¶
- Noisy-neighbor probes: too-frequent or heavy checks cause dependency load → enforce intervals, timeouts, and read-only probes.
- Coupled liveness/readiness: using dependency checks for liveness causes restarts → separate strictly.
- Flapping readiness: thresholds too tight → add stabilization window and hysteresis.
- Leaky details: exposing internal hostnames/errors externally → sanitize messages.
Recovery Procedures¶
- Enter maintenance mode → drain traffic (readiness
503), keep liveness200, perform remediation. - Enable degraded mode for non-critical deps; keep serving with warnings.
- Increase probe intervals/timeouts cautiously; verify impact via metrics.
Performance Characteristics¶
Latency Expectations¶
- Liveness: p95 ≤ 5 ms; Start-up: first success within warmup target; Readiness: p95 ≤ 150–200 ms.
Throughput Limits¶
- Cap concurrent dependency checks (e.g., max 2 per dep per instance).
- Global RPS limit on health endpoints to prevent abuse.
Resource Requirements¶
- Minimal CPU; network usage proportional to dependency checks; cache results for stabilization window (e.g., 2–5s).
Scaling Considerations¶
- Shard readiness by tenant/shard if dependencies are partitioned; expose
components[].partition. - Push passive signals (e.g., queue depth) from dependencies to reduce active probing.
Security & Compliance¶
Authentication¶
- Health endpoints for orchestrator may be anonymous inside cluster (network-policy protected). Ops endpoints require OIDC JWT.
Authorization¶
- Roles:
ops:health.read,ops:health.maintain.
Data Protection¶
- Mask error details in public readiness; full component diagnostics behind RBAC. No secrets in responses.
Compliance¶
- Health state transitions and maintenance toggles audited with actor, reason, and duration.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
health_readiness_status |
gauge | 1=Ready, 0=NotReady | 0 for >1m |
health_probe_latency_ms{component} |
histogram | Per-component probe latency | p95 breach |
health_notready_total{reason} |
counter | Fail events by reason | Spike alert |
health_maintenance_mode |
gauge | 1 when enabled | Unexpected >0 |
health_flaps_total |
counter | Ready↔NotReady transitions | >N/hour |
Logging Requirements¶
- Structured logs:
probe,component,latencyMs,result,timeout,reason,traceId.
Distributed Tracing¶
- Health endpoints not traced by default (to reduce noise); ops toggles may emit spans with attributes
maintenance=true.
Health Checks¶
- Internal self-check (threadpool saturation, GC, disk space).
- Dependency checks with budgeted timeouts and circuit-breaker awareness.
Operational Procedures¶
Deployment¶
- Expose
/health/liveness,/health/readiness,/health/startup. - Configure orchestrator probes and thresholds (see Appendix).
- Register metrics and alerts; link runbooks.
Configuration¶
- Env:
HEALTH_READINESS_TIMEOUT_MS,HEALTH_PROBE_INTERVAL_S,HEALTH_STABILIZATION_WINDOW_S,HEALTH_MAX_CONCURRENCY,HEALTH_MAINTENANCE_TTL_S. - Policy: which dependencies are critical vs advisory for readiness.
Maintenance¶
- Use ops endpoint to enable maintenance → drain → operate → disable → verify readiness.
Troubleshooting¶
- Frequent flaps → extend stabilization, review dependency SLOs.
- Probes time out → check network/circuit breaker; raise timeout only with evidence.
- Orchestrator killing pods unexpectedly → confirm liveness is local-only.
Testing Scenarios¶
Happy Path Tests¶
- Startup becomes
Upafter caches warmed; readiness200. - All components return
Up; status JSON includes latencies.
Error Path Tests¶
- DB timeout triggers readiness
503with sanitized Problem+JSON. - 400 invalid maintenance payload rejected; 404 unknown component; 409 conflicting state change handled.
Performance Tests¶
- Probe p95 ≤ 200 ms under load; intervals respected; no excess CPU/IO.
- High RPS to health endpoints remains within rate limits.
Security Tests¶
- Public readiness hides internals; full diagnostics gated by RBAC.
- Audit records for maintenance toggles captured.
Related Documentation¶
Internal References¶
External References¶
- Kubernetes probe guidance (liveness/readiness/startup)
Appendices¶
A. Example Kubernetes Probes¶
livenessProbe:
httpGet: { path: /health/liveness, port: 8080 }
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 2
readinessProbe:
httpGet: { path: /health/readiness, port: 8080 }
initialDelaySeconds: 20
periodSeconds: 15
timeoutSeconds: 2
successThreshold: 1
failureThreshold: 3
startupProbe:
httpGet: { path: /health/startup, port: 8080 }
failureThreshold: 30
periodSeconds: 5
B. Example Problem+JSON (Not Ready)¶
{
"type": "urn:connectsoft:errors/health/not-ready",
"title": "Readiness check failed",
"status": 503,
"detail": "queue backlog > threshold; integrity service degraded",
"retryAfterSeconds": 15
}
Alert Generation Flow¶
Turns signals into action: evaluates thresholds and SLO burn rates, fires alerts, routes to pager/chat/webhook, opens a ticket, and auto-closes on recovery. Noise is controlled via grouping, inhibition, dedup windows, silences, and maintenance calendars. Escalation paths are explicit and auditable.
Overview¶
Purpose: Deliver timely, actionable notifications with clear ownership and escalation while minimizing false positives.
Scope: Rule evaluation, grouping/dedup, routing, paging/notifications, ticket creation, auto-resolve, silencing and inhibition controls.
Context: Metrics and events feed a Rule Engine (e.g., Prometheus Ruler). Alerts traverse a Router (Alertmanager-like) to destinations: PagerDuty/On-call, Chat (Slack/Teams), Webhook (runbooks/automation), and Ticketing (Jira/ServiceNow).
Key Participants:
- Metrics Backend / Rule Engine
- Alert Router (grouping, dedup, silences, inhibition)
- Destinations: Pager, Chat, Webhook, Ticketing
- On-call Engineer / Team
- Ops API/UI (manage silences, ack, routes)
- Runbooks (linked from alerts)
Prerequisites¶
System Requirements¶
- Metrics and logs published with low cardinality labels (
tenant,shard,region,service) - Rule Engine with multi-window SLO burn capability and dependency-aware inhibition
- Alert Router HA with persistent silences and dedup state
- Integrations to pager/chat/ticket with retry & backoff
Business Requirements¶
- Defined ownership map: service → team → escalation policy
- Runbooks per alert with clear first actions and diagnostics links
- Maintenance windows / change freeze calendars integrated
Performance Requirements¶
- End-to-end alerting latency p95 ≤ 30s from breach to page
- Router throughput sized for peak fan-out; delivery retries with backoff
- Dedup window defaults (e.g., 5m) to limit paging storms
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant MET as Metrics/Rule Engine
participant RTR as Alert Router
participant PD as Pager (On-call)
participant CHAT as Chat (Slack/Teams)
participant TKT as Ticketing (Jira/SNOW)
participant OPS as On-call Engineer
MET->>RTR: Alert{labels, annotations, status="firing"}
RTR->>RTR: Group & dedup (fingerprint), apply inhibition/silences
RTR->>PD: Page (severity=page, service=ingestion)
RTR->>CHAT: Notify #oncall-ingestion (runbook link)
RTR->>TKT: Create ticket (P1) with alert context
PD-->>OPS: Page delivered (push/phone/SMS)
OPS->>TKT: Acknowledge ticket, start mitigation
MET-->>RTR: Alert{status="resolved"}
RTR->>PD: Auto-resolve page
RTR->>TKT: Auto-close with resolution note
RTR->>CHAT: Post recovery message
Alternative Paths¶
- Warning-only: severity
warn→ chat/webhook only, no page. - Escalation: no ack within 10m → escalate to secondary, then manager-on-call.
- Bulk correlation: many shard alerts collapse into one parent incident with children inhibited.
- Auto-remediation: webhook triggers safe runbook; success posts to thread and downgrades severity.
Error Paths¶
sequenceDiagram
participant MET as Metrics/Rule Engine
participant RTR as Alert Router
participant PD as Pager
MET->>RTR: Alert firing
alt 400 Bad Request (invalid labels/size)
RTR-->>MET: 400 drop + audit
else 404 Destination not configured
RTR-->>MET: 404, fallback to default route
else 409 Conflict (duplicate route update)
RTR-->>MET: 409, keep last-good config
else 429/503 Pager API throttled/outage
RTR-->>PD: retry with backoff, queue locally
end
Request/Response Specifications¶
Input Requirements (Alert Payload to Router)¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
status |
enum | Y | firing |
resolved |
labels |
map | Y | {alertname, service, tenant, shard, severity} |
size ≤ 50, allowlist keys |
annotations |
map | O | {summary, description, runbook, dashboard} |
≤ 4KB |
startsAt / endsAt |
RFC3339 | Y/O | When firing/resolved | UTC |
generatorURL |
url | O | Link to rule source | valid URL |
fingerprint |
string | O | Stable dedup key | computed if missing |
Output Specifications (Destinations)¶
- Pager: payload includes
service,severity,routing_key,dedup_key=fingerprint,links(runbook/dashboards). - Chat: message with summary, top labels, graph image link, ack emoji workflow.
- Ticket: fields
summary,description,priority,labels,customFields(tenant/shard), plus auto-close comment on resolve. - Webhook: signed POST with HMAC; body includes current status, last N samples, silence suggestions.
Example Payloads¶
// Alert to Router (condensed)
{
"status": "firing",
"labels": {
"alertname": "ProjectionLagHigh",
"service": "projection",
"tenant": "acme",
"severity": "page",
"region": "eu-west-1"
},
"annotations": {
"summary": "Projection lag > 60s",
"description": "Watermark delay crossing SLO for tenant=acme.",
"runbook": "https://runbooks/projection-lag",
"dashboard": "https://grafana/d/lag"
},
"startsAt": "2025-10-27T08:15:00Z",
"generatorURL": "prom://ruler/expr/123"
}
# Burn-rate rule example (SLO 99.9% over 30d)
- alert: IngestSLOBurnHigh
expr: |
(sum(rate(atp_ingest_errors_total[5m])) by (service,tenant)
/ sum(rate(atp_ingest_requests_total[5m])) by (service,tenant))
> (0.001 * 14.4)
for: 5m
labels: {severity: page, service: ingestion}
annotations:
summary: "Ingest SLO fast burn (5m)"
runbook: "https://runbooks/ingest-slo"
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid alert payload (missing labels/oversized) | Fix rule/labels; drop event | No retry until fixed |
| 401 | Webhook/Pager auth failed | Rotate tokens/keys | Retry after renewal |
| 403 | Route not permitted for tenant/edition | Update RBAC/route policy | — |
| 404 | Route/destination not found | Use default route; fix config | — |
| 409 | Concurrent route config updates | Apply last-write-wins or CAS | Retry after fetch |
| 412 | HMAC signature mismatch (webhook) | Recalculate with correct secret | — |
| 429 | Destination rate-limiting | Honor vendor backoff | Exponential backoff + jitter |
| 503 | Destination outage | Queue & retry within TTL | Progressive backoff, failover route |
Failure Modes¶
- Alert storms: ungrouped high-cardinality labels → enable grouping keys and label sanitization.
- Flapping: thresholds too tight → add
for:windows and hysteresis. - Cascading pages: child alerts page alongside parent → add inhibition until parent resolves.
- Silent failures: misconfigured routes → periodic synthetic alerts verify end-to-end.
Recovery Procedures¶
- Activate global silence or maintenance mode during planned incidents.
- Expand grouping and increase
group_wait/group_intervalto dampen bursts. - Fail over to secondary pager provider if primary remains 503/429 beyond SLO.
Performance Characteristics¶
Latency Expectations¶
- Signal-to-page p95 ≤ 30s; chat/webhook p95 ≤ 15s; ticket creation ≤ 60s.
Throughput Limits¶
- Router handles thousands of alerts/min with grouping; per-destination QPS caps and queues.
Resource Requirements¶
- Router memory for dedup store and silence registry; HA storage (e.g., S3/object store or DB) for persistence.
Scaling Considerations¶
- Partition routes by
regionandservice; replicate router HA; shard rules by domain.
Security & Compliance¶
Authentication¶
- Mutual TLS for webhook receivers; OAuth tokens/keys for pager/ticket/chat APIs.
Authorization¶
- Route policies per tenant/edition; ops roles to create silences and modify routes (
ops:alerts.*).
Data Protection¶
- Do not include PII in labels/annotations; link dashboards instead of embedding raw data.
Compliance¶
- All alert lifecycle actions (fire/route/ack/resolve/silence) audited with actor, reason, and timestamps.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
alerts_firing_total |
gauge | Active firing alerts | Trend by service |
alerts_notifications_sent_total |
counter | Deliveries by destination | Sudden drop |
alerts_delivery_failures_total |
counter | Failed sends by dest | Spike alert |
alerts_routing_latency_seconds |
histogram | Router processing latency | p95 breach |
alerts_silences_active |
gauge | Current silences | Unexpected growth |
alerts_inhibited_total |
counter | Child alerts inhibited | Track correlation |
Logging Requirements¶
- Structured logs:
alertname,fingerprint,status,route,destination,deliveryId,retry,actor(for silences/acks).
Distributed Tracing¶
- Trace Router pipeline (ingest→group→deliver); attach exemplars to routing latency histograms.
Health Checks¶
- Router readiness includes destination probes (token check, rate-limit status); synthetic canaries validate end-to-end.
Operational Procedures¶
Deployment¶
- Deploy Rule Engine & Router HA; configure storage for silences/dedup.
- Create base routes (page/warn/info) and default receivers.
- Set up synthetic alerts per region/service.
Configuration¶
- Router:
group_by: [alertname, service, tenant],group_wait: 10s,group_interval: 5m,repeat_interval: 2h. - Escalation: ack timeout 10m primary → secondary → manager.
- Webhook HMAC secret rotation schedule.
Maintenance¶
- Review top talkers weekly; reduce cardinality; tune thresholds and
for:windows. - Validate runbook links and dashboard IDs quarterly.
Troubleshooting¶
- No pages received → check destination quotas, auth, and router queue depths.
- Excess noise → increase grouping, add inhibition rules, widen hysteresis.
- Auto-close not working → verify
resolvedevents flow and ticket webhooks.
Testing Scenarios¶
Happy Path Tests¶
- Fire
ProjectionLagHigh→ page+chat+ticket created; resolves and auto-closes on recovery. - Warning-only alert posts to chat without paging.
Error Path Tests¶
- 400/404 misrouted alerts handled; default route used.
- 429/503 destination throttling triggers retries and eventual delivery/failover.
Performance Tests¶
- Burst of 10k alerts grouped to ≤ 100 pages; router p95 latency within SLO.
- Dedup prevents duplicate pages across replicas.
Security Tests¶
- Webhook HMAC verified; invalid signature (412) rejected.
- No PII in labels/annotations; audits present for silences/acks.
Related Documentation¶
Internal References¶
External References¶
- SRE Workbook: Multi-window, multi-burn-rate alerts
- Vendor APIs: PagerDuty/Slack/Jira
Appendices¶
A. Router Route Snippet (YAML)¶
route:
group_by: ['alertname','service','tenant']
group_wait: 10s
group_interval: 5m
repeat_interval: 2h
receiver: 'default'
routes:
- match: {severity: 'page'}
receiver: 'pager'
continue: true
- match: {severity: 'page'}
receiver: 'chat'
- match_re: {severity: 'warn|info'}
receiver: 'chat'
receivers:
- name: pager
pagerduty_configs:
- routing_key: ${PAGERDUTY_KEY}
dedup_key: '{{ .GroupLabels.fingerprint }}'
- name: chat
slack_configs:
- channel: '#oncall-{{ .GroupLabels.service }}'
title: '{{ .CommonLabels.alertname }}'
text: '{{ .CommonAnnotations.summary }}\n{{ .CommonAnnotations.runbook }}'
B. Example Silence (API)¶
POST /ops/v1/alerts/silences
{
"matchers": [{"name":"service","value":"projection","isRegex":false}],
"startsAt": "2025-10-27T08:00:00Z",
"endsAt": "2025-10-27T10:00:00Z",
"createdBy": "deploy-bot",
"comment": "Planned projection migration"
}
Tenant Onboarding Flow¶
Provisions and activates a new tenant with IdP linkage, policy defaults, partitioned storage & indexes, per-tenant KMS keys and residency settings. Ensures strict isolation (RLS) and emits onboarding welcome/events. All steps are idempotent and fully audited.
Overview¶
Purpose: Safely create a tenant boundary (identity, data, policy, encryption, residency) and make it ready for use.
Scope: Intake → validation → IdP linkage → resource provisioning (storage/projections/search) → policy/key/residency setup → activation → welcome events. Excludes billing system specifics.
Context: Orchestrated by Onboarding Service with calls to Identity/IdP, Policy, Storage/Projection/Search, KMS/Secrets, and Notifications.
Key Participants:
- Tenant Admin / Operator
- Onboarding Service (orchestrator)
- Identity/SSO (SAML/OIDC, optional SCIM)
- Policy Service (defaults: retention, redaction)
- Storage Service (append store partitions)
- Projection/Search Services (read models, index aliases)
- KMS / Secrets (per-tenant keys/creds)
- Notification/Webhooks
Prerequisites¶
System Requirements¶
- Onboarding API enabled with RBAC and idempotency support
- KMS, Storage, Projection DB, and Search clusters reachable and quota available
- DNS/Domain verification service (for SAML domains)
- OTel tracing/metrics active for step diagnostics
Business Requirements¶
- Approved edition/plan matrix (limits, features)
- Default policy bundles per edition/region (retention, redaction profiles)
- Residency catalog (allowed regions per tenant)
Performance Requirements¶
- Synchronous intake p95 ≤ 300 ms; async provisioning target < 2 min
- Parallelizable steps (keys/indexes) with bounded concurrency
- Backpressure handling when cluster capacity is constrained
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor TA as Tenant Admin
participant GW as API Gateway
participant ONB as Onboarding Service
participant IDP as Identity/IdP
participant POL as Policy Service
participant KMS as KMS/Secrets
participant ST as Storage (Append)
participant PR as Projection DB
participant IX as Search Index
participant NTF as Notifications/Webhooks
TA->>GW: POST /tenants/v1 (tenantSlug, region, edition, idpConfig, adminEmails)
GW->>ONB: CreateTenant (idempotency-key)
ONB->>ONB: Validate & reserve tenantId/slug (unique)
ONB->>IDP: Link IdP / Verify domain (SAML/OIDC/SCIM)
ONB->>POL: Apply default policies (retention/redaction)
par Provision resources
ONB->>KMS: Create tenant key + alias (kid)
ONB->>ST: Create partition/shard & RLS bindings
ONB->>PR: Create schemas (namespaced) & watermarks
ONB->>IX: Create per-tenant index alias/mappings
end
ONB->>ONB: Health checks (readiness of resources)
ONB->>GW: 202 Accepted {tenantId, status:"Provisioning", resumeToken}
ONB->>NTF: Emit Tenant.Provisioned
ONB->>GW: POST /tenants/v1/{tenantId}:activate
ONB->>GW: 200 OK {status:"Active"}
ONB->>NTF: Emit Tenant.Activated + Welcome
Alternative Paths¶
- Deferred IdP linkage: create tenant with local admin; link IdP later via
/link-idp. - Pre-provisioned resources: BYO KMS key or existing index namespace accepted when validated.
- Staged activation: keep
status="Provisioned"until external readiness checks pass.
Error Paths¶
sequenceDiagram
participant TA as Tenant Admin
participant GW as API Gateway
participant ONB as Onboarding Service
participant IDP as Identity/IdP
TA->>GW: POST /tenants/v1 {invalid payload or duplicate slug}
alt 400 Bad Request (invalid/unsupported fields)
GW-->>TA: 400 Problem+JSON
else 409 Conflict (slug/domain already in use)
GW-->>TA: 409 Problem+JSON
else 422 Unprocessable (IdP metadata invalid, domain not verified)
ONB-->>GW: 422 Problem+JSON
GW-->>TA: 422 Problem+JSON
else 503 Dependency unavailable (KMS/Search/DB)
GW-->>TA: 503 Problem+JSON (+Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation | ||
|---|---|---|---|---|---|---|
| Method/Path | POST /tenants/v1 |
Y | Create tenant | JSON body | ||
Authorization |
header | Y | Admin/ops JWT | Role tenants:create |
||
idempotency-key |
header | O | De-dupe create | ≤128 chars | ||
tenantSlug |
string | Y | Human slug (acme) |
^[a-z0-9-]{3,40}$, unique |
||
displayName |
string | Y | Tenant display name | 3–100 chars | ||
edition |
enum | Y | free | standard | enterprise |
allowlist | ||
region |
enum | Y | Residency region | allowlist | ||
idpConfig |
object | O | SAML/OIDC metadata/urls | schema-validated | ||
adminEmails[] |
array | Y | Initial admins | valid emails | ||
webhooks[] |
array | O | Event targets (HMAC) | URL + secret |
Control
GET /tenants/v1/{tenantId}→ status (Provisioning|Provisioned|Active|Failed), components healthPOST /tenants/v1/{tenantId}:activate→ promote toActivePOST /tenants/v1/{tenantId}:link-idp→ attach/replace IdP configPOST /tenants/v1/{tenantId}:rotate-keys→ new KMS key version (dual-read window)
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
tenantId |
string (ULID/GUID) | System identifier | Immutable |
tenantSlug |
string | Human label | Unique, mutable with policy |
status |
enum | Lifecycle status | see above |
kid |
string | Current KMS key id | For integrity/signing |
residency |
object | Region/data classes | PII routing policy |
policyBundle |
object | Defaults applied | versioned |
endpoints |
object | Tenant endpoints/aliases | for SDK setup |
Example Payloads¶
Create Tenant
{
"tenantSlug": "acme",
"displayName": "Acme Corp",
"edition": "enterprise",
"region": "eu-west",
"idpConfig": {
"type": "saml",
"metadataUrl": "https://idp.acme.com/metadata.xml",
"domains": ["acme.com"]
},
"adminEmails": ["secops@acme.com","platform@acme.com"]
}
Create Response (202)
{
"tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
"tenantSlug": "acme",
"status": "Provisioning",
"resumeToken": "onb_7b2d..."
}
Activate
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid slug, edition, region; missing admins | Fix payload | — |
| 401 | Missing/invalid admin JWT | Authenticate | Retry after renewal |
| 403 | Plan/edition not allowed for region | Choose allowed combo | — |
| 404 | Unknown tenantId (status/activate/link) |
Verify id | — |
| 409 | tenantSlug or domain already bound to another tenant |
Pick new slug / release domain | — |
| 412 | Activation preconditions unmet (resources not healthy) | Wait for ready; fix failing component | Conditional retry |
| 422 | IdP metadata invalid, DNS TXT not verified | Correct & re-submit | — |
| 429 | Create rate-limited | Back off | Exponential backoff + jitter |
| 503 | KMS/Storage/Search unavailable | Retry later | Respect Retry-After |
Failure Modes¶
- Partial provisioning: some resources created; idempotent reruns resume from checkpoints.
- Cross-tenant leakage risk: misbound index alias or RLS → automated sanity checks block activation.
- IdP domain hijack: require DNS TXT proof + admin email domain match.
Recovery Procedures¶
- Use status API to inspect failing step; rerun with same
idempotency-key. - Roll back or repair mis-provisioned resources (Compensation flow) before activation.
- Re-verify domain/IdP, then call
:activate.
Performance Characteristics¶
Latency Expectations¶
POST /tenants/v1: p95 ≤ 300 ms (enqueue & reserve).- Provisioning background: typical 30–120 s (parallelized steps).
- Activation p95 ≤ 200 ms after readiness.
Throughput Limits¶
- Controlled by cluster quotas; default ≤ 5 concurrent onboardings per region.
Resource Requirements¶
- Onboarding workers sized for parallel KMS/DB/Index operations; cautious with index creation.
Scaling Considerations¶
- Shard provisioning queues by region; backpressure from dependent clusters pauses new starts.
- Pre-create warm pools (schemas/aliases) for popular editions.
Security & Compliance¶
Authentication¶
- Admin/ops endpoints require OIDC JWT; service-to-service with mTLS.
Authorization¶
- Roles:
tenants:create|read|activate|link-idp|rotate-keys. - Least-privilege service identities for each provisioning step.
Data Protection¶
- Tenant KMS key per tenant; secrets stored encrypted; residency enforced across storage/search/projections.
- No PII stored beyond admin contacts; audit all operations.
Compliance¶
- Emit
Tenant.Provisioned|Activated|Failed|IdpLinkedevents with actor, reason, evidence. - Residency and key policies attached to tenant record for audits.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
tenant_onboard_started_total |
counter | Onboarding requests | Anomaly trend |
tenant_onboard_completed_total |
counter | Successful onboardings | Drop vs start |
tenant_onboard_duration_seconds |
histogram | End-to-end time | p95 > 180s |
tenant_onboard_step_failures_total{step} |
counter | Failures per step | Spike alert |
tenant_activation_gates_open |
gauge | Waiting for readiness | Stuck > 10m |
Logging Requirements¶
- Structured logs with
tenantId,tenantSlug,step,result,component,retry,traceId. Mask secrets/metadata.
Distributed Tracing¶
- Span per step:
idp.link,kms.key.create,storage.partition.create,projection.schema.create,index.alias.create,activate. IncludetenantSlug,region,edition.
Health Checks¶
- Readiness depends on KMS, Storage, DB, Search; onboarding worker queue depth monitored.
Operational Procedures¶
Deployment¶
- Deploy Onboarding Service with worker pool and step registry.
- Configure RBAC, KMS access policies, and cluster credentials per region.
- Register default policy bundles and residency maps.
Configuration¶
- Env:
ONB_MAX_CONCURRENCY,ONB_REGION_ALLOWLIST,ONB_IDP_DOMAIN_TTL,ONB_PROVISION_TIMEOUT_S. - Policies: default retention/redaction per edition; index templates per region.
Maintenance¶
- Rotate service credentials; rotate default index templates; verify domain verification CA chains.
- Periodic dry runs in staging.
Troubleshooting¶
- 409 slug/domain → list bindings, confirm ownership.
- 422 IdP → validate metadata XML/JWKS, DNS TXT ownership.
- Activation stuck → inspect failing component health; run targeted repair.
Testing Scenarios¶
Happy Path Tests¶
- Create → provision all components → activate → welcome events emitted.
- IdP linked and login works for admin users.
Error Path Tests¶
- 400 invalid payload; 409 duplicate slug/domain; 404 unknown tenant.
- 412 activation blocked until readiness passes; succeeds after fix.
- 422 invalid IdP metadata rejected with clear Problem+JSON.
Performance Tests¶
- Parallel onboardings (N=5) complete within target; no cluster saturation.
- Index/schema creation time within SLO per region.
Security Tests¶
- RLS verified—tenant cannot query others’ data.
- Residency enforced—data and indexes created only in chosen region.
- Audit events present for all steps; secrets never logged.
Related Documentation¶
Internal References¶
- Architecture Overview
- Components
- Data Model — Tenancy Keys & Partitioning
- Multitenancy
- Privacy & Policies
External References¶
- SAML / OIDC specs (metadata, JWKS)
- Regional residency regulations (org policy)
Appendices¶
A. DNS TXT Verification (example)¶
B. Example Events¶
{
"type": "Tenant.Provisioned",
"tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
"region": "eu-west",
"kid": "kms:eu-west:acme:v1",
"time": "2025-10-27T08:05:21Z"
}
{
"type": "Tenant.Activated",
"tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
"time": "2025-10-27T08:06:41Z",
"endpoints": {
"ingest": "https://eu-west.api.atp/ingest/acme",
"query": "https://eu-west.api.atp/query/acme"
}
}
Schema Evolution Flow¶
Rolls out safe, additive schema changes across write and read paths. Publishes vNext to the Schema Registry, advertises availability via SDK/Gateway announcements, runs a dual-write / tolerant-read window (projectors, search), and executes a sunset plan for deprecated fields. Enforces a compatibility matrix to prevent breaking consumers.
Overview¶
Purpose: Introduce new fields/enums without breaking existing producers/consumers; coordinate rollout and rollback with clear guardrails.
Scope: Registry publish & validation → SDK/Gateway announcement → producer feature flag/canary → dual-write (events, projections) → tolerant-read (unknown fields) → metrics/alerts → deprecation & sunset. Excludes large-scale data migrations (covered by backfill runbooks).
Context: Works with Ingestion, Projection, Search, Export, and SDKs. Contracts defined in JSON Schema / Protobuf; REST/gRPC negotiate schema version via headers/metadata.
Key Participants:
- Schema Author (engineer)
- Schema Registry (validation, compatibility rules)
- API Gateway / SDKs (announce, negotiate)
- Producers (writers; may dual-write)
- Consumers (readers; tolerant-read)
- Projection/Search Services (tolerant/readers)
- Ops/Release (flags, canaries)
Prerequisites¶
System Requirements¶
- Schema Registry online with compatibility checks and artifact signing
- CI pipeline to lint/validate schemas (JSON Schema/Protobuf)
- Gateway supports version advertisement headers & graceful negotiation
- Services compiled with tolerant parsers (ignore unknowns; default enums)
Business Requirements¶
- Compatibility matrix approved (e.g.,
vNwrite requires readers ≥vN-1) - Rollout plan (tenants/regions/canaries) and rollback criteria defined
- Deprecation timeline communicated to stakeholders
Performance Requirements¶
- Registry publish p95 ≤ 300 ms; lookup cache TTL tuned
- Dual-write overhead ≤ 10% QPS/egress during window
- No more than 1 additional index refresh per change in Search
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor SA as Schema Author
participant CI as CI/CD
participant REG as Schema Registry
participant GW as API Gateway
participant PRD as Producer (Service/SDK)
participant PRJ as Projection Service
participant IDX as Search Index
participant CSM as Consumer (Query/Export)
SA->>CI: Open PR with vNext (add fields/enums)
CI->>REG: Validate & publish draft vNext (compatibility=FORWARD+BACKWARD)
REG-->>CI: OK (artifactId, version=v3, signature)
CI->>GW: Deploy Gateway/SDK announcement (X-Schema-Latest: v3)
PRD->>PRD: Enable canary flag (tenant subset)
PRD->>GW: Writes (dual-write: v2 + v3 metadata)
GW-->>PRD: 202 Accepted (X-Schema-Active: v3)
PRJ->>PRJ: Read tolerant (unknown fields ignored, defaults applied)
IDX->>IDX: Mapping updated (add new fields as optional)
CSM->>GW: Reads (request v2, receives v2) / (request v3, receives v3)
CI->>REG: Promote v3 to stable, start deprecation clock for v1
Alternative Paths¶
- Canary-by-tenant: enable v3 only for
tenant in {acme,beta}; expand after burn-in. - Header-only announce: Gateway advertises
X-Schema-Latestbefore any producer dual-writes (readers prep first). - Soft-fail: Producer emits v3-only but Gateway downgrades to v2 for legacy consumers via transformation map (temporary).
Error Paths¶
sequenceDiagram
participant CI as CI/CD
participant REG as Schema Registry
participant GW as API Gateway
participant PRD as Producer
CI->>REG: Publish vNext (breaking removal/rename)
alt 400 Bad Request (invalid schema)
REG-->>CI: 400 Problem+JSON
else 409 Conflict (compatibility violation)
REG-->>CI: 409 Problem+JSON (matrix failed)
end
PRD->>GW: Write with v3 before announce
GW-->>PRD: 412 Precondition Failed (X-Required-Schema: v2)
Request/Response Specifications¶
Input Requirements (Key Endpoints & Headers)¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| POST /registry/v1/schemas/{name}/versions | http | Y | Publish schema vNext | Signed commit |
compatibility |
enum | Y | BACKWARD, FORWARD, FULL |
policy |
X-Schema-Write-Version |
header | O | Producer-declared write version | int ≥ 1 |
X-Schema-Read-Version |
header | O | Consumer requested read version | int ≥ 1 |
Accept |
header | O | application/json;profile="#v3" |
negotiated |
gRPC metadata: schema-version |
meta | O | Read/write hint | int |
idempotency-key |
header | O | Dual-write de-dupe | ≤128 chars |
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
artifactId |
string | Registry id of version | immutable |
version |
int | Published version (e.g., 3) | monotonic |
X-Schema-Latest |
header | Latest stable version | set by Gateway |
X-Schema-Active |
header | Version currently served | per route/tenant |
downgrade |
flag | Whether Gateway transformed response | temporary only |
Example Payloads¶
Publish vNext (JSON Schema)
POST /registry/v1/schemas/auditrecord/versions
{
"version": 3,
"compatibility": "FULL",
"schema": {
"$id": "urn:atp:auditrecord:v3",
"type": "object",
"properties": {
"Id": {"type":"string"},
"Actor": {"$ref":"urn:atp:actor:v2"},
"Decision": {"$ref":"urn:atp:decision:v1"},
"Geo": {"type":"object","properties":{"Country":{"type":"string"}}} // new additive
},
"additionalProperties": false
}
}
Write (dual-write hint)
POST /audit/v1/records
X-Schema-Write-Version: 3
Idempotency-Key: wr_01JF...
Content-Type: application/json
Read (negotiate v2)
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid schema JSON/Proto; unknown fields without defaults | Fix schema; re-validate | — |
| 401 | Unauthorized schema publish | Authenticate | Retry after renewal |
| 403 | Caller lacks schemas:publish or tenant attempting global change |
Request access | — |
| 404 | Unknown schema name/version; consumer requests non-existent sv |
Request supported version; update client | — |
| 409 | Compatibility violation vs matrix; mapping collision in search | Adjust change or update matrix; run reindex plan | — |
| 412 | Producer writing vNext before Gateway/Registry mark active |
Wait for announce; enable flag after | Conditional |
| 422 | Enum narrowing or field type change detected | Redesign as additive; use new field name | — |
| 429 | Publish rate-limited | Back off | Jittered backoff |
| 503 | Registry/Gateway dependency unavailable | Retry later | Exponential backoff |
Failure Modes¶
- Breaking removal/rename: rejected by Registry; use add + deprecate pattern.
- Dual-write drift: v2 & v3 diverge → enable consistency checkers and fail fast on mismatch.
- Search mapping conflicts: new field analyzer mismatches existing index → create new index alias v3 and reindex.
Recovery Procedures¶
- Roll back producer flag to v2-only; keep Registry v3 published but inactive.
- If search mapping conflict, cut over to v3 alias after backfill; keep reads tolerant.
- Use Compensation Flow to repair projections that missed new fields during early canary.
Performance Characteristics¶
Latency Expectations¶
- Version negotiation adds ≤ 1 ms at Gateway (header processing/cache).
- Registry lookup cached; cache miss p95 ≤ 50 ms.
Throughput Limits¶
- Dual-write increases write amp; restrict to canary tenants initially.
- Reindex/backfill throttled per-tenant to protect cluster SLOs.
Resource Requirements¶
- Registry store for versions & metadata; small footprint per artifact.
- Backfill/reindex workers sized to edition limits.
Scaling Considerations¶
- Per-tenant activation gates; gradual region waves.
- Keep old readers working via tolerant-read and optional downgrade transforms (temporary only).
Security & Compliance¶
Authentication¶
- OIDC/JWT for publish & toggle APIs; mTLS service-to-service.
Authorization¶
- Roles:
schemas:publish,schemas:promote,schemas:deprecate,schemas:read. - Only release managers can promote to stable or start sunset.
Data Protection¶
- Signed artifacts; checksum headers; registry enforces immutability.
- No PII stored in schema metadata beyond author id.
Compliance¶
- Audit events:
Schema.Published|Promoted|Activated|Deprecated|SunsetCompletedwith actor & diff. - Backward/forward compatibility reports attached to change record.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
schema_publish_total |
counter | Versions published | Spike analysis |
schema_compat_fail_total |
counter | Registry rejects | >0 sustained |
schema_negotiations_total |
counter | Gateway negotiations | Trend |
dual_write_mismatch_total |
counter | v2 vs v3 mismatch | Any > 0 |
reader_unknown_field_rate |
counter | Unknowns seen by readers | Spike |
search_reindex_progress |
gauge | Backfill completion | Stalls |
Logging Requirements¶
- Structured logs:
schema,fromVersion,toVersion,tenant,compatMode,result,traceId.
Distributed Tracing¶
- Spans:
registry.validate,gateway.negotiate,producer.dualwrite,projection.tolerant-read,search.mapping.update.
Health Checks¶
- Registry readiness (DB/object store); Gateway cache health; index template availability.
Operational Procedures¶
Deployment¶
- Deploy/upgrade Registry with compatibility policies.
- Enable Gateway negotiation & headers; roll SDKs with version awareness.
- Register CI checks (lint/compat) and block merges on failure.
Configuration¶
- Env:
SCHEMA_COMPAT_MODE=FULL,SCHEMA_CACHE_TTL=300s,SCHEMA_DOWNGRADE_ENABLED=true(temporary). - Flags:
feature.auditrecord.v3.enabled,feature.search.mapping.v3.enabled.
Maintenance¶
- Periodic cleanup of deprecated versions after sunset window.
- Rotate registry signing keys; verify artifact signatures in CI.
Troubleshooting¶
- 409 compatibility failures → inspect matrix report; adjust plan to additive-only.
- Reader errors on unknown fields → ensure tolerant-read; verify SDK versions.
- Search failures → create new alias with updated template; reindex flow.
Testing Scenarios¶
Happy Path Tests¶
- Publish v3 (additive); Gateway advertises; producer dual-writes; readers tolerant; promote to stable.
- Search mapping updated; index accepts new field; dashboards reflect new attribute.
Error Path Tests¶
- 400 invalid schema rejected; 404 unknown version on read; 409 matrix violation blocked.
- 412 write blocked before announce; passes after activation.
Performance Tests¶
- Dual-write adds ≤ 10% overhead; Gateway negotiation ≤ 1 ms p95.
- Reindex completes within planned window without SLO breach.
Security Tests¶
- Only
schemas:promoterole can activate vNext; artifacts signed/verified. - Audit events emitted for publish/promote/deprecate.
Related Documentation¶
Internal References¶
- Data Model — Schema Evolution & Compatibility
- Event Contracts
- Search Index Schema
- Compensation Flow
External References¶
- JSON Schema / Protobuf compatibility guides
Appendices¶
A. Compatibility Matrix (excerpt)¶
| Change Type | Backward | Forward | Allowed |
|---|---|---|---|
| Add optional field | ✓ | ✓ | Yes |
| Add enum value | ✓* | ✓ | Yes (readers must default) |
| Remove field | ✗ | ✗ | No (use deprecate) |
| Change type (string→int) | ✗ | ✗ | No (new field) |
| Widen type (int32→int64) | ✓* | ✓ | Yes with defaults |
- Requires tolerant-read or defaulting behavior.
B. Problem+JSON (compatibility violation)¶
{
"type": "urn:connectsoft:errors/schema/compatibility-violation",
"title": "Schema change is not backward compatible",
"status": 409,
"detail": "Removing field 'Decision' breaks existing consumers.",
"violations": [
{"path":"$.Decision", "rule":"field-removal"}
]
}
Configuration Update Flow¶
Safely rolls out configuration changes using validation (dry-run), staged rollout (feature flags/canaries), hot reload in services, and automatic verification / rollback with strict blast-radius controls. Separates config from secrets; every change is audited and idempotent.
Overview¶
Purpose: Apply config changes without disrupting tenants, maintaining SLOs and isolation.
Scope: Propose → validate (schema & semantic) → stage → canary rollout → service reload → verify (metrics/health) → promote or rollback. Excludes secret rotation (covered elsewhere).
Context: Config is stored in a Config Registry/Repo, announced via Config Service, consumed by Gateway/Ingestion/Projection/Search/Export at runtime with hot reload or restart on failure.
Key Participants:
- Operator / CI/CD
- Config Registry/Repo (GitOps or API)
- Config Service (distribution, versioning, audits)
- Feature Flag Service (progressive exposure)
- Target Services ( Gateway / Ingestion / … )
- Observability (metrics/logs/traces)
- Orchestrator (deploy hooks for restarts if needed)
Prerequisites¶
System Requirements¶
- Config schemas (JSON Schema/Protobuf) with server-side validation and dry-run execution
- Feature flag platform for canary/percentage/segment rollouts
- Services implement hot reload endpoint or SIGHUP handler and config guards (shadow config)
- Config Service supports versioning, idempotency, and RBAC
Business Requirements¶
- Change approval workflow (CAB) with blast-radius assessment
- Runbooks & rollback plans linked to config keys
- Tenant/edition-aware defaults to prevent cross-tenant leakage
Performance Requirements¶
- Validation p95 ≤ 200 ms; distribution to all pods p95 ≤ 60 s
- Hot reload p95 ≤ 250 ms per service; zero-downtime guarantee
- Verification window (post-change) default 5–15 min with auto-rollback gates
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
actor OP as Operator/CI
participant REG as Config Registry/Repo
participant CFG as Config Service
participant FF as Feature Flag Service
participant SVC as Target Services
participant OBS as Observability
OP->>REG: Propose config vNext (PR/ChangeSet)
REG->>CFG: Validate (schema + semantic dry-run)
CFG-->>REG: OK (change-id, version=v17)
OP->>FF: Stage flag "cfg.v17.enabled=false" (guard)
OP->>CFG: Apply v17 (scope: canary tenants/perc=5%)
CFG->>SVC: Distribute v17 (signed, If-None-Match)
SVC->>SVC: Hot reload, shadow compare, begin verification
SVC-->>OBS: Emit KPIs (errors/latency/health)
OBS-->>CFG: Verification passed (within SLO)
OP->>FF: Ramp to 50% → 100%
CFG->>SVC: Finalize v17 (active for all)
CFG-->>REG: Promote v17 to Active, close change
Alternative Paths¶
- Flag-only change: no new config payload; toggle flag segments to roll out behavior changes.
- Tenant-staged rollout: enable by region/tenant/edition gates before global activation.
- Restart-required: services lacking hot reload receive orchestrated rolling restart with readiness guards.
Error Paths¶
sequenceDiagram
participant OP as Operator
participant REG as Config Registry
participant CFG as Config Service
participant SVC as Target Services
OP->>REG: Submit invalid config (schema fail)
REG-->>OP: 400 Bad Request (Problem+JSON)
OP->>CFG: Apply v17 (unknown key/scope)
CFG-->>OP: 404 Not Found (key/scope)
OP->>CFG: Apply while v16 rollout in-progress
CFG-->>OP: 409 Conflict (change in progress)
CFG->>SVC: Distribute v17
SVC-->>CFG: 503 Service Unavailable (reload guard failed)
CFG->>CFG: Auto-rollback to v16, raise alert
Request/Response Specifications¶
Input Requirements (APIs)¶
| Field | Type | Req | Description | Validation | |
|---|---|---|---|---|---|
| POST /ops/v1/config/validate | http | Y | Dry-run schema & semantic checks | JSON body | |
| POST /ops/v1/config/apply | http | Y | Apply version with scope/strategy | RBAC + idempotent | |
changeId |
string | Y | Unique change identifier | ULID/UUID | |
version |
int | Y | Candidate version | monotonic | |
scope |
object | O | {tenants, regions, editions, percent} |
allowlists | |
strategy |
object | O | {mode: canary | all, ramp: [5,50,100], verifyMins:10} |
sane ranges | |
preconditions.etag |
string | O | CAS guard | matches head | |
reason |
string | Y | Change reason | 1–256 chars |
Output Specifications¶
| Field | Type | Description | Notes | |||||
|---|---|---|---|---|---|---|---|---|
status |
enum | Validated | Applying | Partial | RolledBack | Active | Failed |
lifecycle | |||||
activeVersion |
int | Current active config version | — | |||||
appliedTo |
object | Effective scope (tenants/percent) | resolved | |||||
verification |
object | KPIs & window state | pass/fail | |||||
rollbackToken |
string | Token to execute rollback | TTL-bound |
Example Payloads¶
Validate
POST /ops/v1/config/validate
{
"changeId": "chg_01JF8C6Q...",
"version": 17,
"payload": { "Ingestion": { "MaxBatchBytes": 1048576 } }
}
Apply (canary 5%)
POST /ops/v1/config/apply
{
"changeId": "chg_01JF8C6Q...",
"version": 17,
"scope": { "percent": 5, "regions": ["eu-west"] },
"strategy": { "mode": "canary", "ramp": [5,50,100], "verifyMins": 10 },
"preconditions": { "etag": "v16-etag" },
"reason": "Lower ingest batch size to reduce p99"
}
Service Hot Reload Contract
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Schema/semantic validation failed | Fix payload; re-validate | — |
| 401 | Missing/invalid token | Authenticate | Retry after renewal |
| 403 | Caller lacks config:apply |
Request access | — |
| 404 | Unknown config key/version/scope | Verify ids; fetch latest | — |
| 409 | Concurrent change in progress; ETag mismatch | Wait; retry with latest ETag | Conditional retry |
| 412 | Preconditions failed (guardrails) | Adjust scope/strategy | — |
| 422 | Semantic violation (unsafe value range) | Choose safe value | — |
| 429 | Apply rate-limited | Back off | Exponential + jitter |
| 503 | Target service not ready/reload failure | Auto-rollback; investigate | Retry after health OK |
Failure Modes¶
- Blast radius: global apply without canary → guarded by policy (requires staged rollout).
- Config drift: some pods on v16, others v17 → Config Service reconciles until convergence.
- Hot reload hazards: partial initialization using new values → shadow config & atomic swap.
Recovery Procedures¶
- Trigger auto-rollback via policy gate failure; restore
activeVersionto previous. - Freeze changes (global mute) and open incident; evaluate metrics & logs.
- Re-run apply with reduced scope or adjusted values after fix.
Performance Characteristics¶
Latency Expectations¶
- Validation p95 ≤ 200 ms; distribution to pods ≤ 60 s; hot reload ≤ 250 ms.
Throughput Limits¶
- Max N parallel applies per region (e.g., 1); queue subsequent changes.
Resource Requirements¶
- Config Service cache/ETag store; signed bundles; modest CPU for validation.
Scaling Considerations¶
- Shard config topics per service/region; CDN or sidecar cache for large payloads.
- Prefer delta distribution over full bundle for frequent small tweaks.
Security & Compliance¶
Authentication¶
- OIDC JWT for ops APIs; mTLS service-to-service.
Authorization¶
- Roles:
config:validate,config:apply,config:rollback,config:read. - Tenant/edition scoping enforced at apply time.
Data Protection¶
- No secrets in config; secrets managed via dedicated Secrets Service/KMS.
- Signed config bundles (checksum, signature) verified by services.
Compliance¶
- Audit events:
Config.Validated|Applied|Promoted|RolledBackwith actor, diff, scope, reason. - Change records linked to incident/ticket ids.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
config_apply_total |
counter | Applies by result | Spike in failures |
config_active_version |
gauge | Current active version | Unexpected regress |
config_rollbacks_total |
counter | Auto/manual rollbacks | >0 sustained |
config_distribution_lag_seconds |
histogram | Registry→pod lag | p95 > 60s |
service_config_reload_failures_total |
counter | Reload errors | Any > 0 |
Logging Requirements¶
- Structured logs:
changeId,version,service,scope,strategy,result,traceId,rollbackToken.
Distributed Tracing¶
- Spans:
config.validate,config.apply,service.reload,verify.window. Include changeId & version.
Health Checks¶
- Readiness includes config freshness (expected vs actual version).
- Synthetic probes after apply to confirm behavior.
Operational Procedures¶
Deployment¶
- Deploy Config Service (HA) with schema validators & signing keys.
- Enable hot reload endpoints in services; wire feature flag SDK.
- Configure GitOps or Ops API pipeline with approval gates.
Configuration¶
- Env:
CFG_APPLY_CONCURRENCY=1,CFG_VERIFY_WINDOW=10m,CFG_MAX_SCOPE_PERCENT=10,CFG_REQUIRE_FLAG_GUARD=true. - Policies: mandatory canary for high-risk keys; deny global applies during peak.
Maintenance¶
- Rotate signing keys; prune deprecated keys; rehearse rollback drills quarterly.
Troubleshooting¶
- Apply stuck → check distribution lag metrics & queue; verify RBAC/ETag.
- Errors spike post-apply → auto-rollback should trigger; confirm guardrail worked.
- Only subset updated → reconcile loop; investigate failing pods’ reload logs.
Testing Scenarios¶
Happy Path Tests¶
- Validate → apply to 5% → verify → ramp to 100% with no SLO breach.
- Hot reload succeeds across services; config version converges.
Error Path Tests¶
- 400 invalid payload rejected; 404 unknown key; 409 concurrent apply guarded.
- 503 reload failure triggers automatic rollback.
Performance Tests¶
- Distribution completes ≤ 60 s across 200 pods; reload p95 ≤ 250 ms.
- Multiple small deltas do not exceed CPU/network budgets.
Security Tests¶
- Only
config:applyrole can promote; signatures verified; audits present. - No secrets present in config payloads.
Related Documentation¶
Internal References¶
External References¶
- Progressive Delivery / Feature Flags best practices
Appendices¶
A. Canary Strategy (YAML)¶
strategy:
mode: canary
ramp: [5, 25, 50, 100]
verify:
window: 10m
guards:
- metric: atp_ingest_errors_ratio
threshold: "< 0.5%"
- metric: atp_projection_lag_seconds
threshold: "< 60"
- metric: health_readiness_status
threshold: "== 1"
B. Problem+JSON Examples¶
{
"type": "urn:connectsoft:errors/config/invalid",
"title": "Invalid configuration payload",
"status": 400,
"detail": "Ingestion.MaxBatchBytes exceeds allowed maximum."
}
{
"type": "urn:connectsoft:errors/config/conflict",
"title": "Change conflict",
"status": 409,
"detail": "Another change chg_01JF8B... is applying.",
"currentChangeId": "chg_01JF8B..."
}
Backup & Recovery Flow¶
Implements durable backups (snapshots/exports) with integrity verification and WORM-secure storage, plus periodic recovery drills that prove RPO/RTO objectives are met. Covers append store, projections, and search indexes with consistent cutover points and tenant-aware restores. Evidence of successful restore is captured and audited.
Overview¶
Purpose: Guarantee recoverability of tenant data with defined RPO/RTO and cryptographic proof of integrity.
Scope: Scheduled/on-demand backups → snapshot/export → sign/verify → store in immutable object storage → catalog → recovery drills (sandbox restore + validation) → reporting. Excludes hot replicas (covered by HA).
Context: Orchestrated by Backup Service. Sources: Storage (Append/WORM), Projection DB, Search Index. Targets: Object Store (WORM/Object Lock) with tenant/region prefixes and KMS encryption.
Key Participants:
- Backup Scheduler/Service (orchestrator)
- Storage (Append Store) / Projection DB / Search Index
- Integrity Service (hash/Merkle proofs)
- Object Store (WORM) with KMS
- Drill Runner (restore validator)
- Ops / Compliance (approvals, reports)
Prerequisites¶
System Requirements¶
- Snapshot/backup endpoints enabled for all data planes (append/projection/index)
- Object store with WORM/Object Lock & lifecycle policies; mTLS + signed URLs
- Integrity Service available for proof computation/verification
- Catalog/Manifest registry with index of recovery points
Business Requirements¶
- Tenant residency & encryption policies mapped to backup targets
- Defined RPO (e.g., ≤ 15 min) and RTO (e.g., ≤ 60 min) per edition
- Drill cadence (e.g., monthly per region; quarterly per tenant sample) and evidence requirements
Performance Requirements¶
- Backup windows avoid peak hours; bandwidth caps per region/tenant
- Incremental backups preferred; fulls on weekly cadence
- Verification completes within X% of backup duration (target ≤ 30%)
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant SCH as Scheduler
participant BAK as Backup Service
participant ST as Storage (Append)
participant PR as Projection DB
participant IX as Search Index
participant INT as Integrity Service
participant OBJ as Object Store (WORM)
participant CAT as Catalog/Manifest
participant DR as Drill Runner
SCH->>BAK: Trigger Backup (policy, scope, type=incremental)
BAK->>ST: Consistent snapshot/export (cutover @ T)
BAK->>PR: Projection dump @ watermark<=T
BAK->>IX: Index snapshot (optional or template)
BAK->>INT: Compute hashes/Merkle root + sign (kid)
INT-->>BAK: Proof bundle {root, signature, kid}
BAK->>OBJ: Upload packages (JSONL/Parquet/SQL), proofs, manifest (WORM)
BAK->>CAT: Register Recovery Point (RP-2025-10-27T08:00Z)
BAK-->>SCH: Success {recoveryPointId, sizes, proof}
SCH->>DR: Schedule recovery drill (sandbox)
DR->>OBJ: Fetch packages + manifest
DR->>INT: Verify proofs/signatures
DR->>ST: Restore append, reproject read models
DR-->>SCH: Drill report (RPO/RTO met, sample checks OK)
Alternative Paths¶
- On-demand tenant backup: operator requests scoped backup for a single tenant; catalog marks it tenant-scoped.
- Warm-standby region: ship encrypted copies to secondary region with residency-allowed classes only.
- Indexless restore: restore append store and rebuild projections/search from facts to reduce backup volume.
Error Paths¶
sequenceDiagram
participant BAK as Backup Service
participant OBJ as Object Store
participant INT as Integrity
participant CAT as Catalog
BAK->>OBJ: PUT package (network issue)
alt 503 Storage unavailable
BAK-->>BAK: Retry with backoff, pause schedule if persistent
else 409 Conflict (WORM retention/exists)
BAK-->>BAK: Switch to new key (timestamped), update manifest
end
BAK->>INT: Compute proof
alt Proof mismatch
INT-->>BAK: 422 Unprocessable (hash mismatch)
BAK-->>CAT: Mark recovery point FAILED, alert
end
Request/Response Specifications¶
Input Requirements (APIs)¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| POST /ops/v1/backups | http | Y | Start backup | RBAC backup:start |
scope |
object | Y | {tenants:[], regions:[], dataClasses:[]} |
allowlists/residency |
type |
enum | Y | full |
incremental |
cutover |
RFC3339 | O | Desired snapshot time | ≤ now |
retentionDays |
int | O | Override default retention | ≤ policy max |
| POST /ops/v1/restores | http | Y | Start restore/drill | RBAC backup:restore |
recoveryPointId |
string | Y | Catalog id | exists |
mode |
enum | Y | sandbox |
production |
target |
object | O | {tenantId?, region} |
valid & empty slot |
verifyPolicy |
object | O | sampling, row-counts, checksums | schema |
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
recoveryPointId |
string | Unique id for backup | sortable by time |
manifestUrl |
url | Signed URL to manifest | time-limited |
proof |
object | {merkleRoot, signature, kid} |
integrity |
sizes |
object | bytes per package | budgeting |
restoreJobId |
string | Track restore/drill | status API |
Example Payloads¶
Start Backup
POST /ops/v1/backups
{
"scope": { "regions": ["eu-west"], "tenants": ["acme"] },
"type": "incremental",
"retentionDays": 30
}
Catalog Manifest (excerpt)
{
"recoveryPointId": "RP-2025-10-27T08:00:00Z-eu-west-acme",
"time": "2025-10-27T08:00:00Z",
"packages": [
{"name":"append-0001.jsonl","sha256":"...","bytes": 73482910},
{"name":"projection.sql","sha256":"...","bytes": 2183412}
],
"merkleRoot": "b3f3…",
"signature": "MEUCIQ…",
"kid": "kms:eu-west:tenant/acme:v3",
"watermark": "2025-10-27T07:59:58Z"
}
Start Restore (Sandbox)
POST /ops/v1/restores
{
"recoveryPointId": "RP-2025-10-27T08:00:00Z-eu-west-acme",
"mode": "sandbox",
"target": { "region": "eu-west" },
"verifyPolicy": { "rowCounts": true, "samplePercent": 5, "proofs": true }
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid scope/type/cutover; residency mismatch | Correct payload/policy | — |
| 401 | Missing/invalid token | Authenticate | Retry after renewal |
| 403 | Caller lacks backup:* or restore:* |
Request access | — |
| 404 | Unknown recoveryPointId or package missing |
Choose valid point; investigate catalog | — |
| 409 | Restore already in progress for target / resource lock | Wait or choose new target | Conditional retry |
| 412 | Preconditions not met (sandbox not empty; legal hold prevents overwrite) | Satisfy preconditions / choose sandbox | — |
| 423 | Target locked (admin lock/maintenance) | Release lock | Retry |
| 429 | Region throughput/backups rate-limited | Back off | Exponential + jitter |
| 503 | Object store/Integrity service unavailable | Retry later | Bounded retries with backoff |
Failure Modes¶
- Inconsistent cutover: sources not frozen → use watermark T and quiesce writes for snapshot window.
- WORM conflict: attempting overwrite before retention expires → versioned keys; never mutate existing.
- Silent corruption: block-level issues → end-to-end checksums + Merkle proofs required; drill detects.
Recovery Procedures¶
- Re-run backup with quiesce (short write pause or log-based incremental with LSN).
- For failed proof, invalidate recovery point and alert; run full backup next window.
- During restore, rebuild projections and search from append facts if projection package absent or stale.
Performance Characteristics¶
Latency Expectations¶
- Catalog publish p95 ≤ 1 s; proof computation bounded by package size (parallelizable).
- Drill restore: RTO target (e.g., ≤ 60 min for medium tenants) including re-projection.
Throughput Limits¶
- Per-region bandwidth caps (e.g., ≤ 200 MB/s aggregate); per-tenant rate caps to avoid noisy neighbors.
Resource Requirements¶
- Temporary staging disk for package creation; CPU for hashing; memory for buffering; KMS for signing.
Scaling Considerations¶
- Incremental forever + periodic synth full to limit restore chains.
- Shard backups by tenant/shard and time slots to flatten I/O.
Security & Compliance¶
Authentication¶
- Ops endpoints via OIDC; service-to-object store via mTLS and scoped IAM roles.
Authorization¶
- Roles:
backup:start|read|restore|drill|approve. Production restore requires two-person approval.
Data Protection¶
- KMS encryption at rest; WORM/Object Lock with retention & legal hold support; signed manifests/proofs.
- Residency: copy only to allowed regions per data class; PII masking not required since encrypted at rest (still observe policy).
Compliance¶
- Evidence pack: drill reports, manifest, proof verification, timing → archived for audits.
- Legal holds honored—restore does not violate purge blocks; backups include hold metadata.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
backup_runs_total{result} |
counter | Backups by result | Failures > baseline |
backup_bytes_total |
counter | Total bytes uploaded | Sudden drop/spike |
backup_duration_seconds |
histogram | Backup wall time | p95 > SLO |
restore_duration_seconds |
histogram | Drill/restore time | p95 > RTO |
backup_proof_failures_total |
counter | Integrity verification failures | Any > 0 |
rpo_effective_seconds |
gauge | Now − last successful cutover | > target |
rto_drill_pass_rate |
gauge | % drills meeting RTO | < target |
Logging Requirements¶
- Structured logs:
recoveryPointId,tenant,region,sizes,hash,kid,result,traceId,rpo,rto.
Distributed Tracing¶
- Spans for
snapshot,package.upload,proof.compute,proof.verify,restore.apply,reprojection.run.
Health Checks¶
- Readiness of object store, KMS, Integrity; catalog consistency checks (manifest ↔ objects).
Operational Procedures¶
Deployment¶
- Deploy Backup Service (HA) with schedulers and workers per region.
- Configure object store buckets with Object Lock (compliance mode) and lifecycle.
- Register policies (cadence, scope, RPO/RTO) per edition.
Configuration¶
- Env:
BACKUP_WINDOW=02:00-05:00,BACKUP_MAX_BW_MBPS,BACKUP_TYPE=incremental,BACKUP_VERIFY=true. - Policies: weekly full, daily incremental; monthly drill per region.
Maintenance¶
- Rotate KMS keys; test restore runbooks quarterly; refresh lifecycle policies and retention.
Troubleshooting¶
- Missing package → verify catalog vs. object listing; re-upload if upload was interrupted.
- Proof mismatch → recalc locally; if persistent, invalidate RP and run full backup.
- RTO miss → profile slow steps (download bandwidth, reprojection speed) and optimize.
Testing Scenarios¶
Happy Path Tests¶
- Scheduled incremental backup creates catalog entry with valid proofs.
- Monthly drill restores to sandbox, reprojects, and meets RTO.
Error Path Tests¶
- 400 invalid scope rejected; 404 unknown
recoveryPointId; 409 concurrent restore blocked. - 503 object store outage triggers retries and eventual success/fail with alert.
Performance Tests¶
- Backup completes within window; verify overhead does not breach SLOs.
- Drill on medium tenant completes within RTO under load.
Security Tests¶
- WORM enforced—no overwrite/delete within retention; manifests signed & verified.
- Access controls prevent cross-tenant reads of backup artifacts.
Related Documentation¶
Internal References¶
External References¶
- Object Lock/WORM (vendor docs)
- NIST SP 800-34 (Contingency Planning)
Appendices¶
A. Example Object Store Bucket Policy (WORM)¶
{
"ObjectLockEnabled": "Enabled",
"Rules": [{
"DefaultRetention": { "Mode": "COMPLIANCE", "Days": 30 }
}]
}
B. Recovery Drill Checklist¶
- Select latest valid
recoveryPointIdfor target region/tenant. - Provision isolated sandbox (no outbound webhooks).
- Restore append → reproject → (optional) reindex.
- Verify counts (rows/events) & sample diffs; verify proofs.
- Capture RTO and evidence; archive report; clean up sandbox.
C. Problem+JSON (example)¶
{
"type": "urn:connectsoft:errors/backup/recovery-point-not-found",
"title": "Recovery point not found",
"status": 404,
"detail": "RP-2025-10-27T08:00:00Z-eu-west-acme does not exist or is invalid."
}
Load Balancing Flow¶
Distributes incoming traffic fairly across healthy service instances using L7/L4 load balancing, with optional affinity (cookie/hash) for sticky paths and standard stateless routing for idempotent calls. Includes multi-region routing (geo/DNS/anycast) with residency and failover policies. Integrates with health checks, rate limiting, and circuit breakers.
Overview¶
Purpose: Balance requests to healthy backends, maximize utilization, and minimize latency while enforcing tenant isolation and residency.
Scope: Edge routing (DNS/anycast) → Regional LB/Ingress (L7) → per-service pools with health/affinity → response path and headers. Excludes per-tenant throttling logic (covered by Gateway rate limiting).
Context: Client enters via Global LB (GSLB/Anycast), then Regional L7 LB/Ingress/Gateway (Envoy/Nginx/API GW) that selects a backend (Ingestion/Query/Export).
Key Participants:
- Client
- Global Traffic Manager (GTM) (GeoDNS/Anycast)
- Regional L7 LB / API Gateway
- Target Service Pool (Ingestion / Query / Export)
- Health Check / Discovery
- Observability (metrics/logs/traces)
Prerequisites¶
System Requirements¶
- Edge TLS termination with modern ciphers; optional end-to-end mTLS to services
- Active+passive health checks (HTTP/gRPC/TCP) with outlier detection
- Service discovery (EDS/SD) with instance metadata:
{region, shard, edition} - Circuit Breaker and connection pools configured per service/route
Business Requirements¶
- Residency policy maps tenants → allowed regions
- Edition/plan may influence weights (e.g., enterprise canary lanes)
- Documented sticky vs stateless routes (e.g., Query=stateless, Export job UI=sticky)
Performance Requirements¶
- End-to-end added LB latency p95 ≤ 5 ms (regional), ≤ 20 ms (global routing)
- Per-service concurrency/connection limits defined; surge queue bounded
- Balancing algorithm chosen per route: least-request, weighted RR, ring-hash (affinity)
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant C as Client
participant GTM as Global Traffic Manager (GeoDNS/Anycast)
participant L7 as Regional L7 LB / API Gateway
participant S as Service Pool (e.g., Ingestion)
participant HC as Health/Discovery
C->>GTM: Resolve api.atp.example (Geo/latency policy)
GTM-->>C: Regional VIP (eu-west)
C->>L7: HTTPS request (Host: api.atp.example)
L7->>HC: Get healthy endpoints & weights
L7->>S: Route to least-loaded healthy instance (affinity if provided)
S-->>L7: 200 OK (payload)
L7-->>C: 200 OK + headers (X-Region, X-Backend-Id, Server-Timing)
Alternative Paths¶
- Sticky (affinity) routing: LB sets
atp_affinitycookie or uses ring-hash onX-Sticky-Key/tenantIdfor session locality. - Multi-region: GTM favors closest allowed region; on regional brownout, fail over to next policy region.
- Canary/weighted: subset traffic (5%) routed to canary pool via header or flag for progressive delivery.
Error Paths¶
sequenceDiagram
participant C as Client
participant L7 as Regional L7 LB
participant S as Service Pool
participant HC as Health/Discovery
C->>L7: Request /ingest
L7->>HC: Endpoints?
alt No healthy backends
L7-->>C: 503 Service Unavailable (Retry-After)
else Backend times out
L7->>S: Forward
S-->>L7: (timeout)
L7-->>C: 504 Gateway Timeout
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
Host / SNI |
header | Y | Virtual host routing | Matches configured domain |
Authorization |
header | O | Propagated to Gateway | If present, well-formed |
traceparent |
header | O | Trace propagation | W3C format |
X-Tenant-Id |
header | O | Residency/affinity hint | ULID/UUID |
X-Region-Hint |
header | O | Client preferred region | Allowlist |
X-Sticky-Key |
header | O | Consistent hashing key | ≤128 chars |
Cookie: atp_affinity |
cookie | O | LB-issued sticky cookie | Signed |
Accept / Content-Type |
header | O | Protocol negotiation | Valid MIME |
Idempotency-Key |
header | O | For retries across LB | ≤128 chars |
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
X-Region |
header | Region that served the request | e.g., eu-west |
X-Backend-Id |
header | Instance/pod identifier | For debugging |
X-Served-By |
header | LB node identifier | Optional |
Server-Timing |
header | lb;dur=... |
Perf insights |
Retry-After |
header | Sent on 429/503 | Seconds or HTTP date |
Example Payloads¶
GET /query/v1/records?tenant=acme HTTP/1.1
Host: api.atp.example
X-Tenant-Id: 01HZXM0...
X-Region-Hint: eu-west
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
HTTP/1.1 200 OK
X-Region: eu-west
X-Backend-Id: proj-7f9c6bd9d8-2m4sx
Server-Timing: lb;dur=3, gw;dur=6
Content-Type: application/json
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid host/SNI, malformed headers (X-Region-Hint) |
Correct request | — |
| 401 | Auth failure (if L7 does authN) | Re-authenticate | Retry after renewal |
| 403 | Region not allowed by residency | Remove hint / use allowed region | — |
| 404 | Route/service not found | Verify path/host | — |
| 409 | Sticky key conflicts with pool policy | Clear cookie/change key | — |
| 429 | LB/Gateway rate limit | Back off | Exponential + jitter |
| 502 | Bad gateway (abrupt upstream close) | Investigate upstream | Retry idempotent |
| 503 | No healthy backends / brownout | Failover or wait | Respect Retry-After |
| 504 | Upstream timeout | Tune timeouts or retry | Idempotent only |
Failure Modes¶
- Hot spotting: poor hash key → use ring-hash on tenantId and minimum healthy hosts.
- Sticky drift: deleted pod but cookie persists → cookie TTL/clearing and outlier ejection.
- Cross-region leakage: missing residency guard → enforce allowlist at GTM and L7.
Recovery Procedures¶
- Drain failing instances (connection draining) and eject outliers.
- Flip traffic weights away from impaired pool; enable canary disable flag.
- Trigger regional failover at GTM if health below threshold.
Performance Characteristics¶
Latency Expectations¶
- Added L7 overhead p95 ≤ 5 ms; GTM selection ≤ 20 ms additional.
Throughput Limits¶
- Tune per-service max connections/requests; queue length capped (e.g., 100) to prevent head-of-line blocking.
Resource Requirements¶
- LB nodes sized for TLS termination (ECDSA), HTTP/2, and gRPC fan-in/out; enable connection reuse.
Scaling Considerations¶
- Horizontal scale LB nodes; shard by region; enable Autoscaling based on RPS and CPU.
- Prefer least-request for spiky traffic; ring-hash for affinity; weighted RR for canaries.
Security & Compliance¶
Authentication¶
- TLS 1.2+ at edge; optional mTLS to backends; ALPN for HTTP/2/gRPC.
Authorization¶
- If Gateway performs authZ, L7 forwards identity context; deny routes without matching policies.
Data Protection¶
- No PII in LB logs; mask headers; use HSTS; secure cookies (
HttpOnly,Secure,SameSite=Lax).
Compliance¶
- Residency honored at GTM/L7; all decisions auditable (who changed routes/weights).
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
lb_requests_total{route,region} |
counter | Requests by route | Trend |
lb_latency_seconds |
histogram | Added LB latency | p95 breach |
lb_upstream_5xx_total |
counter | Backend errors | Spike |
lb_no_healthy_backends_total |
counter | Routing failures | Any > 0 |
lb_active_connections |
gauge | Concurrent conns | Saturation |
lb_outlier_ejections_total |
counter | Ejected hosts | Investigate |
Logging Requirements¶
- Access logs with
region,backendId,status,bytes,durationMs,traceId; redact sensitive headers.
Distributed Tracing¶
- Start or propagate
traceparent; add span attributeslb.region,lb.backend_id,policy.
Health Checks¶
- Active (HTTP/gRPC) + passive checks; outlier detection (consecutive 5xx/latency) with ejection & recovery.
Operational Procedures¶
Deployment¶
- Deploy GTM records (Geo/latency policy + failover).
- Roll out L7 LB/Ingress with routes, TLS certs, and backends.
- Enable discovery (EDS) and health checks; validate with synthetic probes.
Configuration¶
- Algorithms:
least_request,ring_hash(key=X-Sticky-Key|tenantId),weighted_round_robin. - Timeouts:
connect=1s,request=5s(per route),idle=60s. - Headers: set
X-Region,X-Backend-Id, and propagatetraceparent.
Maintenance¶
- Rotate TLS certs; tune weights during canaries; routinely test failover.
- Drain nodes before upgrades (connection draining, readiness gates).
Troubleshooting¶
- Elevated 5xx → check outlier ejections, backend health, circuit breaker trips.
- High latency → verify least-request and connection pool sizes; inspect Nagle/HTTP/2 settings.
- Sticky anomalies → clear cookies, verify ring-hash seed and host set stability.
Testing Scenarios¶
Happy Path Tests¶
- Requests distributed evenly under steady load (Gini coefficient within target).
- Sticky session remains on same backend across N requests.
Error Path Tests¶
- 503 when all backends unhealthy; 504 on upstream timeout; 404 on unknown route.
- 409 when sticky key conflicts with policy handled gracefully.
Performance Tests¶
- p95 LB overhead ≤ 5 ms at target RPS; no queue growth beyond cap.
- Failover to secondary region within SLA (< 60s) under regional outage.
Security Tests¶
- TLS and cipher policy enforced; mTLS to backends verified.
- Residency blocks cross-region routing attempts; logs contain no PII.
Related Documentation¶
Internal References¶
External References¶
- Load balancing algorithms (least-request, ring-hash) and best practices
Appendices¶
A. Example Envoy Route (weighted + ring-hash)¶
route:
match: { prefix: "/query" }
route:
hash_policy:
- header: { header_name: "X-Sticky-Key" }
- cookie: { name: "atp_affinity", ttl: 3600s, path: "/" }
weighted_clusters:
clusters:
- name: query-primary
weight: 95
- name: query-canary
weight: 5
timeout: 5s
idle_timeout: 60s
B. Problem+JSON (example 503)¶
{
"type": "urn:connectsoft:errors/lb/no-healthy-backends",
"title": "No healthy backends available",
"status": 503,
"detail": "All instances for route '/ingest' are out of service.",
"retryAfterSeconds": 10
}
Caching Flow¶
Reduces read latency and load on backing stores via tenant-scoped caches with L1 (in-process) and L2 (distributed) tiers. Supports read-through + stale-while-revalidate (SWR), with projection-driven invalidation and export-safe cache bypass when strong freshness is required. Consistency model and TTLs are explicit per resource.
Overview¶
Purpose: Serve query responses quickly while honoring tenant isolation and documented freshness guarantees.
Scope: Cache lookup → hit/miss handling → read-through fill → TTL/SWR behavior → projector/exports invalidations → observability. Excludes CDN/public caching.
Context: Query Service fronts Projection DB/Search with L1/L2 caches; Projection Update Flow emits invalidations; Export may request bypass/lock.
Key Participants:
- Client
- API Gateway / Query Service
- Cache L1 (per-pod)
- Cache L2 (Redis/Memcache)
- Projection DB / Search Index
- Invalidation Bus (events from Projector/Export)
Prerequisites¶
System Requirements¶
- L1 in-process cache with bounded memory and eviction (LRU/LFU)
- L2 distributed cache with multi-tenant namespaces, TLS, and ACLs
- Invalidation channel (pub/sub or stream) from Projector & Export
- Strong hashing for keys; serialization with versioned schema
Business Requirements¶
- Documented consistency choices per endpoint:
strong,bounded-staleness, oreventual - Per-edition TTLs and max object sizes; negative-caching policy
- Clear semantics for export and legal-hold reads (bypass or SWR disabled)
Performance Requirements¶
- p95 cache hit latency: L1 ≤ 1 ms, L2 ≤ 3 ms
- Target hit ratio: ≥ 85% for hot keys; ≥ 60% overall for query endpoints
- Fill amplification bounded (parallel request coalescing)
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant C as Client
participant GW as API Gateway / Query Service
participant L1 as Cache L1 (in-process)
participant L2 as Cache L2 (Redis)
participant DB as Projection DB / Search
participant BUS as Invalidation Bus
C->>GW: GET /query/v1/records?tenant=acme&from=... (Cache-Mode: default)
GW->>L1: GET cache[key(Tenant,QueryHash)]
alt L1 hit (fresh)
L1-->>GW: value, meta{ttl,freshness}
GW-->>C: 200 OK (X-Cache: L1-HIT, X-Cache-Freshness: fresh)
else L1 miss
GW->>L2: GET key
alt L2 hit (fresh or SWR-eligible)
L2-->>GW: value, meta
GW-->>C: 200 OK (X-Cache: L2-HIT, X-Cache-Freshness: fresh|stale)
opt SWR revalidate in background if stale
GW->>DB: Query (If-None-Match: etag)
DB-->>GW: 304 or 200 + new value
GW->>L2: SET key (ttl)
GW->>L1: SET key (ttl)
end
else L2 miss
GW->>DB: Query
DB-->>GW: 200 result (etag)
GW->>L2: SET key (ttl, etag)
GW->>L1: SET key (ttl, etag)
GW-->>C: 200 OK (X-Cache: MISS)
end
end
BUS-->>L2: Invalidation(key or tag) on projection update
L2-->>L1: Fan-out eviction notice
Alternative Paths¶
- Bypass: header
Cache-Mode: bypass→ skip L1/L2 for strict reads (e.g., export) and optionally refresh cache. - Write-around: projector writes DB then publishes tag-based invalidations (e.g.,
tenant:acme,resource:order:123). - Coalesced fills: first request holds a per-key mutex; subsequent misses wait to avoid stampede.
Error Paths¶
sequenceDiagram
participant GW as Query Service
participant L2 as Cache L2
participant DB as Projection DB
GW->>L2: GET key
alt 503 L2 unavailable
GW->>DB: Fallback to DB
DB-->>GW: 200
GW->>L2: (skip SET) or queue async warm
else 409 CAS/ETag conflict on SET
L2-->>GW: 409 Conflict
GW->>L2: GET latest → retry SET (backoff)
end
Request/Response Specifications¶
Input Requirements (Headers & Query)¶
| Field | Type | Req | Description | Validation | |||
|---|---|---|---|---|---|---|---|
X-Tenant-Id |
header | Y | Tenant namespace for cache | ULID/UUID | |||
Cache-Mode |
header | O | default | bypass | refresh | swr-only |
enum | |||
Cache-Control |
header | O | max-age, stale-while-revalidate, no-store |
RFC 7234 | |||
If-None-Match |
header | O | Revalidation with ETag | string | |||
X-Consistency |
header | O | strong | bounded | eventual |
per-route | |||
| Query params | query | O | Affect key hash | canonicalized order |
Output Specifications (Response & Meta)¶
| Field | Type | Description | Notes | ||||
|---|---|---|---|---|---|---|---|
X-Cache |
header | L1-HIT | L2-HIT | MISS | BYPASS | STALE |
observability | ||||
ETag |
header | Entity tag for revalidation | stable per value | ||||
Cache-Control |
header | Response caching directives | includes max-age |
||||
X-Cache-Key |
header | Debug key (hashed/short) | no PII | ||||
X-Cache-Freshness |
header | fresh | stale(<sec>) |
SWR info | ||||
X-Watermark |
header | Projection watermark time | freshness signal |
Example Payloads¶
Bounded-staleness read with SWR
GET /query/v1/records?tenant=acme&from=2025-10-27T08:00Z HTTP/1.1
X-Tenant-Id: 01JF...
Cache-Mode: default
X-Consistency: bounded
HTTP/1.1 200 OK
X-Cache: L2-HIT
Cache-Control: max-age=30, stale-while-revalidate=60
ETag: "recset:acme:ab12"
X-Cache-Freshness: stale(12)
X-Watermark: 2025-10-27T08:05:30Z
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid Cache-Mode/X-Consistency value; oversized key |
Fix headers/params | — |
| 401 | Missing tenant header for cached endpoints | Add X-Tenant-Id |
Retry after fix |
| 403 | Tenant not allowed on this region/cache | Correct region or policy | — |
| 404 | Cache management API: unknown key/tag on purge | No-op; verify key | — |
| 409 | CAS/ETag conflict on concurrent SET | Retry with backoff; re-GET latest | Jittered backoff |
| 412 | Revalidation precondition failed (ETag mismatch) | Fetch full object | Conditional retry |
| 429 | Cache rate limit (management ops) | Back off | Exponential |
| 503 | L2 unavailable | Fallback to DB; degrade to L1-only | Bounded retries |
Failure Modes¶
- Cache stampede: thundering herd on popular key → request coalescing, jittered TTLs, SWR background refresh.
- Stale reads too old: misconfigured
stale-while-revalidate→ enforce max-staleness cap per route. - Cross-tenant leakage: missing tenant in key → mandatory
X-Tenant-Id+ namespace prefixes. - Oversized entries: evictions/fragmentation → cap object size, compress payloads, or avoid caching.
Recovery Procedures¶
- Disable SWR temporarily for problematic routes; set shorter TTLs.
- Purge by tag (
tenant:acme,resource:order:123) after projection anomalies. - Route around L2 failures (feature flag) while keeping read path via DB.
Performance Characteristics¶
Latency Expectations¶
- L1 ≤ 1 ms p95; L2 ≤ 3 ms p95; read-through to DB ≤ endpoint SLO.
Throughput Limits¶
- L2 QPS sized for peak miss + revalidation; keyspace cardinality controlled via hashing and tag strategy.
Resource Requirements¶
- Memory budgets per pod (L1) and per cluster (L2); eviction policy tuned (LFU for skewed traffic).
Scaling Considerations¶
- Partition L2 by region and shard; enable replica readers; avoid cross-AZ chatter.
- Use compressed values (e.g., zstd) for large result sets with CPU tradeoff.
Security & Compliance¶
Authentication¶
- mTLS between services and L2; signed purge APIs.
Authorization¶
- RBAC for cache management (
cache:purge|inspect); tenant-scoped purge only.
Data Protection¶
- No PII in keys; values encrypted at rest if L2 supports; TLS in transit.
Compliance¶
- Audit cache management actions (purge/warm) with actor, scope, reason.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
cache_hit_total{tier} |
counter | Hits by tier | Drop signals issues |
cache_miss_total |
counter | Misses (cold + reval) | Spike alert |
cache_hit_ratio |
gauge | Hits / (hits+misses) | < target |
cache_evictions_total |
counter | Evictions by reason | Unexpected growth |
cache_swr_served_total |
counter | Stale responses served | Excess indicates lag |
cache_fill_duration_seconds |
histogram | Miss→filled latency | p95 breach |
cache_invalidation_total{tag} |
counter | Invalidation events | Monitor volume |
Logging Requirements¶
- Include
tenantId, shortcacheKey,tier,freshness,hit/miss,fillMs,traceId. No payloads/PII in logs.
Distributed Tracing¶
- Child spans for
cache.l1.get,cache.l2.get/set,cache.swr.revalidate, with attributeskey_hash,tier.
Health Checks¶
- L2 readiness probes; replication lag; pub/sub connectivity for invalidations.
Operational Procedures¶
Deployment¶
- Deploy L2 cache cluster (HA) with TLS and ACL; configure namespaces per region.
- Enable L1 caches in services with bounds and eviction settings.
- Wire projector → invalidation bus → L2 pub/sub fan-out.
Configuration¶
- Defaults:
TTL=30s,stale-while-revalidate=60s,max-staleness=90s,negativeTTL=3s. - Enable request coalescing and per-key mutex; cap value size (e.g., 512 KB).
Maintenance¶
- Periodic warm-up for hot keys post-deploy; tune TTLs using hit/miss analytics.
- Rotate L2 credentials; defragment and scale nodes as keyspace grows.
Troubleshooting¶
- Low hit ratio → verify key canonicalization and tenant scoping.
- Stampedes → increase jitter, enable SWR, and coalescing.
- Staleness complaints → reduce TTL or require
Cache-Mode: bypassfor affected endpoints.
Testing Scenarios¶
Happy Path Tests¶
- L1/L2 hits return within target latencies and correct headers.
- Revalidation updates cache while serving stale safely (SWR).
Error Path Tests¶
- 503 L2 outage falls back to DB with acceptable latency.
- 409 CAS conflict on SET resolves with retry and no corruption.
- 400 invalid
Cache-Moderejected.
Performance Tests¶
- Hit ratio meets targets under production-like skew (Zipfian).
- Thundering herd prevented under bursty traffic.
Security Tests¶
- No cross-tenant cache bleed; purge is tenant-scoped and audited.
- TLS and ACLs enforced for L2 connections.
Related Documentation¶
Internal References¶
- Read Models & Projections (Query Path)
- Audit Record Projection Update Flow
- Search Query Flow
- Export Flows
External References¶
- RFC 7234 (HTTP Caching), SWR patterns; Redis best practices
Appendices¶
A. Cache Key Schema (canonicalized)¶
Key = sha256(
"tenant=" + TenantId +
"&route=" + RouteId +
"¶ms=" + CanonicalQueryString +
"&version=" + SchemaVersion
)
Namespace = "atp:{region}:{edition}"
Final = Namespace + ":q:" + KeyPrefix
B. Problem+JSON Examples¶
{
"type": "urn:connectsoft:errors/cache/invalid-mode",
"title": "Invalid Cache-Mode",
"status": 400,
"detail": "Allowed values are default|bypass|refresh|swr-only."
}
{
"type": "urn:connectsoft:errors/cache/conflict",
"title": "Concurrent cache update conflict",
"status": 409,
"detail": "ETag mismatch during SET. Value updated by another request."
}
Partitioning Flow¶
Routes traffic and data by tenant / shard / region using a deterministic partition strategy (e.g., TenantId + TimeBucket) mapped onto a consistent-hash ring. Ensures RLS enforcement at the data plane and honors residency flags so data stays within allowed regions. Supports shard pruning on reads and smooth ring changes with minimal rebalancing.
Overview¶
Purpose: Achieve scalable, cost-efficient storage and query performance by distributing load across shards while preserving strict tenant isolation and residency.
Scope: Partition key derivation → ring lookup → write placement (append store & indexes) → read-time shard pruning → ring change management (add/remove/move) → RLS enforcement. Excludes cross-region replication (covered elsewhere).
Context: Ingestion and Query paths use the Placement Service and Partition Catalog to route writes/reads. Storage (Append), Projection DB, and Search Index expose per-shard/tenant namespaces.
Key Participants:
- API Gateway / Ingestion Service
- Placement Service (ring lookup)
- Partition Catalog (tenants, shards, regions)
- Storage (Append) / Projection DB / Search Index
- RLS/Policy Engine
Prerequisites¶
System Requirements¶
- Global Partition Catalog with tenant → region/edition → shard mapping
- Consistent-hash ring with virtual nodes; gossip or control-plane updates
- Time bucketing policy (e.g.,
hour|day) for hot-key spreading and pruning - RLS enabled in all data planes (tenant-scoped schemas/aliases)
Business Requirements¶
- Residency policy per tenant/edition with allowed regions and data classes
- Hot-tenant isolation rules (dedicated shards/weighting)
- Ring change governance (approvals, maintenance windows for big moves)
Performance Requirements¶
- Target shard load imbalance (P95) ≤ 1.5× average
- Read pruning effectiveness ≥ 90% of shards skipped for typical time windows
- Partition lookup p95 ≤ 1 ms (cached in-process)
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant C as Client
participant GW as API Gateway
participant ING as Ingestion Service
participant PLC as Placement Service (Ring)
participant ST as Storage (Append / Shard)
participant PR as Projection DB (Shard)
participant IX as Search Index (Tenant Alias)
C->>GW: POST /audit/v1/records (X-Tenant-Id, time=2025-10-27T08:05Z)
GW->>ING: Canonicalized record (TenantId, OccurredAt)
ING->>PLC: ResolvePartition(TenantId, TimeBucket=2025-10-27:08)
PLC-->>ING: {region: eu-west, shard: s-17, keyspace: k_acme}
ING->>ST: Append to s-17 (RLS=TenantId)
ST-->>ING: ack (offset, partitionId)
ING-->>GW: 202 Accepted (X-Partition: s-17, X-Region: eu-west)
C->>GW: GET /query/v1/records?tenant=acme&from=08:00&to=08:10
GW->>PLC: PlanQuery(TenantId, Range)
PLC-->>GW: {prunedShards:[s-17,s-18], watermark}
GW->>PR: Read from pruned shards (RLS=TenantId)
PR-->>GW: results
GW-->>C: 200 OK (X-Shards: s-17,s-18)
Alternative Paths¶
- Hot-tenant isolation: Placement pins tenant to a dedicated shard set (higher vNode weight) to prevent noisy neighbors.
- Multi-bucket fanout: Large ranges map to multiple time buckets → pruned shard list per bucket, executed in parallel with bounded concurrency.
- Search path: Query uses per-tenant alias → resolves to index shards in allowed region only (no cross-region hits).
Error Paths¶
sequenceDiagram
participant ING as Ingestion
participant PLC as Placement
participant ST as Storage
ING->>PLC: ResolvePartition(TenantId=T?, TimeBucket=?)
alt 400 Bad Request (invalid tenant/time)
PLC-->>ING: 400 Problem+JSON
else 403 Residency violation (region hint not allowed)
PLC-->>ING: 403 Problem+JSON
else 404 Not Found (tenant or shard mapping missing)
PLC-->>ING: 404 Problem+JSON
else 409 Conflict (ring update in progress, epoch mismatch)
PLC-->>ING: 409 Problem+JSON (retry with new epoch)
else 503 Service Unavailable (catalog/ring unavailable)
PLC-->>ING: 503 Problem+JSON (Retry-After)
end
Request/Response Specifications¶
Input Requirements¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
X-Tenant-Id |
header | Y | Tenant identity for RLS and partitioning | ULID/UUID |
X-Region-Hint |
header | O | Preferred region (must be allowed) | Residency allowlist |
OccurredAt |
body field | Y | Event time used for time bucket | RFC3339 UTC |
Partition-Key |
header | O | Override hash key (advanced) | Controlled via policy |
Range |
query | O | from/to time for reads |
from ≤ to, bounded span |
X-Ring-Epoch |
header | O | Client-observed ring epoch | Monotonic int |
Output Specifications¶
| Field | Type | Description | Notes |
|---|---|---|---|
X-Partition |
header | Chosen shard id | For debugging |
X-Region |
header | Serving region | Residency proof |
X-Shards |
header | Pruned shard list for reads | Comma-separated |
X-Watermark |
header | Lowest consistent time served | For staleness checks |
X-Ring-Epoch |
header | Ring epoch used for routing | Detect drift |
Example Payloads¶
Resolve Partition (internal)
POST /placement/v1/resolve
{
"tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
"occurredAt": "2025-10-27T08:05:12Z"
}
Response
Query Plan (pruning)
POST /placement/v1/plan-query
{
"tenantId": "01JF6V3A6W1T6E2TB1C2N2YV9Q",
"from": "2025-10-27T08:00:00Z",
"to": "2025-10-27T08:10:00Z"
}
Response
{
"region": "eu-west",
"shards": ["s-17","s-18"],
"watermark": "2025-10-27T08:09:58Z",
"epoch": 42
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Missing/invalid X-Tenant-Id, bad time window |
Fix request headers/params | — |
| 401 | Unauthenticated request to placement APIs | Authenticate | Retry after renewal |
| 403 | Residency/edition violation (region not allowed) | Choose allowed region | — |
| 404 | Tenant or shard mapping not found | Re-sync catalog / onboard tenant | — |
| 409 | Ring epoch mismatch during write/read | Fetch latest epoch; redo resolve | Jittered retry |
| 412 | Preconditions (RLS context) not present | Include tenant scope | — |
| 429 | Placement lookups rate-limited | Back off | Exponential + jitter |
| 503 | Placement/Catalog unavailable | Degrade to cached hint or fail | Bounded retries |
Failure Modes¶
- Hot shard: skewed hash or burst tenant → adjust vNode weights, or isolate tenant to dedicated shard set.
- Ring churn: frequent membership changes cause 409s → stage updates and epoch gating with drain.
- Cross-region spill: misconfigured residency → hard deny at placement and gateway.
Recovery Procedures¶
- Enable skew mitigations (weighting, pinning) and backfill if rebalancing moved ranges.
- Roll back ring change to prior epoch if error rate spikes; drain and retry in controlled batches.
- Rebuild tenant alias in Search/Projection if shard move required index re-aliasing.
Performance Characteristics¶
Latency Expectations¶
- Placement cache lookup ≤ 1 ms p95; cold fetch ≤ 10 ms p95.
- Pruned read fanout limited to ≤ 4 shards for typical query windows.
Throughput Limits¶
- Placement QPS sized for all writes + planning; use edge caches in services to reduce calls.
Resource Requirements¶
- Small in-memory partition maps per service; watch stream for updates; compact ring representation with virtual nodes.
Scaling Considerations¶
- Multi-ring design (per-region) to avoid cross-region chatter.
- Add shards by adding vNodes (smooth rebalance ≤ 10% key movement).
- Time buckets control hot partitions; tune bucket size by workload.
Security & Compliance¶
Authentication¶
- mTLS between services and Placement/Catalog; OIDC for ops.
Authorization¶
- Roles:
placement:read,placement:update; only platform ops can alter ring/vNodes.
Data Protection¶
- Enforce RLS at DB and index layers; per-tenant schemas/aliases; no PII in partition keys.
Compliance¶
- Residency enforced at plan/placement and audited; changes to ring membership recorded as events
Partition.RingUpdated.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
partition_lookup_latency_seconds |
histogram | Placement latency | p95 > 10 ms |
partition_skew_ratio |
gauge | Max shard load / avg | > 1.5 |
ring_epoch_mismatch_total |
counter | 409 due to epoch drift | Spike |
reads_shards_scanned |
histogram | Shards touched per query | p95 > target |
residency_denied_total |
counter | 403 due to residency | Any sustained |
hot_tenant_isolations_total |
counter | Isolation activations | Trend |
Logging Requirements¶
- Include
tenantId,region,epoch,shardId,bucket,planId,traceId; never log plaintext PII.
Distributed Tracing¶
- Spans:
placement.resolve,placement.planQuery, attributesepoch,shard_list,bucket_count.
Health Checks¶
- Catalog freshness (last update time), ring convergence across nodes, RLS guard status.
Operational Procedures¶
Deployment¶
- Deploy Placement Service (HA) and Catalog with watch streams.
- Configure per-region rings; seed vNodes; warm caches.
Configuration¶
- Hash:
fnv1a/xxhashonTenantId + BucketKey. - Bucket: daily/hourly; configurable per tenant/class.
- Ring:
vNodes=256default;epochincrements on changes.
Maintenance¶
- Quarterly ring review; rebalance heavy shards; rotate ring secrets.
- Simulate ring changes in staging with shadow placement before production.
Troubleshooting¶
- High shard scan count → check time bucket tuning and secondary predicates.
- 409 spikes → ensure services refresh epoch quickly; increase push frequency.
- Residency denials → verify tenant policy and region hint.
Testing Scenarios¶
Happy Path Tests¶
- Ingest routes to correct shard/region with proper headers.
- Query pruning selects minimal shards and returns correct results.
Error Path Tests¶
- 400/404 invalid tenant/mapping rejected; 409 epoch mismatch handled by retry.
- 403 residency violations blocked decisively.
Performance Tests¶
- Placement p95 ≤ 1 ms cached; shard skew ratio ≤ 1.5× under load.
- Query scans ≤ target shards for standard ranges.
Security Tests¶
- RLS enforced on all reads/writes; no cross-tenant leakage.
- Residency never violated even under failover.
Related Documentation¶
Internal References¶
- Tenancy Keys & Partitioning
- Authoritative Stores (Write Path)
- Read Models & Projections
- Load Balancing Flow
External References¶
- Consistent hashing & virtual nodes best practices
Appendices¶
A. Partition Key Derivation¶
BucketKey = floor(to_unix(OccurredAt) / BucketSizeSeconds)
HashInput = TenantId || ":" || BucketKey
Shard = Ring(hash(HashInput))
B. Problem+JSON Examples¶
{
"type": "urn:connectsoft:errors/partition/epoch-mismatch",
"title": "Ring epoch mismatch",
"status": 409,
"detail": "Client epoch 41 != current epoch 42."
}
{
"type": "urn:connectsoft:errors/partition/residency-violation",
"title": "Region not allowed by residency policy",
"status": 403,
"detail": "Tenant 'acme' is restricted to eu-west."
}
Auto-Scaling Flow¶
Scales services safely with load using policy-driven HPA/KEDA decisions, proactive warmup/readiness gates, and cost guardrails. Prevents thrash via stabilization windows, rate limits, and deliberate scale-in. Maintains SLOs while distributing load across newly ready instances.
Overview¶
Purpose: Automatically add/remove capacity to meet SLOs while controlling cost and avoiding oscillation.
Scope: Signal collection → scaling decision → resource provisioning → service scale-out/in → warmup/readiness → load distribution → verification/rollback. Excludes manual capacity planning.
Context: Metrics from Observability and Queue/Bus feed Autoscaler (HPA/KEDA). Kubernetes (orchestrator) applies replica changes. Gateway/LB route traffic only to ready pods.
Key Participants:
- Load Monitor (Prometheus/OTel, Queue metrics)
- Autoscaler (HPA/KEDA controller)
- Orchestrator (Kubernetes API Server)
- Target Service (e.g., Ingestion/Query/Export)
- Warmup Manager (init tasks, cache warm)
- API Gateway / L7 LB
- Cost Guard (budget policy evaluator)
Prerequisites¶
System Requirements¶
- Metrics (CPU, memory, RPS, p95 latency, queue depth/lag) exported and scraped
- HPA/KEDA installed with stabilization windows & scale rate limits
- Readiness/Startup probes and graceful shutdown configured
- Optional Warm Pool or pre-provisioned nodes for burst traffic
Business Requirements¶
- SLOs defined per service (latency/error budget)
- Cost guardrails (min/max replicas, monthly budget caps, per-tenant limits)
- Change approvals for autoscaling policy updates
Performance Requirements¶
- Scale-out reaction time ≤ 30–60s for CPU/RPS, ≤ 10s for queue lag (event-driven)
- Scale-in conservatively; error budget burn must stay within targets
- No oscillation: replica changes limited by stabilization (e.g., 300s down, 60s up)
Sequence Flow¶
Happy Path¶
sequenceDiagram
autonumber
participant LM as Load Monitor (Metrics/Queue)
participant AS as Autoscaler (HPA/KEDA)
participant OR as Orchestrator (K8s API)
participant SVC as Target Service
participant WM as Warmup Manager
participant LB as API Gateway / L7 LB
participant CG as Cost Guard
LM-->>AS: Signals {cpu=78%, rps=1.8k, p95=230ms, queueLag=high}
AS->>CG: Check policy & budget (min/max, cost caps)
CG-->>AS: OK (within budget)
AS->>OR: Patch Deployment replicas +3 (rate-limited)
OR->>SVC: Create Pods (Pending→Init→Running)
SVC->>WM: Warmup (JIT cache, connection pools)
SVC-->>OR: Readiness=TRUE (startup probe passed)
OR-->>LB: Endpoint added to ready set
LB-->>SVC: Start routing a % of traffic (ramp-up)
LM-->>AS: Metrics improve (p95→140ms, queueLag→normal)
AS->>OR: Hold steady (stabilization window active)
Alternative Paths¶
- Predictive/Scheduled: pre-scale based on calendar or forecast (e.g., top-of-hour export).
- Event-driven (KEDA): scale on queue depth/lag or webhook events (spikes).
- Per-tenant partitions: scale labeled shard Deployments independently to isolate hot tenants.
Error Paths¶
sequenceDiagram
participant AS as Autoscaler
participant OR as Orchestrator
participant CG as Cost Guard
participant SVC as Target Service
AS->>CG: Request scale beyond max
CG-->>AS: 409 Conflict (budget cap)
AS-->>AS: Clamp to max, raise alert
AS->>OR: Scale to N
OR-->>AS: 503 API unavailable / quota exceeded
AS-->>AS: Retry w/ backoff, keep stabilization timer
OR->>SVC: Start pods
SVC-->>OR: Readiness FAILED (startup)
OR-->>AS: Scale not effective
AS-->>AS: Pause scale-in, open incident, hold window
Request/Response Specifications¶
Input Requirements (Autoscaling Policy APIs)¶
| Field | Type | Req | Description | Validation |
|---|---|---|---|---|
| POST /ops/v1/autoscale/policies | http | Y | Create/update policy | RBAC |
service |
string | Y | Target service name | existing |
minReplicas / maxReplicas |
int | Y | Bounds | 1 ≤ min ≤ max |
targets |
object | O | e.g., cpu=70, rps=200, p95Ms=180, queueLag=5s |
sane ranges |
scaleUpPolicy |
object | O | stabilizationSec, maxIncreasePercent, step |
limits |
scaleDownPolicy |
object | O | stabilizationSec, maxDecreasePercent, idleWindowSec |
limits |
costGuardrails |
object | O | {maxMonthlyCents, maxNodes, burstAllowance} |
non-negative |
predictive |
object | O | schedule/cron or model id | valid cron |
Output Specifications¶
| Field | Type | Description | Notes | ||
|---|---|---|---|---|---|
policyId |
string | Identifier | immutable | ||
status |
enum | Active | Pending | Error |
— | ||
effectiveAt |
time | Activation time | RFC3339 | ||
reason |
string | Policy validation result | optional |
Example Payloads¶
Create Policy
POST /ops/v1/autoscale/policies
{
"service": "query",
"minReplicas": 4,
"maxReplicas": 40,
"targets": { "cpu": 70, "p95Ms": 180, "rps": 250 },
"scaleUpPolicy": { "stabilizationSec": 60, "maxIncreasePercent": 100, "step": 4 },
"scaleDownPolicy": { "stabilizationSec": 300, "maxDecreasePercent": 33, "idleWindowSec": 600 },
"costGuardrails": { "maxMonthlyCents": 250000, "maxNodes": 60 }
}
Decision Record (emit)
{
"decisionId": "asd_01JF9A...",
"service": "query",
"from": 16,
"to": 24,
"reason": "p95>180ms and rps>target",
"window": "60s",
"guardrailsApplied": false,
"timestamp": "2025-10-27T08:06:30Z"
}
Error Handling¶
Error Scenarios¶
| HTTP Code | Scenario | Recovery Action | Retry Strategy |
|---|---|---|---|
| 400 | Invalid policy (min>max, bad targets) | Fix payload | — |
| 401 | Missing/invalid token to ops API | Authenticate | Retry after renewal |
| 403 | Caller lacks autoscale:write |
Request access | — |
| 404 | Policy/service not found | Verify name; create first | — |
| 409 | Policy conflicts with cost guardrails or active rollout | Adjust bounds or wait | Conditional retry |
| 412 | Preconditions failed (budget exceeded) | Increase budget or reduce target | — |
| 429 | Throttled ops updates | Back off | Exponential + jitter |
| 503 | Orchestrator unavailable/quota exhausted | Retry; open incident | Backoff; clamp to safe min |
Failure Modes¶
- Thrashing: rapid up/down changes → increase stabilization windows; lower sensitivity; coarser steps.
- Cold-start latency: new pods routed too early → enforce readiness gates and ramp-up percentage.
- Exceeding budget: forecast misses → cost guard clamps, triggers graceful degradation plans.
Recovery Procedures¶
- Freeze scale-down; hold steady at current replicas; widen windows.
- Enable predictive pre-scale during known peaks; warm caches.
- If quota hit, divert traffic (multi-region) or shed load (429) with idempotency keys.
Performance Characteristics¶
Latency Expectations¶
- Scale-out decision path (signal→ready) ≤ 60–90s typical; ≤ 15s for KEDA on lag spikes.
- No SLO breach during scale-in; drain connections before termination.
Throughput Limits¶
- Max scale step per window (e.g., +100% up, −33% down).
- Node autoscaler pre-warms to ensure pods schedule within target.
Resource Requirements¶
- Metrics store sized for scrape interval and cardinality; autoscaler controller HA.
- Warm pool (optional) sized to absorb N minutes of surge.
Scaling Considerations¶
- Separate control plane autoscaler resources from workloads.
- Partition by service/shard for isolation; avoid global contention.
- Use pod disruption budgets (PDBs) to protect capacity on rollouts.
Security & Compliance¶
Authentication¶
- OIDC for ops APIs; mTLS between autoscaler and cluster API.
Authorization¶
- RBAC:
autoscale:read,autoscale:write,autoscale:admin. Least privilege for controllers.
Data Protection¶
- No PII in scaling logs/metrics; scrub tenant identifiers or hash.
Compliance¶
- Emit audited events:
Autoscale.PolicyUpdated|DecisionMade|ScaleApplied|GuardrailClampedwith reason & actor.
Monitoring & Observability¶
Key Metrics¶
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
autoscale_desired_replicas |
gauge | Desired vs current | Large sustained delta |
autoscale_decisions_total{reason} |
counter | Scale events | Spike analysis |
autoscale_thrash_total |
counter | Up/down flips within window | > 0 sustained |
service_slo_latency_p95_ms |
gauge | p95 latency | > target |
queue_lag_seconds |
gauge | Event backlog | > target |
cost_estimated_monthly_cents |
gauge | Spend projection | > budget |
Logging Requirements¶
- Decision logs:
decisionId,from→to,reasons,signals,guardrailsApplied,traceId.
Distributed Tracing¶
- Spans:
autoscale.evaluate,autoscale.apply; link to service load spans viatraceparent.
Health Checks¶
- Controller health, permission checks, K8s API latency; synthetic scale probe in staging.
Operational Procedures¶
Deployment¶
- Install HPA/KEDA; configure metrics adapters.
- Enable readiness/startup probes and graceful draining (preStop hooks).
- Apply baseline policies per service; verify guardrails.
Configuration¶
- Example defaults:
min=2,max=40,cpu=70%,p95=180ms,queueLag=5s. scaleUpStabilization=60s,scaleDownStabilization=300s,maxIncrease=100%,maxDecrease=33%.- Cost guard:
maxMonthlyCents,maxNodes,burstAllowance.
Maintenance¶
- Quarterly policy review vs. observed traffic.
- Load tests before peak seasons; adjust predictive schedules.
Troubleshooting¶
- Oscillation → widen stabilization, reduce sensitivity, increase step size.
- Pods not becoming ready → inspect warmup dependencies, increase startupProbe timeouts.
- Budget clamp events → validate forecasts; consider reserved capacity.
Testing Scenarios¶
Happy Path Tests¶
- Sustained load triggers scale-out within target time; SLO met.
- Post-peak scale-in occurs after stabilization; no SLO regressions.
Error Path Tests¶
- 409 guardrail clamp logged; system holds safe capacity.
- 503 orchestrator outage handled by retries without thrash.
Performance Tests¶
- Burst load with KEDA (queue lag) scales within ≤ 15s to clear backlog.
- Scale-in preserves error budget and maintains p95 latency.
Security Tests¶
- Only authorized roles can modify policies; all changes audited.
- No PII in autoscale logs/metrics.
Related Documentation¶
Internal References¶
External References¶
- HPA/KEDA best practices; SRE guides on autoscaling and error budgets
Appendices¶
A. Example HPA (CPU + custom p95 latency via metrics adapter)¶
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: query-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: query
minReplicas: 4
maxReplicas: 40
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 33
periodSeconds: 300
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: service_latency_p95_ms
target:
type: AverageValue
averageValue: "180"
B. Example KEDA ScaledObject (queue lag)¶
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: export-worker
spec:
scaleTargetRef:
name: export-worker
minReplicaCount: 2
maxReplicaCount: 60
cooldownPeriod: 300
triggers:
- type: redis
metadata:
address: REDIS_ADDR
listName: export-jobs
listLength: "100" # target backlog
C. Problem+JSON (policy conflict)¶
{
"type": "urn:connectsoft:errors/autoscale/policy-conflict",
"title": "Autoscale policy conflicts with guardrails",
"status": 409,
"detail": "Requested maxReplicas 120 exceeds maxNodes budget."
}