Skip to content

Multitenancy & Tenancy Guards - Audit Trail Platform (ATP)

Tenant isolation as a first-class invariant — every audit event, policy, and export is scoped, enforced, and verifiable per tenant.


Purpose & Scope

This document defines how the Audit Trail Platform (ATP) enforces tenant isolation end-to-end and how Tenancy Guards are designed, implemented, and verified across all surfaces.

Objectives

  • Isolation by design: Specify guarantees and mechanisms that ensure every operation (ingest, store, query, export) is tenant-scoped and provably isolated.
  • Tenancy Guards: Define cross-cutting guardrails (policy + runtime checks) that prevent cross-tenant access and data leakage and provide clear failure modes.
  • Operational clarity: Describe operator procedures and SRE hooks for tenant onboarding/offboarding, incident response, backup/restore, and eDiscovery in a tenant-first manner.
  • Consistency & evolution: Align with ConnectSoft tenets — security-first, policy-as-code, observability-driven, additive evolution — and outline how isolation evolves without breaking tenants.

In Scope

  • Ingestion path: Context propagation, header/claim validation, idempotency within {tenantId,…} scope.
  • Persistence & indexing: Partitioning strategies (per-tenant vs shared-with-key), RLS predicates, integrity chains per tenant, key management and rotation.
  • Query & export: Mandatory tenant predicates, tenant-safe pagination, export packaging with per-tenant manifests and signatures.
  • Policy enforcement: Residency/retention/redaction evaluated on write/read/export, with deterministic decisions and change auditability.
  • Observability: Tenant-labeled logs/traces/metrics; dashboards and SLOs per tenant/tier.
  • Runbooks: Tenant lifecycle automation, break-glass patterns, DSAR/legal hold, backfill/replay safety.
  • Testing & verification: Contract tests, chaos/fault injection for tenancy failures, continuous policy verification.

Out of Scope

Business-domain tenancy rules inside producer services (beyond required propagation), and pricing/editions billing details (see SaaS Factory documentation).

Audience & Responsibilities

  • Platform Engineers / Service Owners: Implement guards, partitioning, and policy hooks in ATP services and SDKs.
  • SRE / Operations: Run tenant onboarding/offboarding, perform incident response, execute backups/restores and exports.
  • Security & Compliance: Approve guard policies, review evidence packs, monitor break-glass usage.
  • Integrators / App Teams: Use SDKs and follow contracts to propagate tenant context correctly.

Success Criteria

  • Mandatory context: 100% of ATP write/read/export requests carry validated tenantId; requests missing context are rejected or quarantined and audited.
  • Isolation proof: For any tenant, we can produce a verifiable evidence pack (manifests, signatures, chain proofs) showing no cross-tenant data exposure.
  • Policy determinism: For the same input and policy version, guard decisions are deterministic and reproducible.
  • Operational readiness: Runbooks enable safe tenant onboarding/offboarding; incident drills pass with measurable MTTR and zero cross-tenant blast radius.
  • Continuous verification: Automated tests (contract + chaos) run in CI/CD, with gates preventing regressions in tenant isolation.

Constraints & Assumptions

  • TenantId is opaque and stable; no business meaning assumed by ATP.
  • Isolation must hold across multi-region deployments; residency rules may restrict cross-region flows.
  • Backwards compatibility: schema evolution is additive; older producers/consumers continue to function with guards intact.
  • All guard outcomes (allow/reject/redact/quarantine) are themselves audited.

Cross-References

  • Security & Compliance: masking/redaction policies, break-glass governance.
  • Persistence & Storage: partitioning/indexing options and RLS enforcement.
  • Guides / Quickstart — Tenant Onboarding: operator procedures and checks.

Tenancy Model Overview

Tenant identity & hierarchy

  • TenantId (opaque, stable). Never encode business meaning or environment into the identifier.
  • Optional DataSiloId (regional/sovereignty partition) used for residency-aware routing.
  • Optional sub-scopes:
    • workspace/project — for product teams or lines of business under the same tenant.
    • environmentdev, test, staging, prod.
  • No implicit inheritance between hierarchy levels. All APIs require an explicit scope on each call; services must not “guess” a tenant from environment or hostnames.

Canonical ID rules

  • Max 128 chars, case-insensitive compare, printable URL-safe.
  • Treat as opaque: only equality/inequality operations allowed.
  • Stable for the lifetime of the tenant; splits/merges use a mapping table (covered in Migration & Evolution).

Tenant context propagation

  • Edge → ATP: context flows in JWT claims and HTTP headers.
  • Service → service: propagate via gRPC metadata and OpenTelemetry baggage; never recompute/replace TenantId.

Accepted keys (normalized by the Gateway/SDKs)

Purpose Claim/Header keys (any) Notes
Tenant identifier tid, tenant, tenant_id, x-tenant-id Mapped to canonical TenantId. Must match body payload if present.
Data silo/region data_silo, x-data-silo Drives residency routing and key selection.
Edition/plan edition, x-edition Drives quotas/policies; informational for ATP core.
Correlation traceparent, x-correlation-id W3C Trace Context preferred; X-Correlation-Id accepted for legacy.

OpenTelemetry/baggage hints

  • Set tenantId and edition as baggage on entry; propagate on all outgoing calls.
  • Add tenantId as a log/metric attribute (resource or scope).

Header example

Authorization: Bearer <jwt … tid="splootvets" …>
X-Tenant-Id: splootvets
X-Data-Silo: us
traceparent: 00-3e1f2d0c9b8a7f6e5d4c3b2a19081716-7f6e5d4c3b2a1908-01

SDK normalization (client-side)

  • Auto-detect TenantId from configured source (tenant registry, app config, per-request context).
  • Inject headers + set OTel baggage.
  • Validate parity between headers and payload auditRecord.tenantId when present.

Trust boundaries & responsibilities

  • Producer services (tenanted apps)

    • Must provide valid tenant context on every request.
    • Use SDK request builders to avoid missing/invalid headers.
    • Avoid mixing tenant data in a single request body; one TenantId per call.
  • Gateway / ATP services

    • Validate presence and admissible form of TenantId; map known claim aliases.
    • Enforce: inject mandatory WHERE tenantId = :ctx on queries; apply RLS/predicates on storage and search.
    • Record & prove: persist TenantId with the audit record; add to integrity chains; emit GuardDecision events on anomalies.
  • SDKs

    • Provide canonical middleware for context propagation.
    • Offer problem taxonomy (MissingTenant, MismatchTenant, ForeignResidency) with actionable messages.
    • Surface diagnostics (logs/metrics/traces) with tenantId attributes by default.

Minimal guard pseudocode (service boundary)

// Executed at each ingress (HTTP/gRPC) before business logic
var ctx = ResolveTenantContext(http.Headers, jwt.Claims, otel.Baggage);

if (!ctx.HasTenantId) return Problem(MissingTenant);
if (!IsCanonical(ctx.TenantId)) return Problem(InvalidTenantFormat);
if (Body.TenantId is not null && Body.TenantId != ctx.TenantId) 
    return Quarantine(MismatchTenant, evidence: { headers, body });

otel.Baggage.Set("tenantId", ctx.TenantId);
otel.Activity.SetTag("tenantId", ctx.TenantId);

// Downstream calls automatically carry baggage/metadata
Proceed();

End-to-end view

flowchart LR
Client-->|JWT+headers|Gateway
Gateway-->|tenant context validated|Ingestion
Ingestion-->|TenantId persisted|Storage
Storage-->Projection
Projection-->Query
Query-->|Tenant-scoped results|Client
Hold "Alt" / "Option" to enable pan & zoom

Key properties

  • Single source of truth: TenantId captured at ingress → stamped on every persisted row/event.
  • Deterministic routing: TenantId + DataSiloId decide region/shard, never application heuristics.
  • Auditable chain: Every hop carries tenantId in trace baggage and logs for post-hoc verification.

Isolation Guarantees

Data plane

  • Partitioning
    • Per-tenant physical (index/bucket/table prefix) or logical (RLS/predicate) partitions.
    • Invariant: every persisted row/document/event includes a non-null tenantId and passes a storage-side predicate tenantId = :ctxTenant.
    • Dedicated/high-assurance tenants may opt into separate indices/buckets and isolated retention windows.
  • Encryption
    • Per-tenant KEKs in KMS; envelope encryption with rotating DEKs per segment/chunk.
    • Key scope is aligned to residency (DataSiloId) and tenant; KMS policies prevent cross-tenant unwrap.
    • Crypto-shred supported: revoking a tenant KEK renders sealed data unreadable while integrity evidence remains verifiable.
  • Integrity
    • Tenant-scoped hash chains/segments; each segment seals with (tenantId, segmentId, prevHash, rootHash, algoSuite, signedAt).
    • Proofs are verifiable without cross-tenant material; replay/verification tools require only the target tenant’s manifests.

Record envelope (conceptual)

{
  "tenantId": "splootvets",
  "recordId": "a1c…",
  "createdAt": "2025-10-28T07:00:00Z",
  "segmentId": "seg-2025-10-28T07",
  "prevHash": "9f2…",
  "payload": "…",            // encrypted with DEK
  "dekRef": "dek:seg-…",     // wrapped by tenant KEK
  "sig": "MEQCIF…",          // segment seal signature
  "algo": "AES-GCM + SHA-256 + Ed25519"
}

Control plane

  • Tenant policy namespaces: residency, retention, redaction, export permissions versioned as tenant:<id>@<ver>.
  • Edition/feature gates: enforcement uses edition + ABAC attributes to enable/disable features per tenant.
  • Fairness controls: rate limits (req/s), quotas (bytes/day, concurrency), and backpressure (429/deferral) applied per tenant/tier.
  • Decisions are auditable: allow/reject/quarantine outcomes emit GuardDecision with policy/version basis.

Guard decision (shape)

{
  "tenantId": "splootvets",
  "operation": "Query",
  "decision": "Allow",
  "basis": { "policy": "tenant:splootvets@42", "rules": ["RLS.Tenant", "Masking.Investigator"] }
}

Observability

  • Strict labeling: all traces/logs/metrics carry tenantId (and edition) as attributes; ingestion adds tags at entry and propagates via OTel baggage.
  • Dashboards: tenant-scoped by default; any cross-tenant or fleet-wide view is admin-only and redacts tenant identifiers where not required.
  • Cardinality guardrails: per-tenant metric series; sampling/aggregation avoid cross-tenant joins; log routing keeps tenant partitions isolated.
  • SLOs per tier: p95 ingest latency, projector lag, DLQ depth, export lead time — computed per tenant and alerted independently.

Invariants (must hold)

  • Every persisted artifact includes tenantId and integrity linkage; queries are executed with an injected tenantId predicate.
  • Keys, chains, and policies are tenant-scoped; no cross-tenant material is required to decrypt or to verify integrity.
  • Observability is segregated; operators cannot accidentally view or join data across tenants without explicit privileged access.

Tenancy Guards (Concept)

Definition. A Tenancy Guard is a policy + runtime mechanism that makes tenant context mandatory, valid, and enforced on every operation (ingest, query, export, admin). Guards are deny-by-default and produce auditable evidence for each decision.


Evaluation order & placement

  • Ingress: Gateway/edge middleware validates headers/claims and normalizes context.
  • Service boundary: Filters enforce predicates, residency, and idempotency scope.
  • Persistence layer: RLS/predicate injection and integrity stamping.
  • Egress/Exports: Packaging/signing within tenant scope; foreign-region blocks.
  • Ops/Admin: Break-glass evaluators with dual-control and short TTL.
flowchart LR
Ingress["Ingress Guard"] --> Svc["Service Guard"]
Svc --> Persist["Persistence Guard"]
Svc --> Query["Query Guard"]
Svc --> Export["Export Guard"]
Admin["Admin/Ops Guard"] --> Svc
Hold "Alt" / "Option" to enable pan & zoom

Guard types

Ingestion guards

  • Require tenantId (canonical) and parity with body payload when present.
  • Enforce residency (dataSilo) and edition/tier checks.
  • Ensure idempotency scope {tenantId, idempotencyKey} to prevent cross-tenant dedupe collisions.
  • Stamp tenantId and integrity metadata on persisted records.

Query guards

  • Inject mandatory predicate tenantId = :ctxTenant.
  • Forbid cross-tenant joins, aggregations, or UNIONs; restrict multi-tenant cursors.
  • Apply role-based masking profiles and ABAC attributes derived from claims.

Export guards

  • Force per-tenant object prefixes and signed manifests.
  • Validate requested time range/purpose against policy bundle.
  • Block cross-region exports unless explicitly granted (residency policy).

Admin/Ops guards

  • Break-glass only: time-bound, least-privilege scopes with dual approvals.
  • Emit enriched ComplianceEvent and attach evidence (ticket, justification, approvers).
  • Auto-revoke on TTL expiry or incident closure.

Failure modes (auditable)

Condition Action Outcome (Problem+JSON) Notes/Evidence
Missing or invalid tenantId Reject 400/401 MissingTenant GuardDecision{basis:"Schema"}
Header/body tenantId mismatch Quarantine 202 Accepted (isolation lane) Evidence includes headers+payload
Cross-tenant query/join detected Reject 403 CrossTenantForbidden Query plan snapshot
Foreign-region export Reject 409 ResidencyConflict Policy & residency tags
Policy cache stale/unknown Degrade/Log 200 w/ basis:"Cached" tag Alert; force refresh asynchronously

All guard outcomes emit GuardDecision events with {tenantId, operation, decision, policyVersion, evidenceRef}.


Minimal guard middleware (sketch)

var ctx = ContextResolver.From(headers: req.Headers, claims: user.Claims, baggage: Activity.Current?.Baggage);

if (!ctx.HasTenantId) return Problem(MissingTenant);
if (!TenantId.IsCanonical(ctx.TenantId)) return Problem(InvalidTenantFormat);

if (req.Body?.TenantId is string bodyTid && bodyTid != ctx.TenantId)
    return Quarantine(MismatchTenant, evidence: Capture(req));

if (!Residency.Allows(ctx.TenantId, ctx.DataSilo))
    return Problem(ResidencyConflict);

InjectMandatoryPredicate(req, tenantId: ctx.TenantId);
StampTrace(ctx.TenantId);
return next();

Policy expression (example, Rego)

package atp.tenancy

default allow = false

allow {
  input.tenantId != ""
  input.bodyTenantId == ""  ; or equal
} else {
  input.tenantId == input.bodyTenantId
}

deny["CrossTenantQuery"] {
  some t
  t := input.query.detectedTenants[_]
  t != input.tenantId
}

Design guarantees

  • Mandatory context: no operation proceeds without a validated tenantId.
  • Scoped enforcement: predicates, keys, and proofs are tenant-scoped end-to-end.
  • Provable behavior: every decision is reproducible by policy/version and accompanied by machine-readable evidence.

Identity & Access Integration

IdP → ATP mapping

  • Goal: turn IdP-issued tokens (OIDC/JWT) into tenant-scoped context and enforce RBAC/ABAC consistently.
  • Mapping rules
    • Resolve a canonical TenantId from one of the accepted claim aliases (see table). If multiple present, apply priority and normalize (case-insensitive, URL-safe).
    • Extract subject identity (human or workload), roles/scopes, and attributes (edition, residency, org-unit) for ABAC.
    • Do not conflate the IdP’s directory tenant (tid in Azure AD) with our SaaS TenantId unless explicitly configured to map 1:1.

Claim/Header alias map (normalized into ATP context)

ATP Context Preferred Claim(s) Header Aliases Notes
TenantId tenant, tenant_id, app_tid, custom:tenantId x-tenant-id, tenant Treat as opaque; priority order is configurable.
SubjectId sub, oid (AAD user), client_id (workload) Stable principal identifier.
ClientId azp (authorized party), client_id Useful for workload tokens and audit.
Scopes scp (space-delimited) For OAuth2 scope-based auth.
Roles roles, groups (when mapped) For role-based checks.
Edition custom:edition, app_metadata.edition x-edition Drives policy/quotas.
DataSilo custom:data_silo, region x-data-silo Residency-aware routing.
Correlation jti (token id) optional traceparent, x-correlation-id Prefer W3C traceparent.

Configure per-IdP mapping profiles (e.g., azuread, okta, auth0) to avoid accidental use of directory tid as SaaS TenantId.


Required claims & headers

  • Must-have
    • Valid JWT (signature/issuer/audience/time) with SubjectId.
    • TenantId present via claim or X-Tenant-Id header (normalized).
    • Correlation via traceparent (preferred) or X-Correlation-Id.
  • Nice-to-have (enables richer policy)
    • Scopes or Roles (both supported).
    • Edition, DataSilo for ABAC and residency routing.

Ingress contract (HTTP)

Authorization: Bearer <jwt>
X-Tenant-Id: splootvets
X-Data-Silo: us
traceparent: 00-...

Ingress contract (gRPC metadata)

authorization: Bearer <jwt>
x-tenant-id: splootvets
x-data-silo: us

Token classes & authorization model

  • Human tokens (Authorization Code / OIDC)
    • sub identifies the user; roles (e.g., Audit.Reader) and optional org/group attributes drive ABAC.
  • Workload tokens (Client Credentials)
    • client_id is the principal; scopes like audit.ingest, audit.export:read.
    • Must still carry/resolve a TenantId (per-tenant service principals or explicit header).
  • Job tokens (signed job runner)
    • Narrow scopes (e.g., audit.backfill:run); always time-bound and tenant-bound.

Canonical roles (RBAC)

  • Audit.Reader, Audit.Investigator, Audit.Admin
  • Export.Operator, Export.Admin
  • Compliance.Reviewer, Security.Admin
  • SRE.Admin (privileged; mostly for break-glass flows)

Scopes (examples)

  • audit.write, audit.read, audit.export, audit.policy.read, audit.policy.write

ABAC attributes

  • edition, dataSilo, orgUnit, purpose (if provided). Policies can enforce, e.g., Investigator view allowed only in prod org-unit.

Cross-tenant admin (break-glass) with evidence

  • When used: incident response, legal discovery under supervision.
  • Controls
    • Dual approval (separate approvers), least privilege, short TTL (≤ 60 min), IP allowlist optional.
    • Generate an ephemeral grant (JWT or capability token) with narrowed scopes and explicit allowedTenants list.
  • Evidence trail
    • Persist ComplianceEvent{ type:"BreakGlass.Granted", tenantIds, approvers[], ticketRef, reason, issuedAt, expiresAt }.
    • Mirror all actions under break-glass to a separate evidence stream; auto-expire/revoke grant and emit BreakGlass.Revoked.

Workflow (outline)

  1. Operator requests access → attaches ticket & reason.
  2. Approvers sign off (2FA) → system mints ephemeral grant.
  3. All requests with this grant tagged access:break-glass=true.
  4. TTL expiry or manual revoke → grant invalidated; evidence pack sealed.

Validation & enrichment (pseudocode)

var token = await JwtValidator.ValidateAsync(req.Authorization);
var ctx = new TenantContext();

ctx.TenantId = ResolveTenantId(token.Claims, req.Headers); // mapping profile with priority
if (string.IsNullOrEmpty(ctx.TenantId)) return Problem(MissingTenant);

ctx.SubjectId = token.Subject ?? token.ClientId ?? "unknown";
ctx.ClientId  = token.ClientId;
ctx.Scopes    = token.GetScopes();
ctx.Roles     = token.GetRoles();

ctx.Edition   = token.Get("custom:edition") ?? req.Headers["X-Edition"];
ctx.DataSilo  = token.Get("custom:data_silo") ?? req.Headers["X-Data-Silo"];

if (IsBreakGlass(token))
{
  if (!token.AllowedTenants.Contains(ctx.TenantId)) return Problem(CrossTenantForbidden);
  ctx.Flags.Add("breakGlass", true);
}

AttachToTrace(ctx); // set baggage/tags: tenantId, edition, subjectId
return next(ctx);

Example (JWT payload excerpts)

{
  "iss": "https://login.example/idp",
  "aud": "connectsoft-atp",
  "sub": "00u1abc23",
  "azp": "svc-audit-writer",
  "scp": "audit.write audit.export",
  "roles": ["Audit.Reader"],
  "custom:tenantId": "splootvets",
  "custom:data_silo": "us",
  "custom:edition": "enterprise",
  "exp": 1766923200
}

Guarantees

  • Tenant-first resolution: a canonical TenantId is resolved or the call is rejected/quarantined.
  • Principle-of-least-privilege: roles/scopes and ABAC restrict access within the tenant.
  • Provable operations: each decision includes who/what acted (SubjectId/ClientId), for which tenant, and under which policy version, producing an auditable trail.

Data Partitioning Strategy

Topologies

  • Dedicated per-tenant
    • Separate indices/buckets/tables per tenant (and optionally per region).
    • Pros: strongest blast-radius control, tailored retention/SLOs, simpler deletes/exports.
    • Cons: higher ops overhead, more shards/handles, index skew for small tenants.
  • Shared with tenant key
    • One logical store with TenantId as the first-class partition/routing key and strict RLS/predicates.
    • Pros: efficient for many small tenants, easier capacity pooling.
    • Cons: stricter guardrails needed, hot-spot risk without careful keying.

Decision matrix (rule of thumb)

Tenants Per-tenant write rate Assurance Recommendation
≤ 50 high (≥1k ev/s) high Dedicated per-tenant indices/buckets
50–500 mixed medium Hybrid: large tenants dedicated; small shared
500+ low (≤50 ev/s) medium Shared with tenant key (+ time bucketing)

Technology mappings

  • Azure SQL/PostgreSQL
    • Dedicated: schema/table per tenant (e.g., audit_<tenant>).
    • Shared: single table with composite key (TenantId, CreatedAt, RecordId) and RLS: USING (tenant_id = current_setting('app.tenant_id')::uuid).
    • Cluster/index on (TenantId, CreatedAt DESC, RecordId).
  • OpenSearch/Elasticsearch
    • Dedicated: atp-audit-{tenant}-{yyyyMM} (rollover by size/time).
    • Shared: index with routing key tenantId; use index templates and ILM by tier.
  • Cosmos DB / Table-like stores
    • Partition key TenantId, row key (ts#RecordId); add hash/bucket suffix to avoid hot partitions.
  • Object storage (Blob/S3)
    • Prefix: tenants/{tenantId}/streams/{stream}/dt={YYYY}/{MM}/{DD}/…; encryption scope per tenant.

Sharding, keys, and hot-spot avoidance

  • Primary/partition key
    • Always include TenantId first. Prefer time-ordered second key: CreatedAt.
    • Add stability & uniqueness with RecordId (ULID/KSUID preferred for sortability).
  • Sort/secondary keys
    • (CreatedAt DESC, RecordId) for seek pagination and range scans.
    • For heavy tenants, add time buckets (bucket = floor(epoch/300s)) to distribute bursts.
  • Hot-spot strategies
    • Salted routing key: partition = hash(TenantId) % N (kept stable per tenant).
    • Staggered rollovers (indices) and batching writes with jitter.
    • Cap shard size; auto-split on p95 latency or shard size thresholds.

Seek-cursor example (tenant-safe)

-- WHERE tenant_id = :tid AND (created_at, record_id) < (:cursor_ts, :cursor_id)
SELECT *
FROM audit_records
WHERE tenant_id = :tid
  AND (created_at, record_id) < (:ts, :id)
ORDER BY created_at DESC, record_id DESC
LIMIT :page_size;

Re-indexing & migration (splits/merges)

  • Shadow build → cutover
    1. Stand up shadow index/table with new mapping (e.g., new analyzers/keys or per-tenant split).
    2. Dual-write new traffic to both old and shadow.
    3. Backfill historical data (bounded by tenant/time windows).
    4. Verify counts/hashes (Merkle/segment proofs) → flip read alias → retire old.
  • Tenant split
    • Issue new TenantIds; maintain mapping table {oldTid → [newTid…]}.
    • Re-key historical rows using mapping; keep idempotency keys namespaced by new tenant.
  • Tenant merge

  • Choose survivor TenantId; remap others to survivor; re-index with survivor as partition key.

  • Emit Tenant.Remapped events; keep read compatibility window via alias/views.
  • Residency moves
    • Replay to target DataSiloId; block cross-region reads during move unless break-glass.
    • Re-wrap DEKs under the new tenant KEK for the destination region.

Naming, retention, and evidence

  • Index/bucket naming
    • atp-audit-{tenantId}-{region}-{yyyyMM} (dedicated) or atp-audit-shared-{region}-{yyyyMM} (shared).
  • Per-tenant retention
    • Apply ILM/TTL by tenant tier/edition; legal holds suspend deletes at the prefix/index level.
  • Backups
    • Label artifacts with TenantId and chain epoch; restore into tenant-isolated sandboxes for verification.

Invariants

  • TenantId is part of every primary/partition key and predicate.
  • No cross-tenant compound indices or shared cache keys.
  • Re-index/cutovers are zero-downtime and provable (counts, hashes, sampled diffs).

Per-Tenant Cryptography

Key management

  • KEK per tenant (and per region/data silo). Each tenant gets its own Key Encryption Key stored in a managed KMS/HSM; KEKs are never shared across tenants or regions.
  • Rotation cadence. Time-based (e.g., 90/180 days, tier-dependent) plus on-demand rotation for incidents. Rotation is non-disruptive via key versioning.
  • Dual control. Destructive operations (purge, disable, export) require two approvers and a linked ticket; all KMS actions are audited.
  • Access policy. Only ATP encrypt/decrypt service principals within the same DataSiloId may use the tenant KEK; cross-region use is blocked by policy.
  • BYOK/CSEK (optional). Tenants may supply a KMS key reference; platform validates liveness and permissions at onboarding and continuously thereafter.

KEK metadata (example)

{
  "keyId": "kms://us/tenants/splootvets/kek/v7",
  "tenantId": "splootvets",
  "dataSilo": "us",
  "purpose": "encryption",
  "rotation": { "cadenceDays": 180, "nextRotationOn": "2026-04-01" },
  "createdBy": "atp-kms-operator",
  "tags": { "edition": "enterprise", "compliance": "hipaa,soc2" }
}

Envelope encryption

  • DEK per segment/chunk. Payloads are encrypted with short-lived Data Encryption Keys (e.g., AES-GCM 256).
  • Wrapping. DEK is wrapped by the tenant KEK (KMS WrapKey/UnwrapKey). The wrapped dek and KEK version are stored alongside the record/segment.
  • Versioning. Reads select the correct KEK version from metadata; rotations do not require re-encrypting historical data.
  • Crypto-shred. Disabling or destroying a KEK version renders data sealed with that version unreadable while leaving integrity evidence verifiable.

Record/segment envelope

{
  "tenantId": "splootvets",
  "segmentId": "seg-2025-10-28T07",
  "algo": "AES-256-GCM",
  "ciphertext": "…",
  "dekRef": {
    "kek": "kms://us/tenants/splootvets/kek",
    "kekVersion": 7,
    "wrappedDek": "base64url(…)"
  }
}

Integrity (hash chains & signatures)

  • Tenant-scoped chains. Each tenant maintains independent hash chains (or Merkle segments) over ordered records.
  • Signing. Segment roots are signed with a tenant integrity key (Ed25519/ECDSA) kept in KMS/HSM and separate from encryption KEKs.
  • Self-contained proofs. Verifiers need only the tenant’s manifests and public keys; no cross-tenant material is required.

Segment seal (concept)

{
  "tenantId": "splootvets",
  "segmentRoot": "sha256(…)",
  "prevRoot": "sha256(…)",
  "signedAt": "2025-10-28T07:05:00Z",
  "sigAlgo": "Ed25519",
  "signature": "base64url(…)",
  "keyId": "kms://us/tenants/splootvets/signing/v3"
}

Key tagging & deletion safeguards

  • Tags drive policy. tenantId, dataSilo, purpose (encrypt/sign), edition, legalHold inform automation and alerts.
  • Soft-delete windows. Keys enter a recoverable state (e.g., 7–30 days) before purge; purge requires dual-approval and no active legal holds.
  • Pre-flight checks. Before revocation/purge, the platform verifies export-before-delete toggles and evidence packs have been produced.

Residency & availability

  • Per-silo keys only. Keys never leave their DataSiloId. Multi-region tenants receive silo-local KEKs and signing keys.
  • Failure modes. If KMS is unavailable in a silo:
    • Write path: buffer to durable queue with exponential backoff; no plaintext writes.
    • Read path: fail closed with KeyUnavailable; emit GuardDecision and SRE alerts.

BYOK/CSEK specifics

  • Tenant responsibilities. Maintain uptime of external KMS, rotate keys, and provide continuity for incident response.
  • Health checks. ATP continuously validates Wrap/Unwrap/Sign on a canary object per tenant to detect drift or permission revokes.
  • Impact statement. If the customer revokes a KEK, historical data becomes unreadable; integrity proofs and manifests remain verifiable.

Encrypt/Decrypt (pseudocode)

// Encrypt
var dek = Crypto.GenerateDek();                               // AES-256 key
var ct  = Crypto.AesGcmEncrypt(dek, plaintext, aad: tenantId);
var wrappedDek = Kms.WrapKey(kek: tenant.KekRef, dek);

return new Envelope(ct, dekRef: (tenant.KekRef, tenant.KekVersion, wrappedDek));

// Decrypt
var dek = Kms.UnwrapKey(kek: env.DekRef.Kek, wrapped: env.DekRef.WrappedDek);
var pt  = Crypto.AesGcmDecrypt(dek, env.Ciphertext, aad: tenantId);

Invariants

  • Encryption, key usage, and integrity always occur within tenant scope and data silo.
  • Encryption and signing keys are logically separated; compromise of one does not imply compromise of the other.
  • All KMS operations are audited and linkable to requests via correlation ids and ComplianceEvents.

Policy-as-Code Enforcement

Tenant policy bundles

  • Scope & contents
    • Residency: allowed dataSilo/regions, failover posture.
    • Retention: per-stream TTL, grace windows, purge schedules, legal-hold overrides.
    • Redaction & masking: class→rule maps, role-based masking profiles, PII/PHI classifiers.
    • Export permissions: who/what may export, allowed fields/formats, purpose restrictions.
    • Edition/entitlements: feature gates and quotas tied to tenant edition/tier.
  • Form & delivery
    • Bundled as signed documents (YAML/JSON) with semantic version (tenant:<id>@<ver>).
    • Distributed via a policy registry; services fetch and cache with ETag per tenant.
    • Backed by Git/ADR lineage; changes require review and dual-approval for sensitive tenants.

Bundle manifest (example)

tenant: splootvets
version: 42
signingKey: kms://us/policy/signing/v2
residency:
  allowedSilos: [us]
  writeFailover: forbidden
retention:
  streams:
    audit.default: { ttlDays: 365, graceDays: 7 }
    audit.security: { ttlDays: 2555, graceDays: 14 } # ~7y
redaction:
  defaultByField:
    email: { kind: HASH, params: { algo: SHA256, saltRef: "tenant-salt" } }
    phone: { kind: MASK, params: { showLast: 4 } }
  maskingProfiles:
    Investigator: { email: MASK, phone: MASK }
    Reader:       { email: HASH, phone: HASH }
export:
  allowedActors: ["Export.Operator","Export.Admin"]
  formats: ["parquet","ndjson"]
  purposeRequired: true

Evaluation lifecycle

  • On write (ingest)
    • Classify and apply redaction before persist; attach policyVersion to each record/segment.
    • Enforce residency and reject writes to foreign silos unless break-glass.
    • Compute idempotency scope with {tenantId, policyVersion} to ensure consistent outcomes.
  • On read (query)
    • Inject mandatory tenant predicate; select masking profile based on role/scope/intent.
    • Deny if the requested projection violates export/masking constraints even for in-app reads.
  • On export
    • Evaluate purpose-limited view; include signed manifest with policyVersion, filters, and field-level transformations.
    • Verify legal holds and DSAR precedence before generating artifacts.

Determinism guarantees

  • Same inputs + same policyVersion ⇒ same decision/outcome.
  • Policy lookups are version-pinned per request; cache hits must match ETag or be re-fetched.

Drift detection & auditability

  • Signed versions. Every bundle is signed; services verify signature and issuer before use.
  • Cache discipline. Tenant policy cache uses ETag; if stale/unknown, degrade with basis:"Cached" tag and raise alert.
  • Change evidence. Policy updates emit Policy.Updated events with diff, approvers, and ADR link.
  • Runtime monitors. Canary evaluations compare current vs. next policy in shadow mode; discrepancies are reported before rollout.

Policy decision record (shape)

{
  "tenantId": "splootvets",
  "operation": "Read",
  "decision": "Allow",
  "policyVersion": "tenant:splootvets@42",
  "maskingProfile": "Investigator",
  "residency": "us",
  "basis": ["RLS.Tenant", "Mask.Investigator", "Retention.Active"],
  "correlationId": "6f5c…",
  "ts": "2025-10-28T07:40:12Z"
}

Example rules (OPA/Rego)

package atp.policy

default allow = false
default profile = "Reader"

allow {
  input.tenantId != ""
  input.residency in allowed_silos[input.tenantId]
  input.operation == "Write"
}

allow {
  input.operation == "Read"
  input.tenantId == input.context.tenantId
}

profile := "Investigator" {
  some r
  r := input.context.roles[_]
  r == "Audit.Investigator"
}

allowed_silos := { "splootvets": {"us"} }

Enforcement sketch (pseudocode)

var bundle = PolicyCache.Get(tenantId, etagHint: req.PolicyEtag);
VerifySignature(bundle);

var decision = PolicyEvaluator.Evaluate(bundle, req); // write/read/export
if (!decision.Allowed) return Problem(decision.ProblemCode, decision.Basis);

if (req.Operation == Write)
    req.Record = Redactor.Apply(bundle, req.Record);

if (req.Operation == Read)
    resp.Payload = Masker.Apply(decision.MaskingProfile, resp.Payload);

Annotate(resp, policyVersion: bundle.Version, basis: decision.Basis);

Invariants

  • Policies are tenant-scoped, signed, and versioned; services must pin evaluation to a specific version per request.
  • Write-time redaction ensures sensitive data is never stored in clear; read/export masking adjusts to role/intent without re-persisting.
  • Any fallback (stale cache) is explicitly tagged and alerted; all policy changes and decisions are auditable end-to-end.

SDK & Integration Contracts

Mandatory fields in AuditRecord

Goal: guarantee tenant context and replay-safe ingestion.

Shape (JSON, minimal)

{
  "tenantId": "splootvets",
  "idempotencyKey": "tid:splootvets|ulid:01J9ZC5K5Q2Z6S0Z2G1WZQW5Q4",
  "createdAt": "2025-10-28T07:55:12.345Z",
  "action": "Appointment.Booked",
  "resource": { "type": "Appointment", "id": "apt_123" },
  "actor": { "type": "User", "id": "00u1abc23" },
  "correlation": { "traceId": "3e1f…1716", "spanId": "7f6e…1908" }
}

Required fields

  • tenantId (string, opaque, must match header/claims)
  • idempotencyKey (string, globally unique within tenant; recommend ULID/KSUID with tenant prefix)
  • createdAt (RFC 3339/UTC, millisecond precision)
  • action (string, namespaced; see Unknown=0 note below)
  • resource.type (string) and resource.id (string)
  • actor.type (enum-ish string: User|Service|System|Device|External)
  • actor.id (string; user id, client id, or device id)
  • correlation.traceId (W3C trace id; SDK fills from traceparent)

Optional but encouraged

  • context (free-form map; never duplicate tenant identity here)
  • purpose (string; used by policy on export)
  • labels (map of low-cardinality tags; auto-prefixed by SDK if needed)
  • schemaVersion (string; pinned by SDK)

JSON Schema (excerpt)

{
  "$id": "https://connectsoft.dev/atp/schemas/audit-record.schema.json",
  "type": "object",
  "required": ["tenantId", "idempotencyKey", "createdAt", "action", "resource", "actor", "correlation"],
  "properties": {
    "tenantId": { "type": "string", "maxLength": 128 },
    "idempotencyKey": { "type": "string", "maxLength": 256 },
    "createdAt": { "type": "string", "format": "date-time" },
    "action": { "type": "string", "maxLength": 200 },
    "resource": {
      "type": "object",
      "required": ["type", "id"],
      "properties": { "type": { "type": "string" }, "id": { "type": "string" } }
    },
    "actor": {
      "type": "object",
      "required": ["type", "id"],
      "properties": { "type": { "type": "string" }, "id": { "type": "string" } }
    },
    "correlation": {
      "type": "object",
      "required": ["traceId"],
      "properties": { "traceId": { "type": "string" }, "spanId": { "type": "string" } }
    },
    "context": { "type": "object", "additionalProperties": true },
    "labels": { "type": "object", "additionalProperties": { "type": "string" } },
    "purpose": { "type": "string" },
    "schemaVersion": { "type": "string" }
  },
  "additionalProperties": false
}

Evolution (additive & backward compatible)

  • Add-only: new fields are optional; never repurpose or remove existing fields.
  • Enum pattern: when modeling enums client-side, include an Unknown = 0 member; treat unknown values as no-op for business logic but preserve on round-trip.
  • Meta bag: place experimental fields under context.* or labels.*; promote to first-class only after stabilization.
  • Version pinning: SDK sets schemaVersion and sends Accept-Schema: audit-record;v=1. Server negotiates up to supported version; returns 406 Not Acceptable if incompatible.
  • Deprecation: server returns Warnings: 299 atp "field XYZ will be removed in vN" with a 90-day minimum window; SDK logs and surfaces telemetry.

Client enforcement (SDK behaviors)

Request builders & middleware

  • Header injection: ensure X-Tenant-Id, traceparent, Idempotency-Key are present; prefer Authorization: Bearer <jwt>.
  • Parity check: if body has tenantId, it must equal the header/claim; otherwise quarantine (SDK can opt-in to hard reject).
  • Time normalization: convert timestamps to UTC RFC 3339 with millisecond precision.
  • Idempotency: Idempotency-Key = "tid:{tenantId}|{ULID}"; retries must reuse the same key; server returns the original result on duplicates.
  • Retry/backoff: on 429|503|599, apply exponential backoff with jitter, respect Retry-After, and cap at ≤ 30s per attempt; total budget ≤ 2 minutes.
  • Telemetry: attach tenantId, action, resource.type as OTEL attributes; propagate baggage.

Problem+JSON error taxonomy (client mapping)

type HTTP Meaning / Client action
https://atp.connectsoft.dev/errors/missing-tenant 400/401 Add/resolve tenantId (header/claim); do not retry
https://…/tenant-mismatch 202 Body/header mismatch → quarantined; escalate
https://…/residency-conflict 409 Wrong region/silo; route to correct endpoint
https://…/rate-limited 429 Retry with backoff; respect Retry-After
https://…/idempotency-conflict 409 Reused key with different payload; fix client bug
https://…/policy-stale 200 Served under cached policy; log + alert

SDKs surface these as typed exceptions (MissingTenantException, ResidencyConflictException, …) with the original Problem+JSON attached.


HTTP contract (ingest)

POST /v1/audit/records
Authorization: Bearer <jwt>
X-Tenant-Id: splootvets
Idempotency-Key: tid:splootvets|ulid:01J9ZC5K…
Content-Type: application/json
Accept-Schema: audit-record;v=1
traceparent: 00-3e1f…-7f6e…-01
  • 200 OK with { recordId, policyVersion, segmentId, integrity } on success.
  • 202 Accepted for quarantine with problem+json and evidenceRef.
  • 409 Conflict for idempotency mismatch (same key, different payload hash).

Response (success, excerpt)

{
  "recordId": "01J9ZC5K…",
  "tenantId": "splootvets",
  "policyVersion": "tenant:splootvets@42",
  "segmentId": "seg-2025-10-28T07",
  "integrity": { "root": "sha256:9f2…", "prev": "sha256:a1c…", "sig": "MEQCI…" }
}

C# SDK sketch

public sealed record AuditRecord(
    string TenantId,
    string IdempotencyKey,
    DateTimeOffset CreatedAt,
    string Action,
    ResourceRef Resource,
    ActorRef Actor,
    CorrelationRef Correlation,
    IDictionary<string, string>? Labels = null,
    IDictionary<string, object>? Context = null,
    string? Purpose = null,
    string? SchemaVersion = "1.0");

public static class AuditClientExtensions
{
    public static async Task<IngestResult> WriteAsync(this IAuditClient client, AuditRecord record, CancellationToken ct = default)
    {
        record = record with
        {
            CreatedAt = record.CreatedAt.ToUniversalTime(),
            IdempotencyKey = record.IdempotencyKey ?? IdempotencyKey.For(record.TenantId)
        };

        ClientGuards.ValidateTenantParity(record);              // header vs body
        using var act = Telemetry.StartActivity("audit.write", record);
        return await client.PostAsync("/v1/audit/records", record, headers =>
        {
            headers["X-Tenant-Id"] = record.TenantId;
            headers["Idempotency-Key"] = record.IdempotencyKey!;
        }, ct);
    }
}

TypeScript SDK sketch

type AuditRecord = {
  tenantId: string;
  idempotencyKey: string;
  createdAt: string; // RFC3339 UTC
  action: string;
  resource: { type: string; id: string };
  actor: { type: string; id: string };
  correlation: { traceId: string; spanId?: string };
  labels?: Record<string,string>;
  context?: Record<string,unknown>;
  purpose?: string;
  schemaVersion?: string;
};

export async function writeAudit(rec: AuditRecord, token: string) {
  const body = { ...rec, createdAt: new Date(rec.createdAt).toISOString() };
  const res = await fetch("/v1/audit/records", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${token}`,
      "X-Tenant-Id": body.tenantId,
      "Idempotency-Key": body.idempotencyKey,
      "Content-Type": "application/json"
    },
    body: JSON.stringify(body)
  });
  if (!res.ok && res.status !== 202) throw await mapProblem(res);
  return res.json();
}

Idempotency & payload hash

  • Server computes payloadHash = sha256(canonicalize(body minus idempotencyKey)).
  • If the same Idempotency-Key arrives with a different payloadHash, respond 409 idempotency-conflict with both hashes for debugging.
  • SDKs compute and log the same hash to correlate issues.

Telemetry & diagnostics

  • SDK emits counters: audit.write.attempts, audit.write.success, audit.write.retry, audit.write.quarantine.
  • Attributes: tenantId, action, resource.type, edition (if available).
  • Logs include correlation.traceId and Idempotency-Key for evidence linkage.

Invariants

  • Client must send tenantId and a stable idempotencyKey; server must inject tenant predicates and integrity linkages.
  • Schemas evolve additively; unknown fields are preserved and ignored by the server’s validator.
  • Errors are returned as Problem+JSON with stable type URIs; SDKs map them to typed exceptions and actionable guidance.

Query & Export Isolation

Authorized query filters

  • Mandatory predicates
    • Server injects tenantId = <token.tenantId> into every query; clients cannot override or widen it.
    • Additional ABAC predicates may be added from roles/scopes (e.g., maskingProfile, env in {"prod"}).
  • Field allow-list
    • Only specific fields are filterable/sortable to avoid side-channel leaks:
      • Filters: tenantId (server-set), createdAt, action, resource.type, resource.id, actor.type, actor.id, labels.*
      • Sorts: createdAt, recordId (tie-breaker)
    • No free-text filters on sensitive fields post-redaction; use dedicated projections.
  • Guarded query plan
    • Queries are analyzed for cross-tenant joins/UNIONs or missing tenant predicate and rejected.

SQL example (server-assembled)

SELECT cols
FROM audit_records
WHERE tenant_id = :ctx_tenant
  AND created_at BETWEEN :from AND :to
  AND (resource_type = COALESCE(:r_type, resource_type))
ORDER BY created_at DESC, record_id DESC
LIMIT :page_size;

OpenSearch/ES example (server-assembled)

{
  "index": "atp-audit-shared-us-*",
  "routing": "splootvets",
  "query": {
    "bool": {
      "filter": [
        { "term": { "tenantId": "splootvets" } },
        { "range": { "createdAt": { "gte": "2025-10-01T00:00:00Z", "lte": "2025-10-31T23:59:59Z" } } },
        { "term": { "resource.type": "Appointment" } }
      ]
    }
  },
  "sort": [{ "createdAt": "desc" }, { "recordId": "desc" }]
}

Tenant-safe pagination (seek cursors)

  • Seek/keyset only. No offset paging across tenants. Cursors encode a watermark of (tenantId, createdAt, recordId) and are bound to the caller’s tenantId.
  • Stable sort. Always sort by createdAt DESC, recordId DESC to guarantee deterministic continuation.
  • Opaque cursors. Server issues nextCursor = base64url(json) with:
{ "tenantId":"splootvets", "ts":"2025-10-28T08:00:01.234Z", "rid":"01J9ZC…", "sig":"MEQCI…" }

The sig prevents tampering; server validates cursor.tenantId == token.tenantId.

SQL keyset pattern

SELECT cols
FROM audit_records
WHERE tenant_id = :ctx_tenant
  AND (created_at, record_id) < (:cursor_ts, :cursor_id)
ORDER BY created_at DESC, record_id DESC
LIMIT :page_size;

OpenSearch search_after

{ "search_after": ["2025-10-28T08:00:01.234Z", "01J9ZC…"] }

Search/index access control

  • Routing & aliases
    • Shared indices use routing = tenantId and a mandatory { "term": {"tenantId": …} } filter.
    • Per-tenant aliases (or filtered aliases) may be used for high-assurance tenants.
  • Document storage
    • Persist tenantId in the primary key/routing; no cross-tenant compound indices.
  • Capabilities
    • Disallow listing raw indices to non-admin principals; expose only tenant-scoped search APIs.
  • Caching
    • Query result caches are keyed by tenantId + query hash; never shared across tenants.

Exports (packaging & evidence)

  • Per-tenant packaging

    • Object prefixes: tenants/{tenantId}/exports/{stream}/{yyyy}/{mm}/{dd}/…
    • Each export produces a signed manifest with parameters and proofs; artifacts are encrypted under the tenant KEK in the target dataSilo.
  • Manifest (example)

{
  "tenantId": "splootvets",
  "stream": "audit.default",
  "timeRange": { "from": "2025-10-01T00:00:00Z", "to": "2025-10-31T23:59:59Z" },
  "recordCount": 124_532,
  "chunks": [
    { "path": "part-0001.parquet", "sha256": "9f2…", "size": 134217728 },
    { "path": "part-0002.parquet", "sha256": "a1c…", "size": 128774231 }
  ],
  "policyVersion": "tenant:splootvets@42",
  "integrity": { "segmentRoots": ["sha256:…","sha256:…"], "prevRoot": "sha256:…", "sigKey": "kms://us/…/signing/v3", "signature": "MEQCI…" },
  "generatedAt": "2025-10-31T23:59:59Z",
  "tool": { "name": "atp-exporter", "version": "1.8.3" }
}
  • Verify-on-download

    • HEAD on each object returns ETag and x-atp-sha256. Client compares to manifest.
    • A verify endpoint can re-compute checksums and validate segment proofs and manifest signature.
  • Evidence bundle

    • Includes: manifest, signature, key ids, policy version, query filters, and trace linkage (correlationId).
    • Optional chain proof to anchor export to a tenant’s integrity segment.

Error taxonomy (Problem+JSON)

type HTTP Meaning / Action
…/missing-tenant-predicate 400 Client attempted query without tenant scope; fix call
…/cross-tenant-query 403 Detected join/union across tenants; disallowed
…/unsupported-filter 400 Field not in allow-list; remove/replace filter
…/invalid-cursor 400 Cursor tampered/expired/wrong tenant; restart query
…/residency-conflict 409 Export target region violates residency

Pseudocode (server guard)

var tid = ctx.TenantId ?? throw Problem(MissingTenant);
var plan = Analyze(query);

if (!plan.HasTenantPredicate || plan.ReferencesForeignTenant)
    throw Problem(CrossTenantForbidden);

var allowed = new[] {"createdAt","action","resource.type","resource.id","actor.id","labels.*"};
if (plan.Filters.Any(f => !allowed.Contains(f.Field))) 
    throw Problem(UnsupportedFilter);

var stableSort = new[] {("createdAt","desc"),("recordId","desc")};
query = query.WithMandatoryTenant(tid)
             .WithSort(stableSort)
             .WithKeysetCursor(cursor => cursor.BindToTenant(tid));
return Execute(query);

Invariants

  • All queries must include (or receive) tenantId = <ctx> and use seek pagination bound to that tenant.
  • Search/index access is route-scoped to tenant; listing or joining across tenants is forbidden to non-admin principals.
  • Every export is per-tenant, signed, and verifiable with an evidence bundle; foreign-region exports are blocked by residency policy.

Operational Runbooks (Tenant Lens)

Onboarding / Offboarding

Onboarding — checklist

  • Create tenant record in the Registry with tenantId, displayName, edition, dataSilo, tags, and contacts.
  • Provision KMS keys: per-tenant KEK (+ signing key) in the target dataSilo; record key ids on the tenant record.
  • Attach policy bundle: residency, retention, redaction, export permissions; verify signature and pin policyVersion.
  • Quotas & limits: set per-tenant throughput/storage/export-concurrency; initialize rate-limit buckets.
  • Warm caches: policy, registry, routing tables; verify ETag/version coherence.
  • Smoke tests (tenant-scoped):
    • Ingest a signed canary record → project → query with masking profile → export 1 small chunk.
    • Validate integrity chain advanced and segment seal present.
  • Dashboards: create per-tenant SLO panels (ingest p95, projector lag, DLQ depth, export lead time).
  • Emit Tenant.Created compliance event with the evidence bundle (policyVersion, key ids, dashboards URLs).

Offboarding — checklist

  • Suspend ingestion for the tenant (Write = Deny, Read = Allow by default) and broadcast a “sunset” window.
  • Export-before-delete: offer DSAR/retention-aware final export; record acceptance/rejection.
  • Check legal holds: block finalization while any hold is active.
  • Draining: allow projectors/ETL to catch up; DLQ must be 0 for this tenant.
  • PendingDeletion state with a timer (e.g., 30 days) and reminder notifications.
  • Finalize:
    • Revoke/rotate tenant KEK (crypto-shred where allowed).
    • Remove per-tenant indices/buckets or apply TTL/ILM for shared stores.
    • Tear down dashboards/alerts.
  • Emit Tenant.Deleted with references to the last integrity segment and key revocation ids.

State machine (simplified)

stateDiagram-v2
    [*] --> Provisioning
    Provisioning --> Active: smoke passed
    Active --> Suspended: operator action / billing
    Suspended --> Reinstate: operator action
    Suspended --> PendingDeletion: offboard approved
    PendingDeletion --> Deleted: window elapsed & no holds
Hold "Alt" / "Option" to enable pan & zoom

Incident Response: Suspected Cross-Tenant Access

Trigger conditions

  • Query plan without mandatory tenant predicate.
  • Cross-tenant join/union detected by planner.
  • Integrity chain discrepancy involving foreign tenant id.
  • Alert from policy engine: “foreign-region export attempted”.

Response SLOs

  • Acknowledge: ≤ 15 minutes.
  • Freeze path: ≤ 10 minutes from detection.
  • Evidence pack: first-cut ≤ 60 minutes.

Step-by-step

  1. Freeze & contain
    • Toggle Guard.KillSwitch(tenantId) for read/write paths suspected.
    • Switch projectors for the tenant to read-only.
    • Block exports for the tenant and for principals involved.
  2. Snapshot & preserve
    • Capture current policy bundle, routing tables, key versions.
    • Snapshot relevant indices/partitions with X-ATP-Freeze-Tag.
    • Pin trace sampling to 100% for the tenant.
  3. Enable break-glass reviewers
    • Issue time-bound, dual-approved ephemeral grants limited to evidence-only scopes.
    • Emit BreakGlass.Granted event.
  4. Differential queries
    • Run server-side diffs: tenantId == victim vs. any leakage candidates.
    • Validate integrity segments for the time window; recompute roots.
  5. Evidence pack
    • Export: offending requests (redacted), query plans, policy versions & signatures, integrity proofs, KMS key ids (no secrets), trace bundle.
    • Store under tenants/<id>/incidents/<ticket>/evidence/… with signed manifest.
  6. Remediate
    • Patch guard rules (e.g., stricter allow-list, planner checks).
    • Add regression tests; enable shadow policy to validate.
  7. Unfreeze & monitor
    • Gradually reopen read → write → export with elevated logging.
    • Configure short-term heightened alerts.
  8. Post-mortem & governance
    • File ADR with root cause, corrective actions, test references.
    • Emit Incident.Closed and attach the final evidence manifest.

Operator toggles (pseudocode)

await GuardApi.SetAsync(new GuardToggle {
  TenantId = "splootvets",
  DisableWrites = true,
  DisableReads = false,
  DisableExports = true,
  Reason = "CrossTenantSuspect#INC-2025-1031",
  Ttl = TimeSpan.FromHours(2)
});

Backup / Restore / eDiscovery (Tenant-Scoped)

Backups

  • Labeling: every snapshot labeled with tenantId, dataSilo, segmentEpoch, policyVersion.
  • Cadence: stream-tier dependent (e.g., audit.security daily + weekly full); verify checksums and segment roots post-backup.
  • Encryption: backups encrypted under tenant KEK (or dedicated backup KEK chained to tenant).

Restore (sandbox-first)

  • Request: operator specifies tenantId, time range, and target sandbox environment.
  • Isolation: restore into sandbox/<tenantId>/<timestamp> with read-only flags and private alias.
  • Verification:
    • Recompute integrity roots; compare with manifest.
    • Run sample queries; validate policy enforcement and masking profiles.
  • Promote (if needed): use controlled cutover (alias swap) with rollback point; record Restore.Performed event.

eDiscovery / DSAR

  • Workflow:
    1. Create case with subject filters (ids/emails/time range).
    2. Apply masking and purpose-limited profile.
    3. Generate per-tenant export (Parquet/NDJSON) with signed manifest and provenance.
    4. Route to review lane; require approval before release.
  • Legal hold:
    • Immutable, scoped at tenant/stream/predicate.
    • legal hold > dsar delete > retention precedence.
    • Holds pause TTL/ILM at the index/prefix level.

Restore command (conceptual)

atpctl restore start \
  --tenant splootvets \
  --from 2025-10-01T00:00:00Z --to 2025-10-31T23:59:59Z \
  --target-sandbox us-sbx-01 \
  --verify-integrity --read-only

Runbook Acceptance Criteria

  • A new tenant can be onboarded end-to-end (including smoke, dashboards, and evidence) in ≤ 30 minutes.
  • Offboarding completes with export-before-delete honored, and no residual cross-tenant references.
  • Incident response can freeze a tenant’s risky paths within 10 minutes, and produce a first evidence pack within 60 minutes.
  • Backup → sandbox restore → integrity verification is repeatable and yields a signed verification report.
  • All operational actions emit ComplianceEvents linked by correlationId and are tenant-scoped.

Invariants

  • All operational procedures act within the tenant boundary and respect residency tags.
  • Evidence (manifests, integrity roots, policy versions, key ids) is signed and stored under the tenant’s prefix.
  • Break-glass access is time-bound, dual-controlled, and mirrored to a separate evidence stream.

Limits, Quotas, and Fairness

Per-tenant quotas

Goal: guarantee predictable performance and cost control by enforcing per-tenant limits at ingest, query, and export.

  • Throughput
    • Ingest QPS (requests/s) per tenant and per route (e.g., /v1/audit/records, /v1/exports).
    • Burst: token bucket with burst = multiplier × steady_rate (tier-dependent).
  • Payload size
    • Max payload bytes per record (hard cap) and rolling bytes/min per tenant (soft cap → throttle).
  • Storage
    • Bytes/day growth quota per tenant/tier; auto-alerts when 80/90/100% thresholds are crossed.
  • Export concurrency
    • Concurrent jobs per tenant and per region; distinct pools for ad-hoc vs DSAR/legal-hold.
  • Indexing/refresh
    • Max index refresh rate per tenant in shared indices; high-assurance tenants on dedicated indices bypass shared caps.

Tier blueprint (example)

Tier Ingest QPS (steady/burst) Bytes/min Storage/day Exports (concurrent) Notes
Bronze 50 / 100 50 MB 10 GB 1 Shared index only
Silver 150 / 300 150 MB 50 GB 2 Priority 2
Gold 500 / 1000 500 MB 250 GB 4 Priority 1
Enterprise Contractual Contract Contract Contract Dedicated index optional

Quotas are region-scoped (per dataSilo) and enforced hierarchically: global → region → shard → tenant.


Protection against noisy neighbors

  • Backpressure
    • HTTP 429 with Retry-After on live endpoints; SDKs apply exponential backoff with jitter.
    • Queue deferral for async paths (ingest buffers, projectors) using per-tenant partitions; overflow → tenant DLQ.
  • Prioritization
    • Weighted Fair Queuing (WFQ) per route with weights derived from tier and credit balance.
    • Separate pools: realtime (higher weight) vs backfill (lower weight, preemption allowed).
  • Cost-aware throttling
    • Each tenant has a credit budget (cost units/min). Dispatchers reduce credits proportional to payload size and CPU/IO estimate.
    • When credits deplete, reduce effective QPS to the floor rate for the tier until the next refill tick.
  • Shard rebalancing
    • Hot tenants trigger auto-split (time-bucketing or routing salt) and HPA/KEDA scale-outs with per-tenant backpressure maintained during warmup.
  • Circuit breakers (per tenant)
    • Trip on sustained p95 > SLO or DLQ growth; escalate to operator and throttle exports first to preserve ingest SLAs.
flowchart LR
Req[Incoming request] --> TB[Token Bucket (tenant/tier)]
TB -->|tokens ok| WFQ[WFQ Scheduler]
TB -->|depleted| 429[HTTP 429 + Retry-After]
WFQ --> Exec[Route Handler]
Exec --> BP{Shard busy?}
BP -->|yes| Def[Queue Defer (tenant partition)]
BP -->|no| Ok[Process]
Hold "Alt" / "Option" to enable pan & zoom

Alerting & SLOs (per tier)

  • Golden signals (per tenant)
    • ingest_p95_latency (ms)
    • projector_lag_seconds
    • tenant_dlq_depth
    • export_lead_time_seconds (request→first byte)
    • quota_utilization (throughput, bytes/min, storage/day)

SLO targets (illustrative)

Metric Bronze Silver Gold Enterprise
Ingest p95 latency ≤ 350 ≤ 250 ≤ 150 Contract
Projector lag (steady) ≤ 120s ≤ 60s ≤ 20s Contract
Export lead time (P50, 1GB) ≤ 900s ≤ 600s ≤ 300s Contract
DLQ depth (sustained) 0 0 0 0

Alert thresholds (PromQL-style examples)

# p95 ingest latency breach for 5m
histogram_quantile(0.95, sum(rate(atp_ingest_latency_ms_bucket{tenantId="$TID"}[5m])) by (le))
  > tier_slo_ms{tenantId="$TID"}

# DLQ growing for 10m
increase(atp_tenant_dlq_depth{tenantId="$TID"}[10m]) > 0

# Quota nearing exhaustion (bytes/min > 90% for 3m)
rate(atp_bytes_ingested_total{tenantId="$TID"}[3m]) > 0.9 * tier_bytes_per_min{tenantId="$TID"}

Operator actions on alerts

  • Latency breach → enable throttling of backfill pool; increase projector replicas for tenant shard; evaluate index rollover.
  • DLQ growth → pause non-critical exports; enable quarantine lane for violating producers; notify integrator contacts.
  • Quota exhaustion → return 429 RateLimit with RFC 9331 headers (RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset) and raise proactive ticket.

Enforcement sketch (pseudocode)

var q = Quotas.For(tid, tier);
var tb = TokenBucket.For(tid, route: "ingest");
if (!tb.TryConsume(costUnits: Cost.Estimate(req))) 
    return RateLimitProblem(retryAfter: tb.RetryAfter);

var weight = Priority.For(tier, pool: req.Intent == "backfill" ? "backfill" : "realtime");
await WfqScheduler.EnqueueAsync(tid, weight, req);

using var _ = Metrics.Scope(tid).TrackLatency("ingest");
var res = await Next(req);

Credits.Debit(tid, Cost.Actual(res));
if (Credits.Balance(tid) < 0) tb.SetFloorRate(q.FloorQps);

return res;

Export scheduler (fair-share)

  • Global pool split into N slots per region; each tenant receives min(1, floor(weight / totalWeight * N)) with work-stealing when idle.
  • Jobs carry size estimates; large jobs may be chunked and interleaved to avoid starvation.

Job manifest (excerpt)

{
  "tenantId":"splootvets",
  "intent":"dsar",
  "estimatedBytes": 1073741824,
  "priority": 90, 
  "chunks": 8
}

Error taxonomy (Problem+JSON)

type HTTP Meaning / Client Action
…/rate-limited 429 Honor Retry-After / RateLimit-* headers
…/quota-storage-exceeded 409 Reduce retention or request quota increase
…/export-concurrency-exceeded 409 Wait for slots or lower concurrency
…/backfill-throttled 429 Reschedule with backoff; split into smaller jobs

Invariants

  • Quotas are per tenant, per region, and enforced hierarchically without violating tenant isolation.
  • Backpressure never writes plaintext or bypasses integrity/tenancy guards; it defers or rejects safely.
  • Fairness honors tier weights while preventing starvation via floor rates and work-stealing.
  • All throttling and quota decisions are audited with {tenantId, route, decision, basis, correlationId}.

Testing & Verification

Goals

  • Prove that tenant context is mandatory and enforced end-to-end.
  • Detect regressions via contract tests, chaos experiments, and continuous policy verification before they hit production.
  • Produce auditable evidence (logs, traces, manifests) for every guard decision.

Contract & conformance tests

What we verify

  • SDK → Gateway propagation: headers/claims/baggage set, normalized, and equal to body tenantId.
  • Idempotency scope: {tenantId, idempotencyKey} uniqueness; duplicate with equal payload returns same result; mismatch ⇒ 409.
  • Mandatory predicates: server injects tenantId = :ctx into queries; cross-tenant joins rejected.
  • Masking & policy pinning: policyVersion stamped on writes; reads apply correct masking profile.

MSTest (C#) — example

[TestMethod]
public async Task Write_MissingTenantHeader_ShouldReturnMissingTenantProblem()
{
    var rec = Fixtures.ValidRecord with { tenantId = null! };
    var (status, problem) = await Client.PostAsync("/v1/audit/records", rec, headers: h => h.Remove("X-Tenant-Id"));
    Assert.AreEqual(HttpStatusCode.BadRequest, status);
    Assert.AreEqual("https://atp.connectsoft.dev/errors/missing-tenant", problem.Type);
}

[TestMethod]
public async Task Query_ShouldInjectTenantPredicate_AndDisallowCrossTenant()
{
    var me = TestContext.Get("tenantId");
    var q = new { filters = new { resource = new { type = "Appointment" } } };
    var plan = await Client.PostExplainAsync("/v1/query/plan", q, headers: h => h["X-Tenant-Id"] = me);
    StringAssert.Contains(plan.Sql, "WHERE tenant_id = @ctx_tenant");
    Assert.IsFalse(plan.DetectedTenants.Any(t => t != me));
}

SpecFlow (Gherkin) — parity & idempotency

Feature: Tenancy parity & idempotency
  Scenario: Body/header tenant mismatch is quarantined
    Given a valid JWT with tenant "splootvets"
    And an AuditRecord with tenantId "other-tenant"
    When I POST to /v1/audit/records
    Then the response status should be 202
    And the problem type should be "…/tenant-mismatch"
    And an evidenceRef should be returned

  Scenario: Idempotent retry returns the same result
    Given a valid AuditRecord with idempotencyKey "tid:splootvets|ulid:01ABC…"
    When I POST it twice
    Then both responses have the same recordId and segmentId

Static/lint checks

  • Lint server PRs for unsafe queries lacking tenantId predicate.
  • Validate Problem+JSON types remain stable (contract tests).

Chaos & fault injection

Scenarios

  • Claim tampering: JWT with forged tenant/tid vs header X-Tenant-Id; expect reject/quarantine and GuardDecision.
  • Missing headers: no X-Tenant-Id/no traceparent; expect 400/401 and evidence log.
  • Stale policy cache: freeze policy registry; service must tag responses with basis:"Cached" and raise alert.
  • KMS rotation lag/unavailable: simulate Wrap/Unwrap failures → buffer writes (no plaintext), fail closed on reads with KeyUnavailable.
  • Residency violation: export to foreign region → 409 ResidencyConflict.
  • Noisy neighbor: spike one tenant to trip token bucket; assert 429 for that tenant only; others unaffected.

Fault toggles (pseudocode)

await Chaos.EnableAsync(new ChaosSpec {
  TenantId = "splootvets",
  Faults = new[] {
    Fault.JwtClaimOverride("tenant", "evil-corp"),
    Fault.PolicyRegistry.StaleFor(TimeSpan.FromMinutes(10)),
    Fault.KmsUnavailable(region: "us", probability: 0.3),
    Fault.RateLimiter.ConsumeAllTokens(route: "ingest")
  }
});

Success criteria

  • Guard outcomes match the decision table; no cross-tenant leakage under faults.
  • All faults generate structured evidence (GuardDecision, ComplianceEvent, enriched traces).

Continuous verification

OPA/Rego unit suites

  • Compile policy bundles; run unit tests per tenant offline in CI.
  • Validate allow/deny/masking decisions under known fixtures.
package test.atp.tenancy

import data.atp.tenancy as p

test_allow_write_when_tenant_present {
  input := {"tenantId":"splootvets","operation":"Write","context":{"roles":["Audit.Writer"]}}
  p.allow with input as input
}

Canary policies

  • Deploy next policy version in shadow mode:
    • Evaluate both {current, next} against live traffic.
    • Log diffs when decisions diverge; block rollout if divergence > threshold.

Shadow reads across projections

  • Periodically re-read a sample of records via:
    1. canonical store (segment replay) and 2) projected indices.
  • Compare field sets, masking, counts, and integrity; drift ⇒ alert.
flowchart LR
Sampler-->ReadRaw["Read from segments"]
Sampler-->ReadProj["Read from index"]
ReadRaw-->Diff
ReadProj-->Diff
Diff-->Alert{Mismatch?}
Alert-->|yes| CreateIncident
Hold "Alt" / "Option" to enable pan & zoom

Synthetic tenants & canaries

  • Maintain at least one synthetic tenant per region/tier:
    • Hourly ingest→project→query→export end-to-end with known fixtures.
    • Verify SLOs and integrity roots; publish a “green badge” to ops dashboard.

Nightly evidence pack

  • For each active tenant, produce a small evidence bundle (manifests, chain roots, policy versions) and validate signatures.

CI/CD gates

  • Block deploy if:
    • Tenancy contract tests fail,
    • Policy signature check fails,
    • Shadow-policy divergence > X%,
    • Synthetic flow SLOs breached in the last 30m.

Telemetry checks (PromQL-like)

# Guard decisions must exist for all rejects/quarantines
sum(rate(atp_guard_decisions_total{tenantId!=""}[5m])) by (decision)

# No cross-tenant joins
sum(rate(atp_query_cross_tenant_detected_total[5m])) by (tenantId) == 0

# Shadow diff rate below threshold
rate(atp_shadow_diff_total[15m]) < 0.01

Artifacts & evidence

  • Test fixtures: canonical AuditRecord sets per stream (PII/PHI variants).
  • Golden manifests: signed reference manifests for export verification.
  • Trace bundles: sampled traces with tenantId, policyVersion, GuardDecision.

Invariants

  • Every test and synthetic flow is tenant-scoped and respects residency.
  • Faults never bypass encryption or tenancy guards; they defer or reject safely.
  • Verification results are auditable and link to CI runs, policy versions, and integrity roots.

Compliance Mapping

GDPR

  • Data minimization & purpose limitation
    • Write-time redaction applies policy-defined rules before persistence, ensuring only minimized data is stored.
    • Purpose-bound exports: export profiles require an explicit purpose; manifests record it for later review.
  • Data subject rights (DSAR)
    • Tenant-scoped discovery filters (subject identifiers, time windows).
    • Standardized per-tenant export bundles (Parquet/NDJSON) with signed manifest and integrity proofs.
    • Reviewer approval lane and evidence pack (who requested, purpose, policy version).
  • Residency & transfers
    • Residency-aware routing honors dataSilo at ingest/query/export; cross-region access is blocked unless break-glass is granted and logged.
    • Per-tenant KEKs are region-local; no key material leaves the silo.
  • Erasure & retention
    • Retention policies per stream; legal hold supersedes deletion.
    • Cryptographic shred via KEK revocation where legally permissible and technically safe.
  • Records of processing
    • Policy bundles are signed & versioned; changes emit Policy.Updated with diff, approvers, and ADR link.
    • Each guard decision produces a machine-readable record (GuardDecision, ComplianceEvent) for audit trails.

Evidence produced

  • Signed policy bundle (tenantId, version, residency/retention/redaction rules).
  • DSAR export manifest + integrity proofs + approval trail.
  • Residency routing logs (region, alias/index used) and key IDs proving per-silo encryption.

HIPAA

  • Access control (ABAC/RBAC)
    • JWT-derived roles/scopes and attributes (e.g., org unit, edition) govern read masks and export permissions.
    • Break-glass access: dual approval, short TTL, least privilege, and explicit allowedTenants; all mirrored in a separate evidence stream.
  • Audit controls
    • Immutable audit records with tenant-scoped integrity segments (hash chains/Merkle roots) and signed segment seals.
    • GuardDecision logs for every allow/reject/quarantine, linked to correlation IDs and policy versions.
  • Integrity & transmission security
    • Envelope encryption: DEK per segment, wrapped by tenant KEK; segment root signed with a separate integrity key.
    • All service-to-service calls use mTLS; traceparent/baggage carry tenantId in-band for verifiability (not for auth).
  • Minimum necessary & masking
    • Read-time masking profiles per role/intent; investigators vs reader views enforced by policy.
    • Export profiles remove or pseudonymize PHI fields as configured.

Evidence produced

  • Sampling of segment seals and verification report (recomputed roots).
  • Access review: who saw what, under which role/mask, with correlation to ticket/purpose.
  • KEK configuration (IDs, rotation cadence, dual-control proof) and mTLS posture.

SOC 2 / ISO 27001

  • Logical separation
    • Tenant isolation at data plane (partitioning, RLS), control plane (namespaced policies, quotas), and observability (tenant-labeled metrics/logs/traces; cross-tenant views are admin-only).
  • Key management
    • Per-tenant KEKs with rotation, dual control, soft-delete windows, and continuous health checks (wrap/unwrap canaries).
    • Optional BYOK/CSEK with documented responsibilities and availability monitoring.
  • Change management & SDLC
    • Policy-as-code in VCS with reviews, ADRs, and signed releases.
    • CI/CD gates: tenancy contract tests, chaos checks (KMS/policy staleness), shadow-policy divergence thresholds.
  • Monitoring & alerting
    • Golden signals per tenant/tier (ingest p95, projector lag, DLQ depth, export lead time).
    • Rate limit & quota telemetry with RFC 9331 headers; alerts on threshold breaches.
  • Incident response & evidence
    • Freeze → snapshot → evidence pack → remediate → unfreeze runbook with timeline and decision log.
    • All operational toggles emit ComplianceEvent with who/why/when.

Evidence produced

  • Change log with linked PRs/ADRs, policy diffs, and deployment hashes.
  • Rate-limit/quotas configuration and alert history per tenant.
  • Incident evidence bundles (query plans, manifests, chain proofs, approvals).

Traceability matrix (excerpt)

Requirement Mechanism/Control Evidence Artifacts
Logical tenant separation RLS predicates; per-tenant partitioning & routing Query plans with injected tenantId; index alias configs
Data minimization (GDPR) Write-time redaction; masking profiles Policy bundle (redaction); sample records before/after; redaction logs
Residency enforcement dataSilo routing; per-silo KEKs Routing logs; KEK IDs per region; export manifest region fields
Access control & least privilege (HIPAA/SOC2) RBAC/ABAC; break-glass with dual approval JWT claim mapping; grant logs; BreakGlass.Granted/Revoked events
Integrity (HIPAA, SOC2) Tenant hash chains; signed segment roots Verification report; seal signatures & key IDs
Key management (SOC2/ISO) Per-tenant KEKs; rotation; dual control KMS audit logs; key metadata (tags, cadence, last-rotated)
Change control (SOC2/ISO) Policy-as-code; signed versions; CI/CD gates ADR links; policy signatures; CI runs; shadow-policy diff reports
DSAR & eDiscovery (GDPR) Tenant-scoped exports; reviewer lane DSAR request ticket; signed export manifest; approval trail
Incident handling (all) Freeze/snapshot/evidence runbook Incident timeline; GuardDecision series; evidence bundle manifests

Auditor playbook (how to demonstrate quickly)

  1. Show isolation: run a query as tenant A and prove the server-injected predicate and absence of cross-tenant results (plan + result).
  2. Verify integrity: pick a segment, recompute root, verify signature and key ID; compare with stored manifest.
  3. Prove residency: execute an export; show object prefix in the tenant’s region + per-silo KEK in the manifest.
  4. Access controls: present a read under Audit.Reader vs Audit.Investigator and show differing masking outcomes.
  5. Change control: open the policy PR with signatures, ADR link, CI checks, and shadow-policy diff report.
  6. DSAR demo: run a subject export end-to-end, show approval record, and validate integrity of the bundle.

Invariants

  • All compliance evidence is tenant-scoped, signed, and replayable (manifests, proofs, policy versions).
  • Residency, access control, and masking are policy-driven and deterministic per policyVersion.
  • No cross-tenant data or keys are required to verify a tenant’s integrity or exports.

Migration & Evolution

Isolation tightening (shared → dedicated) with zero downtime

When to move: hot tenants, regulatory posture change, or noisy-neighbor mitigation.

Pattern: shadow topology → dual-write → backfill → validate → alias cutover → retire.

  1. Plan & prepare
    • Open ADR with scope, target indices/buckets/tables, residency constraints, and rollback plan.
    • Add feature flags:
      • iso.shadow.enabled (enable dual-write),
      • iso.read.preferShadow,
      • iso.cutover.commit.
  2. Shadow build
    • Create per-tenant destination (e.g., atp-audit-{tenant}-{region}-{yyyyMM}) with desired mappings/ILM.
    • Warm policy cache and KMS keys for the destination.
  3. Dual-write
    • Ingestion writes to both shared and dedicated stores.
    • Stamp records with identical recordId, segmentId, policyVersion, and per-store integrity.
  4. Historical backfill
    • Reproject historical segments by tenant/time window into the destination.
    • Throttle by tenant credits; use seek cursors to avoid offset drift.
  5. Validation
    • Compare counts, min/max timestamps, sampled payload hashes, and segment root parity.
    • Run shadow reads (serve from destination but verify against source).
  6. Cutover
    • Flip read alias to destination; keep source as hot-standby for N hours.
    • Enable iso.cutover.commit to stop reading from shared; keep dual-write until stand-down.
  7. Retire & clean up
    • Disable dual-write; seal final segment in shared; freeze for retention window or ILM purge.
    • Update runbooks/dashboards to target destination.
flowchart LR
Ingest-->Shared
Ingest-->Dedicated["Dedicated (shadow)"]
Backfill["Backfill (historical)"]-->Dedicated
Validate["Counts + roots parity"]-->Cutover["Alias cutover"]
Cutover-->Retire["Retire shared (freeze/ILM)"]
Hold "Alt" / "Option" to enable pan & zoom

Rollback: If validation fails or SLOs regress, flip alias back to shared; keep dual-write active until resolved.


Tenant split / merge

Split (one → many)

  • Issue new ids: TenantId' ∈ {T1', T2', …}; keep mapping table with effective timestamp.
  • Re-key rules
    • tenantId := map(oldTid) by predicate (e.g., workspace/project, resource.id prefix).
    • Idempotency scope changes to {newTenantId, key}; maintain a translation registry for dedup during transition.
  • Process
    1. Create mapping {oldTid → [newTid…]} with immutable version.
    2. Dual-write: new events routed to newTid based on routing rule; legacy still to oldTid until producers updated.
    3. Backfill historical data by applying mapping predicates; seal new segments per newTid.
    4. Publish Tenant.Remapped events; update SDK configs.
    5. Decommission oldTid after grace window; keep read-only alias for audit.

Merge (many → one)

  • Survivor id: choose TenantId*; others map {oldTid → TenantId*}.
  • Re-index with survivor as partition key; maintain idempotency redirect for old keys until TTL expires.
  • Emit Tenant.Merged with evidence of counts and segment continuity.

Mapping table (schema)

CREATE TABLE tenant_id_map (
  old_tenant_id   text primary key,
  new_tenant_ids  jsonb not null,      -- ["tA","tB"] for split, or ["t*"] for merge
  effective_from  timestamptz not null,
  reason          text,
  adr_link        text,
  version         int not null,
  etag            text not null
);

Read-path compatibility (pseudocode)

var tid = ctx.TenantId;
var map = TenantMap.Resolve(tid, at: now);
var targetTids = map?.NewTenantIds ?? new[] { tid };

query = query.WithMandatoryTenants(targetTids); // IN (...)
results = MergeAndDeDup(results, key: r => (r.RecordId, r.CreatedAt)); // stable merge

Export & evidence

  • Exports during the window include mapping manifest to explain lineage.
  • Chain proofs reference pre/post segment roots; provide a bridge proof for auditors.

Residency moves (region/silo migration)

  • Constraints: no cross-region reads unless break-glass; data must be re-encrypted under destination KEK.
  • Process: shadow region → dual-write in dest → backfill by time window → verify → cutover DNS/alias → retire source.
  • Key handling: wrap DEKs under destination KEK; keep source KEK until retirement; record key lineage in evidence pack.

Header/claim deprecations

Goal: evolve identity mapping additively without breaking integrators.

  • Accept old+new claim/header forms during a grace window:
    • Old: tid, tenant; New: custom:tenantId, x-tenant-id.
  • Gateway mapping profile
    • Priority list: custom:tenantIdtenant_idtenanttid.
    • Emit Deprecation warning (Warnings: 299 atp "claim 'tid' deprecated; use 'custom:tenantId' by 2026-06-30").
  • SDK linting
    • Build-time analyzers flag usage of deprecated fields.
    • CI fails if new code introduces old keys after cutoff date.
  • Events & docs
    • Publish Policy.DeprecationAnnounced and Policy.DeprecationEnforced with dates; link ADR and migration guide.

Gateway normalization (sketch)

string ResolveTenantId(ClaimsPrincipal p, IHeaderDictionary h) =>
    p.Get("custom:tenantId")
 ?? p.Get("tenant_id")
 ?? p.Get("tenant")
 ?? p.Get("tid")
 ?? h["X-Tenant-Id"]
 ?? throw MissingTenantException();

Compatibility windows & flags

  • Windows: typical 90–180 days; longer for critical integrators.
  • Flags: producers can opt-in early (useNewTenantId=true) and receive stricter guardrails.
  • Shadow mode: run both mappings and compare guard decisions; alert on divergence > threshold.

Validation & evidence

  • Parity reports: counts and integrity roots for source vs destination per time bucket.
  • Divergence dashboards: show % differences, top actions/resources affected.
  • Signed migration manifest: tenant ids, time ranges, key lineage, policy versions, and ADR hash.

Invariants

  • Migrations are tenant-scoped, reproducible, and reversible until final commit.
  • Dual-write/backfill never bypass tenancy guards, encryption, or integrity stamping.
  • Header/claim changes are additive first; removals only after a documented grace window with telemetry proving safe adoption.

Tenant Metadata Model

Identifiers

  • TenantId — opaque, stable, URL-safe; max 128 chars; no embedded business meaning.
  • DataSiloId (optional) — residency/sovereignty placement (e.g., us, eu-we).
  • ExternalRef (optional) — upstream system handle(s) (CRM/billing/IdP), stored as a typed list (e.g., {system:"okta", id:"00o1…"}).
  • Slug (optional, display/UX) — human-friendly alias; never used for authorization.

Rules

  • TenantId is immutable after Active; splits/merges use a mapping table (see Migration & Evolution).
  • Compare identifiers case-insensitively; preserve original casing for display.

Attributes

  • Names: legalName, displayName
  • Contacts: ownerEmail, securityEmail, billingEmail
  • Residency: dataSilo (single primary); optional allowedSilos[] for read replicas
  • Edition/Entitlements: edition (e.g., gold, enterprise), optional features[]
  • Tags: low-cardinality labels (tier, vertical, region, costCenter)
  • Lifecycle: Provisioning | Active | Suspended | PendingDeletion | Deleted
  • Policy Pointers: current policyVersion, retentionProfile, maskingProfile
  • Keys: references to KEK/signing key ids (no secrets stored here)
  • Quotas: defaults per edition, overridable per tenant (ingest QPS, bytes/day, export concurrency)

Registry responsibilities

  • System of Record for tenant metadata used by ATP routers/guards.
  • Event source: emits Tenant.Created|Updated|Suspended|Reinstated|PendingDeletion|Deleted|Remapped|Merged.
  • Cache discipline: strong ETag/version semantics; consumers must use If-None-Match and handle 304.
  • Warmup webhooks: notify ATP services & SDK config endpoints on change for cache refresh.
  • Validation & governance: enforce identifier rules, residency compatibility, edition/feature matrix, and contact requirements.

Shape (JSON preview)

{
  "tenantId": "splootvets",
  "displayName": "Sploot Veterinary Care",
  "legalName": "Sploot Veterinary Care, Inc.",
  "dataSilo": "us",
  "allowedSilos": ["us"],
  "edition": "enterprise",
  "features": ["byok", "advanced-exports"],
  "tags": { "tier": "gold", "vertical": "healthcare" },
  "contacts": {
    "ownerEmail": "owner@splootvets.com",
    "securityEmail": "secops@splootvets.com",
    "billingEmail": "ap@splootvets.com"
  },
  "policyVersion": "tenant:splootvets@42",
  "keyRefs": {
    "kek": "kms://us/tenants/splootvets/kek/v7",
    "signing": "kms://us/tenants/splootvets/signing/v3"
  },
  "quotas": { "ingestQps": 500, "bytesPerDay": 268435456000, "exportConcurrency": 4 },
  "lifecycle": "Active",
  "externalRefs": [{ "system": "okta", "id": "00o1abc23" }],
  "version": 7,
  "updatedAt": "2025-10-28T08:30:00Z",
  "etag": "\"W/\\\"ten-splootvets-v7\\\"\""
}

API & caching (registry)

  • GET /tenants/{tenantId}
    • Returns JSON + ETag. Clients should call with If-None-Match and accept 304 Not Modified.
  • POST /tenants (provision) & PATCH /tenants/{id} (partial update)
    • Require preconditions via If-Match: <ETag> to guard against lost updates.
  • WATCH /tenants/stream
    • Server-sent events (SSE) or webhook subscription; payload includes {tenantId, version, etag, changeSet}.

HTTP example

GET /tenants/splootvets
If-None-Match: "W/\"ten-splootvets-v6\""
→ 200 OK (ETag: "W/\"ten-splootvets-v7\"") or 304 Not Modified

Data model & constraints (relational sketch)

CREATE TABLE tenants (
  tenant_id     text PRIMARY KEY,
  display_name  text NOT NULL,
  legal_name    text,
  data_silo     text NOT NULL,
  edition       text NOT NULL,
  tags          jsonb NOT NULL DEFAULT '{}',
  contacts      jsonb NOT NULL,
  policy_version text NOT NULL,
  key_refs      jsonb NOT NULL,
  quotas        jsonb NOT NULL,
  lifecycle     text NOT NULL CHECK (lifecycle IN ('Provisioning','Active','Suspended','PendingDeletion','Deleted')),
  external_refs jsonb NOT NULL DEFAULT '[]',
  version       int  NOT NULL DEFAULT 1,
  updated_at    timestamptz NOT NULL DEFAULT now(),
  etag          text NOT NULL
);

CREATE UNIQUE INDEX ux_tenants_slug ON tenants ((lower((tags->>'slug')))) WHERE (tags ? 'slug');

Validation rules (pseudocode)

Ensure(IsOpaqueId(tenantId) && tenantId.Length <= 128);
Require(contacts.ownerEmail && contacts.securityEmail);
Ensure(IsValidSilo(dataSilo) && allowedSilos.Contains(dataSilo));
Guard(edition in Editions.Matrix && features  Editions[edition].Features);
If(lifecycle == "PendingDeletion") Require(flags.exportBeforeDeleteAcknowledged);

Events (examples)

{
  "type": "Tenant.Updated",
  "tenantId": "splootvets",
  "version": 8,
  "changeSet": ["edition:+advanced-exports", "quotas.bytesPerDay: 250GB→300GB"],
  "policyVersion": "tenant:splootvets@43",
  "correlationId": "3e1f…",
  "ts": "2025-10-28T08:35:18Z"
}

Security & privacy notes

  • No secrets in the registry; only key references/ids.
  • Access governed by admin-only roles; read-only scoped tokens may fetch their own tenant record.
  • PII in contacts limited to business emails; avoid personal data beyond necessity.

Invariants

  • The registry is the authoritative source for TenantId, residency, edition, quotas, and policy pointer.
  • All consumers cache by ETag/version and handle 304 to avoid stale decisions.
  • Any change in residency/edition/quotas triggers warmup webhooks and emits Tenant.* events for downstream reconciliation.

Tenant Lifecycle Automation

States & transitions

  • States: Provisioning → Active → (Suspended ↔ Reinstate) → PendingDeletion → Deleted
  • Guards: transitions are policy-checked, tenant-scoped, and auditable; illegal transitions are rejected.
stateDiagram-v2
    [*] --> Provisioning
    Provisioning --> Active: onboard.ok
    Active --> Suspended: suspend.requested
    Suspended --> Reinstate: reinstate.approved
    Active --> PendingDeletion: offboard.approved
    Suspended --> PendingDeletion: offboard.approved
    PendingDeletion --> Deleted: window.elapsed && holds==0 && exports.done
Hold "Alt" / "Option" to enable pan & zoom

Signals & side-effects

Every transition:

  • Appends a ComplianceEvent with {from,to,who,why,correlationId,policyVersion,evidenceRef}.
  • Updates caches (registry ETag/version) and pushes warmup webhooks to ATP services.
  • Refreshes quotas/guards (e.g., set Write=Denied on Suspended/PendingDeletion).
  • Emits operator notifications (email/Teams/Slack) to tenant contacts.

ComplianceEvent (example)

{
  "type": "Tenant.Transition",
  "tenantId": "splootvets",
  "from": "Active",
  "to": "Suspended",
  "reason": "billing:overdue",
  "who": "ops@connectsoft.dev",
  "correlationId": "c-01J9ZK…",
  "policyVersion": "tenant:splootvets@42",
  "ts": "2025-10-28T09:05:00Z"
}

Controls & safeguards

  • Export-before-delete toggle: operator must capture tenant decision; the platform enforces a final export window before deletion.
  • Holds precedence: legalHold > dsarDelete > retention. Hard delete only when no active holds, DSARs closed, and retention satisfied.
  • Residency respect: lifecycle automation runs in-silo (DataSiloId); no cross-region reads without break-glass.
  • Crypto-shred: finalization rotates/revokes tenant KEK versions as permitted; integrity evidence remains verifiable.

Idempotent provisioning

Goal: safe retries without duplicate tenants or side-effects.

Flow (Provision → Active)

  1. Upsert registry record (tenantId, dataSilo, edition, contacts, default quotas).
  2. Create per-tenant KEK & signing keys (idempotent by tags).
  3. Attach policy bundle; pin policyVersion; warm policy/router caches.
  4. Smoke test: ingest canary → project → query (masked) → mini-export.
  5. Emit Tenant.Created + transition to Active.

Pseudocode

public async Task OnboardAsync(TenantSpec spec) {
  await Registry.UpsertAsync(spec, idempotent: true);                    // ETag guarded
  await Kms.EnsureKeysAsync(spec.TenantId, spec.DataSilo);               // idempotent by tags
  var pv = await Policy.AttachAsync(spec.TenantId, spec.PolicyVersion);
  await Caches.WarmAsync(spec.TenantId, pv);
  await Probes.SmokeAsync(spec.TenantId);                                // canary e2e
  await Transitions.GoAsync(spec.TenantId, "Active", reason:"onboard.ok");
}

Suspend / Reinstate

  • Suspend sets Write=Denied, Read=Allow, Export=Allow|Deny per policy; projectors drain; DLQ must be zero.
  • Reinstate re-enables quotas/limits, refreshes keys/policy caches, and runs a quick smoke test before flipping.

Suspend sketch

await Guards.SetAsync(tid, disableWrites:true, disableExports:true);
await Projectors.DrainAsync(tid, timeout: TimeSpan.FromMinutes(10));
await Transitions.GoAsync(tid, "Suspended", reason:"ops:request");

Offboard → PendingDeletion → Deleted

PendingDeletion entry (checks):

  • exportBeforeDelete acknowledged or explicitly declined.
  • No active legal holds; otherwise block with Problem+JSON.
  • All jobs drained; DLQ depth = 0; projections up-to-date.

Deletion window:

  • Configurable (e.g., 30 days). During window:
    • Ingestion denied, reads allowed for review, exports limited to final bundles.
    • Timer issues reminders at T-7/T-1 days; any hold triggers auto-extend.

Finalize (hard delete):

  • Revoke/rotate tenant KEK (crypto-shred).
  • Remove dedicated indices/buckets; for shared stores rely on TTL/ILM.
  • Tear down dashboards/alerts; emit Tenant.Deleted with last integrity segment and key lineage.

Automation surfaces

API (operator-facing)

  • POST /tenants/{id}:suspend {reason}
  • POST /tenants/{id}:reinstate {reason}
  • POST /tenants/{id}:offboard {exportBeforeDelete:true|false}
  • POST /tenants/{id}:finalize-delete (pre-flight validates holds/retention)

Background jobs

  • LifecycleOrchestrator (saga): manages timers, retries, evidence assembly.
  • EvidenceBuilder: compiles manifests, integrity roots, policy snapshots per transition.
  • Notifier: routes events to contacts and ops channels.

Timers & retries

  • Transitions are retryable with backoff; orchestrator is idempotent (transition compare-and-swap on current state).
  • Window timers are persisted (e.g., durable scheduling) and survive restarts/region failover.

Evidence & dashboards

  • Per-transition evidence pack: {transition, who, why, when, policyVersion, keyRefs, guardsState, probeResults}.
  • Lifecycle dashboard: state, window remaining, holds status, export acknowledgments, last smoke tests, guard toggles.

Acceptance criteria

  • Provisioning to Active completes within the target SLO and is idempotent under retries.
  • Suspension/Reinstate flips without data loss, preserving integrity chains and guard posture.
  • Finalization enforces export-before-delete and holds; deletion is irreversible and leaves a signed evidence trail.

Invariants

  • All lifecycle actions are tenant-scoped, policy-checked, and audited.
  • Caches and guards are updated atomically with state (or immediately after, with compensating retries).
  • Hard delete occurs only when retention and hold constraints are met; otherwise the orchestrator refuses to finalize.

Privileged / Break-Glass Access

Controls

  • Dual approval: two distinct approvers (security + business/owner) must authorize each grant.
  • Least privilege: scopes narrowed to specific operations, specific tenants, and (optionally) specific streams/time ranges.
  • Time-bound: short TTL (default ≤ 60 minutes; max configurable per tier). No refresh; re-request required.
  • Context bounds: optional IP allowlists / device posture; region-locked to tenant’s dataSilo.
  • Rate & scope caps: separate low QPS ceilings and export size caps for break-glass sessions.

Grant workflow (operator runbook)

  1. Request: operator submits {tenantId(s), operations, purpose, ticketRef, proposedTTL}.
  2. Review: two approvers validate purpose & scope; system evaluates policy (holds/residency).
  3. Issue: platform mints an ephemeral grant (JWT or capability token) tagged breakGlass=true.
  4. Use: all requests with this grant are tagged, mirrored to evidence stream, and throttled per break-glass policy.
  5. Expire/ revoke: grant auto-expires at exp; can be revoked early; sweeping job cleans up stragglers.
  6. Evidence pack: upon expiry, the system assembles the post-hoc evidence bundle and sends notifications.
flowchart LR
Request-->Review
Review-->Issue[Issue Ephemeral Grant]
Issue-->Use[Tagged Access]
Use-->Expire[Auto-Expire/Revoke]
Expire-->Evidence[Evidence Pack Sealed]
Hold "Alt" / "Option" to enable pan & zoom

Grant shape (JWT payload example)

{
  "iss": "https://atp.connectsoft.dev/breakglass",
  "aud": "connectsoft-atp",
  "sub": "00u1ops42",
  "azp": "ops-console",
  "iat": 1766930000,
  "exp": 1766933600,
  "break_glass": true,
  "allowedTenants": ["splootvets"],
  "ops": ["audit.read", "export.read"],               // allowed operations
  "scope": "audit.read export.read",
  "purpose": "INC-2025-1031 containment review",
  "ticketRef": "INC-2025-1031",
  "approvedBy": ["secops@connectsoft.dev","owner@splootvets.com"],
  "dataSilo": "us",
  "limits": { "qps": 2, "maxExportBytes": 1073741824 }
}

Server enforcement rules

  • break_glass == true AND tenantId ∈ allowedTenants AND now < exp.
  • Operations restricted to ops/scope; write-paths remain denied unless explicitly granted.
  • Residency check: dataSilo in grant must match tenant residency.

Evidence (ComplianceEvents)

  • Granted

{
  "type": "BreakGlass.Granted",
  "tenantIds": ["splootvets"],
  "purpose": "INC-2025-1031 containment review",
  "ticketRef": "INC-2025-1031",
  "approvedBy": ["secops@connectsoft.dev","owner@splootvets.com"],
  "grantId": "bg-01J9ZK…",
  "issuedAt": "2025-10-28T09:20:00Z",
  "expiresAt": "2025-10-28T10:20:00Z"
}
* Used (emitted per request)

{ "type":"BreakGlass.Used", "grantId":"bg-01J9ZK…", "tenantId":"splootvets", "operation":"Query", "correlationId":"3e1f…", "ts":"2025-10-28T09:35:12Z" }
* Revoked/Expired

{ "type":"BreakGlass.Revoked", "grantId":"bg-01J9ZK…", "reason":"ttl-expired", "ts":"2025-10-28T10:20:01Z" }

All events are mirrored into a separate evidence stream and linked to the incident ticket.


Guard middleware (enforcement sketch)

var g = ParseGrant(token);

if (!g.BreakGlass) return Problem(Forbidden);
if (DateTimeOffset.UtcNow >= g.ExpiresAt) return Problem(ExpiredGrant);
if (!g.AllowedTenants.Contains(ctx.TenantId)) return Problem(CrossTenantForbidden);
if (!g.Ops.Contains(req.Operation)) return Problem(OperationNotAllowed);
if (ctx.DataSilo != g.DataSilo) return Problem(ResidencyConflict);

ApplyRateCaps(g); // low QPS, export byte ceilings
TagTrace("breakGlass", true);
EmitCompliance("BreakGlass.Used", g, ctx, req);
return next();

Monitoring & alerts

  • Immediate notifications on Granted, Used, Revoked to security channel + tenant owner.
  • Anomaly detection:
    • Use outside requested time window, tenant set, or operation set → auto-revoke + page SecOps.
    • More than N break-glass grants per week for a tenant → trigger review.
  • Dashboards: active grants, time remaining, operations executed, bytes exported, geo/IP distribution.

Auto-revocation & sweeping

  • Short TTL by default; grants cannot be refreshed—new approval required.
  • Sweeper job:
    • Runs every minute; revokes expired; terminates active sessions; emits BreakGlass.Revoked.
  • Webhooks: notify stakeholders (tenant contacts, SecOps) on revoke/expiry with a link to the evidence pack.

Export & write constraints (defaults)

  • Read-only unless reviewers explicitly grant audit.write/export.write for a narrow purpose.
  • Exports:
    • Forced per-tenant packaging with signed manifest; purpose copied from grant.
    • Size caps (maxExportBytes) and no background bulk jobs allowed.

API surface (operator)

  • POST /breakglass/grants:request → returns requestId
  • POST /breakglass/grants/{requestId}:approve (twice, distinct approvers) → issues grantId
  • POST /breakglass/grants/{grantId}:revoke
  • GET /breakglass/grants/{grantId}/evidence

All endpoints are admin-only, tenant-agnostic but emit tenant-scoped evidence.


Invariants

  • Break-glass access is exceptional, time-boxed, least-privilege, and region-bound.
  • Every action under a break-glass grant is tagged, throttled, and mirrored to ComplianceEvent streams.
  • Grants auto-expire; early revoke is always possible; evidence packs are sealed and signed for audit.

BYOK / CSEK Options

Customer-managed keys (BYOK/CSEK)

  • Model: tenant supplies a KMS key reference; ATP performs Wrap/Unwrap/Sign via the tenant’s KMS under tenant-granted permissions.
  • Scope: keys are per tenant, per data silo/region; references must not cross silos.
  • Rotation: cadence defined by tenant; ATP tolerates rotation via versioned key IDs and metadata stored with each segment.
  • Availability contract: platform operates only when Wrap/Unwrap/Sign succeed; no plaintext fallback.

Key reference examples

{
  "tenantId": "splootvets",
  "dataSilo": "us",
  "encryptionKey": "azure-kv://kv-us/keys/splootvets-kek/7",
  "signingKey":    "azure-kv://kv-us/keys/splootvets-sign/3"
}

Supported schemes (examples): azure-kv://…, aws-kms://…, gcp-kms://…. The key material never leaves tenant KMS/HSM.


Health & validation

  • Onboarding checks
    • Verify WrapKey/UnwrapKey and, if configured, Sign/Verify with a canary envelope.
    • Validate AAG (assurance & algorithm) compatibility (e.g., AES-GCM-256, Ed25519/ECDSA P-256).
    • Confirm RBAC policies (least privilege service principal / IAM role) and audit logging enabled on tenant KMS.
  • Continuous probes
    • Minute-level canary ops per tenant/silo (low cost; cached).
    • Alert on latency spikes, permission failures, disabled/rotated keys without new version activation.

Escrow & recovery

  • Escrow (optional): tenant may register an escrow policy (e.g., second HSM-protected copy or external KMS alias) for emergency rotation.
  • Runbook references stored in registry: contact routes, change windows, and authorization steps.
  • Recovery drills: quarterly simulated key rotation + failover using escrow alias; evidence report produced.

Failure modes & impact statement

  • Key revoked/disabled: Reads fail closed (KeyUnavailable), writes buffer (durable queue) until KMS recovers; integrity chains continue (signing may pause if the same KMS is used for signatures).
  • Key destroyed: Crypto-shred of encrypted content; data becomes unreadable. Integrity manifests/seals remain verifiable if signing key is separate and intact.
  • Regions out of sync: BYOK reference must exist per silo; cross-region unwrap is rejected by policy.

Impact statement (template)

If customer-managed KEK is revoked or destroyed:
- Decrypt/read operations: denied (no plaintext fallback).
- Write operations: queued; may be dropped after retention-of-queue window.
- Integrity verification: continues if signing key unaffected.
Operator action required: restore key access or rotate to a new key version.

High assurance options

  • HSM-backed keys: require FIPS 140-⅔ validated modules for KEK and signing keys.
  • Split-key/threshold signing: m-of-n shares for integrity signing (e.g., 2-of-3) to reduce single-key risk.
  • Certificate pinning for KMS endpoints and mutual TLS between ATP and tenant KMS where supported.
  • Dual-key separation: encryption KEK and integrity signing key must be logically separate and tagged for distinct purposes.

Onboarding & runtime (operator checklist)

  • Record BYOK references in Tenant Registry (encryptionKey, signingKey) per silo.
  • Grant least-privilege IAM/RBAC role to ATP service principal; enable KMS audit logs.
  • Run canary wrap/unwrap/sign; store probe result and AAG in evidence.
  • Configure rotation webhook: tenant notifies ATP prior to rotation; ATP pre-warms caches.
  • Enable continuous probes and alerts (permission denied, disabled key, latency anomaly).
  • Document escrow policy and escalation contacts.

Evidence & audit

  • Key metadata (IDs, versions, purposes, tags) — no secrets.
  • Probe logs: last success time, latency percentiles, error codes.
  • Rotation events: Key.Rotated with old/new versions, approvers, and change ticket.
  • Separation proof: encryption and signing keys’ distinct IDs and policies.

Event (example)

{
  "type": "Key.Rotated",
  "tenantId": "splootvets",
  "dataSilo": "us",
  "purpose": "encryption",
  "oldVersion": 7,
  "newVersion": 8,
  "approvedBy": ["secops@tenant.com", "owner@tenant.com"],
  "ticketRef": "CHG-2025-1142",
  "ts": "2025-10-28T09:45:00Z"
}

Guardrails (policy snippets)

encryption:
  mode: BYOK
  requireHsm: true
  allowSplitKey: false
  rotation:
    minDays: 90
    graceOverlapDays: 14   # both versions valid during cutover
failure:
  writes: buffer
  reads: failClosed
  queueMaxAge: 6h

Health check pseudocode

var key = Tenant.Byok.EncryptionKey(tid, silo:"us");
var dek = Crypto.GenerateDek();
var wrapped = await Kms.WrapAsync(key, dek);         // proves wrap permission & availability
var unwrapped = await Kms.UnwrapAsync(key, wrapped); // proves unwrap correctness
Assert.SequenceEqual(dek, unwrapped);

var sig = await Kms.SignAsync(Tenant.Byok.SigningKey(tid), data: CanaryDigest);
var ok  = await Kms.VerifyAsync(Tenant.Byok.SigningKey(tid), data: CanaryDigest, sig);
if (!ok) Alert("SigningKeyVerifyFailed", tid);

Invariants

  • BYOK/CSEK never weakens tenancy: keys are per tenant, per silo, and access is strictly least-privilege.
  • Platform fails closed on KMS unavailability; no plaintext writes.
  • Encryption and integrity keys are separate; destroying encryption keys does not invalidate integrity proofs.
  • All BYOK actions (onboarding, rotation, failure, recovery) are auditable and tenant-scoped.

Residency-Aware Routing

Routing (ingest/query/export)

  • Tenant-first resolution
    • Resolve dataSilo from the Tenant Registry (authoritative), not from caller headers.
    • Route ingest, query, and export to endpoints in that silo only.
  • Strict boundaries
    • Cross-region reads/writes are forbidden unless an approved break-glass grant explicitly allows it.
    • Endpoints validate ctx.dataSilo == registry.dataSilo on every call; mismatch ⇒ 409 ResidencyConflict.

Routing table (example)

routers:
  us:
    ingest: https://us.atp.connectsoft.dev/v1/audit/records
    query:  https://us.atp.connectsoft.dev/v1/query
    export: https://us.atp.connectsoft.dev/v1/exports
  eu-we:
    ingest: https://eu-we.atp.connectsoft.dev/v1/audit/records
    query:  https://eu-we.atp.connectsoft.dev/v1/query
    export: https://eu-we.atp.connectsoft.dev/v1/exports

Gateway decision (pseudocode)

var t = TenantRegistry.Get(tenantId);               // cache w/ ETag
var targetSilo = t.DataSilo;
if (ctx.DataSiloHeader is {} hdr && hdr != targetSilo) Tag("callerSiloMismatch", true);

var ep = Router.For(targetSilo, intent: req.Intent); // ingest/query/export
if (ep is null) return Problem(ResidencyConflict);

if (IsBreakGlass(token) && token.DataSilo == targetSilo && token.Ops.Contains(req.Intent))
    Tag("breakGlass", true);
else if (ctx.OriginSilo != targetSilo)
    return Problem(ResidencyConflict);

return ProxyTo(ep);
flowchart LR
Client -- JWT --> Gateway
Gateway --> Registry[(Tenant Registry)]
Registry --> Gateway
Gateway -->|route to silo| SiloEP[(US/EU endpoints)]
SiloEP --> Services
Hold "Alt" / "Option" to enable pan & zoom

Failover (respecting residency)

  • Read-only fallback (default)
    • If primary in-silo services degrade, route queries to read replicas within the same silo.
    • Writes buffer to durable queues in-silo; no cross-region write unless policy allows disaster write-overrides.
  • Write failover (policy-gated, exceptional)
    • Allowed only when tenant policy residency.writeFailover = "emergency" and dual approval is recorded.
    • Traffic re-routed to a designated secondary silo; all segments rewrapped under secondary KEK; evidence bundle produced.
    • On primary recovery: reconciliation job replays deltas and anchors segment proofs in both silos.

Policy snippet (residency)

residency:
  primarySilo: us
  readFailover: ["us"]          # in-silo replicas only
  writeFailover: forbidden      # "forbidden" | "emergency"
  emergencyTarget: eu-we        # used only if writeFailover = emergency

Failover toggle (operator)

await Residency.Guard.EnableWriteFailoverAsync(
  tenantId: "splootvets",
  toSilo: "eu-we",
  ttl: TimeSpan.FromHours(2),
  approvals: ["secops@…","owner@splootvets.com"],
  ticket: "INC-2025-1102");

Health & triggers

  • Trip RO fallback when:
    • ingest p95 > SLO for 5m or
    • replica lag < threshold and
    • primary write path unhealthy.
  • Never auto-enable cross-region writes; must be manual + approved.

Data gravity (analytics & joins)

  • Analytics in place
    • ETL, projections, and ad-hoc analytics run within the tenant’s silo.
    • Federated queries across silos require:
      • Explicit grant with purpose, allowedTenants, allowedSilos[],
      • Read-only views with masking applied,
      • Temporary dataset scoped to the case and auto-expired.
  • No cross-context joins in operational APIs
    • Query service forbids JOIN/UNION that span silos/tenants; analytics must export per-tenant bundles and combine outside of ATP, or via approved federated job.

Federated job manifest (example)

{
  "jobId": "fed-01J9ZM…",
  "purpose": "RiskTrend-Q4",
  "allowedTenants": ["splootvets","vetco"],
  "allowedSilos": ["us","eu-we"],
  "readOnly": true,
  "ttlHours": 6,
  "outputs": ["s3://analytics/fed/fed-01J9ZM…/result.parquet"]
}

Export routing & verification

  • Exports write to tenants/{tenantId}/exports/... in the tenant’s silo and are encrypted under the silo-local KEK.
  • Verify-on-download endpoint runs in-silo; cross-region downloads require break-glass and are rate-capped.

Export request (server-enforced fields)

{
  "tenantId": "splootvets",
  "dataSilo": "us",
  "timeRange": { "from": "2025-10-01T00:00:00Z", "to": "2025-10-31T23:59:59Z" },
  "purpose": "DSAR-241"
}

Observability & alerts

  • Routing tags on traces/logs: tenantId, dataSilo.source, dataSilo.target, routeDecision.
  • Alerts
    • residency_conflict_total > 0 (5m) → investigate misrouted clients.
    • replica_lag_seconds > threshold → block RO fallback.
    • cross_region_attempt_total > 0 without break-glass → page SecOps.

Evidence artifacts

  • Routing manifest per tenant: endpoints, current silo, failover posture, last verification timestamp.
  • Failover pack (if invoked): approvals, timerange, key lineage (KEK IDs), reconciliation report, and segment proof anchors.
  • Federated job evidence: grant payload, masking profile, dataset TTL, and signed outputs manifest.

Invariants

  • Routing is driven by the Tenant Registry’s dataSilo, not caller-provided hints.
  • Cross-region access is denied by default; only allowed under break-glass with strict scope and TTL.
  • All artifacts (exports, manifests, proofs) are produced and verified in-silo, with silo-local keys.
  • Failover never bypasses tenancy guards, encryption, or integrity; write failover is explicitly approved and fully auditable.

  • Purpose: suspend deletion under retention/TTL for specific tenant/stream/predicate/time-window.
  • Scope:
    • tenantId (required)
    • stream (e.g., audit.default, audit.security) or *
    • predicate (field filters, e.g., action in ["Login","Export.*"], resource.id = "apt_123")
    • timeRange (from/to, optional → open-ended)
  • Immutability: once Activated, holds are append-only (extend time/predicate) or Revoked via dual-approval; no in-place narrow.
  • Effect: retention jobs and delete requests for matching records must skip purges; export reads still apply masking.

Policy pointer (bundle snippet)

legalHold:
  approval: dual
  allowGlobalStreamHold: true     # tenant-wide stream holds
  defaultTtlDays: 90              # auto-expire unless extended (optional)

Hold record (shape)

{
  "holdId": "lh-01J9ZN5W",
  "tenantId": "splootvets",
  "stream": "audit.default",
  "predicate": { "action": ["Export.Requested","Export.Completed"] },
  "timeRange": { "from": "2025-10-01T00:00:00Z", "to": "2025-10-31T23:59:59Z" },
  "state": "Active",
  "approvers": ["secops@connectsoft.dev","owner@splootvets.com"],
  "createdAt": "2025-10-28T09:55:00Z",
  "evidenceRef": "tenants/splootvets/legalholds/lh-01J9ZN5W/manifest.json"
}

Events

  • LegalHold.RequestedLegalHold.Activated → (LegalHold.Extended)* → LegalHold.Revoked
  • Each event includes {tenantId, holdId, scope, approvers, reason, ts} and a signed manifest.

DSAR / subject rights workflow

  • Case lifecycle: Opened → Discovery → Review → Approved → Exported → Closed
  • Discovery filters: tenant-scoped predicates on subject identifiers (e.g., emails, phones, subjectIds, custom IDs), time window, and actions/resources.
    • Support hash-aware matching (e.g., emails hashed at write-time): the DSAR engine accepts clear values and transforms to the stored representation (HASH/MASK/TOKEN) using tenant redaction config.
  • Masking profiles: export uses purpose-limited masking (minimized view) even for the subject data, unless policy allows full copy.
  • Reviewer lane: at least one reviewer (not the submitter) must approve before export.

DSAR case (shape)

{
  "caseId": "dsar-241",
  "tenantId": "splootvets",
  "subjects": { "email": ["john@example.com"], "phone": ["+13035551234"] },
  "timeRange": { "from": "2025-09-01T00:00:00Z", "to": "2025-10-31T23:59:59Z" },
  "filters": { "actions": ["Appointment.*","Export.*"] },
  "purpose": "Data Subject Access Request",
  "state": "Review",
  "reviewers": ["privacy@splootvets.com"],
  "policyVersion": "tenant:splootvets@42"
}

Export bundle

  • Artifacts: Parquet/NDJSON chunks + signed manifest (policy version, filters, masking profile, record counts, checksums, chain proofs).
  • Residency: generated in-silo, encrypted under tenant KEK.
  • TTL: temporary staging objects auto-expire (e.g., 7 days) unless extended.

Manifest (excerpt)

{
  "tenantId": "splootvets",
  "caseId": "dsar-241",
  "timeRange": { "from": "2025-09-01T00:00:00Z", "to": "2025-10-31T23:59:59Z" },
  "policyVersion": "tenant:splootvets@42",
  "maskingProfile": "DSAR-Minimal",
  "recordCount": 1832,
  "chunks": [{ "path": "part-0001.ndjson", "sha256": "9f2…"}],
  "integrity": { "segmentRoots": ["sha256:…"], "signature": "MEQCI…" },
  "generatedAt": "2025-10-28T10:10:00Z"
}

Sequence (overview)

sequenceDiagram
  participant Req as Requester
  participant Console as Privacy Console
  participant Policy as Policy Engine
  participant Export as Exporter (in-silo)

  Req->>Console: Open DSAR case (subjects, window, purpose)
  Console->>Policy: Validate purpose & scope (tenant policyVersion)
  Policy-->>Console: Ok (masking profile, filters)
  Console->>Export: Generate DSAR export (tenant-scoped)
  Export-->>Console: Signed manifest + artifacts
  Console-->>Req: Download link (time-bound); log reviewer approval
Hold "Alt" / "Option" to enable pan & zoom

Precedence & enforcement

  • Precedence: legal hold > dsar delete > retention purge
    • If a record matches an Active legal hold, it cannot be deleted (by retention or DSAR-delete).
    • DSAR delete (if supported by policy) applies only when no active hold covers the record.
  • Enforcement points:
    • Retention worker: filters out held records.
    • Delete endpoint: checks holds before performing subject erasure; returns 409 LegalHoldConflict when blocked.
    • Policy evaluator: resolves masking vs. deletion paths deterministically based on policyVersion.

Retention worker (pseudocode)

var holds = HoldIndex.ActiveFor(tid, stream);
var candidates = SelectExpired(tid, stream, before: now - ttl);
var deletable = candidates.Where(r => !holds.Any(h => h.Matches(r)));
Delete(deletable);  // deletions are logged with chain updates
Skip(candidates.Except(deletable)); // emit HoldSkip events

APIs (operator & privacy)

Legal holds

  • POST /tenants/{id}/holds → request hold (requires reason + approvers)
  • POST /tenants/{id}/holds/{holdId}:activate (dual approval)
  • POST /tenants/{id}/holds/{holdId}:extend (widen time window or predicate)
  • POST /tenants/{id}/holds/{holdId}:revoke (dual approval, reason required)
  • GET /tenants/{id}/holds?state=Active

DSAR

  • POST /tenants/{id}/dsar/cases (open)
  • POST /tenants/{id}/dsar/{caseId}:review (approve/deny)
  • POST /tenants/{id}/dsar/{caseId}:export (generate)
  • GET /tenants/{id}/dsar/{caseId}/manifest
  • POST /tenants/{id}/dsar/{caseId}:delete (subject erasure; policy-gated)

Error taxonomy (Problem+JSON)

type HTTP Meaning / Action
…/legal-hold-conflict 409 Active hold blocks delete; provide hold details
…/hold-invalid-scope 400 Bad predicate/window; fix and retry
…/dsar-approval-required 403 Reviewer approval missing
…/residency-conflict 409 DSAR export attempted to foreign region
…/subject-identifier-unsupported 400 Unknown identifier type for DSAR discovery

Evidence & audit

  • Hold manifest: scope, approvals, policyVersion, activation/revocation times, signed.
  • DSAR case log: requester, reviewers, purpose, filters, masking profile, export manifest, download access logs.
  • Retention decisions: counts of deleted vs hold-skipped records, with sample ids and corresponding holds.
  • All events mirrored into ComplianceEvent streams, tenant-scoped.

Security & performance notes

  • In-silo only: discovery, export, and verification run in the tenant’s dataSilo.
  • Resource caps: DSAR jobs use fair-share scheduler; large exports are chunked with seek cursors.
  • PII handling: plain identifiers accepted only in memory for transformation; never persisted unredacted.

Invariants

  • Legal holds are immutable in effect and override retention and DSAR-delete.
  • DSAR exports are purpose-bound, tenant-scoped, signed, and masking-aware.
  • All decisions (holds, DSAR, deletions) are deterministic per policyVersion and auditable end-to-end.

Backfill & Reprocessing Safety

Goals

  • Rebuild projections/indices or re-evaluate policy outcomes without violating immutability and never crossing tenant boundaries.
  • Bound scope, dedupe safely, and provide auditable evidence for every change.

Replay scope & deduplication

  • Scope is mandatory: every job declares tenantId and a time window (from/to). Cross-tenant windows are forbidden.
  • Seek pagination: iterate via (createdAt, recordId); no offset paging.
  • Dedup rules:
    • Primary: idempotencyKey (within tenant).
    • Secondary: (recordId) if present, or sha256(canonicalPayload) to detect payload drift.
  • Immutability: raw segments are append-only. Reprocessing does not modify sealed segments; it (re)builds projections and materialized indices. Corrections are appended as Correction metadata events when needed.

Job manifest (example)

{
  "jobId": "reproc-01J9ZQ…",
  "tenantId": "splootvets",
  "intent": "reindex|policy-reclassify|export-rebuild",
  "timeRange": { "from": "2025-10-01T00:00:00Z", "to": "2025-10-31T23:59:59Z" },
  "policyVersion": "tenant:splootvets@42",   // pin; may also use @next for shadow
  "maxRecords": 2_000_000,
  "dryRun": true,
  "sampleRate": 0.05                          // 5% sample for first pass
}

Quarantine lane & operator gates

  • Quarantine triggers:
    • Tenant mismatch (body/header/segment).
    • Duplicate idempotencyKey with different payload hash.
    • Policy evaluation divergence beyond threshold (current vs next).
  • Operator gates:
    • Dry-run diffs required before execution; show counts, sample changes, and impact by index/shard.
    • Sampled rollout: 1% → 10% → 100% by shards/partitions with automatic pause on breach.
    • Approval check: dual approval for destructive projections (e.g., re-redaction of derived stores).

Quarantine record (shape)

{
  "tenantId": "splootvets",
  "reason": "IdempotencyMismatch",
  "originalKey": "tid:splootvets|ulid:01J9…",
  "originalHash": "a1c…",
  "newHash": "9f2…",
  "time": "2025-10-28T10:30:00Z",
  "evidenceRef": "tenants/splootvets/quarantine/reproc-01J9ZQ…/…"
}

Blast-radius controls

  • Per-tenant ceilings: maxRecords, maxBytes, maxDuration (wall clock) per job.
  • Circuit breakers:
    • Trip if projector lag grows above threshold (e.g., > 60s sustained) or error rate > X%.
    • Auto-pause if diff rate (changed records / processed) exceeds expected bound.
  • Throughput shaping: low-priority pool with WFQ; backoff on shared hot shards.
  • Residency lock: jobs execute in-silo; cross-region reprocessing is forbidden unless break-glass.

Dry-run diffs & evidence

  • Diff outputs (per index/projection):
    • added, updated, unchanged, skipped (held|quarantine|error) counts.
    • Sample documents with field-level before/after (masked).
    • Policy versions compared (current vs target).
  • Signed report stored under tenants/<id>/reproc/<jobId>/manifest.json with checksums and trace links.
  • Promotion rule: execution allowed only if updated/processed ≤ threshold and errors == 0 (configurable per tenant).

Diff report (excerpt)

{
  "tenantId": "splootvets",
  "jobId": "reproc-01J9ZQ…",
  "projection": "audit.search",
  "counts": { "processed": 125000, "updated": 423, "added": 0, "skipped": 37, "errors": 0 },
  "policyCompare": { "current": "…@41", "target": "…@42", "divergencePct": 0.34 },
  "sample": [
    { "recordId": "01J9…", "field": "email", "before": "HASH:…", "after": "MASK:****1234" }
  ],
  "generatedAt": "2025-10-28T10:40:12Z",
  "signature": "MEQCIB…"
}

Execution flow

flowchart LR
Seed[Seed by tenant+time] --> Page[Seek paginate]
Page --> Eval[Evaluate policy (pinned)]
Eval --> Dedup{Seen idempotency?}
Dedup -->|same hash| Skip[Skip]
Dedup -->|hash mismatch| Q[Quarantine]
Eval --> Apply[Apply to projections]
Apply --> Lag{Projector lag OK?}
Lag -->|no| Pause[Pause/Backoff]
Lag -->|yes| Next[Continue]
Hold "Alt" / "Option" to enable pan & zoom

Server algorithm (pseudocode)

await foreach (var batch in Segments.ReadKeyset(tid, from, to, pageSize: 10_000))
{
    foreach (var rec in batch)
    {
        if (!Idemp.TryAdd(rec.IdempotencyKey, rec.PayloadHash, out var conflict))
            if (conflict.PayloadHash != rec.PayloadHash) { Quarantine(rec, "IdempotencyMismatch"); continue; }
            else continue; // true duplicate, skip

        var decision = Policy.EvaluatePinned(targetPolicy, rec);
        if (!decision.Allowed) { Quarantine(rec, decision.Basis); continue; }

        if (DryRun) { Diff(rec, decision); continue; }

        Projections.Apply(rec, decision); // write-only to derived stores
        Metrics.Inc("processed", tid);
    }

    if (LagMonitor.ProjectorLag(tid) > SLO.ProjectorLag) { Backoff(); continue; }
    if (Breaker.Tripped) { PauseJob("Breaker"); break; }
}

Problem+JSON (reprocessing)

type HTTP Meaning / Action
…/reproc-scope-missing 400 Require tenantId + time window
…/reproc-cross-tenant 403 Attempt to process multiple tenants in one job
…/idempotency-mismatch 202 Quarantined; needs operator review
…/policy-divergence-high 409 Diff > threshold in dry-run; adjust policy/scope or approve
…/projector-lag-breach 429 Circuit breaker tripped; retry later

Scheduling & fairness

  • Jobs tagged intent: backfill|reindex|policy-reclassify run in low-priority pools with floor rates per tenant.
  • Large jobs are chunked by time buckets (e.g., hourly) and interleaved with realtime.

Observability

  • Metrics (per tenant/job): processed, updated, quarantined, skipped, errors, lag_seconds.
  • Traces include tenantId, jobId, policyVersion(current/target), and diff counts as span attributes.
  • Logs emit evidence refs for each quarantine and final report.

Invariants

  • Reprocessing is tenant-scoped, in-silo, and append-only to derived stores.
  • Idempotency and payload hashing prevent duplicates and flag drifts.
  • Execution is gated by dry-run diffs, quotas, and circuit breakers; operators retain explicit control over rollout.

Performance Isolation & Tests

Targets (by tier)

Goal: guarantee predictable performance under multi-tenant load and validate isolation against noisy neighbors.

SLOs (illustrative)

Tier Ingest p50 / p95 (ms) Query p50 / p95 (ms) Export throughput (MB/s) Projector lag steady (s) Error budget (30d)
Bronze 50 / 350 120 / 600 2 ≤ 120 99.5%
Silver 35 / 250 90 / 400 5 ≤ 60 99.9%
Gold 20 / 150 60 / 250 10 ≤ 20 99.95%
Enterprise Contractual Contractual Contractual Contractual Contractual

Error budgets are tracked per tenant and per route (ingest/query/export). Breaches trigger auto-throttling for non-critical pools (backfill, ad-hoc export) before real-time paths.


Load test design

Traffic model

  • Realtime ingest: Poisson arrivals; action mix by stream (e.g., audit.default: 80%, audit.security: 20%).
  • Query: think-time 1–3s; fan-in to hot resources (to exercise caches) and cold reads (to test index seeks).
  • Export: periodic bursts (DSAR, monthly closes) sized by tenant tier.

Tenants under test

  • T_hot: high QPS (Gold)
  • T_norm: steady baseline (Silver)
  • T_backfill: long-running reindex (low priority)
  • T_vip: Enterprise with dedicated index
flowchart LR
Gen[Load Generators] -->|T_hot| GW[Gateway]
Gen -->|T_norm| GW
Gen -->|T_backfill| GW
Gen -->|T_vip| GW
GW --> Ingest
GW --> Query
GW --> Export
Ingest --> Projectors --> Index
Query --> Index
Export --> ObjectStore
Hold "Alt" / "Option" to enable pan & zoom

Noisy-neighbor scenarios

  1. Burst storm (T_hot): spike ingest 5× for 10 minutes → expect per-tenant backpressure (429 + Retry-After) with no SLO regression for T_norm/T_vip.
  2. Backfill hammer (T_backfill): sustained reprocessing at max allowed credits → WFQ must deprioritize it when realtime rises.
  3. Shard hotspot: skew resource.id to collide on one shard → observe auto-split/rebalancing and preserved SLOs for other tenants.

Success criteria

  • Only the overloaded tenant receives throttling; others stay within SLO.
  • Export scheduler interleaves large T_backfill jobs with T_norm small exports (no starvation).
  • Projector lag remains under tier targets after transient spikes.

Test harness & tools

  • k6 (HTTP) for ingest/query; k6-experimental/grpc for gRPC paths.
  • Custom reproc driver for backfill (seeks by (createdAt, recordId)).
  • Locust/Gatling optional for long-haul runs.
  • Fault toggles to inject partial failures (rate-limit, slow shard, KMS latency) while measuring isolation.

k6 sketch (ingest + throttling awareness)

import http from 'k6/http';
import { sleep, check } from 'k6';

export const options = { vus: 200, duration: '10m' };

export default function () {
  const tid = __ENV.TENANT_ID; // e.g., T_hot
  const rec = makeAuditRecord(tid);
  const res = http.post(`${__ENV.BASE}/v1/audit/records`, JSON.stringify(rec), {
    headers: { 'Content-Type': 'application/json', 'X-Tenant-Id': tid, 'Idempotency-Key': rec.idempotencyKey, 'Authorization': `Bearer ${__ENV.TOKEN}` }
  });

  check(res, { '200|202|429': r => [200,202,429].includes(r.status) });
  if (res.status === 429) sleep(Math.random() * 0.2 + 0.1); // jittered backoff
  else sleep(Math.random() * 0.05);
}

Reproc driver knobs

  • --tenant T_backfill --from 2025-10-01 --to 2025-10-31 --sample 0.1 --pool low --max-qps 50

Metrics, dashboards & alarms

Per tenant

  • ingest_p50/p95_latency_ms, query_p50/p95_latency_ms, export_lead_time_seconds
  • projector_lag_seconds, tenant_dlq_depth
  • rate_limit_hits_total, 429_ratio
  • quota_utilization (qps, bytes/min, storage/day)
  • cache_hit_ratio (query)

PromQL-style checks

# Tenant isolation: no SLO bleed when others burst
histogram_quantile(0.95, sum(rate(atp_ingest_latency_ms_bucket{tenantId="T_norm"}[5m])) by (le))
  < slo_ingest_p95_ms{tenantId="T_norm"}

# Backpressure scoped to noisy neighbor
sum(rate(atp_rate_limit_hits_total{tenantId="T_hot"}[5m])) > 0
sum(rate(atp_rate_limit_hits_total{tenantId!="T_hot"}[5m])) == 0

# Projector lag guard
max_over_time(atp_projector_lag_seconds{tenantId=~"T_.*"}[10m]) < tier_projector_lag_slo{tenantId=~"T_.*"}

Dashboards

  • Tenant Overview: SLO dials, rate limits, credits, DLQ, lag, export queue.
  • Shard Heatmap: request density & p95 per shard; hot-split events.
  • Scheduler View: WFQ weights, queues (realtime vs backfill), active export slots.

Synthetic tenants (perf canaries)

  • Always-on tenants per region/tier generating:
    • Trickle ingest (baseline QPS), periodic queries, hourly micro-export.
    • Signatures: known distributions for actions/resources to exercise caches and projections.

Definition (example)

{
  "tenantId": "perf-gold-us",
  "tier": "gold",
  "dataSilo": "us",
  "workload": {
    "ingestQps": 120,
    "queryRps": 40,
    "exportEveryMinutes": 60,
    "mix": { "Appointment.Booked": 0.5, "Login": 0.3, "Export.Requested": 0.2 }
  }
}

Gates

  • CI perf smoke on each deploy (5–10 min).
  • Nightly 1h endurance with noisy-neighbor injection.
  • Fails pipeline if: any canary breaches SLO for ≥ 10 consecutive minutes.

Shard rebalancing & cache behavior

  • Validate auto-split triggers once shard p95 exceeds threshold with skew > X%.
  • Ensure routing salt or time-bucket expansion reduces p95 within 10 minutes.
  • Verify caches are per-tenant keyed and do not leak across tenants; cold-start penalties remain within SLO budgets.

Acceptance criteria

  • Under a 5× burst from T_hot, T_norm and T_vip remain within their SLOs; only T_hot shows 429s.
  • Backfill from T_backfill never increases T_norm’s projector lag beyond its SLO.
  • Export scheduler prevents starvation; T_norm gets a slot within SLO lead time even when T_backfill runs.
  • Synthetic tenants stay green for 7 days rolling; any regression opens an incident.

Invariants

  • Performance controls (rate-limit, WFQ, credits) are per-tenant, per-region; no global throttle that breaks isolation.
  • Load tests never bypass tenancy guards or residency; all artifacts tagged with tenantId, edition, dataSilo.
  • Results are repeatable: same workload → same SLO outcomes within defined variance bands.

Data Portability & Manifests

Standardized export

  • Formats: Parquet (columnar; compression on) or NDJSON (one-record-per-line).
  • Per-tenant only: artifacts live under tenants/{tenantId}/exports/{stream}/{yyyy}/{mm}/{dd}/….
  • Deterministic ordering: (createdAt ASC, recordId ASC); stable across retries.
  • Chunking: target 128–256 MB objects; include sha256 for each chunk; whole-bundle manifest is signed.
  • Schemas: versioned (schemaVersion); additive evolution only. A data dictionary is embedded.

Object naming

tenants/{tenantId}/exports/{stream}/{from}-{to}/{exportId}/
  part-00000.parquet
  part-00001.parquet
  manifest.json
  manifest.sig
  README-verify.md
  dictionary.json

Manifest spec (signed)

  • Purpose: single source of truth for bundle contents, filters, policies, and integrity.
  • Signature: detached manifest.sig (JWS/COSE) using silo-local signing key; key id recorded.
{
  "manifestVersion": "1.2",
  "exportId": "exp-01J9ZS4GW7X3S3M8N6WJ7N6M2V",
  "tenantId": "splootvets",
  "stream": "audit.default",
  "format": "parquet",
  "compression": "zstd",
  "timeRange": { "from": "2025-10-01T00:00:00Z", "to": "2025-10-31T23:59:59Z" },
  "filters": { "resource.type": ["Appointment","Payment"] },
  "schemaVersion": "audit.v5",
  "recordCount": 124532,
  "chunks": [
    { "path": "part-00000.parquet", "rows": 65536, "bytes": 134217728, "sha256": "9f2…" },
    { "path": "part-00001.parquet", "rows": 65536, "bytes": 133901234, "sha256": "a1c…" }
  ],
  "provenance": {
    "tool": { "name": "atp-exporter", "version": "1.8.3", "build": "e3c9a4f" },
    "policyVersion": "tenant:splootvets@42",
    "maskingProfile": "Export-Standard",
    "generatorHost": "us-exp-03",
    "generatedAt": "2025-10-31T23:59:59Z",
    "correlationId": "c-01J9ZS…"
  },
  "integrity": {
    "segmentRoots": ["sha256:…","sha256:…"],
    "prevRoot": "sha256:…",
    "signingKeyId": "kms://us/tenants/splootvets/signing/v3"
  },
  "residency": { "dataSilo": "us", "objectStore": "s3://us-atp/…" },
  "dictionaryRef": "dictionary.json",
  "readmeRef": "README-verify.md"
}

Provenance & evidence

  • Tooling: exporter name/version/build hash; generator host identity.
  • Policies: policyVersion, maskingProfile, and any purpose string included.
  • Integrity: current and previous tenant segment roots; detached signature and signing key id.
  • Routing: silo/region recorded; object store location captured for residency proof.

Verification (third-party / auditor)

Included README-verify.md

  • Checksums: recompute sha256 for each chunk and compare with manifest.
  • Signature: verify manifest.sig using published tenant signing key (JWK/JWKS URL in evidence channel or shared out-of-band).
  • Integrity chain:
    1. For each chunk, compute Merkle leaf hashes (if provided) or trust segment roots.
    2. Recompute the tenant segment root for the covered window.
    3. Compare with manifest.integrity.segmentRoots and validate signature.
  • Schema & policy: cross-check schemaVersion against dictionary.json; confirm policyVersion matches the tenant’s registry at generatedAt.

CLI sketch

atp-verify \
  --manifest manifest.json \
  --signature manifest.sig \
  --jwks jwks.json \
  --check-sha256 \
  --check-segment-roots

Programmatic (pseudocode)

var man = Manifest.Load("manifest.json");
Signature.Verify("manifest.sig", man, jwks);
foreach (var c in man.Chunks) Assert.Equal(Sha256(File.ReadAllBytes(c.Path)), c.Sha256);
Integrity.VerifySegmentRoots(man.Integrity, window: man.TimeRange);

Packaging & security

  • Encryption: objects encrypted at rest using tenant KEK (silo-local); presigned links are time-bound and tenant-scoped.
  • Access model: only principals with tenant read/export scopes; break-glass required for cross-region download.
  • PII safety: exported fields already redacted/masked per maskingProfile; README lists any pseudonymization steps applied.

Data dictionary (excerpt)

{
  "schemaVersion": "audit.v5",
  "fields": [
    { "name": "recordId", "type": "string", "desc": "ULID" },
    { "name": "createdAt", "type": "timestamp", "desc": "UTC ISO-8601" },
    { "name": "tenantId", "type": "string", "desc": "Opaque; constant per bundle" },
    { "name": "action", "type": "string", "desc": "Audit action verb" },
    { "name": "actor.id", "type": "string", "desc": "Masked/hashes applied per policy" }
  ],
  "masks": { "email": "HASH(sha256)", "phone": "MASK(showLast4)" }
}

API surfaces

  • POST /exports Body (server-enforced fields): { tenantId, dataSilo, stream, timeRange, format, purpose }
  • GET /exports/{exportId}/manifest → signed manifest + checksum list
  • GET /exports/{exportId}/verify → returns machine-verifiable integrity report (optional)

Error taxonomy (Problem+JSON)

type HTTP Meaning / Action
…/export-unsupported-format 400 Choose parquet or ndjson
…/manifest-signature-invalid 422 Signature mismatch; do not trust bundle
…/checksum-mismatch 409 Chunk corrupted; reissue export
…/residency-conflict 409 Cross-silo request; use tenant’s primary silo
…/policy-version-stale 409 Retry with current policyVersion or re-run export

Invariants

  • Every export bundle is per-tenant, in-silo, signed, and checksumed.
  • Manifests are canonical and sufficient for independent verification without access to other tenants’ data.
  • Provenance (tool/policy/filters/timerange) is always included, enabling reproducibility and auditability.

Tenant-Aware Caching & Indexing

Caching (principles & keys)

  • Per-tenant scoping only
    • All cache entries (memory, Redis, CDN) must include tenantId in the key; no shared/global entries for tenant-sensitive data.
    • Include version/ETag (from Tenant Registry/policy bundle) so policy or metadata changes naturally bust stale entries.
  • Key composition (sketch)
<cacheNamespace>:
  v1:
    <tenantId>:
      <policyVersion>@<registryEtag>:
        query:<hash(filters, projection, maskProfile)>
        record:<recordId>
        export:<exportId>
  • Layered caches
    • L1 (in-proc): tiny TTL (100–500ms) for hot paths (policy lookups, router decisions).
    • L2 (distributed): Redis/memory grid with per-tenant namespaces; TTL minutes.
    • Edge/CDN (optional): tenant-aware keys only; forbid caching authenticated responses without tenantId keying.
  • Poisoning defenses
    • Keys must include masking profile and caller role/scope where output differs by authorization.
    • Deny cache insert for responses marked no-store (e.g., break-glass, evidence endpoints). C# helper (example)
string MakeKey(TenantCtx t, string ns, string kind, ReadIntent intent, object shape) =>
  $"{ns}:v1:{t.TenantId}:{t.PolicyVersion}@{t.RegistryEtag}:{kind}:{intent}:{StableHash(shape)}";

Cache invalidation & warming

  • Invalidation channels
    • Per-tenant pub/sub topic: cache-inv/<tenantId>. Producers (Registry, Policy Engine) publish ETag/policyVersion updates.
    • Consumers drop only entries with older {policyVersion, registryEtag} to avoid shotgun clears.
  • Triggers
    • Tenant.Updated ⇒ invalidate {registryEtag←old}.
    • Policy.Updated ⇒ invalidate {policyVersion←old} and any masking-dependent query caches.
    • BreakGlass.Granted ⇒ bypass/no-store for duration; clear on Revoked.
  • Warming
    • On onboarding or policy rotation, prefetch:
      • Tenant Registry record,
      • Current policy bundle & compiled evaluators,
      • Router endpoints & per-tenant routing hints.
    • Emit Warmup.Completed evidence with hit-ratio baseline.

Config (YAML)

cache:
  l1:
    ttlMs: 250
    maxItems: 5000
  l2:
    ttlSeconds:
      policy: 300
      registry: 120
      queryResult: 60
    namespaceByTenant: true
  invalidateOn:
    events: ["Tenant.Updated","Policy.Updated","BreakGlass.Granted","BreakGlass.Revoked"]

Query result caching (tenant-safe)

  • Keys include: tenantId, policyVersion, maskProfile, normalized filters, projection, and seek cursor.

  • Never cache responses that:

    • Cross page boundaries with offset paging (we use seek cursors only).
    • Depend on ephemeral guard toggles (e.g., incident freeze) unless TTL ≤ 1s with guard version in key.
  • Planner integration
    • Planner injects mandatory tenantId = ctx predicate; its plan hash is part of the cache key to prevent stale reuse after index changes.

Indexing (partitioning & aliases)

  • Partition strategy
    • Primary: by tenant, then time (e.g., ULID/TS bucketing).
    • Avoid cross-tenant compound indices; if shared store is required, enforce RLS and prefix partition keys.
  • Aliases & rollover

    • Per-tenant, time-bucketed aliases:

    audit-{tenantId}-write   → active write index
    audit-{tenantId}-read-*  → read aliases across recent buckets
    
    * ILM/rollover by size/time (e.g., 50GB or 7d). Dedicated tenants may have separate shards.

  • Search schema

    • Low-cardinality fields (tenantId, stream, action) indexed; high-cardinality (resource.id) hashed or keyword-indexed by profile.
    • Avoid global analyzers that ignore tenantId; analyzers must be per-namespace where supported.

Index map (example)

{
  "index": "audit-splootvets-2025-10",
  "partitionKeys": ["tenantId", "createdAt:month"],
  "routing": "tenantId",
  "aliases": ["audit-splootvets-read", "audit-splootvets-write"]
}

Eviction & TTLs

  • Per-tenant TTLs

    • Edition-aware: higher tiers get longer L2 TTL for query results (e.g., Gold 120s vs Bronze 30s) to improve hit ratio.
    • Workload-aware: analytical endpoints may use longer TTL for identical queries; realtime endpoints stay short.
  • Safe eviction rules

    • Evict only entries matching (tenantId, version/etag) predicates.
    • On policy change, evict masking-sensitive caches first; others lazily expire.
  • Pressure response

    • When L2 pressure rises, LRU per tenant with min-share to prevent large tenants evicting small ones (fairness).

Observability

  • Metrics (tagged by tenantId):
    • cache_l1_hit_ratio, cache_l2_hit_ratio, cache_evict_total, invalidations_total
    • index_rollover_events_total, query_plan_cache_hits_total
  • Traces include cache hit/miss and key parts (redacted) to diagnose leaks.
  • Alerts:
    • Hit ratio drops > X% for tenant → warmup or adjust TTLs.
    • Unexpected cross-tenant cache key (detected via static analyzer) → fail CI.

Static analysis & CI guards

  • Lint rules enforce tenantId & version/etag presence in all cache key builders.
  • Unit tests verify no cache reuse when policyVersion or maskProfile changes.
  • Build fails if any index DDL introduces cross-tenant compound keys.

Examples

Query cache key (normalized)

qr:v2:tenant=splootvets|pv=42|etag=W/"v7"|mask=Reader|
plan=91f0…|filters=action:Export.*,resource.type:Appointment|
seek=2025-10-28T10:00:00Z|01J9…

Invalidation event

{
  "type": "Cache.Invalidate",
  "tenantId": "splootvets",
  "policyVersionOld": "…@41",
  "policyVersionNew": "…@42",
  "targets": ["queryResult","projectionCache"],
  "ts": "2025-10-28T10:50:00Z"
}

Invariants

  • Cache keys always include tenantId and change when policy/registry versions change.
  • Indexes never co-locate multiple tenants in the same compound key; routing is by tenant.
  • Invalidation is surgical and per-tenant; eviction fairness prevents one tenant from degrading another.

Alignment with SaaS DDD Blueprints

Tenant Management

  • Role: authoritative source for TenantId, dataSilo, edition/entitlements, lifecycle, contacts, quotas.
  • Consumed by ATP: guards, routers, residency resolver, policy evaluator, schedulers.
  • Contracts
    • Reads: GET /tenants/{tenantId} (ETag/Version).
    • Events in: Tenant.Created|Updated|Suspended|Reinstated|PendingDeletion|Deleted|Remapped|Merged.
    • Side-effects: cache warmup, quota/limit refresh, routing table refresh.

Guard dependency (pseudocode)

var t = TenantRegistry.Get(tid, ifNoneMatch: etag);
Guards.Context = new GuardCtx { TenantId = t.Id, DataSilo = t.DataSilo, Edition = t.Edition, Quotas = t.Quotas };

Identity

  • Role: issues tokens with tenant context (tenant/tid), roles/scopes, subject (sub), and optional ABAC attributes (org unit, region).
  • ATP enforcement: RBAC/ABAC on top of mandatory tenant predicate; break-glass tokens carry explicit allowedTenants + TTL.
  • Claims mapping (illustrative)
Claim/Field Meaning ATP Use
tenant / tid Tenant identifier (opaque) Must match TenantId
scope / roles Allowed operations Route/endpoint allow lists
sub Subject (human/service) actor.id on writes; audit trail
department/ou ABAC attribute Masking/filters on read/export
break_glass Emergency grant flag Throttled access path

SaaS Core Metadata (Editions & Entitlements)

  • Role: defines edition (e.g., Bronze/Silver/Gold/Enterprise) and feature flags per tenant.
  • ATP effect: selects policy bundle, default masking profile, quotas (ingest QPS, bytes/day, export concurrency), fair-share weights.
  • Edition → Guard defaults (example)
Edition Policy Bundle Quotas (ingest qps / bytes/day) Export Concurrency Fair-Share Weight
Bronze policy.bronze@X 50 / 10 GB 1 1
Silver policy.silver@X 150 / 50 GB 2 2
Gold policy.gold@X 500 / 250 GB 4 4
Enterprise policy.enterprise@X Contractual Contractual Contractual

Billing / Metering

  • Role: aggregates usage events and enforces plan limits; feeds cost dashboards and overage workflows.
  • From ATP: emits tenant-scoped usage: {bytesIngested, recordsIngested, storageBytes, queries, exportsBytes, dlqDepth, rateLimitHits}.
  • To ATP: provides quota overrides / credit budgets (per tenant & period) that drive WFQ and token buckets.

Usage event (excerpt)

{
  "event": "Usage.Reported",
  "tenantId": "splootvets",
  "window": "2025-10-28T09:00:00Z/2025-10-28T10:00:00Z",
  "ingest": { "records": 125432, "bytes": 184320000 },
  "query":  { "count": 9821, "p95ms": 210 },
  "export": { "bytes": 734003200, "jobs": 2 },
  "rateLimitHits": 37
}

Config / Flags

  • Role: dynamic toggles and parameters evaluated per tenant (via attributes like edition, residency, tags).
  • ATP usage:
    • Kill switches: guards.ingest.enabled=false (per tenant).
    • Migration: iso.shadow.enabled, iso.read.preferShadow.
    • Residency: residency.writeFailover="emergency" (policy-gated).
    • Backfill throttles: reproc.maxQpsPerTenant.

Evaluation context (policy/flags)

{
  "tenant": { "id": "splootvets", "edition": "enterprise", "tags": { "tier": "gold", "vertical": "healthcare" }, "dataSilo": "us" },
  "actor": { "sub": "00u1ops42", "roles": ["Audit.Reader"], "attrs": { "department": "Compliance" } },
  "intent": "query",
  "route": "/v1/query",
  "correlationId": "c-01J9…"
}

Context map (alignment)

flowchart LR
IdP[Identity & Auth] -- tokens/claims --> Gateway
TenantSvc[Tenant Management] -- registry/events --> Gateway
CoreMeta[SaaS Core Metadata] -- edition/entitlements --> Policy
Billing[Billing & Metering] -- quotas/credits --> Guards
Flags[Config/Flags] -- toggles --> Guards
Gateway --> Guards
Guards --> Policy[Policy Engine]
Policy --> Ingest/Query/Export
Hold "Alt" / "Option" to enable pan & zoom

Integration matrix (who reads/writes what)

Bounded Context Reads from ATP Writes to ATP Events from ATP
Tenant Management Registry (authoritative) Tenant.* consumed by ATP services
Identity Token issuance (claims) GuardDecision (audit of access)
SaaS Core Metadata Edition/feature catalogs Policy.Updated co-authored
Billing/Metering Usage feeds (metrics/export manifests) Quota overrides / credit budgets Usage.Reported
Config/Flags Health/telemetry Flag values / dynamic params Flag.Changed

Failure modes & guardrails (cross-context)

  • Registry stale → ATP serves with last ETag; critical changes (residency/edition) trigger warmup + invalidate before accepting writes.
  • Token missing/invalid → request rejected; GuardDecision recorded.
  • Quota service unavailable → fall back to last known credit budget; never disable tenant isolation.
  • Flag misconfig → shadow evaluation first; fail-safe defaults for guards (deny on uncertainty).

Invariants

  • ATP never derives tenant or residency from caller hints alone; it binds to Tenant Management as SoR.
  • Identity determines who can do what, Tenant Management determines where, SaaS Core Metadata determines how much, Billing determines how often, Config/Flags determine how now.
  • All integrations are tenant-scoped, additive-first, and auditable via shared event contracts.

Appendix A — Example Headers/Claims

HTTP (preferred)

# Auth & tenancy
Authorization: Bearer <JWT-with-tenant-claims>
X-Tenant-Id: <opaque-tenant-id>          # must match JWT tenant claim
X-Idempotency-Key: <tid:...|ulid:...>    # required on writes

# Correlation & telemetry
traceparent: 00-<traceId>-<spanId>-01    # W3C Trace Context (preferred)
baggage: tenantId=<id>,edition=<ed>,policyVersion=<pv>
X-Correlation-Id: <uuid>                 # optional legacy, kept for logs

# Residency (diagnostic only; server resolves from registry)
X-Data-Silo: <us|eu-we>                  # ignored for auth; logged if mismatched

JWT (claims — illustrative)

{
  "iss": "https://idp.example.com",
  "aud": "connectsoft-atp",
  "sub": "00u1abc42",
  "azp": "my-service-client",
  "exp": 1766933600,
  "iat": 1766930000,

  "tenant": "splootvets",                // or "tid"
  "scope": "audit.write audit.query",
  "roles": ["Audit.Writer","Audit.Reader"],

  "break_glass": false,                  // true for emergency grants
  "allowedTenants": ["splootvets"],      // required if break_glass = true
  "dataSilo": "us",                      // residency lock for break-glass

  "attrs": { "department": "Compliance", "ou": "Ops" }  // ABAC attributes
}

gRPC metadata (service → service)

authorization: Bearer <JWT>
x-tenant-id: <id>
traceparent: 00-<traceId>-<spanId>-01
baggage: tenantId=<id>,edition=<ed>,policyVersion=<pv>

Notes

  • Server binds residency and edition from the Tenant Registry, not from headers.
  • X-Idempotency-Key scope is per tenant. Duplicate with different payload ⇒ quarantine.
  • Break-glass tokens must include allowedTenants, dataSilo, short TTL, and are heavily rate-capped.

Appendix B — Guard Decision Table (Sketch)

Condition Action Outcome (Problem+JSON) Evidence Emitted
Missing X-Tenant-Id and no tenant claim Reject 400 …/missing-tenant GuardDecision:MissingTenant
Tenant in header ≠ tenant in JWT/body Quarantine 202 …/tenant-mismatch GuardDecision:TenantMismatch
Cross-tenant query (predicate would touch another tenant) Reject 403 …/cross-tenant GuardDecision:CrossTenant
Residency conflict (caller/suggested silo ≠ registry dataSilo) Reject 409 …/residency-conflict GuardDecision:ResidencyConflict
Missing X-Idempotency-Key on write Reject 400 …/idempotency-required GuardDecision:IdempotencyMissing
Duplicate idempotency key with different payload hash Quarantine 202 …/idempotency-mismatch GuardDecision:IdempotencyDrift
Token invalid/expired Reject 401 …/invalid-token GuardDecision:AuthFailed
Break-glass token used outside allowedTenants/TTL/ops/silo Reject + Revoke 403 …/breakglass-scope BreakGlass.Used/Revoked
Rate limit/Quota exceeded (per tenant) Backpressure 429 …/rate-limit + Retry-After GuardDecision:Throttled
Policy version stale/unknown Reject or Cached 409 …/policy-version-stale or proceed basis:Cached Policy.CacheStale
BYOK key unavailable Fail closed (R) 503 …/key-unavailable (reads); writes buffered Key.Unavailable
Admin-only endpoint without role Reject 403 …/forbidden GuardDecision:Forbidden

Legend

  • Quarantine ⇒ accept to a review lane; no user-visible write until operator approves.
  • Backpressure ⇒ per-tenant throttling; other tenants unaffected.
  • Fail closed (R) ⇒ read denied; writes buffered in-silo (never plaintext).

Emissions & tracing

  • Every decision emits a structured GuardDecision{tenantId, decision, reason, correlationId, policyVersion} and tags the trace (tenantId, decision, basis, route).
  • Problem responses include type, title, correlationId, and do not leak tenant identities across boundaries.

  • Platform → Security & Compliance — guard policies, masking profiles, break-glass, legal hold & DSAR (see security/compliance section in architecture.md and hld.md).
  • Implementation → Persistence & Storage — partitioning keys, rollover/ILM, integrity chains (see storage sections in architecture.md, deployment-views.md).
  • Guides → Quickstart — Tenant Onboarding — operator procedures, smoke checks, evidence (see tutorials/getting-started.md#tenant-onboarding).
  • See also: hld.md (Architecture/HLD), components.md (service responsibilities), data-model.md (entities & contracts), sequence-flows.md (ingest/query/export paths), deployment-views.md (regions, failover), use-cases.md (SRE & compliance scenarios).