Skip to content

Deployment Views - Audit Trail Platform

Purpose, Scope & Reader Map

Azure-first assumption. Every deployment view and example in this page targets Microsoft Azure as the primary platform (AKS, ACR, Key Vault, Service Bus, Front Door/WAF, Entra ID Workload Identity, Azure Monitor). Portability tips may be noted inline, but Azure is the baseline.

What a Deployment View is (and isn’t)

A deployment view shows how the Audit Trail Platform (ATP) actually runs in the cloud: the runtime topology of clusters, namespaces, services, data stores, networks, identities, and control planes per environment/region. It focuses on where components live, how they’re wired (ingress/egress, identity, secrets, policies), and what non-functional controls are enforced (observability, resilience, security, compliance).

A deployment view is not:

  • A business or domain model (see Architecture/Components/Data Model).
  • A sequence-of-calls or behavior spec (see Sequence Flows).
  • A CI/YAML how-to or runbook (see CI/CD & Ops docs referenced later).

Reader Map — “Read this if you…”

  • SRE / DevOps: Need to know where services run (AKS/namespace), how traffic enters (Front Door/WAF → APIM/Ingress), how they scale (HPA/KEDA), which queues/topics they use (ASB), and how to observe them (OTel → Azure Monitor/Grafana).
  • Security / Compliance: Need trust boundaries, mTLS & RBAC/ABAC points, secret & key custody (Key Vault, KMS), WORM/immutability for evidence, tenancy isolation, and residency.
  • Solution / Enterprise Architects: Need environment overlays (dev/qa/stage/prod), regional variants, shared services (APIM, ASB, KV, ACR), and DR/failover posture.
  • Backend Developers: Need service placements, env vars/secret sources, message contracts via topics/queues, and feature flag attachment points.
  • Data/Analytics: Need to see authoritative stores vs. projections, export surfaces, and retention/immutability constraints that impact analytics jobs.

Scope of this Page

  • Included: Environment/region topologies; edge and networking paths; identity & secrets; data plane tiers; messaging/DLQ/replay; scale policies; observability hooks; security controls; DR; cost guardrails.
  • Excluded: Detailed API definitions, domain models, end-to-end behavior flows, and step-by-step runbooks (these are cross-linked).

Diagram Legend & Conventions (used across all deployment views)

Abbreviations

Abbrev Azure Service / Concept Notes
AKS Azure Kubernetes Service Primary compute/orchestration plane
APIM Azure API Management Optional/edge API gateway (alt: NGINX Ingress)
AFD Azure Front Door + WAF Global edge, WAF rules, TLS
ASB Azure Service Bus Topics/queues/DLQ, idempotent replay
KV Azure Key Vault Secrets/keys; CSI driver in pods
ACR Azure Container Registry Signed images/SBOMs
AAD/WI Entra ID & Workload Identity Pod-level identity (no secrets in env)
OTel OpenTelemetry Traces/metrics/logs export
AM/LA/AppI Azure Monitor / Log Analytics / App Insights Observability backends
BLOB (IMM) Azure Blob Storage (immutability) WORM for evidence/segments/exports
SQL/COS/RED Azure SQL / Cosmos DB / Redis Authoritative stores, projections, caching
KEDA/HPA Event/metrics-based autoscaling Queue depth, CPU, RPS, projection lag
NP/PSS NetworkPolicy / Pod Security Standards East-west control & hardening

Notation & Styling

  • Namespaces: atp-<domain>-ns (e.g., atp-ingest-ns, atp-query-ns).
  • Resource naming: atp-<svc>-<env>-<region> (e.g., atp-ingest-prod-eus).
  • Trust boundaries: large boxes with a bold border labeled “Edge”, “Cluster”, “Data Plane”.
  • Control-plane vs data-plane: dashed lines for control-plane calls (e.g., metrics/identity), solid for data paths.
  • Secrets/keys: key/lock glyphs near pods using KV CSI; identities labeled WI: .
  • Tenancy markers: tenantId tag shown on storage/index/messaging resources; partitioning called out explicitly.

Cross-References (read these alongside Deployment Views)

  • High-Level Design (HLD): overall capabilities and quality attributes → hld.md
  • Architecture Overview: logical components and bounded contexts → architecture.md
  • Components: per-service responsibilities & contracts → components.md
  • Sequence Flows: hot paths, back-pressure & replay steps → sequence-flows.md
  • Data Model: entities, partitions, residency flags → data-model.md
  • Use Cases: operator/compliance scenarios this topology must satisfy → use-cases.md

With these conventions in place, subsequent sections (Environments & Release Trains, Base Topology, Networking & Edge, etc.) will reuse the same legend and Azure-first primitives to keep every diagram and table consistent.


Environments & Release Trains (Azure-first)

Environments

  • preview — short-lived per-PR environments (AKS namespace or isolated RG) for UI/UX review and early e2e checks.
  • dev — shared integration playground; fast iteration; feature flags on; relaxed quotas.
  • qa — system verification; stable datasets; cross-service tests; load and chaos rehearsals off-hours.
  • staging — prod-like (same SKUs/quotas); change-freeze windows enforced; canary rehearsals; DR drills.
  • prod — multi-region, compliance & observability hardening on; WORM/immutability fully enforced.

Azure baseline: AKS + ACR + Key Vault (CSI), Front Door/WAF → APIM/Ingress, Service Bus, Azure Monitor (App Insights/Log Analytics), Entra ID Workload Identity.

Promotion lanes

  • Mainline trainmain → dev → qa → staging → prod Standard feature flow with automated checks at every hop and staged rollout (canary → region → global).
  • Hotfix trainhotfix/* → staging (ring-0) → prod Minimal blast radius, expedited checks (security + smoke + SLO guardrails), post-deploy follow-up to qa/dev.

Regional variants

  • Target codes (examples): us (eastus), eu (westeurope), il (israelcentral).
    • dev/qa: typically single region (us) to contain cost.
    • staging/prod: per-region with identical topology; prod may run active-active for ingest/query.

Risk gates (examples)

  • Pre-deploy: image signing (ACR + Cosign), Defender for Cloud scan, IaC policy check (Bicep/OPA), SBOM presence.
  • Deploy-time: Azure DevOps Environment checks (required reviewers, change ticket link), maintenance window tag, migration dry-run.
  • Post-deploy: synthetic smoke (App Insights), SLO burn-rate guardrails (p95 ingest latency/error rate), auto-rollback on breach, DLQ drift watch.

Promotion overview (lanes & rings)

flowchart LR
  subgraph Mainline
    A[main commit] --> P[preview]
    P --> D[dev]
    D --> Q[qa]
    Q --> S[staging]
    S --> C1{canary 5–10%}
    C1 -->|pass| R1[prod us]
    R1 --> R2[prod eu]
    R2 --> R3[prod il]
    C1 -->|fail| Rb1[rollback]
  end

  subgraph Hotfix
    H[hotfix/*] --> S2[staging - ring-0]
    S2 --> C2{canary 5–10%}
    C2 -->|pass| P1[prod target region]
    C2 -->|fail| Rb2[rollback]
  end
Hold "Alt" / "Option" to enable pan & zoom

Matrix — Environment × Region × Release Train × Approval Gates

Environment Regions in scope Release train(s) Approval & checks (Azure DevOps Environments)
preview Same region as target cluster (usually us) Mainline (per PR) Auto only: build + unit/integration tests, image signed, IaC policy ok, ephemeral namespace/RG cleanup registered
dev us Mainline Auto: build + tests + vulnerability scan + KV/CSI mount check + Service Bus topic reachability; no manual approvers
qa us (optionally eu) Mainline Auto: contract tests, e2e workflows, load smoke; Manual: QA owner if schema migration present; SLO guardrails simulated
staging us, eu, il Mainline & Hotfix Manual: Release Manager + Security; Auto: canary rehearsal, synthetic smoke, data-migrations dry-run, feature flags staged
prod us, eu, il Mainline & Hotfix Manual: RM + On-call SRE (2-eyeballs); Auto: WAF policy sync, key roll check, canary 5–10% + SLO burn-rate; auto-rollback + incident stub if breached

Notes

  • SLO guardrails (examples): ingest p95 ≤ 300 ms; query p95 ≤ 500 ms; error rate ≤ 1%; consumer lag ≤ 30 s. Breach during canary → automatic rollback and DLQ snapshot.
  • Change types: data-shape changes require staging soak ≥ 24h; security policy changes require Security reviewer even on hotfix.
  • Regional cadence: promote us → eu → il with observation windows; emergency hotfix may target a single region first.

This section defines where code can land, how it moves, which regions participate, and what gates enforce safety, so the following topology sections can reference these lanes without re-explaining the mechanics.


Base Topology (Kubernetes + Mesh) — Azure-first

This section anchors the runtime topology of the Audit Trail Platform (ATP) on AKS with a service mesh (mTLS-by-default), Azure edge, and core Azure dependencies. It acts as the base layer that later sections (networking, data plane, scaling, security) reference.

Cluster overview (rings, pools, namespaces)

  • AKS rings:
    • system ring (managed components, mesh control plane, OTel collector).
    • user ring (all ATP workloads).
  • Node pools (typical):
    • np-system (small, reserved for control/system).
    • np-generic (stateless web/API pods: Gateway, Query, Admin).
    • np-io (I/O-heavy: Ingestion, Projection, Export, Integrity).
    • Optional np-jobs (cron/maintenance/export windows).
  • Namespaces (examples):
    • atp-gateway-ns, atp-ingest-ns, atp-policy-ns, atp-projection-ns, atp-query-ns, atp-integrity-ns, atp-export-ns, atp-admin-ns.

C4-style deployment diagram (edge → AKS → Azure services)

flowchart TB
  %% Edge
  subgraph EDGE["Azure Edge"]
    AFD["AFD + WAF"]
    APIM["API Management (optional)"]
  end

  %% AKS Cluster
  subgraph AKS["AKS Cluster (mTLS via Mesh)"]
    direction TB

    subgraph SYS["system ring / namespaces"]
      OTL["OTel Collector (DaemonSet)"]
      MESHCP["Mesh control plane"]
    end

    subgraph GWNS["ns: atp-gateway-ns"]
      GW["Gateway Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
    end

    subgraph INGNS["ns: atp-ingest-ns"]
      ING["Ingestion Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
    end

    subgraph POLNS["ns: atp-policy-ns"]
      POL["Policy Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
    end

    subgraph PRJNS["ns: atp-projection-ns"]
      PRJ["Projection Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
    end

    subgraph QRYNS["ns: atp-query-ns"]
      QRY["Query Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
    end

    subgraph INTNS["ns: atp-integrity-ns"]
      INT["Integrity Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
    end

    subgraph EXSNS["ns: atp-export-ns"]
      EXP["Export Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
    end

    subgraph ADMNS["ns: atp-admin-ns"]
      ADM["Admin Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
    end
  end

  %% Azure backing services
  subgraph AZSVC["Azure Services"]
    ASB["Service Bus (topics/queues/DLQ)"]
    KV["Key Vault<br/>+ CSI driver"]
    BLOB["Blob Storage (WORM)"]
    DB["Azure SQL / Cosmos DB"]
    MON["Azure Monitor / App Insights / Log Analytics"]
  end

  %% Traffic & dependencies
  AFD --> APIM
  APIM --> GW
  GW --> ING
  GW --> QRY
  ING --> ASB
  PRJ <-- ASB
  QRY --> DB
  PRJ --> DB
  INT --> BLOB
  EXP --> BLOB
  GW --> MON
  ING --> MON
  QRY --> MON
  PRJ --> MON
  INT --> MON
  EXP --> MON
  ADM --> MON
  GW -. secrets .-> KV
  ING -. secrets .-> KV
  POL -. secrets .-> KV
  PRJ -. secrets .-> KV
  QRY -. secrets .-> KV
  INT -. secrets .-> KV
  EXP -. secrets .-> KV
  ADM -. secrets .-> KV
Hold "Alt" / "Option" to enable pan & zoom

Core pods and responsibilities

Pod Role in topology Primary dependencies
Gateway Public/API entry; authN/Z; request shaping; tenancy guards; version routing AFD/WAF → APIM/Ingress, Key Vault (certs), Azure Monitor
Ingestion Append-only intake; schema validation; outbox → ASB Service Bus (topics), Key Vault, OTel
Policy Policy resolution/decisions (classification, retention, redaction plans) DB (policy store), KV, OTel
Projection Build/read models & search indexes from events ASB (subscribe), DB (projections), OTel
Query Tenant-scoped queries with verify-on-read options DB (authoritative/projections), BLOB (evidence, when applicable)
Integrity Seal/verify segments, hash-chains, Merkle roots; publish proofs BLOB (WORM), KV (signing keys), OTel
Export Egress pipelines (signed exports, legal holds, redaction applied) BLOB (export), ASB (jobs), OTel
Admin Ops/Config UX, feature flags, maintenance hooks DB (config), KV, OTel

Sidecars and daemons

Component Placement Purpose
Envoy/mesh sidecar Every app pod mTLS, retries/timeouts, policy enforcement, telemetry taps
KV CSI driver Every app pod needing secrets/keys Mount short-lived secrets; avoid env var secrets; rotation-friendly
OTel agent Sidecar (per pod) or node DaemonSet Trace/metric/log export to Azure Monitor backends
Log shipper (optional) Sidecar/DaemonSet Structured logs to Log Analytics with tenant/edition tags
Mesh control plane System namespace Certificate issuance, identity, traffic policy distribution
OTel Collector System namespace (DaemonSet) Centralize/transform telemetry; batching and export

Ingress path and mesh

  • External ingress: AFD + WAFAPIM (or direct NGINX/Envoy Ingress) → Gateway.
  • East-west: all service-to-service flows run inside the mesh with mTLS, RBAC/ABAC at the Gateway and per-service boundaries.
  • Egress: deny-by-default, egress policies only to ASB/KV/DB/BLOB/Monitor endpoints.

Stateful anchors (data plane)

  • Authoritative stores: Azure SQL / Cosmos DB (tenant-partitioned).
  • Event transport: ASB topics/queues with DLQ per subscription; idempotent replay.
  • Evidence & segments: Blob WORM containers with lifecycle policies and legal holds.
  • Observability: OTel → Azure Monitor/App Insights/Log Analytics; dashboards in Grafana (optional).

This base topology is the “map” subsequent sections will annotate with networking controls, scaling triggers, security policies, DR paths, and cost guardrails.


Networking & Edge

This section codifies the public front-door, API edge, and east–west policies for ATP on Azure. It assumes Azure Front Door (AFD) + WAF at the edge, API Management (APIM) (or direct Ingress), and a service mesh with mTLS inside AKS.

Deployment diagram — Internet → Edge → Gateway → Mesh

flowchart TB
  user[Internet Clients]

  subgraph EDGE["Azure Front Door (AFD) + WAF"]
    waf[Managed rules + custom rules<br/>HSTS, TLS, geo/IP filters]
    dns[Azure DNS<br/> - Apex/CNAME to AFD]
  end

  subgraph APIEDGE["API Edge"]
    apim[API Management<br/> rate-limit-by-key, version routing,<br/>JWT validation, request shaping]
    or[(or)]
    ing[Ingress Controller - NGINX/Envoy<br/>TLS passthrough/termination]
  end

  subgraph AKS["AKS (Service Mesh mTLS)"]
    gw[Gateway Pod]
    subgraph eastwest["East–West (mTLS, RBAC/ABAC, NetworkPolicies)"]
      ingest[Ingestion]
      policy[Policy]
      proj[Projection]
      query[Query]
      integ[Integrity]
      export[Export]
      admin[Admin]
    end
  end

  user -->|HTTPS :443| dns --> waf --> apim
  waf -.->|CNAME| dns
  apim -->|mTLS/TLS| gw
  user -. alt path .-> waf --> ing
  ing --> gw
  gw --> ingest
  gw --> query
  gw --> policy
  ingest <--> proj
  proj --> query
  gw --> admin
  gw --> export
  gw --> integ
Hold "Alt" / "Option" to enable pan & zoom

Edge (public) controls

  • WAF at AFD: Managed rule set + custom rules (blocklists, geo-IP), bot protection, anomaly scoring; headers normalized at edge.
  • TLS: Terminate at AFD with managed/Key Vault-backed certs; re-encrypt to APIM/Ingress; mTLS enforced inside mesh.
  • HSTS: max-age=31536000; includeSubDomains; preload at the edge.
  • DNS: Azure DNS apex → AFD (CNAME). Use CAA records for allowed CAs. Region-specific subdomains optional (e.g., api-eu.example.com).
  • CORS stance: Default deny. Explicit allow-list per SPA/portal origin; short preflight cache (60–120s); no wildcard with credentials.
  • Rate limiting (public):
    • At APIM: rate-limit-by-key on {tenantId|clientId|subscriptionKey}; burst & sustained windows; 429 with Retry-After.
    • Optional at AFD: simple per-IP throttles for volumetric abuse before APIM.
  • Version routing:
    • Header (x-api-version), path (/v1/…), or revision (APIM); map to canary rings (see deployment lanes).

API Gateway & request shaping

  • AuthN/Z: JWT verification at APIM (client apps) and tenancy guards at Gateway (ABAC on tenantId, edition).
  • Request policies: schema/size limits, idempotency-key normalization, anti-replay nonce for append endpoints.
  • Canary routing: APIM or Ingress splits to vNext Gateway by header/percentage; ringed rollout (5–10% → 50% → 100%).

East–west policies (inside AKS + mesh)

  • mTLS-by-default: All pod-to-pod traffic via mesh Envoy; strong identities (Entra Workload Identity/SPIFEE).
  • NetworkPolicies:
    • Default deny per namespace; allow only from Gateway, required system pods, and specific producer/consumer pairs.
    • Cross-namespace traffic is allow-listed (e.g., atp-gateway-ns → atp-query-ns on service ports only).
  • Service exposure: ClusterIP internally; no NodePort/LoadBalancer for app services (ingress-only).
  • Egress controls:
    • Deny-by-default; allow only to Private Link endpoints for Service Bus, Key Vault, Storage, SQL/Cosmos, Monitor.
    • Optional Azure Firewall/NVA with FQDN tags; egress proxy for audited outbound HTTP if required.

Example policy snippets

APIM (rate limit by tenant)

<inbound>
  <validate-jwt header-name="Authorization" failed-validation-httpcode="401" />
  <rate-limit-by-key calls="300" renewal-period="60" counter-key="@(context.Request.Headers.GetValueOrDefault("x-tenant-id","anon"))" />
  <set-header name="x-request-id" exists-action="override">
    <value>@(Guid.NewGuid().ToString())</value>
  </set-header>
</inbound>

Kubernetes NetworkPolicy (deny all, allow Gateway → Query)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-gateway-to-query
  namespace: atp-query-ns
spec:
  podSelector:
    matchLabels:
      app: query
  policyTypes: [Ingress, Egress]
  ingress:
  - from:
    - namespaceSelector:
        matchLabels: { name: atp-gateway-ns }
      podSelector:
        matchLabels: { app: gateway }
    ports:
    - protocol: TCP
      port: 8080
  egress: [] # default deny; mesh handles sidecar-to-sidecar

Traffic paths (summarized)

  • Internet → Edge: Client → Azure DNS → AFD/WAF (TLS, HSTS, WAF rules).
  • Edge → API: AFD → APIM (rate limit, JWT, version routing) or AFD → Ingress (TLS passthrough/termination).
  • API → Gateway: APIM/Ingress → Gateway (mTLS at mesh boundary, tenancy guards).
  • Gateway → Services: Gateway → {Ingestion, Query, Policy, …} over mTLS, constrained by NetworkPolicies.
  • Service egress: Only to Private Link PaaS endpoints (ASB/KV/Storage/DB/Monitor) under explicit egress policies.

These controls ensure a hardened, deterministic path from the public Internet to tenant-scoped services with layered defense: WAF → APIM/Ingress → Gateway → mTLS mesh, with default-deny network policies and Private Link egress.


Identity, Secrets & KMS (Azure-first)

This section defines how workloads authenticate (no long-lived secrets), how secrets are delivered (ephemeral mounts, rotation), and how signing keys (integrity proofs, export packaging) are owned, rotated, and backed up.

Principles

  • Identity over secrets. Prefer Entra ID Workload Identity (federated OIDC) and RBAC to access PaaS (Service Bus, Storage, SQL/Cosmos) — no connection strings or SAS where possible.
  • Short-lived & mount-only. If a secret is unavoidable, deliver via Key Vault CSI Driver as files (tmpfs), not env vars; keep TTL short and rotate automatically.
  • HSM for signing keys. Integrity/export signing keys live in Azure Key Vault Managed HSM with versioned rotation and backup/restore packages.
  • Purge protection & immutable evidence. KV has soft-delete + purge protection; signed artifacts land in Blob WORM with legal holds.

Workload Identity (pods → Entra ID)

  • Each AKS ServiceAccount is mapped to an Entra application with a federated credential (issuer = cluster OIDC, subject = SA).
  • Pod retrieves an OIDC token → exchanges for an Entra access token → calls Azure APIs (Key Vault, Service Bus, Storage) using RBAC role assignments.
  • The mesh handles mTLS between pods; mesh identities can optionally follow SPIFFE-style IDs for service-to-service policy.

Secrets delivery (Key Vault + CSI)

  • Source: Azure Key Vault (KV) secrets/certs; consumption: mounted via CSI as files into pods.
  • Rotation:
    • App secrets (HMAC, webhooks, OAuth client secrets): 60–90 days.
    • TLS certs (Ingress/APIM/AFD): managed renewal (CA) or ≤90 days if BYOC.
    • Per-tenant salts/keys (tokenization): 90–180 days with overlap window.
  • Reload: Sidecars or apps watch mounted paths and reload without restart where possible (SIGHUP/hot reload); otherwise rolling restart on secret version change.

Integrity & export signing keys (Managed HSM)

  • Key types: RSA-3072 (or ECC P-256) signing keys stored in Managed HSM; operations performed in-HSM.
  • Rotation strategy: Staged dual-sign (old+new) for N days → flip trust root → revoke old. Maintain key versions and kid in proofs.
  • Backup: HSM backup package to secured Storage Account with private endpoint, encrypted at rest with CMK. Tested restore drills.
  • Access: Integrity and Export services get sign permission via role assignments; no get key material.

Secret/Key catalog

Secret/Key Owner Rotation Consumer(s) Storage Class
Workload Identity (pod → Entra) Platform/SRE N/A (token TTL ≤60m) All pods Entra ID federated credential; RBAC on target resources
Gateway TLS cert (edge) Platform/Sec ≤90d (managed renewal) AFD/APIM/Ingress/Gateway Key Vault Certificate (soft-delete + purge protect); delivered to edge/ingress
Mesh mTLS certs Platform/SRE 30–90d (auto by mesh CA) All app pods (Envoy) Mesh CA store (control plane); root anchored in KV/MHSM if BYOC
Webhook HMAC secrets App Team 60–90d Gateway/Export KV SecretCSI mount (tmpfs), no env vars
Per-tenant tokenization salt/key Security 90–180d (overlap window) Ingestion/Redaction KV Secret (scoped per tenant/edition) → CSI
Integrity signing key Security/Compliance 180–365d (dual-sign rollout) Integrity service Managed HSM Key (sign only); HSM backup package
Export package signing key Security/Compliance 180–365d Export service Managed HSM Key (sign only); HSM backup package
Storage Account CMK (SSE-CMK) Security 365d (auto-rotate) Blob (WORM) / archival stores KV Key bound to Storage; rotation via key version
DB access Platform/SRE N/A Services using SQL/Cosmos AAD auth (no secrets); RBAC roles, Private Link
Service Bus access Platform/SRE N/A Ingestion/Projection/Export AAD RBAC (no SAS); fallback SAS in KV if required (≤30d)

Storage Class legend: Entra (federated identity + RBAC), KV Secret/Certificate (soft-delete & purge protection), Managed HSM Key (in-HSM ops only), CSI mount (ephemeral files, tmpfs), SSE-CMK (storage encryption with customer-managed key).


Signer usage flow (Integrity service)

sequenceDiagram
  autonumber
  participant INT as Integrity Pod
  participant WI as Entra Workload Identity
  participant KV as Key Vault / Managed HSM
  participant BLOB as Blob Storage (WORM)
  participant MON as Azure Monitor

  INT->>WI: Request AAD token (federated OIDC from SA)
  WI-->>INT: Access token (scope: https://vault.azure.net)
  INT->>INT: Compute segment root (Merkle) & digest payload
  INT->>KV: Sign(digest) with keyId (kid=v2025-10-01)
  KV-->>INT: Signed blob (signature)
  INT->>BLOB: Write {manifest, root, signature, kid} to WORM container
  INT->>MON: Emit audit/trace (sign op id, kid, segmentId)
  Note over INT,K V: Key material never leaves HSM, op is in-HSM
Hold "Alt" / "Option" to enable pan & zoom

Operational policies & runbooks (highlights)

  • Emergency rotation (signing keys): generate new version, enable dual-sign immediately, update trust config in Gateway/Query verification, revoke old after N days; publish advisory event on ASB.
  • KV hygiene: soft-delete and purge protection must be on; require Defender for Cloud checks before deploy; access policies managed via RBAC, not legacy ACLs.
  • Secret sprawl control: quarterly scan for env var secrets, SAS tokens, or connection strings; replace with AAD RBAC patterns.
  • Backups: HSM backup/restore drills quarterly; verify WORM containers’ retention & legal hold settings align with compliance.

With these controls, workloads authenticate using identity, secrets are short-lived and mounted, and critical signing keys are protected by Managed HSM with auditable rotation and backup.


Data Plane (Hot/Warm/Cold) & Residency (Azure-first)

This section places the authoritative stores and tiers (hot/warm/cold), explains tenancy partitioning and residency, and sets backup/restore objectives that align with compliance (immutability, legal hold) and SLOs.

Tiered storage diagram

flowchart LR
  subgraph REGION_US["Region: US (example)"]
    direction TB

    subgraph HOT["HOT — Append & Evidence (Immutable)"]
      HOTBLOB["Blob Storage (WORM)<br/>container: atp-{tenant}-hot<br/>objects: segments, manifests, roots<br/>Retention: time-based + legal hold"]
      HOTIDX["Hot Index (SQL/Cosmos)<br/>segment catalog & pointers<br/>RLS/PK=tenantId"]
    end

    subgraph WARM["WARM — Projections & Query"]
      PROJDB["Azure SQL / Cosmos DB<br/>projections/read models<br/>RLS/PK=tenantId + time"]
      SEARCH["Azure AI Search (optional)<br/>full-text/index aliases per region"]
      CACHE["Redis (optional)<br/>query cache, TTL-scoped"]
    end

    subgraph COLD["COLD — Exports & Archives"]
      XPORT["Blob Storage (Archive/Cool)<br/>Signed export packages<br/>Legal hold capable"]
      META["SQL/Cosmos<br/>export registry (hash, kid, location)"]
    end

    ING["Ingestion Service"]
    INT["Integrity Service"]
    QRY["Query Service"]
    EXP["Export Service"]

    ING --> HOTBLOB
    ING --> HOTIDX
    INT --> HOTBLOB
    PROJDB <-- HOTIDX
    QRY --> PROJDB
    PROJDB --> SEARCH
    EXP --> XPORT
    EXP --> META
  end

  note1((Private Link)):::note
  HOTBLOB --- note1
  PROJDB --- note1
  XPORT --- note1

  classDef note fill:#fff,stroke:#999,stroke-dasharray:5 5,color:#666
Hold "Alt" / "Option" to enable pan & zoom

Tiers & stores

  • HOT (authoritative, immutable)

    • What: Append-only segments, manifests, Merkle roots, signatures written by Ingestion/Integrity.
    • Where: Azure Blob Storage containers with immutability (WORM) and time-based retention; legal hold supported.
    • Access: AAD RBAC via Private Link; no SAS; signed artifacts include kid and digest.
    • Index: Minimal hot index (SQL/Cosmos) for pointers: {tenantId, window, segmentId, blobUrl, hash, kid}.
  • WARM (operational read models)

    • What: Projections/read models derived from hot segments; optionally search for text facets.
    • Where: Azure SQL (rowstore/columnstore per table) or Cosmos DB (/tenantId partition, time bucketing).
    • Access: Query service with RLS (SQL) or tenant-scoped queries (Cosmos). Redis optional for hot keys.
    • Rebuild: Deterministic replay from HOT via Projection workers.
  • COLD (egress & long-term)

    • What: Signed export packages (ZIP/TAR + manifest + signature), audit bundles for eDiscovery.
    • Where: Blob in Cool/Archive tier; per-tenant containers if legal holds vary.
    • Index: Export registry in SQL/Cosmos (hashes, time, requester, hold flags).

Residency & RLS at each boundary

  • Regional scoping

    • Per-region accounts: us, eu, il have separate Storage/DB/Search to enforce residency.
    • Prod topology: active-active ingest/query per region; no cross-region replication for EU/IL data unless explicitly allowed by policy.
    • DR: use ZRS in-region; optional same-jurisdiction DR account (EU→EU pair). Avoid GRS across residency boundaries.
  • Tenancy partitioning

    • Blob: container per tenant (e.g., atp-{tenantId}-hot) to enable independent legal holds and retention policies.
    • SQL: Row-Level Security with predicate on tenantId; partitioning by (tenantId, eventMonth) for pruning.
    • Cosmos: Partition key /tenantId; composite indexes (tenantId, ts); per-tenant RU baselines (autoscale).
    • Search: index aliases per region, fields include tenantId; optional per-tenant index for isolation at scale.
  • Access controls

    • AAD RBAC only; Private Link endpoints for Storage/SQL/Cosmos/Search.
    • No direct client reads from HOT; reads via Query with verify-on-read (rehash + signature check) when enabled.

Backup & restore objectives (RPO/RTO)

Tier Data set Backup/Recovery approach Target RPO Target RTO
HOT – Blob (WORM) Segments, manifests, roots, signatures Immutability + versioning; secondary immutable account in same region (periodic copy); integrity sweeps (hash verify) ~0 (append is authoritative) ≤ 2h (account/container re-point + key trust check)
HOT Index – SQL/Cosmos Segment catalog pointers PITR (SQL), Continuous backup (Cosmos) + rebuild from HOT if needed ≤ 5 min (or rebuild) ≤ 4h (replay pointers)
WARM – Projections Read models PITR + replay from HOT via Projection workers ≤ 15 min (source available) ≤ 8h (depends on replay volume)
SEARCH Full-text index Rehydrate from WARM/HOT; store only config N/A (derived) ≤ 12h (reindex window)
COLD – Exports Signed export packages Copy to Archive with immutability; registry DB PITR ≤ 24h (batch) ≤ 24–48h (rehydrate + verify signatures)

Notes

  • Rebuild-first strategy: WARM/SEARCH are derivative; prefer replay from HOT to ensure integrity and reduce backup cost.
  • Legal holds: Applied at container level per-tenant; holds block deletion regardless of retention end; tracked in export/hold registries.
  • Key continuity: During HOT recovery or cross-account promotion, Integrity service validates kid chain and republishes current trust roots.

Operational checks

  • Daily: HOT integrity sweep (sampled segment rehash) and pointer consistency (HOT↔Index).
  • Weekly: Projection backlog/lag SLO and replay dry-run from a checkpoint.
  • Quarterly: DR exercise: restore HOT to a fresh account in-region, replay WARM, rebuild SEARCH, and verify RPO/RTO attainment.

With HOT/WARM/COLD clearly separated, tenancy & residency enforced at storage and query layers, and replay-centric recovery, the platform maintains tamper-evidence while meeting practical RPO/RTO targets.


Messaging, DLQ & Replay (Azure-first)

This section pinpoints the Azure Service Bus (ASB) topology, the Outbox/Inbox deployment pattern for exactly-once semantics at-least-once transport, and the replay guardrails used to rebuild projections safely.

Broker topology (topics, subscriptions, DLQs)

Namespace & connectivity

  • Namespace: sb-atp-<env>-<region> with Private Link; access via Entra ID RBAC (no SAS in app paths).
  • Partitioning: Enabled on topics to spread load; duplicate detection On (10–30 min window).
  • Message shape (conventions):
    • MessageId = stable eventId (UUID v7).
    • CorrelationId = requestId from Gateway.
    • SessionId (only where strict ordering is needed) = tenantId.
    • ApplicationProperties: tenantId, edition, schema, occurredAt, idempotencyKey.

Core topics

Topic Purpose Sessions Typical producers Typical consumers (subscriptions)
atp.audit.appended.v1 Append confirmations + minimal segment metadata Off Ingestion projection-sub, integrity-sub
atp.projection.work.v1 Fanout work units for projection (batch/slice) Off Ingestion/Orchestrator projection-sub
atp.policy.updated.v1 Policy/plan changes (classification/redaction/retention) On (tenantId) Admin/Policy gateway-sub, ingestion-sub, projection-sub, query-sub
atp.export.requested.v1 Export job requests Off Gateway/Admin export-sub
atp.export.completed.v1 Export job status updates Off Export admin-sub, gateway-sub
atp.alerts.ops.v1 Ops/compliance events (e.g., verification anomaly) Off Integrity/Query admin-sub, siem-sub

Each subscription has its own DLQ (subqueue) with MaxDeliveryCount (5–10) and LockDuration (60–120s). Poison messages land in the subscription DLQ—not the topic.

Example subscriptions

  • projection-sub on atp.audit.appended.v1 (filter by tenantId IN (…) if needed).
  • integrity-sub on atp.audit.appended.v1 (all tenants, lower concurrency).
  • export-sub on atp.export.requested.v1 (per-tenant concurrency caps).

Deployment diagram — topics, subscriptions, DLQs

flowchart LR
  ING[Ingestion Service] -->|publish| T1[(Topic: atp.audit.appended.v1)]
  T1 --> S1[Sub: projection-sub] --> Q1[[DLQ: projection-sub/$DeadLetterQueue]]
  T1 --> S2[Sub: integrity-sub] --> Q2[[DLQ: integrity-sub/$DeadLetterQueue]]

  POL[Policy/Admin] -->|publish| T2[(Topic: atp.policy.updated.v1)]
  T2 --> S3[Sub: gateway-sub] --> Q3[[DLQ: gateway-sub/$DeadLetterQueue]]
  T2 --> S4[Sub: ingestion-sub] --> Q4[[DLQ: ingestion-sub/$DeadLetterQueue]]
  T2 --> S5[Sub: projection-sub] --> Q5[[DLQ: projection-sub/$DeadLetterQueue]]
  T2 --> S6[Sub: query-sub] --> Q6[[DLQ: query-sub/$DeadLetterQueue]]

  GW[Gateway] -->|publish| T3[(Topic: atp.export.requested.v1)]
  T3 --> S7[Sub: export-sub] --> Q7[[DLQ: export-sub/$DeadLetterQueue]]

  EXP[Export Service] -->|publish| T4[(Topic: atp.export.completed.v1)]
  T4 --> S8[Sub: admin-sub] --> Q8[[DLQ: admin-sub/$DeadLetterQueue]]

  INTEGRITY[Integrity] -->|publish alerts| T5[(Topic: atp.alerts.ops.v1)]
  T5 --> S9[Sub: admin-sub] --> Q9[[DLQ: admin-sub/$DeadLetterQueue]]
  T5 --> S10[Sub: siem-sub] --> Q10[[DLQ: siem-sub/$DeadLetterQueue]]
Hold "Alt" / "Option" to enable pan & zoom

Outbox/Inbox pattern (per service)

Outbox (publish)

  • Each service that emits events writes to a local Outbox table within the same transaction as its state change (e.g., Ingestion → hot index pointer + outbox row).
  • A background publisher (idempotent) forwards Outbox rows to ASB, sets MessageId = eventId, and marks them dispatched on success.
  • Duplicate detection on the topic ensures “at-least-once on the wire” becomes effectively once downstream.

Inbox (consume)

  • Consumers record a ProcessedMessages (Inbox) entry keyed by (subscription, MessageId).
  • Before handling, they check Inbox; if seen, skip. After successful handling, they upsert Inbox and complete the message.
  • Long-running handlers use deferral or saga state (in SQL/Cosmos) to persist progress and avoid timeouts.

Message handling guardrails

  • MaxDeliveryCount tuned per subscription (5–10); exceed → DLQ with diagnostic properties.
  • Prefetch: enable 100–500 (handlers must be idempotent).
  • Concurrency: enforce per-tenant limits if needed (sessions for policy.updated by tenantId).

Replay controls & guardrails

When to replay

  • Projection rebuild after schema change or data correction.
  • Integrity re-verify after key rotation or algorithm bump.
  • Tenant-scoped repair after incident.

Replay sources

  • Primary: HOT (Blob WORM) is the source of truth. Rebuild projections by reading segments and re-emitting work items (atp.projection.work.v1) rather than replaying historical broker traffic.
  • Secondary: For short gaps, resubmit from DLQ or deferred messages after remediation.

Idempotency & checkpoints

  • Idempotency key: eventId (MessageId) + handler name; persisted in Inbox.
  • Checkpoints: For bulk replays, keep progress records (tenantId, fromTs, toTs, lastSegmentId) to allow restarts.
  • Rate guards: throttle replays (KEDA scaler on backlog) to preserve SLOs for live traffic.

Ordering

  • Topics that require strict intra-tenant ordering (e.g., policy.updated) use sessions with SessionId=tenantId.
  • Projections from append confirmations do not assume strict order; they reconstruct order from segment metadata.

Runbook pointers

DLQ drain (subscription DLQ)

  1. Stabilize: Pause/scale down the consumer to prevent churn; capture metrics & sample traces.
  2. Peek a batch from sub/$DeadLetterQueue; classify by DeadLetterReason / properties (tenantId, schema).
  3. Fix root cause (e.g., bad mapping, missing policy, transient dependency).
  4. Resubmit:
    • For one-off: Requeue to active (clone message with original MessageId, preserve headers).
    • For bulk: run the DLQ Resubmitter function (tags: resubmittedFrom=DLQ, dlqSequenceNumber) with bounded rate.
  5. Verify: Watch error budget burn-rate, DLQ depth, consumer lag; confirm Inbox dedupe prevents dup effects.
  6. Close: Write incident note (root cause, counts, timestamps), keep sample messages for 7–14 days.

Projection replay (from HOT)

  1. Scope: Select tenantId and time window; freeze schema version if needed.
  2. Generate work items from HOT segments (batch size tuned); publish to atp.projection.work.v1 with isReplay=true.
  3. Scale Projection workers via KEDA on backlog; enforce tenant rate caps.
  4. Verify: Compare sample queries vs. baseline; ensure counts & hashes match.
  5. Finalize: Mark checkpoint complete, emit ops event to atp.alerts.ops.v1.

Integrity re-verify

  1. Trigger integrity job with target kid/algorithm.
  2. Read manifests from HOT; verify signatures/roots in-HSM; log deltas.
  3. Emit anomalies to atp.alerts.ops.v1; open case if needed.

  • Duplicate detection: PT15M on all topics.
  • Lock duration: 60–120s; Auto-renew for long handlers.
  • Max delivery: 5–10; DLQ enabled everywhere.
  • Poison quarantine: Pattern “parking-lot” queue for manual triage if resubmission causes loops.
  • Schema versioning: applicationProperties.schema = atp.v1 (bump on breaking change).
  • Observability: Emit tenantId, edition, correlationId, messageId, subscription tags on every handler span.

With this topology and runbooks, the platform achieves fault isolation (per-subscription DLQs), idempotent processing (Outbox/Inbox + duplicate detection), and safe rebuilds driven from the authoritative HOT store rather than unreliable historical broker replay.


Scaling & Capacity (HPA/KEDA, Queues, Partitions)

This section defines how each service scales on AKS using HPA (resource/custom metrics) and KEDA (event/backlog), how we partition hot tenants/shards, and how we handle warmup/readiness and node placement for heavy jobs in an Azure-first topology.

Partitioning strategy (hot tenants & shards)

  • Detection: Continuously rank tenants by ingest RPS, projection backlog share, and query CPU/time. Flag the top p95 as “hot”.
  • Message-plane isolation:
    • Create dedicated subscriptions or session partitions per hot tenant (e.g., projection-sub-tenant-<id>).
    • Bind KEDA scalers per subscription to cap cross-tenant contention.
  • Data-plane isolation:
    • SQL: partition by (tenantId, eventMonth); hot tenants get separate filegroups or partition ranges.
    • Cosmos: standard /tenantId partition key; raise autoscale RU floor for hot tenants.
    • Search: optional per-tenant index for extreme hotspots; otherwise filter by tenantId with index alias.
  • Concurrency guardrails: Per-tenant max concurrent handlers (sessions) to protect SLOs for the long tail.

Warmup, readiness & node placement

  • Readiness gates: Mesh sidecar ready, Key Vault CSI mount version present, Service Bus reachable, DB pool healthy, feature flags fetched.
  • Warmup: Pre-open DB/ASB connections, optionally prime cache (Query), load policy snapshot (Policy), JIT or precompile hotspots.
  • Rollout budgets:
    • Stateless web APIs: maxSurge=30%, maxUnavailable=0, minReadySeconds=10.
    • Heavy workers (Projection/Export): maxSurge=1, maxUnavailable=0.
  • Node pools & taints:
    • np-generic for Gateway/Query/Admin.
    • np-io (taint: workload=io:NoSchedule) for Ingestion/Projection/Export/Integrity with tolerations; higher IOPS and memory.

Service scaling matrix

Service Scale metric(s) (HPA/KEDA) Min / Max Readiness Gate (examples) Cost notes
Gateway HPA on CPU 60% and p95 latency (custom metric via Azure Monitor); optional KEDA HTTP (requests) 2 / 20 Mesh ready; KV CSI certs; APIM/Ingress route health Keep low idle; prefer header-based canary over full fleet surge
Ingestion HPA on CPU 60% + custom rps_ingest; 429 rate guard 2 / 20 KV CSI (webhook HMAC), HOT Blob reachability, hot-index write check Scale up only within storage IOPS budget to avoid throttling
Policy HPA on CPU + cache miss rate 1 / 5 Policy snapshot loaded; DB reachable Keep small; cache policies per tenant to reduce DB hits
Projection KEDA on ASB projection-sub* backlog; target 100–500 msgs/replica 0 / 50 DB write test; inbox/outbox tables live; SB lock renew path Activation to 0 saves cost; cap per-tenant concurrency
Query HPA on CPU + p95 query latency; optional KEDA on queue of verify-on-read jobs 2 / 30 DB/read models ready; cache connected; search reachable Prefer cache TTLs and result caching to control spend
Integrity KEDA on verification job queue or CronScaledJob for windows 0 / 10 HSM sign op test; HOT Blob read Run in off-peak windows; throttle to protect data plane
Export KEDA on export-sub backlog + bandwidth cap (custom metric) 0 / 10 Blob write SAS-less path; KV/HSM sign op; temp volume space Pin to np-io; enforce egress budgets to control costs
Admin HPA on CPU (low) 1 / 2 DB ready Keep minimal; no autoscale to large counts

For KEDA activation to zero, ensure startup probes are tolerant (slow cold-start) and that work item visibility is preserved during scale-from-zero.

Example KEDA spec (Projection on Service Bus topic subscription)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: projection-worker
  namespace: atp-projection-ns
spec:
  scaleTargetRef:
    name: projection-deployment
  pollingInterval: 10            # seconds
  cooldownPeriod: 120
  minReplicaCount: 0
  maxReplicaCount: 50
  advanced:
    restoreToOriginalReplicaCount: false
  triggers:
  - type: azure-servicebus
    metadata:
      namespace: sb-atp-prod-us
      topicName: atp.audit.appended.v1
      subscriptionName: projection-sub
      messageCount: "400"        # ~ messages per replica target
      activationMessageCount: "1"
      # Use AAD auth; no conn string
      cloud: AzurePublicCloud
    authenticationRef:
      name: keda-auth-asb-aad
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-auth-asb-aad
  namespace: atp-projection-ns
spec:
  podIdentity:
    provider: azure-workload

Example rollout strategy (heavy worker)

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0
minReadySeconds: 20
tolerations:
- key: "workload"
  operator: "Equal"
  value: "io"
  effect: "NoSchedule"
nodeSelector:
  agentpool: np-io

Sizing heuristics & guards

  • Backlog targets: Projection ~400 msgs/replica, Export 20–50 jobs/replica (depending on payload size), Integrity 100–200 segments/replica.
  • Queue-driven scale: Prefer backlog + age (lag) over raw backlog to prioritize stale tenants.
  • Throttle replays: When isReplay=true, apply lower maxReplicaCount and tenant rate caps.
  • Budget enforcement: Track cost-per-tenant (storage + compute + egress). Auto-reduce max replicas when budgets near limits.

With HPA/KEDA triggers tied to real demand (RPS, CPU, backlog/lag), hot-tenant partitioning at both message and data planes, and strict warmup/readiness plus node placement, the platform scales safely and cost-aware while protecting shared SLOs.


Observability & SLO Enforcement (Azure-first)

This section makes golden signals deployable across ATP with OpenTelemetry → Azure Monitor (App Insights/Log Analytics) and optional Managed Grafana. It defines mandatory telemetry, SLOs with burn-rate alerts, and dashboards wired from the in-cluster OTel Collector.

Telemetry wiring (standard)

  • SDKs: OpenTelemetry (traces, metrics, logs) in every service.
  • Resource attributes (must be on every span/metric/log): service.name, service.version, deployment.environment, cloud.region, tenantId, edition, correlationId, messageId, subscription, http.route (where applicable).
  • Collector (DaemonSet + sidecars where needed):
    • Receivers: otlp gRPC :4317, HTTP :4318; prometheus scrape for kube/mesh.
    • Exporters: azuremonitor (traces/metrics/logs), optional prometheusremotewrite to Managed Prometheus, and logging for debug.
  • Availability tests: App Insights Ping/Multistep against public APIs (Gateway) + internal probes (mesh VIPs).

Example (collector exporter fragment):

exporters:
  azuremonitor:
    connection_string: "InstrumentationKey=${APPINSIGHTS_KEY}"
service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch, resource]
      exporters: [azuremonitor]
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [azuremonitor]
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [azuremonitor]

Mandatory signals by service (golden set)

Service Metrics (examples) Traces (must-have spans) Logs (structured fields)
Gateway http.server.duration (histogram), http.server.requests (by status), rate_limit_drops_total Gateway/Authorize, Gateway/Route, Gateway/TenancyGuard tenantId, userId, route, status, requestId
Ingestion ingest.requests_total, ingest.p95_ms, ingest.rejects_total, outbox.pending, asb.publish_duration_ms Ingestion/Validate, Ingestion/Outbox/Commit, ASB/Publish eventId, schema, tenantId, idempotencyKey
Policy policy.cache_hit_ratio, policy.load_ms, policy.errors_total Policy/Resolve, Policy/CacheLoad policyVersion, tenantId
Projection projection.backlog (by sub/tenant), projection.p95_ms, inbox.dupes_total, db.write_ms Projection/Handle, DB/BulkUpsert, ASB/Complete subscription, messageId, tenantId, isReplay
Query query.requests_total, query.p95_ms, verify_on_read_ms, cache.hit_ratio Query/Execute, DB/Read, VerifyOnRead route, tenantId, filters, resultSize
Integrity integrity.verify.count, integrity.sign.ms, integrity.anomalies_total Integrity/SealSegment, HSM/Sign, Blob/Write segmentId, kid, hashAlgo
Export export.jobs_queued, export.duration_ms, export.failure_total, egress.bytes Export/Assemble, HSM/Sign, Blob/Write exportId, format, tenantId
Admin admin.ops_total, featureflag.toggle_total Admin/Action actor, action, target

All latencies are histograms with p50/p95/p99; all errors include exception.type, stack, and correlationId.


SLO catalog (targets & SLIs)

Capability SLI (how measured) SLO target (per region) Notes
Ingest success (2xx+3xx) / all gateway ingest requests 99.5% over 30 days Excludes planned maintenance windows
Ingest latency p95 http.server.duration on append route ≤ 300 ms Under nominal load in business hours
Projection freshness 95th pct of backlog age per tenant ≤ 60 s Measured from append to projection visible
Query success (2xx) / all query requests 99.5% Gateway- or service-level errors count against
Query latency p95 http.server.duration on query routes ≤ 500 ms For typical filters and page size
DLQ rate DLQ messages / total consumed per sub ≤ 0.5% Excluding intentional parking-lot ops
Export completion 95% jobs finish < 15 min ≥ 95% Per 24h rolling
Integrity verification segments verified within window ≥ 99% within policy window Window defined by sealing cadence

Alert policies (burn-rate & thresholds)

Error-budget burn-rate (applies to Ingest and Query success SLOs):

  • Let SLO = 99.5% ⇒ budget = 0.5%.
  • Page when both windows breach:
    • Fast: 5-min error rate > 7.2× budget (i.e., > 3.6% errors)
    • Slow: 1-hr error rate > budget (i.e., > 1.5% errors)
  • Ticket (non-page):
    • 6-hr error rate > budget (≥ 0.5%) or 24-hr > 0.5× budget.

Latency:

  • Warn: p95 ingest > 300 ms for 10 min; Page if > 500 ms for 5 min.
  • Warn: p95 query > 500 ms for 10 min; Page if > 800 ms for 5 min.

Projection freshness & DLQ:

  • Warn: backlog age p95 > 60 s for 15 min; Page if > 180 s for 10 min.
  • Warn: DLQ rate > 0.5% for 30 min; Page if > 2% for 15 min.

Export & Integrity:

  • Warn: export 95th duration > 15 min for 60 min; Ticket if > 30 min.
  • Warn: integrity verify coverage < 99% at window end; Ticket + runbook link.

Canary checks (post-deploy):

  • Synthetic Availability test success ≥ 99% over 15 min.
  • No SLO burn in canary slice for 15–30 min.
  • Auto-rollback if fast burn-rate page triggers during canary.

PromQL examples (Managed Prometheus):

# Error rate (Gateway ingest)
sum(rate(http_server_requests_seconds_count{route="/append",status!~"2..|3.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{route="/append"}[5m]))

# p95 query latency
histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{route=~"/query.*"}[5m])) by (le))

KQL examples (Log Analytics / App Insights):

requests
| where customDimensions["route"] == "/append"
| summarize errors = countif(success == false), total = count() by bin(timestamp, 5m)
| extend error_rate = todouble(errors) / todouble(total)

requests
| where customDimensions["route"] startswith "/query"
| summarize p95=percentile(duration, 95) by bin(timestamp, 5m)

Dashboards (UID placeholders)

Grafana (Managed)

  • ATP — Overview uid: g_atp_overview
  • Edge & Gateway uid: g_atp_edge
  • Ingestion & ASB uid: g_atp_ingest
  • Projection Health uid: g_atp_projection
  • Query Performance uid: g_atp_query
  • Integrity & Evidence uid: g_atp_integrity
  • Export Pipeline uid: g_atp_export
  • SLO Heatmap (tenants) uid: g_atp_tenant_slos

Azure Workbooks

  • ATP Overview workbook: LA-ATP-Overview
  • Gateway & Edge workbook: LA-ATP-Edge
  • Messaging & DLQ workbook: LA-ATP-ASB
  • Projections & Freshness workbook: LA-ATP-Projection
  • Query & Cache workbook: LA-ATP-Query

With SDK + Collector standardized, SLIs codified into SLOs, burn-rate alerts enforced across short/long windows, and dashboards pre-wired, observability becomes a deployable artifact rather than ad-hoc instrumentation.


CI/CD Overlays & Config Strategy (Azure-first)

This section shows how we parameterize deployments per environment/region/edition, wire feature flags, and enforce supply-chain controls (signing, SBOM, verification) across Azure DevOps pipelines.

Overlay layout (Helm + optional Kustomize)

deploy/
├─ charts/
│  └─ atp/
│     ├─ Chart.yaml
│     ├─ values.yaml                    # sane defaults (non-secret)
│     ├─ values.dev.yaml                # env overlays
│     ├─ values.qa.yaml
│     ├─ values.staging.yaml
│     ├─ values.prod.yaml
│     ├─ values.us.yaml                 # region overlays
│     ├─ values.eu.yaml
│     ├─ values.il.yaml
│     ├─ values.edition.default.yaml    # edition overlays
│     ├─ values.edition.enterprise.yaml
│     └─ templates/**.yaml
└─ kustomize/                           # optional, if you prefer Kustomize
   ├─ base/ (rendered helm, then kustomize)
   └─ overlays/
      ├─ dev-us/
      ├─ qa-us/
      ├─ staging-us/
      ├─ prod-us/
      ├─ prod-eu/
      └─ prod-il/

Secrets never live in values files. Names/IDs (e.g., KV, ASB namespaces, Storage accounts) are acceptable; access uses Entra Workload Identity + Key Vault CSI.

Example values overlays

values.prod.yaml

global:
  environment: prod
  replicaDefaults:
    min: 2
    max: 10
features:
  sealing:
    enabled: true
    cadence: "PT5M"
    mode: "merkle+hsm"
  verifyOnRead:
    default: true
    allowOverridePerRequest: false
  export:
    caps:
      maxConcurrentPerTenant: 2
      maxBytesPerJob: "10Gi"
    windows:
      allowedCron: "0 2 * * *"  # 02:00 local window
network:
  ingress:
    wafMode: "prevention"
observability:
  otelEndpoint: "http://otel-collector.atp-system.svc:4317"

values.us.yaml

azure:
  region: eastus
  keyVaultName: kv-atp-prod-us
  serviceBusNamespace: sb-atp-prod-us
  storage:
    hotAccount: stathotprodus
    coldAccount: statcoldprodus
  sql:
    server: sql-atp-prod-us.database.windows.net
    db: atp_prod_us
residency:
  regionCode: "us"
  allowCrossRegionReplication: false

values.edition.enterprise.yaml

features:
  verifyOnRead:
    default: true
  export:
    caps:
      maxConcurrentPerTenant: 5
      maxBytesPerJob: "50Gi"
pricing:
  skuHint: "enterprise"

Compose overlays at deploy time, e.g.: helm upgrade --install atp ./charts/atp -f values.prod.yaml -f values.eu.yaml -f values.edition.enterprise.yaml

Pipeline stages (Azure DevOps) with checks & gates

stages:
- stage: Build
  jobs:
  - job: build
    steps:
    - script: dotnet build --configuration Release
    - script: dotnet test --collect:"XPlat Code Coverage"
    - task: TrivyScan@1              # image & deps scan (or Defender for DevOps)
    - script: syft packages -o cyclonedx-json > sbom.json

- stage: Package
  dependsOn: Build
  jobs:
  - job: containerize
    steps:
    - script: docker build -t $(ACR)/atp-gateway:$(Build.SourceVersion) .
    - script: docker push $(ACR)/atp-gateway:$(Build.SourceVersion)
    - script: cosign sign --key $(COSIGN_KEY) $(ACR)/atp-gateway:$(Build.SourceVersion)
    - script: cosign attach sbom --sbom sbom.json $(ACR)/atp-gateway:$(Build.SourceVersion)
    - script: cosign attest --predicate provenance.json --type slsaprovenance $(ACR)/atp-gateway:$(Build.SourceVersion)

- stage: Verify_Manifests
  dependsOn: Package
  jobs:
  - job: policy
    steps:
    - script: helm template atp ./deploy/charts/atp -f values.dev.yaml > rendered.yaml
    - script: conftest test rendered.yaml         # OPA policies: no NodePort, PSS, resources
    - script: ratify verify --subject $(ACR)/atp-gateway:$(Build.SourceVersion) # signature/SBOM

- stage: Dev
  dependsOn: Verify_Manifests
  variables:
    HELM_VALUES: "-f values.dev.yaml -f values.us.yaml -f values.edition.default.yaml"
  jobs:
  - deployment: dev
    environment: atp-dev-us   # Env checks: owners, approvals optional
    strategy:
      runOnce:
        deploy:
          steps:
          - script: helm upgrade --install atp ./deploy/charts/atp $(HELM_VALUES) --set image.tag=$(Build.SourceVersion)

- stage: QA
  dependsOn: Dev
  jobs:
  - deployment: qa
    environment: atp-qa-us
    strategy:
      runOnce:
        deploy:
          steps:
          - script: helm upgrade --install atp ./deploy/charts/atp -f values.qa.yaml -f values.us.yaml

- stage: Staging
  dependsOn: QA
  jobs:
  - deployment: staging
    environment: atp-staging-us
    strategy:
      runOnce:
        preDeploy:
          steps:
          - task: ManualValidation@0     # Release Manager + Security approval
          - script: ./scripts/canary-enable.sh 10   # 10% canary
        deploy:
          steps:
          - script: helm upgrade --install atp ./deploy/charts/atp -f values.staging.yaml -f values.us.yaml
        routeTraffic:
          steps:
          - script: ./scripts/canary-promote.sh 100 # promote if SLO OK

- stage: Prod
  dependsOn: Staging
  jobs:
  - deployment: prod_us
    environment: atp-prod-us
    strategy:
      runOnce:
        preDeploy:
          steps:
          - task: ManualValidation@0     # RM + On-call SRE (2-eyes)
          - task: AzureCLI@2             # Policy check: Gatekeeper/Kyverno/Ratify status
        deploy:
          steps:
          - script: helm upgrade --install atp ./deploy/charts/atp -f values.prod.yaml -f values.us.yaml
        routeTraffic:
          steps:
          - script: ./scripts/canary-promote.sh 50
          - script: ./scripts/canary-promote.sh 100
  - deployment: prod_eu
    environment: atp-prod-eu
    dependsOn: prod_us
    strategy:
      runOnce:
        deploy:
          steps:
          - script: helm upgrade --install atp ./deploy/charts/atp -f values.prod.yaml -f values.eu.yaml
  - deployment: prod_il
    environment: atp-prod-il
    dependsOn: prod_eu
    strategy:
      runOnce:
        deploy:
          steps:
          - script: helm upgrade --install atp ./deploy/charts/atp -f values.prod.yaml -f values.il.yaml

Environment checks & gates

  • Build/Package: unit/integration tests, vulnerability scan, SBOM generation, Cosign sign/attest.
  • Verify_Manifests: OPA/Conftest policy, Ratify validation (signature + SBOM) as an admission gate in AKS.
  • Staging: manual approval, canary + SLO guard, auto-rollback on burn-rate page.
  • Prod: manual approval (RM + SRE), admission policy green, progressive traffic (10%/50%/100%) with checks in between.

Feature flags (centralized; wired via values)

features:
  sealing:
    enabled: true
    cadence: "PT5M"
    mode: "merkle+hsm"
  verifyOnRead:
    default: false   # dev/qa
  export:
    caps:
      maxConcurrentPerTenant: 1
      maxBytesPerJob: "5Gi"
    windows:
      allowedCron: "0 3 * * *"  # dev window

Flip per environment/edition by merging overlays. Flags are exposed to services through config maps or typed options, not environment variables with secrets.

Artifact promotion & immutability

  • Tags: Every image has :sha-<gitsha> (immutable), :vX.Y.Z, and :channel-<env> tags.
  • Promotion: Retag only (no rebuild) from ACR dev → qa → staging → prod.
  • ACR policies: Immutable tags on semver; quarantine repo for images pending scan/sign.
  • Verification: Admission plugin (Ratify/Kyverno) requires cosign signature + SBOM before scheduling.

“Overlay tree” (at deploy time)

  • Env selects base values (values.{env}.yaml).
  • Region adds resource names and residency (values.{region}.yaml).
  • Edition sets feature/cap caps (values.edition.{kind}.yaml).
  • Command combines three: helm upgrade --install atp ./charts/atp -f values.prod.yaml -f values.eu.yaml -f values.edition.enterprise.yaml --set image.tag=$(Build.SourceVersion)

With this strategy, the same artifact is promoted across rings, overlays capture reality (env/region/edition), and supply-chain policies (signing/SBOM/verification) are enforced by the pipeline and the cluster—turning CI/CD into a governed control plane, not just a copy step.


Security Controls & Zero-Trust Map (Azure-first)

This section renders the zero-trust control map as it’s deployed on Azure: edge hardening, mesh mTLS, identity-first access, default-deny networking, supply-chain enforcement, and least-privilege IAM across data and messaging planes.

Trust boundaries diagram

flowchart TB
  user[Public Internet]

  subgraph EDGE["Trust Boundary: Edge (Azure Front Door + WAF)"]
    AFD["AFD + WAF (TLS/HSTS, managed+custom rules)"]
    APIM["API Management (JWT, rate limit, version routing)"]
  end

  subgraph AKS["Trust Boundary: AKS Mesh (mTLS, default-deny)"]
    ADMISSION["Admission: Ratify (sig/SBOM)<br/>+ Kyverno/Gatekeeper (PSS, policies)"]
    GW["Gateway (ABAC tenancy guard)"]
    SVC["App services: Ingestion • Policy • Projection • Query • Integrity • Export • Admin<br/>(Envoy sidecars, KV-CSI, OTel)"]
    NP["NetworkPolicies: namespace default-deny<br/>allow-list east–west"]
  end

  subgraph PAAS["Trust Boundary: PaaS (Private Link only)"]
    ASB["Service Bus (RBAC)"]
    KV["Key Vault + Managed HSM (sign only)"]
    STG["Blob (WORM, CMK)"]
    DB["Azure SQL/Cosmos (AAD auth, RLS/PK)"]
    MON["Azure Monitor / App Insights / Log Analytics"]
  end

  user -->|HTTPS :443| AFD --> APIM -->|re-encrypt TLS| GW
  GW -->|mTLS via mesh| SVC
  SVC -->|egress allow-list| ASB
  SVC --> KV
  SVC --> STG
  SVC --> DB
  SVC --> MON
  ADMISSION -.enforces.-> GW
  ADMISSION -.enforces.-> SVC
  NP -.limits.-> SVC
Hold "Alt" / "Option" to enable pan & zoom

Zero-trust pillars (how they apply here)

  • Identity > secrets: Entra Workload Identity for every pod; PaaS via RBAC, not connection strings/SAS.
  • Encrypt & authenticate everywhere: TLS at edge, mTLS inside mesh, HSM-backed signing for evidence/exports.
  • Default-deny networking: NetworkPolicies for east–west, egress deny except Private Link PaaS.
  • Least privilege: Narrow RBAC roles (DB/ASB/Storage/KV). HSM keys: sign permission only.
  • Supply chain integrity: Images scanned, signed (Cosign), SBOM attached, verified at admission (Ratify).
  • Hardened runtime: Pod Security Standards (restricted), policy guardrails (Kyverno/Gatekeeper), no privileged pods/capabilities.
  • Edge hardening: AFD+WAF (managed+custom rules), APIM (JWT, rate limit, schema), HSTS, CORS allow-list.

Control map — “Control → Layer → Enforced by → Evidence”

Control Layer Enforced by Evidence (where to check)
TLS 1.2+ & HSTS Edge AFD + WAF config Azure Diagnostics (FrontDoorWebApplicationFirewallLog), SSL report, AFD policy export
JWT validation & version routing Edge/API APIM inbound policies APIM trace, policy repo, App Insights requests with clientPrincipalId
Global rate limiting, abuse throttles Edge/API APIM rate-limit-by-key, AFD rules APIM analytics, WAF logs (RuleAction=Block), 429 counters
mTLS service-to-service Cluster Service mesh (Envoy) Mesh policy dump, Envoy cert stats, OTel spans with tls=true
Tenancy ABAC at Gateway App Gateway policy middleware AuthZ logs (tenantId, edition, outcome), unit tests of guards
Namespace default-deny Network Kubernetes NetworkPolicies kubectl get netpol, denied connection tests, Cilium/Calico flow logs
Egress deny + Private Link only Network/PaaS Egress policies + Private Endpoints NSG/Firewall logs, Private Link endpoints, failed non-PL egress
Pod Security Standards (restricted) Runtime Kyverno/Gatekeeper Admission audit, Kyverno policy reports, kubectl auth can-i checks
Image signing & SBOM Supply chain Cosign + Ratify Ratify admission results, cosign verify, SBOM (CycloneDX) attached
Container/base image scanning Supply chain Defender for DevOps/Trivy Security center findings, pipeline scan artifacts
Secrets delivery (no env vars) Secrets Key Vault + CSI driver Pod mounts, KV audit logs (SecretGet), env var scans (should be 0)
Managed HSM signing (no export) KMS AKV Managed HSM HSM audit logs, sign ops only, no get on key material
SQL/Cosmos least-privilege access Data AAD auth + custom roles/RLS SQL audit FAILED_LOGIN_GROUP, RLS predicate tests, Cosmos RBAC
Storage immutability & legal holds Data Blob WORM + legal hold Storage immutability policy, object legal hold flags, deletion attempts denied
Service Bus RBAC (no SAS) Messaging Entra RBAC ASB access audits, absence of SAS in configs, role assignment list
WAF rules (managed+custom) Edge AFD WAF Rule hit metrics, blocked IP/geo logs
DDoS protection scope Edge/IP AFD global network (and Azure DDoS for public IPs if used) DDoS metrics, mitigation reports (if any)
Admission conformance (no NodePort, PSS, resources) Cluster Kyverno/Gatekeeper OPA policies Policy tests, admission denials, Conftest in CI
Observability integrity Telemetry OTel → Azure Monitor Traces/logs/metrics with tenantId,correlationId; export health alerts

“Evidence” are the artifacts auditors and SREs look at to prove the control exists and is active during runtime and deployment.

Pod hardening (high-value defaults)

  • PSS restricted: runAsNonRoot, readOnlyRootFilesystem, seccomp=RuntimeDefault, drop ALL capabilities (allow only explicit minimal set), no host* (PID/IPC/Network), no privileged, no hostPath volumes.
  • Network: no NodePort, internal ClusterIP only; Ingress terminates TLS and hands off to Gateway; egress through Private Link endpoints.
  • Secrets: only via KV CSI mounts (tmpfs), short TTL, rollover without restart where possible.

Least-privilege IAM (typical assignments)

  • Ingestion/Projection: Azure Service Bus Data Sender/Receiver on specific topics/subs; no namespace-wide rights.
  • Query/Projection: AAD contained users/roles to specific schemas; RLS on tenantId; no SQL logins.
  • Export/Integrity: sign on specific HSM keys; no get/list keys; Storage Blob Data Contributor scoped to tenant containers.
  • Gateway: Key Vault Secrets User (read certs/secrets) + per-resource read roles; no write on data plane.

Example policy fragments

Kyverno — deny privileged/host networking

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: pss-restricted-core
spec:
  validationFailureAction: enforce
  rules:
  - name: deny-privileged-host
    match: { resources: { kinds: ["Pod"] } }
    validate:
      message: "Privileged/host* disallowed"
      pattern:
        spec:
          securityContext:
            runAsNonRoot: true
          hostNetwork: false
          hostPID: false
          hostIPC: false
          containers:
          - securityContext:
              privileged: false
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true

Ratify — require cosign signature & SBOM (conceptual)

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: RatifyVerificationPolicy
metadata:
  name: require-signed-and-sbom
spec:
  artifacts:
    - pattern: "*"
      validations:
        - name: cosign-signature
        - name: sbom-cyclonedx

Outcome: The map above ties each security objective to a deployment control and a concrete evidence source. With mTLS in mesh, JWT + rate limits at the edge, ABAC/RBAC inside, PSS/admission policies, deny-by-default networking, and least-privilege IAM over Private Link, the platform maintains a practical, auditable zero-trust posture.


DR, Backups & Region Failover (Azure-first)

This section documents how ATP survives zonal/region outages while preserving tamper-evidence and residency. Azure-first assumptions: AFD/WAF at edge, APIM/Ingress per region, AKS per region, Blob (WORM) for HOT, SQL/Cosmos for WARM, ASB for messaging, Key Vault/Managed HSM for keys.

Strategy overview

  • Active–active user plane for Gateway, Ingestion, Query across allowed regions; traffic steered by AFD health probes and origin groups.
  • Active–passive control jobs (Integrity sealing sweeps, heavy Exports) with pilot-light capacity in secondary regions; scale up on failover (KEDA/HPA).
  • Residency-first: EU/IL tenants fail over only to same-jurisdiction paired regions. US may use cross-region pairs.
  • Authoritative truth is HOT WORM; WARM (projections/search) are rebuildable.

Backup plan (cadence, verify, retention)

Asset Technique Cadence / Retention Integrity & Verify
HOT (Blob WORM) ZRS in-region; Object Replication to same-jurisdiction DR account; legal hold where required Retention per tenant policy (e.g., 1–7+ years) Scheduled hash re-verify samples daily; manifest/root checks; quarterly promote DR copy dry-run
HOT index (SQL/Cosmos) SQL PITR (14–35d) / LTR weekly (months) ; Cosmos Continuous backup (7–30d) PITR 14–35d; LTR 6–12 mo Nightly integrity job compares pointer counts vs HOT manifests
WARM projections PITR + replay from HOT (preferred) PITR 7–14d Weekly replay smoke against subset tenants
Search Reindex from WARM/HOT (no backup) N/A Post-restore consistency spot checks
ASB (Service Bus) Geo-DR alias (metadata replication); premium namespace pairs Test quarterly; alias failover exercised DLQ drift checks before/after failover
Key Vault / Managed HSM KV: soft-delete + purge-protect; HSM backup package to private Storage KV continuous; HSM backup weekly and before rotation Monthly restore test of HSM backup to isolated vault
APIM / AFD config IaC in repo; config export snapshots With every release Diff-and-apply check on DR region
Configs (Helm values) Git as source of truth With every release Admission policy conformance in DR cluster

DR patterns & failover posture

  • Zonal failure (within region)
    • AKS uses zonal node pools; storage ZRS; AFD/APIM keep routing within region.
    • RTO: ≤ 15 min, RPO: ~ 0 (authoritative writes to HOT continue).
  • Regional outage (allowed cross-region)
    • AFD marks region unhealthy → shifts to healthy origin group.
    • ASB Geo-DR alias switch if namespace is down.
    • Scale pilot-light workloads (Projection/Export/Integrity) in DR region.
    • RTO: 15–60 min, RPO: ≤ 5 min (HOT replicated + outbox/inbox re-drain).
  • Regional outage (strict residency, same-jurisdiction only)
    • Read/write remain within jurisdiction pair (e.g., EU-pair).
    • If replication is asynchronous, accept small RPO gap; rebuild WARM from HOT in DR.
    • RTO: 1–4 h, RPO: ≤ 15 min (depends on HOT replication lag).
  • Control-plane impairment (AKS only)
    • Keep PaaS healthy; re-point traffic to sibling cluster in same region if available.
    • RTO: 30–60 min, RPO: ~ 0.

Integrity ledger continuity (sealing across failover)

  • Key continuity: Integrity/Export sign with Managed HSM; DR vault has restored key version (kid) before cutover.
  • Chain anchoring: Each sealed segment root contains previous root hash. On region failover, first DR seal anchors to the last confirmed root (by kid + hash) to avoid forks; publish “chain-continuation” event.
  • Dual-sign window (optional): Temporarily accept old+new kid to bridge any trust gaps; later revoke old.
  • Watermarks: Integrity keeps a sealing watermark (per tenant/region) in HOT index; DR resumes from that watermark to prevent duplicates or gaps.

DR run sequence (Mermaid)

flowchart TB
  A[Detect outage via AFD/APIM/SLI] --> B{Scope?}
  B -->|Zonal| C[Keep region; scale unaffected zones]
  B -->|Region| D[Mark region unhealthy in AFD]
  D --> E[Shift traffic to DR origin group]
  E --> F[Switch ASB Geo-DR alias if needed]
  F --> G[Scale pilot-light: Ingestion/Query/Projection/Integrity/Export]
  G --> H[Restore HSM key version in DR - if not pre-staged]
  H --> I[Run health & smoke tests; enable canary]
  I --> J[Resume sealing; publish chain-continuation event]
  J --> K[Monitor SLOs & burn-rate; adjust capacity]
Hold "Alt" / "Option" to enable pan & zoom

DR checklist (operator-facing)

Before an incident (readiness)

  • DR region origin registered in AFD; health probe green.
  • ASB Geo-DR pairing healthy; alias tested last quarter.
  • HSM backup restored in DR vault (pre-staged kid current).
  • DR AKS cluster passes admission policies; minimal pilot-light replicas deployed.
  • HOT object replication policies green; last replication lag < 15 min.
  • Runbooks and Feature flag toggles (reduce export/concurrency) reviewed.

At incident

  • Confirm scope (zonal vs region).
  • AFD/APIM show primary unhealthy → cutover to DR origin.
  • If ASB down: fail over alias to DR namespace.
  • Scale KEDA/HPA targets for Projection/Query/Export/Integrity.
  • Validate HSM key available; if needed, restore from latest backup.
  • Execute smoke tests; enable canary; observe SLOs.
  • Resume sealing; publish chain-continuation event.

After incident

  • Backfill: replay projections from HOT for gaps.
  • Compare pointer counts HOT↔WARM; validate signatures.
  • Post-mortem: RTO/RPO actuals, DLQ rate, error budget spent.
  • Rotate temporary flags back; scale down pilot-light if appropriate.

RTO/RPO per scenario

Scenario RTO target RPO target Notes
Zonal failure (within region) 15 min ~ 0 ZRS + multi-zone node pools; AFD keeps region
Region outage (cross-region allowed) 15–60 min 5 min AFD cutover + ASB alias + HOT replication lag
Region outage (strict residency pair) 1–4 h 15 min Same-jurisdiction DR restore + replay WARM
AKS control-plane impairment 30–60 min ~ 0 Shift to sibling cluster or re-create node pools
Key Vault/HSM incident 1–2 h ~ 0 Restore HSM backup; dual-sign window if needed

Drill cadence

  • Monthly: Partial replay from HOT in DR; verify WARM parity.
  • Quarterly: Full region failover exercise (AFD, ASB alias, HSM restore, sealing continuation).
  • After major changes: Re-validate object replication, SLO alerts, runbook steps.

With AFD-driven cutover, ASB Geo-DR, HOT WORM replication, and HSM key continuity, the platform maintains tamper-evident chains and meets practical RTO/RPO targets without violating residency constraints.


Cost & Capacity Guardrails (Azure-first)

This section bakes cost discipline into the deployment by making cost drivers visible, enforcing tenant/edition quotas, using export windows, and automating shrink/retention workflows—without sacrificing SLOs or tamper-evidence.

Per-service cost drivers

Plane/Service Primary cost drivers Secondary drivers Guardrails to apply
Gateway (AFD/APIM/Ingress) APIM reqs/sec, AFD egress TLS cert mgmt, WAF rules eval Tight CORS; cacheable 4xx/5xx bodies; low-cardinality labels in logs
Ingestion AKS CPU/mem; HOT Blob writes; ASB publishes Azure Monitor ingestion Cap RPS by tenant; batch writes; structured logs sampling
Policy AKS CPU/mem; DB reads KV/CSI mounts Cache policy snapshots; low TTL metrics histograms
Projection AKS CPU/mem; DB writes; ASB consumes Cosmos RU/SQL DTUs; Monitor KEDA on backlog; batch upserts; throttle replay workloads
Query DB reads; AKS CPU; cache misses Search queries; Monitor Result caching; read replicas/partition pruning; cap payload size
Integrity HSM sign ops; HOT reads/writes AKS CPU; Monitor Off-peak schedules; batch sealing; reduce verification frequency under load (never below policy min)
Export Blob egress + archive storage; HSM signs AKS mem/disk; network Export windows (nightly); max bytes/job; compress; per-tenant concurrency caps
Observability Logs/Traces/Metrics ingestion & retention Managed Grafana Drop noisy fields; 90-day default retention; sampling for DEBUG
Messaging (ASB) Topic/subscription ops; Premium MUs DLQ depth Right-size MUs; dedupe window; consumer prefetch tuning
Storage (HOT/WARM/COLD) HOT: GB stored × replicas; WARM: DB size; COLD: Archive GB Transactions WORM retention by policy; lifecycle to Cool/Archive; partitioning for pruning

Quotas & limits (per tenant/edition) + export windows

Config (values overlay)

quotas:
  tenant:
    defaults:
      maxIngestRps: 50
      maxConcurrentExports: 1
      maxDailyExportBytes: "10Gi"
      maxQueryRps: 30
    enterprise:
      maxIngestRps: 200
      maxConcurrentExports: 3
      maxDailyExportBytes: "100Gi"
      maxQueryRps: 100
  hardStops:
    exportJobMaxBytes: "50Gi"   # absolute cap
    queryMaxResponseBytes: "25Mi"

exportWindows:
  # Run heavy egress when bandwidth is cheap and user traffic low
  allowedCronLocal: "0 2 * * *"   # 02:00 local region time
  perTenantConcurrency: 1
  bandwidthBudgetMibps: 200       # cluster-wide cap

Enforcement points

  • Gateway: rate limit by x-tenant-id; reject over-limit with 429 + Retry-After.
  • Export: scheduler checks bytes/day & concurrency; defers outside window.
  • Query: enforce payload caps and pagination; optional per-tenant query RPS.
  • Projection: KEDA per-subscription scaler to isolate hot tenants.

Auto-shrink & retention compaction

  • WARM compaction: roll partitions by (tenantId, eventMonth), compress historical tables, and drop derived columns that can be rehydrated.
  • Lifecycle policies (Blob):
    • HOT → Cool after N days (policy-driven), COLD exports to Archive immediately.
    • Auto-expire temporary export staging containers after M days.
  • Observability retention:
    • Logs 30–90d (by env), metrics 90d, traces 7–30d; keep exemplars for P1 incidents tagged for 180d.
    • Sampling: 0% DEBUG, 10–20% INFO, 100% WARN/ERROR (server-side).
  • Scale-to-zero workers (Projection/Export/Integrity) when idle; cron-scaled for windows.
  • Search: prefer reindex on demand over long retention of large indexes; elastic alias swaps to limit downtime.

Top 10 cost levers

  1. Export egress — biggest surprise line item: constrain by windows, compression, and bytes/day caps.
  2. Azure Monitor ingestion — prune log fields; sample non-errors; avoid high-cardinality labels (tenant-safe but minimal).
  3. Projection batch size — fewer DB round-trips; tune upsert bulk size before hitting RU/DTU throttles.
  4. ASB Premium MUs — right-size namespaces; consolidate topics; dedupe window to reduce dup processing.
  5. HOT retention & replication — align WORM duration with true policy; replicate only within jurisdiction; avoid unnecessary cross-region copies.
  6. Search index refresh — reduce refresh frequency; index only searchable fields; keep analyzers simple.
  7. Query cache — cache common queries; cap response size; push down filters to partitions.
  8. Node pools — run heavy jobs on np-io only; scale to zero off-window; right-size VM SKUs.
  9. APIM tiers — pick per-region capacity that matches real QPS; shift low-risk limits to Gateway to avoid APIM overage.
  10. Feature flags — disable expensive features (e.g., verify-on-read) for tiers/editions where not required.

Quick cost calculator (inputs & formulas)

Use this to estimate monthly order-of-magnitude. Plug into a spreadsheet with your actual Azure rates.

Inputs

  • T = tenants
  • R_d = records (events) per day (all tenants)
  • B_e = avg event bytes (raw payload)
  • O_hot = HOT overhead factor (manifest, hashes, signatures; e.g., 1.25)
  • Rep_hot = HOT replication multiplier (e.g., 2 for primary + DR)
  • P_proj = projection amplification (derived rows per event × avg bytes; e.g., 0.5 × B_e)
  • Q_d = queries per day; B_q = avg query response bytes
  • X_d = export bytes per day (post-compression)
  • Log_b = avg log bytes per request/event traced (after sampling)

Derived

  • HOT monthly GB HOT_GB = (R_d × B_e × O_hot × Rep_hot × 30) / (1024^3)
  • WARM monthly GB (projections) WARM_GB = (R_d × P_proj × 30) / (1024^3)
  • Query egress monthly GB Q_GB = (Q_d × B_q × 30) / (1024^3)
  • Exports monthly TB X_TB = (X_d × 30) / (1024^4)
  • Observability GB OBS_GB = ((R_d + Q_d) × Log_b × 30) / (1024^3)

Rules of thumb

  • Aim for O_hot ≤ 1.25 (keep manifests/metadata lean).
  • Keep P_proj as small as needed for SLOs (derive on read when possible).
  • Keep OBS_GB under a fixed budget; enforce sampling at Gateway and worker services.
  • Validate X_TB stays inside your egress budget; increase compression or shift to on-site pickup if exceeded.

Budget automation (signals → actions)

  • Budget breach early-warning: If 7-day projected cost > 90% of monthly budget, tighten quotas by edition (maxIngestRps, maxDailyExportBytes) and reduce KEDA maxReplicaCount for replay workloads.
  • Autoshrink trigger: When WARM size growth > 20% MoM, automatically compact old partitions and reduce index refresh.
  • Observability clamp: If OBS_GB trend exceeds budget, raise sampling, truncate overly long log fields, and disable verbose query logging.

With explicit per-service drivers, tenant/edition quotas, lifecycle-based auto-shrink, and a simple calculator to size HOT/WARM/COLD and egress, you can keep ATP within a predictable spend envelope—even during growth or replay/DR events.


Ops Runbook Hooks & Change Management (Azure-first)

This section connects the deployment views to day-2 operations: how we verify a rollout, flip features safely, and govern changes to the topology.

Health checks, readiness, and smoke tests (post-rollout)

Kubernetes probes (standard)

livenessProbe:
  httpGet: { path: /healthz, port: 8080 }
  initialDelaySeconds: 20
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /readyz, port: 8080 }   # mesh sidecar ready, KV-CSI mounted, SB/DB reachable
  initialDelaySeconds: 10
  periodSeconds: 5
startupProbe:
  httpGet: { path: /startupz, port: 8080 }
  failureThreshold: 30
  periodSeconds: 5

Post-deploy smoke (automated)

  • External: App Insights Availability Tests (Ping & Multistep) against Gateway canary routes.
  • Internal: synthetic append → project → query tracer (small tenant sandbox), asserts p95s and zero DLQ for 15–30 min.
  • Data plane sanity: HOT container write/read, ASB publish/consume, DB read/write minimal cycle.
  • Go/No-Go: promote traffic from 10% → 50% → 100% only if SLO burn-rate and smoke pass.

Change windows, feature flags, dark-launch toggles

  • Windows
    • staging: daily, region-local off-peak; prod: weekly window per region; emergency hotfixes allowed with RM+SRE approval.
  • Feature flags (examples)
    • features.sealing.enabled, features.verifyOnRead.default, features.export.caps.* — set via values overlays; read by services at startup and on refresh signal.
  • Dark-launch
    • Route a header-scoped slice (e.g., x-atp-experiment: vNext) through APIM/Ingress to a preview deployment.
    • Observe SLO deltas and logs; ramp only after stability window.
  • DLQ Drain & Resubmitdocs/runbooks/dlq-replay.md
  • Projection Replay from HOTdocs/runbooks/projection-replay.md
  • DR Cutover (AFD, ASB Alias, HSM Restore)docs/runbooks/dr-failover.md
  • Integrity Key Rotation & Dual-Sign Windowdocs/runbooks/key-rotation.md
  • Hotfix Rollbackdocs/runbooks/rollback.md
  • Quota Breach Response (Cost Guardrails)docs/runbooks/quotas.md

ADRs and proposing a topology change

  • ADR location: docs/adr/ (use log4brains or similar).
  • When an ADR is required: new region, new PaaS SKU/tier, mesh policy changes, egress to a new external service, storage/retention policy changes, or any control affecting security/DR/SLO/cost.
  • ADR template essentials
    • Context & goals; security & residency impact; SLO impact; cost delta; migration plan; rollback; monitoring plan; affected diagrams.
  • Process
    1. Draft ADR with diagrams and overlays changed.
    2. Add threat model note and cost estimate.
    3. Open PR tagged Architecture, Security, SRE for review.
    4. Pilot in staging with canary + smoke; attach results to ADR.
    5. Merge ADR; schedule production change window.

PR checklist for infra changes (mini)

  • OPA/Kyverno policy pass (PSS: restricted, no NodePort, resource limits set).
  • Ratify: image signature + SBOM verified.
  • Private Link endpoints in place for new PaaS; egress allow-list updated.
  • IAM least-privilege reviewed (RBAC roles scoped to resource/namespace).
  • Residency: data stays within allowed region/jurisdiction.
  • SLO impact assessed; canary plan + rollback defined.
  • Cost: estimate and budget tag updated (FinOps label).
  • Observability: dashboards/alerts updated; new signals documented.
  • Runbooks: added/updated; DR implications noted.
  • Security sign-off for edge/WAF/JWT/policy changes.
  • Backout plan verified (previous helm release or traffic revert script).
  • Communication: change ticket, stakeholder notice, and on-call briefed.

Deployment annotations & traceability

metadata:
  annotations:
    atp.io/change-id: "$(Build.BuildNumber)"
    atp.io/commit: "$(Build.SourceVersion)"
    atp.io/adr: "ADR-0021-topology-change"
    atp.io/runbook: "docs/runbooks/rollback.md"
  • Emit a deployment annotation event to Azure Monitor/Grafana so SLO charts show the change marker.

Rollback triggers (guardrails)

  • Fast burn-rate page during canary (e.g., error rate > 3.6% for 5 min) → auto-rollback.
  • p95 regressions (ingest > 500 ms or query > 800 ms for 5 min) → rollback.
  • Projection freshness p95 > 180 s for 10 min with no remediation → rollback.

Outcome: after each rollout you have verifiable health, controlled feature exposure via flags/dark-launch, governed changes through ADRs, and a repeatable PR checklist that keeps security, SLOs, cost, and DR aligned with the deployment views.