Deployment Views - Audit Trail Platform¶
Purpose, Scope & Reader Map¶
Azure-first assumption. Every deployment view and example in this page targets Microsoft Azure as the primary platform (AKS, ACR, Key Vault, Service Bus, Front Door/WAF, Entra ID Workload Identity, Azure Monitor). Portability tips may be noted inline, but Azure is the baseline.
What a Deployment View is (and isn’t)¶
A deployment view shows how the Audit Trail Platform (ATP) actually runs in the cloud: the runtime topology of clusters, namespaces, services, data stores, networks, identities, and control planes per environment/region. It focuses on where components live, how they’re wired (ingress/egress, identity, secrets, policies), and what non-functional controls are enforced (observability, resilience, security, compliance).
A deployment view is not:
- A business or domain model (see Architecture/Components/Data Model).
- A sequence-of-calls or behavior spec (see Sequence Flows).
- A CI/YAML how-to or runbook (see CI/CD & Ops docs referenced later).
Reader Map — “Read this if you…”¶
- SRE / DevOps: Need to know where services run (AKS/namespace), how traffic enters (Front Door/WAF → APIM/Ingress), how they scale (HPA/KEDA), which queues/topics they use (ASB), and how to observe them (OTel → Azure Monitor/Grafana).
- Security / Compliance: Need trust boundaries, mTLS & RBAC/ABAC points, secret & key custody (Key Vault, KMS), WORM/immutability for evidence, tenancy isolation, and residency.
- Solution / Enterprise Architects: Need environment overlays (dev/qa/stage/prod), regional variants, shared services (APIM, ASB, KV, ACR), and DR/failover posture.
- Backend Developers: Need service placements, env vars/secret sources, message contracts via topics/queues, and feature flag attachment points.
- Data/Analytics: Need to see authoritative stores vs. projections, export surfaces, and retention/immutability constraints that impact analytics jobs.
Scope of this Page¶
- Included: Environment/region topologies; edge and networking paths; identity & secrets; data plane tiers; messaging/DLQ/replay; scale policies; observability hooks; security controls; DR; cost guardrails.
- Excluded: Detailed API definitions, domain models, end-to-end behavior flows, and step-by-step runbooks (these are cross-linked).
Diagram Legend & Conventions (used across all deployment views)¶
Abbreviations
| Abbrev | Azure Service / Concept | Notes |
|---|---|---|
| AKS | Azure Kubernetes Service | Primary compute/orchestration plane |
| APIM | Azure API Management | Optional/edge API gateway (alt: NGINX Ingress) |
| AFD | Azure Front Door + WAF | Global edge, WAF rules, TLS |
| ASB | Azure Service Bus | Topics/queues/DLQ, idempotent replay |
| KV | Azure Key Vault | Secrets/keys; CSI driver in pods |
| ACR | Azure Container Registry | Signed images/SBOMs |
| AAD/WI | Entra ID & Workload Identity | Pod-level identity (no secrets in env) |
| OTel | OpenTelemetry | Traces/metrics/logs export |
| AM/LA/AppI | Azure Monitor / Log Analytics / App Insights | Observability backends |
| BLOB (IMM) | Azure Blob Storage (immutability) | WORM for evidence/segments/exports |
| SQL/COS/RED | Azure SQL / Cosmos DB / Redis | Authoritative stores, projections, caching |
| KEDA/HPA | Event/metrics-based autoscaling | Queue depth, CPU, RPS, projection lag |
| NP/PSS | NetworkPolicy / Pod Security Standards | East-west control & hardening |
Notation & Styling
- Namespaces:
atp-<domain>-ns(e.g.,atp-ingest-ns,atp-query-ns). - Resource naming:
atp-<svc>-<env>-<region>(e.g.,atp-ingest-prod-eus). - Trust boundaries: large boxes with a bold border labeled “Edge”, “Cluster”, “Data Plane”.
- Control-plane vs data-plane: dashed lines for control-plane calls (e.g., metrics/identity), solid for data paths.
- Secrets/keys: key/lock glyphs near pods using KV CSI; identities labeled WI:
. - Tenancy markers:
tenantIdtag shown on storage/index/messaging resources; partitioning called out explicitly.
Cross-References (read these alongside Deployment Views)¶
- High-Level Design (HLD): overall capabilities and quality attributes →
hld.md - Architecture Overview: logical components and bounded contexts →
architecture.md - Components: per-service responsibilities & contracts →
components.md - Sequence Flows: hot paths, back-pressure & replay steps →
sequence-flows.md - Data Model: entities, partitions, residency flags →
data-model.md - Use Cases: operator/compliance scenarios this topology must satisfy →
use-cases.md
With these conventions in place, subsequent sections (Environments & Release Trains, Base Topology, Networking & Edge, etc.) will reuse the same legend and Azure-first primitives to keep every diagram and table consistent.
Environments & Release Trains (Azure-first)¶
Environments¶
- preview — short-lived per-PR environments (AKS namespace or isolated RG) for UI/UX review and early e2e checks.
- dev — shared integration playground; fast iteration; feature flags on; relaxed quotas.
- qa — system verification; stable datasets; cross-service tests; load and chaos rehearsals off-hours.
- staging — prod-like (same SKUs/quotas); change-freeze windows enforced; canary rehearsals; DR drills.
- prod — multi-region, compliance & observability hardening on; WORM/immutability fully enforced.
Azure baseline: AKS + ACR + Key Vault (CSI), Front Door/WAF → APIM/Ingress, Service Bus, Azure Monitor (App Insights/Log Analytics), Entra ID Workload Identity.
Promotion lanes¶
- Mainline train —
main→ dev → qa → staging → prod Standard feature flow with automated checks at every hop and staged rollout (canary → region → global). - Hotfix train —
hotfix/*→ staging (ring-0) → prod Minimal blast radius, expedited checks (security + smoke + SLO guardrails), post-deploy follow-up to qa/dev.
Regional variants¶
- Target codes (examples): us (
eastus), eu (westeurope), il (israelcentral).- dev/qa: typically single region (us) to contain cost.
- staging/prod: per-region with identical topology; prod may run active-active for ingest/query.
Risk gates (examples)¶
- Pre-deploy: image signing (ACR + Cosign), Defender for Cloud scan, IaC policy check (Bicep/OPA), SBOM presence.
- Deploy-time: Azure DevOps Environment checks (required reviewers, change ticket link), maintenance window tag, migration dry-run.
- Post-deploy: synthetic smoke (App Insights), SLO burn-rate guardrails (p95 ingest latency/error rate), auto-rollback on breach, DLQ drift watch.
Promotion overview (lanes & rings)¶
flowchart LR
subgraph Mainline
A[main commit] --> P[preview]
P --> D[dev]
D --> Q[qa]
Q --> S[staging]
S --> C1{canary 5–10%}
C1 -->|pass| R1[prod us]
R1 --> R2[prod eu]
R2 --> R3[prod il]
C1 -->|fail| Rb1[rollback]
end
subgraph Hotfix
H[hotfix/*] --> S2[staging - ring-0]
S2 --> C2{canary 5–10%}
C2 -->|pass| P1[prod target region]
C2 -->|fail| Rb2[rollback]
end
Matrix — Environment × Region × Release Train × Approval Gates¶
| Environment | Regions in scope | Release train(s) | Approval & checks (Azure DevOps Environments) |
|---|---|---|---|
| preview | Same region as target cluster (usually us) | Mainline (per PR) | Auto only: build + unit/integration tests, image signed, IaC policy ok, ephemeral namespace/RG cleanup registered |
| dev | us | Mainline | Auto: build + tests + vulnerability scan + KV/CSI mount check + Service Bus topic reachability; no manual approvers |
| qa | us (optionally eu) | Mainline | Auto: contract tests, e2e workflows, load smoke; Manual: QA owner if schema migration present; SLO guardrails simulated |
| staging | us, eu, il | Mainline & Hotfix | Manual: Release Manager + Security; Auto: canary rehearsal, synthetic smoke, data-migrations dry-run, feature flags staged |
| prod | us, eu, il | Mainline & Hotfix | Manual: RM + On-call SRE (2-eyeballs); Auto: WAF policy sync, key roll check, canary 5–10% + SLO burn-rate; auto-rollback + incident stub if breached |
Notes
- SLO guardrails (examples): ingest p95 ≤ 300 ms; query p95 ≤ 500 ms; error rate ≤ 1%; consumer lag ≤ 30 s. Breach during canary → automatic rollback and DLQ snapshot.
- Change types: data-shape changes require staging soak ≥ 24h; security policy changes require Security reviewer even on hotfix.
- Regional cadence: promote us → eu → il with observation windows; emergency hotfix may target a single region first.
This section defines where code can land, how it moves, which regions participate, and what gates enforce safety, so the following topology sections can reference these lanes without re-explaining the mechanics.
Base Topology (Kubernetes + Mesh) — Azure-first¶
This section anchors the runtime topology of the Audit Trail Platform (ATP) on AKS with a service mesh (mTLS-by-default), Azure edge, and core Azure dependencies. It acts as the base layer that later sections (networking, data plane, scaling, security) reference.
Cluster overview (rings, pools, namespaces)¶
- AKS rings:
- system ring (managed components, mesh control plane, OTel collector).
- user ring (all ATP workloads).
- Node pools (typical):
np-system(small, reserved for control/system).np-generic(stateless web/API pods: Gateway, Query, Admin).np-io(I/O-heavy: Ingestion, Projection, Export, Integrity).- Optional
np-jobs(cron/maintenance/export windows).
- Namespaces (examples):
atp-gateway-ns,atp-ingest-ns,atp-policy-ns,atp-projection-ns,atp-query-ns,atp-integrity-ns,atp-export-ns,atp-admin-ns.
C4-style deployment diagram (edge → AKS → Azure services)¶
flowchart TB
%% Edge
subgraph EDGE["Azure Edge"]
AFD["AFD + WAF"]
APIM["API Management (optional)"]
end
%% AKS Cluster
subgraph AKS["AKS Cluster (mTLS via Mesh)"]
direction TB
subgraph SYS["system ring / namespaces"]
OTL["OTel Collector (DaemonSet)"]
MESHCP["Mesh control plane"]
end
subgraph GWNS["ns: atp-gateway-ns"]
GW["Gateway Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
end
subgraph INGNS["ns: atp-ingest-ns"]
ING["Ingestion Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
end
subgraph POLNS["ns: atp-policy-ns"]
POL["Policy Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
end
subgraph PRJNS["ns: atp-projection-ns"]
PRJ["Projection Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
end
subgraph QRYNS["ns: atp-query-ns"]
QRY["Query Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
end
subgraph INTNS["ns: atp-integrity-ns"]
INT["Integrity Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
end
subgraph EXSNS["ns: atp-export-ns"]
EXP["Export Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
end
subgraph ADMNS["ns: atp-admin-ns"]
ADM["Admin Pod<br/>(sidecars: Envoy, KV-CSI, OTel)"]
end
end
%% Azure backing services
subgraph AZSVC["Azure Services"]
ASB["Service Bus (topics/queues/DLQ)"]
KV["Key Vault<br/>+ CSI driver"]
BLOB["Blob Storage (WORM)"]
DB["Azure SQL / Cosmos DB"]
MON["Azure Monitor / App Insights / Log Analytics"]
end
%% Traffic & dependencies
AFD --> APIM
APIM --> GW
GW --> ING
GW --> QRY
ING --> ASB
PRJ <-- ASB
QRY --> DB
PRJ --> DB
INT --> BLOB
EXP --> BLOB
GW --> MON
ING --> MON
QRY --> MON
PRJ --> MON
INT --> MON
EXP --> MON
ADM --> MON
GW -. secrets .-> KV
ING -. secrets .-> KV
POL -. secrets .-> KV
PRJ -. secrets .-> KV
QRY -. secrets .-> KV
INT -. secrets .-> KV
EXP -. secrets .-> KV
ADM -. secrets .-> KV
Core pods and responsibilities¶
| Pod | Role in topology | Primary dependencies |
|---|---|---|
| Gateway | Public/API entry; authN/Z; request shaping; tenancy guards; version routing | AFD/WAF → APIM/Ingress, Key Vault (certs), Azure Monitor |
| Ingestion | Append-only intake; schema validation; outbox → ASB | Service Bus (topics), Key Vault, OTel |
| Policy | Policy resolution/decisions (classification, retention, redaction plans) | DB (policy store), KV, OTel |
| Projection | Build/read models & search indexes from events | ASB (subscribe), DB (projections), OTel |
| Query | Tenant-scoped queries with verify-on-read options | DB (authoritative/projections), BLOB (evidence, when applicable) |
| Integrity | Seal/verify segments, hash-chains, Merkle roots; publish proofs | BLOB (WORM), KV (signing keys), OTel |
| Export | Egress pipelines (signed exports, legal holds, redaction applied) | BLOB (export), ASB (jobs), OTel |
| Admin | Ops/Config UX, feature flags, maintenance hooks | DB (config), KV, OTel |
Sidecars and daemons¶
| Component | Placement | Purpose |
|---|---|---|
| Envoy/mesh sidecar | Every app pod | mTLS, retries/timeouts, policy enforcement, telemetry taps |
| KV CSI driver | Every app pod needing secrets/keys | Mount short-lived secrets; avoid env var secrets; rotation-friendly |
| OTel agent | Sidecar (per pod) or node DaemonSet | Trace/metric/log export to Azure Monitor backends |
| Log shipper (optional) | Sidecar/DaemonSet | Structured logs to Log Analytics with tenant/edition tags |
| Mesh control plane | System namespace | Certificate issuance, identity, traffic policy distribution |
| OTel Collector | System namespace (DaemonSet) | Centralize/transform telemetry; batching and export |
Ingress path and mesh¶
- External ingress: AFD + WAF → APIM (or direct NGINX/Envoy Ingress) → Gateway.
- East-west: all service-to-service flows run inside the mesh with mTLS, RBAC/ABAC at the Gateway and per-service boundaries.
- Egress: deny-by-default, egress policies only to ASB/KV/DB/BLOB/Monitor endpoints.
Stateful anchors (data plane)¶
- Authoritative stores: Azure SQL / Cosmos DB (tenant-partitioned).
- Event transport: ASB topics/queues with DLQ per subscription; idempotent replay.
- Evidence & segments: Blob WORM containers with lifecycle policies and legal holds.
- Observability: OTel → Azure Monitor/App Insights/Log Analytics; dashboards in Grafana (optional).
This base topology is the “map” subsequent sections will annotate with networking controls, scaling triggers, security policies, DR paths, and cost guardrails.
Networking & Edge¶
This section codifies the public front-door, API edge, and east–west policies for ATP on Azure. It assumes Azure Front Door (AFD) + WAF at the edge, API Management (APIM) (or direct Ingress), and a service mesh with mTLS inside AKS.
Deployment diagram — Internet → Edge → Gateway → Mesh¶
flowchart TB
user[Internet Clients]
subgraph EDGE["Azure Front Door (AFD) + WAF"]
waf[Managed rules + custom rules<br/>HSTS, TLS, geo/IP filters]
dns[Azure DNS<br/> - Apex/CNAME to AFD]
end
subgraph APIEDGE["API Edge"]
apim[API Management<br/> rate-limit-by-key, version routing,<br/>JWT validation, request shaping]
or[(or)]
ing[Ingress Controller - NGINX/Envoy<br/>TLS passthrough/termination]
end
subgraph AKS["AKS (Service Mesh mTLS)"]
gw[Gateway Pod]
subgraph eastwest["East–West (mTLS, RBAC/ABAC, NetworkPolicies)"]
ingest[Ingestion]
policy[Policy]
proj[Projection]
query[Query]
integ[Integrity]
export[Export]
admin[Admin]
end
end
user -->|HTTPS :443| dns --> waf --> apim
waf -.->|CNAME| dns
apim -->|mTLS/TLS| gw
user -. alt path .-> waf --> ing
ing --> gw
gw --> ingest
gw --> query
gw --> policy
ingest <--> proj
proj --> query
gw --> admin
gw --> export
gw --> integ
Edge (public) controls¶
- WAF at AFD: Managed rule set + custom rules (blocklists, geo-IP), bot protection, anomaly scoring; headers normalized at edge.
- TLS: Terminate at AFD with managed/Key Vault-backed certs; re-encrypt to APIM/Ingress; mTLS enforced inside mesh.
- HSTS:
max-age=31536000; includeSubDomains; preloadat the edge. - DNS: Azure DNS apex → AFD (CNAME). Use CAA records for allowed CAs. Region-specific subdomains optional (e.g.,
api-eu.example.com). - CORS stance: Default deny. Explicit allow-list per SPA/portal origin; short preflight cache (60–120s); no wildcard with credentials.
- Rate limiting (public):
- At APIM:
rate-limit-by-keyon{tenantId|clientId|subscriptionKey}; burst & sustained windows; 429 withRetry-After. - Optional at AFD: simple per-IP throttles for volumetric abuse before APIM.
- At APIM:
- Version routing:
- Header (
x-api-version), path (/v1/…), or revision (APIM); map to canary rings (see deployment lanes).
- Header (
API Gateway & request shaping¶
- AuthN/Z: JWT verification at APIM (client apps) and tenancy guards at Gateway (ABAC on
tenantId,edition). - Request policies: schema/size limits, idempotency-key normalization, anti-replay nonce for append endpoints.
- Canary routing: APIM or Ingress splits to vNext Gateway by header/percentage; ringed rollout (
5–10% → 50% → 100%).
East–west policies (inside AKS + mesh)¶
- mTLS-by-default: All pod-to-pod traffic via mesh Envoy; strong identities (Entra Workload Identity/SPIFEE).
- NetworkPolicies:
- Default deny per namespace; allow only from Gateway, required system pods, and specific producer/consumer pairs.
- Cross-namespace traffic is allow-listed (e.g.,
atp-gateway-ns → atp-query-nson service ports only).
- Service exposure: ClusterIP internally; no NodePort/LoadBalancer for app services (ingress-only).
- Egress controls:
- Deny-by-default; allow only to Private Link endpoints for Service Bus, Key Vault, Storage, SQL/Cosmos, Monitor.
- Optional Azure Firewall/NVA with FQDN tags; egress proxy for audited outbound HTTP if required.
Example policy snippets¶
APIM (rate limit by tenant)
<inbound>
<validate-jwt header-name="Authorization" failed-validation-httpcode="401" />
<rate-limit-by-key calls="300" renewal-period="60" counter-key="@(context.Request.Headers.GetValueOrDefault("x-tenant-id","anon"))" />
<set-header name="x-request-id" exists-action="override">
<value>@(Guid.NewGuid().ToString())</value>
</set-header>
</inbound>
Kubernetes NetworkPolicy (deny all, allow Gateway → Query)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-gateway-to-query
namespace: atp-query-ns
spec:
podSelector:
matchLabels:
app: query
policyTypes: [Ingress, Egress]
ingress:
- from:
- namespaceSelector:
matchLabels: { name: atp-gateway-ns }
podSelector:
matchLabels: { app: gateway }
ports:
- protocol: TCP
port: 8080
egress: [] # default deny; mesh handles sidecar-to-sidecar
Traffic paths (summarized)¶
- Internet → Edge:
Client → Azure DNS → AFD/WAF (TLS, HSTS, WAF rules). - Edge → API:
AFD → APIM (rate limit, JWT, version routing)orAFD → Ingress (TLS passthrough/termination). - API → Gateway:
APIM/Ingress → Gateway (mTLS at mesh boundary, tenancy guards). - Gateway → Services:
Gateway → {Ingestion, Query, Policy, …}over mTLS, constrained by NetworkPolicies. - Service egress: Only to Private Link PaaS endpoints (ASB/KV/Storage/DB/Monitor) under explicit egress policies.
These controls ensure a hardened, deterministic path from the public Internet to tenant-scoped services with layered defense: WAF → APIM/Ingress → Gateway → mTLS mesh, with default-deny network policies and Private Link egress.
Identity, Secrets & KMS (Azure-first)¶
This section defines how workloads authenticate (no long-lived secrets), how secrets are delivered (ephemeral mounts, rotation), and how signing keys (integrity proofs, export packaging) are owned, rotated, and backed up.
Principles¶
- Identity over secrets. Prefer Entra ID Workload Identity (federated OIDC) and RBAC to access PaaS (Service Bus, Storage, SQL/Cosmos) — no connection strings or SAS where possible.
- Short-lived & mount-only. If a secret is unavoidable, deliver via Key Vault CSI Driver as files (tmpfs), not env vars; keep TTL short and rotate automatically.
- HSM for signing keys. Integrity/export signing keys live in Azure Key Vault Managed HSM with versioned rotation and backup/restore packages.
- Purge protection & immutable evidence. KV has soft-delete + purge protection; signed artifacts land in Blob WORM with legal holds.
Workload Identity (pods → Entra ID)¶
- Each AKS ServiceAccount is mapped to an Entra application with a federated credential (issuer = cluster OIDC, subject = SA).
- Pod retrieves an OIDC token → exchanges for an Entra access token → calls Azure APIs (Key Vault, Service Bus, Storage) using RBAC role assignments.
- The mesh handles mTLS between pods; mesh identities can optionally follow SPIFFE-style IDs for service-to-service policy.
Secrets delivery (Key Vault + CSI)¶
- Source: Azure Key Vault (KV) secrets/certs; consumption: mounted via CSI as files into pods.
- Rotation:
- App secrets (HMAC, webhooks, OAuth client secrets): 60–90 days.
- TLS certs (Ingress/APIM/AFD): managed renewal (CA) or ≤90 days if BYOC.
- Per-tenant salts/keys (tokenization): 90–180 days with overlap window.
- Reload: Sidecars or apps watch mounted paths and reload without restart where possible (SIGHUP/hot reload); otherwise rolling restart on secret version change.
Integrity & export signing keys (Managed HSM)¶
- Key types: RSA-3072 (or ECC P-256) signing keys stored in Managed HSM; operations performed in-HSM.
- Rotation strategy: Staged dual-sign (old+new) for N days → flip trust root → revoke old. Maintain key versions and kid in proofs.
- Backup: HSM backup package to secured Storage Account with private endpoint, encrypted at rest with CMK. Tested restore drills.
- Access: Integrity and Export services get sign permission via role assignments; no get key material.
Secret/Key catalog¶
| Secret/Key | Owner | Rotation | Consumer(s) | Storage Class |
|---|---|---|---|---|
| Workload Identity (pod → Entra) | Platform/SRE | N/A (token TTL ≤60m) | All pods | Entra ID federated credential; RBAC on target resources |
| Gateway TLS cert (edge) | Platform/Sec | ≤90d (managed renewal) | AFD/APIM/Ingress/Gateway | Key Vault Certificate (soft-delete + purge protect); delivered to edge/ingress |
| Mesh mTLS certs | Platform/SRE | 30–90d (auto by mesh CA) | All app pods (Envoy) | Mesh CA store (control plane); root anchored in KV/MHSM if BYOC |
| Webhook HMAC secrets | App Team | 60–90d | Gateway/Export | KV Secret → CSI mount (tmpfs), no env vars |
| Per-tenant tokenization salt/key | Security | 90–180d (overlap window) | Ingestion/Redaction | KV Secret (scoped per tenant/edition) → CSI |
| Integrity signing key | Security/Compliance | 180–365d (dual-sign rollout) | Integrity service | Managed HSM Key (sign only); HSM backup package |
| Export package signing key | Security/Compliance | 180–365d | Export service | Managed HSM Key (sign only); HSM backup package |
| Storage Account CMK (SSE-CMK) | Security | 365d (auto-rotate) | Blob (WORM) / archival stores | KV Key bound to Storage; rotation via key version |
| DB access | Platform/SRE | N/A | Services using SQL/Cosmos | AAD auth (no secrets); RBAC roles, Private Link |
| Service Bus access | Platform/SRE | N/A | Ingestion/Projection/Export | AAD RBAC (no SAS); fallback SAS in KV if required (≤30d) |
Storage Class legend: Entra (federated identity + RBAC), KV Secret/Certificate (soft-delete & purge protection), Managed HSM Key (in-HSM ops only), CSI mount (ephemeral files, tmpfs), SSE-CMK (storage encryption with customer-managed key).
Signer usage flow (Integrity service)¶
sequenceDiagram
autonumber
participant INT as Integrity Pod
participant WI as Entra Workload Identity
participant KV as Key Vault / Managed HSM
participant BLOB as Blob Storage (WORM)
participant MON as Azure Monitor
INT->>WI: Request AAD token (federated OIDC from SA)
WI-->>INT: Access token (scope: https://vault.azure.net)
INT->>INT: Compute segment root (Merkle) & digest payload
INT->>KV: Sign(digest) with keyId (kid=v2025-10-01)
KV-->>INT: Signed blob (signature)
INT->>BLOB: Write {manifest, root, signature, kid} to WORM container
INT->>MON: Emit audit/trace (sign op id, kid, segmentId)
Note over INT,K V: Key material never leaves HSM, op is in-HSM
Operational policies & runbooks (highlights)¶
- Emergency rotation (signing keys): generate new version, enable dual-sign immediately, update trust config in Gateway/Query verification, revoke old after N days; publish advisory event on ASB.
- KV hygiene: soft-delete and purge protection must be on; require Defender for Cloud checks before deploy; access policies managed via RBAC, not legacy ACLs.
- Secret sprawl control: quarterly scan for env var secrets, SAS tokens, or connection strings; replace with AAD RBAC patterns.
- Backups: HSM backup/restore drills quarterly; verify WORM containers’ retention & legal hold settings align with compliance.
With these controls, workloads authenticate using identity, secrets are short-lived and mounted, and critical signing keys are protected by Managed HSM with auditable rotation and backup.
Data Plane (Hot/Warm/Cold) & Residency (Azure-first)¶
This section places the authoritative stores and tiers (hot/warm/cold), explains tenancy partitioning and residency, and sets backup/restore objectives that align with compliance (immutability, legal hold) and SLOs.
Tiered storage diagram¶
flowchart LR
subgraph REGION_US["Region: US (example)"]
direction TB
subgraph HOT["HOT — Append & Evidence (Immutable)"]
HOTBLOB["Blob Storage (WORM)<br/>container: atp-{tenant}-hot<br/>objects: segments, manifests, roots<br/>Retention: time-based + legal hold"]
HOTIDX["Hot Index (SQL/Cosmos)<br/>segment catalog & pointers<br/>RLS/PK=tenantId"]
end
subgraph WARM["WARM — Projections & Query"]
PROJDB["Azure SQL / Cosmos DB<br/>projections/read models<br/>RLS/PK=tenantId + time"]
SEARCH["Azure AI Search (optional)<br/>full-text/index aliases per region"]
CACHE["Redis (optional)<br/>query cache, TTL-scoped"]
end
subgraph COLD["COLD — Exports & Archives"]
XPORT["Blob Storage (Archive/Cool)<br/>Signed export packages<br/>Legal hold capable"]
META["SQL/Cosmos<br/>export registry (hash, kid, location)"]
end
ING["Ingestion Service"]
INT["Integrity Service"]
QRY["Query Service"]
EXP["Export Service"]
ING --> HOTBLOB
ING --> HOTIDX
INT --> HOTBLOB
PROJDB <-- HOTIDX
QRY --> PROJDB
PROJDB --> SEARCH
EXP --> XPORT
EXP --> META
end
note1((Private Link)):::note
HOTBLOB --- note1
PROJDB --- note1
XPORT --- note1
classDef note fill:#fff,stroke:#999,stroke-dasharray:5 5,color:#666
Tiers & stores¶
-
HOT (authoritative, immutable)
- What: Append-only segments, manifests, Merkle roots, signatures written by Ingestion/Integrity.
- Where: Azure Blob Storage containers with immutability (WORM) and time-based retention; legal hold supported.
- Access: AAD RBAC via Private Link; no SAS; signed artifacts include
kidand digest. - Index: Minimal hot index (SQL/Cosmos) for pointers:
{tenantId, window, segmentId, blobUrl, hash, kid}.
-
WARM (operational read models)
- What: Projections/read models derived from hot segments; optionally search for text facets.
- Where: Azure SQL (rowstore/columnstore per table) or Cosmos DB (
/tenantIdpartition, time bucketing). - Access: Query service with RLS (SQL) or tenant-scoped queries (Cosmos). Redis optional for hot keys.
- Rebuild: Deterministic replay from HOT via Projection workers.
-
COLD (egress & long-term)
- What: Signed export packages (ZIP/TAR + manifest + signature), audit bundles for eDiscovery.
- Where: Blob in Cool/Archive tier; per-tenant containers if legal holds vary.
- Index: Export registry in SQL/Cosmos (hashes, time, requester, hold flags).
Residency & RLS at each boundary¶
-
Regional scoping
- Per-region accounts:
us,eu,ilhave separate Storage/DB/Search to enforce residency. - Prod topology: active-active ingest/query per region; no cross-region replication for EU/IL data unless explicitly allowed by policy.
- DR: use ZRS in-region; optional same-jurisdiction DR account (EU→EU pair). Avoid GRS across residency boundaries.
- Per-region accounts:
-
Tenancy partitioning
- Blob: container per tenant (e.g.,
atp-{tenantId}-hot) to enable independent legal holds and retention policies. - SQL: Row-Level Security with predicate on
tenantId; partitioning by(tenantId, eventMonth)for pruning. - Cosmos: Partition key
/tenantId; composite indexes(tenantId, ts); per-tenant RU baselines (autoscale). - Search: index aliases per region, fields include
tenantId; optional per-tenant index for isolation at scale.
- Blob: container per tenant (e.g.,
-
Access controls
- AAD RBAC only; Private Link endpoints for Storage/SQL/Cosmos/Search.
- No direct client reads from HOT; reads via Query with verify-on-read (rehash + signature check) when enabled.
Backup & restore objectives (RPO/RTO)¶
| Tier | Data set | Backup/Recovery approach | Target RPO | Target RTO |
|---|---|---|---|---|
| HOT – Blob (WORM) | Segments, manifests, roots, signatures | Immutability + versioning; secondary immutable account in same region (periodic copy); integrity sweeps (hash verify) | ~0 (append is authoritative) | ≤ 2h (account/container re-point + key trust check) |
| HOT Index – SQL/Cosmos | Segment catalog pointers | PITR (SQL), Continuous backup (Cosmos) + rebuild from HOT if needed | ≤ 5 min (or rebuild) | ≤ 4h (replay pointers) |
| WARM – Projections | Read models | PITR + replay from HOT via Projection workers | ≤ 15 min (source available) | ≤ 8h (depends on replay volume) |
| SEARCH | Full-text index | Rehydrate from WARM/HOT; store only config | N/A (derived) | ≤ 12h (reindex window) |
| COLD – Exports | Signed export packages | Copy to Archive with immutability; registry DB PITR | ≤ 24h (batch) | ≤ 24–48h (rehydrate + verify signatures) |
Notes
- Rebuild-first strategy: WARM/SEARCH are derivative; prefer replay from HOT to ensure integrity and reduce backup cost.
- Legal holds: Applied at container level per-tenant; holds block deletion regardless of retention end; tracked in export/hold registries.
- Key continuity: During HOT recovery or cross-account promotion, Integrity service validates
kidchain and republishes current trust roots.
Operational checks¶
- Daily: HOT integrity sweep (sampled segment rehash) and pointer consistency (HOT↔Index).
- Weekly: Projection backlog/lag SLO and replay dry-run from a checkpoint.
- Quarterly: DR exercise: restore HOT to a fresh account in-region, replay WARM, rebuild SEARCH, and verify RPO/RTO attainment.
With HOT/WARM/COLD clearly separated, tenancy & residency enforced at storage and query layers, and replay-centric recovery, the platform maintains tamper-evidence while meeting practical RPO/RTO targets.
Messaging, DLQ & Replay (Azure-first)¶
This section pinpoints the Azure Service Bus (ASB) topology, the Outbox/Inbox deployment pattern for exactly-once semantics at-least-once transport, and the replay guardrails used to rebuild projections safely.
Broker topology (topics, subscriptions, DLQs)¶
Namespace & connectivity
- Namespace:
sb-atp-<env>-<region>with Private Link; access via Entra ID RBAC (no SAS in app paths). - Partitioning: Enabled on topics to spread load; duplicate detection On (
10–30 minwindow). - Message shape (conventions):
MessageId= stable eventId (UUID v7).CorrelationId= requestId from Gateway.SessionId(only where strict ordering is needed) = tenantId.ApplicationProperties:tenantId,edition,schema,occurredAt,idempotencyKey.
Core topics
| Topic | Purpose | Sessions | Typical producers | Typical consumers (subscriptions) |
|---|---|---|---|---|
atp.audit.appended.v1 |
Append confirmations + minimal segment metadata | Off | Ingestion | projection-sub, integrity-sub |
atp.projection.work.v1 |
Fanout work units for projection (batch/slice) | Off | Ingestion/Orchestrator | projection-sub |
atp.policy.updated.v1 |
Policy/plan changes (classification/redaction/retention) | On (tenantId) | Admin/Policy | gateway-sub, ingestion-sub, projection-sub, query-sub |
atp.export.requested.v1 |
Export job requests | Off | Gateway/Admin | export-sub |
atp.export.completed.v1 |
Export job status updates | Off | Export | admin-sub, gateway-sub |
atp.alerts.ops.v1 |
Ops/compliance events (e.g., verification anomaly) | Off | Integrity/Query | admin-sub, siem-sub |
Each subscription has its own DLQ (subqueue) with MaxDeliveryCount (
5–10) and LockDuration (60–120s). Poison messages land in the subscription DLQ—not the topic.
Example subscriptions
projection-subonatp.audit.appended.v1(filter bytenantId IN (…)if needed).integrity-subonatp.audit.appended.v1(all tenants, lower concurrency).export-subonatp.export.requested.v1(per-tenant concurrency caps).
Deployment diagram — topics, subscriptions, DLQs¶
flowchart LR
ING[Ingestion Service] -->|publish| T1[(Topic: atp.audit.appended.v1)]
T1 --> S1[Sub: projection-sub] --> Q1[[DLQ: projection-sub/$DeadLetterQueue]]
T1 --> S2[Sub: integrity-sub] --> Q2[[DLQ: integrity-sub/$DeadLetterQueue]]
POL[Policy/Admin] -->|publish| T2[(Topic: atp.policy.updated.v1)]
T2 --> S3[Sub: gateway-sub] --> Q3[[DLQ: gateway-sub/$DeadLetterQueue]]
T2 --> S4[Sub: ingestion-sub] --> Q4[[DLQ: ingestion-sub/$DeadLetterQueue]]
T2 --> S5[Sub: projection-sub] --> Q5[[DLQ: projection-sub/$DeadLetterQueue]]
T2 --> S6[Sub: query-sub] --> Q6[[DLQ: query-sub/$DeadLetterQueue]]
GW[Gateway] -->|publish| T3[(Topic: atp.export.requested.v1)]
T3 --> S7[Sub: export-sub] --> Q7[[DLQ: export-sub/$DeadLetterQueue]]
EXP[Export Service] -->|publish| T4[(Topic: atp.export.completed.v1)]
T4 --> S8[Sub: admin-sub] --> Q8[[DLQ: admin-sub/$DeadLetterQueue]]
INTEGRITY[Integrity] -->|publish alerts| T5[(Topic: atp.alerts.ops.v1)]
T5 --> S9[Sub: admin-sub] --> Q9[[DLQ: admin-sub/$DeadLetterQueue]]
T5 --> S10[Sub: siem-sub] --> Q10[[DLQ: siem-sub/$DeadLetterQueue]]
Outbox/Inbox pattern (per service)¶
Outbox (publish)
- Each service that emits events writes to a local Outbox table within the same transaction as its state change (e.g., Ingestion → hot index pointer + outbox row).
- A background publisher (idempotent) forwards Outbox rows to ASB, sets
MessageId = eventId, and marks them dispatched on success. - Duplicate detection on the topic ensures “at-least-once on the wire” becomes effectively once downstream.
Inbox (consume)
- Consumers record a ProcessedMessages (Inbox) entry keyed by
(subscription, MessageId). - Before handling, they check Inbox; if seen, skip. After successful handling, they upsert Inbox and complete the message.
- Long-running handlers use deferral or saga state (in SQL/Cosmos) to persist progress and avoid timeouts.
Message handling guardrails
- MaxDeliveryCount tuned per subscription (
5–10); exceed → DLQ with diagnostic properties. - Prefetch: enable
100–500(handlers must be idempotent). - Concurrency: enforce per-tenant limits if needed (sessions for
policy.updatedbytenantId).
Replay controls & guardrails¶
When to replay
- Projection rebuild after schema change or data correction.
- Integrity re-verify after key rotation or algorithm bump.
- Tenant-scoped repair after incident.
Replay sources
- Primary: HOT (Blob WORM) is the source of truth. Rebuild projections by reading segments and re-emitting work items (
atp.projection.work.v1) rather than replaying historical broker traffic. - Secondary: For short gaps, resubmit from DLQ or deferred messages after remediation.
Idempotency & checkpoints
- Idempotency key:
eventId(MessageId) + handler name; persisted in Inbox. - Checkpoints: For bulk replays, keep progress records (
tenantId,fromTs,toTs,lastSegmentId) to allow restarts. - Rate guards: throttle replays (KEDA scaler on backlog) to preserve SLOs for live traffic.
Ordering
- Topics that require strict intra-tenant ordering (e.g.,
policy.updated) use sessions withSessionId=tenantId. - Projections from append confirmations do not assume strict order; they reconstruct order from segment metadata.
Runbook pointers¶
DLQ drain (subscription DLQ)
- Stabilize: Pause/scale down the consumer to prevent churn; capture metrics & sample traces.
- Peek a batch from
sub/$DeadLetterQueue; classify byDeadLetterReason/ properties (tenantId, schema). - Fix root cause (e.g., bad mapping, missing policy, transient dependency).
- Resubmit:
- For one-off: Requeue to active (clone message with original
MessageId, preserve headers). - For bulk: run the DLQ Resubmitter function (tags:
resubmittedFrom=DLQ,dlqSequenceNumber) with bounded rate.
- For one-off: Requeue to active (clone message with original
- Verify: Watch error budget burn-rate, DLQ depth, consumer lag; confirm Inbox dedupe prevents dup effects.
- Close: Write incident note (root cause, counts, timestamps), keep sample messages for 7–14 days.
Projection replay (from HOT)
- Scope: Select
tenantIdand time window; freeze schema version if needed. - Generate work items from HOT segments (batch size tuned); publish to
atp.projection.work.v1withisReplay=true. - Scale Projection workers via KEDA on backlog; enforce tenant rate caps.
- Verify: Compare sample queries vs. baseline; ensure counts & hashes match.
- Finalize: Mark checkpoint complete, emit ops event to
atp.alerts.ops.v1.
Integrity re-verify
- Trigger integrity job with target
kid/algorithm. - Read manifests from HOT; verify signatures/roots in-HSM; log deltas.
- Emit anomalies to
atp.alerts.ops.v1; open case if needed.
Operational defaults (recommended)¶
- Duplicate detection:
PT15Mon all topics. - Lock duration:
60–120s; Auto-renew for long handlers. - Max delivery:
5–10; DLQ enabled everywhere. - Poison quarantine: Pattern “parking-lot” queue for manual triage if resubmission causes loops.
- Schema versioning:
applicationProperties.schema = atp.v1(bump on breaking change). - Observability: Emit
tenantId,edition,correlationId,messageId,subscriptiontags on every handler span.
With this topology and runbooks, the platform achieves fault isolation (per-subscription DLQs), idempotent processing (Outbox/Inbox + duplicate detection), and safe rebuilds driven from the authoritative HOT store rather than unreliable historical broker replay.
Scaling & Capacity (HPA/KEDA, Queues, Partitions)¶
This section defines how each service scales on AKS using HPA (resource/custom metrics) and KEDA (event/backlog), how we partition hot tenants/shards, and how we handle warmup/readiness and node placement for heavy jobs in an Azure-first topology.
Partitioning strategy (hot tenants & shards)¶
- Detection: Continuously rank tenants by ingest RPS, projection backlog share, and query CPU/time. Flag the top p95 as “hot”.
- Message-plane isolation:
- Create dedicated subscriptions or session partitions per hot tenant (e.g.,
projection-sub-tenant-<id>). - Bind KEDA scalers per subscription to cap cross-tenant contention.
- Create dedicated subscriptions or session partitions per hot tenant (e.g.,
- Data-plane isolation:
- SQL: partition by
(tenantId, eventMonth); hot tenants get separate filegroups or partition ranges. - Cosmos: standard
/tenantIdpartition key; raise autoscale RU floor for hot tenants. - Search: optional per-tenant index for extreme hotspots; otherwise filter by
tenantIdwith index alias.
- SQL: partition by
- Concurrency guardrails: Per-tenant max concurrent handlers (sessions) to protect SLOs for the long tail.
Warmup, readiness & node placement¶
- Readiness gates: Mesh sidecar ready, Key Vault CSI mount version present, Service Bus reachable, DB pool healthy, feature flags fetched.
- Warmup: Pre-open DB/ASB connections, optionally prime cache (Query), load policy snapshot (Policy), JIT or precompile hotspots.
- Rollout budgets:
- Stateless web APIs:
maxSurge=30%,maxUnavailable=0,minReadySeconds=10. - Heavy workers (Projection/Export):
maxSurge=1,maxUnavailable=0.
- Stateless web APIs:
- Node pools & taints:
np-genericfor Gateway/Query/Admin.np-io(taint:workload=io:NoSchedule) for Ingestion/Projection/Export/Integrity with tolerations; higher IOPS and memory.
Service scaling matrix¶
| Service | Scale metric(s) (HPA/KEDA) | Min / Max | Readiness Gate (examples) | Cost notes |
|---|---|---|---|---|
| Gateway | HPA on CPU 60% and p95 latency (custom metric via Azure Monitor); optional KEDA HTTP (requests) | 2 / 20 | Mesh ready; KV CSI certs; APIM/Ingress route health | Keep low idle; prefer header-based canary over full fleet surge |
| Ingestion | HPA on CPU 60% + custom rps_ingest; 429 rate guard | 2 / 20 | KV CSI (webhook HMAC), HOT Blob reachability, hot-index write check | Scale up only within storage IOPS budget to avoid throttling |
| Policy | HPA on CPU + cache miss rate | 1 / 5 | Policy snapshot loaded; DB reachable | Keep small; cache policies per tenant to reduce DB hits |
| Projection | KEDA on ASB projection-sub* backlog; target 100–500 msgs/replica |
0 / 50 | DB write test; inbox/outbox tables live; SB lock renew path | Activation to 0 saves cost; cap per-tenant concurrency |
| Query | HPA on CPU + p95 query latency; optional KEDA on queue of verify-on-read jobs | 2 / 30 | DB/read models ready; cache connected; search reachable | Prefer cache TTLs and result caching to control spend |
| Integrity | KEDA on verification job queue or CronScaledJob for windows | 0 / 10 | HSM sign op test; HOT Blob read | Run in off-peak windows; throttle to protect data plane |
| Export | KEDA on export-sub backlog + bandwidth cap (custom metric) |
0 / 10 | Blob write SAS-less path; KV/HSM sign op; temp volume space | Pin to np-io; enforce egress budgets to control costs |
| Admin | HPA on CPU (low) | 1 / 2 | DB ready | Keep minimal; no autoscale to large counts |
For KEDA activation to zero, ensure startup probes are tolerant (slow cold-start) and that work item visibility is preserved during scale-from-zero.
Example KEDA spec (Projection on Service Bus topic subscription)¶
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: projection-worker
namespace: atp-projection-ns
spec:
scaleTargetRef:
name: projection-deployment
pollingInterval: 10 # seconds
cooldownPeriod: 120
minReplicaCount: 0
maxReplicaCount: 50
advanced:
restoreToOriginalReplicaCount: false
triggers:
- type: azure-servicebus
metadata:
namespace: sb-atp-prod-us
topicName: atp.audit.appended.v1
subscriptionName: projection-sub
messageCount: "400" # ~ messages per replica target
activationMessageCount: "1"
# Use AAD auth; no conn string
cloud: AzurePublicCloud
authenticationRef:
name: keda-auth-asb-aad
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: keda-auth-asb-aad
namespace: atp-projection-ns
spec:
podIdentity:
provider: azure-workload
Example rollout strategy (heavy worker)¶
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
minReadySeconds: 20
tolerations:
- key: "workload"
operator: "Equal"
value: "io"
effect: "NoSchedule"
nodeSelector:
agentpool: np-io
Sizing heuristics & guards¶
- Backlog targets: Projection ~400 msgs/replica, Export 20–50 jobs/replica (depending on payload size), Integrity 100–200 segments/replica.
- Queue-driven scale: Prefer backlog + age (lag) over raw backlog to prioritize stale tenants.
- Throttle replays: When
isReplay=true, apply lower maxReplicaCount and tenant rate caps. - Budget enforcement: Track cost-per-tenant (storage + compute + egress). Auto-reduce max replicas when budgets near limits.
With HPA/KEDA triggers tied to real demand (RPS, CPU, backlog/lag), hot-tenant partitioning at both message and data planes, and strict warmup/readiness plus node placement, the platform scales safely and cost-aware while protecting shared SLOs.
Observability & SLO Enforcement (Azure-first)¶
This section makes golden signals deployable across ATP with OpenTelemetry → Azure Monitor (App Insights/Log Analytics) and optional Managed Grafana. It defines mandatory telemetry, SLOs with burn-rate alerts, and dashboards wired from the in-cluster OTel Collector.
Telemetry wiring (standard)¶
- SDKs: OpenTelemetry (traces, metrics, logs) in every service.
- Resource attributes (must be on every span/metric/log):
service.name,service.version,deployment.environment,cloud.region,tenantId,edition,correlationId,messageId,subscription,http.route(where applicable). - Collector (DaemonSet + sidecars where needed):
- Receivers:
otlpgRPC:4317, HTTP:4318;prometheusscrape for kube/mesh. - Exporters:
azuremonitor(traces/metrics/logs), optionalprometheusremotewriteto Managed Prometheus, andloggingfor debug.
- Receivers:
- Availability tests: App Insights Ping/Multistep against public APIs (Gateway) + internal probes (mesh VIPs).
Example (collector exporter fragment):
exporters:
azuremonitor:
connection_string: "InstrumentationKey=${APPINSIGHTS_KEY}"
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [batch, resource]
exporters: [azuremonitor]
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [azuremonitor]
logs:
receivers: [otlp]
processors: [batch, resource]
exporters: [azuremonitor]
Mandatory signals by service (golden set)¶
| Service | Metrics (examples) | Traces (must-have spans) | Logs (structured fields) |
|---|---|---|---|
| Gateway | http.server.duration (histogram), http.server.requests (by status), rate_limit_drops_total |
Gateway/Authorize, Gateway/Route, Gateway/TenancyGuard |
tenantId, userId, route, status, requestId |
| Ingestion | ingest.requests_total, ingest.p95_ms, ingest.rejects_total, outbox.pending, asb.publish_duration_ms |
Ingestion/Validate, Ingestion/Outbox/Commit, ASB/Publish |
eventId, schema, tenantId, idempotencyKey |
| Policy | policy.cache_hit_ratio, policy.load_ms, policy.errors_total |
Policy/Resolve, Policy/CacheLoad |
policyVersion, tenantId |
| Projection | projection.backlog (by sub/tenant), projection.p95_ms, inbox.dupes_total, db.write_ms |
Projection/Handle, DB/BulkUpsert, ASB/Complete |
subscription, messageId, tenantId, isReplay |
| Query | query.requests_total, query.p95_ms, verify_on_read_ms, cache.hit_ratio |
Query/Execute, DB/Read, VerifyOnRead |
route, tenantId, filters, resultSize |
| Integrity | integrity.verify.count, integrity.sign.ms, integrity.anomalies_total |
Integrity/SealSegment, HSM/Sign, Blob/Write |
segmentId, kid, hashAlgo |
| Export | export.jobs_queued, export.duration_ms, export.failure_total, egress.bytes |
Export/Assemble, HSM/Sign, Blob/Write |
exportId, format, tenantId |
| Admin | admin.ops_total, featureflag.toggle_total |
Admin/Action |
actor, action, target |
All latencies are histograms with p50/p95/p99; all errors include
exception.type,stack, and correlationId.
SLO catalog (targets & SLIs)¶
| Capability | SLI (how measured) | SLO target (per region) | Notes |
|---|---|---|---|
| Ingest success | (2xx+3xx) / all gateway ingest requests |
99.5% over 30 days | Excludes planned maintenance windows |
| Ingest latency | p95 http.server.duration on append route |
≤ 300 ms | Under nominal load in business hours |
| Projection freshness | 95th pct of backlog age per tenant |
≤ 60 s | Measured from append to projection visible |
| Query success | (2xx) / all query requests |
99.5% | Gateway- or service-level errors count against |
| Query latency | p95 http.server.duration on query routes |
≤ 500 ms | For typical filters and page size |
| DLQ rate | DLQ messages / total consumed per sub |
≤ 0.5% | Excluding intentional parking-lot ops |
| Export completion | 95% jobs finish < 15 min |
≥ 95% | Per 24h rolling |
| Integrity verification | segments verified within window |
≥ 99% within policy window | Window defined by sealing cadence |
Alert policies (burn-rate & thresholds)¶
Error-budget burn-rate (applies to Ingest and Query success SLOs):
- Let SLO = 99.5% ⇒ budget = 0.5%.
- Page when both windows breach:
- Fast: 5-min error rate > 7.2× budget (i.e., > 3.6% errors)
- Slow: 1-hr error rate > 3× budget (i.e., > 1.5% errors)
- Ticket (non-page):
- 6-hr error rate > 1× budget (≥ 0.5%) or 24-hr > 0.5× budget.
Latency:
- Warn: p95 ingest > 300 ms for 10 min; Page if > 500 ms for 5 min.
- Warn: p95 query > 500 ms for 10 min; Page if > 800 ms for 5 min.
Projection freshness & DLQ:
- Warn: backlog age p95 > 60 s for 15 min; Page if > 180 s for 10 min.
- Warn: DLQ rate > 0.5% for 30 min; Page if > 2% for 15 min.
Export & Integrity:
- Warn: export 95th duration > 15 min for 60 min; Ticket if > 30 min.
- Warn: integrity verify coverage < 99% at window end; Ticket + runbook link.
Canary checks (post-deploy):
- Synthetic Availability test success ≥ 99% over 15 min.
- No SLO burn in canary slice for 15–30 min.
- Auto-rollback if fast burn-rate page triggers during canary.
PromQL examples (Managed Prometheus):
# Error rate (Gateway ingest)
sum(rate(http_server_requests_seconds_count{route="/append",status!~"2..|3.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{route="/append"}[5m]))
# p95 query latency
histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{route=~"/query.*"}[5m])) by (le))
KQL examples (Log Analytics / App Insights):
requests
| where customDimensions["route"] == "/append"
| summarize errors = countif(success == false), total = count() by bin(timestamp, 5m)
| extend error_rate = todouble(errors) / todouble(total)
requests
| where customDimensions["route"] startswith "/query"
| summarize p95=percentile(duration, 95) by bin(timestamp, 5m)
Dashboards (UID placeholders)¶
Grafana (Managed)
- ATP — Overview
uid: g_atp_overview - Edge & Gateway
uid: g_atp_edge - Ingestion & ASB
uid: g_atp_ingest - Projection Health
uid: g_atp_projection - Query Performance
uid: g_atp_query - Integrity & Evidence
uid: g_atp_integrity - Export Pipeline
uid: g_atp_export - SLO Heatmap (tenants)
uid: g_atp_tenant_slos
Azure Workbooks
- ATP Overview
workbook: LA-ATP-Overview - Gateway & Edge
workbook: LA-ATP-Edge - Messaging & DLQ
workbook: LA-ATP-ASB - Projections & Freshness
workbook: LA-ATP-Projection - Query & Cache
workbook: LA-ATP-Query
With SDK + Collector standardized, SLIs codified into SLOs, burn-rate alerts enforced across short/long windows, and dashboards pre-wired, observability becomes a deployable artifact rather than ad-hoc instrumentation.
CI/CD Overlays & Config Strategy (Azure-first)¶
This section shows how we parameterize deployments per environment/region/edition, wire feature flags, and enforce supply-chain controls (signing, SBOM, verification) across Azure DevOps pipelines.
Overlay layout (Helm + optional Kustomize)¶
deploy/
├─ charts/
│ └─ atp/
│ ├─ Chart.yaml
│ ├─ values.yaml # sane defaults (non-secret)
│ ├─ values.dev.yaml # env overlays
│ ├─ values.qa.yaml
│ ├─ values.staging.yaml
│ ├─ values.prod.yaml
│ ├─ values.us.yaml # region overlays
│ ├─ values.eu.yaml
│ ├─ values.il.yaml
│ ├─ values.edition.default.yaml # edition overlays
│ ├─ values.edition.enterprise.yaml
│ └─ templates/**.yaml
└─ kustomize/ # optional, if you prefer Kustomize
├─ base/ (rendered helm, then kustomize)
└─ overlays/
├─ dev-us/
├─ qa-us/
├─ staging-us/
├─ prod-us/
├─ prod-eu/
└─ prod-il/
Secrets never live in values files. Names/IDs (e.g., KV, ASB namespaces, Storage accounts) are acceptable; access uses Entra Workload Identity + Key Vault CSI.
Example values overlays¶
values.prod.yaml
global:
environment: prod
replicaDefaults:
min: 2
max: 10
features:
sealing:
enabled: true
cadence: "PT5M"
mode: "merkle+hsm"
verifyOnRead:
default: true
allowOverridePerRequest: false
export:
caps:
maxConcurrentPerTenant: 2
maxBytesPerJob: "10Gi"
windows:
allowedCron: "0 2 * * *" # 02:00 local window
network:
ingress:
wafMode: "prevention"
observability:
otelEndpoint: "http://otel-collector.atp-system.svc:4317"
values.us.yaml
azure:
region: eastus
keyVaultName: kv-atp-prod-us
serviceBusNamespace: sb-atp-prod-us
storage:
hotAccount: stathotprodus
coldAccount: statcoldprodus
sql:
server: sql-atp-prod-us.database.windows.net
db: atp_prod_us
residency:
regionCode: "us"
allowCrossRegionReplication: false
values.edition.enterprise.yaml
features:
verifyOnRead:
default: true
export:
caps:
maxConcurrentPerTenant: 5
maxBytesPerJob: "50Gi"
pricing:
skuHint: "enterprise"
Compose overlays at deploy time, e.g.:
helm upgrade --install atp ./charts/atp -f values.prod.yaml -f values.eu.yaml -f values.edition.enterprise.yaml
Pipeline stages (Azure DevOps) with checks & gates¶
stages:
- stage: Build
jobs:
- job: build
steps:
- script: dotnet build --configuration Release
- script: dotnet test --collect:"XPlat Code Coverage"
- task: TrivyScan@1 # image & deps scan (or Defender for DevOps)
- script: syft packages -o cyclonedx-json > sbom.json
- stage: Package
dependsOn: Build
jobs:
- job: containerize
steps:
- script: docker build -t $(ACR)/atp-gateway:$(Build.SourceVersion) .
- script: docker push $(ACR)/atp-gateway:$(Build.SourceVersion)
- script: cosign sign --key $(COSIGN_KEY) $(ACR)/atp-gateway:$(Build.SourceVersion)
- script: cosign attach sbom --sbom sbom.json $(ACR)/atp-gateway:$(Build.SourceVersion)
- script: cosign attest --predicate provenance.json --type slsaprovenance $(ACR)/atp-gateway:$(Build.SourceVersion)
- stage: Verify_Manifests
dependsOn: Package
jobs:
- job: policy
steps:
- script: helm template atp ./deploy/charts/atp -f values.dev.yaml > rendered.yaml
- script: conftest test rendered.yaml # OPA policies: no NodePort, PSS, resources
- script: ratify verify --subject $(ACR)/atp-gateway:$(Build.SourceVersion) # signature/SBOM
- stage: Dev
dependsOn: Verify_Manifests
variables:
HELM_VALUES: "-f values.dev.yaml -f values.us.yaml -f values.edition.default.yaml"
jobs:
- deployment: dev
environment: atp-dev-us # Env checks: owners, approvals optional
strategy:
runOnce:
deploy:
steps:
- script: helm upgrade --install atp ./deploy/charts/atp $(HELM_VALUES) --set image.tag=$(Build.SourceVersion)
- stage: QA
dependsOn: Dev
jobs:
- deployment: qa
environment: atp-qa-us
strategy:
runOnce:
deploy:
steps:
- script: helm upgrade --install atp ./deploy/charts/atp -f values.qa.yaml -f values.us.yaml
- stage: Staging
dependsOn: QA
jobs:
- deployment: staging
environment: atp-staging-us
strategy:
runOnce:
preDeploy:
steps:
- task: ManualValidation@0 # Release Manager + Security approval
- script: ./scripts/canary-enable.sh 10 # 10% canary
deploy:
steps:
- script: helm upgrade --install atp ./deploy/charts/atp -f values.staging.yaml -f values.us.yaml
routeTraffic:
steps:
- script: ./scripts/canary-promote.sh 100 # promote if SLO OK
- stage: Prod
dependsOn: Staging
jobs:
- deployment: prod_us
environment: atp-prod-us
strategy:
runOnce:
preDeploy:
steps:
- task: ManualValidation@0 # RM + On-call SRE (2-eyes)
- task: AzureCLI@2 # Policy check: Gatekeeper/Kyverno/Ratify status
deploy:
steps:
- script: helm upgrade --install atp ./deploy/charts/atp -f values.prod.yaml -f values.us.yaml
routeTraffic:
steps:
- script: ./scripts/canary-promote.sh 50
- script: ./scripts/canary-promote.sh 100
- deployment: prod_eu
environment: atp-prod-eu
dependsOn: prod_us
strategy:
runOnce:
deploy:
steps:
- script: helm upgrade --install atp ./deploy/charts/atp -f values.prod.yaml -f values.eu.yaml
- deployment: prod_il
environment: atp-prod-il
dependsOn: prod_eu
strategy:
runOnce:
deploy:
steps:
- script: helm upgrade --install atp ./deploy/charts/atp -f values.prod.yaml -f values.il.yaml
Environment checks & gates
- Build/Package: unit/integration tests, vulnerability scan, SBOM generation, Cosign sign/attest.
- Verify_Manifests: OPA/Conftest policy, Ratify validation (signature + SBOM) as an admission gate in AKS.
- Staging: manual approval, canary + SLO guard, auto-rollback on burn-rate page.
- Prod: manual approval (RM + SRE), admission policy green, progressive traffic (10%/50%/100%) with checks in between.
Feature flags (centralized; wired via values)¶
features:
sealing:
enabled: true
cadence: "PT5M"
mode: "merkle+hsm"
verifyOnRead:
default: false # dev/qa
export:
caps:
maxConcurrentPerTenant: 1
maxBytesPerJob: "5Gi"
windows:
allowedCron: "0 3 * * *" # dev window
Flip per environment/edition by merging overlays. Flags are exposed to services through config maps or typed options, not environment variables with secrets.
Artifact promotion & immutability¶
- Tags: Every image has
:sha-<gitsha>(immutable),:vX.Y.Z, and:channel-<env>tags. - Promotion: Retag only (no rebuild) from ACR
dev → qa → staging → prod. - ACR policies: Immutable tags on semver; quarantine repo for images pending scan/sign.
- Verification: Admission plugin (Ratify/Kyverno) requires cosign signature + SBOM before scheduling.
“Overlay tree” (at deploy time)¶
- Env selects base values (
values.{env}.yaml). - Region adds resource names and residency (
values.{region}.yaml). - Edition sets feature/cap caps (
values.edition.{kind}.yaml). - Command combines three:
helm upgrade --install atp ./charts/atp -f values.prod.yaml -f values.eu.yaml -f values.edition.enterprise.yaml --set image.tag=$(Build.SourceVersion)
With this strategy, the same artifact is promoted across rings, overlays capture reality (env/region/edition), and supply-chain policies (signing/SBOM/verification) are enforced by the pipeline and the cluster—turning CI/CD into a governed control plane, not just a copy step.
Security Controls & Zero-Trust Map (Azure-first)¶
This section renders the zero-trust control map as it’s deployed on Azure: edge hardening, mesh mTLS, identity-first access, default-deny networking, supply-chain enforcement, and least-privilege IAM across data and messaging planes.
Trust boundaries diagram¶
flowchart TB
user[Public Internet]
subgraph EDGE["Trust Boundary: Edge (Azure Front Door + WAF)"]
AFD["AFD + WAF (TLS/HSTS, managed+custom rules)"]
APIM["API Management (JWT, rate limit, version routing)"]
end
subgraph AKS["Trust Boundary: AKS Mesh (mTLS, default-deny)"]
ADMISSION["Admission: Ratify (sig/SBOM)<br/>+ Kyverno/Gatekeeper (PSS, policies)"]
GW["Gateway (ABAC tenancy guard)"]
SVC["App services: Ingestion • Policy • Projection • Query • Integrity • Export • Admin<br/>(Envoy sidecars, KV-CSI, OTel)"]
NP["NetworkPolicies: namespace default-deny<br/>allow-list east–west"]
end
subgraph PAAS["Trust Boundary: PaaS (Private Link only)"]
ASB["Service Bus (RBAC)"]
KV["Key Vault + Managed HSM (sign only)"]
STG["Blob (WORM, CMK)"]
DB["Azure SQL/Cosmos (AAD auth, RLS/PK)"]
MON["Azure Monitor / App Insights / Log Analytics"]
end
user -->|HTTPS :443| AFD --> APIM -->|re-encrypt TLS| GW
GW -->|mTLS via mesh| SVC
SVC -->|egress allow-list| ASB
SVC --> KV
SVC --> STG
SVC --> DB
SVC --> MON
ADMISSION -.enforces.-> GW
ADMISSION -.enforces.-> SVC
NP -.limits.-> SVC
Zero-trust pillars (how they apply here)¶
- Identity > secrets: Entra Workload Identity for every pod; PaaS via RBAC, not connection strings/SAS.
- Encrypt & authenticate everywhere: TLS at edge, mTLS inside mesh, HSM-backed signing for evidence/exports.
- Default-deny networking: NetworkPolicies for east–west, egress deny except Private Link PaaS.
- Least privilege: Narrow RBAC roles (DB/ASB/Storage/KV). HSM keys: sign permission only.
- Supply chain integrity: Images scanned, signed (Cosign), SBOM attached, verified at admission (Ratify).
- Hardened runtime: Pod Security Standards (restricted), policy guardrails (Kyverno/Gatekeeper), no privileged pods/capabilities.
- Edge hardening: AFD+WAF (managed+custom rules), APIM (JWT, rate limit, schema), HSTS, CORS allow-list.
Control map — “Control → Layer → Enforced by → Evidence”¶
| Control | Layer | Enforced by | Evidence (where to check) |
|---|---|---|---|
| TLS 1.2+ & HSTS | Edge | AFD + WAF config | Azure Diagnostics (FrontDoorWebApplicationFirewallLog), SSL report, AFD policy export |
| JWT validation & version routing | Edge/API | APIM inbound policies | APIM trace, policy repo, App Insights requests with clientPrincipalId |
| Global rate limiting, abuse throttles | Edge/API | APIM rate-limit-by-key, AFD rules | APIM analytics, WAF logs (RuleAction=Block), 429 counters |
| mTLS service-to-service | Cluster | Service mesh (Envoy) | Mesh policy dump, Envoy cert stats, OTel spans with tls=true |
| Tenancy ABAC at Gateway | App | Gateway policy middleware | AuthZ logs (tenantId, edition, outcome), unit tests of guards |
| Namespace default-deny | Network | Kubernetes NetworkPolicies | kubectl get netpol, denied connection tests, Cilium/Calico flow logs |
| Egress deny + Private Link only | Network/PaaS | Egress policies + Private Endpoints | NSG/Firewall logs, Private Link endpoints, failed non-PL egress |
| Pod Security Standards (restricted) | Runtime | Kyverno/Gatekeeper | Admission audit, Kyverno policy reports, kubectl auth can-i checks |
| Image signing & SBOM | Supply chain | Cosign + Ratify | Ratify admission results, cosign verify, SBOM (CycloneDX) attached |
| Container/base image scanning | Supply chain | Defender for DevOps/Trivy | Security center findings, pipeline scan artifacts |
| Secrets delivery (no env vars) | Secrets | Key Vault + CSI driver | Pod mounts, KV audit logs (SecretGet), env var scans (should be 0) |
| Managed HSM signing (no export) | KMS | AKV Managed HSM | HSM audit logs, sign ops only, no get on key material |
| SQL/Cosmos least-privilege access | Data | AAD auth + custom roles/RLS | SQL audit FAILED_LOGIN_GROUP, RLS predicate tests, Cosmos RBAC |
| Storage immutability & legal holds | Data | Blob WORM + legal hold | Storage immutability policy, object legal hold flags, deletion attempts denied |
| Service Bus RBAC (no SAS) | Messaging | Entra RBAC | ASB access audits, absence of SAS in configs, role assignment list |
| WAF rules (managed+custom) | Edge | AFD WAF | Rule hit metrics, blocked IP/geo logs |
| DDoS protection scope | Edge/IP | AFD global network (and Azure DDoS for public IPs if used) | DDoS metrics, mitigation reports (if any) |
| Admission conformance (no NodePort, PSS, resources) | Cluster | Kyverno/Gatekeeper OPA policies | Policy tests, admission denials, Conftest in CI |
| Observability integrity | Telemetry | OTel → Azure Monitor | Traces/logs/metrics with tenantId,correlationId; export health alerts |
“Evidence” are the artifacts auditors and SREs look at to prove the control exists and is active during runtime and deployment.
Pod hardening (high-value defaults)¶
- PSS
restricted:runAsNonRoot,readOnlyRootFilesystem,seccomp=RuntimeDefault, dropALLcapabilities (allow only explicit minimal set), no host* (PID/IPC/Network), no privileged, no hostPath volumes. - Network: no NodePort, internal ClusterIP only; Ingress terminates TLS and hands off to Gateway; egress through Private Link endpoints.
- Secrets: only via KV CSI mounts (tmpfs), short TTL, rollover without restart where possible.
Least-privilege IAM (typical assignments)¶
- Ingestion/Projection:
Azure Service Bus Data Sender/Receiveron specific topics/subs; no namespace-wide rights. - Query/Projection: AAD contained users/roles to specific schemas; RLS on
tenantId; no SQL logins. - Export/Integrity:
signon specific HSM keys; no get/list keys; StorageBlob Data Contributorscoped to tenant containers. - Gateway:
Key Vault Secrets User(read certs/secrets) + per-resource read roles; no write on data plane.
Example policy fragments¶
Kyverno — deny privileged/host networking
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: pss-restricted-core
spec:
validationFailureAction: enforce
rules:
- name: deny-privileged-host
match: { resources: { kinds: ["Pod"] } }
validate:
message: "Privileged/host* disallowed"
pattern:
spec:
securityContext:
runAsNonRoot: true
hostNetwork: false
hostPID: false
hostIPC: false
containers:
- securityContext:
privileged: false
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
Ratify — require cosign signature & SBOM (conceptual)
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: RatifyVerificationPolicy
metadata:
name: require-signed-and-sbom
spec:
artifacts:
- pattern: "*"
validations:
- name: cosign-signature
- name: sbom-cyclonedx
Outcome: The map above ties each security objective to a deployment control and a concrete evidence source. With mTLS in mesh, JWT + rate limits at the edge, ABAC/RBAC inside, PSS/admission policies, deny-by-default networking, and least-privilege IAM over Private Link, the platform maintains a practical, auditable zero-trust posture.
DR, Backups & Region Failover (Azure-first)¶
This section documents how ATP survives zonal/region outages while preserving tamper-evidence and residency. Azure-first assumptions: AFD/WAF at edge, APIM/Ingress per region, AKS per region, Blob (WORM) for HOT, SQL/Cosmos for WARM, ASB for messaging, Key Vault/Managed HSM for keys.
Strategy overview¶
- Active–active user plane for Gateway, Ingestion, Query across allowed regions; traffic steered by AFD health probes and origin groups.
- Active–passive control jobs (Integrity sealing sweeps, heavy Exports) with pilot-light capacity in secondary regions; scale up on failover (KEDA/HPA).
- Residency-first: EU/IL tenants fail over only to same-jurisdiction paired regions. US may use cross-region pairs.
- Authoritative truth is HOT WORM; WARM (projections/search) are rebuildable.
Backup plan (cadence, verify, retention)¶
| Asset | Technique | Cadence / Retention | Integrity & Verify |
|---|---|---|---|
| HOT (Blob WORM) | ZRS in-region; Object Replication to same-jurisdiction DR account; legal hold where required | Retention per tenant policy (e.g., 1–7+ years) | Scheduled hash re-verify samples daily; manifest/root checks; quarterly promote DR copy dry-run |
| HOT index (SQL/Cosmos) | SQL PITR (14–35d) / LTR weekly (months) ; Cosmos Continuous backup (7–30d) | PITR 14–35d; LTR 6–12 mo | Nightly integrity job compares pointer counts vs HOT manifests |
| WARM projections | PITR + replay from HOT (preferred) | PITR 7–14d | Weekly replay smoke against subset tenants |
| Search | Reindex from WARM/HOT (no backup) | N/A | Post-restore consistency spot checks |
| ASB (Service Bus) | Geo-DR alias (metadata replication); premium namespace pairs | Test quarterly; alias failover exercised | DLQ drift checks before/after failover |
| Key Vault / Managed HSM | KV: soft-delete + purge-protect; HSM backup package to private Storage | KV continuous; HSM backup weekly and before rotation | Monthly restore test of HSM backup to isolated vault |
| APIM / AFD config | IaC in repo; config export snapshots | With every release | Diff-and-apply check on DR region |
| Configs (Helm values) | Git as source of truth | With every release | Admission policy conformance in DR cluster |
DR patterns & failover posture¶
- Zonal failure (within region)
- AKS uses zonal node pools; storage ZRS; AFD/APIM keep routing within region.
- RTO: ≤ 15 min, RPO: ~ 0 (authoritative writes to HOT continue).
- Regional outage (allowed cross-region)
- AFD marks region unhealthy → shifts to healthy origin group.
- ASB Geo-DR alias switch if namespace is down.
- Scale pilot-light workloads (Projection/Export/Integrity) in DR region.
- RTO: 15–60 min, RPO: ≤ 5 min (HOT replicated + outbox/inbox re-drain).
- Regional outage (strict residency, same-jurisdiction only)
- Read/write remain within jurisdiction pair (e.g., EU-pair).
- If replication is asynchronous, accept small RPO gap; rebuild WARM from HOT in DR.
- RTO: 1–4 h, RPO: ≤ 15 min (depends on HOT replication lag).
- Control-plane impairment (AKS only)
- Keep PaaS healthy; re-point traffic to sibling cluster in same region if available.
- RTO: 30–60 min, RPO: ~ 0.
Integrity ledger continuity (sealing across failover)¶
- Key continuity: Integrity/Export sign with Managed HSM; DR vault has restored key version (
kid) before cutover. - Chain anchoring: Each sealed segment root contains previous root hash. On region failover, first DR seal anchors to the last confirmed root (by
kid+ hash) to avoid forks; publish “chain-continuation” event. - Dual-sign window (optional): Temporarily accept old+new
kidto bridge any trust gaps; later revoke old. - Watermarks: Integrity keeps a sealing watermark (per tenant/region) in HOT index; DR resumes from that watermark to prevent duplicates or gaps.
DR run sequence (Mermaid)¶
flowchart TB
A[Detect outage via AFD/APIM/SLI] --> B{Scope?}
B -->|Zonal| C[Keep region; scale unaffected zones]
B -->|Region| D[Mark region unhealthy in AFD]
D --> E[Shift traffic to DR origin group]
E --> F[Switch ASB Geo-DR alias if needed]
F --> G[Scale pilot-light: Ingestion/Query/Projection/Integrity/Export]
G --> H[Restore HSM key version in DR - if not pre-staged]
H --> I[Run health & smoke tests; enable canary]
I --> J[Resume sealing; publish chain-continuation event]
J --> K[Monitor SLOs & burn-rate; adjust capacity]
DR checklist (operator-facing)¶
Before an incident (readiness)
- DR region origin registered in AFD; health probe green.
- ASB Geo-DR pairing healthy; alias tested last quarter.
- HSM backup restored in DR vault (pre-staged
kidcurrent). - DR AKS cluster passes admission policies; minimal pilot-light replicas deployed.
- HOT object replication policies green; last replication lag < 15 min.
- Runbooks and Feature flag toggles (reduce export/concurrency) reviewed.
At incident
- Confirm scope (zonal vs region).
- AFD/APIM show primary unhealthy → cutover to DR origin.
- If ASB down: fail over alias to DR namespace.
- Scale KEDA/HPA targets for Projection/Query/Export/Integrity.
- Validate HSM key available; if needed, restore from latest backup.
- Execute smoke tests; enable canary; observe SLOs.
- Resume sealing; publish chain-continuation event.
After incident
- Backfill: replay projections from HOT for gaps.
- Compare pointer counts HOT↔WARM; validate signatures.
- Post-mortem: RTO/RPO actuals, DLQ rate, error budget spent.
- Rotate temporary flags back; scale down pilot-light if appropriate.
RTO/RPO per scenario¶
| Scenario | RTO target | RPO target | Notes |
|---|---|---|---|
| Zonal failure (within region) | ≤ 15 min | ~ 0 | ZRS + multi-zone node pools; AFD keeps region |
| Region outage (cross-region allowed) | 15–60 min | ≤ 5 min | AFD cutover + ASB alias + HOT replication lag |
| Region outage (strict residency pair) | 1–4 h | ≤ 15 min | Same-jurisdiction DR restore + replay WARM |
| AKS control-plane impairment | 30–60 min | ~ 0 | Shift to sibling cluster or re-create node pools |
| Key Vault/HSM incident | 1–2 h | ~ 0 | Restore HSM backup; dual-sign window if needed |
Drill cadence¶
- Monthly: Partial replay from HOT in DR; verify WARM parity.
- Quarterly: Full region failover exercise (AFD, ASB alias, HSM restore, sealing continuation).
- After major changes: Re-validate object replication, SLO alerts, runbook steps.
With AFD-driven cutover, ASB Geo-DR, HOT WORM replication, and HSM key continuity, the platform maintains tamper-evident chains and meets practical RTO/RPO targets without violating residency constraints.
Cost & Capacity Guardrails (Azure-first)¶
This section bakes cost discipline into the deployment by making cost drivers visible, enforcing tenant/edition quotas, using export windows, and automating shrink/retention workflows—without sacrificing SLOs or tamper-evidence.
Per-service cost drivers¶
| Plane/Service | Primary cost drivers | Secondary drivers | Guardrails to apply |
|---|---|---|---|
| Gateway (AFD/APIM/Ingress) | APIM reqs/sec, AFD egress | TLS cert mgmt, WAF rules eval | Tight CORS; cacheable 4xx/5xx bodies; low-cardinality labels in logs |
| Ingestion | AKS CPU/mem; HOT Blob writes; ASB publishes | Azure Monitor ingestion | Cap RPS by tenant; batch writes; structured logs sampling |
| Policy | AKS CPU/mem; DB reads | KV/CSI mounts | Cache policy snapshots; low TTL metrics histograms |
| Projection | AKS CPU/mem; DB writes; ASB consumes | Cosmos RU/SQL DTUs; Monitor | KEDA on backlog; batch upserts; throttle replay workloads |
| Query | DB reads; AKS CPU; cache misses | Search queries; Monitor | Result caching; read replicas/partition pruning; cap payload size |
| Integrity | HSM sign ops; HOT reads/writes | AKS CPU; Monitor | Off-peak schedules; batch sealing; reduce verification frequency under load (never below policy min) |
| Export | Blob egress + archive storage; HSM signs | AKS mem/disk; network | Export windows (nightly); max bytes/job; compress; per-tenant concurrency caps |
| Observability | Logs/Traces/Metrics ingestion & retention | Managed Grafana | Drop noisy fields; 90-day default retention; sampling for DEBUG |
| Messaging (ASB) | Topic/subscription ops; Premium MUs | DLQ depth | Right-size MUs; dedupe window; consumer prefetch tuning |
| Storage (HOT/WARM/COLD) | HOT: GB stored × replicas; WARM: DB size; COLD: Archive GB | Transactions | WORM retention by policy; lifecycle to Cool/Archive; partitioning for pruning |
Quotas & limits (per tenant/edition) + export windows¶
Config (values overlay)
quotas:
tenant:
defaults:
maxIngestRps: 50
maxConcurrentExports: 1
maxDailyExportBytes: "10Gi"
maxQueryRps: 30
enterprise:
maxIngestRps: 200
maxConcurrentExports: 3
maxDailyExportBytes: "100Gi"
maxQueryRps: 100
hardStops:
exportJobMaxBytes: "50Gi" # absolute cap
queryMaxResponseBytes: "25Mi"
exportWindows:
# Run heavy egress when bandwidth is cheap and user traffic low
allowedCronLocal: "0 2 * * *" # 02:00 local region time
perTenantConcurrency: 1
bandwidthBudgetMibps: 200 # cluster-wide cap
Enforcement points
- Gateway: rate limit by
x-tenant-id; reject over-limit with429+Retry-After. - Export: scheduler checks bytes/day & concurrency; defers outside window.
- Query: enforce payload caps and pagination; optional per-tenant query RPS.
- Projection: KEDA per-subscription scaler to isolate hot tenants.
Auto-shrink & retention compaction¶
- WARM compaction: roll partitions by
(tenantId, eventMonth), compress historical tables, and drop derived columns that can be rehydrated. - Lifecycle policies (Blob):
- HOT → Cool after N days (policy-driven), COLD exports to Archive immediately.
- Auto-expire temporary export staging containers after M days.
- Observability retention:
- Logs 30–90d (by env), metrics 90d, traces 7–30d; keep exemplars for P1 incidents tagged for 180d.
- Sampling: 0% DEBUG, 10–20% INFO, 100% WARN/ERROR (server-side).
- Scale-to-zero workers (Projection/Export/Integrity) when idle; cron-scaled for windows.
- Search: prefer reindex on demand over long retention of large indexes; elastic alias swaps to limit downtime.
Top 10 cost levers¶
- Export egress — biggest surprise line item: constrain by windows, compression, and bytes/day caps.
- Azure Monitor ingestion — prune log fields; sample non-errors; avoid high-cardinality labels (tenant-safe but minimal).
- Projection batch size — fewer DB round-trips; tune upsert bulk size before hitting RU/DTU throttles.
- ASB Premium MUs — right-size namespaces; consolidate topics; dedupe window to reduce dup processing.
- HOT retention & replication — align WORM duration with true policy; replicate only within jurisdiction; avoid unnecessary cross-region copies.
- Search index refresh — reduce refresh frequency; index only searchable fields; keep analyzers simple.
- Query cache — cache common queries; cap response size; push down filters to partitions.
- Node pools — run heavy jobs on np-io only; scale to zero off-window; right-size VM SKUs.
- APIM tiers — pick per-region capacity that matches real QPS; shift low-risk limits to Gateway to avoid APIM overage.
- Feature flags — disable expensive features (e.g., verify-on-read) for tiers/editions where not required.
Quick cost calculator (inputs & formulas)¶
Use this to estimate monthly order-of-magnitude. Plug into a spreadsheet with your actual Azure rates.
Inputs
T= tenantsR_d= records (events) per day (all tenants)B_e= avg event bytes (raw payload)O_hot= HOT overhead factor (manifest, hashes, signatures; e.g., 1.25)Rep_hot= HOT replication multiplier (e.g., 2 for primary + DR)P_proj= projection amplification (derived rows per event × avg bytes; e.g., 0.5 × B_e)Q_d= queries per day;B_q= avg query response bytesX_d= export bytes per day (post-compression)Log_b= avg log bytes per request/event traced (after sampling)
Derived
- HOT monthly GB
HOT_GB = (R_d × B_e × O_hot × Rep_hot × 30) / (1024^3) - WARM monthly GB (projections)
WARM_GB = (R_d × P_proj × 30) / (1024^3) - Query egress monthly GB
Q_GB = (Q_d × B_q × 30) / (1024^3) - Exports monthly TB
X_TB = (X_d × 30) / (1024^4) - Observability GB
OBS_GB = ((R_d + Q_d) × Log_b × 30) / (1024^3)
Rules of thumb
- Aim for O_hot ≤ 1.25 (keep manifests/metadata lean).
- Keep
P_projas small as needed for SLOs (derive on read when possible). - Keep
OBS_GBunder a fixed budget; enforce sampling at Gateway and worker services. - Validate
X_TBstays inside your egress budget; increase compression or shift to on-site pickup if exceeded.
Budget automation (signals → actions)¶
- Budget breach early-warning: If 7-day projected cost > 90% of monthly budget, tighten quotas by edition (
maxIngestRps,maxDailyExportBytes) and reduce KEDA maxReplicaCount for replay workloads. - Autoshrink trigger: When WARM size growth > 20% MoM, automatically compact old partitions and reduce index refresh.
- Observability clamp: If
OBS_GBtrend exceeds budget, raise sampling, truncate overly long log fields, and disable verbose query logging.
With explicit per-service drivers, tenant/edition quotas, lifecycle-based auto-shrink, and a simple calculator to size HOT/WARM/COLD and egress, you can keep ATP within a predictable spend envelope—even during growth or replay/DR events.
Ops Runbook Hooks & Change Management (Azure-first)¶
This section connects the deployment views to day-2 operations: how we verify a rollout, flip features safely, and govern changes to the topology.
Health checks, readiness, and smoke tests (post-rollout)¶
Kubernetes probes (standard)
livenessProbe:
httpGet: { path: /healthz, port: 8080 }
initialDelaySeconds: 20
periodSeconds: 10
readinessProbe:
httpGet: { path: /readyz, port: 8080 } # mesh sidecar ready, KV-CSI mounted, SB/DB reachable
initialDelaySeconds: 10
periodSeconds: 5
startupProbe:
httpGet: { path: /startupz, port: 8080 }
failureThreshold: 30
periodSeconds: 5
Post-deploy smoke (automated)
- External: App Insights Availability Tests (Ping & Multistep) against Gateway canary routes.
- Internal: synthetic
append → project → querytracer (small tenant sandbox), asserts p95s and zero DLQ for 15–30 min. - Data plane sanity: HOT container write/read, ASB publish/consume, DB read/write minimal cycle.
- Go/No-Go: promote traffic from 10% → 50% → 100% only if SLO burn-rate and smoke pass.
Change windows, feature flags, dark-launch toggles¶
- Windows
- staging: daily, region-local off-peak; prod: weekly window per region; emergency hotfixes allowed with RM+SRE approval.
- Feature flags (examples)
features.sealing.enabled,features.verifyOnRead.default,features.export.caps.*— set via values overlays; read by services at startup and on refresh signal.
- Dark-launch
- Route a header-scoped slice (e.g.,
x-atp-experiment: vNext) through APIM/Ingress to a preview deployment. - Observe SLO deltas and logs; ramp only after stability window.
- Route a header-scoped slice (e.g.,
Runbook hooks (link placeholders)¶
- DLQ Drain & Resubmit →
docs/runbooks/dlq-replay.md - Projection Replay from HOT →
docs/runbooks/projection-replay.md - DR Cutover (AFD, ASB Alias, HSM Restore) →
docs/runbooks/dr-failover.md - Integrity Key Rotation & Dual-Sign Window →
docs/runbooks/key-rotation.md - Hotfix Rollback →
docs/runbooks/rollback.md - Quota Breach Response (Cost Guardrails) →
docs/runbooks/quotas.md
ADRs and proposing a topology change¶
- ADR location:
docs/adr/(use log4brains or similar). - When an ADR is required: new region, new PaaS SKU/tier, mesh policy changes, egress to a new external service, storage/retention policy changes, or any control affecting security/DR/SLO/cost.
- ADR template essentials
- Context & goals; security & residency impact; SLO impact; cost delta; migration plan; rollback; monitoring plan; affected diagrams.
- Process
- Draft ADR with diagrams and overlays changed.
- Add threat model note and cost estimate.
- Open PR tagged Architecture, Security, SRE for review.
- Pilot in staging with canary + smoke; attach results to ADR.
- Merge ADR; schedule production change window.
PR checklist for infra changes (mini)¶
- OPA/Kyverno policy pass (PSS: restricted, no NodePort, resource limits set).
- Ratify: image signature + SBOM verified.
- Private Link endpoints in place for new PaaS; egress allow-list updated.
- IAM least-privilege reviewed (RBAC roles scoped to resource/namespace).
- Residency: data stays within allowed region/jurisdiction.
- SLO impact assessed; canary plan + rollback defined.
- Cost: estimate and budget tag updated (FinOps label).
- Observability: dashboards/alerts updated; new signals documented.
- Runbooks: added/updated; DR implications noted.
- Security sign-off for edge/WAF/JWT/policy changes.
- Backout plan verified (previous helm release or traffic revert script).
- Communication: change ticket, stakeholder notice, and on-call briefed.
Deployment annotations & traceability¶
metadata:
annotations:
atp.io/change-id: "$(Build.BuildNumber)"
atp.io/commit: "$(Build.SourceVersion)"
atp.io/adr: "ADR-0021-topology-change"
atp.io/runbook: "docs/runbooks/rollback.md"
- Emit a deployment annotation event to Azure Monitor/Grafana so SLO charts show the change marker.
Rollback triggers (guardrails)¶
- Fast burn-rate page during canary (e.g., error rate > 3.6% for 5 min) → auto-rollback.
- p95 regressions (ingest > 500 ms or query > 800 ms for 5 min) → rollback.
- Projection freshness p95 > 180 s for 10 min with no remediation → rollback.
Outcome: after each rollout you have verifiable health, controlled feature exposure via flags/dark-launch, governed changes through ADRs, and a repeatable PR checklist that keeps security, SLOs, cost, and DR aligned with the deployment views.