Architecture Overview — Audit Trail Platform (ATP)¶

This document provides the top-level architectural narrative for ATP. It explains why the platform exists, the guiding principles that shape every decision, and how readers should navigate the rest of the architecture set. Deep dives live in sibling docs (HLD, Components, Data Model, Sequence Flows, Deployment Views) per the table of contents.

Purpose¶

ATP is a secure, multi‑tenant audit and evidence platform that ingests, classifies, and stores immutable events from heterogeneous systems, making them queryable, exportable, and verifiable under strict compliance and data‑residency constraints. The architecture aims to:

Provide tamper‑evident append‑only storage with verifiable integrity signals.
Support high‑throughput ingestion and low‑latency query paths across tenants and editions.
Embed privacy, classification, and retention controls by design (not as an afterthought).
Offer clear integration contracts (REST/gRPC, events, webhooks) and tested SDKs.
Expose operational transparency (OTel traces/logs/metrics) with SLO‑backed reliability.
Remain cost‑aware and scalable, balancing hot/warm/cold storage and export patterns.

Architectural Principles¶

The following principles are non‑negotiable guardrails. Each downstream decision (interfaces, storage, indexing, deployment) should cite the relevant principle(s).

Security‑First & Zero‑Trust Strong identity between workloads; least privilege; per‑tenant and per‑operation authorization; secret/keys managed via KMS; encryption in transit and at rest.
Multi‑Tenant Isolation by Default Tenant context is explicit in every interface and persisted form. Isolation uses a layered model (routing, authZ, RLS/filters, quotas, rate‑limits). No “best‑effort” multi‑tenancy.
Event‑Driven by Design Ingestion, projection, and export are choreographed via durable messaging. Outbox/inbox and idempotency keys are mandatory on all critical paths.
Tamper‑Evidence & Integrity Append‑only semantics; chain‑of‑hash/signature strategies; evidence manifests that can be independently verified at export/eDiscovery time.
Compliance‑by‑Design Data classification, minimization, and retention are enforced at write time; residency controls and subject‑rights operations (DSR) are planned and testable.
Observability‑First Every request and message is traceable end‑to‑end with correlation/tenant/edition tags. Golden signals and error budgets are defined per service and reflected in SLOs.
Resilience & Back‑Pressure Timeouts, retries with jitter, bulkheads, DLQs, and circuit breakers are applied consistently. Components are idempotent and safe to replay.
API‑First & Contract‑Driven REST/gRPC schemas and event contracts are versioned, linted, and backward compatible; producers/consumers are validated in CI with contract tests.
Scalability with Cost Discipline Scale out hot paths; project read models fit for purpose; apply storage tiering (hot/warm/cold) and export batching windows to protect SLOs and cost envelopes.
Simplicity & Paved Roads Prefer standard templates, libraries, and platform “paved roads” over bespoke solutions. Documentation and examples are treated as part of the product.
Governed Change & Traceability Architectural decisions are captured as ADRs; artifacts and flows are versioned; changes reference the driving SLOs, risks, and compliance requirements.

Audience & Reading Map¶

Product/Delivery — Start here, then read High-Level Design.
Developers — After this overview, see Components & Services and Contracts.
Data/Sec/Compliance — Review Data Model, PII/Classification, and Privacy.
SRE/Operations — See Observability, Alerts & SLOs, and Runbook.

System Context (C4 L1)¶

ATP sits between event producers (first- and third-party systems) and consumers (operators, auditors, investigators, export/eDiscovery tooling). Clear trust boundaries ensure identity, tenancy, privacy, and integrity guarantees are preserved end-to-end.

Primary Actors¶

Event Producers — product microservices, backends, frontends, partner systems emitting audit events (REST/gRPC, webhooks).
Human Users — Operators/SRE (runbooks, dashboards), Auditors/Investigators (query & export), Tenant Admins (policy & access).
Automation — CI/CD agents and scheduled jobs exercising administrative APIs (policy/config rotation, projector rebuilds).

External Systems¶

Identity Provider (IdP) — OIDC/OAuth2 for users; workload identity for service-to-service.
Key Management (KMS) — encryption and signing keys; rotation; optional customer-managed keys.
Observability Stack — OTel collector, metrics/logs/traces backends, alerting.
Export Destinations — object storage targets, legal hold archives, SIEM/DLP, ticketing (optional webhooks).
Admin Surfaces — configuration/feature flags, policy repositories, schema registry.

Trust Boundaries & Zones¶

Ingress Boundary (Gateway) — authentication, tenancy resolution, request limits, schema validation, input sanitation.
Processing Boundary (Messaging/EDA) — durable delivery, outbox/inbox, idempotency keys, replay safety, DLQ isolation.
Data Boundary (Storage/Indexes) — encryption at rest, tenant-scoped access (RLS/filters), classification & retention enforcement.
Admin/Control Plane — privileged operations, break-glass workflows, audited changes and ADR-linked approvals.

Interface Surface (at a glance)¶

Inbound
- REST/gRPC: /api/v{n}/audit/append, /api/v{n}/query/..., /api/v{n}/export/...
- Webhooks: signed callbacks for event ingestion (optional connectors)
Outbound
- Events: AuditRecord.Appended, AuditRecord.Accepted, Projection.Updated, Export.Requested|Completed
- Webhooks: export completion, verification results (optional)
Admin
- Policies: classification/retention APIs, edition flags
- Ops: projector lag, replay tooling, DLQ management

Tenancy & Identity Propagation¶

Tenancy is explicit on every call/message (tenantId claim/header); enforced in the gateway and persisted in all stores.
Authorization is tenant-scoped (RBAC/ABAC) with edition gates; tokens are short-lived and audience-bound.
Observability carries tenantId, edition, traceId, and correlationId across boundaries.

flowchart LR
  subgraph External
    EP[(Event Producers)]
    AUD[(Auditors/Investigators)]
    OPS[(Operators/SRE)]
    IDP[(IdP/OIDC)]
    KMS[(KMS)]
    OBS[(OTel/Logs/Metrics)]
    DST[(Exports / eDiscovery Destinations)]
  end

  GW[API Gateway<br/>AuthZ • Tenancy • Rate Limit • Schema]
  BUS[(Event Bus)]
  ING[Ingestion Service<br/>validate • classify • retain]
  INT[Integrity Service<br/>hash chains • signatures]
  PROJ[Projection Service<br/>read models • lag control]
  QRY[Query Service<br/>tenant-scoped filters]
  EXP[Export Service<br/>packages • manifests]

  EP -->|REST/gRPC/Webhooks| GW
  GW --> BUS
  BUS --> ING
  ING --> INT
  ING -->|append| STORE[(Append Store)]
  ING --> PROJ
  PROJ --> QRY
  QRY -->|results| AUD
  EXP -->|packages| DST

  GW --- IDP
  ING --- KMS
  INT --- KMS
  GW --- OBS
  ING --- OBS
  PROJ --- OBS
  QRY --- OBS
  EXP --- OBS

  AUD -->|requests| GW
  OPS -->|dashboards/runbooks| QRY
  QRY --> EXP

Hold "Alt" / "Option" to enable pan & zoom

Quality Attributes Anchored Here¶

Security & Privacy — zero-trust ingress, least-privilege access, data minimization at write.
Integrity — append-only semantics with verifiable chains/signatures.
Scalability — bursty producers absorbed via durable messaging and back-pressure.
Operability — golden signals and SLO budgets per service; traceability across all hops.
Cost Awareness — hot/warm/cold tiers; export windows; per-tenant quotas and rate limits.

Links¶

→ Context Map (context-map.md)
→ Components & Services (components.md)
→ Sequence Flows (sequence-flows.md)
→ Deployment Views (deployment-views.md)

Bounded Contexts & Context Map¶

The ATP domain is split into cohesive bounded contexts that collaborate via well-defined contracts (REST/gRPC, events, webhooks). Each context owns its model, persistence, and decision logic; integration relies on Published Language and Open Host Service patterns, with Anti-Corruption Layers at external edges.

Bounded Contexts (overview)¶

Gateway
Responsibility — Ingress, authentication/authorization, tenancy resolution, rate limiting, schema validation, versioning.
Contracts — REST/gRPC (OHS), request/response schemas.
Notes — Enforces tenantId propagation and edition gates.
Ingestion
Responsibility — Canonicalize & validate events, apply classification/retention at write, append to immutable store, emit acceptance signals.
Contracts — Consumes REST/gRPC/webhooks from Gateway (OHS); publishes AuditRecord.Appended|Accepted.
Notes — ACL for producer-specific payloads → canonical schema.
Policy
Responsibility — Provide decisions for classification, retention, redaction; edition feature flags.
Contracts — Decision API (sync) + policy change events.
Notes — Customer–Supplier to Ingestion/Query/Export; Policy is supplier.
Integrity
Responsibility — Compute/verify hash chains & signatures; issue attestations and evidence manifests.
Contracts — Subscribes to append pipeline; exposes verify endpoints; emits Integrity.Verified.
Notes — Keys/rotation via KMS.
Projection
Responsibility — Build/maintain read models and search indexes; track projector lag/watermarks; rebuild strategies.
Contracts — Subscribes to AuditRecord.Accepted (and deltas); publishes Projection.Updated.
Notes — Strict idempotency and replay safety.
Query
Responsibility — Authorized, tenant-scoped retrieval over read models; policy-aware filtering/redaction.
Contracts — REST/gRPC; optional GraphQL facade (internal).
Notes — Surfaces selection for exports.
Search (optional)
Responsibility — Full-text/time-range queries aligned with policy & tenancy.
Contracts — Reads projection feed; exposes search API.
Notes — May be disabled in small editions.
Export
Responsibility — Package selections (from Query/Search) with manifests, signatures; manage delivery & legal hold.
Contracts — REST to request/stream; events Export.Requested|Completed; optional webhooks.
Notes — Throttled, resumable flows; long-running operations.
Admin/Control Plane
Responsibility — Policies, schemas, feature flags, projector controls, replay/DLQ tooling.
Contracts — Admin APIs; audit of changes; ADR links in metadata.
Notes — Break-glass procedures with strict logging.

Relationship Patterns¶

Gateway → Ingestion — Open Host Service with a versioned Published Language.
Ingestion → Policy — Customer–Supplier (Ingestion conforms to Policy’s decisions).
Ingestion → Integrity — Conformist to integrity calculation rules; emits materials for verification.
Ingestion → Projection — Event choreography via durable topics.
Projection → Query/Search — Published Language for read models/index projections.
Query/Search → Export — Customer–Supplier (Export depends on Query’s selection semantics).
External Producers → Gateway — Anti-Corruption Layer at Ingestion to canonicalize.

Context Map (mermaid)¶

flowchart LR
  subgraph External Producers
    P1[(Product Services)]
    P2[(3rd-Party Systems)]
  end

  GW[Gateway<br />OHS + Published Language]
  ING[Ingestion<br />ACL + Canonicalizer]
  POL[Policy<br />Decisions: classify/retain/redact]
  INT[Integrity<br />Hash Chains/Signatures]
  PROJ[Projection<br />Read Models/Indexes]
  QRY[Query<br />Tenant-Scoped Retrieval]
  SRCH[Search<br />Optional Index API]
  EXP[Export<br />Packages + Manifests]
  ADM[Admin/Control Plane]

  P1 -->|REST/gRPC/Webhooks| GW
  P2 -->|REST/gRPC/Webhooks| GW
  GW --> ING

  ING -->|decisions| POL
  POL -->|policy changes| ING
  POL --> QRY
  POL --> EXP

  ING -->|append accepted| INT
  ING -->|events| PROJ
  PROJ --> QRY
  PROJ --> SRCH

  QRY -->|selection| EXP
  SRCH -->|selection| EXP

  ADM --- POL
  ADM --- PROJ
  ADM --- EXP

  classDef c fill:#F4F7FF,stroke:#5B6,stroke-width:1px,rx:6,ry:6;
  class GW,ING,POL,INT,PROJ,QRY,SRCH,EXP,ADM c;

Hold "Alt" / "Option" to enable pan & zoom

Contract Snapshots (at a glance)¶

Events (Published Language)
- AuditRecord.Appended → canonical event submitted (pre-commit checks passed)
- AuditRecord.Accepted → persisted + classified + retained; integrity material ready
- Projection.Updated → read model/index segment advanced (with watermark)
- Export.Requested | Export.Completed → export lifecycle
- Integrity.Verified → attestations for records/segments/packages
APIs (Open Host Service)
- POST /api/v{n}/audit/append — Gateway → Ingestion (tenant/edition required)
- GET /api/v{n}/query/... — tenant-scoped search & retrieval
- POST /api/v{n}/export/... — create/stream export packages
- POST /api/v{n}/policy/evaluate — (internal) policy decision snapshot
- POST /api/v{n}/integrity/verify — verify record/segment/export

Modeling Notes¶

Ubiquitous Language — “audit record”, “evidence chain”, “manifest”, “projection lag”, “selection set” are domain terms; see domain language.
Idempotency Keys — (tenantId, sourceId, sequence|hash) for all ingestion paths.
Tenancy — Always explicit; persisted on write; filtered on read; included in traces.
Edition Awareness — Feature gates in Gateway/Policy; never UI-only gates.

Links¶

→ Context Map (detailed) (context-map.md)
→ Events Catalog (../domain/events-catalog.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ REST APIs (../domain/contracts/rest-apis.md)
→ Ubiquitous Language (../domain/ubiquitous-language.md)

Core Services & Containers (C4 L2)¶

This section presents the container view of ATP: the runtime building blocks (services and infrastructure) and their responsibilities, interfaces, data boundaries, and operational concerns. It complements the domain view by focusing on how capabilities are realized in deployable components.

Container View (diagram)¶

flowchart LR
  subgraph Edge
    G[API Gateway<br />AuthN/Z • Tenancy • RL • Versioning]
  end

  subgraph App Plane
    ING[Ingestion Service<br />validate • classify • retain]
    POL[Policy Service<br />classify • retain • redact • edition]
    INT[Integrity Service<br />hash chains • signatures • attest]
    PROJ[Projection Service<br />read models • indexes • lag]
    QRY[Query Service<br />search • filters • redaction]
    SRCH[Search Service - optional<br />full-text • time-range]
    EXP[Export Service<br />packages • manifests • deliver]
    ADM[Admin/Control Plane<br />schemas • flags • replay • DLQ]
  end

  subgraph Data Plane
    APPEND[(Append Store<br />append-only)]
    READ[(Read Models / Indexes)]
    COLD[(Cold Archive / eDiscovery)]
  end

  subgraph Platform
    BUS[(Event Bus / Topics)]
    KMS[(KMS / Secrets)]
    OTL[(OTel Collector / Logs / Metrics / Traces)]
    IDP[(IdP / OIDC)]
  end

  G -->|REST/gRPC| ING
  G -->|REST/gRPC| QRY
  QRY -->|select| EXP
  ING -->|decisions| POL
  POL --> QRY
  POL --> EXP
  ING -->|append| APPEND
  ING -->|events| PROJ
  PROJ --> READ
  SRCH --> READ
  QRY --> READ
  EXP -->|packages| COLD

  ING -.->|materials| INT
  INT -.->|verify| EXP

  G --- IDP
  ING --- BUS
  PROJ --- BUS
  EXP --- BUS
  ING --- KMS
  INT --- KMS
  G --- OTL
  ING --- OTL
  PROJ --- OTL
  QRY --- OTL
  EXP --- OTL

  classDef svc fill:#F4F7FF,stroke:#7aa6ff,stroke-width:1px,rx:6,ry:6;
  classDef plat fill:#fafafa,stroke:#c8c8c8,stroke-width:1px,rx:6,ry:6;
  class G,ING,POL,INT,PROJ,QRY,SRCH,EXP,ADM svc;
  class BUS,KMS,OTL,IDP,APPEND,READ,COLD plat;

Hold "Alt" / "Option" to enable pan & zoom

Containers (one-liners)¶

API Gateway — Central ingress; AuthN/Z, tenancy resolution, rate limiting, schema & versioning.
Ingestion — Validates/canonicalizes events, applies classification/retention, writes append-only, emits acceptance.
Policy — Synchronously answers classification/retention/redaction queries; manages edition feature gates.
Integrity — Maintains hash chains and signatures; exposes verification APIs; issues evidence manifests.
Projection — Builds read models and indexes; tracks watermarks/lag; supports rebuilds.
Query — Tenant-scoped retrieval with policy-aware filtering and redaction; selection for exports.
Search (optional) — Full-text/time-range over projected data while respecting policy/tenancy.
Export — Packages selections + manifests, supports resumable streaming, and optional webhooks.
Admin/Control Plane — Config, feature flags, schema registry, DLQ/replay, break-glass ops.

Responsibilities Matrix¶

Container	Purpose	Key Interfaces	Data Boundary	Scaling / Resilience	SLO Hints
API Gateway	Ingress, authZ, tenancy, RL, versioning	REST/gRPC; JWT/mTLS	N/A (stateless)	HPA on RPS; 429 for back-pressure	p95 auth+route ≤ X ms
Ingestion	Validate, classify, retain, append	REST/gRPC from Gateway; publishes events	Writes to Append Store	Outbox; consumer concurrency; DLQ	p95 append ≤ X ms; accept rate
Policy	Decisions: classify/retain/redact; edition	Sync decision API; policy change events	Policy store (read-mostly)	Cache + TTL; fallback modes	p95 decision ≤ Y ms
Integrity	Hash/sign; attest/verify	Verify API; subscribes to append	Integrity material store	Idempotent; replay-safe	verify p95 ≤ Z ms
Projection	Build read models / indexes	Subscribes to accepted events	Read Models/Indexes	Rebuild tooling; lag caps	projector lag ≤ N s
Query	Tenant-scoped retrieval, redaction	REST/gRPC; (opt) GraphQL	Read-only over Read Models	Cache; rate-limit; bulkheads	query p95/p99 targets
Search (opt)	Full-text/time queries	REST/gRPC	Search index	Async refresh; ISR windows	query p95 ≤ T ms
Export	Package & deliver w/ manifests	REST stream; events; (opt) webhook	Streams from Read/Cold	Resumable; batch windows	completion p95 for M rec
Admin	Policies, flags, replay, DLQ, schemas	Admin REST/gRPC	Control metadata	Strong audit; approvals	admin ops audited

Data Containers¶

Append Store (hot) — Append-only write path; short retention for high-QPS ingestion and near-term verification.
Read Models / Indexes (warm) — Denormalized projections tailored to query/search; rebuildable from events.
Cold Archive (cold) — Long-term eDiscovery/export storage; immutability and legal-hold compatible.

Data tiering and shapes are detailed in Data Model (data-model.md) and Deployment Views (deployment-views.md).

Interface Summary¶

Inbound: POST /api/v{n}/audit/append, GET /api/v{n}/query/..., POST /api/v{n}/export/...
Events: AuditRecord.Appended, AuditRecord.Accepted, Projection.Updated, Export.Requested|Completed, Integrity.Verified
Admin: .../policy/*, .../admin/replay, .../admin/dlq, .../admin/schema

Cross-Cutting (applies to all services)¶

Tenancy: tenantId is mandatory at ingress and persisted end-to-end; enforced by gateway/middleware and data filters.
Security: Zero-trust defaults; mTLS in mesh; KMS-managed keys; short-lived tokens with audience/scope.
Observability: OTel traces/logs/metrics with tenantId, edition, traceId, correlationId; golden signals per service.
Resilience: Outbox/inbox, idempotent consumers, configured retries with jitter, circuit breakers, DLQ + replay tooling.
Cost: Rate-limits/quotas per tenant; projection/index retention; export batching windows.

Links¶

→ Components & Services (deep-dive) (components.md)
→ Data Model (data-model.md)
→ Deployment Views (deployment-views.md)
→ REST APIs (../domain/contracts/rest-apis.md)
→ Message Schemas (../domain/contracts/message-schemas.md)

Component Boundaries (C4 L3)¶

This section details the internal structure of each service using a Clean Architecture variant:

API (adapters) — HTTP/gRPC endpoints, webhook receivers.
Application (use-cases) — orchestration, policies, idempotency, transactions.
Domain (model) — aggregates, value objects, domain services.
Infrastructure (adapters) — persistence, messaging, cache, KMS, observability.

Dependency Rule: API → Application → Domain, with Domain independent of frameworks. Infrastructure depends inward via ports (interfaces) declared in Application/Domain.

Reference Component Map (Ingestion service)¶

flowchart LR
  subgraph API Layer
    Ctrl[AppendController<br />REST/gRPC]
    Hook[WebhookReceiver]
  end

  subgraph Application Layer
    UC[AppendAuditRecordUseCase]
    Pol[PolicyClient Port]
    Repo[AuditRecordRepository Port]
    Outb[Outbox Port]
    Idem[IdempotencyService]
    Val[SchemaValidator]
  end

  subgraph Domain Layer
    Agg[AuditRecord Aggregate]
    Clas[Classification VO]
    Ret[RetentionPolicy VO]
    Sig[IntegrityMaterial VO]
    DS[Domain Services]
  end

  subgraph Infrastructure Adapters
    RepoImpl[(NHibernate Repo)]
    Bus[(Message Broker Adapter)]
    KMS[(KMS Adapter)]
    Cache[(Cache Adapter)]
    OTel[(OTel Adapter)]
  end

  Ctrl --> UC
  Hook --> UC
  UC --> Val
  UC --> Pol
  UC --> Repo
  UC --> Outb
  UC --> Idem
  UC --> Agg
  Agg --> Clas
  Agg --> Ret
  Agg --> Sig

  RepoImpl -.implements.-> Repo
  Bus -.implements.-> Outb
  KMS -.used by.-> DS
  OTel -.used by.-> UC
  Cache -.used by.-> Pol

Hold "Alt" / "Option" to enable pan & zoom

Flow (happy path)

AppendController validates auth/tenancy → calls AppendAuditRecordUseCase.
Use-case validates schema, checks idempotency, queries PolicyClient for classification/retention.
Aggregate enforces invariants, produces domain events.
Repository persists append-only record + outbox entry.
Outbox relays an AuditRecord.Accepted event; OTel spans/logs recorded.

Per-Service Component Boundaries¶

Service	API Adapters	Application (Use-cases)	Domain (Aggregates/VO)	Outbound Ports	Infra Adapters
Gateway	Minimal APIs/gRPC, auth filters, tenancy middleware	RouteResolution, RateLimitCheck, SchemaGate	N/A (edge orchestration)	PolicyDecision, TokenIntrospection, RateLimit	OIDC/JWT, RateLimiter, Schema Registry
Ingestion	AppendController, WebhookReceiver	AppendAuditRecord, ClassifyAndRetain, AcceptEvent	AuditRecord, Classification, Retention, IntegrityMaterial	PolicyClient, AuditRecordRepository, Outbox	NHibernate, Broker, KMS, OTel
Policy	PolicyController	EvaluateClassification, EvaluateRetention, EvaluateRedaction	PolicySet, Rule, Decision	PolicyChangePublisher	PolicyStore (read-mostly), Broker, Cache
Integrity	VerifyController	ComputeChain, SignSegment, VerifyEvidence	EvidenceChain, Signature, Checkpoint	IntegrityRepo, KmsSigner, VerifyResultPublisher	NHibernate, KMS, Broker
Projection	(Internal) Health/Control endpoints	ApplyDelta, RebuildProjection, ManageLag	ReadModelSegment, Watermark	ProjectionRepo, LagMetrics	NHibernate/Read-store, Broker, OTel
Query	QueryController, (opt) GraphQL	ExecuteQuery, ApplyPolicyFilters, RedactFields	SelectionSet, RedactionPlan	ReadModelRepo, PolicyClient	Read-store adapters, Cache
Search (opt.)	SearchController	ExecuteSearch, BuildQueryPlan	SearchQuery, Facet, TimeRange	IndexRepo	Search index adapter
Export	ExportController, Webhook for completion	CreatePackage, StreamPackage, BuildManifest	ExportPackage, Manifest, SelectionRef	ReadModelRepo, ColdStoreSink, IntegrityVerifier	Object storage, Broker, KMS
Admin/Control	AdminController	UpdatePolicy, ReplayDLQ, Rebuild, FeatureToggle	AdminAction, Approval	DLQClient, FeatureFlagRepo	Broker Mgmt, Flags Store

Ports live in Application/Domain, implemented by Infrastructure. Tests use in-memory or fakes against ports to keep fast, hermetic boundaries.

Ports & Adapters (canonical interfaces)¶

// Application ports (examples)
public interface IPolicyClient {
    Task<PolicyDecision> EvaluateAsync(AppendContext ctx, CancellationToken ct);
}
public interface IAuditRecordRepository {
    Task AppendAsync(AuditRecord record, CancellationToken ct);
}
public interface IOutbox {
    Task EnqueueAsync<T>(T evt, CancellationToken ct);
}
public interface IReadModelRepository {
    Task<QueryResult> ExecuteAsync(QuerySpec spec, CancellationToken ct);
}
public interface IIntegrityVerifier {
    Task<VerifyResult> VerifyAsync(VerifyTarget target, CancellationToken ct);
}

Inbound adapters: Minimal APIs, gRPC services, webhook receivers.
Outbound adapters: NHibernate repos, MassTransit/Azure Service Bus producers/consumers, KMS wrappers, cache providers, OTel exporters.

Boundary Rules (enforced)¶

Tenancy is a first-class parameter in all ports and aggregate constructors (no ambient singletons).
Idempotency: all write use-cases accept an IdempotencyKey; repositories guarantee upsert-or-noop semantics.
Transactions: application layer coordinates a transactional outbox; domain emits events, infra persists record + outbox atomically.
Validation: API does syntactic checks; Application performs semantic validation and policy evaluation; Domain enforces invariants only.
Error Mapping:
- Domain validation → 400 / INVALID_ARGUMENT
- AuthZ/tenancy violations → 403
- Rate-limit/back-pressure → 429 (with retry hints)
- Transient infra → retried with jitter; eventual DLQ after N attempts
No framework leakage into Domain (no HTTP types, no EF/NHibernate entities, no broker types).

Concurrency & Scaling¶

API: stateless; scale on RPS; protect with rate limits and request budgets.
Consumers/Projectors: concurrency tuned per partition/shard; back-pressure from queue length & lag metrics.
Export: long-running streams; resumable with checkpoints; separated pool to avoid starving Query.

Observability & Policies (per boundary)¶

Span model: gateway → use-case → repo/outbox → broker → projector/query/export.
Attributes: tenantId, edition, idempotencyKey, policyVersion, watermark.
Logs: structured; sensitive fields redacted by classification tags.
Metrics: use-case latency, outbox age, projector lag, export TTFB/completion.
Policies as code: policy version stamped on write; propagated in events; evaluated again on read/export.

Technology Anchors¶

Runtime: .NET 9, Rest APIs/gRPC.
Persistence: NHibernate (write/read models), repositories per aggregate.
Messaging: MassTransit over Azure Service Bus (topics/queues, DLQ).
Security: OIDC, mTLS (mesh), KMS envelope encryption & signing.
Telemetry: OpenTelemetry traces/logs/metrics; dashboards per service.
Testing: unit tests against ports; contract tests for REST/events; projector/replay integration tests.

Example: Error-to-Protocol Mapping¶

Boundary	Failure	Handling	Client Signal
API ingress	Schema invalid	400 + error codes	`X-Request-Id`
Application	Policy denies write	403	Problem+JSON body
Repository	Unique duplicate (idempotent)	200 (noop)	`Idempotent: true`
Broker	Publish timeout	retry+jitter → outbox relay	n/a (internal)
Export stream	Client disconnect	checkpoint + resume	`Range` support

Links¶

→ Components & Services (deep-dive) (components.md)
→ Messaging & Outbox (../implementation/messaging.md)
→ Outbox/Inbox & Idempotency (../implementation/outbox-inbox-idempotency.md)
→ Data Model (data-model.md)
→ REST APIs (../domain/contracts/rest-apis.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ Security & Compliance (../platform/security-compliance.md)

Event-Driven Communication Plan¶

ATP is event-driven by design. Events are the backbone for ingestion acceptance, integrity processing, projection, and exports. We target at-least-once delivery with exactly-once effects via idempotency keys, transactional outbox/inbox, and idempotent consumers. Ordering is scoped, never global.

Patterns & Guarantees¶

Delivery: at-least-once from broker; consumers must be idempotent.
Exactly-once intent: (tenantId, sourceId, sequence|hash) as the IdempotencyKey; repository upsert/no-op semantics.
Ordering: guaranteed only within a partition key (e.g., tenantId:sourceId). Do not assume global order.
Back-pressure: rate limits at gateway, consumer concurrency caps, queue depth alerts, retry with jitter, DLQs.
Replay safety: projectors/exporters are replay-tolerant; watermarks control catch-up.

Topic & Subscription Topology (logical)¶

Subject (topic)	Producers	Consumers	Purpose
`audit.appended`	Ingestion	Integrity, Projection	Raw append accepted for downstream processing
`audit.accepted`	Ingestion	Projection, Search (opt)	Persisted + classified + retained
`projection.updated`	Projection	Query, Search (opt), Export	Read models advanced; watermark/lag hints
`export.requested`	Query, API	Export	Start packaging workflow
`export.completed`	Export	API/Webhooks, Integrity (opt verify)	Notify completion; attach manifest
`integrity.verified`	Integrity	API, Export (attach to packages)	Attest records/segments/packages
`policy.changed`	Admin/Policy	Ingestion, Query, Export	Cache bust + version pin updates

Each topic has named subscriptions per service (e.g., projection-svc, export-svc) and a DLQ (<topic>.<subscription>.dlq).

Message Contracts (snapshot)¶

We publish a Published Language with clear evolution rules (see Message Schemas).

{
  "eventId": "01J8X3TB2Z9WQ6M9P3E2A4K7QG",       // ULID preferred
  "subject": "audit.accepted",
  "schemaVersion": "1.2.0",
  "occurredAt": "2025-10-22T05:12:31Z",
  "tenantId": "t-9d8e...",
  "edition": "enterprise",
  "producer": {
    "service": "ingestion",
    "instance": "ingestion-7fcb9f",
    "region": "eus2"
  },
  "correlation": {
    "traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
    "correlationId": "c-8e7f...",
    "causationId": "evt-01J8X3..."
  },
  "idempotencyKey": "t-9d8e:src-abc:seq-00000045",
  "policy": {
    "classification": "SENSITIVE",
    "retentionPolicyId": "rp-2025-01",
    "policyVersion": 17
  },
  "payload": {
    "recordId": "rec-01H...",
    "sourceId": "src-abc",
    "hash": "sha256:...",
    "sizeBytes": 5342,
    "attributes": { "actorId": "u-123", "action": "UPDATE", "resource": "Order/4711" }
  }
}

Headers (transport) * content-type: application/json; charset=utf-8 * x-tenant-id, x-edition * traceparent, tracestate * x-idempotency-key, x-schema-version * x-classification, x-policy-version

No PII in headers. Payload fields carrying PII are classified and redacted on sinks/logs.

Evolution & Compatibility¶

Additive changes → bump minor (1.1 → 1.2); fields are optional by default.
Breaking changes → new subject or major bump with side-by-side consumers.
Schema registry: producers/consumers validated in CI; contract tests block incompatible change.
Deprecations: announce N releases ahead; dual-publish during migration windows.

Choreography (happy path)¶

sequenceDiagram
  autonumber
  participant P as Producer
  participant GW as API Gateway
  participant ING as Ingestion
  participant BUS as Event Bus
  participant INT as Integrity
  participant PROJ as Projection
  participant QRY as Query
  participant EXP as Export

  P->>GW: POST /audit/append (tenant, idempotencyKey)
  GW->>ING: Append command
  ING->>ING: validate + classify + retain + persist
  ING->>BUS: publish audit.appended
  ING->>BUS: publish audit.accepted
  BUS-->>INT: audit.appended
  INT->>BUS: integrity.verified (optional later)
  BUS-->>PROJ: audit.accepted
  PROJ->>BUS: projection.updated (watermark)
  QRY->>QRY: read models current for queries
  QRY->>BUS: export.requested (on demand)
  BUS-->>EXP: export.requested
  EXP->>BUS: export.completed (manifest, links)

Hold "Alt" / "Option" to enable pan & zoom

Partitioning & Ordering¶

Partition key: tenantId (default). For high-volume producers, prefer tenantId:sourceId.
Ordering guarantees: within partition, best-effort FIFO; consumers must handle reordering and duplicates.
Large tenants: increase consumer concurrency; shard by sourceId to avoid hot partitions.

Reliability & Retry¶

Producers: exponential backoff with jitter; bounded retries; overflow to outbox for relay.
Consumers: N (e.g., 5) attempts with backoff → DLQ; DLQ triage dashboards and replay tools.
Inbox de-dup: store (eventId|idempotencyKey) receipt for M days; drop duplicates.
Poison messages: dead-letter with diagnostic envelope (error code, stack, schema version, payload hash).

Security & Privacy¶

Authz on topics: least privilege per service identity; publish/subscribe explicit.
PII discipline: only classified fields in payload; never in subjects/headers/topic names.
Encryption: in transit (TLS) and at rest (broker & stores); sensitive attachments use KMS envelope encryption.
Signature material: events may carry content digests; Integrity service signs chains and manifests.

Observability¶

Tracing: propagate traceparent/correlationId; spans from GW→ING→BUS→INT/PROJ→QRY/EXP.
Metrics: publish rate, error rate, outbox age, queue depth, consumer lag, DLQ count, replay count.
Logs: structured, redacted by classification; include tenantId, edition, eventId.

Replay & Backfills¶

Controlled replay: per-tenant/window; watermark caps; idempotent processing mandated.
Runbook: identify root cause, drain DLQ, run replay job with dry-run, monitor lag & budgets.
Projection rebuilds: snapshot + replay strategy; export paused/throttled during rebuild windows.

Links¶

→ Events Catalog (../domain/events-catalog.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ REST APIs (../domain/contracts/rest-apis.md)
→ Messaging (impl) (../implementation/messaging.md)
→ Outbox/Inbox & Idempotency (impl) (../implementation/outbox-inbox-idempotency.md)
→ Observability (../operations/observability.md)
→ Runbook (../operations/runbook.md)

Multitenancy & Tenancy Guards (overview)¶

ATP is multi-tenant by default. Tenancy must be explicit, verifiable, and enforced at every boundary: ingress → messaging → storage → observability → exports. Edition flags refine capability exposure per tenant without weakening isolation.

Tenancy Model & Terms¶

Tenant — an organizational boundary; primary key tenantId present in every API call, event, and persisted row.
Edition — capability set for a tenant (e.g., Standard/Enterprise); evaluated at Gateway and Policy.
Tenant Context — normalized, immutable tuple carried across layers: { tenantId, edition, authz, policyVersion, correlationId }.

Isolation Layers (defense-in-depth)¶

Ingress — require tenantId; validate edition gates; enforce per-tenant rate limits/quotas and payload limits.
AuthZ — tenant-scoped RBAC/ABAC; tokens must include tenant claim; cross-tenant operations are rejected.
Messaging — partition keys include tenantId (or tenantId:sourceId); topic ACLs scoped to service identity.
Storage — tenantId is part of every key; row-level filters (or RLS) applied in repositories; encryption scope may be per tenant.
Cache — keys are tenant-scoped; no global caches for sensitive projections.
Observability — traces/logs/metrics labeled with tenantId and edition; logs redact PII.
Operations — per-tenant runbooks: suspend, throttle, replay, export, legal hold, and delete (where allowed).

Tenancy Propagation (canonical)¶

Authenticate at Gateway; extract tenantId from token (preferred). Accept X-Tenant-Id only for trusted workload identities; normalize into Tenant Context.
Attach Tenant Context to trace and message headers; persist with write operations.
Verify Tenant Context at service boundary (middleware) and inside each use-case.
Refuse requests if tenant is missing/mismatched; never infer tenant from data.

sequenceDiagram
  autonumber
  participant C as Client
  participant GW as API Gateway
  participant ING as Ingestion (Use-case)
  participant DB as Append Store
  Note over C: JWT contains tenantId=acme<br />edition=enterprise
  C->>GW: POST /audit/append (JWT, X-Idempotency-Key)
  GW->>GW: validate JWT, resolve tenant/edition, rate-limit
  GW->>ING: Append(cmd, TenantContext)
  ING->>ING: check authZ & edition gates
  ING->>DB: write (tenantId, record, policyVersion)
  ING-->>GW: 202 Accepted (traceId, tenantId)

Hold "Alt" / "Option" to enable pan & zoom

Data Isolation Patterns¶

Single DB, tenant-partitioned — default; tenantId in composite keys, enforced repository filters, and query specs.
Per-tenant schema — optional for very large tenants; keep API/contracts identical; infra complexity increases.
Encryption — at rest; keys can be rotated globally or per tenant; evidence manifests do not expose keys.
Backups/Restore — support scoped restore by tenant or time window; integrate with legal hold.

Messaging & Events¶

Partitioning by tenantId (or tenantId:sourceId) ensures localized ordering and hot-shard control.
Headers carry tenantId, edition, traceparent; payload contains tenantId for de-dup/inbox.
No cross-tenant joins in consumers or projectors; selection for export is tenant-scoped.

Edition Gating¶

Gateway rejects routes not enabled for the tenant’s edition.
Policy returns decisions (e.g., advanced retention) only if edition allows; Query/Export re-evaluate gates to prevent confused-deputy issues.

Guardrails (checklist)¶

Boundary	Guard	Enforced by	Failure signal
Ingress	`tenantId` required + edition check	Gateway middleware	`400/403`
Ingress	Per-tenant RL/quotas	Gateway	`429` (+ retry hints)
Use-case	Tenant/edition validation	Application layer	`403`
Repo	Tenant filter / RLS	Repository/ORM	No rows (isolation)
Events	Tenant partition key	Producer/Outbox	Reject publish if missing
Cache	Tenant-scoped keys	Cache adapter	N/A (internal)
Logs	Redact PII; add tenant labels	Logging pipeline	N/A
Export	Tenant-scoped selection & manifest	Export service	`403` on cross-tenant

Threats & Mitigations¶

Tenant spoofing → accept tenant only from validated token/identity; normalize once at Gateway.
Confused deputy → re-evaluate edition/ABAC on every read/export; never trust upstream UI.
Noisy neighbor → per-tenant limits on RPS, storage, concurrent exports, and projector throughput.
Data bleed → repository filters + contract tests; synthetic cross-tenant tests in CI.

Testable Controls¶

Contract tests: every API requires tenantId and rejects mismatches.
Repo tests: queries without tenantId fail build/lint; cross-tenant fixtures return 0 rows.
Messaging tests: publish fails without tenant headers; consumer rejects orphan events.
Observability tests: traces/metrics include tenant labels; redaction verified in logs.

Links¶

→ Multitenancy & Tenancy Guards (platform) (../platform/multitenancy-tenancy.md)
→ Security & Compliance (../platform/security-compliance.md)
→ Data Residency & Retention (../platform/data-residency-retention.md)
→ PII Redaction & Classification (../platform/pii-redaction-classification.md)

Security Architecture (Zero Trust)¶

ATP adopts a Zero Trust posture end-to-end: never trust, always verify; strong identity for users and workloads, least privilege at every hop, and continuous policy evaluation (tenant + edition + data classification). Controls are layered across ingress, mesh, messaging, storage, observability, and exports.

Security Objectives¶

Confidentiality — prevent unauthorized data access across tenants and tiers.
Integrity — tamper-evident writes and verifiable exports, with cryptographic proofs.
Availability — resilient controls that degrade safely (e.g., deny on policy fetch failures), with back-pressure and circuit breakers.
Accountability — high-fidelity audit trails of admin and data actions, correlatable across services.

Identity & Access (Users and Workloads)¶

Human users — OIDC/OAuth2, MFA, SSO; RBAC/ABAC scoped to tenantId and edition; short-lived tokens; refresh via secure flows.
Workloads — workload identity (SPIFFE-like or equivalent); service-to-service mTLS in mesh; audience/scope-bound JWTs when used.
Token hygiene — short expirations, clock-skew tolerance, replay protection (nonce/PKCE where applicable), proof-of-possession optional for high-risk flows.
Authorization — gateway performs coarse checks; services apply fine-grained ABAC (tenant/edition/policy decision) at each use-case.

Boundary Controls¶

API Gateway / WAF
- JWT validation (issuer/audience/exp), tenancy resolution, schema & size limits, rate limiting/quotas per tenant, IP allow-lists for admin APIs.
- Threat mitigation: SQL/JSON injection filters, deserialization guards, content-type validation, CORS/CSRF protections for console UIs.
Service Mesh / Network
- Mutual TLS by default; L7 authorization for service identities; namespace/network policies; egress allow-lists and DNS pinning for critical deps.
Messaging
- Per-service publish/subscribe ACLs; topic-level encryption at rest; headers sanitized (no PII); DLQ isolation with audited replay.
Storage
- Row-level isolation (tenant filters / RLS); encryption at rest; audit tables for admin actions; classified fields drive redaction and masking.

Data Protection¶

In transit — TLS 1.2+ everywhere; HSTS on public edges; mTLS inside the mesh; secure cipher suites; TLS secrets managed in KMS/secret store.
At rest — envelope encryption with KMS-managed keys; per-env key hierarchy; optional per-tenant keys for high-assurance tenants; key versioning on rotation.
Field/classification aware — classification labels at write time control: persistence, logging, projections, and export redaction.

Keys, Secrets & Rotation¶

Secrets in secret manager (never in images/YAML); use managed identities over static credentials.
Key rotation schedules for signing/encryption and TLS certs; dual-key windows for smooth transitions; automated provenance of rotations.
HSM-backed keys optional; audit every read of high-value secrets.

Integrity & Cryptography (platform interplay)¶

Integrity service maintains hash chains (per tenant/range) and digital signatures for segments and export manifests.
Canonicalization of records before hashing; recorded policyVersion and chain checkpoints to simplify offline verification.

Supply Chain Security¶

SBOM generation on every build; dependency scan/block on critical CVEs.
Image signing and provenance (e.g., attestations) enforced at admission; non-root, read-only FS, dropped Linux capabilities, seccomp/apparmor profiles.
Infrastructure as Code scanning; drift detection; locked registries and private base images.

Threat Model (snapshot) & Mitigations¶

Threat	Vector	Control
Tenant impersonation	Forged headers, token substitution	Accept tenant only from validated token; normalize at gateway; bind token audience/scope
Data exfiltration	Over-permissive roles, broad exports	ABAC with least privilege; export selection gated; watermarking; per-tenant quotas & approvals
Injection / deserialization	Untrusted payloads	Strict content-type; schema validation; JSON size caps; safe parsers
Replay / duplication	Message re-delivery	Idempotency keys; inbox receipts; event ULIDs; consumer de-dup
Side-channel / noisy neighbor	Resource contention	Per-tenant rate limits/quotas; bulkheads; isolated exporter pools
Secret leakage	Misplaced configs/logs	Secret store; redaction pipeline; zero PII in headers; structured logs with classifiers
Supply chain compromise	Tainted deps/images	SBOM+scans; signed images; verified provenance; gated deploys
Stale keys/certs	Missed rotation	Rotation SLOs; dual-publish keys; monitors/alerts; break-glass runbook

Security Telemetry & Auditing¶

Traces: carry tenantId, edition, policyVersion, traceId, correlationId across gateway→service→broker→store.
Logs: structured, classification-aware redaction; admin actions logged with actor, scope, before/after diff (where safe).
Metrics: auth failures, rate-limit hits, policy denies, DLQ volume, export anomalies, KMS latency/errors.
Alerts: token validation spikes, cross-tenant access attempts, chain verification failures, unexpected export surges.

Incident Response & Break-Glass¶

Playbooks: per-tenant isolation, revoke tokens, rotate affected keys, pause exports, enable heightened logging, notify stakeholders.
Forensics: immutable log retention; chain checkpoints; snapshot policies; export manifest verification.
Containment: gateway blocks by tenant/route; mesh denies by service identity; selective projector/export throttling.

Secure Defaults & Hardening Checklist¶

Non-root containers; read-only FS; minimal capabilities; pinned distroless bases.
Egress restricted; DNS allow-lists; outbound proxies where required.
HTTP security headers (HSTS, CSP, X-Content-Type-Options, Referrer-Policy) on public UIs.
Least privilege IAM for cloud resources; scoped service identities; deny-by-default policies.
Console/admin endpoints behind SSO + IP allow-lists + step-up auth.

Links¶

→ Security & Compliance (../platform/security-compliance.md)
→ Privacy (GDPR/HIPAA/SOC2) (../platform/privacy-gdpr-hipaa-soc2.md)
→ PII Redaction & Classification (../platform/pii-redaction-classification.md)
→ Zero Trust (hardening) (../hardening/zero-trust.md)
→ Key Rotation (hardening) (../hardening/key-rotation.md)
→ Tamper Evidence (hardening) (../hardening/tamper-evidence.md)

Compliance & Privacy (GDPR/HIPAA/SOC2) — Overview¶

ATP is designed as a privacy-by-design, compliance-by-default platform. Controls are embedded in the write path (classification/minimization/retention), enforced across read/export, and evidenced via immutable audit trails and integrity proofs.

Roles & Responsibilities¶

Typical role: ATP operates as a Processor for tenant Controllers; some admin/telemetry data may make ATP a limited Controller (documented in DPA).
Sub-processors: declared per environment; contracts require equivalent safeguards.
Agreements: DPA (GDPR), BAA (HIPAA) for PHI workloads, SOC 2 reporting for trust criteria.

Regulatory Anchors (focus)¶

GDPR: lawful basis, transparency, DSR (access/erasure/export), 72h breach notice to SA where required, minimization, storage limitation, data transfers/residency.
HIPAA: PHI protection, minimum necessary, access controls, audit controls, integrity, transmission security, breach notification ≤ 60 days to affected individuals; BAA in place.
SOC 2: Trust Service Criteria — Security, Availability, Processing Integrity, Confidentiality, Privacy — evidenced through technical and procedural controls.

Controls Matrix (snapshot)¶

Requirement	Control (design)	Where enforced	Evidence / Artifacts
Data minimization	Canonical schema + policy evaluation at write	Ingestion + Policy	Schema registry, policy version in events, unit/contract tests
Classification & redaction	Field tags drive redaction/masking	Ingestion, Query, Logs/Exports	Redaction library, log scrubbing tests, sample redacted exports
Retention & deletion	Retention stamped on write; lifecycle jobs	Ingestion, Lifecycle jobs	Retention policy catalog, job logs, deletion attestations
Residency/sovereignty	Region-aware routing, per-tenant storage map	Gateway, Storage	Tenant residency map, deploy topology diagrams
Access control	Tenant-scoped RBAC/ABAC, edition gates	Gateway, Services	AuthZ policy files, access logs, ABAC tests
Auditability	Immutable admin/data action logs	All services	Append logs, admin trails, correlation/traces
Export & portability	Package + manifest + signatures	Export, Integrity	Export manifests, signatures, hash chains
Incident response	Runbooks, alerts, forensics snapshots	Ops/Runbook	IR playbooks, alert policies, drill reports

Data Lifecycle (end-to-end)¶

Collect — strictly necessary attributes; reject unknown fields by default.
Classify — tag sensitivity at write; bind policy version.
Store — encrypted at rest; tenant-scoped keys (optional per tenant).
Project — read models exclude disallowed fields; derived data tracked.
Retain — timers enforce storage limitation; legal hold overrides tracked.
Delete/Anonymize — policy-driven purge/anonymization with proofs.

Data Subject Requests (DSR) — Workflow¶

sequenceDiagram
  autonumber
  participant U as User/Tenant Admin
  participant GW as API Gateway
  participant Q as Query Service
  participant EXP as Export Service
  participant L as Lifecycle/Retention
  U->>GW: Submit DSR (access/export/erasure)
  GW->>Q: AuthZ + tenant scope, locate records
  Q-->>GW: Result set / pointers
  alt Access/Export
    GW->>EXP: Create export package
    EXP-->>U: Download + manifest (portable)
  else Erasure
    GW->>L: Schedule policy-compliant deletion (legal holds respected)
    L-->>GW: Deletion attestation
    GW-->>U: Completion notice + evidence
  end

Hold "Alt" / "Option" to enable pan & zoom

SLA guidance

GDPR DSR response: typically ≤ 30 days (track in runbooks).
Breach notifications: GDPR 72h (to SA where required); HIPAA ≤ 60 days to affected individuals.

Privacy by Design (architectural hooks)¶

Policies as code: versioned policy sets; decisions stamped on write and re-evaluated on read/export.
No PII in headers: classification prevents leakage to logs/metrics; sensitive fields redacted.
Least privilege: tenant RBAC/ABAC at every use-case; exporter isolation/bulkheads.
DPIA triggers: new data categories, cross-region transfers, novel large-scale processing — require review/ADR.

Residency & Transfers¶

Region binding: tenant → region mapping; data stays in region unless contractually permitted.
Cross-border: blocked by default; explicit policy + contractual basis required.
Backups/restore: region-scoped; tenant-targeted restore supported.

Monitoring & Evidence¶

Signals: policy deny rates, retention job failures, export volume anomalies, cross-region attempts.
Evidence pack: policy catalog, schema registry snapshots, export manifests, chain checkpoints, IR drill reports.
Periodic attestations: automated reports feed SOC 2 control testing.

Testable Controls¶

CI checks for classification tags on new fields; rejection if missing.
Synthetic DSR tests (access/export/erasure) per environment.
Retention dry-run reports; deletion requires attestation artifacts.
Policy evolution contract tests (additive vs breaking).

Links¶

→ Privacy (GDPR/HIPAA/SOC2) (../platform/privacy-gdpr-hipaa-soc2.md)
→ Data Residency & Retention (../platform/data-residency-retention.md)
→ PII Redaction & Classification (../platform/pii-redaction-classification.md)
→ Runbook (../operations/runbook.md)
→ Alerts & SLOs (../operations/alerts-slos.md)

Data Architecture Overview¶

ATP’s data layer is optimized for append-heavy writes, policy-aware reads, and provable integrity. We separate hot append storage, warm read models/indexes, and cold archival to balance performance, cost, and compliance.

Data Primitives (canonical types)¶

AuditRecord — canonicalized event with tenant/actor/resource, timestamps, attributes (flattened JSON), and policy stamps.
PolicyDecision — classification, retention, redaction directives (versioned).
EvidenceMaterial — content digests, chain IDs, signatures, checkpoints.
ProjectionSegment — denormalized slices for query/search; watermark & lag metadata.
ExportPackage — immutable bundle manifest with hashes, signatures, and lineage.

Write Path (append-only)¶

sequenceDiagram
  autonumber
  participant GW as Gateway
  participant ING as Ingestion
  participant APP as Append Store (hot)
  participant INT as Integrity
  participant BUS as Event Bus
  GW->>ING: Append(cmd + TenantContext)
  ING->>ING: Canonicalize + Validate + Policy.Evaluate
  ING->>APP: Append(AuditRecord + PolicyDecision)
  ING->>BUS: audit.appended / audit.accepted
  ING->>INT: Provide digest material (async)

Hold "Alt" / "Option" to enable pan & zoom

Consistency: write path is strong for a single record; projections are eventually consistent (seconds-level lag budgets).

Keys, Partitions, and Time¶

Concept	Strategy
Primary key	ULID `recordId` for monotonic ordering within time partitions
Idempotency key	`(tenantId, sourceId, sequence \| hash)`
Partitioning	by time bucket (e.g., day) and tenantId; hot shards can further include `sourceId`
Timestamps	`occurredAt` (source), `receivedAt` (gateway), `committedAt` (store); all UTC, ISO-8601

Canonical Schema (fields snapshot)¶

Field	Type	Notes
`recordId`	ULID	unique; sortable
`tenantId`	string	required on every record/index
`sourceId`	string	producer/system origin
`actorId`	string	user/service (classified)
`action`	string	verb (CREATE/UPDATE/DELETE/…)
`resource`	string	dotted path (e.g., `Order/4711`)
`attributes`	object	flat/normalized JSON; unknowns rejected unless whitelisted
`occurredAt/receivedAt/committedAt`	datetime	UTC
`policyVersion`	int	immutable once stamped
`classification`	enum	PUBLIC/INTERNAL/… (used for redaction)
`retentionPolicyId`	string	determines lifecycle
`digest`	string	content hash (sha256:…)
`chainId/chainIndex`	string/int	integrity chain placement

Full schema lives in Message Schemas (../domain/contracts/message-schemas.md). Schema changes follow additive-first rules and contract tests.

Read Models & Indexing (warm tier)¶

Timeline model (by tenant/resource/actor, time-range).
Facet/aggregation model (counts by action/resource).
Lookup model (by recordId, sourceId, correlation).
Search index (optional) for full-text and fast range scans.

Rebuild strategy: snapshot + replay from append logs; watermarks track projector progress; lag SLO drives autoscaling.

Classification, Redaction & Logs¶

Fields carry classification tags at write; tags drive:
- storage shape (e.g., tokenization),
- query redaction (field masking or omission),
- log scrubbing (no sensitive data in logs/metrics),
- export filtering (respect tenant’s data handling rules).

Example policy map (excerpt)

Field	Classification	Redaction (query/export)
`actorId`	PERSONAL	mask last 4
`attributes.email`	PERSONAL	hash (sha256)
`attributes.cardLast4`	SENSITIVE	allow if role:Auditor & scope:PII
`resource`	INTERNAL	none

See PII Redaction & Classification (../platform/pii-redaction-classification.md).

Retention & Lifecycle¶

Stamped on write: retentionPolicyId + policyVersion.
Lifecycle jobs: tiering hot→warm→cold, legal hold awareness, deletion/anonymization windows.
Attestations: deletion manifests & job logs stored immutably.

See Data Residency & Retention (../platform/data-residency-retention.md).

Integrity Materials¶

Chain-of-hash per tenant and time-range; rolling checkpoints.
Signatures minted by Integrity svc; referenced in projections and exports.
Verification APIs accept a record/segment/package and return proofs.

Storage Tiers¶

Tier	Workload	Technology Shape	Notes
Hot	Append path, near-term verify	OLTP append store	SSD, high IOPS, short retention
Warm	Queries & aggregations	Read models / columnar	Denormalized, rebuildable
Cold	Archival/eDiscovery	Object storage (+ immutability)	Legal hold, cheap, manifest-signed

Schema Evolution¶

Additive: new optional fields or enums → minor version bump; dual readers.
Breaking: new subject or major version; side-by-side projections.
Registry & tests: schemas linted; CI contract tests for producers/consumers/projectors.

ER Snapshot (logical)¶

erDiagram
  TENANT ||--o{ AUDIT_RECORD : owns
  AUDIT_RECORD ||--o| EVIDENCE_MATERIAL : "has"
  AUDIT_RECORD ||--o{ PROJECTION_SEGMENT : "appears_in"
  EXPORT_PACKAGE ||--o{ AUDIT_RECORD : "contains"
  TENANT {
    string tenantId PK
    string edition
  }
  AUDIT_RECORD {
    string recordId PK
    string tenantId FK
    string sourceId
    string actorId
    string action
    string resource
    json   attributes
    datetime occurredAt
    datetime committedAt
    string classification
    string retentionPolicyId
    string digest
    string chainId
    int    chainIndex
    int    policyVersion
  }
  EVIDENCE_MATERIAL {
    string recordId FK
    string segmentId
    string signature
    string algo
  }
  PROJECTION_SEGMENT {
    string segmentId PK
    string tenantId FK
    string key
    json   payload
    datetime watermark
  }
  EXPORT_PACKAGE {
    string packageId PK
    string tenantId FK
    string manifestHash
    datetime createdAt
  }

Hold "Alt" / "Option" to enable pan & zoom

Sizing & Capacity Hints (initial)¶

Record size: median 1–5 KB (flattened JSON); avoid unbounded attributes.
Throughput: design for burst QPS B per tenant; apply ingest RL + back-pressure.
Indexes: time-range first, then tenant/resource/actor; avoid cross-tenant joins.
Cold costs: batch exports; prefer delta-based packages for repeats.

Testable Controls¶

Lints reject schemas without classification tags for new fields.
CI ensures queries require tenant filters; cross-tenant fixtures return zero rows.
Projections verified idempotent (re-run safe) with watermark assertions.
Retention dry-run reports produced; deletions emit attestations.

Links¶

→ Data Model (deep-dive) (data-model.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ Query Views & Indexing (impl) (../implementation/query-views-indexing.md)
→ PII Redaction & Classification (../platform/pii-redaction-classification.md)
→ Data Residency & Retention (../platform/data-residency-retention.md)

Storage Strategy (Summary)¶

Our storage approach optimizes for append-heavy writes, policy-aware queries, provable integrity, and cost control. We separate concerns across Hot (append), Warm (read models/indexes), and Cold (archival/eDiscovery) tiers, governed by policy-stamped retention and lifecycle jobs.

Objectives¶

Performance where it matters: low-latency appends and queries; projections provide fit-for-purpose shapes.
Compliance by default: classification-aware storage, retention stamped on write, legal hold and residency honored.
Provable integrity: digest chains and signatures persist with data lineage and manifests.
Cost discipline: data tiering, compression, batching exports, and quotas per tenant.

Tiering at a Glance¶

Tier	Workload	Typical Retention	Consistency	Durability & Encryption	Notes
Hot (Append Store)	ingest path, near-term verification	hours–days (policy)	strong per-write	multi-AZ/zone; at-rest encryption (KMS)	time/tenant partitions, high IOPS, small indexes
Warm (Read Models / Indexes)	query/search/aggregations	days–months (policy)	eventually consistent (lag SLO)	multi-AZ/zone; at-rest encryption (KMS)	rebuildable from events; denormalized projections
Cold (Archive / eDiscovery)	long-term retention, legal hold	months–years (policy)	N/A (immutable)	object store immutability + KMS; legal hold	manifests + signatures; cost-efficient, slower access

Detailed shapes live in Data Model (data-model.md) and infra specifics in Deployment Views (deployment-views.md).

Lifecycle & Policy Enforcement¶

flowchart LR
  subgraph HOT[Hot / Append]
    A[Append-only segments]
  end
  subgraph WARM[Warm / Read Models]
    R[Projections & indexes]
  end
  subgraph COLD[Cold / Archive]
    C[Immutable objects + manifests]
  end
  A -->|policy window reached| WARM
  WARM -->|tiering job| COLD
  A -->|legal hold? keep| A
  WARM -->|legal hold? pin| WARM
  C -->|export/eDiscovery| C

Hold "Alt" / "Option" to enable pan & zoom

On write: retentionPolicyId + policyVersion stamped; classification guides storage and logs.
Lifecycle jobs: move eligible segments from Hot→Warm→Cold; respect legal hold and residency maps.
Deletion/anonymization: performed per policy window; produce attestations and job logs.

Partitioning, Compaction & Index Hygiene¶

Partitions: time (e.g., day) × tenant; optional sourceId for hot-shard control.
Compaction: roll small segments into bounded files (size/time thresholds) to control file counts and seek cost.
Index hygiene: projector lag SLOs guide autoscaling; background vacuum/merge jobs keep read paths predictable.

Integrity Material Persistence¶

Digest chains (per tenant/range) stored alongside append metadata; checkpoints for fast verification.
Signatures & manifests for exported bundles persisted in Cold; Verify APIs reference chain/manifest IDs.

Backup, Restore & eDiscovery¶

Backups: scheduled snapshots of Warm (and necessary Hot metadata) with region-scoped policies.
Restore: tenant- or time-scoped restores; rebuild read models from append logs when possible.
eDiscovery: selections from Query → Export packages → signed manifests in Cold; immutable retention with legal hold support.

Residency & Encryption¶

Residency map: tenant → region binding; lifecycle never crosses region without contractual/policy basis.
Encryption: at rest via KMS; optional per-tenant keys for high-assurance tenants; rotation windows supported.
Secrets: no keys in payloads; policy and classification prevent sensitive leakage to logs/headers.

Capacity & Cost Levers¶

Hot: cap record size, enforce schema limits, rate-limit bursty tenants.
Warm: projection granularity tuned to query needs; compress wide models; expire stale indexes.
Cold: batch exports, dedupe repeated selections, prefer incremental/delta packages.
Global: per-tenant quotas, export concurrency caps, storage alerts on growth velocity.

SLO Hints (storage-facing)¶

Ingest commit p95 ≤ X ms (Hot).
Projector lag ≤ N s median; p95 ≤ M s (Warm).
Export TTFB p95 ≤ Z s for packages up to K MB (Cold).

Failure Modes & Guardrails¶

Hot saturation → back-pressure at Gateway; temporary queueing; projector throttle lifts once lag ≤ budget.
Projection failure → DLQ + replay tooling; queries fall back to last consistent watermark.
Cold unavailability → exports paused; already created packages remain downloadable via signed URLs.
Residency mismatch → hard fail with audit; no cross-region copies without policy/contract.

Testable Controls¶

Lifecycle dry-run reports (what would tier/delete) per tenant.
CI checks ensure new tables/indexes include tenantId and time partitioning.
Synthetic restores: periodic tenant/time-window drills.
Export verification: random sampling of packages against manifests/signatures.

Links¶

→ Data Residency & Retention (../platform/data-residency-retention.md)
→ Tamper Evidence (../hardening/tamper-evidence.md)
→ Backups, Restore & eDiscovery (../operations/backups-restore-ediscovery.md)
→ Persistence & Storage (impl) (../implementation/persistence.md)
→ Deployment Views (deployment-views.md)

Sequence Flows (append/query/export)¶

This section captures the three canonical end-to-end flows in ATP. These flows are reference-grade and map directly to our containers, bounded contexts, and policies. Detailed, step-by-step variants (timeouts, retries, failure drills) live in (sequence-flows.md).

Append (happy path)¶

sequenceDiagram
  autonumber
  participant P as Producer
  participant GW as API Gateway
  participant ING as Ingestion (Use-cases)
  participant POL as Policy (Decision API)
  participant APP as Append Store (Hot)
  participant OB as Outbox (Tx)
  participant BUS as Event Bus
  participant INT as Integrity
  participant PROJ as Projection
  participant OTL as OTel/Obs

  P->>GW: POST /api/v{n}/audit/append<br />JWT (tenant), X-Idempotency-Key, body (canonical)
  GW->>GW: AuthN (OIDC) + tenant/edition + schema + rate-limit
  GW->>ING: AppendCommand(TenantContext, IdempotencyKey, Payload)
  ING->>POL: Evaluate(classification, retention) [short TTL cache]
  POL-->>ING: PolicyDecision(version, labels, retentionPolicyId)
  ING->>ING: Canonicalize + Validate + Apply Policy
  ING->>APP: Append(AuditRecord + PolicyDecision) [atomic]
  ING->>OB: Enqueue(audit.appended, audit.accepted) [same tx]
  OB->>BUS: publish (relay)
  BUS-->>INT: audit.appended | accepted
  INT->>INT: Digest/Chain/Sign (async)
  BUS-->>PROJ: audit.accepted
  PROJ->>PROJ: Update read models / set watermark
  ING->>OTL: spans/logs/metrics (tenantId, edition, policyVersion, idemKey)
  ING-->>GW: 202 Accepted { recordId, traceId }

Hold "Alt" / "Option" to enable pan & zoom

Headers (ingress): Authorization: Bearer <JWT>, X-Idempotency-Key, Content-Type: application/json
Guarantees: at-least-once delivery; exactly-once intent via (tenantId, sourceId, sequence|hash); order within partition (tenantId[:sourceId])
SLO cues: p95 append ≤ X ms; policy eval p95 ≤ Y ms; projector lag p95 ≤ N s

Query (authorized read)¶

sequenceDiagram
  autonumber
  participant C as Client (Ops/Auditor)
  participant GW as API Gateway
  participant Q as Query Service
  participant POL as Policy (Decision API)
  participant RM as Read Models (Warm)
  participant RED as Redaction Plan
  participant OTL as OTel/Obs

  C->>GW: GET /api/v{n}/query?tenant=...&filters=...<br />JWT (tenant)
  GW->>GW: AuthN + AuthZ (RBAC/ABAC), edition gates, rate-limit
  GW->>Q: QueryRequest(TenantContext, Filters, Page)
  Q->>RM: Fetch(ReadModel slice, watermark)
  Q->>POL: Evaluate(read constraints) [cached]
  POL-->>Q: Decision(redaction/deny/allow)
  Q->>RED: Apply redaction per classification/policy
  Q-->>GW: 200 OK { results, page, watermark, redactionHints }
  Q->>OTL: record spans/metrics (p95/p99, filtered-out)

Hold "Alt" / "Option" to enable pan & zoom

Filters: time-range, actor/resource, action, attributes (whitelisted)
Redaction: field-level masking/hashing per classification; no PII in logs/headers
SLO cues: p95 latency ≤ Y ms at Q RPS; cache hit ratio ≥ H%; watermark drift ≤ D s

Export (selection → package → verify)¶

sequenceDiagram
  autonumber
  participant C as Client (Auditor/Legal)
  participant GW as API Gateway
  participant Q as Query Service
  participant EXP as Export Service
  participant INT as Integrity
  participant COLD as Cold Store (Immutable)
  participant WH as Webhook (optional)

  C->>GW: POST /api/v{n}/export { selectionSpec | queryId , format }
  GW->>Q: Validate selection (tenant/edition/ABAC)
  Q-->>GW: Selection OK (token/manifest draft)
  GW->>EXP: CreateExport(TenantContext, selectionToken, format)
  EXP->>Q: Stream records (paged, resumable)
  EXP->>INT: Request signatures/chain refs (batch)
  INT-->>EXP: Evidence (chain checkpoints, signatures)
  EXP->>COLD: Write package parts + manifest (signed)
  EXP-->>GW: 202 Accepted { exportId, pollUrl, ttfbHint }
  loop Client poll or webhook
    C->>GW: GET /api/v{n}/export/{exportId}
    GW-->>C: 303 See Other → signed download URL
    EXP-->>WH: POST /on-export-completed (optional)
  end

Hold "Alt" / "Option" to enable pan & zoom

Semantics: resumable streaming; throttled to protect query SLIs; immutable artifacts with signed manifests SLO cues: TTFB p95 ≤ Z s for ≤ K MB outputs; completion p95 ≤ M min for N records

Failure & Back-pressure (extract)¶

Flow	Condition	Behavior	Client Signal
Append	schema invalid / policy deny	400 / 403	Problem+JSON with code & trace
Append	hot partition saturation	429 (Gateway), retry-after	`Retry-After`, `X-Rate-Limit-*`
Append	transient store/broker	retry with jitter → DLQ after N	202 Accepted (eventual), trace
Query	projector lag beyond budget	serve with last watermark + warn	`X-Watermark`, `X-Lag`
Query	authorization fails	403	Problem+JSON
Export	package > concurrency quota	429 + backoff	Retry schedule
Export	integrity service slow	continue buffering; partial manifest; retry	Polling continues; final manifest on completion

Observability: spans across GW→ING→BUS→INT/PROJ→Q/EXP with tenantId, edition, traceId, correlationId; metrics for outbox age, consumer lag, export queue depth, policy deny rate.

Links¶

→ Sequence Flows (detailed) (sequence-flows.md)
→ REST APIs (../domain/contracts/rest-apis.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ Outbox/Inbox & Idempotency (../implementation/outbox-inbox-idempotency.md)
→ Backups, Restore & eDiscovery (../operations/backups-restore-ediscovery.md)

Deployment Views (baseline)¶

This section describes the cloud-native baseline for ATP across environments, regions, and failure domains. It maps our containers to runtime substrates (AKS/ACA), messaging, data stores, and the observability/security planes. Deeper infra specifics live in (deployment-views.md).

Environments & Promotion Model¶

Env	Purpose	Data	Change Rate	Protections
dev	rapid iteration, PR validation	synthetic	highest	permissive RBAC, ephemeral namespaces
test	integration, contract & replay tests	masked	high	seeded tenants, DLQ/replay drills
staging	prod-like validation, chaos drills	masked or opt-in	medium	WAF rules, HPA parity, approvals
prod	customer traffic	real	controlled	SLO-backed autoscaling, break-glass, approvals

Promotion: build once, deploy many (signed image → dev → test → staging → prod) with policy gates and environment-specific overlays.

Regional Topology & Residency¶

Tenants are bound to a home region; data never crosses regions unless contractually allowed.
All planes are multi-AZ/zone within a region.
Optional multi-region active/standby for DR (RPO/RTO declared per edition).

flowchart LR
  subgraph Region[Cloud Region - e.g., East US 2]
    subgraph Net[Virtual Network]
      subgraph Edge[Ingress/WAF Subnet]
        GW[API Gateway / Ingress]
      end
      subgraph App[App Plane - AKS/ACA]
        mesh[Service Mesh - mTLS/L7]
        ING[Ingestion]
        POL[Policy]
        INT[Integrity]
        PROJ[Projection]
        QRY[Query]
        SRCH[Search - optional]
        EXP[Export]
        ADM[Admin/Control]
      end
      subgraph Msg[Messaging]
        BUS[(Topics/Queues + DLQ)]
      end
      subgraph Data[Data Plane]
        HOT[(Append Store)]
        WARM[(Read Models/Indexes)]
        COLD[(Object Store / Archive)]
      end
      subgraph Obs[Observability]
        OTL[(OTel Collector)]
        LOG[(Logs)]
        MET[(Metrics)]
        TRC[(Traces)]
      end
      subgraph Sec[Security]
        KMS[(KMS/Keys)]
        SEC[(Secret Manager)]
      end
    end
  end

  GW --> mesh
  mesh --> ING & POL & INT & PROJ & QRY & SRCH & EXP & ADM
  ING --> HOT
  PROJ --> WARM
  EXP --> COLD
  ING --- BUS
  PROJ --- BUS
  EXP --- BUS
  GW --- OTL
  ING --- OTL
  PROJ --- OTL
  QRY --- OTL
  EXP --- OTL
  INT --- KMS
  HOT --- SEC
  WARM --- SEC
  COLD --- SEC

Hold "Alt" / "Option" to enable pan & zoom

Kubernetes / Container Apps Mapping¶

Namespace	Workloads	HPA Signals	Notes
`gateway`	ingress/gateway, auth filters	RPS, p95 route latency, 429 ratio	WAF rules, IP allow-lists for admin
`ingestion`	append API, webhook receiver, outbox relay	CPU, QPS, pending outbox, 5xx	strict idempotency; schema guard
`policy`	decision API, cache	p95 decision latency, hit ratio	warm cache w/ TTL + circuit breaker
`integrity`	chain/sign/verify workers	queue depth, worker CPU	HSM/KMS integration
`projection`	projectors, rebuild jobs	lag (sec), consumer lag, DLQ	watermarks, replay safety
`query`	query API, (opt) GraphQL	p95/p99 latency, cache hit	redaction at boundary
`search` (opt)	search API, indexers	queue depth, index refresh	can be disabled per edition
`export`	export API, packagers	concurrent exports, TTFB	resumable, throttled
`admin`	policy mgmt, DLQ/replay, feature flags	N/A	break-glass guarded
`observability`	OTel collector, dashboards	N/A	multi-tenant labeling & scrubbing

Network, Mesh & Access¶

Ingress: Public → WAF/Ingress → Gateway. Admin surfaces optionally behind IP allow-lists + SSO.
Mesh: mTLS everywhere; L7 authorization by service identity; timeouts/retries/circuit breakers standardized.
Egress: deny-by-default, allow-lists for KMS, messaging, object store, and IdP.
DNS/Service discovery: mesh-native, with identity-bound policies.

Config, Secrets & Keys¶

Config: per-env overlays; feature flags for edition gates; config maps for non-sensitive settings.
Secrets: stored in secret manager; mounted/injected at runtime; rotation SLOs enforced.
Keys: KMS-backed envelope encryption; key IDs & versions recorded in manifests; dual-key windows during rotation.

Scaling & SLO Budgets¶

Gateway: scale on RPS and auth latency; enforce per-tenant rate limits.
Ingestion: scale on incoming QPS and outbox age; shed load via 429 when Hot saturation detected.
Projection: scale to respect lag SLO; auto-tune consumer concurrency.
Query/Search: scale on p95/p99 latency; cache enabled; bulkhead against Export.
Export: separate worker pools; cap concurrent packages per tenant to protect read SLOs.

Failure Domains, HA & DR¶

Intra-region HA: multi-zone deployments; stateless pods across zones; storage with zone-redundancy where supported.
DLQ & Replay: standard triage tools; replay by window/tenant; dry-run mode.
Backups/Restore: scheduled backups of warm stores + metadata; tenant/time-window restore drills.
DR (optional): async replication of cold artifacts; RPO/RTO declared per edition; failover runbooks.

CI/CD & Supply Chain¶

Pipelines: build → test → sign image (SBOM, vuln scan) → push to registry → deploy via GitOps/Argo or pipelines.
Policies: admission requires signed images, baseline pod security, resource limits/requests.
Observability: dashboards per service; alerts for SLO breaches, DLQ growth, export anomalies, projector lag.

Cost Controls¶

Per-tenant quotas (RPS/storage/exports), export batch windows, storage tiering rules, and autoscaling floors/ceilings tuned for cost envelopes.

Testable Controls¶

Policy tests: deny unsanctioned egress; block unsigned images at admission.
Residency tests: enforce region binding per tenant.
Chaos drills: periodic pod/node loss, message broker hiccups, object-store slowdown.
SLO checks: synthetic probes for append/query/export; alert on budget burn.

Links¶

→ Deployment Views (deep-dive) (deployment-views.md)
→ Environments (../ci-cd/environments.md)
→ Azure Pipelines (../ci-cd/azure-pipelines.md)
→ Observability (OTel/Logs/Metrics) (../operations/observability.md)
→ Security & Compliance (../platform/security-compliance.md)
→ Backups, Restore & eDiscovery (../operations/backups-restore-ediscovery.md)

API Gateway & Connectivity¶

The API Gateway is the single ingress for tenant traffic and the control point for identity, tenancy, versioning, rate limits, schema validation, and egress discipline. It fronts REST/gRPC APIs and optional webhooks, and propagates Zero Trust signals (identity, tenant, edition, trace) into the mesh.

Objectives¶

Protect: strong AuthN/Z, input validation, per-tenant quotas/limits, DDoS/WAF.
Standardize: versioning, headers, error shapes, retry semantics.
Propagate: tenantId, edition, traceparent, correlationId to downstream services.
Observe: request metrics, saturation (429s), error taxonomy, and schema failure rates.

Ingress Architecture (L7)¶

flowchart LR
  Client[Producers / Clients / UIs]
  WAF[WAF / Ingress Controller]
  GW[API Gateway<br />AuthN/Z • Tenancy • Rate Limits • Schema]
  Mesh[Service Mesh - mTLS/L7 AuthZ]
  ING[Ingestion]
  QRY[Query]
  EXP[Export]
  POL[Policy]
  ADM[Admin/Control]
  WH[(Webhooks - optional)]
  Client --> WAF --> GW --> Mesh
  Mesh --> ING & QRY & EXP & POL & ADM
  GW --> WH

Hold "Alt" / "Option" to enable pan & zoom

TLS termination at edge; mTLS inside the mesh.
Admin surfaces (policy/flags/replay) can be IP allow-listed + SSO.

Versioning & Deprecation¶

URI or header versioning: GET /api/v{n}/... or X-Api-Version: n.
Compatibility windows announced in changelog; deprecation headers: Deprecation: true, Sunset: <rfc1123>, Link: <url>; rel="deprecation".

Tenancy Propagation (canonical)¶

Gateway requires tenant at ingress (JWT claim preferred; X-Tenant-Id only for trusted workloads).
Inject standardized headers to downstream:
- x-tenant-id, x-edition, traceparent, tracestate, x-correlation-id, x-policy-version.
Reject requests missing or conflicting tenant signals (400/403).

Authentication & Authorization¶

Users: OIDC/OAuth2 (short-lived tokens, MFA), scopes/roles mapped to tenant.
Workloads: workload identity; audience/scope-bound JWTs or mesh L7 policies.
Coarse checks at Gateway; fine-grained ABAC in services (use-case level).

Rate Limits, Quotas, & Back-Pressure¶

Per-tenant burst/sustained rate limits and concurrent export caps.
Global safeties on hot routes (append, export create).
Signal back-pressure with 429 + Retry-After; include limit headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.

Schema & Payload Safeguards¶

Content-type & size limits; JSON schema validation on critical routes.
Reject unknown fields (strict mode) unless whitelisted in schema registry.
Enforce PII discipline: no PII in headers; payloads classified for redaction downstream.

Connectivity Matrix (inbound/outbound)¶

Surface	Protocol	Auth	Tenancy	Notes
Append	REST (POST) / gRPC	Bearer (OIDC) or workload JWT	Required	`X-Idempotency-Key` mandatory
Query	REST (GET) / gRPC	Bearer	Required	watermark headers on responses
Export	REST (POST/GET stream)	Bearer	Required	resumable download; signed URLs
Webhooks (ingest)	HTTPS (POST)	HMAC/signature	In payload	Signature verification, replays detected
Admin	REST (POST/GET)	SSO + IP allow-list	N/A	break-glass logged & approved

Standard Headers (selected)¶

Ingress: Authorization, X-Idempotency-Key, Content-Type
Propagated: x-tenant-id, x-edition, traceparent, tracestate, x-correlation-id, x-policy-version
Responses: X-Watermark, X-Lag, X-RateLimit-*, X-Request-Id

Error Model (Problem+JSON)¶

{
  "type": "https://errors.atp.example/validation",
  "title": "Invalid request payload",
  "status": 400,
  "detail": "Field 'attributes' failed schema validation",
  "instance": "urn:trace:01J9...-req-7f3a",
  "tenantId": "t-acme",
  "code": "SCHEMA_VALIDATION_FAILED"
}

Egress & Network Policy¶

Deny-by-default egress; allow only IdP, KMS, messaging, object store, email/webhook domains (when used).
DNS pinning/allow-lists for critical dependencies; outbound proxies if required.
Mesh L7 AuthZ: only permitted service→service calls; no lateral “surprises”.

Streaming, Downloads & Large Payloads¶

Append encourages bounded payloads (size caps).
Export uses chunked transfer and signed URLs; resilience via range requests.
Timeouts and read/write budgets enforced per route; client hints for retry/backoff.

CORS & Browser Clients¶

Strict CORS: allow specific origins for tenant consoles; SameSite and CSRF tokens for state-changing routes in UIs.

Observability @ Gateway¶

Spans: route, tenant, edition, status, bytes in/out, auth latency, schema check time.
Metrics: RPS, p95/99 per route, 4xx/5xx ratios, 429s, rejected schemas.
Logs: structured; PII scrubbed; include x-request-id and correlation keys.

Failure Modes & Signals¶

Condition	Behavior	Client Signal
Missing/invalid tenant	Reject	`400/403` with problem+json
Rate-limit exceeded	Shed	`429` + `Retry-After`
Schema invalid	Reject	`400` with problem+json & error path
AuthN failed	Reject	`401`
AuthZ/edition denied	Reject	`403`
Upstream saturation	Back-pressure	`503` (retryable) with `Retry-After`

Testable Controls¶

Contract tests: tenant required on protected routes; headers propagated.
Negative tests: cross-tenant attempts return 403; unknown JSON fields rejected.
Synthetic load: verify 429 behavior, limit headers, and stable p95 under burst.

Links¶

→ REST APIs (../domain/contracts/rest-apis.md)
→ Webhooks (../domain/contracts/webhooks.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ Multitenancy & Tenancy Guards (../platform/multitenancy-tenancy.md)
→ Security & Compliance (../platform/security-compliance.md)
→ Health Checks (../operations/health-checks.md)

Observability & SLOs¶

Observability in ATP is first-class: every request and message carries tenantId, edition, traceId, and correlationId. We instrument traces, metrics, and logs using OpenTelemetry, define SLIs/SLOs per service, and manage reliability via error budgets with multi-window burn alerts. PII is never logged; redaction follows classification tags.

Telemetry Standards¶

Traces: Gateway → Service → Outbox/Bus → Consumer/Projector → Store/Export.
- Resource attrs: service.name, service.version, deployment.environment, cloud.region.
- Span attrs: tenant.id, tenant.edition, http.route, messaging.operation, messaging.destination, db.system, db.operation, policy.version, idempotency.key.
Metrics (OTel with exemplars): use histograms for latency; avoid high-cardinality labels.
- Common labels: service.name, route|operation, tenant.class (small cardinality bucket), result.
- Examples: http.server.duration, messaging.consumer.lag, outbox.relay.age, export.queue.depth.
Logs: structured JSON; fields include timestamp, level, message, tenantId, edition, traceId, correlationId, eventId, code.
- No PII in logs or headers; classified fields are masked or hashed.

Sampling: head-based baseline, tail-based for slow/error spans. Retention: traces short, metrics medium, logs per compliance policy.

Golden Signals (platform-wide)¶

Traffic (RPS, throughput), Latency (p95/p99), Errors (4xx/5xx, policy denies), Saturation (CPU/mem, queue depth, projector lag), plus Back-pressure (429s, retry counts).

SLIs per Service¶

Service	Primary SLIs	Supporting SLIs
Gateway	route latency p95/p99; 4xx/5xx ratio; 429 rate	auth latency, schema failure rate
Ingestion	append accepted latency p95; accept success rate	outbox relay age p99, request size distribution
Policy	decision latency p95; cache hit ratio	decision error rate, fallback activations
Integrity	verify latency p95 (by target size)	chain build queue depth, signer error rate
Projection	projector lag median/p95; DLQ rate	replay duration, consumer throughput
Query	query latency p95/p99; success rate	cache hit ratio, redacted-field count, watermark drift
Search (opt)	search latency p95; success rate	index refresh age, queue depth
Export	TTFB p95; completion time p95	package queue depth, resumptions, webhook success
Admin	action success rate	time-to-approve, break-glass invocations

SLO Targets (initial placeholders)¶

Tune these during load testing; record in [Alerts & SLOs (../operations/alerts-slos.md)].

Gateway: route p95 ≤ X ms; 5xx ≤ Y ppm.
Ingestion: append accepted p95 ≤ X ms; outbox age p99 ≤ N s.
Projection: lag median ≤ 5 s, p95 ≤ 30 s.
Query: p95 ≤ Y ms at baseline RPS; success rate ≥ 99.95% (excl. client 4xx).
Export: TTFB p95 ≤ Z s for ≤ K MB; completion p95 ≤ M min for N records.
Integrity: verify p95 ≤ T s for S records; failures ≤ E ppm.
Policy: decision p95 ≤ Q ms; hit ratio ≥ H%.

Error Budgets & Burn Alerts¶

Budget = 1 - target_availability. Example: SLO 99.9% ⇒ budget 0.1%.
Multi-window, multi-burn alerts (fast + slow):
- Page if burn ≥ 14× over 1h or ≥ 6× over 6h.
- Ticket if burn ≥ 2× over 24h.
Pair with auto-suppression during planned maintenance (annotations).

Dashboards (minimum set)¶

Service: latency histograms, error ratios, throughput, saturation, dependency health.
Flow: Append (GW→Ingestion→Outbox→Bus→Projection), Query (GW→Query→Read), Export (GW→Export→Cold).
Tenancy: top tenants by usage; quota headroom; noisy-neighbor detection.
Reliability: projector lag, outbox age, DLQ depth, export queue depth, policy deny rate.
Security: auth failures, cross-tenant attempts, signature failures.

Watermarks, Idempotency & Headers¶

Query responses include X-Watermark & X-Lag to expose freshness.
Append requires X-Idempotency-Key; success logs include Idempotent:true on dedupe.
Rate headers return budget signals: X-RateLimit-*, Retry-After.

Alert Policies (extract)¶

Condition	Threshold	Action
Projector lag p95 > 30s for 15m	sustained	Page on-call; auto-scale consumers; evaluate DLQ
Outbox age p99 > 10s	15m	Page ingestion; check broker health
Query p95 > Y ms	30m	Page API/runtime; enable cache protection
Export queue depth > Q	30m	Ticket; throttle export concurrency per tenant
5xx ratio > R ppm	10m	Page owning team; roll back last deploy guardrail
Token validation failures spike	10m	Page security on-call; investigate IdP/clock skew

Cardinality & Cost Guardrails¶

Cap high-cardinality labels (e.g., raw tenantId)—aggregate into tenant.class (e.g., S/M/L).
Use RED metrics (Rate, Errors, Duration) per route/use-case.
Histograms with controlled buckets; exemplars sampled from traces.
Drop verbose logs in hot paths; sample at INFO, keep ERROR always.

Health & Probes¶

Liveness: process healthy; Readiness: deps reachable and policy cache warm; Startup: migrations/keys loaded.
Expose /healthz endpoints; aggregate in [Health Checks (../operations/health-checks.md)].

Synthetic Probes & Canaries¶

Tenant-scoped synthetics for append/query/export; publish probe artifacts with tenantId=probe.
Canary releases gated by SLO trend; rollback if burn > threshold.

Testable Controls¶

CI checks for OTel exporters present; span/metric naming lint.
Contract tests assert presence of tenantId, edition, traceId on key spans.
E2E tests validate X-Watermark/X-Lag headers and redaction hints on query.

Links¶

→ Observability (OTel/Logs/Metrics) (../operations/observability.md)
→ Alerts & SLOs (../operations/alerts-slos.md)
→ Health Checks (../operations/health-checks.md)
→ Runbook (../operations/runbook.md)
→ Outbox/Inbox & Idempotency (../implementation/outbox-inbox-idempotency.md)

Reliability & Resilience (retries, outbox, DLQ)¶

ATP targets graceful degradation under failure: shed load early, retry safely with idempotency, confine faults with bulkheads, and preserve work via transactional outbox/inbox and DLQs with audited replay. Policies are tuned to protect SLOs and tenant isolation.

Principles¶

Fail fast at the edge (schema/tenancy/rate) and retry inside only when it’s safe.
Exactly-once intent through idempotency keys; consumers are idempotent by construction.
Back-pressure before meltdown: 429s at the Gateway; bounded concurrency in workers.
Isolate & contain: bulkheads, circuit breakers, DLQs per subscription, exporter pools separate from query.
Observable by default: retries, drops, DLQ, and replays are fully traceable.

Timeouts, Retries, Backoff¶

Boundary	Timeout (budget)	Retry Policy	Max Attempts	Notes
Client → Gateway	short (route p95 + margin)	No (client retry on 429/503 only)	0	Gateway signals back-off via headers
Gateway → Service	route-specific (p95 × 1.2)	No (propagate)	0	Avoid retry storms
Service → Policy/IdP/KMS	short	Yes, exp. backoff + jitter	3–5	Only on transient errors/timeouts
Service → Broker (publish)	short	Yes, then Outbox relay	bounded	Never drop; relay ensures delivery
Consumer → Repo/Store	medium	Yes, exp. backoff + jitter → DLQ	5	Idempotent upsert/no-op required
Integrity/Export → Object Store	medium/long	Yes, exp. backoff + resume	bounded	Resume via range requests

Budgeting: Timeouts derived from SLOs; p95 + headroom. Retries use exponential backoff with jitter (e.g., base 100–250ms, cap 5–10s). No retries on validation/authorization errors.

Outbox / Inbox Semantics¶

Outbox guarantees durable publication of domain events without two-phase commit.

sequenceDiagram
  autonumber
  participant API as API/Use-case
  participant DB as Append Store (tx)
  participant OB as Outbox (tx)
  participant RL as Relay Worker
  participant BUS as Broker

  API->>DB: Persist aggregate
  API->>OB: Persist event (same tx)
  DB-->>API: Commit (record + outbox)
  RL->>OB: Poll due events
  RL->>BUS: Publish (with headers/trace)
  RL->>OB: Mark as delivered (idempotent)

Hold "Alt" / "Option" to enable pan & zoom

Inbox (consumer de-dup) stores (eventId|idempotencyKey) receipts for M days to drop duplicates safely.

Idempotency Key: (tenantId, sourceId, sequence|hash)
Headers: x-idempotency-key, traceparent, x-tenant-id, x-schema-version
Guarantee: at-least-once delivery; exactly-once effects via idempotent handlers.

DLQ, Replay & Triage¶

Every subscription has a DLQ. We never silently drop.

flowchart LR
  subgraph Processing
    C[Consumer] --> H[Handler  - Idempotent]
    H --> OK[Commit]
    H -->|on failure after N retries| DLQ[(DLQ)]
  end
  subgraph Triage & Replay
    DLQ --> UI[DLQ Console/Runbook]
    UI --> DRY[Dry-run Replay]
    DRY --> REP[Replay Window/Tenant]
    REP --> C
  end

Hold "Alt" / "Option" to enable pan & zoom

Triage metadata: error code, exception type, payload hash, headers, schema/version, attempt count, trace links.

Replay rules

Scoped by tenant and time window; require approval for large replays.
Dry-run first: count would-be successes/failures; cap concurrency; respect watermarks.
Immutability: handlers must be replay-safe; only side effects that are idempotent.

Bulkheads, Circuit Breakers, Quotas¶

Bulkheads: separate pools for Export vs Query; limit projector concurrency by partition; per-tenant worker caps.
Circuit breakers: open on dependency failures (Policy/IdP/KMS/Store); fail closed for risky decisions (deny if policy unavailable), fail open for non-critical enrichments.
Quotas: per-tenant caps on RPS, concurrent exports, selection size, and storage growth.

Failure Taxonomy & Handling¶

Class	Examples	Handling	Client Signal
Validation	schema invalid, unknown fields	Reject (no retry)	400 (Problem+JSON)
Authorization	missing/invalid tenant, ABAC deny	Reject (no retry)	401/403
Resource limits	rate/quotas exceeded	Shed load	429 + `Retry-After`
Transient infra	broker timeout, object store 503	Retry with jitter; back-off; DLQ after N	202/503 (server)
Hot partition	tenant/source spike	Throttle producer; bounded consumer concurrency	429 (edge), lag dashboards
Dependency outage	Policy/KMS down	Breaker; safe defaults; degrade features	503 (server)
Data poison	schema/version mismatch	Route to DLQ; fix schema/consumer; replay	N/A (internal)

Back-Pressure & Lag Management¶

Gateway: 429 with tenant-scoped limits; communicate X-RateLimit-*.
Consumers: dynamic concurrency (lower when lag or error rate climbs).
Projection: track watermarks; alert when lag p95 > SLO; autoscale consumers on lag & queue depth.
Export: queue depth guards; resumable streams; per-tenant concurrency caps.

Chaos & Resilience Drills¶

Inject faults in staging: broker hiccups, object-store slowdown, KMS latency, projector crashes.
Verify: error budget burn alerts, DLQ accumulation, replay throughput, circuit behavior, user-facing latency.
Record drill outcomes and update runbooks and SLO thresholds.

Observability for Reliability¶

Metrics: outbox.relay.age, consumer.lag, dlq.depth, dlq.replay.rate, retry.count, 429.rate, breaker.open.count.
Traces: link original request → outbox publish → consumer handle → DLQ/replay spans.
Logs: structured error codes; payload hashes (not values); tenant-safe redaction.

Configuration Defaults (starters)¶

Retries: 3–5 attempts, exp. backoff + full jitter (Decorrelated Jitter), cap 10s.
Concurrency: start low (e.g., 2–4 per partition); auto-tune toward SLO.
DLQ retention: 7–30 days per environment; immutable audit of purges.
Replay: require change ticket/approval for >N msgs or cross-tenant windows.

Testable Controls¶

Unit tests: consumers idempotent; repositories upsert-or-noop.
Contract tests: no PII in headers; tenant headers present.
E2E: inject transient failures and verify DLQ path, replay correctness, and watermark recovery.
Synthetic hot-partition tests: verify 429s at edge, stable p95, and bounded lag.

Links¶

→ Outbox/Inbox & Idempotency (impl) (../implementation/outbox-inbox-idempotency.md)
→ Messaging (impl) (../implementation/messaging.md)
→ Alerts & SLOs (../operations/alerts-slos.md)
→ Health Checks (../operations/health-checks.md)
→ Runbook (../operations/runbook.md)
→ Chaos Drills (../hardening/chaos-drills.md)

Integrity & Tamper-Evidence¶

ATP provides cryptographic assurance that audit data is unchanged since write and that exports are authentic. We implement append-only hash chains, digitally signed checkpoints/manifests, and verification APIs usable online (service-backed) and offline (air-gapped).

Objectives¶

Tamper-evidence: any modification, insertion, deletion, or re-ordering becomes detectable.
Provenance: every package/export carries a signed, reproducible manifest.
Verifiability: tenants and auditors can verify individual records, ranges, or full exports without trusting ATP.
Agility: algorithm and key rotation without rewriting historical data (versioned metadata).

Integrity Model (at a glance)¶

Canonicalization at write → compute a content digest of the canonical record.
Hash Chains per {tenantId, time-slice} with rolling checkpoints.
- Chain node i: H_i = Hash(H_{i-1} || digest(record_i) || meta_i)
- meta_i includes recordId, committedAt, policyVersion, and chain coordinates.
Checkpoints (e.g., hourly/daily) are digitally signed; contain H_last, span, and count.
Exports produce signed manifests listing content digests and chain/checkpoint references.
Verification APIs return proof objects; offline tools can validate with published keys.

flowchart LR
  subgraph Tenant Slice: t-acme - 2025-10-22
    R1[rec#1 digest] --> CH1[H1]
    R2[rec#2 digest] --> CH2[H2]
    R3[rec#3 digest] --> CH3[H3]
    CH1 --> CH2 --> CH3
    CH3 --> CKP[Checkpoint Σ<br /> signed - H_last, span, count]
  end
  CKP --> MAN[Export Manifest<br /> signed - digests, ranges, refs]

Hold "Alt" / "Option" to enable pan & zoom

We prefer linear hash chains with signed periodic checkpoints. Optionally, a Merkle tree can be built per checkpoint window for batch verification without changing the public surface.

Data Elements¶

Artifact	Purpose	Immutable Fields (excerpt)
Record digest	Per-record integrity	`recordId`, canonical payload, `occurredAt`, `committedAt`, `policyVersion`
Chain node	Links records in order	`chainId`, `chainIndex`, `prevHash`, `recordDigest`, `meta`
Checkpoint	Signed summary of a range	`chainId`, `fromIndex`, `toIndex`, `H_last`, `spanStart/End`, `count`, `algo`, `keyId`, `sig`
Manifest	Export integrity & lineage	package metadata, list of `{recordId, digest}`, checkpoint refs, `algo`, `keyId`, `sig`

Canonicalization rules live alongside message schemas; whitespace/order-independent; numeric/boolean normalization; rejected if unknown fields (unless whitelisted).

Algorithms & Keys (guidance)¶

Hash: SHA-256 (default); allow algorithm versioning (e.g., sha256, sha512/256).
Signatures: Ed25519 / ECDSA P-256 (configurable).
Key management: KMS-backed; key IDs and versions stamped in checkpoints/manifests.
Rotation: introduce new algo/keyId at checkpoint boundaries; historical proofs remain valid.

We maintain algorithm agility: verification reads metadata and dispatches appropriate verifier.

Write-time Flow (with integrity material)¶

sequenceDiagram
  autonumber
  participant ING as Ingestion
  participant CAN as Canonicalizer
  participant APP as Append Store
  participant INT as Integrity
  ING->>CAN: Normalize(payload)
  CAN-->>ING: canonical bytes + digest(payload)
  ING->>APP: Append(record, digest, policyVersion)
  ING->>INT: Update chain (tenant, time-slice, recordDigest)
  INT->>INT: H_i = Hash(H_{i-1} || recordDigest || meta)
  alt checkpoint boundary
    INT->>KMS: Sign(H_last, span, count, algo)
    KMS-->>INT: signature(keyId, version)
    INT->>APP: Persist checkpoint
  end

Hold "Alt" / "Option" to enable pan & zoom

Verification Surfaces¶

APIs

POST /api/v{n}/integrity/verify/record → input: { tenantId, recordId } → output: { ok, recordDigest, chainProof, checkpointRef }
POST /api/v{n}/integrity/verify/range → { tenantId, from, to } → { ok, H_last, span, checkpointSig }
POST /api/v{n}/integrity/verify/export → { exportId } → { ok, manifestDigest, signature, chainRefs[] }

CLI/Offline (reference implementation)

Verify with only: manifest, exported data, published public keys, and (optional) checkpoint bundle.

Response Example (export verify)

{
  "ok": true,
  "exportId": "exp-01JAX...",
  "manifest": {
    "algo": "sha256",
    "keyId": "kms:key/ed25519:v4",
    "digest": "sha256:7f...c1",
    "signature": "base64:MEUCIQ..."
  },
  "chainRefs": [
    { "chainId": "t-acme:2025-10-22", "toIndex": 85123, "H_last": "sha256:ab..ef", "checkpointSig": "base64:..." }
  ]
}

Export Manifest (canonical shape)¶

{
  "schemaVersion": "1.0.0",
  "packageId": "exp-01JAX...",
  "tenantId": "t-acme",
  "createdAt": "2025-10-22T12:04:55Z",
  "algo": "sha256",
  "keyId": "kms:key/ed25519:v4",
  "items": [
    { "recordId": "01H...", "digest": "sha256:...", "occurredAt": "2025-10-22T10:12:00Z" }
  ],
  "chainRefs": [
    { "chainId": "t-acme:2025-10-22", "fromIndex": 84000, "toIndex": 85123, "H_last": "sha256:...", "checkpointId": "ckp-2025-10-22T12:00:00Z" }
  ],
  "signature": "base64:..."
}

Failure Modes & Mitigations¶

Scenario	Detection	Mitigation
Record tampered	digest mismatch	reject read/export; raise integrity alert
Chain gap or re-order	recompute `H_i`; mismatch vs stored	mark slice compromised; rebuild from trustworthy boundary; alert
Checkpoint key rotated	`keyId` mismatch	verify with previous public key set; publish rotation bundle
Manifest altered	signature invalid	reject download; regenerate; investigate
Clock skew	timestamp sanity checks	rely on `committedAt` from ATP; include in proofs
Hot shard loss	missing nodes	replay from outbox/event log; regenerate chain; issue new checkpoint noting incident (transparent)

Rebuild policy: Regenerate affected chain segment without rewriting record contents; checkpoint notes remediation; previous proofs remain for unaffected ranges.

Observability & SLOs (integrity)¶

SLIs: verify latency p95 (by size), chain build queue depth, checkpoint issuance delay, signature error rate.
Alerts: chain mismatch, signature failure spikes, delayed checkpoints beyond window.
Trace: link append → chain update → checkpoint sign → export manifest.

Testable Controls¶

Deterministic canonicalization tests (golden vectors).
Property tests: random order tampering must fail verification.
Rotation tests: verify old data with previous keys; new data with new keys.
Cross-check: export manifest round-trip verification in CI (sampled).
Negative tests: header-only PII prohibition (integrity material never includes PII).

Operational Practices¶

Key custody: least-privilege KMS roles; HSM-backed keys optional; dual-control for rotations.
Publishing: make public verification keys and checkpoint bundles available per tenant/region (read-only).
eDiscovery: export includes manifest and optional checkpoint pack to enable offline verification.
Incident handling: freeze exports for affected slices; publish advisory with affected chainId ranges.

Links¶

→ Tamper Evidence (hardening) (../hardening/tamper-evidence.md)
→ Security & Compliance (../platform/security-compliance.md)
→ Backups, Restore & eDiscovery (../operations/backups-restore-ediscovery.md)
→ Events Catalog (../domain/events-catalog.md)
→ Message Schemas (../domain/contracts/message-schemas.md)

SDK & Integration Guidance (overview)¶

This section orients integrators to the official SDKs and the minimum set of practices to publish, query, and export audit data safely. Deep dives and runnable samples live under SDK and Guides.

Official SDKs¶

Language	Package	Status	Target Runtimes
C#	`ConnectSoft.Atp`	GA (preferred)	.NET 8/9
JavaScript/TypeScript	`@connectsoft/atp`	GA (preferred)	Node 18+/20+, modern browsers (for query only)

Common features: automatic tenancy propagation, idempotency key helpers, schema validation, built-in retry with jitter (safe verbs only), OTel instrumentation hooks, and Problem+JSON error mapping.

Minimal Client Configuration¶

# Pseudoconfig (both SDKs support env vars and code-based config)
ATP_BASE_URL:   https://api.atp.example
ATP_TENANT_ID:  t-acme
ATP_EDITION:    enterprise
ATP_CLIENT_ID:  <oidc client id>
ATP_CLIENT_SECRET: <secret>         # or workload identity
ATP_TIMEOUT_MS: 3000

Auth: OIDC client credentials or workload identity.
Tenancy: tenantId required for all calls; SDK sets x-tenant-id header.
Tracing: pass OTel tracer/provider to auto-attach traceparent.

Publish (Append) — Quick Start¶

C#

var client = new AtpClient(new AtpOptions {
  BaseUrl = new Uri(Environment.GetEnvironmentVariable("ATP_BASE_URL")!),
  TenantId = "t-acme",
  Auth = AtpAuth.ClientCredentials("clientId","clientSecret"),
});

var record = new AuditRecord {
  SourceId = "order-svc",
  ActorId = "u-123",
  Action = "UPDATE",
  Resource = "Order/4711",
  Attributes = new { status = "Shipped", carrier = "DHL" },
  OccurredAt = DateTimeOffset.UtcNow
};

var idem = IdempotencyKey.From("t-acme", "order-svc", sequence:4711);
await client.AppendAsync(record, idem, ct);

TypeScript

import { AtpClient, idempotencyKey } from "@connectsoft/atp";

const client = new AtpClient({
  baseUrl: process.env.ATP_BASE_URL!,
  tenantId: "t-acme",
  auth: { kind: "clientCredentials", clientId: "...", clientSecret: "..." },
});

const record = {
  sourceId: "order-svc",
  actorId: "u-123",
  action: "UPDATE",
  resource: "Order/4711",
  attributes: { status: "Shipped", carrier: "DHL" },
  occurredAt: new Date().toISOString(),
};

await client.append(record, idempotencyKey("t-acme","order-svc",4711));

Contract reminders

Headers (SDK-managed): Authorization, x-tenant-id, X-Idempotency-Key, Content-Type.
Schema: unknown fields rejected unless whitelisted; use SDK types to avoid drift.
PII discipline: never place PII in headers; use attributes with classification-aware fields in schema.

Query — Authorized Read¶

C#

var page = await client.QueryAsync(new QuerySpec {
  TimeRange = TimeRange.LastHours(24),
  Filters = new() { Resource = "Order/4711" },
  Page = new PageSpec(size: 100)
}, ct);

// Headers surfaced on the response object
Console.WriteLine($"Watermark={page.Watermark}, Lag={page.LagSeconds}s");

TypeScript

const res = await client.query({
  timeRange: { lastHours: 24 },
  filters: { resource: "Order/4711" },
  page: { size: 100 }
});
console.log(res.headers["x-watermark"], res.headers["x-lag"]);

Redaction: SDKs expose redactionHints where fields were masked/hidden.
Pagination: cursor-based; pass next token; default page size 100 (configurable).
Rate limits: watch X-RateLimit-* and Retry-After.

Export — Selection → Package¶

C#

var exportId = await client.Export.CreateAsync(new ExportRequest {
  Query = new QuerySpec { Filters = new() { Resource = "Order/4711" } },
  Format = "jsonl"
}, ct);

var stream = await client.Export.DownloadAsync(exportId, ct); // resumable via Range

TypeScript

const { exportId } = await client.export.create({
  query: { filters: { resource: "Order/4711" } },
  format: "jsonl"
});
const file = await client.export.download(exportId); // supports range & resume

Resumable downloads (HTTP Range).
Integrity: response includes manifest digest/signature; use SDK verifyExport() wrapper for convenience.

Webhooks (optional) — Ingest & Completion¶

Verification

import { verifyWebhook } from "@connectsoft/atp/webhooks";
const ok = verifyWebhook(headers, rawBody, secret); // HMAC or signature
if (!ok) return res.status(401).end();

Retry-safe: your handler must be idempotent; use event eventId or x-idempotency-key.
Security: require HTTPS; rotate secrets; deny unsigned deliveries.

Retries, Timeouts, and Idempotency¶

SDK retries: exponential backoff with jitter on transient errors only (5xx/timeouts).
Do not auto-retry on 4xx (validation, auth, policy deny).
Idempotency key builder helpers ensure exactly-once intent:
- Conventional form: `${tenantId}:${sourceId}:${sequence|hash}`.
Timeouts: defaults ~3s (configurable per method); align with route SLOs.

Error Model (Problem+JSON)¶

HTTP	Code (example)	Meaning	Action
400	`SCHEMA_VALIDATION_FAILED`	Payload invalid	fix schema
401	`UNAUTHENTICATED`	Missing/invalid token	refresh/reauth
403	`AUTHORIZATION_DENIED`	RBAC/ABAC/edition gate	request access
429	`RATE_LIMITED`	Per-tenant quota exceeded	backoff using `Retry-After`
503	`UPSTREAM_UNAVAILABLE`	Transient infra	SDK retries (jitter)

All SDK exceptions include traceId; logs are safe for sharing (no PII).

Versioning & Compatibility¶

Semantic Versioning for SDKs; APIs/events are additive-first.
Deprecation: SDK surfaces Deprecation/Sunset headers; consult Changelog.
Pin major versions in production; run contract tests in CI.

Security Considerations¶

Tokens are short-lived; enable MFA/workload identity.
Never hardcode secrets; use secret managers.
No PII in headers; classification-driven redaction in logs/exports.
Export manifests are signed; verify before processing.

Observability Hooks¶

Pass your tracer/provider to the SDK to attach spans (gateway → service → broker → store).
Metrics: per-call latency histograms, retry counts, 4xx/5xx ratios.
Correlate with traceId from Problem+JSON on errors.

Common Patterns & Anti-patterns¶

Do:

Use IdempotencyKey for every append.
Batch appends with bounded payloads; avoid oversized attributes.
Respect Retry-After and X-RateLimit-*.
Propagate tenantId consistently.

Don’t:

Put secrets/PII into headers.
Bypass SDK schema types with “raw” posts.
Retry on validation/auth errors.

Links¶

→ C# SDK (../sdk/csharp.md)
→ JavaScript SDK (../sdk/javascript.md)
→ .NET Publisher Example (../sdk/examples/dotnet-publisher.md)
→ JS Publisher Example (../sdk/examples/js-publisher.md)
→ Query Examples (../sdk/examples/query-examples.md)
→ Guides: Produce Audit Events (../guides/producing-audit-events.md)
→ Guides: Query Audit Logs (../guides/querying-audit-logs.md)
→ Guides: Export & eDiscovery (../guides/export-and-ediscovery.md)

Risks & Mitigations¶

This section catalogs the top 8 architectural risks for ATP with signals, mitigations, and contingency playbooks. Each risk has an owner, testable controls, and traceability to SLOs/ADRs.

Severity (S): Low/Med/High. Likelihood (L): Low/Med/High.

Summary Matrix¶

ID	Risk	S	L	Primary Owner	Key Signals / SLIs
R-001	Scale spikes on write (ingestion surge)	High	Med	Solution + DevOps	429 rate, outbox age p99, broker queue depth, projector lag p95
R-002	Cost overruns (storage/exports/telemetry)	Med	Med	Cloud + Finance (FinOps)	Hot/Warm/Cold growth velocity, export concurrency, metrics/logs ingestion cost
R-003	Data residency / sovereignty violation	High	Low	Security + Data + Cloud	Cross-region traffic events, storage location drift, restore target checks
R-004	Schema drift / incompatible change	Med	Med	Application + Data	Contract test failures, schema registry diffs, consumer DLQ spike
R-005	Tight coupling between services/contracts	Med	Med	Enterprise + Solution	Change ripple count per deploy, cross-service failure blast radius
R-006	Vendor lock-in (broker/DB/cloud features)	Med	Med	Enterprise + Infra	Adapter coverage gaps, portability test failures
R-007	Noisy neighbor (multi-tenant contention)	High	Med	Enterprise + SRE	Per-tenant 429s, query p95 regressions, export queue depth by tenant
R-008	Compliance drift (controls stale vs GDPR/HIPAA/SOC2)	High	Low	Security + Compliance	Policy deny rates, retention job failures, audit finding backlog

R-001 — Scale spikes on write (ingestion surge)¶

Signals/SLIs: 429 rate ↑, outbox.relay.age ↑, broker depth ↑, projector lag p95 > SLO.
Mitigations:
- Per-tenant rate limits/quotas at Gateway; 429 + Retry-After.
- Transactional outbox, consumer concurrency autoscaling, partition by tenantId[:sourceId].
- Back-pressure: bounded worker pools; bulkheads to protect Query/Export.
Contingency:
- Enable surge mode: increase broker partitions/consumer concurrency; temporarily narrow schema acceptance windows.
- Throttle exporters globally; prioritize ingestion SLIs.
Tests: synthetic hot-partition load; verify stable p95 and bounded lag.
Trace: SLOs in Observability & SLOs.

R-002 — Cost overruns (storage/exports/telemetry)¶

Signals: Hot/Warm/Cold growth velocity; export concurrency; log/metrics ingestion cost.
Mitigations:
- Tiering policy (Hot→Warm→Cold), compression, projection granularity reviews.
- Export batch windows, per-tenant export caps, dedupe repeated selections.
- Telemetry sampling (tail-based traces), cardinality caps (tenant.class buckets).
Contingency:
- Apply emergency retention adjustments (policy-backed), reduce export concurrency; enable log down-sampling.
Tests: monthly FinOps review; cost regression checks in perf env.
See Storage Strategy & Data Residency & Retention.

R-003 — Data residency / sovereignty violation¶

Signals: cross-region writes/reads, restore to wrong region, misconfigured buckets.
Mitigations:
- Tenant→region map enforcement at Gateway; region-scoped storage and backups.
- Infrastructure policies (deny cross-region replication by default).
- Residency checks in CI/CD and synthetic restore drills (tenant/time-scoped).
Contingency:
- Freeze affected tenant exports; relocate data; notify per contractual/DPA terms.
Tests: residency unit tests; periodic restore drills per region.
See Privacy (GDPR/HIPAA/SOC2).

R-004 — Schema drift / incompatible change¶

Signals: contract test failures, consumer DLQ spikes, projector failures.
Mitigations:
- Schema registry, additive-first evolution, versioned Published Language.
- Producer/consumer contract tests in CI; dual-write/read during migrations.
- Strict schema validation at Gateway; unknown fields rejected unless whitelisted.
Contingency:
- Pin consumers; roll back producers; replay DLQ after schema fix.
Tests: golden schema vectors; migration rehearsals in staging with replay.
See Message Schemas & REST APIs.

R-005 — Tight coupling between services/contracts¶

Signals: high change ripple; deploys blocked by downstream readiness; wide blast radius.
Mitigations:
- Open Host Service at Gateway; asynchronous EDA between services.
- Ports-and-adapters; domain isolated from frameworks; backward-compatible contracts.
- Feature flags & canary releases to localize risk.
Contingency:
- Breaker open on failing dependency; serve with last watermark; postpone non-critical enrichments.
Tests: chaos drills removing downstreams; verify graceful degradation.
See Component Boundaries & Event-Driven Plan.

R-006 — Vendor lock-in (broker/DB/cloud features)¶

Signals: adapter gaps, reliance on proprietary features without abstraction, migration blockers.
Mitigations:
- Abstraction layers for broker/index/persistence; avoid provider-specific payloads in domain.
- Keep export/verify formats open (JSONL + signed manifests).
- ADRs record trade-offs; periodic portability tests (alt broker/index in CI).
Contingency:
- Side-by-side pilot on alternate provider; maintain dual adapters for a window.
Tests: contract suite against alt adapters; export/verify runs offline.
See ADRs & Governance.

R-007 — Noisy neighbor (multi-tenant contention)¶

Signals: per-tenant 429s, p95/p99 regressions, export queue depth spikes, projector lag localized to a tenant.
Mitigations:
- Per-tenant quotas for RPS/storage/exports; separate exporter pools; projector bulkheads per partition.
- Cache protection and query timeouts; circuit breakers on read amplification paths.
Contingency:
- Temporarily lower limits for offending tenants; schedule off-peak exports; enable shard by tenantId:sourceId.
Tests: synthetic contention runs; verify isolation and SLO stability.
See Multitenancy & Tenancy Guards.

R-008 — Compliance drift (controls stale vs GDPR/HIPAA/SOC2)¶

Signals: policy deny spikes, retention failures, missed rotations, audit finding backlog.
Mitigations:
- Policies as code, versioned; CI checks for classification tags on new fields.
- Automated retention jobs with attestations; scheduled key rotations; admin action audit trails.
- Quarterly control attestation packs; DSR synthetics.
Contingency:
- Freeze risky exports; hotfix policy sets; trigger IR playbooks and stakeholder comms.
Tests: DPIA gates for new data classes; DSR rehearsal; rotation drills.
See Compliance & Privacy & Security & Compliance.

Governance & Traceability¶

Each risk maps to ADRs (decision logs), linked mitigations, and SLOs (error budgets).
Risk review cadence: monthly in Ops/Architecture, quarterly for Compliance/Exec.
Changes to risk posture require an ADR update and CI policy updates.

Links¶

→ Alerts & SLOs (../operations/alerts-slos.md)
→ Runbook (../operations/runbook.md)
→ Messaging & Outbox (../implementation/messaging.md)
→ Outbox/Inbox & Idempotency (../implementation/outbox-inbox-idempotency.md)
→ Data Residency & Retention (../platform/data-residency-retention.md)
→ Privacy (GDPR/HIPAA/SOC2) (../platform/privacy-gdpr-hipaa-soc2.md)
→ REST APIs (../domain/contracts/rest-apis.md)
→ Message Schemas (../domain/contracts/message-schemas.md)

ADR Index & Governance¶

Architecture decisions are captured as ADRs (Architecture Decision Records) to make trade-offs explicit, auditable, and traceable to roadmap items and SLOs. This section defines where ADRs live, how they’re authored/reviewed, and a starter index of the key decisions for ATP.

Where ADRs Live¶

Repository path: /docs/adrs/ (one file per decision).
Naming: ADR-<YYYY>-<NNN>-<kebab-title>.md (e.g., ADR-2025-001-tenancy-model.md).
Status taxonomy: Proposed → Accepted → Deprecated → Superseded.

Each PR that changes contracts, data models, or deployment posture must reference an ADR (existing or new).

Decision Lifecycle¶

flowchart LR
  P[Propose ADR <br /> - draft PR] --> R[Review<br /> - arch council + owners]
  R -->|approve| A[Accepted\nmerge + tag]
  R -->|revise| P
  A --> I[Implement\ncode + docs + SLOs]
  I --> E[Evaluate\nmetrics + post-deploy]
  A --> S[Supersede/Deprecate\nnew ADR links prior]

Hold "Alt" / "Option" to enable pan & zoom

RACI

Responsible: Proposal author (feature owner)
Accountable: Enterprise Architect (final sign-off)
Consulted: Solution, Security, Data, Cloud/Infra, DevOps Architects
Informed: PM/Delivery, Compliance, SRE

ADR Template (use for new decisions)¶

---
adr: ADR-2025-XXX
title: <Concise decision title>
status: Proposed | Accepted | Deprecated | Superseded by ADR-YYYY-NNN
owners:
  - <role/name>
date: 2025-..-..
links:
  issues: [ <link(s) to epics/issues> ]
  docs:
    - ../architecture/architecture.md#<section-anchor>
    - ../domain/contracts/...
slo_impact:
  - <which SLOs/SLIs are affected>
risk:
  severity: Low|Med|High
  mitigations: [ <refs to sections/runbooks> ]
---

## Context
<Problem statement, constraints, alternatives considered, why now>

## Decision
<The option chosen and why; scope and boundaries; tenant/edition impact>

## Consequences
<Positive/negative trade-offs, operational impacts, cost and complexity, migration notes>

## Implementation Notes
<High-level tasks, rollout plan, feature flags, compatibility windows>

## Verification
<How we’ll verify success: metrics, tests, drills, acceptance gates>

## References
<Links to PoCs, benchmarks, standards, prior ADRs>

ADR Index (starter set)¶

ADR	Title	Status	Links
ADR-2025-001	Tenancy Model & Guards (explicit tenant on every boundary; RLS; edition gating at Gateway/Policy)	Accepted	Multitenancy
ADR-2025-002	Event Bus & EDA Guarantees (at-least-once, idem keys, outbox/inbox, partitioning by `tenantId[:sourceId]`)	Accepted	Event-Driven Plan
ADR-2025-003	Hash Chains + Signed Checkpoints (export manifests with proofs)	Accepted	Integrity
ADR-2025-004	Storage Tiering (Hot/Warm/Cold) & lifecycle enforcement	Accepted	Storage Strategy
ADR-2025-005	Schema Registry & Evolution Policy (additive-first, breaking via new subject/major)	Accepted	Message Schemas
ADR-2025-006	Gateway Versioning & Rate Limits (Problem+JSON, 429 with budgets)	Proposed	API Gateway
ADR-2025-007	Per-Tenant Export Concurrency Caps (bulkheads to protect Query SLOs)	Proposed	Reliability & Resilience
ADR-2025-008	Policy as Code (classification/retention/redaction versioned; stamped on write)	Accepted	Compliance & Privacy

When an ADR is superseded, update the index and add a Superseded by ADR-… line at the top of the older record.

Governance Rules (what requires an ADR)¶

Domain contracts: REST/gRPC schemas, message subjects/schemas, webhook signatures.
Persistence/Index: new tables/indices, partitioning strategy changes, retention rules.
Security: identity model changes, key/crypto algorithms, breakout from Zero Trust defaults.
Platform: event bus/provider changes, region/residency posture, ingress/WAF changes.
SLO/Cost: material shifts in error budgets, telemetry retention, cost levers.

Minor refactors that do not alter contracts, SLOs, or posture can proceed without a new ADR, but must reference related ADRs in PRs.

Quality Gates (CI/CD)¶

Lint: ADR front-matter required (status, owners, links), broken links fail build.
Contract tests: PRs that touch /domain/contracts must reference an ADR.
Docs check: architecture/ pages must not link to Deprecated ADRs without also linking the successor.
Changelog: /reference/changelog.md automatically includes ADR titles on merge.

Traceability¶

Each section of this document references one or more ADRs; ADRs link back here via anchors.
Roadmap epics reference ADR IDs; production incidents include ADR references in the post-mortem template.

Cadence & Forums¶

Weekly architecture sync (triage new ADRs, status reviews).
Monthly risk/governance review aligning with Risks & Mitigations.
Quarterly compliance/controls review (SOC 2 evidence packs, DPIA triggers).

Testable Controls¶

Pipeline fails if a change to contracts/index/storage lacks an ADR reference.
Docs link checker: ADR anchors valid; “superseded” graph has no orphans.
Synthetic audit: sample PRs verify Problem+JSON includes traceId + linked ADR in release notes.

Links¶

→ Quality Gates (../ci-cd/quality-gates.md)
→ Planned Work (Epics & Features) (../planning/index.md)
→ Changelog (../reference/changelog.md)
→ Message Schemas (../domain/contracts/message-schemas.md)
→ REST APIs (../domain/contracts/rest-apis.md)

Traceability to Roadmap¶

Tiny matrix mapping this document’s sections to roadmap Epics/Features and the artifact location under /docs.

Section	Roadmap Epic / Feature	Artifact (under `/docs`)
Purpose & Principles	AUD-ARC-001 / GOV	`/docs/architecture/architecture.md#purpose`
System Context (C4 L1)	AUD-ARC-001 / HLD	`/docs/architecture/hld.md`
Bounded Contexts & Context Map	AUD-ARC-001 / DDD	`/docs/architecture/context-map.md`
Core Services & Containers (C4 L2)	AUD-ARC-001 / HLD	`/docs/architecture/hld.md`
Component Boundaries (C4 L3)	AUD-ARC-001 / HLD	`/docs/architecture/components.md`
C06 – Event-Driven Communication Plan	AUD-ARC-001 / HLD-T002	`/docs/domain/events-catalog.md`
C07 – Multitenancy & Tenancy Guards	AUD-TENANT-001	`/docs/platform/multitenancy-tenancy.md`
C08 – Security Architecture (Zero Trust)	AUD-SECURITY-001, AUD-IDENTITY-001	`/docs/platform/security-compliance.md`
C09 – Compliance & Privacy (GDPR/HIPAA/SOC2)	AUD-COMPLIANCE-001	`/docs/platform/privacy-gdpr-hipaa-soc2.md`
C10 – Data Architecture Overview	AUD-STORAGE-001, AUD-QUERY-001	`/docs/architecture/data-model.md`
C11 – Storage Strategy (summary)	AUD-STORAGE-001	`/docs/implementation/persistence.md`
C12 – Sequence Flows (append/query/export)	AUD-INGEST-001, AUD-QUERY-001, AUD-EXPORT-001	`/docs/architecture/sequence-flows.md`
C13 – Deployment Views (baseline)	AUD-OPS-001 (DevOps & Envs)	`/docs/architecture/deployment-views.md`
C14 – API Gateway & Connectivity	AUD-GATEWAY-001	`/docs/architecture/architecture.md#api-gateway-connectivity`
C15 – Observability & SLOs	AUD-OTEL-001	`/docs/operations/observability.md`
C16 – Reliability & Resilience	AUD-CHAOS-001	`/docs/implementation/outbox-inbox-idempotency.md`
C17 – Integrity & Tamper-Evidence	AUD-INTEGRITY-001	`/docs/hardening/tamper-evidence.md`
C18 – SDK & Integration Guidance	AUD-SDK-001	`/docs/sdk/`
C19 – Risks & Mitigations	Governance cadence	`/docs/architecture/architecture.md#risks-mitigations`
C20 – ADR Index & Governance	ADR process	`/docs/adrs/`
C21 – Traceability to Roadmap	Plan baseline	`/docs/planning/index.md`

Links¶

→ Roadmap (Epics & Features) (../planning/index.md)
→ Quality Gates (../ci-cd/quality-gates.md)
→ Changelog (../reference/changelog.md)