Architecture Overview — Audit Trail Platform (ATP)¶
This document provides the top-level architectural narrative for ATP. It explains why the platform exists, the guiding principles that shape every decision, and how readers should navigate the rest of the architecture set. Deep dives live in sibling docs (HLD, Components, Data Model, Sequence Flows, Deployment Views) per the table of contents.
Purpose¶
ATP is a secure, multi‑tenant audit and evidence platform that ingests, classifies, and stores immutable events from heterogeneous systems, making them queryable, exportable, and verifiable under strict compliance and data‑residency constraints. The architecture aims to:
- Provide tamper‑evident append‑only storage with verifiable integrity signals.
- Support high‑throughput ingestion and low‑latency query paths across tenants and editions.
- Embed privacy, classification, and retention controls by design (not as an afterthought).
- Offer clear integration contracts (REST/gRPC, events, webhooks) and tested SDKs.
- Expose operational transparency (OTel traces/logs/metrics) with SLO‑backed reliability.
- Remain cost‑aware and scalable, balancing hot/warm/cold storage and export patterns.
Architectural Principles¶
The following principles are non‑negotiable guardrails. Each downstream decision (interfaces, storage, indexing, deployment) should cite the relevant principle(s).
-
Security‑First & Zero‑Trust Strong identity between workloads; least privilege; per‑tenant and per‑operation authorization; secret/keys managed via KMS; encryption in transit and at rest.
-
Multi‑Tenant Isolation by Default Tenant context is explicit in every interface and persisted form. Isolation uses a layered model (routing, authZ, RLS/filters, quotas, rate‑limits). No “best‑effort” multi‑tenancy.
-
Event‑Driven by Design Ingestion, projection, and export are choreographed via durable messaging. Outbox/inbox and idempotency keys are mandatory on all critical paths.
-
Tamper‑Evidence & Integrity Append‑only semantics; chain‑of‑hash/signature strategies; evidence manifests that can be independently verified at export/eDiscovery time.
-
Compliance‑by‑Design Data classification, minimization, and retention are enforced at write time; residency controls and subject‑rights operations (DSR) are planned and testable.
-
Observability‑First Every request and message is traceable end‑to‑end with correlation/tenant/edition tags. Golden signals and error budgets are defined per service and reflected in SLOs.
-
Resilience & Back‑Pressure Timeouts, retries with jitter, bulkheads, DLQs, and circuit breakers are applied consistently. Components are idempotent and safe to replay.
-
API‑First & Contract‑Driven REST/gRPC schemas and event contracts are versioned, linted, and backward compatible; producers/consumers are validated in CI with contract tests.
-
Scalability with Cost Discipline Scale out hot paths; project read models fit for purpose; apply storage tiering (hot/warm/cold) and export batching windows to protect SLOs and cost envelopes.
-
Simplicity & Paved Roads Prefer standard templates, libraries, and platform “paved roads” over bespoke solutions. Documentation and examples are treated as part of the product.
-
Governed Change & Traceability Architectural decisions are captured as ADRs; artifacts and flows are versioned; changes reference the driving SLOs, risks, and compliance requirements.
Audience & Reading Map¶
- Product/Delivery — Start here, then read High-Level Design.
- Developers — After this overview, see Components & Services and Contracts.
- Data/Sec/Compliance — Review Data Model, PII/Classification, and Privacy.
- SRE/Operations — See Observability, Alerts & SLOs, and Runbook.
System Context (C4 L1)¶
ATP sits between event producers (first- and third-party systems) and consumers (operators, auditors, investigators, export/eDiscovery tooling). Clear trust boundaries ensure identity, tenancy, privacy, and integrity guarantees are preserved end-to-end.
Primary Actors¶
- Event Producers — product microservices, backends, frontends, partner systems emitting audit events (REST/gRPC, webhooks).
- Human Users — Operators/SRE (runbooks, dashboards), Auditors/Investigators (query & export), Tenant Admins (policy & access).
- Automation — CI/CD agents and scheduled jobs exercising administrative APIs (policy/config rotation, projector rebuilds).
External Systems¶
- Identity Provider (IdP) — OIDC/OAuth2 for users; workload identity for service-to-service.
- Key Management (KMS) — encryption and signing keys; rotation; optional customer-managed keys.
- Observability Stack — OTel collector, metrics/logs/traces backends, alerting.
- Export Destinations — object storage targets, legal hold archives, SIEM/DLP, ticketing (optional webhooks).
- Admin Surfaces — configuration/feature flags, policy repositories, schema registry.
Trust Boundaries & Zones¶
- Ingress Boundary (Gateway) — authentication, tenancy resolution, request limits, schema validation, input sanitation.
- Processing Boundary (Messaging/EDA) — durable delivery, outbox/inbox, idempotency keys, replay safety, DLQ isolation.
- Data Boundary (Storage/Indexes) — encryption at rest, tenant-scoped access (RLS/filters), classification & retention enforcement.
- Admin/Control Plane — privileged operations, break-glass workflows, audited changes and ADR-linked approvals.
Interface Surface (at a glance)¶
- Inbound
- REST/gRPC:
/api/v{n}/audit/append,/api/v{n}/query/...,/api/v{n}/export/... - Webhooks: signed callbacks for event ingestion (optional connectors)
- REST/gRPC:
- Outbound
- Events:
AuditRecord.Appended,AuditRecord.Accepted,Projection.Updated,Export.Requested|Completed - Webhooks: export completion, verification results (optional)
- Events:
- Admin
- Policies: classification/retention APIs, edition flags
- Ops: projector lag, replay tooling, DLQ management
Tenancy & Identity Propagation¶
- Tenancy is explicit on every call/message (
tenantIdclaim/header); enforced in the gateway and persisted in all stores. - Authorization is tenant-scoped (RBAC/ABAC) with edition gates; tokens are short-lived and audience-bound.
- Observability carries
tenantId,edition,traceId, andcorrelationIdacross boundaries.
flowchart LR
subgraph External
EP[(Event Producers)]
AUD[(Auditors/Investigators)]
OPS[(Operators/SRE)]
IDP[(IdP/OIDC)]
KMS[(KMS)]
OBS[(OTel/Logs/Metrics)]
DST[(Exports / eDiscovery Destinations)]
end
GW[API Gateway<br/>AuthZ • Tenancy • Rate Limit • Schema]
BUS[(Event Bus)]
ING[Ingestion Service<br/>validate • classify • retain]
INT[Integrity Service<br/>hash chains • signatures]
PROJ[Projection Service<br/>read models • lag control]
QRY[Query Service<br/>tenant-scoped filters]
EXP[Export Service<br/>packages • manifests]
EP -->|REST/gRPC/Webhooks| GW
GW --> BUS
BUS --> ING
ING --> INT
ING -->|append| STORE[(Append Store)]
ING --> PROJ
PROJ --> QRY
QRY -->|results| AUD
EXP -->|packages| DST
GW --- IDP
ING --- KMS
INT --- KMS
GW --- OBS
ING --- OBS
PROJ --- OBS
QRY --- OBS
EXP --- OBS
AUD -->|requests| GW
OPS -->|dashboards/runbooks| QRY
QRY --> EXP
Quality Attributes Anchored Here¶
- Security & Privacy — zero-trust ingress, least-privilege access, data minimization at write.
- Integrity — append-only semantics with verifiable chains/signatures.
- Scalability — bursty producers absorbed via durable messaging and back-pressure.
- Operability — golden signals and SLO budgets per service; traceability across all hops.
- Cost Awareness — hot/warm/cold tiers; export windows; per-tenant quotas and rate limits.
Links¶
- → Context Map (
context-map.md) - → Components & Services (
components.md) - → Sequence Flows (
sequence-flows.md) - → Deployment Views (
deployment-views.md)
Bounded Contexts & Context Map¶
The ATP domain is split into cohesive bounded contexts that collaborate via well-defined contracts (REST/gRPC, events, webhooks). Each context owns its model, persistence, and decision logic; integration relies on Published Language and Open Host Service patterns, with Anti-Corruption Layers at external edges.
Bounded Contexts (overview)¶
-
Gateway
Responsibility — Ingress, authentication/authorization, tenancy resolution, rate limiting, schema validation, versioning.
Contracts — REST/gRPC (OHS), request/response schemas.
Notes — EnforcestenantIdpropagation and edition gates. -
Ingestion
Responsibility — Canonicalize & validate events, apply classification/retention at write, append to immutable store, emit acceptance signals.
Contracts — Consumes REST/gRPC/webhooks from Gateway (OHS); publishesAuditRecord.Appended|Accepted.
Notes — ACL for producer-specific payloads → canonical schema. -
Policy
Responsibility — Provide decisions for classification, retention, redaction; edition feature flags.
Contracts — Decision API (sync) + policy change events.
Notes — Customer–Supplier to Ingestion/Query/Export; Policy is supplier. -
Integrity
Responsibility — Compute/verify hash chains & signatures; issue attestations and evidence manifests.
Contracts — Subscribes to append pipeline; exposes verify endpoints; emitsIntegrity.Verified.
Notes — Keys/rotation via KMS. -
Projection
Responsibility — Build/maintain read models and search indexes; track projector lag/watermarks; rebuild strategies.
Contracts — Subscribes toAuditRecord.Accepted(and deltas); publishesProjection.Updated.
Notes — Strict idempotency and replay safety. -
Query
Responsibility — Authorized, tenant-scoped retrieval over read models; policy-aware filtering/redaction.
Contracts — REST/gRPC; optional GraphQL facade (internal).
Notes — Surfaces selection for exports. -
Search (optional)
Responsibility — Full-text/time-range queries aligned with policy & tenancy.
Contracts — Reads projection feed; exposes search API.
Notes — May be disabled in small editions. -
Export
Responsibility — Package selections (from Query/Search) with manifests, signatures; manage delivery & legal hold.
Contracts — REST to request/stream; eventsExport.Requested|Completed; optional webhooks.
Notes — Throttled, resumable flows; long-running operations. -
Admin/Control Plane
Responsibility — Policies, schemas, feature flags, projector controls, replay/DLQ tooling.
Contracts — Admin APIs; audit of changes; ADR links in metadata.
Notes — Break-glass procedures with strict logging.
Relationship Patterns¶
- Gateway → Ingestion — Open Host Service with a versioned Published Language.
- Ingestion → Policy — Customer–Supplier (Ingestion conforms to Policy’s decisions).
- Ingestion → Integrity — Conformist to integrity calculation rules; emits materials for verification.
- Ingestion → Projection — Event choreography via durable topics.
- Projection → Query/Search — Published Language for read models/index projections.
- Query/Search → Export — Customer–Supplier (Export depends on Query’s selection semantics).
- External Producers → Gateway — Anti-Corruption Layer at Ingestion to canonicalize.
Context Map (mermaid)¶
flowchart LR
subgraph External Producers
P1[(Product Services)]
P2[(3rd-Party Systems)]
end
GW[Gateway<br />OHS + Published Language]
ING[Ingestion<br />ACL + Canonicalizer]
POL[Policy<br />Decisions: classify/retain/redact]
INT[Integrity<br />Hash Chains/Signatures]
PROJ[Projection<br />Read Models/Indexes]
QRY[Query<br />Tenant-Scoped Retrieval]
SRCH[Search<br />Optional Index API]
EXP[Export<br />Packages + Manifests]
ADM[Admin/Control Plane]
P1 -->|REST/gRPC/Webhooks| GW
P2 -->|REST/gRPC/Webhooks| GW
GW --> ING
ING -->|decisions| POL
POL -->|policy changes| ING
POL --> QRY
POL --> EXP
ING -->|append accepted| INT
ING -->|events| PROJ
PROJ --> QRY
PROJ --> SRCH
QRY -->|selection| EXP
SRCH -->|selection| EXP
ADM --- POL
ADM --- PROJ
ADM --- EXP
classDef c fill:#F4F7FF,stroke:#5B6,stroke-width:1px,rx:6,ry:6;
class GW,ING,POL,INT,PROJ,QRY,SRCH,EXP,ADM c;
Contract Snapshots (at a glance)¶
-
Events (Published Language)
AuditRecord.Appended→ canonical event submitted (pre-commit checks passed)AuditRecord.Accepted→ persisted + classified + retained; integrity material readyProjection.Updated→ read model/index segment advanced (with watermark)Export.Requested | Export.Completed→ export lifecycleIntegrity.Verified→ attestations for records/segments/packages
-
APIs (Open Host Service)
POST /api/v{n}/audit/append— Gateway → Ingestion (tenant/edition required)GET /api/v{n}/query/...— tenant-scoped search & retrievalPOST /api/v{n}/export/...— create/stream export packagesPOST /api/v{n}/policy/evaluate— (internal) policy decision snapshotPOST /api/v{n}/integrity/verify— verify record/segment/export
Modeling Notes¶
- Ubiquitous Language — “audit record”, “evidence chain”, “manifest”, “projection lag”, “selection set” are domain terms; see domain language.
- Idempotency Keys —
(tenantId, sourceId, sequence|hash)for all ingestion paths. - Tenancy — Always explicit; persisted on write; filtered on read; included in traces.
- Edition Awareness — Feature gates in Gateway/Policy; never UI-only gates.
Links¶
- → Context Map (detailed) (
context-map.md) - → Events Catalog (
../domain/events-catalog.md) - → Message Schemas (
../domain/contracts/message-schemas.md) - → REST APIs (
../domain/contracts/rest-apis.md) - → Ubiquitous Language (
../domain/ubiquitous-language.md)
Core Services & Containers (C4 L2)¶
This section presents the container view of ATP: the runtime building blocks (services and infrastructure) and their responsibilities, interfaces, data boundaries, and operational concerns. It complements the domain view by focusing on how capabilities are realized in deployable components.
Container View (diagram)¶
flowchart LR
subgraph Edge
G[API Gateway<br />AuthN/Z • Tenancy • RL • Versioning]
end
subgraph App Plane
ING[Ingestion Service<br />validate • classify • retain]
POL[Policy Service<br />classify • retain • redact • edition]
INT[Integrity Service<br />hash chains • signatures • attest]
PROJ[Projection Service<br />read models • indexes • lag]
QRY[Query Service<br />search • filters • redaction]
SRCH[Search Service - optional<br />full-text • time-range]
EXP[Export Service<br />packages • manifests • deliver]
ADM[Admin/Control Plane<br />schemas • flags • replay • DLQ]
end
subgraph Data Plane
APPEND[(Append Store<br />append-only)]
READ[(Read Models / Indexes)]
COLD[(Cold Archive / eDiscovery)]
end
subgraph Platform
BUS[(Event Bus / Topics)]
KMS[(KMS / Secrets)]
OTL[(OTel Collector / Logs / Metrics / Traces)]
IDP[(IdP / OIDC)]
end
G -->|REST/gRPC| ING
G -->|REST/gRPC| QRY
QRY -->|select| EXP
ING -->|decisions| POL
POL --> QRY
POL --> EXP
ING -->|append| APPEND
ING -->|events| PROJ
PROJ --> READ
SRCH --> READ
QRY --> READ
EXP -->|packages| COLD
ING -.->|materials| INT
INT -.->|verify| EXP
G --- IDP
ING --- BUS
PROJ --- BUS
EXP --- BUS
ING --- KMS
INT --- KMS
G --- OTL
ING --- OTL
PROJ --- OTL
QRY --- OTL
EXP --- OTL
classDef svc fill:#F4F7FF,stroke:#7aa6ff,stroke-width:1px,rx:6,ry:6;
classDef plat fill:#fafafa,stroke:#c8c8c8,stroke-width:1px,rx:6,ry:6;
class G,ING,POL,INT,PROJ,QRY,SRCH,EXP,ADM svc;
class BUS,KMS,OTL,IDP,APPEND,READ,COLD plat;
Containers (one-liners)¶
- API Gateway — Central ingress; AuthN/Z, tenancy resolution, rate limiting, schema & versioning.
- Ingestion — Validates/canonicalizes events, applies classification/retention, writes append-only, emits acceptance.
- Policy — Synchronously answers classification/retention/redaction queries; manages edition feature gates.
- Integrity — Maintains hash chains and signatures; exposes verification APIs; issues evidence manifests.
- Projection — Builds read models and indexes; tracks watermarks/lag; supports rebuilds.
- Query — Tenant-scoped retrieval with policy-aware filtering and redaction; selection for exports.
- Search (optional) — Full-text/time-range over projected data while respecting policy/tenancy.
- Export — Packages selections + manifests, supports resumable streaming, and optional webhooks.
- Admin/Control Plane — Config, feature flags, schema registry, DLQ/replay, break-glass ops.
Responsibilities Matrix¶
| Container | Purpose | Key Interfaces | Data Boundary | Scaling / Resilience | SLO Hints |
|---|---|---|---|---|---|
| API Gateway | Ingress, authZ, tenancy, RL, versioning | REST/gRPC; JWT/mTLS | N/A (stateless) | HPA on RPS; 429 for back-pressure | p95 auth+route ≤ X ms |
| Ingestion | Validate, classify, retain, append | REST/gRPC from Gateway; publishes events | Writes to Append Store | Outbox; consumer concurrency; DLQ | p95 append ≤ X ms; accept rate |
| Policy | Decisions: classify/retain/redact; edition | Sync decision API; policy change events | Policy store (read-mostly) | Cache + TTL; fallback modes | p95 decision ≤ Y ms |
| Integrity | Hash/sign; attest/verify | Verify API; subscribes to append | Integrity material store | Idempotent; replay-safe | verify p95 ≤ Z ms |
| Projection | Build read models / indexes | Subscribes to accepted events | Read Models/Indexes | Rebuild tooling; lag caps | projector lag ≤ N s |
| Query | Tenant-scoped retrieval, redaction | REST/gRPC; (opt) GraphQL | Read-only over Read Models | Cache; rate-limit; bulkheads | query p95/p99 targets |
| Search (opt) | Full-text/time queries | REST/gRPC | Search index | Async refresh; ISR windows | query p95 ≤ T ms |
| Export | Package & deliver w/ manifests | REST stream; events; (opt) webhook | Streams from Read/Cold | Resumable; batch windows | completion p95 for M rec |
| Admin | Policies, flags, replay, DLQ, schemas | Admin REST/gRPC | Control metadata | Strong audit; approvals | admin ops audited |
Data Containers¶
- Append Store (hot) — Append-only write path; short retention for high-QPS ingestion and near-term verification.
- Read Models / Indexes (warm) — Denormalized projections tailored to query/search; rebuildable from events.
- Cold Archive (cold) — Long-term eDiscovery/export storage; immutability and legal-hold compatible.
Data tiering and shapes are detailed in Data Model (
data-model.md) and Deployment Views (deployment-views.md).
Interface Summary¶
- Inbound:
POST /api/v{n}/audit/append,GET /api/v{n}/query/...,POST /api/v{n}/export/... - Events:
AuditRecord.Appended,AuditRecord.Accepted,Projection.Updated,Export.Requested|Completed,Integrity.Verified - Admin:
.../policy/*,.../admin/replay,.../admin/dlq,.../admin/schema
Cross-Cutting (applies to all services)¶
- Tenancy:
tenantIdis mandatory at ingress and persisted end-to-end; enforced by gateway/middleware and data filters. - Security: Zero-trust defaults; mTLS in mesh; KMS-managed keys; short-lived tokens with audience/scope.
- Observability: OTel traces/logs/metrics with
tenantId,edition,traceId,correlationId; golden signals per service. - Resilience: Outbox/inbox, idempotent consumers, configured retries with jitter, circuit breakers, DLQ + replay tooling.
- Cost: Rate-limits/quotas per tenant; projection/index retention; export batching windows.
Links¶
- → Components & Services (deep-dive) (
components.md) - → Data Model (
data-model.md) - → Deployment Views (
deployment-views.md) - → REST APIs (
../domain/contracts/rest-apis.md) - → Message Schemas (
../domain/contracts/message-schemas.md)
Component Boundaries (C4 L3)¶
This section details the internal structure of each service using a Clean Architecture variant:
- API (adapters) — HTTP/gRPC endpoints, webhook receivers.
- Application (use-cases) — orchestration, policies, idempotency, transactions.
- Domain (model) — aggregates, value objects, domain services.
- Infrastructure (adapters) — persistence, messaging, cache, KMS, observability.
Dependency Rule: API → Application → Domain, with Domain independent of frameworks. Infrastructure depends inward via ports (interfaces) declared in Application/Domain.
Reference Component Map (Ingestion service)¶
flowchart LR
subgraph API Layer
Ctrl[AppendController<br />REST/gRPC]
Hook[WebhookReceiver]
end
subgraph Application Layer
UC[AppendAuditRecordUseCase]
Pol[PolicyClient Port]
Repo[AuditRecordRepository Port]
Outb[Outbox Port]
Idem[IdempotencyService]
Val[SchemaValidator]
end
subgraph Domain Layer
Agg[AuditRecord Aggregate]
Clas[Classification VO]
Ret[RetentionPolicy VO]
Sig[IntegrityMaterial VO]
DS[Domain Services]
end
subgraph Infrastructure Adapters
RepoImpl[(NHibernate Repo)]
Bus[(Message Broker Adapter)]
KMS[(KMS Adapter)]
Cache[(Cache Adapter)]
OTel[(OTel Adapter)]
end
Ctrl --> UC
Hook --> UC
UC --> Val
UC --> Pol
UC --> Repo
UC --> Outb
UC --> Idem
UC --> Agg
Agg --> Clas
Agg --> Ret
Agg --> Sig
RepoImpl -.implements.-> Repo
Bus -.implements.-> Outb
KMS -.used by.-> DS
OTel -.used by.-> UC
Cache -.used by.-> Pol
Flow (happy path)
AppendControllervalidates auth/tenancy → callsAppendAuditRecordUseCase.- Use-case validates schema, checks idempotency, queries
PolicyClientfor classification/retention. - Aggregate enforces invariants, produces domain events.
Repositorypersists append-only record + outbox entry.- Outbox relays an
AuditRecord.Acceptedevent; OTel spans/logs recorded.
Per-Service Component Boundaries¶
| Service | API Adapters | Application (Use-cases) | Domain (Aggregates/VO) | Outbound Ports | Infra Adapters |
|---|---|---|---|---|---|
| Gateway | Minimal APIs/gRPC, auth filters, tenancy middleware | RouteResolution, RateLimitCheck, SchemaGate | N/A (edge orchestration) | PolicyDecision, TokenIntrospection, RateLimit | OIDC/JWT, RateLimiter, Schema Registry |
| Ingestion | AppendController, WebhookReceiver | AppendAuditRecord, ClassifyAndRetain, AcceptEvent | AuditRecord, Classification, Retention, IntegrityMaterial | PolicyClient, AuditRecordRepository, Outbox | NHibernate, Broker, KMS, OTel |
| Policy | PolicyController | EvaluateClassification, EvaluateRetention, EvaluateRedaction | PolicySet, Rule, Decision | PolicyChangePublisher | PolicyStore (read-mostly), Broker, Cache |
| Integrity | VerifyController | ComputeChain, SignSegment, VerifyEvidence | EvidenceChain, Signature, Checkpoint | IntegrityRepo, KmsSigner, VerifyResultPublisher | NHibernate, KMS, Broker |
| Projection | (Internal) Health/Control endpoints | ApplyDelta, RebuildProjection, ManageLag | ReadModelSegment, Watermark | ProjectionRepo, LagMetrics | NHibernate/Read-store, Broker, OTel |
| Query | QueryController, (opt) GraphQL | ExecuteQuery, ApplyPolicyFilters, RedactFields | SelectionSet, RedactionPlan | ReadModelRepo, PolicyClient | Read-store adapters, Cache |
| Search (opt.) | SearchController | ExecuteSearch, BuildQueryPlan | SearchQuery, Facet, TimeRange | IndexRepo | Search index adapter |
| Export | ExportController, Webhook for completion | CreatePackage, StreamPackage, BuildManifest | ExportPackage, Manifest, SelectionRef | ReadModelRepo, ColdStoreSink, IntegrityVerifier | Object storage, Broker, KMS |
| Admin/Control | AdminController | UpdatePolicy, ReplayDLQ, Rebuild, FeatureToggle | AdminAction, Approval | DLQClient, FeatureFlagRepo | Broker Mgmt, Flags Store |
Ports live in Application/Domain, implemented by Infrastructure. Tests use in-memory or fakes against ports to keep fast, hermetic boundaries.
Ports & Adapters (canonical interfaces)¶
// Application ports (examples)
public interface IPolicyClient {
Task<PolicyDecision> EvaluateAsync(AppendContext ctx, CancellationToken ct);
}
public interface IAuditRecordRepository {
Task AppendAsync(AuditRecord record, CancellationToken ct);
}
public interface IOutbox {
Task EnqueueAsync<T>(T evt, CancellationToken ct);
}
public interface IReadModelRepository {
Task<QueryResult> ExecuteAsync(QuerySpec spec, CancellationToken ct);
}
public interface IIntegrityVerifier {
Task<VerifyResult> VerifyAsync(VerifyTarget target, CancellationToken ct);
}
- Inbound adapters: Minimal APIs, gRPC services, webhook receivers.
- Outbound adapters: NHibernate repos, MassTransit/Azure Service Bus producers/consumers, KMS wrappers, cache providers, OTel exporters.
Boundary Rules (enforced)¶
- Tenancy is a first-class parameter in all ports and aggregate constructors (no ambient singletons).
- Idempotency: all write use-cases accept an
IdempotencyKey; repositories guarantee upsert-or-noop semantics. - Transactions: application layer coordinates a transactional outbox; domain emits events, infra persists record + outbox atomically.
- Validation: API does syntactic checks; Application performs semantic validation and policy evaluation; Domain enforces invariants only.
- Error Mapping:
- Domain validation →
400/INVALID_ARGUMENT - AuthZ/tenancy violations →
403 - Rate-limit/back-pressure →
429(with retry hints) - Transient infra → retried with jitter; eventual DLQ after N attempts
- Domain validation →
- No framework leakage into Domain (no HTTP types, no EF/NHibernate entities, no broker types).
Concurrency & Scaling¶
- API: stateless; scale on RPS; protect with rate limits and request budgets.
- Consumers/Projectors: concurrency tuned per partition/shard; back-pressure from queue length & lag metrics.
- Export: long-running streams; resumable with checkpoints; separated pool to avoid starving Query.
Observability & Policies (per boundary)¶
- Span model:
gateway → use-case → repo/outbox → broker → projector/query/export. - Attributes:
tenantId,edition,idempotencyKey,policyVersion,watermark. - Logs: structured; sensitive fields redacted by classification tags.
- Metrics: use-case latency, outbox age, projector lag, export TTFB/completion.
- Policies as code: policy version stamped on write; propagated in events; evaluated again on read/export.
Technology Anchors¶
- Runtime: .NET 9, Rest APIs/gRPC.
- Persistence: NHibernate (write/read models), repositories per aggregate.
- Messaging: MassTransit over Azure Service Bus (topics/queues, DLQ).
- Security: OIDC, mTLS (mesh), KMS envelope encryption & signing.
- Telemetry: OpenTelemetry traces/logs/metrics; dashboards per service.
- Testing: unit tests against ports; contract tests for REST/events; projector/replay integration tests.
Example: Error-to-Protocol Mapping¶
| Boundary | Failure | Handling | Client Signal |
|---|---|---|---|
| API ingress | Schema invalid | 400 + error codes | X-Request-Id |
| Application | Policy denies write | 403 | Problem+JSON body |
| Repository | Unique duplicate (idempotent) | 200 (noop) | Idempotent: true |
| Broker | Publish timeout | retry+jitter → outbox relay | n/a (internal) |
| Export stream | Client disconnect | checkpoint + resume | Range support |
Links¶
- → Components & Services (deep-dive) (
components.md) - → Messaging & Outbox (
../implementation/messaging.md) - → Outbox/Inbox & Idempotency (
../implementation/outbox-inbox-idempotency.md) - → Data Model (
data-model.md) - → REST APIs (
../domain/contracts/rest-apis.md) - → Message Schemas (
../domain/contracts/message-schemas.md) - → Security & Compliance (
../platform/security-compliance.md)
Event-Driven Communication Plan¶
ATP is event-driven by design. Events are the backbone for ingestion acceptance, integrity processing, projection, and exports. We target at-least-once delivery with exactly-once effects via idempotency keys, transactional outbox/inbox, and idempotent consumers. Ordering is scoped, never global.
Patterns & Guarantees¶
- Delivery: at-least-once from broker; consumers must be idempotent.
- Exactly-once intent:
(tenantId, sourceId, sequence|hash)as the IdempotencyKey; repository upsert/no-op semantics. - Ordering: guaranteed only within a partition key (e.g.,
tenantId:sourceId). Do not assume global order. - Back-pressure: rate limits at gateway, consumer concurrency caps, queue depth alerts, retry with jitter, DLQs.
- Replay safety: projectors/exporters are replay-tolerant; watermarks control catch-up.
Topic & Subscription Topology (logical)¶
| Subject (topic) | Producers | Consumers | Purpose |
|---|---|---|---|
audit.appended |
Ingestion | Integrity, Projection | Raw append accepted for downstream processing |
audit.accepted |
Ingestion | Projection, Search (opt) | Persisted + classified + retained |
projection.updated |
Projection | Query, Search (opt), Export | Read models advanced; watermark/lag hints |
export.requested |
Query, API | Export | Start packaging workflow |
export.completed |
Export | API/Webhooks, Integrity (opt verify) | Notify completion; attach manifest |
integrity.verified |
Integrity | API, Export (attach to packages) | Attest records/segments/packages |
policy.changed |
Admin/Policy | Ingestion, Query, Export | Cache bust + version pin updates |
Each topic has named subscriptions per service (e.g.,
projection-svc,export-svc) and a DLQ (<topic>.<subscription>.dlq).
Message Contracts (snapshot)¶
We publish a Published Language with clear evolution rules (see Message Schemas).
{
"eventId": "01J8X3TB2Z9WQ6M9P3E2A4K7QG", // ULID preferred
"subject": "audit.accepted",
"schemaVersion": "1.2.0",
"occurredAt": "2025-10-22T05:12:31Z",
"tenantId": "t-9d8e...",
"edition": "enterprise",
"producer": {
"service": "ingestion",
"instance": "ingestion-7fcb9f",
"region": "eus2"
},
"correlation": {
"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
"correlationId": "c-8e7f...",
"causationId": "evt-01J8X3..."
},
"idempotencyKey": "t-9d8e:src-abc:seq-00000045",
"policy": {
"classification": "SENSITIVE",
"retentionPolicyId": "rp-2025-01",
"policyVersion": 17
},
"payload": {
"recordId": "rec-01H...",
"sourceId": "src-abc",
"hash": "sha256:...",
"sizeBytes": 5342,
"attributes": { "actorId": "u-123", "action": "UPDATE", "resource": "Order/4711" }
}
}
Headers (transport)
* content-type: application/json; charset=utf-8
* x-tenant-id, x-edition
* traceparent, tracestate
* x-idempotency-key, x-schema-version
* x-classification, x-policy-version
No PII in headers. Payload fields carrying PII are classified and redacted on sinks/logs.
Evolution & Compatibility¶
- Additive changes → bump minor (1.1 → 1.2); fields are optional by default.
- Breaking changes → new subject or major bump with side-by-side consumers.
- Schema registry: producers/consumers validated in CI; contract tests block incompatible change.
- Deprecations: announce N releases ahead; dual-publish during migration windows.
Choreography (happy path)¶
sequenceDiagram
autonumber
participant P as Producer
participant GW as API Gateway
participant ING as Ingestion
participant BUS as Event Bus
participant INT as Integrity
participant PROJ as Projection
participant QRY as Query
participant EXP as Export
P->>GW: POST /audit/append (tenant, idempotencyKey)
GW->>ING: Append command
ING->>ING: validate + classify + retain + persist
ING->>BUS: publish audit.appended
ING->>BUS: publish audit.accepted
BUS-->>INT: audit.appended
INT->>BUS: integrity.verified (optional later)
BUS-->>PROJ: audit.accepted
PROJ->>BUS: projection.updated (watermark)
QRY->>QRY: read models current for queries
QRY->>BUS: export.requested (on demand)
BUS-->>EXP: export.requested
EXP->>BUS: export.completed (manifest, links)
Partitioning & Ordering¶
- Partition key:
tenantId(default). For high-volume producers, prefertenantId:sourceId. - Ordering guarantees: within partition, best-effort FIFO; consumers must handle reordering and duplicates.
- Large tenants: increase consumer concurrency; shard by
sourceIdto avoid hot partitions.
Reliability & Retry¶
- Producers: exponential backoff with jitter; bounded retries; overflow to outbox for relay.
- Consumers: N (e.g., 5) attempts with backoff → DLQ; DLQ triage dashboards and replay tools.
- Inbox de-dup: store
(eventId|idempotencyKey)receipt for M days; drop duplicates. - Poison messages: dead-letter with diagnostic envelope (error code, stack, schema version, payload hash).
Security & Privacy¶
- Authz on topics: least privilege per service identity; publish/subscribe explicit.
- PII discipline: only classified fields in payload; never in subjects/headers/topic names.
- Encryption: in transit (TLS) and at rest (broker & stores); sensitive attachments use KMS envelope encryption.
- Signature material: events may carry content digests; Integrity service signs chains and manifests.
Observability¶
- Tracing: propagate
traceparent/correlationId; spans from GW→ING→BUS→INT/PROJ→QRY/EXP. - Metrics: publish rate, error rate, outbox age, queue depth, consumer lag, DLQ count, replay count.
- Logs: structured, redacted by classification; include
tenantId,edition,eventId.
Replay & Backfills¶
- Controlled replay: per-tenant/window; watermark caps; idempotent processing mandated.
- Runbook: identify root cause, drain DLQ, run replay job with dry-run, monitor lag & budgets.
- Projection rebuilds: snapshot + replay strategy; export paused/throttled during rebuild windows.
Links¶
- → Events Catalog (
../domain/events-catalog.md) - → Message Schemas (
../domain/contracts/message-schemas.md) - → REST APIs (
../domain/contracts/rest-apis.md) - → Messaging (impl) (
../implementation/messaging.md) - → Outbox/Inbox & Idempotency (impl) (
../implementation/outbox-inbox-idempotency.md) - → Observability (
../operations/observability.md) - → Runbook (
../operations/runbook.md)
Multitenancy & Tenancy Guards (overview)¶
ATP is multi-tenant by default. Tenancy must be explicit, verifiable, and enforced at every boundary: ingress → messaging → storage → observability → exports. Edition flags refine capability exposure per tenant without weakening isolation.
Tenancy Model & Terms¶
- Tenant — an organizational boundary; primary key
tenantIdpresent in every API call, event, and persisted row. - Edition — capability set for a tenant (e.g., Standard/Enterprise); evaluated at Gateway and Policy.
- Tenant Context — normalized, immutable tuple carried across layers:
{ tenantId, edition, authz, policyVersion, correlationId }.
Isolation Layers (defense-in-depth)¶
- Ingress — require
tenantId; validate edition gates; enforce per-tenant rate limits/quotas and payload limits. - AuthZ — tenant-scoped RBAC/ABAC; tokens must include tenant claim; cross-tenant operations are rejected.
- Messaging — partition keys include
tenantId(ortenantId:sourceId); topic ACLs scoped to service identity. - Storage —
tenantIdis part of every key; row-level filters (or RLS) applied in repositories; encryption scope may be per tenant. - Cache — keys are tenant-scoped; no global caches for sensitive projections.
- Observability — traces/logs/metrics labeled with
tenantIdandedition; logs redact PII. - Operations — per-tenant runbooks: suspend, throttle, replay, export, legal hold, and delete (where allowed).
Tenancy Propagation (canonical)¶
- Authenticate at Gateway; extract
tenantIdfrom token (preferred). AcceptX-Tenant-Idonly for trusted workload identities; normalize into Tenant Context. - Attach Tenant Context to trace and message headers; persist with write operations.
- Verify Tenant Context at service boundary (middleware) and inside each use-case.
- Refuse requests if tenant is missing/mismatched; never infer tenant from data.
sequenceDiagram
autonumber
participant C as Client
participant GW as API Gateway
participant ING as Ingestion (Use-case)
participant DB as Append Store
Note over C: JWT contains tenantId=acme<br />edition=enterprise
C->>GW: POST /audit/append (JWT, X-Idempotency-Key)
GW->>GW: validate JWT, resolve tenant/edition, rate-limit
GW->>ING: Append(cmd, TenantContext)
ING->>ING: check authZ & edition gates
ING->>DB: write (tenantId, record, policyVersion)
ING-->>GW: 202 Accepted (traceId, tenantId)
Data Isolation Patterns¶
- Single DB, tenant-partitioned — default;
tenantIdin composite keys, enforced repository filters, and query specs. - Per-tenant schema — optional for very large tenants; keep API/contracts identical; infra complexity increases.
- Encryption — at rest; keys can be rotated globally or per tenant; evidence manifests do not expose keys.
- Backups/Restore — support scoped restore by tenant or time window; integrate with legal hold.
Messaging & Events¶
- Partitioning by
tenantId(ortenantId:sourceId) ensures localized ordering and hot-shard control. - Headers carry
tenantId,edition,traceparent; payload containstenantIdfor de-dup/inbox. - No cross-tenant joins in consumers or projectors; selection for export is tenant-scoped.
Edition Gating¶
- Gateway rejects routes not enabled for the tenant’s edition.
- Policy returns decisions (e.g., advanced retention) only if edition allows; Query/Export re-evaluate gates to prevent confused-deputy issues.
Guardrails (checklist)¶
| Boundary | Guard | Enforced by | Failure signal |
|---|---|---|---|
| Ingress | tenantId required + edition check |
Gateway middleware | 400/403 |
| Ingress | Per-tenant RL/quotas | Gateway | 429 (+ retry hints) |
| Use-case | Tenant/edition validation | Application layer | 403 |
| Repo | Tenant filter / RLS | Repository/ORM | No rows (isolation) |
| Events | Tenant partition key | Producer/Outbox | Reject publish if missing |
| Cache | Tenant-scoped keys | Cache adapter | N/A (internal) |
| Logs | Redact PII; add tenant labels | Logging pipeline | N/A |
| Export | Tenant-scoped selection & manifest | Export service | 403 on cross-tenant |
Threats & Mitigations¶
- Tenant spoofing → accept tenant only from validated token/identity; normalize once at Gateway.
- Confused deputy → re-evaluate edition/ABAC on every read/export; never trust upstream UI.
- Noisy neighbor → per-tenant limits on RPS, storage, concurrent exports, and projector throughput.
- Data bleed → repository filters + contract tests; synthetic cross-tenant tests in CI.
Testable Controls¶
- Contract tests: every API requires
tenantIdand rejects mismatches. - Repo tests: queries without
tenantIdfail build/lint; cross-tenant fixtures return 0 rows. - Messaging tests: publish fails without tenant headers; consumer rejects orphan events.
- Observability tests: traces/metrics include tenant labels; redaction verified in logs.
Links¶
- → Multitenancy & Tenancy Guards (platform) (
../platform/multitenancy-tenancy.md) - → Security & Compliance (
../platform/security-compliance.md) - → Data Residency & Retention (
../platform/data-residency-retention.md) - → PII Redaction & Classification (
../platform/pii-redaction-classification.md)
Security Architecture (Zero Trust)¶
ATP adopts a Zero Trust posture end-to-end: never trust, always verify; strong identity for users and workloads, least privilege at every hop, and continuous policy evaluation (tenant + edition + data classification). Controls are layered across ingress, mesh, messaging, storage, observability, and exports.
Security Objectives¶
- Confidentiality — prevent unauthorized data access across tenants and tiers.
- Integrity — tamper-evident writes and verifiable exports, with cryptographic proofs.
- Availability — resilient controls that degrade safely (e.g., deny on policy fetch failures), with back-pressure and circuit breakers.
- Accountability — high-fidelity audit trails of admin and data actions, correlatable across services.
Identity & Access (Users and Workloads)¶
- Human users — OIDC/OAuth2, MFA, SSO; RBAC/ABAC scoped to
tenantIdand edition; short-lived tokens; refresh via secure flows. - Workloads — workload identity (SPIFFE-like or equivalent); service-to-service mTLS in mesh; audience/scope-bound JWTs when used.
- Token hygiene — short expirations, clock-skew tolerance, replay protection (nonce/PKCE where applicable), proof-of-possession optional for high-risk flows.
- Authorization — gateway performs coarse checks; services apply fine-grained ABAC (tenant/edition/policy decision) at each use-case.
Boundary Controls¶
- API Gateway / WAF
- JWT validation (issuer/audience/exp), tenancy resolution, schema & size limits, rate limiting/quotas per tenant, IP allow-lists for admin APIs.
- Threat mitigation: SQL/JSON injection filters, deserialization guards, content-type validation, CORS/CSRF protections for console UIs.
- Service Mesh / Network
- Mutual TLS by default; L7 authorization for service identities; namespace/network policies; egress allow-lists and DNS pinning for critical deps.
- Messaging
- Per-service publish/subscribe ACLs; topic-level encryption at rest; headers sanitized (no PII); DLQ isolation with audited replay.
- Storage
- Row-level isolation (tenant filters / RLS); encryption at rest; audit tables for admin actions; classified fields drive redaction and masking.
Data Protection¶
- In transit — TLS 1.2+ everywhere; HSTS on public edges; mTLS inside the mesh; secure cipher suites; TLS secrets managed in KMS/secret store.
- At rest — envelope encryption with KMS-managed keys; per-env key hierarchy; optional per-tenant keys for high-assurance tenants; key versioning on rotation.
- Field/classification aware — classification labels at write time control: persistence, logging, projections, and export redaction.
Keys, Secrets & Rotation¶
- Secrets in secret manager (never in images/YAML); use managed identities over static credentials.
- Key rotation schedules for signing/encryption and TLS certs; dual-key windows for smooth transitions; automated provenance of rotations.
- HSM-backed keys optional; audit every read of high-value secrets.
Integrity & Cryptography (platform interplay)¶
- Integrity service maintains hash chains (per tenant/range) and digital signatures for segments and export manifests.
- Canonicalization of records before hashing; recorded policyVersion and chain checkpoints to simplify offline verification.
Supply Chain Security¶
- SBOM generation on every build; dependency scan/block on critical CVEs.
- Image signing and provenance (e.g., attestations) enforced at admission; non-root, read-only FS, dropped Linux capabilities, seccomp/apparmor profiles.
- Infrastructure as Code scanning; drift detection; locked registries and private base images.
Threat Model (snapshot) & Mitigations¶
| Threat | Vector | Control |
|---|---|---|
| Tenant impersonation | Forged headers, token substitution | Accept tenant only from validated token; normalize at gateway; bind token audience/scope |
| Data exfiltration | Over-permissive roles, broad exports | ABAC with least privilege; export selection gated; watermarking; per-tenant quotas & approvals |
| Injection / deserialization | Untrusted payloads | Strict content-type; schema validation; JSON size caps; safe parsers |
| Replay / duplication | Message re-delivery | Idempotency keys; inbox receipts; event ULIDs; consumer de-dup |
| Side-channel / noisy neighbor | Resource contention | Per-tenant rate limits/quotas; bulkheads; isolated exporter pools |
| Secret leakage | Misplaced configs/logs | Secret store; redaction pipeline; zero PII in headers; structured logs with classifiers |
| Supply chain compromise | Tainted deps/images | SBOM+scans; signed images; verified provenance; gated deploys |
| Stale keys/certs | Missed rotation | Rotation SLOs; dual-publish keys; monitors/alerts; break-glass runbook |
Security Telemetry & Auditing¶
- Traces: carry
tenantId,edition,policyVersion,traceId,correlationIdacross gateway→service→broker→store. - Logs: structured, classification-aware redaction; admin actions logged with actor, scope, before/after diff (where safe).
- Metrics: auth failures, rate-limit hits, policy denies, DLQ volume, export anomalies, KMS latency/errors.
- Alerts: token validation spikes, cross-tenant access attempts, chain verification failures, unexpected export surges.
Incident Response & Break-Glass¶
- Playbooks: per-tenant isolation, revoke tokens, rotate affected keys, pause exports, enable heightened logging, notify stakeholders.
- Forensics: immutable log retention; chain checkpoints; snapshot policies; export manifest verification.
- Containment: gateway blocks by tenant/route; mesh denies by service identity; selective projector/export throttling.
Secure Defaults & Hardening Checklist¶
- Non-root containers; read-only FS; minimal capabilities; pinned distroless bases.
- Egress restricted; DNS allow-lists; outbound proxies where required.
- HTTP security headers (HSTS, CSP, X-Content-Type-Options, Referrer-Policy) on public UIs.
- Least privilege IAM for cloud resources; scoped service identities; deny-by-default policies.
- Console/admin endpoints behind SSO + IP allow-lists + step-up auth.
Links¶
- → Security & Compliance (
../platform/security-compliance.md) - → Privacy (GDPR/HIPAA/SOC2) (
../platform/privacy-gdpr-hipaa-soc2.md) - → PII Redaction & Classification (
../platform/pii-redaction-classification.md) - → Zero Trust (hardening) (
../hardening/zero-trust.md) - → Key Rotation (hardening) (
../hardening/key-rotation.md) - → Tamper Evidence (hardening) (
../hardening/tamper-evidence.md)
Compliance & Privacy (GDPR/HIPAA/SOC2) — Overview¶
ATP is designed as a privacy-by-design, compliance-by-default platform. Controls are embedded in the write path (classification/minimization/retention), enforced across read/export, and evidenced via immutable audit trails and integrity proofs.
Roles & Responsibilities¶
- Typical role: ATP operates as a Processor for tenant Controllers; some admin/telemetry data may make ATP a limited Controller (documented in DPA).
- Sub-processors: declared per environment; contracts require equivalent safeguards.
- Agreements: DPA (GDPR), BAA (HIPAA) for PHI workloads, SOC 2 reporting for trust criteria.
Regulatory Anchors (focus)¶
- GDPR: lawful basis, transparency, DSR (access/erasure/export), 72h breach notice to SA where required, minimization, storage limitation, data transfers/residency.
- HIPAA: PHI protection, minimum necessary, access controls, audit controls, integrity, transmission security, breach notification ≤ 60 days to affected individuals; BAA in place.
- SOC 2: Trust Service Criteria — Security, Availability, Processing Integrity, Confidentiality, Privacy — evidenced through technical and procedural controls.
Controls Matrix (snapshot)¶
| Requirement | Control (design) | Where enforced | Evidence / Artifacts |
|---|---|---|---|
| Data minimization | Canonical schema + policy evaluation at write | Ingestion + Policy | Schema registry, policy version in events, unit/contract tests |
| Classification & redaction | Field tags drive redaction/masking | Ingestion, Query, Logs/Exports | Redaction library, log scrubbing tests, sample redacted exports |
| Retention & deletion | Retention stamped on write; lifecycle jobs | Ingestion, Lifecycle jobs | Retention policy catalog, job logs, deletion attestations |
| Residency/sovereignty | Region-aware routing, per-tenant storage map | Gateway, Storage | Tenant residency map, deploy topology diagrams |
| Access control | Tenant-scoped RBAC/ABAC, edition gates | Gateway, Services | AuthZ policy files, access logs, ABAC tests |
| Auditability | Immutable admin/data action logs | All services | Append logs, admin trails, correlation/traces |
| Export & portability | Package + manifest + signatures | Export, Integrity | Export manifests, signatures, hash chains |
| Incident response | Runbooks, alerts, forensics snapshots | Ops/Runbook | IR playbooks, alert policies, drill reports |
Data Lifecycle (end-to-end)¶
- Collect — strictly necessary attributes; reject unknown fields by default.
- Classify — tag sensitivity at write; bind policy version.
- Store — encrypted at rest; tenant-scoped keys (optional per tenant).
- Project — read models exclude disallowed fields; derived data tracked.
- Retain — timers enforce storage limitation; legal hold overrides tracked.
- Delete/Anonymize — policy-driven purge/anonymization with proofs.
Data Subject Requests (DSR) — Workflow¶
sequenceDiagram
autonumber
participant U as User/Tenant Admin
participant GW as API Gateway
participant Q as Query Service
participant EXP as Export Service
participant L as Lifecycle/Retention
U->>GW: Submit DSR (access/export/erasure)
GW->>Q: AuthZ + tenant scope, locate records
Q-->>GW: Result set / pointers
alt Access/Export
GW->>EXP: Create export package
EXP-->>U: Download + manifest (portable)
else Erasure
GW->>L: Schedule policy-compliant deletion (legal holds respected)
L-->>GW: Deletion attestation
GW-->>U: Completion notice + evidence
end
SLA guidance
- GDPR DSR response: typically ≤ 30 days (track in runbooks).
- Breach notifications: GDPR 72h (to SA where required); HIPAA ≤ 60 days to affected individuals.
Privacy by Design (architectural hooks)¶
- Policies as code: versioned policy sets; decisions stamped on write and re-evaluated on read/export.
- No PII in headers: classification prevents leakage to logs/metrics; sensitive fields redacted.
- Least privilege: tenant RBAC/ABAC at every use-case; exporter isolation/bulkheads.
- DPIA triggers: new data categories, cross-region transfers, novel large-scale processing — require review/ADR.
Residency & Transfers¶
- Region binding: tenant → region mapping; data stays in region unless contractually permitted.
- Cross-border: blocked by default; explicit policy + contractual basis required.
- Backups/restore: region-scoped; tenant-targeted restore supported.
Monitoring & Evidence¶
- Signals: policy deny rates, retention job failures, export volume anomalies, cross-region attempts.
- Evidence pack: policy catalog, schema registry snapshots, export manifests, chain checkpoints, IR drill reports.
- Periodic attestations: automated reports feed SOC 2 control testing.
Testable Controls¶
- CI checks for classification tags on new fields; rejection if missing.
- Synthetic DSR tests (access/export/erasure) per environment.
- Retention dry-run reports; deletion requires attestation artifacts.
- Policy evolution contract tests (additive vs breaking).
Links¶
- → Privacy (GDPR/HIPAA/SOC2) (
../platform/privacy-gdpr-hipaa-soc2.md) - → Data Residency & Retention (
../platform/data-residency-retention.md) - → PII Redaction & Classification (
../platform/pii-redaction-classification.md) - → Runbook (
../operations/runbook.md) - → Alerts & SLOs (
../operations/alerts-slos.md)
Data Architecture Overview¶
ATP’s data layer is optimized for append-heavy writes, policy-aware reads, and provable integrity. We separate hot append storage, warm read models/indexes, and cold archival to balance performance, cost, and compliance.
Data Primitives (canonical types)¶
- AuditRecord — canonicalized event with tenant/actor/resource, timestamps, attributes (flattened JSON), and policy stamps.
- PolicyDecision — classification, retention, redaction directives (versioned).
- EvidenceMaterial — content digests, chain IDs, signatures, checkpoints.
- ProjectionSegment — denormalized slices for query/search; watermark & lag metadata.
- ExportPackage — immutable bundle manifest with hashes, signatures, and lineage.
Write Path (append-only)¶
sequenceDiagram
autonumber
participant GW as Gateway
participant ING as Ingestion
participant APP as Append Store (hot)
participant INT as Integrity
participant BUS as Event Bus
GW->>ING: Append(cmd + TenantContext)
ING->>ING: Canonicalize + Validate + Policy.Evaluate
ING->>APP: Append(AuditRecord + PolicyDecision)
ING->>BUS: audit.appended / audit.accepted
ING->>INT: Provide digest material (async)
Consistency: write path is strong for a single record; projections are eventually consistent (seconds-level lag budgets).
Keys, Partitions, and Time¶
| Concept | Strategy | |
|---|---|---|
| Primary key | ULID recordId for monotonic ordering within time partitions |
|
| Idempotency key | (tenantId, sourceId, sequence | hash) |
|
| Partitioning | by time bucket (e.g., day) and tenantId; hot shards can further include sourceId |
|
| Timestamps | occurredAt (source), receivedAt (gateway), committedAt (store); all UTC, ISO-8601 |
Canonical Schema (fields snapshot)¶
| Field | Type | Notes |
|---|---|---|
recordId |
ULID | unique; sortable |
tenantId |
string | required on every record/index |
sourceId |
string | producer/system origin |
actorId |
string | user/service (classified) |
action |
string | verb (CREATE/UPDATE/DELETE/…) |
resource |
string | dotted path (e.g., Order/4711) |
attributes |
object | flat/normalized JSON; unknowns rejected unless whitelisted |
occurredAt/receivedAt/committedAt |
datetime | UTC |
policyVersion |
int | immutable once stamped |
classification |
enum | PUBLIC/INTERNAL/… (used for redaction) |
retentionPolicyId |
string | determines lifecycle |
digest |
string | content hash (sha256:…) |
chainId/chainIndex |
string/int | integrity chain placement |
Full schema lives in Message Schemas (
../domain/contracts/message-schemas.md). Schema changes follow additive-first rules and contract tests.
Read Models & Indexing (warm tier)¶
- Timeline model (by tenant/resource/actor, time-range).
- Facet/aggregation model (counts by action/resource).
- Lookup model (by
recordId,sourceId, correlation). - Search index (optional) for full-text and fast range scans.
Rebuild strategy: snapshot + replay from append logs; watermarks track projector progress; lag SLO drives autoscaling.
Classification, Redaction & Logs¶
- Fields carry classification tags at write; tags drive:
- storage shape (e.g., tokenization),
- query redaction (field masking or omission),
- log scrubbing (no sensitive data in logs/metrics),
- export filtering (respect tenant’s data handling rules).
Example policy map (excerpt)
| Field | Classification | Redaction (query/export) |
|---|---|---|
actorId |
PERSONAL | mask last 4 |
attributes.email |
PERSONAL | hash (sha256) |
attributes.cardLast4 |
SENSITIVE | allow if role:Auditor & scope:PII |
resource |
INTERNAL | none |
See PII Redaction & Classification (../platform/pii-redaction-classification.md).
Retention & Lifecycle¶
- Stamped on write:
retentionPolicyId+policyVersion. - Lifecycle jobs: tiering hot→warm→cold, legal hold awareness, deletion/anonymization windows.
- Attestations: deletion manifests & job logs stored immutably.
See Data Residency & Retention (../platform/data-residency-retention.md).
Integrity Materials¶
- Chain-of-hash per tenant and time-range; rolling checkpoints.
- Signatures minted by Integrity svc; referenced in projections and exports.
- Verification APIs accept a record/segment/package and return proofs.
Storage Tiers¶
| Tier | Workload | Technology Shape | Notes |
|---|---|---|---|
| Hot | Append path, near-term verify | OLTP append store | SSD, high IOPS, short retention |
| Warm | Queries & aggregations | Read models / columnar | Denormalized, rebuildable |
| Cold | Archival/eDiscovery | Object storage (+ immutability) | Legal hold, cheap, manifest-signed |
Schema Evolution¶
- Additive: new optional fields or enums → minor version bump; dual readers.
- Breaking: new subject or major version; side-by-side projections.
- Registry & tests: schemas linted; CI contract tests for producers/consumers/projectors.
ER Snapshot (logical)¶
erDiagram
TENANT ||--o{ AUDIT_RECORD : owns
AUDIT_RECORD ||--o| EVIDENCE_MATERIAL : "has"
AUDIT_RECORD ||--o{ PROJECTION_SEGMENT : "appears_in"
EXPORT_PACKAGE ||--o{ AUDIT_RECORD : "contains"
TENANT {
string tenantId PK
string edition
}
AUDIT_RECORD {
string recordId PK
string tenantId FK
string sourceId
string actorId
string action
string resource
json attributes
datetime occurredAt
datetime committedAt
string classification
string retentionPolicyId
string digest
string chainId
int chainIndex
int policyVersion
}
EVIDENCE_MATERIAL {
string recordId FK
string segmentId
string signature
string algo
}
PROJECTION_SEGMENT {
string segmentId PK
string tenantId FK
string key
json payload
datetime watermark
}
EXPORT_PACKAGE {
string packageId PK
string tenantId FK
string manifestHash
datetime createdAt
}
Sizing & Capacity Hints (initial)¶
- Record size: median 1–5 KB (flattened JSON); avoid unbounded attributes.
- Throughput: design for burst QPS
Bper tenant; apply ingest RL + back-pressure. - Indexes: time-range first, then tenant/resource/actor; avoid cross-tenant joins.
- Cold costs: batch exports; prefer delta-based packages for repeats.
Testable Controls¶
- Lints reject schemas without classification tags for new fields.
- CI ensures queries require tenant filters; cross-tenant fixtures return zero rows.
- Projections verified idempotent (re-run safe) with watermark assertions.
- Retention dry-run reports produced; deletions emit attestations.
Links¶
- → Data Model (deep-dive) (
data-model.md) - → Message Schemas (
../domain/contracts/message-schemas.md) - → Query Views & Indexing (impl) (
../implementation/query-views-indexing.md) - → PII Redaction & Classification (
../platform/pii-redaction-classification.md) - → Data Residency & Retention (
../platform/data-residency-retention.md)
Storage Strategy (Summary)¶
Our storage approach optimizes for append-heavy writes, policy-aware queries, provable integrity, and cost control. We separate concerns across Hot (append), Warm (read models/indexes), and Cold (archival/eDiscovery) tiers, governed by policy-stamped retention and lifecycle jobs.
Objectives¶
- Performance where it matters: low-latency appends and queries; projections provide fit-for-purpose shapes.
- Compliance by default: classification-aware storage, retention stamped on write, legal hold and residency honored.
- Provable integrity: digest chains and signatures persist with data lineage and manifests.
- Cost discipline: data tiering, compression, batching exports, and quotas per tenant.
Tiering at a Glance¶
| Tier | Workload | Typical Retention | Consistency | Durability & Encryption | Notes |
|---|---|---|---|---|---|
| Hot (Append Store) | ingest path, near-term verification | hours–days (policy) | strong per-write | multi-AZ/zone; at-rest encryption (KMS) | time/tenant partitions, high IOPS, small indexes |
| Warm (Read Models / Indexes) | query/search/aggregations | days–months (policy) | eventually consistent (lag SLO) | multi-AZ/zone; at-rest encryption (KMS) | rebuildable from events; denormalized projections |
| Cold (Archive / eDiscovery) | long-term retention, legal hold | months–years (policy) | N/A (immutable) | object store immutability + KMS; legal hold | manifests + signatures; cost-efficient, slower access |
Detailed shapes live in Data Model (
data-model.md) and infra specifics in Deployment Views (deployment-views.md).
Lifecycle & Policy Enforcement¶
flowchart LR
subgraph HOT[Hot / Append]
A[Append-only segments]
end
subgraph WARM[Warm / Read Models]
R[Projections & indexes]
end
subgraph COLD[Cold / Archive]
C[Immutable objects + manifests]
end
A -->|policy window reached| WARM
WARM -->|tiering job| COLD
A -->|legal hold? keep| A
WARM -->|legal hold? pin| WARM
C -->|export/eDiscovery| C
- On write:
retentionPolicyId+policyVersionstamped; classification guides storage and logs. - Lifecycle jobs: move eligible segments from Hot→Warm→Cold; respect legal hold and residency maps.
- Deletion/anonymization: performed per policy window; produce attestations and job logs.
Partitioning, Compaction & Index Hygiene¶
- Partitions: time (e.g., day) × tenant; optional
sourceIdfor hot-shard control. - Compaction: roll small segments into bounded files (size/time thresholds) to control file counts and seek cost.
- Index hygiene: projector lag SLOs guide autoscaling; background vacuum/merge jobs keep read paths predictable.
Integrity Material Persistence¶
- Digest chains (per tenant/range) stored alongside append metadata; checkpoints for fast verification.
- Signatures & manifests for exported bundles persisted in Cold; Verify APIs reference chain/manifest IDs.
Backup, Restore & eDiscovery¶
- Backups: scheduled snapshots of Warm (and necessary Hot metadata) with region-scoped policies.
- Restore: tenant- or time-scoped restores; rebuild read models from append logs when possible.
- eDiscovery: selections from Query → Export packages → signed manifests in Cold; immutable retention with legal hold support.
Residency & Encryption¶
- Residency map: tenant → region binding; lifecycle never crosses region without contractual/policy basis.
- Encryption: at rest via KMS; optional per-tenant keys for high-assurance tenants; rotation windows supported.
- Secrets: no keys in payloads; policy and classification prevent sensitive leakage to logs/headers.
Capacity & Cost Levers¶
- Hot: cap record size, enforce schema limits, rate-limit bursty tenants.
- Warm: projection granularity tuned to query needs; compress wide models; expire stale indexes.
- Cold: batch exports, dedupe repeated selections, prefer incremental/delta packages.
- Global: per-tenant quotas, export concurrency caps, storage alerts on growth velocity.
SLO Hints (storage-facing)¶
- Ingest commit p95 ≤ X ms (Hot).
- Projector lag ≤ N s median; p95 ≤ M s (Warm).
- Export TTFB p95 ≤ Z s for packages up to K MB (Cold).
Failure Modes & Guardrails¶
- Hot saturation → back-pressure at Gateway; temporary queueing; projector throttle lifts once lag ≤ budget.
- Projection failure → DLQ + replay tooling; queries fall back to last consistent watermark.
- Cold unavailability → exports paused; already created packages remain downloadable via signed URLs.
- Residency mismatch → hard fail with audit; no cross-region copies without policy/contract.
Testable Controls¶
- Lifecycle dry-run reports (what would tier/delete) per tenant.
- CI checks ensure new tables/indexes include
tenantIdand time partitioning. - Synthetic restores: periodic tenant/time-window drills.
- Export verification: random sampling of packages against manifests/signatures.
Links¶
- → Data Residency & Retention (
../platform/data-residency-retention.md) - → Tamper Evidence (
../hardening/tamper-evidence.md) - → Backups, Restore & eDiscovery (
../operations/backups-restore-ediscovery.md) - → Persistence & Storage (impl) (
../implementation/persistence.md) - → Deployment Views (
deployment-views.md)
Sequence Flows (append/query/export)¶
This section captures the three canonical end-to-end flows in ATP. These flows are reference-grade and map directly to our containers, bounded contexts, and policies. Detailed, step-by-step variants (timeouts, retries, failure drills) live in (sequence-flows.md).
Append (happy path)¶
sequenceDiagram
autonumber
participant P as Producer
participant GW as API Gateway
participant ING as Ingestion (Use-cases)
participant POL as Policy (Decision API)
participant APP as Append Store (Hot)
participant OB as Outbox (Tx)
participant BUS as Event Bus
participant INT as Integrity
participant PROJ as Projection
participant OTL as OTel/Obs
P->>GW: POST /api/v{n}/audit/append<br />JWT (tenant), X-Idempotency-Key, body (canonical)
GW->>GW: AuthN (OIDC) + tenant/edition + schema + rate-limit
GW->>ING: AppendCommand(TenantContext, IdempotencyKey, Payload)
ING->>POL: Evaluate(classification, retention) [short TTL cache]
POL-->>ING: PolicyDecision(version, labels, retentionPolicyId)
ING->>ING: Canonicalize + Validate + Apply Policy
ING->>APP: Append(AuditRecord + PolicyDecision) [atomic]
ING->>OB: Enqueue(audit.appended, audit.accepted) [same tx]
OB->>BUS: publish (relay)
BUS-->>INT: audit.appended | accepted
INT->>INT: Digest/Chain/Sign (async)
BUS-->>PROJ: audit.accepted
PROJ->>PROJ: Update read models / set watermark
ING->>OTL: spans/logs/metrics (tenantId, edition, policyVersion, idemKey)
ING-->>GW: 202 Accepted { recordId, traceId }
- Headers (ingress):
Authorization: Bearer <JWT>,X-Idempotency-Key,Content-Type: application/json - Guarantees: at-least-once delivery; exactly-once intent via
(tenantId, sourceId, sequence|hash); order within partition (tenantId[:sourceId]) - SLO cues: p95 append ≤ X ms; policy eval p95 ≤ Y ms; projector lag p95 ≤ N s
Query (authorized read)¶
sequenceDiagram
autonumber
participant C as Client (Ops/Auditor)
participant GW as API Gateway
participant Q as Query Service
participant POL as Policy (Decision API)
participant RM as Read Models (Warm)
participant RED as Redaction Plan
participant OTL as OTel/Obs
C->>GW: GET /api/v{n}/query?tenant=...&filters=...<br />JWT (tenant)
GW->>GW: AuthN + AuthZ (RBAC/ABAC), edition gates, rate-limit
GW->>Q: QueryRequest(TenantContext, Filters, Page)
Q->>RM: Fetch(ReadModel slice, watermark)
Q->>POL: Evaluate(read constraints) [cached]
POL-->>Q: Decision(redaction/deny/allow)
Q->>RED: Apply redaction per classification/policy
Q-->>GW: 200 OK { results, page, watermark, redactionHints }
Q->>OTL: record spans/metrics (p95/p99, filtered-out)
- Filters: time-range, actor/resource, action, attributes (whitelisted)
- Redaction: field-level masking/hashing per classification; no PII in logs/headers
- SLO cues: p95 latency ≤ Y ms at Q RPS; cache hit ratio ≥ H%; watermark drift ≤ D s
Export (selection → package → verify)¶
sequenceDiagram
autonumber
participant C as Client (Auditor/Legal)
participant GW as API Gateway
participant Q as Query Service
participant EXP as Export Service
participant INT as Integrity
participant COLD as Cold Store (Immutable)
participant WH as Webhook (optional)
C->>GW: POST /api/v{n}/export { selectionSpec | queryId , format }
GW->>Q: Validate selection (tenant/edition/ABAC)
Q-->>GW: Selection OK (token/manifest draft)
GW->>EXP: CreateExport(TenantContext, selectionToken, format)
EXP->>Q: Stream records (paged, resumable)
EXP->>INT: Request signatures/chain refs (batch)
INT-->>EXP: Evidence (chain checkpoints, signatures)
EXP->>COLD: Write package parts + manifest (signed)
EXP-->>GW: 202 Accepted { exportId, pollUrl, ttfbHint }
loop Client poll or webhook
C->>GW: GET /api/v{n}/export/{exportId}
GW-->>C: 303 See Other → signed download URL
EXP-->>WH: POST /on-export-completed (optional)
end
Semantics: resumable streaming; throttled to protect query SLIs; immutable artifacts with signed manifests SLO cues: TTFB p95 ≤ Z s for ≤ K MB outputs; completion p95 ≤ M min for N records
Failure & Back-pressure (extract)¶
| Flow | Condition | Behavior | Client Signal |
|---|---|---|---|
| Append | schema invalid / policy deny | 400 / 403 | Problem+JSON with code & trace |
| Append | hot partition saturation | 429 (Gateway), retry-after | Retry-After, X-Rate-Limit-* |
| Append | transient store/broker | retry with jitter → DLQ after N | 202 Accepted (eventual), trace |
| Query | projector lag beyond budget | serve with last watermark + warn | X-Watermark, X-Lag |
| Query | authorization fails | 403 | Problem+JSON |
| Export | package > concurrency quota | 429 + backoff | Retry schedule |
| Export | integrity service slow | continue buffering; partial manifest; retry | Polling continues; final manifest on completion |
Observability: spans across GW→ING→BUS→INT/PROJ→Q/EXP with tenantId, edition, traceId, correlationId; metrics for outbox age, consumer lag, export queue depth, policy deny rate.
Links¶
- → Sequence Flows (detailed) (
sequence-flows.md) - → REST APIs (
../domain/contracts/rest-apis.md) - → Message Schemas (
../domain/contracts/message-schemas.md) - → Outbox/Inbox & Idempotency (
../implementation/outbox-inbox-idempotency.md) - → Backups, Restore & eDiscovery (
../operations/backups-restore-ediscovery.md)
Deployment Views (baseline)¶
This section describes the cloud-native baseline for ATP across environments, regions, and failure domains. It maps our containers to runtime substrates (AKS/ACA), messaging, data stores, and the observability/security planes. Deeper infra specifics live in (deployment-views.md).
Environments & Promotion Model¶
| Env | Purpose | Data | Change Rate | Protections |
|---|---|---|---|---|
| dev | rapid iteration, PR validation | synthetic | highest | permissive RBAC, ephemeral namespaces |
| test | integration, contract & replay tests | masked | high | seeded tenants, DLQ/replay drills |
| staging | prod-like validation, chaos drills | masked or opt-in | medium | WAF rules, HPA parity, approvals |
| prod | customer traffic | real | controlled | SLO-backed autoscaling, break-glass, approvals |
Promotion: build once, deploy many (signed image → dev → test → staging → prod) with policy gates and environment-specific overlays.
Regional Topology & Residency¶
- Tenants are bound to a home region; data never crosses regions unless contractually allowed.
- All planes are multi-AZ/zone within a region.
- Optional multi-region active/standby for DR (RPO/RTO declared per edition).
flowchart LR
subgraph Region[Cloud Region - e.g., East US 2]
subgraph Net[Virtual Network]
subgraph Edge[Ingress/WAF Subnet]
GW[API Gateway / Ingress]
end
subgraph App[App Plane - AKS/ACA]
mesh[Service Mesh - mTLS/L7]
ING[Ingestion]
POL[Policy]
INT[Integrity]
PROJ[Projection]
QRY[Query]
SRCH[Search - optional]
EXP[Export]
ADM[Admin/Control]
end
subgraph Msg[Messaging]
BUS[(Topics/Queues + DLQ)]
end
subgraph Data[Data Plane]
HOT[(Append Store)]
WARM[(Read Models/Indexes)]
COLD[(Object Store / Archive)]
end
subgraph Obs[Observability]
OTL[(OTel Collector)]
LOG[(Logs)]
MET[(Metrics)]
TRC[(Traces)]
end
subgraph Sec[Security]
KMS[(KMS/Keys)]
SEC[(Secret Manager)]
end
end
end
GW --> mesh
mesh --> ING & POL & INT & PROJ & QRY & SRCH & EXP & ADM
ING --> HOT
PROJ --> WARM
EXP --> COLD
ING --- BUS
PROJ --- BUS
EXP --- BUS
GW --- OTL
ING --- OTL
PROJ --- OTL
QRY --- OTL
EXP --- OTL
INT --- KMS
HOT --- SEC
WARM --- SEC
COLD --- SEC
Kubernetes / Container Apps Mapping¶
| Namespace | Workloads | HPA Signals | Notes |
|---|---|---|---|
gateway |
ingress/gateway, auth filters | RPS, p95 route latency, 429 ratio | WAF rules, IP allow-lists for admin |
ingestion |
append API, webhook receiver, outbox relay | CPU, QPS, pending outbox, 5xx | strict idempotency; schema guard |
policy |
decision API, cache | p95 decision latency, hit ratio | warm cache w/ TTL + circuit breaker |
integrity |
chain/sign/verify workers | queue depth, worker CPU | HSM/KMS integration |
projection |
projectors, rebuild jobs | lag (sec), consumer lag, DLQ | watermarks, replay safety |
query |
query API, (opt) GraphQL | p95/p99 latency, cache hit | redaction at boundary |
search (opt) |
search API, indexers | queue depth, index refresh | can be disabled per edition |
export |
export API, packagers | concurrent exports, TTFB | resumable, throttled |
admin |
policy mgmt, DLQ/replay, feature flags | N/A | break-glass guarded |
observability |
OTel collector, dashboards | N/A | multi-tenant labeling & scrubbing |
Network, Mesh & Access¶
- Ingress: Public → WAF/Ingress → Gateway. Admin surfaces optionally behind IP allow-lists + SSO.
- Mesh: mTLS everywhere; L7 authorization by service identity; timeouts/retries/circuit breakers standardized.
- Egress: deny-by-default, allow-lists for KMS, messaging, object store, and IdP.
- DNS/Service discovery: mesh-native, with identity-bound policies.
Config, Secrets & Keys¶
- Config: per-env overlays; feature flags for edition gates; config maps for non-sensitive settings.
- Secrets: stored in secret manager; mounted/injected at runtime; rotation SLOs enforced.
- Keys: KMS-backed envelope encryption; key IDs & versions recorded in manifests; dual-key windows during rotation.
Scaling & SLO Budgets¶
- Gateway: scale on RPS and auth latency; enforce per-tenant rate limits.
- Ingestion: scale on incoming QPS and outbox age; shed load via 429 when Hot saturation detected.
- Projection: scale to respect lag SLO; auto-tune consumer concurrency.
- Query/Search: scale on p95/p99 latency; cache enabled; bulkhead against Export.
- Export: separate worker pools; cap concurrent packages per tenant to protect read SLOs.
Failure Domains, HA & DR¶
- Intra-region HA: multi-zone deployments; stateless pods across zones; storage with zone-redundancy where supported.
- DLQ & Replay: standard triage tools; replay by window/tenant; dry-run mode.
- Backups/Restore: scheduled backups of warm stores + metadata; tenant/time-window restore drills.
- DR (optional): async replication of cold artifacts; RPO/RTO declared per edition; failover runbooks.
CI/CD & Supply Chain¶
- Pipelines: build → test → sign image (SBOM, vuln scan) → push to registry → deploy via GitOps/Argo or pipelines.
- Policies: admission requires signed images, baseline pod security, resource limits/requests.
- Observability: dashboards per service; alerts for SLO breaches, DLQ growth, export anomalies, projector lag.
Cost Controls¶
- Per-tenant quotas (RPS/storage/exports), export batch windows, storage tiering rules, and autoscaling floors/ceilings tuned for cost envelopes.
Testable Controls¶
- Policy tests: deny unsanctioned egress; block unsigned images at admission.
- Residency tests: enforce region binding per tenant.
- Chaos drills: periodic pod/node loss, message broker hiccups, object-store slowdown.
- SLO checks: synthetic probes for append/query/export; alert on budget burn.
Links¶
- → Deployment Views (deep-dive) (
deployment-views.md) - → Environments (
../ci-cd/environments.md) - → Azure Pipelines (
../ci-cd/azure-pipelines.md) - → Observability (OTel/Logs/Metrics) (
../operations/observability.md) - → Security & Compliance (
../platform/security-compliance.md) - → Backups, Restore & eDiscovery (
../operations/backups-restore-ediscovery.md)
API Gateway & Connectivity¶
The API Gateway is the single ingress for tenant traffic and the control point for identity, tenancy, versioning, rate limits, schema validation, and egress discipline. It fronts REST/gRPC APIs and optional webhooks, and propagates Zero Trust signals (identity, tenant, edition, trace) into the mesh.
Objectives¶
- Protect: strong AuthN/Z, input validation, per-tenant quotas/limits, DDoS/WAF.
- Standardize: versioning, headers, error shapes, retry semantics.
- Propagate:
tenantId,edition,traceparent,correlationIdto downstream services. - Observe: request metrics, saturation (429s), error taxonomy, and schema failure rates.
Ingress Architecture (L7)¶
flowchart LR
Client[Producers / Clients / UIs]
WAF[WAF / Ingress Controller]
GW[API Gateway<br />AuthN/Z • Tenancy • Rate Limits • Schema]
Mesh[Service Mesh - mTLS/L7 AuthZ]
ING[Ingestion]
QRY[Query]
EXP[Export]
POL[Policy]
ADM[Admin/Control]
WH[(Webhooks - optional)]
Client --> WAF --> GW --> Mesh
Mesh --> ING & QRY & EXP & POL & ADM
GW --> WH
- TLS termination at edge; mTLS inside the mesh.
- Admin surfaces (policy/flags/replay) can be IP allow-listed + SSO.
Versioning & Deprecation¶
- URI or header versioning:
GET /api/v{n}/...orX-Api-Version: n. - Compatibility windows announced in changelog; deprecation headers:
Deprecation: true,Sunset: <rfc1123>,Link: <url>; rel="deprecation".
Tenancy Propagation (canonical)¶
- Gateway requires tenant at ingress (JWT claim preferred;
X-Tenant-Idonly for trusted workloads). - Inject standardized headers to downstream:
x-tenant-id,x-edition,traceparent,tracestate,x-correlation-id,x-policy-version.
- Reject requests missing or conflicting tenant signals (
400/403).
Authentication & Authorization¶
- Users: OIDC/OAuth2 (short-lived tokens, MFA), scopes/roles mapped to tenant.
- Workloads: workload identity; audience/scope-bound JWTs or mesh L7 policies.
- Coarse checks at Gateway; fine-grained ABAC in services (use-case level).
Rate Limits, Quotas, & Back-Pressure¶
- Per-tenant burst/sustained rate limits and concurrent export caps.
- Global safeties on hot routes (append, export create).
- Signal back-pressure with
429+Retry-After; include limit headers:X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset.
Schema & Payload Safeguards¶
- Content-type & size limits; JSON schema validation on critical routes.
- Reject unknown fields (strict mode) unless whitelisted in schema registry.
- Enforce PII discipline: no PII in headers; payloads classified for redaction downstream.
Connectivity Matrix (inbound/outbound)¶
| Surface | Protocol | Auth | Tenancy | Notes |
|---|---|---|---|---|
| Append | REST (POST) / gRPC | Bearer (OIDC) or workload JWT | Required | X-Idempotency-Key mandatory |
| Query | REST (GET) / gRPC | Bearer | Required | watermark headers on responses |
| Export | REST (POST/GET stream) | Bearer | Required | resumable download; signed URLs |
| Webhooks (ingest) | HTTPS (POST) | HMAC/signature | In payload | Signature verification, replays detected |
| Admin | REST (POST/GET) | SSO + IP allow-list | N/A | break-glass logged & approved |
Standard Headers (selected)¶
- Ingress:
Authorization,X-Idempotency-Key,Content-Type - Propagated:
x-tenant-id,x-edition,traceparent,tracestate,x-correlation-id,x-policy-version - Responses:
X-Watermark,X-Lag,X-RateLimit-*,X-Request-Id
Error Model (Problem+JSON)¶
{
"type": "https://errors.atp.example/validation",
"title": "Invalid request payload",
"status": 400,
"detail": "Field 'attributes' failed schema validation",
"instance": "urn:trace:01J9...-req-7f3a",
"tenantId": "t-acme",
"code": "SCHEMA_VALIDATION_FAILED"
}
Egress & Network Policy¶
- Deny-by-default egress; allow only IdP, KMS, messaging, object store, email/webhook domains (when used).
- DNS pinning/allow-lists for critical dependencies; outbound proxies if required.
- Mesh L7 AuthZ: only permitted service→service calls; no lateral “surprises”.
Streaming, Downloads & Large Payloads¶
- Append encourages bounded payloads (size caps).
- Export uses chunked transfer and signed URLs; resilience via range requests.
- Timeouts and read/write budgets enforced per route; client hints for retry/backoff.
CORS & Browser Clients¶
- Strict CORS: allow specific origins for tenant consoles;
SameSiteand CSRF tokens for state-changing routes in UIs.
Observability @ Gateway¶
- Spans: route, tenant, edition, status, bytes in/out, auth latency, schema check time.
- Metrics: RPS, p95/99 per route, 4xx/5xx ratios, 429s, rejected schemas.
- Logs: structured; PII scrubbed; include
x-request-idand correlation keys.
Failure Modes & Signals¶
| Condition | Behavior | Client Signal |
|---|---|---|
| Missing/invalid tenant | Reject | 400/403 with problem+json |
| Rate-limit exceeded | Shed | 429 + Retry-After |
| Schema invalid | Reject | 400 with problem+json & error path |
| AuthN failed | Reject | 401 |
| AuthZ/edition denied | Reject | 403 |
| Upstream saturation | Back-pressure | 503 (retryable) with Retry-After |
Testable Controls¶
- Contract tests: tenant required on protected routes; headers propagated.
- Negative tests: cross-tenant attempts return
403; unknown JSON fields rejected. - Synthetic load: verify 429 behavior, limit headers, and stable p95 under burst.
Links¶
- → REST APIs (
../domain/contracts/rest-apis.md) - → Webhooks (
../domain/contracts/webhooks.md) - → Message Schemas (
../domain/contracts/message-schemas.md) - → Multitenancy & Tenancy Guards (
../platform/multitenancy-tenancy.md) - → Security & Compliance (
../platform/security-compliance.md) - → Health Checks (
../operations/health-checks.md)
Observability & SLOs¶
Observability in ATP is first-class: every request and message carries tenantId, edition, traceId, and correlationId. We instrument traces, metrics, and logs using OpenTelemetry, define SLIs/SLOs per service, and manage reliability via error budgets with multi-window burn alerts. PII is never logged; redaction follows classification tags.
Telemetry Standards¶
- Traces: Gateway → Service → Outbox/Bus → Consumer/Projector → Store/Export.
- Resource attrs:
service.name,service.version,deployment.environment,cloud.region. - Span attrs:
tenant.id,tenant.edition,http.route,messaging.operation,messaging.destination,db.system,db.operation,policy.version,idempotency.key.
- Resource attrs:
- Metrics (OTel with exemplars): use histograms for latency; avoid high-cardinality labels.
- Common labels:
service.name,route|operation,tenant.class(small cardinality bucket),result. - Examples:
http.server.duration,messaging.consumer.lag,outbox.relay.age,export.queue.depth.
- Common labels:
- Logs: structured JSON; fields include
timestamp,level,message,tenantId,edition,traceId,correlationId,eventId,code.- No PII in logs or headers; classified fields are masked or hashed.
Sampling: head-based baseline, tail-based for slow/error spans. Retention: traces short, metrics medium, logs per compliance policy.
Golden Signals (platform-wide)¶
- Traffic (RPS, throughput), Latency (p95/p99), Errors (4xx/5xx, policy denies), Saturation (CPU/mem, queue depth, projector lag), plus Back-pressure (429s, retry counts).
SLIs per Service¶
| Service | Primary SLIs | Supporting SLIs |
|---|---|---|
| Gateway | route latency p95/p99; 4xx/5xx ratio; 429 rate | auth latency, schema failure rate |
| Ingestion | append accepted latency p95; accept success rate | outbox relay age p99, request size distribution |
| Policy | decision latency p95; cache hit ratio | decision error rate, fallback activations |
| Integrity | verify latency p95 (by target size) | chain build queue depth, signer error rate |
| Projection | projector lag median/p95; DLQ rate | replay duration, consumer throughput |
| Query | query latency p95/p99; success rate | cache hit ratio, redacted-field count, watermark drift |
| Search (opt) | search latency p95; success rate | index refresh age, queue depth |
| Export | TTFB p95; completion time p95 | package queue depth, resumptions, webhook success |
| Admin | action success rate | time-to-approve, break-glass invocations |
SLO Targets (initial placeholders)¶
Tune these during load testing; record in [Alerts & SLOs (../operations/alerts-slos.md)].
- Gateway: route p95 ≤ X ms; 5xx ≤ Y ppm.
- Ingestion: append accepted p95 ≤ X ms; outbox age p99 ≤ N s.
- Projection: lag median ≤ 5 s, p95 ≤ 30 s.
- Query: p95 ≤ Y ms at baseline RPS; success rate ≥ 99.95% (excl. client 4xx).
- Export: TTFB p95 ≤ Z s for ≤ K MB; completion p95 ≤ M min for N records.
- Integrity: verify p95 ≤ T s for S records; failures ≤ E ppm.
- Policy: decision p95 ≤ Q ms; hit ratio ≥ H%.
Error Budgets & Burn Alerts¶
- Budget =
1 - target_availability. Example: SLO 99.9% ⇒ budget 0.1%. - Multi-window, multi-burn alerts (fast + slow):
- Page if burn ≥ 14× over 1h or ≥ 6× over 6h.
- Ticket if burn ≥ 2× over 24h.
- Pair with auto-suppression during planned maintenance (annotations).
Dashboards (minimum set)¶
- Service: latency histograms, error ratios, throughput, saturation, dependency health.
- Flow: Append (GW→Ingestion→Outbox→Bus→Projection), Query (GW→Query→Read), Export (GW→Export→Cold).
- Tenancy: top tenants by usage; quota headroom; noisy-neighbor detection.
- Reliability: projector lag, outbox age, DLQ depth, export queue depth, policy deny rate.
- Security: auth failures, cross-tenant attempts, signature failures.
Watermarks, Idempotency & Headers¶
- Query responses include
X-Watermark&X-Lagto expose freshness. - Append requires
X-Idempotency-Key; success logs includeIdempotent:trueon dedupe. - Rate headers return budget signals:
X-RateLimit-*,Retry-After.
Alert Policies (extract)¶
| Condition | Threshold | Action |
|---|---|---|
| Projector lag p95 > 30s for 15m | sustained | Page on-call; auto-scale consumers; evaluate DLQ |
| Outbox age p99 > 10s | 15m | Page ingestion; check broker health |
| Query p95 > Y ms | 30m | Page API/runtime; enable cache protection |
| Export queue depth > Q | 30m | Ticket; throttle export concurrency per tenant |
| 5xx ratio > R ppm | 10m | Page owning team; roll back last deploy guardrail |
| Token validation failures spike | 10m | Page security on-call; investigate IdP/clock skew |
Cardinality & Cost Guardrails¶
- Cap high-cardinality labels (e.g., raw
tenantId)—aggregate intotenant.class(e.g.,S/M/L). - Use RED metrics (Rate, Errors, Duration) per route/use-case.
- Histograms with controlled buckets; exemplars sampled from traces.
- Drop verbose logs in hot paths; sample at INFO, keep ERROR always.
Health & Probes¶
- Liveness: process healthy; Readiness: deps reachable and policy cache warm; Startup: migrations/keys loaded.
- Expose /healthz endpoints; aggregate in [Health Checks (
../operations/health-checks.md)].
Synthetic Probes & Canaries¶
- Tenant-scoped synthetics for append/query/export; publish probe artifacts with
tenantId=probe. - Canary releases gated by SLO trend; rollback if burn > threshold.
Testable Controls¶
- CI checks for OTel exporters present; span/metric naming lint.
- Contract tests assert presence of
tenantId,edition,traceIdon key spans. - E2E tests validate
X-Watermark/X-Lagheaders and redaction hints on query.
Links¶
- → Observability (OTel/Logs/Metrics) (
../operations/observability.md) - → Alerts & SLOs (
../operations/alerts-slos.md) - → Health Checks (
../operations/health-checks.md) - → Runbook (
../operations/runbook.md) - → Outbox/Inbox & Idempotency (
../implementation/outbox-inbox-idempotency.md)
Reliability & Resilience (retries, outbox, DLQ)¶
ATP targets graceful degradation under failure: shed load early, retry safely with idempotency, confine faults with bulkheads, and preserve work via transactional outbox/inbox and DLQs with audited replay. Policies are tuned to protect SLOs and tenant isolation.
Principles¶
- Fail fast at the edge (schema/tenancy/rate) and retry inside only when it’s safe.
- Exactly-once intent through idempotency keys; consumers are idempotent by construction.
- Back-pressure before meltdown: 429s at the Gateway; bounded concurrency in workers.
- Isolate & contain: bulkheads, circuit breakers, DLQs per subscription, exporter pools separate from query.
- Observable by default: retries, drops, DLQ, and replays are fully traceable.
Timeouts, Retries, Backoff¶
| Boundary | Timeout (budget) | Retry Policy | Max Attempts | Notes |
|---|---|---|---|---|
| Client → Gateway | short (route p95 + margin) | No (client retry on 429/503 only) | 0 | Gateway signals back-off via headers |
| Gateway → Service | route-specific (p95 × 1.2) | No (propagate) | 0 | Avoid retry storms |
| Service → Policy/IdP/KMS | short | Yes, exp. backoff + jitter | 3–5 | Only on transient errors/timeouts |
| Service → Broker (publish) | short | Yes, then Outbox relay | bounded | Never drop; relay ensures delivery |
| Consumer → Repo/Store | medium | Yes, exp. backoff + jitter → DLQ | 5 | Idempotent upsert/no-op required |
| Integrity/Export → Object Store | medium/long | Yes, exp. backoff + resume | bounded | Resume via range requests |
Budgeting: Timeouts derived from SLOs; p95 + headroom. Retries use exponential backoff with jitter (e.g., base 100–250ms, cap 5–10s). No retries on validation/authorization errors.
Outbox / Inbox Semantics¶
Outbox guarantees durable publication of domain events without two-phase commit.
sequenceDiagram
autonumber
participant API as API/Use-case
participant DB as Append Store (tx)
participant OB as Outbox (tx)
participant RL as Relay Worker
participant BUS as Broker
API->>DB: Persist aggregate
API->>OB: Persist event (same tx)
DB-->>API: Commit (record + outbox)
RL->>OB: Poll due events
RL->>BUS: Publish (with headers/trace)
RL->>OB: Mark as delivered (idempotent)
Inbox (consumer de-dup) stores (eventId|idempotencyKey) receipts for M days to drop duplicates safely.
- Idempotency Key:
(tenantId, sourceId, sequence|hash) - Headers:
x-idempotency-key,traceparent,x-tenant-id,x-schema-version - Guarantee: at-least-once delivery; exactly-once effects via idempotent handlers.
DLQ, Replay & Triage¶
Every subscription has a DLQ. We never silently drop.
flowchart LR
subgraph Processing
C[Consumer] --> H[Handler - Idempotent]
H --> OK[Commit]
H -->|on failure after N retries| DLQ[(DLQ)]
end
subgraph Triage & Replay
DLQ --> UI[DLQ Console/Runbook]
UI --> DRY[Dry-run Replay]
DRY --> REP[Replay Window/Tenant]
REP --> C
end
Triage metadata: error code, exception type, payload hash, headers, schema/version, attempt count, trace links.
Replay rules
- Scoped by tenant and time window; require approval for large replays.
- Dry-run first: count would-be successes/failures; cap concurrency; respect watermarks.
- Immutability: handlers must be replay-safe; only side effects that are idempotent.
Bulkheads, Circuit Breakers, Quotas¶
- Bulkheads: separate pools for Export vs Query; limit projector concurrency by partition; per-tenant worker caps.
- Circuit breakers: open on dependency failures (Policy/IdP/KMS/Store); fail closed for risky decisions (deny if policy unavailable), fail open for non-critical enrichments.
- Quotas: per-tenant caps on RPS, concurrent exports, selection size, and storage growth.
Failure Taxonomy & Handling¶
| Class | Examples | Handling | Client Signal |
|---|---|---|---|
| Validation | schema invalid, unknown fields | Reject (no retry) | 400 (Problem+JSON) |
| Authorization | missing/invalid tenant, ABAC deny | Reject (no retry) | 401/403 |
| Resource limits | rate/quotas exceeded | Shed load | 429 + Retry-After |
| Transient infra | broker timeout, object store 503 | Retry with jitter; back-off; DLQ after N | 202/503 (server) |
| Hot partition | tenant/source spike | Throttle producer; bounded consumer concurrency | 429 (edge), lag dashboards |
| Dependency outage | Policy/KMS down | Breaker; safe defaults; degrade features | 503 (server) |
| Data poison | schema/version mismatch | Route to DLQ; fix schema/consumer; replay | N/A (internal) |
Back-Pressure & Lag Management¶
- Gateway: 429 with tenant-scoped limits; communicate
X-RateLimit-*. - Consumers: dynamic concurrency (lower when lag or error rate climbs).
- Projection: track watermarks; alert when lag p95 > SLO; autoscale consumers on lag & queue depth.
- Export: queue depth guards; resumable streams; per-tenant concurrency caps.
Chaos & Resilience Drills¶
- Inject faults in staging: broker hiccups, object-store slowdown, KMS latency, projector crashes.
- Verify: error budget burn alerts, DLQ accumulation, replay throughput, circuit behavior, user-facing latency.
- Record drill outcomes and update runbooks and SLO thresholds.
Observability for Reliability¶
- Metrics:
outbox.relay.age,consumer.lag,dlq.depth,dlq.replay.rate,retry.count,429.rate,breaker.open.count. - Traces: link original request → outbox publish → consumer handle → DLQ/replay spans.
- Logs: structured error codes; payload hashes (not values); tenant-safe redaction.
Configuration Defaults (starters)¶
- Retries: 3–5 attempts, exp. backoff + full jitter (Decorrelated Jitter), cap 10s.
- Concurrency: start low (e.g., 2–4 per partition); auto-tune toward SLO.
- DLQ retention: 7–30 days per environment; immutable audit of purges.
- Replay: require change ticket/approval for >N msgs or cross-tenant windows.
Testable Controls¶
- Unit tests: consumers idempotent; repositories upsert-or-noop.
- Contract tests: no PII in headers; tenant headers present.
- E2E: inject transient failures and verify DLQ path, replay correctness, and watermark recovery.
- Synthetic hot-partition tests: verify 429s at edge, stable p95, and bounded lag.
Links¶
- → Outbox/Inbox & Idempotency (impl) (
../implementation/outbox-inbox-idempotency.md) - → Messaging (impl) (
../implementation/messaging.md) - → Alerts & SLOs (
../operations/alerts-slos.md) - → Health Checks (
../operations/health-checks.md) - → Runbook (
../operations/runbook.md) - → Chaos Drills (
../hardening/chaos-drills.md)
Integrity & Tamper-Evidence¶
ATP provides cryptographic assurance that audit data is unchanged since write and that exports are authentic. We implement append-only hash chains, digitally signed checkpoints/manifests, and verification APIs usable online (service-backed) and offline (air-gapped).
Objectives¶
- Tamper-evidence: any modification, insertion, deletion, or re-ordering becomes detectable.
- Provenance: every package/export carries a signed, reproducible manifest.
- Verifiability: tenants and auditors can verify individual records, ranges, or full exports without trusting ATP.
- Agility: algorithm and key rotation without rewriting historical data (versioned metadata).
Integrity Model (at a glance)¶
- Canonicalization at write → compute a content digest of the canonical record.
- Hash Chains per
{tenantId, time-slice}with rolling checkpoints.- Chain node
i:H_i = Hash(H_{i-1} || digest(record_i) || meta_i) meta_iincludesrecordId,committedAt,policyVersion, and chain coordinates.
- Chain node
- Checkpoints (e.g., hourly/daily) are digitally signed; contain
H_last, span, and count. - Exports produce signed manifests listing content digests and chain/checkpoint references.
- Verification APIs return proof objects; offline tools can validate with published keys.
flowchart LR
subgraph Tenant Slice: t-acme - 2025-10-22
R1[rec#1 digest] --> CH1[H1]
R2[rec#2 digest] --> CH2[H2]
R3[rec#3 digest] --> CH3[H3]
CH1 --> CH2 --> CH3
CH3 --> CKP[Checkpoint Σ<br /> signed - H_last, span, count]
end
CKP --> MAN[Export Manifest<br /> signed - digests, ranges, refs]
We prefer linear hash chains with signed periodic checkpoints. Optionally, a Merkle tree can be built per checkpoint window for batch verification without changing the public surface.
Data Elements¶
| Artifact | Purpose | Immutable Fields (excerpt) |
|---|---|---|
| Record digest | Per-record integrity | recordId, canonical payload, occurredAt, committedAt, policyVersion |
| Chain node | Links records in order | chainId, chainIndex, prevHash, recordDigest, meta |
| Checkpoint | Signed summary of a range | chainId, fromIndex, toIndex, H_last, spanStart/End, count, algo, keyId, sig |
| Manifest | Export integrity & lineage | package metadata, list of {recordId, digest}, checkpoint refs, algo, keyId, sig |
Canonicalization rules live alongside message schemas; whitespace/order-independent; numeric/boolean normalization; rejected if unknown fields (unless whitelisted).
Algorithms & Keys (guidance)¶
- Hash: SHA-256 (default); allow algorithm versioning (e.g.,
sha256,sha512/256). - Signatures: Ed25519 / ECDSA P-256 (configurable).
- Key management: KMS-backed; key IDs and versions stamped in checkpoints/manifests.
- Rotation: introduce new
algo/keyIdat checkpoint boundaries; historical proofs remain valid.
We maintain algorithm agility: verification reads metadata and dispatches appropriate verifier.
Write-time Flow (with integrity material)¶
sequenceDiagram
autonumber
participant ING as Ingestion
participant CAN as Canonicalizer
participant APP as Append Store
participant INT as Integrity
ING->>CAN: Normalize(payload)
CAN-->>ING: canonical bytes + digest(payload)
ING->>APP: Append(record, digest, policyVersion)
ING->>INT: Update chain (tenant, time-slice, recordDigest)
INT->>INT: H_i = Hash(H_{i-1} || recordDigest || meta)
alt checkpoint boundary
INT->>KMS: Sign(H_last, span, count, algo)
KMS-->>INT: signature(keyId, version)
INT->>APP: Persist checkpoint
end
Verification Surfaces¶
APIs
POST /api/v{n}/integrity/verify/record→ input:{ tenantId, recordId }→ output:{ ok, recordDigest, chainProof, checkpointRef }POST /api/v{n}/integrity/verify/range→{ tenantId, from, to }→{ ok, H_last, span, checkpointSig }POST /api/v{n}/integrity/verify/export→{ exportId }→{ ok, manifestDigest, signature, chainRefs[] }
CLI/Offline (reference implementation)
- Verify with only: manifest, exported data, published public keys, and (optional) checkpoint bundle.
Response Example (export verify)
{
"ok": true,
"exportId": "exp-01JAX...",
"manifest": {
"algo": "sha256",
"keyId": "kms:key/ed25519:v4",
"digest": "sha256:7f...c1",
"signature": "base64:MEUCIQ..."
},
"chainRefs": [
{ "chainId": "t-acme:2025-10-22", "toIndex": 85123, "H_last": "sha256:ab..ef", "checkpointSig": "base64:..." }
]
}
Export Manifest (canonical shape)¶
{
"schemaVersion": "1.0.0",
"packageId": "exp-01JAX...",
"tenantId": "t-acme",
"createdAt": "2025-10-22T12:04:55Z",
"algo": "sha256",
"keyId": "kms:key/ed25519:v4",
"items": [
{ "recordId": "01H...", "digest": "sha256:...", "occurredAt": "2025-10-22T10:12:00Z" }
],
"chainRefs": [
{ "chainId": "t-acme:2025-10-22", "fromIndex": 84000, "toIndex": 85123, "H_last": "sha256:...", "checkpointId": "ckp-2025-10-22T12:00:00Z" }
],
"signature": "base64:..."
}
Failure Modes & Mitigations¶
| Scenario | Detection | Mitigation |
|---|---|---|
| Record tampered | digest mismatch | reject read/export; raise integrity alert |
| Chain gap or re-order | recompute H_i; mismatch vs stored |
mark slice compromised; rebuild from trustworthy boundary; alert |
| Checkpoint key rotated | keyId mismatch |
verify with previous public key set; publish rotation bundle |
| Manifest altered | signature invalid | reject download; regenerate; investigate |
| Clock skew | timestamp sanity checks | rely on committedAt from ATP; include in proofs |
| Hot shard loss | missing nodes | replay from outbox/event log; regenerate chain; issue new checkpoint noting incident (transparent) |
Rebuild policy: Regenerate affected chain segment without rewriting record contents; checkpoint notes remediation; previous proofs remain for unaffected ranges.
Observability & SLOs (integrity)¶
- SLIs: verify latency p95 (by size), chain build queue depth, checkpoint issuance delay, signature error rate.
- Alerts: chain mismatch, signature failure spikes, delayed checkpoints beyond window.
- Trace: link
append → chain update → checkpoint sign → export manifest.
Testable Controls¶
- Deterministic canonicalization tests (golden vectors).
- Property tests: random order tampering must fail verification.
- Rotation tests: verify old data with previous keys; new data with new keys.
- Cross-check: export manifest round-trip verification in CI (sampled).
- Negative tests: header-only PII prohibition (integrity material never includes PII).
Operational Practices¶
- Key custody: least-privilege KMS roles; HSM-backed keys optional; dual-control for rotations.
- Publishing: make public verification keys and checkpoint bundles available per tenant/region (read-only).
- eDiscovery: export includes manifest and optional checkpoint pack to enable offline verification.
- Incident handling: freeze exports for affected slices; publish advisory with affected
chainIdranges.
Links¶
- → Tamper Evidence (hardening) (
../hardening/tamper-evidence.md) - → Security & Compliance (
../platform/security-compliance.md) - → Backups, Restore & eDiscovery (
../operations/backups-restore-ediscovery.md) - → Events Catalog (
../domain/events-catalog.md) - → Message Schemas (
../domain/contracts/message-schemas.md)
SDK & Integration Guidance (overview)¶
This section orients integrators to the official SDKs and the minimum set of practices to publish, query, and export audit data safely. Deep dives and runnable samples live under SDK and Guides.
Official SDKs¶
| Language | Package | Status | Target Runtimes |
|---|---|---|---|
| C# | ConnectSoft.Atp |
GA (preferred) | .NET 8/9 |
| JavaScript/TypeScript | @connectsoft/atp |
GA (preferred) | Node 18+/20+, modern browsers (for query only) |
Common features: automatic tenancy propagation, idempotency key helpers, schema validation, built-in retry with jitter (safe verbs only), OTel instrumentation hooks, and Problem+JSON error mapping.
Minimal Client Configuration¶
# Pseudoconfig (both SDKs support env vars and code-based config)
ATP_BASE_URL: https://api.atp.example
ATP_TENANT_ID: t-acme
ATP_EDITION: enterprise
ATP_CLIENT_ID: <oidc client id>
ATP_CLIENT_SECRET: <secret> # or workload identity
ATP_TIMEOUT_MS: 3000
- Auth: OIDC client credentials or workload identity.
- Tenancy:
tenantIdrequired for all calls; SDK setsx-tenant-idheader. - Tracing: pass OTel tracer/provider to auto-attach
traceparent.
Publish (Append) — Quick Start¶
C#
var client = new AtpClient(new AtpOptions {
BaseUrl = new Uri(Environment.GetEnvironmentVariable("ATP_BASE_URL")!),
TenantId = "t-acme",
Auth = AtpAuth.ClientCredentials("clientId","clientSecret"),
});
var record = new AuditRecord {
SourceId = "order-svc",
ActorId = "u-123",
Action = "UPDATE",
Resource = "Order/4711",
Attributes = new { status = "Shipped", carrier = "DHL" },
OccurredAt = DateTimeOffset.UtcNow
};
var idem = IdempotencyKey.From("t-acme", "order-svc", sequence:4711);
await client.AppendAsync(record, idem, ct);
TypeScript
import { AtpClient, idempotencyKey } from "@connectsoft/atp";
const client = new AtpClient({
baseUrl: process.env.ATP_BASE_URL!,
tenantId: "t-acme",
auth: { kind: "clientCredentials", clientId: "...", clientSecret: "..." },
});
const record = {
sourceId: "order-svc",
actorId: "u-123",
action: "UPDATE",
resource: "Order/4711",
attributes: { status: "Shipped", carrier: "DHL" },
occurredAt: new Date().toISOString(),
};
await client.append(record, idempotencyKey("t-acme","order-svc",4711));
Contract reminders
- Headers (SDK-managed):
Authorization,x-tenant-id,X-Idempotency-Key,Content-Type. - Schema: unknown fields rejected unless whitelisted; use SDK types to avoid drift.
- PII discipline: never place PII in headers; use
attributeswith classification-aware fields in schema.
Query — Authorized Read¶
C#
var page = await client.QueryAsync(new QuerySpec {
TimeRange = TimeRange.LastHours(24),
Filters = new() { Resource = "Order/4711" },
Page = new PageSpec(size: 100)
}, ct);
// Headers surfaced on the response object
Console.WriteLine($"Watermark={page.Watermark}, Lag={page.LagSeconds}s");
TypeScript
const res = await client.query({
timeRange: { lastHours: 24 },
filters: { resource: "Order/4711" },
page: { size: 100 }
});
console.log(res.headers["x-watermark"], res.headers["x-lag"]);
- Redaction: SDKs expose
redactionHintswhere fields were masked/hidden. - Pagination: cursor-based; pass
nexttoken; default page size 100 (configurable). - Rate limits: watch
X-RateLimit-*andRetry-After.
Export — Selection → Package¶
C#
var exportId = await client.Export.CreateAsync(new ExportRequest {
Query = new QuerySpec { Filters = new() { Resource = "Order/4711" } },
Format = "jsonl"
}, ct);
var stream = await client.Export.DownloadAsync(exportId, ct); // resumable via Range
TypeScript
const { exportId } = await client.export.create({
query: { filters: { resource: "Order/4711" } },
format: "jsonl"
});
const file = await client.export.download(exportId); // supports range & resume
- Resumable downloads (HTTP Range).
- Integrity: response includes manifest digest/signature; use SDK
verifyExport()wrapper for convenience.
Webhooks (optional) — Ingest & Completion¶
Verification
import { verifyWebhook } from "@connectsoft/atp/webhooks";
const ok = verifyWebhook(headers, rawBody, secret); // HMAC or signature
if (!ok) return res.status(401).end();
- Retry-safe: your handler must be idempotent; use event
eventIdorx-idempotency-key. - Security: require HTTPS; rotate secrets; deny unsigned deliveries.
Retries, Timeouts, and Idempotency¶
- SDK retries: exponential backoff with jitter on transient errors only (5xx/timeouts).
- Do not auto-retry on 4xx (validation, auth, policy deny).
- Idempotency key builder helpers ensure exactly-once intent:
- Conventional form:
`${tenantId}:${sourceId}:${sequence|hash}`.
- Conventional form:
- Timeouts: defaults ~3s (configurable per method); align with route SLOs.
Error Model (Problem+JSON)¶
| HTTP | Code (example) | Meaning | Action |
|---|---|---|---|
| 400 | SCHEMA_VALIDATION_FAILED |
Payload invalid | fix schema |
| 401 | UNAUTHENTICATED |
Missing/invalid token | refresh/reauth |
| 403 | AUTHORIZATION_DENIED |
RBAC/ABAC/edition gate | request access |
| 429 | RATE_LIMITED |
Per-tenant quota exceeded | backoff using Retry-After |
| 503 | UPSTREAM_UNAVAILABLE |
Transient infra | SDK retries (jitter) |
All SDK exceptions include traceId; logs are safe for sharing (no PII).
Versioning & Compatibility¶
- Semantic Versioning for SDKs; APIs/events are additive-first.
- Deprecation: SDK surfaces
Deprecation/Sunsetheaders; consult Changelog. - Pin major versions in production; run contract tests in CI.
Security Considerations¶
- Tokens are short-lived; enable MFA/workload identity.
- Never hardcode secrets; use secret managers.
- No PII in headers; classification-driven redaction in logs/exports.
- Export manifests are signed; verify before processing.
Observability Hooks¶
- Pass your tracer/provider to the SDK to attach spans (
gateway → service → broker → store). - Metrics: per-call latency histograms, retry counts, 4xx/5xx ratios.
- Correlate with
traceIdfrom Problem+JSON on errors.
Common Patterns & Anti-patterns¶
Do:
- Use
IdempotencyKeyfor every append. - Batch appends with bounded payloads; avoid oversized
attributes. - Respect
Retry-AfterandX-RateLimit-*. - Propagate
tenantIdconsistently.
Don’t:
- Put secrets/PII into headers.
- Bypass SDK schema types with “raw” posts.
- Retry on validation/auth errors.
Links¶
- → C# SDK (
../sdk/csharp.md) - → JavaScript SDK (
../sdk/javascript.md) - → .NET Publisher Example (
../sdk/examples/dotnet-publisher.md) - → JS Publisher Example (
../sdk/examples/js-publisher.md) - → Query Examples (
../sdk/examples/query-examples.md) - → Guides: Produce Audit Events (
../guides/producing-audit-events.md) - → Guides: Query Audit Logs (
../guides/querying-audit-logs.md) - → Guides: Export & eDiscovery (
../guides/export-and-ediscovery.md)
Risks & Mitigations¶
This section catalogs the top 8 architectural risks for ATP with signals, mitigations, and contingency playbooks. Each risk has an owner, testable controls, and traceability to SLOs/ADRs.
Severity (S): Low/Med/High. Likelihood (L): Low/Med/High.
Summary Matrix¶
| ID | Risk | S | L | Primary Owner | Key Signals / SLIs |
|---|---|---|---|---|---|
| R-001 | Scale spikes on write (ingestion surge) | High | Med | Solution + DevOps | 429 rate, outbox age p99, broker queue depth, projector lag p95 |
| R-002 | Cost overruns (storage/exports/telemetry) | Med | Med | Cloud + Finance (FinOps) | Hot/Warm/Cold growth velocity, export concurrency, metrics/logs ingestion cost |
| R-003 | Data residency / sovereignty violation | High | Low | Security + Data + Cloud | Cross-region traffic events, storage location drift, restore target checks |
| R-004 | Schema drift / incompatible change | Med | Med | Application + Data | Contract test failures, schema registry diffs, consumer DLQ spike |
| R-005 | Tight coupling between services/contracts | Med | Med | Enterprise + Solution | Change ripple count per deploy, cross-service failure blast radius |
| R-006 | Vendor lock-in (broker/DB/cloud features) | Med | Med | Enterprise + Infra | Adapter coverage gaps, portability test failures |
| R-007 | Noisy neighbor (multi-tenant contention) | High | Med | Enterprise + SRE | Per-tenant 429s, query p95 regressions, export queue depth by tenant |
| R-008 | Compliance drift (controls stale vs GDPR/HIPAA/SOC2) | High | Low | Security + Compliance | Policy deny rates, retention job failures, audit finding backlog |
R-001 — Scale spikes on write (ingestion surge)¶
- Signals/SLIs: 429 rate ↑, outbox.relay.age ↑, broker depth ↑, projector lag p95 > SLO.
- Mitigations:
- Per-tenant rate limits/quotas at Gateway; 429 +
Retry-After. - Transactional outbox, consumer concurrency autoscaling, partition by
tenantId[:sourceId]. - Back-pressure: bounded worker pools; bulkheads to protect Query/Export.
- Per-tenant rate limits/quotas at Gateway; 429 +
- Contingency:
- Enable surge mode: increase broker partitions/consumer concurrency; temporarily narrow schema acceptance windows.
- Throttle exporters globally; prioritize ingestion SLIs.
- Tests: synthetic hot-partition load; verify stable p95 and bounded lag.
- Trace: SLOs in Observability & SLOs.
R-002 — Cost overruns (storage/exports/telemetry)¶
- Signals: Hot/Warm/Cold growth velocity; export concurrency; log/metrics ingestion cost.
- Mitigations:
- Tiering policy (Hot→Warm→Cold), compression, projection granularity reviews.
- Export batch windows, per-tenant export caps, dedupe repeated selections.
- Telemetry sampling (tail-based traces), cardinality caps (
tenant.classbuckets).
- Contingency:
- Apply emergency retention adjustments (policy-backed), reduce export concurrency; enable log down-sampling.
- Tests: monthly FinOps review; cost regression checks in perf env.
See Storage Strategy & Data Residency & Retention.
R-003 — Data residency / sovereignty violation¶
- Signals: cross-region writes/reads, restore to wrong region, misconfigured buckets.
- Mitigations:
- Tenant→region map enforcement at Gateway; region-scoped storage and backups.
- Infrastructure policies (deny cross-region replication by default).
- Residency checks in CI/CD and synthetic restore drills (tenant/time-scoped).
- Contingency:
- Freeze affected tenant exports; relocate data; notify per contractual/DPA terms.
- Tests: residency unit tests; periodic restore drills per region.
See Privacy (GDPR/HIPAA/SOC2).
R-004 — Schema drift / incompatible change¶
- Signals: contract test failures, consumer DLQ spikes, projector failures.
- Mitigations:
- Schema registry, additive-first evolution, versioned Published Language.
- Producer/consumer contract tests in CI; dual-write/read during migrations.
- Strict schema validation at Gateway; unknown fields rejected unless whitelisted.
- Contingency:
- Pin consumers; roll back producers; replay DLQ after schema fix.
- Tests: golden schema vectors; migration rehearsals in staging with replay.
See Message Schemas & REST APIs.
R-005 — Tight coupling between services/contracts¶
- Signals: high change ripple; deploys blocked by downstream readiness; wide blast radius.
- Mitigations:
- Open Host Service at Gateway; asynchronous EDA between services.
- Ports-and-adapters; domain isolated from frameworks; backward-compatible contracts.
- Feature flags & canary releases to localize risk.
- Contingency:
- Breaker open on failing dependency; serve with last watermark; postpone non-critical enrichments.
- Tests: chaos drills removing downstreams; verify graceful degradation.
See Component Boundaries & Event-Driven Plan.
R-006 — Vendor lock-in (broker/DB/cloud features)¶
- Signals: adapter gaps, reliance on proprietary features without abstraction, migration blockers.
- Mitigations:
- Abstraction layers for broker/index/persistence; avoid provider-specific payloads in domain.
- Keep export/verify formats open (JSONL + signed manifests).
- ADRs record trade-offs; periodic portability tests (alt broker/index in CI).
- Contingency:
- Side-by-side pilot on alternate provider; maintain dual adapters for a window.
- Tests: contract suite against alt adapters; export/verify runs offline.
See ADRs & Governance.
R-007 — Noisy neighbor (multi-tenant contention)¶
- Signals: per-tenant 429s, p95/p99 regressions, export queue depth spikes, projector lag localized to a tenant.
- Mitigations:
- Per-tenant quotas for RPS/storage/exports; separate exporter pools; projector bulkheads per partition.
- Cache protection and query timeouts; circuit breakers on read amplification paths.
- Contingency:
- Temporarily lower limits for offending tenants; schedule off-peak exports; enable shard by
tenantId:sourceId.
- Temporarily lower limits for offending tenants; schedule off-peak exports; enable shard by
- Tests: synthetic contention runs; verify isolation and SLO stability.
See Multitenancy & Tenancy Guards.
R-008 — Compliance drift (controls stale vs GDPR/HIPAA/SOC2)¶
- Signals: policy deny spikes, retention failures, missed rotations, audit finding backlog.
- Mitigations:
- Policies as code, versioned; CI checks for classification tags on new fields.
- Automated retention jobs with attestations; scheduled key rotations; admin action audit trails.
- Quarterly control attestation packs; DSR synthetics.
- Contingency:
- Freeze risky exports; hotfix policy sets; trigger IR playbooks and stakeholder comms.
- Tests: DPIA gates for new data classes; DSR rehearsal; rotation drills.
See Compliance & Privacy & Security & Compliance.
Governance & Traceability¶
- Each risk maps to ADRs (decision logs), linked mitigations, and SLOs (error budgets).
- Risk review cadence: monthly in Ops/Architecture, quarterly for Compliance/Exec.
- Changes to risk posture require an ADR update and CI policy updates.
Links¶
- → Alerts & SLOs (
../operations/alerts-slos.md) - → Runbook (
../operations/runbook.md) - → Messaging & Outbox (
../implementation/messaging.md) - → Outbox/Inbox & Idempotency (
../implementation/outbox-inbox-idempotency.md) - → Data Residency & Retention (
../platform/data-residency-retention.md) - → Privacy (GDPR/HIPAA/SOC2) (
../platform/privacy-gdpr-hipaa-soc2.md) - → REST APIs (
../domain/contracts/rest-apis.md) - → Message Schemas (
../domain/contracts/message-schemas.md)
ADR Index & Governance¶
Architecture decisions are captured as ADRs (Architecture Decision Records) to make trade-offs explicit, auditable, and traceable to roadmap items and SLOs. This section defines where ADRs live, how they’re authored/reviewed, and a starter index of the key decisions for ATP.
Where ADRs Live¶
- Repository path:
/docs/adrs/(one file per decision). - Naming:
ADR-<YYYY>-<NNN>-<kebab-title>.md(e.g.,ADR-2025-001-tenancy-model.md). - Status taxonomy:
Proposed → Accepted → Deprecated → Superseded.
Each PR that changes contracts, data models, or deployment posture must reference an ADR (existing or new).
Decision Lifecycle¶
flowchart LR
P[Propose ADR <br /> - draft PR] --> R[Review<br /> - arch council + owners]
R -->|approve| A[Accepted\nmerge + tag]
R -->|revise| P
A --> I[Implement\ncode + docs + SLOs]
I --> E[Evaluate\nmetrics + post-deploy]
A --> S[Supersede/Deprecate\nnew ADR links prior]
RACI
- Responsible: Proposal author (feature owner)
- Accountable: Enterprise Architect (final sign-off)
- Consulted: Solution, Security, Data, Cloud/Infra, DevOps Architects
- Informed: PM/Delivery, Compliance, SRE
ADR Template (use for new decisions)¶
---
adr: ADR-2025-XXX
title: <Concise decision title>
status: Proposed | Accepted | Deprecated | Superseded by ADR-YYYY-NNN
owners:
- <role/name>
date: 2025-..-..
links:
issues: [ <link(s) to epics/issues> ]
docs:
- ../architecture/architecture.md#<section-anchor>
- ../domain/contracts/...
slo_impact:
- <which SLOs/SLIs are affected>
risk:
severity: Low|Med|High
mitigations: [ <refs to sections/runbooks> ]
---
## Context
<Problem statement, constraints, alternatives considered, why now>
## Decision
<The option chosen and why; scope and boundaries; tenant/edition impact>
## Consequences
<Positive/negative trade-offs, operational impacts, cost and complexity, migration notes>
## Implementation Notes
<High-level tasks, rollout plan, feature flags, compatibility windows>
## Verification
<How we’ll verify success: metrics, tests, drills, acceptance gates>
## References
<Links to PoCs, benchmarks, standards, prior ADRs>
ADR Index (starter set)¶
| ADR | Title | Status | Links |
|---|---|---|---|
| ADR-2025-001 | Tenancy Model & Guards (explicit tenant on every boundary; RLS; edition gating at Gateway/Policy) | Accepted | Multitenancy |
| ADR-2025-002 | Event Bus & EDA Guarantees (at-least-once, idem keys, outbox/inbox, partitioning by tenantId[:sourceId]) |
Accepted | Event-Driven Plan |
| ADR-2025-003 | Hash Chains + Signed Checkpoints (export manifests with proofs) | Accepted | Integrity |
| ADR-2025-004 | Storage Tiering (Hot/Warm/Cold) & lifecycle enforcement | Accepted | Storage Strategy |
| ADR-2025-005 | Schema Registry & Evolution Policy (additive-first, breaking via new subject/major) | Accepted | Message Schemas |
| ADR-2025-006 | Gateway Versioning & Rate Limits (Problem+JSON, 429 with budgets) | Proposed | API Gateway |
| ADR-2025-007 | Per-Tenant Export Concurrency Caps (bulkheads to protect Query SLOs) | Proposed | Reliability & Resilience |
| ADR-2025-008 | Policy as Code (classification/retention/redaction versioned; stamped on write) | Accepted | Compliance & Privacy |
When an ADR is superseded, update the index and add a
Superseded by ADR-…line at the top of the older record.
Governance Rules (what requires an ADR)¶
- Domain contracts: REST/gRPC schemas, message subjects/schemas, webhook signatures.
- Persistence/Index: new tables/indices, partitioning strategy changes, retention rules.
- Security: identity model changes, key/crypto algorithms, breakout from Zero Trust defaults.
- Platform: event bus/provider changes, region/residency posture, ingress/WAF changes.
- SLO/Cost: material shifts in error budgets, telemetry retention, cost levers.
Minor refactors that do not alter contracts, SLOs, or posture can proceed without a new ADR, but must reference related ADRs in PRs.
Quality Gates (CI/CD)¶
- Lint: ADR front-matter required (
status,owners,links), broken links fail build. - Contract tests: PRs that touch
/domain/contractsmust reference an ADR. - Docs check:
architecture/pages must not link to Deprecated ADRs without also linking the successor. - Changelog:
/reference/changelog.mdautomatically includes ADR titles on merge.
Traceability¶
- Each section of this document references one or more ADRs; ADRs link back here via anchors.
- Roadmap epics reference ADR IDs; production incidents include ADR references in the post-mortem template.
Cadence & Forums¶
- Weekly architecture sync (triage new ADRs, status reviews).
- Monthly risk/governance review aligning with Risks & Mitigations.
- Quarterly compliance/controls review (SOC 2 evidence packs, DPIA triggers).
Testable Controls¶
- Pipeline fails if a change to contracts/index/storage lacks an ADR reference.
- Docs link checker: ADR anchors valid; “superseded” graph has no orphans.
- Synthetic audit: sample PRs verify Problem+JSON includes
traceId+ linked ADR in release notes.
Links¶
- → Quality Gates (
../ci-cd/quality-gates.md) - → Planned Work (Epics & Features) (
../planning/index.md) - → Changelog (
../reference/changelog.md) - → Message Schemas (
../domain/contracts/message-schemas.md) - → REST APIs (
../domain/contracts/rest-apis.md)
Traceability to Roadmap¶
Tiny matrix mapping this document’s sections to roadmap Epics/Features and the artifact location under /docs.
| Section | Roadmap Epic / Feature | Artifact (under /docs) |
|---|---|---|
| Purpose & Principles | AUD-ARC-001 / GOV | /docs/architecture/architecture.md#purpose |
| System Context (C4 L1) | AUD-ARC-001 / HLD | /docs/architecture/hld.md |
| Bounded Contexts & Context Map | AUD-ARC-001 / DDD | /docs/architecture/context-map.md |
| Core Services & Containers (C4 L2) | AUD-ARC-001 / HLD | /docs/architecture/hld.md |
| Component Boundaries (C4 L3) | AUD-ARC-001 / HLD | /docs/architecture/components.md |
| C06 – Event-Driven Communication Plan | AUD-ARC-001 / HLD-T002 | /docs/domain/events-catalog.md |
| C07 – Multitenancy & Tenancy Guards | AUD-TENANT-001 | /docs/platform/multitenancy-tenancy.md |
| C08 – Security Architecture (Zero Trust) | AUD-SECURITY-001, AUD-IDENTITY-001 | /docs/platform/security-compliance.md |
| C09 – Compliance & Privacy (GDPR/HIPAA/SOC2) | AUD-COMPLIANCE-001 | /docs/platform/privacy-gdpr-hipaa-soc2.md |
| C10 – Data Architecture Overview | AUD-STORAGE-001, AUD-QUERY-001 | /docs/architecture/data-model.md |
| C11 – Storage Strategy (summary) | AUD-STORAGE-001 | /docs/implementation/persistence.md |
| C12 – Sequence Flows (append/query/export) | AUD-INGEST-001, AUD-QUERY-001, AUD-EXPORT-001 | /docs/architecture/sequence-flows.md |
| C13 – Deployment Views (baseline) | AUD-OPS-001 (DevOps & Envs) | /docs/architecture/deployment-views.md |
| C14 – API Gateway & Connectivity | AUD-GATEWAY-001 | /docs/architecture/architecture.md#api-gateway-connectivity |
| C15 – Observability & SLOs | AUD-OTEL-001 | /docs/operations/observability.md |
| C16 – Reliability & Resilience | AUD-CHAOS-001 | /docs/implementation/outbox-inbox-idempotency.md |
| C17 – Integrity & Tamper-Evidence | AUD-INTEGRITY-001 | /docs/hardening/tamper-evidence.md |
| C18 – SDK & Integration Guidance | AUD-SDK-001 | /docs/sdk/ |
| C19 – Risks & Mitigations | Governance cadence | /docs/architecture/architecture.md#risks-mitigations |
| C20 – ADR Index & Governance | ADR process | /docs/adrs/ |
| C21 – Traceability to Roadmap | Plan baseline | /docs/planning/index.md |
Links¶
- → Roadmap (Epics & Features) (
../planning/index.md) - → Quality Gates (
../ci-cd/quality-gates.md) - → Changelog (
../reference/changelog.md)