Skip to content

Architecture Overview — Audit Trail Platform (ATP)

This document provides the top-level architectural narrative for ATP. It explains why the platform exists, the guiding principles that shape every decision, and how readers should navigate the rest of the architecture set. Deep dives live in sibling docs (HLD, Components, Data Model, Sequence Flows, Deployment Views) per the table of contents.


Purpose

ATP is a secure, multi‑tenant audit and evidence platform that ingests, classifies, and stores immutable events from heterogeneous systems, making them queryable, exportable, and verifiable under strict compliance and data‑residency constraints. The architecture aims to:

  • Provide tamper‑evident append‑only storage with verifiable integrity signals.
  • Support high‑throughput ingestion and low‑latency query paths across tenants and editions.
  • Embed privacy, classification, and retention controls by design (not as an afterthought).
  • Offer clear integration contracts (REST/gRPC, events, webhooks) and tested SDKs.
  • Expose operational transparency (OTel traces/logs/metrics) with SLO‑backed reliability.
  • Remain cost‑aware and scalable, balancing hot/warm/cold storage and export patterns.

Architectural Principles

The following principles are non‑negotiable guardrails. Each downstream decision (interfaces, storage, indexing, deployment) should cite the relevant principle(s).

  • Security‑First & Zero‑Trust Strong identity between workloads; least privilege; per‑tenant and per‑operation authorization; secret/keys managed via KMS; encryption in transit and at rest.

  • Multi‑Tenant Isolation by Default Tenant context is explicit in every interface and persisted form. Isolation uses a layered model (routing, authZ, RLS/filters, quotas, rate‑limits). No “best‑effort” multi‑tenancy.

  • Event‑Driven by Design Ingestion, projection, and export are choreographed via durable messaging. Outbox/inbox and idempotency keys are mandatory on all critical paths.

  • Tamper‑Evidence & Integrity Append‑only semantics; chain‑of‑hash/signature strategies; evidence manifests that can be independently verified at export/eDiscovery time.

  • Compliance‑by‑Design Data classification, minimization, and retention are enforced at write time; residency controls and subject‑rights operations (DSR) are planned and testable.

  • Observability‑First Every request and message is traceable end‑to‑end with correlation/tenant/edition tags. Golden signals and error budgets are defined per service and reflected in SLOs.

  • Resilience & Back‑Pressure Timeouts, retries with jitter, bulkheads, DLQs, and circuit breakers are applied consistently. Components are idempotent and safe to replay.

  • API‑First & Contract‑Driven REST/gRPC schemas and event contracts are versioned, linted, and backward compatible; producers/consumers are validated in CI with contract tests.

  • Scalability with Cost Discipline Scale out hot paths; project read models fit for purpose; apply storage tiering (hot/warm/cold) and export batching windows to protect SLOs and cost envelopes.

  • Simplicity & Paved Roads Prefer standard templates, libraries, and platform “paved roads” over bespoke solutions. Documentation and examples are treated as part of the product.

  • Governed Change & Traceability Architectural decisions are captured as ADRs; artifacts and flows are versioned; changes reference the driving SLOs, risks, and compliance requirements.

Audience & Reading Map


System Context (C4 L1)

ATP sits between event producers (first- and third-party systems) and consumers (operators, auditors, investigators, export/eDiscovery tooling). Clear trust boundaries ensure identity, tenancy, privacy, and integrity guarantees are preserved end-to-end.

Primary Actors

  • Event Producers — product microservices, backends, frontends, partner systems emitting audit events (REST/gRPC, webhooks).
  • Human Users — Operators/SRE (runbooks, dashboards), Auditors/Investigators (query & export), Tenant Admins (policy & access).
  • Automation — CI/CD agents and scheduled jobs exercising administrative APIs (policy/config rotation, projector rebuilds).

External Systems

  • Identity Provider (IdP) — OIDC/OAuth2 for users; workload identity for service-to-service.
  • Key Management (KMS) — encryption and signing keys; rotation; optional customer-managed keys.
  • Observability Stack — OTel collector, metrics/logs/traces backends, alerting.
  • Export Destinations — object storage targets, legal hold archives, SIEM/DLP, ticketing (optional webhooks).
  • Admin Surfaces — configuration/feature flags, policy repositories, schema registry.

Trust Boundaries & Zones

  • Ingress Boundary (Gateway) — authentication, tenancy resolution, request limits, schema validation, input sanitation.
  • Processing Boundary (Messaging/EDA) — durable delivery, outbox/inbox, idempotency keys, replay safety, DLQ isolation.
  • Data Boundary (Storage/Indexes) — encryption at rest, tenant-scoped access (RLS/filters), classification & retention enforcement.
  • Admin/Control Plane — privileged operations, break-glass workflows, audited changes and ADR-linked approvals.

Interface Surface (at a glance)

  • Inbound
    • REST/gRPC: /api/v{n}/audit/append, /api/v{n}/query/..., /api/v{n}/export/...
    • Webhooks: signed callbacks for event ingestion (optional connectors)
  • Outbound
    • Events: AuditRecord.Appended, AuditRecord.Accepted, Projection.Updated, Export.Requested|Completed
    • Webhooks: export completion, verification results (optional)
  • Admin
    • Policies: classification/retention APIs, edition flags
    • Ops: projector lag, replay tooling, DLQ management

Tenancy & Identity Propagation

  • Tenancy is explicit on every call/message (tenantId claim/header); enforced in the gateway and persisted in all stores.
  • Authorization is tenant-scoped (RBAC/ABAC) with edition gates; tokens are short-lived and audience-bound.
  • Observability carries tenantId, edition, traceId, and correlationId across boundaries.
flowchart LR
  subgraph External
    EP[(Event Producers)]
    AUD[(Auditors/Investigators)]
    OPS[(Operators/SRE)]
    IDP[(IdP/OIDC)]
    KMS[(KMS)]
    OBS[(OTel/Logs/Metrics)]
    DST[(Exports / eDiscovery Destinations)]
  end

  GW[API Gateway<br/>AuthZ • Tenancy • Rate Limit • Schema]
  BUS[(Event Bus)]
  ING[Ingestion Service<br/>validate • classify • retain]
  INT[Integrity Service<br/>hash chains • signatures]
  PROJ[Projection Service<br/>read models • lag control]
  QRY[Query Service<br/>tenant-scoped filters]
  EXP[Export Service<br/>packages • manifests]

  EP -->|REST/gRPC/Webhooks| GW
  GW --> BUS
  BUS --> ING
  ING --> INT
  ING -->|append| STORE[(Append Store)]
  ING --> PROJ
  PROJ --> QRY
  QRY -->|results| AUD
  EXP -->|packages| DST

  GW --- IDP
  ING --- KMS
  INT --- KMS
  GW --- OBS
  ING --- OBS
  PROJ --- OBS
  QRY --- OBS
  EXP --- OBS

  AUD -->|requests| GW
  OPS -->|dashboards/runbooks| QRY
  QRY --> EXP
Hold "Alt" / "Option" to enable pan & zoom

Quality Attributes Anchored Here

  • Security & Privacy — zero-trust ingress, least-privilege access, data minimization at write.
  • Integrity — append-only semantics with verifiable chains/signatures.
  • Scalability — bursty producers absorbed via durable messaging and back-pressure.
  • Operability — golden signals and SLO budgets per service; traceability across all hops.
  • Cost Awareness — hot/warm/cold tiers; export windows; per-tenant quotas and rate limits.


Bounded Contexts & Context Map

The ATP domain is split into cohesive bounded contexts that collaborate via well-defined contracts (REST/gRPC, events, webhooks). Each context owns its model, persistence, and decision logic; integration relies on Published Language and Open Host Service patterns, with Anti-Corruption Layers at external edges.

Bounded Contexts (overview)

  • Gateway
    Responsibility — Ingress, authentication/authorization, tenancy resolution, rate limiting, schema validation, versioning.
    Contracts — REST/gRPC (OHS), request/response schemas.
    Notes — Enforces tenantId propagation and edition gates.

  • Ingestion
    Responsibility — Canonicalize & validate events, apply classification/retention at write, append to immutable store, emit acceptance signals.
    Contracts — Consumes REST/gRPC/webhooks from Gateway (OHS); publishes AuditRecord.Appended|Accepted.
    NotesACL for producer-specific payloads → canonical schema.

  • Policy
    Responsibility — Provide decisions for classification, retention, redaction; edition feature flags.
    Contracts — Decision API (sync) + policy change events.
    NotesCustomer–Supplier to Ingestion/Query/Export; Policy is supplier.

  • Integrity
    Responsibility — Compute/verify hash chains & signatures; issue attestations and evidence manifests.
    Contracts — Subscribes to append pipeline; exposes verify endpoints; emits Integrity.Verified.
    Notes — Keys/rotation via KMS.

  • Projection
    Responsibility — Build/maintain read models and search indexes; track projector lag/watermarks; rebuild strategies.
    Contracts — Subscribes to AuditRecord.Accepted (and deltas); publishes Projection.Updated.
    Notes — Strict idempotency and replay safety.

  • Query
    Responsibility — Authorized, tenant-scoped retrieval over read models; policy-aware filtering/redaction.
    Contracts — REST/gRPC; optional GraphQL facade (internal).
    Notes — Surfaces selection for exports.

  • Search (optional)
    Responsibility — Full-text/time-range queries aligned with policy & tenancy.
    Contracts — Reads projection feed; exposes search API.
    Notes — May be disabled in small editions.

  • Export
    Responsibility — Package selections (from Query/Search) with manifests, signatures; manage delivery & legal hold.
    Contracts — REST to request/stream; events Export.Requested|Completed; optional webhooks.
    Notes — Throttled, resumable flows; long-running operations.

  • Admin/Control Plane
    Responsibility — Policies, schemas, feature flags, projector controls, replay/DLQ tooling.
    Contracts — Admin APIs; audit of changes; ADR links in metadata.
    Notes — Break-glass procedures with strict logging.

Relationship Patterns

  • Gateway → IngestionOpen Host Service with a versioned Published Language.
  • Ingestion → PolicyCustomer–Supplier (Ingestion conforms to Policy’s decisions).
  • Ingestion → IntegrityConformist to integrity calculation rules; emits materials for verification.
  • Ingestion → ProjectionEvent choreography via durable topics.
  • Projection → Query/SearchPublished Language for read models/index projections.
  • Query/Search → ExportCustomer–Supplier (Export depends on Query’s selection semantics).
  • External Producers → GatewayAnti-Corruption Layer at Ingestion to canonicalize.

Context Map (mermaid)

flowchart LR
  subgraph External Producers
    P1[(Product Services)]
    P2[(3rd-Party Systems)]
  end

  GW[Gateway<br />OHS + Published Language]
  ING[Ingestion<br />ACL + Canonicalizer]
  POL[Policy<br />Decisions: classify/retain/redact]
  INT[Integrity<br />Hash Chains/Signatures]
  PROJ[Projection<br />Read Models/Indexes]
  QRY[Query<br />Tenant-Scoped Retrieval]
  SRCH[Search<br />Optional Index API]
  EXP[Export<br />Packages + Manifests]
  ADM[Admin/Control Plane]

  P1 -->|REST/gRPC/Webhooks| GW
  P2 -->|REST/gRPC/Webhooks| GW
  GW --> ING

  ING -->|decisions| POL
  POL -->|policy changes| ING
  POL --> QRY
  POL --> EXP

  ING -->|append accepted| INT
  ING -->|events| PROJ
  PROJ --> QRY
  PROJ --> SRCH

  QRY -->|selection| EXP
  SRCH -->|selection| EXP

  ADM --- POL
  ADM --- PROJ
  ADM --- EXP

  classDef c fill:#F4F7FF,stroke:#5B6,stroke-width:1px,rx:6,ry:6;
  class GW,ING,POL,INT,PROJ,QRY,SRCH,EXP,ADM c;
Hold "Alt" / "Option" to enable pan & zoom

Contract Snapshots (at a glance)

  • Events (Published Language)

    • AuditRecord.Appended → canonical event submitted (pre-commit checks passed)
    • AuditRecord.Accepted → persisted + classified + retained; integrity material ready
    • Projection.Updated → read model/index segment advanced (with watermark)
    • Export.Requested | Export.Completed → export lifecycle
    • Integrity.Verified → attestations for records/segments/packages
  • APIs (Open Host Service)

    • POST /api/v{n}/audit/append — Gateway → Ingestion (tenant/edition required)
    • GET /api/v{n}/query/... — tenant-scoped search & retrieval
    • POST /api/v{n}/export/... — create/stream export packages
    • POST /api/v{n}/policy/evaluate — (internal) policy decision snapshot
    • POST /api/v{n}/integrity/verify — verify record/segment/export

Modeling Notes

  • Ubiquitous Language — “audit record”, “evidence chain”, “manifest”, “projection lag”, “selection set” are domain terms; see domain language.
  • Idempotency Keys(tenantId, sourceId, sequence|hash) for all ingestion paths.
  • Tenancy — Always explicit; persisted on write; filtered on read; included in traces.
  • Edition Awareness — Feature gates in Gateway/Policy; never UI-only gates.


Core Services & Containers (C4 L2)

This section presents the container view of ATP: the runtime building blocks (services and infrastructure) and their responsibilities, interfaces, data boundaries, and operational concerns. It complements the domain view by focusing on how capabilities are realized in deployable components.

Container View (diagram)

flowchart LR
  subgraph Edge
    G[API Gateway<br />AuthN/Z • Tenancy • RL • Versioning]
  end

  subgraph App Plane
    ING[Ingestion Service<br />validate • classify • retain]
    POL[Policy Service<br />classify • retain • redact • edition]
    INT[Integrity Service<br />hash chains • signatures • attest]
    PROJ[Projection Service<br />read models • indexes • lag]
    QRY[Query Service<br />search • filters • redaction]
    SRCH[Search Service - optional<br />full-text • time-range]
    EXP[Export Service<br />packages • manifests • deliver]
    ADM[Admin/Control Plane<br />schemas • flags • replay • DLQ]
  end

  subgraph Data Plane
    APPEND[(Append Store<br />append-only)]
    READ[(Read Models / Indexes)]
    COLD[(Cold Archive / eDiscovery)]
  end

  subgraph Platform
    BUS[(Event Bus / Topics)]
    KMS[(KMS / Secrets)]
    OTL[(OTel Collector / Logs / Metrics / Traces)]
    IDP[(IdP / OIDC)]
  end

  G -->|REST/gRPC| ING
  G -->|REST/gRPC| QRY
  QRY -->|select| EXP
  ING -->|decisions| POL
  POL --> QRY
  POL --> EXP
  ING -->|append| APPEND
  ING -->|events| PROJ
  PROJ --> READ
  SRCH --> READ
  QRY --> READ
  EXP -->|packages| COLD

  ING -.->|materials| INT
  INT -.->|verify| EXP

  G --- IDP
  ING --- BUS
  PROJ --- BUS
  EXP --- BUS
  ING --- KMS
  INT --- KMS
  G --- OTL
  ING --- OTL
  PROJ --- OTL
  QRY --- OTL
  EXP --- OTL

  classDef svc fill:#F4F7FF,stroke:#7aa6ff,stroke-width:1px,rx:6,ry:6;
  classDef plat fill:#fafafa,stroke:#c8c8c8,stroke-width:1px,rx:6,ry:6;
  class G,ING,POL,INT,PROJ,QRY,SRCH,EXP,ADM svc;
  class BUS,KMS,OTL,IDP,APPEND,READ,COLD plat;
Hold "Alt" / "Option" to enable pan & zoom

Containers (one-liners)

  • API Gateway — Central ingress; AuthN/Z, tenancy resolution, rate limiting, schema & versioning.
  • Ingestion — Validates/canonicalizes events, applies classification/retention, writes append-only, emits acceptance.
  • Policy — Synchronously answers classification/retention/redaction queries; manages edition feature gates.
  • Integrity — Maintains hash chains and signatures; exposes verification APIs; issues evidence manifests.
  • Projection — Builds read models and indexes; tracks watermarks/lag; supports rebuilds.
  • QueryTenant-scoped retrieval with policy-aware filtering and redaction; selection for exports.
  • Search (optional) — Full-text/time-range over projected data while respecting policy/tenancy.
  • Export — Packages selections + manifests, supports resumable streaming, and optional webhooks.
  • Admin/Control Plane — Config, feature flags, schema registry, DLQ/replay, break-glass ops.

Responsibilities Matrix

Container Purpose Key Interfaces Data Boundary Scaling / Resilience SLO Hints
API Gateway Ingress, authZ, tenancy, RL, versioning REST/gRPC; JWT/mTLS N/A (stateless) HPA on RPS; 429 for back-pressure p95 auth+route ≤ X ms
Ingestion Validate, classify, retain, append REST/gRPC from Gateway; publishes events Writes to Append Store Outbox; consumer concurrency; DLQ p95 append ≤ X ms; accept rate
Policy Decisions: classify/retain/redact; edition Sync decision API; policy change events Policy store (read-mostly) Cache + TTL; fallback modes p95 decision ≤ Y ms
Integrity Hash/sign; attest/verify Verify API; subscribes to append Integrity material store Idempotent; replay-safe verify p95 ≤ Z ms
Projection Build read models / indexes Subscribes to accepted events Read Models/Indexes Rebuild tooling; lag caps projector lag ≤ N s
Query Tenant-scoped retrieval, redaction REST/gRPC; (opt) GraphQL Read-only over Read Models Cache; rate-limit; bulkheads query p95/p99 targets
Search (opt) Full-text/time queries REST/gRPC Search index Async refresh; ISR windows query p95 ≤ T ms
Export Package & deliver w/ manifests REST stream; events; (opt) webhook Streams from Read/Cold Resumable; batch windows completion p95 for M rec
Admin Policies, flags, replay, DLQ, schemas Admin REST/gRPC Control metadata Strong audit; approvals admin ops audited

Data Containers

  • Append Store (hot)Append-only write path; short retention for high-QPS ingestion and near-term verification.
  • Read Models / Indexes (warm) — Denormalized projections tailored to query/search; rebuildable from events.
  • Cold Archive (cold) — Long-term eDiscovery/export storage; immutability and legal-hold compatible.

Data tiering and shapes are detailed in Data Model (data-model.md) and Deployment Views (deployment-views.md).

Interface Summary

  • Inbound: POST /api/v{n}/audit/append, GET /api/v{n}/query/..., POST /api/v{n}/export/...
  • Events: AuditRecord.Appended, AuditRecord.Accepted, Projection.Updated, Export.Requested|Completed, Integrity.Verified
  • Admin: .../policy/*, .../admin/replay, .../admin/dlq, .../admin/schema

Cross-Cutting (applies to all services)

  • Tenancy: tenantId is mandatory at ingress and persisted end-to-end; enforced by gateway/middleware and data filters.
  • Security: Zero-trust defaults; mTLS in mesh; KMS-managed keys; short-lived tokens with audience/scope.
  • Observability: OTel traces/logs/metrics with tenantId, edition, traceId, correlationId; golden signals per service.
  • Resilience: Outbox/inbox, idempotent consumers, configured retries with jitter, circuit breakers, DLQ + replay tooling.
  • Cost: Rate-limits/quotas per tenant; projection/index retention; export batching windows.


Component Boundaries (C4 L3)

This section details the internal structure of each service using a Clean Architecture variant:

  • API (adapters) — HTTP/gRPC endpoints, webhook receivers.
  • Application (use-cases) — orchestration, policies, idempotency, transactions.
  • Domain (model) — aggregates, value objects, domain services.
  • Infrastructure (adapters) — persistence, messaging, cache, KMS, observability.

Dependency Rule: API → Application → Domain, with Domain independent of frameworks. Infrastructure depends inward via ports (interfaces) declared in Application/Domain.

Reference Component Map (Ingestion service)

flowchart LR
  subgraph API Layer
    Ctrl[AppendController<br />REST/gRPC]
    Hook[WebhookReceiver]
  end

  subgraph Application Layer
    UC[AppendAuditRecordUseCase]
    Pol[PolicyClient Port]
    Repo[AuditRecordRepository Port]
    Outb[Outbox Port]
    Idem[IdempotencyService]
    Val[SchemaValidator]
  end

  subgraph Domain Layer
    Agg[AuditRecord Aggregate]
    Clas[Classification VO]
    Ret[RetentionPolicy VO]
    Sig[IntegrityMaterial VO]
    DS[Domain Services]
  end

  subgraph Infrastructure Adapters
    RepoImpl[(NHibernate Repo)]
    Bus[(Message Broker Adapter)]
    KMS[(KMS Adapter)]
    Cache[(Cache Adapter)]
    OTel[(OTel Adapter)]
  end

  Ctrl --> UC
  Hook --> UC
  UC --> Val
  UC --> Pol
  UC --> Repo
  UC --> Outb
  UC --> Idem
  UC --> Agg
  Agg --> Clas
  Agg --> Ret
  Agg --> Sig

  RepoImpl -.implements.-> Repo
  Bus -.implements.-> Outb
  KMS -.used by.-> DS
  OTel -.used by.-> UC
  Cache -.used by.-> Pol
Hold "Alt" / "Option" to enable pan & zoom

Flow (happy path)

  1. AppendController validates auth/tenancy → calls AppendAuditRecordUseCase.
  2. Use-case validates schema, checks idempotency, queries PolicyClient for classification/retention.
  3. Aggregate enforces invariants, produces domain events.
  4. Repository persists append-only record + outbox entry.
  5. Outbox relays an AuditRecord.Accepted event; OTel spans/logs recorded.

Per-Service Component Boundaries

Service API Adapters Application (Use-cases) Domain (Aggregates/VO) Outbound Ports Infra Adapters
Gateway Minimal APIs/gRPC, auth filters, tenancy middleware RouteResolution, RateLimitCheck, SchemaGate N/A (edge orchestration) PolicyDecision, TokenIntrospection, RateLimit OIDC/JWT, RateLimiter, Schema Registry
Ingestion AppendController, WebhookReceiver AppendAuditRecord, ClassifyAndRetain, AcceptEvent AuditRecord, Classification, Retention, IntegrityMaterial PolicyClient, AuditRecordRepository, Outbox NHibernate, Broker, KMS, OTel
Policy PolicyController EvaluateClassification, EvaluateRetention, EvaluateRedaction PolicySet, Rule, Decision PolicyChangePublisher PolicyStore (read-mostly), Broker, Cache
Integrity VerifyController ComputeChain, SignSegment, VerifyEvidence EvidenceChain, Signature, Checkpoint IntegrityRepo, KmsSigner, VerifyResultPublisher NHibernate, KMS, Broker
Projection (Internal) Health/Control endpoints ApplyDelta, RebuildProjection, ManageLag ReadModelSegment, Watermark ProjectionRepo, LagMetrics NHibernate/Read-store, Broker, OTel
Query QueryController, (opt) GraphQL ExecuteQuery, ApplyPolicyFilters, RedactFields SelectionSet, RedactionPlan ReadModelRepo, PolicyClient Read-store adapters, Cache
Search (opt.) SearchController ExecuteSearch, BuildQueryPlan SearchQuery, Facet, TimeRange IndexRepo Search index adapter
Export ExportController, Webhook for completion CreatePackage, StreamPackage, BuildManifest ExportPackage, Manifest, SelectionRef ReadModelRepo, ColdStoreSink, IntegrityVerifier Object storage, Broker, KMS
Admin/Control AdminController UpdatePolicy, ReplayDLQ, Rebuild, FeatureToggle AdminAction, Approval DLQClient, FeatureFlagRepo Broker Mgmt, Flags Store

Ports live in Application/Domain, implemented by Infrastructure. Tests use in-memory or fakes against ports to keep fast, hermetic boundaries.


Ports & Adapters (canonical interfaces)

// Application ports (examples)
public interface IPolicyClient {
    Task<PolicyDecision> EvaluateAsync(AppendContext ctx, CancellationToken ct);
}
public interface IAuditRecordRepository {
    Task AppendAsync(AuditRecord record, CancellationToken ct);
}
public interface IOutbox {
    Task EnqueueAsync<T>(T evt, CancellationToken ct);
}
public interface IReadModelRepository {
    Task<QueryResult> ExecuteAsync(QuerySpec spec, CancellationToken ct);
}
public interface IIntegrityVerifier {
    Task<VerifyResult> VerifyAsync(VerifyTarget target, CancellationToken ct);
}
  • Inbound adapters: Minimal APIs, gRPC services, webhook receivers.
  • Outbound adapters: NHibernate repos, MassTransit/Azure Service Bus producers/consumers, KMS wrappers, cache providers, OTel exporters.

Boundary Rules (enforced)

  • Tenancy is a first-class parameter in all ports and aggregate constructors (no ambient singletons).
  • Idempotency: all write use-cases accept an IdempotencyKey; repositories guarantee upsert-or-noop semantics.
  • Transactions: application layer coordinates a transactional outbox; domain emits events, infra persists record + outbox atomically.
  • Validation: API does syntactic checks; Application performs semantic validation and policy evaluation; Domain enforces invariants only.
  • Error Mapping:
    • Domain validation → 400 / INVALID_ARGUMENT
    • AuthZ/tenancy violations → 403
    • Rate-limit/back-pressure → 429 (with retry hints)
    • Transient infra → retried with jitter; eventual DLQ after N attempts
  • No framework leakage into Domain (no HTTP types, no EF/NHibernate entities, no broker types).

Concurrency & Scaling

  • API: stateless; scale on RPS; protect with rate limits and request budgets.
  • Consumers/Projectors: concurrency tuned per partition/shard; back-pressure from queue length & lag metrics.
  • Export: long-running streams; resumable with checkpoints; separated pool to avoid starving Query.

Observability & Policies (per boundary)

  • Span model: gateway → use-case → repo/outbox → broker → projector/query/export.
  • Attributes: tenantId, edition, idempotencyKey, policyVersion, watermark.
  • Logs: structured; sensitive fields redacted by classification tags.
  • Metrics: use-case latency, outbox age, projector lag, export TTFB/completion.
  • Policies as code: policy version stamped on write; propagated in events; evaluated again on read/export.

Technology Anchors

  • Runtime: .NET 9, Rest APIs/gRPC.
  • Persistence: NHibernate (write/read models), repositories per aggregate.
  • Messaging: MassTransit over Azure Service Bus (topics/queues, DLQ).
  • Security: OIDC, mTLS (mesh), KMS envelope encryption & signing.
  • Telemetry: OpenTelemetry traces/logs/metrics; dashboards per service.
  • Testing: unit tests against ports; contract tests for REST/events; projector/replay integration tests.

Example: Error-to-Protocol Mapping

Boundary Failure Handling Client Signal
API ingress Schema invalid 400 + error codes X-Request-Id
Application Policy denies write 403 Problem+JSON body
Repository Unique duplicate (idempotent) 200 (noop) Idempotent: true
Broker Publish timeout retry+jitter → outbox relay n/a (internal)
Export stream Client disconnect checkpoint + resume Range support


Event-Driven Communication Plan

ATP is event-driven by design. Events are the backbone for ingestion acceptance, integrity processing, projection, and exports. We target at-least-once delivery with exactly-once effects via idempotency keys, transactional outbox/inbox, and idempotent consumers. Ordering is scoped, never global.

Patterns & Guarantees

  • Delivery: at-least-once from broker; consumers must be idempotent.
  • Exactly-once intent: (tenantId, sourceId, sequence|hash) as the IdempotencyKey; repository upsert/no-op semantics.
  • Ordering: guaranteed only within a partition key (e.g., tenantId:sourceId). Do not assume global order.
  • Back-pressure: rate limits at gateway, consumer concurrency caps, queue depth alerts, retry with jitter, DLQs.
  • Replay safety: projectors/exporters are replay-tolerant; watermarks control catch-up.

Topic & Subscription Topology (logical)

Subject (topic) Producers Consumers Purpose
audit.appended Ingestion Integrity, Projection Raw append accepted for downstream processing
audit.accepted Ingestion Projection, Search (opt) Persisted + classified + retained
projection.updated Projection Query, Search (opt), Export Read models advanced; watermark/lag hints
export.requested Query, API Export Start packaging workflow
export.completed Export API/Webhooks, Integrity (opt verify) Notify completion; attach manifest
integrity.verified Integrity API, Export (attach to packages) Attest records/segments/packages
policy.changed Admin/Policy Ingestion, Query, Export Cache bust + version pin updates

Each topic has named subscriptions per service (e.g., projection-svc, export-svc) and a DLQ (<topic>.<subscription>.dlq).

Message Contracts (snapshot)

We publish a Published Language with clear evolution rules (see Message Schemas).

{
  "eventId": "01J8X3TB2Z9WQ6M9P3E2A4K7QG",       // ULID preferred
  "subject": "audit.accepted",
  "schemaVersion": "1.2.0",
  "occurredAt": "2025-10-22T05:12:31Z",
  "tenantId": "t-9d8e...",
  "edition": "enterprise",
  "producer": {
    "service": "ingestion",
    "instance": "ingestion-7fcb9f",
    "region": "eus2"
  },
  "correlation": {
    "traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
    "correlationId": "c-8e7f...",
    "causationId": "evt-01J8X3..."
  },
  "idempotencyKey": "t-9d8e:src-abc:seq-00000045",
  "policy": {
    "classification": "SENSITIVE",
    "retentionPolicyId": "rp-2025-01",
    "policyVersion": 17
  },
  "payload": {
    "recordId": "rec-01H...",
    "sourceId": "src-abc",
    "hash": "sha256:...",
    "sizeBytes": 5342,
    "attributes": { "actorId": "u-123", "action": "UPDATE", "resource": "Order/4711" }
  }
}

Headers (transport) * content-type: application/json; charset=utf-8 * x-tenant-id, x-edition * traceparent, tracestate * x-idempotency-key, x-schema-version * x-classification, x-policy-version

No PII in headers. Payload fields carrying PII are classified and redacted on sinks/logs.

Evolution & Compatibility

  • Additive changes → bump minor (1.1 → 1.2); fields are optional by default.
  • Breaking changes → new subject or major bump with side-by-side consumers.
  • Schema registry: producers/consumers validated in CI; contract tests block incompatible change.
  • Deprecations: announce N releases ahead; dual-publish during migration windows.

Choreography (happy path)

sequenceDiagram
  autonumber
  participant P as Producer
  participant GW as API Gateway
  participant ING as Ingestion
  participant BUS as Event Bus
  participant INT as Integrity
  participant PROJ as Projection
  participant QRY as Query
  participant EXP as Export

  P->>GW: POST /audit/append (tenant, idempotencyKey)
  GW->>ING: Append command
  ING->>ING: validate + classify + retain + persist
  ING->>BUS: publish audit.appended
  ING->>BUS: publish audit.accepted
  BUS-->>INT: audit.appended
  INT->>BUS: integrity.verified (optional later)
  BUS-->>PROJ: audit.accepted
  PROJ->>BUS: projection.updated (watermark)
  QRY->>QRY: read models current for queries
  QRY->>BUS: export.requested (on demand)
  BUS-->>EXP: export.requested
  EXP->>BUS: export.completed (manifest, links)
Hold "Alt" / "Option" to enable pan & zoom

Partitioning & Ordering

  • Partition key: tenantId (default). For high-volume producers, prefer tenantId:sourceId.
  • Ordering guarantees: within partition, best-effort FIFO; consumers must handle reordering and duplicates.
  • Large tenants: increase consumer concurrency; shard by sourceId to avoid hot partitions.

Reliability & Retry

  • Producers: exponential backoff with jitter; bounded retries; overflow to outbox for relay.
  • Consumers: N (e.g., 5) attempts with backoff → DLQ; DLQ triage dashboards and replay tools.
  • Inbox de-dup: store (eventId|idempotencyKey) receipt for M days; drop duplicates.
  • Poison messages: dead-letter with diagnostic envelope (error code, stack, schema version, payload hash).

Security & Privacy

  • Authz on topics: least privilege per service identity; publish/subscribe explicit.
  • PII discipline: only classified fields in payload; never in subjects/headers/topic names.
  • Encryption: in transit (TLS) and at rest (broker & stores); sensitive attachments use KMS envelope encryption.
  • Signature material: events may carry content digests; Integrity service signs chains and manifests.

Observability

  • Tracing: propagate traceparent/correlationId; spans from GW→ING→BUS→INT/PROJ→QRY/EXP.
  • Metrics: publish rate, error rate, outbox age, queue depth, consumer lag, DLQ count, replay count.
  • Logs: structured, redacted by classification; include tenantId, edition, eventId.

Replay & Backfills

  • Controlled replay: per-tenant/window; watermark caps; idempotent processing mandated.
  • Runbook: identify root cause, drain DLQ, run replay job with dry-run, monitor lag & budgets.
  • Projection rebuilds: snapshot + replay strategy; export paused/throttled during rebuild windows.


Multitenancy & Tenancy Guards (overview)

ATP is multi-tenant by default. Tenancy must be explicit, verifiable, and enforced at every boundary: ingress → messaging → storage → observability → exports. Edition flags refine capability exposure per tenant without weakening isolation.

Tenancy Model & Terms

  • Tenant — an organizational boundary; primary key tenantId present in every API call, event, and persisted row.
  • Edition — capability set for a tenant (e.g., Standard/Enterprise); evaluated at Gateway and Policy.
  • Tenant Context — normalized, immutable tuple carried across layers: { tenantId, edition, authz, policyVersion, correlationId }.

Isolation Layers (defense-in-depth)

  1. Ingress — require tenantId; validate edition gates; enforce per-tenant rate limits/quotas and payload limits.
  2. AuthZ — tenant-scoped RBAC/ABAC; tokens must include tenant claim; cross-tenant operations are rejected.
  3. Messaging — partition keys include tenantId (or tenantId:sourceId); topic ACLs scoped to service identity.
  4. StoragetenantId is part of every key; row-level filters (or RLS) applied in repositories; encryption scope may be per tenant.
  5. Cache — keys are tenant-scoped; no global caches for sensitive projections.
  6. Observability — traces/logs/metrics labeled with tenantId and edition; logs redact PII.
  7. Operations — per-tenant runbooks: suspend, throttle, replay, export, legal hold, and delete (where allowed).

Tenancy Propagation (canonical)

  • Authenticate at Gateway; extract tenantId from token (preferred). Accept X-Tenant-Id only for trusted workload identities; normalize into Tenant Context.
  • Attach Tenant Context to trace and message headers; persist with write operations.
  • Verify Tenant Context at service boundary (middleware) and inside each use-case.
  • Refuse requests if tenant is missing/mismatched; never infer tenant from data.
sequenceDiagram
  autonumber
  participant C as Client
  participant GW as API Gateway
  participant ING as Ingestion (Use-case)
  participant DB as Append Store
  Note over C: JWT contains tenantId=acme<br />edition=enterprise
  C->>GW: POST /audit/append (JWT, X-Idempotency-Key)
  GW->>GW: validate JWT, resolve tenant/edition, rate-limit
  GW->>ING: Append(cmd, TenantContext)
  ING->>ING: check authZ & edition gates
  ING->>DB: write (tenantId, record, policyVersion)
  ING-->>GW: 202 Accepted (traceId, tenantId)
Hold "Alt" / "Option" to enable pan & zoom

Data Isolation Patterns

  • Single DB, tenant-partitioned — default; tenantId in composite keys, enforced repository filters, and query specs.
  • Per-tenant schema — optional for very large tenants; keep API/contracts identical; infra complexity increases.
  • Encryption — at rest; keys can be rotated globally or per tenant; evidence manifests do not expose keys.
  • Backups/Restore — support scoped restore by tenant or time window; integrate with legal hold.

Messaging & Events

  • Partitioning by tenantId (or tenantId:sourceId) ensures localized ordering and hot-shard control.
  • Headers carry tenantId, edition, traceparent; payload contains tenantId for de-dup/inbox.
  • No cross-tenant joins in consumers or projectors; selection for export is tenant-scoped.

Edition Gating

  • Gateway rejects routes not enabled for the tenant’s edition.
  • Policy returns decisions (e.g., advanced retention) only if edition allows; Query/Export re-evaluate gates to prevent confused-deputy issues.

Guardrails (checklist)

Boundary Guard Enforced by Failure signal
Ingress tenantId required + edition check Gateway middleware 400/403
Ingress Per-tenant RL/quotas Gateway 429 (+ retry hints)
Use-case Tenant/edition validation Application layer 403
Repo Tenant filter / RLS Repository/ORM No rows (isolation)
Events Tenant partition key Producer/Outbox Reject publish if missing
Cache Tenant-scoped keys Cache adapter N/A (internal)
Logs Redact PII; add tenant labels Logging pipeline N/A
Export Tenant-scoped selection & manifest Export service 403 on cross-tenant

Threats & Mitigations

  • Tenant spoofing → accept tenant only from validated token/identity; normalize once at Gateway.
  • Confused deputy → re-evaluate edition/ABAC on every read/export; never trust upstream UI.
  • Noisy neighbor → per-tenant limits on RPS, storage, concurrent exports, and projector throughput.
  • Data bleed → repository filters + contract tests; synthetic cross-tenant tests in CI.

Testable Controls

  • Contract tests: every API requires tenantId and rejects mismatches.
  • Repo tests: queries without tenantId fail build/lint; cross-tenant fixtures return 0 rows.
  • Messaging tests: publish fails without tenant headers; consumer rejects orphan events.
  • Observability tests: traces/metrics include tenant labels; redaction verified in logs.


Security Architecture (Zero Trust)

ATP adopts a Zero Trust posture end-to-end: never trust, always verify; strong identity for users and workloads, least privilege at every hop, and continuous policy evaluation (tenant + edition + data classification). Controls are layered across ingress, mesh, messaging, storage, observability, and exports.

Security Objectives

  • Confidentiality — prevent unauthorized data access across tenants and tiers.
  • Integrity — tamper-evident writes and verifiable exports, with cryptographic proofs.
  • Availability — resilient controls that degrade safely (e.g., deny on policy fetch failures), with back-pressure and circuit breakers.
  • Accountability — high-fidelity audit trails of admin and data actions, correlatable across services.

Identity & Access (Users and Workloads)

  • Human users — OIDC/OAuth2, MFA, SSO; RBAC/ABAC scoped to tenantId and edition; short-lived tokens; refresh via secure flows.
  • Workloadsworkload identity (SPIFFE-like or equivalent); service-to-service mTLS in mesh; audience/scope-bound JWTs when used.
  • Token hygiene — short expirations, clock-skew tolerance, replay protection (nonce/PKCE where applicable), proof-of-possession optional for high-risk flows.
  • Authorization — gateway performs coarse checks; services apply fine-grained ABAC (tenant/edition/policy decision) at each use-case.

Boundary Controls

  • API Gateway / WAF
    • JWT validation (issuer/audience/exp), tenancy resolution, schema & size limits, rate limiting/quotas per tenant, IP allow-lists for admin APIs.
    • Threat mitigation: SQL/JSON injection filters, deserialization guards, content-type validation, CORS/CSRF protections for console UIs.
  • Service Mesh / Network
    • Mutual TLS by default; L7 authorization for service identities; namespace/network policies; egress allow-lists and DNS pinning for critical deps.
  • Messaging
    • Per-service publish/subscribe ACLs; topic-level encryption at rest; headers sanitized (no PII); DLQ isolation with audited replay.
  • Storage
    • Row-level isolation (tenant filters / RLS); encryption at rest; audit tables for admin actions; classified fields drive redaction and masking.

Data Protection

  • In transit — TLS 1.2+ everywhere; HSTS on public edges; mTLS inside the mesh; secure cipher suites; TLS secrets managed in KMS/secret store.
  • At rest — envelope encryption with KMS-managed keys; per-env key hierarchy; optional per-tenant keys for high-assurance tenants; key versioning on rotation.
  • Field/classification aware — classification labels at write time control: persistence, logging, projections, and export redaction.

Keys, Secrets & Rotation

  • Secrets in secret manager (never in images/YAML); use managed identities over static credentials.
  • Key rotation schedules for signing/encryption and TLS certs; dual-key windows for smooth transitions; automated provenance of rotations.
  • HSM-backed keys optional; audit every read of high-value secrets.

Integrity & Cryptography (platform interplay)

  • Integrity service maintains hash chains (per tenant/range) and digital signatures for segments and export manifests.
  • Canonicalization of records before hashing; recorded policyVersion and chain checkpoints to simplify offline verification.

Supply Chain Security

  • SBOM generation on every build; dependency scan/block on critical CVEs.
  • Image signing and provenance (e.g., attestations) enforced at admission; non-root, read-only FS, dropped Linux capabilities, seccomp/apparmor profiles.
  • Infrastructure as Code scanning; drift detection; locked registries and private base images.

Threat Model (snapshot) & Mitigations

Threat Vector Control
Tenant impersonation Forged headers, token substitution Accept tenant only from validated token; normalize at gateway; bind token audience/scope
Data exfiltration Over-permissive roles, broad exports ABAC with least privilege; export selection gated; watermarking; per-tenant quotas & approvals
Injection / deserialization Untrusted payloads Strict content-type; schema validation; JSON size caps; safe parsers
Replay / duplication Message re-delivery Idempotency keys; inbox receipts; event ULIDs; consumer de-dup
Side-channel / noisy neighbor Resource contention Per-tenant rate limits/quotas; bulkheads; isolated exporter pools
Secret leakage Misplaced configs/logs Secret store; redaction pipeline; zero PII in headers; structured logs with classifiers
Supply chain compromise Tainted deps/images SBOM+scans; signed images; verified provenance; gated deploys
Stale keys/certs Missed rotation Rotation SLOs; dual-publish keys; monitors/alerts; break-glass runbook

Security Telemetry & Auditing

  • Traces: carry tenantId, edition, policyVersion, traceId, correlationId across gateway→service→broker→store.
  • Logs: structured, classification-aware redaction; admin actions logged with actor, scope, before/after diff (where safe).
  • Metrics: auth failures, rate-limit hits, policy denies, DLQ volume, export anomalies, KMS latency/errors.
  • Alerts: token validation spikes, cross-tenant access attempts, chain verification failures, unexpected export surges.

Incident Response & Break-Glass

  • Playbooks: per-tenant isolation, revoke tokens, rotate affected keys, pause exports, enable heightened logging, notify stakeholders.
  • Forensics: immutable log retention; chain checkpoints; snapshot policies; export manifest verification.
  • Containment: gateway blocks by tenant/route; mesh denies by service identity; selective projector/export throttling.

Secure Defaults & Hardening Checklist

  • Non-root containers; read-only FS; minimal capabilities; pinned distroless bases.
  • Egress restricted; DNS allow-lists; outbound proxies where required.
  • HTTP security headers (HSTS, CSP, X-Content-Type-Options, Referrer-Policy) on public UIs.
  • Least privilege IAM for cloud resources; scoped service identities; deny-by-default policies.
  • Console/admin endpoints behind SSO + IP allow-lists + step-up auth.


Compliance & Privacy (GDPR/HIPAA/SOC2) — Overview

ATP is designed as a privacy-by-design, compliance-by-default platform. Controls are embedded in the write path (classification/minimization/retention), enforced across read/export, and evidenced via immutable audit trails and integrity proofs.

Roles & Responsibilities

  • Typical role: ATP operates as a Processor for tenant Controllers; some admin/telemetry data may make ATP a limited Controller (documented in DPA).
  • Sub-processors: declared per environment; contracts require equivalent safeguards.
  • Agreements: DPA (GDPR), BAA (HIPAA) for PHI workloads, SOC 2 reporting for trust criteria.

Regulatory Anchors (focus)

  • GDPR: lawful basis, transparency, DSR (access/erasure/export), 72h breach notice to SA where required, minimization, storage limitation, data transfers/residency.
  • HIPAA: PHI protection, minimum necessary, access controls, audit controls, integrity, transmission security, breach notification ≤ 60 days to affected individuals; BAA in place.
  • SOC 2: Trust Service Criteria — Security, Availability, Processing Integrity, Confidentiality, Privacy — evidenced through technical and procedural controls.

Controls Matrix (snapshot)

Requirement Control (design) Where enforced Evidence / Artifacts
Data minimization Canonical schema + policy evaluation at write Ingestion + Policy Schema registry, policy version in events, unit/contract tests
Classification & redaction Field tags drive redaction/masking Ingestion, Query, Logs/Exports Redaction library, log scrubbing tests, sample redacted exports
Retention & deletion Retention stamped on write; lifecycle jobs Ingestion, Lifecycle jobs Retention policy catalog, job logs, deletion attestations
Residency/sovereignty Region-aware routing, per-tenant storage map Gateway, Storage Tenant residency map, deploy topology diagrams
Access control Tenant-scoped RBAC/ABAC, edition gates Gateway, Services AuthZ policy files, access logs, ABAC tests
Auditability Immutable admin/data action logs All services Append logs, admin trails, correlation/traces
Export & portability Package + manifest + signatures Export, Integrity Export manifests, signatures, hash chains
Incident response Runbooks, alerts, forensics snapshots Ops/Runbook IR playbooks, alert policies, drill reports

Data Lifecycle (end-to-end)

  1. Collect — strictly necessary attributes; reject unknown fields by default.
  2. Classify — tag sensitivity at write; bind policy version.
  3. Store — encrypted at rest; tenant-scoped keys (optional per tenant).
  4. Project — read models exclude disallowed fields; derived data tracked.
  5. Retain — timers enforce storage limitation; legal hold overrides tracked.
  6. Delete/Anonymize — policy-driven purge/anonymization with proofs.

Data Subject Requests (DSR) — Workflow

sequenceDiagram
  autonumber
  participant U as User/Tenant Admin
  participant GW as API Gateway
  participant Q as Query Service
  participant EXP as Export Service
  participant L as Lifecycle/Retention
  U->>GW: Submit DSR (access/export/erasure)
  GW->>Q: AuthZ + tenant scope, locate records
  Q-->>GW: Result set / pointers
  alt Access/Export
    GW->>EXP: Create export package
    EXP-->>U: Download + manifest (portable)
  else Erasure
    GW->>L: Schedule policy-compliant deletion (legal holds respected)
    L-->>GW: Deletion attestation
    GW-->>U: Completion notice + evidence
  end
Hold "Alt" / "Option" to enable pan & zoom

SLA guidance

  • GDPR DSR response: typically ≤ 30 days (track in runbooks).
  • Breach notifications: GDPR 72h (to SA where required); HIPAA ≤ 60 days to affected individuals.

Privacy by Design (architectural hooks)

  • Policies as code: versioned policy sets; decisions stamped on write and re-evaluated on read/export.
  • No PII in headers: classification prevents leakage to logs/metrics; sensitive fields redacted.
  • Least privilege: tenant RBAC/ABAC at every use-case; exporter isolation/bulkheads.
  • DPIA triggers: new data categories, cross-region transfers, novel large-scale processing — require review/ADR.

Residency & Transfers

  • Region binding: tenant → region mapping; data stays in region unless contractually permitted.
  • Cross-border: blocked by default; explicit policy + contractual basis required.
  • Backups/restore: region-scoped; tenant-targeted restore supported.

Monitoring & Evidence

  • Signals: policy deny rates, retention job failures, export volume anomalies, cross-region attempts.
  • Evidence pack: policy catalog, schema registry snapshots, export manifests, chain checkpoints, IR drill reports.
  • Periodic attestations: automated reports feed SOC 2 control testing.

Testable Controls

  • CI checks for classification tags on new fields; rejection if missing.
  • Synthetic DSR tests (access/export/erasure) per environment.
  • Retention dry-run reports; deletion requires attestation artifacts.
  • Policy evolution contract tests (additive vs breaking).


Data Architecture Overview

ATP’s data layer is optimized for append-heavy writes, policy-aware reads, and provable integrity. We separate hot append storage, warm read models/indexes, and cold archival to balance performance, cost, and compliance.

Data Primitives (canonical types)

  • AuditRecord — canonicalized event with tenant/actor/resource, timestamps, attributes (flattened JSON), and policy stamps.
  • PolicyDecision — classification, retention, redaction directives (versioned).
  • EvidenceMaterial — content digests, chain IDs, signatures, checkpoints.
  • ProjectionSegment — denormalized slices for query/search; watermark & lag metadata.
  • ExportPackage — immutable bundle manifest with hashes, signatures, and lineage.

Write Path (append-only)

sequenceDiagram
  autonumber
  participant GW as Gateway
  participant ING as Ingestion
  participant APP as Append Store (hot)
  participant INT as Integrity
  participant BUS as Event Bus
  GW->>ING: Append(cmd + TenantContext)
  ING->>ING: Canonicalize + Validate + Policy.Evaluate
  ING->>APP: Append(AuditRecord + PolicyDecision)
  ING->>BUS: audit.appended / audit.accepted
  ING->>INT: Provide digest material (async)
Hold "Alt" / "Option" to enable pan & zoom

Consistency: write path is strong for a single record; projections are eventually consistent (seconds-level lag budgets).

Keys, Partitions, and Time

Concept Strategy
Primary key ULID recordId for monotonic ordering within time partitions
Idempotency key (tenantId, sourceId, sequence | hash)
Partitioning by time bucket (e.g., day) and tenantId; hot shards can further include sourceId
Timestamps occurredAt (source), receivedAt (gateway), committedAt (store); all UTC, ISO-8601

Canonical Schema (fields snapshot)

Field Type Notes
recordId ULID unique; sortable
tenantId string required on every record/index
sourceId string producer/system origin
actorId string user/service (classified)
action string verb (CREATE/UPDATE/DELETE/…)
resource string dotted path (e.g., Order/4711)
attributes object flat/normalized JSON; unknowns rejected unless whitelisted
occurredAt/receivedAt/committedAt datetime UTC
policyVersion int immutable once stamped
classification enum PUBLIC/INTERNAL/… (used for redaction)
retentionPolicyId string determines lifecycle
digest string content hash (sha256:…)
chainId/chainIndex string/int integrity chain placement

Full schema lives in Message Schemas (../domain/contracts/message-schemas.md). Schema changes follow additive-first rules and contract tests.

Read Models & Indexing (warm tier)

  • Timeline model (by tenant/resource/actor, time-range).
  • Facet/aggregation model (counts by action/resource).
  • Lookup model (by recordId, sourceId, correlation).
  • Search index (optional) for full-text and fast range scans.

Rebuild strategy: snapshot + replay from append logs; watermarks track projector progress; lag SLO drives autoscaling.

Classification, Redaction & Logs

  • Fields carry classification tags at write; tags drive:
    • storage shape (e.g., tokenization),
    • query redaction (field masking or omission),
    • log scrubbing (no sensitive data in logs/metrics),
    • export filtering (respect tenant’s data handling rules).

Example policy map (excerpt)

Field Classification Redaction (query/export)
actorId PERSONAL mask last 4
attributes.email PERSONAL hash (sha256)
attributes.cardLast4 SENSITIVE allow if role:Auditor & scope:PII
resource INTERNAL none

See PII Redaction & Classification (../platform/pii-redaction-classification.md).

Retention & Lifecycle

  • Stamped on write: retentionPolicyId + policyVersion.
  • Lifecycle jobs: tiering hot→warm→cold, legal hold awareness, deletion/anonymization windows.
  • Attestations: deletion manifests & job logs stored immutably.

See Data Residency & Retention (../platform/data-residency-retention.md).

Integrity Materials

  • Chain-of-hash per tenant and time-range; rolling checkpoints.
  • Signatures minted by Integrity svc; referenced in projections and exports.
  • Verification APIs accept a record/segment/package and return proofs.

Storage Tiers

Tier Workload Technology Shape Notes
Hot Append path, near-term verify OLTP append store SSD, high IOPS, short retention
Warm Queries & aggregations Read models / columnar Denormalized, rebuildable
Cold Archival/eDiscovery Object storage (+ immutability) Legal hold, cheap, manifest-signed

Schema Evolution

  • Additive: new optional fields or enums → minor version bump; dual readers.
  • Breaking: new subject or major version; side-by-side projections.
  • Registry & tests: schemas linted; CI contract tests for producers/consumers/projectors.

ER Snapshot (logical)

erDiagram
  TENANT ||--o{ AUDIT_RECORD : owns
  AUDIT_RECORD ||--o| EVIDENCE_MATERIAL : "has"
  AUDIT_RECORD ||--o{ PROJECTION_SEGMENT : "appears_in"
  EXPORT_PACKAGE ||--o{ AUDIT_RECORD : "contains"
  TENANT {
    string tenantId PK
    string edition
  }
  AUDIT_RECORD {
    string recordId PK
    string tenantId FK
    string sourceId
    string actorId
    string action
    string resource
    json   attributes
    datetime occurredAt
    datetime committedAt
    string classification
    string retentionPolicyId
    string digest
    string chainId
    int    chainIndex
    int    policyVersion
  }
  EVIDENCE_MATERIAL {
    string recordId FK
    string segmentId
    string signature
    string algo
  }
  PROJECTION_SEGMENT {
    string segmentId PK
    string tenantId FK
    string key
    json   payload
    datetime watermark
  }
  EXPORT_PACKAGE {
    string packageId PK
    string tenantId FK
    string manifestHash
    datetime createdAt
  }
Hold "Alt" / "Option" to enable pan & zoom

Sizing & Capacity Hints (initial)

  • Record size: median 1–5 KB (flattened JSON); avoid unbounded attributes.
  • Throughput: design for burst QPS B per tenant; apply ingest RL + back-pressure.
  • Indexes: time-range first, then tenant/resource/actor; avoid cross-tenant joins.
  • Cold costs: batch exports; prefer delta-based packages for repeats.

Testable Controls

  • Lints reject schemas without classification tags for new fields.
  • CI ensures queries require tenant filters; cross-tenant fixtures return zero rows.
  • Projections verified idempotent (re-run safe) with watermark assertions.
  • Retention dry-run reports produced; deletions emit attestations.


Storage Strategy (Summary)

Our storage approach optimizes for append-heavy writes, policy-aware queries, provable integrity, and cost control. We separate concerns across Hot (append), Warm (read models/indexes), and Cold (archival/eDiscovery) tiers, governed by policy-stamped retention and lifecycle jobs.

Objectives

  • Performance where it matters: low-latency appends and queries; projections provide fit-for-purpose shapes.
  • Compliance by default: classification-aware storage, retention stamped on write, legal hold and residency honored.
  • Provable integrity: digest chains and signatures persist with data lineage and manifests.
  • Cost discipline: data tiering, compression, batching exports, and quotas per tenant.

Tiering at a Glance

Tier Workload Typical Retention Consistency Durability & Encryption Notes
Hot (Append Store) ingest path, near-term verification hours–days (policy) strong per-write multi-AZ/zone; at-rest encryption (KMS) time/tenant partitions, high IOPS, small indexes
Warm (Read Models / Indexes) query/search/aggregations days–months (policy) eventually consistent (lag SLO) multi-AZ/zone; at-rest encryption (KMS) rebuildable from events; denormalized projections
Cold (Archive / eDiscovery) long-term retention, legal hold months–years (policy) N/A (immutable) object store immutability + KMS; legal hold manifests + signatures; cost-efficient, slower access

Detailed shapes live in Data Model (data-model.md) and infra specifics in Deployment Views (deployment-views.md).

Lifecycle & Policy Enforcement

flowchart LR
  subgraph HOT[Hot / Append]
    A[Append-only segments]
  end
  subgraph WARM[Warm / Read Models]
    R[Projections & indexes]
  end
  subgraph COLD[Cold / Archive]
    C[Immutable objects + manifests]
  end
  A -->|policy window reached| WARM
  WARM -->|tiering job| COLD
  A -->|legal hold? keep| A
  WARM -->|legal hold? pin| WARM
  C -->|export/eDiscovery| C
Hold "Alt" / "Option" to enable pan & zoom
  • On write: retentionPolicyId + policyVersion stamped; classification guides storage and logs.
  • Lifecycle jobs: move eligible segments from Hot→Warm→Cold; respect legal hold and residency maps.
  • Deletion/anonymization: performed per policy window; produce attestations and job logs.

Partitioning, Compaction & Index Hygiene

  • Partitions: time (e.g., day) × tenant; optional sourceId for hot-shard control.
  • Compaction: roll small segments into bounded files (size/time thresholds) to control file counts and seek cost.
  • Index hygiene: projector lag SLOs guide autoscaling; background vacuum/merge jobs keep read paths predictable.

Integrity Material Persistence

  • Digest chains (per tenant/range) stored alongside append metadata; checkpoints for fast verification.
  • Signatures & manifests for exported bundles persisted in Cold; Verify APIs reference chain/manifest IDs.

Backup, Restore & eDiscovery

  • Backups: scheduled snapshots of Warm (and necessary Hot metadata) with region-scoped policies.
  • Restore: tenant- or time-scoped restores; rebuild read models from append logs when possible.
  • eDiscovery: selections from Query → Export packages → signed manifests in Cold; immutable retention with legal hold support.

Residency & Encryption

  • Residency map: tenant → region binding; lifecycle never crosses region without contractual/policy basis.
  • Encryption: at rest via KMS; optional per-tenant keys for high-assurance tenants; rotation windows supported.
  • Secrets: no keys in payloads; policy and classification prevent sensitive leakage to logs/headers.

Capacity & Cost Levers

  • Hot: cap record size, enforce schema limits, rate-limit bursty tenants.
  • Warm: projection granularity tuned to query needs; compress wide models; expire stale indexes.
  • Cold: batch exports, dedupe repeated selections, prefer incremental/delta packages.
  • Global: per-tenant quotas, export concurrency caps, storage alerts on growth velocity.

SLO Hints (storage-facing)

  • Ingest commit p95 ≤ X ms (Hot).
  • Projector lagN s median; p95 ≤ M s (Warm).
  • Export TTFB p95 ≤ Z s for packages up to K MB (Cold).

Failure Modes & Guardrails

  • Hot saturation → back-pressure at Gateway; temporary queueing; projector throttle lifts once lag ≤ budget.
  • Projection failure → DLQ + replay tooling; queries fall back to last consistent watermark.
  • Cold unavailability → exports paused; already created packages remain downloadable via signed URLs.
  • Residency mismatch → hard fail with audit; no cross-region copies without policy/contract.

Testable Controls

  • Lifecycle dry-run reports (what would tier/delete) per tenant.
  • CI checks ensure new tables/indexes include tenantId and time partitioning.
  • Synthetic restores: periodic tenant/time-window drills.
  • Export verification: random sampling of packages against manifests/signatures.


Sequence Flows (append/query/export)

This section captures the three canonical end-to-end flows in ATP. These flows are reference-grade and map directly to our containers, bounded contexts, and policies. Detailed, step-by-step variants (timeouts, retries, failure drills) live in (sequence-flows.md).


Append (happy path)

sequenceDiagram
  autonumber
  participant P as Producer
  participant GW as API Gateway
  participant ING as Ingestion (Use-cases)
  participant POL as Policy (Decision API)
  participant APP as Append Store (Hot)
  participant OB as Outbox (Tx)
  participant BUS as Event Bus
  participant INT as Integrity
  participant PROJ as Projection
  participant OTL as OTel/Obs

  P->>GW: POST /api/v{n}/audit/append<br />JWT (tenant), X-Idempotency-Key, body (canonical)
  GW->>GW: AuthN (OIDC) + tenant/edition + schema + rate-limit
  GW->>ING: AppendCommand(TenantContext, IdempotencyKey, Payload)
  ING->>POL: Evaluate(classification, retention) [short TTL cache]
  POL-->>ING: PolicyDecision(version, labels, retentionPolicyId)
  ING->>ING: Canonicalize + Validate + Apply Policy
  ING->>APP: Append(AuditRecord + PolicyDecision) [atomic]
  ING->>OB: Enqueue(audit.appended, audit.accepted) [same tx]
  OB->>BUS: publish (relay)
  BUS-->>INT: audit.appended | accepted
  INT->>INT: Digest/Chain/Sign (async)
  BUS-->>PROJ: audit.accepted
  PROJ->>PROJ: Update read models / set watermark
  ING->>OTL: spans/logs/metrics (tenantId, edition, policyVersion, idemKey)
  ING-->>GW: 202 Accepted { recordId, traceId }
Hold "Alt" / "Option" to enable pan & zoom
  • Headers (ingress): Authorization: Bearer <JWT>, X-Idempotency-Key, Content-Type: application/json
  • Guarantees: at-least-once delivery; exactly-once intent via (tenantId, sourceId, sequence|hash); order within partition (tenantId[:sourceId])
  • SLO cues: p95 append ≤ X ms; policy eval p95 ≤ Y ms; projector lag p95 ≤ N s

Query (authorized read)

sequenceDiagram
  autonumber
  participant C as Client (Ops/Auditor)
  participant GW as API Gateway
  participant Q as Query Service
  participant POL as Policy (Decision API)
  participant RM as Read Models (Warm)
  participant RED as Redaction Plan
  participant OTL as OTel/Obs

  C->>GW: GET /api/v{n}/query?tenant=...&filters=...<br />JWT (tenant)
  GW->>GW: AuthN + AuthZ (RBAC/ABAC), edition gates, rate-limit
  GW->>Q: QueryRequest(TenantContext, Filters, Page)
  Q->>RM: Fetch(ReadModel slice, watermark)
  Q->>POL: Evaluate(read constraints) [cached]
  POL-->>Q: Decision(redaction/deny/allow)
  Q->>RED: Apply redaction per classification/policy
  Q-->>GW: 200 OK { results, page, watermark, redactionHints }
  Q->>OTL: record spans/metrics (p95/p99, filtered-out)
Hold "Alt" / "Option" to enable pan & zoom
  • Filters: time-range, actor/resource, action, attributes (whitelisted)
  • Redaction: field-level masking/hashing per classification; no PII in logs/headers
  • SLO cues: p95 latency ≤ Y ms at Q RPS; cache hit ratio ≥ H%; watermark drift ≤ D s

Export (selection → package → verify)

sequenceDiagram
  autonumber
  participant C as Client (Auditor/Legal)
  participant GW as API Gateway
  participant Q as Query Service
  participant EXP as Export Service
  participant INT as Integrity
  participant COLD as Cold Store (Immutable)
  participant WH as Webhook (optional)

  C->>GW: POST /api/v{n}/export { selectionSpec | queryId , format }
  GW->>Q: Validate selection (tenant/edition/ABAC)
  Q-->>GW: Selection OK (token/manifest draft)
  GW->>EXP: CreateExport(TenantContext, selectionToken, format)
  EXP->>Q: Stream records (paged, resumable)
  EXP->>INT: Request signatures/chain refs (batch)
  INT-->>EXP: Evidence (chain checkpoints, signatures)
  EXP->>COLD: Write package parts + manifest (signed)
  EXP-->>GW: 202 Accepted { exportId, pollUrl, ttfbHint }
  loop Client poll or webhook
    C->>GW: GET /api/v{n}/export/{exportId}
    GW-->>C: 303 See Other → signed download URL
    EXP-->>WH: POST /on-export-completed (optional)
  end
Hold "Alt" / "Option" to enable pan & zoom

Semantics: resumable streaming; throttled to protect query SLIs; immutable artifacts with signed manifests SLO cues: TTFB p95 ≤ Z s for ≤ K MB outputs; completion p95 ≤ M min for N records


Failure & Back-pressure (extract)

Flow Condition Behavior Client Signal
Append schema invalid / policy deny 400 / 403 Problem+JSON with code & trace
Append hot partition saturation 429 (Gateway), retry-after Retry-After, X-Rate-Limit-*
Append transient store/broker retry with jitter → DLQ after N 202 Accepted (eventual), trace
Query projector lag beyond budget serve with last watermark + warn X-Watermark, X-Lag
Query authorization fails 403 Problem+JSON
Export package > concurrency quota 429 + backoff Retry schedule
Export integrity service slow continue buffering; partial manifest; retry Polling continues; final manifest on completion

Observability: spans across GW→ING→BUS→INT/PROJ→Q/EXP with tenantId, edition, traceId, correlationId; metrics for outbox age, consumer lag, export queue depth, policy deny rate.



Deployment Views (baseline)

This section describes the cloud-native baseline for ATP across environments, regions, and failure domains. It maps our containers to runtime substrates (AKS/ACA), messaging, data stores, and the observability/security planes. Deeper infra specifics live in (deployment-views.md).

Environments & Promotion Model

Env Purpose Data Change Rate Protections
dev rapid iteration, PR validation synthetic highest permissive RBAC, ephemeral namespaces
test integration, contract & replay tests masked high seeded tenants, DLQ/replay drills
staging prod-like validation, chaos drills masked or opt-in medium WAF rules, HPA parity, approvals
prod customer traffic real controlled SLO-backed autoscaling, break-glass, approvals

Promotion: build once, deploy many (signed image → dev → test → staging → prod) with policy gates and environment-specific overlays.

Regional Topology & Residency

  • Tenants are bound to a home region; data never crosses regions unless contractually allowed.
  • All planes are multi-AZ/zone within a region.
  • Optional multi-region active/standby for DR (RPO/RTO declared per edition).
flowchart LR
  subgraph Region[Cloud Region - e.g., East US 2]
    subgraph Net[Virtual Network]
      subgraph Edge[Ingress/WAF Subnet]
        GW[API Gateway / Ingress]
      end
      subgraph App[App Plane - AKS/ACA]
        mesh[Service Mesh - mTLS/L7]
        ING[Ingestion]
        POL[Policy]
        INT[Integrity]
        PROJ[Projection]
        QRY[Query]
        SRCH[Search - optional]
        EXP[Export]
        ADM[Admin/Control]
      end
      subgraph Msg[Messaging]
        BUS[(Topics/Queues + DLQ)]
      end
      subgraph Data[Data Plane]
        HOT[(Append Store)]
        WARM[(Read Models/Indexes)]
        COLD[(Object Store / Archive)]
      end
      subgraph Obs[Observability]
        OTL[(OTel Collector)]
        LOG[(Logs)]
        MET[(Metrics)]
        TRC[(Traces)]
      end
      subgraph Sec[Security]
        KMS[(KMS/Keys)]
        SEC[(Secret Manager)]
      end
    end
  end

  GW --> mesh
  mesh --> ING & POL & INT & PROJ & QRY & SRCH & EXP & ADM
  ING --> HOT
  PROJ --> WARM
  EXP --> COLD
  ING --- BUS
  PROJ --- BUS
  EXP --- BUS
  GW --- OTL
  ING --- OTL
  PROJ --- OTL
  QRY --- OTL
  EXP --- OTL
  INT --- KMS
  HOT --- SEC
  WARM --- SEC
  COLD --- SEC
Hold "Alt" / "Option" to enable pan & zoom

Kubernetes / Container Apps Mapping

Namespace Workloads HPA Signals Notes
gateway ingress/gateway, auth filters RPS, p95 route latency, 429 ratio WAF rules, IP allow-lists for admin
ingestion append API, webhook receiver, outbox relay CPU, QPS, pending outbox, 5xx strict idempotency; schema guard
policy decision API, cache p95 decision latency, hit ratio warm cache w/ TTL + circuit breaker
integrity chain/sign/verify workers queue depth, worker CPU HSM/KMS integration
projection projectors, rebuild jobs lag (sec), consumer lag, DLQ watermarks, replay safety
query query API, (opt) GraphQL p95/p99 latency, cache hit redaction at boundary
search (opt) search API, indexers queue depth, index refresh can be disabled per edition
export export API, packagers concurrent exports, TTFB resumable, throttled
admin policy mgmt, DLQ/replay, feature flags N/A break-glass guarded
observability OTel collector, dashboards N/A multi-tenant labeling & scrubbing

Network, Mesh & Access

  • Ingress: Public → WAF/Ingress → Gateway. Admin surfaces optionally behind IP allow-lists + SSO.
  • Mesh: mTLS everywhere; L7 authorization by service identity; timeouts/retries/circuit breakers standardized.
  • Egress: deny-by-default, allow-lists for KMS, messaging, object store, and IdP.
  • DNS/Service discovery: mesh-native, with identity-bound policies.

Config, Secrets & Keys

  • Config: per-env overlays; feature flags for edition gates; config maps for non-sensitive settings.
  • Secrets: stored in secret manager; mounted/injected at runtime; rotation SLOs enforced.
  • Keys: KMS-backed envelope encryption; key IDs & versions recorded in manifests; dual-key windows during rotation.

Scaling & SLO Budgets

  • Gateway: scale on RPS and auth latency; enforce per-tenant rate limits.
  • Ingestion: scale on incoming QPS and outbox age; shed load via 429 when Hot saturation detected.
  • Projection: scale to respect lag SLO; auto-tune consumer concurrency.
  • Query/Search: scale on p95/p99 latency; cache enabled; bulkhead against Export.
  • Export: separate worker pools; cap concurrent packages per tenant to protect read SLOs.

Failure Domains, HA & DR

  • Intra-region HA: multi-zone deployments; stateless pods across zones; storage with zone-redundancy where supported.
  • DLQ & Replay: standard triage tools; replay by window/tenant; dry-run mode.
  • Backups/Restore: scheduled backups of warm stores + metadata; tenant/time-window restore drills.
  • DR (optional): async replication of cold artifacts; RPO/RTO declared per edition; failover runbooks.

CI/CD & Supply Chain

  • Pipelines: build → test → sign image (SBOM, vuln scan) → push to registry → deploy via GitOps/Argo or pipelines.
  • Policies: admission requires signed images, baseline pod security, resource limits/requests.
  • Observability: dashboards per service; alerts for SLO breaches, DLQ growth, export anomalies, projector lag.

Cost Controls

  • Per-tenant quotas (RPS/storage/exports), export batch windows, storage tiering rules, and autoscaling floors/ceilings tuned for cost envelopes.

Testable Controls

  • Policy tests: deny unsanctioned egress; block unsigned images at admission.
  • Residency tests: enforce region binding per tenant.
  • Chaos drills: periodic pod/node loss, message broker hiccups, object-store slowdown.
  • SLO checks: synthetic probes for append/query/export; alert on budget burn.


API Gateway & Connectivity

The API Gateway is the single ingress for tenant traffic and the control point for identity, tenancy, versioning, rate limits, schema validation, and egress discipline. It fronts REST/gRPC APIs and optional webhooks, and propagates Zero Trust signals (identity, tenant, edition, trace) into the mesh.

Objectives

  • Protect: strong AuthN/Z, input validation, per-tenant quotas/limits, DDoS/WAF.
  • Standardize: versioning, headers, error shapes, retry semantics.
  • Propagate: tenantId, edition, traceparent, correlationId to downstream services.
  • Observe: request metrics, saturation (429s), error taxonomy, and schema failure rates.

Ingress Architecture (L7)

flowchart LR
  Client[Producers / Clients / UIs]
  WAF[WAF / Ingress Controller]
  GW[API Gateway<br />AuthN/Z • Tenancy • Rate Limits • Schema]
  Mesh[Service Mesh - mTLS/L7 AuthZ]
  ING[Ingestion]
  QRY[Query]
  EXP[Export]
  POL[Policy]
  ADM[Admin/Control]
  WH[(Webhooks - optional)]
  Client --> WAF --> GW --> Mesh
  Mesh --> ING & QRY & EXP & POL & ADM
  GW --> WH
Hold "Alt" / "Option" to enable pan & zoom
  • TLS termination at edge; mTLS inside the mesh.
  • Admin surfaces (policy/flags/replay) can be IP allow-listed + SSO.

Versioning & Deprecation

  • URI or header versioning: GET /api/v{n}/... or X-Api-Version: n.
  • Compatibility windows announced in changelog; deprecation headers: Deprecation: true, Sunset: <rfc1123>, Link: <url>; rel="deprecation".

Tenancy Propagation (canonical)

  • Gateway requires tenant at ingress (JWT claim preferred; X-Tenant-Id only for trusted workloads).
  • Inject standardized headers to downstream:
    • x-tenant-id, x-edition, traceparent, tracestate, x-correlation-id, x-policy-version.
  • Reject requests missing or conflicting tenant signals (400/403).

Authentication & Authorization

  • Users: OIDC/OAuth2 (short-lived tokens, MFA), scopes/roles mapped to tenant.
  • Workloads: workload identity; audience/scope-bound JWTs or mesh L7 policies.
  • Coarse checks at Gateway; fine-grained ABAC in services (use-case level).

Rate Limits, Quotas, & Back-Pressure

  • Per-tenant burst/sustained rate limits and concurrent export caps.
  • Global safeties on hot routes (append, export create).
  • Signal back-pressure with 429 + Retry-After; include limit headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.

Schema & Payload Safeguards

  • Content-type & size limits; JSON schema validation on critical routes.
  • Reject unknown fields (strict mode) unless whitelisted in schema registry.
  • Enforce PII discipline: no PII in headers; payloads classified for redaction downstream.

Connectivity Matrix (inbound/outbound)

Surface Protocol Auth Tenancy Notes
Append REST (POST) / gRPC Bearer (OIDC) or workload JWT Required X-Idempotency-Key mandatory
Query REST (GET) / gRPC Bearer Required watermark headers on responses
Export REST (POST/GET stream) Bearer Required resumable download; signed URLs
Webhooks (ingest) HTTPS (POST) HMAC/signature In payload Signature verification, replays detected
Admin REST (POST/GET) SSO + IP allow-list N/A break-glass logged & approved

Standard Headers (selected)

  • Ingress: Authorization, X-Idempotency-Key, Content-Type
  • Propagated: x-tenant-id, x-edition, traceparent, tracestate, x-correlation-id, x-policy-version
  • Responses: X-Watermark, X-Lag, X-RateLimit-*, X-Request-Id

Error Model (Problem+JSON)

{
  "type": "https://errors.atp.example/validation",
  "title": "Invalid request payload",
  "status": 400,
  "detail": "Field 'attributes' failed schema validation",
  "instance": "urn:trace:01J9...-req-7f3a",
  "tenantId": "t-acme",
  "code": "SCHEMA_VALIDATION_FAILED"
}

Egress & Network Policy

  • Deny-by-default egress; allow only IdP, KMS, messaging, object store, email/webhook domains (when used).
  • DNS pinning/allow-lists for critical dependencies; outbound proxies if required.
  • Mesh L7 AuthZ: only permitted service→service calls; no lateral “surprises”.

Streaming, Downloads & Large Payloads

  • Append encourages bounded payloads (size caps).
  • Export uses chunked transfer and signed URLs; resilience via range requests.
  • Timeouts and read/write budgets enforced per route; client hints for retry/backoff.

CORS & Browser Clients

  • Strict CORS: allow specific origins for tenant consoles; SameSite and CSRF tokens for state-changing routes in UIs.

Observability @ Gateway

  • Spans: route, tenant, edition, status, bytes in/out, auth latency, schema check time.
  • Metrics: RPS, p95/99 per route, 4xx/5xx ratios, 429s, rejected schemas.
  • Logs: structured; PII scrubbed; include x-request-id and correlation keys.

Failure Modes & Signals

Condition Behavior Client Signal
Missing/invalid tenant Reject 400/403 with problem+json
Rate-limit exceeded Shed 429 + Retry-After
Schema invalid Reject 400 with problem+json & error path
AuthN failed Reject 401
AuthZ/edition denied Reject 403
Upstream saturation Back-pressure 503 (retryable) with Retry-After

Testable Controls

  • Contract tests: tenant required on protected routes; headers propagated.
  • Negative tests: cross-tenant attempts return 403; unknown JSON fields rejected.
  • Synthetic load: verify 429 behavior, limit headers, and stable p95 under burst.


Observability & SLOs

Observability in ATP is first-class: every request and message carries tenantId, edition, traceId, and correlationId. We instrument traces, metrics, and logs using OpenTelemetry, define SLIs/SLOs per service, and manage reliability via error budgets with multi-window burn alerts. PII is never logged; redaction follows classification tags.

Telemetry Standards

  • Traces: Gateway → Service → Outbox/Bus → Consumer/Projector → Store/Export.
    • Resource attrs: service.name, service.version, deployment.environment, cloud.region.
    • Span attrs: tenant.id, tenant.edition, http.route, messaging.operation, messaging.destination, db.system, db.operation, policy.version, idempotency.key.
  • Metrics (OTel with exemplars): use histograms for latency; avoid high-cardinality labels.
    • Common labels: service.name, route|operation, tenant.class (small cardinality bucket), result.
    • Examples: http.server.duration, messaging.consumer.lag, outbox.relay.age, export.queue.depth.
  • Logs: structured JSON; fields include timestamp, level, message, tenantId, edition, traceId, correlationId, eventId, code.
    • No PII in logs or headers; classified fields are masked or hashed.

Sampling: head-based baseline, tail-based for slow/error spans. Retention: traces short, metrics medium, logs per compliance policy.

Golden Signals (platform-wide)

  • Traffic (RPS, throughput), Latency (p95/p99), Errors (4xx/5xx, policy denies), Saturation (CPU/mem, queue depth, projector lag), plus Back-pressure (429s, retry counts).

SLIs per Service

Service Primary SLIs Supporting SLIs
Gateway route latency p95/p99; 4xx/5xx ratio; 429 rate auth latency, schema failure rate
Ingestion append accepted latency p95; accept success rate outbox relay age p99, request size distribution
Policy decision latency p95; cache hit ratio decision error rate, fallback activations
Integrity verify latency p95 (by target size) chain build queue depth, signer error rate
Projection projector lag median/p95; DLQ rate replay duration, consumer throughput
Query query latency p95/p99; success rate cache hit ratio, redacted-field count, watermark drift
Search (opt) search latency p95; success rate index refresh age, queue depth
Export TTFB p95; completion time p95 package queue depth, resumptions, webhook success
Admin action success rate time-to-approve, break-glass invocations

SLO Targets (initial placeholders)

Tune these during load testing; record in [Alerts & SLOs (../operations/alerts-slos.md)].

  • Gateway: route p95 ≤ X ms; 5xx ≤ Y ppm.
  • Ingestion: append accepted p95 ≤ X ms; outbox age p99 ≤ N s.
  • Projection: lag median ≤ 5 s, p95 ≤ 30 s.
  • Query: p95 ≤ Y ms at baseline RPS; success rate ≥ 99.95% (excl. client 4xx).
  • Export: TTFB p95 ≤ Z s for ≤ K MB; completion p95 ≤ M min for N records.
  • Integrity: verify p95 ≤ T s for S records; failures ≤ E ppm.
  • Policy: decision p95 ≤ Q ms; hit ratio ≥ H%.

Error Budgets & Burn Alerts

  • Budget = 1 - target_availability. Example: SLO 99.9% ⇒ budget 0.1%.
  • Multi-window, multi-burn alerts (fast + slow):
    • Page if burn ≥ 14× over 1h or over 6h.
    • Ticket if burn ≥ over 24h.
  • Pair with auto-suppression during planned maintenance (annotations).

Dashboards (minimum set)

  • Service: latency histograms, error ratios, throughput, saturation, dependency health.
  • Flow: Append (GW→Ingestion→Outbox→Bus→Projection), Query (GW→Query→Read), Export (GW→Export→Cold).
  • Tenancy: top tenants by usage; quota headroom; noisy-neighbor detection.
  • Reliability: projector lag, outbox age, DLQ depth, export queue depth, policy deny rate.
  • Security: auth failures, cross-tenant attempts, signature failures.

Watermarks, Idempotency & Headers

  • Query responses include X-Watermark & X-Lag to expose freshness.
  • Append requires X-Idempotency-Key; success logs include Idempotent:true on dedupe.
  • Rate headers return budget signals: X-RateLimit-*, Retry-After.

Alert Policies (extract)

Condition Threshold Action
Projector lag p95 > 30s for 15m sustained Page on-call; auto-scale consumers; evaluate DLQ
Outbox age p99 > 10s 15m Page ingestion; check broker health
Query p95 > Y ms 30m Page API/runtime; enable cache protection
Export queue depth > Q 30m Ticket; throttle export concurrency per tenant
5xx ratio > R ppm 10m Page owning team; roll back last deploy guardrail
Token validation failures spike 10m Page security on-call; investigate IdP/clock skew

Cardinality & Cost Guardrails

  • Cap high-cardinality labels (e.g., raw tenantId)—aggregate into tenant.class (e.g., S/M/L).
  • Use RED metrics (Rate, Errors, Duration) per route/use-case.
  • Histograms with controlled buckets; exemplars sampled from traces.
  • Drop verbose logs in hot paths; sample at INFO, keep ERROR always.

Health & Probes

  • Liveness: process healthy; Readiness: deps reachable and policy cache warm; Startup: migrations/keys loaded.
  • Expose /healthz endpoints; aggregate in [Health Checks (../operations/health-checks.md)].

Synthetic Probes & Canaries

  • Tenant-scoped synthetics for append/query/export; publish probe artifacts with tenantId=probe.
  • Canary releases gated by SLO trend; rollback if burn > threshold.

Testable Controls

  • CI checks for OTel exporters present; span/metric naming lint.
  • Contract tests assert presence of tenantId, edition, traceId on key spans.
  • E2E tests validate X-Watermark/X-Lag headers and redaction hints on query.


Reliability & Resilience (retries, outbox, DLQ)

ATP targets graceful degradation under failure: shed load early, retry safely with idempotency, confine faults with bulkheads, and preserve work via transactional outbox/inbox and DLQs with audited replay. Policies are tuned to protect SLOs and tenant isolation.

Principles

  • Fail fast at the edge (schema/tenancy/rate) and retry inside only when it’s safe.
  • Exactly-once intent through idempotency keys; consumers are idempotent by construction.
  • Back-pressure before meltdown: 429s at the Gateway; bounded concurrency in workers.
  • Isolate & contain: bulkheads, circuit breakers, DLQs per subscription, exporter pools separate from query.
  • Observable by default: retries, drops, DLQ, and replays are fully traceable.

Timeouts, Retries, Backoff

Boundary Timeout (budget) Retry Policy Max Attempts Notes
Client → Gateway short (route p95 + margin) No (client retry on 429/503 only) 0 Gateway signals back-off via headers
Gateway → Service route-specific (p95 × 1.2) No (propagate) 0 Avoid retry storms
Service → Policy/IdP/KMS short Yes, exp. backoff + jitter 3–5 Only on transient errors/timeouts
Service → Broker (publish) short Yes, then Outbox relay bounded Never drop; relay ensures delivery
Consumer → Repo/Store medium Yes, exp. backoff + jitter → DLQ 5 Idempotent upsert/no-op required
Integrity/Export → Object Store medium/long Yes, exp. backoff + resume bounded Resume via range requests

Budgeting: Timeouts derived from SLOs; p95 + headroom. Retries use exponential backoff with jitter (e.g., base 100–250ms, cap 5–10s). No retries on validation/authorization errors.


Outbox / Inbox Semantics

Outbox guarantees durable publication of domain events without two-phase commit.

sequenceDiagram
  autonumber
  participant API as API/Use-case
  participant DB as Append Store (tx)
  participant OB as Outbox (tx)
  participant RL as Relay Worker
  participant BUS as Broker

  API->>DB: Persist aggregate
  API->>OB: Persist event (same tx)
  DB-->>API: Commit (record + outbox)
  RL->>OB: Poll due events
  RL->>BUS: Publish (with headers/trace)
  RL->>OB: Mark as delivered (idempotent)
Hold "Alt" / "Option" to enable pan & zoom

Inbox (consumer de-dup) stores (eventId|idempotencyKey) receipts for M days to drop duplicates safely.

  • Idempotency Key: (tenantId, sourceId, sequence|hash)
  • Headers: x-idempotency-key, traceparent, x-tenant-id, x-schema-version
  • Guarantee: at-least-once delivery; exactly-once effects via idempotent handlers.

DLQ, Replay & Triage

Every subscription has a DLQ. We never silently drop.

flowchart LR
  subgraph Processing
    C[Consumer] --> H[Handler  - Idempotent]
    H --> OK[Commit]
    H -->|on failure after N retries| DLQ[(DLQ)]
  end
  subgraph Triage & Replay
    DLQ --> UI[DLQ Console/Runbook]
    UI --> DRY[Dry-run Replay]
    DRY --> REP[Replay Window/Tenant]
    REP --> C
  end
Hold "Alt" / "Option" to enable pan & zoom

Triage metadata: error code, exception type, payload hash, headers, schema/version, attempt count, trace links.

Replay rules

  • Scoped by tenant and time window; require approval for large replays.
  • Dry-run first: count would-be successes/failures; cap concurrency; respect watermarks.
  • Immutability: handlers must be replay-safe; only side effects that are idempotent.

Bulkheads, Circuit Breakers, Quotas

  • Bulkheads: separate pools for Export vs Query; limit projector concurrency by partition; per-tenant worker caps.
  • Circuit breakers: open on dependency failures (Policy/IdP/KMS/Store); fail closed for risky decisions (deny if policy unavailable), fail open for non-critical enrichments.
  • Quotas: per-tenant caps on RPS, concurrent exports, selection size, and storage growth.

Failure Taxonomy & Handling

Class Examples Handling Client Signal
Validation schema invalid, unknown fields Reject (no retry) 400 (Problem+JSON)
Authorization missing/invalid tenant, ABAC deny Reject (no retry) 401/403
Resource limits rate/quotas exceeded Shed load 429 + Retry-After
Transient infra broker timeout, object store 503 Retry with jitter; back-off; DLQ after N 202/503 (server)
Hot partition tenant/source spike Throttle producer; bounded consumer concurrency 429 (edge), lag dashboards
Dependency outage Policy/KMS down Breaker; safe defaults; degrade features 503 (server)
Data poison schema/version mismatch Route to DLQ; fix schema/consumer; replay N/A (internal)

Back-Pressure & Lag Management

  • Gateway: 429 with tenant-scoped limits; communicate X-RateLimit-*.
  • Consumers: dynamic concurrency (lower when lag or error rate climbs).
  • Projection: track watermarks; alert when lag p95 > SLO; autoscale consumers on lag & queue depth.
  • Export: queue depth guards; resumable streams; per-tenant concurrency caps.

Chaos & Resilience Drills

  • Inject faults in staging: broker hiccups, object-store slowdown, KMS latency, projector crashes.
  • Verify: error budget burn alerts, DLQ accumulation, replay throughput, circuit behavior, user-facing latency.
  • Record drill outcomes and update runbooks and SLO thresholds.

Observability for Reliability

  • Metrics: outbox.relay.age, consumer.lag, dlq.depth, dlq.replay.rate, retry.count, 429.rate, breaker.open.count.
  • Traces: link original request → outbox publish → consumer handle → DLQ/replay spans.
  • Logs: structured error codes; payload hashes (not values); tenant-safe redaction.

Configuration Defaults (starters)

  • Retries: 3–5 attempts, exp. backoff + full jitter (Decorrelated Jitter), cap 10s.
  • Concurrency: start low (e.g., 2–4 per partition); auto-tune toward SLO.
  • DLQ retention: 7–30 days per environment; immutable audit of purges.
  • Replay: require change ticket/approval for >N msgs or cross-tenant windows.

Testable Controls

  • Unit tests: consumers idempotent; repositories upsert-or-noop.
  • Contract tests: no PII in headers; tenant headers present.
  • E2E: inject transient failures and verify DLQ path, replay correctness, and watermark recovery.
  • Synthetic hot-partition tests: verify 429s at edge, stable p95, and bounded lag.


Integrity & Tamper-Evidence

ATP provides cryptographic assurance that audit data is unchanged since write and that exports are authentic. We implement append-only hash chains, digitally signed checkpoints/manifests, and verification APIs usable online (service-backed) and offline (air-gapped).

Objectives

  • Tamper-evidence: any modification, insertion, deletion, or re-ordering becomes detectable.
  • Provenance: every package/export carries a signed, reproducible manifest.
  • Verifiability: tenants and auditors can verify individual records, ranges, or full exports without trusting ATP.
  • Agility: algorithm and key rotation without rewriting historical data (versioned metadata).

Integrity Model (at a glance)

  • Canonicalization at write → compute a content digest of the canonical record.
  • Hash Chains per {tenantId, time-slice} with rolling checkpoints.
    • Chain node i: H_i = Hash(H_{i-1} || digest(record_i) || meta_i)
    • meta_i includes recordId, committedAt, policyVersion, and chain coordinates.
  • Checkpoints (e.g., hourly/daily) are digitally signed; contain H_last, span, and count.
  • Exports produce signed manifests listing content digests and chain/checkpoint references.
  • Verification APIs return proof objects; offline tools can validate with published keys.
flowchart LR
  subgraph Tenant Slice: t-acme - 2025-10-22
    R1[rec#1 digest] --> CH1[H1]
    R2[rec#2 digest] --> CH2[H2]
    R3[rec#3 digest] --> CH3[H3]
    CH1 --> CH2 --> CH3
    CH3 --> CKP[Checkpoint Σ<br /> signed - H_last, span, count]
  end
  CKP --> MAN[Export Manifest<br /> signed - digests, ranges, refs]
Hold "Alt" / "Option" to enable pan & zoom

We prefer linear hash chains with signed periodic checkpoints. Optionally, a Merkle tree can be built per checkpoint window for batch verification without changing the public surface.


Data Elements

Artifact Purpose Immutable Fields (excerpt)
Record digest Per-record integrity recordId, canonical payload, occurredAt, committedAt, policyVersion
Chain node Links records in order chainId, chainIndex, prevHash, recordDigest, meta
Checkpoint Signed summary of a range chainId, fromIndex, toIndex, H_last, spanStart/End, count, algo, keyId, sig
Manifest Export integrity & lineage package metadata, list of {recordId, digest}, checkpoint refs, algo, keyId, sig

Canonicalization rules live alongside message schemas; whitespace/order-independent; numeric/boolean normalization; rejected if unknown fields (unless whitelisted).


Algorithms & Keys (guidance)

  • Hash: SHA-256 (default); allow algorithm versioning (e.g., sha256, sha512/256).
  • Signatures: Ed25519 / ECDSA P-256 (configurable).
  • Key management: KMS-backed; key IDs and versions stamped in checkpoints/manifests.
  • Rotation: introduce new algo/keyId at checkpoint boundaries; historical proofs remain valid.

We maintain algorithm agility: verification reads metadata and dispatches appropriate verifier.


Write-time Flow (with integrity material)

sequenceDiagram
  autonumber
  participant ING as Ingestion
  participant CAN as Canonicalizer
  participant APP as Append Store
  participant INT as Integrity
  ING->>CAN: Normalize(payload)
  CAN-->>ING: canonical bytes + digest(payload)
  ING->>APP: Append(record, digest, policyVersion)
  ING->>INT: Update chain (tenant, time-slice, recordDigest)
  INT->>INT: H_i = Hash(H_{i-1} || recordDigest || meta)
  alt checkpoint boundary
    INT->>KMS: Sign(H_last, span, count, algo)
    KMS-->>INT: signature(keyId, version)
    INT->>APP: Persist checkpoint
  end
Hold "Alt" / "Option" to enable pan & zoom

Verification Surfaces

APIs

  • POST /api/v{n}/integrity/verify/record → input: { tenantId, recordId } → output: { ok, recordDigest, chainProof, checkpointRef }
  • POST /api/v{n}/integrity/verify/range{ tenantId, from, to }{ ok, H_last, span, checkpointSig }
  • POST /api/v{n}/integrity/verify/export{ exportId }{ ok, manifestDigest, signature, chainRefs[] }

CLI/Offline (reference implementation)

  • Verify with only: manifest, exported data, published public keys, and (optional) checkpoint bundle.

Response Example (export verify)

{
  "ok": true,
  "exportId": "exp-01JAX...",
  "manifest": {
    "algo": "sha256",
    "keyId": "kms:key/ed25519:v4",
    "digest": "sha256:7f...c1",
    "signature": "base64:MEUCIQ..."
  },
  "chainRefs": [
    { "chainId": "t-acme:2025-10-22", "toIndex": 85123, "H_last": "sha256:ab..ef", "checkpointSig": "base64:..." }
  ]
}

Export Manifest (canonical shape)

{
  "schemaVersion": "1.0.0",
  "packageId": "exp-01JAX...",
  "tenantId": "t-acme",
  "createdAt": "2025-10-22T12:04:55Z",
  "algo": "sha256",
  "keyId": "kms:key/ed25519:v4",
  "items": [
    { "recordId": "01H...", "digest": "sha256:...", "occurredAt": "2025-10-22T10:12:00Z" }
  ],
  "chainRefs": [
    { "chainId": "t-acme:2025-10-22", "fromIndex": 84000, "toIndex": 85123, "H_last": "sha256:...", "checkpointId": "ckp-2025-10-22T12:00:00Z" }
  ],
  "signature": "base64:..."
}

Failure Modes & Mitigations

Scenario Detection Mitigation
Record tampered digest mismatch reject read/export; raise integrity alert
Chain gap or re-order recompute H_i; mismatch vs stored mark slice compromised; rebuild from trustworthy boundary; alert
Checkpoint key rotated keyId mismatch verify with previous public key set; publish rotation bundle
Manifest altered signature invalid reject download; regenerate; investigate
Clock skew timestamp sanity checks rely on committedAt from ATP; include in proofs
Hot shard loss missing nodes replay from outbox/event log; regenerate chain; issue new checkpoint noting incident (transparent)

Rebuild policy: Regenerate affected chain segment without rewriting record contents; checkpoint notes remediation; previous proofs remain for unaffected ranges.


Observability & SLOs (integrity)

  • SLIs: verify latency p95 (by size), chain build queue depth, checkpoint issuance delay, signature error rate.
  • Alerts: chain mismatch, signature failure spikes, delayed checkpoints beyond window.
  • Trace: link append → chain update → checkpoint sign → export manifest.

Testable Controls

  • Deterministic canonicalization tests (golden vectors).
  • Property tests: random order tampering must fail verification.
  • Rotation tests: verify old data with previous keys; new data with new keys.
  • Cross-check: export manifest round-trip verification in CI (sampled).
  • Negative tests: header-only PII prohibition (integrity material never includes PII).

Operational Practices

  • Key custody: least-privilege KMS roles; HSM-backed keys optional; dual-control for rotations.
  • Publishing: make public verification keys and checkpoint bundles available per tenant/region (read-only).
  • eDiscovery: export includes manifest and optional checkpoint pack to enable offline verification.
  • Incident handling: freeze exports for affected slices; publish advisory with affected chainId ranges.


SDK & Integration Guidance (overview)

This section orients integrators to the official SDKs and the minimum set of practices to publish, query, and export audit data safely. Deep dives and runnable samples live under SDK and Guides.

Official SDKs

Language Package Status Target Runtimes
C# ConnectSoft.Atp GA (preferred) .NET 8/9
JavaScript/TypeScript @connectsoft/atp GA (preferred) Node 18+/20+, modern browsers (for query only)

Common features: automatic tenancy propagation, idempotency key helpers, schema validation, built-in retry with jitter (safe verbs only), OTel instrumentation hooks, and Problem+JSON error mapping.


Minimal Client Configuration

# Pseudoconfig (both SDKs support env vars and code-based config)
ATP_BASE_URL:   https://api.atp.example
ATP_TENANT_ID:  t-acme
ATP_EDITION:    enterprise
ATP_CLIENT_ID:  <oidc client id>
ATP_CLIENT_SECRET: <secret>         # or workload identity
ATP_TIMEOUT_MS: 3000
  • Auth: OIDC client credentials or workload identity.
  • Tenancy: tenantId required for all calls; SDK sets x-tenant-id header.
  • Tracing: pass OTel tracer/provider to auto-attach traceparent.

Publish (Append) — Quick Start

C#

var client = new AtpClient(new AtpOptions {
  BaseUrl = new Uri(Environment.GetEnvironmentVariable("ATP_BASE_URL")!),
  TenantId = "t-acme",
  Auth = AtpAuth.ClientCredentials("clientId","clientSecret"),
});

var record = new AuditRecord {
  SourceId = "order-svc",
  ActorId = "u-123",
  Action = "UPDATE",
  Resource = "Order/4711",
  Attributes = new { status = "Shipped", carrier = "DHL" },
  OccurredAt = DateTimeOffset.UtcNow
};

var idem = IdempotencyKey.From("t-acme", "order-svc", sequence:4711);
await client.AppendAsync(record, idem, ct);

TypeScript

import { AtpClient, idempotencyKey } from "@connectsoft/atp";

const client = new AtpClient({
  baseUrl: process.env.ATP_BASE_URL!,
  tenantId: "t-acme",
  auth: { kind: "clientCredentials", clientId: "...", clientSecret: "..." },
});

const record = {
  sourceId: "order-svc",
  actorId: "u-123",
  action: "UPDATE",
  resource: "Order/4711",
  attributes: { status: "Shipped", carrier: "DHL" },
  occurredAt: new Date().toISOString(),
};

await client.append(record, idempotencyKey("t-acme","order-svc",4711));

Contract reminders

  • Headers (SDK-managed): Authorization, x-tenant-id, X-Idempotency-Key, Content-Type.
  • Schema: unknown fields rejected unless whitelisted; use SDK types to avoid drift.
  • PII discipline: never place PII in headers; use attributes with classification-aware fields in schema.

Query — Authorized Read

C#

var page = await client.QueryAsync(new QuerySpec {
  TimeRange = TimeRange.LastHours(24),
  Filters = new() { Resource = "Order/4711" },
  Page = new PageSpec(size: 100)
}, ct);

// Headers surfaced on the response object
Console.WriteLine($"Watermark={page.Watermark}, Lag={page.LagSeconds}s");

TypeScript

const res = await client.query({
  timeRange: { lastHours: 24 },
  filters: { resource: "Order/4711" },
  page: { size: 100 }
});
console.log(res.headers["x-watermark"], res.headers["x-lag"]);
  • Redaction: SDKs expose redactionHints where fields were masked/hidden.
  • Pagination: cursor-based; pass next token; default page size 100 (configurable).
  • Rate limits: watch X-RateLimit-* and Retry-After.

Export — Selection → Package

C#

var exportId = await client.Export.CreateAsync(new ExportRequest {
  Query = new QuerySpec { Filters = new() { Resource = "Order/4711" } },
  Format = "jsonl"
}, ct);

var stream = await client.Export.DownloadAsync(exportId, ct); // resumable via Range

TypeScript

const { exportId } = await client.export.create({
  query: { filters: { resource: "Order/4711" } },
  format: "jsonl"
});
const file = await client.export.download(exportId); // supports range & resume
  • Resumable downloads (HTTP Range).
  • Integrity: response includes manifest digest/signature; use SDK verifyExport() wrapper for convenience.

Webhooks (optional) — Ingest & Completion

Verification

import { verifyWebhook } from "@connectsoft/atp/webhooks";
const ok = verifyWebhook(headers, rawBody, secret); // HMAC or signature
if (!ok) return res.status(401).end();
  • Retry-safe: your handler must be idempotent; use event eventId or x-idempotency-key.
  • Security: require HTTPS; rotate secrets; deny unsigned deliveries.

Retries, Timeouts, and Idempotency

  • SDK retries: exponential backoff with jitter on transient errors only (5xx/timeouts).
  • Do not auto-retry on 4xx (validation, auth, policy deny).
  • Idempotency key builder helpers ensure exactly-once intent:
    • Conventional form: `${tenantId}:${sourceId}:${sequence|hash}`.
  • Timeouts: defaults ~3s (configurable per method); align with route SLOs.

Error Model (Problem+JSON)

HTTP Code (example) Meaning Action
400 SCHEMA_VALIDATION_FAILED Payload invalid fix schema
401 UNAUTHENTICATED Missing/invalid token refresh/reauth
403 AUTHORIZATION_DENIED RBAC/ABAC/edition gate request access
429 RATE_LIMITED Per-tenant quota exceeded backoff using Retry-After
503 UPSTREAM_UNAVAILABLE Transient infra SDK retries (jitter)

All SDK exceptions include traceId; logs are safe for sharing (no PII).


Versioning & Compatibility

  • Semantic Versioning for SDKs; APIs/events are additive-first.
  • Deprecation: SDK surfaces Deprecation/Sunset headers; consult Changelog.
  • Pin major versions in production; run contract tests in CI.

Security Considerations

  • Tokens are short-lived; enable MFA/workload identity.
  • Never hardcode secrets; use secret managers.
  • No PII in headers; classification-driven redaction in logs/exports.
  • Export manifests are signed; verify before processing.

Observability Hooks

  • Pass your tracer/provider to the SDK to attach spans (gateway → service → broker → store).
  • Metrics: per-call latency histograms, retry counts, 4xx/5xx ratios.
  • Correlate with traceId from Problem+JSON on errors.

Common Patterns & Anti-patterns

Do:

  • Use IdempotencyKey for every append.
  • Batch appends with bounded payloads; avoid oversized attributes.
  • Respect Retry-After and X-RateLimit-*.
  • Propagate tenantId consistently.

Don’t:

  • Put secrets/PII into headers.
  • Bypass SDK schema types with “raw” posts.
  • Retry on validation/auth errors.


Risks & Mitigations

This section catalogs the top 8 architectural risks for ATP with signals, mitigations, and contingency playbooks. Each risk has an owner, testable controls, and traceability to SLOs/ADRs.

Severity (S): Low/Med/High. Likelihood (L): Low/Med/High.

Summary Matrix

ID Risk S L Primary Owner Key Signals / SLIs
R-001 Scale spikes on write (ingestion surge) High Med Solution + DevOps 429 rate, outbox age p99, broker queue depth, projector lag p95
R-002 Cost overruns (storage/exports/telemetry) Med Med Cloud + Finance (FinOps) Hot/Warm/Cold growth velocity, export concurrency, metrics/logs ingestion cost
R-003 Data residency / sovereignty violation High Low Security + Data + Cloud Cross-region traffic events, storage location drift, restore target checks
R-004 Schema drift / incompatible change Med Med Application + Data Contract test failures, schema registry diffs, consumer DLQ spike
R-005 Tight coupling between services/contracts Med Med Enterprise + Solution Change ripple count per deploy, cross-service failure blast radius
R-006 Vendor lock-in (broker/DB/cloud features) Med Med Enterprise + Infra Adapter coverage gaps, portability test failures
R-007 Noisy neighbor (multi-tenant contention) High Med Enterprise + SRE Per-tenant 429s, query p95 regressions, export queue depth by tenant
R-008 Compliance drift (controls stale vs GDPR/HIPAA/SOC2) High Low Security + Compliance Policy deny rates, retention job failures, audit finding backlog

R-001 — Scale spikes on write (ingestion surge)

  • Signals/SLIs: 429 rate ↑, outbox.relay.age ↑, broker depth ↑, projector lag p95 > SLO.
  • Mitigations:
    • Per-tenant rate limits/quotas at Gateway; 429 + Retry-After.
    • Transactional outbox, consumer concurrency autoscaling, partition by tenantId[:sourceId].
    • Back-pressure: bounded worker pools; bulkheads to protect Query/Export.
  • Contingency:
    • Enable surge mode: increase broker partitions/consumer concurrency; temporarily narrow schema acceptance windows.
    • Throttle exporters globally; prioritize ingestion SLIs.
  • Tests: synthetic hot-partition load; verify stable p95 and bounded lag.
  • Trace: SLOs in Observability & SLOs.

R-002 — Cost overruns (storage/exports/telemetry)

  • Signals: Hot/Warm/Cold growth velocity; export concurrency; log/metrics ingestion cost.
  • Mitigations:
    • Tiering policy (Hot→Warm→Cold), compression, projection granularity reviews.
    • Export batch windows, per-tenant export caps, dedupe repeated selections.
    • Telemetry sampling (tail-based traces), cardinality caps (tenant.class buckets).
  • Contingency:
    • Apply emergency retention adjustments (policy-backed), reduce export concurrency; enable log down-sampling.
  • Tests: monthly FinOps review; cost regression checks in perf env.
    See Storage Strategy & Data Residency & Retention.

R-003 — Data residency / sovereignty violation

  • Signals: cross-region writes/reads, restore to wrong region, misconfigured buckets.
  • Mitigations:
    • Tenant→region map enforcement at Gateway; region-scoped storage and backups.
    • Infrastructure policies (deny cross-region replication by default).
    • Residency checks in CI/CD and synthetic restore drills (tenant/time-scoped).
  • Contingency:
    • Freeze affected tenant exports; relocate data; notify per contractual/DPA terms.
  • Tests: residency unit tests; periodic restore drills per region.
    See Privacy (GDPR/HIPAA/SOC2).

R-004 — Schema drift / incompatible change

  • Signals: contract test failures, consumer DLQ spikes, projector failures.
  • Mitigations:
    • Schema registry, additive-first evolution, versioned Published Language.
    • Producer/consumer contract tests in CI; dual-write/read during migrations.
    • Strict schema validation at Gateway; unknown fields rejected unless whitelisted.
  • Contingency:
    • Pin consumers; roll back producers; replay DLQ after schema fix.
  • Tests: golden schema vectors; migration rehearsals in staging with replay.
    See Message Schemas & REST APIs.

R-005 — Tight coupling between services/contracts

  • Signals: high change ripple; deploys blocked by downstream readiness; wide blast radius.
  • Mitigations:
    • Open Host Service at Gateway; asynchronous EDA between services.
    • Ports-and-adapters; domain isolated from frameworks; backward-compatible contracts.
    • Feature flags & canary releases to localize risk.
  • Contingency:
    • Breaker open on failing dependency; serve with last watermark; postpone non-critical enrichments.
  • Tests: chaos drills removing downstreams; verify graceful degradation.
    See Component Boundaries & Event-Driven Plan.

R-006 — Vendor lock-in (broker/DB/cloud features)

  • Signals: adapter gaps, reliance on proprietary features without abstraction, migration blockers.
  • Mitigations:
    • Abstraction layers for broker/index/persistence; avoid provider-specific payloads in domain.
    • Keep export/verify formats open (JSONL + signed manifests).
    • ADRs record trade-offs; periodic portability tests (alt broker/index in CI).
  • Contingency:
    • Side-by-side pilot on alternate provider; maintain dual adapters for a window.
  • Tests: contract suite against alt adapters; export/verify runs offline.
    See ADRs & Governance.

R-007 — Noisy neighbor (multi-tenant contention)

  • Signals: per-tenant 429s, p95/p99 regressions, export queue depth spikes, projector lag localized to a tenant.
  • Mitigations:
    • Per-tenant quotas for RPS/storage/exports; separate exporter pools; projector bulkheads per partition.
    • Cache protection and query timeouts; circuit breakers on read amplification paths.
  • Contingency:
    • Temporarily lower limits for offending tenants; schedule off-peak exports; enable shard by tenantId:sourceId.
  • Tests: synthetic contention runs; verify isolation and SLO stability.
    See Multitenancy & Tenancy Guards.

R-008 — Compliance drift (controls stale vs GDPR/HIPAA/SOC2)

  • Signals: policy deny spikes, retention failures, missed rotations, audit finding backlog.
  • Mitigations:
    • Policies as code, versioned; CI checks for classification tags on new fields.
    • Automated retention jobs with attestations; scheduled key rotations; admin action audit trails.
    • Quarterly control attestation packs; DSR synthetics.
  • Contingency:
    • Freeze risky exports; hotfix policy sets; trigger IR playbooks and stakeholder comms.
  • Tests: DPIA gates for new data classes; DSR rehearsal; rotation drills.
    See Compliance & Privacy & Security & Compliance.

Governance & Traceability

  • Each risk maps to ADRs (decision logs), linked mitigations, and SLOs (error budgets).
  • Risk review cadence: monthly in Ops/Architecture, quarterly for Compliance/Exec.
  • Changes to risk posture require an ADR update and CI policy updates.


ADR Index & Governance

Architecture decisions are captured as ADRs (Architecture Decision Records) to make trade-offs explicit, auditable, and traceable to roadmap items and SLOs. This section defines where ADRs live, how they’re authored/reviewed, and a starter index of the key decisions for ATP.

Where ADRs Live

  • Repository path: /docs/adrs/ (one file per decision).
  • Naming: ADR-<YYYY>-<NNN>-<kebab-title>.md (e.g., ADR-2025-001-tenancy-model.md).
  • Status taxonomy: Proposed → Accepted → Deprecated → Superseded.

Each PR that changes contracts, data models, or deployment posture must reference an ADR (existing or new).

Decision Lifecycle

flowchart LR
  P[Propose ADR <br /> - draft PR] --> R[Review<br /> - arch council + owners]
  R -->|approve| A[Accepted\nmerge + tag]
  R -->|revise| P
  A --> I[Implement\ncode + docs + SLOs]
  I --> E[Evaluate\nmetrics + post-deploy]
  A --> S[Supersede/Deprecate\nnew ADR links prior]
Hold "Alt" / "Option" to enable pan & zoom

RACI

  • Responsible: Proposal author (feature owner)
  • Accountable: Enterprise Architect (final sign-off)
  • Consulted: Solution, Security, Data, Cloud/Infra, DevOps Architects
  • Informed: PM/Delivery, Compliance, SRE

ADR Template (use for new decisions)

---
adr: ADR-2025-XXX
title: <Concise decision title>
status: Proposed | Accepted | Deprecated | Superseded by ADR-YYYY-NNN
owners:
  - <role/name>
date: 2025-..-..
links:
  issues: [ <link(s) to epics/issues> ]
  docs:
    - ../architecture/architecture.md#<section-anchor>
    - ../domain/contracts/...
slo_impact:
  - <which SLOs/SLIs are affected>
risk:
  severity: Low|Med|High
  mitigations: [ <refs to sections/runbooks> ]
---

## Context
<Problem statement, constraints, alternatives considered, why now>

## Decision
<The option chosen and why; scope and boundaries; tenant/edition impact>

## Consequences
<Positive/negative trade-offs, operational impacts, cost and complexity, migration notes>

## Implementation Notes
<High-level tasks, rollout plan, feature flags, compatibility windows>

## Verification
<How we’ll verify success: metrics, tests, drills, acceptance gates>

## References
<Links to PoCs, benchmarks, standards, prior ADRs>

ADR Index (starter set)

ADR Title Status Links
ADR-2025-001 Tenancy Model & Guards (explicit tenant on every boundary; RLS; edition gating at Gateway/Policy) Accepted Multitenancy
ADR-2025-002 Event Bus & EDA Guarantees (at-least-once, idem keys, outbox/inbox, partitioning by tenantId[:sourceId]) Accepted Event-Driven Plan
ADR-2025-003 Hash Chains + Signed Checkpoints (export manifests with proofs) Accepted Integrity
ADR-2025-004 Storage Tiering (Hot/Warm/Cold) & lifecycle enforcement Accepted Storage Strategy
ADR-2025-005 Schema Registry & Evolution Policy (additive-first, breaking via new subject/major) Accepted Message Schemas
ADR-2025-006 Gateway Versioning & Rate Limits (Problem+JSON, 429 with budgets) Proposed API Gateway
ADR-2025-007 Per-Tenant Export Concurrency Caps (bulkheads to protect Query SLOs) Proposed Reliability & Resilience
ADR-2025-008 Policy as Code (classification/retention/redaction versioned; stamped on write) Accepted Compliance & Privacy

When an ADR is superseded, update the index and add a Superseded by ADR-… line at the top of the older record.

Governance Rules (what requires an ADR)

  • Domain contracts: REST/gRPC schemas, message subjects/schemas, webhook signatures.
  • Persistence/Index: new tables/indices, partitioning strategy changes, retention rules.
  • Security: identity model changes, key/crypto algorithms, breakout from Zero Trust defaults.
  • Platform: event bus/provider changes, region/residency posture, ingress/WAF changes.
  • SLO/Cost: material shifts in error budgets, telemetry retention, cost levers.

Minor refactors that do not alter contracts, SLOs, or posture can proceed without a new ADR, but must reference related ADRs in PRs.

Quality Gates (CI/CD)

  • Lint: ADR front-matter required (status, owners, links), broken links fail build.
  • Contract tests: PRs that touch /domain/contracts must reference an ADR.
  • Docs check: architecture/ pages must not link to Deprecated ADRs without also linking the successor.
  • Changelog: /reference/changelog.md automatically includes ADR titles on merge.

Traceability

  • Each section of this document references one or more ADRs; ADRs link back here via anchors.
  • Roadmap epics reference ADR IDs; production incidents include ADR references in the post-mortem template.

Cadence & Forums

  • Weekly architecture sync (triage new ADRs, status reviews).
  • Monthly risk/governance review aligning with Risks & Mitigations.
  • Quarterly compliance/controls review (SOC 2 evidence packs, DPIA triggers).

Testable Controls

  • Pipeline fails if a change to contracts/index/storage lacks an ADR reference.
  • Docs link checker: ADR anchors valid; “superseded” graph has no orphans.
  • Synthetic audit: sample PRs verify Problem+JSON includes traceId + linked ADR in release notes.


Traceability to Roadmap

Tiny matrix mapping this document’s sections to roadmap Epics/Features and the artifact location under /docs.

Section Roadmap Epic / Feature Artifact (under /docs)
Purpose & Principles AUD-ARC-001 / GOV /docs/architecture/architecture.md#purpose
System Context (C4 L1) AUD-ARC-001 / HLD /docs/architecture/hld.md
Bounded Contexts & Context Map AUD-ARC-001 / DDD /docs/architecture/context-map.md
Core Services & Containers (C4 L2) AUD-ARC-001 / HLD /docs/architecture/hld.md
Component Boundaries (C4 L3) AUD-ARC-001 / HLD /docs/architecture/components.md
C06 – Event-Driven Communication Plan AUD-ARC-001 / HLD-T002 /docs/domain/events-catalog.md
C07 – Multitenancy & Tenancy Guards AUD-TENANT-001 /docs/platform/multitenancy-tenancy.md
C08 – Security Architecture (Zero Trust) AUD-SECURITY-001, AUD-IDENTITY-001 /docs/platform/security-compliance.md
C09 – Compliance & Privacy (GDPR/HIPAA/SOC2) AUD-COMPLIANCE-001 /docs/platform/privacy-gdpr-hipaa-soc2.md
C10 – Data Architecture Overview AUD-STORAGE-001, AUD-QUERY-001 /docs/architecture/data-model.md
C11 – Storage Strategy (summary) AUD-STORAGE-001 /docs/implementation/persistence.md
C12 – Sequence Flows (append/query/export) AUD-INGEST-001, AUD-QUERY-001, AUD-EXPORT-001 /docs/architecture/sequence-flows.md
C13 – Deployment Views (baseline) AUD-OPS-001 (DevOps & Envs) /docs/architecture/deployment-views.md
C14 – API Gateway & Connectivity AUD-GATEWAY-001 /docs/architecture/architecture.md#api-gateway-connectivity
C15 – Observability & SLOs AUD-OTEL-001 /docs/operations/observability.md
C16 – Reliability & Resilience AUD-CHAOS-001 /docs/implementation/outbox-inbox-idempotency.md
C17 – Integrity & Tamper-Evidence AUD-INTEGRITY-001 /docs/hardening/tamper-evidence.md
C18 – SDK & Integration Guidance AUD-SDK-001 /docs/sdk/
C19 – Risks & Mitigations Governance cadence /docs/architecture/architecture.md#risks-mitigations
C20 – ADR Index & Governance ADR process /docs/adrs/
C21 – Traceability to Roadmap Plan baseline /docs/planning/index.md