Skip to content

Audit & Compliance (Audit Trail) — HLD & DDD Blueprint

Executive Introduction

Every serious SaaS needs a trustworthy memory: a tamper-evident, tenant-scoped Audit Trail that records who did what, to which resource, when, where, and why/how—without leaking sensitive data. This blueprint defines a reusable Audit & Compliance bounded context you can drop into any ConnectSoft-style solution (microservices or modular monoliths). It is append-only, idempotent, multi-tenant, and compliance-ready (classification, redaction, retention, legal hold). Unlike typical one-off audit logs, this BC formalizes protocol-agnostic ingest (HTTP, gRPC, Service Bus / MassTransit, and Orleans actors) while keeping a single canonical domain model.

This cycle establishes the scope, non-goals, boundaries, and the Ubiquitous Language that downstream teams will rely on in subsequent cycles.


Scope (What this context does)

  • Immutable, append-only audit ingestion of domain-significant actions and access decisions.
  • Correlation across services (trace/request/causation IDs) with strict per-tenant isolation.
  • Data classification at write and redaction by policy at both write and read.
  • Retention policies & legal holds for compliance and eDiscovery.
  • Integrity proofing (hash chain / segment sealing) for tamper-evidence.
  • Export for auditors/eDiscovery with signed manifests and integrity proofs.
  • Multi-transport ingest via HTTP, gRPC, Service Bus (MassTransit), and Orleans actors, all normalized into a single Inbox/Dedupe pipeline.

Non-Goals (Deliberate exclusions)

  • Not an analytics warehouse (we provide filtered queries & exports, not OLAP).
  • Not a SIEM replacement (can feed SIEM via streams/exports).
  • Not a general event store (only audit-grade records).
  • Not user activity tracking for growth marketing (belongs to Product Analytics BC).

Context in the Platform (high-level)

flowchart LR
  subgraph Producers - Domain BCs
    A[Identity & Access]
    B[Billing & Payments]
    C[Config & Feature Flags]
    D[Core Business Domains]
  end

  A --> H
  B --> H
  C --> H
  D --> H

  subgraph Inbound Adapters
    H[HTTP Ingest API]
    G[gRPC Ingest Service]
    M[Service Bus - MassTransit]
    O[Orleans Actor Ingest]
  end

  H --> I
  G --> I
  M --> I
  O --> I

  subgraph AU[Audit & Compliance BC]
    I[(Inbox + Dedupe)]
    S[(Append-Only Store)]
    CL[Classification/Redaction]
    RP[Retention/Legal Hold]
    IL[Integrity Ledger]
    EX[Export Service]
    I --> S
    S <--> CL
    S <--> RP
    S <--> IL
    S --> EX
  end

  EX --> E[(Auditors / eDiscovery)]
  AU --> SIEM[(SIEM / SOC)]
  AU --> OBS[(Observability & Compliance Dashboards)]
Hold "Alt" / "Option" to enable pan & zoom

Boundary & Interfaces (concise)

Inbound (multiple transports; same canonical envelope)

  • HTTP: POST /audit/records (single/batch) with Idempotency-Key, Tenant-Id, correlation headers.
  • gRPC: AuditIngest.Append(AuditRecordBatch); metadata carries Tenant-Id, Idempotency-Key, correlation.
  • Service Bus (MassTransit): Topic/queue audit.records.v1; message header requirements identical to HTTP/gRPC.
  • Orleans: IAuditIngestGrain.Append(AuditRecordBatch, IngestMetadata) for high-trust in-cluster producers.

All transports first land in Inbox + Dedupe, then the canonical Append-Only Store. No path bypasses the inbox.

Outbound

  • Query APIs for timelines/filters (redacted by default).
  • ExportJob for controlled extractions (signed manifests + integrity proofs).
  • Optional stream/webhook for high-value audit events to external tools.

Transport Normalization Matrix

Dimension HTTP gRPC Service Bus (MassTransit) Orleans
Delivery Sync (request/response) Sync (unary/stream) Async (brokered) Sync intra-cluster
Envelope JSON Protobuf Protobuf/JSON (contracted) Protobuf/POCO
Required Meta Tenant-Id, Idempotency-Key, Trace-Id Same via metadata Same via headers Same via args
Dedupe Key Tenant-Id + Idempotency-Key Same Same Same
Backpressure HTTP 429 + retry gRPC status + retry Broker DLQ/deferral Grain throttling

Ubiquitous Language (UL)

Use these terms exactly in code, docs, tests, and conversations. Treat them as canonical.

Term Type Concise Definition Notes / Examples
AuditStream Aggregate Root Logical, tenant-scoped channel of audit records, often partitioned by category/domain. e.g., tenant:abc / category:identity
AuditRecord Entity Immutable, append-only fact (actor, action, resource, decision, metadata). Idempotent, classified, correlated.
DataClass Enum Sensitivity tags for fields. Public, Internal, Personal, Secret, Credential, PHI
RedactionRule VO/Policy Field-level handling by class: hash, mask, drop, tokenize. Applies at write and read.
RetentionPolicy Aggregate Root Rules for duration/purge per tenant/category/stream. E.g., 365d standard, HIPAA overlay longer.
LegalHold Aggregate Root Suspension of purge for matching records/segments. Blocks retention purge; fully audited.
ExportJob Aggregate Root Purpose-limited extraction with manifest & proofs. JSONL + signed manifest.
IntegrityProof Value Object Tamper-evidence for segments/exports (hash chain/Merkle). Externally verifiable.
Correlation Concept End-to-end linkage (traceId, requestId, causationId). Mandatory in ingest metadata.

Canonical AuditRecord — Shape (preview; formal schema later)

{
  "id": "ULID/UUID (immutable)",
  "tenantId": "string",
  "occurredAtUtc": "ISO-8601",
  "actor": { "type": "user|service", "id": "string", "display": "optional" },
  "action": "string",
  "resource": { "type": "string", "id": "string", "path": "optional" },
  "decision": { "outcome": "allow|deny|n/a", "reason": "optional" },
  "context": { "ip": "optional", "userAgent": "optional", "clientApp": "optional" },
  "before": "optional (diff-safe, redaction-aware)",
  "after": "optional (diff-safe, redaction-aware)",
  "classes": ["DataClass"],
  "correlation": { "traceId": "string", "requestId": "string", "causationId": "string" },
  "idempotencyKey": "string (required for online ingest)",
  "integrity": { "segmentId": "string", "hash": "string" }
}

Required Metadata & Idempotency (applies to all transports)

  • Tenant-Id — required (header/metadata/param depending on transport).
  • Idempotency-Key — required for online ingest; dedupe per tenant.
  • Trace-Id / Request-Id — strongly recommended (platform default: required).
  • Producer — service name/version for provenance.

Duplicates (same Tenant-Id + Idempotency-Key) are acknowledged as success with no second write.


Tenancy & Isolation (principles)

  • Every write/read is scoped to a single tenant by boundary policy.
  • Cross-tenant reads are disallowed except platform-admin with explicit justification and enhanced auditing.
  • Storage, indexes, and exports are partitioned by tenant and/or category.

Data Classification & Redaction (principles)

  • Classification defaults may be inferred; producers can hint, but the auditor decides.
  • Redaction applies at write (minimize risk) and on read (least privilege).
  • Secrets/credentials are never stored in cleartext; hash/tokenize only.

  • Purge eligibility derives from RetentionPolicy; purge operations are themselves audited.
  • LegalHold blocks purge of matching records/segments until explicitly released.

Integrity & Tamper-Evidence (principles)

  • Records link into integrity segments; sealed segments produce IntegrityProofs.
  • Exports include signed manifests with segment proofs for third-party verification.

Seed Glossary (for downstream teams)

  • Actor — The principal performing the action (human user, service, job).
  • Action — Verb describing the operation (e.g., User.Login, Payment.Refunded).
  • Resource — Business object acted upon (type + ID; optional path).
  • Decision — Access/policy outcome with optional reason.
  • Context — Environmental info aiding investigations (IP, UA, client).
  • Timeline — Time-ordered view of audit records per tenant/resource.
  • Segment — Contiguous span of records used for integrity sealing.
  • Backfill — Trusted bulk ingestion path for historical records.
  • eDiscovery — Formal process to collect/export records for legal/audit.
  • Purpose Limiting — Access granted only for a declared, auditable purpose.

Example Narratives (to align intent)

  • “As an Auditor, I need a redacted timeline of User Management actions in Tenant A between two dates, filterable by actor and resource.”
  • “As a Security Engineer, I need to verify integrity proofs for a specific export to confirm immutability.”
  • “As a Compliance Officer, I must place a Legal Hold on all records that mention Case #12345.”

Aggregate Map (first pass)


Context Map

flowchart LR
  subgraph AUD[AuditStream BC - Microservice]
    AUD_AR[Aggregate: AuditStream]
  end

  subgraph RET[RetentionPolicy BC - Microservice]
    RET_AR[Aggregate: RetentionPolicy]
  end

  subgraph HOLD[LegalHold BC - Microservice]
    HOLD_AR[Aggregate: LegalHold]
  end

  subgraph EXP[Export BC - Microservice]
    EXP_AR[Aggregate: ExportJob]
  end

  subgraph CLASS[ClassificationPolicy BC - Microservice]
    CLASS_AR[Aggregate: ClassificationPolicy]
  end

  subgraph INTG[IntegrityLedger BC - Microservice]
    INTG_AR[Aggregate: IntegrityLedger]
  end

  AUD -->|audit.record_appended| RET
  AUD -->|audit.record_appended| INTG
  HOLD -->|legalhold.applied| RET
  RET -->|retention.window_elapsed| AUD
  EXP -->|export.completed| INTG
  CLASS -->|classification.policy_changed| AUD
  AUD -->|audit.timeline.requested| EXP
Hold "Alt" / "Option" to enable pan & zoom

Aggregates & Microservice Boundaries

AuditStream (BC / Aggregate Root) The append-only, tenant-scoped stream of AuditRecord facts. Owns write-path invariants, classification hooks, and correlation.

Invariants

  • Append-only; no updates or deletes.
  • Tenant isolation by tenantId.
  • Idempotency on (tenantId, idempotencyKey).
  • Correlation metadata required.

Commands

  • AppendAuditRecord / BatchAppendRecords
  • SealCurrentSegment

Events

  • audit.record_appended
  • audit.segment_seal_requested

RetentionPolicy (BC / Aggregate Root) Defines purge windows and scope per tenant/category; computes eligibility and orchestrates purges.

Invariants

  • Cannot shorten windows below compliance minimums.
  • Versioned, forward-dated revisions.

Commands

  • PutRetentionPolicy
  • EvaluatePurgeEligibility
  • ExecuteRetentionPurge

Events

  • retention.policy_changed
  • retention.window_elapsed
  • retention.purged

LegalHold (BC / Aggregate Root) Places and releases legal holds on matching record sets.

Invariants

  • Must specify scope and caseId.
  • Blocks purge until release.

Commands

  • PlaceLegalHold
  • ReleaseLegalHold

Events

  • legalhold.applied
  • legalhold.released

ExportJob (BC / Aggregate Root) Purpose-limited extraction of audit data with signed manifest and proofs.

Invariants

  • Must declare purpose, requestor, redaction level, and scope.
  • Immutable post-completion.

Commands

  • StartExport
  • FinalizeExport
  • CancelExport

Events

  • export.started
  • export.chunk_ready
  • export.completed
  • export.failed

ClassificationPolicy (BC / Aggregate Root) Manages DataClassRedactionRule mappings and overrides.

Invariants

  • Policies are versioned, forward-only.
  • Reserved classes must not allow cleartext.

Commands

  • DefineClassificationPolicy
  • UpdateRedactionRules

Events

  • classification.policy_changed

IntegrityLedger (BC / Aggregate Root) Maintains tamper-evidence via hash chains/Merkle segments.

Invariants

  • Sealed segments cannot be reopened.
  • Canonical, order-stable hashing inputs.

Commands

  • SealSegment
  • ComputeIntegrityProof
  • AttachExportProof

Events

  • integrity.segment_sealed
  • integrity.proof_computed

Cross-Context Policies/Sagas

  • Segment Sealing Policy Triggered by audit.record_appended thresholds; leads to SealSegment.

  • Retention Enforcement Saga Triggered by retention.window_elapsed; evaluates eligibility, purges, emits retention.purged.

  • Legal Hold Guard Policy Triggered by legalhold.applied or released; updates indexes to block or allow purge.

  • Export Proof Attachment Policy Triggered by export.completed; attaches proof to manifest and verification endpoint.


Justification

  • One command → one aggregate ensures strong invariants without distributed transactions.
  • Cross-context effects expressed as events; sagas enforce coordination.
  • Separation allows scaling, isolation, and compliance overlays per BC.

Event Inventory

  • audit.record_appended
  • audit.segment_seal_requested
  • classification.policy_changed
  • retention.policy_changed
  • retention.window_elapsed
  • retention.purged
  • legalhold.applied
  • legalhold.released
  • export.started
  • export.chunk_ready
  • export.completed
  • export.failed
  • integrity.segment_sealed
  • integrity.proof_computed

AuditRecord model (VO + entity shape)


Purpose

Define the canonical, append-only AuditRecord used across all inbound transports (HTTP, gRPC, Service Bus/MassTransit, Orleans). The model preserves investigative value while minimizing exposure through classification and redaction hooks. It enforces tenant isolation, idempotency, and correlation on the write path.


Conceptual Model

classDiagram
  class AuditRecord {
    +string id
    +string tenantId
    +Instant occurredAtUtc
    +Actor actor
    +string action
    +ResourceRef resource
    +Decision decision
    +Context context
    +Delta before
    +Delta after
    +string[] classes
    +Correlation correlation
    +string idempotencyKey
    +IntegrityRef integrity
  }

  class Actor {
    +ActorType type  // user|service|job
    +string id
    +string display [optional]
    +string[] roles [optional]
  }

  class ResourceRef {
    +string type
    +string id
    +string path [optional]
    +string tenantScopedId [optional]
  }

  class Decision {
    +DecisionOutcome outcome // allow|deny|na
    +string reason [optional]
    +map<string,string> attributes [optional]
  }

  class Context {
    +string ip [optional]
    +string userAgent [optional]
    +string clientApp [optional]
    +string location [optional] // ISO-3166 + geo-hint (redacted)
  }

  class Delta {
    +map<string,any> fields
    +map<string,RedactionHint> hints
  }

  class Correlation {
    +string traceId
    +string requestId
    +string causationId [optional]
    +string producer // service/version
  }

  class IntegrityRef {
    +string segmentId [optional]
    +string hash [optional]
  }
Hold "Alt" / "Option" to enable pan & zoom

Field Inventory & Rules

Field Type Required Rules / Invariants
id ULID/UUID Yes Immutable; generated server-side if absent.
tenantId String Yes Must match authenticated tenant; used for partitioning/isolation.
occurredAtUtc ISO-8601 Yes Server-normalized to UTC; must not be in the future beyond tolerance (e.g., 10m).
actor Actor Yes type ∈ {user,service,job}; id non-empty.
action String Yes Namespaced verb, e.g., User.Login, Payment.Refunded.
resource ResourceRef Yes type and id required; tenantScopedId optional helper.
decision Decision No If present, outcome ∈ {allow,deny,na}.
context Context No PII-safe; IP normalization (IPv4/IPv6), UA truncation.
before / after Delta No Diff-safe; classification/redaction rules applied per-field.
classes Array No Sensitivity tags; server may add/override.
correlation Correlation Yes traceId and requestId required; producer required.
idempotencyKey String Yes Scoped by tenantId; duplicate writes are no-ops.
integrity IntegrityRef No Populated by sealing pipeline (post-append).

JSON Canonical Envelope (ingest/at-rest)

{
  "id": "01J9Z7B19X4P8ZQ7H6M4V6GQWY",
  "tenantId": "t-9c8f1",
  "occurredAtUtc": "2025-09-30T12:34:56Z",
  "actor": { "type": "user", "id": "u-12345", "display": "Jane Admin", "roles": ["admin"] },
  "action": "User.PasswordChanged",
  "resource": { "type": "User", "id": "u-12345" },
  "decision": { "outcome": "allow", "reason": "MFA_OK" },
  "context": { "ip": "203.0.113.42", "userAgent": "Chrome/140", "clientApp": "Portal" },
  "before": { "fields": { "password.lastChangedAt": "2025-06-01T10:00:00Z" } },
  "after": { "fields": { "password.lastChangedAt": "2025-09-30T12:34:56Z" } },
  "classes": ["Internal"],
  "correlation": { "traceId": "tr-abc", "requestId": "rq-xyz", "producer": "iam-service@1.12.3" },
  "idempotencyKey": "iam:pwd-change:u-12345:2025-09-30T12:34:56Z",
  "integrity": null
}

Protobuf (transport-friendly core)

syntax = "proto3";
package audit.v1;

enum ActorType { ACTOR_TYPE_UNSPECIFIED = 0; USER = 1; SERVICE = 2; JOB = 3; }
enum DecisionOutcome { DECISION_OUTCOME_UNSPECIFIED = 0; ALLOW = 1; DENY = 2; NA = 3; }

message Actor { ActorType type = 1; string id = 2; string display = 3; repeated string roles = 4; }
message ResourceRef { string type = 1; string id = 2; string path = 3; string tenant_scoped_id = 4; }
message Decision { DecisionOutcome outcome = 1; string reason = 2; map<string,string> attributes = 3; }
message Context { string ip = 1; string user_agent = 2; string client_app = 3; string location = 4; }
message RedactionHint { string rule = 1; } // hash|mask|drop|tokenize (+params later)
message Delta { map<string,string> fields = 1; map<string,RedactionHint> hints = 2; }
message Correlation { string trace_id = 1; string request_id = 2; string causation_id = 3; string producer = 4; }
message IntegrityRef { string segment_id = 1; string hash = 2; }

message AuditRecord {
  string id = 1;
  string tenant_id = 2;
  string occurred_at_utc = 3;
  Actor actor = 4;
  string action = 5;
  ResourceRef resource = 6;
  Decision decision = 7;
  Context context = 8;
  Delta before = 9;
  Delta after = 10;
  repeated string classes = 11;
  Correlation correlation = 12;
  string idempotency_key = 13;
  IntegrityRef integrity = 14;
}

Write-Path Invariants

  • Append-only

    • No updates/deletes to an existing AuditRecord.
    • Server rejects requests attempting to mutate existing IDs.
  • Tenant isolation

    • tenantId derived from auth/boundary must equal payload tenantId.
    • Records are partitioned by tenantId and (optionally) category/resource.type.
  • Correlation required

    • correlation.traceId, correlation.requestId, correlation.producer must be present.
    • If missing, the gateway injects them; producers should still send.
  • Idempotency

    • (tenantId, idempotencyKey) uniquely identifies a logical write.
    • On duplicate, server returns 200/OK with the original id and skips a second append.
  • Classification & redaction on write

    • Server applies ClassificationPolicy and transforms fields in before/after/context as needed (hash/mask/drop/tokenize).
    • Producers may send hints; the server’s policy is authoritative.
  • Clock skew tolerance

    • occurredAtUtc may differ within a small window (e.g., ±10 minutes) from server time; beyond that is rejected or flagged.

Validation & Normalization

  • Action naming: Domain.Verb[.PastParticiple] (e.g., User.Login, Payment.Refunded).
  • Resource identity: canonical type + id; optional hierarchical path (e.g., /orgs/12/users/u-12345).
  • IP: normalized textual form; strip ports; IPv6 compressed.
  • User agent: truncated to max length; known sensitive tokens removed.
  • Delta: only diff fields; large payloads must be summarized or tokenized.

Indexing & Storage Hints (at-rest)

Primary clustering/partitioning:

  • (tenantId, occurredAtUtc) for time-ordered scans. Secondary indexes:
  • (tenantId, resource.type, resource.id, occurredAtUtc)
  • (tenantId, actor.id, occurredAtUtc)
  • (tenantId, action, occurredAtUtc) Dedupe store:
  • (tenantId, idempotencyKey) → recordId

Segments (for integrity sealing):

  • Rolling segment map keyed by (tenantId, category/resource.type) with counters and byte budgets.

Security & Privacy Considerations

  • No secrets in cleartext (API keys, tokens, credentials) — must be hashed or dropped per policy.
  • Minimal context — only data needed for investigations (PII minimization).
  • Purpose-limited access — elevated, less-redacted views require dedicated scopes and are fully audited.
  • Backfill path — bulk/historical ingestion requires elevated trust and is isolated from online ingest.

Example Idempotency Keys (guidance)

  • Deterministic pattern: producer:domain:resourceId:action:occurredAtUtc
    • iam:pwd-change:u-12345:User.PasswordChanged:2025-09-30T12:34:56Z
  • For batch jobs, include batch ID and item ordinal if time is not unique.

Error Handling (ingest)

  • 409 Conflict for cross-tenant mismatch or immutable-field violation.
  • 422 Unprocessable for schema/validation/classification rejections.
  • 200 OK (or 202 Accepted) for idempotent duplicates with original id.
  • Transport-specific equivalents in gRPC/Service Bus/Orleans with consistent error codes/messages.

Data Classification & Redaction


Purpose

Establish a consistent, enforceable model for sensitivity classification and field-level redaction that applies uniformly to all ingested audit data—across transports and producers. The goals are to (1) minimize exposure at the moment of write and (2) enforce least-privilege at read time while preserving investigative value and integrity.


Conceptual Model

classDiagram
  class ClassificationPolicy {
    +string id
    +int version
    +map<string,DataClass> defaultByField
    +map<string,RedactionRule> rulesByClass
    +map<string,RedactionRule> overridesByField
    +Instant effectiveFromUtc
    +string author
  }

  class RedactionRule {
    +RedactionKind kind        // HASH | MASK | DROP | TOKENIZE | NONE
    +map<string,string> params // e.g., hash=SHA256, mask=showLast=4
  }

  class RedactionPlan {
    +map<string,RedactionRule> byField
    +int policyVersion
  }

  class DataClass {
    <<Enumeration>>
    PUBLIC
    INTERNAL
    PERSONAL      // PII lite
    SENSITIVE     // PII/financial
    CREDENTIAL    // secrets/tokens
    PHI           // health data
  }

  class RedactionKind {
    <<Enumeration>>
    NONE
    HASH
    MASK
    DROP
    TOKENIZE
  }
Hold "Alt" / "Option" to enable pan & zoom
  • ClassificationPolicy (Aggregate Root, separate BC): authoritative mapping from field → DataClass and DataClass → RedactionRule, with optional field-level overrides.
  • RedactionPlan: a compiled, per-write snapshot derived from the active policy version; stored alongside the record (policy version only) to make transformations reproducible.

Principles

  • Classify at Write: every candidate field (context, before/after deltas, selected headers) is classified and transformed immediately according to the active policy.
  • Redact at Write & Read: write-time transformation minimizes stored risk; read-time gates enforce least-privilege (default redacted views; elevated scopes may reveal less-redacted data, never more than policy allows).
  • Server Is Authoritative: producers may attach hints; the service computes final classification and redaction.
  • Forward-Only Versioning: policies are versioned; a new version never “unmasks” previously dropped data. Historical records keep their write-time transformations.

Default DataClass & Rule Set (baseline)

DataClass Default Redaction Rule Typical Fields
PUBLIC NONE resource type, action name, coarse timestamps
INTERNAL MASK(showLast=4) internal IDs, non-PII metadata
PERSONAL HASH(SHA256+salt) emails, phone numbers (normalized before hash)
SENSITIVE MASK(showLast=2) partial addresses, last4 PAN (never full PAN)
CREDENTIAL DROP API keys, auth tokens, passwords
PHI TOKENIZE(namespace=phi) diagnosis codes, lab values (token reference stored)

Jurisdictional overlays (e.g., HIPAA/GDPR) are implemented as policy presets and constraints; CREDENTIAL must never be stored in cleartext.


Write Path (classification then transform)

  1. Normalize: standardize field formats (emails lowercased, phones E.164, IP canonical, UA trimmed).
  2. Classify: map fields → DataClass using defaultByField, producer hints, and heuristics.
  3. Plan: compile RedactionPlan (field → RedactionRule) using class rules + field overrides.
  4. Transform: apply rules in-place to before, after, context, and selected headers; attach policyVersion.
  5. Persist: store transformed payload only; do not persist secrets in cleartext.

Pseudocode (illustrative)

var policy = _policyProvider.GetActive(tenantId);
var plan = policy.Compile(record);

foreach (var (field, value) in record.TargetableFields())
{
    var rule = plan.For(field);
    record[field] = _redactor.Apply(rule, value);
}

record.PolicyVersion = policy.Version;
_appendOnlyStore.Write(record);

Read Path (least-privilege by default)

  • Default view (redacted): returns the stored (already transformed) values.
  • Elevated view (e.g., Auditor scope): may further mask less (e.g., show more digits), but never reconstruct dropped/hashed/tokenized data.
  • Justification logging: elevated reads require purpose scope; each access is itself audited.

Field Targeting

  • Always subject to policy:
    • context.ip, context.userAgent, context.clientApp, resource.path
    • before.fields[*], after.fields[*]
  • Header/meta candidates: values mirrored from inbound headers that could include PII (e.g., email in actor display) must be classified too.
  • Non-targeted: structural fields (id, tenantId, occurredAtUtc, action, resource.type/id) remain visible unless specifically overridden by policy.

Tokenization & Hashing

  • Hashing: one-way (e.g., SHA256) with per-tenant salt kept in HSM/Key Vault; compare-only scenarios (e.g., “does this email appear?”) are supported via normalized input + same salt.
  • Tokenization: exchange sensitive value for stable token stored in a separate vault; only authorized services can resolve tokens, and resolution events are audited.

Policy Commands & Events (ClassificationPolicy BC)

  • Commands

    • DefineClassificationPolicy(policyDraft)
    • UpdateRedactionRules(patch)
    • DeprecatePolicy(policyId) (optional)
  • Events

    • classification.policy_changed (includes version, effective window)
    • classification.policy_deprecated

Constraints

  • Cannot weaken rules for classes like CREDENTIAL once established.
  • effectiveFromUtc must be ≥ now + safety window; no retroactive “unmasking.”

Transport-Agnostic Enforcement

  • HTTP/gRPC: service enforces in write pipeline; producers may include classificationHints.
  • Service Bus (MassTransit): consumer applies policy at consume time before append.
  • Orleans: ingest grain uses the same policy provider; compiled plan cached per policy version.

Testing & Verification

  • Golden cases: fixture payloads for each DataClass; snapshot expected redactions.
  • Property tests: secrets never appear in logs/DB; fuzz inputs for leakage.
  • Perf checks: rule application stays sub-millisecond per field; caching for policy and compiled plans.
  • Compliance checks: CI gate ensuring reserved classes (CREDENTIAL, PHI) have non-weakened rules.

Example

Input (producer hints email as PERSONAL):

{
  "context": { "ip": "2001:db8::1", "userAgent": "Chrome/140", "clientApp": "Portal" },
  "before": { "fields": { "email": "Alice@example.com" } },
  "after":  { "fields": { "email": "Alice@example.com" } },
  "classes": [],
  "hints": { "before.fields.email": "PERSONAL", "after.fields.email": "PERSONAL" }
}

Stored (write-time transformed; policy v7):

{
  "context": { "ip": "2001:db8::1", "userAgent": "Chrome/140", "clientApp": "Portal" },
  "before": { "fields": { "email": "HASH:sha256:7b1d..."} },
  "after":  { "fields": { "email": "HASH:sha256:7b1d..."} },
  "policyVersion": 7
}

Read (default): returns hashed values. Read (auditor scope): still hashed; may reveal additional metadata (e.g., normalization method), but not raw email.


Operational Notes

  • Caching: cache ClassificationPolicy by (tenantId, version); invalidate on policy_changed.
  • Observability: emit counters for rule application (hash, mask, drop, tokenize) and rejections.
  • Migrations: new policies apply only to future writes; optional re-scrub jobs can further restrict historical data (never weaken).
  • Docs & SDKs: client SDKs expose classificationHints to help the server classify, but do not implement redaction logic.

Commands & Invariants (write path)


Purpose

Define the command surface for all write paths across the Audit & Compliance bounded contexts and codify the invariants enforced on each append or policy change. These rules are transport-agnostic (HTTP, gRPC code-first, Service Bus/MassTransit, Orleans); all ingress normalizes through the Inbox + Dedupe pipeline before hitting aggregates.


Command Catalog (by BC)

AuditStream BC

  • AppendAuditRecord(record)
  • BatchAppend(records[])
  • SealCurrentSegment() (policy-triggered; typically raised as an event request)

ClassificationPolicy BC

  • DefineClassificationPolicy(policyDraft)
  • UpdateRedactionRules(patch)

RetentionPolicy BC

  • PutRetentionPolicy(policySpec)
  • EvaluatePurgeEligibility(scope)
  • ExecuteRetentionPurge(scope)

LegalHold BC

  • PlaceLegalHold(holdSpec)
  • ReleaseLegalHold(holdId)

Export BC

  • StartExport(criteria)
  • FinalizeExport(jobId)
  • CancelExport(jobId)

IntegrityLedger BC

  • ComputeIntegrityProof(segmentId)
  • SealSegment(segmentKey)
  • AttachExportProof(jobId, proofRef)

Commands are handled one aggregate instance at a time. Cross-context effects happen via domain events and policies/sagas.


Global Invariants (applies to all write paths)

  • Append-only
    • AuditRecord writes are immutable. No updates, no deletes.
  • Per-tenant isolation
    • tenantId is authoritative from the boundary; payload tenantId must match.
  • Idempotency for online ingest
    • idempotencyKey is required for online writes; uniqueness on (tenantId, idempotencyKey).
  • Correlation
    • traceId and requestId required (gateway may inject). producer must be present.
  • Classification & redaction on write
    • Apply active ClassificationPolicy to targetable fields before persistence.
  • Legal hold precedence
    • Any operation that would purge or mutate visibility of held records is blocked.
  • Retention locks
    • A record/segment not yet retention-eligible cannot be destroyed; only redaction overlays allowed per policy.
  • Clock-skew tolerance
    • occurredAtUtc accepted within configured window; outliers rejected or flagged.
  • Transport-agnostic consistency
    • Same validation and invariants regardless of HTTP/gRPC/Bus/Orleans.

Write Path (sequence)

sequenceDiagram
  participant P as Producer (any BC)
  participant IN as Inbound Adapter (HTTP/gRPC/Bus/Orleans)
  participant ID as Inbox+Dedupe
  participant CP as ClassificationPolicy BC
  participant AS as AuditStream BC
  participant IL as IntegrityLedger BC
  participant RP as RetentionPolicy BC
  participant LH as LegalHold BC

  P->>IN: Command: AppendAuditRecord / BatchAppend
  IN->>ID: Normalize + headers (tenant, trace, idempotency)
  ID-->>IN: Duplicate? If yes -> ACK with original id
  IN->>CP: Resolve active policy + compile plan
  CP-->>IN: Policy(version) + redaction plan
  IN->>AS: AppendAuditRecord (transformed payload)
  AS-->>IN: Event: audit.record_appended
  IN->>IL: (Async) segment threshold check -> Seal request
  IN->>RP: (Async) notify retention indexes
  IN->>LH: (Async) check or update hold indexes
Hold "Alt" / "Option" to enable pan & zoom

Validation Rules (selected highlights)

AppendAuditRecord / BatchAppend

  • Require: tenantId, occurredAtUtc, actor, action, resource, correlation, idempotencyKey (online).
  • Reject if: tenant mismatch, correlation missing, immutable field mutation detected, legal hold violated, occurredAtUtc out of tolerance, payload after classification still contains forbidden classes (e.g., CREDENTIAL cleartext).
  • Transform: apply classification/redaction; normalize IP/UA; compute deterministic id if absent (ULID/UUIDv7).
  • Emit: audit.record_appended; optionally audit.segment_seal_requested when thresholds hit.

Define/Update ClassificationPolicy

  • Require forward-only versioning; cannot weaken reserved classes (CREDENTIAL, PHI).
  • Effective time ≥ now + safety window.
  • Emit classification.policy_changed(version).

PutRetentionPolicy / ExecuteRetentionPurge

  • Retention window cannot be below compliance minimums.
  • Purge must exclude any scope under active LegalHold.
  • Emit retention.policy_changed, retention.window_elapsed, retention.purged.

Place/Release LegalHold

  • Require explicit scope (tenant, categories, time window) and caseId.
  • Hold blocks purge until release; all checks are index-assisted and auditable.
  • Emit legalhold.applied, legalhold.released.

Start/Finalize/Cancel Export

  • Require: purpose, requestor, time-bounded scope, redaction level.
  • Exports are immutable after finalize; artifacts signed and referenced in manifest.
  • Emit export.started, export.chunk_ready, export.completed/export.failed.

ComputeIntegrityProof / SealSegment

  • Sealed segments cannot be reopened.
  • Hash inputs must be canonical, order-stable.
  • Emit integrity.segment_sealed, integrity.proof_computed.

Command Contracts (code-first examples)

gRPC code-first (C#)

public record AppendAuditRecordCommand(AuditRecord Record, string TenantId, string IdempotencyKey);
public record BatchAppendCommand(IReadOnlyList<AuditRecord> Records, string TenantId);

public interface IAuditIngestService // exposed via gRPC code-first & HTTP
{
    Task<AppendResult> AppendAsync(AppendAuditRecordCommand cmd, CancellationToken ct);
    Task<BatchAppendResult> BatchAppendAsync(BatchAppendCommand cmd, CancellationToken ct);
}

MassTransit (message contracts)

public interface AppendAuditRecord
{
    string TenantId { get; }
    string IdempotencyKey { get; }
    AuditRecord Record { get; }
    Correlation Correlation { get; }
}

Orleans (grain)

public interface IAuditIngestGrain : IGrainWithStringKey
{
    Task<AppendResult> Append(AppendAuditRecordCommand cmd);
    Task<BatchAppendResult> BatchAppend(BatchAppendCommand cmd);
}

All adapters call a shared write pipeline that enforces invariants and emits the same domain events.


Idempotency Semantics

  • Key scope: (tenantId, idempotencyKey) → recordId.
  • Duplicates: return success with original identifiers; never write a second time.
  • Batch: dedupe per item; partial duplicates return a mixed result with per-item statuses.

Error Model (canonical)

  • 409 Conflict: tenant mismatch; immutable field mutation; legal hold violation.
  • 422 Unprocessable: schema/validation/classification failure.
  • 429 Too Many Requests: backpressure (retryable with jitter).
  • 5xx: unexpected; retried with exponential backoff and idempotency preserved.

Observability (write path)

  • Metrics: append throughput, dedupe hit rate, classification ops/field, seal latency, purge counts.
  • Traces: span attributes include tenantId, action, resource.type/id, idempotencyKey.
  • Logs: structured; never contain secrets; include command name and result code.

Events (domain & integration)


Purpose

Define a clear, event-first contract across the Audit & Compliance bounded contexts. Events are the only mechanism for cross-context coordination (segment sealing, retention, legal hold, exports). Publishing is post-commit via outbox, and all consumers are idempotent. Transports are code-first: MassTransit (Service Bus) for inter-service messaging, optional signed webhooks for external sinks, and Orleans streams for intra-cluster reactions where applicable.


Event Strategy

  • Event-first + Outbox: domain events are appended to an outbox within the same transaction as the aggregate change; a relay publishes them to the bus.
  • Inbox/Dedupe: every consumer keeps a dedupe store keyed by (consumer, eventId) (or (consumer, tenantId, aggregateId, sequence)), ensuring at-least-once → exactly-once processing.
  • Per-aggregate ordering: ordering is guaranteed within a partition key (aggregateId or (tenantId, category)), not globally.
  • Schema versioning: version embedded in the envelope; additive changes preferred.
  • Security: internal bus authN/Z; webhooks signed (HMAC) with replay protection.

Event Catalog (by BC)

AuditStream BC

  • audit.record_appended — a new immutable record was appended.
  • audit.segment_seal_requested — thresholds hit; request sealing (internal integration).

RetentionPolicy BC

  • retention.policy_changed — new policy version activated.
  • retention.window_elapsed — a scope became retention-eligible.
  • retention.purged — purge completed for a scope.

LegalHold BC

  • legalhold.applied — hold in effect for scope/case.
  • legalhold.released — hold released.

Export BC

  • export.started — export accepted and queued.
  • export.chunk_ready — partial artifact available (streaming exports).
  • export.completed — export finished; manifest + signed URLs.
  • export.failed — export failed.

IntegrityLedger BC

  • integrity.segment_sealed — segment finalized with proof.
  • integrity.proof_computed — verification material available.

Canonical Event Envelope (code-first)

public sealed record DomainEventEnvelope<TPayload>(
    string EventId,                 // ULID/UUIDv7
    string Name,                    // e.g., "audit.record_appended"
    int SchemaVersion,              // e.g., 1
    string TenantId,                // partition & auth
    string AggregateType,           // e.g., "AuditStream"
    string AggregateId,             // stream or policy id
    long AggregateSequence,         // monotonic within aggregate (if applicable)
    DateTime OccurredAtUtc,         // business time
    DateTime EmittedAtUtc,          // publish time
    CorrelationMeta Correlation,    // traceId, requestId, causationId, producer
    TPayload Payload);

public sealed record CorrelationMeta(
    string TraceId,
    string RequestId,
    string? CausationId,
    string Producer);
  • Naming: domain.action_pastparticiple (dot-delimited).
  • Partition key: AggregateId (or (TenantId, Category) for streams).
  • Dedupe key: EventId (plus TenantId if desired).

Payload Shapes (selected)

audit.record_appended (v1)

public sealed record AuditRecordAppendedV1(
    string RecordId,
    string Category,                     // e.g., "identity", "billing"
    string Action,                       // e.g., "User.PasswordChanged"
    string ResourceType, string ResourceId,
    string ActorType, string ActorId,
    DateTime OccurredAtUtc,
    int ClassificationPolicyVersion);

retention.policy_changed (v1)

public sealed record RetentionPolicyChangedV1(
    string PolicyId,
    int Version,
    DateTime EffectiveFromUtc,
    IReadOnlyDictionary<string,int> WindowsDaysByCategory);

legalhold.applied / released (v1)

public sealed record LegalHoldChangedV1(
    string HoldId,
    string CaseId,
    string ScopeTenantId,
    string[] Categories,
    DateTime FromUtc,
    DateTime ToUtc,
    string Change); // "applied" | "released"

export.completed (v1)

public sealed record ExportCompletedV1(
    string JobId,
    string Purpose,
    DateTime RangeFromUtc,
    DateTime RangeToUtc,
    string RedactionLevel,
    Uri ManifestUrl,
    string IntegrityProofId);

integrity.segment_sealed (v1)

public sealed record SegmentSealedV1(
    string SegmentId,
    string TenantId,
    string Category,
    long FirstSequence,
    long LastSequence,
    string HashAlgorithm,
    string RootHash,
    Uri ProofUrl);

Publishing Flow (post-commit with outbox)

sequenceDiagram
  participant AG as Aggregate
  participant DB as DB (Tx)
  participant OB as Outbox Table
  participant RL as Outbox Relay
  participant BUS as Service Bus
  participant C1 as Consumer BC (e.g., IntegrityLedger)
  participant C2 as Consumer BC (e.g., RetentionPolicy)

  AG->>DB: Save aggregate state (commit)
  AG->>OB: Append DomainEvent (same Tx)
  Note over DB,OB: State + Outbox are atomic
  RL->>OB: Poll/Read un-published rows
  RL->>BUS: Publish events (Name, Version, Envelope, Payload)
  BUS-->>C1: Deliver (at-least-once)
  C1->>C1: Inbox dedupe (idempotent handler)
  C1->>C1: Apply reaction / emit own events
  BUS-->>C2: Deliver (at-least-once)
  C2->>C2: Inbox dedupe (idempotent handler)
Hold "Alt" / "Option" to enable pan & zoom

Consumer Contract (idempotency & ordering)

  • Idempotency: persist (consumer, EventId). If seen, ack and skip.
  • Ordering: rely on partition keys; if you must enforce sequence, validate AggregateSequence continuity; otherwise apply commutative logic (recommended).
  • Retries: transient errors → retry with backoff/jitter; poison messages → DLQ.
  • Side effects: only after dedupe is recorded.

MassTransit configuration (sketch)

cfg.ReceiveEndpoint("integrity.segment.events", e =>
{
    e.UseMessageRetry(r => r.Exponential(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(30), TimeSpan.FromSeconds(2)));
    e.UseInMemoryOutbox(); // local idempotency for side-effects within consumer scope
    e.Consumer<SegmentSealedConsumer>();
});

For persistent idempotency across restarts, maintain a ConsumerOffset table keyed by (consumer, EventId) (and optionally (TenantId, AggregateId, Sequence)).


Transport Guidance

  • Service Bus (MassTransit) — primary; topics per event name or per BC with filters. Headers carry envelope fields for routing; payload is JSON.
  • Webhooks (optional) — only for external systems; signed with HMAC; include Event-Id, Event-Timestamp; 5-minute age window, retries with backoff.
  • Orleans Streams (internal) — for intra-cluster consumers needing low latency; still publish to the bus for inter-service integration.

Versioning & Evolution

  • SchemaVersion increments on breaking changes.
  • Prefer additive payloads; consumers ignore unknown fields.
  • For unavoidable breaks, publish both ...v1 and ...v2 until all consumers migrate.

Observability

  • Metrics: publish rate, DLQ count, consumer lag, dedupe hits, handler latency.
  • Tracing: propagate TraceId/RequestId; create spans for publish and handle.
  • Auditing the auditor: event publishing/consumption is itself audited in minimal form (no secrets).

Example End-to-End

  1. AppendAuditRecord commits in AuditStream → outbox writes audit.record_appended.
  2. IntegrityLedger consumes, updates segment counters; if threshold reached, emits integrity.segment_seal_requested and then integrity.segment_sealed.
  3. RetentionPolicy updates eligibility indexes; at window, emits retention.window_elapsed and eventually retention.purged.
  4. Export listens for queries/timeline requests, runs job, emits export.completed with manifest + proof.

Repositories & Read Models


Purpose

Define the write model (append-only) and the read models that power fast investigative queries:

  • Write: append-only log partitioned by tenantId and category (or resource.type).
  • Reads:
    • AuditTimeline — chronological, filterable stream
    • AccessDecisionsView — focused allow/deny audit
    • ChangeDeltasView — before/after changes
    • IntegrityProofsView — segment/proof lookup and verification

All projections are idempotent, tenant-scoped, and updated asynchronously from domain events.


Architecture (high level)

flowchart LR
  IN[Inbox+Dedupe] --> WR[(Append-Only Store)]
  WR -->|events| PJ[Projectors]
  PJ --> TL[AuditTimeline]
  PJ --> AD[AccessDecisionsView]
  PJ --> CD[ChangeDeltasView]
  PJ --> IP[IntegrityProofsView]
  WR --> SG[Integrity Segments/Map]
  SG --> IP
Hold "Alt" / "Option" to enable pan & zoom

Write Model (append-only)

Table: audit.Records (Azure SQL, JSON for flexible fields)

CREATE TABLE audit.Records (
  RecordId            VARCHAR(26)  NOT NULL,     -- ULID/UUIDv7 (string)
  TenantId            VARCHAR(64)  NOT NULL,
  Category            VARCHAR(64)  NOT NULL,     -- e.g., identity, billing
  OccurredAtUtc       DATETIME2(3) NOT NULL,
  ActorType           VARCHAR(16)  NOT NULL,     -- user|service|job
  ActorId             VARCHAR(128) NOT NULL,
  Action              VARCHAR(128) NOT NULL,     -- e.g., User.PasswordChanged
  ResourceType        VARCHAR(64)  NOT NULL,
  ResourceId          VARCHAR(128) NOT NULL,
  DecisionOutcome     VARCHAR(8)   NULL,         -- allow|deny|na
  DecisionReason      VARCHAR(256) NULL,
  ContextJson         NVARCHAR(MAX) NULL,        -- PII-safe (post-redaction)
  BeforeJson          NVARCHAR(MAX) NULL,        -- diff-safe (post-redaction)
  AfterJson           NVARCHAR(MAX) NULL,        -- diff-safe (post-redaction)
  ClassesJson         NVARCHAR(512) NULL,        -- ["Personal","Internal"]
  CorrelationJson     NVARCHAR(512) NOT NULL,    -- traceId, requestId, producer
  PolicyVersion       INT          NOT NULL,
  SegmentId           VARCHAR(40)  NULL,
  IntegrityHash       VARCHAR(128) NULL,
  InsertedAtUtc       DATETIME2(3) NOT NULL CONSTRAINT DF_audit_Records_InsertedAt DEFAULT SYSUTCDATETIME(),
  CONSTRAINT PK_audit_Records PRIMARY KEY CLUSTERED (TenantId, Category, OccurredAtUtc, RecordId)
);

Hot indexes (nonclustered):

-- Resource drill-down (timeline for a resource)
CREATE INDEX IX_audit_Records_Resource
  ON audit.Records (TenantId, ResourceType, ResourceId, OccurredAtUtc)
  INCLUDE (Action, ActorId, DecisionOutcome);

-- Actor drill-down
CREATE INDEX IX_audit_Records_Actor
  ON audit.Records (TenantId, ActorId, OccurredAtUtc)
  INCLUDE (Action, ResourceType, ResourceId, DecisionOutcome);

-- Action/verb timelines
CREATE INDEX IX_audit_Records_Action
  ON audit.Records (TenantId, Action, OccurredAtUtc)
  INCLUDE (ResourceType, ResourceId, ActorId, DecisionOutcome);

Dedupe (idempotency) store:

CREATE TABLE audit.Idempotency (
  TenantId        VARCHAR(64) NOT NULL,
  IdempotencyKey  VARCHAR(256) NOT NULL,
  RecordId        VARCHAR(26) NOT NULL,
  CreatedAtUtc    DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
  CONSTRAINT PK_audit_Idempotency PRIMARY KEY (TenantId, IdempotencyKey)
);

Integrity segment map (for sealing):

CREATE TABLE audit.Segments (
  SegmentId       VARCHAR(40)  NOT NULL,
  TenantId        VARCHAR(64)  NOT NULL,
  Category        VARCHAR(64)  NOT NULL,
  FirstOccurredAt DATETIME2(3) NOT NULL,
  LastOccurredAt  DATETIME2(3) NOT NULL,
  FirstRecordId   VARCHAR(26)  NOT NULL,
  LastRecordId    VARCHAR(26)  NOT NULL,
  RootHash        VARCHAR(128) NULL,
  HashAlgorithm   VARCHAR(16)  NULL,   -- e.g., sha256
  SealedAtUtc     DATETIME2(3) NULL,
  CONSTRAINT PK_audit_Segments PRIMARY KEY (TenantId, Category, SegmentId)
);

Read Models & Projections

AuditTimeline

  • Purpose: fast, paged, time-ordered audit browsing with common filters.
  • Source: audit.record_appended events.
  • Schema (often same as audit.Records, but stored separately for denormalized filters & pagination tokens).
  • Suggested partition: (TenantId, OccurredAtUtc); support from/to, actor, resource, action, category.
CREATE TABLE audit_rm.Timeline (
  TenantId        VARCHAR(64)  NOT NULL,
  OccurredAtUtc   DATETIME2(3) NOT NULL,
  RecordId        VARCHAR(26)  NOT NULL,
  Category        VARCHAR(64)  NOT NULL,
  Action          VARCHAR(128) NOT NULL,
  ActorId         VARCHAR(128) NOT NULL,
  ResourceType    VARCHAR(64)  NOT NULL,
  ResourceId      VARCHAR(128) NOT NULL,
  DecisionOutcome VARCHAR(8)   NULL,
  SummaryJson     NVARCHAR(1024) NULL, -- concise payload for UI cards
  CONSTRAINT PK_audit_rm_Timeline PRIMARY KEY (TenantId, OccurredAtUtc, RecordId)
);

AccessDecisionsView

  • Purpose: focused allow/deny trails for security reviews and access investigations.
  • Source: audit.record_appended with DecisionOutcome IS NOT NULL.
  • Partition: (TenantId, OccurredAtUtc); secondary on (TenantId, DecisionOutcome).
CREATE TABLE audit_rm.AccessDecisions (
  TenantId        VARCHAR(64)  NOT NULL,
  OccurredAtUtc   DATETIME2(3) NOT NULL,
  RecordId        VARCHAR(26)  NOT NULL,
  ActorId         VARCHAR(128) NOT NULL,
  ResourceType    VARCHAR(64)  NOT NULL,
  ResourceId      VARCHAR(128) NOT NULL,
  Action          VARCHAR(128) NOT NULL,
  DecisionOutcome VARCHAR(8)   NOT NULL,
  Reason          VARCHAR(256) NULL,
  CONSTRAINT PK_audit_rm_Access PRIMARY KEY (TenantId, OccurredAtUtc, RecordId)
);

ChangeDeltasView

  • Purpose: “what changed?” — quick access to before/after deltas for resources.
  • Source: audit.record_appended where BeforeJson or AfterJson is present.
  • Partition: (TenantId, ResourceType, ResourceId, OccurredAtUtc).
CREATE TABLE audit_rm.ChangeDeltas (
  TenantId        VARCHAR(64)  NOT NULL,
  ResourceType    VARCHAR(64)  NOT NULL,
  ResourceId      VARCHAR(128) NOT NULL,
  OccurredAtUtc   DATETIME2(3) NOT NULL,
  RecordId        VARCHAR(26)  NOT NULL,
  BeforeJson      NVARCHAR(MAX) NULL,
  AfterJson       NVARCHAR(MAX) NULL,
  Action          VARCHAR(128) NOT NULL,
  ActorId         VARCHAR(128) NOT NULL,
  CONSTRAINT PK_audit_rm_Deltas PRIMARY KEY (TenantId, ResourceType, ResourceId, OccurredAtUtc, RecordId)
);

IntegrityProofsView

  • Purpose: locate proof material for segments and exports; drive verification UI/API.
  • Source: integrity.segment_sealed, export.completed.
  • Partition: (TenantId, Category); lookup by SegmentId or JobId.
CREATE TABLE audit_rm.IntegrityProofs (
  TenantId      VARCHAR(64)  NOT NULL,
  Category      VARCHAR(64)  NOT NULL,
  SegmentId     VARCHAR(40)  NULL,
  JobId         VARCHAR(40)  NULL,
  RootHash      VARCHAR(128) NOT NULL,
  HashAlgorithm VARCHAR(16)  NOT NULL,
  ProofUrl      NVARCHAR(512) NULL,
  SealedAtUtc   DATETIME2(3) NOT NULL,
  CONSTRAINT PK_audit_rm_Proofs PRIMARY KEY (TenantId, Category, SealedAtUtc, RootHash)
);

Projection Rules

  • Idempotent handlers using consumer offsets; UPSERT by (TenantId, ...).
  • Eventual consistency with small lag; publish X-Projected-At headers in query responses if needed.
  • Redaction-preserving: project post-redaction fields; never attempt to reverse transformations.
  • Backfill support: projectors can rebuild from audit.Records (source of truth) if event history is pruned.

Hot Path & Caching Guidance

  • Query patterns

    • Timeline (time range, paging): primary cluster (TenantId, OccurredAtUtc)seek + range scan.
    • Resource drill-down: index (TenantId, ResourceType, ResourceId, OccurredAtUtc).
    • Actor investigations: index (TenantId, ActorId, OccurredAtUtc).
    • Decision-only: index on (TenantId, DecisionOutcome, OccurredAtUtc) in AccessDecisionsView.
  • Caching

    • Edge cache (short TTL, e.g., 5–15s) for repeated dashboard queries.
    • Redis for hot resource/actor timelines (keyed by hash of filter, page token; TTL 30–60s).
    • ETag/If-None-Match on timeline pages (tokenized OccurredAtUtc + RecordId).
  • Pagination

    • Prefer seek pagination via (OccurredAtUtc, RecordId) cursor over OFFSET/LIMIT.
    • Max page size 200–500 rows; enforce time-bounded queries to protect hot partitions.
  • Partitioning

    • Monthly or quarterly partitions on OccurredAtUtc for very large tenants.
    • Move cold partitions to cheaper tier; keep last N partitions in premium tier.
  • Compression & storage

    • Enable row/page compression; consider columnstore for audit_rm tables with heavy scans.
    • Avoid huge JSON blobs; keep Before/After diffs concise and redacted.

Repository Interfaces (sketch)

public interface IAuditRecordRepository
{
    Task AppendAsync(AuditRecord record, CancellationToken ct);
    Task<bool> TryDeduplicateAsync(string tenantId, string idempotencyKey, string recordId, CancellationToken ct);
    IAsyncEnumerable<AuditRecord> ReadRangeAsync(string tenantId, DateTime fromUtc, DateTime toUtc, AuditFilter filter, PageCursor cursor, int pageSize, CancellationToken ct);
}

public interface IAuditTimelineReader
{
    IAsyncEnumerable<TimelineRow> QueryAsync(TimelineQuery q, PageCursor cursor, int pageSize, CancellationToken ct);
}

public interface IIntegrityProofsReader
{
    Task<IntegrityProof?> GetSegmentProofAsync(string tenantId, string category, string segmentId, CancellationToken ct);
    Task<IReadOnlyList<IntegrityProof>> ListRecentAsync(string tenantId, string category, int take, CancellationToken ct);
}

Security & Multitenancy

  • Row-level authorization by TenantId at repository boundary; never cross-tenant joins.
  • Purpose-limited reads: decision and delta views may require elevated scopes; all access is itself audited.
  • No PII in projections beyond already-transformed data; classification policy enforced on the write model before projection.

Operational Notes

  • SLA knobs: keep projector lag < p95 2s for timeline; proof view can tolerate higher.
  • Observability: measure projector latency, lag, UPSERT counts, cache hit ratio, top filters.
  • Maintenance: rolling segment sealing compacts integrity metadata; retention purges operate against write model and cascade to views via events.

Contracts: Ingest APIs & Streams


Purpose

Provide transport-agnostic ingest contracts for writing audit data into the AuditStream BC:

  • HTTP (single & batch) with idempotency, tenant scoping, and classification hints.
  • Backfill endpoint for trusted, high-volume historical imports.
  • Internal streams (audit.records.v1) for conformist downstreams and inter-service ingestion.
  • Consistent envelope, headers/metadata, and error model across transports.

HTTP Ingest (online)

Base URL: /audit/records

Headers (required)

  • Authorization: Bearer <token> — scope: audit.ingest
  • Tenant-Id: <string> — tenant isolation key (alias: tenantId)
  • Idempotency-Key: <string> — required for online writes
  • Trace-Id, Request-Id — correlation (gateway may inject)

Single record

  • POST /audit/records
  • Content-Type: application/json
  • Body: canonical AuditRecord (as defined in the model cycle); classification hints optional.
{
  "record": { /* AuditRecord … */ },
  "classificationHints": {
    "before.fields.email": "PERSONAL",
    "after.fields.token": "CREDENTIAL"
  }
}

Responses

  • 201 Created + body { "id": "<RecordId>", "status": "created" }
  • 200 OK (idempotent duplicate) + { "id": "<RecordId>", "status": "duplicate" }
  • 4xx/5xx with problem+json

Batch

  • POST /audit/records:batch
  • Body:
{
  "items": [
    {
      "idempotencyKey": "…",
      "record": { /* AuditRecord */ },
      "classificationHints": { /* optional */ }
    }
  ]
}
  • Responses:
    • 202 Accepted with per-item result array (success/duplicate/error, message, recordId)
    • Use 207 Multi-Status semantics inside payload; HTTP code remains 202 for mixed outcomes

Rate limits & size

  • Max body: 5–10 MB (configurable)
  • Max items per batch: 500 (configurable)
  • Recommend gzip: Content-Encoding: gzip

Error model (canonical)

  • 409 Conflict — tenant mismatch, legal hold/retention lock violation
  • 422 Unprocessable Entity — schema/validation/classification failure
  • 429 Too Many Requests — backpressure; retry with jitter
  • 5xx — transient; safe to retry with same Idempotency-Key

cURL (single)

curl -X POST https://api.example.com/audit/records \
 -H "Authorization: Bearer $TOKEN" \
 -H "Tenant-Id: t-9c8f1" \
 -H "Idempotency-Key: iam:pwd-change:u-12345:2025-09-30T12:34:56Z" \
 -H "Trace-Id: tr-abc" -H "Request-Id: rq-xyz" \
 -H "Content-Type: application/json" \
 -d @record.json

Backfill Ingest (trusted)

For historical imports under elevated controls.

Endpoint

  • POST /audit/records:backfill
  • Scopes: audit.backfill
  • Headers:
    • Tenant-Id: <string>
    • Optional Idempotency-Key per item (recommended when deduping across replays)

Body formats

  • application/x-ndjson (preferred for scale) — one canonical AuditRecord envelope per line
  • application/json with { "items": [...] }
  • Support Content-Encoding: gzip

Rules

  • Enforces all write invariants (append-only, tenant isolation, legal hold, classification).
  • OccurredAtUtc may precede current time arbitrarily (no small skew limit).
  • Throughput quotas separate from online ingest; processed async → 202 Accepted + job id.

Response

{ "jobId": "bf-20250930-00123", "accepted": 100000, "rejected": 0 }

Telemetry

  • Emits audit.record_appended for each stored record (same as online), preserving event-first posture.

Classification Hints (optional)

Hints help the server classify fields; the server remains authoritative.

Schema (inline or separate object)

{
  "classificationHints": {
    "context.userAgent": "INTERNAL",
    "before.fields.email": "PERSONAL",
    "after.fields.token": "CREDENTIAL"
  }
}

Pathing

  • Dot-notation aligned to the canonical JSON (context.*, before.fields.<k>, after.fields.<k>)

Streams: audit.records.v1 (internal bus)

For conformist producers/consumers using MassTransit/Service Bus.

Topic: audit.records.v1

Message (code-first contract)

public interface IngestAuditRecord // message name: audit.records.v1
{
    string TenantId { get; }
    string IdempotencyKey { get; } // required for online-like producers
    AuditRecord Record { get; }
    IDictionary<string,string>? ClassificationHints { get; }
    CorrelationMeta Correlation { get; }
}

Headers (required)

  • Tenant-Id, Idempotency-Key, Trace-Id, Request-Id, Producer

Semantics

  • Consumers in the AuditStream service apply the same write pipeline (inbox+dedupe → classify/redact → append).
  • Messages missing TenantId are rejected (moved to DLQ with reason).
  • Partitioning: route by TenantId (and optionally Category) to preserve per-tenant ordering.

Retries & idempotency

  • At-least-once delivery; inbox dedupe keyed by (TenantId,IdempotencyKey) → exactly-once write.

Other Transports (ingest parity)

  • gRPC (code-first)

    • Service: IAuditIngestService.AppendAsync(record, tenantId, idempotencyKey, classificationHints, correlation)
    • Metadata carries Tenant-Id, Trace-Id, Request-Id; same invariants and responses.
  • Orleans

    • Grain: IAuditIngestGrain.Append(AppendAuditRecordCommand)
    • For in-cluster high-trust producers; still passes through the shared write pipeline.

Security

  • OAuth2 bearer tokens with audience audit and scopes: audit.ingest, audit.backfill.
  • mTLS between internal services.
  • Mandatory tenant scoping at the gateway; no cross-tenant writes.
  • All ingest calls themselves produce minimal audit entries (who/what/purpose).

Diagram

sequenceDiagram
  participant P as Producer
  participant API as HTTP/gRPC/Bus/Orleans
  participant IN as Inbox + Dedupe
  participant CL as ClassificationPolicy
  participant ST as Append-only Store
  participant BUS as Internal Topic (audit.records.v1)

  P->>API: POST /audit/records (+headers, hints)
  API->>IN: Normalize + Tenant + Idempotency + Correlation
  IN-->>API: Duplicate? yes -> return 200/duplicate
  IN->>CL: Resolve policy + compile plan
  CL-->>IN: Policy(version), plan
  IN->>ST: Append (post-redaction)
  ST-->>BUS: Emit event audit.record_appended (outbox → bus)
  API-->>P: 201/Created (record id)
Hold "Alt" / "Option" to enable pan & zoom

Validation Summary (applies to all ingest)

  • Reject if missing Tenant-Id or mismatched tenant.
  • Require Idempotency-Key (except backfill jobs, where recommended).
  • Enforce classification & redaction before persistence.
  • Block writes that violate LegalHold/Retention locks.
  • Preserve correlation (Trace-Id, Request-Id, Producer) end-to-end.

Contracts: Query APIs


Purpose

Expose tenant-scoped, purpose-limited HTTP query surfaces for investigative and compliance use-cases:

  • GET /audit/timeline — time-ordered browsing with rich filters.
  • GET /audit/decision-log — allow/deny focus.
  • POST /audit/exports + GET /audit/exports/{id} — controlled eDiscovery exports with signed artifacts.
  • (Added for completeness) GET /audit/proofs — lookup integrity segment/export proofs.

All endpoints enforce tenant isolation and purpose-based access (scopes + justification). Responses default to redacted data; elevated views require auditor scopes.


Security & Headers (applies to all)

Required headers

  • Authorization: Bearer <token> — scopes vary per endpoint (see below)
  • Tenant-Id: <string>
  • Trace-Id, Request-Id (gateway may inject)

Purpose & redaction controls

  • X-Purpose: <string> — short justification (“security-investigation:INC-12345”)
  • X-Redaction-Level: default|elevated — optional; elevated requires auditor scope
  • Responses carry ETag (page content hash) and optional X-Projected-At (projection timestamp)

Scopes

  • audit.read.timeline
  • audit.read.decisions
  • audit.export.start
  • audit.export.read
  • audit.read.proofs
  • audit.read.elevated (to allow X-Redaction-Level: elevated)

GET /audit/timeline

Description: Paged, time-ordered audit records for a tenant.

Query parameters

  • from (ISO-8601, required) — inclusive
  • to (ISO-8601, required) — exclusive; enforce max range (e.g., 31 days)
  • Filters (all optional; ANDed unless noted):
    • actor — exact actor.id or prefix
    • resourcetype:id (e.g., User:u-12345); resourceType, resourceId also supported
    • action — exact or prefix (e.g., User.)
    • category — e.g., identity, billing
    • classDataClass value (PERSONAL, PHI, …)
    • decisionallow|deny|na
  • Pagination:
    • cursor — opaque seek token from previous page
    • limit — default 100, max 500

Response 200 OK

{
  "items": [
    {
      "recordId": "01J9Z7B19X4P8ZQ7H6M4V6GQWY",
      "occurredAtUtc": "2025-09-30T12:34:56Z",
      "category": "identity",
      "action": "User.PasswordChanged",
      "actor": { "type": "user", "id": "u-12345" },
      "resource": { "type": "User", "id": "u-12345" },
      "decision": { "outcome": "allow", "reason": "MFA_OK" },
      "summary": { "policyVersion": 7 }  // compact, redacted view
    }
  ],
  "nextCursor": "eyJ0IjoiMjAyNS0wOS0zMFQxMjozNDo1NloiLCJyIjoiMDFKOS...\"",
  "projectedAtUtc": "2025-09-30T12:35:02Z"
}

Errors

  • 400 invalid range/filters
  • 403 missing scopes or elevated redaction without audit.read.elevated
  • 422 malformed params

GET /audit/decision-log

Description: Access decision-centric view (allow/deny), optimized for security reviews.

Query parameters

  • Same base as /audit/timeline plus:

  • outcomeallow|deny (required unless action is specified)

  • actor, resource, action for narrowing
  • Pagination via cursor / limit as above

Response 200 OK

{
  "items": [
    {
      "occurredAtUtc": "2025-09-30T09:02:15Z",
      "recordId": "01J9Z6...",
      "actorId": "u-987",
      "resource": { "type": "Document", "id": "doc-55" },
      "action": "Document.Downloaded",
      "outcome": "deny",
      "reason": "Policy:IP_Allowlist"
    }
  ],
  "nextCursor": null
}

Errors

  • 400 missing outcome or invalid values
  • 403 insufficient scope

Scope required: audit.read.decisions (and audit.read.elevated if requesting elevated redaction)


POST /audit/exports

Description: Starts a controlled ExportJob for eDiscovery/audit.

Scope required: audit.export.start (and audit.read.elevated if requesting elevated redaction)

Body

{
  "purpose": "ediscovery:case-12345",
  "range": { "from": "2025-09-01T00:00:00Z", "to": "2025-09-30T23:59:59Z" },
  "filters": {
    "actor": "u-12345",
    "resource": "User:u-12345",
    "action": "User.",
    "category": "identity",
    "decision": "allow"
  },
  "redactionLevel": "default|elevated",
  "format": "jsonl|csv",
  "delivery": { "kind": "signed-url" }  // future: bucket/prefix
}

Response

  • 202 Accepted
{
  "jobId": "exp-01J9Z7E8AVB2M7",
  "state": "queued",
  "estimatedReadyAtUtc": "2025-09-30T13:05:00Z"
}

Errors

  • 400 invalid range/filters/format
  • 403 insufficient scope or elevated redaction without auditor scope
  • 409 active LegalHold blocks request if scope violates policy

GET /audit/exports/{id}

Description: Polls export status and returns signed artifact manifest on completion.

Scope required: audit.export.read

Response 200 OK

{
  "jobId": "exp-01J9Z7E8AVB2M7",
  "state": "completed", // queued|running|completed|failed|canceled
  "count": 12840,
  "format": "jsonl",
  "redactionLevel": "default",
  "manifest": {
    "url": "https://signed.example.com/exp-.../manifest.json?sig=...",
    "expiresAtUtc": "2025-09-30T15:10:00Z",
    "integrityProofId": "proof-01J9Z7KQ...",
    "hash": "sha256:abcd..."
  }
}

Errors

  • 404 unknown job or tenant mismatch
  • 409 job not in a terminal state for download

GET /audit/proofs

Description: Lookup integrity proofs for segments or exports.

Scope required: audit.read.proofs

Query parameters

  • category — e.g., identity (required for segment lookups)
  • segmentId — to fetch a specific segment’s proof
  • jobId — to fetch an export’s proof
  • take — number of recent proofs (default 20, max 200) when segmentId/jobId not provided

Responses

  • Single proof:
{
  "type": "segment",
  "tenantId": "t-9c8f1",
  "category": "identity",
  "segmentId": "seg-2025-09-30-12",
  "sealedAtUtc": "2025-09-30T12:40:00Z",
  "hashAlgorithm": "sha256",
  "rootHash": "3f88...",
  "proofUrl": "https://signed.example.com/proofs/seg-...json?sig=..."
}
  • Recent list:
{ "items": [ { "type":"segment", "segmentId":"seg-...", "sealedAtUtc":"..." }, ... ] }

Errors

  • 400 missing category for segment queries
  • 404 unknown proof

Pagination & Caching

  • Seek pagination: use cursor (opaque token including OccurredAtUtc + RecordId).
  • Cache: allow Cache-Control: private, max-age=15 for dashboard queries; support If-None-Match with ETag.
  • Rate limits: per-tenant and per-token quotas; return 429 + Retry-After.

Error Model (Problem Details)

All error responses use RFC 7807:

{
  "type": "https://docs.example.com/errors/validation",
  "title": "One or more validation errors occurred.",
  "status": 422,
  "traceId": "tr-abc",
  "errors": { "from": ["Must be before 'to'"], "limit": ["Max 500"] }
}

Examples

Timeline (default redaction)

curl -G https://api.example.com/audit/timeline \
 -H "Authorization: Bearer $TOKEN" \
 -H "Tenant-Id: t-9c8f1" \
 --data-urlencode "from=2025-09-01T00:00:00Z" \
 --data-urlencode "to=2025-09-30T23:59:59Z" \
 --data-urlencode "actor=u-12345" \
 --data-urlencode "limit=200"

Start export (elevated)

curl -X POST https://api.example.com/audit/exports \
 -H "Authorization: Bearer $AUDITOR" \
 -H "Tenant-Id: t-9c8f1" \
 -H "X-Purpose: ediscovery:case-12345" \
 -H "X-Redaction-Level: elevated" \
 -H "Content-Type: application/json" \
 -d @export.json

Fetch proof

curl -G https://api.example.com/audit/proofs \
 -H "Authorization: Bearer $TOKEN" \
 -H "Tenant-Id: t-9c8f1" \
 --data-urlencode "category=identity" \
 --data-urlencode "segmentId=seg-2025-09-30-12"

Security & Tenancy Model


Purpose

Guarantee hard tenant isolation, least-privilege access, and confidentiality across all transports (HTTP, gRPC code-first, Service Bus/MassTransit, Orleans). Enforce no cross-tenant queries except tightly-controlled, justified, audited platform-admin flows. Minimize sensitive data at the source (classification/redaction), and protect exports with integrity and signature guarantees.


Threat Model & Goals (concise)

  • Isolation: A tenant’s data is never visible to another tenant by design, not convention.
  • Defense-in-depth: AuthN/Z at gateway, service, repository, and database (RLS) layers.
  • Transport security: mTLS service-to-service; OAuth2/JWT for callers; key rotation.
  • Data minimization: No secrets in events/logs; classification & redaction on write.
  • Controlled exfiltration: Exports are purpose-limited, signed, and time-boxed.
  • Audit of the auditor: All elevated reads/exports are themselves audited.

Identity & Authentication

  • Human callers: OAuth2/OIDC bearer tokens (short-lived), scopes tied to endpoints.
  • Service callers: client-credentials tokens with narrow scopes; per-service audiences.
  • Intra-cluster: mTLS with SPIFFE/SVID or mesh-issued certs; authorize by SAN/service identity.

Required claims / headers

  • tenant_id (claim) ⇄ Tenant-Id (header) — must match boundary routing.
  • scope — action-specific (e.g., audit.read.timeline, audit.export.start).
  • sub / client_id — principal identity.
  • Correlation: Trace-Id, Request-Id, Producer (header/metadata).

Authorization (scopes & purpose-limiting)

Capability Scope(s) Notes
Timeline read (redacted) audit.read.timeline Default redaction level.
Decision log audit.read.decisions Allow/deny focus.
Elevated read audit.read.elevated Requires X-Purpose justification; audited.
Start export audit.export.start Also requires audit.read.elevated if elevated redaction requested.
Export status/download audit.export.read Returns signed manifest/URLs.
Proof lookup audit.read.proofs Integrity verification.
Policy admin audit.admin.policy Classification/retention/holds.
Platform-admin cross-tenant (break-glass) platform.admin.audit Dual-control + time-boxed + justification.

Purpose header

  • X-Purpose: <short string> (e.g., security-investigation:INC-12345) is required for elevated reads and admin operations.

Tenancy Enforcement (boundary → repository → DB)

  • At gateway: route requests by Tenant-Id; reject if header missing or mismatched with token tenant_id.
  • At service boundary: all commands/queries must carry a tenant; no tenant-less APIs.
  • At repository: every query/mutation requires tenantId; repositories disallow cross-tenant filters/joins.
  • At database: Row-Level Security (RLS) ensures server-side isolation.

RLS pattern (SQL Server/Azure SQL)

-- Set session context per connection
EXEC sp_set_session_context @key=N'TenantId', @value=@tenantId;

-- Predicate
CREATE SCHEMA security;
GO
CREATE FUNCTION security.fn_TenantPredicate(@TenantId sysname)
RETURNS TABLE WITH SCHEMABINDING
AS RETURN SELECT 1 AS fn_access
WHERE @TenantId = CAST(SESSION_CONTEXT(N'TenantId') AS sysname);
GO

-- Apply policy to tables
CREATE SECURITY POLICY security.TenantPolicy
ADD FILTER PREDICATE security.fn_TenantPredicate(TenantId) ON audit.Records,
ADD FILTER PREDICATE security.fn_TenantPredicate(TenantId) ON audit_rm.Timeline,
ADD FILTER PREDICATE security.fn_TenantPredicate(TenantId) ON audit_rm.AccessDecisions,
ADD FILTER PREDICATE security.fn_TenantPredicate(TenantId) ON audit_rm.ChangeDeltas
WITH (STATE = ON);

Application code must set SESSION_CONTEXT at connection open and reuse pooled connections per tenant to avoid leakage.


Transport Security

  • mTLS inside the cluster (gateway ↔ services, services ↔ services). Certs auto-rotated (short TTL).
  • HTTP: TLS 1.2+, HSTS.
  • gRPC (code-first): TLS + auth metadata; validate SAN/service identity.
  • Service Bus/MassTransit: SAS/AAD with per-queue/topic credentials; message-level authZ by headers (tenant, scopes).
  • Orleans: silo-to-silo TLS; grain calls restricted to trusted identities.

Data Minimization & Event Hygiene

  • On write: classification applies; tokens/passwords/keys → DROP; emails/phones → HASH with per-tenant salt; PHI → TOKENIZE.
  • Domain events: include only investigative fields (no raw PII/secrets). Example:
    • recordId, action, resource.type/id, actor.id, occurredAtUtc, policyVersion.
    • ❌ raw before/after contents for sensitive fields.
  • Logs: structured; never log secrets or full payloads. Use event IDs, hashes, counts.

Exports (controlled eDiscovery)

  • Start requires scope audit.export.start (+ audit.read.elevated if elevated view).
  • Manifests include record counts, hash lists, and integrity proof IDs (Merkle root/segment hash).
  • Delivery: signed URLs (short expiry) or customer-managed bucket integration.
  • Signing: manifest and chunk checksums signed with per-tenant signing keys in HSM/Key Vault (kid included).
  • Watermarking (optional): embed export context (tenant, purpose, jobId) into manifest for traceability.
  • Revocation: immediate by invalidating SAS/presigned URLs; manifests remain for audit trail.

Platform-Admin Exceptions (break-glass)

  • Allowed only with:
    • Scope platform.admin.audit + dual approval (two distinct admins).
    • Time-boxed tokens (e.g., 15 minutes).
    • Mandatory X-Purpose; reason captured.
  • Flow produces additional meta-audit entries:
    • who approved, timestamps, filters used, counts returned, artifacts accessed.

Key Management & Secrets

  • Per-tenant salt (hashing) and tokenization keys stored in HSM/Key Vault; rotated on schedule.
  • mTLS certificates rotated automatically (mesh/PKI).
  • Export signing keys versioned; manifests include kid to verify against JWKS.
  • No secrets in configuration files; use managed identities for vault access.

Rate-Limiting & Abuse Controls

  • Per-tenant quotas for read/write; adaptive throttling based on p95 latency and queue depth.
  • 429 with Retry-After headers; clients apply exponential backoff + jitter.
  • Backfill channel isolated from online ingest to protect SLOs.

Observability & “Audit of the Auditor”

  • Emit security-focused metrics: denied requests by reason, cross-tenant attempts, elevated-read counts, export volume by purpose.
  • Trace spans include tenant, scope, purpose, redaction level (never PII).
  • Every elevated read/export → a minimal audit record (AuditorAccess.Requested/Granted) with justification and principal.

Diagram — End-to-End Enforcement

sequenceDiagram
  participant C as Caller (User/Service)
  participant GW as API Gateway
  participant AUD as AuditStream Service
  participant REPO as Repositories
  participant DB as DB (RLS)
  participant BUS as Service Bus

  C->>GW: HTTP/gRPC + JWT (tenant_id, scopes)
  GW->>C: 401/403 if invalid; else forward with Tenant-Id, Trace-Id
  GW->>AUD: mTLS request (tenant scoped)
  AUD->>AUD: AuthZ (scope + purpose) + policy checks
  AUD->>REPO: Set SESSION_CONTEXT('TenantId')
  REPO->>DB: Query/Append
  DB-->>REPO: Enforced by RLS (tenant rows only)
  AUD-->>BUS: Outbox publish (minimized event)
  AUD-->>C: Response (redacted/elevated per scope)
Hold "Alt" / "Option" to enable pan & zoom

Hard Rules Recap

  • No cross-tenant queries. Ever—except formal break-glass with dual control, justification, and exhaustive audit.
  • Least-privilege tokens only; scopes map 1:1 to actions.
  • mTLS everywhere inside the cluster; TLS to clients.
  • Data minimization by default; classification on write; signed, time-boxed exports.

Observability of the Auditor


Purpose

Instrument the Audit & Compliance services so investigators and operators can see: how fast data arrives (ingest lag), whether writes succeed, where dedupe is triggering, how long exports take, and whether any consumer or saga is falling behind—without ever leaking PII. We adopt OpenTelemetry (OTel) for traces/metrics/logs, enforced PII redaction, and event-first correlation across HTTP, gRPC (code-first), Service Bus/MassTransit, and Orleans.


Telemetry Stack (reference)

  • Tracing: OpenTelemetry SDK + auto-instrumentation (HTTP client/server, gRPC, MassTransit, ADO.NET).
  • Metrics: OTel Meter; export to Prometheus/OTLP.
  • Logs: structured JSON via Serilog (or ILogger) → OTLP/Elastic.
  • Correlation: W3C trace-context; propagate traceparent and our domain correlation (Trace-Id, Request-Id, Producer).
  • Exemplars: attach trace IDs to metric data points for deep linking from dashboards.

Tracing Model (span taxonomy)

Create consistent span names and attributes across services. All spans must include: tenant.id, audit.type (publish|consume|saga.step|query|export|purge|seal), audit.key (hashed idempotency key), delivery.attempt (for consumers).

Common span names

  • audit.ingest.append — HTTP/gRPC/Bus/Orleans ingress
  • audit.outbox.publish — outbox relay publishes domain events
  • audit.bus.consume — MassTransit consumer handler
  • audit.saga.step — retention/export/integrity orchestration step
  • audit.export.run — an ExportJob execution window
  • audit.retention.purge — purge execution
  • audit.integrity.seal — segment sealing
  • audit.query.timeline / audit.query.decisions / audit.query.proofs — read APIs

Span attributes (selected)

Attribute Example Notes
tenant.id t-9c8f1 required
audit.category identity optional (if available)
audit.action User.PasswordChanged optional
audit.record.id 01J9Z7B... avoid logging payload
audit.key.hash sha256:ab12… hash of Idempotency-Key
correlation.request_id rq-xyz from envelope
correlation.producer iam-service@1.12.3
messaging.system azure.servicebus for bus spans
messaging.operation process consume spans
delivery.attempt 3 incremented by consumer
db.operation INSERT, SELECT for storage spans

Sampling

  • AlwaysOn for write path (audit.ingest.append, audit.outbox.publish, audit.bus.consume) on error.
  • Parent-based 10–20% for success traces in hot tenants; enable tail-based sampling by high latency/lag.
  • PII guard: deny sampling decisions that would capture PII payloads—payloads aren’t logged anyway.

Metrics (SLIs & diagnostic)

Define stable instruments with clear units and labels. Tenants are labels; avoid high-cardinality labels (hash IDs when needed).

Ingest & Write

  • audit_ingest_messages_total{tenant} — Counter
  • audit_ingest_lag_seconds{tenant} — Histogram (p50/p90/p95/p99)
  • audit_append_success_total{tenant} — Counter
  • audit_append_failures_total{tenant,reason} — Counter
  • audit_dedupe_hits_total{tenant} / audit_dedupe_ratio{tenant} — Counter/Gauge
  • audit_outbox_backlog{service} — Gauge (rows in outbox not yet published)
  • audit_inbox_backlog{consumer} — Gauge

Consumers & Sagas

  • audit_consumer_delivery_attempts{consumer} — Histogram
  • audit_consumer_lag_seconds{consumer,tenant} — Histogram (event time → now)
  • audit_projector_lag_seconds{view,tenant} — Histogram
  • audit_saga_duration_seconds{saga} — Histogram
  • audit_dlq_messages_total{consumer} — Counter

Exports & Integrity

  • audit_export_duration_seconds{tenant,format,redaction} — Histogram
  • audit_export_records_total{tenant} — Counter
  • audit_integrity_seal_duration_seconds{tenant,category} — Histogram
  • audit_integrity_segments_sealed_total{tenant,category} — Counter

Queries

  • audit_query_timeline_latency_seconds{tenant} — Histogram
  • audit_query_rate_limited_total{tenant,endpoint} — Counter
  • audit_cache_hit_ratio{endpoint} — Gauge

Attach exemplars to audit_ingest_lag_seconds and audit_export_duration_seconds to point to slow traces.


Logs (PII-safe, policy-enforced)

  • Structure only; no raw payloads.
  • Required fields: timestamp, level, messageTemplate, tenant.id, trace.id, request.id, component, event.name, result.
  • Redaction middleware: apply the same ClassificationPolicy-based redaction to log scopes that process potentially sensitive strings (e.g., exceptions that echo inputs).
  • Drop fields classified as CREDENTIAL; hash PERSONAL; tokenize PHI.
  • Compact error events with codes (e.g., AUDIT_422_CLASSIFICATION, AUDIT_409_LEGALHOLD).

Examples

{ "level":"Information","event":"audit.append.ok","tenant.id":"t-9c8f1",
  "trace.id":"3b5...","audit.category":"identity","record.id":"01J9Z7B...","dedupe":"miss" }

{ "level":"Warning","event":"audit.append.duplicate","tenant.id":"t-9c8f1",
  "trace.id":"3b5...","audit.key.hash":"sha256:ab12...","dedupe":"hit" }

{ "level":"Error","event":"audit.export.failed","tenant.id":"t-9c8f1",
  "trace.id":"3b5...","job.id":"exp-01J9Z7...","reason":"Timeout","duration.ms":90000 }

Dashboards (operator & auditor views)

  1. Ingest Health
    • Ingest rate, ingest lag p95, append success vs failures by reason, dedupe ratio, outbox/inbox backlog.
  2. Consumers & Projections
    • Consumer lag & delivery attempts; projector lag per view; DLQ counts; top failing handlers.
  3. Exports & Integrity
    • Export durations, records per export, failure rate, proofs sealed over time, seal durations.
  4. Queries (Usage & Performance)
    • Timeline latency, cache hit ratio, rate-limited requests, elevated reads by purpose (counts only).
  5. Security Signals
    • Elevated-read counts by purpose, cross-tenant denial attempts, break-glass usage with approver identities.

Each panel links to traces via exemplars and to filtered logs.


Alerts & SLOs (suggested starting points)

  • Ingest SLO: p95 audit_ingest_lag_seconds < 5s (warn at 10s, page at 30s).
  • Append Error Budget: audit_append_failures_total / (append_success + failures) < 0.5% (rolling 1h).
  • Consumer Lag: p95 audit_consumer_lag_seconds < 15s (warn), < 60s (page).
  • Projector Lag: p95 audit_projector_lag_seconds < 5s (warn), < 20s (page).
  • Export Duration: p95 audit_export_duration_seconds < target SLA per size (e.g., 10 min for 10M records).
  • DLQ: growth rate > N/min → page (sticky failure).
  • Break-Glass: any event → immediate page + incident ticket.

Code Sketch (C# OTel wiring)

services.AddOpenTelemetry()
    .WithTracing(b => b
        .AddAspNetCoreInstrumentation()
        .AddGrpcClientInstrumentation()
        .AddSqlClientInstrumentation(o => o.SetDbStatementForText = false) // avoid payloads
        .AddMassTransitInstrumentation()
        .AddSource("ConnectSoft.Audit.*")
        .SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.2)))
        .AddOtlpExporter())
    .WithMetrics(b => b
        .AddRuntimeInstrumentation()
        .AddAspNetCoreInstrumentation()
        .AddMeter("ConnectSoft.Audit.Metrics")
        .AddOtlpExporter());

Span creation example

using var activity = _activitySource.StartActivity("audit.bus.consume");
activity?.SetTag("tenant.id", tenantId);
activity?.SetTag("audit.type", "consume");
activity?.SetTag("delivery.attempt", attempt);
activity?.SetTag("audit.key.hash", keyHash);

Diagram — Telemetry Flow

flowchart LR
  subgraph Services
    IN[Ingest API] -->|Spans+Metrics+Logs| TE[OTel SDK]
    CON[Consumers/Sagas] -->|Spans+Metrics+Logs| TE
    QRY[Query APIs] -->|Spans+Metrics+Logs| TE
  end
  TE --> EXP[OTLP Exporter]
  EXP --> APM[Tracing Backend]
  EXP --> PRM[Metrics Backend]
  EXP --> LOGS[Logs Backend]
  PRM --> DASH[Dashboards & Alerts]
  APM --> DASH
  LOGS --> DASH
Hold "Alt" / "Option" to enable pan & zoom

Governance

  • PII Security Review on any new log field or span attribute.
  • Telemetry ADRs document every new metric/alert.
  • Synthetic probes (canaries) continuously exercise ingest, query, and export paths and verify SLIs.

Notes

  • Keep cardinality bounded: hash keys/IDs in attributes; avoid free-form labels.
  • Prefer seek-based pagination metrics over counting offsets.
  • Always audit the auditor: elevated reads/exports emit minimal audit records with purpose and principal.

Storage Strategy (HLD)


Purpose

Design a default, cost-aware storage layout that provides fast append, predictable reads, and long-term retention with legal-hold guarantees. The baseline is Azure SQL for hot/warm data (append-only), complemented by Blob Storage (ADLS Gen2) for cold/archival tiers, with optional query-through via serverless analytics.


Tiered Layout (hot → warm → cold)

flowchart LR
  IN[Inbox + Dedupe] --> SQL[(Azure SQL<br/>audit.*)]
  SQL --> RM[(Azure SQL<br/>audit_rm.*)]
  SQL --> SEG[(Azure SQL<br/>audit.Segments)]
  SQL -- Retention window elapsed --> ADLS[(ADLS Gen2 / Blob<br/>JSONL or Parquet)]
  ADLS --> MAN[(Manifests & Indexes in SQL)]
  MAN -. optional query-through .-> ANA[(Serverless SQL / Fabric)]
Hold "Alt" / "Option" to enable pan & zoom
  • Hot (0–90 days): Azure SQL, append-only, minimal indexes, row/page compression off or row-only.
  • Warm (90–365 days): Azure SQL, same tables but older partitions compressed (PAGE/columnstore for read models).
  • Cold (≥ retention window): ADLS Gen2 (Blob) in JSONL or Parquet, plus SQL index rows to locate artifacts; optional serverless SQL/Fabric for ad-hoc queries.

Actual thresholds are policy-driven; HIPAA/GDPR overlays may extend hot/warm windows.


Logical Schema (recap)

  • Write model (append-only): audit.Records, audit.Idempotency, audit.Segments.
  • Read models (projections): audit_rm.Timeline, audit_rm.AccessDecisions, audit_rm.ChangeDeltas, audit_rm.IntegrityProofs.
  • All tables are tenant-scoped and enforce RLS (see Security & Tenancy).

Partitioning & Clustering

Records table (primary store)

  • Clustered key: (TenantId, Category, OccurredAtUtc, RecordId)
  • Partition function: monthly on OccurredAtUtc (or weekly for hot tenants).
  • Benefits: predictable range scans by time; easy partition management for retention/sealing.

Sample (illustrative):

-- Partition monthly by UTC date
CREATE PARTITION FUNCTION pf_MonthlyDate (DATETIME2(3))
AS RANGE RIGHT FOR VALUES (
  '2025-01-01','2025-02-01','2025-03-01','2025-04-01','2025-05-01','2025-06-01',
  '2025-07-01','2025-08-01','2025-09-01','2025-10-01','2025-11-01','2025-12-01'
);

CREATE PARTITION SCHEME ps_MonthlyDate
AS PARTITION pf_MonthlyDate ALL TO ([PRIMARY]);

-- Clustered index aligned with partitioning
CREATE CLUSTERED INDEX CIX_Records_TenantCatDateId
ON audit.Records (TenantId, Category, OccurredAtUtc, RecordId)
ON ps_MonthlyDate(OccurredAtUtc);

Nonclustered hot-path indexes (seek/prefix-friendly)

  • (TenantId, ResourceType, ResourceId, OccurredAtUtc) INCLUDE (Action, ActorId, DecisionOutcome)
  • (TenantId, ActorId, OccurredAtUtc) INCLUDE (Action, ResourceType, ResourceId, DecisionOutcome)
  • (TenantId, Action, OccurredAtUtc) INCLUDE (ResourceType, ResourceId, ActorId)

Create only what queries need; rely on projections for complex filters.


Compression & Storage Modes

  • Write model: keep ROW compression on hot partitions; switch PAGE on warm partitions via rolling jobs.
  • Read models with heavy scans: consider clustered columnstore for audit_rm.Timeline and audit_rm.ChangeDeltas (validate cardinality & segment elimination first).
  • Lob JSON fields kept in NVARCHAR(MAX) post-redaction; enforce max sizes and summarize large diffs.

Ingest & Append (performance)

  • Batching: use TVPs / SqlBulkCopy for BatchAppend; target batches of 500–2,000 rows depending on payload size.
  • Minimal per-row work: precompute normalized fields and hashes in the service; avoid scalar SQL UDFs on hot paths.
  • Idempotency: write to audit.Idempotency (TenantId, IdempotencyKey → RecordId) first; short-circuit duplicates.
  • Locking: favor ordered inserts by (TenantId, Category, OccurredAtUtc); avoid random keys.

Integrity Segments

  • Segment map: audit.Segments (TenantId, Category, SegmentId, First/LastOccurredAt, First/LastRecordId, RootHash, SealedAtUtc)
  • Sealing: on thresholds, compute hash/Merkle in service, update segment row atomically, emit integrity.segment_sealed.
  • Verification: proofs stored alongside exports (Blob) and indexed in audit_rm.IntegrityProofs.

  • RetentionPolicy marks eligible partitions/row ranges; never purge rows covered by LegalHold.
  • Purge mechanics:
    1. Export & stage eligible rows to ADLS Gen2 in JSONL/Parquet (with manifest & hash list).
    2. Verify counts/hashes; write manifest row to SQL (tenant, category, range, URLs, hashes).
    3. Soft-delete or hard-delete in SQL per jurisdictional overlay (prefer hard-delete for payload with redacted tombstones left only if required).
  • Legal hold overlay: exclusion lists maintained in SQL; any held segment/row range is skipped.

Cold Storage (ADLS Gen2)

  • Format:
    • JSONL: simple, line-delimited canonical audit envelopes.
    • Parquet: columnar, efficient for analytics (recommended for large volumes).
  • Layout: /tenantId=/category=/year=/month=/day=/part-*.parquet (Hive-style for partition pruning).
  • Security: per-tenant containers or prefixes, RBAC + SAS.
  • Index-in-SQL: table audit.ArchiveManifests (TenantId, Category, FromUtc, ToUtc, Format, Uri, Hash, CreatedAtUtc) enables discovery and integrity verification.

Optional query-through

  • Serverless SQL / Fabric external tables over ADLS (for ad-hoc).
  • Keep primary APIs on SQL hot/warm to protect SLOs; cold queries are explicitly slower and rate-limited.

HA/DR & Backups

  • HA: Azure SQL Business Critical or Premium for hot tenants; zone redundancy on.
  • DR: Active Geo-Replication to paired region; failover runbook tested quarterly.
  • Backups: PITR enabled (e.g., 7–35 days hot DB); long-term backup for compliance (weeks–years per policy).
  • ADLS: RA-GRS (geo-redundant) with immutability policies (Time-based retention, Legal hold flags on containers).

Encryption & Keys

  • Azure SQL: TDE on; per-tenant salts/keys for hashing/tokenization in Key Vault; rotation schedule.
  • ADLS: SSE with Microsoft- or customer-managed keys (CMK).
  • Export/Manifest signing: keys in HSM/Key Vault; manifests include kid.

Capacity & Sizing (rules of thumb)

  • Row size (post-redaction, compact JSON deltas): 300–1200 bytes typical; plan for ~1 KB avg.
  • Daily volume: rows/day = events_per_sec × 86,400.
  • Hot partition size: keep below 200–300 GB per table for comfortable maintenance on P-sku; split tenants/categories if needed.
  • Read models: size ~40–70% of write model depending on projection density.

Maintenance Jobs

  • Partition management: monthly partition create/switch/merge automation.
  • Compression: roll older partitions to PAGE/columnstore weekly.
  • Index care: monitor fragmentation (>30% → rebuild/reorg off-peak).
  • Vacuum (if soft-deletes used): periodic cleanup.
  • Checksum audit: re-verify segment hashes; surface drift alarms.

Cost Controls

  • Right-size SKUs: scale up for backfill, scale down after.
  • Move warm partitions to cheaper compute tiers (if environment allows); keep only last N partitions in high performance tier.
  • Blob tiering: Hot → Cool → Archive based on access.
  • Serverless analytics: pay-per-query for cold investigations.

Access Patterns → Storage Mapping

Query Storage Index/Technique
Tenant timeline for last 30 days Azure SQL (hot) Clustered seek on (TenantId, OccurredAtUtc)
Resource drill-down Azure SQL (hot/warm) IX_Resource nonclustered index
Actor investigations Azure SQL (hot/warm) IX_Actor nonclustered index
Bulk eDiscovery export Azure SQL → ADLS Range scan + export job; manifest in SQL
Legacy period review (> retention) ADLS Gen2 External table / download & verify via manifest

Notes

  • Keep append-only discipline: no in-place updates; use events and projections to evolve views.
  • Treat SQL as source of truth for active windows; treat ADLS + manifests as source of truth for archived windows.
  • All movement between tiers is audited and integrity-checked (hash lists + proofs).

Integrity & Tamper-evidence


Purpose

Provide cryptographic tamper-evidence for the audit trail. The IntegrityLedger maintains rolling segments of records per (tenantId, category), computes proofs on seal (hash-chain + Merkle root), and emits integrity.segment_sealed. Proofs are embedded in exports and can be independently verified. No rewrites are permitted; authorized redactions produce tombstones + redacted copies linked by proof references.


Model

classDiagram
  class IntegrityLedger {
    +SealPolicy sealPolicy
    +StartSegment(tenantId, category)
    +AppendLeaf(tenantId, category, leaf)
    +SealSegment(tenantId, category) : SegmentProof
    +GetProof(segmentId) : SegmentProof
  }

  class Segment {
    +string segmentId
    +string tenantId
    +string category
    +datetime openedAtUtc
    +long firstSequence
    +long lastSequence
    +HashChain chain
    +MerkleBuilder merkle
    +bool sealed
    +datetime? sealedAtUtc
    +string rootHash
    +string hashAlgorithm
  }

  class HashChain {
    +string lastHash
    +Append(leafHash) : string
  }

  class SegmentProof {
    +string segmentId
    +string tenantId
    +string category
    +string hashAlgorithm   // sha256
    +string rootHash        // Merkle root
    +string chainTipHash    // rolling hash at seal
    +Uri    proofUrl        // blob manifest
    +datetime sealedAtUtc
  }
Hold "Alt" / "Option" to enable pan & zoom
  • Leaf = hash of a canonicalized audit record (post-redaction), plus minimal stable metadata.
  • Chain: sequential rolling hash provides ordering proof; Merkle provides membership proof.
  • Per (tenant, category): each has an open segment until sealed.

Canonicalization & Hashing

Leaf canonical form (JSON with deterministic ordering, UTF-8, no whitespace):

{
  "v":1,
  "recordId":"01J9Z7B19X4P8ZQ7H6M4V6GQWY",
  "tenantId":"t-9c8f1",
  "category":"identity",
  "occurredAtUtc":"2025-09-30T12:34:56Z",
  "action":"User.PasswordChanged",
  "actorId":"u-12345",
  "resource":{"type":"User","id":"u-12345"},
  "policyVersion":7
}
  • Values are those stored at rest (already redacted).
  • Hash: leafHash = SHA-256(UTF8(canonicalJson)).
  • Chain step: chainTip = SHA-256(chainTip || leafHash) (with empty tip for the first leaf).
  • Merkle: standard binary Merkle tree with left-right concatenation; odd leaves are duplicated for pairing.

Sealing Policy

Sealing triggers when any threshold is reached (configurable, per tenant/category):

  • maxRecords (e.g., 10,000)
  • maxBytes (e.g., 100 MB accumulated leaf inputs)
  • maxDuration (e.g., 5 minutes)

On seal:

  1. Finalize Merkle root (rootHash) and capture chainTipHash.
  2. Persist segment metadata (atomic update).
  3. Write proof bundle (JSON) to Blob and record URL.
  4. Publish integrity.segment_sealed.
  5. Open a new segment.

Proof Bundle (stored in Blob; referenced in SQL)

{
  "type": "audit.segment-proof",
  "schemaVersion": 1,
  "segment": {
    "segmentId": "seg-2025-09-30-identity-0012",
    "tenantId": "t-9c8f1",
    "category": "identity",
    "firstSequence": 8123400,
    "lastSequence": 8133399,
    "openedAtUtc": "2025-09-30T12:00:00Z",
    "sealedAtUtc": "2025-09-30T12:05:00Z"
  },
  "hash": {
    "algorithm": "sha256",
    "root": "f0c4…",
    "chainTip": "9ab7…"
  },
  "leaves": {
    "count": 10000,
    "mapping": "omitted",             // mapping is not required; provided on demand
    "leafHashFormat": "sha256(hex)"
  },
  "signing": {
    "kid": "key-tenant-t-9c8f1-2025-09",
    "signature": "base64url(sha256withRSA(manifest-bytes))"
  }
}
  • Signing: proof bundle is signed with a per-tenant key (Key Vault/HSM).
  • SQL index: audit_rm.IntegrityProofs stores segmentId, rootHash, sealedAtUtc, proofUrl.

Inclusion Proofs (per record)

When a client requests proof for a specific record:

  • Compute/lookup the leafHash for recordId.
  • Return Merkle path + rootHash + hashAlgorithm + segmentId:
{
  "type": "audit.membership-proof",
  "segmentId": "seg-2025-09-30-identity-0012",
  "recordId": "01J9Z7B19X4P8ZQ7H6M4V6GQWY",
  "hashAlgorithm": "sha256",
  "leafHash": "1c0d…",
  "merklePath": ["b713…","004f…","9aa1…"],   // left/right indicated via structured tuples if needed
  "rootHash": "f0c4…",
  "sealedAtUtc": "2025-09-30T12:05:00Z"
}

Verification (offline):

  1. Rebuild leaf from canonicalized stored record; hash → leafHash.
  2. Fold merklePath to rootHash.
  3. Compare to bundle’s rootHash and signature.

Redaction & “No Rewrites”

  • No rewrites: once a record is written, it is never modified.
  • Authorized redaction (rare, policy-driven) produces two new facts:
    1. Tombstone: AuditRecordRedacted referencing the original recordId (reason, authority, case).
    2. RedactedCopy: a fully redacted replacement record with a new recordId, linked to the original.
  • Both facts are appended normally and become part of subsequent segments.
  • Queries default to the redacted copy; exports include the tombstone + redacted copy and their proofs.

Linking (schema additions)

ALTER TABLE audit.Records ADD
  OriginalRecordId VARCHAR(26) NULL,  -- present only for redacted copies
  IsTombstone BIT NOT NULL DEFAULT 0;
  • Integrity: because the original record remains unchanged and included in past sealed segments, any redaction is additive and auditable.

Export Integration

  • Each ExportJob manifest lists the segments covered and embeds:
    • the segment proof bundle references (URLs, hashes, signatures), and
    • optional membership proofs for all included records (or on-demand endpoint to retrieve them).
  • This enables an external investigator to verify off-platform that:
    • the exported records belong to sealed segments, and
    • the segment proofs are authentic (signature) and consistent (hashes).

Events & Flow

sequenceDiagram
  participant AS as AuditStream
  participant IL as IntegrityLedger
  participant SQL as SQL (Segments)
  participant BLOB as Blob (Proofs)
  participant BUS as Service Bus

  AS->>IL: audit.segment_seal_requested (threshold hit)
  IL->>IL: Finalize Merkle + ChainTip
  IL->>SQL: Update audit.Segments (sealedAt, rootHash, chainTip)
  IL->>BLOB: Upload proof-bundle.json (signed)
  IL->>BUS: Publish integrity.segment_sealed {segmentId, rootHash, proofUrl}
Hold "Alt" / "Option" to enable pan & zoom

Event payload (code-first)

public sealed record SegmentSealedV1(
  string SegmentId,
  string TenantId,
  string Category,
  long FirstSequence,
  long LastSequence,
  string HashAlgorithm,
  string RootHash,
  string ChainTipHash,
  Uri ProofUrl,
  DateTime SealedAtUtc);

APIs (server-side verification helpers)

  • GET /audit/proofs?category=...&segmentId=... — returns SegmentProof (already defined in Query APIs cycle).
  • GET /audit/proofs/record/{recordId} — returns membership proof for a single record (tenant-scoped).
  • POST /audit/proofs/verify — accepts { segmentProof, membershipProof?, record? }, returns { valid: true|false, reasons:[] } (best-effort convenience; offline verification is authoritative).

Data Structures (SQL recap)

  • audit.Segments (already defined): add ChainTipHash VARCHAR(128).
  • audit_rm.IntegrityProofs (read model): lists segment/export proofs, URLs, hash algorithms.
ALTER TABLE audit.Segments ADD ChainTipHash VARCHAR(128) NULL;

Performance & Operations

  • Streaming Merkle: build incrementally; avoid holding all leaves in memory (use rolling bottom layer + disk spill if needed).
  • Parallelization: seal in a background worker per (tenant, category) to avoid cross-tenant contention.
  • Sizing: target segments that seal in 1–5 minutes under typical load to keep proofs bite-sized.
  • Metrics: audit_integrity_segments_sealed_total, audit_integrity_seal_duration_seconds, audit_integrity_proof_size_bytes.
  • Backfill: sealing operates the same; thresholds tuned for bulk jobs.

Optional External Anchoring

Periodically (e.g., hourly) anchor the latest set of segment root hashes into an external, time-stamping authority (or public blockchain) by publishing a super-root (Merkle of roots). Store the anchor receipt in Blob and index it in SQL to strengthen non-repudiation.


Security Notes

  • Proof bundles are signed; keys live in HSM/Key Vault; manifests include kid.
  • Verification requires only the stored record (redacted), the membership proof, and the segment proof bundle—no secret keys.
  • All proof access is audited (purpose-limited).


Purpose

Protect tenants with time-bounded data retention while guaranteeing that legal holds can suspend purge for investigations or litigation. Retention is enforced per tenant + category; purge operates on eligible segments first (row-level fallback when necessary). Every action—policy change, eligibility evaluation, purge, hold apply/release—is audited (“audit of the auditor”).


Model

classDiagram
  class RetentionPolicy {
    +string policyId
    +string tenantId
    +int version
    +map<string,int> daysByCategory        // e.g., identity:365, billing:2555
    +RetentionMode mode                   // PURGE|REDACT_TOMBSTONE|ARCHIVE_THEN_PURGE
    +datetime effectiveFromUtc
    +string author
    +bool lockedByCompliance              // prevents weakening below baseline
  }

  class LegalHold {
    +string holdId
    +string tenantId
    +string caseId
    +string[] categories
    +datetime fromUtc
    +datetime toUtc
    +string query                         // optional: actor/resource/action filter
    +string reason
    +string createdBy
    +datetime createdAtUtc
    +bool released
    +datetime? releasedAtUtc
  }

  class RetentionMode {
    <<Enumeration>>
    PURGE
    REDACT_TOMBSTONE
    ARCHIVE_THEN_PURGE
  }
Hold "Alt" / "Option" to enable pan & zoom

Precedence

  1. LegalHold > Retention — any held scope is excluded from purge.
  2. Compliance baselines — minima per category can’t be reduced (e.g., HIPAA).
  3. Mode semantics — determine how eligibility is acted upon:
    • PURGE: delete eligible rows (no rewrites; events notify projections).
    • REDACT_TOMBSTONE: replace payload with redacted tombstone entries (additive).
    • ARCHIVE_THEN_PURGE: export eligible rows to ADLS (with manifest & proofs) before deleting.

Data Structures (SQL)

Retention policies (versioned)

CREATE TABLE retention.Policies (
  PolicyId        VARCHAR(40)  NOT NULL,
  TenantId        VARCHAR(64)  NOT NULL,
  Version         INT          NOT NULL,
  DaysByCategory  NVARCHAR(1024) NOT NULL, -- JSON { "identity":365, "billing":2555 }
  Mode            VARCHAR(24)  NOT NULL,   -- PURGE|REDACT_TOMBSTONE|ARCHIVE_THEN_PURGE
  EffectiveFromUtc DATETIME2(3) NOT NULL,
  Author          VARCHAR(128) NOT NULL,
  LockedByCompliance BIT NOT NULL DEFAULT 0,
  CreatedAtUtc    DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
  CONSTRAINT PK_retention_Policies PRIMARY KEY (TenantId, PolicyId, Version)
);

Legal holds

CREATE TABLE retention.LegalHolds (
  HoldId         VARCHAR(40)  NOT NULL,
  TenantId       VARCHAR(64)  NOT NULL,
  CaseId         VARCHAR(64)  NOT NULL,
  CategoriesJson NVARCHAR(512) NOT NULL,
  FromUtc        DATETIME2(3) NOT NULL,
  ToUtc          DATETIME2(3) NOT NULL,
  Query          NVARCHAR(1024) NULL,  -- optional filter DSL
  Reason         NVARCHAR(512) NOT NULL,
  CreatedBy      VARCHAR(128) NOT NULL,
  CreatedAtUtc   DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
  Released       BIT NOT NULL DEFAULT 0,
  ReleasedAtUtc  DATETIME2(3) NULL,
  CONSTRAINT PK_retention_LegalHolds PRIMARY KEY (TenantId, HoldId)
);

Purge ledger (idempotent, auditable)

CREATE TABLE retention.PurgeLedger (
  JobId          VARCHAR(40)  NOT NULL,
  TenantId       VARCHAR(64)  NOT NULL,
  Category       VARCHAR(64)  NOT NULL,
  FromUtc        DATETIME2(3) NOT NULL,
  ToUtc          DATETIME2(3) NOT NULL,
  Mode           VARCHAR(24)  NOT NULL,
  Status         VARCHAR(16)  NOT NULL,   -- started|completed|failed|canceled
  RecordsAffected BIGINT      NOT NULL DEFAULT 0,
  SegmentsTouched INT         NOT NULL DEFAULT 0,
  ManifestUrl    NVARCHAR(512) NULL,      -- when ARCHIVE_THEN_PURGE
  StartedAtUtc   DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
  CompletedAtUtc DATETIME2(3) NULL,
  Error          NVARCHAR(1024) NULL,
  CONSTRAINT PK_retention_PurgeLedger PRIMARY KEY (JobId)
);

Eligibility & Purge Algorithm

Eligibility rule A record is eligible if:

  • occurredAtUtc <= (nowUtc - daysByCategory[category]), and
  • the record (or its segment) is not covered by any active legal hold (time/category/query match).

Segment-first strategy

  • Prefer whole-segment purge for efficiency when the entire segment is eligible and not intersecting any active hold.
  • When holds intersect a segment’s range, fallback to row-level filtering for the non-held subset.

Pseudocode (service layer)

var policy = policies.ActiveFor(tenantId, now);
var windows = policy.ToWindows(nowUtc); // per category (from,to)

foreach (var cat in windows.Keys)
{
    var (from, to) = windows[cat];
    var held = holds.ActiveScopes(tenantId, cat, from, to);

    var candidates = segments.Range(tenantId, cat, upTo: to);

    foreach (var seg in candidates)
    {
        if (held.Intersects(seg))
            PurgeRows(seg, exclude: held.RowFilters);
        else
            PurgeSegment(seg);
    }
}

Idempotency Each purge job writes an entry to PurgeLedger. Re-running with the same (TenantId, Category, FromUtc, ToUtc) produces no duplicated effects.


Saga & Events

sequenceDiagram
  participant SCH as Scheduler
  participant RP as RetentionPolicy Service
  participant LH as LegalHold Service
  participant AS as AuditStream
  participant IL as IntegrityLedger
  participant BUS as Service Bus

  SCH->>RP: Tick (~15m)
  RP->>LH: Query active holds per tenant/category
  RP->>AS: Evaluate eligible segments/rows (excludes holds)
  AS-->>RP: Eligible set (segments|rows)
  RP->>AS: Execute purge (mode-dependent)
  AS-->>BUS: retention.purged {tenant, category, range, count, mode}
  RP-->>BUS: retention.window_elapsed {tenant, category, toUtc}
  IL-->>AS: (optional) seal small tail segments before purge
Hold "Alt" / "Option" to enable pan & zoom

Emitted events

  • retention.policy_changed — on policy upsert/activate
  • retention.window_elapsed — category window crossed (informational)
  • retention.purged — includes { jobId, tenantId, category, fromUtc, toUtc, mode, records, segments }
  • legalhold.applied / legalhold.released — from LegalHold BC

All events are published via outbox; consumers (projections, dashboards) are idempotent.


Commands & APIs (admin)

RetentionPolicy BC

  • Commands: PutRetentionPolicy, EvaluatePurgeEligibility, ExecuteRetentionPurge.
  • HTTP (admin):
    • PUT /audit/admin/retention-policies — body: policy draft (versioned, forward-only).
    • POST /audit/admin/retention/purge — body: { category?, from?, to?, mode? } to force/limit scope.
  • Scopes: audit.admin.policy. Elevated actions require X-Purpose.

LegalHold BC

  • Commands: PlaceLegalHold, ReleaseLegalHold.
  • HTTP (admin):
    • POST /audit/admin/legal-holds — create hold (caseId, categories, range, query, reason).
    • POST /audit/admin/legal-holds/{holdId}:release.
  • Scopes: audit.admin.policy; justification required.

Mode Semantics & Effects

Mode Storage effect Read models Exports Integrity
PURGE Physical delete of eligible rows Projectors receive retention.purged and remove rows N/A Past proofs remain valid (affected records were in prior sealed segments)
REDACT_TOMBSTONE Replace payload with minimal tombstone rows Views show redacted entries with reason Exports include tombstones Additive; no rewrites of originals
ARCHIVE_THEN_PURGE Export to ADLS (manifest, hashes), then delete Views show nothing after purge (unless “archived” badge) Signed manifest + URIs included Segment proofs referenced in manifest

Archival step (for ARCHIVE_THEN_PURGE):

  • Export writes manifest to Blob (with counts, hashes, referenced segment proofs).
  • PurgeLedger.ManifestUrl populated; downstream can verify off-platform.

A hold can specify any combination of:

  • Time window (fromUtctoUtc)
  • Categories (e.g., identity, billing)
  • Query (optional DSL): by actor.id, resource.type/id, action prefix—compiled to segment/time filters + secondary filters on rows.

Operational rules

  • Holds are created once and can be released; they are never edited retroactively.
  • A released hold stays in history for audit purposes.
  • Applying/releasing a hold emits events and updates indexes used by the retention saga.

Auditing (“audit of the auditor”)

Every administrative action creates an audit record:

  • Retention.PolicyChanged (policyId, version, author, diff)
  • Retention.PurgeStarted / Retention.PurgeCompleted (jobId, counts, mode)
  • LegalHold.Applied / LegalHold.Released (holdId, caseId, scope)

These are tenant-scoped (or platform-admin scoped for cross-tenant operations) and appear in timelines and exports.


Metrics & Alerts

  • retention_eligible_segments{tenant,category} — Gauge
  • retention_purge_duration_seconds{mode} — Histogram
  • retention_records_purged_total{tenant,category} — Counter
  • legalholds_active_total{tenant} — Gauge
  • Alerts:
    • Purge failure rate > 1% over 15m
    • Purge duration p95 > SLO (e.g., 10m per 1M rows)
    • Active holds overlapping > 80% of eligible data (investigate policy vs legal load)

Edge Cases & Safeguards

  • Backfill: backfilled historical records honor current legal holds but are evaluated against policy windows relative to their occurredAtUtc.
  • Segment tails: tiny tails can be sealed before purge to simplify eligibility evaluation.
  • Partial-hold segments: row-level purge with strict filters; if too fragmented, defer to next window.
  • Break-glass: platform-admin can inspect retention state across tenants only with dual-control + justification; actions are fully audited.

Diagram — Retention Decision Flow

flowchart TD
  A[Policy Window Elapsed] --> B[Load Active Legal Holds]
  B --> C{Held Overlap?}
  C -- No --> D[Select Eligible Segments]
  C -- Yes --> E[Row-level Filter (exclude holds)]
  D --> F{Mode}
  E --> F
  F -- PURGE --> G[Delete rows/segments]
  F -- REDACT_TOMBSTONE --> H[Insert tombstones]
  F -- ARCHIVE_THEN_PURGE --> I[Export → Manifest] --> G
  G --> J[Emit retention.purged]
  H --> J
Hold "Alt" / "Option" to enable pan & zoom

Backfill, Migration & Replay


Purpose

Enable safe historical imports (backfill), schema evolution without downtime, and deterministic rebuilds of read models via an event replay harness. Preserve append-only guarantees, integrity proofs, and tenant isolation throughout.


Modes

  • Backfill — ingest historical records under elevated controls (NDJSON/Parquet → canonical AuditRecord), honoring classification, legal holds, and retention.
  • Replay — rebuild projections (audit_rm.*) from the source of truth (audit.Records) or from archived files when SQL is pruned.
  • Migration — introduce new schemas/indices/views using blue/green read models; evolve contracts with versioned exporters/importers.

Architecture Overview

flowchart LR
  SRC[(Sources<br/>NDJSON/Parquet/Legacy DB)] --> IMP[Importer (versioned)]
  IMP --> BF[Backfill API]
  BF --> WR[(audit.Records - append-only)]
  WR --> OUTBOX[(Outbox Events)]
  OUTBOX --> BUS[[Service Bus]]
  BUS --> PROJ[Projectors]
  PROJ --> RM[(audit_rm.* v1)]
  PROJ -.replay.-> RM2[(audit_rm.* v2)]
  ARCH[(ADLS Gen2 Archives)] --> RPY[Replay Harness] --> PROJ
Hold "Alt" / "Option" to enable pan & zoom
  • Backfill and live ingest both land in the same append-only store, publish the same events, and feed the same projectors.
  • Replay can read from audit.Records or ADLS archives (JSONL/Parquet) when rebuilding.

Backfill

Transport

  • POST /audit/records:backfill (NDJSON/JSON; gzip).
  • Throughput-isolated from online ingest; runs with audit.backfill scope and mandatory X-Purpose.

Importer (versioned)

  • Importer@v1, Importer@v2, … map legacy payloads → canonical AuditRecord.
  • Each importer declares: sourceVersion, normalizers, classificationHints, idempotencyStrategy.
  • Validation: run dry-run to compute accept/reject counts and classification outcomes before commit.

Idempotency

  • Recommend deterministic keys: sourceName:sourceId:occurredAtUtc.
  • Duplicate lines across replays safely dedupe on (tenantId, idempotencyKey).

Integrity

  • Historical loads create historical segments per (tenant, category) with compressed thresholds; immediately sealed upon chunk completion to keep proofs compact and time-bounded.
  • Segment labels include mode=historical for audit transparency.

Replay Harness

Goals

  • Deterministically rebuild audit_rm.* projections and caches without touching the write model.
  • Support point-in-time (PIT) rebuilds and full rebuilds per tenant/category.

Sources

  • Primary: audit.Records (clustered seek by time).
  • Secondary: ADLS (JSONL/Parquet) via streaming readers when SQL windows were purged/archived.

Controls

  • Tenant-scoped, time-bounded, category-filtered runs.
  • Resume with high-water marks (HWM): (OccurredAtUtc, RecordId) stored per projection.
  • Parallelism tuned per tenant/category; avoid cross-tenant contention.

Mechanics

  1. Disable or shadow existing consumer for the target projection (blue/green).
  2. Stream records (or archived files) in order; synthesize audit.record_appended envelopes (or re-consume bus) into new projection tables audit_rm_v2.*.
  3. Validate counts/hashes; swap atomically (synonyms/view switch), keep old (v1) for rollback.

CLI Sketch (code-first)

audit-replay run \
  --tenant t-9c8f1 \
  --projection TimelineV2 \
  --from 2025-09-01T00:00:00Z --to 2025-09-30T23:59:59Z \
  --source sql --degree-of-parallelism 4 \
  --snapshot resume

Idempotent Upserts

  • Projections write with MERGE/UPSERT keyed by the target PK; handlers record ConsumerOffset(eventId) for safety even during replay.

Snapshot Strategy (large streams)

Why

  • Reduce replay time for very large tenants/resources.

Snapshot types

  • Projection Snapshots: periodic materialized state (e.g., last cursor for Timeline, last known view aggregates).
  • Segment Checkpoints: (segmentId, lastRecordId) to allow resuming mid-segment on replays.
  • Cursor Snapshots: per-projection HWM (OccurredAtUtc, RecordId).

Policy

  • Snapshot every N million records or M minutes; store in audit_snap.* (tenant-scoped).
  • Snapshots are append-only with metadata: projection version, policy version, tool version.

Restore

  • Replay starts from the latest compatible snapshot; verifies hash of the first delta batch before proceeding.

Migration Playbook (blue/green)

  1. Design new projection schema audit_rm_v2.* (additive fields, new indexes).
  2. Shadow build: run replay into v2 while v1 serves traffic.
  3. Dual-write (optional): during bake-in, live events update both v1 and v2 projectors.
  4. Cutover: swap synonyms/views or route queries via feature flag to v2.
  5. Monitor: compare v1 vs v2 counts/latencies for a window.
  6. Decommission: retire v1 after retention period.

DB Aid

CREATE SYNONYM audit_rm.Timeline FOR audit_rm_v2.Timeline;
-- Swap to v1 or v2 by changing synonym target inside a transaction.

Contract Evolution (exporters/importers)

  • Exporters

    • Exporter@v1: JSONL canonical envelope.
    • Exporter@v2: adds record.schemaVersion, policyVersion, optional membership proofs inline.
    • Manifests include exporterVersion, schemaVersion, kid, hashes.
  • Importers

    • Each importer version declares supported source schemas and a validation matrix.
    • On ingest, record importerVersion into an internal meta table for lineage.

Compatibility Rules

  • Additive-first: new fields optional, consumers ignore unknowns.
  • Breaking change: publish new topic name or path (e.g., /v2), maintain v1 until all consumers migrated.
  • Policy invariants: cannot weaken redaction for CREDENTIAL/PHI.

Consistency Guarantees

  • Append-only: neither backfill nor replay modifies existing records; they only append to write models or recompute read models.
  • Ordering: replay reads in (OccurredAtUtc, RecordId) order per (tenant, category).
  • Integrity: proof segments created during backfill are sealed and referenced; replay does not alter proofs.

Operational Controls

  • Throttling: replay & backfill respect per-tenant QPS and shared DB concurrency budgets.
  • Isolation: run heavy replays in off-peak windows; prioritize hot-tenant live traffic.
  • Observability:
    • replay_progress{tenant,projection} — Gauge (0–100).
    • replay_rate_records_per_s{tenant} — Gauge.
    • replay_errors_total{projection,reason} — Counter.
    • Trace spans: audit.replay.batch with fromCursor/toCursor.

Failure Recovery

  • Store HWM every N records; on crash, resume from last committed HWM.
  • Poison batches → quarantine file with first/last IDs; continue with next range; open incident.

Example: Rebuild Timeline from ADLS

audit-replay run \
  --tenant t-9c8f1 \
  --projection TimelineV2 \
  --source adls \
  --path "abfss://audit/tenantId=t-9c8f1/category=identity/year=2025/month=09" \
  --from 2025-09-01T00:00:00Z --to 2025-09-30T23:59:59Z \
  --degree-of-parallelism 8 \
  --snapshot none
  • Reader streams Parquet row groups → canonical leaf records → projectors.
  • On completion, tool outputs counts, min/max timestamps, and verification hashes matching the export manifest.

Guardrails

  • LegalHold aware: backfill & replay must not bypass active holds (even for historical data).
  • Retention aware: do not re-introduce data beyond retention windows into hot SQL; keep in ADLS only.
  • No PII leakage: importers run the server-side classification; logs contain only counts and hashed identifiers.

APIs (admin)

  • POST /audit/admin/replay — body: { tenantId, projection, from, to, source: 'sql'|'adls', path?, snapshot?: 'resume'|'none' }
  • GET /audit/admin/replay/{runId} — progress and metrics
  • POST /audit/admin/migrations/activate — cutover v2 projections (scope audit.admin.policy, purpose required)

Diagram — Replay Control Flow

sequenceDiagram
  participant ADM as Admin
  participant RPY as Replay Controller
  participant SRC as Source (SQL/ADLS)
  participant PJ as Projector Host (v2)
  participant DB as audit_rm_v2.*

  ADM->>RPY: Start replay (tenant, projection, from/to)
  RPY->>SRC: Open reader (ordered stream)
  SRC-->>PJ: Records (in order)
  PJ->>DB: UPSERT rows (idempotent)
  PJ-->>RPY: HWM checkpoints
  RPY-->>ADM: Progress (%, counts, ETA)
Hold "Alt" / "Option" to enable pan & zoom

Checklist (quick start)

  • Define/import mapping with Importer@vN.
  • Dry-run backfill (classify/validate).
  • Run backfill with idempotency keys and historical segments sealing.
  • Spin up replay to audit_rm_v2.* (shadow build).
  • Compare v1/v2, cut over via synonym/view swap.
  • Archive manifests & proofs; audit everything (who/what/why).

Scale & Performance


Purpose

Achieve predictable low-latency ingest, fast queries, and stable exports under bursty multi-tenant load. Scale horizontally based on queue depth/lag, tune prefetch & concurrency per handler, and isolate hot tenants with dedicated partitions/queues. Provide concrete SLO targets and batch sizing guidance.


Scaling Strategy (overview)

flowchart LR
  API[HTTP/gRPC Gateways] --> IN[(Inbox+Dedupe Queues)]
  IN --> AS[AuditStream Workers]
  IN --> CL[Classification Workers]
  IN --> IL[Integrity Workers]
  IN --> RP[Retention Workers]
  IN --> EX[Export Workers]

  subgraph KEDA
    KD1[KEDA Scalers: queue length/lag]
  end

  KD1 -.-> AS
  KD1 -.-> IL
  KD1 -.-> RP
  KD1 -.-> EX
Hold "Alt" / "Option" to enable pan & zoom
  • KEDA-style autoscaling on queue length and time-in-queue (lag).
  • Workers are stateless and scale out; DB & caches scale vertically/horizontally as needed.
  • Per-tenant partitions/queues available to isolate “noisy neighbors”.

Queues & Partitions

  • Default: shard by TenantId hash across N queues (balanced).
  • Hot tenants: promote to dedicated queue/partition with its own consumer group and scaling policy.
  • Ordering: per-tenant ordering guaranteed within a partition (keyed by (tenantId, category) when required).
  • DLQ: one per queue; monitored and drained by a quarantine worker.

KEDA Scalers (Azure Service Bus example)

  • Triggers:
    • messageCount > target (e.g., > 1,000 per partition)
    • queueLagSeconds > target (e.g., > 5s p95)
  • Scaling policy:
    • Min replicas = 1 (or 0 for cold paths like exports)
    • Max replicas determined per environment (e.g., 40 for ingest)
    • Cooldown 60–120s to avoid thrash

Sketch

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: auditstream-consumer
spec:
  scaleTargetRef:
    name: auditstream-worker
  pollingInterval: 5
  cooldownPeriod: 90
  minReplicaCount: 2
  maxReplicaCount: 40
  triggers:
    - type: azure-servicebus
      metadata:
        queueName: audit-ingress-q
        messageCount: "1000"
        activationMessageCount: "10"
        namespace: "***"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: audit_consumer_lag_seconds
        threshold: "5"

Consumer Tuning (MassTransit / Service Bus)

  • Prefetch: start at 512–2048 messages per consumer; tune by payload size and handler time.
  • Concurrency:
    • ConcurrentMessageLimit: start at env(cores) × 2 per instance.
    • Ensure handler is CPU-light (I/O bound); otherwise lower concurrency to protect DB.
  • Lock management: use PeekLock; MaxAutoRenewDuration >= handler_p95 × 2.
  • Idempotency: maintain a compact dedupe store keyed by (tenantId, idempotencyKey) → recordId.

MassTransit setup (sketch)

cfg.ReceiveEndpoint("audit-ingress-q", e =>
{
    e.PrefetchCount = 1024;
    e.ConcurrentMessageLimit = Environment.ProcessorCount * 2;
    e.UseMessageRetry(r => r.Exponential(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(30), TimeSpan.FromSeconds(2)));
    e.UseInMemoryOutbox(); // local idempotent side-effects
    e.Consumer<AppendAuditRecordConsumer>();
});

Orleans & gRPC Gateways

  • Orleans: cap activations per grain; route by tenant key; use stateless workers for CPU-bound transforms; backpressure via maxActiveRequests.
  • gRPC (code-first): enable HTTP/2, set min thread pool for bursty loads, and bound request body size to protect memory.
  • Gateways: implement token-bucket rate limiting per tenant and circuit breakers to shed load under DB pressure.

Storage Throughput

  • Write path:
    • Use SqlBulkCopy/TVPs for BatchAppend (500–2,000 rows/batch).
    • Keep inserts ordered by (TenantId, Category, OccurredAtUtc) to minimize page splits.
    • Avoid per-row UDFs; compute hashes/redaction in the app.
  • Read path:
    • Projections serve 95% of queries; place hot views on columnstore when scans dominate.
    • Seek pagination only; never use OFFSET on hot tables.

SLOs & Guardrails

Area Target (p95 unless noted) Notes
Ingest lag (produce → record_appended) 5s Autoscale on lag; alert > 30s
Append latency (API 201) ≤ 50–100 ms Single record; batch returns 202
Consumer backlog < 1× minute volume Warn when > 5×
Export duration ≤ 10 min / 10M rows Scales with shards & format
Projector lag ≤ 2–5 s Timeline view
DLQ rate ~0 Page if sustained growth

Reconfirm SLOs per environment/tenant class; attach exemplars to lag metrics for deep links.


Batch & Payload Sizing

  • BatchAppend:
    • 500–2,000 items per call; cap request body at 5–10 MB (gzip recommended).
    • Prefer many medium batches over a few giant ones to keep tail latency low.
  • Queue messages: keep payloads small; if > ~200 KB per item, consider reference + fetch pattern (store large deltas in Blob, send URI).

Hot Tenant Isolation

  • Promotion criteria: sustained >10% of total traffic or >50% of a shard’s backlog.
  • Actions:
    1. Dedicated queue/partition for that tenant.
    2. Per-tenant consumer deployment with its own scaler limits.
    3. Optional DB pool/read replica priority if available.

Backpressure & Load Shedding

  • HTTP/gRPC: 429 with jittered Retry-After when queues or DB lag exceed thresholds.
  • Queue consumers: temporarily lower concurrency or pause pulling when DB CPU > 80% or log write waits spike.
  • Exports: queue to a separate tier with strict concurrency caps; never compete with ingest.

Capacity Planning (rules of thumb)

  • Compute: start at 1 core per ~1–3k msgs/s for light transforms; double for heavy classification.
  • Memory: size for prefetch window + concurrent batches; avoid large in-memory buffers.
  • DB: target < 70% CPU at peak; keep P99 log flush < 10ms; add tempdb and log throughput headroom.

Performance Testing

  • Synthetic load: mix (70% small with decisions, 20% with deltas, 10% export triggers).
  • Scenarios: steady-state, bursts (10× for 2 min), and soak (24–72h).
  • Checks: ingest lag, append latency, dedupe ratio, projector lag, export SLA, segment seal time, DLQ growth.
  • Chaos: inject transient DB errors and broker disconnects; verify retries and idempotency.

Example: Export Concurrency

  • Limit export workers to 1–2 per tenant; global cap (e.g., 8).
  • Chunk size: 50–200k records per artifact; stream to Blob; push manifests incrementally.
  • Throttle when ingest lag > target to favor writes over exports.

Caching

  • Edge cache: 5–15s TTL for dashboard timeline pages.
  • Redis: cache hot resource/actor timelines keyed by filter hash + cursor (TTL 30–60s).
  • ETag + If-None-Match on read APIs to cut egress.

Operational Settings (quick reference)

  • PrefetchCount: 512–2048
  • ConcurrentMessageLimit: cores × 2 (I/O bound) or cores (CPU bound)
  • BatchAppend: 500–2,000 items, ≤ 10 MB gz
  • Segment sealing: 10k records or 100 MB or 5 min (whichever first)
  • KEDA polling: 5s; cooldown 60–120s
  • Rate limits: tenant token-bucket sized to SLO + bursting 2–3×

Notes

  • Always prefer horizontal scale on workers and queues before vertical DB scale; then revisit indexes and partition strategy.
  • Keep idempotency cheap and O(1): write-through cache + SQL upsert on (tenantId, idempotencyKey).
  • Measure, don’t guess: tie autoscaling to lag rather than CPU alone.

Idempotency & Dedupe


Purpose

Guarantee exactly-once effects at the domain boundary despite at-least-once delivery across HTTP, gRPC (code-first), Service Bus/MassTransit, and Orleans. Enforce dedupe keyed by tenantId + idempotencyKey (or a deterministic content hash) and make all consumers idempotent. Define a clear conflict policy for exact vs. semantic duplicates.


Patterns

  • Outbox on write: aggregate state change and outbox event are committed atomically; a relay publishes later.
  • Inbox on consume: each consumer records processed eventId (or (aggregateId, sequence)), skipping repeats.
  • Idempotency gate on ingest: first persistent check keyed by (tenantId, idempotencyKey or contentHash).
sequenceDiagram
  participant Client
  participant Ingest as Ingest Adapter
  participant Gate as Idempotency Gate
  participant Store as Append-only Store
  participant Outbox as Outbox
  Client->>Ingest: AppendAuditRecord (Idempotency-Key)
  Ingest->>Gate: TryReserve(tenantId, key)
  alt duplicate
    Gate-->>Ingest: { status: duplicate, recordId }
    Ingest-->>Client: 200 OK (duplicate, recordId)
  else first-seen
    Gate-->>Ingest: { status: reserved }
    Ingest->>Store: Append(record)
    Store->>Outbox: Save event (same tx)
    Ingest-->>Client: 201 Created (recordId)
  end
Hold "Alt" / "Option" to enable pan & zoom

Keys & Deterministic Hashing

  • Primary: client-provided Idempotency-Key (scoped to tenantId).
  • Fallback: deterministic content hash when clients can’t provide keys:
    • Canonicalize the material fields of the record (tenantId, occurredAtUtc, actor, action, resource, decision, before/after, policyVersion) — exclude transient fields (traceId, requestId, producer).
    • contentHash = SHA-256(UTF8(canonicalJson)).
  • Storage: only store hashed keys (sha256) with per-tenant salt to reduce leakage.

Examples

  • iam:pwd-change:u-12345:2025-09-30T12:34:56Zsha256:ab12…
  • contentHash for the canonicalized payload → sha256:7f9c…

Conflict Policy

Situation Detection Outcome
Exact duplicate (same material content) Same (tenantId, idempotencyKey) or same contentHash and same stored payload hash 200 OK, return recordId of the first write; increment dedupe metric.
Semantic duplicate (same key but different material content) Same (tenantId, idempotencyKey) paired with mismatched payload hash 409 Conflict with problem details; no second write; log “idempotency_conflict”.
Key reuse after TTL Key TTL expired, but content matches stored recordId (via contentHash) Treat as exact duplicate if content hash matches; else 409 Conflict.

Material comparison mask (server-defined): ignores correlation.*, preserves all fields that affect the business meaning.


Data Structures (SQL)

Idempotency gate

CREATE TABLE audit.Idempotency (
  TenantId           VARCHAR(64)   NOT NULL,
  KeyHash            CHAR(64)      NOT NULL,   -- sha256 of (tenant-salted key)
  RecordId           VARCHAR(26)   NOT NULL,
  PayloadHash        CHAR(64)      NOT NULL,   -- sha256 of canonical material content
  CreatedAtUtc       DATETIME2(3)  NOT NULL DEFAULT SYSUTCDATETIME(),
  ExpiresAtUtc       DATETIME2(3)  NOT NULL,   -- TTL window
  CONSTRAINT PK_Idempotency PRIMARY KEY (TenantId, KeyHash)
);
CREATE INDEX IX_Idempotency_Expires ON audit.Idempotency(ExpiresAtUtc);

Consumer inbox (per service)

CREATE TABLE audit.InboxOffsets (
  Consumer          VARCHAR(64)   NOT NULL,
  EventId           VARCHAR(50)   NOT NULL,   -- ULID/UUIDv7
  ProcessedAtUtc    DATETIME2(3)  NOT NULL DEFAULT SYSUTCDATETIME(),
  CONSTRAINT PK_InboxOffsets PRIMARY KEY (Consumer, EventId)
);

Outbox (per service)

CREATE TABLE audit.Outbox (
  EventId           VARCHAR(50)   NOT NULL,
  AggregateType     VARCHAR(64)   NOT NULL,
  AggregateId       VARCHAR(64)   NOT NULL,
  TenantId          VARCHAR(64)   NOT NULL,
  Name              VARCHAR(128)  NOT NULL,
  SchemaVersion     INT           NOT NULL,
  OccurredAtUtc     DATETIME2(3)  NOT NULL,
  PayloadJson       NVARCHAR(MAX) NOT NULL,
  PublishedAtUtc    DATETIME2(3)  NULL,
  CONSTRAINT PK_Outbox PRIMARY KEY (EventId)
);

Algorithms

TryReserve (idempotency gate)

-- Attempt to upsert reservation; if exists, compare hashes
MERGE audit.Idempotency AS tgt
USING (SELECT @TenantId AS TenantId, @KeyHash AS KeyHash) AS src
ON (tgt.TenantId = src.TenantId AND tgt.KeyHash = src.KeyHash)
WHEN NOT MATCHED THEN
  INSERT (TenantId, KeyHash, RecordId, PayloadHash, ExpiresAtUtc)
  VALUES (@TenantId, @KeyHash, @RecordId, @PayloadHash, DATEADD(day, @TtlDays, SYSUTCDATETIME()))
WHEN MATCHED AND tgt.PayloadHash = @PayloadHash THEN
  UPDATE SET ExpiresAtUtc = DATEADD(day, @TtlDays, SYSUTCDATETIME()) -- refresh TTL
WHEN MATCHED AND tgt.PayloadHash <> @PayloadHash THEN
  -- Signal conflict (no write). Surface via @@ROWCOUNT or OUTPUT.
  UPDATE SET RecordId = tgt.RecordId  -- no-op to produce OUTPUT
OUTPUT $action, inserted.RecordId;

Consumer Idempotency

  • On message receipt:
    1. If (consumer, eventId) exists → ACK & skip.
    2. Else BEGIN: handle message, write side-effects, then insert (consumer, eventId).
    3. COMMIT; any failure → retry (at-least-once) but inbox prevents duplicates.

Batch semantics

  • Reserve per-item; compute per-item statuses: created | duplicate | conflict | error.
  • Return 202 with multi-status body.

TTL & Windows

  • Key TTL: 7 days (≥ max client retry horizon); configurable per tenant.
  • Inbox TTL: 7–30 days depending on bus re-delivery SLAs and recovery plans.
  • Background jobs purge expired rows (ExpiresAtUtc < now).

Transports & Gateways

  • HTTP: require Idempotency-Key header for online writes; 200 (duplicate) vs. 409 (conflict).
  • gRPC (code-first): metadata carries key; same semantics.
  • Service Bus/MassTransit: key in headers; inbox & gate run in the consumer.
  • Orleans: ingest grain uses the shared gate; grain replays are safe due to reserve-first semantics.

Edge Cases

  • Clock skew: identical keys but slightly different occurredAtUtc; treat as semantic conflict if outside skew tolerance.
  • Producer bugs (random keys): fall back to contentHash gate to suppress floods of accidental duplicates.
  • Backfill: use deterministic keys derived from source record identifiers; disable per-request idempotency header requirement, but keep gate active.

Observability

  • Metrics:
    • audit_dedupe_hits_total{tenant}
    • audit_idempotency_conflicts_total{tenant}
    • audit_inbox_skips_total{consumer}
    • audit_outbox_backlog{service}
  • Logs (PII-safe): audit.append.duplicate, audit.append.conflict, with hashed keys.
  • Traces: tag spans with audit.key.hash and dedupe="hit|miss|conflict".

Code Sketches

Adapter pipeline (C#)

// Precomputed
var keyHash = HashKey(tenantId, idempotencyKey, tenantSalt);
var payloadHash = HashCanonicalMaterial(record);

var reserve = await _idempotency.TryReserveAsync(tenantId, keyHash, payloadHash, ct);
switch (reserve.Status)
{
    case ReserveStatus.Duplicate:
        return Ok(new { id = reserve.RecordId, status = "duplicate" });
    case ReserveStatus.Conflict:
        return Conflict(Problem("Idempotency key reused with different content."));
    case ReserveStatus.Reserved:
        var id = await _records.AppendAsync(record, ct);   // writes outbox in same tx
        await _idempotency.BindRecordAsync(tenantId, keyHash, id, ct); // if needed
        return Created(id);
}

Consumer (MassTransit)

public class AuditEventConsumer : IConsumer<DomainEventEnvelope<AuditRecordAppendedV1>>
{
    public async Task Consume(ConsumeContext<DomainEventEnvelope<AuditRecordAppendedV1>> ctx)
    {
        if (await _inbox.SeenAsync("audit-projection", ctx.Message.EventId)) return;

        using var tx = await _db.BeginTransactionAsync();
        await _projection.UpsertAsync(ctx.Message, tx);
        await _inbox.MarkAsync("audit-projection", ctx.Message.EventId, tx);
        await tx.CommitAsync();
    }
}

Guidance

  • Prefer client keys for online paths; use contentHash for bulk/backfill or to guard against missing keys.
  • Keep the comparison mask stable and documented; breaking changes to “material fields” must bump its version.
  • Do not leak raw keys; hash with tenant salt before storing or logging.
  • Handle semantic conflicts early and loudly (409 + alert) — they usually indicate a producer bug.

Classification/Redaction Rules Engine


Purpose

Implement a deterministic, high-throughput rules engine that classifies fields and applies hash/mask/drop/tokenize transforms before persistence, and optionally on read for permitted elevated views. The engine is server-authoritative, tenant-aware, versioned, and designed so that no PII ever reaches logs or events.


Architecture

flowchart LR
  IN[Ingress (HTTP/gRPC/Bus/Orleans)]
  --> MAP[Field Mappers<br/>(pre-classification)]
  --> CLS[Classifier<br/>(policy+heuristics)]
  --> PLAN[Plan Compiler<br/>(field→rule)]
  --> RDX[Redaction Engine<br/>(write-time)]
  --> APPEND[(Append-only Store)]

  QRY[Query APIs] --> READPLAN[Read Filter Plan<br/>(profile)]
  READPLAN --> RDX2[Redaction Engine<br/>(read-time)]
  RDX2 --> OUT[Response (default redacted)]
Hold "Alt" / "Option" to enable pan & zoom
  • Field Mappers extract candidate fields (e.g., before.fields.email, context.ip).
  • Classifier assigns DataClass using policy rules + hints + heuristics.
  • Plan Compiler resolves effective rule per field (class rule + field override).
  • Redaction Engine applies transforms:
    • Write-time (mandatory, persistent)
    • Read-time (optional, profile-based: default, elevated)

Data Structures

Policy (recap, per-tenant, versioned)

{
  "policyId": "pol-abc",
  "version": 7,
  "effectiveFromUtc": "2025-09-30T12:00:00Z",
  "defaultByField": {
    "before.fields.email": "PERSONAL",
    "after.fields.email": "PERSONAL",
    "context.ip": "INTERNAL",
    "after.fields.apiKey": "CREDENTIAL"
  },
  "rulesByClass": {
    "PUBLIC":    {"kind":"NONE"},
    "INTERNAL":  {"kind":"MASK","params":{"showLast":4}},
    "PERSONAL":  {"kind":"HASH","params":{"algo":"SHA256"}},
    "SENSITIVE": {"kind":"MASK","params":{"showLast":2}},
    "CREDENTIAL":{"kind":"DROP"},
    "PHI":       {"kind":"TOKENIZE","params":{"ns":"phi"}}
  },
  "overridesByField": {
    "context.userAgent": {"kind":"MASK","params":{"showFirst":10}},
    "after.fields.apiKey": {"kind":"DROP"}
  }
}

Rule kinds

  • NONE — keep as is.
  • MASK — retain portions (showFirst, showLast, maskChar, keepSeparators).
  • HASH — one-way hash; per-tenant salt (Key Vault/HSM) + algorithm (e.g., SHA256).
  • DROP — remove value entirely.
  • TOKENIZE — store token; resolution requires explicit privileged service & is fully audited.

Field Mapping

  • Targets: context.*, before.fields[*], after.fields[*], headers mirrored into payload (if any).
  • Paths: dot-notation with wildcards, e.g., before.fields.*, after.fields.password*.
  • Heuristics (best-effort, server-side): detect emails, E.164 phones, credit-card-like numbers, secrets (common prefixes), IPs. Heuristics never override an explicit CREDENTIAL/PHI rule to a weaker rule.

Evaluation & Precedence

  1. Explicit field override (overridesByField)
  2. Class rule via defaultByField or heuristic + hint
  3. Fallback class: INTERNAL
  4. Read profile may apply additional masking only (cannot reverse HASH/DROP; TOKENIZE resolution requires privileged path)

Server computes RedactionPlan { byField, policyVersion } and stores only the transformed values + policyVersion.


Write-Time Pipeline (pseudocode)

var policy = _policyProvider.GetActive(tenantId);            // v7
var mapped = _mapper.MapTargets(record);                     // yields (path,value)
var classes = _classifier.Classify(mapped, policy, hints);   // path -> DataClass
var plan = _planner.Compile(policy, classes);                // path -> RedactionRule

foreach (var t in mapped)
{
    var rule = plan.For(t.Path);
    var norm = _normalizer.Normalize(t.Path, t.Value);       // emails lowercased, IP canonical
    var red  = _redactor.Apply(rule, norm, tenantSalt);      // PII-safe
    record = _writer.Write(record, t.Path, red);
}

record.PolicyVersion = policy.Version;

Normalization ensures deterministic hashing/masking:

  • Email: lowercased, Unicode normalized.
  • Phone: E.164.
  • IP: RFC-compressed IPv6, no port.
  • UA: trimmed to max length.

Read-Time Profiles

Profile Default Elevated
PUBLIC/INTERNAL NONE/MASK(…4) may relax to MASK(…8)
PERSONAL HASH stays HASH (non-reversible)
SENSITIVE MASK(…2) may relax to MASK(…4) with auditor scope
CREDENTIAL DROP stays DROP
PHI TOKENIZE return token only; resolution via dedicated PHI service & separate policy
  • Read profile chosen by API layer (X-Redaction-Level: default|elevated) + scopes.
  • Engine never reconstructs dropped or hashed values.

Plug-in Model

Interfaces

public interface IFieldMapper { IEnumerable<Target> Map(AuditRecord r); }
public interface IClassifier  { DataClass Classify(Target t, Policy p, Hints h); }
public interface IPlanner     { RedactionPlan Compile(Policy p, IDictionary<string,DataClass> classes); }
public interface IRedactor    { object Apply(RedactionRule rule, object value, byte[]? tenantSalt=null); }
public interface IReadFilter  { object Filter(object storedValue, Rule rule, RedactionProfile profile); }
  • Register mappers for custom producers (e.g., billing deltas, IAM extras).
  • Add classifier enrichers (regex, ML heuristics if ever needed) behind the policy.

Performance & Caching

  • Cache policy by (tenantId, version); TTL aligned to policy change events.
  • Memoize compiled plans for frequent field sets (e.g., same resource schema).
  • Vectorize hashing by batch (SIMD) when applying HASH to large lists.
  • Target < 1 ms / record for typical 10–30 field transforms; benchmark under P95 load.

Observability Guardrails

  • No PII in logs: the redaction engine exposes counters only:
    • rules_applied_total{tenant,kind} (HASH, MASK, DROP, TOKENIZE)
    • classification_fallback_total{tenant} (heuristic used)
  • Errors contain paths, rule kinds, never raw values.
  • Traces include policy.version, targets.count, rules.masked.count, not the data.

Testing

  • Golden tests: fixtures per class and per rule; assert exact outputs (hashes for given salt).
  • Property tests: secrets never leak in logs/events; masking length consistent.
  • Compatibility tests: same input + same policy → identical stored outputs across versions.
  • Performance tests: throughput and P95 latency on mixed payloads.

Example

Input (hints)

{
  "context": { "ip": "2001:db8::1", "userAgent": "Mozilla/5.0 ..." },
  "before": { "fields": { "email": "Alice@Example.com", "apiKey": "sk_live_123" } },
  "after":  { "fields": { "email": "Alice@Example.com", "apiKey": "sk_live_123" } },
  "hints": { "before.fields.email": "PERSONAL", "after.fields.apiKey": "CREDENTIAL" }
}

Stored (policy v7)

{
  "context": { "ip": "2001:db8::1", "userAgent": "Mozilla/5..." },
  "before": { "fields": { "email": "HASH:sha256:7b1d...", "apiKey": null } },
  "after":  { "fields": { "email": "HASH:sha256:7b1d...", "apiKey": null } },
  "policyVersion": 7
}

Read (elevated) shows the same for PERSONAL (hash) and CREDENTIAL (dropped), but may show a longer UA slice (per override).


Failure Modes & Safeguards

  • Rule missing: default to stronger class (SENSITIVEMASK); never NONE.
  • Oversized values: fall back to HASH or DROP to avoid memory pressure.
  • Unsupported encodings: normalize or DROP with diagnostic code (no value in logs).
  • Policy change mid-batch: lock to policy version at batch start (store policyVersion per record).

Admin & DSL (optional convenience)

Define field rules via a compact DSL in admin UI:

INTERNAL: context.userAgent -> MASK(showFirst=12)
PERSONAL: before.fields.email -> HASH(algo=SHA256)
CREDENTIAL: after.fields.apiKey -> DROP
PHI: after.fields.diagnosisCode -> TOKENIZE(ns=phi)
  • Compiler translates DSL → overridesByField + rulesByClass.
  • Validation prevents weakening CREDENTIAL/PHI.

Security Notes

  • Tenant salt sourced from HSM/Key Vault; rotated; hash results are stable per version.
  • Tokenization vault isolated; resolutions require separate privileged service with strong auditing.
  • Hash outputs and tokens are treated as sensitive-but-safe; still excluded from logs unless necessary (and then truncated).

Integration Points

  • Ingest adapters (HTTP/gRPC/Bus/Orleans) call IClassificationPipeline.ProcessAsync before persistence.
  • Query layer requests a read profile and applies IReadFilter to the stored payload.
  • Event payloads use minimized fields; never include raw sensitive values.

Metrics (quick list)

  • classification_decisions_total{tenant,class}
  • redaction_rules_applied_total{tenant,kind}
  • redaction_write_duration_ms (histogram)
  • redaction_read_duration_ms (histogram)
  • Alerts on any DROP/CREDENTIAL occurrences > 0 for fields not in allowlist (configuration drift).

Access Control & Purpose Limiting


Purpose

Enforce least privilege and purpose-limited access to audit data and admin capabilities. Admin actions (e.g., exports, legal holds) require higher scopes, MFA, and optionally dual-control (“four eyes”). Policy evaluation plugs into the platform’s RBAC/ABAC engine and uses the existing Access Type = Audit category in the platform catalog.


Control Points

  • PEP @ Gateway

    • Verifies OAuth2/OIDC token, MFA signals, tenant binding, scopes.
    • Requires Tenant-Id header and enforces no cross-tenant unless break-glass.
    • Accepts X-Purpose and X-Redaction-Level headers; validates they match scopes.
  • PEP @ Service Boundary

    • Re-evaluates RBAC/ABAC against Audit access type.
    • Fetches policy decision from PDP (cached); logs minimal decision info.
    • Stamps the effective purpose, scopes, and decision into a minimal audit entry (“audit of the auditor”).
  • PDP (Policy Decision Point)

    • Evaluates RBAC (role/entitlement) and ABAC (attributes: tenant, category, time, purpose, action prefix, resource).
    • Supports dual-control requirements and “break-glass” overlays.

Token & Request Requirements

  • Claims (must be present/validated):

    • sub / client_id — caller identity
    • tenant_id — must match Tenant-Id header
    • scope — action-specific (see matrix)
    • amr or acr — indicates MFA (e.g., amr contains otp/hwk) for admin/elevated actions
    • auth_time — recent sign-in check for sensitive ops (≤ N minutes)
    • jti — for replay protection (logged)
    • optional cnf — key-binding for stronger tokens
  • Headers

    • Tenant-Id: <id> — required
    • X-Purpose: <reason> — required for elevated reads/admin (e.g., security-investigation:INC-12345)
    • X-Redaction-Level: default|elevatedelevated requires audit.read.elevated

Scope & Capability Matrix (Access Type = Audit)

Capability Endpoint(s) Required Scope(s) Extras
Timeline (redacted) GET /audit/timeline audit.read.timeline ABAC: tenant match; optional category filter
Decision log GET /audit/decision-log audit.read.decisions ABAC: tenant match; outcome filter
Elevated read any read with X-Redaction-Level: elevated audit.read.elevated + one of read scopes MFA + X-Purpose required
Start export POST /audit/exports audit.export.start MFA + X-Purpose; optional dual-control
Export status/download GET /audit/exports/{id} audit.export.read Tenant & job ownership check
Legal hold apply/release /audit/admin/legal-holds* audit.admin.policy MFA + X-Purpose; optional dual-control
Retention policy change /audit/admin/retention-* audit.admin.policy MFA + X-Purpose; compliance locks respected
Proof lookup GET /audit/proofs* audit.read.proofs Tenant check
Platform break-glass (cross-tenant) any platform.admin.audit Dual-control, time-boxed token, mandatory purpose

ABAC Inputs (evaluated by PDP)

  • Subject: user/service identity, roles, groups, scopes, amr/acr, auth_time.
  • Resource: { tenantId, category, actionPrefix, resourceType, resourceId }.
  • Context: { purpose, redactionLevel, requestTime, callerIp, clientApp }.
  • Policy examples:
    • Allow elevated read only if purpose starts with security- or compliance- and amr indicates MFA.
    • Deny export if requested time range exceeds max window for role AuditorLite.
    • Require dual-control for legalhold.apply when category in ('billing','phi').

Decisions are cached short-term (e.g., 60s) keyed by {sub, tenant, capability, purpose} with purpose in the cache key.


Dual-Control (optional policy)

  • When: StartExport(elevated), PlaceLegalHold, ReleaseLegalHold, break-glass.
  • Flow:
    1. Initiator calls endpoint → PendingApproval created (approvalId, policy, purpose, ttl).
    2. Approver (different principal) confirms with approvalId + MFA.
    3. PEP checks both approvals; PDP returns Permit; operation proceeds.
sequenceDiagram
  participant I as Initiator
  participant API as Audit API
  participant WF as Approval Service
  participant A as Approver
  I->>API: StartExport (elevated, X-Purpose)
  API->>WF: Create PendingApproval
  API-->>I: 202 Accepted (approvalId)
  A->>WF: Approve(approvalId) + MFA
  WF-->>API: Approved
  API->>API: Execute Export
  API-->>I: 201 Created (jobId)
Hold "Alt" / "Option" to enable pan & zoom
  • Constraints: approver must be in allowed group, not equal to initiator, approval expires (e.g., 15 min).

Purpose Limiting

  • Mandatory for elevated/admin via X-Purpose.

  • Stored alongside the request metadata; echoed in the minimal self-audit entries (who/what/why).

  • Purpose catalog (examples):

    • security-investigation:<ticket>
    • compliance-audit:<period>
    • legal-ediscovery:<caseId>
    • operational-troubleshooting:<incident> (not eligible for elevated PII views)
  • Redaction coupling: purpose + scope gate whether X-Redaction-Level: elevated is honored.


Break-Glass (cross-tenant)

  • Requires scope platform.admin.audit and dual-control.
  • Issues time-boxed token (≤ 15 min), logs reason, approver, exact filters used, row counts returned.
  • Responses remain redacted by default; elevated requires explicit approval and is audited separately.

Decision Model (pseudo-policy)

# RBAC (entitlements)
roles:
  Auditor:
    scopes: [audit.read.timeline, audit.read.decisions, audit.read.proofs]
  AuditorElevated:
    scopes: [audit.read.elevated]
  AuditAdmin:
    scopes: [audit.admin.policy, audit.export.start, audit.export.read]
  PlatformAdmin:
    scopes: [platform.admin.audit]

# ABAC (Obligations)
rules:
  - when: request.capability == "read:elevated"
    require:
      - token.scopes includes "audit.read.elevated"
      - token.amr contains any ["otp","hwk"]
      - request.purpose matches ^(security|compliance|legal)-.*
  - when: request.capability == "export:start"
    require:
      - token.scopes includes "audit.export.start"
      - dual_control == true if request.redactionLevel == "elevated"
  - when: request.capability in ["legalhold:apply","legalhold:release"]
    require:
      - token.scopes includes "audit.admin.policy"
      - dual_control == true if request.categories intersects ["billing","phi"]
  - when: request.crossTenant == true
    require:
      - token.scopes includes "platform.admin.audit"
      - dual_control == true
      - token.ttl <= 15m

Failure Responses (problem+json)

  • 403 Insufficient scope — missing required scope for capability.
  • 403 MFA required — token lacks strong amr for elevated/admin.
  • 403 Purpose requiredX-Purpose missing or disallowed for capability.
  • 409 Approval pending — dual-control not completed (approvalId returned).
  • 423 Compliance lock — policy forbids weakening or category is under legal restrictions.

All denials emit a minimal self-audit (no PII) with reason codes.


Operational Guidance

  • Cache PDP decisions (≤ 60s) with tenant & purpose keys; bust cache on role/entitlement updates.
  • Enforce freshness for admin actions: require auth_time within N minutes (e.g., 15).
  • Rotate admin access logs to a secure enclave; restrict visibility.
  • Provide admin dashboard listing: active purposes, pending approvals, recent elevated reads/exports.

Minimal Self-Audit (examples)

{ "event":"AuditorAccess.Requested","tenantId":"t-9c8f1","actor":"u-42",
  "capability":"read:elevated","purpose":"security-investigation:INC-12345","decision":"allow" }

{ "event":"DualControl.Approved","tenantId":"t-9c8f1","initiator":"u-42",
  "approver":"u-77","operation":"StartExport","expiresAtUtc":"2025-09-30T13:05:00Z" }

{ "event":"BreakGlass.Used","actor":"admin-u1","scope":"platform.admin.audit",
  "tenants":["t-9c8f1","t-7df2a"],"purpose":"legal-ediscovery:CASE-555","durationSec":600 }

Notes

  • Access checks are performed before reading payloads; default redacted responses unless elevated passes policy.
  • “Audit” access type in the platform catalog should map these scopes/roles to standard entitlements (e.g., Audit.Read, Audit.Read.Elevated, Audit.Export.Manage, Audit.Policy.Manage, Audit.Platform.BreakGlass).
  • All admin/elevated paths require MFA and are audited with purpose and outcome.

Exports & eDiscovery


Purpose

Provide a controlled, purpose-limited way to extract tenant-scoped audit data for investigations and regulatory requests. The ExportJob aggregate takes a query snapshot, runs a time-bounded extraction, produces content hashes, delivers artifacts via signed URLs, and emits a manifest that references integrity proofs (segment roots, optional membership proofs). Every export is itself audited with who/when/why.


Aggregate & Lifecycle

stateDiagram-v2
  [*] --> Requested
  Requested --> SnapshotTaken : freeze filters + redactionLevel + policyVersion
  SnapshotTaken --> Running : plan shards/chunks
  Running --> ChunkReady : stream chunk → blob (repeat)
  Running --> Completed : all chunks uploaded + manifest signed
  Running --> Failed : error (retriable or terminal)
  Requested --> Canceled
  Running --> Canceled
Hold "Alt" / "Option" to enable pan & zoom

Invariants

  • Tenant-scoped; no cross-tenant data.
  • Snapshot captures: filters, time range, policyVersion, redactionProfile, classification overrides, and purpose.
  • Results are immutable once completed; new export required for changes.
  • Legal hold/retention rules apply; held records are included (they motivated the export), but purge never occurs as part of export.

Events

  • export.started, export.chunk_ready, export.completed, export.failed.

Query Snapshot

Frozen at job start to ensure repeatability:

{
  "tenantId": "t-9c8f1",
  "range": { "from": "2025-09-01T00:00:00Z", "to": "2025-09-30T23:59:59Z" },
  "filters": {
    "actor": "u-12345",
    "resource": "User:u-12345",
    "action": "User.",
    "category": "identity",
    "decision": "allow"
  },
  "redactionLevel": "default|elevated",
  "policyVersion": 7,
  "purpose": "ediscovery:case-12345",
  "createdBy": "u-42",
  "createdAtUtc": "2025-09-30T12:40:00Z"
}

Manifest (signed)

Each completed export publishes a manifest describing the job, artifacts, and verifiability metadata.

{
  "type": "audit.export-manifest",
  "schemaVersion": 2,
  "job": {
    "jobId": "exp-01J9Z7E8AVB2M7",
    "tenantId": "t-9c8f1",
    "state": "completed",
    "snapshot": { /* query snapshot from above */ },
    "recordCount": 12840,
    "format": "jsonl",
    "createdAtUtc": "2025-09-30T12:40:00Z",
    "completedAtUtc": "2025-09-30T12:55:10Z"
  },
  "artifacts": [
    {
      "name": "part-00001.jsonl.gz",
      "url": "https://signed.example.com/.../part-00001.jsonl.gz?sig=...",
      "expiresAtUtc": "2025-09-30T15:10:00Z",
      "bytes": 52428800,
      "hash": { "algo": "sha256", "value": "ab12..." }
    }
  ],
  "integrity": {
    "segments": [
      {
        "category": "identity",
        "segmentId": "seg-2025-09-30-identity-0012",
        "rootHash": "f0c4...",
        "chainTipHash": "9ab7...",
        "hashAlgorithm": "sha256",
        "proofUrl": "https://signed.example.com/proofs/seg-...json?sig=..."
      }
    ],
    "membershipProofs": "on-demand"  // or "embedded"
  },
  "signing": { "kid": "key-tenant-t-9c8f1-2025-09", "signature": "base64url(...)" }
}
  • Signed URLs expire quickly; rehydration requires re-authorization.
  • Signing uses per-tenant keys in HSM/Key Vault; kid allows offline verification.
  • Artifacts are chunked (e.g., 50–200k records per file) to support resumable downloads.

Formats

JSONL (canonical, recommended) One canonical record per line (already post-redaction):

{"recordId":"01J9Z7B...","occurredAtUtc":"2025-09-30T12:34:56Z","category":"identity","action":"User.PasswordChanged","actor":{"type":"user","id":"u-12345"},"resource":{"type":"User","id":"u-12345"},"decision":{"outcome":"allow"},"context":{"ip":"2001:db8::1","userAgent":"Mozilla/5..."},"before":{"fields":{"email":"HASH:sha256:7b1d..."}},"after":{"fields":{"email":"HASH:sha256:7b1d..."}},"policyVersion":7}

CSV (safe subset) Flattened columns only; nested before/after deltas summarized or excluded depending on policy. Example columns:

recordId,occurredAtUtc,category,action,actorType,actorId,resourceType,resourceId,decisionOutcome,policyVersion

CSV never contains raw PII; any sensitive columns are hashed/masked or omitted per policy.


API & Contracts (code-first)

Start export

POST /audit/exports
Authorization: Bearer <token with audit.export.start [+ audit.read.elevated if requested]>
Tenant-Id: t-9c8f1
X-Purpose: ediscovery:case-12345
Content-Type: application/json

Body = snapshot fields; response 202 Accepted with jobId.

Poll status / fetch manifest

GET /audit/exports/{jobId}
Authorization: Bearer <audit.export.read>
Tenant-Id: t-9c8f1

Returns state, counts, and manifest URL when completed.

Server interfaces (C#)

public record StartExportCommand(ExportSnapshot Snapshot, RedactionLevel Level);
public interface IExportService {
  Task<StartExportResult> StartAsync(StartExportCommand cmd, CancellationToken ct);
  Task<ExportStatus> GetAsync(string jobId, CancellationToken ct);
}

Planning & Execution

  • Shard plan: split query window into time slices (e.g., 1–5 minutes) or row-count slices; parallelize per slice with tenant-safe concurrency.
  • Chunking: stream rows → gzip → upload; compute per-chunk sha256 and append to manifest.
  • Idempotency: StartExport is idempotent on (tenantId, snapshotHash) within a TTL to avoid duplicate runs.
  • Resumability: on retry, already-uploaded chunk hashes are validated and skipped.

Integrity & Verification

  • Manifest references the segment proofs covering the exported records.
  • Optionally embed membership proofs per record; otherwise provide an endpoint:
    • GET /audit/proofs/record/{recordId} → Merkle path + root.
  • Offline verification:
    1. Check manifest signature (kid).
    2. Verify each artifact hash.
    3. Verify segment proof signatures and root hashes.
    4. Optionally verify record membership with Merkle paths.

Security & Governance

  • Purpose-limited: X-Purpose required; logged in minimal self-audit entries.
  • Scopes: audit.export.start / audit.export.read; elevated redaction requires audit.read.elevated and MFA (see Access Control cycle).
  • Dual-control (optional policy): exports of sensitive categories or elevated views require a second approver.
  • Watermarking: manifest includes tenant, jobId, purpose, and time window; artifacts contain headers with the same meta.
  • Retention: export artifacts have TTL; manifests remain for audit. Revocation = URL expiry or rotation.

Operational SLOs

  • Throughput: ≥ 1M records/minute per worker under columnstore projections (environment-dependent).
  • Export duration: ≤ 10 min per 10M records (target p95).
  • Manifest availability: immediate after last chunk upload (< 60s).
  • Backpressure: pause or slow exports if ingest lag > SLO (favor writes).

Observability

  • Metrics:
    • audit_export_started_total{tenant}
    • audit_export_records_total{tenant}
    • audit_export_duration_seconds{tenant,format,redaction} (histogram)
    • audit_export_chunk_bytes_total{tenant}
    • audit_export_failures_total{reason}
  • Traces: audit.export.run, audit.export.chunk, attributes include tenant.id, slice, chunkIndex, records, hash.
  • Logs: no PII; include jobId, purpose, counts, durations, and error codes.

Failure Modes & Recovery

  • Partial upload: verify chunk hash → skip re-upload; otherwise re-stream.
  • Timeouts: resume from last successful chunk; keep shard cursors.
  • Policy/hold changes mid-run: job executes against frozen snapshot; new export required for new policy.
  • Quota exceeded: throttle and surface 429; job remains running with backoff.

“Audit of the Auditor”

Every export produces minimal audit entries:

  • Export.Requested — actor, purpose, snapshot hash, redaction level.
  • Export.Completed — jobId, counts, duration, manifest hash, proof IDs.
  • Export.Accessed — when artifacts or manifest are downloaded (subject, purpose).

Diagram — End-to-End Export

sequenceDiagram
  participant U as Auditor (caller)
  participant API as Export API
  participant Q as Planner/Workers
  participant SQL as Read Models
  participant B as Blob (Artifacts/Manifest)
  participant IL as IntegrityLedger
  participant BUS as Service Bus

  U->>API: POST /audit/exports (snapshot + purpose)
  API-->>BUS: export.started
  API->>Q: Plan(shards, chunks)
  loop per shard
    Q->>SQL: Stream rows (snapshot policyVersion, redacted)
    Q->>B: Upload chunk (hash)
    Q-->>BUS: export.chunk_ready
  end
  Q->>IL: Collect segment proofs
  Q->>B: Write signed manifest (hashes + proofs)
  Q-->>BUS: export.completed
  API-->>U: GET /audit/exports/{id} → manifest URL
Hold "Alt" / "Option" to enable pan & zoom

Admin Knobs

  • Max range/window per job, max concurrent jobs per tenant, chunk size, URL expiry, inclusion mode for membership proofs (embedded|on-demand), and pause-on-ingest-lag threshold.

Edge Integrations & Webhooks


Purpose

Expose signed webhooks for high-value audit events (e.g., admin actions, elevated reads, legal holds, exports) to external consumers (SIEM/SOAR/GRC). Preserve tenant isolation, event-first guarantees, and PII minimization. Enforce HMAC signatures, idempotency (Event-Id), replay protection (age limit), and exponential backoff retries. Any inbound third-party callbacks terminate at the gateway/inbox and pass the same dedupe and security checks.


Event Selection (high-value only)

Default outbound set (extensible via subscription filters):

  • AuditorAccess.Requested|Granted|Denied
  • Export.Requested|Completed|Accessed
  • LegalHold.Applied|Released
  • Retention.PolicyChanged|PurgeCompleted
  • BreakGlass.Used
  • Integrity.SegmentSealed (metadata only, no leaf data)

Payloads are minimal and redacted by policy; no secrets/PII.


Delivery Semantics

  • At-least-once delivery with outbox at source, inbox/dedupe at subscriber.
  • Ordering per subscriber is best-effort and preserved within a partition key (tenantId); do not assume global order.
  • Idempotency via Event-Id—subscribers must store and skip repeats.

Security

  • HMAC signature over the exact HTTP body + timestamp using a per-subscription secret (rotatable).
  • Headers (all required):
    • X-Webhook-Id: subscription id
    • X-Webhook-Event: event name (e.g., export.completed)
    • X-Webhook-Tenant: tenant id
    • X-Webhook-Timestamp: UNIX epoch seconds (server time)
    • X-Webhook-Signature: t=<ts>,v1=<hex(HMAC_SHA256(secret, t + "." + body))>
  • Replay protection: receivers must reject if |now - X-Webhook-Timestamp| > 300s (configurable).
  • mTLS (optional): per-destination certificate pinning for regulated tenants.
  • IP allowlist: optional; published egress ranges documented per environment.

Signature example (C#)

static bool Verify(string body, string ts, string sig, byte[] secret)
{
    using var h = new HMACSHA256(secret);
    var computed = h.ComputeHash(Encoding.UTF8.GetBytes($"{ts}.{body}"));
    return CryptographicOperations.FixedTimeEquals(
        Convert.FromHexString(sig), computed);
}

Envelope & Payload

Envelope (outer)

{
  "eventId": "01J9Z8C6MSSQ7E9KZQW2C4KQ1V",
  "name": "export.completed",
  "schemaVersion": 1,
  "tenantId": "t-9c8f1",
  "occurredAtUtc": "2025-09-30T12:55:10Z",
  "emittedAtUtc": "2025-09-30T12:55:12Z",
  "correlation": { "traceId": "tr-abc", "requestId": "rq-xyz", "producer": "audit-export@2.4.1" },
  "payload": { /* minimal per-event data (redacted) */ }
}

Payload examples (minimal, PII-safe)

  • export.completed
{
  "jobId": "exp-01J9Z7E8AVB2M7",
  "purpose": "ediscovery:case-12345",
  "range": { "from": "2025-09-01T00:00:00Z", "to": "2025-09-30T23:59:59Z" },
  "recordCount": 12840,
  "manifestUrl": "SIGNED_URL_REDACTED",
  "integrityProofId": "proof-01J9Z7KQ..."
}
  • legalhold.applied
{
  "holdId": "lh-01J9Z8...",
  "caseId": "CASE-555",
  "categories": ["identity","billing"],
  "fromUtc": "2025-09-01T00:00:00Z",
  "toUtc": "2025-10-01T00:00:00Z"
}

Registration & Management (HTTP)

Create subscription

POST /audit/webhooks
Authorization: Bearer <audit.webhooks.manage>
Tenant-Id: t-9c8f1
Content-Type: application/json

Body:

{
  "destination": "https://siem.example.com/hooks/audit",
  "events": ["export.completed","legalhold.applied","breakglass.used"],
  "filters": { "category": ["identity","billing"] },
  "secret": "auto-generate",                // or supply CMK-wrapped secret
  "retries": { "maxAttempts": 12, "maxAgeSeconds": 86400 },
  "mTLS": { "enabled": false }
}

Rotate secret

POST /audit/webhooks/{id}:rotate-secret

Returns newSecretId; old secret remains valid for a grace window (dual-signing accepted).

Challenge (handshake)

POST /audit/webhooks/{id}:test

Sends a challenge event; receiver must echo challenge token and verify signature on their side.

Pause/Resume

POST /audit/webhooks/{id}:pause
POST /audit/webhooks/{id}:resume

Retry & Backoff

  • Exponential with jitter: 1s → 2s → 4s … up to max backoff (e.g., 5 min).
  • Max attempts per event (default 12) and max age (e.g., 24h).
  • Terminal failure: move to per-subscription DLQ with last error; subscription may auto-pause after sustained failures.

Receiver responsibilities

  • Return 2xx on success only after durable write to their store.
  • Use Event-Id for idempotent processing.
  • Support HMAC verification and age check before processing.

Delivery Flow

sequenceDiagram
  participant SRC as Audit Service (Outbox)
  participant PUB as Webhook Publisher
  participant EXT as External Receiver

  SRC->>PUB: Outbox row (event)
  PUB->>PUB: Build JSON body (minimal)
  PUB->>PUB: Sign body (HMAC) + set timestamp
  PUB->>EXT: POST /hook  (headers: X-Webhook-*)
  alt 2xx
    EXT-->>PUB: OK
    PUB->>PUB: Mark delivered
  else non-2xx/timeout
    PUB->>PUB: Retry with backoff+jitter
    opt exceeded
      PUB->>PUB: DLQ + auto-pause (optional)
    end
  end
Hold "Alt" / "Option" to enable pan & zoom

Filtering & Throttling

  • Per-subscription filters on event, category, optional action prefix.
  • Rate caps per subscription (tokens/sec, burst). Over-cap events are queued; if backlog exceeds threshold, subscription is paused and operator alerted.

Inbound Callbacks (if any)

  • All inbound third-party callbacks (e.g., challenge responses, approval webhooks in dual-control flows) terminate at the API Gateway and enter the Inbox:
    • mTLS (optional) + OAuth2 (client credentials) or signed challenge.
    • Dedupe via X-Callback-Id (same semantics as Event-Id).
    • Age limit enforced; drop stale requests.
    • Minimal payloads; no secrets in logs; results audited.

Observability

  • Metrics:
    • webhook_delivered_total{tenant,subscription}
    • webhook_retry_total{tenant,subscription,reason}
    • webhook_delivery_latency_seconds{tenant} (histogram)
    • webhook_dropped_total{tenant,reason=age|sig|410}
    • webhook_active_subscriptions{tenant}
  • Logs (no PII): subscription id, event name, attempt, latency, status code.
  • Traces: audit.webhook.publish with tags tenant.id, subscription.id, event.name, attempt, age.

Failure Modes & Policies

  • 410 Gone from receiver: immediately pause subscription.
  • 401/403: treat as terminal until operator action (likely secret/mTLS drift).
  • Signature mismatch: drop and alert (security signal).
  • Age limit exceeded: drop silently to avoid storming receivers with stale events.

Tenant Isolation & Governance

  • Subscriptions are tenant-scoped (Tenant-Id required to manage); no cross-tenant webhooks.
  • Break-glass (platform-admin) cannot create cross-tenant webhooks; only read events across tenants with dual-control (as per Access Control).
  • All management operations (create, rotate, pause, delete) are audited with who/when/why.

Receiver Quickstart (verification)

Node.js

const crypto = require('crypto');
function verify({body, ts, sig, secret}) {
  const mac = crypto.createHmac('sha256', Buffer.from(secret, 'utf8'))
                    .update(`${ts}.${body}`).digest('hex');
  return crypto.timingSafeEqual(Buffer.from(sig, 'hex'), Buffer.from(mac, 'hex'));
}

Recommended receiver steps

  1. Check X-Webhook-Timestamp age ≤ window.
  2. Recompute v1 HMAC; compare in constant time.
  3. Ensure Event-Id unseen → process → persist → respond 2xx.
  4. Store minimal receipt (eventId, ts, outcome) for your audit.

Admin Knobs

  • Age window (default 300s), max attempts (default 12), max backlog per subscription, rate caps, mTLS toggle, secret rotation grace (dual-sign), per-tenant event filters, and pause-on-failure thresholds.

Compliance Profiles


Purpose

Provide configurable overlays for common frameworks—SOC 2, GDPR, HIPAA—that translate regulatory intent into concrete defaults for this bounded context: retention windows, PII handling presets, consent & legal basis checks, and breach-notification hooks. Profiles are tenant-scoped, versioned, and map to our ClassificationPolicy, RetentionPolicy, LegalHold, ExportJob, and Edge Webhooks without changing producer contracts.

Guidance only, not legal advice. Profiles encode operational defaults and guardrails; tenants can harden, never weaken.


Model & Scope

flowchart LR
  CP[Compliance Profile] --> RP[RetentionPolicy]
  CP --> CL[ClassificationPolicy]
  CP --> AC[Access Control & Purpose]
  CP --> EX[ExportJob Defaults]
  CP --> WH[Webhooks (Breach/Reg)]
  CP --> INT[IntegrityLedger constraints]
  CP --> LOG[Observability Guardrails]
Hold "Alt" / "Option" to enable pan & zoom
  • Attachment: per-tenant (tenantId) with optional category overrides (e.g., billing, phi).
  • Layering: multiple profiles can apply; the stricter rule wins (minimize exposure, maximize retention where required).
  • Effective window: effectiveFromUtc + forward-only versioning; profile changes never unmask data already transformed.

Configuration Schema (admin)

profileId: "gdpr-defaults"
version: 3
effectiveFromUtc: "2025-10-01T00:00:00Z"
jurisdictions: ["EU","EEA","UK"]
appliesTo:
  categories: ["identity","billing"]
  actors: ["user","service"]
defaults:
  retentionDays:
    identity: 365
    billing: 2555               # example; tenant may harden
  redaction:
    PERSONAL:  { kind: HASH, params: { algo: SHA256 } }
    SENSITIVE: { kind: MASK, params: { showLast: 2 } }
    CREDENTIAL:{ kind: DROP }
    PHI:       { kind: TOKENIZE, params: { ns: "phi" } }
  consent:
    requirePurposeHeader: true
    legalBases: ["consent","contract","legal_obligation"]  # allowed for elevated reads/exports
  export:
    format: "jsonl"
    requireManifestSignature: true
    includeMembershipProofs: "on-demand"
  breach:
    notify:
      webhookId: "sec-ops-1"
      ageSeconds: 60
      subjects: ["export.failed","breakglass.used","integrity.segment_sealed"] # extend with detectors
  observability:
    blockPIIInLogs: true
    redactUserAgentTo: 64
  residency:
    restrictCrossRegionExports: true
    allowedRegions: ["EU"]

Policy Mapping (how a profile shapes the BC)

Control Effect
RetentionPolicy Sets per-category retention windows & mode (PURGE/ARCHIVE_THEN_PURGE), enforces baselines; triggers retention.policy_changed.
ClassificationPolicy Seeds rulesByClass and field overrides; ensures CREDENTIAL → DROP, PERSONAL → HASH, PHI → TOKENIZE.
Access Control & Purpose Requires X-Purpose and permitted legal basis for elevated reads/exports; MFA flags enforced for admin actions.
ExportJob Defaults to JSONL, signed manifest, region checks, and proof linkage; dual-control optional by profile.
Edge Webhooks Enables signed notifications to Security/Privacy teams for defined events (breach candidates, break-glass, export completion).
IntegrityLedger Minimum seal cadence (e.g., ≤5m) and mandatory proof anchoring (optional) per regulated profiles.
Observability Enforces PII-free logs, truncation limits, and blocks unsafe diagnostic fields.

Profile Presets (examples)

Values are illustrative and must be validated by your compliance team.

Profile Retention Defaults PII Presets Consent & Purpose Exports Other Overlays
SOC 2 identity: 365d; billing: 2555d PERSONAL → HASH, CREDENTIAL → DROP Purpose header recommended; MFA for admin JSONL + signed manifest Proof seal ≤ 10m; basic breach hooks
GDPR identity: 365d (minimize); billing: policy-driven PERSONAL → HASH, PHI → TOKENIZE, CREDENTIAL → DROP Purpose required; legal basis whitelist; deny elevated if basis missing JSONL; region-restricted; DSAR helpers Data residency restrict; erase requests honored via retention/tombstone
HIPAA phi: 2555d (policy-driven) PHI → TOKENIZE (vault), PERSONAL → HASH Purpose limited to treatment/payment/operations; MFA mandatory JSONL; membership proofs on-demand; dual-control for exports Seal cadence ≤ 5m; additional access logging flags

Runtime Enforcement

  • Write path: classification/redaction uses profile-backed ClassificationPolicy vN; secrets dropped before persist.
  • Read path: default redacted views; elevated requires audit.read.elevated, MFA, X-Purpose, legal basis ∈ profile’s allowlist.
  • Export: snapshot validated against residency & purpose; artifacts signed; proofs included/referenced.
  • Retention: scheduler uses profile windows; LegalHold always overrides.
  • Breach hooks: profile subscribes specific events to webhooks; receivers verify HMAC and age.

  • Integrates with the platform PDP (RBAC/ABAC). Profile declares allowed legal bases for Audit access.
  • Request must carry X-Purpose (e.g., legal-ediscovery:CASE-555) and, when applicable, the legal basis claim in token or header (e.g., X-Legal-Basis: consent).
  • Deny elevated operations that lack a permitted basis; emit minimal denied audit entry.

DSAR / Subject Rights (GDPR-focused helpers)

  • Discovery: GET /audit/timeline with actor/resource filters + X-Purpose: gdpr-dsar:<ticket>.
  • Packaging: POST /audit/exports with DSAR preset (profile-defined minimal fields).
  • Erasure: Implemented via retention/tombstone—never rewrites; profile may mark categories eligible for REDACT_TOMBSTONE mode.

Breach Notification Hooks

  • Profile lists trigger events and destinations:
    • breakglass.used, export.failed, unusual access:elevated spikes (from metrics), integrity drift detections.
  • Publisher sends HMAC-signed webhooks with age-limit; receivers are expected to open incidents per playbook.

Data Residency & Region Controls

  • For GDPR/HIPAA profiles, enforce region fences during export and proof access.
  • If restrictCrossRegionExports=true, StartExport fails with 403 if the requested delivery region is outside allowedRegions.

Versioning & Change Control

  • Profiles are forward-only; new version schedules at effectiveFromUtc.
  • A profile change emits compliance.profile_changed (internal) and recomputes policy caches; does not retroactively unmask historic data.

Evidence & Auditability

  • Automatic evidence bundle endpoints for audits:
    • Current profile JSON (signed), active policy versions, retention windows, webhook subscriptions, and last 30 days of admin/auditor access entries.
  • Exportable as a ZIP with hashes and a signed manifest for SOC2/HIPAA audits.

Testing & Validation

  • Conformance tests per profile:
    • Golden redaction fixtures (hash/mask/token rules)
    • Residency denial tests
    • Export manifest signature presence
    • Purpose/legal basis gating for elevated reads
  • Drills: simulate webhook delivery failures, DSAR export within SLA, and break-glass usage (must page on use).

Admin APIs

PUT /audit/admin/compliance-profiles/{profileId}
Body: { version, effectiveFromUtc, defaults... }

POST /audit/admin/tenants/{tenantId}/attach-profile
Body: { profileId, version, categories? }

GET  /audit/admin/compliance-profiles/{profileId}/effective?tenantId=...
  • Attaching a profile triggers regeneration of compiled policy caches and emits a minimal audit entry.

Safe Defaults (when no profile attached)

  • Apply platform baseline: CREDENTIAL → DROP, PERSONAL → HASH, purpose header optional, exports signed, seal ≤ 10m, PII-free logs, and retention 365d (identity) unless tenant overrides.

SDKs & Adapters


Purpose

Provide thin, ergonomic client SDKs (C#, JS/TS) to emit standardized AuditRecords with correlation and idempotency baked in. Supply anti-corruption layers for popular upstreams (identity tokens → Actor, HTTP context → Context) and adapters for HTTP, Service Bus/MassTransit, and Orleans. Conformist downstreams can adopt schemas verbatim for events and read models.


Design Principles

  • Zero business logic: SDKs package the canonical shapes, headers, and retry/idempotency helpers.
  • Server-authoritative: classification/redaction happen server-side; clients may add hints only.
  • Portable correlation: propagate W3C traceparent plus domain headers (Trace-Id, Request-Id, Producer, Tenant-Id).
  • Idempotent by default: deterministic key helpers; safe retries.
  • PII-safe: optional mappers normalize/minimize before send; SDK logs never include payloads.

Package Layout

  • .NET: ConnectSoft.Audit
    • AuditClient (HTTP / gRPC code-first)
    • MassTransitAuditPublisher (message contract)
    • AspNetCoreAuditMiddleware (actor/context mappers)
    • Idempotency & Correlation utilities
  • JS/TS: @connectsoft/audit
    • AuditClient (HTTP)
    • expressAudit() middleware
    • correlation() helpers, idempotency() helpers
    • Types for AuditRecord, Actor, ResourceRef, etc.

Canonical Types (JS/TS)

export type ActorType = "user" | "service" | "job";

export interface Actor {
  type: ActorType;
  id: string;
  display?: string;
  roles?: string[];
}

export interface ResourceRef {
  type: string;
  id: string;
  path?: string;
  tenantScopedId?: string;
}

export type DecisionOutcome = "allow" | "deny" | "na";
export interface Decision { outcome: DecisionOutcome; reason?: string; attributes?: Record<string,string>; }

export interface Context { ip?: string; userAgent?: string; clientApp?: string; location?: string; }

export interface Delta { fields?: Record<string, unknown>; hints?: Record<string, { rule: string }>; }

export interface Correlation { traceId: string; requestId: string; causationId?: string; producer: string; }

export interface AuditRecord {
  id?: string;
  tenantId: string;
  occurredAtUtc: string;
  actor: Actor;
  action: string;
  resource: ResourceRef;
  decision?: Decision;
  context?: Context;
  before?: Delta;
  after?: Delta;
  classes?: string[];
  correlation: Correlation;
  idempotencyKey: string;
  integrity?: { segmentId?: string; hash?: string } | null;
}

The C# SDK mirrors these types as records with nullable optionals.


Correlation & Headers

SDKs set/propagate:

  • Tenant-Id (required)
  • Idempotency-Key (required for online)
  • Trace-Id, Request-Id (or W3C traceparent)
  • Producer (service@version)

Generation rules

  • traceId: reuse ambient trace; else ULID/UUIDv7
  • requestId: from framework request; else ULID/UUIDv7
  • producer: <serviceName>@<semver> from app config

Idempotency Helpers

  • Idempotency.Key.forAction(tenantId, resourceId, action, occurredAtUtc[, extra]) → deterministic key
  • Fallback Idempotency.Key.fromContent(record) → canonical JSON → SHA-256

Anti-Corruption Mappers

Identity token → Actor

  • Prefer sub as actor.id
  • Map azp/client_idservice actors
  • Map roles from roles/groups claims
  • If email present, do not embed raw email; SDK can attach classification hint (PERSONAL) or hash locally (optional)

HTTP context → Context

  • ip: from X-Forwarded-For (first hop) or connection remote IP
  • userAgent: trimmed to max length (e.g., 128)
  • clientApp: from custom header (e.g., X-Client-App)

ASP.NET Core

app.UseMiddleware<AspNetCoreAuditMiddleware>(new AuditMiddlewareOptions {
  Actor = ctx => ActorFromJwt(ctx.User),
  Context = ctx => new Context { ip = ctx.GetClientIp(), userAgent = ctx.GetUserAgent(), clientApp = "Portal" }
});

Express (Node)

import { expressAudit } from "@connectsoft/audit/mw";
app.use(expressAudit({
  actor: req => ({ type: "user", id: req.user.sub, roles: req.user.roles }),
  context: req => ({ ip: req.ip, userAgent: req.headers["user-agent"] as string, clientApp: "Portal" })
}));

HTTP Client (C#)

var client = new AuditClient(new AuditClientOptions {
  BaseAddress = new Uri("https://api.example.com"),
  Producer = "iam-service@1.12.3",
  HttpClient = httpClient
});

var now = DateTimeOffset.UtcNow;
var record = new AuditRecord {
  tenantId = "t-9c8f1",
  occurredAtUtc = now.ToString("O"),
  actor = new Actor(ActorType.User, "u-12345") { roles = new[] { "admin" } },
  action = "User.PasswordChanged",
  resource = new ResourceRef("User", "u-12345"),
  correlation = Correlation.CurrentOrNew("iam-service@1.12.3"),
  idempotencyKey = Idempotency.ForAction("t-9c8f1","u-12345","User.PasswordChanged", now)
};

await client.AppendAsync(record, new AppendOptions {
  ClassificationHints = new Dictionary<string,string> {
    ["before.fields.email"] = "PERSONAL",
    ["after.fields.token"]  = "CREDENTIAL"
  }
}, ct);

Retry & error handling

  • Retries with exponential backoff + jitter on 429/5xx.
  • 200 OK with status=duplicate treated as success (surfacing original recordId).
  • 409 Conflict for semantic duplicates → raise IdempotencyConflictException.

HTTP Client (JS/TS)

import { AuditClient, idempotency, correlation } from "@connectsoft/audit";

const ac = new AuditClient({
  baseUrl: "https://api.example.com",
  producer: "frontend@3.4.0",
  getToken: async () => `Bearer ${await auth.getAccessToken()}`
});

const now = new Date().toISOString();
await ac.append({
  tenantId: "t-9c8f1",
  occurredAtUtc: now,
  actor: { type: "user", id: auth.userId, roles: auth.roles },
  action: "Session.Login",
  resource: { type: "User", id: auth.userId },
  correlation: correlation.currentOrNew("frontend@3.4.0"),
  idempotencyKey: idempotency.forAction("t-9c8f1", auth.userId, "Session.Login", now),
  context: { ip: undefined, userAgent: navigator.userAgent, clientApp: "Web" }
}, {
  classificationHints: { "after.fields.email": "PERSONAL" }
});

MassTransit & Orleans Adapters

MassTransit producer

await _bus.Publish<IngestAuditRecord>(new {
  TenantId = "t-9c8f1",
  IdempotencyKey = Idempotency.ForAction(...),
  Record = record,
  ClassificationHints = new Dictionary<string,string> { /* optional */ },
  Correlation = CorrelationMeta.Current()
});

Orleans

var grain = _client.GetGrain<IAuditIngestGrain>("t-9c8f1");
await grain.Append(new AppendAuditRecordCommand(record, "t-9c8f1", record.idempotencyKey));

Both adapters reuse the same canonical contracts and correlation model.


Classification Hints (optional)

SDKs allow attaching hints by path:

{ "before.fields.email": "PERSONAL", "after.fields.apiKey": "CREDENTIAL" }

Server remains authoritative; hints merely guide classification.


Middleware Quick-Emit

ASP.NET Core action filter / Express handler helpers to emit common actions (login, access decisions, config changes) with one-liners:

await _audit.EmitDecision("Document.Downloaded", resource, allow ? "allow" : "deny", reason);
await audit.emitDecision("Document.Downloaded", resource, outcome, reason);

Conformist Downstreams

  • Events: consumers can adopt DomainEventEnvelope<T> as-is (same names, versions).
  • Read models: schemas in audit_rm.* are documented for analytics (avoid cross-tenant access).
  • Idempotency: a ConsumerTemplate is provided (offset store + handler skeleton).

Serialization & Canonicalization

  • JSON serializer configured to:
    • Stable property order (for deterministic hashes)
    • ISO-8601 UTC for timestamps
    • Trim/normalize known fields (emails lowercased, IPv6 compressed)

Security & Privacy Guardrails

  • SDK never logs payloads; structured logs carry only status, IDs, and hashed keys.
  • Built-in secret scrubbing for accidental sensitive strings in exceptions (e.g., patterns like sk_live_...).
  • Large deltas discouraged on clients; prefer summaries and let server tokenize as needed.

Versioning & Compatibility

  • SemVer packages; additive first.
  • Feature flags allow server capabilities detection (e.g., supportsBatchHints, supportsBackfill).
  • Breaking wire changes publish new HTTP paths (/v2) or message contracts (topic ...v2).

Testing Utilities

  • Fixture builder for AuditRecord with sensible defaults.
  • Golden corpus for redaction expectations (client-side normalizations only).
  • Contract tests: spin a stub gateway validating headers, idempotency behavior, and retry semantics.

Diagram

sequenceDiagram
  participant APP as App/Service
  participant SDK as SDK (C#/TS)
  participant GW as API Gateway
  participant AUD as Audit Ingest

  APP->>SDK: Build AuditRecord (+hints)
  SDK->>SDK: Correlation + Idempotency key
  SDK->>GW: POST /audit/records (Tenant-Id, Idempotency-Key, Trace-Id)
  GW->>AUD: mTLS forward
  AUD-->>SDK: 201 Created | 200 Duplicate
  SDK-->>APP: AppendResult(recordId)
Hold "Alt" / "Option" to enable pan & zoom

Quick Reference

  • Always set Tenant-Id, Idempotency-Key, correlation.
  • Prefer deterministic keys; retries are safe.
  • Send hints when easy; server decides redaction.
  • Never log payloads; rely on SDK’s structured outcome logs.
  • Adopt event/read-model schemas directly for conformist consumers.

Evolution & Versioning


Purpose

Evolve the Audit & Compliance BC safely without breaking producers/consumers. Favor additive-first changes; when a break is unavoidable, publish a new channel (/v2 HTTP path, ...v2 topic/webhook) and keep mapping layers to bridge old producers. Version exporters/importers independently from APIs and events.


Version Dimensions (what can change)

Axis Examples Strategy
HTTP APIs /audit/timeline, /audit/exports Path versioning (/v2), optional media type version (Accept: application/vnd.connectsoft.audit+json;version=2)
Message bus audit.records.v1 New topics per major (.v2); keep v1 until migration completes
Domain events audit.record_appended payload shape Envelope SchemaVersion increments; additive by default; publish both v1 & v2 during migration
Read models audit_rm.* Blue/green: create audit_rm_v2.*, swap via synonyms/views
Webhooks export.completed Header X-Webhook-Schema: 12; or path /hooks/v2/...
Storage schema indexes/columns Add columns/indexes; avoid type changes; use new tables for big breaks
Exporters/Importers manifest, file layout Version independently: Exporter@v2, Importer@v3
Policies ClassificationPolicy, RetentionPolicy Versioned inside policy domain; never unmask historic data

Additive-First Rules

  • Only add optional fields; keep defaults stable.
  • Never rename/remove existing fields; mark deprecated in docs.
  • Enums: add new values; consumers must ignore unknowns.
  • Semantics: do not change meaning of existing fields—breaking by semantics is a major.
  • Unknown ≠ error: all conformist consumers must tolerate unknown fields/headers.

When a Breaking Change Is Required

  • Structural changes (type swap, field removal, required field added).
  • Privacy hardening that reduces visibility (e.g., Decision.reason removed for policy).
  • Contract shifts (different pagination model, cursor format).
  • Transport framing change.

Action: publish /v2 (HTTP), ...v2 (topics/webhooks); keep /v1 read-only or frozen until decommission.


Channel Versioning Patterns

HTTP

  • Paths: /v1/audit/timeline/v2/audit/timeline.
  • Headers: responses include X-API-Version: 1|2, Deprecation, Sunset, Link: <url>; rel="deprecation".
  • Negotiation (optional): Accept: application/vnd.connectsoft.audit+json;version=2.

Bus/Webhooks

  • Topics: audit.records.v2, export.completed.v2.
  • Webhooks: add X-Webhook-Schema: 2. Existing endpoints can opt into v2 by re-subscribing.

Code-first gRPC

  • New service/contract names (no proto): IAuditIngestServiceV2. Keep V1 interface while migrating.

Mapping Layers (anti-corruption)

Bridge older producers/consumers transparently:

flowchart LR
  P1[Producer v1] --> AC1[Ingest Mapper v1→v2] --> Core[AuditStream v2]
  Core --> AC2[Event Mapper v2→v1] --> C1[Consumer v1]
Hold "Alt" / "Option" to enable pan & zoom
  • Ingest Mapper v1→v2: accepts v1 payloads (headers/fields), emits canonical v2 AuditRecord.
  • Event Mapper v2→v1: for critical consumers stuck on v1, derive best-effort v1 payloads from v2 events (temporary).

Mappers live at the boundary; core domain remains on the newest model.


Exporters & Importers (independent versioning)

  • Exporter@vN: defines manifest schema + artifact layout. Manifest includes schemaVersion and exporterVersion.
  • Importer@vN: supports a matrix of source versions; validates and maps to canonical ingest shape.
  • Upgrade either without touching API/event versions; record tool versions in manifests and lineage tables.

Deprecation Policy & Timeline

  1. Announce with docs + headers (Deprecation, Sunset) and dashboard banner.
  2. Dual-publish: run v1 and v2 in parallel; mirror events to both topics.
  3. Measure: per-version usage metrics; reach out to stragglers.
  4. Freeze: stop adding features to v1; security fixes only.
  5. Sunset: after ≥ 6–12 months (profile-driven), disable writes/reads on v1.

Response headers (HTTP)

Deprecation: true
Sunset: Tue, 30 Sep 2026 00:00:00 GMT
Link: https://docs.example.com/audit/v1-deprecation; rel="deprecation"

Capability Discovery

  • GET /audit/capabilities returns supported API versions, event schemas, formats, and limits.
{
  "api": { "versions": ["1","2"], "default": "2" },
  "events": { "audit.record_appended": [1,2] },
  "exports": { "formats": ["jsonl","csv"], "exporterVersions": [1,2] }
}

SDKs cache this to auto-select the highest compatible version.


Contracts (stable envelope)

All events keep a stable envelope even when payload evolves:

public sealed record DomainEventEnvelope<TPayload>(
  string EventId, string Name, int SchemaVersion, string TenantId,
  string AggregateType, string AggregateId, long AggregateSequence,
  DateTime OccurredAtUtc, DateTime EmittedAtUtc, CorrelationMeta Correlation,
  TPayload Payload);

Only TPayload’s schema version increments; envelope stays compatible.


Storage & Read Model Evolution

  • Create audit_rm_v2.* tables; project with v2 handlers.
  • Swap via database synonyms or API layer routing.
  • Keep backfill & replay tools able to target a specific projection version:
audit-replay run --projection TimelineV2 --source sql --from ... --to ...

Observability per Version

  • Tag metrics/logs/traces with api.version, event.schema, exporter.version, importer.version.
  • Dashboards: version adoption curves, error rate by version, throughput by channel/version.

CI/CD & Testing

  • Contract tests (consumer-driven): publish Pacts for v1 and v2.
  • Golden payloads for v1/v2 with expected redactions.
  • Canary: route 1–5% to v2 endpoints before full cutover; compare result counts/latencies.
  • Fail-fast gates: any semantic divergence on canary blocks rollout.

Examples

HTTP v2 path

POST /v2/audit/exports
Accept: application/json
X-API-Version: 2

Bus topic

Topic: audit.records.v2
Payload: AuditRecordAppendedV2 { addedFields..., same core ids }

Exporter manifest bump

{ "schemaVersion": 2, "exporterVersion": "2.1.0", ... }

Guardrails

  • Privacy never loosens via versioning—older records remain with their original transforms; mappers do not unmask.
  • One-way: v2 readers must accept v1 data (with defaults), but v1 readers must not be fed v2 unless mapped.
  • Roll-forward: on failure, prefer rolling forward with a mapper fix rather than reverting core schemas.

Quick Checklist

  • Additive first; break → new path/topic.
  • Keep boundary mappers for legacy until sunset.
  • Version exporters/importers separately; record tool versions.
  • Publish capability discovery; instrument by version.
  • Plan deprecation with Deprecation/Sunset headers and a measured migration.

Operational Runbook & Quality Gates


Purpose

Codify day-2 operations and release safeguards for the Audit & Compliance BC:

  • DLQ remediation + safe replay (idempotent consumers).
  • SLOs & alerts focused on lag, export queues, and dedupe health.
  • Incident drill for “tamper suspected.”
  • Quality gates in CI to block schema-breaking or redaction regressions.
  • Platform checklists emphasizing outbox everywhere and idempotency at every boundary.

DLQ Workflow & Safe Replay

What lands in DLQ? Poison messages (schema drift, authorization failure, handler bug), age-outs, or repeatedly failing webhooks.

Golden rules

  • Never replay straight to handlers that cause side effects without the inbox check.
  • Treat DLQ as evidence: do not mutate payloads; annotate with metadata.
flowchart TD
  DLQ[(DLQ Queue/Topic)]
  INSPECT[Inspect & Classify]
  FIX[Fix Config/Code/Secrets]
  SANDBOX[Sandbox Reproduce]
  REPLAY[Replay via Ingest<br/>or Dedicated Replayer]
  SKIP[Quarantine/Skip]
  METRICS[Update Runbook Ticket + Metrics]

  DLQ --> INSPECT -->|auth/mTLS/secret| FIX
  INSPECT -->|schema mismatch| FIX
  INSPECT -->|one-off| REPLAY
  FIX --> SANDBOX --> REPLAY
  INSPECT -->|invalid or malicious| SKIP
  REPLAY --> METRICS
  SKIP --> METRICS
Hold "Alt" / "Option" to enable pan & zoom

Procedure

  1. Triage
    • Pull last N DLQ entries; group by reason, event.name, tenantId.
    • If signature/age failures on webhooks → rotate secret or widen clock skew; do not replay until fixed.
  2. Root cause
    • Schema: check event SchemaVersion vs consumer.
    • Auth: verify scopes/tenant headers on inbound message.
    • Data: verify classification/redaction plan present.
  3. Fix
    • Config drift (secrets, endpoints) → rotate/patch; code bug → hotfix branch.
    • Add mapping layer if producer is on v1 while consumer expects v2.
  4. Sandbox reproduce
    • Use Replay Harness against non-prod to prove handler idempotency and success path.
  5. Replay
    • Prefer republish to the original topic with Event-Id intact; consumer inbox guarantees idempotency.
    • For ingest failures, re-enqueue through the Inbox (same dedupe gate).
  6. Quarantine
    • Messages confirmed malicious or irreparably invalid → keep in quarantine store with reason; mark skipped.
  7. Close
    • Update incident ticket with counts, cause, and prevention. Capture playbook deltas if needed.

Command (example)

audit-ops dlq list --consumer integrity-projection --since 1h
audit-ops dlq replay --consumer integrity-projection --filter event=audit.record_appended --max 500
audit-ops dlq quarantine --consumer webhooks --reason "sig_mismatch"

SLOs & Alerts (operations dashboard)

Signal SLO / Threshold Alerting Notes
Ingest lag p95 (produce→record_appended) ≤ 5s warn @10s, page @30s PagerDuty (critical) KEDA scales on lag
Projector lag p95 (timeline) ≤ 5s warn @10s, page @20s PagerDuty Idempotent rebuild path
Export queue age p95 ≤ 2m warn @5m, page @10m Ops channel Exports pause if ingest lag breached
Outbox backlog ≤ 10k pending/service Ops channel Relay health
DLQ rate ~0; page if continuous > 1/min over 5m PagerDuty Investigate consumer
Dedupe ratio (hits/total) Baseline per tenant; alert on ±30% drift Ops channel Spikes may indicate producer bug
Idempotency conflicts 0; page if > 10/min PagerDuty Semantic conflicts (same key, diff payload)
Integrity seal duration p95 ≤ 2s warn @5s Ops channel Hash/Merkle compute health
Webhook drop (age/signature) 0; warn any Security channel Possible clock drift or key issue

Attach exemplars to lag & duration metrics to deep-link into traces.


Incident Drill — “Tamper Suspected”

Scope: An operator or external auditor reports suspected tampering of the audit trail.

sequenceDiagram
  participant R as Reporter
  participant OC as On-Call
  participant IL as IntegrityLedger
  participant SQL as SQL/Segments
  participant ADLS as Proofs/Manifests
  participant SEC as Security/Compliance

  R->>OC: Tamper suspected (record/segment)
  OC->>IL: Fetch SegmentProof + ChainTip
  OC->>SQL: Get stored record(s) (read-only)
  OC->>ADLS: Download proof bundle & manifest
  OC->>OC: Offline verify membership + root signature
  alt mismatch
    OC->>SEC: Escalate SEV-1, freeze purge/exports
    OC->>IL: Seal current open segments
    OC->>OC: Forensic triage (access logs, approvals)
  else valid
    OC-->>R: Evidence OK (report)
  end
Hold "Alt" / "Option" to enable pan & zoom

Checklist

  1. Stabilize
    • Flip read-only flag for purge/export if warranted (tenant-scoped).
    • Trigger immediate segment sealing for tails.
  2. Evidence capture
    • Export SegmentProof (signed), manifest, and membership proof for the record IDs in question.
    • Snapshot relevant logs (PII-free) and outbox/inbox offsets.
  3. Verification (offline)
    • Rebuild leaf from stored (redacted) record; recompute hash.
    • Fold Merkle path → root; verify signature (kid) from Key Vault JWKS.
    • Compare chainTip sequence for ordering consistency.
  4. Triage
    • If mismatch: SEV-1, rotate relevant signing keys, audit privileged access, confirm no DB writes beyond append-only path (check RLS + triggers).
    • Inspect DLQ for related anomalies.
  5. Communication
    • Trigger profile-defined breach hooks (if mandated).
    • Record minimal AuditorAccess entries for every verification step.
  6. Post-mortem
    • Root cause, scope, data at risk, timeline, remediations (e.g., stricter sealing cadence, stronger anchoring).

Quality Gates (CI/CD)

Contract & Compatibility

  • OpenAPI/HTTP: run breaking-change detector (/v2 required on majors); reject renames/removals.
  • Event schemas: JSON Schema + version bump check; ensure additive-first; publish to schema registry.
  • Pact/CDC: consumer-driven contracts for key consumers (projections, webhooks).

Redaction & Privacy

  • Golden redaction tests: fixtures assert HASH/MASK/DROP/TOKENIZE outputs per policy version and tenant salt.
  • Leak scanner: static & runtime log scans for PII patterns; fail build on findings.
  • Event hygiene: linter ensures no sensitive fields enter event payloads.

Idempotency & Outbox

  • Arch tests (unit): every command handler must write outbox in same transaction as state.
  • Ingest path tests: must call idempotency gate before append; simulate duplicate & semantic conflict.
  • Consumer skeleton: requires inbox write before side effects.

Performance & Scale

  • Load smoke: N=10k records/min for 3 min; assert p95 ingest lag < 10s, no DLQ growth.
  • Seal timing: Merkle compute p95 < threshold on synthetic batch.

Migrations

  • Blue/green readiness: migrations create audit_rm_v2.* tables; no in-place destructive ops.
  • Backfill targetability: replay tool can target v2 projection; rehearsal job passes.

Release gates

  • Block if any:
    • contract break without /v2
    • redaction golden test diff
    • missing outbox write in a new aggregate handler
    • ingest path bypasses idempotency
    • DLQ rate > baseline after canary (5%)
    • SLO regression vs last release

Operational Checklists

Pre-deploy

  • ✅ Migrations applied (no destructive in-place changes).
  • ✅ Feature flags default safe; canary slice configured.
  • ✅ Outbox relay healthy; schema registry updated.
  • ✅ Webhook secrets rotated if expiring; mTLS pins validated.

Post-deploy (30–60 min)

  • ✅ Ingest lag p95 within SLO.
  • ✅ Outbox backlog stable/declining.
  • ✅ DLQ stable at baseline.
  • ✅ Dedupe ratio within ±10% of yesterday.
  • ✅ Canary v2 metrics match v1 (counts/latency).

On-call quick actions

  • 🔧 Scale consumers via KEDA override if lag spikes.
  • 🔧 Pause exports if ingest lag pages.
  • 🔧 Quarantine toxic subscriptions (webhooks) on repeated 401/403/signature failures.
  • 🔧 Trigger replay for known fixed DLQ batches.

Runbook Artifacts & Commands

# Lag overview
audit-ops metrics show --tenant t-9c8f1 --since 15m --signals ingest_lag_p95,projector_lag_p95,outbox_backlog

# Force seal (per tenant/category)
audit-ops integrity seal --tenant t-9c8f1 --category identity

# Start tamper verification for record
audit-ops integrity verify --tenant t-9c8f1 --record 01J9Z7B19X4P8ZQ7H6M4V6GQWY --output evidence.zip

# Pause exports (tenant)
audit-ops export pause --tenant t-9c8f1

Governance Notes

  • Keep all runbook actions audited with minimal self-audit entries (who/what/why).
  • Review this playbook quarterly; run game days (DLQ storm, seal failure, webhook outage, tamper drill).
  • Ensure every new boundary adheres to outbox everywhere and idempotency—no exceptions.