Audit & Compliance (Audit Trail) — HLD & DDD Blueprint¶
Executive Introduction¶
Every serious SaaS needs a trustworthy memory: a tamper-evident, tenant-scoped Audit Trail that records who did what, to which resource, when, where, and why/how—without leaking sensitive data. This blueprint defines a reusable Audit & Compliance bounded context you can drop into any ConnectSoft-style solution (microservices or modular monoliths). It is append-only, idempotent, multi-tenant, and compliance-ready (classification, redaction, retention, legal hold). Unlike typical one-off audit logs, this BC formalizes protocol-agnostic ingest (HTTP, gRPC, Service Bus / MassTransit, and Orleans actors) while keeping a single canonical domain model.
This cycle establishes the scope, non-goals, boundaries, and the Ubiquitous Language that downstream teams will rely on in subsequent cycles.
Scope (What this context does)¶
- Immutable, append-only audit ingestion of domain-significant actions and access decisions.
- Correlation across services (trace/request/causation IDs) with strict per-tenant isolation.
- Data classification at write and redaction by policy at both write and read.
- Retention policies & legal holds for compliance and eDiscovery.
- Integrity proofing (hash chain / segment sealing) for tamper-evidence.
- Export for auditors/eDiscovery with signed manifests and integrity proofs.
- Multi-transport ingest via HTTP, gRPC, Service Bus (MassTransit), and Orleans actors, all normalized into a single Inbox/Dedupe pipeline.
Non-Goals (Deliberate exclusions)¶
- Not an analytics warehouse (we provide filtered queries & exports, not OLAP).
- Not a SIEM replacement (can feed SIEM via streams/exports).
- Not a general event store (only audit-grade records).
- Not user activity tracking for growth marketing (belongs to Product Analytics BC).
Context in the Platform (high-level)¶
flowchart LR
subgraph Producers - Domain BCs
A[Identity & Access]
B[Billing & Payments]
C[Config & Feature Flags]
D[Core Business Domains]
end
A --> H
B --> H
C --> H
D --> H
subgraph Inbound Adapters
H[HTTP Ingest API]
G[gRPC Ingest Service]
M[Service Bus - MassTransit]
O[Orleans Actor Ingest]
end
H --> I
G --> I
M --> I
O --> I
subgraph AU[Audit & Compliance BC]
I[(Inbox + Dedupe)]
S[(Append-Only Store)]
CL[Classification/Redaction]
RP[Retention/Legal Hold]
IL[Integrity Ledger]
EX[Export Service]
I --> S
S <--> CL
S <--> RP
S <--> IL
S --> EX
end
EX --> E[(Auditors / eDiscovery)]
AU --> SIEM[(SIEM / SOC)]
AU --> OBS[(Observability & Compliance Dashboards)]
Boundary & Interfaces (concise)¶
Inbound (multiple transports; same canonical envelope)¶
- HTTP:
POST /audit/records(single/batch) withIdempotency-Key,Tenant-Id, correlation headers. - gRPC:
AuditIngest.Append(AuditRecordBatch); metadata carriesTenant-Id,Idempotency-Key, correlation. - Service Bus (MassTransit): Topic/queue
audit.records.v1; message header requirements identical to HTTP/gRPC. - Orleans:
IAuditIngestGrain.Append(AuditRecordBatch, IngestMetadata)for high-trust in-cluster producers.
All transports first land in Inbox + Dedupe, then the canonical Append-Only Store. No path bypasses the inbox.
Outbound¶
- Query APIs for timelines/filters (redacted by default).
ExportJobfor controlled extractions (signed manifests + integrity proofs).- Optional stream/webhook for high-value audit events to external tools.
Transport Normalization Matrix¶
| Dimension | HTTP | gRPC | Service Bus (MassTransit) | Orleans |
|---|---|---|---|---|
| Delivery | Sync (request/response) | Sync (unary/stream) | Async (brokered) | Sync intra-cluster |
| Envelope | JSON | Protobuf | Protobuf/JSON (contracted) | Protobuf/POCO |
| Required Meta | Tenant-Id, Idempotency-Key, Trace-Id |
Same via metadata | Same via headers | Same via args |
| Dedupe Key | Tenant-Id + Idempotency-Key |
Same | Same | Same |
| Backpressure | HTTP 429 + retry | gRPC status + retry | Broker DLQ/deferral | Grain throttling |
Ubiquitous Language (UL)¶
Use these terms exactly in code, docs, tests, and conversations. Treat them as canonical.
| Term | Type | Concise Definition | Notes / Examples |
|---|---|---|---|
| AuditStream | Aggregate Root | Logical, tenant-scoped channel of audit records, often partitioned by category/domain. | e.g., tenant:abc / category:identity |
| AuditRecord | Entity | Immutable, append-only fact (actor, action, resource, decision, metadata). | Idempotent, classified, correlated. |
| DataClass | Enum | Sensitivity tags for fields. | Public, Internal, Personal, Secret, Credential, PHI |
| RedactionRule | VO/Policy | Field-level handling by class: hash, mask, drop, tokenize. | Applies at write and read. |
| RetentionPolicy | Aggregate Root | Rules for duration/purge per tenant/category/stream. | E.g., 365d standard, HIPAA overlay longer. |
| LegalHold | Aggregate Root | Suspension of purge for matching records/segments. | Blocks retention purge; fully audited. |
| ExportJob | Aggregate Root | Purpose-limited extraction with manifest & proofs. | JSONL + signed manifest. |
| IntegrityProof | Value Object | Tamper-evidence for segments/exports (hash chain/Merkle). | Externally verifiable. |
| Correlation | Concept | End-to-end linkage (traceId, requestId, causationId). | Mandatory in ingest metadata. |
Canonical AuditRecord — Shape (preview; formal schema later)¶
{
"id": "ULID/UUID (immutable)",
"tenantId": "string",
"occurredAtUtc": "ISO-8601",
"actor": { "type": "user|service", "id": "string", "display": "optional" },
"action": "string",
"resource": { "type": "string", "id": "string", "path": "optional" },
"decision": { "outcome": "allow|deny|n/a", "reason": "optional" },
"context": { "ip": "optional", "userAgent": "optional", "clientApp": "optional" },
"before": "optional (diff-safe, redaction-aware)",
"after": "optional (diff-safe, redaction-aware)",
"classes": ["DataClass"],
"correlation": { "traceId": "string", "requestId": "string", "causationId": "string" },
"idempotencyKey": "string (required for online ingest)",
"integrity": { "segmentId": "string", "hash": "string" }
}
Required Metadata & Idempotency (applies to all transports)¶
Tenant-Id— required (header/metadata/param depending on transport).Idempotency-Key— required for online ingest; dedupe per tenant.Trace-Id/Request-Id— strongly recommended (platform default: required).Producer— service name/version for provenance.
Duplicates (same
Tenant-Id+Idempotency-Key) are acknowledged as success with no second write.
Tenancy & Isolation (principles)¶
- Every write/read is scoped to a single tenant by boundary policy.
- Cross-tenant reads are disallowed except platform-admin with explicit justification and enhanced auditing.
- Storage, indexes, and exports are partitioned by tenant and/or category.
Data Classification & Redaction (principles)¶
- Classification defaults may be inferred; producers can hint, but the auditor decides.
- Redaction applies at write (minimize risk) and on read (least privilege).
- Secrets/credentials are never stored in cleartext; hash/tokenize only.
Retention & Legal Hold (principles)¶
- Purge eligibility derives from RetentionPolicy; purge operations are themselves audited.
- LegalHold blocks purge of matching records/segments until explicitly released.
Integrity & Tamper-Evidence (principles)¶
- Records link into integrity segments; sealed segments produce IntegrityProofs.
- Exports include signed manifests with segment proofs for third-party verification.
Seed Glossary (for downstream teams)¶
- Actor — The principal performing the action (human user, service, job).
- Action — Verb describing the operation (e.g.,
User.Login,Payment.Refunded). - Resource — Business object acted upon (type + ID; optional path).
- Decision — Access/policy outcome with optional reason.
- Context — Environmental info aiding investigations (IP, UA, client).
- Timeline — Time-ordered view of audit records per tenant/resource.
- Segment — Contiguous span of records used for integrity sealing.
- Backfill — Trusted bulk ingestion path for historical records.
- eDiscovery — Formal process to collect/export records for legal/audit.
- Purpose Limiting — Access granted only for a declared, auditable purpose.
Example Narratives (to align intent)¶
- “As an Auditor, I need a redacted timeline of
User Managementactions in TenantAbetween two dates, filterable by actor and resource.” - “As a Security Engineer, I need to verify integrity proofs for a specific export to confirm immutability.”
- “As a Compliance Officer, I must place a Legal Hold on all records that mention Case
#12345.”
Aggregate Map (first pass)¶
Context Map¶
flowchart LR
subgraph AUD[AuditStream BC - Microservice]
AUD_AR[Aggregate: AuditStream]
end
subgraph RET[RetentionPolicy BC - Microservice]
RET_AR[Aggregate: RetentionPolicy]
end
subgraph HOLD[LegalHold BC - Microservice]
HOLD_AR[Aggregate: LegalHold]
end
subgraph EXP[Export BC - Microservice]
EXP_AR[Aggregate: ExportJob]
end
subgraph CLASS[ClassificationPolicy BC - Microservice]
CLASS_AR[Aggregate: ClassificationPolicy]
end
subgraph INTG[IntegrityLedger BC - Microservice]
INTG_AR[Aggregate: IntegrityLedger]
end
AUD -->|audit.record_appended| RET
AUD -->|audit.record_appended| INTG
HOLD -->|legalhold.applied| RET
RET -->|retention.window_elapsed| AUD
EXP -->|export.completed| INTG
CLASS -->|classification.policy_changed| AUD
AUD -->|audit.timeline.requested| EXP
Aggregates & Microservice Boundaries¶
AuditStream (BC / Aggregate Root)
The append-only, tenant-scoped stream of AuditRecord facts. Owns write-path invariants, classification hooks, and correlation.
Invariants
- Append-only; no updates or deletes.
- Tenant isolation by
tenantId. - Idempotency on (
tenantId,idempotencyKey). - Correlation metadata required.
Commands
AppendAuditRecord/BatchAppendRecordsSealCurrentSegment
Events
audit.record_appendedaudit.segment_seal_requested
RetentionPolicy (BC / Aggregate Root) Defines purge windows and scope per tenant/category; computes eligibility and orchestrates purges.
Invariants
- Cannot shorten windows below compliance minimums.
- Versioned, forward-dated revisions.
Commands
PutRetentionPolicyEvaluatePurgeEligibilityExecuteRetentionPurge
Events
retention.policy_changedretention.window_elapsedretention.purged
LegalHold (BC / Aggregate Root) Places and releases legal holds on matching record sets.
Invariants
- Must specify scope and caseId.
- Blocks purge until release.
Commands
PlaceLegalHoldReleaseLegalHold
Events
legalhold.appliedlegalhold.released
ExportJob (BC / Aggregate Root) Purpose-limited extraction of audit data with signed manifest and proofs.
Invariants
- Must declare purpose, requestor, redaction level, and scope.
- Immutable post-completion.
Commands
StartExportFinalizeExportCancelExport
Events
export.startedexport.chunk_readyexport.completedexport.failed
ClassificationPolicy (BC / Aggregate Root)
Manages DataClass → RedactionRule mappings and overrides.
Invariants
- Policies are versioned, forward-only.
- Reserved classes must not allow cleartext.
Commands
DefineClassificationPolicyUpdateRedactionRules
Events
classification.policy_changed
IntegrityLedger (BC / Aggregate Root) Maintains tamper-evidence via hash chains/Merkle segments.
Invariants
- Sealed segments cannot be reopened.
- Canonical, order-stable hashing inputs.
Commands
SealSegmentComputeIntegrityProofAttachExportProof
Events
integrity.segment_sealedintegrity.proof_computed
Cross-Context Policies/Sagas¶
-
Segment Sealing Policy Triggered by
audit.record_appendedthresholds; leads toSealSegment. -
Retention Enforcement Saga Triggered by
retention.window_elapsed; evaluates eligibility, purges, emitsretention.purged. -
Legal Hold Guard Policy Triggered by
legalhold.appliedorreleased; updates indexes to block or allow purge. -
Export Proof Attachment Policy Triggered by
export.completed; attaches proof to manifest and verification endpoint.
Justification¶
- One command → one aggregate ensures strong invariants without distributed transactions.
- Cross-context effects expressed as events; sagas enforce coordination.
- Separation allows scaling, isolation, and compliance overlays per BC.
Event Inventory¶
audit.record_appendedaudit.segment_seal_requestedclassification.policy_changedretention.policy_changedretention.window_elapsedretention.purgedlegalhold.appliedlegalhold.releasedexport.startedexport.chunk_readyexport.completedexport.failedintegrity.segment_sealedintegrity.proof_computed
AuditRecord model (VO + entity shape)¶
Purpose¶
Define the canonical, append-only AuditRecord used across all inbound transports (HTTP, gRPC, Service Bus/MassTransit, Orleans). The model preserves investigative value while minimizing exposure through classification and redaction hooks. It enforces tenant isolation, idempotency, and correlation on the write path.
Conceptual Model¶
classDiagram
class AuditRecord {
+string id
+string tenantId
+Instant occurredAtUtc
+Actor actor
+string action
+ResourceRef resource
+Decision decision
+Context context
+Delta before
+Delta after
+string[] classes
+Correlation correlation
+string idempotencyKey
+IntegrityRef integrity
}
class Actor {
+ActorType type // user|service|job
+string id
+string display [optional]
+string[] roles [optional]
}
class ResourceRef {
+string type
+string id
+string path [optional]
+string tenantScopedId [optional]
}
class Decision {
+DecisionOutcome outcome // allow|deny|na
+string reason [optional]
+map<string,string> attributes [optional]
}
class Context {
+string ip [optional]
+string userAgent [optional]
+string clientApp [optional]
+string location [optional] // ISO-3166 + geo-hint (redacted)
}
class Delta {
+map<string,any> fields
+map<string,RedactionHint> hints
}
class Correlation {
+string traceId
+string requestId
+string causationId [optional]
+string producer // service/version
}
class IntegrityRef {
+string segmentId [optional]
+string hash [optional]
}
Field Inventory & Rules¶
| Field | Type | Required | Rules / Invariants |
|---|---|---|---|
id |
ULID/UUID | Yes | Immutable; generated server-side if absent. |
tenantId |
String | Yes | Must match authenticated tenant; used for partitioning/isolation. |
occurredAtUtc |
ISO-8601 | Yes | Server-normalized to UTC; must not be in the future beyond tolerance (e.g., 10m). |
actor |
Actor |
Yes | type ∈ {user,service,job}; id non-empty. |
action |
String | Yes | Namespaced verb, e.g., User.Login, Payment.Refunded. |
resource |
ResourceRef |
Yes | type and id required; tenantScopedId optional helper. |
decision |
Decision |
No | If present, outcome ∈ {allow,deny,na}. |
context |
Context |
No | PII-safe; IP normalization (IPv4/IPv6), UA truncation. |
before / after |
Delta |
No | Diff-safe; classification/redaction rules applied per-field. |
classes |
Array |
No | Sensitivity tags; server may add/override. |
correlation |
Correlation |
Yes | traceId and requestId required; producer required. |
idempotencyKey |
String | Yes | Scoped by tenantId; duplicate writes are no-ops. |
integrity |
IntegrityRef |
No | Populated by sealing pipeline (post-append). |
JSON Canonical Envelope (ingest/at-rest)¶
{
"id": "01J9Z7B19X4P8ZQ7H6M4V6GQWY",
"tenantId": "t-9c8f1",
"occurredAtUtc": "2025-09-30T12:34:56Z",
"actor": { "type": "user", "id": "u-12345", "display": "Jane Admin", "roles": ["admin"] },
"action": "User.PasswordChanged",
"resource": { "type": "User", "id": "u-12345" },
"decision": { "outcome": "allow", "reason": "MFA_OK" },
"context": { "ip": "203.0.113.42", "userAgent": "Chrome/140", "clientApp": "Portal" },
"before": { "fields": { "password.lastChangedAt": "2025-06-01T10:00:00Z" } },
"after": { "fields": { "password.lastChangedAt": "2025-09-30T12:34:56Z" } },
"classes": ["Internal"],
"correlation": { "traceId": "tr-abc", "requestId": "rq-xyz", "producer": "iam-service@1.12.3" },
"idempotencyKey": "iam:pwd-change:u-12345:2025-09-30T12:34:56Z",
"integrity": null
}
Protobuf (transport-friendly core)¶
syntax = "proto3";
package audit.v1;
enum ActorType { ACTOR_TYPE_UNSPECIFIED = 0; USER = 1; SERVICE = 2; JOB = 3; }
enum DecisionOutcome { DECISION_OUTCOME_UNSPECIFIED = 0; ALLOW = 1; DENY = 2; NA = 3; }
message Actor { ActorType type = 1; string id = 2; string display = 3; repeated string roles = 4; }
message ResourceRef { string type = 1; string id = 2; string path = 3; string tenant_scoped_id = 4; }
message Decision { DecisionOutcome outcome = 1; string reason = 2; map<string,string> attributes = 3; }
message Context { string ip = 1; string user_agent = 2; string client_app = 3; string location = 4; }
message RedactionHint { string rule = 1; } // hash|mask|drop|tokenize (+params later)
message Delta { map<string,string> fields = 1; map<string,RedactionHint> hints = 2; }
message Correlation { string trace_id = 1; string request_id = 2; string causation_id = 3; string producer = 4; }
message IntegrityRef { string segment_id = 1; string hash = 2; }
message AuditRecord {
string id = 1;
string tenant_id = 2;
string occurred_at_utc = 3;
Actor actor = 4;
string action = 5;
ResourceRef resource = 6;
Decision decision = 7;
Context context = 8;
Delta before = 9;
Delta after = 10;
repeated string classes = 11;
Correlation correlation = 12;
string idempotency_key = 13;
IntegrityRef integrity = 14;
}
Write-Path Invariants¶
-
Append-only
- No updates/deletes to an existing
AuditRecord. - Server rejects requests attempting to mutate existing IDs.
- No updates/deletes to an existing
-
Tenant isolation
tenantIdderived from auth/boundary must equal payloadtenantId.- Records are partitioned by
tenantIdand (optionally)category/resource.type.
-
Correlation required
correlation.traceId,correlation.requestId,correlation.producermust be present.- If missing, the gateway injects them; producers should still send.
-
Idempotency
- (
tenantId,idempotencyKey) uniquely identifies a logical write. - On duplicate, server returns 200/OK with the original
idand skips a second append.
- (
-
Classification & redaction on write
- Server applies
ClassificationPolicyand transforms fields inbefore/after/contextas needed (hash/mask/drop/tokenize). - Producers may send hints; the server’s policy is authoritative.
- Server applies
-
Clock skew tolerance
occurredAtUtcmay differ within a small window (e.g., ±10 minutes) from server time; beyond that is rejected or flagged.
Validation & Normalization¶
- Action naming:
Domain.Verb[.PastParticiple](e.g.,User.Login,Payment.Refunded). - Resource identity: canonical
type+id; optional hierarchicalpath(e.g.,/orgs/12/users/u-12345). - IP: normalized textual form; strip ports; IPv6 compressed.
- User agent: truncated to max length; known sensitive tokens removed.
- Delta: only diff fields; large payloads must be summarized or tokenized.
Indexing & Storage Hints (at-rest)¶
Primary clustering/partitioning:
(tenantId, occurredAtUtc)for time-ordered scans. Secondary indexes:(tenantId, resource.type, resource.id, occurredAtUtc)(tenantId, actor.id, occurredAtUtc)(tenantId, action, occurredAtUtc)Dedupe store:(tenantId, idempotencyKey) → recordId
Segments (for integrity sealing):
- Rolling segment map keyed by
(tenantId, category/resource.type)with counters and byte budgets.
Security & Privacy Considerations¶
- No secrets in cleartext (API keys, tokens, credentials) — must be hashed or dropped per policy.
- Minimal context — only data needed for investigations (PII minimization).
- Purpose-limited access — elevated, less-redacted views require dedicated scopes and are fully audited.
- Backfill path — bulk/historical ingestion requires elevated trust and is isolated from online ingest.
Example Idempotency Keys (guidance)¶
- Deterministic pattern:
producer:domain:resourceId:action:occurredAtUtciam:pwd-change:u-12345:User.PasswordChanged:2025-09-30T12:34:56Z
- For batch jobs, include batch ID and item ordinal if time is not unique.
Error Handling (ingest)¶
409 Conflictfor cross-tenant mismatch or immutable-field violation.422 Unprocessablefor schema/validation/classification rejections.200 OK(or202 Accepted) for idempotent duplicates with originalid.- Transport-specific equivalents in gRPC/Service Bus/Orleans with consistent error codes/messages.
Data Classification & Redaction¶
Purpose¶
Establish a consistent, enforceable model for sensitivity classification and field-level redaction that applies uniformly to all ingested audit data—across transports and producers. The goals are to (1) minimize exposure at the moment of write and (2) enforce least-privilege at read time while preserving investigative value and integrity.
Conceptual Model¶
classDiagram
class ClassificationPolicy {
+string id
+int version
+map<string,DataClass> defaultByField
+map<string,RedactionRule> rulesByClass
+map<string,RedactionRule> overridesByField
+Instant effectiveFromUtc
+string author
}
class RedactionRule {
+RedactionKind kind // HASH | MASK | DROP | TOKENIZE | NONE
+map<string,string> params // e.g., hash=SHA256, mask=showLast=4
}
class RedactionPlan {
+map<string,RedactionRule> byField
+int policyVersion
}
class DataClass {
<<Enumeration>>
PUBLIC
INTERNAL
PERSONAL // PII lite
SENSITIVE // PII/financial
CREDENTIAL // secrets/tokens
PHI // health data
}
class RedactionKind {
<<Enumeration>>
NONE
HASH
MASK
DROP
TOKENIZE
}
- ClassificationPolicy (Aggregate Root, separate BC): authoritative mapping from field → DataClass and DataClass → RedactionRule, with optional field-level overrides.
- RedactionPlan: a compiled, per-write snapshot derived from the active policy version; stored alongside the record (policy version only) to make transformations reproducible.
Principles¶
- Classify at Write: every candidate field (context, before/after deltas, selected headers) is classified and transformed immediately according to the active policy.
- Redact at Write & Read: write-time transformation minimizes stored risk; read-time gates enforce least-privilege (default redacted views; elevated scopes may reveal less-redacted data, never more than policy allows).
- Server Is Authoritative: producers may attach hints; the service computes final classification and redaction.
- Forward-Only Versioning: policies are versioned; a new version never “unmasks” previously dropped data. Historical records keep their write-time transformations.
Default DataClass & Rule Set (baseline)¶
| DataClass | Default Redaction Rule | Typical Fields |
|---|---|---|
PUBLIC |
NONE |
resource type, action name, coarse timestamps |
INTERNAL |
MASK(showLast=4) |
internal IDs, non-PII metadata |
PERSONAL |
HASH(SHA256+salt) |
emails, phone numbers (normalized before hash) |
SENSITIVE |
MASK(showLast=2) |
partial addresses, last4 PAN (never full PAN) |
CREDENTIAL |
DROP |
API keys, auth tokens, passwords |
PHI |
TOKENIZE(namespace=phi) |
diagnosis codes, lab values (token reference stored) |
Jurisdictional overlays (e.g., HIPAA/GDPR) are implemented as policy presets and constraints;
CREDENTIALmust never be stored in cleartext.
Write Path (classification then transform)¶
- Normalize: standardize field formats (emails lowercased, phones E.164, IP canonical, UA trimmed).
- Classify: map fields →
DataClassusingdefaultByField, producer hints, and heuristics. - Plan: compile
RedactionPlan(field →RedactionRule) using class rules + field overrides. - Transform: apply rules in-place to
before,after,context, and selected headers; attachpolicyVersion. - Persist: store transformed payload only; do not persist secrets in cleartext.
Pseudocode (illustrative)
var policy = _policyProvider.GetActive(tenantId);
var plan = policy.Compile(record);
foreach (var (field, value) in record.TargetableFields())
{
var rule = plan.For(field);
record[field] = _redactor.Apply(rule, value);
}
record.PolicyVersion = policy.Version;
_appendOnlyStore.Write(record);
Read Path (least-privilege by default)¶
- Default view (redacted): returns the stored (already transformed) values.
- Elevated view (e.g., Auditor scope): may further mask less (e.g., show more digits), but never reconstruct dropped/hashed/tokenized data.
- Justification logging: elevated reads require purpose scope; each access is itself audited.
Field Targeting¶
- Always subject to policy:
context.ip,context.userAgent,context.clientApp,resource.pathbefore.fields[*],after.fields[*]
- Header/meta candidates: values mirrored from inbound headers that could include PII (e.g., email in actor display) must be classified too.
- Non-targeted: structural fields (
id,tenantId,occurredAtUtc,action,resource.type/id) remain visible unless specifically overridden by policy.
Tokenization & Hashing¶
- Hashing: one-way (e.g.,
SHA256) with per-tenant salt kept in HSM/Key Vault; compare-only scenarios (e.g., “does this email appear?”) are supported via normalized input + same salt. - Tokenization: exchange sensitive value for stable token stored in a separate vault; only authorized services can resolve tokens, and resolution events are audited.
Policy Commands & Events (ClassificationPolicy BC)¶
-
Commands
DefineClassificationPolicy(policyDraft)UpdateRedactionRules(patch)DeprecatePolicy(policyId)(optional)
-
Events
classification.policy_changed(includesversion, effective window)classification.policy_deprecated
Constraints
- Cannot weaken rules for classes like
CREDENTIALonce established. effectiveFromUtcmust be ≥ now + safety window; no retroactive “unmasking.”
Transport-Agnostic Enforcement¶
- HTTP/gRPC: service enforces in write pipeline; producers may include
classificationHints. - Service Bus (MassTransit): consumer applies policy at consume time before append.
- Orleans: ingest grain uses the same policy provider; compiled plan cached per policy version.
Testing & Verification¶
- Golden cases: fixture payloads for each DataClass; snapshot expected redactions.
- Property tests: secrets never appear in logs/DB; fuzz inputs for leakage.
- Perf checks: rule application stays sub-millisecond per field; caching for policy and compiled plans.
- Compliance checks: CI gate ensuring reserved classes (
CREDENTIAL,PHI) have non-weakened rules.
Example¶
Input (producer hints email as PERSONAL):
{
"context": { "ip": "2001:db8::1", "userAgent": "Chrome/140", "clientApp": "Portal" },
"before": { "fields": { "email": "Alice@example.com" } },
"after": { "fields": { "email": "Alice@example.com" } },
"classes": [],
"hints": { "before.fields.email": "PERSONAL", "after.fields.email": "PERSONAL" }
}
Stored (write-time transformed; policy v7):
{
"context": { "ip": "2001:db8::1", "userAgent": "Chrome/140", "clientApp": "Portal" },
"before": { "fields": { "email": "HASH:sha256:7b1d..."} },
"after": { "fields": { "email": "HASH:sha256:7b1d..."} },
"policyVersion": 7
}
Read (default): returns hashed values. Read (auditor scope): still hashed; may reveal additional metadata (e.g., normalization method), but not raw email.
Operational Notes¶
- Caching: cache
ClassificationPolicyby(tenantId, version); invalidate onpolicy_changed. - Observability: emit counters for rule application (
hash,mask,drop,tokenize) and rejections. - Migrations: new policies apply only to future writes; optional re-scrub jobs can further restrict historical data (never weaken).
- Docs & SDKs: client SDKs expose
classificationHintsto help the server classify, but do not implement redaction logic.
Commands & Invariants (write path)¶
Purpose¶
Define the command surface for all write paths across the Audit & Compliance bounded contexts and codify the invariants enforced on each append or policy change. These rules are transport-agnostic (HTTP, gRPC code-first, Service Bus/MassTransit, Orleans); all ingress normalizes through the Inbox + Dedupe pipeline before hitting aggregates.
Command Catalog (by BC)¶
AuditStream BC
AppendAuditRecord(record)BatchAppend(records[])SealCurrentSegment()(policy-triggered; typically raised as an event request)
ClassificationPolicy BC
DefineClassificationPolicy(policyDraft)UpdateRedactionRules(patch)
RetentionPolicy BC
PutRetentionPolicy(policySpec)EvaluatePurgeEligibility(scope)ExecuteRetentionPurge(scope)
LegalHold BC
PlaceLegalHold(holdSpec)ReleaseLegalHold(holdId)
Export BC
StartExport(criteria)FinalizeExport(jobId)CancelExport(jobId)
IntegrityLedger BC
ComputeIntegrityProof(segmentId)SealSegment(segmentKey)AttachExportProof(jobId, proofRef)
Commands are handled one aggregate instance at a time. Cross-context effects happen via domain events and policies/sagas.
Global Invariants (applies to all write paths)¶
- Append-only
AuditRecordwrites are immutable. No updates, no deletes.
- Per-tenant isolation
tenantIdis authoritative from the boundary; payloadtenantIdmust match.
- Idempotency for online ingest
idempotencyKeyis required for online writes; uniqueness on (tenantId,idempotencyKey).
- Correlation
traceIdandrequestIdrequired (gateway may inject).producermust be present.
- Classification & redaction on write
- Apply active
ClassificationPolicyto targetable fields before persistence.
- Apply active
- Legal hold precedence
- Any operation that would purge or mutate visibility of held records is blocked.
- Retention locks
- A record/segment not yet retention-eligible cannot be destroyed; only redaction overlays allowed per policy.
- Clock-skew tolerance
occurredAtUtcaccepted within configured window; outliers rejected or flagged.
- Transport-agnostic consistency
- Same validation and invariants regardless of HTTP/gRPC/Bus/Orleans.
Write Path (sequence)¶
sequenceDiagram
participant P as Producer (any BC)
participant IN as Inbound Adapter (HTTP/gRPC/Bus/Orleans)
participant ID as Inbox+Dedupe
participant CP as ClassificationPolicy BC
participant AS as AuditStream BC
participant IL as IntegrityLedger BC
participant RP as RetentionPolicy BC
participant LH as LegalHold BC
P->>IN: Command: AppendAuditRecord / BatchAppend
IN->>ID: Normalize + headers (tenant, trace, idempotency)
ID-->>IN: Duplicate? If yes -> ACK with original id
IN->>CP: Resolve active policy + compile plan
CP-->>IN: Policy(version) + redaction plan
IN->>AS: AppendAuditRecord (transformed payload)
AS-->>IN: Event: audit.record_appended
IN->>IL: (Async) segment threshold check -> Seal request
IN->>RP: (Async) notify retention indexes
IN->>LH: (Async) check or update hold indexes
Validation Rules (selected highlights)¶
AppendAuditRecord / BatchAppend
- Require:
tenantId,occurredAtUtc,actor,action,resource,correlation,idempotencyKey(online). - Reject if: tenant mismatch, correlation missing, immutable field mutation detected, legal hold violated, occurredAtUtc out of tolerance, payload after classification still contains forbidden classes (e.g.,
CREDENTIALcleartext). - Transform: apply classification/redaction; normalize IP/UA; compute deterministic
idif absent (ULID/UUIDv7). - Emit:
audit.record_appended; optionallyaudit.segment_seal_requestedwhen thresholds hit.
Define/Update ClassificationPolicy
- Require forward-only versioning; cannot weaken reserved classes (
CREDENTIAL,PHI). - Effective time ≥ now + safety window.
- Emit
classification.policy_changed(version).
PutRetentionPolicy / ExecuteRetentionPurge
- Retention window cannot be below compliance minimums.
- Purge must exclude any scope under active
LegalHold. - Emit
retention.policy_changed,retention.window_elapsed,retention.purged.
Place/Release LegalHold
- Require explicit scope (tenant, categories, time window) and
caseId. - Hold blocks purge until release; all checks are index-assisted and auditable.
- Emit
legalhold.applied,legalhold.released.
Start/Finalize/Cancel Export
- Require: purpose, requestor, time-bounded scope, redaction level.
- Exports are immutable after finalize; artifacts signed and referenced in manifest.
- Emit
export.started,export.chunk_ready,export.completed/export.failed.
ComputeIntegrityProof / SealSegment
- Sealed segments cannot be reopened.
- Hash inputs must be canonical, order-stable.
- Emit
integrity.segment_sealed,integrity.proof_computed.
Command Contracts (code-first examples)¶
gRPC code-first (C#)
public record AppendAuditRecordCommand(AuditRecord Record, string TenantId, string IdempotencyKey);
public record BatchAppendCommand(IReadOnlyList<AuditRecord> Records, string TenantId);
public interface IAuditIngestService // exposed via gRPC code-first & HTTP
{
Task<AppendResult> AppendAsync(AppendAuditRecordCommand cmd, CancellationToken ct);
Task<BatchAppendResult> BatchAppendAsync(BatchAppendCommand cmd, CancellationToken ct);
}
MassTransit (message contracts)
public interface AppendAuditRecord
{
string TenantId { get; }
string IdempotencyKey { get; }
AuditRecord Record { get; }
Correlation Correlation { get; }
}
Orleans (grain)
public interface IAuditIngestGrain : IGrainWithStringKey
{
Task<AppendResult> Append(AppendAuditRecordCommand cmd);
Task<BatchAppendResult> BatchAppend(BatchAppendCommand cmd);
}
All adapters call a shared write pipeline that enforces invariants and emits the same domain events.
Idempotency Semantics¶
- Key scope: (
tenantId,idempotencyKey) →recordId. - Duplicates: return success with original identifiers; never write a second time.
- Batch: dedupe per item; partial duplicates return a mixed result with per-item statuses.
Error Model (canonical)¶
409 Conflict: tenant mismatch; immutable field mutation; legal hold violation.422 Unprocessable: schema/validation/classification failure.429 Too Many Requests: backpressure (retryable with jitter).5xx: unexpected; retried with exponential backoff and idempotency preserved.
Observability (write path)¶
- Metrics: append throughput, dedupe hit rate, classification ops/field, seal latency, purge counts.
- Traces: span attributes include
tenantId,action,resource.type/id,idempotencyKey. - Logs: structured; never contain secrets; include command name and result code.
Events (domain & integration)¶
Purpose¶
Define a clear, event-first contract across the Audit & Compliance bounded contexts. Events are the only mechanism for cross-context coordination (segment sealing, retention, legal hold, exports). Publishing is post-commit via outbox, and all consumers are idempotent. Transports are code-first: MassTransit (Service Bus) for inter-service messaging, optional signed webhooks for external sinks, and Orleans streams for intra-cluster reactions where applicable.
Event Strategy¶
- Event-first + Outbox: domain events are appended to an outbox within the same transaction as the aggregate change; a relay publishes them to the bus.
- Inbox/Dedupe: every consumer keeps a dedupe store keyed by
(consumer, eventId)(or(consumer, tenantId, aggregateId, sequence)), ensuring at-least-once → exactly-once processing. - Per-aggregate ordering: ordering is guaranteed within a partition key (
aggregateIdor(tenantId, category)), not globally. - Schema versioning: version embedded in the envelope; additive changes preferred.
- Security: internal bus authN/Z; webhooks signed (HMAC) with replay protection.
Event Catalog (by BC)¶
AuditStream BC
audit.record_appended— a new immutable record was appended.audit.segment_seal_requested— thresholds hit; request sealing (internal integration).
RetentionPolicy BC
retention.policy_changed— new policy version activated.retention.window_elapsed— a scope became retention-eligible.retention.purged— purge completed for a scope.
LegalHold BC
legalhold.applied— hold in effect for scope/case.legalhold.released— hold released.
Export BC
export.started— export accepted and queued.export.chunk_ready— partial artifact available (streaming exports).export.completed— export finished; manifest + signed URLs.export.failed— export failed.
IntegrityLedger BC
integrity.segment_sealed— segment finalized with proof.integrity.proof_computed— verification material available.
Canonical Event Envelope (code-first)¶
public sealed record DomainEventEnvelope<TPayload>(
string EventId, // ULID/UUIDv7
string Name, // e.g., "audit.record_appended"
int SchemaVersion, // e.g., 1
string TenantId, // partition & auth
string AggregateType, // e.g., "AuditStream"
string AggregateId, // stream or policy id
long AggregateSequence, // monotonic within aggregate (if applicable)
DateTime OccurredAtUtc, // business time
DateTime EmittedAtUtc, // publish time
CorrelationMeta Correlation, // traceId, requestId, causationId, producer
TPayload Payload);
public sealed record CorrelationMeta(
string TraceId,
string RequestId,
string? CausationId,
string Producer);
- Naming:
domain.action_pastparticiple(dot-delimited). - Partition key:
AggregateId(or(TenantId, Category)for streams). - Dedupe key:
EventId(plusTenantIdif desired).
Payload Shapes (selected)¶
audit.record_appended (v1)
public sealed record AuditRecordAppendedV1(
string RecordId,
string Category, // e.g., "identity", "billing"
string Action, // e.g., "User.PasswordChanged"
string ResourceType, string ResourceId,
string ActorType, string ActorId,
DateTime OccurredAtUtc,
int ClassificationPolicyVersion);
retention.policy_changed (v1)
public sealed record RetentionPolicyChangedV1(
string PolicyId,
int Version,
DateTime EffectiveFromUtc,
IReadOnlyDictionary<string,int> WindowsDaysByCategory);
legalhold.applied / released (v1)
public sealed record LegalHoldChangedV1(
string HoldId,
string CaseId,
string ScopeTenantId,
string[] Categories,
DateTime FromUtc,
DateTime ToUtc,
string Change); // "applied" | "released"
export.completed (v1)
public sealed record ExportCompletedV1(
string JobId,
string Purpose,
DateTime RangeFromUtc,
DateTime RangeToUtc,
string RedactionLevel,
Uri ManifestUrl,
string IntegrityProofId);
integrity.segment_sealed (v1)
public sealed record SegmentSealedV1(
string SegmentId,
string TenantId,
string Category,
long FirstSequence,
long LastSequence,
string HashAlgorithm,
string RootHash,
Uri ProofUrl);
Publishing Flow (post-commit with outbox)¶
sequenceDiagram
participant AG as Aggregate
participant DB as DB (Tx)
participant OB as Outbox Table
participant RL as Outbox Relay
participant BUS as Service Bus
participant C1 as Consumer BC (e.g., IntegrityLedger)
participant C2 as Consumer BC (e.g., RetentionPolicy)
AG->>DB: Save aggregate state (commit)
AG->>OB: Append DomainEvent (same Tx)
Note over DB,OB: State + Outbox are atomic
RL->>OB: Poll/Read un-published rows
RL->>BUS: Publish events (Name, Version, Envelope, Payload)
BUS-->>C1: Deliver (at-least-once)
C1->>C1: Inbox dedupe (idempotent handler)
C1->>C1: Apply reaction / emit own events
BUS-->>C2: Deliver (at-least-once)
C2->>C2: Inbox dedupe (idempotent handler)
Consumer Contract (idempotency & ordering)¶
- Idempotency: persist
(consumer, EventId). If seen, ack and skip. - Ordering: rely on partition keys; if you must enforce sequence, validate
AggregateSequencecontinuity; otherwise apply commutative logic (recommended). - Retries: transient errors → retry with backoff/jitter; poison messages → DLQ.
- Side effects: only after dedupe is recorded.
MassTransit configuration (sketch)
cfg.ReceiveEndpoint("integrity.segment.events", e =>
{
e.UseMessageRetry(r => r.Exponential(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(30), TimeSpan.FromSeconds(2)));
e.UseInMemoryOutbox(); // local idempotency for side-effects within consumer scope
e.Consumer<SegmentSealedConsumer>();
});
For persistent idempotency across restarts, maintain a ConsumerOffset table keyed by
(consumer, EventId)(and optionally(TenantId, AggregateId, Sequence)).
Transport Guidance¶
- Service Bus (MassTransit) — primary; topics per event name or per BC with filters. Headers carry envelope fields for routing; payload is JSON.
- Webhooks (optional) — only for external systems; signed with HMAC; include
Event-Id,Event-Timestamp; 5-minute age window, retries with backoff. - Orleans Streams (internal) — for intra-cluster consumers needing low latency; still publish to the bus for inter-service integration.
Versioning & Evolution¶
- SchemaVersion increments on breaking changes.
- Prefer additive payloads; consumers ignore unknown fields.
- For unavoidable breaks, publish both
...v1and...v2until all consumers migrate.
Observability¶
- Metrics: publish rate, DLQ count, consumer lag, dedupe hits, handler latency.
- Tracing: propagate
TraceId/RequestId; create spans for publish and handle. - Auditing the auditor: event publishing/consumption is itself audited in minimal form (no secrets).
Example End-to-End¶
AppendAuditRecordcommits in AuditStream → outbox writesaudit.record_appended.- IntegrityLedger consumes, updates segment counters; if threshold reached, emits
integrity.segment_seal_requestedand thenintegrity.segment_sealed. - RetentionPolicy updates eligibility indexes; at window, emits
retention.window_elapsedand eventuallyretention.purged. - Export listens for queries/timeline requests, runs job, emits
export.completedwith manifest + proof.
Repositories & Read Models¶
Purpose¶
Define the write model (append-only) and the read models that power fast investigative queries:
- Write: append-only log partitioned by
tenantIdand category (orresource.type). - Reads:
AuditTimeline— chronological, filterable streamAccessDecisionsView— focused allow/deny auditChangeDeltasView— before/after changesIntegrityProofsView— segment/proof lookup and verification
All projections are idempotent, tenant-scoped, and updated asynchronously from domain events.
Architecture (high level)¶
flowchart LR
IN[Inbox+Dedupe] --> WR[(Append-Only Store)]
WR -->|events| PJ[Projectors]
PJ --> TL[AuditTimeline]
PJ --> AD[AccessDecisionsView]
PJ --> CD[ChangeDeltasView]
PJ --> IP[IntegrityProofsView]
WR --> SG[Integrity Segments/Map]
SG --> IP
Write Model (append-only)¶
Table: audit.Records (Azure SQL, JSON for flexible fields)
CREATE TABLE audit.Records (
RecordId VARCHAR(26) NOT NULL, -- ULID/UUIDv7 (string)
TenantId VARCHAR(64) NOT NULL,
Category VARCHAR(64) NOT NULL, -- e.g., identity, billing
OccurredAtUtc DATETIME2(3) NOT NULL,
ActorType VARCHAR(16) NOT NULL, -- user|service|job
ActorId VARCHAR(128) NOT NULL,
Action VARCHAR(128) NOT NULL, -- e.g., User.PasswordChanged
ResourceType VARCHAR(64) NOT NULL,
ResourceId VARCHAR(128) NOT NULL,
DecisionOutcome VARCHAR(8) NULL, -- allow|deny|na
DecisionReason VARCHAR(256) NULL,
ContextJson NVARCHAR(MAX) NULL, -- PII-safe (post-redaction)
BeforeJson NVARCHAR(MAX) NULL, -- diff-safe (post-redaction)
AfterJson NVARCHAR(MAX) NULL, -- diff-safe (post-redaction)
ClassesJson NVARCHAR(512) NULL, -- ["Personal","Internal"]
CorrelationJson NVARCHAR(512) NOT NULL, -- traceId, requestId, producer
PolicyVersion INT NOT NULL,
SegmentId VARCHAR(40) NULL,
IntegrityHash VARCHAR(128) NULL,
InsertedAtUtc DATETIME2(3) NOT NULL CONSTRAINT DF_audit_Records_InsertedAt DEFAULT SYSUTCDATETIME(),
CONSTRAINT PK_audit_Records PRIMARY KEY CLUSTERED (TenantId, Category, OccurredAtUtc, RecordId)
);
Hot indexes (nonclustered):
-- Resource drill-down (timeline for a resource)
CREATE INDEX IX_audit_Records_Resource
ON audit.Records (TenantId, ResourceType, ResourceId, OccurredAtUtc)
INCLUDE (Action, ActorId, DecisionOutcome);
-- Actor drill-down
CREATE INDEX IX_audit_Records_Actor
ON audit.Records (TenantId, ActorId, OccurredAtUtc)
INCLUDE (Action, ResourceType, ResourceId, DecisionOutcome);
-- Action/verb timelines
CREATE INDEX IX_audit_Records_Action
ON audit.Records (TenantId, Action, OccurredAtUtc)
INCLUDE (ResourceType, ResourceId, ActorId, DecisionOutcome);
Dedupe (idempotency) store:
CREATE TABLE audit.Idempotency (
TenantId VARCHAR(64) NOT NULL,
IdempotencyKey VARCHAR(256) NOT NULL,
RecordId VARCHAR(26) NOT NULL,
CreatedAtUtc DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
CONSTRAINT PK_audit_Idempotency PRIMARY KEY (TenantId, IdempotencyKey)
);
Integrity segment map (for sealing):
CREATE TABLE audit.Segments (
SegmentId VARCHAR(40) NOT NULL,
TenantId VARCHAR(64) NOT NULL,
Category VARCHAR(64) NOT NULL,
FirstOccurredAt DATETIME2(3) NOT NULL,
LastOccurredAt DATETIME2(3) NOT NULL,
FirstRecordId VARCHAR(26) NOT NULL,
LastRecordId VARCHAR(26) NOT NULL,
RootHash VARCHAR(128) NULL,
HashAlgorithm VARCHAR(16) NULL, -- e.g., sha256
SealedAtUtc DATETIME2(3) NULL,
CONSTRAINT PK_audit_Segments PRIMARY KEY (TenantId, Category, SegmentId)
);
Read Models & Projections¶
AuditTimeline
- Purpose: fast, paged, time-ordered audit browsing with common filters.
- Source:
audit.record_appendedevents. - Schema (often same as
audit.Records, but stored separately for denormalized filters & pagination tokens). - Suggested partition:
(TenantId, OccurredAtUtc); supportfrom/to,actor,resource,action,category.
CREATE TABLE audit_rm.Timeline (
TenantId VARCHAR(64) NOT NULL,
OccurredAtUtc DATETIME2(3) NOT NULL,
RecordId VARCHAR(26) NOT NULL,
Category VARCHAR(64) NOT NULL,
Action VARCHAR(128) NOT NULL,
ActorId VARCHAR(128) NOT NULL,
ResourceType VARCHAR(64) NOT NULL,
ResourceId VARCHAR(128) NOT NULL,
DecisionOutcome VARCHAR(8) NULL,
SummaryJson NVARCHAR(1024) NULL, -- concise payload for UI cards
CONSTRAINT PK_audit_rm_Timeline PRIMARY KEY (TenantId, OccurredAtUtc, RecordId)
);
AccessDecisionsView
- Purpose: focused allow/deny trails for security reviews and access investigations.
- Source:
audit.record_appendedwithDecisionOutcome IS NOT NULL. - Partition:
(TenantId, OccurredAtUtc); secondary on(TenantId, DecisionOutcome).
CREATE TABLE audit_rm.AccessDecisions (
TenantId VARCHAR(64) NOT NULL,
OccurredAtUtc DATETIME2(3) NOT NULL,
RecordId VARCHAR(26) NOT NULL,
ActorId VARCHAR(128) NOT NULL,
ResourceType VARCHAR(64) NOT NULL,
ResourceId VARCHAR(128) NOT NULL,
Action VARCHAR(128) NOT NULL,
DecisionOutcome VARCHAR(8) NOT NULL,
Reason VARCHAR(256) NULL,
CONSTRAINT PK_audit_rm_Access PRIMARY KEY (TenantId, OccurredAtUtc, RecordId)
);
ChangeDeltasView
- Purpose: “what changed?” — quick access to before/after deltas for resources.
- Source:
audit.record_appendedwhereBeforeJsonorAfterJsonis present. - Partition:
(TenantId, ResourceType, ResourceId, OccurredAtUtc).
CREATE TABLE audit_rm.ChangeDeltas (
TenantId VARCHAR(64) NOT NULL,
ResourceType VARCHAR(64) NOT NULL,
ResourceId VARCHAR(128) NOT NULL,
OccurredAtUtc DATETIME2(3) NOT NULL,
RecordId VARCHAR(26) NOT NULL,
BeforeJson NVARCHAR(MAX) NULL,
AfterJson NVARCHAR(MAX) NULL,
Action VARCHAR(128) NOT NULL,
ActorId VARCHAR(128) NOT NULL,
CONSTRAINT PK_audit_rm_Deltas PRIMARY KEY (TenantId, ResourceType, ResourceId, OccurredAtUtc, RecordId)
);
IntegrityProofsView
- Purpose: locate proof material for segments and exports; drive verification UI/API.
- Source:
integrity.segment_sealed,export.completed. - Partition:
(TenantId, Category); lookup bySegmentIdorJobId.
CREATE TABLE audit_rm.IntegrityProofs (
TenantId VARCHAR(64) NOT NULL,
Category VARCHAR(64) NOT NULL,
SegmentId VARCHAR(40) NULL,
JobId VARCHAR(40) NULL,
RootHash VARCHAR(128) NOT NULL,
HashAlgorithm VARCHAR(16) NOT NULL,
ProofUrl NVARCHAR(512) NULL,
SealedAtUtc DATETIME2(3) NOT NULL,
CONSTRAINT PK_audit_rm_Proofs PRIMARY KEY (TenantId, Category, SealedAtUtc, RootHash)
);
Projection Rules¶
- Idempotent handlers using consumer offsets;
UPSERTby(TenantId, ...). - Eventual consistency with small lag; publish
X-Projected-Atheaders in query responses if needed. - Redaction-preserving: project post-redaction fields; never attempt to reverse transformations.
- Backfill support: projectors can rebuild from
audit.Records(source of truth) if event history is pruned.
Hot Path & Caching Guidance¶
-
Query patterns
- Timeline (time range, paging): primary cluster
(TenantId, OccurredAtUtc)→ seek + range scan. - Resource drill-down: index
(TenantId, ResourceType, ResourceId, OccurredAtUtc). - Actor investigations: index
(TenantId, ActorId, OccurredAtUtc). - Decision-only: index on
(TenantId, DecisionOutcome, OccurredAtUtc)inAccessDecisionsView.
- Timeline (time range, paging): primary cluster
-
Caching
- Edge cache (short TTL, e.g., 5–15s) for repeated dashboard queries.
- Redis for hot resource/actor timelines (keyed by hash of filter, page token; TTL 30–60s).
- ETag/If-None-Match on timeline pages (tokenized
OccurredAtUtc + RecordId).
-
Pagination
- Prefer seek pagination via
(OccurredAtUtc, RecordId)cursor over OFFSET/LIMIT. - Max page size 200–500 rows; enforce time-bounded queries to protect hot partitions.
- Prefer seek pagination via
-
Partitioning
- Monthly or quarterly partitions on
OccurredAtUtcfor very large tenants. - Move cold partitions to cheaper tier; keep last N partitions in premium tier.
- Monthly or quarterly partitions on
-
Compression & storage
- Enable row/page compression; consider columnstore for
audit_rmtables with heavy scans. - Avoid huge JSON blobs; keep
Before/Afterdiffs concise and redacted.
- Enable row/page compression; consider columnstore for
Repository Interfaces (sketch)¶
public interface IAuditRecordRepository
{
Task AppendAsync(AuditRecord record, CancellationToken ct);
Task<bool> TryDeduplicateAsync(string tenantId, string idempotencyKey, string recordId, CancellationToken ct);
IAsyncEnumerable<AuditRecord> ReadRangeAsync(string tenantId, DateTime fromUtc, DateTime toUtc, AuditFilter filter, PageCursor cursor, int pageSize, CancellationToken ct);
}
public interface IAuditTimelineReader
{
IAsyncEnumerable<TimelineRow> QueryAsync(TimelineQuery q, PageCursor cursor, int pageSize, CancellationToken ct);
}
public interface IIntegrityProofsReader
{
Task<IntegrityProof?> GetSegmentProofAsync(string tenantId, string category, string segmentId, CancellationToken ct);
Task<IReadOnlyList<IntegrityProof>> ListRecentAsync(string tenantId, string category, int take, CancellationToken ct);
}
Security & Multitenancy¶
- Row-level authorization by
TenantIdat repository boundary; never cross-tenant joins. - Purpose-limited reads: decision and delta views may require elevated scopes; all access is itself audited.
- No PII in projections beyond already-transformed data; classification policy enforced on the write model before projection.
Operational Notes¶
- SLA knobs: keep projector lag < p95 2s for timeline; proof view can tolerate higher.
- Observability: measure projector latency, lag, UPSERT counts, cache hit ratio, top filters.
- Maintenance: rolling segment sealing compacts integrity metadata; retention purges operate against write model and cascade to views via events.
Contracts: Ingest APIs & Streams¶
Purpose¶
Provide transport-agnostic ingest contracts for writing audit data into the AuditStream BC:
- HTTP (single & batch) with idempotency, tenant scoping, and classification hints.
- Backfill endpoint for trusted, high-volume historical imports.
- Internal streams (
audit.records.v1) for conformist downstreams and inter-service ingestion. - Consistent envelope, headers/metadata, and error model across transports.
HTTP Ingest (online)¶
Base URL: /audit/records
Headers (required)
Authorization: Bearer <token>— scope:audit.ingestTenant-Id: <string>— tenant isolation key (alias:tenantId)Idempotency-Key: <string>— required for online writesTrace-Id,Request-Id— correlation (gateway may inject)
Single record
POST /audit/recordsContent-Type: application/json- Body: canonical
AuditRecord(as defined in the model cycle); classification hints optional.
{
"record": { /* AuditRecord … */ },
"classificationHints": {
"before.fields.email": "PERSONAL",
"after.fields.token": "CREDENTIAL"
}
}
Responses
201 Created+ body{ "id": "<RecordId>", "status": "created" }200 OK(idempotent duplicate) +{ "id": "<RecordId>", "status": "duplicate" }4xx/5xxwith problem+json
Batch
POST /audit/records:batch- Body:
{
"items": [
{
"idempotencyKey": "…",
"record": { /* AuditRecord */ },
"classificationHints": { /* optional */ }
}
]
}
- Responses:
202 Acceptedwith per-item result array (success/duplicate/error, message, recordId)- Use 207 Multi-Status semantics inside payload; HTTP code remains 202 for mixed outcomes
Rate limits & size
- Max body: 5–10 MB (configurable)
- Max items per batch: 500 (configurable)
- Recommend gzip:
Content-Encoding: gzip
Error model (canonical)
409 Conflict— tenant mismatch, legal hold/retention lock violation422 Unprocessable Entity— schema/validation/classification failure429 Too Many Requests— backpressure; retry with jitter5xx— transient; safe to retry with sameIdempotency-Key
cURL (single)
curl -X POST https://api.example.com/audit/records \
-H "Authorization: Bearer $TOKEN" \
-H "Tenant-Id: t-9c8f1" \
-H "Idempotency-Key: iam:pwd-change:u-12345:2025-09-30T12:34:56Z" \
-H "Trace-Id: tr-abc" -H "Request-Id: rq-xyz" \
-H "Content-Type: application/json" \
-d @record.json
Backfill Ingest (trusted)¶
For historical imports under elevated controls.
Endpoint
POST /audit/records:backfill- Scopes:
audit.backfill - Headers:
Tenant-Id: <string>- Optional
Idempotency-Keyper item (recommended when deduping across replays)
Body formats
application/x-ndjson(preferred for scale) — one canonicalAuditRecordenvelope per lineapplication/jsonwith{ "items": [...] }- Support
Content-Encoding: gzip
Rules
- Enforces all write invariants (append-only, tenant isolation, legal hold, classification).
- OccurredAtUtc may precede current time arbitrarily (no small skew limit).
- Throughput quotas separate from online ingest; processed async →
202 Accepted+ job id.
Response
Telemetry
- Emits
audit.record_appendedfor each stored record (same as online), preserving event-first posture.
Classification Hints (optional)¶
Hints help the server classify fields; the server remains authoritative.
Schema (inline or separate object)
{
"classificationHints": {
"context.userAgent": "INTERNAL",
"before.fields.email": "PERSONAL",
"after.fields.token": "CREDENTIAL"
}
}
Pathing
- Dot-notation aligned to the canonical JSON (
context.*,before.fields.<k>,after.fields.<k>)
Streams: audit.records.v1 (internal bus)¶
For conformist producers/consumers using MassTransit/Service Bus.
Topic: audit.records.v1
Message (code-first contract)
public interface IngestAuditRecord // message name: audit.records.v1
{
string TenantId { get; }
string IdempotencyKey { get; } // required for online-like producers
AuditRecord Record { get; }
IDictionary<string,string>? ClassificationHints { get; }
CorrelationMeta Correlation { get; }
}
Headers (required)
Tenant-Id,Idempotency-Key,Trace-Id,Request-Id,Producer
Semantics
- Consumers in the AuditStream service apply the same write pipeline (inbox+dedupe → classify/redact → append).
- Messages missing
TenantIdare rejected (moved to DLQ with reason). - Partitioning: route by
TenantId(and optionally Category) to preserve per-tenant ordering.
Retries & idempotency
- At-least-once delivery; inbox dedupe keyed by (
TenantId,IdempotencyKey) → exactly-once write.
Other Transports (ingest parity)¶
-
gRPC (code-first)
- Service:
IAuditIngestService.AppendAsync(record, tenantId, idempotencyKey, classificationHints, correlation) - Metadata carries
Tenant-Id,Trace-Id,Request-Id; same invariants and responses.
- Service:
-
Orleans
- Grain:
IAuditIngestGrain.Append(AppendAuditRecordCommand) - For in-cluster high-trust producers; still passes through the shared write pipeline.
- Grain:
Security¶
- OAuth2 bearer tokens with audience
auditand scopes:audit.ingest,audit.backfill. - mTLS between internal services.
- Mandatory tenant scoping at the gateway; no cross-tenant writes.
- All ingest calls themselves produce minimal audit entries (who/what/purpose).
Diagram¶
sequenceDiagram
participant P as Producer
participant API as HTTP/gRPC/Bus/Orleans
participant IN as Inbox + Dedupe
participant CL as ClassificationPolicy
participant ST as Append-only Store
participant BUS as Internal Topic (audit.records.v1)
P->>API: POST /audit/records (+headers, hints)
API->>IN: Normalize + Tenant + Idempotency + Correlation
IN-->>API: Duplicate? yes -> return 200/duplicate
IN->>CL: Resolve policy + compile plan
CL-->>IN: Policy(version), plan
IN->>ST: Append (post-redaction)
ST-->>BUS: Emit event audit.record_appended (outbox → bus)
API-->>P: 201/Created (record id)
Validation Summary (applies to all ingest)¶
- Reject if missing
Tenant-Idor mismatched tenant. - Require
Idempotency-Key(except backfill jobs, where recommended). - Enforce classification & redaction before persistence.
- Block writes that violate LegalHold/Retention locks.
- Preserve correlation (
Trace-Id,Request-Id,Producer) end-to-end.
Contracts: Query APIs¶
Purpose¶
Expose tenant-scoped, purpose-limited HTTP query surfaces for investigative and compliance use-cases:
GET /audit/timeline— time-ordered browsing with rich filters.GET /audit/decision-log— allow/deny focus.POST /audit/exports+GET /audit/exports/{id}— controlled eDiscovery exports with signed artifacts.- (Added for completeness)
GET /audit/proofs— lookup integrity segment/export proofs.
All endpoints enforce tenant isolation and purpose-based access (scopes + justification). Responses default to redacted data; elevated views require auditor scopes.
Security & Headers (applies to all)¶
Required headers
Authorization: Bearer <token>— scopes vary per endpoint (see below)Tenant-Id: <string>Trace-Id,Request-Id(gateway may inject)
Purpose & redaction controls
X-Purpose: <string>— short justification (“security-investigation:INC-12345”)X-Redaction-Level: default|elevated— optional;elevatedrequires auditor scope- Responses carry
ETag(page content hash) and optionalX-Projected-At(projection timestamp)
Scopes
audit.read.timelineaudit.read.decisionsaudit.export.startaudit.export.readaudit.read.proofsaudit.read.elevated(to allowX-Redaction-Level: elevated)
GET /audit/timeline¶
Description: Paged, time-ordered audit records for a tenant.
Query parameters
from(ISO-8601, required) — inclusiveto(ISO-8601, required) — exclusive; enforce max range (e.g., 31 days)- Filters (all optional; ANDed unless noted):
actor— exactactor.idor prefixresource—type:id(e.g.,User:u-12345);resourceType,resourceIdalso supportedaction— exact or prefix (e.g.,User.)category— e.g.,identity,billingclass—DataClassvalue (PERSONAL,PHI, …)decision—allow|deny|na
- Pagination:
cursor— opaque seek token from previous pagelimit— default 100, max 500
Response 200 OK
{
"items": [
{
"recordId": "01J9Z7B19X4P8ZQ7H6M4V6GQWY",
"occurredAtUtc": "2025-09-30T12:34:56Z",
"category": "identity",
"action": "User.PasswordChanged",
"actor": { "type": "user", "id": "u-12345" },
"resource": { "type": "User", "id": "u-12345" },
"decision": { "outcome": "allow", "reason": "MFA_OK" },
"summary": { "policyVersion": 7 } // compact, redacted view
}
],
"nextCursor": "eyJ0IjoiMjAyNS0wOS0zMFQxMjozNDo1NloiLCJyIjoiMDFKOS...\"",
"projectedAtUtc": "2025-09-30T12:35:02Z"
}
Errors
400invalid range/filters403missing scopes or elevated redaction withoutaudit.read.elevated422malformed params
GET /audit/decision-log¶
Description: Access decision-centric view (allow/deny), optimized for security reviews.
Query parameters
-
Same base as
/audit/timelineplus: -
outcome—allow|deny(required unless action is specified) actor,resource,actionfor narrowing- Pagination via
cursor/limitas above
Response 200 OK
{
"items": [
{
"occurredAtUtc": "2025-09-30T09:02:15Z",
"recordId": "01J9Z6...",
"actorId": "u-987",
"resource": { "type": "Document", "id": "doc-55" },
"action": "Document.Downloaded",
"outcome": "deny",
"reason": "Policy:IP_Allowlist"
}
],
"nextCursor": null
}
Errors
400missingoutcomeor invalid values403insufficient scope
Scope required: audit.read.decisions (and audit.read.elevated if requesting elevated redaction)
POST /audit/exports¶
Description: Starts a controlled ExportJob for eDiscovery/audit.
Scope required: audit.export.start (and audit.read.elevated if requesting elevated redaction)
Body
{
"purpose": "ediscovery:case-12345",
"range": { "from": "2025-09-01T00:00:00Z", "to": "2025-09-30T23:59:59Z" },
"filters": {
"actor": "u-12345",
"resource": "User:u-12345",
"action": "User.",
"category": "identity",
"decision": "allow"
},
"redactionLevel": "default|elevated",
"format": "jsonl|csv",
"delivery": { "kind": "signed-url" } // future: bucket/prefix
}
Response
202 Accepted
Errors
400invalid range/filters/format403insufficient scope or elevated redaction without auditor scope409active LegalHold blocks request if scope violates policy
GET /audit/exports/{id}¶
Description: Polls export status and returns signed artifact manifest on completion.
Scope required: audit.export.read
Response 200 OK
{
"jobId": "exp-01J9Z7E8AVB2M7",
"state": "completed", // queued|running|completed|failed|canceled
"count": 12840,
"format": "jsonl",
"redactionLevel": "default",
"manifest": {
"url": "https://signed.example.com/exp-.../manifest.json?sig=...",
"expiresAtUtc": "2025-09-30T15:10:00Z",
"integrityProofId": "proof-01J9Z7KQ...",
"hash": "sha256:abcd..."
}
}
Errors
404unknown job or tenant mismatch409job not in a terminal state for download
GET /audit/proofs¶
Description: Lookup integrity proofs for segments or exports.
Scope required: audit.read.proofs
Query parameters
category— e.g.,identity(required for segment lookups)segmentId— to fetch a specific segment’s proofjobId— to fetch an export’s prooftake— number of recent proofs (default 20, max 200) whensegmentId/jobIdnot provided
Responses
- Single proof:
{
"type": "segment",
"tenantId": "t-9c8f1",
"category": "identity",
"segmentId": "seg-2025-09-30-12",
"sealedAtUtc": "2025-09-30T12:40:00Z",
"hashAlgorithm": "sha256",
"rootHash": "3f88...",
"proofUrl": "https://signed.example.com/proofs/seg-...json?sig=..."
}
- Recent list:
Errors
400missingcategoryfor segment queries404unknown proof
Pagination & Caching¶
- Seek pagination: use
cursor(opaque token includingOccurredAtUtc+RecordId). - Cache: allow
Cache-Control: private, max-age=15for dashboard queries; supportIf-None-MatchwithETag. - Rate limits: per-tenant and per-token quotas; return
429+Retry-After.
Error Model (Problem Details)¶
All error responses use RFC 7807:
{
"type": "https://docs.example.com/errors/validation",
"title": "One or more validation errors occurred.",
"status": 422,
"traceId": "tr-abc",
"errors": { "from": ["Must be before 'to'"], "limit": ["Max 500"] }
}
Examples¶
Timeline (default redaction)
curl -G https://api.example.com/audit/timeline \
-H "Authorization: Bearer $TOKEN" \
-H "Tenant-Id: t-9c8f1" \
--data-urlencode "from=2025-09-01T00:00:00Z" \
--data-urlencode "to=2025-09-30T23:59:59Z" \
--data-urlencode "actor=u-12345" \
--data-urlencode "limit=200"
Start export (elevated)
curl -X POST https://api.example.com/audit/exports \
-H "Authorization: Bearer $AUDITOR" \
-H "Tenant-Id: t-9c8f1" \
-H "X-Purpose: ediscovery:case-12345" \
-H "X-Redaction-Level: elevated" \
-H "Content-Type: application/json" \
-d @export.json
Fetch proof
curl -G https://api.example.com/audit/proofs \
-H "Authorization: Bearer $TOKEN" \
-H "Tenant-Id: t-9c8f1" \
--data-urlencode "category=identity" \
--data-urlencode "segmentId=seg-2025-09-30-12"
Security & Tenancy Model¶
Purpose¶
Guarantee hard tenant isolation, least-privilege access, and confidentiality across all transports (HTTP, gRPC code-first, Service Bus/MassTransit, Orleans). Enforce no cross-tenant queries except tightly-controlled, justified, audited platform-admin flows. Minimize sensitive data at the source (classification/redaction), and protect exports with integrity and signature guarantees.
Threat Model & Goals (concise)¶
- Isolation: A tenant’s data is never visible to another tenant by design, not convention.
- Defense-in-depth: AuthN/Z at gateway, service, repository, and database (RLS) layers.
- Transport security: mTLS service-to-service; OAuth2/JWT for callers; key rotation.
- Data minimization: No secrets in events/logs; classification & redaction on write.
- Controlled exfiltration: Exports are purpose-limited, signed, and time-boxed.
- Audit of the auditor: All elevated reads/exports are themselves audited.
Identity & Authentication¶
- Human callers: OAuth2/OIDC bearer tokens (short-lived), scopes tied to endpoints.
- Service callers: client-credentials tokens with narrow scopes; per-service audiences.
- Intra-cluster: mTLS with SPIFFE/SVID or mesh-issued certs; authorize by SAN/service identity.
Required claims / headers
tenant_id(claim) ⇄Tenant-Id(header) — must match boundary routing.scope— action-specific (e.g.,audit.read.timeline,audit.export.start).sub/client_id— principal identity.- Correlation:
Trace-Id,Request-Id,Producer(header/metadata).
Authorization (scopes & purpose-limiting)¶
| Capability | Scope(s) | Notes |
|---|---|---|
| Timeline read (redacted) | audit.read.timeline |
Default redaction level. |
| Decision log | audit.read.decisions |
Allow/deny focus. |
| Elevated read | audit.read.elevated |
Requires X-Purpose justification; audited. |
| Start export | audit.export.start |
Also requires audit.read.elevated if elevated redaction requested. |
| Export status/download | audit.export.read |
Returns signed manifest/URLs. |
| Proof lookup | audit.read.proofs |
Integrity verification. |
| Policy admin | audit.admin.policy |
Classification/retention/holds. |
| Platform-admin cross-tenant (break-glass) | platform.admin.audit |
Dual-control + time-boxed + justification. |
Purpose header
X-Purpose: <short string>(e.g.,security-investigation:INC-12345) is required for elevated reads and admin operations.
Tenancy Enforcement (boundary → repository → DB)¶
- At gateway: route requests by
Tenant-Id; reject if header missing or mismatched with tokentenant_id. - At service boundary: all commands/queries must carry a tenant; no tenant-less APIs.
- At repository: every query/mutation requires
tenantId; repositories disallow cross-tenant filters/joins. - At database: Row-Level Security (RLS) ensures server-side isolation.
RLS pattern (SQL Server/Azure SQL)
-- Set session context per connection
EXEC sp_set_session_context @key=N'TenantId', @value=@tenantId;
-- Predicate
CREATE SCHEMA security;
GO
CREATE FUNCTION security.fn_TenantPredicate(@TenantId sysname)
RETURNS TABLE WITH SCHEMABINDING
AS RETURN SELECT 1 AS fn_access
WHERE @TenantId = CAST(SESSION_CONTEXT(N'TenantId') AS sysname);
GO
-- Apply policy to tables
CREATE SECURITY POLICY security.TenantPolicy
ADD FILTER PREDICATE security.fn_TenantPredicate(TenantId) ON audit.Records,
ADD FILTER PREDICATE security.fn_TenantPredicate(TenantId) ON audit_rm.Timeline,
ADD FILTER PREDICATE security.fn_TenantPredicate(TenantId) ON audit_rm.AccessDecisions,
ADD FILTER PREDICATE security.fn_TenantPredicate(TenantId) ON audit_rm.ChangeDeltas
WITH (STATE = ON);
Application code must set
SESSION_CONTEXTat connection open and reuse pooled connections per tenant to avoid leakage.
Transport Security¶
- mTLS inside the cluster (gateway ↔ services, services ↔ services). Certs auto-rotated (short TTL).
- HTTP: TLS 1.2+, HSTS.
- gRPC (code-first): TLS + auth metadata; validate SAN/service identity.
- Service Bus/MassTransit: SAS/AAD with per-queue/topic credentials; message-level authZ by headers (tenant, scopes).
- Orleans: silo-to-silo TLS; grain calls restricted to trusted identities.
Data Minimization & Event Hygiene¶
- On write: classification applies; tokens/passwords/keys →
DROP; emails/phones →HASHwith per-tenant salt; PHI →TOKENIZE. - Domain events: include only investigative fields (no raw PII/secrets). Example:
- ✅
recordId,action,resource.type/id,actor.id,occurredAtUtc,policyVersion. - ❌ raw
before/aftercontents for sensitive fields.
- ✅
- Logs: structured; never log secrets or full payloads. Use event IDs, hashes, counts.
Exports (controlled eDiscovery)¶
- Start requires scope
audit.export.start(+audit.read.elevatedif elevated view). - Manifests include record counts, hash lists, and integrity proof IDs (Merkle root/segment hash).
- Delivery: signed URLs (short expiry) or customer-managed bucket integration.
- Signing: manifest and chunk checksums signed with per-tenant signing keys in HSM/Key Vault (kid included).
- Watermarking (optional): embed export context (tenant, purpose, jobId) into manifest for traceability.
- Revocation: immediate by invalidating SAS/presigned URLs; manifests remain for audit trail.
Platform-Admin Exceptions (break-glass)¶
- Allowed only with:
- Scope
platform.admin.audit+ dual approval (two distinct admins). - Time-boxed tokens (e.g., 15 minutes).
- Mandatory
X-Purpose; reason captured.
- Scope
- Flow produces additional meta-audit entries:
- who approved, timestamps, filters used, counts returned, artifacts accessed.
Key Management & Secrets¶
- Per-tenant salt (hashing) and tokenization keys stored in HSM/Key Vault; rotated on schedule.
- mTLS certificates rotated automatically (mesh/PKI).
- Export signing keys versioned; manifests include
kidto verify against JWKS. - No secrets in configuration files; use managed identities for vault access.
Rate-Limiting & Abuse Controls¶
- Per-tenant quotas for read/write; adaptive throttling based on p95 latency and queue depth.
429withRetry-Afterheaders; clients apply exponential backoff + jitter.- Backfill channel isolated from online ingest to protect SLOs.
Observability & “Audit of the Auditor”¶
- Emit security-focused metrics: denied requests by reason, cross-tenant attempts, elevated-read counts, export volume by purpose.
- Trace spans include tenant, scope, purpose, redaction level (never PII).
- Every elevated read/export → a minimal audit record (
AuditorAccess.Requested/Granted) with justification and principal.
Diagram — End-to-End Enforcement¶
sequenceDiagram
participant C as Caller (User/Service)
participant GW as API Gateway
participant AUD as AuditStream Service
participant REPO as Repositories
participant DB as DB (RLS)
participant BUS as Service Bus
C->>GW: HTTP/gRPC + JWT (tenant_id, scopes)
GW->>C: 401/403 if invalid; else forward with Tenant-Id, Trace-Id
GW->>AUD: mTLS request (tenant scoped)
AUD->>AUD: AuthZ (scope + purpose) + policy checks
AUD->>REPO: Set SESSION_CONTEXT('TenantId')
REPO->>DB: Query/Append
DB-->>REPO: Enforced by RLS (tenant rows only)
AUD-->>BUS: Outbox publish (minimized event)
AUD-->>C: Response (redacted/elevated per scope)
Hard Rules Recap¶
- No cross-tenant queries. Ever—except formal break-glass with dual control, justification, and exhaustive audit.
- Least-privilege tokens only; scopes map 1:1 to actions.
- mTLS everywhere inside the cluster; TLS to clients.
- Data minimization by default; classification on write; signed, time-boxed exports.
Observability of the Auditor¶
Purpose¶
Instrument the Audit & Compliance services so investigators and operators can see: how fast data arrives (ingest lag), whether writes succeed, where dedupe is triggering, how long exports take, and whether any consumer or saga is falling behind—without ever leaking PII. We adopt OpenTelemetry (OTel) for traces/metrics/logs, enforced PII redaction, and event-first correlation across HTTP, gRPC (code-first), Service Bus/MassTransit, and Orleans.
Telemetry Stack (reference)¶
- Tracing: OpenTelemetry SDK + auto-instrumentation (HTTP client/server, gRPC, MassTransit, ADO.NET).
- Metrics: OTel Meter; export to Prometheus/OTLP.
- Logs: structured JSON via Serilog (or ILogger) → OTLP/Elastic.
- Correlation: W3C trace-context; propagate
traceparentand our domain correlation (Trace-Id,Request-Id,Producer). - Exemplars: attach trace IDs to metric data points for deep linking from dashboards.
Tracing Model (span taxonomy)¶
Create consistent span names and attributes across services. All spans must include: tenant.id, audit.type (publish|consume|saga.step|query|export|purge|seal), audit.key (hashed idempotency key), delivery.attempt (for consumers).
Common span names
audit.ingest.append— HTTP/gRPC/Bus/Orleans ingressaudit.outbox.publish— outbox relay publishes domain eventsaudit.bus.consume— MassTransit consumer handleraudit.saga.step— retention/export/integrity orchestration stepaudit.export.run— an ExportJob execution windowaudit.retention.purge— purge executionaudit.integrity.seal— segment sealingaudit.query.timeline/audit.query.decisions/audit.query.proofs— read APIs
Span attributes (selected)
| Attribute | Example | Notes |
|---|---|---|
tenant.id |
t-9c8f1 |
required |
audit.category |
identity |
optional (if available) |
audit.action |
User.PasswordChanged |
optional |
audit.record.id |
01J9Z7B... |
avoid logging payload |
audit.key.hash |
sha256:ab12… |
hash of Idempotency-Key |
correlation.request_id |
rq-xyz |
from envelope |
correlation.producer |
iam-service@1.12.3 |
|
messaging.system |
azure.servicebus |
for bus spans |
messaging.operation |
process |
consume spans |
delivery.attempt |
3 |
incremented by consumer |
db.operation |
INSERT, SELECT |
for storage spans |
Sampling
- AlwaysOn for write path (
audit.ingest.append,audit.outbox.publish,audit.bus.consume) on error. - Parent-based 10–20% for success traces in hot tenants; enable tail-based sampling by high latency/lag.
- PII guard: deny sampling decisions that would capture PII payloads—payloads aren’t logged anyway.
Metrics (SLIs & diagnostic)¶
Define stable instruments with clear units and labels. Tenants are labels; avoid high-cardinality labels (hash IDs when needed).
Ingest & Write
audit_ingest_messages_total{tenant}— Counteraudit_ingest_lag_seconds{tenant}— Histogram (p50/p90/p95/p99)audit_append_success_total{tenant}— Counteraudit_append_failures_total{tenant,reason}— Counteraudit_dedupe_hits_total{tenant}/audit_dedupe_ratio{tenant}— Counter/Gaugeaudit_outbox_backlog{service}— Gauge (rows in outbox not yet published)audit_inbox_backlog{consumer}— Gauge
Consumers & Sagas
audit_consumer_delivery_attempts{consumer}— Histogramaudit_consumer_lag_seconds{consumer,tenant}— Histogram (event time → now)audit_projector_lag_seconds{view,tenant}— Histogramaudit_saga_duration_seconds{saga}— Histogramaudit_dlq_messages_total{consumer}— Counter
Exports & Integrity
audit_export_duration_seconds{tenant,format,redaction}— Histogramaudit_export_records_total{tenant}— Counteraudit_integrity_seal_duration_seconds{tenant,category}— Histogramaudit_integrity_segments_sealed_total{tenant,category}— Counter
Queries
audit_query_timeline_latency_seconds{tenant}— Histogramaudit_query_rate_limited_total{tenant,endpoint}— Counteraudit_cache_hit_ratio{endpoint}— Gauge
Attach exemplars to
audit_ingest_lag_secondsandaudit_export_duration_secondsto point to slow traces.
Logs (PII-safe, policy-enforced)¶
- Structure only; no raw payloads.
- Required fields:
timestamp,level,messageTemplate,tenant.id,trace.id,request.id,component,event.name,result. - Redaction middleware: apply the same
ClassificationPolicy-based redaction to log scopes that process potentially sensitive strings (e.g., exceptions that echo inputs). - Drop fields classified as
CREDENTIAL; hashPERSONAL; tokenizePHI. - Compact error events with codes (e.g.,
AUDIT_422_CLASSIFICATION,AUDIT_409_LEGALHOLD).
Examples
{ "level":"Information","event":"audit.append.ok","tenant.id":"t-9c8f1",
"trace.id":"3b5...","audit.category":"identity","record.id":"01J9Z7B...","dedupe":"miss" }
{ "level":"Warning","event":"audit.append.duplicate","tenant.id":"t-9c8f1",
"trace.id":"3b5...","audit.key.hash":"sha256:ab12...","dedupe":"hit" }
{ "level":"Error","event":"audit.export.failed","tenant.id":"t-9c8f1",
"trace.id":"3b5...","job.id":"exp-01J9Z7...","reason":"Timeout","duration.ms":90000 }
Dashboards (operator & auditor views)¶
- Ingest Health
- Ingest rate, ingest lag p95, append success vs failures by reason, dedupe ratio, outbox/inbox backlog.
- Consumers & Projections
- Consumer lag & delivery attempts; projector lag per view; DLQ counts; top failing handlers.
- Exports & Integrity
- Export durations, records per export, failure rate, proofs sealed over time, seal durations.
- Queries (Usage & Performance)
- Timeline latency, cache hit ratio, rate-limited requests, elevated reads by purpose (counts only).
- Security Signals
- Elevated-read counts by purpose, cross-tenant denial attempts, break-glass usage with approver identities.
Each panel links to traces via exemplars and to filtered logs.
Alerts & SLOs (suggested starting points)¶
- Ingest SLO: p95
audit_ingest_lag_seconds< 5s (warn at 10s, page at 30s). - Append Error Budget:
audit_append_failures_total / (append_success + failures)< 0.5% (rolling 1h). - Consumer Lag: p95
audit_consumer_lag_seconds< 15s (warn), < 60s (page). - Projector Lag: p95
audit_projector_lag_seconds< 5s (warn), < 20s (page). - Export Duration: p95
audit_export_duration_seconds< target SLA per size (e.g., 10 min for 10M records). - DLQ: growth rate > N/min → page (sticky failure).
- Break-Glass: any event → immediate page + incident ticket.
Code Sketch (C# OTel wiring)¶
services.AddOpenTelemetry()
.WithTracing(b => b
.AddAspNetCoreInstrumentation()
.AddGrpcClientInstrumentation()
.AddSqlClientInstrumentation(o => o.SetDbStatementForText = false) // avoid payloads
.AddMassTransitInstrumentation()
.AddSource("ConnectSoft.Audit.*")
.SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.2)))
.AddOtlpExporter())
.WithMetrics(b => b
.AddRuntimeInstrumentation()
.AddAspNetCoreInstrumentation()
.AddMeter("ConnectSoft.Audit.Metrics")
.AddOtlpExporter());
Span creation example
using var activity = _activitySource.StartActivity("audit.bus.consume");
activity?.SetTag("tenant.id", tenantId);
activity?.SetTag("audit.type", "consume");
activity?.SetTag("delivery.attempt", attempt);
activity?.SetTag("audit.key.hash", keyHash);
Diagram — Telemetry Flow¶
flowchart LR
subgraph Services
IN[Ingest API] -->|Spans+Metrics+Logs| TE[OTel SDK]
CON[Consumers/Sagas] -->|Spans+Metrics+Logs| TE
QRY[Query APIs] -->|Spans+Metrics+Logs| TE
end
TE --> EXP[OTLP Exporter]
EXP --> APM[Tracing Backend]
EXP --> PRM[Metrics Backend]
EXP --> LOGS[Logs Backend]
PRM --> DASH[Dashboards & Alerts]
APM --> DASH
LOGS --> DASH
Governance¶
- PII Security Review on any new log field or span attribute.
- Telemetry ADRs document every new metric/alert.
- Synthetic probes (canaries) continuously exercise ingest, query, and export paths and verify SLIs.
Notes¶
- Keep cardinality bounded: hash keys/IDs in attributes; avoid free-form labels.
- Prefer seek-based pagination metrics over counting offsets.
- Always audit the auditor: elevated reads/exports emit minimal audit records with purpose and principal.
Storage Strategy (HLD)¶
Purpose¶
Design a default, cost-aware storage layout that provides fast append, predictable reads, and long-term retention with legal-hold guarantees. The baseline is Azure SQL for hot/warm data (append-only), complemented by Blob Storage (ADLS Gen2) for cold/archival tiers, with optional query-through via serverless analytics.
Tiered Layout (hot → warm → cold)¶
flowchart LR
IN[Inbox + Dedupe] --> SQL[(Azure SQL<br/>audit.*)]
SQL --> RM[(Azure SQL<br/>audit_rm.*)]
SQL --> SEG[(Azure SQL<br/>audit.Segments)]
SQL -- Retention window elapsed --> ADLS[(ADLS Gen2 / Blob<br/>JSONL or Parquet)]
ADLS --> MAN[(Manifests & Indexes in SQL)]
MAN -. optional query-through .-> ANA[(Serverless SQL / Fabric)]
- Hot (0–90 days): Azure SQL, append-only, minimal indexes, row/page compression off or row-only.
- Warm (90–365 days): Azure SQL, same tables but older partitions compressed (PAGE/columnstore for read models).
- Cold (≥ retention window): ADLS Gen2 (Blob) in JSONL or Parquet, plus SQL index rows to locate artifacts; optional serverless SQL/Fabric for ad-hoc queries.
Actual thresholds are policy-driven; HIPAA/GDPR overlays may extend hot/warm windows.
Logical Schema (recap)¶
- Write model (append-only):
audit.Records,audit.Idempotency,audit.Segments. - Read models (projections):
audit_rm.Timeline,audit_rm.AccessDecisions,audit_rm.ChangeDeltas,audit_rm.IntegrityProofs. - All tables are tenant-scoped and enforce RLS (see Security & Tenancy).
Partitioning & Clustering¶
Records table (primary store)
- Clustered key:
(TenantId, Category, OccurredAtUtc, RecordId) - Partition function: monthly on
OccurredAtUtc(or weekly for hot tenants). - Benefits: predictable range scans by time; easy partition management for retention/sealing.
Sample (illustrative):
-- Partition monthly by UTC date
CREATE PARTITION FUNCTION pf_MonthlyDate (DATETIME2(3))
AS RANGE RIGHT FOR VALUES (
'2025-01-01','2025-02-01','2025-03-01','2025-04-01','2025-05-01','2025-06-01',
'2025-07-01','2025-08-01','2025-09-01','2025-10-01','2025-11-01','2025-12-01'
);
CREATE PARTITION SCHEME ps_MonthlyDate
AS PARTITION pf_MonthlyDate ALL TO ([PRIMARY]);
-- Clustered index aligned with partitioning
CREATE CLUSTERED INDEX CIX_Records_TenantCatDateId
ON audit.Records (TenantId, Category, OccurredAtUtc, RecordId)
ON ps_MonthlyDate(OccurredAtUtc);
Nonclustered hot-path indexes (seek/prefix-friendly)
(TenantId, ResourceType, ResourceId, OccurredAtUtc) INCLUDE (Action, ActorId, DecisionOutcome)(TenantId, ActorId, OccurredAtUtc) INCLUDE (Action, ResourceType, ResourceId, DecisionOutcome)(TenantId, Action, OccurredAtUtc) INCLUDE (ResourceType, ResourceId, ActorId)
Create only what queries need; rely on projections for complex filters.
Compression & Storage Modes¶
- Write model: keep ROW compression on hot partitions; switch PAGE on warm partitions via rolling jobs.
- Read models with heavy scans: consider clustered columnstore for
audit_rm.Timelineandaudit_rm.ChangeDeltas(validate cardinality & segment elimination first). - Lob JSON fields kept in NVARCHAR(MAX) post-redaction; enforce max sizes and summarize large diffs.
Ingest & Append (performance)¶
- Batching: use TVPs /
SqlBulkCopyforBatchAppend; target batches of 500–2,000 rows depending on payload size. - Minimal per-row work: precompute normalized fields and hashes in the service; avoid scalar SQL UDFs on hot paths.
- Idempotency: write to
audit.Idempotency (TenantId, IdempotencyKey → RecordId)first; short-circuit duplicates. - Locking: favor ordered inserts by
(TenantId, Category, OccurredAtUtc); avoid random keys.
Integrity Segments¶
- Segment map:
audit.Segments (TenantId, Category, SegmentId, First/LastOccurredAt, First/LastRecordId, RootHash, SealedAtUtc) - Sealing: on thresholds, compute hash/Merkle in service, update segment row atomically, emit
integrity.segment_sealed. - Verification: proofs stored alongside exports (Blob) and indexed in
audit_rm.IntegrityProofs.
Retention, Legal Hold, and Archival¶
- RetentionPolicy marks eligible partitions/row ranges; never purge rows covered by LegalHold.
- Purge mechanics:
- Export & stage eligible rows to ADLS Gen2 in JSONL/Parquet (with manifest & hash list).
- Verify counts/hashes; write manifest row to SQL (tenant, category, range, URLs, hashes).
- Soft-delete or hard-delete in SQL per jurisdictional overlay (prefer hard-delete for payload with redacted tombstones left only if required).
- Legal hold overlay: exclusion lists maintained in SQL; any held segment/row range is skipped.
Cold Storage (ADLS Gen2)¶
- Format:
- JSONL: simple, line-delimited canonical audit envelopes.
- Parquet: columnar, efficient for analytics (recommended for large volumes).
- Layout:
/tenantId=/category=/year=/month=/day=/part-*.parquet(Hive-style for partition pruning). - Security: per-tenant containers or prefixes, RBAC + SAS.
- Index-in-SQL: table
audit.ArchiveManifests (TenantId, Category, FromUtc, ToUtc, Format, Uri, Hash, CreatedAtUtc)enables discovery and integrity verification.
Optional query-through
- Serverless SQL / Fabric external tables over ADLS (for ad-hoc).
- Keep primary APIs on SQL hot/warm to protect SLOs; cold queries are explicitly slower and rate-limited.
HA/DR & Backups¶
- HA: Azure SQL Business Critical or Premium for hot tenants; zone redundancy on.
- DR: Active Geo-Replication to paired region; failover runbook tested quarterly.
- Backups: PITR enabled (e.g., 7–35 days hot DB); long-term backup for compliance (weeks–years per policy).
- ADLS: RA-GRS (geo-redundant) with immutability policies (
Time-based retention,Legal holdflags on containers).
Encryption & Keys¶
- Azure SQL: TDE on; per-tenant salts/keys for hashing/tokenization in Key Vault; rotation schedule.
- ADLS: SSE with Microsoft- or customer-managed keys (CMK).
- Export/Manifest signing: keys in HSM/Key Vault; manifests include
kid.
Capacity & Sizing (rules of thumb)¶
- Row size (post-redaction, compact JSON deltas): 300–1200 bytes typical; plan for ~1 KB avg.
- Daily volume:
rows/day = events_per_sec × 86,400. - Hot partition size: keep below 200–300 GB per table for comfortable maintenance on P-sku; split tenants/categories if needed.
- Read models: size ~40–70% of write model depending on projection density.
Maintenance Jobs¶
- Partition management: monthly partition create/switch/merge automation.
- Compression: roll older partitions to PAGE/columnstore weekly.
- Index care: monitor fragmentation (>30% → rebuild/reorg off-peak).
- Vacuum (if soft-deletes used): periodic cleanup.
- Checksum audit: re-verify segment hashes; surface drift alarms.
Cost Controls¶
- Right-size SKUs: scale up for backfill, scale down after.
- Move warm partitions to cheaper compute tiers (if environment allows); keep only last N partitions in high performance tier.
- Blob tiering: Hot → Cool → Archive based on access.
- Serverless analytics: pay-per-query for cold investigations.
Access Patterns → Storage Mapping¶
| Query | Storage | Index/Technique |
|---|---|---|
| Tenant timeline for last 30 days | Azure SQL (hot) | Clustered seek on (TenantId, OccurredAtUtc) |
| Resource drill-down | Azure SQL (hot/warm) | IX_Resource nonclustered index |
| Actor investigations | Azure SQL (hot/warm) | IX_Actor nonclustered index |
| Bulk eDiscovery export | Azure SQL → ADLS | Range scan + export job; manifest in SQL |
| Legacy period review (> retention) | ADLS Gen2 | External table / download & verify via manifest |
Notes¶
- Keep append-only discipline: no in-place updates; use events and projections to evolve views.
- Treat SQL as source of truth for active windows; treat ADLS + manifests as source of truth for archived windows.
- All movement between tiers is audited and integrity-checked (hash lists + proofs).
Integrity & Tamper-evidence¶
Purpose¶
Provide cryptographic tamper-evidence for the audit trail. The IntegrityLedger maintains rolling segments of records per (tenantId, category), computes proofs on seal (hash-chain + Merkle root), and emits integrity.segment_sealed. Proofs are embedded in exports and can be independently verified. No rewrites are permitted; authorized redactions produce tombstones + redacted copies linked by proof references.
Model¶
classDiagram
class IntegrityLedger {
+SealPolicy sealPolicy
+StartSegment(tenantId, category)
+AppendLeaf(tenantId, category, leaf)
+SealSegment(tenantId, category) : SegmentProof
+GetProof(segmentId) : SegmentProof
}
class Segment {
+string segmentId
+string tenantId
+string category
+datetime openedAtUtc
+long firstSequence
+long lastSequence
+HashChain chain
+MerkleBuilder merkle
+bool sealed
+datetime? sealedAtUtc
+string rootHash
+string hashAlgorithm
}
class HashChain {
+string lastHash
+Append(leafHash) : string
}
class SegmentProof {
+string segmentId
+string tenantId
+string category
+string hashAlgorithm // sha256
+string rootHash // Merkle root
+string chainTipHash // rolling hash at seal
+Uri proofUrl // blob manifest
+datetime sealedAtUtc
}
- Leaf = hash of a canonicalized audit record (post-redaction), plus minimal stable metadata.
- Chain: sequential rolling hash provides ordering proof; Merkle provides membership proof.
- Per (tenant, category): each has an open segment until sealed.
Canonicalization & Hashing¶
Leaf canonical form (JSON with deterministic ordering, UTF-8, no whitespace):
{
"v":1,
"recordId":"01J9Z7B19X4P8ZQ7H6M4V6GQWY",
"tenantId":"t-9c8f1",
"category":"identity",
"occurredAtUtc":"2025-09-30T12:34:56Z",
"action":"User.PasswordChanged",
"actorId":"u-12345",
"resource":{"type":"User","id":"u-12345"},
"policyVersion":7
}
- Values are those stored at rest (already redacted).
- Hash:
leafHash = SHA-256(UTF8(canonicalJson)). - Chain step:
chainTip = SHA-256(chainTip || leafHash)(with empty tip for the first leaf). - Merkle: standard binary Merkle tree with left-right concatenation; odd leaves are duplicated for pairing.
Sealing Policy¶
Sealing triggers when any threshold is reached (configurable, per tenant/category):
maxRecords(e.g., 10,000)maxBytes(e.g., 100 MB accumulated leaf inputs)maxDuration(e.g., 5 minutes)
On seal:
- Finalize Merkle root (
rootHash) and capturechainTipHash. - Persist segment metadata (atomic update).
- Write proof bundle (JSON) to Blob and record URL.
- Publish
integrity.segment_sealed. - Open a new segment.
Proof Bundle (stored in Blob; referenced in SQL)¶
{
"type": "audit.segment-proof",
"schemaVersion": 1,
"segment": {
"segmentId": "seg-2025-09-30-identity-0012",
"tenantId": "t-9c8f1",
"category": "identity",
"firstSequence": 8123400,
"lastSequence": 8133399,
"openedAtUtc": "2025-09-30T12:00:00Z",
"sealedAtUtc": "2025-09-30T12:05:00Z"
},
"hash": {
"algorithm": "sha256",
"root": "f0c4…",
"chainTip": "9ab7…"
},
"leaves": {
"count": 10000,
"mapping": "omitted", // mapping is not required; provided on demand
"leafHashFormat": "sha256(hex)"
},
"signing": {
"kid": "key-tenant-t-9c8f1-2025-09",
"signature": "base64url(sha256withRSA(manifest-bytes))"
}
}
- Signing: proof bundle is signed with a per-tenant key (Key Vault/HSM).
- SQL index:
audit_rm.IntegrityProofsstoressegmentId,rootHash,sealedAtUtc,proofUrl.
Inclusion Proofs (per record)¶
When a client requests proof for a specific record:
- Compute/lookup the
leafHashforrecordId. - Return Merkle path +
rootHash+hashAlgorithm+segmentId:
{
"type": "audit.membership-proof",
"segmentId": "seg-2025-09-30-identity-0012",
"recordId": "01J9Z7B19X4P8ZQ7H6M4V6GQWY",
"hashAlgorithm": "sha256",
"leafHash": "1c0d…",
"merklePath": ["b713…","004f…","9aa1…"], // left/right indicated via structured tuples if needed
"rootHash": "f0c4…",
"sealedAtUtc": "2025-09-30T12:05:00Z"
}
Verification (offline):
- Rebuild leaf from canonicalized stored record; hash →
leafHash. - Fold
merklePathtorootHash. - Compare to bundle’s
rootHashand signature.
Redaction & “No Rewrites”¶
- No rewrites: once a record is written, it is never modified.
- Authorized redaction (rare, policy-driven) produces two new facts:
- Tombstone:
AuditRecordRedactedreferencing the original recordId (reason, authority, case). - RedactedCopy: a fully redacted replacement record with a new
recordId, linked to the original.
- Tombstone:
- Both facts are appended normally and become part of subsequent segments.
- Queries default to the redacted copy; exports include the tombstone + redacted copy and their proofs.
Linking (schema additions)
ALTER TABLE audit.Records ADD
OriginalRecordId VARCHAR(26) NULL, -- present only for redacted copies
IsTombstone BIT NOT NULL DEFAULT 0;
- Integrity: because the original record remains unchanged and included in past sealed segments, any redaction is additive and auditable.
Export Integration¶
- Each ExportJob manifest lists the segments covered and embeds:
- the segment proof bundle references (URLs, hashes, signatures), and
- optional membership proofs for all included records (or on-demand endpoint to retrieve them).
- This enables an external investigator to verify off-platform that:
- the exported records belong to sealed segments, and
- the segment proofs are authentic (signature) and consistent (hashes).
Events & Flow¶
sequenceDiagram
participant AS as AuditStream
participant IL as IntegrityLedger
participant SQL as SQL (Segments)
participant BLOB as Blob (Proofs)
participant BUS as Service Bus
AS->>IL: audit.segment_seal_requested (threshold hit)
IL->>IL: Finalize Merkle + ChainTip
IL->>SQL: Update audit.Segments (sealedAt, rootHash, chainTip)
IL->>BLOB: Upload proof-bundle.json (signed)
IL->>BUS: Publish integrity.segment_sealed {segmentId, rootHash, proofUrl}
Event payload (code-first)
public sealed record SegmentSealedV1(
string SegmentId,
string TenantId,
string Category,
long FirstSequence,
long LastSequence,
string HashAlgorithm,
string RootHash,
string ChainTipHash,
Uri ProofUrl,
DateTime SealedAtUtc);
APIs (server-side verification helpers)¶
GET /audit/proofs?category=...&segmentId=...— returnsSegmentProof(already defined in Query APIs cycle).GET /audit/proofs/record/{recordId}— returns membership proof for a single record (tenant-scoped).POST /audit/proofs/verify— accepts{ segmentProof, membershipProof?, record? }, returns{ valid: true|false, reasons:[] }(best-effort convenience; offline verification is authoritative).
Data Structures (SQL recap)¶
audit.Segments(already defined): addChainTipHash VARCHAR(128).audit_rm.IntegrityProofs(read model): lists segment/export proofs, URLs, hash algorithms.
Performance & Operations¶
- Streaming Merkle: build incrementally; avoid holding all leaves in memory (use rolling bottom layer + disk spill if needed).
- Parallelization: seal in a background worker per
(tenant, category)to avoid cross-tenant contention. - Sizing: target segments that seal in 1–5 minutes under typical load to keep proofs bite-sized.
- Metrics:
audit_integrity_segments_sealed_total,audit_integrity_seal_duration_seconds,audit_integrity_proof_size_bytes. - Backfill: sealing operates the same; thresholds tuned for bulk jobs.
Optional External Anchoring¶
Periodically (e.g., hourly) anchor the latest set of segment root hashes into an external, time-stamping authority (or public blockchain) by publishing a super-root (Merkle of roots). Store the anchor receipt in Blob and index it in SQL to strengthen non-repudiation.
Security Notes¶
- Proof bundles are signed; keys live in HSM/Key Vault; manifests include
kid. - Verification requires only the stored record (redacted), the membership proof, and the segment proof bundle—no secret keys.
- All proof access is audited (purpose-limited).
Retention & Legal Hold¶
Purpose¶
Protect tenants with time-bounded data retention while guaranteeing that legal holds can suspend purge for investigations or litigation. Retention is enforced per tenant + category; purge operates on eligible segments first (row-level fallback when necessary). Every action—policy change, eligibility evaluation, purge, hold apply/release—is audited (“audit of the auditor”).
Model¶
classDiagram
class RetentionPolicy {
+string policyId
+string tenantId
+int version
+map<string,int> daysByCategory // e.g., identity:365, billing:2555
+RetentionMode mode // PURGE|REDACT_TOMBSTONE|ARCHIVE_THEN_PURGE
+datetime effectiveFromUtc
+string author
+bool lockedByCompliance // prevents weakening below baseline
}
class LegalHold {
+string holdId
+string tenantId
+string caseId
+string[] categories
+datetime fromUtc
+datetime toUtc
+string query // optional: actor/resource/action filter
+string reason
+string createdBy
+datetime createdAtUtc
+bool released
+datetime? releasedAtUtc
}
class RetentionMode {
<<Enumeration>>
PURGE
REDACT_TOMBSTONE
ARCHIVE_THEN_PURGE
}
Precedence
- LegalHold > Retention — any held scope is excluded from purge.
- Compliance baselines — minima per category can’t be reduced (e.g., HIPAA).
- Mode semantics — determine how eligibility is acted upon:
PURGE: delete eligible rows (no rewrites; events notify projections).REDACT_TOMBSTONE: replace payload with redacted tombstone entries (additive).ARCHIVE_THEN_PURGE: export eligible rows to ADLS (with manifest & proofs) before deleting.
Data Structures (SQL)¶
Retention policies (versioned)
CREATE TABLE retention.Policies (
PolicyId VARCHAR(40) NOT NULL,
TenantId VARCHAR(64) NOT NULL,
Version INT NOT NULL,
DaysByCategory NVARCHAR(1024) NOT NULL, -- JSON { "identity":365, "billing":2555 }
Mode VARCHAR(24) NOT NULL, -- PURGE|REDACT_TOMBSTONE|ARCHIVE_THEN_PURGE
EffectiveFromUtc DATETIME2(3) NOT NULL,
Author VARCHAR(128) NOT NULL,
LockedByCompliance BIT NOT NULL DEFAULT 0,
CreatedAtUtc DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
CONSTRAINT PK_retention_Policies PRIMARY KEY (TenantId, PolicyId, Version)
);
Legal holds
CREATE TABLE retention.LegalHolds (
HoldId VARCHAR(40) NOT NULL,
TenantId VARCHAR(64) NOT NULL,
CaseId VARCHAR(64) NOT NULL,
CategoriesJson NVARCHAR(512) NOT NULL,
FromUtc DATETIME2(3) NOT NULL,
ToUtc DATETIME2(3) NOT NULL,
Query NVARCHAR(1024) NULL, -- optional filter DSL
Reason NVARCHAR(512) NOT NULL,
CreatedBy VARCHAR(128) NOT NULL,
CreatedAtUtc DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
Released BIT NOT NULL DEFAULT 0,
ReleasedAtUtc DATETIME2(3) NULL,
CONSTRAINT PK_retention_LegalHolds PRIMARY KEY (TenantId, HoldId)
);
Purge ledger (idempotent, auditable)
CREATE TABLE retention.PurgeLedger (
JobId VARCHAR(40) NOT NULL,
TenantId VARCHAR(64) NOT NULL,
Category VARCHAR(64) NOT NULL,
FromUtc DATETIME2(3) NOT NULL,
ToUtc DATETIME2(3) NOT NULL,
Mode VARCHAR(24) NOT NULL,
Status VARCHAR(16) NOT NULL, -- started|completed|failed|canceled
RecordsAffected BIGINT NOT NULL DEFAULT 0,
SegmentsTouched INT NOT NULL DEFAULT 0,
ManifestUrl NVARCHAR(512) NULL, -- when ARCHIVE_THEN_PURGE
StartedAtUtc DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
CompletedAtUtc DATETIME2(3) NULL,
Error NVARCHAR(1024) NULL,
CONSTRAINT PK_retention_PurgeLedger PRIMARY KEY (JobId)
);
Eligibility & Purge Algorithm¶
Eligibility rule A record is eligible if:
occurredAtUtc <= (nowUtc - daysByCategory[category]), and- the record (or its segment) is not covered by any active legal hold (time/category/query match).
Segment-first strategy
- Prefer whole-segment purge for efficiency when the entire segment is eligible and not intersecting any active hold.
- When holds intersect a segment’s range, fallback to row-level filtering for the non-held subset.
Pseudocode (service layer)
var policy = policies.ActiveFor(tenantId, now);
var windows = policy.ToWindows(nowUtc); // per category (from,to)
foreach (var cat in windows.Keys)
{
var (from, to) = windows[cat];
var held = holds.ActiveScopes(tenantId, cat, from, to);
var candidates = segments.Range(tenantId, cat, upTo: to);
foreach (var seg in candidates)
{
if (held.Intersects(seg))
PurgeRows(seg, exclude: held.RowFilters);
else
PurgeSegment(seg);
}
}
Idempotency
Each purge job writes an entry to PurgeLedger. Re-running with the same (TenantId, Category, FromUtc, ToUtc) produces no duplicated effects.
Saga & Events¶
sequenceDiagram
participant SCH as Scheduler
participant RP as RetentionPolicy Service
participant LH as LegalHold Service
participant AS as AuditStream
participant IL as IntegrityLedger
participant BUS as Service Bus
SCH->>RP: Tick (~15m)
RP->>LH: Query active holds per tenant/category
RP->>AS: Evaluate eligible segments/rows (excludes holds)
AS-->>RP: Eligible set (segments|rows)
RP->>AS: Execute purge (mode-dependent)
AS-->>BUS: retention.purged {tenant, category, range, count, mode}
RP-->>BUS: retention.window_elapsed {tenant, category, toUtc}
IL-->>AS: (optional) seal small tail segments before purge
Emitted events
retention.policy_changed— on policy upsert/activateretention.window_elapsed— category window crossed (informational)retention.purged— includes{ jobId, tenantId, category, fromUtc, toUtc, mode, records, segments }legalhold.applied/legalhold.released— from LegalHold BC
All events are published via outbox; consumers (projections, dashboards) are idempotent.
Commands & APIs (admin)¶
RetentionPolicy BC
- Commands:
PutRetentionPolicy,EvaluatePurgeEligibility,ExecuteRetentionPurge. - HTTP (admin):
PUT /audit/admin/retention-policies— body: policy draft (versioned, forward-only).POST /audit/admin/retention/purge— body:{ category?, from?, to?, mode? }to force/limit scope.
- Scopes:
audit.admin.policy. Elevated actions requireX-Purpose.
LegalHold BC
- Commands:
PlaceLegalHold,ReleaseLegalHold. - HTTP (admin):
POST /audit/admin/legal-holds— create hold (caseId, categories, range, query, reason).POST /audit/admin/legal-holds/{holdId}:release.
- Scopes:
audit.admin.policy; justification required.
Mode Semantics & Effects¶
| Mode | Storage effect | Read models | Exports | Integrity |
|---|---|---|---|---|
| PURGE | Physical delete of eligible rows | Projectors receive retention.purged and remove rows |
N/A | Past proofs remain valid (affected records were in prior sealed segments) |
| REDACT_TOMBSTONE | Replace payload with minimal tombstone rows | Views show redacted entries with reason | Exports include tombstones | Additive; no rewrites of originals |
| ARCHIVE_THEN_PURGE | Export to ADLS (manifest, hashes), then delete | Views show nothing after purge (unless “archived” badge) | Signed manifest + URIs included | Segment proofs referenced in manifest |
Archival step (for ARCHIVE_THEN_PURGE):
- Export writes manifest to Blob (with counts, hashes, referenced segment proofs).
PurgeLedger.ManifestUrlpopulated; downstream can verify off-platform.
Legal Hold Scope¶
A hold can specify any combination of:
- Time window (
fromUtc–toUtc) - Categories (e.g.,
identity,billing) - Query (optional DSL): by
actor.id,resource.type/id,actionprefix—compiled to segment/time filters + secondary filters on rows.
Operational rules
- Holds are created once and can be released; they are never edited retroactively.
- A released hold stays in history for audit purposes.
- Applying/releasing a hold emits events and updates indexes used by the retention saga.
Auditing (“audit of the auditor”)¶
Every administrative action creates an audit record:
Retention.PolicyChanged(policyId, version, author, diff)Retention.PurgeStarted/Retention.PurgeCompleted(jobId, counts, mode)LegalHold.Applied/LegalHold.Released(holdId, caseId, scope)
These are tenant-scoped (or platform-admin scoped for cross-tenant operations) and appear in timelines and exports.
Metrics & Alerts¶
retention_eligible_segments{tenant,category}— Gaugeretention_purge_duration_seconds{mode}— Histogramretention_records_purged_total{tenant,category}— Counterlegalholds_active_total{tenant}— Gauge- Alerts:
- Purge failure rate > 1% over 15m
- Purge duration p95 > SLO (e.g., 10m per 1M rows)
- Active holds overlapping > 80% of eligible data (investigate policy vs legal load)
Edge Cases & Safeguards¶
- Backfill: backfilled historical records honor current legal holds but are evaluated against policy windows relative to their
occurredAtUtc. - Segment tails: tiny tails can be sealed before purge to simplify eligibility evaluation.
- Partial-hold segments: row-level purge with strict filters; if too fragmented, defer to next window.
- Break-glass: platform-admin can inspect retention state across tenants only with dual-control + justification; actions are fully audited.
Diagram — Retention Decision Flow¶
flowchart TD
A[Policy Window Elapsed] --> B[Load Active Legal Holds]
B --> C{Held Overlap?}
C -- No --> D[Select Eligible Segments]
C -- Yes --> E[Row-level Filter (exclude holds)]
D --> F{Mode}
E --> F
F -- PURGE --> G[Delete rows/segments]
F -- REDACT_TOMBSTONE --> H[Insert tombstones]
F -- ARCHIVE_THEN_PURGE --> I[Export → Manifest] --> G
G --> J[Emit retention.purged]
H --> J
Backfill, Migration & Replay¶
Purpose¶
Enable safe historical imports (backfill), schema evolution without downtime, and deterministic rebuilds of read models via an event replay harness. Preserve append-only guarantees, integrity proofs, and tenant isolation throughout.
Modes¶
- Backfill — ingest historical records under elevated controls (NDJSON/Parquet → canonical
AuditRecord), honoring classification, legal holds, and retention. - Replay — rebuild projections (
audit_rm.*) from the source of truth (audit.Records) or from archived files when SQL is pruned. - Migration — introduce new schemas/indices/views using blue/green read models; evolve contracts with versioned exporters/importers.
Architecture Overview¶
flowchart LR
SRC[(Sources<br/>NDJSON/Parquet/Legacy DB)] --> IMP[Importer (versioned)]
IMP --> BF[Backfill API]
BF --> WR[(audit.Records - append-only)]
WR --> OUTBOX[(Outbox Events)]
OUTBOX --> BUS[[Service Bus]]
BUS --> PROJ[Projectors]
PROJ --> RM[(audit_rm.* v1)]
PROJ -.replay.-> RM2[(audit_rm.* v2)]
ARCH[(ADLS Gen2 Archives)] --> RPY[Replay Harness] --> PROJ
- Backfill and live ingest both land in the same append-only store, publish the same events, and feed the same projectors.
- Replay can read from
audit.Recordsor ADLS archives (JSONL/Parquet) when rebuilding.
Backfill¶
Transport
POST /audit/records:backfill(NDJSON/JSON; gzip).- Throughput-isolated from online ingest; runs with
audit.backfillscope and mandatoryX-Purpose.
Importer (versioned)
Importer@v1,Importer@v2, … map legacy payloads → canonicalAuditRecord.- Each importer declares:
sourceVersion,normalizers,classificationHints,idempotencyStrategy. - Validation: run dry-run to compute accept/reject counts and classification outcomes before commit.
Idempotency
- Recommend deterministic keys:
sourceName:sourceId:occurredAtUtc. - Duplicate lines across replays safely dedupe on (
tenantId,idempotencyKey).
Integrity
- Historical loads create historical segments per
(tenant, category)with compressed thresholds; immediately sealed upon chunk completion to keep proofs compact and time-bounded. - Segment labels include
mode=historicalfor audit transparency.
Replay Harness¶
Goals
- Deterministically rebuild
audit_rm.*projections and caches without touching the write model. - Support point-in-time (PIT) rebuilds and full rebuilds per tenant/category.
Sources
- Primary:
audit.Records(clustered seek by time). - Secondary: ADLS (JSONL/Parquet) via streaming readers when SQL windows were purged/archived.
Controls
- Tenant-scoped, time-bounded, category-filtered runs.
- Resume with high-water marks (HWM):
(OccurredAtUtc, RecordId)stored per projection. - Parallelism tuned per tenant/category; avoid cross-tenant contention.
Mechanics
- Disable or shadow existing consumer for the target projection (blue/green).
- Stream records (or archived files) in order; synthesize
audit.record_appendedenvelopes (or re-consume bus) into new projection tablesaudit_rm_v2.*. - Validate counts/hashes; swap atomically (synonyms/view switch), keep old (v1) for rollback.
CLI Sketch (code-first)
audit-replay run \
--tenant t-9c8f1 \
--projection TimelineV2 \
--from 2025-09-01T00:00:00Z --to 2025-09-30T23:59:59Z \
--source sql --degree-of-parallelism 4 \
--snapshot resume
Idempotent Upserts
- Projections write with
MERGE/UPSERT keyed by the target PK; handlers recordConsumerOffset(eventId)for safety even during replay.
Snapshot Strategy (large streams)¶
Why
- Reduce replay time for very large tenants/resources.
Snapshot types
- Projection Snapshots: periodic materialized state (e.g., last cursor for
Timeline, last known view aggregates). - Segment Checkpoints:
(segmentId, lastRecordId)to allow resuming mid-segment on replays. - Cursor Snapshots: per-projection HWM
(OccurredAtUtc, RecordId).
Policy
- Snapshot every N million records or M minutes; store in
audit_snap.*(tenant-scoped). - Snapshots are append-only with metadata: projection version, policy version, tool version.
Restore
- Replay starts from the latest compatible snapshot; verifies hash of the first delta batch before proceeding.
Migration Playbook (blue/green)¶
- Design new projection schema
audit_rm_v2.*(additive fields, new indexes). - Shadow build: run replay into v2 while v1 serves traffic.
- Dual-write (optional): during bake-in, live events update both v1 and v2 projectors.
- Cutover: swap synonyms/views or route queries via feature flag to v2.
- Monitor: compare v1 vs v2 counts/latencies for a window.
- Decommission: retire v1 after retention period.
DB Aid
CREATE SYNONYM audit_rm.Timeline FOR audit_rm_v2.Timeline;
-- Swap to v1 or v2 by changing synonym target inside a transaction.
Contract Evolution (exporters/importers)¶
-
Exporters
Exporter@v1: JSONL canonical envelope.Exporter@v2: addsrecord.schemaVersion,policyVersion, optional membership proofs inline.- Manifests include
exporterVersion,schemaVersion,kid, hashes.
-
Importers
- Each importer version declares supported source schemas and a validation matrix.
- On ingest, record
importerVersioninto an internal meta table for lineage.
Compatibility Rules
- Additive-first: new fields optional, consumers ignore unknowns.
- Breaking change: publish new topic name or path (e.g.,
/v2), maintain v1 until all consumers migrated. - Policy invariants: cannot weaken redaction for
CREDENTIAL/PHI.
Consistency Guarantees¶
- Append-only: neither backfill nor replay modifies existing records; they only append to write models or recompute read models.
- Ordering: replay reads in
(OccurredAtUtc, RecordId)order per(tenant, category). - Integrity: proof segments created during backfill are sealed and referenced; replay does not alter proofs.
Operational Controls¶
- Throttling: replay & backfill respect per-tenant QPS and shared DB concurrency budgets.
- Isolation: run heavy replays in off-peak windows; prioritize hot-tenant live traffic.
- Observability:
replay_progress{tenant,projection}— Gauge (0–100).replay_rate_records_per_s{tenant}— Gauge.replay_errors_total{projection,reason}— Counter.- Trace spans:
audit.replay.batchwithfromCursor/toCursor.
Failure Recovery
- Store HWM every N records; on crash, resume from last committed HWM.
- Poison batches → quarantine file with first/last IDs; continue with next range; open incident.
Example: Rebuild Timeline from ADLS¶
audit-replay run \
--tenant t-9c8f1 \
--projection TimelineV2 \
--source adls \
--path "abfss://audit/tenantId=t-9c8f1/category=identity/year=2025/month=09" \
--from 2025-09-01T00:00:00Z --to 2025-09-30T23:59:59Z \
--degree-of-parallelism 8 \
--snapshot none
- Reader streams Parquet row groups → canonical leaf records → projectors.
- On completion, tool outputs counts, min/max timestamps, and verification hashes matching the export manifest.
Guardrails¶
- LegalHold aware: backfill & replay must not bypass active holds (even for historical data).
- Retention aware: do not re-introduce data beyond retention windows into hot SQL; keep in ADLS only.
- No PII leakage: importers run the server-side classification; logs contain only counts and hashed identifiers.
APIs (admin)¶
POST /audit/admin/replay— body:{ tenantId, projection, from, to, source: 'sql'|'adls', path?, snapshot?: 'resume'|'none' }GET /audit/admin/replay/{runId}— progress and metricsPOST /audit/admin/migrations/activate— cutover v2 projections (scopeaudit.admin.policy, purpose required)
Diagram — Replay Control Flow¶
sequenceDiagram
participant ADM as Admin
participant RPY as Replay Controller
participant SRC as Source (SQL/ADLS)
participant PJ as Projector Host (v2)
participant DB as audit_rm_v2.*
ADM->>RPY: Start replay (tenant, projection, from/to)
RPY->>SRC: Open reader (ordered stream)
SRC-->>PJ: Records (in order)
PJ->>DB: UPSERT rows (idempotent)
PJ-->>RPY: HWM checkpoints
RPY-->>ADM: Progress (%, counts, ETA)
Checklist (quick start)¶
- Define/import mapping with
Importer@vN. - Dry-run backfill (classify/validate).
- Run backfill with idempotency keys and historical segments sealing.
- Spin up replay to
audit_rm_v2.*(shadow build). - Compare v1/v2, cut over via synonym/view swap.
- Archive manifests & proofs; audit everything (who/what/why).
Scale & Performance¶
Purpose¶
Achieve predictable low-latency ingest, fast queries, and stable exports under bursty multi-tenant load. Scale horizontally based on queue depth/lag, tune prefetch & concurrency per handler, and isolate hot tenants with dedicated partitions/queues. Provide concrete SLO targets and batch sizing guidance.
Scaling Strategy (overview)¶
flowchart LR
API[HTTP/gRPC Gateways] --> IN[(Inbox+Dedupe Queues)]
IN --> AS[AuditStream Workers]
IN --> CL[Classification Workers]
IN --> IL[Integrity Workers]
IN --> RP[Retention Workers]
IN --> EX[Export Workers]
subgraph KEDA
KD1[KEDA Scalers: queue length/lag]
end
KD1 -.-> AS
KD1 -.-> IL
KD1 -.-> RP
KD1 -.-> EX
- KEDA-style autoscaling on queue length and time-in-queue (lag).
- Workers are stateless and scale out; DB & caches scale vertically/horizontally as needed.
- Per-tenant partitions/queues available to isolate “noisy neighbors”.
Queues & Partitions¶
- Default: shard by
TenantIdhash across N queues (balanced). - Hot tenants: promote to dedicated queue/partition with its own consumer group and scaling policy.
- Ordering: per-tenant ordering guaranteed within a partition (keyed by
(tenantId, category)when required). - DLQ: one per queue; monitored and drained by a quarantine worker.
KEDA Scalers (Azure Service Bus example)¶
- Triggers:
messageCount > target(e.g., > 1,000 per partition)queueLagSeconds > target(e.g., > 5s p95)
- Scaling policy:
- Min replicas = 1 (or 0 for cold paths like exports)
- Max replicas determined per environment (e.g., 40 for ingest)
- Cooldown 60–120s to avoid thrash
Sketch
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: auditstream-consumer
spec:
scaleTargetRef:
name: auditstream-worker
pollingInterval: 5
cooldownPeriod: 90
minReplicaCount: 2
maxReplicaCount: 40
triggers:
- type: azure-servicebus
metadata:
queueName: audit-ingress-q
messageCount: "1000"
activationMessageCount: "10"
namespace: "***"
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: audit_consumer_lag_seconds
threshold: "5"
Consumer Tuning (MassTransit / Service Bus)¶
- Prefetch: start at
512–2048messages per consumer; tune by payload size and handler time. - Concurrency:
ConcurrentMessageLimit: start atenv(cores) × 2per instance.- Ensure handler is CPU-light (I/O bound); otherwise lower concurrency to protect DB.
- Lock management: use PeekLock;
MaxAutoRenewDuration>=handler_p95 × 2. - Idempotency: maintain a compact dedupe store keyed by (
tenantId,idempotencyKey) →recordId.
MassTransit setup (sketch)
cfg.ReceiveEndpoint("audit-ingress-q", e =>
{
e.PrefetchCount = 1024;
e.ConcurrentMessageLimit = Environment.ProcessorCount * 2;
e.UseMessageRetry(r => r.Exponential(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(30), TimeSpan.FromSeconds(2)));
e.UseInMemoryOutbox(); // local idempotent side-effects
e.Consumer<AppendAuditRecordConsumer>();
});
Orleans & gRPC Gateways¶
- Orleans: cap activations per grain; route by tenant key; use stateless workers for CPU-bound transforms; backpressure via
maxActiveRequests. - gRPC (code-first): enable HTTP/2, set min thread pool for bursty loads, and bound request body size to protect memory.
- Gateways: implement token-bucket rate limiting per tenant and circuit breakers to shed load under DB pressure.
Storage Throughput¶
- Write path:
- Use
SqlBulkCopy/TVPs forBatchAppend(500–2,000 rows/batch). - Keep inserts ordered by
(TenantId, Category, OccurredAtUtc)to minimize page splits. - Avoid per-row UDFs; compute hashes/redaction in the app.
- Use
- Read path:
- Projections serve 95% of queries; place hot views on columnstore when scans dominate.
- Seek pagination only; never use
OFFSETon hot tables.
SLOs & Guardrails¶
| Area | Target (p95 unless noted) | Notes |
|---|---|---|
Ingest lag (produce → record_appended) |
≤ 5s | Autoscale on lag; alert > 30s |
| Append latency (API 201) | ≤ 50–100 ms | Single record; batch returns 202 |
| Consumer backlog | < 1× minute volume | Warn when > 5× |
| Export duration | ≤ 10 min / 10M rows | Scales with shards & format |
| Projector lag | ≤ 2–5 s | Timeline view |
| DLQ rate | ~0 | Page if sustained growth |
Reconfirm SLOs per environment/tenant class; attach exemplars to lag metrics for deep links.
Batch & Payload Sizing¶
- BatchAppend:
- 500–2,000 items per call; cap request body at 5–10 MB (gzip recommended).
- Prefer many medium batches over a few giant ones to keep tail latency low.
- Queue messages: keep payloads small; if > ~200 KB per item, consider reference + fetch pattern (store large deltas in Blob, send URI).
Hot Tenant Isolation¶
- Promotion criteria: sustained >10% of total traffic or >50% of a shard’s backlog.
- Actions:
- Dedicated queue/partition for that tenant.
- Per-tenant consumer deployment with its own scaler limits.
- Optional DB pool/read replica priority if available.
Backpressure & Load Shedding¶
- HTTP/gRPC:
429with jitteredRetry-Afterwhen queues or DB lag exceed thresholds. - Queue consumers: temporarily lower concurrency or pause pulling when DB CPU > 80% or log write waits spike.
- Exports: queue to a separate tier with strict concurrency caps; never compete with ingest.
Capacity Planning (rules of thumb)¶
- Compute: start at 1 core per ~1–3k msgs/s for light transforms; double for heavy classification.
- Memory: size for prefetch window + concurrent batches; avoid large in-memory buffers.
- DB: target < 70% CPU at peak; keep P99 log flush < 10ms; add tempdb and log throughput headroom.
Performance Testing¶
- Synthetic load: mix (70% small with decisions, 20% with deltas, 10% export triggers).
- Scenarios: steady-state, bursts (10× for 2 min), and soak (24–72h).
- Checks: ingest lag, append latency, dedupe ratio, projector lag, export SLA, segment seal time, DLQ growth.
- Chaos: inject transient DB errors and broker disconnects; verify retries and idempotency.
Example: Export Concurrency¶
- Limit export workers to 1–2 per tenant; global cap (e.g., 8).
- Chunk size: 50–200k records per artifact; stream to Blob; push manifests incrementally.
- Throttle when ingest lag > target to favor writes over exports.
Caching¶
- Edge cache: 5–15s TTL for dashboard timeline pages.
- Redis: cache hot resource/actor timelines keyed by filter hash + cursor (TTL 30–60s).
- ETag +
If-None-Matchon read APIs to cut egress.
Operational Settings (quick reference)¶
- PrefetchCount: 512–2048
- ConcurrentMessageLimit: cores × 2 (I/O bound) or cores (CPU bound)
- BatchAppend: 500–2,000 items, ≤ 10 MB gz
- Segment sealing: 10k records or 100 MB or 5 min (whichever first)
- KEDA polling: 5s; cooldown 60–120s
- Rate limits: tenant token-bucket sized to SLO + bursting 2–3×
Notes¶
- Always prefer horizontal scale on workers and queues before vertical DB scale; then revisit indexes and partition strategy.
- Keep idempotency cheap and O(1): write-through cache + SQL upsert on
(tenantId, idempotencyKey). - Measure, don’t guess: tie autoscaling to lag rather than CPU alone.
Idempotency & Dedupe¶
Purpose¶
Guarantee exactly-once effects at the domain boundary despite at-least-once delivery across HTTP, gRPC (code-first), Service Bus/MassTransit, and Orleans. Enforce dedupe keyed by tenantId + idempotencyKey (or a deterministic content hash) and make all consumers idempotent. Define a clear conflict policy for exact vs. semantic duplicates.
Patterns¶
- Outbox on write: aggregate state change and outbox event are committed atomically; a relay publishes later.
- Inbox on consume: each consumer records processed
eventId(or(aggregateId, sequence)), skipping repeats. - Idempotency gate on ingest: first persistent check keyed by (
tenantId,idempotencyKeyorcontentHash).
sequenceDiagram
participant Client
participant Ingest as Ingest Adapter
participant Gate as Idempotency Gate
participant Store as Append-only Store
participant Outbox as Outbox
Client->>Ingest: AppendAuditRecord (Idempotency-Key)
Ingest->>Gate: TryReserve(tenantId, key)
alt duplicate
Gate-->>Ingest: { status: duplicate, recordId }
Ingest-->>Client: 200 OK (duplicate, recordId)
else first-seen
Gate-->>Ingest: { status: reserved }
Ingest->>Store: Append(record)
Store->>Outbox: Save event (same tx)
Ingest-->>Client: 201 Created (recordId)
end
Keys & Deterministic Hashing¶
- Primary: client-provided
Idempotency-Key(scoped totenantId). - Fallback: deterministic content hash when clients can’t provide keys:
- Canonicalize the material fields of the record (
tenantId,occurredAtUtc,actor,action,resource,decision,before/after,policyVersion) — exclude transient fields (traceId,requestId,producer). contentHash = SHA-256(UTF8(canonicalJson)).
- Canonicalize the material fields of the record (
- Storage: only store hashed keys (
sha256) with per-tenant salt to reduce leakage.
Examples
iam:pwd-change:u-12345:2025-09-30T12:34:56Z→sha256:ab12…contentHashfor the canonicalized payload →sha256:7f9c…
Conflict Policy¶
| Situation | Detection | Outcome |
|---|---|---|
| Exact duplicate (same material content) | Same (tenantId, idempotencyKey) or same contentHash and same stored payload hash |
200 OK, return recordId of the first write; increment dedupe metric. |
| Semantic duplicate (same key but different material content) | Same (tenantId, idempotencyKey) paired with mismatched payload hash |
409 Conflict with problem details; no second write; log “idempotency_conflict”. |
| Key reuse after TTL | Key TTL expired, but content matches stored recordId (via contentHash) |
Treat as exact duplicate if content hash matches; else 409 Conflict. |
Material comparison mask (server-defined): ignores correlation.*, preserves all fields that affect the business meaning.
Data Structures (SQL)¶
Idempotency gate
CREATE TABLE audit.Idempotency (
TenantId VARCHAR(64) NOT NULL,
KeyHash CHAR(64) NOT NULL, -- sha256 of (tenant-salted key)
RecordId VARCHAR(26) NOT NULL,
PayloadHash CHAR(64) NOT NULL, -- sha256 of canonical material content
CreatedAtUtc DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
ExpiresAtUtc DATETIME2(3) NOT NULL, -- TTL window
CONSTRAINT PK_Idempotency PRIMARY KEY (TenantId, KeyHash)
);
CREATE INDEX IX_Idempotency_Expires ON audit.Idempotency(ExpiresAtUtc);
Consumer inbox (per service)
CREATE TABLE audit.InboxOffsets (
Consumer VARCHAR(64) NOT NULL,
EventId VARCHAR(50) NOT NULL, -- ULID/UUIDv7
ProcessedAtUtc DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
CONSTRAINT PK_InboxOffsets PRIMARY KEY (Consumer, EventId)
);
Outbox (per service)
CREATE TABLE audit.Outbox (
EventId VARCHAR(50) NOT NULL,
AggregateType VARCHAR(64) NOT NULL,
AggregateId VARCHAR(64) NOT NULL,
TenantId VARCHAR(64) NOT NULL,
Name VARCHAR(128) NOT NULL,
SchemaVersion INT NOT NULL,
OccurredAtUtc DATETIME2(3) NOT NULL,
PayloadJson NVARCHAR(MAX) NOT NULL,
PublishedAtUtc DATETIME2(3) NULL,
CONSTRAINT PK_Outbox PRIMARY KEY (EventId)
);
Algorithms¶
TryReserve (idempotency gate)
-- Attempt to upsert reservation; if exists, compare hashes
MERGE audit.Idempotency AS tgt
USING (SELECT @TenantId AS TenantId, @KeyHash AS KeyHash) AS src
ON (tgt.TenantId = src.TenantId AND tgt.KeyHash = src.KeyHash)
WHEN NOT MATCHED THEN
INSERT (TenantId, KeyHash, RecordId, PayloadHash, ExpiresAtUtc)
VALUES (@TenantId, @KeyHash, @RecordId, @PayloadHash, DATEADD(day, @TtlDays, SYSUTCDATETIME()))
WHEN MATCHED AND tgt.PayloadHash = @PayloadHash THEN
UPDATE SET ExpiresAtUtc = DATEADD(day, @TtlDays, SYSUTCDATETIME()) -- refresh TTL
WHEN MATCHED AND tgt.PayloadHash <> @PayloadHash THEN
-- Signal conflict (no write). Surface via @@ROWCOUNT or OUTPUT.
UPDATE SET RecordId = tgt.RecordId -- no-op to produce OUTPUT
OUTPUT $action, inserted.RecordId;
Consumer Idempotency
- On message receipt:
- If
(consumer, eventId)exists → ACK & skip. - Else BEGIN: handle message, write side-effects, then insert
(consumer, eventId). - COMMIT; any failure → retry (at-least-once) but inbox prevents duplicates.
- If
Batch semantics
- Reserve per-item; compute per-item statuses:
created | duplicate | conflict | error. - Return 202 with multi-status body.
TTL & Windows¶
- Key TTL:
7 days(≥ max client retry horizon); configurable per tenant. - Inbox TTL:
7–30 daysdepending on bus re-delivery SLAs and recovery plans. - Background jobs purge expired rows (
ExpiresAtUtc < now).
Transports & Gateways¶
- HTTP: require
Idempotency-Keyheader for online writes; 200 (duplicate) vs. 409 (conflict). - gRPC (code-first): metadata carries key; same semantics.
- Service Bus/MassTransit: key in headers; inbox & gate run in the consumer.
- Orleans: ingest grain uses the shared gate; grain replays are safe due to reserve-first semantics.
Edge Cases¶
- Clock skew: identical keys but slightly different
occurredAtUtc; treat as semantic conflict if outside skew tolerance. - Producer bugs (random keys): fall back to contentHash gate to suppress floods of accidental duplicates.
- Backfill: use deterministic keys derived from source record identifiers; disable per-request idempotency header requirement, but keep gate active.
Observability¶
- Metrics:
audit_dedupe_hits_total{tenant}audit_idempotency_conflicts_total{tenant}audit_inbox_skips_total{consumer}audit_outbox_backlog{service}
- Logs (PII-safe):
audit.append.duplicate,audit.append.conflict, with hashed keys. - Traces: tag spans with
audit.key.hashanddedupe="hit|miss|conflict".
Code Sketches¶
Adapter pipeline (C#)
// Precomputed
var keyHash = HashKey(tenantId, idempotencyKey, tenantSalt);
var payloadHash = HashCanonicalMaterial(record);
var reserve = await _idempotency.TryReserveAsync(tenantId, keyHash, payloadHash, ct);
switch (reserve.Status)
{
case ReserveStatus.Duplicate:
return Ok(new { id = reserve.RecordId, status = "duplicate" });
case ReserveStatus.Conflict:
return Conflict(Problem("Idempotency key reused with different content."));
case ReserveStatus.Reserved:
var id = await _records.AppendAsync(record, ct); // writes outbox in same tx
await _idempotency.BindRecordAsync(tenantId, keyHash, id, ct); // if needed
return Created(id);
}
Consumer (MassTransit)
public class AuditEventConsumer : IConsumer<DomainEventEnvelope<AuditRecordAppendedV1>>
{
public async Task Consume(ConsumeContext<DomainEventEnvelope<AuditRecordAppendedV1>> ctx)
{
if (await _inbox.SeenAsync("audit-projection", ctx.Message.EventId)) return;
using var tx = await _db.BeginTransactionAsync();
await _projection.UpsertAsync(ctx.Message, tx);
await _inbox.MarkAsync("audit-projection", ctx.Message.EventId, tx);
await tx.CommitAsync();
}
}
Guidance¶
- Prefer client keys for online paths; use contentHash for bulk/backfill or to guard against missing keys.
- Keep the comparison mask stable and documented; breaking changes to “material fields” must bump its version.
- Do not leak raw keys; hash with tenant salt before storing or logging.
- Handle semantic conflicts early and loudly (409 + alert) — they usually indicate a producer bug.
Classification/Redaction Rules Engine¶
Purpose¶
Implement a deterministic, high-throughput rules engine that classifies fields and applies hash/mask/drop/tokenize transforms before persistence, and optionally on read for permitted elevated views. The engine is server-authoritative, tenant-aware, versioned, and designed so that no PII ever reaches logs or events.
Architecture¶
flowchart LR
IN[Ingress (HTTP/gRPC/Bus/Orleans)]
--> MAP[Field Mappers<br/>(pre-classification)]
--> CLS[Classifier<br/>(policy+heuristics)]
--> PLAN[Plan Compiler<br/>(field→rule)]
--> RDX[Redaction Engine<br/>(write-time)]
--> APPEND[(Append-only Store)]
QRY[Query APIs] --> READPLAN[Read Filter Plan<br/>(profile)]
READPLAN --> RDX2[Redaction Engine<br/>(read-time)]
RDX2 --> OUT[Response (default redacted)]
- Field Mappers extract candidate fields (e.g.,
before.fields.email,context.ip). - Classifier assigns
DataClassusing policy rules + hints + heuristics. - Plan Compiler resolves effective rule per field (class rule + field override).
- Redaction Engine applies transforms:
- Write-time (mandatory, persistent)
- Read-time (optional, profile-based:
default,elevated)
Data Structures¶
Policy (recap, per-tenant, versioned)
{
"policyId": "pol-abc",
"version": 7,
"effectiveFromUtc": "2025-09-30T12:00:00Z",
"defaultByField": {
"before.fields.email": "PERSONAL",
"after.fields.email": "PERSONAL",
"context.ip": "INTERNAL",
"after.fields.apiKey": "CREDENTIAL"
},
"rulesByClass": {
"PUBLIC": {"kind":"NONE"},
"INTERNAL": {"kind":"MASK","params":{"showLast":4}},
"PERSONAL": {"kind":"HASH","params":{"algo":"SHA256"}},
"SENSITIVE": {"kind":"MASK","params":{"showLast":2}},
"CREDENTIAL":{"kind":"DROP"},
"PHI": {"kind":"TOKENIZE","params":{"ns":"phi"}}
},
"overridesByField": {
"context.userAgent": {"kind":"MASK","params":{"showFirst":10}},
"after.fields.apiKey": {"kind":"DROP"}
}
}
Rule kinds
NONE— keep as is.MASK— retain portions (showFirst,showLast,maskChar,keepSeparators).HASH— one-way hash; per-tenant salt (Key Vault/HSM) + algorithm (e.g.,SHA256).DROP— remove value entirely.TOKENIZE— store token; resolution requires explicit privileged service & is fully audited.
Field Mapping¶
- Targets:
context.*,before.fields[*],after.fields[*], headers mirrored into payload (if any). - Paths: dot-notation with wildcards, e.g.,
before.fields.*,after.fields.password*. - Heuristics (best-effort, server-side): detect emails, E.164 phones, credit-card-like numbers, secrets (common prefixes), IPs. Heuristics never override an explicit
CREDENTIAL/PHIrule to a weaker rule.
Evaluation & Precedence¶
- Explicit field override (
overridesByField) - Class rule via
defaultByFieldor heuristic + hint - Fallback class:
INTERNAL - Read profile may apply additional masking only (cannot reverse
HASH/DROP;TOKENIZEresolution requires privileged path)
Server computes
RedactionPlan{ byField, policyVersion }and stores only the transformed values +policyVersion.
Write-Time Pipeline (pseudocode)¶
var policy = _policyProvider.GetActive(tenantId); // v7
var mapped = _mapper.MapTargets(record); // yields (path,value)
var classes = _classifier.Classify(mapped, policy, hints); // path -> DataClass
var plan = _planner.Compile(policy, classes); // path -> RedactionRule
foreach (var t in mapped)
{
var rule = plan.For(t.Path);
var norm = _normalizer.Normalize(t.Path, t.Value); // emails lowercased, IP canonical
var red = _redactor.Apply(rule, norm, tenantSalt); // PII-safe
record = _writer.Write(record, t.Path, red);
}
record.PolicyVersion = policy.Version;
Normalization ensures deterministic hashing/masking:
- Email: lowercased, Unicode normalized.
- Phone: E.164.
- IP: RFC-compressed IPv6, no port.
- UA: trimmed to max length.
Read-Time Profiles¶
| Profile | Default | Elevated |
|---|---|---|
PUBLIC/INTERNAL |
NONE/MASK(…4) |
may relax to MASK(…8) |
PERSONAL |
HASH |
stays HASH (non-reversible) |
SENSITIVE |
MASK(…2) |
may relax to MASK(…4) with auditor scope |
CREDENTIAL |
DROP |
stays DROP |
PHI |
TOKENIZE |
return token only; resolution via dedicated PHI service & separate policy |
- Read profile chosen by API layer (
X-Redaction-Level: default|elevated) + scopes. - Engine never reconstructs dropped or hashed values.
Plug-in Model¶
Interfaces
public interface IFieldMapper { IEnumerable<Target> Map(AuditRecord r); }
public interface IClassifier { DataClass Classify(Target t, Policy p, Hints h); }
public interface IPlanner { RedactionPlan Compile(Policy p, IDictionary<string,DataClass> classes); }
public interface IRedactor { object Apply(RedactionRule rule, object value, byte[]? tenantSalt=null); }
public interface IReadFilter { object Filter(object storedValue, Rule rule, RedactionProfile profile); }
- Register mappers for custom producers (e.g., billing deltas, IAM extras).
- Add classifier enrichers (regex, ML heuristics if ever needed) behind the policy.
Performance & Caching¶
- Cache policy by
(tenantId, version); TTL aligned to policy change events. - Memoize compiled plans for frequent field sets (e.g., same resource schema).
- Vectorize hashing by batch (SIMD) when applying
HASHto large lists. - Target < 1 ms / record for typical 10–30 field transforms; benchmark under P95 load.
Observability Guardrails¶
- No PII in logs: the redaction engine exposes counters only:
rules_applied_total{tenant,kind}(HASH,MASK,DROP,TOKENIZE)classification_fallback_total{tenant}(heuristic used)
- Errors contain paths, rule kinds, never raw values.
- Traces include
policy.version,targets.count,rules.masked.count, not the data.
Testing¶
- Golden tests: fixtures per class and per rule; assert exact outputs (hashes for given salt).
- Property tests: secrets never leak in logs/events; masking length consistent.
- Compatibility tests: same input + same policy → identical stored outputs across versions.
- Performance tests: throughput and P95 latency on mixed payloads.
Example¶
Input (hints)
{
"context": { "ip": "2001:db8::1", "userAgent": "Mozilla/5.0 ..." },
"before": { "fields": { "email": "Alice@Example.com", "apiKey": "sk_live_123" } },
"after": { "fields": { "email": "Alice@Example.com", "apiKey": "sk_live_123" } },
"hints": { "before.fields.email": "PERSONAL", "after.fields.apiKey": "CREDENTIAL" }
}
Stored (policy v7)
{
"context": { "ip": "2001:db8::1", "userAgent": "Mozilla/5..." },
"before": { "fields": { "email": "HASH:sha256:7b1d...", "apiKey": null } },
"after": { "fields": { "email": "HASH:sha256:7b1d...", "apiKey": null } },
"policyVersion": 7
}
Read (elevated) shows the same for PERSONAL (hash) and CREDENTIAL (dropped), but may show a longer UA slice (per override).
Failure Modes & Safeguards¶
- Rule missing: default to stronger class (
SENSITIVE→MASK); neverNONE. - Oversized values: fall back to
HASHorDROPto avoid memory pressure. - Unsupported encodings: normalize or
DROPwith diagnostic code (no value in logs). - Policy change mid-batch: lock to policy version at batch start (store
policyVersionper record).
Admin & DSL (optional convenience)¶
Define field rules via a compact DSL in admin UI:
INTERNAL: context.userAgent -> MASK(showFirst=12)
PERSONAL: before.fields.email -> HASH(algo=SHA256)
CREDENTIAL: after.fields.apiKey -> DROP
PHI: after.fields.diagnosisCode -> TOKENIZE(ns=phi)
- Compiler translates DSL →
overridesByField+rulesByClass. - Validation prevents weakening
CREDENTIAL/PHI.
Security Notes¶
- Tenant salt sourced from HSM/Key Vault; rotated; hash results are stable per version.
- Tokenization vault isolated; resolutions require separate privileged service with strong auditing.
- Hash outputs and tokens are treated as sensitive-but-safe; still excluded from logs unless necessary (and then truncated).
Integration Points¶
- Ingest adapters (HTTP/gRPC/Bus/Orleans) call
IClassificationPipeline.ProcessAsyncbefore persistence. - Query layer requests a read profile and applies
IReadFilterto the stored payload. - Event payloads use minimized fields; never include raw sensitive values.
Metrics (quick list)¶
classification_decisions_total{tenant,class}redaction_rules_applied_total{tenant,kind}redaction_write_duration_ms(histogram)redaction_read_duration_ms(histogram)- Alerts on any DROP/
CREDENTIALoccurrences > 0 for fields not in allowlist (configuration drift).
Access Control & Purpose Limiting¶
Purpose¶
Enforce least privilege and purpose-limited access to audit data and admin capabilities. Admin actions (e.g., exports, legal holds) require higher scopes, MFA, and optionally dual-control (“four eyes”). Policy evaluation plugs into the platform’s RBAC/ABAC engine and uses the existing Access Type = Audit category in the platform catalog.
Control Points¶
-
PEP @ Gateway
- Verifies OAuth2/OIDC token, MFA signals, tenant binding, scopes.
- Requires
Tenant-Idheader and enforces no cross-tenant unless break-glass. - Accepts
X-PurposeandX-Redaction-Levelheaders; validates they match scopes.
-
PEP @ Service Boundary
- Re-evaluates RBAC/ABAC against
Auditaccess type. - Fetches policy decision from PDP (cached); logs minimal decision info.
- Stamps the effective purpose, scopes, and decision into a minimal audit entry (“audit of the auditor”).
- Re-evaluates RBAC/ABAC against
-
PDP (Policy Decision Point)
- Evaluates RBAC (role/entitlement) and ABAC (attributes: tenant, category, time, purpose, action prefix, resource).
- Supports dual-control requirements and “break-glass” overlays.
Token & Request Requirements¶
-
Claims (must be present/validated):
sub/client_id— caller identitytenant_id— must matchTenant-Idheaderscope— action-specific (see matrix)amroracr— indicates MFA (e.g.,amrcontainsotp/hwk) for admin/elevated actionsauth_time— recent sign-in check for sensitive ops (≤ N minutes)jti— for replay protection (logged)- optional
cnf— key-binding for stronger tokens
-
Headers
Tenant-Id: <id>— requiredX-Purpose: <reason>— required for elevated reads/admin (e.g.,security-investigation:INC-12345)X-Redaction-Level: default|elevated—elevatedrequiresaudit.read.elevated
Scope & Capability Matrix (Access Type = Audit)¶
| Capability | Endpoint(s) | Required Scope(s) | Extras |
|---|---|---|---|
| Timeline (redacted) | GET /audit/timeline |
audit.read.timeline |
ABAC: tenant match; optional category filter |
| Decision log | GET /audit/decision-log |
audit.read.decisions |
ABAC: tenant match; outcome filter |
| Elevated read | any read with X-Redaction-Level: elevated |
audit.read.elevated + one of read scopes |
MFA + X-Purpose required |
| Start export | POST /audit/exports |
audit.export.start |
MFA + X-Purpose; optional dual-control |
| Export status/download | GET /audit/exports/{id} |
audit.export.read |
Tenant & job ownership check |
| Legal hold apply/release | /audit/admin/legal-holds* |
audit.admin.policy |
MFA + X-Purpose; optional dual-control |
| Retention policy change | /audit/admin/retention-* |
audit.admin.policy |
MFA + X-Purpose; compliance locks respected |
| Proof lookup | GET /audit/proofs* |
audit.read.proofs |
Tenant check |
| Platform break-glass (cross-tenant) | any | platform.admin.audit |
Dual-control, time-boxed token, mandatory purpose |
ABAC Inputs (evaluated by PDP)¶
- Subject: user/service identity, roles, groups, scopes,
amr/acr,auth_time. - Resource:
{ tenantId, category, actionPrefix, resourceType, resourceId }. - Context:
{ purpose, redactionLevel, requestTime, callerIp, clientApp }. - Policy examples:
- Allow elevated read only if
purposestarts withsecurity-orcompliance-andamrindicates MFA. - Deny export if requested time range exceeds max window for role
AuditorLite. - Require dual-control for
legalhold.applywhencategory in ('billing','phi').
- Allow elevated read only if
Decisions are cached short-term (e.g., 60s) keyed by
{sub, tenant, capability, purpose}with purpose in the cache key.
Dual-Control (optional policy)¶
- When:
StartExport(elevated),PlaceLegalHold,ReleaseLegalHold, break-glass. - Flow:
- Initiator calls endpoint → PendingApproval created (
approvalId, policy, purpose, ttl). - Approver (different principal) confirms with
approvalId+ MFA. - PEP checks both approvals; PDP returns Permit; operation proceeds.
- Initiator calls endpoint → PendingApproval created (
sequenceDiagram
participant I as Initiator
participant API as Audit API
participant WF as Approval Service
participant A as Approver
I->>API: StartExport (elevated, X-Purpose)
API->>WF: Create PendingApproval
API-->>I: 202 Accepted (approvalId)
A->>WF: Approve(approvalId) + MFA
WF-->>API: Approved
API->>API: Execute Export
API-->>I: 201 Created (jobId)
- Constraints: approver must be in allowed group, not equal to initiator, approval expires (e.g., 15 min).
Purpose Limiting¶
-
Mandatory for elevated/admin via
X-Purpose. -
Stored alongside the request metadata; echoed in the minimal self-audit entries (who/what/why).
-
Purpose catalog (examples):
security-investigation:<ticket>compliance-audit:<period>legal-ediscovery:<caseId>operational-troubleshooting:<incident>(not eligible for elevated PII views)
-
Redaction coupling:
purpose+ scope gate whetherX-Redaction-Level: elevatedis honored.
Break-Glass (cross-tenant)¶
- Requires scope
platform.admin.auditand dual-control. - Issues time-boxed token (≤ 15 min), logs reason, approver, exact filters used, row counts returned.
- Responses remain redacted by default; elevated requires explicit approval and is audited separately.
Decision Model (pseudo-policy)¶
# RBAC (entitlements)
roles:
Auditor:
scopes: [audit.read.timeline, audit.read.decisions, audit.read.proofs]
AuditorElevated:
scopes: [audit.read.elevated]
AuditAdmin:
scopes: [audit.admin.policy, audit.export.start, audit.export.read]
PlatformAdmin:
scopes: [platform.admin.audit]
# ABAC (Obligations)
rules:
- when: request.capability == "read:elevated"
require:
- token.scopes includes "audit.read.elevated"
- token.amr contains any ["otp","hwk"]
- request.purpose matches ^(security|compliance|legal)-.*
- when: request.capability == "export:start"
require:
- token.scopes includes "audit.export.start"
- dual_control == true if request.redactionLevel == "elevated"
- when: request.capability in ["legalhold:apply","legalhold:release"]
require:
- token.scopes includes "audit.admin.policy"
- dual_control == true if request.categories intersects ["billing","phi"]
- when: request.crossTenant == true
require:
- token.scopes includes "platform.admin.audit"
- dual_control == true
- token.ttl <= 15m
Failure Responses (problem+json)¶
403Insufficient scope — missing required scope for capability.403MFA required — token lacks strongamrfor elevated/admin.403Purpose required —X-Purposemissing or disallowed for capability.409Approval pending — dual-control not completed (approvalIdreturned).423Compliance lock — policy forbids weakening or category is under legal restrictions.
All denials emit a minimal self-audit (no PII) with reason codes.
Operational Guidance¶
- Cache PDP decisions (≤ 60s) with tenant & purpose keys; bust cache on role/entitlement updates.
- Enforce freshness for admin actions: require
auth_timewithin N minutes (e.g., 15). - Rotate admin access logs to a secure enclave; restrict visibility.
- Provide admin dashboard listing: active purposes, pending approvals, recent elevated reads/exports.
Minimal Self-Audit (examples)¶
{ "event":"AuditorAccess.Requested","tenantId":"t-9c8f1","actor":"u-42",
"capability":"read:elevated","purpose":"security-investigation:INC-12345","decision":"allow" }
{ "event":"DualControl.Approved","tenantId":"t-9c8f1","initiator":"u-42",
"approver":"u-77","operation":"StartExport","expiresAtUtc":"2025-09-30T13:05:00Z" }
{ "event":"BreakGlass.Used","actor":"admin-u1","scope":"platform.admin.audit",
"tenants":["t-9c8f1","t-7df2a"],"purpose":"legal-ediscovery:CASE-555","durationSec":600 }
Notes¶
- Access checks are performed before reading payloads; default redacted responses unless
elevatedpasses policy. - “Audit” access type in the platform catalog should map these scopes/roles to standard entitlements (e.g.,
Audit.Read,Audit.Read.Elevated,Audit.Export.Manage,Audit.Policy.Manage,Audit.Platform.BreakGlass). - All admin/elevated paths require MFA and are audited with purpose and outcome.
Exports & eDiscovery¶
Purpose¶
Provide a controlled, purpose-limited way to extract tenant-scoped audit data for investigations and regulatory requests. The ExportJob aggregate takes a query snapshot, runs a time-bounded extraction, produces content hashes, delivers artifacts via signed URLs, and emits a manifest that references integrity proofs (segment roots, optional membership proofs). Every export is itself audited with who/when/why.
Aggregate & Lifecycle¶
stateDiagram-v2
[*] --> Requested
Requested --> SnapshotTaken : freeze filters + redactionLevel + policyVersion
SnapshotTaken --> Running : plan shards/chunks
Running --> ChunkReady : stream chunk → blob (repeat)
Running --> Completed : all chunks uploaded + manifest signed
Running --> Failed : error (retriable or terminal)
Requested --> Canceled
Running --> Canceled
Invariants
- Tenant-scoped; no cross-tenant data.
- Snapshot captures: filters, time range, policyVersion, redactionProfile, classification overrides, and purpose.
- Results are immutable once completed; new export required for changes.
- Legal hold/retention rules apply; held records are included (they motivated the export), but purge never occurs as part of export.
Events
export.started,export.chunk_ready,export.completed,export.failed.
Query Snapshot¶
Frozen at job start to ensure repeatability:
{
"tenantId": "t-9c8f1",
"range": { "from": "2025-09-01T00:00:00Z", "to": "2025-09-30T23:59:59Z" },
"filters": {
"actor": "u-12345",
"resource": "User:u-12345",
"action": "User.",
"category": "identity",
"decision": "allow"
},
"redactionLevel": "default|elevated",
"policyVersion": 7,
"purpose": "ediscovery:case-12345",
"createdBy": "u-42",
"createdAtUtc": "2025-09-30T12:40:00Z"
}
Manifest (signed)¶
Each completed export publishes a manifest describing the job, artifacts, and verifiability metadata.
{
"type": "audit.export-manifest",
"schemaVersion": 2,
"job": {
"jobId": "exp-01J9Z7E8AVB2M7",
"tenantId": "t-9c8f1",
"state": "completed",
"snapshot": { /* query snapshot from above */ },
"recordCount": 12840,
"format": "jsonl",
"createdAtUtc": "2025-09-30T12:40:00Z",
"completedAtUtc": "2025-09-30T12:55:10Z"
},
"artifacts": [
{
"name": "part-00001.jsonl.gz",
"url": "https://signed.example.com/.../part-00001.jsonl.gz?sig=...",
"expiresAtUtc": "2025-09-30T15:10:00Z",
"bytes": 52428800,
"hash": { "algo": "sha256", "value": "ab12..." }
}
],
"integrity": {
"segments": [
{
"category": "identity",
"segmentId": "seg-2025-09-30-identity-0012",
"rootHash": "f0c4...",
"chainTipHash": "9ab7...",
"hashAlgorithm": "sha256",
"proofUrl": "https://signed.example.com/proofs/seg-...json?sig=..."
}
],
"membershipProofs": "on-demand" // or "embedded"
},
"signing": { "kid": "key-tenant-t-9c8f1-2025-09", "signature": "base64url(...)" }
}
- Signed URLs expire quickly; rehydration requires re-authorization.
- Signing uses per-tenant keys in HSM/Key Vault;
kidallows offline verification. - Artifacts are chunked (e.g., 50–200k records per file) to support resumable downloads.
Formats¶
JSONL (canonical, recommended) One canonical record per line (already post-redaction):
{"recordId":"01J9Z7B...","occurredAtUtc":"2025-09-30T12:34:56Z","category":"identity","action":"User.PasswordChanged","actor":{"type":"user","id":"u-12345"},"resource":{"type":"User","id":"u-12345"},"decision":{"outcome":"allow"},"context":{"ip":"2001:db8::1","userAgent":"Mozilla/5..."},"before":{"fields":{"email":"HASH:sha256:7b1d..."}},"after":{"fields":{"email":"HASH:sha256:7b1d..."}},"policyVersion":7}
CSV (safe subset)
Flattened columns only; nested before/after deltas summarized or excluded depending on policy. Example columns:
recordId,occurredAtUtc,category,action,actorType,actorId,resourceType,resourceId,decisionOutcome,policyVersion
CSV never contains raw PII; any sensitive columns are hashed/masked or omitted per policy.
API & Contracts (code-first)¶
Start export
POST /audit/exports
Authorization: Bearer <token with audit.export.start [+ audit.read.elevated if requested]>
Tenant-Id: t-9c8f1
X-Purpose: ediscovery:case-12345
Content-Type: application/json
Body = snapshot fields; response 202 Accepted with jobId.
Poll status / fetch manifest
Returns state, counts, and manifest URL when completed.
Server interfaces (C#)
public record StartExportCommand(ExportSnapshot Snapshot, RedactionLevel Level);
public interface IExportService {
Task<StartExportResult> StartAsync(StartExportCommand cmd, CancellationToken ct);
Task<ExportStatus> GetAsync(string jobId, CancellationToken ct);
}
Planning & Execution¶
- Shard plan: split query window into time slices (e.g., 1–5 minutes) or row-count slices; parallelize per slice with tenant-safe concurrency.
- Chunking: stream rows → gzip → upload; compute per-chunk
sha256and append to manifest. - Idempotency:
StartExportis idempotent on(tenantId, snapshotHash)within a TTL to avoid duplicate runs. - Resumability: on retry, already-uploaded chunk hashes are validated and skipped.
Integrity & Verification¶
- Manifest references the segment proofs covering the exported records.
- Optionally embed membership proofs per record; otherwise provide an endpoint:
GET /audit/proofs/record/{recordId}→ Merkle path + root.
- Offline verification:
- Check manifest signature (
kid). - Verify each artifact hash.
- Verify segment proof signatures and root hashes.
- Optionally verify record membership with Merkle paths.
- Check manifest signature (
Security & Governance¶
- Purpose-limited:
X-Purposerequired; logged in minimal self-audit entries. - Scopes:
audit.export.start/audit.export.read; elevated redaction requiresaudit.read.elevatedand MFA (see Access Control cycle). - Dual-control (optional policy): exports of sensitive categories or elevated views require a second approver.
- Watermarking: manifest includes tenant, jobId, purpose, and time window; artifacts contain headers with the same meta.
- Retention: export artifacts have TTL; manifests remain for audit. Revocation = URL expiry or rotation.
Operational SLOs¶
- Throughput: ≥ 1M records/minute per worker under columnstore projections (environment-dependent).
- Export duration: ≤ 10 min per 10M records (target p95).
- Manifest availability: immediate after last chunk upload (< 60s).
- Backpressure: pause or slow exports if ingest lag > SLO (favor writes).
Observability¶
- Metrics:
audit_export_started_total{tenant}audit_export_records_total{tenant}audit_export_duration_seconds{tenant,format,redaction}(histogram)audit_export_chunk_bytes_total{tenant}audit_export_failures_total{reason}
- Traces:
audit.export.run,audit.export.chunk, attributes includetenant.id,slice,chunkIndex,records,hash. - Logs: no PII; include jobId, purpose, counts, durations, and error codes.
Failure Modes & Recovery¶
- Partial upload: verify chunk hash → skip re-upload; otherwise re-stream.
- Timeouts: resume from last successful chunk; keep shard cursors.
- Policy/hold changes mid-run: job executes against frozen snapshot; new export required for new policy.
- Quota exceeded: throttle and surface
429; job remainsrunningwith backoff.
“Audit of the Auditor”¶
Every export produces minimal audit entries:
Export.Requested— actor, purpose, snapshot hash, redaction level.Export.Completed— jobId, counts, duration, manifest hash, proof IDs.Export.Accessed— when artifacts or manifest are downloaded (subject, purpose).
Diagram — End-to-End Export¶
sequenceDiagram
participant U as Auditor (caller)
participant API as Export API
participant Q as Planner/Workers
participant SQL as Read Models
participant B as Blob (Artifacts/Manifest)
participant IL as IntegrityLedger
participant BUS as Service Bus
U->>API: POST /audit/exports (snapshot + purpose)
API-->>BUS: export.started
API->>Q: Plan(shards, chunks)
loop per shard
Q->>SQL: Stream rows (snapshot policyVersion, redacted)
Q->>B: Upload chunk (hash)
Q-->>BUS: export.chunk_ready
end
Q->>IL: Collect segment proofs
Q->>B: Write signed manifest (hashes + proofs)
Q-->>BUS: export.completed
API-->>U: GET /audit/exports/{id} → manifest URL
Admin Knobs¶
- Max range/window per job, max concurrent jobs per tenant, chunk size, URL expiry, inclusion mode for membership proofs (
embedded|on-demand), and pause-on-ingest-lag threshold.
Edge Integrations & Webhooks¶
Purpose¶
Expose signed webhooks for high-value audit events (e.g., admin actions, elevated reads, legal holds, exports) to external consumers (SIEM/SOAR/GRC). Preserve tenant isolation, event-first guarantees, and PII minimization. Enforce HMAC signatures, idempotency (Event-Id), replay protection (age limit), and exponential backoff retries. Any inbound third-party callbacks terminate at the gateway/inbox and pass the same dedupe and security checks.
Event Selection (high-value only)¶
Default outbound set (extensible via subscription filters):
AuditorAccess.Requested|Granted|DeniedExport.Requested|Completed|AccessedLegalHold.Applied|ReleasedRetention.PolicyChanged|PurgeCompletedBreakGlass.UsedIntegrity.SegmentSealed(metadata only, no leaf data)
Payloads are minimal and redacted by policy; no secrets/PII.
Delivery Semantics¶
- At-least-once delivery with outbox at source, inbox/dedupe at subscriber.
- Ordering per subscriber is best-effort and preserved within a partition key
(tenantId); do not assume global order. - Idempotency via
Event-Id—subscribers must store and skip repeats.
Security¶
- HMAC signature over the exact HTTP body + timestamp using a per-subscription secret (rotatable).
- Headers (all required):
X-Webhook-Id: subscription idX-Webhook-Event: event name (e.g.,export.completed)X-Webhook-Tenant: tenant idX-Webhook-Timestamp: UNIX epoch seconds (server time)X-Webhook-Signature:t=<ts>,v1=<hex(HMAC_SHA256(secret, t + "." + body))>
- Replay protection: receivers must reject if
|now - X-Webhook-Timestamp| > 300s(configurable). - mTLS (optional): per-destination certificate pinning for regulated tenants.
- IP allowlist: optional; published egress ranges documented per environment.
Signature example (C#)
static bool Verify(string body, string ts, string sig, byte[] secret)
{
using var h = new HMACSHA256(secret);
var computed = h.ComputeHash(Encoding.UTF8.GetBytes($"{ts}.{body}"));
return CryptographicOperations.FixedTimeEquals(
Convert.FromHexString(sig), computed);
}
Envelope & Payload¶
Envelope (outer)
{
"eventId": "01J9Z8C6MSSQ7E9KZQW2C4KQ1V",
"name": "export.completed",
"schemaVersion": 1,
"tenantId": "t-9c8f1",
"occurredAtUtc": "2025-09-30T12:55:10Z",
"emittedAtUtc": "2025-09-30T12:55:12Z",
"correlation": { "traceId": "tr-abc", "requestId": "rq-xyz", "producer": "audit-export@2.4.1" },
"payload": { /* minimal per-event data (redacted) */ }
}
Payload examples (minimal, PII-safe)
export.completed
{
"jobId": "exp-01J9Z7E8AVB2M7",
"purpose": "ediscovery:case-12345",
"range": { "from": "2025-09-01T00:00:00Z", "to": "2025-09-30T23:59:59Z" },
"recordCount": 12840,
"manifestUrl": "SIGNED_URL_REDACTED",
"integrityProofId": "proof-01J9Z7KQ..."
}
legalhold.applied
{
"holdId": "lh-01J9Z8...",
"caseId": "CASE-555",
"categories": ["identity","billing"],
"fromUtc": "2025-09-01T00:00:00Z",
"toUtc": "2025-10-01T00:00:00Z"
}
Registration & Management (HTTP)¶
Create subscription
POST /audit/webhooks
Authorization: Bearer <audit.webhooks.manage>
Tenant-Id: t-9c8f1
Content-Type: application/json
Body:
{
"destination": "https://siem.example.com/hooks/audit",
"events": ["export.completed","legalhold.applied","breakglass.used"],
"filters": { "category": ["identity","billing"] },
"secret": "auto-generate", // or supply CMK-wrapped secret
"retries": { "maxAttempts": 12, "maxAgeSeconds": 86400 },
"mTLS": { "enabled": false }
}
Rotate secret
Returns newSecretId; old secret remains valid for a grace window (dual-signing accepted).
Challenge (handshake)
Sends a challenge event; receiver must echo challenge token and verify signature on their side.
Pause/Resume
Retry & Backoff¶
- Exponential with jitter: 1s → 2s → 4s … up to max backoff (e.g., 5 min).
- Max attempts per event (default 12) and max age (e.g., 24h).
- Terminal failure: move to per-subscription DLQ with last error; subscription may auto-pause after sustained failures.
Receiver responsibilities
- Return
2xxon success only after durable write to their store. - Use
Event-Idfor idempotent processing. - Support HMAC verification and age check before processing.
Delivery Flow¶
sequenceDiagram
participant SRC as Audit Service (Outbox)
participant PUB as Webhook Publisher
participant EXT as External Receiver
SRC->>PUB: Outbox row (event)
PUB->>PUB: Build JSON body (minimal)
PUB->>PUB: Sign body (HMAC) + set timestamp
PUB->>EXT: POST /hook (headers: X-Webhook-*)
alt 2xx
EXT-->>PUB: OK
PUB->>PUB: Mark delivered
else non-2xx/timeout
PUB->>PUB: Retry with backoff+jitter
opt exceeded
PUB->>PUB: DLQ + auto-pause (optional)
end
end
Filtering & Throttling¶
- Per-subscription filters on
event,category, optionalactionprefix. - Rate caps per subscription (tokens/sec, burst). Over-cap events are queued; if backlog exceeds threshold, subscription is paused and operator alerted.
Inbound Callbacks (if any)¶
- All inbound third-party callbacks (e.g., challenge responses, approval webhooks in dual-control flows) terminate at the API Gateway and enter the Inbox:
- mTLS (optional) + OAuth2 (client credentials) or signed challenge.
- Dedupe via
X-Callback-Id(same semantics asEvent-Id). - Age limit enforced; drop stale requests.
- Minimal payloads; no secrets in logs; results audited.
Observability¶
- Metrics:
webhook_delivered_total{tenant,subscription}webhook_retry_total{tenant,subscription,reason}webhook_delivery_latency_seconds{tenant}(histogram)webhook_dropped_total{tenant,reason=age|sig|410}webhook_active_subscriptions{tenant}
- Logs (no PII): subscription id, event name, attempt, latency, status code.
- Traces:
audit.webhook.publishwith tagstenant.id,subscription.id,event.name,attempt,age.
Failure Modes & Policies¶
- 410 Gone from receiver: immediately pause subscription.
- 401/403: treat as terminal until operator action (likely secret/mTLS drift).
- Signature mismatch: drop and alert (security signal).
- Age limit exceeded: drop silently to avoid storming receivers with stale events.
Tenant Isolation & Governance¶
- Subscriptions are tenant-scoped (
Tenant-Idrequired to manage); no cross-tenant webhooks. - Break-glass (platform-admin) cannot create cross-tenant webhooks; only read events across tenants with dual-control (as per Access Control).
- All management operations (
create,rotate,pause,delete) are audited with who/when/why.
Receiver Quickstart (verification)¶
Node.js
const crypto = require('crypto');
function verify({body, ts, sig, secret}) {
const mac = crypto.createHmac('sha256', Buffer.from(secret, 'utf8'))
.update(`${ts}.${body}`).digest('hex');
return crypto.timingSafeEqual(Buffer.from(sig, 'hex'), Buffer.from(mac, 'hex'));
}
Recommended receiver steps
- Check
X-Webhook-Timestampage ≤ window. - Recompute
v1HMAC; compare in constant time. - Ensure
Event-Idunseen → process → persist → respond2xx. - Store minimal receipt (eventId, ts, outcome) for your audit.
Admin Knobs¶
- Age window (default 300s), max attempts (default 12), max backlog per subscription, rate caps, mTLS toggle, secret rotation grace (dual-sign), per-tenant event filters, and pause-on-failure thresholds.
Compliance Profiles¶
Purpose¶
Provide configurable overlays for common frameworks—SOC 2, GDPR, HIPAA—that translate regulatory intent into concrete defaults for this bounded context: retention windows, PII handling presets, consent & legal basis checks, and breach-notification hooks. Profiles are tenant-scoped, versioned, and map to our ClassificationPolicy, RetentionPolicy, LegalHold, ExportJob, and Edge Webhooks without changing producer contracts.
Guidance only, not legal advice. Profiles encode operational defaults and guardrails; tenants can harden, never weaken.
Model & Scope¶
flowchart LR
CP[Compliance Profile] --> RP[RetentionPolicy]
CP --> CL[ClassificationPolicy]
CP --> AC[Access Control & Purpose]
CP --> EX[ExportJob Defaults]
CP --> WH[Webhooks (Breach/Reg)]
CP --> INT[IntegrityLedger constraints]
CP --> LOG[Observability Guardrails]
- Attachment: per-tenant (
tenantId) with optional category overrides (e.g.,billing,phi). - Layering: multiple profiles can apply; the stricter rule wins (minimize exposure, maximize retention where required).
- Effective window:
effectiveFromUtc+ forward-only versioning; profile changes never unmask data already transformed.
Configuration Schema (admin)¶
profileId: "gdpr-defaults"
version: 3
effectiveFromUtc: "2025-10-01T00:00:00Z"
jurisdictions: ["EU","EEA","UK"]
appliesTo:
categories: ["identity","billing"]
actors: ["user","service"]
defaults:
retentionDays:
identity: 365
billing: 2555 # example; tenant may harden
redaction:
PERSONAL: { kind: HASH, params: { algo: SHA256 } }
SENSITIVE: { kind: MASK, params: { showLast: 2 } }
CREDENTIAL:{ kind: DROP }
PHI: { kind: TOKENIZE, params: { ns: "phi" } }
consent:
requirePurposeHeader: true
legalBases: ["consent","contract","legal_obligation"] # allowed for elevated reads/exports
export:
format: "jsonl"
requireManifestSignature: true
includeMembershipProofs: "on-demand"
breach:
notify:
webhookId: "sec-ops-1"
ageSeconds: 60
subjects: ["export.failed","breakglass.used","integrity.segment_sealed"] # extend with detectors
observability:
blockPIIInLogs: true
redactUserAgentTo: 64
residency:
restrictCrossRegionExports: true
allowedRegions: ["EU"]
Policy Mapping (how a profile shapes the BC)¶
| Control | Effect |
|---|---|
| RetentionPolicy | Sets per-category retention windows & mode (PURGE/ARCHIVE_THEN_PURGE), enforces baselines; triggers retention.policy_changed. |
| ClassificationPolicy | Seeds rulesByClass and field overrides; ensures CREDENTIAL → DROP, PERSONAL → HASH, PHI → TOKENIZE. |
| Access Control & Purpose | Requires X-Purpose and permitted legal basis for elevated reads/exports; MFA flags enforced for admin actions. |
| ExportJob | Defaults to JSONL, signed manifest, region checks, and proof linkage; dual-control optional by profile. |
| Edge Webhooks | Enables signed notifications to Security/Privacy teams for defined events (breach candidates, break-glass, export completion). |
| IntegrityLedger | Minimum seal cadence (e.g., ≤5m) and mandatory proof anchoring (optional) per regulated profiles. |
| Observability | Enforces PII-free logs, truncation limits, and blocks unsafe diagnostic fields. |
Profile Presets (examples)¶
Values are illustrative and must be validated by your compliance team.
| Profile | Retention Defaults | PII Presets | Consent & Purpose | Exports | Other Overlays |
|---|---|---|---|---|---|
| SOC 2 | identity: 365d; billing: 2555d | PERSONAL → HASH, CREDENTIAL → DROP |
Purpose header recommended; MFA for admin | JSONL + signed manifest | Proof seal ≤ 10m; basic breach hooks |
| GDPR | identity: 365d (minimize); billing: policy-driven | PERSONAL → HASH, PHI → TOKENIZE, CREDENTIAL → DROP |
Purpose required; legal basis whitelist; deny elevated if basis missing | JSONL; region-restricted; DSAR helpers | Data residency restrict; erase requests honored via retention/tombstone |
| HIPAA | phi: 2555d (policy-driven) | PHI → TOKENIZE (vault), PERSONAL → HASH |
Purpose limited to treatment/payment/operations; MFA mandatory | JSONL; membership proofs on-demand; dual-control for exports | Seal cadence ≤ 5m; additional access logging flags |
Runtime Enforcement¶
- Write path: classification/redaction uses profile-backed
ClassificationPolicyvN; secrets dropped before persist. - Read path: default redacted views;
elevatedrequiresaudit.read.elevated, MFA,X-Purpose, legal basis ∈ profile’s allowlist. - Export: snapshot validated against residency & purpose; artifacts signed; proofs included/referenced.
- Retention: scheduler uses profile windows;
LegalHoldalways overrides. - Breach hooks: profile subscribes specific events to webhooks; receivers verify HMAC and age.
Consent & Legal Basis¶
- Integrates with the platform PDP (RBAC/ABAC). Profile declares allowed legal bases for
Auditaccess. - Request must carry
X-Purpose(e.g.,legal-ediscovery:CASE-555) and, when applicable, the legal basis claim in token or header (e.g.,X-Legal-Basis: consent). - Deny elevated operations that lack a permitted basis; emit minimal denied audit entry.
DSAR / Subject Rights (GDPR-focused helpers)¶
- Discovery:
GET /audit/timelinewith actor/resource filters +X-Purpose: gdpr-dsar:<ticket>. - Packaging:
POST /audit/exportswith DSAR preset (profile-defined minimal fields). - Erasure: Implemented via retention/tombstone—never rewrites; profile may mark categories eligible for REDACT_TOMBSTONE mode.
Breach Notification Hooks¶
- Profile lists trigger events and destinations:
breakglass.used,export.failed, unusualaccess:elevatedspikes (from metrics), integrity drift detections.
- Publisher sends HMAC-signed webhooks with age-limit; receivers are expected to open incidents per playbook.
Data Residency & Region Controls¶
- For GDPR/HIPAA profiles, enforce region fences during export and proof access.
- If
restrictCrossRegionExports=true,StartExportfails with403if the requested delivery region is outsideallowedRegions.
Versioning & Change Control¶
- Profiles are forward-only; new
versionschedules ateffectiveFromUtc. - A profile change emits
compliance.profile_changed(internal) and recomputes policy caches; does not retroactively unmask historic data.
Evidence & Auditability¶
- Automatic evidence bundle endpoints for audits:
- Current profile JSON (signed), active policy versions, retention windows, webhook subscriptions, and last 30 days of admin/auditor access entries.
- Exportable as a ZIP with hashes and a signed manifest for SOC2/HIPAA audits.
Testing & Validation¶
- Conformance tests per profile:
- Golden redaction fixtures (hash/mask/token rules)
- Residency denial tests
- Export manifest signature presence
- Purpose/legal basis gating for elevated reads
- Drills: simulate webhook delivery failures, DSAR export within SLA, and break-glass usage (must page on use).
Admin APIs¶
PUT /audit/admin/compliance-profiles/{profileId}
Body: { version, effectiveFromUtc, defaults... }
POST /audit/admin/tenants/{tenantId}/attach-profile
Body: { profileId, version, categories? }
GET /audit/admin/compliance-profiles/{profileId}/effective?tenantId=...
- Attaching a profile triggers regeneration of compiled policy caches and emits a minimal audit entry.
Safe Defaults (when no profile attached)¶
- Apply platform baseline:
CREDENTIAL → DROP,PERSONAL → HASH, purpose header optional, exports signed, seal ≤ 10m, PII-free logs, and retention 365d (identity) unless tenant overrides.
SDKs & Adapters¶
Purpose¶
Provide thin, ergonomic client SDKs (C#, JS/TS) to emit standardized AuditRecords with correlation and idempotency baked in. Supply anti-corruption layers for popular upstreams (identity tokens → Actor, HTTP context → Context) and adapters for HTTP, Service Bus/MassTransit, and Orleans. Conformist downstreams can adopt schemas verbatim for events and read models.
Design Principles¶
- Zero business logic: SDKs package the canonical shapes, headers, and retry/idempotency helpers.
- Server-authoritative: classification/redaction happen server-side; clients may add hints only.
- Portable correlation: propagate W3C
traceparentplus domain headers (Trace-Id,Request-Id,Producer,Tenant-Id). - Idempotent by default: deterministic key helpers; safe retries.
- PII-safe: optional mappers normalize/minimize before send; SDK logs never include payloads.
Package Layout¶
- .NET:
ConnectSoft.AuditAuditClient(HTTP / gRPC code-first)MassTransitAuditPublisher(message contract)AspNetCoreAuditMiddleware(actor/context mappers)Idempotency&Correlationutilities
- JS/TS:
@connectsoft/auditAuditClient(HTTP)expressAudit()middlewarecorrelation()helpers,idempotency()helpers- Types for
AuditRecord,Actor,ResourceRef, etc.
Canonical Types (JS/TS)¶
export type ActorType = "user" | "service" | "job";
export interface Actor {
type: ActorType;
id: string;
display?: string;
roles?: string[];
}
export interface ResourceRef {
type: string;
id: string;
path?: string;
tenantScopedId?: string;
}
export type DecisionOutcome = "allow" | "deny" | "na";
export interface Decision { outcome: DecisionOutcome; reason?: string; attributes?: Record<string,string>; }
export interface Context { ip?: string; userAgent?: string; clientApp?: string; location?: string; }
export interface Delta { fields?: Record<string, unknown>; hints?: Record<string, { rule: string }>; }
export interface Correlation { traceId: string; requestId: string; causationId?: string; producer: string; }
export interface AuditRecord {
id?: string;
tenantId: string;
occurredAtUtc: string;
actor: Actor;
action: string;
resource: ResourceRef;
decision?: Decision;
context?: Context;
before?: Delta;
after?: Delta;
classes?: string[];
correlation: Correlation;
idempotencyKey: string;
integrity?: { segmentId?: string; hash?: string } | null;
}
The C# SDK mirrors these types as records with nullable optionals.
Correlation & Headers¶
SDKs set/propagate:
Tenant-Id(required)Idempotency-Key(required for online)Trace-Id,Request-Id(or W3Ctraceparent)Producer(service@version)
Generation rules
traceId: reuse ambient trace; else ULID/UUIDv7requestId: from framework request; else ULID/UUIDv7producer:<serviceName>@<semver>from app config
Idempotency Helpers¶
Idempotency.Key.forAction(tenantId, resourceId, action, occurredAtUtc[, extra])→ deterministic key- Fallback
Idempotency.Key.fromContent(record)→ canonical JSON → SHA-256
Anti-Corruption Mappers¶
Identity token → Actor
- Prefer
subasactor.id - Map
azp/client_id→serviceactors - Map roles from
roles/groupsclaims - If
emailpresent, do not embed raw email; SDK can attach classification hint (PERSONAL) or hash locally (optional)
HTTP context → Context
ip: fromX-Forwarded-For(first hop) or connection remote IPuserAgent: trimmed to max length (e.g., 128)clientApp: from custom header (e.g.,X-Client-App)
ASP.NET Core
app.UseMiddleware<AspNetCoreAuditMiddleware>(new AuditMiddlewareOptions {
Actor = ctx => ActorFromJwt(ctx.User),
Context = ctx => new Context { ip = ctx.GetClientIp(), userAgent = ctx.GetUserAgent(), clientApp = "Portal" }
});
Express (Node)
import { expressAudit } from "@connectsoft/audit/mw";
app.use(expressAudit({
actor: req => ({ type: "user", id: req.user.sub, roles: req.user.roles }),
context: req => ({ ip: req.ip, userAgent: req.headers["user-agent"] as string, clientApp: "Portal" })
}));
HTTP Client (C#)¶
var client = new AuditClient(new AuditClientOptions {
BaseAddress = new Uri("https://api.example.com"),
Producer = "iam-service@1.12.3",
HttpClient = httpClient
});
var now = DateTimeOffset.UtcNow;
var record = new AuditRecord {
tenantId = "t-9c8f1",
occurredAtUtc = now.ToString("O"),
actor = new Actor(ActorType.User, "u-12345") { roles = new[] { "admin" } },
action = "User.PasswordChanged",
resource = new ResourceRef("User", "u-12345"),
correlation = Correlation.CurrentOrNew("iam-service@1.12.3"),
idempotencyKey = Idempotency.ForAction("t-9c8f1","u-12345","User.PasswordChanged", now)
};
await client.AppendAsync(record, new AppendOptions {
ClassificationHints = new Dictionary<string,string> {
["before.fields.email"] = "PERSONAL",
["after.fields.token"] = "CREDENTIAL"
}
}, ct);
Retry & error handling
- Retries with exponential backoff + jitter on
429/5xx. 200 OKwithstatus=duplicatetreated as success (surfacing originalrecordId).409 Conflictfor semantic duplicates → raiseIdempotencyConflictException.
HTTP Client (JS/TS)¶
import { AuditClient, idempotency, correlation } from "@connectsoft/audit";
const ac = new AuditClient({
baseUrl: "https://api.example.com",
producer: "frontend@3.4.0",
getToken: async () => `Bearer ${await auth.getAccessToken()}`
});
const now = new Date().toISOString();
await ac.append({
tenantId: "t-9c8f1",
occurredAtUtc: now,
actor: { type: "user", id: auth.userId, roles: auth.roles },
action: "Session.Login",
resource: { type: "User", id: auth.userId },
correlation: correlation.currentOrNew("frontend@3.4.0"),
idempotencyKey: idempotency.forAction("t-9c8f1", auth.userId, "Session.Login", now),
context: { ip: undefined, userAgent: navigator.userAgent, clientApp: "Web" }
}, {
classificationHints: { "after.fields.email": "PERSONAL" }
});
MassTransit & Orleans Adapters¶
MassTransit producer
await _bus.Publish<IngestAuditRecord>(new {
TenantId = "t-9c8f1",
IdempotencyKey = Idempotency.ForAction(...),
Record = record,
ClassificationHints = new Dictionary<string,string> { /* optional */ },
Correlation = CorrelationMeta.Current()
});
Orleans
var grain = _client.GetGrain<IAuditIngestGrain>("t-9c8f1");
await grain.Append(new AppendAuditRecordCommand(record, "t-9c8f1", record.idempotencyKey));
Both adapters reuse the same canonical contracts and correlation model.
Classification Hints (optional)¶
SDKs allow attaching hints by path:
Server remains authoritative; hints merely guide classification.
Middleware Quick-Emit¶
ASP.NET Core action filter / Express handler helpers to emit common actions (login, access decisions, config changes) with one-liners:
Conformist Downstreams¶
- Events: consumers can adopt
DomainEventEnvelope<T>as-is (same names, versions). - Read models: schemas in
audit_rm.*are documented for analytics (avoid cross-tenant access). - Idempotency: a
ConsumerTemplateis provided (offset store + handler skeleton).
Serialization & Canonicalization¶
- JSON serializer configured to:
- Stable property order (for deterministic hashes)
- ISO-8601 UTC for timestamps
- Trim/normalize known fields (emails lowercased, IPv6 compressed)
Security & Privacy Guardrails¶
- SDK never logs payloads; structured logs carry only status, IDs, and hashed keys.
- Built-in secret scrubbing for accidental sensitive strings in exceptions (e.g., patterns like
sk_live_...). - Large deltas discouraged on clients; prefer summaries and let server tokenize as needed.
Versioning & Compatibility¶
- SemVer packages; additive first.
- Feature flags allow server capabilities detection (e.g.,
supportsBatchHints,supportsBackfill). - Breaking wire changes publish new HTTP paths (
/v2) or message contracts (topic...v2).
Testing Utilities¶
- Fixture builder for
AuditRecordwith sensible defaults. - Golden corpus for redaction expectations (client-side normalizations only).
- Contract tests: spin a stub gateway validating headers, idempotency behavior, and retry semantics.
Diagram¶
sequenceDiagram
participant APP as App/Service
participant SDK as SDK (C#/TS)
participant GW as API Gateway
participant AUD as Audit Ingest
APP->>SDK: Build AuditRecord (+hints)
SDK->>SDK: Correlation + Idempotency key
SDK->>GW: POST /audit/records (Tenant-Id, Idempotency-Key, Trace-Id)
GW->>AUD: mTLS forward
AUD-->>SDK: 201 Created | 200 Duplicate
SDK-->>APP: AppendResult(recordId)
Quick Reference¶
- Always set
Tenant-Id,Idempotency-Key, correlation. - Prefer deterministic keys; retries are safe.
- Send hints when easy; server decides redaction.
- Never log payloads; rely on SDK’s structured outcome logs.
- Adopt event/read-model schemas directly for conformist consumers.
Evolution & Versioning¶
Purpose¶
Evolve the Audit & Compliance BC safely without breaking producers/consumers. Favor additive-first changes; when a break is unavoidable, publish a new channel (/v2 HTTP path, ...v2 topic/webhook) and keep mapping layers to bridge old producers. Version exporters/importers independently from APIs and events.
Version Dimensions (what can change)¶
| Axis | Examples | Strategy |
|---|---|---|
| HTTP APIs | /audit/timeline, /audit/exports |
Path versioning (/v2), optional media type version (Accept: application/vnd.connectsoft.audit+json;version=2) |
| Message bus | audit.records.v1 |
New topics per major (.v2); keep v1 until migration completes |
| Domain events | audit.record_appended payload shape |
Envelope SchemaVersion increments; additive by default; publish both v1 & v2 during migration |
| Read models | audit_rm.* |
Blue/green: create audit_rm_v2.*, swap via synonyms/views |
| Webhooks | export.completed |
Header X-Webhook-Schema: 1 → 2; or path /hooks/v2/... |
| Storage schema | indexes/columns | Add columns/indexes; avoid type changes; use new tables for big breaks |
| Exporters/Importers | manifest, file layout | Version independently: Exporter@v2, Importer@v3 |
| Policies | ClassificationPolicy, RetentionPolicy |
Versioned inside policy domain; never unmask historic data |
Additive-First Rules¶
- Only add optional fields; keep defaults stable.
- Never rename/remove existing fields; mark deprecated in docs.
- Enums: add new values; consumers must ignore unknowns.
- Semantics: do not change meaning of existing fields—breaking by semantics is a major.
- Unknown ≠ error: all conformist consumers must tolerate unknown fields/headers.
When a Breaking Change Is Required¶
- Structural changes (type swap, field removal, required field added).
- Privacy hardening that reduces visibility (e.g.,
Decision.reasonremoved for policy). - Contract shifts (different pagination model, cursor format).
- Transport framing change.
Action: publish /v2 (HTTP), ...v2 (topics/webhooks); keep /v1 read-only or frozen until decommission.
Channel Versioning Patterns¶
HTTP
- Paths:
/v1/audit/timeline→/v2/audit/timeline. - Headers: responses include
X-API-Version: 1|2,Deprecation,Sunset,Link: <url>; rel="deprecation". - Negotiation (optional):
Accept: application/vnd.connectsoft.audit+json;version=2.
Bus/Webhooks
- Topics:
audit.records.v2,export.completed.v2. - Webhooks: add
X-Webhook-Schema: 2. Existing endpoints can opt into v2 by re-subscribing.
Code-first gRPC
- New service/contract names (no proto):
IAuditIngestServiceV2. Keep V1 interface while migrating.
Mapping Layers (anti-corruption)¶
Bridge older producers/consumers transparently:
flowchart LR
P1[Producer v1] --> AC1[Ingest Mapper v1→v2] --> Core[AuditStream v2]
Core --> AC2[Event Mapper v2→v1] --> C1[Consumer v1]
- Ingest Mapper v1→v2: accepts v1 payloads (headers/fields), emits canonical v2
AuditRecord. - Event Mapper v2→v1: for critical consumers stuck on v1, derive best-effort v1 payloads from v2 events (temporary).
Mappers live at the boundary; core domain remains on the newest model.
Exporters & Importers (independent versioning)¶
- Exporter@vN: defines manifest schema + artifact layout. Manifest includes
schemaVersionandexporterVersion. - Importer@vN: supports a matrix of source versions; validates and maps to canonical ingest shape.
- Upgrade either without touching API/event versions; record tool versions in manifests and lineage tables.
Deprecation Policy & Timeline¶
- Announce with docs + headers (
Deprecation,Sunset) and dashboard banner. - Dual-publish: run v1 and v2 in parallel; mirror events to both topics.
- Measure: per-version usage metrics; reach out to stragglers.
- Freeze: stop adding features to v1; security fixes only.
- Sunset: after ≥ 6–12 months (profile-driven), disable writes/reads on v1.
Response headers (HTTP)
Deprecation: true
Sunset: Tue, 30 Sep 2026 00:00:00 GMT
Link: https://docs.example.com/audit/v1-deprecation; rel="deprecation"
Capability Discovery¶
GET /audit/capabilitiesreturns supported API versions, event schemas, formats, and limits.
{
"api": { "versions": ["1","2"], "default": "2" },
"events": { "audit.record_appended": [1,2] },
"exports": { "formats": ["jsonl","csv"], "exporterVersions": [1,2] }
}
SDKs cache this to auto-select the highest compatible version.
Contracts (stable envelope)¶
All events keep a stable envelope even when payload evolves:
public sealed record DomainEventEnvelope<TPayload>(
string EventId, string Name, int SchemaVersion, string TenantId,
string AggregateType, string AggregateId, long AggregateSequence,
DateTime OccurredAtUtc, DateTime EmittedAtUtc, CorrelationMeta Correlation,
TPayload Payload);
Only TPayload’s schema version increments; envelope stays compatible.
Storage & Read Model Evolution¶
- Create
audit_rm_v2.*tables; project with v2 handlers. - Swap via database synonyms or API layer routing.
- Keep backfill & replay tools able to target a specific projection version:
Observability per Version¶
- Tag metrics/logs/traces with
api.version,event.schema,exporter.version,importer.version. - Dashboards: version adoption curves, error rate by version, throughput by channel/version.
CI/CD & Testing¶
- Contract tests (consumer-driven): publish Pacts for v1 and v2.
- Golden payloads for v1/v2 with expected redactions.
- Canary: route 1–5% to v2 endpoints before full cutover; compare result counts/latencies.
- Fail-fast gates: any semantic divergence on canary blocks rollout.
Examples¶
HTTP v2 path
Bus topic
Exporter manifest bump
Guardrails¶
- Privacy never loosens via versioning—older records remain with their original transforms; mappers do not unmask.
- One-way: v2 readers must accept v1 data (with defaults), but v1 readers must not be fed v2 unless mapped.
- Roll-forward: on failure, prefer rolling forward with a mapper fix rather than reverting core schemas.
Quick Checklist¶
- Additive first; break → new path/topic.
- Keep boundary mappers for legacy until sunset.
- Version exporters/importers separately; record tool versions.
- Publish capability discovery; instrument by version.
- Plan deprecation with
Deprecation/Sunsetheaders and a measured migration.
Operational Runbook & Quality Gates¶
Purpose¶
Codify day-2 operations and release safeguards for the Audit & Compliance BC:
- DLQ remediation + safe replay (idempotent consumers).
- SLOs & alerts focused on lag, export queues, and dedupe health.
- Incident drill for “tamper suspected.”
- Quality gates in CI to block schema-breaking or redaction regressions.
- Platform checklists emphasizing outbox everywhere and idempotency at every boundary.
DLQ Workflow & Safe Replay¶
What lands in DLQ? Poison messages (schema drift, authorization failure, handler bug), age-outs, or repeatedly failing webhooks.
Golden rules
- Never replay straight to handlers that cause side effects without the inbox check.
- Treat DLQ as evidence: do not mutate payloads; annotate with metadata.
flowchart TD
DLQ[(DLQ Queue/Topic)]
INSPECT[Inspect & Classify]
FIX[Fix Config/Code/Secrets]
SANDBOX[Sandbox Reproduce]
REPLAY[Replay via Ingest<br/>or Dedicated Replayer]
SKIP[Quarantine/Skip]
METRICS[Update Runbook Ticket + Metrics]
DLQ --> INSPECT -->|auth/mTLS/secret| FIX
INSPECT -->|schema mismatch| FIX
INSPECT -->|one-off| REPLAY
FIX --> SANDBOX --> REPLAY
INSPECT -->|invalid or malicious| SKIP
REPLAY --> METRICS
SKIP --> METRICS
Procedure
- Triage
- Pull last N DLQ entries; group by
reason,event.name,tenantId. - If
signature/agefailures on webhooks → rotate secret or widen clock skew; do not replay until fixed.
- Pull last N DLQ entries; group by
- Root cause
- Schema: check event
SchemaVersionvs consumer. - Auth: verify scopes/tenant headers on inbound message.
- Data: verify classification/redaction plan present.
- Schema: check event
- Fix
- Config drift (secrets, endpoints) → rotate/patch; code bug → hotfix branch.
- Add mapping layer if producer is on v1 while consumer expects v2.
- Sandbox reproduce
- Use Replay Harness against non-prod to prove handler idempotency and success path.
- Replay
- Prefer republish to the original topic with
Event-Idintact; consumer inbox guarantees idempotency. - For ingest failures, re-enqueue through the Inbox (same dedupe gate).
- Prefer republish to the original topic with
- Quarantine
- Messages confirmed malicious or irreparably invalid → keep in quarantine store with reason; mark skipped.
- Close
- Update incident ticket with counts, cause, and prevention. Capture playbook deltas if needed.
Command (example)
audit-ops dlq list --consumer integrity-projection --since 1h
audit-ops dlq replay --consumer integrity-projection --filter event=audit.record_appended --max 500
audit-ops dlq quarantine --consumer webhooks --reason "sig_mismatch"
SLOs & Alerts (operations dashboard)¶
| Signal | SLO / Threshold | Alerting | Notes |
|---|---|---|---|
Ingest lag p95 (produce→record_appended) |
≤ 5s warn @10s, page @30s | PagerDuty (critical) | KEDA scales on lag |
| Projector lag p95 (timeline) | ≤ 5s warn @10s, page @20s | PagerDuty | Idempotent rebuild path |
| Export queue age p95 | ≤ 2m warn @5m, page @10m | Ops channel | Exports pause if ingest lag breached |
| Outbox backlog | ≤ 10k pending/service | Ops channel | Relay health |
| DLQ rate | ~0; page if continuous > 1/min over 5m | PagerDuty | Investigate consumer |
Dedupe ratio (hits/total) |
Baseline per tenant; alert on ±30% drift | Ops channel | Spikes may indicate producer bug |
| Idempotency conflicts | 0; page if > 10/min | PagerDuty | Semantic conflicts (same key, diff payload) |
| Integrity seal duration p95 | ≤ 2s warn @5s | Ops channel | Hash/Merkle compute health |
| Webhook drop (age/signature) | 0; warn any | Security channel | Possible clock drift or key issue |
Attach exemplars to lag & duration metrics to deep-link into traces.
Incident Drill — “Tamper Suspected”¶
Scope: An operator or external auditor reports suspected tampering of the audit trail.
sequenceDiagram
participant R as Reporter
participant OC as On-Call
participant IL as IntegrityLedger
participant SQL as SQL/Segments
participant ADLS as Proofs/Manifests
participant SEC as Security/Compliance
R->>OC: Tamper suspected (record/segment)
OC->>IL: Fetch SegmentProof + ChainTip
OC->>SQL: Get stored record(s) (read-only)
OC->>ADLS: Download proof bundle & manifest
OC->>OC: Offline verify membership + root signature
alt mismatch
OC->>SEC: Escalate SEV-1, freeze purge/exports
OC->>IL: Seal current open segments
OC->>OC: Forensic triage (access logs, approvals)
else valid
OC-->>R: Evidence OK (report)
end
Checklist
- Stabilize
- Flip read-only flag for purge/export if warranted (tenant-scoped).
- Trigger immediate segment sealing for tails.
- Evidence capture
- Export SegmentProof (signed), manifest, and membership proof for the record IDs in question.
- Snapshot relevant logs (PII-free) and outbox/inbox offsets.
- Verification (offline)
- Rebuild leaf from stored (redacted) record; recompute hash.
- Fold Merkle path → root; verify signature (
kid) from Key Vault JWKS. - Compare
chainTipsequence for ordering consistency.
- Triage
- If mismatch: SEV-1, rotate relevant signing keys, audit privileged access, confirm no DB writes beyond append-only path (check RLS + triggers).
- Inspect DLQ for related anomalies.
- Communication
- Trigger profile-defined breach hooks (if mandated).
- Record minimal AuditorAccess entries for every verification step.
- Post-mortem
- Root cause, scope, data at risk, timeline, remediations (e.g., stricter sealing cadence, stronger anchoring).
Quality Gates (CI/CD)¶
Contract & Compatibility
- OpenAPI/HTTP: run breaking-change detector (
/v2required on majors); reject renames/removals. - Event schemas: JSON Schema + version bump check; ensure additive-first; publish to schema registry.
- Pact/CDC: consumer-driven contracts for key consumers (projections, webhooks).
Redaction & Privacy
- Golden redaction tests: fixtures assert HASH/MASK/DROP/TOKENIZE outputs per policy version and tenant salt.
- Leak scanner: static & runtime log scans for PII patterns; fail build on findings.
- Event hygiene: linter ensures no sensitive fields enter event payloads.
Idempotency & Outbox
- Arch tests (unit): every command handler must write outbox in same transaction as state.
- Ingest path tests: must call idempotency gate before append; simulate duplicate & semantic conflict.
- Consumer skeleton: requires inbox write before side effects.
Performance & Scale
- Load smoke: N=10k records/min for 3 min; assert p95 ingest lag < 10s, no DLQ growth.
- Seal timing: Merkle compute p95 < threshold on synthetic batch.
Migrations
- Blue/green readiness: migrations create
audit_rm_v2.*tables; no in-place destructive ops. - Backfill targetability: replay tool can target
v2projection; rehearsal job passes.
Release gates
- Block if any:
- contract break without
/v2 - redaction golden test diff
- missing outbox write in a new aggregate handler
- ingest path bypasses idempotency
- DLQ rate > baseline after canary (5%)
- SLO regression vs last release
- contract break without
Operational Checklists¶
Pre-deploy
- ✅ Migrations applied (no destructive in-place changes).
- ✅ Feature flags default safe; canary slice configured.
- ✅ Outbox relay healthy; schema registry updated.
- ✅ Webhook secrets rotated if expiring; mTLS pins validated.
Post-deploy (30–60 min)
- ✅ Ingest lag p95 within SLO.
- ✅ Outbox backlog stable/declining.
- ✅ DLQ stable at baseline.
- ✅ Dedupe ratio within ±10% of yesterday.
- ✅ Canary v2 metrics match v1 (counts/latency).
On-call quick actions
- 🔧 Scale consumers via KEDA override if lag spikes.
- 🔧 Pause exports if ingest lag pages.
- 🔧 Quarantine toxic subscriptions (webhooks) on repeated 401/403/signature failures.
- 🔧 Trigger replay for known fixed DLQ batches.
Runbook Artifacts & Commands¶
# Lag overview
audit-ops metrics show --tenant t-9c8f1 --since 15m --signals ingest_lag_p95,projector_lag_p95,outbox_backlog
# Force seal (per tenant/category)
audit-ops integrity seal --tenant t-9c8f1 --category identity
# Start tamper verification for record
audit-ops integrity verify --tenant t-9c8f1 --record 01J9Z7B19X4P8ZQ7H6M4V6GQWY --output evidence.zip
# Pause exports (tenant)
audit-ops export pause --tenant t-9c8f1
Governance Notes¶
- Keep all runbook actions audited with minimal self-audit entries (who/what/why).
- Review this playbook quarterly; run game days (DLQ storm, seal failure, webhook outage, tamper drill).
- Ensure every new boundary adheres to outbox everywhere and idempotency—no exceptions.