Data Architecture - Audit Trail Platform (ATP)¶

Append-optimized, tamper-evident, tenant-isolated — ATP's data architecture separates hot append storage, warm read models, and cold archives with provable integrity, multi-tenant isolation, and compliance-driven lifecycle management across Azure SQL, Cosmos DB, and Blob Storage (WORM).

📋 Documentation Generation Plan¶

This document will be generated in 14 cycles. Current progress:

Cycle	Topics	Estimated Lines	Status
Cycle 1	Data Architecture Overview & Principles (1-2)	~2,500	⏳ Not Started
Cycle 2	Storage Tiers: Hot/Warm/Cold (3-4)	~3,000	⏳ Not Started
Cycle 3	Multi-Tenant Data Isolation (5-6)	~2,500	⏳ Not Started
Cycle 4	Authoritative Store: AuditRecord Model (7-8)	~3,000	⏳ Not Started
Cycle 5	Event Sourcing & CQRS Patterns (9-10)	~2,500	⏳ Not Started
Cycle 6	Read Models & Projections (11-12)	~3,000	⏳ Not Started
Cycle 7	Azure SQL Implementation (13-14)	~2,500	⏳ Not Started
Cycle 8	Azure Cosmos DB Implementation (15-16)	~2,500	⏳ Not Started
Cycle 9	Azure Blob Storage (WORM) (17-18)	~3,000	⏳ Not Started
Cycle 10	Data Lifecycle & Retention (19-20)	~2,500	⏳ Not Started
Cycle 11	Data Residency & Compliance (21-22)	~2,500	⏳ Not Started
Cycle 12	Backup, Recovery & DR (23-24)	~2,500	⏳ Not Started
Cycle 13	Performance & Optimization (25-26)	~2,500	⏳ Not Started
Cycle 14	Operations & Best Practices (27-30)	~3,000	⏳ Not Started

Total Estimated Lines: ~37,000

Purpose & Scope¶

This document defines the complete data layer architecture for the Audit Trail Platform (ATP), covering data models, storage strategies, access patterns, multi-tenant isolation, event sourcing, CQRS projections, lifecycle management, and compliance enforcement across all Azure storage technologies.

Key Architectural Drivers - Append-Only Semantics: Immutable audit trail with tamper-evidence - Multi-Tenant Isolation: Hard tenant boundaries at storage, query, and compliance layers - Event Sourcing & CQRS: Authoritative append store + derived read models - Storage Tiering: Hot (write-optimized) → Warm (query-optimized) → Cold (archive) - Compliance-Driven: WORM, retention, legal hold, data residency, right-to-erasure - Performance & Scale: Partition strategies, indexing, caching, denormalization - Provable Integrity: Hash chains, Merkle trees, digital signatures, verification protocols

Detailed Cycle Plan¶

CYCLE 1: Data Architecture Overview & Principles (~2,500 lines)¶

Topic 1: Data Architecture Foundations¶

What will be covered: - ATP Data Architecture Vision - Append-optimized for audit trail workloads - Separation of concerns: write vs. read paths - Event-driven data flow - Compliance-first design (immutability, auditability)

Core Design Principles
Append-Only Semantics
- No updates or deletes (except lifecycle purges)
- WORM (Write-Once-Read-Many) storage
- Immutability at multiple layers
Event Sourcing
- Events as source of truth
- Complete audit history reconstruction
- Time-travel queries
CQRS (Command Query Responsibility Segregation)
- Separate write and read models
- Optimized projections for queries
- Eventually consistent read models
Multi-Tenant Isolation
- TenantId as first-class partition key
- Row-Level Security (RLS) enforcement
- Physical and logical isolation strategies
Compliance & Privacy
- Classification-driven access control
- Retention policies and lifecycle management
- Data residency enforcement
- Right-to-erasure (GDPR Article 17)
Data Flow Architecture
Write path: Gateway → Ingestion → Append Store → Outbox → Event Bus
Read path: Query → Read Models (Projections) ← Projection Workers ← Event Bus
Integrity path: Integrity Service → Blob Storage (WORM) → Verification
Export path: Export Service → Read Models → Cold Storage (packages)

Code Examples: - High-level data flow sequence diagram - TenantContext propagation pattern - Event sourcing command/event example - CQRS write vs. read model comparison

Diagrams: - ATP data architecture overview (layered) - Write vs. Read path separation - Event-driven data flow - Storage tiers topology

Deliverables: - Data architecture vision document - Design principles catalog - Data flow documentation - Architectural decision records (ADRs)

Topic 2: Storage Technologies & Technology Stack¶

What will be covered: - Azure Storage Technologies - Azure SQL Database - Purpose: Authoritative append store, projections, metadata - Features: RLS, temporal tables, columnstore indexes - Edition: Business Critical or Hyperscale for scale - Azure Cosmos DB - Purpose: Optional for global-scale, multi-region writes - Partition strategy: /tenantId - Consistency: Session or Bounded Staleness - Azure Blob Storage - Purpose: WORM storage for segments, exports, evidence - Features: Immutability policies, legal hold, lifecycle tiers - Access: Private endpoint, Managed Identity - Azure Cache for Redis (optional) - Purpose: Query result caching, session state - Eviction: LRU with TTL - Azure AI Search (optional) - Purpose: Full-text search, faceted queries - Indexing: Per-tenant index aliases

Technology Stack Rationale
Why Azure SQL for append store?
- ACID guarantees for write path
- RLS for tenant isolation
- Proven scalability with partitioning
Why Blob Storage for WORM?
- Native immutability support
- Cost-effective for large volumes
- Lifecycle management (hot→cool→archive)
When to use Cosmos DB?
- Global distribution requirements
- Extreme scale (millions of events/sec)
- Multi-region write scenarios
Hybrid Storage Strategy
SQL for metadata and hot index
Blob for bulk event payloads
Cosmos DB for global tenants (optional)
Search for full-text queries (optional)

Code Examples: - Connection string management (Key Vault) - NHibernate configuration for SQL - Cosmos DB client setup - Blob Storage client with Managed Identity

Diagrams: - Storage technology mapping - Hybrid storage topology - Technology decision tree

Deliverables: - Storage technology selection guide - Technology stack documentation - Hybrid strategy design - Connection management patterns

CYCLE 2: Storage Tiers - Hot/Warm/Cold (~3,000 lines)¶

Topic 3: Hot Tier - Append Store¶

What will be covered: - Hot Tier Purpose & Characteristics - Authoritative source of truth - Optimized for high-throughput writes - Short to medium retention (days to months) - Immutable after write - Low-latency access for recent data

Azure SQL Hot Store Design
Schema Design
- AuditRecords table (authoritative)
- Clustered columnstore index for write optimization
- Partitioning by (TenantId, CreatedMonth)
- Minimal indexes (write performance priority)
Row-Level Security (RLS)
- Predicate function: TenantId = SESSION_CONTEXT('tenant_id')
- Applied to all queries automatically
- Block and filter predicates
Append-Only Enforcement
- INSTEAD OF UPDATE/DELETE triggers
- Deny schema modification permissions
- Audit log for DML attempts
Azure Blob Hot Store Design
Container Structure
- atp-{tenantId}-hot: Append-only segments
- atp-{tenantId}-hot-manifests: Segment metadata
- atp-{tenantId}-hot-integrity: Hash chains, signatures
WORM Configuration
- Time-based retention policy (e.g., 90 days)
- Version-level immutability
- Compliance mode (vs. governance mode)
Segment Strategy
- Fixed-size segments (e.g., 10MB or 10K events)
- Segment sealing and hash computation
- Manifest generation with Merkle root
Hot Index (SQL Metadata)
Purpose: Fast segment lookup, pointer to blob
Schema: (TenantId, SegmentId, BlobUrl, Hash, CreatedAt, RecordCount)
Query pattern: Find segments for tenant + time range
Write Performance Optimization
Batching: Accumulate 100-1000 events before commit
Async writes to blob storage
Outbox pattern for messaging
Connection pooling and prepared statements

Code Examples: - AuditRecords SQL table DDL with partitioning - RLS policy implementation (T-SQL) - INSTEAD OF trigger for append-only - Blob segment write with WORM configuration - Hot index schema and queries - Batch insert performance pattern

Diagrams: - Hot tier architecture (SQL + Blob) - Segment structure and sealing - Write path data flow - Partitioning strategy visualization

Deliverables: - Hot tier design specification - SQL schema scripts - Blob storage configuration - Performance tuning guide

Topic 4: Warm Tier - Read Models & Projections¶

What will be covered: - Warm Tier Purpose & Characteristics - Query-optimized read models - Derived from hot tier via projections - Medium to long retention (months to years) - Denormalized for fast reads - Eventually consistent with hot tier

Projection Models
AuditEvents Projection
- Flat event stream for list/search
- Fields: AuditRecordId, TenantId, CreatedAt, Action, ResourceType, ResourceId, ActorId, etc.
- Indexes: (TenantId, CreatedAt), (TenantId, ResourceType, ResourceId), (TenantId, ActorId)
ResourceEvents Projection
- Per-resource timeline
- Sequence number for ordered history
- Fast lookups: "Show me all changes to Order #12345"
ActorEvents Projection
- Per-actor activity log
- Compliance use case: "What did user X do?"
Classification Metadata
- Pre-computed classification flags
- Redaction plan references
- Policy version stamps
Projection Build Process
Event-driven: Subscribe to audit.appended events
Idempotent upserts keyed by (TenantId, AuditRecordId)
Watermark tracking: HighWaterRecordId, HighWaterObservedAt
Lag monitoring: Alert if lag > 30 seconds
Rebuild from scratch: Replay all events from hot tier
Projection Storage Options
Azure SQL
- Best for structured queries, JOINs, aggregations
- RLS for tenant isolation
- Columnstore for large scans
Azure Cosmos DB
- Best for global distribution, high throughput
- Partition key: /tenantId
- Composite indexes: (tenantId, createdAt)
Caching Layer (Redis)
Cache frequently accessed queries
TTL: 5-60 minutes depending on lag tolerance
Cache key pattern: query:{tenantId}:{queryHash}
Invalidation: On policy changes or manual purge

Code Examples: - AuditEvents projection schema (SQL) - ResourceEvents projection with sequence numbers - Projection worker consumer pattern (MassTransit) - Idempotent upsert logic - Watermark checkpoint management - Redis caching pattern

Diagrams: - Warm tier architecture - Projection build flow - Read model schemas - Caching strategy

Deliverables: - Warm tier design specification - Projection schemas (SQL and Cosmos DB) - Projection worker implementation guide - Caching patterns

CYCLE 3: Multi-Tenant Data Isolation (~2,500 lines)¶

Topic 5: Tenant Isolation Patterns¶

What will be covered: - Multi-Tenancy Model - Logical Multi-Tenancy (default) - Shared database, tenant-partitioned tables - TenantId in every row - Row-Level Security (RLS) enforcement - Cost-effective, operationally simple - Physical Multi-Tenancy (optional for large tenants) - Dedicated database per tenant - Complete isolation, custom SLAs - Higher operational complexity

TenantId as First-Class Citizen
TenantId Requirements
- Opaque ASCII token: ^[A-Za-z0-9._-]{1,128}$
- Immutable for tenant lifetime
- Case-sensitive (treat as opaque)
- Globally unique within platform
Propagation Pattern
- Extracted from JWT at Gateway
- Stored in TenantContext object
- Propagated across all service calls
- Included in all database queries
- Present in all events and logs
Row-Level Security (RLS) in Azure SQL

Implementation

CREATE FUNCTION dbo.fn_tenantPredicate(@TenantId NVARCHAR(128))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
SELECT 1 AS IsAuthorized
WHERE @TenantId = CONVERT(NVARCHAR(128), SESSION_CONTEXT(N'tenant_id'));

CREATE SECURITY POLICY dbo.TenantSecurityPolicy
ADD FILTER PREDICATE dbo.fn_tenantPredicate(TenantId) ON dbo.AuditRecords,
ADD BLOCK PREDICATE dbo.fn_tenantPredicate(TenantId) ON dbo.AuditRecords
WITH (STATE = ON);

Session Context Setup
- Set SESSION_CONTEXT('tenant_id') on every connection
- Validate before any query execution
- Audit RLS bypasses (should never happen)
Partitioning Strategies
SQL Server Partitioning
- Partition function: (TenantId, CreatedMonth)
- Partition scheme: One filegroup per month (optional)
- Benefits: Pruning, parallel queries, lifecycle management
Cosmos DB Partitioning
- Partition key: /tenantId
- Hot partition detection and mitigation
- Composite keys for large tenants: /tenantId-{shard}
Blob Storage Partitioning
- Container per tenant: atp-{tenantId}-{tier}
- Benefits: Independent legal holds, retention policies
- Access control: Per-container RBAC
Cross-Tenant Query Prevention
No JOINs across tenant boundaries
Explicit tenant filter in all queries
Query analyzer to detect violations
Integration tests for isolation

Code Examples: - TenantContext class definition - RLS policy implementation (complete) - SESSION_CONTEXT setup in connection middleware - SQL partitioning DDL - Cosmos DB partition key configuration - Tenant isolation test cases

Diagrams: - Multi-tenancy model comparison - TenantId propagation flow - RLS enforcement architecture - Partitioning strategies

Deliverables: - Multi-tenancy design guide - RLS implementation scripts - Partitioning strategy document - Tenant isolation testing guide

Topic 6: Data Residency & Regional Isolation¶

What will be covered: - Data Residency Requirements - GDPR: EU data must stay in EU - CCPA: California resident data (US only) - Custom: Per-tenant residency policies - Enforcement: At write and read time

Regional Topology
Separate Accounts per Region
- US: atp-us-prod (Azure SQL, Blob Storage)
- EU: atp-eu-prod (separate subscription)
- IL: atp-il-prod (Israel data center)
No Cross-Region Replication (by default)
- Except explicit tenant opt-in
- Geo-redundancy within region only (ZRS, GZRS)
Tenant → Region Binding
Home Region Assignment
- Set during tenant provisioning
- Stored in Tenant metadata
- Immutable (requires migration to change)
Allowed Regions
- Multi-region tenants: List of approved regions
- Enforcement at Gateway (routing)
- Enforcement at storage (write rejection)
Residency Enforcement
Gateway Layer
- Resolve tenant home region
- Route request to correct regional cluster
- Reject if region not allowed
Storage Layer
- Tenant data written only to home region account
- No cross-region queries
- Export packages honor residency
Compliance Audit
- Daily scan: Verify all data in correct region
- Alert on violations
- Remediation: Move or purge violating data
Migration Procedures
Tenant region change process
Data export from source region
Import to target region
Verification and cutover
Source data purge

Code Examples: - Tenant residency policy model - Region resolution logic - Storage account selection by region - Cross-region prevention validation - Migration workflow (Pseudocode)

Diagrams: - Regional topology (multi-region ATP deployment) - Tenant → Region binding model - Residency enforcement flow - Migration sequence diagram

Deliverables: - Data residency design - Regional deployment architecture - Enforcement implementation guide - Migration runbook

CYCLE 4: Authoritative Store - AuditRecord Model (~3,000 lines)¶

Topic 7: AuditRecord Canonical Schema¶

What will be covered: - AuditRecord Entity (Authoritative) - Core aggregate root in DDD model - Single source of truth for audit events - Immutable after creation - Complete lineage and correlation

Field Specification

public class AuditRecord {
  // Identity
  public string AuditRecordId { get; set; }      // ULID (26 chars)
  public string TenantId { get; set; }           // Opaque tenant identifier
  public string? CorrelationTraceId { get; set; } // W3C trace ID (hex32)
  public string? IdempotencyKey { get; set; }    // Client-provided dedup key

  // Timestamps
  public DateTime CreatedAt { get; set; }        // Platform record creation (UTC)
  public DateTime ObservedAt { get; set; }       // Platform observation time (UTC)
  public DateTime? EffectiveAt { get; set; }     // Business effective time (UTC)

  // Event Core
  public string Action { get; set; }             // e.g., "appointment.update"
  public string? ResourceType { get; set; }      // e.g., "Appointment"
  public string? ResourceId { get; set; }        // Business entity ID
  public string? ActorId { get; set; }           // Who performed action
  public ActorType ActorType { get; set; }       // User, Service, System
  public string? SubjectId { get; set; }         // Whom action was performed on
  public DecisionOutcome? DecisionOutcome { get; set; } // Allow, Deny, NotApplicable

  // Payload
  public string PayloadJson { get; set; }        // Full JSON event payload
  public string? PayloadSchemaUri { get; set; }  // Schema reference
  public int PayloadBytes { get; set; }          // Size in bytes

  // Classification & Policy
  public DataClassification Classification { get; set; } // PUBLIC, SENSITIVE, PII, etc.
  public string PolicyVersion { get; set; }      // Policy version stamp
  public string? RetentionPolicyId { get; set; } // Retention rule reference
  public DateTime? RetentionUntil { get; set; }  // Computed retention date
  public bool OnHold { get; set; }               // Legal hold flag

  // Integrity
  public string? IntegrityBlockId { get; set; }  // Merkle tree block reference
  public string? ContentHash { get; set; }       // SHA-256 of PayloadJson

  // Metadata
  public string? SourceSystem { get; set; }      // Origin system name
  public string? SessionId { get; set; }         // User session ID
  public string? IpAddress { get; set; }         // Client IP (hashed or redacted)
  public string? UserAgent { get; set; }         // Client user agent
}

Field Validation Rules
TenantId: Required, max 128 chars, regex pattern
CreatedAt ≤ ObservedAt ≤ now(): Timestamp ordering
Action: Required, lowercase, dot-separated (e.g., entity.verb)
ResourceType: PascalCase, namespace-qualified (e.g., Vetspire.Appointment)
ActorId: Required for user actions, optional for system events
PayloadJson: Valid JSON, max 1MB (configurable)
ContentHash: SHA-256, hex-encoded (64 chars)
Append-Only Guarantees
No UPDATE statements allowed
No DELETE statements (except lifecycle purges)
Database triggers enforce append-only
Violation attempts are audited
Idempotency Handling
Unique index: (TenantId, IdempotencyKey) (where not null)
Duplicate inserts return existing record
Client receives 200 OK with idempotent: true flag

Code Examples: - AuditRecord C# class (complete) - SQL DDL for AuditRecords table - Validation logic (FluentValidation) - Idempotency check repository method - Append-only trigger implementation

Diagrams: - AuditRecord entity relationship diagram - Timestamp semantics - Idempotency flow

Deliverables: - AuditRecord schema specification - Field validation rules - SQL DDL scripts - Domain model implementation

Topic 8: Supporting Data Models¶

What will be covered: - PolicyDecision - Classification result - Retention policy reference - Redaction plan (if applicable) - Policy version stamp

IntegrityBlock
Block ID (ULID)
Segment references
Merkle root hash
Digital signature
Signing key ID (kid)
Sealed timestamp
Segment
Segment ID (ULID)
Tenant ID
Blob storage URL
Record count
Start/End timestamps
Content hash (SHA-256)
Manifest reference
ExportPackage
Job ID (ULID)
Tenant ID
Filter criteria
Package format (JSONL, Parquet)
Manifest with record list
Digital signature
Export timestamp
ProjectionCheckpoint
Projection name (e.g., "AuditEvents")
Tenant ID
High water mark (last processed AuditRecordId)
Checkpoint timestamp
Lag metrics (seconds behind hot tier)
LegalHold
Hold ID (ULID)
Tenant ID
Reason (e.g., "Litigation #2025-0042")
Applied timestamp
Released timestamp (nullable)
Scope: entire tenant or filtered (date range, resource type)

Code Examples: - PolicyDecision class - IntegrityBlock class - Segment metadata class - ExportPackage manifest structure (JSON) - ProjectionCheckpoint table schema - LegalHold table schema

Diagrams: - Supporting entities ERD - Integrity chain structure - Export package composition

Deliverables: - Supporting data model specifications - SQL/Cosmos DB schemas - Domain model classes - Serialization formats

CYCLE 5: Event Sourcing & CQRS Patterns (~2,500 lines)¶

Topic 9: Event Sourcing Implementation¶

What will be covered: - Event Sourcing Fundamentals - Events as immutable facts - Complete audit history - Time-travel queries - Replay capability - Event store as source of truth

ATP Event Store Design
AuditRecords table IS the event store
- Each row is an immutable event
- Ordered by (TenantId, CreatedAt, AuditRecordId)
- Never updated or deleted (except lifecycle)
Event Envelope
- Correlation IDs for traceability
- Causation IDs for event chains
- Schema versioning
- Metadata (timestamps, actor, etc.)
Event Replay
Use Cases
- Rebuild projections after corruption
- Create new projection types
- Audit investigations (replay specific tenant/time range)
- Disaster recovery (replay from backup)
Replay Process
1. Set projection checkpoint to beginning (or specific point)
2. Stream events from hot tier in order
3. Apply events to projection (idempotent upsert)
4. Advance checkpoint periodically
5. Monitor lag and throughput
Replay Performance
- Batch processing: 1000-10000 events per batch
- Parallel workers per tenant (up to partition count)
- Throttling to avoid overwhelming read models
Event Versioning
Schema evolution strategy
Backward compatibility requirements
Upcasting old events to new schema
Versioned payload deserializers
Event Correlation & Causation
Correlation ID: Links events in same business transaction
Causation ID: Parent event that caused this event
Trace ID: W3C-compliant distributed tracing
Graph of causality for investigation

Code Examples: - Event store query patterns (get events for tenant + time range) - Replay worker implementation (consumer) - Batch processing logic - Idempotent event application - Correlation/causation tracking - Event upcasting example

Diagrams: - Event sourcing architecture - Event replay flow - Correlation graph visualization - Event versioning timeline

Deliverables: - Event sourcing design guide - Replay procedures and runbooks - Event versioning strategy - Correlation implementation

Topic 10: CQRS Implementation¶

What will be covered: - CQRS Pattern in ATP - Command Side (Write Path) - Commands: AppendAuditRecord, CreateExport, ApplyPolicy - Write to authoritative store only - Emit events on success - Strong consistency within command - Query Side (Read Path) - Queries: ListEvents, GetResourceHistory, SearchActor - Read from optimized projections - Eventually consistent with write side - No side effects

Separation Benefits
Independent scaling (write-heavy vs. read-heavy)
Optimized data models per use case
Flexibility to add new projections
Reduced contention (no read locks on writes)
Command Handlers
Validation and authorization
Aggregate load (if needed)
Business logic execution
Event generation
Atomic write + outbox
Query Handlers
Parameter validation
Projection selection
Policy-based filtering (classification, redaction)
Pagination and cursors
Result transformation
Eventual Consistency Management
Lag Monitoring
- Track projection lag (seconds behind write)
- Alert if lag > SLO threshold (e.g., 30s)
- Dashboard showing lag per projection
Read-After-Write Consistency
- Option 1: Poll projection until record appears
- Option 2: Return "processing" status, client polls
- Option 3: Accept eventual consistency (preferred)
Consistency Boundary
- Strong consistency: Within single command/aggregate
- Eventual consistency: Across aggregates, projections, exports
Projection Synchronization
Event-driven updates via message bus
Idempotent handlers to tolerate replays
Checkpoint-based resume after failures
Monitoring and alerting on projection health

Code Examples: - Command handler structure (AppendAuditRecordCommand) - Query handler structure (ListEventsQuery) - CQRS mediator pattern (MediatR) - Projection lag calculation - Read-after-write consistency pattern - Projection synchronization worker

Diagrams: - CQRS architecture overview - Command vs. Query path comparison - Eventual consistency timeline - Projection synchronization flow

Deliverables: - CQRS implementation guide - Command and query patterns - Consistency management strategy - Monitoring and alerting setup

CYCLE 6: Read Models & Projections (~3,000 lines)¶

Topic 11: AuditEvents Projection¶

What will be covered: - AuditEvents Purpose - Flat event stream for list/search/filter operations - Most common query: "Show me all events for tenant X" - Denormalized for read performance - Indexed for fast seeks and scans

Schema Design

CREATE TABLE dbo.AuditEvents (
  TenantId NVARCHAR(128) NOT NULL,
  AuditRecordId CHAR(26) NOT NULL,
  CreatedAt DATETIME2(3) NOT NULL,
  ObservedAt DATETIME2(3) NOT NULL,
  Action VARCHAR(255) NOT NULL,
  ResourceType VARCHAR(255) NULL,
  ResourceId VARCHAR(255) NULL,
  ActorId VARCHAR(255) NULL,
  ActorType TINYINT NOT NULL,
  DecisionOutcome TINYINT NULL,
  ChangedFields NVARCHAR(MAX) NULL, -- JSON array of field names
  DataClassFlags SMALLINT NOT NULL,
  CorrelationTraceId CHAR(32) NOT NULL,
  IntegrityBlockId CHAR(26) NULL,
  PayloadBytes INT NOT NULL,
  PRIMARY KEY (TenantId, AuditRecordId),
  INDEX IX_AuditEvents_Tenant_CreatedAt (TenantId, CreatedAt DESC, AuditRecordId DESC),
  INDEX IX_AuditEvents_Tenant_Resource (TenantId, ResourceType, ResourceId, CreatedAt DESC),
  INDEX IX_AuditEvents_Tenant_Actor (TenantId, ActorId, CreatedAt DESC)
);

Projection Build Logic
Subscribe to audit.appended events
Extract relevant fields from AuditRecord
Compute derived fields (ChangedFields, DataClassFlags)
Upsert into AuditEvents (idempotent by AuditRecordId)
Update checkpoint
Query Patterns
List all events for tenant (paginated)
Filter by date range
Filter by action, resource type, actor
Search by correlation trace ID
Order by CreatedAt DESC (most recent first)
Pagination Strategy
Cursor-based (not offset-based for stability)
Cursor = base64url({CreatedAt}:{AuditRecordId})
Query: WHERE (CreatedAt, AuditRecordId) < cursor ORDER BY ... LIMIT N
Performance Optimization
Covering indexes for common queries
Partitioning by (TenantId, CreatedMonth) for pruning
Columnstore index for large scans (analytics)

Code Examples: - AuditEvents SQL DDL (complete) - Projection worker logic (event handler) - Idempotent upsert pattern - Cursor-based pagination query - Index tuning for specific queries

Diagrams: - AuditEvents schema diagram - Projection build flow - Query execution plan

Deliverables: - AuditEvents projection specification - SQL DDL and indexes - Projection worker implementation - Query pattern examples

Topic 12: Resource & Actor Projections¶

What will be covered: - ResourceEvents Projection - Purpose: Per-resource audit timeline - Use Case: "Show me all changes to Order #12345" - Schema:

CREATE TABLE dbo.ResourceEvents (
  TenantId NVARCHAR(128) NOT NULL,
  ResourceType VARCHAR(255) NOT NULL,
  ResourceId VARCHAR(255) NOT NULL,
  Seq BIGINT NOT NULL, -- Sequence within resource
  AuditRecordId CHAR(26) NOT NULL,
  CreatedAt DATETIME2(3) NOT NULL,
  Action VARCHAR(255) NOT NULL,
  ActorId VARCHAR(255) NULL,
  -- ... other fields
  PRIMARY KEY (TenantId, ResourceType, ResourceId, Seq),
  INDEX IX_ResourceEvents_AuditRecordId (AuditRecordId)
);

- Sequence Management - Seq increments per (TenantId, ResourceType, ResourceId) - Ensures ordered timeline per resource - Idempotent: If AuditRecordId exists, skip or update Seq

ActorEvents Projection
Purpose: Per-actor activity log
Use Case: "What did user alice@example.com do?"
Schema: Similar to ResourceEvents, keyed by (TenantId, ActorId, Seq)
Compliance: GDPR access requests, internal investigations
Projection Build Logic
For each audit.appended event:
1. Upsert into AuditEvents (all events)
2. If ResourceType+ResourceId present, upsert into ResourceEvents
3. If ActorId present, upsert into ActorEvents
Handle missing resources gracefully (some events are resource-less)
Query Patterns
Resource timeline: SELECT * FROM ResourceEvents WHERE TenantId = ? AND ResourceType = ? AND ResourceId = ? ORDER BY Seq
Actor activity: SELECT * FROM ActorEvents WHERE TenantId = ? AND ActorId = ? ORDER BY Seq
Cross-reference: Use AuditRecordId to join with AuditEvents for full details
Performance Considerations
Highly selective queries (tenant + resource/actor)
Indexes on (TenantId, ResourceType, ResourceId) and (TenantId, ActorId)
Potential for hot resources (many updates) → monitoring

Code Examples: - ResourceEvents SQL DDL - ActorEvents SQL DDL - Projection build logic with sequence generation - Resource timeline query - Actor activity query

Diagrams: - ResourceEvents schema and relationships - ActorEvents schema - Projection fanout (1 audit event → 3 projections)

Deliverables: - Resource and Actor projection specifications - SQL DDL scripts - Projection worker implementation - Query examples

CYCLE 7: Azure SQL Implementation (~2,500 lines)¶

Topic 13: Azure SQL Configuration & Optimization¶

What will be covered: - Azure SQL Database Tier Selection - General Purpose - Use case: Dev, test, small to medium workloads - Storage: Remote storage (Azure Standard SSD) - IOPS: Variable based on vCore count - Business Critical - Use case: Production, low-latency requirements - Storage: Local SSD (higher IOPS) - Read replicas: Built-in (readable secondaries) - Hyperscale - Use case: Massive scale (>1TB databases) - Storage: Tiered, nearly infinite (100TB+) - Rapid scale-up/down

Elastic Pool Strategy
One elastic pool per environment
Multiple tenant databases sharing pool resources
DTU or vCore model
Cost optimization for multi-tenant workloads
High Availability & Disaster Recovery
HA: Built-in (automatic failover replicas)
Geo-Replication: Active geo-replication for DR
Backup: Automated backups (PITR: 7-35 days)
Long-Term Retention (LTR): Up to 10 years
Performance Tuning
Indexing Strategy
- Clustered columnstore for large tables (append workloads)
- Nonclustered indexes for query paths
- Covering indexes for hot queries
- Index maintenance jobs (rebuild/reorganize)
Partitioning
- Partition by (TenantId, CreatedMonth)
- Benefits: Pruning, parallel queries, sliding window purges
- Partition function and scheme setup
Statistics
- Automatic statistics updates
- Manual statistics for critical queries
Query Store
- Enable query store for performance monitoring
- Identify regression and optimization opportunities
Security Configuration
Authentication: Azure AD only (disable SQL auth)
Network: Private endpoint, no public access
Encryption: TDE (Transparent Data Encryption) always on
Auditing: Enable auditing to Log Analytics
Advanced Threat Protection: Enable for production

Code Examples: - Azure SQL provisioning (ARM/Bicep or Pulumi) - Elastic pool configuration - Partitioning DDL (partition function, scheme, table) - Index creation scripts - Query Store setup - Security configuration checklist

Diagrams: - Azure SQL topology (pool, databases, replicas) - Partitioning visualization - HA/DR architecture

Deliverables: - Azure SQL configuration guide - Performance tuning playbook - HA/DR design - Security hardening checklist

Topic 14: Row-Level Security & Temporal Tables¶

What will be covered: - Row-Level Security (RLS) Deep Dive - Predicate Functions - Filter predicate: Silently filters rows on SELECT - Block predicate: Prevents INSERT/UPDATE/DELETE if condition fails - SESSION_CONTEXT Usage - Set tenant_id in session context on every connection - Middleware pattern for automatic setup - Validation before any queries - RLS Bypass Prevention - Deny ALTER ANY SECURITY POLICY permission - Audit users with CONTROL permission - Monitor for RLS bypass attempts - Performance Impact - Predicate pushdown (filter early in execution plan) - Minimal overhead for well-indexed queries - Test query plans with RLS enabled

Temporal Tables (System-Versioned)
Use Case: Audit history of metadata tables (policies, configs)
Not for AuditRecords (already immutable)

Setup:

CREATE TABLE dbo.Policies (
  PolicyId INT PRIMARY KEY,
  TenantId NVARCHAR(128) NOT NULL,
  PolicyName NVARCHAR(255) NOT NULL,
  PolicyJson NVARCHAR(MAX) NOT NULL,
  ValidFrom DATETIME2 GENERATED ALWAYS AS ROW START,
  ValidTo DATETIME2 GENERATED ALWAYS AS ROW END,
  PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)
)
WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.PoliciesHistory));

Querying History:

SELECT * FROM dbo.Policies
FOR SYSTEM_TIME AS OF '2025-10-01T00:00:00'
WHERE TenantId = 'acme';

Combining RLS + Temporal Tables
Apply RLS to both current and history tables
Ensure consistent tenant isolation across time
Testing & Validation
Unit tests for RLS predicates
Integration tests for multi-tenant queries
Performance tests with large datasets
Security tests (attempt cross-tenant access)

Code Examples: - Complete RLS setup (function, policy, permissions) - SESSION_CONTEXT middleware (C# + NHibernate) - Temporal table DDL - Temporal query examples - RLS test suite

Diagrams: - RLS architecture and execution flow - Temporal table versioning timeline - Combined RLS + Temporal model

Deliverables: - RLS implementation guide (complete) - Temporal tables usage guide - Testing strategy and test cases - Performance validation results

CYCLE 8: Azure Cosmos DB Implementation (~2,500 lines)¶

Topic 15: Cosmos DB for Global Scale (Optional)¶

What will be covered: - When to Use Cosmos DB - Global distribution requirements (multi-region writes) - Extreme scale (millions of events/sec per tenant) - Need for low-latency reads worldwide - Elastic scalability (RU/s autoscale)

Cosmos DB Configuration
API Selection: SQL API (native)
Consistency Level: Session (default) or Bounded Staleness
Partition Key: /tenantId
Throughput: Autoscale (400 - 20,000+ RU/s)
Indexing Policy: Automatic (tune for specific queries)
Backup: Continuous (PITR) or Periodic

AuditRecord Model in Cosmos DB

{
  "id": "01JE6KQQD0Q0J5VQ8WJ6T1S9FX", // AuditRecordId
  "tenantId": "acme",
  "createdAt": "2025-10-22T15:43:11.281Z",
  "observedAt": "2025-10-22T15:43:11.500Z",
  "action": "appointment.update",
  "resourceType": "Appointment",
  "resourceId": "A-9981",
  "actorId": "user_123",
  "actorType": "User",
  "payload": { ... },
  "classification": "SENSITIVE",
  "policyVersion": 17,
  "integrityBlockId": "01JE6KQQD0Q0J5VQ8WJ6T1S9GX",
  "_ts": 1729619591 // Cosmos DB internal timestamp
}

Partition Strategy
Single Tenant per Partition (default)
- Partition key: /tenantId
- Works for most tenants (< 50GB per tenant)
Hot Partition Mitigation
- For large tenants: Synthetic partition key /tenantId-{hash % N}
- Requires query fanout across N partitions
Composite Partition Keys (future)
- Hierarchical: /tenantId/resourceType
- Requires Cosmos DB Hierarchical Partition Keys feature
Query Patterns
Point Reads: Get single record by id + tenantId (1 RU)
Range Queries: Filter by createdAt, action, actorId (cross-partition if needed)
Pagination: Use continuation tokens (built-in)
Aggregations: Use server-side aggregates or Azure Synapse Link
Consistency & Performance
Session Consistency: Read-your-writes within same session
Bounded Staleness: Max staleness lag (time or operations)
Trade-offs: Stronger consistency = higher latency + cost
RU Consumption Monitoring: Alert on high RU/s usage

Code Examples: - Cosmos DB provisioning (Pulumi) - AuditRecord document insert (C# SDK) - Query with partition key - Cross-partition query with continuation token - Change feed consumer for projections - RU consumption tracking

Diagrams: - Cosmos DB global distribution topology - Partition strategy visualization - Consistency level comparison - Change feed processing flow

Deliverables: - Cosmos DB design guide - Partition strategy for ATP - Query optimization patterns - Cost estimation and monitoring

Topic 16: Cosmos DB Change Feed & Projections¶

What will be covered: - Change Feed Overview - Real-time feed of all inserts and updates - Used to build projections (similar to event sourcing) - Checkpointing via lease container - At-least-once delivery

Projection Worker Pattern
Subscribe to change feed
Process each document change
Upsert into projection (Azure SQL or separate Cosmos container)
Update checkpoint in lease container
Idempotent processing
Lease Container
Stores checkpoint per partition
Enables multiple workers (scale-out)
Automatic load balancing
Projection Examples
AuditEvents: Denormalized view for queries
ResourceEvents: Per-resource timeline
Materialized aggregates: Count of events per tenant/day
Performance & Scalability
One worker per partition (or multiples)
Monitor lag between source and projection
Alert if lag > SLO threshold
Throttle to avoid overwhelming projection store
Comparison: Cosmos DB vs. Azure SQL
When to use Cosmos for hot tier?
- Global distribution
- Massive scale (>100K writes/sec)
- Low-latency reads worldwide
When to use Azure SQL?
- Regional deployment (simpler)
- Complex joins and transactions
- Mature tooling and expertise
- Cost-effective for moderate scale

Code Examples: - Change feed processor setup (C# SDK) - Projection worker implementation - Lease container configuration - Idempotent upsert pattern - Lag monitoring query

Diagrams: - Change feed architecture - Projection worker topology - Lease container and checkpointing - Cosmos vs. SQL decision tree

Deliverables: - Change feed implementation guide - Projection worker templates - Performance tuning recommendations - Cosmos vs. SQL selection criteria

CYCLE 9: Azure Blob Storage (WORM) (~3,000 lines)¶

Topic 17: Blob Storage for Immutable Evidence¶

What will be covered: - WORM (Write-Once-Read-Many) Requirements - SEC Rule 17a-4: Financial records immutability - HIPAA: Healthcare audit logs - GDPR Article 5(1)(f): Integrity and confidentiality - Tamper-evidence for forensic analysis

Azure Blob Immutability Policies
Time-Based Retention
- Set retention period (e.g., 7 years)
- Blobs cannot be deleted or modified until period expires
- Can extend retention, but not shorten (in compliance mode)
Legal Hold
- Independent of retention policy
- Applied per blob or container
- Must be explicitly released
- Used for litigation, investigations
Compliance Mode vs. Governance Mode
- Compliance: Cannot be overridden (even by admins)
- Governance: Can be overridden with special permission
- ATP uses Compliance Mode for production
Container Structure
Per-Tenant Containers
- atp-{tenantId}-hot: Active append segments
- atp-{tenantId}-cold: Export packages
- atp-{tenantId}-integrity: Merkle roots, signatures
Benefits:
- Independent retention policies per tenant
- Per-tenant legal holds
- Per-container access control (RBAC)
- Cost allocation and reporting
Segment Storage Pattern
Segment Creation
- Accumulate events until threshold (e.g., 10K events or 10MB)
- Seal segment (no more appends)
- Compute SHA-256 hash of entire segment
- Write segment to blob storage
- Write manifest (metadata + hash) to separate blob
Segment Naming
- Pattern: segments/{year}/{month}/{segmentId}.jsonl.gz
- JSONL: One JSON object per line (easy streaming)
- GZIP: Compression (5-10x reduction)
Manifest Naming
- Pattern: manifests/{year}/{month}/{segmentId}.manifest.json
- Contains: segment hash, record count, time range, blob URL
Access Control
Managed Identity Only (no SAS tokens in application)
RBAC Roles:
- Ingestion service: Storage Blob Data Contributor (write segments)
- Query service: Storage Blob Data Reader (read segments)
- Admin: Storage Blob Data Owner (manage containers)
Private Endpoint: All access via VNet, no public access
Lifecycle Management
Hot to Cool: After 90 days
Cool to Archive: After 365 days
Cost Optimization: Archive tier is 1/10^th cost of hot tier
Rehydration: Archive blobs must be rehydrated before read (hours)

Code Examples: - Blob container creation with immutability policy (Pulumi) - Segment write with GZIP compression (C#) - Manifest generation and upload - Legal hold application - Managed Identity authentication for Blob access - Lifecycle policy configuration (ARM template)

Diagrams: - WORM blob storage architecture - Segment and manifest structure - Immutability policy timeline - Access control model - Lifecycle tier transitions

Deliverables: - Blob storage WORM design specification - Segment storage implementation guide - Immutability policy configuration - Access control setup - Lifecycle management policies

Topic 18: Integrity Verification & Merkle Trees¶

What will be covered: - Tamper-Evidence Design - Each segment has content hash (SHA-256) - Segments organized into Merkle trees - Merkle root signed by HSM-backed key - Chain of Merkle roots for historical integrity

Merkle Tree Construction
Leaf Nodes: Hash of each event (SHA-256)
Internal Nodes: Hash of concatenation of child hashes
Root: Top-level hash representing entire tree
Benefits:
- Efficient verification (log(N) proofs)
- Detect any tampering (single bit change invalidates root)
- Cryptographic proof of inclusion

Integrity Block Structure

{
  "blockId": "01JE6KQQD0Q0J5VQ8WJ6T1S9GX",
  "tenantId": "acme",
  "sealedAt": "2025-10-22T16:00:00Z",
  "segmentIds": ["seg-001", "seg-002", "seg-003"],
  "recordCount": 30000,
  "merkleRoot": "a3f5b9c2...",
  "prevBlockRoot": "2d8e1f4a...",
  "signature": {
    "algorithm": "Ed25519",
    "keyId": "kid-2025-10",
    "value": "5e8f2c9a..."
  }
}

Signature Creation
Construct message: {blockId}|{merkleRoot}|{prevBlockRoot}|{sealedAt}
Sign with HSM-backed signing key (Ed25519 or RSA-PSS)
Store signature in integrity block
Write to blob storage (immutable)
Verification Protocol
Verify Signature: Use public key to verify signature on Merkle root
Verify Merkle Proof: Given event ID, verify it's included in Merkle root
Verify Chain: Check prevBlockRoot links form valid chain
Detect Tampering: Any hash mismatch = tampering detected
Verification Use Cases
Export Verification: Verify all events in export package
Forensic Analysis: Prove event was recorded at specific time
Compliance Audit: Demonstrate integrity of audit trail
Litigation: Provide cryptographic evidence

Code Examples: - Merkle tree construction algorithm - Integrity block creation and signing - Signature verification (public key) - Merkle proof generation and verification - Chain integrity check - Verification report generation

Diagrams: - Merkle tree structure - Integrity block chain - Signature creation flow - Verification protocol sequence

Deliverables: - Tamper-evidence design specification - Merkle tree implementation - Signature protocol documentation - Verification procedures and tools

CYCLE 10: Data Lifecycle & Retention (~2,500 lines)¶

Topic 19: Retention Policies & Lifecycle Management¶

What will be covered: - Retention Policy Model - Policy Definition - Policy ID (e.g., "rp-2025-standard") - Retention period (e.g., 7 years) - Triggers: Event type, classification, resource type - Actions: Mark eligible, purge, archive - Policy Assignment - Default policy per tenant - Override policies for specific event types - Classification-based policies (PII = longer retention)

Retention Calculation
Retention Start Date: CreatedAt or EffectiveAt
Retention End Date: Start + Retention Period
Eligibility: Event becomes eligible for deletion after end date
Grace Period: Additional buffer before purge (e.g., 30 days)
Lifecycle States
Active: Within retention period, accessible
Eligible: Retention expired, eligible for purge (but not yet deleted)
OnHold: Legal hold applied, cannot be deleted regardless of retention
Archived: Moved to cold tier, rehydration required
Purged: Permanently deleted
Lifecycle Management Process
Daily Job: Scan for eligible records
Purge Decision:
- Check: Retention period expired?
- Check: Not on legal hold?
- Check: No active export referencing this record?
Purge Execution:
- Delete from hot tier (Azure SQL)
- Delete from warm tier (projections)
- Delete blob segments (or move to Archive tier first)
- Record purge event for audit
Legal Hold Management
Apply Hold: Set OnHold = true, record reason
Hold Scope: Entire tenant, date range, or specific records
Release Hold: Manual approval, set OnHold = false
Hold Audit Trail: All hold operations logged
Data Minimization
Redact PII after N days (per classification)
Pseudonymize identifiers (replace with hash)
Aggregate old data (keep summary statistics, delete details)

Code Examples: - RetentionPolicy class definition - Retention calculation logic - Lifecycle state machine - Daily purge job (background worker) - Legal hold application and release - Data minimization pseudocode

Diagrams: - Retention policy model - Lifecycle state machine - Purge decision flowchart - Legal hold management workflow

Deliverables: - Retention policy specification - Lifecycle management design - Purge procedures and safety checks - Legal hold operational guide

What will be covered: - GDPR Right to Erasure - User can request deletion of their personal data - Must be executed "without undue delay" - Exceptions: Legal obligations, litigation holds

ATP Erasure Strategy
Challenge: Audit logs are immutable and legally required
Solution: Pseudonymization + Cryptographic Erasure
- Encrypt PII with tenant-specific key
- When erasure requested, delete encryption key
- PII becomes permanently unreadable (cryptographic erasure)
- Audit trail structure remains intact (compliance)
Erasure Process
Step 1: Identify all records containing subject's PII
Step 2: Verify no legal holds or active litigation
Step 3: Delete or pseudonymize PII fields
Step 4: Delete encryption key (if using cryptographic erasure)
Step 5: Update projections (remove PII)
Step 6: Generate erasure certificate
Step 7: Notify user of completion
Pseudonymization Techniques
Replace user ID with HMAC-SHA256(userId, secret)
Replace email with "user-{hash}@redacted.example.com"
Replace name with "[REDACTED-{hash}]"
Preserve data structure and relationships
Cryptographic Erasure
Encrypt PII fields with DEK (Data Encryption Key)
Wrap DEK with KEK (Key Encryption Key) in Key Vault
Store encrypted PII in PayloadJson
On erasure: Delete DEK from Key Vault
Result: PII permanently unrecoverable
Audit Trail of Erasure
Log erasure request (subject, timestamp, requestor)
Log erasure execution (affected records, method)
Generate cryptographic proof of erasure
Retain erasure audit log (separate from main audit trail)

Code Examples: - Erasure request API endpoint - PII identification query - Pseudonymization functions - Cryptographic erasure (DEK deletion) - Erasure certificate generation - Erasure audit log entry

Diagrams: - GDPR erasure workflow - Cryptographic erasure architecture - Pseudonymization vs. deletion comparison - Erasure audit trail

Deliverables: - Right to erasure implementation guide - Pseudonymization and cryptographic erasure patterns - Erasure procedures and checklists - Compliance documentation templates

CYCLE 11: Data Residency & Compliance (~2,500 lines)¶

Topic 21: Data Residency Enforcement¶

What will be covered: - Data Residency Requirements - GDPR (EU): Personal data of EU residents must stay in EU - Data Protection Act (UK): Post-Brexit UK requirements - CCPA (California): California resident data (US-based) - LGPD (Brazil): Brazilian data protection law - Custom Contractual: Per-tenant residency requirements

Regional Deployment Architecture
Separate Subscriptions/Accounts per Region
- atp-us-prod: US East 2, US West 2
- atp-eu-prod: West Europe, North Europe
- atp-il-prod: Israel Central
No Cross-Region Replication (default)
- Tenants bound to home region
- Data never crosses boundaries
- Exports stay in home region
Tenant → Region Binding
Provisioning Time: Assign home region based on tenant location

Metadata:

{
  "tenantId": "eucorp",
  "homeRegion": "eu-west-1",
  "allowedRegions": ["eu-west-1", "eu-north-1"],
  "crossRegionReplication": false,
  "dataResidencyPolicy": "GDPR-EU"
}

Immutability: Home region cannot be changed (requires migration)
Enforcement Mechanisms
Gateway Layer
- Resolve tenant's home region from metadata
- Route request to correct regional API endpoint
- Reject request if accessing from disallowed region
Storage Layer
- Write to home region storage account only
- Read from home region storage account only
- Projection workers run in home region
Export Layer
- Export packages stored in home region
- Download links use home region blob URLs
- No cross-region staging
Compliance Validation
Daily Scan: Verify all data in correct region
Alert on Violations: Any data found in wrong region
Automated Remediation: Move or purge violating data
Audit Report: Monthly residency compliance report
Migration Procedures
When tenant needs to change regions (rare)
Export all data from source region
Import to target region with new home region
Verify integrity and completeness
Purge data from source region
Update tenant metadata

Code Examples: - Tenant residency policy model - Region resolution logic (Gateway) - Storage account selection based on region - Cross-region prevention checks - Compliance validation query - Migration workflow pseudocode

Diagrams: - Multi-region deployment topology - Tenant → Region binding model - Residency enforcement flow - Migration sequence diagram

Deliverables: - Data residency architecture - Enforcement implementation guide - Compliance validation procedures - Migration runbooks

Topic 22: Compliance Certifications & Audits¶

What will be covered: - Target Compliance Frameworks - SOC 2 Type II: Trust Services Criteria - ISO 27001: Information Security Management - GDPR: EU data protection regulation - HIPAA: Healthcare data protection (US) - PCI DSS: Payment card data (if applicable)

Data Architecture Compliance Requirements
SOC 2
- Logical and physical access controls (RLS, RBAC)
- Change management (immutable audit trail)
- Data backup and recovery (PITR, replicas)
- Encryption at rest and in transit (TDE, TLS)
ISO 27001
- Asset management (data classification)
- Access control (multi-factor, least privilege)
- Cryptography (Key Vault, HSM)
- Logging and monitoring (Azure Monitor)
GDPR
- Data minimization (retention policies)
- Right to access (query APIs)
- Right to erasure (cryptographic erasure)
- Data portability (export APIs)
- Breach notification (alerting, runbooks)
HIPAA
- Access controls (RLS, authentication)
- Audit controls (immutable audit trail)
- Integrity controls (hash chains, signatures)
- Transmission security (TLS, mTLS)
Audit Evidence Generation
Access Logs: All data access logged to immutable store
Configuration Logs: All schema/policy changes logged
Compliance Reports: Automated monthly reports
Retention Proofs: Demonstrate retention policy enforcement
Erasure Certificates: Proof of GDPR erasure execution
Integrity Proofs: Cryptographic verification results
Auditor Support
Read-Only Access: Provide auditors read-only query access
Sample Data Sets: Generate representative data samples
Control Matrices: Map ATP controls to framework requirements
Evidence Packages: Pre-assembled evidence for audits
Continuous Compliance Monitoring
Automated Checks: Daily validation of compliance controls
Alerting: Immediate alerts on control failures
Dashboard: Real-time compliance posture dashboard
Remediation: Automated or semi-automated fixes

Code Examples: - Access log generation (all queries logged) - Compliance report query (retention adherence) - Erasure certificate generation - Integrity proof generation (Merkle verification) - Compliance dashboard query (Kusto/KQL)

Diagrams: - Compliance framework mapping - Audit evidence generation flow - Continuous compliance monitoring architecture

Deliverables: - Compliance requirements matrix - Audit evidence generation guide - Auditor support procedures - Continuous compliance monitoring setup

CYCLE 12: Backup, Recovery & DR (~2,500 lines)¶

Topic 23: Backup Strategies¶

What will be covered: - Backup Requirements - RPO (Recovery Point Objective): Max acceptable data loss - Hot tier: ≤ 5 minutes - Warm tier: ≤ 15 minutes - Cold tier: ≤ 24 hours - RTO (Recovery Time Objective): Max acceptable downtime - Hot tier: ≤ 2 hours - Warm tier: ≤ 8 hours - Cold tier: ≤ 48 hours

Azure SQL Backup
Automated Backups
- Full backup: Weekly
- Differential backup: Every 12-24 hours
- Transaction log backup: Every 5-10 minutes
Point-in-Time Restore (PITR)
- Restore to any point within retention period (7-35 days)
- Use case: Recover from accidental deletion (non-audit data)
Long-Term Retention (LTR)
- Weekly/Monthly/Yearly backups retained up to 10 years
- Use case: Compliance, archival
- Cost: Significantly cheaper than primary storage
Geo-Redundant Backup
- Backups replicated to paired region
- Use case: Regional disaster recovery
Azure Blob Storage Backup
Immutability as Backup
- WORM blobs are inherently protected
- No deletion or modification possible
Soft Delete
- Deleted blobs retained for 7-30 days
- Recover from accidental deletion
Versioning
- Every write creates new version
- Old versions retained per policy
Snapshot Strategy
- Daily snapshots of critical containers
- Stored in separate geo-redundant account
Backup Validation
Monthly Restore Test: Restore sample data, verify integrity
Quarterly DR Drill: Full region failover test
Integrity Check: Verify backup content hashes match originals

Code Examples: - Azure SQL PITR restore command (T-SQL/PowerShell) - LTR backup configuration (ARM/Bicep) - Blob snapshot creation (C# SDK) - Backup validation script - Integrity verification (hash comparison)

Diagrams: - Backup architecture (SQL + Blob) - PITR timeline - Geo-redundant backup topology - Restore workflow

Deliverables: - Backup strategy document - RPO/RTO definitions - Backup configuration guide - Validation and testing procedures

Topic 24: Disaster Recovery (DR)¶

What will be covered: - DR Scenarios - Regional outage (Azure region down) - Data corruption (accidental or malicious) - Compliance breach (unauthorized access) - Ransomware attack (though unlikely with immutable storage)

DR Architecture
Active-Active (optional, expensive)
- Two regions serving traffic simultaneously
- Data replicated in near real-time
- Automatic failover (Traffic Manager, Front Door)
Active-Passive (recommended)
- Primary region serves all traffic
- Secondary region: Standby with replicated data
- Manual or automatic failover on primary failure
Failover Procedures
Detection: Azure Monitor alerts on regional outage
Decision: SRE team evaluates severity, decides on failover
DNS Update: Switch Traffic Manager to secondary region
Data Sync: Verify secondary data is current (check replication lag)
Activation: Activate secondary region services (scale up)
Verification: Health checks, smoke tests
Communication: Notify customers of DR activation
Failback Procedures
Primary Recovery: Verify primary region is healthy
Data Sync: Replicate changes from secondary back to primary
Cutover: Switch traffic back to primary
Deactivation: Scale down secondary region (standby mode)
Post-Mortem: Document lessons learned
DR Data Consistency
Hot Tier: Geo-replicated (async), expect small data loss (RPO)
Warm Tier: Rebuild from hot tier events (eventual consistency)
Cold Tier: Geo-redundant blob storage (ZRS or GZRS)
Integrity Verification: Recompute hashes, verify signatures
DR Testing
Quarterly DR Drills: Full failover test (non-production)
Annual DR Exercise: Full failover with production traffic (scheduled)
Metrics: Measure actual RPO/RTO achieved
Continuous Improvement: Update DR plan based on results

Code Examples: - Traffic Manager failover configuration (ARM/Bicep) - Geo-replication setup (Azure SQL) - Failover script (PowerShell/CLI) - Failback synchronization logic - DR drill runbook

Diagrams: - Active-Passive DR topology - Failover sequence diagram - Data replication flow - Failback process

Deliverables: - DR architecture design - Failover and failback runbooks - DR testing plan - Communication templates

CYCLE 13: Performance & Optimization (~2,500 lines)¶

Topic 25: Query Performance Optimization¶

What will be covered: - Query Patterns & Optimization - Common Query Types - List events for tenant (paginated) - Get single event by ID - Filter by date range, action, resource, actor - Full-text search (via Azure AI Search) - Aggregations (count, sum, group by)

Indexing Strategy
Primary Indexes
- Clustered index: (TenantId, AuditRecordId)
- Covering indexes for hot queries:
- (TenantId, CreatedAt DESC, AuditRecordId DESC) INCLUDE (Action, ResourceType, ResourceId, ActorId)
- (TenantId, ResourceType, ResourceId, CreatedAt DESC)
- (TenantId, ActorId, CreatedAt DESC)
Columnstore Indexes
- For large scans and analytics queries
- Applied to historical data (after N months)
- Significant compression (5-10x)
Filtered Indexes
- Index only specific classifications (e.g., PII events)
- Smaller index size, faster updates
Partitioning for Performance
Partition by (TenantId, CreatedMonth)
Partition Pruning: Queries with date filter scan only relevant partitions
Parallel Queries: SQL Server can scan partitions in parallel
Sliding Window: Drop old partitions (after retention expires)
Caching Layer
Redis Cache: Cache frequent query results
Cache Key: query:{tenantId}:{queryHash}
TTL: 5-60 minutes (based on lag tolerance)
Invalidation: On policy changes, manual purge, or TTL expiration
Cache Hit Ratio: Monitor and optimize (target >80%)
Query Optimization Techniques
Avoid SELECT *: Select only needed columns
Use Covering Indexes: Avoid key lookups
Limit Result Sets: Always paginate with OFFSET/FETCH or cursors
Parameterize Queries: Prevent SQL injection, enable plan reuse
Analyze Execution Plans: Identify missing indexes, expensive operations
Update Statistics: Keep query optimizer informed

Code Examples: - Index creation scripts (covering, filtered, columnstore) - Partitioned table setup (SQL) - Redis caching pattern (C#) - Optimized query examples (with execution plans) - Query performance analysis (DMVs, Query Store)

Diagrams: - Index strategy visualization - Partitioning and pruning - Caching architecture - Query execution flow

Deliverables: - Query performance optimization guide - Indexing best practices - Caching strategy - Performance monitoring setup

Topic 26: Write Performance & Batching¶

What will be covered: - Write Performance Challenges - High-throughput ingestion (thousands of events/sec) - Transactional consistency (append + outbox atomicity) - Index maintenance overhead - Network latency (API → DB)

Batching Strategy
In-Memory Accumulation
- Accumulate events in memory buffer
- Flush when:
- Buffer reaches N events (e.g., 1000)
- Time threshold (e.g., 5 seconds)
- Memory pressure
Bulk Insert
- Use SqlBulkCopy (SQL Server) or BulkExecutor (Cosmos DB)
- 10-100x faster than individual inserts
- Minimizes round trips and transaction overhead
Outbox Pattern for Reliability
Transactional Outbox
- Atomically insert event + outbox entry in single transaction
- Outbox worker reads outbox table, publishes to message bus
- Ensures no lost events (even if message bus is down)
Deduplication in Outbox
- Idempotency key prevents duplicate publishes
- Outbox entries marked as processed after successful publish
Connection Pooling
Reuse database connections
Configuration: Min pool size = 10, Max pool size = 100
Monitor pool exhaustion (alerts if all connections in use)
Asynchronous Processing
Non-blocking writes (async/await in C#)
Background workers for blob uploads (don't block API response)
Fire-and-forget for non-critical operations (e.g., analytics)
Performance Monitoring
Metrics:
- Write throughput (events/sec)
- Write latency (p50, p95, p99)
- Batch size distribution
- Connection pool utilization
- Index maintenance time
Alerts:
- Throughput drops below threshold
- Latency exceeds SLO (e.g., p95 > 500ms)
- Connection pool exhaustion

Code Examples: - Batching accumulator pattern (C#) - SqlBulkCopy usage (bulk insert) - Transactional outbox implementation - Connection pooling configuration - Performance monitoring queries (DMVs) - Alert rule definitions (Azure Monitor)

Diagrams: - Write path with batching - Transactional outbox architecture - Connection pooling - Performance metrics dashboard

Deliverables: - Write performance optimization guide - Batching implementation patterns - Outbox pattern documentation - Performance monitoring setup

CYCLE 14: Operations & Best Practices (~3,000 lines)¶

Topic 27: Data Operations & Maintenance¶

What will be covered: - Routine Maintenance Tasks - Index Maintenance - Rebuild fragmented indexes (weekly) - Update statistics (daily for hot tables) - Monitor index usage, drop unused indexes - Partition Management - Create new partitions for upcoming months - Drop old partitions (after retention expires) - Sliding window partition maintenance - Backup Verification - Monthly restore test (sample data) - Verify backup integrity (checksums) - Test DR failover procedures

Data Quality Monitoring
Validation Checks
- Orphaned records (projections without hot tier source)
- Duplicate records (idempotency failures)
- Timestamp anomalies (CreatedAt > ObservedAt)
- Missing integrity blocks (unsealed segments)
Automated Remediation
- Rebuild orphaned projections from hot tier
- Deduplicate using idempotency key
- Alert on timestamp anomalies (investigate)
- Seal pending segments (scheduled job)
Capacity Planning
Growth Projections
- Forecast storage growth based on ingestion rate
- Estimate database size for next 12 months
- Plan for elastic pool or Hyperscale migration
Threshold Monitoring
- Alert when database reaches 80% capacity
- Alert when blob storage reaches quota
- Alert when partition gets too large (> 50GB)
Scaling Triggers
- Auto-scale elastic pool (DTU or vCore)
- Add new partitions proactively
- Increase blob storage tier (Standard to Premium)
Cost Optimization
Storage Tiering
- Move old segments to Cool tier (after 90 days)
- Move archive exports to Archive tier (after 30 days)
- Monitor lifecycle policy execution
Database Optimization
- Archive old partitions to cheaper storage
- Use columnstore for historical data (compression)
- Right-size elastic pool (avoid over-provisioning)
Query Optimization
- Cache hot queries (Redis)
- Reduce full table scans
- Optimize expensive queries (Query Store)

Code Examples: - Index rebuild script (T-SQL) - Partition sliding window procedure (T-SQL) - Data quality validation queries - Capacity monitoring query (Kusto/KQL) - Cost optimization script (Azure CLI) - Lifecycle policy configuration (ARM template)

Diagrams: - Maintenance workflow (scheduled jobs) - Data quality monitoring architecture - Capacity planning dashboard - Cost optimization strategies

Deliverables: - Operations manual - Maintenance runbooks - Data quality monitoring setup - Capacity planning guide - Cost optimization strategies

Topic 28: Troubleshooting & Diagnostics¶

What will be covered: - Common Issues - Slow Queries - Symptom: Query latency exceeds SLO - Diagnosis: Check execution plan, missing indexes - Resolution: Add covering index, update statistics - High Write Latency - Symptom: Ingestion throughput drops - Diagnosis: Check connection pool, index maintenance - Resolution: Increase pool size, batch writes, optimize indexes - Projection Lag - Symptom: Read models behind write tier (> 30 seconds) - Diagnosis: Check projection worker health, message bus backlog - Resolution: Scale out workers, increase consumer concurrency - Data Corruption - Symptom: Integrity verification fails - Diagnosis: Check segment hashes, Merkle roots - Resolution: Restore from backup, replay events from hot tier

Diagnostic Tools
Azure SQL
- Dynamic Management Views (DMVs)
- Query Store (execution stats, query plans)
- Extended Events (detailed tracing)
Azure Blob Storage
- Storage Analytics logs
- Metrics (transactions, latency, throttling)
- Diagnostic settings (Log Analytics)
Azure Cosmos DB
- Request Unit (RU) consumption metrics
- Partition hot spot detection
- Slow query log
Application Logs
- Structured logging (Serilog, Application Insights)
- Correlation IDs for tracing
- Error aggregation and alerting
Performance Troubleshooting Workflow
Identify Symptom: User report, alert, or monitoring dashboard
Isolate Component: API, database, message bus, blob storage?
Gather Metrics: Query DMVs, check dashboards, review logs
Analyze Root Cause: Slow query? Missing index? Resource contention?
Apply Fix: Index, configuration change, scaling
Verify Resolution: Monitor metrics, confirm performance restored
Document: Update runbook, create postmortem
Emergency Procedures
Database Failover: If primary DB unresponsive
Read Replica Activation: If query load too high
Circuit Breaker: Temporarily disable non-critical features
Rate Limiting: Throttle ingestion if storage overwhelmed

Code Examples: - DMV queries for performance diagnostics - Query Store analysis queries - Projection lag calculation query - Integrity verification script - Emergency failover script (PowerShell)

Diagrams: - Troubleshooting decision tree - Diagnostic tool architecture - Emergency response workflow

Deliverables: - Troubleshooting guide - Diagnostic playbooks - Emergency response procedures - Root cause analysis templates

Topic 29: Data Governance & Compliance¶

What will be covered: - Data Governance Framework - Data Stewardship - Roles: Data Owner, Data Custodian, Data Steward - Responsibilities: Classification, retention, access control - Data Catalog - Document all data entities (AuditRecord, projections, etc.) - Schema registry with versioning - Data lineage tracking (source → transformation → destination) - Data Quality Standards - Completeness: All required fields populated - Accuracy: Data matches source system - Consistency: Same data across projections - Timeliness: Lag within SLO thresholds

Access Governance
Least Privilege
- Application service accounts: Read/write scoped to specific tables
- Analyst accounts: Read-only, tenant-scoped (RLS)
- Admin accounts: Just-In-Time (JIT) elevation for operations
Access Reviews
- Quarterly review of all database access
- Revoke unused or expired permissions
- Audit trail of access grants/revokes
Privileged Access Management (PAM)
- Break-glass accounts for emergencies
- Multi-factor authentication required
- All actions logged and reviewed
Change Management
Schema Changes
- All changes via FluentMigrator (versioned migrations)
- Peer review for all schema changes
- Backward compatibility required
- Deployment via CI/CD pipeline
Configuration Changes
- All changes in Git (Infrastructure as Code)
- Approval workflow (pull request + review)
- Automated testing before production
- Rollback plan for every change
Audit & Compliance
Continuous Auditing
- All data access logged (Read/Write/Delete)
- Anomaly detection (unusual access patterns)
- Alert on suspicious activity
Compliance Reporting
- Monthly compliance dashboard
- Evidence packages for auditors
- Control attestation (SOC 2, ISO 27001)
Breach Response
- Detection: Automated alerts on unauthorized access
- Containment: Revoke access, isolate affected systems
- Investigation: Root cause analysis, forensics
- Notification: Notify affected users (GDPR 72-hour requirement)

Code Examples: - Data catalog schema (metadata tables) - Access review query (list all permissions) - Schema migration example (FluentMigrator) - Audit log query (all access in last N days) - Anomaly detection query (unusual access patterns)

Diagrams: - Data governance framework - Access governance model - Change management workflow - Breach response playbook

Deliverables: - Data governance policy - Access control matrix - Change management procedures - Compliance reporting templates - Breach response plan

Topic 30: Best Practices & Architectural Patterns¶

What will be covered: - Data Architecture Best Practices - Immutability First - Design for append-only from day one - Avoid temptation to add UPDATE logic - Use projections for derived views - Tenant Isolation - TenantId in every table, every row - RLS for defense-in-depth - Test cross-tenant isolation regularly - Event Sourcing - Events as source of truth - Projections are derived and rebuildable - Time-travel queries for investigations - CQRS - Separate write and read models - Optimize each for its use case - Embrace eventual consistency - Compliance by Design - Classification at write time - Retention policies enforced automatically - Right to erasure via cryptographic methods

Performance Best Practices
Indexing: Cover hot queries, avoid over-indexing
Partitioning: By tenant + time for scalability
Caching: Redis for hot queries, TTL-based invalidation
Batching: Bulk inserts, reduce round trips
Async: Non-blocking I/O, background workers
Operational Best Practices
Monitoring: Golden signals (latency, throughput, errors, saturation)
Alerting: Alert on SLO violations, not arbitrary thresholds
Runbooks: Document common procedures (failover, restore, purge)
DR Testing: Quarterly drills, measure actual RPO/RTO
Automation: Infrastructure as Code, automated testing, CI/CD
Security Best Practices
Zero Trust: Assume breach, verify every access
Encryption: TDE for at-rest, TLS for in-transit, KMS for keys
RBAC: Least privilege, JIT elevation for admins
Auditing: Log all access, immutable audit trail
Secrets: Azure Key Vault, Managed Identity (no credentials in code)
Anti-Patterns to Avoid
Mutable Audit Logs: Never allow updates/deletes on audit records
Shared TenantId: Never query across tenants (except admin)
Missing Idempotency Keys: Always provide on critical operations
Over-Normalization: Projections should be denormalized for reads
Synchronous Everything: Use async/messaging for non-critical paths

Code Examples: - Immutable entity pattern (C#) - RLS policy template (reusable) - Event sourcing aggregate pattern - CQRS command/query handlers - Monitoring dashboard query (Kusto) - Security configuration checklist

Diagrams: - Best practices reference architecture - Anti-patterns to avoid (before/after) - Security-hardened data layer - Operational excellence framework

Deliverables: - Data architecture best practices guide - Pattern catalog (with examples) - Anti-pattern avoidance checklist - Security hardening guide - Operational excellence checklist

Summary of Deliverables¶

Across all 14 cycles, this documentation will provide:

Foundational Knowledge
Data architecture vision and principles
Storage technology selection guide
Storage tiers (Hot/Warm/Cold) design
Multi-Tenant Isolation
Tenant isolation patterns (logical and physical)
Row-Level Security implementation
Data residency enforcement
Data Models
AuditRecord canonical schema
Supporting entities (policies, integrity, exports)
Projection models (AuditEvents, ResourceEvents, ActorEvents)
Event Sourcing & CQRS
Event store design and replay procedures
CQRS command/query separation
Projection synchronization patterns
Technology Implementations
Azure SQL (schema, RLS, partitioning, performance)
Azure Cosmos DB (partition strategy, change feed)
Azure Blob Storage (WORM, segments, integrity)
Lifecycle Management
Retention policies and lifecycle states
Right to erasure (GDPR Article 17)
Legal hold management
Compliance & Security
Data residency architecture
Compliance certifications (SOC 2, ISO 27001, GDPR, HIPAA)
Backup, recovery, and disaster recovery
Performance & Operations
Query and write optimization
Monitoring and alerting
Troubleshooting and diagnostics
Best practices and anti-patterns

Next Steps¶

Review & Approval: Validate cycle plan with architecture and data teams
Cycle 1 Generation: Begin content generation for data architecture overview
Schema Scripts: Prepare SQL DDL scripts for all tables
Code Samples: Develop C# implementations for data access patterns
Pilot Tenant: Implement full data layer for dev environment tenant

Architecture Overview: ATP system architecture context
Data Model: Detailed entity specifications and relationships
Components: Service-level data storage mapping
Pulumi: IaC for provisioning all storage resources
Database Migrations: FluentMigrator for schema evolution
Zero-Trust Architecture: Security patterns applied to data layer
Tamper-Evidence: Cryptographic integrity mechanisms
Key Rotation: Encryption key management and rotation
Observability: Monitoring data layer health and performance
Privacy & GDPR: Privacy compliance implementation

This documentation plan covers the complete data architecture for ATP, from foundational principles to operational best practices, ensuring a secure, scalable, compliant, and high-performance data layer for the Audit Trail Platform.