Skip to content

Data Architecture - Audit Trail Platform (ATP)

Append-optimized, tamper-evident, tenant-isolated — ATP's data architecture separates hot append storage, warm read models, and cold archives with provable integrity, multi-tenant isolation, and compliance-driven lifecycle management across Azure SQL, Cosmos DB, and Blob Storage (WORM).


📋 Documentation Generation Plan

This document will be generated in 14 cycles. Current progress:

Cycle Topics Estimated Lines Status
Cycle 1 Data Architecture Overview & Principles (1-2) ~2,500 ⏳ Not Started
Cycle 2 Storage Tiers: Hot/Warm/Cold (3-4) ~3,000 ⏳ Not Started
Cycle 3 Multi-Tenant Data Isolation (5-6) ~2,500 ⏳ Not Started
Cycle 4 Authoritative Store: AuditRecord Model (7-8) ~3,000 ⏳ Not Started
Cycle 5 Event Sourcing & CQRS Patterns (9-10) ~2,500 ⏳ Not Started
Cycle 6 Read Models & Projections (11-12) ~3,000 ⏳ Not Started
Cycle 7 Azure SQL Implementation (13-14) ~2,500 ⏳ Not Started
Cycle 8 Azure Cosmos DB Implementation (15-16) ~2,500 ⏳ Not Started
Cycle 9 Azure Blob Storage (WORM) (17-18) ~3,000 ⏳ Not Started
Cycle 10 Data Lifecycle & Retention (19-20) ~2,500 ⏳ Not Started
Cycle 11 Data Residency & Compliance (21-22) ~2,500 ⏳ Not Started
Cycle 12 Backup, Recovery & DR (23-24) ~2,500 ⏳ Not Started
Cycle 13 Performance & Optimization (25-26) ~2,500 ⏳ Not Started
Cycle 14 Operations & Best Practices (27-30) ~3,000 ⏳ Not Started

Total Estimated Lines: ~37,000


Purpose & Scope

This document defines the complete data layer architecture for the Audit Trail Platform (ATP), covering data models, storage strategies, access patterns, multi-tenant isolation, event sourcing, CQRS projections, lifecycle management, and compliance enforcement across all Azure storage technologies.

Key Architectural Drivers - Append-Only Semantics: Immutable audit trail with tamper-evidence - Multi-Tenant Isolation: Hard tenant boundaries at storage, query, and compliance layers - Event Sourcing & CQRS: Authoritative append store + derived read models - Storage Tiering: Hot (write-optimized) → Warm (query-optimized) → Cold (archive) - Compliance-Driven: WORM, retention, legal hold, data residency, right-to-erasure - Performance & Scale: Partition strategies, indexing, caching, denormalization - Provable Integrity: Hash chains, Merkle trees, digital signatures, verification protocols


Detailed Cycle Plan

CYCLE 1: Data Architecture Overview & Principles (~2,500 lines)

Topic 1: Data Architecture Foundations

What will be covered: - ATP Data Architecture Vision - Append-optimized for audit trail workloads - Separation of concerns: write vs. read paths - Event-driven data flow - Compliance-first design (immutability, auditability)

  • Core Design Principles
  • Append-Only Semantics
    • No updates or deletes (except lifecycle purges)
    • WORM (Write-Once-Read-Many) storage
    • Immutability at multiple layers
  • Event Sourcing
    • Events as source of truth
    • Complete audit history reconstruction
    • Time-travel queries
  • CQRS (Command Query Responsibility Segregation)
    • Separate write and read models
    • Optimized projections for queries
    • Eventually consistent read models
  • Multi-Tenant Isolation
    • TenantId as first-class partition key
    • Row-Level Security (RLS) enforcement
    • Physical and logical isolation strategies
  • Compliance & Privacy

    • Classification-driven access control
    • Retention policies and lifecycle management
    • Data residency enforcement
    • Right-to-erasure (GDPR Article 17)
  • Data Flow Architecture

  • Write path: Gateway → Ingestion → Append Store → Outbox → Event Bus
  • Read path: Query → Read Models (Projections) ← Projection Workers ← Event Bus
  • Integrity path: Integrity Service → Blob Storage (WORM) → Verification
  • Export path: Export Service → Read Models → Cold Storage (packages)

Code Examples: - High-level data flow sequence diagram - TenantContext propagation pattern - Event sourcing command/event example - CQRS write vs. read model comparison

Diagrams: - ATP data architecture overview (layered) - Write vs. Read path separation - Event-driven data flow - Storage tiers topology

Deliverables: - Data architecture vision document - Design principles catalog - Data flow documentation - Architectural decision records (ADRs)


Topic 2: Storage Technologies & Technology Stack

What will be covered: - Azure Storage Technologies - Azure SQL Database - Purpose: Authoritative append store, projections, metadata - Features: RLS, temporal tables, columnstore indexes - Edition: Business Critical or Hyperscale for scale - Azure Cosmos DB - Purpose: Optional for global-scale, multi-region writes - Partition strategy: /tenantId - Consistency: Session or Bounded Staleness - Azure Blob Storage - Purpose: WORM storage for segments, exports, evidence - Features: Immutability policies, legal hold, lifecycle tiers - Access: Private endpoint, Managed Identity - Azure Cache for Redis (optional) - Purpose: Query result caching, session state - Eviction: LRU with TTL - Azure AI Search (optional) - Purpose: Full-text search, faceted queries - Indexing: Per-tenant index aliases

  • Technology Stack Rationale
  • Why Azure SQL for append store?
    • ACID guarantees for write path
    • RLS for tenant isolation
    • Proven scalability with partitioning
  • Why Blob Storage for WORM?
    • Native immutability support
    • Cost-effective for large volumes
    • Lifecycle management (hot→cool→archive)
  • When to use Cosmos DB?

    • Global distribution requirements
    • Extreme scale (millions of events/sec)
    • Multi-region write scenarios
  • Hybrid Storage Strategy

  • SQL for metadata and hot index
  • Blob for bulk event payloads
  • Cosmos DB for global tenants (optional)
  • Search for full-text queries (optional)

Code Examples: - Connection string management (Key Vault) - NHibernate configuration for SQL - Cosmos DB client setup - Blob Storage client with Managed Identity

Diagrams: - Storage technology mapping - Hybrid storage topology - Technology decision tree

Deliverables: - Storage technology selection guide - Technology stack documentation - Hybrid strategy design - Connection management patterns


CYCLE 2: Storage Tiers - Hot/Warm/Cold (~3,000 lines)

Topic 3: Hot Tier - Append Store

What will be covered: - Hot Tier Purpose & Characteristics - Authoritative source of truth - Optimized for high-throughput writes - Short to medium retention (days to months) - Immutable after write - Low-latency access for recent data

  • Azure SQL Hot Store Design
  • Schema Design
    • AuditRecords table (authoritative)
    • Clustered columnstore index for write optimization
    • Partitioning by (TenantId, CreatedMonth)
    • Minimal indexes (write performance priority)
  • Row-Level Security (RLS)
    • Predicate function: TenantId = SESSION_CONTEXT('tenant_id')
    • Applied to all queries automatically
    • Block and filter predicates
  • Append-Only Enforcement

    • INSTEAD OF UPDATE/DELETE triggers
    • Deny schema modification permissions
    • Audit log for DML attempts
  • Azure Blob Hot Store Design

  • Container Structure
    • atp-{tenantId}-hot: Append-only segments
    • atp-{tenantId}-hot-manifests: Segment metadata
    • atp-{tenantId}-hot-integrity: Hash chains, signatures
  • WORM Configuration
    • Time-based retention policy (e.g., 90 days)
    • Version-level immutability
    • Compliance mode (vs. governance mode)
  • Segment Strategy

    • Fixed-size segments (e.g., 10MB or 10K events)
    • Segment sealing and hash computation
    • Manifest generation with Merkle root
  • Hot Index (SQL Metadata)

  • Purpose: Fast segment lookup, pointer to blob
  • Schema: (TenantId, SegmentId, BlobUrl, Hash, CreatedAt, RecordCount)
  • Query pattern: Find segments for tenant + time range

  • Write Performance Optimization

  • Batching: Accumulate 100-1000 events before commit
  • Async writes to blob storage
  • Outbox pattern for messaging
  • Connection pooling and prepared statements

Code Examples: - AuditRecords SQL table DDL with partitioning - RLS policy implementation (T-SQL) - INSTEAD OF trigger for append-only - Blob segment write with WORM configuration - Hot index schema and queries - Batch insert performance pattern

Diagrams: - Hot tier architecture (SQL + Blob) - Segment structure and sealing - Write path data flow - Partitioning strategy visualization

Deliverables: - Hot tier design specification - SQL schema scripts - Blob storage configuration - Performance tuning guide


Topic 4: Warm Tier - Read Models & Projections

What will be covered: - Warm Tier Purpose & Characteristics - Query-optimized read models - Derived from hot tier via projections - Medium to long retention (months to years) - Denormalized for fast reads - Eventually consistent with hot tier

  • Projection Models
  • AuditEvents Projection
    • Flat event stream for list/search
    • Fields: AuditRecordId, TenantId, CreatedAt, Action, ResourceType, ResourceId, ActorId, etc.
    • Indexes: (TenantId, CreatedAt), (TenantId, ResourceType, ResourceId), (TenantId, ActorId)
  • ResourceEvents Projection
    • Per-resource timeline
    • Sequence number for ordered history
    • Fast lookups: "Show me all changes to Order #12345"
  • ActorEvents Projection
    • Per-actor activity log
    • Compliance use case: "What did user X do?"
  • Classification Metadata

    • Pre-computed classification flags
    • Redaction plan references
    • Policy version stamps
  • Projection Build Process

  • Event-driven: Subscribe to audit.appended events
  • Idempotent upserts keyed by (TenantId, AuditRecordId)
  • Watermark tracking: HighWaterRecordId, HighWaterObservedAt
  • Lag monitoring: Alert if lag > 30 seconds
  • Rebuild from scratch: Replay all events from hot tier

  • Projection Storage Options

  • Azure SQL
    • Best for structured queries, JOINs, aggregations
    • RLS for tenant isolation
    • Columnstore for large scans
  • Azure Cosmos DB

    • Best for global distribution, high throughput
    • Partition key: /tenantId
    • Composite indexes: (tenantId, createdAt)
  • Caching Layer (Redis)

  • Cache frequently accessed queries
  • TTL: 5-60 minutes depending on lag tolerance
  • Cache key pattern: query:{tenantId}:{queryHash}
  • Invalidation: On policy changes or manual purge

Code Examples: - AuditEvents projection schema (SQL) - ResourceEvents projection with sequence numbers - Projection worker consumer pattern (MassTransit) - Idempotent upsert logic - Watermark checkpoint management - Redis caching pattern

Diagrams: - Warm tier architecture - Projection build flow - Read model schemas - Caching strategy

Deliverables: - Warm tier design specification - Projection schemas (SQL and Cosmos DB) - Projection worker implementation guide - Caching patterns


CYCLE 3: Multi-Tenant Data Isolation (~2,500 lines)

Topic 5: Tenant Isolation Patterns

What will be covered: - Multi-Tenancy Model - Logical Multi-Tenancy (default) - Shared database, tenant-partitioned tables - TenantId in every row - Row-Level Security (RLS) enforcement - Cost-effective, operationally simple - Physical Multi-Tenancy (optional for large tenants) - Dedicated database per tenant - Complete isolation, custom SLAs - Higher operational complexity

  • TenantId as First-Class Citizen
  • TenantId Requirements
    • Opaque ASCII token: ^[A-Za-z0-9._-]{1,128}$
    • Immutable for tenant lifetime
    • Case-sensitive (treat as opaque)
    • Globally unique within platform
  • Propagation Pattern

    • Extracted from JWT at Gateway
    • Stored in TenantContext object
    • Propagated across all service calls
    • Included in all database queries
    • Present in all events and logs
  • Row-Level Security (RLS) in Azure SQL

  • Implementation
    CREATE FUNCTION dbo.fn_tenantPredicate(@TenantId NVARCHAR(128))
    RETURNS TABLE
    WITH SCHEMABINDING
    AS RETURN
    SELECT 1 AS IsAuthorized
    WHERE @TenantId = CONVERT(NVARCHAR(128), SESSION_CONTEXT(N'tenant_id'));
    
    CREATE SECURITY POLICY dbo.TenantSecurityPolicy
    ADD FILTER PREDICATE dbo.fn_tenantPredicate(TenantId) ON dbo.AuditRecords,
    ADD BLOCK PREDICATE dbo.fn_tenantPredicate(TenantId) ON dbo.AuditRecords
    WITH (STATE = ON);
    
  • Session Context Setup

    • Set SESSION_CONTEXT('tenant_id') on every connection
    • Validate before any query execution
    • Audit RLS bypasses (should never happen)
  • Partitioning Strategies

  • SQL Server Partitioning
    • Partition function: (TenantId, CreatedMonth)
    • Partition scheme: One filegroup per month (optional)
    • Benefits: Pruning, parallel queries, lifecycle management
  • Cosmos DB Partitioning
    • Partition key: /tenantId
    • Hot partition detection and mitigation
    • Composite keys for large tenants: /tenantId-{shard}
  • Blob Storage Partitioning

    • Container per tenant: atp-{tenantId}-{tier}
    • Benefits: Independent legal holds, retention policies
    • Access control: Per-container RBAC
  • Cross-Tenant Query Prevention

  • No JOINs across tenant boundaries
  • Explicit tenant filter in all queries
  • Query analyzer to detect violations
  • Integration tests for isolation

Code Examples: - TenantContext class definition - RLS policy implementation (complete) - SESSION_CONTEXT setup in connection middleware - SQL partitioning DDL - Cosmos DB partition key configuration - Tenant isolation test cases

Diagrams: - Multi-tenancy model comparison - TenantId propagation flow - RLS enforcement architecture - Partitioning strategies

Deliverables: - Multi-tenancy design guide - RLS implementation scripts - Partitioning strategy document - Tenant isolation testing guide


Topic 6: Data Residency & Regional Isolation

What will be covered: - Data Residency Requirements - GDPR: EU data must stay in EU - CCPA: California resident data (US only) - Custom: Per-tenant residency policies - Enforcement: At write and read time

  • Regional Topology
  • Separate Accounts per Region
    • US: atp-us-prod (Azure SQL, Blob Storage)
    • EU: atp-eu-prod (separate subscription)
    • IL: atp-il-prod (Israel data center)
  • No Cross-Region Replication (by default)

    • Except explicit tenant opt-in
    • Geo-redundancy within region only (ZRS, GZRS)
  • Tenant → Region Binding

  • Home Region Assignment
    • Set during tenant provisioning
    • Stored in Tenant metadata
    • Immutable (requires migration to change)
  • Allowed Regions

    • Multi-region tenants: List of approved regions
    • Enforcement at Gateway (routing)
    • Enforcement at storage (write rejection)
  • Residency Enforcement

  • Gateway Layer
    • Resolve tenant home region
    • Route request to correct regional cluster
    • Reject if region not allowed
  • Storage Layer
    • Tenant data written only to home region account
    • No cross-region queries
    • Export packages honor residency
  • Compliance Audit

    • Daily scan: Verify all data in correct region
    • Alert on violations
    • Remediation: Move or purge violating data
  • Migration Procedures

  • Tenant region change process
  • Data export from source region
  • Import to target region
  • Verification and cutover
  • Source data purge

Code Examples: - Tenant residency policy model - Region resolution logic - Storage account selection by region - Cross-region prevention validation - Migration workflow (Pseudocode)

Diagrams: - Regional topology (multi-region ATP deployment) - Tenant → Region binding model - Residency enforcement flow - Migration sequence diagram

Deliverables: - Data residency design - Regional deployment architecture - Enforcement implementation guide - Migration runbook


CYCLE 4: Authoritative Store - AuditRecord Model (~3,000 lines)

Topic 7: AuditRecord Canonical Schema

What will be covered: - AuditRecord Entity (Authoritative) - Core aggregate root in DDD model - Single source of truth for audit events - Immutable after creation - Complete lineage and correlation

  • Field Specification

    public class AuditRecord {
      // Identity
      public string AuditRecordId { get; set; }      // ULID (26 chars)
      public string TenantId { get; set; }           // Opaque tenant identifier
      public string? CorrelationTraceId { get; set; } // W3C trace ID (hex32)
      public string? IdempotencyKey { get; set; }    // Client-provided dedup key
    
      // Timestamps
      public DateTime CreatedAt { get; set; }        // Platform record creation (UTC)
      public DateTime ObservedAt { get; set; }       // Platform observation time (UTC)
      public DateTime? EffectiveAt { get; set; }     // Business effective time (UTC)
    
      // Event Core
      public string Action { get; set; }             // e.g., "appointment.update"
      public string? ResourceType { get; set; }      // e.g., "Appointment"
      public string? ResourceId { get; set; }        // Business entity ID
      public string? ActorId { get; set; }           // Who performed action
      public ActorType ActorType { get; set; }       // User, Service, System
      public string? SubjectId { get; set; }         // Whom action was performed on
      public DecisionOutcome? DecisionOutcome { get; set; } // Allow, Deny, NotApplicable
    
      // Payload
      public string PayloadJson { get; set; }        // Full JSON event payload
      public string? PayloadSchemaUri { get; set; }  // Schema reference
      public int PayloadBytes { get; set; }          // Size in bytes
    
      // Classification & Policy
      public DataClassification Classification { get; set; } // PUBLIC, SENSITIVE, PII, etc.
      public string PolicyVersion { get; set; }      // Policy version stamp
      public string? RetentionPolicyId { get; set; } // Retention rule reference
      public DateTime? RetentionUntil { get; set; }  // Computed retention date
      public bool OnHold { get; set; }               // Legal hold flag
    
      // Integrity
      public string? IntegrityBlockId { get; set; }  // Merkle tree block reference
      public string? ContentHash { get; set; }       // SHA-256 of PayloadJson
    
      // Metadata
      public string? SourceSystem { get; set; }      // Origin system name
      public string? SessionId { get; set; }         // User session ID
      public string? IpAddress { get; set; }         // Client IP (hashed or redacted)
      public string? UserAgent { get; set; }         // Client user agent
    }
    

  • Field Validation Rules

  • TenantId: Required, max 128 chars, regex pattern
  • CreatedAt ≤ ObservedAt ≤ now(): Timestamp ordering
  • Action: Required, lowercase, dot-separated (e.g., entity.verb)
  • ResourceType: PascalCase, namespace-qualified (e.g., Vetspire.Appointment)
  • ActorId: Required for user actions, optional for system events
  • PayloadJson: Valid JSON, max 1MB (configurable)
  • ContentHash: SHA-256, hex-encoded (64 chars)

  • Append-Only Guarantees

  • No UPDATE statements allowed
  • No DELETE statements (except lifecycle purges)
  • Database triggers enforce append-only
  • Violation attempts are audited

  • Idempotency Handling

  • Unique index: (TenantId, IdempotencyKey) (where not null)
  • Duplicate inserts return existing record
  • Client receives 200 OK with idempotent: true flag

Code Examples: - AuditRecord C# class (complete) - SQL DDL for AuditRecords table - Validation logic (FluentValidation) - Idempotency check repository method - Append-only trigger implementation

Diagrams: - AuditRecord entity relationship diagram - Timestamp semantics - Idempotency flow

Deliverables: - AuditRecord schema specification - Field validation rules - SQL DDL scripts - Domain model implementation


Topic 8: Supporting Data Models

What will be covered: - PolicyDecision - Classification result - Retention policy reference - Redaction plan (if applicable) - Policy version stamp

  • IntegrityBlock
  • Block ID (ULID)
  • Segment references
  • Merkle root hash
  • Digital signature
  • Signing key ID (kid)
  • Sealed timestamp

  • Segment

  • Segment ID (ULID)
  • Tenant ID
  • Blob storage URL
  • Record count
  • Start/End timestamps
  • Content hash (SHA-256)
  • Manifest reference

  • ExportPackage

  • Job ID (ULID)
  • Tenant ID
  • Filter criteria
  • Package format (JSONL, Parquet)
  • Manifest with record list
  • Digital signature
  • Export timestamp

  • ProjectionCheckpoint

  • Projection name (e.g., "AuditEvents")
  • Tenant ID
  • High water mark (last processed AuditRecordId)
  • Checkpoint timestamp
  • Lag metrics (seconds behind hot tier)

  • LegalHold

  • Hold ID (ULID)
  • Tenant ID
  • Reason (e.g., "Litigation #2025-0042")
  • Applied timestamp
  • Released timestamp (nullable)
  • Scope: entire tenant or filtered (date range, resource type)

Code Examples: - PolicyDecision class - IntegrityBlock class - Segment metadata class - ExportPackage manifest structure (JSON) - ProjectionCheckpoint table schema - LegalHold table schema

Diagrams: - Supporting entities ERD - Integrity chain structure - Export package composition

Deliverables: - Supporting data model specifications - SQL/Cosmos DB schemas - Domain model classes - Serialization formats


CYCLE 5: Event Sourcing & CQRS Patterns (~2,500 lines)

Topic 9: Event Sourcing Implementation

What will be covered: - Event Sourcing Fundamentals - Events as immutable facts - Complete audit history - Time-travel queries - Replay capability - Event store as source of truth

  • ATP Event Store Design
  • AuditRecords table IS the event store
    • Each row is an immutable event
    • Ordered by (TenantId, CreatedAt, AuditRecordId)
    • Never updated or deleted (except lifecycle)
  • Event Envelope

    • Correlation IDs for traceability
    • Causation IDs for event chains
    • Schema versioning
    • Metadata (timestamps, actor, etc.)
  • Event Replay

  • Use Cases
    • Rebuild projections after corruption
    • Create new projection types
    • Audit investigations (replay specific tenant/time range)
    • Disaster recovery (replay from backup)
  • Replay Process
    1. Set projection checkpoint to beginning (or specific point)
    2. Stream events from hot tier in order
    3. Apply events to projection (idempotent upsert)
    4. Advance checkpoint periodically
    5. Monitor lag and throughput
  • Replay Performance

    • Batch processing: 1000-10000 events per batch
    • Parallel workers per tenant (up to partition count)
    • Throttling to avoid overwhelming read models
  • Event Versioning

  • Schema evolution strategy
  • Backward compatibility requirements
  • Upcasting old events to new schema
  • Versioned payload deserializers

  • Event Correlation & Causation

  • Correlation ID: Links events in same business transaction
  • Causation ID: Parent event that caused this event
  • Trace ID: W3C-compliant distributed tracing
  • Graph of causality for investigation

Code Examples: - Event store query patterns (get events for tenant + time range) - Replay worker implementation (consumer) - Batch processing logic - Idempotent event application - Correlation/causation tracking - Event upcasting example

Diagrams: - Event sourcing architecture - Event replay flow - Correlation graph visualization - Event versioning timeline

Deliverables: - Event sourcing design guide - Replay procedures and runbooks - Event versioning strategy - Correlation implementation


Topic 10: CQRS Implementation

What will be covered: - CQRS Pattern in ATP - Command Side (Write Path) - Commands: AppendAuditRecord, CreateExport, ApplyPolicy - Write to authoritative store only - Emit events on success - Strong consistency within command - Query Side (Read Path) - Queries: ListEvents, GetResourceHistory, SearchActor - Read from optimized projections - Eventually consistent with write side - No side effects

  • Separation Benefits
  • Independent scaling (write-heavy vs. read-heavy)
  • Optimized data models per use case
  • Flexibility to add new projections
  • Reduced contention (no read locks on writes)

  • Command Handlers

  • Validation and authorization
  • Aggregate load (if needed)
  • Business logic execution
  • Event generation
  • Atomic write + outbox

  • Query Handlers

  • Parameter validation
  • Projection selection
  • Policy-based filtering (classification, redaction)
  • Pagination and cursors
  • Result transformation

  • Eventual Consistency Management

  • Lag Monitoring
    • Track projection lag (seconds behind write)
    • Alert if lag > SLO threshold (e.g., 30s)
    • Dashboard showing lag per projection
  • Read-After-Write Consistency
    • Option 1: Poll projection until record appears
    • Option 2: Return "processing" status, client polls
    • Option 3: Accept eventual consistency (preferred)
  • Consistency Boundary

    • Strong consistency: Within single command/aggregate
    • Eventual consistency: Across aggregates, projections, exports
  • Projection Synchronization

  • Event-driven updates via message bus
  • Idempotent handlers to tolerate replays
  • Checkpoint-based resume after failures
  • Monitoring and alerting on projection health

Code Examples: - Command handler structure (AppendAuditRecordCommand) - Query handler structure (ListEventsQuery) - CQRS mediator pattern (MediatR) - Projection lag calculation - Read-after-write consistency pattern - Projection synchronization worker

Diagrams: - CQRS architecture overview - Command vs. Query path comparison - Eventual consistency timeline - Projection synchronization flow

Deliverables: - CQRS implementation guide - Command and query patterns - Consistency management strategy - Monitoring and alerting setup


CYCLE 6: Read Models & Projections (~3,000 lines)

Topic 11: AuditEvents Projection

What will be covered: - AuditEvents Purpose - Flat event stream for list/search/filter operations - Most common query: "Show me all events for tenant X" - Denormalized for read performance - Indexed for fast seeks and scans

  • Schema Design

    CREATE TABLE dbo.AuditEvents (
      TenantId NVARCHAR(128) NOT NULL,
      AuditRecordId CHAR(26) NOT NULL,
      CreatedAt DATETIME2(3) NOT NULL,
      ObservedAt DATETIME2(3) NOT NULL,
      Action VARCHAR(255) NOT NULL,
      ResourceType VARCHAR(255) NULL,
      ResourceId VARCHAR(255) NULL,
      ActorId VARCHAR(255) NULL,
      ActorType TINYINT NOT NULL,
      DecisionOutcome TINYINT NULL,
      ChangedFields NVARCHAR(MAX) NULL, -- JSON array of field names
      DataClassFlags SMALLINT NOT NULL,
      CorrelationTraceId CHAR(32) NOT NULL,
      IntegrityBlockId CHAR(26) NULL,
      PayloadBytes INT NOT NULL,
      PRIMARY KEY (TenantId, AuditRecordId),
      INDEX IX_AuditEvents_Tenant_CreatedAt (TenantId, CreatedAt DESC, AuditRecordId DESC),
      INDEX IX_AuditEvents_Tenant_Resource (TenantId, ResourceType, ResourceId, CreatedAt DESC),
      INDEX IX_AuditEvents_Tenant_Actor (TenantId, ActorId, CreatedAt DESC)
    );
    

  • Projection Build Logic

  • Subscribe to audit.appended events
  • Extract relevant fields from AuditRecord
  • Compute derived fields (ChangedFields, DataClassFlags)
  • Upsert into AuditEvents (idempotent by AuditRecordId)
  • Update checkpoint

  • Query Patterns

  • List all events for tenant (paginated)
  • Filter by date range
  • Filter by action, resource type, actor
  • Search by correlation trace ID
  • Order by CreatedAt DESC (most recent first)

  • Pagination Strategy

  • Cursor-based (not offset-based for stability)
  • Cursor = base64url({CreatedAt}:{AuditRecordId})
  • Query: WHERE (CreatedAt, AuditRecordId) < cursor ORDER BY ... LIMIT N

  • Performance Optimization

  • Covering indexes for common queries
  • Partitioning by (TenantId, CreatedMonth) for pruning
  • Columnstore index for large scans (analytics)

Code Examples: - AuditEvents SQL DDL (complete) - Projection worker logic (event handler) - Idempotent upsert pattern - Cursor-based pagination query - Index tuning for specific queries

Diagrams: - AuditEvents schema diagram - Projection build flow - Query execution plan

Deliverables: - AuditEvents projection specification - SQL DDL and indexes - Projection worker implementation - Query pattern examples


Topic 12: Resource & Actor Projections

What will be covered: - ResourceEvents Projection - Purpose: Per-resource audit timeline - Use Case: "Show me all changes to Order #12345" - Schema:

CREATE TABLE dbo.ResourceEvents (
  TenantId NVARCHAR(128) NOT NULL,
  ResourceType VARCHAR(255) NOT NULL,
  ResourceId VARCHAR(255) NOT NULL,
  Seq BIGINT NOT NULL, -- Sequence within resource
  AuditRecordId CHAR(26) NOT NULL,
  CreatedAt DATETIME2(3) NOT NULL,
  Action VARCHAR(255) NOT NULL,
  ActorId VARCHAR(255) NULL,
  -- ... other fields
  PRIMARY KEY (TenantId, ResourceType, ResourceId, Seq),
  INDEX IX_ResourceEvents_AuditRecordId (AuditRecordId)
);
- Sequence Management - Seq increments per (TenantId, ResourceType, ResourceId) - Ensures ordered timeline per resource - Idempotent: If AuditRecordId exists, skip or update Seq

  • ActorEvents Projection
  • Purpose: Per-actor activity log
  • Use Case: "What did user alice@example.com do?"
  • Schema: Similar to ResourceEvents, keyed by (TenantId, ActorId, Seq)
  • Compliance: GDPR access requests, internal investigations

  • Projection Build Logic

  • For each audit.appended event:
    1. Upsert into AuditEvents (all events)
    2. If ResourceType+ResourceId present, upsert into ResourceEvents
    3. If ActorId present, upsert into ActorEvents
  • Handle missing resources gracefully (some events are resource-less)

  • Query Patterns

  • Resource timeline: SELECT * FROM ResourceEvents WHERE TenantId = ? AND ResourceType = ? AND ResourceId = ? ORDER BY Seq
  • Actor activity: SELECT * FROM ActorEvents WHERE TenantId = ? AND ActorId = ? ORDER BY Seq
  • Cross-reference: Use AuditRecordId to join with AuditEvents for full details

  • Performance Considerations

  • Highly selective queries (tenant + resource/actor)
  • Indexes on (TenantId, ResourceType, ResourceId) and (TenantId, ActorId)
  • Potential for hot resources (many updates) → monitoring

Code Examples: - ResourceEvents SQL DDL - ActorEvents SQL DDL - Projection build logic with sequence generation - Resource timeline query - Actor activity query

Diagrams: - ResourceEvents schema and relationships - ActorEvents schema - Projection fanout (1 audit event → 3 projections)

Deliverables: - Resource and Actor projection specifications - SQL DDL scripts - Projection worker implementation - Query examples


CYCLE 7: Azure SQL Implementation (~2,500 lines)

Topic 13: Azure SQL Configuration & Optimization

What will be covered: - Azure SQL Database Tier Selection - General Purpose - Use case: Dev, test, small to medium workloads - Storage: Remote storage (Azure Standard SSD) - IOPS: Variable based on vCore count - Business Critical - Use case: Production, low-latency requirements - Storage: Local SSD (higher IOPS) - Read replicas: Built-in (readable secondaries) - Hyperscale - Use case: Massive scale (>1TB databases) - Storage: Tiered, nearly infinite (100TB+) - Rapid scale-up/down

  • Elastic Pool Strategy
  • One elastic pool per environment
  • Multiple tenant databases sharing pool resources
  • DTU or vCore model
  • Cost optimization for multi-tenant workloads

  • High Availability & Disaster Recovery

  • HA: Built-in (automatic failover replicas)
  • Geo-Replication: Active geo-replication for DR
  • Backup: Automated backups (PITR: 7-35 days)
  • Long-Term Retention (LTR): Up to 10 years

  • Performance Tuning

  • Indexing Strategy
    • Clustered columnstore for large tables (append workloads)
    • Nonclustered indexes for query paths
    • Covering indexes for hot queries
    • Index maintenance jobs (rebuild/reorganize)
  • Partitioning
    • Partition by (TenantId, CreatedMonth)
    • Benefits: Pruning, parallel queries, sliding window purges
    • Partition function and scheme setup
  • Statistics
    • Automatic statistics updates
    • Manual statistics for critical queries
  • Query Store

    • Enable query store for performance monitoring
    • Identify regression and optimization opportunities
  • Security Configuration

  • Authentication: Azure AD only (disable SQL auth)
  • Network: Private endpoint, no public access
  • Encryption: TDE (Transparent Data Encryption) always on
  • Auditing: Enable auditing to Log Analytics
  • Advanced Threat Protection: Enable for production

Code Examples: - Azure SQL provisioning (ARM/Bicep or Pulumi) - Elastic pool configuration - Partitioning DDL (partition function, scheme, table) - Index creation scripts - Query Store setup - Security configuration checklist

Diagrams: - Azure SQL topology (pool, databases, replicas) - Partitioning visualization - HA/DR architecture

Deliverables: - Azure SQL configuration guide - Performance tuning playbook - HA/DR design - Security hardening checklist


Topic 14: Row-Level Security & Temporal Tables

What will be covered: - Row-Level Security (RLS) Deep Dive - Predicate Functions - Filter predicate: Silently filters rows on SELECT - Block predicate: Prevents INSERT/UPDATE/DELETE if condition fails - SESSION_CONTEXT Usage - Set tenant_id in session context on every connection - Middleware pattern for automatic setup - Validation before any queries - RLS Bypass Prevention - Deny ALTER ANY SECURITY POLICY permission - Audit users with CONTROL permission - Monitor for RLS bypass attempts - Performance Impact - Predicate pushdown (filter early in execution plan) - Minimal overhead for well-indexed queries - Test query plans with RLS enabled

  • Temporal Tables (System-Versioned)
  • Use Case: Audit history of metadata tables (policies, configs)
  • Not for AuditRecords (already immutable)
  • Setup:
    CREATE TABLE dbo.Policies (
      PolicyId INT PRIMARY KEY,
      TenantId NVARCHAR(128) NOT NULL,
      PolicyName NVARCHAR(255) NOT NULL,
      PolicyJson NVARCHAR(MAX) NOT NULL,
      ValidFrom DATETIME2 GENERATED ALWAYS AS ROW START,
      ValidTo DATETIME2 GENERATED ALWAYS AS ROW END,
      PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)
    )
    WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.PoliciesHistory));
    
  • Querying History:

    SELECT * FROM dbo.Policies
    FOR SYSTEM_TIME AS OF '2025-10-01T00:00:00'
    WHERE TenantId = 'acme';
    

  • Combining RLS + Temporal Tables

  • Apply RLS to both current and history tables
  • Ensure consistent tenant isolation across time

  • Testing & Validation

  • Unit tests for RLS predicates
  • Integration tests for multi-tenant queries
  • Performance tests with large datasets
  • Security tests (attempt cross-tenant access)

Code Examples: - Complete RLS setup (function, policy, permissions) - SESSION_CONTEXT middleware (C# + NHibernate) - Temporal table DDL - Temporal query examples - RLS test suite

Diagrams: - RLS architecture and execution flow - Temporal table versioning timeline - Combined RLS + Temporal model

Deliverables: - RLS implementation guide (complete) - Temporal tables usage guide - Testing strategy and test cases - Performance validation results


CYCLE 8: Azure Cosmos DB Implementation (~2,500 lines)

Topic 15: Cosmos DB for Global Scale (Optional)

What will be covered: - When to Use Cosmos DB - Global distribution requirements (multi-region writes) - Extreme scale (millions of events/sec per tenant) - Need for low-latency reads worldwide - Elastic scalability (RU/s autoscale)

  • Cosmos DB Configuration
  • API Selection: SQL API (native)
  • Consistency Level: Session (default) or Bounded Staleness
  • Partition Key: /tenantId
  • Throughput: Autoscale (400 - 20,000+ RU/s)
  • Indexing Policy: Automatic (tune for specific queries)
  • Backup: Continuous (PITR) or Periodic

  • AuditRecord Model in Cosmos DB

    {
      "id": "01JE6KQQD0Q0J5VQ8WJ6T1S9FX", // AuditRecordId
      "tenantId": "acme",
      "createdAt": "2025-10-22T15:43:11.281Z",
      "observedAt": "2025-10-22T15:43:11.500Z",
      "action": "appointment.update",
      "resourceType": "Appointment",
      "resourceId": "A-9981",
      "actorId": "user_123",
      "actorType": "User",
      "payload": { ... },
      "classification": "SENSITIVE",
      "policyVersion": 17,
      "integrityBlockId": "01JE6KQQD0Q0J5VQ8WJ6T1S9GX",
      "_ts": 1729619591 // Cosmos DB internal timestamp
    }
    

  • Partition Strategy

  • Single Tenant per Partition (default)
    • Partition key: /tenantId
    • Works for most tenants (< 50GB per tenant)
  • Hot Partition Mitigation
    • For large tenants: Synthetic partition key /tenantId-{hash % N}
    • Requires query fanout across N partitions
  • Composite Partition Keys (future)

    • Hierarchical: /tenantId/resourceType
    • Requires Cosmos DB Hierarchical Partition Keys feature
  • Query Patterns

  • Point Reads: Get single record by id + tenantId (1 RU)
  • Range Queries: Filter by createdAt, action, actorId (cross-partition if needed)
  • Pagination: Use continuation tokens (built-in)
  • Aggregations: Use server-side aggregates or Azure Synapse Link

  • Consistency & Performance

  • Session Consistency: Read-your-writes within same session
  • Bounded Staleness: Max staleness lag (time or operations)
  • Trade-offs: Stronger consistency = higher latency + cost
  • RU Consumption Monitoring: Alert on high RU/s usage

Code Examples: - Cosmos DB provisioning (Pulumi) - AuditRecord document insert (C# SDK) - Query with partition key - Cross-partition query with continuation token - Change feed consumer for projections - RU consumption tracking

Diagrams: - Cosmos DB global distribution topology - Partition strategy visualization - Consistency level comparison - Change feed processing flow

Deliverables: - Cosmos DB design guide - Partition strategy for ATP - Query optimization patterns - Cost estimation and monitoring


Topic 16: Cosmos DB Change Feed & Projections

What will be covered: - Change Feed Overview - Real-time feed of all inserts and updates - Used to build projections (similar to event sourcing) - Checkpointing via lease container - At-least-once delivery

  • Projection Worker Pattern
  • Subscribe to change feed
  • Process each document change
  • Upsert into projection (Azure SQL or separate Cosmos container)
  • Update checkpoint in lease container
  • Idempotent processing

  • Lease Container

  • Stores checkpoint per partition
  • Enables multiple workers (scale-out)
  • Automatic load balancing

  • Projection Examples

  • AuditEvents: Denormalized view for queries
  • ResourceEvents: Per-resource timeline
  • Materialized aggregates: Count of events per tenant/day

  • Performance & Scalability

  • One worker per partition (or multiples)
  • Monitor lag between source and projection
  • Alert if lag > SLO threshold
  • Throttle to avoid overwhelming projection store

  • Comparison: Cosmos DB vs. Azure SQL

  • When to use Cosmos for hot tier?
    • Global distribution
    • Massive scale (>100K writes/sec)
    • Low-latency reads worldwide
  • When to use Azure SQL?
    • Regional deployment (simpler)
    • Complex joins and transactions
    • Mature tooling and expertise
    • Cost-effective for moderate scale

Code Examples: - Change feed processor setup (C# SDK) - Projection worker implementation - Lease container configuration - Idempotent upsert pattern - Lag monitoring query

Diagrams: - Change feed architecture - Projection worker topology - Lease container and checkpointing - Cosmos vs. SQL decision tree

Deliverables: - Change feed implementation guide - Projection worker templates - Performance tuning recommendations - Cosmos vs. SQL selection criteria


CYCLE 9: Azure Blob Storage (WORM) (~3,000 lines)

Topic 17: Blob Storage for Immutable Evidence

What will be covered: - WORM (Write-Once-Read-Many) Requirements - SEC Rule 17a-4: Financial records immutability - HIPAA: Healthcare audit logs - GDPR Article 5(1)(f): Integrity and confidentiality - Tamper-evidence for forensic analysis

  • Azure Blob Immutability Policies
  • Time-Based Retention
    • Set retention period (e.g., 7 years)
    • Blobs cannot be deleted or modified until period expires
    • Can extend retention, but not shorten (in compliance mode)
  • Legal Hold
    • Independent of retention policy
    • Applied per blob or container
    • Must be explicitly released
    • Used for litigation, investigations
  • Compliance Mode vs. Governance Mode

    • Compliance: Cannot be overridden (even by admins)
    • Governance: Can be overridden with special permission
    • ATP uses Compliance Mode for production
  • Container Structure

  • Per-Tenant Containers
    • atp-{tenantId}-hot: Active append segments
    • atp-{tenantId}-cold: Export packages
    • atp-{tenantId}-integrity: Merkle roots, signatures
  • Benefits:

    • Independent retention policies per tenant
    • Per-tenant legal holds
    • Per-container access control (RBAC)
    • Cost allocation and reporting
  • Segment Storage Pattern

  • Segment Creation
    • Accumulate events until threshold (e.g., 10K events or 10MB)
    • Seal segment (no more appends)
    • Compute SHA-256 hash of entire segment
    • Write segment to blob storage
    • Write manifest (metadata + hash) to separate blob
  • Segment Naming
    • Pattern: segments/{year}/{month}/{segmentId}.jsonl.gz
    • JSONL: One JSON object per line (easy streaming)
    • GZIP: Compression (5-10x reduction)
  • Manifest Naming

    • Pattern: manifests/{year}/{month}/{segmentId}.manifest.json
    • Contains: segment hash, record count, time range, blob URL
  • Access Control

  • Managed Identity Only (no SAS tokens in application)
  • RBAC Roles:
    • Ingestion service: Storage Blob Data Contributor (write segments)
    • Query service: Storage Blob Data Reader (read segments)
    • Admin: Storage Blob Data Owner (manage containers)
  • Private Endpoint: All access via VNet, no public access

  • Lifecycle Management

  • Hot to Cool: After 90 days
  • Cool to Archive: After 365 days
  • Cost Optimization: Archive tier is 1/10th cost of hot tier
  • Rehydration: Archive blobs must be rehydrated before read (hours)

Code Examples: - Blob container creation with immutability policy (Pulumi) - Segment write with GZIP compression (C#) - Manifest generation and upload - Legal hold application - Managed Identity authentication for Blob access - Lifecycle policy configuration (ARM template)

Diagrams: - WORM blob storage architecture - Segment and manifest structure - Immutability policy timeline - Access control model - Lifecycle tier transitions

Deliverables: - Blob storage WORM design specification - Segment storage implementation guide - Immutability policy configuration - Access control setup - Lifecycle management policies


Topic 18: Integrity Verification & Merkle Trees

What will be covered: - Tamper-Evidence Design - Each segment has content hash (SHA-256) - Segments organized into Merkle trees - Merkle root signed by HSM-backed key - Chain of Merkle roots for historical integrity

  • Merkle Tree Construction
  • Leaf Nodes: Hash of each event (SHA-256)
  • Internal Nodes: Hash of concatenation of child hashes
  • Root: Top-level hash representing entire tree
  • Benefits:

    • Efficient verification (log(N) proofs)
    • Detect any tampering (single bit change invalidates root)
    • Cryptographic proof of inclusion
  • Integrity Block Structure

    {
      "blockId": "01JE6KQQD0Q0J5VQ8WJ6T1S9GX",
      "tenantId": "acme",
      "sealedAt": "2025-10-22T16:00:00Z",
      "segmentIds": ["seg-001", "seg-002", "seg-003"],
      "recordCount": 30000,
      "merkleRoot": "a3f5b9c2...",
      "prevBlockRoot": "2d8e1f4a...",
      "signature": {
        "algorithm": "Ed25519",
        "keyId": "kid-2025-10",
        "value": "5e8f2c9a..."
      }
    }
    

  • Signature Creation

  • Construct message: {blockId}|{merkleRoot}|{prevBlockRoot}|{sealedAt}
  • Sign with HSM-backed signing key (Ed25519 or RSA-PSS)
  • Store signature in integrity block
  • Write to blob storage (immutable)

  • Verification Protocol

  • Verify Signature: Use public key to verify signature on Merkle root
  • Verify Merkle Proof: Given event ID, verify it's included in Merkle root
  • Verify Chain: Check prevBlockRoot links form valid chain
  • Detect Tampering: Any hash mismatch = tampering detected

  • Verification Use Cases

  • Export Verification: Verify all events in export package
  • Forensic Analysis: Prove event was recorded at specific time
  • Compliance Audit: Demonstrate integrity of audit trail
  • Litigation: Provide cryptographic evidence

Code Examples: - Merkle tree construction algorithm - Integrity block creation and signing - Signature verification (public key) - Merkle proof generation and verification - Chain integrity check - Verification report generation

Diagrams: - Merkle tree structure - Integrity block chain - Signature creation flow - Verification protocol sequence

Deliverables: - Tamper-evidence design specification - Merkle tree implementation - Signature protocol documentation - Verification procedures and tools


CYCLE 10: Data Lifecycle & Retention (~2,500 lines)

Topic 19: Retention Policies & Lifecycle Management

What will be covered: - Retention Policy Model - Policy Definition - Policy ID (e.g., "rp-2025-standard") - Retention period (e.g., 7 years) - Triggers: Event type, classification, resource type - Actions: Mark eligible, purge, archive - Policy Assignment - Default policy per tenant - Override policies for specific event types - Classification-based policies (PII = longer retention)

  • Retention Calculation
  • Retention Start Date: CreatedAt or EffectiveAt
  • Retention End Date: Start + Retention Period
  • Eligibility: Event becomes eligible for deletion after end date
  • Grace Period: Additional buffer before purge (e.g., 30 days)

  • Lifecycle States

  • Active: Within retention period, accessible
  • Eligible: Retention expired, eligible for purge (but not yet deleted)
  • OnHold: Legal hold applied, cannot be deleted regardless of retention
  • Archived: Moved to cold tier, rehydration required
  • Purged: Permanently deleted

  • Lifecycle Management Process

  • Daily Job: Scan for eligible records
  • Purge Decision:
    • Check: Retention period expired?
    • Check: Not on legal hold?
    • Check: No active export referencing this record?
  • Purge Execution:

    • Delete from hot tier (Azure SQL)
    • Delete from warm tier (projections)
    • Delete blob segments (or move to Archive tier first)
    • Record purge event for audit
  • Legal Hold Management

  • Apply Hold: Set OnHold = true, record reason
  • Hold Scope: Entire tenant, date range, or specific records
  • Release Hold: Manual approval, set OnHold = false
  • Hold Audit Trail: All hold operations logged

  • Data Minimization

  • Redact PII after N days (per classification)
  • Pseudonymize identifiers (replace with hash)
  • Aggregate old data (keep summary statistics, delete details)

Code Examples: - RetentionPolicy class definition - Retention calculation logic - Lifecycle state machine - Daily purge job (background worker) - Legal hold application and release - Data minimization pseudocode

Diagrams: - Retention policy model - Lifecycle state machine - Purge decision flowchart - Legal hold management workflow

Deliverables: - Retention policy specification - Lifecycle management design - Purge procedures and safety checks - Legal hold operational guide


Topic 20: Right to Erasure (GDPR Article 17)

What will be covered: - GDPR Right to Erasure - User can request deletion of their personal data - Must be executed "without undue delay" - Exceptions: Legal obligations, litigation holds

  • ATP Erasure Strategy
  • Challenge: Audit logs are immutable and legally required
  • Solution: Pseudonymization + Cryptographic Erasure

    • Encrypt PII with tenant-specific key
    • When erasure requested, delete encryption key
    • PII becomes permanently unreadable (cryptographic erasure)
    • Audit trail structure remains intact (compliance)
  • Erasure Process

  • Step 1: Identify all records containing subject's PII
  • Step 2: Verify no legal holds or active litigation
  • Step 3: Delete or pseudonymize PII fields
  • Step 4: Delete encryption key (if using cryptographic erasure)
  • Step 5: Update projections (remove PII)
  • Step 6: Generate erasure certificate
  • Step 7: Notify user of completion

  • Pseudonymization Techniques

  • Replace user ID with HMAC-SHA256(userId, secret)
  • Replace email with "user-{hash}@redacted.example.com"
  • Replace name with "[REDACTED-{hash}]"
  • Preserve data structure and relationships

  • Cryptographic Erasure

  • Encrypt PII fields with DEK (Data Encryption Key)
  • Wrap DEK with KEK (Key Encryption Key) in Key Vault
  • Store encrypted PII in PayloadJson
  • On erasure: Delete DEK from Key Vault
  • Result: PII permanently unrecoverable

  • Audit Trail of Erasure

  • Log erasure request (subject, timestamp, requestor)
  • Log erasure execution (affected records, method)
  • Generate cryptographic proof of erasure
  • Retain erasure audit log (separate from main audit trail)

Code Examples: - Erasure request API endpoint - PII identification query - Pseudonymization functions - Cryptographic erasure (DEK deletion) - Erasure certificate generation - Erasure audit log entry

Diagrams: - GDPR erasure workflow - Cryptographic erasure architecture - Pseudonymization vs. deletion comparison - Erasure audit trail

Deliverables: - Right to erasure implementation guide - Pseudonymization and cryptographic erasure patterns - Erasure procedures and checklists - Compliance documentation templates


CYCLE 11: Data Residency & Compliance (~2,500 lines)

Topic 21: Data Residency Enforcement

What will be covered: - Data Residency Requirements - GDPR (EU): Personal data of EU residents must stay in EU - Data Protection Act (UK): Post-Brexit UK requirements - CCPA (California): California resident data (US-based) - LGPD (Brazil): Brazilian data protection law - Custom Contractual: Per-tenant residency requirements

  • Regional Deployment Architecture
  • Separate Subscriptions/Accounts per Region
    • atp-us-prod: US East 2, US West 2
    • atp-eu-prod: West Europe, North Europe
    • atp-il-prod: Israel Central
  • No Cross-Region Replication (default)

    • Tenants bound to home region
    • Data never crosses boundaries
    • Exports stay in home region
  • Tenant → Region Binding

  • Provisioning Time: Assign home region based on tenant location
  • Metadata:
    {
      "tenantId": "eucorp",
      "homeRegion": "eu-west-1",
      "allowedRegions": ["eu-west-1", "eu-north-1"],
      "crossRegionReplication": false,
      "dataResidencyPolicy": "GDPR-EU"
    }
    
  • Immutability: Home region cannot be changed (requires migration)

  • Enforcement Mechanisms

  • Gateway Layer
    • Resolve tenant's home region from metadata
    • Route request to correct regional API endpoint
    • Reject request if accessing from disallowed region
  • Storage Layer
    • Write to home region storage account only
    • Read from home region storage account only
    • Projection workers run in home region
  • Export Layer

    • Export packages stored in home region
    • Download links use home region blob URLs
    • No cross-region staging
  • Compliance Validation

  • Daily Scan: Verify all data in correct region
  • Alert on Violations: Any data found in wrong region
  • Automated Remediation: Move or purge violating data
  • Audit Report: Monthly residency compliance report

  • Migration Procedures

  • When tenant needs to change regions (rare)
  • Export all data from source region
  • Import to target region with new home region
  • Verify integrity and completeness
  • Purge data from source region
  • Update tenant metadata

Code Examples: - Tenant residency policy model - Region resolution logic (Gateway) - Storage account selection based on region - Cross-region prevention checks - Compliance validation query - Migration workflow pseudocode

Diagrams: - Multi-region deployment topology - Tenant → Region binding model - Residency enforcement flow - Migration sequence diagram

Deliverables: - Data residency architecture - Enforcement implementation guide - Compliance validation procedures - Migration runbooks


Topic 22: Compliance Certifications & Audits

What will be covered: - Target Compliance Frameworks - SOC 2 Type II: Trust Services Criteria - ISO 27001: Information Security Management - GDPR: EU data protection regulation - HIPAA: Healthcare data protection (US) - PCI DSS: Payment card data (if applicable)

  • Data Architecture Compliance Requirements
  • SOC 2
    • Logical and physical access controls (RLS, RBAC)
    • Change management (immutable audit trail)
    • Data backup and recovery (PITR, replicas)
    • Encryption at rest and in transit (TDE, TLS)
  • ISO 27001
    • Asset management (data classification)
    • Access control (multi-factor, least privilege)
    • Cryptography (Key Vault, HSM)
    • Logging and monitoring (Azure Monitor)
  • GDPR
    • Data minimization (retention policies)
    • Right to access (query APIs)
    • Right to erasure (cryptographic erasure)
    • Data portability (export APIs)
    • Breach notification (alerting, runbooks)
  • HIPAA

    • Access controls (RLS, authentication)
    • Audit controls (immutable audit trail)
    • Integrity controls (hash chains, signatures)
    • Transmission security (TLS, mTLS)
  • Audit Evidence Generation

  • Access Logs: All data access logged to immutable store
  • Configuration Logs: All schema/policy changes logged
  • Compliance Reports: Automated monthly reports
  • Retention Proofs: Demonstrate retention policy enforcement
  • Erasure Certificates: Proof of GDPR erasure execution
  • Integrity Proofs: Cryptographic verification results

  • Auditor Support

  • Read-Only Access: Provide auditors read-only query access
  • Sample Data Sets: Generate representative data samples
  • Control Matrices: Map ATP controls to framework requirements
  • Evidence Packages: Pre-assembled evidence for audits

  • Continuous Compliance Monitoring

  • Automated Checks: Daily validation of compliance controls
  • Alerting: Immediate alerts on control failures
  • Dashboard: Real-time compliance posture dashboard
  • Remediation: Automated or semi-automated fixes

Code Examples: - Access log generation (all queries logged) - Compliance report query (retention adherence) - Erasure certificate generation - Integrity proof generation (Merkle verification) - Compliance dashboard query (Kusto/KQL)

Diagrams: - Compliance framework mapping - Audit evidence generation flow - Continuous compliance monitoring architecture

Deliverables: - Compliance requirements matrix - Audit evidence generation guide - Auditor support procedures - Continuous compliance monitoring setup


CYCLE 12: Backup, Recovery & DR (~2,500 lines)

Topic 23: Backup Strategies

What will be covered: - Backup Requirements - RPO (Recovery Point Objective): Max acceptable data loss - Hot tier: ≤ 5 minutes - Warm tier: ≤ 15 minutes - Cold tier: ≤ 24 hours - RTO (Recovery Time Objective): Max acceptable downtime - Hot tier: ≤ 2 hours - Warm tier: ≤ 8 hours - Cold tier: ≤ 48 hours

  • Azure SQL Backup
  • Automated Backups
    • Full backup: Weekly
    • Differential backup: Every 12-24 hours
    • Transaction log backup: Every 5-10 minutes
  • Point-in-Time Restore (PITR)
    • Restore to any point within retention period (7-35 days)
    • Use case: Recover from accidental deletion (non-audit data)
  • Long-Term Retention (LTR)
    • Weekly/Monthly/Yearly backups retained up to 10 years
    • Use case: Compliance, archival
    • Cost: Significantly cheaper than primary storage
  • Geo-Redundant Backup

    • Backups replicated to paired region
    • Use case: Regional disaster recovery
  • Azure Blob Storage Backup

  • Immutability as Backup
    • WORM blobs are inherently protected
    • No deletion or modification possible
  • Soft Delete
    • Deleted blobs retained for 7-30 days
    • Recover from accidental deletion
  • Versioning
    • Every write creates new version
    • Old versions retained per policy
  • Snapshot Strategy

    • Daily snapshots of critical containers
    • Stored in separate geo-redundant account
  • Backup Validation

  • Monthly Restore Test: Restore sample data, verify integrity
  • Quarterly DR Drill: Full region failover test
  • Integrity Check: Verify backup content hashes match originals

Code Examples: - Azure SQL PITR restore command (T-SQL/PowerShell) - LTR backup configuration (ARM/Bicep) - Blob snapshot creation (C# SDK) - Backup validation script - Integrity verification (hash comparison)

Diagrams: - Backup architecture (SQL + Blob) - PITR timeline - Geo-redundant backup topology - Restore workflow

Deliverables: - Backup strategy document - RPO/RTO definitions - Backup configuration guide - Validation and testing procedures


Topic 24: Disaster Recovery (DR)

What will be covered: - DR Scenarios - Regional outage (Azure region down) - Data corruption (accidental or malicious) - Compliance breach (unauthorized access) - Ransomware attack (though unlikely with immutable storage)

  • DR Architecture
  • Active-Active (optional, expensive)
    • Two regions serving traffic simultaneously
    • Data replicated in near real-time
    • Automatic failover (Traffic Manager, Front Door)
  • Active-Passive (recommended)

    • Primary region serves all traffic
    • Secondary region: Standby with replicated data
    • Manual or automatic failover on primary failure
  • Failover Procedures

  • Detection: Azure Monitor alerts on regional outage
  • Decision: SRE team evaluates severity, decides on failover
  • DNS Update: Switch Traffic Manager to secondary region
  • Data Sync: Verify secondary data is current (check replication lag)
  • Activation: Activate secondary region services (scale up)
  • Verification: Health checks, smoke tests
  • Communication: Notify customers of DR activation

  • Failback Procedures

  • Primary Recovery: Verify primary region is healthy
  • Data Sync: Replicate changes from secondary back to primary
  • Cutover: Switch traffic back to primary
  • Deactivation: Scale down secondary region (standby mode)
  • Post-Mortem: Document lessons learned

  • DR Data Consistency

  • Hot Tier: Geo-replicated (async), expect small data loss (RPO)
  • Warm Tier: Rebuild from hot tier events (eventual consistency)
  • Cold Tier: Geo-redundant blob storage (ZRS or GZRS)
  • Integrity Verification: Recompute hashes, verify signatures

  • DR Testing

  • Quarterly DR Drills: Full failover test (non-production)
  • Annual DR Exercise: Full failover with production traffic (scheduled)
  • Metrics: Measure actual RPO/RTO achieved
  • Continuous Improvement: Update DR plan based on results

Code Examples: - Traffic Manager failover configuration (ARM/Bicep) - Geo-replication setup (Azure SQL) - Failover script (PowerShell/CLI) - Failback synchronization logic - DR drill runbook

Diagrams: - Active-Passive DR topology - Failover sequence diagram - Data replication flow - Failback process

Deliverables: - DR architecture design - Failover and failback runbooks - DR testing plan - Communication templates


CYCLE 13: Performance & Optimization (~2,500 lines)

Topic 25: Query Performance Optimization

What will be covered: - Query Patterns & Optimization - Common Query Types - List events for tenant (paginated) - Get single event by ID - Filter by date range, action, resource, actor - Full-text search (via Azure AI Search) - Aggregations (count, sum, group by)

  • Indexing Strategy
  • Primary Indexes
    • Clustered index: (TenantId, AuditRecordId)
    • Covering indexes for hot queries:
    • (TenantId, CreatedAt DESC, AuditRecordId DESC) INCLUDE (Action, ResourceType, ResourceId, ActorId)
    • (TenantId, ResourceType, ResourceId, CreatedAt DESC)
    • (TenantId, ActorId, CreatedAt DESC)
  • Columnstore Indexes
    • For large scans and analytics queries
    • Applied to historical data (after N months)
    • Significant compression (5-10x)
  • Filtered Indexes

    • Index only specific classifications (e.g., PII events)
    • Smaller index size, faster updates
  • Partitioning for Performance

  • Partition by (TenantId, CreatedMonth)
  • Partition Pruning: Queries with date filter scan only relevant partitions
  • Parallel Queries: SQL Server can scan partitions in parallel
  • Sliding Window: Drop old partitions (after retention expires)

  • Caching Layer

  • Redis Cache: Cache frequent query results
  • Cache Key: query:{tenantId}:{queryHash}
  • TTL: 5-60 minutes (based on lag tolerance)
  • Invalidation: On policy changes, manual purge, or TTL expiration
  • Cache Hit Ratio: Monitor and optimize (target >80%)

  • Query Optimization Techniques

  • Avoid SELECT *: Select only needed columns
  • Use Covering Indexes: Avoid key lookups
  • Limit Result Sets: Always paginate with OFFSET/FETCH or cursors
  • Parameterize Queries: Prevent SQL injection, enable plan reuse
  • Analyze Execution Plans: Identify missing indexes, expensive operations
  • Update Statistics: Keep query optimizer informed

Code Examples: - Index creation scripts (covering, filtered, columnstore) - Partitioned table setup (SQL) - Redis caching pattern (C#) - Optimized query examples (with execution plans) - Query performance analysis (DMVs, Query Store)

Diagrams: - Index strategy visualization - Partitioning and pruning - Caching architecture - Query execution flow

Deliverables: - Query performance optimization guide - Indexing best practices - Caching strategy - Performance monitoring setup


Topic 26: Write Performance & Batching

What will be covered: - Write Performance Challenges - High-throughput ingestion (thousands of events/sec) - Transactional consistency (append + outbox atomicity) - Index maintenance overhead - Network latency (API → DB)

  • Batching Strategy
  • In-Memory Accumulation
    • Accumulate events in memory buffer
    • Flush when:
    • Buffer reaches N events (e.g., 1000)
    • Time threshold (e.g., 5 seconds)
    • Memory pressure
  • Bulk Insert

    • Use SqlBulkCopy (SQL Server) or BulkExecutor (Cosmos DB)
    • 10-100x faster than individual inserts
    • Minimizes round trips and transaction overhead
  • Outbox Pattern for Reliability

  • Transactional Outbox
    • Atomically insert event + outbox entry in single transaction
    • Outbox worker reads outbox table, publishes to message bus
    • Ensures no lost events (even if message bus is down)
  • Deduplication in Outbox

    • Idempotency key prevents duplicate publishes
    • Outbox entries marked as processed after successful publish
  • Connection Pooling

  • Reuse database connections
  • Configuration: Min pool size = 10, Max pool size = 100
  • Monitor pool exhaustion (alerts if all connections in use)

  • Asynchronous Processing

  • Non-blocking writes (async/await in C#)
  • Background workers for blob uploads (don't block API response)
  • Fire-and-forget for non-critical operations (e.g., analytics)

  • Performance Monitoring

  • Metrics:
    • Write throughput (events/sec)
    • Write latency (p50, p95, p99)
    • Batch size distribution
    • Connection pool utilization
    • Index maintenance time
  • Alerts:
    • Throughput drops below threshold
    • Latency exceeds SLO (e.g., p95 > 500ms)
    • Connection pool exhaustion

Code Examples: - Batching accumulator pattern (C#) - SqlBulkCopy usage (bulk insert) - Transactional outbox implementation - Connection pooling configuration - Performance monitoring queries (DMVs) - Alert rule definitions (Azure Monitor)

Diagrams: - Write path with batching - Transactional outbox architecture - Connection pooling - Performance metrics dashboard

Deliverables: - Write performance optimization guide - Batching implementation patterns - Outbox pattern documentation - Performance monitoring setup


CYCLE 14: Operations & Best Practices (~3,000 lines)

Topic 27: Data Operations & Maintenance

What will be covered: - Routine Maintenance Tasks - Index Maintenance - Rebuild fragmented indexes (weekly) - Update statistics (daily for hot tables) - Monitor index usage, drop unused indexes - Partition Management - Create new partitions for upcoming months - Drop old partitions (after retention expires) - Sliding window partition maintenance - Backup Verification - Monthly restore test (sample data) - Verify backup integrity (checksums) - Test DR failover procedures

  • Data Quality Monitoring
  • Validation Checks
    • Orphaned records (projections without hot tier source)
    • Duplicate records (idempotency failures)
    • Timestamp anomalies (CreatedAt > ObservedAt)
    • Missing integrity blocks (unsealed segments)
  • Automated Remediation

    • Rebuild orphaned projections from hot tier
    • Deduplicate using idempotency key
    • Alert on timestamp anomalies (investigate)
    • Seal pending segments (scheduled job)
  • Capacity Planning

  • Growth Projections
    • Forecast storage growth based on ingestion rate
    • Estimate database size for next 12 months
    • Plan for elastic pool or Hyperscale migration
  • Threshold Monitoring
    • Alert when database reaches 80% capacity
    • Alert when blob storage reaches quota
    • Alert when partition gets too large (> 50GB)
  • Scaling Triggers

    • Auto-scale elastic pool (DTU or vCore)
    • Add new partitions proactively
    • Increase blob storage tier (Standard to Premium)
  • Cost Optimization

  • Storage Tiering
    • Move old segments to Cool tier (after 90 days)
    • Move archive exports to Archive tier (after 30 days)
    • Monitor lifecycle policy execution
  • Database Optimization
    • Archive old partitions to cheaper storage
    • Use columnstore for historical data (compression)
    • Right-size elastic pool (avoid over-provisioning)
  • Query Optimization
    • Cache hot queries (Redis)
    • Reduce full table scans
    • Optimize expensive queries (Query Store)

Code Examples: - Index rebuild script (T-SQL) - Partition sliding window procedure (T-SQL) - Data quality validation queries - Capacity monitoring query (Kusto/KQL) - Cost optimization script (Azure CLI) - Lifecycle policy configuration (ARM template)

Diagrams: - Maintenance workflow (scheduled jobs) - Data quality monitoring architecture - Capacity planning dashboard - Cost optimization strategies

Deliverables: - Operations manual - Maintenance runbooks - Data quality monitoring setup - Capacity planning guide - Cost optimization strategies


Topic 28: Troubleshooting & Diagnostics

What will be covered: - Common Issues - Slow Queries - Symptom: Query latency exceeds SLO - Diagnosis: Check execution plan, missing indexes - Resolution: Add covering index, update statistics - High Write Latency - Symptom: Ingestion throughput drops - Diagnosis: Check connection pool, index maintenance - Resolution: Increase pool size, batch writes, optimize indexes - Projection Lag - Symptom: Read models behind write tier (> 30 seconds) - Diagnosis: Check projection worker health, message bus backlog - Resolution: Scale out workers, increase consumer concurrency - Data Corruption - Symptom: Integrity verification fails - Diagnosis: Check segment hashes, Merkle roots - Resolution: Restore from backup, replay events from hot tier

  • Diagnostic Tools
  • Azure SQL
    • Dynamic Management Views (DMVs)
    • Query Store (execution stats, query plans)
    • Extended Events (detailed tracing)
  • Azure Blob Storage
    • Storage Analytics logs
    • Metrics (transactions, latency, throttling)
    • Diagnostic settings (Log Analytics)
  • Azure Cosmos DB
    • Request Unit (RU) consumption metrics
    • Partition hot spot detection
    • Slow query log
  • Application Logs

    • Structured logging (Serilog, Application Insights)
    • Correlation IDs for tracing
    • Error aggregation and alerting
  • Performance Troubleshooting Workflow

  • Identify Symptom: User report, alert, or monitoring dashboard
  • Isolate Component: API, database, message bus, blob storage?
  • Gather Metrics: Query DMVs, check dashboards, review logs
  • Analyze Root Cause: Slow query? Missing index? Resource contention?
  • Apply Fix: Index, configuration change, scaling
  • Verify Resolution: Monitor metrics, confirm performance restored
  • Document: Update runbook, create postmortem

  • Emergency Procedures

  • Database Failover: If primary DB unresponsive
  • Read Replica Activation: If query load too high
  • Circuit Breaker: Temporarily disable non-critical features
  • Rate Limiting: Throttle ingestion if storage overwhelmed

Code Examples: - DMV queries for performance diagnostics - Query Store analysis queries - Projection lag calculation query - Integrity verification script - Emergency failover script (PowerShell)

Diagrams: - Troubleshooting decision tree - Diagnostic tool architecture - Emergency response workflow

Deliverables: - Troubleshooting guide - Diagnostic playbooks - Emergency response procedures - Root cause analysis templates


Topic 29: Data Governance & Compliance

What will be covered: - Data Governance Framework - Data Stewardship - Roles: Data Owner, Data Custodian, Data Steward - Responsibilities: Classification, retention, access control - Data Catalog - Document all data entities (AuditRecord, projections, etc.) - Schema registry with versioning - Data lineage tracking (source → transformation → destination) - Data Quality Standards - Completeness: All required fields populated - Accuracy: Data matches source system - Consistency: Same data across projections - Timeliness: Lag within SLO thresholds

  • Access Governance
  • Least Privilege
    • Application service accounts: Read/write scoped to specific tables
    • Analyst accounts: Read-only, tenant-scoped (RLS)
    • Admin accounts: Just-In-Time (JIT) elevation for operations
  • Access Reviews
    • Quarterly review of all database access
    • Revoke unused or expired permissions
    • Audit trail of access grants/revokes
  • Privileged Access Management (PAM)

    • Break-glass accounts for emergencies
    • Multi-factor authentication required
    • All actions logged and reviewed
  • Change Management

  • Schema Changes
    • All changes via FluentMigrator (versioned migrations)
    • Peer review for all schema changes
    • Backward compatibility required
    • Deployment via CI/CD pipeline
  • Configuration Changes

    • All changes in Git (Infrastructure as Code)
    • Approval workflow (pull request + review)
    • Automated testing before production
    • Rollback plan for every change
  • Audit & Compliance

  • Continuous Auditing
    • All data access logged (Read/Write/Delete)
    • Anomaly detection (unusual access patterns)
    • Alert on suspicious activity
  • Compliance Reporting
    • Monthly compliance dashboard
    • Evidence packages for auditors
    • Control attestation (SOC 2, ISO 27001)
  • Breach Response
    • Detection: Automated alerts on unauthorized access
    • Containment: Revoke access, isolate affected systems
    • Investigation: Root cause analysis, forensics
    • Notification: Notify affected users (GDPR 72-hour requirement)

Code Examples: - Data catalog schema (metadata tables) - Access review query (list all permissions) - Schema migration example (FluentMigrator) - Audit log query (all access in last N days) - Anomaly detection query (unusual access patterns)

Diagrams: - Data governance framework - Access governance model - Change management workflow - Breach response playbook

Deliverables: - Data governance policy - Access control matrix - Change management procedures - Compliance reporting templates - Breach response plan


Topic 30: Best Practices & Architectural Patterns

What will be covered: - Data Architecture Best Practices - Immutability First - Design for append-only from day one - Avoid temptation to add UPDATE logic - Use projections for derived views - Tenant Isolation - TenantId in every table, every row - RLS for defense-in-depth - Test cross-tenant isolation regularly - Event Sourcing - Events as source of truth - Projections are derived and rebuildable - Time-travel queries for investigations - CQRS - Separate write and read models - Optimize each for its use case - Embrace eventual consistency - Compliance by Design - Classification at write time - Retention policies enforced automatically - Right to erasure via cryptographic methods

  • Performance Best Practices
  • Indexing: Cover hot queries, avoid over-indexing
  • Partitioning: By tenant + time for scalability
  • Caching: Redis for hot queries, TTL-based invalidation
  • Batching: Bulk inserts, reduce round trips
  • Async: Non-blocking I/O, background workers

  • Operational Best Practices

  • Monitoring: Golden signals (latency, throughput, errors, saturation)
  • Alerting: Alert on SLO violations, not arbitrary thresholds
  • Runbooks: Document common procedures (failover, restore, purge)
  • DR Testing: Quarterly drills, measure actual RPO/RTO
  • Automation: Infrastructure as Code, automated testing, CI/CD

  • Security Best Practices

  • Zero Trust: Assume breach, verify every access
  • Encryption: TDE for at-rest, TLS for in-transit, KMS for keys
  • RBAC: Least privilege, JIT elevation for admins
  • Auditing: Log all access, immutable audit trail
  • Secrets: Azure Key Vault, Managed Identity (no credentials in code)

  • Anti-Patterns to Avoid

  • Mutable Audit Logs: Never allow updates/deletes on audit records
  • Shared TenantId: Never query across tenants (except admin)
  • Missing Idempotency Keys: Always provide on critical operations
  • Over-Normalization: Projections should be denormalized for reads
  • Synchronous Everything: Use async/messaging for non-critical paths

Code Examples: - Immutable entity pattern (C#) - RLS policy template (reusable) - Event sourcing aggregate pattern - CQRS command/query handlers - Monitoring dashboard query (Kusto) - Security configuration checklist

Diagrams: - Best practices reference architecture - Anti-patterns to avoid (before/after) - Security-hardened data layer - Operational excellence framework

Deliverables: - Data architecture best practices guide - Pattern catalog (with examples) - Anti-pattern avoidance checklist - Security hardening guide - Operational excellence checklist


Summary of Deliverables

Across all 14 cycles, this documentation will provide:

  1. Foundational Knowledge
  2. Data architecture vision and principles
  3. Storage technology selection guide
  4. Storage tiers (Hot/Warm/Cold) design

  5. Multi-Tenant Isolation

  6. Tenant isolation patterns (logical and physical)
  7. Row-Level Security implementation
  8. Data residency enforcement

  9. Data Models

  10. AuditRecord canonical schema
  11. Supporting entities (policies, integrity, exports)
  12. Projection models (AuditEvents, ResourceEvents, ActorEvents)

  13. Event Sourcing & CQRS

  14. Event store design and replay procedures
  15. CQRS command/query separation
  16. Projection synchronization patterns

  17. Technology Implementations

  18. Azure SQL (schema, RLS, partitioning, performance)
  19. Azure Cosmos DB (partition strategy, change feed)
  20. Azure Blob Storage (WORM, segments, integrity)

  21. Lifecycle Management

  22. Retention policies and lifecycle states
  23. Right to erasure (GDPR Article 17)
  24. Legal hold management

  25. Compliance & Security

  26. Data residency architecture
  27. Compliance certifications (SOC 2, ISO 27001, GDPR, HIPAA)
  28. Backup, recovery, and disaster recovery

  29. Performance & Operations

  30. Query and write optimization
  31. Monitoring and alerting
  32. Troubleshooting and diagnostics
  33. Best practices and anti-patterns

Next Steps

  1. Review & Approval: Validate cycle plan with architecture and data teams
  2. Cycle 1 Generation: Begin content generation for data architecture overview
  3. Schema Scripts: Prepare SQL DDL scripts for all tables
  4. Code Samples: Develop C# implementations for data access patterns
  5. Pilot Tenant: Implement full data layer for dev environment tenant


This documentation plan covers the complete data architecture for ATP, from foundational principles to operational best practices, ensuring a secure, scalable, compliant, and high-performance data layer for the Audit Trail Platform.