Skip to content

Environments - Audit Trail Platform (ATP)

Environment isolation by design — ATP enforces separation across dev, test, staging, and production with graduated controls and approval workflows.


Purpose & Scope

This document defines the multi-environment deployment strategy for the ConnectSoft Audit Trail Platform (ATP), establishing how applications progress from development through production with graduated controls, environment isolation, and compliance-aware configuration management at each stage.

ATP operates across six distinct environment tiers — each with specific purposes, data handling requirements, approval workflows, and infrastructure characteristics. This separation ensures safe experimentation in lower environments while maintaining production stability, security, and regulatory compliance in higher tiers.

What this document covers

  • Establish ATP's environment topology across all deployment tiers: Preview (ephemeral PR environments), Dev, Test, Staging, Production, and Hotfix with clear boundaries and characteristics.
  • Define environment-specific infrastructure using Infrastructure as Code (IaC) overlays: SKU tiers, scaling policies, networking configurations, and regional deployment patterns.
  • Specify configuration management hierarchy: appsettings.json layering, Azure App Configuration integration, environment variables, and Key Vault secret references.
  • Detail secrets and key management per environment: Key Vault organization, secret categories, rotation policies, and managed identity patterns.
  • Describe promotion workflows and approval gates: automated progression (Dev → Test), manual approvals (Staging, Production), Change Advisory Board (CAB) processes, and rollback procedures.
  • Outline networking and security boundaries: VNet isolation, private endpoints, NSG rules, egress controls, and environment-specific access policies.
  • Document data management strategies per environment: synthetic data generation (Dev/Test), production-like datasets (Staging), live tenant data with compliance controls (Production).
  • Specify observability and monitoring configurations: telemetry sampling rates, log retention policies, Application Insights settings, and alerting thresholds.
  • Define cost management and optimization: environment budgets, SKU selection rationale, auto-shutdown policies, reserved instances, and cost alerts.
  • Detail disaster recovery and high availability: RPO/RTO targets per environment, multi-region topology, failover procedures, and DR drill cadence.
  • Outline compliance and audit requirements: environment-specific policy enforcement (encryption, immutability, access reviews, audit logging).
  • Describe testing strategies per environment: unit/integration (Dev), smoke/regression (Test), load/chaos (Staging), synthetic monitors/canary (Production).

Out of scope (referenced elsewhere)

  • CI/CD pipeline implementation details and template structure (see azure-pipelines.md).
  • Quality gate policies, code coverage thresholds, and security scanning rules (see quality-gates.md).
  • Detailed security controls, threat model, and compliance framework mappings (see security-compliance.md).
  • Data residency rules, retention policies, and legal hold procedures (see data-residency-retention.md).
  • Service-specific business logic, domain models, or API contracts (see service repositories and hld.md).
  • Operational runbooks for incident response, on-call procedures, and troubleshooting (see runbook.md).

Readers & ownership

  • Platform Engineering/DevOps (owners): Environment topology, IaC overlays, promotion workflows, infrastructure provisioning, and cost optimization.
  • SRE/Operations: Disaster recovery planning, failover procedures, environment health monitoring, capacity planning, and incident response coordination.
  • Security/Compliance: Environment-specific security controls, access policies, secret management, compliance enforcement, and audit evidence collection.
  • Service Teams: Environment-specific configuration, feature flag management, deployment validation, and testing strategies.
  • QA/Test Engineering: Test environment maintenance, test data management, regression testing, and quality validation.
  • Finance/FinOps: Environment cost budgets, resource optimization, cost allocation, and financial forecasting.

Artifacts produced

  • Infrastructure as Code (IaC): Pulumi stacks per environment with C# code defining Azure resources, networking, security configurations, and observability integrations.
  • Environment Configurations: appsettings.json overlays, Azure App Configuration feature flags, environment-specific variable groups in Azure DevOps.
  • Secret Management: Key Vault per environment with organized secret categories, access policies, rotation schedules, and audit logs.
  • Deployment Manifests: Environment-specific deployment receipts with version history, configuration snapshots, and rollback points.
  • Network Topology Diagrams: VNet/subnet layouts, private endpoint configurations, NSG rule sets, and cross-environment isolation boundaries.
  • Observability Configurations: Application Insights instrumentation keys, Log Analytics workspaces, sampling policies, and alert rules per environment.
  • Cost Reports: Monthly environment cost breakdowns, budget alerts, optimization recommendations, and resource utilization dashboards.
  • DR Plans: Environment-specific disaster recovery procedures, failover runbooks, RPO/RTO validation reports, and drill evidence.
  • Compliance Evidence: Environment audit trails, access reviews, encryption verification, immutability proofs, and policy enforcement logs.

Acceptance (done when)

  • All six environment tiers (Preview, Dev, Test, Staging, Production, Hotfix) are provisioned with appropriate infrastructure, networking, and security configurations.
  • Promotion workflows are operational with automated gates (Dev → Test), manual approvals (Test → Staging → Production), and rollback procedures validated.
  • Configuration management hierarchy is established with clear precedence (appsettings → App Configuration → environment variables → Key Vault) and documented for all services.
  • Secrets management is operational with Key Vault per environment, managed identity access, automatic rotation for Production, and no plaintext secrets in code or configurations.
  • Environment isolation is verified with network segmentation (separate VNets/subscriptions), private endpoints (Staging/Production), and cross-environment access denied by default.
  • Observability is configured with appropriate telemetry levels (100% sampling Dev, 10% Production), log retention policies, dashboards, and alerts per environment.
  • Cost management is active with environment budgets, cost alerts at 80% threshold, auto-shutdown policies (Dev/Test), and monthly cost reviews scheduled.
  • Disaster recovery procedures are documented and tested with DR drills per environment (quarterly Staging/Production), failover automation validated, and RPO/RTO targets met.
  • Compliance controls are enforced appropriately per environment (relaxed Dev/Test, production-like Staging, full compliance Production) with audit evidence collection operational.
  • Testing strategies are implemented per environment with appropriate test suites, automation, and validation criteria defined and operational.
  • Documentation complete with comprehensive examples, runbooks, troubleshooting guides, and cross-references to related documents.

Environment Topology Overview

ATP's environment strategy follows the graduated control principle: lower environments prioritize developer velocity and rapid iteration, while higher environments enforce stability, security, and compliance. Each tier serves a distinct purpose in the software delivery lifecycle, with increasing levels of control, approval requirements, and production-like characteristics.

This topology balances innovation speed (developers can experiment freely in Dev/Preview) with risk management (Production changes require multiple approvals and validation). Environment progression acts as a quality funnel — defects are caught early, performance is validated under load, and compliance controls are verified before reaching live tenants.

Environment Tiers

ATP operates six environment tiers, each with specific characteristics, deployment triggers, and control requirements:

Standard Environment Tiers

Environment Purpose Data Change Frequency Approval Uptime SLA
Preview Per-PR ephemeral Synthetic/mock Continuous (per commit) None Best-effort
Dev Integration playground Synthetic + sample Multiple per day None 95%
Test (QA) System verification Stable test datasets 1-2 per day None 98%
Staging Pre-production validation Production-like 1-2 per week 1 approver 99.5%
Production Live tenant traffic Real tenant data 1-2 per month 2 approvers + CAB 99.9%
Hotfix Emergency patches Production clone As needed 2 approvers (expedited) 99.9%

Environment Characteristics

Isolation:

  • Network: Separate VNets (Staging/Production) or shared VNet with subnet isolation (Dev/Test).
  • Subscriptions: Dedicated Azure subscriptions for Production; shared subscriptions for lower environments with resource group separation.
  • Access: No cross-environment access; developers cannot access Staging/Production data or secrets.
  • Blast Radius: Failures in Dev/Test do not impact Production; environment failures are contained.

Tenancy:

  • Dev/Test: Shared synthetic tenants (tenant-001, tenant-002, etc.); non-production data only.
  • Staging: Production-like synthetic tenants with realistic data volumes and obfuscated PII.
  • Production: Isolated per real tenant with strict tenant boundaries and compliance controls.

Immutability:

  • Dev/Test: Disabled; data can be modified/deleted for testing scenarios.
  • Staging: Enabled; mimics production WORM storage and tamper-evidence.
  • Production: Fully enforced with WORM storage, hash-chained segments, legal holds, and audit trails.

Observability:

  • Dev/Test: Verbose logging (Debug level), 100% trace sampling, local Seq containers, 7-14 day retention.
  • Staging: Production logging levels (Warning), 25% sampling, Azure Log Analytics, 30-day retention.
  • Production: Optimized logging (Warning/Error), 10% sampling with intelligent sampling, 90-day hot + 7-year archive.

Environment Progression Model

Deployment Flow:

flowchart LR
    DEV[Development] -->|Auto| TEST[Test/QA]
    TEST -->|Manual Approval| STAGE[Staging]
    STAGE -->|CAB + 2 Approvals| PROD[Production]

    PR[Feature Branch] -.->|Ephemeral| PREVIEW[Preview Environment]

    HOTFIX[Hotfix Branch] -.->|Expedited| PROD

    style DEV fill:#90EE90
    style TEST fill:#FFD700
    style STAGE fill:#FFA500
    style PROD fill:#FF6347
    style PREVIEW fill:#87CEEB
    style HOTFIX fill:#FF69B4
Hold "Alt" / "Option" to enable pan & zoom

Promotion Gates:

  • Dev → Test: Automated (CI pipeline success, smoke tests pass, no critical bugs).
  • Test → Staging: Manual approval (Lead Engineer), regression tests green, no P1/P2 bugs, performance benchmarks met.
  • Staging → Production: Manual approval (2 approvers: Architect + SRE), CAB approval, change window scheduled, no active incidents, load tests pass.
  • Hotfix → Production: Expedited approval (2 approvers), incident ticket linked, limited scope, mandatory post-deployment validation.

Rollback Strategy:

  • Dev/Test: Redeploy previous version (5 minutes).
  • Staging: Blue-green slot swap (2 minutes).
  • Production: Canary rollback or blue-green swap (1-3 minutes); automatic rollback on health check failures or error rate thresholds.

Environment Lifecycle

Preview Environments (Ephemeral):

  • Creation: Automatically provisioned when PR is opened against master/main branch.
  • Lifespan: Exists while PR is open; auto-deleted 4 hours after PR merge/close.
  • Purpose: Validate feature changes in isolation before merging to main branch.
  • Infrastructure: Lightweight Azure Container Instances (ACI) or AKS namespaces; minimal dependencies.
  • Cost Optimization: Pay-per-second billing; serverless SQL (auto-pause); shared infrastructure where possible.

Persistent Environments (Dev, Test, Staging, Production, Hotfix):

  • Provisioning: Created once via IaC; updated through infrastructure pipelines.
  • Maintenance: Regular updates to match production topology; quarterly infrastructure refreshes.
  • Decommissioning: Only with approval; data exported and archived before deletion.

Environment Naming & Tagging

Resource Naming Pattern:

atp-{service}-{env}-{region}

Examples:
- atp-ingestion-dev-eus
- atp-query-prod-weu
- atp-gateway-staging-apse

Required Tags (All Resources):

Tag Example Value Purpose
Environment dev, test, staging, prod Environment identification
Service ingestion, query, gateway Service ownership
CostCenter ATP-Platform, ATP-Services Cost allocation
Owner platform-team@connectsoft.example Responsible team
Compliance gdpr, hipaa, soc2 Compliance scope
DataClassification public, internal, restricted, secret Data sensitivity
ManagedBy pulumi, bicep, manual IaC tool
BackupRequired true, false Backup policy
DR-Tier critical, important, standard DR priority

Tag Enforcement:

  • Azure Policy: Enforces required tags on resource creation; denies resources without tags.
  • Cost Management: Tags enable cost breakdowns by environment, service, and team.
  • Compliance: Tags identify resources requiring specific controls (encryption, audit, backup).

Environment Health & Status

Health Indicators (Per Environment):

environmentHealth:
  dev:
    status: healthy | degraded | down
    deployments:
      lastSuccessful: 2025-10-30T08:15:00Z
      lastFailed: 2025-10-29T14:22:00Z
      successRate24h: 94.2%
    services:
      online: 7/7
      degraded: 0
    alerts:
      active: 2
      severity: [warning, warning]

  production:
    status: healthy
    deployments:
      lastSuccessful: 2025-10-28T02:30:00Z
      successRate24h: 100%
    services:
      online: 7/7
      degraded: 0
    sla:
      current: 99.97%
      target: 99.9%
    alerts:
      active: 0

Dashboard Integration:

  • Azure DevOps: Environment health widgets on team dashboards.
  • Application Insights: Environment-specific workbooks with health trends.
  • Status Page: Public status page showing Production health (no sensitive details).

Detailed Environment Specifications

Each environment tier has specific infrastructure configurations, data management policies, deployment patterns, and operational characteristics tailored to its purpose in the software delivery lifecycle. This section provides comprehensive specifications for all six environments.

Preview Environment (Ephemeral)

Purpose: Provide isolated, full-stack environments for pull request validation before code merges to the main branch. Preview environments enable developers to test features with realistic dependencies without impacting shared Dev/Test environments.

Lifecycle:

  • Creation: Automatically triggered when PR is opened against master/main branch.
  • Provisioning Time: 5-10 minutes (full ATP stack with 7 services).
  • Lifespan: Active while PR is open; auto-deleted 4 hours after PR merge/close (prevents resource leaks).
  • Naming: atp-preview-pr-{PR-ID}-{region} (e.g., atp-preview-pr-1234-eus).

Infrastructure:

# Preview Environment (Lightweight, Ephemeral)
compute:
  type: Azure Container Instances (ACI)
  services:
    - atp-gateway-preview-pr-{ID}
    - atp-ingestion-preview-pr-{ID}
    - atp-query-preview-pr-{ID}
  instances: 1 per service
  sku: 1 vCPU, 1.5 GB RAM

storage:
  sql:
    type: Azure SQL Database (Serverless)
    tier: General Purpose
    vCores: 2
    autoPauseDelay: 60 minutes  # Auto-pause when inactive
  redis:
    type: Azure Cache for Redis
    sku: Basic C0 (250 MB)
  serviceBus:
    type: Azure Service Bus
    tier: Basic
  blobStorage:
    type: Azure Blob Storage
    tier: Hot
    retention: 7 days

networking:
  vnet: Shared Preview VNet (10.10.0.0/16)
  subnet: Dynamic allocation per PR (10.10.{PR-ID}.0/24)
  privateEndpoints: false  # Public endpoints for cost savings
  nsg: Allow HTTP/HTTPS from CI/CD agents

observability:
  logging:
    level: Debug
    sink: Ephemeral Seq container
  tracing:
    sampling: 100%
  metrics:
    enabled: true
    retention: PR lifespan only

costOptimization:
  budget: $10 per PR per day
  autoShutdown: 4 hours after last activity
  cleanup: Aggressive (delete on PR close)

Data Management:

  • Data Source: Minimal synthetic data seeded at creation (10 sample audit events per tenant).
  • Tenants: 2 synthetic tenants (preview-tenant-1, preview-tenant-2).
  • Immutability: Disabled (not needed for short-lived validation).
  • Backup: None (ephemeral data; recreate from seed scripts if needed).

Deployment:

# azure-pipelines.yml - PR trigger
pr:
  branches:
    include: [master, main]

stages:
- stage: Provision_Preview
  jobs:
  - job: Create_Ephemeral_Stack
    steps:
    - script: |
        # Create Pulumi stack for this PR
        pulumi stack select atp-preview-pr-$(System.PullRequest.PullRequestId) --create
        pulumi config set environment preview
        pulumi config set prId $(System.PullRequest.PullRequestId)
        pulumi up --yes

        # Capture service URLs
        echo "##vso[task.setvariable variable=GatewayUrl;isOutput=true]$(pulumi stack output GatewayUrl)"
      displayName: 'Provision Preview Environment'

- stage: Test_Preview
  dependsOn: Provision_Preview
  jobs:
  - job: Integration_Tests
    variables:
      GatewayUrl: $[ stageDependencies.Provision_Preview.Create_Ephemeral_Stack.outputs['GatewayUrl'] ]
    steps:
    - script: |
        dotnet test tests/Integration.Tests.csproj \
          --environment:GatewayUrl=$(GatewayUrl) \
          --filter Category=PullRequest
      displayName: 'Run Integration Tests Against Preview'

- stage: Cleanup_Preview
  condition: always()  # Always cleanup, even on failure
  jobs:
  - job: Destroy_Ephemeral_Stack
    steps:
    - script: |
        pulumi stack select atp-preview-pr-$(System.PullRequest.PullRequestId)
        pulumi destroy --yes
        pulumi stack rm --yes
      displayName: 'Destroy Preview Environment'

Use Cases:

  • Feature Validation: Test new features against full ATP stack before merge.
  • Integration Testing: Validate service-to-service communication with realistic dependencies.
  • Breaking Change Detection: Ensure API contract compatibility across services.
  • Performance Baseline: Quick performance validation (not comprehensive load testing).

Limitations:

  • No Production Data: Only synthetic data; cannot validate against real tenant scenarios.
  • Limited Scale: Single instance per service; not suitable for load testing.
  • Short-Lived: 4-hour maximum; not for long-running testing.
  • Cost Constraints: Minimal infrastructure; simplified topology.

Dev Environment (Integration Playground)

Purpose: Primary environment for continuous integration and developer experimentation. Dev environment receives deployments on every commit to main branch, enabling rapid feedback loops and integration testing.

Lifecycle:

  • Always-On: Persistent environment; never deleted.
  • Deployment Frequency: Multiple times per day (every main branch commit).
  • Uptime Target: 95% (tolerates brief downtime for infrastructure updates).
  • Maintenance Window: Weeknights 10 PM - 6 AM local time.

Infrastructure:

# Dev Environment (Cost-Optimized, Shared)
compute:
  type: Azure App Service (Linux)
  services:
    - atp-gateway-dev-eus
    - atp-ingestion-dev-eus
    - atp-query-dev-eus
    - atp-integrity-dev-eus
    - atp-export-dev-eus
    - atp-policy-dev-eus
    - atp-search-dev-eus
  sku: B1 (Basic)
  instances: 1 per service
  autoscale: false

storage:
  sql:
    name: atp-sql-dev-eus
    tier: Basic
    dtu: 5
    maxSizeGB: 2
    geoReplication: false
    backupRetention: 7 days
  cosmos:
    name: atp-cosmos-dev-eus
    tier: Standard
    throughput: 400 RU/s (manual)
    consistency: Session
  redis:
    name: atp-redis-dev-eus
    sku: Basic C0
    capacity: 250 MB
    persistence: false
  serviceBus:
    name: atp-servicebus-dev-eus
    tier: Basic
    messaging: Standard queues/topics
  blobStorage:
    name: atpstoragedeveus
    tier: Hot
    replication: LRS (Locally Redundant)
    retention: 30 days

networking:
  vnet: Shared ATP VNet (10.0.0.0/16)
  subnet: Dev Subnet (10.0.1.0/24)
  nsg: 
    - Allow HTTPS from developer IPs
    - Allow RDP/SSH from VPN (jumpbox access)
    - Allow all within subnet (service-to-service)
  publicEndpoints: 
    enabled: true
    ipWhitelist: 
      - Developer IPs
      - CI/CD agents
      - VPN gateway range
  privateEndpoints: false  # Cost optimization

identity:
  managedIdentity: System-assigned per App Service
  keyVault: atp-keyvault-dev-eus
  rbac:
    - Developers: Contributor on resource group
    - Service Principals: Reader on secrets

observability:
  appInsights: atp-appinsights-dev-eus
  logAnalytics: atp-loganalytics-dev-eus
  logging:
    level: Debug
    sinks: [Console, AppInsights, Seq]
  tracing:
    sampling: 100%
    exportInterval: 5 seconds
  metrics:
    all: true
    customDimensions: [tenantId, service, operation]
  retention:
    logs: 7 days (hot)
    traces: 7 days
    metrics: 30 days

costManagement:
  monthlyBudget: $500
  autoShutdown:
    enabled: true
    schedule: "Weeknights 8 PM - 6 AM, Weekends"
    timezone: Eastern
  alerts:
    - threshold: 80% of budget
    - anomaly: >50% daily spike

Configuration (appsettings.Development.json):

{
  "Logging": {
    "LogLevel": {
      "Default": "Debug",
      "Microsoft": "Information",
      "Microsoft.EntityFrameworkCore": "Information",
      "System": "Information"
    },
    "Console": {
      "IncludeScopes": true
    }
  },
  "ConnectionStrings": {
    "DefaultConnection": "Server=atp-sql-dev-eus.database.windows.net;Database=ATP_Dev;Authentication=Active Directory Managed Identity;",
    "Redis": "atp-redis-dev-eus.redis.cache.windows.net:6380,ssl=True,abortConnect=False",
    "ServiceBus": "Endpoint=sb://atp-servicebus-dev-eus.servicebus.windows.net/;Authentication=Managed Identity",
    "CosmosDb": "AccountEndpoint=https://atp-cosmos-dev-eus.documents.azure.com:443/;DefaultKeyResolution=ManagedIdentity"
  },
  "Audit": {
    "EnableImmutability": false,
    "EnableTamperEvidence": false,
    "EnableHashChaining": false,
    "RetentionDays": 30,
    "WormStorage": false,
    "SegmentSize": 1000  // Smaller segments for faster testing
  },
  "Compliance": {
    "StrictInDevelopment": true,
    "EnableLoggingRedaction": true,  // Practice redaction even in dev
    "SimulateComplianceChecks": true,
    "Profile": "development"
  },
  "OpenTelemetry": {
    "ServiceName": "atp-{service}-dev",
    "ExporterEndpoint": "http://otel-collector-dev:4317",
    "SamplingRatio": 1.0,  // 100% sampling
    "ExportIntervalSeconds": 5
  },
  "FeatureManagement": {
    "TamperEvidenceV2": true,
    "AdvancedQueryFilters": true,
    "AIAssistedAnomalyDetection": true,  // Enable all features in dev
    "ExperimentalFeatures": true
  },
  "RateLimiting": {
    "Enabled": false,  // No rate limits in dev
    "PermitLimit": 0,
    "Window": 0
  }
}

Data Management:

// Dev Data Seeding (Example: C# Seeder)
public class DevDataSeeder
{
    private readonly IAuditDbContext _context;

    public async Task SeedDevEnvironmentAsync()
    {
        // Create 10 synthetic tenants
        var tenants = Enumerable.Range(1, 10)
            .Select(i => new Tenant
            {
                TenantId = $"dev-tenant-{i:D3}",
                Name = $"Development Tenant {i}",
                Edition = i <= 3 ? "Standard" : (i <= 7 ? "Business" : "Enterprise"),
                Region = i % 2 == 0 ? "US" : "EU",
                CreatedAt = DateTime.UtcNow.AddDays(-30)
            });

        await _context.Tenants.AddRangeAsync(tenants);

        // Seed 1000 audit events per tenant
        foreach (var tenant in tenants)
        {
            var events = Enumerable.Range(1, 1000)
                .Select(i => new AuditEvent
                {
                    EventId = $"evt-{tenant.TenantId}-{i:D6}",
                    TenantId = tenant.TenantId,
                    Timestamp = DateTime.UtcNow.AddHours(-i),
                    Actor = $"user-{i % 10}@{tenant.Name}",
                    Action = GetRandomAction(),
                    Resource = $"resource-{i % 50}",
                    Outcome = i % 20 == 0 ? "Denied" : "Allowed",
                    Metadata = GenerateSyntheticMetadata()
                });

            await _context.AuditEvents.AddRangeAsync(events);
        }

        await _context.SaveChangesAsync();
    }
}

Bash Seeding Script:

#!/bin/bash
# seed-dev-environment.sh

echo "Seeding Dev Environment with synthetic data..."

# Seed database
dotnet run --project tools/DataSeeder \
  --environment Development \
  --tenants 10 \
  --events-per-tenant 1000 \
  --start-date "2025-09-01" \
  --end-date "2025-10-30"

# Seed Redis cache
redis-cli -h atp-redis-dev-eus.redis.cache.windows.net -p 6380 -a $(az keyvault secret show --name RedisPassword --vault-name atp-keyvault-dev-eus --query value -o tsv) --tls <<EOF
SET session:dev-tenant-001 "{\\"userId\\":\\"dev-user-1\\",\\"expiresAt\\":\\"2025-10-31T00:00:00Z\\"}"
SET session:dev-tenant-002 "{\\"userId\\":\\"dev-user-2\\",\\"expiresAt\\":\\"2025-10-31T00:00:00Z\\"}"
EOF

echo "✅ Dev environment seeded successfully"

Access Control:

  • Developers: Full access (Contributor role on resource group).
  • CI/CD: Managed identity with deploy permissions.
  • External Access: IP-whitelisted (developer office IPs, VPN range).
  • Key Vault: Developers can read secrets (for local debugging).

Use Cases:

  1. Continuous Integration: Every main branch commit deploys automatically; developers see changes within 10 minutes.
  2. Integration Testing: Service-to-service communication validated with real dependencies (Redis, SQL, RabbitMQ).
  3. Local Development Parity: Developers can test against Dev environment to debug integration issues.
  4. Experimental Features: New features enabled by default; breaking changes tested before Test environment.
  5. SDK Testing: Client SDK developers test against Dev APIs with realistic responses.

Deployment Pattern:

# Deployment: Automated on every main commit
trigger:
  branches:
    include: [master, main]

stages:
- stage: CI_Stage
  # ... build, test, security scans

- stage: Deploy_Dev
  dependsOn: CI_Stage
  condition: succeeded()
  jobs:
  - deployment: DeployToDev
    environment: ATP-Dev  # No approval required
    strategy:
      runOnce:
        deploy:
          steps:
          - template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
            parameters:
              azureSubscription: $(azureSubscription)
              appName: atp-ingestion-dev-eus
              package: $(Pipeline.Workspace)/drop/*.zip
              appSettings: |
                -ASPNETCORE_ENVIRONMENT "Development"
                -Audit__EnableImmutability "false"

Monitoring & Alerts:

  • Health Checks: Every 5 minutes; alert on 3 consecutive failures.
  • Error Rate: Alert if >10% (relaxed threshold).
  • Deployment Failures: Notify #dev-team Slack channel.
  • Cost Alerts: Notify if approaching $500/month budget.

Test Environment (System Verification)

Purpose: Dedicated environment for automated regression testing, QA validation, and integration verification with stable datasets. Test environment acts as the quality gate before promoting to Staging.

Lifecycle:

  • Always-On: Persistent environment with controlled refresh cycles.
  • Deployment Frequency: 1-2 times per day (after Dev soak period).
  • Uptime Target: 98% (planned downtime for test data refresh).
  • Maintenance Window: Nightly 2 AM - 4 AM local time.

Infrastructure:

# Test Environment (QA-Grade, Stable)
compute:
  type: Azure App Service (Linux)
  services: All 7 ATP services
  sku: S1 (Standard)
  instances: 2 per service (for blue-green testing)
  autoscale: false
  alwaysOn: true

storage:
  sql:
    name: atp-sql-test-eus
    tier: Standard S1
    dtu: 20
    maxSizeGB: 10
    geoReplication: false
    backupRetention: 14 days
    pointInTimeRestore: true
  cosmos:
    name: atp-cosmos-test-eus
    tier: Standard
    throughput: 1000 RU/s (manual)
    consistency: BoundedStaleness
  redis:
    name: atp-redis-test-eus
    sku: Standard C1
    capacity: 1 GB
    persistence: RDB (snapshots every 15 min)
  serviceBus:
    name: atp-servicebus-test-eus
    tier: Standard
    messaging: Standard + Topics
  blobStorage:
    name: atpstorragetesteus
    tier: Hot
    replication: GRS (Geo-Redundant for DR testing)
    retention: 60 days
  elasticsearch:
    name: atp-search-test-eus
    tier: Basic
    nodes: 2
    storage: 100 GB

networking:
  vnet: Shared ATP VNet (10.0.0.0/16)
  subnet: Test Subnet (10.0.2.0/24)
  nsg:
    - Allow HTTPS from CI/CD agents
    - Allow test automation IPs
    - Deny all other inbound
  publicEndpoints:
    enabled: true
    ipWhitelist:
      - CI/CD agent pool IPs
      - QA team IPs
  privateEndpoints: false

identity:
  managedIdentity: System-assigned
  keyVault: atp-keyvault-test-eus
  rbac:
    - QA Team: Reader + Test Data Manager
    - CI/CD: Contributor (deploy only)
    - Developers: Reader (view-only for debugging)

observability:
  appInsights: atp-appinsights-test-eus
  logAnalytics: atp-loganalytics-test-eus
  logging:
    level: Information
    sinks: [AppInsights, Seq]
    structuredLogging: true
  tracing:
    sampling: 50%  # Reduced from Dev
    adaptiveSampling: true
  metrics:
    all: true
    aggregationInterval: 60 seconds
  retention:
    logs: 14 days
    traces: 14 days
    metrics: 60 days

costManagement:
  monthlyBudget: $1,000
  autoShutdown:
    enabled: true
    schedule: "Weekends only"
  reservedInstances: false
  alerts:
    - threshold: 80%
    - anomaly: >50% spike

Configuration (appsettings.Test.json):

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft": "Warning",
      "Microsoft.EntityFrameworkCore.Database.Command": "Information"
    }
  },
  "ConnectionStrings": {
    "DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/SqlConnectionString)",
    "Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/RedisConnectionString)",
    "ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/ServiceBusConnectionString)"
  },
  "Audit": {
    "EnableImmutability": false,
    "EnableTamperEvidence": true,  // Test tamper evidence logic
    "EnableHashChaining": true,
    "RetentionDays": 90,
    "WormStorage": false,
    "SegmentSize": 10000
  },
  "Compliance": {
    "StrictInDevelopment": false,
    "EnableLoggingRedaction": true,
    "SimulateComplianceChecks": true,
    "Profile": "test",
    "EnforceGDPR": true,  // Test GDPR workflows
    "EnforceHIPAA": false
  },
  "OpenTelemetry": {
    "ServiceName": "atp-{service}-test",
    "ExporterEndpoint": "http://otel-collector-test:4317",
    "SamplingRatio": 0.5,  // 50% sampling
    "ExportIntervalSeconds": 10
  },
  "FeatureManagement": {
    "TamperEvidenceV2": true,
    "AdvancedQueryFilters": true,
    "AIAssistedAnomalyDetection": false,  // Stable features only
    "ExperimentalFeatures": false
  },
  "RateLimiting": {
    "Enabled": true,
    "PermitLimit": 1000,  // Higher than prod for load testing
    "Window": 60
  }
}

Data Management:

// Test Data Management (Stable Datasets)
public class TestDataManager
{
    public async Task RefreshTestDataAsync()
    {
        // Backup current data (for rollback)
        await BackupAsync("test-data-backup-{timestamp}");

        // Clear existing data
        await TruncateTablesAsync();

        // Load stable test datasets
        await LoadFixturesAsync("fixtures/test-tenants.json");
        await LoadFixturesAsync("fixtures/test-events-stable.json");

        // Verify data integrity
        var counts = await VerifyDataCountsAsync();
        Assert.Equal(20, counts.Tenants);
        Assert.Equal(50000, counts.AuditEvents);
    }

    // Stable test datasets (version-controlled)
    private async Task LoadFixturesAsync(string fixturePath)
    {
        var json = await File.ReadAllTextAsync(fixturePath);
        var fixtures = JsonSerializer.Deserialize<TestFixture>(json);

        foreach (var tenant in fixtures.Tenants)
        {
            await _context.Tenants.AddAsync(tenant);
        }

        await _context.SaveChangesAsync();
    }
}

Test Dataset Characteristics:

  • Tenants: 20 synthetic tenants with varied profiles (Standard, Business, Enterprise editions).
  • Audit Events: 50,000 stable events (version-controlled in fixtures/test-events-stable.json).
  • Time Range: Events spanning last 90 days (predictable date ranges for test assertions).
  • Scenarios: Pre-defined test scenarios (compliance violation, high-volume ingestion, tamper detection, export workflows).

Access Control:

  • QA Team: Full access (read/write on data, deploy permissions).
  • Developers: Read-only (view logs, query data, no deployments).
  • CI/CD: Deploy and test execution permissions.
  • External Access: IP-whitelisted (CI/CD agents, QA team IPs).

Use Cases:

  1. Automated Regression Testing: Nightly full regression suite (all tests, all services).
  2. API Contract Validation: OpenAPI spec generation and breaking change detection.
  3. Integration Testing: Multi-service workflows (ingestion → integrity verification → query → export).
  4. Performance Baseline: Response time benchmarks under normal load.
  5. Data Migration Testing: Test database migration scripts before Staging.

Deployment Pattern:

# Deployment: Automated after Dev soak (24 hours)
stages:
- stage: Deploy_Test
  dependsOn: Deploy_Dev
  condition: |
    and(
      succeeded(),
      eq(variables['Dev.SoakHours'], '24'),  // Dev stable for 24 hours
      eq(variables['Build.SourceBranch'], 'refs/heads/main')
    )
  jobs:
  - deployment: DeployToTest
    environment: ATP-Test  # No approval
    strategy:
      runOnce:
        deploy:
          steps:
          - template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
            parameters:
              azureSubscription: $(azureSubscription)
              appName: atp-ingestion-test-eus
              package: $(Pipeline.Workspace)/drop/*.zip

        postDeployment:
          steps:
          - script: |
              # Run smoke tests
              dotnet test tests/Smoke.Tests.csproj \
                --environment Test \
                --logger "trx;LogFileName=smoke-results.trx"

              # Run regression suite
              dotnet test tests/Regression.Tests.csproj \
                --environment Test \
                --logger "trx;LogFileName=regression-results.trx" \
                --settings test.runsettings
            displayName: 'Post-Deployment Test Suite'

Monitoring & Alerts:

  • Health Checks: Every 2 minutes; alert on 2 consecutive failures.
  • Test Suite Failures: Alert QA team on regression failures.
  • Error Rate: Alert if >5% (stricter than Dev).
  • Deployment Success Rate: Alert if <95% over 7 days.

Staging Environment (Pre-Production Validation)

Purpose: Production-equivalent environment for final validation before Production deployment. Staging mirrors Production infrastructure, security controls, and compliance policies, enabling realistic load testing, chaos engineering, and stakeholder acceptance.

Lifecycle:

  • Always-On: Persistent environment; critical for go-live decisions.
  • Deployment Frequency: 1-2 times per week (after Test validation).
  • Uptime Target: 99.5% (near-production SLA).
  • Maintenance Window: Scheduled changes only during approved windows.

Infrastructure:

# Staging Environment (Production-Equivalent)
compute:
  type: Azure App Service (Linux) or AKS
  services: All 7 ATP services
  sku: P1v2 (Premium)
  instances: 2 per service (blue-green deployment slots)
  autoscale: 
    enabled: true
    min: 2
    max: 5
    rules:
      - metric: CPU Percentage
        threshold: 70%
        scaleOut: 1
        scaleIn: 1
  alwaysOn: true
  slots:
    - production (active)
    - blue (deployment staging)

storage:
  sql:
    name: atp-sql-staging-eus
    tier: Premium P2
    dtu: 125
    maxSizeGB: 100
    geoReplication: true
    secondaryRegion: West Europe
    backupRetention: 35 days
    pointInTimeRestore: true
    encryption: TDE (Transparent Data Encryption)
  cosmos:
    name: atp-cosmos-staging-eus
    tier: Standard
    throughput: 5000 RU/s (autoscale 1000-5000)
    consistency: Session
    multiRegion: 
      - eastus (primary)
      - westeurope (read replica)
  redis:
    name: atp-redis-staging-eus
    sku: Premium P1
    capacity: 6 GB
    clustering: true
    persistence: AOF (Append-Only File)
    geoReplication: 
      enabled: true
      secondary: atp-redis-staging-weu
  serviceBus:
    name: atp-servicebus-staging-eus
    tier: Premium
    messagingUnits: 1
    geoDisasterRecovery: true
    secondaryNamespace: atp-servicebus-staging-weu
  blobStorage:
    name: atpstoragestgeus
    tier: Hot
    replication: GZRS (Geo-Zone-Redundant)
    retention: 180 days
    immutability:
      enabled: true
      policy: time-based (90 days)
    encryption:
      type: Microsoft-managed keys (testing BYOK patterns)

networking:
  vnet: Dedicated Staging VNet (10.1.0.0/16)
  subnets:
    - Gateway Subnet: 10.1.1.0/24
    - Services Subnet: 10.1.2.0/24
    - Data Subnet: 10.1.3.0/24 (private endpoints)
  nsg:
    - Deny all inbound by default
    - Allow HTTPS from API Gateway subnet
    - Allow service-to-service within Services subnet
  privateEndpoints:
    enabled: true
    resources:
      - SQL Database (10.1.3.4)
      - Storage Account (10.1.3.5)
      - Key Vault (10.1.3.6)
      - Service Bus (10.1.3.7)
  applicationGateway:
    enabled: true
    waf: WAF_v2 (OWASP 3.2)
    sslPolicy: AppGwSslPolicy20220101

identity:
  managedIdentity: System-assigned + User-assigned
  keyVault: atp-keyvault-staging-eus
  rbac:
    - SRE Team: Reader + Deployment Operator
    - Platform Team: Contributor
    - Developers: No access (production-like restrictions)
  conditionalAccess:
    - Require MFA for all human access
    - Device compliance required

observability:
  appInsights: atp-appinsights-staging-eus
  logAnalytics: atp-loganalytics-staging-eus
  logging:
    level: Warning
    sinks: [AppInsights, LogAnalytics]
    structuredLogging: true
    sensitiveDataMasking: true
  tracing:
    sampling: 25%
    adaptiveSampling: true
    dependencies: true
  metrics:
    all: true
    customMetrics: true
    aggregationInterval: 60 seconds
  retention:
    logs: 30 days (hot) + 90 days (archive)
    traces: 30 days
    metrics: 90 days
  alerts:
    - Service health degradation
    - Error rate >2%
    - p95 latency >1000ms
    - Failed deployments

costManagement:
  monthlyBudget: $3,000
  autoShutdown:
    enabled: false  # Always-on for production parity
  reservedInstances: 
    enabled: true
    term: 1 year (App Services)
  alerts:
    - threshold: 80% of budget
    - anomaly: >30% spike

Configuration (appsettings.Staging.json):

{
  "Logging": {
    "LogLevel": {
      "Default": "Warning",
      "Microsoft": "Error",
      "Microsoft.EntityFrameworkCore": "Warning"
    }
  },
  "ConnectionStrings": {
    "DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/SqlConnectionString)",
    "Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/RedisConnectionString)",
    "ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/ServiceBusConnectionString)",
    "CosmosDb": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/CosmosConnectionString)"
  },
  "Audit": {
    "EnableImmutability": true,  // Production-like
    "EnableTamperEvidence": true,
    "EnableHashChaining": true,
    "RetentionDays": 2555,  // 7 years (production setting)
    "WormStorage": true,
    "SegmentSize": 100000,
    "SealInterval": "PT15M"
  },
  "Compliance": {
    "StrictInDevelopment": false,
    "EnableLoggingRedaction": true,
    "SimulateComplianceChecks": false,  // Real compliance checks
    "Profile": "staging",
    "EnforceGDPR": true,
    "EnforceHIPAA": true,
    "EnforceSOC2": true
  },
  "OpenTelemetry": {
    "ServiceName": "atp-{service}-staging",
    "ExporterEndpoint": "https://otel-collector-staging.connectsoft.local:4317",
    "SamplingRatio": 0.25,  // 25% sampling
    "ExportIntervalSeconds": 30
  },
  "FeatureManagement": {
    "TamperEvidenceV2": {
      "EnabledFor": [
        { "Name": "Percentage", "Parameters": { "Value": 50 } }  // 50% canary
      ]
    },
    "AdvancedQueryFilters": true,
    "AIAssistedAnomalyDetection": {
      "EnabledFor": [
        { "Name": "TargetingFilter", "Parameters": { "Audience": { "Users": ["staging-tenant-001"] } } }
      ]
    },
    "ExperimentalFeatures": false
  },
  "RateLimiting": {
    "Enabled": true,
    "PermitLimit": 500,  // Production-like limits
    "Window": 60,
    "QueueLimit": 100
  },
  "Security": {
    "RequireHttps": true,
    "HstsEnabled": true,
    "HstsMaxAge": 31536000,
    "ContentSecurityPolicy": "default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'"
  }
}

Data Management:

// Staging Data (Production-Like Synthetic)
public class StagingDataManager
{
    public async Task RefreshStagingDataAsync()
    {
        // Option 1: Anonymized Production Snapshot
        await RestoreFromAnonymizedProductionAsync();

        // Option 2: Generate Production-Scale Synthetic Data
        await GenerateProductionScaleDataAsync();
    }

    private async Task RestoreFromAnonymizedProductionAsync()
    {
        // Download anonymized backup from Production
        var backupUri = await GetLatestAnonymizedBackupAsync();

        // Restore to Staging SQL
        await RestoreDatabaseAsync(
            sourceUri: backupUri,
            targetDatabase: "ATP_Staging",
            overwriteExisting: true);

        // Verify PII redaction
        await VerifyNoPIIAsync();
    }

    private async Task GenerateProductionScaleDataAsync()
    {
        // 50 synthetic tenants (mimic real tenant distribution)
        var tenants = GenerateSyntheticTenants(count: 50);

        // 5 million audit events (realistic volume)
        var events = await GenerateRealisticEventsAsync(
            tenants: tenants,
            totalEvents: 5_000_000,
            timeRange: TimeSpan.FromDays(180));

        // Insert in batches (efficient bulk insert)
        await BulkInsertAsync(tenants, batchSize: 1000);
        await BulkInsertAsync(events, batchSize: 10000);
    }
}

Access Control:

  • SRE Team: Full access (deployment, troubleshooting, configuration).
  • Platform Team: Contributor (infrastructure changes, monitoring).
  • QA Team: Test execution permissions only.
  • Developers: No access (production-like restrictions).
  • Stakeholders: Read-only access for acceptance validation.

Use Cases:

  1. Load Testing: Simulate production traffic (50-80% of expected peak load).
  2. Chaos Engineering: Inject failures (pod restarts, network latency, database throttling).
  3. Security Testing: OWASP ZAP scans, penetration testing, vulnerability validation.
  4. Disaster Recovery Drills: Practice failover to secondary region; validate RPO/RTO.
  5. Stakeholder Acceptance: Product owners and compliance teams validate features before Production.
  6. Feature Flag Testing: Validate percentage rollouts and targeting filters before Production.

Deployment Pattern:

# Deployment: Manual approval required
stages:
- stage: Deploy_Staging
  dependsOn: Deploy_Test
  condition: |
    and(
      succeeded(),
      eq(variables['Build.SourceBranch'], 'refs/heads/main')
    )
  jobs:
  - deployment: DeployToStaging
    environment: ATP-Staging  # Requires 1 manual approval (Lead Engineer)
    strategy:
      runOnce:
        preDeploy:
          steps:
          - script: |
              # Pre-deployment validation
              echo "Verifying Test environment stability..."

              # Check Test error rate (last 24 hours)
              ERROR_RATE=$(az monitor metrics list \
                --resource atp-appinsights-test-eus \
                --metric "requests/failed" \
                --aggregation avg \
                --interval PT24H \
                --query "value[0].timeseries[0].data[-1].average")

              if (( $(echo "$ERROR_RATE > 0.02" | bc -l) )); then
                echo "❌ Test error rate too high: $ERROR_RATE%"
                exit 1
              fi

              echo "✅ Test environment stable; proceeding to Staging"
            displayName: 'Pre-Deploy Validation'

        deploy:
          steps:
          - template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
            parameters:
              azureSubscription: $(azureSubscription)
              appName: atp-ingestion-staging-eus
              package: $(Pipeline.Workspace)/drop/*.zip
              slotName: blue  # Deploy to blue slot first

        routeTraffic:
          steps:
          - task: AzureAppServiceManage@0
            displayName: 'Swap Blue  Production Slot'
            inputs:
              azureSubscription: $(azureSubscription)
              action: 'Swap Slots'
              webAppName: atp-ingestion-staging-eus
              sourceSlot: blue
              targetSlot: production

        postRouteTraffic:
          steps:
          - script: |
              # Post-deployment validation
              echo "Running post-deployment checks..."

              # Health checks
              curl -f https://atp-gateway-staging-eus.azurewebsites.net/health || exit 1

              # Smoke tests
              dotnet test tests/Smoke.Tests.csproj --environment Staging

              # Load test (light validation)
              k6 run --vus 50 --duration 5m tests/load/basic-load.js
            displayName: 'Post-Deployment Validation'

Monitoring & Alerts:

  • Health Checks: Every 1 minute; alert immediately on failure.
  • Error Rate: Alert if >1% (production threshold).
  • Latency: Alert if p95 >500ms (production SLO).
  • Deployment Validation: Alert on slot swap failures or post-deployment test failures.
  • Security: WAF blocks, failed auth attempts, ABAC denials.

Production Environment (Live Tenant Traffic)

Purpose: Live environment serving real tenant traffic with full compliance enforcement, high availability, disaster recovery, and 24/7 monitoring. Production is the most controlled environment with strict approval workflows and change management.

Lifecycle:

  • Mission-Critical: Always-on with multi-region redundancy.
  • Deployment Frequency: 1-2 times per month (conservative change cadence).
  • Uptime Target: 99.9% (SLA-backed; ~43 minutes downtime/month).
  • Maintenance Window: Approved CAB windows only; typically Friday nights.

Infrastructure:

# Production Environment (Enterprise-Grade, Multi-Region)
compute:
  type: Azure Kubernetes Service (AKS)
  cluster:
    name: atp-aks-prod-eus
    nodeCount: 6 (3 per zone)
    vmSize: Standard_D4s_v5 (4 vCPU, 16 GB RAM)
    zones: [1, 2, 3]  # Zone-redundant
    autoscale:
      enabled: true
      min: 6
      max: 20
      profile: production
  services:
    replicas: 3 per service (distributed across zones)
    resources:
      requests: { cpu: "500m", memory: "1Gi" }
      limits: { cpu: "2000m", memory: "4Gi" }
    probes:
      liveness: /health/live
      readiness: /health/ready
      startup: /health/startup
    hpa:  # Horizontal Pod Autoscaler
      enabled: true
      minReplicas: 3
      maxReplicas: 10
      targetCPU: 70%
      targetMemory: 80%

storage:
  sql:
    name: atp-sql-prod-eus
    tier: Premium P6
    vCores: 16
    maxSizeGB: 500
    zoneRedundant: true
    geoReplication: 
      enabled: true
      secondaryRegion: West Europe
      failoverPolicy: Automatic
    backupRetention: 35 days
    longTermRetention: 7 years
    encryption: 
      type: TDE with CMK (Customer-Managed Key)
      keyVault: atp-keyvault-prod-eus
      keyName: sql-tde-key
    advancedThreatProtection: true
  cosmos:
    name: atp-cosmos-prod-eus
    tier: Standard
    throughput: 20000 RU/s (autoscale 5000-20000)
    consistency: BoundedStaleness
    multiRegion:
      - eastus (write)
      - westeurope (read)
      - southeastasia (read)
    automaticFailover: true
    multipleWriteLocations: false  # Single writer
  redis:
    name: atp-redis-prod-eus
    sku: Premium P3
    capacity: 26 GB
    clustering: true
    shardCount: 3
    persistence: AOF + RDB
    geoReplication:
      enabled: true
      secondary: atp-redis-prod-weu
    encryption: TLS 1.2+
  serviceBus:
    name: atp-servicebus-prod-eus
    tier: Premium
    messagingUnits: 2
    geoDisasterRecovery: true
    secondaryNamespace: atp-servicebus-prod-weu
    zones: [1, 2, 3]
  blobStorage:
    name: atpstorageprodeus
    tier: Hot (recent) + Cool (30-90 days) + Archive (90+ days)
    replication: GZRS (Geo-Zone-Redundant)
    retention: 7 years
    immutability:
      enabled: true
      policy: WORM (time-based, 7 years)
    legalHold: supported
    encryption:
      type: Customer-managed keys (BYOK)
      keyVault: atp-keyvault-prod-eus
      keyName: storage-cmk-key
      rotation: Automatic (90 days)
    advancedThreatProtection: true
    softDelete:
      enabled: true
      retentionDays: 30

networking:
  vnet: Dedicated Production VNet (10.2.0.0/16)
  subnets:
    - AKS Nodes: 10.2.1.0/24
    - Application Gateway: 10.2.2.0/27
    - Private Endpoints: 10.2.3.0/24
    - Azure Firewall: 10.2.4.0/26
  nsg:
    - Deny all by default (Zero Trust)
    - Allow inbound HTTPS (443) to App Gateway only
    - Allow AKS → Private Endpoints (SQL, Storage, KV)
  privateEndpoints:
    enabled: true
    dnsIntegration: true
    resources:
      - SQL Database (10.2.3.4)
      - Storage Account (10.2.3.5)
      - Key Vault (10.2.3.6)
      - Service Bus (10.2.3.7)
      - Cosmos DB (10.2.3.8)
      - Container Registry (10.2.3.9)
  applicationGateway:
    name: atp-appgw-prod-eus
    tier: WAF_v2
    capacity: 2-10 (autoscale)
    waf:
      enabled: true
      mode: Prevention
      ruleSet: OWASP 3.2
      exclusions: []
    sslPolicy: AppGwSslPolicy20220101S
    httpListeners:
      - HTTPS only
      - TLS 1.2+
      - Custom domain with managed certificate
  azureFirewall:
    name: atp-fw-prod-eus
    tier: Premium
    threatIntel: Alert and deny
    dnsProxy: enabled
    outboundRules:
      - Allow HTTPS to approved FQDNs (NuGet, Docker Hub, Azure services)
      - Deny all other egress

identity:
  managedIdentity: 
    - System-assigned (per AKS node pool)
    - User-assigned (for Key Vault access)
  keyVault: atp-keyvault-prod-eus
  rbac:
    - Production Operators: Custom role (deploy only, no data access)
    - SRE On-Call: Reader + Incident Responder
    - Platform Security: Key Vault Administrator
    - No Developer Access: Prohibited
  conditionalAccess:
    - Require MFA
    - Require compliant device
    - Require trusted location (corporate network or VPN)
    - Block legacy authentication
  privilegedIdentityManagement:
    enabled: true
    justInTime: true
    maxDuration: 4 hours
    approvalRequired: true

observability:
  appInsights: atp-appinsights-prod-eus
  logAnalytics: atp-loganalytics-prod-eus
  logging:
    level: Warning
    sinks: [AppInsights, LogAnalytics, Seq (centralized)]
    structuredLogging: true
    sensitiveDataMasking: true
    piiRedaction: enforced
  tracing:
    sampling: 10%
    adaptiveSampling: true
    intelligentSampling:
      enabled: true
      prioritize: [errors, slowRequests, dependencies]
  metrics:
    all: true
    customMetrics: true
    dimensionality: 
      - tenantId
      - region
      - service
      - operation
    aggregationInterval: 60 seconds
  retention:
    logs: 90 days (hot) + 7 years (archive to Blob Storage)
    traces: 90 days
    metrics: 1 year
  alerts:
    - SLO breaches (error budget)
    - Security events (failed auth, ABAC denials)
    - Performance degradation (p95, p99)
    - Cost anomalies

costManagement:
  monthlyBudget: $10,000
  autoShutdown:
    enabled: false  # Never shutdown Production
  reservedInstances:
    enabled: true
    term: 3 years (maximum savings)
    resources: [AKS nodes, SQL Database, App Services]
  costOptimization:
    - Spot instances for non-critical batch jobs
    - Storage lifecycle policies (Hot → Cool → Archive)
    - Autoscaling based on traffic patterns
  alerts:
    - threshold: 85% of budget
    - anomaly: >20% spike
    - forecast: Projected to exceed budget

Configuration (appsettings.Production.json):

{
  "Logging": {
    "LogLevel": {
      "Default": "Warning",
      "Microsoft": "Error",
      "Microsoft.EntityFrameworkCore": "Error",
      "ConnectSoft": "Warning"
    },
    "ApplicationInsights": {
      "LogLevel": {
        "Default": "Warning"
      }
    }
  },
  "ConnectionStrings": {
    "DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlConnectionString)",
    "Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)",
    "ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/ServiceBusConnectionString)",
    "CosmosDb": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/CosmosConnectionString)"
  },
  "Audit": {
    "EnableImmutability": true,
    "EnableTamperEvidence": true,
    "EnableHashChaining": true,
    "RetentionDays": 2555,  // 7 years
    "WormStorage": true,
    "SegmentSize": 100000,
    "SealInterval": "PT15M",
    "IntegrityVerification": {
      "OnRead": true,
      "Scheduled": true,
      "ScheduleCron": "0 2 * * *",  // Daily at 2 AM
      "SampleRate": 0.1
    }
  },
  "Compliance": {
    "StrictInDevelopment": false,
    "EnableLoggingRedaction": true,
    "SimulateComplianceChecks": false,
    "Profile": "production",
    "EnforceGDPR": true,
    "EnforceHIPAA": true,
    "EnforceSOC2": true,
    "AuditTrail": {
      "Enabled": true,
      "RetentionYears": 7,
      "ImmutableStorage": true
    }
  },
  "OpenTelemetry": {
    "ServiceName": "atp-{service}-prod",
    "ExporterEndpoint": "https://otel-collector-prod.connectsoft.local:4317",
    "SamplingRatio": 0.1,  // 10% sampling
    "ExportIntervalSeconds": 60,
    "AdaptiveSampling": {
      "Enabled": true,
      "MaxTelemetryItemsPerSecond": 10
    }
  },
  "FeatureManagement": {
    "TamperEvidenceV2": true,  // Stable features only
    "AdvancedQueryFilters": true,
    "AIAssistedAnomalyDetection": {
      "EnabledFor": [
        { "Name": "Percentage", "Parameters": { "Value": 10 } }  // Conservative rollout
      ]
    },
    "ExperimentalFeatures": false  // Never in Production
  },
  "RateLimiting": {
    "Enabled": true,
    "PermitLimit": 100,  // Per client per minute
    "Window": 60,
    "QueueLimit": 50,
    "ByTenant": true,
    "ByIPAddress": true
  },
  "Security": {
    "RequireHttps": true,
    "HstsEnabled": true,
    "HstsMaxAge": 31536000,
    "HstsIncludeSubdomains": true,
    "HstsPreload": true,
    "ContentSecurityPolicy": "default-src 'self'; script-src 'self'; style-src 'self'; img-src 'self' data: https:; font-src 'self'; connect-src 'self'; frame-ancestors 'none'",
    "XFrameOptions": "DENY",
    "XContentTypeOptions": "nosniff",
    "ReferrerPolicy": "strict-origin-when-cross-origin"
  },
  "HighAvailability": {
    "MultiRegion": true,
    "PrimaryRegion": "eastus",
    "SecondaryRegion": "westeurope",
    "TrafficDistribution": "80-20",  // 80% East US, 20% West Europe
    "FailoverMode": "Automatic",
    "HealthCheckInterval": 30
  }
}

Data Management:

  • Data Source: Live tenant audit records with real PII (classified and protected).
  • Volume: Millions of audit events per day across all tenants.
  • Tenancy: Strict tenant isolation with compliance controls per tenant's residency profile.
  • Retention: 7 years default (configurable per tenant; legal holds override).
  • Immutability: Full WORM enforcement; hash-chained segments; tamper-evidence with HSM-signed anchors.
  • Backup: Daily incremental + weekly full; geo-replicated; 7-year retention.

Access Control (Highly Restricted):

# Production Access Policy (Least Privilege)
roles:
  productionOperators:
    permissions:
      - Deploy (via approved pipelines only)
      - View logs (PII-redacted)
      - Restart services (break-glass only)
    restrictions:
      - No data access
      - No secret read
      - No manual configuration changes

  sreOnCall:
    permissions:
      - Read logs (PII-redacted)
      - Execute runbooks
      - Trigger failover (with approval)
    justInTime: true
    maxDuration: 4 hours
    approvalRequired: true

  platformSecurity:
    permissions:
      - Key Vault administration
      - Security policy updates
      - Compliance report generation
    mfaRequired: true
    auditLogging: comprehensive

  developers:
    permissions: []  # Zero access to Production
    exceptions: None

accessReview:
  cadence: Weekly
  approver: Security Officer
  autoRemove: Inactive for 30 days

Use Cases:

  1. Live Tenant Traffic: Serve production audit trail ingestion, queries, and exports.
  2. Compliance Evidence: Generate SOC 2, GDPR, HIPAA compliance artifacts.
  3. SLA Monitoring: Track and maintain 99.9% uptime commitment.
  4. Security Monitoring: Real-time threat detection and incident response.
  5. Performance Optimization: Continuous performance tuning based on real traffic patterns.

Deployment Pattern:

# Deployment: Strict approval + canary rollout
stages:
- stage: Deploy_Production
  dependsOn: Deploy_Staging
  condition: |
    and(
      succeeded(),
      eq(variables['Build.Reason'], 'Manual'),  // Only manual triggers
      eq(variables['Build.SourceBranch'], 'refs/heads/main')
    )
  jobs:
  - deployment: DeployToProduction
    environment: ATP-Production  # Requires 2 approvals + CAB
    strategy:
      canary:
        increments: [5, 20, 50]  // 5% → 20% → 50% → 100%

        preDeploy:
          steps:
          - script: |
              # Verify Staging stability (48 hours)
              echo "Verifying Staging has been stable for 48 hours..."

              STAGING_INCIDENTS=$(az monitor activity-log list \
                --resource-group ATP-Staging-RG \
                --offset 48h \
                --query "[?contains(status.value, 'Failed')] | length(@)")

              if [ "$STAGING_INCIDENTS" -gt "0" ]; then
                echo "❌ Staging has active incidents; blocking Production deployment"
                exit 1
              fi

              echo "✅ Staging stable; proceeding with canary deployment"
            displayName: 'Pre-Deploy Safety Checks'

        deploy:
          steps:
          - task: Kubernetes@1
            displayName: 'Deploy Canary ($(strategy.increment)%)'
            inputs:
              connectionType: 'Azure Resource Manager'
              azureSubscription: $(azureSubscription)
              azureResourceGroup: 'ATP-Prod-RG'
              kubernetesCluster: 'atp-aks-prod-eus'
              command: 'apply'
              arguments: '-f k8s/canary-$(strategy.increment).yaml'

        routeTraffic:
          steps:
          - script: |
              echo "Routing $(strategy.increment)% traffic to new version..."
              kubectl apply -f k8s/istio-traffic-split-$(strategy.increment).yaml
            displayName: 'Route Traffic to Canary'

        postRouteTraffic:
          steps:
          - script: |
              echo "Monitoring canary for 15 minutes..."
              sleep 900  # 15 minutes soak

              # Query Application Insights metrics
              ERROR_RATE=$(az monitor app-insights metrics show \
                --app atp-appinsights-prod-eus \
                --metric "requests/failed" \
                --aggregation avg \
                --offset 15m \
                --filter "cloud_RoleName eq 'atp-ingestion-canary'" \
                --query "value.segments[0].segments[0]['requests/failed'].avg")

              if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
                echo "❌ Canary error rate exceeded 1%: $ERROR_RATE%"
                exit 1  # Triggers automatic rollback
              fi

              LATENCY_P95=$(az monitor app-insights metrics show \
                --app atp-appinsights-prod-eus \
                --metric "requests/duration" \
                --aggregation percentile95 \
                --offset 15m \
                --query "value.segments[0].segments[0]['requests/duration'].percentile95")

              if (( $(echo "$LATENCY_P95 > 1000" | bc -l) )); then
                echo "❌ Canary p95 latency exceeded 1000ms: ${LATENCY_P95}ms"
                exit 1
              fi

              echo "✅ Canary metrics healthy; proceeding to next increment"
            displayName: 'Validate Canary Metrics'

        on:
          failure:
            steps:
            - script: |
                echo "🔴 Canary deployment failed; rolling back..."

                # Revert traffic to stable version
                kubectl apply -f k8s/istio-traffic-split-stable.yaml

                # Notify on-call
                curl -X POST https://hooks.slack.com/services/PROD_ONCALL \
                  -d '{"text":"Production canary rollback triggered for build $(Build.BuildNumber)"}'

                # Create incident ticket
                az boards work-item create \
                  --title "Production Canary Rollback - Build $(Build.BuildNumber)" \
                  --type "Incident" \
                  --assigned-to "SRE-Team"
              displayName: 'Automatic Rollback'

Monitoring & Alerts (24/7 On-Call):

  • Health Checks: Every 30 seconds; PagerDuty alert immediately on failure.
  • Error Rate: Alert if >0.5% (strict SLO).
  • Latency: Alert if p95 >500ms or p99 >1000ms.
  • Security Events: Alert on failed auth >10/min, ABAC denials spike, WAF blocks.
  • Compliance: Alert on immutability violations, retention policy failures.
  • Cost: Alert on unexpected spending (>20% above forecast).

Hotfix Environment (Emergency Patches)

Purpose: Fast-track environment for critical production fixes that cannot wait for normal release cycles. Hotfix environment is a Production clone with expedited approval workflows.

Lifecycle:

  • On-Demand: Created when P0/P1 incident requires immediate fix.
  • Deployment Frequency: As needed (incident-driven).
  • Uptime Target: 99.9% (same as Production).
  • Lifespan: Active during incident; decommissioned after successful Production hotfix.

Infrastructure:

# Hotfix Environment (Production Clone)
# Infrastructure mirrors Production exactly
# Created from Production IaC with hotfix overlay

compute:
  # Same as Production (AKS with same SKUs)
  # Deployed in separate namespace: atp-hotfix

storage:
  # Fresh databases (not cloned from Prod for safety)
  # Seeded with anonymized Production data if needed

  sql:
    name: atp-sql-hotfix-eus
    tier: Premium P6 (same as Prod)
    # Data: Anonymized snapshot from Production

  # Other storage: Same tiers as Production

networking:
  vnet: Dedicated Hotfix VNet (10.3.0.0/16)
  # Network topology mirrors Production
  # Separate to prevent any Production impact

Deployment Pattern:

# Deployment: Expedited approval (2 approvers within 2 hours)
stages:
- stage: Deploy_Hotfix_Validation
  jobs:
  - deployment: DeployToHotfix
    environment: ATP-Hotfix  # Requires 2 approvals (expedited SLA: 2 hours)
    strategy:
      runOnce:
        deploy:
          steps:
          - script: |
              # Deploy hotfix to Hotfix environment first
              kubectl apply -f k8s/hotfix-deployment.yaml
            displayName: 'Deploy to Hotfix Environment'

          - script: |
              # Run targeted tests (hotfix validation only)
              dotnet test tests/Hotfix.Validation.csproj \
                --filter "Category=Hotfix" \
                --environment Hotfix
            displayName: 'Validate Hotfix'

- stage: Deploy_Production_Hotfix
  dependsOn: Deploy_Hotfix_Validation
  condition: succeeded()
  jobs:
  - deployment: HotfixProduction
    environment: ATP-Production  # Additional Production approval
    strategy:
      runOnce:
        deploy:
          steps:
          - script: |
              # Apply hotfix to Production with minimal scope
              kubectl set image deployment/atp-ingestion-prod \
                atp-ingestion=connectsoft.azurecr.io/atp/ingestion:hotfix-$(Build.BuildNumber)

              # Monitor rollout
              kubectl rollout status deployment/atp-ingestion-prod --timeout=10m
            displayName: 'Apply Hotfix to Production'

        postDeployment:
          steps:
          - script: |
              # Immediate validation
              curl -f https://atp-gateway-prod.connectsoft.com/health || exit 1

              # Monitor for 30 minutes
              python scripts/monitor-hotfix.py \
                --duration 30 \
                --error-threshold 0.01 \
                --latency-threshold 1000
            displayName: 'Post-Hotfix Monitoring'

Hotfix Workflow:

flowchart TD
    A[P0/P1 Incident Detected] --> B[Create Hotfix Branch]
    B --> C[Develop Fix]
    C --> D[Deploy to Hotfix Environment]
    D --> E{Validation Pass?}
    E -->|No| C
    E -->|Yes| F[Request Expedited Approvals]
    F --> G[2 Approvers + Incident Commander]
    G --> H[Deploy to Production]
    H --> I[Monitor 30 Minutes]
    I --> J{Metrics Healthy?}
    J -->|No| K[Rollback + Escalate]
    J -->|Yes| L[Merge to Main + Decommission Hotfix]
Hold "Alt" / "Option" to enable pan & zoom

Approval Requirements:

  • Hotfix Environment: 2 approvers (SRE Lead + Platform Architect) within 2 hours.
  • Production Deployment: Same 2 approvers + Incident Commander confirmation.
  • Post-Deployment: Mandatory 30-minute monitoring before incident closure.

Environment Comparison Matrix

Characteristic Preview Dev Test Staging Production Hotfix
Compute SKU ACI (1 vCPU) B1 Basic S1 Standard P1v2 Premium P3v3 Premium (AKS) P6 (Prod clone)
Instances 1 1 2 2-5 (autoscale) 3-10 (autoscale) 3 (fixed)
SQL Tier Serverless Basic (5 DTU) Standard S1 Premium P2 Premium P6 Premium P6
Redis SKU Basic C0 Basic C0 Standard C1 Premium P1 Premium P3 Premium P3
Zone Redundancy No No No No Yes Yes
Geo-Replication No No No Yes Yes Yes
Private Endpoints No No No Yes Yes Yes
VNet Isolation Shared Shared Shared Dedicated Dedicated Dedicated
Managed Identity No System System System + User User (prod keys) User
Log Retention PR lifetime 7 days 14 days 30 days 90d + 7yr 90d + 7yr
Trace Sampling 100% 100% 50% 25% 10% 10%
Deployment Approvals 0 0 0 1 2 + CAB 2 (expedited)
Deployment Frequency Per PR commit Multiple/day 1-2/day 1-2/week 1-2/month As needed
Cost/Month $10/PR $500 $1,000 $3,000 $10,000 $500 (short-lived)
Uptime SLA Best-effort 95% 98% 99.5% 99.9% 99.9%
Data Type Synthetic Synthetic Stable fixtures Prod-like synthetic Live tenant data Prod clone
Immutability No No No Yes Yes (WORM) Yes (WORM)
Security Level Basic Standard Enhanced Production-like Maximum Maximum

Environment Selection Decision Tree

Use this flowchart to determine which environment to use for specific scenarios:

flowchart TD
    START[Need to test/deploy?] --> Q1{What are you testing?}

    Q1 -->|Feature in isolation| PR[Create Preview Environment]
    Q1 -->|Integration changes| DEV[Deploy to Dev]
    Q1 -->|Regression validation| TEST[Deploy to Test]
    Q1 -->|Load/chaos testing| STAGE[Deploy to Staging]
    Q1 -->|Production release| PROD_Q{Is it urgent?}

    PROD_Q -->|P0/P1 Incident| HOTFIX[Use Hotfix Path]
    PROD_Q -->|Normal release| PROD[Deploy to Production via CAB]

    PR --> PR_VALID{Tests pass?}
    PR_VALID -->|Yes| MERGE[Merge PR → Dev]
    PR_VALID -->|No| FIX[Fix in branch]

    DEV --> DEV_STABLE{Stable 24h?}
    DEV_STABLE -->|Yes| TEST
    DEV_STABLE -->|No| WAIT_DEV[Monitor Dev]

    TEST --> TEST_PASS{All tests green?}
    TEST_PASS -->|Yes| STAGE
    TEST_PASS -->|No| FIX_TEST[Fix issues]

    STAGE --> STAGE_APPROVE{1 Approval + Tests?}
    STAGE_APPROVE -->|Yes| PROD
    STAGE_APPROVE -->|No| WAIT_STAGE[Address feedback]

    style PR fill:#87CEEB
    style DEV fill:#90EE90
    style TEST fill:#FFD700
    style STAGE fill:#FFA500
    style PROD fill:#FF6347
    style HOTFIX fill:#FF69B4
Hold "Alt" / "Option" to enable pan & zoom

Azure Topology & Resource Naming

ATP's Azure resource organization follows a hierarchical naming convention that enables clear resource identification, cost allocation, and automated management across all environments. This section defines the resource group structure, naming patterns, and Azure-specific topology considerations.

Resource Group Structure

Each environment deploys to a dedicated resource group containing all ATP services and their dependencies. The resource group acts as the lifecycle boundary — all resources are provisioned, updated, and decommissioned together.

Standard Resource Group Layout:

ConnectSoft-ATP-{Env}-{Region}-RG
├── atp-gateway-{env}-{region}       # API Gateway (App Service or AKS pod)
├── atp-ingestion-{env}-{region}     # Ingestion Service (App Service or AKS pod)
├── atp-query-{env}-{region}         # Query Service (App Service or AKS pod)
├── atp-integrity-{env}-{region}     # Integrity Verification (App Service or AKS pod)
├── atp-export-{env}-{region}        # Export Service (App Service or AKS pod)
├── atp-policy-{env}-{region}        # Policy Engine (App Service or AKS pod)
├── atp-search-{env}-{region}        # Search Service (App Service or AKS pod)
├── atp-sql-{env}-{region}           # Azure SQL Database (primary audit store)
├── atp-cosmos-{env}-{region}        # Cosmos DB or PostgreSQL (metadata store)
├── atp-storage-{env}-{region}       # Blob Storage (WORM in prod; immutable audit logs)
├── atp-servicebus-{env}-{region}    # Service Bus namespace (async messaging)
├── atp-redis-{env}-{region}         # Redis Cache (session state, distributed cache)
├── atp-keyvault-{env}-{region}      # Key Vault (secrets, certificates, keys)
├── atp-appinsights-{env}-{region}   # Application Insights (telemetry)
└── atp-loganalytics-{env}-{region}  # Log Analytics workspace (centralized logs)

Multi-Region Resource Groups:

For Production and Staging (geo-replicated):

# Primary Region (East US)
ConnectSoft-ATP-Prod-EUS-RG
├── atp-gateway-prod-eus
├── atp-ingestion-prod-eus
├── ... (all services)
├── atp-sql-prod-eus (primary write)
├── atp-storage-prod-eus (GZRS replication to WEU)
└── atp-redis-prod-eus (geo-replicated to WEU)

# Secondary Region (West Europe)
ConnectSoft-ATP-Prod-WEU-RG
├── atp-gateway-prod-weu
├── atp-ingestion-prod-weu
├── ... (all services)
├── atp-sql-prod-weu (read replica)
├── atp-storage-prod-weu (geo-replicated secondary)
└── atp-redis-prod-weu (geo-replicated secondary)

Shared Infrastructure Resource Group:

Some resources are shared across environments for cost optimization:

ConnectSoft-ATP-Shared-EUS-RG
├── atp-acr-shared-eus               # Azure Container Registry (shared Docker images)
├── atp-vnet-shared-eus              # Shared VNet for Dev/Test (10.0.0.0/16)
├── atp-bastion-shared-eus           # Azure Bastion (secure RDP/SSH access)
├── atp-devops-agents-eus            # Self-hosted Azure DevOps agents
└── atp-monitoring-shared-eus        # Shared monitoring infrastructure

Naming Conventions

Pattern: atp-{service}-{env}-{region}

Components:

  • atp: Project prefix (Audit Trail Platform).
  • {service}: Service identifier (e.g., gateway, ingestion, query).
  • {env}: Environment abbreviation (lowercase).
  • {region}: Azure region abbreviation (lowercase).

Environment Abbreviations:

Environment Abbreviation Example
Preview (Ephemeral) preview atp-gateway-preview-pr-1234-eus
Development dev atp-ingestion-dev-eus
Test/QA test atp-query-test-eus
Staging staging atp-integrity-staging-eus
Production prod atp-export-prod-eus
Hotfix hotfix atp-policy-hotfix-eus

Region Abbreviations:

Azure Region Abbreviation Example
East US eus atp-sql-prod-eus
West Europe weu atp-sql-prod-weu
Southeast Asia apse atp-cosmos-prod-apse
Central US cus atp-storage-staging-cus
North Europe neu atp-redis-dev-neu

Service Abbreviations:

Service Abbreviation Resource Type Example
API Gateway gateway App Service / AKS atp-gateway-prod-eus
Ingestion Service ingestion App Service / AKS atp-ingestion-prod-eus
Query Service query App Service / AKS atp-query-prod-eus
Integrity Verification integrity App Service / AKS atp-integrity-prod-eus
Export Service export App Service / AKS atp-export-prod-eus
Policy Engine policy App Service / AKS atp-policy-prod-eus
Search Service search App Service / AKS atp-search-prod-eus
SQL Database sql Azure SQL atp-sql-prod-eus
Cosmos DB cosmos Cosmos DB atp-cosmos-prod-eus
Blob Storage storage Storage Account atpstorageprodeus (no hyphens)*
Service Bus servicebus Service Bus atp-servicebus-prod-eus
Redis Cache redis Redis Cache atp-redis-prod-eus
Key Vault keyvault Key Vault atp-keyvault-prod-eus
Application Insights appinsights App Insights atp-appinsights-prod-eus
Log Analytics loganalytics Log Analytics atp-loganalytics-prod-eus
Container Registry acr ACR atpacrsharedeus (no hyphens)*

* Storage Accounts and Container Registries have stricter naming rules (no hyphens, lowercase alphanumeric only, 3-24 characters).

Complete Naming Examples

Dev Environment (East US):

Resource Group: ConnectSoft-ATP-Dev-EUS-RG
├── atp-gateway-dev-eus
├── atp-ingestion-dev-eus
├── atp-query-dev-eus
├── atp-integrity-dev-eus
├── atp-export-dev-eus
├── atp-policy-dev-eus
├── atp-search-dev-eus
├── atp-sql-dev-eus
├── atp-cosmos-dev-eus
├── atpstoragedeveus
├── atp-servicebus-dev-eus
├── atp-redis-dev-eus
├── atp-keyvault-dev-eus
├── atp-appinsights-dev-eus
└── atp-loganalytics-dev-eus

Production Environment (Multi-Region):

# Primary (East US)
Resource Group: ConnectSoft-ATP-Prod-EUS-RG
├── atp-aks-prod-eus                 # AKS cluster
├── atp-sql-prod-eus                 # SQL primary
├── atp-cosmos-prod-eus              # Cosmos primary
├── atpstorageprodeus                # GZRS storage
├── atp-servicebus-prod-eus          # Service Bus primary
├── atp-redis-prod-eus               # Redis primary
├── atp-keyvault-prod-eus            # Key Vault (geo-backed up)
├── atp-appgw-prod-eus               # Application Gateway
├── atp-fw-prod-eus                  # Azure Firewall
├── atp-appinsights-prod-eus
└── atp-loganalytics-prod-eus

# Secondary (West Europe)
Resource Group: ConnectSoft-ATP-Prod-WEU-RG
├── atp-aks-prod-weu                 # AKS cluster (standby)
├── atp-sql-prod-weu                 # SQL read replica
├── atp-cosmos-prod-weu              # Cosmos read replica
├── atpstorageprodweu                # GZRS storage secondary
├── atp-servicebus-prod-weu          # Service Bus secondary (GDR)
├── atp-redis-prod-weu               # Redis geo-replicated
├── atp-appgw-prod-weu               # Application Gateway
└── atp-appinsights-prod-weu

Azure-Specific Resource Constraints

Storage Account Naming (Most Restrictive):

  • Length: 3-24 characters
  • Characters: Lowercase letters and numbers only (no hyphens, underscores, or uppercase)
  • Uniqueness: Globally unique across all Azure
  • Pattern: atpstorage{env}{region} (e.g., atpstorageprodeus)

Container Registry Naming:

  • Length: 5-50 characters
  • Characters: Alphanumeric only (no special characters)
  • Uniqueness: Globally unique
  • Pattern: atpacr{env}{region} (e.g., atpacrprodeus)

Key Vault Naming:

  • Length: 3-24 characters
  • Characters: Alphanumeric and hyphens (must start/end with alphanumeric)
  • Uniqueness: Globally unique
  • Pattern: atp-keyvault-{env}-{region} or atp-kv-{env}-{region} (shortened)

App Service Naming:

  • Length: 2-60 characters
  • Characters: Alphanumeric and hyphens
  • Uniqueness: Globally unique (*.azurewebsites.net subdomain)
  • Pattern: atp-{service}-{env}-{region}

DNS and Domain Strategy

App Service URLs (Default):

# Dev
https://atp-gateway-dev-eus.azurewebsites.net
https://atp-ingestion-dev-eus.azurewebsites.net

# Production
https://atp-gateway-prod-eus.azurewebsites.net (behind Application Gateway)

Custom Domains (Production):

# Public API (via Application Gateway)
https://api.atp.connectsoft.com → Application Gateway → AKS Ingress

# Services (internal, private endpoints)
https://ingestion.atp.internal → Private endpoint
https://query.atp.internal → Private endpoint

Private DNS Zones (Production):

privatelink.database.windows.net    # Azure SQL private endpoints
privatelink.blob.core.windows.net   # Storage private endpoints
privatelink.vaultcore.azure.net     # Key Vault private endpoints
privatelink.servicebus.windows.net  # Service Bus private endpoints
atp.internal                        # Custom private zone for ATP services

Infrastructure as Code (IaC) Naming in Pulumi

Pulumi Stack Naming:

// Pattern: {organization}/{project}/{environment}
var stackName = $"connectsoft/atp/{environment}";

// Examples:
// connectsoft/atp/dev
// connectsoft/atp/prod-eus
// connectsoft/atp/prod-weu

Pulumi Resource Naming (C# Example):

// Resource Group
var resourceGroup = new ResourceGroup($"atp-{environment}-{region}-rg", new ResourceGroupArgs
{
    ResourceGroupName = $"ConnectSoft-ATP-{environment.ToUpper()}-{region.ToUpper()}-RG",
    Location = region
});

// App Service
var appService = new WebApp($"atp-ingestion-{environment}-{region}", new WebAppArgs
{
    Name = $"atp-ingestion-{environment}-{region}",
    ResourceGroupName = resourceGroup.Name,
    Location = region,
    Tags = new InputMap<string>
    {
        ["Environment"] = environment,
        ["Service"] = "ingestion",
        ["ManagedBy"] = "pulumi"
    }
});

// Storage Account (handle naming constraints)
var storageAccount = new StorageAccount($"atp-storage-{environment}-{region}", new StorageAccountArgs
{
    AccountName = $"atpstorage{environment}{region}".Replace("-", "").ToLower(), // Remove hyphens
    ResourceGroupName = resourceGroup.Name,
    Location = region,
    Sku = new SkuArgs { Name = SkuName.Standard_GRS },
    Tags = appService.Tags
});

Tagging Strategy

Required Tags (Enforced via Azure Policy):

{
  "Environment": "dev | test | staging | prod | hotfix",
  "Service": "gateway | ingestion | query | integrity | export | policy | search",
  "Owner": "platform-team@connectsoft.example",
  "CostCenter": "ATP-Platform | ATP-Services",
  "Compliance": "gdpr | hipaa | soc2",
  "DataClassification": "public | internal | restricted | secret",
  "ManagedBy": "pulumi | bicep | terraform | manual",
  "BackupRequired": "true | false",
  "DR-Tier": "critical | important | standard",
  "CreatedDate": "2025-10-30T00:00:00Z",
  "ExpiryDate": "2026-10-30T00:00:00Z" // For Preview/Hotfix only
}

Tag Application Example (Pulumi):

var commonTags = new InputMap<string>
{
    ["Environment"] = config.Require("environment"),
    ["Owner"] = "platform-team@connectsoft.example",
    ["CostCenter"] = "ATP-Platform",
    ["Compliance"] = "gdpr,hipaa,soc2",
    ["DataClassification"] = config.Require("dataClassification"), // env-specific
    ["ManagedBy"] = "pulumi",
    ["BackupRequired"] = config.RequireBoolean("backupRequired").ToString(),
    ["DR-Tier"] = config.Require("drTier"),
    ["CreatedDate"] = DateTime.UtcNow.ToString("o"),
    ["Project"] = "ATP",
    ["Version"] = "1.0"
};

// Apply to all resources
var resourceGroup = new ResourceGroup("atp-rg", new ResourceGroupArgs
{
    Tags = commonTags
});

Resource Naming Validation

Azure Policy Enforcement (Custom Policy):

{
  "policyRule": {
    "if": {
      "allOf": [
        {
          "field": "type",
          "in": [
            "Microsoft.Web/sites",
            "Microsoft.Sql/servers",
            "Microsoft.Storage/storageAccounts"
          ]
        },
        {
          "not": {
            "field": "name",
            "match": "atp-*-{dev|test|staging|prod|hotfix}-{eus|weu|apse}*"
          }
        }
      ]
    },
    "then": {
      "effect": "deny",
      "details": {
        "message": "Resource name must follow ATP naming convention: atp-{service}-{env}-{region}"
      }
    }
  }
}

Automated Validation Script (PowerShell):

# validate-naming.ps1
param(
    [string]$ResourceGroupName
)

$resources = Get-AzResource -ResourceGroupName $ResourceGroupName

foreach ($resource in $resources) {
    $name = $resource.Name

    # Validate naming pattern (excluding storage accounts)
    if ($resource.Type -notlike "*Storage*") {
        if ($name -notmatch "^atp-[\w]+-(?:dev|test|staging|prod|hotfix)-(?:eus|weu|apse)$") {
            Write-Warning "❌ Invalid name: $name (Type: $($resource.Type))"
        } else {
            Write-Host "✅ Valid name: $name"
        }
    }

    # Validate required tags
    $requiredTags = @("Environment", "Service", "Owner", "ManagedBy")
    foreach ($tag in $requiredTags) {
        if (-not $resource.Tags.ContainsKey($tag)) {
            Write-Warning "❌ Missing tag '$tag' on resource: $name"
        }
    }
}

Cross-Environment Resource Dependencies

Shared Resources (Accessible by Multiple Environments):

# Azure Container Registry (shared across all environments)
atpacrsharedeus

# Shared VNet (Dev + Test only)
atp-vnet-shared-eus (10.0.0.0/16)
├── Dev Subnet: 10.0.1.0/24
└── Test Subnet: 10.0.2.0/24

# Dedicated VNets (Staging, Production)
atp-vnet-staging-eus (10.1.0.0/16)
atp-vnet-prod-eus (10.2.0.0/16)
atp-vnet-prod-weu (10.2.0.0/16) # Same CIDR (different regions)

Resource Group Locking:

# Production resource groups: ReadOnly lock (prevent accidental deletion)
az lock create --name prod-delete-lock \
  --resource-group ConnectSoft-ATP-Prod-EUS-RG \
  --lock-type ReadOnly \
  --notes "Prevent accidental deletion of production resources"

# Staging: CanNotDelete lock (allow updates, prevent deletion)
az lock create --name staging-delete-lock \
  --resource-group ConnectSoft-ATP-Staging-EUS-RG \
  --lock-type CanNotDelete

Summary

  • Naming Pattern: atp-{service}-{env}-{region} ensures clarity and automation compatibility.
  • Resource Groups: One per environment/region combination; acts as lifecycle boundary.
  • Tagging: Required tags enable cost allocation, compliance tracking, and automated management.
  • Validation: Azure Policy and scripts enforce naming conventions and tag requirements.
  • Multi-Region: Production uses dedicated resource groups per region with geo-replication.
  • Shared Infrastructure: Cost optimization via shared ACR, VNets, and monitoring for lower environments.

Configuration Management & Hierarchy

ATP employs a layered configuration strategy that balances developer convenience (defaults, local development) with production security (secrets in Key Vault, dynamic feature flags). Configuration precedence follows the ASP.NET Core standard with ATP-specific extensions for Azure App Configuration and Key Vault integration.

This approach ensures configuration is immutable in code (no hardcoded secrets), environment-specific overrides are explicit, and production secrets are never stored in source control or deployment artifacts.

Configuration Hierarchy

ATP configurations are resolved in the following precedence order (later sources override earlier ones):

1. appsettings.json                    # Base defaults (checked into source control)
2. appsettings.{Environment}.json      # Environment-specific overrides (checked in)
3. Azure App Configuration (optional)  # Dynamic feature flags, A/B testing configs
4. Environment Variables               # Runtime overrides, container orchestration
5. Key Vault References                # Secrets, connection strings, certificates
6. Command-Line Arguments              # Override for debugging/testing

Configuration Resolution Example:

// Program.cs - Configuration loading order
public class Program
{
    public static IHostBuilder CreateHostBuilder(string[] args) =>
        Host.CreateDefaultsBuilder(args)
            .ConfigureAppConfiguration((context, config) =>
            {
                var env = context.HostingEnvironment;

                // 1. Base configuration
                config.AddJsonFile("appsettings.json", optional: false, reloadOnChange: true);

                // 2. Environment-specific overrides
                config.AddJsonFile($"appsettings.{env.EnvironmentName}.json", optional: true, reloadOnChange: true);

                // 3. Azure App Configuration (Production, Staging only)
                if (env.IsProduction() || env.IsStaging())
                {
                    var settings = config.Build();
                    var appConfigConnection = settings["AppConfig:ConnectionString"];

                    config.AddAzureAppConfiguration(options =>
                    {
                        options
                            .Connect(appConfigConnection)
                            .Select(KeyFilter.Any, LabelFilter.Null)
                            .Select(KeyFilter.Any, env.EnvironmentName)
                            .ConfigureRefresh(refresh =>
                            {
                                refresh.Register("Sentinel", refreshAll: true)
                                       .SetCacheExpiration(TimeSpan.FromMinutes(5));
                            })
                            .UseFeatureFlags(featureFlagOptions =>
                            {
                                featureFlagOptions.CacheExpirationInterval = TimeSpan.FromMinutes(5);
                            });
                    });
                }

                // 4. Environment variables (override via ASPNETCORE_* or custom prefix)
                config.AddEnvironmentVariables(prefix: "ATP_");

                // 5. Key Vault (Production, Staging only)
                if (env.IsProduction() || env.IsStaging())
                {
                    var builtConfig = config.Build();
                    var keyVaultEndpoint = builtConfig["KeyVault:Endpoint"];

                    config.AddAzureKeyVault(
                        new Uri(keyVaultEndpoint),
                        new DefaultAzureCredential());
                }

                // 6. Command-line arguments (highest priority)
                config.AddCommandLine(args);
            })
            .ConfigureWebHostDefaults(webBuilder =>
            {
                webBuilder.UseStartup<Startup>();
            });
}

Base Configuration (appsettings.json)

The base configuration contains safe defaults suitable for local development and general application structure. No secrets or environment-specific values should be in this file.

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft": "Warning",
      "Microsoft.Hosting.Lifetime": "Information"
    },
    "Console": {
      "IncludeScopes": false,
      "TimestampFormat": "yyyy-MM-dd HH:mm:ss "
    }
  },

  "AllowedHosts": "*",

  "Audit": {
    "ServiceName": "ATP",
    "Version": "1.0.0",
    "EnableImmutability": false,
    "EnableTamperEvidence": false,
    "EnableHashChaining": false,
    "RetentionDays": 90,
    "WormStorage": false,
    "SegmentSize": 10000,
    "SealInterval": "PT1H",
    "MaxBatchSize": 1000,
    "BatchTimeoutSeconds": 30
  },

  "Compliance": {
    "StrictInDevelopment": false,
    "EnableLoggingRedaction": false,
    "SimulateComplianceChecks": false,
    "Profile": "default",
    "EnforceGDPR": false,
    "EnforceHIPAA": false,
    "EnforceSOC2": false
  },

  "OpenTelemetry": {
    "ServiceName": "atp-service",
    "ServiceVersion": "1.0.0",
    "ExporterEndpoint": "http://localhost:4317",
    "SamplingRatio": 1.0,
    "ExportIntervalSeconds": 5,
    "EnableConsoleExporter": false,
    "EnableJaegerExporter": false
  },

  "RateLimiting": {
    "Enabled": false,
    "PermitLimit": 100,
    "Window": 60,
    "QueueLimit": 0
  },

  "Caching": {
    "DefaultSlidingExpiration": "00:05:00",
    "DefaultAbsoluteExpiration": "01:00:00"
  },

  "HealthChecks": {
    "Enabled": true,
    "PollingIntervalSeconds": 30
  },

  "KeyVault": {
    "Endpoint": ""
  },

  "AppConfig": {
    "ConnectionString": ""
  }
}

Dev Environment (appsettings.Development.json)

Purpose: Local development and continuous integration with verbose logging, synthetic data, and disabled compliance controls for rapid iteration.

{
  "Logging": {
    "LogLevel": {
      "Default": "Debug",
      "Microsoft": "Information",
      "Microsoft.EntityFrameworkCore": "Information",
      "Microsoft.EntityFrameworkCore.Database.Command": "Information",
      "System": "Information",
      "ConnectSoft": "Debug"
    },
    "Console": {
      "IncludeScopes": true,
      "TimestampFormat": "HH:mm:ss.fff "
    }
  },

  "ConnectionStrings": {
    "DefaultConnection": "Server=atp-sql-dev-eus.database.windows.net;Database=ATP_Dev;Authentication=Active Directory Managed Identity;",
    "Redis": "atp-redis-dev-eus.redis.cache.windows.net:6380,ssl=True,abortConnect=False",
    "ServiceBus": "Endpoint=sb://atp-servicebus-dev-eus.servicebus.windows.net/;Authentication=Managed Identity",
    "CosmosDb": "AccountEndpoint=https://atp-cosmos-dev-eus.documents.azure.com:443/;AuthKeyOrResourceToken=ManagedIdentity"
  },

  "Audit": {
    "EnableImmutability": false,
    "EnableTamperEvidence": false,
    "EnableHashChaining": false,
    "RetentionDays": 30,
    "WormStorage": false,
    "SegmentSize": 1000,
    "SealInterval": "PT24H",
    "MaxBatchSize": 100,
    "IntegrityVerification": {
      "OnRead": false,
      "Scheduled": false
    }
  },

  "Compliance": {
    "StrictInDevelopment": true,
    "EnableLoggingRedaction": true,
    "SimulateComplianceChecks": true,
    "Profile": "development",
    "EnforceGDPR": false,
    "EnforceHIPAA": false,
    "EnforceSOC2": false,
    "AllowTestData": true
  },

  "OpenTelemetry": {
    "ServiceName": "atp-ingestion-dev",
    "ServiceVersion": "1.0.0",
    "ExporterEndpoint": "http://otel-collector-dev:4317",
    "SamplingRatio": 1.0,
    "ExportIntervalSeconds": 5,
    "EnableConsoleExporter": true,
    "EnableJaegerExporter": false,
    "Attributes": {
      "environment": "dev",
      "region": "eus"
    }
  },

  "FeatureManagement": {
    "TamperEvidenceV2": true,
    "AdvancedQueryFilters": true,
    "AIAssistedAnomalyDetection": true,
    "ExperimentalFeatures": true,
    "PerformanceOptimizations": true
  },

  "RateLimiting": {
    "Enabled": false,
    "PermitLimit": 0,
    "Window": 0
  },

  "Caching": {
    "DefaultSlidingExpiration": "00:01:00",
    "DefaultAbsoluteExpiration": "00:05:00",
    "Enabled": true
  },

  "HealthChecks": {
    "Enabled": true,
    "PollingIntervalSeconds": 60,
    "IncludeDependencies": true
  },

  "Cors": {
    "AllowedOrigins": ["http://localhost:3000", "http://localhost:5173"],
    "AllowCredentials": true
  },

  "Swagger": {
    "Enabled": true,
    "IncludeXmlComments": true
  }
}

Test Environment (appsettings.Test.json)

Purpose: Automated testing and QA validation with stable datasets, moderate logging, and selective compliance enforcement for test validation.

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft": "Warning",
      "Microsoft.EntityFrameworkCore": "Information",
      "Microsoft.EntityFrameworkCore.Database.Command": "Information",
      "ConnectSoft": "Information"
    }
  },

  "ConnectionStrings": {
    "DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/SqlConnectionString)",
    "Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/RedisConnectionString)",
    "ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/ServiceBusConnectionString)",
    "CosmosDb": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/CosmosConnectionString)"
  },

  "Audit": {
    "EnableImmutability": false,
    "EnableTamperEvidence": true,
    "EnableHashChaining": true,
    "RetentionDays": 90,
    "WormStorage": false,
    "SegmentSize": 10000,
    "SealInterval": "PT1H",
    "IntegrityVerification": {
      "OnRead": true,
      "Scheduled": true,
      "ScheduleCron": "0 2 * * *",
      "SampleRate": 0.5
    }
  },

  "Compliance": {
    "StrictInDevelopment": false,
    "EnableLoggingRedaction": true,
    "SimulateComplianceChecks": true,
    "Profile": "test",
    "EnforceGDPR": true,
    "EnforceHIPAA": false,
    "EnforceSOC2": false,
    "AllowTestData": true
  },

  "OpenTelemetry": {
    "ServiceName": "atp-ingestion-test",
    "ExporterEndpoint": "http://otel-collector-test:4317",
    "SamplingRatio": 0.5,
    "ExportIntervalSeconds": 10,
    "EnableConsoleExporter": false,
    "Attributes": {
      "environment": "test",
      "region": "eus"
    }
  },

  "FeatureManagement": {
    "TamperEvidenceV2": true,
    "AdvancedQueryFilters": true,
    "AIAssistedAnomalyDetection": false,
    "ExperimentalFeatures": false
  },

  "RateLimiting": {
    "Enabled": true,
    "PermitLimit": 1000,
    "Window": 60,
    "QueueLimit": 100
  },

  "KeyVault": {
    "Endpoint": "https://atp-keyvault-test-eus.vault.azure.net/"
  },

  "Swagger": {
    "Enabled": true
  }
}

Staging Environment (appsettings.Staging.json)

Purpose: Pre-production validation with production-equivalent configuration, full compliance enforcement, and realistic load testing capabilities.

{
  "Logging": {
    "LogLevel": {
      "Default": "Warning",
      "Microsoft": "Error",
      "Microsoft.EntityFrameworkCore": "Warning",
      "ConnectSoft": "Warning"
    }
  },

  "ConnectionStrings": {
    "DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/SqlConnectionString)",
    "Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/RedisConnectionString)",
    "ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/ServiceBusConnectionString)",
    "CosmosDb": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/CosmosConnectionString)"
  },

  "Audit": {
    "EnableImmutability": true,
    "EnableTamperEvidence": true,
    "EnableHashChaining": true,
    "RetentionDays": 2555,
    "WormStorage": true,
    "SegmentSize": 100000,
    "SealInterval": "PT15M",
    "IntegrityVerification": {
      "OnRead": true,
      "Scheduled": true,
      "ScheduleCron": "0 */6 * * *",
      "SampleRate": 0.2
    }
  },

  "Compliance": {
    "StrictInDevelopment": false,
    "EnableLoggingRedaction": true,
    "SimulateComplianceChecks": false,
    "Profile": "staging",
    "EnforceGDPR": true,
    "EnforceHIPAA": true,
    "EnforceSOC2": true,
    "AllowTestData": false,
    "AuditTrail": {
      "Enabled": true,
      "RetentionYears": 7,
      "ImmutableStorage": true
    }
  },

  "OpenTelemetry": {
    "ServiceName": "atp-ingestion-staging",
    "ExporterEndpoint": "https://otel-collector-staging.connectsoft.local:4317",
    "SamplingRatio": 0.25,
    "ExportIntervalSeconds": 30,
    "EnableConsoleExporter": false,
    "AdaptiveSampling": {
      "Enabled": true,
      "MaxTelemetryItemsPerSecond": 50
    },
    "Attributes": {
      "environment": "staging",
      "region": "eus"
    }
  },

  "FeatureManagement": {
    "TamperEvidenceV2": {
      "EnabledFor": [
        {
          "Name": "Percentage",
          "Parameters": {
            "Value": 50
          }
        }
      ]
    },
    "AdvancedQueryFilters": true,
    "AIAssistedAnomalyDetection": {
      "EnabledFor": [
        {
          "Name": "TargetingFilter",
          "Parameters": {
            "Audience": {
              "Users": ["staging-tenant-001"]
            }
          }
        }
      ]
    },
    "ExperimentalFeatures": false
  },

  "RateLimiting": {
    "Enabled": true,
    "PermitLimit": 500,
    "Window": 60,
    "QueueLimit": 100,
    "ByTenant": true,
    "ByIPAddress": true
  },

  "Security": {
    "RequireHttps": true,
    "HstsEnabled": true,
    "HstsMaxAge": 31536000,
    "ContentSecurityPolicy": "default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'"
  },

  "KeyVault": {
    "Endpoint": "https://atp-keyvault-staging-eus.vault.azure.net/"
  },

  "AppConfig": {
    "ConnectionString": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/AppConfigConnectionString)",
    "RefreshInterval": "00:05:00"
  },

  "Swagger": {
    "Enabled": false
  }
}

Production Environment (appsettings.Production.json)

Purpose: Live tenant traffic with maximum security, full compliance enforcement, optimized performance, and strict monitoring.

{
  "Logging": {
    "LogLevel": {
      "Default": "Warning",
      "Microsoft": "Error",
      "Microsoft.EntityFrameworkCore": "Error",
      "ConnectSoft": "Warning"
    },
    "ApplicationInsights": {
      "LogLevel": {
        "Default": "Warning",
        "ConnectSoft": "Warning"
      }
    }
  },

  "ConnectionStrings": {
    "DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlConnectionString)",
    "Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)",
    "ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/ServiceBusConnectionString)",
    "CosmosDb": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/CosmosConnectionString)"
  },

  "Audit": {
    "EnableImmutability": true,
    "EnableTamperEvidence": true,
    "EnableHashChaining": true,
    "RetentionDays": 2555,
    "WormStorage": true,
    "SegmentSize": 100000,
    "SealInterval": "PT15M",
    "MaxBatchSize": 10000,
    "BatchTimeoutSeconds": 60,
    "IntegrityVerification": {
      "OnRead": true,
      "Scheduled": true,
      "ScheduleCron": "0 2 * * *",
      "SampleRate": 0.1,
      "ParallelVerification": true,
      "MaxDegreeOfParallelism": 4
    }
  },

  "Compliance": {
    "StrictInDevelopment": false,
    "EnableLoggingRedaction": true,
    "SimulateComplianceChecks": false,
    "Profile": "production",
    "EnforceGDPR": true,
    "EnforceHIPAA": true,
    "EnforceSOC2": true,
    "AllowTestData": false,
    "AuditTrail": {
      "Enabled": true,
      "RetentionYears": 7,
      "ImmutableStorage": true,
      "EncryptionAtRest": true,
      "EncryptionInTransit": true
    }
  },

  "OpenTelemetry": {
    "ServiceName": "atp-ingestion-prod",
    "ServiceVersion": "1.0.0",
    "ExporterEndpoint": "https://otel-collector-prod.connectsoft.local:4317",
    "SamplingRatio": 0.1,
    "ExportIntervalSeconds": 60,
    "EnableConsoleExporter": false,
    "EnableJaegerExporter": false,
    "AdaptiveSampling": {
      "Enabled": true,
      "MaxTelemetryItemsPerSecond": 10,
      "SamplingPercentage": {
        "Default": 10,
        "OnError": 100,
        "SlowRequests": 100
      }
    },
    "Attributes": {
      "environment": "prod",
      "region": "eus",
      "datacenter": "azure-eastus"
    }
  },

  "FeatureManagement": {
    "TamperEvidenceV2": true,
    "AdvancedQueryFilters": true,
    "AIAssistedAnomalyDetection": {
      "EnabledFor": [
        {
          "Name": "Percentage",
          "Parameters": {
            "Value": 10
          }
        }
      ]
    },
    "ExperimentalFeatures": false,
    "PerformanceOptimizations": true
  },

  "RateLimiting": {
    "Enabled": true,
    "PermitLimit": 100,
    "Window": 60,
    "QueueLimit": 50,
    "ByTenant": true,
    "ByIPAddress": true,
    "ByUser": true,
    "Strategy": "TokenBucket"
  },

  "Caching": {
    "DefaultSlidingExpiration": "00:15:00",
    "DefaultAbsoluteExpiration": "01:00:00",
    "Enabled": true,
    "DistributedCache": true,
    "CompressionEnabled": true
  },

  "Security": {
    "RequireHttps": true,
    "HstsEnabled": true,
    "HstsMaxAge": 31536000,
    "HstsIncludeSubdomains": true,
    "HstsPreload": true,
    "ContentSecurityPolicy": "default-src 'self'; script-src 'self'; style-src 'self'; img-src 'self' data: https:; font-src 'self'; connect-src 'self'; frame-ancestors 'none'",
    "XFrameOptions": "DENY",
    "XContentTypeOptions": "nosniff",
    "ReferrerPolicy": "strict-origin-when-cross-origin",
    "PermissionsPolicy": "geolocation=(), microphone=(), camera=()"
  },

  "HighAvailability": {
    "MultiRegion": true,
    "PrimaryRegion": "eastus",
    "SecondaryRegion": "westeurope",
    "TrafficDistribution": "80-20",
    "FailoverMode": "Automatic",
    "HealthCheckInterval": 30,
    "HealthCheckTimeout": 5,
    "UnhealthyThreshold": 3
  },

  "HealthChecks": {
    "Enabled": true,
    "PollingIntervalSeconds": 30,
    "TimeoutSeconds": 10,
    "IncludeDependencies": true,
    "PublishToApplicationInsights": true
  },

  "KeyVault": {
    "Endpoint": "https://atp-keyvault-prod-eus.vault.azure.net/",
    "CacheExpirationMinutes": 60,
    "ReloadInterval": "00:15:00"
  },

  "AppConfig": {
    "ConnectionString": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/AppConfigConnectionString)",
    "RefreshInterval": "00:05:00",
    "UseFeatureFlags": true,
    "FeatureFlagRefreshInterval": "00:01:00"
  },

  "Swagger": {
    "Enabled": false
  },

  "Cors": {
    "AllowedOrigins": ["https://portal.connectsoft.com"],
    "AllowCredentials": false
  }
}

Azure App Configuration Integration

Purpose: Centralized dynamic configuration and feature flags that can be updated without redeployment. Used in Staging and Production only.

Configuration Structure (Azure App Configuration):

# Key-Value Pairs (Namespaced by environment)
ATP:Ingestion:MaxBatchSize = 10000 (label: prod)
ATP:Ingestion:MaxBatchSize = 1000 (label: staging)

ATP:RateLimiting:PermitLimit = 100 (label: prod)
ATP:RateLimiting:PermitLimit = 500 (label: staging)

# Feature Flags
TamperEvidenceV2 = true (label: prod)
AIAssistedAnomalyDetection = true (label: prod, percentage: 10%)

C# Integration Example:

// Startup.cs
public void ConfigureServices(IServiceCollection services)
{
    // Add Azure App Configuration
    services.AddAzureAppConfiguration();

    // Add Feature Management
    services.AddFeatureManagement()
            .AddFeatureFilter<PercentageFilter>()
            .AddFeatureFilter<TargetingFilter>()
            .AddFeatureFilter<TimeWindowFilter>();

    // Bind configuration sections
    services.Configure<AuditOptions>(Configuration.GetSection("Audit"));
    services.Configure<ComplianceOptions>(Configuration.GetSection("Compliance"));
    services.Configure<OpenTelemetryOptions>(Configuration.GetSection("OpenTelemetry"));
}

// Middleware to refresh App Configuration
public void Configure(IApplicationBuilder app)
{
    app.UseAzureAppConfiguration();
}

Feature Flag Usage:

// Service implementation with feature flag
public class AuditService : IAuditService
{
    private readonly IFeatureManager _featureManager;

    public AuditService(IFeatureManager featureManager)
    {
        _featureManager = featureManager;
    }

    public async Task<AuditResult> RecordEventAsync(AuditEvent auditEvent)
    {
        // Check if new tamper evidence is enabled
        if (await _featureManager.IsEnabledAsync("TamperEvidenceV2"))
        {
            return await RecordWithTamperEvidenceV2Async(auditEvent);
        }
        else
        {
            return await RecordWithLegacyTamperEvidenceAsync(auditEvent);
        }
    }
}

Environment Variables & Container Overrides

Purpose: Override configuration at runtime without modifying appsettings.json. Useful for containerized deployments (AKS, Docker Compose) and CI/CD pipelines.

Kubernetes ConfigMap Example:

# atp-ingestion-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
  namespace: atp-prod
data:
  ASPNETCORE_ENVIRONMENT: "Production"
  ATP_Audit__SegmentSize: "100000"
  ATP_OpenTelemetry__SamplingRatio: "0.1"
  ATP_RateLimiting__PermitLimit: "100"

Kubernetes Deployment with ConfigMap:

# atp-ingestion-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-prod
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:1.0.0
        envFrom:
        - configMapRef:
            name: atp-ingestion-config
        env:
        - name: ConnectionStrings__DefaultConnection
          valueFrom:
            secretKeyRef:
              name: atp-sql-secret
              key: connectionString

Docker Compose Example (Local Development):

version: '3.8'
services:
  atp-ingestion:
    image: atp-ingestion:dev
    environment:
      - ASPNETCORE_ENVIRONMENT=Development
      - ATP_Audit__EnableImmutability=false
      - ATP_ConnectionStrings__Redis=redis:6379
      - ATP_ConnectionStrings__ServiceBus=sb://localhost:5672
    depends_on:
      - redis
      - rabbitmq

Key Vault Secret References

Purpose: Store sensitive configuration (connection strings, API keys, certificates) in Azure Key Vault with managed identity access.

Key Vault Secret Naming Convention:

# Pattern: {Service}-{Environment}-{SecretType}
SqlConnectionString
RedisConnectionString
ServiceBusConnectionString
CosmosConnectionString
AppConfigConnectionString
StorageAccountConnectionString
JwtSigningKey
EncryptionKey
CertificatePassword
ExternalApiKey

Reference in appsettings.json (Production):

{
  "ConnectionStrings": {
    "DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlConnectionString)",
    "Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)"
  }
}

Programmatic Key Vault Access:

// Program.cs - Manual Key Vault integration
public static IHostBuilder CreateHostBuilder(string[] args) =>
    Host.CreateDefaultsBuilder(args)
        .ConfigureAppConfiguration((context, config) =>
        {
            if (context.HostingEnvironment.IsProduction())
            {
                var builtConfig = config.Build();
                var keyVaultEndpoint = builtConfig["KeyVault:Endpoint"];

                // Use Managed Identity for authentication
                var credential = new DefaultAzureCredential();

                config.AddAzureKeyVault(
                    new Uri(keyVaultEndpoint),
                    credential,
                    new AzureKeyVaultConfigurationOptions
                    {
                        ReloadInterval = TimeSpan.FromMinutes(15)
                    });
            }
        });

Managed Identity Configuration (Azure App Service):

# Enable system-assigned managed identity
az webapp identity assign \
  --name atp-ingestion-prod-eus \
  --resource-group ConnectSoft-ATP-Prod-EUS-RG

# Grant Key Vault access to managed identity
az keyvault set-policy \
  --name atp-keyvault-prod-eus \
  --object-id <managed-identity-object-id> \
  --secret-permissions get list

Configuration Validation

Purpose: Validate configuration at startup to prevent runtime errors from misconfigured settings.

Options Validation Example:

// AuditOptions.cs
public class AuditOptions : IValidatableObject
{
    public bool EnableImmutability { get; set; }
    public bool EnableTamperEvidence { get; set; }
    public int RetentionDays { get; set; }
    public bool WormStorage { get; set; }
    public int SegmentSize { get; set; }

    public IEnumerable<ValidationResult> Validate(ValidationContext validationContext)
    {
        if (RetentionDays < 1)
        {
            yield return new ValidationResult(
                "RetentionDays must be at least 1 day",
                new[] { nameof(RetentionDays) });
        }

        if (EnableImmutability && !WormStorage)
        {
            yield return new ValidationResult(
                "WormStorage must be enabled when EnableImmutability is true",
                new[] { nameof(WormStorage) });
        }

        if (SegmentSize < 100 || SegmentSize > 1000000)
        {
            yield return new ValidationResult(
                "SegmentSize must be between 100 and 1,000,000",
                new[] { nameof(SegmentSize) });
        }
    }
}

// Startup.cs
public void ConfigureServices(IServiceCollection services)
{
    services.AddOptions<AuditOptions>()
            .Bind(Configuration.GetSection("Audit"))
            .ValidateDataAnnotations()
            .ValidateOnStart(); // Fail fast on startup if invalid
}

Startup Configuration Validation:

// Program.cs - Validate critical configuration
public static void Main(string[] args)
{
    var host = CreateHostBuilder(args).Build();

    // Validate configuration before starting
    var logger = host.Services.GetRequiredService<ILogger<Program>>();

    try
    {
        var auditOptions = host.Services.GetRequiredService<IOptions<AuditOptions>>().Value;
        var complianceOptions = host.Services.GetRequiredService<IOptions<ComplianceOptions>>().Value;

        logger.LogInformation("✅ Configuration validated successfully");
        logger.LogInformation("Audit - Immutability: {Immutability}, Retention: {Retention} days",
            auditOptions.EnableImmutability, auditOptions.RetentionDays);
    }
    catch (OptionsValidationException ex)
    {
        logger.LogCritical(ex, "❌ Configuration validation failed");
        throw;
    }

    host.Run();
}

Configuration Best Practices

Security:

  1. Never commit secrets: Use .gitignore to exclude appsettings.*.local.json files.
  2. Use Managed Identity: Avoid connection strings with passwords; use Managed Identity authentication.
  3. Rotate secrets: Implement automated secret rotation (90-day cycle for Production).
  4. Encrypt sensitive data: Use Data Protection API for configuration encryption at rest.

Maintainability:

  1. Environment parity: Staging should mirror Production configuration (except scale/cost).
  2. Configuration as code: Store appsettings.json files in source control; manage Key Vault secrets via IaC (Pulumi/Bicep).
  3. Validation: Use Options pattern with validation to fail fast on misconfiguration.
  4. Documentation: Comment complex configuration sections; link to ADRs for architectural decisions.

Performance:

  1. Cache configuration: Reload Key Vault secrets every 15 minutes (not every request).
  2. Minimize App Configuration calls: Use 5-minute refresh interval; sentinel key pattern for bulk refresh.
  3. Local caching: Cache expensive computations (feature flag evaluations, connection string parsing).

Configuration Deployment Workflow

Pipeline Integration (Azure DevOps):

# azure-pipelines.yml - Deploy with environment-specific config
- task: AzureKeyVault@2
  displayName: 'Fetch secrets from Key Vault'
  inputs:
    azureSubscription: $(azureSubscription)
    KeyVaultName: 'atp-keyvault-$(environment)-eus'
    SecretsFilter: '*'
    RunAsPreJob: true

- task: FileTransform@1
  displayName: 'Transform appsettings.json'
  inputs:
    folderPath: '$(System.DefaultWorkingDirectory)/publish'
    fileType: 'json'
    targetFiles: '**/appsettings.$(environment).json'

- task: AzureWebApp@1
  displayName: 'Deploy to Azure App Service'
  inputs:
    azureSubscription: $(azureSubscription)
    appName: 'atp-ingestion-$(environment)-eus'
    package: '$(System.DefaultWorkingDirectory)/publish/*.zip'
    appSettings: |
      -ASPNETCORE_ENVIRONMENT "$(environment)"
      -KeyVault__Endpoint "https://atp-keyvault-$(environment)-eus.vault.azure.net/"

Summary

  • Configuration Hierarchy: Layered approach from base defaults → environment overrides → Azure App Configuration → environment variables → Key Vault.
  • Environment-Specific: Each environment has tailored configuration balancing developer productivity (Dev) with production security (Prod).
  • Secret Management: All sensitive data stored in Key Vault with Managed Identity access; no secrets in source control.
  • Dynamic Configuration: Azure App Configuration enables feature flags and runtime config changes without redeployment.
  • Validation: Options pattern with startup validation ensures misconfiguration is caught early.
  • Best Practices: Security-first approach with encrypted secrets, managed identities, and automated rotation.

Secrets & Key Management

ATP implements a defense-in-depth approach to secrets management with Azure Key Vault as the centralized secret store, Managed Identities for authentication, and environment-specific access controls. This strategy ensures no secrets in source control, principle of least privilege, and automated rotation for production environments.

Secrets management follows the zero-trust security model: lower environments (Dev/Test) balance developer productivity with basic security, while higher environments (Staging/Production) enforce strict controls with zero human access to production secrets.

Key Vault Per Environment

Each environment has a dedicated Key Vault with appropriate access controls, audit logging, and compliance configurations tailored to its security requirements.

Dev Environment Key Vault

Name: atp-keyvault-dev-eus

Purpose: Developer-accessible secrets for local development and CI/CD testing with relaxed security for rapid iteration.

Configuration:

# Dev Key Vault (atp-keyvault-dev-eus)
properties:
  sku: Standard
  tenantId: <tenant-id>

  accessPolicies:
    - tenantId: <tenant-id>
      objectId: <developers-aad-group-id>
      permissions:
        secrets: [get, list, set, delete]  # Full access for developers
        keys: [get, list]
        certificates: [get, list]

    - tenantId: <tenant-id>
      objectId: <ci-cd-service-principal-id>
      permissions:
        secrets: [get, list]  # Read-only for pipelines

  enabledForDeployment: true
  enabledForTemplateDeployment: true
  enabledForDiskEncryption: false

  enableSoftDelete: true
  softDeleteRetentionInDays: 7  # Minimum retention
  enablePurgeProtection: false  # Allow purge for cleanup

  networkAcls:
    bypass: AzureServices
    defaultAction: Allow  # Public access for developer convenience
    ipRules: []
    virtualNetworkRules: []

  publicNetworkAccess: Enabled

Access Control:

  • Developers: Full access (get, list, set, delete secrets) for local debugging.
  • CI/CD Pipelines: Read-only access (get, list secrets) for automated deployments.
  • No MFA Required: Developer convenience prioritized over strict security.

Secret Characteristics:

  • Rotation: Manual (on-demand when compromised).
  • Audit Logging: 30-day retention in Log Analytics.
  • Backup: Not required (ephemeral development secrets).

Example Secrets (Dev):

# Connection strings (non-production databases)
az keyvault secret set \
  --vault-name atp-keyvault-dev-eus \
  --name SqlConnectionString \
  --value "Server=atp-sql-dev-eus.database.windows.net;Database=ATP_Dev;User Id=devuser;Password=DevP@ss123!"

# Shared API keys (development tier)
az keyvault secret set \
  --vault-name atp-keyvault-dev-eus \
  --name ExternalApiKey \
  --value "dev-api-key-12345"

# JWT signing key (fixed for dev)
az keyvault secret set \
  --vault-name atp-keyvault-dev-eus \
  --name JwtSigningKey \
  --value "dev-jwt-secret-key-do-not-use-in-prod"

Test Environment Key Vault

Name: atp-keyvault-test-eus

Purpose: Test automation with service principal access and moderate security controls for QA validation.

Configuration:

# Test Key Vault (atp-keyvault-test-eus)
properties:
  sku: Standard
  tenantId: <tenant-id>

  accessPolicies:
    - tenantId: <tenant-id>
      objectId: <qa-team-aad-group-id>
      permissions:
        secrets: [get, list]  # Read-only for QA
        keys: [get, list]
        certificates: [get, list]

    - tenantId: <tenant-id>
      objectId: <test-automation-service-principal-id>
      permissions:
        secrets: [get, list]  # Read-only for test automation

    - tenantId: <tenant-id>
      objectId: <atp-test-managed-identity-id>
      permissions:
        secrets: [get, list]  # Managed identity for Test services

  enabledForDeployment: true
  enabledForTemplateDeployment: true
  enabledForDiskEncryption: false

  enableSoftDelete: true
  softDeleteRetentionInDays: 30
  enablePurgeProtection: false

  networkAcls:
    bypass: AzureServices
    defaultAction: Deny
    ipRules:
      - value: <ci-cd-agent-ip>  # CI/CD agent pool
      - value: <qa-team-office-ip>  # QA team office
    virtualNetworkRules:
      - id: /subscriptions/<sub-id>/resourceGroups/ATP-Test-RG/providers/Microsoft.Network/virtualNetworks/atp-vnet-shared-eus/subnets/Test-Subnet

  publicNetworkAccess: Enabled

Access Control:

  • QA Team: Read-only access (view secrets for troubleshooting).
  • Test Automation: Read-only service principal access.
  • Managed Identity: Test environment services use managed identity (no keys in config).

Secret Characteristics:

  • Rotation: Quarterly (90-day cycle).
  • Audit Logging: 90-day retention in Log Analytics.
  • Backup: Daily automated backups (30-day retention).

Example Secrets (Test):

# Connection strings (with Key Vault references in appsettings.Test.json)
az keyvault secret set \
  --vault-name atp-keyvault-test-eus \
  --name SqlConnectionString \
  --value "Server=atp-sql-test-eus.database.windows.net;Database=ATP_Test;User Id=testuser;Password=$(Generate-SecurePassword)"

# Test-specific certificates
az keyvault certificate import \
  --vault-name atp-keyvault-test-eus \
  --name MtlsClientCertificate \
  --file test-client-cert.pfx \
  --password <pfx-password>

Staging Environment Key Vault

Name: atp-keyvault-staging-eus

Purpose: Production-like security with restricted access, private endpoints, and compliance controls for pre-production validation.

Configuration:

# Staging Key Vault (atp-keyvault-staging-eus)
properties:
  sku: Premium  # HSM-backed keys
  tenantId: <tenant-id>

  accessPolicies:
    - tenantId: <tenant-id>
      objectId: <platform-team-aad-group-id>
      permissions:
        secrets: [get, list]  # Read-only for platform team
        keys: [get, list]
        certificates: [get, list]

    - tenantId: <tenant-id>
      objectId: <atp-staging-managed-identity-id>
      permissions:
        secrets: [get, list]  # Managed identity only
        keys: [get, unwrapKey, wrapKey]  # For encryption operations
        certificates: [get]

  enabledForDeployment: false  # Prevent VM deployments
  enabledForTemplateDeployment: true  # IaC deployments only
  enabledForDiskEncryption: false

  enableSoftDelete: true
  softDeleteRetentionInDays: 90
  enablePurgeProtection: true  # Cannot purge deleted secrets

  networkAcls:
    bypass: AzureServices
    defaultAction: Deny
    ipRules: []  # No public IP access
    virtualNetworkRules:
      - id: /subscriptions/<sub-id>/resourceGroups/ATP-Staging-RG/providers/Microsoft.Network/virtualNetworks/atp-vnet-staging-eus/subnets/Services-Subnet
      - id: /subscriptions/<sub-id>/resourceGroups/ATP-Staging-RG/providers/Microsoft.Network/virtualNetworks/atp-vnet-staging-eus/subnets/Data-Subnet

  publicNetworkAccess: Disabled  # Private endpoint only

  privateEndpointConnections:
    - privateLinkServiceConnectionState:
        status: Approved
      privateEndpoint:
        id: /subscriptions/<sub-id>/resourceGroups/ATP-Staging-RG/providers/Microsoft.Network/privateEndpoints/atp-kv-staging-pe

Access Control:

  • Platform Team: Read-only access (view secrets for troubleshooting) with MFA required.
  • Managed Identity: Only authentication method for services (no service principals).
  • No Developer Access: Prohibited.

Secret Characteristics:

  • Rotation: Monthly (30-day cycle with automated rotation).
  • Audit Logging: 90-day retention with Azure Sentinel integration.
  • Backup: Daily automated backups with geo-redundancy (365-day retention).
  • HSM-Backed Keys: Premium tier with hardware security module protection.

Example Secrets (Staging):

# Connection strings (production-equivalent)
az keyvault secret set \
  --vault-name atp-keyvault-staging-eus \
  --name SqlConnectionString \
  --value "Server=atp-sql-staging-eus.database.windows.net;Database=ATP_Staging;Authentication=Active Directory Managed Identity;" \
  --description "Staging SQL Connection (Managed Identity)" \
  --tags Environment=Staging Compliance=GDPR,HIPAA,SOC2

# Per-tenant encryption keys
az keyvault key create \
  --vault-name atp-keyvault-staging-eus \
  --name TenantKEK-staging-tenant-001 \
  --kty RSA-HSM \
  --size 4096 \
  --ops wrapKey unwrapKey \
  --protection hsm

Production Environment Key Vault

Name: atp-keyvault-prod-eus

Purpose: Maximum security with zero human access, private endpoints only, HSM-backed keys, and full compliance enforcement.

Configuration:

# Production Key Vault (atp-keyvault-prod-eus)
properties:
  sku: Premium  # HSM-backed keys
  tenantId: <tenant-id>

  accessPolicies:
    - tenantId: <tenant-id>
      objectId: <atp-prod-managed-identity-id>
      permissions:
        secrets: [get, list]  # Managed identity ONLY
        keys: [get, unwrapKey, wrapKey, decrypt, encrypt]
        certificates: [get]

    - tenantId: <tenant-id>
      objectId: <break-glass-emergency-group-id>
      permissions:
        secrets: [get]  # Break-glass read-only (audited)

      # Conditional Access Policy Required:
      # - MFA enforced
      # - Compliant device required
      # - Trusted location (VPN) required
      # - Just-in-Time access (max 4 hours)

  enabledForDeployment: false
  enabledForTemplateDeployment: false  # Prevent any template deployments
  enabledForDiskEncryption: false

  enableSoftDelete: true
  softDeleteRetentionInDays: 90
  enablePurgeProtection: true

  networkAcls:
    bypass: None  # Strict: no bypasses
    defaultAction: Deny
    ipRules: []
    virtualNetworkRules:
      - id: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.Network/virtualNetworks/atp-vnet-prod-eus/subnets/AKS-Nodes
      - id: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.Network/virtualNetworks/atp-vnet-prod-eus/subnets/Private-Endpoints

  publicNetworkAccess: Disabled

  privateEndpointConnections:
    - privateLinkServiceConnectionState:
        status: Approved
      privateEndpoint:
        id: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.Network/privateEndpoints/atp-kv-prod-pe

Access Control:

  • Managed Identity Only: No human access under normal operations.
  • Break-Glass Access: Emergency access group with strict conditional access (MFA, compliant device, VPN, JIT approval, max 4-hour access).
  • Zero Developer Access: Absolutely prohibited.
  • Zero Service Principal Access: Only managed identities allowed.

Secret Characteristics:

  • Rotation: Monthly automated rotation (30-day cycle with zero-downtime).
  • Audit Logging: 365-day retention with Azure Sentinel + SIEM integration; real-time alerting on all access.
  • Backup: Continuous backup with geo-redundancy (7-year retention for compliance).
  • HSM-Backed Keys: All keys stored in FIPS 140-2 Level 3 HSM.
  • Private Endpoint Only: No public internet access.

Example Secrets (Production):

# Connection strings (managed identity only)
az keyvault secret set \
  --vault-name atp-keyvault-prod-eus \
  --name SqlConnectionString \
  --value "Server=atp-sql-prod-eus.database.windows.net;Database=ATP_Prod;Authentication=Active Directory Managed Identity;" \
  --description "Production SQL Connection (Managed Identity Only)" \
  --tags Environment=Production Compliance=GDPR,HIPAA,SOC2 Criticality=High AutoRotate=true

# Per-tenant encryption keys (HSM-backed)
az keyvault key create \
  --vault-name atp-keyvault-prod-eus \
  --name TenantKEK-tenant-12345 \
  --kty RSA-HSM \
  --size 4096 \
  --ops wrapKey unwrapKey \
  --protection hsm \
  --tags TenantId=tenant-12345 KeyType=KEK AutoRotate=true RotationDays=90

# JWT signing keys (auto-rotated)
az keyvault key create \
  --vault-name atp-keyvault-prod-eus \
  --name JwtSigningKey \
  --kty RSA-HSM \
  --size 2048 \
  --ops sign verify \
  --protection hsm \
  --tags KeyType=JwtSigning AutoRotate=true RotationDays=30

Secret Categories

ATP manages five primary secret categories, each with specific security controls, rotation policies, and access patterns.

Secret Type Example Dev/Test Staging/Prod Rotation Storage
Connection Strings SQL, Redis, Service Bus, Cosmos DB Plaintext in appsettings.json Key Vault reference Quarterly (Test), Monthly (Prod) Standard Key Vault Secret
API Keys Third-party integrations, webhooks Shared dev key Unique per environment On-demand (Dev), Monthly (Prod) Standard Key Vault Secret
Certificates mTLS, code signing, SSL/TLS Self-signed, long-lived Managed certificate from KV Annual renewal Key Vault Certificate
Encryption Keys Per-tenant KEK, DEK Single shared key Per-tenant HSM-backed key 90-day rotation Key Vault Key (HSM)
JWT Signing Keys Auth tokens, API tokens Fixed dev key Auto-rotated RSA key Never (Dev), 30-day (Prod) Key Vault Key (HSM)

Connection Strings

Purpose: Database, cache, message queue, and storage connections.

Dev/Test Pattern:

// appsettings.Development.json (plaintext for convenience)
{
  "ConnectionStrings": {
    "DefaultConnection": "Server=atp-sql-dev-eus.database.windows.net;Database=ATP_Dev;User Id=devuser;Password=DevP@ss123!",
    "Redis": "atp-redis-dev-eus.redis.cache.windows.net:6380,password=dev-redis-key,ssl=True"
  }
}

Staging/Production Pattern:

// appsettings.Production.json (Key Vault references)
{
  "ConnectionStrings": {
    "DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlConnectionString)",
    "Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)"
  }
}

Managed Identity Authentication (Preferred for Prod):

{
  "ConnectionStrings": {
    "DefaultConnection": "Server=atp-sql-prod-eus.database.windows.net;Database=ATP_Prod;Authentication=Active Directory Managed Identity;"
  }
}

Rotation Strategy:

  • Test: Quarterly (every 90 days).
  • Staging/Production: Monthly (every 30 days) with automated rotation via Azure Functions.

API Keys

Purpose: Third-party API integrations, webhook endpoints, external service authentication.

Dev/Test Pattern:

# Shared development key (not secret)
az keyvault secret set \
  --vault-name atp-keyvault-dev-eus \
  --name ThirdPartyApiKey \
  --value "dev-api-key-shared-12345" \
  --tags Environment=Dev Shared=true

Production Pattern:

# Unique per-environment key with metadata
az keyvault secret set \
  --vault-name atp-keyvault-prod-eus \
  --name ThirdPartyApiKey-Prod \
  --value $(Generate-SecureApiKey) \
  --description "Production API key for ExternalServiceX" \
  --tags Environment=Production AutoRotate=true RotationDays=30 Criticality=High \
  --expires $(date -d "+90 days" +%Y-%m-%dT%H:%M:%SZ)

Rotation Strategy:

  • Dev: On-demand (when compromised).
  • Test: Quarterly.
  • Production: Monthly automated rotation with overlap period (old key valid for 7 days during rotation).

Certificates

Purpose: mTLS client/server authentication, code signing, SSL/TLS certificates.

Dev/Test Pattern:

# Self-signed certificate (long-lived, 1 year)
az keyvault certificate create \
  --vault-name atp-keyvault-dev-eus \
  --name DevMtlsCertificate \
  --policy @self-signed-policy.json

Production Pattern:

# Managed certificate with auto-renewal
az keyvault certificate create \
  --vault-name atp-keyvault-prod-eus \
  --name ProdMtlsCertificate \
  --policy @cert-policy.json

# cert-policy.json
{
  "issuerParameters": {
    "name": "DigiCert",  # Or internal CA
    "certificateType": "OV-SSL"
  },
  "x509CertificateProperties": {
    "subject": "CN=atp-ingestion-prod.connectsoft.com",
    "subjectAlternativeNames": {
      "dnsNames": [
        "atp-ingestion-prod-eus.azurewebsites.net",
        "atp-ingestion-prod.connectsoft.com"
      ]
    },
    "validityInMonths": 12
  },
  "lifetimeActions": [
    {
      "trigger": {
        "daysBeforeExpiry": 30
      },
      "action": {
        "actionType": "AutoRenew"
      }
    },
    {
      "trigger": {
        "daysBeforeExpiry": 60
      },
      "action": {
        "actionType": "EmailContacts"
      }
    }
  ]
}

Rotation Strategy:

  • Dev/Test: Annual (365 days).
  • Production: Annual with automated renewal 30 days before expiry.

Encryption Keys (KEK/DEK)

Purpose: Per-tenant encryption keys for data-at-rest encryption, envelope encryption pattern.

Dev/Test Pattern:

# Single shared key (software-protected)
az keyvault key create \
  --vault-name atp-keyvault-dev-eus \
  --name SharedEncryptionKey \
  --kty RSA \
  --size 2048 \
  --ops encrypt decrypt wrapKey unwrapKey

Production Pattern:

# Per-tenant HSM-backed KEK
az keyvault key create \
  --vault-name atp-keyvault-prod-eus \
  --name TenantKEK-tenant-12345 \
  --kty RSA-HSM \
  --size 4096 \
  --ops wrapKey unwrapKey \
  --protection hsm \
  --tags TenantId=tenant-12345 KeyType=KEK AutoRotate=true RotationDays=90

# Data Encryption Key (DEK) wrapped by KEK
# DEK is generated per audit segment and stored encrypted in database

Envelope Encryption Pattern:

// Encryption service using Key Vault KEK
public class TenantEncryptionService
{
    private readonly KeyClient _keyClient;

    public async Task<byte[]> EncryptAuditSegmentAsync(string tenantId, byte[] plaintext)
    {
        // 1. Generate ephemeral DEK (AES-256)
        var dek = GenerateDataEncryptionKey();

        // 2. Encrypt plaintext with DEK
        var ciphertext = AesEncrypt(plaintext, dek);

        // 3. Wrap DEK with tenant's KEK from Key Vault
        var kekName = $"TenantKEK-{tenantId}";
        var wrapResult = await _keyClient.WrapKeyAsync(kekName, KeyWrapAlgorithm.RsaOaep256, dek);

        // 4. Store ciphertext + wrapped DEK together
        return CombineCiphertextAndWrappedKey(ciphertext, wrapResult.EncryptedKey);
    }
}

Rotation Strategy:

  • Dev/Test: Never (static key for testing).
  • Production: Quarterly (90 days) with zero-downtime rotation (new KEK version; old version remains valid for decryption).

JWT Signing Keys

Purpose: Cryptographic signing keys for JWT tokens, API authentication, service-to-service communication.

Dev/Test Pattern:

# Fixed symmetric key (HS256)
az keyvault secret set \
  --vault-name atp-keyvault-dev-eus \
  --name JwtSigningKey \
  --value "dev-jwt-secret-key-do-not-use-in-production-256bit"

Production Pattern:

# Asymmetric RSA key pair (RS256, HSM-backed)
az keyvault key create \
  --vault-name atp-keyvault-prod-eus \
  --name JwtSigningKey \
  --kty RSA-HSM \
  --size 2048 \
  --ops sign verify \
  --protection hsm \
  --tags KeyType=JwtSigning AutoRotate=true RotationDays=30

# Public key published to JWKS endpoint for verification

JWT Signing Implementation:

// JWT signing with Key Vault RSA key
public class JwtTokenService
{
    private readonly CryptographyClient _cryptoClient;

    public async Task<string> GenerateTokenAsync(ClaimsPrincipal user)
    {
        var header = new { alg = "RS256", typ = "JWT", kid = "JwtSigningKey" };
        var payload = new { sub = user.Identity.Name, exp = DateTimeOffset.UtcNow.AddHours(1).ToUnixTimeSeconds() };

        var headerEncoded = Base64UrlEncode(JsonSerializer.Serialize(header));
        var payloadEncoded = Base64UrlEncode(JsonSerializer.Serialize(payload));
        var message = $"{headerEncoded}.{payloadEncoded}";

        // Sign with Key Vault RSA key
        var signature = await _cryptoClient.SignDataAsync(SignatureAlgorithm.RS256, Encoding.UTF8.GetBytes(message));
        var signatureEncoded = Base64UrlEncode(signature.Signature);

        return $"{message}.{signatureEncoded}";
    }
}

Rotation Strategy:

  • Dev/Test: Never (fixed key for consistency).
  • Production: Monthly (30 days) with overlap period (old key remains valid for verification for 7 days).

Secret Rotation Policies

Automated Rotation (Production):

// Azure Function for automated secret rotation
[FunctionName("RotateSecrets")]
public async Task RunAsync(
    [TimerTrigger("0 0 2 1 * *")] TimerInfo timer,  // Monthly on 1st at 2 AM
    ILogger log)
{
    var secretsToRotate = new[]
    {
        "SqlConnectionString",
        "RedisConnectionString",
        "ServiceBusConnectionString",
        "JwtSigningKey"
    };

    foreach (var secretName in secretsToRotate)
    {
        log.LogInformation($"Rotating secret: {secretName}");

        // 1. Generate new secret value
        var newSecretValue = await GenerateNewSecretAsync(secretName);

        // 2. Create new version in Key Vault
        await _secretClient.SetSecretAsync(secretName, newSecretValue);

        // 3. Wait for services to pick up new secret (15-minute cache expiration)
        await Task.Delay(TimeSpan.FromMinutes(20));

        // 4. Verify all services using new secret
        var healthCheckPassed = await VerifyServicesHealthAsync();

        if (!healthCheckPassed)
        {
            log.LogError($"Health check failed after rotating {secretName}. Rolling back...");
            await RollbackSecretAsync(secretName);
            throw new Exception($"Secret rotation failed for {secretName}");
        }

        log.LogInformation($"✅ Successfully rotated secret: {secretName}");
    }
}

Rotation Schedule:

Environment Connection Strings API Keys Certificates Encryption Keys JWT Keys
Dev On-demand On-demand Annual Never Never
Test Quarterly Quarterly Annual Never Never
Staging Monthly Monthly Annual Quarterly Monthly
Production Monthly (automated) Monthly (automated) Annual (auto-renew) Quarterly (automated) Monthly (automated)

Managed Identity Access Patterns

App Service Managed Identity (Staging/Production):

// Program.cs - Managed Identity for Key Vault access
public static IHostBuilder CreateHostBuilder(string[] args) =>
    Host.CreateDefaultsBuilder(args)
        .ConfigureAppConfiguration((context, config) =>
        {
            if (context.HostingEnvironment.IsProduction())
            {
                var builtConfig = config.Build();
                var keyVaultEndpoint = builtConfig["KeyVault:Endpoint"];

                // DefaultAzureCredential: tries Managed Identity first
                var credential = new DefaultAzureCredential();

                config.AddAzureKeyVault(
                    new Uri(keyVaultEndpoint),
                    credential,
                    new AzureKeyVaultConfigurationOptions
                    {
                        ReloadInterval = TimeSpan.FromMinutes(15)
                    });
            }
        });

AKS Pod Identity (Production):

# Azure AD Pod Identity (deprecated; use Workload Identity)
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentity
metadata:
  name: atp-prod-identity
  namespace: atp-prod
spec:
  type: 0  # Managed Identity
  resourceID: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.ManagedIdentity/userAssignedIdentities/atp-prod-mi
  clientID: <managed-identity-client-id>
---
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentityBinding
metadata:
  name: atp-prod-identity-binding
  namespace: atp-prod
spec:
  azureIdentity: atp-prod-identity
  selector: atp-prod-pods  # Pod label selector

AKS Workload Identity (Recommended for Production):

# ServiceAccount with Workload Identity
apiVersion: v1
kind: ServiceAccount
metadata:
  name: atp-prod-sa
  namespace: atp-prod
  annotations:
    azure.workload.identity/client-id: <managed-identity-client-id>
---
# Deployment using Workload Identity
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-prod
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"
    spec:
      serviceAccountName: atp-prod-sa
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:1.0.0
        env:
        - name: AZURE_CLIENT_ID
          value: <managed-identity-client-id>
        - name: KEY_VAULT_ENDPOINT
          value: https://atp-keyvault-prod-eus.vault.azure.net/

Key Vault CSI Integration (AKS)

Purpose: Mount Key Vault secrets as volumes in Kubernetes pods for seamless secret injection without environment variables.

SecretProviderClass (Production):

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: atp-prod-secrets
  namespace: atp-prod
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"  # Use Workload Identity instead
    useVMManagedIdentity: "false"
    clientID: <managed-identity-client-id>
    keyvaultName: atp-keyvault-prod-eus
    tenantId: <tenant-id>
    objects: |
      array:
        - objectName: SqlConnectionString
          objectType: secret
          objectAlias: sql-connection-string
        - objectName: RedisConnectionString
          objectType: secret
          objectAlias: redis-connection-string
        - objectName: ServiceBusConnectionString
          objectType: secret
          objectAlias: servicebus-connection-string
        - objectName: TenantKEK-tenant-12345
          objectType: key
          objectAlias: tenant-kek-12345
        - objectName: JwtSigningKey
          objectType: key
          objectAlias: jwt-signing-key
        - objectName: MtlsClientCertificate
          objectType: cert
          objectAlias: mtls-client-cert
  secretObjects:
    - secretName: atp-sql-secret
      type: Opaque
      data:
        - objectName: sql-connection-string
          key: connectionString
    - secretName: atp-redis-secret
      type: Opaque
      data:
        - objectName: redis-connection-string
          key: connectionString

Deployment with CSI Secrets (Production):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-prod
spec:
  replicas: 3
  selector:
    matchLabels:
      app: atp-ingestion
  template:
    metadata:
      labels:
        app: atp-ingestion
        azure.workload.identity/use: "true"
    spec:
      serviceAccountName: atp-prod-sa
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:1.0.0
        volumeMounts:
        - name: secrets-store
          mountPath: "/mnt/secrets"
          readOnly: true
        env:
        - name: ConnectionStrings__DefaultConnection
          valueFrom:
            secretKeyRef:
              name: atp-sql-secret
              key: connectionString
        - name: ConnectionStrings__Redis
          valueFrom:
            secretKeyRef:
              name: atp-redis-secret
              key: connectionString
      volumes:
      - name: secrets-store
        csi:
          driver: secrets-store.csi.k8s.io
          readOnly: true
          volumeAttributes:
            secretProviderClass: atp-prod-secrets

Benefits of CSI Integration:

  1. Automatic Secret Refresh: Secrets updated in Key Vault are automatically synced to pods (polling interval: 2 minutes).
  2. No Environment Variables: Secrets never exposed in environment variables (more secure).
  3. File-Based Access: Applications read secrets from mounted files (/mnt/secrets/sql-connection-string).
  4. Atomic Updates: Secret updates are atomic (no partial reads during rotation).

Secret Management Best Practices

Security:

  1. Never commit secrets: Enforce pre-commit hooks to detect secrets in source control.
  2. Managed Identity only (Prod): No service principals, passwords, or API keys for production services.
  3. Rotate regularly: Automated monthly rotation for production with overlap periods.
  4. Least privilege: Grant minimum required permissions (get, list only; never set, delete).
  5. Private endpoints: Production Key Vaults accessible only via private network.

Operational:

  1. Audit logging: Enable diagnostic logs for all Key Vault operations; integrate with Azure Sentinel.
  2. Soft delete + purge protection: Prevent accidental deletion; 90-day retention for recovery.
  3. Backup: Automated daily backups with geo-redundancy for Staging/Production.
  4. Health checks: Validate Key Vault connectivity and secret retrieval in application health checks.
  5. Break-glass procedures: Document emergency access procedures with strict approval workflows.

Compliance:

  1. Encryption at rest: All secrets encrypted with Microsoft-managed keys; HSM-backed for Prod.
  2. Encryption in transit: TLS 1.2+ for all Key Vault API calls.
  3. Compliance tags: Tag secrets with compliance scope (GDPR, HIPAA, SOC2).
  4. Audit retention: 7-year audit log retention for compliance evidence.
  5. Access reviews: Quarterly review of Key Vault access policies; remove stale permissions.

Emergency Access Procedures

Break-Glass Access (Production):

# Break-Glass Access Policy (Production Key Vault)
accessPolicies:
  - tenantId: <tenant-id>
    objectId: <break-glass-emergency-group-id>
    permissions:
      secrets: [get]  # Read-only

# Conditional Access Requirements:
# - Multi-Factor Authentication: Required
# - Compliant Device: Required
# - Trusted Location: VPN only
# - Just-in-Time Access: PIM activation (4-hour max)
# - Approval: 2 approvers (Security Officer + Incident Commander)
# - Audit: Real-time alert to SIEM; Slack notification to #security-alerts

Emergency Access Workflow:

flowchart TD
    A[P0 Incident Requires Secret Access] --> B[Request PIM Elevation]
    B --> C[2 Approvers Review]
    C --> D{Approved?}
    D -->|No| E[Access Denied + Audit Log]
    D -->|Yes| F[4-Hour JIT Access Granted]
    F --> G[Access Key Vault via VPN]
    G --> H[Retrieve Secret]
    H --> I[Real-Time Alert to SIEM]
    I --> J[Incident Resolution]
    J --> K[PIM Access Expires]
    K --> L[Post-Incident Review]
Hold "Alt" / "Option" to enable pan & zoom

Break-Glass Secret Retrieval (Azure CLI):

# 1. Activate PIM role (requires approval)
az ad pim role-assignment request create \
  --resource-id /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG \
  --role-definition-id <key-vault-secrets-user-role-id> \
  --principal-id <my-user-object-id> \
  --duration PT4H \
  --reason "P0 Incident #12345: Production SQL connection failure"

# 2. Wait for approval (~5-10 minutes)
# 3. Connect via VPN (required by Conditional Access)
# 4. Retrieve secret (audited)
az keyvault secret show \
  --vault-name atp-keyvault-prod-eus \
  --name SqlConnectionString \
  --query value -o tsv

# 5. Use secret for incident resolution
# 6. PIM access expires after 4 hours

Summary

  • Key Vault Per Environment: Dedicated Key Vault for each environment with graduated security controls (Dev: developer-accessible → Prod: zero human access).
  • Secret Categories: Five primary categories (Connection Strings, API Keys, Certificates, Encryption Keys, JWT Keys) with environment-specific patterns.
  • Managed Identity Only (Prod): No service principals or passwords; all production access via managed identities.
  • Automated Rotation: Monthly automated rotation for production with zero-downtime and overlap periods.
  • Key Vault CSI: AKS integration for seamless secret injection as mounted volumes with automatic refresh.
  • Break-Glass Access: Emergency access procedures with strict conditional access, JIT approval, and comprehensive auditing.

Environment Promotion Workflow

ATP's environment promotion strategy implements a graduated deployment pipeline where code progresses through increasingly production-like environments with escalating quality gates and approval requirements. This approach ensures defects are caught early in lower environments while maintaining production stability through rigorous validation and controlled change management.

The promotion workflow balances deployment velocity (automated Dev/Test) with risk mitigation (manual approvals for Staging/Production) and provides fast rollback mechanisms at every tier to minimize incident impact.

Promotion Lanes

ATP uses a multi-lane promotion strategy supporting both regular feature releases and emergency hotfixes:

flowchart TD
    subgraph Feature Development
        A[Feature Branch] --> B[Open Pull Request]
        B --> C[Preview Environment Created]
        C --> D{PR Tests Pass?}
        D -->|No| E[Fix in Branch]
        E --> C
        D -->|Yes| F[Merge to Main]
    end

    F --> G[CI Build Pipeline]

    subgraph Continuous Integration
        G --> H[Dev Environment]
        H --> I{Smoke Tests Pass?}
        I -->|No| J[Alert Dev Team]
        I -->|Yes| K[24-Hour Soak]
    end

    subgraph Continuous Delivery
        K --> L[Test Environment]
        L --> M{Regression Tests Pass?}
        M -->|No| N[Block Promotion]
        M -->|Yes| O[Manual Approval Required]
        O --> P[Staging Environment]
        P --> Q{Load + Chaos Tests Pass?}
        Q -->|No| R[Fix Issues]
        Q -->|Yes| S[CAB Approval + 2 Approvers]
        S --> T[Production Environment]
    end

    subgraph Hotfix Lane
        U[P0/P1 Incident] --> V[Hotfix Branch]
        V --> W[Hotfix Environment]
        W --> X{Validation Pass?}
        X -->|No| V
        X -->|Yes| Y[Expedited Approvals]
        Y --> T
    end

    T --> Z[Post-Deployment Monitoring]
    Z --> AA{Metrics Healthy?}
    AA -->|No| AB[Automated Rollback]
    AA -->|Yes| AC[Deployment Complete]

    style A fill:#87CEEB
    style H fill:#90EE90
    style L fill:#FFD700
    style P fill:#FFA500
    style T fill:#FF6347
    style W fill:#FF69B4
Hold "Alt" / "Option" to enable pan & zoom

Standard Promotion Lane (Feature Releases):

1. feature-branch → Pull Request → Preview Environment (ephemeral)
   ↓ (PR approved + tests pass)
2. Merge to main → CI Build
   ↓ (build + tests + security scans pass)
3. main → Dev Environment (auto-deploy)
   ↓ (smoke tests pass + 24-hour soak)
4. Dev → Test Environment (auto-deploy)
   ↓ (regression tests pass)
5. Test → Staging Environment (manual approval: 1 Lead Engineer)
   ↓ (load tests + chaos tests pass)
6. Staging → Production Environment (manual approval: 2 approvers + CAB)
   ↓ (canary deployment with metrics validation)
7. Production (stable) → Monitoring & Observability

Hotfix Lane (Emergency Fixes):

1. P0/P1 Incident → Hotfix Branch
   ↓ (fix developed)
2. Hotfix Branch → Hotfix Environment (on-demand provision)
   ↓ (targeted tests pass)
3. Hotfix Environment → Production (expedited approval: 2 approvers within 2 hours)
   ↓ (30-minute intensive monitoring)
4. Production (stable) → Merge to main + Decommission Hotfix Environment

Approval Gates

Each environment transition has specific approval requirements, validation criteria, and SLA commitments to ensure quality while maintaining deployment velocity.

Dev → Test (Automated)

Approvers: None (fully automated)

Requirements:

  1. Smoke Tests: All critical path smoke tests pass (health checks, basic API calls).
  2. Dev Soak Period: Deployed to Dev for minimum 24 hours without critical incidents.
  3. Build Artifacts: All artifacts published successfully (binaries, Docker images, SBOM).
  4. Security Scans: No critical/high vulnerabilities in SAST, dependency, or container scans.

Implementation (Azure Pipelines):

# Dev → Test promotion (automated)
- stage: Deploy_Test
  displayName: 'Promote to Test Environment'
  dependsOn: Deploy_Dev
  condition: |
    and(
      succeeded(),
      eq(variables['Dev.SoakHours'], '24'),  # Dev deployed 24+ hours ago
      eq(variables['Dev.CriticalIncidents'], '0'),  # No critical incidents in Dev
      eq(variables['Build.SourceBranch'], 'refs/heads/main')  # Only main branch
    )
  jobs:
  - deployment: PromoteToTest
    environment: ATP-Test  # No manual approval
    strategy:
      runOnce:
        preDeploy:
          steps:
          - script: |
              echo "Validating Dev environment stability..."

              # Query Dev error rate (last 24 hours)
              DEV_ERROR_RATE=$(az monitor app-insights metrics show \
                --app atp-appinsights-dev-eus \
                --metric "requests/failed" \
                --aggregation avg \
                --offset 24h \
                --query "value.segments[0]['requests/failed'].avg")

              if (( $(echo "$DEV_ERROR_RATE > 0.10" | bc -l) )); then
                echo "❌ Dev error rate too high: $DEV_ERROR_RATE%"
                exit 1
              fi

              # Query Dev incidents
              DEV_INCIDENTS=$(az monitor activity-log list \
                --resource-group ATP-Dev-RG \
                --offset 24h \
                --query "[?level=='Critical'] | length(@)")

              if [ "$DEV_INCIDENTS" -gt "0" ]; then
                echo "❌ Critical incidents detected in Dev"
                exit 1
              fi

              echo "✅ Dev environment stable; promoting to Test"
            displayName: 'Validate Dev Stability'

        deploy:
          steps:
          - template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
            parameters:
              azureSubscription: $(azureSubscription)
              appName: atp-ingestion-test-eus
              package: $(Pipeline.Workspace)/drop/*.zip

        postDeployment:
          steps:
          - script: |
              # Run smoke tests
              dotnet test tests/Smoke.Tests.csproj \
                --environment Test \
                --logger "trx;LogFileName=test-smoke-results.trx" \
                --filter "Priority=1"

              if [ $? -ne 0 ]; then
                echo "❌ Smoke tests failed; rolling back Test deployment"
                exit 1
              fi

              echo "✅ Test deployment successful"
            displayName: 'Post-Deployment Smoke Tests'

SLA: Immediate (automated promotion within 5 minutes of Dev validation).

Rollback: Automatic if post-deployment smoke tests fail.


Test → Staging (Manual Approval)

Approvers: 1 Lead Engineer

Requirements:

  1. Regression Tests: Full regression test suite passes (100% pass rate required).
  2. No P1/P2 Bugs: Zero high-priority bugs in Test environment.
  3. Performance Benchmarks: Response time p95 within acceptable thresholds.
  4. Test Soak Period: Minimum 48 hours in Test without incidents.
  5. Database Migrations: All migrations tested successfully in Test.

Implementation (Azure DevOps Environment):

# Azure DevOps Environment Configuration
# Environment: ATP-Staging
approvals:
  - type: manual
    approvers:
      - lead-engineer@connectsoft.example
    minRequired: 1
    timeoutInMinutes: 240  # 4-hour approval window
    instructions: |
      ## Staging Promotion Checklist

      Before approving, verify:
      - [ ] All regression tests passed (100% green)
      - [ ] No P1/P2 bugs in Test environment
      - [ ] Performance benchmarks met (p95 < 500ms)
      - [ ] Test deployed for 48+ hours without incidents
      - [ ] Database migrations validated
      - [ ] Security scans passed (no critical/high vulnerabilities)
      - [ ] Change ticket created and approved

      **Approval SLA**: 4 hours

Pipeline Stage (Test → Staging):

- stage: Deploy_Staging
  displayName: 'Promote to Staging Environment'
  dependsOn: Deploy_Test
  condition: |
    and(
      succeeded(),
      eq(variables['Test.SoakHours'], '48'),  # Test deployed 48+ hours
      eq(variables['Test.RegressionTestsPass'], 'true'),
      eq(variables['Test.P1P2BugCount'], '0'),
      eq(variables['Build.SourceBranch'], 'refs/heads/main')
    )
  jobs:
  - deployment: PromoteToStaging
    environment: ATP-Staging  # Requires 1 manual approval
    timeoutInMinutes: 300  # 5-hour timeout (includes approval wait)
    strategy:
      runOnce:
        preDeploy:
          steps:
          - script: |
              echo "Pre-Staging Validation Checks..."

              # Verify Test regression tests
              TEST_PASS_RATE=$(az devops test runs list \
                --project ATP \
                --query "[?state=='Completed' && startDate > '$(date -d '48 hours ago' --iso-8601)'].passRate" \
                --output tsv | awk '{sum+=$1; count++} END {print sum/count}')

              if (( $(echo "$TEST_PASS_RATE < 100" | bc -l) )); then
                echo "❌ Test pass rate below 100%: $TEST_PASS_RATE%"
                exit 1
              fi

              # Verify no high-priority bugs
              P1_P2_BUGS=$(az boards query --wiql "SELECT [System.Id] FROM WorkItems WHERE [System.State] = 'Active' AND [Microsoft.VSTS.Common.Priority] <= 2 AND [System.Tags] CONTAINS 'ATP-Test'" --output tsv | wc -l)

              if [ "$P1_P2_BUGS" -gt "0" ]; then
                echo "❌ Found $P1_P2_BUGS P1/P2 bugs in Test"
                exit 1
              fi

              echo "✅ Test environment ready for Staging promotion"
            displayName: 'Pre-Staging Validation'

        deploy:
          steps:
          - template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
            parameters:
              azureSubscription: $(azureSubscription)
              appName: atp-ingestion-staging-eus
              package: $(Pipeline.Workspace)/drop/*.zip
              slotName: blue  # Blue-green deployment

        routeTraffic:
          steps:
          - task: AzureAppServiceManage@0
            displayName: 'Swap Blue  Production Slot'
            inputs:
              azureSubscription: $(azureSubscription)
              action: 'Swap Slots'
              webAppName: atp-ingestion-staging-eus
              sourceSlot: blue
              targetSlot: production

        postRouteTraffic:
          steps:
          - script: |
              # Post-deployment validation
              echo "Running Staging post-deployment tests..."

              # Health checks
              for i in {1..10}; do
                curl -f https://atp-gateway-staging-eus.azurewebsites.net/health && break || sleep 30
              done

              # Smoke tests
              dotnet test tests/Smoke.Tests.csproj --environment Staging

              # Light load test
              k6 run --vus 50 --duration 5m tests/load/staging-validation.js

              echo "✅ Staging deployment validated"
            displayName: 'Post-Staging Validation'

SLA: 4 hours (time from approval request to approver response).

Notifications:

# Azure DevOps Service Hook (Slack notification)
- trigger: Approval Pending
  action: Send Slack message
  target: #staging-approvals channel
  message: |
    🟡 Staging Approval Required

    Build: $(Build.BuildNumber)
    Branch: $(Build.SourceBranch)
    Requester: $(Build.RequestedFor)

    Approve: $(Environment.ApprovalUrl)

    Checklist:
    - Regression tests: $(Test.RegressionTestsPass)
    - P1/P2 bugs: $(Test.P1P2BugCount)
    - Soak period: $(Test.SoakHours) hours

Staging → Production (CAB + 2 Approvers)

Approvers: 2 required (Platform Architect + SRE Lead)

Requirements:

  1. CAB Approval: Change Advisory Board approval with documented change ticket.
  2. Staging Soak: Minimum 48 hours in Staging without incidents.
  3. Load Tests: Load tests pass at 80% expected peak production load.
  4. Chaos Tests: Chaos engineering tests pass (pod failures, network latency, database throttling).
  5. Security Review: Final security review completed; no pending vulnerabilities.
  6. Change Window: Deployment scheduled during approved maintenance window.
  7. Rollback Plan: Documented rollback procedure with RTO < 5 minutes.

Implementation (Azure DevOps Environment):

# Azure DevOps Environment Configuration
# Environment: ATP-Production
approvals:
  - type: manual
    approvers:
      - platform-architect@connectsoft.example
      - sre-lead@connectsoft.example
    minRequired: 2  # Both approvers must approve
    timeoutInMinutes: 1440  # 24-hour approval window
    instructions: |
      ## Production Promotion Checklist (CAB Required)

      **Prerequisites**:
      - [ ] CAB approval obtained (change ticket: CHG-XXXXX)
      - [ ] Staging deployed for 48+ hours (no incidents)
      - [ ] Load tests passed (80% peak load)
      - [ ] Chaos tests passed (pod failures, network latency)
      - [ ] Security review completed (no critical/high vulnerabilities)
      - [ ] Change window scheduled (approved maintenance slot)
      - [ ] Rollback plan documented and reviewed
      - [ ] On-call rotation confirmed (SRE coverage)
      - [ ] Stakeholder notification sent
      - [ ] Database backup verified (< 24 hours old)

      **Deployment Details**:
      - Build: $(Build.BuildNumber)
      - Commit: $(Build.SourceVersion)
      - Requester: $(Build.RequestedFor)
      - Scheduled Window: [Specify date/time]

      **Approval SLA**: 24 hours

      **Rollback RTO**: < 5 minutes (canary rollback or slot swap)

checks:
  - type: gate
    displayName: 'Verify CAB Approval'
    evaluationMode: Sequential
    timeout: 1440
    gates:
      - task: InvokeRESTAPI@1
        inputs:
          serviceConnection: 'ChangeManagementAPI'
          method: 'GET'
          urlSuffix: '/api/change-tickets/$(ChangeTicketId)/status'
          successCriteria: 'eq(root.status, "Approved")'

  - type: gate
    displayName: 'Verify No Active Incidents'
    evaluationMode: Sequential
    timeout: 1440
    gates:
      - task: AzureCLI@2
        inputs:
          scriptType: 'bash'
          scriptLocation: 'inlineScript'
          inlineScript: |
            INCIDENTS=$(az monitor activity-log list \
              --resource-group ATP-Staging-RG \
              --offset 48h \
              --query "[?level=='Critical' || level=='Error'] | length(@)")

            if [ "$INCIDENTS" -gt "0" ]; then
              echo "Active incidents detected: $INCIDENTS"
              exit 1
            fi

Pipeline Stage (Staging → Production):

- stage: Deploy_Production
  displayName: 'Promote to Production Environment'
  dependsOn: Deploy_Staging
  condition: |
    and(
      succeeded(),
      eq(variables['Build.Reason'], 'Manual'),  # Manual trigger only
      eq(variables['Staging.SoakHours'], '48'),
      eq(variables['CAB.Approved'], 'true'),
      eq(variables['Build.SourceBranch'], 'refs/heads/main')
    )
  jobs:
  - deployment: PromoteToProduction
    environment: ATP-Production  # Requires 2 approvals + CAB gate
    timeoutInMinutes: 600  # 10-hour timeout (includes approval + canary deployment)
    strategy:
      canary:
        increments: [5, 20, 50]  # 5% → 20% → 50% → 100%

        preDeploy:
          steps:
          - script: |
              echo "Pre-Production Safety Checks..."

              # Verify Staging stability (48 hours)
              STAGING_ERROR_RATE=$(az monitor app-insights metrics show \
                --app atp-appinsights-staging-eus \
                --metric "requests/failed" \
                --aggregation avg \
                --offset 48h \
                --query "value.segments[0]['requests/failed'].avg")

              if (( $(echo "$STAGING_ERROR_RATE > 0.01" | bc -l) )); then
                echo "❌ Staging error rate too high: $STAGING_ERROR_RATE%"
                exit 1
              fi

              # Verify load tests passed
              LOAD_TEST_STATUS=$(az load test show \
                --name staging-load-test-latest \
                --query "status")

              if [ "$LOAD_TEST_STATUS" != "Passed" ]; then
                echo "❌ Load tests did not pass"
                exit 1
              fi

              # Verify CAB approval
              CAB_STATUS=$(curl -s https://changemanagement.connectsoft.local/api/tickets/$(ChangeTicketId) | jq -r '.status')

              if [ "$CAB_STATUS" != "Approved" ]; then
                echo "❌ CAB approval not obtained"
                exit 1
              fi

              echo "✅ All pre-production checks passed"
            displayName: 'Pre-Production Validation'

          - task: AzureCLI@2
            displayName: 'Create Production Backup Snapshot'
            inputs:
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                # Create database backup before deployment
                az sql db copy \
                  --name ATP_Prod \
                  --resource-group ATP-Prod-RG \
                  --server atp-sql-prod-eus \
                  --dest-name ATP_Prod_Backup_$(Build.BuildId) \
                  --dest-resource-group ATP-Prod-Backups-RG \
                  --dest-server atp-sql-backup-eus

        deploy:
          steps:
          - task: Kubernetes@1
            displayName: 'Deploy Canary ($(strategy.increment)%)'
            inputs:
              connectionType: 'Azure Resource Manager'
              azureSubscription: $(azureSubscription)
              azureResourceGroup: 'ATP-Prod-RG'
              kubernetesCluster: 'atp-aks-prod-eus'
              command: 'apply'
              arguments: '-f k8s/canary/atp-ingestion-canary-$(strategy.increment).yaml'

        routeTraffic:
          steps:
          - script: |
              echo "Routing $(strategy.increment)% traffic to canary..."

              # Update Istio VirtualService for traffic splitting
              kubectl apply -f - <<EOF
              apiVersion: networking.istio.io/v1beta1
              kind: VirtualService
              metadata:
                name: atp-ingestion-traffic
                namespace: atp-prod
              spec:
                hosts:
                - atp-ingestion.atp-prod.svc.cluster.local
                http:
                - match:
                  - headers:
                      canary:
                        exact: "true"
                  route:
                  - destination:
                      host: atp-ingestion-canary
                      subset: canary
                    weight: 100
                - route:
                  - destination:
                      host: atp-ingestion
                      subset: stable
                    weight: $((100 - $(strategy.increment)))
                  - destination:
                      host: atp-ingestion-canary
                      subset: canary
                    weight: $(strategy.increment)
              EOF
            displayName: 'Configure Istio Traffic Split'

        postRouteTraffic:
          steps:
          - script: |
              echo "Monitoring canary deployment ($(strategy.increment)% traffic)..."

              # Monitor for 15 minutes
              sleep 900

              # Query Application Insights for canary metrics
              CANARY_ERROR_RATE=$(az monitor app-insights metrics show \
                --app atp-appinsights-prod-eus \
                --metric "requests/failed" \
                --aggregation avg \
                --offset 15m \
                --filter "cloud_RoleName eq 'atp-ingestion-canary'" \
                --query "value.segments[0]['requests/failed'].avg")

              if (( $(echo "$CANARY_ERROR_RATE > 0.01" | bc -l) )); then
                echo "❌ Canary error rate exceeded 1%: $CANARY_ERROR_RATE%"
                exit 1
              fi

              CANARY_LATENCY_P95=$(az monitor app-insights metrics show \
                --app atp-appinsights-prod-eus \
                --metric "requests/duration" \
                --aggregation percentile95 \
                --offset 15m \
                --filter "cloud_RoleName eq 'atp-ingestion-canary'" \
                --query "value.segments[0]['requests/duration'].percentile95")

              if (( $(echo "$CANARY_LATENCY_P95 > 1000" | bc -l) )); then
                echo "❌ Canary p95 latency exceeded 1000ms: ${CANARY_LATENCY_P95}ms"
                exit 1
              fi

              # Check for canary-specific errors in logs
              CANARY_EXCEPTIONS=$(az monitor app-insights query \
                --app atp-appinsights-prod-eus \
                --analytics-query "exceptions | where cloud_RoleName == 'atp-ingestion-canary' and timestamp > ago(15m) | count" \
                --query "tables[0].rows[0][0]")

              if [ "$CANARY_EXCEPTIONS" -gt "10" ]; then
                echo "❌ Canary has $CANARY_EXCEPTIONS exceptions"
                exit 1
              fi

              echo "✅ Canary metrics healthy at $(strategy.increment)% traffic"
            displayName: 'Validate Canary Metrics (15-min soak)'

        on:
          failure:
            steps:
            - script: |
                echo "🔴 Canary deployment failed; initiating rollback..."

                # Revert traffic to stable version (100% to stable)
                kubectl apply -f k8s/canary/atp-ingestion-stable-only.yaml

                # Notify on-call team
                curl -X POST $(SlackWebhookUrl) \
                  -H 'Content-Type: application/json' \
                  -d '{
                    "text": "🚨 Production Canary Rollback",
                    "attachments": [{
                      "color": "danger",
                      "fields": [
                        {"title": "Build", "value": "$(Build.BuildNumber)", "short": true},
                        {"title": "Increment", "value": "$(strategy.increment)%", "short": true},
                        {"title": "Reason", "value": "Metrics threshold exceeded", "short": false}
                      ]
                    }]
                  }'

                # Create incident ticket
                az boards work-item create \
                  --title "Production Canary Rollback - Build $(Build.BuildNumber)" \
                  --type "Incident" \
                  --description "Canary deployment failed at $(strategy.increment)% traffic. Metrics exceeded thresholds. Automatic rollback executed." \
                  --assigned-to "sre-team@connectsoft.example" \
                  --area "ATP/Production" \
                  --iteration "ATP/Current" \
                  --fields Priority=1 Severity="1 - Critical"

                # Send PagerDuty alert
                curl -X POST https://events.pagerduty.com/v2/enqueue \
                  -H 'Content-Type: application/json' \
                  -d '{
                    "routing_key": "$(PagerDutyRoutingKey)",
                    "event_action": "trigger",
                    "payload": {
                      "summary": "Production canary rollback for build $(Build.BuildNumber)",
                      "severity": "critical",
                      "source": "Azure DevOps"
                    }
                  }'
              displayName: 'Rollback + Incident Response'

SLA: 24 hours (CAB approval + deployment scheduling).

Rollback: Automatic if canary metrics exceed thresholds; manual option available.


Hotfix → Production (Expedited)

Approvers: 2 (SRE Lead + Platform Architect)

Requirements:

  1. Incident Ticket: Linked P0/P1 incident ticket with root cause analysis.
  2. Hotfix Validation: Targeted tests pass in Hotfix environment.
  3. Expedited CAB: Emergency CAB approval (within 2 hours).
  4. Limited Scope: Changes limited to specific service/component; no breaking changes.
  5. Rollback Ready: Immediate rollback plan with < 2-minute RTO.

Implementation:

- stage: Deploy_Production_Hotfix
  displayName: 'Emergency Hotfix to Production'
  dependsOn: Deploy_Hotfix_Validation
  condition: |
    and(
      succeeded(),
      eq(variables['Hotfix.Validated'], 'true'),
      ne(variables['IncidentTicketId'], '')  # Incident ticket required
    )
  jobs:
  - deployment: HotfixProduction
    environment: ATP-Production  # Requires 2 approvals (expedited)
    timeoutInMinutes: 180  # 3-hour timeout (expedited approval SLA)
    strategy:
      runOnce:
        preDeploy:
          steps:
          - script: |
              echo "Hotfix Pre-Deployment Validation..."

              # Verify incident ticket exists and is P0/P1
              INCIDENT_PRIORITY=$(az boards work-item show \
                --id $(IncidentTicketId) \
                --query "fields['Microsoft.VSTS.Common.Priority']" \
                --output tsv)

              if [ "$INCIDENT_PRIORITY" -gt "1" ]; then
                echo "❌ Hotfix only allowed for P0/P1 incidents (found P$INCIDENT_PRIORITY)"
                exit 1
              fi

              # Verify Hotfix environment tests passed
              HOTFIX_TESTS=$(az pipelines runs list \
                --pipeline-id $(HotfixPipelineId) \
                --query "[0].result" \
                --output tsv)

              if [ "$HOTFIX_TESTS" != "succeeded" ]; then
                echo "❌ Hotfix validation tests did not pass"
                exit 1
              fi

              echo "✅ Hotfix validated; proceeding to Production"
            displayName: 'Verify Hotfix Prerequisites'

        deploy:
          steps:
          - task: Kubernetes@1
            displayName: 'Apply Hotfix to Production'
            inputs:
              connectionType: 'Azure Resource Manager'
              azureSubscription: $(azureSubscription)
              azureResourceGroup: 'ATP-Prod-RG'
              kubernetesCluster: 'atp-aks-prod-eus'
              command: 'set'
              arguments: 'image deployment/atp-ingestion atp-ingestion=connectsoft.azurecr.io/atp/ingestion:hotfix-$(Build.BuildNumber)'

          - script: |
              # Monitor rollout (10-minute timeout)
              kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=10m

              if [ $? -ne 0 ]; then
                echo "❌ Hotfix rollout failed"
                kubectl rollout undo deployment/atp-ingestion -n atp-prod
                exit 1
              fi
            displayName: 'Monitor Hotfix Rollout'

        postDeployment:
          steps:
          - script: |
              echo "Intensive post-hotfix monitoring (30 minutes)..."

              # Monitor for 30 minutes with 1-minute intervals
              for i in {1..30}; do
                echo "Monitoring minute $i/30..."

                # Health check
                curl -f https://atp-gateway-prod.connectsoft.com/health || {
                  echo "❌ Health check failed at minute $i"
                  exit 1
                }

                # Error rate check
                ERROR_RATE=$(az monitor app-insights metrics show \
                  --app atp-appinsights-prod-eus \
                  --metric "requests/failed" \
                  --aggregation avg \
                  --offset 1m \
                  --query "value.segments[0]['requests/failed'].avg")

                if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
                  echo "❌ Error rate spike detected: $ERROR_RATE%"
                  exit 1
                fi

                sleep 60
              done

              echo "✅ Hotfix monitoring complete; deployment stable"
            displayName: 'Post-Hotfix Monitoring (30 min)'

          - task: AzureCLI@2
            displayName: 'Update Incident Ticket'
            inputs:
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                # Update incident with deployment details
                az boards work-item update \
                  --id $(IncidentTicketId) \
                  --state "Resolved" \
                  --discussion "Hotfix deployed to Production: Build $(Build.BuildNumber). 30-minute monitoring complete. No issues detected."

SLA: 2 hours (expedited approval from request to deployment).

Rollback: Immediate (< 2 minutes) via kubectl rollout undo or traffic reversion.


Approval Gate Comparison

Transition Approvers Requirements Summary Approval SLA Rollback RTO
Dev → Test None (auto) Smoke tests green, 24h soak Immediate 5 minutes
Test → Staging 1 (Lead Engineer) All tests green, no P1/P2 bugs, 48h soak 4 hours 2 minutes (slot swap)
Staging → Production 2 (Architect + SRE) CAB approval, load tests, chaos tests, change window 24 hours < 5 minutes (canary rollback)
Hotfix → Production 2 (same) Expedited CAB, incident ticket, limited scope 2 hours < 2 minutes (rollout undo)

Rollback Triggers & Procedures

ATP implements multi-layered rollback strategies with both automated triggers (metric-based) and manual operator control to minimize incident impact and restore service quickly.

Automated Rollback Triggers

Metric-Based Triggers (Production Canary):

# Automated rollback conditions
rollbackTriggers:
  errorRate:
    threshold: 0.05  # 5% error rate
    window: 5 minutes
    action: Immediate rollback

  latencyP95:
    threshold: 1000  # 1000ms
    baselineMultiplier: 2.0  # 2x baseline
    window: 10 minutes
    action: Immediate rollback

  healthChecks:
    consecutiveFailures: 3
    interval: 30 seconds
    action: Immediate rollback

  exceptionRate:
    threshold: 100  # exceptions per minute
    window: 5 minutes
    action: Immediate rollback

  customMetrics:
    - name: AuditIngestionFailureRate
      threshold: 0.02  # 2% failure rate
      window: 10 minutes
      action: Immediate rollback

Implementation (Monitoring Script):

# monitor-canary.py
import time
import sys
from azure.monitor.query import MetricsQueryClient
from azure.identity import DefaultAzureCredential

def monitor_canary_deployment(duration_minutes, error_threshold, latency_threshold):
    credential = DefaultAzureCredential()
    client = MetricsQueryClient(credential)

    start_time = time.time()
    end_time = start_time + (duration_minutes * 60)

    while time.time() < end_time:
        elapsed = int((time.time() - start_time) / 60)
        print(f"Monitoring canary: {elapsed}/{duration_minutes} minutes...")

        # Query error rate
        error_rate = query_error_rate(client, window_minutes=5)
        if error_rate > error_threshold:
            print(f"❌ ERROR: Canary error rate {error_rate}% exceeds threshold {error_threshold}%")
            sys.exit(1)

        # Query latency p95
        latency_p95 = query_latency_p95(client, window_minutes=5)
        if latency_p95 > latency_threshold:
            print(f"❌ ERROR: Canary p95 latency {latency_p95}ms exceeds threshold {latency_threshold}ms")
            sys.exit(1)

        # Query health checks
        health_status = query_health_checks(client)
        if health_status != "Healthy":
            print(f"❌ ERROR: Canary health check failed: {health_status}")
            sys.exit(1)

        print(f"✅ Canary healthy: Error={error_rate}%, P95={latency_p95}ms")
        time.sleep(60)  # Check every minute

    print(f"✅ Canary monitoring complete: {duration_minutes} minutes passed")
    sys.exit(0)

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--duration", type=int, required=True)
    parser.add_argument("--error-threshold", type=float, required=True)
    parser.add_argument("--latency-threshold", type=int, required=True)
    args = parser.parse_args()

    monitor_canary_deployment(args.duration, args.error_threshold, args.latency_threshold)

Manual Rollback Triggers

Operator-Initiated Rollback:

# Manual rollback via Azure DevOps CLI
az pipelines run \
  --name "ATP-Rollback-Pipeline" \
  --parameters \
    environment=production \
    targetVersion=1.0.42 \
    reason="Manual rollback due to [REASON]" \
    initiatedBy=$(az account show --query user.name -o tsv)

Rollback Reasons (Documented):

  1. Functional Regression: Feature not working as expected; user-reported issues.
  2. Performance Degradation: Latency increase not caught by automated thresholds.
  3. Business Decision: Stakeholder request to revert feature.
  4. Security Vulnerability: Newly discovered vulnerability in deployed version.
  5. Data Corruption: Audit data integrity issues detected.

Rollback Procedures

Dev/Test Rollback (Redeploy Previous Version):

# rollback-dev.yaml
- stage: Rollback_Dev
  jobs:
  - deployment: RollbackToPreviousVersion
    environment: ATP-Dev
    strategy:
      runOnce:
        deploy:
          steps:
          - script: |
              # Identify last-known-good build
              LAST_GOOD_BUILD=$(az pipelines runs list \
                --pipeline-id $(PipelineId) \
                --status completed \
                --result succeeded \
                --top 2 \
                --query "[1].id" \
                --output tsv)

              echo "Rolling back to build: $LAST_GOOD_BUILD"

              # Download previous build artifacts
              az pipelines runs artifact download \
                --run-id $LAST_GOOD_BUILD \
                --artifact-name drop \
                --path $(Pipeline.Workspace)/rollback
            displayName: 'Download Previous Build'

          - template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
            parameters:
              azureSubscription: $(azureSubscription)
              appName: atp-ingestion-dev-eus
              package: $(Pipeline.Workspace)/rollback/*.zip

RTO: 5 minutes (download + redeploy).


Staging Rollback (Blue-Green Slot Swap):

# rollback-staging.yaml
- stage: Rollback_Staging
  jobs:
  - deployment: RollbackToStable
    environment: ATP-Staging
    strategy:
      runOnce:
        deploy:
          steps:
          - task: AzureAppServiceManage@0
            displayName: 'Swap Back to Previous Slot'
            inputs:
              azureSubscription: $(azureSubscription)
              action: 'Swap Slots'
              webAppName: atp-ingestion-staging-eus
              sourceSlot: production
              targetSlot: blue  # Swap back

          - script: |
              # Verify rollback successful
              curl -f https://atp-gateway-staging-eus.azurewebsites.net/health

              # Update deployment tracking
              echo "Rollback completed at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
            displayName: 'Verify Rollback'

RTO: 2 minutes (slot swap is nearly instantaneous).


Production Rollback (Canary Traffic Reversion):

# rollback-production.yaml
- stage: Rollback_Production
  jobs:
  - deployment: RollbackCanary
    environment: ATP-Production  # May require 1 approver depending on policy
    strategy:
      runOnce:
        deploy:
          steps:
          - script: |
              echo "🔴 Rolling back Production canary deployment..."

              # Revert Istio traffic to 100% stable
              kubectl apply -f k8s/canary/atp-ingestion-stable-only.yaml

              # Verify traffic shift
              kubectl get virtualservice atp-ingestion-traffic -n atp-prod -o yaml

              # Wait 2 minutes for traffic to drain
              sleep 120

              # Delete canary deployment
              kubectl delete deployment atp-ingestion-canary -n atp-prod

              echo "✅ Rollback complete; 100% traffic on stable version"
            displayName: 'Revert Traffic to Stable'

          - script: |
              # Verify stable version health
              for i in {1..10}; do
                HEALTH=$(curl -s https://atp-gateway-prod.connectsoft.com/health | jq -r '.status')
                if [ "$HEALTH" == "Healthy" ]; then
                  echo "✅ Stable version healthy"
                  break
                fi
                sleep 30
              done

              # Verify error rate returned to normal
              ERROR_RATE=$(az monitor app-insights metrics show \
                --app atp-appinsights-prod-eus \
                --metric "requests/failed" \
                --aggregation avg \
                --offset 5m \
                --query "value.segments[0]['requests/failed'].avg")

              echo "Post-rollback error rate: $ERROR_RATE%"
            displayName: 'Post-Rollback Validation'

          - task: AzureCLI@2
            displayName: 'Create Post-Incident Review Task'
            inputs:
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                # Create work item for RCA
                az boards work-item create \
                  --title "Post-Incident Review: Production Rollback - Build $(Build.BuildNumber)" \
                  --type "Task" \
                  --description "Conduct RCA for production rollback. Analyze canary metrics, identify root cause, implement preventive measures." \
                  --assigned-to "platform-architect@connectsoft.example" \
                  --area "ATP/Production" \
                  --fields Priority=1 \
                  --discussion "Rollback executed at $(date -u +%Y-%m-%dT%H:%M:%SZ). Incident ticket: $(IncidentTicketId)"

RTO: < 5 minutes (traffic reversion is near-instantaneous; cleanup takes additional time).


Promotion Workflow Automation

Automated Promotion Script (Dev → Test):

#!/bin/bash
# promote-to-test.sh

set -e

echo "Automated Promotion: Dev → Test"

# 1. Verify Dev stability
echo "Checking Dev environment stability..."

DEV_ERROR_RATE=$(az monitor app-insights metrics show \
  --app atp-appinsights-dev-eus \
  --metric "requests/failed" \
  --aggregation avg \
  --offset 24h \
  --query "value.segments[0]['requests/failed'].avg")

if (( $(echo "$DEV_ERROR_RATE > 0.10" | bc -l) )); then
  echo "❌ Dev error rate too high: $DEV_ERROR_RATE%"
  exit 1
fi

# 2. Verify smoke tests
echo "Running Dev smoke tests..."
dotnet test tests/Smoke.Tests.csproj --environment Dev --filter "Priority=1"

if [ $? -ne 0 ]; then
  echo "❌ Dev smoke tests failed"
  exit 1
fi

# 3. Trigger Test deployment pipeline
echo "Triggering Test deployment..."
az pipelines run \
  --name "ATP-Ingestion-Pipeline" \
  --branch main \
  --variables \
    targetEnvironment=test \
    sourceEnvironment=dev \
    buildId=$(az pipelines runs list --pipeline-id $(PipelineId) --top 1 --query "[0].id" -o tsv)

echo "✅ Promotion to Test initiated"

Promotion Tracking (Azure DevOps):

// Promotion tracking service
public class PromotionTracker
{
    public async Task<PromotionResult> TrackPromotionAsync(string buildId, string sourceEnv, string targetEnv)
    {
        var promotion = new Promotion
        {
            BuildId = buildId,
            SourceEnvironment = sourceEnv,
            TargetEnvironment = targetEnv,
            InitiatedAt = DateTime.UtcNow,
            InitiatedBy = _context.User.Identity.Name,
            Status = PromotionStatus.Pending
        };

        await _dbContext.Promotions.AddAsync(promotion);
        await _dbContext.SaveChangesAsync();

        // Emit telemetry event
        _telemetry.TrackEvent("EnvironmentPromotion", new Dictionary<string, string>
        {
            ["BuildId"] = buildId,
            ["SourceEnvironment"] = sourceEnv,
            ["TargetEnvironment"] = targetEnv,
            ["PromotionId"] = promotion.Id.ToString()
        });

        return new PromotionResult { PromotionId = promotion.Id, Status = "Initiated" };
    }
}

Change Advisory Board (CAB) Process

Purpose: Provide governance oversight for Production deployments with cross-functional review of changes, risks, and rollback plans.

CAB Composition:

  • Platform Architect (chair)
  • SRE Lead
  • Security Officer
  • Product Owner
  • Compliance Officer (for regulatory changes)

CAB Meeting Cadence:

  • Regular CAB: Weekly (Wednesdays 2 PM); reviews all Staging → Production promotions.
  • Emergency CAB: On-demand (within 2 hours); reviews P0/P1 hotfixes.

Change Ticket Template:

# Change Ticket: CHG-2025-1030-001

## Change Summary
Deploy ATP Ingestion Service v1.0.50 to Production

## Justification
- New feature: AI-assisted anomaly detection (10% canary rollout)
- Bug fix: Query performance regression (20% improvement expected)
- Security patch: Update dependency with CVE-2025-12345 fix

## Impact Assessment
- **Services Affected**: ATP Ingestion, ATP Query
- **Downtime Expected**: None (canary deployment)
- **User Impact**: Minimal (10% of tenants see new features)
- **Data Impact**: None (backward-compatible schema)

## Testing Evidence
- [x] All regression tests passed (Test environment)
- [x] Load tests passed at 80% peak load (Staging)
- [x] Chaos tests passed (pod failures, network latency)
- [x] Security scan: 0 critical/high vulnerabilities
- [x] Staging soak: 72 hours (no incidents)

## Deployment Plan
- **Date**: Friday, November 1, 2025
- **Time**: 10 PM - 12 AM EST (approved maintenance window)
- **Method**: Canary deployment (5% → 20% → 50% → 100%)
- **Duration**: 3 hours (including monitoring)

## Rollback Plan
- **Method**: Istio traffic reversion to stable version
- **RTO**: < 5 minutes
- **Trigger**: Error rate > 1% OR p95 latency > 1000ms OR manual abort
- **Communication**: PagerDuty alert to SRE on-call

## Approvals
- [x] Platform Architect: Approved
- [x] SRE Lead: Approved
- [x] Security Officer: Approved (no security concerns)
- [ ] CAB: **Pending Review**

## Communication Plan
- **Pre-Deployment**: Email to stakeholders (24 hours before)
- **During Deployment**: Slack updates in #production-deployments
- **Post-Deployment**: Status page update + email confirmation

CAB Approval Workflow:

flowchart LR
    A[Change Ticket Created] --> B[CAB Review]
    B --> C{Approved?}
    C -->|No| D[Modify Change + Resubmit]
    C -->|Yes| E[Schedule Deployment]
    E --> F[Pre-Deployment Notification]
    F --> G[Execute Deployment]
    G --> H[Post-Deployment Review]
    H --> I[Close Change Ticket]

    D --> B
Hold "Alt" / "Option" to enable pan & zoom

Deployment Scheduling & Change Windows

Production Change Windows (Approved Times):

Day Window Type Use Case
Tuesday 10 PM - 12 AM EST Standard Minor updates, feature rollouts
Friday 10 PM - 2 AM EST Extended Major releases, infrastructure changes
Saturday 2 AM - 6 AM EST Extended Database migrations, breaking changes
Any Day Emergency Hotfix P0/P1 incident resolution only

Blackout Periods (No Production Deployments):

  • End of Quarter: Last 3 days of Q1, Q2, Q3, Q4 (business-critical reporting).
  • Major Holidays: December 24-26, December 31 - January 2.
  • Tenant Peak Hours: Monday-Friday 8 AM - 6 PM EST.

Change Window Validation (Pipeline):

- stage: Validate_Change_Window
  jobs:
  - job: CheckChangeWindow
    steps:
    - script: |
        CURRENT_DAY=$(date +%A)
        CURRENT_HOUR=$(date +%H)
        CURRENT_DATE=$(date +%Y-%m-%d)

        # Check if current time is within approved window
        if [ "$CURRENT_DAY" == "Tuesday" ] && [ "$CURRENT_HOUR" -ge 22 ] && [ "$CURRENT_HOUR" -lt 24 ]; then
          echo "✅ Within approved change window (Tuesday 10 PM - 12 AM)"
        elif [ "$CURRENT_DAY" == "Friday" ] && [ "$CURRENT_HOUR" -ge 22 ]; then
          echo "✅ Within approved change window (Friday 10 PM - 2 AM)"
        else
          # Check if it's an emergency hotfix
          if [ "$(IsHotfix)" == "true" ]; then
            echo "⚠️ Emergency hotfix; bypassing change window"
          else
            echo "❌ Deployment outside approved change window"
            echo "Current: $CURRENT_DAY $CURRENT_HOUR:00"
            exit 1
          fi
        fi

        # Check blackout periods
        if [[ "$CURRENT_DATE" =~ 2025-(03|06|09|12)-(2[89]|30|31) ]]; then
          echo "❌ Blackout period: End of quarter"
          exit 1
        fi
      displayName: 'Validate Change Window'

Rollback Procedures by Environment

Dev Environment Rollback

Method: Redeploy previous build artifacts

RTO: 5 minutes

Procedure:

#!/bin/bash
# rollback-dev.sh

# 1. Identify last-known-good build
LAST_GOOD_BUILD=$(az pipelines runs list \
  --pipeline-id $(PipelineId) \
  --branch main \
  --status completed \
  --result succeeded \
  --top 2 \
  --query "[1].id" \
  --output tsv)

echo "Rolling back Dev to build: $LAST_GOOD_BUILD"

# 2. Download artifacts
az pipelines runs artifact download \
  --run-id $LAST_GOOD_BUILD \
  --artifact-name drop \
  --path ./rollback

# 3. Deploy to Dev
az webapp deployment source config-zip \
  --resource-group ATP-Dev-RG \
  --name atp-ingestion-dev-eus \
  --src ./rollback/drop.zip

# 4. Verify rollback
sleep 30
curl -f https://atp-ingestion-dev-eus.azurewebsites.net/health

echo "✅ Dev rollback complete"

Test Environment Rollback

Method: Redeploy previous build + restore test data

RTO: 10 minutes

Procedure:

#!/bin/bash
# rollback-test.sh

# 1. Backup current test data (in case rollback fails)
az sql db copy \
  --name ATP_Test \
  --resource-group ATP-Test-RG \
  --server atp-sql-test-eus \
  --dest-name ATP_Test_Rollback_Backup_$(date +%Y%m%d%H%M%S) \
  --dest-resource-group ATP-Test-Backups-RG

# 2. Identify last-known-good build
LAST_GOOD_BUILD=$(az pipelines runs list \
  --pipeline-id $(PipelineId) \
  --branch main \
  --status completed \
  --result succeeded \
  --top 2 \
  --query "[1].id" \
  --output tsv)

# 3. Deploy previous version
az webapp deployment source config-zip \
  --resource-group ATP-Test-RG \
  --name atp-ingestion-test-eus \
  --src ./rollback/drop.zip

# 4. Restore test data (if schema changed)
if [ "$(SchemaChanged)" == "true" ]; then
  echo "Restoring test data from stable snapshot..."
  az sql db restore \
    --resource-group ATP-Test-RG \
    --server atp-sql-test-eus \
    --name ATP_Test \
    --dest-name ATP_Test_Restored \
    --time "48 hours ago"
fi

echo "✅ Test rollback complete"

Staging Environment Rollback

Method: Blue-green slot swap

RTO: 2 minutes

Procedure:

# Rollback via slot swap (instant)
- task: AzureAppServiceManage@0
  displayName: 'Rollback Staging (Slot Swap)'
  inputs:
    azureSubscription: $(azureSubscription)
    action: 'Swap Slots'
    webAppName: atp-ingestion-staging-eus
    sourceSlot: production  # Current (bad) version
    targetSlot: blue  # Previous (good) version

Post-Rollback Validation:

# Verify rollback successful
curl -f https://atp-gateway-staging-eus.azurewebsites.net/health

# Run quick smoke tests
dotnet test tests/Smoke.Tests.csproj \
  --environment Staging \
  --filter "Category=Critical"

# Check error rate
ERROR_RATE=$(az monitor app-insights metrics show \
  --app atp-appinsights-staging-eus \
  --metric "requests/failed" \
  --aggregation avg \
  --offset 5m \
  --query "value.segments[0]['requests/failed'].avg")

if (( $(echo "$ERROR_RATE > 0.02" | bc -l) )); then
  echo "⚠️ Warning: Error rate still elevated: $ERROR_RATE%"
else
  echo "✅ Staging rollback successful; error rate normal"
fi

Production Environment Rollback

Method: Canary traffic reversion (< 1 minute) or Kubernetes rollout undo (< 5 minutes)

RTO: < 5 minutes

Automated Rollback (Triggered by Metrics):

# Automatic rollback on metric threshold breach
on:
  failure:
    steps:
    - script: |
        echo "🔴 AUTOMATIC ROLLBACK TRIGGERED"

        # Log rollback reason
        ROLLBACK_REASON=$(cat <<EOF
        {
          "buildId": "$(Build.BuildId)",
          "buildNumber": "$(Build.BuildNumber)",
          "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
          "trigger": "Automated",
          "reason": "Canary metrics exceeded thresholds",
          "canaryIncrement": "$(strategy.increment)%"
        }
        EOF
        )

        echo "$ROLLBACK_REASON" | tee rollback-log.json

        # Revert Istio traffic to 100% stable
        kubectl apply -f k8s/canary/atp-ingestion-stable-only.yaml

        # Wait for traffic drain
        sleep 60

        # Delete canary deployment
        kubectl delete deployment atp-ingestion-canary -n atp-prod --ignore-not-found=true

        # Verify stable version healthy
        kubectl rollout status deployment/atp-ingestion -n atp-prod

        echo "✅ Automatic rollback complete"
      displayName: 'Automatic Canary Rollback'

    - task: PublishBuildArtifacts@1
      displayName: 'Publish Rollback Log'
      inputs:
        PathtoPublish: 'rollback-log.json'
        ArtifactName: 'rollback-evidence'

    - script: |
        # Notify on-call team
        curl -X POST https://events.pagerduty.com/v2/enqueue \
          -H 'Content-Type: application/json' \
          -d '{
            "routing_key": "$(PagerDutyRoutingKey)",
            "event_action": "trigger",
            "payload": {
              "summary": "Production canary automatic rollback - Build $(Build.BuildNumber)",
              "severity": "critical",
              "source": "Azure Pipelines",
              "custom_details": {
                "buildId": "$(Build.BuildId)",
                "canaryIncrement": "$(strategy.increment)%",
                "reason": "Metrics threshold exceeded"
              }
            }
          }'

        # Slack notification
        curl -X POST $(SlackWebhookUrl) \
          -H 'Content-Type: application/json' \
          -d '{
            "text": "🚨 Production Automatic Rollback",
            "attachments": [{
              "color": "danger",
              "title": "Build $(Build.BuildNumber) - Canary Rollback",
              "fields": [
                {"title": "Increment", "value": "$(strategy.increment)%", "short": true},
                {"title": "Trigger", "value": "Automated (metrics)", "short": true},
                {"title": "RTO", "value": "< 5 minutes", "short": true},
                {"title": "Incident", "value": "Created: INC-AUTO-$(Build.BuildId)", "short": true}
              ]
            }]
          }'
      displayName: 'Notify Stakeholders'

Manual Rollback (Operator-Initiated):

#!/bin/bash
# manual-rollback-production.sh

read -p "Confirm Production rollback (yes/no): " CONFIRM

if [ "$CONFIRM" != "yes" ]; then
  echo "Rollback cancelled"
  exit 0
fi

read -p "Enter rollback reason: " REASON
read -p "Enter target build ID (last-known-good): " TARGET_BUILD_ID

echo "Initiating Production rollback..."
echo "Reason: $REASON"
echo "Target Build: $TARGET_BUILD_ID"

# 1. Revert traffic to stable version
kubectl apply -f k8s/canary/atp-ingestion-stable-only.yaml

# 2. Wait for traffic to drain from canary
sleep 120

# 3. Delete canary deployment
kubectl delete deployment atp-ingestion-canary -n atp-prod

# 4. If needed, rollback stable deployment to previous version
if [ -n "$TARGET_BUILD_ID" ]; then
  echo "Rolling back stable deployment to build $TARGET_BUILD_ID..."

  kubectl set image deployment/atp-ingestion -n atp-prod \
    atp-ingestion=connectsoft.azurecr.io/atp/ingestion:$TARGET_BUILD_ID

  kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=5m
fi

# 5. Verify rollback
curl -f https://atp-gateway-prod.connectsoft.com/health

# 6. Create incident ticket
az boards work-item create \
  --title "Manual Production Rollback - Build $(Build.BuildNumber)" \
  --type "Incident" \
  --description "Reason: $REASON. Rolled back to: $TARGET_BUILD_ID" \
  --assigned-to "sre-team@connectsoft.example" \
  --fields Priority=1

echo "✅ Production rollback complete"

Post-Deployment Verification

Verification Checklist (All Environments):

# Post-deployment verification template
- stage: Post_Deployment_Verification
  jobs:
  - job: VerifyDeployment
    steps:
    # 1. Health checks
    - script: |
        for service in gateway ingestion query integrity export policy search; do
          echo "Health check: atp-$service-$(environment)-eus"
          curl -f https://atp-$service-$(environment)-eus.azurewebsites.net/health || exit 1
        done
      displayName: 'Verify Service Health'

    # 2. Smoke tests
    - task: DotNetCoreCLI@2
      displayName: 'Run Smoke Tests'
      inputs:
        command: 'test'
        projects: 'tests/Smoke.Tests.csproj'
        arguments: '--environment $(environment) --filter "Priority=1"'

    # 3. Metrics validation
    - script: |
        # Wait for metrics to stabilize
        sleep 300  # 5 minutes

        ERROR_RATE=$(az monitor app-insights metrics show \
          --app atp-appinsights-$(environment)-eus \
          --metric "requests/failed" \
          --aggregation avg \
          --offset 5m \
          --query "value.segments[0]['requests/failed'].avg")

        echo "Post-deployment error rate: $ERROR_RATE%"

        THRESHOLD=$([ "$(environment)" == "prod" ] && echo "0.01" || echo "0.05")

        if (( $(echo "$ERROR_RATE > $THRESHOLD" | bc -l) )); then
          echo "❌ Error rate exceeds threshold for $(environment)"
          exit 1
        fi
      displayName: 'Validate Metrics'

    # 4. Database migration verification
    - script: |
        # Verify database schema version matches deployment
        DB_VERSION=$(sqlcmd -S atp-sql-$(environment)-eus.database.windows.net \
          -d ATP_$(environment) \
          -Q "SELECT TOP 1 Version FROM __EFMigrationsHistory ORDER BY MigrationId DESC" \
          -h -1)

        EXPECTED_VERSION=$(grep 'MigrationVersion' version.txt | cut -d'=' -f2)

        if [ "$DB_VERSION" != "$EXPECTED_VERSION" ]; then
          echo "❌ Database version mismatch. Expected: $EXPECTED_VERSION, Actual: $DB_VERSION"
          exit 1
        fi

        echo "✅ Database schema version verified"
      displayName: 'Verify Database Migrations'

    # 5. Configuration validation
    - script: |
        # Verify environment configuration loaded correctly
        CONFIG_ENV=$(curl -s https://atp-gateway-$(environment)-eus.azurewebsites.net/api/diagnostics/config | jq -r '.environment')

        if [ "$CONFIG_ENV" != "$(environment)" ]; then
          echo "❌ Configuration environment mismatch"
          exit 1
        fi

        echo "✅ Configuration validated"
      displayName: 'Verify Configuration'

Promotion Metrics & Reporting

Metrics Tracked (Per Promotion):

promotionMetrics:
  deployment:
    initiatedAt: 2025-10-30T22:00:00Z
    completedAt: 2025-10-30T23:45:00Z
    duration: 105 minutes

  approvals:
    requestedAt: 2025-10-30T14:00:00Z
    approvedAt: 2025-10-30T15:30:00Z
    approvalDuration: 90 minutes
    approvers: [architect@example.com, sre@example.com]

  validation:
    smokeTests: passed
    regressionTests: passed (100%)
    loadTests: passed (p95: 450ms)
    chaosTests: passed
    securityScans: passed (0 critical)

  rollback:
    triggered: false
    reason: null
    duration: null

  outcome: success

Promotion Dashboard (Power BI / Azure Monitor Workbook):

  • Promotion Frequency: Deployments per environment per week.
  • Approval Duration: Time from approval request to approval granted.
  • Success Rate: Percentage of successful promotions (no rollback).
  • DORA Metrics: Deployment frequency, lead time, MTTR, change failure rate.

Summary

  • Promotion Lanes: Standard lane (feature releases) and hotfix lane (emergency fixes) with clearly defined progression.
  • Approval Gates: Graduated approvals from zero (Dev → Test) to 2 + CAB (Staging → Production) with specific validation requirements.
  • Automated Promotion: Dev → Test fully automated with 24-hour soak and metric validation.
  • Manual Approvals: Test → Staging (1 approver, 4-hour SLA), Staging → Production (2 approvers + CAB, 24-hour SLA).
  • Rollback Triggers: Automated (metric thresholds) and manual (operator-initiated) with environment-specific RTOs.
  • Change Windows: Approved deployment times with blackout periods for business-critical operations.
  • CAB Process: Weekly governance meetings for Production changes with comprehensive change ticket template.

Data Management Per Environment

ATP's data management strategy ensures each environment has appropriate data characteristics for its purpose: synthetic data for rapid testing (Dev/Test), production-like datasets for validation (Staging), and live tenant data with full compliance controls (Production). This approach balances testing realism with privacy protection and compliance requirements.

Data management policies vary significantly across environments to support different use cases while maintaining data sovereignty, retention compliance, and disaster recovery capabilities appropriate to each tier's criticality.

Data Management Overview

Environment Data Source Volume PII/Sensitivity Retention Immutability Backups Compliance
Preview Minimal synthetic 100 events None (synthetic) PR lifetime No No None
Dev Synthetic generators 10K events None (synthetic) 30 days rolling No No Basic redaction testing
Test Stable fixtures 50K events None (synthetic) 90 days No Daily (30-day retention) GDPR/HIPAA simulation
Staging Prod-like synthetic 5M events Obfuscated PII 180 days Yes Weekly (4-week retention) Full enforcement
Production Live tenant data Millions/day Real PII (classified) 7 years Yes (WORM) Daily + weekly (7-year retention) Full enforcement
Hotfix Prod clone (anonymized) Subset of Prod Anonymized PII Incident duration Yes No Full enforcement

Dev Environment Data Management

Purpose: Synthetic data for rapid development iteration with no PII and minimal compliance constraints.

Data Source:

  • Synthetic Data Generators: C# data generator libraries (Bogus, AutoFixture) create realistic but fake tenant data.
  • Seeding Scripts: Version-controlled scripts regenerate consistent dev datasets.
  • Volume: 10 synthetic tenants, 1,000 audit events per tenant (10,000 total events).

Data Characteristics:

// Dev Data Generator (C# with Bogus library)
public class DevDataGenerator
{
    private readonly Faker<Tenant> _tenantFaker;
    private readonly Faker<AuditEvent> _auditEventFaker;

    public DevDataGenerator()
    {
        // Tenant generator
        _tenantFaker = new Faker<Tenant>()
            .RuleFor(t => t.TenantId, f => $"dev-tenant-{f.IndexFaker:000}")
            .RuleFor(t => t.Name, f => $"{f.Company.CompanyName()} (Dev)")
            .RuleFor(t => t.Edition, f => f.PickRandom("Standard", "Business", "Enterprise"))
            .RuleFor(t => t.Region, f => f.PickRandom("US", "EU", "APAC"))
            .RuleFor(t => t.CreatedAt, f => f.Date.Past(1))
            .RuleFor(t => t.MaxRetentionDays, f => f.Random.Int(30, 365))
            .RuleFor(t => t.ComplianceProfile, f => f.PickRandom("gdpr", "hipaa", "soc2", "none"));

        // Audit event generator
        _auditEventFaker = new Faker<AuditEvent>()
            .RuleFor(e => e.EventId, f => $"evt-{Guid.NewGuid()}")
            .RuleFor(e => e.Timestamp, f => f.Date.Recent(30))
            .RuleFor(e => e.Actor, f => f.Internet.Email())
            .RuleFor(e => e.Action, f => f.PickRandom("Create", "Read", "Update", "Delete", "Login", "Logout"))
            .RuleFor(e => e.Resource, f => $"/api/{f.PickRandom("users", "documents", "settings")}/{f.Random.Int(1, 100)}")
            .RuleFor(e => e.Outcome, f => f.PickRandom(new[] { "Allowed", "Denied" }, new[] { 0.9f, 0.1f }))
            .RuleFor(e => e.IpAddress, f => f.Internet.Ip())
            .RuleFor(e => e.UserAgent, f => f.Internet.UserAgent())
            .RuleFor(e => e.Metadata, f => new Dictionary<string, object>
            {
                ["duration"] = f.Random.Int(10, 5000),
                ["statusCode"] = f.Random.Int(200, 500),
                ["region"] = f.PickRandom("eastus", "westeurope", "southeastasia")
            });
    }

    public async Task<IEnumerable<Tenant>> GenerateTenantsAsync(int count)
    {
        return _tenantFaker.Generate(count);
    }

    public async Task<IEnumerable<AuditEvent>> GenerateEventsAsync(string tenantId, int count)
    {
        return _auditEventFaker
            .RuleFor(e => e.TenantId, _ => tenantId)
            .Generate(count);
    }
}

Data Seeding Script (Dev):

#!/bin/bash
# seed-dev-environment.sh

echo "Seeding Dev Environment with synthetic data..."

# 1. Clear existing data
dotnet run --project tools/DataSeeder -- --clear --environment Development

# 2. Generate synthetic tenants and events
dotnet run --project tools/DataSeeder -- \
  --environment Development \
  --tenants 10 \
  --events-per-tenant 1000 \
  --start-date "2025-09-01" \
  --end-date "2025-10-30" \
  --seed 42  # Fixed seed for reproducibility

# 3. Seed Redis cache with sessions
redis-cli -h atp-redis-dev-eus.redis.cache.windows.net -p 6380 -a $(az keyvault secret show --name RedisPassword --vault-name atp-keyvault-dev-eus --query value -o tsv) --tls <<EOF
SET session:dev-tenant-001:user-1 "{\"userId\":\"user-1\",\"tenantId\":\"dev-tenant-001\",\"expiresAt\":\"2025-10-31T00:00:00Z\"}"
SET session:dev-tenant-002:user-2 "{\"userId\":\"user-2\",\"tenantId\":\"dev-tenant-002\",\"expiresAt\":\"2025-10-31T00:00:00Z\"}"
EXPIRE session:dev-tenant-001:user-1 86400
EXPIRE session:dev-tenant-002:user-2 86400
EOF

# 4. Verify data counts
TENANT_COUNT=$(sqlcmd -S atp-sql-dev-eus.database.windows.net -d ATP_Dev -Q "SELECT COUNT(*) FROM Tenants" -h -1)
EVENT_COUNT=$(sqlcmd -S atp-sql-dev-eus.database.windows.net -d ATP_Dev -Q "SELECT COUNT(*) FROM AuditEvents" -h -1)

echo "Tenants: $TENANT_COUNT"
echo "Events: $EVENT_COUNT"

if [ "$TENANT_COUNT" -eq 10 ] && [ "$EVENT_COUNT" -eq 10000 ]; then
  echo "✅ Dev environment seeded successfully"
else
  echo "❌ Seeding verification failed"
  exit 1
fi

Retention Policy (Dev):

  • Audit Events: 30-day rolling retention; events older than 30 days automatically purged.
  • Tenants: Persist until manual cleanup.
  • Logs/Traces: 7-day retention in Application Insights.

Retention Enforcement (Automated Cleanup Job):

// Dev data cleanup job (Azure Function)
[FunctionName("CleanupDevData")]
public async Task RunAsync(
    [TimerTrigger("0 0 2 * * *")] TimerInfo timer,  // Daily at 2 AM
    ILogger log)
{
    log.LogInformation("Dev data cleanup started");

    var cutoffDate = DateTime.UtcNow.AddDays(-30);

    // Delete events older than 30 days
    var deletedCount = await _dbContext.AuditEvents
        .Where(e => e.Timestamp < cutoffDate)
        .ExecuteDeleteAsync();

    log.LogInformation($"Deleted {deletedCount} events older than 30 days");

    // Vacuum database to reclaim space
    await _dbContext.Database.ExecuteSqlRawAsync("DBCC SHRINKDATABASE (ATP_Dev, 10)");

    log.LogInformation("✅ Dev data cleanup complete");
}

Immutability: Disabled (data can be freely modified/deleted for testing scenarios).

Backups: None (ephemeral dev data; recreate from seeding scripts if needed).


Test Environment Data Management

Purpose: Stable test datasets with version-controlled fixtures for consistent regression testing and QA validation.

Data Source:

  • Stable Fixtures: JSON/Parquet files version-controlled in tests/fixtures/ directory.
  • Reproducible: Same data across test runs; enables predictable test assertions.
  • Volume: 20 synthetic tenants, 50,000 audit events (stable over time).

Data Characteristics:

// tests/fixtures/test-tenants.json
[
  {
    "tenantId": "test-tenant-001",
    "name": "Acme Corporation (Test)",
    "edition": "Enterprise",
    "region": "US",
    "complianceProfile": "gdpr,hipaa",
    "createdAt": "2025-01-01T00:00:00Z",
    "maxRetentionDays": 2555
  },
  {
    "tenantId": "test-tenant-002",
    "name": "Global Industries (Test)",
    "edition": "Business",
    "region": "EU",
    "complianceProfile": "gdpr",
    "createdAt": "2025-01-15T00:00:00Z",
    "maxRetentionDays": 365
  }
  // ... 18 more tenants
]

Test Data Seeding (C#):

// Test data loader (loads from version-controlled fixtures)
public class TestDataSeeder
{
    private readonly IAuditDbContext _context;

    public async Task SeedTestEnvironmentAsync()
    {
        // 1. Clear existing data
        await _context.Database.ExecuteSqlRawAsync("TRUNCATE TABLE AuditEvents");
        await _context.Database.ExecuteSqlRawAsync("DELETE FROM Tenants");

        // 2. Load tenant fixtures
        var tenantsJson = await File.ReadAllTextAsync("fixtures/test-tenants.json");
        var tenants = JsonSerializer.Deserialize<List<Tenant>>(tenantsJson);
        await _context.Tenants.AddRangeAsync(tenants);
        await _context.SaveChangesAsync();

        // 3. Load event fixtures (Parquet for efficient storage)
        var eventsParquet = await LoadParquetAsync("fixtures/test-events-stable.parquet");

        // Batch insert for performance
        const int batchSize = 1000;
        for (int i = 0; i < eventsParquet.Count; i += batchSize)
        {
            var batch = eventsParquet.Skip(i).Take(batchSize).ToList();
            await _context.AuditEvents.AddRangeAsync(batch);
            await _context.SaveChangesAsync();
        }

        // 4. Verify data integrity
        var tenantCount = await _context.Tenants.CountAsync();
        var eventCount = await _context.AuditEvents.CountAsync();

        if (tenantCount != 20 || eventCount != 50000)
        {
            throw new Exception($"Data seeding verification failed. Expected: 20 tenants, 50K events. Actual: {tenantCount} tenants, {eventCount} events");
        }

        Console.WriteLine("✅ Test environment seeded successfully");
    }
}

Data Refresh Script (Test):

#!/bin/bash
# refresh-test-data.sh

echo "Refreshing Test environment data..."

# 1. Backup current data (safety net)
az sql db export \
  --name ATP_Test \
  --resource-group ATP-Test-RG \
  --server atp-sql-test-eus \
  --admin-user testadmin \
  --admin-password $(az keyvault secret show --vault-name atp-keyvault-test-eus --name SqlAdminPassword --query value -o tsv) \
  --storage-key $(az storage account keys list --account-name atpstorragetesteus --query "[0].value" -o tsv) \
  --storage-key-type StorageAccessKey \
  --storage-uri "https://atpstorragetesteus.blob.core.windows.net/backups/test-backup-$(date +%Y%m%d).bacpac"

# 2. Run seeding tool
dotnet run --project tools/DataSeeder -- \
  --environment Test \
  --clear \
  --load-fixtures tests/fixtures/test-tenants.json \
  --load-fixtures tests/fixtures/test-events-stable.parquet

# 3. Verify data integrity
TEST_TENANT_COUNT=$(sqlcmd -S atp-sql-test-eus.database.windows.net -d ATP_Test -Q "SELECT COUNT(*) FROM Tenants" -h -1)
TEST_EVENT_COUNT=$(sqlcmd -S atp-sql-test-eus.database.windows.net -d ATP_Test -Q "SELECT COUNT(*) FROM AuditEvents" -h -1)

if [ "$TEST_TENANT_COUNT" -eq 20 ] && [ "$TEST_EVENT_COUNT" -eq 50000 ]; then
  echo "✅ Test data refresh successful"
else
  echo "❌ Test data verification failed"
  exit 1
fi

# 4. Update test data version metadata
az storage blob metadata update \
  --account-name atpstorragetesteus \
  --container-name fixtures \
  --name test-data-version.txt \
  --metadata version=$(date +%Y%m%d) refreshedAt=$(date -u +%Y-%m-%dT%H:%M:%SZ)

Retention Policy (Test):

  • Audit Events: 90-day retention (matches Test compliance profile).
  • Automated Purge: Monthly cleanup job removes events older than 90 days.
  • Test Data Refresh: Weekly refresh from fixtures (every Sunday 2 AM).

Immutability: Disabled (test data can be modified for scenario testing).

Backups:

  • Frequency: Daily automated backups.
  • Retention: 30 days.
  • Purpose: Disaster recovery if test data corrupted; not for compliance.

Staging Environment Data Management

Purpose: Production-like datasets with obfuscated PII for realistic load testing, chaos engineering, and stakeholder acceptance validation.

Data Source (Two Options):

Option 1: Anonymized Production Snapshot

// Anonymize Production data for Staging
public class ProductionDataAnonymizer
{
    public async Task CreateAnonymizedSnapshotAsync()
    {
        // 1. Export production data (read-only replica)
        var prodEvents = await _prodDbContext.AuditEvents
            .Where(e => e.Timestamp > DateTime.UtcNow.AddDays(-180))
            .OrderBy(e => e.Timestamp)
            .Take(5_000_000)
            .AsNoTracking()
            .ToListAsync();

        // 2. Anonymize PII fields
        var anonymizedEvents = prodEvents.Select(e => new AuditEvent
        {
            EventId = e.EventId,
            TenantId = AnonymizeTenantId(e.TenantId),  // tenant-12345 → staging-tenant-001
            Timestamp = e.Timestamp,
            Actor = AnonymizeEmail(e.Actor),  // john.doe@example.com → user-123@staging.local
            Action = e.Action,  // Preserve (no PII)
            Resource = e.Resource,  // Preserve (no PII)
            Outcome = e.Outcome,  // Preserve
            IpAddress = AnonymizeIpAddress(e.IpAddress),  // 192.168.1.1 → 10.0.X.X
            UserAgent = e.UserAgent,  // Preserve
            Metadata = AnonymizeMetadata(e.Metadata)  // Remove any PII in JSON
        });

        // 3. Export to Parquet (efficient format)
        await ExportToParquetAsync(anonymizedEvents, "anonymized-prod-snapshot.parquet");

        // 4. Verify no PII remains
        await VerifyNoPIIAsync("anonymized-prod-snapshot.parquet");
    }

    private string AnonymizeTenantId(string realTenantId)
    {
        // Consistent mapping: real tenant → staging tenant
        var hash = ComputeHash(realTenantId);
        var index = Math.Abs(hash) % 50 + 1;
        return $"staging-tenant-{index:D3}";
    }

    private string AnonymizeEmail(string email)
    {
        // Hash email to consistent fake email
        var hash = ComputeHash(email);
        return $"user-{Math.Abs(hash) % 10000:D4}@staging.local";
    }

    private string AnonymizeIpAddress(string ipAddress)
    {
        // Replace with private IP range
        var hash = ComputeHash(ipAddress);
        return $"10.0.{Math.Abs(hash) % 256}.{Math.Abs(hash >> 8) % 256}";
    }
}

Option 2: Production-Scale Synthetic Data

// Generate production-scale synthetic data
public class StagingDataGenerator
{
    public async Task GenerateProductionScaleDataAsync()
    {
        // 50 synthetic tenants (mimics real tenant distribution)
        var tenants = GenerateSyntheticTenants(count: 50, distribution: new
        {
            Standard = 20,   // 40%
            Business = 20,   // 40%
            Enterprise = 10  // 20%
        });

        await _context.Tenants.AddRangeAsync(tenants);
        await _context.SaveChangesAsync();

        // 5 million audit events (realistic production volume)
        var startDate = DateTime.UtcNow.AddDays(-180);
        var endDate = DateTime.UtcNow;

        var eventsPerDay = 5_000_000 / 180;  // ~27,778 events/day

        foreach (var tenant in tenants)
        {
            var tenantEventsPerDay = (int)(eventsPerDay * GetTenantWeightFactor(tenant.Edition));

            var events = await GenerateRealisticEventsAsync(
                tenantId: tenant.TenantId,
                startDate: startDate,
                endDate: endDate,
                eventsPerDay: tenantEventsPerDay,
                patterns: new[] 
                {
                    "BusinessHours",    // More activity 9 AM - 5 PM
                    "WeekdayBias",      // Less activity on weekends
                    "SeasonalSpikes"    // Occasional high-volume days
                });

            // Batch insert (efficient)
            await BulkInsertAsync(events, batchSize: 10000);
        }
    }

    private double GetTenantWeightFactor(string edition)
    {
        return edition switch
        {
            "Standard" => 0.5,    // Lower activity
            "Business" => 1.0,    // Average activity
            "Enterprise" => 2.0,  // Higher activity
            _ => 1.0
        };
    }
}

Data Refresh Workflow (Staging):

#!/bin/bash
# refresh-staging-data.sh

echo "Refreshing Staging environment with production-like data..."

# Option 1: Restore from anonymized production snapshot
if [ "$USE_PROD_SNAPSHOT" == "true" ]; then
  echo "Restoring from anonymized production snapshot..."

  # Download latest anonymized snapshot
  az storage blob download \
    --account-name atpstorageprodeus \
    --container-name anonymized-snapshots \
    --name "anonymized-prod-snapshot-latest.parquet" \
    --file ./anonymized-snapshot.parquet

  # Import into Staging database
  dotnet run --project tools/DataImporter -- \
    --environment Staging \
    --clear \
    --import ./anonymized-snapshot.parquet \
    --verify-no-pii

else
  # Option 2: Generate production-scale synthetic data
  echo "Generating production-scale synthetic data..."

  dotnet run --project tools/DataSeeder -- \
    --environment Staging \
    --clear \
    --tenants 50 \
    --events-total 5000000 \
    --start-date "2025-04-01" \
    --end-date "2025-10-30" \
    --use-realistic-patterns \
    --seed 2025
fi

# Verify data volume
STAGING_TENANT_COUNT=$(sqlcmd -S atp-sql-staging-eus.database.windows.net -d ATP_Staging -Q "SELECT COUNT(*) FROM Tenants" -h -1)
STAGING_EVENT_COUNT=$(sqlcmd -S atp-sql-staging-eus.database.windows.net -d ATP_Staging -Q "SELECT COUNT(*) FROM AuditEvents" -h -1)

echo "Staging - Tenants: $STAGING_TENANT_COUNT, Events: $STAGING_EVENT_COUNT"

if [ "$STAGING_TENANT_COUNT" -ge 50 ] && [ "$STAGING_EVENT_COUNT" -ge 5000000 ]; then
  echo "✅ Staging data refresh successful"
else
  echo "❌ Staging data verification failed"
  exit 1
fi

Retention Policy (Staging):

  • Audit Events: 180-day retention (production-like; validates retention policies).
  • Legal Holds: Test legal hold workflows (place/release holds on specific events).
  • Automated Purge: Weekly job purges events older than 180 days.

Retention Enforcement (Staging):

// Staging retention enforcement (mirrors production logic)
[FunctionName("EnforceRetentionPolicy")]
public async Task RunAsync(
    [TimerTrigger("0 0 3 * * 0")] TimerInfo timer,  // Weekly on Sunday at 3 AM
    ILogger log)
{
    log.LogInformation("Enforcing retention policy for Staging");

    // Query tenants with custom retention settings
    var tenants = await _context.Tenants.ToListAsync();

    foreach (var tenant in tenants)
    {
        var retentionCutoff = DateTime.UtcNow.AddDays(-tenant.MaxRetentionDays);

        // Find events eligible for purge (excluding legal holds)
        var eventsToDelete = await _context.AuditEvents
            .Where(e => e.TenantId == tenant.TenantId)
            .Where(e => e.Timestamp < retentionCutoff)
            .Where(e => !e.LegalHold)  // Never delete events under legal hold
            .ToListAsync();

        if (eventsToDelete.Any())
        {
            log.LogInformation($"Purging {eventsToDelete.Count} events for tenant {tenant.TenantId} (retention: {tenant.MaxRetentionDays} days)");

            _context.AuditEvents.RemoveRange(eventsToDelete);
            await _context.SaveChangesAsync();
        }
    }

    log.LogInformation("✅ Retention policy enforcement complete");
}

Immutability: Enabled (tests WORM storage, hash chaining, and tamper-evidence workflows).

Backups:

  • Frequency: Weekly automated backups.
  • Retention: 4 weeks (enables DR drills with realistic data).
  • Geo-Replication: Enabled (tests failover procedures).

Backup Script (Staging):

#!/bin/bash
# backup-staging.sh

echo "Creating Staging database backup..."

# Export to BACPAC (includes schema + data)
az sql db export \
  --name ATP_Staging \
  --resource-group ATP-Staging-RG \
  --server atp-sql-staging-eus \
  --admin-user $(az keyvault secret show --vault-name atp-keyvault-staging-eus --name SqlAdminUser --query value -o tsv) \
  --admin-password $(az keyvault secret show --vault-name atp-keyvault-staging-eus --name SqlAdminPassword --query value -o tsv) \
  --storage-key $(az storage account keys list --account-name atpstoragestgeus --query "[0].value" -o tsv) \
  --storage-key-type StorageAccessKey \
  --storage-uri "https://atpstoragestgeus.blob.core.windows.net/backups/staging-weekly-$(date +%Y%m%d).bacpac"

# Verify backup
BACKUP_SIZE=$(az storage blob show \
  --account-name atpstoragestgeus \
  --container-name backups \
  --name "staging-weekly-$(date +%Y%m%d).bacpac" \
  --query properties.contentLength -o tsv)

if [ "$BACKUP_SIZE" -gt 1000000 ]; then
  echo "✅ Staging backup created: $(($BACKUP_SIZE / 1024 / 1024)) MB"
else
  echo "❌ Backup verification failed"
  exit 1
fi

Production Environment Data Management

Purpose: Live tenant audit records with real PII, full compliance enforcement, and 7-year retention with WORM storage and tamper-evidence.

Data Source:

  • Live Tenant Traffic: Real audit events ingested from production tenant applications.
  • Volume: Millions of events per day across all tenants.
  • PII Classification: Full PII classification with data sensitivity labels (see pii-redaction-classification.md).

Data Characteristics:

// Production audit event with PII classification
public class AuditEvent
{
    public string EventId { get; set; }
    public string TenantId { get; set; }

    public DateTime Timestamp { get; set; }

    [PersonalData]  // PII classification
    public string Actor { get; set; }  // john.doe@customer.com

    public string Action { get; set; }
    public string Resource { get; set; }
    public string Outcome { get; set; }

    [PersonalData]
    public string IpAddress { get; set; }  // Client IP

    public string UserAgent { get; set; }

    [SensitiveData]
    public Dictionary<string, object> Metadata { get; set; }  // May contain PII

    // Immutability fields
    public string Hash { get; set; }  // SHA-256 hash of event
    public string PreviousHash { get; set; }  // Hash chain
    public string SegmentId { get; set; }
    public bool Sealed { get; set; }
    public DateTime? SealedAt { get; set; }

    // Compliance fields
    public bool LegalHold { get; set; }
    public string LegalHoldReason { get; set; }
    public DateTime? LegalHoldPlacedAt { get; set; }
    public int RetentionDays { get; set; }  // Per-tenant retention
    public DateTime PurgeEligibleAt { get; set; }  // Timestamp + RetentionDays
}

Retention Policy (Production):

  • Default Retention: 7 years (2,555 days) for all tenants.
  • Tenant-Specific Retention: Configurable per tenant (minimum 1 year, maximum 10 years).
  • Legal Holds: Override retention; events never purged while under legal hold.
  • Regulatory Compliance: GDPR (7 years for financial data), HIPAA (6 years minimum), SOC 2 (7 years audit logs).

Retention Configuration (Per Tenant):

// Tenant retention configuration
public class Tenant
{
    public string TenantId { get; set; }
    public string Name { get; set; }

    // Retention settings
    public int RetentionDays { get; set; } = 2555;  // 7 years default
    public bool AllowCustomRetention { get; set; } = false;
    public int MinRetentionDays { get; set; } = 365;  // 1 year minimum
    public int MaxRetentionDays { get; set; } = 3650;  // 10 years maximum

    // Compliance profile determines retention requirements
    public string ComplianceProfile { get; set; }  // "gdpr,hipaa,soc2"

    // Legal hold management
    public bool HasActiveLegalHolds { get; set; }
    public List<LegalHold> LegalHolds { get; set; }
}

// Legal hold entity
public class LegalHold
{
    public string LegalHoldId { get; set; }
    public string TenantId { get; set; }
    public string Reason { get; set; }
    public DateTime PlacedAt { get; set; }
    public string PlacedBy { get; set; }
    public DateTime? ReleasedAt { get; set; }
    public string ReleasedBy { get; set; }
    public string CaseNumber { get; set; }
}

Retention Enforcement (Production):

// Production retention enforcement (Azure Function)
[FunctionName("EnforceProductionRetention")]
public async Task RunAsync(
    [TimerTrigger("0 0 2 * * *")] TimerInfo timer,  // Daily at 2 AM
    ILogger log)
{
    log.LogInformation("Enforcing production retention policy...");

    var tenants = await _context.Tenants.ToListAsync();

    foreach (var tenant in tenants)
    {
        var retentionCutoff = DateTime.UtcNow.AddDays(-tenant.RetentionDays);

        // Find events eligible for purge
        var eligibleEvents = await _context.AuditEvents
            .Where(e => e.TenantId == tenant.TenantId)
            .Where(e => e.Timestamp < retentionCutoff)
            .Where(e => !e.LegalHold)  // Never delete events under legal hold
            .Where(e => e.Sealed)  // Only delete sealed (immutable) events
            .ToListAsync();

        if (eligibleEvents.Any())
        {
            log.LogInformation($"Purging {eligibleEvents.Count} events for tenant {tenant.TenantId}");

            // Mark as deleted (soft delete; actual purge happens after 30 days)
            foreach (var evt in eligibleEvents)
            {
                evt.MarkedForDeletion = true;
                evt.MarkedForDeletionAt = DateTime.UtcNow;
            }

            await _context.SaveChangesAsync();

            // Audit the purge operation
            await _auditLogger.LogRetentionPurgeAsync(tenant.TenantId, eligibleEvents.Count, retentionCutoff);
        }
    }

    log.LogInformation("✅ Production retention enforcement complete");
}

Immutability: Fully Enabled (WORM storage, hash chains, tamper-evidence):

// Production immutability implementation
public class ImmutableAuditService
{
    public async Task<AuditSegment> SealSegmentAsync(string segmentId)
    {
        var segment = await _context.AuditSegments
            .Include(s => s.Events)
            .FirstOrDefaultAsync(s => s.SegmentId == segmentId);

        if (segment.Sealed)
        {
            throw new InvalidOperationException("Segment already sealed");
        }

        // 1. Calculate Merkle tree hash
        var merkleRoot = CalculateMerkleTreeHash(segment.Events);

        // 2. Sign hash with HSM-backed key
        var signature = await _cryptoClient.SignDataAsync(
            SignatureAlgorithm.RS256,
            Encoding.UTF8.GetBytes(merkleRoot));

        // 3. Seal segment
        segment.Sealed = true;
        segment.SealedAt = DateTime.UtcNow;
        segment.MerkleRoot = merkleRoot;
        segment.Signature = Convert.ToBase64String(signature.Signature);
        segment.Immutable = true;

        await _context.SaveChangesAsync();

        // 4. Store segment hash in blockchain/DLT (optional)
        await _blockchainAnchor.AnchorHashAsync(segmentId, merkleRoot, signature.Signature);

        return segment;
    }
}

Backups (Production):

  • Incremental Backups: Daily incremental backups (capture changes since last full).
  • Full Backups: Weekly full database backups.
  • Geo-Replication: All backups replicated to secondary region (West Europe).
  • Retention: 7 years (matches audit retention requirements).
  • Encryption: All backups encrypted with customer-managed keys (CMK).

Backup Strategy (Production):

#!/bin/bash
# production-backup-strategy.sh

DAY_OF_WEEK=$(date +%u)  # 1 = Monday, 7 = Sunday

if [ "$DAY_OF_WEEK" -eq 7 ]; then
  echo "Sunday: Creating weekly full backup..."

  # Full backup (BACPAC export)
  az sql db export \
    --name ATP_Prod \
    --resource-group ATP-Prod-RG \
    --server atp-sql-prod-eus \
    --admin-user $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminUser --query value -o tsv) \
    --admin-password $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminPassword --query value -o tsv) \
    --storage-key $(az storage account keys list --account-name atpstorageprodeus --query "[0].value" -o tsv) \
    --storage-key-type StorageAccessKey \
    --storage-uri "https://atpstorageprodeus.blob.core.windows.net/backups/weekly/prod-full-$(date +%Y%m%d).bacpac"

  # Copy to geo-redundant storage (automatic with GZRS)
  echo "✅ Weekly full backup created"

else
  echo "Weekday: Creating daily incremental backup..."

  # Incremental backup (native SQL backup)
  az sql db update \
    --name ATP_Prod \
    --resource-group ATP-Prod-RG \
    --server atp-sql-prod-eus \
    --backup-storage-redundancy Geo

  # Point-in-time restore enabled (automatic)
  echo "✅ Daily incremental backup configured"
fi

# Verify backup retention
BACKUP_COUNT=$(az storage blob list \
  --account-name atpstorageprodeus \
  --container-name backups \
  --prefix "weekly/" \
  --query "length([?properties.createdOn > '$(date -d '7 years ago' --iso-8601)'])")

echo "Production backups (7-year retention): $BACKUP_COUNT"

Immutability: Fully Enforced (WORM storage with policy lock):

# Configure WORM (Write-Once-Read-Many) storage
az storage container immutability-policy create \
  --account-name atpstorageprodeus \
  --container-name audit-events \
  --period 2555 \
  --allow-protected-append-writes false

# Lock immutability policy (irreversible)
az storage container immutability-policy lock \
  --account-name atpstorageprodeus \
  --container-name audit-events \
  --if-match <etag>

# Place legal hold (for litigation support)
az storage container legal-hold set \
  --account-name atpstorageprodeus \
  --container-name audit-events \
  --tags "case-2025-001" "litigation-hold"

Data Refresh & Maintenance Windows

Dev Environment:

  • Refresh Frequency: On-demand (developers trigger when needed).
  • Downtime: Acceptable (no SLA).
  • Method: Drop database + re-seed from scripts.

Test Environment:

  • Refresh Frequency: Weekly (every Sunday 2 AM).
  • Downtime: 2-hour maintenance window (2 AM - 4 AM).
  • Method: Truncate tables + reload fixtures.

Staging Environment:

  • Refresh Frequency: Monthly (first Saturday of month, 2 AM).
  • Downtime: 4-hour maintenance window (2 AM - 6 AM).
  • Method: Restore from anonymized production snapshot or regenerate synthetic data.

Production Environment:

  • Refresh Frequency: Never (live data only).
  • Maintenance: Continuous (online operations; no downtime for data management).
  • Method: N/A (live ingestion only).

Data Anonymization & Obfuscation

Purpose: Enable realistic testing in Staging without exposing production PII.

Anonymization Techniques:

  1. Email Addresses: Hash-based mapping to fake emails (consistent across snapshots).
  2. IP Addresses: Replace with private IP ranges (10.0.0.0/8).
  3. User Names: Replace with generated pseudonyms (user-0001, user-0002).
  4. Tenant IDs: Map to staging tenant IDs (tenant-12345 → staging-tenant-001).
  5. Metadata: Redact PII fields; preserve structure and data types.

Anonymization Pipeline:

// Production → Staging anonymization pipeline
public class DataAnonymizationPipeline
{
    public async Task<string> CreateAnonymizedSnapshotAsync()
    {
        // 1. Extract from production (read replica to avoid impact)
        var prodConnectionString = await GetReadReplicaConnectionStringAsync();
        var prodEvents = await ExtractProductionDataAsync(prodConnectionString, days: 180);

        // 2. Anonymize PII fields
        var anonymizedEvents = prodEvents.Select(AnonymizeEvent).ToList();

        // 3. Verify no PII remains
        await VerifyNoPIIAsync(anonymizedEvents);

        // 4. Export to Parquet (efficient columnar format)
        var outputPath = $"anonymized-prod-snapshot-{DateTime.UtcNow:yyyyMMdd}.parquet";
        await ExportToParquetAsync(anonymizedEvents, outputPath);

        // 5. Upload to Staging storage
        await UploadToAzureStorageAsync(outputPath, "atpstoragestgeus", "anonymized-snapshots");

        return outputPath;
    }

    private AuditEvent AnonymizeEvent(AuditEvent prodEvent)
    {
        return new AuditEvent
        {
            EventId = prodEvent.EventId,  // Preserve ID (no PII)
            TenantId = _tenantMapper.MapToStagingTenant(prodEvent.TenantId),
            Timestamp = prodEvent.Timestamp,  // Preserve timestamp

            // Anonymize PII fields
            Actor = AnonymizeEmail(prodEvent.Actor),
            IpAddress = AnonymizeIp(prodEvent.IpAddress),

            // Preserve non-PII fields
            Action = prodEvent.Action,
            Resource = prodEvent.Resource,
            Outcome = prodEvent.Outcome,
            UserAgent = prodEvent.UserAgent,

            // Anonymize metadata (recursive PII removal)
            Metadata = AnonymizeMetadata(prodEvent.Metadata)
        };
    }

    private string AnonymizeEmail(string email)
    {
        // Deterministic hashing (same email → same fake email)
        var hash = SHA256.HashData(Encoding.UTF8.GetBytes(email + _salt));
        var hashInt = BitConverter.ToInt32(hash, 0);
        return $"user-{Math.Abs(hashInt) % 10000:D4}@staging.local";
    }

    private string AnonymizeIp(string ipAddress)
    {
        var hash = SHA256.HashData(Encoding.UTF8.GetBytes(ipAddress + _salt));
        return $"10.0.{hash[0]}.{hash[1]}";
    }

    private Dictionary<string, object> AnonymizeMetadata(Dictionary<string, object> metadata)
    {
        var anonymized = new Dictionary<string, object>();

        foreach (var kvp in metadata)
        {
            // Redact known PII fields
            if (IsPIIField(kvp.Key))
            {
                anonymized[kvp.Key] = "[REDACTED]";
            }
            else
            {
                anonymized[kvp.Key] = kvp.Value;
            }
        }

        return anonymized;
    }
}

PII Verification (Automated):

// Verify no PII in anonymized dataset
public class PIIVerifier
{
    private readonly List<Regex> _piiPatterns = new()
    {
        new Regex(@"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),  // Email
        new Regex(@"\b\d{3}-\d{2}-\d{4}\b"),  // SSN
        new Regex(@"\b(?:\d{1,3}\.){3}\d{1,3}\b"),  // IP address (public)
        new Regex(@"\b\d{3}-\d{3}-\d{4}\b")  // Phone number
    };

    public async Task<bool> VerifyNoPIIAsync(IEnumerable<AuditEvent> events)
    {
        var piiFound = false;

        foreach (var evt in events.Take(1000))  // Sample 1000 events
        {
            var serialized = JsonSerializer.Serialize(evt);

            foreach (var pattern in _piiPatterns)
            {
                if (pattern.IsMatch(serialized))
                {
                    Console.WriteLine($"⚠️ Potential PII found in event {evt.EventId}");
                    piiFound = true;
                }
            }
        }

        if (piiFound)
        {
            throw new Exception("PII verification failed; anonymization incomplete");
        }

        Console.WriteLine("✅ PII verification passed (no PII detected in sample)");
        return true;
    }
}

Data Seeding & Generation Tools

Data Seeder CLI Tool

// tools/DataSeeder/Program.cs
public class Program
{
    public static async Task<int> Main(string[] args)
    {
        var rootCommand = new RootCommand("ATP Data Seeder - Generate synthetic audit data");

        var environmentOption = new Option<string>("--environment", "Target environment (Development, Test, Staging)");
        var clearOption = new Option<bool>("--clear", "Clear existing data before seeding");
        var tenantsOption = new Option<int>("--tenants", () => 10, "Number of tenants to generate");
        var eventsPerTenantOption = new Option<int>("--events-per-tenant", () => 1000, "Events per tenant");
        var eventsTotalOption = new Option<int>("--events-total", "Total events (overrides events-per-tenant)");
        var startDateOption = new Option<DateTime>("--start-date", () => DateTime.UtcNow.AddDays(-30), "Start date for events");
        var endDateOption = new Option<DateTime>("--end-date", () => DateTime.UtcNow, "End date for events");
        var seedOption = new Option<int>("--seed", () => 42, "Random seed for reproducibility");
        var loadFixturesOption = new Option<string>("--load-fixtures", "Path to fixture file (JSON/Parquet)");

        rootCommand.AddOption(environmentOption);
        rootCommand.AddOption(clearOption);
        rootCommand.AddOption(tenantsOption);
        rootCommand.AddOption(eventsPerTenantOption);
        rootCommand.AddOption(eventsTotalOption);
        rootCommand.AddOption(startDateOption);
        rootCommand.AddOption(endDateOption);
        rootCommand.AddOption(seedOption);
        rootCommand.AddOption(loadFixturesOption);

        rootCommand.SetHandler(async (context) =>
        {
            var environment = context.ParseResult.GetValueForOption(environmentOption);
            var clear = context.ParseResult.GetValueForOption(clearOption);
            var tenants = context.ParseResult.GetValueForOption(tenantsOption);
            var eventsPerTenant = context.ParseResult.GetValueForOption(eventsPerTenantOption);
            var eventsTotal = context.ParseResult.GetValueForOption(eventsTotalOption);
            var startDate = context.ParseResult.GetValueForOption(startDateOption);
            var endDate = context.ParseResult.GetValueForOption(endDateOption);
            var seed = context.ParseResult.GetValueForOption(seedOption);
            var fixturesPath = context.ParseResult.GetValueForOption(loadFixturesOption);

            var seeder = new DataSeeder(environment);

            if (clear)
            {
                await seeder.ClearDataAsync();
            }

            if (!string.IsNullOrEmpty(fixturesPath))
            {
                await seeder.LoadFixturesAsync(fixturesPath);
            }
            else
            {
                var totalEvents = eventsTotal > 0 ? eventsTotal : (tenants * eventsPerTenant);
                await seeder.GenerateSyntheticDataAsync(tenants, totalEvents, startDate, endDate, seed);
            }
        });

        return await rootCommand.InvokeAsync(args);
    }
}

Usage Examples:

# Dev: Generate 10 tenants with 1000 events each
dotnet run --project tools/DataSeeder -- \
  --environment Development \
  --clear \
  --tenants 10 \
  --events-per-tenant 1000 \
  --start-date "2025-09-01" \
  --end-date "2025-10-30"

# Test: Load stable fixtures
dotnet run --project tools/DataSeeder -- \
  --environment Test \
  --clear \
  --load-fixtures tests/fixtures/test-tenants.json \
  --load-fixtures tests/fixtures/test-events-stable.parquet

# Staging: Generate production-scale data
dotnet run --project tools/DataSeeder -- \
  --environment Staging \
  --clear \
  --tenants 50 \
  --events-total 5000000 \
  --start-date "2025-04-01" \
  --end-date "2025-10-30" \
  --seed 2025

Data Migration & Schema Updates

Purpose: Manage database schema changes across environments with zero-downtime migrations and rollback capabilities.

Migration Workflow:

flowchart LR
    A[EF Core Migration Created] --> B[Dev: Apply Migration]
    B --> C{Tests Pass?}
    C -->|No| D[Fix Migration]
    C -->|Yes| E[Test: Apply Migration]
    E --> F{Regression Tests Pass?}
    F -->|No| D
    F -->|Yes| G[Staging: Apply Migration]
    G --> H{Production-Like Tests Pass?}
    H -->|No| D
    H -->|Yes| I[Production: Apply Migration]
    I --> J[Verify Migration Success]
Hold "Alt" / "Option" to enable pan & zoom

Migration Script (Entity Framework Core):

#!/bin/bash
# apply-migration.sh

ENVIRONMENT=$1
MIGRATION_NAME=$2

echo "Applying migration '$MIGRATION_NAME' to $ENVIRONMENT..."

# 1. Backup database before migration
az sql db copy \
  --name ATP_$ENVIRONMENT \
  --resource-group ATP-$ENVIRONMENT-RG \
  --server atp-sql-$ENVIRONMENT-eus \
  --dest-name ATP_${ENVIRONMENT}_PreMigration_$(date +%Y%m%d) \
  --dest-resource-group ATP-Backups-RG

# 2. Apply migration
dotnet ef database update \
  --project src/ConnectSoft.ATP.Infrastructure \
  --connection "Server=atp-sql-$ENVIRONMENT-eus.database.windows.net;Database=ATP_$ENVIRONMENT;Authentication=Active Directory Default;" \
  --verbose

if [ $? -ne 0 ]; then
  echo "❌ Migration failed; rolling back..."

  # Rollback migration (apply previous migration)
  dotnet ef database update <PreviousMigration> \
    --project src/ConnectSoft.ATP.Infrastructure \
    --connection "Server=atp-sql-$ENVIRONMENT-eus.database.windows.net;Database=ATP_$ENVIRONMENT;Authentication=Active Directory Default;"

  exit 1
fi

# 3. Verify migration applied
APPLIED_MIGRATION=$(sqlcmd -S atp-sql-$ENVIRONMENT-eus.database.windows.net \
  -d ATP_$ENVIRONMENT \
  -Q "SELECT TOP 1 MigrationId FROM __EFMigrationsHistory ORDER BY MigrationId DESC" \
  -h -1)

if [[ "$APPLIED_MIGRATION" == *"$MIGRATION_NAME"* ]]; then
  echo "✅ Migration '$MIGRATION_NAME' applied successfully to $ENVIRONMENT"
else
  echo "❌ Migration verification failed"
  exit 1
fi

Data Compliance & Privacy

GDPR Compliance (Data Subject Rights):

// Right to erasure (GDPR Article 17)
public class GdprDataService
{
    public async Task ErasePersonalDataAsync(string tenantId, string userId)
    {
        // 1. Find all events for user
        var userEvents = await _context.AuditEvents
            .Where(e => e.TenantId == tenantId)
            .Where(e => e.Actor == userId)
            .ToListAsync();

        // 2. Pseudonymize (cannot delete immutable audit events)
        foreach (var evt in userEvents)
        {
            evt.Actor = $"[ERASED-{ComputeHash(userId).Substring(0, 8)}]";
            evt.IpAddress = "0.0.0.0";
            evt.Metadata = new Dictionary<string, object>
            {
                ["erasedAt"] = DateTime.UtcNow,
                ["reason"] = "GDPR Right to Erasure"
            };
        }

        await _context.SaveChangesAsync();

        // 3. Audit the erasure operation
        await _auditLogger.LogDataErasureAsync(tenantId, userId, userEvents.Count);
    }
}

HIPAA Compliance (Minimum Necessary Rule):

  • Dev/Test: No PHI (Protected Health Information); synthetic data only.
  • Staging: Obfuscated data; no real patient information.
  • Production: Real PHI with encryption at rest, encryption in transit, access controls, and audit logging.

Data Seeding Scripts (Comprehensive)

Dev Environment Seeding:

#!/bin/bash
# comprehensive-dev-seed.sh

echo "=== Dev Environment Data Seeding ==="

# 1. Database seeding
echo "Seeding SQL database..."
dotnet run --project tools/DataSeeder -- \
  --environment Development \
  --clear \
  --tenants 10 \
  --events-per-tenant 1000 \
  --start-date "2025-09-01" \
  --end-date "2025-10-30" \
  --seed 42

# 2. Redis cache seeding
echo "Seeding Redis cache..."
REDIS_HOST="atp-redis-dev-eus.redis.cache.windows.net"
REDIS_PASSWORD=$(az keyvault secret show --vault-name atp-keyvault-dev-eus --name RedisPassword --query value -o tsv)

redis-cli -h $REDIS_HOST -p 6380 -a $REDIS_PASSWORD --tls <<EOF
FLUSHDB
SET session:dev-tenant-001:user-1 "{\"userId\":\"user-1\",\"tenantId\":\"dev-tenant-001\",\"expiresAt\":\"2025-10-31T00:00:00Z\"}"
SET session:dev-tenant-002:user-2 "{\"userId\":\"user-2\",\"tenantId\":\"dev-tenant-002\",\"expiresAt\":\"2025-10-31T00:00:00Z\"}"
SET cache:tenant-config:dev-tenant-001 "{\"maxRetention\":30,\"enableImmutability\":false}"
EXPIRE session:dev-tenant-001:user-1 86400
EXPIRE session:dev-tenant-002:user-2 86400
EOF

# 3. Cosmos DB seeding (metadata store)
echo "Seeding Cosmos DB..."
az cosmosdb sql container item upsert \
  --account-name atp-cosmos-dev-eus \
  --database-name ATP \
  --container-name TenantMetadata \
  --partition-key "dev-tenant-001" \
  --body '{
    "id": "dev-tenant-001",
    "tenantId": "dev-tenant-001",
    "settings": {
      "retentionDays": 30,
      "enableImmutability": false
    }
  }'

# 4. Service Bus seeding (seed queues with sample messages)
echo "Seeding Service Bus..."
SB_CONNECTION=$(az keyvault secret show --vault-name atp-keyvault-dev-eus --name ServiceBusConnectionString --query value -o tsv)

# Send sample messages to ingestion queue
for i in {1..10}; do
  az servicebus queue message send \
    --namespace-name atp-servicebus-dev-eus \
    --queue-name ingestion-queue \
    --body "{\"eventId\":\"seed-$i\",\"tenantId\":\"dev-tenant-001\",\"action\":\"Create\"}"
done

echo "✅ Dev environment seeding complete"

Staging Environment Data Refresh (From Anonymized Prod Snapshot):

#!/bin/bash
# refresh-staging-from-prod-snapshot.sh

echo "=== Staging Data Refresh (Anonymized Production Snapshot) ==="

# 1. Generate anonymized snapshot in Production (via Azure Function)
echo "Triggering anonymization pipeline..."
SNAPSHOT_PATH=$(az functionapp function invoke \
  --resource-group ATP-Prod-RG \
  --name atp-functions-prod-eus \
  --function-name CreateAnonymizedSnapshot \
  --query "outputBindings.snapshotPath" -o tsv)

echo "Anonymized snapshot created: $SNAPSHOT_PATH"

# 2. Download snapshot to staging
echo "Downloading anonymized snapshot..."
az storage blob download \
  --account-name atpstorageprodeus \
  --container-name anonymized-snapshots \
  --name "$SNAPSHOT_PATH" \
  --file ./staging-snapshot.parquet \
  --auth-mode login

# 3. Verify no PII in snapshot
echo "Verifying no PII in snapshot..."
dotnet run --project tools/PIIVerifier -- \
  --input ./staging-snapshot.parquet \
  --strict

if [ $? -ne 0 ]; then
  echo "❌ PII verification failed; aborting refresh"
  exit 1
fi

# 4. Clear staging database
echo "Clearing Staging database..."
sqlcmd -S atp-sql-staging-eus.database.windows.net -d ATP_Staging \
  -Q "TRUNCATE TABLE AuditEvents; DELETE FROM Tenants;" \
  -U $(az keyvault secret show --vault-name atp-keyvault-staging-eus --name SqlAdminUser --query value -o tsv) \
  -P $(az keyvault secret show --vault-name atp-keyvault-staging-eus --name SqlAdminPassword --query value -o tsv)

# 5. Import anonymized data
echo "Importing anonymized data to Staging..."
dotnet run --project tools/DataImporter -- \
  --environment Staging \
  --import ./staging-snapshot.parquet \
  --batch-size 10000 \
  --parallel-workers 4

# 6. Verify data counts
STAGING_TENANT_COUNT=$(sqlcmd -S atp-sql-staging-eus.database.windows.net -d ATP_Staging -Q "SELECT COUNT(*) FROM Tenants" -h -1)
STAGING_EVENT_COUNT=$(sqlcmd -S atp-sql-staging-eus.database.windows.net -d ATP_Staging -Q "SELECT COUNT(*) FROM AuditEvents" -h -1)

echo "Staging - Tenants: $STAGING_TENANT_COUNT, Events: $STAGING_EVENT_COUNT"

if [ "$STAGING_TENANT_COUNT" -ge 50 ] && [ "$STAGING_EVENT_COUNT" -ge 5000000 ]; then
  echo "✅ Staging data refresh from production snapshot successful"
else
  echo "❌ Staging data verification failed"
  exit 1
fi

# 7. Cleanup temporary files
rm ./staging-snapshot.parquet

echo "✅ Staging refresh complete"

Data Backup & Restore Procedures

Automated Backup (Production)

# Azure DevOps Pipeline: Production Backups
trigger: none  # Scheduled only

schedules:
- cron: "0 2 * * 0"  # Weekly on Sunday at 2 AM
  displayName: Weekly Full Backup
  branches:
    include:
    - main
  always: true

- cron: "0 3 * * 1-6"  # Daily at 3 AM (Mon-Sat)
  displayName: Daily Incremental Backup
  branches:
    include:
    - main
  always: true

jobs:
- job: BackupProduction
  pool:
    vmImage: 'ubuntu-latest'
  steps:
  - task: AzureCLI@2
    displayName: 'Create Production Backup'
    inputs:
      scriptType: 'bash'
      scriptLocation: 'inlineScript'
      inlineScript: |
        DAY_OF_WEEK=$(date +%u)

        if [ "$DAY_OF_WEEK" -eq 7 ]; then
          # Full backup (Sunday)
          echo "Creating weekly full backup..."

          az sql db export \
            --name ATP_Prod \
            --resource-group ATP-Prod-RG \
            --server atp-sql-prod-eus \
            --admin-user $(SqlAdminUser) \
            --admin-password $(SqlAdminPassword) \
            --storage-key $(StorageAccountKey) \
            --storage-key-type StorageAccessKey \
            --storage-uri "https://atpstorageprodeus.blob.core.windows.net/backups/weekly/prod-full-$(date +%Y%m%d).bacpac"

          # Tag backup with metadata
          az storage blob metadata update \
            --account-name atpstorageprodeus \
            --container-name backups \
            --name "weekly/prod-full-$(date +%Y%m%d).bacpac" \
            --metadata type=full createdAt=$(date -u +%Y-%m-%dT%H:%M:%SZ) retentionYears=7

        else
          # Incremental backup (Mon-Sat) - automatic via Azure SQL
          echo "Daily incremental backup (automatic point-in-time restore)"
        fi

        # Verify backup exists
        LATEST_BACKUP=$(az storage blob list \
          --account-name atpstorageprodeus \
          --container-name backups \
          --prefix "weekly/" \
          --query "sort_by([?properties.createdOn > '$(date -d '7 days ago' --iso-8601)'], &properties.createdOn)[-1].name" \
          -o tsv)

        if [ -n "$LATEST_BACKUP" ]; then
          echo "✅ Latest backup verified: $LATEST_BACKUP"
        else
          echo "❌ Backup verification failed"
          exit 1
        fi

Disaster Recovery Restore (Production)

#!/bin/bash
# disaster-recovery-restore.sh

echo "=== DISASTER RECOVERY: Restore Production from Backup ==="

read -p "Confirm DR restore to Production (yes/no): " CONFIRM
if [ "$CONFIRM" != "yes" ]; then
  echo "DR restore cancelled"
  exit 0
fi

read -p "Enter backup file name (e.g., prod-full-20251030.bacpac): " BACKUP_FILE

echo "Restoring Production from backup: $BACKUP_FILE"

# 1. Create temporary database for verification
echo "Creating temporary restore database..."
az sql db import \
  --resource-group ATP-Prod-RG \
  --server atp-sql-prod-eus \
  --name ATP_Prod_Restore_Temp \
  --admin-user $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminUser --query value -o tsv) \
  --admin-password $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminPassword --query value -o tsv) \
  --storage-key $(az storage account keys list --account-name atpstorageprodeus --query "[0].value" -o tsv) \
  --storage-key-type StorageAccessKey \
  --storage-uri "https://atpstorageprodeus.blob.core.windows.net/backups/weekly/$BACKUP_FILE"

# 2. Verify restored database integrity
echo "Verifying restored database..."
RESTORE_TENANT_COUNT=$(sqlcmd -S atp-sql-prod-eus.database.windows.net -d ATP_Prod_Restore_Temp -Q "SELECT COUNT(*) FROM Tenants" -h -1)
RESTORE_EVENT_COUNT=$(sqlcmd -S atp-sql-prod-eus.database.windows.net -d ATP_Prod_Restore_Temp -Q "SELECT COUNT(*) FROM AuditEvents" -h -1)

echo "Restored - Tenants: $RESTORE_TENANT_COUNT, Events: $RESTORE_EVENT_COUNT"

read -p "Counts look correct? Proceed with swap (yes/no): " PROCEED
if [ "$PROCEED" != "yes" ]; then
  echo "Aborting DR restore"
  az sql db delete --name ATP_Prod_Restore_Temp --resource-group ATP-Prod-RG --server atp-sql-prod-eus --yes
  exit 0
fi

# 3. Rename databases (swap)
echo "Swapping databases..."

# Rename current prod to backup
az sql db rename \
  --resource-group ATP-Prod-RG \
  --server atp-sql-prod-eus \
  --name ATP_Prod \
  --new-name ATP_Prod_Backup_$(date +%Y%m%d%H%M%S)

# Rename restored to prod
az sql db rename \
  --resource-group ATP-Prod-RG \
  --server atp-sql-prod-eus \
  --name ATP_Prod_Restore_Temp \
  --new-name ATP_Prod

echo "✅ DR restore complete; Production database swapped"
echo "⚠️ Old production database preserved as: ATP_Prod_Backup_$(date +%Y%m%d%H%M%S)"

Data Volume & Growth Projections

Expected Data Growth:

Environment Current Volume Growth Rate 1-Year Projection Storage Tier
Dev 10K events 0 (static) 10K events Hot
Test 50K events 0 (static fixtures) 50K events Hot
Staging 5M events Refreshed monthly 5M events (static) Hot
Production 100M events 500K/day 280M events Hot (0-30d), Cool (30-90d), Archive (90d+)

Storage Lifecycle Policy (Production):

{
  "rules": [
    {
      "enabled": true,
      "name": "MoveToCol",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["audit-events/"]
        },
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 30
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 90
            },
            "delete": {
              "daysAfterModificationGreaterThan": 2555
            }
          }
        }
      }
    }
  ]
}

Summary

  • Data Sources: Synthetic generators (Dev/Test), production-like synthetic or anonymized snapshots (Staging), live tenant data (Production).
  • Retention Policies: 30 days (Dev), 90 days (Test), 180 days (Staging), 7 years (Production) with legal hold overrides.
  • Immutability: Disabled (Dev/Test), enabled (Staging for testing), fully enforced with WORM (Production).
  • Backups: None (Dev), daily (Test), weekly (Staging), daily incremental + weekly full with geo-replication (Production).
  • Data Anonymization: Production → Staging anonymization pipeline with PII verification ensures no real PII in lower environments.
  • Data Seeding Tools: Comprehensive CLI tools, C# generators, and automated refresh scripts for all environments.
  • Compliance: GDPR/HIPAA compliance with data subject rights, minimum necessary rule, and audit evidence.

Infrastructure as Code (IaC) Overlays

ATP uses Infrastructure as Code with Pulumi (C# preferred) and Bicep to provision and manage Azure resources across all environments. The overlay pattern separates base infrastructure definitions (common across all environments) from environment-specific configurations (SKU tiers, scaling, networking), enabling consistent infrastructure with graduated controls.

This approach ensures infrastructure reproducibility, configuration drift detection, and environment parity (Staging mirrors Production) while optimizing costs (Dev/Test use lower SKUs).

IaC Strategy Overview

Tools:

  • Pulumi (C#): Primary IaC tool; preferred for type-safety, IDE support, and C# team familiarity.
  • Bicep: Alternative for Azure-native declarative templates; used for simple resource provisioning.

Repository Structure:

ConnectSoft.ATP.Infrastructure/
├── Pulumi.yaml                      # Pulumi project metadata
├── Pulumi.dev.yaml                  # Dev stack configuration
├── Pulumi.test.yaml                 # Test stack configuration
├── Pulumi.staging.yaml              # Staging stack configuration
├── Pulumi.prod.yaml                 # Production stack configuration
├── Program.cs                       # Base infrastructure (Pulumi C#)
├── Overlays/
│   ├── DevOverlay.cs                # Dev-specific overrides
│   ├── TestOverlay.cs               # Test-specific overrides
│   ├── StagingOverlay.cs            # Staging-specific overrides
│   └── ProductionOverlay.cs         # Production-specific overrides
├── Resources/
│   ├── AppServiceResources.cs       # App Service definitions
│   ├── DatabaseResources.cs         # SQL, Cosmos DB
│   ├── StorageResources.cs          # Blob, Service Bus, Redis
│   ├── NetworkingResources.cs       # VNet, NSG, Private Endpoints
│   └── ObservabilityResources.cs   # App Insights, Log Analytics
└── bicep/                           # Bicep templates (alternative)
    ├── main.bicep                   # Base template
    └── overlays/
        ├── dev.bicepparam           # Dev parameters
        ├── test.bicepparam          # Test parameters
        ├── staging.bicepparam       # Staging parameters
        └── prod.bicepparam          # Production parameters

Base Infrastructure (Pulumi C#)

Base infrastructure defines the common resources required for all ATP environments, with parameterized values that are overridden by environment-specific overlays.

// Program.cs - Base infrastructure
using Pulumi;
using Pulumi.AzureNative.Resources;
using Pulumi.AzureNative.Web;
using Pulumi.AzureNative.Sql;
using Pulumi.AzureNative.Cache;
using Pulumi.AzureNative.ServiceBus;
using Pulumi.AzureNative.Storage;

class ATPInfrastructure : Stack
{
    public ATPInfrastructure()
    {
        // Read configuration
        var config = new Config();
        var environment = config.Require("environment");  // dev, test, staging, prod
        var region = config.Get("region") ?? "eastus";

        // Load environment-specific overlay
        var overlay = LoadOverlay(environment);

        // Common tags
        var tags = new InputMap<string>
        {
            ["Environment"] = environment,
            ["Project"] = "ATP",
            ["ManagedBy"] = "pulumi",
            ["Owner"] = "platform-team@connectsoft.example",
            ["CostCenter"] = "ATP-Platform",
            ["Compliance"] = "gdpr,hipaa,soc2"
        };

        // Resource Group
        var resourceGroup = new ResourceGroup($"atp-{environment}-{region}-rg", new ResourceGroupArgs
        {
            ResourceGroupName = $"ConnectSoft-ATP-{environment.ToUpper()}-{region.ToUpper()}-RG",
            Location = region,
            Tags = tags
        });

        // App Service Plan
        var appServicePlan = new AppServicePlan($"atp-plan-{environment}-{region}", new AppServicePlanArgs
        {
            Name = $"atp-plan-{environment}-{region}",
            ResourceGroupName = resourceGroup.Name,
            Location = region,
            Sku = new SkuDescriptionArgs
            {
                Name = overlay.AppServiceSku,
                Tier = overlay.AppServiceTier,
                Capacity = overlay.AppServiceInstances
            },
            Kind = "linux",
            Reserved = true,  // Linux
            Tags = tags
        });

        // Azure SQL Server
        var sqlServer = new Server($"atp-sql-{environment}-{region}", new ServerArgs
        {
            ServerName = $"atp-sql-{environment}-{region}",
            ResourceGroupName = resourceGroup.Name,
            Location = region,
            AdministratorLogin = config.RequireSecret("sqlAdminUser"),
            AdministratorLoginPassword = config.RequireSecret("sqlAdminPassword"),
            Version = "12.0",
            PublicNetworkAccess = overlay.EnablePublicNetworkAccess ? "Enabled" : "Disabled",
            Tags = tags
        });

        // Azure SQL Database
        var database = new Database($"atp-db-{environment}-{region}", new DatabaseArgs
        {
            DatabaseName = $"ATP_{environment}",
            ResourceGroupName = resourceGroup.Name,
            ServerName = sqlServer.Name,
            Location = region,
            Sku = new Pulumi.AzureNative.Sql.Inputs.SkuArgs
            {
                Name = overlay.SqlSku,
                Tier = overlay.SqlTier,
                Capacity = overlay.SqlCapacity
            },
            MaxSizeBytes = overlay.SqlMaxSizeBytes,
            ZoneRedundant = overlay.EnableZoneRedundancy,
            Tags = tags
        });

        // Redis Cache
        var redis = new Redis($"atp-redis-{environment}-{region}", new RedisArgs
        {
            Name = $"atp-redis-{environment}-{region}",
            ResourceGroupName = resourceGroup.Name,
            Location = region,
            Sku = new Pulumi.AzureNative.Cache.Inputs.SkuArgs
            {
                Name = overlay.RedisSku,
                Family = overlay.RedisFamily,
                Capacity = overlay.RedisCapacity
            },
            EnableNonSslPort = false,
            MinimumTlsVersion = "1.2",
            Tags = tags
        });

        // Service Bus Namespace
        var serviceBus = new Namespace($"atp-servicebus-{environment}-{region}", new NamespaceArgs
        {
            NamespaceName = $"atp-servicebus-{environment}-{region}",
            ResourceGroupName = resourceGroup.Name,
            Location = region,
            Sku = new Pulumi.AzureNative.ServiceBus.Inputs.SBSkuArgs
            {
                Name = overlay.ServiceBusSku,
                Tier = overlay.ServiceBusTier
            },
            Tags = tags
        });

        // Storage Account
        var storageAccount = new StorageAccount($"atp-storage-{environment}-{region}", new StorageAccountArgs
        {
            AccountName = $"atpstorage{environment}{region}".Replace("-", "").ToLower(),
            ResourceGroupName = resourceGroup.Name,
            Location = region,
            Sku = new Pulumi.AzureNative.Storage.Inputs.SkuArgs
            {
                Name = overlay.StorageReplication
            },
            Kind = "StorageV2",
            EnableHttpsTrafficOnly = true,
            MinimumTlsVersion = "TLS1_2",
            AllowBlobPublicAccess = false,
            Tags = tags
        });

        // Application Insights
        var appInsights = new Component($"atp-appinsights-{environment}-{region}", new ComponentArgs
        {
            ResourceName = $"atp-appinsights-{environment}-{region}",
            ResourceGroupName = resourceGroup.Name,
            Location = region,
            Kind = "web",
            ApplicationType = "web",
            RetentionInDays = overlay.LogRetentionDays,
            SamplingPercentage = overlay.TelemetrySamplingPercentage,
            Tags = tags
        });

        // Export outputs
        this.ResourceGroupName = resourceGroup.Name;
        this.AppServicePlanId = appServicePlan.Id;
        this.SqlServerName = sqlServer.Name;
        this.AppInsightsInstrumentationKey = appInsights.InstrumentationKey;
    }

    private EnvironmentOverlay LoadOverlay(string environment)
    {
        return environment.ToLower() switch
        {
            "dev" => new DevOverlay(),
            "test" => new TestOverlay(),
            "staging" => new StagingOverlay(),
            "prod" => new ProductionOverlay(),
            _ => throw new ArgumentException($"Unknown environment: {environment}")
        };
    }

    [Output] public Output<string> ResourceGroupName { get; set; }
    [Output] public Output<string> AppServicePlanId { get; set; }
    [Output] public Output<string> SqlServerName { get; set; }
    [Output] public Output<string> AppInsightsInstrumentationKey { get; set; }
}

Environment Overlay Base Class:

// EnvironmentOverlay.cs
public abstract class EnvironmentOverlay
{
    // App Service configuration
    public abstract string AppServiceSku { get; }
    public abstract string AppServiceTier { get; }
    public abstract int AppServiceInstances { get; }
    public abstract bool EnableAutoscale { get; }

    // SQL Database configuration
    public abstract string SqlSku { get; }
    public abstract string SqlTier { get; }
    public abstract int SqlCapacity { get; }
    public abstract long SqlMaxSizeBytes { get; }
    public abstract bool EnableZoneRedundancy { get; }
    public abstract bool EnableGeoReplication { get; }

    // Redis configuration
    public abstract string RedisSku { get; }
    public abstract string RedisFamily { get; }
    public abstract int RedisCapacity { get; }
    public abstract bool EnableRedisPersistence { get; }

    // Service Bus configuration
    public abstract string ServiceBusSku { get; }
    public abstract string ServiceBusTier { get; }

    // Storage configuration
    public abstract string StorageReplication { get; }
    public abstract bool EnableImmutability { get; }

    // Networking configuration
    public abstract bool EnablePublicNetworkAccess { get; }
    public abstract bool EnablePrivateEndpoints { get; }
    public abstract string VNetAddressSpace { get; }

    // Observability configuration
    public abstract int LogRetentionDays { get; }
    public abstract double TelemetrySamplingPercentage { get; }

    // Cost management
    public abstract int MonthlyBudget { get; }
    public abstract bool EnableAutoShutdown { get; }
}

Dev Overlay

Purpose: Cost-optimized infrastructure with minimal SKUs and public access for developer convenience.

// Overlays/DevOverlay.cs
public class DevOverlay : EnvironmentOverlay
{
    // App Service: Basic tier, single instance
    public override string AppServiceSku => "B1";
    public override string AppServiceTier => "Basic";
    public override int AppServiceInstances => 1;
    public override bool EnableAutoscale => false;

    // SQL Database: Basic tier, 5 DTU
    public override string SqlSku => "Basic";
    public override string SqlTier => "Basic";
    public override int SqlCapacity => 5;
    public override long SqlMaxSizeBytes => 2L * 1024 * 1024 * 1024;  // 2 GB
    public override bool EnableZoneRedundancy => false;
    public override bool EnableGeoReplication => false;

    // Redis: Basic C0 (250 MB)
    public override string RedisSku => "Basic";
    public override string RedisFamily => "C";
    public override int RedisCapacity => 0;
    public override bool EnableRedisPersistence => false;

    // Service Bus: Basic tier
    public override string ServiceBusSku => "Basic";
    public override string ServiceBusTier => "Basic";

    // Storage: LRS (Locally Redundant)
    public override string StorageReplication => "Standard_LRS";
    public override bool EnableImmutability => false;

    // Networking: Public access enabled
    public override bool EnablePublicNetworkAccess => true;
    public override bool EnablePrivateEndpoints => false;
    public override string VNetAddressSpace => "10.0.0.0/16";  // Shared VNet

    // Observability: Short retention, verbose sampling
    public override int LogRetentionDays => 7;
    public override double TelemetrySamplingPercentage => 100.0;  // 100% sampling

    // Cost: $500/month budget
    public override int MonthlyBudget => 500;
    public override bool EnableAutoShutdown => true;  // Shutdown weeknights/weekends
}

Dev Stack Configuration (Pulumi.dev.yaml):

config:
  azure-native:location: eastus
  atp-infrastructure:environment: dev
  atp-infrastructure:region: eus
  atp-infrastructure:sqlAdminUser:
    secure: AQAAANCMnd8BFdERjHoAwE/Cl+sBAAAA...  # Encrypted
  atp-infrastructure:sqlAdminPassword:
    secure: AQAAANCMnd8BFdERjHoAwE/Cl+sBAAAA...  # Encrypted
  atp-infrastructure:enableAutoShutdown: true
  atp-infrastructure:monthlyBudget: 500

Dev Deployment (Pulumi CLI):

#!/bin/bash
# deploy-dev-infrastructure.sh

echo "Deploying Dev infrastructure..."

# Select Dev stack
pulumi stack select connectsoft/atp/dev --create

# Set configuration
pulumi config set azure-native:location eastus
pulumi config set atp-infrastructure:environment dev
pulumi config set atp-infrastructure:region eus

# Set secrets (from Key Vault)
pulumi config set --secret atp-infrastructure:sqlAdminUser devadmin
pulumi config set --secret atp-infrastructure:sqlAdminPassword $(az keyvault secret show --vault-name atp-keyvault-shared-eus --name DevSqlPassword --query value -o tsv)

# Deploy (preview first)
pulumi preview

# Confirm and deploy
pulumi up --yes

echo "✅ Dev infrastructure deployed"

Test Overlay

Purpose: QA-grade infrastructure with moderate SKUs and IP-restricted access for automated testing.

// Overlays/TestOverlay.cs
public class TestOverlay : EnvironmentOverlay
{
    // App Service: Standard tier, 2 instances (blue-green testing)
    public override string AppServiceSku => "S1";
    public override string AppServiceTier => "Standard";
    public override int AppServiceInstances => 2;
    public override bool EnableAutoscale => false;

    // SQL Database: Standard S1, 20 DTU
    public override string SqlSku => "S1";
    public override string SqlTier => "Standard";
    public override int SqlCapacity => 20;
    public override long SqlMaxSizeBytes => 10L * 1024 * 1024 * 1024;  // 10 GB
    public override bool EnableZoneRedundancy => false;
    public override bool EnableGeoReplication => false;

    // Redis: Standard C1 (1 GB)
    public override string RedisSku => "Standard";
    public override string RedisFamily => "C";
    public override int RedisCapacity => 1;
    public override bool EnableRedisPersistence => true;  // RDB snapshots

    // Service Bus: Standard tier
    public override string ServiceBusSku => "Standard";
    public override string ServiceBusTier => "Standard";

    // Storage: GRS (Geo-Redundant for DR testing)
    public override string StorageReplication => "Standard_GRS";
    public override bool EnableImmutability => false;

    // Networking: IP-restricted public access
    public override bool EnablePublicNetworkAccess => true;
    public override bool EnablePrivateEndpoints => false;
    public override string VNetAddressSpace => "10.0.0.0/16";  // Shared VNet

    // Observability: Moderate retention, 50% sampling
    public override int LogRetentionDays => 14;
    public override double TelemetrySamplingPercentage => 50.0;

    // Cost: $1,000/month budget
    public override int MonthlyBudget => 1000;
    public override bool EnableAutoShutdown => true;  // Weekends only
}

Staging Overlay

Purpose: Production-equivalent infrastructure with premium SKUs, private endpoints, and geo-replication for pre-production validation.

// Overlays/StagingOverlay.cs
public class StagingOverlay : EnvironmentOverlay
{
    // App Service: Premium tier, autoscale 2-5 instances
    public override string AppServiceSku => "P1v2";
    public override string AppServiceTier => "PremiumV2";
    public override int AppServiceInstances => 2;
    public override bool EnableAutoscale => true;

    // SQL Database: Premium P2, 125 DTU
    public override string SqlSku => "P2";
    public override string SqlTier => "Premium";
    public override int SqlCapacity => 125;
    public override long SqlMaxSizeBytes => 100L * 1024 * 1024 * 1024;  // 100 GB
    public override bool EnableZoneRedundancy => false;
    public override bool EnableGeoReplication => true;  // Test failover

    // Redis: Premium P1 (6 GB, clustering)
    public override string RedisSku => "Premium";
    public override string RedisFamily => "P";
    public override int RedisCapacity => 1;
    public override bool EnableRedisPersistence => true;  // AOF

    // Service Bus: Premium tier
    public override string ServiceBusSku => "Premium";
    public override string ServiceBusTier => "Premium";

    // Storage: GZRS (Geo-Zone-Redundant)
    public override string StorageReplication => "Standard_GZRS";
    public override bool EnableImmutability => true;  // Test WORM

    // Networking: Private endpoints enabled
    public override bool EnablePublicNetworkAccess => false;
    public override bool EnablePrivateEndpoints => true;
    public override string VNetAddressSpace => "10.1.0.0/16";  // Dedicated VNet

    // Observability: Production-like retention, 25% sampling
    public override int LogRetentionDays => 30;
    public override double TelemetrySamplingPercentage => 25.0;

    // Cost: $3,000/month budget
    public override int MonthlyBudget => 3000;
    public override bool EnableAutoShutdown => false;  // Always-on
}

Staging Private Endpoints (Pulumi C#):

// Add private endpoints for Staging
if (overlay.EnablePrivateEndpoints)
{
    var privateEndpointSubnet = new Subnet($"atp-pe-subnet-{environment}-{region}", new SubnetArgs
    {
        SubnetName = "PrivateEndpoints-Subnet",
        ResourceGroupName = resourceGroup.Name,
        VirtualNetworkName = vnet.Name,
        AddressPrefix = "10.1.3.0/24",
        PrivateEndpointNetworkPolicies = "Disabled"
    });

    // SQL Private Endpoint
    var sqlPrivateEndpoint = new PrivateEndpoint($"atp-sql-pe-{environment}-{region}", new PrivateEndpointArgs
    {
        PrivateEndpointName = $"atp-sql-pe-{environment}-{region}",
        ResourceGroupName = resourceGroup.Name,
        Location = region,
        Subnet = new Pulumi.AzureNative.Network.Inputs.SubnetArgs
        {
            Id = privateEndpointSubnet.Id
        },
        PrivateLinkServiceConnections = new[]
        {
            new Pulumi.AzureNative.Network.Inputs.PrivateLinkServiceConnectionArgs
            {
                Name = $"sql-connection",
                PrivateLinkServiceId = sqlServer.Id,
                GroupIds = new[] { "sqlServer" }
            }
        },
        Tags = tags
    });

    // Storage Private Endpoint
    var storagePrivateEndpoint = new PrivateEndpoint($"atp-storage-pe-{environment}-{region}", new PrivateEndpointArgs
    {
        PrivateEndpointName = $"atp-storage-pe-{environment}-{region}",
        ResourceGroupName = resourceGroup.Name,
        Location = region,
        Subnet = new Pulumi.AzureNative.Network.Inputs.SubnetArgs
        {
            Id = privateEndpointSubnet.Id
        },
        PrivateLinkServiceConnections = new[]
        {
            new Pulumi.AzureNative.Network.Inputs.PrivateLinkServiceConnectionArgs
            {
                Name = $"storage-connection",
                PrivateLinkServiceId = storageAccount.Id,
                GroupIds = new[] { "blob" }
            }
        },
        Tags = tags
    });
}

Production Overlay

Purpose: Enterprise-grade infrastructure with AKS, zone redundancy, multi-region, and maximum security.

// Overlays/ProductionOverlay.cs
public class ProductionOverlay : EnvironmentOverlay
{
    // App Service: Premium tier (or AKS for containerized)
    public override string AppServiceSku => "P3v3";
    public override string AppServiceTier => "PremiumV3";
    public override int AppServiceInstances => 3;
    public override bool EnableAutoscale => true;  // 3-10 instances

    // SQL Database: Premium P6, 16 vCores
    public override string SqlSku => "P6";
    public override string SqlTier => "Premium";
    public override int SqlCapacity => 125;
    public override long SqlMaxSizeBytes => 500L * 1024 * 1024 * 1024;  // 500 GB
    public override bool EnableZoneRedundancy => true;
    public override bool EnableGeoReplication => true;  // Multi-region

    // Redis: Premium P3 (26 GB, clustering)
    public override string RedisSku => "Premium";
    public override string RedisFamily => "P";
    public override int RedisCapacity => 3;
    public override bool EnableRedisPersistence => true;  // AOF + RDB

    // Service Bus: Premium tier
    public override string ServiceBusSku => "Premium";
    public override string ServiceBusTier => "Premium";

    // Storage: GZRS with WORM
    public override string StorageReplication => "Standard_GZRS";
    public override bool EnableImmutability => true;

    // Networking: Private endpoints only
    public override bool EnablePublicNetworkAccess => false;
    public override bool EnablePrivateEndpoints => true;
    public override string VNetAddressSpace => "10.2.0.0/16";  // Dedicated VNet

    // Observability: Long retention, 10% sampling
    public override int LogRetentionDays => 90;
    public override double TelemetrySamplingPercentage => 10.0;

    // Cost: $10,000/month budget
    public override int MonthlyBudget => 10000;
    public override bool EnableAutoShutdown => false;  // Always-on
}

Production AKS Cluster (Pulumi C#):

// Production uses AKS instead of App Service
if (environment == "prod")
{
    var aksCluster = new ManagedCluster($"atp-aks-{environment}-{region}", new ManagedClusterArgs
    {
        ResourceName = $"atp-aks-{environment}-{region}",
        ResourceGroupName = resourceGroup.Name,
        Location = region,

        // Identity
        Identity = new ManagedClusterIdentityArgs
        {
            Type = ResourceIdentityType.SystemAssigned
        },

        // Kubernetes version
        KubernetesVersion = "1.28.3",

        // DNS prefix
        DnsPrefix = $"atp-{environment}",

        // Node pools
        AgentPoolProfiles = new[]
        {
            new ManagedClusterAgentPoolProfileArgs
            {
                Name = "system",
                Count = 3,
                VmSize = "Standard_D4s_v5",  // 4 vCPU, 16 GB RAM
                Mode = "System",
                OsType = "Linux",
                OsDiskSizeGB = 128,
                VnetSubnetID = aksSubnet.Id,
                EnableAutoScaling = true,
                MinCount = 3,
                MaxCount = 10,
                AvailabilityZones = new[] { "1", "2", "3" }  // Zone-redundant
            },
            new ManagedClusterAgentPoolProfileArgs
            {
                Name = "user",
                Count = 6,
                VmSize = "Standard_D4s_v5",
                Mode = "User",
                OsType = "Linux",
                OsDiskSizeGB = 128,
                VnetSubnetID = aksSubnet.Id,
                EnableAutoScaling = true,
                MinCount = 6,
                MaxCount = 20,
                AvailabilityZones = new[] { "1", "2", "3" }
            }
        },

        // Networking
        NetworkProfile = new ContainerServiceNetworkProfileArgs
        {
            NetworkPlugin = "azure",
            NetworkPolicy = "azure",
            LoadBalancerSku = "Standard",
            ServiceCidr = "10.2.10.0/24",
            DnsServiceIP = "10.2.10.10"
        },

        // Add-ons
        AddonProfiles = new InputMap<ManagedClusterAddonProfileArgs>
        {
            ["azureKeyvaultSecretsProvider"] = new ManagedClusterAddonProfileArgs
            {
                Enabled = true,
                Config = new InputMap<string>
                {
                    ["enableSecretRotation"] = "true",
                    ["rotationPollInterval"] = "2m"
                }
            },
            ["omsagent"] = new ManagedClusterAddonProfileArgs  // Container Insights
            {
                Enabled = true,
                Config = new InputMap<string>
                {
                    ["logAnalyticsWorkspaceResourceID"] = logAnalyticsWorkspace.Id.Apply(id => id)
                }
            }
        },

        // Security
        AadProfile = new ManagedClusterAADProfileArgs
        {
            Managed = true,
            EnableAzureRBAC = true
        },

        Tags = tags
    });

    this.AksClusterName = aksCluster.Name;
}

Bicep Alternative (Base Template)

Purpose: Azure-native declarative IaC for teams preferring ARM/Bicep over Pulumi.

// main.bicep - Base infrastructure template
@description('Environment name (dev, test, staging, prod)')
param environment string

@description('Azure region')
param location string = resourceGroup().location

@description('App Service SKU')
param appServiceSku string

@description('SQL Database SKU')
param sqlSku string

@description('Redis SKU')
param redisSku string

@description('Storage replication')
param storageReplication string

@description('Enable private endpoints')
param enablePrivateEndpoints bool = false

// Variables
var resourcePrefix = 'atp'
var regionAbbr = location == 'eastus' ? 'eus' : (location == 'westeurope' ? 'weu' : 'apse')

var commonTags = {
  Environment: environment
  Project: 'ATP'
  ManagedBy: 'bicep'
  Owner: 'platform-team@connectsoft.example'
}

// App Service Plan
resource appServicePlan 'Microsoft.Web/serverfarms@2022-09-01' = {
  name: '${resourcePrefix}-plan-${environment}-${regionAbbr}'
  location: location
  kind: 'linux'
  sku: {
    name: appServiceSku
  }
  properties: {
    reserved: true
  }
  tags: commonTags
}

// SQL Server
resource sqlServer 'Microsoft.Sql/servers@2023-05-01-preview' = {
  name: '${resourcePrefix}-sql-${environment}-${regionAbbr}'
  location: location
  properties: {
    administratorLogin: 'sqladmin'
    administratorLoginPassword: '<managed-by-keyvault>'
    version: '12.0'
    publicNetworkAccess: enablePrivateEndpoints ? 'Disabled' : 'Enabled'
  }
  tags: commonTags
}

// SQL Database
resource database 'Microsoft.Sql/servers/databases@2023-05-01-preview' = {
  parent: sqlServer
  name: 'ATP_${environment}'
  location: location
  sku: {
    name: sqlSku
  }
  properties: {
    collation: 'SQL_Latin1_General_CP1_CI_AS'
    maxSizeBytes: 2147483648
  }
  tags: commonTags
}

// Redis Cache
resource redis 'Microsoft.Cache/redis@2023-08-01' = {
  name: '${resourcePrefix}-redis-${environment}-${regionAbbr}'
  location: location
  properties: {
    sku: {
      name: redisSku
      family: redisSku == 'Premium' ? 'P' : 'C'
      capacity: redisSku == 'Basic' ? 0 : 1
    }
    enableNonSslPort: false
    minimumTlsVersion: '1.2'
  }
  tags: commonTags
}

// Storage Account
resource storageAccount 'Microsoft.Storage/storageAccounts@2023-01-01' = {
  name: '${resourcePrefix}storage${environment}${regionAbbr}'
  location: location
  sku: {
    name: storageReplication
  }
  kind: 'StorageV2'
  properties: {
    supportsHttpsTrafficOnly: true
    minimumTlsVersion: 'TLS1_2'
    allowBlobPublicAccess: false
  }
  tags: commonTags
}

// Application Insights
resource appInsights 'Microsoft.Insights/components@2020-02-02' = {
  name: '${resourcePrefix}-appinsights-${environment}-${regionAbbr}'
  location: location
  kind: 'web'
  properties: {
    Application_Type: 'web'
    RetentionInDays: environment == 'prod' ? 90 : 30
    SamplingPercentage: environment == 'prod' ? 10 : 100
  }
  tags: commonTags
}

// Outputs
output resourceGroupName string = resourceGroup().name
output appServicePlanId string = appServicePlan.id
output sqlServerName string = sqlServer.name
output appInsightsInstrumentationKey string = appInsights.properties.InstrumentationKey

Bicep Parameter Files (Per Environment):

// overlays/dev.bicepparam
using './main.bicep'

param environment = 'dev'
param location = 'eastus'
param appServiceSku = 'B1'
param sqlSku = 'Basic'
param redisSku = 'Basic'
param storageReplication = 'Standard_LRS'
param enablePrivateEndpoints = false
// overlays/prod.bicepparam
using './main.bicep'

param environment = 'prod'
param location = 'eastus'
param appServiceSku = 'P3v3'
param sqlSku = 'P6'
param redisSku = 'Premium'
param storageReplication = 'Standard_GZRS'
param enablePrivateEndpoints = true

Bicep Deployment (Azure CLI):

#!/bin/bash
# deploy-bicep-infrastructure.sh

ENVIRONMENT=$1
REGION=${2:-eastus}

echo "Deploying $ENVIRONMENT infrastructure using Bicep..."

# Create resource group
az group create \
  --name "ConnectSoft-ATP-${ENVIRONMENT^^}-${REGION^^}-RG" \
  --location $REGION

# Deploy with environment-specific parameters
az deployment group create \
  --resource-group "ConnectSoft-ATP-${ENVIRONMENT^^}-${REGION^^}-RG" \
  --template-file bicep/main.bicep \
  --parameters bicep/overlays/$ENVIRONMENT.bicepparam \
  --parameters location=$REGION

echo "✅ $ENVIRONMENT infrastructure deployed via Bicep"

IaC Deployment Workflow (Azure Pipelines)

Purpose: Automate infrastructure provisioning via CI/CD pipelines with validation, preview, and approval gates.

# infrastructure-pipeline.yaml
name: Infrastructure-$(environment)-$(Date:yyyyMMdd)$(Rev:.r)

parameters:
- name: environment
  displayName: 'Target Environment'
  type: string
  values:
  - dev
  - test
  - staging
  - prod

- name: action
  displayName: 'Deployment Action'
  type: string
  default: 'preview'
  values:
  - preview
  - deploy
  - destroy

trigger: none  # Manual trigger only

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: Validate_IaC
  displayName: 'Validate Infrastructure Code'
  jobs:
  - job: Validate
    steps:
    - task: UseDotNet@2
      inputs:
        version: '8.x'

    - script: |
        # Install Pulumi
        curl -fsSL https://get.pulumi.com | sh
        export PATH=$PATH:$HOME/.pulumi/bin

        # Select stack
        pulumi stack select connectsoft/atp/${{ parameters.environment }}

        # Validate configuration
        pulumi config

        # Run preview
        pulumi preview --non-interactive
      displayName: 'Pulumi Preview'
      env:
        PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
        AZURE_CLIENT_ID: $(AzureClientId)
        AZURE_CLIENT_SECRET: $(AzureClientSecret)
        AZURE_TENANT_ID: $(AzureTenantId)

- stage: Deploy_Infrastructure
  displayName: 'Deploy Infrastructure to ${{ parameters.environment }}'
  dependsOn: Validate_IaC
  condition: eq('${{ parameters.action }}', 'deploy')
  jobs:
  - deployment: DeployInfrastructure
    environment: ATP-Infrastructure-${{ parameters.environment }}
    strategy:
      runOnce:
        deploy:
          steps:
          - script: |
              # Install Pulumi
              curl -fsSL https://get.pulumi.com | sh
              export PATH=$PATH:$HOME/.pulumi/bin

              # Select stack
              pulumi stack select connectsoft/atp/${{ parameters.environment }}

              # Deploy infrastructure
              pulumi up --yes --non-interactive --skip-preview

              # Export outputs
              pulumi stack output --json > infrastructure-outputs.json
            displayName: 'Pulumi Deploy'
            env:
              PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
              AZURE_CLIENT_ID: $(AzureClientId)
              AZURE_CLIENT_SECRET: $(AzureClientSecret)
              AZURE_TENANT_ID: $(AzureTenantId)

          - task: PublishBuildArtifacts@1
            inputs:
              PathtoPublish: 'infrastructure-outputs.json'
              ArtifactName: 'infrastructure-outputs-${{ parameters.environment }}'

          - script: |
              # Verify deployment
              RESOURCE_GROUP=$(pulumi stack output ResourceGroupName)

              RESOURCE_COUNT=$(az resource list \
                --resource-group $RESOURCE_GROUP \
                --query "length([])")

              echo "Resources deployed: $RESOURCE_COUNT"

              if [ "$RESOURCE_COUNT" -lt 10 ]; then
                echo "❌ Expected at least 10 resources"
                exit 1
              fi

              echo "✅ Infrastructure deployment verified"
            displayName: 'Verify Deployment'

Configuration Drift Detection

Purpose: Detect and alert on manual changes to infrastructure that deviate from IaC definitions.

Drift Detection Script (Pulumi):

#!/bin/bash
# detect-drift.sh

ENVIRONMENT=$1

echo "Detecting infrastructure drift for $ENVIRONMENT..."

# Select stack
pulumi stack select connectsoft/atp/$ENVIRONMENT

# Refresh state from Azure
pulumi refresh --yes --non-interactive

# Preview to detect drift
DRIFT_OUTPUT=$(pulumi preview --diff --non-interactive 2>&1)

# Check if drift detected
if echo "$DRIFT_OUTPUT" | grep -q "~ update"; then
  echo "⚠️ DRIFT DETECTED in $ENVIRONMENT"
  echo "$DRIFT_OUTPUT"

  # Create work item for drift resolution
  az boards work-item create \
    --title "Infrastructure Drift Detected: $ENVIRONMENT" \
    --type "Task" \
    --description "Drift detected in $ENVIRONMENT infrastructure. Review and resolve.\n\n$DRIFT_OUTPUT" \
    --assigned-to "platform-team@connectsoft.example" \
    --area "ATP/Infrastructure"

  exit 1
elif echo "$DRIFT_OUTPUT" | grep -q "no changes"; then
  echo "✅ No drift detected; infrastructure matches code"
else
  echo "⚠️ Unable to determine drift status"
fi

Scheduled Drift Detection (Azure DevOps):

# drift-detection-pipeline.yaml
trigger: none

schedules:
- cron: "0 6 * * *"  # Daily at 6 AM
  displayName: Daily Drift Detection
  branches:
    include:
    - main
  always: true

jobs:
- job: DetectDrift
  pool:
    vmImage: 'ubuntu-latest'
  strategy:
    matrix:
      dev:
        environment: dev
      test:
        environment: test
      staging:
        environment: staging
      prod:
        environment: prod
  steps:
  - script: |
      # Detect drift
      ./scripts/detect-drift.sh $(environment)
    displayName: 'Detect Drift: $(environment)'
    continueOnError: true  # Don't fail pipeline; just alert

GitOps for Configuration Management

Purpose: Manage application configuration (feature flags, connection strings) via Git with automated sync to Azure App Configuration.

GitOps Workflow:

flowchart LR
    A[Update config/prod.yaml] --> B[Commit to main]
    B --> C[GitHub Action Triggered]
    C --> D[Validate Config Schema]
    D --> E{Valid?}
    E -->|No| F[Fail CI]
    E -->|Yes| G[Sync to Azure App Configuration]
    G --> H[Services Auto-Refresh]

    style F fill:#FF6347
    style H fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Configuration Repository (config/prod.yaml):

# config/prod.yaml - GitOps-managed configuration
environment: prod
region: eastus

featureFlags:
  TamperEvidenceV2:
    enabled: true
    description: "V2 tamper evidence with Merkle trees"

  AIAssistedAnomalyDetection:
    enabled: true
    targetingRules:
      - name: PercentageRollout
        percentage: 10
    description: "AI-based anomaly detection (canary rollout)"

appSettings:
  Audit:
    MaxBatchSize: 10000
    SealInterval: "PT15M"

  RateLimiting:
    PermitLimit: 100
    Window: 60

  OpenTelemetry:
    SamplingRatio: 0.1
    ExportIntervalSeconds: 60

GitOps Sync Script (GitHub Action):

# .github/workflows/sync-config.yaml
name: Sync Configuration to Azure App Config

on:
  push:
    branches: [main]
    paths:
      - 'config/prod.yaml'
      - 'config/staging.yaml'

jobs:
  sync-config:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3

    - name: Validate Configuration Schema
      run: |
        # JSON Schema validation
        npx ajv-cli validate \
          -s schemas/config-schema.json \
          -d config/prod.yaml

    - name: Sync to Azure App Configuration
      run: |
        # Install Azure CLI
        curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

        # Login with service principal
        az login --service-principal \
          -u ${{ secrets.AZURE_CLIENT_ID }} \
          -p ${{ secrets.AZURE_CLIENT_SECRET }} \
          --tenant ${{ secrets.AZURE_TENANT_ID }}

        # Sync configuration
        az appconfig kv import \
          --name atp-appconfig-prod-eus \
          --source file \
          --path config/prod.yaml \
          --format yaml \
          --label prod \
          --yes

        # Update sentinel key (triggers app refresh)
        az appconfig kv set \
          --name atp-appconfig-prod-eus \
          --key Sentinel \
          --value "$(date +%s)" \
          --label prod \
          --yes

        echo "✅ Configuration synced to Azure App Configuration"

Environment Comparison (IaC Configurations)

Resource Dev Test Staging Production
App Service SKU B1 (Basic) S1 (Standard) P1v2 (Premium) P3v3 (Premium) or AKS
App Service Instances 1 2 2-5 (autoscale) 3-10 (autoscale) or AKS nodes
SQL SKU Basic (5 DTU) S1 (20 DTU) P2 (125 DTU) P6 (16 vCores)
SQL Max Size 2 GB 10 GB 100 GB 500 GB
SQL Zone Redundancy No No No Yes
SQL Geo-Replication No No Yes Yes (multi-region)
Redis SKU Basic C0 (250 MB) Standard C1 (1 GB) Premium P1 (6 GB) Premium P3 (26 GB)
Redis Clustering No No Yes Yes
Redis Persistence No RDB AOF AOF + RDB
Service Bus SKU Basic Standard Premium Premium
Storage Replication LRS GRS GZRS GZRS
Storage Immutability No No Yes (time-based) Yes (WORM locked)
VNet Shared (10.0.0.0/16) Shared (10.0.0.0/16) Dedicated (10.1.0.0/16) Dedicated (10.2.0.0/16)
Private Endpoints No No Yes Yes
Public Network Access Yes Yes (IP-restricted) No No
Log Retention 7 days 14 days 30 days 90 days + archive
Telemetry Sampling 100% 50% 25% 10%
Monthly Budget $500 $1,000 $3,000 $10,000

IaC Best Practices

Security:

  1. Secrets in Key Vault: Never hardcode secrets in IaC; reference Key Vault.
  2. Least Privilege: Grant minimal RBAC roles for IaC service principals.
  3. State File Security: Encrypt Pulumi state files; restrict access to state storage.
  4. Private Endpoints: Use private endpoints for Production/Staging (no public access).

Maintainability:

  1. DRY Principle: Use overlays to avoid duplicating base infrastructure across environments.
  2. Version Control: Tag IaC releases; pin Production to stable tags.
  3. Documentation: Comment complex resources; link to ADRs for architectural decisions.
  4. Modular Design: Separate concerns (compute, storage, networking) into reusable modules.

Operational:

  1. Preview Before Deploy: Always run pulumi preview or az deployment group validate before applying changes.
  2. Drift Detection: Schedule daily drift detection; alert on manual changes.
  3. Incremental Updates: Use pulumi up for incremental updates (not pulumi destroy + recreate).
  4. Backup State: Back up Pulumi state regularly; test state restore procedures.

Cost Optimization:

  1. Right-Sizing: Use overlays to assign appropriate SKUs per environment (Dev: cheapest, Prod: optimized for performance).
  2. Auto-Shutdown: Enable auto-shutdown for Dev/Test (nights/weekends).
  3. Reserved Instances: Use reserved instances for Production (3-year commitment for maximum savings).
  4. Storage Tiers: Use lifecycle policies (Hot → Cool → Archive) for Production.

Multi-Region Deployment (Production)

Purpose: Deploy Production infrastructure to multiple Azure regions for high availability and disaster recovery.

// Production multi-region deployment
class ProductionMultiRegionStack : Stack
{
    public ProductionMultiRegionStack()
    {
        var config = new Config();
        var primaryRegion = "eastus";
        var secondaryRegion = "westeurope";

        // Deploy to primary region
        var primaryInfra = DeployRegionalInfrastructure(primaryRegion, isPrimary: true);

        // Deploy to secondary region
        var secondaryInfra = DeployRegionalInfrastructure(secondaryRegion, isPrimary: false);

        // Configure geo-replication
        ConfigureGeoReplication(primaryInfra, secondaryInfra);

        // Configure Traffic Manager (multi-region routing)
        var trafficManager = new Profile("atp-traffic-manager-prod", new ProfileArgs
        {
            ProfileName = "atp-prod",
            ResourceGroupName = primaryInfra.ResourceGroupName,
            TrafficRoutingMethod = "Performance",  // Route to nearest region
            Endpoints = new[]
            {
                new EndpointArgs
                {
                    Name = "primary-eastus",
                    Type = "azureEndpoints",
                    TargetResourceId = primaryInfra.AppGatewayId,
                    Priority = 1,
                    Weight = 80  // 80% traffic to primary
                },
                new EndpointArgs
                {
                    Name = "secondary-westeurope",
                    Type = "azureEndpoints",
                    TargetResourceId = secondaryInfra.AppGatewayId,
                    Priority = 2,
                    Weight = 20  // 20% traffic to secondary
                }
            }
        });
    }
}

Summary

  • IaC Tools: Pulumi (C# preferred) and Bicep (Azure-native alternative) for infrastructure provisioning and management.
  • Overlay Pattern: Base infrastructure + environment-specific overlays enable consistent resources with graduated controls.
  • Environment Overlays: Dev (cost-optimized), Test (QA-grade), Staging (production-equivalent), Production (enterprise-grade with AKS).
  • Deployment Workflows: Automated CI/CD pipelines with validation, preview, and approval gates for infrastructure changes.
  • Drift Detection: Daily automated drift detection with alerts and work item creation for manual changes.
  • GitOps: Configuration management via Git with automated sync to Azure App Configuration.
  • Multi-Region: Production deploys to multiple regions with Traffic Manager routing and geo-replication.

Feature Flags & Runtime Configuration

ATP leverages Azure App Configuration for dynamic feature management and runtime configuration changes without requiring code redeployment. Feature flags enable gradual rollouts, A/B testing, kill switches, and environment-specific feature enablement, supporting safe experimentation in lower environments and controlled production releases.

This strategy ensures feature toggles are managed centrally, flag states are audited, and rollback is instantaneous (toggle flag off) without deploying previous code versions.

Azure App Configuration Per Environment

Each environment has tailored feature flag policies balancing innovation (Dev: all features on) with stability (Production: conservative rollouts).

Dev Environment Feature Flags

Purpose: Enable all features including experimental capabilities for rapid development and integration testing.

Configuration (atp-appconfig-dev-eus):

{
  "featureManagement": {
    "TamperEvidenceV2": true,
    "AdvancedQueryFilters": true,
    "AIAssistedAnomalyDetection": true,
    "ExperimentalFeatures": true,
    "PerformanceOptimizations": true,
    "NewExportFormats": true,
    "BlockchainAnchoring": true
  },

  "appSettings": {
    "Audit:MaxBatchSize": 100,
    "Audit:SealInterval": "PT24H",
    "RateLimiting:Enabled": false,
    "Caching:DefaultSlidingExpiration": "00:01:00"
  }
}

Feature Flag Policy (Dev):

  • All Features: Enabled by default (including experimental).
  • No Targeting: No user/tenant filters; everyone gets all features.
  • No Time Windows: Features always available.
  • Purpose: Test feature interactions; validate new capabilities early.

Test Environment Feature Flags

Purpose: Stable features only with integration test-specific flags for automated validation.

Configuration (atp-appconfig-test-eus):

{
  "featureManagement": {
    "TamperEvidenceV2": true,
    "AdvancedQueryFilters": true,
    "AIAssistedAnomalyDetection": false,  // Not stable yet
    "ExperimentalFeatures": false,  // Never in Test
    "PerformanceOptimizations": true,
    "NewExportFormats": {
      "EnabledFor": [
        {
          "Name": "TargetingFilter",
          "Parameters": {
            "Audience": {
              "Users": ["test-tenant-001", "test-tenant-002"]
            }
          }
        }
      ]
    }
  },

  "appSettings": {
    "Audit:MaxBatchSize": 1000,
    "Audit:SealInterval": "PT1H",
    "RateLimiting:Enabled": true,
    "RateLimiting:PermitLimit": 1000,
    "Caching:DefaultSlidingExpiration": "00:05:00"
  }
}

Feature Flag Policy (Test):

  • Stable Features: Enabled for regression testing.
  • Beta Features: Disabled (not ready for QA validation).
  • Integration Test Flags: Targeting specific test tenants for feature validation.
  • Purpose: Validate features with predictable stable behavior.

Staging Environment Feature Flags

Purpose: Production feature set with canary flags for validating new features at scale before Production rollout.

Configuration (atp-appconfig-staging-eus):

{
  "featureManagement": {
    "TamperEvidenceV2": true,

    "AdvancedQueryFilters": true,

    "AIAssistedAnomalyDetection": {
      "EnabledFor": [
        {
          "Name": "Percentage",
          "Parameters": {
            "Value": 50
          }
        }
      ]
    },

    "ExperimentalFeatures": false,

    "NewExportFormats": {
      "EnabledFor": [
        {
          "Name": "TargetingFilter",
          "Parameters": {
            "Audience": {
              "Users": ["staging-tenant-001", "staging-tenant-003"],
              "Groups": ["beta-testers"]
            }
          }
        }
      ]
    },

    "BlockchainAnchoring": {
      "EnabledFor": [
        {
          "Name": "TimeWindow",
          "Parameters": {
            "Start": "2025-11-01T00:00:00Z",
            "End": "2025-11-30T23:59:59Z"
          }
        }
      ]
    }
  },

  "appSettings": {
    "Audit:MaxBatchSize": 10000,
    "Audit:SealInterval": "PT15M",
    "RateLimiting:Enabled": true,
    "RateLimiting:PermitLimit": 500,
    "Caching:DefaultSlidingExpiration": "00:15:00"
  }
}

Feature Flag Policy (Staging):

  • Production Features: Enabled (mirrors production).
  • Canary Features: 50% rollout for load testing impact.
  • Beta Features: Targeted rollout to specific tenants for acceptance validation.
  • Time-Windowed Features: Test time-based feature activation for planned releases.
  • Purpose: Validate production feature configuration; test rollout strategies.

Production Environment Feature Flags

Purpose: Stable features only with conservative gradual rollouts and targeting for early access tenants.

Configuration (atp-appconfig-prod-eus):

{
  "featureManagement": {
    "TamperEvidenceV2": true,  // Fully rolled out

    "AdvancedQueryFilters": true,  // Fully rolled out

    "AIAssistedAnomalyDetection": {
      "EnabledFor": [
        {
          "Name": "Percentage",
          "Parameters": {
            "Value": 10,
            "Seed": "consistent-seed-123"
          }
        }
      ]
    },

    "ExperimentalFeatures": false,  // Never in Production

    "NewExportFormats": {
      "EnabledFor": [
        {
          "Name": "TargetingFilter",
          "Parameters": {
            "Audience": {
              "Users": ["tenant-12345", "tenant-67890"],
              "Groups": ["early-access", "enterprise-tier"],
              "DefaultRolloutPercentage": 5
            }
          }
        }
      ]
    },

    "PerformanceOptimizations": {
      "EnabledFor": [
        {
          "Name": "Percentage",
          "Parameters": {
            "Value": 100  // Fully rolled out after successful canary
          }
        }
      ]
    }
  },

  "appSettings": {
    "Audit:MaxBatchSize": 10000,
    "Audit:SealInterval": "PT15M",
    "RateLimiting:Enabled": true,
    "RateLimiting:PermitLimit": 100,
    "RateLimiting:Window": 60,
    "Caching:DefaultSlidingExpiration": "00:15:00",
    "Caching:DistributedCache": true,
    "OpenTelemetry:SamplingRatio": 0.1
  }
}

Feature Flag Policy (Production):

  • Stable Features: Fully enabled (100% rollout).
  • New Features: Conservative percentage rollout (5-10% initially).
  • Beta Features: Targeting-based rollout to early access tenants only.
  • Experimental Features: Absolutely prohibited.
  • Purpose: Minimize production risk; enable data-driven rollout decisions.

Feature Flag Patterns & Filters

ATP uses Microsoft Feature Management library with five filter types for sophisticated feature control.

Boolean Filter (Simple On/Off)

Usage: Feature is either enabled or disabled for all users.

{
  "featureManagement": {
    "TamperEvidenceV2": true
  }
}

C# Implementation:

// Check if feature enabled
if (await _featureManager.IsEnabledAsync("TamperEvidenceV2"))
{
    return await RecordWithTamperEvidenceV2Async(auditEvent);
}
else
{
    return await RecordWithLegacyTamperEvidenceAsync(auditEvent);
}

Percentage Filter (Gradual Rollout)

Usage: Enable feature for random percentage of users/requests (canary deployments, A/B testing).

{
  "featureManagement": {
    "AIAssistedAnomalyDetection": {
      "EnabledFor": [
        {
          "Name": "Percentage",
          "Parameters": {
            "Value": 10,
            "Seed": "consistent-seed-123"  // Deterministic hashing
          }
        }
      ]
    }
  }
}

C# Implementation:

// Percentage filter implementation (deterministic)
public class PercentageFilter : IFeatureFilter
{
    private readonly IHttpContextAccessor _httpContextAccessor;

    public Task<bool> EvaluateAsync(FeatureFilterEvaluationContext context)
    {
        var parameters = context.Parameters.Get<PercentageFilterSettings>();

        // Get stable identifier (tenant ID for consistent results)
        var tenantId = _httpContextAccessor.HttpContext?.User?.FindFirst("tenantId")?.Value;

        if (string.IsNullOrEmpty(tenantId))
        {
            return Task.FromResult(false);
        }

        // Deterministic hash (same tenant always gets same result)
        var hash = ComputeHash($"{tenantId}{parameters.Seed}");
        var percentage = Math.Abs(hash) % 100;

        var enabled = percentage < parameters.Value;

        return Task.FromResult(enabled);
    }
}

Rollout Strategy (Production):

Week 1: 5%  (50 tenants out of 1000)
Week 2: 10% (100 tenants)
Week 3: 25% (250 tenants)
Week 4: 50% (500 tenants)
Week 5: 100% (all tenants)

Targeting Filter (User/Tenant-Specific)

Usage: Enable feature for specific users, tenants, or groups (early access programs, beta testing).

{
  "featureManagement": {
    "NewExportFormats": {
      "EnabledFor": [
        {
          "Name": "TargetingFilter",
          "Parameters": {
            "Audience": {
              "Users": [
                "tenant-12345",
                "tenant-67890",
                "tenant-11111"
              ],
              "Groups": [
                "early-access",
                "enterprise-tier",
                "beta-testers"
              ],
              "DefaultRolloutPercentage": 5,
              "Exclusion": {
                "Users": ["tenant-99999"],
                "Groups": ["opt-out"]
              }
            }
          }
        }
      ]
    }
  }
}

C# Implementation:

// Targeting filter implementation
public class TargetingFilter : IFeatureFilter
{
    private readonly IHttpContextAccessor _httpContextAccessor;

    public Task<bool> EvaluateAsync(FeatureFilterEvaluationContext context)
    {
        var parameters = context.Parameters.Get<TargetingFilterSettings>();
        var tenantId = _httpContextAccessor.HttpContext?.User?.FindFirst("tenantId")?.Value;
        var groups = _httpContextAccessor.HttpContext?.User?.FindAll("group").Select(c => c.Value).ToList();

        if (string.IsNullOrEmpty(tenantId))
        {
            return Task.FromResult(false);
        }

        // Check exclusions first
        if (parameters.Audience.Exclusion?.Users?.Contains(tenantId) == true)
        {
            return Task.FromResult(false);
        }

        if (groups != null && parameters.Audience.Exclusion?.Groups?.Any(g => groups.Contains(g)) == true)
        {
            return Task.FromResult(false);
        }

        // Check explicit targeting
        if (parameters.Audience.Users?.Contains(tenantId) == true)
        {
            return Task.FromResult(true);
        }

        if (groups != null && parameters.Audience.Groups?.Any(g => groups.Contains(g)) == true)
        {
            return Task.FromResult(true);
        }

        // Fall back to default rollout percentage
        if (parameters.Audience.DefaultRolloutPercentage > 0)
        {
            var hash = ComputeHash(tenantId);
            var percentage = Math.Abs(hash) % 100;
            return Task.FromResult(percentage < parameters.Audience.DefaultRolloutPercentage);
        }

        return Task.FromResult(false);
    }
}

Time Window Filter (Scheduled Features)

Usage: Enable feature between specific dates/times (planned feature launches, limited-time offers, maintenance windows).

{
  "featureManagement": {
    "BlockchainAnchoring": {
      "EnabledFor": [
        {
          "Name": "TimeWindow",
          "Parameters": {
            "Start": "2025-11-01T00:00:00Z",
            "End": "2025-11-30T23:59:59Z"
          }
        }
      ]
    },

    "MaintenanceMode": {
      "EnabledFor": [
        {
          "Name": "TimeWindow",
          "Parameters": {
            "Start": "2025-10-31T02:00:00Z",
            "End": "2025-10-31T04:00:00Z",
            "Recurrence": {
              "Pattern": "Weekly",
              "DaysOfWeek": ["Sunday"]
            }
          }
        }
      ]
    }
  }
}

C# Implementation:

// Time window filter implementation
public class TimeWindowFilter : IFeatureFilter
{
    public Task<bool> EvaluateAsync(FeatureFilterEvaluationContext context)
    {
        var parameters = context.Parameters.Get<TimeWindowFilterSettings>();
        var now = DateTime.UtcNow;

        // Check if within time window
        var enabled = now >= parameters.Start && now <= parameters.End;

        // Check recurrence pattern (e.g., every Sunday)
        if (!enabled && parameters.Recurrence != null)
        {
            if (parameters.Recurrence.Pattern == "Weekly")
            {
                var currentDay = now.DayOfWeek.ToString();
                enabled = parameters.Recurrence.DaysOfWeek?.Contains(currentDay) == true;

                if (enabled)
                {
                    // Check if within daily time window
                    var startTime = parameters.Start.TimeOfDay;
                    var endTime = parameters.End.TimeOfDay;
                    enabled = now.TimeOfDay >= startTime && now.TimeOfDay <= endTime;
                }
            }
        }

        return Task.FromResult(enabled);
    }
}

Custom Filter (Business Logic)

Usage: Complex business rules for feature enablement (tenant tier, compliance profile, region).

{
  "featureManagement": {
    "AdvancedAnalytics": {
      "EnabledFor": [
        {
          "Name": "TenantTierFilter",
          "Parameters": {
            "RequiredTier": "Enterprise",
            "RequiredCompliance": ["soc2", "hipaa"]
          }
        }
      ]
    }
  }
}

C# Implementation:

// Custom filter: Tenant tier + compliance requirements
public class TenantTierFilter : IFeatureFilter
{
    private readonly ITenantService _tenantService;

    public TenantTierFilter(ITenantService tenantService)
    {
        _tenantService = tenantService;
    }

    public async Task<bool> EvaluateAsync(FeatureFilterEvaluationContext context)
    {
        var parameters = context.Parameters.Get<TenantTierFilterSettings>();
        var tenantId = _httpContextAccessor.HttpContext?.User?.FindFirst("tenantId")?.Value;

        if (string.IsNullOrEmpty(tenantId))
        {
            return false;
        }

        // Fetch tenant details
        var tenant = await _tenantService.GetTenantAsync(tenantId);

        // Check tier requirement
        if (tenant.Edition != parameters.RequiredTier)
        {
            return false;
        }

        // Check compliance profile
        var tenantCompliance = tenant.ComplianceProfile.Split(',');
        var hasRequiredCompliance = parameters.RequiredCompliance
            .All(req => tenantCompliance.Contains(req));

        return hasRequiredCompliance;
    }
}

// Settings class
public class TenantTierFilterSettings
{
    public string RequiredTier { get; set; }  // Standard, Business, Enterprise
    public List<string> RequiredCompliance { get; set; }  // gdpr, hipaa, soc2
}

Composite Filters (Multiple Conditions)

Usage: Combine multiple filters with AND/OR logic for complex scenarios.

{
  "featureManagement": {
    "PremiumFeature": {
      "RequirementType": "All",  // AND logic
      "EnabledFor": [
        {
          "Name": "TenantTierFilter",
          "Parameters": {
            "RequiredTier": "Enterprise"
          }
        },
        {
          "Name": "Percentage",
          "Parameters": {
            "Value": 25
          }
        },
        {
          "Name": "TimeWindow",
          "Parameters": {
            "Start": "2025-11-01T00:00:00Z",
            "End": "2025-12-31T23:59:59Z"
          }
        }
      ]
    }
  }
}

Evaluation Logic:

  • RequirementType: All (AND): All filters must return true.
  • RequirementType: Any (OR): At least one filter must return true.

Feature Flag Precedence & Evaluation

Evaluation Order (First match wins):

1. Explicit User/Tenant Targeting (highest priority)
2. Group Targeting
3. Percentage Rollout
4. Time Window
5. Custom Business Logic Filters
6. Environment Default (Boolean true/false)
7. Global Default (false if not specified)

Precedence Example:

// Feature flag evaluation with precedence
public class FeatureEvaluationService
{
    private readonly IFeatureManager _featureManager;
    private readonly ILogger<FeatureEvaluationService> _logger;

    public async Task<bool> IsFeatureEnabledAsync(string featureName, string tenantId)
    {
        // 1. Check if feature exists
        var featureNames = await _featureManager.GetFeatureNamesAsync();
        if (!featureNames.Contains(featureName))
        {
            _logger.LogWarning("Feature '{FeatureName}' not defined; defaulting to false", featureName);
            return false;
        }

        // 2. Evaluate filters (in precedence order)
        var context = new TargetingContext
        {
            UserId = tenantId,
            Groups = await GetTenantGroupsAsync(tenantId)
        };

        var enabled = await _featureManager.IsEnabledAsync(featureName, context);

        // 3. Log feature flag evaluation for audit trail
        _logger.LogInformation(
            "Feature flag evaluated: {FeatureName} = {Enabled} for tenant {TenantId}",
            featureName, enabled, tenantId);

        // 4. Emit telemetry for feature usage analytics
        _telemetry.TrackEvent("FeatureFlagEvaluation", new Dictionary<string, string>
        {
            ["FeatureName"] = featureName,
            ["Enabled"] = enabled.ToString(),
            ["TenantId"] = tenantId,
            ["Timestamp"] = DateTime.UtcNow.ToString("o")
        });

        return enabled;
    }
}

Feature Flag Management Operations

Creating Feature Flags

Via Azure CLI:

#!/bin/bash
# create-feature-flag.sh

FEATURE_NAME=$1
ENVIRONMENT=$2
ENABLED=${3:-false}

echo "Creating feature flag '$FEATURE_NAME' for $ENVIRONMENT..."

az appconfig feature set \
  --name "atp-appconfig-$ENVIRONMENT-eus" \
  --feature "$FEATURE_NAME" \
  --label "$ENVIRONMENT" \
  --description "Feature: $FEATURE_NAME" \
  --yes

if [ "$ENABLED" == "true" ]; then
  az appconfig feature enable \
    --name "atp-appconfig-$ENVIRONMENT-eus" \
    --feature "$FEATURE_NAME" \
    --label "$ENVIRONMENT" \
    --yes
fi

echo "✅ Feature flag '$FEATURE_NAME' created"

Via Pulumi (IaC):

// Create feature flag in Azure App Configuration
var featureFlag = new ConfigurationStoreKeyValue($"feature-{featureName}", new ConfigurationStoreKeyValueArgs
{
    ConfigStoreName = appConfigStore.Name,
    ResourceGroupName = resourceGroup.Name,
    Key = $".appconfig.featureflag/{featureName}",
    Label = environment,
    ContentType = "application/vnd.microsoft.appconfig.ff+json;charset=utf-8",
    Value = JsonSerializer.Serialize(new
    {
        id = featureName,
        description = $"{featureName} feature flag",
        enabled = true,
        conditions = new
        {
            client_filters = new[]
            {
                new
                {
                    name = "Percentage",
                    parameters = new
                    {
                        Value = 10,
                        Seed = "consistent-seed"
                    }
                }
            }
        }
    })
});

Updating Feature Flags

Gradual Rollout (Production):

#!/bin/bash
# rollout-feature.sh

FEATURE_NAME="AIAssistedAnomalyDetection"
STAGES=(5 10 25 50 100)

for PERCENTAGE in "${STAGES[@]}"; do
  echo "Rolling out $FEATURE_NAME to $PERCENTAGE%..."

  # Update percentage filter
  az appconfig feature filter add \
    --name atp-appconfig-prod-eus \
    --feature "$FEATURE_NAME" \
    --label prod \
    --filter-name Percentage \
    --filter-parameters Value=$PERCENTAGE \
    --yes

  # Update sentinel key (trigger app refresh)
  az appconfig kv set \
    --name atp-appconfig-prod-eus \
    --key Sentinel \
    --value "$(date +%s)" \
    --label prod \
    --yes

  echo "Waiting 7 days for monitoring..."

  # In production, wait 7 days between rollout stages
  # (simulated here; actual implementation would be manual or scheduled)

  # Monitor metrics for this stage
  ERROR_RATE=$(az monitor app-insights metrics show \
    --app atp-appinsights-prod-eus \
    --metric "requests/failed" \
    --aggregation avg \
    --offset 24h \
    --query "value.segments[0]['requests/failed'].avg")

  if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
    echo "❌ Error rate too high: $ERROR_RATE%"
    echo "Rolling back feature flag..."

    # Disable feature
    az appconfig feature disable \
      --name atp-appconfig-prod-eus \
      --feature "$FEATURE_NAME" \
      --label prod \
      --yes

    exit 1
  fi

  echo "✅ $PERCENTAGE% rollout successful; proceeding to next stage"
done

echo "✅ Feature $FEATURE_NAME rolled out to 100%"

Disabling Feature Flags (Kill Switch)

Immediate Rollback:

#!/bin/bash
# kill-switch.sh

FEATURE_NAME=$1
REASON=$2

echo "🚨 KILL SWITCH: Disabling feature '$FEATURE_NAME'"
echo "Reason: $REASON"

# Disable feature in Production
az appconfig feature disable \
  --name atp-appconfig-prod-eus \
  --feature "$FEATURE_NAME" \
  --label prod \
  --yes

# Update sentinel (trigger immediate app refresh)
az appconfig kv set \
  --name atp-appconfig-prod-eus \
  --key Sentinel \
  --value "$(date +%s)" \
  --label prod \
  --yes

# Create incident ticket
az boards work-item create \
  --title "Feature Kill Switch Activated: $FEATURE_NAME" \
  --type "Incident" \
  --description "Feature '$FEATURE_NAME' disabled via kill switch.\n\nReason: $REASON\n\nTimestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --assigned-to "platform-team@connectsoft.example" \
  --fields Priority=1

# Notify team
curl -X POST $(SlackWebhookUrl) \
  -H 'Content-Type: application/json' \
  -d "{
    \"text\": \"🚨 Feature Kill Switch Activated\",
    \"attachments\": [{
      \"color\": \"danger\",
      \"fields\": [
        {\"title\": \"Feature\", \"value\": \"$FEATURE_NAME\", \"short\": true},
        {\"title\": \"Reason\", \"value\": \"$REASON\", \"short\": false}
      ]
    }]
  }"

echo "✅ Feature '$FEATURE_NAME' disabled; apps refreshing within 5 minutes"

Feature Flag Testing Strategies

Unit Testing with Feature Flags

// Unit test with feature flag mocking
[Fact]
public async Task RecordEvent_WhenTamperEvidenceV2Enabled_UsesNewAlgorithm()
{
    // Arrange
    var featureManager = new Mock<IFeatureManager>();
    featureManager
        .Setup(fm => fm.IsEnabledAsync("TamperEvidenceV2"))
        .ReturnsAsync(true);

    var service = new AuditService(featureManager.Object);

    // Act
    var result = await service.RecordEventAsync(new AuditEvent { /* ... */ });

    // Assert
    Assert.NotNull(result.TamperEvidence);
    Assert.Equal("V2", result.TamperEvidence.Algorithm);
}

[Fact]
public async Task RecordEvent_WhenTamperEvidenceV2Disabled_UsesLegacyAlgorithm()
{
    // Arrange
    var featureManager = new Mock<IFeatureManager>();
    featureManager
        .Setup(fm => fm.IsEnabledAsync("TamperEvidenceV2"))
        .ReturnsAsync(false);

    var service = new AuditService(featureManager.Object);

    // Act
    var result = await service.RecordEventAsync(new AuditEvent { /* ... */ });

    // Assert
    Assert.NotNull(result.TamperEvidence);
    Assert.Equal("V1", result.TamperEvidence.Algorithm);
}

Integration Testing (Test Environment)

// Integration test with actual Azure App Configuration
[Collection("IntegrationTests")]
public class FeatureFlagIntegrationTests
{
    private readonly IFeatureManager _featureManager;

    public FeatureFlagIntegrationTests()
    {
        // Connect to Test App Configuration
        var configuration = new ConfigurationBuilder()
            .AddAzureAppConfiguration(options =>
            {
                options.Connect(TestEnvironment.AppConfigConnectionString)
                       .Select(KeyFilter.Any, "test")
                       .UseFeatureFlags();
            })
            .Build();

        var services = new ServiceCollection();
        services.AddSingleton<IConfiguration>(configuration);
        services.AddFeatureManagement();

        var provider = services.BuildServiceProvider();
        _featureManager = provider.GetRequiredService<IFeatureManager>();
    }

    [Fact]
    public async Task FeatureFlag_TamperEvidenceV2_EnabledInTest()
    {
        // Act
        var enabled = await _featureManager.IsEnabledAsync("TamperEvidenceV2");

        // Assert
        Assert.True(enabled, "TamperEvidenceV2 should be enabled in Test environment");
    }

    [Fact]
    public async Task FeatureFlag_ExperimentalFeatures_DisabledInTest()
    {
        // Act
        var enabled = await _featureManager.IsEnabledAsync("ExperimentalFeatures");

        // Assert
        Assert.False(enabled, "ExperimentalFeatures should be disabled in Test environment");
    }
}

Feature Flag Monitoring & Analytics

Purpose: Track feature flag usage metrics, performance impact, and rollout success for data-driven decisions.

Telemetry Integration:

// Feature flag usage tracking
public class TelemetryFeatureManagerSnapshot : IFeatureManagerSnapshot
{
    private readonly IFeatureManagerSnapshot _inner;
    private readonly TelemetryClient _telemetry;

    public TelemetryFeatureManagerSnapshot(
        IFeatureManagerSnapshot inner,
        TelemetryClient telemetry)
    {
        _inner = inner;
        _telemetry = telemetry;
    }

    public async Task<bool> IsEnabledAsync(string feature)
    {
        var enabled = await _inner.IsEnabledAsync(feature);

        // Track feature flag evaluation
        _telemetry.TrackEvent("FeatureFlagEvaluated", new Dictionary<string, string>
        {
            ["FeatureName"] = feature,
            ["Enabled"] = enabled.ToString(),
            ["Timestamp"] = DateTime.UtcNow.ToString("o")
        });

        return enabled;
    }

    public async Task<bool> IsEnabledAsync<TContext>(string feature, TContext context)
    {
        var enabled = await _inner.IsEnabledAsync(feature, context);

        // Track with context
        var properties = new Dictionary<string, string>
        {
            ["FeatureName"] = feature,
            ["Enabled"] = enabled.ToString(),
            ["Timestamp"] = DateTime.UtcNow.ToString("o")
        };

        if (context is TargetingContext targetingContext)
        {
            properties["UserId"] = targetingContext.UserId;
            properties["Groups"] = string.Join(",", targetingContext.Groups ?? new List<string>());
        }

        _telemetry.TrackEvent("FeatureFlagEvaluated", properties);

        return enabled;
    }
}

Feature Usage Dashboard (Application Insights Query):

// Feature flag usage over last 7 days
customEvents
| where timestamp > ago(7d)
| where name == "FeatureFlagEvaluated"
| extend FeatureName = tostring(customDimensions.FeatureName)
| extend Enabled = tostring(customDimensions.Enabled)
| summarize 
    TotalEvaluations = count(),
    EnabledCount = countif(Enabled == "true"),
    DisabledCount = countif(Enabled == "false"),
    EnabledPercentage = 100.0 * countif(Enabled == "true") / count()
  by FeatureName
| order by TotalEvaluations desc

Performance Impact Analysis:

// Compare performance with/without feature flag
requests
| where timestamp > ago(24h)
| extend FeatureFlagEnabled = tostring(customDimensions.TamperEvidenceV2)
| summarize 
    AvgDuration = avg(duration),
    P95Duration = percentile(duration, 95),
    P99Duration = percentile(duration, 99),
    Count = count()
  by FeatureFlagEnabled
| project FeatureFlagEnabled, AvgDuration, P95Duration, P99Duration, Count

Feature Flag Lifecycle Management

Feature Flag States:

stateDiagram-v2
    [*] --> Development: Feature created
    Development --> Testing: Feature ready
    Testing --> Canary: Tests pass
    Canary --> Rollout: Metrics healthy
    Rollout --> General_Availability: 100% rollout
    General_Availability --> Deprecated: Feature superseded
    Deprecated --> Removed: Cleanup old code

    Canary --> Disabled: Metrics unhealthy
    Rollout --> Disabled: Issues detected
    General_Availability --> Disabled: Kill switch

    Disabled --> Canary: Issues resolved
Hold "Alt" / "Option" to enable pan & zoom

Feature Flag Cleanup (Remove Old Flags):

// Identify stale feature flags (fully rolled out >90 days)
[FunctionName("IdentifyStaleFeatureFlags")]
public async Task RunAsync(
    [TimerTrigger("0 0 1 * * 0")] TimerInfo timer,  // Weekly on Sunday at 1 AM
    ILogger log)
{
    log.LogInformation("Identifying stale feature flags...");

    var appConfigClient = new ConfigurationClient(
        connectionString: Environment.GetEnvironmentVariable("AppConfig:ConnectionString"),
        new DefaultAzureCredential());

    var staleFlags = new List<string>();

    await foreach (var setting in appConfigClient.GetConfigurationSettingsAsync(
        new SettingSelector { KeyFilter = ".appconfig.featureflag/*", LabelFilter = "prod" }))
    {
        var featureFlag = JsonSerializer.Deserialize<FeatureFlag>(setting.Value);

        // Check if flag is boolean true (no filters) = fully rolled out
        if (featureFlag.Enabled && (featureFlag.Conditions?.ClientFilters?.Count == 0))
        {
            // Check how long it's been at 100%
            var lastModified = setting.LastModified;
            var daysSinceFullRollout = (DateTime.UtcNow - lastModified).Days;

            if (daysSinceFullRollout > 90)
            {
                staleFlags.Add(featureFlag.Id);
                log.LogWarning($"Stale flag: {featureFlag.Id} (fully rolled out for {daysSinceFullRollout} days)");
            }
        }
    }

    if (staleFlags.Any())
    {
        // Create work item for cleanup
        await CreateCleanupTaskAsync(staleFlags);
    }

    log.LogInformation($"✅ Identified {staleFlags.Count} stale feature flags");
}

Runtime Configuration Refresh

Purpose: Enable configuration updates without restarting applications using Azure App Configuration refresh.

Configuration Refresh Implementation:

// Program.cs - Configure App Configuration refresh
public static IHostBuilder CreateHostBuilder(string[] args) =>
    Host.CreateDefaultsBuilder(args)
        .ConfigureAppConfiguration((context, config) =>
        {
            var env = context.HostingEnvironment;

            if (env.IsProduction() || env.IsStaging())
            {
                var settings = config.Build();
                var appConfigConnection = settings["AppConfig:ConnectionString"];

                config.AddAzureAppConfiguration(options =>
                {
                    options
                        .Connect(appConfigConnection)
                        .Select(KeyFilter.Any, LabelFilter.Null)
                        .Select(KeyFilter.Any, env.EnvironmentName)

                        // Configure refresh behavior
                        .ConfigureRefresh(refresh =>
                        {
                            // Sentinel key pattern: refresh all when sentinel changes
                            refresh.Register("Sentinel", refreshAll: true)
                                   .SetCacheExpiration(TimeSpan.FromMinutes(5));

                            // Refresh specific keys independently
                            refresh.Register("Audit:MaxBatchSize", refreshAll: false)
                                   .SetCacheExpiration(TimeSpan.FromMinutes(15));
                        })

                        // Feature flags refresh
                        .UseFeatureFlags(featureFlagOptions =>
                        {
                            featureFlagOptions.CacheExpirationInterval = TimeSpan.FromMinutes(5);
                            featureFlagOptions.Label = env.EnvironmentName;
                        });
                });
            }
        })
        .ConfigureWebHostDefaults(webBuilder =>
        {
            webBuilder.UseStartup<Startup>();
        });

// Startup.cs - Add middleware
public void Configure(IApplicationBuilder app)
{
    // Azure App Configuration refresh middleware
    app.UseAzureAppConfiguration();

    // ... other middleware
}

Sentinel Key Pattern (Trigger Full Refresh):

# Update sentinel key to trigger full app configuration refresh
az appconfig kv set \
  --name atp-appconfig-prod-eus \
  --key Sentinel \
  --value "$(date +%s)" \
  --label prod \
  --yes

echo "Sentinel updated; apps will refresh within 5 minutes"

Feature Flag Best Practices

Development:

  1. Feature Flag Naming: Use descriptive names with version suffixes (e.g., TamperEvidenceV2, QueryOptimizationV3).
  2. Default to Off: New features default to false; explicitly enable per environment.
  3. Short-Lived Flags: Remove flags once features are fully rolled out and old code paths deleted.
  4. Documentation: Document feature flags with purpose, rollout plan, and cleanup date.

Testing:

  1. Test Both Paths: Test feature enabled AND disabled in unit/integration tests.
  2. Percentage Testing: Test percentage filter with various values (0%, 50%, 100%).
  3. Targeting Testing: Validate targeting filters work correctly for specific tenants.
  4. Performance Testing: Measure performance impact of new features during canary rollout.

Production:

  1. Conservative Rollouts: Start at 5-10%; monitor for 48+ hours before increasing.
  2. Monitor Metrics: Track error rate, latency, feature usage, and business metrics.
  3. Kill Switch Ready: Document rollback procedure; test kill switch in Staging.
  4. Audit Trail: Log all feature flag evaluations for compliance and debugging.

Operational:

  1. Centralized Management: Use Azure App Configuration UI or CLI; avoid hardcoded flags.
  2. Version Control: Store feature flag configurations in Git; sync via GitOps.
  3. Access Control: Restrict production feature flag changes to platform team only.
  4. Cleanup Policy: Remove feature flags >90 days after 100% rollout; delete legacy code paths.

Example Feature Flag Scenarios

Scenario 1: Gradual Feature Rollout

Feature: AI-Assisted Anomaly Detection

Rollout Plan:

Week 1: 5% (targeting: early-access group)
Week 2: 10% (percentage rollout)
Week 3: 25% (percentage rollout)
Week 4: 50% (percentage rollout)
Week 5: 100% (fully rolled out)
Week 12: Remove flag; delete legacy code

Implementation:

// Week 1: Targeting early access
{
  "AIAssistedAnomalyDetection": {
    "EnabledFor": [
      {
        "Name": "TargetingFilter",
        "Parameters": {
          "Audience": {
            "Groups": ["early-access"]
          }
        }
      }
    ]
  }
}

// Week 2: 10% rollout
{
  "AIAssistedAnomalyDetection": {
    "EnabledFor": [
      {
        "Name": "Percentage",
        "Parameters": { "Value": 10 }
      }
    ]
  }
}

// Week 5: Fully rolled out
{
  "AIAssistedAnomalyDetection": true
}

Scenario 2: A/B Testing

Feature: New Export Format (JSON vs Parquet)

Test Setup:

{
  "ExportFormat_Parquet": {
    "EnabledFor": [
      {
        "Name": "Percentage",
        "Parameters": {
          "Value": 50,
          "Seed": "export-format-test"
        }
      }
    ]
  }
}

Usage:

// A/B test: Export format selection
public async Task<ExportResult> ExportAuditEventsAsync(ExportRequest request)
{
    // Check which variant user gets
    if (await _featureManager.IsEnabledAsync("ExportFormat_Parquet"))
    {
        // Variant A: Parquet format
        _telemetry.TrackEvent("ExportFormat", new Dictionary<string, string>
        {
            ["Format"] = "Parquet",
            ["TenantId"] = request.TenantId,
            ["EventCount"] = request.EventCount.ToString()
        });

        return await ExportAsParquetAsync(request);
    }
    else
    {
        // Variant B: JSON format (control)
        _telemetry.TrackEvent("ExportFormat", new Dictionary<string, string>
        {
            ["Format"] = "JSON",
            ["TenantId"] = request.TenantId,
            ["EventCount"] = request.EventCount.ToString()
        });

        return await ExportAsJsonAsync(request);
    }
}

Analysis (After 2 weeks):

// Compare export performance by format
customEvents
| where name == "ExportFormat"
| extend Format = tostring(customDimensions.Format)
| extend EventCount = toint(customDimensions.EventCount)
| join kind=inner (
    dependencies
    | where name == "ExportAuditEvents"
) on operation_Id
| summarize 
    AvgDuration = avg(duration),
    P95Duration = percentile(duration, 95),
    TotalExports = count()
  by Format
| project Format, AvgDuration, P95Duration, TotalExports

Scenario 3: Maintenance Mode

Feature: Enable read-only mode during maintenance

Configuration:

{
  "MaintenanceMode": {
    "EnabledFor": [
      {
        "Name": "TimeWindow",
        "Parameters": {
          "Start": "2025-10-31T02:00:00Z",
          "End": "2025-10-31T04:00:00Z",
          "Recurrence": {
            "Pattern": "Weekly",
            "DaysOfWeek": ["Sunday"]
          }
        }
      }
    ]
  }
}

Usage:

// Maintenance mode middleware
public class MaintenanceModeMiddleware
{
    private readonly RequestDelegate _next;
    private readonly IFeatureManager _featureManager;

    public MaintenanceModeMiddleware(RequestDelegate next, IFeatureManager featureManager)
    {
        _next = next;
        _featureManager = featureManager;
    }

    public async Task InvokeAsync(HttpContext context)
    {
        if (await _featureManager.IsEnabledAsync("MaintenanceMode"))
        {
            // Only allow read operations
            if (context.Request.Method != "GET" && context.Request.Method != "HEAD")
            {
                context.Response.StatusCode = 503;
                await context.Response.WriteAsJsonAsync(new
                {
                    error = "Service Unavailable",
                    message = "System is in maintenance mode. Only read operations are allowed.",
                    retryAfter = "2025-10-31T04:00:00Z"
                });
                return;
            }
        }

        await _next(context);
    }
}

Feature Flag Security & Compliance

Access Control (Azure App Configuration):

# Dev: Developers can modify feature flags
accessControl:
  - principal: developers-aad-group
    role: App Configuration Data Owner
    scope: /subscriptions/<sub-id>/resourceGroups/ATP-Dev-RG/providers/Microsoft.AppConfiguration/configurationStores/atp-appconfig-dev-eus

# Test: QA team read-only
accessControl:
  - principal: qa-team-aad-group
    role: App Configuration Data Reader
    scope: /subscriptions/<sub-id>/resourceGroups/ATP-Test-RG/providers/Microsoft.AppConfiguration/configurationStores/atp-appconfig-test-eus

# Production: Platform team only (no developers)
accessControl:
  - principal: platform-team-aad-group
    role: App Configuration Data Owner
    scope: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.AppConfiguration/configurationStores/atp-appconfig-prod-eus

  - principal: atp-prod-managed-identity
    role: App Configuration Data Reader
    scope: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.AppConfiguration/configurationStores/atp-appconfig-prod-eus

Audit Logging (App Configuration Diagnostic Settings):

# Enable diagnostic logging for Production App Configuration
az monitor diagnostic-settings create \
  --name atp-appconfig-prod-audit \
  --resource /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.AppConfiguration/configurationStores/atp-appconfig-prod-eus \
  --logs '[
    {
      "category": "HttpRequest",
      "enabled": true,
      "retentionPolicy": {
        "enabled": true,
        "days": 365
      }
    },
    {
      "category": "Audit",
      "enabled": true,
      "retentionPolicy": {
        "enabled": true,
        "days": 365
      }
    }
  ]' \
  --workspace /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.OperationalInsights/workspaces/atp-loganalytics-prod-eus

echo "✅ App Configuration audit logging enabled"

Summary

  • Azure App Configuration Per Environment: Dev (all features on), Test (stable only), Staging (canary testing), Production (conservative rollouts).
  • Feature Flag Filters: Boolean, Percentage, Targeting, Time Window, Custom filters for flexible feature control.
  • Precedence Rules: User targeting → Group targeting → Percentage → Time Window → Custom logic → Environment default.
  • Feature Management: Create, update, disable (kill switch), and cleanup stale flags with automated workflows.
  • Testing Strategies: Unit tests with mocking, integration tests with actual App Configuration, A/B testing with telemetry.
  • Monitoring & Analytics: Track feature flag evaluations, usage metrics, and performance impact for data-driven decisions.
  • Lifecycle Management: Feature progression from Development → Testing → Canary → General Availability → Deprecated → Removed.
  • Security & Compliance: Access control per environment, audit logging with 365-day retention for Production.

Networking & Security Boundaries

ATP enforces strict network isolation and graduated security controls across environments to ensure developer productivity in lower tiers while maintaining zero-trust security in Production. Network boundaries prevent cross-environment access, protect sensitive data, and enforce least-privilege network access aligned with each environment's security requirements.

This approach implements defense-in-depth networking with VNet isolation, Network Security Groups (NSGs), private endpoints, and Azure Firewall to create security zones that match the criticality of each environment.

Network Isolation Strategy

ATP uses a hybrid VNet strategy: lower environments (Dev/Test) share a VNet with subnet isolation, while higher environments (Staging/Production) have dedicated VNets with full network segmentation and private endpoint enforcement.

Network Topology Overview

graph TB
    subgraph Shared VNet - Dev/Test
        DevSubnet[Dev Subnet<br/>10.0.1.0/24]
        TestSubnet[Test Subnet<br/>10.0.2.0/24]
        SharedServices[Shared Services<br/>10.0.3.0/24]
    end

    subgraph Dedicated VNet - Staging
        StagingGateway[Gateway Subnet<br/>10.1.1.0/24]
        StagingServices[Services Subnet<br/>10.1.2.0/24]
        StagingData[Data Subnet<br/>10.1.3.0/24]
    end

    subgraph Dedicated VNet - Production Primary
        ProdGateway[Gateway Subnet<br/>10.2.1.0/24]
        ProdAKS[AKS Subnet<br/>10.2.2.0/23]
        ProdData[Data Subnet<br/>10.2.3.0/24]
        ProdFirewall[Firewall Subnet<br/>10.2.4.0/26]
    end

    Internet((Internet)) --> DevSubnet
    Internet --> TestSubnet
    Internet -.X.-> StagingGateway
    Internet -.X.-> ProdGateway

    DevSubnet <--> SharedServices
    TestSubnet <--> SharedServices

    style DevSubnet fill:#90EE90
    style TestSubnet fill:#FFD700
    style StagingServices fill:#FFA500
    style ProdAKS fill:#FF6347
Hold "Alt" / "Option" to enable pan & zoom

Network Isolation Per Environment

Environment VNet Address Space Subnets NSG Rules Public Access Private Endpoints
Preview Shared Preview VNet 10.10.0.0/16 Dynamic per PR (10.10.{PR-ID}.0/24) Allow CI/CD agents Yes No
Dev Shared ATP VNet 10.0.0.0/16 Dev: 10.0.1.0/24 Allow developers, VPN Yes (IP-whitelisted) No
Test Shared ATP VNet 10.0.0.0/16 Test: 10.0.2.0/24 Allow CI/CD agents, QA Yes (IP-whitelisted) No
Staging Dedicated Staging VNet 10.1.0.0/16 Gateway: 10.1.1.0/24
Services: 10.1.2.0/24
Data: 10.1.3.0/24
Deny all by default No Yes (all data resources)
Production Dedicated Production VNet 10.2.0.0/16 Gateway: 10.2.1.0/24
AKS: 10.2.2.0/23
Data: 10.2.3.0/24
Firewall: 10.2.4.0/26
Zero-trust (deny all) No Yes (all resources)
Hotfix Dedicated Hotfix VNet 10.3.0.0/16 Same as Production Zero-trust No Yes

Dev Environment Networking

Purpose: Convenient access for developers with IP-whitelisted public endpoints and shared VNet for cost optimization.

VNet Configuration (Pulumi):

// Shared VNet for Dev + Test environments
var sharedVNet = new VirtualNetwork("atp-vnet-shared-eus", new VirtualNetworkArgs
{
    VirtualNetworkName = "atp-vnet-shared-eus",
    ResourceGroupName = sharedResourceGroup.Name,
    Location = "eastus",
    AddressSpace = new AddressSpaceArgs
    {
        AddressPrefixes = new[] { "10.0.0.0/16" }
    },
    Subnets = new[]
    {
        new SubnetArgs
        {
            Name = "Dev-Subnet",
            AddressPrefix = "10.0.1.0/24"
        },
        new SubnetArgs
        {
            Name = "Test-Subnet",
            AddressPrefix = "10.0.2.0/24"
        },
        new SubnetArgs
        {
            Name = "Shared-Services-Subnet",
            AddressPrefix = "10.0.3.0/24"
        }
    },
    Tags = tags
});

Network Security Group (Dev Subnet):

// Dev NSG - Allow developer access
var devNsg = new NetworkSecurityGroup("atp-nsg-dev-eus", new NetworkSecurityGroupArgs
{
    NetworkSecurityGroupName = "atp-nsg-dev-eus",
    ResourceGroupName = resourceGroup.Name,
    Location = "eastus",
    SecurityRules = new[]
    {
        // Allow HTTPS from developer IPs
        new SecurityRuleArgs
        {
            Name = "AllowDeveloperHTTPS",
            Priority = 100,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRange = "443",
            SourceAddressPrefixes = new[]
            {
                "203.0.113.0/24",  // Developer office IP range
                "198.51.100.0/24",  // VPN gateway range
                "192.0.2.0/24"  // Home office IPs
            },
            DestinationAddressPrefix = "10.0.1.0/24"
        },

        // Allow SSH/RDP from VPN (jumpbox access)
        new SecurityRuleArgs
        {
            Name = "AllowVPNManagement",
            Priority = 110,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRanges = new[] { "22", "3389" },
            SourceAddressPrefix = "198.51.100.0/24",  // VPN range
            DestinationAddressPrefix = "10.0.1.0/24"
        },

        // Allow all within subnet (service-to-service)
        new SecurityRuleArgs
        {
            Name = "AllowWithinSubnet",
            Priority = 120,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "*",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "10.0.1.0/24",
            DestinationAddressPrefix = "10.0.1.0/24"
        },

        // Allow Azure Load Balancer health probes
        new SecurityRuleArgs
        {
            Name = "AllowAzureLoadBalancer",
            Priority = 130,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "*",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "AzureLoadBalancer",
            DestinationAddressPrefix = "*"
        },

        // Deny all other inbound
        new SecurityRuleArgs
        {
            Name = "DenyAllInbound",
            Priority = 4096,
            Direction = "Inbound",
            Access = "Deny",
            Protocol = "*",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "*",
            DestinationAddressPrefix = "*"
        }
    },
    Tags = tags
});

Public Network Access (Dev):

  • Enabled: Yes (IP-whitelisted for developer convenience).
  • Allowed IPs: Developer office IPs, VPN gateway, individual developer home IPs.
  • Purpose: Enable remote development and troubleshooting.

Test Environment Networking

Purpose: Controlled access for CI/CD agents and QA team with IP restrictions and shared VNet with Dev.

Network Security Group (Test Subnet):

// Test NSG - Allow test automation access
var testNsg = new NetworkSecurityGroup("atp-nsg-test-eus", new NetworkSecurityGroupArgs
{
    NetworkSecurityGroupName = "atp-nsg-test-eus",
    ResourceGroupName = resourceGroup.Name,
    Location = "eastus",
    SecurityRules = new[]
    {
        // Allow HTTPS from CI/CD agents
        new SecurityRuleArgs
        {
            Name = "AllowCICDAgents",
            Priority = 100,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRange = "443",
            SourceAddressPrefixes = new[]
            {
                "20.62.134.0/24",  // Azure DevOps agent pool IP range
                "13.107.6.0/24"    // GitHub Actions runners
            },
            DestinationAddressPrefix = "10.0.2.0/24"
        },

        // Allow HTTPS from QA team IPs
        new SecurityRuleArgs
        {
            Name = "AllowQATeam",
            Priority = 110,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRange = "443",
            SourceAddressPrefix = "203.0.113.0/24",  // QA team office
            DestinationAddressPrefix = "10.0.2.0/24"
        },

        // Allow test automation tools (Selenium Grid, API testing)
        new SecurityRuleArgs
        {
            Name = "AllowTestAutomation",
            Priority = 120,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRange = "443",
            SourceAddressPrefix = "10.0.3.0/24",  // Shared services subnet (test runners)
            DestinationAddressPrefix = "10.0.2.0/24"
        },

        // Allow within subnet
        new SecurityRuleArgs
        {
            Name = "AllowWithinSubnet",
            Priority = 130,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "*",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "10.0.2.0/24",
            DestinationAddressPrefix = "10.0.2.0/24"
        },

        // Deny all other inbound
        new SecurityRuleArgs
        {
            Name = "DenyAllInbound",
            Priority = 4096,
            Direction = "Inbound",
            Access = "Deny",
            Protocol = "*",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "*",
            DestinationAddressPrefix = "*"
        }
    },
    Tags = tags
});

Public Network Access (Test):

  • Enabled: Yes (IP-whitelisted for CI/CD agents and QA team).
  • Allowed IPs: Azure DevOps agent pool IPs, GitHub Actions runners, QA team office.
  • Purpose: Enable automated testing and QA validation.

Staging Environment Networking

Purpose: Production-like security with dedicated VNet, private endpoints, and no public access for realistic security validation.

VNet Configuration (Pulumi):

// Dedicated VNet for Staging
var stagingVNet = new VirtualNetwork("atp-vnet-staging-eus", new VirtualNetworkArgs
{
    VirtualNetworkName = "atp-vnet-staging-eus",
    ResourceGroupName = stagingResourceGroup.Name,
    Location = "eastus",
    AddressSpace = new AddressSpaceArgs
    {
        AddressPrefixes = new[] { "10.1.0.0/16" }
    },
    Subnets = new[]
    {
        new SubnetArgs
        {
            Name = "Gateway-Subnet",
            AddressPrefix = "10.1.1.0/24",
            Delegations = new[]
            {
                new DelegationArgs
                {
                    Name = "AppGatewayDelegation",
                    ServiceName = "Microsoft.Network/applicationGateways"
                }
            }
        },
        new SubnetArgs
        {
            Name = "Services-Subnet",
            AddressPrefix = "10.1.2.0/24",
            ServiceEndpoints = new[]
            {
                new ServiceEndpointPropertiesFormatArgs { Service = "Microsoft.Sql" },
                new ServiceEndpointPropertiesFormatArgs { Service = "Microsoft.Storage" },
                new ServiceEndpointPropertiesFormatArgs { Service = "Microsoft.KeyVault" }
            }
        },
        new SubnetArgs
        {
            Name = "Data-Subnet",
            AddressPrefix = "10.1.3.0/24",
            PrivateEndpointNetworkPolicies = "Disabled",  // Required for private endpoints
            PrivateLinkServiceNetworkPolicies = "Enabled"
        }
    },
    Tags = tags
});

Network Security Group (Staging Services Subnet):

// Staging NSG - Deny all by default
var stagingNsg = new NetworkSecurityGroup("atp-nsg-staging-services-eus", new NetworkSecurityGroupArgs
{
    NetworkSecurityGroupName = "atp-nsg-staging-services-eus",
    ResourceGroupName = stagingResourceGroup.Name,
    Location = "eastus",
    SecurityRules = new[]
    {
        // Allow HTTPS from Application Gateway subnet only
        new SecurityRuleArgs
        {
            Name = "AllowAppGatewayHTTPS",
            Priority = 100,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRange = "443",
            SourceAddressPrefix = "10.1.1.0/24",  // Gateway subnet
            DestinationAddressPrefix = "10.1.2.0/24"  // Services subnet
        },

        // Allow service-to-service within Services subnet
        new SecurityRuleArgs
        {
            Name = "AllowServiceToService",
            Priority = 110,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRanges = new[] { "80", "443", "5672", "6379" },  // HTTP, HTTPS, RabbitMQ, Redis
            SourceAddressPrefix = "10.1.2.0/24",
            DestinationAddressPrefix = "10.1.2.0/24"
        },

        // Allow Services → Data subnet (private endpoints)
        new SecurityRuleArgs
        {
            Name = "AllowServicesToData",
            Priority = 120,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRanges = new[] { "1433", "443", "5432" },  // SQL, HTTPS, PostgreSQL
            SourceAddressPrefix = "10.1.2.0/24",
            DestinationAddressPrefix = "10.1.3.0/24"
        },

        // Allow Azure Load Balancer
        new SecurityRuleArgs
        {
            Name = "AllowAzureLoadBalancer",
            Priority = 130,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "*",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "AzureLoadBalancer",
            DestinationAddressPrefix = "*"
        },

        // Deny all other inbound (Zero-trust)
        new SecurityRuleArgs
        {
            Name = "DenyAllInbound",
            Priority = 4096,
            Direction = "Inbound",
            Access = "Deny",
            Protocol = "*",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "*",
            DestinationAddressPrefix = "*"
        }
    },
    Tags = tags
});

Private Endpoints (Staging):

// Private endpoint for SQL Database
var sqlPrivateEndpoint = new PrivateEndpoint("atp-sql-pe-staging-eus", new PrivateEndpointArgs
{
    PrivateEndpointName = "atp-sql-pe-staging-eus",
    ResourceGroupName = stagingResourceGroup.Name,
    Location = "eastus",
    Subnet = new SubnetArgs
    {
        Id = dataSubnet.Id  // 10.1.3.0/24
    },
    PrivateLinkServiceConnections = new[]
    {
        new PrivateLinkServiceConnectionArgs
        {
            Name = "sql-connection",
            PrivateLinkServiceId = sqlServer.Id,
            GroupIds = new[] { "sqlServer" },
            PrivateLinkServiceConnectionState = new PrivateLinkServiceConnectionStateArgs
            {
                Status = "Approved",
                Description = "Auto-approved by Pulumi"
            }
        }
    },
    Tags = tags
});

// Private DNS Zone for SQL
var sqlPrivateDnsZone = new PrivateZone("privatelink-database-windows-net", new PrivateZoneArgs
{
    PrivateZoneName = "privatelink.database.windows.net",
    ResourceGroupName = stagingResourceGroup.Name,
    Location = "global",
    Tags = tags
});

// Link private DNS zone to VNet
var dnsZoneLink = new VirtualNetworkLink("sql-dns-link-staging", new VirtualNetworkLinkArgs
{
    VirtualNetworkLinkName = "atp-vnet-staging-link",
    ResourceGroupName = stagingResourceGroup.Name,
    PrivateZoneName = sqlPrivateDnsZone.Name,
    VirtualNetwork = new SubResourceArgs { Id = stagingVNet.Id },
    RegistrationEnabled = false,
    Location = "global",
    Tags = tags
});

// DNS record for private endpoint
var dnsRecord = new RecordSet("sql-dns-record-staging", new RecordSetArgs
{
    RecordSetName = sqlServer.Name,
    ResourceGroupName = stagingResourceGroup.Name,
    PrivateZoneName = sqlPrivateDnsZone.Name,
    RecordType = "A",
    Ttl = 3600,
    ARecords = new[]
    {
        new ARecordArgs
        {
            Ipv4Address = sqlPrivateEndpoint.CustomDnsConfigs.Apply(configs => configs[0].IpAddresses[0])
        }
    }
});

Public Network Access (Staging):

  • Disabled: All resources accessible only via private endpoints.
  • Exception: Application Gateway has public IP for load testing access.
  • Purpose: Validate production security posture; test private endpoint connectivity.

Production Environment Networking

Purpose: Maximum security with dedicated VNet, Azure Firewall, private endpoints only, and zero-trust network access.

VNet Configuration (Pulumi):

// Dedicated VNet for Production
var prodVNet = new VirtualNetwork("atp-vnet-prod-eus", new VirtualNetworkArgs
{
    VirtualNetworkName = "atp-vnet-prod-eus",
    ResourceGroupName = prodResourceGroup.Name,
    Location = "eastus",
    AddressSpace = new AddressSpaceArgs
    {
        AddressPrefixes = new[] { "10.2.0.0/16" }
    },
    Subnets = new[]
    {
        new SubnetArgs
        {
            Name = "Gateway-Subnet",
            AddressPrefix = "10.2.1.0/24",
            Delegations = new[]
            {
                new DelegationArgs
                {
                    Name = "AppGatewayDelegation",
                    ServiceName = "Microsoft.Network/applicationGateways"
                }
            }
        },
        new SubnetArgs
        {
            Name = "AKS-Subnet",
            AddressPrefix = "10.2.2.0/23",  // /23 for AKS node scaling
            ServiceEndpoints = new[]
            {
                new ServiceEndpointPropertiesFormatArgs { Service = "Microsoft.ContainerRegistry" },
                new ServiceEndpointPropertiesFormatArgs { Service = "Microsoft.KeyVault" }
            }
        },
        new SubnetArgs
        {
            Name = "Data-Subnet",
            AddressPrefix = "10.2.3.0/24",
            PrivateEndpointNetworkPolicies = "Disabled",
            PrivateLinkServiceNetworkPolicies = "Enabled"
        },
        new SubnetArgs
        {
            Name = "AzureFirewallSubnet",  // Must be named exactly this
            AddressPrefix = "10.2.4.0/26"
        }
    },
    Tags = tags
});

Network Security Group (Production AKS Subnet):

// Production NSG - Zero-trust (deny all by default)
var prodNsg = new NetworkSecurityGroup("atp-nsg-prod-aks-eus", new NetworkSecurityGroupArgs
{
    NetworkSecurityGroupName = "atp-nsg-prod-aks-eus",
    ResourceGroupName = prodResourceGroup.Name,
    Location = "eastus",
    SecurityRules = new[]
    {
        // Allow inbound HTTPS from Application Gateway only
        new SecurityRuleArgs
        {
            Name = "AllowAppGatewayHTTPS",
            Priority = 100,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRange = "443",
            SourceAddressPrefix = "10.2.1.0/24",  // App Gateway subnet
            DestinationAddressPrefix = "10.2.2.0/23"  // AKS subnet
        },

        // Allow AKS → Data subnet (private endpoints)
        new SecurityRuleArgs
        {
            Name = "AllowAKSToData",
            Priority = 110,
            Direction = "Outbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRanges = new[] { "1433", "443", "5432", "6379" },
            SourceAddressPrefix = "10.2.2.0/23",
            DestinationAddressPrefix = "10.2.3.0/24"
        },

        // Allow AKS → Azure services (via service endpoints)
        new SecurityRuleArgs
        {
            Name = "AllowAKSToAzureServices",
            Priority = 120,
            Direction = "Outbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRange = "443",
            SourceAddressPrefix = "10.2.2.0/23",
            DestinationAddressPrefixes = new[]
            {
                "AzureContainerRegistry",
                "AzureKeyVault",
                "AzureActiveDirectory"
            }
        },

        // Allow AKS internal communication (Kubernetes API server)
        new SecurityRuleArgs
        {
            Name = "AllowAKSInternal",
            Priority = 130,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "*",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "10.2.2.0/23",
            DestinationAddressPrefix = "10.2.2.0/23"
        },

        // Deny all other traffic (Zero-trust)
        new SecurityRuleArgs
        {
            Name = "DenyAllInbound",
            Priority = 4096,
            Direction = "Inbound",
            Access = "Deny",
            Protocol = "*",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "*",
            DestinationAddressPrefix = "*"
        },

        // Deny all outbound except explicitly allowed
        new SecurityRuleArgs
        {
            Name = "DenyAllOutbound",
            Priority = 4096,
            Direction = "Outbound",
            Access = "Deny",
            Protocol = "*",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "*",
            DestinationAddressPrefix = "*"
        }
    },
    Tags = tags
});

Summary

  • Network Isolation: Shared VNet (Dev/Test) with subnet isolation, dedicated VNets (Staging/Production) with full segmentation.
  • VNet Topology: Dev (10.0.1.0/24), Test (10.0.2.0/24), Staging (10.1.0.0/16), Production (10.2.0.0/16) with gateway, AKS, data, and firewall subnets.
  • NSG Rules: Graduated from allow-developer-IPs (Dev) to deny-all-by-default (Production) with explicit allowlists.
  • Private Endpoints: None (Dev/Test), all data resources (Staging/Production) with private DNS zones for name resolution.
  • Public Access: Enabled with IP whitelisting (Dev/Test), disabled entirely (Staging/Production) except Application Gateway.
  • Azure Firewall: Production egress filtering with approved FQDN allowlist and threat intelligence.
  • Zero-Trust: Production enforces verify-explicitly, least-privilege, assume-breach with Istio mTLS and Kubernetes Network Policies.
  • Monitoring: NSG flow logs, firewall logs, traffic analytics, and Azure Sentinel for security visibility.

Observability Per Environment

ATP implements graduated observability across environments, balancing developer debugging needs (verbose logs, high trace sampling) with production cost optimization (warning-level logs, adaptive sampling). Each environment has tailored telemetry levels, sampling rates, and retention policies aligned with its operational requirements and budget constraints.

This strategy ensures comprehensive visibility for troubleshooting in lower environments while maintaining cost-effective observability in Production with intelligent sampling and long-term cold storage for compliance.

Observability Strategy Overview

graph LR
    subgraph Dev Environment
        DevApp[App Service] --> DevSeq[Seq Container<br/>Debug logs<br/>100% traces<br/>7-day retention]
    end

    subgraph Test Environment
        TestApp[App Service] --> TestSeq[Seq Container<br/>Info logs<br/>50% traces<br/>14-day retention]
    end

    subgraph Staging Environment
        StagingApp[App Service] --> StagingOtel[OTel Collector<br/>Warning logs<br/>25% traces]
        StagingOtel --> StagingLA[Log Analytics<br/>30-day hot]
        StagingOtel --> StagingAI[Application Insights<br/>Adaptive sampling]
    end

    subgraph Production Environment
        ProdPods[AKS Pods] --> ProdOtel[OTel Collector<br/>Warning logs<br/>10% traces]
        ProdOtel --> ProdProm[Prometheus<br/>Metrics aggregation]
        ProdOtel --> ProdLA[Log Analytics<br/>90-day hot]
        ProdLA --> ProdBlob[Blob Storage<br/>7-year cold]
        ProdOtel --> ProdAI[Application Insights<br/>Adaptive sampling]
        ProdProm --> ProdGrafana[Grafana<br/>Dashboards + Alerts]
    end

    style DevSeq fill:#90EE90
    style TestSeq fill:#FFD700
    style StagingLA fill:#FFA500
    style ProdGrafana fill:#FF6347
Hold "Alt" / "Option" to enable pan & zoom

Telemetry Levels Per Environment

ATP uses graduated telemetry verbosity from verbose debugging (Dev) to optimized production observability.

Environment Log Level Trace Sampling Metric Collection Retention (Hot) Retention (Cold)
Preview Debug 100% All metrics Ephemeral (PR lifetime) None
Dev Debug 100% All metrics 7 days None
Test Information 50% All metrics 14 days None
Staging Warning 25% (adaptive) All metrics 30 days None
Production Warning 10% (adaptive) All metrics 90 days 7 years (blob)
Hotfix Warning 10% (same as Prod) All metrics 90 days 7 years

Rationale:

  • Dev (Debug, 100% sampling): Full visibility for rapid debugging; cost is minimal (low traffic).
  • Test (Info, 50% sampling): Sufficient for test validation; reduce noise from automated tests.
  • Staging (Warning, 25% sampling): Production-like telemetry; catch errors/warnings only; representative sampling.
  • Production (Warning, 10% sampling): Cost-optimized; adaptive sampling adjusts based on traffic; long-term cold storage for compliance.

Dev Environment Observability

Purpose: Maximum visibility for developers with verbose logging, 100% trace sampling, and immediate feedback.

Log Level: Debug (most verbose)

Telemetry Configuration (appsettings.Development.json):

{
  "Logging": {
    "LogLevel": {
      "Default": "Debug",
      "Microsoft": "Information",
      "Microsoft.Hosting.Lifetime": "Information",
      "System": "Information"
    }
  },

  "Seq": {
    "ServerUrl": "http://localhost:5341",
    "ApiKey": "",  // No auth in Dev
    "MinimumLevel": "Trace",
    "LevelOverride": {
      "Microsoft": "Warning",
      "System": "Warning"
    }
  },

  "OpenTelemetry": {
    "ServiceName": "atp-ingestion-dev",
    "ServiceVersion": "1.0.0-dev",
    "ExporterEndpoint": "http://localhost:4317",  // Local OTel collector (optional)
    "TracingSampler": {
      "Type": "AlwaysOn",  // 100% sampling
      "Probability": 1.0
    },
    "MetricsExportInterval": 10  // 10 seconds (frequent for dev)
  },

  "ApplicationInsights": {
    "ConnectionString": "",  // Disabled in Dev (use Seq instead)
    "EnableAdaptiveSampling": false,
    "EnableDependencyTracking": true,
    "EnablePerformanceCounterCollectionModule": true
  }
}

Seq Container (Docker Compose for Dev):

# docker-compose.dev.yml
version: '3.8'

services:
  seq:
    image: datalust/seq:latest
    container_name: atp-seq-dev
    ports:
      - "5341:80"
    environment:
      ACCEPT_EULA: "Y"
      SEQ_FIRSTRUN_ADMINUSERNAME: admin
      SEQ_FIRSTRUN_ADMINPASSWORDHASH: <bcrypt-hash>  # Change in production
    volumes:
      - seq-data:/data
    restart: unless-stopped

  otel-collector:
    image: otel/opentelemetry-collector:0.97.0
    container_name: atp-otel-dev
    command: ["--config=/etc/otel/config.yaml"]
    ports:
      - "4317:4317"  # OTLP gRPC
      - "4318:4318"  # OTLP HTTP
      - "8888:8888"  # Prometheus metrics (collector itself)
      - "13133:13133"  # Health check
    volumes:
      - ./otel-config-dev.yaml:/etc/otel/config.yaml
    restart: unless-stopped

volumes:
  seq-data:

OpenTelemetry Collector Configuration (Dev):

# otel-config-dev.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  # No sampling in Dev (100%)
  attributes:
    actions:
      - key: environment
        value: dev
        action: insert

exporters:
  logging:
    loglevel: debug  # Console output for dev debugging

  # Export to Seq (optional, for centralized logs)
  otlphttp/seq:
    endpoint: http://seq:80/ingest/otlp
    headers:
      X-Seq-ApiKey: ""

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [logging]

    metrics:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [logging]

    logs:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [logging, otlphttp/seq]

C# Logging Configuration (Dev):

// Program.cs - Dev logging setup
public static IHostBuilder CreateHostBuilder(string[] args) =>
    Host.CreateDefaultsBuilder(args)
        .ConfigureLogging((context, logging) =>
        {
            var env = context.HostingEnvironment;

            logging.ClearProviders();
            logging.AddConsole();  // Console output for local debugging

            if (env.IsDevelopment())
            {
                logging.SetMinimumLevel(LogLevel.Debug);

                // Add Seq for structured logging
                logging.AddSeq(context.Configuration.GetSection("Seq"));
            }
        })
        .ConfigureServices((context, services) =>
        {
            // OpenTelemetry instrumentation
            services.AddOpenTelemetry()
                .WithTracing(builder =>
                {
                    builder
                        .SetResourceBuilder(ResourceBuilder.CreateDefault()
                            .AddService("atp-ingestion-dev", "1.0.0-dev"))
                        .AddAspNetCoreInstrumentation(options =>
                        {
                            options.RecordException = true;
                            options.Filter = (httpContext) => true;  // Capture all requests
                        })
                        .AddHttpClientInstrumentation()
                        .AddSqlClientInstrumentation(options =>
                        {
                            options.SetDbStatementForText = true;  // Include SQL in traces
                            options.RecordException = true;
                        })
                        .AddOtlpExporter(options =>
                        {
                            options.Endpoint = new Uri("http://localhost:4317");
                            options.Protocol = OtlpExportProtocol.Grpc;
                        });
                })
                .WithMetrics(builder =>
                {
                    builder
                        .SetResourceBuilder(ResourceBuilder.CreateDefault()
                            .AddService("atp-ingestion-dev"))
                        .AddAspNetCoreInstrumentation()
                        .AddHttpClientInstrumentation()
                        .AddRuntimeInstrumentation()
                        .AddProcessInstrumentation()
                        .AddOtlpExporter(options =>
                        {
                            options.Endpoint = new Uri("http://localhost:4317");
                            options.Protocol = OtlpExportProtocol.Grpc;
                            options.ExportIntervalMilliseconds = 10000;  // 10 seconds
                        });
                });
        })
        .ConfigureWebHostDefaults(webBuilder =>
        {
            webBuilder.UseStartup<Startup>();
        });

Dev Observability Benefits:

  • Instant Feedback: Console logs + Seq UI for real-time log viewing.
  • 100% Traces: No sampling; every request traced for debugging.
  • SQL Query Visibility: Full SQL statements in traces for query optimization.
  • No Cost Constraints: Local containers; unlimited logs/traces.

Test Environment Observability

Purpose: Balanced visibility for QA validation with Info-level logs and 50% sampling to reduce test automation noise.

Log Level: Information

Telemetry Configuration (appsettings.Test.json):

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft": "Warning",
      "Microsoft.Hosting.Lifetime": "Information",
      "System": "Warning"
    }
  },

  "Seq": {
    "ServerUrl": "http://seq-test.connectsoft.local:5341",
    "ApiKey": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/SeqApiKey)",
    "MinimumLevel": "Information",
    "LevelOverride": {
      "Microsoft": "Warning",
      "System": "Warning"
    }
  },

  "OpenTelemetry": {
    "ServiceName": "atp-ingestion-test",
    "ServiceVersion": "1.0.0-test",
    "ExporterEndpoint": "http://otel-collector-test.connectsoft.local:4317",
    "TracingSampler": {
      "Type": "TraceIdRatioBased",  // Consistent sampling
      "Probability": 0.5  // 50% sampling
    },
    "MetricsExportInterval": 30  // 30 seconds
  },

  "ApplicationInsights": {
    "ConnectionString": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/AppInsightsConnectionString)",
    "EnableAdaptiveSampling": false,
    "SamplingPercentage": 50,
    "EnableDependencyTracking": true,
    "EnablePerformanceCounterCollectionModule": true
  }
}

OpenTelemetry Collector Configuration (Test):

# otel-config-test.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 30s
    send_batch_size: 2048

  # 50% sampling for Test
  probabilistic_sampler:
    sampling_percentage: 50

  attributes:
    actions:
      - key: environment
        value: test
        action: insert

  # Filter out health check traces
  filter:
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/ready"'

exporters:
  logging:
    loglevel: info

  otlphttp/seq:
    endpoint: http://seq-test:80/ingest/otlp
    headers:
      X-Seq-ApiKey: ${SEQ_API_KEY}

  # Optional: Export to Application Insights
  azuremonitor:
    connection_string: ${APPLICATIONINSIGHTS_CONNECTION_STRING}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, probabilistic_sampler, attributes, filter]
      exporters: [logging, azuremonitor]

    metrics:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [logging, azuremonitor]

    logs:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [logging, otlphttp/seq]

Test Observability Benefits:

  • QA Validation: Info-level logs sufficient for test pass/fail analysis.
  • 50% Sampling: Reduce telemetry volume from automated test runs.
  • Seq Integration: Centralized logs for test result analysis.
  • Application Insights: Optional export for trend analysis.

Staging Environment Observability

Purpose: Production-like observability with Warning-level logs, adaptive sampling (25%), and Azure Log Analytics for production validation.

Log Level: Warning

Telemetry Configuration (appsettings.Staging.json):

{
  "Logging": {
    "LogLevel": {
      "Default": "Warning",
      "Microsoft": "Error",
      "Microsoft.Hosting.Lifetime": "Information",
      "System": "Error"
    },
    "ApplicationInsights": {
      "LogLevel": {
        "Default": "Warning",
        "Microsoft": "Error"
      }
    }
  },

  "OpenTelemetry": {
    "ServiceName": "atp-ingestion-staging",
    "ServiceVersion": "${BUILD_VERSION}",
    "ExporterEndpoint": "http://otel-collector-staging.connectsoft.local:4317",
    "TracingSampler": {
      "Type": "ParentBased",  // Respect upstream sampling decisions
      "RootSampler": {
        "Type": "TraceIdRatioBased",
        "Probability": 0.25  // 25% base sampling
      }
    },
    "MetricsExportInterval": 60  // 60 seconds
  },

  "ApplicationInsights": {
    "ConnectionString": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/AppInsightsConnectionString)",
    "EnableAdaptiveSampling": true,
    "SamplingSettings": {
      "IsEnabled": true,
      "MaxTelemetryItemsPerSecond": 10,
      "EvaluationInterval": "00:00:15",
      "AdaptiveSamplingSettings": {
        "MaxTelemetryItemsPerSecond": 10,
        "InitialSamplingPercentage": 25,
        "MinSamplingPercentage": 10,
        "MaxSamplingPercentage": 50,
        "MovingAverageRatio": 0.25
      }
    },
    "EnableDependencyTracking": true,
    "EnablePerformanceCounterCollectionModule": true,
    "EnableEventCounterCollectionModule": true
  }
}

Azure Log Analytics Workspace (Pulumi):

// Staging Log Analytics Workspace
var stagingLogAnalytics = new Workspace("atp-loganalytics-staging-eus", new WorkspaceArgs
{
    WorkspaceName = "atp-loganalytics-staging-eus",
    ResourceGroupName = stagingResourceGroup.Name,
    Location = "eastus",
    Sku = new WorkspaceSkuArgs
    {
        Name = "PerGB2018"  // Pay-per-GB pricing
    },
    RetentionInDays = 30,  // 30-day hot retention
    PublicNetworkAccessForIngestion = "Enabled",
    PublicNetworkAccessForQuery = "Enabled",
    Tags = tags
});

// Application Insights for Staging
var stagingAppInsights = new Component("atp-appinsights-staging-eus", new ComponentArgs
{
    ResourceName = "atp-appinsights-staging-eus",
    ResourceGroupName = stagingResourceGroup.Name,
    Location = "eastus",
    ApplicationType = "web",
    Kind = "web",
    WorkspaceResourceId = stagingLogAnalytics.Id,
    RetentionInDays = 30,
    SamplingPercentage = 25,  // 25% sampling
    DisableIpMasking = false,  // Mask IP addresses for privacy
    Tags = tags
});

OpenTelemetry Collector Configuration (Staging):

# otel-config-staging.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 60s
    send_batch_size: 4096

  # Adaptive sampling (25% base)
  probabilistic_sampler:
    sampling_percentage: 25

  attributes:
    actions:
      - key: environment
        value: staging
        action: insert
      - key: deployment.environment
        value: staging
        action: insert

  # Filter health checks
  filter:
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/ready"'
        - 'attributes["http.target"] == "/metrics"'

  # Resource detection (cloud environment)
  resourcedetection:
    detectors: [env, azure]
    timeout: 5s

exporters:
  logging:
    loglevel: warn  # Only warnings/errors to collector logs

  azuremonitor:
    connection_string: ${APPLICATIONINSIGHTS_CONNECTION_STRING}
    maxbatchsize: 1024
    maxbatchinterval: 10s

  # Export to Log Analytics workspace
  azureloganalytics:
    workspace_id: ${LOG_ANALYTICS_WORKSPACE_ID}
    workspace_key: ${LOG_ANALYTICS_WORKSPACE_KEY}
    resource_type: "Custom-OpenTelemetry"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, probabilistic_sampler, resourcedetection, attributes, filter]
      exporters: [azuremonitor]

    metrics:
      receivers: [otlp]
      processors: [batch, resourcedetection, attributes]
      exporters: [azuremonitor]

    logs:
      receivers: [otlp]
      processors: [batch, resourcedetection, attributes]
      exporters: [azureloganalytics, azuremonitor]

Staging Observability Benefits:

  • Production-Like: Same log levels and sampling as Production for realistic testing.
  • Azure Log Analytics: Centralized log aggregation with KQL queries.
  • Application Insights: APM for performance profiling and dependency tracking.
  • Adaptive Sampling: Automatically adjusts based on traffic volume.
  • 30-Day Retention: Sufficient for staging validation and troubleshooting.

Production Environment Observability

Purpose: Cost-optimized observability with Warning-level logs, 10% adaptive sampling, Prometheus/Grafana for metrics, and long-term cold storage for compliance.

Log Level: Warning

Telemetry Configuration (appsettings.Production.json):

{
  "Logging": {
    "LogLevel": {
      "Default": "Warning",
      "Microsoft": "Error",
      "Microsoft.Hosting.Lifetime": "Warning",
      "System": "Error",
      "ConnectSoft.ATP": "Warning"
    },
    "ApplicationInsights": {
      "LogLevel": {
        "Default": "Warning",
        "Microsoft": "Error"
      }
    }
  },

  "OpenTelemetry": {
    "ServiceName": "atp-ingestion-prod",
    "ServiceVersion": "${BUILD_VERSION}",
    "ServiceInstanceId": "${HOSTNAME}",  // Pod name in AKS
    "ExporterEndpoint": "http://otel-collector.atp-prod.svc.cluster.local:4317",
    "TracingSampler": {
      "Type": "ParentBased",
      "RootSampler": {
        "Type": "TraceIdRatioBased",
        "Probability": 0.1  // 10% base sampling
      }
    },
    "MetricsExportInterval": 60,  // 60 seconds
    "Attributes": {
      "deployment.environment": "production",
      "service.namespace": "atp",
      "k8s.cluster.name": "atp-aks-prod-eus",
      "k8s.namespace.name": "atp-prod"
    }
  },

  "ApplicationInsights": {
    "ConnectionString": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/AppInsightsConnectionString)",
    "EnableAdaptiveSampling": true,
    "SamplingSettings": {
      "IsEnabled": true,
      "MaxTelemetryItemsPerSecond": 5,
      "EvaluationInterval": "00:00:15",
      "AdaptiveSamplingSettings": {
        "MaxTelemetryItemsPerSecond": 5,
        "InitialSamplingPercentage": 10,
        "MinSamplingPercentage": 5,
        "MaxSamplingPercentage": 25,
        "MovingAverageRatio": 0.25
      },
      "ExcludedTypes": "Event,Trace",  // Only sample requests/dependencies
      "IncludedTypes": "Request,Dependency,Exception"
    },
    "EnableDependencyTracking": true,
    "EnablePerformanceCounterCollectionModule": false,  // Use OTel metrics instead
    "EnableEventCounterCollectionModule": true,
    "EnableHeartbeat": true,
    "HeartbeatInterval": "00:15:00",  // 15 minutes
    "DisableIpMasking": false,  // Mask PII
    "EnableAuthenticationTrackingJavaScript": false  // Privacy
  },

  "Prometheus": {
    "Enabled": true,
    "Port": 9090,
    "Path": "/metrics",
    "ScrapeInterval": 15  // 15 seconds
  }
}

Azure Log Analytics Workspace (Production):

// Production Log Analytics Workspace (90-day hot + long-term cold)
var prodLogAnalytics = new Workspace("atp-loganalytics-prod-eus", new WorkspaceArgs
{
    WorkspaceName = "atp-loganalytics-prod-eus",
    ResourceGroupName = prodResourceGroup.Name,
    Location = "eastus",
    Sku = new WorkspaceSkuArgs
    {
        Name = "PerGB2018",
        CapacityReservation = 100  // 100 GB/day commitment pricing
    },
    RetentionInDays = 90,  // 90-day hot retention
    PublicNetworkAccessForIngestion = "Disabled",  // Private endpoint only
    PublicNetworkAccessForQuery = "Disabled",
    Tags = tags
});

// Data Export to Blob Storage (7-year cold retention)
var dataExportRule = new DataExport("atp-loganalytics-export-prod", new DataExportArgs
{
    DataExportName = "export-to-blob",
    ResourceGroupName = prodResourceGroup.Name,
    WorkspaceName = prodLogAnalytics.Name,
    TableNames = new[]
    {
        "AppTraces",
        "AppRequests",
        "AppDependencies",
        "AppExceptions",
        "AppMetrics"
    },
    Destination = new DestinationArgs
    {
        ResourceId = coldStorageAccount.Id,
        MetaData = new DestinationMetaDataArgs
        {
            EventHubName = ""  // Export to Storage Account, not Event Hub
        }
    },
    Enable = true
});

// Application Insights (Production)
var prodAppInsights = new Component("atp-appinsights-prod-eus", new ComponentArgs
{
    ResourceName = "atp-appinsights-prod-eus",
    ResourceGroupName = prodResourceGroup.Name,
    Location = "eastus",
    ApplicationType = "web",
    Kind = "web",
    WorkspaceResourceId = prodLogAnalytics.Id,
    RetentionInDays = 90,
    SamplingPercentage = 10,  // 10% sampling
    DisableIpMasking = true,  // Mask IP for GDPR
    DisableLocalAuth = false,  // Allow instrumentation key (legacy)
    IngestionMode = "LogAnalytics",  // Route to Log Analytics workspace
    PublicNetworkAccessForIngestion = "Disabled",  // Private endpoint
    PublicNetworkAccessForQuery = "Disabled",
    Tags = tags
});

OpenTelemetry Collector Configuration (Production AKS):

# otel-collector-config-prod.yaml (Kubernetes ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: atp-prod
data:
  otel-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

      # Prometheus scrape for service metrics
      prometheus:
        config:
          scrape_configs:
            - job_name: 'atp-services'
              kubernetes_sd_configs:
                - role: pod
                  namespaces:
                    names: [atp-prod]
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
                  action: keep
                  regex: true
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
                  action: replace
                  target_label: __address__
                  regex: ([^:]+)(?::\d+)?;(\d+)
                  replacement: $1:$2

    processors:
      batch:
        timeout: 60s
        send_batch_size: 8192

      # Adaptive sampling (10% base, adjust based on traffic)
      probabilistic_sampler:
        hash_seed: 42
        sampling_percentage: 10

      # Tail sampling (keep all errors, sample successes)
      tail_sampling:
        policies:
          - name: errors-policy
            type: status_code
            status_code: {status_codes: [ERROR]}
          - name: slow-requests-policy
            type: latency
            latency: {threshold_ms: 2000}  # Keep requests >2s
          - name: random-ok-policy
            type: probabilistic
            probabilistic: {sampling_percentage: 5}  # 5% of successful requests

      attributes:
        actions:
          - key: environment
            value: production
            action: insert
          - key: deployment.environment
            value: production
            action: insert
          - key: k8s.cluster.name
            value: atp-aks-prod-eus
            action: insert

      # Filter health checks and metrics endpoints
      filter:
        traces:
          span:
            - 'attributes["http.target"] == "/health"'
            - 'attributes["http.target"] == "/ready"'
            - 'attributes["http.target"] == "/metrics"'
            - 'attributes["http.target"] == "/livez"'

      # Resource detection (Kubernetes, Azure)
      resourcedetection:
        detectors: [env, system, docker, azure, aks]
        timeout: 10s
        override: true

      # Add Kubernetes metadata
      k8sattributes:
        auth_type: "serviceAccount"
        passthrough: false
        extract:
          metadata:
            - k8s.namespace.name
            - k8s.deployment.name
            - k8s.pod.name
            - k8s.pod.uid
            - k8s.node.name
          labels:
            - tag_name: app
              key: app
              from: pod
            - tag_name: version
              key: version
              from: pod

    exporters:
      logging:
        loglevel: error  # Only errors in collector logs (reduce noise)

      # Export to Application Insights
      azuremonitor:
        connection_string: ${APPLICATIONINSIGHTS_CONNECTION_STRING}
        maxbatchsize: 2048
        maxbatchinterval: 30s

      # Export to Prometheus (for Grafana)
      prometheusremotewrite:
        endpoint: http://prometheus-server.atp-prod.svc.cluster.local:9090/api/v1/write
        tls:
          insecure: true
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_interval: 30s

      # Export logs to Log Analytics
      azureloganalytics:
        workspace_id: ${LOG_ANALYTICS_WORKSPACE_ID}
        workspace_key: ${LOG_ANALYTICS_WORKSPACE_KEY}
        resource_type: "Custom-OpenTelemetry-Prod"

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch, tail_sampling, k8sattributes, resourcedetection, attributes, filter]
          exporters: [azuremonitor]

        metrics:
          receivers: [otlp, prometheus]
          processors: [batch, k8sattributes, resourcedetection, attributes]
          exporters: [azuremonitor, prometheusremotewrite]

        logs:
          receivers: [otlp]
          processors: [batch, k8sattributes, resourcedetection, attributes]
          exporters: [azureloganalytics, azuremonitor]

      extensions: [health_check, pprof, zpages]
      telemetry:
        logs:
          level: warn
        metrics:
          level: detailed
          address: 0.0.0.0:8888

OTel Collector Deployment (Kubernetes):

# otel-collector-deployment-prod.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: atp-prod
spec:
  replicas: 3  # High availability
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
        version: 0.97.0
    spec:
      serviceAccountName: otel-collector
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:0.97.0
        command: ["--config=/etc/otel/config.yaml"]
        ports:
        - containerPort: 4317  # OTLP gRPC
          name: otlp-grpc
        - containerPort: 4318  # OTLP HTTP
          name: otlp-http
        - containerPort: 8888  # Prometheus metrics (collector itself)
          name: metrics
        - containerPort: 13133  # Health check
          name: health
        env:
        - name: APPLICATIONINSIGHTS_CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: atp-secrets
              key: applicationInsightsConnectionString
        - name: LOG_ANALYTICS_WORKSPACE_ID
          valueFrom:
            secretKeyRef:
              name: atp-secrets
              key: logAnalyticsWorkspaceId
        - name: LOG_ANALYTICS_WORKSPACE_KEY
          valueFrom:
            secretKeyRef:
              name: atp-secrets
              key: logAnalyticsWorkspaceKey
        volumeMounts:
        - name: otel-config
          mountPath: /etc/otel
          readOnly: true
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /
            port: 13133
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /
            port: 13133
          initialDelaySeconds: 10
          periodSeconds: 5
      volumes:
      - name: otel-config
        configMap:
          name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: atp-prod
spec:
  selector:
    app: otel-collector
  ports:
  - name: otlp-grpc
    port: 4317
    targetPort: 4317
    protocol: TCP
  - name: otlp-http
    port: 4318
    targetPort: 4318
    protocol: TCP
  - name: metrics
    port: 8888
    targetPort: 8888
    protocol: TCP
  type: ClusterIP

Prometheus & Grafana (Production)

Prometheus Server (Kubernetes):

# prometheus-deployment-prod.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: atp-prod
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: atp-aks-prod-eus
        environment: production

    # Alertmanager configuration
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ['alertmanager:9093']

    # Scrape configurations
    scrape_configs:
      # OTel Collector metrics
      - job_name: 'otel-collector'
        kubernetes_sd_configs:
          - role: service
            namespaces:
              names: [atp-prod]
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_name]
            action: keep
            regex: otel-collector

      # ATP microservices (Prometheus endpoint)
      - job_name: 'atp-services'
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names: [atp-prod]
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            target_label: __address__
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)

      # Kubernetes node metrics
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)

      # Kubernetes pod metrics
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names: [atp-prod]

    # Alerting rules
    rule_files:
      - /etc/prometheus/alerts/*.yml
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus-server
  namespace: atp-prod
spec:
  serviceName: prometheus-server
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-server
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      serviceAccountName: prometheus
      containers:
      - name: prometheus
        image: prom/prometheus:v2.48.0
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--storage.tsdb.retention.time=90d'  # 90-day retention
          - '--storage.tsdb.retention.size=100GB'
          - '--web.enable-lifecycle'
        ports:
        - containerPort: 9090
          name: web
        volumeMounts:
        - name: prometheus-config
          mountPath: /etc/prometheus
          readOnly: true
        - name: prometheus-storage
          mountPath: /prometheus
        resources:
          requests:
            memory: "4Gi"
            cpu: "1000m"
          limits:
            memory: "8Gi"
            cpu: "2000m"
      volumes:
      - name: prometheus-config
        configMap:
          name: prometheus-config
  volumeClaimTemplates:
  - metadata:
      name: prometheus-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "managed-premium"
      resources:
        requests:
          storage: 100Gi

Grafana Deployment (Kubernetes):

# grafana-deployment-prod.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: atp-prod
spec:
  replicas: 2
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:10.2.0
        ports:
        - containerPort: 3000
          name: web
        env:
        - name: GF_SERVER_ROOT_URL
          value: "https://grafana.atp.connectsoft.com"
        - name: GF_AUTH_GENERIC_OAUTH_ENABLED
          value: "true"
        - name: GF_AUTH_GENERIC_OAUTH_CLIENT_ID
          valueFrom:
            secretKeyRef:
              name: grafana-secrets
              key: oauth-client-id
        - name: GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: grafana-secrets
              key: oauth-client-secret
        - name: GF_DATABASE_TYPE
          value: "postgres"
        - name: GF_DATABASE_HOST
          valueFrom:
            secretKeyRef:
              name: grafana-secrets
              key: db-host
        volumeMounts:
        - name: grafana-storage
          mountPath: /var/lib/grafana
        - name: grafana-datasources
          mountPath: /etc/grafana/provisioning/datasources
        - name: grafana-dashboards
          mountPath: /etc/grafana/provisioning/dashboards
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
      volumes:
      - name: grafana-storage
        persistentVolumeClaim:
          claimName: grafana-pvc
      - name: grafana-datasources
        configMap:
          name: grafana-datasources
      - name: grafana-dashboards
        configMap:
          name: grafana-dashboards

Grafana Datasources (ConfigMap):

# grafana-datasources-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: atp-prod
data:
  datasources.yaml: |
    apiVersion: 1

    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus-server:9090
        isDefault: true
        editable: false
        jsonData:
          timeInterval: "15s"

      - name: Azure Log Analytics
        type: grafana-azure-monitor-datasource
        access: proxy
        jsonData:
          subscriptionId: ${AZURE_SUBSCRIPTION_ID}
          tenantId: ${AZURE_TENANT_ID}
          clientId: ${AZURE_CLIENT_ID}
          cloudName: azuremonitor
          azureLogAnalyticsSameAs: true
          logAnalyticsDefaultWorkspace: ${LOG_ANALYTICS_WORKSPACE_ID}
        secureJsonData:
          clientSecret: ${AZURE_CLIENT_SECRET}

      - name: Application Insights
        type: grafana-azure-monitor-datasource
        access: proxy
        jsonData:
          subscriptionId: ${AZURE_SUBSCRIPTION_ID}
          tenantId: ${AZURE_TENANT_ID}
          clientId: ${AZURE_CLIENT_ID}
          cloudName: azuremonitor
          appInsightsAppId: ${APPLICATIONINSIGHTS_APP_ID}
        secureJsonData:
          appInsightsApiKey: ${APPLICATIONINSIGHTS_API_KEY}

Log Aggregation & Retention

Dev/Test: Seq Containers

Purpose: Local log viewing with structured logging and ephemeral retention.

  • Dev: Local Docker Compose; 7-day retention.
  • Test: Shared Seq instance; 14-day retention; API key authentication.

Staging: Azure Log Analytics

Purpose: Production-like log aggregation with KQL queries and 30-day retention.

  • Log Analytics Workspace: Centralized log store.
  • Retention: 30 days (hot storage).
  • Cost: ~$2.30/GB ingested.

Production: Multi-Tier Retention

Purpose: Cost-optimized log retention with 90-day hot storage and 7-year cold storage for compliance.

Retention Tiers:

Hot Storage (Log Analytics): 90 days
  ↓ (automated export)
Cold Storage (Blob - Cool tier): 7 years
  ↓ (automated lifecycle policy)
Archive Storage (Blob - Archive tier): Indefinite (compliance hold)

Blob Storage Lifecycle Policy:

{
  "rules": [
    {
      "enabled": true,
      "name": "MoveLogsToCoolAfter90Days",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["logs/"]
        },
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 90
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 2555  // 7 years
            }
          }
        }
      }
    }
  ]
}

Cold Storage Cost (Production):

Hot (Log Analytics): ~$2.30/GB/month (90 days)
Cool (Blob): ~$0.01/GB/month (7 years)
Archive (Blob): ~$0.002/GB/month (indefinite)

Example: 10 GB/day logs
  - Hot: 900 GB × $2.30 = $2,070/month
  - Cool: 25,200 GB × $0.01 = $252/month
  - Archive: ∞ GB × $0.002 = minimal

Summary

  • Telemetry Levels: Graduated from Debug/100% sampling (Dev) to Warning/10% adaptive sampling (Production).
  • Dev: Seq containers, 100% traces, 7-day retention, unlimited logs for debugging.
  • Test: Seq + Application Insights, 50% sampling, 14-day retention, QA validation focus.
  • Staging: Application Insights + Log Analytics, 25% adaptive sampling, 30-day retention, production-like observability.
  • Production: OTel Collector → Prometheus/Grafana + Application Insights + Log Analytics, 10% adaptive sampling with tail sampling (errors/slow requests), 90-day hot + 7-year cold retention.
  • Prometheus & Grafana: Production metrics aggregation with custom dashboards and alerting.
  • Cost Optimization: Adaptive sampling, tail sampling (keep errors), multi-tier retention (hot/cool/archive).

Cost Management & Optimization

ATP implements FinOps principles across all environments to balance operational requirements with cost efficiency. Each environment has tailored cost profiles, optimization strategies, and cost governance controls aligned with its criticality and usage patterns.

This approach ensures predictable spending, cost transparency, and continuous optimization through automated shutdown schedules, reserved capacity, and granular cost tracking per environment, service, and team.

Environment Cost Profiles

ATP's cost model is graduated by environment with Dev/Test optimized for minimal cost and Production optimized for reliability within budget constraints.

Environment Monthly Budget Primary Compute SKU Tier Scaling Strategy Monitoring Cost Total Est. Monthly
Preview $100 Azure Container Instances Dynamic Per-PR ephemeral N/A $50-150 (variable)
Dev $500 App Service Basic B1 (1 vCPU, 1.75 GB) Fixed (1 instance) Basic alerts $400-600
Test $1,000 App Service Standard S1 (1 vCPU, 1.75 GB) Fixed (2 instances) Basic alerts $900-1,200
Staging $3,000 App Service Premium P1v2 (1 vCPU, 3.5 GB) Autoscale (2-5) Enhanced alerts $2,500-3,500
Production $10,000 AKS (3-10 nodes) Standard_D4s_v5 (4 vCPU, 16 GB) Autoscale (3-10 nodes) Full observability $8,000-12,000
Hotfix $500 App Service (on-demand) Premium P1v3 (2 vCPU, 8 GB) Fixed (1 instance) Basic alerts $0-500 (as-needed)

Cost Profile Rationale:

  • Dev ($500): Cost-minimized with Basic SKU; shutdown evenings/weekends (40% savings).
  • Test ($1,000): Standard SKU for stable performance; 2 instances for load testing; shutdown nights.
  • Staging ($3,000): Premium SKU for production-like validation; autoscaling; always-on.
  • Production ($10,000): AKS for enterprise-grade scalability; reserved instances (20% savings); 99.9% SLA.
  • Hotfix ($500): On-demand deployment only when needed; deleted after hotfix deployment.

Detailed Cost Breakdown

Dev Environment Monthly Costs

Compute (App Service Basic B1 × 1):          $13/month × 0.6 (60% uptime with shutdown) = $8/month
SQL Database (Basic - 5 DTU):                $5/month
Redis Cache (Basic C0 - 250 MB):             $16/month
Service Bus (Basic):                         $5/month
Storage (LRS - 100 GB):                      $2/month
Key Vault (transactions):                    $1/month
Bandwidth:                                   $5/month
---
Subtotal:                                    $42/month

Actual with shutdown automation:             ~$25/month per service × 7 services = $175/month
Dev shared infrastructure:                   $300/month (VNet, NSG, Seq, etc.)
---
Total Dev Environment:                       $475/month

Cost Optimization (Dev): - Shutdown Schedule: Stop 6 PM - 8 AM weekdays, all weekend → 60% uptime → 40% savings. - Shared Resources: VNet, NSG, Seq shared between Dev/Test → split cost. - Basic SKUs: Minimum viable performance for development.


Test Environment Monthly Costs

Compute (App Service Standard S1 × 2):       $70/month × 0.7 (70% uptime) × 2 = $98/month
SQL Database (Standard S1 - 20 DTU):         $30/month
Redis Cache (Standard C1 - 1 GB):            $75/month
Service Bus (Standard):                      $10/month
Storage (LRS - 500 GB):                      $10/month
Key Vault (transactions):                    $2/month
Application Insights (5 GB/month):           $12/month
Bandwidth:                                   $10/month
---
Subtotal:                                    $247/month

Actual with shutdown automation:             ~$35/month per service × 7 services = $245/month
Test shared infrastructure:                  $400/month
CI/CD agent costs:                           $200/month
---
Total Test Environment:                      $845/month

Cost Optimization (Test): - Shutdown Schedule: Stop 10 PM - 6 AM weekdays, all weekend → 70% uptime → 30% savings. - Standard SKUs: Sufficient for automated testing; reliable performance. - Shared Seq Instance: Single Seq container for all test services.


Staging Environment Monthly Costs

Compute (App Service Premium P1v2 × 3):      $146/month × 3 = $438/month
SQL Database (Premium P1 - 125 DTU):         $465/month
Redis Cache (Premium P1 - 6 GB):             $250/month
Service Bus (Premium - 1 messaging unit):    $677/month
Storage (GRS - 1 TB):                        $40/month
Key Vault (HSM - 10 keys):                   $125/month
Application Insights (15 GB/month):          $35/month
Log Analytics (30-day retention):            $70/month
Private Endpoints (5 × $7):                  $35/month
Application Gateway (v2):                    $125/month
Bandwidth:                                   $50/month
---
Total Staging Environment:                   $2,310/month

Cost Optimization (Staging): - Always-On: No shutdown (production validation requires 24/7 availability). - Reserved Instances: 1-year commitment → 20% savings (~$460/year). - Private Endpoints: Shared across data resources. - Geo-Redundant Storage: Balance cost vs. reliability for production-like testing.


Production Environment Monthly Costs

AKS Cluster (3-10 nodes, Standard_D4s_v5):
  - System pool (3 nodes, always on):        $200/month × 3 = $600/month
  - User pool (3-7 nodes, autoscale):        $200/month × 5 avg = $1,000/month
SQL Database (Premium P4 - 500 DTU):         $1,860/month
Cosmos DB (1000 RU/s provisioned):           $730/month
Redis Cache (Premium P3 - 26 GB):            $1,037/month
Service Bus (Premium - 4 messaging units):   $2,708/month
Storage (GRS + WORM - 10 TB):                $500/month (hot) + $100/month (cool)
Key Vault (HSM - 50 keys):                   $625/month
Application Insights (50 GB/month):          $115/month
Log Analytics (90-day retention, 30 GB/day): $900/month
Private Endpoints (10 × $7):                 $70/month
Application Gateway (v2 with WAF):           $250/month
Azure Firewall (Premium):                    $625/month
DDoS Protection Standard:                    $2,944/month
Bandwidth (outbound - 1 TB):                 $90/month
ACR (Premium - geo-replication):             $30/month
Prometheus + Grafana (self-hosted on AKS):   $50/month (storage only)
---
Total Production Environment:                $12,234/month

Reserved Instance Savings (1-year):          -$2,400/month (20% on compute/database)
---
Net Production Monthly Cost:                 $9,834/month

Cost Optimization (Production): - Reserved Instances: 1-year commitment for AKS nodes, SQL, Cosmos DB → 20-30% savings. - Autoscaling: Scale down to 3 nodes during low-traffic hours → save ~\(400/month. - **Storage Lifecycle**: Auto-transition to cool tier after 90 days → save ~\)300/month. - DDoS Protection: Shared across all public endpoints in subscription. - Application Insights Sampling: 10% adaptive sampling → reduce ingestion by 90%.


Cost Optimization Strategies

ATP implements automated cost optimization across all environments using Azure Policy, automation scripts, and IaC overlays.

Automated Shutdown Schedules

Purpose: Reduce compute costs in Dev/Test environments by shutting down resources during non-business hours.

Dev Environment Shutdown Schedule:

# Shutdown Schedule (Dev)
schedule:
  weekdays:
    shutdown: "18:00"  # 6 PM local time
    startup: "08:00"   # 8 AM local time
    timezone: "Eastern Standard Time"
  weekends:
    shutdown: "18:00 Friday"
    startup: "08:00 Monday"
    timezone: "Eastern Standard Time"

expectedUptime: 40%  # 40 hours/week out of 168 hours
estimatedSavings: 60%  # $8/month vs $13/month per App Service

Azure Automation Runbook (Shutdown):

<#
.SYNOPSIS
    Shutdown ATP Dev environment resources during non-business hours.
.DESCRIPTION
    Stops App Services, VMs, and SQL Databases in Dev environment to reduce costs.
    Scheduled to run at 6 PM ET weekdays and 6 PM Friday for weekend.
#>

param(
    [Parameter(Mandatory=$false)]
    [string]$ResourceGroupName = "ConnectSoft-ATP-Dev-EUS-RG"
)

# Authenticate using Managed Identity
Connect-AzAccount -Identity

Write-Output "Starting shutdown sequence for Dev environment: $ResourceGroupName"

# Stop all App Services
$appServices = Get-AzWebApp -ResourceGroupName $ResourceGroupName
foreach ($app in $appServices) {
    Write-Output "Stopping App Service: $($app.Name)"
    Stop-AzWebApp -ResourceGroupName $ResourceGroupName -Name $app.Name
}

# Pause SQL Databases
$sqlServers = Get-AzSqlServer -ResourceGroupName $ResourceGroupName
foreach ($server in $sqlServers) {
    $databases = Get-AzSqlDatabase -ServerName $server.ServerName -ResourceGroupName $ResourceGroupName
    foreach ($db in $databases) {
        if ($db.DatabaseName -ne "master") {
            Write-Output "Pausing SQL Database: $($db.DatabaseName)"
            Suspend-AzSqlDatabase -ResourceGroupName $ResourceGroupName -ServerName $server.ServerName -DatabaseName $db.DatabaseName
        }
    }
}

# Stop VMs (if any for Dev jumpbox)
$vms = Get-AzVM -ResourceGroupName $ResourceGroupName
foreach ($vm in $vms) {
    Write-Output "Stopping VM: $($vm.Name)"
    Stop-AzVM -ResourceGroupName $ResourceGroupName -Name $vm.Name -Force
}

# Stop ACI instances (Preview environments)
$containers = Get-AzContainerGroup -ResourceGroupName $ResourceGroupName
foreach ($container in $containers) {
    Write-Output "Stopping Container Instance: $($container.Name)"
    Stop-AzContainerGroup -ResourceGroupName $ResourceGroupName -Name $container.Name
}

Write-Output "Dev environment shutdown complete. Estimated monthly savings: 60%"

Azure Automation Runbook (Startup):

<#
.SYNOPSIS
    Startup ATP Dev environment resources during business hours.
.DESCRIPTION
    Starts App Services, VMs, and SQL Databases in Dev environment.
    Scheduled to run at 8 AM ET weekdays.
#>

param(
    [Parameter(Mandatory=$false)]
    [string]$ResourceGroupName = "ConnectSoft-ATP-Dev-EUS-RG"
)

Connect-AzAccount -Identity

Write-Output "Starting startup sequence for Dev environment: $ResourceGroupName"

# Start App Services
$appServices = Get-AzWebApp -ResourceGroupName $ResourceGroupName
foreach ($app in $appServices) {
    Write-Output "Starting App Service: $($app.Name)"
    Start-AzWebApp -ResourceGroupName $ResourceGroupName -Name $app.Name
}

# Resume SQL Databases
$sqlServers = Get-AzSqlServer -ResourceGroupName $ResourceGroupName
foreach ($server in $sqlServers) {
    $databases = Get-AzSqlDatabase -ServerName $server.ServerName -ResourceGroupName $ResourceGroupName
    foreach ($db in $databases) {
        if ($db.DatabaseName -ne "master" -and $db.Status -eq "Paused") {
            Write-Output "Resuming SQL Database: $($db.DatabaseName)"
            Resume-AzSqlDatabase -ResourceGroupName $ResourceGroupName -ServerName $server.ServerName -DatabaseName $db.DatabaseName
        }
    }
}

# Start VMs
$vms = Get-AzVM -ResourceGroupName $ResourceGroupName -Status
foreach ($vm in $vms | Where-Object {$_.PowerState -eq "VM deallocated"}) {
    Write-Output "Starting VM: $($vm.Name)"
    Start-AzVM -ResourceGroupName $ResourceGroupName -Name $vm.Name
}

Write-Output "Dev environment startup complete."

Azure Automation Schedule:

# Create Automation Account
az automation account create \
  --name "atp-automation-eus" \
  --resource-group "ConnectSoft-ATP-Shared-RG" \
  --location "eastus" \
  --sku "Basic" \
  --tags Environment=Shared Purpose=CostOptimization

# Enable Managed Identity
az automation account update \
  --name "atp-automation-eus" \
  --resource-group "ConnectSoft-ATP-Shared-RG" \
  --assign-identity

# Create Shutdown Schedule (Weekdays 6 PM)
az automation schedule create \
  --automation-account-name "atp-automation-eus" \
  --resource-group "ConnectSoft-ATP-Shared-RG" \
  --name "Dev-Shutdown-Weekdays" \
  --frequency "Week" \
  --interval 1 \
  --start-time "2025-01-01T18:00:00-05:00" \
  --time-zone "Eastern Standard Time" \
  --week-days Monday Tuesday Wednesday Thursday Friday

# Create Startup Schedule (Weekdays 8 AM)
az automation schedule create \
  --automation-account-name "atp-automation-eus" \
  --resource-group "ConnectSoft-ATP-Shared-RG" \
  --name "Dev-Startup-Weekdays" \
  --frequency "Week" \
  --interval 1 \
  --start-time "2025-01-01T08:00:00-05:00" \
  --time-zone "Eastern Standard Time" \
  --week-days Monday Tuesday Wednesday Thursday Friday

echo "✅ Automation schedules created; Dev environment will shutdown/startup automatically"

Test Environment Shutdown Schedule (10 PM - 6 AM):

schedule:
  weekdays:
    shutdown: "22:00"  # 10 PM (after automated test runs)
    startup: "06:00"   # 6 AM (before CI/CD starts)
  weekends:
    shutdown: "22:00 Friday"
    startup: "06:00 Monday"

expectedUptime: 70%  # 120 hours/week
estimatedSavings: 30%  # $49/month vs $70/month per App Service

Reserved Instances & Savings Plans

Purpose: Long-term cost savings for predictable workloads in Staging/Production with 1-3 year commitments.

Reserved Instance Strategy:

environment: Staging
commitment: 1-year
resources:
  - type: App Service Premium P1v2
    quantity: 2
    monthlyCost: $146 × 2 = $292
    reservedCost: $233 × 2 = $466 (20% savings)
    annualSavings: $708

  - type: SQL Database Premium P1
    quantity: 1
    monthlyCost: $465
    reservedCost: $372 (20% savings)
    annualSavings: $1,116

totalAnnualSavings: $1,824 (Staging)

Production Reserved Instance Plan:

environment: Production
commitment: 1-year (renew annually)
resources:
  - type: AKS Standard_D4s_v5
    quantity: 3 (system pool, always on)
    monthlyCost: $600
    reservedCost: $480 (20% savings)
    annualSavings: $1,440

  - type: SQL Database Premium P4
    quantity: 1
    monthlyCost: $1,860
    reservedCost: $1,395 (25% savings)
    annualSavings: $5,580

  - type: Cosmos DB (1000 RU/s)
    quantity: 1
    monthlyCost: $730
    reservedCost: $511 (30% savings)
    annualSavings: $2,628

  - type: Redis Cache Premium P3
    quantity: 1
    monthlyCost: $1,037
    reservedCost: $830 (20% savings)
    annualSavings: $2,484

totalAnnualSavings: $12,132 (Production)
totalATPReservedInstanceSavings: $13,956/year

Purchase Reserved Instances (Azure CLI):

#!/bin/bash
# purchase-reserved-instances.sh

SUBSCRIPTION_ID="<azure-subscription-id>"
REGION="eastus"

echo "Purchasing Reserved Instances for ATP Production..."

# AKS Nodes (Standard_D4s_v5 × 3)
az reservations reservation-order purchase \
  --reserved-resource-type "VirtualMachines" \
  --sku "Standard_D4s_v5" \
  --location "$REGION" \
  --quantity 3 \
  --term "P1Y" \
  --billing-plan "Monthly" \
  --display-name "ATP-Prod-AKS-RI-2025"

# SQL Database (Premium P4)
az sql db update \
  --resource-group "ConnectSoft-ATP-Prod-EUS-RG" \
  --server "atp-sql-prod-eus" \
  --name "ATP_Prod" \
  --compute-model "Provisioned" \
  --service-objective "P4" \
  --backup-storage-redundancy "Geo" \
  --zone-redundant true \
  --read-scale "Enabled"

# Cosmos DB Reserved Capacity (1000 RU/s)
az cosmosdb sql container throughput update \
  --resource-group "ConnectSoft-ATP-Prod-EUS-RG" \
  --account-name "atp-cosmos-prod-eus" \
  --database-name "ATP" \
  --name "AuditEvents" \
  --throughput 1000

echo "✅ Reserved Instances purchased; savings will appear in next billing cycle"

Spot Instances (Preview Environments)

Purpose: 90% cost savings for ephemeral Preview environments using Azure Spot VMs.

Spot Instance Configuration (ACI with Spot pricing):

# Azure Container Instances with Spot pricing
apiVersion: '2021-09-01'
location: eastus
name: atp-preview-pr-123
properties:
  containers:
    - name: atp-gateway-preview
      properties:
        image: connectsoft.azurecr.io/atp-gateway:pr-123
        resources:
          requests:
            memoryInGB: 1.5
            cpu: 1
  osType: Linux
  priority: Spot  # Use Spot pricing (90% cheaper than regular)
  restartPolicy: Never
  sku: Standard
tags:
  Environment: Preview
  PullRequest: PR-123
  CostCenter: Engineering
  AutoDelete: "24h"  # Delete after 24 hours

Cost Comparison (Preview Environment):

Regular ACI Pricing: $0.0000125/second × 3600s × 24h × 30 days = $32.40/month
Spot ACI Pricing: $0.00000125/second × 3600s × 24h × 30 days = $3.24/month
Savings: $29.16/month per container (90% savings)

Typical Preview Environment (5 containers × 24 hours):
  - Regular: $5.40
  - Spot: $0.54
  - Savings per PR: $4.86 (90%)

Estimated Monthly (20 PRs/month):
  - Regular: $108
  - Spot: $10.80
  - Total Savings: $97.20/month

Storage Lifecycle Management

Purpose: Automated tiering from hot → cool → archive based on retention policies to minimize storage costs.

Storage Lifecycle Policy (Production):

{
  "rules": [
    {
      "enabled": true,
      "name": "MoveLogsToArchive",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["logs/", "traces/", "metrics/"]
        },
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 90
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 2555  // 7 years
            },
            "delete": {
              "daysAfterModificationGreaterThan": 3650  // 10 years (optional cleanup)
            }
          },
          "snapshot": {
            "tierToCool": {
              "daysAfterCreationGreaterThan": 30
            },
            "delete": {
              "daysAfterCreationGreaterThan": 90
            }
          }
        }
      }
    },
    {
      "enabled": true,
      "name": "MoveBackupsToArchive",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["backups/"]
        },
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 30  // Weekly backups to cool after 30 days
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 180  // Archive after 6 months
            }
          }
        }
      }
    },
    {
      "enabled": true,
      "name": "DeleteTempBlobs",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["temp/", "preview/"]
        },
        "actions": {
          "baseBlob": {
            "delete": {
              "daysAfterModificationGreaterThan": 7  // Delete temp/preview after 7 days
            }
          }
        }
      }
    }
  ]
}

Storage Cost Savings (Production):

Hot Storage (0-90 days): 900 GB × $0.0184/GB = $16.56/month
Cool Storage (91 days - 7 years): 25,200 GB × $0.01/GB = $252/month
Archive Storage (7+ years): 100,000 GB × $0.002/GB = $200/month

Without Lifecycle Policy (all hot): 126,100 GB × $0.0184/GB = $2,320/month
With Lifecycle Policy: $16.56 + $252 + $200 = $468.56/month
Total Savings: $1,851.44/month (80% savings on storage)

Cost Alerts & Monitoring

Purpose: Proactive cost visibility with alerts when environments exceed budget thresholds or exhibit anomalous spending patterns.

Azure Cost Management Budgets (Pulumi):

// Dev Environment Budget
var devBudget = new Budget("atp-budget-dev", new BudgetArgs
{
    BudgetName = "atp-budget-dev",
    ResourceGroupName = devResourceGroup.Name,
    Amount = 500,  // $500/month
    TimeGrain = "Monthly",
    TimePeriod = new BudgetTimePeriodArgs
    {
        StartDate = "2025-01-01",
        EndDate = "2025-12-31"
    },
    Category = "Cost",
    Notifications = new InputMap<NotificationArgs>
    {
        ["Alert80Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 80,
            ContactEmails = new[] { "platform-team@connectsoft.example" },
            ContactRoles = new[] { "Owner", "Contributor" },
            ThresholdType = "Actual"
        },
        ["Alert100Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 100,
            ContactEmails = new[] { "platform-team@connectsoft.example", "finance@connectsoft.example" },
            ContactRoles = new[] { "Owner" },
            ThresholdType = "Actual"
        },
        ["Forecast120Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 120,
            ContactEmails = new[] { "platform-team@connectsoft.example" },
            ThresholdType = "Forecasted"
        }
    },
    Filter = new BudgetFilterArgs
    {
        Tags = new InputList<BudgetFilterTagsArgs>
        {
            new BudgetFilterTagsArgs
            {
                Name = "Environment",
                Operator = "In",
                Values = new[] { "dev" }
            }
        }
    }
});

// Production Environment Budget
var prodBudget = new Budget("atp-budget-prod", new BudgetArgs
{
    BudgetName = "atp-budget-prod",
    ResourceGroupName = prodResourceGroup.Name,
    Amount = 10000,  // $10,000/month
    TimeGrain = "Monthly",
    TimePeriod = new BudgetTimePeriodArgs
    {
        StartDate = "2025-01-01",
        EndDate = "2025-12-31"
    },
    Category = "Cost",
    Notifications = new InputMap<NotificationArgs>
    {
        ["Alert50Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 50,
            ContactEmails = new[] { "platform-team@connectsoft.example" },
            ThresholdType = "Actual"
        },
        ["Alert80Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 80,
            ContactEmails = new[] { "platform-team@connectsoft.example", "finance@connectsoft.example" },
            ContactRoles = new[] { "Owner" },
            ThresholdType = "Actual"
        },
        ["Alert100Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 100,
            ContactEmails = new[] { "cfo@connectsoft.example", "platform-team@connectsoft.example" },
            ContactRoles = new[] { "Owner" },
            ThresholdType = "Actual",
            ContactActions = new[] { "CreateIncident" }  // Auto-create P1 incident
        }
    }
});

Cost Anomaly Detection (Azure Monitor):

// Cost anomaly alert (50% spike in single day)
var costAnomalyAlert = new MetricAlert("atp-cost-anomaly-alert-prod", new MetricAlertArgs
{
    AlertRuleName = "atp-cost-anomaly-prod",
    ResourceGroupName = prodResourceGroup.Name,
    Location = "global",
    Description = "Alert when Production environment cost spikes >50% in 24 hours",
    Severity = 1,  // High severity
    Enabled = true,
    Scopes = new[] { prodResourceGroup.Id },
    EvaluationFrequency = "PT1H",  // Evaluate every hour
    WindowSize = "PT24H",  // 24-hour window
    Criteria = new MetricAlertMultipleResourceMultipleMetricCriteriaArgs
    {
        OdataType = "Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria",
        AllOf = new[]
        {
            new DynamicMetricCriteriaArgs
            {
                Name = "CostAnomaly",
                MetricName = "ActualCost",
                MetricNamespace = "Microsoft.CostManagement/budgets",
                Operator = "GreaterThan",
                AlertSensitivity = "Medium",
                DynamicThresholdFailingPeriods = new DynamicThresholdFailingPeriodsArgs
                {
                    NumberOfEvaluationPeriods = 4,
                    MinFailingPeriodsToAlert = 2
                },
                TimeAggregation = "Total"
            }
        }
    },
    Actions = new[]
    {
        new MetricAlertActionArgs
        {
            ActionGroupId = costAnomalyActionGroup.Id
        }
    }
});

// Action Group for cost alerts
var costAnomalyActionGroup = new ActionGroup("atp-cost-anomaly-action-group", new ActionGroupArgs
{
    ActionGroupName = "atp-cost-anomaly",
    ResourceGroupName = sharedResourceGroup.Name,
    Location = "global",
    Enabled = true,
    ShortName = "CostAlert",
    EmailReceivers = new[]
    {
        new EmailReceiverArgs
        {
            Name = "PlatformTeam",
            EmailAddress = "platform-team@connectsoft.example",
            UseCommonAlertSchema = true
        },
        new EmailReceiverArgs
        {
            Name = "Finance",
            EmailAddress = "finance@connectsoft.example",
            UseCommonAlertSchema = true
        }
    },
    WebhookReceivers = new[]
    {
        new WebhookReceiverArgs
        {
            Name = "SlackNotification",
            ServiceUri = "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX",
            UseCommonAlertSchema = true
        }
    }
});

Monthly Cost Review Automation:

<#
.SYNOPSIS
    Generate monthly cost report for ATP environments.
.DESCRIPTION
    Queries Azure Cost Management API for current month spending per environment.
    Sends report to finance and platform teams.
#>

param(
    [Parameter(Mandatory=$false)]
    [string]$SubscriptionId = "<subscription-id>"
)

Connect-AzAccount -Identity

$startDate = (Get-Date -Day 1).ToString("yyyy-MM-dd")
$endDate = (Get-Date).ToString("yyyy-MM-dd")

Write-Output "Generating ATP cost report for $startDate to $endDate"

# Query costs per environment
$environments = @("dev", "test", "staging", "prod", "hotfix")
$costReport = @()

foreach ($env in $environments) {
    $query = @{
        type = "ActualCost"
        timeframe = "Custom"
        timePeriod = @{
            from = $startDate
            to = $endDate
        }
        dataset = @{
            granularity = "Monthly"
            aggregation = @{
                totalCost = @{
                    name = "Cost"
                    function = "Sum"
                }
            }
            grouping = @(
                @{
                    type = "Dimension"
                    name = "ResourceGroupName"
                }
            )
            filter = @{
                tags = @{
                    name = "Environment"
                    operator = "In"
                    values = @($env)
                }
            }
        }
    } | ConvertTo-Json -Depth 10

    $result = Invoke-AzRestMethod `
        -Path "/subscriptions/$SubscriptionId/providers/Microsoft.CostManagement/query?api-version=2021-10-01" `
        -Method POST `
        -Payload $query

    $cost = ($result.Content | ConvertFrom-Json).properties.rows | Measure-Object -Property @{Expression={$_[0]}} -Sum

    $costReport += [PSCustomObject]@{
        Environment = $env.ToUpper()
        Cost = [math]::Round($cost.Sum, 2)
        Budget = switch ($env) {
            "dev" { 500 }
            "test" { 1000 }
            "staging" { 3000 }
            "prod" { 10000 }
            "hotfix" { 500 }
        }
        Utilization = [math]::Round(($cost.Sum / (switch ($env) {
            "dev" { 500 }
            "test" { 1000 }
            "staging" { 3000 }
            "prod" { 10000 }
            "hotfix" { 500 }
        })) * 100, 1)
    }
}

# Generate HTML report
$htmlReport = @"
<html>
<head>
    <style>
        body { font-family: Arial; }
        table { border-collapse: collapse; width: 100%; }
        th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
        th { background-color: #4CAF50; color: white; }
        .over-budget { background-color: #ffcccc; }
        .near-budget { background-color: #ffffcc; }
    </style>
</head>
<body>
    <h2>ATP Monthly Cost Report - $(Get-Date -Format 'MMMM yyyy')</h2>
    <table>
        <tr>
            <th>Environment</th>
            <th>Current Cost</th>
            <th>Monthly Budget</th>
            <th>Utilization</th>
        </tr>
"@

foreach ($item in $costReport) {
    $rowClass = if ($item.Utilization -gt 100) { "over-budget" } 
                elseif ($item.Utilization -gt 80) { "near-budget" } 
                else { "" }

    $htmlReport += @"
        <tr class="$rowClass">
            <td>$($item.Environment)</td>
            <td>`$$($item.Cost)</td>
            <td>`$$($item.Budget)</td>
            <td>$($item.Utilization)%</td>
        </tr>
"@
}

$totalCost = ($costReport | Measure-Object -Property Cost -Sum).Sum
$totalBudget = ($costReport | Measure-Object -Property Budget -Sum).Sum

$htmlReport += @"
        <tr style="font-weight: bold;">
            <td>TOTAL</td>
            <td>`$$totalCost</td>
            <td>`$$totalBudget</td>
            <td>$([math]::Round(($totalCost / $totalBudget) * 100, 1))%</td>
        </tr>
    </table>

    <h3>Cost Optimization Recommendations</h3>
    <ul>
        <li>Dev/Test shutdown automation active: Estimated savings ~`$200/month</li>
        <li>Reserved Instances active: Saving ~`$1,163/month</li>
        <li>Storage lifecycle policies active: Saving ~`$1,851/month</li>
        <li>Total Estimated Savings: ~`$3,214/month</li>
    </ul>
</body>
</html>
"@

# Send email report
Send-MailMessage `
    -From "platform-team@connectsoft.example" `
    -To "finance@connectsoft.example", "platform-team@connectsoft.example" `
    -Subject "ATP Monthly Cost Report - $(Get-Date -Format 'MMMM yyyy')" `
    -Body $htmlReport `
    -BodyAsHtml `
    -SmtpServer "smtp.office365.com" `
    -UseSsl `
    -Credential (Get-AutomationPSCredential -Name "EmailCredential")

Write-Output "✅ Cost report sent successfully"

Cost Tagging Strategy

Purpose: Granular cost attribution per environment, service, team, and cost center for accurate chargeback/showback.

Tagging Policy (Azure Policy):

{
  "properties": {
    "displayName": "Require tags on ATP resources",
    "policyType": "Custom",
    "mode": "Indexed",
    "description": "Enforces required tags on all ATP resources for cost tracking",
    "metadata": {
      "category": "Cost Management"
    },
    "parameters": {
      "requiredTags": {
        "type": "Object",
        "defaultValue": {
          "Environment": ["dev", "test", "staging", "prod", "hotfix", "preview"],
          "Service": ["gateway", "ingestion", "query", "integrity", "export", "policy", "search"],
          "CostCenter": ["Engineering", "Platform", "Security"],
          "Owner": ["platform-team@connectsoft.example"],
          "Project": ["ATP"]
        }
      }
    },
    "policyRule": {
      "if": {
        "allOf": [
          {
            "field": "type",
            "notIn": [
              "Microsoft.Resources/subscriptions",
              "Microsoft.Resources/subscriptions/resourceGroups"
            ]
          },
          {
            "anyOf": [
              {
                "field": "tags['Environment']",
                "exists": "false"
              },
              {
                "field": "tags['Service']",
                "exists": "false"
              },
              {
                "field": "tags['CostCenter']",
                "exists": "false"
              },
              {
                "field": "tags['Owner']",
                "exists": "false"
              },
              {
                "field": "tags['Project']",
                "exists": "false"
              }
            ]
          }
        ]
      },
      "then": {
        "effect": "deny"
      }
    }
  }
}

Pulumi Tagging (Consistent across all resources):

// Global tags applied to all ATP resources
var globalTags = new InputMap<string>
{
    ["Project"] = "ATP",
    ["ManagedBy"] = "Pulumi",
    ["CostCenter"] = "Engineering",
    ["Owner"] = "platform-team@connectsoft.example"
};

// Environment-specific tags
var devTags = globalTags.Concat(new InputMap<string>
{
    ["Environment"] = "dev",
    ["CostOptimization"] = "ShutdownSchedule",
    ["BackupRequired"] = "false"
}).ToInputMap();

var prodTags = globalTags.Concat(new InputMap<string>
{
    ["Environment"] = "prod",
    ["CostOptimization"] = "ReservedInstances",
    ["BackupRequired"] = "true",
    ["Compliance"] = "SOC2,HIPAA,GDPR"
}).ToInputMap();

// Apply to resources
var prodAppService = new WebApp("atp-gateway-prod-eus", new WebAppArgs
{
    // ... resource configuration ...
    Tags = prodTags.Concat(new InputMap<string>
    {
        ["Service"] = "gateway",
        ["Tier"] = "Premium"
    }).ToInputMap()
});

Cost Attribution Query (KQL):

// Cost breakdown by Environment and Service
AzureCostManagement
| where TimeGenerated >= startofmonth(now())
| extend Environment = tostring(Tags["Environment"])
| extend Service = tostring(Tags["Service"])
| extend CostCenter = tostring(Tags["CostCenter"])
| summarize TotalCost = sum(Cost) by Environment, Service, CostCenter
| order by TotalCost desc

FinOps Best Practices

ATP FinOps Principles:

  1. Visibility: Tag all resources; enable Cost Management; monthly reviews.
  2. Optimization: Shutdown schedules, reserved instances, autoscaling, storage lifecycle.
  3. Governance: Budget alerts, Azure Policy enforcement, approval workflows for cost increases.
  4. Culture: Cost awareness in development; cost-per-feature metrics; regular optimization sprints.

Cost per Tenant (Production):

Total Production Monthly Cost: $9,834
Active Tenants (production): 50
Cost per Tenant: $9,834 / 50 = $196.68/month

Target Cost per Tenant (with 500 tenants): $9,834 / 500 = $19.67/month
Required Optimization: 90% reduction through economies of scale and multi-tenancy

Cost Governance Workflow:

# Approval required for resources exceeding cost thresholds
costGovernance:
  thresholds:
    - resource: App Service Premium
      monthlyCost: $150
      approver: Lead Architect

    - resource: SQL Database Premium
      monthlyCost: $500
      approver: CTO

    - resource: AKS Node Pool
      monthlyCost: $1000
      approver: CFO

  process:
    1. Engineer submits Pulumi PR with new resource
    2. CI/CD calculates estimated monthly cost (Infracost)
    3. If cost > threshold, require approval
    4. Approver reviews cost justification
    5. If approved, Pulumi deploys resource with cost tags
    6. Monthly review of actual vs estimated costs

Summary

  • Environment Cost Profiles: Graduated from $500/month (Dev) to $10,000/month (Production) with tailored SKUs and scaling.
  • Dev Optimization: Shutdown evenings/weekends (60% uptime) → 40% savings ($175/month actual cost).
  • Test Optimization: Shutdown nights (70% uptime) → 30% savings ($845/month actual cost).
  • Reserved Instances: 1-year commitments for Staging/Production → 20-30% savings ($13,956/year total).
  • Spot Instances: Preview environments use Spot pricing → 90% savings ($97/month).
  • Storage Lifecycle: Automated hot → cool → archive transitions → 80% storage savings ($1,851/month).
  • Cost Alerts: Budget thresholds (80%, 100%) and anomaly detection (50% spike) with automated notifications.
  • Tagging Strategy: Granular cost attribution per environment, service, team, and cost center.
  • FinOps Culture: Monthly cost reviews, cost-per-feature metrics, governance workflows for cost increases.

Disaster Recovery & High Availability

ATP implements graduated disaster recovery and high availability strategies aligned with each environment's criticality and business requirements. Production operates in active-active multi-region configuration with 15-minute RPO and 30-minute RTO, while lower environments use cost-effective recreate-from-IaC or backup restore strategies.

This approach ensures business continuity for critical production workloads while maintaining cost efficiency in non-production environments through infrastructure-as-code recovery instead of expensive geo-replication.

RPO/RTO Targets Per Environment

ATP defines graduated recovery objectives from best-effort Dev recovery to mission-critical Production multi-region failover.

Environment RPO (Recovery Point) RTO (Recovery Time) HA Strategy DR Strategy Annual DR Test
Preview N/A N/A None None (ephemeral) N/A
Dev 24 hours 4 hours Single instance Recreate from IaC + Git Quarterly
Test 12 hours 2 hours 2 instances (no LB) Restore from last backup Quarterly
Staging 1 hour 1 hour Blue-green slots Slot swap + backup restore Monthly
Production 15 minutes 30 minutes Multi-region active-active (80/20 split) Automated regional failover Weekly (non-disruptive)
Hotfix 15 minutes 30 minutes Same as Production Same as Production N/A (mirrors Prod)

RPO/RTO Rationale:

  • Dev (24h RPO, 4h RTO): Acceptable data loss; recreate from Git + IaC in half-day.
  • Test (12h RPO, 2h RTO): Daily backups sufficient; restore within business hours.
  • Staging (1h RPO, 1h RTO): Production-like; validate blue-green failover.
  • Production (15min RPO, 30min RTO): Mission-critical; continuous geo-replication; automated failover.

Multi-Region Production Topology

ATP Production operates in active-active mode across two Azure regions (East US primary, West Europe secondary) with automated failover and geo-replicated data.

Regional Architecture

graph TB
    subgraph Internet
        Users[Global Users]
        AFD[Azure Front Door<br/>Global Load Balancer]
    end

    subgraph Primary Region - East US
        EUS_AppGw[App Gateway WAF<br/>10.2.1.0/24]
        EUS_AKS[AKS Cluster<br/>3-10 nodes<br/>10.2.2.0/23]
        EUS_SQL[(SQL Primary<br/>Geo-Replication Enabled)]
        EUS_Cosmos[(Cosmos DB Primary<br/>Multi-region writes)]
        EUS_Storage[(Blob Storage GRS<br/>Auto-replicate to WEU)]
        EUS_ServiceBus[(Service Bus Premium<br/>Geo-DR paired)]
    end

    subgraph Secondary Region - West Europe
        WEU_AppGw[App Gateway WAF<br/>10.4.1.0/24]
        WEU_AKS[AKS Cluster<br/>2-6 nodes<br/>10.4.2.0/23]
        WEU_SQL[(SQL Secondary<br/>Read replica)]
        WEU_Cosmos[(Cosmos DB Secondary<br/>Multi-region writes)]
        WEU_Storage[(Blob Storage GRS<br/>Replica)]
        WEU_ServiceBus[(Service Bus Premium<br/>Geo-DR paired)]
    end

    Users --> AFD
    AFD -->|80% traffic| EUS_AppGw
    AFD -->|20% traffic| WEU_AppGw

    EUS_AppGw --> EUS_AKS
    WEU_AppGw --> WEU_AKS

    EUS_AKS --> EUS_SQL
    EUS_AKS --> EUS_Cosmos
    EUS_AKS --> EUS_Storage
    EUS_AKS --> EUS_ServiceBus

    WEU_AKS --> WEU_SQL
    WEU_AKS --> WEU_Cosmos
    WEU_AKS --> WEU_Storage
    WEU_AKS --> WEU_ServiceBus

    EUS_SQL -.->|Geo-Replication| WEU_SQL
    EUS_Cosmos <-.->|Multi-write| WEU_Cosmos
    EUS_Storage -.->|GRS Replication| WEU_Storage
    EUS_ServiceBus <-.->|Geo-DR Pairing| WEU_ServiceBus

    style EUS_AKS fill:#90EE90
    style WEU_AKS fill:#FFD700
    style AFD fill:#4CAF50
Hold "Alt" / "Option" to enable pan & zoom

Regional Resource Naming:

Primary (East US):   atp-{service}-prod-eus
Secondary (West EU): atp-{service}-prod-weu

Dev Environment DR Strategy

RPO: 24 hours | RTO: 4 hours

Strategy: Recreate from Infrastructure as Code (no backups; Git is source of truth).

Recovery Procedure (Dev):

#!/bin/bash
# dr-recovery-dev.sh

ENVIRONMENT="dev"
REGION="eastus"

echo "🔄 Starting DR recovery for Dev environment..."

# Step 1: Recreate infrastructure via Pulumi
cd infrastructure/

pulumi stack select atp-dev-eus
pulumi up --yes --skip-preview

if [ $? -ne 0 ]; then
  echo "❌ Pulumi infrastructure deployment failed"
  exit 1
fi

echo "✅ Infrastructure recreated"

# Step 2: Redeploy latest main branch code
cd ../

# Trigger latest CI/CD pipeline
az pipelines run \
  --name "ATP.Ingestion" \
  --branch "main" \
  --organization "https://dev.azure.com/ConnectSoft" \
  --project "ATP"

echo "✅ CI/CD pipeline triggered; Dev environment will be ready in ~15 minutes"

# Step 3: Seed synthetic data
dotnet run --project tools/DataSeeder \
  --environment Dev \
  --tenants 10 \
  --records-per-tenant 1000

echo "✅ Dev environment recovery complete"
echo "Total Recovery Time: ~4 hours (infrastructure + deployment + data seeding)"

Dev DR Testing (Quarterly):

drTestProcedure:
  frequency: Quarterly
  steps:
    1. Delete Dev resource group
    2. Run dr-recovery-dev.sh script
    3. Validate all services healthy
    4. Validate synthetic data seeded
    5. Document actual RTO achieved

  acceptanceCriteria:
    - All services pass health checks
    - RTO < 4 hours
    - No data corruption

Test Environment DR Strategy

RPO: 12 hours | RTO: 2 hours

Strategy: Restore from daily backups (SQL, Redis snapshots).

Backup Configuration (Test):

// Test SQL Database automated backups
var testSqlDatabase = new Database("atp-sql-test-eus", new DatabaseArgs
{
    DatabaseName = "ATP_Test",
    ServerName = testSqlServer.Name,
    ResourceGroupName = testResourceGroup.Name,
    Location = "eastus",
    Sku = new SkuArgs
    {
        Name = "S1",
        Tier = "Standard"
    },

    // Automated backups
    BackupRetentionPolicyInDays = 14,  // 14-day retention
    BackupStorageRedundancy = "Local",  // LRS (cheaper than GRS)

    // Long-term retention (optional)
    LongTermRetentionPolicy = new DatabaseLongTermRetentionPolicyArgs
    {
        WeeklyRetention = "P4W",  // 4 weeks
        MonthlyRetention = "P0M",  // Disabled
        YearlyRetention = "P0Y"    // Disabled
    },

    Tags = testTags
});

Recovery Procedure (Test):

<#
.SYNOPSIS
    Restore ATP Test environment from last good backup.
.DESCRIPTION
    Restores SQL Database and Redis Cache from automated backups.
    RTO target: 2 hours
#>

param(
    [Parameter(Mandatory=$false)]
    [DateTime]$RestorePointInTime = (Get-Date).AddHours(-1)  # Default: 1 hour ago
)

Connect-AzAccount -Identity

$resourceGroup = "ConnectSoft-ATP-Test-EUS-RG"
$region = "eastus"

Write-Output "Starting Test environment DR recovery..."
Write-Output "Restore point: $RestorePointInTime"

# Step 1: Restore SQL Database from automated backup
$sqlServer = "atp-sql-test-eus"
$originalDb = "ATP_Test"
$restoredDb = "ATP_Test_Restored_$(Get-Date -Format 'yyyyMMddHHmmss')"

Write-Output "Restoring SQL Database from point-in-time: $RestorePointInTime"

Restore-AzSqlDatabase `
    -ResourceGroupName $resourceGroup `
    -ServerName $sqlServer `
    -TargetDatabaseName $restoredDb `
    -ServiceObjectiveName "S1" `
    -Edition "Standard" `
    -PointInTime $RestorePointInTime `
    -ResourceId "/subscriptions/<sub-id>/resourceGroups/$resourceGroup/providers/Microsoft.Sql/servers/$sqlServer/databases/$originalDb"

Write-Output "✅ SQL Database restored to $restoredDb"

# Step 2: Rename databases (swap restored → active)
Rename-AzSqlDatabase `
    -ResourceGroupName $resourceGroup `
    -ServerName $sqlServer `
    -DatabaseName $originalDb `
    -NewName "${originalDb}_OLD"

Rename-AzSqlDatabase `
    -ResourceGroupName $resourceGroup `
    -ServerName $sqlServer `
    -DatabaseName $restoredDb `
    -NewName $originalDb

Write-Output "✅ Database swap complete"

# Step 3: Restore Redis Cache (from RDB snapshot)
# Note: Azure Redis Cache automated backups (Premium tier only)
# Test uses Standard tier, so recreate cache instead

Write-Output "Recreating Redis Cache (no backups in Standard tier)..."
# App Services will reconnect and rebuild cache from database

# Step 4: Restart App Services (pick up new connection strings)
$appServices = Get-AzWebApp -ResourceGroupName $resourceGroup
foreach ($app in $appServices) {
    Write-Output "Restarting App Service: $($app.Name)"
    Restart-AzWebApp -ResourceGroupName $resourceGroup -Name $app.Name
}

# Step 5: Run smoke tests
Write-Output "Running smoke tests..."
$healthCheckUrl = "https://atp-gateway-test.azurewebsites.net/health"
$response = Invoke-RestMethod -Uri $healthCheckUrl -Method Get

if ($response.status -ne "Healthy") {
    Write-Error "❌ Health check failed after DR recovery"
    exit 1
}

Write-Output "✅ Test environment DR recovery complete"
Write-Output "Actual RTO: $((Get-Date) - $StartTime | Select-Object -ExpandProperty TotalMinutes) minutes"

Test DR Testing (Quarterly):

drTestProcedure:
  frequency: Quarterly
  steps:
    1. Simulate failure (delete database)
    2. Run dr-recovery-test.ps1 script
    3. Validate data integrity
    4. Validate RTO < 2 hours
    5. Document lessons learned

  acceptanceCriteria:
    - Database restored successfully
    - All tests green post-recovery
    - RTO < 2 hours

Staging Environment DR Strategy

RPO: 1 hour | RTO: 1 hour

Strategy: Blue-Green deployment slots for instant failover with hourly backups for data recovery.

Blue-Green Configuration (Staging):

// Staging App Service with blue-green slots
var stagingAppService = new WebApp("atp-gateway-staging-eus", new WebAppArgs
{
    Name = "atp-gateway-staging-eus",
    ResourceGroupName = stagingResourceGroup.Name,
    Location = "eastus",
    ServerFarmId = stagingAppServicePlan.Id,

    SiteConfig = new SiteConfigArgs
    {
        AlwaysOn = true,
        Http20Enabled = true,
        MinTlsVersion = "1.2"
    },

    Tags = stagingTags
});

// Blue slot (active)
var blueSlot = new WebAppSlot("atp-gateway-staging-blue", new WebAppSlotArgs
{
    Name = "atp-gateway-staging-eus/blue",
    ResourceGroupName = stagingResourceGroup.Name,
    Location = "eastus",
    ServerFarmId = stagingAppServicePlan.Id,

    SiteConfig = new SiteConfigArgs
    {
        AlwaysOn = true,
        AppSettings = new[]
        {
            new NameValuePairArgs { Name = "Slot", Value = "Blue" },
            new NameValuePairArgs { Name = "HealthCheckPath", Value = "/health" }
        }
    },

    Tags = stagingTags
});

// Green slot (standby)
var greenSlot = new WebAppSlot("atp-gateway-staging-green", new WebAppSlotArgs
{
    Name = "atp-gateway-staging-eus/green",
    ResourceGroupName = stagingResourceGroup.Name,
    Location = "eastus",
    ServerFarmId = stagingAppServicePlan.Id,

    SiteConfig = new SiteConfigArgs
    {
        AlwaysOn = true,
        AppSettings = new[]
        {
            new NameValuePairArgs { Name = "Slot", Value = "Green" },
            new NameValuePairArgs { Name = "HealthCheckPath", Value = "/health" }
        }
    },

    Tags = stagingTags
});

Failover Procedure (Staging - Slot Swap):

#!/bin/bash
# failover-staging-blue-green.sh

RESOURCE_GROUP="ConnectSoft-ATP-Staging-EUS-RG"
APP_NAME="atp-gateway-staging-eus"
SOURCE_SLOT="blue"
TARGET_SLOT="production"

echo "🔄 Starting blue-green failover for Staging..."

# Step 1: Validate green slot health
GREEN_HEALTH=$(curl -s https://$APP_NAME-green.azurewebsites.net/health | jq -r '.status')

if [ "$GREEN_HEALTH" != "Healthy" ]; then
  echo "❌ Green slot unhealthy; aborting failover"
  exit 1
fi

echo "✅ Green slot healthy; proceeding with swap"

# Step 2: Perform slot swap
az webapp deployment slot swap \
  --name $APP_NAME \
  --resource-group $RESOURCE_GROUP \
  --slot $SOURCE_SLOT \
  --target-slot $TARGET_SLOT \
  --action swap

if [ $? -ne 0 ]; then
  echo "❌ Slot swap failed"
  exit 1
fi

echo "✅ Slot swap complete; green is now production"

# Step 3: Validate production health
sleep 30  # Wait for DNS propagation

PROD_HEALTH=$(curl -s https://$APP_NAME.azurewebsites.net/health | jq -r '.status')

if [ "$PROD_HEALTH" != "Healthy" ]; then
  echo "❌ Production unhealthy after swap; rolling back..."

  # Rollback: Swap back to blue
  az webapp deployment slot swap \
    --name $APP_NAME \
    --resource-group $RESOURCE_GROUP \
    --slot $SOURCE_SLOT \
    --target-slot $TARGET_SLOT \
    --action swap

  exit 1
fi

echo "✅ Staging failover complete; RTO: ~5 minutes"

Staging DR Testing (Monthly):

drTestProcedure:
  frequency: Monthly
  steps:
    1. Deploy known-good version to green slot
    2. Run failover-staging-blue-green.sh
    3. Validate production slot serving traffic
    4. Run regression tests
    5. Swap back to blue (or keep green as new production)

  acceptanceCriteria:
    - Swap completes in < 2 minutes
    - Zero downtime during swap
    - All tests pass post-swap

Production Environment DR Strategy

RPO: 15 minutes | RTO: 30 minutes

Strategy: Active-active multi-region with Azure Front Door global load balancing and automated regional failover.

Multi-Region Infrastructure (Pulumi)

Primary Region (East US):

// Primary Production Region (East US)
var prodEUSResourceGroup = new ResourceGroup("atp-prod-eus-rg", new ResourceGroupArgs
{
    ResourceGroupName = "ConnectSoft-ATP-Prod-EUS-RG",
    Location = "eastus",
    Tags = prodTags
});

var prodEUSVNet = new VirtualNetwork("atp-vnet-prod-eus", new VirtualNetworkArgs
{
    VirtualNetworkName = "atp-vnet-prod-eus",
    ResourceGroupName = prodEUSResourceGroup.Name,
    Location = "eastus",
    AddressSpace = new AddressSpaceArgs
    {
        AddressPrefixes = new[] { "10.2.0.0/16" }
    },
    // ... subnets configuration (see previous cycle)
    Tags = prodTags
});

// Primary AKS Cluster (3-10 nodes)
var prodEUSAKS = new ManagedCluster("atp-aks-prod-eus", new ManagedClusterArgs
{
    ResourceName = "atp-aks-prod-eus",
    ResourceGroupName = prodEUSResourceGroup.Name,
    Location = "eastus",
    // ... AKS configuration (see Infrastructure as Code cycle)
    AgentPoolProfiles = new[]
    {
        new ManagedClusterAgentPoolProfileArgs
        {
            Name = "system",
            Count = 3,  // Always 3 system nodes
            MinCount = 3,
            MaxCount = 10,
            EnableAutoScaling = true,
            VmSize = "Standard_D4s_v5",
            AvailabilityZones = new[] { "1", "2", "3" }  // Zone-redundant in region
        }
    },
    Tags = prodTags
});

// Primary SQL Database with geo-replication
var prodEUSSQL = new Server("atp-sql-prod-eus", new ServerArgs
{
    ServerName = "atp-sql-prod-eus",
    ResourceGroupName = prodEUSResourceGroup.Name,
    Location = "eastus",
    AdministratorLogin = "sqladmin",
    AdministratorLoginPassword = sqlAdminPassword,
    Version = "12.0",
    PublicNetworkAccess = "Disabled",  // Private endpoint only
    Tags = prodTags
});

var prodEUSDatabase = new Database("atp-db-prod-eus", new DatabaseArgs
{
    DatabaseName = "ATP_Prod",
    ServerName = prodEUSSQL.Name,
    ResourceGroupName = prodEUSResourceGroup.Name,
    Location = "eastus",
    Sku = new SkuArgs
    {
        Name = "P4",  // Premium 500 DTU
        Tier = "Premium"
    },
    MaxSizeBytes = 1073741824000,  // 1 TB
    ZoneRedundant = true,  // Zone-redundant within region
    ReadScale = "Enabled",  // Read replicas for load distribution
    BackupRetentionPolicyInDays = 35,  // 35-day retention
    BackupStorageRedundancy = "Geo",  // Geo-redundant backups
    Tags = prodTags
});

Secondary Region (West Europe):

// Secondary Production Region (West Europe)
var prodWEUResourceGroup = new ResourceGroup("atp-prod-weu-rg", new ResourceGroupArgs
{
    ResourceGroupName = "ConnectSoft-ATP-Prod-WEU-RG",
    Location = "westeurope",
    Tags = prodTags
});

var prodWEUVNet = new VirtualNetwork("atp-vnet-prod-weu", new VirtualNetworkArgs
{
    VirtualNetworkName = "atp-vnet-prod-weu",
    ResourceGroupName = prodWEUResourceGroup.Name,
    Location = "westeurope",
    AddressSpace = new AddressSpaceArgs
    {
        AddressPrefixes = new[] { "10.4.0.0/16" }  // Different address space
    },
    // ... subnets configuration
    Tags = prodTags
});

// Secondary AKS Cluster (2-6 nodes, smaller than primary)
var prodWEUAKS = new ManagedCluster("atp-aks-prod-weu", new ManagedClusterArgs
{
    ResourceName = "atp-aks-prod-weu",
    ResourceGroupName = prodWEUResourceGroup.Name,
    Location = "westeurope",
    AgentPoolProfiles = new[]
    {
        new ManagedClusterAgentPoolProfileArgs
        {
            Name = "system",
            Count = 2,  // Smaller secondary region
            MinCount = 2,
            MaxCount = 6,
            EnableAutoScaling = true,
            VmSize = "Standard_D4s_v5",
            AvailabilityZones = new[] { "1", "2", "3" }
        }
    },
    Tags = prodTags
});

// SQL Geo-Replication (East US → West Europe)
var sqlGeoReplica = new ReplicationLink("atp-sql-geo-replica-weu", new ReplicationLinkArgs
{
    LinkName = "geo-replica-weu",
    ResourceGroupName = prodEUSResourceGroup.Name,
    ServerName = prodEUSSQL.Name,
    DatabaseName = prodEUSDatabase.Name,
    PartnerServer = prodWEUSQL.Id,
    PartnerDatabase = "ATP_Prod",
    PartnerLocation = "westeurope",
    ReplicationMode = "Async",  // Asynchronous replication
    Tags = prodTags
});

Geo-Replication Setup:

# SQL Geo-Replication (Primary → Secondary)
primary:
  region: East US
  role: Primary
  readWrite: true
  replicationLag: < 5 seconds (typically)

secondary:
  region: West Europe
  role: Secondary (readable)
  readWrite: false (read-only until failover)
  replicationLag: < 15 seconds (RPO guarantee)

# Cosmos DB Multi-Region Writes
cosmosDB:
  writeRegions:
    - East US (primary)
    - West Europe (secondary)
  readRegions:
    - East US
    - West Europe
    - Southeast Asia (read-only)
  consistencyLevel: Session  # Balance consistency vs. availability
  automaticFailover: true
  multiRegionWrites: true

# Blob Storage GRS (Geo-Redundant Storage)
storage:
  primaryRegion: East US
  secondaryRegion: West Europe
  replicationType: GRS (Geo-Redundant Storage)
  readAccess: RA-GRS (Read-Access Geo-Redundant)
  replicationLag: < 15 minutes (RPO guarantee)

# Service Bus Geo-DR
serviceBus:
  primaryNamespace: atp-servicebus-prod-eus
  secondaryNamespace: atp-servicebus-prod-weu
  alias: atp-servicebus-prod  # Abstraction over primary/secondary
  failoverType: Automated (metadata only)
  messageReplication: Manual application-level replication

Azure Front Door Configuration (Global Load Balancing)

Purpose: Distribute traffic across regions (80% primary, 20% secondary) with automatic failover on regional outage.

// Azure Front Door for global traffic distribution
var frontDoor = new FrontDoor("atp-frontdoor-prod", new FrontDoorArgs
{
    FrontDoorName = "atp-prod",
    ResourceGroupName = sharedResourceGroup.Name,
    Location = "global",

    EnabledState = "Enabled",

    FrontendEndpoints = new[]
    {
        new FrontendEndpointArgs
        {
            Name = "atp-frontend",
            HostName = "atp-prod.azurefd.net",
            SessionAffinityEnabledState = "Disabled",
            WebApplicationFirewallPolicyLink = new FrontendEndpointUpdateParametersWebApplicationFirewallPolicyLinkArgs
            {
                Id = wafPolicy.Id
            }
        }
    },

    BackendPools = new[]
    {
        new BackendPoolArgs
        {
            Name = "atp-backend-pool",
            LoadBalancingSettings = new SubResourceArgs { Id = "loadBalancingSettings1" },
            HealthProbeSettings = new SubResourceArgs { Id = "healthProbeSettings1" },
            Backends = new[]
            {
                // Primary region (East US) - Weight 8 (80% traffic)
                new BackendArgs
                {
                    Address = "atp-appgw-prod-eus.eastus.cloudapp.azure.com",
                    HttpPort = 80,
                    HttpsPort = 443,
                    Priority = 1,
                    Weight = 8,  // 80% traffic
                    BackendHostHeader = "api.atp.connectsoft.com",
                    EnabledState = "Enabled"
                },

                // Secondary region (West Europe) - Weight 2 (20% traffic)
                new BackendArgs
                {
                    Address = "atp-appgw-prod-weu.westeurope.cloudapp.azure.com",
                    HttpPort = 80,
                    HttpsPort = 443,
                    Priority = 1,
                    Weight = 2,  // 20% traffic
                    BackendHostHeader = "api.atp.connectsoft.com",
                    EnabledState = "Enabled"
                }
            }
        }
    },

    LoadBalancingSettings = new[]
    {
        new LoadBalancingSettingsModelArgs
        {
            Name = "loadBalancingSettings1",
            SampleSize = 4,
            SuccessfulSamplesRequired = 2,
            AdditionalLatencyMilliseconds = 50  // Prefer lower latency
        }
    },

    HealthProbeSettings = new[]
    {
        new HealthProbeSettingsModelArgs
        {
            Name = "healthProbeSettings1",
            Path = "/health",
            Protocol = "Https",
            IntervalInSeconds = 30,
            HealthProbeMethod = "GET",
            EnabledState = "Enabled"
        }
    },

    RoutingRules = new[]
    {
        new RoutingRuleArgs
        {
            Name = "atp-routing-rule",
            FrontendEndpoints = new[] { new SubResourceArgs { Id = "atp-frontend" } },
            AcceptedProtocols = new[] { "Https" },
            PatternsToMatch = new[] { "/*" },
            RouteConfiguration = new ForwardingConfigurationArgs
            {
                OdataType = "#Microsoft.Azure.FrontDoor.Models.FrontdoorForwardingConfiguration",
                BackendPool = new SubResourceArgs { Id = "atp-backend-pool" }
            },
            EnabledState = "Enabled"
        }
    },

    Tags = prodTags
});

Traffic Distribution (Normal Operation):

Global Users
Azure Front Door (atp-prod.azurefd.net)
  ↓ 80% traffic → East US (Primary)
  ↓ 20% traffic → West Europe (Secondary)
Both regions serve traffic (active-active)

Traffic Distribution (East US Failure):

Global Users
Azure Front Door (health probe detects East US down)
  ↓ 100% traffic → West Europe (Secondary promoted)
West Europe serves all traffic (failover)

Automated Failover Mechanism

Health Probe Configuration:

# Azure Front Door health probe settings
healthProbe:
  path: /health
  protocol: HTTPS
  interval: 30 seconds
  method: GET
  expectedStatusCodes: [200]

  failureThreshold:
    consecutiveFailures: 3  # 3 failures (90 seconds) triggers failover

  action:
    - Mark backend as unhealthy
    - Remove from load balancing pool
    - Route 100% traffic to healthy region
    - Send alert to platform team
    - Create incident ticket (P1)

Automated Failover Logic (Azure Monitor Alert):

// Azure Monitor alert for regional failure
var regionalFailureAlert = new MetricAlert("atp-regional-failure-alert", new MetricAlertArgs
{
    AlertRuleName = "atp-regional-failure-prod",
    ResourceGroupName = sharedResourceGroup.Name,
    Location = "global",
    Description = "Alert when primary region (East US) is unavailable",
    Severity = 0,  // Critical
    Enabled = true,
    Scopes = new[] { frontDoor.Id },
    EvaluationFrequency = "PT1M",  // Evaluate every minute
    WindowSize = "PT5M",  // 5-minute window
    Criteria = new MetricAlertMultipleResourceMultipleMetricCriteriaArgs
    {
        OdataType = "Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria",
        AllOf = new[]
        {
            new MetricCriteriaArgs
            {
                Name = "BackendHealthPercentage",
                MetricName = "BackendHealthPercentage",
                MetricNamespace = "Microsoft.Network/frontdoors",
                Operator = "LessThan",
                Threshold = 50,  // Less than 50% backends healthy
                TimeAggregation = "Average",
                Dimensions = new[]
                {
                    new MetricDimensionArgs
                    {
                        Name = "Backend",
                        Operator = "Include",
                        Values = new[] { "atp-appgw-prod-eus.eastus.cloudapp.azure.com" }
                    }
                }
            }
        }
    },
    Actions = new[]
    {
        new MetricAlertActionArgs
        {
            ActionGroupId = regionalFailoverActionGroup.Id
        }
    }
});

// Action Group for regional failover
var regionalFailoverActionGroup = new ActionGroup("atp-regional-failover-action-group", new ActionGroupArgs
{
    ActionGroupName = "atp-regional-failover",
    ResourceGroupName = sharedResourceGroup.Name,
    Location = "global",
    Enabled = true,
    ShortName = "RegFailover",
    EmailReceivers = new[]
    {
        new EmailReceiverArgs
        {
            Name = "PlatformTeam",
            EmailAddress = "platform-team@connectsoft.example",
            UseCommonAlertSchema = true
        },
        new EmailReceiverArgs
        {
            Name = "IncidentManager",
            EmailAddress = "incident-manager@connectsoft.example",
            UseCommonAlertSchema = true
        }
    },
    SmsReceivers = new[]
    {
        new SmsReceiverArgs
        {
            Name = "OnCallEngineer",
            CountryCode = "1",
            PhoneNumber = "+1234567890"
        }
    },
    WebhookReceivers = new[]
    {
        new WebhookReceiverArgs
        {
            Name = "PagerDuty",
            ServiceUri = "https://events.pagerduty.com/integration/<key>/enqueue",
            UseCommonAlertSchema = true
        },
        new WebhookReceiverArgs
        {
            Name = "RunFailoverRunbook",
            ServiceUri = "https://atp-automation-eus.azure-automation.net/webhooks/<webhook-token>",
            UseCommonAlertSchema = false
        }
    }
});

Failover Runbook (Automated):

<#
.SYNOPSIS
    Automated regional failover for ATP Production.
.DESCRIPTION
    Triggered by Azure Monitor when primary region (East US) is unavailable.
    Promotes secondary region (West Europe) to primary.
.NOTES
    Target RTO: 30 minutes
#>

param(
    [Parameter(Mandatory=$true)]
    [string]$FailedRegion,  # "eastus" or "westeurope"

    [Parameter(Mandatory=$false)]
    [string]$Reason = "Automated health probe failure"
)

$StartTime = Get-Date
Connect-AzAccount -Identity

Write-Output "🚨 REGIONAL FAILOVER INITIATED"
Write-Output "Failed Region: $FailedRegion"
Write-Output "Reason: $Reason"
Write-Output "Start Time: $StartTime"

# Determine target region
$targetRegion = if ($FailedRegion -eq "eastus") { "westeurope" } else { "eastus" }
$targetRG = "ConnectSoft-ATP-Prod-$($targetRegion.ToUpper() -replace 'EASTUS','EUS' -replace 'WESTEUROPE','WEU')-RG"

# Step 1: Promote SQL secondary to primary (if East US failed)
if ($FailedRegion -eq "eastus") {
    Write-Output "Promoting SQL secondary (West Europe) to primary..."

    $failoverGroup = Get-AzSqlDatabaseFailoverGroup `
        -ResourceGroupName "ConnectSoft-ATP-Prod-WEU-RG" `
        -ServerName "atp-sql-prod-weu" `
        -FailoverGroupName "atp-sql-failover-group"

    # Forced failover (allow data loss if primary completely unavailable)
    $failoverGroup | Switch-AzSqlDatabaseFailoverGroup -AllowDataLoss

    Write-Output "✅ SQL failover complete (RPO: < 15 minutes)"
}

# Step 2: Update Azure Front Door weights (route 100% to healthy region)
Write-Output "Updating Front Door backend weights..."

$frontDoorName = "atp-prod"
$backendPoolName = "atp-backend-pool"

# Disable failed region backend
az network front-door backend-pool backend update \
    --front-door-name $frontDoorName \
    --pool-name $backendPoolName \
    --resource-group "ConnectSoft-ATP-Shared-RG" \
    --address "atp-appgw-prod-$($FailedRegion -replace 'us','').${FailedRegion}.cloudapp.azure.com" \
    --enabled-state Disabled

Write-Output "✅ Failed region removed from load balancing pool"

# Step 3: Scale up secondary region AKS (handle 100% traffic)
if ($targetRegion -eq "westeurope") {
    Write-Output "Scaling up West Europe AKS to handle full traffic..."

    az aks nodepool scale \
        --resource-group $targetRG \
        --cluster-name "atp-aks-prod-weu" \
        --name "user" \
        --node-count 10  # Scale to max capacity

    Write-Output "✅ Secondary region scaled to 10 nodes"
}

# Step 4: Validate secondary region health
Write-Output "Validating secondary region health..."

$healthCheckUrl = "https://api.atp.connectsoft.com/health"  # Front Door routes to healthy region
$response = Invoke-RestMethod -Uri $healthCheckUrl -Method Get

if ($response.status -ne "Healthy") {
    Write-Error "❌ Secondary region unhealthy; manual intervention required"
    exit 1
}

Write-Output "✅ Secondary region healthy and serving traffic"

# Step 5: Create incident ticket
$incident = az boards work-item create `
    --title "Regional Failover: $FailedRegion Unavailable" `
    --type "Incident" `
    --description "Automated failover executed from $FailedRegion to $targetRegion.\n\nReason: $Reason\n\nStart Time: $StartTime\n\nStatus: Failover complete; $targetRegion serving 100% traffic" `
    --assigned-to "platform-team@connectsoft.example" `
    --fields Priority=1 Severity="1 - Critical" `
    --output json | ConvertFrom-Json

Write-Output "✅ Incident ticket created: $($incident.id)"

# Step 6: Update status page
$statusPageUpdate = @{
    status = "Degraded"
    message = "ATP experienced a regional outage in $FailedRegion. Traffic has been automatically rerouted to $targetRegion. All services are operational."
    timestamp = (Get-Date).ToUniversalTime().ToString("o")
} | ConvertTo-Json

Invoke-RestMethod `
    -Uri "https://status.atp.connectsoft.com/api/incidents" `
    -Method POST `
    -Body $statusPageUpdate `
    -ContentType "application/json" `
    -Headers @{ "Authorization" = "Bearer $(Get-AutomationVariable -Name 'StatusPageApiKey')" }

# Step 7: Send notifications
$emailBody = @"
ATP Regional Failover Notification

Failed Region: $FailedRegion
Target Region: $targetRegion
Start Time: $StartTime
Completion Time: $(Get-Date)
RTO Achieved: $((Get-Date) - $StartTime | Select-Object -ExpandProperty TotalMinutes) minutes

All services are operational. No action required from tenants.

Incident Ticket: https://dev.azure.com/ConnectSoft/ATP/_workitems/edit/$($incident.id)
Status Page: https://status.atp.connectsoft.com
"@

Send-MailMessage `
    -From "platform-team@connectsoft.example" `
    -To "cto@connectsoft.example", "platform-team@connectsoft.example" `
    -Subject "🚨 ATP Regional Failover: $FailedRegion → $targetRegion" `
    -Body $emailBody `
    -SmtpServer "smtp.office365.com" `
    -UseSsl `
    -Credential (Get-AutomationPSCredential -Name "EmailCredential")

$elapsed = (Get-Date) - $StartTime
Write-Output "✅ REGIONAL FAILOVER COMPLETE"
Write-Output "Total RTO: $($elapsed.TotalMinutes) minutes (Target: 30 minutes)"

Failback Procedure (Return to Primary Region)

Purpose: Restore normal operations to primary region (East US) after recovery.

<#
.SYNOPSIS
    Failback to primary region after DR event.
.DESCRIPTION
    Restores primary region (East US) and rebalances traffic.
.NOTES
    Execute only after primary region fully recovered and validated.
#>

param(
    [Parameter(Mandatory=$true)]
    [string]$PrimaryRegion = "eastus"
)

$StartTime = Get-Date
Connect-AzAccount -Identity

Write-Output "🔄 REGIONAL FAILBACK INITIATED"
Write-Output "Primary Region: $PrimaryRegion"

# Step 1: Validate primary region infrastructure health
Write-Output "Validating primary region infrastructure..."

$aksCluster = Get-AzAksCluster `
    -ResourceGroupName "ConnectSoft-ATP-Prod-EUS-RG" `
    -Name "atp-aks-prod-eus"

if ($aksCluster.PowerState.Code -ne "Running") {
    Write-Error "❌ Primary AKS cluster not running; aborting failback"
    exit 1
}

Write-Output "✅ Primary infrastructure healthy"

# Step 2: Synchronize SQL databases (secondary → primary)
Write-Output "Synchronizing databases..."

# Force sync from secondary (now primary) to original primary (now secondary)
$failoverGroup = Get-AzSqlDatabaseFailoverGroup `
    -ResourceGroupName "ConnectSoft-ATP-Prod-WEU-RG" `
    -ServerName "atp-sql-prod-weu" `
    -FailoverGroupName "atp-sql-failover-group"

# Planned failover (no data loss)
$failoverGroup | Switch-AzSqlDatabaseFailoverGroup

Write-Output "✅ SQL databases synchronized; East US is now primary again"

# Step 3: Gradually shift traffic back to primary (phased approach)
Write-Output "Phasing traffic back to primary region..."

# Phase 1: 20% to primary, 80% to secondary
az network front-door backend-pool backend update \
    --front-door-name "atp-prod" \
    --pool-name "atp-backend-pool" \
    --resource-group "ConnectSoft-ATP-Shared-RG" \
    --address "atp-appgw-prod-eus.eastus.cloudapp.azure.com" \
    --weight 2 \
    --enabled-state Enabled

Start-Sleep -Seconds 600  # Monitor for 10 minutes

# Phase 2: 50% to primary, 50% to secondary
az network front-door backend-pool backend update \
    --front-door-name "atp-prod" \
    --pool-name "atp-backend-pool" \
    --resource-group "ConnectSoft-ATP-Shared-RG" \
    --address "atp-appgw-prod-eus.eastus.cloudapp.azure.com" \
    --weight 5

Start-Sleep -Seconds 600  # Monitor for 10 minutes

# Phase 3: 80% to primary, 20% to secondary (normal state)
az network front-door backend-pool backend update \
    --front-door-name "atp-prod" \
    --pool-name "atp-backend-pool" \
    --resource-group "ConnectSoft-ATP-Shared-RG" \
    --address "atp-appgw-prod-eus.eastus.cloudapp.azure.com" \
    --weight 8

az network front-door backend-pool backend update \
    --front-door-name "atp-prod" \
    --pool-name "atp-backend-pool" \
    --resource-group "ConnectSoft-ATP-Shared-RG" \
    --address "atp-appgw-prod-weu.westeurope.cloudapp.azure.com" \
    --weight 2

Write-Output "✅ Traffic rebalanced to normal state (80/20)"

# Step 4: Scale down secondary region (cost optimization)
az aks nodepool scale \
    --resource-group "ConnectSoft-ATP-Prod-WEU-RG" \
    --cluster-name "atp-aks-prod-weu" \
    --name "user" \
    --node-count 3  # Return to normal capacity

Write-Output "✅ Secondary region scaled back to normal"

# Step 5: Close incident ticket
$incidentId = (az boards work-item query `
    --wiql "SELECT [System.Id] FROM WorkItems WHERE [System.Title] CONTAINS 'Regional Failover' AND [System.State] = 'Active'" `
    --output json | ConvertFrom-Json).workItems[0].id

az boards work-item update \
    --id $incidentId \
    --state "Resolved" \
    --fields "Microsoft.VSTS.Common.ResolvedReason=Fixed" \
              "Microsoft.VSTS.Common.ResolvedBy=automation@connectsoft.example"

$elapsed = (Get-Date) - $StartTime
Write-Output "✅ REGIONAL FAILBACK COMPLETE"
Write-Output "Total Failback Time: $($elapsed.TotalMinutes) minutes"

DR Testing Schedule & Procedures

Production DR Testing (Weekly, Non-Disruptive):

drTestingStrategy:
  frequency: Weekly (every Sunday 2 AM ET)
  type: Non-disruptive (traffic shifting only)

  procedure:
    1. Gradually shift traffic from primary → secondary (10% increments)
    2. Monitor metrics for 1 hour (error rate, latency, throughput)
    3. If metrics healthy, continue; if degraded, rollback
    4. Once 100% traffic on secondary, validate for 30 minutes
    5. Shift traffic back to primary (reverse process)
    6. Document actual RTO/RPO achieved

  acceptanceCriteria:
    - Zero errors during traffic shift
    - Latency p95 < 1.2x baseline
    - Successful validation queries in secondary region
    - RTO < 30 minutes (measured)

  rollback:
    - Automatic if error rate > 1%
    - Automatic if latency p99 > 3x baseline
    - Manual abort via Azure Portal

DR Testing Automation (Azure Function):

// Automated DR testing (weekly)
[FunctionName("WeeklyDRTest")]
public async Task RunAsync(
    [TimerTrigger("0 0 2 * * 0")] TimerInfo timer,  // Every Sunday 2 AM ET
    ILogger log)
{
    log.LogInformation("Starting weekly DR test...");

    var startTime = DateTime.UtcNow;
    var frontDoorClient = new FrontDoorManagementClient(new DefaultAzureCredential());
    var testResults = new List<string>();

    try
    {
        // Phase 1: Shift 20% traffic to secondary
        await ShiftTrafficAsync(frontDoorClient, primaryWeight: 80, secondaryWeight: 20);
        await Task.Delay(TimeSpan.FromMinutes(10));
        await ValidateMetricsAsync(log);

        // Phase 2: Shift 50% traffic to secondary
        await ShiftTrafficAsync(frontDoorClient, primaryWeight: 50, secondaryWeight: 50);
        await Task.Delay(TimeSpan.FromMinutes(10));
        await ValidateMetricsAsync(log);

        // Phase 3: Shift 100% traffic to secondary
        await ShiftTrafficAsync(frontDoorClient, primaryWeight: 0, secondaryWeight: 100);
        await Task.Delay(TimeSpan.FromMinutes(30));  // Validate for 30 minutes
        await ValidateMetricsAsync(log);

        // Phase 4: Shift back to normal (80/20)
        await ShiftTrafficAsync(frontDoorClient, primaryWeight: 80, secondaryWeight: 20);

        var elapsed = DateTime.UtcNow - startTime;
        log.LogInformation($"✅ DR test complete. Duration: {elapsed.TotalMinutes:F1} minutes");

        // Create test report
        await CreateDRTestReportAsync(elapsed, testResults, success: true);
    }
    catch (Exception ex)
    {
        log.LogError(ex, "❌ DR test failed");

        // Rollback to normal state
        await ShiftTrafficAsync(frontDoorClient, primaryWeight: 80, secondaryWeight: 20);

        await CreateDRTestReportAsync(DateTime.UtcNow - startTime, testResults, success: false);
        throw;
    }
}

private async Task ValidateMetricsAsync(ILogger log)
{
    var appInsightsClient = new ApplicationInsightsDataClient(new DefaultAzureCredential());

    // Query error rate
    var errorRate = await appInsightsClient.Metrics.GetAsync(
        appId: "atp-appinsights-prod-eus",
        metricId: "requests/failed",
        timespan: "PT10M",  // Last 10 minutes
        aggregation: new[] { "avg" }
    );

    var avgErrorRate = errorRate.Value.Segments[0].Metrics["requests/failed"].Avg;

    if (avgErrorRate > 0.01)  // >1% error rate
    {
        throw new Exception($"Error rate too high during DR test: {avgErrorRate:P2}");
    }

    log.LogInformation($"✅ Metrics healthy: Error rate {avgErrorRate:P2}");
}

Geo-Replication Configuration

SQL Database Geo-Replication

// SQL Failover Group (automatic failover)
var sqlFailoverGroup = new FailoverGroup("atp-sql-failover-group", new FailoverGroupArgs
{
    FailoverGroupName = "atp-sql-failover-group",
    ResourceGroupName = prodEUSResourceGroup.Name,
    ServerName = prodEUSSQL.Name,

    // Partner server (secondary region)
    PartnerServers = new[]
    {
        new PartnerInfoArgs
        {
            Id = prodWEUSQL.Id
        }
    },

    // Databases to replicate
    Databases = new[]
    {
        prodEUSDatabase.Id
    },

    // Read-write listener endpoint
    ReadWriteEndpoint = new FailoverGroupReadWriteEndpointArgs
    {
        FailoverPolicy = "Automatic",
        FailoverWithDataLossGracePeriodMinutes = 60  // Allow 1 hour for primary to recover before forcing failover
    },

    // Read-only listener endpoint (route to nearest replica)
    ReadOnlyEndpoint = new FailoverGroupReadOnlyEndpointArgs
    {
        FailoverPolicy = "Disabled"  // Read-only queries don't failover
    },

    Tags = prodTags
});

SQL Connection String (Failover Group Aware):

{
  "ConnectionStrings": {
    "DefaultConnection": "Server=atp-sql-failover-group.database.windows.net;Database=ATP_Prod;User Id=$(DbUser);Password=$(DbPassword);MultipleActiveResultSets=true;ApplicationIntent=ReadWrite"
  }
}

Explanation: Application connects to failover group endpoint (atp-sql-failover-group.database.windows.net) which automatically routes to current primary region. Upon failover, DNS updates to point to new primary (WEU) with zero application code changes.


Cosmos DB Multi-Region Configuration

// Cosmos DB with multi-region writes (active-active)
var cosmosAccount = new DatabaseAccount("atp-cosmos-prod", new DatabaseAccountArgs
{
    AccountName = "atp-cosmos-prod",
    ResourceGroupName = sharedResourceGroup.Name,
    Location = "eastus",  // Primary location

    // Multi-region write configuration
    Locations = new[]
    {
        new LocationArgs
        {
            LocationName = "eastus",
            FailoverPriority = 0,  // Primary
            IsZoneRedundant = true
        },
        new LocationArgs
        {
            LocationName = "westeurope",
            FailoverPriority = 1,  // Secondary
            IsZoneRedundant = true
        },
        new LocationArgs
        {
            LocationName = "southeastasia",
            FailoverPriority = 2,  // Read-only tertiary
            IsZoneRedundant = false
        }
    },

    // Consistency level (balance consistency vs. availability)
    ConsistencyPolicy = new ConsistencyPolicyArgs
    {
        DefaultConsistencyLevel = "Session",  // Session consistency for ATP
        MaxIntervalInSeconds = 5,
        MaxStalenessPrefix = 100
    },

    // Enable multi-region writes
    EnableMultipleWriteLocations = true,
    EnableAutomaticFailover = true,

    // Backup configuration
    BackupPolicy = new ContinuousModeBackupPolicyArgs
    {
        Type = "Continuous",  // Continuous backup (point-in-time restore)
        ContinuousModeProperties = new ContinuousModePropertiesArgs
        {
            Tier = "Continuous7Days"  // 7-day continuous backup
        }
    },

    Tags = prodTags
});

Cosmos DB Failover (Automated):

  • Health Monitoring: Azure monitors Cosmos DB availability per region.
  • Automatic Failover: If primary region unavailable >5 minutes, promote secondary to primary.
  • Multi-Write: Both regions accept writes simultaneously (conflict resolution via last-write-wins).
  • RPO: Near-zero (synchronous replication within seconds).

Blob Storage Geo-Replication

// Blob Storage with Read-Access Geo-Redundant Storage (RA-GRS)
var prodStorage = new StorageAccount("atpstorageprodeus", new StorageAccountArgs
{
    AccountName = "atpstorageprodeus",
    ResourceGroupName = prodEUSResourceGroup.Name,
    Location = "eastus",

    Sku = new SkuArgs
    {
        Name = "Standard_RAGRS"  // Read-Access Geo-Redundant Storage
    },

    Kind = "StorageV2",

    // Blob configuration
    BlobServices = new BlobServicesArgs
    {
        // ... (WORM, versioning, etc.)
    },

    // Geo-replication to West Europe (automatic)
    GeoReplicationStats = new GeoReplicationStatsArgs
    {
        Status = "Live",
        CanFailover = true
    },

    Tags = prodTags
});

Storage Failover (Manual):

# Initiate storage account failover to secondary region
az storage account failover \
  --name atpstorageprodeus \
  --resource-group ConnectSoft-ATP-Prod-EUS-RG \
  --yes

# Note: This makes West Europe the new primary
# Geo-replication will re-establish to a new secondary region after failover

Service Bus Geo-Disaster Recovery

// Service Bus Geo-DR pairing
var serviceBusGeoAlias = new DisasterRecoveryConfig("atp-servicebus-geo-dr", new DisasterRecoveryConfigArgs
{
    Alias = "atp-servicebus-prod",  // Stable endpoint name
    ResourceGroupName = prodEUSResourceGroup.Name,
    NamespaceName = prodEUSServiceBus.Name,

    // Partner namespace (secondary region)
    PartnerNamespace = prodWEUServiceBus.Id,

    Tags = prodTags
});

Service Bus Connection String (Geo-DR Aware):

{
  "ConnectionStrings": {
    "ServiceBus": "Endpoint=sb://atp-servicebus-prod.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=$(ServiceBusKey)"
  }
}

Explanation: Application connects to Geo-DR alias (atp-servicebus-prod.servicebus.windows.net) which routes to active region. Upon failover, alias DNS switches to secondary region within minutes.


High Availability Within Region (Zone Redundancy)

Purpose: Protect against single datacenter failures within a region using Azure Availability Zones.

Zone-Redundant Resources (Production East US):

// AKS with zone-redundant node pools
var aksNodePool = new ManagedClusterAgentPoolProfileArgs
{
    Name = "system",
    Count = 3,
    VmSize = "Standard_D4s_v5",
    AvailabilityZones = new[] { "1", "2", "3" },  // Spread across 3 zones
    EnableAutoScaling = true,
    MinCount = 3,
    MaxCount = 10
};

// SQL Database with zone redundancy
var sqlDatabase = new Database("atp-db-prod-eus", new DatabaseArgs
{
    DatabaseName = "ATP_Prod",
    ZoneRedundant = true,  // Synchronous replication across 3 zones
    ReadScale = "Enabled",  // Read replicas in each zone
    // ...
});

// Application Gateway with zone redundancy
var appGateway = new ApplicationGateway("atp-appgw-prod-eus", new ApplicationGatewayArgs
{
    // ...
    Zones = new[] { "1", "2", "3" },  // Deploy instances in all zones
});

// Cosmos DB zone-redundant
var cosmosAccount = new DatabaseAccount("atp-cosmos-prod", new DatabaseAccountArgs
{
    Locations = new[]
    {
        new LocationArgs
        {
            LocationName = "eastus",
            IsZoneRedundant = true  // Zone-redundant within region
        }
    }
});

Zone Redundancy Benefits:

  • SLA: 99.99% (zone-redundant) vs. 99.95% (single zone) = 4.4x better uptime.
  • Failure Isolation: Single datacenter outage → automatic failover to other zones within seconds.
  • No RPO: Synchronous replication within region (zero data loss).

Backup Strategy Per Environment

Dev Environment Backups

Strategy: No backups (recreate from Git + IaC).

backups:
  enabled: false
  rationale: Git is source of truth; recreate faster than restore
  costSavings: ~$50/month (no backup storage)

Test Environment Backups

Strategy: Daily automated backups with 14-day retention.

backups:
  sqlDatabase:
    frequency: Daily (automated)
    retention: 14 days
    type: Automated point-in-time restore
    cost: Included in SQL Database cost

  redis:
    frequency: None (Standard tier doesn't support persistence)
    recreate: Rebuild cache from database on restore

  storage:
    frequency: None (test data is synthetic)
    recreate: Re-run data seeding scripts

Staging Environment Backups

Strategy: Hourly snapshots with 7-day retention + weekly long-term backups.

// Staging SQL Database backup configuration
var stagingSqlDatabase = new Database("atp-db-staging-eus", new DatabaseArgs
{
    DatabaseName = "ATP_Staging",
    BackupRetentionPolicyInDays = 35,  // 35-day short-term retention
    BackupStorageRedundancy = "Geo",  // Geo-redundant backups

    // Long-term retention
    LongTermRetentionPolicy = new DatabaseLongTermRetentionPolicyArgs
    {
        WeeklyRetention = "P4W",   // 4 weeks
        MonthlyRetention = "P3M",  // 3 months
        YearlyRetention = "P0Y"    // Disabled
    },

    // ... other configuration
});

// Redis Cache persistence (Staging uses Premium tier)
var stagingRedis = new RedisResource("atp-redis-staging-eus", new RedisResourceArgs
{
    Name = "atp-redis-staging-eus",
    ResourceGroupName = stagingResourceGroup.Name,
    Location = "eastus",
    Sku = new SkuArgs
    {
        Name = "Premium",
        Family = "P",
        Capacity = 1  // P1 (6 GB)
    },

    // RDB persistence (snapshots to storage account)
    RedisConfiguration = new RedisCommonPropertiesRedisConfigurationArgs
    {
        RdbBackupEnabled = "true",
        RdbBackupFrequency = "60",  // 60 minutes
        RdbBackupMaxSnapshotCount = "1",
        RdbStorageConnectionString = stagingStorageConnectionString
    },

    Tags = stagingTags
});

Production Environment Backups

Strategy: Continuous backups with point-in-time restore (7-day) + long-term retention (7 years).

backups:
  sqlDatabase:
    frequency: Continuous (transaction log backups every 5-10 minutes)
    retention: 
      - Short-term: 35 days (point-in-time restore)
      - Long-term: 7 years (weekly full + monthly)
    redundancy: Geo-redundant (replicated to secondary region)
    RPO: < 5 minutes

  cosmosDB:
    frequency: Continuous (change feed based)
    retention: 7 days (continuous backup mode)
    redundancy: Multi-region (automatic)
    RPO: < 1 minute

  redis:
    frequency: RDB snapshots every 15 minutes
    retention: 7 days
    redundancy: Replicated to geo-paired region
    RPO: < 15 minutes

  blobStorage:
    frequency: Real-time geo-replication
    retention: 7 years (with WORM)
    redundancy: RA-GRS (Read-Access Geo-Redundant)
    RPO: < 15 minutes

  serviceBus:
    frequency: N/A (Geo-DR replicates metadata, not messages)
    strategy: Application-level message replication
    RPO: 0 (messages in-flight may be lost on failover)

Production Backup Validation (Daily):

#!/bin/bash
# validate-backups-prod.sh

echo "Validating Production backups..."

# Step 1: Verify SQL automated backups exist
LATEST_SQL_BACKUP=$(az sql db list-restorable-dropped \
  --server atp-sql-prod-eus \
  --resource-group ConnectSoft-ATP-Prod-EUS-RG \
  --query "[0].earliestRestoreDate" -o tsv)

if [ -z "$LATEST_SQL_BACKUP" ]; then
  echo "❌ No SQL backups found"
  exit 1
fi

echo "✅ Latest SQL backup: $LATEST_SQL_BACKUP"

# Step 2: Verify Cosmos DB continuous backup mode
COSMOS_BACKUP_MODE=$(az cosmosdb show \
  --name atp-cosmos-prod \
  --resource-group ConnectSoft-ATP-Shared-RG \
  --query "backupPolicy.type" -o tsv)

if [ "$COSMOS_BACKUP_MODE" != "Continuous" ]; then
  echo "❌ Cosmos DB not in continuous backup mode"
  exit 1
fi

echo "✅ Cosmos DB continuous backup enabled"

# Step 3: Verify blob geo-replication status
GEO_REPL_STATUS=$(az storage account show \
  --name atpstorageprodeus \
  --resource-group ConnectSoft-ATP-Prod-EUS-RG \
  --query "geoReplicationStats.status" -o tsv)

if [ "$GEO_REPL_STATUS" != "Live" ]; then
  echo "❌ Blob geo-replication not live"
  exit 1
fi

echo "✅ Blob geo-replication status: Live"

# Step 4: Test restore operation (non-disruptive)
# Restore to a test database to validate backup integrity
az sql db restore \
  --dest-name "ATP_Prod_BackupTest_$(date +%Y%m%d)" \
  --resource-group ConnectSoft-ATP-Prod-EUS-RG \
  --server atp-sql-prod-eus \
  --time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S'Z')" \
  --name "ATP_Prod" \
  --service-objective "P1" \
  --edition "Premium"

echo "✅ Backup restore test successful"

# Clean up test database after validation
az sql db delete \
  --name "ATP_Prod_BackupTest_$(date +%Y%m%d)" \
  --resource-group ConnectSoft-ATP-Prod-EUS-RG \
  --server atp-sql-prod-eus \
  --yes

echo "✅ Production backup validation complete"

RTO/RPO Validation & Reporting

Purpose: Measure and validate actual RTO/RPO achieved during DR tests and incidents.

DR Metrics Tracking:

// Track DR test results in Application Insights
public class DRTestMetrics
{
    private readonly TelemetryClient _telemetry;

    public async Task RecordDRTestAsync(DRTestResult result)
    {
        var properties = new Dictionary<string, string>
        {
            ["TestType"] = result.TestType,  // "Automated" or "Manual"
            ["Environment"] = result.Environment,
            ["SourceRegion"] = result.SourceRegion,
            ["TargetRegion"] = result.TargetRegion,
            ["Success"] = result.Success.ToString(),
            ["FailureReason"] = result.FailureReason ?? "N/A"
        };

        var metrics = new Dictionary<string, double>
        {
            ["RTOMinutes"] = result.RTOMinutes,
            ["RPOMinutes"] = result.RPOMinutes,
            ["TargetRTOMinutes"] = result.TargetRTOMinutes,
            ["TargetRPOMinutes"] = result.TargetRPOMinutes,
            ["DataLossRecords"] = result.DataLossRecords
        };

        _telemetry.TrackEvent("DRTestCompleted", properties, metrics);

        // Create work item if RTO/RPO targets not met
        if (result.RTOMinutes > result.TargetRTOMinutes || result.RPOMinutes > result.TargetRPOMinutes)
        {
            await CreateDRImprovementTaskAsync(result);
        }
    }
}

DR Dashboard (Application Insights KQL):

// DR test success rate over last 6 months
customEvents
| where name == "DRTestCompleted"
| where timestamp > ago(180d)
| extend Environment = tostring(customDimensions.Environment)
| extend Success = tobool(customDimensions.Success)
| extend RTOMinutes = todouble(customMeasurements.RTOMinutes)
| extend TargetRTOMinutes = todouble(customMeasurements.TargetRTOMinutes)
| extend RTOMet = RTOMinutes <= TargetRTOMinutes
| summarize 
    TotalTests = count(),
    SuccessfulTests = countif(Success == true),
    SuccessRate = 100.0 * countif(Success == true) / count(),
    AvgRTO = avg(RTOMinutes),
    TargetRTO = max(TargetRTOMinutes),
    RTOMetPercentage = 100.0 * countif(RTOMet == true) / count()
  by Environment
| order by SuccessRate asc

Business Continuity Plan (BCP)

ATP Business Continuity Objectives:

businessContinuity:
  missionCriticalServices:
    - Audit Event Ingestion (100% uptime required)
    - Audit Event Query (99.9% uptime required)
    - Tamper-Evidence Validation (99.9% uptime required)

  tolerableDowntime:
    - Ingestion: 0 minutes (cannot lose audit events)
    - Query: 30 minutes (tenants can retry)
    - Export: 4 hours (background process)

  dataCriticality:
    - Audit Events: Mission-critical (immutable, compliance)
    - Configuration: High (backed up hourly)
    - Logs/Metrics: Medium (7-day recovery acceptable)

Incident Response Runbook:

# P0 Incident: Regional Outage
incidentResponse:
  detection:
    - Azure Monitor health probe failures (3 consecutive)
    - PagerDuty alert to on-call engineer
    - Automated failover triggered (if configured)

  response:
    1. Validate automated failover executed
    2. Confirm secondary region serving traffic
    3. Check error rates, latency, throughput
    4. Update status page (Degraded: Regional outage)
    5. Notify tenants via email/webhook

  recovery:
    1. Investigate primary region root cause
    2. Validate primary region health
    3. Execute gradual failback (phased traffic shift)
    4. Close incident ticket
    5. Post-mortem analysis within 48 hours

  communication:
    - Status page: https://status.atp.connectsoft.com
    - Tenant notifications: Email + webhook
    - Internal: Slack #incidents channel
    - Executive: Email to CTO/CEO if downtime > 1 hour

Disaster Scenarios & Response

Scenario 1: Regional Azure Outage (East US)

Detection: Front Door health probes fail for East US backends (3 consecutive failures over 90 seconds).

Automated Response:

T+0:00 - Health probes detect East US unavailable
T+0:01 - Front Door removes East US from pool
T+0:01 - 100% traffic routed to West Europe
T+0:02 - Azure Monitor alert fires
T+0:02 - PagerDuty pages on-call engineer
T+0:02 - Incident ticket auto-created (P0)
T+0:03 - Status page updated (Degraded)
T+0:05 - Tenant notification emails sent

Manual Response (On-Call Engineer):

T+0:05 - Engineer acknowledges PagerDuty alert
T+0:10 - Validate West Europe serving traffic correctly
T+0:15 - Check Azure status page for East US outage
T+0:20 - Scale up West Europe AKS to handle 100% traffic
T+0:30 - Validate metrics (error rate, latency, throughput)
T+0:35 - Update status page with ETA
T+1:00 - Monitor for next hour; coordinate with Azure support

RTO Achieved: ~5 minutes (automated failover)
RPO Achieved: ~15 minutes (geo-replication lag)


Scenario 2: SQL Database Corruption

Detection: Application errors; integrity check failures.

Response Procedure:

# Restore SQL Database from point-in-time (before corruption)
$corruptionTime = (Get-Date).AddHours(-2)  # Corruption detected 2 hours ago

# Step 1: Restore to a new database
Restore-AzSqlDatabase `
    -ResourceGroupName "ConnectSoft-ATP-Prod-EUS-RG" `
    -ServerName "atp-sql-prod-eus" `
    -TargetDatabaseName "ATP_Prod_Restored" `
    -ServiceObjectiveName "P4" `
    -Edition "Premium" `
    -PointInTime $corruptionTime

# Step 2: Validate restored database integrity
$integrityCheck = Invoke-Sqlcmd `
    -ServerInstance "atp-sql-prod-eus.database.windows.net" `
    -Database "ATP_Prod_Restored" `
    -Query "EXEC sp_ATP_ValidateIntegrity" `
    -Username "sqladmin" `
    -Password $(Get-AutomationVariable -Name "SqlAdminPassword")

if ($integrityCheck.IntegrityValid -ne $true) {
    Write-Error "❌ Restored database integrity check failed"
    exit 1
}

# Step 3: Swap databases (minimal downtime)
# Use failover group to switch active database

RTO: ~30 minutes (restore + validation)
RPO: ~2 hours (restore to before corruption time)


Scenario 3: AKS Cluster Failure

Detection: All pods in primary region unhealthy; Kubernetes API unreachable.

Automated Response:

T+0:00 - Health checks fail for all pods
T+0:01 - Front Door marks East US unhealthy
T+0:01 - Traffic routed to West Europe AKS
T+0:02 - Azure Monitor alert (AKS cluster down)
T+0:05 - Autoscale West Europe AKS from 3 → 10 nodes
T+0:10 - All pods running in West Europe
T+0:15 - Traffic fully served from West Europe

Manual Recovery:

# Diagnose AKS cluster failure
az aks show \
  --name atp-aks-prod-eus \
  --resource-group ConnectSoft-ATP-Prod-EUS-RG \
  --query "powerState"

# If cluster stopped, restart
az aks start \
  --name atp-aks-prod-eus \
  --resource-group ConnectSoft-ATP-Prod-EUS-RG

# If cluster corrupted, recreate from IaC
pulumi up --target urn:pulumi:atp-prod-eus::atp-infrastructure::azure-native:containerservice:ManagedCluster::atp-aks-prod-eus

RTO: ~15 minutes (automated failover to West Europe)
RPO: 0 (stateless AKS pods; data in geo-replicated databases)


Summary

  • RPO/RTO Targets: Graduated from 24h/4h (Dev recreate) to 15min/30min (Production multi-region).
  • Dev: No backups; recreate from IaC + Git in 4 hours.
  • Test: Daily backups with 12h RPO, 2h RTO via restore.
  • Staging: Blue-green slots for instant failover (1h RPO/RTO), hourly backups.
  • Production: Active-active multi-region (East US 80%, West Europe 20%) with automated failover, 15min RPO, 30min RTO.
  • Multi-Region Topology: Primary (East US) + Secondary (West Europe) with VNet peering, SQL geo-replication, Cosmos multi-write, Blob RA-GRS, Service Bus Geo-DR.
  • Azure Front Door: Global load balancing with health probes (30s interval, 3 failure threshold) and automated traffic rerouting.
  • Zone Redundancy: Production resources spread across 3 availability zones (99.99% SLA).
  • DR Testing: Weekly non-disruptive production tests, monthly staging tests, quarterly dev/test tests.
  • Backup Strategy: Continuous backups (Production) with 7-day PITR + 7-year long-term retention.
  • Incident Response: Automated failover (5 minutes) with manual validation and phased failback.

Compliance & Audit Per Environment

ATP implements graduated compliance controls across environments, balancing developer productivity with regulatory requirements. Production enforces full GDPR, HIPAA, and SOC 2 compliance with continuous monitoring, while Dev/Test environments use relaxed policies with synthetic data to accelerate development without compliance overhead.

This approach ensures regulatory alignment in production while maintaining rapid iteration in lower environments through environment-specific policies, automated compliance scanning, and comprehensive audit evidence collection.

Compliance Enforcement Matrix

ATP enforces progressive compliance controls from relaxed (Dev) to strict (Production) aligned with data sensitivity and regulatory requirements.

Control Category Dev/Test Staging Production
Encryption at Rest Optional (TDE off) Required (TDE on) Required + BYOK (Customer-Managed Keys)
Encryption in Transit TLS 1.2+ TLS 1.2+ TLS 1.3 enforced
Immutability Disabled Enabled (validation) Enabled + WORM storage
Tamper-Evidence Disabled Enabled (testing) Enabled + HSM signatures
Audit Logging Basic (7 days) Enhanced (30 days) Full + 7-year retention
PII Handling Synthetic data only Anonymized production data Real PII + classification
Data Retention 30 days 90 days 7 years (compliance)
Access Reviews Quarterly Monthly Weekly + JIT access
Penetration Testing Annually Quarterly Quarterly + post-change
Vulnerability Scanning Weekly Daily Continuous (real-time)
Secret Rotation Manual (90 days) Automated (60 days) Automated (30 days)
Privileged Access Developer accounts Restricted (Lead+ only) Zero standing access (PIM)
Change Management None Manual approval CAB + change ticket
Incident Response SLA Best-effort 4 hours 30 minutes

Environment-Specific Compliance Policies

Dev/Test Compliance Policies

Purpose: Minimal compliance overhead with synthetic data and relaxed controls to maximize developer velocity.

Data Handling (Dev/Test):

# Dev/Test Compliance Profile
piiHandling:
  realPII: Prohibited  # Only synthetic/anonymized data
  piiClassification: Not required
  redaction: Optional (for testing redaction logic)
  dataResidency: No restrictions (US-only for simplicity)

encryption:
  atRest: Optional (TDE disabled for performance)
  inTransit: TLS 1.2+ required
  customerManagedKeys: No

immutability:
  enabled: false
  rationale: Support data mutations for testing

auditLogging:
  level: Basic
  retention: 7 days (Dev), 14 days (Test)
  destinations: Local Seq container
  piiRedaction: Optional

accessControl:
  authentication: Azure AD
  authorization: Developer role (full access)
  mfa: Recommended (not enforced)
  conditionalAccess: No

complianceScanning:
  frequency: Weekly
  frameworks: None (development only)
  findings: Informational (no blocking)

Azure Policy (Dev/Test - Audit Mode):

{
  "policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/dev-test-compliance",
  "displayName": "Dev/Test Environment Compliance (Audit Mode)",
  "policyRule": {
    "if": {
      "field": "tags['Environment']",
      "in": ["dev", "test"]
    },
    "then": {
      "effect": "audit",  // Audit only (don't block)
      "details": {
        "type": "Microsoft.Sql/servers/databases",
        "existenceCondition": {
          "field": "Microsoft.Sql/servers/databases/transparentDataEncryption.status",
          "equals": "Enabled"
        }
      }
    }
  },
  "metadata": {
    "category": "Compliance",
    "severity": "Low"
  }
}

Staging Compliance Policies

Purpose: Production-like compliance for validating compliance controls before production deployment.

Data Handling (Staging):

# Staging Compliance Profile (mirrors Production)
piiHandling:
  realPII: Prohibited (anonymized prod snapshots only)
  piiClassification: Required (test classification logic)
  redaction: Enforced (validate redaction)
  dataResidency: EU data in EU region (GDPR simulation)

encryption:
  atRest: Required (TDE enabled)
  inTransit: TLS 1.2+ enforced
  customerManagedKeys: Optional (test BYOK integration)

immutability:
  enabled: true
  wormStorage: Enabled (test compliance workflows)
  tamperEvidence: Enabled (validate hash chains)

auditLogging:
  level: Enhanced
  retention: 30 days (hot) + 180 days (cool)
  destinations: Azure Log Analytics
  piiRedaction: Enforced

accessControl:
  authentication: Azure AD + MFA
  authorization: Least privilege (RBAC)
  mfa: Enforced for all users
  conditionalAccess: Location-based (office/VPN only)
  justInTimeAccess: Enabled (PIM)

complianceScanning:
  frequency: Daily
  frameworks: GDPR, HIPAA, SOC 2
  findings: Blocking for critical/high severity
  remediation: SLA 7 days (critical), 30 days (high)

Azure Policy (Staging - Deny Mode):

{
  "policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/staging-compliance",
  "displayName": "Staging Environment Compliance (Deny Mode)",
  "policyRule": {
    "if": {
      "allOf": [
        {
          "field": "tags['Environment']",
          "equals": "staging"
        },
        {
          "anyOf": [
            {
              "field": "type",
              "equals": "Microsoft.Sql/servers/databases"
            },
            {
              "field": "type",
              "equals": "Microsoft.Storage/storageAccounts"
            }
          ]
        },
        {
          "anyOf": [
            {
              "field": "Microsoft.Sql/servers/databases/transparentDataEncryption.status",
              "notEquals": "Enabled"
            },
            {
              "field": "Microsoft.Storage/storageAccounts/encryption.services.blob.enabled",
              "notEquals": "true"
            }
          ]
        }
      ]
    },
    "then": {
      "effect": "deny",  // Block non-compliant resources
      "details": {
        "message": "Staging resources must have encryption at rest enabled"
      }
    }
  },
  "metadata": {
    "category": "Compliance",
    "severity": "High"
  }
}

Production Compliance Policies

Purpose: Full regulatory compliance with GDPR, HIPAA, and SOC 2 controls enforced at infrastructure and application layers.

Data Handling (Production):

# Production Compliance Profile (strict enforcement)
piiHandling:
  realPII: Allowed (with classification)
  piiClassification: Required (automated scanning)
  redaction: Enforced (logs, telemetry, exports)
  dataResidency: Enforced (EU data stays in EU region)
  rightToErasure: Automated GDPR deletion workflow

encryption:
  atRest: Required (TDE + BYOK)
  inTransit: TLS 1.3 enforced (no TLS 1.2 fallback)
  customerManagedKeys: Required (HSM-backed)
  keyRotation: Automated (90-day rotation)

immutability:
  enabled: true
  wormStorage: Enforced (Legal Hold + Time-based Retention)
  tamperEvidence: Enabled + HSM signatures
  hashChains: Merkle trees with periodic sealing

auditLogging:
  level: Full (all API calls, database queries, access events)
  retention: 90 days (hot) + 7 years (cold archive)
  destinations: Log Analytics + Blob Storage (immutable)
  piiRedaction: Enforced with automated scanning
  logIntegrity: Signed with HSM

accessControl:
  authentication: Azure AD + MFA + Conditional Access
  authorization: Zero standing access (PIM required)
  mfa: Enforced (no exceptions)
  conditionalAccess: Device compliance + location + risk-based
  justInTimeAccess: Enforced (max 8-hour elevation)
  privilegedAccess: Break-glass accounts only

complianceScanning:
  frequency: Continuous (real-time)
  frameworks: GDPR, HIPAA, SOC 2, ISO 27001
  findings: Blocking (auto-remediate or escalate)
  remediation: SLA 24 hours (critical), 7 days (high)

regulatoryReporting:
  frequency: Quarterly
  reports: GDPR compliance, HIPAA attestation, SOC 2 audit trail
  auditor: External auditor access (read-only, time-limited)

Azure Policy (Production - Strict Enforcement):

{
  "policySetDefinitionId": "/providers/Microsoft.Authorization/policySetDefinitions/production-compliance-initiative",
  "displayName": "Production Environment Compliance Initiative",
  "policyDefinitions": [
    {
      "policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/deny-public-network-access",
      "parameters": {
        "effect": "deny"
      }
    },
    {
      "policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/require-encryption-at-rest",
      "parameters": {
        "effect": "deny",
        "requireCustomerManagedKeys": true
      }
    },
    {
      "policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/require-tls-1-3",
      "parameters": {
        "effect": "deny",
        "minimumTlsVersion": "1.3"
      }
    },
    {
      "policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/require-diagnostic-logs",
      "parameters": {
        "effect": "deployIfNotExists",
        "logAnalyticsWorkspaceId": "/subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.OperationalInsights/workspaces/atp-loganalytics-prod-eus",
        "retentionDays": 90
      }
    },
    {
      "policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/require-immutable-storage",
      "parameters": {
        "effect": "deny",
        "requireWORM": true
      }
    }
  ],
  "metadata": {
    "category": "Regulatory Compliance",
    "version": "1.0.0"
  }
}

Encryption Controls Per Environment

Dev/Test Encryption

Encryption at Rest: Optional (disabled for cost/performance)

// Dev SQL Database (TDE disabled)
var devSqlDatabase = new Database("atp-sql-dev-eus", new DatabaseArgs
{
    DatabaseName = "ATP_Dev",
    TransparentDataEncryption = new TransparentDataEncryptionArgs
    {
        Status = "Disabled"  // Optional in Dev
    }
});

Encryption in Transit: TLS 1.2+ (enforced)

{
  "Kestrel": {
    "EndpointDefaults": {
      "Protocols": "Http2",
      "SslProtocols": ["Tls12", "Tls13"]
    }
  }
}

Staging/Production Encryption

Encryption at Rest: Required with Customer-Managed Keys (BYOK)

// Production SQL Database (TDE with BYOK)
var prodSqlDatabase = new Database("atp-db-prod-eus", new DatabaseArgs
{
    DatabaseName = "ATP_Prod",

    // Transparent Data Encryption with Customer-Managed Key
    TransparentDataEncryption = new TransparentDataEncryptionArgs
    {
        Status = "Enabled",
        KeyVaultKeyUri = "https://atp-keyvault-prod-eus.vault.azure.net/keys/TDE-Key/latest"
    }
});

// Storage Account (encryption with BYOK)
var prodStorage = new StorageAccount("atpstorageprodeus", new StorageAccountArgs
{
    AccountName = "atpstorageprodeus",

    Encryption = new EncryptionArgs
    {
        Services = new EncryptionServicesArgs
        {
            Blob = new EncryptionServiceArgs { Enabled = true, KeyType = "Account" },
            File = new EncryptionServiceArgs { Enabled = true, KeyType = "Account" },
            Queue = new EncryptionServiceArgs { Enabled = true, KeyType = "Service" },
            Table = new EncryptionServiceArgs { Enabled = true, KeyType = "Service" }
        },
        KeySource = "Microsoft.Keyvault",
        KeyVaultProperties = new KeyVaultPropertiesArgs
        {
            KeyName = "StorageEncryptionKey",
            KeyVersion = "",  // Use latest version
            KeyVaultUri = "https://atp-keyvault-prod-eus.vault.azure.net"
        },
        RequireInfrastructureEncryption = true  // Double encryption
    }
});

// Cosmos DB (encryption with BYOK)
var prodCosmos = new DatabaseAccount("atp-cosmos-prod", new DatabaseAccountArgs
{
    AccountName = "atp-cosmos-prod",

    KeyVaultKeyUri = "https://atp-keyvault-prod-eus.vault.azure.net/keys/CosmosEncryptionKey/latest",

    // Default encryption key policy
    DefaultIdentity = "SystemAssigned"
});

Encryption Key Rotation (Production):

// Automated key rotation (Azure Function)
[FunctionName("RotateEncryptionKeys")]
public async Task RunAsync(
    [TimerTrigger("0 0 1 */3 * *")] TimerInfo timer,  // Every 90 days
    ILogger log)
{
    log.LogInformation("Starting encryption key rotation...");

    var keyVaultClient = new KeyClient(
        vaultUri: new Uri("https://atp-keyvault-prod-eus.vault.azure.net"),
        credential: new DefaultAzureCredential());

    // Rotate TDE key
    var newTdeKey = await keyVaultClient.CreateRsaKeyAsync(new CreateRsaKeyOptions($"TDE-Key-{DateTime.UtcNow:yyyyMMdd}")
    {
        KeySize = 4096,
        KeyOperations = { KeyOperation.WrapKey, KeyOperation.UnwrapKey },
        ExpiresOn = DateTimeOffset.UtcNow.AddDays(365)
    });

    log.LogInformation($"Created new TDE key: {newTdeKey.Value.Name}");

    // Update SQL Database TDE key
    await UpdateSqlTdeKeyAsync("atp-sql-prod-eus", "ATP_Prod", newTdeKey.Value.Id.ToString());

    // Rotate Storage encryption key
    var newStorageKey = await keyVaultClient.CreateRsaKeyAsync(new CreateRsaKeyOptions($"StorageEncryptionKey-{DateTime.UtcNow:yyyyMMdd}")
    {
        KeySize = 4096
    });

    await UpdateStorageEncryptionKeyAsync("atpstorageprodeus", newStorageKey.Value.Id.ToString());

    // Disable old keys (retain for 30 days for decryption of old data)
    var oldKeys = await keyVaultClient.GetPropertiesOfKeyVersionsAsync("TDE-Key").ToListAsync();
    foreach (var oldKey in oldKeys.Where(k => k.CreatedOn < DateTimeOffset.UtcNow.AddDays(-30)))
    {
        await keyVaultClient.UpdateKeyPropertiesAsync(new KeyProperties(oldKey.Id) { Enabled = false });
        log.LogInformation($"Disabled old key: {oldKey.Name}");
    }

    log.LogInformation("✅ Encryption key rotation complete");
}

Encryption in Transit (Production - TLS 1.3 Only):

// Enforce TLS 1.3 in Production
public void ConfigureServices(IServiceCollection services)
{
    services.Configure<KestrelServerOptions>(options =>
    {
        options.ConfigureHttpsDefaults(httpsOptions =>
        {
            httpsOptions.SslProtocols = SslProtocols.Tls13;  // TLS 1.3 only
            httpsOptions.ClientCertificateMode = ClientCertificateMode.RequireCertificate;  // mTLS
            httpsOptions.CheckCertificateRevocation = true;  // Validate cert revocation
        });
    });
}

Immutability & WORM Storage

Staging (Immutability Testing):

// Staging blob container with immutability policy
var stagingImmutableContainer = new BlobContainer("audit-events-staging", new BlobContainerArgs
{
    ContainerName = "audit-events",
    AccountName = stagingStorage.Name,
    ResourceGroupName = stagingResourceGroup.Name,

    ImmutableStorageWithVersioning = new ImmutableStorageWithVersioningArgs
    {
        Enabled = true
    },

    // Time-based retention policy (test mode)
    DefaultEncryptionScope = "$account-encryption-key",
    DenyEncryptionScopeOverride = true
});

// Immutability policy (90-day retention)
var immutabilityPolicy = new ImmutabilityPolicy("staging-immutability-policy", new ImmutabilityPolicyArgs
{
    ImmutabilityPolicyName = "default",
    AccountName = stagingStorage.Name,
    ResourceGroupName = stagingResourceGroup.Name,
    ContainerName = stagingImmutableContainer.Name,
    ImmutabilityPeriodSinceCreationInDays = 90,
    AllowProtectedAppendWrites = false  // No appends allowed
});

Production (WORM + Legal Hold):

// Production blob container with WORM storage
var prodImmutableContainer = new BlobContainer("audit-events-prod", new BlobContainerArgs
{
    ContainerName = "audit-events",
    AccountName = prodStorage.Name,
    ResourceGroupName = prodEUSResourceGroup.Name,

    ImmutableStorageWithVersioning = new ImmutableStorageWithVersioningArgs
    {
        Enabled = true
    }
});

// Immutability policy (7-year retention + legal hold)
var prodImmutabilityPolicy = new ImmutabilityPolicy("prod-immutability-policy", new ImmutabilityPolicyArgs
{
    ImmutabilityPolicyName = "default",
    AccountName = prodStorage.Name,
    ResourceGroupName = prodEUSResourceGroup.Name,
    ContainerName = prodImmutableContainer.Name,
    ImmutabilityPeriodSinceCreationInDays = 2555,  // 7 years
    AllowProtectedAppendWrites = false
});

// Lock immutability policy (cannot be reduced or deleted)
var policyLock = new ManagementLock("prod-immutability-lock", new ManagementLockArgs
{
    LockName = "immutability-policy-lock",
    LockLevel = "CanNotDelete",
    ResourceGroupName = prodEUSResourceGroup.Name,
    Notes = "Prevents deletion of immutability policy for compliance"
});

Legal Hold (Production):

# Apply legal hold to blob container
az storage container legal-hold set \
  --account-name atpstorageprodeus \
  --container-name audit-events \
  --tags "litigation-2025-001" "regulatory-investigation-SEC-456" \
  --allow-protected-append-writes-all false

echo "✅ Legal hold applied; blobs cannot be deleted or modified"

Audit Logging & Evidence Collection

Production Audit Logging

Comprehensive Logging (All layers):

auditLogs:
  infrastructure:
    - Azure Activity Log (control plane operations)
    - NSG Flow Logs (network traffic)
    - Azure Firewall Logs (egress traffic)
    - Key Vault Audit Logs (secret access)

  application:
    - API Gateway Logs (all HTTP requests)
    - Application Logs (business events)
    - Database Audit Logs (all SQL queries)
    - Authentication Logs (login attempts, MFA)

  security:
    - Azure AD Sign-in Logs (user authentication)
    - Conditional Access Logs (access policy decisions)
    - PIM Activation Logs (privileged access elevation)
    - Azure Defender Alerts (security findings)

Azure SQL Auditing (Production):

// SQL Server auditing (production)
var sqlAuditing = new ServerBlobAuditingPolicy("atp-sql-audit-prod", new ServerBlobAuditingPolicyArgs
{
    BlobAuditingPolicyName = "default",
    ResourceGroupName = prodEUSResourceGroup.Name,
    ServerName = prodEUSSQL.Name,
    State = "Enabled",

    // Storage account for audit logs
    StorageEndpoint = prodStorage.PrimaryEndpoints.Apply(endpoints => endpoints.Blob),
    StorageAccountAccessKey = prodStorage.PrimaryAccessKey,
    StorageAccountSubscriptionId = subscriptionId,

    RetentionDays = 90,  // 90-day hot retention
    IsStorageSecondaryKeyInUse = false,
    IsAzureMonitorTargetEnabled = true,  // Also send to Log Analytics

    // Audit action groups (comprehensive)
    AuditActionsAndGroups = new[]
    {
        "SUCCESSFUL_DATABASE_AUTHENTICATION_GROUP",
        "FAILED_DATABASE_AUTHENTICATION_GROUP",
        "BATCH_COMPLETED_GROUP",
        "SCHEMA_OBJECT_CHANGE_GROUP",
        "SCHEMA_OBJECT_ACCESS_GROUP",
        "DATABASE_OBJECT_CHANGE_GROUP",
        "DATABASE_OBJECT_OWNERSHIP_CHANGE_GROUP",
        "DATABASE_OBJECT_PERMISSION_CHANGE_GROUP",
        "DATABASE_PERMISSION_CHANGE_GROUP",
        "DATABASE_PRINCIPAL_CHANGE_GROUP",
        "DATABASE_ROLE_MEMBER_CHANGE_GROUP"
    }
});

// Database-level auditing
var dbAuditing = new DatabaseBlobAuditingPolicy("atp-db-audit-prod", new DatabaseBlobAuditingPolicyArgs
{
    BlobAuditingPolicyName = "default",
    ResourceGroupName = prodEUSResourceGroup.Name,
    ServerName = prodEUSSQL.Name,
    DatabaseName = prodEUSDatabase.Name,
    State = "Enabled",

    // Inherit from server-level + add database-specific actions
    AuditActionsAndGroups = new[]
    {
        "SELECT",
        "INSERT",
        "UPDATE",
        "DELETE",
        "EXECUTE",
        "RECEIVE",
        "REFERENCES"
    }
});

Key Vault Audit Logging (Production):

# Enable Key Vault diagnostic logging
az monitor diagnostic-settings create \
  --name atp-keyvault-prod-audit \
  --resource /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.KeyVault/vaults/atp-keyvault-prod-eus \
  --logs '[
    {
      "category": "AuditEvent",
      "enabled": true,
      "retentionPolicy": {
        "enabled": true,
        "days": 365
      }
    }
  ]' \
  --metrics '[
    {
      "category": "AllMetrics",
      "enabled": true,
      "retentionPolicy": {
        "enabled": true,
        "days": 90
      }
    }
  ]' \
  --workspace /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.OperationalInsights/workspaces/atp-loganalytics-prod-eus

echo "✅ Key Vault audit logging enabled (365-day retention)"

API Gateway Logging (Production):

// Application Gateway diagnostic settings
var appGatewayDiagnostics = new DiagnosticSetting("atp-appgw-diag-prod", new DiagnosticSettingArgs
{
    Name = "atp-appgw-prod-diagnostics",
    ResourceUri = applicationGateway.Id,
    WorkspaceId = prodLogAnalytics.Id,

    Logs = new[]
    {
        new LogSettingsArgs
        {
            Category = "ApplicationGatewayAccessLog",
            Enabled = true,
            RetentionPolicy = new RetentionPolicyArgs
            {
                Enabled = true,
                Days = 90
            }
        },
        new LogSettingsArgs
        {
            Category = "ApplicationGatewayPerformanceLog",
            Enabled = true,
            RetentionPolicy = new RetentionPolicyArgs
            {
                Enabled = true,
                Days = 90
            }
        },
        new LogSettingsArgs
        {
            Category = "ApplicationGatewayFirewallLog",
            Enabled = true,
            RetentionPolicy = new RetentionPolicyArgs
            {
                Enabled = true,
                Days = 365  // WAF logs retained for 1 year
            }
        }
    },

    Metrics = new[]
    {
        new MetricSettingsArgs
        {
            Category = "AllMetrics",
            Enabled = true,
            RetentionPolicy = new RetentionPolicyArgs
            {
                Enabled = true,
                Days = 90
            }
        }
    }
});

Access Control & Privileged Access Management

Just-In-Time (JIT) Access (Staging/Production):

// Azure PIM role assignment (time-limited elevation)
public class PrivilegedAccessService
{
    private readonly GraphServiceClient _graphClient;

    public async Task RequestProductionAccessAsync(string userId, string role, TimeSpan duration)
    {
        // Maximum elevation: 8 hours
        if (duration > TimeSpan.FromHours(8))
        {
            throw new InvalidOperationException("Maximum elevation period is 8 hours");
        }

        // Create PIM role assignment request
        var roleAssignmentScheduleRequest = new RoleAssignmentScheduleRequest
        {
            PrincipalId = userId,
            RoleDefinitionId = role,  // e.g., "Contributor", "Reader"
            DirectoryScopeId = "/subscriptions/<sub-id>/resourceGroups/ConnectSoft-ATP-Prod-EUS-RG",
            Justification = "Production troubleshooting - Incident #12345",
            ScheduleInfo = new RequestSchedule
            {
                StartDateTime = DateTimeOffset.UtcNow,
                Expiration = new ExpirationPattern
                {
                    Type = "AfterDuration",
                    Duration = duration
                }
            }
        };

        await _graphClient.RoleManagement.Directory.RoleAssignmentScheduleRequests
            .PostAsync(roleAssignmentScheduleRequest);

        // Log PIM elevation for audit trail
        _logger.LogWarning(
            "PIM elevation granted: User {UserId} elevated to {Role} for {Duration}",
            userId, role, duration);
    }
}

Access Review (Production - Weekly):

<#
.SYNOPSIS
    Weekly access review for Production environment.
.DESCRIPTION
    Reviews all role assignments; identifies stale permissions; sends report.
#>

Connect-AzAccount -Identity

$resourceGroup = "ConnectSoft-ATP-Prod-EUS-RG"
$reportDate = Get-Date -Format "yyyy-MM-dd"

Write-Output "Starting weekly access review for Production..."

# Get all role assignments
$roleAssignments = Get-AzRoleAssignment -ResourceGroupName $resourceGroup

$accessReport = @()

foreach ($assignment in $roleAssignments) {
    $principal = Get-AzADUser -ObjectId $assignment.ObjectId -ErrorAction SilentlyContinue

    if ($principal) {
        # Check last sign-in
        $lastSignIn = (Get-AzureADUserActivity -ObjectId $assignment.ObjectId).LastSignInDateTime
        $daysSinceSignIn = ((Get-Date) - $lastSignIn).Days

        $accessReport += [PSCustomObject]@{
            User = $principal.UserPrincipalName
            Role = $assignment.RoleDefinitionName
            Scope = $assignment.Scope
            LastSignIn = $lastSignIn
            DaysSinceSignIn = $daysSinceSignIn
            Recommendation = if ($daysSinceSignIn > 90) { "Remove" } elseif ($daysSinceSignIn > 30) { "Review" } else { "Keep" }
        }
    }
}

# Export report
$accessReport | Export-Csv -Path "access-review-$reportDate.csv" -NoTypeInformation

# Send to security team
Send-MailMessage `
    -From "platform-team@connectsoft.example" `
    -To "security-team@connectsoft.example" `
    -Subject "Weekly Production Access Review - $reportDate" `
    -Body "Attached: Access review for Production environment" `
    -Attachments "access-review-$reportDate.csv" `
    -SmtpServer "smtp.office365.com"

Write-Output "✅ Access review complete; report sent to security team"

Penetration Testing Per Environment

Environment Frequency Scope Approval Findings SLA
Dev Annually Basic OWASP scan None Informational
Test Annually Full automated scan None 30 days
Staging Quarterly Full manual + automated Lead Architect 14 days (high), 7 days (critical)
Production Quarterly + Post-Change Full manual pentest CISO 48 hours (critical), 7 days (high)

Production Penetration Testing Procedure:

pentestProcedure:
  frequency: Quarterly + after major changes
  vendor: External security firm (pre-approved)

  scope:
    inScope:
      - Public-facing Application Gateway
      - API endpoints (all microservices)
      - Authentication/authorization flows
      - Data encryption verification
      - Network segmentation validation

    outOfScope:
      - Physical Azure datacenter testing
      - Social engineering
      - Denial of Service attacks

  notification:
    azure: Notify Azure 7 days before pentest
    internal: CISO approval required
    tenants: No notification (production isolation)

  deliverables:
    - Executive summary
    - Detailed findings report
    - Proof-of-concept exploits
    - Remediation recommendations
    - Retest validation (after fixes)

  remediationSLA:
    critical: 48 hours
    high: 7 days
    medium: 30 days
    low: 90 days

Pentest Findings Tracking:

// Track pentest findings in Azure DevOps
public async Task CreatePentestFindingAsync(PentestFinding finding)
{
    var workItem = new WorkItem
    {
        Fields = new Dictionary<string, object>
        {
            ["System.Title"] = $"Pentest Finding: {finding.Title}",
            ["System.WorkItemType"] = "Security Bug",
            ["System.State"] = "New",
            ["Microsoft.VSTS.Common.Priority"] = finding.Severity switch
            {
                "Critical" => 1,
                "High" => 2,
                "Medium" => 3,
                "Low" => 4,
                _ => 3
            },
            ["Microsoft.VSTS.Common.Severity"] = finding.Severity,
            ["System.Description"] = $@"
                <b>Finding:</b> {finding.Description}<br/>
                <b>Severity:</b> {finding.Severity}<br/>
                <b>CVSS Score:</b> {finding.CvssScore}<br/>
                <b>Remediation:</b> {finding.Remediation}<br/>
                <b>Due Date:</b> {finding.DueDate:yyyy-MM-dd}
            ",
            ["System.Tags"] = "Pentest; Security; Compliance; Q1-2025",
            ["Custom.ComplianceFramework"] = "SOC2-CC6.1",
            ["System.AreaPath"] = "ATP\\Security",
            ["System.IterationPath"] = "ATP\\2025\\Q1"
        }
    };

    await _devOpsClient.CreateWorkItemAsync(workItem, "ConnectSoft", "ATP");
}

Compliance Scanning & Monitoring

Microsoft Defender for Cloud (Production):

# Enable Defender for Cloud for all Production resources
az security pricing create \
  --name "VirtualMachines" \
  --tier "Standard"

az security pricing create \
  --name "SqlServers" \
  --tier "Standard"

az security pricing create \
  --name "AppServices" \
  --tier "Standard"

az security pricing create \
  --name "StorageAccounts" \
  --tier "Standard"

az security pricing create \
  --name "KubernetesService" \
  --tier "Standard"

az security pricing create \
  --name "ContainerRegistry" \
  --tier "Standard"

az security pricing create \
  --name "KeyVaults" \
  --tier "Standard"

# Enable regulatory compliance dashboard
az security regulatory-compliance-standard list \
  --query "[].{Name:name, State:state}"

echo "✅ Defender for Cloud enabled; compliance dashboard available"

Continuous Compliance Monitoring (Azure Policy Compliance):

#!/bin/bash
# compliance-scan-prod.sh

ENVIRONMENT="prod"
RESOURCE_GROUP="ConnectSoft-ATP-Prod-EUS-RG"

echo "Running compliance scan for Production..."

# Get policy compliance state
COMPLIANCE=$(az policy state summarize \
  --resource-group $RESOURCE_GROUP \
  --query "results.policyAssignments[].{Policy:policyAssignmentId, Compliant:results.resourceDetails.compliantResources, NonCompliant:results.resourceDetails.nonCompliantResources}" \
  --output json)

NON_COMPLIANT_COUNT=$(echo $COMPLIANCE | jq '[.[].NonCompliant] | add')

if [ "$NON_COMPLIANT_COUNT" -gt 0 ]; then
  echo "❌ Non-compliant resources detected: $NON_COMPLIANT_COUNT"

  # List non-compliant resources
  az policy state list \
    --resource-group $RESOURCE_GROUP \
    --filter "complianceState eq 'NonCompliant'" \
    --query "[].{Resource:resourceId, Policy:policyDefinitionName, Reason:policyDefinitionAction}" \
    --output table

  # Create compliance violation ticket
  az boards work-item create \
    --title "Compliance Violation Detected in Production" \
    --type "Bug" \
    --description "Azure Policy compliance scan detected $NON_COMPLIANT_COUNT non-compliant resources.\n\nSee attached compliance report." \
    --assigned-to "security-team@connectsoft.example" \
    --fields Priority=1 Severity="1 - Critical"

  exit 1
else
  echo "✅ All Production resources compliant"
fi

# Generate compliance report
az policy state summarize \
  --resource-group $RESOURCE_GROUP \
  --output json > "compliance-report-prod-$(date +%Y%m%d).json"

echo "✅ Compliance scan complete"

Regulatory Compliance Evidence

SOC 2 Evidence Collection:

soc2Evidence:
  CC6.1_LogicalAndPhysicalAccessControls:
    - Azure AD sign-in logs (authentication)
    - PIM activation logs (privileged access)
    - Key Vault access logs (secret retrieval)
    - NSG flow logs (network access)

  CC6.6_LogicalAccessRemoval:
    - Access review reports (weekly)
    - Offboarding automation logs
    - Role assignment change logs

  CC7.2_DetectionAndMonitoring:
    - Azure Defender alerts (security findings)
    - Application Insights exceptions
    - Azure Monitor alerts (health/performance)

  CC8.1_ChangeManagement:
    - Azure DevOps pipeline logs (deployments)
    - Git commit history (code changes)
    - CAB meeting minutes (approval records)
    - Change ticket logs (ServiceNow integration)

GDPR Evidence Collection:

gdprEvidence:
  Article30_RecordsOfProcessing:
    - Data classification inventory (PII fields)
    - Data flow diagrams (tenant → ATP → storage)
    - Retention policy documentation

  Article32_SecurityMeasures:
    - Encryption certificates (TDE, TLS 1.3)
    - Penetration test reports (quarterly)
    - Vulnerability scan results (continuous)

  Article33_BreachNotification:
    - Incident response logs (P0/P1 incidents)
    - Breach notification templates
    - Tenant communication records

  Article17_RightToErasure:
    - GDPR deletion logs (tenant data purge)
    - Deletion verification reports
    - Backup retention override logs

HIPAA Evidence Collection:

hipaaEvidence:
  164.308_AdministrativeSafeguards:
    - Security training completion records
    - Risk assessment reports (annual)
    - Access review logs (weekly)

  164.310_PhysicalSafeguards:
    - Azure datacenter compliance certificates
    - Physical access logs (Azure-managed)

  164.312_TechnicalSafeguards:
    - Encryption key rotation logs
    - Audit trail reports (SQL, API, Key Vault)
    - Access control logs (authentication/authorization)

  164.316_PoliciesAndProcedures:
    - Incident response runbooks
    - DR test reports (weekly)
    - Compliance policy documentation

Evidence Retention & Archival

Production Evidence Retention Policy:

evidenceRetention:
  pipelineLogs:
    retention: 1 year (SOC 2 minimum)
    storage: Azure DevOps + exported to Blob Storage
    format: JSON (machine-readable)

  deploymentArtifacts:
    retention: 7 years (match audit data retention)
    storage: Azure Artifacts + Blob Archive
    artifacts:
      - SBOM (Software Bill of Materials)
      - Security scan reports (SAST, dependency, secrets)
      - Test results (unit, integration, regression)
      - ADR snapshots (architecture decisions)

  accessLogs:
    retention: 90 days (hot) + 7 years (cold archive)
    storage: Log Analytics + Blob Storage (immutable)
    logs:
      - Azure AD sign-in logs
      - PIM activation logs
      - Key Vault audit logs
      - SQL audit logs
      - API Gateway access logs

  incidentRecords:
    retention: 7 years (regulatory requirement)
    storage: Azure DevOps Work Items + exported to Blob
    records:
      - Incident tickets (P0/P1)
      - Post-mortem reports
      - Corrective action tracking
      - Communication logs (tenant notifications)

  complianceReports:
    retention: 7 years (audit requirement)
    storage: Blob Storage (WORM + Legal Hold)
    reports:
      - Quarterly SOC 2 attestation
      - Annual HIPAA assessment
      - GDPR compliance reports
      - Penetration test reports

Automated Evidence Export (Azure Function):

// Export compliance evidence to immutable storage
[FunctionName("ExportComplianceEvidence")]
public async Task RunAsync(
    [TimerTrigger("0 0 1 1 */3 *")] TimerInfo timer,  // Quarterly on 1st at 1 AM
    ILogger log)
{
    log.LogInformation("Starting quarterly compliance evidence export...");

    var quarter = $"Q{(DateTime.UtcNow.Month - 1) / 3 + 1}-{DateTime.UtcNow.Year}";
    var exportPath = $"compliance-evidence/{quarter}";

    var blobClient = new BlobServiceClient(
        connectionString: Environment.GetEnvironmentVariable("ComplianceStorageConnectionString"));

    var containerClient = blobClient.GetBlobContainerClient("compliance-evidence");

    // Export Azure DevOps pipeline logs
    var pipelineLogs = await ExportPipelineLogsAsync(quarter);
    await UploadEvidenceAsync(containerClient, $"{exportPath}/pipeline-logs.json", pipelineLogs);

    // Export deployment artifacts (SBOM, security scans)
    var deploymentArtifacts = await ExportDeploymentArtifactsAsync(quarter);
    await UploadEvidenceAsync(containerClient, $"{exportPath}/deployment-artifacts.zip", deploymentArtifacts);

    // Export access logs from Log Analytics
    var accessLogs = await ExportAccessLogsAsync(quarter);
    await UploadEvidenceAsync(containerClient, $"{exportPath}/access-logs.json", accessLogs);

    // Export incident records
    var incidents = await ExportIncidentRecordsAsync(quarter);
    await UploadEvidenceAsync(containerClient, $"{exportPath}/incidents.json", incidents);

    // Generate compliance summary report
    var summaryReport = await GenerateComplianceSummaryAsync(quarter);
    await UploadEvidenceAsync(containerClient, $"{exportPath}/compliance-summary.pdf", summaryReport);

    // Apply immutability policy (WORM)
    var immutabilityPolicy = containerClient.GetImmutabilityPolicyClient();
    await immutabilityPolicy.SetImmutabilityPolicyAsync(
        immutabilityPeriod: TimeSpan.FromDays(2555));  // 7 years

    log.LogInformation($"✅ Compliance evidence exported to {exportPath}");
    log.LogInformation("Evidence is now immutable (7-year retention)");
}

Compliance Dashboard & Reporting

Azure Policy Compliance Dashboard (KQL):

// Compliance posture over time
PolicyInsights
| where TimeGenerated > ago(90d)
| where ResourceGroup contains "ATP-Prod"
| extend IsCompliant = ComplianceState == "Compliant"
| summarize 
    TotalResources = dcount(ResourceId),
    CompliantResources = dcountif(ResourceId, IsCompliant),
    NonCompliantResources = dcountif(ResourceId, not(IsCompliant)),
    CompliancePercentage = 100.0 * dcountif(ResourceId, IsCompliant) / dcount(ResourceId)
  by bin(TimeGenerated, 1d), PolicyDefinitionName
| order by TimeGenerated desc

Compliance Scorecard (Monthly Report):

// Generate monthly compliance scorecard
public class ComplianceReportService
{
    public async Task<ComplianceScorecard> GenerateMonthlyScorecard(string environment)
    {
        var scorecard = new ComplianceScorecard
        {
            Environment = environment,
            ReportDate = DateTime.UtcNow,
            Period = $"{DateTime.UtcNow:MMMM yyyy}"
        };

        // SOC 2 Controls
        scorecard.SOC2 = new SOC2Scorecard
        {
            CC6_1_AccessControls = await ValidateAccessControlsAsync(),
            CC6_6_AccessRemoval = await ValidateAccessRemovalAsync(),
            CC7_2_Monitoring = await ValidateMonitoringAsync(),
            CC8_1_ChangeManagement = await ValidateChangeManagementAsync(),
            OverallCompliance = CalculateOverallCompliance(new[] { /* ... */ })
        };

        // GDPR Controls
        scorecard.GDPR = new GDPRScorecard
        {
            Article30_RecordsOfProcessing = await ValidateDataInventoryAsync(),
            Article32_SecurityMeasures = await ValidateSecurityMeasuresAsync(),
            Article33_BreachNotification = await ValidateBreachNotificationAsync(),
            Article17_RightToErasure = await ValidateRightToErasureAsync(),
            OverallCompliance = CalculateOverallCompliance(new[] { /* ... */ })
        };

        // HIPAA Controls
        scorecard.HIPAA = new HIPAAScorecard
        {
            Safeguard_164_308_Administrative = await ValidateAdministrativeSafeguardsAsync(),
            Safeguard_164_310_Physical = await ValidatePhysicalSafeguardsAsync(),
            Safeguard_164_312_Technical = await ValidateTechnicalSafeguardsAsync(),
            Safeguard_164_316_Policies = await ValidatePoliciesAsync(),
            OverallCompliance = CalculateOverallCompliance(new[] { /* ... */ })
        };

        // Overall environment compliance
        scorecard.OverallScore = (scorecard.SOC2.OverallCompliance + 
                                   scorecard.GDPR.OverallCompliance + 
                                   scorecard.HIPAA.OverallCompliance) / 3;

        return scorecard;
    }
}

PII Redaction & Data Classification

Automated PII Detection (Production):

// PII detection and redaction middleware
public class PiiRedactionMiddleware
{
    private readonly RequestDelegate _next;
    private readonly IPiiDetectionService _piiDetector;
    private readonly ILogger<PiiRedactionMiddleware> _logger;

    public async Task InvokeAsync(HttpContext context)
    {
        // Capture response body
        var originalBody = context.Response.Body;
        using var newBody = new MemoryStream();
        context.Response.Body = newBody;

        await _next(context);

        // Read response
        newBody.Seek(0, SeekOrigin.Begin);
        var responseBody = await new StreamReader(newBody).ReadToEndAsync();

        // Detect and redact PII
        var piiDetected = _piiDetector.DetectPii(responseBody);
        if (piiDetected.Any())
        {
            _logger.LogWarning(
                "PII detected in API response: {PiiTypes}. Redacting...",
                string.Join(", ", piiDetected.Select(p => p.Type)));

            // Redact PII fields
            responseBody = _piiDetector.RedactPii(responseBody, piiDetected);

            // Log PII exposure incident (for compliance)
            await LogPiiExposureAsync(context.Request.Path, piiDetected);
        }

        // Write redacted response
        newBody.Seek(0, SeekOrigin.Begin);
        await newBody.CopyToAsync(originalBody);
        context.Response.Body = originalBody;
    }
}

// PII detection service
public class PiiDetectionService : IPiiDetectionService
{
    private static readonly Regex EmailRegex = new Regex(@"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b");
    private static readonly Regex SsnRegex = new Regex(@"\b\d{3}-\d{2}-\d{4}\b");
    private static readonly Regex CreditCardRegex = new Regex(@"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b");

    public List<PiiDetection> DetectPii(string content)
    {
        var detections = new List<PiiDetection>();

        // Email addresses
        var emailMatches = EmailRegex.Matches(content);
        detections.AddRange(emailMatches.Select(m => new PiiDetection
        {
            Type = "Email",
            Value = m.Value,
            StartIndex = m.Index,
            Length = m.Length
        }));

        // SSN
        var ssnMatches = SsnRegex.Matches(content);
        detections.AddRange(ssnMatches.Select(m => new PiiDetection
        {
            Type = "SSN",
            Value = m.Value,
            StartIndex = m.Index,
            Length = m.Length
        }));

        // Credit card numbers
        var ccMatches = CreditCardRegex.Matches(content);
        detections.AddRange(ccMatches.Select(m => new PiiDetection
        {
            Type = "CreditCard",
            Value = m.Value,
            StartIndex = m.Index,
            Length = m.Length
        }));

        return detections;
    }

    public string RedactPii(string content, List<PiiDetection> detections)
    {
        foreach (var detection in detections.OrderByDescending(d => d.StartIndex))
        {
            var redacted = detection.Type switch
            {
                "Email" => MaskEmail(detection.Value),
                "SSN" => "***-**-****",
                "CreditCard" => "**** **** **** " + detection.Value.Substring(detection.Value.Length - 4),
                _ => "[REDACTED]"
            };

            content = content.Remove(detection.StartIndex, detection.Length)
                             .Insert(detection.StartIndex, redacted);
        }

        return content;
    }

    private string MaskEmail(string email)
    {
        var parts = email.Split('@');
        if (parts.Length != 2) return "[REDACTED]";

        var localPart = parts[0];
        var maskedLocal = localPart.Length > 3 
            ? localPart.Substring(0, 3) + "***" 
            : "***";

        return $"{maskedLocal}@{parts[1]}";
    }
}

Data Classification (Automated Tagging):

// Automated data classification for SQL tables
public class DataClassificationService
{
    public async Task ClassifyDatabaseAsync(string databaseName)
    {
        var classifications = new List<DataClassification>
        {
            new DataClassification
            {
                Schema = "dbo",
                Table = "AuditEvents",
                Column = "UserId",
                InformationType = "Person.Name",
                SensitivityLabel = "Confidential",
                SensitivityRank = "High"
            },
            new DataClassification
            {
                Schema = "dbo",
                Table = "AuditEvents",
                Column = "IPAddress",
                InformationType = "Network.IPAddress",
                SensitivityLabel = "Confidential",
                SensitivityRank = "Medium"
            },
            new DataClassification
            {
                Schema = "dbo",
                Table = "Tenants",
                Column = "ContactEmail",
                InformationType = "Contact.EmailAddress",
                SensitivityLabel = "Confidential - GDPR",
                SensitivityRank = "High"
            }
        };

        foreach (var classification in classifications)
        {
            await ApplyClassificationAsync(databaseName, classification);
        }
    }

    private async Task ApplyClassificationAsync(string database, DataClassification classification)
    {
        var sql = $@"
            ADD SENSITIVITY CLASSIFICATION TO 
            [{classification.Schema}].[{classification.Table}].[{classification.Column}]
            WITH (
                LABEL = '{classification.SensitivityLabel}',
                INFORMATION_TYPE = '{classification.InformationType}',
                RANK = {classification.SensitivityRank}
            )
        ";

        // Execute via SQL connection
        await ExecuteSqlAsync(database, sql);

        _logger.LogInformation(
            "Applied data classification: {Schema}.{Table}.{Column} = {Label}",
            classification.Schema, classification.Table, classification.Column, classification.SensitivityLabel);
    }
}

Vulnerability Management Per Environment

Vulnerability Scanning Strategy:

vulnerabilityScanning:
  dev:
    frequency: Weekly
    tools:
      - OWASP Dependency Check (NuGet packages)
      - Trivy (Docker images)
    severity: Informational only
    remediation: Best-effort

  test:
    frequency: Daily (as part of CI/CD)
    tools:
      - OWASP Dependency Check
      - Trivy
      - SonarQube (SAST)
    severity: Block critical vulnerabilities
    remediation: 30 days

  staging:
    frequency: Continuous
    tools:
      - Azure Defender for Cloud
      - OWASP Dependency Check
      - Trivy
      - SonarQube
    severity: Block critical/high
    remediation: 7 days (critical), 14 days (high)

  production:
    frequency: Continuous (real-time)
    tools:
      - Azure Defender for Cloud (Advanced Threat Protection)
      - Microsoft Defender for Containers
      - Azure Sentinel (SIEM)
      - OWASP Dependency Check
    severity: Block all critical/high
    remediation: 24 hours (critical), 7 days (high)
    responseProcedure: Incident ticket + emergency patching

Vulnerability Response Workflow:

flowchart TD
    A[Vulnerability Detected] --> B{Severity?}
    B -->|Critical| C[Create P0 Incident]
    B -->|High| D[Create P1 Bug]
    B -->|Medium/Low| E[Create P2/P3 Bug]

    C --> F[Emergency Patching<br/>SLA: 24 hours]
    D --> G[Scheduled Patching<br/>SLA: 7 days]
    E --> H[Backlog Grooming<br/>SLA: 30 days]

    F --> I[Deploy Hotfix to Prod]
    G --> J[Deploy via Regular Release]
    H --> J

    I --> K[Validate Fix]
    J --> K

    K --> L{Fixed?}
    L -->|Yes| M[Close Ticket]
    L -->|No| N[Escalate to Security Team]

    N --> O[Manual Investigation]
    O --> K

    M --> P[Post-Mortem if P0]
Hold "Alt" / "Option" to enable pan & zoom

Compliance Automation & Guardrails

Infrastructure Compliance Validation (CI/CD Pipeline):

# Compliance validation stage in infrastructure pipeline
- stage: ComplianceValidation
  displayName: 'Validate Compliance Before Deployment'
  dependsOn: PulumiPreview
  jobs:
  - job: ComplianceScan
    displayName: 'Run Compliance Scans'
    steps:
    # Checkov - Infrastructure as Code compliance
    - task: Bash@3
      displayName: 'Run Checkov IaC Scan'
      inputs:
        targetType: 'inline'
        script: |
          pip install checkov
          checkov --directory infrastructure/ \
            --framework pulumi \
            --output junitxml \
            --output-file-path checkov-results.xml \
            --soft-fail  # Don't fail build, just report

    - task: PublishTestResults@2
      inputs:
        testResultsFormat: 'JUnit'
        testResultsFiles: 'checkov-results.xml'
        testRunTitle: 'Checkov Compliance Scan'

    # Azure Policy What-If Analysis
    - task: AzureCLI@2
      displayName: 'Azure Policy What-If'
      inputs:
        azureSubscription: $(azureSubscription)
        scriptType: 'bash'
        scriptLocation: 'inlineScript'
        inlineScript: |
          # Simulate policy compliance for proposed changes
          az policy state trigger-scan \
            --resource-group $(resourceGroup)

          # Get compliance state
          az policy state list \
            --resource-group $(resourceGroup) \
            --filter "complianceState eq 'NonCompliant'" \
            --output table

    # Terraform Compliance (if using Terraform instead of Pulumi)
    - task: Bash@3
      displayName: 'Run Terraform Compliance'
      inputs:
        targetType: 'inline'
        script: |
          pip install terraform-compliance
          terraform-compliance \
            --features compliance/ \
            --planfile tfplan.json

Application Compliance Validation (C# Build):

// Unit test: Validate PII redaction
[Fact]
public void ApiResponse_WhenContainsPII_RedactsEmail()
{
    // Arrange
    var middleware = new PiiRedactionMiddleware(/* ... */);
    var response = new
    {
        tenantId = "tenant-123",
        contactEmail = "user@example.com",  // PII
        auditEventCount = 1000
    };

    // Act
    var json = JsonSerializer.Serialize(response);
    var redacted = _piiDetector.RedactPii(json, _piiDetector.DetectPii(json));

    // Assert
    Assert.DoesNotContain("user@example.com", redacted);
    Assert.Contains("use***@example.com", redacted);  // Masked email
}

// Integration test: Validate encryption at rest
[Fact]
public async Task SqlDatabase_InProduction_HasTdeEnabled()
{
    // Arrange
    var environment = Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT");

    if (environment != "Production")
    {
        return;  // Skip test in non-production
    }

    // Act
    var tdeStatus = await _sqlClient.ExecuteScalarAsync<string>(
        "SELECT encryption_state_desc FROM sys.dm_database_encryption_keys WHERE database_id = DB_ID()");

    // Assert
    Assert.Equal("ENCRYPTED", tdeStatus);
}

Audit Trail Integrity

Log Signing (Production):

// Sign audit logs with HSM for tamper-evidence
public class AuditLogSigner
{
    private readonly CryptographyClient _cryptoClient;

    public AuditLogSigner(string keyVaultUrl, string keyName)
    {
        var keyClient = new KeyClient(new Uri(keyVaultUrl), new DefaultAzureCredential());
        var key = keyClient.GetKey(keyName);
        _cryptoClient = new CryptographyClient(key.Value.Id, new DefaultAzureCredential());
    }

    public async Task<SignedAuditLog> SignLogBatchAsync(List<AuditLogEntry> logs)
    {
        // Serialize log batch
        var logJson = JsonSerializer.Serialize(logs);
        var logBytes = Encoding.UTF8.GetBytes(logJson);

        // Compute SHA-256 hash
        var hash = SHA256.HashData(logBytes);

        // Sign with HSM key (RSASSA-PKCS1-v1_5)
        var signResult = await _cryptoClient.SignAsync(SignatureAlgorithm.RS256, hash);

        return new SignedAuditLog
        {
            Logs = logs,
            Hash = Convert.ToBase64String(hash),
            Signature = Convert.ToBase64String(signResult.Signature),
            SignedAt = DateTime.UtcNow,
            SigningKeyId = _cryptoClient.KeyId.ToString(),
            Algorithm = "RS256"
        };
    }

    public async Task<bool> VerifyLogBatchAsync(SignedAuditLog signedLog)
    {
        // Recompute hash
        var logJson = JsonSerializer.Serialize(signedLog.Logs);
        var logBytes = Encoding.UTF8.GetBytes(logJson);
        var hash = SHA256.HashData(logBytes);

        // Verify signature
        var signature = Convert.FromBase64String(signedLog.Signature);
        var verifyResult = await _cryptoClient.VerifyAsync(
            SignatureAlgorithm.RS256,
            hash,
            signature);

        return verifyResult.IsValid;
    }
}

Log Integrity Validation (Daily):

#!/bin/bash
# validate-log-integrity-prod.sh

echo "Validating audit log integrity for Production..."

# Download signed log batches from immutable storage
az storage blob download-batch \
  --account-name atpstorageprodeus \
  --source "audit-logs" \
  --destination "./log-validation/" \
  --pattern "*.signed.json"

# Validate signatures using Azure Key Vault
for LOG_FILE in ./log-validation/*.signed.json; do
  echo "Validating: $LOG_FILE"

  # Extract signature and hash
  SIGNATURE=$(jq -r '.signature' $LOG_FILE)
  HASH=$(jq -r '.hash' $LOG_FILE)

  # Verify signature using Key Vault
  az keyvault key verify \
    --vault-name atp-keyvault-prod-eus \
    --name AuditLogSigningKey \
    --algorithm RS256 \
    --digest $HASH \
    --signature $SIGNATURE

  if [ $? -eq 0 ]; then
    echo "✅ $LOG_FILE signature valid"
  else
    echo "❌ $LOG_FILE signature INVALID - potential tampering detected"

    # Create security incident
    az boards work-item create \
      --title "Log Tampering Detected: $LOG_FILE" \
      --type "Incident" \
      --description "Audit log signature validation failed. Potential tampering detected.\n\nFile: $LOG_FILE" \
      --assigned-to "security-team@connectsoft.example" \
      --fields Priority=1 Severity="1 - Critical"

    exit 1
  fi
done

echo "✅ All audit logs validated successfully"

Compliance Reporting & Attestation

Quarterly SOC 2 Report:

// Generate SOC 2 Type II attestation report
[FunctionName("GenerateSOC2Report")]
public async Task RunAsync(
    [TimerTrigger("0 0 1 1 */3 *")] TimerInfo timer,  // Quarterly
    ILogger log)
{
    log.LogInformation("Generating SOC 2 Type II report...");

    var quarter = $"Q{(DateTime.UtcNow.Month - 1) / 3 + 1}-{DateTime.UtcNow.Year}";

    var report = new SOC2Report
    {
        Quarter = quarter,
        ReportDate = DateTime.UtcNow,
        Scope = "ATP Production Environment (East US + West Europe)"
    };

    // CC6.1: Logical and Physical Access Controls
    report.CC6_1 = new ControlEvidence
    {
        ControlName = "CC6.1 - Logical and Physical Access Controls",
        EvidenceCollected = new List<string>
        {
            "Azure AD sign-in logs (90 days)",
            "PIM activation logs (all elevations)",
            "Key Vault access logs (all secret retrievals)",
            "NSG flow logs (all network access attempts)",
            "Access review reports (weekly)"
        },
        TestingPerformed = "Sampled 100 access requests; validated MFA and PIM enforcement",
        Exceptions = new List<string>(),  // No exceptions
        OpinionDate = DateTime.UtcNow,
        Opinion = "Operating Effectively"
    };

    // CC8.1: Change Management Controls
    report.CC8_1 = new ControlEvidence
    {
        ControlName = "CC8.1 - Change Management",
        EvidenceCollected = new List<string>
        {
            "Azure DevOps pipeline logs (all deployments)",
            "Git commit history (code changes)",
            "CAB meeting minutes (approval records)",
            "Deployment artifacts (SBOM, security scans)",
            "Rollback procedures (documented and tested)"
        },
        TestingPerformed = "Sampled 50 production deployments; validated CAB approval and SBOM generation",
        Exceptions = new List<string> { "1 deployment without SBOM (remediated within 24h)" },
        OpinionDate = DateTime.UtcNow,
        Opinion = "Operating Effectively (1 exception noted)"
    };

    // ... (remaining controls)

    // Generate PDF report
    var pdf = await GeneratePdfReportAsync(report);

    // Upload to compliance evidence storage
    await UploadToImmutableStorageAsync($"soc2-reports/{quarter}/SOC2-Report-{quarter}.pdf", pdf);

    // Send to auditor
    await SendToAuditorAsync(pdf, "auditor@example-auditing-firm.com");

    log.LogInformation($"✅ SOC 2 report generated for {quarter}");
}

External Auditor Access

Time-Limited Read-Only Access (Production):

#!/bin/bash
# grant-auditor-access.sh

AUDITOR_EMAIL=$1
DURATION_HOURS=${2:-8}  # Default: 8 hours
JUSTIFICATION=$3

echo "Granting auditor access to Production..."
echo "Auditor: $AUDITOR_EMAIL"
echo "Duration: $DURATION_HOURS hours"
echo "Justification: $JUSTIFICATION"

# Create guest user in Azure AD (if not exists)
AUDITOR_ID=$(az ad user show --id $AUDITOR_EMAIL --query objectId -o tsv 2>/dev/null)

if [ -z "$AUDITOR_ID" ]; then
  echo "Creating guest user..."
  az ad user create \
    --display-name "External Auditor" \
    --user-principal-name $AUDITOR_EMAIL \
    --account-enabled true \
    --force-change-password-next-sign-in false

  AUDITOR_ID=$(az ad user show --id $AUDITOR_EMAIL --query objectId -o tsv)
fi

# Grant Reader role via PIM (time-limited)
az rest \
  --method POST \
  --url "https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignmentScheduleRequests" \
  --body "{
    \"principalId\": \"$AUDITOR_ID\",
    \"roleDefinitionId\": \"Reader\",
    \"directoryScopeId\": \"/subscriptions/<sub-id>/resourceGroups/ConnectSoft-ATP-Prod-EUS-RG\",
    \"justification\": \"$JUSTIFICATION\",
    \"scheduleInfo\": {
      \"startDateTime\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",
      \"expiration\": {
        \"type\": \"AfterDuration\",
        \"duration\": \"PT${DURATION_HOURS}H\"
      }
    }
  }"

# Log access grant for compliance
az monitor activity-log list \
  --resource-group ConnectSoft-ATP-Prod-EUS-RG \
  --start-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --query "[?authorization.action=='Microsoft.Authorization/roleAssignments/write']"

echo "✅ Auditor access granted (expires in $DURATION_HOURS hours)"
echo "Auditor can access: Azure Portal (read-only), Log Analytics, Compliance Reports"

Summary

  • Compliance Enforcement: Graduated from optional (Dev) to required+BYOK (Production) for encryption, immutability, and audit logging.
  • Environment Policies: Dev/Test (synthetic data, relaxed controls), Staging (production-like, full simulation), Production (GDPR/HIPAA/SOC2 enforced).
  • Encryption: TDE optional (Dev), TDE required (Staging), TDE+BYOK+HSM (Production) with 90-day automated key rotation.
  • Immutability: Disabled (Dev/Test), enabled for testing (Staging), WORM+Legal Hold (Production) with 7-year retention.
  • Audit Logging: Basic/7-day (Dev), Enhanced/30-day (Staging), Full/90-day hot + 7-year cold (Production).
  • Access Controls: Developer full access (Dev), MFA+PIM (Staging), Zero standing access+PIM+Conditional Access (Production).
  • Penetration Testing: Annually (Dev/Test), Quarterly (Staging/Production), Post-Change (Production).
  • Vulnerability Scanning: Weekly (Dev), Daily (Test), Continuous (Staging/Production) with 24-hour critical remediation SLA.
  • PII Redaction: Optional (Dev/Test), enforced with automated detection/masking (Staging/Production).
  • Evidence Collection: Pipeline logs (1 year), deployment artifacts (7 years), access logs (7 years), incident records (7 years).
  • Compliance Reporting: Quarterly SOC 2/GDPR/HIPAA reports with automated evidence export to immutable storage.
  • Auditor Access: Time-limited read-only access (8-hour max) via PIM with full audit trail.

Testing Strategies Per Environment

ATP implements progressive testing strategies across environments, from rapid unit testing in Dev to production synthetic monitoring with canary deployments. Each environment has tailored testing approaches that balance development velocity (fast feedback loops) with production confidence (comprehensive validation).

This strategy ensures high code quality through comprehensive testing in lower environments while maintaining production stability through gradual rollouts, chaos engineering, and continuous synthetic monitoring.

Testing Pyramid Per Environment

graph TB
    subgraph Dev Environment
        DevUnit[Unit Tests<br/>1000+ tests<br/>< 2 minutes]
        DevInteg[Integration Tests<br/>100+ tests<br/>< 5 minutes]
        DevE2E[Local E2E Tests<br/>10+ flows<br/>Developer-run]

        DevUnit --> DevInteg
        DevInteg --> DevE2E
    end

    subgraph Test Environment
        TestSmoke[Smoke Tests<br/>Critical paths<br/>Post-deployment]
        TestRegression[Regression Suite<br/>500+ tests<br/>Nightly]
        TestContract[API Contract Tests<br/>OpenAPI validation<br/>Breaking change detection]

        TestSmoke --> TestRegression
        TestRegression --> TestContract
    end

    subgraph Staging Environment
        StagingLoad[Load Tests<br/>50% prod traffic<br/>60-minute runs]
        StagingChaos[Chaos Tests<br/>Failure injection<br/>Weekly]
        StagingSecurity[Security Tests<br/>OWASP ZAP<br/>Pre-production]
        StagingDR[DR Drills<br/>Failover validation<br/>Monthly]

        StagingLoad --> StagingChaos
        StagingChaos --> StagingSecurity
        StagingSecurity --> StagingDR
    end

    subgraph Production Environment
        ProdSynthetic[Synthetic Monitors<br/>Multi-region<br/>Every 5 minutes]
        ProdCanary[Canary Tests<br/>10% rollout<br/>24-hour validation]
        ProdAB[A/B Tests<br/>Feature flags<br/>Statistical analysis]

        ProdSynthetic --> ProdCanary
        ProdCanary --> ProdAB
    end

    style DevUnit fill:#90EE90
    style TestRegression fill:#FFD700
    style StagingChaos fill:#FFA500
    style ProdSynthetic fill:#FF6347
Hold "Alt" / "Option" to enable pan & zoom

Dev Environment Testing

Purpose: Rapid feedback loops with comprehensive unit/integration tests executed on every commit for fast validation.

Unit Tests

Execution: Every commit (pre-push hook + CI pipeline)

Configuration (ConnectSoft.ATP.Ingestion.Tests.csproj):

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>net8.0</TargetFramework>
    <IsPackable>false</IsPackable>
    <IsTestProject>true</IsTestProject>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="xunit" Version="2.6.2" />
    <PackageReference Include="xunit.runner.visualstudio" Version="2.5.4" />
    <PackageReference Include="Moq" Version="4.20.70" />
    <PackageReference Include="FluentAssertions" Version="6.12.0" />
    <PackageReference Include="Microsoft.NET.Test.Sdk" Version="17.8.0" />
    <PackageReference Include="coverlet.collector" Version="6.0.0" />
  </ItemGroup>

  <ItemGroup>
    <ProjectReference Include="..\ConnectSoft.ATP.Ingestion\ConnectSoft.ATP.Ingestion.csproj" />
  </ItemGroup>
</Project>

Example Unit Test (ATP Ingestion Service):

// Unit test: Validate audit event ingestion
public class AuditIngestionServiceTests
{
    private readonly Mock<IAuditRepository> _mockRepository;
    private readonly Mock<ITamperEvidenceService> _mockTamperEvidence;
    private readonly Mock<ILogger<AuditIngestionService>> _mockLogger;
    private readonly AuditIngestionService _service;

    public AuditIngestionServiceTests()
    {
        _mockRepository = new Mock<IAuditRepository>();
        _mockTamperEvidence = new Mock<ITamperEvidenceService>();
        _mockLogger = new Mock<ILogger<AuditIngestionService>>();

        _service = new AuditIngestionService(
            _mockRepository.Object,
            _mockTamperEvidence.Object,
            _mockLogger.Object);
    }

    [Fact]
    public async Task IngestEvent_ValidEvent_ReturnsSuccessWithEventId()
    {
        // Arrange
        var auditEvent = new AuditEvent
        {
            TenantId = "tenant-123",
            EventType = "UserLogin",
            Timestamp = DateTime.UtcNow,
            Payload = "{\"userId\": \"user-456\"}"
        };

        _mockRepository
            .Setup(r => r.InsertAsync(It.IsAny<AuditEvent>()))
            .ReturnsAsync(new AuditEvent { EventId = Guid.NewGuid() });

        _mockTamperEvidence
            .Setup(t => t.GenerateEvidenceAsync(It.IsAny<AuditEvent>()))
            .ReturnsAsync(new TamperEvidence { Hash = "abc123", Signature = "def456" });

        // Act
        var result = await _service.IngestEventAsync(auditEvent);

        // Assert
        result.Should().NotBeNull();
        result.EventId.Should().NotBeEmpty();
        result.TamperEvidence.Should().NotBeNull();

        _mockRepository.Verify(r => r.InsertAsync(It.IsAny<AuditEvent>()), Times.Once);
        _mockTamperEvidence.Verify(t => t.GenerateEvidenceAsync(It.IsAny<AuditEvent>()), Times.Once);
    }

    [Fact]
    public async Task IngestEvent_NullEvent_ThrowsArgumentNullException()
    {
        // Act & Assert
        await Assert.ThrowsAsync<ArgumentNullException>(() => 
            _service.IngestEventAsync(null));
    }

    [Theory]
    [InlineData(null)]
    [InlineData("")]
    [InlineData("   ")]
    public async Task IngestEvent_InvalidTenantId_ThrowsValidationException(string tenantId)
    {
        // Arrange
        var auditEvent = new AuditEvent { TenantId = tenantId };

        // Act & Assert
        await Assert.ThrowsAsync<ValidationException>(() => 
            _service.IngestEventAsync(auditEvent));
    }
}

Unit Test Execution (Pre-Push Hook):

#!/bin/bash
# .git/hooks/pre-push

echo "Running unit tests before push..."

dotnet test ConnectSoft.ATP.Ingestion.sln \
  --configuration Debug \
  --filter Category=Unit \
  --logger "console;verbosity=minimal" \
  --no-build

if [ $? -ne 0 ]; then
  echo "❌ Unit tests failed; push aborted"
  exit 1
fi

echo "✅ Unit tests passed; proceeding with push"

Coverage Threshold (Dev):

<!-- runsettings file -->
<RunSettings>
  <DataCollectionRunSettings>
    <DataCollectors>
      <DataCollector friendlyName="XPlat Code Coverage">
        <Configuration>
          <Format>cobertura</Format>
          <Exclude>[*]*.Program,[*]*.Startup</Exclude>
          <IncludeTestAssembly>false</IncludeTestAssembly>
        </Configuration>
      </DataCollector>
    </DataCollectors>
  </DataCollectionRunSettings>

  <RunConfiguration>
    <TargetFramework>net8.0</TargetFramework>
    <TestSessionTimeout>300000</TestSessionTimeout>
    <CollectSourceInformation>true</CollectSourceInformation>
  </RunConfiguration>
</RunSettings>

Integration Tests

Execution: Every CI build (with service containers)

Service Container Setup (azure-pipelines.yml):

# ATP Ingestion pipeline with service containers
resources:
  containers:
    - container: redis
      image: redis:7-alpine
      ports: [6379:6379]
      options: --health-cmd "redis-cli ping" --health-interval 10s

    - container: rabbitmq
      image: rabbitmq:3-management-alpine
      ports: [5672:5672, 15672:15672]
      env:
        RABBITMQ_DEFAULT_USER: guest
        RABBITMQ_DEFAULT_PASS: guest
      options: --health-cmd "rabbitmq-diagnostics -q ping" --health-interval 10s

    - container: mssql
      image: mcr.microsoft.com/mssql/server:2022-latest
      ports: [1433:1433]
      env:
        ACCEPT_EULA: Y
        SA_PASSWORD: P@ssw0rd123!
      options: --health-cmd "/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P 'P@ssw0rd123!' -Q 'SELECT 1'" --health-interval 10s

stages:
  - stage: CI_Stage
    jobs:
    - job: Integration_Tests
      services:
        redis: redis
        rabbitmq: rabbitmq
        mssql: mssql

      steps:
      - task: DotNetCoreCLI@2
        displayName: 'Run Integration Tests'
        inputs:
          command: 'test'
          projects: '**/*Tests.Integration.csproj'
          arguments: '--configuration Debug --filter Category=Integration --collect:"XPlat Code Coverage"'

Example Integration Test (Redis Integration):

[Collection("IntegrationTests")]
[Trait("Category", "Integration")]
public class RedisCacheIntegrationTests : IAsyncLifetime
{
    private readonly IConnectionMultiplexer _redis;
    private readonly IDatabase _db;

    public RedisCacheIntegrationTests()
    {
        // Connect to service container Redis
        _redis = ConnectionMultiplexer.Connect("localhost:6379");
        _db = _redis.GetDatabase();
    }

    [Fact]
    public async Task CacheService_WhenEventCached_RetrievesFromRedis()
    {
        // Arrange
        var cacheKey = "tenant-123:event-456";
        var auditEvent = new AuditEvent
        {
            EventId = Guid.NewGuid(),
            TenantId = "tenant-123",
            EventType = "UserLogin"
        };

        var cacheService = new CacheService(_redis);

        // Act
        await cacheService.SetAsync(cacheKey, auditEvent, TimeSpan.FromMinutes(5));
        var retrieved = await cacheService.GetAsync<AuditEvent>(cacheKey);

        // Assert
        retrieved.Should().NotBeNull();
        retrieved.EventId.Should().Be(auditEvent.EventId);
        retrieved.TenantId.Should().Be("tenant-123");
    }

    public Task InitializeAsync() => Task.CompletedTask;

    public async Task DisposeAsync()
    {
        await _db.ExecuteAsync("FLUSHALL");  // Clean up after tests
        _redis?.Dispose();
    }
}

Example Integration Test (SQL Database):

[Collection("IntegrationTests")]
[Trait("Category", "Integration")]
public class AuditRepositoryIntegrationTests : IAsyncLifetime
{
    private readonly SqlConnection _connection;
    private readonly AuditRepository _repository;

    public AuditRepositoryIntegrationTests()
    {
        var connectionString = "Server=localhost,1433;Database=ATP_Test;User Id=sa;Password=P@ssw0rd123!;TrustServerCertificate=true";
        _connection = new SqlConnection(connectionString);
        _repository = new AuditRepository(_connection);
    }

    [Fact]
    public async Task InsertAsync_ValidEvent_PersistsToDatabase()
    {
        // Arrange
        var auditEvent = new AuditEvent
        {
            TenantId = "tenant-integration-test",
            EventType = "TestEvent",
            Timestamp = DateTime.UtcNow,
            Payload = "{\"test\": true}"
        };

        // Act
        var inserted = await _repository.InsertAsync(auditEvent);
        var retrieved = await _repository.GetByIdAsync(inserted.EventId);

        // Assert
        retrieved.Should().NotBeNull();
        retrieved.EventId.Should().Be(inserted.EventId);
        retrieved.TenantId.Should().Be("tenant-integration-test");
        retrieved.EventType.Should().Be("TestEvent");
    }

    public async Task InitializeAsync()
    {
        await _connection.OpenAsync();

        // Run database migrations
        await _connection.ExecuteAsync(@"
            IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = 'AuditEvents')
            BEGIN
                CREATE TABLE AuditEvents (
                    EventId UNIQUEIDENTIFIER PRIMARY KEY,
                    TenantId NVARCHAR(50) NOT NULL,
                    EventType NVARCHAR(100) NOT NULL,
                    Timestamp DATETIME2 NOT NULL,
                    Payload NVARCHAR(MAX)
                )
            END
        ");
    }

    public async Task DisposeAsync()
    {
        // Clean up test data
        await _connection.ExecuteAsync("DELETE FROM AuditEvents WHERE TenantId = 'tenant-integration-test'");
        await _connection.CloseAsync();
        _connection.Dispose();
    }
}

Local End-to-End Tests

Execution: Developer-run (manual or via VS Code tasks)

Postman Collection (ATP.Ingestion.postman_collection.json):

{
  "info": {
    "name": "ATP Ingestion API - Dev",
    "description": "Local E2E tests for ATP Ingestion service",
    "schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
  },
  "item": [
    {
      "name": "Health Check",
      "request": {
        "method": "GET",
        "header": [],
        "url": {
          "raw": "https://localhost:7001/health",
          "protocol": "https",
          "host": ["localhost"],
          "port": "7001",
          "path": ["health"]
        }
      },
      "event": [
        {
          "listen": "test",
          "script": {
            "exec": [
              "pm.test(\"Status is 200\", function () {",
              "    pm.response.to.have.status(200);",
              "});",
              "",
              "pm.test(\"Health status is Healthy\", function () {",
              "    var jsonData = pm.response.json();",
              "    pm.expect(jsonData.status).to.eql('Healthy');",
              "});"
            ]
          }
        }
      ]
    },
    {
      "name": "Ingest Audit Event",
      "request": {
        "method": "POST",
        "header": [
          {
            "key": "Content-Type",
            "value": "application/json"
          },
          {
            "key": "X-Tenant-Id",
            "value": "tenant-dev-001"
          }
        ],
        "body": {
          "mode": "raw",
          "raw": "{\n  \"eventType\": \"UserLogin\",\n  \"timestamp\": \"{{$isoTimestamp}}\",\n  \"userId\": \"user-123\",\n  \"ipAddress\": \"203.0.113.10\",\n  \"metadata\": {\n    \"browser\": \"Chrome\",\n    \"os\": \"Windows\"\n  }\n}"
        },
        "url": {
          "raw": "https://localhost:7001/api/v1/audit/ingest",
          "protocol": "https",
          "host": ["localhost"],
          "port": "7001",
          "path": ["api", "v1", "audit", "ingest"]
        }
      },
      "event": [
        {
          "listen": "test",
          "script": {
            "exec": [
              "pm.test(\"Status is 201 Created\", function () {",
              "    pm.response.to.have.status(201);",
              "});",
              "",
              "pm.test(\"Response contains eventId\", function () {",
              "    var jsonData = pm.response.json();",
              "    pm.expect(jsonData.eventId).to.exist;",
              "    pm.environment.set('lastEventId', jsonData.eventId);",
              "});",
              "",
              "pm.test(\"Tamper evidence generated\", function () {",
              "    var jsonData = pm.response.json();",
              "    pm.expect(jsonData.tamperEvidence).to.exist;",
              "    pm.expect(jsonData.tamperEvidence.hash).to.not.be.empty;",
              "});"
            ]
          }
        }
      ]
    },
    {
      "name": "Query Audit Event",
      "request": {
        "method": "GET",
        "header": [
          {
            "key": "X-Tenant-Id",
            "value": "tenant-dev-001"
          }
        ],
        "url": {
          "raw": "https://localhost:7002/api/v1/audit/events/{{lastEventId}}",
          "protocol": "https",
          "host": ["localhost"],
          "port": "7002",
          "path": ["api", "v1", "audit", "events", "{{lastEventId}}"]
        }
      },
      "event": [
        {
          "listen": "test",
          "script": {
            "exec": [
              "pm.test(\"Event retrieved successfully\", function () {",
              "    pm.response.to.have.status(200);",
              "    var jsonData = pm.response.json();",
              "    pm.expect(jsonData.eventId).to.eql(pm.environment.get('lastEventId'));",
              "});"
            ]
          }
        }
      ]
    }
  ]
}

Run Postman Tests (Newman CLI):

#!/bin/bash
# run-e2e-tests-dev.sh

echo "Starting ATP services locally..."

# Start services (Docker Compose)
docker-compose -f docker-compose.dev.yml up -d

# Wait for services to be ready
sleep 30

# Run Postman collection
newman run ATP.Ingestion.postman_collection.json \
  --environment ATP.Dev.postman_environment.json \
  --reporters cli,json,html \
  --reporter-html-export e2e-results.html

EXIT_CODE=$?

# Stop services
docker-compose -f docker-compose.dev.yml down

if [ $EXIT_CODE -eq 0 ]; then
  echo "✅ E2E tests passed"
else
  echo "❌ E2E tests failed"
  exit 1
fi

Test Environment Testing

Purpose: Automated validation with smoke tests (post-deployment), regression suite (nightly), and API contract validation.

Smoke Tests

Execution: Post-deployment (automated via Azure Pipelines)

Configuration (CI/CD Pipeline):

# Smoke tests after deployment to Test
- stage: CD_Test
  dependsOn: CI_Stage
  jobs:
  - deployment: DeployToTest
    environment: ATP-Test
    strategy:
      runOnce:
        deploy:
          steps:
          - template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
            parameters:
              azureSubscription: $(azureSubscription)
              appName: atp-ingestion-test

          # Post-deployment smoke tests
          - task: PowerShell@2
            displayName: 'Run Smoke Tests'
            inputs:
              targetType: 'inline'
              script: |
                $services = @(
                    @{ Name = "Gateway"; Url = "https://atp-gateway-test.azurewebsites.net/health" },
                    @{ Name = "Ingestion"; Url = "https://atp-ingestion-test.azurewebsites.net/health" },
                    @{ Name = "Query"; Url = "https://atp-query-test.azurewebsites.net/health" }
                )

                $allHealthy = $true

                foreach ($service in $services) {
                    Write-Host "Checking $($service.Name) health..."

                    $response = Invoke-RestMethod -Uri $service.Url -Method Get -TimeoutSec 30

                    if ($response.status -eq "Healthy") {
                        Write-Host "✅ $($service.Name) healthy"
                    } else {
                        Write-Error "❌ $($service.Name) unhealthy: $($response.status)"
                        $allHealthy = $false
                    }
                }

                if (-not $allHealthy) {
                    Write-Error "Smoke tests failed; deployment may need rollback"
                    exit 1
                }

                Write-Host "✅ All smoke tests passed"

Regression Tests

Execution: Nightly (full suite)

Regression Test Suite (SpecFlow/BDD):

// Feature: Audit event ingestion regression tests
[Binding]
public class AuditIngestionSteps
{
    private readonly HttpClient _httpClient;
    private HttpResponseMessage _response;
    private AuditEvent _auditEvent;

    public AuditIngestionSteps()
    {
        _httpClient = new HttpClient
        {
            BaseAddress = new Uri("https://atp-gateway-test.azurewebsites.net")
        };
    }

    [Given(@"I have a valid audit event for tenant ""(.*)""")]
    public void GivenIHaveAValidAuditEvent(string tenantId)
    {
        _auditEvent = new AuditEvent
        {
            TenantId = tenantId,
            EventType = "UserLogin",
            Timestamp = DateTime.UtcNow,
            Payload = "{\"userId\": \"test-user\"}"
        };
    }

    [When(@"I submit the event to the ingestion API")]
    public async Task WhenISubmitTheEvent()
    {
        _httpClient.DefaultRequestHeaders.Add("X-Tenant-Id", _auditEvent.TenantId);

        var json = JsonSerializer.Serialize(_auditEvent);
        var content = new StringContent(json, Encoding.UTF8, "application/json");

        _response = await _httpClient.PostAsync("/api/v1/audit/ingest", content);
    }

    [Then(@"the event should be accepted with status (.*)")]
    public void ThenTheEventShouldBeAccepted(int expectedStatus)
    {
        Assert.Equal(expectedStatus, (int)_response.StatusCode);
    }

    [Then(@"the response should contain an event ID")]
    public async Task ThenTheResponseShouldContainEventId()
    {
        var responseBody = await _response.Content.ReadAsStringAsync();
        var result = JsonSerializer.Deserialize<IngestResponse>(responseBody);

        Assert.NotNull(result);
        Assert.NotEqual(Guid.Empty, result.EventId);
    }

    [Then(@"tamper evidence should be generated")]
    public async Task ThenTamperEvidenceShouldBeGenerated()
    {
        var responseBody = await _response.Content.ReadAsStringAsync();
        var result = JsonSerializer.Deserialize<IngestResponse>(responseBody);

        Assert.NotNull(result.TamperEvidence);
        Assert.NotEmpty(result.TamperEvidence.Hash);
        Assert.NotEmpty(result.TamperEvidence.Signature);
    }
}

Nightly Regression Pipeline:

# Scheduled nightly regression tests
schedules:
- cron: "0 2 * * *"  # 2 AM daily
  displayName: Nightly Regression Tests
  branches:
    include:
    - main
  always: true  # Run even if no code changes

stages:
  - stage: Regression_Tests
    displayName: 'Nightly Regression Suite'
    jobs:
    - job: Full_Regression
      timeoutInMinutes: 60
      steps:
      - task: DotNetCoreCLI@2
        displayName: 'Run Full Regression Suite'
        inputs:
          command: 'test'
          projects: '**/*Tests.Regression.csproj'
          arguments: '--configuration Release --filter Category=Regression --logger "trx;LogFileName=regression-results.trx"'

      - task: PublishTestResults@2
        inputs:
          testResultsFormat: 'VSTest'
          testResultsFiles: '**/regression-results.trx'
          testRunTitle: 'Nightly Regression Tests'
          failTaskOnFailedTests: true

      # Send results to team
      - task: PowerShell@2
        condition: always()
        displayName: 'Send Test Results Email'
        inputs:
          targetType: 'inline'
          script: |
            $testResults = Get-Content regression-results.trx | ConvertFrom-XML
            $passed = $testResults.TestRun.ResultSummary.Counters.passed
            $failed = $testResults.TestRun.ResultSummary.Counters.failed

            $subject = if ($failed -eq 0) { "✅ Nightly Regression PASSED" } else { "❌ Nightly Regression FAILED" }

            Send-MailMessage `
              -From "atp-ci@connectsoft.example" `
              -To "dev-team@connectsoft.example" `
              -Subject $subject `
              -Body "Passed: $passed | Failed: $failed"

API Contract Tests

Execution: Every deployment (validate OpenAPI spec)

OpenAPI Spec Validation:

// Validate OpenAPI spec matches implementation
[Fact]
public async Task OpenApiSpec_MatchesImplementation()
{
    // Arrange
    var httpClient = new HttpClient
    {
        BaseAddress = new Uri("https://atp-gateway-test.azurewebsites.net")
    };

    // Act: Fetch generated OpenAPI spec
    var specJson = await httpClient.GetStringAsync("/swagger/v1/swagger.json");
    var spec = JsonSerializer.Deserialize<OpenApiDocument>(specJson);

    // Assert: Validate critical endpoints exist
    Assert.True(spec.Paths.ContainsKey("/api/v1/audit/ingest"));
    Assert.True(spec.Paths.ContainsKey("/api/v1/audit/events/{eventId}"));

    // Validate request/response schemas
    var ingestEndpoint = spec.Paths["/api/v1/audit/ingest"].Operations[OperationType.Post];
    Assert.NotNull(ingestEndpoint.RequestBody);
    Assert.NotNull(ingestEndpoint.Responses["201"]);
}

// Breaking change detection
[Fact]
public async Task OpenApiSpec_NoBreakingChanges()
{
    // Arrange: Load previous version OpenAPI spec
    var previousSpec = await LoadOpenApiSpecAsync("v1.0.0");
    var currentSpec = await FetchCurrentOpenApiSpecAsync();

    // Act: Detect breaking changes
    var breakingChanges = OpenApiDiff.Compare(previousSpec, currentSpec)
        .Where(c => c.IsBreaking)
        .ToList();

    // Assert: No breaking changes allowed
    Assert.Empty(breakingChanges);
}

OpenAPI Diff Tool (CI/CD Pipeline):

# API contract validation stage
- stage: API_Contract_Validation
  dependsOn: CD_Test
  jobs:
  - job: Validate_API_Contract
    steps:
    - task: Bash@3
      displayName: 'Download Previous OpenAPI Spec'
      inputs:
        targetType: 'inline'
        script: |
          # Download last known-good spec from Azure Artifacts
          az artifacts universal download \
            --organization https://dev.azure.com/ConnectSoft \
            --feed ConnectSoft-Artifacts \
            --name openapi-spec-atp-ingestion \
            --version $(PreviousVersion) \
            --path ./previous-spec/

    - task: Bash@3
      displayName: 'Fetch Current OpenAPI Spec'
      inputs:
        targetType: 'inline'
        script: |
          curl https://atp-gateway-test.azurewebsites.net/swagger/v1/swagger.json \
            -o ./current-spec/swagger.json

    - task: Bash@3
      displayName: 'Detect Breaking Changes'
      inputs:
        targetType: 'inline'
        script: |
          npx openapi-diff \
            ./previous-spec/swagger.json \
            ./current-spec/swagger.json \
            --format markdown \
            --output api-diff-report.md

          # Check for breaking changes
          if grep -q "Breaking changes detected" api-diff-report.md; then
            echo "❌ Breaking API changes detected"
            cat api-diff-report.md
            exit 1
          fi

          echo "✅ No breaking API changes"

Staging Environment Testing

Purpose: Production validation with load testing, chaos engineering, security testing, and DR drills before production deployment.

Load Tests

Execution: Pre-production deployment (validate scalability)

Apache JMeter Test Plan:

<!-- ATP-LoadTest.jmx -->
<jmeterTestPlan version="1.2">
  <hashTree>
    <TestPlan guiclass="TestPlanGui" testclass="TestPlan" testname="ATP Ingestion Load Test">
      <elementProp name="TestPlan.user_defined_variables" elementType="Arguments">
        <collectionProp name="Arguments.arguments">
          <elementProp name="TARGET_URL" elementType="Argument">
            <stringProp name="Argument.name">TARGET_URL</stringProp>
            <stringProp name="Argument.value">https://atp-gateway-staging.azurewebsites.net</stringProp>
          </elementProp>
          <elementProp name="THREAD_COUNT" elementType="Argument">
            <stringProp name="Argument.name">THREAD_COUNT</stringProp>
            <stringProp name="Argument.value">500</stringProp>  <!-- 500 concurrent users -->
          </elementProp>
          <elementProp name="RAMP_UP_TIME" elementType="Argument">
            <stringProp name="Argument.name">RAMP_UP_TIME</stringProp>
            <stringProp name="Argument.value">300</stringProp>  <!-- 5 minutes ramp-up -->
          </elementProp>
          <elementProp name="DURATION" elementType="Argument">
            <stringProp name="Argument.name">DURATION</stringProp>
            <stringProp name="Argument.value">3600</stringProp>  <!-- 60 minutes -->
          </elementProp>
        </collectionProp>
      </elementProp>
    </TestPlan>

    <hashTree>
      <!-- Thread Group -->
      <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Ingestion Load">
        <stringProp name="ThreadGroup.num_threads">${THREAD_COUNT}</stringProp>
        <stringProp name="ThreadGroup.ramp_time">${RAMP_UP_TIME}</stringProp>
        <stringProp name="ThreadGroup.duration">${DURATION}</stringProp>
        <boolProp name="ThreadGroup.scheduler">true</boolProp>
      </ThreadGroup>

      <hashTree>
        <!-- HTTP Request: Ingest Event -->
        <HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="POST Ingest Event">
          <stringProp name="HTTPSampler.domain">${TARGET_URL}</stringProp>
          <stringProp name="HTTPSampler.path">/api/v1/audit/ingest</stringProp>
          <stringProp name="HTTPSampler.method">POST</stringProp>
          <boolProp name="HTTPSampler.use_keepalive">true</boolProp>
          <elementProp name="HTTPsampler.Arguments" elementType="Arguments">
            <collectionProp name="Arguments.arguments">
              <elementProp name="" elementType="HTTPArgument">
                <stringProp name="Argument.value">{
                  "eventType": "UserLogin",
                  "timestamp": "${__time(yyyy-MM-dd'T'HH:mm:ss'Z')}",
                  "userId": "load-test-user-${__Random(1,10000)}",
                  "ipAddress": "203.0.113.${__Random(1,255)}"
                }</stringProp>
                <stringProp name="Argument.metadata">=</stringProp>
              </elementProp>
            </collectionProp>
          </elementProp>
        </HTTPSamplerProxy>

        <!-- Response Assertion -->
        <ResponseAssertion guiclass="AssertionGui" testclass="ResponseAssertion" testname="Assert 201 Created">
          <stringProp name="Assertion.test_field">Assertion.response_code</stringProp>
          <stringProp name="Assertion.test_string">201</stringProp>
          <intProp name="Assertion.test_type">8</intProp>
        </ResponseAssertion>
      </hashTree>
    </hashTree>
  </hashTree>
</jmeterTestPlan>

Run Load Tests (Azure Pipelines):

- stage: Load_Tests_Staging
  dependsOn: CD_Staging
  jobs:
  - job: JMeter_Load_Test
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: Bash@3
      displayName: 'Install JMeter'
      inputs:
        targetType: 'inline'
        script: |
          wget https://archive.apache.org/dist/jmeter/binaries/apache-jmeter-5.6.2.tgz
          tar -xzf apache-jmeter-5.6.2.tgz

    - task: Bash@3
      displayName: 'Run Load Test (500 users, 60 minutes)'
      inputs:
        targetType: 'inline'
        script: |
          ./apache-jmeter-5.6.2/bin/jmeter \
            -n -t ATP-LoadTest.jmx \
            -l load-test-results.jtl \
            -e -o load-test-report/

    - task: PublishBuildArtifacts@1
      inputs:
        PathtoPublish: 'load-test-report/'
        ArtifactName: 'LoadTestReport'

    - task: Bash@3
      displayName: 'Validate Load Test Results'
      inputs:
        targetType: 'inline'
        script: |
          # Parse JMeter results
          AVG_RESPONSE_TIME=$(awk -F',' 'NR>1 {sum+=$2; count++} END {print sum/count}' load-test-results.jtl)
          ERROR_RATE=$(awk -F',' 'NR>1 {if ($8 != "200") errors++; total++} END {print (errors/total)*100}' load-test-results.jtl)

          echo "Average Response Time: ${AVG_RESPONSE_TIME}ms"
          echo "Error Rate: ${ERROR_RATE}%"

          # Validate against thresholds
          if (( $(echo "$AVG_RESPONSE_TIME > 1000" | bc -l) )); then
            echo "❌ Average response time exceeds 1000ms threshold"
            exit 1
          fi

          if (( $(echo "$ERROR_RATE > 1" | bc -l) )); then
            echo "❌ Error rate exceeds 1% threshold"
            exit 1
          fi

          echo "✅ Load test passed all thresholds"

Chaos Tests

Execution: Weekly (inject failures to validate resilience)

Chaos Mesh Configuration (Kubernetes):

# Network delay chaos experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-experiment
  namespace: atp-staging
spec:
  action: delay
  mode: one  # Affect one pod at a time
  selector:
    namespaces:
      - atp-staging
    labelSelectors:
      app: atp-ingestion
  delay:
    latency: "500ms"
    correlation: "50"  # 50% of packets delayed
    jitter: "100ms"
  duration: "5m"  # 5-minute experiment
  scheduler:
    cron: "@weekly"  # Run weekly
---
# Pod failure chaos experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-experiment
  namespace: atp-staging
spec:
  action: pod-failure
  mode: fixed-percent
  value: "10"  # Kill 10% of pods
  selector:
    namespaces:
      - atp-staging
    labelSelectors:
      app: atp-query
  duration: "2m"
  scheduler:
    cron: "@weekly"
---
# Storage I/O chaos experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: storage-delay-experiment
  namespace: atp-staging
spec:
  action: latency
  mode: one
  selector:
    namespaces:
      - atp-staging
    labelSelectors:
      app: atp-ingestion
  volumePath: /data
  path: /data/**/*
  delay: "1s"  # 1-second I/O delay
  percent: 50  # 50% of I/O operations
  duration: "5m"
  scheduler:
    cron: "@weekly"

Chaos Test Validation:

#!/bin/bash
# run-chaos-tests-staging.sh

echo "Starting chaos engineering tests on Staging..."

# Install Chaos Mesh CLI
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash

# Apply chaos experiments
kubectl apply -f chaos-experiments/ -n atp-staging

echo "Chaos experiments running for 10 minutes..."
sleep 600

# Monitor metrics during chaos
ERROR_RATE=$(kubectl exec -n atp-staging deploy/prometheus-server -- \
  promtool query instant http://localhost:9090 \
  'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100' \
  | jq -r '.data.result[0].value[1]')

echo "Error rate during chaos: ${ERROR_RATE}%"

# Validate resilience
if (( $(echo "$ERROR_RATE > 5" | bc -l) )); then
  echo "❌ Error rate too high during chaos; system not resilient"
  exit 1
fi

echo "✅ Chaos tests passed; system resilient to injected failures"

# Clean up chaos experiments
kubectl delete -f chaos-experiments/ -n atp-staging

Security Tests

Execution: Pre-production deployment (OWASP ZAP scan)

OWASP ZAP Scan (Azure Pipelines):

- stage: Security_Tests_Staging
  dependsOn: CD_Staging
  jobs:
  - job: OWASP_ZAP_Scan
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: Bash@3
      displayName: 'Run OWASP ZAP Baseline Scan'
      inputs:
        targetType: 'inline'
        script: |
          docker run --rm \
            -v $(pwd):/zap/wrk/:rw \
            -t owasp/zap2docker-stable \
            zap-baseline.py \
            -t https://atp-gateway-staging.azurewebsites.net \
            -r zap-baseline-report.html \
            -w zap-baseline-report.md

    - task: PublishBuildArtifacts@1
      inputs:
        PathtoPublish: 'zap-baseline-report.html'
        ArtifactName: 'ZAPReport'

    - task: Bash@3
      displayName: 'Validate ZAP Scan Results'
      inputs:
        targetType: 'inline'
        script: |
          # Check for high-risk findings
          HIGH_RISK=$(grep -c "High (High)" zap-baseline-report.md || echo 0)

          if [ "$HIGH_RISK" -gt 0 ]; then
            echo "❌ High-risk vulnerabilities detected: $HIGH_RISK"
            exit 1
          fi

          echo "✅ OWASP ZAP scan passed"

DR Drills

Execution: Monthly (validate failover procedures)

DR Drill Script (Staging):

#!/bin/bash
# dr-drill-staging.sh

echo "Starting monthly DR drill for Staging..."

START_TIME=$(date +%s)

# Step 1: Simulate primary slot failure (swap to green)
echo "Simulating blue slot failure; failing over to green..."

az webapp deployment slot swap \
  --name atp-gateway-staging-eus \
  --resource-group ConnectSoft-ATP-Staging-EUS-RG \
  --slot blue \
  --target-slot production

# Step 2: Validate green slot serving traffic
sleep 30
HEALTH=$(curl -s https://atp-gateway-staging-eus.azurewebsites.net/health | jq -r '.status')

if [ "$HEALTH" != "Healthy" ]; then
  echo "❌ DR drill failed; green slot unhealthy"
  exit 1
fi

# Step 3: Run full regression suite
dotnet test ConnectSoft.ATP.Tests.Regression.sln \
  --configuration Release \
  --logger "trx;LogFileName=dr-drill-results.trx"

if [ $? -ne 0 ]; then
  echo "❌ Regression tests failed after failover"
  exit 1
fi

# Step 4: Calculate RTO
END_TIME=$(date +%s)
RTO_SECONDS=$((END_TIME - START_TIME))
RTO_MINUTES=$((RTO_SECONDS / 60))

echo "✅ DR drill complete"
echo "RTO Achieved: $RTO_MINUTES minutes (Target: 60 minutes)"

# Step 5: Document results
az boards work-item create \
  --title "DR Drill Results - Staging - $(date +%Y-%m-%d)" \
  --type "Task" \
  --description "DR drill completed successfully.\n\nRTO: $RTO_MINUTES minutes\nAll regression tests passed." \
  --assigned-to "platform-team@connectsoft.example" \
  --fields Priority=3 State=Closed

# Optional: Swap back to blue (or leave green as production)
echo "Leaving green slot as production (validate for 24 hours)"

Production Environment Testing

Purpose: Continuous validation with synthetic monitors, canary deployments, and A/B testing for production confidence.

Synthetic Monitors

Execution: Every 5 minutes from multiple regions

Application Insights Availability Test:

// Create multi-region availability test
var availabilityTest = new WebTest("atp-availability-test-prod", new WebTestArgs
{
    WebTestName = "atp-ingestion-availability",
    ResourceGroupName = sharedResourceGroup.Name,
    Location = "eastus",
    Kind = "ping",

    SyntheticMonitorId = "atp-ingestion-prod-monitor",

    Enabled = true,
    Frequency = 300,  // 5 minutes
    Timeout = 30,
    RetryEnabled = true,

    Locations = new[]
    {
        new WebTestGeolocationArgs { Location = "us-va-ash-azr" },  // East US
        new WebTestGeolocationArgs { Location = "emea-nl-ams-azr" },  // West Europe
        new WebTestGeolocationArgs { Location = "apac-sg-sin-azr" },  // Southeast Asia
        new WebTestGeolocationArgs { Location = "us-ca-sjc-azr" },  // West US
        new WebTestGeolocationArgs { Location = "emea-gb-db3-azr" }   // UK South
    },

    Configuration = new WebTestPropertiesConfigurationArgs
    {
        WebTest = @"
            <WebTest Name='ATP Ingestion Health Check' Enabled='True'>
              <Items>
                <Request Method='GET' Version='1.1' Url='https://api.atp.connectsoft.com/health' ThinkTime='0'>
                  <ValidationRules>
                    <ValidationRule Classname='Microsoft.VisualStudio.TestTools.WebTesting.Rules.ValidationRuleResponseCode' />
                    <ValidationRule Classname='Microsoft.VisualStudio.TestTools.WebTesting.Rules.ValidationRuleExpectedText'>
                      <RuleParameters>
                        <RuleParameter Name='ExpectedText' Value='Healthy' />
                      </RuleParameters>
                    </ValidationRule>
                  </ValidationRules>
                </Request>
              </Items>
            </WebTest>
        "
    },

    Tags = prodTags
});

// Alert when availability drops below 99%
var availabilityAlert = new MetricAlert("atp-availability-alert-prod", new MetricAlertArgs
{
    AlertRuleName = "atp-availability-prod",
    ResourceGroupName = sharedResourceGroup.Name,
    Location = "global",
    Description = "Alert when ATP availability drops below 99%",
    Severity = 1,
    Enabled = true,
    Scopes = new[] { availabilityTest.Id },
    EvaluationFrequency = "PT5M",
    WindowSize = "PT15M",
    Criteria = new MetricAlertMultipleResourceMultipleMetricCriteriaArgs
    {
        OdataType = "Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria",
        AllOf = new[]
        {
            new MetricCriteriaArgs
            {
                Name = "AvailabilityPercentage",
                MetricName = "availabilityResults/availabilityPercentage",
                Operator = "LessThan",
                Threshold = 99,
                TimeAggregation = "Average"
            }
        }
    }
});

Custom Synthetic Monitor (Azure Function):

// Advanced synthetic monitor with full workflow validation
[FunctionName("SyntheticMonitor")]
public async Task RunAsync(
    [TimerTrigger("0 */5 * * * *")] TimerInfo timer,  // Every 5 minutes
    ILogger log)
{
    log.LogInformation("Starting synthetic monitor workflow...");

    var startTime = DateTime.UtcNow;
    var httpClient = new HttpClient
    {
        BaseAddress = new Uri("https://api.atp.connectsoft.com")
    };

    try
    {
        // Step 1: Health check
        var healthResponse = await httpClient.GetAsync("/health");
        healthResponse.EnsureSuccessStatusCode();

        // Step 2: Ingest synthetic event
        var auditEvent = new
        {
            eventType = "SyntheticMonitor",
            timestamp = DateTime.UtcNow,
            userId = "synthetic-monitor",
            metadata = new { source = "synthetic-monitor", region = Environment.GetEnvironmentVariable("REGION") }
        };

        httpClient.DefaultRequestHeaders.Add("X-Tenant-Id", "synthetic-tenant-001");
        var ingestResponse = await httpClient.PostAsJsonAsync("/api/v1/audit/ingest", auditEvent);
        ingestResponse.EnsureSuccessStatusCode();

        var ingestResult = await ingestResponse.Content.ReadFromJsonAsync<IngestResponse>();

        // Step 3: Query event (validate query service)
        await Task.Delay(TimeSpan.FromSeconds(5));  // Allow indexing

        var queryResponse = await httpClient.GetAsync($"/api/v1/audit/events/{ingestResult.EventId}");
        queryResponse.EnsureSuccessStatusCode();

        // Step 4: Validate tamper evidence
        var queriedEvent = await queryResponse.Content.ReadFromJsonAsync<AuditEvent>();
        Assert.NotNull(queriedEvent.TamperEvidence);

        var elapsed = DateTime.UtcNow - startTime;

        // Track success metrics
        _telemetry.TrackMetric("SyntheticMonitor.Duration", elapsed.TotalMilliseconds);
        _telemetry.TrackEvent("SyntheticMonitor.Success", new Dictionary<string, string>
        {
            ["Region"] = Environment.GetEnvironmentVariable("REGION"),
            ["Timestamp"] = DateTime.UtcNow.ToString("o")
        });

        log.LogInformation($"✅ Synthetic monitor passed in {elapsed.TotalMilliseconds}ms");
    }
    catch (Exception ex)
    {
        log.LogError(ex, "❌ Synthetic monitor failed");

        // Track failure
        _telemetry.TrackException(ex);
        _telemetry.TrackEvent("SyntheticMonitor.Failure", new Dictionary<string, string>
        {
            ["Region"] = Environment.GetEnvironmentVariable("REGION"),
            ["ErrorMessage"] = ex.Message
        });

        throw;
    }
}

Canary Tests

Execution: Production deployments (10% rollout with 24-hour validation)

Canary Deployment Strategy (Azure Pipelines):

- stage: CD_Production_Canary
  displayName: 'Deploy to Production (Canary)'
  dependsOn: CD_Staging
  jobs:
  - deployment: CanaryDeployment
    environment: ATP-Production
    strategy:
      canary:
        increments: [10, 25, 50]  # 10% → 25% → 50% → 100%
        preDeploy:
          steps:
          - script: echo "Validating staging has been stable for 48 hours..."
          - task: PowerShell@2
            inputs:
              targetType: 'inline'
              script: |
                # Check for incidents in last 48 hours
                $incidents = az boards work-item query \
                  --wiql "SELECT [System.Id] FROM WorkItems WHERE [System.WorkItemType] = 'Incident' AND [System.State] = 'Active' AND [System.CreatedDate] > @Today - 2" \
                  --output json | ConvertFrom-Json

                if ($incidents.workItems.Count -gt 0) {
                  Write-Error "Active incidents detected; aborting canary deployment"
                  exit 1
                }

        deploy:
          steps:
          - template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
            parameters:
              azureSubscription: $(azureSubscription)
              appName: atp-ingestion-prod
              package: $(Pipeline.Workspace)/drop/*.zip
              trafficPercentage: $(strategy.increment)  # Canary percentage

        routeTraffic:
          steps:
          - script: echo "Routing $(strategy.increment)% traffic to canary version"

        postRouteTraffic:
          steps:
          # Monitor for 24 hours at each increment
          - task: PowerShell@2
            displayName: 'Monitor Canary Metrics (24 hours)'
            inputs:
              targetType: 'inline'
              script: |
                $monitorDuration = if ($(strategy.increment) -eq 10) { 1440 } else { 720 }  # 24h for 10%, 12h for others
                Write-Host "Monitoring canary for $monitorDuration minutes..."

                Start-Sleep -Seconds ($monitorDuration * 60)

                # Query Application Insights for canary metrics
                $errorRate = az monitor app-insights metrics show \
                  --app atp-appinsights-prod-eus \
                  --metric "requests/failed" \
                  --aggregation avg \
                  --offset ${monitorDuration}m \
                  --query "value.segments[0]['requests/failed'].avg" -o tsv

                $p95Latency = az monitor app-insights metrics show \
                  --app atp-appinsights-prod-eus \
                  --metric "requests/duration" \
                  --aggregation percentile \
                  --interval PT1H \
                  --offset ${monitorDuration}m \
                  --query "value.segments[0]['requests/duration'].percentiles.95" -o tsv

                Write-Host "Error Rate: $errorRate%"
                Write-Host "P95 Latency: ${p95Latency}ms"

                # Validate against thresholds
                if ($errorRate -gt 0.01) {  # >1% error rate
                  Write-Error "Error rate too high: $errorRate%"
                  exit 1
                }

                if ($p95Latency -gt 2000) {  # >2s p95 latency
                  Write-Error "Latency too high: ${p95Latency}ms"
                  exit 1
                }

                Write-Host "✅ Canary metrics healthy at $(strategy.increment)%"

        on:
          failure:
            steps:
            - script: echo "🔴 Canary deployment failed; rolling back..."
            - task: AzureAppServiceManage@0
              inputs:
                azureSubscription: $(azureSubscription)
                action: 'Swap Slots'
                webAppName: 'atp-ingestion-prod'
                sourceSlot: 'production'
                targetSlot: 'canary'

            - task: PowerShell@2
              inputs:
                targetType: 'inline'
                script: |
                  # Create incident for failed canary
                  az boards work-item create \
                    --title "Canary Deployment Failed - $(Build.BuildNumber)" \
                    --type "Incident" \
                    --description "Canary deployment rolled back due to failed metrics.\n\nBuild: $(Build.BuildNumber)" \
                    --assigned-to "platform-team@connectsoft.example" \
                    --fields Priority=1

A/B Tests

Execution: Feature flag validation with statistical analysis

A/B Test Configuration:

// A/B test: New export format (JSON vs Parquet)
public class ExportFormatABTest
{
    private readonly IFeatureManager _featureManager;
    private readonly TelemetryClient _telemetry;

    public async Task<ExportResult> ExportAuditEventsAsync(ExportRequest request)
    {
        var startTime = DateTime.UtcNow;

        // Determine variant based on feature flag (50/50 split)
        var useParquet = await _featureManager.IsEnabledAsync("ExportFormat_Parquet");

        ExportResult result;

        if (useParquet)
        {
            // Variant A: Parquet format
            result = await ExportAsParquetAsync(request);

            _telemetry.TrackEvent("ABTest.ExportFormat", new Dictionary<string, string>
            {
                ["Variant"] = "Parquet",
                ["TenantId"] = request.TenantId,
                ["EventCount"] = request.EventCount.ToString(),
                ["Duration"] = (DateTime.UtcNow - startTime).TotalMilliseconds.ToString(),
                ["FileSize"] = result.FileSizeBytes.ToString()
            });
        }
        else
        {
            // Variant B: JSON format (control)
            result = await ExportAsJsonAsync(request);

            _telemetry.TrackEvent("ABTest.ExportFormat", new Dictionary<string, string>
            {
                ["Variant"] = "JSON",
                ["TenantId"] = request.TenantId,
                ["EventCount"] = request.EventCount.ToString(),
                ["Duration"] = (DateTime.UtcNow - startTime).TotalMilliseconds.ToString(),
                ["FileSize"] = result.FileSizeBytes.ToString()
            });
        }

        return result;
    }
}

A/B Test Analysis (Application Insights):

// Compare A/B test variants
customEvents
| where name == "ABTest.ExportFormat"
| where timestamp > ago(7d)
| extend Variant = tostring(customDimensions.Variant)
| extend EventCount = toint(customDimensions.EventCount)
| extend Duration = todouble(customDimensions.Duration)
| extend FileSize = todouble(customDimensions.FileSize)
| summarize 
    TotalExports = count(),
    AvgDuration = avg(Duration),
    P50Duration = percentile(Duration, 50),
    P95Duration = percentile(Duration, 95),
    P99Duration = percentile(Duration, 99),
    AvgFileSize = avg(FileSize),
    CompressionRatio = avg(FileSize) / avg(EventCount)
  by Variant
| extend Winner = case(
    Variant == "Parquet" and P95Duration < 1000 and AvgFileSize < 50000000, "Parquet",
    Variant == "JSON" and P95Duration < 1000, "JSON",
    "Inconclusive"
  )
| project Variant, TotalExports, AvgDuration, P95Duration, AvgFileSize, Winner

A/B Test Statistical Significance:

// Calculate statistical significance of A/B test
public class ABTestAnalyzer
{
    public ABTestResult AnalyzeTest(List<ABTestSample> variantA, List<ABTestSample> variantB)
    {
        // Calculate means
        var meanA = variantA.Average(s => s.Duration);
        var meanB = variantB.Average(s => s.Duration);

        // Calculate standard deviations
        var stdDevA = Math.Sqrt(variantA.Average(s => Math.Pow(s.Duration - meanA, 2)));
        var stdDevB = Math.Sqrt(variantB.Average(s => Math.Pow(s.Duration - meanB, 2)));

        // T-test (independent samples)
        var tStatistic = (meanA - meanB) / Math.Sqrt((stdDevA * stdDevA / variantA.Count) + (stdDevB * stdDevB / variantB.Count));
        var degreesOfFreedom = variantA.Count + variantB.Count - 2;

        // P-value (simplified; use proper statistical library in production)
        var pValue = CalculatePValue(tStatistic, degreesOfFreedom);

        return new ABTestResult
        {
            VariantAMean = meanA,
            VariantBMean = meanB,
            DifferencePercent = ((meanB - meanA) / meanA) * 100,
            PValue = pValue,
            IsSignificant = pValue < 0.05,  // 95% confidence
            Recommendation = pValue < 0.05 
                ? (meanA < meanB ? "Variant A is significantly faster" : "Variant B is significantly faster")
                : "No significant difference; choose based on other criteria"
        };
    }
}

Test Automation Infrastructure

Test Execution Summary (By Environment):

Environment Test Type Frequency Duration Pass Threshold Retry Policy
Dev Unit Every commit < 2 min 100% No retry
Dev Integration Every commit < 5 min 100% No retry
Test Smoke Post-deployment < 2 min 100% 3 retries
Test Regression Nightly < 30 min 100% No retry (investigate failures)
Test API Contract Every deployment < 5 min 100% (no breaking changes) No retry
Staging Load Pre-production 60 min <1% error, <1s p95 latency No retry
Staging Chaos Weekly 10 min <5% error during failures No retry
Staging Security Pre-production 20 min Zero high-risk findings No retry
Staging DR Drill Monthly < 60 min 100% pass + RTO met No retry
Production Synthetic Every 5 min < 30 sec 99% availability 2 retries
Production Canary Per deployment 24-72 hours <0.5% error, <1.5x baseline latency Auto-rollback on failure

Test Data Management

Dev/Test (Synthetic Data):

// Synthetic test data generator
public class SyntheticDataGenerator
{
    private readonly Faker<AuditEvent> _auditEventFaker;

    public SyntheticDataGenerator()
    {
        _auditEventFaker = new Faker<AuditEvent>()
            .RuleFor(e => e.TenantId, f => $"tenant-{f.Random.Number(1, 100)}")
            .RuleFor(e => e.EventType, f => f.PickRandom("UserLogin", "UserLogout", "DocumentAccess", "SettingChanged"))
            .RuleFor(e => e.Timestamp, f => f.Date.Recent(7))
            .RuleFor(e => e.UserId, f => $"user-{f.Random.Guid()}")
            .RuleFor(e => e.IPAddress, f => f.Internet.Ip())
            .RuleFor(e => e.Payload, f => JsonSerializer.Serialize(new
            {
                action = f.PickRandom("read", "write", "delete"),
                resourceId = f.Random.Guid(),
                success = f.Random.Bool(0.95f)  // 95% success rate
            }));
    }

    public List<AuditEvent> Generate(int count) => _auditEventFaker.Generate(count);
}

// Usage in tests
[Fact]
public async Task LoadTest_1000Events_CompletesWithinSLA()
{
    // Arrange
    var generator = new SyntheticDataGenerator();
    var events = generator.Generate(1000);

    // Act
    var stopwatch = Stopwatch.StartNew();

    foreach (var evt in events)
    {
        await _ingestionService.IngestEventAsync(evt);
    }

    stopwatch.Stop();

    // Assert
    Assert.True(stopwatch.Elapsed < TimeSpan.FromMinutes(5), "Load test exceeded 5-minute SLA");
}

Staging (Anonymized Production Data):

# Anonymize production snapshot for staging
./anonymize-production-data.sh \
  --source atp-sql-prod-eus \
  --destination atp-sql-staging-eus \
  --pii-fields "UserId,IPAddress,ContactEmail" \
  --hash-seed "staging-anonymization-2025"

Summary

  • Dev Environment Tests: Unit tests (every commit, <2min, 100% pass), integration tests (service containers, <5min), local E2E (Postman, developer-run).
  • Test Environment Tests: Smoke tests (post-deployment, critical paths), regression suite (nightly, 500+ tests, <30min), API contract tests (OpenAPI validation, breaking change detection).
  • Staging Environment Tests: Load tests (JMeter, 500 users, 60min, 50% prod scale), chaos tests (Chaos Mesh, weekly, network/pod/storage failures), security tests (OWASP ZAP, pre-production), DR drills (monthly, blue-green failover, <60min RTO).
  • Production Environment Tests: Synthetic monitors (multi-region, every 5min, full workflow validation), canary tests (10%→25%→50%→100%, 24-72h validation), A/B tests (feature flags, statistical analysis).
  • Test Pyramid: Unit (1000+ tests) → Integration (100+ tests) → E2E (10+ flows) → Load/Chaos → Synthetic monitoring.
  • Pass Thresholds: 100% (unit/integration/regression), <1% error rate (load), <5% error during chaos, 99% availability (synthetic).
  • Automation: Pre-push hooks (unit tests), CI/CD pipelines (all tests), nightly schedules (regression), weekly schedules (chaos), continuous (synthetic).

Governance & Continuous Improvement

ATP implements structured governance and continuous improvement frameworks to ensure environments remain cost-efficient, secure, and aligned with business objectives. Regular reviews (monthly cost, quarterly security, annual refresh) combined with a formal Change Advisory Board ensure controlled evolution of the platform.

This approach balances operational stability (CAB-approved production changes) with innovation (roadmap-driven improvements) while maintaining accountability through documented reviews and measurable improvement metrics.

Environment Review Cadence

ATP conducts tiered reviews at monthly, quarterly, and annual intervals to maintain environment health and optimize operations.

Monthly Review: Cost & Utilization

Purpose: Optimize spending and identify underutilized resources for rightsizing or decommissioning.

Attendees: Platform Team Lead, Finance Representative, Engineering Manager

Agenda:

monthlyReview:
  agenda:
    1. Cost Analysis (30 minutes)
       - Actual vs. budgeted spend per environment
       - Cost anomalies and trend analysis
       - Reserved instance utilization
       - Storage lifecycle effectiveness

    2. Resource Utilization (20 minutes)
       - Compute utilization (CPU, memory)
       - Database DTU/RU consumption
       - Storage growth trends
       - Network bandwidth utilization

    3. Scaling Policy Review (15 minutes)
       - Autoscaling trigger effectiveness
       - Over-provisioned resources
       - Under-provisioned resources

    4. Action Items (15 minutes)
       - Rightsizing recommendations
       - Cost optimization opportunities
       - Resource decommissioning

  duration: 80 minutes
  artifacts:
    - Monthly cost report (generated by automation)
    - Resource utilization dashboard
    - Action items list (Azure DevOps work items)

Monthly Cost Review Report (Automated):

// Generate monthly cost and utilization report
[FunctionName("MonthlyEnvironmentReview")]
public async Task RunAsync(
    [TimerTrigger("0 0 9 1 * *")] TimerInfo timer,  // 1st of month at 9 AM
    ILogger log)
{
    log.LogInformation("Generating monthly environment review report...");

    var reportDate = DateTime.UtcNow.AddMonths(-1);  // Previous month
    var reportMonth = reportDate.ToString("MMMM yyyy");

    var report = new MonthlyEnvironmentReport
    {
        Month = reportMonth,
        GeneratedAt = DateTime.UtcNow
    };

    // Cost Analysis
    report.CostAnalysis = await GenerateCostAnalysisAsync(reportDate);

    // Resource Utilization
    report.Utilization = new UtilizationReport
    {
        Dev = await AnalyzeEnvironmentUtilizationAsync("dev"),
        Test = await AnalyzeEnvironmentUtilizationAsync("test"),
        Staging = await AnalyzeEnvironmentUtilizationAsync("staging"),
        Production = await AnalyzeEnvironmentUtilizationAsync("prod")
    };

    // Rightsizing Recommendations
    report.Recommendations = await GenerateRightsizingRecommendationsAsync();

    // Generate PDF report
    var pdf = await GeneratePdfAsync(report);

    // Upload to Blob Storage
    await UploadReportAsync($"monthly-reviews/{reportMonth}/Environment-Review-{reportMonth}.pdf", pdf);

    // Send to stakeholders
    await SendReportAsync(pdf, new[]
    {
        "platform-team@connectsoft.example",
        "finance@connectsoft.example",
        "engineering-manager@connectsoft.example"
    });

    log.LogInformation($"✅ Monthly review report generated and sent for {reportMonth}");
}

private async Task<EnvironmentUtilization> AnalyzeEnvironmentUtilizationAsync(string environment)
{
    var metrics = await QueryAzureMonitorAsync(environment);

    return new EnvironmentUtilization
    {
        Environment = environment,

        // Compute utilization
        AvgCpuUtilization = metrics.AvgCpu,
        AvgMemoryUtilization = metrics.AvgMemory,
        PeakCpuUtilization = metrics.PeakCpu,

        // Database utilization
        AvgDtuUtilization = metrics.AvgDtu,
        AvgStorageUtilization = metrics.AvgStorage,

        // Recommendations
        ComputeRecommendation = metrics.AvgCpu < 30 
            ? "Consider downsizing SKU (avg CPU < 30%)" 
            : metrics.AvgCpu > 80 
                ? "Consider upsizing SKU (avg CPU > 80%)" 
                : "Optimal sizing",

        EstimatedMonthlySavings = metrics.AvgCpu < 30 
            ? CalculatePotentialSavings(environment, "downsize") 
            : 0
    };
}

Cost Review KQL Query:

// Month-over-month cost comparison
AzureCostManagement
| where TimeGenerated >= startofmonth(ago(60d))
| extend Environment = tostring(Tags["Environment"])
| extend Month = startofmonth(TimeGenerated)
| summarize MonthlyCost = sum(Cost) by Environment, Month
| evaluate pivot(Month, sum(MonthlyCost))
| extend MoM_Change = (column2 - column1) / column1 * 100  // Month-over-month change %
| project Environment, PreviousMonth=column1, CurrentMonth=column2, MoM_Change
| order by MoM_Change desc

Quarterly Review: Security & Compliance

Purpose: Validate security posture, audit access controls, and test disaster recovery procedures.

Attendees: CISO, Platform Team Lead, Security Engineer, Compliance Officer

Agenda:

quarterlyReview:
  agenda:
    1. Security Posture Assessment (45 minutes)
       - Vulnerability scan results (open findings)
       - Penetration test findings (remediation status)
       - Azure Defender recommendations
       - Security incident retrospective

    2. Access Control Audit (30 minutes)
       - Access review findings (stale permissions)
       - PIM usage analytics (elevation frequency)
       - Service principal inventory
       - Break-glass account validation

    3. Compliance Status (30 minutes)
       - SOC 2 control effectiveness
       - GDPR compliance scorecard
       - HIPAA safeguard validation
       - Policy compliance state

    4. DR Drill Review (30 minutes)
       - DR test results (last quarter)
       - RTO/RPO achievement rate
       - Failover success rate
       - Lessons learned

    5. Action Items (15 minutes)
       - Security remediation tasks
       - Access revocation list
       - DR procedure improvements

  duration: 150 minutes (2.5 hours)
  artifacts:
    - Security posture report
    - Access audit findings
    - Compliance scorecard
    - DR test summary
    - Action items (Azure DevOps backlog)

Quarterly Security Review Report:

// Generate quarterly security posture report
[FunctionName("QuarterlySecurityReview")]
public async Task RunAsync(
    [TimerTrigger("0 0 9 1 */3 *")] TimerInfo timer,  // Quarterly on 1st at 9 AM
    ILogger log)
{
    log.LogInformation("Generating quarterly security review report...");

    var quarter = $"Q{(DateTime.UtcNow.AddMonths(-1).Month - 1) / 3 + 1}-{DateTime.UtcNow.Year}";

    var report = new QuarterlySecurityReport
    {
        Quarter = quarter,
        GeneratedAt = DateTime.UtcNow,
        Scope = "All ATP Environments"
    };

    // Security Findings Summary
    report.VulnerabilityFindings = await GetVulnerabilitySummaryAsync();
    report.PentestFindings = await GetPentestFindingsAsync(quarter);
    report.DefenderRecommendations = await GetDefenderRecommendationsAsync();

    // Access Control Audit
    report.AccessAudit = new AccessAuditReport
    {
        TotalUsers = await CountActiveUsersAsync(),
        StalePermissions = await IdentifyStalePermissionsAsync(),
        PimElevations = await GetPimElevationStatsAsync(quarter),
        ServicePrincipals = await AuditServicePrincipalsAsync()
    };

    // Compliance Status
    report.ComplianceStatus = new ComplianceStatusReport
    {
        SOC2 = await EvaluateSOC2ComplianceAsync(),
        GDPR = await EvaluateGDPRComplianceAsync(),
        HIPAA = await EvaluateHIPAAComplianceAsync(),
        OverallScore = (await EvaluateSOC2ComplianceAsync() + 
                        await EvaluateGDPRComplianceAsync() + 
                        await EvaluateHIPAAComplianceAsync()) / 3
    };

    // DR Test Results
    report.DRTestResults = await GetDRTestResultsAsync(quarter);

    // Generate PDF
    var pdf = await GeneratePdfAsync(report);

    // Upload and distribute
    await UploadReportAsync($"quarterly-security-reviews/{quarter}/Security-Review-{quarter}.pdf", pdf);
    await SendReportAsync(pdf, new[]
    {
        "ciso@connectsoft.example",
        "platform-team@connectsoft.example",
        "compliance@connectsoft.example"
    });

    log.LogInformation($"✅ Quarterly security review report generated for {quarter}");
}

Access Audit Query (Stale Permissions):

// Identify users with stale access (no activity in 90 days)
AzureActivity
| where TimeGenerated > ago(90d)
| where ActivityStatusValue == "Success"
| extend User = Caller
| summarize LastActivity = max(TimeGenerated) by User
| join kind=leftouter (
    IdentityInfo
    | where TimeGenerated > ago(1d)
    | distinct AccountUPN, AccountObjectId
) on $left.User == $right.AccountUPN
| extend DaysSinceLastActivity = datetime_diff('day', now(), LastActivity)
| where DaysSinceLastActivity > 90
| project User, LastActivity, DaysSinceLastActivity, Recommendation = "Revoke Access"
| order by DaysSinceLastActivity desc

Annual Review: Full Environment Refresh

Purpose: Comprehensive refresh of infrastructure, cost optimization, and technology stack updates.

Attendees: CTO, Platform Team, Security Team, Finance, Product Management

Agenda:

annualReview:
  agenda:
    1. Infrastructure Assessment (60 minutes)
       - IaC codebase review (Pulumi/Bicep)
       - Azure service updates (new capabilities)
       - SKU tier optimization
       - Region expansion evaluation

    2. Cost Optimization Deep Dive (60 minutes)
       - Annual spend analysis (trend vs. forecast)
       - Reserved instance renewal strategy
       - Commitment-based discounts (3-year options)
       - Alternative service evaluation (cost reduction)

    3. Security & Compliance Evolution (45 minutes)
       - New regulatory requirements (upcoming)
       - Security technology upgrades (TLS 1.4, PQC)
       - Zero-trust maturity assessment
       - Threat landscape changes

    4. Technology Roadmap Review (45 minutes)
       - Platform modernization opportunities
       - Multi-cloud strategy evaluation
       - Kubernetes version upgrades
       - Service mesh evaluation (Istio, Linkerd)

    5. Strategic Initiatives (30 minutes)
       - Q1-Q4 roadmap alignment
       - Budget allocation for improvements
       - Team capacity planning
       - Vendor evaluations

  duration: 240 minutes (4 hours, typically full-day workshop)
  artifacts:
    - Annual infrastructure report
    - Cost optimization plan
    - Security roadmap
    - Strategic initiatives backlog
    - Budget proposal for next fiscal year

Annual IaC Refresh Procedure:

#!/bin/bash
# annual-iac-refresh.sh

echo "Starting annual IaC refresh for all environments..."

ENVIRONMENTS=("dev" "test" "staging" "prod")

for ENV in "${ENVIRONMENTS[@]}"; do
  echo "Refreshing $ENV environment IaC..."

  # Step 1: Update Pulumi dependencies
  cd infrastructure/
  dotnet add package Pulumi.AzureNative --version latest
  dotnet add package Pulumi.Azure --version latest

  # Step 2: Validate against latest Azure Policy
  pulumi preview --stack atp-$ENV-eus --diff

  # Step 3: Generate cost estimate for refresh
  pulumi preview --stack atp-$ENV-eus --json > preview-$ENV.json

  # Step 4: Create PR for review
  git checkout -b "annual-refresh-$ENV-2025"
  git add infrastructure/
  git commit -m "chore: Annual IaC refresh for $ENV environment"
  git push origin "annual-refresh-$ENV-2025"

  # Create PR
  az repos pr create \
    --title "Annual IaC Refresh - $ENV Environment" \
    --description "Annual infrastructure refresh with latest Pulumi packages and Azure best practices.\n\nCost Impact: See preview-$ENV.json" \
    --source-branch "annual-refresh-$ENV-2025" \
    --target-branch "main" \
    --reviewers "platform-team@connectsoft.example"

  cd ..
done

echo "✅ Annual IaC refresh PRs created for all environments"

Change Advisory Board (CAB)

ATP's Change Advisory Board provides governance oversight for production changes, ensuring risk mitigation, communication planning, and rollback readiness.

CAB Composition

cabMembers:
  core:
    - role: Chair
      title: Lead Architect
      responsibilities:
        - Facilitate CAB meetings
        - Final approval authority
        - Escalation point for conflicts

    - role: Technical Reviewer
      title: SRE Lead
      responsibilities:
        - Assess technical risk
        - Validate rollback procedures
        - Review deployment strategy

    - role: Security Reviewer
      title: Security Officer
      responsibilities:
        - Security impact assessment
        - Compliance validation
        - Vulnerability review

    - role: Product Representative
      title: Product Owner
      responsibilities:
        - Business impact assessment
        - Stakeholder communication
        - Change prioritization

  optional:
    - Database Administrator (for schema changes)
    - Network Engineer (for network changes)
    - Compliance Officer (for regulatory changes)
    - Customer Success (for tenant-impacting changes)

CAB Meeting Cadence:

cabSchedule:
  regular:
    frequency: Weekly (every Wednesday 2 PM ET)
    duration: 60 minutes
    type: Routine production changes
    approvalThreshold: 2 core members (Architect + SRE or Security)

  emergency:
    frequency: On-demand (within 2 hours)
    duration: 30 minutes
    type: Hotfix/P0 incident remediation
    approvalThreshold: 2 core members (expedited review)

  majorChange:
    frequency: Ad-hoc (scheduled 2 weeks in advance)
    duration: 120 minutes
    type: Major architecture changes, multi-service deployments
    approvalThreshold: All 4 core members + optional reviewers

Change Request Process

Change Ticket Template (Azure DevOps):

# Change Request Work Item Template
workItemType: Change Request
requiredFields:
  - title: Short description of change
  - description: |
      ## Change Summary
      Brief description of what is being changed and why.

      ## Business Justification
      Why is this change necessary? What business value does it provide?

      ## Technical Details
      - Affected services: [List services]
      - Infrastructure changes: [Yes/No]
      - Database schema changes: [Yes/No]
      - API breaking changes: [Yes/No]

      ## Risk Assessment
      - Risk level: [Low/Medium/High/Critical]
      - Potential impact: [Description]
      - Affected tenants: [All/Specific/None]

      ## Rollback Plan
      - Rollback procedure: [Detailed steps]
      - Rollback RTO: [Estimated time]
      - Rollback validation: [How to verify rollback success]

      ## Communication Plan
      - Status page update: [Yes/No]
      - Tenant notification: [Yes/No]
      - Maintenance window: [Date/Time or Zero-downtime]

      ## Testing Evidence
      - Staging validation: [Link to test results]
      - DR drill: [Link to drill report]
      - Security scan: [Link to scan results]

  - assignedTo: Requester (engineer submitting change)
  - scheduledDate: Proposed deployment date/time
  - changeType: [Standard/Emergency/Major]
  - priority: [1-4]

linkedItems:
  - Epic/Feature: Parent feature this change supports
  - Test Results: Staging validation results
  - Security Scan: Latest security scan report
  - DR Drill: Most recent DR drill results

approvalWorkflow:
  step1:
    approver: SRE Lead
    criteria: Technical feasibility, rollback plan validated

  step2:
    approver: Security Officer
    criteria: Security impact assessed, compliance validated

  step3:
    approver: Lead Architect (final approval)
    criteria: Overall risk acceptable, aligned with roadmap

Change Request Workflow (Mermaid):

stateDiagram-v2
    [*] --> Submitted: Engineer creates change request
    Submitted --> TechnicalReview: Assigned to SRE Lead

    TechnicalReview --> SecurityReview: SRE approves
    TechnicalReview --> Rejected: Technical concerns

    SecurityReview --> ArchitectApproval: Security Officer approves
    SecurityReview --> Rejected: Security concerns

    ArchitectApproval --> Scheduled: Lead Architect approves
    ArchitectApproval --> Rejected: Architecture concerns

    Scheduled --> InProgress: Deployment window starts

    InProgress --> Deployed: Deployment successful
    InProgress --> RolledBack: Deployment failed

    Deployed --> Validated: Post-deployment validation passes
    RolledBack --> PostMortem: Rollback complete

    Validated --> Closed: Change successful
    PostMortem --> Closed: RCA documented

    Rejected --> [*]: Change cancelled
    Closed --> [*]: Change complete
Hold "Alt" / "Option" to enable pan & zoom

CAB Meeting Template

Meeting Agenda (Weekly CAB):

# ATP Change Advisory Board Meeting
**Date**: Wednesday, January 15, 2025, 2:00 PM ET  
**Duration**: 60 minutes  
**Attendees**: Lead Architect, SRE Lead, Security Officer, Product Owner

---

## Agenda

### 1. Review Previous Week's Changes (10 minutes)
- Deployments executed: [List]
- Issues encountered: [List or "None"]
- Rollbacks performed: [List or "None"]

### 2. Upcoming Changes for Approval (40 minutes)

#### Change Request #12345: Upgrade Redis Cache to Premium P3
- **Requester**: Platform Team
- **Risk**: Low
- **Scheduled**: January 18, 2025, 10 PM ET
- **Duration**: 30 minutes (slot swap)
- **Rollback**: Swap back to previous slot (5 minutes)
- **Tenant Impact**: None (transparent upgrade)
- **Cost Impact**: +$500/month
- **Decision**: ☐ Approved ☐ Rejected ☐ Deferred

#### Change Request #12346: Deploy Tamper Evidence V2
- **Requester**: Engineering Team
- **Risk**: Medium
- **Scheduled**: January 20, 2025, via canary deployment
- **Duration**: 7 days (phased rollout)
- **Rollback**: Feature flag kill switch (immediate)
- **Tenant Impact**: Improved tamper evidence (backward compatible)
- **Testing**: Passed load tests, security scans, DR drill
- **Decision**: ☐ Approved ☐ Rejected ☐ Deferred

### 3. Emergency/Hotfix Changes (5 minutes)
- No emergency changes this week

### 4. Action Items Review (5 minutes)
- Outstanding action items from previous meetings

---

## Decisions
1. Change #12345: **Approved** (Architect, SRE, Security)
2. Change #12346: **Approved with conditions** (Require 48h monitoring at 10% before proceeding to 25%)

## Action Items
1. [Platform Team] Schedule Redis upgrade for January 18, 10 PM ET
2. [Engineering Team] Deploy Tamper Evidence V2 canary (10%) on January 20
3. [SRE Team] Monitor canary metrics for 48 hours before advancing

Deployment Scheduling & Change Windows

Purpose: Minimize tenant disruption by scheduling changes during low-traffic windows and communicating proactively.

Change Windows (Production):

changeWindows:
  routine:
    days: Tuesday, Wednesday, Thursday
    time: 10 PM - 2 AM ET (low-traffic window)
    frequency: 1-2 per week
    type: Standard deployments, infrastructure changes
    notification: 48-hour advance notice on status page

  emergency:
    days: Any
    time: Immediate
    frequency: As needed (P0/P1 incidents)
    type: Hotfixes, security patches, critical bug fixes
    notification: Real-time status page + email

  blackout:
    periods:
      - December 15 - January 5 (holiday freeze)
      - End of fiscal quarter (last 3 days)
      - Major customer events (identified by Product team)
    exemptions: P0 incidents, security vulnerabilities

Change Communication Template:

# Production Change Notification

**Change ID**: CR-12345  
**Scheduled Date**: January 18, 2025, 10:00 PM - 10:30 PM ET  
**Type**: Infrastructure Upgrade  
**Risk**: Low  
**Expected Downtime**: Zero (blue-green deployment)

---

## What's Changing?
Upgrading Redis Cache from Premium P1 (6 GB) to Premium P3 (26 GB) to improve performance and capacity.

## Why?
Current Redis utilization averaging 85%; upgrade provides headroom for growth.

## Impact to You
- **Downtime**: None (transparent slot swap)
- **Performance**: Improved response times for query operations
- **Action Required**: None

## Rollback Plan
If issues detected, we will swap back to the previous Redis instance within 5 minutes.

## Questions?
Contact: platform-team@connectsoft.example

---

**Status**: ☐ Scheduled ☐ In Progress ☐ Complete ☐ Rolled Back

Continuous Improvement Metrics

Purpose: Track improvement progress with measurable KPIs aligned with DORA metrics and operational excellence.

DORA Metrics (DevOps Research and Assessment)

doraMetrics:
  deploymentFrequency:
    target: Daily (to Staging), Weekly (to Production)
    current: 2-3 deployments/week (Production)
    trend: Improving (up from 1/week in 2024)

  leadTimeForChanges:
    target: < 1 week (commit to production)
    current: 10 days average
    trend: Stable

  changeFailureRate:
    target: < 15%
    current: 8% (Production deployments requiring rollback)
    trend: Improving (down from 12% in 2024)

  timeToRestoreService:
    target: < 1 hour
    current: 25 minutes average (automated failover)
    trend: Excellent

DORA Metrics Dashboard (Application Insights):

// Deployment frequency (last 30 days)
customEvents
| where name == "DeploymentCompleted"
| where timestamp > ago(30d)
| extend Environment = tostring(customDimensions.Environment)
| where Environment == "prod"
| summarize DeploymentCount = count() by bin(timestamp, 1d)
| extend DeploymentsPerWeek = DeploymentCount * 7
| summarize AvgDeploymentsPerWeek = avg(DeploymentsPerWeek)
// Lead time for changes (commit to production)
customEvents
| where name == "DeploymentCompleted"
| where timestamp > ago(30d)
| extend Environment = tostring(customDimensions.Environment)
| extend CommitTimestamp = todatetime(customDimensions.CommitTimestamp)
| extend DeploymentTimestamp = timestamp
| extend LeadTimeHours = datetime_diff('hour', DeploymentTimestamp, CommitTimestamp)
| where Environment == "prod"
| summarize 
    AvgLeadTimeHours = avg(LeadTimeHours),
    P50LeadTimeHours = percentile(LeadTimeHours, 50),
    P95LeadTimeHours = percentile(LeadTimeHours, 95)
// Change failure rate (deployments requiring rollback)
customEvents
| where name in ("DeploymentCompleted", "DeploymentRolledBack")
| where timestamp > ago(30d)
| extend Environment = tostring(customDimensions.Environment)
| where Environment == "prod"
| summarize 
    TotalDeployments = countif(name == "DeploymentCompleted"),
    FailedDeployments = countif(name == "DeploymentRolledBack"),
    ChangeFailureRate = 100.0 * countif(name == "DeploymentRolledBack") / count()

Platform Maturity Scorecard

platformMaturity:
  automation:
    current: 85%
    target: 95%
    gaps:
      - Manual canary progression (target: automated with ML)
      - Manual DR failback (target: automated validation)

  observability:
    current: 90%
    target: 95%
    gaps:
      - Distributed tracing incomplete (target: 100% coverage)
      - Business metrics dashboards (target: Grafana dashboards per service)

  security:
    current: 95%
    target: 98%
    gaps:
      - Post-quantum cryptography (target: evaluate PQC algorithms)
      - Zero-trust microsegmentation (target: Istio AuthorizationPolicy per pod)

  compliance:
    current: 98%
    target: 100%
    gaps:
      - Continuous compliance monitoring (target: real-time policy enforcement)
      - Automated evidence collection (target: zero manual effort)

  costOptimization:
    current: 80%
    target: 90%
    gaps:
      - Kubernetes node autoscaling tuning (target: KEDA event-driven scaling)
      - Spot instance adoption (target: 50% of non-production compute)

Strategic Roadmap (2025)

ATP's environment roadmap focuses on developer productivity, multi-region expansion, automation maturity, and self-service capabilities.

Q1 2025: Ephemeral Preview Environments

Objective: Per-PR preview environments using Kubernetes namespaces for faster feedback and isolated testing.

Implementation Plan:

q1Deliverables:
  epic: Ephemeral Preview Environments

  features:
    - name: AKS Namespace per PR
      description: Automatically create Kubernetes namespace for each PR
      effort: 13 story points
      dependencies: AKS cluster capacity planning

    - name: Automated DNS per Preview
      description: Dynamic subdomain creation (pr-123.preview.atp.connectsoft.com)
      effort: 8 story points
      dependencies: Azure DNS integration

    - name: Auto-Delete after PR Merge
      description: Cleanup preview environments within 24 hours of PR merge/close
      effort: 5 story points
      dependencies: GitHub webhook integration

    - name: Cost Tracking per PR
      description: Tag preview resources with PR ID for cost attribution
      effort: 3 story points
      dependencies: Azure Cost Management API

  successCriteria:
    - Preview environment created within 10 minutes of PR creation
    - Full ATP stack deployed (7 microservices)
    - Cost < $5 per PR (Spot instances)
    - Automated cleanup 100% successful

Preview Environment Provisioning (GitHub Action):

# .github/workflows/preview-environment.yml
name: Create Preview Environment

on:
  pull_request:
    types: [opened, synchronize]
    branches: [main]

env:
  PR_NUMBER: ${{ github.event.pull_request.number }}
  NAMESPACE: pr-${{ github.event.pull_request.number }}

jobs:
  create-preview:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4

    - name: Azure Login
      uses: azure/login@v1
      with:
        creds: ${{ secrets.AZURE_CREDENTIALS }}

    - name: Create Kubernetes Namespace
      run: |
        az aks get-credentials \
          --resource-group ConnectSoft-ATP-Preview-RG \
          --name atp-aks-preview-eus

        kubectl create namespace $NAMESPACE \
          --dry-run=client -o yaml | kubectl apply -f -

        kubectl label namespace $NAMESPACE \
          pr-number=$PR_NUMBER \
          cost-center=Engineering \
          auto-delete=24h

    - name: Deploy ATP Stack
      run: |
        helm upgrade --install atp-$PR_NUMBER ./helm/atp \
          --namespace $NAMESPACE \
          --set image.tag=pr-$PR_NUMBER \
          --set ingress.host=pr-$PR_NUMBER.preview.atp.connectsoft.com \
          --set resources.requests.cpu=100m \
          --set resources.requests.memory=128Mi

    - name: Wait for Deployment
      run: |
        kubectl wait --for=condition=ready pod \
          --selector=app=atp-gateway \
          --namespace=$NAMESPACE \
          --timeout=600s

    - name: Run Smoke Tests
      run: |
        PREVIEW_URL="https://pr-$PR_NUMBER.preview.atp.connectsoft.com"

        # Wait for DNS propagation
        sleep 60

        # Health check
        curl -f $PREVIEW_URL/health || exit 1

        echo "✅ Preview environment ready: $PREVIEW_URL"

    - name: Comment on PR
      uses: actions/github-script@v7
      with:
        script: |
          github.rest.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.name,
            body: `## 🚀 Preview Environment Ready\n\n**URL**: https://pr-${{ github.event.pull_request.number }}.preview.atp.connectsoft.com\n\n**Services**:\n- Gateway: https://pr-${{ github.event.pull_request.number }}.preview.atp.connectsoft.com/health\n- Swagger: https://pr-${{ github.event.pull_request.number }}.preview.atp.connectsoft.com/swagger\n\n**Auto-Delete**: 24 hours after PR merge/close`
          })

Q2 2025: Multi-Region Active-Active Expansion

Objective: Expand multi-region to all ATP services (currently only Ingestion/Query are multi-region).

Implementation Plan:

q2Deliverables:
  epic: Multi-Region Active-Active Expansion

  features:
    - name: Deploy Policy Service to West Europe
      effort: 13 story points
      dependencies: West Europe AKS cluster, Cosmos DB geo-replication

    - name: Deploy Export Service to West Europe
      effort: 13 story points
      dependencies: Blob Storage RA-GRS, cross-region blob access

    - name: Deploy Integrity Service to West Europe
      effort: 8 story points
      dependencies: HSM key replication (Azure Key Vault managed HSM)

    - name: Cross-Region Service Discovery
      effort: 8 story points
      dependencies: Azure Front Door routing rules per service

    - name: Multi-Region E2E Tests
      effort: 5 story points
      dependencies: Synthetic monitors from both regions

  successCriteria:
    - All 7 ATP services deployed to both regions
    - Traffic split: 70% East US, 30% West Europe
    - Failover RTO < 30 minutes for all services
    - Incremental cost increase < $3,000/month

Q3 2025: Automated Canary with ML Anomaly Detection

Objective: Fully automated canary deployments with ML-based anomaly detection for intelligent rollout decisions.

Implementation Plan:

q3Deliverables:
  epic: Intelligent Canary Deployments

  features:
    - name: ML Anomaly Detection Model
      description: Train model on historical deployment metrics (error rate, latency, throughput)
      effort: 21 story points
      dependencies: Azure Machine Learning workspace, historical telemetry data

    - name: Automated Canary Progression
      description: Auto-advance canary based on ML model confidence
      effort: 13 story points
      dependencies: ML model endpoint, Azure Pipelines integration

    - name: Intelligent Rollback
      description: ML model detects anomalies; auto-rollback without human intervention
      effort: 13 story points
      dependencies: ML model, automated rollback runbook

    - name: Canary Dashboard
      description: Real-time canary health dashboard with ML predictions
      effort: 8 story points
      dependencies: Grafana, ML model metrics export

  successCriteria:
    - 90% of canary deployments fully automated (no manual progression)
    - False positive rate < 5% (incorrect rollbacks)
    - Anomaly detection latency < 5 minutes
    - Rollback initiated within 10 minutes of anomaly detection

ML Anomaly Detection (Conceptual):

// ML-based canary health prediction
public class CanaryHealthPredictor
{
    private readonly MachineLearningClient _mlClient;

    public async Task<CanaryHealthPrediction> PredictHealthAsync(CanaryMetrics metrics)
    {
        // Prepare features for ML model
        var features = new
        {
            errorRate = metrics.ErrorRate,
            p95Latency = metrics.P95Latency,
            p99Latency = metrics.P99Latency,
            throughput = metrics.RequestsPerSecond,
            cpuUtilization = metrics.AvgCpuUtilization,
            memoryUtilization = metrics.AvgMemoryUtilization,

            // Comparison to baseline
            errorRateChange = (metrics.ErrorRate - metrics.BaselineErrorRate) / metrics.BaselineErrorRate,
            latencyChange = (metrics.P95Latency - metrics.BaselineP95Latency) / metrics.BaselineP95Latency
        };

        // Invoke ML model endpoint
        var prediction = await _mlClient.PredictAsync("canary-health-model", features);

        return new CanaryHealthPrediction
        {
            IsHealthy = prediction.Prediction == "Healthy",
            Confidence = prediction.Confidence,
            AnomalyScore = prediction.AnomalyScore,
            Recommendation = prediction.AnomalyScore > 0.8 
                ? "Rollback immediately" 
                : prediction.AnomalyScore > 0.5 
                    ? "Pause rollout; investigate" 
                    : "Proceed to next increment"
        };
    }
}

Q4 2025: Self-Service Environment Provisioning

Objective: Empower developers to create on-demand dev/test environments without platform team intervention.

Implementation Plan:

q4Deliverables:
  epic: Self-Service Environment Provisioning

  features:
    - name: Environment Portal (Web UI)
      description: Self-service portal for creating dev/test environments
      effort: 21 story points
      dependencies: Azure App Service, Azure AD integration

    - name: Pulumi Automation API Integration
      description: Backend API to trigger Pulumi deployments
      effort: 13 story points
      dependencies: Pulumi Automation API, Azure DevOps integration

    - name: Cost Guardrails
      description: Prevent creation of expensive resources; require approval for >$100/month
      effort: 8 story points
      dependencies: Azure Cost Management API

    - name: Automatic Expiration
      description: Auto-delete environments after 7 days (with renewal option)
      effort: 5 story points
      dependencies: Azure Automation, tagging strategy

  successCriteria:
    - Developers can create environment in < 15 minutes
    - Zero platform team involvement for dev/test environments
    - 100% environments tagged with owner + expiration
    - Orphaned resource detection and cleanup (weekly scan)

Self-Service Portal (ASP.NET Core):

// Environment provisioning API
[ApiController]
[Route("api/environments")]
public class EnvironmentProvisioningController : ControllerBase
{
    private readonly IPulumiAutomationService _pulumi;

    [HttpPost]
    [Authorize(Roles = "Developer")]
    public async Task<IActionResult> CreateEnvironment([FromBody] CreateEnvironmentRequest request)
    {
        // Validate request
        if (request.Environment != "dev" && request.Environment != "test")
        {
            return BadRequest("Self-service provisioning only available for dev/test environments");
        }

        // Estimate cost
        var estimatedCost = await EstimateMonthlyCostAsync(request);

        if (estimatedCost > 100 && !User.IsInRole("PlatformTeam"))
        {
            return Unauthorized("Environments >$100/month require platform team approval");
        }

        // Generate unique environment name
        var envName = $"{request.Environment}-{User.Identity.Name.Replace("@", "-")}-{DateTime.UtcNow:yyyyMMdd}";

        // Provision via Pulumi Automation API
        var provisioningJob = await _pulumi.CreateStackAsync(new StackConfig
        {
            ProjectName = "atp-infrastructure",
            StackName = envName,
            Config = new Dictionary<string, string>
            {
                ["environment"] = request.Environment,
                ["region"] = "eastus",
                ["owner"] = User.Identity.Name,
                ["expiresAt"] = DateTime.UtcNow.AddDays(7).ToString("o"),
                ["costCenter"] = "Engineering"
            }
        });

        // Tag resources for tracking
        await TagResourcesAsync(envName, new Dictionary<string, string>
        {
            ["Owner"] = User.Identity.Name,
            ["CreatedAt"] = DateTime.UtcNow.ToString("o"),
            ["ExpiresAt"] = DateTime.UtcNow.AddDays(7).ToString("o"),
            ["EstimatedMonthlyCost"] = estimatedCost.ToString("F2")
        });

        return Accepted(new
        {
            environmentName = envName,
            status = "Provisioning",
            estimatedCompletionTime = DateTime.UtcNow.AddMinutes(15),
            estimatedMonthlyCost = estimatedCost,
            expiresAt = DateTime.UtcNow.AddDays(7),
            provisioningJobId = provisioningJob.Id
        });
    }

    [HttpDelete("{environmentName}")]
    [Authorize(Roles = "Developer")]
    public async Task<IActionResult> DeleteEnvironment(string environmentName)
    {
        // Validate ownership
        var owner = await GetEnvironmentOwnerAsync(environmentName);

        if (owner != User.Identity.Name && !User.IsInRole("PlatformTeam"))
        {
            return Forbid("You can only delete environments you own");
        }

        // Delete via Pulumi
        await _pulumi.DestroyStackAsync(environmentName);

        return NoContent();
    }
}

Environment Lifecycle Management

Automated Expiration (Daily Scan):

#!/bin/bash
# cleanup-expired-environments.sh

echo "Scanning for expired environments..."

CURRENT_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)

# Query resource groups with expiration tags
EXPIRED_RGS=$(az group list \
  --query "[?tags.ExpiresAt != null && tags.ExpiresAt < '$CURRENT_DATE'].{Name:name, Owner:tags.Owner, ExpiresAt:tags.ExpiresAt}" \
  --output json)

EXPIRED_COUNT=$(echo $EXPIRED_RGS | jq length)

if [ "$EXPIRED_COUNT" -eq 0 ]; then
  echo "No expired environments found"
  exit 0
fi

echo "Found $EXPIRED_COUNT expired environments:"
echo $EXPIRED_RGS | jq -r '.[] | "\(.Name) (Owner: \(.Owner), Expired: \(.ExpiresAt))"'

# Notify owners before deletion
for RG in $(echo $EXPIRED_RGS | jq -r '.[].Name'); do
  OWNER=$(echo $EXPIRED_RGS | jq -r ".[] | select(.Name == \"$RG\") | .Owner")
  EXPIRES_AT=$(echo $EXPIRED_RGS | jq -r ".[] | select(.Name == \"$RG\") | .ExpiresAt")

  echo "Notifying owner: $OWNER about $RG..."

  # Send email notification
  az mail send \
    --to "$OWNER" \
    --subject "Environment Expiration Notice: $RG" \
    --body "Your environment '$RG' expired on $EXPIRES_AT and will be deleted in 24 hours.\n\nTo extend, visit: https://portal.atp.connectsoft.com/environments/$RG/renew"

  # Tag for deletion (grace period)
  az group update \
    --name "$RG" \
    --set tags.PendingDeletion="$(date -u -d '+24 hours' +%Y-%m-%dT%H:%M:%SZ)"
done

echo "✅ Expiration notifications sent; environments will be deleted in 24 hours"

Orphaned Resource Detection:

<#
.SYNOPSIS
    Detect orphaned resources without owner tags.
.DESCRIPTION
    Weekly scan for resources missing required tags; notify platform team.
#>

Connect-AzAccount -Identity

$orphanedResources = Get-AzResource | Where-Object {
    -not $_.Tags.ContainsKey('Owner') -or
    -not $_.Tags.ContainsKey('Environment') -or
    -not $_.Tags.ContainsKey('CostCenter')
}

if ($orphanedResources.Count -gt 0) {
    Write-Output "Found $($orphanedResources.Count) orphaned resources:"

    $orphanedResources | Format-Table -Property Name, ResourceGroupName, ResourceType, Location

    # Create work item for cleanup
    az boards work-item create \
        --title "Orphaned Resources Detected: $($orphanedResources.Count) resources" \
        --type "Task" \
        --description "Resources without required tags detected.\n\nSee attached report." \
        --assigned-to "platform-team@connectsoft.example" \
        --fields Priority=3

    # Export CSV for review
    $orphanedResources | Export-Csv -Path "orphaned-resources-$(Get-Date -Format 'yyyyMMdd').csv"
}
else {
    Write-Output "✅ No orphaned resources detected"
}

Continuous Improvement Framework

Purpose: Systematic improvement through retrospectives, metrics tracking, and experimentation.

Improvement Cycle:

graph LR
    A[Identify Opportunity] --> B[Define Metric]
    B --> C[Implement Change]
    C --> D[Measure Impact]
    D --> E{Improvement?}
    E -->|Yes| F[Document & Scale]
    E -->|No| G[Rollback & Retry]
    F --> A
    G --> A

    style F fill:#90EE90
    style G fill:#FF6347
Hold "Alt" / "Option" to enable pan & zoom

Improvement Tracking:

// Track environment improvements
public class ImprovementTracker
{
    public async Task RecordImprovementAsync(Improvement improvement)
    {
        var workItem = new WorkItem
        {
            Fields = new Dictionary<string, object>
            {
                ["System.Title"] = improvement.Title,
                ["System.WorkItemType"] = "Improvement",
                ["System.State"] = "Proposed",
                ["Custom.Hypothesis"] = improvement.Hypothesis,
                ["Custom.TargetMetric"] = improvement.TargetMetric,
                ["Custom.BaselineValue"] = improvement.BaselineValue,
                ["Custom.TargetValue"] = improvement.TargetValue,
                ["Custom.EstimatedEffort"] = improvement.EstimatedEffort,
                ["Custom.ExpectedImpact"] = improvement.ExpectedImpact
            }
        };

        await _devOpsClient.CreateWorkItemAsync(workItem, "ConnectSoft", "ATP");
    }

    public async Task MeasureImpactAsync(int improvementId)
    {
        var improvement = await _devOpsClient.GetWorkItemAsync(improvementId);
        var targetMetric = improvement.Fields["Custom.TargetMetric"].ToString();

        // Query actual metric value post-improvement
        var actualValue = await QueryMetricAsync(targetMetric);
        var baselineValue = double.Parse(improvement.Fields["Custom.BaselineValue"].ToString());
        var targetValue = double.Parse(improvement.Fields["Custom.TargetValue"].ToString());

        var actualImprovement = ((actualValue - baselineValue) / baselineValue) * 100;
        var targetImprovement = ((targetValue - baselineValue) / baselineValue) * 100;

        var success = actualImprovement >= targetImprovement;

        // Update work item
        await _devOpsClient.UpdateWorkItemAsync(improvementId, new[]
        {
            new JsonPatchOperation
            {
                Operation = Operation.Add,
                Path = "/fields/Custom.ActualValue",
                Value = actualValue
            },
            new JsonPatchOperation
            {
                Operation = Operation.Add,
                Path = "/fields/Custom.ActualImprovement",
                Value = $"{actualImprovement:F1}%"
            },
            new JsonPatchOperation
            {
                Operation = Operation.Add,
                Path = "/fields/System.State",
                Value = success ? "Closed" : "Active"
            }
        });
    }
}

Example Improvements (2024-2025):

improvements:
  - id: IMP-001
    title: Automated Dev/Test Shutdown
    hypothesis: Shutting down dev/test during non-business hours will reduce costs by 40%
    targetMetric: Monthly cost (Dev+Test)
    baselineValue: $1,500
    targetValue: $900
    actualValue: $1,020
    actualImprovement: 32%  # (slightly below target, but significant)
    status: Success

  - id: IMP-002
    title: Reserved Instances for Production
    hypothesis: 1-year reserved instances will reduce compute costs by 20%
    targetMetric: Monthly compute cost (Production)
    baselineValue: $3,000
    targetValue: $2,400
    actualValue: $2,450
    actualImprovement: 18.3%  # (close to target)
    status: Success

  - id: IMP-003
    title: Application Insights Sampling
    hypothesis: 10% sampling will reduce ingestion costs by 90% with minimal observability impact
    targetMetric: Application Insights monthly cost
    baselineValue: $500
    targetValue: $50
    actualValue: $115
    actualImprovement: 77%  # (better than expected; some non-sampled telemetry)
    status: Success

Summary

  • Review Cadence: Monthly (cost/utilization), Quarterly (security/access/DR), Annually (full refresh/roadmap).
  • Monthly Review: Cost analysis, resource utilization, scaling policies, rightsizing recommendations with automated report generation.
  • Quarterly Review: Security posture, access audit, compliance scorecard, DR drill review with 2.5-hour workshop format.
  • Annual Review: Full IaC refresh, cost optimization deep dive, security evolution, technology roadmap with 4-hour full-day workshop.
  • Change Advisory Board: 4 core members (Architect, SRE, Security, Product), weekly meetings for routine changes, emergency meetings for hotfixes.
  • Change Process: Formal change request template, 3-step approval workflow (SRE → Security → Architect), Mermaid state diagram.
  • Change Windows: Tuesday-Thursday 10 PM-2 AM ET (routine), immediate (emergency), blackout periods (holidays, fiscal quarter-end).
  • DORA Metrics: Deployment frequency (2-3/week), lead time (10 days), change failure rate (8%), time to restore (25 min).
  • Platform Maturity: Automation (85%), Observability (90%), Security (95%), Compliance (98%), Cost Optimization (80%).
  • Q1 2025 Roadmap: Ephemeral preview environments per PR (AKS namespaces, auto-delete, <$5 per PR).
  • Q2 2025 Roadmap: Multi-region active-active for all 7 ATP services (70/30 traffic split, <$3k incremental cost).
  • Q3 2025 Roadmap: Automated canary with ML anomaly detection (90% automated, <5% false positives, 10-min rollback).
  • Q4 2025 Roadmap: Self-service environment provisioning (web portal, 15-min creation, cost guardrails, 7-day expiration).
  • Improvement Framework: Hypothesis-driven improvements with baseline/target metrics, success tracking via Azure DevOps work items.

Appendix A — Environment Variable Reference

This appendix provides comprehensive environment variable templates for all ATP environments, including base configuration, environment-specific overrides, Key Vault references, and container runtime examples.

Purpose

  • Standardize environment variable naming and structure across all environments
  • Document required vs. optional variables per environment
  • Provide templates for local development, container deployments, and Azure App Service
  • Clarify Key Vault reference syntax for secure secrets injection

Base Environment Variables (All Environments)

Common Variables (required in all environments):

# Runtime environment
export ASPNETCORE_ENVIRONMENT=Development  # Development | Test | Staging | Production

# Application identity
export APP_NAME=atp-ingestion
export APP_VERSION=1.0.123
export DEPLOYMENT_TIMESTAMP=2025-01-15T14:30:00Z

# Logging
export Logging__LogLevel__Default=Information
export Logging__LogLevel__Microsoft=Warning
export Logging__LogLevel__System=Warning

# OpenTelemetry
export OpenTelemetry__ServiceName=atp-ingestion
export OpenTelemetry__ServiceVersion=1.0.123
export OpenTelemetry__ServiceNamespace=atp
export OpenTelemetry__ExporterEndpoint=http://otel-collector:4317

# Health checks
export HealthChecks__Enabled=true
export HealthChecks__Port=8080
export HealthChecks__Path=/health

Dev Environment Variables

Dev Environment (appsettings.Development.json + environment variables):

#!/bin/bash
# dev-environment-variables.sh

# ═══════════════════════════════════════════════════════════════════════════
# ATP Dev Environment Variables
# Usage: source dev-environment-variables.sh
# ═══════════════════════════════════════════════════════════════════════════

# Runtime
export ASPNETCORE_ENVIRONMENT=Development
export ASPNETCORE_URLS=http://+:5000;https://+:5001

# Database (local SQL Server)
export ConnectionStrings__DefaultConnection="Server=localhost,1433;Database=ATP_Dev;User Id=sa;Password=P@ssw0rd123!;TrustServerCertificate=True;MultipleActiveResultSets=True"
export ConnectionStrings__ReadReplica="Server=localhost,1433;Database=ATP_Dev;User Id=sa;Password=P@ssw0rd123!;TrustServerCertificate=True;ApplicationIntent=ReadOnly"

# Cosmos DB (local emulator)
export ConnectionStrings__CosmosDb="AccountEndpoint=https://localhost:8081/;AccountKey=C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw=="
export CosmosDb__DatabaseName=ATPDev
export CosmosDb__ContainerName=AuditEvents

# Redis (local container)
export ConnectionStrings__Redis="localhost:6379,abortConnect=false,connectRetry=3,connectTimeout=5000"
export Redis__InstanceName=ATPDev
export Redis__DefaultExpirationMinutes=60

# Service Bus (local emulator or Azure)
export ConnectionStrings__ServiceBus="Endpoint=sb://localhost:5672;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=..."
export ServiceBus__QueueName=atp-audit-events-dev
export ServiceBus__TopicName=atp-events-dev

# RabbitMQ (alternative to Service Bus in dev)
export ConnectionStrings__RabbitMQ="amqp://guest:guest@localhost:5672"
export RabbitMQ__QueueName=atp-audit-events-dev
export RabbitMQ__ExchangeName=atp-events-dev

# Blob Storage (local Azurite)
export ConnectionStrings__BlobStorage="DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1"
export BlobStorage__ContainerName=atp-audit-attachments-dev

# Audit Configuration
export Audit__EnableImmutability=false
export Audit__EnableTamperEvidence=false
export Audit__RetentionDays=30
export Audit__EnableWormStorage=false

# Compliance
export Compliance__StrictInDevelopment=true  # Enable strict validation even in dev
export Compliance__EnableLoggingRedaction=true
export Compliance__Profile=default

# OpenTelemetry
export OpenTelemetry__ServiceName=atp-ingestion
export OpenTelemetry__ServiceVersion=1.0.0-dev
export OpenTelemetry__ExporterEndpoint=http://localhost:4317
export OpenTelemetry__SamplingRatio=1.0  # 100% sampling in dev
export OpenTelemetry__ExportIntervalSeconds=5

# Application Insights (optional in dev)
export ApplicationInsights__InstrumentationKey=""  # Empty = disabled
export ApplicationInsights__EnableAdaptiveSampling=false

# Feature Flags (all enabled in dev)
export FeatureManagement__TamperEvidenceV2=true
export FeatureManagement__AdvancedQueryFilters=true
export FeatureManagement__AIAssistedAnomalyDetection=true

# JWT Authentication (dev key)
export Authentication__JwtSecret=dev-secret-key-change-in-production-32-chars-min
export Authentication__JwtIssuer=https://atp-dev.connectsoft.local
export Authentication__JwtAudience=atp-services
export Authentication__JwtExpirationMinutes=480  # 8 hours

# API Rate Limiting (relaxed in dev)
export RateLimiting__Enabled=false
export RateLimiting__RequestsPerMinute=1000

# Debugging
export ASPNETCORE_DETAILEDERRORS=true
export ASPNETCORE_SHUTDOWNTIMEOUTSECONDS=30
export COMPlus_EnableDiagnostics=1

echo "✅ Dev environment variables loaded"
echo "   - ASPNETCORE_ENVIRONMENT: $ASPNETCORE_ENVIRONMENT"
echo "   - Database: localhost:1433"
echo "   - Redis: localhost:6379"
echo "   - OpenTelemetry: $OpenTelemetry__ExporterEndpoint"

Test Environment Variables

Test Environment (Azure App Service or AKS):

#!/bin/bash
# test-environment-variables.sh

# Runtime
export ASPNETCORE_ENVIRONMENT=Test
export ASPNETCORE_URLS=http://+:8080

# Database (Azure SQL)
export ConnectionStrings__DefaultConnection="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/SqlConnectionString)"
export ConnectionStrings__ReadReplica="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/SqlReadReplicaConnectionString)"

# Cosmos DB
export ConnectionStrings__CosmosDb="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/CosmosDbConnectionString)"
export CosmosDb__DatabaseName=ATPTest
export CosmosDb__ContainerName=AuditEvents

# Redis
export ConnectionStrings__Redis="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/RedisConnectionString)"
export Redis__InstanceName=ATPTest
export Redis__DefaultExpirationMinutes=120

# Service Bus
export ConnectionStrings__ServiceBus="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/ServiceBusConnectionString)"
export ServiceBus__QueueName=atp-audit-events-test
export ServiceBus__TopicName=atp-events-test

# Blob Storage
export ConnectionStrings__BlobStorage="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/BlobStorageConnectionString)"
export BlobStorage__ContainerName=atp-audit-attachments-test

# Audit Configuration
export Audit__EnableImmutability=false  # Still disabled in test
export Audit__EnableTamperEvidence=true  # Test tamper evidence
export Audit__RetentionDays=90
export Audit__EnableWormStorage=false

# Compliance
export Compliance__StrictInDevelopment=false
export Compliance__EnableLoggingRedaction=true
export Compliance__Profile=default

# OpenTelemetry
export OpenTelemetry__ServiceName=atp-ingestion
export OpenTelemetry__ServiceVersion=1.0.123
export OpenTelemetry__ExporterEndpoint=http://otel-collector-test.atp.local:4317
export OpenTelemetry__SamplingRatio=0.5  # 50% sampling in test
export OpenTelemetry__ExportIntervalSeconds=30

# Application Insights
export ApplicationInsights__InstrumentationKey="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/AppInsightsKey)"
export ApplicationInsights__EnableAdaptiveSampling=true
export ApplicationInsights__MaxTelemetryItemsPerSecond=50

# Feature Flags (controlled via Azure App Configuration)
export AppConfiguration__Endpoint=https://atp-appconfig-test-eus.azconfig.io
export AppConfiguration__ManagedIdentityEnabled=true

# JWT Authentication
export Authentication__JwtSecret="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/JwtSigningKey)"
export Authentication__JwtIssuer=https://atp-test.connectsoft.com
export Authentication__JwtAudience=atp-services
export Authentication__JwtExpirationMinutes=240  # 4 hours

# API Rate Limiting
export RateLimiting__Enabled=true
export RateLimiting__RequestsPerMinute=500

# Azure Managed Identity
export AZURE_CLIENT_ID=12345678-1234-1234-1234-123456789012  # User-assigned managed identity

Staging Environment Variables

Staging Environment (production-like configuration):

#!/bin/bash
# staging-environment-variables.sh

# Runtime
export ASPNETCORE_ENVIRONMENT=Staging
export ASPNETCORE_URLS=http://+:8080

# Database (Azure SQL with geo-replication)
export ConnectionStrings__DefaultConnection="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/SqlConnectionString)"
export ConnectionStrings__ReadReplica="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/SqlReadReplicaConnectionString)"

# Cosmos DB
export ConnectionStrings__CosmosDb="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/CosmosDbConnectionString)"
export CosmosDb__DatabaseName=ATPStaging
export CosmosDb__ContainerName=AuditEvents
export CosmosDb__EnableMultiRegion=true
export CosmosDb__PreferredRegions=eastus,westeurope

# Redis (Premium tier)
export ConnectionStrings__Redis="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/RedisConnectionString)"
export Redis__InstanceName=ATPStaging
export Redis__DefaultExpirationMinutes=240
export Redis__EnableCluster=true

# Service Bus (Premium tier)
export ConnectionStrings__ServiceBus="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/ServiceBusConnectionString)"
export ServiceBus__QueueName=atp-audit-events-staging
export ServiceBus__TopicName=atp-events-staging
export ServiceBus__EnablePartitioning=true

# Blob Storage (with WORM for testing)
export ConnectionStrings__BlobStorage="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/BlobStorageConnectionString)"
export BlobStorage__ContainerName=atp-audit-attachments-staging
export BlobStorage__EnableImmutability=true
export BlobStorage__ImmutabilityPeriodDays=90

# Audit Configuration (production-like)
export Audit__EnableImmutability=true
export Audit__EnableTamperEvidence=true
export Audit__RetentionDays=180
export Audit__EnableWormStorage=true
export Audit__EnableHashChains=true

# Compliance
export Compliance__StrictInDevelopment=false
export Compliance__EnableLoggingRedaction=true
export Compliance__Profile=default
export Compliance__EnableGDPR=true
export Compliance__EnableHIPAA=true

# OpenTelemetry
export OpenTelemetry__ServiceName=atp-ingestion
export OpenTelemetry__ServiceVersion=1.0.123
export OpenTelemetry__ExporterEndpoint=http://otel-collector-staging.atp.local:4317
export OpenTelemetry__SamplingRatio=0.25  # 25% sampling in staging
export OpenTelemetry__ExportIntervalSeconds=60

# Application Insights
export ApplicationInsights__InstrumentationKey="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/AppInsightsKey)"
export ApplicationInsights__EnableAdaptiveSampling=true
export ApplicationInsights__MaxTelemetryItemsPerSecond=100
export ApplicationInsights__InitialSamplingPercentage=25

# Feature Flags (Azure App Configuration)
export AppConfiguration__Endpoint=https://atp-appconfig-staging-eus.azconfig.io
export AppConfiguration__ManagedIdentityEnabled=true
export AppConfiguration__RefreshIntervalSeconds=30

# JWT Authentication
export Authentication__JwtSecret="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/JwtSigningKey)"
export Authentication__JwtIssuer=https://atp-staging.connectsoft.com
export Authentication__JwtAudience=atp-services
export Authentication__JwtExpirationMinutes=120  # 2 hours
export Authentication__EnableRefreshTokens=true

# API Rate Limiting
export RateLimiting__Enabled=true
export RateLimiting__RequestsPerMinute=500
export RateLimiting__BurstSize=100

# Azure Managed Identity
export AZURE_CLIENT_ID=23456789-2345-2345-2345-234567890123

Production Environment Variables

Production Environment (full security and compliance):

#!/bin/bash
# production-environment-variables.sh

# Runtime
export ASPNETCORE_ENVIRONMENT=Production
export ASPNETCORE_URLS=http://+:8080

# Database (Azure SQL with multi-region failover)
export ConnectionStrings__DefaultConnection="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlConnectionString)"
export ConnectionStrings__ReadReplica="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlReadReplicaConnectionString)"
export ConnectionStrings__FailoverPartner=atp-sql-prod-weu.database.windows.net

# Cosmos DB (multi-region with automatic failover)
export ConnectionStrings__CosmosDb="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/CosmosDbConnectionString)"
export CosmosDb__DatabaseName=ATPProduction
export CosmosDb__ContainerName=AuditEvents
export CosmosDb__EnableMultiRegion=true
export CosmosDb__PreferredRegions=eastus,westeurope,southeastasia
export CosmosDb__ConsistencyLevel=Session
export CosmosDb__EnableAutomaticFailover=true

# Redis (Premium tier with geo-replication)
export ConnectionStrings__Redis="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)"
export Redis__InstanceName=ATPProduction
export Redis__DefaultExpirationMinutes=1440  # 24 hours
export Redis__EnableCluster=true
export Redis__EnableGeoReplication=true

# Service Bus (Premium tier with geo-disaster recovery)
export ConnectionStrings__ServiceBus="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/ServiceBusConnectionString)"
export ServiceBus__QueueName=atp-audit-events-prod
export ServiceBus__TopicName=atp-events-prod
export ServiceBus__EnablePartitioning=true
export ServiceBus__EnableDuplicateDetection=true
export ServiceBus__DuplicateDetectionHistoryTimeWindow=600  # 10 minutes

# Blob Storage (WORM enabled, 7-year retention)
export ConnectionStrings__BlobStorage="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/BlobStorageConnectionString)"
export BlobStorage__ContainerName=atp-audit-attachments-prod
export BlobStorage__EnableImmutability=true
export BlobStorage__ImmutabilityPeriodDays=2555  # 7 years
export BlobStorage__EnableVersioning=true
export BlobStorage__EnableSoftDelete=true
export BlobStorage__SoftDeleteRetentionDays=90

# Audit Configuration (full compliance)
export Audit__EnableImmutability=true
export Audit__EnableTamperEvidence=true
export Audit__RetentionDays=2555  # 7 years
export Audit__EnableWormStorage=true
export Audit__EnableHashChains=true
export Audit__EnableDigitalSignatures=true
export Audit__SigningKeyVaultUri="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/AuditSigningKey)"

# Compliance (all frameworks enabled)
export Compliance__StrictInDevelopment=false
export Compliance__EnableLoggingRedaction=true
export Compliance__Profile=default
export Compliance__EnableGDPR=true
export Compliance__EnableHIPAA=true
export Compliance__EnableSOC2=true
export Compliance__EnablePCIDSS=false  # Not applicable to ATP
export Compliance__DataClassificationEnabled=true

# OpenTelemetry (optimized for production)
export OpenTelemetry__ServiceName=atp-ingestion
export OpenTelemetry__ServiceVersion=1.0.123
export OpenTelemetry__ServiceNamespace=atp
export OpenTelemetry__ExporterEndpoint=http://otel-collector-prod.atp.local:4317
export OpenTelemetry__SamplingRatio=0.1  # 10% sampling in production
export OpenTelemetry__ExportIntervalSeconds=60
export OpenTelemetry__EnableMetrics=true
export OpenTelemetry__EnableTracing=true
export OpenTelemetry__EnableLogging=false  # Use separate log aggregation

# Application Insights (with advanced features)
export ApplicationInsights__InstrumentationKey="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/AppInsightsKey)"
export ApplicationInsights__EnableAdaptiveSampling=true
export ApplicationInsights__MaxTelemetryItemsPerSecond=100
export ApplicationInsights__InitialSamplingPercentage=10
export ApplicationInsights__MinSamplingPercentage=5
export ApplicationInsights__MaxSamplingPercentage=25
export ApplicationInsights__EnableDependencyTracking=true
export ApplicationInsights__EnablePerformanceCounterCollection=true

# Feature Flags (Azure App Configuration with caching)
export AppConfiguration__Endpoint=https://atp-appconfig-prod-eus.azconfig.io
export AppConfiguration__ManagedIdentityEnabled=true
export AppConfiguration__RefreshIntervalSeconds=60
export AppConfiguration__EnableCaching=true
export AppConfiguration__CacheDurationSeconds=300  # 5 minutes

# JWT Authentication (production keys)
export Authentication__JwtSecret="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/JwtSigningKey)"
export Authentication__JwtIssuer=https://atp.connectsoft.com
export Authentication__JwtAudience=atp-services
export Authentication__JwtExpirationMinutes=60  # 1 hour
export Authentication__EnableRefreshTokens=true
export Authentication__RefreshTokenExpirationDays=7
export Authentication__EnableTokenRotation=true

# API Rate Limiting (strict in production)
export RateLimiting__Enabled=true
export RateLimiting__RequestsPerMinute=500
export RateLimiting__BurstSize=100
export RateLimiting__EnableDistributedRateLimiting=true
export RateLimiting__RedisConnectionString="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)"

# Circuit Breaker (Polly)
export Resilience__CircuitBreaker__Enabled=true
export Resilience__CircuitBreaker__FailureThreshold=0.5  # 50% failure rate
export Resilience__CircuitBreaker__SamplingDurationSeconds=60
export Resilience__CircuitBreaker__MinimumThroughput=10
export Resilience__CircuitBreaker__BreakDurationSeconds=30

# Azure Managed Identity (production)
export AZURE_CLIENT_ID=34567890-3456-3456-3456-345678901234

# Logging (production-optimized)
export Logging__LogLevel__Default=Warning
export Logging__LogLevel__Microsoft=Error
export Logging__LogLevel__System=Error
export Logging__LogLevel__Microsoft_Hosting_Lifetime=Information

# Performance
export ASPNETCORE_FORWARDEDHEADERS_ENABLED=true
export ASPNETCORE_SHUTDOWNTIMEOUTSECONDS=60

# Health Checks
export HealthChecks__Enabled=true
export HealthChecks__Port=8080
export HealthChecks__Path=/health
export HealthChecks__DetailedErrorsEnabled=false  # Security: don't expose details

Kubernetes ConfigMap & Secret Example

Kubernetes Deployment (combining ConfigMap + Secrets):

# k8s-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
  namespace: atp-prod
data:
  ASPNETCORE_ENVIRONMENT: "Production"
  ASPNETCORE_URLS: "http://+:8080"

  # Audit Configuration
  Audit__EnableImmutability: "true"
  Audit__EnableTamperEvidence: "true"
  Audit__RetentionDays: "2555"
  Audit__EnableWormStorage: "true"

  # OpenTelemetry
  OpenTelemetry__ServiceName: "atp-ingestion"
  OpenTelemetry__ServiceVersion: "1.0.123"
  OpenTelemetry__ExporterEndpoint: "http://otel-collector.atp-prod.svc.cluster.local:4317"
  OpenTelemetry__SamplingRatio: "0.1"

  # Feature Flags
  AppConfiguration__Endpoint: "https://atp-appconfig-prod-eus.azconfig.io"
  AppConfiguration__ManagedIdentityEnabled: "true"

  # Rate Limiting
  RateLimiting__Enabled: "true"
  RateLimiting__RequestsPerMinute: "500"

---
# k8s-secrets.yaml (Key Vault CSI)
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: atp-ingestion-secrets
  namespace: atp-prod
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"
    useVMManagedIdentity: "true"
    userAssignedIdentityID: "34567890-3456-3456-3456-345678901234"
    keyvaultName: "atp-keyvault-prod-eus"
    tenantId: "12345678-1234-1234-1234-123456789012"
    objects: |
      array:
        - objectName: SqlConnectionString
          objectType: secret
          objectAlias: ConnectionStrings__DefaultConnection

        - objectName: CosmosDbConnectionString
          objectType: secret
          objectAlias: ConnectionStrings__CosmosDb

        - objectName: RedisConnectionString
          objectType: secret
          objectAlias: ConnectionStrings__Redis

        - objectName: ServiceBusConnectionString
          objectType: secret
          objectAlias: ConnectionStrings__ServiceBus

        - objectName: BlobStorageConnectionString
          objectType: secret
          objectAlias: ConnectionStrings__BlobStorage

        - objectName: AppInsightsKey
          objectType: secret
          objectAlias: ApplicationInsights__InstrumentationKey

        - objectName: JwtSigningKey
          objectType: secret
          objectAlias: Authentication__JwtSecret

---
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-prod
spec:
  replicas: 3
  selector:
    matchLabels:
      app: atp-ingestion
  template:
    metadata:
      labels:
        app: atp-ingestion
        version: 1.0.123
    spec:
      serviceAccountName: atp-ingestion-sa
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:1.0.123
        ports:
        - containerPort: 8080

        # Environment variables from ConfigMap
        envFrom:
        - configMapRef:
            name: atp-ingestion-config

        # Secrets from Key Vault CSI
        volumeMounts:
        - name: secrets-store
          mountPath: "/mnt/secrets-store"
          readOnly: true

        # Override with secrets
        env:
        - name: ConnectionStrings__DefaultConnection
          valueFrom:
            secretKeyRef:
              name: atp-ingestion-secrets
              key: ConnectionStrings__DefaultConnection

        - name: ConnectionStrings__Redis
          valueFrom:
            secretKeyRef:
              name: atp-ingestion-secrets
              key: ConnectionStrings__Redis

        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 2000m
            memory: 2Gi

      volumes:
      - name: secrets-store
        csi:
          driver: secrets-store.csi.k8s.io
          readOnly: true
          volumeAttributes:
            secretProviderClass: atp-ingestion-secrets

Docker Compose Example (Dev Environment)

Local Development with Docker Compose:

# docker-compose.yml
version: '3.8'

services:
  atp-ingestion:
    build:
      context: .
      dockerfile: src/ConnectSoft.ATP.Ingestion/Dockerfile
    ports:
      - "5000:8080"
    environment:
      ASPNETCORE_ENVIRONMENT: Development
      ASPNETCORE_URLS: http://+:8080

      # Connection strings (Docker service names)
      ConnectionStrings__DefaultConnection: "Server=sqlserver;Database=ATP_Dev;User Id=sa;Password=P@ssw0rd123!;TrustServerCertificate=True"
      ConnectionStrings__Redis: "redis:6379,abortConnect=false"
      ConnectionStrings__ServiceBus: "Endpoint=sb://rabbitmq:5672"
      ConnectionStrings__CosmosDb: "AccountEndpoint=http://cosmos:8081/;AccountKey=C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw=="

      # Audit
      Audit__EnableImmutability: "false"
      Audit__RetentionDays: "30"

      # OpenTelemetry
      OpenTelemetry__ServiceName: atp-ingestion
      OpenTelemetry__ExporterEndpoint: http://otel-collector:4317
      OpenTelemetry__SamplingRatio: "1.0"

      # Feature Flags
      FeatureManagement__TamperEvidenceV2: "true"
    depends_on:
      - sqlserver
      - redis
      - rabbitmq
      - otel-collector
      - seq

  sqlserver:
    image: mcr.microsoft.com/mssql/server:2022-latest
    environment:
      ACCEPT_EULA: Y
      SA_PASSWORD: P@ssw0rd123!
      MSSQL_PID: Developer
    ports:
      - "1433:1433"

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  rabbitmq:
    image: rabbitmq:3-management-alpine
    environment:
      RABBITMQ_DEFAULT_USER: guest
      RABBITMQ_DEFAULT_PASS: guest
    ports:
      - "5672:5672"
      - "15672:15672"

  otel-collector:
    image: otel/opentelemetry-collector:0.97.0
    command: ["--config=/etc/otel/config.yaml"]
    volumes:
      - ./otel-config.yaml:/etc/otel/config.yaml
    ports:
      - "4317:4317"
      - "8888:8888"

  seq:
    image: datalust/seq:latest
    environment:
      ACCEPT_EULA: Y
    ports:
      - "5341:80"

Environment Variable Validation

C# Startup Validation (ensures required variables are present):

// Validate environment variables at startup
public static class EnvironmentVariableValidator
{
    public static void ValidateRequiredVariables(IConfiguration configuration, ILogger logger)
    {
        var requiredVariables = new Dictionary<string, string>
        {
            ["ASPNETCORE_ENVIRONMENT"] = "Runtime environment",
            ["ConnectionStrings:DefaultConnection"] = "Primary database connection",
            ["ConnectionStrings:Redis"] = "Redis cache connection",
            ["OpenTelemetry:ServiceName"] = "Service name for telemetry",
            ["OpenTelemetry:ExporterEndpoint"] = "OTEL collector endpoint"
        };

        var missingVariables = new List<string>();

        foreach (var (key, description) in requiredVariables)
        {
            var value = configuration[key.Replace(":", "__")];

            if (string.IsNullOrWhiteSpace(value))
            {
                missingVariables.Add($"{key} ({description})");
                logger.LogError("Missing required environment variable: {Key} ({Description})", key, description);
            }
        }

        if (missingVariables.Any())
        {
            throw new InvalidOperationException(
                $"Missing required environment variables:\n  - {string.Join("\n  - ", missingVariables)}");
        }

        logger.LogInformation("✅ All required environment variables validated successfully");
    }
}

// In Program.cs
var builder = WebApplication.CreateBuilder(args);

// Validate environment variables before building app
EnvironmentVariableValidator.ValidateRequiredVariables(
    builder.Configuration, 
    builder.Logging.Services.BuildServiceProvider().GetRequiredService<ILogger<Program>>());

Summary

  • Base Variables: Common variables required across all environments (runtime, logging, OTel, health checks).
  • Dev Environment: Local development with localhost services, 100% sampling, relaxed security, all features enabled.
  • Test Environment: Azure-hosted services, Key Vault references, 50% sampling, production-like configuration without immutability.
  • Staging Environment: Full production simulation, Key Vault secrets, 25% sampling, WORM storage enabled, multi-region support.
  • Production Environment: Full compliance controls, 10% sampling, WORM storage with 7-year retention, multi-region failover, circuit breakers, strict rate limiting.
  • Kubernetes Deployment: ConfigMap for non-sensitive config, Key Vault CSI for secrets, workload identity integration.
  • Docker Compose: Local development with service containers (SQL, Redis, RabbitMQ, OTel, Seq).
  • Validation: C# startup validation ensures all required variables are present before application starts.

Appendix B — IaC Overlay Rendering Examples

This appendix demonstrates practical IaC deployment workflows using Pulumi (C#) and Bicep for rendering environment-specific infrastructure overlays.

Purpose

  • Illustrate the complete workflow for deploying infrastructure per environment
  • Provide command-line examples for Pulumi stack management
  • Demonstrate environment-specific configuration rendering
  • Show validation and preview workflows before deployment

Pulumi Workflow (C#)

Initial Setup (One-Time)

Step 1: Install Pulumi CLI

# Install Pulumi CLI (macOS/Linux)
curl -fsSL https://get.pulumi.com | sh

# Or via Homebrew (macOS)
brew install pulumi/tap/pulumi

# Or via Chocolatey (Windows)
choco install pulumi

# Verify installation
pulumi version
# Output: v3.100.0

Step 2: Login to Pulumi Backend

# Azure Blob Storage backend (recommended for team use)
pulumi login azblob://atp-pulumi-state?storage_account=atppulumistate

# Or use Pulumi Cloud (SaaS)
pulumi login

# Or use local file system (dev only)
pulumi login file://~/.pulumi

Step 3: Initialize Pulumi Project

cd infrastructure/

# Create new Pulumi project (if not exists)
pulumi new azure-csharp --name atp-infrastructure --description "ATP Infrastructure as Code"

# Restore .NET dependencies
dotnet restore

Deploy Dev Environment

Step 1: Create/Select Dev Stack

#!/bin/bash
# deploy-dev.sh

set -e  # Exit on error

echo "Deploying ATP Dev environment..."

# Select (or create) dev stack
pulumi stack select dev --create

# Configure dev-specific settings
pulumi config set azure-native:location eastus
pulumi config set environment dev
pulumi config set region eus
pulumi config set sku B1  # Basic tier for dev
pulumi config set autoscale false
pulumi config set instances 1
pulumi config set enableMultiRegion false
pulumi config set enablePrivateEndpoints false
pulumi config set costCenter Engineering
pulumi config set owner platform-team@connectsoft.example

# Set secrets (encrypted)
pulumi config set --secret sqlAdminPassword "DevP@ssw0rd123!"
pulumi config set --secret cosmosDbKey "dev-cosmos-key-placeholder"

echo "✅ Dev stack configured"

Step 2: Preview Changes

# Preview infrastructure changes (dry-run)
pulumi preview --diff

# Expected output:
# Previewing update (dev)
#
#      Type                                        Name                        Plan       
#  +   pulumi:pulumi:Stack                         atp-infrastructure-dev      create     
#  +   ├─ azure-native:resources:ResourceGroup     atp-dev-eus-rg             create     
#  +   ├─ azure-native:web:AppServicePlan          atp-plan-dev-eus           create     
#  +   ├─ azure-native:web:WebApp                  atp-ingestion-dev-eus      create     
#  +   ├─ azure-native:sql:Server                  atp-sql-dev-eus            create     
#  +   ├─ azure-native:sql:Database                atp-sql-db-dev-eus         create     
#  +   ├─ azure-native:cache:Redis                 atp-redis-dev-eus          create     
#  +   ├─ azure-native:storage:StorageAccount      atpstoragedeveus           create     
#  +   └─ azure-native:keyvault:Vault              atp-keyvault-dev-eus       create     
#
# Resources:
#     + 9 to create
#
# ✅ Preview complete (no actual changes made)

Step 3: Deploy Infrastructure

# Deploy infrastructure (with auto-approval)
pulumi up --yes --skip-preview

# Or interactive (requires manual approval)
pulumi up

# Expected output:
# Updating (dev)
#
#      Type                                        Name                        Status      
#  +   pulumi:pulumi:Stack                         atp-infrastructure-dev      created     
#  +   ├─ azure-native:resources:ResourceGroup     atp-dev-eus-rg             created     
#  +   ├─ azure-native:web:AppServicePlan          atp-plan-dev-eus           created (45s)     
#  +   ├─ azure-native:web:WebApp                  atp-ingestion-dev-eus      created (1m30s)     
#  +   ├─ azure-native:sql:Server                  atp-sql-dev-eus            created (2m15s)     
#  +   ├─ azure-native:sql:Database                atp-sql-db-dev-eus         created (3m0s)     
#  +   ├─ azure-native:cache:Redis                 atp-redis-dev-eus          created (10m0s)     
#  +   ├─ azure-native:storage:StorageAccount      atpstoragedeveus           created (1m0s)     
#  +   └─ azure-native:keyvault:Vault              atp-keyvault-dev-eus       created (30s)     
#
# Outputs:
#     appServiceUrl: "https://atp-ingestion-dev-eus.azurewebsites.net"
#     sqlServerFqdn: "atp-sql-dev-eus.database.windows.net"
#     redisHostName: "atp-redis-dev-eus.redis.cache.windows.net"
#     keyVaultUri: "https://atp-keyvault-dev-eus.vault.azure.net/"
#
# Resources:
#     + 9 created
#
# Duration: 12m30s
#
# ✅ Dev environment deployed successfully

Step 4: Export Stack Outputs

# Export stack outputs as JSON
pulumi stack output --json > dev-outputs.json

# Or export specific output
SQL_SERVER=$(pulumi stack output sqlServerFqdn)
echo "SQL Server: $SQL_SERVER"

# Or export as environment variables (for CI/CD)
pulumi stack output --json | jq -r 'to_entries | .[] | "export \(.key)=\(.value)"' > dev-env.sh
source dev-env.sh

Deploy Production Environment

Step 1: Create/Select Prod Stack

#!/bin/bash
# deploy-prod.sh

set -e

echo "Deploying ATP Production environment..."

# Select (or create) prod stack
pulumi stack select prod --create

# Configure production-specific settings
pulumi config set azure-native:location eastus
pulumi config set environment prod
pulumi config set region eus
pulumi config set sku P1v3  # Premium tier for production
pulumi config set autoscale true
pulumi config set minInstances 3
pulumi config set maxInstances 10
pulumi config set enableMultiRegion true
pulumi config set enablePrivateEndpoints true
pulumi config set enableZoneRedundancy true
pulumi config set enableGeoReplication true
pulumi config set costCenter Production
pulumi config set owner platform-team@connectsoft.example

# Set secrets (from Key Vault or secure input)
read -sp "Enter SQL Admin Password: " SQL_PASSWORD
pulumi config set --secret sqlAdminPassword "$SQL_PASSWORD"

# Or fetch from existing Key Vault
COSMOS_KEY=$(az keyvault secret show --vault-name atp-keyvault-bootstrap --name CosmosDbKey --query value -o tsv)
pulumi config set --secret cosmosDbKey "$COSMOS_KEY"

echo "✅ Production stack configured"

Step 2: Preview Changes (with Cost Estimation)

# Preview with detailed diff
pulumi preview --diff --show-config --show-replacement-steps

# Export preview as JSON for approval workflow
pulumi preview --json > prod-preview.json

# Estimate monthly cost (requires Pulumi Cloud or third-party integration)
# Note: Cost estimation is a Pulumi Cloud Enterprise feature
pulumi preview --show-costs

# Expected output:
# Previewing update (prod)
#
#      Type                                        Name                         Plan       Info
#  +   pulumi:pulumi:Stack                         atp-infrastructure-prod      create     
#  +   ├─ azure-native:resources:ResourceGroup     atp-prod-eus-rg             create     
#  +   ├─ azure-native:web:AppServicePlan          atp-plan-prod-eus           create     ~$300/mo
#  +   ├─ azure-native:web:WebApp                  atp-ingestion-prod-eus      create     
#  +   ├─ azure-native:sql:Server                  atp-sql-prod-eus            create     
#  +   ├─ azure-native:sql:Database                atp-sql-db-prod-eus         create     ~$1,200/mo (P2 tier)
#  +   ├─ azure-native:cache:Redis                 atp-redis-prod-eus          create     ~$600/mo (P1 Premium)
#  +   ├─ azure-native:storage:StorageAccount      atpstorageprodeus           create     ~$100/mo (WORM enabled)
#  +   ├─ azure-native:keyvault:Vault              atp-keyvault-prod-eus       create     ~$10/mo
#  +   ├─ azure-native:network:VirtualNetwork      atp-vnet-prod-eus           create     
#  +   ├─ azure-native:network:PrivateEndpoint     sql-private-endpoint        create     ~$15/mo
#  +   └─ azure-native:frontdoor:FrontDoor         atp-frontdoor-prod          create     ~$50/mo
#
# Estimated Monthly Cost: ~$2,275
#
# Resources:
#     + 12 to create
#
# ⚠️  This will create production infrastructure. Review carefully before proceeding.

Step 3: Deploy with Approval Gate

# Production deployment (manual approval required)
pulumi up

# You will be prompted:
# Do you want to perform this update? [yes/no]: yes

# Or use --yes for automation (only in CI/CD with approvals)
pulumi up --yes

# Expected output:
# Updating (prod)
#
#      Type                                        Name                         Status      
#  +   pulumi:pulumi:Stack                         atp-infrastructure-prod      created     
#  +   ├─ azure-native:resources:ResourceGroup     atp-prod-eus-rg             created     
#  +   ├─ azure-native:web:AppServicePlan          atp-plan-prod-eus           created (1m0s)     
#  +   ├─ azure-native:web:WebApp                  atp-ingestion-prod-eus      created (2m0s)     
#  +   ├─ azure-native:sql:Server                  atp-sql-prod-eus            created (3m0s)     
#  +   ├─ azure-native:sql:Database                atp-sql-db-prod-eus         created (5m0s)     
#  +   ├─ azure-native:cache:Redis                 atp-redis-prod-eus          created (15m0s)     
#  +   ├─ azure-native:storage:StorageAccount      atpstorageprodeus           created (2m0s)     
#  +   ├─ azure-native:keyvault:Vault              atp-keyvault-prod-eus       created (1m0s)     
#  +   ├─ azure-native:network:VirtualNetwork      atp-vnet-prod-eus           created (1m30s)     
#  +   ├─ azure-native:network:PrivateEndpoint     sql-private-endpoint        created (2m30s)     
#  +   └─ azure-native:frontdoor:FrontDoor         atp-frontdoor-prod          created (5m0s)     
#
# Outputs:
#     appServiceUrl: "https://atp-ingestion-prod-eus.azurewebsites.net"
#     sqlServerFqdn: "atp-sql-prod-eus.database.windows.net"
#     redisHostName: "atp-redis-prod-eus.redis.cache.windows.net"
#     keyVaultUri: "https://atp-keyvault-prod-eus.vault.azure.net/"
#     frontDoorEndpoint: "https://atp.connectsoft.com"
#
# Resources:
#     + 12 created
#
# Duration: 25m15s
#
# ✅ Production environment deployed successfully

Update Existing Stack

Scenario: Upgrade Redis from Basic to Premium

# Select prod stack
pulumi stack select prod

# Update configuration
pulumi config set redisSku Premium
pulumi config set redisCapacity P1

# Preview changes
pulumi preview --diff

# Expected output:
# Previewing update (prod)
#
#      Type                              Name                  Plan       Info
#  ~   azure-native:cache:Redis          atp-redis-prod-eus    update     [diff: ~sku]
#
# Resources:
#     ~ 1 to update
#     11 unchanged
#
# ⚠️  Redis will be updated in-place (may cause brief downtime)

# Deploy update
pulumi up --yes

# ✅ Redis upgraded to Premium tier

Destroy Environment

Safely destroy non-production environment:

# Select dev stack
pulumi stack select dev

# Preview what will be destroyed
pulumi destroy --preview

# Destroy infrastructure (with confirmation)
pulumi destroy

# You will be prompted:
# Do you want to perform this destroy? [yes/no]: yes

# Expected output:
# Destroying (dev)
#
#      Type                                        Name                        Status      
#  -   pulumi:pulumi:Stack                         atp-infrastructure-dev      deleted     
#  -   ├─ azure-native:keyvault:Vault              atp-keyvault-dev-eus       deleted (30s)     
#  -   ├─ azure-native:storage:StorageAccount      atpstoragedeveus           deleted (45s)     
#  -   ├─ azure-native:cache:Redis                 atp-redis-dev-eus          deleted (5m0s)     
#  -   ├─ azure-native:sql:Database                atp-sql-db-dev-eus         deleted (1m0s)     
#  -   ├─ azure-native:sql:Server                  atp-sql-dev-eus            deleted (30s)     
#  -   ├─ azure-native:web:WebApp                  atp-ingestion-dev-eus      deleted (30s)     
#  -   ├─ azure-native:web:AppServicePlan          atp-plan-dev-eus           deleted (15s)     
#  -   └─ azure-native:resources:ResourceGroup     atp-dev-eus-rg             deleted (2m0s)     
#
# Resources:
#     - 9 deleted
#
# Duration: 8m45s
#
# ✅ Dev environment destroyed

# Remove stack (optional)
pulumi stack rm dev --yes

Bicep Workflow (Alternative)

Deploy Dev Environment with Bicep:

#!/bin/bash
# deploy-dev-bicep.sh

set -e

echo "Deploying ATP Dev environment with Bicep..."

# Variables
ENVIRONMENT=dev
REGION=eastus
RESOURCE_GROUP=atp-dev-eus-rg

# Create resource group
az group create \
  --name $RESOURCE_GROUP \
  --location $REGION \
  --tags Environment=$ENVIRONMENT CostCenter=Engineering

# Validate Bicep template
az deployment group validate \
  --resource-group $RESOURCE_GROUP \
  --template-file main.bicep \
  --parameters @parameters.dev.json

# Preview changes (What-If)
az deployment group what-if \
  --resource-group $RESOURCE_GROUP \
  --template-file main.bicep \
  --parameters @parameters.dev.json

# Deploy infrastructure
az deployment group create \
  --name "atp-dev-deployment-$(date +%Y%m%d-%H%M%S)" \
  --resource-group $RESOURCE_GROUP \
  --template-file main.bicep \
  --parameters @parameters.dev.json \
  --verbose

# Export outputs
az deployment group show \
  --name "atp-dev-deployment-$(date +%Y%m%d-%H%M%S)" \
  --resource-group $RESOURCE_GROUP \
  --query properties.outputs \
  --output json > dev-outputs.json

echo "✅ Dev environment deployed with Bicep"

Bicep Parameters File (parameters.dev.json):

{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentParameters.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "environment": {
      "value": "dev"
    },
    "region": {
      "value": "eus"
    },
    "appServicePlanSku": {
      "value": "B1"
    },
    "sqlDatabaseTier": {
      "value": "Basic"
    },
    "redisSku": {
      "value": "Basic"
    },
    "enableAutoscale": {
      "value": false
    },
    "enablePrivateEndpoints": {
      "value": false
    },
    "sqlAdminUsername": {
      "value": "sqladmin"
    },
    "sqlAdminPassword": {
      "reference": {
        "keyVault": {
          "id": "/subscriptions/{subscription-id}/resourceGroups/atp-bootstrap-rg/providers/Microsoft.KeyVault/vaults/atp-keyvault-bootstrap"
        },
        "secretName": "SqlAdminPassword-Dev"
      }
    }
  }
}

CI/CD Integration (Azure Pipelines)

Automated Pulumi Deployment:

# pulumi-deploy-pipeline.yml
trigger:
  branches:
    include:
      - main
  paths:
    include:
      - infrastructure/**

pool:
  vmImage: 'ubuntu-latest'

variables:
  - group: Pulumi-Secrets  # Azure DevOps variable group

stages:
- stage: Deploy_Dev
  displayName: 'Deploy to Dev'
  jobs:
  - job: Pulumi_Up_Dev
    steps:
    - task: UseDotNet@2
      inputs:
        version: '8.x'

    - script: |
        curl -fsSL https://get.pulumi.com | sh
        export PATH=$PATH:$HOME/.pulumi/bin
        pulumi version
      displayName: 'Install Pulumi CLI'

    - script: |
        cd infrastructure/
        pulumi login azblob://atp-pulumi-state?storage_account=atppulumistate
        pulumi stack select dev --create
        pulumi config set azure-native:location eastus
        pulumi config set environment dev
        pulumi up --yes --skip-preview
      displayName: 'Deploy Dev Infrastructure'
      env:
        PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
        AZURE_STORAGE_ACCOUNT: atppulumistate
        AZURE_STORAGE_KEY: $(AzureStorageKey)

- stage: Deploy_Prod
  displayName: 'Deploy to Production'
  dependsOn: Deploy_Dev
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
  jobs:
  - deployment: Pulumi_Up_Prod
    environment: ATP-Production  # Requires manual approval
    strategy:
      runOnce:
        deploy:
          steps:
          - script: |
              cd infrastructure/
              pulumi login azblob://atp-pulumi-state?storage_account=atppulumistate
              pulumi stack select prod
              pulumi up --yes
            displayName: 'Deploy Production Infrastructure'
            env:
              PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
              AZURE_STORAGE_ACCOUNT: atppulumistate
              AZURE_STORAGE_KEY: $(AzureStorageKey)

Summary

  • Pulumi Workflow: Stack creation, configuration, preview (dry-run), deployment, output export, updates, and destruction.
  • Dev Deployment: Basic SKU (B1), single instance, no autoscale, no private endpoints, <5 minutes deployment.
  • Production Deployment: Premium SKU (P1v3), 3-10 instances, autoscale, private endpoints, multi-region, ~25 minutes deployment.
  • Update Workflow: Configuration changes, preview diffs, in-place updates with minimal downtime.
  • Bicep Alternative: Azure CLI deployment with What-If preview, parameter files per environment, Key Vault parameter references.
  • CI/CD Integration: Azure Pipelines with automated dev deployment, manual approval for production, Pulumi state in Azure Blob.
  • Cost Estimation: Preview shows estimated monthly cost per resource (~$2,275/month for production).
  • Safety: Preview before deploy, What-If analysis, manual approval gates for production, destroy confirmation.

Appendix C — Cross-Reference Map

This appendix provides comprehensive cross-references to related ATP documentation, enabling seamless navigation across architecture, operations, compliance, and implementation domains.

Purpose

  • Link environment-specific topics to their primary documentation sources
  • Provide context for where specific concerns are addressed in depth
  • Facilitate cross-functional collaboration by mapping responsibilities
  • Ensure consistency across documentation domains

Cross-Reference Table

Topic Primary Document Section Notes
CI/CD & Pipelines
Azure Pipelines architecture azure-pipelines.md Pipeline Architecture Overview Build/deploy automation per environment, template usage
Quality gates & thresholds azure-pipelines.md Quality Gates & Policies Code coverage, security scans, test pass rates per environment
Deployment strategies azure-pipelines.md Multi-Environment Deployment Rolling, blue-green, canary deployments per environment
Pipeline observability azure-pipelines.md Pipeline Observability & Metrics Build duration, success rate, DORA metrics
Architecture & Design
Azure deployment topology ../architecture/deployment-views.md Deployment Views Azure topology, regions, failure domains, resource distribution
High-level architecture ../architecture/hld.md High-Level Design System components, service interactions, data flows
Microservice patterns ../architecture/microservice-architecture.md Microservice Architecture Service boundaries, communication patterns, resilience
Data architecture ../architecture/data-architecture.md Data Architecture Database design, CQRS, event sourcing, data residency
Platform & Infrastructure
Security controls ../platform/security-compliance.md Security Compliance Environment-specific security controls, zero-trust, encryption
Compliance frameworks ../platform/security-compliance.md Compliance Attestation SOC 2, GDPR, HIPAA evidence collection per environment
Data residency ../platform/data-residency-retention.md Data Residency & Retention Geographic data storage, retention policies, GDPR compliance
Tenant isolation ../platform/multitenancy-tenancy.md Multi-Tenancy Tenant isolation strategies, data separation per environment
Networking & VNets ../platform/networking.md Networking VNet topology, NSG rules, private endpoints, Azure Firewall
Operations
Observability strategy ../operations/observability.md Observability Telemetry levels, log aggregation, distributed tracing per environment
Monitoring & alerts ../operations/monitoring-alerts.md Monitoring & Alerts Alert thresholds, escalation policies, incident response per environment
Backup & restore ../operations/backups-restore-ediscovery.md Backups & Restore DR procedures, RPO/RTO targets, restore testing per environment
Incident management ../operations/runbook.md Runbooks Incident response procedures, escalation paths, communication templates
Hardening & Security
Zero-trust architecture ../hardening/zero-trust.md Zero Trust Network microsegmentation, workload identity, least privilege per environment
Key rotation ../hardening/key-rotation.md Key Rotation Automated secret rotation, Key Vault integration, rotation cadence per environment
Tamper evidence ../hardening/tamper-evidence.md Tamper Evidence Hash chains, digital signatures, immutability enforcement per environment
Chaos engineering ../hardening/chaos-drills.md Chaos Drills Failure injection, resilience testing, DR drill procedures per environment
Domain & Contracts
REST API specifications ../domain/contracts/rest-apis.md REST APIs API versioning, breaking change management, environment-specific endpoints
Message schemas ../domain/contracts/message-schemas.md Message Schemas Event formats, schema evolution, backward compatibility per environment
Webhook integration ../domain/contracts/webhooks.md Webhooks Webhook endpoints, retry policies, signature validation per environment
Idempotency patterns ../domain/contracts/shared/idempotency.md Idempotency Idempotency keys, duplicate detection, consistency guarantees
Infrastructure
Pulumi IaC ../infrastructure/pulumi.md Pulumi C# infrastructure code, stack management, environment overlays
Database migrations ../infrastructure/database-migrations.md Database Migrations Schema versioning, migration strategies, rollback procedures per environment
Container orchestration ../infrastructure/kubernetes.md Kubernetes AKS configuration, Helm charts, namespace isolation per environment
Testing & Quality
Test strategy ../testing/strategy.md Test Strategy Test pyramid, coverage targets, test automation per environment
Load testing ../testing/load-testing.md Load Testing Performance benchmarks, stress testing, capacity planning per environment
Security testing ../testing/security-testing.md Security Testing SAST, DAST, penetration testing, vulnerability management per environment

Environment-Specific Cross-References

Dev Environment

Concern Primary Document Key Details
Local development setup ../guides/development-setup.md Docker Compose, service containers, dev tooling
Debugging ../development/debugging.md Local debugging, remote debugging, log analysis
Feature flags Feature Flags & Runtime Configuration All features enabled, experimental flags on

Test Environment

Concern Primary Document Key Details
Integration testing ../testing/integration-testing.md Service-to-service tests, contract validation
Test data management Data Management Per Environment Synthetic data, stable fixtures, 90-day retention

Staging Environment

Concern Primary Document Key Details
Pre-production validation ../testing/staging-validation.md Load tests, chaos tests, full regression suite
Blue-green deployments azure-pipelines.md Slot swaps, validation gates, rollback procedures

Production Environment

Concern Primary Document Key Details
Canary deployments azure-pipelines.md 10%→25%→50%→100% rollout, automated metrics validation
Incident response ../operations/runbook.md On-call procedures, escalation paths, post-mortems
Compliance auditing ../platform/security-compliance.md SOC 2, GDPR, HIPAA audit evidence collection

Responsibility Matrix (RACI)

Environment Management Ownership:

Activity Platform Team Security Team SRE Team Development Team
Environment provisioning R, A C C I
Configuration management R, A C C I
Secrets management C R, A C I
Cost optimization R, A I C I
Deployment approvals (Staging) A C R C
Deployment approvals (Production) A C R I
DR testing C I R, A I
Compliance audits C R, A C I
Incident response C C R, A C

Legend: R = Responsible (does the work), A = Accountable (final approval), C = Consulted, I = Informed


ADR Title Environment Impact
ADR-001 Multi-Environment Strategy Defines 6-tier environment topology (Preview, Dev, Test, Staging, Prod, Hotfix)
ADR-002 Pulumi for Infrastructure as Code C# Pulumi chosen over Bicep/Terraform for type safety and .NET ecosystem alignment
ADR-003 Azure App Configuration for Feature Flags Centralized feature flag management with environment-specific targeting
ADR-004 Key Vault Per Environment Separate Key Vaults for isolation and blast radius containment
ADR-005 WORM Storage for Production Immutable audit logs with 7-year retention for regulatory compliance
ADR-006 Multi-Region Active-Active Topology Production traffic split 80/20 across East US and West Europe
ADR-007 Automated Canary Deployments Phased rollout with automated metrics validation and rollback

Document Hierarchy

docs/
├── architecture/
│   ├── hld.md                        # System overview (referenced for context)
│   ├── deployment-views.md           # Azure topology (referenced for resource naming)
│   └── data-architecture.md          # Data flows (referenced for data management)
├── ci-cd/
│   ├── azure-pipelines.md            # CI/CD automation (referenced for deployment workflows)
│   ├── environments.md               # ← YOU ARE HERE
│   └── quality-gates.md              # Test thresholds (referenced for validation criteria)
├── platform/
│   ├── security-compliance.md        # Security controls (referenced for per-environment policies)
│   ├── data-residency-retention.md   # Data residency (referenced for retention policies)
│   └── multitenancy-tenancy.md       # Tenant isolation (referenced for staging/prod isolation)
├── operations/
│   ├── observability.md              # Telemetry (referenced for logging/tracing levels)
│   ├── backups-restore-ediscovery.md # DR procedures (referenced for RPO/RTO targets)
│   └── multitenancy-tenancy.md       # Alerts (referenced for health monitoring)
├── hardening/
│   ├── zero-trust.md                 # Network security (referenced for VNet isolation)
│   ├── key-rotation.md               # Secret rotation (referenced for Key Vault automation)
│   ├── tamper-evidence.md            # Immutability (referenced for WORM storage)
│   └── chaos-drills.md               # Resilience testing (referenced for DR drills)
└── infrastructure/
    ├── pulumi.md                     # IaC (referenced for overlay examples)
    └── database-migrations.md        # Schema changes (referenced for migration workflows)

Quick Reference: Key Metrics by Environment

Metric Dev Test Staging Production
Uptime SLA 95% 98% 99.5% 99.9%
RPO 24h 12h 1h 15min
RTO 4h 2h 1h 30min
Log Retention 7 days 14 days 30 days 90 days (hot) + 7 years (cold)
Trace Sampling 100% 50% 25% 10%
Monthly Budget $500 $1,000 $3,000 $10,000
Approval SLA None None 4 hours (1 approver) 24 hours (2 approvers + CAB)
Change Frequency Multiple/day 1-2/day 1-2/week 1-2/month

Summary

  • Comprehensive Cross-References: 40+ links to related ATP documentation across architecture, operations, compliance, hardening, and domain domains.
  • Environment-Specific Guidance: Dev (local setup, debugging), Test (integration testing, test data), Staging (pre-production validation, blue-green), Production (canary, incident response, compliance).
  • Responsibility Matrix (RACI): Clear ownership for environment provisioning, configuration, secrets, deployments, DR, compliance, and incident response.
  • Related ADRs: 7 architecture decision records defining multi-environment strategy, Pulumi choice, App Configuration, Key Vault isolation, WORM storage, multi-region, and canary deployments.
  • Document Hierarchy: Visual map showing environments.md position within the broader ATP documentation structure.
  • Quick Reference Metrics: Side-by-side comparison of uptime SLA, RPO/RTO, log retention, trace sampling, budget, approval SLA, and change frequency per environment.