Environments - Audit Trail Platform (ATP)¶
Environment isolation by design — ATP enforces separation across dev, test, staging, and production with graduated controls and approval workflows.
Purpose & Scope¶
This document defines the multi-environment deployment strategy for the ConnectSoft Audit Trail Platform (ATP), establishing how applications progress from development through production with graduated controls, environment isolation, and compliance-aware configuration management at each stage.
ATP operates across six distinct environment tiers — each with specific purposes, data handling requirements, approval workflows, and infrastructure characteristics. This separation ensures safe experimentation in lower environments while maintaining production stability, security, and regulatory compliance in higher tiers.
What this document covers
- Establish ATP's environment topology across all deployment tiers: Preview (ephemeral PR environments), Dev, Test, Staging, Production, and Hotfix with clear boundaries and characteristics.
- Define environment-specific infrastructure using Infrastructure as Code (IaC) overlays: SKU tiers, scaling policies, networking configurations, and regional deployment patterns.
- Specify configuration management hierarchy: appsettings.json layering, Azure App Configuration integration, environment variables, and Key Vault secret references.
- Detail secrets and key management per environment: Key Vault organization, secret categories, rotation policies, and managed identity patterns.
- Describe promotion workflows and approval gates: automated progression (Dev → Test), manual approvals (Staging, Production), Change Advisory Board (CAB) processes, and rollback procedures.
- Outline networking and security boundaries: VNet isolation, private endpoints, NSG rules, egress controls, and environment-specific access policies.
- Document data management strategies per environment: synthetic data generation (Dev/Test), production-like datasets (Staging), live tenant data with compliance controls (Production).
- Specify observability and monitoring configurations: telemetry sampling rates, log retention policies, Application Insights settings, and alerting thresholds.
- Define cost management and optimization: environment budgets, SKU selection rationale, auto-shutdown policies, reserved instances, and cost alerts.
- Detail disaster recovery and high availability: RPO/RTO targets per environment, multi-region topology, failover procedures, and DR drill cadence.
- Outline compliance and audit requirements: environment-specific policy enforcement (encryption, immutability, access reviews, audit logging).
- Describe testing strategies per environment: unit/integration (Dev), smoke/regression (Test), load/chaos (Staging), synthetic monitors/canary (Production).
Out of scope (referenced elsewhere)
- CI/CD pipeline implementation details and template structure (see azure-pipelines.md).
- Quality gate policies, code coverage thresholds, and security scanning rules (see quality-gates.md).
- Detailed security controls, threat model, and compliance framework mappings (see security-compliance.md).
- Data residency rules, retention policies, and legal hold procedures (see data-residency-retention.md).
- Service-specific business logic, domain models, or API contracts (see service repositories and hld.md).
- Operational runbooks for incident response, on-call procedures, and troubleshooting (see runbook.md).
Readers & ownership
- Platform Engineering/DevOps (owners): Environment topology, IaC overlays, promotion workflows, infrastructure provisioning, and cost optimization.
- SRE/Operations: Disaster recovery planning, failover procedures, environment health monitoring, capacity planning, and incident response coordination.
- Security/Compliance: Environment-specific security controls, access policies, secret management, compliance enforcement, and audit evidence collection.
- Service Teams: Environment-specific configuration, feature flag management, deployment validation, and testing strategies.
- QA/Test Engineering: Test environment maintenance, test data management, regression testing, and quality validation.
- Finance/FinOps: Environment cost budgets, resource optimization, cost allocation, and financial forecasting.
Artifacts produced
- Infrastructure as Code (IaC): Pulumi stacks per environment with C# code defining Azure resources, networking, security configurations, and observability integrations.
- Environment Configurations: appsettings.json overlays, Azure App Configuration feature flags, environment-specific variable groups in Azure DevOps.
- Secret Management: Key Vault per environment with organized secret categories, access policies, rotation schedules, and audit logs.
- Deployment Manifests: Environment-specific deployment receipts with version history, configuration snapshots, and rollback points.
- Network Topology Diagrams: VNet/subnet layouts, private endpoint configurations, NSG rule sets, and cross-environment isolation boundaries.
- Observability Configurations: Application Insights instrumentation keys, Log Analytics workspaces, sampling policies, and alert rules per environment.
- Cost Reports: Monthly environment cost breakdowns, budget alerts, optimization recommendations, and resource utilization dashboards.
- DR Plans: Environment-specific disaster recovery procedures, failover runbooks, RPO/RTO validation reports, and drill evidence.
- Compliance Evidence: Environment audit trails, access reviews, encryption verification, immutability proofs, and policy enforcement logs.
Acceptance (done when)
- All six environment tiers (Preview, Dev, Test, Staging, Production, Hotfix) are provisioned with appropriate infrastructure, networking, and security configurations.
- Promotion workflows are operational with automated gates (Dev → Test), manual approvals (Test → Staging → Production), and rollback procedures validated.
- Configuration management hierarchy is established with clear precedence (appsettings → App Configuration → environment variables → Key Vault) and documented for all services.
- Secrets management is operational with Key Vault per environment, managed identity access, automatic rotation for Production, and no plaintext secrets in code or configurations.
- Environment isolation is verified with network segmentation (separate VNets/subscriptions), private endpoints (Staging/Production), and cross-environment access denied by default.
- Observability is configured with appropriate telemetry levels (100% sampling Dev, 10% Production), log retention policies, dashboards, and alerts per environment.
- Cost management is active with environment budgets, cost alerts at 80% threshold, auto-shutdown policies (Dev/Test), and monthly cost reviews scheduled.
- Disaster recovery procedures are documented and tested with DR drills per environment (quarterly Staging/Production), failover automation validated, and RPO/RTO targets met.
- Compliance controls are enforced appropriately per environment (relaxed Dev/Test, production-like Staging, full compliance Production) with audit evidence collection operational.
- Testing strategies are implemented per environment with appropriate test suites, automation, and validation criteria defined and operational.
- Documentation complete with comprehensive examples, runbooks, troubleshooting guides, and cross-references to related documents.
Environment Topology Overview¶
ATP's environment strategy follows the graduated control principle: lower environments prioritize developer velocity and rapid iteration, while higher environments enforce stability, security, and compliance. Each tier serves a distinct purpose in the software delivery lifecycle, with increasing levels of control, approval requirements, and production-like characteristics.
This topology balances innovation speed (developers can experiment freely in Dev/Preview) with risk management (Production changes require multiple approvals and validation). Environment progression acts as a quality funnel — defects are caught early, performance is validated under load, and compliance controls are verified before reaching live tenants.
Environment Tiers¶
ATP operates six environment tiers, each with specific characteristics, deployment triggers, and control requirements:
Standard Environment Tiers¶
| Environment | Purpose | Data | Change Frequency | Approval | Uptime SLA |
|---|---|---|---|---|---|
| Preview | Per-PR ephemeral | Synthetic/mock | Continuous (per commit) | None | Best-effort |
| Dev | Integration playground | Synthetic + sample | Multiple per day | None | 95% |
| Test (QA) | System verification | Stable test datasets | 1-2 per day | None | 98% |
| Staging | Pre-production validation | Production-like | 1-2 per week | 1 approver | 99.5% |
| Production | Live tenant traffic | Real tenant data | 1-2 per month | 2 approvers + CAB | 99.9% |
| Hotfix | Emergency patches | Production clone | As needed | 2 approvers (expedited) | 99.9% |
Environment Characteristics¶
Isolation:
- Network: Separate VNets (Staging/Production) or shared VNet with subnet isolation (Dev/Test).
- Subscriptions: Dedicated Azure subscriptions for Production; shared subscriptions for lower environments with resource group separation.
- Access: No cross-environment access; developers cannot access Staging/Production data or secrets.
- Blast Radius: Failures in Dev/Test do not impact Production; environment failures are contained.
Tenancy:
- Dev/Test: Shared synthetic tenants (tenant-001, tenant-002, etc.); non-production data only.
- Staging: Production-like synthetic tenants with realistic data volumes and obfuscated PII.
- Production: Isolated per real tenant with strict tenant boundaries and compliance controls.
Immutability:
- Dev/Test: Disabled; data can be modified/deleted for testing scenarios.
- Staging: Enabled; mimics production WORM storage and tamper-evidence.
- Production: Fully enforced with WORM storage, hash-chained segments, legal holds, and audit trails.
Observability:
- Dev/Test: Verbose logging (Debug level), 100% trace sampling, local Seq containers, 7-14 day retention.
- Staging: Production logging levels (Warning), 25% sampling, Azure Log Analytics, 30-day retention.
- Production: Optimized logging (Warning/Error), 10% sampling with intelligent sampling, 90-day hot + 7-year archive.
Environment Progression Model¶
Deployment Flow:
flowchart LR
DEV[Development] -->|Auto| TEST[Test/QA]
TEST -->|Manual Approval| STAGE[Staging]
STAGE -->|CAB + 2 Approvals| PROD[Production]
PR[Feature Branch] -.->|Ephemeral| PREVIEW[Preview Environment]
HOTFIX[Hotfix Branch] -.->|Expedited| PROD
style DEV fill:#90EE90
style TEST fill:#FFD700
style STAGE fill:#FFA500
style PROD fill:#FF6347
style PREVIEW fill:#87CEEB
style HOTFIX fill:#FF69B4
Promotion Gates:
- Dev → Test: Automated (CI pipeline success, smoke tests pass, no critical bugs).
- Test → Staging: Manual approval (Lead Engineer), regression tests green, no P1/P2 bugs, performance benchmarks met.
- Staging → Production: Manual approval (2 approvers: Architect + SRE), CAB approval, change window scheduled, no active incidents, load tests pass.
- Hotfix → Production: Expedited approval (2 approvers), incident ticket linked, limited scope, mandatory post-deployment validation.
Rollback Strategy:
- Dev/Test: Redeploy previous version (5 minutes).
- Staging: Blue-green slot swap (2 minutes).
- Production: Canary rollback or blue-green swap (1-3 minutes); automatic rollback on health check failures or error rate thresholds.
Environment Lifecycle¶
Preview Environments (Ephemeral):
- Creation: Automatically provisioned when PR is opened against master/main branch.
- Lifespan: Exists while PR is open; auto-deleted 4 hours after PR merge/close.
- Purpose: Validate feature changes in isolation before merging to main branch.
- Infrastructure: Lightweight Azure Container Instances (ACI) or AKS namespaces; minimal dependencies.
- Cost Optimization: Pay-per-second billing; serverless SQL (auto-pause); shared infrastructure where possible.
Persistent Environments (Dev, Test, Staging, Production, Hotfix):
- Provisioning: Created once via IaC; updated through infrastructure pipelines.
- Maintenance: Regular updates to match production topology; quarterly infrastructure refreshes.
- Decommissioning: Only with approval; data exported and archived before deletion.
Environment Naming & Tagging¶
Resource Naming Pattern:
atp-{service}-{env}-{region}
Examples:
- atp-ingestion-dev-eus
- atp-query-prod-weu
- atp-gateway-staging-apse
Required Tags (All Resources):
| Tag | Example Value | Purpose |
|---|---|---|
Environment |
dev, test, staging, prod |
Environment identification |
Service |
ingestion, query, gateway |
Service ownership |
CostCenter |
ATP-Platform, ATP-Services |
Cost allocation |
Owner |
platform-team@connectsoft.example |
Responsible team |
Compliance |
gdpr, hipaa, soc2 |
Compliance scope |
DataClassification |
public, internal, restricted, secret |
Data sensitivity |
ManagedBy |
pulumi, bicep, manual |
IaC tool |
BackupRequired |
true, false |
Backup policy |
DR-Tier |
critical, important, standard |
DR priority |
Tag Enforcement:
- Azure Policy: Enforces required tags on resource creation; denies resources without tags.
- Cost Management: Tags enable cost breakdowns by environment, service, and team.
- Compliance: Tags identify resources requiring specific controls (encryption, audit, backup).
Environment Health & Status¶
Health Indicators (Per Environment):
environmentHealth:
dev:
status: healthy | degraded | down
deployments:
lastSuccessful: 2025-10-30T08:15:00Z
lastFailed: 2025-10-29T14:22:00Z
successRate24h: 94.2%
services:
online: 7/7
degraded: 0
alerts:
active: 2
severity: [warning, warning]
production:
status: healthy
deployments:
lastSuccessful: 2025-10-28T02:30:00Z
successRate24h: 100%
services:
online: 7/7
degraded: 0
sla:
current: 99.97%
target: 99.9%
alerts:
active: 0
Dashboard Integration:
- Azure DevOps: Environment health widgets on team dashboards.
- Application Insights: Environment-specific workbooks with health trends.
- Status Page: Public status page showing Production health (no sensitive details).
Detailed Environment Specifications¶
Each environment tier has specific infrastructure configurations, data management policies, deployment patterns, and operational characteristics tailored to its purpose in the software delivery lifecycle. This section provides comprehensive specifications for all six environments.
Preview Environment (Ephemeral)¶
Purpose: Provide isolated, full-stack environments for pull request validation before code merges to the main branch. Preview environments enable developers to test features with realistic dependencies without impacting shared Dev/Test environments.
Lifecycle:
- Creation: Automatically triggered when PR is opened against
master/mainbranch. - Provisioning Time: 5-10 minutes (full ATP stack with 7 services).
- Lifespan: Active while PR is open; auto-deleted 4 hours after PR merge/close (prevents resource leaks).
- Naming:
atp-preview-pr-{PR-ID}-{region}(e.g.,atp-preview-pr-1234-eus).
Infrastructure:
# Preview Environment (Lightweight, Ephemeral)
compute:
type: Azure Container Instances (ACI)
services:
- atp-gateway-preview-pr-{ID}
- atp-ingestion-preview-pr-{ID}
- atp-query-preview-pr-{ID}
instances: 1 per service
sku: 1 vCPU, 1.5 GB RAM
storage:
sql:
type: Azure SQL Database (Serverless)
tier: General Purpose
vCores: 2
autoPauseDelay: 60 minutes # Auto-pause when inactive
redis:
type: Azure Cache for Redis
sku: Basic C0 (250 MB)
serviceBus:
type: Azure Service Bus
tier: Basic
blobStorage:
type: Azure Blob Storage
tier: Hot
retention: 7 days
networking:
vnet: Shared Preview VNet (10.10.0.0/16)
subnet: Dynamic allocation per PR (10.10.{PR-ID}.0/24)
privateEndpoints: false # Public endpoints for cost savings
nsg: Allow HTTP/HTTPS from CI/CD agents
observability:
logging:
level: Debug
sink: Ephemeral Seq container
tracing:
sampling: 100%
metrics:
enabled: true
retention: PR lifespan only
costOptimization:
budget: $10 per PR per day
autoShutdown: 4 hours after last activity
cleanup: Aggressive (delete on PR close)
Data Management:
- Data Source: Minimal synthetic data seeded at creation (10 sample audit events per tenant).
- Tenants: 2 synthetic tenants (preview-tenant-1, preview-tenant-2).
- Immutability: Disabled (not needed for short-lived validation).
- Backup: None (ephemeral data; recreate from seed scripts if needed).
Deployment:
# azure-pipelines.yml - PR trigger
pr:
branches:
include: [master, main]
stages:
- stage: Provision_Preview
jobs:
- job: Create_Ephemeral_Stack
steps:
- script: |
# Create Pulumi stack for this PR
pulumi stack select atp-preview-pr-$(System.PullRequest.PullRequestId) --create
pulumi config set environment preview
pulumi config set prId $(System.PullRequest.PullRequestId)
pulumi up --yes
# Capture service URLs
echo "##vso[task.setvariable variable=GatewayUrl;isOutput=true]$(pulumi stack output GatewayUrl)"
displayName: 'Provision Preview Environment'
- stage: Test_Preview
dependsOn: Provision_Preview
jobs:
- job: Integration_Tests
variables:
GatewayUrl: $[ stageDependencies.Provision_Preview.Create_Ephemeral_Stack.outputs['GatewayUrl'] ]
steps:
- script: |
dotnet test tests/Integration.Tests.csproj \
--environment:GatewayUrl=$(GatewayUrl) \
--filter Category=PullRequest
displayName: 'Run Integration Tests Against Preview'
- stage: Cleanup_Preview
condition: always() # Always cleanup, even on failure
jobs:
- job: Destroy_Ephemeral_Stack
steps:
- script: |
pulumi stack select atp-preview-pr-$(System.PullRequest.PullRequestId)
pulumi destroy --yes
pulumi stack rm --yes
displayName: 'Destroy Preview Environment'
Use Cases:
- Feature Validation: Test new features against full ATP stack before merge.
- Integration Testing: Validate service-to-service communication with realistic dependencies.
- Breaking Change Detection: Ensure API contract compatibility across services.
- Performance Baseline: Quick performance validation (not comprehensive load testing).
Limitations:
- No Production Data: Only synthetic data; cannot validate against real tenant scenarios.
- Limited Scale: Single instance per service; not suitable for load testing.
- Short-Lived: 4-hour maximum; not for long-running testing.
- Cost Constraints: Minimal infrastructure; simplified topology.
Dev Environment (Integration Playground)¶
Purpose: Primary environment for continuous integration and developer experimentation. Dev environment receives deployments on every commit to main branch, enabling rapid feedback loops and integration testing.
Lifecycle:
- Always-On: Persistent environment; never deleted.
- Deployment Frequency: Multiple times per day (every main branch commit).
- Uptime Target: 95% (tolerates brief downtime for infrastructure updates).
- Maintenance Window: Weeknights 10 PM - 6 AM local time.
Infrastructure:
# Dev Environment (Cost-Optimized, Shared)
compute:
type: Azure App Service (Linux)
services:
- atp-gateway-dev-eus
- atp-ingestion-dev-eus
- atp-query-dev-eus
- atp-integrity-dev-eus
- atp-export-dev-eus
- atp-policy-dev-eus
- atp-search-dev-eus
sku: B1 (Basic)
instances: 1 per service
autoscale: false
storage:
sql:
name: atp-sql-dev-eus
tier: Basic
dtu: 5
maxSizeGB: 2
geoReplication: false
backupRetention: 7 days
cosmos:
name: atp-cosmos-dev-eus
tier: Standard
throughput: 400 RU/s (manual)
consistency: Session
redis:
name: atp-redis-dev-eus
sku: Basic C0
capacity: 250 MB
persistence: false
serviceBus:
name: atp-servicebus-dev-eus
tier: Basic
messaging: Standard queues/topics
blobStorage:
name: atpstoragedeveus
tier: Hot
replication: LRS (Locally Redundant)
retention: 30 days
networking:
vnet: Shared ATP VNet (10.0.0.0/16)
subnet: Dev Subnet (10.0.1.0/24)
nsg:
- Allow HTTPS from developer IPs
- Allow RDP/SSH from VPN (jumpbox access)
- Allow all within subnet (service-to-service)
publicEndpoints:
enabled: true
ipWhitelist:
- Developer IPs
- CI/CD agents
- VPN gateway range
privateEndpoints: false # Cost optimization
identity:
managedIdentity: System-assigned per App Service
keyVault: atp-keyvault-dev-eus
rbac:
- Developers: Contributor on resource group
- Service Principals: Reader on secrets
observability:
appInsights: atp-appinsights-dev-eus
logAnalytics: atp-loganalytics-dev-eus
logging:
level: Debug
sinks: [Console, AppInsights, Seq]
tracing:
sampling: 100%
exportInterval: 5 seconds
metrics:
all: true
customDimensions: [tenantId, service, operation]
retention:
logs: 7 days (hot)
traces: 7 days
metrics: 30 days
costManagement:
monthlyBudget: $500
autoShutdown:
enabled: true
schedule: "Weeknights 8 PM - 6 AM, Weekends"
timezone: Eastern
alerts:
- threshold: 80% of budget
- anomaly: >50% daily spike
Configuration (appsettings.Development.json):
{
"Logging": {
"LogLevel": {
"Default": "Debug",
"Microsoft": "Information",
"Microsoft.EntityFrameworkCore": "Information",
"System": "Information"
},
"Console": {
"IncludeScopes": true
}
},
"ConnectionStrings": {
"DefaultConnection": "Server=atp-sql-dev-eus.database.windows.net;Database=ATP_Dev;Authentication=Active Directory Managed Identity;",
"Redis": "atp-redis-dev-eus.redis.cache.windows.net:6380,ssl=True,abortConnect=False",
"ServiceBus": "Endpoint=sb://atp-servicebus-dev-eus.servicebus.windows.net/;Authentication=Managed Identity",
"CosmosDb": "AccountEndpoint=https://atp-cosmos-dev-eus.documents.azure.com:443/;DefaultKeyResolution=ManagedIdentity"
},
"Audit": {
"EnableImmutability": false,
"EnableTamperEvidence": false,
"EnableHashChaining": false,
"RetentionDays": 30,
"WormStorage": false,
"SegmentSize": 1000 // Smaller segments for faster testing
},
"Compliance": {
"StrictInDevelopment": true,
"EnableLoggingRedaction": true, // Practice redaction even in dev
"SimulateComplianceChecks": true,
"Profile": "development"
},
"OpenTelemetry": {
"ServiceName": "atp-{service}-dev",
"ExporterEndpoint": "http://otel-collector-dev:4317",
"SamplingRatio": 1.0, // 100% sampling
"ExportIntervalSeconds": 5
},
"FeatureManagement": {
"TamperEvidenceV2": true,
"AdvancedQueryFilters": true,
"AIAssistedAnomalyDetection": true, // Enable all features in dev
"ExperimentalFeatures": true
},
"RateLimiting": {
"Enabled": false, // No rate limits in dev
"PermitLimit": 0,
"Window": 0
}
}
Data Management:
// Dev Data Seeding (Example: C# Seeder)
public class DevDataSeeder
{
private readonly IAuditDbContext _context;
public async Task SeedDevEnvironmentAsync()
{
// Create 10 synthetic tenants
var tenants = Enumerable.Range(1, 10)
.Select(i => new Tenant
{
TenantId = $"dev-tenant-{i:D3}",
Name = $"Development Tenant {i}",
Edition = i <= 3 ? "Standard" : (i <= 7 ? "Business" : "Enterprise"),
Region = i % 2 == 0 ? "US" : "EU",
CreatedAt = DateTime.UtcNow.AddDays(-30)
});
await _context.Tenants.AddRangeAsync(tenants);
// Seed 1000 audit events per tenant
foreach (var tenant in tenants)
{
var events = Enumerable.Range(1, 1000)
.Select(i => new AuditEvent
{
EventId = $"evt-{tenant.TenantId}-{i:D6}",
TenantId = tenant.TenantId,
Timestamp = DateTime.UtcNow.AddHours(-i),
Actor = $"user-{i % 10}@{tenant.Name}",
Action = GetRandomAction(),
Resource = $"resource-{i % 50}",
Outcome = i % 20 == 0 ? "Denied" : "Allowed",
Metadata = GenerateSyntheticMetadata()
});
await _context.AuditEvents.AddRangeAsync(events);
}
await _context.SaveChangesAsync();
}
}
Bash Seeding Script:
#!/bin/bash
# seed-dev-environment.sh
echo "Seeding Dev Environment with synthetic data..."
# Seed database
dotnet run --project tools/DataSeeder \
--environment Development \
--tenants 10 \
--events-per-tenant 1000 \
--start-date "2025-09-01" \
--end-date "2025-10-30"
# Seed Redis cache
redis-cli -h atp-redis-dev-eus.redis.cache.windows.net -p 6380 -a $(az keyvault secret show --name RedisPassword --vault-name atp-keyvault-dev-eus --query value -o tsv) --tls <<EOF
SET session:dev-tenant-001 "{\\"userId\\":\\"dev-user-1\\",\\"expiresAt\\":\\"2025-10-31T00:00:00Z\\"}"
SET session:dev-tenant-002 "{\\"userId\\":\\"dev-user-2\\",\\"expiresAt\\":\\"2025-10-31T00:00:00Z\\"}"
EOF
echo "✅ Dev environment seeded successfully"
Access Control:
- Developers: Full access (Contributor role on resource group).
- CI/CD: Managed identity with deploy permissions.
- External Access: IP-whitelisted (developer office IPs, VPN range).
- Key Vault: Developers can read secrets (for local debugging).
Use Cases:
- Continuous Integration: Every main branch commit deploys automatically; developers see changes within 10 minutes.
- Integration Testing: Service-to-service communication validated with real dependencies (Redis, SQL, RabbitMQ).
- Local Development Parity: Developers can test against Dev environment to debug integration issues.
- Experimental Features: New features enabled by default; breaking changes tested before Test environment.
- SDK Testing: Client SDK developers test against Dev APIs with realistic responses.
Deployment Pattern:
# Deployment: Automated on every main commit
trigger:
branches:
include: [master, main]
stages:
- stage: CI_Stage
# ... build, test, security scans
- stage: Deploy_Dev
dependsOn: CI_Stage
condition: succeeded()
jobs:
- deployment: DeployToDev
environment: ATP-Dev # No approval required
strategy:
runOnce:
deploy:
steps:
- template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
parameters:
azureSubscription: $(azureSubscription)
appName: atp-ingestion-dev-eus
package: $(Pipeline.Workspace)/drop/*.zip
appSettings: |
-ASPNETCORE_ENVIRONMENT "Development"
-Audit__EnableImmutability "false"
Monitoring & Alerts:
- Health Checks: Every 5 minutes; alert on 3 consecutive failures.
- Error Rate: Alert if >10% (relaxed threshold).
- Deployment Failures: Notify #dev-team Slack channel.
- Cost Alerts: Notify if approaching $500/month budget.
Test Environment (System Verification)¶
Purpose: Dedicated environment for automated regression testing, QA validation, and integration verification with stable datasets. Test environment acts as the quality gate before promoting to Staging.
Lifecycle:
- Always-On: Persistent environment with controlled refresh cycles.
- Deployment Frequency: 1-2 times per day (after Dev soak period).
- Uptime Target: 98% (planned downtime for test data refresh).
- Maintenance Window: Nightly 2 AM - 4 AM local time.
Infrastructure:
# Test Environment (QA-Grade, Stable)
compute:
type: Azure App Service (Linux)
services: All 7 ATP services
sku: S1 (Standard)
instances: 2 per service (for blue-green testing)
autoscale: false
alwaysOn: true
storage:
sql:
name: atp-sql-test-eus
tier: Standard S1
dtu: 20
maxSizeGB: 10
geoReplication: false
backupRetention: 14 days
pointInTimeRestore: true
cosmos:
name: atp-cosmos-test-eus
tier: Standard
throughput: 1000 RU/s (manual)
consistency: BoundedStaleness
redis:
name: atp-redis-test-eus
sku: Standard C1
capacity: 1 GB
persistence: RDB (snapshots every 15 min)
serviceBus:
name: atp-servicebus-test-eus
tier: Standard
messaging: Standard + Topics
blobStorage:
name: atpstorragetesteus
tier: Hot
replication: GRS (Geo-Redundant for DR testing)
retention: 60 days
elasticsearch:
name: atp-search-test-eus
tier: Basic
nodes: 2
storage: 100 GB
networking:
vnet: Shared ATP VNet (10.0.0.0/16)
subnet: Test Subnet (10.0.2.0/24)
nsg:
- Allow HTTPS from CI/CD agents
- Allow test automation IPs
- Deny all other inbound
publicEndpoints:
enabled: true
ipWhitelist:
- CI/CD agent pool IPs
- QA team IPs
privateEndpoints: false
identity:
managedIdentity: System-assigned
keyVault: atp-keyvault-test-eus
rbac:
- QA Team: Reader + Test Data Manager
- CI/CD: Contributor (deploy only)
- Developers: Reader (view-only for debugging)
observability:
appInsights: atp-appinsights-test-eus
logAnalytics: atp-loganalytics-test-eus
logging:
level: Information
sinks: [AppInsights, Seq]
structuredLogging: true
tracing:
sampling: 50% # Reduced from Dev
adaptiveSampling: true
metrics:
all: true
aggregationInterval: 60 seconds
retention:
logs: 14 days
traces: 14 days
metrics: 60 days
costManagement:
monthlyBudget: $1,000
autoShutdown:
enabled: true
schedule: "Weekends only"
reservedInstances: false
alerts:
- threshold: 80%
- anomaly: >50% spike
Configuration (appsettings.Test.json):
{
"Logging": {
"LogLevel": {
"Default": "Information",
"Microsoft": "Warning",
"Microsoft.EntityFrameworkCore.Database.Command": "Information"
}
},
"ConnectionStrings": {
"DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/SqlConnectionString)",
"Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/RedisConnectionString)",
"ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/ServiceBusConnectionString)"
},
"Audit": {
"EnableImmutability": false,
"EnableTamperEvidence": true, // Test tamper evidence logic
"EnableHashChaining": true,
"RetentionDays": 90,
"WormStorage": false,
"SegmentSize": 10000
},
"Compliance": {
"StrictInDevelopment": false,
"EnableLoggingRedaction": true,
"SimulateComplianceChecks": true,
"Profile": "test",
"EnforceGDPR": true, // Test GDPR workflows
"EnforceHIPAA": false
},
"OpenTelemetry": {
"ServiceName": "atp-{service}-test",
"ExporterEndpoint": "http://otel-collector-test:4317",
"SamplingRatio": 0.5, // 50% sampling
"ExportIntervalSeconds": 10
},
"FeatureManagement": {
"TamperEvidenceV2": true,
"AdvancedQueryFilters": true,
"AIAssistedAnomalyDetection": false, // Stable features only
"ExperimentalFeatures": false
},
"RateLimiting": {
"Enabled": true,
"PermitLimit": 1000, // Higher than prod for load testing
"Window": 60
}
}
Data Management:
// Test Data Management (Stable Datasets)
public class TestDataManager
{
public async Task RefreshTestDataAsync()
{
// Backup current data (for rollback)
await BackupAsync("test-data-backup-{timestamp}");
// Clear existing data
await TruncateTablesAsync();
// Load stable test datasets
await LoadFixturesAsync("fixtures/test-tenants.json");
await LoadFixturesAsync("fixtures/test-events-stable.json");
// Verify data integrity
var counts = await VerifyDataCountsAsync();
Assert.Equal(20, counts.Tenants);
Assert.Equal(50000, counts.AuditEvents);
}
// Stable test datasets (version-controlled)
private async Task LoadFixturesAsync(string fixturePath)
{
var json = await File.ReadAllTextAsync(fixturePath);
var fixtures = JsonSerializer.Deserialize<TestFixture>(json);
foreach (var tenant in fixtures.Tenants)
{
await _context.Tenants.AddAsync(tenant);
}
await _context.SaveChangesAsync();
}
}
Test Dataset Characteristics:
- Tenants: 20 synthetic tenants with varied profiles (Standard, Business, Enterprise editions).
- Audit Events: 50,000 stable events (version-controlled in
fixtures/test-events-stable.json). - Time Range: Events spanning last 90 days (predictable date ranges for test assertions).
- Scenarios: Pre-defined test scenarios (compliance violation, high-volume ingestion, tamper detection, export workflows).
Access Control:
- QA Team: Full access (read/write on data, deploy permissions).
- Developers: Read-only (view logs, query data, no deployments).
- CI/CD: Deploy and test execution permissions.
- External Access: IP-whitelisted (CI/CD agents, QA team IPs).
Use Cases:
- Automated Regression Testing: Nightly full regression suite (all tests, all services).
- API Contract Validation: OpenAPI spec generation and breaking change detection.
- Integration Testing: Multi-service workflows (ingestion → integrity verification → query → export).
- Performance Baseline: Response time benchmarks under normal load.
- Data Migration Testing: Test database migration scripts before Staging.
Deployment Pattern:
# Deployment: Automated after Dev soak (24 hours)
stages:
- stage: Deploy_Test
dependsOn: Deploy_Dev
condition: |
and(
succeeded(),
eq(variables['Dev.SoakHours'], '24'), // Dev stable for 24 hours
eq(variables['Build.SourceBranch'], 'refs/heads/main')
)
jobs:
- deployment: DeployToTest
environment: ATP-Test # No approval
strategy:
runOnce:
deploy:
steps:
- template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
parameters:
azureSubscription: $(azureSubscription)
appName: atp-ingestion-test-eus
package: $(Pipeline.Workspace)/drop/*.zip
postDeployment:
steps:
- script: |
# Run smoke tests
dotnet test tests/Smoke.Tests.csproj \
--environment Test \
--logger "trx;LogFileName=smoke-results.trx"
# Run regression suite
dotnet test tests/Regression.Tests.csproj \
--environment Test \
--logger "trx;LogFileName=regression-results.trx" \
--settings test.runsettings
displayName: 'Post-Deployment Test Suite'
Monitoring & Alerts:
- Health Checks: Every 2 minutes; alert on 2 consecutive failures.
- Test Suite Failures: Alert QA team on regression failures.
- Error Rate: Alert if >5% (stricter than Dev).
- Deployment Success Rate: Alert if <95% over 7 days.
Staging Environment (Pre-Production Validation)¶
Purpose: Production-equivalent environment for final validation before Production deployment. Staging mirrors Production infrastructure, security controls, and compliance policies, enabling realistic load testing, chaos engineering, and stakeholder acceptance.
Lifecycle:
- Always-On: Persistent environment; critical for go-live decisions.
- Deployment Frequency: 1-2 times per week (after Test validation).
- Uptime Target: 99.5% (near-production SLA).
- Maintenance Window: Scheduled changes only during approved windows.
Infrastructure:
# Staging Environment (Production-Equivalent)
compute:
type: Azure App Service (Linux) or AKS
services: All 7 ATP services
sku: P1v2 (Premium)
instances: 2 per service (blue-green deployment slots)
autoscale:
enabled: true
min: 2
max: 5
rules:
- metric: CPU Percentage
threshold: 70%
scaleOut: 1
scaleIn: 1
alwaysOn: true
slots:
- production (active)
- blue (deployment staging)
storage:
sql:
name: atp-sql-staging-eus
tier: Premium P2
dtu: 125
maxSizeGB: 100
geoReplication: true
secondaryRegion: West Europe
backupRetention: 35 days
pointInTimeRestore: true
encryption: TDE (Transparent Data Encryption)
cosmos:
name: atp-cosmos-staging-eus
tier: Standard
throughput: 5000 RU/s (autoscale 1000-5000)
consistency: Session
multiRegion:
- eastus (primary)
- westeurope (read replica)
redis:
name: atp-redis-staging-eus
sku: Premium P1
capacity: 6 GB
clustering: true
persistence: AOF (Append-Only File)
geoReplication:
enabled: true
secondary: atp-redis-staging-weu
serviceBus:
name: atp-servicebus-staging-eus
tier: Premium
messagingUnits: 1
geoDisasterRecovery: true
secondaryNamespace: atp-servicebus-staging-weu
blobStorage:
name: atpstoragestgeus
tier: Hot
replication: GZRS (Geo-Zone-Redundant)
retention: 180 days
immutability:
enabled: true
policy: time-based (90 days)
encryption:
type: Microsoft-managed keys (testing BYOK patterns)
networking:
vnet: Dedicated Staging VNet (10.1.0.0/16)
subnets:
- Gateway Subnet: 10.1.1.0/24
- Services Subnet: 10.1.2.0/24
- Data Subnet: 10.1.3.0/24 (private endpoints)
nsg:
- Deny all inbound by default
- Allow HTTPS from API Gateway subnet
- Allow service-to-service within Services subnet
privateEndpoints:
enabled: true
resources:
- SQL Database (10.1.3.4)
- Storage Account (10.1.3.5)
- Key Vault (10.1.3.6)
- Service Bus (10.1.3.7)
applicationGateway:
enabled: true
waf: WAF_v2 (OWASP 3.2)
sslPolicy: AppGwSslPolicy20220101
identity:
managedIdentity: System-assigned + User-assigned
keyVault: atp-keyvault-staging-eus
rbac:
- SRE Team: Reader + Deployment Operator
- Platform Team: Contributor
- Developers: No access (production-like restrictions)
conditionalAccess:
- Require MFA for all human access
- Device compliance required
observability:
appInsights: atp-appinsights-staging-eus
logAnalytics: atp-loganalytics-staging-eus
logging:
level: Warning
sinks: [AppInsights, LogAnalytics]
structuredLogging: true
sensitiveDataMasking: true
tracing:
sampling: 25%
adaptiveSampling: true
dependencies: true
metrics:
all: true
customMetrics: true
aggregationInterval: 60 seconds
retention:
logs: 30 days (hot) + 90 days (archive)
traces: 30 days
metrics: 90 days
alerts:
- Service health degradation
- Error rate >2%
- p95 latency >1000ms
- Failed deployments
costManagement:
monthlyBudget: $3,000
autoShutdown:
enabled: false # Always-on for production parity
reservedInstances:
enabled: true
term: 1 year (App Services)
alerts:
- threshold: 80% of budget
- anomaly: >30% spike
Configuration (appsettings.Staging.json):
{
"Logging": {
"LogLevel": {
"Default": "Warning",
"Microsoft": "Error",
"Microsoft.EntityFrameworkCore": "Warning"
}
},
"ConnectionStrings": {
"DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/SqlConnectionString)",
"Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/RedisConnectionString)",
"ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/ServiceBusConnectionString)",
"CosmosDb": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/CosmosConnectionString)"
},
"Audit": {
"EnableImmutability": true, // Production-like
"EnableTamperEvidence": true,
"EnableHashChaining": true,
"RetentionDays": 2555, // 7 years (production setting)
"WormStorage": true,
"SegmentSize": 100000,
"SealInterval": "PT15M"
},
"Compliance": {
"StrictInDevelopment": false,
"EnableLoggingRedaction": true,
"SimulateComplianceChecks": false, // Real compliance checks
"Profile": "staging",
"EnforceGDPR": true,
"EnforceHIPAA": true,
"EnforceSOC2": true
},
"OpenTelemetry": {
"ServiceName": "atp-{service}-staging",
"ExporterEndpoint": "https://otel-collector-staging.connectsoft.local:4317",
"SamplingRatio": 0.25, // 25% sampling
"ExportIntervalSeconds": 30
},
"FeatureManagement": {
"TamperEvidenceV2": {
"EnabledFor": [
{ "Name": "Percentage", "Parameters": { "Value": 50 } } // 50% canary
]
},
"AdvancedQueryFilters": true,
"AIAssistedAnomalyDetection": {
"EnabledFor": [
{ "Name": "TargetingFilter", "Parameters": { "Audience": { "Users": ["staging-tenant-001"] } } }
]
},
"ExperimentalFeatures": false
},
"RateLimiting": {
"Enabled": true,
"PermitLimit": 500, // Production-like limits
"Window": 60,
"QueueLimit": 100
},
"Security": {
"RequireHttps": true,
"HstsEnabled": true,
"HstsMaxAge": 31536000,
"ContentSecurityPolicy": "default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'"
}
}
Data Management:
// Staging Data (Production-Like Synthetic)
public class StagingDataManager
{
public async Task RefreshStagingDataAsync()
{
// Option 1: Anonymized Production Snapshot
await RestoreFromAnonymizedProductionAsync();
// Option 2: Generate Production-Scale Synthetic Data
await GenerateProductionScaleDataAsync();
}
private async Task RestoreFromAnonymizedProductionAsync()
{
// Download anonymized backup from Production
var backupUri = await GetLatestAnonymizedBackupAsync();
// Restore to Staging SQL
await RestoreDatabaseAsync(
sourceUri: backupUri,
targetDatabase: "ATP_Staging",
overwriteExisting: true);
// Verify PII redaction
await VerifyNoPIIAsync();
}
private async Task GenerateProductionScaleDataAsync()
{
// 50 synthetic tenants (mimic real tenant distribution)
var tenants = GenerateSyntheticTenants(count: 50);
// 5 million audit events (realistic volume)
var events = await GenerateRealisticEventsAsync(
tenants: tenants,
totalEvents: 5_000_000,
timeRange: TimeSpan.FromDays(180));
// Insert in batches (efficient bulk insert)
await BulkInsertAsync(tenants, batchSize: 1000);
await BulkInsertAsync(events, batchSize: 10000);
}
}
Access Control:
- SRE Team: Full access (deployment, troubleshooting, configuration).
- Platform Team: Contributor (infrastructure changes, monitoring).
- QA Team: Test execution permissions only.
- Developers: No access (production-like restrictions).
- Stakeholders: Read-only access for acceptance validation.
Use Cases:
- Load Testing: Simulate production traffic (50-80% of expected peak load).
- Chaos Engineering: Inject failures (pod restarts, network latency, database throttling).
- Security Testing: OWASP ZAP scans, penetration testing, vulnerability validation.
- Disaster Recovery Drills: Practice failover to secondary region; validate RPO/RTO.
- Stakeholder Acceptance: Product owners and compliance teams validate features before Production.
- Feature Flag Testing: Validate percentage rollouts and targeting filters before Production.
Deployment Pattern:
# Deployment: Manual approval required
stages:
- stage: Deploy_Staging
dependsOn: Deploy_Test
condition: |
and(
succeeded(),
eq(variables['Build.SourceBranch'], 'refs/heads/main')
)
jobs:
- deployment: DeployToStaging
environment: ATP-Staging # Requires 1 manual approval (Lead Engineer)
strategy:
runOnce:
preDeploy:
steps:
- script: |
# Pre-deployment validation
echo "Verifying Test environment stability..."
# Check Test error rate (last 24 hours)
ERROR_RATE=$(az monitor metrics list \
--resource atp-appinsights-test-eus \
--metric "requests/failed" \
--aggregation avg \
--interval PT24H \
--query "value[0].timeseries[0].data[-1].average")
if (( $(echo "$ERROR_RATE > 0.02" | bc -l) )); then
echo "❌ Test error rate too high: $ERROR_RATE%"
exit 1
fi
echo "✅ Test environment stable; proceeding to Staging"
displayName: 'Pre-Deploy Validation'
deploy:
steps:
- template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
parameters:
azureSubscription: $(azureSubscription)
appName: atp-ingestion-staging-eus
package: $(Pipeline.Workspace)/drop/*.zip
slotName: blue # Deploy to blue slot first
routeTraffic:
steps:
- task: AzureAppServiceManage@0
displayName: 'Swap Blue → Production Slot'
inputs:
azureSubscription: $(azureSubscription)
action: 'Swap Slots'
webAppName: atp-ingestion-staging-eus
sourceSlot: blue
targetSlot: production
postRouteTraffic:
steps:
- script: |
# Post-deployment validation
echo "Running post-deployment checks..."
# Health checks
curl -f https://atp-gateway-staging-eus.azurewebsites.net/health || exit 1
# Smoke tests
dotnet test tests/Smoke.Tests.csproj --environment Staging
# Load test (light validation)
k6 run --vus 50 --duration 5m tests/load/basic-load.js
displayName: 'Post-Deployment Validation'
Monitoring & Alerts:
- Health Checks: Every 1 minute; alert immediately on failure.
- Error Rate: Alert if >1% (production threshold).
- Latency: Alert if p95 >500ms (production SLO).
- Deployment Validation: Alert on slot swap failures or post-deployment test failures.
- Security: WAF blocks, failed auth attempts, ABAC denials.
Production Environment (Live Tenant Traffic)¶
Purpose: Live environment serving real tenant traffic with full compliance enforcement, high availability, disaster recovery, and 24/7 monitoring. Production is the most controlled environment with strict approval workflows and change management.
Lifecycle:
- Mission-Critical: Always-on with multi-region redundancy.
- Deployment Frequency: 1-2 times per month (conservative change cadence).
- Uptime Target: 99.9% (SLA-backed; ~43 minutes downtime/month).
- Maintenance Window: Approved CAB windows only; typically Friday nights.
Infrastructure:
# Production Environment (Enterprise-Grade, Multi-Region)
compute:
type: Azure Kubernetes Service (AKS)
cluster:
name: atp-aks-prod-eus
nodeCount: 6 (3 per zone)
vmSize: Standard_D4s_v5 (4 vCPU, 16 GB RAM)
zones: [1, 2, 3] # Zone-redundant
autoscale:
enabled: true
min: 6
max: 20
profile: production
services:
replicas: 3 per service (distributed across zones)
resources:
requests: { cpu: "500m", memory: "1Gi" }
limits: { cpu: "2000m", memory: "4Gi" }
probes:
liveness: /health/live
readiness: /health/ready
startup: /health/startup
hpa: # Horizontal Pod Autoscaler
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPU: 70%
targetMemory: 80%
storage:
sql:
name: atp-sql-prod-eus
tier: Premium P6
vCores: 16
maxSizeGB: 500
zoneRedundant: true
geoReplication:
enabled: true
secondaryRegion: West Europe
failoverPolicy: Automatic
backupRetention: 35 days
longTermRetention: 7 years
encryption:
type: TDE with CMK (Customer-Managed Key)
keyVault: atp-keyvault-prod-eus
keyName: sql-tde-key
advancedThreatProtection: true
cosmos:
name: atp-cosmos-prod-eus
tier: Standard
throughput: 20000 RU/s (autoscale 5000-20000)
consistency: BoundedStaleness
multiRegion:
- eastus (write)
- westeurope (read)
- southeastasia (read)
automaticFailover: true
multipleWriteLocations: false # Single writer
redis:
name: atp-redis-prod-eus
sku: Premium P3
capacity: 26 GB
clustering: true
shardCount: 3
persistence: AOF + RDB
geoReplication:
enabled: true
secondary: atp-redis-prod-weu
encryption: TLS 1.2+
serviceBus:
name: atp-servicebus-prod-eus
tier: Premium
messagingUnits: 2
geoDisasterRecovery: true
secondaryNamespace: atp-servicebus-prod-weu
zones: [1, 2, 3]
blobStorage:
name: atpstorageprodeus
tier: Hot (recent) + Cool (30-90 days) + Archive (90+ days)
replication: GZRS (Geo-Zone-Redundant)
retention: 7 years
immutability:
enabled: true
policy: WORM (time-based, 7 years)
legalHold: supported
encryption:
type: Customer-managed keys (BYOK)
keyVault: atp-keyvault-prod-eus
keyName: storage-cmk-key
rotation: Automatic (90 days)
advancedThreatProtection: true
softDelete:
enabled: true
retentionDays: 30
networking:
vnet: Dedicated Production VNet (10.2.0.0/16)
subnets:
- AKS Nodes: 10.2.1.0/24
- Application Gateway: 10.2.2.0/27
- Private Endpoints: 10.2.3.0/24
- Azure Firewall: 10.2.4.0/26
nsg:
- Deny all by default (Zero Trust)
- Allow inbound HTTPS (443) to App Gateway only
- Allow AKS → Private Endpoints (SQL, Storage, KV)
privateEndpoints:
enabled: true
dnsIntegration: true
resources:
- SQL Database (10.2.3.4)
- Storage Account (10.2.3.5)
- Key Vault (10.2.3.6)
- Service Bus (10.2.3.7)
- Cosmos DB (10.2.3.8)
- Container Registry (10.2.3.9)
applicationGateway:
name: atp-appgw-prod-eus
tier: WAF_v2
capacity: 2-10 (autoscale)
waf:
enabled: true
mode: Prevention
ruleSet: OWASP 3.2
exclusions: []
sslPolicy: AppGwSslPolicy20220101S
httpListeners:
- HTTPS only
- TLS 1.2+
- Custom domain with managed certificate
azureFirewall:
name: atp-fw-prod-eus
tier: Premium
threatIntel: Alert and deny
dnsProxy: enabled
outboundRules:
- Allow HTTPS to approved FQDNs (NuGet, Docker Hub, Azure services)
- Deny all other egress
identity:
managedIdentity:
- System-assigned (per AKS node pool)
- User-assigned (for Key Vault access)
keyVault: atp-keyvault-prod-eus
rbac:
- Production Operators: Custom role (deploy only, no data access)
- SRE On-Call: Reader + Incident Responder
- Platform Security: Key Vault Administrator
- No Developer Access: Prohibited
conditionalAccess:
- Require MFA
- Require compliant device
- Require trusted location (corporate network or VPN)
- Block legacy authentication
privilegedIdentityManagement:
enabled: true
justInTime: true
maxDuration: 4 hours
approvalRequired: true
observability:
appInsights: atp-appinsights-prod-eus
logAnalytics: atp-loganalytics-prod-eus
logging:
level: Warning
sinks: [AppInsights, LogAnalytics, Seq (centralized)]
structuredLogging: true
sensitiveDataMasking: true
piiRedaction: enforced
tracing:
sampling: 10%
adaptiveSampling: true
intelligentSampling:
enabled: true
prioritize: [errors, slowRequests, dependencies]
metrics:
all: true
customMetrics: true
dimensionality:
- tenantId
- region
- service
- operation
aggregationInterval: 60 seconds
retention:
logs: 90 days (hot) + 7 years (archive to Blob Storage)
traces: 90 days
metrics: 1 year
alerts:
- SLO breaches (error budget)
- Security events (failed auth, ABAC denials)
- Performance degradation (p95, p99)
- Cost anomalies
costManagement:
monthlyBudget: $10,000
autoShutdown:
enabled: false # Never shutdown Production
reservedInstances:
enabled: true
term: 3 years (maximum savings)
resources: [AKS nodes, SQL Database, App Services]
costOptimization:
- Spot instances for non-critical batch jobs
- Storage lifecycle policies (Hot → Cool → Archive)
- Autoscaling based on traffic patterns
alerts:
- threshold: 85% of budget
- anomaly: >20% spike
- forecast: Projected to exceed budget
Configuration (appsettings.Production.json):
{
"Logging": {
"LogLevel": {
"Default": "Warning",
"Microsoft": "Error",
"Microsoft.EntityFrameworkCore": "Error",
"ConnectSoft": "Warning"
},
"ApplicationInsights": {
"LogLevel": {
"Default": "Warning"
}
}
},
"ConnectionStrings": {
"DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlConnectionString)",
"Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)",
"ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/ServiceBusConnectionString)",
"CosmosDb": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/CosmosConnectionString)"
},
"Audit": {
"EnableImmutability": true,
"EnableTamperEvidence": true,
"EnableHashChaining": true,
"RetentionDays": 2555, // 7 years
"WormStorage": true,
"SegmentSize": 100000,
"SealInterval": "PT15M",
"IntegrityVerification": {
"OnRead": true,
"Scheduled": true,
"ScheduleCron": "0 2 * * *", // Daily at 2 AM
"SampleRate": 0.1
}
},
"Compliance": {
"StrictInDevelopment": false,
"EnableLoggingRedaction": true,
"SimulateComplianceChecks": false,
"Profile": "production",
"EnforceGDPR": true,
"EnforceHIPAA": true,
"EnforceSOC2": true,
"AuditTrail": {
"Enabled": true,
"RetentionYears": 7,
"ImmutableStorage": true
}
},
"OpenTelemetry": {
"ServiceName": "atp-{service}-prod",
"ExporterEndpoint": "https://otel-collector-prod.connectsoft.local:4317",
"SamplingRatio": 0.1, // 10% sampling
"ExportIntervalSeconds": 60,
"AdaptiveSampling": {
"Enabled": true,
"MaxTelemetryItemsPerSecond": 10
}
},
"FeatureManagement": {
"TamperEvidenceV2": true, // Stable features only
"AdvancedQueryFilters": true,
"AIAssistedAnomalyDetection": {
"EnabledFor": [
{ "Name": "Percentage", "Parameters": { "Value": 10 } } // Conservative rollout
]
},
"ExperimentalFeatures": false // Never in Production
},
"RateLimiting": {
"Enabled": true,
"PermitLimit": 100, // Per client per minute
"Window": 60,
"QueueLimit": 50,
"ByTenant": true,
"ByIPAddress": true
},
"Security": {
"RequireHttps": true,
"HstsEnabled": true,
"HstsMaxAge": 31536000,
"HstsIncludeSubdomains": true,
"HstsPreload": true,
"ContentSecurityPolicy": "default-src 'self'; script-src 'self'; style-src 'self'; img-src 'self' data: https:; font-src 'self'; connect-src 'self'; frame-ancestors 'none'",
"XFrameOptions": "DENY",
"XContentTypeOptions": "nosniff",
"ReferrerPolicy": "strict-origin-when-cross-origin"
},
"HighAvailability": {
"MultiRegion": true,
"PrimaryRegion": "eastus",
"SecondaryRegion": "westeurope",
"TrafficDistribution": "80-20", // 80% East US, 20% West Europe
"FailoverMode": "Automatic",
"HealthCheckInterval": 30
}
}
Data Management:
- Data Source: Live tenant audit records with real PII (classified and protected).
- Volume: Millions of audit events per day across all tenants.
- Tenancy: Strict tenant isolation with compliance controls per tenant's residency profile.
- Retention: 7 years default (configurable per tenant; legal holds override).
- Immutability: Full WORM enforcement; hash-chained segments; tamper-evidence with HSM-signed anchors.
- Backup: Daily incremental + weekly full; geo-replicated; 7-year retention.
Access Control (Highly Restricted):
# Production Access Policy (Least Privilege)
roles:
productionOperators:
permissions:
- Deploy (via approved pipelines only)
- View logs (PII-redacted)
- Restart services (break-glass only)
restrictions:
- No data access
- No secret read
- No manual configuration changes
sreOnCall:
permissions:
- Read logs (PII-redacted)
- Execute runbooks
- Trigger failover (with approval)
justInTime: true
maxDuration: 4 hours
approvalRequired: true
platformSecurity:
permissions:
- Key Vault administration
- Security policy updates
- Compliance report generation
mfaRequired: true
auditLogging: comprehensive
developers:
permissions: [] # Zero access to Production
exceptions: None
accessReview:
cadence: Weekly
approver: Security Officer
autoRemove: Inactive for 30 days
Use Cases:
- Live Tenant Traffic: Serve production audit trail ingestion, queries, and exports.
- Compliance Evidence: Generate SOC 2, GDPR, HIPAA compliance artifacts.
- SLA Monitoring: Track and maintain 99.9% uptime commitment.
- Security Monitoring: Real-time threat detection and incident response.
- Performance Optimization: Continuous performance tuning based on real traffic patterns.
Deployment Pattern:
# Deployment: Strict approval + canary rollout
stages:
- stage: Deploy_Production
dependsOn: Deploy_Staging
condition: |
and(
succeeded(),
eq(variables['Build.Reason'], 'Manual'), // Only manual triggers
eq(variables['Build.SourceBranch'], 'refs/heads/main')
)
jobs:
- deployment: DeployToProduction
environment: ATP-Production # Requires 2 approvals + CAB
strategy:
canary:
increments: [5, 20, 50] // 5% → 20% → 50% → 100%
preDeploy:
steps:
- script: |
# Verify Staging stability (48 hours)
echo "Verifying Staging has been stable for 48 hours..."
STAGING_INCIDENTS=$(az monitor activity-log list \
--resource-group ATP-Staging-RG \
--offset 48h \
--query "[?contains(status.value, 'Failed')] | length(@)")
if [ "$STAGING_INCIDENTS" -gt "0" ]; then
echo "❌ Staging has active incidents; blocking Production deployment"
exit 1
fi
echo "✅ Staging stable; proceeding with canary deployment"
displayName: 'Pre-Deploy Safety Checks'
deploy:
steps:
- task: Kubernetes@1
displayName: 'Deploy Canary ($(strategy.increment)%)'
inputs:
connectionType: 'Azure Resource Manager'
azureSubscription: $(azureSubscription)
azureResourceGroup: 'ATP-Prod-RG'
kubernetesCluster: 'atp-aks-prod-eus'
command: 'apply'
arguments: '-f k8s/canary-$(strategy.increment).yaml'
routeTraffic:
steps:
- script: |
echo "Routing $(strategy.increment)% traffic to new version..."
kubectl apply -f k8s/istio-traffic-split-$(strategy.increment).yaml
displayName: 'Route Traffic to Canary'
postRouteTraffic:
steps:
- script: |
echo "Monitoring canary for 15 minutes..."
sleep 900 # 15 minutes soak
# Query Application Insights metrics
ERROR_RATE=$(az monitor app-insights metrics show \
--app atp-appinsights-prod-eus \
--metric "requests/failed" \
--aggregation avg \
--offset 15m \
--filter "cloud_RoleName eq 'atp-ingestion-canary'" \
--query "value.segments[0].segments[0]['requests/failed'].avg")
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "❌ Canary error rate exceeded 1%: $ERROR_RATE%"
exit 1 # Triggers automatic rollback
fi
LATENCY_P95=$(az monitor app-insights metrics show \
--app atp-appinsights-prod-eus \
--metric "requests/duration" \
--aggregation percentile95 \
--offset 15m \
--query "value.segments[0].segments[0]['requests/duration'].percentile95")
if (( $(echo "$LATENCY_P95 > 1000" | bc -l) )); then
echo "❌ Canary p95 latency exceeded 1000ms: ${LATENCY_P95}ms"
exit 1
fi
echo "✅ Canary metrics healthy; proceeding to next increment"
displayName: 'Validate Canary Metrics'
on:
failure:
steps:
- script: |
echo "🔴 Canary deployment failed; rolling back..."
# Revert traffic to stable version
kubectl apply -f k8s/istio-traffic-split-stable.yaml
# Notify on-call
curl -X POST https://hooks.slack.com/services/PROD_ONCALL \
-d '{"text":"Production canary rollback triggered for build $(Build.BuildNumber)"}'
# Create incident ticket
az boards work-item create \
--title "Production Canary Rollback - Build $(Build.BuildNumber)" \
--type "Incident" \
--assigned-to "SRE-Team"
displayName: 'Automatic Rollback'
Monitoring & Alerts (24/7 On-Call):
- Health Checks: Every 30 seconds; PagerDuty alert immediately on failure.
- Error Rate: Alert if >0.5% (strict SLO).
- Latency: Alert if p95 >500ms or p99 >1000ms.
- Security Events: Alert on failed auth >10/min, ABAC denials spike, WAF blocks.
- Compliance: Alert on immutability violations, retention policy failures.
- Cost: Alert on unexpected spending (>20% above forecast).
Hotfix Environment (Emergency Patches)¶
Purpose: Fast-track environment for critical production fixes that cannot wait for normal release cycles. Hotfix environment is a Production clone with expedited approval workflows.
Lifecycle:
- On-Demand: Created when P0/P1 incident requires immediate fix.
- Deployment Frequency: As needed (incident-driven).
- Uptime Target: 99.9% (same as Production).
- Lifespan: Active during incident; decommissioned after successful Production hotfix.
Infrastructure:
# Hotfix Environment (Production Clone)
# Infrastructure mirrors Production exactly
# Created from Production IaC with hotfix overlay
compute:
# Same as Production (AKS with same SKUs)
# Deployed in separate namespace: atp-hotfix
storage:
# Fresh databases (not cloned from Prod for safety)
# Seeded with anonymized Production data if needed
sql:
name: atp-sql-hotfix-eus
tier: Premium P6 (same as Prod)
# Data: Anonymized snapshot from Production
# Other storage: Same tiers as Production
networking:
vnet: Dedicated Hotfix VNet (10.3.0.0/16)
# Network topology mirrors Production
# Separate to prevent any Production impact
Deployment Pattern:
# Deployment: Expedited approval (2 approvers within 2 hours)
stages:
- stage: Deploy_Hotfix_Validation
jobs:
- deployment: DeployToHotfix
environment: ATP-Hotfix # Requires 2 approvals (expedited SLA: 2 hours)
strategy:
runOnce:
deploy:
steps:
- script: |
# Deploy hotfix to Hotfix environment first
kubectl apply -f k8s/hotfix-deployment.yaml
displayName: 'Deploy to Hotfix Environment'
- script: |
# Run targeted tests (hotfix validation only)
dotnet test tests/Hotfix.Validation.csproj \
--filter "Category=Hotfix" \
--environment Hotfix
displayName: 'Validate Hotfix'
- stage: Deploy_Production_Hotfix
dependsOn: Deploy_Hotfix_Validation
condition: succeeded()
jobs:
- deployment: HotfixProduction
environment: ATP-Production # Additional Production approval
strategy:
runOnce:
deploy:
steps:
- script: |
# Apply hotfix to Production with minimal scope
kubectl set image deployment/atp-ingestion-prod \
atp-ingestion=connectsoft.azurecr.io/atp/ingestion:hotfix-$(Build.BuildNumber)
# Monitor rollout
kubectl rollout status deployment/atp-ingestion-prod --timeout=10m
displayName: 'Apply Hotfix to Production'
postDeployment:
steps:
- script: |
# Immediate validation
curl -f https://atp-gateway-prod.connectsoft.com/health || exit 1
# Monitor for 30 minutes
python scripts/monitor-hotfix.py \
--duration 30 \
--error-threshold 0.01 \
--latency-threshold 1000
displayName: 'Post-Hotfix Monitoring'
Hotfix Workflow:
flowchart TD
A[P0/P1 Incident Detected] --> B[Create Hotfix Branch]
B --> C[Develop Fix]
C --> D[Deploy to Hotfix Environment]
D --> E{Validation Pass?}
E -->|No| C
E -->|Yes| F[Request Expedited Approvals]
F --> G[2 Approvers + Incident Commander]
G --> H[Deploy to Production]
H --> I[Monitor 30 Minutes]
I --> J{Metrics Healthy?}
J -->|No| K[Rollback + Escalate]
J -->|Yes| L[Merge to Main + Decommission Hotfix]
Approval Requirements:
- Hotfix Environment: 2 approvers (SRE Lead + Platform Architect) within 2 hours.
- Production Deployment: Same 2 approvers + Incident Commander confirmation.
- Post-Deployment: Mandatory 30-minute monitoring before incident closure.
Environment Comparison Matrix¶
| Characteristic | Preview | Dev | Test | Staging | Production | Hotfix |
|---|---|---|---|---|---|---|
| Compute SKU | ACI (1 vCPU) | B1 Basic | S1 Standard | P1v2 Premium | P3v3 Premium (AKS) | P6 (Prod clone) |
| Instances | 1 | 1 | 2 | 2-5 (autoscale) | 3-10 (autoscale) | 3 (fixed) |
| SQL Tier | Serverless | Basic (5 DTU) | Standard S1 | Premium P2 | Premium P6 | Premium P6 |
| Redis SKU | Basic C0 | Basic C0 | Standard C1 | Premium P1 | Premium P3 | Premium P3 |
| Zone Redundancy | No | No | No | No | Yes | Yes |
| Geo-Replication | No | No | No | Yes | Yes | Yes |
| Private Endpoints | No | No | No | Yes | Yes | Yes |
| VNet Isolation | Shared | Shared | Shared | Dedicated | Dedicated | Dedicated |
| Managed Identity | No | System | System | System + User | User (prod keys) | User |
| Log Retention | PR lifetime | 7 days | 14 days | 30 days | 90d + 7yr | 90d + 7yr |
| Trace Sampling | 100% | 100% | 50% | 25% | 10% | 10% |
| Deployment Approvals | 0 | 0 | 0 | 1 | 2 + CAB | 2 (expedited) |
| Deployment Frequency | Per PR commit | Multiple/day | 1-2/day | 1-2/week | 1-2/month | As needed |
| Cost/Month | $10/PR | $500 | $1,000 | $3,000 | $10,000 | $500 (short-lived) |
| Uptime SLA | Best-effort | 95% | 98% | 99.5% | 99.9% | 99.9% |
| Data Type | Synthetic | Synthetic | Stable fixtures | Prod-like synthetic | Live tenant data | Prod clone |
| Immutability | No | No | No | Yes | Yes (WORM) | Yes (WORM) |
| Security Level | Basic | Standard | Enhanced | Production-like | Maximum | Maximum |
Environment Selection Decision Tree¶
Use this flowchart to determine which environment to use for specific scenarios:
flowchart TD
START[Need to test/deploy?] --> Q1{What are you testing?}
Q1 -->|Feature in isolation| PR[Create Preview Environment]
Q1 -->|Integration changes| DEV[Deploy to Dev]
Q1 -->|Regression validation| TEST[Deploy to Test]
Q1 -->|Load/chaos testing| STAGE[Deploy to Staging]
Q1 -->|Production release| PROD_Q{Is it urgent?}
PROD_Q -->|P0/P1 Incident| HOTFIX[Use Hotfix Path]
PROD_Q -->|Normal release| PROD[Deploy to Production via CAB]
PR --> PR_VALID{Tests pass?}
PR_VALID -->|Yes| MERGE[Merge PR → Dev]
PR_VALID -->|No| FIX[Fix in branch]
DEV --> DEV_STABLE{Stable 24h?}
DEV_STABLE -->|Yes| TEST
DEV_STABLE -->|No| WAIT_DEV[Monitor Dev]
TEST --> TEST_PASS{All tests green?}
TEST_PASS -->|Yes| STAGE
TEST_PASS -->|No| FIX_TEST[Fix issues]
STAGE --> STAGE_APPROVE{1 Approval + Tests?}
STAGE_APPROVE -->|Yes| PROD
STAGE_APPROVE -->|No| WAIT_STAGE[Address feedback]
style PR fill:#87CEEB
style DEV fill:#90EE90
style TEST fill:#FFD700
style STAGE fill:#FFA500
style PROD fill:#FF6347
style HOTFIX fill:#FF69B4
Azure Topology & Resource Naming¶
ATP's Azure resource organization follows a hierarchical naming convention that enables clear resource identification, cost allocation, and automated management across all environments. This section defines the resource group structure, naming patterns, and Azure-specific topology considerations.
Resource Group Structure¶
Each environment deploys to a dedicated resource group containing all ATP services and their dependencies. The resource group acts as the lifecycle boundary — all resources are provisioned, updated, and decommissioned together.
Standard Resource Group Layout:
ConnectSoft-ATP-{Env}-{Region}-RG
├── atp-gateway-{env}-{region} # API Gateway (App Service or AKS pod)
├── atp-ingestion-{env}-{region} # Ingestion Service (App Service or AKS pod)
├── atp-query-{env}-{region} # Query Service (App Service or AKS pod)
├── atp-integrity-{env}-{region} # Integrity Verification (App Service or AKS pod)
├── atp-export-{env}-{region} # Export Service (App Service or AKS pod)
├── atp-policy-{env}-{region} # Policy Engine (App Service or AKS pod)
├── atp-search-{env}-{region} # Search Service (App Service or AKS pod)
├── atp-sql-{env}-{region} # Azure SQL Database (primary audit store)
├── atp-cosmos-{env}-{region} # Cosmos DB or PostgreSQL (metadata store)
├── atp-storage-{env}-{region} # Blob Storage (WORM in prod; immutable audit logs)
├── atp-servicebus-{env}-{region} # Service Bus namespace (async messaging)
├── atp-redis-{env}-{region} # Redis Cache (session state, distributed cache)
├── atp-keyvault-{env}-{region} # Key Vault (secrets, certificates, keys)
├── atp-appinsights-{env}-{region} # Application Insights (telemetry)
└── atp-loganalytics-{env}-{region} # Log Analytics workspace (centralized logs)
Multi-Region Resource Groups:
For Production and Staging (geo-replicated):
# Primary Region (East US)
ConnectSoft-ATP-Prod-EUS-RG
├── atp-gateway-prod-eus
├── atp-ingestion-prod-eus
├── ... (all services)
├── atp-sql-prod-eus (primary write)
├── atp-storage-prod-eus (GZRS replication to WEU)
└── atp-redis-prod-eus (geo-replicated to WEU)
# Secondary Region (West Europe)
ConnectSoft-ATP-Prod-WEU-RG
├── atp-gateway-prod-weu
├── atp-ingestion-prod-weu
├── ... (all services)
├── atp-sql-prod-weu (read replica)
├── atp-storage-prod-weu (geo-replicated secondary)
└── atp-redis-prod-weu (geo-replicated secondary)
Shared Infrastructure Resource Group:
Some resources are shared across environments for cost optimization:
ConnectSoft-ATP-Shared-EUS-RG
├── atp-acr-shared-eus # Azure Container Registry (shared Docker images)
├── atp-vnet-shared-eus # Shared VNet for Dev/Test (10.0.0.0/16)
├── atp-bastion-shared-eus # Azure Bastion (secure RDP/SSH access)
├── atp-devops-agents-eus # Self-hosted Azure DevOps agents
└── atp-monitoring-shared-eus # Shared monitoring infrastructure
Naming Conventions¶
Pattern: atp-{service}-{env}-{region}
Components:
atp: Project prefix (Audit Trail Platform).{service}: Service identifier (e.g.,gateway,ingestion,query).{env}: Environment abbreviation (lowercase).{region}: Azure region abbreviation (lowercase).
Environment Abbreviations:
| Environment | Abbreviation | Example |
|---|---|---|
| Preview (Ephemeral) | preview |
atp-gateway-preview-pr-1234-eus |
| Development | dev |
atp-ingestion-dev-eus |
| Test/QA | test |
atp-query-test-eus |
| Staging | staging |
atp-integrity-staging-eus |
| Production | prod |
atp-export-prod-eus |
| Hotfix | hotfix |
atp-policy-hotfix-eus |
Region Abbreviations:
| Azure Region | Abbreviation | Example |
|---|---|---|
| East US | eus |
atp-sql-prod-eus |
| West Europe | weu |
atp-sql-prod-weu |
| Southeast Asia | apse |
atp-cosmos-prod-apse |
| Central US | cus |
atp-storage-staging-cus |
| North Europe | neu |
atp-redis-dev-neu |
Service Abbreviations:
| Service | Abbreviation | Resource Type | Example |
|---|---|---|---|
| API Gateway | gateway |
App Service / AKS | atp-gateway-prod-eus |
| Ingestion Service | ingestion |
App Service / AKS | atp-ingestion-prod-eus |
| Query Service | query |
App Service / AKS | atp-query-prod-eus |
| Integrity Verification | integrity |
App Service / AKS | atp-integrity-prod-eus |
| Export Service | export |
App Service / AKS | atp-export-prod-eus |
| Policy Engine | policy |
App Service / AKS | atp-policy-prod-eus |
| Search Service | search |
App Service / AKS | atp-search-prod-eus |
| SQL Database | sql |
Azure SQL | atp-sql-prod-eus |
| Cosmos DB | cosmos |
Cosmos DB | atp-cosmos-prod-eus |
| Blob Storage | storage |
Storage Account | atpstorageprodeus (no hyphens)* |
| Service Bus | servicebus |
Service Bus | atp-servicebus-prod-eus |
| Redis Cache | redis |
Redis Cache | atp-redis-prod-eus |
| Key Vault | keyvault |
Key Vault | atp-keyvault-prod-eus |
| Application Insights | appinsights |
App Insights | atp-appinsights-prod-eus |
| Log Analytics | loganalytics |
Log Analytics | atp-loganalytics-prod-eus |
| Container Registry | acr |
ACR | atpacrsharedeus (no hyphens)* |
* Storage Accounts and Container Registries have stricter naming rules (no hyphens, lowercase alphanumeric only, 3-24 characters).
Complete Naming Examples¶
Dev Environment (East US):
Resource Group: ConnectSoft-ATP-Dev-EUS-RG
├── atp-gateway-dev-eus
├── atp-ingestion-dev-eus
├── atp-query-dev-eus
├── atp-integrity-dev-eus
├── atp-export-dev-eus
├── atp-policy-dev-eus
├── atp-search-dev-eus
├── atp-sql-dev-eus
├── atp-cosmos-dev-eus
├── atpstoragedeveus
├── atp-servicebus-dev-eus
├── atp-redis-dev-eus
├── atp-keyvault-dev-eus
├── atp-appinsights-dev-eus
└── atp-loganalytics-dev-eus
Production Environment (Multi-Region):
# Primary (East US)
Resource Group: ConnectSoft-ATP-Prod-EUS-RG
├── atp-aks-prod-eus # AKS cluster
├── atp-sql-prod-eus # SQL primary
├── atp-cosmos-prod-eus # Cosmos primary
├── atpstorageprodeus # GZRS storage
├── atp-servicebus-prod-eus # Service Bus primary
├── atp-redis-prod-eus # Redis primary
├── atp-keyvault-prod-eus # Key Vault (geo-backed up)
├── atp-appgw-prod-eus # Application Gateway
├── atp-fw-prod-eus # Azure Firewall
├── atp-appinsights-prod-eus
└── atp-loganalytics-prod-eus
# Secondary (West Europe)
Resource Group: ConnectSoft-ATP-Prod-WEU-RG
├── atp-aks-prod-weu # AKS cluster (standby)
├── atp-sql-prod-weu # SQL read replica
├── atp-cosmos-prod-weu # Cosmos read replica
├── atpstorageprodweu # GZRS storage secondary
├── atp-servicebus-prod-weu # Service Bus secondary (GDR)
├── atp-redis-prod-weu # Redis geo-replicated
├── atp-appgw-prod-weu # Application Gateway
└── atp-appinsights-prod-weu
Azure-Specific Resource Constraints¶
Storage Account Naming (Most Restrictive):
- Length: 3-24 characters
- Characters: Lowercase letters and numbers only (no hyphens, underscores, or uppercase)
- Uniqueness: Globally unique across all Azure
- Pattern:
atpstorage{env}{region}(e.g.,atpstorageprodeus)
Container Registry Naming:
- Length: 5-50 characters
- Characters: Alphanumeric only (no special characters)
- Uniqueness: Globally unique
- Pattern:
atpacr{env}{region}(e.g.,atpacrprodeus)
Key Vault Naming:
- Length: 3-24 characters
- Characters: Alphanumeric and hyphens (must start/end with alphanumeric)
- Uniqueness: Globally unique
- Pattern:
atp-keyvault-{env}-{region}oratp-kv-{env}-{region}(shortened)
App Service Naming:
- Length: 2-60 characters
- Characters: Alphanumeric and hyphens
- Uniqueness: Globally unique (*.azurewebsites.net subdomain)
- Pattern:
atp-{service}-{env}-{region}
DNS and Domain Strategy¶
App Service URLs (Default):
# Dev
https://atp-gateway-dev-eus.azurewebsites.net
https://atp-ingestion-dev-eus.azurewebsites.net
# Production
https://atp-gateway-prod-eus.azurewebsites.net (behind Application Gateway)
Custom Domains (Production):
# Public API (via Application Gateway)
https://api.atp.connectsoft.com → Application Gateway → AKS Ingress
# Services (internal, private endpoints)
https://ingestion.atp.internal → Private endpoint
https://query.atp.internal → Private endpoint
Private DNS Zones (Production):
privatelink.database.windows.net # Azure SQL private endpoints
privatelink.blob.core.windows.net # Storage private endpoints
privatelink.vaultcore.azure.net # Key Vault private endpoints
privatelink.servicebus.windows.net # Service Bus private endpoints
atp.internal # Custom private zone for ATP services
Infrastructure as Code (IaC) Naming in Pulumi¶
Pulumi Stack Naming:
// Pattern: {organization}/{project}/{environment}
var stackName = $"connectsoft/atp/{environment}";
// Examples:
// connectsoft/atp/dev
// connectsoft/atp/prod-eus
// connectsoft/atp/prod-weu
Pulumi Resource Naming (C# Example):
// Resource Group
var resourceGroup = new ResourceGroup($"atp-{environment}-{region}-rg", new ResourceGroupArgs
{
ResourceGroupName = $"ConnectSoft-ATP-{environment.ToUpper()}-{region.ToUpper()}-RG",
Location = region
});
// App Service
var appService = new WebApp($"atp-ingestion-{environment}-{region}", new WebAppArgs
{
Name = $"atp-ingestion-{environment}-{region}",
ResourceGroupName = resourceGroup.Name,
Location = region,
Tags = new InputMap<string>
{
["Environment"] = environment,
["Service"] = "ingestion",
["ManagedBy"] = "pulumi"
}
});
// Storage Account (handle naming constraints)
var storageAccount = new StorageAccount($"atp-storage-{environment}-{region}", new StorageAccountArgs
{
AccountName = $"atpstorage{environment}{region}".Replace("-", "").ToLower(), // Remove hyphens
ResourceGroupName = resourceGroup.Name,
Location = region,
Sku = new SkuArgs { Name = SkuName.Standard_GRS },
Tags = appService.Tags
});
Tagging Strategy¶
Required Tags (Enforced via Azure Policy):
{
"Environment": "dev | test | staging | prod | hotfix",
"Service": "gateway | ingestion | query | integrity | export | policy | search",
"Owner": "platform-team@connectsoft.example",
"CostCenter": "ATP-Platform | ATP-Services",
"Compliance": "gdpr | hipaa | soc2",
"DataClassification": "public | internal | restricted | secret",
"ManagedBy": "pulumi | bicep | terraform | manual",
"BackupRequired": "true | false",
"DR-Tier": "critical | important | standard",
"CreatedDate": "2025-10-30T00:00:00Z",
"ExpiryDate": "2026-10-30T00:00:00Z" // For Preview/Hotfix only
}
Tag Application Example (Pulumi):
var commonTags = new InputMap<string>
{
["Environment"] = config.Require("environment"),
["Owner"] = "platform-team@connectsoft.example",
["CostCenter"] = "ATP-Platform",
["Compliance"] = "gdpr,hipaa,soc2",
["DataClassification"] = config.Require("dataClassification"), // env-specific
["ManagedBy"] = "pulumi",
["BackupRequired"] = config.RequireBoolean("backupRequired").ToString(),
["DR-Tier"] = config.Require("drTier"),
["CreatedDate"] = DateTime.UtcNow.ToString("o"),
["Project"] = "ATP",
["Version"] = "1.0"
};
// Apply to all resources
var resourceGroup = new ResourceGroup("atp-rg", new ResourceGroupArgs
{
Tags = commonTags
});
Resource Naming Validation¶
Azure Policy Enforcement (Custom Policy):
{
"policyRule": {
"if": {
"allOf": [
{
"field": "type",
"in": [
"Microsoft.Web/sites",
"Microsoft.Sql/servers",
"Microsoft.Storage/storageAccounts"
]
},
{
"not": {
"field": "name",
"match": "atp-*-{dev|test|staging|prod|hotfix}-{eus|weu|apse}*"
}
}
]
},
"then": {
"effect": "deny",
"details": {
"message": "Resource name must follow ATP naming convention: atp-{service}-{env}-{region}"
}
}
}
}
Automated Validation Script (PowerShell):
# validate-naming.ps1
param(
[string]$ResourceGroupName
)
$resources = Get-AzResource -ResourceGroupName $ResourceGroupName
foreach ($resource in $resources) {
$name = $resource.Name
# Validate naming pattern (excluding storage accounts)
if ($resource.Type -notlike "*Storage*") {
if ($name -notmatch "^atp-[\w]+-(?:dev|test|staging|prod|hotfix)-(?:eus|weu|apse)$") {
Write-Warning "❌ Invalid name: $name (Type: $($resource.Type))"
} else {
Write-Host "✅ Valid name: $name"
}
}
# Validate required tags
$requiredTags = @("Environment", "Service", "Owner", "ManagedBy")
foreach ($tag in $requiredTags) {
if (-not $resource.Tags.ContainsKey($tag)) {
Write-Warning "❌ Missing tag '$tag' on resource: $name"
}
}
}
Cross-Environment Resource Dependencies¶
Shared Resources (Accessible by Multiple Environments):
# Azure Container Registry (shared across all environments)
atpacrsharedeus
# Shared VNet (Dev + Test only)
atp-vnet-shared-eus (10.0.0.0/16)
├── Dev Subnet: 10.0.1.0/24
└── Test Subnet: 10.0.2.0/24
# Dedicated VNets (Staging, Production)
atp-vnet-staging-eus (10.1.0.0/16)
atp-vnet-prod-eus (10.2.0.0/16)
atp-vnet-prod-weu (10.2.0.0/16) # Same CIDR (different regions)
Resource Group Locking:
# Production resource groups: ReadOnly lock (prevent accidental deletion)
az lock create --name prod-delete-lock \
--resource-group ConnectSoft-ATP-Prod-EUS-RG \
--lock-type ReadOnly \
--notes "Prevent accidental deletion of production resources"
# Staging: CanNotDelete lock (allow updates, prevent deletion)
az lock create --name staging-delete-lock \
--resource-group ConnectSoft-ATP-Staging-EUS-RG \
--lock-type CanNotDelete
Summary¶
- Naming Pattern:
atp-{service}-{env}-{region}ensures clarity and automation compatibility. - Resource Groups: One per environment/region combination; acts as lifecycle boundary.
- Tagging: Required tags enable cost allocation, compliance tracking, and automated management.
- Validation: Azure Policy and scripts enforce naming conventions and tag requirements.
- Multi-Region: Production uses dedicated resource groups per region with geo-replication.
- Shared Infrastructure: Cost optimization via shared ACR, VNets, and monitoring for lower environments.
Configuration Management & Hierarchy¶
ATP employs a layered configuration strategy that balances developer convenience (defaults, local development) with production security (secrets in Key Vault, dynamic feature flags). Configuration precedence follows the ASP.NET Core standard with ATP-specific extensions for Azure App Configuration and Key Vault integration.
This approach ensures configuration is immutable in code (no hardcoded secrets), environment-specific overrides are explicit, and production secrets are never stored in source control or deployment artifacts.
Configuration Hierarchy¶
ATP configurations are resolved in the following precedence order (later sources override earlier ones):
1. appsettings.json # Base defaults (checked into source control)
↓
2. appsettings.{Environment}.json # Environment-specific overrides (checked in)
↓
3. Azure App Configuration (optional) # Dynamic feature flags, A/B testing configs
↓
4. Environment Variables # Runtime overrides, container orchestration
↓
5. Key Vault References # Secrets, connection strings, certificates
↓
6. Command-Line Arguments # Override for debugging/testing
Configuration Resolution Example:
// Program.cs - Configuration loading order
public class Program
{
public static IHostBuilder CreateHostBuilder(string[] args) =>
Host.CreateDefaultsBuilder(args)
.ConfigureAppConfiguration((context, config) =>
{
var env = context.HostingEnvironment;
// 1. Base configuration
config.AddJsonFile("appsettings.json", optional: false, reloadOnChange: true);
// 2. Environment-specific overrides
config.AddJsonFile($"appsettings.{env.EnvironmentName}.json", optional: true, reloadOnChange: true);
// 3. Azure App Configuration (Production, Staging only)
if (env.IsProduction() || env.IsStaging())
{
var settings = config.Build();
var appConfigConnection = settings["AppConfig:ConnectionString"];
config.AddAzureAppConfiguration(options =>
{
options
.Connect(appConfigConnection)
.Select(KeyFilter.Any, LabelFilter.Null)
.Select(KeyFilter.Any, env.EnvironmentName)
.ConfigureRefresh(refresh =>
{
refresh.Register("Sentinel", refreshAll: true)
.SetCacheExpiration(TimeSpan.FromMinutes(5));
})
.UseFeatureFlags(featureFlagOptions =>
{
featureFlagOptions.CacheExpirationInterval = TimeSpan.FromMinutes(5);
});
});
}
// 4. Environment variables (override via ASPNETCORE_* or custom prefix)
config.AddEnvironmentVariables(prefix: "ATP_");
// 5. Key Vault (Production, Staging only)
if (env.IsProduction() || env.IsStaging())
{
var builtConfig = config.Build();
var keyVaultEndpoint = builtConfig["KeyVault:Endpoint"];
config.AddAzureKeyVault(
new Uri(keyVaultEndpoint),
new DefaultAzureCredential());
}
// 6. Command-line arguments (highest priority)
config.AddCommandLine(args);
})
.ConfigureWebHostDefaults(webBuilder =>
{
webBuilder.UseStartup<Startup>();
});
}
Base Configuration (appsettings.json)¶
The base configuration contains safe defaults suitable for local development and general application structure. No secrets or environment-specific values should be in this file.
{
"Logging": {
"LogLevel": {
"Default": "Information",
"Microsoft": "Warning",
"Microsoft.Hosting.Lifetime": "Information"
},
"Console": {
"IncludeScopes": false,
"TimestampFormat": "yyyy-MM-dd HH:mm:ss "
}
},
"AllowedHosts": "*",
"Audit": {
"ServiceName": "ATP",
"Version": "1.0.0",
"EnableImmutability": false,
"EnableTamperEvidence": false,
"EnableHashChaining": false,
"RetentionDays": 90,
"WormStorage": false,
"SegmentSize": 10000,
"SealInterval": "PT1H",
"MaxBatchSize": 1000,
"BatchTimeoutSeconds": 30
},
"Compliance": {
"StrictInDevelopment": false,
"EnableLoggingRedaction": false,
"SimulateComplianceChecks": false,
"Profile": "default",
"EnforceGDPR": false,
"EnforceHIPAA": false,
"EnforceSOC2": false
},
"OpenTelemetry": {
"ServiceName": "atp-service",
"ServiceVersion": "1.0.0",
"ExporterEndpoint": "http://localhost:4317",
"SamplingRatio": 1.0,
"ExportIntervalSeconds": 5,
"EnableConsoleExporter": false,
"EnableJaegerExporter": false
},
"RateLimiting": {
"Enabled": false,
"PermitLimit": 100,
"Window": 60,
"QueueLimit": 0
},
"Caching": {
"DefaultSlidingExpiration": "00:05:00",
"DefaultAbsoluteExpiration": "01:00:00"
},
"HealthChecks": {
"Enabled": true,
"PollingIntervalSeconds": 30
},
"KeyVault": {
"Endpoint": ""
},
"AppConfig": {
"ConnectionString": ""
}
}
Dev Environment (appsettings.Development.json)¶
Purpose: Local development and continuous integration with verbose logging, synthetic data, and disabled compliance controls for rapid iteration.
{
"Logging": {
"LogLevel": {
"Default": "Debug",
"Microsoft": "Information",
"Microsoft.EntityFrameworkCore": "Information",
"Microsoft.EntityFrameworkCore.Database.Command": "Information",
"System": "Information",
"ConnectSoft": "Debug"
},
"Console": {
"IncludeScopes": true,
"TimestampFormat": "HH:mm:ss.fff "
}
},
"ConnectionStrings": {
"DefaultConnection": "Server=atp-sql-dev-eus.database.windows.net;Database=ATP_Dev;Authentication=Active Directory Managed Identity;",
"Redis": "atp-redis-dev-eus.redis.cache.windows.net:6380,ssl=True,abortConnect=False",
"ServiceBus": "Endpoint=sb://atp-servicebus-dev-eus.servicebus.windows.net/;Authentication=Managed Identity",
"CosmosDb": "AccountEndpoint=https://atp-cosmos-dev-eus.documents.azure.com:443/;AuthKeyOrResourceToken=ManagedIdentity"
},
"Audit": {
"EnableImmutability": false,
"EnableTamperEvidence": false,
"EnableHashChaining": false,
"RetentionDays": 30,
"WormStorage": false,
"SegmentSize": 1000,
"SealInterval": "PT24H",
"MaxBatchSize": 100,
"IntegrityVerification": {
"OnRead": false,
"Scheduled": false
}
},
"Compliance": {
"StrictInDevelopment": true,
"EnableLoggingRedaction": true,
"SimulateComplianceChecks": true,
"Profile": "development",
"EnforceGDPR": false,
"EnforceHIPAA": false,
"EnforceSOC2": false,
"AllowTestData": true
},
"OpenTelemetry": {
"ServiceName": "atp-ingestion-dev",
"ServiceVersion": "1.0.0",
"ExporterEndpoint": "http://otel-collector-dev:4317",
"SamplingRatio": 1.0,
"ExportIntervalSeconds": 5,
"EnableConsoleExporter": true,
"EnableJaegerExporter": false,
"Attributes": {
"environment": "dev",
"region": "eus"
}
},
"FeatureManagement": {
"TamperEvidenceV2": true,
"AdvancedQueryFilters": true,
"AIAssistedAnomalyDetection": true,
"ExperimentalFeatures": true,
"PerformanceOptimizations": true
},
"RateLimiting": {
"Enabled": false,
"PermitLimit": 0,
"Window": 0
},
"Caching": {
"DefaultSlidingExpiration": "00:01:00",
"DefaultAbsoluteExpiration": "00:05:00",
"Enabled": true
},
"HealthChecks": {
"Enabled": true,
"PollingIntervalSeconds": 60,
"IncludeDependencies": true
},
"Cors": {
"AllowedOrigins": ["http://localhost:3000", "http://localhost:5173"],
"AllowCredentials": true
},
"Swagger": {
"Enabled": true,
"IncludeXmlComments": true
}
}
Test Environment (appsettings.Test.json)¶
Purpose: Automated testing and QA validation with stable datasets, moderate logging, and selective compliance enforcement for test validation.
{
"Logging": {
"LogLevel": {
"Default": "Information",
"Microsoft": "Warning",
"Microsoft.EntityFrameworkCore": "Information",
"Microsoft.EntityFrameworkCore.Database.Command": "Information",
"ConnectSoft": "Information"
}
},
"ConnectionStrings": {
"DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/SqlConnectionString)",
"Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/RedisConnectionString)",
"ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/ServiceBusConnectionString)",
"CosmosDb": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/CosmosConnectionString)"
},
"Audit": {
"EnableImmutability": false,
"EnableTamperEvidence": true,
"EnableHashChaining": true,
"RetentionDays": 90,
"WormStorage": false,
"SegmentSize": 10000,
"SealInterval": "PT1H",
"IntegrityVerification": {
"OnRead": true,
"Scheduled": true,
"ScheduleCron": "0 2 * * *",
"SampleRate": 0.5
}
},
"Compliance": {
"StrictInDevelopment": false,
"EnableLoggingRedaction": true,
"SimulateComplianceChecks": true,
"Profile": "test",
"EnforceGDPR": true,
"EnforceHIPAA": false,
"EnforceSOC2": false,
"AllowTestData": true
},
"OpenTelemetry": {
"ServiceName": "atp-ingestion-test",
"ExporterEndpoint": "http://otel-collector-test:4317",
"SamplingRatio": 0.5,
"ExportIntervalSeconds": 10,
"EnableConsoleExporter": false,
"Attributes": {
"environment": "test",
"region": "eus"
}
},
"FeatureManagement": {
"TamperEvidenceV2": true,
"AdvancedQueryFilters": true,
"AIAssistedAnomalyDetection": false,
"ExperimentalFeatures": false
},
"RateLimiting": {
"Enabled": true,
"PermitLimit": 1000,
"Window": 60,
"QueueLimit": 100
},
"KeyVault": {
"Endpoint": "https://atp-keyvault-test-eus.vault.azure.net/"
},
"Swagger": {
"Enabled": true
}
}
Staging Environment (appsettings.Staging.json)¶
Purpose: Pre-production validation with production-equivalent configuration, full compliance enforcement, and realistic load testing capabilities.
{
"Logging": {
"LogLevel": {
"Default": "Warning",
"Microsoft": "Error",
"Microsoft.EntityFrameworkCore": "Warning",
"ConnectSoft": "Warning"
}
},
"ConnectionStrings": {
"DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/SqlConnectionString)",
"Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/RedisConnectionString)",
"ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/ServiceBusConnectionString)",
"CosmosDb": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/CosmosConnectionString)"
},
"Audit": {
"EnableImmutability": true,
"EnableTamperEvidence": true,
"EnableHashChaining": true,
"RetentionDays": 2555,
"WormStorage": true,
"SegmentSize": 100000,
"SealInterval": "PT15M",
"IntegrityVerification": {
"OnRead": true,
"Scheduled": true,
"ScheduleCron": "0 */6 * * *",
"SampleRate": 0.2
}
},
"Compliance": {
"StrictInDevelopment": false,
"EnableLoggingRedaction": true,
"SimulateComplianceChecks": false,
"Profile": "staging",
"EnforceGDPR": true,
"EnforceHIPAA": true,
"EnforceSOC2": true,
"AllowTestData": false,
"AuditTrail": {
"Enabled": true,
"RetentionYears": 7,
"ImmutableStorage": true
}
},
"OpenTelemetry": {
"ServiceName": "atp-ingestion-staging",
"ExporterEndpoint": "https://otel-collector-staging.connectsoft.local:4317",
"SamplingRatio": 0.25,
"ExportIntervalSeconds": 30,
"EnableConsoleExporter": false,
"AdaptiveSampling": {
"Enabled": true,
"MaxTelemetryItemsPerSecond": 50
},
"Attributes": {
"environment": "staging",
"region": "eus"
}
},
"FeatureManagement": {
"TamperEvidenceV2": {
"EnabledFor": [
{
"Name": "Percentage",
"Parameters": {
"Value": 50
}
}
]
},
"AdvancedQueryFilters": true,
"AIAssistedAnomalyDetection": {
"EnabledFor": [
{
"Name": "TargetingFilter",
"Parameters": {
"Audience": {
"Users": ["staging-tenant-001"]
}
}
}
]
},
"ExperimentalFeatures": false
},
"RateLimiting": {
"Enabled": true,
"PermitLimit": 500,
"Window": 60,
"QueueLimit": 100,
"ByTenant": true,
"ByIPAddress": true
},
"Security": {
"RequireHttps": true,
"HstsEnabled": true,
"HstsMaxAge": 31536000,
"ContentSecurityPolicy": "default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'"
},
"KeyVault": {
"Endpoint": "https://atp-keyvault-staging-eus.vault.azure.net/"
},
"AppConfig": {
"ConnectionString": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/AppConfigConnectionString)",
"RefreshInterval": "00:05:00"
},
"Swagger": {
"Enabled": false
}
}
Production Environment (appsettings.Production.json)¶
Purpose: Live tenant traffic with maximum security, full compliance enforcement, optimized performance, and strict monitoring.
{
"Logging": {
"LogLevel": {
"Default": "Warning",
"Microsoft": "Error",
"Microsoft.EntityFrameworkCore": "Error",
"ConnectSoft": "Warning"
},
"ApplicationInsights": {
"LogLevel": {
"Default": "Warning",
"ConnectSoft": "Warning"
}
}
},
"ConnectionStrings": {
"DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlConnectionString)",
"Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)",
"ServiceBus": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/ServiceBusConnectionString)",
"CosmosDb": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/CosmosConnectionString)"
},
"Audit": {
"EnableImmutability": true,
"EnableTamperEvidence": true,
"EnableHashChaining": true,
"RetentionDays": 2555,
"WormStorage": true,
"SegmentSize": 100000,
"SealInterval": "PT15M",
"MaxBatchSize": 10000,
"BatchTimeoutSeconds": 60,
"IntegrityVerification": {
"OnRead": true,
"Scheduled": true,
"ScheduleCron": "0 2 * * *",
"SampleRate": 0.1,
"ParallelVerification": true,
"MaxDegreeOfParallelism": 4
}
},
"Compliance": {
"StrictInDevelopment": false,
"EnableLoggingRedaction": true,
"SimulateComplianceChecks": false,
"Profile": "production",
"EnforceGDPR": true,
"EnforceHIPAA": true,
"EnforceSOC2": true,
"AllowTestData": false,
"AuditTrail": {
"Enabled": true,
"RetentionYears": 7,
"ImmutableStorage": true,
"EncryptionAtRest": true,
"EncryptionInTransit": true
}
},
"OpenTelemetry": {
"ServiceName": "atp-ingestion-prod",
"ServiceVersion": "1.0.0",
"ExporterEndpoint": "https://otel-collector-prod.connectsoft.local:4317",
"SamplingRatio": 0.1,
"ExportIntervalSeconds": 60,
"EnableConsoleExporter": false,
"EnableJaegerExporter": false,
"AdaptiveSampling": {
"Enabled": true,
"MaxTelemetryItemsPerSecond": 10,
"SamplingPercentage": {
"Default": 10,
"OnError": 100,
"SlowRequests": 100
}
},
"Attributes": {
"environment": "prod",
"region": "eus",
"datacenter": "azure-eastus"
}
},
"FeatureManagement": {
"TamperEvidenceV2": true,
"AdvancedQueryFilters": true,
"AIAssistedAnomalyDetection": {
"EnabledFor": [
{
"Name": "Percentage",
"Parameters": {
"Value": 10
}
}
]
},
"ExperimentalFeatures": false,
"PerformanceOptimizations": true
},
"RateLimiting": {
"Enabled": true,
"PermitLimit": 100,
"Window": 60,
"QueueLimit": 50,
"ByTenant": true,
"ByIPAddress": true,
"ByUser": true,
"Strategy": "TokenBucket"
},
"Caching": {
"DefaultSlidingExpiration": "00:15:00",
"DefaultAbsoluteExpiration": "01:00:00",
"Enabled": true,
"DistributedCache": true,
"CompressionEnabled": true
},
"Security": {
"RequireHttps": true,
"HstsEnabled": true,
"HstsMaxAge": 31536000,
"HstsIncludeSubdomains": true,
"HstsPreload": true,
"ContentSecurityPolicy": "default-src 'self'; script-src 'self'; style-src 'self'; img-src 'self' data: https:; font-src 'self'; connect-src 'self'; frame-ancestors 'none'",
"XFrameOptions": "DENY",
"XContentTypeOptions": "nosniff",
"ReferrerPolicy": "strict-origin-when-cross-origin",
"PermissionsPolicy": "geolocation=(), microphone=(), camera=()"
},
"HighAvailability": {
"MultiRegion": true,
"PrimaryRegion": "eastus",
"SecondaryRegion": "westeurope",
"TrafficDistribution": "80-20",
"FailoverMode": "Automatic",
"HealthCheckInterval": 30,
"HealthCheckTimeout": 5,
"UnhealthyThreshold": 3
},
"HealthChecks": {
"Enabled": true,
"PollingIntervalSeconds": 30,
"TimeoutSeconds": 10,
"IncludeDependencies": true,
"PublishToApplicationInsights": true
},
"KeyVault": {
"Endpoint": "https://atp-keyvault-prod-eus.vault.azure.net/",
"CacheExpirationMinutes": 60,
"ReloadInterval": "00:15:00"
},
"AppConfig": {
"ConnectionString": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/AppConfigConnectionString)",
"RefreshInterval": "00:05:00",
"UseFeatureFlags": true,
"FeatureFlagRefreshInterval": "00:01:00"
},
"Swagger": {
"Enabled": false
},
"Cors": {
"AllowedOrigins": ["https://portal.connectsoft.com"],
"AllowCredentials": false
}
}
Azure App Configuration Integration¶
Purpose: Centralized dynamic configuration and feature flags that can be updated without redeployment. Used in Staging and Production only.
Configuration Structure (Azure App Configuration):
# Key-Value Pairs (Namespaced by environment)
ATP:Ingestion:MaxBatchSize = 10000 (label: prod)
ATP:Ingestion:MaxBatchSize = 1000 (label: staging)
ATP:RateLimiting:PermitLimit = 100 (label: prod)
ATP:RateLimiting:PermitLimit = 500 (label: staging)
# Feature Flags
TamperEvidenceV2 = true (label: prod)
AIAssistedAnomalyDetection = true (label: prod, percentage: 10%)
C# Integration Example:
// Startup.cs
public void ConfigureServices(IServiceCollection services)
{
// Add Azure App Configuration
services.AddAzureAppConfiguration();
// Add Feature Management
services.AddFeatureManagement()
.AddFeatureFilter<PercentageFilter>()
.AddFeatureFilter<TargetingFilter>()
.AddFeatureFilter<TimeWindowFilter>();
// Bind configuration sections
services.Configure<AuditOptions>(Configuration.GetSection("Audit"));
services.Configure<ComplianceOptions>(Configuration.GetSection("Compliance"));
services.Configure<OpenTelemetryOptions>(Configuration.GetSection("OpenTelemetry"));
}
// Middleware to refresh App Configuration
public void Configure(IApplicationBuilder app)
{
app.UseAzureAppConfiguration();
}
Feature Flag Usage:
// Service implementation with feature flag
public class AuditService : IAuditService
{
private readonly IFeatureManager _featureManager;
public AuditService(IFeatureManager featureManager)
{
_featureManager = featureManager;
}
public async Task<AuditResult> RecordEventAsync(AuditEvent auditEvent)
{
// Check if new tamper evidence is enabled
if (await _featureManager.IsEnabledAsync("TamperEvidenceV2"))
{
return await RecordWithTamperEvidenceV2Async(auditEvent);
}
else
{
return await RecordWithLegacyTamperEvidenceAsync(auditEvent);
}
}
}
Environment Variables & Container Overrides¶
Purpose: Override configuration at runtime without modifying appsettings.json. Useful for containerized deployments (AKS, Docker Compose) and CI/CD pipelines.
Kubernetes ConfigMap Example:
# atp-ingestion-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: atp-ingestion-config
namespace: atp-prod
data:
ASPNETCORE_ENVIRONMENT: "Production"
ATP_Audit__SegmentSize: "100000"
ATP_OpenTelemetry__SamplingRatio: "0.1"
ATP_RateLimiting__PermitLimit: "100"
Kubernetes Deployment with ConfigMap:
# atp-ingestion-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-prod
spec:
replicas: 3
template:
spec:
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:1.0.0
envFrom:
- configMapRef:
name: atp-ingestion-config
env:
- name: ConnectionStrings__DefaultConnection
valueFrom:
secretKeyRef:
name: atp-sql-secret
key: connectionString
Docker Compose Example (Local Development):
version: '3.8'
services:
atp-ingestion:
image: atp-ingestion:dev
environment:
- ASPNETCORE_ENVIRONMENT=Development
- ATP_Audit__EnableImmutability=false
- ATP_ConnectionStrings__Redis=redis:6379
- ATP_ConnectionStrings__ServiceBus=sb://localhost:5672
depends_on:
- redis
- rabbitmq
Key Vault Secret References¶
Purpose: Store sensitive configuration (connection strings, API keys, certificates) in Azure Key Vault with managed identity access.
Key Vault Secret Naming Convention:
# Pattern: {Service}-{Environment}-{SecretType}
SqlConnectionString
RedisConnectionString
ServiceBusConnectionString
CosmosConnectionString
AppConfigConnectionString
StorageAccountConnectionString
JwtSigningKey
EncryptionKey
CertificatePassword
ExternalApiKey
Reference in appsettings.json (Production):
{
"ConnectionStrings": {
"DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlConnectionString)",
"Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)"
}
}
Programmatic Key Vault Access:
// Program.cs - Manual Key Vault integration
public static IHostBuilder CreateHostBuilder(string[] args) =>
Host.CreateDefaultsBuilder(args)
.ConfigureAppConfiguration((context, config) =>
{
if (context.HostingEnvironment.IsProduction())
{
var builtConfig = config.Build();
var keyVaultEndpoint = builtConfig["KeyVault:Endpoint"];
// Use Managed Identity for authentication
var credential = new DefaultAzureCredential();
config.AddAzureKeyVault(
new Uri(keyVaultEndpoint),
credential,
new AzureKeyVaultConfigurationOptions
{
ReloadInterval = TimeSpan.FromMinutes(15)
});
}
});
Managed Identity Configuration (Azure App Service):
# Enable system-assigned managed identity
az webapp identity assign \
--name atp-ingestion-prod-eus \
--resource-group ConnectSoft-ATP-Prod-EUS-RG
# Grant Key Vault access to managed identity
az keyvault set-policy \
--name atp-keyvault-prod-eus \
--object-id <managed-identity-object-id> \
--secret-permissions get list
Configuration Validation¶
Purpose: Validate configuration at startup to prevent runtime errors from misconfigured settings.
Options Validation Example:
// AuditOptions.cs
public class AuditOptions : IValidatableObject
{
public bool EnableImmutability { get; set; }
public bool EnableTamperEvidence { get; set; }
public int RetentionDays { get; set; }
public bool WormStorage { get; set; }
public int SegmentSize { get; set; }
public IEnumerable<ValidationResult> Validate(ValidationContext validationContext)
{
if (RetentionDays < 1)
{
yield return new ValidationResult(
"RetentionDays must be at least 1 day",
new[] { nameof(RetentionDays) });
}
if (EnableImmutability && !WormStorage)
{
yield return new ValidationResult(
"WormStorage must be enabled when EnableImmutability is true",
new[] { nameof(WormStorage) });
}
if (SegmentSize < 100 || SegmentSize > 1000000)
{
yield return new ValidationResult(
"SegmentSize must be between 100 and 1,000,000",
new[] { nameof(SegmentSize) });
}
}
}
// Startup.cs
public void ConfigureServices(IServiceCollection services)
{
services.AddOptions<AuditOptions>()
.Bind(Configuration.GetSection("Audit"))
.ValidateDataAnnotations()
.ValidateOnStart(); // Fail fast on startup if invalid
}
Startup Configuration Validation:
// Program.cs - Validate critical configuration
public static void Main(string[] args)
{
var host = CreateHostBuilder(args).Build();
// Validate configuration before starting
var logger = host.Services.GetRequiredService<ILogger<Program>>();
try
{
var auditOptions = host.Services.GetRequiredService<IOptions<AuditOptions>>().Value;
var complianceOptions = host.Services.GetRequiredService<IOptions<ComplianceOptions>>().Value;
logger.LogInformation("✅ Configuration validated successfully");
logger.LogInformation("Audit - Immutability: {Immutability}, Retention: {Retention} days",
auditOptions.EnableImmutability, auditOptions.RetentionDays);
}
catch (OptionsValidationException ex)
{
logger.LogCritical(ex, "❌ Configuration validation failed");
throw;
}
host.Run();
}
Configuration Best Practices¶
Security:
- Never commit secrets: Use
.gitignoreto excludeappsettings.*.local.jsonfiles. - Use Managed Identity: Avoid connection strings with passwords; use Managed Identity authentication.
- Rotate secrets: Implement automated secret rotation (90-day cycle for Production).
- Encrypt sensitive data: Use Data Protection API for configuration encryption at rest.
Maintainability:
- Environment parity: Staging should mirror Production configuration (except scale/cost).
- Configuration as code: Store
appsettings.jsonfiles in source control; manage Key Vault secrets via IaC (Pulumi/Bicep). - Validation: Use Options pattern with validation to fail fast on misconfiguration.
- Documentation: Comment complex configuration sections; link to ADRs for architectural decisions.
Performance:
- Cache configuration: Reload Key Vault secrets every 15 minutes (not every request).
- Minimize App Configuration calls: Use 5-minute refresh interval; sentinel key pattern for bulk refresh.
- Local caching: Cache expensive computations (feature flag evaluations, connection string parsing).
Configuration Deployment Workflow¶
Pipeline Integration (Azure DevOps):
# azure-pipelines.yml - Deploy with environment-specific config
- task: AzureKeyVault@2
displayName: 'Fetch secrets from Key Vault'
inputs:
azureSubscription: $(azureSubscription)
KeyVaultName: 'atp-keyvault-$(environment)-eus'
SecretsFilter: '*'
RunAsPreJob: true
- task: FileTransform@1
displayName: 'Transform appsettings.json'
inputs:
folderPath: '$(System.DefaultWorkingDirectory)/publish'
fileType: 'json'
targetFiles: '**/appsettings.$(environment).json'
- task: AzureWebApp@1
displayName: 'Deploy to Azure App Service'
inputs:
azureSubscription: $(azureSubscription)
appName: 'atp-ingestion-$(environment)-eus'
package: '$(System.DefaultWorkingDirectory)/publish/*.zip'
appSettings: |
-ASPNETCORE_ENVIRONMENT "$(environment)"
-KeyVault__Endpoint "https://atp-keyvault-$(environment)-eus.vault.azure.net/"
Summary¶
- Configuration Hierarchy: Layered approach from base defaults → environment overrides → Azure App Configuration → environment variables → Key Vault.
- Environment-Specific: Each environment has tailored configuration balancing developer productivity (Dev) with production security (Prod).
- Secret Management: All sensitive data stored in Key Vault with Managed Identity access; no secrets in source control.
- Dynamic Configuration: Azure App Configuration enables feature flags and runtime config changes without redeployment.
- Validation: Options pattern with startup validation ensures misconfiguration is caught early.
- Best Practices: Security-first approach with encrypted secrets, managed identities, and automated rotation.
Secrets & Key Management¶
ATP implements a defense-in-depth approach to secrets management with Azure Key Vault as the centralized secret store, Managed Identities for authentication, and environment-specific access controls. This strategy ensures no secrets in source control, principle of least privilege, and automated rotation for production environments.
Secrets management follows the zero-trust security model: lower environments (Dev/Test) balance developer productivity with basic security, while higher environments (Staging/Production) enforce strict controls with zero human access to production secrets.
Key Vault Per Environment¶
Each environment has a dedicated Key Vault with appropriate access controls, audit logging, and compliance configurations tailored to its security requirements.
Dev Environment Key Vault¶
Name: atp-keyvault-dev-eus
Purpose: Developer-accessible secrets for local development and CI/CD testing with relaxed security for rapid iteration.
Configuration:
# Dev Key Vault (atp-keyvault-dev-eus)
properties:
sku: Standard
tenantId: <tenant-id>
accessPolicies:
- tenantId: <tenant-id>
objectId: <developers-aad-group-id>
permissions:
secrets: [get, list, set, delete] # Full access for developers
keys: [get, list]
certificates: [get, list]
- tenantId: <tenant-id>
objectId: <ci-cd-service-principal-id>
permissions:
secrets: [get, list] # Read-only for pipelines
enabledForDeployment: true
enabledForTemplateDeployment: true
enabledForDiskEncryption: false
enableSoftDelete: true
softDeleteRetentionInDays: 7 # Minimum retention
enablePurgeProtection: false # Allow purge for cleanup
networkAcls:
bypass: AzureServices
defaultAction: Allow # Public access for developer convenience
ipRules: []
virtualNetworkRules: []
publicNetworkAccess: Enabled
Access Control:
- Developers: Full access (get, list, set, delete secrets) for local debugging.
- CI/CD Pipelines: Read-only access (get, list secrets) for automated deployments.
- No MFA Required: Developer convenience prioritized over strict security.
Secret Characteristics:
- Rotation: Manual (on-demand when compromised).
- Audit Logging: 30-day retention in Log Analytics.
- Backup: Not required (ephemeral development secrets).
Example Secrets (Dev):
# Connection strings (non-production databases)
az keyvault secret set \
--vault-name atp-keyvault-dev-eus \
--name SqlConnectionString \
--value "Server=atp-sql-dev-eus.database.windows.net;Database=ATP_Dev;User Id=devuser;Password=DevP@ss123!"
# Shared API keys (development tier)
az keyvault secret set \
--vault-name atp-keyvault-dev-eus \
--name ExternalApiKey \
--value "dev-api-key-12345"
# JWT signing key (fixed for dev)
az keyvault secret set \
--vault-name atp-keyvault-dev-eus \
--name JwtSigningKey \
--value "dev-jwt-secret-key-do-not-use-in-prod"
Test Environment Key Vault¶
Name: atp-keyvault-test-eus
Purpose: Test automation with service principal access and moderate security controls for QA validation.
Configuration:
# Test Key Vault (atp-keyvault-test-eus)
properties:
sku: Standard
tenantId: <tenant-id>
accessPolicies:
- tenantId: <tenant-id>
objectId: <qa-team-aad-group-id>
permissions:
secrets: [get, list] # Read-only for QA
keys: [get, list]
certificates: [get, list]
- tenantId: <tenant-id>
objectId: <test-automation-service-principal-id>
permissions:
secrets: [get, list] # Read-only for test automation
- tenantId: <tenant-id>
objectId: <atp-test-managed-identity-id>
permissions:
secrets: [get, list] # Managed identity for Test services
enabledForDeployment: true
enabledForTemplateDeployment: true
enabledForDiskEncryption: false
enableSoftDelete: true
softDeleteRetentionInDays: 30
enablePurgeProtection: false
networkAcls:
bypass: AzureServices
defaultAction: Deny
ipRules:
- value: <ci-cd-agent-ip> # CI/CD agent pool
- value: <qa-team-office-ip> # QA team office
virtualNetworkRules:
- id: /subscriptions/<sub-id>/resourceGroups/ATP-Test-RG/providers/Microsoft.Network/virtualNetworks/atp-vnet-shared-eus/subnets/Test-Subnet
publicNetworkAccess: Enabled
Access Control:
- QA Team: Read-only access (view secrets for troubleshooting).
- Test Automation: Read-only service principal access.
- Managed Identity: Test environment services use managed identity (no keys in config).
Secret Characteristics:
- Rotation: Quarterly (90-day cycle).
- Audit Logging: 90-day retention in Log Analytics.
- Backup: Daily automated backups (30-day retention).
Example Secrets (Test):
# Connection strings (with Key Vault references in appsettings.Test.json)
az keyvault secret set \
--vault-name atp-keyvault-test-eus \
--name SqlConnectionString \
--value "Server=atp-sql-test-eus.database.windows.net;Database=ATP_Test;User Id=testuser;Password=$(Generate-SecurePassword)"
# Test-specific certificates
az keyvault certificate import \
--vault-name atp-keyvault-test-eus \
--name MtlsClientCertificate \
--file test-client-cert.pfx \
--password <pfx-password>
Staging Environment Key Vault¶
Name: atp-keyvault-staging-eus
Purpose: Production-like security with restricted access, private endpoints, and compliance controls for pre-production validation.
Configuration:
# Staging Key Vault (atp-keyvault-staging-eus)
properties:
sku: Premium # HSM-backed keys
tenantId: <tenant-id>
accessPolicies:
- tenantId: <tenant-id>
objectId: <platform-team-aad-group-id>
permissions:
secrets: [get, list] # Read-only for platform team
keys: [get, list]
certificates: [get, list]
- tenantId: <tenant-id>
objectId: <atp-staging-managed-identity-id>
permissions:
secrets: [get, list] # Managed identity only
keys: [get, unwrapKey, wrapKey] # For encryption operations
certificates: [get]
enabledForDeployment: false # Prevent VM deployments
enabledForTemplateDeployment: true # IaC deployments only
enabledForDiskEncryption: false
enableSoftDelete: true
softDeleteRetentionInDays: 90
enablePurgeProtection: true # Cannot purge deleted secrets
networkAcls:
bypass: AzureServices
defaultAction: Deny
ipRules: [] # No public IP access
virtualNetworkRules:
- id: /subscriptions/<sub-id>/resourceGroups/ATP-Staging-RG/providers/Microsoft.Network/virtualNetworks/atp-vnet-staging-eus/subnets/Services-Subnet
- id: /subscriptions/<sub-id>/resourceGroups/ATP-Staging-RG/providers/Microsoft.Network/virtualNetworks/atp-vnet-staging-eus/subnets/Data-Subnet
publicNetworkAccess: Disabled # Private endpoint only
privateEndpointConnections:
- privateLinkServiceConnectionState:
status: Approved
privateEndpoint:
id: /subscriptions/<sub-id>/resourceGroups/ATP-Staging-RG/providers/Microsoft.Network/privateEndpoints/atp-kv-staging-pe
Access Control:
- Platform Team: Read-only access (view secrets for troubleshooting) with MFA required.
- Managed Identity: Only authentication method for services (no service principals).
- No Developer Access: Prohibited.
Secret Characteristics:
- Rotation: Monthly (30-day cycle with automated rotation).
- Audit Logging: 90-day retention with Azure Sentinel integration.
- Backup: Daily automated backups with geo-redundancy (365-day retention).
- HSM-Backed Keys: Premium tier with hardware security module protection.
Example Secrets (Staging):
# Connection strings (production-equivalent)
az keyvault secret set \
--vault-name atp-keyvault-staging-eus \
--name SqlConnectionString \
--value "Server=atp-sql-staging-eus.database.windows.net;Database=ATP_Staging;Authentication=Active Directory Managed Identity;" \
--description "Staging SQL Connection (Managed Identity)" \
--tags Environment=Staging Compliance=GDPR,HIPAA,SOC2
# Per-tenant encryption keys
az keyvault key create \
--vault-name atp-keyvault-staging-eus \
--name TenantKEK-staging-tenant-001 \
--kty RSA-HSM \
--size 4096 \
--ops wrapKey unwrapKey \
--protection hsm
Production Environment Key Vault¶
Name: atp-keyvault-prod-eus
Purpose: Maximum security with zero human access, private endpoints only, HSM-backed keys, and full compliance enforcement.
Configuration:
# Production Key Vault (atp-keyvault-prod-eus)
properties:
sku: Premium # HSM-backed keys
tenantId: <tenant-id>
accessPolicies:
- tenantId: <tenant-id>
objectId: <atp-prod-managed-identity-id>
permissions:
secrets: [get, list] # Managed identity ONLY
keys: [get, unwrapKey, wrapKey, decrypt, encrypt]
certificates: [get]
- tenantId: <tenant-id>
objectId: <break-glass-emergency-group-id>
permissions:
secrets: [get] # Break-glass read-only (audited)
# Conditional Access Policy Required:
# - MFA enforced
# - Compliant device required
# - Trusted location (VPN) required
# - Just-in-Time access (max 4 hours)
enabledForDeployment: false
enabledForTemplateDeployment: false # Prevent any template deployments
enabledForDiskEncryption: false
enableSoftDelete: true
softDeleteRetentionInDays: 90
enablePurgeProtection: true
networkAcls:
bypass: None # Strict: no bypasses
defaultAction: Deny
ipRules: []
virtualNetworkRules:
- id: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.Network/virtualNetworks/atp-vnet-prod-eus/subnets/AKS-Nodes
- id: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.Network/virtualNetworks/atp-vnet-prod-eus/subnets/Private-Endpoints
publicNetworkAccess: Disabled
privateEndpointConnections:
- privateLinkServiceConnectionState:
status: Approved
privateEndpoint:
id: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.Network/privateEndpoints/atp-kv-prod-pe
Access Control:
- Managed Identity Only: No human access under normal operations.
- Break-Glass Access: Emergency access group with strict conditional access (MFA, compliant device, VPN, JIT approval, max 4-hour access).
- Zero Developer Access: Absolutely prohibited.
- Zero Service Principal Access: Only managed identities allowed.
Secret Characteristics:
- Rotation: Monthly automated rotation (30-day cycle with zero-downtime).
- Audit Logging: 365-day retention with Azure Sentinel + SIEM integration; real-time alerting on all access.
- Backup: Continuous backup with geo-redundancy (7-year retention for compliance).
- HSM-Backed Keys: All keys stored in FIPS 140-2 Level 3 HSM.
- Private Endpoint Only: No public internet access.
Example Secrets (Production):
# Connection strings (managed identity only)
az keyvault secret set \
--vault-name atp-keyvault-prod-eus \
--name SqlConnectionString \
--value "Server=atp-sql-prod-eus.database.windows.net;Database=ATP_Prod;Authentication=Active Directory Managed Identity;" \
--description "Production SQL Connection (Managed Identity Only)" \
--tags Environment=Production Compliance=GDPR,HIPAA,SOC2 Criticality=High AutoRotate=true
# Per-tenant encryption keys (HSM-backed)
az keyvault key create \
--vault-name atp-keyvault-prod-eus \
--name TenantKEK-tenant-12345 \
--kty RSA-HSM \
--size 4096 \
--ops wrapKey unwrapKey \
--protection hsm \
--tags TenantId=tenant-12345 KeyType=KEK AutoRotate=true RotationDays=90
# JWT signing keys (auto-rotated)
az keyvault key create \
--vault-name atp-keyvault-prod-eus \
--name JwtSigningKey \
--kty RSA-HSM \
--size 2048 \
--ops sign verify \
--protection hsm \
--tags KeyType=JwtSigning AutoRotate=true RotationDays=30
Secret Categories¶
ATP manages five primary secret categories, each with specific security controls, rotation policies, and access patterns.
| Secret Type | Example | Dev/Test | Staging/Prod | Rotation | Storage |
|---|---|---|---|---|---|
| Connection Strings | SQL, Redis, Service Bus, Cosmos DB | Plaintext in appsettings.json |
Key Vault reference | Quarterly (Test), Monthly (Prod) | Standard Key Vault Secret |
| API Keys | Third-party integrations, webhooks | Shared dev key | Unique per environment | On-demand (Dev), Monthly (Prod) | Standard Key Vault Secret |
| Certificates | mTLS, code signing, SSL/TLS | Self-signed, long-lived | Managed certificate from KV | Annual renewal | Key Vault Certificate |
| Encryption Keys | Per-tenant KEK, DEK | Single shared key | Per-tenant HSM-backed key | 90-day rotation | Key Vault Key (HSM) |
| JWT Signing Keys | Auth tokens, API tokens | Fixed dev key | Auto-rotated RSA key | Never (Dev), 30-day (Prod) | Key Vault Key (HSM) |
Connection Strings¶
Purpose: Database, cache, message queue, and storage connections.
Dev/Test Pattern:
// appsettings.Development.json (plaintext for convenience)
{
"ConnectionStrings": {
"DefaultConnection": "Server=atp-sql-dev-eus.database.windows.net;Database=ATP_Dev;User Id=devuser;Password=DevP@ss123!",
"Redis": "atp-redis-dev-eus.redis.cache.windows.net:6380,password=dev-redis-key,ssl=True"
}
}
Staging/Production Pattern:
// appsettings.Production.json (Key Vault references)
{
"ConnectionStrings": {
"DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlConnectionString)",
"Redis": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)"
}
}
Managed Identity Authentication (Preferred for Prod):
{
"ConnectionStrings": {
"DefaultConnection": "Server=atp-sql-prod-eus.database.windows.net;Database=ATP_Prod;Authentication=Active Directory Managed Identity;"
}
}
Rotation Strategy:
- Test: Quarterly (every 90 days).
- Staging/Production: Monthly (every 30 days) with automated rotation via Azure Functions.
API Keys¶
Purpose: Third-party API integrations, webhook endpoints, external service authentication.
Dev/Test Pattern:
# Shared development key (not secret)
az keyvault secret set \
--vault-name atp-keyvault-dev-eus \
--name ThirdPartyApiKey \
--value "dev-api-key-shared-12345" \
--tags Environment=Dev Shared=true
Production Pattern:
# Unique per-environment key with metadata
az keyvault secret set \
--vault-name atp-keyvault-prod-eus \
--name ThirdPartyApiKey-Prod \
--value $(Generate-SecureApiKey) \
--description "Production API key for ExternalServiceX" \
--tags Environment=Production AutoRotate=true RotationDays=30 Criticality=High \
--expires $(date -d "+90 days" +%Y-%m-%dT%H:%M:%SZ)
Rotation Strategy:
- Dev: On-demand (when compromised).
- Test: Quarterly.
- Production: Monthly automated rotation with overlap period (old key valid for 7 days during rotation).
Certificates¶
Purpose: mTLS client/server authentication, code signing, SSL/TLS certificates.
Dev/Test Pattern:
# Self-signed certificate (long-lived, 1 year)
az keyvault certificate create \
--vault-name atp-keyvault-dev-eus \
--name DevMtlsCertificate \
--policy @self-signed-policy.json
Production Pattern:
# Managed certificate with auto-renewal
az keyvault certificate create \
--vault-name atp-keyvault-prod-eus \
--name ProdMtlsCertificate \
--policy @cert-policy.json
# cert-policy.json
{
"issuerParameters": {
"name": "DigiCert", # Or internal CA
"certificateType": "OV-SSL"
},
"x509CertificateProperties": {
"subject": "CN=atp-ingestion-prod.connectsoft.com",
"subjectAlternativeNames": {
"dnsNames": [
"atp-ingestion-prod-eus.azurewebsites.net",
"atp-ingestion-prod.connectsoft.com"
]
},
"validityInMonths": 12
},
"lifetimeActions": [
{
"trigger": {
"daysBeforeExpiry": 30
},
"action": {
"actionType": "AutoRenew"
}
},
{
"trigger": {
"daysBeforeExpiry": 60
},
"action": {
"actionType": "EmailContacts"
}
}
]
}
Rotation Strategy:
- Dev/Test: Annual (365 days).
- Production: Annual with automated renewal 30 days before expiry.
Encryption Keys (KEK/DEK)¶
Purpose: Per-tenant encryption keys for data-at-rest encryption, envelope encryption pattern.
Dev/Test Pattern:
# Single shared key (software-protected)
az keyvault key create \
--vault-name atp-keyvault-dev-eus \
--name SharedEncryptionKey \
--kty RSA \
--size 2048 \
--ops encrypt decrypt wrapKey unwrapKey
Production Pattern:
# Per-tenant HSM-backed KEK
az keyvault key create \
--vault-name atp-keyvault-prod-eus \
--name TenantKEK-tenant-12345 \
--kty RSA-HSM \
--size 4096 \
--ops wrapKey unwrapKey \
--protection hsm \
--tags TenantId=tenant-12345 KeyType=KEK AutoRotate=true RotationDays=90
# Data Encryption Key (DEK) wrapped by KEK
# DEK is generated per audit segment and stored encrypted in database
Envelope Encryption Pattern:
// Encryption service using Key Vault KEK
public class TenantEncryptionService
{
private readonly KeyClient _keyClient;
public async Task<byte[]> EncryptAuditSegmentAsync(string tenantId, byte[] plaintext)
{
// 1. Generate ephemeral DEK (AES-256)
var dek = GenerateDataEncryptionKey();
// 2. Encrypt plaintext with DEK
var ciphertext = AesEncrypt(plaintext, dek);
// 3. Wrap DEK with tenant's KEK from Key Vault
var kekName = $"TenantKEK-{tenantId}";
var wrapResult = await _keyClient.WrapKeyAsync(kekName, KeyWrapAlgorithm.RsaOaep256, dek);
// 4. Store ciphertext + wrapped DEK together
return CombineCiphertextAndWrappedKey(ciphertext, wrapResult.EncryptedKey);
}
}
Rotation Strategy:
- Dev/Test: Never (static key for testing).
- Production: Quarterly (90 days) with zero-downtime rotation (new KEK version; old version remains valid for decryption).
JWT Signing Keys¶
Purpose: Cryptographic signing keys for JWT tokens, API authentication, service-to-service communication.
Dev/Test Pattern:
# Fixed symmetric key (HS256)
az keyvault secret set \
--vault-name atp-keyvault-dev-eus \
--name JwtSigningKey \
--value "dev-jwt-secret-key-do-not-use-in-production-256bit"
Production Pattern:
# Asymmetric RSA key pair (RS256, HSM-backed)
az keyvault key create \
--vault-name atp-keyvault-prod-eus \
--name JwtSigningKey \
--kty RSA-HSM \
--size 2048 \
--ops sign verify \
--protection hsm \
--tags KeyType=JwtSigning AutoRotate=true RotationDays=30
# Public key published to JWKS endpoint for verification
JWT Signing Implementation:
// JWT signing with Key Vault RSA key
public class JwtTokenService
{
private readonly CryptographyClient _cryptoClient;
public async Task<string> GenerateTokenAsync(ClaimsPrincipal user)
{
var header = new { alg = "RS256", typ = "JWT", kid = "JwtSigningKey" };
var payload = new { sub = user.Identity.Name, exp = DateTimeOffset.UtcNow.AddHours(1).ToUnixTimeSeconds() };
var headerEncoded = Base64UrlEncode(JsonSerializer.Serialize(header));
var payloadEncoded = Base64UrlEncode(JsonSerializer.Serialize(payload));
var message = $"{headerEncoded}.{payloadEncoded}";
// Sign with Key Vault RSA key
var signature = await _cryptoClient.SignDataAsync(SignatureAlgorithm.RS256, Encoding.UTF8.GetBytes(message));
var signatureEncoded = Base64UrlEncode(signature.Signature);
return $"{message}.{signatureEncoded}";
}
}
Rotation Strategy:
- Dev/Test: Never (fixed key for consistency).
- Production: Monthly (30 days) with overlap period (old key remains valid for verification for 7 days).
Secret Rotation Policies¶
Automated Rotation (Production):
// Azure Function for automated secret rotation
[FunctionName("RotateSecrets")]
public async Task RunAsync(
[TimerTrigger("0 0 2 1 * *")] TimerInfo timer, // Monthly on 1st at 2 AM
ILogger log)
{
var secretsToRotate = new[]
{
"SqlConnectionString",
"RedisConnectionString",
"ServiceBusConnectionString",
"JwtSigningKey"
};
foreach (var secretName in secretsToRotate)
{
log.LogInformation($"Rotating secret: {secretName}");
// 1. Generate new secret value
var newSecretValue = await GenerateNewSecretAsync(secretName);
// 2. Create new version in Key Vault
await _secretClient.SetSecretAsync(secretName, newSecretValue);
// 3. Wait for services to pick up new secret (15-minute cache expiration)
await Task.Delay(TimeSpan.FromMinutes(20));
// 4. Verify all services using new secret
var healthCheckPassed = await VerifyServicesHealthAsync();
if (!healthCheckPassed)
{
log.LogError($"Health check failed after rotating {secretName}. Rolling back...");
await RollbackSecretAsync(secretName);
throw new Exception($"Secret rotation failed for {secretName}");
}
log.LogInformation($"✅ Successfully rotated secret: {secretName}");
}
}
Rotation Schedule:
| Environment | Connection Strings | API Keys | Certificates | Encryption Keys | JWT Keys |
|---|---|---|---|---|---|
| Dev | On-demand | On-demand | Annual | Never | Never |
| Test | Quarterly | Quarterly | Annual | Never | Never |
| Staging | Monthly | Monthly | Annual | Quarterly | Monthly |
| Production | Monthly (automated) | Monthly (automated) | Annual (auto-renew) | Quarterly (automated) | Monthly (automated) |
Managed Identity Access Patterns¶
App Service Managed Identity (Staging/Production):
// Program.cs - Managed Identity for Key Vault access
public static IHostBuilder CreateHostBuilder(string[] args) =>
Host.CreateDefaultsBuilder(args)
.ConfigureAppConfiguration((context, config) =>
{
if (context.HostingEnvironment.IsProduction())
{
var builtConfig = config.Build();
var keyVaultEndpoint = builtConfig["KeyVault:Endpoint"];
// DefaultAzureCredential: tries Managed Identity first
var credential = new DefaultAzureCredential();
config.AddAzureKeyVault(
new Uri(keyVaultEndpoint),
credential,
new AzureKeyVaultConfigurationOptions
{
ReloadInterval = TimeSpan.FromMinutes(15)
});
}
});
AKS Pod Identity (Production):
# Azure AD Pod Identity (deprecated; use Workload Identity)
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentity
metadata:
name: atp-prod-identity
namespace: atp-prod
spec:
type: 0 # Managed Identity
resourceID: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.ManagedIdentity/userAssignedIdentities/atp-prod-mi
clientID: <managed-identity-client-id>
---
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentityBinding
metadata:
name: atp-prod-identity-binding
namespace: atp-prod
spec:
azureIdentity: atp-prod-identity
selector: atp-prod-pods # Pod label selector
AKS Workload Identity (Recommended for Production):
# ServiceAccount with Workload Identity
apiVersion: v1
kind: ServiceAccount
metadata:
name: atp-prod-sa
namespace: atp-prod
annotations:
azure.workload.identity/client-id: <managed-identity-client-id>
---
# Deployment using Workload Identity
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-prod
spec:
template:
metadata:
labels:
azure.workload.identity/use: "true"
spec:
serviceAccountName: atp-prod-sa
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:1.0.0
env:
- name: AZURE_CLIENT_ID
value: <managed-identity-client-id>
- name: KEY_VAULT_ENDPOINT
value: https://atp-keyvault-prod-eus.vault.azure.net/
Key Vault CSI Integration (AKS)¶
Purpose: Mount Key Vault secrets as volumes in Kubernetes pods for seamless secret injection without environment variables.
SecretProviderClass (Production):
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: atp-prod-secrets
namespace: atp-prod
spec:
provider: azure
parameters:
usePodIdentity: "false" # Use Workload Identity instead
useVMManagedIdentity: "false"
clientID: <managed-identity-client-id>
keyvaultName: atp-keyvault-prod-eus
tenantId: <tenant-id>
objects: |
array:
- objectName: SqlConnectionString
objectType: secret
objectAlias: sql-connection-string
- objectName: RedisConnectionString
objectType: secret
objectAlias: redis-connection-string
- objectName: ServiceBusConnectionString
objectType: secret
objectAlias: servicebus-connection-string
- objectName: TenantKEK-tenant-12345
objectType: key
objectAlias: tenant-kek-12345
- objectName: JwtSigningKey
objectType: key
objectAlias: jwt-signing-key
- objectName: MtlsClientCertificate
objectType: cert
objectAlias: mtls-client-cert
secretObjects:
- secretName: atp-sql-secret
type: Opaque
data:
- objectName: sql-connection-string
key: connectionString
- secretName: atp-redis-secret
type: Opaque
data:
- objectName: redis-connection-string
key: connectionString
Deployment with CSI Secrets (Production):
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-prod
spec:
replicas: 3
selector:
matchLabels:
app: atp-ingestion
template:
metadata:
labels:
app: atp-ingestion
azure.workload.identity/use: "true"
spec:
serviceAccountName: atp-prod-sa
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:1.0.0
volumeMounts:
- name: secrets-store
mountPath: "/mnt/secrets"
readOnly: true
env:
- name: ConnectionStrings__DefaultConnection
valueFrom:
secretKeyRef:
name: atp-sql-secret
key: connectionString
- name: ConnectionStrings__Redis
valueFrom:
secretKeyRef:
name: atp-redis-secret
key: connectionString
volumes:
- name: secrets-store
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: atp-prod-secrets
Benefits of CSI Integration:
- Automatic Secret Refresh: Secrets updated in Key Vault are automatically synced to pods (polling interval: 2 minutes).
- No Environment Variables: Secrets never exposed in environment variables (more secure).
- File-Based Access: Applications read secrets from mounted files (
/mnt/secrets/sql-connection-string). - Atomic Updates: Secret updates are atomic (no partial reads during rotation).
Secret Management Best Practices¶
Security:
- Never commit secrets: Enforce pre-commit hooks to detect secrets in source control.
- Managed Identity only (Prod): No service principals, passwords, or API keys for production services.
- Rotate regularly: Automated monthly rotation for production with overlap periods.
- Least privilege: Grant minimum required permissions (get, list only; never set, delete).
- Private endpoints: Production Key Vaults accessible only via private network.
Operational:
- Audit logging: Enable diagnostic logs for all Key Vault operations; integrate with Azure Sentinel.
- Soft delete + purge protection: Prevent accidental deletion; 90-day retention for recovery.
- Backup: Automated daily backups with geo-redundancy for Staging/Production.
- Health checks: Validate Key Vault connectivity and secret retrieval in application health checks.
- Break-glass procedures: Document emergency access procedures with strict approval workflows.
Compliance:
- Encryption at rest: All secrets encrypted with Microsoft-managed keys; HSM-backed for Prod.
- Encryption in transit: TLS 1.2+ for all Key Vault API calls.
- Compliance tags: Tag secrets with compliance scope (GDPR, HIPAA, SOC2).
- Audit retention: 7-year audit log retention for compliance evidence.
- Access reviews: Quarterly review of Key Vault access policies; remove stale permissions.
Emergency Access Procedures¶
Break-Glass Access (Production):
# Break-Glass Access Policy (Production Key Vault)
accessPolicies:
- tenantId: <tenant-id>
objectId: <break-glass-emergency-group-id>
permissions:
secrets: [get] # Read-only
# Conditional Access Requirements:
# - Multi-Factor Authentication: Required
# - Compliant Device: Required
# - Trusted Location: VPN only
# - Just-in-Time Access: PIM activation (4-hour max)
# - Approval: 2 approvers (Security Officer + Incident Commander)
# - Audit: Real-time alert to SIEM; Slack notification to #security-alerts
Emergency Access Workflow:
flowchart TD
A[P0 Incident Requires Secret Access] --> B[Request PIM Elevation]
B --> C[2 Approvers Review]
C --> D{Approved?}
D -->|No| E[Access Denied + Audit Log]
D -->|Yes| F[4-Hour JIT Access Granted]
F --> G[Access Key Vault via VPN]
G --> H[Retrieve Secret]
H --> I[Real-Time Alert to SIEM]
I --> J[Incident Resolution]
J --> K[PIM Access Expires]
K --> L[Post-Incident Review]
Break-Glass Secret Retrieval (Azure CLI):
# 1. Activate PIM role (requires approval)
az ad pim role-assignment request create \
--resource-id /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG \
--role-definition-id <key-vault-secrets-user-role-id> \
--principal-id <my-user-object-id> \
--duration PT4H \
--reason "P0 Incident #12345: Production SQL connection failure"
# 2. Wait for approval (~5-10 minutes)
# 3. Connect via VPN (required by Conditional Access)
# 4. Retrieve secret (audited)
az keyvault secret show \
--vault-name atp-keyvault-prod-eus \
--name SqlConnectionString \
--query value -o tsv
# 5. Use secret for incident resolution
# 6. PIM access expires after 4 hours
Summary¶
- Key Vault Per Environment: Dedicated Key Vault for each environment with graduated security controls (Dev: developer-accessible → Prod: zero human access).
- Secret Categories: Five primary categories (Connection Strings, API Keys, Certificates, Encryption Keys, JWT Keys) with environment-specific patterns.
- Managed Identity Only (Prod): No service principals or passwords; all production access via managed identities.
- Automated Rotation: Monthly automated rotation for production with zero-downtime and overlap periods.
- Key Vault CSI: AKS integration for seamless secret injection as mounted volumes with automatic refresh.
- Break-Glass Access: Emergency access procedures with strict conditional access, JIT approval, and comprehensive auditing.
Environment Promotion Workflow¶
ATP's environment promotion strategy implements a graduated deployment pipeline where code progresses through increasingly production-like environments with escalating quality gates and approval requirements. This approach ensures defects are caught early in lower environments while maintaining production stability through rigorous validation and controlled change management.
The promotion workflow balances deployment velocity (automated Dev/Test) with risk mitigation (manual approvals for Staging/Production) and provides fast rollback mechanisms at every tier to minimize incident impact.
Promotion Lanes¶
ATP uses a multi-lane promotion strategy supporting both regular feature releases and emergency hotfixes:
flowchart TD
subgraph Feature Development
A[Feature Branch] --> B[Open Pull Request]
B --> C[Preview Environment Created]
C --> D{PR Tests Pass?}
D -->|No| E[Fix in Branch]
E --> C
D -->|Yes| F[Merge to Main]
end
F --> G[CI Build Pipeline]
subgraph Continuous Integration
G --> H[Dev Environment]
H --> I{Smoke Tests Pass?}
I -->|No| J[Alert Dev Team]
I -->|Yes| K[24-Hour Soak]
end
subgraph Continuous Delivery
K --> L[Test Environment]
L --> M{Regression Tests Pass?}
M -->|No| N[Block Promotion]
M -->|Yes| O[Manual Approval Required]
O --> P[Staging Environment]
P --> Q{Load + Chaos Tests Pass?}
Q -->|No| R[Fix Issues]
Q -->|Yes| S[CAB Approval + 2 Approvers]
S --> T[Production Environment]
end
subgraph Hotfix Lane
U[P0/P1 Incident] --> V[Hotfix Branch]
V --> W[Hotfix Environment]
W --> X{Validation Pass?}
X -->|No| V
X -->|Yes| Y[Expedited Approvals]
Y --> T
end
T --> Z[Post-Deployment Monitoring]
Z --> AA{Metrics Healthy?}
AA -->|No| AB[Automated Rollback]
AA -->|Yes| AC[Deployment Complete]
style A fill:#87CEEB
style H fill:#90EE90
style L fill:#FFD700
style P fill:#FFA500
style T fill:#FF6347
style W fill:#FF69B4
Standard Promotion Lane (Feature Releases):
1. feature-branch → Pull Request → Preview Environment (ephemeral)
↓ (PR approved + tests pass)
2. Merge to main → CI Build
↓ (build + tests + security scans pass)
3. main → Dev Environment (auto-deploy)
↓ (smoke tests pass + 24-hour soak)
4. Dev → Test Environment (auto-deploy)
↓ (regression tests pass)
5. Test → Staging Environment (manual approval: 1 Lead Engineer)
↓ (load tests + chaos tests pass)
6. Staging → Production Environment (manual approval: 2 approvers + CAB)
↓ (canary deployment with metrics validation)
7. Production (stable) → Monitoring & Observability
Hotfix Lane (Emergency Fixes):
1. P0/P1 Incident → Hotfix Branch
↓ (fix developed)
2. Hotfix Branch → Hotfix Environment (on-demand provision)
↓ (targeted tests pass)
3. Hotfix Environment → Production (expedited approval: 2 approvers within 2 hours)
↓ (30-minute intensive monitoring)
4. Production (stable) → Merge to main + Decommission Hotfix Environment
Approval Gates¶
Each environment transition has specific approval requirements, validation criteria, and SLA commitments to ensure quality while maintaining deployment velocity.
Dev → Test (Automated)¶
Approvers: None (fully automated)
Requirements:
- Smoke Tests: All critical path smoke tests pass (health checks, basic API calls).
- Dev Soak Period: Deployed to Dev for minimum 24 hours without critical incidents.
- Build Artifacts: All artifacts published successfully (binaries, Docker images, SBOM).
- Security Scans: No critical/high vulnerabilities in SAST, dependency, or container scans.
Implementation (Azure Pipelines):
# Dev → Test promotion (automated)
- stage: Deploy_Test
displayName: 'Promote to Test Environment'
dependsOn: Deploy_Dev
condition: |
and(
succeeded(),
eq(variables['Dev.SoakHours'], '24'), # Dev deployed 24+ hours ago
eq(variables['Dev.CriticalIncidents'], '0'), # No critical incidents in Dev
eq(variables['Build.SourceBranch'], 'refs/heads/main') # Only main branch
)
jobs:
- deployment: PromoteToTest
environment: ATP-Test # No manual approval
strategy:
runOnce:
preDeploy:
steps:
- script: |
echo "Validating Dev environment stability..."
# Query Dev error rate (last 24 hours)
DEV_ERROR_RATE=$(az monitor app-insights metrics show \
--app atp-appinsights-dev-eus \
--metric "requests/failed" \
--aggregation avg \
--offset 24h \
--query "value.segments[0]['requests/failed'].avg")
if (( $(echo "$DEV_ERROR_RATE > 0.10" | bc -l) )); then
echo "❌ Dev error rate too high: $DEV_ERROR_RATE%"
exit 1
fi
# Query Dev incidents
DEV_INCIDENTS=$(az monitor activity-log list \
--resource-group ATP-Dev-RG \
--offset 24h \
--query "[?level=='Critical'] | length(@)")
if [ "$DEV_INCIDENTS" -gt "0" ]; then
echo "❌ Critical incidents detected in Dev"
exit 1
fi
echo "✅ Dev environment stable; promoting to Test"
displayName: 'Validate Dev Stability'
deploy:
steps:
- template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
parameters:
azureSubscription: $(azureSubscription)
appName: atp-ingestion-test-eus
package: $(Pipeline.Workspace)/drop/*.zip
postDeployment:
steps:
- script: |
# Run smoke tests
dotnet test tests/Smoke.Tests.csproj \
--environment Test \
--logger "trx;LogFileName=test-smoke-results.trx" \
--filter "Priority=1"
if [ $? -ne 0 ]; then
echo "❌ Smoke tests failed; rolling back Test deployment"
exit 1
fi
echo "✅ Test deployment successful"
displayName: 'Post-Deployment Smoke Tests'
SLA: Immediate (automated promotion within 5 minutes of Dev validation).
Rollback: Automatic if post-deployment smoke tests fail.
Test → Staging (Manual Approval)¶
Approvers: 1 Lead Engineer
Requirements:
- Regression Tests: Full regression test suite passes (100% pass rate required).
- No P1/P2 Bugs: Zero high-priority bugs in Test environment.
- Performance Benchmarks: Response time p95 within acceptable thresholds.
- Test Soak Period: Minimum 48 hours in Test without incidents.
- Database Migrations: All migrations tested successfully in Test.
Implementation (Azure DevOps Environment):
# Azure DevOps Environment Configuration
# Environment: ATP-Staging
approvals:
- type: manual
approvers:
- lead-engineer@connectsoft.example
minRequired: 1
timeoutInMinutes: 240 # 4-hour approval window
instructions: |
## Staging Promotion Checklist
Before approving, verify:
- [ ] All regression tests passed (100% green)
- [ ] No P1/P2 bugs in Test environment
- [ ] Performance benchmarks met (p95 < 500ms)
- [ ] Test deployed for 48+ hours without incidents
- [ ] Database migrations validated
- [ ] Security scans passed (no critical/high vulnerabilities)
- [ ] Change ticket created and approved
**Approval SLA**: 4 hours
Pipeline Stage (Test → Staging):
- stage: Deploy_Staging
displayName: 'Promote to Staging Environment'
dependsOn: Deploy_Test
condition: |
and(
succeeded(),
eq(variables['Test.SoakHours'], '48'), # Test deployed 48+ hours
eq(variables['Test.RegressionTestsPass'], 'true'),
eq(variables['Test.P1P2BugCount'], '0'),
eq(variables['Build.SourceBranch'], 'refs/heads/main')
)
jobs:
- deployment: PromoteToStaging
environment: ATP-Staging # Requires 1 manual approval
timeoutInMinutes: 300 # 5-hour timeout (includes approval wait)
strategy:
runOnce:
preDeploy:
steps:
- script: |
echo "Pre-Staging Validation Checks..."
# Verify Test regression tests
TEST_PASS_RATE=$(az devops test runs list \
--project ATP \
--query "[?state=='Completed' && startDate > '$(date -d '48 hours ago' --iso-8601)'].passRate" \
--output tsv | awk '{sum+=$1; count++} END {print sum/count}')
if (( $(echo "$TEST_PASS_RATE < 100" | bc -l) )); then
echo "❌ Test pass rate below 100%: $TEST_PASS_RATE%"
exit 1
fi
# Verify no high-priority bugs
P1_P2_BUGS=$(az boards query --wiql "SELECT [System.Id] FROM WorkItems WHERE [System.State] = 'Active' AND [Microsoft.VSTS.Common.Priority] <= 2 AND [System.Tags] CONTAINS 'ATP-Test'" --output tsv | wc -l)
if [ "$P1_P2_BUGS" -gt "0" ]; then
echo "❌ Found $P1_P2_BUGS P1/P2 bugs in Test"
exit 1
fi
echo "✅ Test environment ready for Staging promotion"
displayName: 'Pre-Staging Validation'
deploy:
steps:
- template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
parameters:
azureSubscription: $(azureSubscription)
appName: atp-ingestion-staging-eus
package: $(Pipeline.Workspace)/drop/*.zip
slotName: blue # Blue-green deployment
routeTraffic:
steps:
- task: AzureAppServiceManage@0
displayName: 'Swap Blue → Production Slot'
inputs:
azureSubscription: $(azureSubscription)
action: 'Swap Slots'
webAppName: atp-ingestion-staging-eus
sourceSlot: blue
targetSlot: production
postRouteTraffic:
steps:
- script: |
# Post-deployment validation
echo "Running Staging post-deployment tests..."
# Health checks
for i in {1..10}; do
curl -f https://atp-gateway-staging-eus.azurewebsites.net/health && break || sleep 30
done
# Smoke tests
dotnet test tests/Smoke.Tests.csproj --environment Staging
# Light load test
k6 run --vus 50 --duration 5m tests/load/staging-validation.js
echo "✅ Staging deployment validated"
displayName: 'Post-Staging Validation'
SLA: 4 hours (time from approval request to approver response).
Notifications:
# Azure DevOps Service Hook (Slack notification)
- trigger: Approval Pending
action: Send Slack message
target: #staging-approvals channel
message: |
🟡 Staging Approval Required
Build: $(Build.BuildNumber)
Branch: $(Build.SourceBranch)
Requester: $(Build.RequestedFor)
Approve: $(Environment.ApprovalUrl)
Checklist:
- Regression tests: $(Test.RegressionTestsPass)
- P1/P2 bugs: $(Test.P1P2BugCount)
- Soak period: $(Test.SoakHours) hours
Staging → Production (CAB + 2 Approvers)¶
Approvers: 2 required (Platform Architect + SRE Lead)
Requirements:
- CAB Approval: Change Advisory Board approval with documented change ticket.
- Staging Soak: Minimum 48 hours in Staging without incidents.
- Load Tests: Load tests pass at 80% expected peak production load.
- Chaos Tests: Chaos engineering tests pass (pod failures, network latency, database throttling).
- Security Review: Final security review completed; no pending vulnerabilities.
- Change Window: Deployment scheduled during approved maintenance window.
- Rollback Plan: Documented rollback procedure with RTO < 5 minutes.
Implementation (Azure DevOps Environment):
# Azure DevOps Environment Configuration
# Environment: ATP-Production
approvals:
- type: manual
approvers:
- platform-architect@connectsoft.example
- sre-lead@connectsoft.example
minRequired: 2 # Both approvers must approve
timeoutInMinutes: 1440 # 24-hour approval window
instructions: |
## Production Promotion Checklist (CAB Required)
**Prerequisites**:
- [ ] CAB approval obtained (change ticket: CHG-XXXXX)
- [ ] Staging deployed for 48+ hours (no incidents)
- [ ] Load tests passed (80% peak load)
- [ ] Chaos tests passed (pod failures, network latency)
- [ ] Security review completed (no critical/high vulnerabilities)
- [ ] Change window scheduled (approved maintenance slot)
- [ ] Rollback plan documented and reviewed
- [ ] On-call rotation confirmed (SRE coverage)
- [ ] Stakeholder notification sent
- [ ] Database backup verified (< 24 hours old)
**Deployment Details**:
- Build: $(Build.BuildNumber)
- Commit: $(Build.SourceVersion)
- Requester: $(Build.RequestedFor)
- Scheduled Window: [Specify date/time]
**Approval SLA**: 24 hours
**Rollback RTO**: < 5 minutes (canary rollback or slot swap)
checks:
- type: gate
displayName: 'Verify CAB Approval'
evaluationMode: Sequential
timeout: 1440
gates:
- task: InvokeRESTAPI@1
inputs:
serviceConnection: 'ChangeManagementAPI'
method: 'GET'
urlSuffix: '/api/change-tickets/$(ChangeTicketId)/status'
successCriteria: 'eq(root.status, "Approved")'
- type: gate
displayName: 'Verify No Active Incidents'
evaluationMode: Sequential
timeout: 1440
gates:
- task: AzureCLI@2
inputs:
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
INCIDENTS=$(az monitor activity-log list \
--resource-group ATP-Staging-RG \
--offset 48h \
--query "[?level=='Critical' || level=='Error'] | length(@)")
if [ "$INCIDENTS" -gt "0" ]; then
echo "Active incidents detected: $INCIDENTS"
exit 1
fi
Pipeline Stage (Staging → Production):
- stage: Deploy_Production
displayName: 'Promote to Production Environment'
dependsOn: Deploy_Staging
condition: |
and(
succeeded(),
eq(variables['Build.Reason'], 'Manual'), # Manual trigger only
eq(variables['Staging.SoakHours'], '48'),
eq(variables['CAB.Approved'], 'true'),
eq(variables['Build.SourceBranch'], 'refs/heads/main')
)
jobs:
- deployment: PromoteToProduction
environment: ATP-Production # Requires 2 approvals + CAB gate
timeoutInMinutes: 600 # 10-hour timeout (includes approval + canary deployment)
strategy:
canary:
increments: [5, 20, 50] # 5% → 20% → 50% → 100%
preDeploy:
steps:
- script: |
echo "Pre-Production Safety Checks..."
# Verify Staging stability (48 hours)
STAGING_ERROR_RATE=$(az monitor app-insights metrics show \
--app atp-appinsights-staging-eus \
--metric "requests/failed" \
--aggregation avg \
--offset 48h \
--query "value.segments[0]['requests/failed'].avg")
if (( $(echo "$STAGING_ERROR_RATE > 0.01" | bc -l) )); then
echo "❌ Staging error rate too high: $STAGING_ERROR_RATE%"
exit 1
fi
# Verify load tests passed
LOAD_TEST_STATUS=$(az load test show \
--name staging-load-test-latest \
--query "status")
if [ "$LOAD_TEST_STATUS" != "Passed" ]; then
echo "❌ Load tests did not pass"
exit 1
fi
# Verify CAB approval
CAB_STATUS=$(curl -s https://changemanagement.connectsoft.local/api/tickets/$(ChangeTicketId) | jq -r '.status')
if [ "$CAB_STATUS" != "Approved" ]; then
echo "❌ CAB approval not obtained"
exit 1
fi
echo "✅ All pre-production checks passed"
displayName: 'Pre-Production Validation'
- task: AzureCLI@2
displayName: 'Create Production Backup Snapshot'
inputs:
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
# Create database backup before deployment
az sql db copy \
--name ATP_Prod \
--resource-group ATP-Prod-RG \
--server atp-sql-prod-eus \
--dest-name ATP_Prod_Backup_$(Build.BuildId) \
--dest-resource-group ATP-Prod-Backups-RG \
--dest-server atp-sql-backup-eus
deploy:
steps:
- task: Kubernetes@1
displayName: 'Deploy Canary ($(strategy.increment)%)'
inputs:
connectionType: 'Azure Resource Manager'
azureSubscription: $(azureSubscription)
azureResourceGroup: 'ATP-Prod-RG'
kubernetesCluster: 'atp-aks-prod-eus'
command: 'apply'
arguments: '-f k8s/canary/atp-ingestion-canary-$(strategy.increment).yaml'
routeTraffic:
steps:
- script: |
echo "Routing $(strategy.increment)% traffic to canary..."
# Update Istio VirtualService for traffic splitting
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: atp-ingestion-traffic
namespace: atp-prod
spec:
hosts:
- atp-ingestion.atp-prod.svc.cluster.local
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: atp-ingestion-canary
subset: canary
weight: 100
- route:
- destination:
host: atp-ingestion
subset: stable
weight: $((100 - $(strategy.increment)))
- destination:
host: atp-ingestion-canary
subset: canary
weight: $(strategy.increment)
EOF
displayName: 'Configure Istio Traffic Split'
postRouteTraffic:
steps:
- script: |
echo "Monitoring canary deployment ($(strategy.increment)% traffic)..."
# Monitor for 15 minutes
sleep 900
# Query Application Insights for canary metrics
CANARY_ERROR_RATE=$(az monitor app-insights metrics show \
--app atp-appinsights-prod-eus \
--metric "requests/failed" \
--aggregation avg \
--offset 15m \
--filter "cloud_RoleName eq 'atp-ingestion-canary'" \
--query "value.segments[0]['requests/failed'].avg")
if (( $(echo "$CANARY_ERROR_RATE > 0.01" | bc -l) )); then
echo "❌ Canary error rate exceeded 1%: $CANARY_ERROR_RATE%"
exit 1
fi
CANARY_LATENCY_P95=$(az monitor app-insights metrics show \
--app atp-appinsights-prod-eus \
--metric "requests/duration" \
--aggregation percentile95 \
--offset 15m \
--filter "cloud_RoleName eq 'atp-ingestion-canary'" \
--query "value.segments[0]['requests/duration'].percentile95")
if (( $(echo "$CANARY_LATENCY_P95 > 1000" | bc -l) )); then
echo "❌ Canary p95 latency exceeded 1000ms: ${CANARY_LATENCY_P95}ms"
exit 1
fi
# Check for canary-specific errors in logs
CANARY_EXCEPTIONS=$(az monitor app-insights query \
--app atp-appinsights-prod-eus \
--analytics-query "exceptions | where cloud_RoleName == 'atp-ingestion-canary' and timestamp > ago(15m) | count" \
--query "tables[0].rows[0][0]")
if [ "$CANARY_EXCEPTIONS" -gt "10" ]; then
echo "❌ Canary has $CANARY_EXCEPTIONS exceptions"
exit 1
fi
echo "✅ Canary metrics healthy at $(strategy.increment)% traffic"
displayName: 'Validate Canary Metrics (15-min soak)'
on:
failure:
steps:
- script: |
echo "🔴 Canary deployment failed; initiating rollback..."
# Revert traffic to stable version (100% to stable)
kubectl apply -f k8s/canary/atp-ingestion-stable-only.yaml
# Notify on-call team
curl -X POST $(SlackWebhookUrl) \
-H 'Content-Type: application/json' \
-d '{
"text": "🚨 Production Canary Rollback",
"attachments": [{
"color": "danger",
"fields": [
{"title": "Build", "value": "$(Build.BuildNumber)", "short": true},
{"title": "Increment", "value": "$(strategy.increment)%", "short": true},
{"title": "Reason", "value": "Metrics threshold exceeded", "short": false}
]
}]
}'
# Create incident ticket
az boards work-item create \
--title "Production Canary Rollback - Build $(Build.BuildNumber)" \
--type "Incident" \
--description "Canary deployment failed at $(strategy.increment)% traffic. Metrics exceeded thresholds. Automatic rollback executed." \
--assigned-to "sre-team@connectsoft.example" \
--area "ATP/Production" \
--iteration "ATP/Current" \
--fields Priority=1 Severity="1 - Critical"
# Send PagerDuty alert
curl -X POST https://events.pagerduty.com/v2/enqueue \
-H 'Content-Type: application/json' \
-d '{
"routing_key": "$(PagerDutyRoutingKey)",
"event_action": "trigger",
"payload": {
"summary": "Production canary rollback for build $(Build.BuildNumber)",
"severity": "critical",
"source": "Azure DevOps"
}
}'
displayName: 'Rollback + Incident Response'
SLA: 24 hours (CAB approval + deployment scheduling).
Rollback: Automatic if canary metrics exceed thresholds; manual option available.
Hotfix → Production (Expedited)¶
Approvers: 2 (SRE Lead + Platform Architect)
Requirements:
- Incident Ticket: Linked P0/P1 incident ticket with root cause analysis.
- Hotfix Validation: Targeted tests pass in Hotfix environment.
- Expedited CAB: Emergency CAB approval (within 2 hours).
- Limited Scope: Changes limited to specific service/component; no breaking changes.
- Rollback Ready: Immediate rollback plan with < 2-minute RTO.
Implementation:
- stage: Deploy_Production_Hotfix
displayName: 'Emergency Hotfix to Production'
dependsOn: Deploy_Hotfix_Validation
condition: |
and(
succeeded(),
eq(variables['Hotfix.Validated'], 'true'),
ne(variables['IncidentTicketId'], '') # Incident ticket required
)
jobs:
- deployment: HotfixProduction
environment: ATP-Production # Requires 2 approvals (expedited)
timeoutInMinutes: 180 # 3-hour timeout (expedited approval SLA)
strategy:
runOnce:
preDeploy:
steps:
- script: |
echo "Hotfix Pre-Deployment Validation..."
# Verify incident ticket exists and is P0/P1
INCIDENT_PRIORITY=$(az boards work-item show \
--id $(IncidentTicketId) \
--query "fields['Microsoft.VSTS.Common.Priority']" \
--output tsv)
if [ "$INCIDENT_PRIORITY" -gt "1" ]; then
echo "❌ Hotfix only allowed for P0/P1 incidents (found P$INCIDENT_PRIORITY)"
exit 1
fi
# Verify Hotfix environment tests passed
HOTFIX_TESTS=$(az pipelines runs list \
--pipeline-id $(HotfixPipelineId) \
--query "[0].result" \
--output tsv)
if [ "$HOTFIX_TESTS" != "succeeded" ]; then
echo "❌ Hotfix validation tests did not pass"
exit 1
fi
echo "✅ Hotfix validated; proceeding to Production"
displayName: 'Verify Hotfix Prerequisites'
deploy:
steps:
- task: Kubernetes@1
displayName: 'Apply Hotfix to Production'
inputs:
connectionType: 'Azure Resource Manager'
azureSubscription: $(azureSubscription)
azureResourceGroup: 'ATP-Prod-RG'
kubernetesCluster: 'atp-aks-prod-eus'
command: 'set'
arguments: 'image deployment/atp-ingestion atp-ingestion=connectsoft.azurecr.io/atp/ingestion:hotfix-$(Build.BuildNumber)'
- script: |
# Monitor rollout (10-minute timeout)
kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=10m
if [ $? -ne 0 ]; then
echo "❌ Hotfix rollout failed"
kubectl rollout undo deployment/atp-ingestion -n atp-prod
exit 1
fi
displayName: 'Monitor Hotfix Rollout'
postDeployment:
steps:
- script: |
echo "Intensive post-hotfix monitoring (30 minutes)..."
# Monitor for 30 minutes with 1-minute intervals
for i in {1..30}; do
echo "Monitoring minute $i/30..."
# Health check
curl -f https://atp-gateway-prod.connectsoft.com/health || {
echo "❌ Health check failed at minute $i"
exit 1
}
# Error rate check
ERROR_RATE=$(az monitor app-insights metrics show \
--app atp-appinsights-prod-eus \
--metric "requests/failed" \
--aggregation avg \
--offset 1m \
--query "value.segments[0]['requests/failed'].avg")
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "❌ Error rate spike detected: $ERROR_RATE%"
exit 1
fi
sleep 60
done
echo "✅ Hotfix monitoring complete; deployment stable"
displayName: 'Post-Hotfix Monitoring (30 min)'
- task: AzureCLI@2
displayName: 'Update Incident Ticket'
inputs:
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
# Update incident with deployment details
az boards work-item update \
--id $(IncidentTicketId) \
--state "Resolved" \
--discussion "Hotfix deployed to Production: Build $(Build.BuildNumber). 30-minute monitoring complete. No issues detected."
SLA: 2 hours (expedited approval from request to deployment).
Rollback: Immediate (< 2 minutes) via kubectl rollout undo or traffic reversion.
Approval Gate Comparison¶
| Transition | Approvers | Requirements Summary | Approval SLA | Rollback RTO |
|---|---|---|---|---|
| Dev → Test | None (auto) | Smoke tests green, 24h soak | Immediate | 5 minutes |
| Test → Staging | 1 (Lead Engineer) | All tests green, no P1/P2 bugs, 48h soak | 4 hours | 2 minutes (slot swap) |
| Staging → Production | 2 (Architect + SRE) | CAB approval, load tests, chaos tests, change window | 24 hours | < 5 minutes (canary rollback) |
| Hotfix → Production | 2 (same) | Expedited CAB, incident ticket, limited scope | 2 hours | < 2 minutes (rollout undo) |
Rollback Triggers & Procedures¶
ATP implements multi-layered rollback strategies with both automated triggers (metric-based) and manual operator control to minimize incident impact and restore service quickly.
Automated Rollback Triggers¶
Metric-Based Triggers (Production Canary):
# Automated rollback conditions
rollbackTriggers:
errorRate:
threshold: 0.05 # 5% error rate
window: 5 minutes
action: Immediate rollback
latencyP95:
threshold: 1000 # 1000ms
baselineMultiplier: 2.0 # 2x baseline
window: 10 minutes
action: Immediate rollback
healthChecks:
consecutiveFailures: 3
interval: 30 seconds
action: Immediate rollback
exceptionRate:
threshold: 100 # exceptions per minute
window: 5 minutes
action: Immediate rollback
customMetrics:
- name: AuditIngestionFailureRate
threshold: 0.02 # 2% failure rate
window: 10 minutes
action: Immediate rollback
Implementation (Monitoring Script):
# monitor-canary.py
import time
import sys
from azure.monitor.query import MetricsQueryClient
from azure.identity import DefaultAzureCredential
def monitor_canary_deployment(duration_minutes, error_threshold, latency_threshold):
credential = DefaultAzureCredential()
client = MetricsQueryClient(credential)
start_time = time.time()
end_time = start_time + (duration_minutes * 60)
while time.time() < end_time:
elapsed = int((time.time() - start_time) / 60)
print(f"Monitoring canary: {elapsed}/{duration_minutes} minutes...")
# Query error rate
error_rate = query_error_rate(client, window_minutes=5)
if error_rate > error_threshold:
print(f"❌ ERROR: Canary error rate {error_rate}% exceeds threshold {error_threshold}%")
sys.exit(1)
# Query latency p95
latency_p95 = query_latency_p95(client, window_minutes=5)
if latency_p95 > latency_threshold:
print(f"❌ ERROR: Canary p95 latency {latency_p95}ms exceeds threshold {latency_threshold}ms")
sys.exit(1)
# Query health checks
health_status = query_health_checks(client)
if health_status != "Healthy":
print(f"❌ ERROR: Canary health check failed: {health_status}")
sys.exit(1)
print(f"✅ Canary healthy: Error={error_rate}%, P95={latency_p95}ms")
time.sleep(60) # Check every minute
print(f"✅ Canary monitoring complete: {duration_minutes} minutes passed")
sys.exit(0)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--duration", type=int, required=True)
parser.add_argument("--error-threshold", type=float, required=True)
parser.add_argument("--latency-threshold", type=int, required=True)
args = parser.parse_args()
monitor_canary_deployment(args.duration, args.error_threshold, args.latency_threshold)
Manual Rollback Triggers¶
Operator-Initiated Rollback:
# Manual rollback via Azure DevOps CLI
az pipelines run \
--name "ATP-Rollback-Pipeline" \
--parameters \
environment=production \
targetVersion=1.0.42 \
reason="Manual rollback due to [REASON]" \
initiatedBy=$(az account show --query user.name -o tsv)
Rollback Reasons (Documented):
- Functional Regression: Feature not working as expected; user-reported issues.
- Performance Degradation: Latency increase not caught by automated thresholds.
- Business Decision: Stakeholder request to revert feature.
- Security Vulnerability: Newly discovered vulnerability in deployed version.
- Data Corruption: Audit data integrity issues detected.
Rollback Procedures¶
Dev/Test Rollback (Redeploy Previous Version):
# rollback-dev.yaml
- stage: Rollback_Dev
jobs:
- deployment: RollbackToPreviousVersion
environment: ATP-Dev
strategy:
runOnce:
deploy:
steps:
- script: |
# Identify last-known-good build
LAST_GOOD_BUILD=$(az pipelines runs list \
--pipeline-id $(PipelineId) \
--status completed \
--result succeeded \
--top 2 \
--query "[1].id" \
--output tsv)
echo "Rolling back to build: $LAST_GOOD_BUILD"
# Download previous build artifacts
az pipelines runs artifact download \
--run-id $LAST_GOOD_BUILD \
--artifact-name drop \
--path $(Pipeline.Workspace)/rollback
displayName: 'Download Previous Build'
- template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
parameters:
azureSubscription: $(azureSubscription)
appName: atp-ingestion-dev-eus
package: $(Pipeline.Workspace)/rollback/*.zip
RTO: 5 minutes (download + redeploy).
Staging Rollback (Blue-Green Slot Swap):
# rollback-staging.yaml
- stage: Rollback_Staging
jobs:
- deployment: RollbackToStable
environment: ATP-Staging
strategy:
runOnce:
deploy:
steps:
- task: AzureAppServiceManage@0
displayName: 'Swap Back to Previous Slot'
inputs:
azureSubscription: $(azureSubscription)
action: 'Swap Slots'
webAppName: atp-ingestion-staging-eus
sourceSlot: production
targetSlot: blue # Swap back
- script: |
# Verify rollback successful
curl -f https://atp-gateway-staging-eus.azurewebsites.net/health
# Update deployment tracking
echo "Rollback completed at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
displayName: 'Verify Rollback'
RTO: 2 minutes (slot swap is nearly instantaneous).
Production Rollback (Canary Traffic Reversion):
# rollback-production.yaml
- stage: Rollback_Production
jobs:
- deployment: RollbackCanary
environment: ATP-Production # May require 1 approver depending on policy
strategy:
runOnce:
deploy:
steps:
- script: |
echo "🔴 Rolling back Production canary deployment..."
# Revert Istio traffic to 100% stable
kubectl apply -f k8s/canary/atp-ingestion-stable-only.yaml
# Verify traffic shift
kubectl get virtualservice atp-ingestion-traffic -n atp-prod -o yaml
# Wait 2 minutes for traffic to drain
sleep 120
# Delete canary deployment
kubectl delete deployment atp-ingestion-canary -n atp-prod
echo "✅ Rollback complete; 100% traffic on stable version"
displayName: 'Revert Traffic to Stable'
- script: |
# Verify stable version health
for i in {1..10}; do
HEALTH=$(curl -s https://atp-gateway-prod.connectsoft.com/health | jq -r '.status')
if [ "$HEALTH" == "Healthy" ]; then
echo "✅ Stable version healthy"
break
fi
sleep 30
done
# Verify error rate returned to normal
ERROR_RATE=$(az monitor app-insights metrics show \
--app atp-appinsights-prod-eus \
--metric "requests/failed" \
--aggregation avg \
--offset 5m \
--query "value.segments[0]['requests/failed'].avg")
echo "Post-rollback error rate: $ERROR_RATE%"
displayName: 'Post-Rollback Validation'
- task: AzureCLI@2
displayName: 'Create Post-Incident Review Task'
inputs:
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
# Create work item for RCA
az boards work-item create \
--title "Post-Incident Review: Production Rollback - Build $(Build.BuildNumber)" \
--type "Task" \
--description "Conduct RCA for production rollback. Analyze canary metrics, identify root cause, implement preventive measures." \
--assigned-to "platform-architect@connectsoft.example" \
--area "ATP/Production" \
--fields Priority=1 \
--discussion "Rollback executed at $(date -u +%Y-%m-%dT%H:%M:%SZ). Incident ticket: $(IncidentTicketId)"
RTO: < 5 minutes (traffic reversion is near-instantaneous; cleanup takes additional time).
Promotion Workflow Automation¶
Automated Promotion Script (Dev → Test):
#!/bin/bash
# promote-to-test.sh
set -e
echo "Automated Promotion: Dev → Test"
# 1. Verify Dev stability
echo "Checking Dev environment stability..."
DEV_ERROR_RATE=$(az monitor app-insights metrics show \
--app atp-appinsights-dev-eus \
--metric "requests/failed" \
--aggregation avg \
--offset 24h \
--query "value.segments[0]['requests/failed'].avg")
if (( $(echo "$DEV_ERROR_RATE > 0.10" | bc -l) )); then
echo "❌ Dev error rate too high: $DEV_ERROR_RATE%"
exit 1
fi
# 2. Verify smoke tests
echo "Running Dev smoke tests..."
dotnet test tests/Smoke.Tests.csproj --environment Dev --filter "Priority=1"
if [ $? -ne 0 ]; then
echo "❌ Dev smoke tests failed"
exit 1
fi
# 3. Trigger Test deployment pipeline
echo "Triggering Test deployment..."
az pipelines run \
--name "ATP-Ingestion-Pipeline" \
--branch main \
--variables \
targetEnvironment=test \
sourceEnvironment=dev \
buildId=$(az pipelines runs list --pipeline-id $(PipelineId) --top 1 --query "[0].id" -o tsv)
echo "✅ Promotion to Test initiated"
Promotion Tracking (Azure DevOps):
// Promotion tracking service
public class PromotionTracker
{
public async Task<PromotionResult> TrackPromotionAsync(string buildId, string sourceEnv, string targetEnv)
{
var promotion = new Promotion
{
BuildId = buildId,
SourceEnvironment = sourceEnv,
TargetEnvironment = targetEnv,
InitiatedAt = DateTime.UtcNow,
InitiatedBy = _context.User.Identity.Name,
Status = PromotionStatus.Pending
};
await _dbContext.Promotions.AddAsync(promotion);
await _dbContext.SaveChangesAsync();
// Emit telemetry event
_telemetry.TrackEvent("EnvironmentPromotion", new Dictionary<string, string>
{
["BuildId"] = buildId,
["SourceEnvironment"] = sourceEnv,
["TargetEnvironment"] = targetEnv,
["PromotionId"] = promotion.Id.ToString()
});
return new PromotionResult { PromotionId = promotion.Id, Status = "Initiated" };
}
}
Change Advisory Board (CAB) Process¶
Purpose: Provide governance oversight for Production deployments with cross-functional review of changes, risks, and rollback plans.
CAB Composition:
- Platform Architect (chair)
- SRE Lead
- Security Officer
- Product Owner
- Compliance Officer (for regulatory changes)
CAB Meeting Cadence:
- Regular CAB: Weekly (Wednesdays 2 PM); reviews all Staging → Production promotions.
- Emergency CAB: On-demand (within 2 hours); reviews P0/P1 hotfixes.
Change Ticket Template:
# Change Ticket: CHG-2025-1030-001
## Change Summary
Deploy ATP Ingestion Service v1.0.50 to Production
## Justification
- New feature: AI-assisted anomaly detection (10% canary rollout)
- Bug fix: Query performance regression (20% improvement expected)
- Security patch: Update dependency with CVE-2025-12345 fix
## Impact Assessment
- **Services Affected**: ATP Ingestion, ATP Query
- **Downtime Expected**: None (canary deployment)
- **User Impact**: Minimal (10% of tenants see new features)
- **Data Impact**: None (backward-compatible schema)
## Testing Evidence
- [x] All regression tests passed (Test environment)
- [x] Load tests passed at 80% peak load (Staging)
- [x] Chaos tests passed (pod failures, network latency)
- [x] Security scan: 0 critical/high vulnerabilities
- [x] Staging soak: 72 hours (no incidents)
## Deployment Plan
- **Date**: Friday, November 1, 2025
- **Time**: 10 PM - 12 AM EST (approved maintenance window)
- **Method**: Canary deployment (5% → 20% → 50% → 100%)
- **Duration**: 3 hours (including monitoring)
## Rollback Plan
- **Method**: Istio traffic reversion to stable version
- **RTO**: < 5 minutes
- **Trigger**: Error rate > 1% OR p95 latency > 1000ms OR manual abort
- **Communication**: PagerDuty alert to SRE on-call
## Approvals
- [x] Platform Architect: Approved
- [x] SRE Lead: Approved
- [x] Security Officer: Approved (no security concerns)
- [ ] CAB: **Pending Review**
## Communication Plan
- **Pre-Deployment**: Email to stakeholders (24 hours before)
- **During Deployment**: Slack updates in #production-deployments
- **Post-Deployment**: Status page update + email confirmation
CAB Approval Workflow:
flowchart LR
A[Change Ticket Created] --> B[CAB Review]
B --> C{Approved?}
C -->|No| D[Modify Change + Resubmit]
C -->|Yes| E[Schedule Deployment]
E --> F[Pre-Deployment Notification]
F --> G[Execute Deployment]
G --> H[Post-Deployment Review]
H --> I[Close Change Ticket]
D --> B
Deployment Scheduling & Change Windows¶
Production Change Windows (Approved Times):
| Day | Window | Type | Use Case |
|---|---|---|---|
| Tuesday | 10 PM - 12 AM EST | Standard | Minor updates, feature rollouts |
| Friday | 10 PM - 2 AM EST | Extended | Major releases, infrastructure changes |
| Saturday | 2 AM - 6 AM EST | Extended | Database migrations, breaking changes |
| Any Day | Emergency | Hotfix | P0/P1 incident resolution only |
Blackout Periods (No Production Deployments):
- End of Quarter: Last 3 days of Q1, Q2, Q3, Q4 (business-critical reporting).
- Major Holidays: December 24-26, December 31 - January 2.
- Tenant Peak Hours: Monday-Friday 8 AM - 6 PM EST.
Change Window Validation (Pipeline):
- stage: Validate_Change_Window
jobs:
- job: CheckChangeWindow
steps:
- script: |
CURRENT_DAY=$(date +%A)
CURRENT_HOUR=$(date +%H)
CURRENT_DATE=$(date +%Y-%m-%d)
# Check if current time is within approved window
if [ "$CURRENT_DAY" == "Tuesday" ] && [ "$CURRENT_HOUR" -ge 22 ] && [ "$CURRENT_HOUR" -lt 24 ]; then
echo "✅ Within approved change window (Tuesday 10 PM - 12 AM)"
elif [ "$CURRENT_DAY" == "Friday" ] && [ "$CURRENT_HOUR" -ge 22 ]; then
echo "✅ Within approved change window (Friday 10 PM - 2 AM)"
else
# Check if it's an emergency hotfix
if [ "$(IsHotfix)" == "true" ]; then
echo "⚠️ Emergency hotfix; bypassing change window"
else
echo "❌ Deployment outside approved change window"
echo "Current: $CURRENT_DAY $CURRENT_HOUR:00"
exit 1
fi
fi
# Check blackout periods
if [[ "$CURRENT_DATE" =~ 2025-(03|06|09|12)-(2[89]|30|31) ]]; then
echo "❌ Blackout period: End of quarter"
exit 1
fi
displayName: 'Validate Change Window'
Rollback Procedures by Environment¶
Dev Environment Rollback¶
Method: Redeploy previous build artifacts
RTO: 5 minutes
Procedure:
#!/bin/bash
# rollback-dev.sh
# 1. Identify last-known-good build
LAST_GOOD_BUILD=$(az pipelines runs list \
--pipeline-id $(PipelineId) \
--branch main \
--status completed \
--result succeeded \
--top 2 \
--query "[1].id" \
--output tsv)
echo "Rolling back Dev to build: $LAST_GOOD_BUILD"
# 2. Download artifacts
az pipelines runs artifact download \
--run-id $LAST_GOOD_BUILD \
--artifact-name drop \
--path ./rollback
# 3. Deploy to Dev
az webapp deployment source config-zip \
--resource-group ATP-Dev-RG \
--name atp-ingestion-dev-eus \
--src ./rollback/drop.zip
# 4. Verify rollback
sleep 30
curl -f https://atp-ingestion-dev-eus.azurewebsites.net/health
echo "✅ Dev rollback complete"
Test Environment Rollback¶
Method: Redeploy previous build + restore test data
RTO: 10 minutes
Procedure:
#!/bin/bash
# rollback-test.sh
# 1. Backup current test data (in case rollback fails)
az sql db copy \
--name ATP_Test \
--resource-group ATP-Test-RG \
--server atp-sql-test-eus \
--dest-name ATP_Test_Rollback_Backup_$(date +%Y%m%d%H%M%S) \
--dest-resource-group ATP-Test-Backups-RG
# 2. Identify last-known-good build
LAST_GOOD_BUILD=$(az pipelines runs list \
--pipeline-id $(PipelineId) \
--branch main \
--status completed \
--result succeeded \
--top 2 \
--query "[1].id" \
--output tsv)
# 3. Deploy previous version
az webapp deployment source config-zip \
--resource-group ATP-Test-RG \
--name atp-ingestion-test-eus \
--src ./rollback/drop.zip
# 4. Restore test data (if schema changed)
if [ "$(SchemaChanged)" == "true" ]; then
echo "Restoring test data from stable snapshot..."
az sql db restore \
--resource-group ATP-Test-RG \
--server atp-sql-test-eus \
--name ATP_Test \
--dest-name ATP_Test_Restored \
--time "48 hours ago"
fi
echo "✅ Test rollback complete"
Staging Environment Rollback¶
Method: Blue-green slot swap
RTO: 2 minutes
Procedure:
# Rollback via slot swap (instant)
- task: AzureAppServiceManage@0
displayName: 'Rollback Staging (Slot Swap)'
inputs:
azureSubscription: $(azureSubscription)
action: 'Swap Slots'
webAppName: atp-ingestion-staging-eus
sourceSlot: production # Current (bad) version
targetSlot: blue # Previous (good) version
Post-Rollback Validation:
# Verify rollback successful
curl -f https://atp-gateway-staging-eus.azurewebsites.net/health
# Run quick smoke tests
dotnet test tests/Smoke.Tests.csproj \
--environment Staging \
--filter "Category=Critical"
# Check error rate
ERROR_RATE=$(az monitor app-insights metrics show \
--app atp-appinsights-staging-eus \
--metric "requests/failed" \
--aggregation avg \
--offset 5m \
--query "value.segments[0]['requests/failed'].avg")
if (( $(echo "$ERROR_RATE > 0.02" | bc -l) )); then
echo "⚠️ Warning: Error rate still elevated: $ERROR_RATE%"
else
echo "✅ Staging rollback successful; error rate normal"
fi
Production Environment Rollback¶
Method: Canary traffic reversion (< 1 minute) or Kubernetes rollout undo (< 5 minutes)
RTO: < 5 minutes
Automated Rollback (Triggered by Metrics):
# Automatic rollback on metric threshold breach
on:
failure:
steps:
- script: |
echo "🔴 AUTOMATIC ROLLBACK TRIGGERED"
# Log rollback reason
ROLLBACK_REASON=$(cat <<EOF
{
"buildId": "$(Build.BuildId)",
"buildNumber": "$(Build.BuildNumber)",
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"trigger": "Automated",
"reason": "Canary metrics exceeded thresholds",
"canaryIncrement": "$(strategy.increment)%"
}
EOF
)
echo "$ROLLBACK_REASON" | tee rollback-log.json
# Revert Istio traffic to 100% stable
kubectl apply -f k8s/canary/atp-ingestion-stable-only.yaml
# Wait for traffic drain
sleep 60
# Delete canary deployment
kubectl delete deployment atp-ingestion-canary -n atp-prod --ignore-not-found=true
# Verify stable version healthy
kubectl rollout status deployment/atp-ingestion -n atp-prod
echo "✅ Automatic rollback complete"
displayName: 'Automatic Canary Rollback'
- task: PublishBuildArtifacts@1
displayName: 'Publish Rollback Log'
inputs:
PathtoPublish: 'rollback-log.json'
ArtifactName: 'rollback-evidence'
- script: |
# Notify on-call team
curl -X POST https://events.pagerduty.com/v2/enqueue \
-H 'Content-Type: application/json' \
-d '{
"routing_key": "$(PagerDutyRoutingKey)",
"event_action": "trigger",
"payload": {
"summary": "Production canary automatic rollback - Build $(Build.BuildNumber)",
"severity": "critical",
"source": "Azure Pipelines",
"custom_details": {
"buildId": "$(Build.BuildId)",
"canaryIncrement": "$(strategy.increment)%",
"reason": "Metrics threshold exceeded"
}
}
}'
# Slack notification
curl -X POST $(SlackWebhookUrl) \
-H 'Content-Type: application/json' \
-d '{
"text": "🚨 Production Automatic Rollback",
"attachments": [{
"color": "danger",
"title": "Build $(Build.BuildNumber) - Canary Rollback",
"fields": [
{"title": "Increment", "value": "$(strategy.increment)%", "short": true},
{"title": "Trigger", "value": "Automated (metrics)", "short": true},
{"title": "RTO", "value": "< 5 minutes", "short": true},
{"title": "Incident", "value": "Created: INC-AUTO-$(Build.BuildId)", "short": true}
]
}]
}'
displayName: 'Notify Stakeholders'
Manual Rollback (Operator-Initiated):
#!/bin/bash
# manual-rollback-production.sh
read -p "Confirm Production rollback (yes/no): " CONFIRM
if [ "$CONFIRM" != "yes" ]; then
echo "Rollback cancelled"
exit 0
fi
read -p "Enter rollback reason: " REASON
read -p "Enter target build ID (last-known-good): " TARGET_BUILD_ID
echo "Initiating Production rollback..."
echo "Reason: $REASON"
echo "Target Build: $TARGET_BUILD_ID"
# 1. Revert traffic to stable version
kubectl apply -f k8s/canary/atp-ingestion-stable-only.yaml
# 2. Wait for traffic to drain from canary
sleep 120
# 3. Delete canary deployment
kubectl delete deployment atp-ingestion-canary -n atp-prod
# 4. If needed, rollback stable deployment to previous version
if [ -n "$TARGET_BUILD_ID" ]; then
echo "Rolling back stable deployment to build $TARGET_BUILD_ID..."
kubectl set image deployment/atp-ingestion -n atp-prod \
atp-ingestion=connectsoft.azurecr.io/atp/ingestion:$TARGET_BUILD_ID
kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=5m
fi
# 5. Verify rollback
curl -f https://atp-gateway-prod.connectsoft.com/health
# 6. Create incident ticket
az boards work-item create \
--title "Manual Production Rollback - Build $(Build.BuildNumber)" \
--type "Incident" \
--description "Reason: $REASON. Rolled back to: $TARGET_BUILD_ID" \
--assigned-to "sre-team@connectsoft.example" \
--fields Priority=1
echo "✅ Production rollback complete"
Post-Deployment Verification¶
Verification Checklist (All Environments):
# Post-deployment verification template
- stage: Post_Deployment_Verification
jobs:
- job: VerifyDeployment
steps:
# 1. Health checks
- script: |
for service in gateway ingestion query integrity export policy search; do
echo "Health check: atp-$service-$(environment)-eus"
curl -f https://atp-$service-$(environment)-eus.azurewebsites.net/health || exit 1
done
displayName: 'Verify Service Health'
# 2. Smoke tests
- task: DotNetCoreCLI@2
displayName: 'Run Smoke Tests'
inputs:
command: 'test'
projects: 'tests/Smoke.Tests.csproj'
arguments: '--environment $(environment) --filter "Priority=1"'
# 3. Metrics validation
- script: |
# Wait for metrics to stabilize
sleep 300 # 5 minutes
ERROR_RATE=$(az monitor app-insights metrics show \
--app atp-appinsights-$(environment)-eus \
--metric "requests/failed" \
--aggregation avg \
--offset 5m \
--query "value.segments[0]['requests/failed'].avg")
echo "Post-deployment error rate: $ERROR_RATE%"
THRESHOLD=$([ "$(environment)" == "prod" ] && echo "0.01" || echo "0.05")
if (( $(echo "$ERROR_RATE > $THRESHOLD" | bc -l) )); then
echo "❌ Error rate exceeds threshold for $(environment)"
exit 1
fi
displayName: 'Validate Metrics'
# 4. Database migration verification
- script: |
# Verify database schema version matches deployment
DB_VERSION=$(sqlcmd -S atp-sql-$(environment)-eus.database.windows.net \
-d ATP_$(environment) \
-Q "SELECT TOP 1 Version FROM __EFMigrationsHistory ORDER BY MigrationId DESC" \
-h -1)
EXPECTED_VERSION=$(grep 'MigrationVersion' version.txt | cut -d'=' -f2)
if [ "$DB_VERSION" != "$EXPECTED_VERSION" ]; then
echo "❌ Database version mismatch. Expected: $EXPECTED_VERSION, Actual: $DB_VERSION"
exit 1
fi
echo "✅ Database schema version verified"
displayName: 'Verify Database Migrations'
# 5. Configuration validation
- script: |
# Verify environment configuration loaded correctly
CONFIG_ENV=$(curl -s https://atp-gateway-$(environment)-eus.azurewebsites.net/api/diagnostics/config | jq -r '.environment')
if [ "$CONFIG_ENV" != "$(environment)" ]; then
echo "❌ Configuration environment mismatch"
exit 1
fi
echo "✅ Configuration validated"
displayName: 'Verify Configuration'
Promotion Metrics & Reporting¶
Metrics Tracked (Per Promotion):
promotionMetrics:
deployment:
initiatedAt: 2025-10-30T22:00:00Z
completedAt: 2025-10-30T23:45:00Z
duration: 105 minutes
approvals:
requestedAt: 2025-10-30T14:00:00Z
approvedAt: 2025-10-30T15:30:00Z
approvalDuration: 90 minutes
approvers: [architect@example.com, sre@example.com]
validation:
smokeTests: passed
regressionTests: passed (100%)
loadTests: passed (p95: 450ms)
chaosTests: passed
securityScans: passed (0 critical)
rollback:
triggered: false
reason: null
duration: null
outcome: success
Promotion Dashboard (Power BI / Azure Monitor Workbook):
- Promotion Frequency: Deployments per environment per week.
- Approval Duration: Time from approval request to approval granted.
- Success Rate: Percentage of successful promotions (no rollback).
- DORA Metrics: Deployment frequency, lead time, MTTR, change failure rate.
Summary¶
- Promotion Lanes: Standard lane (feature releases) and hotfix lane (emergency fixes) with clearly defined progression.
- Approval Gates: Graduated approvals from zero (Dev → Test) to 2 + CAB (Staging → Production) with specific validation requirements.
- Automated Promotion: Dev → Test fully automated with 24-hour soak and metric validation.
- Manual Approvals: Test → Staging (1 approver, 4-hour SLA), Staging → Production (2 approvers + CAB, 24-hour SLA).
- Rollback Triggers: Automated (metric thresholds) and manual (operator-initiated) with environment-specific RTOs.
- Change Windows: Approved deployment times with blackout periods for business-critical operations.
- CAB Process: Weekly governance meetings for Production changes with comprehensive change ticket template.
Data Management Per Environment¶
ATP's data management strategy ensures each environment has appropriate data characteristics for its purpose: synthetic data for rapid testing (Dev/Test), production-like datasets for validation (Staging), and live tenant data with full compliance controls (Production). This approach balances testing realism with privacy protection and compliance requirements.
Data management policies vary significantly across environments to support different use cases while maintaining data sovereignty, retention compliance, and disaster recovery capabilities appropriate to each tier's criticality.
Data Management Overview¶
| Environment | Data Source | Volume | PII/Sensitivity | Retention | Immutability | Backups | Compliance |
|---|---|---|---|---|---|---|---|
| Preview | Minimal synthetic | 100 events | None (synthetic) | PR lifetime | No | No | None |
| Dev | Synthetic generators | 10K events | None (synthetic) | 30 days rolling | No | No | Basic redaction testing |
| Test | Stable fixtures | 50K events | None (synthetic) | 90 days | No | Daily (30-day retention) | GDPR/HIPAA simulation |
| Staging | Prod-like synthetic | 5M events | Obfuscated PII | 180 days | Yes | Weekly (4-week retention) | Full enforcement |
| Production | Live tenant data | Millions/day | Real PII (classified) | 7 years | Yes (WORM) | Daily + weekly (7-year retention) | Full enforcement |
| Hotfix | Prod clone (anonymized) | Subset of Prod | Anonymized PII | Incident duration | Yes | No | Full enforcement |
Dev Environment Data Management¶
Purpose: Synthetic data for rapid development iteration with no PII and minimal compliance constraints.
Data Source:
- Synthetic Data Generators: C# data generator libraries (Bogus, AutoFixture) create realistic but fake tenant data.
- Seeding Scripts: Version-controlled scripts regenerate consistent dev datasets.
- Volume: 10 synthetic tenants, 1,000 audit events per tenant (10,000 total events).
Data Characteristics:
// Dev Data Generator (C# with Bogus library)
public class DevDataGenerator
{
private readonly Faker<Tenant> _tenantFaker;
private readonly Faker<AuditEvent> _auditEventFaker;
public DevDataGenerator()
{
// Tenant generator
_tenantFaker = new Faker<Tenant>()
.RuleFor(t => t.TenantId, f => $"dev-tenant-{f.IndexFaker:000}")
.RuleFor(t => t.Name, f => $"{f.Company.CompanyName()} (Dev)")
.RuleFor(t => t.Edition, f => f.PickRandom("Standard", "Business", "Enterprise"))
.RuleFor(t => t.Region, f => f.PickRandom("US", "EU", "APAC"))
.RuleFor(t => t.CreatedAt, f => f.Date.Past(1))
.RuleFor(t => t.MaxRetentionDays, f => f.Random.Int(30, 365))
.RuleFor(t => t.ComplianceProfile, f => f.PickRandom("gdpr", "hipaa", "soc2", "none"));
// Audit event generator
_auditEventFaker = new Faker<AuditEvent>()
.RuleFor(e => e.EventId, f => $"evt-{Guid.NewGuid()}")
.RuleFor(e => e.Timestamp, f => f.Date.Recent(30))
.RuleFor(e => e.Actor, f => f.Internet.Email())
.RuleFor(e => e.Action, f => f.PickRandom("Create", "Read", "Update", "Delete", "Login", "Logout"))
.RuleFor(e => e.Resource, f => $"/api/{f.PickRandom("users", "documents", "settings")}/{f.Random.Int(1, 100)}")
.RuleFor(e => e.Outcome, f => f.PickRandom(new[] { "Allowed", "Denied" }, new[] { 0.9f, 0.1f }))
.RuleFor(e => e.IpAddress, f => f.Internet.Ip())
.RuleFor(e => e.UserAgent, f => f.Internet.UserAgent())
.RuleFor(e => e.Metadata, f => new Dictionary<string, object>
{
["duration"] = f.Random.Int(10, 5000),
["statusCode"] = f.Random.Int(200, 500),
["region"] = f.PickRandom("eastus", "westeurope", "southeastasia")
});
}
public async Task<IEnumerable<Tenant>> GenerateTenantsAsync(int count)
{
return _tenantFaker.Generate(count);
}
public async Task<IEnumerable<AuditEvent>> GenerateEventsAsync(string tenantId, int count)
{
return _auditEventFaker
.RuleFor(e => e.TenantId, _ => tenantId)
.Generate(count);
}
}
Data Seeding Script (Dev):
#!/bin/bash
# seed-dev-environment.sh
echo "Seeding Dev Environment with synthetic data..."
# 1. Clear existing data
dotnet run --project tools/DataSeeder -- --clear --environment Development
# 2. Generate synthetic tenants and events
dotnet run --project tools/DataSeeder -- \
--environment Development \
--tenants 10 \
--events-per-tenant 1000 \
--start-date "2025-09-01" \
--end-date "2025-10-30" \
--seed 42 # Fixed seed for reproducibility
# 3. Seed Redis cache with sessions
redis-cli -h atp-redis-dev-eus.redis.cache.windows.net -p 6380 -a $(az keyvault secret show --name RedisPassword --vault-name atp-keyvault-dev-eus --query value -o tsv) --tls <<EOF
SET session:dev-tenant-001:user-1 "{\"userId\":\"user-1\",\"tenantId\":\"dev-tenant-001\",\"expiresAt\":\"2025-10-31T00:00:00Z\"}"
SET session:dev-tenant-002:user-2 "{\"userId\":\"user-2\",\"tenantId\":\"dev-tenant-002\",\"expiresAt\":\"2025-10-31T00:00:00Z\"}"
EXPIRE session:dev-tenant-001:user-1 86400
EXPIRE session:dev-tenant-002:user-2 86400
EOF
# 4. Verify data counts
TENANT_COUNT=$(sqlcmd -S atp-sql-dev-eus.database.windows.net -d ATP_Dev -Q "SELECT COUNT(*) FROM Tenants" -h -1)
EVENT_COUNT=$(sqlcmd -S atp-sql-dev-eus.database.windows.net -d ATP_Dev -Q "SELECT COUNT(*) FROM AuditEvents" -h -1)
echo "Tenants: $TENANT_COUNT"
echo "Events: $EVENT_COUNT"
if [ "$TENANT_COUNT" -eq 10 ] && [ "$EVENT_COUNT" -eq 10000 ]; then
echo "✅ Dev environment seeded successfully"
else
echo "❌ Seeding verification failed"
exit 1
fi
Retention Policy (Dev):
- Audit Events: 30-day rolling retention; events older than 30 days automatically purged.
- Tenants: Persist until manual cleanup.
- Logs/Traces: 7-day retention in Application Insights.
Retention Enforcement (Automated Cleanup Job):
// Dev data cleanup job (Azure Function)
[FunctionName("CleanupDevData")]
public async Task RunAsync(
[TimerTrigger("0 0 2 * * *")] TimerInfo timer, // Daily at 2 AM
ILogger log)
{
log.LogInformation("Dev data cleanup started");
var cutoffDate = DateTime.UtcNow.AddDays(-30);
// Delete events older than 30 days
var deletedCount = await _dbContext.AuditEvents
.Where(e => e.Timestamp < cutoffDate)
.ExecuteDeleteAsync();
log.LogInformation($"Deleted {deletedCount} events older than 30 days");
// Vacuum database to reclaim space
await _dbContext.Database.ExecuteSqlRawAsync("DBCC SHRINKDATABASE (ATP_Dev, 10)");
log.LogInformation("✅ Dev data cleanup complete");
}
Immutability: Disabled (data can be freely modified/deleted for testing scenarios).
Backups: None (ephemeral dev data; recreate from seeding scripts if needed).
Test Environment Data Management¶
Purpose: Stable test datasets with version-controlled fixtures for consistent regression testing and QA validation.
Data Source:
- Stable Fixtures: JSON/Parquet files version-controlled in
tests/fixtures/directory. - Reproducible: Same data across test runs; enables predictable test assertions.
- Volume: 20 synthetic tenants, 50,000 audit events (stable over time).
Data Characteristics:
// tests/fixtures/test-tenants.json
[
{
"tenantId": "test-tenant-001",
"name": "Acme Corporation (Test)",
"edition": "Enterprise",
"region": "US",
"complianceProfile": "gdpr,hipaa",
"createdAt": "2025-01-01T00:00:00Z",
"maxRetentionDays": 2555
},
{
"tenantId": "test-tenant-002",
"name": "Global Industries (Test)",
"edition": "Business",
"region": "EU",
"complianceProfile": "gdpr",
"createdAt": "2025-01-15T00:00:00Z",
"maxRetentionDays": 365
}
// ... 18 more tenants
]
Test Data Seeding (C#):
// Test data loader (loads from version-controlled fixtures)
public class TestDataSeeder
{
private readonly IAuditDbContext _context;
public async Task SeedTestEnvironmentAsync()
{
// 1. Clear existing data
await _context.Database.ExecuteSqlRawAsync("TRUNCATE TABLE AuditEvents");
await _context.Database.ExecuteSqlRawAsync("DELETE FROM Tenants");
// 2. Load tenant fixtures
var tenantsJson = await File.ReadAllTextAsync("fixtures/test-tenants.json");
var tenants = JsonSerializer.Deserialize<List<Tenant>>(tenantsJson);
await _context.Tenants.AddRangeAsync(tenants);
await _context.SaveChangesAsync();
// 3. Load event fixtures (Parquet for efficient storage)
var eventsParquet = await LoadParquetAsync("fixtures/test-events-stable.parquet");
// Batch insert for performance
const int batchSize = 1000;
for (int i = 0; i < eventsParquet.Count; i += batchSize)
{
var batch = eventsParquet.Skip(i).Take(batchSize).ToList();
await _context.AuditEvents.AddRangeAsync(batch);
await _context.SaveChangesAsync();
}
// 4. Verify data integrity
var tenantCount = await _context.Tenants.CountAsync();
var eventCount = await _context.AuditEvents.CountAsync();
if (tenantCount != 20 || eventCount != 50000)
{
throw new Exception($"Data seeding verification failed. Expected: 20 tenants, 50K events. Actual: {tenantCount} tenants, {eventCount} events");
}
Console.WriteLine("✅ Test environment seeded successfully");
}
}
Data Refresh Script (Test):
#!/bin/bash
# refresh-test-data.sh
echo "Refreshing Test environment data..."
# 1. Backup current data (safety net)
az sql db export \
--name ATP_Test \
--resource-group ATP-Test-RG \
--server atp-sql-test-eus \
--admin-user testadmin \
--admin-password $(az keyvault secret show --vault-name atp-keyvault-test-eus --name SqlAdminPassword --query value -o tsv) \
--storage-key $(az storage account keys list --account-name atpstorragetesteus --query "[0].value" -o tsv) \
--storage-key-type StorageAccessKey \
--storage-uri "https://atpstorragetesteus.blob.core.windows.net/backups/test-backup-$(date +%Y%m%d).bacpac"
# 2. Run seeding tool
dotnet run --project tools/DataSeeder -- \
--environment Test \
--clear \
--load-fixtures tests/fixtures/test-tenants.json \
--load-fixtures tests/fixtures/test-events-stable.parquet
# 3. Verify data integrity
TEST_TENANT_COUNT=$(sqlcmd -S atp-sql-test-eus.database.windows.net -d ATP_Test -Q "SELECT COUNT(*) FROM Tenants" -h -1)
TEST_EVENT_COUNT=$(sqlcmd -S atp-sql-test-eus.database.windows.net -d ATP_Test -Q "SELECT COUNT(*) FROM AuditEvents" -h -1)
if [ "$TEST_TENANT_COUNT" -eq 20 ] && [ "$TEST_EVENT_COUNT" -eq 50000 ]; then
echo "✅ Test data refresh successful"
else
echo "❌ Test data verification failed"
exit 1
fi
# 4. Update test data version metadata
az storage blob metadata update \
--account-name atpstorragetesteus \
--container-name fixtures \
--name test-data-version.txt \
--metadata version=$(date +%Y%m%d) refreshedAt=$(date -u +%Y-%m-%dT%H:%M:%SZ)
Retention Policy (Test):
- Audit Events: 90-day retention (matches Test compliance profile).
- Automated Purge: Monthly cleanup job removes events older than 90 days.
- Test Data Refresh: Weekly refresh from fixtures (every Sunday 2 AM).
Immutability: Disabled (test data can be modified for scenario testing).
Backups:
- Frequency: Daily automated backups.
- Retention: 30 days.
- Purpose: Disaster recovery if test data corrupted; not for compliance.
Staging Environment Data Management¶
Purpose: Production-like datasets with obfuscated PII for realistic load testing, chaos engineering, and stakeholder acceptance validation.
Data Source (Two Options):
Option 1: Anonymized Production Snapshot¶
// Anonymize Production data for Staging
public class ProductionDataAnonymizer
{
public async Task CreateAnonymizedSnapshotAsync()
{
// 1. Export production data (read-only replica)
var prodEvents = await _prodDbContext.AuditEvents
.Where(e => e.Timestamp > DateTime.UtcNow.AddDays(-180))
.OrderBy(e => e.Timestamp)
.Take(5_000_000)
.AsNoTracking()
.ToListAsync();
// 2. Anonymize PII fields
var anonymizedEvents = prodEvents.Select(e => new AuditEvent
{
EventId = e.EventId,
TenantId = AnonymizeTenantId(e.TenantId), // tenant-12345 → staging-tenant-001
Timestamp = e.Timestamp,
Actor = AnonymizeEmail(e.Actor), // john.doe@example.com → user-123@staging.local
Action = e.Action, // Preserve (no PII)
Resource = e.Resource, // Preserve (no PII)
Outcome = e.Outcome, // Preserve
IpAddress = AnonymizeIpAddress(e.IpAddress), // 192.168.1.1 → 10.0.X.X
UserAgent = e.UserAgent, // Preserve
Metadata = AnonymizeMetadata(e.Metadata) // Remove any PII in JSON
});
// 3. Export to Parquet (efficient format)
await ExportToParquetAsync(anonymizedEvents, "anonymized-prod-snapshot.parquet");
// 4. Verify no PII remains
await VerifyNoPIIAsync("anonymized-prod-snapshot.parquet");
}
private string AnonymizeTenantId(string realTenantId)
{
// Consistent mapping: real tenant → staging tenant
var hash = ComputeHash(realTenantId);
var index = Math.Abs(hash) % 50 + 1;
return $"staging-tenant-{index:D3}";
}
private string AnonymizeEmail(string email)
{
// Hash email to consistent fake email
var hash = ComputeHash(email);
return $"user-{Math.Abs(hash) % 10000:D4}@staging.local";
}
private string AnonymizeIpAddress(string ipAddress)
{
// Replace with private IP range
var hash = ComputeHash(ipAddress);
return $"10.0.{Math.Abs(hash) % 256}.{Math.Abs(hash >> 8) % 256}";
}
}
Option 2: Production-Scale Synthetic Data¶
// Generate production-scale synthetic data
public class StagingDataGenerator
{
public async Task GenerateProductionScaleDataAsync()
{
// 50 synthetic tenants (mimics real tenant distribution)
var tenants = GenerateSyntheticTenants(count: 50, distribution: new
{
Standard = 20, // 40%
Business = 20, // 40%
Enterprise = 10 // 20%
});
await _context.Tenants.AddRangeAsync(tenants);
await _context.SaveChangesAsync();
// 5 million audit events (realistic production volume)
var startDate = DateTime.UtcNow.AddDays(-180);
var endDate = DateTime.UtcNow;
var eventsPerDay = 5_000_000 / 180; // ~27,778 events/day
foreach (var tenant in tenants)
{
var tenantEventsPerDay = (int)(eventsPerDay * GetTenantWeightFactor(tenant.Edition));
var events = await GenerateRealisticEventsAsync(
tenantId: tenant.TenantId,
startDate: startDate,
endDate: endDate,
eventsPerDay: tenantEventsPerDay,
patterns: new[]
{
"BusinessHours", // More activity 9 AM - 5 PM
"WeekdayBias", // Less activity on weekends
"SeasonalSpikes" // Occasional high-volume days
});
// Batch insert (efficient)
await BulkInsertAsync(events, batchSize: 10000);
}
}
private double GetTenantWeightFactor(string edition)
{
return edition switch
{
"Standard" => 0.5, // Lower activity
"Business" => 1.0, // Average activity
"Enterprise" => 2.0, // Higher activity
_ => 1.0
};
}
}
Data Refresh Workflow (Staging):
#!/bin/bash
# refresh-staging-data.sh
echo "Refreshing Staging environment with production-like data..."
# Option 1: Restore from anonymized production snapshot
if [ "$USE_PROD_SNAPSHOT" == "true" ]; then
echo "Restoring from anonymized production snapshot..."
# Download latest anonymized snapshot
az storage blob download \
--account-name atpstorageprodeus \
--container-name anonymized-snapshots \
--name "anonymized-prod-snapshot-latest.parquet" \
--file ./anonymized-snapshot.parquet
# Import into Staging database
dotnet run --project tools/DataImporter -- \
--environment Staging \
--clear \
--import ./anonymized-snapshot.parquet \
--verify-no-pii
else
# Option 2: Generate production-scale synthetic data
echo "Generating production-scale synthetic data..."
dotnet run --project tools/DataSeeder -- \
--environment Staging \
--clear \
--tenants 50 \
--events-total 5000000 \
--start-date "2025-04-01" \
--end-date "2025-10-30" \
--use-realistic-patterns \
--seed 2025
fi
# Verify data volume
STAGING_TENANT_COUNT=$(sqlcmd -S atp-sql-staging-eus.database.windows.net -d ATP_Staging -Q "SELECT COUNT(*) FROM Tenants" -h -1)
STAGING_EVENT_COUNT=$(sqlcmd -S atp-sql-staging-eus.database.windows.net -d ATP_Staging -Q "SELECT COUNT(*) FROM AuditEvents" -h -1)
echo "Staging - Tenants: $STAGING_TENANT_COUNT, Events: $STAGING_EVENT_COUNT"
if [ "$STAGING_TENANT_COUNT" -ge 50 ] && [ "$STAGING_EVENT_COUNT" -ge 5000000 ]; then
echo "✅ Staging data refresh successful"
else
echo "❌ Staging data verification failed"
exit 1
fi
Retention Policy (Staging):
- Audit Events: 180-day retention (production-like; validates retention policies).
- Legal Holds: Test legal hold workflows (place/release holds on specific events).
- Automated Purge: Weekly job purges events older than 180 days.
Retention Enforcement (Staging):
// Staging retention enforcement (mirrors production logic)
[FunctionName("EnforceRetentionPolicy")]
public async Task RunAsync(
[TimerTrigger("0 0 3 * * 0")] TimerInfo timer, // Weekly on Sunday at 3 AM
ILogger log)
{
log.LogInformation("Enforcing retention policy for Staging");
// Query tenants with custom retention settings
var tenants = await _context.Tenants.ToListAsync();
foreach (var tenant in tenants)
{
var retentionCutoff = DateTime.UtcNow.AddDays(-tenant.MaxRetentionDays);
// Find events eligible for purge (excluding legal holds)
var eventsToDelete = await _context.AuditEvents
.Where(e => e.TenantId == tenant.TenantId)
.Where(e => e.Timestamp < retentionCutoff)
.Where(e => !e.LegalHold) // Never delete events under legal hold
.ToListAsync();
if (eventsToDelete.Any())
{
log.LogInformation($"Purging {eventsToDelete.Count} events for tenant {tenant.TenantId} (retention: {tenant.MaxRetentionDays} days)");
_context.AuditEvents.RemoveRange(eventsToDelete);
await _context.SaveChangesAsync();
}
}
log.LogInformation("✅ Retention policy enforcement complete");
}
Immutability: Enabled (tests WORM storage, hash chaining, and tamper-evidence workflows).
Backups:
- Frequency: Weekly automated backups.
- Retention: 4 weeks (enables DR drills with realistic data).
- Geo-Replication: Enabled (tests failover procedures).
Backup Script (Staging):
#!/bin/bash
# backup-staging.sh
echo "Creating Staging database backup..."
# Export to BACPAC (includes schema + data)
az sql db export \
--name ATP_Staging \
--resource-group ATP-Staging-RG \
--server atp-sql-staging-eus \
--admin-user $(az keyvault secret show --vault-name atp-keyvault-staging-eus --name SqlAdminUser --query value -o tsv) \
--admin-password $(az keyvault secret show --vault-name atp-keyvault-staging-eus --name SqlAdminPassword --query value -o tsv) \
--storage-key $(az storage account keys list --account-name atpstoragestgeus --query "[0].value" -o tsv) \
--storage-key-type StorageAccessKey \
--storage-uri "https://atpstoragestgeus.blob.core.windows.net/backups/staging-weekly-$(date +%Y%m%d).bacpac"
# Verify backup
BACKUP_SIZE=$(az storage blob show \
--account-name atpstoragestgeus \
--container-name backups \
--name "staging-weekly-$(date +%Y%m%d).bacpac" \
--query properties.contentLength -o tsv)
if [ "$BACKUP_SIZE" -gt 1000000 ]; then
echo "✅ Staging backup created: $(($BACKUP_SIZE / 1024 / 1024)) MB"
else
echo "❌ Backup verification failed"
exit 1
fi
Production Environment Data Management¶
Purpose: Live tenant audit records with real PII, full compliance enforcement, and 7-year retention with WORM storage and tamper-evidence.
Data Source:
- Live Tenant Traffic: Real audit events ingested from production tenant applications.
- Volume: Millions of events per day across all tenants.
- PII Classification: Full PII classification with data sensitivity labels (see pii-redaction-classification.md).
Data Characteristics:
// Production audit event with PII classification
public class AuditEvent
{
public string EventId { get; set; }
public string TenantId { get; set; }
public DateTime Timestamp { get; set; }
[PersonalData] // PII classification
public string Actor { get; set; } // john.doe@customer.com
public string Action { get; set; }
public string Resource { get; set; }
public string Outcome { get; set; }
[PersonalData]
public string IpAddress { get; set; } // Client IP
public string UserAgent { get; set; }
[SensitiveData]
public Dictionary<string, object> Metadata { get; set; } // May contain PII
// Immutability fields
public string Hash { get; set; } // SHA-256 hash of event
public string PreviousHash { get; set; } // Hash chain
public string SegmentId { get; set; }
public bool Sealed { get; set; }
public DateTime? SealedAt { get; set; }
// Compliance fields
public bool LegalHold { get; set; }
public string LegalHoldReason { get; set; }
public DateTime? LegalHoldPlacedAt { get; set; }
public int RetentionDays { get; set; } // Per-tenant retention
public DateTime PurgeEligibleAt { get; set; } // Timestamp + RetentionDays
}
Retention Policy (Production):
- Default Retention: 7 years (2,555 days) for all tenants.
- Tenant-Specific Retention: Configurable per tenant (minimum 1 year, maximum 10 years).
- Legal Holds: Override retention; events never purged while under legal hold.
- Regulatory Compliance: GDPR (7 years for financial data), HIPAA (6 years minimum), SOC 2 (7 years audit logs).
Retention Configuration (Per Tenant):
// Tenant retention configuration
public class Tenant
{
public string TenantId { get; set; }
public string Name { get; set; }
// Retention settings
public int RetentionDays { get; set; } = 2555; // 7 years default
public bool AllowCustomRetention { get; set; } = false;
public int MinRetentionDays { get; set; } = 365; // 1 year minimum
public int MaxRetentionDays { get; set; } = 3650; // 10 years maximum
// Compliance profile determines retention requirements
public string ComplianceProfile { get; set; } // "gdpr,hipaa,soc2"
// Legal hold management
public bool HasActiveLegalHolds { get; set; }
public List<LegalHold> LegalHolds { get; set; }
}
// Legal hold entity
public class LegalHold
{
public string LegalHoldId { get; set; }
public string TenantId { get; set; }
public string Reason { get; set; }
public DateTime PlacedAt { get; set; }
public string PlacedBy { get; set; }
public DateTime? ReleasedAt { get; set; }
public string ReleasedBy { get; set; }
public string CaseNumber { get; set; }
}
Retention Enforcement (Production):
// Production retention enforcement (Azure Function)
[FunctionName("EnforceProductionRetention")]
public async Task RunAsync(
[TimerTrigger("0 0 2 * * *")] TimerInfo timer, // Daily at 2 AM
ILogger log)
{
log.LogInformation("Enforcing production retention policy...");
var tenants = await _context.Tenants.ToListAsync();
foreach (var tenant in tenants)
{
var retentionCutoff = DateTime.UtcNow.AddDays(-tenant.RetentionDays);
// Find events eligible for purge
var eligibleEvents = await _context.AuditEvents
.Where(e => e.TenantId == tenant.TenantId)
.Where(e => e.Timestamp < retentionCutoff)
.Where(e => !e.LegalHold) // Never delete events under legal hold
.Where(e => e.Sealed) // Only delete sealed (immutable) events
.ToListAsync();
if (eligibleEvents.Any())
{
log.LogInformation($"Purging {eligibleEvents.Count} events for tenant {tenant.TenantId}");
// Mark as deleted (soft delete; actual purge happens after 30 days)
foreach (var evt in eligibleEvents)
{
evt.MarkedForDeletion = true;
evt.MarkedForDeletionAt = DateTime.UtcNow;
}
await _context.SaveChangesAsync();
// Audit the purge operation
await _auditLogger.LogRetentionPurgeAsync(tenant.TenantId, eligibleEvents.Count, retentionCutoff);
}
}
log.LogInformation("✅ Production retention enforcement complete");
}
Immutability: Fully Enabled (WORM storage, hash chains, tamper-evidence):
// Production immutability implementation
public class ImmutableAuditService
{
public async Task<AuditSegment> SealSegmentAsync(string segmentId)
{
var segment = await _context.AuditSegments
.Include(s => s.Events)
.FirstOrDefaultAsync(s => s.SegmentId == segmentId);
if (segment.Sealed)
{
throw new InvalidOperationException("Segment already sealed");
}
// 1. Calculate Merkle tree hash
var merkleRoot = CalculateMerkleTreeHash(segment.Events);
// 2. Sign hash with HSM-backed key
var signature = await _cryptoClient.SignDataAsync(
SignatureAlgorithm.RS256,
Encoding.UTF8.GetBytes(merkleRoot));
// 3. Seal segment
segment.Sealed = true;
segment.SealedAt = DateTime.UtcNow;
segment.MerkleRoot = merkleRoot;
segment.Signature = Convert.ToBase64String(signature.Signature);
segment.Immutable = true;
await _context.SaveChangesAsync();
// 4. Store segment hash in blockchain/DLT (optional)
await _blockchainAnchor.AnchorHashAsync(segmentId, merkleRoot, signature.Signature);
return segment;
}
}
Backups (Production):
- Incremental Backups: Daily incremental backups (capture changes since last full).
- Full Backups: Weekly full database backups.
- Geo-Replication: All backups replicated to secondary region (West Europe).
- Retention: 7 years (matches audit retention requirements).
- Encryption: All backups encrypted with customer-managed keys (CMK).
Backup Strategy (Production):
#!/bin/bash
# production-backup-strategy.sh
DAY_OF_WEEK=$(date +%u) # 1 = Monday, 7 = Sunday
if [ "$DAY_OF_WEEK" -eq 7 ]; then
echo "Sunday: Creating weekly full backup..."
# Full backup (BACPAC export)
az sql db export \
--name ATP_Prod \
--resource-group ATP-Prod-RG \
--server atp-sql-prod-eus \
--admin-user $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminUser --query value -o tsv) \
--admin-password $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminPassword --query value -o tsv) \
--storage-key $(az storage account keys list --account-name atpstorageprodeus --query "[0].value" -o tsv) \
--storage-key-type StorageAccessKey \
--storage-uri "https://atpstorageprodeus.blob.core.windows.net/backups/weekly/prod-full-$(date +%Y%m%d).bacpac"
# Copy to geo-redundant storage (automatic with GZRS)
echo "✅ Weekly full backup created"
else
echo "Weekday: Creating daily incremental backup..."
# Incremental backup (native SQL backup)
az sql db update \
--name ATP_Prod \
--resource-group ATP-Prod-RG \
--server atp-sql-prod-eus \
--backup-storage-redundancy Geo
# Point-in-time restore enabled (automatic)
echo "✅ Daily incremental backup configured"
fi
# Verify backup retention
BACKUP_COUNT=$(az storage blob list \
--account-name atpstorageprodeus \
--container-name backups \
--prefix "weekly/" \
--query "length([?properties.createdOn > '$(date -d '7 years ago' --iso-8601)'])")
echo "Production backups (7-year retention): $BACKUP_COUNT"
Immutability: Fully Enforced (WORM storage with policy lock):
# Configure WORM (Write-Once-Read-Many) storage
az storage container immutability-policy create \
--account-name atpstorageprodeus \
--container-name audit-events \
--period 2555 \
--allow-protected-append-writes false
# Lock immutability policy (irreversible)
az storage container immutability-policy lock \
--account-name atpstorageprodeus \
--container-name audit-events \
--if-match <etag>
# Place legal hold (for litigation support)
az storage container legal-hold set \
--account-name atpstorageprodeus \
--container-name audit-events \
--tags "case-2025-001" "litigation-hold"
Data Refresh & Maintenance Windows¶
Dev Environment:
- Refresh Frequency: On-demand (developers trigger when needed).
- Downtime: Acceptable (no SLA).
- Method: Drop database + re-seed from scripts.
Test Environment:
- Refresh Frequency: Weekly (every Sunday 2 AM).
- Downtime: 2-hour maintenance window (2 AM - 4 AM).
- Method: Truncate tables + reload fixtures.
Staging Environment:
- Refresh Frequency: Monthly (first Saturday of month, 2 AM).
- Downtime: 4-hour maintenance window (2 AM - 6 AM).
- Method: Restore from anonymized production snapshot or regenerate synthetic data.
Production Environment:
- Refresh Frequency: Never (live data only).
- Maintenance: Continuous (online operations; no downtime for data management).
- Method: N/A (live ingestion only).
Data Anonymization & Obfuscation¶
Purpose: Enable realistic testing in Staging without exposing production PII.
Anonymization Techniques:
- Email Addresses: Hash-based mapping to fake emails (consistent across snapshots).
- IP Addresses: Replace with private IP ranges (10.0.0.0/8).
- User Names: Replace with generated pseudonyms (user-0001, user-0002).
- Tenant IDs: Map to staging tenant IDs (tenant-12345 → staging-tenant-001).
- Metadata: Redact PII fields; preserve structure and data types.
Anonymization Pipeline:
// Production → Staging anonymization pipeline
public class DataAnonymizationPipeline
{
public async Task<string> CreateAnonymizedSnapshotAsync()
{
// 1. Extract from production (read replica to avoid impact)
var prodConnectionString = await GetReadReplicaConnectionStringAsync();
var prodEvents = await ExtractProductionDataAsync(prodConnectionString, days: 180);
// 2. Anonymize PII fields
var anonymizedEvents = prodEvents.Select(AnonymizeEvent).ToList();
// 3. Verify no PII remains
await VerifyNoPIIAsync(anonymizedEvents);
// 4. Export to Parquet (efficient columnar format)
var outputPath = $"anonymized-prod-snapshot-{DateTime.UtcNow:yyyyMMdd}.parquet";
await ExportToParquetAsync(anonymizedEvents, outputPath);
// 5. Upload to Staging storage
await UploadToAzureStorageAsync(outputPath, "atpstoragestgeus", "anonymized-snapshots");
return outputPath;
}
private AuditEvent AnonymizeEvent(AuditEvent prodEvent)
{
return new AuditEvent
{
EventId = prodEvent.EventId, // Preserve ID (no PII)
TenantId = _tenantMapper.MapToStagingTenant(prodEvent.TenantId),
Timestamp = prodEvent.Timestamp, // Preserve timestamp
// Anonymize PII fields
Actor = AnonymizeEmail(prodEvent.Actor),
IpAddress = AnonymizeIp(prodEvent.IpAddress),
// Preserve non-PII fields
Action = prodEvent.Action,
Resource = prodEvent.Resource,
Outcome = prodEvent.Outcome,
UserAgent = prodEvent.UserAgent,
// Anonymize metadata (recursive PII removal)
Metadata = AnonymizeMetadata(prodEvent.Metadata)
};
}
private string AnonymizeEmail(string email)
{
// Deterministic hashing (same email → same fake email)
var hash = SHA256.HashData(Encoding.UTF8.GetBytes(email + _salt));
var hashInt = BitConverter.ToInt32(hash, 0);
return $"user-{Math.Abs(hashInt) % 10000:D4}@staging.local";
}
private string AnonymizeIp(string ipAddress)
{
var hash = SHA256.HashData(Encoding.UTF8.GetBytes(ipAddress + _salt));
return $"10.0.{hash[0]}.{hash[1]}";
}
private Dictionary<string, object> AnonymizeMetadata(Dictionary<string, object> metadata)
{
var anonymized = new Dictionary<string, object>();
foreach (var kvp in metadata)
{
// Redact known PII fields
if (IsPIIField(kvp.Key))
{
anonymized[kvp.Key] = "[REDACTED]";
}
else
{
anonymized[kvp.Key] = kvp.Value;
}
}
return anonymized;
}
}
PII Verification (Automated):
// Verify no PII in anonymized dataset
public class PIIVerifier
{
private readonly List<Regex> _piiPatterns = new()
{
new Regex(@"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"), // Email
new Regex(@"\b\d{3}-\d{2}-\d{4}\b"), // SSN
new Regex(@"\b(?:\d{1,3}\.){3}\d{1,3}\b"), // IP address (public)
new Regex(@"\b\d{3}-\d{3}-\d{4}\b") // Phone number
};
public async Task<bool> VerifyNoPIIAsync(IEnumerable<AuditEvent> events)
{
var piiFound = false;
foreach (var evt in events.Take(1000)) // Sample 1000 events
{
var serialized = JsonSerializer.Serialize(evt);
foreach (var pattern in _piiPatterns)
{
if (pattern.IsMatch(serialized))
{
Console.WriteLine($"⚠️ Potential PII found in event {evt.EventId}");
piiFound = true;
}
}
}
if (piiFound)
{
throw new Exception("PII verification failed; anonymization incomplete");
}
Console.WriteLine("✅ PII verification passed (no PII detected in sample)");
return true;
}
}
Data Seeding & Generation Tools¶
Data Seeder CLI Tool¶
// tools/DataSeeder/Program.cs
public class Program
{
public static async Task<int> Main(string[] args)
{
var rootCommand = new RootCommand("ATP Data Seeder - Generate synthetic audit data");
var environmentOption = new Option<string>("--environment", "Target environment (Development, Test, Staging)");
var clearOption = new Option<bool>("--clear", "Clear existing data before seeding");
var tenantsOption = new Option<int>("--tenants", () => 10, "Number of tenants to generate");
var eventsPerTenantOption = new Option<int>("--events-per-tenant", () => 1000, "Events per tenant");
var eventsTotalOption = new Option<int>("--events-total", "Total events (overrides events-per-tenant)");
var startDateOption = new Option<DateTime>("--start-date", () => DateTime.UtcNow.AddDays(-30), "Start date for events");
var endDateOption = new Option<DateTime>("--end-date", () => DateTime.UtcNow, "End date for events");
var seedOption = new Option<int>("--seed", () => 42, "Random seed for reproducibility");
var loadFixturesOption = new Option<string>("--load-fixtures", "Path to fixture file (JSON/Parquet)");
rootCommand.AddOption(environmentOption);
rootCommand.AddOption(clearOption);
rootCommand.AddOption(tenantsOption);
rootCommand.AddOption(eventsPerTenantOption);
rootCommand.AddOption(eventsTotalOption);
rootCommand.AddOption(startDateOption);
rootCommand.AddOption(endDateOption);
rootCommand.AddOption(seedOption);
rootCommand.AddOption(loadFixturesOption);
rootCommand.SetHandler(async (context) =>
{
var environment = context.ParseResult.GetValueForOption(environmentOption);
var clear = context.ParseResult.GetValueForOption(clearOption);
var tenants = context.ParseResult.GetValueForOption(tenantsOption);
var eventsPerTenant = context.ParseResult.GetValueForOption(eventsPerTenantOption);
var eventsTotal = context.ParseResult.GetValueForOption(eventsTotalOption);
var startDate = context.ParseResult.GetValueForOption(startDateOption);
var endDate = context.ParseResult.GetValueForOption(endDateOption);
var seed = context.ParseResult.GetValueForOption(seedOption);
var fixturesPath = context.ParseResult.GetValueForOption(loadFixturesOption);
var seeder = new DataSeeder(environment);
if (clear)
{
await seeder.ClearDataAsync();
}
if (!string.IsNullOrEmpty(fixturesPath))
{
await seeder.LoadFixturesAsync(fixturesPath);
}
else
{
var totalEvents = eventsTotal > 0 ? eventsTotal : (tenants * eventsPerTenant);
await seeder.GenerateSyntheticDataAsync(tenants, totalEvents, startDate, endDate, seed);
}
});
return await rootCommand.InvokeAsync(args);
}
}
Usage Examples:
# Dev: Generate 10 tenants with 1000 events each
dotnet run --project tools/DataSeeder -- \
--environment Development \
--clear \
--tenants 10 \
--events-per-tenant 1000 \
--start-date "2025-09-01" \
--end-date "2025-10-30"
# Test: Load stable fixtures
dotnet run --project tools/DataSeeder -- \
--environment Test \
--clear \
--load-fixtures tests/fixtures/test-tenants.json \
--load-fixtures tests/fixtures/test-events-stable.parquet
# Staging: Generate production-scale data
dotnet run --project tools/DataSeeder -- \
--environment Staging \
--clear \
--tenants 50 \
--events-total 5000000 \
--start-date "2025-04-01" \
--end-date "2025-10-30" \
--seed 2025
Data Migration & Schema Updates¶
Purpose: Manage database schema changes across environments with zero-downtime migrations and rollback capabilities.
Migration Workflow:
flowchart LR
A[EF Core Migration Created] --> B[Dev: Apply Migration]
B --> C{Tests Pass?}
C -->|No| D[Fix Migration]
C -->|Yes| E[Test: Apply Migration]
E --> F{Regression Tests Pass?}
F -->|No| D
F -->|Yes| G[Staging: Apply Migration]
G --> H{Production-Like Tests Pass?}
H -->|No| D
H -->|Yes| I[Production: Apply Migration]
I --> J[Verify Migration Success]
Migration Script (Entity Framework Core):
#!/bin/bash
# apply-migration.sh
ENVIRONMENT=$1
MIGRATION_NAME=$2
echo "Applying migration '$MIGRATION_NAME' to $ENVIRONMENT..."
# 1. Backup database before migration
az sql db copy \
--name ATP_$ENVIRONMENT \
--resource-group ATP-$ENVIRONMENT-RG \
--server atp-sql-$ENVIRONMENT-eus \
--dest-name ATP_${ENVIRONMENT}_PreMigration_$(date +%Y%m%d) \
--dest-resource-group ATP-Backups-RG
# 2. Apply migration
dotnet ef database update \
--project src/ConnectSoft.ATP.Infrastructure \
--connection "Server=atp-sql-$ENVIRONMENT-eus.database.windows.net;Database=ATP_$ENVIRONMENT;Authentication=Active Directory Default;" \
--verbose
if [ $? -ne 0 ]; then
echo "❌ Migration failed; rolling back..."
# Rollback migration (apply previous migration)
dotnet ef database update <PreviousMigration> \
--project src/ConnectSoft.ATP.Infrastructure \
--connection "Server=atp-sql-$ENVIRONMENT-eus.database.windows.net;Database=ATP_$ENVIRONMENT;Authentication=Active Directory Default;"
exit 1
fi
# 3. Verify migration applied
APPLIED_MIGRATION=$(sqlcmd -S atp-sql-$ENVIRONMENT-eus.database.windows.net \
-d ATP_$ENVIRONMENT \
-Q "SELECT TOP 1 MigrationId FROM __EFMigrationsHistory ORDER BY MigrationId DESC" \
-h -1)
if [[ "$APPLIED_MIGRATION" == *"$MIGRATION_NAME"* ]]; then
echo "✅ Migration '$MIGRATION_NAME' applied successfully to $ENVIRONMENT"
else
echo "❌ Migration verification failed"
exit 1
fi
Data Compliance & Privacy¶
GDPR Compliance (Data Subject Rights):
// Right to erasure (GDPR Article 17)
public class GdprDataService
{
public async Task ErasePersonalDataAsync(string tenantId, string userId)
{
// 1. Find all events for user
var userEvents = await _context.AuditEvents
.Where(e => e.TenantId == tenantId)
.Where(e => e.Actor == userId)
.ToListAsync();
// 2. Pseudonymize (cannot delete immutable audit events)
foreach (var evt in userEvents)
{
evt.Actor = $"[ERASED-{ComputeHash(userId).Substring(0, 8)}]";
evt.IpAddress = "0.0.0.0";
evt.Metadata = new Dictionary<string, object>
{
["erasedAt"] = DateTime.UtcNow,
["reason"] = "GDPR Right to Erasure"
};
}
await _context.SaveChangesAsync();
// 3. Audit the erasure operation
await _auditLogger.LogDataErasureAsync(tenantId, userId, userEvents.Count);
}
}
HIPAA Compliance (Minimum Necessary Rule):
- Dev/Test: No PHI (Protected Health Information); synthetic data only.
- Staging: Obfuscated data; no real patient information.
- Production: Real PHI with encryption at rest, encryption in transit, access controls, and audit logging.
Data Seeding Scripts (Comprehensive)¶
Dev Environment Seeding:
#!/bin/bash
# comprehensive-dev-seed.sh
echo "=== Dev Environment Data Seeding ==="
# 1. Database seeding
echo "Seeding SQL database..."
dotnet run --project tools/DataSeeder -- \
--environment Development \
--clear \
--tenants 10 \
--events-per-tenant 1000 \
--start-date "2025-09-01" \
--end-date "2025-10-30" \
--seed 42
# 2. Redis cache seeding
echo "Seeding Redis cache..."
REDIS_HOST="atp-redis-dev-eus.redis.cache.windows.net"
REDIS_PASSWORD=$(az keyvault secret show --vault-name atp-keyvault-dev-eus --name RedisPassword --query value -o tsv)
redis-cli -h $REDIS_HOST -p 6380 -a $REDIS_PASSWORD --tls <<EOF
FLUSHDB
SET session:dev-tenant-001:user-1 "{\"userId\":\"user-1\",\"tenantId\":\"dev-tenant-001\",\"expiresAt\":\"2025-10-31T00:00:00Z\"}"
SET session:dev-tenant-002:user-2 "{\"userId\":\"user-2\",\"tenantId\":\"dev-tenant-002\",\"expiresAt\":\"2025-10-31T00:00:00Z\"}"
SET cache:tenant-config:dev-tenant-001 "{\"maxRetention\":30,\"enableImmutability\":false}"
EXPIRE session:dev-tenant-001:user-1 86400
EXPIRE session:dev-tenant-002:user-2 86400
EOF
# 3. Cosmos DB seeding (metadata store)
echo "Seeding Cosmos DB..."
az cosmosdb sql container item upsert \
--account-name atp-cosmos-dev-eus \
--database-name ATP \
--container-name TenantMetadata \
--partition-key "dev-tenant-001" \
--body '{
"id": "dev-tenant-001",
"tenantId": "dev-tenant-001",
"settings": {
"retentionDays": 30,
"enableImmutability": false
}
}'
# 4. Service Bus seeding (seed queues with sample messages)
echo "Seeding Service Bus..."
SB_CONNECTION=$(az keyvault secret show --vault-name atp-keyvault-dev-eus --name ServiceBusConnectionString --query value -o tsv)
# Send sample messages to ingestion queue
for i in {1..10}; do
az servicebus queue message send \
--namespace-name atp-servicebus-dev-eus \
--queue-name ingestion-queue \
--body "{\"eventId\":\"seed-$i\",\"tenantId\":\"dev-tenant-001\",\"action\":\"Create\"}"
done
echo "✅ Dev environment seeding complete"
Staging Environment Data Refresh (From Anonymized Prod Snapshot):
#!/bin/bash
# refresh-staging-from-prod-snapshot.sh
echo "=== Staging Data Refresh (Anonymized Production Snapshot) ==="
# 1. Generate anonymized snapshot in Production (via Azure Function)
echo "Triggering anonymization pipeline..."
SNAPSHOT_PATH=$(az functionapp function invoke \
--resource-group ATP-Prod-RG \
--name atp-functions-prod-eus \
--function-name CreateAnonymizedSnapshot \
--query "outputBindings.snapshotPath" -o tsv)
echo "Anonymized snapshot created: $SNAPSHOT_PATH"
# 2. Download snapshot to staging
echo "Downloading anonymized snapshot..."
az storage blob download \
--account-name atpstorageprodeus \
--container-name anonymized-snapshots \
--name "$SNAPSHOT_PATH" \
--file ./staging-snapshot.parquet \
--auth-mode login
# 3. Verify no PII in snapshot
echo "Verifying no PII in snapshot..."
dotnet run --project tools/PIIVerifier -- \
--input ./staging-snapshot.parquet \
--strict
if [ $? -ne 0 ]; then
echo "❌ PII verification failed; aborting refresh"
exit 1
fi
# 4. Clear staging database
echo "Clearing Staging database..."
sqlcmd -S atp-sql-staging-eus.database.windows.net -d ATP_Staging \
-Q "TRUNCATE TABLE AuditEvents; DELETE FROM Tenants;" \
-U $(az keyvault secret show --vault-name atp-keyvault-staging-eus --name SqlAdminUser --query value -o tsv) \
-P $(az keyvault secret show --vault-name atp-keyvault-staging-eus --name SqlAdminPassword --query value -o tsv)
# 5. Import anonymized data
echo "Importing anonymized data to Staging..."
dotnet run --project tools/DataImporter -- \
--environment Staging \
--import ./staging-snapshot.parquet \
--batch-size 10000 \
--parallel-workers 4
# 6. Verify data counts
STAGING_TENANT_COUNT=$(sqlcmd -S atp-sql-staging-eus.database.windows.net -d ATP_Staging -Q "SELECT COUNT(*) FROM Tenants" -h -1)
STAGING_EVENT_COUNT=$(sqlcmd -S atp-sql-staging-eus.database.windows.net -d ATP_Staging -Q "SELECT COUNT(*) FROM AuditEvents" -h -1)
echo "Staging - Tenants: $STAGING_TENANT_COUNT, Events: $STAGING_EVENT_COUNT"
if [ "$STAGING_TENANT_COUNT" -ge 50 ] && [ "$STAGING_EVENT_COUNT" -ge 5000000 ]; then
echo "✅ Staging data refresh from production snapshot successful"
else
echo "❌ Staging data verification failed"
exit 1
fi
# 7. Cleanup temporary files
rm ./staging-snapshot.parquet
echo "✅ Staging refresh complete"
Data Backup & Restore Procedures¶
Automated Backup (Production)¶
# Azure DevOps Pipeline: Production Backups
trigger: none # Scheduled only
schedules:
- cron: "0 2 * * 0" # Weekly on Sunday at 2 AM
displayName: Weekly Full Backup
branches:
include:
- main
always: true
- cron: "0 3 * * 1-6" # Daily at 3 AM (Mon-Sat)
displayName: Daily Incremental Backup
branches:
include:
- main
always: true
jobs:
- job: BackupProduction
pool:
vmImage: 'ubuntu-latest'
steps:
- task: AzureCLI@2
displayName: 'Create Production Backup'
inputs:
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
DAY_OF_WEEK=$(date +%u)
if [ "$DAY_OF_WEEK" -eq 7 ]; then
# Full backup (Sunday)
echo "Creating weekly full backup..."
az sql db export \
--name ATP_Prod \
--resource-group ATP-Prod-RG \
--server atp-sql-prod-eus \
--admin-user $(SqlAdminUser) \
--admin-password $(SqlAdminPassword) \
--storage-key $(StorageAccountKey) \
--storage-key-type StorageAccessKey \
--storage-uri "https://atpstorageprodeus.blob.core.windows.net/backups/weekly/prod-full-$(date +%Y%m%d).bacpac"
# Tag backup with metadata
az storage blob metadata update \
--account-name atpstorageprodeus \
--container-name backups \
--name "weekly/prod-full-$(date +%Y%m%d).bacpac" \
--metadata type=full createdAt=$(date -u +%Y-%m-%dT%H:%M:%SZ) retentionYears=7
else
# Incremental backup (Mon-Sat) - automatic via Azure SQL
echo "Daily incremental backup (automatic point-in-time restore)"
fi
# Verify backup exists
LATEST_BACKUP=$(az storage blob list \
--account-name atpstorageprodeus \
--container-name backups \
--prefix "weekly/" \
--query "sort_by([?properties.createdOn > '$(date -d '7 days ago' --iso-8601)'], &properties.createdOn)[-1].name" \
-o tsv)
if [ -n "$LATEST_BACKUP" ]; then
echo "✅ Latest backup verified: $LATEST_BACKUP"
else
echo "❌ Backup verification failed"
exit 1
fi
Disaster Recovery Restore (Production)¶
#!/bin/bash
# disaster-recovery-restore.sh
echo "=== DISASTER RECOVERY: Restore Production from Backup ==="
read -p "Confirm DR restore to Production (yes/no): " CONFIRM
if [ "$CONFIRM" != "yes" ]; then
echo "DR restore cancelled"
exit 0
fi
read -p "Enter backup file name (e.g., prod-full-20251030.bacpac): " BACKUP_FILE
echo "Restoring Production from backup: $BACKUP_FILE"
# 1. Create temporary database for verification
echo "Creating temporary restore database..."
az sql db import \
--resource-group ATP-Prod-RG \
--server atp-sql-prod-eus \
--name ATP_Prod_Restore_Temp \
--admin-user $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminUser --query value -o tsv) \
--admin-password $(az keyvault secret show --vault-name atp-keyvault-prod-eus --name SqlAdminPassword --query value -o tsv) \
--storage-key $(az storage account keys list --account-name atpstorageprodeus --query "[0].value" -o tsv) \
--storage-key-type StorageAccessKey \
--storage-uri "https://atpstorageprodeus.blob.core.windows.net/backups/weekly/$BACKUP_FILE"
# 2. Verify restored database integrity
echo "Verifying restored database..."
RESTORE_TENANT_COUNT=$(sqlcmd -S atp-sql-prod-eus.database.windows.net -d ATP_Prod_Restore_Temp -Q "SELECT COUNT(*) FROM Tenants" -h -1)
RESTORE_EVENT_COUNT=$(sqlcmd -S atp-sql-prod-eus.database.windows.net -d ATP_Prod_Restore_Temp -Q "SELECT COUNT(*) FROM AuditEvents" -h -1)
echo "Restored - Tenants: $RESTORE_TENANT_COUNT, Events: $RESTORE_EVENT_COUNT"
read -p "Counts look correct? Proceed with swap (yes/no): " PROCEED
if [ "$PROCEED" != "yes" ]; then
echo "Aborting DR restore"
az sql db delete --name ATP_Prod_Restore_Temp --resource-group ATP-Prod-RG --server atp-sql-prod-eus --yes
exit 0
fi
# 3. Rename databases (swap)
echo "Swapping databases..."
# Rename current prod to backup
az sql db rename \
--resource-group ATP-Prod-RG \
--server atp-sql-prod-eus \
--name ATP_Prod \
--new-name ATP_Prod_Backup_$(date +%Y%m%d%H%M%S)
# Rename restored to prod
az sql db rename \
--resource-group ATP-Prod-RG \
--server atp-sql-prod-eus \
--name ATP_Prod_Restore_Temp \
--new-name ATP_Prod
echo "✅ DR restore complete; Production database swapped"
echo "⚠️ Old production database preserved as: ATP_Prod_Backup_$(date +%Y%m%d%H%M%S)"
Data Volume & Growth Projections¶
Expected Data Growth:
| Environment | Current Volume | Growth Rate | 1-Year Projection | Storage Tier |
|---|---|---|---|---|
| Dev | 10K events | 0 (static) | 10K events | Hot |
| Test | 50K events | 0 (static fixtures) | 50K events | Hot |
| Staging | 5M events | Refreshed monthly | 5M events (static) | Hot |
| Production | 100M events | 500K/day | 280M events | Hot (0-30d), Cool (30-90d), Archive (90d+) |
Storage Lifecycle Policy (Production):
{
"rules": [
{
"enabled": true,
"name": "MoveToCol",
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["audit-events/"]
},
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 30
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 90
},
"delete": {
"daysAfterModificationGreaterThan": 2555
}
}
}
}
}
]
}
Summary¶
- Data Sources: Synthetic generators (Dev/Test), production-like synthetic or anonymized snapshots (Staging), live tenant data (Production).
- Retention Policies: 30 days (Dev), 90 days (Test), 180 days (Staging), 7 years (Production) with legal hold overrides.
- Immutability: Disabled (Dev/Test), enabled (Staging for testing), fully enforced with WORM (Production).
- Backups: None (Dev), daily (Test), weekly (Staging), daily incremental + weekly full with geo-replication (Production).
- Data Anonymization: Production → Staging anonymization pipeline with PII verification ensures no real PII in lower environments.
- Data Seeding Tools: Comprehensive CLI tools, C# generators, and automated refresh scripts for all environments.
- Compliance: GDPR/HIPAA compliance with data subject rights, minimum necessary rule, and audit evidence.
Infrastructure as Code (IaC) Overlays¶
ATP uses Infrastructure as Code with Pulumi (C# preferred) and Bicep to provision and manage Azure resources across all environments. The overlay pattern separates base infrastructure definitions (common across all environments) from environment-specific configurations (SKU tiers, scaling, networking), enabling consistent infrastructure with graduated controls.
This approach ensures infrastructure reproducibility, configuration drift detection, and environment parity (Staging mirrors Production) while optimizing costs (Dev/Test use lower SKUs).
IaC Strategy Overview¶
Tools:
- Pulumi (C#): Primary IaC tool; preferred for type-safety, IDE support, and C# team familiarity.
- Bicep: Alternative for Azure-native declarative templates; used for simple resource provisioning.
Repository Structure:
ConnectSoft.ATP.Infrastructure/
├── Pulumi.yaml # Pulumi project metadata
├── Pulumi.dev.yaml # Dev stack configuration
├── Pulumi.test.yaml # Test stack configuration
├── Pulumi.staging.yaml # Staging stack configuration
├── Pulumi.prod.yaml # Production stack configuration
├── Program.cs # Base infrastructure (Pulumi C#)
├── Overlays/
│ ├── DevOverlay.cs # Dev-specific overrides
│ ├── TestOverlay.cs # Test-specific overrides
│ ├── StagingOverlay.cs # Staging-specific overrides
│ └── ProductionOverlay.cs # Production-specific overrides
├── Resources/
│ ├── AppServiceResources.cs # App Service definitions
│ ├── DatabaseResources.cs # SQL, Cosmos DB
│ ├── StorageResources.cs # Blob, Service Bus, Redis
│ ├── NetworkingResources.cs # VNet, NSG, Private Endpoints
│ └── ObservabilityResources.cs # App Insights, Log Analytics
└── bicep/ # Bicep templates (alternative)
├── main.bicep # Base template
└── overlays/
├── dev.bicepparam # Dev parameters
├── test.bicepparam # Test parameters
├── staging.bicepparam # Staging parameters
└── prod.bicepparam # Production parameters
Base Infrastructure (Pulumi C#)¶
Base infrastructure defines the common resources required for all ATP environments, with parameterized values that are overridden by environment-specific overlays.
// Program.cs - Base infrastructure
using Pulumi;
using Pulumi.AzureNative.Resources;
using Pulumi.AzureNative.Web;
using Pulumi.AzureNative.Sql;
using Pulumi.AzureNative.Cache;
using Pulumi.AzureNative.ServiceBus;
using Pulumi.AzureNative.Storage;
class ATPInfrastructure : Stack
{
public ATPInfrastructure()
{
// Read configuration
var config = new Config();
var environment = config.Require("environment"); // dev, test, staging, prod
var region = config.Get("region") ?? "eastus";
// Load environment-specific overlay
var overlay = LoadOverlay(environment);
// Common tags
var tags = new InputMap<string>
{
["Environment"] = environment,
["Project"] = "ATP",
["ManagedBy"] = "pulumi",
["Owner"] = "platform-team@connectsoft.example",
["CostCenter"] = "ATP-Platform",
["Compliance"] = "gdpr,hipaa,soc2"
};
// Resource Group
var resourceGroup = new ResourceGroup($"atp-{environment}-{region}-rg", new ResourceGroupArgs
{
ResourceGroupName = $"ConnectSoft-ATP-{environment.ToUpper()}-{region.ToUpper()}-RG",
Location = region,
Tags = tags
});
// App Service Plan
var appServicePlan = new AppServicePlan($"atp-plan-{environment}-{region}", new AppServicePlanArgs
{
Name = $"atp-plan-{environment}-{region}",
ResourceGroupName = resourceGroup.Name,
Location = region,
Sku = new SkuDescriptionArgs
{
Name = overlay.AppServiceSku,
Tier = overlay.AppServiceTier,
Capacity = overlay.AppServiceInstances
},
Kind = "linux",
Reserved = true, // Linux
Tags = tags
});
// Azure SQL Server
var sqlServer = new Server($"atp-sql-{environment}-{region}", new ServerArgs
{
ServerName = $"atp-sql-{environment}-{region}",
ResourceGroupName = resourceGroup.Name,
Location = region,
AdministratorLogin = config.RequireSecret("sqlAdminUser"),
AdministratorLoginPassword = config.RequireSecret("sqlAdminPassword"),
Version = "12.0",
PublicNetworkAccess = overlay.EnablePublicNetworkAccess ? "Enabled" : "Disabled",
Tags = tags
});
// Azure SQL Database
var database = new Database($"atp-db-{environment}-{region}", new DatabaseArgs
{
DatabaseName = $"ATP_{environment}",
ResourceGroupName = resourceGroup.Name,
ServerName = sqlServer.Name,
Location = region,
Sku = new Pulumi.AzureNative.Sql.Inputs.SkuArgs
{
Name = overlay.SqlSku,
Tier = overlay.SqlTier,
Capacity = overlay.SqlCapacity
},
MaxSizeBytes = overlay.SqlMaxSizeBytes,
ZoneRedundant = overlay.EnableZoneRedundancy,
Tags = tags
});
// Redis Cache
var redis = new Redis($"atp-redis-{environment}-{region}", new RedisArgs
{
Name = $"atp-redis-{environment}-{region}",
ResourceGroupName = resourceGroup.Name,
Location = region,
Sku = new Pulumi.AzureNative.Cache.Inputs.SkuArgs
{
Name = overlay.RedisSku,
Family = overlay.RedisFamily,
Capacity = overlay.RedisCapacity
},
EnableNonSslPort = false,
MinimumTlsVersion = "1.2",
Tags = tags
});
// Service Bus Namespace
var serviceBus = new Namespace($"atp-servicebus-{environment}-{region}", new NamespaceArgs
{
NamespaceName = $"atp-servicebus-{environment}-{region}",
ResourceGroupName = resourceGroup.Name,
Location = region,
Sku = new Pulumi.AzureNative.ServiceBus.Inputs.SBSkuArgs
{
Name = overlay.ServiceBusSku,
Tier = overlay.ServiceBusTier
},
Tags = tags
});
// Storage Account
var storageAccount = new StorageAccount($"atp-storage-{environment}-{region}", new StorageAccountArgs
{
AccountName = $"atpstorage{environment}{region}".Replace("-", "").ToLower(),
ResourceGroupName = resourceGroup.Name,
Location = region,
Sku = new Pulumi.AzureNative.Storage.Inputs.SkuArgs
{
Name = overlay.StorageReplication
},
Kind = "StorageV2",
EnableHttpsTrafficOnly = true,
MinimumTlsVersion = "TLS1_2",
AllowBlobPublicAccess = false,
Tags = tags
});
// Application Insights
var appInsights = new Component($"atp-appinsights-{environment}-{region}", new ComponentArgs
{
ResourceName = $"atp-appinsights-{environment}-{region}",
ResourceGroupName = resourceGroup.Name,
Location = region,
Kind = "web",
ApplicationType = "web",
RetentionInDays = overlay.LogRetentionDays,
SamplingPercentage = overlay.TelemetrySamplingPercentage,
Tags = tags
});
// Export outputs
this.ResourceGroupName = resourceGroup.Name;
this.AppServicePlanId = appServicePlan.Id;
this.SqlServerName = sqlServer.Name;
this.AppInsightsInstrumentationKey = appInsights.InstrumentationKey;
}
private EnvironmentOverlay LoadOverlay(string environment)
{
return environment.ToLower() switch
{
"dev" => new DevOverlay(),
"test" => new TestOverlay(),
"staging" => new StagingOverlay(),
"prod" => new ProductionOverlay(),
_ => throw new ArgumentException($"Unknown environment: {environment}")
};
}
[Output] public Output<string> ResourceGroupName { get; set; }
[Output] public Output<string> AppServicePlanId { get; set; }
[Output] public Output<string> SqlServerName { get; set; }
[Output] public Output<string> AppInsightsInstrumentationKey { get; set; }
}
Environment Overlay Base Class:
// EnvironmentOverlay.cs
public abstract class EnvironmentOverlay
{
// App Service configuration
public abstract string AppServiceSku { get; }
public abstract string AppServiceTier { get; }
public abstract int AppServiceInstances { get; }
public abstract bool EnableAutoscale { get; }
// SQL Database configuration
public abstract string SqlSku { get; }
public abstract string SqlTier { get; }
public abstract int SqlCapacity { get; }
public abstract long SqlMaxSizeBytes { get; }
public abstract bool EnableZoneRedundancy { get; }
public abstract bool EnableGeoReplication { get; }
// Redis configuration
public abstract string RedisSku { get; }
public abstract string RedisFamily { get; }
public abstract int RedisCapacity { get; }
public abstract bool EnableRedisPersistence { get; }
// Service Bus configuration
public abstract string ServiceBusSku { get; }
public abstract string ServiceBusTier { get; }
// Storage configuration
public abstract string StorageReplication { get; }
public abstract bool EnableImmutability { get; }
// Networking configuration
public abstract bool EnablePublicNetworkAccess { get; }
public abstract bool EnablePrivateEndpoints { get; }
public abstract string VNetAddressSpace { get; }
// Observability configuration
public abstract int LogRetentionDays { get; }
public abstract double TelemetrySamplingPercentage { get; }
// Cost management
public abstract int MonthlyBudget { get; }
public abstract bool EnableAutoShutdown { get; }
}
Dev Overlay¶
Purpose: Cost-optimized infrastructure with minimal SKUs and public access for developer convenience.
// Overlays/DevOverlay.cs
public class DevOverlay : EnvironmentOverlay
{
// App Service: Basic tier, single instance
public override string AppServiceSku => "B1";
public override string AppServiceTier => "Basic";
public override int AppServiceInstances => 1;
public override bool EnableAutoscale => false;
// SQL Database: Basic tier, 5 DTU
public override string SqlSku => "Basic";
public override string SqlTier => "Basic";
public override int SqlCapacity => 5;
public override long SqlMaxSizeBytes => 2L * 1024 * 1024 * 1024; // 2 GB
public override bool EnableZoneRedundancy => false;
public override bool EnableGeoReplication => false;
// Redis: Basic C0 (250 MB)
public override string RedisSku => "Basic";
public override string RedisFamily => "C";
public override int RedisCapacity => 0;
public override bool EnableRedisPersistence => false;
// Service Bus: Basic tier
public override string ServiceBusSku => "Basic";
public override string ServiceBusTier => "Basic";
// Storage: LRS (Locally Redundant)
public override string StorageReplication => "Standard_LRS";
public override bool EnableImmutability => false;
// Networking: Public access enabled
public override bool EnablePublicNetworkAccess => true;
public override bool EnablePrivateEndpoints => false;
public override string VNetAddressSpace => "10.0.0.0/16"; // Shared VNet
// Observability: Short retention, verbose sampling
public override int LogRetentionDays => 7;
public override double TelemetrySamplingPercentage => 100.0; // 100% sampling
// Cost: $500/month budget
public override int MonthlyBudget => 500;
public override bool EnableAutoShutdown => true; // Shutdown weeknights/weekends
}
Dev Stack Configuration (Pulumi.dev.yaml):
config:
azure-native:location: eastus
atp-infrastructure:environment: dev
atp-infrastructure:region: eus
atp-infrastructure:sqlAdminUser:
secure: AQAAANCMnd8BFdERjHoAwE/Cl+sBAAAA... # Encrypted
atp-infrastructure:sqlAdminPassword:
secure: AQAAANCMnd8BFdERjHoAwE/Cl+sBAAAA... # Encrypted
atp-infrastructure:enableAutoShutdown: true
atp-infrastructure:monthlyBudget: 500
Dev Deployment (Pulumi CLI):
#!/bin/bash
# deploy-dev-infrastructure.sh
echo "Deploying Dev infrastructure..."
# Select Dev stack
pulumi stack select connectsoft/atp/dev --create
# Set configuration
pulumi config set azure-native:location eastus
pulumi config set atp-infrastructure:environment dev
pulumi config set atp-infrastructure:region eus
# Set secrets (from Key Vault)
pulumi config set --secret atp-infrastructure:sqlAdminUser devadmin
pulumi config set --secret atp-infrastructure:sqlAdminPassword $(az keyvault secret show --vault-name atp-keyvault-shared-eus --name DevSqlPassword --query value -o tsv)
# Deploy (preview first)
pulumi preview
# Confirm and deploy
pulumi up --yes
echo "✅ Dev infrastructure deployed"
Test Overlay¶
Purpose: QA-grade infrastructure with moderate SKUs and IP-restricted access for automated testing.
// Overlays/TestOverlay.cs
public class TestOverlay : EnvironmentOverlay
{
// App Service: Standard tier, 2 instances (blue-green testing)
public override string AppServiceSku => "S1";
public override string AppServiceTier => "Standard";
public override int AppServiceInstances => 2;
public override bool EnableAutoscale => false;
// SQL Database: Standard S1, 20 DTU
public override string SqlSku => "S1";
public override string SqlTier => "Standard";
public override int SqlCapacity => 20;
public override long SqlMaxSizeBytes => 10L * 1024 * 1024 * 1024; // 10 GB
public override bool EnableZoneRedundancy => false;
public override bool EnableGeoReplication => false;
// Redis: Standard C1 (1 GB)
public override string RedisSku => "Standard";
public override string RedisFamily => "C";
public override int RedisCapacity => 1;
public override bool EnableRedisPersistence => true; // RDB snapshots
// Service Bus: Standard tier
public override string ServiceBusSku => "Standard";
public override string ServiceBusTier => "Standard";
// Storage: GRS (Geo-Redundant for DR testing)
public override string StorageReplication => "Standard_GRS";
public override bool EnableImmutability => false;
// Networking: IP-restricted public access
public override bool EnablePublicNetworkAccess => true;
public override bool EnablePrivateEndpoints => false;
public override string VNetAddressSpace => "10.0.0.0/16"; // Shared VNet
// Observability: Moderate retention, 50% sampling
public override int LogRetentionDays => 14;
public override double TelemetrySamplingPercentage => 50.0;
// Cost: $1,000/month budget
public override int MonthlyBudget => 1000;
public override bool EnableAutoShutdown => true; // Weekends only
}
Staging Overlay¶
Purpose: Production-equivalent infrastructure with premium SKUs, private endpoints, and geo-replication for pre-production validation.
// Overlays/StagingOverlay.cs
public class StagingOverlay : EnvironmentOverlay
{
// App Service: Premium tier, autoscale 2-5 instances
public override string AppServiceSku => "P1v2";
public override string AppServiceTier => "PremiumV2";
public override int AppServiceInstances => 2;
public override bool EnableAutoscale => true;
// SQL Database: Premium P2, 125 DTU
public override string SqlSku => "P2";
public override string SqlTier => "Premium";
public override int SqlCapacity => 125;
public override long SqlMaxSizeBytes => 100L * 1024 * 1024 * 1024; // 100 GB
public override bool EnableZoneRedundancy => false;
public override bool EnableGeoReplication => true; // Test failover
// Redis: Premium P1 (6 GB, clustering)
public override string RedisSku => "Premium";
public override string RedisFamily => "P";
public override int RedisCapacity => 1;
public override bool EnableRedisPersistence => true; // AOF
// Service Bus: Premium tier
public override string ServiceBusSku => "Premium";
public override string ServiceBusTier => "Premium";
// Storage: GZRS (Geo-Zone-Redundant)
public override string StorageReplication => "Standard_GZRS";
public override bool EnableImmutability => true; // Test WORM
// Networking: Private endpoints enabled
public override bool EnablePublicNetworkAccess => false;
public override bool EnablePrivateEndpoints => true;
public override string VNetAddressSpace => "10.1.0.0/16"; // Dedicated VNet
// Observability: Production-like retention, 25% sampling
public override int LogRetentionDays => 30;
public override double TelemetrySamplingPercentage => 25.0;
// Cost: $3,000/month budget
public override int MonthlyBudget => 3000;
public override bool EnableAutoShutdown => false; // Always-on
}
Staging Private Endpoints (Pulumi C#):
// Add private endpoints for Staging
if (overlay.EnablePrivateEndpoints)
{
var privateEndpointSubnet = new Subnet($"atp-pe-subnet-{environment}-{region}", new SubnetArgs
{
SubnetName = "PrivateEndpoints-Subnet",
ResourceGroupName = resourceGroup.Name,
VirtualNetworkName = vnet.Name,
AddressPrefix = "10.1.3.0/24",
PrivateEndpointNetworkPolicies = "Disabled"
});
// SQL Private Endpoint
var sqlPrivateEndpoint = new PrivateEndpoint($"atp-sql-pe-{environment}-{region}", new PrivateEndpointArgs
{
PrivateEndpointName = $"atp-sql-pe-{environment}-{region}",
ResourceGroupName = resourceGroup.Name,
Location = region,
Subnet = new Pulumi.AzureNative.Network.Inputs.SubnetArgs
{
Id = privateEndpointSubnet.Id
},
PrivateLinkServiceConnections = new[]
{
new Pulumi.AzureNative.Network.Inputs.PrivateLinkServiceConnectionArgs
{
Name = $"sql-connection",
PrivateLinkServiceId = sqlServer.Id,
GroupIds = new[] { "sqlServer" }
}
},
Tags = tags
});
// Storage Private Endpoint
var storagePrivateEndpoint = new PrivateEndpoint($"atp-storage-pe-{environment}-{region}", new PrivateEndpointArgs
{
PrivateEndpointName = $"atp-storage-pe-{environment}-{region}",
ResourceGroupName = resourceGroup.Name,
Location = region,
Subnet = new Pulumi.AzureNative.Network.Inputs.SubnetArgs
{
Id = privateEndpointSubnet.Id
},
PrivateLinkServiceConnections = new[]
{
new Pulumi.AzureNative.Network.Inputs.PrivateLinkServiceConnectionArgs
{
Name = $"storage-connection",
PrivateLinkServiceId = storageAccount.Id,
GroupIds = new[] { "blob" }
}
},
Tags = tags
});
}
Production Overlay¶
Purpose: Enterprise-grade infrastructure with AKS, zone redundancy, multi-region, and maximum security.
// Overlays/ProductionOverlay.cs
public class ProductionOverlay : EnvironmentOverlay
{
// App Service: Premium tier (or AKS for containerized)
public override string AppServiceSku => "P3v3";
public override string AppServiceTier => "PremiumV3";
public override int AppServiceInstances => 3;
public override bool EnableAutoscale => true; // 3-10 instances
// SQL Database: Premium P6, 16 vCores
public override string SqlSku => "P6";
public override string SqlTier => "Premium";
public override int SqlCapacity => 125;
public override long SqlMaxSizeBytes => 500L * 1024 * 1024 * 1024; // 500 GB
public override bool EnableZoneRedundancy => true;
public override bool EnableGeoReplication => true; // Multi-region
// Redis: Premium P3 (26 GB, clustering)
public override string RedisSku => "Premium";
public override string RedisFamily => "P";
public override int RedisCapacity => 3;
public override bool EnableRedisPersistence => true; // AOF + RDB
// Service Bus: Premium tier
public override string ServiceBusSku => "Premium";
public override string ServiceBusTier => "Premium";
// Storage: GZRS with WORM
public override string StorageReplication => "Standard_GZRS";
public override bool EnableImmutability => true;
// Networking: Private endpoints only
public override bool EnablePublicNetworkAccess => false;
public override bool EnablePrivateEndpoints => true;
public override string VNetAddressSpace => "10.2.0.0/16"; // Dedicated VNet
// Observability: Long retention, 10% sampling
public override int LogRetentionDays => 90;
public override double TelemetrySamplingPercentage => 10.0;
// Cost: $10,000/month budget
public override int MonthlyBudget => 10000;
public override bool EnableAutoShutdown => false; // Always-on
}
Production AKS Cluster (Pulumi C#):
// Production uses AKS instead of App Service
if (environment == "prod")
{
var aksCluster = new ManagedCluster($"atp-aks-{environment}-{region}", new ManagedClusterArgs
{
ResourceName = $"atp-aks-{environment}-{region}",
ResourceGroupName = resourceGroup.Name,
Location = region,
// Identity
Identity = new ManagedClusterIdentityArgs
{
Type = ResourceIdentityType.SystemAssigned
},
// Kubernetes version
KubernetesVersion = "1.28.3",
// DNS prefix
DnsPrefix = $"atp-{environment}",
// Node pools
AgentPoolProfiles = new[]
{
new ManagedClusterAgentPoolProfileArgs
{
Name = "system",
Count = 3,
VmSize = "Standard_D4s_v5", // 4 vCPU, 16 GB RAM
Mode = "System",
OsType = "Linux",
OsDiskSizeGB = 128,
VnetSubnetID = aksSubnet.Id,
EnableAutoScaling = true,
MinCount = 3,
MaxCount = 10,
AvailabilityZones = new[] { "1", "2", "3" } // Zone-redundant
},
new ManagedClusterAgentPoolProfileArgs
{
Name = "user",
Count = 6,
VmSize = "Standard_D4s_v5",
Mode = "User",
OsType = "Linux",
OsDiskSizeGB = 128,
VnetSubnetID = aksSubnet.Id,
EnableAutoScaling = true,
MinCount = 6,
MaxCount = 20,
AvailabilityZones = new[] { "1", "2", "3" }
}
},
// Networking
NetworkProfile = new ContainerServiceNetworkProfileArgs
{
NetworkPlugin = "azure",
NetworkPolicy = "azure",
LoadBalancerSku = "Standard",
ServiceCidr = "10.2.10.0/24",
DnsServiceIP = "10.2.10.10"
},
// Add-ons
AddonProfiles = new InputMap<ManagedClusterAddonProfileArgs>
{
["azureKeyvaultSecretsProvider"] = new ManagedClusterAddonProfileArgs
{
Enabled = true,
Config = new InputMap<string>
{
["enableSecretRotation"] = "true",
["rotationPollInterval"] = "2m"
}
},
["omsagent"] = new ManagedClusterAddonProfileArgs // Container Insights
{
Enabled = true,
Config = new InputMap<string>
{
["logAnalyticsWorkspaceResourceID"] = logAnalyticsWorkspace.Id.Apply(id => id)
}
}
},
// Security
AadProfile = new ManagedClusterAADProfileArgs
{
Managed = true,
EnableAzureRBAC = true
},
Tags = tags
});
this.AksClusterName = aksCluster.Name;
}
Bicep Alternative (Base Template)¶
Purpose: Azure-native declarative IaC for teams preferring ARM/Bicep over Pulumi.
// main.bicep - Base infrastructure template
@description('Environment name (dev, test, staging, prod)')
param environment string
@description('Azure region')
param location string = resourceGroup().location
@description('App Service SKU')
param appServiceSku string
@description('SQL Database SKU')
param sqlSku string
@description('Redis SKU')
param redisSku string
@description('Storage replication')
param storageReplication string
@description('Enable private endpoints')
param enablePrivateEndpoints bool = false
// Variables
var resourcePrefix = 'atp'
var regionAbbr = location == 'eastus' ? 'eus' : (location == 'westeurope' ? 'weu' : 'apse')
var commonTags = {
Environment: environment
Project: 'ATP'
ManagedBy: 'bicep'
Owner: 'platform-team@connectsoft.example'
}
// App Service Plan
resource appServicePlan 'Microsoft.Web/serverfarms@2022-09-01' = {
name: '${resourcePrefix}-plan-${environment}-${regionAbbr}'
location: location
kind: 'linux'
sku: {
name: appServiceSku
}
properties: {
reserved: true
}
tags: commonTags
}
// SQL Server
resource sqlServer 'Microsoft.Sql/servers@2023-05-01-preview' = {
name: '${resourcePrefix}-sql-${environment}-${regionAbbr}'
location: location
properties: {
administratorLogin: 'sqladmin'
administratorLoginPassword: '<managed-by-keyvault>'
version: '12.0'
publicNetworkAccess: enablePrivateEndpoints ? 'Disabled' : 'Enabled'
}
tags: commonTags
}
// SQL Database
resource database 'Microsoft.Sql/servers/databases@2023-05-01-preview' = {
parent: sqlServer
name: 'ATP_${environment}'
location: location
sku: {
name: sqlSku
}
properties: {
collation: 'SQL_Latin1_General_CP1_CI_AS'
maxSizeBytes: 2147483648
}
tags: commonTags
}
// Redis Cache
resource redis 'Microsoft.Cache/redis@2023-08-01' = {
name: '${resourcePrefix}-redis-${environment}-${regionAbbr}'
location: location
properties: {
sku: {
name: redisSku
family: redisSku == 'Premium' ? 'P' : 'C'
capacity: redisSku == 'Basic' ? 0 : 1
}
enableNonSslPort: false
minimumTlsVersion: '1.2'
}
tags: commonTags
}
// Storage Account
resource storageAccount 'Microsoft.Storage/storageAccounts@2023-01-01' = {
name: '${resourcePrefix}storage${environment}${regionAbbr}'
location: location
sku: {
name: storageReplication
}
kind: 'StorageV2'
properties: {
supportsHttpsTrafficOnly: true
minimumTlsVersion: 'TLS1_2'
allowBlobPublicAccess: false
}
tags: commonTags
}
// Application Insights
resource appInsights 'Microsoft.Insights/components@2020-02-02' = {
name: '${resourcePrefix}-appinsights-${environment}-${regionAbbr}'
location: location
kind: 'web'
properties: {
Application_Type: 'web'
RetentionInDays: environment == 'prod' ? 90 : 30
SamplingPercentage: environment == 'prod' ? 10 : 100
}
tags: commonTags
}
// Outputs
output resourceGroupName string = resourceGroup().name
output appServicePlanId string = appServicePlan.id
output sqlServerName string = sqlServer.name
output appInsightsInstrumentationKey string = appInsights.properties.InstrumentationKey
Bicep Parameter Files (Per Environment):
// overlays/dev.bicepparam
using './main.bicep'
param environment = 'dev'
param location = 'eastus'
param appServiceSku = 'B1'
param sqlSku = 'Basic'
param redisSku = 'Basic'
param storageReplication = 'Standard_LRS'
param enablePrivateEndpoints = false
// overlays/prod.bicepparam
using './main.bicep'
param environment = 'prod'
param location = 'eastus'
param appServiceSku = 'P3v3'
param sqlSku = 'P6'
param redisSku = 'Premium'
param storageReplication = 'Standard_GZRS'
param enablePrivateEndpoints = true
Bicep Deployment (Azure CLI):
#!/bin/bash
# deploy-bicep-infrastructure.sh
ENVIRONMENT=$1
REGION=${2:-eastus}
echo "Deploying $ENVIRONMENT infrastructure using Bicep..."
# Create resource group
az group create \
--name "ConnectSoft-ATP-${ENVIRONMENT^^}-${REGION^^}-RG" \
--location $REGION
# Deploy with environment-specific parameters
az deployment group create \
--resource-group "ConnectSoft-ATP-${ENVIRONMENT^^}-${REGION^^}-RG" \
--template-file bicep/main.bicep \
--parameters bicep/overlays/$ENVIRONMENT.bicepparam \
--parameters location=$REGION
echo "✅ $ENVIRONMENT infrastructure deployed via Bicep"
IaC Deployment Workflow (Azure Pipelines)¶
Purpose: Automate infrastructure provisioning via CI/CD pipelines with validation, preview, and approval gates.
# infrastructure-pipeline.yaml
name: Infrastructure-$(environment)-$(Date:yyyyMMdd)$(Rev:.r)
parameters:
- name: environment
displayName: 'Target Environment'
type: string
values:
- dev
- test
- staging
- prod
- name: action
displayName: 'Deployment Action'
type: string
default: 'preview'
values:
- preview
- deploy
- destroy
trigger: none # Manual trigger only
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: Validate_IaC
displayName: 'Validate Infrastructure Code'
jobs:
- job: Validate
steps:
- task: UseDotNet@2
inputs:
version: '8.x'
- script: |
# Install Pulumi
curl -fsSL https://get.pulumi.com | sh
export PATH=$PATH:$HOME/.pulumi/bin
# Select stack
pulumi stack select connectsoft/atp/${{ parameters.environment }}
# Validate configuration
pulumi config
# Run preview
pulumi preview --non-interactive
displayName: 'Pulumi Preview'
env:
PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
AZURE_CLIENT_ID: $(AzureClientId)
AZURE_CLIENT_SECRET: $(AzureClientSecret)
AZURE_TENANT_ID: $(AzureTenantId)
- stage: Deploy_Infrastructure
displayName: 'Deploy Infrastructure to ${{ parameters.environment }}'
dependsOn: Validate_IaC
condition: eq('${{ parameters.action }}', 'deploy')
jobs:
- deployment: DeployInfrastructure
environment: ATP-Infrastructure-${{ parameters.environment }}
strategy:
runOnce:
deploy:
steps:
- script: |
# Install Pulumi
curl -fsSL https://get.pulumi.com | sh
export PATH=$PATH:$HOME/.pulumi/bin
# Select stack
pulumi stack select connectsoft/atp/${{ parameters.environment }}
# Deploy infrastructure
pulumi up --yes --non-interactive --skip-preview
# Export outputs
pulumi stack output --json > infrastructure-outputs.json
displayName: 'Pulumi Deploy'
env:
PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
AZURE_CLIENT_ID: $(AzureClientId)
AZURE_CLIENT_SECRET: $(AzureClientSecret)
AZURE_TENANT_ID: $(AzureTenantId)
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: 'infrastructure-outputs.json'
ArtifactName: 'infrastructure-outputs-${{ parameters.environment }}'
- script: |
# Verify deployment
RESOURCE_GROUP=$(pulumi stack output ResourceGroupName)
RESOURCE_COUNT=$(az resource list \
--resource-group $RESOURCE_GROUP \
--query "length([])")
echo "Resources deployed: $RESOURCE_COUNT"
if [ "$RESOURCE_COUNT" -lt 10 ]; then
echo "❌ Expected at least 10 resources"
exit 1
fi
echo "✅ Infrastructure deployment verified"
displayName: 'Verify Deployment'
Configuration Drift Detection¶
Purpose: Detect and alert on manual changes to infrastructure that deviate from IaC definitions.
Drift Detection Script (Pulumi):
#!/bin/bash
# detect-drift.sh
ENVIRONMENT=$1
echo "Detecting infrastructure drift for $ENVIRONMENT..."
# Select stack
pulumi stack select connectsoft/atp/$ENVIRONMENT
# Refresh state from Azure
pulumi refresh --yes --non-interactive
# Preview to detect drift
DRIFT_OUTPUT=$(pulumi preview --diff --non-interactive 2>&1)
# Check if drift detected
if echo "$DRIFT_OUTPUT" | grep -q "~ update"; then
echo "⚠️ DRIFT DETECTED in $ENVIRONMENT"
echo "$DRIFT_OUTPUT"
# Create work item for drift resolution
az boards work-item create \
--title "Infrastructure Drift Detected: $ENVIRONMENT" \
--type "Task" \
--description "Drift detected in $ENVIRONMENT infrastructure. Review and resolve.\n\n$DRIFT_OUTPUT" \
--assigned-to "platform-team@connectsoft.example" \
--area "ATP/Infrastructure"
exit 1
elif echo "$DRIFT_OUTPUT" | grep -q "no changes"; then
echo "✅ No drift detected; infrastructure matches code"
else
echo "⚠️ Unable to determine drift status"
fi
Scheduled Drift Detection (Azure DevOps):
# drift-detection-pipeline.yaml
trigger: none
schedules:
- cron: "0 6 * * *" # Daily at 6 AM
displayName: Daily Drift Detection
branches:
include:
- main
always: true
jobs:
- job: DetectDrift
pool:
vmImage: 'ubuntu-latest'
strategy:
matrix:
dev:
environment: dev
test:
environment: test
staging:
environment: staging
prod:
environment: prod
steps:
- script: |
# Detect drift
./scripts/detect-drift.sh $(environment)
displayName: 'Detect Drift: $(environment)'
continueOnError: true # Don't fail pipeline; just alert
GitOps for Configuration Management¶
Purpose: Manage application configuration (feature flags, connection strings) via Git with automated sync to Azure App Configuration.
GitOps Workflow:
flowchart LR
A[Update config/prod.yaml] --> B[Commit to main]
B --> C[GitHub Action Triggered]
C --> D[Validate Config Schema]
D --> E{Valid?}
E -->|No| F[Fail CI]
E -->|Yes| G[Sync to Azure App Configuration]
G --> H[Services Auto-Refresh]
style F fill:#FF6347
style H fill:#90EE90
Configuration Repository (config/prod.yaml):
# config/prod.yaml - GitOps-managed configuration
environment: prod
region: eastus
featureFlags:
TamperEvidenceV2:
enabled: true
description: "V2 tamper evidence with Merkle trees"
AIAssistedAnomalyDetection:
enabled: true
targetingRules:
- name: PercentageRollout
percentage: 10
description: "AI-based anomaly detection (canary rollout)"
appSettings:
Audit:
MaxBatchSize: 10000
SealInterval: "PT15M"
RateLimiting:
PermitLimit: 100
Window: 60
OpenTelemetry:
SamplingRatio: 0.1
ExportIntervalSeconds: 60
GitOps Sync Script (GitHub Action):
# .github/workflows/sync-config.yaml
name: Sync Configuration to Azure App Config
on:
push:
branches: [main]
paths:
- 'config/prod.yaml'
- 'config/staging.yaml'
jobs:
sync-config:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Validate Configuration Schema
run: |
# JSON Schema validation
npx ajv-cli validate \
-s schemas/config-schema.json \
-d config/prod.yaml
- name: Sync to Azure App Configuration
run: |
# Install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# Login with service principal
az login --service-principal \
-u ${{ secrets.AZURE_CLIENT_ID }} \
-p ${{ secrets.AZURE_CLIENT_SECRET }} \
--tenant ${{ secrets.AZURE_TENANT_ID }}
# Sync configuration
az appconfig kv import \
--name atp-appconfig-prod-eus \
--source file \
--path config/prod.yaml \
--format yaml \
--label prod \
--yes
# Update sentinel key (triggers app refresh)
az appconfig kv set \
--name atp-appconfig-prod-eus \
--key Sentinel \
--value "$(date +%s)" \
--label prod \
--yes
echo "✅ Configuration synced to Azure App Configuration"
Environment Comparison (IaC Configurations)¶
| Resource | Dev | Test | Staging | Production |
|---|---|---|---|---|
| App Service SKU | B1 (Basic) | S1 (Standard) | P1v2 (Premium) | P3v3 (Premium) or AKS |
| App Service Instances | 1 | 2 | 2-5 (autoscale) | 3-10 (autoscale) or AKS nodes |
| SQL SKU | Basic (5 DTU) | S1 (20 DTU) | P2 (125 DTU) | P6 (16 vCores) |
| SQL Max Size | 2 GB | 10 GB | 100 GB | 500 GB |
| SQL Zone Redundancy | No | No | No | Yes |
| SQL Geo-Replication | No | No | Yes | Yes (multi-region) |
| Redis SKU | Basic C0 (250 MB) | Standard C1 (1 GB) | Premium P1 (6 GB) | Premium P3 (26 GB) |
| Redis Clustering | No | No | Yes | Yes |
| Redis Persistence | No | RDB | AOF | AOF + RDB |
| Service Bus SKU | Basic | Standard | Premium | Premium |
| Storage Replication | LRS | GRS | GZRS | GZRS |
| Storage Immutability | No | No | Yes (time-based) | Yes (WORM locked) |
| VNet | Shared (10.0.0.0/16) | Shared (10.0.0.0/16) | Dedicated (10.1.0.0/16) | Dedicated (10.2.0.0/16) |
| Private Endpoints | No | No | Yes | Yes |
| Public Network Access | Yes | Yes (IP-restricted) | No | No |
| Log Retention | 7 days | 14 days | 30 days | 90 days + archive |
| Telemetry Sampling | 100% | 50% | 25% | 10% |
| Monthly Budget | $500 | $1,000 | $3,000 | $10,000 |
IaC Best Practices¶
Security:
- Secrets in Key Vault: Never hardcode secrets in IaC; reference Key Vault.
- Least Privilege: Grant minimal RBAC roles for IaC service principals.
- State File Security: Encrypt Pulumi state files; restrict access to state storage.
- Private Endpoints: Use private endpoints for Production/Staging (no public access).
Maintainability:
- DRY Principle: Use overlays to avoid duplicating base infrastructure across environments.
- Version Control: Tag IaC releases; pin Production to stable tags.
- Documentation: Comment complex resources; link to ADRs for architectural decisions.
- Modular Design: Separate concerns (compute, storage, networking) into reusable modules.
Operational:
- Preview Before Deploy: Always run
pulumi previeworaz deployment group validatebefore applying changes. - Drift Detection: Schedule daily drift detection; alert on manual changes.
- Incremental Updates: Use
pulumi upfor incremental updates (notpulumi destroy+ recreate). - Backup State: Back up Pulumi state regularly; test state restore procedures.
Cost Optimization:
- Right-Sizing: Use overlays to assign appropriate SKUs per environment (Dev: cheapest, Prod: optimized for performance).
- Auto-Shutdown: Enable auto-shutdown for Dev/Test (nights/weekends).
- Reserved Instances: Use reserved instances for Production (3-year commitment for maximum savings).
- Storage Tiers: Use lifecycle policies (Hot → Cool → Archive) for Production.
Multi-Region Deployment (Production)¶
Purpose: Deploy Production infrastructure to multiple Azure regions for high availability and disaster recovery.
// Production multi-region deployment
class ProductionMultiRegionStack : Stack
{
public ProductionMultiRegionStack()
{
var config = new Config();
var primaryRegion = "eastus";
var secondaryRegion = "westeurope";
// Deploy to primary region
var primaryInfra = DeployRegionalInfrastructure(primaryRegion, isPrimary: true);
// Deploy to secondary region
var secondaryInfra = DeployRegionalInfrastructure(secondaryRegion, isPrimary: false);
// Configure geo-replication
ConfigureGeoReplication(primaryInfra, secondaryInfra);
// Configure Traffic Manager (multi-region routing)
var trafficManager = new Profile("atp-traffic-manager-prod", new ProfileArgs
{
ProfileName = "atp-prod",
ResourceGroupName = primaryInfra.ResourceGroupName,
TrafficRoutingMethod = "Performance", // Route to nearest region
Endpoints = new[]
{
new EndpointArgs
{
Name = "primary-eastus",
Type = "azureEndpoints",
TargetResourceId = primaryInfra.AppGatewayId,
Priority = 1,
Weight = 80 // 80% traffic to primary
},
new EndpointArgs
{
Name = "secondary-westeurope",
Type = "azureEndpoints",
TargetResourceId = secondaryInfra.AppGatewayId,
Priority = 2,
Weight = 20 // 20% traffic to secondary
}
}
});
}
}
Summary¶
- IaC Tools: Pulumi (C# preferred) and Bicep (Azure-native alternative) for infrastructure provisioning and management.
- Overlay Pattern: Base infrastructure + environment-specific overlays enable consistent resources with graduated controls.
- Environment Overlays: Dev (cost-optimized), Test (QA-grade), Staging (production-equivalent), Production (enterprise-grade with AKS).
- Deployment Workflows: Automated CI/CD pipelines with validation, preview, and approval gates for infrastructure changes.
- Drift Detection: Daily automated drift detection with alerts and work item creation for manual changes.
- GitOps: Configuration management via Git with automated sync to Azure App Configuration.
- Multi-Region: Production deploys to multiple regions with Traffic Manager routing and geo-replication.
Feature Flags & Runtime Configuration¶
ATP leverages Azure App Configuration for dynamic feature management and runtime configuration changes without requiring code redeployment. Feature flags enable gradual rollouts, A/B testing, kill switches, and environment-specific feature enablement, supporting safe experimentation in lower environments and controlled production releases.
This strategy ensures feature toggles are managed centrally, flag states are audited, and rollback is instantaneous (toggle flag off) without deploying previous code versions.
Azure App Configuration Per Environment¶
Each environment has tailored feature flag policies balancing innovation (Dev: all features on) with stability (Production: conservative rollouts).
Dev Environment Feature Flags¶
Purpose: Enable all features including experimental capabilities for rapid development and integration testing.
Configuration (atp-appconfig-dev-eus):
{
"featureManagement": {
"TamperEvidenceV2": true,
"AdvancedQueryFilters": true,
"AIAssistedAnomalyDetection": true,
"ExperimentalFeatures": true,
"PerformanceOptimizations": true,
"NewExportFormats": true,
"BlockchainAnchoring": true
},
"appSettings": {
"Audit:MaxBatchSize": 100,
"Audit:SealInterval": "PT24H",
"RateLimiting:Enabled": false,
"Caching:DefaultSlidingExpiration": "00:01:00"
}
}
Feature Flag Policy (Dev):
- All Features: Enabled by default (including experimental).
- No Targeting: No user/tenant filters; everyone gets all features.
- No Time Windows: Features always available.
- Purpose: Test feature interactions; validate new capabilities early.
Test Environment Feature Flags¶
Purpose: Stable features only with integration test-specific flags for automated validation.
Configuration (atp-appconfig-test-eus):
{
"featureManagement": {
"TamperEvidenceV2": true,
"AdvancedQueryFilters": true,
"AIAssistedAnomalyDetection": false, // Not stable yet
"ExperimentalFeatures": false, // Never in Test
"PerformanceOptimizations": true,
"NewExportFormats": {
"EnabledFor": [
{
"Name": "TargetingFilter",
"Parameters": {
"Audience": {
"Users": ["test-tenant-001", "test-tenant-002"]
}
}
}
]
}
},
"appSettings": {
"Audit:MaxBatchSize": 1000,
"Audit:SealInterval": "PT1H",
"RateLimiting:Enabled": true,
"RateLimiting:PermitLimit": 1000,
"Caching:DefaultSlidingExpiration": "00:05:00"
}
}
Feature Flag Policy (Test):
- Stable Features: Enabled for regression testing.
- Beta Features: Disabled (not ready for QA validation).
- Integration Test Flags: Targeting specific test tenants for feature validation.
- Purpose: Validate features with predictable stable behavior.
Staging Environment Feature Flags¶
Purpose: Production feature set with canary flags for validating new features at scale before Production rollout.
Configuration (atp-appconfig-staging-eus):
{
"featureManagement": {
"TamperEvidenceV2": true,
"AdvancedQueryFilters": true,
"AIAssistedAnomalyDetection": {
"EnabledFor": [
{
"Name": "Percentage",
"Parameters": {
"Value": 50
}
}
]
},
"ExperimentalFeatures": false,
"NewExportFormats": {
"EnabledFor": [
{
"Name": "TargetingFilter",
"Parameters": {
"Audience": {
"Users": ["staging-tenant-001", "staging-tenant-003"],
"Groups": ["beta-testers"]
}
}
}
]
},
"BlockchainAnchoring": {
"EnabledFor": [
{
"Name": "TimeWindow",
"Parameters": {
"Start": "2025-11-01T00:00:00Z",
"End": "2025-11-30T23:59:59Z"
}
}
]
}
},
"appSettings": {
"Audit:MaxBatchSize": 10000,
"Audit:SealInterval": "PT15M",
"RateLimiting:Enabled": true,
"RateLimiting:PermitLimit": 500,
"Caching:DefaultSlidingExpiration": "00:15:00"
}
}
Feature Flag Policy (Staging):
- Production Features: Enabled (mirrors production).
- Canary Features: 50% rollout for load testing impact.
- Beta Features: Targeted rollout to specific tenants for acceptance validation.
- Time-Windowed Features: Test time-based feature activation for planned releases.
- Purpose: Validate production feature configuration; test rollout strategies.
Production Environment Feature Flags¶
Purpose: Stable features only with conservative gradual rollouts and targeting for early access tenants.
Configuration (atp-appconfig-prod-eus):
{
"featureManagement": {
"TamperEvidenceV2": true, // Fully rolled out
"AdvancedQueryFilters": true, // Fully rolled out
"AIAssistedAnomalyDetection": {
"EnabledFor": [
{
"Name": "Percentage",
"Parameters": {
"Value": 10,
"Seed": "consistent-seed-123"
}
}
]
},
"ExperimentalFeatures": false, // Never in Production
"NewExportFormats": {
"EnabledFor": [
{
"Name": "TargetingFilter",
"Parameters": {
"Audience": {
"Users": ["tenant-12345", "tenant-67890"],
"Groups": ["early-access", "enterprise-tier"],
"DefaultRolloutPercentage": 5
}
}
}
]
},
"PerformanceOptimizations": {
"EnabledFor": [
{
"Name": "Percentage",
"Parameters": {
"Value": 100 // Fully rolled out after successful canary
}
}
]
}
},
"appSettings": {
"Audit:MaxBatchSize": 10000,
"Audit:SealInterval": "PT15M",
"RateLimiting:Enabled": true,
"RateLimiting:PermitLimit": 100,
"RateLimiting:Window": 60,
"Caching:DefaultSlidingExpiration": "00:15:00",
"Caching:DistributedCache": true,
"OpenTelemetry:SamplingRatio": 0.1
}
}
Feature Flag Policy (Production):
- Stable Features: Fully enabled (100% rollout).
- New Features: Conservative percentage rollout (5-10% initially).
- Beta Features: Targeting-based rollout to early access tenants only.
- Experimental Features: Absolutely prohibited.
- Purpose: Minimize production risk; enable data-driven rollout decisions.
Feature Flag Patterns & Filters¶
ATP uses Microsoft Feature Management library with five filter types for sophisticated feature control.
Boolean Filter (Simple On/Off)¶
Usage: Feature is either enabled or disabled for all users.
C# Implementation:
// Check if feature enabled
if (await _featureManager.IsEnabledAsync("TamperEvidenceV2"))
{
return await RecordWithTamperEvidenceV2Async(auditEvent);
}
else
{
return await RecordWithLegacyTamperEvidenceAsync(auditEvent);
}
Percentage Filter (Gradual Rollout)¶
Usage: Enable feature for random percentage of users/requests (canary deployments, A/B testing).
{
"featureManagement": {
"AIAssistedAnomalyDetection": {
"EnabledFor": [
{
"Name": "Percentage",
"Parameters": {
"Value": 10,
"Seed": "consistent-seed-123" // Deterministic hashing
}
}
]
}
}
}
C# Implementation:
// Percentage filter implementation (deterministic)
public class PercentageFilter : IFeatureFilter
{
private readonly IHttpContextAccessor _httpContextAccessor;
public Task<bool> EvaluateAsync(FeatureFilterEvaluationContext context)
{
var parameters = context.Parameters.Get<PercentageFilterSettings>();
// Get stable identifier (tenant ID for consistent results)
var tenantId = _httpContextAccessor.HttpContext?.User?.FindFirst("tenantId")?.Value;
if (string.IsNullOrEmpty(tenantId))
{
return Task.FromResult(false);
}
// Deterministic hash (same tenant always gets same result)
var hash = ComputeHash($"{tenantId}{parameters.Seed}");
var percentage = Math.Abs(hash) % 100;
var enabled = percentage < parameters.Value;
return Task.FromResult(enabled);
}
}
Rollout Strategy (Production):
Week 1: 5% (50 tenants out of 1000)
Week 2: 10% (100 tenants)
Week 3: 25% (250 tenants)
Week 4: 50% (500 tenants)
Week 5: 100% (all tenants)
Targeting Filter (User/Tenant-Specific)¶
Usage: Enable feature for specific users, tenants, or groups (early access programs, beta testing).
{
"featureManagement": {
"NewExportFormats": {
"EnabledFor": [
{
"Name": "TargetingFilter",
"Parameters": {
"Audience": {
"Users": [
"tenant-12345",
"tenant-67890",
"tenant-11111"
],
"Groups": [
"early-access",
"enterprise-tier",
"beta-testers"
],
"DefaultRolloutPercentage": 5,
"Exclusion": {
"Users": ["tenant-99999"],
"Groups": ["opt-out"]
}
}
}
}
]
}
}
}
C# Implementation:
// Targeting filter implementation
public class TargetingFilter : IFeatureFilter
{
private readonly IHttpContextAccessor _httpContextAccessor;
public Task<bool> EvaluateAsync(FeatureFilterEvaluationContext context)
{
var parameters = context.Parameters.Get<TargetingFilterSettings>();
var tenantId = _httpContextAccessor.HttpContext?.User?.FindFirst("tenantId")?.Value;
var groups = _httpContextAccessor.HttpContext?.User?.FindAll("group").Select(c => c.Value).ToList();
if (string.IsNullOrEmpty(tenantId))
{
return Task.FromResult(false);
}
// Check exclusions first
if (parameters.Audience.Exclusion?.Users?.Contains(tenantId) == true)
{
return Task.FromResult(false);
}
if (groups != null && parameters.Audience.Exclusion?.Groups?.Any(g => groups.Contains(g)) == true)
{
return Task.FromResult(false);
}
// Check explicit targeting
if (parameters.Audience.Users?.Contains(tenantId) == true)
{
return Task.FromResult(true);
}
if (groups != null && parameters.Audience.Groups?.Any(g => groups.Contains(g)) == true)
{
return Task.FromResult(true);
}
// Fall back to default rollout percentage
if (parameters.Audience.DefaultRolloutPercentage > 0)
{
var hash = ComputeHash(tenantId);
var percentage = Math.Abs(hash) % 100;
return Task.FromResult(percentage < parameters.Audience.DefaultRolloutPercentage);
}
return Task.FromResult(false);
}
}
Time Window Filter (Scheduled Features)¶
Usage: Enable feature between specific dates/times (planned feature launches, limited-time offers, maintenance windows).
{
"featureManagement": {
"BlockchainAnchoring": {
"EnabledFor": [
{
"Name": "TimeWindow",
"Parameters": {
"Start": "2025-11-01T00:00:00Z",
"End": "2025-11-30T23:59:59Z"
}
}
]
},
"MaintenanceMode": {
"EnabledFor": [
{
"Name": "TimeWindow",
"Parameters": {
"Start": "2025-10-31T02:00:00Z",
"End": "2025-10-31T04:00:00Z",
"Recurrence": {
"Pattern": "Weekly",
"DaysOfWeek": ["Sunday"]
}
}
}
]
}
}
}
C# Implementation:
// Time window filter implementation
public class TimeWindowFilter : IFeatureFilter
{
public Task<bool> EvaluateAsync(FeatureFilterEvaluationContext context)
{
var parameters = context.Parameters.Get<TimeWindowFilterSettings>();
var now = DateTime.UtcNow;
// Check if within time window
var enabled = now >= parameters.Start && now <= parameters.End;
// Check recurrence pattern (e.g., every Sunday)
if (!enabled && parameters.Recurrence != null)
{
if (parameters.Recurrence.Pattern == "Weekly")
{
var currentDay = now.DayOfWeek.ToString();
enabled = parameters.Recurrence.DaysOfWeek?.Contains(currentDay) == true;
if (enabled)
{
// Check if within daily time window
var startTime = parameters.Start.TimeOfDay;
var endTime = parameters.End.TimeOfDay;
enabled = now.TimeOfDay >= startTime && now.TimeOfDay <= endTime;
}
}
}
return Task.FromResult(enabled);
}
}
Custom Filter (Business Logic)¶
Usage: Complex business rules for feature enablement (tenant tier, compliance profile, region).
{
"featureManagement": {
"AdvancedAnalytics": {
"EnabledFor": [
{
"Name": "TenantTierFilter",
"Parameters": {
"RequiredTier": "Enterprise",
"RequiredCompliance": ["soc2", "hipaa"]
}
}
]
}
}
}
C# Implementation:
// Custom filter: Tenant tier + compliance requirements
public class TenantTierFilter : IFeatureFilter
{
private readonly ITenantService _tenantService;
public TenantTierFilter(ITenantService tenantService)
{
_tenantService = tenantService;
}
public async Task<bool> EvaluateAsync(FeatureFilterEvaluationContext context)
{
var parameters = context.Parameters.Get<TenantTierFilterSettings>();
var tenantId = _httpContextAccessor.HttpContext?.User?.FindFirst("tenantId")?.Value;
if (string.IsNullOrEmpty(tenantId))
{
return false;
}
// Fetch tenant details
var tenant = await _tenantService.GetTenantAsync(tenantId);
// Check tier requirement
if (tenant.Edition != parameters.RequiredTier)
{
return false;
}
// Check compliance profile
var tenantCompliance = tenant.ComplianceProfile.Split(',');
var hasRequiredCompliance = parameters.RequiredCompliance
.All(req => tenantCompliance.Contains(req));
return hasRequiredCompliance;
}
}
// Settings class
public class TenantTierFilterSettings
{
public string RequiredTier { get; set; } // Standard, Business, Enterprise
public List<string> RequiredCompliance { get; set; } // gdpr, hipaa, soc2
}
Composite Filters (Multiple Conditions)¶
Usage: Combine multiple filters with AND/OR logic for complex scenarios.
{
"featureManagement": {
"PremiumFeature": {
"RequirementType": "All", // AND logic
"EnabledFor": [
{
"Name": "TenantTierFilter",
"Parameters": {
"RequiredTier": "Enterprise"
}
},
{
"Name": "Percentage",
"Parameters": {
"Value": 25
}
},
{
"Name": "TimeWindow",
"Parameters": {
"Start": "2025-11-01T00:00:00Z",
"End": "2025-12-31T23:59:59Z"
}
}
]
}
}
}
Evaluation Logic:
RequirementType: All(AND): All filters must returntrue.RequirementType: Any(OR): At least one filter must returntrue.
Feature Flag Precedence & Evaluation¶
Evaluation Order (First match wins):
1. Explicit User/Tenant Targeting (highest priority)
↓
2. Group Targeting
↓
3. Percentage Rollout
↓
4. Time Window
↓
5. Custom Business Logic Filters
↓
6. Environment Default (Boolean true/false)
↓
7. Global Default (false if not specified)
Precedence Example:
// Feature flag evaluation with precedence
public class FeatureEvaluationService
{
private readonly IFeatureManager _featureManager;
private readonly ILogger<FeatureEvaluationService> _logger;
public async Task<bool> IsFeatureEnabledAsync(string featureName, string tenantId)
{
// 1. Check if feature exists
var featureNames = await _featureManager.GetFeatureNamesAsync();
if (!featureNames.Contains(featureName))
{
_logger.LogWarning("Feature '{FeatureName}' not defined; defaulting to false", featureName);
return false;
}
// 2. Evaluate filters (in precedence order)
var context = new TargetingContext
{
UserId = tenantId,
Groups = await GetTenantGroupsAsync(tenantId)
};
var enabled = await _featureManager.IsEnabledAsync(featureName, context);
// 3. Log feature flag evaluation for audit trail
_logger.LogInformation(
"Feature flag evaluated: {FeatureName} = {Enabled} for tenant {TenantId}",
featureName, enabled, tenantId);
// 4. Emit telemetry for feature usage analytics
_telemetry.TrackEvent("FeatureFlagEvaluation", new Dictionary<string, string>
{
["FeatureName"] = featureName,
["Enabled"] = enabled.ToString(),
["TenantId"] = tenantId,
["Timestamp"] = DateTime.UtcNow.ToString("o")
});
return enabled;
}
}
Feature Flag Management Operations¶
Creating Feature Flags¶
Via Azure CLI:
#!/bin/bash
# create-feature-flag.sh
FEATURE_NAME=$1
ENVIRONMENT=$2
ENABLED=${3:-false}
echo "Creating feature flag '$FEATURE_NAME' for $ENVIRONMENT..."
az appconfig feature set \
--name "atp-appconfig-$ENVIRONMENT-eus" \
--feature "$FEATURE_NAME" \
--label "$ENVIRONMENT" \
--description "Feature: $FEATURE_NAME" \
--yes
if [ "$ENABLED" == "true" ]; then
az appconfig feature enable \
--name "atp-appconfig-$ENVIRONMENT-eus" \
--feature "$FEATURE_NAME" \
--label "$ENVIRONMENT" \
--yes
fi
echo "✅ Feature flag '$FEATURE_NAME' created"
Via Pulumi (IaC):
// Create feature flag in Azure App Configuration
var featureFlag = new ConfigurationStoreKeyValue($"feature-{featureName}", new ConfigurationStoreKeyValueArgs
{
ConfigStoreName = appConfigStore.Name,
ResourceGroupName = resourceGroup.Name,
Key = $".appconfig.featureflag/{featureName}",
Label = environment,
ContentType = "application/vnd.microsoft.appconfig.ff+json;charset=utf-8",
Value = JsonSerializer.Serialize(new
{
id = featureName,
description = $"{featureName} feature flag",
enabled = true,
conditions = new
{
client_filters = new[]
{
new
{
name = "Percentage",
parameters = new
{
Value = 10,
Seed = "consistent-seed"
}
}
}
}
})
});
Updating Feature Flags¶
Gradual Rollout (Production):
#!/bin/bash
# rollout-feature.sh
FEATURE_NAME="AIAssistedAnomalyDetection"
STAGES=(5 10 25 50 100)
for PERCENTAGE in "${STAGES[@]}"; do
echo "Rolling out $FEATURE_NAME to $PERCENTAGE%..."
# Update percentage filter
az appconfig feature filter add \
--name atp-appconfig-prod-eus \
--feature "$FEATURE_NAME" \
--label prod \
--filter-name Percentage \
--filter-parameters Value=$PERCENTAGE \
--yes
# Update sentinel key (trigger app refresh)
az appconfig kv set \
--name atp-appconfig-prod-eus \
--key Sentinel \
--value "$(date +%s)" \
--label prod \
--yes
echo "Waiting 7 days for monitoring..."
# In production, wait 7 days between rollout stages
# (simulated here; actual implementation would be manual or scheduled)
# Monitor metrics for this stage
ERROR_RATE=$(az monitor app-insights metrics show \
--app atp-appinsights-prod-eus \
--metric "requests/failed" \
--aggregation avg \
--offset 24h \
--query "value.segments[0]['requests/failed'].avg")
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "❌ Error rate too high: $ERROR_RATE%"
echo "Rolling back feature flag..."
# Disable feature
az appconfig feature disable \
--name atp-appconfig-prod-eus \
--feature "$FEATURE_NAME" \
--label prod \
--yes
exit 1
fi
echo "✅ $PERCENTAGE% rollout successful; proceeding to next stage"
done
echo "✅ Feature $FEATURE_NAME rolled out to 100%"
Disabling Feature Flags (Kill Switch)¶
Immediate Rollback:
#!/bin/bash
# kill-switch.sh
FEATURE_NAME=$1
REASON=$2
echo "🚨 KILL SWITCH: Disabling feature '$FEATURE_NAME'"
echo "Reason: $REASON"
# Disable feature in Production
az appconfig feature disable \
--name atp-appconfig-prod-eus \
--feature "$FEATURE_NAME" \
--label prod \
--yes
# Update sentinel (trigger immediate app refresh)
az appconfig kv set \
--name atp-appconfig-prod-eus \
--key Sentinel \
--value "$(date +%s)" \
--label prod \
--yes
# Create incident ticket
az boards work-item create \
--title "Feature Kill Switch Activated: $FEATURE_NAME" \
--type "Incident" \
--description "Feature '$FEATURE_NAME' disabled via kill switch.\n\nReason: $REASON\n\nTimestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--assigned-to "platform-team@connectsoft.example" \
--fields Priority=1
# Notify team
curl -X POST $(SlackWebhookUrl) \
-H 'Content-Type: application/json' \
-d "{
\"text\": \"🚨 Feature Kill Switch Activated\",
\"attachments\": [{
\"color\": \"danger\",
\"fields\": [
{\"title\": \"Feature\", \"value\": \"$FEATURE_NAME\", \"short\": true},
{\"title\": \"Reason\", \"value\": \"$REASON\", \"short\": false}
]
}]
}"
echo "✅ Feature '$FEATURE_NAME' disabled; apps refreshing within 5 minutes"
Feature Flag Testing Strategies¶
Unit Testing with Feature Flags¶
// Unit test with feature flag mocking
[Fact]
public async Task RecordEvent_WhenTamperEvidenceV2Enabled_UsesNewAlgorithm()
{
// Arrange
var featureManager = new Mock<IFeatureManager>();
featureManager
.Setup(fm => fm.IsEnabledAsync("TamperEvidenceV2"))
.ReturnsAsync(true);
var service = new AuditService(featureManager.Object);
// Act
var result = await service.RecordEventAsync(new AuditEvent { /* ... */ });
// Assert
Assert.NotNull(result.TamperEvidence);
Assert.Equal("V2", result.TamperEvidence.Algorithm);
}
[Fact]
public async Task RecordEvent_WhenTamperEvidenceV2Disabled_UsesLegacyAlgorithm()
{
// Arrange
var featureManager = new Mock<IFeatureManager>();
featureManager
.Setup(fm => fm.IsEnabledAsync("TamperEvidenceV2"))
.ReturnsAsync(false);
var service = new AuditService(featureManager.Object);
// Act
var result = await service.RecordEventAsync(new AuditEvent { /* ... */ });
// Assert
Assert.NotNull(result.TamperEvidence);
Assert.Equal("V1", result.TamperEvidence.Algorithm);
}
Integration Testing (Test Environment)¶
// Integration test with actual Azure App Configuration
[Collection("IntegrationTests")]
public class FeatureFlagIntegrationTests
{
private readonly IFeatureManager _featureManager;
public FeatureFlagIntegrationTests()
{
// Connect to Test App Configuration
var configuration = new ConfigurationBuilder()
.AddAzureAppConfiguration(options =>
{
options.Connect(TestEnvironment.AppConfigConnectionString)
.Select(KeyFilter.Any, "test")
.UseFeatureFlags();
})
.Build();
var services = new ServiceCollection();
services.AddSingleton<IConfiguration>(configuration);
services.AddFeatureManagement();
var provider = services.BuildServiceProvider();
_featureManager = provider.GetRequiredService<IFeatureManager>();
}
[Fact]
public async Task FeatureFlag_TamperEvidenceV2_EnabledInTest()
{
// Act
var enabled = await _featureManager.IsEnabledAsync("TamperEvidenceV2");
// Assert
Assert.True(enabled, "TamperEvidenceV2 should be enabled in Test environment");
}
[Fact]
public async Task FeatureFlag_ExperimentalFeatures_DisabledInTest()
{
// Act
var enabled = await _featureManager.IsEnabledAsync("ExperimentalFeatures");
// Assert
Assert.False(enabled, "ExperimentalFeatures should be disabled in Test environment");
}
}
Feature Flag Monitoring & Analytics¶
Purpose: Track feature flag usage metrics, performance impact, and rollout success for data-driven decisions.
Telemetry Integration:
// Feature flag usage tracking
public class TelemetryFeatureManagerSnapshot : IFeatureManagerSnapshot
{
private readonly IFeatureManagerSnapshot _inner;
private readonly TelemetryClient _telemetry;
public TelemetryFeatureManagerSnapshot(
IFeatureManagerSnapshot inner,
TelemetryClient telemetry)
{
_inner = inner;
_telemetry = telemetry;
}
public async Task<bool> IsEnabledAsync(string feature)
{
var enabled = await _inner.IsEnabledAsync(feature);
// Track feature flag evaluation
_telemetry.TrackEvent("FeatureFlagEvaluated", new Dictionary<string, string>
{
["FeatureName"] = feature,
["Enabled"] = enabled.ToString(),
["Timestamp"] = DateTime.UtcNow.ToString("o")
});
return enabled;
}
public async Task<bool> IsEnabledAsync<TContext>(string feature, TContext context)
{
var enabled = await _inner.IsEnabledAsync(feature, context);
// Track with context
var properties = new Dictionary<string, string>
{
["FeatureName"] = feature,
["Enabled"] = enabled.ToString(),
["Timestamp"] = DateTime.UtcNow.ToString("o")
};
if (context is TargetingContext targetingContext)
{
properties["UserId"] = targetingContext.UserId;
properties["Groups"] = string.Join(",", targetingContext.Groups ?? new List<string>());
}
_telemetry.TrackEvent("FeatureFlagEvaluated", properties);
return enabled;
}
}
Feature Usage Dashboard (Application Insights Query):
// Feature flag usage over last 7 days
customEvents
| where timestamp > ago(7d)
| where name == "FeatureFlagEvaluated"
| extend FeatureName = tostring(customDimensions.FeatureName)
| extend Enabled = tostring(customDimensions.Enabled)
| summarize
TotalEvaluations = count(),
EnabledCount = countif(Enabled == "true"),
DisabledCount = countif(Enabled == "false"),
EnabledPercentage = 100.0 * countif(Enabled == "true") / count()
by FeatureName
| order by TotalEvaluations desc
Performance Impact Analysis:
// Compare performance with/without feature flag
requests
| where timestamp > ago(24h)
| extend FeatureFlagEnabled = tostring(customDimensions.TamperEvidenceV2)
| summarize
AvgDuration = avg(duration),
P95Duration = percentile(duration, 95),
P99Duration = percentile(duration, 99),
Count = count()
by FeatureFlagEnabled
| project FeatureFlagEnabled, AvgDuration, P95Duration, P99Duration, Count
Feature Flag Lifecycle Management¶
Feature Flag States:
stateDiagram-v2
[*] --> Development: Feature created
Development --> Testing: Feature ready
Testing --> Canary: Tests pass
Canary --> Rollout: Metrics healthy
Rollout --> General_Availability: 100% rollout
General_Availability --> Deprecated: Feature superseded
Deprecated --> Removed: Cleanup old code
Canary --> Disabled: Metrics unhealthy
Rollout --> Disabled: Issues detected
General_Availability --> Disabled: Kill switch
Disabled --> Canary: Issues resolved
Feature Flag Cleanup (Remove Old Flags):
// Identify stale feature flags (fully rolled out >90 days)
[FunctionName("IdentifyStaleFeatureFlags")]
public async Task RunAsync(
[TimerTrigger("0 0 1 * * 0")] TimerInfo timer, // Weekly on Sunday at 1 AM
ILogger log)
{
log.LogInformation("Identifying stale feature flags...");
var appConfigClient = new ConfigurationClient(
connectionString: Environment.GetEnvironmentVariable("AppConfig:ConnectionString"),
new DefaultAzureCredential());
var staleFlags = new List<string>();
await foreach (var setting in appConfigClient.GetConfigurationSettingsAsync(
new SettingSelector { KeyFilter = ".appconfig.featureflag/*", LabelFilter = "prod" }))
{
var featureFlag = JsonSerializer.Deserialize<FeatureFlag>(setting.Value);
// Check if flag is boolean true (no filters) = fully rolled out
if (featureFlag.Enabled && (featureFlag.Conditions?.ClientFilters?.Count == 0))
{
// Check how long it's been at 100%
var lastModified = setting.LastModified;
var daysSinceFullRollout = (DateTime.UtcNow - lastModified).Days;
if (daysSinceFullRollout > 90)
{
staleFlags.Add(featureFlag.Id);
log.LogWarning($"Stale flag: {featureFlag.Id} (fully rolled out for {daysSinceFullRollout} days)");
}
}
}
if (staleFlags.Any())
{
// Create work item for cleanup
await CreateCleanupTaskAsync(staleFlags);
}
log.LogInformation($"✅ Identified {staleFlags.Count} stale feature flags");
}
Runtime Configuration Refresh¶
Purpose: Enable configuration updates without restarting applications using Azure App Configuration refresh.
Configuration Refresh Implementation:
// Program.cs - Configure App Configuration refresh
public static IHostBuilder CreateHostBuilder(string[] args) =>
Host.CreateDefaultsBuilder(args)
.ConfigureAppConfiguration((context, config) =>
{
var env = context.HostingEnvironment;
if (env.IsProduction() || env.IsStaging())
{
var settings = config.Build();
var appConfigConnection = settings["AppConfig:ConnectionString"];
config.AddAzureAppConfiguration(options =>
{
options
.Connect(appConfigConnection)
.Select(KeyFilter.Any, LabelFilter.Null)
.Select(KeyFilter.Any, env.EnvironmentName)
// Configure refresh behavior
.ConfigureRefresh(refresh =>
{
// Sentinel key pattern: refresh all when sentinel changes
refresh.Register("Sentinel", refreshAll: true)
.SetCacheExpiration(TimeSpan.FromMinutes(5));
// Refresh specific keys independently
refresh.Register("Audit:MaxBatchSize", refreshAll: false)
.SetCacheExpiration(TimeSpan.FromMinutes(15));
})
// Feature flags refresh
.UseFeatureFlags(featureFlagOptions =>
{
featureFlagOptions.CacheExpirationInterval = TimeSpan.FromMinutes(5);
featureFlagOptions.Label = env.EnvironmentName;
});
});
}
})
.ConfigureWebHostDefaults(webBuilder =>
{
webBuilder.UseStartup<Startup>();
});
// Startup.cs - Add middleware
public void Configure(IApplicationBuilder app)
{
// Azure App Configuration refresh middleware
app.UseAzureAppConfiguration();
// ... other middleware
}
Sentinel Key Pattern (Trigger Full Refresh):
# Update sentinel key to trigger full app configuration refresh
az appconfig kv set \
--name atp-appconfig-prod-eus \
--key Sentinel \
--value "$(date +%s)" \
--label prod \
--yes
echo "Sentinel updated; apps will refresh within 5 minutes"
Feature Flag Best Practices¶
Development:
- Feature Flag Naming: Use descriptive names with version suffixes (e.g.,
TamperEvidenceV2,QueryOptimizationV3). - Default to Off: New features default to
false; explicitly enable per environment. - Short-Lived Flags: Remove flags once features are fully rolled out and old code paths deleted.
- Documentation: Document feature flags with purpose, rollout plan, and cleanup date.
Testing:
- Test Both Paths: Test feature enabled AND disabled in unit/integration tests.
- Percentage Testing: Test percentage filter with various values (0%, 50%, 100%).
- Targeting Testing: Validate targeting filters work correctly for specific tenants.
- Performance Testing: Measure performance impact of new features during canary rollout.
Production:
- Conservative Rollouts: Start at 5-10%; monitor for 48+ hours before increasing.
- Monitor Metrics: Track error rate, latency, feature usage, and business metrics.
- Kill Switch Ready: Document rollback procedure; test kill switch in Staging.
- Audit Trail: Log all feature flag evaluations for compliance and debugging.
Operational:
- Centralized Management: Use Azure App Configuration UI or CLI; avoid hardcoded flags.
- Version Control: Store feature flag configurations in Git; sync via GitOps.
- Access Control: Restrict production feature flag changes to platform team only.
- Cleanup Policy: Remove feature flags >90 days after 100% rollout; delete legacy code paths.
Example Feature Flag Scenarios¶
Scenario 1: Gradual Feature Rollout¶
Feature: AI-Assisted Anomaly Detection
Rollout Plan:
Week 1: 5% (targeting: early-access group)
Week 2: 10% (percentage rollout)
Week 3: 25% (percentage rollout)
Week 4: 50% (percentage rollout)
Week 5: 100% (fully rolled out)
Week 12: Remove flag; delete legacy code
Implementation:
// Week 1: Targeting early access
{
"AIAssistedAnomalyDetection": {
"EnabledFor": [
{
"Name": "TargetingFilter",
"Parameters": {
"Audience": {
"Groups": ["early-access"]
}
}
}
]
}
}
// Week 2: 10% rollout
{
"AIAssistedAnomalyDetection": {
"EnabledFor": [
{
"Name": "Percentage",
"Parameters": { "Value": 10 }
}
]
}
}
// Week 5: Fully rolled out
{
"AIAssistedAnomalyDetection": true
}
Scenario 2: A/B Testing¶
Feature: New Export Format (JSON vs Parquet)
Test Setup:
{
"ExportFormat_Parquet": {
"EnabledFor": [
{
"Name": "Percentage",
"Parameters": {
"Value": 50,
"Seed": "export-format-test"
}
}
]
}
}
Usage:
// A/B test: Export format selection
public async Task<ExportResult> ExportAuditEventsAsync(ExportRequest request)
{
// Check which variant user gets
if (await _featureManager.IsEnabledAsync("ExportFormat_Parquet"))
{
// Variant A: Parquet format
_telemetry.TrackEvent("ExportFormat", new Dictionary<string, string>
{
["Format"] = "Parquet",
["TenantId"] = request.TenantId,
["EventCount"] = request.EventCount.ToString()
});
return await ExportAsParquetAsync(request);
}
else
{
// Variant B: JSON format (control)
_telemetry.TrackEvent("ExportFormat", new Dictionary<string, string>
{
["Format"] = "JSON",
["TenantId"] = request.TenantId,
["EventCount"] = request.EventCount.ToString()
});
return await ExportAsJsonAsync(request);
}
}
Analysis (After 2 weeks):
// Compare export performance by format
customEvents
| where name == "ExportFormat"
| extend Format = tostring(customDimensions.Format)
| extend EventCount = toint(customDimensions.EventCount)
| join kind=inner (
dependencies
| where name == "ExportAuditEvents"
) on operation_Id
| summarize
AvgDuration = avg(duration),
P95Duration = percentile(duration, 95),
TotalExports = count()
by Format
| project Format, AvgDuration, P95Duration, TotalExports
Scenario 3: Maintenance Mode¶
Feature: Enable read-only mode during maintenance
Configuration:
{
"MaintenanceMode": {
"EnabledFor": [
{
"Name": "TimeWindow",
"Parameters": {
"Start": "2025-10-31T02:00:00Z",
"End": "2025-10-31T04:00:00Z",
"Recurrence": {
"Pattern": "Weekly",
"DaysOfWeek": ["Sunday"]
}
}
}
]
}
}
Usage:
// Maintenance mode middleware
public class MaintenanceModeMiddleware
{
private readonly RequestDelegate _next;
private readonly IFeatureManager _featureManager;
public MaintenanceModeMiddleware(RequestDelegate next, IFeatureManager featureManager)
{
_next = next;
_featureManager = featureManager;
}
public async Task InvokeAsync(HttpContext context)
{
if (await _featureManager.IsEnabledAsync("MaintenanceMode"))
{
// Only allow read operations
if (context.Request.Method != "GET" && context.Request.Method != "HEAD")
{
context.Response.StatusCode = 503;
await context.Response.WriteAsJsonAsync(new
{
error = "Service Unavailable",
message = "System is in maintenance mode. Only read operations are allowed.",
retryAfter = "2025-10-31T04:00:00Z"
});
return;
}
}
await _next(context);
}
}
Feature Flag Security & Compliance¶
Access Control (Azure App Configuration):
# Dev: Developers can modify feature flags
accessControl:
- principal: developers-aad-group
role: App Configuration Data Owner
scope: /subscriptions/<sub-id>/resourceGroups/ATP-Dev-RG/providers/Microsoft.AppConfiguration/configurationStores/atp-appconfig-dev-eus
# Test: QA team read-only
accessControl:
- principal: qa-team-aad-group
role: App Configuration Data Reader
scope: /subscriptions/<sub-id>/resourceGroups/ATP-Test-RG/providers/Microsoft.AppConfiguration/configurationStores/atp-appconfig-test-eus
# Production: Platform team only (no developers)
accessControl:
- principal: platform-team-aad-group
role: App Configuration Data Owner
scope: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.AppConfiguration/configurationStores/atp-appconfig-prod-eus
- principal: atp-prod-managed-identity
role: App Configuration Data Reader
scope: /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.AppConfiguration/configurationStores/atp-appconfig-prod-eus
Audit Logging (App Configuration Diagnostic Settings):
# Enable diagnostic logging for Production App Configuration
az monitor diagnostic-settings create \
--name atp-appconfig-prod-audit \
--resource /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.AppConfiguration/configurationStores/atp-appconfig-prod-eus \
--logs '[
{
"category": "HttpRequest",
"enabled": true,
"retentionPolicy": {
"enabled": true,
"days": 365
}
},
{
"category": "Audit",
"enabled": true,
"retentionPolicy": {
"enabled": true,
"days": 365
}
}
]' \
--workspace /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.OperationalInsights/workspaces/atp-loganalytics-prod-eus
echo "✅ App Configuration audit logging enabled"
Summary¶
- Azure App Configuration Per Environment: Dev (all features on), Test (stable only), Staging (canary testing), Production (conservative rollouts).
- Feature Flag Filters: Boolean, Percentage, Targeting, Time Window, Custom filters for flexible feature control.
- Precedence Rules: User targeting → Group targeting → Percentage → Time Window → Custom logic → Environment default.
- Feature Management: Create, update, disable (kill switch), and cleanup stale flags with automated workflows.
- Testing Strategies: Unit tests with mocking, integration tests with actual App Configuration, A/B testing with telemetry.
- Monitoring & Analytics: Track feature flag evaluations, usage metrics, and performance impact for data-driven decisions.
- Lifecycle Management: Feature progression from Development → Testing → Canary → General Availability → Deprecated → Removed.
- Security & Compliance: Access control per environment, audit logging with 365-day retention for Production.
Networking & Security Boundaries¶
ATP enforces strict network isolation and graduated security controls across environments to ensure developer productivity in lower tiers while maintaining zero-trust security in Production. Network boundaries prevent cross-environment access, protect sensitive data, and enforce least-privilege network access aligned with each environment's security requirements.
This approach implements defense-in-depth networking with VNet isolation, Network Security Groups (NSGs), private endpoints, and Azure Firewall to create security zones that match the criticality of each environment.
Network Isolation Strategy¶
ATP uses a hybrid VNet strategy: lower environments (Dev/Test) share a VNet with subnet isolation, while higher environments (Staging/Production) have dedicated VNets with full network segmentation and private endpoint enforcement.
Network Topology Overview¶
graph TB
subgraph Shared VNet - Dev/Test
DevSubnet[Dev Subnet<br/>10.0.1.0/24]
TestSubnet[Test Subnet<br/>10.0.2.0/24]
SharedServices[Shared Services<br/>10.0.3.0/24]
end
subgraph Dedicated VNet - Staging
StagingGateway[Gateway Subnet<br/>10.1.1.0/24]
StagingServices[Services Subnet<br/>10.1.2.0/24]
StagingData[Data Subnet<br/>10.1.3.0/24]
end
subgraph Dedicated VNet - Production Primary
ProdGateway[Gateway Subnet<br/>10.2.1.0/24]
ProdAKS[AKS Subnet<br/>10.2.2.0/23]
ProdData[Data Subnet<br/>10.2.3.0/24]
ProdFirewall[Firewall Subnet<br/>10.2.4.0/26]
end
Internet((Internet)) --> DevSubnet
Internet --> TestSubnet
Internet -.X.-> StagingGateway
Internet -.X.-> ProdGateway
DevSubnet <--> SharedServices
TestSubnet <--> SharedServices
style DevSubnet fill:#90EE90
style TestSubnet fill:#FFD700
style StagingServices fill:#FFA500
style ProdAKS fill:#FF6347
Network Isolation Per Environment¶
| Environment | VNet | Address Space | Subnets | NSG Rules | Public Access | Private Endpoints |
|---|---|---|---|---|---|---|
| Preview | Shared Preview VNet | 10.10.0.0/16 |
Dynamic per PR (10.10.{PR-ID}.0/24) |
Allow CI/CD agents | Yes | No |
| Dev | Shared ATP VNet | 10.0.0.0/16 |
Dev: 10.0.1.0/24 |
Allow developers, VPN | Yes (IP-whitelisted) | No |
| Test | Shared ATP VNet | 10.0.0.0/16 |
Test: 10.0.2.0/24 |
Allow CI/CD agents, QA | Yes (IP-whitelisted) | No |
| Staging | Dedicated Staging VNet | 10.1.0.0/16 |
Gateway: 10.1.1.0/24Services: 10.1.2.0/24Data: 10.1.3.0/24 |
Deny all by default | No | Yes (all data resources) |
| Production | Dedicated Production VNet | 10.2.0.0/16 |
Gateway: 10.2.1.0/24AKS: 10.2.2.0/23Data: 10.2.3.0/24Firewall: 10.2.4.0/26 |
Zero-trust (deny all) | No | Yes (all resources) |
| Hotfix | Dedicated Hotfix VNet | 10.3.0.0/16 |
Same as Production | Zero-trust | No | Yes |
Dev Environment Networking¶
Purpose: Convenient access for developers with IP-whitelisted public endpoints and shared VNet for cost optimization.
VNet Configuration (Pulumi):
// Shared VNet for Dev + Test environments
var sharedVNet = new VirtualNetwork("atp-vnet-shared-eus", new VirtualNetworkArgs
{
VirtualNetworkName = "atp-vnet-shared-eus",
ResourceGroupName = sharedResourceGroup.Name,
Location = "eastus",
AddressSpace = new AddressSpaceArgs
{
AddressPrefixes = new[] { "10.0.0.0/16" }
},
Subnets = new[]
{
new SubnetArgs
{
Name = "Dev-Subnet",
AddressPrefix = "10.0.1.0/24"
},
new SubnetArgs
{
Name = "Test-Subnet",
AddressPrefix = "10.0.2.0/24"
},
new SubnetArgs
{
Name = "Shared-Services-Subnet",
AddressPrefix = "10.0.3.0/24"
}
},
Tags = tags
});
Network Security Group (Dev Subnet):
// Dev NSG - Allow developer access
var devNsg = new NetworkSecurityGroup("atp-nsg-dev-eus", new NetworkSecurityGroupArgs
{
NetworkSecurityGroupName = "atp-nsg-dev-eus",
ResourceGroupName = resourceGroup.Name,
Location = "eastus",
SecurityRules = new[]
{
// Allow HTTPS from developer IPs
new SecurityRuleArgs
{
Name = "AllowDeveloperHTTPS",
Priority = 100,
Direction = "Inbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRange = "443",
SourceAddressPrefixes = new[]
{
"203.0.113.0/24", // Developer office IP range
"198.51.100.0/24", // VPN gateway range
"192.0.2.0/24" // Home office IPs
},
DestinationAddressPrefix = "10.0.1.0/24"
},
// Allow SSH/RDP from VPN (jumpbox access)
new SecurityRuleArgs
{
Name = "AllowVPNManagement",
Priority = 110,
Direction = "Inbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRanges = new[] { "22", "3389" },
SourceAddressPrefix = "198.51.100.0/24", // VPN range
DestinationAddressPrefix = "10.0.1.0/24"
},
// Allow all within subnet (service-to-service)
new SecurityRuleArgs
{
Name = "AllowWithinSubnet",
Priority = 120,
Direction = "Inbound",
Access = "Allow",
Protocol = "*",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "10.0.1.0/24",
DestinationAddressPrefix = "10.0.1.0/24"
},
// Allow Azure Load Balancer health probes
new SecurityRuleArgs
{
Name = "AllowAzureLoadBalancer",
Priority = 130,
Direction = "Inbound",
Access = "Allow",
Protocol = "*",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "AzureLoadBalancer",
DestinationAddressPrefix = "*"
},
// Deny all other inbound
new SecurityRuleArgs
{
Name = "DenyAllInbound",
Priority = 4096,
Direction = "Inbound",
Access = "Deny",
Protocol = "*",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "*",
DestinationAddressPrefix = "*"
}
},
Tags = tags
});
Public Network Access (Dev):
- Enabled: Yes (IP-whitelisted for developer convenience).
- Allowed IPs: Developer office IPs, VPN gateway, individual developer home IPs.
- Purpose: Enable remote development and troubleshooting.
Test Environment Networking¶
Purpose: Controlled access for CI/CD agents and QA team with IP restrictions and shared VNet with Dev.
Network Security Group (Test Subnet):
// Test NSG - Allow test automation access
var testNsg = new NetworkSecurityGroup("atp-nsg-test-eus", new NetworkSecurityGroupArgs
{
NetworkSecurityGroupName = "atp-nsg-test-eus",
ResourceGroupName = resourceGroup.Name,
Location = "eastus",
SecurityRules = new[]
{
// Allow HTTPS from CI/CD agents
new SecurityRuleArgs
{
Name = "AllowCICDAgents",
Priority = 100,
Direction = "Inbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRange = "443",
SourceAddressPrefixes = new[]
{
"20.62.134.0/24", // Azure DevOps agent pool IP range
"13.107.6.0/24" // GitHub Actions runners
},
DestinationAddressPrefix = "10.0.2.0/24"
},
// Allow HTTPS from QA team IPs
new SecurityRuleArgs
{
Name = "AllowQATeam",
Priority = 110,
Direction = "Inbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRange = "443",
SourceAddressPrefix = "203.0.113.0/24", // QA team office
DestinationAddressPrefix = "10.0.2.0/24"
},
// Allow test automation tools (Selenium Grid, API testing)
new SecurityRuleArgs
{
Name = "AllowTestAutomation",
Priority = 120,
Direction = "Inbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRange = "443",
SourceAddressPrefix = "10.0.3.0/24", // Shared services subnet (test runners)
DestinationAddressPrefix = "10.0.2.0/24"
},
// Allow within subnet
new SecurityRuleArgs
{
Name = "AllowWithinSubnet",
Priority = 130,
Direction = "Inbound",
Access = "Allow",
Protocol = "*",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "10.0.2.0/24",
DestinationAddressPrefix = "10.0.2.0/24"
},
// Deny all other inbound
new SecurityRuleArgs
{
Name = "DenyAllInbound",
Priority = 4096,
Direction = "Inbound",
Access = "Deny",
Protocol = "*",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "*",
DestinationAddressPrefix = "*"
}
},
Tags = tags
});
Public Network Access (Test):
- Enabled: Yes (IP-whitelisted for CI/CD agents and QA team).
- Allowed IPs: Azure DevOps agent pool IPs, GitHub Actions runners, QA team office.
- Purpose: Enable automated testing and QA validation.
Staging Environment Networking¶
Purpose: Production-like security with dedicated VNet, private endpoints, and no public access for realistic security validation.
VNet Configuration (Pulumi):
// Dedicated VNet for Staging
var stagingVNet = new VirtualNetwork("atp-vnet-staging-eus", new VirtualNetworkArgs
{
VirtualNetworkName = "atp-vnet-staging-eus",
ResourceGroupName = stagingResourceGroup.Name,
Location = "eastus",
AddressSpace = new AddressSpaceArgs
{
AddressPrefixes = new[] { "10.1.0.0/16" }
},
Subnets = new[]
{
new SubnetArgs
{
Name = "Gateway-Subnet",
AddressPrefix = "10.1.1.0/24",
Delegations = new[]
{
new DelegationArgs
{
Name = "AppGatewayDelegation",
ServiceName = "Microsoft.Network/applicationGateways"
}
}
},
new SubnetArgs
{
Name = "Services-Subnet",
AddressPrefix = "10.1.2.0/24",
ServiceEndpoints = new[]
{
new ServiceEndpointPropertiesFormatArgs { Service = "Microsoft.Sql" },
new ServiceEndpointPropertiesFormatArgs { Service = "Microsoft.Storage" },
new ServiceEndpointPropertiesFormatArgs { Service = "Microsoft.KeyVault" }
}
},
new SubnetArgs
{
Name = "Data-Subnet",
AddressPrefix = "10.1.3.0/24",
PrivateEndpointNetworkPolicies = "Disabled", // Required for private endpoints
PrivateLinkServiceNetworkPolicies = "Enabled"
}
},
Tags = tags
});
Network Security Group (Staging Services Subnet):
// Staging NSG - Deny all by default
var stagingNsg = new NetworkSecurityGroup("atp-nsg-staging-services-eus", new NetworkSecurityGroupArgs
{
NetworkSecurityGroupName = "atp-nsg-staging-services-eus",
ResourceGroupName = stagingResourceGroup.Name,
Location = "eastus",
SecurityRules = new[]
{
// Allow HTTPS from Application Gateway subnet only
new SecurityRuleArgs
{
Name = "AllowAppGatewayHTTPS",
Priority = 100,
Direction = "Inbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRange = "443",
SourceAddressPrefix = "10.1.1.0/24", // Gateway subnet
DestinationAddressPrefix = "10.1.2.0/24" // Services subnet
},
// Allow service-to-service within Services subnet
new SecurityRuleArgs
{
Name = "AllowServiceToService",
Priority = 110,
Direction = "Inbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRanges = new[] { "80", "443", "5672", "6379" }, // HTTP, HTTPS, RabbitMQ, Redis
SourceAddressPrefix = "10.1.2.0/24",
DestinationAddressPrefix = "10.1.2.0/24"
},
// Allow Services → Data subnet (private endpoints)
new SecurityRuleArgs
{
Name = "AllowServicesToData",
Priority = 120,
Direction = "Inbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRanges = new[] { "1433", "443", "5432" }, // SQL, HTTPS, PostgreSQL
SourceAddressPrefix = "10.1.2.0/24",
DestinationAddressPrefix = "10.1.3.0/24"
},
// Allow Azure Load Balancer
new SecurityRuleArgs
{
Name = "AllowAzureLoadBalancer",
Priority = 130,
Direction = "Inbound",
Access = "Allow",
Protocol = "*",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "AzureLoadBalancer",
DestinationAddressPrefix = "*"
},
// Deny all other inbound (Zero-trust)
new SecurityRuleArgs
{
Name = "DenyAllInbound",
Priority = 4096,
Direction = "Inbound",
Access = "Deny",
Protocol = "*",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "*",
DestinationAddressPrefix = "*"
}
},
Tags = tags
});
Private Endpoints (Staging):
// Private endpoint for SQL Database
var sqlPrivateEndpoint = new PrivateEndpoint("atp-sql-pe-staging-eus", new PrivateEndpointArgs
{
PrivateEndpointName = "atp-sql-pe-staging-eus",
ResourceGroupName = stagingResourceGroup.Name,
Location = "eastus",
Subnet = new SubnetArgs
{
Id = dataSubnet.Id // 10.1.3.0/24
},
PrivateLinkServiceConnections = new[]
{
new PrivateLinkServiceConnectionArgs
{
Name = "sql-connection",
PrivateLinkServiceId = sqlServer.Id,
GroupIds = new[] { "sqlServer" },
PrivateLinkServiceConnectionState = new PrivateLinkServiceConnectionStateArgs
{
Status = "Approved",
Description = "Auto-approved by Pulumi"
}
}
},
Tags = tags
});
// Private DNS Zone for SQL
var sqlPrivateDnsZone = new PrivateZone("privatelink-database-windows-net", new PrivateZoneArgs
{
PrivateZoneName = "privatelink.database.windows.net",
ResourceGroupName = stagingResourceGroup.Name,
Location = "global",
Tags = tags
});
// Link private DNS zone to VNet
var dnsZoneLink = new VirtualNetworkLink("sql-dns-link-staging", new VirtualNetworkLinkArgs
{
VirtualNetworkLinkName = "atp-vnet-staging-link",
ResourceGroupName = stagingResourceGroup.Name,
PrivateZoneName = sqlPrivateDnsZone.Name,
VirtualNetwork = new SubResourceArgs { Id = stagingVNet.Id },
RegistrationEnabled = false,
Location = "global",
Tags = tags
});
// DNS record for private endpoint
var dnsRecord = new RecordSet("sql-dns-record-staging", new RecordSetArgs
{
RecordSetName = sqlServer.Name,
ResourceGroupName = stagingResourceGroup.Name,
PrivateZoneName = sqlPrivateDnsZone.Name,
RecordType = "A",
Ttl = 3600,
ARecords = new[]
{
new ARecordArgs
{
Ipv4Address = sqlPrivateEndpoint.CustomDnsConfigs.Apply(configs => configs[0].IpAddresses[0])
}
}
});
Public Network Access (Staging):
- Disabled: All resources accessible only via private endpoints.
- Exception: Application Gateway has public IP for load testing access.
- Purpose: Validate production security posture; test private endpoint connectivity.
Production Environment Networking¶
Purpose: Maximum security with dedicated VNet, Azure Firewall, private endpoints only, and zero-trust network access.
VNet Configuration (Pulumi):
// Dedicated VNet for Production
var prodVNet = new VirtualNetwork("atp-vnet-prod-eus", new VirtualNetworkArgs
{
VirtualNetworkName = "atp-vnet-prod-eus",
ResourceGroupName = prodResourceGroup.Name,
Location = "eastus",
AddressSpace = new AddressSpaceArgs
{
AddressPrefixes = new[] { "10.2.0.0/16" }
},
Subnets = new[]
{
new SubnetArgs
{
Name = "Gateway-Subnet",
AddressPrefix = "10.2.1.0/24",
Delegations = new[]
{
new DelegationArgs
{
Name = "AppGatewayDelegation",
ServiceName = "Microsoft.Network/applicationGateways"
}
}
},
new SubnetArgs
{
Name = "AKS-Subnet",
AddressPrefix = "10.2.2.0/23", // /23 for AKS node scaling
ServiceEndpoints = new[]
{
new ServiceEndpointPropertiesFormatArgs { Service = "Microsoft.ContainerRegistry" },
new ServiceEndpointPropertiesFormatArgs { Service = "Microsoft.KeyVault" }
}
},
new SubnetArgs
{
Name = "Data-Subnet",
AddressPrefix = "10.2.3.0/24",
PrivateEndpointNetworkPolicies = "Disabled",
PrivateLinkServiceNetworkPolicies = "Enabled"
},
new SubnetArgs
{
Name = "AzureFirewallSubnet", // Must be named exactly this
AddressPrefix = "10.2.4.0/26"
}
},
Tags = tags
});
Network Security Group (Production AKS Subnet):
// Production NSG - Zero-trust (deny all by default)
var prodNsg = new NetworkSecurityGroup("atp-nsg-prod-aks-eus", new NetworkSecurityGroupArgs
{
NetworkSecurityGroupName = "atp-nsg-prod-aks-eus",
ResourceGroupName = prodResourceGroup.Name,
Location = "eastus",
SecurityRules = new[]
{
// Allow inbound HTTPS from Application Gateway only
new SecurityRuleArgs
{
Name = "AllowAppGatewayHTTPS",
Priority = 100,
Direction = "Inbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRange = "443",
SourceAddressPrefix = "10.2.1.0/24", // App Gateway subnet
DestinationAddressPrefix = "10.2.2.0/23" // AKS subnet
},
// Allow AKS → Data subnet (private endpoints)
new SecurityRuleArgs
{
Name = "AllowAKSToData",
Priority = 110,
Direction = "Outbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRanges = new[] { "1433", "443", "5432", "6379" },
SourceAddressPrefix = "10.2.2.0/23",
DestinationAddressPrefix = "10.2.3.0/24"
},
// Allow AKS → Azure services (via service endpoints)
new SecurityRuleArgs
{
Name = "AllowAKSToAzureServices",
Priority = 120,
Direction = "Outbound",
Access = "Allow",
Protocol = "Tcp",
SourcePortRange = "*",
DestinationPortRange = "443",
SourceAddressPrefix = "10.2.2.0/23",
DestinationAddressPrefixes = new[]
{
"AzureContainerRegistry",
"AzureKeyVault",
"AzureActiveDirectory"
}
},
// Allow AKS internal communication (Kubernetes API server)
new SecurityRuleArgs
{
Name = "AllowAKSInternal",
Priority = 130,
Direction = "Inbound",
Access = "Allow",
Protocol = "*",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "10.2.2.0/23",
DestinationAddressPrefix = "10.2.2.0/23"
},
// Deny all other traffic (Zero-trust)
new SecurityRuleArgs
{
Name = "DenyAllInbound",
Priority = 4096,
Direction = "Inbound",
Access = "Deny",
Protocol = "*",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "*",
DestinationAddressPrefix = "*"
},
// Deny all outbound except explicitly allowed
new SecurityRuleArgs
{
Name = "DenyAllOutbound",
Priority = 4096,
Direction = "Outbound",
Access = "Deny",
Protocol = "*",
SourcePortRange = "*",
DestinationPortRange = "*",
SourceAddressPrefix = "*",
DestinationAddressPrefix = "*"
}
},
Tags = tags
});
Summary¶
- Network Isolation: Shared VNet (Dev/Test) with subnet isolation, dedicated VNets (Staging/Production) with full segmentation.
- VNet Topology: Dev (
10.0.1.0/24), Test (10.0.2.0/24), Staging (10.1.0.0/16), Production (10.2.0.0/16) with gateway, AKS, data, and firewall subnets. - NSG Rules: Graduated from allow-developer-IPs (Dev) to deny-all-by-default (Production) with explicit allowlists.
- Private Endpoints: None (Dev/Test), all data resources (Staging/Production) with private DNS zones for name resolution.
- Public Access: Enabled with IP whitelisting (Dev/Test), disabled entirely (Staging/Production) except Application Gateway.
- Azure Firewall: Production egress filtering with approved FQDN allowlist and threat intelligence.
- Zero-Trust: Production enforces verify-explicitly, least-privilege, assume-breach with Istio mTLS and Kubernetes Network Policies.
- Monitoring: NSG flow logs, firewall logs, traffic analytics, and Azure Sentinel for security visibility.
Observability Per Environment¶
ATP implements graduated observability across environments, balancing developer debugging needs (verbose logs, high trace sampling) with production cost optimization (warning-level logs, adaptive sampling). Each environment has tailored telemetry levels, sampling rates, and retention policies aligned with its operational requirements and budget constraints.
This strategy ensures comprehensive visibility for troubleshooting in lower environments while maintaining cost-effective observability in Production with intelligent sampling and long-term cold storage for compliance.
Observability Strategy Overview¶
graph LR
subgraph Dev Environment
DevApp[App Service] --> DevSeq[Seq Container<br/>Debug logs<br/>100% traces<br/>7-day retention]
end
subgraph Test Environment
TestApp[App Service] --> TestSeq[Seq Container<br/>Info logs<br/>50% traces<br/>14-day retention]
end
subgraph Staging Environment
StagingApp[App Service] --> StagingOtel[OTel Collector<br/>Warning logs<br/>25% traces]
StagingOtel --> StagingLA[Log Analytics<br/>30-day hot]
StagingOtel --> StagingAI[Application Insights<br/>Adaptive sampling]
end
subgraph Production Environment
ProdPods[AKS Pods] --> ProdOtel[OTel Collector<br/>Warning logs<br/>10% traces]
ProdOtel --> ProdProm[Prometheus<br/>Metrics aggregation]
ProdOtel --> ProdLA[Log Analytics<br/>90-day hot]
ProdLA --> ProdBlob[Blob Storage<br/>7-year cold]
ProdOtel --> ProdAI[Application Insights<br/>Adaptive sampling]
ProdProm --> ProdGrafana[Grafana<br/>Dashboards + Alerts]
end
style DevSeq fill:#90EE90
style TestSeq fill:#FFD700
style StagingLA fill:#FFA500
style ProdGrafana fill:#FF6347
Telemetry Levels Per Environment¶
ATP uses graduated telemetry verbosity from verbose debugging (Dev) to optimized production observability.
| Environment | Log Level | Trace Sampling | Metric Collection | Retention (Hot) | Retention (Cold) |
|---|---|---|---|---|---|
| Preview | Debug | 100% | All metrics | Ephemeral (PR lifetime) | None |
| Dev | Debug | 100% | All metrics | 7 days | None |
| Test | Information | 50% | All metrics | 14 days | None |
| Staging | Warning | 25% (adaptive) | All metrics | 30 days | None |
| Production | Warning | 10% (adaptive) | All metrics | 90 days | 7 years (blob) |
| Hotfix | Warning | 10% (same as Prod) | All metrics | 90 days | 7 years |
Rationale:
- Dev (Debug, 100% sampling): Full visibility for rapid debugging; cost is minimal (low traffic).
- Test (Info, 50% sampling): Sufficient for test validation; reduce noise from automated tests.
- Staging (Warning, 25% sampling): Production-like telemetry; catch errors/warnings only; representative sampling.
- Production (Warning, 10% sampling): Cost-optimized; adaptive sampling adjusts based on traffic; long-term cold storage for compliance.
Dev Environment Observability¶
Purpose: Maximum visibility for developers with verbose logging, 100% trace sampling, and immediate feedback.
Log Level: Debug (most verbose)
Telemetry Configuration (appsettings.Development.json):
{
"Logging": {
"LogLevel": {
"Default": "Debug",
"Microsoft": "Information",
"Microsoft.Hosting.Lifetime": "Information",
"System": "Information"
}
},
"Seq": {
"ServerUrl": "http://localhost:5341",
"ApiKey": "", // No auth in Dev
"MinimumLevel": "Trace",
"LevelOverride": {
"Microsoft": "Warning",
"System": "Warning"
}
},
"OpenTelemetry": {
"ServiceName": "atp-ingestion-dev",
"ServiceVersion": "1.0.0-dev",
"ExporterEndpoint": "http://localhost:4317", // Local OTel collector (optional)
"TracingSampler": {
"Type": "AlwaysOn", // 100% sampling
"Probability": 1.0
},
"MetricsExportInterval": 10 // 10 seconds (frequent for dev)
},
"ApplicationInsights": {
"ConnectionString": "", // Disabled in Dev (use Seq instead)
"EnableAdaptiveSampling": false,
"EnableDependencyTracking": true,
"EnablePerformanceCounterCollectionModule": true
}
}
Seq Container (Docker Compose for Dev):
# docker-compose.dev.yml
version: '3.8'
services:
seq:
image: datalust/seq:latest
container_name: atp-seq-dev
ports:
- "5341:80"
environment:
ACCEPT_EULA: "Y"
SEQ_FIRSTRUN_ADMINUSERNAME: admin
SEQ_FIRSTRUN_ADMINPASSWORDHASH: <bcrypt-hash> # Change in production
volumes:
- seq-data:/data
restart: unless-stopped
otel-collector:
image: otel/opentelemetry-collector:0.97.0
container_name: atp-otel-dev
command: ["--config=/etc/otel/config.yaml"]
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8888:8888" # Prometheus metrics (collector itself)
- "13133:13133" # Health check
volumes:
- ./otel-config-dev.yaml:/etc/otel/config.yaml
restart: unless-stopped
volumes:
seq-data:
OpenTelemetry Collector Configuration (Dev):
# otel-config-dev.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
# No sampling in Dev (100%)
attributes:
actions:
- key: environment
value: dev
action: insert
exporters:
logging:
loglevel: debug # Console output for dev debugging
# Export to Seq (optional, for centralized logs)
otlphttp/seq:
endpoint: http://seq:80/ingest/otlp
headers:
X-Seq-ApiKey: ""
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes]
exporters: [logging]
metrics:
receivers: [otlp]
processors: [batch, attributes]
exporters: [logging]
logs:
receivers: [otlp]
processors: [batch, attributes]
exporters: [logging, otlphttp/seq]
C# Logging Configuration (Dev):
// Program.cs - Dev logging setup
public static IHostBuilder CreateHostBuilder(string[] args) =>
Host.CreateDefaultsBuilder(args)
.ConfigureLogging((context, logging) =>
{
var env = context.HostingEnvironment;
logging.ClearProviders();
logging.AddConsole(); // Console output for local debugging
if (env.IsDevelopment())
{
logging.SetMinimumLevel(LogLevel.Debug);
// Add Seq for structured logging
logging.AddSeq(context.Configuration.GetSection("Seq"));
}
})
.ConfigureServices((context, services) =>
{
// OpenTelemetry instrumentation
services.AddOpenTelemetry()
.WithTracing(builder =>
{
builder
.SetResourceBuilder(ResourceBuilder.CreateDefault()
.AddService("atp-ingestion-dev", "1.0.0-dev"))
.AddAspNetCoreInstrumentation(options =>
{
options.RecordException = true;
options.Filter = (httpContext) => true; // Capture all requests
})
.AddHttpClientInstrumentation()
.AddSqlClientInstrumentation(options =>
{
options.SetDbStatementForText = true; // Include SQL in traces
options.RecordException = true;
})
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://localhost:4317");
options.Protocol = OtlpExportProtocol.Grpc;
});
})
.WithMetrics(builder =>
{
builder
.SetResourceBuilder(ResourceBuilder.CreateDefault()
.AddService("atp-ingestion-dev"))
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddRuntimeInstrumentation()
.AddProcessInstrumentation()
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://localhost:4317");
options.Protocol = OtlpExportProtocol.Grpc;
options.ExportIntervalMilliseconds = 10000; // 10 seconds
});
});
})
.ConfigureWebHostDefaults(webBuilder =>
{
webBuilder.UseStartup<Startup>();
});
Dev Observability Benefits:
- Instant Feedback: Console logs + Seq UI for real-time log viewing.
- 100% Traces: No sampling; every request traced for debugging.
- SQL Query Visibility: Full SQL statements in traces for query optimization.
- No Cost Constraints: Local containers; unlimited logs/traces.
Test Environment Observability¶
Purpose: Balanced visibility for QA validation with Info-level logs and 50% sampling to reduce test automation noise.
Log Level: Information
Telemetry Configuration (appsettings.Test.json):
{
"Logging": {
"LogLevel": {
"Default": "Information",
"Microsoft": "Warning",
"Microsoft.Hosting.Lifetime": "Information",
"System": "Warning"
}
},
"Seq": {
"ServerUrl": "http://seq-test.connectsoft.local:5341",
"ApiKey": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/SeqApiKey)",
"MinimumLevel": "Information",
"LevelOverride": {
"Microsoft": "Warning",
"System": "Warning"
}
},
"OpenTelemetry": {
"ServiceName": "atp-ingestion-test",
"ServiceVersion": "1.0.0-test",
"ExporterEndpoint": "http://otel-collector-test.connectsoft.local:4317",
"TracingSampler": {
"Type": "TraceIdRatioBased", // Consistent sampling
"Probability": 0.5 // 50% sampling
},
"MetricsExportInterval": 30 // 30 seconds
},
"ApplicationInsights": {
"ConnectionString": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/AppInsightsConnectionString)",
"EnableAdaptiveSampling": false,
"SamplingPercentage": 50,
"EnableDependencyTracking": true,
"EnablePerformanceCounterCollectionModule": true
}
}
OpenTelemetry Collector Configuration (Test):
# otel-config-test.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 30s
send_batch_size: 2048
# 50% sampling for Test
probabilistic_sampler:
sampling_percentage: 50
attributes:
actions:
- key: environment
value: test
action: insert
# Filter out health check traces
filter:
traces:
span:
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/ready"'
exporters:
logging:
loglevel: info
otlphttp/seq:
endpoint: http://seq-test:80/ingest/otlp
headers:
X-Seq-ApiKey: ${SEQ_API_KEY}
# Optional: Export to Application Insights
azuremonitor:
connection_string: ${APPLICATIONINSIGHTS_CONNECTION_STRING}
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, probabilistic_sampler, attributes, filter]
exporters: [logging, azuremonitor]
metrics:
receivers: [otlp]
processors: [batch, attributes]
exporters: [logging, azuremonitor]
logs:
receivers: [otlp]
processors: [batch, attributes]
exporters: [logging, otlphttp/seq]
Test Observability Benefits:
- QA Validation: Info-level logs sufficient for test pass/fail analysis.
- 50% Sampling: Reduce telemetry volume from automated test runs.
- Seq Integration: Centralized logs for test result analysis.
- Application Insights: Optional export for trend analysis.
Staging Environment Observability¶
Purpose: Production-like observability with Warning-level logs, adaptive sampling (25%), and Azure Log Analytics for production validation.
Log Level: Warning
Telemetry Configuration (appsettings.Staging.json):
{
"Logging": {
"LogLevel": {
"Default": "Warning",
"Microsoft": "Error",
"Microsoft.Hosting.Lifetime": "Information",
"System": "Error"
},
"ApplicationInsights": {
"LogLevel": {
"Default": "Warning",
"Microsoft": "Error"
}
}
},
"OpenTelemetry": {
"ServiceName": "atp-ingestion-staging",
"ServiceVersion": "${BUILD_VERSION}",
"ExporterEndpoint": "http://otel-collector-staging.connectsoft.local:4317",
"TracingSampler": {
"Type": "ParentBased", // Respect upstream sampling decisions
"RootSampler": {
"Type": "TraceIdRatioBased",
"Probability": 0.25 // 25% base sampling
}
},
"MetricsExportInterval": 60 // 60 seconds
},
"ApplicationInsights": {
"ConnectionString": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/AppInsightsConnectionString)",
"EnableAdaptiveSampling": true,
"SamplingSettings": {
"IsEnabled": true,
"MaxTelemetryItemsPerSecond": 10,
"EvaluationInterval": "00:00:15",
"AdaptiveSamplingSettings": {
"MaxTelemetryItemsPerSecond": 10,
"InitialSamplingPercentage": 25,
"MinSamplingPercentage": 10,
"MaxSamplingPercentage": 50,
"MovingAverageRatio": 0.25
}
},
"EnableDependencyTracking": true,
"EnablePerformanceCounterCollectionModule": true,
"EnableEventCounterCollectionModule": true
}
}
Azure Log Analytics Workspace (Pulumi):
// Staging Log Analytics Workspace
var stagingLogAnalytics = new Workspace("atp-loganalytics-staging-eus", new WorkspaceArgs
{
WorkspaceName = "atp-loganalytics-staging-eus",
ResourceGroupName = stagingResourceGroup.Name,
Location = "eastus",
Sku = new WorkspaceSkuArgs
{
Name = "PerGB2018" // Pay-per-GB pricing
},
RetentionInDays = 30, // 30-day hot retention
PublicNetworkAccessForIngestion = "Enabled",
PublicNetworkAccessForQuery = "Enabled",
Tags = tags
});
// Application Insights for Staging
var stagingAppInsights = new Component("atp-appinsights-staging-eus", new ComponentArgs
{
ResourceName = "atp-appinsights-staging-eus",
ResourceGroupName = stagingResourceGroup.Name,
Location = "eastus",
ApplicationType = "web",
Kind = "web",
WorkspaceResourceId = stagingLogAnalytics.Id,
RetentionInDays = 30,
SamplingPercentage = 25, // 25% sampling
DisableIpMasking = false, // Mask IP addresses for privacy
Tags = tags
});
OpenTelemetry Collector Configuration (Staging):
# otel-config-staging.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 60s
send_batch_size: 4096
# Adaptive sampling (25% base)
probabilistic_sampler:
sampling_percentage: 25
attributes:
actions:
- key: environment
value: staging
action: insert
- key: deployment.environment
value: staging
action: insert
# Filter health checks
filter:
traces:
span:
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/ready"'
- 'attributes["http.target"] == "/metrics"'
# Resource detection (cloud environment)
resourcedetection:
detectors: [env, azure]
timeout: 5s
exporters:
logging:
loglevel: warn # Only warnings/errors to collector logs
azuremonitor:
connection_string: ${APPLICATIONINSIGHTS_CONNECTION_STRING}
maxbatchsize: 1024
maxbatchinterval: 10s
# Export to Log Analytics workspace
azureloganalytics:
workspace_id: ${LOG_ANALYTICS_WORKSPACE_ID}
workspace_key: ${LOG_ANALYTICS_WORKSPACE_KEY}
resource_type: "Custom-OpenTelemetry"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, probabilistic_sampler, resourcedetection, attributes, filter]
exporters: [azuremonitor]
metrics:
receivers: [otlp]
processors: [batch, resourcedetection, attributes]
exporters: [azuremonitor]
logs:
receivers: [otlp]
processors: [batch, resourcedetection, attributes]
exporters: [azureloganalytics, azuremonitor]
Staging Observability Benefits:
- Production-Like: Same log levels and sampling as Production for realistic testing.
- Azure Log Analytics: Centralized log aggregation with KQL queries.
- Application Insights: APM for performance profiling and dependency tracking.
- Adaptive Sampling: Automatically adjusts based on traffic volume.
- 30-Day Retention: Sufficient for staging validation and troubleshooting.
Production Environment Observability¶
Purpose: Cost-optimized observability with Warning-level logs, 10% adaptive sampling, Prometheus/Grafana for metrics, and long-term cold storage for compliance.
Log Level: Warning
Telemetry Configuration (appsettings.Production.json):
{
"Logging": {
"LogLevel": {
"Default": "Warning",
"Microsoft": "Error",
"Microsoft.Hosting.Lifetime": "Warning",
"System": "Error",
"ConnectSoft.ATP": "Warning"
},
"ApplicationInsights": {
"LogLevel": {
"Default": "Warning",
"Microsoft": "Error"
}
}
},
"OpenTelemetry": {
"ServiceName": "atp-ingestion-prod",
"ServiceVersion": "${BUILD_VERSION}",
"ServiceInstanceId": "${HOSTNAME}", // Pod name in AKS
"ExporterEndpoint": "http://otel-collector.atp-prod.svc.cluster.local:4317",
"TracingSampler": {
"Type": "ParentBased",
"RootSampler": {
"Type": "TraceIdRatioBased",
"Probability": 0.1 // 10% base sampling
}
},
"MetricsExportInterval": 60, // 60 seconds
"Attributes": {
"deployment.environment": "production",
"service.namespace": "atp",
"k8s.cluster.name": "atp-aks-prod-eus",
"k8s.namespace.name": "atp-prod"
}
},
"ApplicationInsights": {
"ConnectionString": "@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/AppInsightsConnectionString)",
"EnableAdaptiveSampling": true,
"SamplingSettings": {
"IsEnabled": true,
"MaxTelemetryItemsPerSecond": 5,
"EvaluationInterval": "00:00:15",
"AdaptiveSamplingSettings": {
"MaxTelemetryItemsPerSecond": 5,
"InitialSamplingPercentage": 10,
"MinSamplingPercentage": 5,
"MaxSamplingPercentage": 25,
"MovingAverageRatio": 0.25
},
"ExcludedTypes": "Event,Trace", // Only sample requests/dependencies
"IncludedTypes": "Request,Dependency,Exception"
},
"EnableDependencyTracking": true,
"EnablePerformanceCounterCollectionModule": false, // Use OTel metrics instead
"EnableEventCounterCollectionModule": true,
"EnableHeartbeat": true,
"HeartbeatInterval": "00:15:00", // 15 minutes
"DisableIpMasking": false, // Mask PII
"EnableAuthenticationTrackingJavaScript": false // Privacy
},
"Prometheus": {
"Enabled": true,
"Port": 9090,
"Path": "/metrics",
"ScrapeInterval": 15 // 15 seconds
}
}
Azure Log Analytics Workspace (Production):
// Production Log Analytics Workspace (90-day hot + long-term cold)
var prodLogAnalytics = new Workspace("atp-loganalytics-prod-eus", new WorkspaceArgs
{
WorkspaceName = "atp-loganalytics-prod-eus",
ResourceGroupName = prodResourceGroup.Name,
Location = "eastus",
Sku = new WorkspaceSkuArgs
{
Name = "PerGB2018",
CapacityReservation = 100 // 100 GB/day commitment pricing
},
RetentionInDays = 90, // 90-day hot retention
PublicNetworkAccessForIngestion = "Disabled", // Private endpoint only
PublicNetworkAccessForQuery = "Disabled",
Tags = tags
});
// Data Export to Blob Storage (7-year cold retention)
var dataExportRule = new DataExport("atp-loganalytics-export-prod", new DataExportArgs
{
DataExportName = "export-to-blob",
ResourceGroupName = prodResourceGroup.Name,
WorkspaceName = prodLogAnalytics.Name,
TableNames = new[]
{
"AppTraces",
"AppRequests",
"AppDependencies",
"AppExceptions",
"AppMetrics"
},
Destination = new DestinationArgs
{
ResourceId = coldStorageAccount.Id,
MetaData = new DestinationMetaDataArgs
{
EventHubName = "" // Export to Storage Account, not Event Hub
}
},
Enable = true
});
// Application Insights (Production)
var prodAppInsights = new Component("atp-appinsights-prod-eus", new ComponentArgs
{
ResourceName = "atp-appinsights-prod-eus",
ResourceGroupName = prodResourceGroup.Name,
Location = "eastus",
ApplicationType = "web",
Kind = "web",
WorkspaceResourceId = prodLogAnalytics.Id,
RetentionInDays = 90,
SamplingPercentage = 10, // 10% sampling
DisableIpMasking = true, // Mask IP for GDPR
DisableLocalAuth = false, // Allow instrumentation key (legacy)
IngestionMode = "LogAnalytics", // Route to Log Analytics workspace
PublicNetworkAccessForIngestion = "Disabled", // Private endpoint
PublicNetworkAccessForQuery = "Disabled",
Tags = tags
});
OpenTelemetry Collector Configuration (Production AKS):
# otel-collector-config-prod.yaml (Kubernetes ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: atp-prod
data:
otel-config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Prometheus scrape for service metrics
prometheus:
config:
scrape_configs:
- job_name: 'atp-services'
kubernetes_sd_configs:
- role: pod
namespaces:
names: [atp-prod]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
processors:
batch:
timeout: 60s
send_batch_size: 8192
# Adaptive sampling (10% base, adjust based on traffic)
probabilistic_sampler:
hash_seed: 42
sampling_percentage: 10
# Tail sampling (keep all errors, sample successes)
tail_sampling:
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-requests-policy
type: latency
latency: {threshold_ms: 2000} # Keep requests >2s
- name: random-ok-policy
type: probabilistic
probabilistic: {sampling_percentage: 5} # 5% of successful requests
attributes:
actions:
- key: environment
value: production
action: insert
- key: deployment.environment
value: production
action: insert
- key: k8s.cluster.name
value: atp-aks-prod-eus
action: insert
# Filter health checks and metrics endpoints
filter:
traces:
span:
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/ready"'
- 'attributes["http.target"] == "/metrics"'
- 'attributes["http.target"] == "/livez"'
# Resource detection (Kubernetes, Azure)
resourcedetection:
detectors: [env, system, docker, azure, aks]
timeout: 10s
override: true
# Add Kubernetes metadata
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.pod.name
- k8s.pod.uid
- k8s.node.name
labels:
- tag_name: app
key: app
from: pod
- tag_name: version
key: version
from: pod
exporters:
logging:
loglevel: error # Only errors in collector logs (reduce noise)
# Export to Application Insights
azuremonitor:
connection_string: ${APPLICATIONINSIGHTS_CONNECTION_STRING}
maxbatchsize: 2048
maxbatchinterval: 30s
# Export to Prometheus (for Grafana)
prometheusremotewrite:
endpoint: http://prometheus-server.atp-prod.svc.cluster.local:9090/api/v1/write
tls:
insecure: true
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
# Export logs to Log Analytics
azureloganalytics:
workspace_id: ${LOG_ANALYTICS_WORKSPACE_ID}
workspace_key: ${LOG_ANALYTICS_WORKSPACE_KEY}
resource_type: "Custom-OpenTelemetry-Prod"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling, k8sattributes, resourcedetection, attributes, filter]
exporters: [azuremonitor]
metrics:
receivers: [otlp, prometheus]
processors: [batch, k8sattributes, resourcedetection, attributes]
exporters: [azuremonitor, prometheusremotewrite]
logs:
receivers: [otlp]
processors: [batch, k8sattributes, resourcedetection, attributes]
exporters: [azureloganalytics, azuremonitor]
extensions: [health_check, pprof, zpages]
telemetry:
logs:
level: warn
metrics:
level: detailed
address: 0.0.0.0:8888
OTel Collector Deployment (Kubernetes):
# otel-collector-deployment-prod.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: atp-prod
spec:
replicas: 3 # High availability
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
version: 0.97.0
spec:
serviceAccountName: otel-collector
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.97.0
command: ["--config=/etc/otel/config.yaml"]
ports:
- containerPort: 4317 # OTLP gRPC
name: otlp-grpc
- containerPort: 4318 # OTLP HTTP
name: otlp-http
- containerPort: 8888 # Prometheus metrics (collector itself)
name: metrics
- containerPort: 13133 # Health check
name: health
env:
- name: APPLICATIONINSIGHTS_CONNECTION_STRING
valueFrom:
secretKeyRef:
name: atp-secrets
key: applicationInsightsConnectionString
- name: LOG_ANALYTICS_WORKSPACE_ID
valueFrom:
secretKeyRef:
name: atp-secrets
key: logAnalyticsWorkspaceId
- name: LOG_ANALYTICS_WORKSPACE_KEY
valueFrom:
secretKeyRef:
name: atp-secrets
key: logAnalyticsWorkspaceKey
volumeMounts:
- name: otel-config
mountPath: /etc/otel
readOnly: true
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 10
periodSeconds: 5
volumes:
- name: otel-config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: atp-prod
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
protocol: TCP
- name: otlp-http
port: 4318
targetPort: 4318
protocol: TCP
- name: metrics
port: 8888
targetPort: 8888
protocol: TCP
type: ClusterIP
Prometheus & Grafana (Production)¶
Prometheus Server (Kubernetes):
# prometheus-deployment-prod.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: atp-prod
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: atp-aks-prod-eus
environment: production
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# Scrape configurations
scrape_configs:
# OTel Collector metrics
- job_name: 'otel-collector'
kubernetes_sd_configs:
- role: service
namespaces:
names: [atp-prod]
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: otel-collector
# ATP microservices (Prometheus endpoint)
- job_name: 'atp-services'
kubernetes_sd_configs:
- role: pod
namespaces:
names: [atp-prod]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Kubernetes node metrics
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Kubernetes pod metrics
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
namespaces:
names: [atp-prod]
# Alerting rules
rule_files:
- /etc/prometheus/alerts/*.yml
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus-server
namespace: atp-prod
spec:
serviceName: prometheus-server
replicas: 1
selector:
matchLabels:
app: prometheus-server
template:
metadata:
labels:
app: prometheus-server
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.48.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=90d' # 90-day retention
- '--storage.tsdb.retention.size=100GB'
- '--web.enable-lifecycle'
ports:
- containerPort: 9090
name: web
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
readOnly: true
- name: prometheus-storage
mountPath: /prometheus
resources:
requests:
memory: "4Gi"
cpu: "1000m"
limits:
memory: "8Gi"
cpu: "2000m"
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
volumeClaimTemplates:
- metadata:
name: prometheus-storage
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "managed-premium"
resources:
requests:
storage: 100Gi
Grafana Deployment (Kubernetes):
# grafana-deployment-prod.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: atp-prod
spec:
replicas: 2
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:10.2.0
ports:
- containerPort: 3000
name: web
env:
- name: GF_SERVER_ROOT_URL
value: "https://grafana.atp.connectsoft.com"
- name: GF_AUTH_GENERIC_OAUTH_ENABLED
value: "true"
- name: GF_AUTH_GENERIC_OAUTH_CLIENT_ID
valueFrom:
secretKeyRef:
name: grafana-secrets
key: oauth-client-id
- name: GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: grafana-secrets
key: oauth-client-secret
- name: GF_DATABASE_TYPE
value: "postgres"
- name: GF_DATABASE_HOST
valueFrom:
secretKeyRef:
name: grafana-secrets
key: db-host
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: grafana-datasources
mountPath: /etc/grafana/provisioning/datasources
- name: grafana-dashboards
mountPath: /etc/grafana/provisioning/dashboards
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-pvc
- name: grafana-datasources
configMap:
name: grafana-datasources
- name: grafana-dashboards
configMap:
name: grafana-dashboards
Grafana Datasources (ConfigMap):
# grafana-datasources-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: atp-prod
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus-server:9090
isDefault: true
editable: false
jsonData:
timeInterval: "15s"
- name: Azure Log Analytics
type: grafana-azure-monitor-datasource
access: proxy
jsonData:
subscriptionId: ${AZURE_SUBSCRIPTION_ID}
tenantId: ${AZURE_TENANT_ID}
clientId: ${AZURE_CLIENT_ID}
cloudName: azuremonitor
azureLogAnalyticsSameAs: true
logAnalyticsDefaultWorkspace: ${LOG_ANALYTICS_WORKSPACE_ID}
secureJsonData:
clientSecret: ${AZURE_CLIENT_SECRET}
- name: Application Insights
type: grafana-azure-monitor-datasource
access: proxy
jsonData:
subscriptionId: ${AZURE_SUBSCRIPTION_ID}
tenantId: ${AZURE_TENANT_ID}
clientId: ${AZURE_CLIENT_ID}
cloudName: azuremonitor
appInsightsAppId: ${APPLICATIONINSIGHTS_APP_ID}
secureJsonData:
appInsightsApiKey: ${APPLICATIONINSIGHTS_API_KEY}
Log Aggregation & Retention¶
Dev/Test: Seq Containers¶
Purpose: Local log viewing with structured logging and ephemeral retention.
- Dev: Local Docker Compose; 7-day retention.
- Test: Shared Seq instance; 14-day retention; API key authentication.
Staging: Azure Log Analytics¶
Purpose: Production-like log aggregation with KQL queries and 30-day retention.
- Log Analytics Workspace: Centralized log store.
- Retention: 30 days (hot storage).
- Cost: ~$2.30/GB ingested.
Production: Multi-Tier Retention¶
Purpose: Cost-optimized log retention with 90-day hot storage and 7-year cold storage for compliance.
Retention Tiers:
Hot Storage (Log Analytics): 90 days
↓ (automated export)
Cold Storage (Blob - Cool tier): 7 years
↓ (automated lifecycle policy)
Archive Storage (Blob - Archive tier): Indefinite (compliance hold)
Blob Storage Lifecycle Policy:
{
"rules": [
{
"enabled": true,
"name": "MoveLogsToCoolAfter90Days",
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["logs/"]
},
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 90
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 2555 // 7 years
}
}
}
}
}
]
}
Cold Storage Cost (Production):
Hot (Log Analytics): ~$2.30/GB/month (90 days)
Cool (Blob): ~$0.01/GB/month (7 years)
Archive (Blob): ~$0.002/GB/month (indefinite)
Example: 10 GB/day logs
- Hot: 900 GB × $2.30 = $2,070/month
- Cool: 25,200 GB × $0.01 = $252/month
- Archive: ∞ GB × $0.002 = minimal
Summary¶
- Telemetry Levels: Graduated from Debug/100% sampling (Dev) to Warning/10% adaptive sampling (Production).
- Dev: Seq containers, 100% traces, 7-day retention, unlimited logs for debugging.
- Test: Seq + Application Insights, 50% sampling, 14-day retention, QA validation focus.
- Staging: Application Insights + Log Analytics, 25% adaptive sampling, 30-day retention, production-like observability.
- Production: OTel Collector → Prometheus/Grafana + Application Insights + Log Analytics, 10% adaptive sampling with tail sampling (errors/slow requests), 90-day hot + 7-year cold retention.
- Prometheus & Grafana: Production metrics aggregation with custom dashboards and alerting.
- Cost Optimization: Adaptive sampling, tail sampling (keep errors), multi-tier retention (hot/cool/archive).
Cost Management & Optimization¶
ATP implements FinOps principles across all environments to balance operational requirements with cost efficiency. Each environment has tailored cost profiles, optimization strategies, and cost governance controls aligned with its criticality and usage patterns.
This approach ensures predictable spending, cost transparency, and continuous optimization through automated shutdown schedules, reserved capacity, and granular cost tracking per environment, service, and team.
Environment Cost Profiles¶
ATP's cost model is graduated by environment with Dev/Test optimized for minimal cost and Production optimized for reliability within budget constraints.
| Environment | Monthly Budget | Primary Compute | SKU Tier | Scaling Strategy | Monitoring Cost | Total Est. Monthly |
|---|---|---|---|---|---|---|
| Preview | $100 | Azure Container Instances | Dynamic | Per-PR ephemeral | N/A | $50-150 (variable) |
| Dev | $500 | App Service | Basic B1 (1 vCPU, 1.75 GB) | Fixed (1 instance) | Basic alerts | $400-600 |
| Test | $1,000 | App Service | Standard S1 (1 vCPU, 1.75 GB) | Fixed (2 instances) | Basic alerts | $900-1,200 |
| Staging | $3,000 | App Service | Premium P1v2 (1 vCPU, 3.5 GB) | Autoscale (2-5) | Enhanced alerts | $2,500-3,500 |
| Production | $10,000 | AKS (3-10 nodes) | Standard_D4s_v5 (4 vCPU, 16 GB) | Autoscale (3-10 nodes) | Full observability | $8,000-12,000 |
| Hotfix | $500 | App Service (on-demand) | Premium P1v3 (2 vCPU, 8 GB) | Fixed (1 instance) | Basic alerts | $0-500 (as-needed) |
Cost Profile Rationale:
- Dev ($500): Cost-minimized with Basic SKU; shutdown evenings/weekends (40% savings).
- Test ($1,000): Standard SKU for stable performance; 2 instances for load testing; shutdown nights.
- Staging ($3,000): Premium SKU for production-like validation; autoscaling; always-on.
- Production ($10,000): AKS for enterprise-grade scalability; reserved instances (20% savings); 99.9% SLA.
- Hotfix ($500): On-demand deployment only when needed; deleted after hotfix deployment.
Detailed Cost Breakdown¶
Dev Environment Monthly Costs¶
Compute (App Service Basic B1 × 1): $13/month × 0.6 (60% uptime with shutdown) = $8/month
SQL Database (Basic - 5 DTU): $5/month
Redis Cache (Basic C0 - 250 MB): $16/month
Service Bus (Basic): $5/month
Storage (LRS - 100 GB): $2/month
Key Vault (transactions): $1/month
Bandwidth: $5/month
---
Subtotal: $42/month
Actual with shutdown automation: ~$25/month per service × 7 services = $175/month
Dev shared infrastructure: $300/month (VNet, NSG, Seq, etc.)
---
Total Dev Environment: $475/month
Cost Optimization (Dev): - Shutdown Schedule: Stop 6 PM - 8 AM weekdays, all weekend → 60% uptime → 40% savings. - Shared Resources: VNet, NSG, Seq shared between Dev/Test → split cost. - Basic SKUs: Minimum viable performance for development.
Test Environment Monthly Costs¶
Compute (App Service Standard S1 × 2): $70/month × 0.7 (70% uptime) × 2 = $98/month
SQL Database (Standard S1 - 20 DTU): $30/month
Redis Cache (Standard C1 - 1 GB): $75/month
Service Bus (Standard): $10/month
Storage (LRS - 500 GB): $10/month
Key Vault (transactions): $2/month
Application Insights (5 GB/month): $12/month
Bandwidth: $10/month
---
Subtotal: $247/month
Actual with shutdown automation: ~$35/month per service × 7 services = $245/month
Test shared infrastructure: $400/month
CI/CD agent costs: $200/month
---
Total Test Environment: $845/month
Cost Optimization (Test): - Shutdown Schedule: Stop 10 PM - 6 AM weekdays, all weekend → 70% uptime → 30% savings. - Standard SKUs: Sufficient for automated testing; reliable performance. - Shared Seq Instance: Single Seq container for all test services.
Staging Environment Monthly Costs¶
Compute (App Service Premium P1v2 × 3): $146/month × 3 = $438/month
SQL Database (Premium P1 - 125 DTU): $465/month
Redis Cache (Premium P1 - 6 GB): $250/month
Service Bus (Premium - 1 messaging unit): $677/month
Storage (GRS - 1 TB): $40/month
Key Vault (HSM - 10 keys): $125/month
Application Insights (15 GB/month): $35/month
Log Analytics (30-day retention): $70/month
Private Endpoints (5 × $7): $35/month
Application Gateway (v2): $125/month
Bandwidth: $50/month
---
Total Staging Environment: $2,310/month
Cost Optimization (Staging): - Always-On: No shutdown (production validation requires 24/7 availability). - Reserved Instances: 1-year commitment → 20% savings (~$460/year). - Private Endpoints: Shared across data resources. - Geo-Redundant Storage: Balance cost vs. reliability for production-like testing.
Production Environment Monthly Costs¶
AKS Cluster (3-10 nodes, Standard_D4s_v5):
- System pool (3 nodes, always on): $200/month × 3 = $600/month
- User pool (3-7 nodes, autoscale): $200/month × 5 avg = $1,000/month
SQL Database (Premium P4 - 500 DTU): $1,860/month
Cosmos DB (1000 RU/s provisioned): $730/month
Redis Cache (Premium P3 - 26 GB): $1,037/month
Service Bus (Premium - 4 messaging units): $2,708/month
Storage (GRS + WORM - 10 TB): $500/month (hot) + $100/month (cool)
Key Vault (HSM - 50 keys): $625/month
Application Insights (50 GB/month): $115/month
Log Analytics (90-day retention, 30 GB/day): $900/month
Private Endpoints (10 × $7): $70/month
Application Gateway (v2 with WAF): $250/month
Azure Firewall (Premium): $625/month
DDoS Protection Standard: $2,944/month
Bandwidth (outbound - 1 TB): $90/month
ACR (Premium - geo-replication): $30/month
Prometheus + Grafana (self-hosted on AKS): $50/month (storage only)
---
Total Production Environment: $12,234/month
Reserved Instance Savings (1-year): -$2,400/month (20% on compute/database)
---
Net Production Monthly Cost: $9,834/month
Cost Optimization (Production): - Reserved Instances: 1-year commitment for AKS nodes, SQL, Cosmos DB → 20-30% savings. - Autoscaling: Scale down to 3 nodes during low-traffic hours → save ~\(400/month. - **Storage Lifecycle**: Auto-transition to cool tier after 90 days → save ~\)300/month. - DDoS Protection: Shared across all public endpoints in subscription. - Application Insights Sampling: 10% adaptive sampling → reduce ingestion by 90%.
Cost Optimization Strategies¶
ATP implements automated cost optimization across all environments using Azure Policy, automation scripts, and IaC overlays.
Automated Shutdown Schedules¶
Purpose: Reduce compute costs in Dev/Test environments by shutting down resources during non-business hours.
Dev Environment Shutdown Schedule:
# Shutdown Schedule (Dev)
schedule:
weekdays:
shutdown: "18:00" # 6 PM local time
startup: "08:00" # 8 AM local time
timezone: "Eastern Standard Time"
weekends:
shutdown: "18:00 Friday"
startup: "08:00 Monday"
timezone: "Eastern Standard Time"
expectedUptime: 40% # 40 hours/week out of 168 hours
estimatedSavings: 60% # $8/month vs $13/month per App Service
Azure Automation Runbook (Shutdown):
<#
.SYNOPSIS
Shutdown ATP Dev environment resources during non-business hours.
.DESCRIPTION
Stops App Services, VMs, and SQL Databases in Dev environment to reduce costs.
Scheduled to run at 6 PM ET weekdays and 6 PM Friday for weekend.
#>
param(
[Parameter(Mandatory=$false)]
[string]$ResourceGroupName = "ConnectSoft-ATP-Dev-EUS-RG"
)
# Authenticate using Managed Identity
Connect-AzAccount -Identity
Write-Output "Starting shutdown sequence for Dev environment: $ResourceGroupName"
# Stop all App Services
$appServices = Get-AzWebApp -ResourceGroupName $ResourceGroupName
foreach ($app in $appServices) {
Write-Output "Stopping App Service: $($app.Name)"
Stop-AzWebApp -ResourceGroupName $ResourceGroupName -Name $app.Name
}
# Pause SQL Databases
$sqlServers = Get-AzSqlServer -ResourceGroupName $ResourceGroupName
foreach ($server in $sqlServers) {
$databases = Get-AzSqlDatabase -ServerName $server.ServerName -ResourceGroupName $ResourceGroupName
foreach ($db in $databases) {
if ($db.DatabaseName -ne "master") {
Write-Output "Pausing SQL Database: $($db.DatabaseName)"
Suspend-AzSqlDatabase -ResourceGroupName $ResourceGroupName -ServerName $server.ServerName -DatabaseName $db.DatabaseName
}
}
}
# Stop VMs (if any for Dev jumpbox)
$vms = Get-AzVM -ResourceGroupName $ResourceGroupName
foreach ($vm in $vms) {
Write-Output "Stopping VM: $($vm.Name)"
Stop-AzVM -ResourceGroupName $ResourceGroupName -Name $vm.Name -Force
}
# Stop ACI instances (Preview environments)
$containers = Get-AzContainerGroup -ResourceGroupName $ResourceGroupName
foreach ($container in $containers) {
Write-Output "Stopping Container Instance: $($container.Name)"
Stop-AzContainerGroup -ResourceGroupName $ResourceGroupName -Name $container.Name
}
Write-Output "Dev environment shutdown complete. Estimated monthly savings: 60%"
Azure Automation Runbook (Startup):
<#
.SYNOPSIS
Startup ATP Dev environment resources during business hours.
.DESCRIPTION
Starts App Services, VMs, and SQL Databases in Dev environment.
Scheduled to run at 8 AM ET weekdays.
#>
param(
[Parameter(Mandatory=$false)]
[string]$ResourceGroupName = "ConnectSoft-ATP-Dev-EUS-RG"
)
Connect-AzAccount -Identity
Write-Output "Starting startup sequence for Dev environment: $ResourceGroupName"
# Start App Services
$appServices = Get-AzWebApp -ResourceGroupName $ResourceGroupName
foreach ($app in $appServices) {
Write-Output "Starting App Service: $($app.Name)"
Start-AzWebApp -ResourceGroupName $ResourceGroupName -Name $app.Name
}
# Resume SQL Databases
$sqlServers = Get-AzSqlServer -ResourceGroupName $ResourceGroupName
foreach ($server in $sqlServers) {
$databases = Get-AzSqlDatabase -ServerName $server.ServerName -ResourceGroupName $ResourceGroupName
foreach ($db in $databases) {
if ($db.DatabaseName -ne "master" -and $db.Status -eq "Paused") {
Write-Output "Resuming SQL Database: $($db.DatabaseName)"
Resume-AzSqlDatabase -ResourceGroupName $ResourceGroupName -ServerName $server.ServerName -DatabaseName $db.DatabaseName
}
}
}
# Start VMs
$vms = Get-AzVM -ResourceGroupName $ResourceGroupName -Status
foreach ($vm in $vms | Where-Object {$_.PowerState -eq "VM deallocated"}) {
Write-Output "Starting VM: $($vm.Name)"
Start-AzVM -ResourceGroupName $ResourceGroupName -Name $vm.Name
}
Write-Output "Dev environment startup complete."
Azure Automation Schedule:
# Create Automation Account
az automation account create \
--name "atp-automation-eus" \
--resource-group "ConnectSoft-ATP-Shared-RG" \
--location "eastus" \
--sku "Basic" \
--tags Environment=Shared Purpose=CostOptimization
# Enable Managed Identity
az automation account update \
--name "atp-automation-eus" \
--resource-group "ConnectSoft-ATP-Shared-RG" \
--assign-identity
# Create Shutdown Schedule (Weekdays 6 PM)
az automation schedule create \
--automation-account-name "atp-automation-eus" \
--resource-group "ConnectSoft-ATP-Shared-RG" \
--name "Dev-Shutdown-Weekdays" \
--frequency "Week" \
--interval 1 \
--start-time "2025-01-01T18:00:00-05:00" \
--time-zone "Eastern Standard Time" \
--week-days Monday Tuesday Wednesday Thursday Friday
# Create Startup Schedule (Weekdays 8 AM)
az automation schedule create \
--automation-account-name "atp-automation-eus" \
--resource-group "ConnectSoft-ATP-Shared-RG" \
--name "Dev-Startup-Weekdays" \
--frequency "Week" \
--interval 1 \
--start-time "2025-01-01T08:00:00-05:00" \
--time-zone "Eastern Standard Time" \
--week-days Monday Tuesday Wednesday Thursday Friday
echo "✅ Automation schedules created; Dev environment will shutdown/startup automatically"
Test Environment Shutdown Schedule (10 PM - 6 AM):
schedule:
weekdays:
shutdown: "22:00" # 10 PM (after automated test runs)
startup: "06:00" # 6 AM (before CI/CD starts)
weekends:
shutdown: "22:00 Friday"
startup: "06:00 Monday"
expectedUptime: 70% # 120 hours/week
estimatedSavings: 30% # $49/month vs $70/month per App Service
Reserved Instances & Savings Plans¶
Purpose: Long-term cost savings for predictable workloads in Staging/Production with 1-3 year commitments.
Reserved Instance Strategy:
environment: Staging
commitment: 1-year
resources:
- type: App Service Premium P1v2
quantity: 2
monthlyCost: $146 × 2 = $292
reservedCost: $233 × 2 = $466 (20% savings)
annualSavings: $708
- type: SQL Database Premium P1
quantity: 1
monthlyCost: $465
reservedCost: $372 (20% savings)
annualSavings: $1,116
totalAnnualSavings: $1,824 (Staging)
Production Reserved Instance Plan:
environment: Production
commitment: 1-year (renew annually)
resources:
- type: AKS Standard_D4s_v5
quantity: 3 (system pool, always on)
monthlyCost: $600
reservedCost: $480 (20% savings)
annualSavings: $1,440
- type: SQL Database Premium P4
quantity: 1
monthlyCost: $1,860
reservedCost: $1,395 (25% savings)
annualSavings: $5,580
- type: Cosmos DB (1000 RU/s)
quantity: 1
monthlyCost: $730
reservedCost: $511 (30% savings)
annualSavings: $2,628
- type: Redis Cache Premium P3
quantity: 1
monthlyCost: $1,037
reservedCost: $830 (20% savings)
annualSavings: $2,484
totalAnnualSavings: $12,132 (Production)
totalATPReservedInstanceSavings: $13,956/year
Purchase Reserved Instances (Azure CLI):
#!/bin/bash
# purchase-reserved-instances.sh
SUBSCRIPTION_ID="<azure-subscription-id>"
REGION="eastus"
echo "Purchasing Reserved Instances for ATP Production..."
# AKS Nodes (Standard_D4s_v5 × 3)
az reservations reservation-order purchase \
--reserved-resource-type "VirtualMachines" \
--sku "Standard_D4s_v5" \
--location "$REGION" \
--quantity 3 \
--term "P1Y" \
--billing-plan "Monthly" \
--display-name "ATP-Prod-AKS-RI-2025"
# SQL Database (Premium P4)
az sql db update \
--resource-group "ConnectSoft-ATP-Prod-EUS-RG" \
--server "atp-sql-prod-eus" \
--name "ATP_Prod" \
--compute-model "Provisioned" \
--service-objective "P4" \
--backup-storage-redundancy "Geo" \
--zone-redundant true \
--read-scale "Enabled"
# Cosmos DB Reserved Capacity (1000 RU/s)
az cosmosdb sql container throughput update \
--resource-group "ConnectSoft-ATP-Prod-EUS-RG" \
--account-name "atp-cosmos-prod-eus" \
--database-name "ATP" \
--name "AuditEvents" \
--throughput 1000
echo "✅ Reserved Instances purchased; savings will appear in next billing cycle"
Spot Instances (Preview Environments)¶
Purpose: 90% cost savings for ephemeral Preview environments using Azure Spot VMs.
Spot Instance Configuration (ACI with Spot pricing):
# Azure Container Instances with Spot pricing
apiVersion: '2021-09-01'
location: eastus
name: atp-preview-pr-123
properties:
containers:
- name: atp-gateway-preview
properties:
image: connectsoft.azurecr.io/atp-gateway:pr-123
resources:
requests:
memoryInGB: 1.5
cpu: 1
osType: Linux
priority: Spot # Use Spot pricing (90% cheaper than regular)
restartPolicy: Never
sku: Standard
tags:
Environment: Preview
PullRequest: PR-123
CostCenter: Engineering
AutoDelete: "24h" # Delete after 24 hours
Cost Comparison (Preview Environment):
Regular ACI Pricing: $0.0000125/second × 3600s × 24h × 30 days = $32.40/month
Spot ACI Pricing: $0.00000125/second × 3600s × 24h × 30 days = $3.24/month
Savings: $29.16/month per container (90% savings)
Typical Preview Environment (5 containers × 24 hours):
- Regular: $5.40
- Spot: $0.54
- Savings per PR: $4.86 (90%)
Estimated Monthly (20 PRs/month):
- Regular: $108
- Spot: $10.80
- Total Savings: $97.20/month
Storage Lifecycle Management¶
Purpose: Automated tiering from hot → cool → archive based on retention policies to minimize storage costs.
Storage Lifecycle Policy (Production):
{
"rules": [
{
"enabled": true,
"name": "MoveLogsToArchive",
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["logs/", "traces/", "metrics/"]
},
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 90
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 2555 // 7 years
},
"delete": {
"daysAfterModificationGreaterThan": 3650 // 10 years (optional cleanup)
}
},
"snapshot": {
"tierToCool": {
"daysAfterCreationGreaterThan": 30
},
"delete": {
"daysAfterCreationGreaterThan": 90
}
}
}
}
},
{
"enabled": true,
"name": "MoveBackupsToArchive",
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["backups/"]
},
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 30 // Weekly backups to cool after 30 days
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 180 // Archive after 6 months
}
}
}
}
},
{
"enabled": true,
"name": "DeleteTempBlobs",
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["temp/", "preview/"]
},
"actions": {
"baseBlob": {
"delete": {
"daysAfterModificationGreaterThan": 7 // Delete temp/preview after 7 days
}
}
}
}
}
]
}
Storage Cost Savings (Production):
Hot Storage (0-90 days): 900 GB × $0.0184/GB = $16.56/month
Cool Storage (91 days - 7 years): 25,200 GB × $0.01/GB = $252/month
Archive Storage (7+ years): 100,000 GB × $0.002/GB = $200/month
Without Lifecycle Policy (all hot): 126,100 GB × $0.0184/GB = $2,320/month
With Lifecycle Policy: $16.56 + $252 + $200 = $468.56/month
Total Savings: $1,851.44/month (80% savings on storage)
Cost Alerts & Monitoring¶
Purpose: Proactive cost visibility with alerts when environments exceed budget thresholds or exhibit anomalous spending patterns.
Azure Cost Management Budgets (Pulumi):
// Dev Environment Budget
var devBudget = new Budget("atp-budget-dev", new BudgetArgs
{
BudgetName = "atp-budget-dev",
ResourceGroupName = devResourceGroup.Name,
Amount = 500, // $500/month
TimeGrain = "Monthly",
TimePeriod = new BudgetTimePeriodArgs
{
StartDate = "2025-01-01",
EndDate = "2025-12-31"
},
Category = "Cost",
Notifications = new InputMap<NotificationArgs>
{
["Alert80Percent"] = new NotificationArgs
{
Enabled = true,
Operator = "GreaterThanOrEqualTo",
Threshold = 80,
ContactEmails = new[] { "platform-team@connectsoft.example" },
ContactRoles = new[] { "Owner", "Contributor" },
ThresholdType = "Actual"
},
["Alert100Percent"] = new NotificationArgs
{
Enabled = true,
Operator = "GreaterThanOrEqualTo",
Threshold = 100,
ContactEmails = new[] { "platform-team@connectsoft.example", "finance@connectsoft.example" },
ContactRoles = new[] { "Owner" },
ThresholdType = "Actual"
},
["Forecast120Percent"] = new NotificationArgs
{
Enabled = true,
Operator = "GreaterThanOrEqualTo",
Threshold = 120,
ContactEmails = new[] { "platform-team@connectsoft.example" },
ThresholdType = "Forecasted"
}
},
Filter = new BudgetFilterArgs
{
Tags = new InputList<BudgetFilterTagsArgs>
{
new BudgetFilterTagsArgs
{
Name = "Environment",
Operator = "In",
Values = new[] { "dev" }
}
}
}
});
// Production Environment Budget
var prodBudget = new Budget("atp-budget-prod", new BudgetArgs
{
BudgetName = "atp-budget-prod",
ResourceGroupName = prodResourceGroup.Name,
Amount = 10000, // $10,000/month
TimeGrain = "Monthly",
TimePeriod = new BudgetTimePeriodArgs
{
StartDate = "2025-01-01",
EndDate = "2025-12-31"
},
Category = "Cost",
Notifications = new InputMap<NotificationArgs>
{
["Alert50Percent"] = new NotificationArgs
{
Enabled = true,
Operator = "GreaterThanOrEqualTo",
Threshold = 50,
ContactEmails = new[] { "platform-team@connectsoft.example" },
ThresholdType = "Actual"
},
["Alert80Percent"] = new NotificationArgs
{
Enabled = true,
Operator = "GreaterThanOrEqualTo",
Threshold = 80,
ContactEmails = new[] { "platform-team@connectsoft.example", "finance@connectsoft.example" },
ContactRoles = new[] { "Owner" },
ThresholdType = "Actual"
},
["Alert100Percent"] = new NotificationArgs
{
Enabled = true,
Operator = "GreaterThanOrEqualTo",
Threshold = 100,
ContactEmails = new[] { "cfo@connectsoft.example", "platform-team@connectsoft.example" },
ContactRoles = new[] { "Owner" },
ThresholdType = "Actual",
ContactActions = new[] { "CreateIncident" } // Auto-create P1 incident
}
}
});
Cost Anomaly Detection (Azure Monitor):
// Cost anomaly alert (50% spike in single day)
var costAnomalyAlert = new MetricAlert("atp-cost-anomaly-alert-prod", new MetricAlertArgs
{
AlertRuleName = "atp-cost-anomaly-prod",
ResourceGroupName = prodResourceGroup.Name,
Location = "global",
Description = "Alert when Production environment cost spikes >50% in 24 hours",
Severity = 1, // High severity
Enabled = true,
Scopes = new[] { prodResourceGroup.Id },
EvaluationFrequency = "PT1H", // Evaluate every hour
WindowSize = "PT24H", // 24-hour window
Criteria = new MetricAlertMultipleResourceMultipleMetricCriteriaArgs
{
OdataType = "Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria",
AllOf = new[]
{
new DynamicMetricCriteriaArgs
{
Name = "CostAnomaly",
MetricName = "ActualCost",
MetricNamespace = "Microsoft.CostManagement/budgets",
Operator = "GreaterThan",
AlertSensitivity = "Medium",
DynamicThresholdFailingPeriods = new DynamicThresholdFailingPeriodsArgs
{
NumberOfEvaluationPeriods = 4,
MinFailingPeriodsToAlert = 2
},
TimeAggregation = "Total"
}
}
},
Actions = new[]
{
new MetricAlertActionArgs
{
ActionGroupId = costAnomalyActionGroup.Id
}
}
});
// Action Group for cost alerts
var costAnomalyActionGroup = new ActionGroup("atp-cost-anomaly-action-group", new ActionGroupArgs
{
ActionGroupName = "atp-cost-anomaly",
ResourceGroupName = sharedResourceGroup.Name,
Location = "global",
Enabled = true,
ShortName = "CostAlert",
EmailReceivers = new[]
{
new EmailReceiverArgs
{
Name = "PlatformTeam",
EmailAddress = "platform-team@connectsoft.example",
UseCommonAlertSchema = true
},
new EmailReceiverArgs
{
Name = "Finance",
EmailAddress = "finance@connectsoft.example",
UseCommonAlertSchema = true
}
},
WebhookReceivers = new[]
{
new WebhookReceiverArgs
{
Name = "SlackNotification",
ServiceUri = "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX",
UseCommonAlertSchema = true
}
}
});
Monthly Cost Review Automation:
<#
.SYNOPSIS
Generate monthly cost report for ATP environments.
.DESCRIPTION
Queries Azure Cost Management API for current month spending per environment.
Sends report to finance and platform teams.
#>
param(
[Parameter(Mandatory=$false)]
[string]$SubscriptionId = "<subscription-id>"
)
Connect-AzAccount -Identity
$startDate = (Get-Date -Day 1).ToString("yyyy-MM-dd")
$endDate = (Get-Date).ToString("yyyy-MM-dd")
Write-Output "Generating ATP cost report for $startDate to $endDate"
# Query costs per environment
$environments = @("dev", "test", "staging", "prod", "hotfix")
$costReport = @()
foreach ($env in $environments) {
$query = @{
type = "ActualCost"
timeframe = "Custom"
timePeriod = @{
from = $startDate
to = $endDate
}
dataset = @{
granularity = "Monthly"
aggregation = @{
totalCost = @{
name = "Cost"
function = "Sum"
}
}
grouping = @(
@{
type = "Dimension"
name = "ResourceGroupName"
}
)
filter = @{
tags = @{
name = "Environment"
operator = "In"
values = @($env)
}
}
}
} | ConvertTo-Json -Depth 10
$result = Invoke-AzRestMethod `
-Path "/subscriptions/$SubscriptionId/providers/Microsoft.CostManagement/query?api-version=2021-10-01" `
-Method POST `
-Payload $query
$cost = ($result.Content | ConvertFrom-Json).properties.rows | Measure-Object -Property @{Expression={$_[0]}} -Sum
$costReport += [PSCustomObject]@{
Environment = $env.ToUpper()
Cost = [math]::Round($cost.Sum, 2)
Budget = switch ($env) {
"dev" { 500 }
"test" { 1000 }
"staging" { 3000 }
"prod" { 10000 }
"hotfix" { 500 }
}
Utilization = [math]::Round(($cost.Sum / (switch ($env) {
"dev" { 500 }
"test" { 1000 }
"staging" { 3000 }
"prod" { 10000 }
"hotfix" { 500 }
})) * 100, 1)
}
}
# Generate HTML report
$htmlReport = @"
<html>
<head>
<style>
body { font-family: Arial; }
table { border-collapse: collapse; width: 100%; }
th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
th { background-color: #4CAF50; color: white; }
.over-budget { background-color: #ffcccc; }
.near-budget { background-color: #ffffcc; }
</style>
</head>
<body>
<h2>ATP Monthly Cost Report - $(Get-Date -Format 'MMMM yyyy')</h2>
<table>
<tr>
<th>Environment</th>
<th>Current Cost</th>
<th>Monthly Budget</th>
<th>Utilization</th>
</tr>
"@
foreach ($item in $costReport) {
$rowClass = if ($item.Utilization -gt 100) { "over-budget" }
elseif ($item.Utilization -gt 80) { "near-budget" }
else { "" }
$htmlReport += @"
<tr class="$rowClass">
<td>$($item.Environment)</td>
<td>`$$($item.Cost)</td>
<td>`$$($item.Budget)</td>
<td>$($item.Utilization)%</td>
</tr>
"@
}
$totalCost = ($costReport | Measure-Object -Property Cost -Sum).Sum
$totalBudget = ($costReport | Measure-Object -Property Budget -Sum).Sum
$htmlReport += @"
<tr style="font-weight: bold;">
<td>TOTAL</td>
<td>`$$totalCost</td>
<td>`$$totalBudget</td>
<td>$([math]::Round(($totalCost / $totalBudget) * 100, 1))%</td>
</tr>
</table>
<h3>Cost Optimization Recommendations</h3>
<ul>
<li>Dev/Test shutdown automation active: Estimated savings ~`$200/month</li>
<li>Reserved Instances active: Saving ~`$1,163/month</li>
<li>Storage lifecycle policies active: Saving ~`$1,851/month</li>
<li>Total Estimated Savings: ~`$3,214/month</li>
</ul>
</body>
</html>
"@
# Send email report
Send-MailMessage `
-From "platform-team@connectsoft.example" `
-To "finance@connectsoft.example", "platform-team@connectsoft.example" `
-Subject "ATP Monthly Cost Report - $(Get-Date -Format 'MMMM yyyy')" `
-Body $htmlReport `
-BodyAsHtml `
-SmtpServer "smtp.office365.com" `
-UseSsl `
-Credential (Get-AutomationPSCredential -Name "EmailCredential")
Write-Output "✅ Cost report sent successfully"
Cost Tagging Strategy¶
Purpose: Granular cost attribution per environment, service, team, and cost center for accurate chargeback/showback.
Tagging Policy (Azure Policy):
{
"properties": {
"displayName": "Require tags on ATP resources",
"policyType": "Custom",
"mode": "Indexed",
"description": "Enforces required tags on all ATP resources for cost tracking",
"metadata": {
"category": "Cost Management"
},
"parameters": {
"requiredTags": {
"type": "Object",
"defaultValue": {
"Environment": ["dev", "test", "staging", "prod", "hotfix", "preview"],
"Service": ["gateway", "ingestion", "query", "integrity", "export", "policy", "search"],
"CostCenter": ["Engineering", "Platform", "Security"],
"Owner": ["platform-team@connectsoft.example"],
"Project": ["ATP"]
}
}
},
"policyRule": {
"if": {
"allOf": [
{
"field": "type",
"notIn": [
"Microsoft.Resources/subscriptions",
"Microsoft.Resources/subscriptions/resourceGroups"
]
},
{
"anyOf": [
{
"field": "tags['Environment']",
"exists": "false"
},
{
"field": "tags['Service']",
"exists": "false"
},
{
"field": "tags['CostCenter']",
"exists": "false"
},
{
"field": "tags['Owner']",
"exists": "false"
},
{
"field": "tags['Project']",
"exists": "false"
}
]
}
]
},
"then": {
"effect": "deny"
}
}
}
}
Pulumi Tagging (Consistent across all resources):
// Global tags applied to all ATP resources
var globalTags = new InputMap<string>
{
["Project"] = "ATP",
["ManagedBy"] = "Pulumi",
["CostCenter"] = "Engineering",
["Owner"] = "platform-team@connectsoft.example"
};
// Environment-specific tags
var devTags = globalTags.Concat(new InputMap<string>
{
["Environment"] = "dev",
["CostOptimization"] = "ShutdownSchedule",
["BackupRequired"] = "false"
}).ToInputMap();
var prodTags = globalTags.Concat(new InputMap<string>
{
["Environment"] = "prod",
["CostOptimization"] = "ReservedInstances",
["BackupRequired"] = "true",
["Compliance"] = "SOC2,HIPAA,GDPR"
}).ToInputMap();
// Apply to resources
var prodAppService = new WebApp("atp-gateway-prod-eus", new WebAppArgs
{
// ... resource configuration ...
Tags = prodTags.Concat(new InputMap<string>
{
["Service"] = "gateway",
["Tier"] = "Premium"
}).ToInputMap()
});
Cost Attribution Query (KQL):
// Cost breakdown by Environment and Service
AzureCostManagement
| where TimeGenerated >= startofmonth(now())
| extend Environment = tostring(Tags["Environment"])
| extend Service = tostring(Tags["Service"])
| extend CostCenter = tostring(Tags["CostCenter"])
| summarize TotalCost = sum(Cost) by Environment, Service, CostCenter
| order by TotalCost desc
FinOps Best Practices¶
ATP FinOps Principles:
- Visibility: Tag all resources; enable Cost Management; monthly reviews.
- Optimization: Shutdown schedules, reserved instances, autoscaling, storage lifecycle.
- Governance: Budget alerts, Azure Policy enforcement, approval workflows for cost increases.
- Culture: Cost awareness in development; cost-per-feature metrics; regular optimization sprints.
Cost per Tenant (Production):
Total Production Monthly Cost: $9,834
Active Tenants (production): 50
Cost per Tenant: $9,834 / 50 = $196.68/month
Target Cost per Tenant (with 500 tenants): $9,834 / 500 = $19.67/month
Required Optimization: 90% reduction through economies of scale and multi-tenancy
Cost Governance Workflow:
# Approval required for resources exceeding cost thresholds
costGovernance:
thresholds:
- resource: App Service Premium
monthlyCost: $150
approver: Lead Architect
- resource: SQL Database Premium
monthlyCost: $500
approver: CTO
- resource: AKS Node Pool
monthlyCost: $1000
approver: CFO
process:
1. Engineer submits Pulumi PR with new resource
2. CI/CD calculates estimated monthly cost (Infracost)
3. If cost > threshold, require approval
4. Approver reviews cost justification
5. If approved, Pulumi deploys resource with cost tags
6. Monthly review of actual vs estimated costs
Summary¶
- Environment Cost Profiles: Graduated from $500/month (Dev) to $10,000/month (Production) with tailored SKUs and scaling.
- Dev Optimization: Shutdown evenings/weekends (60% uptime) → 40% savings ($175/month actual cost).
- Test Optimization: Shutdown nights (70% uptime) → 30% savings ($845/month actual cost).
- Reserved Instances: 1-year commitments for Staging/Production → 20-30% savings ($13,956/year total).
- Spot Instances: Preview environments use Spot pricing → 90% savings ($97/month).
- Storage Lifecycle: Automated hot → cool → archive transitions → 80% storage savings ($1,851/month).
- Cost Alerts: Budget thresholds (80%, 100%) and anomaly detection (50% spike) with automated notifications.
- Tagging Strategy: Granular cost attribution per environment, service, team, and cost center.
- FinOps Culture: Monthly cost reviews, cost-per-feature metrics, governance workflows for cost increases.
Disaster Recovery & High Availability¶
ATP implements graduated disaster recovery and high availability strategies aligned with each environment's criticality and business requirements. Production operates in active-active multi-region configuration with 15-minute RPO and 30-minute RTO, while lower environments use cost-effective recreate-from-IaC or backup restore strategies.
This approach ensures business continuity for critical production workloads while maintaining cost efficiency in non-production environments through infrastructure-as-code recovery instead of expensive geo-replication.
RPO/RTO Targets Per Environment¶
ATP defines graduated recovery objectives from best-effort Dev recovery to mission-critical Production multi-region failover.
| Environment | RPO (Recovery Point) | RTO (Recovery Time) | HA Strategy | DR Strategy | Annual DR Test |
|---|---|---|---|---|---|
| Preview | N/A | N/A | None | None (ephemeral) | N/A |
| Dev | 24 hours | 4 hours | Single instance | Recreate from IaC + Git | Quarterly |
| Test | 12 hours | 2 hours | 2 instances (no LB) | Restore from last backup | Quarterly |
| Staging | 1 hour | 1 hour | Blue-green slots | Slot swap + backup restore | Monthly |
| Production | 15 minutes | 30 minutes | Multi-region active-active (80/20 split) | Automated regional failover | Weekly (non-disruptive) |
| Hotfix | 15 minutes | 30 minutes | Same as Production | Same as Production | N/A (mirrors Prod) |
RPO/RTO Rationale:
- Dev (24h RPO, 4h RTO): Acceptable data loss; recreate from Git + IaC in half-day.
- Test (12h RPO, 2h RTO): Daily backups sufficient; restore within business hours.
- Staging (1h RPO, 1h RTO): Production-like; validate blue-green failover.
- Production (15min RPO, 30min RTO): Mission-critical; continuous geo-replication; automated failover.
Multi-Region Production Topology¶
ATP Production operates in active-active mode across two Azure regions (East US primary, West Europe secondary) with automated failover and geo-replicated data.
Regional Architecture¶
graph TB
subgraph Internet
Users[Global Users]
AFD[Azure Front Door<br/>Global Load Balancer]
end
subgraph Primary Region - East US
EUS_AppGw[App Gateway WAF<br/>10.2.1.0/24]
EUS_AKS[AKS Cluster<br/>3-10 nodes<br/>10.2.2.0/23]
EUS_SQL[(SQL Primary<br/>Geo-Replication Enabled)]
EUS_Cosmos[(Cosmos DB Primary<br/>Multi-region writes)]
EUS_Storage[(Blob Storage GRS<br/>Auto-replicate to WEU)]
EUS_ServiceBus[(Service Bus Premium<br/>Geo-DR paired)]
end
subgraph Secondary Region - West Europe
WEU_AppGw[App Gateway WAF<br/>10.4.1.0/24]
WEU_AKS[AKS Cluster<br/>2-6 nodes<br/>10.4.2.0/23]
WEU_SQL[(SQL Secondary<br/>Read replica)]
WEU_Cosmos[(Cosmos DB Secondary<br/>Multi-region writes)]
WEU_Storage[(Blob Storage GRS<br/>Replica)]
WEU_ServiceBus[(Service Bus Premium<br/>Geo-DR paired)]
end
Users --> AFD
AFD -->|80% traffic| EUS_AppGw
AFD -->|20% traffic| WEU_AppGw
EUS_AppGw --> EUS_AKS
WEU_AppGw --> WEU_AKS
EUS_AKS --> EUS_SQL
EUS_AKS --> EUS_Cosmos
EUS_AKS --> EUS_Storage
EUS_AKS --> EUS_ServiceBus
WEU_AKS --> WEU_SQL
WEU_AKS --> WEU_Cosmos
WEU_AKS --> WEU_Storage
WEU_AKS --> WEU_ServiceBus
EUS_SQL -.->|Geo-Replication| WEU_SQL
EUS_Cosmos <-.->|Multi-write| WEU_Cosmos
EUS_Storage -.->|GRS Replication| WEU_Storage
EUS_ServiceBus <-.->|Geo-DR Pairing| WEU_ServiceBus
style EUS_AKS fill:#90EE90
style WEU_AKS fill:#FFD700
style AFD fill:#4CAF50
Regional Resource Naming:
Dev Environment DR Strategy¶
RPO: 24 hours | RTO: 4 hours
Strategy: Recreate from Infrastructure as Code (no backups; Git is source of truth).
Recovery Procedure (Dev):
#!/bin/bash
# dr-recovery-dev.sh
ENVIRONMENT="dev"
REGION="eastus"
echo "🔄 Starting DR recovery for Dev environment..."
# Step 1: Recreate infrastructure via Pulumi
cd infrastructure/
pulumi stack select atp-dev-eus
pulumi up --yes --skip-preview
if [ $? -ne 0 ]; then
echo "❌ Pulumi infrastructure deployment failed"
exit 1
fi
echo "✅ Infrastructure recreated"
# Step 2: Redeploy latest main branch code
cd ../
# Trigger latest CI/CD pipeline
az pipelines run \
--name "ATP.Ingestion" \
--branch "main" \
--organization "https://dev.azure.com/ConnectSoft" \
--project "ATP"
echo "✅ CI/CD pipeline triggered; Dev environment will be ready in ~15 minutes"
# Step 3: Seed synthetic data
dotnet run --project tools/DataSeeder \
--environment Dev \
--tenants 10 \
--records-per-tenant 1000
echo "✅ Dev environment recovery complete"
echo "Total Recovery Time: ~4 hours (infrastructure + deployment + data seeding)"
Dev DR Testing (Quarterly):
drTestProcedure:
frequency: Quarterly
steps:
1. Delete Dev resource group
2. Run dr-recovery-dev.sh script
3. Validate all services healthy
4. Validate synthetic data seeded
5. Document actual RTO achieved
acceptanceCriteria:
- All services pass health checks
- RTO < 4 hours
- No data corruption
Test Environment DR Strategy¶
RPO: 12 hours | RTO: 2 hours
Strategy: Restore from daily backups (SQL, Redis snapshots).
Backup Configuration (Test):
// Test SQL Database automated backups
var testSqlDatabase = new Database("atp-sql-test-eus", new DatabaseArgs
{
DatabaseName = "ATP_Test",
ServerName = testSqlServer.Name,
ResourceGroupName = testResourceGroup.Name,
Location = "eastus",
Sku = new SkuArgs
{
Name = "S1",
Tier = "Standard"
},
// Automated backups
BackupRetentionPolicyInDays = 14, // 14-day retention
BackupStorageRedundancy = "Local", // LRS (cheaper than GRS)
// Long-term retention (optional)
LongTermRetentionPolicy = new DatabaseLongTermRetentionPolicyArgs
{
WeeklyRetention = "P4W", // 4 weeks
MonthlyRetention = "P0M", // Disabled
YearlyRetention = "P0Y" // Disabled
},
Tags = testTags
});
Recovery Procedure (Test):
<#
.SYNOPSIS
Restore ATP Test environment from last good backup.
.DESCRIPTION
Restores SQL Database and Redis Cache from automated backups.
RTO target: 2 hours
#>
param(
[Parameter(Mandatory=$false)]
[DateTime]$RestorePointInTime = (Get-Date).AddHours(-1) # Default: 1 hour ago
)
Connect-AzAccount -Identity
$resourceGroup = "ConnectSoft-ATP-Test-EUS-RG"
$region = "eastus"
Write-Output "Starting Test environment DR recovery..."
Write-Output "Restore point: $RestorePointInTime"
# Step 1: Restore SQL Database from automated backup
$sqlServer = "atp-sql-test-eus"
$originalDb = "ATP_Test"
$restoredDb = "ATP_Test_Restored_$(Get-Date -Format 'yyyyMMddHHmmss')"
Write-Output "Restoring SQL Database from point-in-time: $RestorePointInTime"
Restore-AzSqlDatabase `
-ResourceGroupName $resourceGroup `
-ServerName $sqlServer `
-TargetDatabaseName $restoredDb `
-ServiceObjectiveName "S1" `
-Edition "Standard" `
-PointInTime $RestorePointInTime `
-ResourceId "/subscriptions/<sub-id>/resourceGroups/$resourceGroup/providers/Microsoft.Sql/servers/$sqlServer/databases/$originalDb"
Write-Output "✅ SQL Database restored to $restoredDb"
# Step 2: Rename databases (swap restored → active)
Rename-AzSqlDatabase `
-ResourceGroupName $resourceGroup `
-ServerName $sqlServer `
-DatabaseName $originalDb `
-NewName "${originalDb}_OLD"
Rename-AzSqlDatabase `
-ResourceGroupName $resourceGroup `
-ServerName $sqlServer `
-DatabaseName $restoredDb `
-NewName $originalDb
Write-Output "✅ Database swap complete"
# Step 3: Restore Redis Cache (from RDB snapshot)
# Note: Azure Redis Cache automated backups (Premium tier only)
# Test uses Standard tier, so recreate cache instead
Write-Output "Recreating Redis Cache (no backups in Standard tier)..."
# App Services will reconnect and rebuild cache from database
# Step 4: Restart App Services (pick up new connection strings)
$appServices = Get-AzWebApp -ResourceGroupName $resourceGroup
foreach ($app in $appServices) {
Write-Output "Restarting App Service: $($app.Name)"
Restart-AzWebApp -ResourceGroupName $resourceGroup -Name $app.Name
}
# Step 5: Run smoke tests
Write-Output "Running smoke tests..."
$healthCheckUrl = "https://atp-gateway-test.azurewebsites.net/health"
$response = Invoke-RestMethod -Uri $healthCheckUrl -Method Get
if ($response.status -ne "Healthy") {
Write-Error "❌ Health check failed after DR recovery"
exit 1
}
Write-Output "✅ Test environment DR recovery complete"
Write-Output "Actual RTO: $((Get-Date) - $StartTime | Select-Object -ExpandProperty TotalMinutes) minutes"
Test DR Testing (Quarterly):
drTestProcedure:
frequency: Quarterly
steps:
1. Simulate failure (delete database)
2. Run dr-recovery-test.ps1 script
3. Validate data integrity
4. Validate RTO < 2 hours
5. Document lessons learned
acceptanceCriteria:
- Database restored successfully
- All tests green post-recovery
- RTO < 2 hours
Staging Environment DR Strategy¶
RPO: 1 hour | RTO: 1 hour
Strategy: Blue-Green deployment slots for instant failover with hourly backups for data recovery.
Blue-Green Configuration (Staging):
// Staging App Service with blue-green slots
var stagingAppService = new WebApp("atp-gateway-staging-eus", new WebAppArgs
{
Name = "atp-gateway-staging-eus",
ResourceGroupName = stagingResourceGroup.Name,
Location = "eastus",
ServerFarmId = stagingAppServicePlan.Id,
SiteConfig = new SiteConfigArgs
{
AlwaysOn = true,
Http20Enabled = true,
MinTlsVersion = "1.2"
},
Tags = stagingTags
});
// Blue slot (active)
var blueSlot = new WebAppSlot("atp-gateway-staging-blue", new WebAppSlotArgs
{
Name = "atp-gateway-staging-eus/blue",
ResourceGroupName = stagingResourceGroup.Name,
Location = "eastus",
ServerFarmId = stagingAppServicePlan.Id,
SiteConfig = new SiteConfigArgs
{
AlwaysOn = true,
AppSettings = new[]
{
new NameValuePairArgs { Name = "Slot", Value = "Blue" },
new NameValuePairArgs { Name = "HealthCheckPath", Value = "/health" }
}
},
Tags = stagingTags
});
// Green slot (standby)
var greenSlot = new WebAppSlot("atp-gateway-staging-green", new WebAppSlotArgs
{
Name = "atp-gateway-staging-eus/green",
ResourceGroupName = stagingResourceGroup.Name,
Location = "eastus",
ServerFarmId = stagingAppServicePlan.Id,
SiteConfig = new SiteConfigArgs
{
AlwaysOn = true,
AppSettings = new[]
{
new NameValuePairArgs { Name = "Slot", Value = "Green" },
new NameValuePairArgs { Name = "HealthCheckPath", Value = "/health" }
}
},
Tags = stagingTags
});
Failover Procedure (Staging - Slot Swap):
#!/bin/bash
# failover-staging-blue-green.sh
RESOURCE_GROUP="ConnectSoft-ATP-Staging-EUS-RG"
APP_NAME="atp-gateway-staging-eus"
SOURCE_SLOT="blue"
TARGET_SLOT="production"
echo "🔄 Starting blue-green failover for Staging..."
# Step 1: Validate green slot health
GREEN_HEALTH=$(curl -s https://$APP_NAME-green.azurewebsites.net/health | jq -r '.status')
if [ "$GREEN_HEALTH" != "Healthy" ]; then
echo "❌ Green slot unhealthy; aborting failover"
exit 1
fi
echo "✅ Green slot healthy; proceeding with swap"
# Step 2: Perform slot swap
az webapp deployment slot swap \
--name $APP_NAME \
--resource-group $RESOURCE_GROUP \
--slot $SOURCE_SLOT \
--target-slot $TARGET_SLOT \
--action swap
if [ $? -ne 0 ]; then
echo "❌ Slot swap failed"
exit 1
fi
echo "✅ Slot swap complete; green is now production"
# Step 3: Validate production health
sleep 30 # Wait for DNS propagation
PROD_HEALTH=$(curl -s https://$APP_NAME.azurewebsites.net/health | jq -r '.status')
if [ "$PROD_HEALTH" != "Healthy" ]; then
echo "❌ Production unhealthy after swap; rolling back..."
# Rollback: Swap back to blue
az webapp deployment slot swap \
--name $APP_NAME \
--resource-group $RESOURCE_GROUP \
--slot $SOURCE_SLOT \
--target-slot $TARGET_SLOT \
--action swap
exit 1
fi
echo "✅ Staging failover complete; RTO: ~5 minutes"
Staging DR Testing (Monthly):
drTestProcedure:
frequency: Monthly
steps:
1. Deploy known-good version to green slot
2. Run failover-staging-blue-green.sh
3. Validate production slot serving traffic
4. Run regression tests
5. Swap back to blue (or keep green as new production)
acceptanceCriteria:
- Swap completes in < 2 minutes
- Zero downtime during swap
- All tests pass post-swap
Production Environment DR Strategy¶
RPO: 15 minutes | RTO: 30 minutes
Strategy: Active-active multi-region with Azure Front Door global load balancing and automated regional failover.
Multi-Region Infrastructure (Pulumi)¶
Primary Region (East US):
// Primary Production Region (East US)
var prodEUSResourceGroup = new ResourceGroup("atp-prod-eus-rg", new ResourceGroupArgs
{
ResourceGroupName = "ConnectSoft-ATP-Prod-EUS-RG",
Location = "eastus",
Tags = prodTags
});
var prodEUSVNet = new VirtualNetwork("atp-vnet-prod-eus", new VirtualNetworkArgs
{
VirtualNetworkName = "atp-vnet-prod-eus",
ResourceGroupName = prodEUSResourceGroup.Name,
Location = "eastus",
AddressSpace = new AddressSpaceArgs
{
AddressPrefixes = new[] { "10.2.0.0/16" }
},
// ... subnets configuration (see previous cycle)
Tags = prodTags
});
// Primary AKS Cluster (3-10 nodes)
var prodEUSAKS = new ManagedCluster("atp-aks-prod-eus", new ManagedClusterArgs
{
ResourceName = "atp-aks-prod-eus",
ResourceGroupName = prodEUSResourceGroup.Name,
Location = "eastus",
// ... AKS configuration (see Infrastructure as Code cycle)
AgentPoolProfiles = new[]
{
new ManagedClusterAgentPoolProfileArgs
{
Name = "system",
Count = 3, // Always 3 system nodes
MinCount = 3,
MaxCount = 10,
EnableAutoScaling = true,
VmSize = "Standard_D4s_v5",
AvailabilityZones = new[] { "1", "2", "3" } // Zone-redundant in region
}
},
Tags = prodTags
});
// Primary SQL Database with geo-replication
var prodEUSSQL = new Server("atp-sql-prod-eus", new ServerArgs
{
ServerName = "atp-sql-prod-eus",
ResourceGroupName = prodEUSResourceGroup.Name,
Location = "eastus",
AdministratorLogin = "sqladmin",
AdministratorLoginPassword = sqlAdminPassword,
Version = "12.0",
PublicNetworkAccess = "Disabled", // Private endpoint only
Tags = prodTags
});
var prodEUSDatabase = new Database("atp-db-prod-eus", new DatabaseArgs
{
DatabaseName = "ATP_Prod",
ServerName = prodEUSSQL.Name,
ResourceGroupName = prodEUSResourceGroup.Name,
Location = "eastus",
Sku = new SkuArgs
{
Name = "P4", // Premium 500 DTU
Tier = "Premium"
},
MaxSizeBytes = 1073741824000, // 1 TB
ZoneRedundant = true, // Zone-redundant within region
ReadScale = "Enabled", // Read replicas for load distribution
BackupRetentionPolicyInDays = 35, // 35-day retention
BackupStorageRedundancy = "Geo", // Geo-redundant backups
Tags = prodTags
});
Secondary Region (West Europe):
// Secondary Production Region (West Europe)
var prodWEUResourceGroup = new ResourceGroup("atp-prod-weu-rg", new ResourceGroupArgs
{
ResourceGroupName = "ConnectSoft-ATP-Prod-WEU-RG",
Location = "westeurope",
Tags = prodTags
});
var prodWEUVNet = new VirtualNetwork("atp-vnet-prod-weu", new VirtualNetworkArgs
{
VirtualNetworkName = "atp-vnet-prod-weu",
ResourceGroupName = prodWEUResourceGroup.Name,
Location = "westeurope",
AddressSpace = new AddressSpaceArgs
{
AddressPrefixes = new[] { "10.4.0.0/16" } // Different address space
},
// ... subnets configuration
Tags = prodTags
});
// Secondary AKS Cluster (2-6 nodes, smaller than primary)
var prodWEUAKS = new ManagedCluster("atp-aks-prod-weu", new ManagedClusterArgs
{
ResourceName = "atp-aks-prod-weu",
ResourceGroupName = prodWEUResourceGroup.Name,
Location = "westeurope",
AgentPoolProfiles = new[]
{
new ManagedClusterAgentPoolProfileArgs
{
Name = "system",
Count = 2, // Smaller secondary region
MinCount = 2,
MaxCount = 6,
EnableAutoScaling = true,
VmSize = "Standard_D4s_v5",
AvailabilityZones = new[] { "1", "2", "3" }
}
},
Tags = prodTags
});
// SQL Geo-Replication (East US → West Europe)
var sqlGeoReplica = new ReplicationLink("atp-sql-geo-replica-weu", new ReplicationLinkArgs
{
LinkName = "geo-replica-weu",
ResourceGroupName = prodEUSResourceGroup.Name,
ServerName = prodEUSSQL.Name,
DatabaseName = prodEUSDatabase.Name,
PartnerServer = prodWEUSQL.Id,
PartnerDatabase = "ATP_Prod",
PartnerLocation = "westeurope",
ReplicationMode = "Async", // Asynchronous replication
Tags = prodTags
});
Geo-Replication Setup:
# SQL Geo-Replication (Primary → Secondary)
primary:
region: East US
role: Primary
readWrite: true
replicationLag: < 5 seconds (typically)
secondary:
region: West Europe
role: Secondary (readable)
readWrite: false (read-only until failover)
replicationLag: < 15 seconds (RPO guarantee)
# Cosmos DB Multi-Region Writes
cosmosDB:
writeRegions:
- East US (primary)
- West Europe (secondary)
readRegions:
- East US
- West Europe
- Southeast Asia (read-only)
consistencyLevel: Session # Balance consistency vs. availability
automaticFailover: true
multiRegionWrites: true
# Blob Storage GRS (Geo-Redundant Storage)
storage:
primaryRegion: East US
secondaryRegion: West Europe
replicationType: GRS (Geo-Redundant Storage)
readAccess: RA-GRS (Read-Access Geo-Redundant)
replicationLag: < 15 minutes (RPO guarantee)
# Service Bus Geo-DR
serviceBus:
primaryNamespace: atp-servicebus-prod-eus
secondaryNamespace: atp-servicebus-prod-weu
alias: atp-servicebus-prod # Abstraction over primary/secondary
failoverType: Automated (metadata only)
messageReplication: Manual application-level replication
Azure Front Door Configuration (Global Load Balancing)¶
Purpose: Distribute traffic across regions (80% primary, 20% secondary) with automatic failover on regional outage.
// Azure Front Door for global traffic distribution
var frontDoor = new FrontDoor("atp-frontdoor-prod", new FrontDoorArgs
{
FrontDoorName = "atp-prod",
ResourceGroupName = sharedResourceGroup.Name,
Location = "global",
EnabledState = "Enabled",
FrontendEndpoints = new[]
{
new FrontendEndpointArgs
{
Name = "atp-frontend",
HostName = "atp-prod.azurefd.net",
SessionAffinityEnabledState = "Disabled",
WebApplicationFirewallPolicyLink = new FrontendEndpointUpdateParametersWebApplicationFirewallPolicyLinkArgs
{
Id = wafPolicy.Id
}
}
},
BackendPools = new[]
{
new BackendPoolArgs
{
Name = "atp-backend-pool",
LoadBalancingSettings = new SubResourceArgs { Id = "loadBalancingSettings1" },
HealthProbeSettings = new SubResourceArgs { Id = "healthProbeSettings1" },
Backends = new[]
{
// Primary region (East US) - Weight 8 (80% traffic)
new BackendArgs
{
Address = "atp-appgw-prod-eus.eastus.cloudapp.azure.com",
HttpPort = 80,
HttpsPort = 443,
Priority = 1,
Weight = 8, // 80% traffic
BackendHostHeader = "api.atp.connectsoft.com",
EnabledState = "Enabled"
},
// Secondary region (West Europe) - Weight 2 (20% traffic)
new BackendArgs
{
Address = "atp-appgw-prod-weu.westeurope.cloudapp.azure.com",
HttpPort = 80,
HttpsPort = 443,
Priority = 1,
Weight = 2, // 20% traffic
BackendHostHeader = "api.atp.connectsoft.com",
EnabledState = "Enabled"
}
}
}
},
LoadBalancingSettings = new[]
{
new LoadBalancingSettingsModelArgs
{
Name = "loadBalancingSettings1",
SampleSize = 4,
SuccessfulSamplesRequired = 2,
AdditionalLatencyMilliseconds = 50 // Prefer lower latency
}
},
HealthProbeSettings = new[]
{
new HealthProbeSettingsModelArgs
{
Name = "healthProbeSettings1",
Path = "/health",
Protocol = "Https",
IntervalInSeconds = 30,
HealthProbeMethod = "GET",
EnabledState = "Enabled"
}
},
RoutingRules = new[]
{
new RoutingRuleArgs
{
Name = "atp-routing-rule",
FrontendEndpoints = new[] { new SubResourceArgs { Id = "atp-frontend" } },
AcceptedProtocols = new[] { "Https" },
PatternsToMatch = new[] { "/*" },
RouteConfiguration = new ForwardingConfigurationArgs
{
OdataType = "#Microsoft.Azure.FrontDoor.Models.FrontdoorForwardingConfiguration",
BackendPool = new SubResourceArgs { Id = "atp-backend-pool" }
},
EnabledState = "Enabled"
}
},
Tags = prodTags
});
Traffic Distribution (Normal Operation):
Global Users
↓
Azure Front Door (atp-prod.azurefd.net)
↓ 80% traffic → East US (Primary)
↓ 20% traffic → West Europe (Secondary)
↓
Both regions serve traffic (active-active)
Traffic Distribution (East US Failure):
Global Users
↓
Azure Front Door (health probe detects East US down)
↓ 100% traffic → West Europe (Secondary promoted)
↓
West Europe serves all traffic (failover)
Automated Failover Mechanism¶
Health Probe Configuration:
# Azure Front Door health probe settings
healthProbe:
path: /health
protocol: HTTPS
interval: 30 seconds
method: GET
expectedStatusCodes: [200]
failureThreshold:
consecutiveFailures: 3 # 3 failures (90 seconds) triggers failover
action:
- Mark backend as unhealthy
- Remove from load balancing pool
- Route 100% traffic to healthy region
- Send alert to platform team
- Create incident ticket (P1)
Automated Failover Logic (Azure Monitor Alert):
// Azure Monitor alert for regional failure
var regionalFailureAlert = new MetricAlert("atp-regional-failure-alert", new MetricAlertArgs
{
AlertRuleName = "atp-regional-failure-prod",
ResourceGroupName = sharedResourceGroup.Name,
Location = "global",
Description = "Alert when primary region (East US) is unavailable",
Severity = 0, // Critical
Enabled = true,
Scopes = new[] { frontDoor.Id },
EvaluationFrequency = "PT1M", // Evaluate every minute
WindowSize = "PT5M", // 5-minute window
Criteria = new MetricAlertMultipleResourceMultipleMetricCriteriaArgs
{
OdataType = "Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria",
AllOf = new[]
{
new MetricCriteriaArgs
{
Name = "BackendHealthPercentage",
MetricName = "BackendHealthPercentage",
MetricNamespace = "Microsoft.Network/frontdoors",
Operator = "LessThan",
Threshold = 50, // Less than 50% backends healthy
TimeAggregation = "Average",
Dimensions = new[]
{
new MetricDimensionArgs
{
Name = "Backend",
Operator = "Include",
Values = new[] { "atp-appgw-prod-eus.eastus.cloudapp.azure.com" }
}
}
}
}
},
Actions = new[]
{
new MetricAlertActionArgs
{
ActionGroupId = regionalFailoverActionGroup.Id
}
}
});
// Action Group for regional failover
var regionalFailoverActionGroup = new ActionGroup("atp-regional-failover-action-group", new ActionGroupArgs
{
ActionGroupName = "atp-regional-failover",
ResourceGroupName = sharedResourceGroup.Name,
Location = "global",
Enabled = true,
ShortName = "RegFailover",
EmailReceivers = new[]
{
new EmailReceiverArgs
{
Name = "PlatformTeam",
EmailAddress = "platform-team@connectsoft.example",
UseCommonAlertSchema = true
},
new EmailReceiverArgs
{
Name = "IncidentManager",
EmailAddress = "incident-manager@connectsoft.example",
UseCommonAlertSchema = true
}
},
SmsReceivers = new[]
{
new SmsReceiverArgs
{
Name = "OnCallEngineer",
CountryCode = "1",
PhoneNumber = "+1234567890"
}
},
WebhookReceivers = new[]
{
new WebhookReceiverArgs
{
Name = "PagerDuty",
ServiceUri = "https://events.pagerduty.com/integration/<key>/enqueue",
UseCommonAlertSchema = true
},
new WebhookReceiverArgs
{
Name = "RunFailoverRunbook",
ServiceUri = "https://atp-automation-eus.azure-automation.net/webhooks/<webhook-token>",
UseCommonAlertSchema = false
}
}
});
Failover Runbook (Automated):
<#
.SYNOPSIS
Automated regional failover for ATP Production.
.DESCRIPTION
Triggered by Azure Monitor when primary region (East US) is unavailable.
Promotes secondary region (West Europe) to primary.
.NOTES
Target RTO: 30 minutes
#>
param(
[Parameter(Mandatory=$true)]
[string]$FailedRegion, # "eastus" or "westeurope"
[Parameter(Mandatory=$false)]
[string]$Reason = "Automated health probe failure"
)
$StartTime = Get-Date
Connect-AzAccount -Identity
Write-Output "🚨 REGIONAL FAILOVER INITIATED"
Write-Output "Failed Region: $FailedRegion"
Write-Output "Reason: $Reason"
Write-Output "Start Time: $StartTime"
# Determine target region
$targetRegion = if ($FailedRegion -eq "eastus") { "westeurope" } else { "eastus" }
$targetRG = "ConnectSoft-ATP-Prod-$($targetRegion.ToUpper() -replace 'EASTUS','EUS' -replace 'WESTEUROPE','WEU')-RG"
# Step 1: Promote SQL secondary to primary (if East US failed)
if ($FailedRegion -eq "eastus") {
Write-Output "Promoting SQL secondary (West Europe) to primary..."
$failoverGroup = Get-AzSqlDatabaseFailoverGroup `
-ResourceGroupName "ConnectSoft-ATP-Prod-WEU-RG" `
-ServerName "atp-sql-prod-weu" `
-FailoverGroupName "atp-sql-failover-group"
# Forced failover (allow data loss if primary completely unavailable)
$failoverGroup | Switch-AzSqlDatabaseFailoverGroup -AllowDataLoss
Write-Output "✅ SQL failover complete (RPO: < 15 minutes)"
}
# Step 2: Update Azure Front Door weights (route 100% to healthy region)
Write-Output "Updating Front Door backend weights..."
$frontDoorName = "atp-prod"
$backendPoolName = "atp-backend-pool"
# Disable failed region backend
az network front-door backend-pool backend update \
--front-door-name $frontDoorName \
--pool-name $backendPoolName \
--resource-group "ConnectSoft-ATP-Shared-RG" \
--address "atp-appgw-prod-$($FailedRegion -replace 'us','').${FailedRegion}.cloudapp.azure.com" \
--enabled-state Disabled
Write-Output "✅ Failed region removed from load balancing pool"
# Step 3: Scale up secondary region AKS (handle 100% traffic)
if ($targetRegion -eq "westeurope") {
Write-Output "Scaling up West Europe AKS to handle full traffic..."
az aks nodepool scale \
--resource-group $targetRG \
--cluster-name "atp-aks-prod-weu" \
--name "user" \
--node-count 10 # Scale to max capacity
Write-Output "✅ Secondary region scaled to 10 nodes"
}
# Step 4: Validate secondary region health
Write-Output "Validating secondary region health..."
$healthCheckUrl = "https://api.atp.connectsoft.com/health" # Front Door routes to healthy region
$response = Invoke-RestMethod -Uri $healthCheckUrl -Method Get
if ($response.status -ne "Healthy") {
Write-Error "❌ Secondary region unhealthy; manual intervention required"
exit 1
}
Write-Output "✅ Secondary region healthy and serving traffic"
# Step 5: Create incident ticket
$incident = az boards work-item create `
--title "Regional Failover: $FailedRegion Unavailable" `
--type "Incident" `
--description "Automated failover executed from $FailedRegion to $targetRegion.\n\nReason: $Reason\n\nStart Time: $StartTime\n\nStatus: Failover complete; $targetRegion serving 100% traffic" `
--assigned-to "platform-team@connectsoft.example" `
--fields Priority=1 Severity="1 - Critical" `
--output json | ConvertFrom-Json
Write-Output "✅ Incident ticket created: $($incident.id)"
# Step 6: Update status page
$statusPageUpdate = @{
status = "Degraded"
message = "ATP experienced a regional outage in $FailedRegion. Traffic has been automatically rerouted to $targetRegion. All services are operational."
timestamp = (Get-Date).ToUniversalTime().ToString("o")
} | ConvertTo-Json
Invoke-RestMethod `
-Uri "https://status.atp.connectsoft.com/api/incidents" `
-Method POST `
-Body $statusPageUpdate `
-ContentType "application/json" `
-Headers @{ "Authorization" = "Bearer $(Get-AutomationVariable -Name 'StatusPageApiKey')" }
# Step 7: Send notifications
$emailBody = @"
ATP Regional Failover Notification
Failed Region: $FailedRegion
Target Region: $targetRegion
Start Time: $StartTime
Completion Time: $(Get-Date)
RTO Achieved: $((Get-Date) - $StartTime | Select-Object -ExpandProperty TotalMinutes) minutes
All services are operational. No action required from tenants.
Incident Ticket: https://dev.azure.com/ConnectSoft/ATP/_workitems/edit/$($incident.id)
Status Page: https://status.atp.connectsoft.com
"@
Send-MailMessage `
-From "platform-team@connectsoft.example" `
-To "cto@connectsoft.example", "platform-team@connectsoft.example" `
-Subject "🚨 ATP Regional Failover: $FailedRegion → $targetRegion" `
-Body $emailBody `
-SmtpServer "smtp.office365.com" `
-UseSsl `
-Credential (Get-AutomationPSCredential -Name "EmailCredential")
$elapsed = (Get-Date) - $StartTime
Write-Output "✅ REGIONAL FAILOVER COMPLETE"
Write-Output "Total RTO: $($elapsed.TotalMinutes) minutes (Target: 30 minutes)"
Failback Procedure (Return to Primary Region)¶
Purpose: Restore normal operations to primary region (East US) after recovery.
<#
.SYNOPSIS
Failback to primary region after DR event.
.DESCRIPTION
Restores primary region (East US) and rebalances traffic.
.NOTES
Execute only after primary region fully recovered and validated.
#>
param(
[Parameter(Mandatory=$true)]
[string]$PrimaryRegion = "eastus"
)
$StartTime = Get-Date
Connect-AzAccount -Identity
Write-Output "🔄 REGIONAL FAILBACK INITIATED"
Write-Output "Primary Region: $PrimaryRegion"
# Step 1: Validate primary region infrastructure health
Write-Output "Validating primary region infrastructure..."
$aksCluster = Get-AzAksCluster `
-ResourceGroupName "ConnectSoft-ATP-Prod-EUS-RG" `
-Name "atp-aks-prod-eus"
if ($aksCluster.PowerState.Code -ne "Running") {
Write-Error "❌ Primary AKS cluster not running; aborting failback"
exit 1
}
Write-Output "✅ Primary infrastructure healthy"
# Step 2: Synchronize SQL databases (secondary → primary)
Write-Output "Synchronizing databases..."
# Force sync from secondary (now primary) to original primary (now secondary)
$failoverGroup = Get-AzSqlDatabaseFailoverGroup `
-ResourceGroupName "ConnectSoft-ATP-Prod-WEU-RG" `
-ServerName "atp-sql-prod-weu" `
-FailoverGroupName "atp-sql-failover-group"
# Planned failover (no data loss)
$failoverGroup | Switch-AzSqlDatabaseFailoverGroup
Write-Output "✅ SQL databases synchronized; East US is now primary again"
# Step 3: Gradually shift traffic back to primary (phased approach)
Write-Output "Phasing traffic back to primary region..."
# Phase 1: 20% to primary, 80% to secondary
az network front-door backend-pool backend update \
--front-door-name "atp-prod" \
--pool-name "atp-backend-pool" \
--resource-group "ConnectSoft-ATP-Shared-RG" \
--address "atp-appgw-prod-eus.eastus.cloudapp.azure.com" \
--weight 2 \
--enabled-state Enabled
Start-Sleep -Seconds 600 # Monitor for 10 minutes
# Phase 2: 50% to primary, 50% to secondary
az network front-door backend-pool backend update \
--front-door-name "atp-prod" \
--pool-name "atp-backend-pool" \
--resource-group "ConnectSoft-ATP-Shared-RG" \
--address "atp-appgw-prod-eus.eastus.cloudapp.azure.com" \
--weight 5
Start-Sleep -Seconds 600 # Monitor for 10 minutes
# Phase 3: 80% to primary, 20% to secondary (normal state)
az network front-door backend-pool backend update \
--front-door-name "atp-prod" \
--pool-name "atp-backend-pool" \
--resource-group "ConnectSoft-ATP-Shared-RG" \
--address "atp-appgw-prod-eus.eastus.cloudapp.azure.com" \
--weight 8
az network front-door backend-pool backend update \
--front-door-name "atp-prod" \
--pool-name "atp-backend-pool" \
--resource-group "ConnectSoft-ATP-Shared-RG" \
--address "atp-appgw-prod-weu.westeurope.cloudapp.azure.com" \
--weight 2
Write-Output "✅ Traffic rebalanced to normal state (80/20)"
# Step 4: Scale down secondary region (cost optimization)
az aks nodepool scale \
--resource-group "ConnectSoft-ATP-Prod-WEU-RG" \
--cluster-name "atp-aks-prod-weu" \
--name "user" \
--node-count 3 # Return to normal capacity
Write-Output "✅ Secondary region scaled back to normal"
# Step 5: Close incident ticket
$incidentId = (az boards work-item query `
--wiql "SELECT [System.Id] FROM WorkItems WHERE [System.Title] CONTAINS 'Regional Failover' AND [System.State] = 'Active'" `
--output json | ConvertFrom-Json).workItems[0].id
az boards work-item update \
--id $incidentId \
--state "Resolved" \
--fields "Microsoft.VSTS.Common.ResolvedReason=Fixed" \
"Microsoft.VSTS.Common.ResolvedBy=automation@connectsoft.example"
$elapsed = (Get-Date) - $StartTime
Write-Output "✅ REGIONAL FAILBACK COMPLETE"
Write-Output "Total Failback Time: $($elapsed.TotalMinutes) minutes"
DR Testing Schedule & Procedures¶
Production DR Testing (Weekly, Non-Disruptive):
drTestingStrategy:
frequency: Weekly (every Sunday 2 AM ET)
type: Non-disruptive (traffic shifting only)
procedure:
1. Gradually shift traffic from primary → secondary (10% increments)
2. Monitor metrics for 1 hour (error rate, latency, throughput)
3. If metrics healthy, continue; if degraded, rollback
4. Once 100% traffic on secondary, validate for 30 minutes
5. Shift traffic back to primary (reverse process)
6. Document actual RTO/RPO achieved
acceptanceCriteria:
- Zero errors during traffic shift
- Latency p95 < 1.2x baseline
- Successful validation queries in secondary region
- RTO < 30 minutes (measured)
rollback:
- Automatic if error rate > 1%
- Automatic if latency p99 > 3x baseline
- Manual abort via Azure Portal
DR Testing Automation (Azure Function):
// Automated DR testing (weekly)
[FunctionName("WeeklyDRTest")]
public async Task RunAsync(
[TimerTrigger("0 0 2 * * 0")] TimerInfo timer, // Every Sunday 2 AM ET
ILogger log)
{
log.LogInformation("Starting weekly DR test...");
var startTime = DateTime.UtcNow;
var frontDoorClient = new FrontDoorManagementClient(new DefaultAzureCredential());
var testResults = new List<string>();
try
{
// Phase 1: Shift 20% traffic to secondary
await ShiftTrafficAsync(frontDoorClient, primaryWeight: 80, secondaryWeight: 20);
await Task.Delay(TimeSpan.FromMinutes(10));
await ValidateMetricsAsync(log);
// Phase 2: Shift 50% traffic to secondary
await ShiftTrafficAsync(frontDoorClient, primaryWeight: 50, secondaryWeight: 50);
await Task.Delay(TimeSpan.FromMinutes(10));
await ValidateMetricsAsync(log);
// Phase 3: Shift 100% traffic to secondary
await ShiftTrafficAsync(frontDoorClient, primaryWeight: 0, secondaryWeight: 100);
await Task.Delay(TimeSpan.FromMinutes(30)); // Validate for 30 minutes
await ValidateMetricsAsync(log);
// Phase 4: Shift back to normal (80/20)
await ShiftTrafficAsync(frontDoorClient, primaryWeight: 80, secondaryWeight: 20);
var elapsed = DateTime.UtcNow - startTime;
log.LogInformation($"✅ DR test complete. Duration: {elapsed.TotalMinutes:F1} minutes");
// Create test report
await CreateDRTestReportAsync(elapsed, testResults, success: true);
}
catch (Exception ex)
{
log.LogError(ex, "❌ DR test failed");
// Rollback to normal state
await ShiftTrafficAsync(frontDoorClient, primaryWeight: 80, secondaryWeight: 20);
await CreateDRTestReportAsync(DateTime.UtcNow - startTime, testResults, success: false);
throw;
}
}
private async Task ValidateMetricsAsync(ILogger log)
{
var appInsightsClient = new ApplicationInsightsDataClient(new DefaultAzureCredential());
// Query error rate
var errorRate = await appInsightsClient.Metrics.GetAsync(
appId: "atp-appinsights-prod-eus",
metricId: "requests/failed",
timespan: "PT10M", // Last 10 minutes
aggregation: new[] { "avg" }
);
var avgErrorRate = errorRate.Value.Segments[0].Metrics["requests/failed"].Avg;
if (avgErrorRate > 0.01) // >1% error rate
{
throw new Exception($"Error rate too high during DR test: {avgErrorRate:P2}");
}
log.LogInformation($"✅ Metrics healthy: Error rate {avgErrorRate:P2}");
}
Geo-Replication Configuration¶
SQL Database Geo-Replication¶
// SQL Failover Group (automatic failover)
var sqlFailoverGroup = new FailoverGroup("atp-sql-failover-group", new FailoverGroupArgs
{
FailoverGroupName = "atp-sql-failover-group",
ResourceGroupName = prodEUSResourceGroup.Name,
ServerName = prodEUSSQL.Name,
// Partner server (secondary region)
PartnerServers = new[]
{
new PartnerInfoArgs
{
Id = prodWEUSQL.Id
}
},
// Databases to replicate
Databases = new[]
{
prodEUSDatabase.Id
},
// Read-write listener endpoint
ReadWriteEndpoint = new FailoverGroupReadWriteEndpointArgs
{
FailoverPolicy = "Automatic",
FailoverWithDataLossGracePeriodMinutes = 60 // Allow 1 hour for primary to recover before forcing failover
},
// Read-only listener endpoint (route to nearest replica)
ReadOnlyEndpoint = new FailoverGroupReadOnlyEndpointArgs
{
FailoverPolicy = "Disabled" // Read-only queries don't failover
},
Tags = prodTags
});
SQL Connection String (Failover Group Aware):
{
"ConnectionStrings": {
"DefaultConnection": "Server=atp-sql-failover-group.database.windows.net;Database=ATP_Prod;User Id=$(DbUser);Password=$(DbPassword);MultipleActiveResultSets=true;ApplicationIntent=ReadWrite"
}
}
Explanation: Application connects to failover group endpoint (atp-sql-failover-group.database.windows.net) which automatically routes to current primary region. Upon failover, DNS updates to point to new primary (WEU) with zero application code changes.
Cosmos DB Multi-Region Configuration¶
// Cosmos DB with multi-region writes (active-active)
var cosmosAccount = new DatabaseAccount("atp-cosmos-prod", new DatabaseAccountArgs
{
AccountName = "atp-cosmos-prod",
ResourceGroupName = sharedResourceGroup.Name,
Location = "eastus", // Primary location
// Multi-region write configuration
Locations = new[]
{
new LocationArgs
{
LocationName = "eastus",
FailoverPriority = 0, // Primary
IsZoneRedundant = true
},
new LocationArgs
{
LocationName = "westeurope",
FailoverPriority = 1, // Secondary
IsZoneRedundant = true
},
new LocationArgs
{
LocationName = "southeastasia",
FailoverPriority = 2, // Read-only tertiary
IsZoneRedundant = false
}
},
// Consistency level (balance consistency vs. availability)
ConsistencyPolicy = new ConsistencyPolicyArgs
{
DefaultConsistencyLevel = "Session", // Session consistency for ATP
MaxIntervalInSeconds = 5,
MaxStalenessPrefix = 100
},
// Enable multi-region writes
EnableMultipleWriteLocations = true,
EnableAutomaticFailover = true,
// Backup configuration
BackupPolicy = new ContinuousModeBackupPolicyArgs
{
Type = "Continuous", // Continuous backup (point-in-time restore)
ContinuousModeProperties = new ContinuousModePropertiesArgs
{
Tier = "Continuous7Days" // 7-day continuous backup
}
},
Tags = prodTags
});
Cosmos DB Failover (Automated):
- Health Monitoring: Azure monitors Cosmos DB availability per region.
- Automatic Failover: If primary region unavailable >5 minutes, promote secondary to primary.
- Multi-Write: Both regions accept writes simultaneously (conflict resolution via last-write-wins).
- RPO: Near-zero (synchronous replication within seconds).
Blob Storage Geo-Replication¶
// Blob Storage with Read-Access Geo-Redundant Storage (RA-GRS)
var prodStorage = new StorageAccount("atpstorageprodeus", new StorageAccountArgs
{
AccountName = "atpstorageprodeus",
ResourceGroupName = prodEUSResourceGroup.Name,
Location = "eastus",
Sku = new SkuArgs
{
Name = "Standard_RAGRS" // Read-Access Geo-Redundant Storage
},
Kind = "StorageV2",
// Blob configuration
BlobServices = new BlobServicesArgs
{
// ... (WORM, versioning, etc.)
},
// Geo-replication to West Europe (automatic)
GeoReplicationStats = new GeoReplicationStatsArgs
{
Status = "Live",
CanFailover = true
},
Tags = prodTags
});
Storage Failover (Manual):
# Initiate storage account failover to secondary region
az storage account failover \
--name atpstorageprodeus \
--resource-group ConnectSoft-ATP-Prod-EUS-RG \
--yes
# Note: This makes West Europe the new primary
# Geo-replication will re-establish to a new secondary region after failover
Service Bus Geo-Disaster Recovery¶
// Service Bus Geo-DR pairing
var serviceBusGeoAlias = new DisasterRecoveryConfig("atp-servicebus-geo-dr", new DisasterRecoveryConfigArgs
{
Alias = "atp-servicebus-prod", // Stable endpoint name
ResourceGroupName = prodEUSResourceGroup.Name,
NamespaceName = prodEUSServiceBus.Name,
// Partner namespace (secondary region)
PartnerNamespace = prodWEUServiceBus.Id,
Tags = prodTags
});
Service Bus Connection String (Geo-DR Aware):
{
"ConnectionStrings": {
"ServiceBus": "Endpoint=sb://atp-servicebus-prod.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=$(ServiceBusKey)"
}
}
Explanation: Application connects to Geo-DR alias (atp-servicebus-prod.servicebus.windows.net) which routes to active region. Upon failover, alias DNS switches to secondary region within minutes.
High Availability Within Region (Zone Redundancy)¶
Purpose: Protect against single datacenter failures within a region using Azure Availability Zones.
Zone-Redundant Resources (Production East US):
// AKS with zone-redundant node pools
var aksNodePool = new ManagedClusterAgentPoolProfileArgs
{
Name = "system",
Count = 3,
VmSize = "Standard_D4s_v5",
AvailabilityZones = new[] { "1", "2", "3" }, // Spread across 3 zones
EnableAutoScaling = true,
MinCount = 3,
MaxCount = 10
};
// SQL Database with zone redundancy
var sqlDatabase = new Database("atp-db-prod-eus", new DatabaseArgs
{
DatabaseName = "ATP_Prod",
ZoneRedundant = true, // Synchronous replication across 3 zones
ReadScale = "Enabled", // Read replicas in each zone
// ...
});
// Application Gateway with zone redundancy
var appGateway = new ApplicationGateway("atp-appgw-prod-eus", new ApplicationGatewayArgs
{
// ...
Zones = new[] { "1", "2", "3" }, // Deploy instances in all zones
});
// Cosmos DB zone-redundant
var cosmosAccount = new DatabaseAccount("atp-cosmos-prod", new DatabaseAccountArgs
{
Locations = new[]
{
new LocationArgs
{
LocationName = "eastus",
IsZoneRedundant = true // Zone-redundant within region
}
}
});
Zone Redundancy Benefits:
- SLA: 99.99% (zone-redundant) vs. 99.95% (single zone) = 4.4x better uptime.
- Failure Isolation: Single datacenter outage → automatic failover to other zones within seconds.
- No RPO: Synchronous replication within region (zero data loss).
Backup Strategy Per Environment¶
Dev Environment Backups¶
Strategy: No backups (recreate from Git + IaC).
backups:
enabled: false
rationale: Git is source of truth; recreate faster than restore
costSavings: ~$50/month (no backup storage)
Test Environment Backups¶
Strategy: Daily automated backups with 14-day retention.
backups:
sqlDatabase:
frequency: Daily (automated)
retention: 14 days
type: Automated point-in-time restore
cost: Included in SQL Database cost
redis:
frequency: None (Standard tier doesn't support persistence)
recreate: Rebuild cache from database on restore
storage:
frequency: None (test data is synthetic)
recreate: Re-run data seeding scripts
Staging Environment Backups¶
Strategy: Hourly snapshots with 7-day retention + weekly long-term backups.
// Staging SQL Database backup configuration
var stagingSqlDatabase = new Database("atp-db-staging-eus", new DatabaseArgs
{
DatabaseName = "ATP_Staging",
BackupRetentionPolicyInDays = 35, // 35-day short-term retention
BackupStorageRedundancy = "Geo", // Geo-redundant backups
// Long-term retention
LongTermRetentionPolicy = new DatabaseLongTermRetentionPolicyArgs
{
WeeklyRetention = "P4W", // 4 weeks
MonthlyRetention = "P3M", // 3 months
YearlyRetention = "P0Y" // Disabled
},
// ... other configuration
});
// Redis Cache persistence (Staging uses Premium tier)
var stagingRedis = new RedisResource("atp-redis-staging-eus", new RedisResourceArgs
{
Name = "atp-redis-staging-eus",
ResourceGroupName = stagingResourceGroup.Name,
Location = "eastus",
Sku = new SkuArgs
{
Name = "Premium",
Family = "P",
Capacity = 1 // P1 (6 GB)
},
// RDB persistence (snapshots to storage account)
RedisConfiguration = new RedisCommonPropertiesRedisConfigurationArgs
{
RdbBackupEnabled = "true",
RdbBackupFrequency = "60", // 60 minutes
RdbBackupMaxSnapshotCount = "1",
RdbStorageConnectionString = stagingStorageConnectionString
},
Tags = stagingTags
});
Production Environment Backups¶
Strategy: Continuous backups with point-in-time restore (7-day) + long-term retention (7 years).
backups:
sqlDatabase:
frequency: Continuous (transaction log backups every 5-10 minutes)
retention:
- Short-term: 35 days (point-in-time restore)
- Long-term: 7 years (weekly full + monthly)
redundancy: Geo-redundant (replicated to secondary region)
RPO: < 5 minutes
cosmosDB:
frequency: Continuous (change feed based)
retention: 7 days (continuous backup mode)
redundancy: Multi-region (automatic)
RPO: < 1 minute
redis:
frequency: RDB snapshots every 15 minutes
retention: 7 days
redundancy: Replicated to geo-paired region
RPO: < 15 minutes
blobStorage:
frequency: Real-time geo-replication
retention: 7 years (with WORM)
redundancy: RA-GRS (Read-Access Geo-Redundant)
RPO: < 15 minutes
serviceBus:
frequency: N/A (Geo-DR replicates metadata, not messages)
strategy: Application-level message replication
RPO: 0 (messages in-flight may be lost on failover)
Production Backup Validation (Daily):
#!/bin/bash
# validate-backups-prod.sh
echo "Validating Production backups..."
# Step 1: Verify SQL automated backups exist
LATEST_SQL_BACKUP=$(az sql db list-restorable-dropped \
--server atp-sql-prod-eus \
--resource-group ConnectSoft-ATP-Prod-EUS-RG \
--query "[0].earliestRestoreDate" -o tsv)
if [ -z "$LATEST_SQL_BACKUP" ]; then
echo "❌ No SQL backups found"
exit 1
fi
echo "✅ Latest SQL backup: $LATEST_SQL_BACKUP"
# Step 2: Verify Cosmos DB continuous backup mode
COSMOS_BACKUP_MODE=$(az cosmosdb show \
--name atp-cosmos-prod \
--resource-group ConnectSoft-ATP-Shared-RG \
--query "backupPolicy.type" -o tsv)
if [ "$COSMOS_BACKUP_MODE" != "Continuous" ]; then
echo "❌ Cosmos DB not in continuous backup mode"
exit 1
fi
echo "✅ Cosmos DB continuous backup enabled"
# Step 3: Verify blob geo-replication status
GEO_REPL_STATUS=$(az storage account show \
--name atpstorageprodeus \
--resource-group ConnectSoft-ATP-Prod-EUS-RG \
--query "geoReplicationStats.status" -o tsv)
if [ "$GEO_REPL_STATUS" != "Live" ]; then
echo "❌ Blob geo-replication not live"
exit 1
fi
echo "✅ Blob geo-replication status: Live"
# Step 4: Test restore operation (non-disruptive)
# Restore to a test database to validate backup integrity
az sql db restore \
--dest-name "ATP_Prod_BackupTest_$(date +%Y%m%d)" \
--resource-group ConnectSoft-ATP-Prod-EUS-RG \
--server atp-sql-prod-eus \
--time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S'Z')" \
--name "ATP_Prod" \
--service-objective "P1" \
--edition "Premium"
echo "✅ Backup restore test successful"
# Clean up test database after validation
az sql db delete \
--name "ATP_Prod_BackupTest_$(date +%Y%m%d)" \
--resource-group ConnectSoft-ATP-Prod-EUS-RG \
--server atp-sql-prod-eus \
--yes
echo "✅ Production backup validation complete"
RTO/RPO Validation & Reporting¶
Purpose: Measure and validate actual RTO/RPO achieved during DR tests and incidents.
DR Metrics Tracking:
// Track DR test results in Application Insights
public class DRTestMetrics
{
private readonly TelemetryClient _telemetry;
public async Task RecordDRTestAsync(DRTestResult result)
{
var properties = new Dictionary<string, string>
{
["TestType"] = result.TestType, // "Automated" or "Manual"
["Environment"] = result.Environment,
["SourceRegion"] = result.SourceRegion,
["TargetRegion"] = result.TargetRegion,
["Success"] = result.Success.ToString(),
["FailureReason"] = result.FailureReason ?? "N/A"
};
var metrics = new Dictionary<string, double>
{
["RTOMinutes"] = result.RTOMinutes,
["RPOMinutes"] = result.RPOMinutes,
["TargetRTOMinutes"] = result.TargetRTOMinutes,
["TargetRPOMinutes"] = result.TargetRPOMinutes,
["DataLossRecords"] = result.DataLossRecords
};
_telemetry.TrackEvent("DRTestCompleted", properties, metrics);
// Create work item if RTO/RPO targets not met
if (result.RTOMinutes > result.TargetRTOMinutes || result.RPOMinutes > result.TargetRPOMinutes)
{
await CreateDRImprovementTaskAsync(result);
}
}
}
DR Dashboard (Application Insights KQL):
// DR test success rate over last 6 months
customEvents
| where name == "DRTestCompleted"
| where timestamp > ago(180d)
| extend Environment = tostring(customDimensions.Environment)
| extend Success = tobool(customDimensions.Success)
| extend RTOMinutes = todouble(customMeasurements.RTOMinutes)
| extend TargetRTOMinutes = todouble(customMeasurements.TargetRTOMinutes)
| extend RTOMet = RTOMinutes <= TargetRTOMinutes
| summarize
TotalTests = count(),
SuccessfulTests = countif(Success == true),
SuccessRate = 100.0 * countif(Success == true) / count(),
AvgRTO = avg(RTOMinutes),
TargetRTO = max(TargetRTOMinutes),
RTOMetPercentage = 100.0 * countif(RTOMet == true) / count()
by Environment
| order by SuccessRate asc
Business Continuity Plan (BCP)¶
ATP Business Continuity Objectives:
businessContinuity:
missionCriticalServices:
- Audit Event Ingestion (100% uptime required)
- Audit Event Query (99.9% uptime required)
- Tamper-Evidence Validation (99.9% uptime required)
tolerableDowntime:
- Ingestion: 0 minutes (cannot lose audit events)
- Query: 30 minutes (tenants can retry)
- Export: 4 hours (background process)
dataCriticality:
- Audit Events: Mission-critical (immutable, compliance)
- Configuration: High (backed up hourly)
- Logs/Metrics: Medium (7-day recovery acceptable)
Incident Response Runbook:
# P0 Incident: Regional Outage
incidentResponse:
detection:
- Azure Monitor health probe failures (3 consecutive)
- PagerDuty alert to on-call engineer
- Automated failover triggered (if configured)
response:
1. Validate automated failover executed
2. Confirm secondary region serving traffic
3. Check error rates, latency, throughput
4. Update status page (Degraded: Regional outage)
5. Notify tenants via email/webhook
recovery:
1. Investigate primary region root cause
2. Validate primary region health
3. Execute gradual failback (phased traffic shift)
4. Close incident ticket
5. Post-mortem analysis within 48 hours
communication:
- Status page: https://status.atp.connectsoft.com
- Tenant notifications: Email + webhook
- Internal: Slack #incidents channel
- Executive: Email to CTO/CEO if downtime > 1 hour
Disaster Scenarios & Response¶
Scenario 1: Regional Azure Outage (East US)¶
Detection: Front Door health probes fail for East US backends (3 consecutive failures over 90 seconds).
Automated Response:
T+0:00 - Health probes detect East US unavailable
T+0:01 - Front Door removes East US from pool
T+0:01 - 100% traffic routed to West Europe
T+0:02 - Azure Monitor alert fires
T+0:02 - PagerDuty pages on-call engineer
T+0:02 - Incident ticket auto-created (P0)
T+0:03 - Status page updated (Degraded)
T+0:05 - Tenant notification emails sent
Manual Response (On-Call Engineer):
T+0:05 - Engineer acknowledges PagerDuty alert
T+0:10 - Validate West Europe serving traffic correctly
T+0:15 - Check Azure status page for East US outage
T+0:20 - Scale up West Europe AKS to handle 100% traffic
T+0:30 - Validate metrics (error rate, latency, throughput)
T+0:35 - Update status page with ETA
T+1:00 - Monitor for next hour; coordinate with Azure support
RTO Achieved: ~5 minutes (automated failover)
RPO Achieved: ~15 minutes (geo-replication lag)
Scenario 2: SQL Database Corruption¶
Detection: Application errors; integrity check failures.
Response Procedure:
# Restore SQL Database from point-in-time (before corruption)
$corruptionTime = (Get-Date).AddHours(-2) # Corruption detected 2 hours ago
# Step 1: Restore to a new database
Restore-AzSqlDatabase `
-ResourceGroupName "ConnectSoft-ATP-Prod-EUS-RG" `
-ServerName "atp-sql-prod-eus" `
-TargetDatabaseName "ATP_Prod_Restored" `
-ServiceObjectiveName "P4" `
-Edition "Premium" `
-PointInTime $corruptionTime
# Step 2: Validate restored database integrity
$integrityCheck = Invoke-Sqlcmd `
-ServerInstance "atp-sql-prod-eus.database.windows.net" `
-Database "ATP_Prod_Restored" `
-Query "EXEC sp_ATP_ValidateIntegrity" `
-Username "sqladmin" `
-Password $(Get-AutomationVariable -Name "SqlAdminPassword")
if ($integrityCheck.IntegrityValid -ne $true) {
Write-Error "❌ Restored database integrity check failed"
exit 1
}
# Step 3: Swap databases (minimal downtime)
# Use failover group to switch active database
RTO: ~30 minutes (restore + validation)
RPO: ~2 hours (restore to before corruption time)
Scenario 3: AKS Cluster Failure¶
Detection: All pods in primary region unhealthy; Kubernetes API unreachable.
Automated Response:
T+0:00 - Health checks fail for all pods
T+0:01 - Front Door marks East US unhealthy
T+0:01 - Traffic routed to West Europe AKS
T+0:02 - Azure Monitor alert (AKS cluster down)
T+0:05 - Autoscale West Europe AKS from 3 → 10 nodes
T+0:10 - All pods running in West Europe
T+0:15 - Traffic fully served from West Europe
Manual Recovery:
# Diagnose AKS cluster failure
az aks show \
--name atp-aks-prod-eus \
--resource-group ConnectSoft-ATP-Prod-EUS-RG \
--query "powerState"
# If cluster stopped, restart
az aks start \
--name atp-aks-prod-eus \
--resource-group ConnectSoft-ATP-Prod-EUS-RG
# If cluster corrupted, recreate from IaC
pulumi up --target urn:pulumi:atp-prod-eus::atp-infrastructure::azure-native:containerservice:ManagedCluster::atp-aks-prod-eus
RTO: ~15 minutes (automated failover to West Europe)
RPO: 0 (stateless AKS pods; data in geo-replicated databases)
Summary¶
- RPO/RTO Targets: Graduated from 24h/4h (Dev recreate) to 15min/30min (Production multi-region).
- Dev: No backups; recreate from IaC + Git in 4 hours.
- Test: Daily backups with 12h RPO, 2h RTO via restore.
- Staging: Blue-green slots for instant failover (1h RPO/RTO), hourly backups.
- Production: Active-active multi-region (East US 80%, West Europe 20%) with automated failover, 15min RPO, 30min RTO.
- Multi-Region Topology: Primary (East US) + Secondary (West Europe) with VNet peering, SQL geo-replication, Cosmos multi-write, Blob RA-GRS, Service Bus Geo-DR.
- Azure Front Door: Global load balancing with health probes (30s interval, 3 failure threshold) and automated traffic rerouting.
- Zone Redundancy: Production resources spread across 3 availability zones (99.99% SLA).
- DR Testing: Weekly non-disruptive production tests, monthly staging tests, quarterly dev/test tests.
- Backup Strategy: Continuous backups (Production) with 7-day PITR + 7-year long-term retention.
- Incident Response: Automated failover (5 minutes) with manual validation and phased failback.
Compliance & Audit Per Environment¶
ATP implements graduated compliance controls across environments, balancing developer productivity with regulatory requirements. Production enforces full GDPR, HIPAA, and SOC 2 compliance with continuous monitoring, while Dev/Test environments use relaxed policies with synthetic data to accelerate development without compliance overhead.
This approach ensures regulatory alignment in production while maintaining rapid iteration in lower environments through environment-specific policies, automated compliance scanning, and comprehensive audit evidence collection.
Compliance Enforcement Matrix¶
ATP enforces progressive compliance controls from relaxed (Dev) to strict (Production) aligned with data sensitivity and regulatory requirements.
| Control Category | Dev/Test | Staging | Production |
|---|---|---|---|
| Encryption at Rest | Optional (TDE off) | Required (TDE on) | Required + BYOK (Customer-Managed Keys) |
| Encryption in Transit | TLS 1.2+ | TLS 1.2+ | TLS 1.3 enforced |
| Immutability | Disabled | Enabled (validation) | Enabled + WORM storage |
| Tamper-Evidence | Disabled | Enabled (testing) | Enabled + HSM signatures |
| Audit Logging | Basic (7 days) | Enhanced (30 days) | Full + 7-year retention |
| PII Handling | Synthetic data only | Anonymized production data | Real PII + classification |
| Data Retention | 30 days | 90 days | 7 years (compliance) |
| Access Reviews | Quarterly | Monthly | Weekly + JIT access |
| Penetration Testing | Annually | Quarterly | Quarterly + post-change |
| Vulnerability Scanning | Weekly | Daily | Continuous (real-time) |
| Secret Rotation | Manual (90 days) | Automated (60 days) | Automated (30 days) |
| Privileged Access | Developer accounts | Restricted (Lead+ only) | Zero standing access (PIM) |
| Change Management | None | Manual approval | CAB + change ticket |
| Incident Response SLA | Best-effort | 4 hours | 30 minutes |
Environment-Specific Compliance Policies¶
Dev/Test Compliance Policies¶
Purpose: Minimal compliance overhead with synthetic data and relaxed controls to maximize developer velocity.
Data Handling (Dev/Test):
# Dev/Test Compliance Profile
piiHandling:
realPII: Prohibited # Only synthetic/anonymized data
piiClassification: Not required
redaction: Optional (for testing redaction logic)
dataResidency: No restrictions (US-only for simplicity)
encryption:
atRest: Optional (TDE disabled for performance)
inTransit: TLS 1.2+ required
customerManagedKeys: No
immutability:
enabled: false
rationale: Support data mutations for testing
auditLogging:
level: Basic
retention: 7 days (Dev), 14 days (Test)
destinations: Local Seq container
piiRedaction: Optional
accessControl:
authentication: Azure AD
authorization: Developer role (full access)
mfa: Recommended (not enforced)
conditionalAccess: No
complianceScanning:
frequency: Weekly
frameworks: None (development only)
findings: Informational (no blocking)
Azure Policy (Dev/Test - Audit Mode):
{
"policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/dev-test-compliance",
"displayName": "Dev/Test Environment Compliance (Audit Mode)",
"policyRule": {
"if": {
"field": "tags['Environment']",
"in": ["dev", "test"]
},
"then": {
"effect": "audit", // Audit only (don't block)
"details": {
"type": "Microsoft.Sql/servers/databases",
"existenceCondition": {
"field": "Microsoft.Sql/servers/databases/transparentDataEncryption.status",
"equals": "Enabled"
}
}
}
},
"metadata": {
"category": "Compliance",
"severity": "Low"
}
}
Staging Compliance Policies¶
Purpose: Production-like compliance for validating compliance controls before production deployment.
Data Handling (Staging):
# Staging Compliance Profile (mirrors Production)
piiHandling:
realPII: Prohibited (anonymized prod snapshots only)
piiClassification: Required (test classification logic)
redaction: Enforced (validate redaction)
dataResidency: EU data in EU region (GDPR simulation)
encryption:
atRest: Required (TDE enabled)
inTransit: TLS 1.2+ enforced
customerManagedKeys: Optional (test BYOK integration)
immutability:
enabled: true
wormStorage: Enabled (test compliance workflows)
tamperEvidence: Enabled (validate hash chains)
auditLogging:
level: Enhanced
retention: 30 days (hot) + 180 days (cool)
destinations: Azure Log Analytics
piiRedaction: Enforced
accessControl:
authentication: Azure AD + MFA
authorization: Least privilege (RBAC)
mfa: Enforced for all users
conditionalAccess: Location-based (office/VPN only)
justInTimeAccess: Enabled (PIM)
complianceScanning:
frequency: Daily
frameworks: GDPR, HIPAA, SOC 2
findings: Blocking for critical/high severity
remediation: SLA 7 days (critical), 30 days (high)
Azure Policy (Staging - Deny Mode):
{
"policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/staging-compliance",
"displayName": "Staging Environment Compliance (Deny Mode)",
"policyRule": {
"if": {
"allOf": [
{
"field": "tags['Environment']",
"equals": "staging"
},
{
"anyOf": [
{
"field": "type",
"equals": "Microsoft.Sql/servers/databases"
},
{
"field": "type",
"equals": "Microsoft.Storage/storageAccounts"
}
]
},
{
"anyOf": [
{
"field": "Microsoft.Sql/servers/databases/transparentDataEncryption.status",
"notEquals": "Enabled"
},
{
"field": "Microsoft.Storage/storageAccounts/encryption.services.blob.enabled",
"notEquals": "true"
}
]
}
]
},
"then": {
"effect": "deny", // Block non-compliant resources
"details": {
"message": "Staging resources must have encryption at rest enabled"
}
}
},
"metadata": {
"category": "Compliance",
"severity": "High"
}
}
Production Compliance Policies¶
Purpose: Full regulatory compliance with GDPR, HIPAA, and SOC 2 controls enforced at infrastructure and application layers.
Data Handling (Production):
# Production Compliance Profile (strict enforcement)
piiHandling:
realPII: Allowed (with classification)
piiClassification: Required (automated scanning)
redaction: Enforced (logs, telemetry, exports)
dataResidency: Enforced (EU data stays in EU region)
rightToErasure: Automated GDPR deletion workflow
encryption:
atRest: Required (TDE + BYOK)
inTransit: TLS 1.3 enforced (no TLS 1.2 fallback)
customerManagedKeys: Required (HSM-backed)
keyRotation: Automated (90-day rotation)
immutability:
enabled: true
wormStorage: Enforced (Legal Hold + Time-based Retention)
tamperEvidence: Enabled + HSM signatures
hashChains: Merkle trees with periodic sealing
auditLogging:
level: Full (all API calls, database queries, access events)
retention: 90 days (hot) + 7 years (cold archive)
destinations: Log Analytics + Blob Storage (immutable)
piiRedaction: Enforced with automated scanning
logIntegrity: Signed with HSM
accessControl:
authentication: Azure AD + MFA + Conditional Access
authorization: Zero standing access (PIM required)
mfa: Enforced (no exceptions)
conditionalAccess: Device compliance + location + risk-based
justInTimeAccess: Enforced (max 8-hour elevation)
privilegedAccess: Break-glass accounts only
complianceScanning:
frequency: Continuous (real-time)
frameworks: GDPR, HIPAA, SOC 2, ISO 27001
findings: Blocking (auto-remediate or escalate)
remediation: SLA 24 hours (critical), 7 days (high)
regulatoryReporting:
frequency: Quarterly
reports: GDPR compliance, HIPAA attestation, SOC 2 audit trail
auditor: External auditor access (read-only, time-limited)
Azure Policy (Production - Strict Enforcement):
{
"policySetDefinitionId": "/providers/Microsoft.Authorization/policySetDefinitions/production-compliance-initiative",
"displayName": "Production Environment Compliance Initiative",
"policyDefinitions": [
{
"policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/deny-public-network-access",
"parameters": {
"effect": "deny"
}
},
{
"policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/require-encryption-at-rest",
"parameters": {
"effect": "deny",
"requireCustomerManagedKeys": true
}
},
{
"policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/require-tls-1-3",
"parameters": {
"effect": "deny",
"minimumTlsVersion": "1.3"
}
},
{
"policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/require-diagnostic-logs",
"parameters": {
"effect": "deployIfNotExists",
"logAnalyticsWorkspaceId": "/subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.OperationalInsights/workspaces/atp-loganalytics-prod-eus",
"retentionDays": 90
}
},
{
"policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/require-immutable-storage",
"parameters": {
"effect": "deny",
"requireWORM": true
}
}
],
"metadata": {
"category": "Regulatory Compliance",
"version": "1.0.0"
}
}
Encryption Controls Per Environment¶
Dev/Test Encryption¶
Encryption at Rest: Optional (disabled for cost/performance)
// Dev SQL Database (TDE disabled)
var devSqlDatabase = new Database("atp-sql-dev-eus", new DatabaseArgs
{
DatabaseName = "ATP_Dev",
TransparentDataEncryption = new TransparentDataEncryptionArgs
{
Status = "Disabled" // Optional in Dev
}
});
Encryption in Transit: TLS 1.2+ (enforced)
Staging/Production Encryption¶
Encryption at Rest: Required with Customer-Managed Keys (BYOK)
// Production SQL Database (TDE with BYOK)
var prodSqlDatabase = new Database("atp-db-prod-eus", new DatabaseArgs
{
DatabaseName = "ATP_Prod",
// Transparent Data Encryption with Customer-Managed Key
TransparentDataEncryption = new TransparentDataEncryptionArgs
{
Status = "Enabled",
KeyVaultKeyUri = "https://atp-keyvault-prod-eus.vault.azure.net/keys/TDE-Key/latest"
}
});
// Storage Account (encryption with BYOK)
var prodStorage = new StorageAccount("atpstorageprodeus", new StorageAccountArgs
{
AccountName = "atpstorageprodeus",
Encryption = new EncryptionArgs
{
Services = new EncryptionServicesArgs
{
Blob = new EncryptionServiceArgs { Enabled = true, KeyType = "Account" },
File = new EncryptionServiceArgs { Enabled = true, KeyType = "Account" },
Queue = new EncryptionServiceArgs { Enabled = true, KeyType = "Service" },
Table = new EncryptionServiceArgs { Enabled = true, KeyType = "Service" }
},
KeySource = "Microsoft.Keyvault",
KeyVaultProperties = new KeyVaultPropertiesArgs
{
KeyName = "StorageEncryptionKey",
KeyVersion = "", // Use latest version
KeyVaultUri = "https://atp-keyvault-prod-eus.vault.azure.net"
},
RequireInfrastructureEncryption = true // Double encryption
}
});
// Cosmos DB (encryption with BYOK)
var prodCosmos = new DatabaseAccount("atp-cosmos-prod", new DatabaseAccountArgs
{
AccountName = "atp-cosmos-prod",
KeyVaultKeyUri = "https://atp-keyvault-prod-eus.vault.azure.net/keys/CosmosEncryptionKey/latest",
// Default encryption key policy
DefaultIdentity = "SystemAssigned"
});
Encryption Key Rotation (Production):
// Automated key rotation (Azure Function)
[FunctionName("RotateEncryptionKeys")]
public async Task RunAsync(
[TimerTrigger("0 0 1 */3 * *")] TimerInfo timer, // Every 90 days
ILogger log)
{
log.LogInformation("Starting encryption key rotation...");
var keyVaultClient = new KeyClient(
vaultUri: new Uri("https://atp-keyvault-prod-eus.vault.azure.net"),
credential: new DefaultAzureCredential());
// Rotate TDE key
var newTdeKey = await keyVaultClient.CreateRsaKeyAsync(new CreateRsaKeyOptions($"TDE-Key-{DateTime.UtcNow:yyyyMMdd}")
{
KeySize = 4096,
KeyOperations = { KeyOperation.WrapKey, KeyOperation.UnwrapKey },
ExpiresOn = DateTimeOffset.UtcNow.AddDays(365)
});
log.LogInformation($"Created new TDE key: {newTdeKey.Value.Name}");
// Update SQL Database TDE key
await UpdateSqlTdeKeyAsync("atp-sql-prod-eus", "ATP_Prod", newTdeKey.Value.Id.ToString());
// Rotate Storage encryption key
var newStorageKey = await keyVaultClient.CreateRsaKeyAsync(new CreateRsaKeyOptions($"StorageEncryptionKey-{DateTime.UtcNow:yyyyMMdd}")
{
KeySize = 4096
});
await UpdateStorageEncryptionKeyAsync("atpstorageprodeus", newStorageKey.Value.Id.ToString());
// Disable old keys (retain for 30 days for decryption of old data)
var oldKeys = await keyVaultClient.GetPropertiesOfKeyVersionsAsync("TDE-Key").ToListAsync();
foreach (var oldKey in oldKeys.Where(k => k.CreatedOn < DateTimeOffset.UtcNow.AddDays(-30)))
{
await keyVaultClient.UpdateKeyPropertiesAsync(new KeyProperties(oldKey.Id) { Enabled = false });
log.LogInformation($"Disabled old key: {oldKey.Name}");
}
log.LogInformation("✅ Encryption key rotation complete");
}
Encryption in Transit (Production - TLS 1.3 Only):
// Enforce TLS 1.3 in Production
public void ConfigureServices(IServiceCollection services)
{
services.Configure<KestrelServerOptions>(options =>
{
options.ConfigureHttpsDefaults(httpsOptions =>
{
httpsOptions.SslProtocols = SslProtocols.Tls13; // TLS 1.3 only
httpsOptions.ClientCertificateMode = ClientCertificateMode.RequireCertificate; // mTLS
httpsOptions.CheckCertificateRevocation = true; // Validate cert revocation
});
});
}
Immutability & WORM Storage¶
Staging (Immutability Testing):
// Staging blob container with immutability policy
var stagingImmutableContainer = new BlobContainer("audit-events-staging", new BlobContainerArgs
{
ContainerName = "audit-events",
AccountName = stagingStorage.Name,
ResourceGroupName = stagingResourceGroup.Name,
ImmutableStorageWithVersioning = new ImmutableStorageWithVersioningArgs
{
Enabled = true
},
// Time-based retention policy (test mode)
DefaultEncryptionScope = "$account-encryption-key",
DenyEncryptionScopeOverride = true
});
// Immutability policy (90-day retention)
var immutabilityPolicy = new ImmutabilityPolicy("staging-immutability-policy", new ImmutabilityPolicyArgs
{
ImmutabilityPolicyName = "default",
AccountName = stagingStorage.Name,
ResourceGroupName = stagingResourceGroup.Name,
ContainerName = stagingImmutableContainer.Name,
ImmutabilityPeriodSinceCreationInDays = 90,
AllowProtectedAppendWrites = false // No appends allowed
});
Production (WORM + Legal Hold):
// Production blob container with WORM storage
var prodImmutableContainer = new BlobContainer("audit-events-prod", new BlobContainerArgs
{
ContainerName = "audit-events",
AccountName = prodStorage.Name,
ResourceGroupName = prodEUSResourceGroup.Name,
ImmutableStorageWithVersioning = new ImmutableStorageWithVersioningArgs
{
Enabled = true
}
});
// Immutability policy (7-year retention + legal hold)
var prodImmutabilityPolicy = new ImmutabilityPolicy("prod-immutability-policy", new ImmutabilityPolicyArgs
{
ImmutabilityPolicyName = "default",
AccountName = prodStorage.Name,
ResourceGroupName = prodEUSResourceGroup.Name,
ContainerName = prodImmutableContainer.Name,
ImmutabilityPeriodSinceCreationInDays = 2555, // 7 years
AllowProtectedAppendWrites = false
});
// Lock immutability policy (cannot be reduced or deleted)
var policyLock = new ManagementLock("prod-immutability-lock", new ManagementLockArgs
{
LockName = "immutability-policy-lock",
LockLevel = "CanNotDelete",
ResourceGroupName = prodEUSResourceGroup.Name,
Notes = "Prevents deletion of immutability policy for compliance"
});
Legal Hold (Production):
# Apply legal hold to blob container
az storage container legal-hold set \
--account-name atpstorageprodeus \
--container-name audit-events \
--tags "litigation-2025-001" "regulatory-investigation-SEC-456" \
--allow-protected-append-writes-all false
echo "✅ Legal hold applied; blobs cannot be deleted or modified"
Audit Logging & Evidence Collection¶
Production Audit Logging¶
Comprehensive Logging (All layers):
auditLogs:
infrastructure:
- Azure Activity Log (control plane operations)
- NSG Flow Logs (network traffic)
- Azure Firewall Logs (egress traffic)
- Key Vault Audit Logs (secret access)
application:
- API Gateway Logs (all HTTP requests)
- Application Logs (business events)
- Database Audit Logs (all SQL queries)
- Authentication Logs (login attempts, MFA)
security:
- Azure AD Sign-in Logs (user authentication)
- Conditional Access Logs (access policy decisions)
- PIM Activation Logs (privileged access elevation)
- Azure Defender Alerts (security findings)
Azure SQL Auditing (Production):
// SQL Server auditing (production)
var sqlAuditing = new ServerBlobAuditingPolicy("atp-sql-audit-prod", new ServerBlobAuditingPolicyArgs
{
BlobAuditingPolicyName = "default",
ResourceGroupName = prodEUSResourceGroup.Name,
ServerName = prodEUSSQL.Name,
State = "Enabled",
// Storage account for audit logs
StorageEndpoint = prodStorage.PrimaryEndpoints.Apply(endpoints => endpoints.Blob),
StorageAccountAccessKey = prodStorage.PrimaryAccessKey,
StorageAccountSubscriptionId = subscriptionId,
RetentionDays = 90, // 90-day hot retention
IsStorageSecondaryKeyInUse = false,
IsAzureMonitorTargetEnabled = true, // Also send to Log Analytics
// Audit action groups (comprehensive)
AuditActionsAndGroups = new[]
{
"SUCCESSFUL_DATABASE_AUTHENTICATION_GROUP",
"FAILED_DATABASE_AUTHENTICATION_GROUP",
"BATCH_COMPLETED_GROUP",
"SCHEMA_OBJECT_CHANGE_GROUP",
"SCHEMA_OBJECT_ACCESS_GROUP",
"DATABASE_OBJECT_CHANGE_GROUP",
"DATABASE_OBJECT_OWNERSHIP_CHANGE_GROUP",
"DATABASE_OBJECT_PERMISSION_CHANGE_GROUP",
"DATABASE_PERMISSION_CHANGE_GROUP",
"DATABASE_PRINCIPAL_CHANGE_GROUP",
"DATABASE_ROLE_MEMBER_CHANGE_GROUP"
}
});
// Database-level auditing
var dbAuditing = new DatabaseBlobAuditingPolicy("atp-db-audit-prod", new DatabaseBlobAuditingPolicyArgs
{
BlobAuditingPolicyName = "default",
ResourceGroupName = prodEUSResourceGroup.Name,
ServerName = prodEUSSQL.Name,
DatabaseName = prodEUSDatabase.Name,
State = "Enabled",
// Inherit from server-level + add database-specific actions
AuditActionsAndGroups = new[]
{
"SELECT",
"INSERT",
"UPDATE",
"DELETE",
"EXECUTE",
"RECEIVE",
"REFERENCES"
}
});
Key Vault Audit Logging (Production):
# Enable Key Vault diagnostic logging
az monitor diagnostic-settings create \
--name atp-keyvault-prod-audit \
--resource /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.KeyVault/vaults/atp-keyvault-prod-eus \
--logs '[
{
"category": "AuditEvent",
"enabled": true,
"retentionPolicy": {
"enabled": true,
"days": 365
}
}
]' \
--metrics '[
{
"category": "AllMetrics",
"enabled": true,
"retentionPolicy": {
"enabled": true,
"days": 90
}
}
]' \
--workspace /subscriptions/<sub-id>/resourceGroups/ATP-Prod-RG/providers/Microsoft.OperationalInsights/workspaces/atp-loganalytics-prod-eus
echo "✅ Key Vault audit logging enabled (365-day retention)"
API Gateway Logging (Production):
// Application Gateway diagnostic settings
var appGatewayDiagnostics = new DiagnosticSetting("atp-appgw-diag-prod", new DiagnosticSettingArgs
{
Name = "atp-appgw-prod-diagnostics",
ResourceUri = applicationGateway.Id,
WorkspaceId = prodLogAnalytics.Id,
Logs = new[]
{
new LogSettingsArgs
{
Category = "ApplicationGatewayAccessLog",
Enabled = true,
RetentionPolicy = new RetentionPolicyArgs
{
Enabled = true,
Days = 90
}
},
new LogSettingsArgs
{
Category = "ApplicationGatewayPerformanceLog",
Enabled = true,
RetentionPolicy = new RetentionPolicyArgs
{
Enabled = true,
Days = 90
}
},
new LogSettingsArgs
{
Category = "ApplicationGatewayFirewallLog",
Enabled = true,
RetentionPolicy = new RetentionPolicyArgs
{
Enabled = true,
Days = 365 // WAF logs retained for 1 year
}
}
},
Metrics = new[]
{
new MetricSettingsArgs
{
Category = "AllMetrics",
Enabled = true,
RetentionPolicy = new RetentionPolicyArgs
{
Enabled = true,
Days = 90
}
}
}
});
Access Control & Privileged Access Management¶
Just-In-Time (JIT) Access (Staging/Production):
// Azure PIM role assignment (time-limited elevation)
public class PrivilegedAccessService
{
private readonly GraphServiceClient _graphClient;
public async Task RequestProductionAccessAsync(string userId, string role, TimeSpan duration)
{
// Maximum elevation: 8 hours
if (duration > TimeSpan.FromHours(8))
{
throw new InvalidOperationException("Maximum elevation period is 8 hours");
}
// Create PIM role assignment request
var roleAssignmentScheduleRequest = new RoleAssignmentScheduleRequest
{
PrincipalId = userId,
RoleDefinitionId = role, // e.g., "Contributor", "Reader"
DirectoryScopeId = "/subscriptions/<sub-id>/resourceGroups/ConnectSoft-ATP-Prod-EUS-RG",
Justification = "Production troubleshooting - Incident #12345",
ScheduleInfo = new RequestSchedule
{
StartDateTime = DateTimeOffset.UtcNow,
Expiration = new ExpirationPattern
{
Type = "AfterDuration",
Duration = duration
}
}
};
await _graphClient.RoleManagement.Directory.RoleAssignmentScheduleRequests
.PostAsync(roleAssignmentScheduleRequest);
// Log PIM elevation for audit trail
_logger.LogWarning(
"PIM elevation granted: User {UserId} elevated to {Role} for {Duration}",
userId, role, duration);
}
}
Access Review (Production - Weekly):
<#
.SYNOPSIS
Weekly access review for Production environment.
.DESCRIPTION
Reviews all role assignments; identifies stale permissions; sends report.
#>
Connect-AzAccount -Identity
$resourceGroup = "ConnectSoft-ATP-Prod-EUS-RG"
$reportDate = Get-Date -Format "yyyy-MM-dd"
Write-Output "Starting weekly access review for Production..."
# Get all role assignments
$roleAssignments = Get-AzRoleAssignment -ResourceGroupName $resourceGroup
$accessReport = @()
foreach ($assignment in $roleAssignments) {
$principal = Get-AzADUser -ObjectId $assignment.ObjectId -ErrorAction SilentlyContinue
if ($principal) {
# Check last sign-in
$lastSignIn = (Get-AzureADUserActivity -ObjectId $assignment.ObjectId).LastSignInDateTime
$daysSinceSignIn = ((Get-Date) - $lastSignIn).Days
$accessReport += [PSCustomObject]@{
User = $principal.UserPrincipalName
Role = $assignment.RoleDefinitionName
Scope = $assignment.Scope
LastSignIn = $lastSignIn
DaysSinceSignIn = $daysSinceSignIn
Recommendation = if ($daysSinceSignIn > 90) { "Remove" } elseif ($daysSinceSignIn > 30) { "Review" } else { "Keep" }
}
}
}
# Export report
$accessReport | Export-Csv -Path "access-review-$reportDate.csv" -NoTypeInformation
# Send to security team
Send-MailMessage `
-From "platform-team@connectsoft.example" `
-To "security-team@connectsoft.example" `
-Subject "Weekly Production Access Review - $reportDate" `
-Body "Attached: Access review for Production environment" `
-Attachments "access-review-$reportDate.csv" `
-SmtpServer "smtp.office365.com"
Write-Output "✅ Access review complete; report sent to security team"
Penetration Testing Per Environment¶
| Environment | Frequency | Scope | Approval | Findings SLA |
|---|---|---|---|---|
| Dev | Annually | Basic OWASP scan | None | Informational |
| Test | Annually | Full automated scan | None | 30 days |
| Staging | Quarterly | Full manual + automated | Lead Architect | 14 days (high), 7 days (critical) |
| Production | Quarterly + Post-Change | Full manual pentest | CISO | 48 hours (critical), 7 days (high) |
Production Penetration Testing Procedure:
pentestProcedure:
frequency: Quarterly + after major changes
vendor: External security firm (pre-approved)
scope:
inScope:
- Public-facing Application Gateway
- API endpoints (all microservices)
- Authentication/authorization flows
- Data encryption verification
- Network segmentation validation
outOfScope:
- Physical Azure datacenter testing
- Social engineering
- Denial of Service attacks
notification:
azure: Notify Azure 7 days before pentest
internal: CISO approval required
tenants: No notification (production isolation)
deliverables:
- Executive summary
- Detailed findings report
- Proof-of-concept exploits
- Remediation recommendations
- Retest validation (after fixes)
remediationSLA:
critical: 48 hours
high: 7 days
medium: 30 days
low: 90 days
Pentest Findings Tracking:
// Track pentest findings in Azure DevOps
public async Task CreatePentestFindingAsync(PentestFinding finding)
{
var workItem = new WorkItem
{
Fields = new Dictionary<string, object>
{
["System.Title"] = $"Pentest Finding: {finding.Title}",
["System.WorkItemType"] = "Security Bug",
["System.State"] = "New",
["Microsoft.VSTS.Common.Priority"] = finding.Severity switch
{
"Critical" => 1,
"High" => 2,
"Medium" => 3,
"Low" => 4,
_ => 3
},
["Microsoft.VSTS.Common.Severity"] = finding.Severity,
["System.Description"] = $@"
<b>Finding:</b> {finding.Description}<br/>
<b>Severity:</b> {finding.Severity}<br/>
<b>CVSS Score:</b> {finding.CvssScore}<br/>
<b>Remediation:</b> {finding.Remediation}<br/>
<b>Due Date:</b> {finding.DueDate:yyyy-MM-dd}
",
["System.Tags"] = "Pentest; Security; Compliance; Q1-2025",
["Custom.ComplianceFramework"] = "SOC2-CC6.1",
["System.AreaPath"] = "ATP\\Security",
["System.IterationPath"] = "ATP\\2025\\Q1"
}
};
await _devOpsClient.CreateWorkItemAsync(workItem, "ConnectSoft", "ATP");
}
Compliance Scanning & Monitoring¶
Microsoft Defender for Cloud (Production):
# Enable Defender for Cloud for all Production resources
az security pricing create \
--name "VirtualMachines" \
--tier "Standard"
az security pricing create \
--name "SqlServers" \
--tier "Standard"
az security pricing create \
--name "AppServices" \
--tier "Standard"
az security pricing create \
--name "StorageAccounts" \
--tier "Standard"
az security pricing create \
--name "KubernetesService" \
--tier "Standard"
az security pricing create \
--name "ContainerRegistry" \
--tier "Standard"
az security pricing create \
--name "KeyVaults" \
--tier "Standard"
# Enable regulatory compliance dashboard
az security regulatory-compliance-standard list \
--query "[].{Name:name, State:state}"
echo "✅ Defender for Cloud enabled; compliance dashboard available"
Continuous Compliance Monitoring (Azure Policy Compliance):
#!/bin/bash
# compliance-scan-prod.sh
ENVIRONMENT="prod"
RESOURCE_GROUP="ConnectSoft-ATP-Prod-EUS-RG"
echo "Running compliance scan for Production..."
# Get policy compliance state
COMPLIANCE=$(az policy state summarize \
--resource-group $RESOURCE_GROUP \
--query "results.policyAssignments[].{Policy:policyAssignmentId, Compliant:results.resourceDetails.compliantResources, NonCompliant:results.resourceDetails.nonCompliantResources}" \
--output json)
NON_COMPLIANT_COUNT=$(echo $COMPLIANCE | jq '[.[].NonCompliant] | add')
if [ "$NON_COMPLIANT_COUNT" -gt 0 ]; then
echo "❌ Non-compliant resources detected: $NON_COMPLIANT_COUNT"
# List non-compliant resources
az policy state list \
--resource-group $RESOURCE_GROUP \
--filter "complianceState eq 'NonCompliant'" \
--query "[].{Resource:resourceId, Policy:policyDefinitionName, Reason:policyDefinitionAction}" \
--output table
# Create compliance violation ticket
az boards work-item create \
--title "Compliance Violation Detected in Production" \
--type "Bug" \
--description "Azure Policy compliance scan detected $NON_COMPLIANT_COUNT non-compliant resources.\n\nSee attached compliance report." \
--assigned-to "security-team@connectsoft.example" \
--fields Priority=1 Severity="1 - Critical"
exit 1
else
echo "✅ All Production resources compliant"
fi
# Generate compliance report
az policy state summarize \
--resource-group $RESOURCE_GROUP \
--output json > "compliance-report-prod-$(date +%Y%m%d).json"
echo "✅ Compliance scan complete"
Regulatory Compliance Evidence¶
SOC 2 Evidence Collection:
soc2Evidence:
CC6.1_LogicalAndPhysicalAccessControls:
- Azure AD sign-in logs (authentication)
- PIM activation logs (privileged access)
- Key Vault access logs (secret retrieval)
- NSG flow logs (network access)
CC6.6_LogicalAccessRemoval:
- Access review reports (weekly)
- Offboarding automation logs
- Role assignment change logs
CC7.2_DetectionAndMonitoring:
- Azure Defender alerts (security findings)
- Application Insights exceptions
- Azure Monitor alerts (health/performance)
CC8.1_ChangeManagement:
- Azure DevOps pipeline logs (deployments)
- Git commit history (code changes)
- CAB meeting minutes (approval records)
- Change ticket logs (ServiceNow integration)
GDPR Evidence Collection:
gdprEvidence:
Article30_RecordsOfProcessing:
- Data classification inventory (PII fields)
- Data flow diagrams (tenant → ATP → storage)
- Retention policy documentation
Article32_SecurityMeasures:
- Encryption certificates (TDE, TLS 1.3)
- Penetration test reports (quarterly)
- Vulnerability scan results (continuous)
Article33_BreachNotification:
- Incident response logs (P0/P1 incidents)
- Breach notification templates
- Tenant communication records
Article17_RightToErasure:
- GDPR deletion logs (tenant data purge)
- Deletion verification reports
- Backup retention override logs
HIPAA Evidence Collection:
hipaaEvidence:
164.308_AdministrativeSafeguards:
- Security training completion records
- Risk assessment reports (annual)
- Access review logs (weekly)
164.310_PhysicalSafeguards:
- Azure datacenter compliance certificates
- Physical access logs (Azure-managed)
164.312_TechnicalSafeguards:
- Encryption key rotation logs
- Audit trail reports (SQL, API, Key Vault)
- Access control logs (authentication/authorization)
164.316_PoliciesAndProcedures:
- Incident response runbooks
- DR test reports (weekly)
- Compliance policy documentation
Evidence Retention & Archival¶
Production Evidence Retention Policy:
evidenceRetention:
pipelineLogs:
retention: 1 year (SOC 2 minimum)
storage: Azure DevOps + exported to Blob Storage
format: JSON (machine-readable)
deploymentArtifacts:
retention: 7 years (match audit data retention)
storage: Azure Artifacts + Blob Archive
artifacts:
- SBOM (Software Bill of Materials)
- Security scan reports (SAST, dependency, secrets)
- Test results (unit, integration, regression)
- ADR snapshots (architecture decisions)
accessLogs:
retention: 90 days (hot) + 7 years (cold archive)
storage: Log Analytics + Blob Storage (immutable)
logs:
- Azure AD sign-in logs
- PIM activation logs
- Key Vault audit logs
- SQL audit logs
- API Gateway access logs
incidentRecords:
retention: 7 years (regulatory requirement)
storage: Azure DevOps Work Items + exported to Blob
records:
- Incident tickets (P0/P1)
- Post-mortem reports
- Corrective action tracking
- Communication logs (tenant notifications)
complianceReports:
retention: 7 years (audit requirement)
storage: Blob Storage (WORM + Legal Hold)
reports:
- Quarterly SOC 2 attestation
- Annual HIPAA assessment
- GDPR compliance reports
- Penetration test reports
Automated Evidence Export (Azure Function):
// Export compliance evidence to immutable storage
[FunctionName("ExportComplianceEvidence")]
public async Task RunAsync(
[TimerTrigger("0 0 1 1 */3 *")] TimerInfo timer, // Quarterly on 1st at 1 AM
ILogger log)
{
log.LogInformation("Starting quarterly compliance evidence export...");
var quarter = $"Q{(DateTime.UtcNow.Month - 1) / 3 + 1}-{DateTime.UtcNow.Year}";
var exportPath = $"compliance-evidence/{quarter}";
var blobClient = new BlobServiceClient(
connectionString: Environment.GetEnvironmentVariable("ComplianceStorageConnectionString"));
var containerClient = blobClient.GetBlobContainerClient("compliance-evidence");
// Export Azure DevOps pipeline logs
var pipelineLogs = await ExportPipelineLogsAsync(quarter);
await UploadEvidenceAsync(containerClient, $"{exportPath}/pipeline-logs.json", pipelineLogs);
// Export deployment artifacts (SBOM, security scans)
var deploymentArtifacts = await ExportDeploymentArtifactsAsync(quarter);
await UploadEvidenceAsync(containerClient, $"{exportPath}/deployment-artifacts.zip", deploymentArtifacts);
// Export access logs from Log Analytics
var accessLogs = await ExportAccessLogsAsync(quarter);
await UploadEvidenceAsync(containerClient, $"{exportPath}/access-logs.json", accessLogs);
// Export incident records
var incidents = await ExportIncidentRecordsAsync(quarter);
await UploadEvidenceAsync(containerClient, $"{exportPath}/incidents.json", incidents);
// Generate compliance summary report
var summaryReport = await GenerateComplianceSummaryAsync(quarter);
await UploadEvidenceAsync(containerClient, $"{exportPath}/compliance-summary.pdf", summaryReport);
// Apply immutability policy (WORM)
var immutabilityPolicy = containerClient.GetImmutabilityPolicyClient();
await immutabilityPolicy.SetImmutabilityPolicyAsync(
immutabilityPeriod: TimeSpan.FromDays(2555)); // 7 years
log.LogInformation($"✅ Compliance evidence exported to {exportPath}");
log.LogInformation("Evidence is now immutable (7-year retention)");
}
Compliance Dashboard & Reporting¶
Azure Policy Compliance Dashboard (KQL):
// Compliance posture over time
PolicyInsights
| where TimeGenerated > ago(90d)
| where ResourceGroup contains "ATP-Prod"
| extend IsCompliant = ComplianceState == "Compliant"
| summarize
TotalResources = dcount(ResourceId),
CompliantResources = dcountif(ResourceId, IsCompliant),
NonCompliantResources = dcountif(ResourceId, not(IsCompliant)),
CompliancePercentage = 100.0 * dcountif(ResourceId, IsCompliant) / dcount(ResourceId)
by bin(TimeGenerated, 1d), PolicyDefinitionName
| order by TimeGenerated desc
Compliance Scorecard (Monthly Report):
// Generate monthly compliance scorecard
public class ComplianceReportService
{
public async Task<ComplianceScorecard> GenerateMonthlyScorecard(string environment)
{
var scorecard = new ComplianceScorecard
{
Environment = environment,
ReportDate = DateTime.UtcNow,
Period = $"{DateTime.UtcNow:MMMM yyyy}"
};
// SOC 2 Controls
scorecard.SOC2 = new SOC2Scorecard
{
CC6_1_AccessControls = await ValidateAccessControlsAsync(),
CC6_6_AccessRemoval = await ValidateAccessRemovalAsync(),
CC7_2_Monitoring = await ValidateMonitoringAsync(),
CC8_1_ChangeManagement = await ValidateChangeManagementAsync(),
OverallCompliance = CalculateOverallCompliance(new[] { /* ... */ })
};
// GDPR Controls
scorecard.GDPR = new GDPRScorecard
{
Article30_RecordsOfProcessing = await ValidateDataInventoryAsync(),
Article32_SecurityMeasures = await ValidateSecurityMeasuresAsync(),
Article33_BreachNotification = await ValidateBreachNotificationAsync(),
Article17_RightToErasure = await ValidateRightToErasureAsync(),
OverallCompliance = CalculateOverallCompliance(new[] { /* ... */ })
};
// HIPAA Controls
scorecard.HIPAA = new HIPAAScorecard
{
Safeguard_164_308_Administrative = await ValidateAdministrativeSafeguardsAsync(),
Safeguard_164_310_Physical = await ValidatePhysicalSafeguardsAsync(),
Safeguard_164_312_Technical = await ValidateTechnicalSafeguardsAsync(),
Safeguard_164_316_Policies = await ValidatePoliciesAsync(),
OverallCompliance = CalculateOverallCompliance(new[] { /* ... */ })
};
// Overall environment compliance
scorecard.OverallScore = (scorecard.SOC2.OverallCompliance +
scorecard.GDPR.OverallCompliance +
scorecard.HIPAA.OverallCompliance) / 3;
return scorecard;
}
}
PII Redaction & Data Classification¶
Automated PII Detection (Production):
// PII detection and redaction middleware
public class PiiRedactionMiddleware
{
private readonly RequestDelegate _next;
private readonly IPiiDetectionService _piiDetector;
private readonly ILogger<PiiRedactionMiddleware> _logger;
public async Task InvokeAsync(HttpContext context)
{
// Capture response body
var originalBody = context.Response.Body;
using var newBody = new MemoryStream();
context.Response.Body = newBody;
await _next(context);
// Read response
newBody.Seek(0, SeekOrigin.Begin);
var responseBody = await new StreamReader(newBody).ReadToEndAsync();
// Detect and redact PII
var piiDetected = _piiDetector.DetectPii(responseBody);
if (piiDetected.Any())
{
_logger.LogWarning(
"PII detected in API response: {PiiTypes}. Redacting...",
string.Join(", ", piiDetected.Select(p => p.Type)));
// Redact PII fields
responseBody = _piiDetector.RedactPii(responseBody, piiDetected);
// Log PII exposure incident (for compliance)
await LogPiiExposureAsync(context.Request.Path, piiDetected);
}
// Write redacted response
newBody.Seek(0, SeekOrigin.Begin);
await newBody.CopyToAsync(originalBody);
context.Response.Body = originalBody;
}
}
// PII detection service
public class PiiDetectionService : IPiiDetectionService
{
private static readonly Regex EmailRegex = new Regex(@"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b");
private static readonly Regex SsnRegex = new Regex(@"\b\d{3}-\d{2}-\d{4}\b");
private static readonly Regex CreditCardRegex = new Regex(@"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b");
public List<PiiDetection> DetectPii(string content)
{
var detections = new List<PiiDetection>();
// Email addresses
var emailMatches = EmailRegex.Matches(content);
detections.AddRange(emailMatches.Select(m => new PiiDetection
{
Type = "Email",
Value = m.Value,
StartIndex = m.Index,
Length = m.Length
}));
// SSN
var ssnMatches = SsnRegex.Matches(content);
detections.AddRange(ssnMatches.Select(m => new PiiDetection
{
Type = "SSN",
Value = m.Value,
StartIndex = m.Index,
Length = m.Length
}));
// Credit card numbers
var ccMatches = CreditCardRegex.Matches(content);
detections.AddRange(ccMatches.Select(m => new PiiDetection
{
Type = "CreditCard",
Value = m.Value,
StartIndex = m.Index,
Length = m.Length
}));
return detections;
}
public string RedactPii(string content, List<PiiDetection> detections)
{
foreach (var detection in detections.OrderByDescending(d => d.StartIndex))
{
var redacted = detection.Type switch
{
"Email" => MaskEmail(detection.Value),
"SSN" => "***-**-****",
"CreditCard" => "**** **** **** " + detection.Value.Substring(detection.Value.Length - 4),
_ => "[REDACTED]"
};
content = content.Remove(detection.StartIndex, detection.Length)
.Insert(detection.StartIndex, redacted);
}
return content;
}
private string MaskEmail(string email)
{
var parts = email.Split('@');
if (parts.Length != 2) return "[REDACTED]";
var localPart = parts[0];
var maskedLocal = localPart.Length > 3
? localPart.Substring(0, 3) + "***"
: "***";
return $"{maskedLocal}@{parts[1]}";
}
}
Data Classification (Automated Tagging):
// Automated data classification for SQL tables
public class DataClassificationService
{
public async Task ClassifyDatabaseAsync(string databaseName)
{
var classifications = new List<DataClassification>
{
new DataClassification
{
Schema = "dbo",
Table = "AuditEvents",
Column = "UserId",
InformationType = "Person.Name",
SensitivityLabel = "Confidential",
SensitivityRank = "High"
},
new DataClassification
{
Schema = "dbo",
Table = "AuditEvents",
Column = "IPAddress",
InformationType = "Network.IPAddress",
SensitivityLabel = "Confidential",
SensitivityRank = "Medium"
},
new DataClassification
{
Schema = "dbo",
Table = "Tenants",
Column = "ContactEmail",
InformationType = "Contact.EmailAddress",
SensitivityLabel = "Confidential - GDPR",
SensitivityRank = "High"
}
};
foreach (var classification in classifications)
{
await ApplyClassificationAsync(databaseName, classification);
}
}
private async Task ApplyClassificationAsync(string database, DataClassification classification)
{
var sql = $@"
ADD SENSITIVITY CLASSIFICATION TO
[{classification.Schema}].[{classification.Table}].[{classification.Column}]
WITH (
LABEL = '{classification.SensitivityLabel}',
INFORMATION_TYPE = '{classification.InformationType}',
RANK = {classification.SensitivityRank}
)
";
// Execute via SQL connection
await ExecuteSqlAsync(database, sql);
_logger.LogInformation(
"Applied data classification: {Schema}.{Table}.{Column} = {Label}",
classification.Schema, classification.Table, classification.Column, classification.SensitivityLabel);
}
}
Vulnerability Management Per Environment¶
Vulnerability Scanning Strategy:
vulnerabilityScanning:
dev:
frequency: Weekly
tools:
- OWASP Dependency Check (NuGet packages)
- Trivy (Docker images)
severity: Informational only
remediation: Best-effort
test:
frequency: Daily (as part of CI/CD)
tools:
- OWASP Dependency Check
- Trivy
- SonarQube (SAST)
severity: Block critical vulnerabilities
remediation: 30 days
staging:
frequency: Continuous
tools:
- Azure Defender for Cloud
- OWASP Dependency Check
- Trivy
- SonarQube
severity: Block critical/high
remediation: 7 days (critical), 14 days (high)
production:
frequency: Continuous (real-time)
tools:
- Azure Defender for Cloud (Advanced Threat Protection)
- Microsoft Defender for Containers
- Azure Sentinel (SIEM)
- OWASP Dependency Check
severity: Block all critical/high
remediation: 24 hours (critical), 7 days (high)
responseProcedure: Incident ticket + emergency patching
Vulnerability Response Workflow:
flowchart TD
A[Vulnerability Detected] --> B{Severity?}
B -->|Critical| C[Create P0 Incident]
B -->|High| D[Create P1 Bug]
B -->|Medium/Low| E[Create P2/P3 Bug]
C --> F[Emergency Patching<br/>SLA: 24 hours]
D --> G[Scheduled Patching<br/>SLA: 7 days]
E --> H[Backlog Grooming<br/>SLA: 30 days]
F --> I[Deploy Hotfix to Prod]
G --> J[Deploy via Regular Release]
H --> J
I --> K[Validate Fix]
J --> K
K --> L{Fixed?}
L -->|Yes| M[Close Ticket]
L -->|No| N[Escalate to Security Team]
N --> O[Manual Investigation]
O --> K
M --> P[Post-Mortem if P0]
Compliance Automation & Guardrails¶
Infrastructure Compliance Validation (CI/CD Pipeline):
# Compliance validation stage in infrastructure pipeline
- stage: ComplianceValidation
displayName: 'Validate Compliance Before Deployment'
dependsOn: PulumiPreview
jobs:
- job: ComplianceScan
displayName: 'Run Compliance Scans'
steps:
# Checkov - Infrastructure as Code compliance
- task: Bash@3
displayName: 'Run Checkov IaC Scan'
inputs:
targetType: 'inline'
script: |
pip install checkov
checkov --directory infrastructure/ \
--framework pulumi \
--output junitxml \
--output-file-path checkov-results.xml \
--soft-fail # Don't fail build, just report
- task: PublishTestResults@2
inputs:
testResultsFormat: 'JUnit'
testResultsFiles: 'checkov-results.xml'
testRunTitle: 'Checkov Compliance Scan'
# Azure Policy What-If Analysis
- task: AzureCLI@2
displayName: 'Azure Policy What-If'
inputs:
azureSubscription: $(azureSubscription)
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
# Simulate policy compliance for proposed changes
az policy state trigger-scan \
--resource-group $(resourceGroup)
# Get compliance state
az policy state list \
--resource-group $(resourceGroup) \
--filter "complianceState eq 'NonCompliant'" \
--output table
# Terraform Compliance (if using Terraform instead of Pulumi)
- task: Bash@3
displayName: 'Run Terraform Compliance'
inputs:
targetType: 'inline'
script: |
pip install terraform-compliance
terraform-compliance \
--features compliance/ \
--planfile tfplan.json
Application Compliance Validation (C# Build):
// Unit test: Validate PII redaction
[Fact]
public void ApiResponse_WhenContainsPII_RedactsEmail()
{
// Arrange
var middleware = new PiiRedactionMiddleware(/* ... */);
var response = new
{
tenantId = "tenant-123",
contactEmail = "user@example.com", // PII
auditEventCount = 1000
};
// Act
var json = JsonSerializer.Serialize(response);
var redacted = _piiDetector.RedactPii(json, _piiDetector.DetectPii(json));
// Assert
Assert.DoesNotContain("user@example.com", redacted);
Assert.Contains("use***@example.com", redacted); // Masked email
}
// Integration test: Validate encryption at rest
[Fact]
public async Task SqlDatabase_InProduction_HasTdeEnabled()
{
// Arrange
var environment = Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT");
if (environment != "Production")
{
return; // Skip test in non-production
}
// Act
var tdeStatus = await _sqlClient.ExecuteScalarAsync<string>(
"SELECT encryption_state_desc FROM sys.dm_database_encryption_keys WHERE database_id = DB_ID()");
// Assert
Assert.Equal("ENCRYPTED", tdeStatus);
}
Audit Trail Integrity¶
Log Signing (Production):
// Sign audit logs with HSM for tamper-evidence
public class AuditLogSigner
{
private readonly CryptographyClient _cryptoClient;
public AuditLogSigner(string keyVaultUrl, string keyName)
{
var keyClient = new KeyClient(new Uri(keyVaultUrl), new DefaultAzureCredential());
var key = keyClient.GetKey(keyName);
_cryptoClient = new CryptographyClient(key.Value.Id, new DefaultAzureCredential());
}
public async Task<SignedAuditLog> SignLogBatchAsync(List<AuditLogEntry> logs)
{
// Serialize log batch
var logJson = JsonSerializer.Serialize(logs);
var logBytes = Encoding.UTF8.GetBytes(logJson);
// Compute SHA-256 hash
var hash = SHA256.HashData(logBytes);
// Sign with HSM key (RSASSA-PKCS1-v1_5)
var signResult = await _cryptoClient.SignAsync(SignatureAlgorithm.RS256, hash);
return new SignedAuditLog
{
Logs = logs,
Hash = Convert.ToBase64String(hash),
Signature = Convert.ToBase64String(signResult.Signature),
SignedAt = DateTime.UtcNow,
SigningKeyId = _cryptoClient.KeyId.ToString(),
Algorithm = "RS256"
};
}
public async Task<bool> VerifyLogBatchAsync(SignedAuditLog signedLog)
{
// Recompute hash
var logJson = JsonSerializer.Serialize(signedLog.Logs);
var logBytes = Encoding.UTF8.GetBytes(logJson);
var hash = SHA256.HashData(logBytes);
// Verify signature
var signature = Convert.FromBase64String(signedLog.Signature);
var verifyResult = await _cryptoClient.VerifyAsync(
SignatureAlgorithm.RS256,
hash,
signature);
return verifyResult.IsValid;
}
}
Log Integrity Validation (Daily):
#!/bin/bash
# validate-log-integrity-prod.sh
echo "Validating audit log integrity for Production..."
# Download signed log batches from immutable storage
az storage blob download-batch \
--account-name atpstorageprodeus \
--source "audit-logs" \
--destination "./log-validation/" \
--pattern "*.signed.json"
# Validate signatures using Azure Key Vault
for LOG_FILE in ./log-validation/*.signed.json; do
echo "Validating: $LOG_FILE"
# Extract signature and hash
SIGNATURE=$(jq -r '.signature' $LOG_FILE)
HASH=$(jq -r '.hash' $LOG_FILE)
# Verify signature using Key Vault
az keyvault key verify \
--vault-name atp-keyvault-prod-eus \
--name AuditLogSigningKey \
--algorithm RS256 \
--digest $HASH \
--signature $SIGNATURE
if [ $? -eq 0 ]; then
echo "✅ $LOG_FILE signature valid"
else
echo "❌ $LOG_FILE signature INVALID - potential tampering detected"
# Create security incident
az boards work-item create \
--title "Log Tampering Detected: $LOG_FILE" \
--type "Incident" \
--description "Audit log signature validation failed. Potential tampering detected.\n\nFile: $LOG_FILE" \
--assigned-to "security-team@connectsoft.example" \
--fields Priority=1 Severity="1 - Critical"
exit 1
fi
done
echo "✅ All audit logs validated successfully"
Compliance Reporting & Attestation¶
Quarterly SOC 2 Report:
// Generate SOC 2 Type II attestation report
[FunctionName("GenerateSOC2Report")]
public async Task RunAsync(
[TimerTrigger("0 0 1 1 */3 *")] TimerInfo timer, // Quarterly
ILogger log)
{
log.LogInformation("Generating SOC 2 Type II report...");
var quarter = $"Q{(DateTime.UtcNow.Month - 1) / 3 + 1}-{DateTime.UtcNow.Year}";
var report = new SOC2Report
{
Quarter = quarter,
ReportDate = DateTime.UtcNow,
Scope = "ATP Production Environment (East US + West Europe)"
};
// CC6.1: Logical and Physical Access Controls
report.CC6_1 = new ControlEvidence
{
ControlName = "CC6.1 - Logical and Physical Access Controls",
EvidenceCollected = new List<string>
{
"Azure AD sign-in logs (90 days)",
"PIM activation logs (all elevations)",
"Key Vault access logs (all secret retrievals)",
"NSG flow logs (all network access attempts)",
"Access review reports (weekly)"
},
TestingPerformed = "Sampled 100 access requests; validated MFA and PIM enforcement",
Exceptions = new List<string>(), // No exceptions
OpinionDate = DateTime.UtcNow,
Opinion = "Operating Effectively"
};
// CC8.1: Change Management Controls
report.CC8_1 = new ControlEvidence
{
ControlName = "CC8.1 - Change Management",
EvidenceCollected = new List<string>
{
"Azure DevOps pipeline logs (all deployments)",
"Git commit history (code changes)",
"CAB meeting minutes (approval records)",
"Deployment artifacts (SBOM, security scans)",
"Rollback procedures (documented and tested)"
},
TestingPerformed = "Sampled 50 production deployments; validated CAB approval and SBOM generation",
Exceptions = new List<string> { "1 deployment without SBOM (remediated within 24h)" },
OpinionDate = DateTime.UtcNow,
Opinion = "Operating Effectively (1 exception noted)"
};
// ... (remaining controls)
// Generate PDF report
var pdf = await GeneratePdfReportAsync(report);
// Upload to compliance evidence storage
await UploadToImmutableStorageAsync($"soc2-reports/{quarter}/SOC2-Report-{quarter}.pdf", pdf);
// Send to auditor
await SendToAuditorAsync(pdf, "auditor@example-auditing-firm.com");
log.LogInformation($"✅ SOC 2 report generated for {quarter}");
}
External Auditor Access¶
Time-Limited Read-Only Access (Production):
#!/bin/bash
# grant-auditor-access.sh
AUDITOR_EMAIL=$1
DURATION_HOURS=${2:-8} # Default: 8 hours
JUSTIFICATION=$3
echo "Granting auditor access to Production..."
echo "Auditor: $AUDITOR_EMAIL"
echo "Duration: $DURATION_HOURS hours"
echo "Justification: $JUSTIFICATION"
# Create guest user in Azure AD (if not exists)
AUDITOR_ID=$(az ad user show --id $AUDITOR_EMAIL --query objectId -o tsv 2>/dev/null)
if [ -z "$AUDITOR_ID" ]; then
echo "Creating guest user..."
az ad user create \
--display-name "External Auditor" \
--user-principal-name $AUDITOR_EMAIL \
--account-enabled true \
--force-change-password-next-sign-in false
AUDITOR_ID=$(az ad user show --id $AUDITOR_EMAIL --query objectId -o tsv)
fi
# Grant Reader role via PIM (time-limited)
az rest \
--method POST \
--url "https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignmentScheduleRequests" \
--body "{
\"principalId\": \"$AUDITOR_ID\",
\"roleDefinitionId\": \"Reader\",
\"directoryScopeId\": \"/subscriptions/<sub-id>/resourceGroups/ConnectSoft-ATP-Prod-EUS-RG\",
\"justification\": \"$JUSTIFICATION\",
\"scheduleInfo\": {
\"startDateTime\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",
\"expiration\": {
\"type\": \"AfterDuration\",
\"duration\": \"PT${DURATION_HOURS}H\"
}
}
}"
# Log access grant for compliance
az monitor activity-log list \
--resource-group ConnectSoft-ATP-Prod-EUS-RG \
--start-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--query "[?authorization.action=='Microsoft.Authorization/roleAssignments/write']"
echo "✅ Auditor access granted (expires in $DURATION_HOURS hours)"
echo "Auditor can access: Azure Portal (read-only), Log Analytics, Compliance Reports"
Summary¶
- Compliance Enforcement: Graduated from optional (Dev) to required+BYOK (Production) for encryption, immutability, and audit logging.
- Environment Policies: Dev/Test (synthetic data, relaxed controls), Staging (production-like, full simulation), Production (GDPR/HIPAA/SOC2 enforced).
- Encryption: TDE optional (Dev), TDE required (Staging), TDE+BYOK+HSM (Production) with 90-day automated key rotation.
- Immutability: Disabled (Dev/Test), enabled for testing (Staging), WORM+Legal Hold (Production) with 7-year retention.
- Audit Logging: Basic/7-day (Dev), Enhanced/30-day (Staging), Full/90-day hot + 7-year cold (Production).
- Access Controls: Developer full access (Dev), MFA+PIM (Staging), Zero standing access+PIM+Conditional Access (Production).
- Penetration Testing: Annually (Dev/Test), Quarterly (Staging/Production), Post-Change (Production).
- Vulnerability Scanning: Weekly (Dev), Daily (Test), Continuous (Staging/Production) with 24-hour critical remediation SLA.
- PII Redaction: Optional (Dev/Test), enforced with automated detection/masking (Staging/Production).
- Evidence Collection: Pipeline logs (1 year), deployment artifacts (7 years), access logs (7 years), incident records (7 years).
- Compliance Reporting: Quarterly SOC 2/GDPR/HIPAA reports with automated evidence export to immutable storage.
- Auditor Access: Time-limited read-only access (8-hour max) via PIM with full audit trail.
Testing Strategies Per Environment¶
ATP implements progressive testing strategies across environments, from rapid unit testing in Dev to production synthetic monitoring with canary deployments. Each environment has tailored testing approaches that balance development velocity (fast feedback loops) with production confidence (comprehensive validation).
This strategy ensures high code quality through comprehensive testing in lower environments while maintaining production stability through gradual rollouts, chaos engineering, and continuous synthetic monitoring.
Testing Pyramid Per Environment¶
graph TB
subgraph Dev Environment
DevUnit[Unit Tests<br/>1000+ tests<br/>< 2 minutes]
DevInteg[Integration Tests<br/>100+ tests<br/>< 5 minutes]
DevE2E[Local E2E Tests<br/>10+ flows<br/>Developer-run]
DevUnit --> DevInteg
DevInteg --> DevE2E
end
subgraph Test Environment
TestSmoke[Smoke Tests<br/>Critical paths<br/>Post-deployment]
TestRegression[Regression Suite<br/>500+ tests<br/>Nightly]
TestContract[API Contract Tests<br/>OpenAPI validation<br/>Breaking change detection]
TestSmoke --> TestRegression
TestRegression --> TestContract
end
subgraph Staging Environment
StagingLoad[Load Tests<br/>50% prod traffic<br/>60-minute runs]
StagingChaos[Chaos Tests<br/>Failure injection<br/>Weekly]
StagingSecurity[Security Tests<br/>OWASP ZAP<br/>Pre-production]
StagingDR[DR Drills<br/>Failover validation<br/>Monthly]
StagingLoad --> StagingChaos
StagingChaos --> StagingSecurity
StagingSecurity --> StagingDR
end
subgraph Production Environment
ProdSynthetic[Synthetic Monitors<br/>Multi-region<br/>Every 5 minutes]
ProdCanary[Canary Tests<br/>10% rollout<br/>24-hour validation]
ProdAB[A/B Tests<br/>Feature flags<br/>Statistical analysis]
ProdSynthetic --> ProdCanary
ProdCanary --> ProdAB
end
style DevUnit fill:#90EE90
style TestRegression fill:#FFD700
style StagingChaos fill:#FFA500
style ProdSynthetic fill:#FF6347
Dev Environment Testing¶
Purpose: Rapid feedback loops with comprehensive unit/integration tests executed on every commit for fast validation.
Unit Tests¶
Execution: Every commit (pre-push hook + CI pipeline)
Configuration (ConnectSoft.ATP.Ingestion.Tests.csproj):
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>net8.0</TargetFramework>
<IsPackable>false</IsPackable>
<IsTestProject>true</IsTestProject>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="xunit" Version="2.6.2" />
<PackageReference Include="xunit.runner.visualstudio" Version="2.5.4" />
<PackageReference Include="Moq" Version="4.20.70" />
<PackageReference Include="FluentAssertions" Version="6.12.0" />
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="17.8.0" />
<PackageReference Include="coverlet.collector" Version="6.0.0" />
</ItemGroup>
<ItemGroup>
<ProjectReference Include="..\ConnectSoft.ATP.Ingestion\ConnectSoft.ATP.Ingestion.csproj" />
</ItemGroup>
</Project>
Example Unit Test (ATP Ingestion Service):
// Unit test: Validate audit event ingestion
public class AuditIngestionServiceTests
{
private readonly Mock<IAuditRepository> _mockRepository;
private readonly Mock<ITamperEvidenceService> _mockTamperEvidence;
private readonly Mock<ILogger<AuditIngestionService>> _mockLogger;
private readonly AuditIngestionService _service;
public AuditIngestionServiceTests()
{
_mockRepository = new Mock<IAuditRepository>();
_mockTamperEvidence = new Mock<ITamperEvidenceService>();
_mockLogger = new Mock<ILogger<AuditIngestionService>>();
_service = new AuditIngestionService(
_mockRepository.Object,
_mockTamperEvidence.Object,
_mockLogger.Object);
}
[Fact]
public async Task IngestEvent_ValidEvent_ReturnsSuccessWithEventId()
{
// Arrange
var auditEvent = new AuditEvent
{
TenantId = "tenant-123",
EventType = "UserLogin",
Timestamp = DateTime.UtcNow,
Payload = "{\"userId\": \"user-456\"}"
};
_mockRepository
.Setup(r => r.InsertAsync(It.IsAny<AuditEvent>()))
.ReturnsAsync(new AuditEvent { EventId = Guid.NewGuid() });
_mockTamperEvidence
.Setup(t => t.GenerateEvidenceAsync(It.IsAny<AuditEvent>()))
.ReturnsAsync(new TamperEvidence { Hash = "abc123", Signature = "def456" });
// Act
var result = await _service.IngestEventAsync(auditEvent);
// Assert
result.Should().NotBeNull();
result.EventId.Should().NotBeEmpty();
result.TamperEvidence.Should().NotBeNull();
_mockRepository.Verify(r => r.InsertAsync(It.IsAny<AuditEvent>()), Times.Once);
_mockTamperEvidence.Verify(t => t.GenerateEvidenceAsync(It.IsAny<AuditEvent>()), Times.Once);
}
[Fact]
public async Task IngestEvent_NullEvent_ThrowsArgumentNullException()
{
// Act & Assert
await Assert.ThrowsAsync<ArgumentNullException>(() =>
_service.IngestEventAsync(null));
}
[Theory]
[InlineData(null)]
[InlineData("")]
[InlineData(" ")]
public async Task IngestEvent_InvalidTenantId_ThrowsValidationException(string tenantId)
{
// Arrange
var auditEvent = new AuditEvent { TenantId = tenantId };
// Act & Assert
await Assert.ThrowsAsync<ValidationException>(() =>
_service.IngestEventAsync(auditEvent));
}
}
Unit Test Execution (Pre-Push Hook):
#!/bin/bash
# .git/hooks/pre-push
echo "Running unit tests before push..."
dotnet test ConnectSoft.ATP.Ingestion.sln \
--configuration Debug \
--filter Category=Unit \
--logger "console;verbosity=minimal" \
--no-build
if [ $? -ne 0 ]; then
echo "❌ Unit tests failed; push aborted"
exit 1
fi
echo "✅ Unit tests passed; proceeding with push"
Coverage Threshold (Dev):
<!-- runsettings file -->
<RunSettings>
<DataCollectionRunSettings>
<DataCollectors>
<DataCollector friendlyName="XPlat Code Coverage">
<Configuration>
<Format>cobertura</Format>
<Exclude>[*]*.Program,[*]*.Startup</Exclude>
<IncludeTestAssembly>false</IncludeTestAssembly>
</Configuration>
</DataCollector>
</DataCollectors>
</DataCollectionRunSettings>
<RunConfiguration>
<TargetFramework>net8.0</TargetFramework>
<TestSessionTimeout>300000</TestSessionTimeout>
<CollectSourceInformation>true</CollectSourceInformation>
</RunConfiguration>
</RunSettings>
Integration Tests¶
Execution: Every CI build (with service containers)
Service Container Setup (azure-pipelines.yml):
# ATP Ingestion pipeline with service containers
resources:
containers:
- container: redis
image: redis:7-alpine
ports: [6379:6379]
options: --health-cmd "redis-cli ping" --health-interval 10s
- container: rabbitmq
image: rabbitmq:3-management-alpine
ports: [5672:5672, 15672:15672]
env:
RABBITMQ_DEFAULT_USER: guest
RABBITMQ_DEFAULT_PASS: guest
options: --health-cmd "rabbitmq-diagnostics -q ping" --health-interval 10s
- container: mssql
image: mcr.microsoft.com/mssql/server:2022-latest
ports: [1433:1433]
env:
ACCEPT_EULA: Y
SA_PASSWORD: P@ssw0rd123!
options: --health-cmd "/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P 'P@ssw0rd123!' -Q 'SELECT 1'" --health-interval 10s
stages:
- stage: CI_Stage
jobs:
- job: Integration_Tests
services:
redis: redis
rabbitmq: rabbitmq
mssql: mssql
steps:
- task: DotNetCoreCLI@2
displayName: 'Run Integration Tests'
inputs:
command: 'test'
projects: '**/*Tests.Integration.csproj'
arguments: '--configuration Debug --filter Category=Integration --collect:"XPlat Code Coverage"'
Example Integration Test (Redis Integration):
[Collection("IntegrationTests")]
[Trait("Category", "Integration")]
public class RedisCacheIntegrationTests : IAsyncLifetime
{
private readonly IConnectionMultiplexer _redis;
private readonly IDatabase _db;
public RedisCacheIntegrationTests()
{
// Connect to service container Redis
_redis = ConnectionMultiplexer.Connect("localhost:6379");
_db = _redis.GetDatabase();
}
[Fact]
public async Task CacheService_WhenEventCached_RetrievesFromRedis()
{
// Arrange
var cacheKey = "tenant-123:event-456";
var auditEvent = new AuditEvent
{
EventId = Guid.NewGuid(),
TenantId = "tenant-123",
EventType = "UserLogin"
};
var cacheService = new CacheService(_redis);
// Act
await cacheService.SetAsync(cacheKey, auditEvent, TimeSpan.FromMinutes(5));
var retrieved = await cacheService.GetAsync<AuditEvent>(cacheKey);
// Assert
retrieved.Should().NotBeNull();
retrieved.EventId.Should().Be(auditEvent.EventId);
retrieved.TenantId.Should().Be("tenant-123");
}
public Task InitializeAsync() => Task.CompletedTask;
public async Task DisposeAsync()
{
await _db.ExecuteAsync("FLUSHALL"); // Clean up after tests
_redis?.Dispose();
}
}
Example Integration Test (SQL Database):
[Collection("IntegrationTests")]
[Trait("Category", "Integration")]
public class AuditRepositoryIntegrationTests : IAsyncLifetime
{
private readonly SqlConnection _connection;
private readonly AuditRepository _repository;
public AuditRepositoryIntegrationTests()
{
var connectionString = "Server=localhost,1433;Database=ATP_Test;User Id=sa;Password=P@ssw0rd123!;TrustServerCertificate=true";
_connection = new SqlConnection(connectionString);
_repository = new AuditRepository(_connection);
}
[Fact]
public async Task InsertAsync_ValidEvent_PersistsToDatabase()
{
// Arrange
var auditEvent = new AuditEvent
{
TenantId = "tenant-integration-test",
EventType = "TestEvent",
Timestamp = DateTime.UtcNow,
Payload = "{\"test\": true}"
};
// Act
var inserted = await _repository.InsertAsync(auditEvent);
var retrieved = await _repository.GetByIdAsync(inserted.EventId);
// Assert
retrieved.Should().NotBeNull();
retrieved.EventId.Should().Be(inserted.EventId);
retrieved.TenantId.Should().Be("tenant-integration-test");
retrieved.EventType.Should().Be("TestEvent");
}
public async Task InitializeAsync()
{
await _connection.OpenAsync();
// Run database migrations
await _connection.ExecuteAsync(@"
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = 'AuditEvents')
BEGIN
CREATE TABLE AuditEvents (
EventId UNIQUEIDENTIFIER PRIMARY KEY,
TenantId NVARCHAR(50) NOT NULL,
EventType NVARCHAR(100) NOT NULL,
Timestamp DATETIME2 NOT NULL,
Payload NVARCHAR(MAX)
)
END
");
}
public async Task DisposeAsync()
{
// Clean up test data
await _connection.ExecuteAsync("DELETE FROM AuditEvents WHERE TenantId = 'tenant-integration-test'");
await _connection.CloseAsync();
_connection.Dispose();
}
}
Local End-to-End Tests¶
Execution: Developer-run (manual or via VS Code tasks)
Postman Collection (ATP.Ingestion.postman_collection.json):
{
"info": {
"name": "ATP Ingestion API - Dev",
"description": "Local E2E tests for ATP Ingestion service",
"schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
},
"item": [
{
"name": "Health Check",
"request": {
"method": "GET",
"header": [],
"url": {
"raw": "https://localhost:7001/health",
"protocol": "https",
"host": ["localhost"],
"port": "7001",
"path": ["health"]
}
},
"event": [
{
"listen": "test",
"script": {
"exec": [
"pm.test(\"Status is 200\", function () {",
" pm.response.to.have.status(200);",
"});",
"",
"pm.test(\"Health status is Healthy\", function () {",
" var jsonData = pm.response.json();",
" pm.expect(jsonData.status).to.eql('Healthy');",
"});"
]
}
}
]
},
{
"name": "Ingest Audit Event",
"request": {
"method": "POST",
"header": [
{
"key": "Content-Type",
"value": "application/json"
},
{
"key": "X-Tenant-Id",
"value": "tenant-dev-001"
}
],
"body": {
"mode": "raw",
"raw": "{\n \"eventType\": \"UserLogin\",\n \"timestamp\": \"{{$isoTimestamp}}\",\n \"userId\": \"user-123\",\n \"ipAddress\": \"203.0.113.10\",\n \"metadata\": {\n \"browser\": \"Chrome\",\n \"os\": \"Windows\"\n }\n}"
},
"url": {
"raw": "https://localhost:7001/api/v1/audit/ingest",
"protocol": "https",
"host": ["localhost"],
"port": "7001",
"path": ["api", "v1", "audit", "ingest"]
}
},
"event": [
{
"listen": "test",
"script": {
"exec": [
"pm.test(\"Status is 201 Created\", function () {",
" pm.response.to.have.status(201);",
"});",
"",
"pm.test(\"Response contains eventId\", function () {",
" var jsonData = pm.response.json();",
" pm.expect(jsonData.eventId).to.exist;",
" pm.environment.set('lastEventId', jsonData.eventId);",
"});",
"",
"pm.test(\"Tamper evidence generated\", function () {",
" var jsonData = pm.response.json();",
" pm.expect(jsonData.tamperEvidence).to.exist;",
" pm.expect(jsonData.tamperEvidence.hash).to.not.be.empty;",
"});"
]
}
}
]
},
{
"name": "Query Audit Event",
"request": {
"method": "GET",
"header": [
{
"key": "X-Tenant-Id",
"value": "tenant-dev-001"
}
],
"url": {
"raw": "https://localhost:7002/api/v1/audit/events/{{lastEventId}}",
"protocol": "https",
"host": ["localhost"],
"port": "7002",
"path": ["api", "v1", "audit", "events", "{{lastEventId}}"]
}
},
"event": [
{
"listen": "test",
"script": {
"exec": [
"pm.test(\"Event retrieved successfully\", function () {",
" pm.response.to.have.status(200);",
" var jsonData = pm.response.json();",
" pm.expect(jsonData.eventId).to.eql(pm.environment.get('lastEventId'));",
"});"
]
}
}
]
}
]
}
Run Postman Tests (Newman CLI):
#!/bin/bash
# run-e2e-tests-dev.sh
echo "Starting ATP services locally..."
# Start services (Docker Compose)
docker-compose -f docker-compose.dev.yml up -d
# Wait for services to be ready
sleep 30
# Run Postman collection
newman run ATP.Ingestion.postman_collection.json \
--environment ATP.Dev.postman_environment.json \
--reporters cli,json,html \
--reporter-html-export e2e-results.html
EXIT_CODE=$?
# Stop services
docker-compose -f docker-compose.dev.yml down
if [ $EXIT_CODE -eq 0 ]; then
echo "✅ E2E tests passed"
else
echo "❌ E2E tests failed"
exit 1
fi
Test Environment Testing¶
Purpose: Automated validation with smoke tests (post-deployment), regression suite (nightly), and API contract validation.
Smoke Tests¶
Execution: Post-deployment (automated via Azure Pipelines)
Configuration (CI/CD Pipeline):
# Smoke tests after deployment to Test
- stage: CD_Test
dependsOn: CI_Stage
jobs:
- deployment: DeployToTest
environment: ATP-Test
strategy:
runOnce:
deploy:
steps:
- template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
parameters:
azureSubscription: $(azureSubscription)
appName: atp-ingestion-test
# Post-deployment smoke tests
- task: PowerShell@2
displayName: 'Run Smoke Tests'
inputs:
targetType: 'inline'
script: |
$services = @(
@{ Name = "Gateway"; Url = "https://atp-gateway-test.azurewebsites.net/health" },
@{ Name = "Ingestion"; Url = "https://atp-ingestion-test.azurewebsites.net/health" },
@{ Name = "Query"; Url = "https://atp-query-test.azurewebsites.net/health" }
)
$allHealthy = $true
foreach ($service in $services) {
Write-Host "Checking $($service.Name) health..."
$response = Invoke-RestMethod -Uri $service.Url -Method Get -TimeoutSec 30
if ($response.status -eq "Healthy") {
Write-Host "✅ $($service.Name) healthy"
} else {
Write-Error "❌ $($service.Name) unhealthy: $($response.status)"
$allHealthy = $false
}
}
if (-not $allHealthy) {
Write-Error "Smoke tests failed; deployment may need rollback"
exit 1
}
Write-Host "✅ All smoke tests passed"
Regression Tests¶
Execution: Nightly (full suite)
Regression Test Suite (SpecFlow/BDD):
// Feature: Audit event ingestion regression tests
[Binding]
public class AuditIngestionSteps
{
private readonly HttpClient _httpClient;
private HttpResponseMessage _response;
private AuditEvent _auditEvent;
public AuditIngestionSteps()
{
_httpClient = new HttpClient
{
BaseAddress = new Uri("https://atp-gateway-test.azurewebsites.net")
};
}
[Given(@"I have a valid audit event for tenant ""(.*)""")]
public void GivenIHaveAValidAuditEvent(string tenantId)
{
_auditEvent = new AuditEvent
{
TenantId = tenantId,
EventType = "UserLogin",
Timestamp = DateTime.UtcNow,
Payload = "{\"userId\": \"test-user\"}"
};
}
[When(@"I submit the event to the ingestion API")]
public async Task WhenISubmitTheEvent()
{
_httpClient.DefaultRequestHeaders.Add("X-Tenant-Id", _auditEvent.TenantId);
var json = JsonSerializer.Serialize(_auditEvent);
var content = new StringContent(json, Encoding.UTF8, "application/json");
_response = await _httpClient.PostAsync("/api/v1/audit/ingest", content);
}
[Then(@"the event should be accepted with status (.*)")]
public void ThenTheEventShouldBeAccepted(int expectedStatus)
{
Assert.Equal(expectedStatus, (int)_response.StatusCode);
}
[Then(@"the response should contain an event ID")]
public async Task ThenTheResponseShouldContainEventId()
{
var responseBody = await _response.Content.ReadAsStringAsync();
var result = JsonSerializer.Deserialize<IngestResponse>(responseBody);
Assert.NotNull(result);
Assert.NotEqual(Guid.Empty, result.EventId);
}
[Then(@"tamper evidence should be generated")]
public async Task ThenTamperEvidenceShouldBeGenerated()
{
var responseBody = await _response.Content.ReadAsStringAsync();
var result = JsonSerializer.Deserialize<IngestResponse>(responseBody);
Assert.NotNull(result.TamperEvidence);
Assert.NotEmpty(result.TamperEvidence.Hash);
Assert.NotEmpty(result.TamperEvidence.Signature);
}
}
Nightly Regression Pipeline:
# Scheduled nightly regression tests
schedules:
- cron: "0 2 * * *" # 2 AM daily
displayName: Nightly Regression Tests
branches:
include:
- main
always: true # Run even if no code changes
stages:
- stage: Regression_Tests
displayName: 'Nightly Regression Suite'
jobs:
- job: Full_Regression
timeoutInMinutes: 60
steps:
- task: DotNetCoreCLI@2
displayName: 'Run Full Regression Suite'
inputs:
command: 'test'
projects: '**/*Tests.Regression.csproj'
arguments: '--configuration Release --filter Category=Regression --logger "trx;LogFileName=regression-results.trx"'
- task: PublishTestResults@2
inputs:
testResultsFormat: 'VSTest'
testResultsFiles: '**/regression-results.trx'
testRunTitle: 'Nightly Regression Tests'
failTaskOnFailedTests: true
# Send results to team
- task: PowerShell@2
condition: always()
displayName: 'Send Test Results Email'
inputs:
targetType: 'inline'
script: |
$testResults = Get-Content regression-results.trx | ConvertFrom-XML
$passed = $testResults.TestRun.ResultSummary.Counters.passed
$failed = $testResults.TestRun.ResultSummary.Counters.failed
$subject = if ($failed -eq 0) { "✅ Nightly Regression PASSED" } else { "❌ Nightly Regression FAILED" }
Send-MailMessage `
-From "atp-ci@connectsoft.example" `
-To "dev-team@connectsoft.example" `
-Subject $subject `
-Body "Passed: $passed | Failed: $failed"
API Contract Tests¶
Execution: Every deployment (validate OpenAPI spec)
OpenAPI Spec Validation:
// Validate OpenAPI spec matches implementation
[Fact]
public async Task OpenApiSpec_MatchesImplementation()
{
// Arrange
var httpClient = new HttpClient
{
BaseAddress = new Uri("https://atp-gateway-test.azurewebsites.net")
};
// Act: Fetch generated OpenAPI spec
var specJson = await httpClient.GetStringAsync("/swagger/v1/swagger.json");
var spec = JsonSerializer.Deserialize<OpenApiDocument>(specJson);
// Assert: Validate critical endpoints exist
Assert.True(spec.Paths.ContainsKey("/api/v1/audit/ingest"));
Assert.True(spec.Paths.ContainsKey("/api/v1/audit/events/{eventId}"));
// Validate request/response schemas
var ingestEndpoint = spec.Paths["/api/v1/audit/ingest"].Operations[OperationType.Post];
Assert.NotNull(ingestEndpoint.RequestBody);
Assert.NotNull(ingestEndpoint.Responses["201"]);
}
// Breaking change detection
[Fact]
public async Task OpenApiSpec_NoBreakingChanges()
{
// Arrange: Load previous version OpenAPI spec
var previousSpec = await LoadOpenApiSpecAsync("v1.0.0");
var currentSpec = await FetchCurrentOpenApiSpecAsync();
// Act: Detect breaking changes
var breakingChanges = OpenApiDiff.Compare(previousSpec, currentSpec)
.Where(c => c.IsBreaking)
.ToList();
// Assert: No breaking changes allowed
Assert.Empty(breakingChanges);
}
OpenAPI Diff Tool (CI/CD Pipeline):
# API contract validation stage
- stage: API_Contract_Validation
dependsOn: CD_Test
jobs:
- job: Validate_API_Contract
steps:
- task: Bash@3
displayName: 'Download Previous OpenAPI Spec'
inputs:
targetType: 'inline'
script: |
# Download last known-good spec from Azure Artifacts
az artifacts universal download \
--organization https://dev.azure.com/ConnectSoft \
--feed ConnectSoft-Artifacts \
--name openapi-spec-atp-ingestion \
--version $(PreviousVersion) \
--path ./previous-spec/
- task: Bash@3
displayName: 'Fetch Current OpenAPI Spec'
inputs:
targetType: 'inline'
script: |
curl https://atp-gateway-test.azurewebsites.net/swagger/v1/swagger.json \
-o ./current-spec/swagger.json
- task: Bash@3
displayName: 'Detect Breaking Changes'
inputs:
targetType: 'inline'
script: |
npx openapi-diff \
./previous-spec/swagger.json \
./current-spec/swagger.json \
--format markdown \
--output api-diff-report.md
# Check for breaking changes
if grep -q "Breaking changes detected" api-diff-report.md; then
echo "❌ Breaking API changes detected"
cat api-diff-report.md
exit 1
fi
echo "✅ No breaking API changes"
Staging Environment Testing¶
Purpose: Production validation with load testing, chaos engineering, security testing, and DR drills before production deployment.
Load Tests¶
Execution: Pre-production deployment (validate scalability)
Apache JMeter Test Plan:
<!-- ATP-LoadTest.jmx -->
<jmeterTestPlan version="1.2">
<hashTree>
<TestPlan guiclass="TestPlanGui" testclass="TestPlan" testname="ATP Ingestion Load Test">
<elementProp name="TestPlan.user_defined_variables" elementType="Arguments">
<collectionProp name="Arguments.arguments">
<elementProp name="TARGET_URL" elementType="Argument">
<stringProp name="Argument.name">TARGET_URL</stringProp>
<stringProp name="Argument.value">https://atp-gateway-staging.azurewebsites.net</stringProp>
</elementProp>
<elementProp name="THREAD_COUNT" elementType="Argument">
<stringProp name="Argument.name">THREAD_COUNT</stringProp>
<stringProp name="Argument.value">500</stringProp> <!-- 500 concurrent users -->
</elementProp>
<elementProp name="RAMP_UP_TIME" elementType="Argument">
<stringProp name="Argument.name">RAMP_UP_TIME</stringProp>
<stringProp name="Argument.value">300</stringProp> <!-- 5 minutes ramp-up -->
</elementProp>
<elementProp name="DURATION" elementType="Argument">
<stringProp name="Argument.name">DURATION</stringProp>
<stringProp name="Argument.value">3600</stringProp> <!-- 60 minutes -->
</elementProp>
</collectionProp>
</elementProp>
</TestPlan>
<hashTree>
<!-- Thread Group -->
<ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Ingestion Load">
<stringProp name="ThreadGroup.num_threads">${THREAD_COUNT}</stringProp>
<stringProp name="ThreadGroup.ramp_time">${RAMP_UP_TIME}</stringProp>
<stringProp name="ThreadGroup.duration">${DURATION}</stringProp>
<boolProp name="ThreadGroup.scheduler">true</boolProp>
</ThreadGroup>
<hashTree>
<!-- HTTP Request: Ingest Event -->
<HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="POST Ingest Event">
<stringProp name="HTTPSampler.domain">${TARGET_URL}</stringProp>
<stringProp name="HTTPSampler.path">/api/v1/audit/ingest</stringProp>
<stringProp name="HTTPSampler.method">POST</stringProp>
<boolProp name="HTTPSampler.use_keepalive">true</boolProp>
<elementProp name="HTTPsampler.Arguments" elementType="Arguments">
<collectionProp name="Arguments.arguments">
<elementProp name="" elementType="HTTPArgument">
<stringProp name="Argument.value">{
"eventType": "UserLogin",
"timestamp": "${__time(yyyy-MM-dd'T'HH:mm:ss'Z')}",
"userId": "load-test-user-${__Random(1,10000)}",
"ipAddress": "203.0.113.${__Random(1,255)}"
}</stringProp>
<stringProp name="Argument.metadata">=</stringProp>
</elementProp>
</collectionProp>
</elementProp>
</HTTPSamplerProxy>
<!-- Response Assertion -->
<ResponseAssertion guiclass="AssertionGui" testclass="ResponseAssertion" testname="Assert 201 Created">
<stringProp name="Assertion.test_field">Assertion.response_code</stringProp>
<stringProp name="Assertion.test_string">201</stringProp>
<intProp name="Assertion.test_type">8</intProp>
</ResponseAssertion>
</hashTree>
</hashTree>
</hashTree>
</jmeterTestPlan>
Run Load Tests (Azure Pipelines):
- stage: Load_Tests_Staging
dependsOn: CD_Staging
jobs:
- job: JMeter_Load_Test
pool:
vmImage: 'ubuntu-latest'
steps:
- task: Bash@3
displayName: 'Install JMeter'
inputs:
targetType: 'inline'
script: |
wget https://archive.apache.org/dist/jmeter/binaries/apache-jmeter-5.6.2.tgz
tar -xzf apache-jmeter-5.6.2.tgz
- task: Bash@3
displayName: 'Run Load Test (500 users, 60 minutes)'
inputs:
targetType: 'inline'
script: |
./apache-jmeter-5.6.2/bin/jmeter \
-n -t ATP-LoadTest.jmx \
-l load-test-results.jtl \
-e -o load-test-report/
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: 'load-test-report/'
ArtifactName: 'LoadTestReport'
- task: Bash@3
displayName: 'Validate Load Test Results'
inputs:
targetType: 'inline'
script: |
# Parse JMeter results
AVG_RESPONSE_TIME=$(awk -F',' 'NR>1 {sum+=$2; count++} END {print sum/count}' load-test-results.jtl)
ERROR_RATE=$(awk -F',' 'NR>1 {if ($8 != "200") errors++; total++} END {print (errors/total)*100}' load-test-results.jtl)
echo "Average Response Time: ${AVG_RESPONSE_TIME}ms"
echo "Error Rate: ${ERROR_RATE}%"
# Validate against thresholds
if (( $(echo "$AVG_RESPONSE_TIME > 1000" | bc -l) )); then
echo "❌ Average response time exceeds 1000ms threshold"
exit 1
fi
if (( $(echo "$ERROR_RATE > 1" | bc -l) )); then
echo "❌ Error rate exceeds 1% threshold"
exit 1
fi
echo "✅ Load test passed all thresholds"
Chaos Tests¶
Execution: Weekly (inject failures to validate resilience)
Chaos Mesh Configuration (Kubernetes):
# Network delay chaos experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-experiment
namespace: atp-staging
spec:
action: delay
mode: one # Affect one pod at a time
selector:
namespaces:
- atp-staging
labelSelectors:
app: atp-ingestion
delay:
latency: "500ms"
correlation: "50" # 50% of packets delayed
jitter: "100ms"
duration: "5m" # 5-minute experiment
scheduler:
cron: "@weekly" # Run weekly
---
# Pod failure chaos experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-experiment
namespace: atp-staging
spec:
action: pod-failure
mode: fixed-percent
value: "10" # Kill 10% of pods
selector:
namespaces:
- atp-staging
labelSelectors:
app: atp-query
duration: "2m"
scheduler:
cron: "@weekly"
---
# Storage I/O chaos experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: storage-delay-experiment
namespace: atp-staging
spec:
action: latency
mode: one
selector:
namespaces:
- atp-staging
labelSelectors:
app: atp-ingestion
volumePath: /data
path: /data/**/*
delay: "1s" # 1-second I/O delay
percent: 50 # 50% of I/O operations
duration: "5m"
scheduler:
cron: "@weekly"
Chaos Test Validation:
#!/bin/bash
# run-chaos-tests-staging.sh
echo "Starting chaos engineering tests on Staging..."
# Install Chaos Mesh CLI
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash
# Apply chaos experiments
kubectl apply -f chaos-experiments/ -n atp-staging
echo "Chaos experiments running for 10 minutes..."
sleep 600
# Monitor metrics during chaos
ERROR_RATE=$(kubectl exec -n atp-staging deploy/prometheus-server -- \
promtool query instant http://localhost:9090 \
'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100' \
| jq -r '.data.result[0].value[1]')
echo "Error rate during chaos: ${ERROR_RATE}%"
# Validate resilience
if (( $(echo "$ERROR_RATE > 5" | bc -l) )); then
echo "❌ Error rate too high during chaos; system not resilient"
exit 1
fi
echo "✅ Chaos tests passed; system resilient to injected failures"
# Clean up chaos experiments
kubectl delete -f chaos-experiments/ -n atp-staging
Security Tests¶
Execution: Pre-production deployment (OWASP ZAP scan)
OWASP ZAP Scan (Azure Pipelines):
- stage: Security_Tests_Staging
dependsOn: CD_Staging
jobs:
- job: OWASP_ZAP_Scan
pool:
vmImage: 'ubuntu-latest'
steps:
- task: Bash@3
displayName: 'Run OWASP ZAP Baseline Scan'
inputs:
targetType: 'inline'
script: |
docker run --rm \
-v $(pwd):/zap/wrk/:rw \
-t owasp/zap2docker-stable \
zap-baseline.py \
-t https://atp-gateway-staging.azurewebsites.net \
-r zap-baseline-report.html \
-w zap-baseline-report.md
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: 'zap-baseline-report.html'
ArtifactName: 'ZAPReport'
- task: Bash@3
displayName: 'Validate ZAP Scan Results'
inputs:
targetType: 'inline'
script: |
# Check for high-risk findings
HIGH_RISK=$(grep -c "High (High)" zap-baseline-report.md || echo 0)
if [ "$HIGH_RISK" -gt 0 ]; then
echo "❌ High-risk vulnerabilities detected: $HIGH_RISK"
exit 1
fi
echo "✅ OWASP ZAP scan passed"
DR Drills¶
Execution: Monthly (validate failover procedures)
DR Drill Script (Staging):
#!/bin/bash
# dr-drill-staging.sh
echo "Starting monthly DR drill for Staging..."
START_TIME=$(date +%s)
# Step 1: Simulate primary slot failure (swap to green)
echo "Simulating blue slot failure; failing over to green..."
az webapp deployment slot swap \
--name atp-gateway-staging-eus \
--resource-group ConnectSoft-ATP-Staging-EUS-RG \
--slot blue \
--target-slot production
# Step 2: Validate green slot serving traffic
sleep 30
HEALTH=$(curl -s https://atp-gateway-staging-eus.azurewebsites.net/health | jq -r '.status')
if [ "$HEALTH" != "Healthy" ]; then
echo "❌ DR drill failed; green slot unhealthy"
exit 1
fi
# Step 3: Run full regression suite
dotnet test ConnectSoft.ATP.Tests.Regression.sln \
--configuration Release \
--logger "trx;LogFileName=dr-drill-results.trx"
if [ $? -ne 0 ]; then
echo "❌ Regression tests failed after failover"
exit 1
fi
# Step 4: Calculate RTO
END_TIME=$(date +%s)
RTO_SECONDS=$((END_TIME - START_TIME))
RTO_MINUTES=$((RTO_SECONDS / 60))
echo "✅ DR drill complete"
echo "RTO Achieved: $RTO_MINUTES minutes (Target: 60 minutes)"
# Step 5: Document results
az boards work-item create \
--title "DR Drill Results - Staging - $(date +%Y-%m-%d)" \
--type "Task" \
--description "DR drill completed successfully.\n\nRTO: $RTO_MINUTES minutes\nAll regression tests passed." \
--assigned-to "platform-team@connectsoft.example" \
--fields Priority=3 State=Closed
# Optional: Swap back to blue (or leave green as production)
echo "Leaving green slot as production (validate for 24 hours)"
Production Environment Testing¶
Purpose: Continuous validation with synthetic monitors, canary deployments, and A/B testing for production confidence.
Synthetic Monitors¶
Execution: Every 5 minutes from multiple regions
Application Insights Availability Test:
// Create multi-region availability test
var availabilityTest = new WebTest("atp-availability-test-prod", new WebTestArgs
{
WebTestName = "atp-ingestion-availability",
ResourceGroupName = sharedResourceGroup.Name,
Location = "eastus",
Kind = "ping",
SyntheticMonitorId = "atp-ingestion-prod-monitor",
Enabled = true,
Frequency = 300, // 5 minutes
Timeout = 30,
RetryEnabled = true,
Locations = new[]
{
new WebTestGeolocationArgs { Location = "us-va-ash-azr" }, // East US
new WebTestGeolocationArgs { Location = "emea-nl-ams-azr" }, // West Europe
new WebTestGeolocationArgs { Location = "apac-sg-sin-azr" }, // Southeast Asia
new WebTestGeolocationArgs { Location = "us-ca-sjc-azr" }, // West US
new WebTestGeolocationArgs { Location = "emea-gb-db3-azr" } // UK South
},
Configuration = new WebTestPropertiesConfigurationArgs
{
WebTest = @"
<WebTest Name='ATP Ingestion Health Check' Enabled='True'>
<Items>
<Request Method='GET' Version='1.1' Url='https://api.atp.connectsoft.com/health' ThinkTime='0'>
<ValidationRules>
<ValidationRule Classname='Microsoft.VisualStudio.TestTools.WebTesting.Rules.ValidationRuleResponseCode' />
<ValidationRule Classname='Microsoft.VisualStudio.TestTools.WebTesting.Rules.ValidationRuleExpectedText'>
<RuleParameters>
<RuleParameter Name='ExpectedText' Value='Healthy' />
</RuleParameters>
</ValidationRule>
</ValidationRules>
</Request>
</Items>
</WebTest>
"
},
Tags = prodTags
});
// Alert when availability drops below 99%
var availabilityAlert = new MetricAlert("atp-availability-alert-prod", new MetricAlertArgs
{
AlertRuleName = "atp-availability-prod",
ResourceGroupName = sharedResourceGroup.Name,
Location = "global",
Description = "Alert when ATP availability drops below 99%",
Severity = 1,
Enabled = true,
Scopes = new[] { availabilityTest.Id },
EvaluationFrequency = "PT5M",
WindowSize = "PT15M",
Criteria = new MetricAlertMultipleResourceMultipleMetricCriteriaArgs
{
OdataType = "Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria",
AllOf = new[]
{
new MetricCriteriaArgs
{
Name = "AvailabilityPercentage",
MetricName = "availabilityResults/availabilityPercentage",
Operator = "LessThan",
Threshold = 99,
TimeAggregation = "Average"
}
}
}
});
Custom Synthetic Monitor (Azure Function):
// Advanced synthetic monitor with full workflow validation
[FunctionName("SyntheticMonitor")]
public async Task RunAsync(
[TimerTrigger("0 */5 * * * *")] TimerInfo timer, // Every 5 minutes
ILogger log)
{
log.LogInformation("Starting synthetic monitor workflow...");
var startTime = DateTime.UtcNow;
var httpClient = new HttpClient
{
BaseAddress = new Uri("https://api.atp.connectsoft.com")
};
try
{
// Step 1: Health check
var healthResponse = await httpClient.GetAsync("/health");
healthResponse.EnsureSuccessStatusCode();
// Step 2: Ingest synthetic event
var auditEvent = new
{
eventType = "SyntheticMonitor",
timestamp = DateTime.UtcNow,
userId = "synthetic-monitor",
metadata = new { source = "synthetic-monitor", region = Environment.GetEnvironmentVariable("REGION") }
};
httpClient.DefaultRequestHeaders.Add("X-Tenant-Id", "synthetic-tenant-001");
var ingestResponse = await httpClient.PostAsJsonAsync("/api/v1/audit/ingest", auditEvent);
ingestResponse.EnsureSuccessStatusCode();
var ingestResult = await ingestResponse.Content.ReadFromJsonAsync<IngestResponse>();
// Step 3: Query event (validate query service)
await Task.Delay(TimeSpan.FromSeconds(5)); // Allow indexing
var queryResponse = await httpClient.GetAsync($"/api/v1/audit/events/{ingestResult.EventId}");
queryResponse.EnsureSuccessStatusCode();
// Step 4: Validate tamper evidence
var queriedEvent = await queryResponse.Content.ReadFromJsonAsync<AuditEvent>();
Assert.NotNull(queriedEvent.TamperEvidence);
var elapsed = DateTime.UtcNow - startTime;
// Track success metrics
_telemetry.TrackMetric("SyntheticMonitor.Duration", elapsed.TotalMilliseconds);
_telemetry.TrackEvent("SyntheticMonitor.Success", new Dictionary<string, string>
{
["Region"] = Environment.GetEnvironmentVariable("REGION"),
["Timestamp"] = DateTime.UtcNow.ToString("o")
});
log.LogInformation($"✅ Synthetic monitor passed in {elapsed.TotalMilliseconds}ms");
}
catch (Exception ex)
{
log.LogError(ex, "❌ Synthetic monitor failed");
// Track failure
_telemetry.TrackException(ex);
_telemetry.TrackEvent("SyntheticMonitor.Failure", new Dictionary<string, string>
{
["Region"] = Environment.GetEnvironmentVariable("REGION"),
["ErrorMessage"] = ex.Message
});
throw;
}
}
Canary Tests¶
Execution: Production deployments (10% rollout with 24-hour validation)
Canary Deployment Strategy (Azure Pipelines):
- stage: CD_Production_Canary
displayName: 'Deploy to Production (Canary)'
dependsOn: CD_Staging
jobs:
- deployment: CanaryDeployment
environment: ATP-Production
strategy:
canary:
increments: [10, 25, 50] # 10% → 25% → 50% → 100%
preDeploy:
steps:
- script: echo "Validating staging has been stable for 48 hours..."
- task: PowerShell@2
inputs:
targetType: 'inline'
script: |
# Check for incidents in last 48 hours
$incidents = az boards work-item query \
--wiql "SELECT [System.Id] FROM WorkItems WHERE [System.WorkItemType] = 'Incident' AND [System.State] = 'Active' AND [System.CreatedDate] > @Today - 2" \
--output json | ConvertFrom-Json
if ($incidents.workItems.Count -gt 0) {
Write-Error "Active incidents detected; aborting canary deployment"
exit 1
}
deploy:
steps:
- template: deploy/deploy-microservice-to-azure-web-site.yaml@templates
parameters:
azureSubscription: $(azureSubscription)
appName: atp-ingestion-prod
package: $(Pipeline.Workspace)/drop/*.zip
trafficPercentage: $(strategy.increment) # Canary percentage
routeTraffic:
steps:
- script: echo "Routing $(strategy.increment)% traffic to canary version"
postRouteTraffic:
steps:
# Monitor for 24 hours at each increment
- task: PowerShell@2
displayName: 'Monitor Canary Metrics (24 hours)'
inputs:
targetType: 'inline'
script: |
$monitorDuration = if ($(strategy.increment) -eq 10) { 1440 } else { 720 } # 24h for 10%, 12h for others
Write-Host "Monitoring canary for $monitorDuration minutes..."
Start-Sleep -Seconds ($monitorDuration * 60)
# Query Application Insights for canary metrics
$errorRate = az monitor app-insights metrics show \
--app atp-appinsights-prod-eus \
--metric "requests/failed" \
--aggregation avg \
--offset ${monitorDuration}m \
--query "value.segments[0]['requests/failed'].avg" -o tsv
$p95Latency = az monitor app-insights metrics show \
--app atp-appinsights-prod-eus \
--metric "requests/duration" \
--aggregation percentile \
--interval PT1H \
--offset ${monitorDuration}m \
--query "value.segments[0]['requests/duration'].percentiles.95" -o tsv
Write-Host "Error Rate: $errorRate%"
Write-Host "P95 Latency: ${p95Latency}ms"
# Validate against thresholds
if ($errorRate -gt 0.01) { # >1% error rate
Write-Error "Error rate too high: $errorRate%"
exit 1
}
if ($p95Latency -gt 2000) { # >2s p95 latency
Write-Error "Latency too high: ${p95Latency}ms"
exit 1
}
Write-Host "✅ Canary metrics healthy at $(strategy.increment)%"
on:
failure:
steps:
- script: echo "🔴 Canary deployment failed; rolling back..."
- task: AzureAppServiceManage@0
inputs:
azureSubscription: $(azureSubscription)
action: 'Swap Slots'
webAppName: 'atp-ingestion-prod'
sourceSlot: 'production'
targetSlot: 'canary'
- task: PowerShell@2
inputs:
targetType: 'inline'
script: |
# Create incident for failed canary
az boards work-item create \
--title "Canary Deployment Failed - $(Build.BuildNumber)" \
--type "Incident" \
--description "Canary deployment rolled back due to failed metrics.\n\nBuild: $(Build.BuildNumber)" \
--assigned-to "platform-team@connectsoft.example" \
--fields Priority=1
A/B Tests¶
Execution: Feature flag validation with statistical analysis
A/B Test Configuration:
// A/B test: New export format (JSON vs Parquet)
public class ExportFormatABTest
{
private readonly IFeatureManager _featureManager;
private readonly TelemetryClient _telemetry;
public async Task<ExportResult> ExportAuditEventsAsync(ExportRequest request)
{
var startTime = DateTime.UtcNow;
// Determine variant based on feature flag (50/50 split)
var useParquet = await _featureManager.IsEnabledAsync("ExportFormat_Parquet");
ExportResult result;
if (useParquet)
{
// Variant A: Parquet format
result = await ExportAsParquetAsync(request);
_telemetry.TrackEvent("ABTest.ExportFormat", new Dictionary<string, string>
{
["Variant"] = "Parquet",
["TenantId"] = request.TenantId,
["EventCount"] = request.EventCount.ToString(),
["Duration"] = (DateTime.UtcNow - startTime).TotalMilliseconds.ToString(),
["FileSize"] = result.FileSizeBytes.ToString()
});
}
else
{
// Variant B: JSON format (control)
result = await ExportAsJsonAsync(request);
_telemetry.TrackEvent("ABTest.ExportFormat", new Dictionary<string, string>
{
["Variant"] = "JSON",
["TenantId"] = request.TenantId,
["EventCount"] = request.EventCount.ToString(),
["Duration"] = (DateTime.UtcNow - startTime).TotalMilliseconds.ToString(),
["FileSize"] = result.FileSizeBytes.ToString()
});
}
return result;
}
}
A/B Test Analysis (Application Insights):
// Compare A/B test variants
customEvents
| where name == "ABTest.ExportFormat"
| where timestamp > ago(7d)
| extend Variant = tostring(customDimensions.Variant)
| extend EventCount = toint(customDimensions.EventCount)
| extend Duration = todouble(customDimensions.Duration)
| extend FileSize = todouble(customDimensions.FileSize)
| summarize
TotalExports = count(),
AvgDuration = avg(Duration),
P50Duration = percentile(Duration, 50),
P95Duration = percentile(Duration, 95),
P99Duration = percentile(Duration, 99),
AvgFileSize = avg(FileSize),
CompressionRatio = avg(FileSize) / avg(EventCount)
by Variant
| extend Winner = case(
Variant == "Parquet" and P95Duration < 1000 and AvgFileSize < 50000000, "Parquet",
Variant == "JSON" and P95Duration < 1000, "JSON",
"Inconclusive"
)
| project Variant, TotalExports, AvgDuration, P95Duration, AvgFileSize, Winner
A/B Test Statistical Significance:
// Calculate statistical significance of A/B test
public class ABTestAnalyzer
{
public ABTestResult AnalyzeTest(List<ABTestSample> variantA, List<ABTestSample> variantB)
{
// Calculate means
var meanA = variantA.Average(s => s.Duration);
var meanB = variantB.Average(s => s.Duration);
// Calculate standard deviations
var stdDevA = Math.Sqrt(variantA.Average(s => Math.Pow(s.Duration - meanA, 2)));
var stdDevB = Math.Sqrt(variantB.Average(s => Math.Pow(s.Duration - meanB, 2)));
// T-test (independent samples)
var tStatistic = (meanA - meanB) / Math.Sqrt((stdDevA * stdDevA / variantA.Count) + (stdDevB * stdDevB / variantB.Count));
var degreesOfFreedom = variantA.Count + variantB.Count - 2;
// P-value (simplified; use proper statistical library in production)
var pValue = CalculatePValue(tStatistic, degreesOfFreedom);
return new ABTestResult
{
VariantAMean = meanA,
VariantBMean = meanB,
DifferencePercent = ((meanB - meanA) / meanA) * 100,
PValue = pValue,
IsSignificant = pValue < 0.05, // 95% confidence
Recommendation = pValue < 0.05
? (meanA < meanB ? "Variant A is significantly faster" : "Variant B is significantly faster")
: "No significant difference; choose based on other criteria"
};
}
}
Test Automation Infrastructure¶
Test Execution Summary (By Environment):
| Environment | Test Type | Frequency | Duration | Pass Threshold | Retry Policy |
|---|---|---|---|---|---|
| Dev | Unit | Every commit | < 2 min | 100% | No retry |
| Dev | Integration | Every commit | < 5 min | 100% | No retry |
| Test | Smoke | Post-deployment | < 2 min | 100% | 3 retries |
| Test | Regression | Nightly | < 30 min | 100% | No retry (investigate failures) |
| Test | API Contract | Every deployment | < 5 min | 100% (no breaking changes) | No retry |
| Staging | Load | Pre-production | 60 min | <1% error, <1s p95 latency | No retry |
| Staging | Chaos | Weekly | 10 min | <5% error during failures | No retry |
| Staging | Security | Pre-production | 20 min | Zero high-risk findings | No retry |
| Staging | DR Drill | Monthly | < 60 min | 100% pass + RTO met | No retry |
| Production | Synthetic | Every 5 min | < 30 sec | 99% availability | 2 retries |
| Production | Canary | Per deployment | 24-72 hours | <0.5% error, <1.5x baseline latency | Auto-rollback on failure |
Test Data Management¶
Dev/Test (Synthetic Data):
// Synthetic test data generator
public class SyntheticDataGenerator
{
private readonly Faker<AuditEvent> _auditEventFaker;
public SyntheticDataGenerator()
{
_auditEventFaker = new Faker<AuditEvent>()
.RuleFor(e => e.TenantId, f => $"tenant-{f.Random.Number(1, 100)}")
.RuleFor(e => e.EventType, f => f.PickRandom("UserLogin", "UserLogout", "DocumentAccess", "SettingChanged"))
.RuleFor(e => e.Timestamp, f => f.Date.Recent(7))
.RuleFor(e => e.UserId, f => $"user-{f.Random.Guid()}")
.RuleFor(e => e.IPAddress, f => f.Internet.Ip())
.RuleFor(e => e.Payload, f => JsonSerializer.Serialize(new
{
action = f.PickRandom("read", "write", "delete"),
resourceId = f.Random.Guid(),
success = f.Random.Bool(0.95f) // 95% success rate
}));
}
public List<AuditEvent> Generate(int count) => _auditEventFaker.Generate(count);
}
// Usage in tests
[Fact]
public async Task LoadTest_1000Events_CompletesWithinSLA()
{
// Arrange
var generator = new SyntheticDataGenerator();
var events = generator.Generate(1000);
// Act
var stopwatch = Stopwatch.StartNew();
foreach (var evt in events)
{
await _ingestionService.IngestEventAsync(evt);
}
stopwatch.Stop();
// Assert
Assert.True(stopwatch.Elapsed < TimeSpan.FromMinutes(5), "Load test exceeded 5-minute SLA");
}
Staging (Anonymized Production Data):
# Anonymize production snapshot for staging
./anonymize-production-data.sh \
--source atp-sql-prod-eus \
--destination atp-sql-staging-eus \
--pii-fields "UserId,IPAddress,ContactEmail" \
--hash-seed "staging-anonymization-2025"
Summary¶
- Dev Environment Tests: Unit tests (every commit, <2min, 100% pass), integration tests (service containers, <5min), local E2E (Postman, developer-run).
- Test Environment Tests: Smoke tests (post-deployment, critical paths), regression suite (nightly, 500+ tests, <30min), API contract tests (OpenAPI validation, breaking change detection).
- Staging Environment Tests: Load tests (JMeter, 500 users, 60min, 50% prod scale), chaos tests (Chaos Mesh, weekly, network/pod/storage failures), security tests (OWASP ZAP, pre-production), DR drills (monthly, blue-green failover, <60min RTO).
- Production Environment Tests: Synthetic monitors (multi-region, every 5min, full workflow validation), canary tests (10%→25%→50%→100%, 24-72h validation), A/B tests (feature flags, statistical analysis).
- Test Pyramid: Unit (1000+ tests) → Integration (100+ tests) → E2E (10+ flows) → Load/Chaos → Synthetic monitoring.
- Pass Thresholds: 100% (unit/integration/regression), <1% error rate (load), <5% error during chaos, 99% availability (synthetic).
- Automation: Pre-push hooks (unit tests), CI/CD pipelines (all tests), nightly schedules (regression), weekly schedules (chaos), continuous (synthetic).
Governance & Continuous Improvement¶
ATP implements structured governance and continuous improvement frameworks to ensure environments remain cost-efficient, secure, and aligned with business objectives. Regular reviews (monthly cost, quarterly security, annual refresh) combined with a formal Change Advisory Board ensure controlled evolution of the platform.
This approach balances operational stability (CAB-approved production changes) with innovation (roadmap-driven improvements) while maintaining accountability through documented reviews and measurable improvement metrics.
Environment Review Cadence¶
ATP conducts tiered reviews at monthly, quarterly, and annual intervals to maintain environment health and optimize operations.
Monthly Review: Cost & Utilization¶
Purpose: Optimize spending and identify underutilized resources for rightsizing or decommissioning.
Attendees: Platform Team Lead, Finance Representative, Engineering Manager
Agenda:
monthlyReview:
agenda:
1. Cost Analysis (30 minutes)
- Actual vs. budgeted spend per environment
- Cost anomalies and trend analysis
- Reserved instance utilization
- Storage lifecycle effectiveness
2. Resource Utilization (20 minutes)
- Compute utilization (CPU, memory)
- Database DTU/RU consumption
- Storage growth trends
- Network bandwidth utilization
3. Scaling Policy Review (15 minutes)
- Autoscaling trigger effectiveness
- Over-provisioned resources
- Under-provisioned resources
4. Action Items (15 minutes)
- Rightsizing recommendations
- Cost optimization opportunities
- Resource decommissioning
duration: 80 minutes
artifacts:
- Monthly cost report (generated by automation)
- Resource utilization dashboard
- Action items list (Azure DevOps work items)
Monthly Cost Review Report (Automated):
// Generate monthly cost and utilization report
[FunctionName("MonthlyEnvironmentReview")]
public async Task RunAsync(
[TimerTrigger("0 0 9 1 * *")] TimerInfo timer, // 1st of month at 9 AM
ILogger log)
{
log.LogInformation("Generating monthly environment review report...");
var reportDate = DateTime.UtcNow.AddMonths(-1); // Previous month
var reportMonth = reportDate.ToString("MMMM yyyy");
var report = new MonthlyEnvironmentReport
{
Month = reportMonth,
GeneratedAt = DateTime.UtcNow
};
// Cost Analysis
report.CostAnalysis = await GenerateCostAnalysisAsync(reportDate);
// Resource Utilization
report.Utilization = new UtilizationReport
{
Dev = await AnalyzeEnvironmentUtilizationAsync("dev"),
Test = await AnalyzeEnvironmentUtilizationAsync("test"),
Staging = await AnalyzeEnvironmentUtilizationAsync("staging"),
Production = await AnalyzeEnvironmentUtilizationAsync("prod")
};
// Rightsizing Recommendations
report.Recommendations = await GenerateRightsizingRecommendationsAsync();
// Generate PDF report
var pdf = await GeneratePdfAsync(report);
// Upload to Blob Storage
await UploadReportAsync($"monthly-reviews/{reportMonth}/Environment-Review-{reportMonth}.pdf", pdf);
// Send to stakeholders
await SendReportAsync(pdf, new[]
{
"platform-team@connectsoft.example",
"finance@connectsoft.example",
"engineering-manager@connectsoft.example"
});
log.LogInformation($"✅ Monthly review report generated and sent for {reportMonth}");
}
private async Task<EnvironmentUtilization> AnalyzeEnvironmentUtilizationAsync(string environment)
{
var metrics = await QueryAzureMonitorAsync(environment);
return new EnvironmentUtilization
{
Environment = environment,
// Compute utilization
AvgCpuUtilization = metrics.AvgCpu,
AvgMemoryUtilization = metrics.AvgMemory,
PeakCpuUtilization = metrics.PeakCpu,
// Database utilization
AvgDtuUtilization = metrics.AvgDtu,
AvgStorageUtilization = metrics.AvgStorage,
// Recommendations
ComputeRecommendation = metrics.AvgCpu < 30
? "Consider downsizing SKU (avg CPU < 30%)"
: metrics.AvgCpu > 80
? "Consider upsizing SKU (avg CPU > 80%)"
: "Optimal sizing",
EstimatedMonthlySavings = metrics.AvgCpu < 30
? CalculatePotentialSavings(environment, "downsize")
: 0
};
}
Cost Review KQL Query:
// Month-over-month cost comparison
AzureCostManagement
| where TimeGenerated >= startofmonth(ago(60d))
| extend Environment = tostring(Tags["Environment"])
| extend Month = startofmonth(TimeGenerated)
| summarize MonthlyCost = sum(Cost) by Environment, Month
| evaluate pivot(Month, sum(MonthlyCost))
| extend MoM_Change = (column2 - column1) / column1 * 100 // Month-over-month change %
| project Environment, PreviousMonth=column1, CurrentMonth=column2, MoM_Change
| order by MoM_Change desc
Quarterly Review: Security & Compliance¶
Purpose: Validate security posture, audit access controls, and test disaster recovery procedures.
Attendees: CISO, Platform Team Lead, Security Engineer, Compliance Officer
Agenda:
quarterlyReview:
agenda:
1. Security Posture Assessment (45 minutes)
- Vulnerability scan results (open findings)
- Penetration test findings (remediation status)
- Azure Defender recommendations
- Security incident retrospective
2. Access Control Audit (30 minutes)
- Access review findings (stale permissions)
- PIM usage analytics (elevation frequency)
- Service principal inventory
- Break-glass account validation
3. Compliance Status (30 minutes)
- SOC 2 control effectiveness
- GDPR compliance scorecard
- HIPAA safeguard validation
- Policy compliance state
4. DR Drill Review (30 minutes)
- DR test results (last quarter)
- RTO/RPO achievement rate
- Failover success rate
- Lessons learned
5. Action Items (15 minutes)
- Security remediation tasks
- Access revocation list
- DR procedure improvements
duration: 150 minutes (2.5 hours)
artifacts:
- Security posture report
- Access audit findings
- Compliance scorecard
- DR test summary
- Action items (Azure DevOps backlog)
Quarterly Security Review Report:
// Generate quarterly security posture report
[FunctionName("QuarterlySecurityReview")]
public async Task RunAsync(
[TimerTrigger("0 0 9 1 */3 *")] TimerInfo timer, // Quarterly on 1st at 9 AM
ILogger log)
{
log.LogInformation("Generating quarterly security review report...");
var quarter = $"Q{(DateTime.UtcNow.AddMonths(-1).Month - 1) / 3 + 1}-{DateTime.UtcNow.Year}";
var report = new QuarterlySecurityReport
{
Quarter = quarter,
GeneratedAt = DateTime.UtcNow,
Scope = "All ATP Environments"
};
// Security Findings Summary
report.VulnerabilityFindings = await GetVulnerabilitySummaryAsync();
report.PentestFindings = await GetPentestFindingsAsync(quarter);
report.DefenderRecommendations = await GetDefenderRecommendationsAsync();
// Access Control Audit
report.AccessAudit = new AccessAuditReport
{
TotalUsers = await CountActiveUsersAsync(),
StalePermissions = await IdentifyStalePermissionsAsync(),
PimElevations = await GetPimElevationStatsAsync(quarter),
ServicePrincipals = await AuditServicePrincipalsAsync()
};
// Compliance Status
report.ComplianceStatus = new ComplianceStatusReport
{
SOC2 = await EvaluateSOC2ComplianceAsync(),
GDPR = await EvaluateGDPRComplianceAsync(),
HIPAA = await EvaluateHIPAAComplianceAsync(),
OverallScore = (await EvaluateSOC2ComplianceAsync() +
await EvaluateGDPRComplianceAsync() +
await EvaluateHIPAAComplianceAsync()) / 3
};
// DR Test Results
report.DRTestResults = await GetDRTestResultsAsync(quarter);
// Generate PDF
var pdf = await GeneratePdfAsync(report);
// Upload and distribute
await UploadReportAsync($"quarterly-security-reviews/{quarter}/Security-Review-{quarter}.pdf", pdf);
await SendReportAsync(pdf, new[]
{
"ciso@connectsoft.example",
"platform-team@connectsoft.example",
"compliance@connectsoft.example"
});
log.LogInformation($"✅ Quarterly security review report generated for {quarter}");
}
Access Audit Query (Stale Permissions):
// Identify users with stale access (no activity in 90 days)
AzureActivity
| where TimeGenerated > ago(90d)
| where ActivityStatusValue == "Success"
| extend User = Caller
| summarize LastActivity = max(TimeGenerated) by User
| join kind=leftouter (
IdentityInfo
| where TimeGenerated > ago(1d)
| distinct AccountUPN, AccountObjectId
) on $left.User == $right.AccountUPN
| extend DaysSinceLastActivity = datetime_diff('day', now(), LastActivity)
| where DaysSinceLastActivity > 90
| project User, LastActivity, DaysSinceLastActivity, Recommendation = "Revoke Access"
| order by DaysSinceLastActivity desc
Annual Review: Full Environment Refresh¶
Purpose: Comprehensive refresh of infrastructure, cost optimization, and technology stack updates.
Attendees: CTO, Platform Team, Security Team, Finance, Product Management
Agenda:
annualReview:
agenda:
1. Infrastructure Assessment (60 minutes)
- IaC codebase review (Pulumi/Bicep)
- Azure service updates (new capabilities)
- SKU tier optimization
- Region expansion evaluation
2. Cost Optimization Deep Dive (60 minutes)
- Annual spend analysis (trend vs. forecast)
- Reserved instance renewal strategy
- Commitment-based discounts (3-year options)
- Alternative service evaluation (cost reduction)
3. Security & Compliance Evolution (45 minutes)
- New regulatory requirements (upcoming)
- Security technology upgrades (TLS 1.4, PQC)
- Zero-trust maturity assessment
- Threat landscape changes
4. Technology Roadmap Review (45 minutes)
- Platform modernization opportunities
- Multi-cloud strategy evaluation
- Kubernetes version upgrades
- Service mesh evaluation (Istio, Linkerd)
5. Strategic Initiatives (30 minutes)
- Q1-Q4 roadmap alignment
- Budget allocation for improvements
- Team capacity planning
- Vendor evaluations
duration: 240 minutes (4 hours, typically full-day workshop)
artifacts:
- Annual infrastructure report
- Cost optimization plan
- Security roadmap
- Strategic initiatives backlog
- Budget proposal for next fiscal year
Annual IaC Refresh Procedure:
#!/bin/bash
# annual-iac-refresh.sh
echo "Starting annual IaC refresh for all environments..."
ENVIRONMENTS=("dev" "test" "staging" "prod")
for ENV in "${ENVIRONMENTS[@]}"; do
echo "Refreshing $ENV environment IaC..."
# Step 1: Update Pulumi dependencies
cd infrastructure/
dotnet add package Pulumi.AzureNative --version latest
dotnet add package Pulumi.Azure --version latest
# Step 2: Validate against latest Azure Policy
pulumi preview --stack atp-$ENV-eus --diff
# Step 3: Generate cost estimate for refresh
pulumi preview --stack atp-$ENV-eus --json > preview-$ENV.json
# Step 4: Create PR for review
git checkout -b "annual-refresh-$ENV-2025"
git add infrastructure/
git commit -m "chore: Annual IaC refresh for $ENV environment"
git push origin "annual-refresh-$ENV-2025"
# Create PR
az repos pr create \
--title "Annual IaC Refresh - $ENV Environment" \
--description "Annual infrastructure refresh with latest Pulumi packages and Azure best practices.\n\nCost Impact: See preview-$ENV.json" \
--source-branch "annual-refresh-$ENV-2025" \
--target-branch "main" \
--reviewers "platform-team@connectsoft.example"
cd ..
done
echo "✅ Annual IaC refresh PRs created for all environments"
Change Advisory Board (CAB)¶
ATP's Change Advisory Board provides governance oversight for production changes, ensuring risk mitigation, communication planning, and rollback readiness.
CAB Composition¶
cabMembers:
core:
- role: Chair
title: Lead Architect
responsibilities:
- Facilitate CAB meetings
- Final approval authority
- Escalation point for conflicts
- role: Technical Reviewer
title: SRE Lead
responsibilities:
- Assess technical risk
- Validate rollback procedures
- Review deployment strategy
- role: Security Reviewer
title: Security Officer
responsibilities:
- Security impact assessment
- Compliance validation
- Vulnerability review
- role: Product Representative
title: Product Owner
responsibilities:
- Business impact assessment
- Stakeholder communication
- Change prioritization
optional:
- Database Administrator (for schema changes)
- Network Engineer (for network changes)
- Compliance Officer (for regulatory changes)
- Customer Success (for tenant-impacting changes)
CAB Meeting Cadence:
cabSchedule:
regular:
frequency: Weekly (every Wednesday 2 PM ET)
duration: 60 minutes
type: Routine production changes
approvalThreshold: 2 core members (Architect + SRE or Security)
emergency:
frequency: On-demand (within 2 hours)
duration: 30 minutes
type: Hotfix/P0 incident remediation
approvalThreshold: 2 core members (expedited review)
majorChange:
frequency: Ad-hoc (scheduled 2 weeks in advance)
duration: 120 minutes
type: Major architecture changes, multi-service deployments
approvalThreshold: All 4 core members + optional reviewers
Change Request Process¶
Change Ticket Template (Azure DevOps):
# Change Request Work Item Template
workItemType: Change Request
requiredFields:
- title: Short description of change
- description: |
## Change Summary
Brief description of what is being changed and why.
## Business Justification
Why is this change necessary? What business value does it provide?
## Technical Details
- Affected services: [List services]
- Infrastructure changes: [Yes/No]
- Database schema changes: [Yes/No]
- API breaking changes: [Yes/No]
## Risk Assessment
- Risk level: [Low/Medium/High/Critical]
- Potential impact: [Description]
- Affected tenants: [All/Specific/None]
## Rollback Plan
- Rollback procedure: [Detailed steps]
- Rollback RTO: [Estimated time]
- Rollback validation: [How to verify rollback success]
## Communication Plan
- Status page update: [Yes/No]
- Tenant notification: [Yes/No]
- Maintenance window: [Date/Time or Zero-downtime]
## Testing Evidence
- Staging validation: [Link to test results]
- DR drill: [Link to drill report]
- Security scan: [Link to scan results]
- assignedTo: Requester (engineer submitting change)
- scheduledDate: Proposed deployment date/time
- changeType: [Standard/Emergency/Major]
- priority: [1-4]
linkedItems:
- Epic/Feature: Parent feature this change supports
- Test Results: Staging validation results
- Security Scan: Latest security scan report
- DR Drill: Most recent DR drill results
approvalWorkflow:
step1:
approver: SRE Lead
criteria: Technical feasibility, rollback plan validated
step2:
approver: Security Officer
criteria: Security impact assessed, compliance validated
step3:
approver: Lead Architect (final approval)
criteria: Overall risk acceptable, aligned with roadmap
Change Request Workflow (Mermaid):
stateDiagram-v2
[*] --> Submitted: Engineer creates change request
Submitted --> TechnicalReview: Assigned to SRE Lead
TechnicalReview --> SecurityReview: SRE approves
TechnicalReview --> Rejected: Technical concerns
SecurityReview --> ArchitectApproval: Security Officer approves
SecurityReview --> Rejected: Security concerns
ArchitectApproval --> Scheduled: Lead Architect approves
ArchitectApproval --> Rejected: Architecture concerns
Scheduled --> InProgress: Deployment window starts
InProgress --> Deployed: Deployment successful
InProgress --> RolledBack: Deployment failed
Deployed --> Validated: Post-deployment validation passes
RolledBack --> PostMortem: Rollback complete
Validated --> Closed: Change successful
PostMortem --> Closed: RCA documented
Rejected --> [*]: Change cancelled
Closed --> [*]: Change complete
CAB Meeting Template¶
Meeting Agenda (Weekly CAB):
# ATP Change Advisory Board Meeting
**Date**: Wednesday, January 15, 2025, 2:00 PM ET
**Duration**: 60 minutes
**Attendees**: Lead Architect, SRE Lead, Security Officer, Product Owner
---
## Agenda
### 1. Review Previous Week's Changes (10 minutes)
- Deployments executed: [List]
- Issues encountered: [List or "None"]
- Rollbacks performed: [List or "None"]
### 2. Upcoming Changes for Approval (40 minutes)
#### Change Request #12345: Upgrade Redis Cache to Premium P3
- **Requester**: Platform Team
- **Risk**: Low
- **Scheduled**: January 18, 2025, 10 PM ET
- **Duration**: 30 minutes (slot swap)
- **Rollback**: Swap back to previous slot (5 minutes)
- **Tenant Impact**: None (transparent upgrade)
- **Cost Impact**: +$500/month
- **Decision**: ☐ Approved ☐ Rejected ☐ Deferred
#### Change Request #12346: Deploy Tamper Evidence V2
- **Requester**: Engineering Team
- **Risk**: Medium
- **Scheduled**: January 20, 2025, via canary deployment
- **Duration**: 7 days (phased rollout)
- **Rollback**: Feature flag kill switch (immediate)
- **Tenant Impact**: Improved tamper evidence (backward compatible)
- **Testing**: Passed load tests, security scans, DR drill
- **Decision**: ☐ Approved ☐ Rejected ☐ Deferred
### 3. Emergency/Hotfix Changes (5 minutes)
- No emergency changes this week
### 4. Action Items Review (5 minutes)
- Outstanding action items from previous meetings
---
## Decisions
1. Change #12345: **Approved** (Architect, SRE, Security)
2. Change #12346: **Approved with conditions** (Require 48h monitoring at 10% before proceeding to 25%)
## Action Items
1. [Platform Team] Schedule Redis upgrade for January 18, 10 PM ET
2. [Engineering Team] Deploy Tamper Evidence V2 canary (10%) on January 20
3. [SRE Team] Monitor canary metrics for 48 hours before advancing
Deployment Scheduling & Change Windows¶
Purpose: Minimize tenant disruption by scheduling changes during low-traffic windows and communicating proactively.
Change Windows (Production):
changeWindows:
routine:
days: Tuesday, Wednesday, Thursday
time: 10 PM - 2 AM ET (low-traffic window)
frequency: 1-2 per week
type: Standard deployments, infrastructure changes
notification: 48-hour advance notice on status page
emergency:
days: Any
time: Immediate
frequency: As needed (P0/P1 incidents)
type: Hotfixes, security patches, critical bug fixes
notification: Real-time status page + email
blackout:
periods:
- December 15 - January 5 (holiday freeze)
- End of fiscal quarter (last 3 days)
- Major customer events (identified by Product team)
exemptions: P0 incidents, security vulnerabilities
Change Communication Template:
# Production Change Notification
**Change ID**: CR-12345
**Scheduled Date**: January 18, 2025, 10:00 PM - 10:30 PM ET
**Type**: Infrastructure Upgrade
**Risk**: Low
**Expected Downtime**: Zero (blue-green deployment)
---
## What's Changing?
Upgrading Redis Cache from Premium P1 (6 GB) to Premium P3 (26 GB) to improve performance and capacity.
## Why?
Current Redis utilization averaging 85%; upgrade provides headroom for growth.
## Impact to You
- **Downtime**: None (transparent slot swap)
- **Performance**: Improved response times for query operations
- **Action Required**: None
## Rollback Plan
If issues detected, we will swap back to the previous Redis instance within 5 minutes.
## Questions?
Contact: platform-team@connectsoft.example
---
**Status**: ☐ Scheduled ☐ In Progress ☐ Complete ☐ Rolled Back
Continuous Improvement Metrics¶
Purpose: Track improvement progress with measurable KPIs aligned with DORA metrics and operational excellence.
DORA Metrics (DevOps Research and Assessment)¶
doraMetrics:
deploymentFrequency:
target: Daily (to Staging), Weekly (to Production)
current: 2-3 deployments/week (Production)
trend: Improving (up from 1/week in 2024)
leadTimeForChanges:
target: < 1 week (commit to production)
current: 10 days average
trend: Stable
changeFailureRate:
target: < 15%
current: 8% (Production deployments requiring rollback)
trend: Improving (down from 12% in 2024)
timeToRestoreService:
target: < 1 hour
current: 25 minutes average (automated failover)
trend: Excellent
DORA Metrics Dashboard (Application Insights):
// Deployment frequency (last 30 days)
customEvents
| where name == "DeploymentCompleted"
| where timestamp > ago(30d)
| extend Environment = tostring(customDimensions.Environment)
| where Environment == "prod"
| summarize DeploymentCount = count() by bin(timestamp, 1d)
| extend DeploymentsPerWeek = DeploymentCount * 7
| summarize AvgDeploymentsPerWeek = avg(DeploymentsPerWeek)
// Lead time for changes (commit to production)
customEvents
| where name == "DeploymentCompleted"
| where timestamp > ago(30d)
| extend Environment = tostring(customDimensions.Environment)
| extend CommitTimestamp = todatetime(customDimensions.CommitTimestamp)
| extend DeploymentTimestamp = timestamp
| extend LeadTimeHours = datetime_diff('hour', DeploymentTimestamp, CommitTimestamp)
| where Environment == "prod"
| summarize
AvgLeadTimeHours = avg(LeadTimeHours),
P50LeadTimeHours = percentile(LeadTimeHours, 50),
P95LeadTimeHours = percentile(LeadTimeHours, 95)
// Change failure rate (deployments requiring rollback)
customEvents
| where name in ("DeploymentCompleted", "DeploymentRolledBack")
| where timestamp > ago(30d)
| extend Environment = tostring(customDimensions.Environment)
| where Environment == "prod"
| summarize
TotalDeployments = countif(name == "DeploymentCompleted"),
FailedDeployments = countif(name == "DeploymentRolledBack"),
ChangeFailureRate = 100.0 * countif(name == "DeploymentRolledBack") / count()
Platform Maturity Scorecard¶
platformMaturity:
automation:
current: 85%
target: 95%
gaps:
- Manual canary progression (target: automated with ML)
- Manual DR failback (target: automated validation)
observability:
current: 90%
target: 95%
gaps:
- Distributed tracing incomplete (target: 100% coverage)
- Business metrics dashboards (target: Grafana dashboards per service)
security:
current: 95%
target: 98%
gaps:
- Post-quantum cryptography (target: evaluate PQC algorithms)
- Zero-trust microsegmentation (target: Istio AuthorizationPolicy per pod)
compliance:
current: 98%
target: 100%
gaps:
- Continuous compliance monitoring (target: real-time policy enforcement)
- Automated evidence collection (target: zero manual effort)
costOptimization:
current: 80%
target: 90%
gaps:
- Kubernetes node autoscaling tuning (target: KEDA event-driven scaling)
- Spot instance adoption (target: 50% of non-production compute)
Strategic Roadmap (2025)¶
ATP's environment roadmap focuses on developer productivity, multi-region expansion, automation maturity, and self-service capabilities.
Q1 2025: Ephemeral Preview Environments¶
Objective: Per-PR preview environments using Kubernetes namespaces for faster feedback and isolated testing.
Implementation Plan:
q1Deliverables:
epic: Ephemeral Preview Environments
features:
- name: AKS Namespace per PR
description: Automatically create Kubernetes namespace for each PR
effort: 13 story points
dependencies: AKS cluster capacity planning
- name: Automated DNS per Preview
description: Dynamic subdomain creation (pr-123.preview.atp.connectsoft.com)
effort: 8 story points
dependencies: Azure DNS integration
- name: Auto-Delete after PR Merge
description: Cleanup preview environments within 24 hours of PR merge/close
effort: 5 story points
dependencies: GitHub webhook integration
- name: Cost Tracking per PR
description: Tag preview resources with PR ID for cost attribution
effort: 3 story points
dependencies: Azure Cost Management API
successCriteria:
- Preview environment created within 10 minutes of PR creation
- Full ATP stack deployed (7 microservices)
- Cost < $5 per PR (Spot instances)
- Automated cleanup 100% successful
Preview Environment Provisioning (GitHub Action):
# .github/workflows/preview-environment.yml
name: Create Preview Environment
on:
pull_request:
types: [opened, synchronize]
branches: [main]
env:
PR_NUMBER: ${{ github.event.pull_request.number }}
NAMESPACE: pr-${{ github.event.pull_request.number }}
jobs:
create-preview:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Azure Login
uses: azure/login@v1
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}
- name: Create Kubernetes Namespace
run: |
az aks get-credentials \
--resource-group ConnectSoft-ATP-Preview-RG \
--name atp-aks-preview-eus
kubectl create namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
kubectl label namespace $NAMESPACE \
pr-number=$PR_NUMBER \
cost-center=Engineering \
auto-delete=24h
- name: Deploy ATP Stack
run: |
helm upgrade --install atp-$PR_NUMBER ./helm/atp \
--namespace $NAMESPACE \
--set image.tag=pr-$PR_NUMBER \
--set ingress.host=pr-$PR_NUMBER.preview.atp.connectsoft.com \
--set resources.requests.cpu=100m \
--set resources.requests.memory=128Mi
- name: Wait for Deployment
run: |
kubectl wait --for=condition=ready pod \
--selector=app=atp-gateway \
--namespace=$NAMESPACE \
--timeout=600s
- name: Run Smoke Tests
run: |
PREVIEW_URL="https://pr-$PR_NUMBER.preview.atp.connectsoft.com"
# Wait for DNS propagation
sleep 60
# Health check
curl -f $PREVIEW_URL/health || exit 1
echo "✅ Preview environment ready: $PREVIEW_URL"
- name: Comment on PR
uses: actions/github-script@v7
with:
script: |
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.name,
body: `## 🚀 Preview Environment Ready\n\n**URL**: https://pr-${{ github.event.pull_request.number }}.preview.atp.connectsoft.com\n\n**Services**:\n- Gateway: https://pr-${{ github.event.pull_request.number }}.preview.atp.connectsoft.com/health\n- Swagger: https://pr-${{ github.event.pull_request.number }}.preview.atp.connectsoft.com/swagger\n\n**Auto-Delete**: 24 hours after PR merge/close`
})
Q2 2025: Multi-Region Active-Active Expansion¶
Objective: Expand multi-region to all ATP services (currently only Ingestion/Query are multi-region).
Implementation Plan:
q2Deliverables:
epic: Multi-Region Active-Active Expansion
features:
- name: Deploy Policy Service to West Europe
effort: 13 story points
dependencies: West Europe AKS cluster, Cosmos DB geo-replication
- name: Deploy Export Service to West Europe
effort: 13 story points
dependencies: Blob Storage RA-GRS, cross-region blob access
- name: Deploy Integrity Service to West Europe
effort: 8 story points
dependencies: HSM key replication (Azure Key Vault managed HSM)
- name: Cross-Region Service Discovery
effort: 8 story points
dependencies: Azure Front Door routing rules per service
- name: Multi-Region E2E Tests
effort: 5 story points
dependencies: Synthetic monitors from both regions
successCriteria:
- All 7 ATP services deployed to both regions
- Traffic split: 70% East US, 30% West Europe
- Failover RTO < 30 minutes for all services
- Incremental cost increase < $3,000/month
Q3 2025: Automated Canary with ML Anomaly Detection¶
Objective: Fully automated canary deployments with ML-based anomaly detection for intelligent rollout decisions.
Implementation Plan:
q3Deliverables:
epic: Intelligent Canary Deployments
features:
- name: ML Anomaly Detection Model
description: Train model on historical deployment metrics (error rate, latency, throughput)
effort: 21 story points
dependencies: Azure Machine Learning workspace, historical telemetry data
- name: Automated Canary Progression
description: Auto-advance canary based on ML model confidence
effort: 13 story points
dependencies: ML model endpoint, Azure Pipelines integration
- name: Intelligent Rollback
description: ML model detects anomalies; auto-rollback without human intervention
effort: 13 story points
dependencies: ML model, automated rollback runbook
- name: Canary Dashboard
description: Real-time canary health dashboard with ML predictions
effort: 8 story points
dependencies: Grafana, ML model metrics export
successCriteria:
- 90% of canary deployments fully automated (no manual progression)
- False positive rate < 5% (incorrect rollbacks)
- Anomaly detection latency < 5 minutes
- Rollback initiated within 10 minutes of anomaly detection
ML Anomaly Detection (Conceptual):
// ML-based canary health prediction
public class CanaryHealthPredictor
{
private readonly MachineLearningClient _mlClient;
public async Task<CanaryHealthPrediction> PredictHealthAsync(CanaryMetrics metrics)
{
// Prepare features for ML model
var features = new
{
errorRate = metrics.ErrorRate,
p95Latency = metrics.P95Latency,
p99Latency = metrics.P99Latency,
throughput = metrics.RequestsPerSecond,
cpuUtilization = metrics.AvgCpuUtilization,
memoryUtilization = metrics.AvgMemoryUtilization,
// Comparison to baseline
errorRateChange = (metrics.ErrorRate - metrics.BaselineErrorRate) / metrics.BaselineErrorRate,
latencyChange = (metrics.P95Latency - metrics.BaselineP95Latency) / metrics.BaselineP95Latency
};
// Invoke ML model endpoint
var prediction = await _mlClient.PredictAsync("canary-health-model", features);
return new CanaryHealthPrediction
{
IsHealthy = prediction.Prediction == "Healthy",
Confidence = prediction.Confidence,
AnomalyScore = prediction.AnomalyScore,
Recommendation = prediction.AnomalyScore > 0.8
? "Rollback immediately"
: prediction.AnomalyScore > 0.5
? "Pause rollout; investigate"
: "Proceed to next increment"
};
}
}
Q4 2025: Self-Service Environment Provisioning¶
Objective: Empower developers to create on-demand dev/test environments without platform team intervention.
Implementation Plan:
q4Deliverables:
epic: Self-Service Environment Provisioning
features:
- name: Environment Portal (Web UI)
description: Self-service portal for creating dev/test environments
effort: 21 story points
dependencies: Azure App Service, Azure AD integration
- name: Pulumi Automation API Integration
description: Backend API to trigger Pulumi deployments
effort: 13 story points
dependencies: Pulumi Automation API, Azure DevOps integration
- name: Cost Guardrails
description: Prevent creation of expensive resources; require approval for >$100/month
effort: 8 story points
dependencies: Azure Cost Management API
- name: Automatic Expiration
description: Auto-delete environments after 7 days (with renewal option)
effort: 5 story points
dependencies: Azure Automation, tagging strategy
successCriteria:
- Developers can create environment in < 15 minutes
- Zero platform team involvement for dev/test environments
- 100% environments tagged with owner + expiration
- Orphaned resource detection and cleanup (weekly scan)
Self-Service Portal (ASP.NET Core):
// Environment provisioning API
[ApiController]
[Route("api/environments")]
public class EnvironmentProvisioningController : ControllerBase
{
private readonly IPulumiAutomationService _pulumi;
[HttpPost]
[Authorize(Roles = "Developer")]
public async Task<IActionResult> CreateEnvironment([FromBody] CreateEnvironmentRequest request)
{
// Validate request
if (request.Environment != "dev" && request.Environment != "test")
{
return BadRequest("Self-service provisioning only available for dev/test environments");
}
// Estimate cost
var estimatedCost = await EstimateMonthlyCostAsync(request);
if (estimatedCost > 100 && !User.IsInRole("PlatformTeam"))
{
return Unauthorized("Environments >$100/month require platform team approval");
}
// Generate unique environment name
var envName = $"{request.Environment}-{User.Identity.Name.Replace("@", "-")}-{DateTime.UtcNow:yyyyMMdd}";
// Provision via Pulumi Automation API
var provisioningJob = await _pulumi.CreateStackAsync(new StackConfig
{
ProjectName = "atp-infrastructure",
StackName = envName,
Config = new Dictionary<string, string>
{
["environment"] = request.Environment,
["region"] = "eastus",
["owner"] = User.Identity.Name,
["expiresAt"] = DateTime.UtcNow.AddDays(7).ToString("o"),
["costCenter"] = "Engineering"
}
});
// Tag resources for tracking
await TagResourcesAsync(envName, new Dictionary<string, string>
{
["Owner"] = User.Identity.Name,
["CreatedAt"] = DateTime.UtcNow.ToString("o"),
["ExpiresAt"] = DateTime.UtcNow.AddDays(7).ToString("o"),
["EstimatedMonthlyCost"] = estimatedCost.ToString("F2")
});
return Accepted(new
{
environmentName = envName,
status = "Provisioning",
estimatedCompletionTime = DateTime.UtcNow.AddMinutes(15),
estimatedMonthlyCost = estimatedCost,
expiresAt = DateTime.UtcNow.AddDays(7),
provisioningJobId = provisioningJob.Id
});
}
[HttpDelete("{environmentName}")]
[Authorize(Roles = "Developer")]
public async Task<IActionResult> DeleteEnvironment(string environmentName)
{
// Validate ownership
var owner = await GetEnvironmentOwnerAsync(environmentName);
if (owner != User.Identity.Name && !User.IsInRole("PlatformTeam"))
{
return Forbid("You can only delete environments you own");
}
// Delete via Pulumi
await _pulumi.DestroyStackAsync(environmentName);
return NoContent();
}
}
Environment Lifecycle Management¶
Automated Expiration (Daily Scan):
#!/bin/bash
# cleanup-expired-environments.sh
echo "Scanning for expired environments..."
CURRENT_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
# Query resource groups with expiration tags
EXPIRED_RGS=$(az group list \
--query "[?tags.ExpiresAt != null && tags.ExpiresAt < '$CURRENT_DATE'].{Name:name, Owner:tags.Owner, ExpiresAt:tags.ExpiresAt}" \
--output json)
EXPIRED_COUNT=$(echo $EXPIRED_RGS | jq length)
if [ "$EXPIRED_COUNT" -eq 0 ]; then
echo "No expired environments found"
exit 0
fi
echo "Found $EXPIRED_COUNT expired environments:"
echo $EXPIRED_RGS | jq -r '.[] | "\(.Name) (Owner: \(.Owner), Expired: \(.ExpiresAt))"'
# Notify owners before deletion
for RG in $(echo $EXPIRED_RGS | jq -r '.[].Name'); do
OWNER=$(echo $EXPIRED_RGS | jq -r ".[] | select(.Name == \"$RG\") | .Owner")
EXPIRES_AT=$(echo $EXPIRED_RGS | jq -r ".[] | select(.Name == \"$RG\") | .ExpiresAt")
echo "Notifying owner: $OWNER about $RG..."
# Send email notification
az mail send \
--to "$OWNER" \
--subject "Environment Expiration Notice: $RG" \
--body "Your environment '$RG' expired on $EXPIRES_AT and will be deleted in 24 hours.\n\nTo extend, visit: https://portal.atp.connectsoft.com/environments/$RG/renew"
# Tag for deletion (grace period)
az group update \
--name "$RG" \
--set tags.PendingDeletion="$(date -u -d '+24 hours' +%Y-%m-%dT%H:%M:%SZ)"
done
echo "✅ Expiration notifications sent; environments will be deleted in 24 hours"
Orphaned Resource Detection:
<#
.SYNOPSIS
Detect orphaned resources without owner tags.
.DESCRIPTION
Weekly scan for resources missing required tags; notify platform team.
#>
Connect-AzAccount -Identity
$orphanedResources = Get-AzResource | Where-Object {
-not $_.Tags.ContainsKey('Owner') -or
-not $_.Tags.ContainsKey('Environment') -or
-not $_.Tags.ContainsKey('CostCenter')
}
if ($orphanedResources.Count -gt 0) {
Write-Output "Found $($orphanedResources.Count) orphaned resources:"
$orphanedResources | Format-Table -Property Name, ResourceGroupName, ResourceType, Location
# Create work item for cleanup
az boards work-item create \
--title "Orphaned Resources Detected: $($orphanedResources.Count) resources" \
--type "Task" \
--description "Resources without required tags detected.\n\nSee attached report." \
--assigned-to "platform-team@connectsoft.example" \
--fields Priority=3
# Export CSV for review
$orphanedResources | Export-Csv -Path "orphaned-resources-$(Get-Date -Format 'yyyyMMdd').csv"
}
else {
Write-Output "✅ No orphaned resources detected"
}
Continuous Improvement Framework¶
Purpose: Systematic improvement through retrospectives, metrics tracking, and experimentation.
Improvement Cycle:
graph LR
A[Identify Opportunity] --> B[Define Metric]
B --> C[Implement Change]
C --> D[Measure Impact]
D --> E{Improvement?}
E -->|Yes| F[Document & Scale]
E -->|No| G[Rollback & Retry]
F --> A
G --> A
style F fill:#90EE90
style G fill:#FF6347
Improvement Tracking:
// Track environment improvements
public class ImprovementTracker
{
public async Task RecordImprovementAsync(Improvement improvement)
{
var workItem = new WorkItem
{
Fields = new Dictionary<string, object>
{
["System.Title"] = improvement.Title,
["System.WorkItemType"] = "Improvement",
["System.State"] = "Proposed",
["Custom.Hypothesis"] = improvement.Hypothesis,
["Custom.TargetMetric"] = improvement.TargetMetric,
["Custom.BaselineValue"] = improvement.BaselineValue,
["Custom.TargetValue"] = improvement.TargetValue,
["Custom.EstimatedEffort"] = improvement.EstimatedEffort,
["Custom.ExpectedImpact"] = improvement.ExpectedImpact
}
};
await _devOpsClient.CreateWorkItemAsync(workItem, "ConnectSoft", "ATP");
}
public async Task MeasureImpactAsync(int improvementId)
{
var improvement = await _devOpsClient.GetWorkItemAsync(improvementId);
var targetMetric = improvement.Fields["Custom.TargetMetric"].ToString();
// Query actual metric value post-improvement
var actualValue = await QueryMetricAsync(targetMetric);
var baselineValue = double.Parse(improvement.Fields["Custom.BaselineValue"].ToString());
var targetValue = double.Parse(improvement.Fields["Custom.TargetValue"].ToString());
var actualImprovement = ((actualValue - baselineValue) / baselineValue) * 100;
var targetImprovement = ((targetValue - baselineValue) / baselineValue) * 100;
var success = actualImprovement >= targetImprovement;
// Update work item
await _devOpsClient.UpdateWorkItemAsync(improvementId, new[]
{
new JsonPatchOperation
{
Operation = Operation.Add,
Path = "/fields/Custom.ActualValue",
Value = actualValue
},
new JsonPatchOperation
{
Operation = Operation.Add,
Path = "/fields/Custom.ActualImprovement",
Value = $"{actualImprovement:F1}%"
},
new JsonPatchOperation
{
Operation = Operation.Add,
Path = "/fields/System.State",
Value = success ? "Closed" : "Active"
}
});
}
}
Example Improvements (2024-2025):
improvements:
- id: IMP-001
title: Automated Dev/Test Shutdown
hypothesis: Shutting down dev/test during non-business hours will reduce costs by 40%
targetMetric: Monthly cost (Dev+Test)
baselineValue: $1,500
targetValue: $900
actualValue: $1,020
actualImprovement: 32% # (slightly below target, but significant)
status: Success
- id: IMP-002
title: Reserved Instances for Production
hypothesis: 1-year reserved instances will reduce compute costs by 20%
targetMetric: Monthly compute cost (Production)
baselineValue: $3,000
targetValue: $2,400
actualValue: $2,450
actualImprovement: 18.3% # (close to target)
status: Success
- id: IMP-003
title: Application Insights Sampling
hypothesis: 10% sampling will reduce ingestion costs by 90% with minimal observability impact
targetMetric: Application Insights monthly cost
baselineValue: $500
targetValue: $50
actualValue: $115
actualImprovement: 77% # (better than expected; some non-sampled telemetry)
status: Success
Summary¶
- Review Cadence: Monthly (cost/utilization), Quarterly (security/access/DR), Annually (full refresh/roadmap).
- Monthly Review: Cost analysis, resource utilization, scaling policies, rightsizing recommendations with automated report generation.
- Quarterly Review: Security posture, access audit, compliance scorecard, DR drill review with 2.5-hour workshop format.
- Annual Review: Full IaC refresh, cost optimization deep dive, security evolution, technology roadmap with 4-hour full-day workshop.
- Change Advisory Board: 4 core members (Architect, SRE, Security, Product), weekly meetings for routine changes, emergency meetings for hotfixes.
- Change Process: Formal change request template, 3-step approval workflow (SRE → Security → Architect), Mermaid state diagram.
- Change Windows: Tuesday-Thursday 10 PM-2 AM ET (routine), immediate (emergency), blackout periods (holidays, fiscal quarter-end).
- DORA Metrics: Deployment frequency (2-3/week), lead time (10 days), change failure rate (8%), time to restore (25 min).
- Platform Maturity: Automation (85%), Observability (90%), Security (95%), Compliance (98%), Cost Optimization (80%).
- Q1 2025 Roadmap: Ephemeral preview environments per PR (AKS namespaces, auto-delete, <$5 per PR).
- Q2 2025 Roadmap: Multi-region active-active for all 7 ATP services (70/30 traffic split, <$3k incremental cost).
- Q3 2025 Roadmap: Automated canary with ML anomaly detection (90% automated, <5% false positives, 10-min rollback).
- Q4 2025 Roadmap: Self-service environment provisioning (web portal, 15-min creation, cost guardrails, 7-day expiration).
- Improvement Framework: Hypothesis-driven improvements with baseline/target metrics, success tracking via Azure DevOps work items.
Appendix A — Environment Variable Reference¶
This appendix provides comprehensive environment variable templates for all ATP environments, including base configuration, environment-specific overrides, Key Vault references, and container runtime examples.
Purpose¶
- Standardize environment variable naming and structure across all environments
- Document required vs. optional variables per environment
- Provide templates for local development, container deployments, and Azure App Service
- Clarify Key Vault reference syntax for secure secrets injection
Base Environment Variables (All Environments)¶
Common Variables (required in all environments):
# Runtime environment
export ASPNETCORE_ENVIRONMENT=Development # Development | Test | Staging | Production
# Application identity
export APP_NAME=atp-ingestion
export APP_VERSION=1.0.123
export DEPLOYMENT_TIMESTAMP=2025-01-15T14:30:00Z
# Logging
export Logging__LogLevel__Default=Information
export Logging__LogLevel__Microsoft=Warning
export Logging__LogLevel__System=Warning
# OpenTelemetry
export OpenTelemetry__ServiceName=atp-ingestion
export OpenTelemetry__ServiceVersion=1.0.123
export OpenTelemetry__ServiceNamespace=atp
export OpenTelemetry__ExporterEndpoint=http://otel-collector:4317
# Health checks
export HealthChecks__Enabled=true
export HealthChecks__Port=8080
export HealthChecks__Path=/health
Dev Environment Variables¶
Dev Environment (appsettings.Development.json + environment variables):
#!/bin/bash
# dev-environment-variables.sh
# ═══════════════════════════════════════════════════════════════════════════
# ATP Dev Environment Variables
# Usage: source dev-environment-variables.sh
# ═══════════════════════════════════════════════════════════════════════════
# Runtime
export ASPNETCORE_ENVIRONMENT=Development
export ASPNETCORE_URLS=http://+:5000;https://+:5001
# Database (local SQL Server)
export ConnectionStrings__DefaultConnection="Server=localhost,1433;Database=ATP_Dev;User Id=sa;Password=P@ssw0rd123!;TrustServerCertificate=True;MultipleActiveResultSets=True"
export ConnectionStrings__ReadReplica="Server=localhost,1433;Database=ATP_Dev;User Id=sa;Password=P@ssw0rd123!;TrustServerCertificate=True;ApplicationIntent=ReadOnly"
# Cosmos DB (local emulator)
export ConnectionStrings__CosmosDb="AccountEndpoint=https://localhost:8081/;AccountKey=C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw=="
export CosmosDb__DatabaseName=ATPDev
export CosmosDb__ContainerName=AuditEvents
# Redis (local container)
export ConnectionStrings__Redis="localhost:6379,abortConnect=false,connectRetry=3,connectTimeout=5000"
export Redis__InstanceName=ATPDev
export Redis__DefaultExpirationMinutes=60
# Service Bus (local emulator or Azure)
export ConnectionStrings__ServiceBus="Endpoint=sb://localhost:5672;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=..."
export ServiceBus__QueueName=atp-audit-events-dev
export ServiceBus__TopicName=atp-events-dev
# RabbitMQ (alternative to Service Bus in dev)
export ConnectionStrings__RabbitMQ="amqp://guest:guest@localhost:5672"
export RabbitMQ__QueueName=atp-audit-events-dev
export RabbitMQ__ExchangeName=atp-events-dev
# Blob Storage (local Azurite)
export ConnectionStrings__BlobStorage="DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1"
export BlobStorage__ContainerName=atp-audit-attachments-dev
# Audit Configuration
export Audit__EnableImmutability=false
export Audit__EnableTamperEvidence=false
export Audit__RetentionDays=30
export Audit__EnableWormStorage=false
# Compliance
export Compliance__StrictInDevelopment=true # Enable strict validation even in dev
export Compliance__EnableLoggingRedaction=true
export Compliance__Profile=default
# OpenTelemetry
export OpenTelemetry__ServiceName=atp-ingestion
export OpenTelemetry__ServiceVersion=1.0.0-dev
export OpenTelemetry__ExporterEndpoint=http://localhost:4317
export OpenTelemetry__SamplingRatio=1.0 # 100% sampling in dev
export OpenTelemetry__ExportIntervalSeconds=5
# Application Insights (optional in dev)
export ApplicationInsights__InstrumentationKey="" # Empty = disabled
export ApplicationInsights__EnableAdaptiveSampling=false
# Feature Flags (all enabled in dev)
export FeatureManagement__TamperEvidenceV2=true
export FeatureManagement__AdvancedQueryFilters=true
export FeatureManagement__AIAssistedAnomalyDetection=true
# JWT Authentication (dev key)
export Authentication__JwtSecret=dev-secret-key-change-in-production-32-chars-min
export Authentication__JwtIssuer=https://atp-dev.connectsoft.local
export Authentication__JwtAudience=atp-services
export Authentication__JwtExpirationMinutes=480 # 8 hours
# API Rate Limiting (relaxed in dev)
export RateLimiting__Enabled=false
export RateLimiting__RequestsPerMinute=1000
# Debugging
export ASPNETCORE_DETAILEDERRORS=true
export ASPNETCORE_SHUTDOWNTIMEOUTSECONDS=30
export COMPlus_EnableDiagnostics=1
echo "✅ Dev environment variables loaded"
echo " - ASPNETCORE_ENVIRONMENT: $ASPNETCORE_ENVIRONMENT"
echo " - Database: localhost:1433"
echo " - Redis: localhost:6379"
echo " - OpenTelemetry: $OpenTelemetry__ExporterEndpoint"
Test Environment Variables¶
Test Environment (Azure App Service or AKS):
#!/bin/bash
# test-environment-variables.sh
# Runtime
export ASPNETCORE_ENVIRONMENT=Test
export ASPNETCORE_URLS=http://+:8080
# Database (Azure SQL)
export ConnectionStrings__DefaultConnection="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/SqlConnectionString)"
export ConnectionStrings__ReadReplica="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/SqlReadReplicaConnectionString)"
# Cosmos DB
export ConnectionStrings__CosmosDb="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/CosmosDbConnectionString)"
export CosmosDb__DatabaseName=ATPTest
export CosmosDb__ContainerName=AuditEvents
# Redis
export ConnectionStrings__Redis="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/RedisConnectionString)"
export Redis__InstanceName=ATPTest
export Redis__DefaultExpirationMinutes=120
# Service Bus
export ConnectionStrings__ServiceBus="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/ServiceBusConnectionString)"
export ServiceBus__QueueName=atp-audit-events-test
export ServiceBus__TopicName=atp-events-test
# Blob Storage
export ConnectionStrings__BlobStorage="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/BlobStorageConnectionString)"
export BlobStorage__ContainerName=atp-audit-attachments-test
# Audit Configuration
export Audit__EnableImmutability=false # Still disabled in test
export Audit__EnableTamperEvidence=true # Test tamper evidence
export Audit__RetentionDays=90
export Audit__EnableWormStorage=false
# Compliance
export Compliance__StrictInDevelopment=false
export Compliance__EnableLoggingRedaction=true
export Compliance__Profile=default
# OpenTelemetry
export OpenTelemetry__ServiceName=atp-ingestion
export OpenTelemetry__ServiceVersion=1.0.123
export OpenTelemetry__ExporterEndpoint=http://otel-collector-test.atp.local:4317
export OpenTelemetry__SamplingRatio=0.5 # 50% sampling in test
export OpenTelemetry__ExportIntervalSeconds=30
# Application Insights
export ApplicationInsights__InstrumentationKey="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/AppInsightsKey)"
export ApplicationInsights__EnableAdaptiveSampling=true
export ApplicationInsights__MaxTelemetryItemsPerSecond=50
# Feature Flags (controlled via Azure App Configuration)
export AppConfiguration__Endpoint=https://atp-appconfig-test-eus.azconfig.io
export AppConfiguration__ManagedIdentityEnabled=true
# JWT Authentication
export Authentication__JwtSecret="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-test-eus.vault.azure.net/secrets/JwtSigningKey)"
export Authentication__JwtIssuer=https://atp-test.connectsoft.com
export Authentication__JwtAudience=atp-services
export Authentication__JwtExpirationMinutes=240 # 4 hours
# API Rate Limiting
export RateLimiting__Enabled=true
export RateLimiting__RequestsPerMinute=500
# Azure Managed Identity
export AZURE_CLIENT_ID=12345678-1234-1234-1234-123456789012 # User-assigned managed identity
Staging Environment Variables¶
Staging Environment (production-like configuration):
#!/bin/bash
# staging-environment-variables.sh
# Runtime
export ASPNETCORE_ENVIRONMENT=Staging
export ASPNETCORE_URLS=http://+:8080
# Database (Azure SQL with geo-replication)
export ConnectionStrings__DefaultConnection="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/SqlConnectionString)"
export ConnectionStrings__ReadReplica="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/SqlReadReplicaConnectionString)"
# Cosmos DB
export ConnectionStrings__CosmosDb="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/CosmosDbConnectionString)"
export CosmosDb__DatabaseName=ATPStaging
export CosmosDb__ContainerName=AuditEvents
export CosmosDb__EnableMultiRegion=true
export CosmosDb__PreferredRegions=eastus,westeurope
# Redis (Premium tier)
export ConnectionStrings__Redis="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/RedisConnectionString)"
export Redis__InstanceName=ATPStaging
export Redis__DefaultExpirationMinutes=240
export Redis__EnableCluster=true
# Service Bus (Premium tier)
export ConnectionStrings__ServiceBus="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/ServiceBusConnectionString)"
export ServiceBus__QueueName=atp-audit-events-staging
export ServiceBus__TopicName=atp-events-staging
export ServiceBus__EnablePartitioning=true
# Blob Storage (with WORM for testing)
export ConnectionStrings__BlobStorage="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/BlobStorageConnectionString)"
export BlobStorage__ContainerName=atp-audit-attachments-staging
export BlobStorage__EnableImmutability=true
export BlobStorage__ImmutabilityPeriodDays=90
# Audit Configuration (production-like)
export Audit__EnableImmutability=true
export Audit__EnableTamperEvidence=true
export Audit__RetentionDays=180
export Audit__EnableWormStorage=true
export Audit__EnableHashChains=true
# Compliance
export Compliance__StrictInDevelopment=false
export Compliance__EnableLoggingRedaction=true
export Compliance__Profile=default
export Compliance__EnableGDPR=true
export Compliance__EnableHIPAA=true
# OpenTelemetry
export OpenTelemetry__ServiceName=atp-ingestion
export OpenTelemetry__ServiceVersion=1.0.123
export OpenTelemetry__ExporterEndpoint=http://otel-collector-staging.atp.local:4317
export OpenTelemetry__SamplingRatio=0.25 # 25% sampling in staging
export OpenTelemetry__ExportIntervalSeconds=60
# Application Insights
export ApplicationInsights__InstrumentationKey="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/AppInsightsKey)"
export ApplicationInsights__EnableAdaptiveSampling=true
export ApplicationInsights__MaxTelemetryItemsPerSecond=100
export ApplicationInsights__InitialSamplingPercentage=25
# Feature Flags (Azure App Configuration)
export AppConfiguration__Endpoint=https://atp-appconfig-staging-eus.azconfig.io
export AppConfiguration__ManagedIdentityEnabled=true
export AppConfiguration__RefreshIntervalSeconds=30
# JWT Authentication
export Authentication__JwtSecret="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-staging-eus.vault.azure.net/secrets/JwtSigningKey)"
export Authentication__JwtIssuer=https://atp-staging.connectsoft.com
export Authentication__JwtAudience=atp-services
export Authentication__JwtExpirationMinutes=120 # 2 hours
export Authentication__EnableRefreshTokens=true
# API Rate Limiting
export RateLimiting__Enabled=true
export RateLimiting__RequestsPerMinute=500
export RateLimiting__BurstSize=100
# Azure Managed Identity
export AZURE_CLIENT_ID=23456789-2345-2345-2345-234567890123
Production Environment Variables¶
Production Environment (full security and compliance):
#!/bin/bash
# production-environment-variables.sh
# Runtime
export ASPNETCORE_ENVIRONMENT=Production
export ASPNETCORE_URLS=http://+:8080
# Database (Azure SQL with multi-region failover)
export ConnectionStrings__DefaultConnection="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlConnectionString)"
export ConnectionStrings__ReadReplica="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/SqlReadReplicaConnectionString)"
export ConnectionStrings__FailoverPartner=atp-sql-prod-weu.database.windows.net
# Cosmos DB (multi-region with automatic failover)
export ConnectionStrings__CosmosDb="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/CosmosDbConnectionString)"
export CosmosDb__DatabaseName=ATPProduction
export CosmosDb__ContainerName=AuditEvents
export CosmosDb__EnableMultiRegion=true
export CosmosDb__PreferredRegions=eastus,westeurope,southeastasia
export CosmosDb__ConsistencyLevel=Session
export CosmosDb__EnableAutomaticFailover=true
# Redis (Premium tier with geo-replication)
export ConnectionStrings__Redis="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)"
export Redis__InstanceName=ATPProduction
export Redis__DefaultExpirationMinutes=1440 # 24 hours
export Redis__EnableCluster=true
export Redis__EnableGeoReplication=true
# Service Bus (Premium tier with geo-disaster recovery)
export ConnectionStrings__ServiceBus="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/ServiceBusConnectionString)"
export ServiceBus__QueueName=atp-audit-events-prod
export ServiceBus__TopicName=atp-events-prod
export ServiceBus__EnablePartitioning=true
export ServiceBus__EnableDuplicateDetection=true
export ServiceBus__DuplicateDetectionHistoryTimeWindow=600 # 10 minutes
# Blob Storage (WORM enabled, 7-year retention)
export ConnectionStrings__BlobStorage="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/BlobStorageConnectionString)"
export BlobStorage__ContainerName=atp-audit-attachments-prod
export BlobStorage__EnableImmutability=true
export BlobStorage__ImmutabilityPeriodDays=2555 # 7 years
export BlobStorage__EnableVersioning=true
export BlobStorage__EnableSoftDelete=true
export BlobStorage__SoftDeleteRetentionDays=90
# Audit Configuration (full compliance)
export Audit__EnableImmutability=true
export Audit__EnableTamperEvidence=true
export Audit__RetentionDays=2555 # 7 years
export Audit__EnableWormStorage=true
export Audit__EnableHashChains=true
export Audit__EnableDigitalSignatures=true
export Audit__SigningKeyVaultUri="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/AuditSigningKey)"
# Compliance (all frameworks enabled)
export Compliance__StrictInDevelopment=false
export Compliance__EnableLoggingRedaction=true
export Compliance__Profile=default
export Compliance__EnableGDPR=true
export Compliance__EnableHIPAA=true
export Compliance__EnableSOC2=true
export Compliance__EnablePCIDSS=false # Not applicable to ATP
export Compliance__DataClassificationEnabled=true
# OpenTelemetry (optimized for production)
export OpenTelemetry__ServiceName=atp-ingestion
export OpenTelemetry__ServiceVersion=1.0.123
export OpenTelemetry__ServiceNamespace=atp
export OpenTelemetry__ExporterEndpoint=http://otel-collector-prod.atp.local:4317
export OpenTelemetry__SamplingRatio=0.1 # 10% sampling in production
export OpenTelemetry__ExportIntervalSeconds=60
export OpenTelemetry__EnableMetrics=true
export OpenTelemetry__EnableTracing=true
export OpenTelemetry__EnableLogging=false # Use separate log aggregation
# Application Insights (with advanced features)
export ApplicationInsights__InstrumentationKey="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/AppInsightsKey)"
export ApplicationInsights__EnableAdaptiveSampling=true
export ApplicationInsights__MaxTelemetryItemsPerSecond=100
export ApplicationInsights__InitialSamplingPercentage=10
export ApplicationInsights__MinSamplingPercentage=5
export ApplicationInsights__MaxSamplingPercentage=25
export ApplicationInsights__EnableDependencyTracking=true
export ApplicationInsights__EnablePerformanceCounterCollection=true
# Feature Flags (Azure App Configuration with caching)
export AppConfiguration__Endpoint=https://atp-appconfig-prod-eus.azconfig.io
export AppConfiguration__ManagedIdentityEnabled=true
export AppConfiguration__RefreshIntervalSeconds=60
export AppConfiguration__EnableCaching=true
export AppConfiguration__CacheDurationSeconds=300 # 5 minutes
# JWT Authentication (production keys)
export Authentication__JwtSecret="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/JwtSigningKey)"
export Authentication__JwtIssuer=https://atp.connectsoft.com
export Authentication__JwtAudience=atp-services
export Authentication__JwtExpirationMinutes=60 # 1 hour
export Authentication__EnableRefreshTokens=true
export Authentication__RefreshTokenExpirationDays=7
export Authentication__EnableTokenRotation=true
# API Rate Limiting (strict in production)
export RateLimiting__Enabled=true
export RateLimiting__RequestsPerMinute=500
export RateLimiting__BurstSize=100
export RateLimiting__EnableDistributedRateLimiting=true
export RateLimiting__RedisConnectionString="@Microsoft.KeyVault(SecretUri=https://atp-keyvault-prod-eus.vault.azure.net/secrets/RedisConnectionString)"
# Circuit Breaker (Polly)
export Resilience__CircuitBreaker__Enabled=true
export Resilience__CircuitBreaker__FailureThreshold=0.5 # 50% failure rate
export Resilience__CircuitBreaker__SamplingDurationSeconds=60
export Resilience__CircuitBreaker__MinimumThroughput=10
export Resilience__CircuitBreaker__BreakDurationSeconds=30
# Azure Managed Identity (production)
export AZURE_CLIENT_ID=34567890-3456-3456-3456-345678901234
# Logging (production-optimized)
export Logging__LogLevel__Default=Warning
export Logging__LogLevel__Microsoft=Error
export Logging__LogLevel__System=Error
export Logging__LogLevel__Microsoft_Hosting_Lifetime=Information
# Performance
export ASPNETCORE_FORWARDEDHEADERS_ENABLED=true
export ASPNETCORE_SHUTDOWNTIMEOUTSECONDS=60
# Health Checks
export HealthChecks__Enabled=true
export HealthChecks__Port=8080
export HealthChecks__Path=/health
export HealthChecks__DetailedErrorsEnabled=false # Security: don't expose details
Kubernetes ConfigMap & Secret Example¶
Kubernetes Deployment (combining ConfigMap + Secrets):
# k8s-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: atp-ingestion-config
namespace: atp-prod
data:
ASPNETCORE_ENVIRONMENT: "Production"
ASPNETCORE_URLS: "http://+:8080"
# Audit Configuration
Audit__EnableImmutability: "true"
Audit__EnableTamperEvidence: "true"
Audit__RetentionDays: "2555"
Audit__EnableWormStorage: "true"
# OpenTelemetry
OpenTelemetry__ServiceName: "atp-ingestion"
OpenTelemetry__ServiceVersion: "1.0.123"
OpenTelemetry__ExporterEndpoint: "http://otel-collector.atp-prod.svc.cluster.local:4317"
OpenTelemetry__SamplingRatio: "0.1"
# Feature Flags
AppConfiguration__Endpoint: "https://atp-appconfig-prod-eus.azconfig.io"
AppConfiguration__ManagedIdentityEnabled: "true"
# Rate Limiting
RateLimiting__Enabled: "true"
RateLimiting__RequestsPerMinute: "500"
---
# k8s-secrets.yaml (Key Vault CSI)
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: atp-ingestion-secrets
namespace: atp-prod
spec:
provider: azure
parameters:
usePodIdentity: "false"
useVMManagedIdentity: "true"
userAssignedIdentityID: "34567890-3456-3456-3456-345678901234"
keyvaultName: "atp-keyvault-prod-eus"
tenantId: "12345678-1234-1234-1234-123456789012"
objects: |
array:
- objectName: SqlConnectionString
objectType: secret
objectAlias: ConnectionStrings__DefaultConnection
- objectName: CosmosDbConnectionString
objectType: secret
objectAlias: ConnectionStrings__CosmosDb
- objectName: RedisConnectionString
objectType: secret
objectAlias: ConnectionStrings__Redis
- objectName: ServiceBusConnectionString
objectType: secret
objectAlias: ConnectionStrings__ServiceBus
- objectName: BlobStorageConnectionString
objectType: secret
objectAlias: ConnectionStrings__BlobStorage
- objectName: AppInsightsKey
objectType: secret
objectAlias: ApplicationInsights__InstrumentationKey
- objectName: JwtSigningKey
objectType: secret
objectAlias: Authentication__JwtSecret
---
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-prod
spec:
replicas: 3
selector:
matchLabels:
app: atp-ingestion
template:
metadata:
labels:
app: atp-ingestion
version: 1.0.123
spec:
serviceAccountName: atp-ingestion-sa
containers:
- name: atp-ingestion
image: connectsoft.azurecr.io/atp/ingestion:1.0.123
ports:
- containerPort: 8080
# Environment variables from ConfigMap
envFrom:
- configMapRef:
name: atp-ingestion-config
# Secrets from Key Vault CSI
volumeMounts:
- name: secrets-store
mountPath: "/mnt/secrets-store"
readOnly: true
# Override with secrets
env:
- name: ConnectionStrings__DefaultConnection
valueFrom:
secretKeyRef:
name: atp-ingestion-secrets
key: ConnectionStrings__DefaultConnection
- name: ConnectionStrings__Redis
valueFrom:
secretKeyRef:
name: atp-ingestion-secrets
key: ConnectionStrings__Redis
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
volumes:
- name: secrets-store
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: atp-ingestion-secrets
Docker Compose Example (Dev Environment)¶
Local Development with Docker Compose:
# docker-compose.yml
version: '3.8'
services:
atp-ingestion:
build:
context: .
dockerfile: src/ConnectSoft.ATP.Ingestion/Dockerfile
ports:
- "5000:8080"
environment:
ASPNETCORE_ENVIRONMENT: Development
ASPNETCORE_URLS: http://+:8080
# Connection strings (Docker service names)
ConnectionStrings__DefaultConnection: "Server=sqlserver;Database=ATP_Dev;User Id=sa;Password=P@ssw0rd123!;TrustServerCertificate=True"
ConnectionStrings__Redis: "redis:6379,abortConnect=false"
ConnectionStrings__ServiceBus: "Endpoint=sb://rabbitmq:5672"
ConnectionStrings__CosmosDb: "AccountEndpoint=http://cosmos:8081/;AccountKey=C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw=="
# Audit
Audit__EnableImmutability: "false"
Audit__RetentionDays: "30"
# OpenTelemetry
OpenTelemetry__ServiceName: atp-ingestion
OpenTelemetry__ExporterEndpoint: http://otel-collector:4317
OpenTelemetry__SamplingRatio: "1.0"
# Feature Flags
FeatureManagement__TamperEvidenceV2: "true"
depends_on:
- sqlserver
- redis
- rabbitmq
- otel-collector
- seq
sqlserver:
image: mcr.microsoft.com/mssql/server:2022-latest
environment:
ACCEPT_EULA: Y
SA_PASSWORD: P@ssw0rd123!
MSSQL_PID: Developer
ports:
- "1433:1433"
redis:
image: redis:7-alpine
ports:
- "6379:6379"
rabbitmq:
image: rabbitmq:3-management-alpine
environment:
RABBITMQ_DEFAULT_USER: guest
RABBITMQ_DEFAULT_PASS: guest
ports:
- "5672:5672"
- "15672:15672"
otel-collector:
image: otel/opentelemetry-collector:0.97.0
command: ["--config=/etc/otel/config.yaml"]
volumes:
- ./otel-config.yaml:/etc/otel/config.yaml
ports:
- "4317:4317"
- "8888:8888"
seq:
image: datalust/seq:latest
environment:
ACCEPT_EULA: Y
ports:
- "5341:80"
Environment Variable Validation¶
C# Startup Validation (ensures required variables are present):
// Validate environment variables at startup
public static class EnvironmentVariableValidator
{
public static void ValidateRequiredVariables(IConfiguration configuration, ILogger logger)
{
var requiredVariables = new Dictionary<string, string>
{
["ASPNETCORE_ENVIRONMENT"] = "Runtime environment",
["ConnectionStrings:DefaultConnection"] = "Primary database connection",
["ConnectionStrings:Redis"] = "Redis cache connection",
["OpenTelemetry:ServiceName"] = "Service name for telemetry",
["OpenTelemetry:ExporterEndpoint"] = "OTEL collector endpoint"
};
var missingVariables = new List<string>();
foreach (var (key, description) in requiredVariables)
{
var value = configuration[key.Replace(":", "__")];
if (string.IsNullOrWhiteSpace(value))
{
missingVariables.Add($"{key} ({description})");
logger.LogError("Missing required environment variable: {Key} ({Description})", key, description);
}
}
if (missingVariables.Any())
{
throw new InvalidOperationException(
$"Missing required environment variables:\n - {string.Join("\n - ", missingVariables)}");
}
logger.LogInformation("✅ All required environment variables validated successfully");
}
}
// In Program.cs
var builder = WebApplication.CreateBuilder(args);
// Validate environment variables before building app
EnvironmentVariableValidator.ValidateRequiredVariables(
builder.Configuration,
builder.Logging.Services.BuildServiceProvider().GetRequiredService<ILogger<Program>>());
Summary¶
- Base Variables: Common variables required across all environments (runtime, logging, OTel, health checks).
- Dev Environment: Local development with localhost services, 100% sampling, relaxed security, all features enabled.
- Test Environment: Azure-hosted services, Key Vault references, 50% sampling, production-like configuration without immutability.
- Staging Environment: Full production simulation, Key Vault secrets, 25% sampling, WORM storage enabled, multi-region support.
- Production Environment: Full compliance controls, 10% sampling, WORM storage with 7-year retention, multi-region failover, circuit breakers, strict rate limiting.
- Kubernetes Deployment: ConfigMap for non-sensitive config, Key Vault CSI for secrets, workload identity integration.
- Docker Compose: Local development with service containers (SQL, Redis, RabbitMQ, OTel, Seq).
- Validation: C# startup validation ensures all required variables are present before application starts.
Appendix B — IaC Overlay Rendering Examples¶
This appendix demonstrates practical IaC deployment workflows using Pulumi (C#) and Bicep for rendering environment-specific infrastructure overlays.
Purpose¶
- Illustrate the complete workflow for deploying infrastructure per environment
- Provide command-line examples for Pulumi stack management
- Demonstrate environment-specific configuration rendering
- Show validation and preview workflows before deployment
Pulumi Workflow (C#)¶
Initial Setup (One-Time)¶
Step 1: Install Pulumi CLI
# Install Pulumi CLI (macOS/Linux)
curl -fsSL https://get.pulumi.com | sh
# Or via Homebrew (macOS)
brew install pulumi/tap/pulumi
# Or via Chocolatey (Windows)
choco install pulumi
# Verify installation
pulumi version
# Output: v3.100.0
Step 2: Login to Pulumi Backend
# Azure Blob Storage backend (recommended for team use)
pulumi login azblob://atp-pulumi-state?storage_account=atppulumistate
# Or use Pulumi Cloud (SaaS)
pulumi login
# Or use local file system (dev only)
pulumi login file://~/.pulumi
Step 3: Initialize Pulumi Project
cd infrastructure/
# Create new Pulumi project (if not exists)
pulumi new azure-csharp --name atp-infrastructure --description "ATP Infrastructure as Code"
# Restore .NET dependencies
dotnet restore
Deploy Dev Environment¶
Step 1: Create/Select Dev Stack
#!/bin/bash
# deploy-dev.sh
set -e # Exit on error
echo "Deploying ATP Dev environment..."
# Select (or create) dev stack
pulumi stack select dev --create
# Configure dev-specific settings
pulumi config set azure-native:location eastus
pulumi config set environment dev
pulumi config set region eus
pulumi config set sku B1 # Basic tier for dev
pulumi config set autoscale false
pulumi config set instances 1
pulumi config set enableMultiRegion false
pulumi config set enablePrivateEndpoints false
pulumi config set costCenter Engineering
pulumi config set owner platform-team@connectsoft.example
# Set secrets (encrypted)
pulumi config set --secret sqlAdminPassword "DevP@ssw0rd123!"
pulumi config set --secret cosmosDbKey "dev-cosmos-key-placeholder"
echo "✅ Dev stack configured"
Step 2: Preview Changes
# Preview infrastructure changes (dry-run)
pulumi preview --diff
# Expected output:
# Previewing update (dev)
#
# Type Name Plan
# + pulumi:pulumi:Stack atp-infrastructure-dev create
# + ├─ azure-native:resources:ResourceGroup atp-dev-eus-rg create
# + ├─ azure-native:web:AppServicePlan atp-plan-dev-eus create
# + ├─ azure-native:web:WebApp atp-ingestion-dev-eus create
# + ├─ azure-native:sql:Server atp-sql-dev-eus create
# + ├─ azure-native:sql:Database atp-sql-db-dev-eus create
# + ├─ azure-native:cache:Redis atp-redis-dev-eus create
# + ├─ azure-native:storage:StorageAccount atpstoragedeveus create
# + └─ azure-native:keyvault:Vault atp-keyvault-dev-eus create
#
# Resources:
# + 9 to create
#
# ✅ Preview complete (no actual changes made)
Step 3: Deploy Infrastructure
# Deploy infrastructure (with auto-approval)
pulumi up --yes --skip-preview
# Or interactive (requires manual approval)
pulumi up
# Expected output:
# Updating (dev)
#
# Type Name Status
# + pulumi:pulumi:Stack atp-infrastructure-dev created
# + ├─ azure-native:resources:ResourceGroup atp-dev-eus-rg created
# + ├─ azure-native:web:AppServicePlan atp-plan-dev-eus created (45s)
# + ├─ azure-native:web:WebApp atp-ingestion-dev-eus created (1m30s)
# + ├─ azure-native:sql:Server atp-sql-dev-eus created (2m15s)
# + ├─ azure-native:sql:Database atp-sql-db-dev-eus created (3m0s)
# + ├─ azure-native:cache:Redis atp-redis-dev-eus created (10m0s)
# + ├─ azure-native:storage:StorageAccount atpstoragedeveus created (1m0s)
# + └─ azure-native:keyvault:Vault atp-keyvault-dev-eus created (30s)
#
# Outputs:
# appServiceUrl: "https://atp-ingestion-dev-eus.azurewebsites.net"
# sqlServerFqdn: "atp-sql-dev-eus.database.windows.net"
# redisHostName: "atp-redis-dev-eus.redis.cache.windows.net"
# keyVaultUri: "https://atp-keyvault-dev-eus.vault.azure.net/"
#
# Resources:
# + 9 created
#
# Duration: 12m30s
#
# ✅ Dev environment deployed successfully
Step 4: Export Stack Outputs
# Export stack outputs as JSON
pulumi stack output --json > dev-outputs.json
# Or export specific output
SQL_SERVER=$(pulumi stack output sqlServerFqdn)
echo "SQL Server: $SQL_SERVER"
# Or export as environment variables (for CI/CD)
pulumi stack output --json | jq -r 'to_entries | .[] | "export \(.key)=\(.value)"' > dev-env.sh
source dev-env.sh
Deploy Production Environment¶
Step 1: Create/Select Prod Stack
#!/bin/bash
# deploy-prod.sh
set -e
echo "Deploying ATP Production environment..."
# Select (or create) prod stack
pulumi stack select prod --create
# Configure production-specific settings
pulumi config set azure-native:location eastus
pulumi config set environment prod
pulumi config set region eus
pulumi config set sku P1v3 # Premium tier for production
pulumi config set autoscale true
pulumi config set minInstances 3
pulumi config set maxInstances 10
pulumi config set enableMultiRegion true
pulumi config set enablePrivateEndpoints true
pulumi config set enableZoneRedundancy true
pulumi config set enableGeoReplication true
pulumi config set costCenter Production
pulumi config set owner platform-team@connectsoft.example
# Set secrets (from Key Vault or secure input)
read -sp "Enter SQL Admin Password: " SQL_PASSWORD
pulumi config set --secret sqlAdminPassword "$SQL_PASSWORD"
# Or fetch from existing Key Vault
COSMOS_KEY=$(az keyvault secret show --vault-name atp-keyvault-bootstrap --name CosmosDbKey --query value -o tsv)
pulumi config set --secret cosmosDbKey "$COSMOS_KEY"
echo "✅ Production stack configured"
Step 2: Preview Changes (with Cost Estimation)
# Preview with detailed diff
pulumi preview --diff --show-config --show-replacement-steps
# Export preview as JSON for approval workflow
pulumi preview --json > prod-preview.json
# Estimate monthly cost (requires Pulumi Cloud or third-party integration)
# Note: Cost estimation is a Pulumi Cloud Enterprise feature
pulumi preview --show-costs
# Expected output:
# Previewing update (prod)
#
# Type Name Plan Info
# + pulumi:pulumi:Stack atp-infrastructure-prod create
# + ├─ azure-native:resources:ResourceGroup atp-prod-eus-rg create
# + ├─ azure-native:web:AppServicePlan atp-plan-prod-eus create ~$300/mo
# + ├─ azure-native:web:WebApp atp-ingestion-prod-eus create
# + ├─ azure-native:sql:Server atp-sql-prod-eus create
# + ├─ azure-native:sql:Database atp-sql-db-prod-eus create ~$1,200/mo (P2 tier)
# + ├─ azure-native:cache:Redis atp-redis-prod-eus create ~$600/mo (P1 Premium)
# + ├─ azure-native:storage:StorageAccount atpstorageprodeus create ~$100/mo (WORM enabled)
# + ├─ azure-native:keyvault:Vault atp-keyvault-prod-eus create ~$10/mo
# + ├─ azure-native:network:VirtualNetwork atp-vnet-prod-eus create
# + ├─ azure-native:network:PrivateEndpoint sql-private-endpoint create ~$15/mo
# + └─ azure-native:frontdoor:FrontDoor atp-frontdoor-prod create ~$50/mo
#
# Estimated Monthly Cost: ~$2,275
#
# Resources:
# + 12 to create
#
# ⚠️ This will create production infrastructure. Review carefully before proceeding.
Step 3: Deploy with Approval Gate
# Production deployment (manual approval required)
pulumi up
# You will be prompted:
# Do you want to perform this update? [yes/no]: yes
# Or use --yes for automation (only in CI/CD with approvals)
pulumi up --yes
# Expected output:
# Updating (prod)
#
# Type Name Status
# + pulumi:pulumi:Stack atp-infrastructure-prod created
# + ├─ azure-native:resources:ResourceGroup atp-prod-eus-rg created
# + ├─ azure-native:web:AppServicePlan atp-plan-prod-eus created (1m0s)
# + ├─ azure-native:web:WebApp atp-ingestion-prod-eus created (2m0s)
# + ├─ azure-native:sql:Server atp-sql-prod-eus created (3m0s)
# + ├─ azure-native:sql:Database atp-sql-db-prod-eus created (5m0s)
# + ├─ azure-native:cache:Redis atp-redis-prod-eus created (15m0s)
# + ├─ azure-native:storage:StorageAccount atpstorageprodeus created (2m0s)
# + ├─ azure-native:keyvault:Vault atp-keyvault-prod-eus created (1m0s)
# + ├─ azure-native:network:VirtualNetwork atp-vnet-prod-eus created (1m30s)
# + ├─ azure-native:network:PrivateEndpoint sql-private-endpoint created (2m30s)
# + └─ azure-native:frontdoor:FrontDoor atp-frontdoor-prod created (5m0s)
#
# Outputs:
# appServiceUrl: "https://atp-ingestion-prod-eus.azurewebsites.net"
# sqlServerFqdn: "atp-sql-prod-eus.database.windows.net"
# redisHostName: "atp-redis-prod-eus.redis.cache.windows.net"
# keyVaultUri: "https://atp-keyvault-prod-eus.vault.azure.net/"
# frontDoorEndpoint: "https://atp.connectsoft.com"
#
# Resources:
# + 12 created
#
# Duration: 25m15s
#
# ✅ Production environment deployed successfully
Update Existing Stack¶
Scenario: Upgrade Redis from Basic to Premium
# Select prod stack
pulumi stack select prod
# Update configuration
pulumi config set redisSku Premium
pulumi config set redisCapacity P1
# Preview changes
pulumi preview --diff
# Expected output:
# Previewing update (prod)
#
# Type Name Plan Info
# ~ azure-native:cache:Redis atp-redis-prod-eus update [diff: ~sku]
#
# Resources:
# ~ 1 to update
# 11 unchanged
#
# ⚠️ Redis will be updated in-place (may cause brief downtime)
# Deploy update
pulumi up --yes
# ✅ Redis upgraded to Premium tier
Destroy Environment¶
Safely destroy non-production environment:
# Select dev stack
pulumi stack select dev
# Preview what will be destroyed
pulumi destroy --preview
# Destroy infrastructure (with confirmation)
pulumi destroy
# You will be prompted:
# Do you want to perform this destroy? [yes/no]: yes
# Expected output:
# Destroying (dev)
#
# Type Name Status
# - pulumi:pulumi:Stack atp-infrastructure-dev deleted
# - ├─ azure-native:keyvault:Vault atp-keyvault-dev-eus deleted (30s)
# - ├─ azure-native:storage:StorageAccount atpstoragedeveus deleted (45s)
# - ├─ azure-native:cache:Redis atp-redis-dev-eus deleted (5m0s)
# - ├─ azure-native:sql:Database atp-sql-db-dev-eus deleted (1m0s)
# - ├─ azure-native:sql:Server atp-sql-dev-eus deleted (30s)
# - ├─ azure-native:web:WebApp atp-ingestion-dev-eus deleted (30s)
# - ├─ azure-native:web:AppServicePlan atp-plan-dev-eus deleted (15s)
# - └─ azure-native:resources:ResourceGroup atp-dev-eus-rg deleted (2m0s)
#
# Resources:
# - 9 deleted
#
# Duration: 8m45s
#
# ✅ Dev environment destroyed
# Remove stack (optional)
pulumi stack rm dev --yes
Bicep Workflow (Alternative)¶
Deploy Dev Environment with Bicep:
#!/bin/bash
# deploy-dev-bicep.sh
set -e
echo "Deploying ATP Dev environment with Bicep..."
# Variables
ENVIRONMENT=dev
REGION=eastus
RESOURCE_GROUP=atp-dev-eus-rg
# Create resource group
az group create \
--name $RESOURCE_GROUP \
--location $REGION \
--tags Environment=$ENVIRONMENT CostCenter=Engineering
# Validate Bicep template
az deployment group validate \
--resource-group $RESOURCE_GROUP \
--template-file main.bicep \
--parameters @parameters.dev.json
# Preview changes (What-If)
az deployment group what-if \
--resource-group $RESOURCE_GROUP \
--template-file main.bicep \
--parameters @parameters.dev.json
# Deploy infrastructure
az deployment group create \
--name "atp-dev-deployment-$(date +%Y%m%d-%H%M%S)" \
--resource-group $RESOURCE_GROUP \
--template-file main.bicep \
--parameters @parameters.dev.json \
--verbose
# Export outputs
az deployment group show \
--name "atp-dev-deployment-$(date +%Y%m%d-%H%M%S)" \
--resource-group $RESOURCE_GROUP \
--query properties.outputs \
--output json > dev-outputs.json
echo "✅ Dev environment deployed with Bicep"
Bicep Parameters File (parameters.dev.json):
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"environment": {
"value": "dev"
},
"region": {
"value": "eus"
},
"appServicePlanSku": {
"value": "B1"
},
"sqlDatabaseTier": {
"value": "Basic"
},
"redisSku": {
"value": "Basic"
},
"enableAutoscale": {
"value": false
},
"enablePrivateEndpoints": {
"value": false
},
"sqlAdminUsername": {
"value": "sqladmin"
},
"sqlAdminPassword": {
"reference": {
"keyVault": {
"id": "/subscriptions/{subscription-id}/resourceGroups/atp-bootstrap-rg/providers/Microsoft.KeyVault/vaults/atp-keyvault-bootstrap"
},
"secretName": "SqlAdminPassword-Dev"
}
}
}
}
CI/CD Integration (Azure Pipelines)¶
Automated Pulumi Deployment:
# pulumi-deploy-pipeline.yml
trigger:
branches:
include:
- main
paths:
include:
- infrastructure/**
pool:
vmImage: 'ubuntu-latest'
variables:
- group: Pulumi-Secrets # Azure DevOps variable group
stages:
- stage: Deploy_Dev
displayName: 'Deploy to Dev'
jobs:
- job: Pulumi_Up_Dev
steps:
- task: UseDotNet@2
inputs:
version: '8.x'
- script: |
curl -fsSL https://get.pulumi.com | sh
export PATH=$PATH:$HOME/.pulumi/bin
pulumi version
displayName: 'Install Pulumi CLI'
- script: |
cd infrastructure/
pulumi login azblob://atp-pulumi-state?storage_account=atppulumistate
pulumi stack select dev --create
pulumi config set azure-native:location eastus
pulumi config set environment dev
pulumi up --yes --skip-preview
displayName: 'Deploy Dev Infrastructure'
env:
PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
AZURE_STORAGE_ACCOUNT: atppulumistate
AZURE_STORAGE_KEY: $(AzureStorageKey)
- stage: Deploy_Prod
displayName: 'Deploy to Production'
dependsOn: Deploy_Dev
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- deployment: Pulumi_Up_Prod
environment: ATP-Production # Requires manual approval
strategy:
runOnce:
deploy:
steps:
- script: |
cd infrastructure/
pulumi login azblob://atp-pulumi-state?storage_account=atppulumistate
pulumi stack select prod
pulumi up --yes
displayName: 'Deploy Production Infrastructure'
env:
PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
AZURE_STORAGE_ACCOUNT: atppulumistate
AZURE_STORAGE_KEY: $(AzureStorageKey)
Summary¶
- Pulumi Workflow: Stack creation, configuration, preview (dry-run), deployment, output export, updates, and destruction.
- Dev Deployment: Basic SKU (B1), single instance, no autoscale, no private endpoints, <5 minutes deployment.
- Production Deployment: Premium SKU (P1v3), 3-10 instances, autoscale, private endpoints, multi-region, ~25 minutes deployment.
- Update Workflow: Configuration changes, preview diffs, in-place updates with minimal downtime.
- Bicep Alternative: Azure CLI deployment with What-If preview, parameter files per environment, Key Vault parameter references.
- CI/CD Integration: Azure Pipelines with automated dev deployment, manual approval for production, Pulumi state in Azure Blob.
- Cost Estimation: Preview shows estimated monthly cost per resource (~$2,275/month for production).
- Safety: Preview before deploy, What-If analysis, manual approval gates for production, destroy confirmation.
Appendix C — Cross-Reference Map¶
This appendix provides comprehensive cross-references to related ATP documentation, enabling seamless navigation across architecture, operations, compliance, and implementation domains.
Purpose¶
- Link environment-specific topics to their primary documentation sources
- Provide context for where specific concerns are addressed in depth
- Facilitate cross-functional collaboration by mapping responsibilities
- Ensure consistency across documentation domains
Cross-Reference Table¶
| Topic | Primary Document | Section | Notes |
|---|---|---|---|
| CI/CD & Pipelines | |||
| Azure Pipelines architecture | azure-pipelines.md | Pipeline Architecture Overview | Build/deploy automation per environment, template usage |
| Quality gates & thresholds | azure-pipelines.md | Quality Gates & Policies | Code coverage, security scans, test pass rates per environment |
| Deployment strategies | azure-pipelines.md | Multi-Environment Deployment | Rolling, blue-green, canary deployments per environment |
| Pipeline observability | azure-pipelines.md | Pipeline Observability & Metrics | Build duration, success rate, DORA metrics |
| Architecture & Design | |||
| Azure deployment topology | ../architecture/deployment-views.md | Deployment Views | Azure topology, regions, failure domains, resource distribution |
| High-level architecture | ../architecture/hld.md | High-Level Design | System components, service interactions, data flows |
| Microservice patterns | ../architecture/microservice-architecture.md | Microservice Architecture | Service boundaries, communication patterns, resilience |
| Data architecture | ../architecture/data-architecture.md | Data Architecture | Database design, CQRS, event sourcing, data residency |
| Platform & Infrastructure | |||
| Security controls | ../platform/security-compliance.md | Security Compliance | Environment-specific security controls, zero-trust, encryption |
| Compliance frameworks | ../platform/security-compliance.md | Compliance Attestation | SOC 2, GDPR, HIPAA evidence collection per environment |
| Data residency | ../platform/data-residency-retention.md | Data Residency & Retention | Geographic data storage, retention policies, GDPR compliance |
| Tenant isolation | ../platform/multitenancy-tenancy.md | Multi-Tenancy | Tenant isolation strategies, data separation per environment |
| Networking & VNets | ../platform/networking.md | Networking | VNet topology, NSG rules, private endpoints, Azure Firewall |
| Operations | |||
| Observability strategy | ../operations/observability.md | Observability | Telemetry levels, log aggregation, distributed tracing per environment |
| Monitoring & alerts | ../operations/monitoring-alerts.md | Monitoring & Alerts | Alert thresholds, escalation policies, incident response per environment |
| Backup & restore | ../operations/backups-restore-ediscovery.md | Backups & Restore | DR procedures, RPO/RTO targets, restore testing per environment |
| Incident management | ../operations/runbook.md | Runbooks | Incident response procedures, escalation paths, communication templates |
| Hardening & Security | |||
| Zero-trust architecture | ../hardening/zero-trust.md | Zero Trust | Network microsegmentation, workload identity, least privilege per environment |
| Key rotation | ../hardening/key-rotation.md | Key Rotation | Automated secret rotation, Key Vault integration, rotation cadence per environment |
| Tamper evidence | ../hardening/tamper-evidence.md | Tamper Evidence | Hash chains, digital signatures, immutability enforcement per environment |
| Chaos engineering | ../hardening/chaos-drills.md | Chaos Drills | Failure injection, resilience testing, DR drill procedures per environment |
| Domain & Contracts | |||
| REST API specifications | ../domain/contracts/rest-apis.md | REST APIs | API versioning, breaking change management, environment-specific endpoints |
| Message schemas | ../domain/contracts/message-schemas.md | Message Schemas | Event formats, schema evolution, backward compatibility per environment |
| Webhook integration | ../domain/contracts/webhooks.md | Webhooks | Webhook endpoints, retry policies, signature validation per environment |
| Idempotency patterns | ../domain/contracts/shared/idempotency.md | Idempotency | Idempotency keys, duplicate detection, consistency guarantees |
| Infrastructure | |||
| Pulumi IaC | ../infrastructure/pulumi.md | Pulumi | C# infrastructure code, stack management, environment overlays |
| Database migrations | ../infrastructure/database-migrations.md | Database Migrations | Schema versioning, migration strategies, rollback procedures per environment |
| Container orchestration | ../infrastructure/kubernetes.md | Kubernetes | AKS configuration, Helm charts, namespace isolation per environment |
| Testing & Quality | |||
| Test strategy | ../testing/strategy.md | Test Strategy | Test pyramid, coverage targets, test automation per environment |
| Load testing | ../testing/load-testing.md | Load Testing | Performance benchmarks, stress testing, capacity planning per environment |
| Security testing | ../testing/security-testing.md | Security Testing | SAST, DAST, penetration testing, vulnerability management per environment |
Environment-Specific Cross-References¶
Dev Environment¶
| Concern | Primary Document | Key Details |
|---|---|---|
| Local development setup | ../guides/development-setup.md | Docker Compose, service containers, dev tooling |
| Debugging | ../development/debugging.md | Local debugging, remote debugging, log analysis |
| Feature flags | Feature Flags & Runtime Configuration | All features enabled, experimental flags on |
Test Environment¶
| Concern | Primary Document | Key Details |
|---|---|---|
| Integration testing | ../testing/integration-testing.md | Service-to-service tests, contract validation |
| Test data management | Data Management Per Environment | Synthetic data, stable fixtures, 90-day retention |
Staging Environment¶
| Concern | Primary Document | Key Details |
|---|---|---|
| Pre-production validation | ../testing/staging-validation.md | Load tests, chaos tests, full regression suite |
| Blue-green deployments | azure-pipelines.md | Slot swaps, validation gates, rollback procedures |
Production Environment¶
| Concern | Primary Document | Key Details |
|---|---|---|
| Canary deployments | azure-pipelines.md | 10%→25%→50%→100% rollout, automated metrics validation |
| Incident response | ../operations/runbook.md | On-call procedures, escalation paths, post-mortems |
| Compliance auditing | ../platform/security-compliance.md | SOC 2, GDPR, HIPAA audit evidence collection |
Responsibility Matrix (RACI)¶
Environment Management Ownership:
| Activity | Platform Team | Security Team | SRE Team | Development Team |
|---|---|---|---|---|
| Environment provisioning | R, A | C | C | I |
| Configuration management | R, A | C | C | I |
| Secrets management | C | R, A | C | I |
| Cost optimization | R, A | I | C | I |
| Deployment approvals (Staging) | A | C | R | C |
| Deployment approvals (Production) | A | C | R | I |
| DR testing | C | I | R, A | I |
| Compliance audits | C | R, A | C | I |
| Incident response | C | C | R, A | C |
Legend: R = Responsible (does the work), A = Accountable (final approval), C = Consulted, I = Informed
Related ADRs (Architecture Decision Records)¶
| ADR | Title | Environment Impact |
|---|---|---|
| ADR-001 | Multi-Environment Strategy | Defines 6-tier environment topology (Preview, Dev, Test, Staging, Prod, Hotfix) |
| ADR-002 | Pulumi for Infrastructure as Code | C# Pulumi chosen over Bicep/Terraform for type safety and .NET ecosystem alignment |
| ADR-003 | Azure App Configuration for Feature Flags | Centralized feature flag management with environment-specific targeting |
| ADR-004 | Key Vault Per Environment | Separate Key Vaults for isolation and blast radius containment |
| ADR-005 | WORM Storage for Production | Immutable audit logs with 7-year retention for regulatory compliance |
| ADR-006 | Multi-Region Active-Active Topology | Production traffic split 80/20 across East US and West Europe |
| ADR-007 | Automated Canary Deployments | Phased rollout with automated metrics validation and rollback |
Document Hierarchy¶
docs/
├── architecture/
│ ├── hld.md # System overview (referenced for context)
│ ├── deployment-views.md # Azure topology (referenced for resource naming)
│ └── data-architecture.md # Data flows (referenced for data management)
│
├── ci-cd/
│ ├── azure-pipelines.md # CI/CD automation (referenced for deployment workflows)
│ ├── environments.md # ← YOU ARE HERE
│ └── quality-gates.md # Test thresholds (referenced for validation criteria)
│
├── platform/
│ ├── security-compliance.md # Security controls (referenced for per-environment policies)
│ ├── data-residency-retention.md # Data residency (referenced for retention policies)
│ └── multitenancy-tenancy.md # Tenant isolation (referenced for staging/prod isolation)
│
├── operations/
│ ├── observability.md # Telemetry (referenced for logging/tracing levels)
│ ├── backups-restore-ediscovery.md # DR procedures (referenced for RPO/RTO targets)
│ └── multitenancy-tenancy.md # Alerts (referenced for health monitoring)
│
├── hardening/
│ ├── zero-trust.md # Network security (referenced for VNet isolation)
│ ├── key-rotation.md # Secret rotation (referenced for Key Vault automation)
│ ├── tamper-evidence.md # Immutability (referenced for WORM storage)
│ └── chaos-drills.md # Resilience testing (referenced for DR drills)
│
└── infrastructure/
├── pulumi.md # IaC (referenced for overlay examples)
└── database-migrations.md # Schema changes (referenced for migration workflows)
Quick Reference: Key Metrics by Environment¶
| Metric | Dev | Test | Staging | Production |
|---|---|---|---|---|
| Uptime SLA | 95% | 98% | 99.5% | 99.9% |
| RPO | 24h | 12h | 1h | 15min |
| RTO | 4h | 2h | 1h | 30min |
| Log Retention | 7 days | 14 days | 30 days | 90 days (hot) + 7 years (cold) |
| Trace Sampling | 100% | 50% | 25% | 10% |
| Monthly Budget | $500 | $1,000 | $3,000 | $10,000 |
| Approval SLA | None | None | 4 hours (1 approver) | 24 hours (2 approvers + CAB) |
| Change Frequency | Multiple/day | 1-2/day | 1-2/week | 1-2/month |
Summary¶
- Comprehensive Cross-References: 40+ links to related ATP documentation across architecture, operations, compliance, hardening, and domain domains.
- Environment-Specific Guidance: Dev (local setup, debugging), Test (integration testing, test data), Staging (pre-production validation, blue-green), Production (canary, incident response, compliance).
- Responsibility Matrix (RACI): Clear ownership for environment provisioning, configuration, secrets, deployments, DR, compliance, and incident response.
- Related ADRs: 7 architecture decision records defining multi-environment strategy, Pulumi choice, App Configuration, Key Vault isolation, WORM storage, multi-region, and canary deployments.
- Document Hierarchy: Visual map showing
environments.mdposition within the broader ATP documentation structure. - Quick Reference Metrics: Side-by-side comparison of uptime SLA, RPO/RTO, log retention, trace sampling, budget, approval SLA, and change frequency per environment.