Cost Optimization & FinOps¶
Purpose & Scope¶
Purpose: Comprehensive guide for ATP's cost optimization strategies and FinOps practices, ensuring cost efficiency across all environments while maintaining operational requirements, performance, and compliance.
Scope: This document covers:
- FinOps Principles: Cost transparency, optimization, governance, and culture
- Environment Cost Profiles: Detailed cost breakdowns for Dev, Test, Staging, Production, Hotfix, and Preview environments
- Cost Optimization Strategies: Automated shutdown schedules, reserved instances, spot instances, storage lifecycle policies, autoscaling
- Right-Sizing Recommendations: AKS node pools, SQL databases, storage tiers, compute resources
- Cost Monitoring & Budgets: Azure Cost Management, budget alerts, anomaly detection, cost dashboards
- Cost Allocation: Per-tenant cost tracking, cost attribution, showback/chargeback
- Cost Governance: Approval workflows, cost thresholds, policy enforcement
- Storage Optimization: Hot/Warm/Cold tiering, lifecycle transitions, compression, deduplication
Audience: Platform operators, FinOps teams, engineering leads, finance teams, product managers
Relationship to Other Documents:
- Infrastructure: See ../infrastructure/kubernetes.md for AKS sizing and node pool strategies
- Environments: See ../ci-cd/environments.md for environment-specific configurations
- Storage: See ../platform/data-residency-retention.md for storage tiering and lifecycle policies
- Operations: See runbook.md for operational procedures
Table of Contents¶
- FinOps Principles
- Environment Cost Profiles
- Detailed Cost Breakdowns
- Cost Optimization Strategies
- Right-Sizing Recommendations
- Storage Optimization
- Cost Monitoring & Budgets
- Cost Allocation & Attribution
- Cost Governance
- Best Practices & Recommendations
- Cost Optimization Runbooks
FinOps Principles¶
ATP implements FinOps principles to balance operational requirements with cost efficiency:
- Visibility: Tag all resources; enable Cost Management; monthly reviews; cost transparency
- Optimization: Shutdown schedules, reserved instances, autoscaling, storage lifecycle, spot instances
- Governance: Budget alerts, Azure Policy enforcement, approval workflows for cost increases
- Culture: Cost awareness in development; cost-per-feature metrics; regular optimization sprints
FinOps Culture¶
Cost-Aware Development: - Review cost impact before adding new Azure resources - Use cost estimation tools (Infracost) in CI/CD pipelines - Monitor cost-per-feature and cost-per-tenant metrics - Regular cost optimization sprints (quarterly)
Cost Metrics: - Cost per Tenant: Monthly cost divided by active tenants - Cost per Feature: Cost attribution to features/capabilities - Cost per Environment: Track costs by environment (Dev/Test/Staging/Prod) - Cost per Service: Breakdown by microservice/component
Environment Cost Profiles¶
ATP's cost model is graduated by environment with Dev/Test optimized for minimal cost and Production optimized for reliability within budget constraints.
Cost Profile Summary¶
| Environment | Monthly Budget | Primary Compute | SKU Tier | Scaling Strategy | Monitoring Cost | Total Est. Monthly |
|---|---|---|---|---|---|---|
| Preview | $100 | Azure Container Instances | Dynamic | Per-PR ephemeral | N/A | $50-150 (variable) |
| Dev | $500 | App Service | Basic B1 (1 vCPU, 1.75 GB) | Fixed (1 instance) | Basic alerts | $400-600 |
| Test | $1,000 | App Service | Standard S1 (1 vCPU, 1.75 GB) | Fixed (2 instances) | Basic alerts | $900-1,200 |
| Staging | $3,000 | App Service | Premium P1v2 (1 vCPU, 3.5 GB) | Autoscale (2-5) | Enhanced alerts | $2,500-3,500 |
| Production | $10,000 | AKS (3-10 nodes) | Standard_D4s_v5 (4 vCPU, 16 GB) | Autoscale (3-10 nodes) | Full observability | $8,000-12,000 |
| Hotfix | $500 | App Service (on-demand) | Premium P1v3 (2 vCPU, 8 GB) | Fixed (1 instance) | Basic alerts | $0-500 (as-needed) |
Cost Profile Rationale: - Dev (\(500)**: Cost-minimized with Basic SKU; shutdown evenings/weekends (40% savings) - **Test (\)1,000): Standard SKU for stable performance; 2 instances for load testing; shutdown nights - Staging (\(3,000)**: Premium SKU for production-like validation; autoscaling; always-on - **Production (\)10,000): AKS for enterprise-grade scalability; reserved instances (20% savings); 99.9% SLA - Hotfix ($500): On-demand deployment only when needed; deleted after hotfix deployment
Detailed Cost Breakdowns¶
Dev Environment Monthly Costs¶
Compute (App Service Basic B1 × 1): $13/month × 0.6 (60% uptime) = $8/month
SQL Database (Basic - 5 DTU): $5/month
Redis Cache (Basic C0 - 250 MB): $16/month
Service Bus (Basic): $5/month
Storage (LRS - 100 GB): $2/month
Key Vault (transactions): $1/month
Bandwidth: $5/month
---
Subtotal: $42/month
Actual with shutdown automation: ~$25/month per service × 7 services = $175/month
Dev shared infrastructure: $300/month (VNet, NSG, Seq, etc.)
---
Total Dev Environment: $475/month
Cost Optimization (Dev): - Shutdown Schedule: Stop 6 PM - 8 AM weekdays, all weekend → 60% uptime → 40% savings - Shared Resources: VNet, NSG, Seq shared between Dev/Test → split cost - Basic SKUs: Minimum viable performance for development
Test Environment Monthly Costs¶
Compute (App Service Standard S1 × 2): $70/month × 0.7 (70% uptime) × 2 = $98/month
SQL Database (Standard S1 - 20 DTU): $30/month
Redis Cache (Standard C1 - 1 GB): $75/month
Service Bus (Standard): $10/month
Storage (LRS - 500 GB): $10/month
Key Vault (transactions): $2/month
Application Insights (5 GB/month): $12/month
Bandwidth: $10/month
---
Subtotal: $247/month
Actual with shutdown automation: ~$173/month (70% uptime)
Test shared infrastructure: $300/month (shared with Dev)
---
Total Test Environment: $473/month
With reserved instances (1-year): -$50/month (15% savings)
Net Test Environment: $423/month
Cost Optimization (Test): - Shutdown Schedule: Stop 10 PM - 6 AM weekdays → 70% uptime → 30% savings - Standard SKUs: Balance between cost and performance for testing
Staging Environment Monthly Costs¶
Compute (App Service Premium P1v2 × 2-5): $146/month × 3 avg = $438/month
SQL Database (Standard S3 - 100 DTU): $150/month
Redis Cache (Standard C2 - 2.5 GB): $150/month
Service Bus (Standard - 2 messaging units): $20/month
Storage (LRS - 1 TB): $20/month
Key Vault (transactions): $3/month
Application Insights (15 GB/month): $30/month
Log Analytics (30-day retention, 10 GB/day): $300/month
Bandwidth: $20/month
---
Total Staging Environment: $1,131/month
With reserved instances (1-year): -$170/month (15% savings)
Net Staging Environment: $961/month
Cost Optimization (Staging): - Autoscaling: Scale 2-5 instances based on load → optimize for actual usage - Reserved Instances: 1-year commitment → 15% savings
Production Environment Monthly Costs¶
AKS Cluster (3-10 nodes, Standard_D4s_v5):
- System pool (3 nodes, always on): $200/month × 3 = $600/month
- User pool (3-7 nodes, autoscale): $200/month × 5 avg = $1,000/month
SQL Database (Premium P4 - 500 DTU): $1,860/month
Cosmos DB (1000 RU/s provisioned): $730/month
Redis Cache (Premium P3 - 26 GB): $1,037/month
Service Bus (Premium - 4 messaging units): $2,708/month
Storage (GRS + WORM - 10 TB):
- Hot tier (0-90 days): $500/month
- Cool tier (91 days - 7 years): $100/month
Key Vault (HSM - 50 keys): $625/month
Application Insights (50 GB/month): $115/month
Log Analytics (90-day retention, 30 GB/day): $900/month
Private Endpoints (10 × $7): $70/month
Application Gateway (v2 with WAF): $250/month
Azure Firewall (Premium): $625/month
DDoS Protection Standard: $2,944/month
Bandwidth (outbound - 1 TB): $90/month
ACR (Premium - geo-replication): $30/month
Prometheus + Grafana (self-hosted on AKS): $50/month (storage only)
---
Total Production Environment: $12,234/month
Reserved Instance Savings (1-year): -$2,400/month (20% on compute/database)
---
Net Production Monthly Cost: $9,834/month
Cost Optimization (Production): - Reserved Instances: 1-year commitment for AKS nodes, SQL, Cosmos DB → 20-30% savings - Autoscaling: Scale down to 3 nodes during low-traffic hours → save ~\(400/month - **Storage Lifecycle**: Auto-transition to cool tier after 90 days → save ~\)300/month - DDoS Protection: Shared across all public endpoints in subscription - Application Insights Sampling: 10% adaptive sampling → reduce ingestion by 90%
Preview Environment Monthly Costs¶
Azure Container Instances (per-PR ephemeral):
- Average PR duration: 2 hours
- Average instances: 3
- Cost per hour: $0.05/instance
- Monthly PRs: 50
- Total: $0.05 × 3 × 2 × 50 = $15/month
Spot Instances (optional): $1.50/month (90% savings)
---
Total Preview Environment: $15-50/month (highly variable)
Cost Optimization (Preview): - Spot Instances: Use Azure Spot VMs for non-critical workloads → 90% savings - Ephemeral: Containers destroyed after PR merge/close
Cost Optimization Strategies¶
ATP implements automated cost optimization across all environments using Azure Policy, automation scripts, and IaC overlays.
Automated Shutdown Schedules¶
Purpose: Reduce compute costs in Dev/Test environments by shutting down resources during non-business hours.
Implementation (Azure Automation):
// ConnectSoft.ATP.Infrastructure/Automation/ShutdownSchedule.cs
public class ShutdownSchedule
{
public static void ConfigureDevEnvironment(ResourceGroup resourceGroup)
{
// Dev: Shutdown 6 PM - 8 AM weekdays, all weekend
var devShutdownSchedule = new Schedule("dev-shutdown-schedule", new ScheduleArgs
{
ResourceGroupName = resourceGroup.Name,
Name = "dev-shutdown",
Frequency = "Week",
WeekDays = new[] { "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" },
StartTime = "18:00:00", // 6 PM
Description = "Shutdown Dev environment after business hours"
});
var devStartupSchedule = new Schedule("dev-startup-schedule", new ScheduleArgs
{
ResourceGroupName = resourceGroup.Name,
Name = "dev-startup",
Frequency = "Week",
WeekDays = new[] { "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" },
StartTime = "08:00:00", // 8 AM
Description = "Startup Dev environment at beginning of business day"
});
}
}
Cost Savings: - Dev: 60% uptime → 40% savings (~\(7/month per App Service) - **Test**: 70% uptime → 30% savings (~\)21/month per App Service)
Reserved Instances¶
Purpose: Commit to 1-year or 3-year terms for predictable workloads to achieve 20-30% savings.
Production Reserved Instances:
reservedInstances:
commitment: 1-year (renew annually)
resources:
- type: AKS Standard_D4s_v5
quantity: 3 (system pool, always on)
monthlyCost: $600
reservedCost: $480 (20% savings)
annualSavings: $1,440
- type: SQL Database Premium P4
quantity: 1
monthlyCost: $1,860
reservedCost: $1,395 (25% savings)
annualSavings: $5,580
- type: Cosmos DB (1000 RU/s)
quantity: 1
monthlyCost: $730
reservedCost: $511 (30% savings)
annualSavings: $2,628
- type: Redis Cache Premium P3
quantity: 1
monthlyCost: $1,037
reservedCost: $830 (20% savings)
annualSavings: $2,484
totalAnnualSavings: $12,132 (Production)
totalATPReservedInstanceSavings: $13,956/year
Purchase Reserved Instances (Azure CLI):
#!/bin/bash
# purchase-reserved-instances.sh
SUBSCRIPTION_ID="<azure-subscription-id>"
REGION="eastus"
echo "Purchasing Reserved Instances for ATP Production..."
# AKS Nodes (Standard_D4s_v5 × 3)
az reservations reservation-order purchase \
--reserved-resource-type "VirtualMachines" \
--sku "Standard_D4s_v5" \
--location "$REGION" \
--quantity 3 \
--term "P1Y" \
--billing-plan "Monthly" \
--display-name "ATP-Prod-AKS-RI-2025"
# SQL Database (Premium P4)
az sql db update \
--resource-group "ConnectSoft-ATP-Prod-EUS-RG" \
--server "atp-sql-prod-eus" \
--name "ATP_Prod" \
--compute-model "Provisioned" \
--service-objective "P4" \
--backup-storage-redundancy "Geo" \
--zone-redundant true \
--read-scale "Enabled"
# Cosmos DB Reserved Capacity (1000 RU/s)
az cosmosdb sql container throughput update \
--resource-group "ConnectSoft-ATP-Prod-EUS-RG" \
--account-name "atp-cosmos-prod-eus" \
--database-name "ATP" \
--name "AuditEvents" \
--throughput 1000
echo "✅ Reserved Instances purchased; savings will appear in next billing cycle"
Spot Instances (Preview Environments)¶
Purpose: 90% cost savings for ephemeral Preview environments using Azure Spot VMs.
Implementation:
// AKS Node Pool with Spot instances
var spotNodePool = new KubernetesClusterNodePool("spot-pool", new KubernetesClusterNodePoolArgs
{
KubernetesClusterId = aksCluster.Id,
Name = "spotpool",
VmSize = "Standard_D8s_v5",
NodeCount = 0,
MinCount = 0,
MaxCount = 10,
EnableAutoScaling = true,
Priority = "Spot",
EvictionPolicy = "Delete",
SpotMaxPrice = 0.05, // Max $0.05/hour (90% discount)
NodeLabels = new Dictionary<string, string>
{
["workload"] = "preview",
["ephemeral"] = "true"
}
});
Cost Savings: 90% discount vs. regular VMs ($0.50/hour → $0.05/hour)
Storage Lifecycle Policies¶
Purpose: Automatically transition data from Hot → Cool → Archive tiers to reduce storage costs by up to 80%.
Storage Lifecycle Policy:
{
"rules": [
{
"enabled": true,
"name": "MoveLogsToCoolAfter90Days",
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["logs/"]
},
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 90
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 2555 // 7 years
},
"delete": {
"daysAfterModificationGreaterThan": 2555 // Delete after 7 years (if not on legal hold)
}
}
}
}
}
]
}
Storage Cost Savings (Production):
Hot Storage (0-90 days): 900 GB × $0.0184/GB = $16.56/month
Cool Storage (91 days - 7 years): 25,200 GB × $0.01/GB = $252/month
Archive Storage (7+ years): 100,000 GB × $0.002/GB = $200/month
Without Lifecycle Policy (all hot): 126,100 GB × $0.0184/GB = $2,320/month
With Lifecycle Policy: $16.56 + $252 + $200 = $468.56/month
Total Savings: $1,851.44/month (80% savings on storage)
Apply Lifecycle Policy (Azure CLI):
#!/bin/bash
# apply-storage-lifecycle-policy.sh
STORAGE_ACCOUNT="atpstorageprodeus"
RESOURCE_GROUP="ATP-Prod-RG"
echo "Applying storage lifecycle policy..."
az storage account management-policy create \
--account-name "$STORAGE_ACCOUNT" \
--resource-group "$RESOURCE_GROUP" \
--policy @lifecycle-policy.json
echo "✅ Storage lifecycle policy applied"
Autoscaling¶
Purpose: Automatically scale resources based on demand to optimize costs during low-traffic periods.
AKS Cluster Autoscaling:
// Production AKS cluster with autoscaling
var aksCluster = new ManagedCluster("atp-aks-prod", new ManagedClusterArgs
{
// ... other configuration ...
AgentPoolProfiles = new[]
{
// System pool (fixed - always on)
new ManagedClusterAgentPoolProfileArgs
{
Name = "system",
Count = 3,
VmSize = "Standard_D4s_v5",
EnableAutoScaling = false, // Fixed for system components
MinCount = 3,
MaxCount = 3
},
// User pool (autoscale based on demand)
new ManagedClusterAgentPoolProfileArgs
{
Name = "user",
Count = 3,
VmSize = "Standard_D4s_v5",
EnableAutoScaling = true,
MinCount = 3,
MaxCount = 10,
ScaleDownDelayAfterAdd = "PT10M", // Wait 10 min before scaling down
ScaleDownUtilizationThreshold = 0.5 // Scale down if <50% utilization
}
}
});
Cost Savings: Scale down to 3 nodes during low-traffic hours → save ~$400/month
Application Insights Sampling¶
Purpose: Reduce telemetry ingestion costs by 90% while maintaining error visibility.
Adaptive Sampling Configuration:
// Production Application Insights with adaptive sampling
services.AddApplicationInsightsTelemetry(options =>
{
options.ConnectionString = appInsightsConnectionString;
options.EnableAdaptiveSampling = true;
options.EnablePerformanceCounterCollectionModule = true;
options.EnableDependencyTrackingTelemetryModule = true;
options.EnableEventCounterCollectionModule = true;
});
services.ConfigureTelemetryModule<AdaptiveSamplingTelemetryProcessor>(options =>
{
options.ExcludedTypes = "Event"; // Always include events (errors)
options.IncludedTypes = "PageView;Trace;Dependency;Request;Exception";
options.MaxTelemetryItemsPerSecond = 10; // 10% sampling rate
options.InitialSamplingPercentage = 10;
});
Cost Savings: Reduce ingestion from 500 GB/month to 50 GB/month → save ~$90/month
Right-Sizing Recommendations¶
AKS Node Pool Sizing¶
System Node Pool: - VM Size: Standard_D4s_v5 (4 vCPU, 16 GB RAM) - Count: 3 (fixed, one per AZ) - Purpose: Control plane components, mesh, KEDA, FluxCD, OTel - Cost: ~$600/month (always on)
User Node Pool: - VM Size: Standard_D4s_v5 (4 vCPU, 16 GB RAM) - Count: 3-10 (autoscale) - Purpose: Stateless APIs (Gateway, Query, Admin, Policy) - Cost: ~$1,000/month (avg 5 nodes)
I/O Node Pool: - VM Size: Standard_E8s_v5 (8 vCPU, 64 GB RAM, premium storage) - Count: 2-50 (autoscale) - Purpose: I/O-heavy (Ingestion, Projection, Export, Integrity) - Cost: ~$1,200/month (avg 3 nodes)
Jobs Node Pool (Optional): - VM Size: Standard_F16s_v2 (16 vCPU, 32 GB RAM, compute-optimized) - Count: 0-20 (KEDA scale to zero) - Purpose: Export jobs, maintenance tasks, compliance reports - Cost: ~$500/month (on-demand only)
Spot Node Pool (Optional): - VM Size: Standard_D8s_v5 - Count: 0-10 (autoscale) - Priority: Spot (90% discount) - Purpose: Non-critical workloads (dev/test projections, backfills) - Cost: ~$50/month (90% savings)
SQL Database Sizing¶
| Environment | SKU | DTU/vCores | Monthly Cost | Rationale |
|---|---|---|---|---|
| Dev | Basic | 5 DTU | $5 | Minimal for development |
| Test | Standard S1 | 20 DTU | $30 | Stable performance for testing |
| Staging | Standard S3 | 100 DTU | $150 | Production-like validation |
| Production | Premium P4 | 500 DTU | $1,860 | Enterprise-grade performance |
Right-Sizing Recommendations: - Start Small: Begin with lower SKUs and scale up based on metrics - Monitor DTU Usage: If consistently >80%, consider scaling up - Reserved Instances: Use for Production → 25% savings
Redis Cache Sizing¶
| Environment | SKU | Size | Monthly Cost | Rationale |
|---|---|---|---|---|
| Dev | Basic C0 | 250 MB | $16 | Minimal for development |
| Test | Standard C1 | 1 GB | $75 | Standard for testing |
| Staging | Standard C2 | 2.5 GB | $150 | Production-like validation |
| Production | Premium P3 | 26 GB | $1,037 | Enterprise-grade with persistence |
Right-Sizing Recommendations: - Monitor Memory Usage: If consistently >80%, consider scaling up - Use Premium for Production: Persistence and high availability required
Storage Tier Selection¶
| Tier | Use Case | Cost/GB/Month | Access Time |
|---|---|---|---|
| Hot | Frequently accessed (0-90 days) | $0.0184 | <1 ms |
| Cool | Infrequently accessed (91 days - 7 years) | $0.01 | <30 ms |
| Archive | Rarely accessed (7+ years) | $0.002 | Hours (rehydration) |
Recommendation: Use lifecycle policies to automatically transition data to lower-cost tiers
Storage Optimization¶
Hot/Warm/Cold Tiering Strategy¶
Tier Definitions: - Hot (Append/WORM): Authoritative segments and recent anchors; low latency, high IOPS, highest cost - Warm (Read Models): Projections and indexes optimized for query; rebuilt from hot as needed; medium cost - Cold (Archive/Export): Immutable object storage with long retention; bulk throughput; lowest cost
Lifecycle Transitions:
tiering:
hot:
targetWindow: P14D # 14 days in hot tier
warm:
targetWindow: P90D # 90 days in warm tier
rebuildFirst: true # Prefer re-project vs snapshot storage
cold:
storageClass: archive_immutable
lifecycle:
transitionAfter: P90D # Move to cold after 90 days
deleteAfter: P10Y # Delete after 10 years (if not on legal hold)
residency:
crossRegionHydrate: deny # Never hydrate across region families
Cost Savings: - Hot → Cool transition: ~45% savings - Cool → Archive transition: ~80% savings - Total lifecycle savings: ~80% on long-term storage
Compression & Deduplication¶
Compression Strategies: - JSONL: GZIP compression (5-10x reduction) - Parquet: Snappy compression (columnar, 3-5x reduction) - Segment Storage: Compress before upload to blob storage
Cost Impact: Reduce storage by 70-80% → proportional cost savings
Cost Monitoring & Budgets¶
Azure Cost Management Budgets¶
Budget Configuration (Pulumi):
// Production Budget
var prodBudget = new Budget("atp-budget-prod", new BudgetArgs
{
BudgetName = "atp-budget-prod",
ResourceGroupName = prodResourceGroup.Name,
Amount = 10000, // $10,000/month
TimeGrain = "Monthly",
TimePeriod = new BudgetTimePeriodArgs
{
StartDate = "2025-01-01",
EndDate = "2025-12-31"
},
Category = "Cost",
Notifications = new InputMap<NotificationArgs>
{
["Alert50Percent"] = new NotificationArgs
{
Enabled = true,
Operator = "GreaterThanOrEqualTo",
Threshold = 50,
ContactEmails = new[] { "platform-team@connectsoft.example" },
ThresholdType = "Actual"
},
["Alert80Percent"] = new NotificationArgs
{
Enabled = true,
Operator = "GreaterThanOrEqualTo",
Threshold = 80,
ContactEmails = new[] { "platform-team@connectsoft.example", "finance@connectsoft.example" },
ContactRoles = new[] { "Owner" },
ThresholdType = "Actual"
},
["Alert100Percent"] = new NotificationArgs
{
Enabled = true,
Operator = "GreaterThanOrEqualTo",
Threshold = 100,
ContactEmails = new[] { "cfo@connectsoft.example", "platform-team@connectsoft.example" },
ContactRoles = new[] { "Owner" },
ThresholdType = "Actual",
ContactActions = new[] { "CreateIncident" } // Auto-create P1 incident
}
}
});
Cost Anomaly Detection¶
Anomaly Alert (Azure Monitor):
// Cost anomaly alert (50% spike in single day)
var costAnomalyAlert = new MetricAlert("atp-cost-anomaly-alert-prod", new MetricAlertArgs
{
AlertRuleName = "atp-cost-anomaly-prod",
ResourceGroupName = prodResourceGroup.Name,
Location = "global",
Description = "Alert when Production environment cost spikes >50% in 24 hours",
Severity = 1, // High severity
Enabled = true,
Scopes = new[] { prodResourceGroup.Id },
EvaluationFrequency = "PT1H", // Evaluate every hour
WindowSize = "PT24H", // 24-hour window
Criteria = new MetricAlertMultipleResourceMultipleMetricCriteriaArgs
{
OdataType = "Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria",
AllOf = new[]
{
new DynamicMetricCriteriaArgs
{
Name = "CostAnomaly",
MetricName = "ActualCost",
MetricNamespace = "Microsoft.CostManagement/budgets",
Operator = "GreaterThan",
AlertSensitivity = "Medium",
DynamicThresholdFailingPeriods = new DynamicThresholdFailingPeriodsArgs
{
NumberOfEvaluationPeriods = 4,
MinFailingPeriodsToAlert = 2
},
TimeAggregation = "Total"
}
}
}
});
Cost Dashboards¶
KQL Query for Cost Attribution:
// Cost breakdown by Environment and Service
AzureCostManagement
| where TimeGenerated >= startofmonth(now())
| extend Environment = tostring(Tags["Environment"])
| extend Service = tostring(Tags["Service"])
| extend CostCenter = tostring(Tags["CostCenter"])
| summarize TotalCost = sum(Cost) by Environment, Service, CostCenter
| order by TotalCost desc
Cost Allocation & Attribution¶
Resource Tagging Strategy¶
Required Tags:
tags:
Environment: production|staging|test|dev|preview|hotfix
Service: gateway|ingestion|query|policy|export|admin
Team: platform|engineering|ops
CostCenter: atp-production|atp-staging
TenantId: <tenant-id> # For multi-tenant resources
Edition: free|standard|enterprise
Region: eastus|westus|westeurope
Tagging Example (Pulumi):
var prodTags = new InputMap<string>
{
["Environment"] = "production",
["Service"] = "gateway",
["Team"] = "platform",
["CostCenter"] = "atp-production",
["Region"] = "eastus",
["Compliance"] = "soc2-gdpr-hipaa"
};
var prodAppService = new WebApp("atp-gateway-prod-eus", new WebAppArgs
{
// ... resource configuration ...
Tags = prodTags
});
Per-Tenant Cost Tracking¶
Cost per Tenant (Production):
Total Production Monthly Cost: $9,834
Active Tenants (production): 50
Cost per Tenant: $9,834 / 50 = $196.68/month
Target Cost per Tenant (with 500 tenants): $9,834 / 500 = $19.67/month
Required Optimization: 90% reduction through economies of scale and multi-tenancy
Per-Tenant Cost Attribution (KQL):
// Cost per tenant breakdown
AzureCostManagement
| where TimeGenerated >= startofmonth(now())
| where Tags["TenantId"] != ""
| extend TenantId = tostring(Tags["TenantId"])
| extend Service = tostring(Tags["Service"])
| summarize
TotalCost = sum(Cost),
StorageCost = sumif(Cost, Service == "storage"),
ComputeCost = sumif(Cost, Service == "compute")
by TenantId
| order by TotalCost desc
Cost Governance¶
Cost Governance Workflow¶
# Approval required for resources exceeding cost thresholds
costGovernance:
thresholds:
- resource: App Service Premium
monthlyCost: $200
approver: Lead Architect
- resource: SQL Database Premium
monthlyCost: $500
approver: CTO
- resource: AKS Node Pool
monthlyCost: $1000
approver: CFO
process:
1. Engineer submits Pulumi PR with new resource
2. CI/CD calculates estimated monthly cost (Infracost)
3. If cost > threshold, require approval
4. Approver reviews cost justification
5. If approved, Pulumi deploys resource with cost tags
6. Monthly review of actual vs estimated costs
Cost Approval Workflow¶
Infracost Integration (CI/CD):
# azure-pipelines.yml
- task: InfracostTask@2
inputs:
path: 'infrastructure/pulumi'
terraformVersion: '1.5.0'
terraformWorkspace: 'default'
displayName: 'Calculate infrastructure cost'
- task: InfracostComment@1
inputs:
behavior: 'update'
path: 'infracost.json'
displayName: 'Post cost comment to PR'
Best Practices & Recommendations¶
Cost Optimization Checklist¶
Monthly Review: - [ ] Review cost dashboards for anomalies - [ ] Identify unused or underutilized resources - [ ] Review reserved instance utilization - [ ] Check storage lifecycle policy effectiveness - [ ] Verify autoscaling is working correctly - [ ] Review cost per tenant trends
Quarterly Review: - [ ] Right-size resources based on metrics - [ ] Purchase/renew reserved instances - [ ] Review and optimize storage tiers - [ ] Conduct cost optimization sprint - [ ] Review and update cost governance policies
Cost Levers (Top 10)¶
- Hot retention window (days in Hot before Warm)
- Warm index granularity (daily vs hourly partitions)
- Export frequency & size (number of eDiscovery/DSAR bundles)
- Cross-region reads/exports (egress)
- Cache TTL & hit rate (reduces Warm compute)
- Segment size/seal cadence (affects metadata overhead & verification cost)
- Index cardinality (number of fields indexed & distinct values)
- Compression & encoding (Parquet snappy/zstd; JSONL gzip)
- Rebuild-first vs snapshot retention for Warm
- Purge cadence/batch size (storage & compute churn)
Cost Calculator (Quick Estimate)¶
Inputs:
- events_per_day
- avg_event_bytes
- hot_days
- warm_days
- export_gb_per_mo
- cross_region_gb_per_mo
Formulas:
hot_gb = events_per_day * avg_event_bytes * hot_days / (1024^3)
warm_gb = events_per_day * avg_event_bytes * warm_days / (1024^3) * warm_index_factor
storage_cost = hot_gb*rate_hot + warm_gb*rate_warm + archive_tb*rate_cold
egress_cost = cross_region_gb_per_mo * rate_egress
export_cost = export_gb_per_mo * rate_export_io
total = storage_cost + egress_cost + export_cost + verify_cost + rebuild_cost
Cost Optimization Runbooks¶
Monthly Cost Review Runbook¶
- Gather Cost Data
- Export Azure Cost Management report for current month
- Review cost by environment, service, and tenant
-
Identify top cost drivers
-
Analyze Trends
- Compare current month vs. previous month
- Identify cost anomalies (>20% increase)
-
Review cost per tenant trends
-
Identify Optimization Opportunities
- Unused resources (delete if safe)
- Underutilized resources (right-size)
- Storage tier optimization
-
Reserved instance opportunities
-
Take Action
- Delete unused resources
- Right-size underutilized resources
- Apply storage lifecycle policies
-
Purchase reserved instances
-
Document Results
- Record cost savings achieved
- Update cost optimization log
- Share findings with team
Cost Anomaly Response Runbook¶
- Receive Alert
- Cost anomaly alert triggered (>50% spike)
-
Review alert details (time, resource, amount)
-
Investigate
- Check Azure Cost Management for resource breakdown
- Review resource utilization metrics
-
Check for unexpected scaling or traffic spikes
-
Identify Root Cause
- Resource misconfiguration
- Traffic spike
- Scaling issue
-
Storage lifecycle failure
-
Take Corrective Action
- Fix misconfiguration
- Scale down if appropriate
- Apply cost optimizations
-
Prevent future occurrences
-
Document Incident
- Root cause analysis
- Actions taken
- Cost impact
- Prevention measures
Summary¶
ATP implements comprehensive cost optimization strategies across all environments:
- Environment Cost Profiles: Graduated from $500/month (Dev) to $10,000/month (Production)
- Dev Optimization: Shutdown evenings/weekends (60% uptime) → 40% savings
- Test Optimization: Shutdown nights (70% uptime) → 30% savings
- Reserved Instances: 1-year commitments → 20-30% savings ($13,956/year total)
- Spot Instances: Preview environments → 90% savings
- Storage Lifecycle: Automated hot → cool → archive transitions → 80% storage savings
- Cost Alerts: Budget thresholds (80%, 100%) and anomaly detection (50% spike)
- Tagging Strategy: Granular cost attribution per environment, service, team, tenant
- FinOps Culture: Monthly cost reviews, cost-per-feature metrics, governance workflows
Next Steps: - Review and customize cost budgets for your organization - Implement automated shutdown schedules for Dev/Test - Purchase reserved instances for Production - Configure storage lifecycle policies - Set up cost monitoring and alerts - Establish cost governance workflows
Document Version: 1.0
Last Updated: 2025-10-30
Maintained By: Platform Operations & FinOps Team