Skip to content

Cost Optimization & FinOps

Purpose & Scope

Purpose: Comprehensive guide for ATP's cost optimization strategies and FinOps practices, ensuring cost efficiency across all environments while maintaining operational requirements, performance, and compliance.

Scope: This document covers:

  • FinOps Principles: Cost transparency, optimization, governance, and culture
  • Environment Cost Profiles: Detailed cost breakdowns for Dev, Test, Staging, Production, Hotfix, and Preview environments
  • Cost Optimization Strategies: Automated shutdown schedules, reserved instances, spot instances, storage lifecycle policies, autoscaling
  • Right-Sizing Recommendations: AKS node pools, SQL databases, storage tiers, compute resources
  • Cost Monitoring & Budgets: Azure Cost Management, budget alerts, anomaly detection, cost dashboards
  • Cost Allocation: Per-tenant cost tracking, cost attribution, showback/chargeback
  • Cost Governance: Approval workflows, cost thresholds, policy enforcement
  • Storage Optimization: Hot/Warm/Cold tiering, lifecycle transitions, compression, deduplication

Audience: Platform operators, FinOps teams, engineering leads, finance teams, product managers

Relationship to Other Documents: - Infrastructure: See ../infrastructure/kubernetes.md for AKS sizing and node pool strategies - Environments: See ../ci-cd/environments.md for environment-specific configurations - Storage: See ../platform/data-residency-retention.md for storage tiering and lifecycle policies - Operations: See runbook.md for operational procedures


Table of Contents

  1. FinOps Principles
  2. Environment Cost Profiles
  3. Detailed Cost Breakdowns
  4. Cost Optimization Strategies
  5. Right-Sizing Recommendations
  6. Storage Optimization
  7. Cost Monitoring & Budgets
  8. Cost Allocation & Attribution
  9. Cost Governance
  10. Best Practices & Recommendations
  11. Cost Optimization Runbooks

FinOps Principles

ATP implements FinOps principles to balance operational requirements with cost efficiency:

  1. Visibility: Tag all resources; enable Cost Management; monthly reviews; cost transparency
  2. Optimization: Shutdown schedules, reserved instances, autoscaling, storage lifecycle, spot instances
  3. Governance: Budget alerts, Azure Policy enforcement, approval workflows for cost increases
  4. Culture: Cost awareness in development; cost-per-feature metrics; regular optimization sprints

FinOps Culture

Cost-Aware Development: - Review cost impact before adding new Azure resources - Use cost estimation tools (Infracost) in CI/CD pipelines - Monitor cost-per-feature and cost-per-tenant metrics - Regular cost optimization sprints (quarterly)

Cost Metrics: - Cost per Tenant: Monthly cost divided by active tenants - Cost per Feature: Cost attribution to features/capabilities - Cost per Environment: Track costs by environment (Dev/Test/Staging/Prod) - Cost per Service: Breakdown by microservice/component


Environment Cost Profiles

ATP's cost model is graduated by environment with Dev/Test optimized for minimal cost and Production optimized for reliability within budget constraints.

Cost Profile Summary

Environment Monthly Budget Primary Compute SKU Tier Scaling Strategy Monitoring Cost Total Est. Monthly
Preview $100 Azure Container Instances Dynamic Per-PR ephemeral N/A $50-150 (variable)
Dev $500 App Service Basic B1 (1 vCPU, 1.75 GB) Fixed (1 instance) Basic alerts $400-600
Test $1,000 App Service Standard S1 (1 vCPU, 1.75 GB) Fixed (2 instances) Basic alerts $900-1,200
Staging $3,000 App Service Premium P1v2 (1 vCPU, 3.5 GB) Autoscale (2-5) Enhanced alerts $2,500-3,500
Production $10,000 AKS (3-10 nodes) Standard_D4s_v5 (4 vCPU, 16 GB) Autoscale (3-10 nodes) Full observability $8,000-12,000
Hotfix $500 App Service (on-demand) Premium P1v3 (2 vCPU, 8 GB) Fixed (1 instance) Basic alerts $0-500 (as-needed)

Cost Profile Rationale: - Dev (\(500)**: Cost-minimized with Basic SKU; shutdown evenings/weekends (40% savings) - **Test (\)1,000): Standard SKU for stable performance; 2 instances for load testing; shutdown nights - Staging (\(3,000)**: Premium SKU for production-like validation; autoscaling; always-on - **Production (\)10,000): AKS for enterprise-grade scalability; reserved instances (20% savings); 99.9% SLA - Hotfix ($500): On-demand deployment only when needed; deleted after hotfix deployment


Detailed Cost Breakdowns

Dev Environment Monthly Costs

Compute (App Service Basic B1 × 1):          $13/month × 0.6 (60% uptime) = $8/month
SQL Database (Basic - 5 DTU):                $5/month
Redis Cache (Basic C0 - 250 MB):             $16/month
Service Bus (Basic):                         $5/month
Storage (LRS - 100 GB):                      $2/month
Key Vault (transactions):                    $1/month
Bandwidth:                                   $5/month
---
Subtotal:                                    $42/month

Actual with shutdown automation:             ~$25/month per service × 7 services = $175/month
Dev shared infrastructure:                   $300/month (VNet, NSG, Seq, etc.)
---
Total Dev Environment:                       $475/month

Cost Optimization (Dev): - Shutdown Schedule: Stop 6 PM - 8 AM weekdays, all weekend → 60% uptime → 40% savings - Shared Resources: VNet, NSG, Seq shared between Dev/Test → split cost - Basic SKUs: Minimum viable performance for development

Test Environment Monthly Costs

Compute (App Service Standard S1 × 2):       $70/month × 0.7 (70% uptime) × 2 = $98/month
SQL Database (Standard S1 - 20 DTU):         $30/month
Redis Cache (Standard C1 - 1 GB):            $75/month
Service Bus (Standard):                      $10/month
Storage (LRS - 500 GB):                      $10/month
Key Vault (transactions):                    $2/month
Application Insights (5 GB/month):           $12/month
Bandwidth:                                   $10/month
---
Subtotal:                                    $247/month

Actual with shutdown automation:             ~$173/month (70% uptime)
Test shared infrastructure:                  $300/month (shared with Dev)
---
Total Test Environment:                      $473/month

With reserved instances (1-year):            -$50/month (15% savings)
Net Test Environment:                        $423/month

Cost Optimization (Test): - Shutdown Schedule: Stop 10 PM - 6 AM weekdays → 70% uptime → 30% savings - Standard SKUs: Balance between cost and performance for testing

Staging Environment Monthly Costs

Compute (App Service Premium P1v2 × 2-5):    $146/month × 3 avg = $438/month
SQL Database (Standard S3 - 100 DTU):        $150/month
Redis Cache (Standard C2 - 2.5 GB):          $150/month
Service Bus (Standard - 2 messaging units):  $20/month
Storage (LRS - 1 TB):                        $20/month
Key Vault (transactions):                    $3/month
Application Insights (15 GB/month):          $30/month
Log Analytics (30-day retention, 10 GB/day): $300/month
Bandwidth:                                   $20/month
---
Total Staging Environment:                   $1,131/month

With reserved instances (1-year):            -$170/month (15% savings)
Net Staging Environment:                     $961/month

Cost Optimization (Staging): - Autoscaling: Scale 2-5 instances based on load → optimize for actual usage - Reserved Instances: 1-year commitment → 15% savings

Production Environment Monthly Costs

AKS Cluster (3-10 nodes, Standard_D4s_v5):
  - System pool (3 nodes, always on):        $200/month × 3 = $600/month
  - User pool (3-7 nodes, autoscale):        $200/month × 5 avg = $1,000/month
SQL Database (Premium P4 - 500 DTU):         $1,860/month
Cosmos DB (1000 RU/s provisioned):           $730/month
Redis Cache (Premium P3 - 26 GB):            $1,037/month
Service Bus (Premium - 4 messaging units):   $2,708/month
Storage (GRS + WORM - 10 TB):
  - Hot tier (0-90 days):                    $500/month
  - Cool tier (91 days - 7 years):           $100/month
Key Vault (HSM - 50 keys):                   $625/month
Application Insights (50 GB/month):          $115/month
Log Analytics (90-day retention, 30 GB/day): $900/month
Private Endpoints (10 × $7):                 $70/month
Application Gateway (v2 with WAF):           $250/month
Azure Firewall (Premium):                    $625/month
DDoS Protection Standard:                    $2,944/month
Bandwidth (outbound - 1 TB):                 $90/month
ACR (Premium - geo-replication):             $30/month
Prometheus + Grafana (self-hosted on AKS):   $50/month (storage only)
---
Total Production Environment:                $12,234/month

Reserved Instance Savings (1-year):          -$2,400/month (20% on compute/database)
---
Net Production Monthly Cost:                 $9,834/month

Cost Optimization (Production): - Reserved Instances: 1-year commitment for AKS nodes, SQL, Cosmos DB → 20-30% savings - Autoscaling: Scale down to 3 nodes during low-traffic hours → save ~\(400/month - **Storage Lifecycle**: Auto-transition to cool tier after 90 days → save ~\)300/month - DDoS Protection: Shared across all public endpoints in subscription - Application Insights Sampling: 10% adaptive sampling → reduce ingestion by 90%

Preview Environment Monthly Costs

Azure Container Instances (per-PR ephemeral):
  - Average PR duration: 2 hours
  - Average instances: 3
  - Cost per hour: $0.05/instance
  - Monthly PRs: 50
  - Total: $0.05 × 3 × 2 × 50 = $15/month

Spot Instances (optional):                   $1.50/month (90% savings)
---
Total Preview Environment:                   $15-50/month (highly variable)

Cost Optimization (Preview): - Spot Instances: Use Azure Spot VMs for non-critical workloads → 90% savings - Ephemeral: Containers destroyed after PR merge/close


Cost Optimization Strategies

ATP implements automated cost optimization across all environments using Azure Policy, automation scripts, and IaC overlays.

Automated Shutdown Schedules

Purpose: Reduce compute costs in Dev/Test environments by shutting down resources during non-business hours.

Implementation (Azure Automation):

// ConnectSoft.ATP.Infrastructure/Automation/ShutdownSchedule.cs
public class ShutdownSchedule
{
    public static void ConfigureDevEnvironment(ResourceGroup resourceGroup)
    {
        // Dev: Shutdown 6 PM - 8 AM weekdays, all weekend
        var devShutdownSchedule = new Schedule("dev-shutdown-schedule", new ScheduleArgs
        {
            ResourceGroupName = resourceGroup.Name,
            Name = "dev-shutdown",
            Frequency = "Week",
            WeekDays = new[] { "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" },
            StartTime = "18:00:00",  // 6 PM
            Description = "Shutdown Dev environment after business hours"
        });

        var devStartupSchedule = new Schedule("dev-startup-schedule", new ScheduleArgs
        {
            ResourceGroupName = resourceGroup.Name,
            Name = "dev-startup",
            Frequency = "Week",
            WeekDays = new[] { "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" },
            StartTime = "08:00:00",  // 8 AM
            Description = "Startup Dev environment at beginning of business day"
        });
    }
}

Cost Savings: - Dev: 60% uptime → 40% savings (~\(7/month per App Service) - **Test**: 70% uptime → 30% savings (~\)21/month per App Service)

Reserved Instances

Purpose: Commit to 1-year or 3-year terms for predictable workloads to achieve 20-30% savings.

Production Reserved Instances:

reservedInstances:
  commitment: 1-year (renew annually)
  resources:
    - type: AKS Standard_D4s_v5
      quantity: 3 (system pool, always on)
      monthlyCost: $600
      reservedCost: $480 (20% savings)
      annualSavings: $1,440

    - type: SQL Database Premium P4
      quantity: 1
      monthlyCost: $1,860
      reservedCost: $1,395 (25% savings)
      annualSavings: $5,580

    - type: Cosmos DB (1000 RU/s)
      quantity: 1
      monthlyCost: $730
      reservedCost: $511 (30% savings)
      annualSavings: $2,628

    - type: Redis Cache Premium P3
      quantity: 1
      monthlyCost: $1,037
      reservedCost: $830 (20% savings)
      annualSavings: $2,484

totalAnnualSavings: $12,132 (Production)
totalATPReservedInstanceSavings: $13,956/year

Purchase Reserved Instances (Azure CLI):

#!/bin/bash
# purchase-reserved-instances.sh

SUBSCRIPTION_ID="<azure-subscription-id>"
REGION="eastus"

echo "Purchasing Reserved Instances for ATP Production..."

# AKS Nodes (Standard_D4s_v5 × 3)
az reservations reservation-order purchase \
  --reserved-resource-type "VirtualMachines" \
  --sku "Standard_D4s_v5" \
  --location "$REGION" \
  --quantity 3 \
  --term "P1Y" \
  --billing-plan "Monthly" \
  --display-name "ATP-Prod-AKS-RI-2025"

# SQL Database (Premium P4)
az sql db update \
  --resource-group "ConnectSoft-ATP-Prod-EUS-RG" \
  --server "atp-sql-prod-eus" \
  --name "ATP_Prod" \
  --compute-model "Provisioned" \
  --service-objective "P4" \
  --backup-storage-redundancy "Geo" \
  --zone-redundant true \
  --read-scale "Enabled"

# Cosmos DB Reserved Capacity (1000 RU/s)
az cosmosdb sql container throughput update \
  --resource-group "ConnectSoft-ATP-Prod-EUS-RG" \
  --account-name "atp-cosmos-prod-eus" \
  --database-name "ATP" \
  --name "AuditEvents" \
  --throughput 1000

echo "✅ Reserved Instances purchased; savings will appear in next billing cycle"

Spot Instances (Preview Environments)

Purpose: 90% cost savings for ephemeral Preview environments using Azure Spot VMs.

Implementation:

// AKS Node Pool with Spot instances
var spotNodePool = new KubernetesClusterNodePool("spot-pool", new KubernetesClusterNodePoolArgs
{
    KubernetesClusterId = aksCluster.Id,
    Name = "spotpool",
    VmSize = "Standard_D8s_v5",
    NodeCount = 0,
    MinCount = 0,
    MaxCount = 10,
    EnableAutoScaling = true,
    Priority = "Spot",
    EvictionPolicy = "Delete",
    SpotMaxPrice = 0.05,  // Max $0.05/hour (90% discount)
    NodeLabels = new Dictionary<string, string>
    {
        ["workload"] = "preview",
        ["ephemeral"] = "true"
    }
});

Cost Savings: 90% discount vs. regular VMs ($0.50/hour → $0.05/hour)

Storage Lifecycle Policies

Purpose: Automatically transition data from Hot → Cool → Archive tiers to reduce storage costs by up to 80%.

Storage Lifecycle Policy:

{
  "rules": [
    {
      "enabled": true,
      "name": "MoveLogsToCoolAfter90Days",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["logs/"]
        },
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 90
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 2555  // 7 years
            },
            "delete": {
              "daysAfterModificationGreaterThan": 2555  // Delete after 7 years (if not on legal hold)
            }
          }
        }
      }
    }
  ]
}

Storage Cost Savings (Production):

Hot Storage (0-90 days): 900 GB × $0.0184/GB = $16.56/month
Cool Storage (91 days - 7 years): 25,200 GB × $0.01/GB = $252/month
Archive Storage (7+ years): 100,000 GB × $0.002/GB = $200/month

Without Lifecycle Policy (all hot): 126,100 GB × $0.0184/GB = $2,320/month
With Lifecycle Policy: $16.56 + $252 + $200 = $468.56/month
Total Savings: $1,851.44/month (80% savings on storage)

Apply Lifecycle Policy (Azure CLI):

#!/bin/bash
# apply-storage-lifecycle-policy.sh

STORAGE_ACCOUNT="atpstorageprodeus"
RESOURCE_GROUP="ATP-Prod-RG"

echo "Applying storage lifecycle policy..."

az storage account management-policy create \
  --account-name "$STORAGE_ACCOUNT" \
  --resource-group "$RESOURCE_GROUP" \
  --policy @lifecycle-policy.json

echo "✅ Storage lifecycle policy applied"

Autoscaling

Purpose: Automatically scale resources based on demand to optimize costs during low-traffic periods.

AKS Cluster Autoscaling:

// Production AKS cluster with autoscaling
var aksCluster = new ManagedCluster("atp-aks-prod", new ManagedClusterArgs
{
    // ... other configuration ...

    AgentPoolProfiles = new[]
    {
        // System pool (fixed - always on)
        new ManagedClusterAgentPoolProfileArgs
        {
            Name = "system",
            Count = 3,
            VmSize = "Standard_D4s_v5",
            EnableAutoScaling = false,  // Fixed for system components
            MinCount = 3,
            MaxCount = 3
        },

        // User pool (autoscale based on demand)
        new ManagedClusterAgentPoolProfileArgs
        {
            Name = "user",
            Count = 3,
            VmSize = "Standard_D4s_v5",
            EnableAutoScaling = true,
            MinCount = 3,
            MaxCount = 10,
            ScaleDownDelayAfterAdd = "PT10M",  // Wait 10 min before scaling down
            ScaleDownUtilizationThreshold = 0.5  // Scale down if <50% utilization
        }
    }
});

Cost Savings: Scale down to 3 nodes during low-traffic hours → save ~$400/month

Application Insights Sampling

Purpose: Reduce telemetry ingestion costs by 90% while maintaining error visibility.

Adaptive Sampling Configuration:

// Production Application Insights with adaptive sampling
services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = appInsightsConnectionString;
    options.EnableAdaptiveSampling = true;
    options.EnablePerformanceCounterCollectionModule = true;
    options.EnableDependencyTrackingTelemetryModule = true;
    options.EnableEventCounterCollectionModule = true;
});

services.ConfigureTelemetryModule<AdaptiveSamplingTelemetryProcessor>(options =>
{
    options.ExcludedTypes = "Event";  // Always include events (errors)
    options.IncludedTypes = "PageView;Trace;Dependency;Request;Exception";
    options.MaxTelemetryItemsPerSecond = 10;  // 10% sampling rate
    options.InitialSamplingPercentage = 10;
});

Cost Savings: Reduce ingestion from 500 GB/month to 50 GB/month → save ~$90/month


Right-Sizing Recommendations

AKS Node Pool Sizing

System Node Pool: - VM Size: Standard_D4s_v5 (4 vCPU, 16 GB RAM) - Count: 3 (fixed, one per AZ) - Purpose: Control plane components, mesh, KEDA, FluxCD, OTel - Cost: ~$600/month (always on)

User Node Pool: - VM Size: Standard_D4s_v5 (4 vCPU, 16 GB RAM) - Count: 3-10 (autoscale) - Purpose: Stateless APIs (Gateway, Query, Admin, Policy) - Cost: ~$1,000/month (avg 5 nodes)

I/O Node Pool: - VM Size: Standard_E8s_v5 (8 vCPU, 64 GB RAM, premium storage) - Count: 2-50 (autoscale) - Purpose: I/O-heavy (Ingestion, Projection, Export, Integrity) - Cost: ~$1,200/month (avg 3 nodes)

Jobs Node Pool (Optional): - VM Size: Standard_F16s_v2 (16 vCPU, 32 GB RAM, compute-optimized) - Count: 0-20 (KEDA scale to zero) - Purpose: Export jobs, maintenance tasks, compliance reports - Cost: ~$500/month (on-demand only)

Spot Node Pool (Optional): - VM Size: Standard_D8s_v5 - Count: 0-10 (autoscale) - Priority: Spot (90% discount) - Purpose: Non-critical workloads (dev/test projections, backfills) - Cost: ~$50/month (90% savings)

SQL Database Sizing

Environment SKU DTU/vCores Monthly Cost Rationale
Dev Basic 5 DTU $5 Minimal for development
Test Standard S1 20 DTU $30 Stable performance for testing
Staging Standard S3 100 DTU $150 Production-like validation
Production Premium P4 500 DTU $1,860 Enterprise-grade performance

Right-Sizing Recommendations: - Start Small: Begin with lower SKUs and scale up based on metrics - Monitor DTU Usage: If consistently >80%, consider scaling up - Reserved Instances: Use for Production → 25% savings

Redis Cache Sizing

Environment SKU Size Monthly Cost Rationale
Dev Basic C0 250 MB $16 Minimal for development
Test Standard C1 1 GB $75 Standard for testing
Staging Standard C2 2.5 GB $150 Production-like validation
Production Premium P3 26 GB $1,037 Enterprise-grade with persistence

Right-Sizing Recommendations: - Monitor Memory Usage: If consistently >80%, consider scaling up - Use Premium for Production: Persistence and high availability required

Storage Tier Selection

Tier Use Case Cost/GB/Month Access Time
Hot Frequently accessed (0-90 days) $0.0184 <1 ms
Cool Infrequently accessed (91 days - 7 years) $0.01 <30 ms
Archive Rarely accessed (7+ years) $0.002 Hours (rehydration)

Recommendation: Use lifecycle policies to automatically transition data to lower-cost tiers


Storage Optimization

Hot/Warm/Cold Tiering Strategy

Tier Definitions: - Hot (Append/WORM): Authoritative segments and recent anchors; low latency, high IOPS, highest cost - Warm (Read Models): Projections and indexes optimized for query; rebuilt from hot as needed; medium cost - Cold (Archive/Export): Immutable object storage with long retention; bulk throughput; lowest cost

Lifecycle Transitions:

tiering:
  hot:
    targetWindow: P14D  # 14 days in hot tier
  warm:
    targetWindow: P90D  # 90 days in warm tier
    rebuildFirst: true  # Prefer re-project vs snapshot storage
  cold:
    storageClass: archive_immutable
    lifecycle:
      transitionAfter: P90D  # Move to cold after 90 days
      deleteAfter: P10Y  # Delete after 10 years (if not on legal hold)
  residency:
    crossRegionHydrate: deny  # Never hydrate across region families

Cost Savings: - Hot → Cool transition: ~45% savings - Cool → Archive transition: ~80% savings - Total lifecycle savings: ~80% on long-term storage

Compression & Deduplication

Compression Strategies: - JSONL: GZIP compression (5-10x reduction) - Parquet: Snappy compression (columnar, 3-5x reduction) - Segment Storage: Compress before upload to blob storage

Cost Impact: Reduce storage by 70-80% → proportional cost savings


Cost Monitoring & Budgets

Azure Cost Management Budgets

Budget Configuration (Pulumi):

// Production Budget
var prodBudget = new Budget("atp-budget-prod", new BudgetArgs
{
    BudgetName = "atp-budget-prod",
    ResourceGroupName = prodResourceGroup.Name,
    Amount = 10000,  // $10,000/month
    TimeGrain = "Monthly",
    TimePeriod = new BudgetTimePeriodArgs
    {
        StartDate = "2025-01-01",
        EndDate = "2025-12-31"
    },
    Category = "Cost",
    Notifications = new InputMap<NotificationArgs>
    {
        ["Alert50Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 50,
            ContactEmails = new[] { "platform-team@connectsoft.example" },
            ThresholdType = "Actual"
        },
        ["Alert80Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 80,
            ContactEmails = new[] { "platform-team@connectsoft.example", "finance@connectsoft.example" },
            ContactRoles = new[] { "Owner" },
            ThresholdType = "Actual"
        },
        ["Alert100Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 100,
            ContactEmails = new[] { "cfo@connectsoft.example", "platform-team@connectsoft.example" },
            ContactRoles = new[] { "Owner" },
            ThresholdType = "Actual",
            ContactActions = new[] { "CreateIncident" }  // Auto-create P1 incident
        }
    }
});

Cost Anomaly Detection

Anomaly Alert (Azure Monitor):

// Cost anomaly alert (50% spike in single day)
var costAnomalyAlert = new MetricAlert("atp-cost-anomaly-alert-prod", new MetricAlertArgs
{
    AlertRuleName = "atp-cost-anomaly-prod",
    ResourceGroupName = prodResourceGroup.Name,
    Location = "global",
    Description = "Alert when Production environment cost spikes >50% in 24 hours",
    Severity = 1,  // High severity
    Enabled = true,
    Scopes = new[] { prodResourceGroup.Id },
    EvaluationFrequency = "PT1H",  // Evaluate every hour
    WindowSize = "PT24H",  // 24-hour window
    Criteria = new MetricAlertMultipleResourceMultipleMetricCriteriaArgs
    {
        OdataType = "Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria",
        AllOf = new[]
        {
            new DynamicMetricCriteriaArgs
            {
                Name = "CostAnomaly",
                MetricName = "ActualCost",
                MetricNamespace = "Microsoft.CostManagement/budgets",
                Operator = "GreaterThan",
                AlertSensitivity = "Medium",
                DynamicThresholdFailingPeriods = new DynamicThresholdFailingPeriodsArgs
                {
                    NumberOfEvaluationPeriods = 4,
                    MinFailingPeriodsToAlert = 2
                },
                TimeAggregation = "Total"
            }
        }
    }
});

Cost Dashboards

KQL Query for Cost Attribution:

// Cost breakdown by Environment and Service
AzureCostManagement
| where TimeGenerated >= startofmonth(now())
| extend Environment = tostring(Tags["Environment"])
| extend Service = tostring(Tags["Service"])
| extend CostCenter = tostring(Tags["CostCenter"])
| summarize TotalCost = sum(Cost) by Environment, Service, CostCenter
| order by TotalCost desc

Cost Allocation & Attribution

Resource Tagging Strategy

Required Tags:

tags:
  Environment: production|staging|test|dev|preview|hotfix
  Service: gateway|ingestion|query|policy|export|admin
  Team: platform|engineering|ops
  CostCenter: atp-production|atp-staging
  TenantId: <tenant-id>  # For multi-tenant resources
  Edition: free|standard|enterprise
  Region: eastus|westus|westeurope

Tagging Example (Pulumi):

var prodTags = new InputMap<string>
{
    ["Environment"] = "production",
    ["Service"] = "gateway",
    ["Team"] = "platform",
    ["CostCenter"] = "atp-production",
    ["Region"] = "eastus",
    ["Compliance"] = "soc2-gdpr-hipaa"
};

var prodAppService = new WebApp("atp-gateway-prod-eus", new WebAppArgs
{
    // ... resource configuration ...
    Tags = prodTags
});

Per-Tenant Cost Tracking

Cost per Tenant (Production):

Total Production Monthly Cost: $9,834
Active Tenants (production): 50
Cost per Tenant: $9,834 / 50 = $196.68/month

Target Cost per Tenant (with 500 tenants): $9,834 / 500 = $19.67/month
Required Optimization: 90% reduction through economies of scale and multi-tenancy

Per-Tenant Cost Attribution (KQL):

// Cost per tenant breakdown
AzureCostManagement
| where TimeGenerated >= startofmonth(now())
| where Tags["TenantId"] != ""
| extend TenantId = tostring(Tags["TenantId"])
| extend Service = tostring(Tags["Service"])
| summarize 
    TotalCost = sum(Cost),
    StorageCost = sumif(Cost, Service == "storage"),
    ComputeCost = sumif(Cost, Service == "compute")
    by TenantId
| order by TotalCost desc

Cost Governance

Cost Governance Workflow

# Approval required for resources exceeding cost thresholds
costGovernance:
  thresholds:
    - resource: App Service Premium
      monthlyCost: $200
      approver: Lead Architect

    - resource: SQL Database Premium
      monthlyCost: $500
      approver: CTO

    - resource: AKS Node Pool
      monthlyCost: $1000
      approver: CFO

  process:
    1. Engineer submits Pulumi PR with new resource
    2. CI/CD calculates estimated monthly cost (Infracost)
    3. If cost > threshold, require approval
    4. Approver reviews cost justification
    5. If approved, Pulumi deploys resource with cost tags
    6. Monthly review of actual vs estimated costs

Cost Approval Workflow

Infracost Integration (CI/CD):

# azure-pipelines.yml
- task: InfracostTask@2
  inputs:
    path: 'infrastructure/pulumi'
    terraformVersion: '1.5.0'
    terraformWorkspace: 'default'
  displayName: 'Calculate infrastructure cost'

- task: InfracostComment@1
  inputs:
    behavior: 'update'
    path: 'infracost.json'
  displayName: 'Post cost comment to PR'

Best Practices & Recommendations

Cost Optimization Checklist

Monthly Review: - [ ] Review cost dashboards for anomalies - [ ] Identify unused or underutilized resources - [ ] Review reserved instance utilization - [ ] Check storage lifecycle policy effectiveness - [ ] Verify autoscaling is working correctly - [ ] Review cost per tenant trends

Quarterly Review: - [ ] Right-size resources based on metrics - [ ] Purchase/renew reserved instances - [ ] Review and optimize storage tiers - [ ] Conduct cost optimization sprint - [ ] Review and update cost governance policies

Cost Levers (Top 10)

  1. Hot retention window (days in Hot before Warm)
  2. Warm index granularity (daily vs hourly partitions)
  3. Export frequency & size (number of eDiscovery/DSAR bundles)
  4. Cross-region reads/exports (egress)
  5. Cache TTL & hit rate (reduces Warm compute)
  6. Segment size/seal cadence (affects metadata overhead & verification cost)
  7. Index cardinality (number of fields indexed & distinct values)
  8. Compression & encoding (Parquet snappy/zstd; JSONL gzip)
  9. Rebuild-first vs snapshot retention for Warm
  10. Purge cadence/batch size (storage & compute churn)

Cost Calculator (Quick Estimate)

Inputs: - events_per_day - avg_event_bytes - hot_days - warm_days - export_gb_per_mo - cross_region_gb_per_mo

Formulas:

hot_gb  = events_per_day * avg_event_bytes * hot_days  / (1024^3)
warm_gb = events_per_day * avg_event_bytes * warm_days / (1024^3) * warm_index_factor
storage_cost = hot_gb*rate_hot + warm_gb*rate_warm + archive_tb*rate_cold
egress_cost  = cross_region_gb_per_mo * rate_egress
export_cost  = export_gb_per_mo * rate_export_io
total        = storage_cost + egress_cost + export_cost + verify_cost + rebuild_cost

Cost Optimization Runbooks

Monthly Cost Review Runbook

  1. Gather Cost Data
  2. Export Azure Cost Management report for current month
  3. Review cost by environment, service, and tenant
  4. Identify top cost drivers

  5. Analyze Trends

  6. Compare current month vs. previous month
  7. Identify cost anomalies (>20% increase)
  8. Review cost per tenant trends

  9. Identify Optimization Opportunities

  10. Unused resources (delete if safe)
  11. Underutilized resources (right-size)
  12. Storage tier optimization
  13. Reserved instance opportunities

  14. Take Action

  15. Delete unused resources
  16. Right-size underutilized resources
  17. Apply storage lifecycle policies
  18. Purchase reserved instances

  19. Document Results

  20. Record cost savings achieved
  21. Update cost optimization log
  22. Share findings with team

Cost Anomaly Response Runbook

  1. Receive Alert
  2. Cost anomaly alert triggered (>50% spike)
  3. Review alert details (time, resource, amount)

  4. Investigate

  5. Check Azure Cost Management for resource breakdown
  6. Review resource utilization metrics
  7. Check for unexpected scaling or traffic spikes

  8. Identify Root Cause

  9. Resource misconfiguration
  10. Traffic spike
  11. Scaling issue
  12. Storage lifecycle failure

  13. Take Corrective Action

  14. Fix misconfiguration
  15. Scale down if appropriate
  16. Apply cost optimizations
  17. Prevent future occurrences

  18. Document Incident

  19. Root cause analysis
  20. Actions taken
  21. Cost impact
  22. Prevention measures

Summary

ATP implements comprehensive cost optimization strategies across all environments:

  • Environment Cost Profiles: Graduated from $500/month (Dev) to $10,000/month (Production)
  • Dev Optimization: Shutdown evenings/weekends (60% uptime) → 40% savings
  • Test Optimization: Shutdown nights (70% uptime) → 30% savings
  • Reserved Instances: 1-year commitments → 20-30% savings ($13,956/year total)
  • Spot Instances: Preview environments → 90% savings
  • Storage Lifecycle: Automated hot → cool → archive transitions → 80% storage savings
  • Cost Alerts: Budget thresholds (80%, 100%) and anomaly detection (50% spike)
  • Tagging Strategy: Granular cost attribution per environment, service, team, tenant
  • FinOps Culture: Monthly cost reviews, cost-per-feature metrics, governance workflows

Next Steps: - Review and customize cost budgets for your organization - Implement automated shutdown schedules for Dev/Test - Purchase reserved instances for Production - Configure storage lifecycle policies - Set up cost monitoring and alerts - Establish cost governance workflows


Document Version: 1.0
Last Updated: 2025-10-30
Maintained By: Platform Operations & FinOps Team