Cost Optimization & FinOps¶

Purpose & Scope¶

Purpose: Comprehensive guide for ATP's cost optimization strategies and FinOps practices, ensuring cost efficiency across all environments while maintaining operational requirements, performance, and compliance.

Scope: This document covers:

FinOps Principles: Cost transparency, optimization, governance, and culture
Environment Cost Profiles: Detailed cost breakdowns for Dev, Test, Staging, Production, Hotfix, and Preview environments
Cost Optimization Strategies: Automated shutdown schedules, reserved instances, spot instances, storage lifecycle policies, autoscaling
Right-Sizing Recommendations: AKS node pools, SQL databases, storage tiers, compute resources
Cost Monitoring & Budgets: Azure Cost Management, budget alerts, anomaly detection, cost dashboards
Cost Allocation: Per-tenant cost tracking, cost attribution, showback/chargeback
Cost Governance: Approval workflows, cost thresholds, policy enforcement
Storage Optimization: Hot/Warm/Cold tiering, lifecycle transitions, compression, deduplication

Audience: Platform operators, FinOps teams, engineering leads, finance teams, product managers

Relationship to Other Documents: - Infrastructure: See ../infrastructure/kubernetes.md for AKS sizing and node pool strategies - Environments: See ../ci-cd/environments.md for environment-specific configurations - Storage: See ../platform/data-residency-retention.md for storage tiering and lifecycle policies - Operations: See runbook.md for operational procedures

FinOps Principles¶

ATP implements FinOps principles to balance operational requirements with cost efficiency:

Visibility: Tag all resources; enable Cost Management; monthly reviews; cost transparency
Optimization: Shutdown schedules, reserved instances, autoscaling, storage lifecycle, spot instances
Governance: Budget alerts, Azure Policy enforcement, approval workflows for cost increases
Culture: Cost awareness in development; cost-per-feature metrics; regular optimization sprints

FinOps Culture¶

Cost-Aware Development: - Review cost impact before adding new Azure resources - Use cost estimation tools (Infracost) in CI/CD pipelines - Monitor cost-per-feature and cost-per-tenant metrics - Regular cost optimization sprints (quarterly)

Cost Metrics: - Cost per Tenant: Monthly cost divided by active tenants - Cost per Feature: Cost attribution to features/capabilities - Cost per Environment: Track costs by environment (Dev/Test/Staging/Prod) - Cost per Service: Breakdown by microservice/component

Environment Cost Profiles¶

ATP's cost model is graduated by environment with Dev/Test optimized for minimal cost and Production optimized for reliability within budget constraints.

Cost Profile Summary¶

Environment	Monthly Budget	Primary Compute	SKU Tier	Scaling Strategy	Monitoring Cost	Total Est. Monthly
Preview	$100	Azure Container Instances	Dynamic	Per-PR ephemeral	N/A	$50-150 (variable)
Dev	$500	App Service	Basic B1 (1 vCPU, 1.75 GB)	Fixed (1 instance)	Basic alerts	$400-600
Test	$1,000	App Service	Standard S1 (1 vCPU, 1.75 GB)	Fixed (2 instances)	Basic alerts	$900-1,200
Staging	$3,000	App Service	Premium P1v2 (1 vCPU, 3.5 GB)	Autoscale (2-5)	Enhanced alerts	$2,500-3,500
Production	$10,000	AKS (3-10 nodes)	Standard_D4s_v5 (4 vCPU, 16 GB)	Autoscale (3-10 nodes)	Full observability	$8,000-12,000
Hotfix	$500	App Service (on-demand)	Premium P1v3 (2 vCPU, 8 GB)	Fixed (1 instance)	Basic alerts	$0-500 (as-needed)

Cost Profile Rationale: - Dev ($500)**: Cost-minimized with Basic SKU; shutdown evenings/weekends (40% savings) - **Test ($1,000): Standard SKU for stable performance; 2 instances for load testing; shutdown nights - Staging ($3,000)**: Premium SKU for production-like validation; autoscaling; always-on - **Production ($10,000): AKS for enterprise-grade scalability; reserved instances (20% savings); 99.9% SLA - Hotfix ($500): On-demand deployment only when needed; deleted after hotfix deployment

Detailed Cost Breakdowns¶

Dev Environment Monthly Costs¶

Compute (App Service Basic B1 × 1):          $13/month × 0.6 (60% uptime) = $8/month
SQL Database (Basic - 5 DTU):                $5/month
Redis Cache (Basic C0 - 250 MB):             $16/month
Service Bus (Basic):                         $5/month
Storage (LRS - 100 GB):                      $2/month
Key Vault (transactions):                    $1/month
Bandwidth:                                   $5/month
---
Subtotal:                                    $42/month

Actual with shutdown automation:             ~$25/month per service × 7 services = $175/month
Dev shared infrastructure:                   $300/month (VNet, NSG, Seq, etc.)
---
Total Dev Environment:                       $475/month

Cost Optimization (Dev): - Shutdown Schedule: Stop 6 PM - 8 AM weekdays, all weekend → 60% uptime → 40% savings - Shared Resources: VNet, NSG, Seq shared between Dev/Test → split cost - Basic SKUs: Minimum viable performance for development

Test Environment Monthly Costs¶

Compute (App Service Standard S1 × 2):       $70/month × 0.7 (70% uptime) × 2 = $98/month
SQL Database (Standard S1 - 20 DTU):         $30/month
Redis Cache (Standard C1 - 1 GB):            $75/month
Service Bus (Standard):                      $10/month
Storage (LRS - 500 GB):                      $10/month
Key Vault (transactions):                    $2/month
Application Insights (5 GB/month):           $12/month
Bandwidth:                                   $10/month
---
Subtotal:                                    $247/month

Actual with shutdown automation:             ~$173/month (70% uptime)
Test shared infrastructure:                  $300/month (shared with Dev)
---
Total Test Environment:                      $473/month

With reserved instances (1-year):            -$50/month (15% savings)
Net Test Environment:                        $423/month

Cost Optimization (Test): - Shutdown Schedule: Stop 10 PM - 6 AM weekdays → 70% uptime → 30% savings - Standard SKUs: Balance between cost and performance for testing

Staging Environment Monthly Costs¶

Compute (App Service Premium P1v2 × 2-5):    $146/month × 3 avg = $438/month
SQL Database (Standard S3 - 100 DTU):        $150/month
Redis Cache (Standard C2 - 2.5 GB):          $150/month
Service Bus (Standard - 2 messaging units):  $20/month
Storage (LRS - 1 TB):                        $20/month
Key Vault (transactions):                    $3/month
Application Insights (15 GB/month):          $30/month
Log Analytics (30-day retention, 10 GB/day): $300/month
Bandwidth:                                   $20/month
---
Total Staging Environment:                   $1,131/month

With reserved instances (1-year):            -$170/month (15% savings)
Net Staging Environment:                     $961/month

Cost Optimization (Staging): - Autoscaling: Scale 2-5 instances based on load → optimize for actual usage - Reserved Instances: 1-year commitment → 15% savings

Production Environment Monthly Costs¶

AKS Cluster (3-10 nodes, Standard_D4s_v5):
  - System pool (3 nodes, always on):        $200/month × 3 = $600/month
  - User pool (3-7 nodes, autoscale):        $200/month × 5 avg = $1,000/month
SQL Database (Premium P4 - 500 DTU):         $1,860/month
Cosmos DB (1000 RU/s provisioned):           $730/month
Redis Cache (Premium P3 - 26 GB):            $1,037/month
Service Bus (Premium - 4 messaging units):   $2,708/month
Storage (GRS + WORM - 10 TB):
  - Hot tier (0-90 days):                    $500/month
  - Cool tier (91 days - 7 years):           $100/month
Key Vault (HSM - 50 keys):                   $625/month
Application Insights (50 GB/month):          $115/month
Log Analytics (90-day retention, 30 GB/day): $900/month
Private Endpoints (10 × $7):                 $70/month
Application Gateway (v2 with WAF):           $250/month
Azure Firewall (Premium):                    $625/month
DDoS Protection Standard:                    $2,944/month
Bandwidth (outbound - 1 TB):                 $90/month
ACR (Premium - geo-replication):             $30/month
Prometheus + Grafana (self-hosted on AKS):   $50/month (storage only)
---
Total Production Environment:                $12,234/month

Reserved Instance Savings (1-year):          -$2,400/month (20% on compute/database)
---
Net Production Monthly Cost:                 $9,834/month

Cost Optimization (Production): - Reserved Instances: 1-year commitment for AKS nodes, SQL, Cosmos DB → 20-30% savings - Autoscaling: Scale down to 3 nodes during low-traffic hours → save ~$400/month - **Storage Lifecycle**: Auto-transition to cool tier after 90 days → save ~$300/month - DDoS Protection: Shared across all public endpoints in subscription - Application Insights Sampling: 10% adaptive sampling → reduce ingestion by 90%

Preview Environment Monthly Costs¶

Azure Container Instances (per-PR ephemeral):
  - Average PR duration: 2 hours
  - Average instances: 3
  - Cost per hour: $0.05/instance
  - Monthly PRs: 50
  - Total: $0.05 × 3 × 2 × 50 = $15/month

Spot Instances (optional):                   $1.50/month (90% savings)
---
Total Preview Environment:                   $15-50/month (highly variable)

Cost Optimization (Preview): - Spot Instances: Use Azure Spot VMs for non-critical workloads → 90% savings - Ephemeral: Containers destroyed after PR merge/close

Cost Optimization Strategies¶

ATP implements automated cost optimization across all environments using Azure Policy, automation scripts, and IaC overlays.

Automated Shutdown Schedules¶

Purpose: Reduce compute costs in Dev/Test environments by shutting down resources during non-business hours.

Implementation (Azure Automation):

// ConnectSoft.ATP.Infrastructure/Automation/ShutdownSchedule.cs
public class ShutdownSchedule
{
    public static void ConfigureDevEnvironment(ResourceGroup resourceGroup)
    {
        // Dev: Shutdown 6 PM - 8 AM weekdays, all weekend
        var devShutdownSchedule = new Schedule("dev-shutdown-schedule", new ScheduleArgs
        {
            ResourceGroupName = resourceGroup.Name,
            Name = "dev-shutdown",
            Frequency = "Week",
            WeekDays = new[] { "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" },
            StartTime = "18:00:00",  // 6 PM
            Description = "Shutdown Dev environment after business hours"
        });

        var devStartupSchedule = new Schedule("dev-startup-schedule", new ScheduleArgs
        {
            ResourceGroupName = resourceGroup.Name,
            Name = "dev-startup",
            Frequency = "Week",
            WeekDays = new[] { "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" },
            StartTime = "08:00:00",  // 8 AM
            Description = "Startup Dev environment at beginning of business day"
        });
    }
}

Cost Savings: - Dev: 60% uptime → 40% savings (~$7/month per App Service) - **Test**: 70% uptime → 30% savings (~$21/month per App Service)

Reserved Instances¶

Purpose: Commit to 1-year or 3-year terms for predictable workloads to achieve 20-30% savings.

Production Reserved Instances:

reservedInstances:
  commitment: 1-year (renew annually)
  resources:
    - type: AKS Standard_D4s_v5
      quantity: 3 (system pool, always on)
      monthlyCost: $600
      reservedCost: $480 (20% savings)
      annualSavings: $1,440

    - type: SQL Database Premium P4
      quantity: 1
      monthlyCost: $1,860
      reservedCost: $1,395 (25% savings)
      annualSavings: $5,580

    - type: Cosmos DB (1000 RU/s)
      quantity: 1
      monthlyCost: $730
      reservedCost: $511 (30% savings)
      annualSavings: $2,628

    - type: Redis Cache Premium P3
      quantity: 1
      monthlyCost: $1,037
      reservedCost: $830 (20% savings)
      annualSavings: $2,484

totalAnnualSavings: $12,132 (Production)
totalATPReservedInstanceSavings: $13,956/year

Purchase Reserved Instances (Azure CLI):

#!/bin/bash
# purchase-reserved-instances.sh

SUBSCRIPTION_ID="<azure-subscription-id>"
REGION="eastus"

echo "Purchasing Reserved Instances for ATP Production..."

# AKS Nodes (Standard_D4s_v5 × 3)
az reservations reservation-order purchase \
  --reserved-resource-type "VirtualMachines" \
  --sku "Standard_D4s_v5" \
  --location "$REGION" \
  --quantity 3 \
  --term "P1Y" \
  --billing-plan "Monthly" \
  --display-name "ATP-Prod-AKS-RI-2025"

# SQL Database (Premium P4)
az sql db update \
  --resource-group "ConnectSoft-ATP-Prod-EUS-RG" \
  --server "atp-sql-prod-eus" \
  --name "ATP_Prod" \
  --compute-model "Provisioned" \
  --service-objective "P4" \
  --backup-storage-redundancy "Geo" \
  --zone-redundant true \
  --read-scale "Enabled"

# Cosmos DB Reserved Capacity (1000 RU/s)
az cosmosdb sql container throughput update \
  --resource-group "ConnectSoft-ATP-Prod-EUS-RG" \
  --account-name "atp-cosmos-prod-eus" \
  --database-name "ATP" \
  --name "AuditEvents" \
  --throughput 1000

echo "✅ Reserved Instances purchased; savings will appear in next billing cycle"

Spot Instances (Preview Environments)¶

Purpose: 90% cost savings for ephemeral Preview environments using Azure Spot VMs.

Implementation:

// AKS Node Pool with Spot instances
var spotNodePool = new KubernetesClusterNodePool("spot-pool", new KubernetesClusterNodePoolArgs
{
    KubernetesClusterId = aksCluster.Id,
    Name = "spotpool",
    VmSize = "Standard_D8s_v5",
    NodeCount = 0,
    MinCount = 0,
    MaxCount = 10,
    EnableAutoScaling = true,
    Priority = "Spot",
    EvictionPolicy = "Delete",
    SpotMaxPrice = 0.05,  // Max $0.05/hour (90% discount)
    NodeLabels = new Dictionary<string, string>
    {
        ["workload"] = "preview",
        ["ephemeral"] = "true"
    }
});

Cost Savings: 90% discount vs. regular VMs ($0.50/hour → $0.05/hour)

Storage Lifecycle Policies¶

Purpose: Automatically transition data from Hot → Cool → Archive tiers to reduce storage costs by up to 80%.

Storage Lifecycle Policy:

{
  "rules": [
    {
      "enabled": true,
      "name": "MoveLogsToCoolAfter90Days",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["logs/"]
        },
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 90
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 2555  // 7 years
            },
            "delete": {
              "daysAfterModificationGreaterThan": 2555  // Delete after 7 years (if not on legal hold)
            }
          }
        }
      }
    }
  ]
}

Storage Cost Savings (Production):

Hot Storage (0-90 days): 900 GB × $0.0184/GB = $16.56/month
Cool Storage (91 days - 7 years): 25,200 GB × $0.01/GB = $252/month
Archive Storage (7+ years): 100,000 GB × $0.002/GB = $200/month

Without Lifecycle Policy (all hot): 126,100 GB × $0.0184/GB = $2,320/month
With Lifecycle Policy: $16.56 + $252 + $200 = $468.56/month
Total Savings: $1,851.44/month (80% savings on storage)

Apply Lifecycle Policy (Azure CLI):

#!/bin/bash
# apply-storage-lifecycle-policy.sh

STORAGE_ACCOUNT="atpstorageprodeus"
RESOURCE_GROUP="ATP-Prod-RG"

echo "Applying storage lifecycle policy..."

az storage account management-policy create \
  --account-name "$STORAGE_ACCOUNT" \
  --resource-group "$RESOURCE_GROUP" \
  --policy @lifecycle-policy.json

echo "✅ Storage lifecycle policy applied"

Autoscaling¶

Purpose: Automatically scale resources based on demand to optimize costs during low-traffic periods.

AKS Cluster Autoscaling:

// Production AKS cluster with autoscaling
var aksCluster = new ManagedCluster("atp-aks-prod", new ManagedClusterArgs
{
    // ... other configuration ...

    AgentPoolProfiles = new[]
    {
        // System pool (fixed - always on)
        new ManagedClusterAgentPoolProfileArgs
        {
            Name = "system",
            Count = 3,
            VmSize = "Standard_D4s_v5",
            EnableAutoScaling = false,  // Fixed for system components
            MinCount = 3,
            MaxCount = 3
        },

        // User pool (autoscale based on demand)
        new ManagedClusterAgentPoolProfileArgs
        {
            Name = "user",
            Count = 3,
            VmSize = "Standard_D4s_v5",
            EnableAutoScaling = true,
            MinCount = 3,
            MaxCount = 10,
            ScaleDownDelayAfterAdd = "PT10M",  // Wait 10 min before scaling down
            ScaleDownUtilizationThreshold = 0.5  // Scale down if <50% utilization
        }
    }
});

Cost Savings: Scale down to 3 nodes during low-traffic hours → save ~$400/month

Application Insights Sampling¶

Purpose: Reduce telemetry ingestion costs by 90% while maintaining error visibility.

Adaptive Sampling Configuration:

// Production Application Insights with adaptive sampling
services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = appInsightsConnectionString;
    options.EnableAdaptiveSampling = true;
    options.EnablePerformanceCounterCollectionModule = true;
    options.EnableDependencyTrackingTelemetryModule = true;
    options.EnableEventCounterCollectionModule = true;
});

services.ConfigureTelemetryModule<AdaptiveSamplingTelemetryProcessor>(options =>
{
    options.ExcludedTypes = "Event";  // Always include events (errors)
    options.IncludedTypes = "PageView;Trace;Dependency;Request;Exception";
    options.MaxTelemetryItemsPerSecond = 10;  // 10% sampling rate
    options.InitialSamplingPercentage = 10;
});

Cost Savings: Reduce ingestion from 500 GB/month to 50 GB/month → save ~$90/month

Right-Sizing Recommendations¶

AKS Node Pool Sizing¶

System Node Pool: - VM Size: Standard_D4s_v5 (4 vCPU, 16 GB RAM) - Count: 3 (fixed, one per AZ) - Purpose: Control plane components, mesh, KEDA, FluxCD, OTel - Cost: ~$600/month (always on)

User Node Pool: - VM Size: Standard_D4s_v5 (4 vCPU, 16 GB RAM) - Count: 3-10 (autoscale) - Purpose: Stateless APIs (Gateway, Query, Admin, Policy) - Cost: ~$1,000/month (avg 5 nodes)

I/O Node Pool: - VM Size: Standard_E8s_v5 (8 vCPU, 64 GB RAM, premium storage) - Count: 2-50 (autoscale) - Purpose: I/O-heavy (Ingestion, Projection, Export, Integrity) - Cost: ~$1,200/month (avg 3 nodes)

Jobs Node Pool (Optional): - VM Size: Standard_F16s_v2 (16 vCPU, 32 GB RAM, compute-optimized) - Count: 0-20 (KEDA scale to zero) - Purpose: Export jobs, maintenance tasks, compliance reports - Cost: ~$500/month (on-demand only)

Spot Node Pool (Optional): - VM Size: Standard_D8s_v5 - Count: 0-10 (autoscale) - Priority: Spot (90% discount) - Purpose: Non-critical workloads (dev/test projections, backfills) - Cost: ~$50/month (90% savings)

SQL Database Sizing¶

Environment	SKU	DTU/vCores	Monthly Cost	Rationale
Dev	Basic	5 DTU	$5	Minimal for development
Test	Standard S1	20 DTU	$30	Stable performance for testing
Staging	Standard S3	100 DTU	$150	Production-like validation
Production	Premium P4	500 DTU	$1,860	Enterprise-grade performance

Right-Sizing Recommendations: - Start Small: Begin with lower SKUs and scale up based on metrics - Monitor DTU Usage: If consistently >80%, consider scaling up - Reserved Instances: Use for Production → 25% savings

Redis Cache Sizing¶

Environment	SKU	Size	Monthly Cost	Rationale
Dev	Basic C0	250 MB	$16	Minimal for development
Test	Standard C1	1 GB	$75	Standard for testing
Staging	Standard C2	2.5 GB	$150	Production-like validation
Production	Premium P3	26 GB	$1,037	Enterprise-grade with persistence

Right-Sizing Recommendations: - Monitor Memory Usage: If consistently >80%, consider scaling up - Use Premium for Production: Persistence and high availability required

Storage Tier Selection¶

Tier	Use Case	Cost/GB/Month	Access Time
Hot	Frequently accessed (0-90 days)	$0.0184	<1 ms
Cool	Infrequently accessed (91 days - 7 years)	$0.01	<30 ms
Archive	Rarely accessed (7+ years)	$0.002	Hours (rehydration)

Recommendation: Use lifecycle policies to automatically transition data to lower-cost tiers

Storage Optimization¶

Hot/Warm/Cold Tiering Strategy¶

Tier Definitions: - Hot (Append/WORM): Authoritative segments and recent anchors; low latency, high IOPS, highest cost - Warm (Read Models): Projections and indexes optimized for query; rebuilt from hot as needed; medium cost - Cold (Archive/Export): Immutable object storage with long retention; bulk throughput; lowest cost

Lifecycle Transitions:

tiering:
  hot:
    targetWindow: P14D  # 14 days in hot tier
  warm:
    targetWindow: P90D  # 90 days in warm tier
    rebuildFirst: true  # Prefer re-project vs snapshot storage
  cold:
    storageClass: archive_immutable
    lifecycle:
      transitionAfter: P90D  # Move to cold after 90 days
      deleteAfter: P10Y  # Delete after 10 years (if not on legal hold)
  residency:
    crossRegionHydrate: deny  # Never hydrate across region families

Cost Savings: - Hot → Cool transition: ~45% savings - Cool → Archive transition: ~80% savings - Total lifecycle savings: ~80% on long-term storage

Compression & Deduplication¶

Compression Strategies: - JSONL: GZIP compression (5-10x reduction) - Parquet: Snappy compression (columnar, 3-5x reduction) - Segment Storage: Compress before upload to blob storage

Cost Impact: Reduce storage by 70-80% → proportional cost savings

Cost Monitoring & Budgets¶

Azure Cost Management Budgets¶

Budget Configuration (Pulumi):

// Production Budget
var prodBudget = new Budget("atp-budget-prod", new BudgetArgs
{
    BudgetName = "atp-budget-prod",
    ResourceGroupName = prodResourceGroup.Name,
    Amount = 10000,  // $10,000/month
    TimeGrain = "Monthly",
    TimePeriod = new BudgetTimePeriodArgs
    {
        StartDate = "2025-01-01",
        EndDate = "2025-12-31"
    },
    Category = "Cost",
    Notifications = new InputMap<NotificationArgs>
    {
        ["Alert50Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 50,
            ContactEmails = new[] { "platform-team@connectsoft.example" },
            ThresholdType = "Actual"
        },
        ["Alert80Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 80,
            ContactEmails = new[] { "platform-team@connectsoft.example", "finance@connectsoft.example" },
            ContactRoles = new[] { "Owner" },
            ThresholdType = "Actual"
        },
        ["Alert100Percent"] = new NotificationArgs
        {
            Enabled = true,
            Operator = "GreaterThanOrEqualTo",
            Threshold = 100,
            ContactEmails = new[] { "cfo@connectsoft.example", "platform-team@connectsoft.example" },
            ContactRoles = new[] { "Owner" },
            ThresholdType = "Actual",
            ContactActions = new[] { "CreateIncident" }  // Auto-create P1 incident
        }
    }
});

Cost Anomaly Detection¶

Anomaly Alert (Azure Monitor):

// Cost anomaly alert (50% spike in single day)
var costAnomalyAlert = new MetricAlert("atp-cost-anomaly-alert-prod", new MetricAlertArgs
{
    AlertRuleName = "atp-cost-anomaly-prod",
    ResourceGroupName = prodResourceGroup.Name,
    Location = "global",
    Description = "Alert when Production environment cost spikes >50% in 24 hours",
    Severity = 1,  // High severity
    Enabled = true,
    Scopes = new[] { prodResourceGroup.Id },
    EvaluationFrequency = "PT1H",  // Evaluate every hour
    WindowSize = "PT24H",  // 24-hour window
    Criteria = new MetricAlertMultipleResourceMultipleMetricCriteriaArgs
    {
        OdataType = "Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria",
        AllOf = new[]
        {
            new DynamicMetricCriteriaArgs
            {
                Name = "CostAnomaly",
                MetricName = "ActualCost",
                MetricNamespace = "Microsoft.CostManagement/budgets",
                Operator = "GreaterThan",
                AlertSensitivity = "Medium",
                DynamicThresholdFailingPeriods = new DynamicThresholdFailingPeriodsArgs
                {
                    NumberOfEvaluationPeriods = 4,
                    MinFailingPeriodsToAlert = 2
                },
                TimeAggregation = "Total"
            }
        }
    }
});

Cost Dashboards¶

KQL Query for Cost Attribution:

// Cost breakdown by Environment and Service
AzureCostManagement
| where TimeGenerated >= startofmonth(now())
| extend Environment = tostring(Tags["Environment"])
| extend Service = tostring(Tags["Service"])
| extend CostCenter = tostring(Tags["CostCenter"])
| summarize TotalCost = sum(Cost) by Environment, Service, CostCenter
| order by TotalCost desc

Cost Allocation & Attribution¶

Resource Tagging Strategy¶

Required Tags:

tags:
  Environment: production|staging|test|dev|preview|hotfix
  Service: gateway|ingestion|query|policy|export|admin
  Team: platform|engineering|ops
  CostCenter: atp-production|atp-staging
  TenantId: <tenant-id>  # For multi-tenant resources
  Edition: free|standard|enterprise
  Region: eastus|westus|westeurope

Tagging Example (Pulumi):

var prodTags = new InputMap<string>
{
    ["Environment"] = "production",
    ["Service"] = "gateway",
    ["Team"] = "platform",
    ["CostCenter"] = "atp-production",
    ["Region"] = "eastus",
    ["Compliance"] = "soc2-gdpr-hipaa"
};

var prodAppService = new WebApp("atp-gateway-prod-eus", new WebAppArgs
{
    // ... resource configuration ...
    Tags = prodTags
});

Per-Tenant Cost Tracking¶

Cost per Tenant (Production):

Total Production Monthly Cost: $9,834
Active Tenants (production): 50
Cost per Tenant: $9,834 / 50 = $196.68/month

Target Cost per Tenant (with 500 tenants): $9,834 / 500 = $19.67/month
Required Optimization: 90% reduction through economies of scale and multi-tenancy

Per-Tenant Cost Attribution (KQL):

// Cost per tenant breakdown
AzureCostManagement
| where TimeGenerated >= startofmonth(now())
| where Tags["TenantId"] != ""
| extend TenantId = tostring(Tags["TenantId"])
| extend Service = tostring(Tags["Service"])
| summarize 
    TotalCost = sum(Cost),
    StorageCost = sumif(Cost, Service == "storage"),
    ComputeCost = sumif(Cost, Service == "compute")
    by TenantId
| order by TotalCost desc

Cost Governance¶

Cost Governance Workflow¶

# Approval required for resources exceeding cost thresholds
costGovernance:
  thresholds:
    - resource: App Service Premium
      monthlyCost: $200
      approver: Lead Architect

    - resource: SQL Database Premium
      monthlyCost: $500
      approver: CTO

    - resource: AKS Node Pool
      monthlyCost: $1000
      approver: CFO

  process:
    1. Engineer submits Pulumi PR with new resource
    2. CI/CD calculates estimated monthly cost (Infracost)
    3. If cost > threshold, require approval
    4. Approver reviews cost justification
    5. If approved, Pulumi deploys resource with cost tags
    6. Monthly review of actual vs estimated costs

Cost Approval Workflow¶

Infracost Integration (CI/CD):

# azure-pipelines.yml
- task: InfracostTask@2
  inputs:
    path: 'infrastructure/pulumi'
    terraformVersion: '1.5.0'
    terraformWorkspace: 'default'
  displayName: 'Calculate infrastructure cost'

- task: InfracostComment@1
  inputs:
    behavior: 'update'
    path: 'infracost.json'
  displayName: 'Post cost comment to PR'

Best Practices & Recommendations¶

Cost Optimization Checklist¶

Monthly Review: - [ ] Review cost dashboards for anomalies - [ ] Identify unused or underutilized resources - [ ] Review reserved instance utilization - [ ] Check storage lifecycle policy effectiveness - [ ] Verify autoscaling is working correctly - [ ] Review cost per tenant trends

Quarterly Review: - [ ] Right-size resources based on metrics - [ ] Purchase/renew reserved instances - [ ] Review and optimize storage tiers - [ ] Conduct cost optimization sprint - [ ] Review and update cost governance policies

Cost Levers (Top 10)¶

Hot retention window (days in Hot before Warm)
Warm index granularity (daily vs hourly partitions)
Export frequency & size (number of eDiscovery/DSAR bundles)
Cross-region reads/exports (egress)
Cache TTL & hit rate (reduces Warm compute)
Segment size/seal cadence (affects metadata overhead & verification cost)
Index cardinality (number of fields indexed & distinct values)
Compression & encoding (Parquet snappy/zstd; JSONL gzip)
Rebuild-first vs snapshot retention for Warm
Purge cadence/batch size (storage & compute churn)

Cost Calculator (Quick Estimate)¶

Inputs: - events_per_day - avg_event_bytes - hot_days - warm_days - export_gb_per_mo - cross_region_gb_per_mo

Formulas:

hot_gb  = events_per_day * avg_event_bytes * hot_days  / (1024^3)
warm_gb = events_per_day * avg_event_bytes * warm_days / (1024^3) * warm_index_factor
storage_cost = hot_gb*rate_hot + warm_gb*rate_warm + archive_tb*rate_cold
egress_cost  = cross_region_gb_per_mo * rate_egress
export_cost  = export_gb_per_mo * rate_export_io
total        = storage_cost + egress_cost + export_cost + verify_cost + rebuild_cost

Cost Optimization Runbooks¶

Monthly Cost Review Runbook¶

Gather Cost Data
Export Azure Cost Management report for current month
Review cost by environment, service, and tenant
Identify top cost drivers
Analyze Trends
Compare current month vs. previous month
Identify cost anomalies (>20% increase)
Review cost per tenant trends
Identify Optimization Opportunities
Unused resources (delete if safe)
Underutilized resources (right-size)
Storage tier optimization
Reserved instance opportunities
Take Action
Delete unused resources
Right-size underutilized resources
Apply storage lifecycle policies
Purchase reserved instances
Document Results
Record cost savings achieved
Update cost optimization log
Share findings with team

Cost Anomaly Response Runbook¶

Receive Alert
Cost anomaly alert triggered (>50% spike)
Review alert details (time, resource, amount)
Investigate
Check Azure Cost Management for resource breakdown
Review resource utilization metrics
Check for unexpected scaling or traffic spikes
Identify Root Cause
Resource misconfiguration
Traffic spike
Scaling issue
Storage lifecycle failure
Take Corrective Action
Fix misconfiguration
Scale down if appropriate
Apply cost optimizations
Prevent future occurrences
Document Incident
Root cause analysis
Actions taken
Cost impact
Prevention measures

Summary¶

ATP implements comprehensive cost optimization strategies across all environments:

Environment Cost Profiles: Graduated from $500/month (Dev) to $10,000/month (Production)
Dev Optimization: Shutdown evenings/weekends (60% uptime) → 40% savings
Test Optimization: Shutdown nights (70% uptime) → 30% savings
Reserved Instances: 1-year commitments → 20-30% savings ($13,956/year total)
Spot Instances: Preview environments → 90% savings
Storage Lifecycle: Automated hot → cool → archive transitions → 80% storage savings
Cost Alerts: Budget thresholds (80%, 100%) and anomaly detection (50% spike)
Tagging Strategy: Granular cost attribution per environment, service, team, tenant
FinOps Culture: Monthly cost reviews, cost-per-feature metrics, governance workflows

Next Steps: - Review and customize cost budgets for your organization - Implement automated shutdown schedules for Dev/Test - Purchase reserved instances for Production - Configure storage lifecycle policies - Set up cost monitoring and alerts - Establish cost governance workflows

Document Version: 1.0
Last Updated: 2025-10-30
Maintained By: Platform Operations & FinOps Team