Skip to content

Progressive Rollout Strategies

Zero-downtime deployments with automated risk mitigation — ATP uses progressive rollout strategies to minimize blast radius, validate changes under real traffic, and enable instant rollback if issues are detected.


Overview

Progressive rollout is a deployment methodology that introduces changes incrementally to a subset of users or infrastructure, validates stability, and either continues the rollout or automatically reverts based on real-time metrics. This approach is critical for the Audit Trail Platform, where data integrity, tamper-evidence, and regulatory compliance cannot be compromised.

Why Progressive Rollouts?

Challenge Solution
Risk of breaking production Limit blast radius to small percentage of traffic
Unknown behavior under load Validate with real production traffic before full rollout
Slow manual rollbacks Automated metric-based rollback in seconds
Customer impact from bugs Affect only canary users, protect 95%+ of tenants
Compliance violations Quick detection and reversion preserves audit integrity

Core Principles

  1. Incremental Exposure - Deploy to 5% → 20% → 50% → 100% of traffic
  2. Automated Validation - Metrics-driven health checks at each increment
  3. Instant Rollback - Sub-minute reversion to last known good state
  4. Observability First - Rich telemetry during rollout phases
  5. Tenant Isolation - Canary failures don't cascade to stable deployments

Deployment Strategy Decision Matrix

Use this table to select the appropriate rollout strategy based on your deployment scenario:

Scenario Strategy Traffic Pattern Rollback Time Use When
New Feature Release Canary Deployment 5% → 20% → 50% → 100% < 1 min (traffic revert) Default for Production features
Bug Fix (Low Risk) Rolling Update Pod-by-pod replacement < 5 min (rollout undo) Minor fixes, config changes
Major Version Upgrade Blue-Green Swap 100% instant switch < 30 sec (slot swap) Database migrations, API versions
Hotfix (Emergency) Direct + Monitoring 100% with validation < 2 min (rollout undo) Critical security patches
Infrastructure Change Rolling Update Node-by-node < 5 min (rollout undo) Kubernetes version, OS patches
Breaking API Change Feature Flag + Canary Gradual feature activation Instant (flag toggle) API deprecation, schema changes

Deployment Strategies

1. Canary Deployment (Default for Production)

Best For: New features, behavior changes, algorithm updates

How It Works: 1. Deploy new version alongside stable version (both running) 2. Route small percentage of traffic to canary (5%) 3. Monitor metrics for 15 minutes (error rate, latency, exceptions) 4. If healthy, increase to next increment (20% → 50% → 100%) 5. If unhealthy, automatic rollback to 0% (canary removed)

Canary Traffic Increments

Stable (v1.2.3)  →  100% → 95% → 80% → 50% → 0% (decommissioned)
                      ↓      ↓      ↓      ↓
Canary (v1.2.4)  →    5% → 20% → 50% → 100% (promoted to stable)

Canary Validation Metrics

At each increment, the deployment pauses for 15 minutes while monitoring:

Metric Threshold Action if Exceeded
Error Rate < 1% (0.01) Automatic rollback
P95 Latency < 1000ms Automatic rollback
Exception Count < 10 in 15 min Automatic rollback
Health Check Failures 0 failures Automatic rollback
Custom Business Metrics Service-specific Manual review required

Example Canary Flow:

graph TD
    A[Deploy Canary v1.2.4] --> B[Route 5% Traffic]
    B --> C{Monitor 15 min}
    C -->|Metrics OK| D[Increase to 20%]
    C -->|Metrics BAD| Z[Automatic Rollback]
    D --> E{Monitor 15 min}
    E -->|Metrics OK| F[Increase to 50%]
    E -->|Metrics BAD| Z
    F --> G{Monitor 15 min}
    G -->|Metrics OK| H[Promote to 100%]
    G -->|Metrics BAD| Z
    H --> I[Decommission Stable v1.2.3]
    Z --> J[Alert On-Call SRE]
    Z --> K[Create Incident Ticket]
Hold "Alt" / "Option" to enable pan & zoom

Canary Rollback (Automatic)

Trigger: Metrics exceed thresholds during soak period

Action: Immediate traffic reversion to 0% canary, 100% stable

Command (Kubernetes):

# Revert traffic to stable only
kubectl apply -f k8s/canary/stable-only-routing.yaml

# Wait for traffic drain (30 seconds)
sleep 30

# Delete canary deployment
kubectl delete deployment atp-ingestion-canary -n atp-prod

# Verify stable deployment health
kubectl rollout status deployment/atp-ingestion -n atp-prod

RTO: < 1 minute (traffic routing change)


2. Blue-Green Deployment (Staging/Major Versions)

Best For: Full environment validation, database migrations, API version changes

How It Works: 1. Deploy new version to "green" slot (inactive) 2. Run smoke tests, validation scripts against green 3. Instant traffic swap from "blue" (active) to "green" 4. Monitor green under 100% traffic 5. If issues detected, instant swap back to blue

Blue-Green Slots

Blue Slot (v1.2.3)   [ACTIVE - 100% traffic]
Green Slot (v1.2.4)  [IDLE - validation only]

↓ Swap ↓

Blue Slot (v1.2.3)   [IDLE - ready for rollback]
Green Slot (v1.2.4)  [ACTIVE - 100% traffic]

Validation Before Swap

Before swapping traffic, validate the green slot:

# 1. Health check endpoints
curl https://atp-staging-green.azurewebsites.net/health
curl https://atp-staging-green.azurewebsites.net/ready

# 2. Smoke tests
pytest tests/smoke/ --env=green --tenant=test-tenant-001

# 3. Database connectivity
psql -h atp-staging-green-db.postgres.database.azure.com -U admin -c "SELECT 1"

# 4. Service Bus connectivity
az servicebus queue show --namespace-name atp-staging-green-sb --name audit-events

# 5. Key Vault access
az keyvault secret show --vault-name atp-staging-green-kv --name TenantDbConnString

Only swap if all validations pass.

Blue-Green Swap (Azure App Service)

# Swap staging slot to production
az webapp deployment slot swap \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green \
  --target-slot blue

# Verify swap completed
az webapp deployment slot list \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging

Swap Duration: 5-30 seconds (no downtime)

Blue-Green Rollback

# Swap back to previous slot
az webapp deployment slot swap \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot blue \
  --target-slot green

RTO: < 30 seconds


3. Rolling Update (Kubernetes Infrastructure)

Best For: Config changes, minor fixes, Kubernetes version updates

How It Works: 1. Update deployment manifest (image version, config) 2. Kubernetes replaces pods one-by-one 3. Each new pod must pass health checks before old pod is terminated 4. Continues until all pods are updated 5. If health checks fail, rollout pauses automatically

Rolling Update Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-prod
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod during rollout (7 total)
      maxUnavailable: 1  # Allow 1 pod to be unavailable (5 minimum)
  minReadySeconds: 30    # Wait 30s after pod ready before next
  template:
    spec:
      containers:
      - name: ingestion-api
        image: atp.azurecr.io/ingestion:1.2.4
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

Monitor Rolling Update

# Apply updated manifest
kubectl apply -f k8s/atp-ingestion-deployment.yaml

# Watch rollout progress
kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=10m

# Check pod status during rollout
kubectl get pods -n atp-prod -l app=atp-ingestion -w

# View rollout history
kubectl rollout history deployment/atp-ingestion -n atp-prod

Expected Output:

Waiting for deployment "atp-ingestion" rollout to finish: 2 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 3 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 4 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 5 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 6 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 1 old replicas are pending termination...
deployment "atp-ingestion" successfully rolled out

Rolling Update Rollback

# Undo last rollout (immediate)
kubectl rollout undo deployment/atp-ingestion -n atp-prod

# Rollback to specific revision
kubectl rollout undo deployment/atp-ingestion -n atp-prod --to-revision=3

# Verify rollback
kubectl rollout status deployment/atp-ingestion -n atp-prod

RTO: < 5 minutes (depends on pod count and readiness checks)


4. Feature Flag-Based Rollout

Best For: A/B testing, gradual feature activation, breaking API changes

How It Works: 1. Deploy code with feature behind toggle (disabled by default) 2. Enable feature for percentage of users (10% → 25% → 50% → 100%) 3. Monitor metrics per cohort (feature-on vs feature-off) 4. If issues detected, toggle off instantly (no redeployment)

Feature Flag Configuration

Azure App Configuration:

{
  "id": "AuditTrail.NewTamperProofAlgorithm",
  "description": "Enable SHA-512 with merkle tree for tamper evidence",
  "enabled": true,
  "conditions": {
    "client_filters": [
      {
        "name": "Percentage",
        "parameters": {
          "Value": 10  // 10% of requests
        }
      },
      {
        "name": "Targeting",
        "parameters": {
          "Audience": {
            "Users": ["tenant-beta-001", "tenant-beta-002"],
            "Groups": ["beta-testers"],
            "DefaultRolloutPercentage": 10
          }
        }
      }
    ]
  }
}

Gradual Feature Rollout

# Phase 1: Beta testers only (named tenants)
az appconfig feature set \
  --name atp-prod-appconfig \
  --feature AuditTrail.NewTamperProofAlgorithm \
  --filter "Targeting" \
  --parameters '{"Audience":{"Users":["tenant-beta-001"]}}'

# Phase 2: 10% random sampling
az appconfig feature set \
  --name atp-prod-appconfig \
  --feature AuditTrail.NewTamperProofAlgorithm \
  --filter "Percentage" \
  --parameters '{"Value":10}'

# Phase 3: 50% rollout
az appconfig feature set \
  --name atp-prod-appconfig \
  --feature AuditTrail.NewTamperProofAlgorithm \
  --filter "Percentage" \
  --parameters '{"Value":50}'

# Phase 4: 100% (full activation)
az appconfig feature set \
  --name atp-prod-appconfig \
  --feature AuditTrail.NewTamperProofAlgorithm \
  --enabled true

Feature Flag Emergency Disable

# Instant disable (no deployment required)
az appconfig feature set \
  --name atp-prod-appconfig \
  --feature AuditTrail.NewTamperProofAlgorithm \
  --enabled false

RTO: < 10 seconds (config refresh interval)


Rollout Monitoring & Validation

Real-Time Metrics Dashboard

During any rollout (canary, blue-green, rolling), monitor these dashboards:

Application Insights Queries:

// Error rate comparison (canary vs stable)
requests
| where timestamp > ago(15m)
| summarize 
    TotalRequests = count(),
    FailedRequests = countif(success == false)
    by cloud_RoleName
| extend ErrorRate = (FailedRequests * 100.0) / TotalRequests
| project cloud_RoleName, ErrorRate, TotalRequests

// P95 Latency comparison
requests
| where timestamp > ago(15m)
| summarize P95 = percentile(duration, 95) by cloud_RoleName
| project cloud_RoleName, P95_ms = P95

// Exception count per deployment
exceptions
| where timestamp > ago(15m)
| summarize ExceptionCount = count() by cloud_RoleName
| order by ExceptionCount desc

Automated Health Checks

Health Check Script (runs every 2 minutes during rollout):

#!/usr/bin/env python3
"""
Canary health monitor - runs during progressive rollouts.
Triggers automatic rollback if metrics exceed thresholds.
"""
import requests
import time
import sys
from azure.monitor.query import LogsQueryClient
from azure.identity import DefaultAzureCredential

CANARY_ROLE = "atp-ingestion-canary"
STABLE_ROLE = "atp-ingestion-stable"

THRESHOLDS = {
    "error_rate_percent": 1.0,      # Max 1% error rate
    "p95_latency_ms": 1000,          # Max 1000ms p95
    "exception_count": 10,           # Max 10 exceptions in 15min
}

def get_error_rate(role_name):
    """Query Application Insights for error rate."""
    query = f"""
    requests
    | where timestamp > ago(15m)
    | where cloud_RoleName == '{role_name}'
    | summarize 
        Total = count(),
        Failed = countif(success == false)
    | extend ErrorRate = (Failed * 100.0) / Total
    | project ErrorRate
    """
    client = LogsQueryClient(DefaultAzureCredential())
    response = client.query_workspace(
        workspace_id=os.environ["LOG_ANALYTICS_WORKSPACE_ID"],
        query=query,
        timespan=None
    )
    return response.tables[0].rows[0][0] if response.tables[0].rows else 0.0

def get_p95_latency(role_name):
    """Query Application Insights for P95 latency."""
    query = f"""
    requests
    | where timestamp > ago(15m)
    | where cloud_RoleName == '{role_name}'
    | summarize P95 = percentile(duration, 95)
    | project P95
    """
    client = LogsQueryClient(DefaultAzureCredential())
    response = client.query_workspace(
        workspace_id=os.environ["LOG_ANALYTICS_WORKSPACE_ID"],
        query=query,
        timespan=None
    )
    return response.tables[0].rows[0][0] if response.tables[0].rows else 0.0

def get_exception_count(role_name):
    """Query Application Insights for exception count."""
    query = f"""
    exceptions
    | where timestamp > ago(15m)
    | where cloud_RoleName == '{role_name}'
    | count
    """
    client = LogsQueryClient(DefaultAzureCredential())
    response = client.query_workspace(
        workspace_id=os.environ["LOG_ANALYTICS_WORKSPACE_ID"],
        query=query,
        timespan=None
    )
    return response.tables[0].rows[0][0] if response.tables[0].rows else 0

def check_health_endpoint(url):
    """Check HTTP health endpoint."""
    try:
        response = requests.get(url, timeout=5)
        return response.status_code == 200
    except:
        return False

def validate_canary_health():
    """
    Validate canary deployment health.
    Returns (is_healthy, metrics_dict)
    """
    print(f"🔍 Validating canary health at {time.strftime('%H:%M:%S')}...")

    metrics = {
        "error_rate": get_error_rate(CANARY_ROLE),
        "p95_latency": get_p95_latency(CANARY_ROLE),
        "exception_count": get_exception_count(CANARY_ROLE),
        "health_check": check_health_endpoint("https://atp-ingestion-canary/health")
    }

    # Check thresholds
    failures = []

    if metrics["error_rate"] > THRESHOLDS["error_rate_percent"]:
        failures.append(f"Error rate {metrics['error_rate']:.2f}% exceeds {THRESHOLDS['error_rate_percent']}%")

    if metrics["p95_latency"] > THRESHOLDS["p95_latency_ms"]:
        failures.append(f"P95 latency {metrics['p95_latency']:.0f}ms exceeds {THRESHOLDS['p95_latency_ms']}ms")

    if metrics["exception_count"] > THRESHOLDS["exception_count"]:
        failures.append(f"Exception count {metrics['exception_count']} exceeds {THRESHOLDS['exception_count']}")

    if not metrics["health_check"]:
        failures.append("Health check endpoint failed")

    is_healthy = len(failures) == 0

    if is_healthy:
        print(f"✅ Canary healthy: Error={metrics['error_rate']:.2f}%, P95={metrics['p95_latency']:.0f}ms, Exceptions={metrics['exception_count']}")
    else:
        print(f"❌ Canary unhealthy:")
        for failure in failures:
            print(f"   - {failure}")

    return is_healthy, metrics

def monitor_rollout(duration_minutes, check_interval_seconds=120):
    """
    Monitor canary deployment for specified duration.
    Exits with code 1 if unhealthy (triggers rollback in pipeline).
    """
    end_time = time.time() + (duration_minutes * 60)

    print(f"🚀 Starting {duration_minutes}-minute canary monitoring...")

    while time.time() < end_time:
        is_healthy, metrics = validate_canary_health()

        if not is_healthy:
            print(f"🔴 ROLLBACK TRIGGERED: Canary failed health validation")
            sys.exit(1)  # Exit code 1 triggers automatic rollback

        remaining_minutes = (end_time - time.time()) / 60
        print(f"⏳ {remaining_minutes:.1f} minutes remaining...\n")
        time.sleep(check_interval_seconds)

    print(f"✅ Canary monitoring complete: {duration_minutes} minutes passed successfully")
    sys.exit(0)

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--duration", type=int, default=15, help="Monitoring duration in minutes")
    parser.add_argument("--interval", type=int, default=120, help="Check interval in seconds")
    args = parser.parse_args()

    monitor_rollout(args.duration, args.interval)

Usage in Pipeline:

- script: |
    python scripts/monitor-canary.py --duration 15 --interval 120
  displayName: 'Monitor Canary Health (15 min)'
  continueOnError: false  # Pipeline fails if script exits with code 1


Rollback Procedures

When to Rollback

Trigger Action Who Decides
Automated threshold breach Immediate automatic rollback System (no human approval)
Customer-reported critical issue Manual rollback via runbook On-call SRE
Silent data corruption detected Emergency rollback + incident SRE + Security Team
Compliance violation Immediate rollback + audit Compliance Officer + SRE
Performance degradation (SLO miss) Rollback if not resolved in 10 min SRE

Rollback Decision Tree

graph TD
    A[Issue Detected] --> B{Severity?}
    B -->|SEV1: Data Loss/Corruption| C[IMMEDIATE ROLLBACK]
    B -->|SEV2: Service Degradation| D{SLO Impact?}
    B -->|SEV3: Minor Bug| E{User-Facing?}

    D -->|> 1% error rate| C
    D -->|< 1% error rate| F[Monitor 10 min]

    E -->|Yes| G[Rollback + Hotfix]
    E -->|No| H[File Bug, Fix in Next Release]

    F -->|Improves| I[Continue Rollout]
    F -->|Worsens| C

    C --> J[Execute Rollback]
    J --> K[Verify Health]
    K --> L[RCA + Incident Report]
Hold "Alt" / "Option" to enable pan & zoom

Automatic Rollback Triggers

These conditions trigger rollback WITHOUT human approval:

  1. Error Rate: Canary error rate > 1% for 2 consecutive minutes
  2. Latency Spike: P95 latency > 1000ms for 5 consecutive minutes
  3. Exception Storm: > 10 exceptions in 15-minute window
  4. Health Check Failure: Any health check endpoint returns non-200 status
  5. Database Connection Pool Exhaustion: Connection pool > 90% for 3 minutes
  6. Message Queue Backlog: Event queue depth > 10,000 for 5 minutes
  7. Memory Leak Detected: Container memory > 90% for 3 minutes

Alert Channels: - PagerDuty: On-call SRE - Slack: #atp-incidents - Email: Platform team distribution list - Azure Monitor: Alert rule with action group

Manual Rollback Commands

Canary Rollback (Kubernetes)

#!/bin/bash
# rollback-canary.sh - Manual canary rollback

set -e

echo "🔴 Initiating canary rollback..."

# 1. Revert traffic to stable only
echo "Step 1: Reverting traffic to stable deployment..."
kubectl apply -f k8s/canary/stable-only-routing.yaml

# 2. Wait for traffic drain
echo "Step 2: Waiting 30 seconds for traffic to drain from canary..."
sleep 30

# 3. Delete canary deployment
echo "Step 3: Deleting canary deployment..."
kubectl delete deployment atp-ingestion-canary -n atp-prod --ignore-not-found=true

# 4. Verify stable deployment health
echo "Step 4: Verifying stable deployment health..."
kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=2m

# 5. Check metrics
echo "Step 5: Validating post-rollback metrics..."
ERROR_RATE=$(az monitor app-insights metrics show \
  --app atp-prod-appinsights \
  --metric requests/failed \
  --aggregation avg \
  --interval 5m \
  --offset 5m \
  --query value -o tsv)

if (( $(echo "$ERROR_RATE < 0.01" | bc -l) )); then
  echo "✅ Rollback successful. Error rate: ${ERROR_RATE}%"
else
  echo "⚠️  Error rate still elevated: ${ERROR_RATE}%"
fi

# 6. Notify team
curl -X POST $SLACK_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d "{\"text\":\"🔴 Production canary rollback completed. Build: $BUILD_NUMBER\"}"

echo "✅ Canary rollback complete. Investigate root cause before next deployment."

Blue-Green Rollback (Azure App Service)

#!/bin/bash
# rollback-blue-green.sh - Manual slot swap rollback

set -e

echo "🔴 Initiating blue-green rollback (slot swap)..."

# Swap back to previous slot
az webapp deployment slot swap \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green \
  --target-slot blue

echo "⏳ Waiting for swap to complete..."
sleep 15

# Verify production slot is serving traffic
HEALTH_STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://atp-ingestion-staging.azurewebsites.net/health)

if [ "$HEALTH_STATUS" -eq "200" ]; then
  echo "✅ Rollback successful. Production slot is healthy."
else
  echo "❌ ERROR: Production slot health check failed (HTTP $HEALTH_STATUS)"
  exit 1
fi

# Notify team
curl -X POST $SLACK_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d "{\"text\":\"🔴 Staging blue-green rollback completed.\"}"

Rolling Update Rollback (Kubernetes)

#!/bin/bash
# rollback-rolling-update.sh - Rollback Kubernetes rolling update

set -e

echo "🔴 Initiating rolling update rollback..."

# Undo last rollout
kubectl rollout undo deployment/atp-ingestion -n atp-prod

# Wait for rollout to complete
echo "⏳ Waiting for rollout to complete..."
kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=5m

# Verify pod health
READY_PODS=$(kubectl get deployment atp-ingestion -n atp-prod -o jsonpath='{.status.readyReplicas}')
DESIRED_PODS=$(kubectl get deployment atp-ingestion -n atp-prod -o jsonpath='{.spec.replicas}')

if [ "$READY_PODS" -eq "$DESIRED_PODS" ]; then
  echo "✅ Rollback successful. $READY_PODS/$DESIRED_PODS pods ready."
else
  echo "⚠️  Only $READY_PODS/$DESIRED_PODS pods ready."
fi

# Check current image version
CURRENT_IMAGE=$(kubectl get deployment atp-ingestion -n atp-prod -o jsonpath='{.spec.template.spec.containers[0].image}')
echo "📦 Current image: $CURRENT_IMAGE"

Environment-Specific Rollout Strategies

Environment Primary Strategy Approval Required Monitoring Duration Rollback Policy
Dev Direct deploy No None Manual only
Test Rolling update No 5 minutes Automatic on failure
Staging Blue-green 1 approver 30 minutes Automatic on health check failure
Production Canary (5/20/50/100) 2 approvers + CAB 15 min per increment Automatic on metrics breach
Hotfix Direct + monitoring 2 approvers (expedited) 30 minutes Automatic on any error

Runbook Integration

Pre-Deployment Checklist

Before initiating any production rollout:

  • Change Request approved by CAB (except hotfix)
  • Rollback plan documented and tested in staging
  • On-call SRE notified and available
  • Monitoring dashboards open (Application Insights, Grafana)
  • Communication channel active (#atp-deployments in Slack)
  • Feature flags configured for gradual activation (if applicable)
  • Database migrations tested and reversible
  • Load tests passed in staging (for major changes)
  • Security scan passed (no critical vulnerabilities)
  • Deployment window confirmed (avoid Friday evenings!)

During Deployment

  • Monitor error rate in real-time (refresh every 30 seconds)
  • Watch exception logs in Application Insights
  • Check queue depths for backlog buildup
  • Validate health endpoints return 200 OK
  • Communicate status in #atp-deployments at each increment
  • Pause if uncertain — better safe than sorry

Post-Deployment

  • Verify metrics returned to baseline
  • Smoke tests passed in production
  • Synthetic monitors passing (Pingdom, Azure Monitor)
  • Update runbook if new issues discovered
  • Document lessons learned in deployment log
  • Decommission canary resources (if applicable)

Metrics & SLO Validation

Key Metrics to Monitor

Metric Target (SLO) Warning Threshold Critical Threshold
Availability 99.9% 99.5% 99.0%
Error Rate < 0.1% > 0.5% > 1.0%
P95 Latency < 500ms > 800ms > 1000ms
P99 Latency < 1000ms > 1500ms > 2000ms
Event Ingestion Throughput 10K events/sec 8K events/sec 5K events/sec
Queue Processing Lag < 1 minute > 2 minutes > 5 minutes

SLO Dashboard

Application Insights Query (Error Budget):

// Calculate error budget remaining for current month
let startOfMonth = startofmonth(now());
let totalRequests = requests
    | where timestamp >= startOfMonth
    | count;
let failedRequests = requests
    | where timestamp >= startOfMonth
    | where success == false
    | count;
let actualAvailability = 100.0 - ((failedRequests * 100.0) / totalRequests);
let targetAvailability = 99.9;
let errorBudget = 100.0 - targetAvailability;  // 0.1%
let errorBudgetUsed = 100.0 - actualAvailability;
let errorBudgetRemaining = errorBudget - errorBudgetUsed;
print 
    ActualAvailability = actualAvailability,
    TargetAvailability = targetAvailability,
    ErrorBudgetTotal = errorBudget,
    ErrorBudgetUsed = errorBudgetUsed,
    ErrorBudgetRemaining = errorBudgetRemaining,
    ErrorBudgetRemainingPercent = (errorBudgetRemaining / errorBudget) * 100.0

Interpretation: - > 50% error budget remaining: Safe to deploy - 20-50% remaining: Deploy with caution, short canary soaks - < 20% remaining: Defer non-critical deployments - < 0% (budget exhausted): Freeze deployments, focus on stability


Communication & Escalation

Deployment Communication Template

Pre-Deployment Announcement:

📢 **Production Deployment Starting**

**Service:** Audit Trail Platform - Ingestion API
**Version:** v1.2.4 → v1.2.5
**Change:** New tamper-proof algorithm (SHA-512 + Merkle tree)
**Strategy:** Canary deployment (5% → 20% → 50% → 100%)
**Duration:** ~90 minutes (15 min per increment)
**Rollback Plan:** Automatic on error rate > 1%
**On-Call SRE:** @alice (primary), @bob (secondary)
**Monitoring:** https://portal.azure.com/#blade/AppInsights/atp-prod

**Status:** Starting canary at 5% traffic...

Mid-Deployment Update:

✅ **Canary Update: 20% Traffic**

**Metrics (last 15 min):**
- Error Rate: 0.08% ✅
- P95 Latency: 487ms ✅
- Exceptions: 2 ✅
- Health Checks: Passing ✅

**Next:** Increasing to 50% traffic. ETA: 2:30 PM EST

Rollback Announcement:

🔴 **ROLLBACK IN PROGRESS**

**Reason:** Canary error rate exceeded 1% (actual: 1.8%)
**Action:** Reverting all traffic to stable v1.2.4
**Impact:** No customer impact (95% on stable, canary isolated)
**ETA:** Rollback complete in 2 minutes
**Incident:** INC-2025-001 created
**Next Steps:** RCA meeting scheduled for 4:00 PM EST

Escalation Path

  1. Automated Alert → PagerDuty → On-call SRE (immediate)
  2. On-call SRE → Deployment lead (within 5 minutes)
  3. Deployment lead → Engineering manager (if rollback fails)
  4. Engineering manager → VP Engineering (if production down > 30 minutes)
  5. VP Engineering → CTO (if SEV1 incident > 1 hour)

Advanced Rollout Patterns

Progressive Feature Rollout (Combined Strategy)

Scenario: Deploy new feature with both infrastructure and feature flag rollout.

Approach: 1. Week 1: Deploy code with feature flag OFF to all environments (canary deployment) 2. Week 2: Enable feature for 10% of users in production (feature flag) 3. Week 3: Increase to 50% if metrics healthy 4. Week 4: Enable for 100% (full activation)

Benefits: - Code deployed separately from activation (decouple risk) - Instant disable via feature flag (no redeployment) - Gradual user exposure (minimize blast radius)

Multi-Region Progressive Rollout

Scenario: Deploy to multiple Azure regions with staggered rollout.

Approach: 1. Region 1 (East US): Canary deployment (5% → 20% → 50% → 100%) 2. Wait 24 hours, monitor metrics 3. Region 2 (West Europe): Canary deployment 4. Wait 24 hours, monitor metrics 5. Region 3 (Southeast Asia): Canary deployment

Benefits: - Contain blast radius to single region - Leverage time zones for round-the-clock monitoring - Learn from each region before proceeding

Database Schema Migration Rollout

Scenario: Deploy backward-compatible database schema change.

Approach: 1. Phase 1: Add new column with default value (backward compatible) 2. Phase 2: Deploy application code that writes to both old and new columns 3. Phase 3: Backfill data in new column 4. Phase 4: Deploy application code that reads from new column 5. Phase 5 (weeks later): Remove old column

Benefits: - Zero-downtime migration - Instant rollback at each phase - Production validation before committing to schema change


Troubleshooting Common Rollout Issues

Issue: Canary Health Checks Fail Immediately

Symptoms: - Canary pods fail readiness probe - Error: Readiness probe failed: HTTP probe failed with statuscode: 503

Diagnosis:

# Check pod logs
kubectl logs -n atp-prod -l app=atp-ingestion,version=canary --tail=100

# Check health endpoint directly
kubectl exec -n atp-prod atp-ingestion-canary-xyz -- curl localhost:8080/health

Common Causes: 1. Database connection string incorrect (check Key Vault reference) 2. Service Bus namespace unreachable (check NSG rules) 3. Application startup timeout too short (increase initialDelaySeconds) 4. Missing environment variable (check ConfigMap)

Resolution:

# Fix and redeploy
kubectl delete pod -n atp-prod -l app=atp-ingestion,version=canary
kubectl rollout restart deployment/atp-ingestion-canary -n atp-prod


Issue: Blue-Green Swap Fails with Timeout

Symptoms: - az webapp deployment slot swap hangs for > 5 minutes - Error: The operation has timed out

Diagnosis:

# Check slot status
az webapp show \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green \
  --query state

# Check recent logs
az webapp log tail \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green

Common Causes: 1. Application initialization is slow (app pool startup) 2. Warm-up requests configured but failing 3. Health check endpoint not responding during swap

Resolution:

# Cancel swap (if stuck)
az webapp deployment slot swap \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green \
  --target-slot blue \
  --action cancel

# Increase swap timeout (app setting)
az webapp config appsettings set \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green \
  --settings WEBSITE_SWAP_WARMUP_PING_PATH=/health WEBSITE_SWAP_WARMUP_PING_STATUSES=200


Issue: Rolling Update Stuck at "Waiting for deployment to finish"

Symptoms: - kubectl rollout status hangs - Some pods in CrashLoopBackOff or ImagePullBackOff

Diagnosis:

# Check rollout status
kubectl rollout status deployment/atp-ingestion -n atp-prod

# Check pod status
kubectl get pods -n atp-prod -l app=atp-ingestion

# Describe problematic pod
kubectl describe pod atp-ingestion-xyz -n atp-prod

Common Causes: 1. New image doesn't exist in container registry 2. Image pull secret expired or missing 3. Pod resource limits too low (OOMKilled) 4. Liveness probe killing healthy pods

Resolution:

# Pause rollout to investigate
kubectl rollout pause deployment/atp-ingestion -n atp-prod

# Fix issue (e.g., increase memory limit)
kubectl set resources deployment/atp-ingestion -n atp-prod \
  --limits=memory=2Gi --requests=memory=1Gi

# Resume rollout
kubectl rollout resume deployment/atp-ingestion -n atp-prod

# OR rollback if unfixable
kubectl rollout undo deployment/atp-ingestion -n atp-prod


References

External Resources


Changelog

Date Version Changes Author
2025-10-30 1.0.0 Initial progressive rollout guide Platform Team

Questions or feedback? Contact the Platform Engineering team or open an issue in Azure DevOps.