Progressive Rollout Strategies¶

Zero-downtime deployments with automated risk mitigation — ATP uses progressive rollout strategies to minimize blast radius, validate changes under real traffic, and enable instant rollback if issues are detected.

Overview¶

Progressive rollout is a deployment methodology that introduces changes incrementally to a subset of users or infrastructure, validates stability, and either continues the rollout or automatically reverts based on real-time metrics. This approach is critical for the Audit Trail Platform, where data integrity, tamper-evidence, and regulatory compliance cannot be compromised.

Why Progressive Rollouts?¶

Challenge	Solution
Risk of breaking production	Limit blast radius to small percentage of traffic
Unknown behavior under load	Validate with real production traffic before full rollout
Slow manual rollbacks	Automated metric-based rollback in seconds
Customer impact from bugs	Affect only canary users, protect 95%+ of tenants
Compliance violations	Quick detection and reversion preserves audit integrity

Core Principles¶

Incremental Exposure - Deploy to 5% → 20% → 50% → 100% of traffic
Automated Validation - Metrics-driven health checks at each increment
Instant Rollback - Sub-minute reversion to last known good state
Observability First - Rich telemetry during rollout phases
Tenant Isolation - Canary failures don't cascade to stable deployments

Deployment Strategy Decision Matrix¶

Use this table to select the appropriate rollout strategy based on your deployment scenario:

Scenario	Strategy	Traffic Pattern	Rollback Time	Use When
New Feature Release	Canary Deployment	5% → 20% → 50% → 100%	< 1 min (traffic revert)	Default for Production features
Bug Fix (Low Risk)	Rolling Update	Pod-by-pod replacement	< 5 min (rollout undo)	Minor fixes, config changes
Major Version Upgrade	Blue-Green Swap	100% instant switch	< 30 sec (slot swap)	Database migrations, API versions
Hotfix (Emergency)	Direct + Monitoring	100% with validation	< 2 min (rollout undo)	Critical security patches
Infrastructure Change	Rolling Update	Node-by-node	< 5 min (rollout undo)	Kubernetes version, OS patches
Breaking API Change	Feature Flag + Canary	Gradual feature activation	Instant (flag toggle)	API deprecation, schema changes

Deployment Strategies¶

1. Canary Deployment (Default for Production)¶

Best For: New features, behavior changes, algorithm updates

How It Works: 1. Deploy new version alongside stable version (both running) 2. Route small percentage of traffic to canary (5%) 3. Monitor metrics for 15 minutes (error rate, latency, exceptions) 4. If healthy, increase to next increment (20% → 50% → 100%) 5. If unhealthy, automatic rollback to 0% (canary removed)

Canary Traffic Increments¶

Stable (v1.2.3)  →  100% → 95% → 80% → 50% → 0% (decommissioned)
                      ↓      ↓      ↓      ↓
Canary (v1.2.4)  →    5% → 20% → 50% → 100% (promoted to stable)

Canary Validation Metrics¶

At each increment, the deployment pauses for 15 minutes while monitoring:

Metric	Threshold	Action if Exceeded
Error Rate	< 1% (0.01)	Automatic rollback
P95 Latency	< 1000ms	Automatic rollback
Exception Count	< 10 in 15 min	Automatic rollback
Health Check Failures	0 failures	Automatic rollback
Custom Business Metrics	Service-specific	Manual review required

Example Canary Flow:

graph TD
    A[Deploy Canary v1.2.4] --> B[Route 5% Traffic]
    B --> C{Monitor 15 min}
    C -->|Metrics OK| D[Increase to 20%]
    C -->|Metrics BAD| Z[Automatic Rollback]
    D --> E{Monitor 15 min}
    E -->|Metrics OK| F[Increase to 50%]
    E -->|Metrics BAD| Z
    F --> G{Monitor 15 min}
    G -->|Metrics OK| H[Promote to 100%]
    G -->|Metrics BAD| Z
    H --> I[Decommission Stable v1.2.3]
    Z --> J[Alert On-Call SRE]
    Z --> K[Create Incident Ticket]

Hold "Alt" / "Option" to enable pan & zoom

Canary Rollback (Automatic)¶

Trigger: Metrics exceed thresholds during soak period

Action: Immediate traffic reversion to 0% canary, 100% stable

Command (Kubernetes):

# Revert traffic to stable only
kubectl apply -f k8s/canary/stable-only-routing.yaml

# Wait for traffic drain (30 seconds)
sleep 30

# Delete canary deployment
kubectl delete deployment atp-ingestion-canary -n atp-prod

# Verify stable deployment health
kubectl rollout status deployment/atp-ingestion -n atp-prod

RTO: < 1 minute (traffic routing change)

2. Blue-Green Deployment (Staging/Major Versions)¶

Best For: Full environment validation, database migrations, API version changes

How It Works: 1. Deploy new version to "green" slot (inactive) 2. Run smoke tests, validation scripts against green 3. Instant traffic swap from "blue" (active) to "green" 4. Monitor green under 100% traffic 5. If issues detected, instant swap back to blue

Blue-Green Slots¶

Blue Slot (v1.2.3)   [ACTIVE - 100% traffic]
Green Slot (v1.2.4)  [IDLE - validation only]

↓ Swap ↓

Blue Slot (v1.2.3)   [IDLE - ready for rollback]
Green Slot (v1.2.4)  [ACTIVE - 100% traffic]

Validation Before Swap¶

Before swapping traffic, validate the green slot:

# 1. Health check endpoints
curl https://atp-staging-green.azurewebsites.net/health
curl https://atp-staging-green.azurewebsites.net/ready

# 2. Smoke tests
pytest tests/smoke/ --env=green --tenant=test-tenant-001

# 3. Database connectivity
psql -h atp-staging-green-db.postgres.database.azure.com -U admin -c "SELECT 1"

# 4. Service Bus connectivity
az servicebus queue show --namespace-name atp-staging-green-sb --name audit-events

# 5. Key Vault access
az keyvault secret show --vault-name atp-staging-green-kv --name TenantDbConnString

Only swap if all validations pass.

Blue-Green Swap (Azure App Service)¶

# Swap staging slot to production
az webapp deployment slot swap \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green \
  --target-slot blue

# Verify swap completed
az webapp deployment slot list \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging

Swap Duration: 5-30 seconds (no downtime)

Blue-Green Rollback¶

# Swap back to previous slot
az webapp deployment slot swap \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot blue \
  --target-slot green

RTO: < 30 seconds

3. Rolling Update (Kubernetes Infrastructure)¶

Best For: Config changes, minor fixes, Kubernetes version updates

How It Works: 1. Update deployment manifest (image version, config) 2. Kubernetes replaces pods one-by-one 3. Each new pod must pass health checks before old pod is terminated 4. Continues until all pods are updated 5. If health checks fail, rollout pauses automatically

Rolling Update Configuration¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-prod
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod during rollout (7 total)
      maxUnavailable: 1  # Allow 1 pod to be unavailable (5 minimum)
  minReadySeconds: 30    # Wait 30s after pod ready before next
  template:
    spec:
      containers:
      - name: ingestion-api
        image: atp.azurecr.io/ingestion:1.2.4
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

Monitor Rolling Update¶

# Apply updated manifest
kubectl apply -f k8s/atp-ingestion-deployment.yaml

# Watch rollout progress
kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=10m

# Check pod status during rollout
kubectl get pods -n atp-prod -l app=atp-ingestion -w

# View rollout history
kubectl rollout history deployment/atp-ingestion -n atp-prod

Expected Output:

Waiting for deployment "atp-ingestion" rollout to finish: 2 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 3 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 4 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 5 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 6 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 1 old replicas are pending termination...
deployment "atp-ingestion" successfully rolled out

Rolling Update Rollback¶

# Undo last rollout (immediate)
kubectl rollout undo deployment/atp-ingestion -n atp-prod

# Rollback to specific revision
kubectl rollout undo deployment/atp-ingestion -n atp-prod --to-revision=3

# Verify rollback
kubectl rollout status deployment/atp-ingestion -n atp-prod

RTO: < 5 minutes (depends on pod count and readiness checks)

4. Feature Flag-Based Rollout¶

Best For: A/B testing, gradual feature activation, breaking API changes

How It Works: 1. Deploy code with feature behind toggle (disabled by default) 2. Enable feature for percentage of users (10% → 25% → 50% → 100%) 3. Monitor metrics per cohort (feature-on vs feature-off) 4. If issues detected, toggle off instantly (no redeployment)

Feature Flag Configuration¶

Azure App Configuration:

{
  "id": "AuditTrail.NewTamperProofAlgorithm",
  "description": "Enable SHA-512 with merkle tree for tamper evidence",
  "enabled": true,
  "conditions": {
    "client_filters": [
      {
        "name": "Percentage",
        "parameters": {
          "Value": 10  // 10% of requests
        }
      },
      {
        "name": "Targeting",
        "parameters": {
          "Audience": {
            "Users": ["tenant-beta-001", "tenant-beta-002"],
            "Groups": ["beta-testers"],
            "DefaultRolloutPercentage": 10
          }
        }
      }
    ]
  }
}

Gradual Feature Rollout¶

# Phase 1: Beta testers only (named tenants)
az appconfig feature set \
  --name atp-prod-appconfig \
  --feature AuditTrail.NewTamperProofAlgorithm \
  --filter "Targeting" \
  --parameters '{"Audience":{"Users":["tenant-beta-001"]}}'

# Phase 2: 10% random sampling
az appconfig feature set \
  --name atp-prod-appconfig \
  --feature AuditTrail.NewTamperProofAlgorithm \
  --filter "Percentage" \
  --parameters '{"Value":10}'

# Phase 3: 50% rollout
az appconfig feature set \
  --name atp-prod-appconfig \
  --feature AuditTrail.NewTamperProofAlgorithm \
  --filter "Percentage" \
  --parameters '{"Value":50}'

# Phase 4: 100% (full activation)
az appconfig feature set \
  --name atp-prod-appconfig \
  --feature AuditTrail.NewTamperProofAlgorithm \
  --enabled true

Feature Flag Emergency Disable¶

# Instant disable (no deployment required)
az appconfig feature set \
  --name atp-prod-appconfig \
  --feature AuditTrail.NewTamperProofAlgorithm \
  --enabled false

RTO: < 10 seconds (config refresh interval)

Rollout Monitoring & Validation¶

Real-Time Metrics Dashboard¶

During any rollout (canary, blue-green, rolling), monitor these dashboards:

Application Insights Queries:

// Error rate comparison (canary vs stable)
requests
| where timestamp > ago(15m)
| summarize 
    TotalRequests = count(),
    FailedRequests = countif(success == false)
    by cloud_RoleName
| extend ErrorRate = (FailedRequests * 100.0) / TotalRequests
| project cloud_RoleName, ErrorRate, TotalRequests

// P95 Latency comparison
requests
| where timestamp > ago(15m)
| summarize P95 = percentile(duration, 95) by cloud_RoleName
| project cloud_RoleName, P95_ms = P95

// Exception count per deployment
exceptions
| where timestamp > ago(15m)
| summarize ExceptionCount = count() by cloud_RoleName
| order by ExceptionCount desc

Automated Health Checks¶

Health Check Script (runs every 2 minutes during rollout):

#!/usr/bin/env python3
"""
Canary health monitor - runs during progressive rollouts.
Triggers automatic rollback if metrics exceed thresholds.
"""
import requests
import time
import sys
from azure.monitor.query import LogsQueryClient
from azure.identity import DefaultAzureCredential

CANARY_ROLE = "atp-ingestion-canary"
STABLE_ROLE = "atp-ingestion-stable"

THRESHOLDS = {
    "error_rate_percent": 1.0,      # Max 1% error rate
    "p95_latency_ms": 1000,          # Max 1000ms p95
    "exception_count": 10,           # Max 10 exceptions in 15min
}

def get_error_rate(role_name):
    """Query Application Insights for error rate."""
    query = f"""
    requests
    | where timestamp > ago(15m)
    | where cloud_RoleName == '{role_name}'
    | summarize 
        Total = count(),
        Failed = countif(success == false)
    | extend ErrorRate = (Failed * 100.0) / Total
    | project ErrorRate
    """
    client = LogsQueryClient(DefaultAzureCredential())
    response = client.query_workspace(
        workspace_id=os.environ["LOG_ANALYTICS_WORKSPACE_ID"],
        query=query,
        timespan=None
    )
    return response.tables[0].rows[0][0] if response.tables[0].rows else 0.0

def get_p95_latency(role_name):
    """Query Application Insights for P95 latency."""
    query = f"""
    requests
    | where timestamp > ago(15m)
    | where cloud_RoleName == '{role_name}'
    | summarize P95 = percentile(duration, 95)
    | project P95
    """
    client = LogsQueryClient(DefaultAzureCredential())
    response = client.query_workspace(
        workspace_id=os.environ["LOG_ANALYTICS_WORKSPACE_ID"],
        query=query,
        timespan=None
    )
    return response.tables[0].rows[0][0] if response.tables[0].rows else 0.0

def get_exception_count(role_name):
    """Query Application Insights for exception count."""
    query = f"""
    exceptions
    | where timestamp > ago(15m)
    | where cloud_RoleName == '{role_name}'
    | count
    """
    client = LogsQueryClient(DefaultAzureCredential())
    response = client.query_workspace(
        workspace_id=os.environ["LOG_ANALYTICS_WORKSPACE_ID"],
        query=query,
        timespan=None
    )
    return response.tables[0].rows[0][0] if response.tables[0].rows else 0

def check_health_endpoint(url):
    """Check HTTP health endpoint."""
    try:
        response = requests.get(url, timeout=5)
        return response.status_code == 200
    except:
        return False

def validate_canary_health():
    """
    Validate canary deployment health.
    Returns (is_healthy, metrics_dict)
    """
    print(f"🔍 Validating canary health at {time.strftime('%H:%M:%S')}...")

    metrics = {
        "error_rate": get_error_rate(CANARY_ROLE),
        "p95_latency": get_p95_latency(CANARY_ROLE),
        "exception_count": get_exception_count(CANARY_ROLE),
        "health_check": check_health_endpoint("https://atp-ingestion-canary/health")
    }

    # Check thresholds
    failures = []

    if metrics["error_rate"] > THRESHOLDS["error_rate_percent"]:
        failures.append(f"Error rate {metrics['error_rate']:.2f}% exceeds {THRESHOLDS['error_rate_percent']}%")

    if metrics["p95_latency"] > THRESHOLDS["p95_latency_ms"]:
        failures.append(f"P95 latency {metrics['p95_latency']:.0f}ms exceeds {THRESHOLDS['p95_latency_ms']}ms")

    if metrics["exception_count"] > THRESHOLDS["exception_count"]:
        failures.append(f"Exception count {metrics['exception_count']} exceeds {THRESHOLDS['exception_count']}")

    if not metrics["health_check"]:
        failures.append("Health check endpoint failed")

    is_healthy = len(failures) == 0

    if is_healthy:
        print(f"✅ Canary healthy: Error={metrics['error_rate']:.2f}%, P95={metrics['p95_latency']:.0f}ms, Exceptions={metrics['exception_count']}")
    else:
        print(f"❌ Canary unhealthy:")
        for failure in failures:
            print(f"   - {failure}")

    return is_healthy, metrics

def monitor_rollout(duration_minutes, check_interval_seconds=120):
    """
    Monitor canary deployment for specified duration.
    Exits with code 1 if unhealthy (triggers rollback in pipeline).
    """
    end_time = time.time() + (duration_minutes * 60)

    print(f"🚀 Starting {duration_minutes}-minute canary monitoring...")

    while time.time() < end_time:
        is_healthy, metrics = validate_canary_health()

        if not is_healthy:
            print(f"🔴 ROLLBACK TRIGGERED: Canary failed health validation")
            sys.exit(1)  # Exit code 1 triggers automatic rollback

        remaining_minutes = (end_time - time.time()) / 60
        print(f"⏳ {remaining_minutes:.1f} minutes remaining...\n")
        time.sleep(check_interval_seconds)

    print(f"✅ Canary monitoring complete: {duration_minutes} minutes passed successfully")
    sys.exit(0)

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--duration", type=int, default=15, help="Monitoring duration in minutes")
    parser.add_argument("--interval", type=int, default=120, help="Check interval in seconds")
    args = parser.parse_args()

    monitor_rollout(args.duration, args.interval)

Usage in Pipeline:

- script: |
    python scripts/monitor-canary.py --duration 15 --interval 120
  displayName: 'Monitor Canary Health (15 min)'
  continueOnError: false  # Pipeline fails if script exits with code 1

Rollback Procedures¶

When to Rollback¶

Trigger	Action	Who Decides
Automated threshold breach	Immediate automatic rollback	System (no human approval)
Customer-reported critical issue	Manual rollback via runbook	On-call SRE
Silent data corruption detected	Emergency rollback + incident	SRE + Security Team
Compliance violation	Immediate rollback + audit	Compliance Officer + SRE
Performance degradation (SLO miss)	Rollback if not resolved in 10 min	SRE

Rollback Decision Tree¶

graph TD
    A[Issue Detected] --> B{Severity?}
    B -->|SEV1: Data Loss/Corruption| C[IMMEDIATE ROLLBACK]
    B -->|SEV2: Service Degradation| D{SLO Impact?}
    B -->|SEV3: Minor Bug| E{User-Facing?}

    D -->|> 1% error rate| C
    D -->|< 1% error rate| F[Monitor 10 min]

    E -->|Yes| G[Rollback + Hotfix]
    E -->|No| H[File Bug, Fix in Next Release]

    F -->|Improves| I[Continue Rollout]
    F -->|Worsens| C

    C --> J[Execute Rollback]
    J --> K[Verify Health]
    K --> L[RCA + Incident Report]

Hold "Alt" / "Option" to enable pan & zoom

Automatic Rollback Triggers¶

These conditions trigger rollback WITHOUT human approval:

Error Rate: Canary error rate > 1% for 2 consecutive minutes
Latency Spike: P95 latency > 1000ms for 5 consecutive minutes
Exception Storm: > 10 exceptions in 15-minute window
Health Check Failure: Any health check endpoint returns non-200 status
Database Connection Pool Exhaustion: Connection pool > 90% for 3 minutes
Message Queue Backlog: Event queue depth > 10,000 for 5 minutes
Memory Leak Detected: Container memory > 90% for 3 minutes

Alert Channels: - PagerDuty: On-call SRE - Slack: #atp-incidents - Email: Platform team distribution list - Azure Monitor: Alert rule with action group

Manual Rollback Commands¶

Canary Rollback (Kubernetes)¶

#!/bin/bash
# rollback-canary.sh - Manual canary rollback

set -e

echo "🔴 Initiating canary rollback..."

# 1. Revert traffic to stable only
echo "Step 1: Reverting traffic to stable deployment..."
kubectl apply -f k8s/canary/stable-only-routing.yaml

# 2. Wait for traffic drain
echo "Step 2: Waiting 30 seconds for traffic to drain from canary..."
sleep 30

# 3. Delete canary deployment
echo "Step 3: Deleting canary deployment..."
kubectl delete deployment atp-ingestion-canary -n atp-prod --ignore-not-found=true

# 4. Verify stable deployment health
echo "Step 4: Verifying stable deployment health..."
kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=2m

# 5. Check metrics
echo "Step 5: Validating post-rollback metrics..."
ERROR_RATE=$(az monitor app-insights metrics show \
  --app atp-prod-appinsights \
  --metric requests/failed \
  --aggregation avg \
  --interval 5m \
  --offset 5m \
  --query value -o tsv)

if (( $(echo "$ERROR_RATE < 0.01" | bc -l) )); then
  echo "✅ Rollback successful. Error rate: ${ERROR_RATE}%"
else
  echo "⚠️  Error rate still elevated: ${ERROR_RATE}%"
fi

# 6. Notify team
curl -X POST $SLACK_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d "{\"text\":\"🔴 Production canary rollback completed. Build: $BUILD_NUMBER\"}"

echo "✅ Canary rollback complete. Investigate root cause before next deployment."

Blue-Green Rollback (Azure App Service)¶

#!/bin/bash
# rollback-blue-green.sh - Manual slot swap rollback

set -e

echo "🔴 Initiating blue-green rollback (slot swap)..."

# Swap back to previous slot
az webapp deployment slot swap \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green \
  --target-slot blue

echo "⏳ Waiting for swap to complete..."
sleep 15

# Verify production slot is serving traffic
HEALTH_STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://atp-ingestion-staging.azurewebsites.net/health)

if [ "$HEALTH_STATUS" -eq "200" ]; then
  echo "✅ Rollback successful. Production slot is healthy."
else
  echo "❌ ERROR: Production slot health check failed (HTTP $HEALTH_STATUS)"
  exit 1
fi

# Notify team
curl -X POST $SLACK_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d "{\"text\":\"🔴 Staging blue-green rollback completed.\"}"

Rolling Update Rollback (Kubernetes)¶

#!/bin/bash
# rollback-rolling-update.sh - Rollback Kubernetes rolling update

set -e

echo "🔴 Initiating rolling update rollback..."

# Undo last rollout
kubectl rollout undo deployment/atp-ingestion -n atp-prod

# Wait for rollout to complete
echo "⏳ Waiting for rollout to complete..."
kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=5m

# Verify pod health
READY_PODS=$(kubectl get deployment atp-ingestion -n atp-prod -o jsonpath='{.status.readyReplicas}')
DESIRED_PODS=$(kubectl get deployment atp-ingestion -n atp-prod -o jsonpath='{.spec.replicas}')

if [ "$READY_PODS" -eq "$DESIRED_PODS" ]; then
  echo "✅ Rollback successful. $READY_PODS/$DESIRED_PODS pods ready."
else
  echo "⚠️  Only $READY_PODS/$DESIRED_PODS pods ready."
fi

# Check current image version
CURRENT_IMAGE=$(kubectl get deployment atp-ingestion -n atp-prod -o jsonpath='{.spec.template.spec.containers[0].image}')
echo "📦 Current image: $CURRENT_IMAGE"

Environment-Specific Rollout Strategies¶

Environment	Primary Strategy	Approval Required	Monitoring Duration	Rollback Policy
Dev	Direct deploy	No	None	Manual only
Test	Rolling update	No	5 minutes	Automatic on failure
Staging	Blue-green	1 approver	30 minutes	Automatic on health check failure
Production	Canary (5/20/50/100)	2 approvers + CAB	15 min per increment	Automatic on metrics breach
Hotfix	Direct + monitoring	2 approvers (expedited)	30 minutes	Automatic on any error

Runbook Integration¶

Pre-Deployment Checklist¶

Before initiating any production rollout:

During Deployment¶

Monitor error rate in real-time (refresh every 30 seconds)
Watch exception logs in Application Insights
Check queue depths for backlog buildup
Validate health endpoints return 200 OK
Communicate status in #atp-deployments at each increment
Pause if uncertain — better safe than sorry

Post-Deployment¶

Verify metrics returned to baseline
Smoke tests passed in production
Synthetic monitors passing (Pingdom, Azure Monitor)
Update runbook if new issues discovered
Document lessons learned in deployment log
Decommission canary resources (if applicable)

Metrics & SLO Validation¶

Key Metrics to Monitor¶

Metric	Target (SLO)	Warning Threshold	Critical Threshold
Availability	99.9%	99.5%	99.0%
Error Rate	< 0.1%	> 0.5%	> 1.0%
P95 Latency	< 500ms	> 800ms	> 1000ms
P99 Latency	< 1000ms	> 1500ms	> 2000ms
Event Ingestion Throughput	10K events/sec	8K events/sec	5K events/sec
Queue Processing Lag	< 1 minute	> 2 minutes	> 5 minutes

SLO Dashboard¶

Application Insights Query (Error Budget):

// Calculate error budget remaining for current month
let startOfMonth = startofmonth(now());
let totalRequests = requests
    | where timestamp >= startOfMonth
    | count;
let failedRequests = requests
    | where timestamp >= startOfMonth
    | where success == false
    | count;
let actualAvailability = 100.0 - ((failedRequests * 100.0) / totalRequests);
let targetAvailability = 99.9;
let errorBudget = 100.0 - targetAvailability;  // 0.1%
let errorBudgetUsed = 100.0 - actualAvailability;
let errorBudgetRemaining = errorBudget - errorBudgetUsed;
print 
    ActualAvailability = actualAvailability,
    TargetAvailability = targetAvailability,
    ErrorBudgetTotal = errorBudget,
    ErrorBudgetUsed = errorBudgetUsed,
    ErrorBudgetRemaining = errorBudgetRemaining,
    ErrorBudgetRemainingPercent = (errorBudgetRemaining / errorBudget) * 100.0

Interpretation: - > 50% error budget remaining: Safe to deploy - 20-50% remaining: Deploy with caution, short canary soaks - < 20% remaining: Defer non-critical deployments - < 0% (budget exhausted): Freeze deployments, focus on stability

Communication & Escalation¶

Deployment Communication Template¶

Pre-Deployment Announcement:

📢 **Production Deployment Starting**

**Service:** Audit Trail Platform - Ingestion API
**Version:** v1.2.4 → v1.2.5
**Change:** New tamper-proof algorithm (SHA-512 + Merkle tree)
**Strategy:** Canary deployment (5% → 20% → 50% → 100%)
**Duration:** ~90 minutes (15 min per increment)
**Rollback Plan:** Automatic on error rate > 1%
**On-Call SRE:** @alice (primary), @bob (secondary)
**Monitoring:** https://portal.azure.com/#blade/AppInsights/atp-prod

**Status:** Starting canary at 5% traffic...

Mid-Deployment Update:

✅ **Canary Update: 20% Traffic**

**Metrics (last 15 min):**
- Error Rate: 0.08% ✅
- P95 Latency: 487ms ✅
- Exceptions: 2 ✅
- Health Checks: Passing ✅

**Next:** Increasing to 50% traffic. ETA: 2:30 PM EST

Rollback Announcement:

🔴 **ROLLBACK IN PROGRESS**

**Reason:** Canary error rate exceeded 1% (actual: 1.8%)
**Action:** Reverting all traffic to stable v1.2.4
**Impact:** No customer impact (95% on stable, canary isolated)
**ETA:** Rollback complete in 2 minutes
**Incident:** INC-2025-001 created
**Next Steps:** RCA meeting scheduled for 4:00 PM EST

Escalation Path¶

Automated Alert → PagerDuty → On-call SRE (immediate)
On-call SRE → Deployment lead (within 5 minutes)
Deployment lead → Engineering manager (if rollback fails)
Engineering manager → VP Engineering (if production down > 30 minutes)
VP Engineering → CTO (if SEV1 incident > 1 hour)

Advanced Rollout Patterns¶

Progressive Feature Rollout (Combined Strategy)¶

Scenario: Deploy new feature with both infrastructure and feature flag rollout.

Approach: 1. Week 1: Deploy code with feature flag OFF to all environments (canary deployment) 2. Week 2: Enable feature for 10% of users in production (feature flag) 3. Week 3: Increase to 50% if metrics healthy 4. Week 4: Enable for 100% (full activation)

Benefits: - Code deployed separately from activation (decouple risk) - Instant disable via feature flag (no redeployment) - Gradual user exposure (minimize blast radius)

Multi-Region Progressive Rollout¶

Scenario: Deploy to multiple Azure regions with staggered rollout.

Approach: 1. Region 1 (East US): Canary deployment (5% → 20% → 50% → 100%) 2. Wait 24 hours, monitor metrics 3. Region 2 (West Europe): Canary deployment 4. Wait 24 hours, monitor metrics 5. Region 3 (Southeast Asia): Canary deployment

Benefits: - Contain blast radius to single region - Leverage time zones for round-the-clock monitoring - Learn from each region before proceeding

Database Schema Migration Rollout¶

Scenario: Deploy backward-compatible database schema change.

Approach: 1. Phase 1: Add new column with default value (backward compatible) 2. Phase 2: Deploy application code that writes to both old and new columns 3. Phase 3: Backfill data in new column 4. Phase 4: Deploy application code that reads from new column 5. Phase 5 (weeks later): Remove old column

Benefits: - Zero-downtime migration - Instant rollback at each phase - Production validation before committing to schema change

Troubleshooting Common Rollout Issues¶

Issue: Canary Health Checks Fail Immediately¶

Symptoms: - Canary pods fail readiness probe - Error: Readiness probe failed: HTTP probe failed with statuscode: 503

Diagnosis:

# Check pod logs
kubectl logs -n atp-prod -l app=atp-ingestion,version=canary --tail=100

# Check health endpoint directly
kubectl exec -n atp-prod atp-ingestion-canary-xyz -- curl localhost:8080/health

Common Causes: 1. Database connection string incorrect (check Key Vault reference) 2. Service Bus namespace unreachable (check NSG rules) 3. Application startup timeout too short (increase initialDelaySeconds) 4. Missing environment variable (check ConfigMap)

Resolution:

# Fix and redeploy
kubectl delete pod -n atp-prod -l app=atp-ingestion,version=canary
kubectl rollout restart deployment/atp-ingestion-canary -n atp-prod

Issue: Blue-Green Swap Fails with Timeout¶

Symptoms: - az webapp deployment slot swap hangs for > 5 minutes - Error: The operation has timed out

Diagnosis:

# Check slot status
az webapp show \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green \
  --query state

# Check recent logs
az webapp log tail \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green

Common Causes: 1. Application initialization is slow (app pool startup) 2. Warm-up requests configured but failing 3. Health check endpoint not responding during swap

Resolution:

# Cancel swap (if stuck)
az webapp deployment slot swap \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green \
  --target-slot blue \
  --action cancel

# Increase swap timeout (app setting)
az webapp config appsettings set \
  --resource-group rg-atp-staging \
  --name atp-ingestion-staging \
  --slot green \
  --settings WEBSITE_SWAP_WARMUP_PING_PATH=/health WEBSITE_SWAP_WARMUP_PING_STATUSES=200

Issue: Rolling Update Stuck at "Waiting for deployment to finish"¶

Symptoms: - kubectl rollout status hangs - Some pods in CrashLoopBackOff or ImagePullBackOff

Diagnosis:

# Check rollout status
kubectl rollout status deployment/atp-ingestion -n atp-prod

# Check pod status
kubectl get pods -n atp-prod -l app=atp-ingestion

# Describe problematic pod
kubectl describe pod atp-ingestion-xyz -n atp-prod

Common Causes: 1. New image doesn't exist in container registry 2. Image pull secret expired or missing 3. Pod resource limits too low (OOMKilled) 4. Liveness probe killing healthy pods

Resolution:

# Pause rollout to investigate
kubectl rollout pause deployment/atp-ingestion -n atp-prod

# Fix issue (e.g., increase memory limit)
kubectl set resources deployment/atp-ingestion -n atp-prod \
  --limits=memory=2Gi --requests=memory=1Gi

# Resume rollout
kubectl rollout resume deployment/atp-ingestion -n atp-prod

# OR rollback if unfixable
kubectl rollout undo deployment/atp-ingestion -n atp-prod

References¶

CI/CD Implementation: environments.md - Full CI/CD pipeline with canary deployment YAML
Quality Gates: quality-gates.md - Pre-deployment validation and approval workflows
Azure Pipelines: azure-pipelines.md - Pipeline templates and build/deploy stages
Operational Runbook: runbook.md - Incident response procedures and troubleshooting
Health Checks: health-checks.md - Health endpoint implementation and monitoring
Monitoring: monitoring.md - Application Insights dashboards and metrics
Alerts & SLOs: alerts-slos.md - Alert rules and SLO definitions
Kubernetes Infrastructure: ../infrastructure/kubernetes.md - K8s deployment configurations

External Resources¶

Canary Deployments: Martin Fowler - Canary Release
Blue-Green Deployments: Azure App Service Staging Slots
Kubernetes Rolling Updates: Kubernetes Deployment Strategies
Feature Flags: Azure App Configuration Feature Management
SLO Best Practices: Google SRE Book - SLO

Changelog¶

Date	Version	Changes	Author
2025-10-30	1.0.0	Initial progressive rollout guide	Platform Team

Questions or feedback? Contact the Platform Engineering team or open an issue in Azure DevOps.

Progressive Rollout Strategies¶

Overview¶

Why Progressive Rollouts?¶

Core Principles¶

Deployment Strategy Decision Matrix¶

Deployment Strategies¶

1. Canary Deployment (Default for Production)¶

Canary Traffic Increments¶

Canary Validation Metrics¶

Canary Rollback (Automatic)¶

2. Blue-Green Deployment (Staging/Major Versions)¶

Blue-Green Slots¶

Validation Before Swap¶

Blue-Green Swap (Azure App Service)¶

Blue-Green Rollback¶

3. Rolling Update (Kubernetes Infrastructure)¶

Rolling Update Configuration¶

Monitor Rolling Update¶

Rolling Update Rollback¶

4. Feature Flag-Based Rollout¶

Feature Flag Configuration¶

Gradual Feature Rollout¶

Feature Flag Emergency Disable¶

Rollout Monitoring & Validation¶

Real-Time Metrics Dashboard¶

Automated Health Checks¶

Rollback Procedures¶

When to Rollback¶

Rollback Decision Tree¶

Automatic Rollback Triggers¶

Manual Rollback Commands¶

Canary Rollback (Kubernetes)¶

Blue-Green Rollback (Azure App Service)¶

Rolling Update Rollback (Kubernetes)¶

Environment-Specific Rollout Strategies¶

Runbook Integration¶

Pre-Deployment Checklist¶

During Deployment¶

Post-Deployment¶

Metrics & SLO Validation¶

Key Metrics to Monitor¶

SLO Dashboard¶

Communication & Escalation¶

Deployment Communication Template¶

Escalation Path¶

Advanced Rollout Patterns¶

Progressive Feature Rollout (Combined Strategy)¶

Multi-Region Progressive Rollout¶

Database Schema Migration Rollout¶

Troubleshooting Common Rollout Issues¶

Issue: Canary Health Checks Fail Immediately¶

Issue: Blue-Green Swap Fails with Timeout¶

Issue: Rolling Update Stuck at "Waiting for deployment to finish"¶

References¶

Related Documentation¶

External Resources¶

Changelog¶