Progressive Rollout Strategies¶
Zero-downtime deployments with automated risk mitigation — ATP uses progressive rollout strategies to minimize blast radius, validate changes under real traffic, and enable instant rollback if issues are detected.
Overview¶
Progressive rollout is a deployment methodology that introduces changes incrementally to a subset of users or infrastructure, validates stability, and either continues the rollout or automatically reverts based on real-time metrics. This approach is critical for the Audit Trail Platform, where data integrity, tamper-evidence, and regulatory compliance cannot be compromised.
Why Progressive Rollouts?¶
| Challenge | Solution |
|---|---|
| Risk of breaking production | Limit blast radius to small percentage of traffic |
| Unknown behavior under load | Validate with real production traffic before full rollout |
| Slow manual rollbacks | Automated metric-based rollback in seconds |
| Customer impact from bugs | Affect only canary users, protect 95%+ of tenants |
| Compliance violations | Quick detection and reversion preserves audit integrity |
Core Principles¶
- Incremental Exposure - Deploy to 5% → 20% → 50% → 100% of traffic
- Automated Validation - Metrics-driven health checks at each increment
- Instant Rollback - Sub-minute reversion to last known good state
- Observability First - Rich telemetry during rollout phases
- Tenant Isolation - Canary failures don't cascade to stable deployments
Deployment Strategy Decision Matrix¶
Use this table to select the appropriate rollout strategy based on your deployment scenario:
| Scenario | Strategy | Traffic Pattern | Rollback Time | Use When |
|---|---|---|---|---|
| New Feature Release | Canary Deployment | 5% → 20% → 50% → 100% | < 1 min (traffic revert) | Default for Production features |
| Bug Fix (Low Risk) | Rolling Update | Pod-by-pod replacement | < 5 min (rollout undo) | Minor fixes, config changes |
| Major Version Upgrade | Blue-Green Swap | 100% instant switch | < 30 sec (slot swap) | Database migrations, API versions |
| Hotfix (Emergency) | Direct + Monitoring | 100% with validation | < 2 min (rollout undo) | Critical security patches |
| Infrastructure Change | Rolling Update | Node-by-node | < 5 min (rollout undo) | Kubernetes version, OS patches |
| Breaking API Change | Feature Flag + Canary | Gradual feature activation | Instant (flag toggle) | API deprecation, schema changes |
Deployment Strategies¶
1. Canary Deployment (Default for Production)¶
Best For: New features, behavior changes, algorithm updates
How It Works: 1. Deploy new version alongside stable version (both running) 2. Route small percentage of traffic to canary (5%) 3. Monitor metrics for 15 minutes (error rate, latency, exceptions) 4. If healthy, increase to next increment (20% → 50% → 100%) 5. If unhealthy, automatic rollback to 0% (canary removed)
Canary Traffic Increments¶
Stable (v1.2.3) → 100% → 95% → 80% → 50% → 0% (decommissioned)
↓ ↓ ↓ ↓
Canary (v1.2.4) → 5% → 20% → 50% → 100% (promoted to stable)
Canary Validation Metrics¶
At each increment, the deployment pauses for 15 minutes while monitoring:
| Metric | Threshold | Action if Exceeded |
|---|---|---|
| Error Rate | < 1% (0.01) | Automatic rollback |
| P95 Latency | < 1000ms | Automatic rollback |
| Exception Count | < 10 in 15 min | Automatic rollback |
| Health Check Failures | 0 failures | Automatic rollback |
| Custom Business Metrics | Service-specific | Manual review required |
Example Canary Flow:
graph TD
A[Deploy Canary v1.2.4] --> B[Route 5% Traffic]
B --> C{Monitor 15 min}
C -->|Metrics OK| D[Increase to 20%]
C -->|Metrics BAD| Z[Automatic Rollback]
D --> E{Monitor 15 min}
E -->|Metrics OK| F[Increase to 50%]
E -->|Metrics BAD| Z
F --> G{Monitor 15 min}
G -->|Metrics OK| H[Promote to 100%]
G -->|Metrics BAD| Z
H --> I[Decommission Stable v1.2.3]
Z --> J[Alert On-Call SRE]
Z --> K[Create Incident Ticket]
Canary Rollback (Automatic)¶
Trigger: Metrics exceed thresholds during soak period
Action: Immediate traffic reversion to 0% canary, 100% stable
Command (Kubernetes):
# Revert traffic to stable only
kubectl apply -f k8s/canary/stable-only-routing.yaml
# Wait for traffic drain (30 seconds)
sleep 30
# Delete canary deployment
kubectl delete deployment atp-ingestion-canary -n atp-prod
# Verify stable deployment health
kubectl rollout status deployment/atp-ingestion -n atp-prod
RTO: < 1 minute (traffic routing change)
2. Blue-Green Deployment (Staging/Major Versions)¶
Best For: Full environment validation, database migrations, API version changes
How It Works: 1. Deploy new version to "green" slot (inactive) 2. Run smoke tests, validation scripts against green 3. Instant traffic swap from "blue" (active) to "green" 4. Monitor green under 100% traffic 5. If issues detected, instant swap back to blue
Blue-Green Slots¶
Blue Slot (v1.2.3) [ACTIVE - 100% traffic]
Green Slot (v1.2.4) [IDLE - validation only]
↓ Swap ↓
Blue Slot (v1.2.3) [IDLE - ready for rollback]
Green Slot (v1.2.4) [ACTIVE - 100% traffic]
Validation Before Swap¶
Before swapping traffic, validate the green slot:
# 1. Health check endpoints
curl https://atp-staging-green.azurewebsites.net/health
curl https://atp-staging-green.azurewebsites.net/ready
# 2. Smoke tests
pytest tests/smoke/ --env=green --tenant=test-tenant-001
# 3. Database connectivity
psql -h atp-staging-green-db.postgres.database.azure.com -U admin -c "SELECT 1"
# 4. Service Bus connectivity
az servicebus queue show --namespace-name atp-staging-green-sb --name audit-events
# 5. Key Vault access
az keyvault secret show --vault-name atp-staging-green-kv --name TenantDbConnString
Only swap if all validations pass.
Blue-Green Swap (Azure App Service)¶
# Swap staging slot to production
az webapp deployment slot swap \
--resource-group rg-atp-staging \
--name atp-ingestion-staging \
--slot green \
--target-slot blue
# Verify swap completed
az webapp deployment slot list \
--resource-group rg-atp-staging \
--name atp-ingestion-staging
Swap Duration: 5-30 seconds (no downtime)
Blue-Green Rollback¶
# Swap back to previous slot
az webapp deployment slot swap \
--resource-group rg-atp-staging \
--name atp-ingestion-staging \
--slot blue \
--target-slot green
RTO: < 30 seconds
3. Rolling Update (Kubernetes Infrastructure)¶
Best For: Config changes, minor fixes, Kubernetes version updates
How It Works: 1. Update deployment manifest (image version, config) 2. Kubernetes replaces pods one-by-one 3. Each new pod must pass health checks before old pod is terminated 4. Continues until all pods are updated 5. If health checks fail, rollout pauses automatically
Rolling Update Configuration¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion
namespace: atp-prod
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during rollout (7 total)
maxUnavailable: 1 # Allow 1 pod to be unavailable (5 minimum)
minReadySeconds: 30 # Wait 30s after pod ready before next
template:
spec:
containers:
- name: ingestion-api
image: atp.azurecr.io/ingestion:1.2.4
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
Monitor Rolling Update¶
# Apply updated manifest
kubectl apply -f k8s/atp-ingestion-deployment.yaml
# Watch rollout progress
kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=10m
# Check pod status during rollout
kubectl get pods -n atp-prod -l app=atp-ingestion -w
# View rollout history
kubectl rollout history deployment/atp-ingestion -n atp-prod
Expected Output:
Waiting for deployment "atp-ingestion" rollout to finish: 2 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 3 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 4 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 5 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 6 out of 6 new replicas have been updated...
Waiting for deployment "atp-ingestion" rollout to finish: 1 old replicas are pending termination...
deployment "atp-ingestion" successfully rolled out
Rolling Update Rollback¶
# Undo last rollout (immediate)
kubectl rollout undo deployment/atp-ingestion -n atp-prod
# Rollback to specific revision
kubectl rollout undo deployment/atp-ingestion -n atp-prod --to-revision=3
# Verify rollback
kubectl rollout status deployment/atp-ingestion -n atp-prod
RTO: < 5 minutes (depends on pod count and readiness checks)
4. Feature Flag-Based Rollout¶
Best For: A/B testing, gradual feature activation, breaking API changes
How It Works: 1. Deploy code with feature behind toggle (disabled by default) 2. Enable feature for percentage of users (10% → 25% → 50% → 100%) 3. Monitor metrics per cohort (feature-on vs feature-off) 4. If issues detected, toggle off instantly (no redeployment)
Feature Flag Configuration¶
Azure App Configuration:
{
"id": "AuditTrail.NewTamperProofAlgorithm",
"description": "Enable SHA-512 with merkle tree for tamper evidence",
"enabled": true,
"conditions": {
"client_filters": [
{
"name": "Percentage",
"parameters": {
"Value": 10 // 10% of requests
}
},
{
"name": "Targeting",
"parameters": {
"Audience": {
"Users": ["tenant-beta-001", "tenant-beta-002"],
"Groups": ["beta-testers"],
"DefaultRolloutPercentage": 10
}
}
}
]
}
}
Gradual Feature Rollout¶
# Phase 1: Beta testers only (named tenants)
az appconfig feature set \
--name atp-prod-appconfig \
--feature AuditTrail.NewTamperProofAlgorithm \
--filter "Targeting" \
--parameters '{"Audience":{"Users":["tenant-beta-001"]}}'
# Phase 2: 10% random sampling
az appconfig feature set \
--name atp-prod-appconfig \
--feature AuditTrail.NewTamperProofAlgorithm \
--filter "Percentage" \
--parameters '{"Value":10}'
# Phase 3: 50% rollout
az appconfig feature set \
--name atp-prod-appconfig \
--feature AuditTrail.NewTamperProofAlgorithm \
--filter "Percentage" \
--parameters '{"Value":50}'
# Phase 4: 100% (full activation)
az appconfig feature set \
--name atp-prod-appconfig \
--feature AuditTrail.NewTamperProofAlgorithm \
--enabled true
Feature Flag Emergency Disable¶
# Instant disable (no deployment required)
az appconfig feature set \
--name atp-prod-appconfig \
--feature AuditTrail.NewTamperProofAlgorithm \
--enabled false
RTO: < 10 seconds (config refresh interval)
Rollout Monitoring & Validation¶
Real-Time Metrics Dashboard¶
During any rollout (canary, blue-green, rolling), monitor these dashboards:
Application Insights Queries:
// Error rate comparison (canary vs stable)
requests
| where timestamp > ago(15m)
| summarize
TotalRequests = count(),
FailedRequests = countif(success == false)
by cloud_RoleName
| extend ErrorRate = (FailedRequests * 100.0) / TotalRequests
| project cloud_RoleName, ErrorRate, TotalRequests
// P95 Latency comparison
requests
| where timestamp > ago(15m)
| summarize P95 = percentile(duration, 95) by cloud_RoleName
| project cloud_RoleName, P95_ms = P95
// Exception count per deployment
exceptions
| where timestamp > ago(15m)
| summarize ExceptionCount = count() by cloud_RoleName
| order by ExceptionCount desc
Automated Health Checks¶
Health Check Script (runs every 2 minutes during rollout):
#!/usr/bin/env python3
"""
Canary health monitor - runs during progressive rollouts.
Triggers automatic rollback if metrics exceed thresholds.
"""
import requests
import time
import sys
from azure.monitor.query import LogsQueryClient
from azure.identity import DefaultAzureCredential
CANARY_ROLE = "atp-ingestion-canary"
STABLE_ROLE = "atp-ingestion-stable"
THRESHOLDS = {
"error_rate_percent": 1.0, # Max 1% error rate
"p95_latency_ms": 1000, # Max 1000ms p95
"exception_count": 10, # Max 10 exceptions in 15min
}
def get_error_rate(role_name):
"""Query Application Insights for error rate."""
query = f"""
requests
| where timestamp > ago(15m)
| where cloud_RoleName == '{role_name}'
| summarize
Total = count(),
Failed = countif(success == false)
| extend ErrorRate = (Failed * 100.0) / Total
| project ErrorRate
"""
client = LogsQueryClient(DefaultAzureCredential())
response = client.query_workspace(
workspace_id=os.environ["LOG_ANALYTICS_WORKSPACE_ID"],
query=query,
timespan=None
)
return response.tables[0].rows[0][0] if response.tables[0].rows else 0.0
def get_p95_latency(role_name):
"""Query Application Insights for P95 latency."""
query = f"""
requests
| where timestamp > ago(15m)
| where cloud_RoleName == '{role_name}'
| summarize P95 = percentile(duration, 95)
| project P95
"""
client = LogsQueryClient(DefaultAzureCredential())
response = client.query_workspace(
workspace_id=os.environ["LOG_ANALYTICS_WORKSPACE_ID"],
query=query,
timespan=None
)
return response.tables[0].rows[0][0] if response.tables[0].rows else 0.0
def get_exception_count(role_name):
"""Query Application Insights for exception count."""
query = f"""
exceptions
| where timestamp > ago(15m)
| where cloud_RoleName == '{role_name}'
| count
"""
client = LogsQueryClient(DefaultAzureCredential())
response = client.query_workspace(
workspace_id=os.environ["LOG_ANALYTICS_WORKSPACE_ID"],
query=query,
timespan=None
)
return response.tables[0].rows[0][0] if response.tables[0].rows else 0
def check_health_endpoint(url):
"""Check HTTP health endpoint."""
try:
response = requests.get(url, timeout=5)
return response.status_code == 200
except:
return False
def validate_canary_health():
"""
Validate canary deployment health.
Returns (is_healthy, metrics_dict)
"""
print(f"🔍 Validating canary health at {time.strftime('%H:%M:%S')}...")
metrics = {
"error_rate": get_error_rate(CANARY_ROLE),
"p95_latency": get_p95_latency(CANARY_ROLE),
"exception_count": get_exception_count(CANARY_ROLE),
"health_check": check_health_endpoint("https://atp-ingestion-canary/health")
}
# Check thresholds
failures = []
if metrics["error_rate"] > THRESHOLDS["error_rate_percent"]:
failures.append(f"Error rate {metrics['error_rate']:.2f}% exceeds {THRESHOLDS['error_rate_percent']}%")
if metrics["p95_latency"] > THRESHOLDS["p95_latency_ms"]:
failures.append(f"P95 latency {metrics['p95_latency']:.0f}ms exceeds {THRESHOLDS['p95_latency_ms']}ms")
if metrics["exception_count"] > THRESHOLDS["exception_count"]:
failures.append(f"Exception count {metrics['exception_count']} exceeds {THRESHOLDS['exception_count']}")
if not metrics["health_check"]:
failures.append("Health check endpoint failed")
is_healthy = len(failures) == 0
if is_healthy:
print(f"✅ Canary healthy: Error={metrics['error_rate']:.2f}%, P95={metrics['p95_latency']:.0f}ms, Exceptions={metrics['exception_count']}")
else:
print(f"❌ Canary unhealthy:")
for failure in failures:
print(f" - {failure}")
return is_healthy, metrics
def monitor_rollout(duration_minutes, check_interval_seconds=120):
"""
Monitor canary deployment for specified duration.
Exits with code 1 if unhealthy (triggers rollback in pipeline).
"""
end_time = time.time() + (duration_minutes * 60)
print(f"🚀 Starting {duration_minutes}-minute canary monitoring...")
while time.time() < end_time:
is_healthy, metrics = validate_canary_health()
if not is_healthy:
print(f"🔴 ROLLBACK TRIGGERED: Canary failed health validation")
sys.exit(1) # Exit code 1 triggers automatic rollback
remaining_minutes = (end_time - time.time()) / 60
print(f"⏳ {remaining_minutes:.1f} minutes remaining...\n")
time.sleep(check_interval_seconds)
print(f"✅ Canary monitoring complete: {duration_minutes} minutes passed successfully")
sys.exit(0)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--duration", type=int, default=15, help="Monitoring duration in minutes")
parser.add_argument("--interval", type=int, default=120, help="Check interval in seconds")
args = parser.parse_args()
monitor_rollout(args.duration, args.interval)
Usage in Pipeline:
- script: |
python scripts/monitor-canary.py --duration 15 --interval 120
displayName: 'Monitor Canary Health (15 min)'
continueOnError: false # Pipeline fails if script exits with code 1
Rollback Procedures¶
When to Rollback¶
| Trigger | Action | Who Decides |
|---|---|---|
| Automated threshold breach | Immediate automatic rollback | System (no human approval) |
| Customer-reported critical issue | Manual rollback via runbook | On-call SRE |
| Silent data corruption detected | Emergency rollback + incident | SRE + Security Team |
| Compliance violation | Immediate rollback + audit | Compliance Officer + SRE |
| Performance degradation (SLO miss) | Rollback if not resolved in 10 min | SRE |
Rollback Decision Tree¶
graph TD
A[Issue Detected] --> B{Severity?}
B -->|SEV1: Data Loss/Corruption| C[IMMEDIATE ROLLBACK]
B -->|SEV2: Service Degradation| D{SLO Impact?}
B -->|SEV3: Minor Bug| E{User-Facing?}
D -->|> 1% error rate| C
D -->|< 1% error rate| F[Monitor 10 min]
E -->|Yes| G[Rollback + Hotfix]
E -->|No| H[File Bug, Fix in Next Release]
F -->|Improves| I[Continue Rollout]
F -->|Worsens| C
C --> J[Execute Rollback]
J --> K[Verify Health]
K --> L[RCA + Incident Report]
Automatic Rollback Triggers¶
These conditions trigger rollback WITHOUT human approval:
- Error Rate: Canary error rate > 1% for 2 consecutive minutes
- Latency Spike: P95 latency > 1000ms for 5 consecutive minutes
- Exception Storm: > 10 exceptions in 15-minute window
- Health Check Failure: Any health check endpoint returns non-200 status
- Database Connection Pool Exhaustion: Connection pool > 90% for 3 minutes
- Message Queue Backlog: Event queue depth > 10,000 for 5 minutes
- Memory Leak Detected: Container memory > 90% for 3 minutes
Alert Channels:
- PagerDuty: On-call SRE
- Slack: #atp-incidents
- Email: Platform team distribution list
- Azure Monitor: Alert rule with action group
Manual Rollback Commands¶
Canary Rollback (Kubernetes)¶
#!/bin/bash
# rollback-canary.sh - Manual canary rollback
set -e
echo "🔴 Initiating canary rollback..."
# 1. Revert traffic to stable only
echo "Step 1: Reverting traffic to stable deployment..."
kubectl apply -f k8s/canary/stable-only-routing.yaml
# 2. Wait for traffic drain
echo "Step 2: Waiting 30 seconds for traffic to drain from canary..."
sleep 30
# 3. Delete canary deployment
echo "Step 3: Deleting canary deployment..."
kubectl delete deployment atp-ingestion-canary -n atp-prod --ignore-not-found=true
# 4. Verify stable deployment health
echo "Step 4: Verifying stable deployment health..."
kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=2m
# 5. Check metrics
echo "Step 5: Validating post-rollback metrics..."
ERROR_RATE=$(az monitor app-insights metrics show \
--app atp-prod-appinsights \
--metric requests/failed \
--aggregation avg \
--interval 5m \
--offset 5m \
--query value -o tsv)
if (( $(echo "$ERROR_RATE < 0.01" | bc -l) )); then
echo "✅ Rollback successful. Error rate: ${ERROR_RATE}%"
else
echo "⚠️ Error rate still elevated: ${ERROR_RATE}%"
fi
# 6. Notify team
curl -X POST $SLACK_WEBHOOK_URL \
-H 'Content-Type: application/json' \
-d "{\"text\":\"🔴 Production canary rollback completed. Build: $BUILD_NUMBER\"}"
echo "✅ Canary rollback complete. Investigate root cause before next deployment."
Blue-Green Rollback (Azure App Service)¶
#!/bin/bash
# rollback-blue-green.sh - Manual slot swap rollback
set -e
echo "🔴 Initiating blue-green rollback (slot swap)..."
# Swap back to previous slot
az webapp deployment slot swap \
--resource-group rg-atp-staging \
--name atp-ingestion-staging \
--slot green \
--target-slot blue
echo "⏳ Waiting for swap to complete..."
sleep 15
# Verify production slot is serving traffic
HEALTH_STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://atp-ingestion-staging.azurewebsites.net/health)
if [ "$HEALTH_STATUS" -eq "200" ]; then
echo "✅ Rollback successful. Production slot is healthy."
else
echo "❌ ERROR: Production slot health check failed (HTTP $HEALTH_STATUS)"
exit 1
fi
# Notify team
curl -X POST $SLACK_WEBHOOK_URL \
-H 'Content-Type: application/json' \
-d "{\"text\":\"🔴 Staging blue-green rollback completed.\"}"
Rolling Update Rollback (Kubernetes)¶
#!/bin/bash
# rollback-rolling-update.sh - Rollback Kubernetes rolling update
set -e
echo "🔴 Initiating rolling update rollback..."
# Undo last rollout
kubectl rollout undo deployment/atp-ingestion -n atp-prod
# Wait for rollout to complete
echo "⏳ Waiting for rollout to complete..."
kubectl rollout status deployment/atp-ingestion -n atp-prod --timeout=5m
# Verify pod health
READY_PODS=$(kubectl get deployment atp-ingestion -n atp-prod -o jsonpath='{.status.readyReplicas}')
DESIRED_PODS=$(kubectl get deployment atp-ingestion -n atp-prod -o jsonpath='{.spec.replicas}')
if [ "$READY_PODS" -eq "$DESIRED_PODS" ]; then
echo "✅ Rollback successful. $READY_PODS/$DESIRED_PODS pods ready."
else
echo "⚠️ Only $READY_PODS/$DESIRED_PODS pods ready."
fi
# Check current image version
CURRENT_IMAGE=$(kubectl get deployment atp-ingestion -n atp-prod -o jsonpath='{.spec.template.spec.containers[0].image}')
echo "📦 Current image: $CURRENT_IMAGE"
Environment-Specific Rollout Strategies¶
| Environment | Primary Strategy | Approval Required | Monitoring Duration | Rollback Policy |
|---|---|---|---|---|
| Dev | Direct deploy | No | None | Manual only |
| Test | Rolling update | No | 5 minutes | Automatic on failure |
| Staging | Blue-green | 1 approver | 30 minutes | Automatic on health check failure |
| Production | Canary (5/20/50/100) | 2 approvers + CAB | 15 min per increment | Automatic on metrics breach |
| Hotfix | Direct + monitoring | 2 approvers (expedited) | 30 minutes | Automatic on any error |
Runbook Integration¶
Pre-Deployment Checklist¶
Before initiating any production rollout:
- Change Request approved by CAB (except hotfix)
- Rollback plan documented and tested in staging
- On-call SRE notified and available
- Monitoring dashboards open (Application Insights, Grafana)
- Communication channel active (
#atp-deploymentsin Slack) - Feature flags configured for gradual activation (if applicable)
- Database migrations tested and reversible
- Load tests passed in staging (for major changes)
- Security scan passed (no critical vulnerabilities)
- Deployment window confirmed (avoid Friday evenings!)
During Deployment¶
- Monitor error rate in real-time (refresh every 30 seconds)
- Watch exception logs in Application Insights
- Check queue depths for backlog buildup
- Validate health endpoints return 200 OK
- Communicate status in
#atp-deploymentsat each increment - Pause if uncertain — better safe than sorry
Post-Deployment¶
- Verify metrics returned to baseline
- Smoke tests passed in production
- Synthetic monitors passing (Pingdom, Azure Monitor)
- Update runbook if new issues discovered
- Document lessons learned in deployment log
- Decommission canary resources (if applicable)
Metrics & SLO Validation¶
Key Metrics to Monitor¶
| Metric | Target (SLO) | Warning Threshold | Critical Threshold |
|---|---|---|---|
| Availability | 99.9% | 99.5% | 99.0% |
| Error Rate | < 0.1% | > 0.5% | > 1.0% |
| P95 Latency | < 500ms | > 800ms | > 1000ms |
| P99 Latency | < 1000ms | > 1500ms | > 2000ms |
| Event Ingestion Throughput | 10K events/sec | 8K events/sec | 5K events/sec |
| Queue Processing Lag | < 1 minute | > 2 minutes | > 5 minutes |
SLO Dashboard¶
Application Insights Query (Error Budget):
// Calculate error budget remaining for current month
let startOfMonth = startofmonth(now());
let totalRequests = requests
| where timestamp >= startOfMonth
| count;
let failedRequests = requests
| where timestamp >= startOfMonth
| where success == false
| count;
let actualAvailability = 100.0 - ((failedRequests * 100.0) / totalRequests);
let targetAvailability = 99.9;
let errorBudget = 100.0 - targetAvailability; // 0.1%
let errorBudgetUsed = 100.0 - actualAvailability;
let errorBudgetRemaining = errorBudget - errorBudgetUsed;
print
ActualAvailability = actualAvailability,
TargetAvailability = targetAvailability,
ErrorBudgetTotal = errorBudget,
ErrorBudgetUsed = errorBudgetUsed,
ErrorBudgetRemaining = errorBudgetRemaining,
ErrorBudgetRemainingPercent = (errorBudgetRemaining / errorBudget) * 100.0
Interpretation: - > 50% error budget remaining: Safe to deploy - 20-50% remaining: Deploy with caution, short canary soaks - < 20% remaining: Defer non-critical deployments - < 0% (budget exhausted): Freeze deployments, focus on stability
Communication & Escalation¶
Deployment Communication Template¶
Pre-Deployment Announcement:
📢 **Production Deployment Starting**
**Service:** Audit Trail Platform - Ingestion API
**Version:** v1.2.4 → v1.2.5
**Change:** New tamper-proof algorithm (SHA-512 + Merkle tree)
**Strategy:** Canary deployment (5% → 20% → 50% → 100%)
**Duration:** ~90 minutes (15 min per increment)
**Rollback Plan:** Automatic on error rate > 1%
**On-Call SRE:** @alice (primary), @bob (secondary)
**Monitoring:** https://portal.azure.com/#blade/AppInsights/atp-prod
**Status:** Starting canary at 5% traffic...
Mid-Deployment Update:
✅ **Canary Update: 20% Traffic**
**Metrics (last 15 min):**
- Error Rate: 0.08% ✅
- P95 Latency: 487ms ✅
- Exceptions: 2 ✅
- Health Checks: Passing ✅
**Next:** Increasing to 50% traffic. ETA: 2:30 PM EST
Rollback Announcement:
🔴 **ROLLBACK IN PROGRESS**
**Reason:** Canary error rate exceeded 1% (actual: 1.8%)
**Action:** Reverting all traffic to stable v1.2.4
**Impact:** No customer impact (95% on stable, canary isolated)
**ETA:** Rollback complete in 2 minutes
**Incident:** INC-2025-001 created
**Next Steps:** RCA meeting scheduled for 4:00 PM EST
Escalation Path¶
- Automated Alert → PagerDuty → On-call SRE (immediate)
- On-call SRE → Deployment lead (within 5 minutes)
- Deployment lead → Engineering manager (if rollback fails)
- Engineering manager → VP Engineering (if production down > 30 minutes)
- VP Engineering → CTO (if SEV1 incident > 1 hour)
Advanced Rollout Patterns¶
Progressive Feature Rollout (Combined Strategy)¶
Scenario: Deploy new feature with both infrastructure and feature flag rollout.
Approach: 1. Week 1: Deploy code with feature flag OFF to all environments (canary deployment) 2. Week 2: Enable feature for 10% of users in production (feature flag) 3. Week 3: Increase to 50% if metrics healthy 4. Week 4: Enable for 100% (full activation)
Benefits: - Code deployed separately from activation (decouple risk) - Instant disable via feature flag (no redeployment) - Gradual user exposure (minimize blast radius)
Multi-Region Progressive Rollout¶
Scenario: Deploy to multiple Azure regions with staggered rollout.
Approach: 1. Region 1 (East US): Canary deployment (5% → 20% → 50% → 100%) 2. Wait 24 hours, monitor metrics 3. Region 2 (West Europe): Canary deployment 4. Wait 24 hours, monitor metrics 5. Region 3 (Southeast Asia): Canary deployment
Benefits: - Contain blast radius to single region - Leverage time zones for round-the-clock monitoring - Learn from each region before proceeding
Database Schema Migration Rollout¶
Scenario: Deploy backward-compatible database schema change.
Approach: 1. Phase 1: Add new column with default value (backward compatible) 2. Phase 2: Deploy application code that writes to both old and new columns 3. Phase 3: Backfill data in new column 4. Phase 4: Deploy application code that reads from new column 5. Phase 5 (weeks later): Remove old column
Benefits: - Zero-downtime migration - Instant rollback at each phase - Production validation before committing to schema change
Troubleshooting Common Rollout Issues¶
Issue: Canary Health Checks Fail Immediately¶
Symptoms:
- Canary pods fail readiness probe
- Error: Readiness probe failed: HTTP probe failed with statuscode: 503
Diagnosis:
# Check pod logs
kubectl logs -n atp-prod -l app=atp-ingestion,version=canary --tail=100
# Check health endpoint directly
kubectl exec -n atp-prod atp-ingestion-canary-xyz -- curl localhost:8080/health
Common Causes:
1. Database connection string incorrect (check Key Vault reference)
2. Service Bus namespace unreachable (check NSG rules)
3. Application startup timeout too short (increase initialDelaySeconds)
4. Missing environment variable (check ConfigMap)
Resolution:
# Fix and redeploy
kubectl delete pod -n atp-prod -l app=atp-ingestion,version=canary
kubectl rollout restart deployment/atp-ingestion-canary -n atp-prod
Issue: Blue-Green Swap Fails with Timeout¶
Symptoms:
- az webapp deployment slot swap hangs for > 5 minutes
- Error: The operation has timed out
Diagnosis:
# Check slot status
az webapp show \
--resource-group rg-atp-staging \
--name atp-ingestion-staging \
--slot green \
--query state
# Check recent logs
az webapp log tail \
--resource-group rg-atp-staging \
--name atp-ingestion-staging \
--slot green
Common Causes: 1. Application initialization is slow (app pool startup) 2. Warm-up requests configured but failing 3. Health check endpoint not responding during swap
Resolution:
# Cancel swap (if stuck)
az webapp deployment slot swap \
--resource-group rg-atp-staging \
--name atp-ingestion-staging \
--slot green \
--target-slot blue \
--action cancel
# Increase swap timeout (app setting)
az webapp config appsettings set \
--resource-group rg-atp-staging \
--name atp-ingestion-staging \
--slot green \
--settings WEBSITE_SWAP_WARMUP_PING_PATH=/health WEBSITE_SWAP_WARMUP_PING_STATUSES=200
Issue: Rolling Update Stuck at "Waiting for deployment to finish"¶
Symptoms:
- kubectl rollout status hangs
- Some pods in CrashLoopBackOff or ImagePullBackOff
Diagnosis:
# Check rollout status
kubectl rollout status deployment/atp-ingestion -n atp-prod
# Check pod status
kubectl get pods -n atp-prod -l app=atp-ingestion
# Describe problematic pod
kubectl describe pod atp-ingestion-xyz -n atp-prod
Common Causes: 1. New image doesn't exist in container registry 2. Image pull secret expired or missing 3. Pod resource limits too low (OOMKilled) 4. Liveness probe killing healthy pods
Resolution:
# Pause rollout to investigate
kubectl rollout pause deployment/atp-ingestion -n atp-prod
# Fix issue (e.g., increase memory limit)
kubectl set resources deployment/atp-ingestion -n atp-prod \
--limits=memory=2Gi --requests=memory=1Gi
# Resume rollout
kubectl rollout resume deployment/atp-ingestion -n atp-prod
# OR rollback if unfixable
kubectl rollout undo deployment/atp-ingestion -n atp-prod
References¶
Related Documentation¶
- CI/CD Implementation: environments.md - Full CI/CD pipeline with canary deployment YAML
- Quality Gates: quality-gates.md - Pre-deployment validation and approval workflows
- Azure Pipelines: azure-pipelines.md - Pipeline templates and build/deploy stages
- Operational Runbook: runbook.md - Incident response procedures and troubleshooting
- Health Checks: health-checks.md - Health endpoint implementation and monitoring
- Monitoring: monitoring.md - Application Insights dashboards and metrics
- Alerts & SLOs: alerts-slos.md - Alert rules and SLO definitions
- Kubernetes Infrastructure: ../infrastructure/kubernetes.md - K8s deployment configurations
External Resources¶
- Canary Deployments: Martin Fowler - Canary Release
- Blue-Green Deployments: Azure App Service Staging Slots
- Kubernetes Rolling Updates: Kubernetes Deployment Strategies
- Feature Flags: Azure App Configuration Feature Management
- SLO Best Practices: Google SRE Book - SLO
Changelog¶
| Date | Version | Changes | Author |
|---|---|---|---|
| 2025-10-30 | 1.0.0 | Initial progressive rollout guide | Platform Team |
Questions or feedback? Contact the Platform Engineering team or open an issue in Azure DevOps.