Operations Runbook - Audit Trail Platform (ATP)¶
Your operational command center — This runbook provides step-by-step procedures for running ATP in production, responding to incidents, troubleshooting common issues, performing routine maintenance, and handling emergencies with clear ownership, escalation paths, and compliance preservation.
📋 Documentation Generation Plan¶
This document will be generated in 22 cycles. Current progress:
| Cycle | Topics | Estimated Lines | Status |
|---|---|---|---|
| Cycle 1 | Runbook Overview & Organization (1-2) | ~2,500 | ⏳ Not Started |
| Cycle 2 | Service Health & Status Monitoring (3-4) | ~4,000 | ⏳ Not Started |
| Cycle 3 | Alert Response Procedures (5-6) | ~4,000 | ⏳ Not Started |
| Cycle 4 | Incident Management (7-8) | ~4,500 | ⏳ Not Started |
| Cycle 5 | Severity Classification & SLAs (9-10) | ~3,000 | ⏳ Not Started |
| Cycle 6 | Common Problems & Solutions (11-12) | ~5,000 | ⏳ Not Started |
| Cycle 7 | Service-Specific Troubleshooting (13-14) | ~5,000 | ⏳ Not Started |
| Cycle 8 | Database Operations (15-16) | ~3,500 | ⏳ Not Started |
| Cycle 9 | Messaging & Event Bus Operations (17-18) | ~4,000 | ⏳ Not Started |
| Cycle 10 | DLQ Management & Replay (19-20) | ~4,000 | ⏳ Not Started |
| Cycle 11 | Deployment Procedures (21-22) | ~3,500 | ⏳ Not Started |
| Cycle 12 | Rollback Procedures (23-24) | ~3,000 | ⏳ Not Started |
| Cycle 13 | Configuration Changes (25-26) | ~3,000 | ⏳ Not Started |
| Cycle 14 | Key Rotation & Secret Management (27-28) | ~3,500 | ⏳ Not Started |
| Cycle 15 | Scaling & Capacity Management (29-30) | ~3,500 | ⏳ Not Started |
| Cycle 16 | Performance Troubleshooting (31-32) | ~4,000 | ⏳ Not Started |
| Cycle 17 | Security Incident Response (33-34) | ~4,500 | ⏳ Not Started |
| Cycle 18 | Data Recovery & Backups (35-36) | ~3,500 | ⏳ Not Started |
| Cycle 19 | Compliance & Audit Procedures (37-38) | ~3,000 | ⏳ Not Started |
| Cycle 20 | Maintenance Windows (39-40) | ~3,000 | ⏳ Not Started |
| Cycle 21 | Emergency Procedures (41-42) | ~4,000 | ⏳ Not Started |
| Cycle 22 | Contacts, Escalation & Knowledge Base (43-44) | ~3,000 | ⏳ Not Started |
Total Estimated Lines: ~78,000
Purpose & Scope¶
This Operations Runbook is the authoritative operational guide for ATP, providing actionable procedures for SRE/DevOps teams to monitor health, respond to incidents, troubleshoot issues, deploy changes, manage capacity, rotate secrets, handle security events, perform maintenance, and escalate emergencies while preserving ATP's core guarantees: tamper-evidence, tenant isolation, and compliance.
Who Should Use This Runbook?
- SRE/Platform Engineers: Day-to-day operations, monitoring, scaling
- On-Call Engineers: Incident response, troubleshooting, emergency procedures
- DevOps Engineers: Deployments, rollbacks, configuration changes
- Security Team: Security incident response, key rotation, audit preservation
- Compliance Team: Data recovery, legal hold procedures, audit validation
- Engineering Management: Escalation procedures, postmortem reviews
Runbook Philosophy
- Clarity: Step-by-step procedures with copy-paste commands
- Speed: Quick reference for time-sensitive incidents
- Safety: All procedures preserve audit integrity and compliance
- Traceability: Every action logged and auditable
- Escalation: Clear ownership and escalation paths
- Learning: Postmortem integration for continuous improvement
Detailed Cycle Plan¶
CYCLE 1: Runbook Overview & Organization (~2,500 lines)¶
Topic 1: Runbook Structure & Quick Reference¶
What will be covered: - Runbook Organization
This runbook is organized by operational scenario:
1. Health & Monitoring
- Service status checks
- Dashboard access
- Metric interpretation
2. Incident Response
- Alert procedures
- Severity classification
- Escalation paths
3. Troubleshooting
- Common problems by service
- Diagnostic procedures
- Root cause analysis
4. Operations
- Deployment procedures
- Rollback procedures
- Configuration changes
- Key rotation
5. Maintenance
- Scheduled tasks
- Database maintenance
- Index optimization
- Cache warming
6. Emergency Procedures
- Security breaches
- Data corruption
- Multi-region failover
- Disaster recovery
7. Reference
- Contact information
- Escalation matrix
- Service topology
- Runbook updates
-
Quick Reference Cards
🚨 CRITICAL INCIDENT (SEV-1) 1. Acknowledge in PagerDuty (within 5 minutes) 2. Join war room: Slack #atp-incidents 3. Check health dashboard: https://atp-ops.example.com 4. Follow incident procedure (Section 4) 5. Escalate if unresolved in 30 minutes 📊 CHECK SERVICE HEALTH kubectl get pods -n atp-{service}-ns curl https://api.atp.example.com/health/{service} View dashboard: https://grafana.atp.example.com/d/service-health 🔄 ROLLBACK DEPLOYMENT helm rollback atp -n atp-system OR kubectl rollout undo deployment/{service} -n atp-{service}-ns 🔍 VIEW LOGS kubectl logs -f deployment/{service} -n atp-{service}-ns OR az monitor log-analytics query -w {workspace-id} --analytics-query "..." 📈 CHECK METRICS Prometheus: https://prometheus.atp.example.com Grafana: https://grafana.atp.example.com Azure Monitor: https://portal.azure.com -> ATP Log Analytics -
Critical Contacts
On-Call Rotation: - Primary: PagerDuty "ATP Primary" schedule - Secondary: PagerDuty "ATP Secondary" schedule - Manager On-Call: PagerDuty "ATP Manager" schedule War Room: - Slack: #atp-incidents (critical) - Slack: #atp-ops (warnings, maintenance) - Teams: ATP Operations Team Escalation: - L1 (Platform Engineer): 0-30 min response - L2 (Senior SRE): 30-60 min response - L3 (Engineering Manager): 60-120 min response - L4 (VP Engineering): 2+ hours, major outage Stakeholders: - Security: security@connectsoft.com - Compliance: compliance@connectsoft.com - Customer Success: success@connectsoft.com
Code Examples: - Quick reference templates - Contact information structure - Emergency checklists
Diagrams: - Runbook navigation - Escalation flow - War room structure
Deliverables: - Runbook organization guide - Quick reference cards - Contact directory
Topic 2: Using This Runbook¶
What will be covered: - How to Navigate - Table of contents with direct links - Search functionality (Ctrl+F) - Cross-references to architecture docs
- When to Use Each Section
- Alert fired → Alert Response (Section 3)
- Service degraded → Troubleshooting (Section 6-7)
- Deploying change → Deployment Procedures (Section 11)
-
Security event → Security Incident Response (Section 17)
-
Runbook Maintenance
- Update after every postmortem
- Quarterly review by SRE team
- Version controlled in Git
- Changes go through PR review
Code Examples: - Runbook update procedure - Change log template
Diagrams: - Runbook usage flow
Deliverables: - Navigation guide - Usage instructions - Maintenance procedures
CYCLE 2: Service Health & Status Monitoring (~4,000 lines)¶
Topic 3: Health Check Endpoints¶
What will be covered: - ATP Health Check Architecture
Every ATP service exposes 3 health endpoints:
/health/live (Liveness Probe)
- Purpose: Is the service process alive?
- K8s: Restarts pod if fails
- Checks: Minimal (service responsive)
/health/ready (Readiness Probe)
- Purpose: Is the service ready to accept traffic?
- K8s: Removes from load balancer if fails
- Checks: Dependencies (DB, cache, message bus, KMS)
/health/startup (Startup Probe)
- Purpose: Has the service finished initialization?
- K8s: Delays liveness/readiness until passes
- Checks: Configuration loaded, migrations run, caches warmed
-
Health Check Endpoint Details by Service
# Gateway Service curl https://api.atp.example.com/health/live # → 200 OK curl https://api.atp.example.com/health/ready # → 200 OK or 503 # Checks: # - [ready] Azure Key Vault reachable # - [ready] Backend services reachable (ingestion, query, policy) # - [ready] Rate limiter cache (Redis) connected # Ingestion Service curl https://ingestion.atp.internal/health/ready # Checks: # - [ready] Azure SQL connection pool healthy # - [ready] Azure Service Bus connected (publish capability) # - [ready] Azure Blob Storage reachable (WORM container) # - [ready] Policy Service reachable (classification API) # - [ready] Outbox relay worker running # Query Service curl https://query.atp.internal/health/ready # Checks: # - [ready] Read model database (projections) reachable # - [ready] Query cache (Redis) connected # - [ready] Search index reachable (if enabled) # - [ready] Latest projection watermark within SLO lag (<10s) # Projection Service curl https://projection.atp.internal/health/ready # Checks: # - [ready] Azure Service Bus subscription active # - [ready] Read model database writable # - [ready] Inbox deduplication table accessible # - [ready] Projection lag within threshold (<5s) # - [ready] No DLQ backlog (or within threshold) # Export Service curl https://export.atp.internal/health/ready # Checks: # - [ready] Azure Blob Storage writable (export container) # - [ready] Export job queue (Redis) connected # - [ready] KMS signing operation test (dry-run) # - [ready] Bandwidth budget not exceeded # Integrity Service curl https://integrity.atp.internal/health/ready # Checks: # - [ready] Azure Blob Storage WORM container reachable # - [ready] KMS signing keys accessible # - [ready] Hash chain state store connected # - [ready] Merkle tree computation functional # Policy Service curl https://policy.atp.internal/health/ready # Checks: # - [ready] Policy database reachable # - [ready] Policy cache (Redis) connected # - [ready] Default policies loaded -
Health Check Response Format
// Healthy { "status": "Healthy", "totalDuration": "00:00:00.1234567", "entries": { "database": { "status": "Healthy", "description": "Azure SQL connection pool: 5/10 active", "duration": "00:00:00.0123456" }, "servicebus": { "status": "Healthy", "description": "Azure Service Bus connected", "duration": "00:00:00.0234567" }, "cache": { "status": "Healthy", "description": "Redis cache: 1000 keys, 45% memory", "duration": "00:00:00.0098765" } } } // Degraded (service still running but impaired) { "status": "Degraded", "totalDuration": "00:00:00.5000000", "entries": { "database": { "status": "Healthy" }, "cache": { "status": "Degraded", "description": "Redis cache: Degraded performance, high latency (150ms avg)", "duration": "00:00:00.4567890" } } } // Unhealthy (service should be removed from load balancer) { "status": "Unhealthy", "totalDuration": "00:00:01.0000000", "entries": { "database": { "status": "Unhealthy", "description": "Azure SQL connection failed: Timeout", "exception": "System.Data.SqlClient.SqlException: Timeout expired...", "duration": "00:00:01.0000000" } } } -
Automated Health Monitoring
- Kubernetes liveness/readiness probes (every 10s)
- Azure Monitor health check alerts
- Grafana health dashboard
- PagerDuty integration for failures
Code Examples: - Complete health check implementations (all services) - Health check aggregation dashboard - Alerting rules
Diagrams: - Health check architecture - Probe failure handling - Alert routing
Deliverables: - Health check reference - Monitoring guide - Alert configuration
Topic 4: Service Status Dashboard¶
What will be covered: - Grafana Dashboard Access
ATP Operations Dashboard: https://grafana.atp.example.com/d/atp-ops
Panels:
- Service Health Matrix (all services, all regions)
- Error Rate by Service
- Request Rate by Service
- P50/P95/P99 Latency by Service
- Projection Lag (timeline, actor, resource)
- DLQ Depth by Consumer
- Outbox Backlog
- Cache Hit Rates
- Database Connection Pool Usage
- Message Bus Queue Depth
-
Azure Monitor Workbooks
ATP Health Overview: - Navigate to Azure Portal → ATP Resource Group → Monitor → Workbooks - Select "ATP Service Health" Views: - Live Metrics Stream (real-time) - Application Map (service dependencies) - Performance (query execution, dependencies) - Failures (exceptions, failed requests) - Logs (structured query with KQL) -
Status Page (Public)
- External status page for customers
- Incident communication
- Maintenance window announcements
Code Examples: - Grafana dashboard JSON - Azure Monitor KQL queries - Status page integration
Diagrams: - Dashboard layout - Metric flow - Status propagation
Deliverables: - Dashboard templates - Query library - Status page setup
CYCLE 3: Alert Response Procedures (~4,000 lines)¶
Topic 5: Alert Handling Workflow¶
What will be covered: - Alert Lifecycle
flowchart LR
A[Metric Breaches Threshold] --> B[Alert Fired]
B --> C[PagerDuty Page]
B --> D[Slack Notification]
B --> E[Ticket Created]
C --> F[Engineer Acknowledges]
F --> G[Investigate & Mitigate]
G --> H{Issue Resolved?}
H -->|Yes| I[Alert Auto-Resolves]
H -->|No| J[Escalate]
J --> K[L2/L3 Engagement]
K --> G
I --> L[Close Ticket]
L --> M[Postmortem]
-
Alert Channels
Critical (SEV-1, SEV-2): - PagerDuty: Immediate phone/SMS/push - Slack: #atp-incidents (auto-created channel per incident) - Email: atp-oncall@connectsoft.com - Ticket: Jira (auto-created) Warning (SEV-3): - Slack: #atp-ops - Email: atp-ops@connectsoft.com - Ticket: Jira (auto-created) Info (SEV-4): - Slack: #atp-notifications - No page, no email -
Alert Acknowledgement
# Acknowledge in PagerDuty # - Click "Acknowledge" in PagerDuty app/web # - OR via API: curl -X PUT https://api.pagerduty.com/incidents/{id} \ -H "Authorization: Token token=$PD_TOKEN" \ -H "From: oncall@connectsoft.com" \ -d '{ "incident": { "type": "incident_reference", "status": "acknowledged" } }' # Post to Slack # "🔔 Acknowledged: [Alert Name]. Investigating... ETA: 15 min" -
Initial Response Checklist
Code Examples: - Alert response templates - Acknowledgement scripts - Initial triage procedures
Diagrams: - Alert lifecycle - Response timeline - Communication flow
Deliverables: - Alert response procedures - Acknowledgement guide - Triage checklists
Topic 6: Alert Types & Runbook Links¶
What will be covered:
- ATP Alert Catalog
| Alert | Severity | Threshold | Runbook Section | Auto-Remediation |
|-------|----------|-----------|-----------------|------------------|
| ServiceDown | SEV-1 | 0 healthy pods | 6.1 (Service Restart) | Pod restart (K8s) |
| HighErrorRate | SEV-2 | >5% errors, 5min | 6.2 (Error Investigation) | No |
| HighLatency | SEV-2 | P95 >1s, 10min | 6.3 (Performance Tuning) | No |
| DatabaseConnectionFailure | SEV-1 | All connections fail | 8.1 (Database Recovery) | Retry, then page |
| MessageBusDown | SEV-1 | Service Bus unreachable | 9.1 (Message Bus Recovery) | Retry, then page |
| ProjectionLagHigh | SEV-2 | Lag >30s, 10min | 10.1 (Projection Catchup) | Scale workers (KEDA) |
| DLQBacklog | SEV-3 | >100 msgs, 30min | 10.2 (DLQ Triage) | Alert only |
| DiskSpaceLow | SEV-2 | <10% free | 15.1 (Capacity Expansion) | No |
| CertificateExpiring | SEV-3 | <30 days | 14.1 (Certificate Renewal) | No |
| KeyRotationOverdue | SEV-2 | >90 days | 14.2 (Emergency Key Rotation) | No |
| TamperDetected | SEV-1 | Any tamper alert | 17.1 (Tamper Investigation) | Freeze, escalate |
| ComplianceViolation | SEV-1 | Retention/residency breach | 19.1 (Compliance Emergency) | Freeze, escalate |
- Alert Runbook Structure
Code Examples: - Alert definitions (Prometheus rules) - Runbook templates for each alert - Auto-remediation scripts
Diagrams: - Alert taxonomy - Runbook linkage
Deliverables: - Complete alert catalog - Runbook procedures for each alert - Auto-remediation playbooks
CYCLE 4: Incident Management (~4,500 lines)¶
Topic 7: Incident Response Framework¶
What will be covered: - Incident Lifecycle
1. Detection
- Alert fires
- User report
- Monitoring anomaly
2. Acknowledgement
- On-call acknowledges (within 5 min for SEV-1)
- War room created
- Incident ticket opened
3. Triage
- Determine severity
- Identify affected services/tenants
- Assess impact
- Engage additional engineers
4. Investigation
- Review logs, metrics, traces
- Identify root cause
- Document findings
5. Mitigation
- Implement fix or workaround
- Deploy patch
- Verify resolution
6. Recovery
- Restore normal operations
- Validate all services healthy
- Notify stakeholders
7. Resolution
- Close incident ticket
- Resolve PagerDuty incident
- Communicate to affected tenants
8. Postmortem
- Conduct blameless postmortem
- Document root cause
- Create action items
- Update runbook
-
Incident Command Structure
Incident Commander (IC): - Owns incident response - Coordinates team - Makes go/no-go decisions - Communicates to stakeholders Technical Lead (TL): - Drives investigation - Implements fixes - Coordinates with IC Communications Lead (CL): - Customer notifications - Status page updates - Stakeholder updates Scribe: - Documents timeline - Captures decisions - Records actions taken -
War Room Protocols
War Room Creation (SEV-1, SEV-2): 1. Create dedicated Slack channel: #incident-YYYY-MM-DD-HHMM 2. Pin incident ticket link 3. Pin dashboard links 4. Set channel topic: "[SEV-X] [Service] Brief description" 5. Invite: IC, TL, CL, Scribe, relevant SMEs War Room Updates: - Every 15 minutes: IC posts status update - Every action: Engineer posts what they're trying - Every finding: Post evidence (log snippets, metrics screenshots) - Major decisions: IC announces and logs rationale War Room Etiquette: - Use threads for side discussions - Main channel for critical updates only - No "@here" or "@channel" unless critical - Update channel topic with current status
Code Examples: - Incident templates - War room scripts - Communication templates
Diagrams: - Incident lifecycle - Command structure - War room flow
Deliverables: - Incident response procedures - War room protocols - Communication templates
Topic 8: Incident Documentation¶
What will be covered: - Incident Ticket Structure
Incident ID: INC-2025-001234
Title: [SEV-2] Ingestion Service High Latency in US-East
Status: Investigating | Mitigated | Resolved
Severity: SEV-1 | SEV-2 | SEV-3 | SEV-4
Timeline:
- 2025-10-30 14:32 UTC: Alert fired (P95 latency >1s)
- 2025-10-30 14:34 UTC: Acknowledged by @engineer
- 2025-10-30 14:40 UTC: Root cause identified (DB connection pool exhausted)
- 2025-10-30 14:45 UTC: Mitigation deployed (increased pool size)
- 2025-10-30 14:50 UTC: Metrics normal, incident resolved
Impact:
- Affected Tenants: 15 enterprise tenants in US-East
- Duration: 18 minutes
- User Impact: Increased ingestion latency (500ms → 2s)
- Data Integrity: ✅ No data loss, all events persisted
Root Cause:
- Database connection pool exhausted (max 100 connections)
- Traffic spike from tenant "acme-corp" (batch import)
- Pool size not sized for peak load
Resolution:
- Temporary: Increased connection pool max to 200
- Permanent: Implement per-tenant rate limiting
Action Items:
- [ ] Increase default connection pool size (deploy to all regions)
- [ ] Add per-tenant ingestion rate limits
- [ ] Add connection pool usage alerts (>80%)
- [ ] Update capacity planning docs
- Incident Log Template
- Use Jira/ServiceNow incident template
- Auto-populate from alerts
- Link to logs, metrics, traces
- Capture all actions taken
Code Examples: - Incident ticket templates - Timeline documentation - Action item tracking
Diagrams: - Incident ticket flow - Documentation structure
Deliverables: - Incident templates - Documentation procedures - Tracking systems
CYCLE 5: Severity Classification & SLAs (~3,000 lines)¶
Topic 9: Severity Levels¶
What will be covered: - ATP Severity Classification
SEV-1 (Critical - P1)
Definition: Complete service outage or data integrity compromise
Examples:
- All ATP services down (no ingestion, no queries)
- Data corruption detected (tamper evidence failed)
- Security breach (unauthorized access)
- Multi-tenant data leakage
- Compliance violation (GDPR, HIPAA, SOC2)
Response Time: 5 minutes
Resolution Time: 4 hours
Communication: Immediate customer notification
Escalation: Immediate (Manager + VP Engineering)
Actions:
- Page primary, secondary, manager on-call
- Create war room immediately
- Engage security team (if security-related)
- Freeze all deployments
- Customer Success notifies affected tenants
---
SEV-2 (High - P2)
Definition: Significant degradation affecting multiple tenants
Examples:
- Single service degraded (high latency, errors)
- Projection lag >30s (query results stale)
- Export service down (ingestion OK)
- Certificate expiring <7 days
- Key rotation overdue
Response Time: 15 minutes
Resolution Time: 8 hours
Communication: Notify affected tenants if impact >1 hour
Escalation: After 1 hour if unresolved
Actions:
- Page primary on-call
- Create war room if >30 min
- Post updates every 30 min
---
SEV-3 (Medium - P3)
Definition: Minor degradation affecting few tenants
Examples:
- Single tenant experiencing issues
- DLQ backlog >100 messages
- Slow query performance (specific endpoint)
- Cache miss rate elevated
- Non-critical background job failing
Response Time: 1 hour
Resolution Time: 24 hours
Communication: Internal only
Escalation: After 4 hours if unresolved
Actions:
- Slack notification to #atp-ops
- Assign to engineer
- Post updates when resolved
---
SEV-4 (Low - P4)
Definition: No user impact, informational
Examples:
- Warning thresholds breached
- Capacity planning alerts
- Maintenance reminders
- Configuration drift
Response Time: Best effort
Resolution Time: 1 week
Communication: None
Escalation: None
Actions:
- Create ticket
- Prioritize in backlog
-
Severity Escalation Matrix | Time Elapsed | SEV-1 | SEV-2 | SEV-3 | |--------------|-------|-------|-------| | 0-30 min | Primary On-Call | Primary On-Call | - | | 30-60 min | + Secondary On-Call | - | - | | 60-120 min | + Manager On-Call | + Secondary On-Call | - | | 120+ min | + VP Engineering | + Manager On-Call | + Secondary On-Call |
-
Downgrade/Upgrade Criteria
- Downgrade SEV-1 → SEV-2 if impact contained, workaround in place
- Upgrade SEV-2 → SEV-1 if multiple tenants affected, data integrity risk
- Document all severity changes with rationale
Code Examples: - Severity classification decision tree - Escalation automation - Communication templates
Diagrams: - Severity levels - Escalation flow - Timeline requirements
Deliverables: - Severity classification guide - SLA requirements - Escalation procedures
Topic 10: SLO Monitoring & Burn Rate Alerts¶
What will be covered: - ATP Service Level Objectives (SLOs)
Service Availability:
- Target: 99.9% uptime (43.2 min/month downtime budget)
- Measurement: Health check success rate
- Alert: 10% error budget consumed in 1 hour (burn rate)
Ingestion Latency:
- Target: P95 <500ms, P99 <1s
- Measurement: End-to-end (API → append → event published)
- Alert: P95 >1s for 10 minutes
Query Latency:
- Target: P95 <200ms, P99 <500ms
- Measurement: API request duration
- Alert: P95 >500ms for 10 minutes
Projection Lag:
- Target: P95 <5s, P99 <10s
- Measurement: Event timestamp → projection updated
- Alert: P95 >30s for 10 minutes
Data Durability:
- Target: 99.999999999% (11 nines)
- Measurement: Audit records with valid integrity proofs
- Alert: Any integrity verification failure
Tamper Detection:
- Target: 100% detection rate
- Measurement: Hash chain verification success
- Alert: Any hash mismatch
- Error Budget Policy
- 99.9% SLO = 0.1% error budget = 43.2 min/month
- 10% budget consumed → Freeze feature releases
- 25% budget consumed → Emergency freeze, focus on stability
- 50% budget consumed → Incident declared, all hands
- 100% budget consumed → Postmortem, process review
Code Examples: - SLO definitions (Prometheus recording rules) - Burn rate alerts - Error budget dashboards
Diagrams: - SLO monitoring - Error budget tracking - Burn rate visualization
Deliverables: - SLO definitions - Alert rules - Error budget policies
CYCLE 6: Common Problems & Solutions (~5,000 lines)¶
Topic 11: Service Health Issues¶
What will be covered: - Problem: Service Pods CrashLoopBackOff
Symptoms:
- Pods continuously restarting
- kubectl get pods shows "CrashLoopBackOff"
- Service unavailable
Diagnosis:
# Check pod status
kubectl get pods -n atp-ingest-ns
# View pod events
kubectl describe pod <pod-name> -n atp-ingest-ns
# Check logs (current and previous)
kubectl logs <pod-name> -n atp-ingest-ns
kubectl logs <pod-name> -n atp-ingest-ns --previous
Common Causes:
1. Configuration error (missing env var, invalid connection string)
2. Database migration failed
3. Secret not mounted (Key Vault CSI issue)
4. Startup timeout (slow dependency)
5. Application exception on startup
Solutions:
# 1. Check configuration
kubectl get configmap ingestion-config -n atp-ingest-ns -o yaml
kubectl get secret <secret-name> -n atp-ingest-ns
# 2. Check Secret Provider
kubectl get secretproviderclass -n atp-ingest-ns
kubectl describe secretproviderclass atp-kv-secrets -n atp-ingest-ns
# 3. Increase startup timeout
# Edit deployment, increase startupProbe failureThreshold
kubectl edit deployment ingestion -n atp-ingest-ns
# 4. Check database connectivity
kubectl run -it --rm debug --image=mcr.microsoft.com/mssql-tools \
--restart=Never -- /bin/bash
# Then: sqlcmd -S <server> -U <user> -P <password> -Q "SELECT 1"
# 5. Roll back to previous version
kubectl rollout undo deployment/ingestion -n atp-ingest-ns
-
Problem: Pods in Pending State
Symptoms: - Pods stuck in "Pending" status - Service scaled but new pods not starting Diagnosis: kubectl describe pod <pod-name> -n atp-ingest-ns # Look for events: # - "0/5 nodes are available: 3 Insufficient cpu, 2 node(s) had taint..." # - "persistentvolumeclaim not found" # - "image pull backoff" Common Causes: 1. Insufficient cluster capacity (CPU/memory) 2. Node taint mismatch (no toleration) 3. PVC not available 4. Image pull failure (authentication, not found) 5. Node selector mismatch Solutions: # 1. Check cluster capacity kubectl top nodes kubectl describe nodes # 2. Check if autoscaler is working kubectl get nodes --watch # 3. Check node pool autoscaler limits az aks nodepool show --resource-group atp-aks-prod-rg \ --cluster-name atp-aks-useast-prod --name npgeneric # 4. Check image pull secrets kubectl get secrets -n atp-ingest-ns kubectl describe pod <pod-name> -n atp-ingest-ns | grep -A 5 "Events:" # 5. Temporarily reduce resource requests (if emergency) kubectl edit deployment ingestion -n atp-ingest-ns # Reduce requests.cpu / requests.memory -
Problem: Service Returns 503 (Service Unavailable)
Symptoms: - API returns 503 errors - Health check endpoint fails (/health/ready) - Service in load balancer but rejecting traffic Diagnosis: # Check readiness probe kubectl get pods -n atp-query-ns # Look for: READY column showing "0/1" or "0/2" (sidecar) kubectl logs <pod-name> -n atp-query-ns # Search for: "Readiness check failed" # Check dependencies curl https://query.atp.internal/health/ready # Response shows which dependency failed Common Causes: 1. Database connection pool exhausted 2. Redis cache unreachable 3. Projection lag exceeds threshold (ready check fails) 4. Service Bus subscription paused 5. Dependency service down Solutions: # 1. Check database connections # - View metrics: "Database connection pool usage" # - If exhausted, scale database or increase pool size # 2. Restart pods (if transient) kubectl rollout restart deployment/query -n atp-query-ns # 3. Scale up if overwhelmed kubectl scale deployment/query -n atp-query-ns --replicas=10 # 4. Check dependency health kubectl get pods -n atp-projection-ns # If query depends on projections # 5. Bypass ready check temporarily (emergency only) kubectl edit deployment query -n atp-query-ns # Comment out readinessProbe (DO NOT DO THIS IN PRODUCTION without approval)
Code Examples: - Complete troubleshooting procedures - Diagnostic commands - Resolution scripts
Diagrams: - Problem diagnosis flow - Resolution decision tree
Deliverables: - Problem catalog - Diagnostic procedures - Solution library
Topic 12: Database Problems¶
What will be covered: - Problem: Database Connection Failures - Problem: Slow Queries - Problem: Deadlocks - Problem: Connection Pool Exhaustion - Problem: Migration Failures
Code Examples: - Database troubleshooting - Query optimization - Connection management
Deliverables: - Database operations guide
CYCLE 7: Service-Specific Troubleshooting (~5,000 lines)¶
Topic 13: Ingestion Service Issues¶
What will be covered: - High Ingestion Latency
Symptoms:
- P95 latency >500ms (SLO: <500ms)
- Slow API responses
- Queue backlog growing
Diagnosis:
# Check metrics
- Ingestion rate (events/sec)
- Database write latency
- Outbox relay lag
- CPU/memory usage
# Check logs
kubectl logs -f deployment/ingestion -n atp-ingest-ns | grep "WARN\|ERROR"
# Check Application Insights
az monitor app-insights query \
--app atp-appinsights \
--analytics-query "
requests
| where cloud_RoleName == 'Ingestion'
| where timestamp > ago(1h)
| summarize P95=percentile(duration, 95), P99=percentile(duration, 99) by bin(timestamp, 5m)
| render timechart
"
Common Causes:
1. Database bottleneck (DTU/RU exhaustion)
2. Classification service slow (policy evaluation)
3. Outbox table growing (relay worker slow)
4. Large batch ingestion (single tenant spike)
5. CPU/memory limits hit
Solutions:
# 1. Scale ingestion pods
kubectl scale deployment/ingestion -n atp-ingest-ns --replicas=10
# 2. Check database performance
# - View DTU usage in Azure Portal
# - If high, scale up database tier
# 3. Check outbox relay worker
kubectl logs -f deployment/outbox-relay -n atp-ingest-ns
# 4. Implement rate limiting (if single tenant spike)
# - Add per-tenant rate limit policy
# 5. Scale database (if sustained load)
az sql db update --resource-group atp-prod-rg \
--server atp-sql-prod --name atp-db \
--service-objective P4 # Scale to higher tier
-
Ingestion Validation Failures
Symptoms: - 400 Bad Request errors - Schema validation failures in logs - Rejected events Diagnosis: # Check validation error logs kubectl logs deployment/ingestion -n atp-ingest-ns \ | grep "ValidationException" # Sample error: # "ValidationException: Required field 'actor.id' missing in event" Common Causes: 1. Client sending invalid schema 2. Schema version mismatch 3. Required field missing 4. Data type mismatch Solutions: # 1. Review schema documentation # - Check OpenAPI spec: /api/v1/swagger # 2. Provide client with correct schema # - Send link to contract documentation # 3. Add schema migration (if legitimate change) # - Update schema validator to accept old + new formats # 4. Check for API version mismatch # - Verify client using correct API version -
Outbox Backlog Growing
- Idempotency Conflicts
- Policy Evaluation Timeouts
Code Examples: - Service-specific diagnostics (all ATP services) - Resolution procedures - Common fix scripts
Diagrams: - Service troubleshooting flow - Component dependencies
Deliverables: - Service troubleshooting guide (8 services) - Diagnostic procedures - Fix library
Topic 14: Query Service Issues¶
What will be covered: - Slow Query Performance - Projection Lag High - Cache Miss Rate High - Search Index Unavailable - Cross-Tenant Data Leakage (Security)
Code Examples: - Query optimization procedures - Cache troubleshooting - Security validation
Deliverables: - Query service operations guide
CYCLE 8: Database Operations (~3,500 lines)¶
Topic 15: Database Health Monitoring¶
What will be covered: - Azure SQL Database Metrics - Connection Pool Management - Query Performance Monitoring - Database Deadlocks - Index Fragmentation
Code Examples: - Database health queries - Performance diagnostics - Index maintenance scripts
Deliverables: - Database operations guide
Topic 16: Database Emergency Procedures¶
What will be covered: - Database Failover - Connection String Rotation - Emergency Scaling - Backup Restoration
Code Examples: - Failover procedures - Emergency scripts
Deliverables: - Emergency database procedures
CYCLE 9: Messaging & Event Bus Operations (~4,000 lines)¶
Topic 17: Service Bus Health¶
What will be covered: - Azure Service Bus Monitoring
# Check queue/topic depth
az servicebus queue show \
--resource-group atp-prod-rg \
--namespace-name sb-atp-prod \
--name projection-queue \
--query "countDetails"
# Check dead-letter queue
az servicebus queue show \
--resource-group atp-prod-rg \
--namespace-name sb-atp-prod \
--name projection-queue/$DeadLetterQueue \
--query "countDetails"
# List active subscriptions
az servicebus topic subscription list \
--resource-group atp-prod-rg \
--namespace-name sb-atp-prod \
--topic-name audit.appended.v1
- Message Backlog Handling
- Consumer Lag Monitoring
- Topic/Queue Throttling
- Connection Issues
Code Examples: - Service Bus diagnostics - Backlog management - Throttling mitigation
Diagrams: - Message flow monitoring - Backlog handling
Deliverables: - Messaging operations guide - Troubleshooting procedures
Topic 18: Event Flow Troubleshooting¶
What will be covered: - Event Not Published (Outbox Stuck) - Event Not Received (Consumer Down) - Duplicate Events (Idempotency) - Event Ordering Issues - Schema Version Mismatch
Code Examples: - Event flow diagnostics - Replay procedures
Deliverables: - Event troubleshooting guide
CYCLE 10: DLQ Management & Replay (~4,000 lines)¶
Topic 19: Dead Letter Queue (DLQ) Triage¶
What will be covered: - DLQ Workflow
flowchart TD
A[Message in DLQ] --> B[Inspect Message]
B --> C{Classify Failure}
C -->|Schema Error| D[Fix Schema Mapping]
C -->|Auth/Permission| E[Fix Credentials]
C -->|Transient Error| F[Immediate Replay]
C -->|Business Logic| G[Fix Code Bug]
C -->|Malicious/Invalid| H[Quarantine]
D --> I[Test in Sandbox]
E --> I
F --> J[Replay to Topic]
G --> K[Deploy Fix]
K --> I
I --> J
H --> L[Document & Skip]
J --> M[Monitor Success]
L --> M
-
DLQ Inspection Commands
# List DLQ messages (Azure CLI) az servicebus queue show \ --resource-group atp-prod-rg \ --namespace-name sb-atp-prod \ --name projection-queue/$DeadLetterQueue # Peek messages (first 10) az servicebus queue message peek \ --resource-group atp-prod-rg \ --namespace-name sb-atp-prod \ --name projection-queue/$DeadLetterQueue \ --max-message-count 10 # Receive message (for inspection) az servicebus queue message receive \ --resource-group atp-prod-rg \ --namespace-name sb-atp-prod \ --name projection-queue/$DeadLetterQueue \ --max-message-count 1 # Or use ATP admin CLI atp-admin dlq list --consumer projection-worker --limit 50 atp-admin dlq inspect --consumer projection-worker --message-id <id> -
DLQ Classification
Failure Reason: DeliveryCountExceeded - Message failed max retries (default: 10) - Indicates: Persistent handler failure - Action: Fix code/config, then replay Failure Reason: TTLExpiredException - Message exceeded time-to-live - Indicates: Long queue backlog, slow processing - Action: Increase TTL or scale consumers Failure Reason: MessageLockLostException - Processing took longer than lock duration - Indicates: Slow handler, long transactions - Action: Optimize handler, increase lock duration Failure Reason: UnauthorizedException - Consumer lacks permission - Indicates: RBAC/credential issue - Action: Fix managed identity permissions Failure Reason: SerializationException - Cannot deserialize message - Indicates: Schema version mismatch - Action: Add schema migration, replay -
Safe Replay Procedure
# 1. Identify root cause and fix # - Deploy code fix # - OR update configuration # - OR fix credentials # 2. Test in sandbox (non-prod) atp-admin dlq replay \ --consumer projection-worker \ --message-id <id> \ --environment dev \ --dry-run # 3. Replay to production (small batch first) atp-admin dlq replay \ --consumer projection-worker \ --filter "reason=DeliveryCountExceeded" \ --max-count 10 \ --confirm # 4. Monitor for success # - Check projection lag returns to normal # - No new DLQ entries # - Health checks pass # 5. Replay remaining messages atp-admin dlq replay \ --consumer projection-worker \ --filter "reason=DeliveryCountExceeded" \ --max-count 1000 \ --confirm # 6. Document in incident ticket
Code Examples: - Complete DLQ management procedures - Classification logic - Replay automation
Diagrams: - DLQ triage workflow - Replay safety checks
Deliverables: - DLQ management guide - Replay procedures - Classification rules
Topic 20: Message Replay Scenarios¶
What will be covered: - Replay from Outbox (Re-publish failed events) - Replay from Event Store (Rebuild projections) - Selective Replay (Specific tenant/time range) - Dry-Run Replay (Test without applying)
Code Examples: - Replay scenarios - Safety procedures
Deliverables: - Replay cookbook
CYCLE 11: Deployment Procedures (~3,500 lines)¶
Topic 21: Standard Deployment¶
What will be covered: - Pre-Deployment Checklist
☐ All tests passed in CI/CD pipeline
☐ Code review approved (2+ reviewers)
☐ Security scan passed (no critical vulnerabilities)
☐ Performance testing completed
☐ Database migrations reviewed (if any)
☐ Configuration changes documented
☐ Rollback plan prepared
☐ Change Advisory Board (CAB) approved (for production)
☐ Stakeholders notified (if customer-facing change)
☐ Monitoring dashboards ready
☐ On-call engineer briefed
-
Deployment Steps (Helm)
# 1. Verify current state helm list -n atp-system helm status atp -n atp-system # 2. Backup current configuration helm get values atp -n atp-system > backup-values-$(date +%Y%m%d-%H%M%S).yaml # 3. Review changes helm diff upgrade atp ./charts/atp \ --namespace atp-system \ --values values.prod.yaml \ --values values.us.yaml # 4. Deploy with canary (progressive rollout) helm upgrade atp ./charts/atp \ --namespace atp-system \ --values values.prod.yaml \ --values values.us.yaml \ --set image.tag=1.2.4 \ --set canary.enabled=true \ --set canary.weight=5 \ --wait --timeout 10m # 5. Monitor canary (15 minutes) # - Watch metrics dashboard # - Check error rate, latency # - Review logs for exceptions # 6. Increase canary weight (if healthy) helm upgrade atp ./charts/atp \ --reuse-values \ --set canary.weight=20 \ --wait # Monitor... then 50%... then 100% # 7. Promote canary to stable helm upgrade atp ./charts/atp \ --reuse-values \ --set canary.enabled=false \ --wait # 8. Post-deployment validation # - Run smoke tests # - Check all health endpoints # - Verify key metrics stable -
Deployment with FluxCD (GitOps)
# 1. Update Git repository git checkout -b release/v1.2.4 # 2. Update image tag in values sed -i 's/tag: 1.2.3/tag: 1.2.4/' charts/atp/values.prod.yaml # 3. Commit and push git add charts/atp/values.prod.yaml git commit -m "Release v1.2.4 to production" git push origin release/v1.2.4 # 4. Create PR and get approval # - PR review by 2+ engineers # - Automated checks pass # 5. Merge to main git checkout main git merge release/v1.2.4 git push origin main # 6. FluxCD automatically reconciles (within 1 minute) flux get kustomizations flux get helmreleases -n atp-system # 7. Monitor deployment kubectl get pods -n atp-ingest-ns --watch flux logs --follow
Code Examples: - Complete deployment procedures - Validation scripts - Smoke tests
Diagrams: - Deployment workflow - Progressive rollout stages
Deliverables: - Deployment procedures - Validation checklists - Smoke test suite
Topic 22: Emergency Deployments (Hotfix)¶
What will be covered: - Hotfix Criteria - Expedited Approval Process - Fast-Track Deployment - Post-Hotfix Validation
Code Examples: - Hotfix procedures
Deliverables: - Hotfix guide
CYCLE 12: Rollback Procedures (~3,000 lines)¶
Topic 23: Rollback Decision Making¶
What will be covered: - When to Rollback - Rollback vs. Fix Forward - Impact Assessment - Approval Requirements
Code Examples: - Decision matrix
Deliverables: - Rollback decision guide
Topic 24: Rollback Execution¶
What will be covered: - Helm Rollback
# 1. Check release history
helm history atp -n atp-system
# Output:
# REVISION UPDATED STATUS CHART DESCRIPTION
# 1 Mon Oct 28 10:00:00 2025 superseded atp-1.2.3 Install complete
# 2 Tue Oct 29 14:30:00 2025 superseded atp-1.2.4 Upgrade complete
# 3 Wed Oct 30 09:15:00 2025 deployed atp-1.2.5 Upgrade complete
# 2. Rollback to previous version (revision 2)
helm rollback atp 2 -n atp-system --wait --timeout 5m
# 3. Verify rollback
helm list -n atp-system
kubectl get pods -n atp-ingest-ns
# 4. Check health
kubectl get pods -n atp-ingest-ns --watch
curl https://api.atp.example.com/health
# 5. Monitor metrics (15 minutes)
# - Error rate should drop
# - Latency should normalize
# - No new alerts
-
Kubernetes Rollback
# Rollback deployment kubectl rollout undo deployment/ingestion -n atp-ingest-ns # Rollback to specific revision kubectl rollout history deployment/ingestion -n atp-ingest-ns kubectl rollout undo deployment/ingestion -n atp-ingest-ns --to-revision=5 # Watch rollback progress kubectl rollout status deployment/ingestion -n atp-ingest-ns -
Database Migration Rollback
Code Examples: - Complete rollback procedures (all scenarios) - Verification steps - Communication templates
Diagrams: - Rollback workflow - Verification steps
Deliverables: - Rollback procedures - Verification guide - Communication templates
CYCLE 13: Configuration Changes (~3,000 lines)¶
Topic 25: Safe Configuration Updates¶
What will be covered: - ConfigMap Updates
# 1. Backup current ConfigMap
kubectl get configmap ingestion-config -n atp-ingest-ns -o yaml \
> backup-ingestion-config-$(date +%Y%m%d-%H%M%S).yaml
# 2. Edit ConfigMap
kubectl edit configmap ingestion-config -n atp-ingest-ns
# 3. Restart pods to pick up changes
kubectl rollout restart deployment/ingestion -n atp-ingest-ns
# 4. Monitor for issues
kubectl logs -f deployment/ingestion -n atp-ingest-ns
# 5. Rollback if issues
kubectl apply -f backup-ingestion-config-20251030-143000.yaml
kubectl rollout restart deployment/ingestion -n atp-ingest-ns
- Feature Flag Changes
- Rate Limit Adjustments
- Logging Level Changes
Code Examples: - Configuration update procedures - Validation scripts
Deliverables: - Configuration change guide
Topic 26: Azure App Configuration Updates¶
What will be covered: - Updating Feature Flags - Configuration Refresh - Rollback Configuration - Configuration Audit Trail
Code Examples: - App Configuration procedures
Deliverables: - App Configuration guide
CYCLE 14: Key Rotation & Secret Management (~3,500 lines)¶
Topic 27: Routine Key Rotation¶
What will be covered: - Rotation Schedule
Monthly Rotation:
- Database passwords (connection strings)
- Service Bus connection strings
- Redis authentication keys
Quarterly Rotation:
- API keys (third-party integrations)
- Webhook HMAC secrets
- Application secrets
Annual Rotation:
- TLS certificates (if not auto-renewed)
- Root signing keys (with dual-key overlap)
On-Demand Rotation:
- Security incident (immediate)
- Employee departure (within 24 hours)
- Suspected compromise (immediate)
-
Database Password Rotation
# 1. Generate new password NEW_PASSWORD=$(openssl rand -base64 32) # 2. Add new password to Key Vault az keyvault secret set \ --vault-name atp-kv-prod \ --name DatabasePassword-New \ --value "$NEW_PASSWORD" # 3. Update SQL user with new password az sql server ad-admin update \ --resource-group atp-prod-rg \ --server atp-sql-prod \ --password "$NEW_PASSWORD" # 4. Update SecretProviderClass to use new secret kubectl edit secretproviderclass atp-kv-secrets -n atp-ingest-ns # Change: DatabasePassword → DatabasePassword-New # 5. Restart pods to mount new secret kubectl rollout restart deployment/ingestion -n atp-ingest-ns # 6. Wait for pods to be healthy kubectl get pods -n atp-ingest-ns --watch # 7. Verify connectivity kubectl logs deployment/ingestion -n atp-ingest-ns | grep "Database connection" # 8. Delete old secret (after 24 hour overlap) az keyvault secret delete \ --vault-name atp-kv-prod \ --name DatabasePassword # 9. Document rotation in audit log -
Signing Key Rotation (Zero-Downtime)
- Certificate Rotation
- Emergency Rotation (Suspected Compromise)
Code Examples: - Complete rotation procedures (all secret types) - Zero-downtime patterns - Emergency protocols
Diagrams: - Rotation workflow - Dual-key overlap - Emergency rotation
Deliverables: - Key rotation procedures (all types) - Emergency rotation guide - Audit procedures
Topic 28: Secret Compromise Response¶
What will be covered: - Detection - Containment - Rotation - Investigation - Prevention
Code Examples: - Compromise response procedures
Deliverables: - Security incident guide
CYCLE 15: Scaling & Capacity Management (~3,500 lines)¶
Topic 29: Manual Scaling¶
What will be covered: - Scale Pods
# Scale deployment manually
kubectl scale deployment/query -n atp-query-ns --replicas=20
# Verify scaling
kubectl get pods -n atp-query-ns --watch
# Check if HPA will override (disable HPA temporarily if needed)
kubectl get hpa -n atp-query-ns
kubectl delete hpa query-hpa -n atp-query-ns # Temporary removal
-
Scale Nodes (AKS)
-
Scale Database
Code Examples: - Manual scaling procedures - Verification steps
Deliverables: - Scaling procedures
Topic 30: Capacity Planning¶
What will be covered: - Capacity Metrics - Growth Forecasting - Resource Planning - Cost Optimization
Code Examples: - Capacity analysis - Forecasting models
Deliverables: - Capacity planning guide
CYCLE 16: Performance Troubleshooting (~4,000 lines)¶
Topic 31: Performance Investigation¶
What will be covered: - Identifying Performance Bottlenecks - Database Query Optimization - Memory Leak Detection - CPU Profiling - Network Latency Analysis
Code Examples: - Performance diagnostics - Profiling tools
Deliverables: - Performance troubleshooting guide
Topic 32: Performance Optimization¶
What will be covered: - Cache Tuning - Connection Pool Optimization - Query Optimization - Resource Limit Tuning
Code Examples: - Optimization procedures
Deliverables: - Optimization cookbook
CYCLE 17: Security Incident Response (~4,500 lines)¶
Topic 33: Security Breach Response¶
What will be covered: - Breach Detection
Security Alert Types:
- Unauthorized access attempt
- Privilege escalation
- Data exfiltration
- Tamper detection
- DDoS attack
- Credential compromise
-
Immediate Actions (SIEM Alert)
⚠️ SECURITY INCIDENT - IMMEDIATE ACTIONS Within 15 minutes: ☐ Page Security team ☐ Create dedicated war room: #security-incident-YYYYMMDD ☐ Freeze all deployments (emergency freeze) ☐ Capture evidence (logs, metrics, network traffic) ☐ Isolate affected systems (network policies) ☐ Revoke compromised credentials ☐ Notify CISO and Legal ☐ Begin forensic investigation DO NOT: ❌ Delete logs or evidence ❌ Restart services (preserves memory dumps) ❌ Notify customers (until Legal/Communications approval) ❌ Share details publicly -
Tamper Detection Response
# Tamper Alert Fired # 1. Freeze purge/export operations atp-admin integrity freeze --reason "tamper-investigation" # 2. Retrieve integrity proofs atp-admin integrity verify \ --segment-id <segment-id> \ --include-proof \ --output tamper-evidence-$(date +%Y%m%d-%H%M%S).json # 3. Offline verification # - Download proof bundle # - Verify Merkle root # - Verify digital signature # - Compare with stored record # 4. If tamper confirmed # - Escalate to SEV-1 # - Engage Security + Compliance # - Preserve all evidence # - Notify affected customers (Legal approval) # 5. If false positive # - Document root cause # - Tune detection thresholds # - Unfreeze operations -
Data Exfiltration Response
- Credential Compromise Response
- DDoS Mitigation
Code Examples: - Complete security procedures - Forensic collection - Containment automation
Diagrams: - Security incident flow - Containment procedures
Deliverables: - Security incident guide - Forensic procedures - Containment playbooks
Topic 34: Post-Breach Recovery¶
What will be covered: - System Hardening - Credential Rotation (All) - Audit Trail Review - Customer Notification
Code Examples: - Recovery procedures
Deliverables: - Recovery guide
CYCLE 18: Data Recovery & Backups (~3,500 lines)¶
Topic 35: Backup Procedures¶
What will be covered: - Database Backups - Blob Storage Backups - Configuration Backups - Backup Verification
Code Examples: - Backup automation
Deliverables: - Backup procedures
Topic 36: Restore Procedures¶
What will be covered: - Point-in-Time Recovery - Disaster Recovery - Cross-Region Restore - Data Validation After Restore
Code Examples: - Restore procedures
Deliverables: - Recovery guide
CYCLE 19: Compliance & Audit Procedures (~3,000 lines)¶
Topic 37: Compliance Operations¶
What will be covered: - Legal Hold Procedures - Data Retention Validation - Data Residency Verification - Right-to-Erasure (GDPR Article 17) - Compliance Audit Support
Code Examples: - Compliance procedures
Deliverables: - Compliance operations guide
Topic 38: Audit Trail Validation¶
What will be covered: - Integrity Verification - Chain Validation - Export for Auditors - Compliance Reporting
Code Examples: - Validation procedures
Deliverables: - Audit validation guide
CYCLE 20: Maintenance Windows (~3,000 lines)¶
Topic 39: Scheduled Maintenance¶
What will be covered: - Maintenance Planning - Customer Communication - Maintenance Execution - Post-Maintenance Validation
Code Examples: - Maintenance procedures
Deliverables: - Maintenance guide
Topic 40: Zero-Downtime Maintenance¶
What will be covered: - Rolling Node Updates - Database Maintenance - Index Optimization - Cache Maintenance
Code Examples: - Zero-downtime procedures
Deliverables: - Maintenance cookbook
CYCLE 21: Emergency Procedures (~4,000 lines)¶
Topic 41: Emergency Response¶
What will be covered: - Multi-Region Outage
🚨 EMERGENCY: COMPLETE REGION OUTAGE
Scenario: US-East region completely unavailable
Immediate Actions (within 30 minutes):
☐ Declare SEV-1 incident
☐ Page all on-call (primary, secondary, manager)
☐ Activate incident command
☐ Assess scope (all regions? single region?)
☐ Notify Customer Success
☐ Initiate failover to secondary region
Failover Procedure:
1. Update Azure Front Door backend pool
- Remove failed region from rotation
- Route all traffic to healthy region(s)
2. Verify secondary region capacity
- Check pod counts, node counts
- Scale up if needed
3. Monitor secondary region closely
- Increased load may cause cascading issues
- Watch error rates, latency, resource usage
4. Communicate to customers
- Post on status page
- Email affected customers
- Update ETA every hour
5. Investigation (parallel track)
- Azure Service Health
- Azure Support ticket
- Internal root cause analysis
6. Recovery
- When primary region restored
- Gradual traffic shift back (20% → 50% → 100%)
- Monitor for residual issues
7. Postmortem
- Document timeline
- Identify improvements
- Update DR procedures
-
Complete Data Loss Scenario
🚨 EMERGENCY: DATA LOSS SUSPECTED DO NOT PANIC. ATP has multiple protection layers. Immediate Actions: ☐ Freeze ALL write operations (emergency maintenance mode) ☐ Page Security + Compliance teams ☐ Preserve all evidence (logs, snapshots, backups) ☐ Begin forensic investigation Investigation: 1. Verify integrity proofs - Check hash chains - Verify Merkle trees - Validate digital signatures 2. Check all backup sources - Azure SQL point-in-time restore points - Blob Storage WORM containers - Geo-replicated backups 3. Assess scope - Which tenants affected? - Which time range? - What data types? 4. Recovery - Restore from latest valid backup - Verify integrity of restored data - Replay events if needed 5. Customer Notification - Legal approval required - Prepare incident report - Offer credit/compensation 6. Regulatory Notification - GDPR: 72 hours - HIPAA: 60 days - SOC 2: Immediate -
Security Breach Emergency
- Compliance Violation Emergency
- Cascade Failure
Code Examples: - Emergency procedures (all scenarios) - Failover automation - Communication templates
Diagrams: - Emergency response flow - Failover procedures - Recovery timeline
Deliverables: - Emergency procedure manual - Failover guide - Crisis communication templates
Topic 42: Business Continuity¶
What will be covered: - Disaster Recovery Drills - Failover Testing - RTO/RPO Validation - Recovery Verification
Code Examples: - DR drill procedures
Deliverables: - Business continuity guide
CYCLE 22: Contacts, Escalation & Knowledge Base (~3,000 lines)¶
Topic 43: Contact Directory¶
What will be covered: - On-Call Schedules
Primary On-Call:
- PagerDuty: https://connectsoft.pagerduty.com/schedules#ATP-PRIMARY
- Rotation: Weekly (Monday 9 AM → Monday 9 AM)
- Coverage: 24/7
- Response: <5 min (SEV-1), <15 min (SEV-2)
Secondary On-Call:
- PagerDuty: https://connectsoft.pagerduty.com/schedules#ATP-SECONDARY
- Escalation: After 30 min (SEV-1), 1 hour (SEV-2)
Manager On-Call:
- Escalation: After 1 hour (SEV-1), 2 hours (SEV-2)
- Major decisions, customer communication
Subject Matter Experts (SMEs):
- Database: @db-team (Slack: #atp-database)
- Security: @security-team (Slack: #atp-security)
- Compliance: @compliance-team (Email: compliance@)
- Networking: @network-team (Slack: #atp-networking)
- Azure Infrastructure: @cloud-team (Slack: #atp-cloud)
-
Escalation Matrix | Level | Role | Escalate After | Contact Method | |-------|------|----------------|----------------| | L1 | Platform Engineer | N/A | PagerDuty auto-page | | L2 | Senior SRE | 30 min (SEV-1), 1h (SEV-2) | PagerDuty escalation | | L3 | Engineering Manager | 1h (SEV-1), 2h (SEV-2) | PagerDuty + Phone | | L4 | VP Engineering | 2h (SEV-1), Major outage | Phone + Email | | L5 | CTO | Customer-impacting, 4h+ | Phone + Email |
-
External Contacts
Microsoft Azure Support: - Portal: https://portal.azure.com -> Support + troubleshooting - Phone: 1-800-867-1389 - Priority: Severity A (critical) Customers (Major): - Acme Corp: success@acmecorp.com, @acme-csm (Slack Connect) - Contoso: support@contoso.com, @contoso-csm - Fabrikam: ops@fabrikam.com, @fabrikam-csm Vendors: - Hashicorp (Vault): support@hashicorp.com - Elastic (Search): support@elastic.co - PagerDuty: support@pagerduty.com
Code Examples: - Contact templates - Escalation automation
Diagrams: - Escalation flow - Contact hierarchy
Deliverables: - Contact directory - Escalation procedures
Topic 44: Knowledge Base & Resources¶
What will be covered: - Internal Documentation
Architecture Docs: /docs/architecture/
API Documentation: https://api.atp.example.com/swagger
Confluence Wiki: https://connectsoft.atlassian.net/wiki/ATP
Runbook (this doc): /docs/operations/runbook.md
Postmortems: /docs/postmortems/
- Training Resources
- Onboarding guide for new on-call engineers
- Video walkthroughs
-
Practice scenarios
-
Runbook Updates
- Update after every postmortem
- Quarterly review
- Version in Git
- PR approval required
Code Examples: - Knowledge base structure - Update procedures
Diagrams: - Resource map
Deliverables: - Knowledge base index - Training materials - Update procedures
Summary of Deliverables¶
Across all 22 cycles, this documentation will provide:
- Runbook Foundation
- Organization and navigation
- Quick reference cards
-
Contact directory
-
Monitoring & Health
- Health check endpoints (all services)
- Dashboard access
-
Metric interpretation
-
Incident Management
- Alert response procedures
- Incident lifecycle
- War room protocols
- Severity classification
-
SLO monitoring
-
Troubleshooting
- Common problems catalog (50+ scenarios)
- Service-specific diagnostics (8 services)
- Database troubleshooting
- Messaging troubleshooting
-
DLQ triage and replay
-
Operational Procedures
- Deployment procedures (standard, canary, blue-green)
- Rollback procedures (Helm, K8s, database)
- Configuration changes
-
Key rotation (routine and emergency)
-
Capacity & Performance
- Manual scaling procedures
- Capacity planning
- Performance troubleshooting
-
Optimization techniques
-
Security & Compliance
- Security incident response
- Tamper investigation
- Breach containment
- Compliance operations
-
Audit validation
-
Recovery
- Backup procedures
- Restore procedures
- Disaster recovery
-
Multi-region failover
-
Maintenance
- Scheduled maintenance
- Zero-downtime procedures
-
Routine tasks
-
Emergency
- Emergency response procedures
- Crisis management
- Business continuity
-
Reference
- Complete contact directory
- Escalation matrix
- Knowledge base index
- Runbook maintenance
Related Documentation¶
- Progressive Rollout: Deployment strategies
- Kubernetes: K8s operations
- GitOps: Deployment automation
- Monitoring: Monitoring and alerting
- Security: Security architecture
- Key Rotation: Rotation procedures
- Chaos Drills: Resilience testing
- Disaster Recovery: DR procedures
- Configuration: Configuration management
This operational runbook provides complete step-by-step procedures for running ATP in production with confidence, responding to incidents with speed and precision, troubleshooting issues systematically, deploying changes safely, managing capacity proactively, rotating secrets securely, investigating security events thoroughly, recovering from failures reliably, maintaining compliance rigorously, and escalating appropriately while preserving ATP's core guarantees of tamper-evidence, tenant isolation, and regulatory compliance.