Skip to content

Operations Runbook - Audit Trail Platform (ATP)

Your operational command center — This runbook provides step-by-step procedures for running ATP in production, responding to incidents, troubleshooting common issues, performing routine maintenance, and handling emergencies with clear ownership, escalation paths, and compliance preservation.


📋 Documentation Generation Plan

This document will be generated in 22 cycles. Current progress:

Cycle Topics Estimated Lines Status
Cycle 1 Runbook Overview & Organization (1-2) ~2,500 ⏳ Not Started
Cycle 2 Service Health & Status Monitoring (3-4) ~4,000 ⏳ Not Started
Cycle 3 Alert Response Procedures (5-6) ~4,000 ⏳ Not Started
Cycle 4 Incident Management (7-8) ~4,500 ⏳ Not Started
Cycle 5 Severity Classification & SLAs (9-10) ~3,000 ⏳ Not Started
Cycle 6 Common Problems & Solutions (11-12) ~5,000 ⏳ Not Started
Cycle 7 Service-Specific Troubleshooting (13-14) ~5,000 ⏳ Not Started
Cycle 8 Database Operations (15-16) ~3,500 ⏳ Not Started
Cycle 9 Messaging & Event Bus Operations (17-18) ~4,000 ⏳ Not Started
Cycle 10 DLQ Management & Replay (19-20) ~4,000 ⏳ Not Started
Cycle 11 Deployment Procedures (21-22) ~3,500 ⏳ Not Started
Cycle 12 Rollback Procedures (23-24) ~3,000 ⏳ Not Started
Cycle 13 Configuration Changes (25-26) ~3,000 ⏳ Not Started
Cycle 14 Key Rotation & Secret Management (27-28) ~3,500 ⏳ Not Started
Cycle 15 Scaling & Capacity Management (29-30) ~3,500 ⏳ Not Started
Cycle 16 Performance Troubleshooting (31-32) ~4,000 ⏳ Not Started
Cycle 17 Security Incident Response (33-34) ~4,500 ⏳ Not Started
Cycle 18 Data Recovery & Backups (35-36) ~3,500 ⏳ Not Started
Cycle 19 Compliance & Audit Procedures (37-38) ~3,000 ⏳ Not Started
Cycle 20 Maintenance Windows (39-40) ~3,000 ⏳ Not Started
Cycle 21 Emergency Procedures (41-42) ~4,000 ⏳ Not Started
Cycle 22 Contacts, Escalation & Knowledge Base (43-44) ~3,000 ⏳ Not Started

Total Estimated Lines: ~78,000


Purpose & Scope

This Operations Runbook is the authoritative operational guide for ATP, providing actionable procedures for SRE/DevOps teams to monitor health, respond to incidents, troubleshoot issues, deploy changes, manage capacity, rotate secrets, handle security events, perform maintenance, and escalate emergencies while preserving ATP's core guarantees: tamper-evidence, tenant isolation, and compliance.

Who Should Use This Runbook?

  • SRE/Platform Engineers: Day-to-day operations, monitoring, scaling
  • On-Call Engineers: Incident response, troubleshooting, emergency procedures
  • DevOps Engineers: Deployments, rollbacks, configuration changes
  • Security Team: Security incident response, key rotation, audit preservation
  • Compliance Team: Data recovery, legal hold procedures, audit validation
  • Engineering Management: Escalation procedures, postmortem reviews

Runbook Philosophy

  1. Clarity: Step-by-step procedures with copy-paste commands
  2. Speed: Quick reference for time-sensitive incidents
  3. Safety: All procedures preserve audit integrity and compliance
  4. Traceability: Every action logged and auditable
  5. Escalation: Clear ownership and escalation paths
  6. Learning: Postmortem integration for continuous improvement

Detailed Cycle Plan

CYCLE 1: Runbook Overview & Organization (~2,500 lines)

Topic 1: Runbook Structure & Quick Reference

What will be covered: - Runbook Organization

This runbook is organized by operational scenario:

1. Health & Monitoring
   - Service status checks
   - Dashboard access
   - Metric interpretation

2. Incident Response
   - Alert procedures
   - Severity classification
   - Escalation paths

3. Troubleshooting
   - Common problems by service
   - Diagnostic procedures
   - Root cause analysis

4. Operations
   - Deployment procedures
   - Rollback procedures
   - Configuration changes
   - Key rotation

5. Maintenance
   - Scheduled tasks
   - Database maintenance
   - Index optimization
   - Cache warming

6. Emergency Procedures
   - Security breaches
   - Data corruption
   - Multi-region failover
   - Disaster recovery

7. Reference
   - Contact information
   - Escalation matrix
   - Service topology
   - Runbook updates

  • Quick Reference Cards

    🚨 CRITICAL INCIDENT (SEV-1)
    1. Acknowledge in PagerDuty (within 5 minutes)
    2. Join war room: Slack #atp-incidents
    3. Check health dashboard: https://atp-ops.example.com
    4. Follow incident procedure (Section 4)
    5. Escalate if unresolved in 30 minutes
    
    📊 CHECK SERVICE HEALTH
    kubectl get pods -n atp-{service}-ns
    curl https://api.atp.example.com/health/{service}
    View dashboard: https://grafana.atp.example.com/d/service-health
    
    🔄 ROLLBACK DEPLOYMENT
    helm rollback atp -n atp-system
    OR
    kubectl rollout undo deployment/{service} -n atp-{service}-ns
    
    🔍 VIEW LOGS
    kubectl logs -f deployment/{service} -n atp-{service}-ns
    OR
    az monitor log-analytics query -w {workspace-id} --analytics-query "..."
    
    📈 CHECK METRICS
    Prometheus: https://prometheus.atp.example.com
    Grafana: https://grafana.atp.example.com
    Azure Monitor: https://portal.azure.com -> ATP Log Analytics
    

  • Critical Contacts

    On-Call Rotation:
    - Primary: PagerDuty "ATP Primary" schedule
    - Secondary: PagerDuty "ATP Secondary" schedule
    - Manager On-Call: PagerDuty "ATP Manager" schedule
    
    War Room:
    - Slack: #atp-incidents (critical)
    - Slack: #atp-ops (warnings, maintenance)
    - Teams: ATP Operations Team
    
    Escalation:
    - L1 (Platform Engineer): 0-30 min response
    - L2 (Senior SRE): 30-60 min response
    - L3 (Engineering Manager): 60-120 min response
    - L4 (VP Engineering): 2+ hours, major outage
    
    Stakeholders:
    - Security: security@connectsoft.com
    - Compliance: compliance@connectsoft.com
    - Customer Success: success@connectsoft.com
    

Code Examples: - Quick reference templates - Contact information structure - Emergency checklists

Diagrams: - Runbook navigation - Escalation flow - War room structure

Deliverables: - Runbook organization guide - Quick reference cards - Contact directory


Topic 2: Using This Runbook

What will be covered: - How to Navigate - Table of contents with direct links - Search functionality (Ctrl+F) - Cross-references to architecture docs

  • When to Use Each Section
  • Alert fired → Alert Response (Section 3)
  • Service degraded → Troubleshooting (Section 6-7)
  • Deploying change → Deployment Procedures (Section 11)
  • Security event → Security Incident Response (Section 17)

  • Runbook Maintenance

  • Update after every postmortem
  • Quarterly review by SRE team
  • Version controlled in Git
  • Changes go through PR review

Code Examples: - Runbook update procedure - Change log template

Diagrams: - Runbook usage flow

Deliverables: - Navigation guide - Usage instructions - Maintenance procedures


CYCLE 2: Service Health & Status Monitoring (~4,000 lines)

Topic 3: Health Check Endpoints

What will be covered: - ATP Health Check Architecture

Every ATP service exposes 3 health endpoints:

/health/live (Liveness Probe)
- Purpose: Is the service process alive?
- K8s: Restarts pod if fails
- Checks: Minimal (service responsive)

/health/ready (Readiness Probe)
- Purpose: Is the service ready to accept traffic?
- K8s: Removes from load balancer if fails
- Checks: Dependencies (DB, cache, message bus, KMS)

/health/startup (Startup Probe)
- Purpose: Has the service finished initialization?
- K8s: Delays liveness/readiness until passes
- Checks: Configuration loaded, migrations run, caches warmed

  • Health Check Endpoint Details by Service

    # Gateway Service
    curl https://api.atp.example.com/health/live    # → 200 OK
    curl https://api.atp.example.com/health/ready   # → 200 OK or 503
    
    # Checks:
    # - [ready] Azure Key Vault reachable
    # - [ready] Backend services reachable (ingestion, query, policy)
    # - [ready] Rate limiter cache (Redis) connected
    
    # Ingestion Service
    curl https://ingestion.atp.internal/health/ready
    
    # Checks:
    # - [ready] Azure SQL connection pool healthy
    # - [ready] Azure Service Bus connected (publish capability)
    # - [ready] Azure Blob Storage reachable (WORM container)
    # - [ready] Policy Service reachable (classification API)
    # - [ready] Outbox relay worker running
    
    # Query Service
    curl https://query.atp.internal/health/ready
    
    # Checks:
    # - [ready] Read model database (projections) reachable
    # - [ready] Query cache (Redis) connected
    # - [ready] Search index reachable (if enabled)
    # - [ready] Latest projection watermark within SLO lag (<10s)
    
    # Projection Service
    curl https://projection.atp.internal/health/ready
    
    # Checks:
    # - [ready] Azure Service Bus subscription active
    # - [ready] Read model database writable
    # - [ready] Inbox deduplication table accessible
    # - [ready] Projection lag within threshold (<5s)
    # - [ready] No DLQ backlog (or within threshold)
    
    # Export Service
    curl https://export.atp.internal/health/ready
    
    # Checks:
    # - [ready] Azure Blob Storage writable (export container)
    # - [ready] Export job queue (Redis) connected
    # - [ready] KMS signing operation test (dry-run)
    # - [ready] Bandwidth budget not exceeded
    
    # Integrity Service
    curl https://integrity.atp.internal/health/ready
    
    # Checks:
    # - [ready] Azure Blob Storage WORM container reachable
    # - [ready] KMS signing keys accessible
    # - [ready] Hash chain state store connected
    # - [ready] Merkle tree computation functional
    
    # Policy Service
    curl https://policy.atp.internal/health/ready
    
    # Checks:
    # - [ready] Policy database reachable
    # - [ready] Policy cache (Redis) connected
    # - [ready] Default policies loaded
    

  • Health Check Response Format

    // Healthy
    {
      "status": "Healthy",
      "totalDuration": "00:00:00.1234567",
      "entries": {
        "database": {
          "status": "Healthy",
          "description": "Azure SQL connection pool: 5/10 active",
          "duration": "00:00:00.0123456"
        },
        "servicebus": {
          "status": "Healthy",
          "description": "Azure Service Bus connected",
          "duration": "00:00:00.0234567"
        },
        "cache": {
          "status": "Healthy",
          "description": "Redis cache: 1000 keys, 45% memory",
          "duration": "00:00:00.0098765"
        }
      }
    }
    
    // Degraded (service still running but impaired)
    {
      "status": "Degraded",
      "totalDuration": "00:00:00.5000000",
      "entries": {
        "database": {
          "status": "Healthy"
        },
        "cache": {
          "status": "Degraded",
          "description": "Redis cache: Degraded performance, high latency (150ms avg)",
          "duration": "00:00:00.4567890"
        }
      }
    }
    
    // Unhealthy (service should be removed from load balancer)
    {
      "status": "Unhealthy",
      "totalDuration": "00:00:01.0000000",
      "entries": {
        "database": {
          "status": "Unhealthy",
          "description": "Azure SQL connection failed: Timeout",
          "exception": "System.Data.SqlClient.SqlException: Timeout expired...",
          "duration": "00:00:01.0000000"
        }
      }
    }
    

  • Automated Health Monitoring

  • Kubernetes liveness/readiness probes (every 10s)
  • Azure Monitor health check alerts
  • Grafana health dashboard
  • PagerDuty integration for failures

Code Examples: - Complete health check implementations (all services) - Health check aggregation dashboard - Alerting rules

Diagrams: - Health check architecture - Probe failure handling - Alert routing

Deliverables: - Health check reference - Monitoring guide - Alert configuration


Topic 4: Service Status Dashboard

What will be covered: - Grafana Dashboard Access

ATP Operations Dashboard: https://grafana.atp.example.com/d/atp-ops

Panels:
- Service Health Matrix (all services, all regions)
- Error Rate by Service
- Request Rate by Service
- P50/P95/P99 Latency by Service
- Projection Lag (timeline, actor, resource)
- DLQ Depth by Consumer
- Outbox Backlog
- Cache Hit Rates
- Database Connection Pool Usage
- Message Bus Queue Depth

  • Azure Monitor Workbooks

    ATP Health Overview: 
    - Navigate to Azure Portal → ATP Resource Group → Monitor → Workbooks
    - Select "ATP Service Health"
    
    Views:
    - Live Metrics Stream (real-time)
    - Application Map (service dependencies)
    - Performance (query execution, dependencies)
    - Failures (exceptions, failed requests)
    - Logs (structured query with KQL)
    

  • Status Page (Public)

  • External status page for customers
  • Incident communication
  • Maintenance window announcements

Code Examples: - Grafana dashboard JSON - Azure Monitor KQL queries - Status page integration

Diagrams: - Dashboard layout - Metric flow - Status propagation

Deliverables: - Dashboard templates - Query library - Status page setup


CYCLE 3: Alert Response Procedures (~4,000 lines)

Topic 5: Alert Handling Workflow

What will be covered: - Alert Lifecycle

flowchart LR
    A[Metric Breaches Threshold] --> B[Alert Fired]
    B --> C[PagerDuty Page]
    B --> D[Slack Notification]
    B --> E[Ticket Created]
    C --> F[Engineer Acknowledges]
    F --> G[Investigate & Mitigate]
    G --> H{Issue Resolved?}
    H -->|Yes| I[Alert Auto-Resolves]
    H -->|No| J[Escalate]
    J --> K[L2/L3 Engagement]
    K --> G
    I --> L[Close Ticket]
    L --> M[Postmortem]
Hold "Alt" / "Option" to enable pan & zoom

  • Alert Channels

    Critical (SEV-1, SEV-2):
    - PagerDuty: Immediate phone/SMS/push
    - Slack: #atp-incidents (auto-created channel per incident)
    - Email: atp-oncall@connectsoft.com
    - Ticket: Jira (auto-created)
    
    Warning (SEV-3):
    - Slack: #atp-ops
    - Email: atp-ops@connectsoft.com
    - Ticket: Jira (auto-created)
    
    Info (SEV-4):
    - Slack: #atp-notifications
    - No page, no email
    

  • Alert Acknowledgement

    # Acknowledge in PagerDuty
    # - Click "Acknowledge" in PagerDuty app/web
    # - OR via API:
    curl -X PUT https://api.pagerduty.com/incidents/{id} \
        -H "Authorization: Token token=$PD_TOKEN" \
        -H "From: oncall@connectsoft.com" \
        -d '{
          "incident": {
            "type": "incident_reference",
            "status": "acknowledged"
          }
        }'
    
    # Post to Slack
    # "🔔 Acknowledged: [Alert Name]. Investigating... ETA: 15 min"
    

  • Initial Response Checklist

    Within 5 minutes of alert:
    ☐ Acknowledge in PagerDuty
    ☐ Post to Slack incident channel
    ☐ Check service health dashboard
    ☐ Review recent deployments (last 4 hours)
    ☐ Check related services (dependencies)
    ☐ Determine severity (use classification matrix)
    ☐ Engage additional engineers if SEV-1/SEV-2
    

Code Examples: - Alert response templates - Acknowledgement scripts - Initial triage procedures

Diagrams: - Alert lifecycle - Response timeline - Communication flow

Deliverables: - Alert response procedures - Acknowledgement guide - Triage checklists


What will be covered: - ATP Alert Catalog | Alert | Severity | Threshold | Runbook Section | Auto-Remediation | |-------|----------|-----------|-----------------|------------------| | ServiceDown | SEV-1 | 0 healthy pods | 6.1 (Service Restart) | Pod restart (K8s) | | HighErrorRate | SEV-2 | >5% errors, 5min | 6.2 (Error Investigation) | No | | HighLatency | SEV-2 | P95 >1s, 10min | 6.3 (Performance Tuning) | No | | DatabaseConnectionFailure | SEV-1 | All connections fail | 8.1 (Database Recovery) | Retry, then page | | MessageBusDown | SEV-1 | Service Bus unreachable | 9.1 (Message Bus Recovery) | Retry, then page | | ProjectionLagHigh | SEV-2 | Lag >30s, 10min | 10.1 (Projection Catchup) | Scale workers (KEDA) | | DLQBacklog | SEV-3 | >100 msgs, 30min | 10.2 (DLQ Triage) | Alert only | | DiskSpaceLow | SEV-2 | <10% free | 15.1 (Capacity Expansion) | No | | CertificateExpiring | SEV-3 | <30 days | 14.1 (Certificate Renewal) | No | | KeyRotationOverdue | SEV-2 | >90 days | 14.2 (Emergency Key Rotation) | No | | TamperDetected | SEV-1 | Any tamper alert | 17.1 (Tamper Investigation) | Freeze, escalate | | ComplianceViolation | SEV-1 | Retention/residency breach | 19.1 (Compliance Emergency) | Freeze, escalate |

  • Alert Runbook Structure
    For each alert:
    
    1. Description: What this alert means
    2. Symptoms: What users/systems experience
    3. Diagnostic Steps: How to investigate
    4. Resolution Steps: How to fix
    5. Escalation: When to escalate
    6. Prevention: How to avoid in future
    7. Related Links: Architecture docs, code references
    

Code Examples: - Alert definitions (Prometheus rules) - Runbook templates for each alert - Auto-remediation scripts

Diagrams: - Alert taxonomy - Runbook linkage

Deliverables: - Complete alert catalog - Runbook procedures for each alert - Auto-remediation playbooks


CYCLE 4: Incident Management (~4,500 lines)

Topic 7: Incident Response Framework

What will be covered: - Incident Lifecycle

1. Detection
   - Alert fires
   - User report
   - Monitoring anomaly

2. Acknowledgement
   - On-call acknowledges (within 5 min for SEV-1)
   - War room created
   - Incident ticket opened

3. Triage
   - Determine severity
   - Identify affected services/tenants
   - Assess impact
   - Engage additional engineers

4. Investigation
   - Review logs, metrics, traces
   - Identify root cause
   - Document findings

5. Mitigation
   - Implement fix or workaround
   - Deploy patch
   - Verify resolution

6. Recovery
   - Restore normal operations
   - Validate all services healthy
   - Notify stakeholders

7. Resolution
   - Close incident ticket
   - Resolve PagerDuty incident
   - Communicate to affected tenants

8. Postmortem
   - Conduct blameless postmortem
   - Document root cause
   - Create action items
   - Update runbook

  • Incident Command Structure

    Incident Commander (IC):
    - Owns incident response
    - Coordinates team
    - Makes go/no-go decisions
    - Communicates to stakeholders
    
    Technical Lead (TL):
    - Drives investigation
    - Implements fixes
    - Coordinates with IC
    
    Communications Lead (CL):
    - Customer notifications
    - Status page updates
    - Stakeholder updates
    
    Scribe:
    - Documents timeline
    - Captures decisions
    - Records actions taken
    

  • War Room Protocols

    War Room Creation (SEV-1, SEV-2):
    1. Create dedicated Slack channel: #incident-YYYY-MM-DD-HHMM
    2. Pin incident ticket link
    3. Pin dashboard links
    4. Set channel topic: "[SEV-X] [Service] Brief description"
    5. Invite: IC, TL, CL, Scribe, relevant SMEs
    
    War Room Updates:
    - Every 15 minutes: IC posts status update
    - Every action: Engineer posts what they're trying
    - Every finding: Post evidence (log snippets, metrics screenshots)
    - Major decisions: IC announces and logs rationale
    
    War Room Etiquette:
    - Use threads for side discussions
    - Main channel for critical updates only
    - No "@here" or "@channel" unless critical
    - Update channel topic with current status
    

Code Examples: - Incident templates - War room scripts - Communication templates

Diagrams: - Incident lifecycle - Command structure - War room flow

Deliverables: - Incident response procedures - War room protocols - Communication templates


Topic 8: Incident Documentation

What will be covered: - Incident Ticket Structure

Incident ID: INC-2025-001234
Title: [SEV-2] Ingestion Service High Latency in US-East

Status: Investigating | Mitigated | Resolved
Severity: SEV-1 | SEV-2 | SEV-3 | SEV-4

Timeline:
- 2025-10-30 14:32 UTC: Alert fired (P95 latency >1s)
- 2025-10-30 14:34 UTC: Acknowledged by @engineer
- 2025-10-30 14:40 UTC: Root cause identified (DB connection pool exhausted)
- 2025-10-30 14:45 UTC: Mitigation deployed (increased pool size)
- 2025-10-30 14:50 UTC: Metrics normal, incident resolved

Impact:
- Affected Tenants: 15 enterprise tenants in US-East
- Duration: 18 minutes
- User Impact: Increased ingestion latency (500ms → 2s)
- Data Integrity: ✅ No data loss, all events persisted

Root Cause:
- Database connection pool exhausted (max 100 connections)
- Traffic spike from tenant "acme-corp" (batch import)
- Pool size not sized for peak load

Resolution:
- Temporary: Increased connection pool max to 200
- Permanent: Implement per-tenant rate limiting

Action Items:
- [ ] Increase default connection pool size (deploy to all regions)
- [ ] Add per-tenant ingestion rate limits
- [ ] Add connection pool usage alerts (>80%)
- [ ] Update capacity planning docs

  • Incident Log Template
  • Use Jira/ServiceNow incident template
  • Auto-populate from alerts
  • Link to logs, metrics, traces
  • Capture all actions taken

Code Examples: - Incident ticket templates - Timeline documentation - Action item tracking

Diagrams: - Incident ticket flow - Documentation structure

Deliverables: - Incident templates - Documentation procedures - Tracking systems


CYCLE 5: Severity Classification & SLAs (~3,000 lines)

Topic 9: Severity Levels

What will be covered: - ATP Severity Classification

SEV-1 (Critical - P1)
Definition: Complete service outage or data integrity compromise

Examples:
- All ATP services down (no ingestion, no queries)
- Data corruption detected (tamper evidence failed)
- Security breach (unauthorized access)
- Multi-tenant data leakage
- Compliance violation (GDPR, HIPAA, SOC2)

Response Time: 5 minutes
Resolution Time: 4 hours
Communication: Immediate customer notification
Escalation: Immediate (Manager + VP Engineering)

Actions:
- Page primary, secondary, manager on-call
- Create war room immediately
- Engage security team (if security-related)
- Freeze all deployments
- Customer Success notifies affected tenants

---

SEV-2 (High - P2)
Definition: Significant degradation affecting multiple tenants

Examples:
- Single service degraded (high latency, errors)
- Projection lag >30s (query results stale)
- Export service down (ingestion OK)
- Certificate expiring <7 days
- Key rotation overdue

Response Time: 15 minutes
Resolution Time: 8 hours
Communication: Notify affected tenants if impact >1 hour
Escalation: After 1 hour if unresolved

Actions:
- Page primary on-call
- Create war room if >30 min
- Post updates every 30 min

---

SEV-3 (Medium - P3)
Definition: Minor degradation affecting few tenants

Examples:
- Single tenant experiencing issues
- DLQ backlog >100 messages
- Slow query performance (specific endpoint)
- Cache miss rate elevated
- Non-critical background job failing

Response Time: 1 hour
Resolution Time: 24 hours
Communication: Internal only
Escalation: After 4 hours if unresolved

Actions:
- Slack notification to #atp-ops
- Assign to engineer
- Post updates when resolved

---

SEV-4 (Low - P4)
Definition: No user impact, informational

Examples:
- Warning thresholds breached
- Capacity planning alerts
- Maintenance reminders
- Configuration drift

Response Time: Best effort
Resolution Time: 1 week
Communication: None
Escalation: None

Actions:
- Create ticket
- Prioritize in backlog

  • Severity Escalation Matrix | Time Elapsed | SEV-1 | SEV-2 | SEV-3 | |--------------|-------|-------|-------| | 0-30 min | Primary On-Call | Primary On-Call | - | | 30-60 min | + Secondary On-Call | - | - | | 60-120 min | + Manager On-Call | + Secondary On-Call | - | | 120+ min | + VP Engineering | + Manager On-Call | + Secondary On-Call |

  • Downgrade/Upgrade Criteria

  • Downgrade SEV-1 → SEV-2 if impact contained, workaround in place
  • Upgrade SEV-2 → SEV-1 if multiple tenants affected, data integrity risk
  • Document all severity changes with rationale

Code Examples: - Severity classification decision tree - Escalation automation - Communication templates

Diagrams: - Severity levels - Escalation flow - Timeline requirements

Deliverables: - Severity classification guide - SLA requirements - Escalation procedures


Topic 10: SLO Monitoring & Burn Rate Alerts

What will be covered: - ATP Service Level Objectives (SLOs)

Service Availability:
- Target: 99.9% uptime (43.2 min/month downtime budget)
- Measurement: Health check success rate
- Alert: 10% error budget consumed in 1 hour (burn rate)

Ingestion Latency:
- Target: P95 <500ms, P99 <1s
- Measurement: End-to-end (API → append → event published)
- Alert: P95 >1s for 10 minutes

Query Latency:
- Target: P95 <200ms, P99 <500ms
- Measurement: API request duration
- Alert: P95 >500ms for 10 minutes

Projection Lag:
- Target: P95 <5s, P99 <10s
- Measurement: Event timestamp → projection updated
- Alert: P95 >30s for 10 minutes

Data Durability:
- Target: 99.999999999% (11 nines)
- Measurement: Audit records with valid integrity proofs
- Alert: Any integrity verification failure

Tamper Detection:
- Target: 100% detection rate
- Measurement: Hash chain verification success
- Alert: Any hash mismatch

  • Error Budget Policy
  • 99.9% SLO = 0.1% error budget = 43.2 min/month
  • 10% budget consumed → Freeze feature releases
  • 25% budget consumed → Emergency freeze, focus on stability
  • 50% budget consumed → Incident declared, all hands
  • 100% budget consumed → Postmortem, process review

Code Examples: - SLO definitions (Prometheus recording rules) - Burn rate alerts - Error budget dashboards

Diagrams: - SLO monitoring - Error budget tracking - Burn rate visualization

Deliverables: - SLO definitions - Alert rules - Error budget policies


CYCLE 6: Common Problems & Solutions (~5,000 lines)

Topic 11: Service Health Issues

What will be covered: - Problem: Service Pods CrashLoopBackOff

Symptoms:
- Pods continuously restarting
- kubectl get pods shows "CrashLoopBackOff"
- Service unavailable

Diagnosis:
# Check pod status
kubectl get pods -n atp-ingest-ns

# View pod events
kubectl describe pod <pod-name> -n atp-ingest-ns

# Check logs (current and previous)
kubectl logs <pod-name> -n atp-ingest-ns
kubectl logs <pod-name> -n atp-ingest-ns --previous

Common Causes:
1. Configuration error (missing env var, invalid connection string)
2. Database migration failed
3. Secret not mounted (Key Vault CSI issue)
4. Startup timeout (slow dependency)
5. Application exception on startup

Solutions:
# 1. Check configuration
kubectl get configmap ingestion-config -n atp-ingest-ns -o yaml
kubectl get secret <secret-name> -n atp-ingest-ns

# 2. Check Secret Provider
kubectl get secretproviderclass -n atp-ingest-ns
kubectl describe secretproviderclass atp-kv-secrets -n atp-ingest-ns

# 3. Increase startup timeout
# Edit deployment, increase startupProbe failureThreshold
kubectl edit deployment ingestion -n atp-ingest-ns

# 4. Check database connectivity
kubectl run -it --rm debug --image=mcr.microsoft.com/mssql-tools \
    --restart=Never -- /bin/bash
# Then: sqlcmd -S <server> -U <user> -P <password> -Q "SELECT 1"

# 5. Roll back to previous version
kubectl rollout undo deployment/ingestion -n atp-ingest-ns

  • Problem: Pods in Pending State

    Symptoms:
    - Pods stuck in "Pending" status
    - Service scaled but new pods not starting
    
    Diagnosis:
    kubectl describe pod <pod-name> -n atp-ingest-ns
    
    # Look for events:
    # - "0/5 nodes are available: 3 Insufficient cpu, 2 node(s) had taint..."
    # - "persistentvolumeclaim not found"
    # - "image pull backoff"
    
    Common Causes:
    1. Insufficient cluster capacity (CPU/memory)
    2. Node taint mismatch (no toleration)
    3. PVC not available
    4. Image pull failure (authentication, not found)
    5. Node selector mismatch
    
    Solutions:
    # 1. Check cluster capacity
    kubectl top nodes
    kubectl describe nodes
    
    # 2. Check if autoscaler is working
    kubectl get nodes --watch
    
    # 3. Check node pool autoscaler limits
    az aks nodepool show --resource-group atp-aks-prod-rg \
        --cluster-name atp-aks-useast-prod --name npgeneric
    
    # 4. Check image pull secrets
    kubectl get secrets -n atp-ingest-ns
    kubectl describe pod <pod-name> -n atp-ingest-ns | grep -A 5 "Events:"
    
    # 5. Temporarily reduce resource requests (if emergency)
    kubectl edit deployment ingestion -n atp-ingest-ns
    # Reduce requests.cpu / requests.memory
    

  • Problem: Service Returns 503 (Service Unavailable)

    Symptoms:
    - API returns 503 errors
    - Health check endpoint fails (/health/ready)
    - Service in load balancer but rejecting traffic
    
    Diagnosis:
    # Check readiness probe
    kubectl get pods -n atp-query-ns
    # Look for: READY column showing "0/1" or "0/2" (sidecar)
    
    kubectl logs <pod-name> -n atp-query-ns
    # Search for: "Readiness check failed"
    
    # Check dependencies
    curl https://query.atp.internal/health/ready
    # Response shows which dependency failed
    
    Common Causes:
    1. Database connection pool exhausted
    2. Redis cache unreachable
    3. Projection lag exceeds threshold (ready check fails)
    4. Service Bus subscription paused
    5. Dependency service down
    
    Solutions:
    # 1. Check database connections
    # - View metrics: "Database connection pool usage"
    # - If exhausted, scale database or increase pool size
    
    # 2. Restart pods (if transient)
    kubectl rollout restart deployment/query -n atp-query-ns
    
    # 3. Scale up if overwhelmed
    kubectl scale deployment/query -n atp-query-ns --replicas=10
    
    # 4. Check dependency health
    kubectl get pods -n atp-projection-ns  # If query depends on projections
    
    # 5. Bypass ready check temporarily (emergency only)
    kubectl edit deployment query -n atp-query-ns
    # Comment out readinessProbe (DO NOT DO THIS IN PRODUCTION without approval)
    

Code Examples: - Complete troubleshooting procedures - Diagnostic commands - Resolution scripts

Diagrams: - Problem diagnosis flow - Resolution decision tree

Deliverables: - Problem catalog - Diagnostic procedures - Solution library


Topic 12: Database Problems

What will be covered: - Problem: Database Connection Failures - Problem: Slow Queries - Problem: Deadlocks - Problem: Connection Pool Exhaustion - Problem: Migration Failures

Code Examples: - Database troubleshooting - Query optimization - Connection management

Deliverables: - Database operations guide


CYCLE 7: Service-Specific Troubleshooting (~5,000 lines)

Topic 13: Ingestion Service Issues

What will be covered: - High Ingestion Latency

Symptoms:
- P95 latency >500ms (SLO: <500ms)
- Slow API responses
- Queue backlog growing

Diagnosis:
# Check metrics
- Ingestion rate (events/sec)
- Database write latency
- Outbox relay lag
- CPU/memory usage

# Check logs
kubectl logs -f deployment/ingestion -n atp-ingest-ns | grep "WARN\|ERROR"

# Check Application Insights
az monitor app-insights query \
    --app atp-appinsights \
    --analytics-query "
      requests
      | where cloud_RoleName == 'Ingestion'
      | where timestamp > ago(1h)
      | summarize P95=percentile(duration, 95), P99=percentile(duration, 99) by bin(timestamp, 5m)
      | render timechart
    "

Common Causes:
1. Database bottleneck (DTU/RU exhaustion)
2. Classification service slow (policy evaluation)
3. Outbox table growing (relay worker slow)
4. Large batch ingestion (single tenant spike)
5. CPU/memory limits hit

Solutions:
# 1. Scale ingestion pods
kubectl scale deployment/ingestion -n atp-ingest-ns --replicas=10

# 2. Check database performance
# - View DTU usage in Azure Portal
# - If high, scale up database tier

# 3. Check outbox relay worker
kubectl logs -f deployment/outbox-relay -n atp-ingest-ns

# 4. Implement rate limiting (if single tenant spike)
# - Add per-tenant rate limit policy

# 5. Scale database (if sustained load)
az sql db update --resource-group atp-prod-rg \
    --server atp-sql-prod --name atp-db \
    --service-objective P4  # Scale to higher tier

  • Ingestion Validation Failures

    Symptoms:
    - 400 Bad Request errors
    - Schema validation failures in logs
    - Rejected events
    
    Diagnosis:
    # Check validation error logs
    kubectl logs deployment/ingestion -n atp-ingest-ns \
        | grep "ValidationException"
    
    # Sample error:
    # "ValidationException: Required field 'actor.id' missing in event"
    
    Common Causes:
    1. Client sending invalid schema
    2. Schema version mismatch
    3. Required field missing
    4. Data type mismatch
    
    Solutions:
    # 1. Review schema documentation
    # - Check OpenAPI spec: /api/v1/swagger
    
    # 2. Provide client with correct schema
    # - Send link to contract documentation
    
    # 3. Add schema migration (if legitimate change)
    # - Update schema validator to accept old + new formats
    
    # 4. Check for API version mismatch
    # - Verify client using correct API version
    

  • Outbox Backlog Growing

  • Idempotency Conflicts
  • Policy Evaluation Timeouts

Code Examples: - Service-specific diagnostics (all ATP services) - Resolution procedures - Common fix scripts

Diagrams: - Service troubleshooting flow - Component dependencies

Deliverables: - Service troubleshooting guide (8 services) - Diagnostic procedures - Fix library


Topic 14: Query Service Issues

What will be covered: - Slow Query Performance - Projection Lag High - Cache Miss Rate High - Search Index Unavailable - Cross-Tenant Data Leakage (Security)

Code Examples: - Query optimization procedures - Cache troubleshooting - Security validation

Deliverables: - Query service operations guide


CYCLE 8: Database Operations (~3,500 lines)

Topic 15: Database Health Monitoring

What will be covered: - Azure SQL Database Metrics - Connection Pool Management - Query Performance Monitoring - Database Deadlocks - Index Fragmentation

Code Examples: - Database health queries - Performance diagnostics - Index maintenance scripts

Deliverables: - Database operations guide


Topic 16: Database Emergency Procedures

What will be covered: - Database Failover - Connection String Rotation - Emergency Scaling - Backup Restoration

Code Examples: - Failover procedures - Emergency scripts

Deliverables: - Emergency database procedures


CYCLE 9: Messaging & Event Bus Operations (~4,000 lines)

Topic 17: Service Bus Health

What will be covered: - Azure Service Bus Monitoring

# Check queue/topic depth
az servicebus queue show \
    --resource-group atp-prod-rg \
    --namespace-name sb-atp-prod \
    --name projection-queue \
    --query "countDetails"

# Check dead-letter queue
az servicebus queue show \
    --resource-group atp-prod-rg \
    --namespace-name sb-atp-prod \
    --name projection-queue/$DeadLetterQueue \
    --query "countDetails"

# List active subscriptions
az servicebus topic subscription list \
    --resource-group atp-prod-rg \
    --namespace-name sb-atp-prod \
    --topic-name audit.appended.v1

  • Message Backlog Handling
  • Consumer Lag Monitoring
  • Topic/Queue Throttling
  • Connection Issues

Code Examples: - Service Bus diagnostics - Backlog management - Throttling mitigation

Diagrams: - Message flow monitoring - Backlog handling

Deliverables: - Messaging operations guide - Troubleshooting procedures


Topic 18: Event Flow Troubleshooting

What will be covered: - Event Not Published (Outbox Stuck) - Event Not Received (Consumer Down) - Duplicate Events (Idempotency) - Event Ordering Issues - Schema Version Mismatch

Code Examples: - Event flow diagnostics - Replay procedures

Deliverables: - Event troubleshooting guide


CYCLE 10: DLQ Management & Replay (~4,000 lines)

Topic 19: Dead Letter Queue (DLQ) Triage

What will be covered: - DLQ Workflow

flowchart TD
    A[Message in DLQ] --> B[Inspect Message]
    B --> C{Classify Failure}
    C -->|Schema Error| D[Fix Schema Mapping]
    C -->|Auth/Permission| E[Fix Credentials]
    C -->|Transient Error| F[Immediate Replay]
    C -->|Business Logic| G[Fix Code Bug]
    C -->|Malicious/Invalid| H[Quarantine]
    D --> I[Test in Sandbox]
    E --> I
    F --> J[Replay to Topic]
    G --> K[Deploy Fix]
    K --> I
    I --> J
    H --> L[Document & Skip]
    J --> M[Monitor Success]
    L --> M
Hold "Alt" / "Option" to enable pan & zoom

  • DLQ Inspection Commands

    # List DLQ messages (Azure CLI)
    az servicebus queue show \
        --resource-group atp-prod-rg \
        --namespace-name sb-atp-prod \
        --name projection-queue/$DeadLetterQueue
    
    # Peek messages (first 10)
    az servicebus queue message peek \
        --resource-group atp-prod-rg \
        --namespace-name sb-atp-prod \
        --name projection-queue/$DeadLetterQueue \
        --max-message-count 10
    
    # Receive message (for inspection)
    az servicebus queue message receive \
        --resource-group atp-prod-rg \
        --namespace-name sb-atp-prod \
        --name projection-queue/$DeadLetterQueue \
        --max-message-count 1
    
    # Or use ATP admin CLI
    atp-admin dlq list --consumer projection-worker --limit 50
    atp-admin dlq inspect --consumer projection-worker --message-id <id>
    

  • DLQ Classification

    Failure Reason: DeliveryCountExceeded
    - Message failed max retries (default: 10)
    - Indicates: Persistent handler failure
    - Action: Fix code/config, then replay
    
    Failure Reason: TTLExpiredException
    - Message exceeded time-to-live
    - Indicates: Long queue backlog, slow processing
    - Action: Increase TTL or scale consumers
    
    Failure Reason: MessageLockLostException
    - Processing took longer than lock duration
    - Indicates: Slow handler, long transactions
    - Action: Optimize handler, increase lock duration
    
    Failure Reason: UnauthorizedException
    - Consumer lacks permission
    - Indicates: RBAC/credential issue
    - Action: Fix managed identity permissions
    
    Failure Reason: SerializationException
    - Cannot deserialize message
    - Indicates: Schema version mismatch
    - Action: Add schema migration, replay
    

  • Safe Replay Procedure

    # 1. Identify root cause and fix
    # - Deploy code fix
    # - OR update configuration
    # - OR fix credentials
    
    # 2. Test in sandbox (non-prod)
    atp-admin dlq replay \
        --consumer projection-worker \
        --message-id <id> \
        --environment dev \
        --dry-run
    
    # 3. Replay to production (small batch first)
    atp-admin dlq replay \
        --consumer projection-worker \
        --filter "reason=DeliveryCountExceeded" \
        --max-count 10 \
        --confirm
    
    # 4. Monitor for success
    # - Check projection lag returns to normal
    # - No new DLQ entries
    # - Health checks pass
    
    # 5. Replay remaining messages
    atp-admin dlq replay \
        --consumer projection-worker \
        --filter "reason=DeliveryCountExceeded" \
        --max-count 1000 \
        --confirm
    
    # 6. Document in incident ticket
    

Code Examples: - Complete DLQ management procedures - Classification logic - Replay automation

Diagrams: - DLQ triage workflow - Replay safety checks

Deliverables: - DLQ management guide - Replay procedures - Classification rules


Topic 20: Message Replay Scenarios

What will be covered: - Replay from Outbox (Re-publish failed events) - Replay from Event Store (Rebuild projections) - Selective Replay (Specific tenant/time range) - Dry-Run Replay (Test without applying)

Code Examples: - Replay scenarios - Safety procedures

Deliverables: - Replay cookbook


CYCLE 11: Deployment Procedures (~3,500 lines)

Topic 21: Standard Deployment

What will be covered: - Pre-Deployment Checklist

☐ All tests passed in CI/CD pipeline
☐ Code review approved (2+ reviewers)
☐ Security scan passed (no critical vulnerabilities)
☐ Performance testing completed
☐ Database migrations reviewed (if any)
☐ Configuration changes documented
☐ Rollback plan prepared
☐ Change Advisory Board (CAB) approved (for production)
☐ Stakeholders notified (if customer-facing change)
☐ Monitoring dashboards ready
☐ On-call engineer briefed

  • Deployment Steps (Helm)

    # 1. Verify current state
    helm list -n atp-system
    helm status atp -n atp-system
    
    # 2. Backup current configuration
    helm get values atp -n atp-system > backup-values-$(date +%Y%m%d-%H%M%S).yaml
    
    # 3. Review changes
    helm diff upgrade atp ./charts/atp \
        --namespace atp-system \
        --values values.prod.yaml \
        --values values.us.yaml
    
    # 4. Deploy with canary (progressive rollout)
    helm upgrade atp ./charts/atp \
        --namespace atp-system \
        --values values.prod.yaml \
        --values values.us.yaml \
        --set image.tag=1.2.4 \
        --set canary.enabled=true \
        --set canary.weight=5 \
        --wait --timeout 10m
    
    # 5. Monitor canary (15 minutes)
    # - Watch metrics dashboard
    # - Check error rate, latency
    # - Review logs for exceptions
    
    # 6. Increase canary weight (if healthy)
    helm upgrade atp ./charts/atp \
        --reuse-values \
        --set canary.weight=20 \
        --wait
    
    # Monitor... then 50%... then 100%
    
    # 7. Promote canary to stable
    helm upgrade atp ./charts/atp \
        --reuse-values \
        --set canary.enabled=false \
        --wait
    
    # 8. Post-deployment validation
    # - Run smoke tests
    # - Check all health endpoints
    # - Verify key metrics stable
    

  • Deployment with FluxCD (GitOps)

    # 1. Update Git repository
    git checkout -b release/v1.2.4
    
    # 2. Update image tag in values
    sed -i 's/tag: 1.2.3/tag: 1.2.4/' charts/atp/values.prod.yaml
    
    # 3. Commit and push
    git add charts/atp/values.prod.yaml
    git commit -m "Release v1.2.4 to production"
    git push origin release/v1.2.4
    
    # 4. Create PR and get approval
    # - PR review by 2+ engineers
    # - Automated checks pass
    
    # 5. Merge to main
    git checkout main
    git merge release/v1.2.4
    git push origin main
    
    # 6. FluxCD automatically reconciles (within 1 minute)
    flux get kustomizations
    flux get helmreleases -n atp-system
    
    # 7. Monitor deployment
    kubectl get pods -n atp-ingest-ns --watch
    flux logs --follow
    

Code Examples: - Complete deployment procedures - Validation scripts - Smoke tests

Diagrams: - Deployment workflow - Progressive rollout stages

Deliverables: - Deployment procedures - Validation checklists - Smoke test suite


Topic 22: Emergency Deployments (Hotfix)

What will be covered: - Hotfix Criteria - Expedited Approval Process - Fast-Track Deployment - Post-Hotfix Validation

Code Examples: - Hotfix procedures

Deliverables: - Hotfix guide


CYCLE 12: Rollback Procedures (~3,000 lines)

Topic 23: Rollback Decision Making

What will be covered: - When to Rollback - Rollback vs. Fix Forward - Impact Assessment - Approval Requirements

Code Examples: - Decision matrix

Deliverables: - Rollback decision guide


Topic 24: Rollback Execution

What will be covered: - Helm Rollback

# 1. Check release history
helm history atp -n atp-system

# Output:
# REVISION  UPDATED                   STATUS      CHART       DESCRIPTION
# 1         Mon Oct 28 10:00:00 2025  superseded  atp-1.2.3   Install complete
# 2         Tue Oct 29 14:30:00 2025  superseded  atp-1.2.4   Upgrade complete
# 3         Wed Oct 30 09:15:00 2025  deployed    atp-1.2.5   Upgrade complete

# 2. Rollback to previous version (revision 2)
helm rollback atp 2 -n atp-system --wait --timeout 5m

# 3. Verify rollback
helm list -n atp-system
kubectl get pods -n atp-ingest-ns

# 4. Check health
kubectl get pods -n atp-ingest-ns --watch
curl https://api.atp.example.com/health

# 5. Monitor metrics (15 minutes)
# - Error rate should drop
# - Latency should normalize
# - No new alerts

  • Kubernetes Rollback

    # Rollback deployment
    kubectl rollout undo deployment/ingestion -n atp-ingest-ns
    
    # Rollback to specific revision
    kubectl rollout history deployment/ingestion -n atp-ingest-ns
    kubectl rollout undo deployment/ingestion -n atp-ingest-ns --to-revision=5
    
    # Watch rollback progress
    kubectl rollout status deployment/ingestion -n atp-ingest-ns
    

  • Database Migration Rollback

    # Rollback FluentMigrator migration
    dotnet run -- --task migrate:rollback --version 20251029143000
    
    # OR use migration tool
    migrate -database "sqlserver://..." -path ./migrations down 1
    

Code Examples: - Complete rollback procedures (all scenarios) - Verification steps - Communication templates

Diagrams: - Rollback workflow - Verification steps

Deliverables: - Rollback procedures - Verification guide - Communication templates


CYCLE 13: Configuration Changes (~3,000 lines)

Topic 25: Safe Configuration Updates

What will be covered: - ConfigMap Updates

# 1. Backup current ConfigMap
kubectl get configmap ingestion-config -n atp-ingest-ns -o yaml \
    > backup-ingestion-config-$(date +%Y%m%d-%H%M%S).yaml

# 2. Edit ConfigMap
kubectl edit configmap ingestion-config -n atp-ingest-ns

# 3. Restart pods to pick up changes
kubectl rollout restart deployment/ingestion -n atp-ingest-ns

# 4. Monitor for issues
kubectl logs -f deployment/ingestion -n atp-ingest-ns

# 5. Rollback if issues
kubectl apply -f backup-ingestion-config-20251030-143000.yaml
kubectl rollout restart deployment/ingestion -n atp-ingest-ns

  • Feature Flag Changes
  • Rate Limit Adjustments
  • Logging Level Changes

Code Examples: - Configuration update procedures - Validation scripts

Deliverables: - Configuration change guide


Topic 26: Azure App Configuration Updates

What will be covered: - Updating Feature Flags - Configuration Refresh - Rollback Configuration - Configuration Audit Trail

Code Examples: - App Configuration procedures

Deliverables: - App Configuration guide


CYCLE 14: Key Rotation & Secret Management (~3,500 lines)

Topic 27: Routine Key Rotation

What will be covered: - Rotation Schedule

Monthly Rotation:
- Database passwords (connection strings)
- Service Bus connection strings
- Redis authentication keys

Quarterly Rotation:
- API keys (third-party integrations)
- Webhook HMAC secrets
- Application secrets

Annual Rotation:
- TLS certificates (if not auto-renewed)
- Root signing keys (with dual-key overlap)

On-Demand Rotation:
- Security incident (immediate)
- Employee departure (within 24 hours)
- Suspected compromise (immediate)

  • Database Password Rotation

    # 1. Generate new password
    NEW_PASSWORD=$(openssl rand -base64 32)
    
    # 2. Add new password to Key Vault
    az keyvault secret set \
        --vault-name atp-kv-prod \
        --name DatabasePassword-New \
        --value "$NEW_PASSWORD"
    
    # 3. Update SQL user with new password
    az sql server ad-admin update \
        --resource-group atp-prod-rg \
        --server atp-sql-prod \
        --password "$NEW_PASSWORD"
    
    # 4. Update SecretProviderClass to use new secret
    kubectl edit secretproviderclass atp-kv-secrets -n atp-ingest-ns
    # Change: DatabasePassword → DatabasePassword-New
    
    # 5. Restart pods to mount new secret
    kubectl rollout restart deployment/ingestion -n atp-ingest-ns
    
    # 6. Wait for pods to be healthy
    kubectl get pods -n atp-ingest-ns --watch
    
    # 7. Verify connectivity
    kubectl logs deployment/ingestion -n atp-ingest-ns | grep "Database connection"
    
    # 8. Delete old secret (after 24 hour overlap)
    az keyvault secret delete \
        --vault-name atp-kv-prod \
        --name DatabasePassword
    
    # 9. Document rotation in audit log
    

  • Signing Key Rotation (Zero-Downtime)

  • Certificate Rotation
  • Emergency Rotation (Suspected Compromise)

Code Examples: - Complete rotation procedures (all secret types) - Zero-downtime patterns - Emergency protocols

Diagrams: - Rotation workflow - Dual-key overlap - Emergency rotation

Deliverables: - Key rotation procedures (all types) - Emergency rotation guide - Audit procedures


Topic 28: Secret Compromise Response

What will be covered: - Detection - Containment - Rotation - Investigation - Prevention

Code Examples: - Compromise response procedures

Deliverables: - Security incident guide


CYCLE 15: Scaling & Capacity Management (~3,500 lines)

Topic 29: Manual Scaling

What will be covered: - Scale Pods

# Scale deployment manually
kubectl scale deployment/query -n atp-query-ns --replicas=20

# Verify scaling
kubectl get pods -n atp-query-ns --watch

# Check if HPA will override (disable HPA temporarily if needed)
kubectl get hpa -n atp-query-ns
kubectl delete hpa query-hpa -n atp-query-ns  # Temporary removal

  • Scale Nodes (AKS)

    # Scale node pool manually
    az aks nodepool scale \
        --resource-group atp-aks-prod-rg \
        --cluster-name atp-aks-useast-prod \
        --name npgeneric \
        --node-count 20
    
    # Verify nodes
    kubectl get nodes
    

  • Scale Database

    # Scale Azure SQL Database
    az sql db update \
        --resource-group atp-prod-rg \
        --server atp-sql-prod \
        --name atp-db \
        --service-objective P6  # Scale up
    
    # Monitor scaling progress
    az sql db show \
        --resource-group atp-prod-rg \
        --server atp-sql-prod \
        --name atp-db \
        --query "status"
    

Code Examples: - Manual scaling procedures - Verification steps

Deliverables: - Scaling procedures


Topic 30: Capacity Planning

What will be covered: - Capacity Metrics - Growth Forecasting - Resource Planning - Cost Optimization

Code Examples: - Capacity analysis - Forecasting models

Deliverables: - Capacity planning guide


CYCLE 16: Performance Troubleshooting (~4,000 lines)

Topic 31: Performance Investigation

What will be covered: - Identifying Performance Bottlenecks - Database Query Optimization - Memory Leak Detection - CPU Profiling - Network Latency Analysis

Code Examples: - Performance diagnostics - Profiling tools

Deliverables: - Performance troubleshooting guide


Topic 32: Performance Optimization

What will be covered: - Cache Tuning - Connection Pool Optimization - Query Optimization - Resource Limit Tuning

Code Examples: - Optimization procedures

Deliverables: - Optimization cookbook


CYCLE 17: Security Incident Response (~4,500 lines)

Topic 33: Security Breach Response

What will be covered: - Breach Detection

Security Alert Types:
- Unauthorized access attempt
- Privilege escalation
- Data exfiltration
- Tamper detection
- DDoS attack
- Credential compromise

  • Immediate Actions (SIEM Alert)

    ⚠️ SECURITY INCIDENT - IMMEDIATE ACTIONS
    
    Within 15 minutes:
    ☐ Page Security team
    ☐ Create dedicated war room: #security-incident-YYYYMMDD
    ☐ Freeze all deployments (emergency freeze)
    ☐ Capture evidence (logs, metrics, network traffic)
    ☐ Isolate affected systems (network policies)
    ☐ Revoke compromised credentials
    ☐ Notify CISO and Legal
    ☐ Begin forensic investigation
    
    DO NOT:
    ❌ Delete logs or evidence
    ❌ Restart services (preserves memory dumps)
    ❌ Notify customers (until Legal/Communications approval)
    ❌ Share details publicly
    

  • Tamper Detection Response

    # Tamper Alert Fired
    # 1. Freeze purge/export operations
    atp-admin integrity freeze --reason "tamper-investigation"
    
    # 2. Retrieve integrity proofs
    atp-admin integrity verify \
        --segment-id <segment-id> \
        --include-proof \
        --output tamper-evidence-$(date +%Y%m%d-%H%M%S).json
    
    # 3. Offline verification
    # - Download proof bundle
    # - Verify Merkle root
    # - Verify digital signature
    # - Compare with stored record
    
    # 4. If tamper confirmed
    # - Escalate to SEV-1
    # - Engage Security + Compliance
    # - Preserve all evidence
    # - Notify affected customers (Legal approval)
    
    # 5. If false positive
    # - Document root cause
    # - Tune detection thresholds
    # - Unfreeze operations
    

  • Data Exfiltration Response

  • Credential Compromise Response
  • DDoS Mitigation

Code Examples: - Complete security procedures - Forensic collection - Containment automation

Diagrams: - Security incident flow - Containment procedures

Deliverables: - Security incident guide - Forensic procedures - Containment playbooks


Topic 34: Post-Breach Recovery

What will be covered: - System Hardening - Credential Rotation (All) - Audit Trail Review - Customer Notification

Code Examples: - Recovery procedures

Deliverables: - Recovery guide


CYCLE 18: Data Recovery & Backups (~3,500 lines)

Topic 35: Backup Procedures

What will be covered: - Database Backups - Blob Storage Backups - Configuration Backups - Backup Verification

Code Examples: - Backup automation

Deliverables: - Backup procedures


Topic 36: Restore Procedures

What will be covered: - Point-in-Time Recovery - Disaster Recovery - Cross-Region Restore - Data Validation After Restore

Code Examples: - Restore procedures

Deliverables: - Recovery guide


CYCLE 19: Compliance & Audit Procedures (~3,000 lines)

Topic 37: Compliance Operations

What will be covered: - Legal Hold Procedures - Data Retention Validation - Data Residency Verification - Right-to-Erasure (GDPR Article 17) - Compliance Audit Support

Code Examples: - Compliance procedures

Deliverables: - Compliance operations guide


Topic 38: Audit Trail Validation

What will be covered: - Integrity Verification - Chain Validation - Export for Auditors - Compliance Reporting

Code Examples: - Validation procedures

Deliverables: - Audit validation guide


CYCLE 20: Maintenance Windows (~3,000 lines)

Topic 39: Scheduled Maintenance

What will be covered: - Maintenance Planning - Customer Communication - Maintenance Execution - Post-Maintenance Validation

Code Examples: - Maintenance procedures

Deliverables: - Maintenance guide


Topic 40: Zero-Downtime Maintenance

What will be covered: - Rolling Node Updates - Database Maintenance - Index Optimization - Cache Maintenance

Code Examples: - Zero-downtime procedures

Deliverables: - Maintenance cookbook


CYCLE 21: Emergency Procedures (~4,000 lines)

Topic 41: Emergency Response

What will be covered: - Multi-Region Outage

🚨 EMERGENCY: COMPLETE REGION OUTAGE

Scenario: US-East region completely unavailable

Immediate Actions (within 30 minutes):
☐ Declare SEV-1 incident
☐ Page all on-call (primary, secondary, manager)
☐ Activate incident command
☐ Assess scope (all regions? single region?)
☐ Notify Customer Success
☐ Initiate failover to secondary region

Failover Procedure:
1. Update Azure Front Door backend pool
   - Remove failed region from rotation
   - Route all traffic to healthy region(s)

2. Verify secondary region capacity
   - Check pod counts, node counts
   - Scale up if needed

3. Monitor secondary region closely
   - Increased load may cause cascading issues
   - Watch error rates, latency, resource usage

4. Communicate to customers
   - Post on status page
   - Email affected customers
   - Update ETA every hour

5. Investigation (parallel track)
   - Azure Service Health
   - Azure Support ticket
   - Internal root cause analysis

6. Recovery
   - When primary region restored
   - Gradual traffic shift back (20% → 50% → 100%)
   - Monitor for residual issues

7. Postmortem
   - Document timeline
   - Identify improvements
   - Update DR procedures

  • Complete Data Loss Scenario

    🚨 EMERGENCY: DATA LOSS SUSPECTED
    
    DO NOT PANIC. ATP has multiple protection layers.
    
    Immediate Actions:
    ☐ Freeze ALL write operations (emergency maintenance mode)
    ☐ Page Security + Compliance teams
    ☐ Preserve all evidence (logs, snapshots, backups)
    ☐ Begin forensic investigation
    
    Investigation:
    1. Verify integrity proofs
       - Check hash chains
       - Verify Merkle trees
       - Validate digital signatures
    
    2. Check all backup sources
       - Azure SQL point-in-time restore points
       - Blob Storage WORM containers
       - Geo-replicated backups
    
    3. Assess scope
       - Which tenants affected?
       - Which time range?
       - What data types?
    
    4. Recovery
       - Restore from latest valid backup
       - Verify integrity of restored data
       - Replay events if needed
    
    5. Customer Notification
       - Legal approval required
       - Prepare incident report
       - Offer credit/compensation
    
    6. Regulatory Notification
       - GDPR: 72 hours
       - HIPAA: 60 days
       - SOC 2: Immediate
    

  • Security Breach Emergency

  • Compliance Violation Emergency
  • Cascade Failure

Code Examples: - Emergency procedures (all scenarios) - Failover automation - Communication templates

Diagrams: - Emergency response flow - Failover procedures - Recovery timeline

Deliverables: - Emergency procedure manual - Failover guide - Crisis communication templates


Topic 42: Business Continuity

What will be covered: - Disaster Recovery Drills - Failover Testing - RTO/RPO Validation - Recovery Verification

Code Examples: - DR drill procedures

Deliverables: - Business continuity guide


CYCLE 22: Contacts, Escalation & Knowledge Base (~3,000 lines)

Topic 43: Contact Directory

What will be covered: - On-Call Schedules

Primary On-Call:
- PagerDuty: https://connectsoft.pagerduty.com/schedules#ATP-PRIMARY
- Rotation: Weekly (Monday 9 AM → Monday 9 AM)
- Coverage: 24/7
- Response: <5 min (SEV-1), <15 min (SEV-2)

Secondary On-Call:
- PagerDuty: https://connectsoft.pagerduty.com/schedules#ATP-SECONDARY
- Escalation: After 30 min (SEV-1), 1 hour (SEV-2)

Manager On-Call:
- Escalation: After 1 hour (SEV-1), 2 hours (SEV-2)
- Major decisions, customer communication

Subject Matter Experts (SMEs):
- Database: @db-team (Slack: #atp-database)
- Security: @security-team (Slack: #atp-security)
- Compliance: @compliance-team (Email: compliance@)
- Networking: @network-team (Slack: #atp-networking)
- Azure Infrastructure: @cloud-team (Slack: #atp-cloud)

  • Escalation Matrix | Level | Role | Escalate After | Contact Method | |-------|------|----------------|----------------| | L1 | Platform Engineer | N/A | PagerDuty auto-page | | L2 | Senior SRE | 30 min (SEV-1), 1h (SEV-2) | PagerDuty escalation | | L3 | Engineering Manager | 1h (SEV-1), 2h (SEV-2) | PagerDuty + Phone | | L4 | VP Engineering | 2h (SEV-1), Major outage | Phone + Email | | L5 | CTO | Customer-impacting, 4h+ | Phone + Email |

  • External Contacts

    Microsoft Azure Support:
    - Portal: https://portal.azure.com -> Support + troubleshooting
    - Phone: 1-800-867-1389
    - Priority: Severity A (critical)
    
    Customers (Major):
    - Acme Corp: success@acmecorp.com, @acme-csm (Slack Connect)
    - Contoso: support@contoso.com, @contoso-csm
    - Fabrikam: ops@fabrikam.com, @fabrikam-csm
    
    Vendors:
    - Hashicorp (Vault): support@hashicorp.com
    - Elastic (Search): support@elastic.co
    - PagerDuty: support@pagerduty.com
    

Code Examples: - Contact templates - Escalation automation

Diagrams: - Escalation flow - Contact hierarchy

Deliverables: - Contact directory - Escalation procedures


Topic 44: Knowledge Base & Resources

What will be covered: - Internal Documentation

Architecture Docs: /docs/architecture/
API Documentation: https://api.atp.example.com/swagger
Confluence Wiki: https://connectsoft.atlassian.net/wiki/ATP
Runbook (this doc): /docs/operations/runbook.md
Postmortems: /docs/postmortems/

  • Training Resources
  • Onboarding guide for new on-call engineers
  • Video walkthroughs
  • Practice scenarios

  • Runbook Updates

  • Update after every postmortem
  • Quarterly review
  • Version in Git
  • PR approval required

Code Examples: - Knowledge base structure - Update procedures

Diagrams: - Resource map

Deliverables: - Knowledge base index - Training materials - Update procedures


Summary of Deliverables

Across all 22 cycles, this documentation will provide:

  1. Runbook Foundation
  2. Organization and navigation
  3. Quick reference cards
  4. Contact directory

  5. Monitoring & Health

  6. Health check endpoints (all services)
  7. Dashboard access
  8. Metric interpretation

  9. Incident Management

  10. Alert response procedures
  11. Incident lifecycle
  12. War room protocols
  13. Severity classification
  14. SLO monitoring

  15. Troubleshooting

  16. Common problems catalog (50+ scenarios)
  17. Service-specific diagnostics (8 services)
  18. Database troubleshooting
  19. Messaging troubleshooting
  20. DLQ triage and replay

  21. Operational Procedures

  22. Deployment procedures (standard, canary, blue-green)
  23. Rollback procedures (Helm, K8s, database)
  24. Configuration changes
  25. Key rotation (routine and emergency)

  26. Capacity & Performance

  27. Manual scaling procedures
  28. Capacity planning
  29. Performance troubleshooting
  30. Optimization techniques

  31. Security & Compliance

  32. Security incident response
  33. Tamper investigation
  34. Breach containment
  35. Compliance operations
  36. Audit validation

  37. Recovery

  38. Backup procedures
  39. Restore procedures
  40. Disaster recovery
  41. Multi-region failover

  42. Maintenance

  43. Scheduled maintenance
  44. Zero-downtime procedures
  45. Routine tasks

  46. Emergency

    • Emergency response procedures
    • Crisis management
    • Business continuity
  47. Reference

    • Complete contact directory
    • Escalation matrix
    • Knowledge base index
    • Runbook maintenance


This operational runbook provides complete step-by-step procedures for running ATP in production with confidence, responding to incidents with speed and precision, troubleshooting issues systematically, deploying changes safely, managing capacity proactively, rotating secrets securely, investigating security events thoroughly, recovering from failures reliably, maintaining compliance rigorously, and escalating appropriately while preserving ATP's core guarantees of tamper-evidence, tenant isolation, and regulatory compliance.