Operations Runbook - Audit Trail Platform (ATP)¶

Your operational command center — This runbook provides step-by-step procedures for running ATP in production, responding to incidents, troubleshooting common issues, performing routine maintenance, and handling emergencies with clear ownership, escalation paths, and compliance preservation.

📋 Documentation Generation Plan¶

This document will be generated in 22 cycles. Current progress:

Cycle	Topics	Estimated Lines	Status
Cycle 1	Runbook Overview & Organization (1-2)	~2,500	⏳ Not Started
Cycle 2	Service Health & Status Monitoring (3-4)	~4,000	⏳ Not Started
Cycle 3	Alert Response Procedures (5-6)	~4,000	⏳ Not Started
Cycle 4	Incident Management (7-8)	~4,500	⏳ Not Started
Cycle 5	Severity Classification & SLAs (9-10)	~3,000	⏳ Not Started
Cycle 6	Common Problems & Solutions (11-12)	~5,000	⏳ Not Started
Cycle 7	Service-Specific Troubleshooting (13-14)	~5,000	⏳ Not Started
Cycle 8	Database Operations (15-16)	~3,500	⏳ Not Started
Cycle 9	Messaging & Event Bus Operations (17-18)	~4,000	⏳ Not Started
Cycle 10	DLQ Management & Replay (19-20)	~4,000	⏳ Not Started
Cycle 11	Deployment Procedures (21-22)	~3,500	⏳ Not Started
Cycle 12	Rollback Procedures (23-24)	~3,000	⏳ Not Started
Cycle 13	Configuration Changes (25-26)	~3,000	⏳ Not Started
Cycle 14	Key Rotation & Secret Management (27-28)	~3,500	⏳ Not Started
Cycle 15	Scaling & Capacity Management (29-30)	~3,500	⏳ Not Started
Cycle 16	Performance Troubleshooting (31-32)	~4,000	⏳ Not Started
Cycle 17	Security Incident Response (33-34)	~4,500	⏳ Not Started
Cycle 18	Data Recovery & Backups (35-36)	~3,500	⏳ Not Started
Cycle 19	Compliance & Audit Procedures (37-38)	~3,000	⏳ Not Started
Cycle 20	Maintenance Windows (39-40)	~3,000	⏳ Not Started
Cycle 21	Emergency Procedures (41-42)	~4,000	⏳ Not Started
Cycle 22	Contacts, Escalation & Knowledge Base (43-44)	~3,000	⏳ Not Started

Total Estimated Lines: ~78,000

Purpose & Scope¶

This Operations Runbook is the authoritative operational guide for ATP, providing actionable procedures for SRE/DevOps teams to monitor health, respond to incidents, troubleshoot issues, deploy changes, manage capacity, rotate secrets, handle security events, perform maintenance, and escalate emergencies while preserving ATP's core guarantees: tamper-evidence, tenant isolation, and compliance.

Who Should Use This Runbook?

SRE/Platform Engineers: Day-to-day operations, monitoring, scaling
On-Call Engineers: Incident response, troubleshooting, emergency procedures
DevOps Engineers: Deployments, rollbacks, configuration changes
Security Team: Security incident response, key rotation, audit preservation
Compliance Team: Data recovery, legal hold procedures, audit validation
Engineering Management: Escalation procedures, postmortem reviews

Runbook Philosophy

Clarity: Step-by-step procedures with copy-paste commands
Speed: Quick reference for time-sensitive incidents
Safety: All procedures preserve audit integrity and compliance
Traceability: Every action logged and auditable
Escalation: Clear ownership and escalation paths
Learning: Postmortem integration for continuous improvement

Detailed Cycle Plan¶

CYCLE 1: Runbook Overview & Organization (~2,500 lines)¶

Topic 1: Runbook Structure & Quick Reference¶

What will be covered: - Runbook Organization

This runbook is organized by operational scenario:

1. Health & Monitoring
   - Service status checks
   - Dashboard access
   - Metric interpretation

2. Incident Response
   - Alert procedures
   - Severity classification
   - Escalation paths

3. Troubleshooting
   - Common problems by service
   - Diagnostic procedures
   - Root cause analysis

4. Operations
   - Deployment procedures
   - Rollback procedures
   - Configuration changes
   - Key rotation

5. Maintenance
   - Scheduled tasks
   - Database maintenance
   - Index optimization
   - Cache warming

6. Emergency Procedures
   - Security breaches
   - Data corruption
   - Multi-region failover
   - Disaster recovery

7. Reference
   - Contact information
   - Escalation matrix
   - Service topology
   - Runbook updates

Quick Reference Cards

🚨 CRITICAL INCIDENT (SEV-1)
1. Acknowledge in PagerDuty (within 5 minutes)
2. Join war room: Slack #atp-incidents
3. Check health dashboard: https://atp-ops.example.com
4. Follow incident procedure (Section 4)
5. Escalate if unresolved in 30 minutes

📊 CHECK SERVICE HEALTH
kubectl get pods -n atp-{service}-ns
curl https://api.atp.example.com/health/{service}
View dashboard: https://grafana.atp.example.com/d/service-health

🔄 ROLLBACK DEPLOYMENT
helm rollback atp -n atp-system
OR
kubectl rollout undo deployment/{service} -n atp-{service}-ns

🔍 VIEW LOGS
kubectl logs -f deployment/{service} -n atp-{service}-ns
OR
az monitor log-analytics query -w {workspace-id} --analytics-query "..."

📈 CHECK METRICS
Prometheus: https://prometheus.atp.example.com
Grafana: https://grafana.atp.example.com
Azure Monitor: https://portal.azure.com -> ATP Log Analytics

Critical Contacts

On-Call Rotation:
- Primary: PagerDuty "ATP Primary" schedule
- Secondary: PagerDuty "ATP Secondary" schedule
- Manager On-Call: PagerDuty "ATP Manager" schedule

War Room:
- Slack: #atp-incidents (critical)
- Slack: #atp-ops (warnings, maintenance)
- Teams: ATP Operations Team

Escalation:
- L1 (Platform Engineer): 0-30 min response
- L2 (Senior SRE): 30-60 min response
- L3 (Engineering Manager): 60-120 min response
- L4 (VP Engineering): 2+ hours, major outage

Stakeholders:
- Security: security@connectsoft.com
- Compliance: compliance@connectsoft.com
- Customer Success: success@connectsoft.com

Code Examples: - Quick reference templates - Contact information structure - Emergency checklists

Diagrams: - Runbook navigation - Escalation flow - War room structure

Deliverables: - Runbook organization guide - Quick reference cards - Contact directory

Topic 2: Using This Runbook¶

What will be covered: - How to Navigate - Table of contents with direct links - Search functionality (Ctrl+F) - Cross-references to architecture docs

When to Use Each Section
Alert fired → Alert Response (Section 3)
Service degraded → Troubleshooting (Section 6-7)
Deploying change → Deployment Procedures (Section 11)
Security event → Security Incident Response (Section 17)
Runbook Maintenance
Update after every postmortem
Quarterly review by SRE team
Version controlled in Git
Changes go through PR review

Code Examples: - Runbook update procedure - Change log template

Diagrams: - Runbook usage flow

Deliverables: - Navigation guide - Usage instructions - Maintenance procedures

CYCLE 2: Service Health & Status Monitoring (~4,000 lines)¶

Topic 3: Health Check Endpoints¶

What will be covered: - ATP Health Check Architecture

Every ATP service exposes 3 health endpoints:

/health/live (Liveness Probe)
- Purpose: Is the service process alive?
- K8s: Restarts pod if fails
- Checks: Minimal (service responsive)

/health/ready (Readiness Probe)
- Purpose: Is the service ready to accept traffic?
- K8s: Removes from load balancer if fails
- Checks: Dependencies (DB, cache, message bus, KMS)

/health/startup (Startup Probe)
- Purpose: Has the service finished initialization?
- K8s: Delays liveness/readiness until passes
- Checks: Configuration loaded, migrations run, caches warmed

Health Check Endpoint Details by Service

# Gateway Service
curl https://api.atp.example.com/health/live    # → 200 OK
curl https://api.atp.example.com/health/ready   # → 200 OK or 503

# Checks:
# - [ready] Azure Key Vault reachable
# - [ready] Backend services reachable (ingestion, query, policy)
# - [ready] Rate limiter cache (Redis) connected

# Ingestion Service
curl https://ingestion.atp.internal/health/ready

# Checks:
# - [ready] Azure SQL connection pool healthy
# - [ready] Azure Service Bus connected (publish capability)
# - [ready] Azure Blob Storage reachable (WORM container)
# - [ready] Policy Service reachable (classification API)
# - [ready] Outbox relay worker running

# Query Service
curl https://query.atp.internal/health/ready

# Checks:
# - [ready] Read model database (projections) reachable
# - [ready] Query cache (Redis) connected
# - [ready] Search index reachable (if enabled)
# - [ready] Latest projection watermark within SLO lag (<10s)

# Projection Service
curl https://projection.atp.internal/health/ready

# Checks:
# - [ready] Azure Service Bus subscription active
# - [ready] Read model database writable
# - [ready] Inbox deduplication table accessible
# - [ready] Projection lag within threshold (<5s)
# - [ready] No DLQ backlog (or within threshold)

# Export Service
curl https://export.atp.internal/health/ready

# Checks:
# - [ready] Azure Blob Storage writable (export container)
# - [ready] Export job queue (Redis) connected
# - [ready] KMS signing operation test (dry-run)
# - [ready] Bandwidth budget not exceeded

# Integrity Service
curl https://integrity.atp.internal/health/ready

# Checks:
# - [ready] Azure Blob Storage WORM container reachable
# - [ready] KMS signing keys accessible
# - [ready] Hash chain state store connected
# - [ready] Merkle tree computation functional

# Policy Service
curl https://policy.atp.internal/health/ready

# Checks:
# - [ready] Policy database reachable
# - [ready] Policy cache (Redis) connected
# - [ready] Default policies loaded

Health Check Response Format

// Healthy
{
  "status": "Healthy",
  "totalDuration": "00:00:00.1234567",
  "entries": {
    "database": {
      "status": "Healthy",
      "description": "Azure SQL connection pool: 5/10 active",
      "duration": "00:00:00.0123456"
    },
    "servicebus": {
      "status": "Healthy",
      "description": "Azure Service Bus connected",
      "duration": "00:00:00.0234567"
    },
    "cache": {
      "status": "Healthy",
      "description": "Redis cache: 1000 keys, 45% memory",
      "duration": "00:00:00.0098765"
    }
  }
}

// Degraded (service still running but impaired)
{
  "status": "Degraded",
  "totalDuration": "00:00:00.5000000",
  "entries": {
    "database": {
      "status": "Healthy"
    },
    "cache": {
      "status": "Degraded",
      "description": "Redis cache: Degraded performance, high latency (150ms avg)",
      "duration": "00:00:00.4567890"
    }
  }
}

// Unhealthy (service should be removed from load balancer)
{
  "status": "Unhealthy",
  "totalDuration": "00:00:01.0000000",
  "entries": {
    "database": {
      "status": "Unhealthy",
      "description": "Azure SQL connection failed: Timeout",
      "exception": "System.Data.SqlClient.SqlException: Timeout expired...",
      "duration": "00:00:01.0000000"
    }
  }
}

Automated Health Monitoring
Kubernetes liveness/readiness probes (every 10s)
Azure Monitor health check alerts
Grafana health dashboard
PagerDuty integration for failures

Code Examples: - Complete health check implementations (all services) - Health check aggregation dashboard - Alerting rules

Diagrams: - Health check architecture - Probe failure handling - Alert routing

Deliverables: - Health check reference - Monitoring guide - Alert configuration

Topic 4: Service Status Dashboard¶

What will be covered: - Grafana Dashboard Access

ATP Operations Dashboard: https://grafana.atp.example.com/d/atp-ops

Panels:
- Service Health Matrix (all services, all regions)
- Error Rate by Service
- Request Rate by Service
- P50/P95/P99 Latency by Service
- Projection Lag (timeline, actor, resource)
- DLQ Depth by Consumer
- Outbox Backlog
- Cache Hit Rates
- Database Connection Pool Usage
- Message Bus Queue Depth

Azure Monitor Workbooks

ATP Health Overview: 
- Navigate to Azure Portal → ATP Resource Group → Monitor → Workbooks
- Select "ATP Service Health"

Views:
- Live Metrics Stream (real-time)
- Application Map (service dependencies)
- Performance (query execution, dependencies)
- Failures (exceptions, failed requests)
- Logs (structured query with KQL)

Status Page (Public)
External status page for customers
Incident communication
Maintenance window announcements

Code Examples: - Grafana dashboard JSON - Azure Monitor KQL queries - Status page integration

Diagrams: - Dashboard layout - Metric flow - Status propagation

Deliverables: - Dashboard templates - Query library - Status page setup

CYCLE 3: Alert Response Procedures (~4,000 lines)¶

Topic 5: Alert Handling Workflow¶

What will be covered: - Alert Lifecycle

flowchart LR
    A[Metric Breaches Threshold] --> B[Alert Fired]
    B --> C[PagerDuty Page]
    B --> D[Slack Notification]
    B --> E[Ticket Created]
    C --> F[Engineer Acknowledges]
    F --> G[Investigate & Mitigate]
    G --> H{Issue Resolved?}
    H -->|Yes| I[Alert Auto-Resolves]
    H -->|No| J[Escalate]
    J --> K[L2/L3 Engagement]
    K --> G
    I --> L[Close Ticket]
    L --> M[Postmortem]

Hold "Alt" / "Option" to enable pan & zoom

Alert Channels

Critical (SEV-1, SEV-2):
- PagerDuty: Immediate phone/SMS/push
- Slack: #atp-incidents (auto-created channel per incident)
- Email: atp-oncall@connectsoft.com
- Ticket: Jira (auto-created)

Warning (SEV-3):
- Slack: #atp-ops
- Email: atp-ops@connectsoft.com
- Ticket: Jira (auto-created)

Info (SEV-4):
- Slack: #atp-notifications
- No page, no email

Alert Acknowledgement

# Acknowledge in PagerDuty
# - Click "Acknowledge" in PagerDuty app/web
# - OR via API:
curl -X PUT https://api.pagerduty.com/incidents/{id} \
    -H "Authorization: Token token=$PD_TOKEN" \
    -H "From: oncall@connectsoft.com" \
    -d '{
      "incident": {
        "type": "incident_reference",
        "status": "acknowledged"
      }
    }'

# Post to Slack
# "🔔 Acknowledged: [Alert Name]. Investigating... ETA: 15 min"

Initial Response Checklist

Within 5 minutes of alert:
☐ Acknowledge in PagerDuty
☐ Post to Slack incident channel
☐ Check service health dashboard
☐ Review recent deployments (last 4 hours)
☐ Check related services (dependencies)
☐ Determine severity (use classification matrix)
☐ Engage additional engineers if SEV-1/SEV-2

Code Examples: - Alert response templates - Acknowledgement scripts - Initial triage procedures

Diagrams: - Alert lifecycle - Response timeline - Communication flow

Deliverables: - Alert response procedures - Acknowledgement guide - Triage checklists

Topic 6: Alert Types & Runbook Links¶

What will be covered: - ATP Alert Catalog | Alert | Severity | Threshold | Runbook Section | Auto-Remediation | |-------|----------|-----------|-----------------|------------------| | ServiceDown | SEV-1 | 0 healthy pods | 6.1 (Service Restart) | Pod restart (K8s) | | HighErrorRate | SEV-2 | >5% errors, 5min | 6.2 (Error Investigation) | No | | HighLatency | SEV-2 | P95 >1s, 10min | 6.3 (Performance Tuning) | No | | DatabaseConnectionFailure | SEV-1 | All connections fail | 8.1 (Database Recovery) | Retry, then page | | MessageBusDown | SEV-1 | Service Bus unreachable | 9.1 (Message Bus Recovery) | Retry, then page | | ProjectionLagHigh | SEV-2 | Lag >30s, 10min | 10.1 (Projection Catchup) | Scale workers (KEDA) | | DLQBacklog | SEV-3 | >100 msgs, 30min | 10.2 (DLQ Triage) | Alert only | | DiskSpaceLow | SEV-2 | <10% free | 15.1 (Capacity Expansion) | No | | CertificateExpiring | SEV-3 | <30 days | 14.1 (Certificate Renewal) | No | | KeyRotationOverdue | SEV-2 | >90 days | 14.2 (Emergency Key Rotation) | No | | TamperDetected | SEV-1 | Any tamper alert | 17.1 (Tamper Investigation) | Freeze, escalate | | ComplianceViolation | SEV-1 | Retention/residency breach | 19.1 (Compliance Emergency) | Freeze, escalate |

Alert Runbook Structure

For each alert:

1. Description: What this alert means
2. Symptoms: What users/systems experience
3. Diagnostic Steps: How to investigate
4. Resolution Steps: How to fix
5. Escalation: When to escalate
6. Prevention: How to avoid in future
7. Related Links: Architecture docs, code references

Code Examples: - Alert definitions (Prometheus rules) - Runbook templates for each alert - Auto-remediation scripts

Diagrams: - Alert taxonomy - Runbook linkage

Deliverables: - Complete alert catalog - Runbook procedures for each alert - Auto-remediation playbooks

CYCLE 4: Incident Management (~4,500 lines)¶

Topic 7: Incident Response Framework¶

What will be covered: - Incident Lifecycle

1. Detection
   - Alert fires
   - User report
   - Monitoring anomaly

2. Acknowledgement
   - On-call acknowledges (within 5 min for SEV-1)
   - War room created
   - Incident ticket opened

3. Triage
   - Determine severity
   - Identify affected services/tenants
   - Assess impact
   - Engage additional engineers

4. Investigation
   - Review logs, metrics, traces
   - Identify root cause
   - Document findings

5. Mitigation
   - Implement fix or workaround
   - Deploy patch
   - Verify resolution

6. Recovery
   - Restore normal operations
   - Validate all services healthy
   - Notify stakeholders

7. Resolution
   - Close incident ticket
   - Resolve PagerDuty incident
   - Communicate to affected tenants

8. Postmortem
   - Conduct blameless postmortem
   - Document root cause
   - Create action items
   - Update runbook

Incident Command Structure

Incident Commander (IC):
- Owns incident response
- Coordinates team
- Makes go/no-go decisions
- Communicates to stakeholders

Technical Lead (TL):
- Drives investigation
- Implements fixes
- Coordinates with IC

Communications Lead (CL):
- Customer notifications
- Status page updates
- Stakeholder updates

Scribe:
- Documents timeline
- Captures decisions
- Records actions taken

War Room Protocols

War Room Creation (SEV-1, SEV-2):
1. Create dedicated Slack channel: #incident-YYYY-MM-DD-HHMM
2. Pin incident ticket link
3. Pin dashboard links
4. Set channel topic: "[SEV-X] [Service] Brief description"
5. Invite: IC, TL, CL, Scribe, relevant SMEs

War Room Updates:
- Every 15 minutes: IC posts status update
- Every action: Engineer posts what they're trying
- Every finding: Post evidence (log snippets, metrics screenshots)
- Major decisions: IC announces and logs rationale

War Room Etiquette:
- Use threads for side discussions
- Main channel for critical updates only
- No "@here" or "@channel" unless critical
- Update channel topic with current status

Code Examples: - Incident templates - War room scripts - Communication templates

Diagrams: - Incident lifecycle - Command structure - War room flow

Deliverables: - Incident response procedures - War room protocols - Communication templates

Topic 8: Incident Documentation¶

What will be covered: - Incident Ticket Structure

Incident ID: INC-2025-001234
Title: [SEV-2] Ingestion Service High Latency in US-East

Status: Investigating | Mitigated | Resolved
Severity: SEV-1 | SEV-2 | SEV-3 | SEV-4

Timeline:
- 2025-10-30 14:32 UTC: Alert fired (P95 latency >1s)
- 2025-10-30 14:34 UTC: Acknowledged by @engineer
- 2025-10-30 14:40 UTC: Root cause identified (DB connection pool exhausted)
- 2025-10-30 14:45 UTC: Mitigation deployed (increased pool size)
- 2025-10-30 14:50 UTC: Metrics normal, incident resolved

Impact:
- Affected Tenants: 15 enterprise tenants in US-East
- Duration: 18 minutes
- User Impact: Increased ingestion latency (500ms → 2s)
- Data Integrity: ✅ No data loss, all events persisted

Root Cause:
- Database connection pool exhausted (max 100 connections)
- Traffic spike from tenant "acme-corp" (batch import)
- Pool size not sized for peak load

Resolution:
- Temporary: Increased connection pool max to 200
- Permanent: Implement per-tenant rate limiting

Action Items:
- [ ] Increase default connection pool size (deploy to all regions)
- [ ] Add per-tenant ingestion rate limits
- [ ] Add connection pool usage alerts (>80%)
- [ ] Update capacity planning docs

Incident Log Template
Use Jira/ServiceNow incident template
Auto-populate from alerts
Link to logs, metrics, traces
Capture all actions taken

Code Examples: - Incident ticket templates - Timeline documentation - Action item tracking

Diagrams: - Incident ticket flow - Documentation structure

Deliverables: - Incident templates - Documentation procedures - Tracking systems

CYCLE 5: Severity Classification & SLAs (~3,000 lines)¶

Topic 9: Severity Levels¶

What will be covered: - ATP Severity Classification

SEV-1 (Critical - P1)
Definition: Complete service outage or data integrity compromise

Examples:
- All ATP services down (no ingestion, no queries)
- Data corruption detected (tamper evidence failed)
- Security breach (unauthorized access)
- Multi-tenant data leakage
- Compliance violation (GDPR, HIPAA, SOC2)

Response Time: 5 minutes
Resolution Time: 4 hours
Communication: Immediate customer notification
Escalation: Immediate (Manager + VP Engineering)

Actions:
- Page primary, secondary, manager on-call
- Create war room immediately
- Engage security team (if security-related)
- Freeze all deployments
- Customer Success notifies affected tenants

---

SEV-2 (High - P2)
Definition: Significant degradation affecting multiple tenants

Examples:
- Single service degraded (high latency, errors)
- Projection lag >30s (query results stale)
- Export service down (ingestion OK)
- Certificate expiring <7 days
- Key rotation overdue

Response Time: 15 minutes
Resolution Time: 8 hours
Communication: Notify affected tenants if impact >1 hour
Escalation: After 1 hour if unresolved

Actions:
- Page primary on-call
- Create war room if >30 min
- Post updates every 30 min

---

SEV-3 (Medium - P3)
Definition: Minor degradation affecting few tenants

Examples:
- Single tenant experiencing issues
- DLQ backlog >100 messages
- Slow query performance (specific endpoint)
- Cache miss rate elevated
- Non-critical background job failing

Response Time: 1 hour
Resolution Time: 24 hours
Communication: Internal only
Escalation: After 4 hours if unresolved

Actions:
- Slack notification to #atp-ops
- Assign to engineer
- Post updates when resolved

---

SEV-4 (Low - P4)
Definition: No user impact, informational

Examples:
- Warning thresholds breached
- Capacity planning alerts
- Maintenance reminders
- Configuration drift

Response Time: Best effort
Resolution Time: 1 week
Communication: None
Escalation: None

Actions:
- Create ticket
- Prioritize in backlog

Severity Escalation Matrix | Time Elapsed | SEV-1 | SEV-2 | SEV-3 | |--------------|-------|-------|-------| | 0-30 min | Primary On-Call | Primary On-Call | - | | 30-60 min | + Secondary On-Call | - | - | | 60-120 min | + Manager On-Call | + Secondary On-Call | - | | 120+ min | + VP Engineering | + Manager On-Call | + Secondary On-Call |
Downgrade/Upgrade Criteria
Downgrade SEV-1 → SEV-2 if impact contained, workaround in place
Upgrade SEV-2 → SEV-1 if multiple tenants affected, data integrity risk
Document all severity changes with rationale

Code Examples: - Severity classification decision tree - Escalation automation - Communication templates

Diagrams: - Severity levels - Escalation flow - Timeline requirements

Deliverables: - Severity classification guide - SLA requirements - Escalation procedures

Topic 10: SLO Monitoring & Burn Rate Alerts¶

What will be covered: - ATP Service Level Objectives (SLOs)

Service Availability:
- Target: 99.9% uptime (43.2 min/month downtime budget)
- Measurement: Health check success rate
- Alert: 10% error budget consumed in 1 hour (burn rate)

Ingestion Latency:
- Target: P95 <500ms, P99 <1s
- Measurement: End-to-end (API → append → event published)
- Alert: P95 >1s for 10 minutes

Query Latency:
- Target: P95 <200ms, P99 <500ms
- Measurement: API request duration
- Alert: P95 >500ms for 10 minutes

Projection Lag:
- Target: P95 <5s, P99 <10s
- Measurement: Event timestamp → projection updated
- Alert: P95 >30s for 10 minutes

Data Durability:
- Target: 99.999999999% (11 nines)
- Measurement: Audit records with valid integrity proofs
- Alert: Any integrity verification failure

Tamper Detection:
- Target: 100% detection rate
- Measurement: Hash chain verification success
- Alert: Any hash mismatch

Error Budget Policy
99.9% SLO = 0.1% error budget = 43.2 min/month
10% budget consumed → Freeze feature releases
25% budget consumed → Emergency freeze, focus on stability
50% budget consumed → Incident declared, all hands
100% budget consumed → Postmortem, process review

Code Examples: - SLO definitions (Prometheus recording rules) - Burn rate alerts - Error budget dashboards

Diagrams: - SLO monitoring - Error budget tracking - Burn rate visualization

Deliverables: - SLO definitions - Alert rules - Error budget policies

CYCLE 6: Common Problems & Solutions (~5,000 lines)¶

Topic 11: Service Health Issues¶

What will be covered: - Problem: Service Pods CrashLoopBackOff

Symptoms:
- Pods continuously restarting
- kubectl get pods shows "CrashLoopBackOff"
- Service unavailable

Diagnosis:
# Check pod status
kubectl get pods -n atp-ingest-ns

# View pod events
kubectl describe pod <pod-name> -n atp-ingest-ns

# Check logs (current and previous)
kubectl logs <pod-name> -n atp-ingest-ns
kubectl logs <pod-name> -n atp-ingest-ns --previous

Common Causes:
1. Configuration error (missing env var, invalid connection string)
2. Database migration failed
3. Secret not mounted (Key Vault CSI issue)
4. Startup timeout (slow dependency)
5. Application exception on startup

Solutions:
# 1. Check configuration
kubectl get configmap ingestion-config -n atp-ingest-ns -o yaml
kubectl get secret <secret-name> -n atp-ingest-ns

# 2. Check Secret Provider
kubectl get secretproviderclass -n atp-ingest-ns
kubectl describe secretproviderclass atp-kv-secrets -n atp-ingest-ns

# 3. Increase startup timeout
# Edit deployment, increase startupProbe failureThreshold
kubectl edit deployment ingestion -n atp-ingest-ns

# 4. Check database connectivity
kubectl run -it --rm debug --image=mcr.microsoft.com/mssql-tools \
    --restart=Never -- /bin/bash
# Then: sqlcmd -S <server> -U <user> -P <password> -Q "SELECT 1"

# 5. Roll back to previous version
kubectl rollout undo deployment/ingestion -n atp-ingest-ns

Problem: Pods in Pending State

Symptoms:
- Pods stuck in "Pending" status
- Service scaled but new pods not starting

Diagnosis:
kubectl describe pod <pod-name> -n atp-ingest-ns

# Look for events:
# - "0/5 nodes are available: 3 Insufficient cpu, 2 node(s) had taint..."
# - "persistentvolumeclaim not found"
# - "image pull backoff"

Common Causes:
1. Insufficient cluster capacity (CPU/memory)
2. Node taint mismatch (no toleration)
3. PVC not available
4. Image pull failure (authentication, not found)
5. Node selector mismatch

Solutions:
# 1. Check cluster capacity
kubectl top nodes
kubectl describe nodes

# 2. Check if autoscaler is working
kubectl get nodes --watch

# 3. Check node pool autoscaler limits
az aks nodepool show --resource-group atp-aks-prod-rg \
    --cluster-name atp-aks-useast-prod --name npgeneric

# 4. Check image pull secrets
kubectl get secrets -n atp-ingest-ns
kubectl describe pod <pod-name> -n atp-ingest-ns | grep -A 5 "Events:"

# 5. Temporarily reduce resource requests (if emergency)
kubectl edit deployment ingestion -n atp-ingest-ns
# Reduce requests.cpu / requests.memory

Problem: Service Returns 503 (Service Unavailable)

Symptoms:
- API returns 503 errors
- Health check endpoint fails (/health/ready)
- Service in load balancer but rejecting traffic

Diagnosis:
# Check readiness probe
kubectl get pods -n atp-query-ns
# Look for: READY column showing "0/1" or "0/2" (sidecar)

kubectl logs <pod-name> -n atp-query-ns
# Search for: "Readiness check failed"

# Check dependencies
curl https://query.atp.internal/health/ready
# Response shows which dependency failed

Common Causes:
1. Database connection pool exhausted
2. Redis cache unreachable
3. Projection lag exceeds threshold (ready check fails)
4. Service Bus subscription paused
5. Dependency service down

Solutions:
# 1. Check database connections
# - View metrics: "Database connection pool usage"
# - If exhausted, scale database or increase pool size

# 2. Restart pods (if transient)
kubectl rollout restart deployment/query -n atp-query-ns

# 3. Scale up if overwhelmed
kubectl scale deployment/query -n atp-query-ns --replicas=10

# 4. Check dependency health
kubectl get pods -n atp-projection-ns  # If query depends on projections

# 5. Bypass ready check temporarily (emergency only)
kubectl edit deployment query -n atp-query-ns
# Comment out readinessProbe (DO NOT DO THIS IN PRODUCTION without approval)

Code Examples: - Complete troubleshooting procedures - Diagnostic commands - Resolution scripts

Diagrams: - Problem diagnosis flow - Resolution decision tree

Deliverables: - Problem catalog - Diagnostic procedures - Solution library

Topic 12: Database Problems¶

What will be covered: - Problem: Database Connection Failures - Problem: Slow Queries - Problem: Deadlocks - Problem: Connection Pool Exhaustion - Problem: Migration Failures

Code Examples: - Database troubleshooting - Query optimization - Connection management

Deliverables: - Database operations guide

CYCLE 7: Service-Specific Troubleshooting (~5,000 lines)¶

Topic 13: Ingestion Service Issues¶

What will be covered: - High Ingestion Latency

Symptoms:
- P95 latency >500ms (SLO: <500ms)
- Slow API responses
- Queue backlog growing

Diagnosis:
# Check metrics
- Ingestion rate (events/sec)
- Database write latency
- Outbox relay lag
- CPU/memory usage

# Check logs
kubectl logs -f deployment/ingestion -n atp-ingest-ns | grep "WARN\|ERROR"

# Check Application Insights
az monitor app-insights query \
    --app atp-appinsights \
    --analytics-query "
      requests
      | where cloud_RoleName == 'Ingestion'
      | where timestamp > ago(1h)
      | summarize P95=percentile(duration, 95), P99=percentile(duration, 99) by bin(timestamp, 5m)
      | render timechart
    "

Common Causes:
1. Database bottleneck (DTU/RU exhaustion)
2. Classification service slow (policy evaluation)
3. Outbox table growing (relay worker slow)
4. Large batch ingestion (single tenant spike)
5. CPU/memory limits hit

Solutions:
# 1. Scale ingestion pods
kubectl scale deployment/ingestion -n atp-ingest-ns --replicas=10

# 2. Check database performance
# - View DTU usage in Azure Portal
# - If high, scale up database tier

# 3. Check outbox relay worker
kubectl logs -f deployment/outbox-relay -n atp-ingest-ns

# 4. Implement rate limiting (if single tenant spike)
# - Add per-tenant rate limit policy

# 5. Scale database (if sustained load)
az sql db update --resource-group atp-prod-rg \
    --server atp-sql-prod --name atp-db \
    --service-objective P4  # Scale to higher tier

Ingestion Validation Failures

Symptoms:
- 400 Bad Request errors
- Schema validation failures in logs
- Rejected events

Diagnosis:
# Check validation error logs
kubectl logs deployment/ingestion -n atp-ingest-ns \
    | grep "ValidationException"

# Sample error:
# "ValidationException: Required field 'actor.id' missing in event"

Common Causes:
1. Client sending invalid schema
2. Schema version mismatch
3. Required field missing
4. Data type mismatch

Solutions:
# 1. Review schema documentation
# - Check OpenAPI spec: /api/v1/swagger

# 2. Provide client with correct schema
# - Send link to contract documentation

# 3. Add schema migration (if legitimate change)
# - Update schema validator to accept old + new formats

# 4. Check for API version mismatch
# - Verify client using correct API version

Outbox Backlog Growing
Idempotency Conflicts
Policy Evaluation Timeouts

Code Examples: - Service-specific diagnostics (all ATP services) - Resolution procedures - Common fix scripts

Diagrams: - Service troubleshooting flow - Component dependencies

Deliverables: - Service troubleshooting guide (8 services) - Diagnostic procedures - Fix library

Topic 14: Query Service Issues¶

What will be covered: - Slow Query Performance - Projection Lag High - Cache Miss Rate High - Search Index Unavailable - Cross-Tenant Data Leakage (Security)

Code Examples: - Query optimization procedures - Cache troubleshooting - Security validation

Deliverables: - Query service operations guide

CYCLE 8: Database Operations (~3,500 lines)¶

Topic 15: Database Health Monitoring¶

What will be covered: - Azure SQL Database Metrics - Connection Pool Management - Query Performance Monitoring - Database Deadlocks - Index Fragmentation

Code Examples: - Database health queries - Performance diagnostics - Index maintenance scripts

Deliverables: - Database operations guide

Topic 16: Database Emergency Procedures¶

What will be covered: - Database Failover - Connection String Rotation - Emergency Scaling - Backup Restoration

Code Examples: - Failover procedures - Emergency scripts

Deliverables: - Emergency database procedures

CYCLE 9: Messaging & Event Bus Operations (~4,000 lines)¶

Topic 17: Service Bus Health¶

What will be covered: - Azure Service Bus Monitoring

# Check queue/topic depth
az servicebus queue show \
    --resource-group atp-prod-rg \
    --namespace-name sb-atp-prod \
    --name projection-queue \
    --query "countDetails"

# Check dead-letter queue
az servicebus queue show \
    --resource-group atp-prod-rg \
    --namespace-name sb-atp-prod \
    --name projection-queue/$DeadLetterQueue \
    --query "countDetails"

# List active subscriptions
az servicebus topic subscription list \
    --resource-group atp-prod-rg \
    --namespace-name sb-atp-prod \
    --topic-name audit.appended.v1

Message Backlog Handling
Consumer Lag Monitoring
Topic/Queue Throttling
Connection Issues

Code Examples: - Service Bus diagnostics - Backlog management - Throttling mitigation

Diagrams: - Message flow monitoring - Backlog handling

Deliverables: - Messaging operations guide - Troubleshooting procedures

Topic 18: Event Flow Troubleshooting¶

What will be covered: - Event Not Published (Outbox Stuck) - Event Not Received (Consumer Down) - Duplicate Events (Idempotency) - Event Ordering Issues - Schema Version Mismatch

Code Examples: - Event flow diagnostics - Replay procedures

Deliverables: - Event troubleshooting guide

CYCLE 10: DLQ Management & Replay (~4,000 lines)¶

Topic 19: Dead Letter Queue (DLQ) Triage¶

What will be covered: - DLQ Workflow

flowchart TD
    A[Message in DLQ] --> B[Inspect Message]
    B --> C{Classify Failure}
    C -->|Schema Error| D[Fix Schema Mapping]
    C -->|Auth/Permission| E[Fix Credentials]
    C -->|Transient Error| F[Immediate Replay]
    C -->|Business Logic| G[Fix Code Bug]
    C -->|Malicious/Invalid| H[Quarantine]
    D --> I[Test in Sandbox]
    E --> I
    F --> J[Replay to Topic]
    G --> K[Deploy Fix]
    K --> I
    I --> J
    H --> L[Document & Skip]
    J --> M[Monitor Success]
    L --> M

Hold "Alt" / "Option" to enable pan & zoom

DLQ Inspection Commands

# List DLQ messages (Azure CLI)
az servicebus queue show \
    --resource-group atp-prod-rg \
    --namespace-name sb-atp-prod \
    --name projection-queue/$DeadLetterQueue

# Peek messages (first 10)
az servicebus queue message peek \
    --resource-group atp-prod-rg \
    --namespace-name sb-atp-prod \
    --name projection-queue/$DeadLetterQueue \
    --max-message-count 10

# Receive message (for inspection)
az servicebus queue message receive \
    --resource-group atp-prod-rg \
    --namespace-name sb-atp-prod \
    --name projection-queue/$DeadLetterQueue \
    --max-message-count 1

# Or use ATP admin CLI
atp-admin dlq list --consumer projection-worker --limit 50
atp-admin dlq inspect --consumer projection-worker --message-id <id>

DLQ Classification

Failure Reason: DeliveryCountExceeded
- Message failed max retries (default: 10)
- Indicates: Persistent handler failure
- Action: Fix code/config, then replay

Failure Reason: TTLExpiredException
- Message exceeded time-to-live
- Indicates: Long queue backlog, slow processing
- Action: Increase TTL or scale consumers

Failure Reason: MessageLockLostException
- Processing took longer than lock duration
- Indicates: Slow handler, long transactions
- Action: Optimize handler, increase lock duration

Failure Reason: UnauthorizedException
- Consumer lacks permission
- Indicates: RBAC/credential issue
- Action: Fix managed identity permissions

Failure Reason: SerializationException
- Cannot deserialize message
- Indicates: Schema version mismatch
- Action: Add schema migration, replay

Safe Replay Procedure

# 1. Identify root cause and fix
# - Deploy code fix
# - OR update configuration
# - OR fix credentials

# 2. Test in sandbox (non-prod)
atp-admin dlq replay \
    --consumer projection-worker \
    --message-id <id> \
    --environment dev \
    --dry-run

# 3. Replay to production (small batch first)
atp-admin dlq replay \
    --consumer projection-worker \
    --filter "reason=DeliveryCountExceeded" \
    --max-count 10 \
    --confirm

# 4. Monitor for success
# - Check projection lag returns to normal
# - No new DLQ entries
# - Health checks pass

# 5. Replay remaining messages
atp-admin dlq replay \
    --consumer projection-worker \
    --filter "reason=DeliveryCountExceeded" \
    --max-count 1000 \
    --confirm

# 6. Document in incident ticket

Code Examples: - Complete DLQ management procedures - Classification logic - Replay automation

Diagrams: - DLQ triage workflow - Replay safety checks

Deliverables: - DLQ management guide - Replay procedures - Classification rules

Topic 20: Message Replay Scenarios¶

What will be covered: - Replay from Outbox (Re-publish failed events) - Replay from Event Store (Rebuild projections) - Selective Replay (Specific tenant/time range) - Dry-Run Replay (Test without applying)

Code Examples: - Replay scenarios - Safety procedures

Deliverables: - Replay cookbook

CYCLE 11: Deployment Procedures (~3,500 lines)¶

Topic 21: Standard Deployment¶

What will be covered: - Pre-Deployment Checklist

☐ All tests passed in CI/CD pipeline
☐ Code review approved (2+ reviewers)
☐ Security scan passed (no critical vulnerabilities)
☐ Performance testing completed
☐ Database migrations reviewed (if any)
☐ Configuration changes documented
☐ Rollback plan prepared
☐ Change Advisory Board (CAB) approved (for production)
☐ Stakeholders notified (if customer-facing change)
☐ Monitoring dashboards ready
☐ On-call engineer briefed

Deployment Steps (Helm)

# 1. Verify current state
helm list -n atp-system
helm status atp -n atp-system

# 2. Backup current configuration
helm get values atp -n atp-system > backup-values-$(date +%Y%m%d-%H%M%S).yaml

# 3. Review changes
helm diff upgrade atp ./charts/atp \
    --namespace atp-system \
    --values values.prod.yaml \
    --values values.us.yaml

# 4. Deploy with canary (progressive rollout)
helm upgrade atp ./charts/atp \
    --namespace atp-system \
    --values values.prod.yaml \
    --values values.us.yaml \
    --set image.tag=1.2.4 \
    --set canary.enabled=true \
    --set canary.weight=5 \
    --wait --timeout 10m

# 5. Monitor canary (15 minutes)
# - Watch metrics dashboard
# - Check error rate, latency
# - Review logs for exceptions

# 6. Increase canary weight (if healthy)
helm upgrade atp ./charts/atp \
    --reuse-values \
    --set canary.weight=20 \
    --wait

# Monitor... then 50%... then 100%

# 7. Promote canary to stable
helm upgrade atp ./charts/atp \
    --reuse-values \
    --set canary.enabled=false \
    --wait

# 8. Post-deployment validation
# - Run smoke tests
# - Check all health endpoints
# - Verify key metrics stable

Deployment with FluxCD (GitOps)

# 1. Update Git repository
git checkout -b release/v1.2.4

# 2. Update image tag in values
sed -i 's/tag: 1.2.3/tag: 1.2.4/' charts/atp/values.prod.yaml

# 3. Commit and push
git add charts/atp/values.prod.yaml
git commit -m "Release v1.2.4 to production"
git push origin release/v1.2.4

# 4. Create PR and get approval
# - PR review by 2+ engineers
# - Automated checks pass

# 5. Merge to main
git checkout main
git merge release/v1.2.4
git push origin main

# 6. FluxCD automatically reconciles (within 1 minute)
flux get kustomizations
flux get helmreleases -n atp-system

# 7. Monitor deployment
kubectl get pods -n atp-ingest-ns --watch
flux logs --follow

Code Examples: - Complete deployment procedures - Validation scripts - Smoke tests

Diagrams: - Deployment workflow - Progressive rollout stages

Deliverables: - Deployment procedures - Validation checklists - Smoke test suite

Topic 22: Emergency Deployments (Hotfix)¶

What will be covered: - Hotfix Criteria - Expedited Approval Process - Fast-Track Deployment - Post-Hotfix Validation

Code Examples: - Hotfix procedures

Deliverables: - Hotfix guide

CYCLE 12: Rollback Procedures (~3,000 lines)¶

Topic 23: Rollback Decision Making¶

What will be covered: - When to Rollback - Rollback vs. Fix Forward - Impact Assessment - Approval Requirements

Code Examples: - Decision matrix

Deliverables: - Rollback decision guide

Topic 24: Rollback Execution¶

What will be covered: - Helm Rollback

# 1. Check release history
helm history atp -n atp-system

# Output:
# REVISION  UPDATED                   STATUS      CHART       DESCRIPTION
# 1         Mon Oct 28 10:00:00 2025  superseded  atp-1.2.3   Install complete
# 2         Tue Oct 29 14:30:00 2025  superseded  atp-1.2.4   Upgrade complete
# 3         Wed Oct 30 09:15:00 2025  deployed    atp-1.2.5   Upgrade complete

# 2. Rollback to previous version (revision 2)
helm rollback atp 2 -n atp-system --wait --timeout 5m

# 3. Verify rollback
helm list -n atp-system
kubectl get pods -n atp-ingest-ns

# 4. Check health
kubectl get pods -n atp-ingest-ns --watch
curl https://api.atp.example.com/health

# 5. Monitor metrics (15 minutes)
# - Error rate should drop
# - Latency should normalize
# - No new alerts

Kubernetes Rollback

# Rollback deployment
kubectl rollout undo deployment/ingestion -n atp-ingest-ns

# Rollback to specific revision
kubectl rollout history deployment/ingestion -n atp-ingest-ns
kubectl rollout undo deployment/ingestion -n atp-ingest-ns --to-revision=5

# Watch rollback progress
kubectl rollout status deployment/ingestion -n atp-ingest-ns

Database Migration Rollback

# Rollback FluentMigrator migration
dotnet run -- --task migrate:rollback --version 20251029143000

# OR use migration tool
migrate -database "sqlserver://..." -path ./migrations down 1

Code Examples: - Complete rollback procedures (all scenarios) - Verification steps - Communication templates

Diagrams: - Rollback workflow - Verification steps

Deliverables: - Rollback procedures - Verification guide - Communication templates

CYCLE 13: Configuration Changes (~3,000 lines)¶

Topic 25: Safe Configuration Updates¶

What will be covered: - ConfigMap Updates

# 1. Backup current ConfigMap
kubectl get configmap ingestion-config -n atp-ingest-ns -o yaml \
    > backup-ingestion-config-$(date +%Y%m%d-%H%M%S).yaml

# 2. Edit ConfigMap
kubectl edit configmap ingestion-config -n atp-ingest-ns

# 3. Restart pods to pick up changes
kubectl rollout restart deployment/ingestion -n atp-ingest-ns

# 4. Monitor for issues
kubectl logs -f deployment/ingestion -n atp-ingest-ns

# 5. Rollback if issues
kubectl apply -f backup-ingestion-config-20251030-143000.yaml
kubectl rollout restart deployment/ingestion -n atp-ingest-ns

Feature Flag Changes
Rate Limit Adjustments
Logging Level Changes

Code Examples: - Configuration update procedures - Validation scripts

Deliverables: - Configuration change guide

Topic 26: Azure App Configuration Updates¶

What will be covered: - Updating Feature Flags - Configuration Refresh - Rollback Configuration - Configuration Audit Trail

Code Examples: - App Configuration procedures

Deliverables: - App Configuration guide

CYCLE 14: Key Rotation & Secret Management (~3,500 lines)¶

Topic 27: Routine Key Rotation¶

What will be covered: - Rotation Schedule

Monthly Rotation:
- Database passwords (connection strings)
- Service Bus connection strings
- Redis authentication keys

Quarterly Rotation:
- API keys (third-party integrations)
- Webhook HMAC secrets
- Application secrets

Annual Rotation:
- TLS certificates (if not auto-renewed)
- Root signing keys (with dual-key overlap)

On-Demand Rotation:
- Security incident (immediate)
- Employee departure (within 24 hours)
- Suspected compromise (immediate)

Database Password Rotation

# 1. Generate new password
NEW_PASSWORD=$(openssl rand -base64 32)

# 2. Add new password to Key Vault
az keyvault secret set \
    --vault-name atp-kv-prod \
    --name DatabasePassword-New \
    --value "$NEW_PASSWORD"

# 3. Update SQL user with new password
az sql server ad-admin update \
    --resource-group atp-prod-rg \
    --server atp-sql-prod \
    --password "$NEW_PASSWORD"

# 4. Update SecretProviderClass to use new secret
kubectl edit secretproviderclass atp-kv-secrets -n atp-ingest-ns
# Change: DatabasePassword → DatabasePassword-New

# 5. Restart pods to mount new secret
kubectl rollout restart deployment/ingestion -n atp-ingest-ns

# 6. Wait for pods to be healthy
kubectl get pods -n atp-ingest-ns --watch

# 7. Verify connectivity
kubectl logs deployment/ingestion -n atp-ingest-ns | grep "Database connection"

# 8. Delete old secret (after 24 hour overlap)
az keyvault secret delete \
    --vault-name atp-kv-prod \
    --name DatabasePassword

# 9. Document rotation in audit log

Signing Key Rotation (Zero-Downtime)
Certificate Rotation
Emergency Rotation (Suspected Compromise)

Code Examples: - Complete rotation procedures (all secret types) - Zero-downtime patterns - Emergency protocols

Diagrams: - Rotation workflow - Dual-key overlap - Emergency rotation

Deliverables: - Key rotation procedures (all types) - Emergency rotation guide - Audit procedures

Topic 28: Secret Compromise Response¶

What will be covered: - Detection - Containment - Rotation - Investigation - Prevention

Code Examples: - Compromise response procedures

Deliverables: - Security incident guide

CYCLE 15: Scaling & Capacity Management (~3,500 lines)¶

Topic 29: Manual Scaling¶

What will be covered: - Scale Pods

# Scale deployment manually
kubectl scale deployment/query -n atp-query-ns --replicas=20

# Verify scaling
kubectl get pods -n atp-query-ns --watch

# Check if HPA will override (disable HPA temporarily if needed)
kubectl get hpa -n atp-query-ns
kubectl delete hpa query-hpa -n atp-query-ns  # Temporary removal

Scale Nodes (AKS)

# Scale node pool manually
az aks nodepool scale \
    --resource-group atp-aks-prod-rg \
    --cluster-name atp-aks-useast-prod \
    --name npgeneric \
    --node-count 20

# Verify nodes
kubectl get nodes

Scale Database

# Scale Azure SQL Database
az sql db update \
    --resource-group atp-prod-rg \
    --server atp-sql-prod \
    --name atp-db \
    --service-objective P6  # Scale up

# Monitor scaling progress
az sql db show \
    --resource-group atp-prod-rg \
    --server atp-sql-prod \
    --name atp-db \
    --query "status"

Code Examples: - Manual scaling procedures - Verification steps

Deliverables: - Scaling procedures

Topic 30: Capacity Planning¶

What will be covered: - Capacity Metrics - Growth Forecasting - Resource Planning - Cost Optimization

Code Examples: - Capacity analysis - Forecasting models

Deliverables: - Capacity planning guide

CYCLE 16: Performance Troubleshooting (~4,000 lines)¶

Topic 31: Performance Investigation¶

What will be covered: - Identifying Performance Bottlenecks - Database Query Optimization - Memory Leak Detection - CPU Profiling - Network Latency Analysis

Code Examples: - Performance diagnostics - Profiling tools

Deliverables: - Performance troubleshooting guide

Topic 32: Performance Optimization¶

What will be covered: - Cache Tuning - Connection Pool Optimization - Query Optimization - Resource Limit Tuning

Code Examples: - Optimization procedures

Deliverables: - Optimization cookbook

CYCLE 17: Security Incident Response (~4,500 lines)¶

Topic 33: Security Breach Response¶

What will be covered: - Breach Detection

Security Alert Types:
- Unauthorized access attempt
- Privilege escalation
- Data exfiltration
- Tamper detection
- DDoS attack
- Credential compromise

Immediate Actions (SIEM Alert)

⚠️ SECURITY INCIDENT - IMMEDIATE ACTIONS

Within 15 minutes:
☐ Page Security team
☐ Create dedicated war room: #security-incident-YYYYMMDD
☐ Freeze all deployments (emergency freeze)
☐ Capture evidence (logs, metrics, network traffic)
☐ Isolate affected systems (network policies)
☐ Revoke compromised credentials
☐ Notify CISO and Legal
☐ Begin forensic investigation

DO NOT:
❌ Delete logs or evidence
❌ Restart services (preserves memory dumps)
❌ Notify customers (until Legal/Communications approval)
❌ Share details publicly

Tamper Detection Response

# Tamper Alert Fired
# 1. Freeze purge/export operations
atp-admin integrity freeze --reason "tamper-investigation"

# 2. Retrieve integrity proofs
atp-admin integrity verify \
    --segment-id <segment-id> \
    --include-proof \
    --output tamper-evidence-$(date +%Y%m%d-%H%M%S).json

# 3. Offline verification
# - Download proof bundle
# - Verify Merkle root
# - Verify digital signature
# - Compare with stored record

# 4. If tamper confirmed
# - Escalate to SEV-1
# - Engage Security + Compliance
# - Preserve all evidence
# - Notify affected customers (Legal approval)

# 5. If false positive
# - Document root cause
# - Tune detection thresholds
# - Unfreeze operations

Data Exfiltration Response
Credential Compromise Response
DDoS Mitigation

Code Examples: - Complete security procedures - Forensic collection - Containment automation

Diagrams: - Security incident flow - Containment procedures

Deliverables: - Security incident guide - Forensic procedures - Containment playbooks

Topic 34: Post-Breach Recovery¶

What will be covered: - System Hardening - Credential Rotation (All) - Audit Trail Review - Customer Notification

Code Examples: - Recovery procedures

Deliverables: - Recovery guide

CYCLE 18: Data Recovery & Backups (~3,500 lines)¶

Topic 35: Backup Procedures¶

What will be covered: - Database Backups - Blob Storage Backups - Configuration Backups - Backup Verification

Code Examples: - Backup automation

Deliverables: - Backup procedures

Topic 36: Restore Procedures¶

What will be covered: - Point-in-Time Recovery - Disaster Recovery - Cross-Region Restore - Data Validation After Restore

Code Examples: - Restore procedures

Deliverables: - Recovery guide

CYCLE 19: Compliance & Audit Procedures (~3,000 lines)¶

Topic 37: Compliance Operations¶

What will be covered: - Legal Hold Procedures - Data Retention Validation - Data Residency Verification - Right-to-Erasure (GDPR Article 17) - Compliance Audit Support

Code Examples: - Compliance procedures

Deliverables: - Compliance operations guide

Topic 38: Audit Trail Validation¶

What will be covered: - Integrity Verification - Chain Validation - Export for Auditors - Compliance Reporting

Code Examples: - Validation procedures

Deliverables: - Audit validation guide

CYCLE 20: Maintenance Windows (~3,000 lines)¶

Topic 39: Scheduled Maintenance¶

What will be covered: - Maintenance Planning - Customer Communication - Maintenance Execution - Post-Maintenance Validation

Code Examples: - Maintenance procedures

Deliverables: - Maintenance guide

Topic 40: Zero-Downtime Maintenance¶

What will be covered: - Rolling Node Updates - Database Maintenance - Index Optimization - Cache Maintenance

Code Examples: - Zero-downtime procedures

Deliverables: - Maintenance cookbook

CYCLE 21: Emergency Procedures (~4,000 lines)¶

Topic 41: Emergency Response¶

What will be covered: - Multi-Region Outage

🚨 EMERGENCY: COMPLETE REGION OUTAGE

Scenario: US-East region completely unavailable

Immediate Actions (within 30 minutes):
☐ Declare SEV-1 incident
☐ Page all on-call (primary, secondary, manager)
☐ Activate incident command
☐ Assess scope (all regions? single region?)
☐ Notify Customer Success
☐ Initiate failover to secondary region

Failover Procedure:
1. Update Azure Front Door backend pool
   - Remove failed region from rotation
   - Route all traffic to healthy region(s)

2. Verify secondary region capacity
   - Check pod counts, node counts
   - Scale up if needed

3. Monitor secondary region closely
   - Increased load may cause cascading issues
   - Watch error rates, latency, resource usage

4. Communicate to customers
   - Post on status page
   - Email affected customers
   - Update ETA every hour

5. Investigation (parallel track)
   - Azure Service Health
   - Azure Support ticket
   - Internal root cause analysis

6. Recovery
   - When primary region restored
   - Gradual traffic shift back (20% → 50% → 100%)
   - Monitor for residual issues

7. Postmortem
   - Document timeline
   - Identify improvements
   - Update DR procedures

Complete Data Loss Scenario

🚨 EMERGENCY: DATA LOSS SUSPECTED

DO NOT PANIC. ATP has multiple protection layers.

Immediate Actions:
☐ Freeze ALL write operations (emergency maintenance mode)
☐ Page Security + Compliance teams
☐ Preserve all evidence (logs, snapshots, backups)
☐ Begin forensic investigation

Investigation:
1. Verify integrity proofs
   - Check hash chains
   - Verify Merkle trees
   - Validate digital signatures

2. Check all backup sources
   - Azure SQL point-in-time restore points
   - Blob Storage WORM containers
   - Geo-replicated backups

3. Assess scope
   - Which tenants affected?
   - Which time range?
   - What data types?

4. Recovery
   - Restore from latest valid backup
   - Verify integrity of restored data
   - Replay events if needed

5. Customer Notification
   - Legal approval required
   - Prepare incident report
   - Offer credit/compensation

6. Regulatory Notification
   - GDPR: 72 hours
   - HIPAA: 60 days
   - SOC 2: Immediate

Security Breach Emergency
Compliance Violation Emergency
Cascade Failure

Code Examples: - Emergency procedures (all scenarios) - Failover automation - Communication templates

Diagrams: - Emergency response flow - Failover procedures - Recovery timeline

Deliverables: - Emergency procedure manual - Failover guide - Crisis communication templates

Topic 42: Business Continuity¶

What will be covered: - Disaster Recovery Drills - Failover Testing - RTO/RPO Validation - Recovery Verification

Code Examples: - DR drill procedures

Deliverables: - Business continuity guide

CYCLE 22: Contacts, Escalation & Knowledge Base (~3,000 lines)¶

Topic 43: Contact Directory¶

What will be covered: - On-Call Schedules

Primary On-Call:
- PagerDuty: https://connectsoft.pagerduty.com/schedules#ATP-PRIMARY
- Rotation: Weekly (Monday 9 AM → Monday 9 AM)
- Coverage: 24/7
- Response: <5 min (SEV-1), <15 min (SEV-2)

Secondary On-Call:
- PagerDuty: https://connectsoft.pagerduty.com/schedules#ATP-SECONDARY
- Escalation: After 30 min (SEV-1), 1 hour (SEV-2)

Manager On-Call:
- Escalation: After 1 hour (SEV-1), 2 hours (SEV-2)
- Major decisions, customer communication

Subject Matter Experts (SMEs):
- Database: @db-team (Slack: #atp-database)
- Security: @security-team (Slack: #atp-security)
- Compliance: @compliance-team (Email: compliance@)
- Networking: @network-team (Slack: #atp-networking)
- Azure Infrastructure: @cloud-team (Slack: #atp-cloud)

Escalation Matrix | Level | Role | Escalate After | Contact Method | |-------|------|----------------|----------------| | L1 | Platform Engineer | N/A | PagerDuty auto-page | | L2 | Senior SRE | 30 min (SEV-1), 1h (SEV-2) | PagerDuty escalation | | L3 | Engineering Manager | 1h (SEV-1), 2h (SEV-2) | PagerDuty + Phone | | L4 | VP Engineering | 2h (SEV-1), Major outage | Phone + Email | | L5 | CTO | Customer-impacting, 4h+ | Phone + Email |

External Contacts

Microsoft Azure Support:
- Portal: https://portal.azure.com -> Support + troubleshooting
- Phone: 1-800-867-1389
- Priority: Severity A (critical)

Customers (Major):
- Acme Corp: success@acmecorp.com, @acme-csm (Slack Connect)
- Contoso: support@contoso.com, @contoso-csm
- Fabrikam: ops@fabrikam.com, @fabrikam-csm

Vendors:
- Hashicorp (Vault): support@hashicorp.com
- Elastic (Search): support@elastic.co
- PagerDuty: support@pagerduty.com

Code Examples: - Contact templates - Escalation automation

Diagrams: - Escalation flow - Contact hierarchy

Deliverables: - Contact directory - Escalation procedures

Topic 44: Knowledge Base & Resources¶

What will be covered: - Internal Documentation

Architecture Docs: /docs/architecture/
API Documentation: https://api.atp.example.com/swagger
Confluence Wiki: https://connectsoft.atlassian.net/wiki/ATP
Runbook (this doc): /docs/operations/runbook.md
Postmortems: /docs/postmortems/

Training Resources
Onboarding guide for new on-call engineers
Video walkthroughs
Practice scenarios
Runbook Updates
Update after every postmortem
Quarterly review
Version in Git
PR approval required

Code Examples: - Knowledge base structure - Update procedures

Diagrams: - Resource map

Deliverables: - Knowledge base index - Training materials - Update procedures

Summary of Deliverables¶

Across all 22 cycles, this documentation will provide:

Runbook Foundation
Organization and navigation
Quick reference cards
Contact directory
Monitoring & Health
Health check endpoints (all services)
Dashboard access
Metric interpretation
Incident Management
Alert response procedures
Incident lifecycle
War room protocols
Severity classification
SLO monitoring
Troubleshooting
Common problems catalog (50+ scenarios)
Service-specific diagnostics (8 services)
Database troubleshooting
Messaging troubleshooting
DLQ triage and replay
Operational Procedures
Deployment procedures (standard, canary, blue-green)
Rollback procedures (Helm, K8s, database)
Configuration changes
Key rotation (routine and emergency)
Capacity & Performance
Manual scaling procedures
Capacity planning
Performance troubleshooting
Optimization techniques
Security & Compliance
Security incident response
Tamper investigation
Breach containment
Compliance operations
Audit validation
Recovery
Backup procedures
Restore procedures
Disaster recovery
Multi-region failover
Maintenance
Scheduled maintenance
Zero-downtime procedures
Routine tasks
Emergency
- Emergency response procedures
- Crisis management
- Business continuity
Reference
- Complete contact directory
- Escalation matrix
- Knowledge base index
- Runbook maintenance

Progressive Rollout: Deployment strategies
Kubernetes: K8s operations
GitOps: Deployment automation
Monitoring: Monitoring and alerting
Security: Security architecture
Key Rotation: Rotation procedures
Chaos Drills: Resilience testing
Disaster Recovery: DR procedures
Configuration: Configuration management

This operational runbook provides complete step-by-step procedures for running ATP in production with confidence, responding to incidents with speed and precision, troubleshooting issues systematically, deploying changes safely, managing capacity proactively, rotating secrets securely, investigating security events thoroughly, recovering from failures reliably, maintaining compliance rigorously, and escalating appropriately while preserving ATP's core guarantees of tamper-evidence, tenant isolation, and regulatory compliance.

Operations Runbook - Audit Trail Platform (ATP)¶

📋 Documentation Generation Plan¶

Purpose & Scope¶

Detailed Cycle Plan¶

CYCLE 1: Runbook Overview & Organization (~2,500 lines)¶

Topic 1: Runbook Structure & Quick Reference¶

Topic 2: Using This Runbook¶

CYCLE 2: Service Health & Status Monitoring (~4,000 lines)¶

Topic 3: Health Check Endpoints¶

Topic 4: Service Status Dashboard¶

CYCLE 3: Alert Response Procedures (~4,000 lines)¶

Topic 5: Alert Handling Workflow¶

Topic 6: Alert Types & Runbook Links¶

CYCLE 4: Incident Management (~4,500 lines)¶

Topic 7: Incident Response Framework¶

Topic 8: Incident Documentation¶

CYCLE 5: Severity Classification & SLAs (~3,000 lines)¶

Topic 9: Severity Levels¶

Topic 10: SLO Monitoring & Burn Rate Alerts¶

CYCLE 6: Common Problems & Solutions (~5,000 lines)¶

Topic 11: Service Health Issues¶

Topic 12: Database Problems¶

CYCLE 7: Service-Specific Troubleshooting (~5,000 lines)¶

Topic 13: Ingestion Service Issues¶

Topic 14: Query Service Issues¶

CYCLE 8: Database Operations (~3,500 lines)¶

Topic 15: Database Health Monitoring¶

Topic 16: Database Emergency Procedures¶

CYCLE 9: Messaging & Event Bus Operations (~4,000 lines)¶

Topic 17: Service Bus Health¶

Topic 18: Event Flow Troubleshooting¶

CYCLE 10: DLQ Management & Replay (~4,000 lines)¶

Topic 19: Dead Letter Queue (DLQ) Triage¶

Topic 20: Message Replay Scenarios¶

CYCLE 11: Deployment Procedures (~3,500 lines)¶

Topic 21: Standard Deployment¶

Topic 22: Emergency Deployments (Hotfix)¶

CYCLE 12: Rollback Procedures (~3,000 lines)¶

Topic 23: Rollback Decision Making¶

Topic 24: Rollback Execution¶

CYCLE 13: Configuration Changes (~3,000 lines)¶

Topic 25: Safe Configuration Updates¶

Topic 26: Azure App Configuration Updates¶

CYCLE 14: Key Rotation & Secret Management (~3,500 lines)¶

Topic 27: Routine Key Rotation¶

Topic 28: Secret Compromise Response¶

CYCLE 15: Scaling & Capacity Management (~3,500 lines)¶

Topic 29: Manual Scaling¶

Topic 30: Capacity Planning¶

CYCLE 16: Performance Troubleshooting (~4,000 lines)¶

Topic 31: Performance Investigation¶

Topic 32: Performance Optimization¶

CYCLE 17: Security Incident Response (~4,500 lines)¶

Topic 33: Security Breach Response¶

Topic 34: Post-Breach Recovery¶

CYCLE 18: Data Recovery & Backups (~3,500 lines)¶

Topic 35: Backup Procedures¶

Topic 36: Restore Procedures¶

CYCLE 19: Compliance & Audit Procedures (~3,000 lines)¶

Topic 37: Compliance Operations¶

Topic 38: Audit Trail Validation¶

CYCLE 20: Maintenance Windows (~3,000 lines)¶

Topic 39: Scheduled Maintenance¶

Topic 40: Zero-Downtime Maintenance¶

CYCLE 21: Emergency Procedures (~4,000 lines)¶

Topic 41: Emergency Response¶

Topic 42: Business Continuity¶

CYCLE 22: Contacts, Escalation & Knowledge Base (~3,000 lines)¶

Topic 43: Contact Directory¶

Topic 44: Knowledge Base & Resources¶

Summary of Deliverables¶

Related Documentation¶