Skip to content

Chaos Engineering Drills - Audit Trail Platform (ATP)

Break it to validate it — ATP's chaos engineering drills systematically inject failures to validate resilience, test recovery procedures, and ensure SLO compliance under adverse conditions.


Purpose & Scope

This document defines comprehensive chaos engineering and resilience testing procedures for the Audit Trail Platform (ATP), establishing systematic fault injection experiments, disaster recovery drills, and SLO validation procedures to ensure ATP remains available, performant, and compliant even under adverse conditions including infrastructure failures, network partitions, resource exhaustion, and security incidents.

Key Chaos Engineering Principles

  • Hypothesis-Driven: Formulate hypothesis about system behavior, test with experiments
  • Production-Like: Test in staging with production-like load and configuration
  • Blast Radius Control: Limit impact to percentage of traffic or specific tenants
  • Observability: Monitor all metrics during experiments to detect issues
  • Gradual Rollout: Start small (1% traffic), increase if stable
  • Automated Rollback: Abort experiment if SLO violations detected
  • Learn and Improve: Document findings, update runbooks, improve resilience

What this document covers

  • Establish chaos engineering fundamentals: What it is, why ATP needs it, principles and methodology
  • Define chaos experiment framework: Hypothesis, steady state, blast radius, rollback, validation
  • Specify infrastructure chaos experiments: Pod failures, node failures, AZ outages, region failures
  • Document application chaos experiments: Service crashes, dependency failures, latency injection, error injection
  • Detail data and state experiments: Database failures, cache failures, message broker disruptions, storage outages
  • Describe network chaos experiments: Network partitions, packet loss, latency, DNS failures
  • Outline security chaos experiments: Authentication failures, authorization denials, encryption key unavailability
  • Specify disaster recovery drills: Full region failover, cluster rebuilding, data restoration
  • Document GameDays: Quarterly chaos exercises, multi-team coordination, incident response validation
  • Detail chaos automation: Chaos Mesh, Litmus Chaos, custom fault injection, CI/CD integration
  • Describe SLO validation: Ensure SLOs maintained during chaos, error budget tracking
  • Outline reporting and improvement: Experiment reports, findings documentation, resilience improvements

Out of scope (referenced elsewhere)

Readers & ownership

  • SRE/Operations (owners): Chaos experiment design, execution, GameDay coordination
  • Platform Engineering: Infrastructure resilience, automation, tooling
  • Backend Developers: Application resilience, circuit breakers, retry logic
  • Security Engineering: Security chaos experiments, attack simulation
  • Architects: Resilience architecture, failure mode analysis
  • Incident Response: DR drills, incident simulation, runbook validation

Artifacts produced

  • Chaos Experiment Catalog: All experiments with hypotheses, procedures, validation criteria
  • Experiment Runbooks: Step-by-step execution procedures for each experiment
  • Chaos Automation Scripts: Chaos Mesh experiments, Litmus workflows, custom injectors
  • GameDay Playbooks: Quarterly chaos exercise scenarios and coordination
  • SLO Validation Reports: Experiment results showing SLO compliance during chaos
  • Blast Radius Configurations: Traffic percentage, tenant selection, duration limits
  • Rollback Procedures: Automated and manual rollback for each experiment
  • Monitoring Dashboards: Real-time experiment monitoring, SLO tracking
  • Incident Simulations: Security incidents, data breaches, ransomware scenarios
  • DR Drill Reports: Disaster recovery validation results, RTO/RPO measurement
  • Findings Documentation: Lessons learned, resilience gaps, improvement actions
  • Resilience Scorecard: System resilience metrics, improvement tracking over time

Chaos Engineering Fundamentals

Purpose: Establish the foundational understanding of chaos engineering, its principles, methodology, and strategic application within ATP to build systematic confidence in system resilience through controlled failure injection.


What is Chaos Engineering?

Definition

Chaos Engineering is the discipline of experimenting on distributed systems in production (or production-like environments) to discover weaknesses and vulnerabilities before they manifest as customer-facing incidents. It is a systematic, proactive approach to building confidence in system resilience by intentionally injecting failures and observing system behavior under adverse conditions.

Core Concept

Traditional testing validates that systems work under ideal conditions. Chaos Engineering validates that systems work when things go wrong. Instead of waiting for production failures to reveal weaknesses, chaos engineers proactively introduce controlled failures to:

  • Discover unknown failure modes that don't appear in unit or integration tests
  • Validate resilience mechanisms (circuit breakers, retries, failovers) actually work
  • Test incident response procedures under realistic pressure
  • Build confidence that the system can handle real-world failures

Historical Context

Chaos Engineering originated at Netflix in 2010 with the development of Chaos Monkey, a tool that randomly terminated production instances to ensure services could handle instance failures. This evolved into the Simian Army (Chaos Gorilla, Latency Monkey, Conformity Monkey, etc.) and established chaos engineering as a discipline.

Since then, chaos engineering has been adopted by organizations worldwide, including:

  • Amazon Web Services: Chaos engineering for infrastructure resilience
  • Microsoft: Azure Chaos Studio for cloud resilience
  • Google: Site Reliability Engineering (SRE) chaos testing practices
  • LinkedIn: Chaos testing for distributed systems

Proactive vs Reactive Approach

Approach Traditional Testing Chaos Engineering
Timing Before deployment During operation
Focus Does it work? Does it still work when things break?
Discovery Known failure modes Unknown failure modes
Mindset Reactive (fix after incident) Proactive (prevent incidents)
Outcome System works in ideal conditions System works in adverse conditions

Key Insight

"If something can go wrong, it will go wrong—chaos engineering helps you find it before your customers do."


Why Chaos Engineering for ATP?

High Availability Requirements

ATP has a 99.9% uptime SLA (less than 8.76 hours of downtime per year). This requires:

  • Resilient architecture: System must handle failures gracefully
  • Validated recovery: Failover and recovery mechanisms must be proven to work
  • Minimal impact: Failures must not cascade across services
  • Fast recovery: System must recover quickly from failures

Chaos engineering validates that ATP can meet these requirements under real-world failure conditions.

Compliance and Audit Requirements

As an audit trail platform, ATP must maintain:

  • Data availability: Audit logs must remain accessible even during failures
  • Data integrity: No data loss during failures or recovery
  • Audit compliance: SOC 2, GDPR, HIPAA require evidence of resilience testing
  • Continuous operation: Audit data ingestion cannot be interrupted

Chaos engineering provides evidence that ATP maintains compliance during adverse conditions.

Multi-Tenancy Isolation

ATP serves multiple tenants with strict isolation requirements:

  • Tenant isolation: Failures affecting one tenant must not impact others
  • Resource isolation: Resource exhaustion in one tenant must not affect others
  • Data isolation: Tenant data must remain isolated during failures
  • SLA per tenant: Each tenant has independent SLA commitments

Chaos engineering validates that tenant isolation holds under failure conditions.

Distributed System Complexity

ATP is a microservices architecture with many failure modes:

  • Service-to-service communication: Network partitions, latency, failures
  • Service discovery: DNS failures, service registry unavailability
  • Load balancing: Traffic routing during failures
  • State management: Distributed state consistency during failures
  • Cascading failures: One service failure causing others to fail

Chaos engineering systematically tests these failure modes to prevent cascading failures.

Complex Dependencies

ATP depends on multiple Azure services with their own failure modes:

  • Azure Kubernetes Service (AKS): Node failures, pod scheduling, cluster failures
  • Azure SQL Database: Failover, connection pool exhaustion, query timeouts
  • Azure Service Bus: Topic unavailability, message broker failures
  • Azure Key Vault: Secret access failures, certificate expiration
  • Azure Blob Storage: Storage unavailability, replication lag
  • Azure Active Directory: Authentication failures, token refresh issues

Chaos engineering validates ATP's resilience to dependency failures.

Incident Preparedness

Chaos engineering validates incident response procedures:

  • Runbook validation: Do runbooks work under pressure?
  • Team coordination: Can teams respond effectively during incidents?
  • Communication: Is incident communication clear and timely?
  • Escalation: Do escalation paths work correctly?
  • Recovery procedures: Can the team recover within SLA targets?

Regular chaos experiments ensure teams are prepared for real incidents.

Business Value

Chaos engineering delivers measurable business value:

  • Reduced incidents: Find and fix issues before customers experience them
  • Faster recovery: Validated procedures enable faster incident resolution
  • Customer confidence: Proactive resilience testing builds trust
  • Cost savings: Preventing incidents is cheaper than fixing them
  • Compliance: Evidence of resilience testing supports compliance audits

Chaos Engineering Principles

The Principles of Chaos Engineering (established by Netflix) provide a framework for conducting safe, effective chaos experiments.

Principle 1: Build a Hypothesis Around Steady State

Definition

Before injecting failures, define what "normal" looks like. This is the steady state—the observable behavior of the system when operating correctly. Formulate a hypothesis about how the system will behave during the experiment.

Steady State Metrics

Steady state is defined by observable metrics that indicate normal operation:

  • Performance metrics: Latency (P50, P95, P99), throughput, response times
  • Reliability metrics: Error rates, success rates, availability
  • Resource metrics: CPU utilization, memory usage, network I/O
  • Business metrics: Transaction rates, revenue impact, user activity

Hypothesis Format

A hypothesis states the expected behavior during the experiment:

"When [failure condition], the system will [expected behavior], and steady state metrics will [expected metric values]."

Example Hypotheses

Example 1: Pod Failure

Hypothesis: "When 1 ingestion API pod crashes (out of 5 pods), 
the system will remain available, request success rate will stay >99.9%, 
and P95 latency will increase by <100ms."

Example 2: Database Failover

Hypothesis: "When Azure SQL primary database fails over to replica, 
the system will automatically reconnect, no requests will fail, 
and recovery time will be <30 seconds."

Example 3: Network Partition

Hypothesis: "When network partition isolates ingestion service from query service, 
ingestion will continue accepting events, query service will degrade gracefully 
(read-only mode), and no data will be lost."

Steady State Definition Example

# examples/steady-state-definition.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: SteadyState
metadata:
  name: atp-ingestion-api-steady-state
  namespace: chaos-testing
spec:
  service: atp-ingestion-api
  metrics:
    performance:
      - name: p95_latency_ms
        threshold: 200  # ms
        operator: "<"
      - name: p99_latency_ms
        threshold: 500  # ms
        operator: "<"
      - name: throughput_events_per_sec
        threshold: 10000
        operator: ">="
    reliability:
      - name: request_success_rate
        threshold: 99.9  # %
        operator: ">="
      - name: error_rate
        threshold: 0.1  # %
        operator: "<="
    resource:
      - name: cpu_utilization_percent
        threshold: 80  # %
        operator: "<"
      - name: memory_utilization_percent
        threshold: 85  # %
        operator: "<"
    business:
      - name: events_ingested_per_minute
        threshold: 600000
        operator: ">="
      - name: tenant_isolation_maintained
        threshold: true
        operator: "=="

Principle 2: Vary Real-World Events

Definition

Chaos experiments should simulate realistic failures that occur in production, not artificial or unrealistic scenarios. Base experiments on:

  • Historical incidents: What failures have happened before?
  • Common failure modes: What failures are most likely?
  • Dependency failures: How do external dependencies fail?
  • Infrastructure failures: What infrastructure failures occur?

Real-World Event Categories

Category Examples
Infrastructure Pod crashes, node failures, AZ outages, region failures
Network Network partitions, packet loss, latency spikes, DNS failures
Dependencies Database failures, cache failures, message broker issues
Resource CPU exhaustion, memory exhaustion, disk full
Application Service crashes, slow responses, error injection
Security Authentication failures, certificate expiration, key unavailability

Avoid Artificial Scenarios

Bad: "Kill all pods simultaneously" ✅ Good: "Kill 1 pod at a time, observe recovery"

Bad: "Disconnect all network connections" ✅ Good: "Partition network between two services"

Bad: "Set latency to 24 hours" ✅ Good: "Set latency to 500ms (realistic network delay)"

Historical Incident Analysis

Use incident post-mortems to identify chaos experiment scenarios:

## Example: Incident-Driven Chaos Experiment

**Incident**: On 2024-01-15, Azure SQL failover caused 2-minute downtime

**Root Cause**: Application didn't handle connection pool exhaustion during failover

**Chaos Experiment Created**:
- **Hypothesis**: "When Azure SQL fails over, application will retry connections and recover within 30 seconds"
- **Experiment**: Simulate database failover, monitor connection pool behavior
- **Validation**: Verify connection retry logic, measure recovery time

Principle 3: Run Experiments in Production (or Production-Like)

Definition

Chaos experiments must run in production-like environments to be meaningful. Testing in development or QA environments doesn't validate real-world resilience because:

  • Different configurations: Dev/test environments differ from production
  • Different load: Production load patterns aren't replicated
  • Different data: Production data volumes and patterns matter
  • Different dependencies: Production dependency configurations differ

ATP Strategy: Staging Environment

ATP runs chaos experiments in staging environment which is:

  • Production-like: Same configuration, same architecture, same scale
  • Production data volumes: Uses production data volumes for realistic testing
  • Isolated: No customer impact, safe for experimentation
  • Controlled: Blast radius limits prevent cascading failures

Production Experiments

After staging validation, limited production experiments may be conducted:

  • Small blast radius: 1% of traffic, specific test tenants only
  • Short duration: 30 seconds to 5 minutes
  • Gradual rollout: Start small, increase if stable
  • Automated rollback: Immediate abort on SLO violations
  • Approval required: CAB approval for production chaos experiments

Environment Comparison

Aspect Development Staging Production
Chaos Experiments ❌ No (different config) ✅ Yes (primary) ⚠️ Limited (after staging validation)
Blast Radius N/A Controlled Minimal (1%)
Duration N/A 5 min - 1 hour 30 sec - 5 min
Approval N/A SRE Team CAB Required

Principle 4: Automate Experiments to Run Continuously

Definition

Chaos experiments should be automated and run continuously to:

  • Prevent regressions: Detect when resilience is degraded
  • Continuous validation: Ensure resilience improvements are maintained
  • Reduce manual effort: Automate routine experiments
  • Scale testing: Run more experiments more frequently

Automation Levels

Level Description ATP Usage
Level 1: Manual Execute experiments manually Initial development, GameDays
Level 2: Scheduled Cron-based execution Weekly automated experiments
Level 3: CI/CD Integrated Part of deployment pipeline Staging pipeline resilience tests
Level 4: Continuous Always-on low-level chaos 1% traffic continuous chaos

CI/CD Integration Example

# azure-pipelines/chaos-tests.yaml
trigger:
  branches:
    include:
      - main
      - staging

stages:
- stage: ChaosResilienceTests
  displayName: 'Chaos Resilience Tests'
  jobs:
  - job: PodFailureTest
    displayName: 'Pod Failure Resilience Test'
    steps:
    - task: Kubernetes@1
      inputs:
        connectionType: 'Kubernetes Service Connection'
        kubernetesServiceEndpoint: 'atp-staging-aks'
        namespace: 'chaos-testing'
        command: 'apply'
        arguments: '-f chaos-experiments/pod-failure-test.yaml'

    - script: |
        # Wait for experiment to complete
        kubectl wait --for=condition=complete \
          chaos/pod-failure-test -n chaos-testing --timeout=10m

        # Validate results
        ./scripts/validate-chaos-results.sh pod-failure-test

      displayName: 'Execute and Validate Pod Failure Test'

    - task: PublishTestResults@2
      inputs:
        testResultsFormat: 'JUnit'
        testResultsFiles: 'chaos-results/*.xml'

Continuous Chaos

Continuous chaos runs low-level experiments continuously:

# chaos-experiments/continuous-pod-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: continuous-pod-failure
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: fixed-percent
  value: "1"  # 1% of pods
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  duration: "30s"
  scheduler:
    cron: "@every 1h"  # Run every hour
  abortRules:
    - name: error-rate-threshold
      condition: error_rate > 0.5%
      action: abort

Principle 5: Minimize Blast Radius

Definition

Blast radius is the scope of impact of a chaos experiment. Start with minimal blast radius and increase gradually only if the system remains stable.

Blast Radius Dimensions

Dimension Options ATP Default
Traffic Percentage 1%, 5%, 10%, 25%, 50% 1% (start), 5% (staging), 25% (GameDay)
Tenant Scope All, specific tenants, test tenants Test tenants only
Service Scope All services, specific service, specific pod Specific pod/service
Duration 30s, 5min, 15min, 1hour 5min (staging), 30s (production)
Geographic Scope All regions, single region, single AZ Single AZ (start)

Gradual Rollout Strategy

graph LR
    START[Start Experiment] --> VAL1[1% Traffic<br/>30 seconds]
    VAL1 --> CHECK1{Stable?}
    CHECK1 -->|Yes| VAL2[5% Traffic<br/>5 minutes]
    CHECK1 -->|No| ABORT[Abort Experiment]
    VAL2 --> CHECK2{Stable?}
    CHECK2 -->|Yes| VAL3[10% Traffic<br/>15 minutes]
    CHECK2 -->|No| ABORT
    VAL3 --> CHECK3{Stable?}
    CHECK3 -->|Yes| COMPLETE[Experiment Complete]
    CHECK3 -->|No| ABORT

    style START fill:#FFE5B4
    style COMPLETE fill:#90EE90
    style ABORT fill:#FFB6C1
Hold "Alt" / "Option" to enable pan & zoom

Blast Radius Configuration Example

# examples/blast-radius-config.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: BlastRadius
metadata:
  name: pod-failure-blast-radius
  namespace: chaos-testing
spec:
  experiment: pod-failure-test
  traffic:
    percentage: 1  # Start with 1%
    gradualIncrease: true
    increments: [1, 5, 10, 25]  # Gradual rollout percentages
  tenant:
    scope: test-tenants-only  # Only affect test tenants
    tenants:
      - test-tenant-001
      - test-tenant-002
  service:
    scope: single-pod
    namespace: atp-ingest-ns
    labelSelector:
      app: atp-ingest-api
  duration:
    initial: 30s  # Start with 30 seconds
    max: 5m  # Maximum 5 minutes
  autoAbort:
    enabled: true
    triggers:
      - metric: error_rate
        threshold: 1.0  # %
        operator: ">"
      - metric: p95_latency_ms
        threshold: 500  # ms
        operator: ">"
      - metric: request_success_rate
        threshold: 99.0  # %
        operator: "<"

Automatic Abort Criteria

Experiments automatically abort when:

  • Error rate exceeds threshold: >1% for staging, >0.5% for production
  • Latency degrades: P95 latency >500ms (staging), >300ms (production)
  • Success rate drops: Request success rate <99% (staging), <99.9% (production)
  • SLO violations: Any SLO violation detected
  • Manual abort: On-call engineer triggers manual abort

Chaos Engineering Methodology

The chaos engineering methodology provides a systematic 10-step process for conducting safe, effective chaos experiments.

Methodology Overview

graph TD
    STEP1[1. Define Steady State] --> STEP2[2. Formulate Hypothesis]
    STEP2 --> STEP3[3. Design Experiment]
    STEP3 --> STEP4[4. Set Blast Radius]
    STEP4 --> STEP5[5. Configure Rollback]
    STEP5 --> STEP6[6. Execute Experiment]
    STEP6 --> STEP7[7. Observe System]
    STEP7 --> STEP8[8. Validate Hypothesis]
    STEP8 --> STEP9[9. Document Findings]
    STEP9 --> STEP10[10. Improve Resilience]
    STEP10 --> STEP1

    style STEP1 fill:#E3F2FD
    style STEP6 fill:#FFF3E0
    style STEP10 fill:#E8F5E9
Hold "Alt" / "Option" to enable pan & zoom

Detailed Methodology Steps

Step 1: Define Steady State

Objective: Establish baseline metrics that represent normal system operation.

Activities: - Identify key performance metrics (latency, throughput, error rate) - Identify reliability metrics (availability, success rate) - Identify resource metrics (CPU, memory, network) - Identify business metrics (transactions, revenue impact) - Set threshold values for each metric - Measure baseline for 24-48 hours

Output: Steady state definition document with metrics and thresholds.

Step 2: Formulate Hypothesis

Objective: Predict how the system will behave during the experiment.

Activities: - Define the failure condition to inject - Predict system behavior (expected vs unexpected) - Specify metric changes (e.g., "latency will increase by <100ms") - Specify recovery behavior (e.g., "system will recover within 30 seconds") - Define success criteria (what validates the hypothesis)

Output: Hypothesis statement with expected behavior and validation criteria.

Step 3: Design Experiment

Objective: Create the experiment specification.

Activities: - Select chaos tool (Chaos Mesh, Litmus, custom) - Define experiment type (pod kill, network partition, etc.) - Specify target resources (pods, services, nodes) - Define injection parameters (duration, frequency, intensity) - Create experiment YAML/configuration - Review experiment with team

Output: Experiment specification (YAML, configuration file).

Step 4: Set Blast Radius

Objective: Limit the scope of impact to minimize risk.

Activities: - Set traffic percentage (start with 1%) - Select tenant scope (test tenants only) - Set duration (start with 30 seconds) - Configure geographic scope (single AZ) - Set service scope (single pod/service) - Enable gradual rollout if appropriate

Output: Blast radius configuration.

Step 5: Configure Rollback

Objective: Define automatic abort criteria and rollback procedures.

Activities: - Set automatic abort triggers (error rate, latency, SLO violations) - Define rollback procedure (remove fault injection, restore state) - Test rollback procedure - Configure monitoring alerts for abort triggers - Assign on-call engineer for manual abort capability

Output: Rollback configuration and procedures.

Step 6: Execute Experiment

Objective: Run the chaos experiment in staging environment.

Activities: - Notify team (Slack, email) - Start monitoring dashboards - Apply experiment YAML/configuration - Monitor experiment status - Observe system behavior in real-time - Be ready to abort if needed

Output: Experiment execution log.

Step 7: Observe System

Objective: Monitor all metrics, logs, and traces during the experiment.

Activities: - Monitor performance metrics (latency, throughput) - Monitor reliability metrics (error rate, success rate) - Monitor resource metrics (CPU, memory, network) - Review application logs for errors - Review distributed traces for bottlenecks - Monitor user impact (if applicable)

Output: Observation logs and metrics.

Step 8: Validate Hypothesis

Objective: Compare actual behavior to predicted behavior.

Activities: - Compare metrics to hypothesis predictions - Check if success criteria were met - Identify unexpected behaviors - Analyze root causes of any failures - Measure recovery time and behavior - Document deviations from hypothesis

Output: Hypothesis validation report.

Step 9: Document Findings

Objective: Record experiment results, findings, and lessons learned.

Activities: - Document experiment configuration - Document observed metrics and behavior - Document hypothesis validation results - Identify resilience gaps - Identify unexpected behaviors - Document lessons learned - Create improvement actions

Output: Experiment findings document.

Step 10: Improve Resilience

Objective: Address identified resilience gaps and improve system resilience.

Activities: - Prioritize improvement actions (impact × likelihood) - Implement resilience improvements - Update runbooks based on learnings - Update monitoring and alerting - Re-run experiment to validate improvements - Share learnings with team

Output: Resilience improvements implemented.


Chaos vs Other Testing

Chaos engineering complements other testing approaches but serves a unique purpose: validating resilience under failure conditions.

Testing Type Comparison

Testing Type Purpose When ATP Usage What It Validates
Unit Tests Validate individual components Every build CI/CD Component correctness
Integration Tests Validate service interactions Every deployment CI/CD Service integration
Load Tests Validate performance under load Pre-release, Monthly Performance testing Performance at scale
Chaos Tests Validate resilience under failures Quarterly GameDays, Continuous Resilience validation Failure handling
Penetration Tests Validate security controls Annually External audit Security vulnerabilities
DR Drills Validate disaster recovery Quarterly Scheduled drills Recovery procedures

Testing Pyramid with Chaos

graph TD
    PYRAMID[Testing Pyramid]

    UNIT[Unit Tests<br/>1000s of tests<br/>Fast, isolated]
    INTEGRATION[Integration Tests<br/>100s of tests<br/>Service interactions]
    LOAD[Load Tests<br/>10s of tests<br/>Performance validation]
    CHAOS[Chaos Tests<br/>10s of experiments<br/>Resilience validation]
    DR[DR Drills<br/>Quarterly<br/>Full recovery validation]

    PYRAMID --> UNIT
    PYRAMID --> INTEGRATION
    PYRAMID --> LOAD
    PYRAMID --> CHAOS
    PYRAMID --> DR

    style UNIT fill:#90EE90
    style INTEGRATION fill:#87CEEB
    style LOAD fill:#FFD700
    style CHAOS fill:#FFA500
    style DR fill:#FF6347
Hold "Alt" / "Option" to enable pan & zoom

What Each Testing Type Catches

Issue Type Unit Tests Integration Tests Load Tests Chaos Tests DR Drills
Logic errors
API contract violations
Performance degradation ⚠️
Failure handling ⚠️
Cascading failures
Recovery procedures ⚠️
Resource exhaustion ⚠️
Network issues ⚠️ ⚠️

Example: Testing a New Feature

## Example: Testing New Circuit Breaker Feature

### 1. Unit Tests (CI/CD)
- ✅ Test circuit breaker state transitions (Closed → Open → Half-Open)
- ✅ Test threshold calculations
- ✅ Test timeout handling

### 2. Integration Tests (CI/CD)
- ✅ Test circuit breaker with real HTTP client
- ✅ Test circuit breaker with database connection
- ✅ Test fallback behavior

### 3. Load Tests (Pre-release)
- ✅ Validate circuit breaker under high load
- ✅ Measure performance impact

### 4. Chaos Tests (Quarterly)
- ✅ Validate circuit breaker opens during dependency failure
- ✅ Validate circuit breaker prevents cascading failures
- ✅ Validate recovery when dependency recovers

### 5. DR Drills (Quarterly)
- ✅ Validate circuit breaker behavior during regional failover
- ✅ Validate circuit breaker with database failover

ATP Chaos Engineering Strategy

ATP employs a multi-layered chaos engineering strategy combining continuous automated experiments with periodic large-scale exercises.

Strategy Layers

graph TB
    STRATEGY[ATP Chaos Engineering Strategy]

    CONTINUOUS[Continuous Chaos<br/>Always-on, 1% traffic<br/>Automated, CI/CD integrated]
    SCHEDULED[Scheduled Experiments<br/>Weekly, Monthly<br/>Automated, specific scenarios]
    GAMEDAYS[Quarterly GameDays<br/>Large-scale, multi-team<br/>Manual, coordinated]
    DR[Annual DR Drills<br/>Full region failover<br/>Manual, comprehensive]
    ADHOC[Ad-Hoc Experiments<br/>Investigation, validation<br/>Manual, targeted]

    STRATEGY --> CONTINUOUS
    STRATEGY --> SCHEDULED
    STRATEGY --> GAMEDAYS
    STRATEGY --> DR
    STRATEGY --> ADHOC

    style CONTINUOUS fill:#90EE90
    style SCHEDULED fill:#87CEEB
    style GAMEDAYS fill:#FFD700
    style DR fill:#FF6347
    style ADHOC fill:#DDA0DD
Hold "Alt" / "Option" to enable pan & zoom

Layer 1: Continuous Chaos

Purpose: Detect resilience regressions early through always-on low-level chaos.

Characteristics: - Frequency: Continuous (24/7) - Scope: 1% of traffic, test tenants only - Duration: 30 seconds per experiment - Automation: Fully automated, no human intervention - Examples: Random pod kills, low-level latency injection

Benefits: - Early detection of resilience issues - Continuous validation of improvements - Minimal overhead (1% traffic impact)

Configuration:

# continuous-chaos-config.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: ContinuousChaos
metadata:
  name: atp-continuous-chaos
  namespace: chaos-testing
spec:
  enabled: true
  blastRadius:
    trafficPercentage: 1
    tenantScope: test-tenants-only
  experiments:
    - name: random-pod-kill
      frequency: "@every 1h"
      duration: "30s"
    - name: low-latency-injection
      frequency: "@every 2h"
      duration: "1m"
      latency: "100ms"
  autoAbort:
    enabled: true
    errorRateThreshold: 0.5%

Layer 2: Scheduled Experiments

Purpose: Validate specific resilience scenarios on a regular schedule.

Characteristics: - Frequency: Weekly (automated), Monthly (manual) - Scope: 5-10% of traffic, staging environment - Duration: 5-15 minutes - Automation: Automated execution, manual review - Examples: Database failover, cache failure, network partition

Schedule:

Frequency Experiment Type Examples
Daily Basic resilience Pod kill, container restart
Weekly Dependency failures Database failover, cache failure
Monthly Infrastructure failures Node failure, AZ failure
Quarterly Complex scenarios Multiple simultaneous failures

Layer 3: Quarterly GameDays

Purpose: Large-scale coordinated chaos exercises involving multiple teams.

Characteristics: - Frequency: Quarterly (4 per year) - Scope: 25-50% of traffic, staging environment - Duration: 4 hours (1h prep, 2h chaos, 1h retrospective) - Automation: Manual coordination, automated experiments - Participants: All engineering teams, SRE, Security, Operations

GameDay Structure: See Topic 13 for detailed GameDay procedures.

Layer 4: Annual DR Drills

Purpose: Full disaster recovery validation including regional failover.

Characteristics: - Frequency: Annually (once per year) - Scope: Full region failover, production-like - Duration: Full day (8 hours) - Automation: Manual coordination, automated failover - Participants: All teams, leadership, compliance

DR Drill Structure: See Topic 11 for detailed DR drill procedures.

Layer 5: Ad-Hoc Experiments

Purpose: Investigate specific resilience concerns or validate fixes.

Characteristics: - Frequency: As needed - Scope: Targeted, specific scenario - Duration: Variable (30 minutes to 2 hours) - Automation: Manual or automated - Trigger: Incident investigation, new feature validation, fix verification

Ad-Hoc Experiment Examples:

  • Post-Incident: "Why did the system fail during the incident? Let's reproduce it."
  • Feature Validation: "Does the new circuit breaker work correctly? Let's test it."
  • Fix Verification: "We fixed the connection pool issue. Let's verify it's fixed."

Chaos Experiment Template

Template Structure

# templates/chaos-experiment-template.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: EXPERIMENT_NAME
  namespace: chaos-testing
  labels:
    category: infrastructure|application|data|network|security
    severity: low|medium|high
    frequency: continuous|weekly|monthly|quarterly|adhoc
spec:
  # Experiment Metadata
  description: "Brief description of the experiment"
  hypothesis: |
    When [failure condition], 
    the system will [expected behavior], 
    and metrics will [expected values].

  # Steady State Definition
  steadyState:
    metrics:
      - name: request_success_rate
        threshold: 99.9
        operator: ">="
      - name: p95_latency_ms
        threshold: 200
        operator: "<"
      # Add more metrics...

  # Blast Radius
  blastRadius:
    trafficPercentage: 1
    tenantScope: test-tenants-only
    duration: "5m"
    serviceScope:
      namespace: atp-ingest-ns
      labelSelector:
        app: atp-ingest-api

  # Chaos Injection
  injection:
    type: pod-kill|network-partition|latency|error|resource
    config:
      # Experiment-specific configuration
      action: pod-kill
      mode: one

  # Rollback Configuration
  rollback:
    autoAbort:
      enabled: true
      triggers:
        - metric: error_rate
          threshold: 1.0
          operator: ">"
        - metric: p95_latency_ms
          threshold: 500
          operator: ">"

  # Validation Criteria
  validation:
    successCriteria:
      - metric: request_success_rate
        threshold: 99.9
        operator: ">="
      - metric: recovery_time_seconds
        threshold: 30
        operator: "<="

  # Scheduling
  schedule:
    frequency: weekly
    cron: "0 2 * * 1"  # Monday 2 AM

  # Reporting
  reporting:
    notifyChannels:
      - slack: "#atp-chaos"
      - email: "sre-team@connectsoft.example"
    generateReport: true

Hypothesis Formulation Template

## Hypothesis Template

### Experiment: [EXPERIMENT_NAME]

**Hypothesis**:
When [failure condition], the system will [expected behavior], 
and the following metrics will remain within acceptable ranges:
- [metric1]: [expected value/range]
- [metric2]: [expected value/range]
- [metric3]: [expected value/range]

**Failure Condition**:
- [Description of what will be broken]

**Expected Behavior**:
- [How the system should respond]
- [Recovery mechanism]
- [Degradation mode (if any)]

**Success Criteria**:
- ✅ [Criterion 1]
- ✅ [Criterion 2]
- ✅ [Criterion 3]

**Failure Criteria** (auto-abort triggers):
- ❌ [Criterion 1]
- ❌ [Criterion 2]

Example: Complete Experiment Definition

# examples/pod-failure-experiment.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: ingestion-api-pod-failure
  namespace: chaos-testing
  labels:
    category: infrastructure
    severity: low
    frequency: weekly
spec:
  description: "Validate ingestion API resilience to pod failures"

  hypothesis: |
    When 1 ingestion API pod crashes (out of 5 pods), 
    the system will remain available, request success rate will stay >99.9%, 
    P95 latency will increase by <100ms, and the pod will be restarted within 30 seconds.

  steadyState:
    metrics:
      - name: request_success_rate
        threshold: 99.9
        operator: ">="
      - name: p95_latency_ms
        baseline: 150  # ms
        threshold: 250  # ms (150 + 100)
        operator: "<"
      - name: pod_restart_time_seconds
        threshold: 30
        operator: "<="

  blastRadius:
    trafficPercentage: 5
    tenantScope: test-tenants-only
    duration: "5m"
    serviceScope:
      namespace: atp-ingest-ns
      labelSelector:
        app: atp-ingest-api

  injection:
    type: pod-kill
    config:
      action: pod-kill
      mode: one  # Kill one pod
      selector:
        namespaces:
          - atp-ingest-ns
        labelSelectors:
          app: atp-ingest-api

  rollback:
    autoAbort:
      enabled: true
      triggers:
        - metric: error_rate
          threshold: 1.0
          operator: ">"
        - metric: p95_latency_ms
          threshold: 500
          operator: ">"

  validation:
    successCriteria:
      - metric: request_success_rate
        threshold: 99.9
        operator: ">="
        duration: "5m"  # Must maintain for 5 minutes
      - metric: pod_restart_time_seconds
        threshold: 30
        operator: "<="

  schedule:
    frequency: weekly
    cron: "0 2 * * 1"  # Monday 2 AM

  reporting:
    notifyChannels:
      - slack: "#atp-chaos"
    generateReport: true

Summary: Chaos Engineering Fundamentals

  • What is Chaos Engineering: Discipline of experimenting on distributed systems to discover weaknesses proactively; originated at Netflix in 2010 with Chaos Monkey
  • Why Chaos Engineering for ATP: High availability (99.9% SLA), compliance requirements, multi-tenancy isolation, distributed system complexity, complex dependencies, incident preparedness
  • Chaos Engineering Principles: Build hypothesis around steady state, vary real-world events, run in production-like environments, automate continuously, minimize blast radius
  • Chaos Engineering Methodology: 10-step systematic process (define steady state, formulate hypothesis, design experiment, set blast radius, configure rollback, execute, observe, validate, document, improve)
  • Chaos vs Other Testing: Complements unit, integration, load, and penetration tests; unique focus on resilience under failure conditions
  • ATP Chaos Engineering Strategy: Multi-layered approach with continuous chaos (1% traffic), scheduled experiments (weekly/monthly), quarterly GameDays, annual DR drills, and ad-hoc experiments
  • Experiment Template: Comprehensive YAML template for defining chaos experiments with hypothesis, steady state, blast radius, injection, rollback, validation, and reporting

Chaos Experiment Framework

Purpose: Define the comprehensive framework for designing, executing, and managing chaos experiments in ATP, including experiment structure, lifecycle management, steady state definitions, blast radius controls, rollback automation, and safety measures to ensure safe, effective, and repeatable resilience testing.


Experiment Structure

The chaos experiment structure defines the standardized format for all chaos experiments in ATP, ensuring consistency, repeatability, and safety across all resilience testing activities.

Chaos Mesh Experiment Structure

ATP uses Chaos Mesh as the primary chaos engineering tool for Kubernetes-native fault injection. Chaos Mesh provides a declarative API for defining chaos experiments as Kubernetes Custom Resources (CRs).

Basic PodChaos Example

# chaos-experiments/pod-failure-basic.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: atp-ingestion-pod-failure
  namespace: chaos-testing
  labels:
    experiment-type: infrastructure
    service: atp-ingestion-api
    severity: low
spec:
  action: pod-kill
  mode: one  # Kill one pod
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
      version: v1.2.3
  duration: "30s"
  scheduler:
    cron: "@every 1h"

Experiment Metadata

The metadata section provides experiment identification and organization:

  • name: Unique identifier for the experiment
  • namespace: Isolation namespace (always chaos-testing for ATP)
  • labels: Categorization for filtering and reporting

Common Labels:

labels:
  category: infrastructure|application|data|network|security
  service: atp-gateway|atp-ingestion-api|atp-query-api
  environment: staging|production
  severity: low|medium|high|critical
  frequency: continuous|weekly|monthly|quarterly|adhoc
  owner: sre-team|platform-team|backend-team

Experiment Spec Structure

The spec section defines what the experiment does:

Field Purpose Examples
action Type of fault to inject pod-kill, pod-failure, container-kill
mode Scope of injection one, all, fixed, fixed-percent, random-max-percent
selector Target resources Namespaces, label selectors
duration How long to inject "30s", "5m", "1h"
scheduler When to run Cron expression or immediate

Advanced Experiment Structure with ATP Extensions

ATP extends Chaos Mesh experiments with custom annotations and configurations for enhanced control:

# chaos-experiments/pod-failure-advanced.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: atp-ingestion-pod-failure-advanced
  namespace: chaos-testing
  labels:
    category: infrastructure
    service: atp-ingestion-api
    severity: low
    frequency: weekly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When 1 ingestion API pod crashes, the system will remain available,
      request success rate will stay >99.9%, and P95 latency will increase by <100ms.
    chaos.atp.connectsoft.io/blast-radius: "5%"
    chaos.atp.connectsoft.io/auto-abort: "true"
    chaos.atp.connectsoft.io/notify: "slack:#atp-chaos"
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
    annotationSelectors:
      chaos.atp.connectsoft.io/include-in-chaos: "true"
  duration: "5m"

  # ATP-specific rollback triggers
  abortRules:
    - name: error-rate-threshold
      condition: error_rate > 1.0%
      action: abort
    - name: latency-threshold
      condition: p95_latency_ms > 500
      action: abort

  # Gradual rollout configuration
  gradualRollout:
    enabled: true
    stages:
      - percentage: 1
        duration: "30s"
      - percentage: 5
        duration: "5m"

  scheduler:
    cron: "@every 1h"

  # ATP-specific monitoring
  monitoring:
    enabled: true
    metrics:
      - name: request_success_rate
        threshold: 99.9
        operator: ">="
      - name: p95_latency_ms
        threshold: 200
        operator: "<"

Experiment Types

ATP supports multiple chaos experiment types:

Type Chaos Mesh Kind Use Case Example
Pod Chaos PodChaos Pod failures, container restarts Pod crashes, OOM kills
Network Chaos NetworkChaos Network issues Partitions, latency, packet loss
IO Chaos IOChaos Storage issues Disk latency, I/O errors
Stress Chaos StressChaos Resource exhaustion CPU stress, memory stress
Time Chaos TimeChaos Clock skew Time manipulation
Kernel Chaos KernelChaos Kernel-level faults System call failures
HTTP Chaos HTTPChaos HTTP-level faults Request faults, response delays

Experiment Categories

Experiments are categorized for organization and reporting:

# Experiment category examples
categories:
  infrastructure:
    - pod-kill
    - node-failure
    - container-restart
  application:
    - service-crash
    - latency-injection
    - error-injection
  data:
    - database-failover
    - cache-failure
    - storage-unavailable
  network:
    - partition
    - packet-loss
    - dns-failure
  security:
    - auth-failure
    - cert-expiration
    - key-unavailable

Experiment Lifecycle

The experiment lifecycle defines the 9-step process for executing chaos experiments from setup to improvement, ensuring systematic, safe, and effective resilience testing.

Lifecycle Overview

graph TD
    SETUP[1. Setup] --> BASELINE[2. Baseline]
    BASELINE --> INJECT[3. Inject]
    INJECT --> OBSERVE[4. Observe]
    OBSERVE --> VALIDATE[5. Validate]
    VALIDATE --> ROLLBACK[6. Rollback]
    ROLLBACK --> ANALYZE[7. Analyze]
    ANALYZE --> REPORT[8. Report]
    REPORT --> IMPROVE[9. Improve]
    IMPROVE --> SETUP

    style SETUP fill:#E3F2FD
    style INJECT fill:#FFF3E0
    style VALIDATE fill:#F3E5F5
    style IMPROVE fill:#E8F5E9
Hold "Alt" / "Option" to enable pan & zoom

Lifecycle Phase Details

Phase 1: Setup

Objective: Prepare the experiment environment and define steady state metrics.

Activities:

  1. Define Steady State Metrics

    # Define metrics to monitor
    steadyState:
      metrics:
        - name: request_success_rate
          threshold: 99.9
          baseline_period: "24h"
        - name: p95_latency_ms
          threshold: 200
          baseline_period: "24h"
    

  2. Create Experiment Configuration

    # Create experiment YAML
    cp templates/pod-chaos-template.yaml \
       chaos-experiments/ingestion-pod-failure.yaml
    
    # Customize for specific experiment
    vim chaos-experiments/ingestion-pod-failure.yaml
    

  3. Review and Approve

  4. Team review of experiment configuration
  5. Validate blast radius settings
  6. Confirm rollback procedures
  7. Obtain necessary approvals

  8. Prepare Monitoring

  9. Configure Grafana dashboards
  10. Set up alerts for abort triggers
  11. Prepare log aggregation queries
  12. Test monitoring visibility

Output: Experiment configuration file, steady state definition, monitoring setup.

Setup Checklist:

## Experiment Setup Checklist

### Pre-Experiment
- [ ] Experiment YAML created and reviewed
- [ ] Steady state metrics defined
- [ ] Hypothesis documented
- [ ] Blast radius configured appropriately
- [ ] Rollback triggers configured
- [ ] Monitoring dashboards prepared
- [ ] Team notified (Slack, email)
- [ ] On-call engineer available
- [ ] Approval obtained (if required)
- [ ] Rollback procedure tested

Phase 2: Baseline

Objective: Establish baseline metrics before injecting chaos.

Activities:

  1. Measure Baseline Metrics

    # Collect baseline metrics for 24-48 hours
    ./scripts/collect-baseline-metrics.sh \
      --service atp-ingestion-api \
      --duration 24h \
      --output baseline-ingestion-api-$(date +%Y%m%d).json
    

  2. Verify Steady State

  3. Check all metrics within thresholds
  4. Verify no active incidents
  5. Confirm system health
  6. Validate baseline period sufficient

  7. Document Baseline

    # Baseline metrics snapshot
    baseline:
      timestamp: "2024-01-20T10:00:00Z"
      metrics:
        request_success_rate: 99.95
        p95_latency_ms: 145
        p99_latency_ms: 320
        throughput_events_per_sec: 10500
        error_rate: 0.05
        cpu_utilization_percent: 65
        memory_utilization_percent: 72
    

Output: Baseline metrics document, steady state validation.

Baseline Collection Script:

#!/bin/bash
# scripts/collect-baseline-metrics.sh

SERVICE="${1}"
DURATION="${2:-24h}"
OUTPUT="${3:-baseline-${SERVICE}-$(date +%Y%m%d).json}"

echo "📊 Collecting baseline metrics for ${SERVICE} over ${DURATION}"

# Query Prometheus for baseline metrics
PROMETHEUS_URL="http://prometheus.monitoring.svc.cluster.local:9090"

cat > /tmp/baseline-query.json <<EOF
{
  "queries": [
    {
      "metric": "rate(http_requests_total{service=\"${SERVICE}\"}[5m])",
      "duration": "${DURATION}"
    },
    {
      "metric": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"${SERVICE}\"}[5m]))",
      "duration": "${DURATION}"
    },
    {
      "metric": "rate(http_requests_total{service=\"${SERVICE}\",status=~\"5..\"}[5m]) / rate(http_requests_total{service=\"${SERVICE}\"}[5m])",
      "duration": "${DURATION}"
    }
  ]
}
EOF

# Execute queries and collect results
curl -X POST "${PROMETHEUS_URL}/api/v1/query_range" \
  -H "Content-Type: application/json" \
  -d @/tmp/baseline-query.json \
  | jq '.' > "${OUTPUT}"

echo "✅ Baseline metrics collected: ${OUTPUT}"

Phase 3: Inject

Objective: Apply fault injection to the target system.

Activities:

  1. Apply Experiment Configuration

    # Apply chaos experiment
    kubectl apply -f chaos-experiments/ingestion-pod-failure.yaml \
      -n chaos-testing
    

  2. Verify Injection Started

    # Check experiment status
    kubectl get podchaos atp-ingestion-pod-failure -n chaos-testing
    
    # Watch experiment status
    kubectl watch podchaos atp-ingestion-pod-failure -n chaos-testing
    

  3. Confirm Fault Injected

  4. Verify pod killed/affected
  5. Check system impact
  6. Monitor initial metrics
  7. Validate blast radius active

Output: Experiment execution log, fault injection confirmation.

Injection Monitoring Script:

#!/bin/bash
# scripts/monitor-injection.sh

EXPERIMENT="${1}"
NAMESPACE="${2:-chaos-testing}"

echo "🔍 Monitoring fault injection: ${EXPERIMENT}"

# Watch experiment status
kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} -w

# Monitor pod status
kubectl get pods -n atp-ingest-ns -w

# Monitor metrics
watch -n 2 '
  echo "=== Request Success Rate ==="
  curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"atp-ingestion-api\",status!~\"5..\"\}[1m]\) | jq -r ".data.result[0].value[1]"

  echo "=== P95 Latency ==="
  curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"atp-ingestion-api\"\}[1m]\)\) | jq -r ".data.result[0].value[1]"
'

Phase 4: Observe

Objective: Monitor system behavior during fault injection.

Activities:

  1. Monitor Key Metrics
  2. Performance metrics (latency, throughput)
  3. Reliability metrics (error rate, success rate)
  4. Resource metrics (CPU, memory, network)
  5. Business metrics (events ingested, tenant isolation)

  6. Review Logs and Traces

    # Check application logs
    kubectl logs -n atp-ingest-ns -l app=atp-ingest-api --tail=100
    
    # Check distributed traces
    # Use Azure Application Insights or Jaeger
    

  7. Monitor Abort Triggers

  8. Watch error rate thresholds
  9. Monitor latency thresholds
  10. Check SLO violations
  11. Be ready to abort if needed

Output: Observation logs, metrics snapshots, trace analysis.

Real-Time Monitoring Dashboard Query:

// Log Analytics: Real-time experiment monitoring
let ExperimentStart = datetime("2024-01-20T10:00:00Z");
let ExperimentEnd = datetime("2024-01-20T10:05:00Z");
ContainerLog
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| where Namespace == "atp-ingest-ns"
| where ContainerName == "atp-ingestion-api"
| summarize 
    ErrorCount = countif(LogMessage contains "ERROR"),
    TotalRequests = count(),
    ErrorRate = (countif(LogMessage contains "ERROR") * 100.0) / count()
    by bin(TimeGenerated, 1m)
| render timechart

Phase 5: Validate

Objective: Compare actual behavior to hypothesis and steady state.

Activities:

  1. Compare Metrics to Hypothesis

    # Validate experiment results
    ./scripts/validate-experiment-results.sh \
      --experiment ingestion-pod-failure \
      --baseline baseline-ingestion-api-20240120.json \
      --results experiment-results-20240120.json
    

  2. Check Success Criteria

  3. Request success rate maintained?
  4. Latency within acceptable range?
  5. Recovery time as expected?
  6. No unexpected behaviors?

  7. Identify Deviations

  8. Document unexpected behaviors
  9. Analyze root causes
  10. Measure actual vs expected

Output: Validation report, hypothesis validation results.

Validation Script:

#!/bin/bash
# scripts/validate-experiment-results.sh

EXPERIMENT="${1}"
BASELINE="${2}"
RESULTS="${3}"

echo "✅ Validating experiment results: ${EXPERIMENT}"

# Load baseline and results
BASELINE_METRICS=$(cat "${BASELINE}")
RESULT_METRICS=$(cat "${RESULTS}")

# Validate each metric
VALIDATION_FAILED=false

# Check request success rate
BASELINE_SUCCESS=$(echo "${BASELINE_METRICS}" | jq -r '.metrics.request_success_rate')
RESULT_SUCCESS=$(echo "${RESULT_METRICS}" | jq -r '.metrics.request_success_rate')
THRESHOLD=99.9

if (( $(echo "${RESULT_SUCCESS} < ${THRESHOLD}" | bc -l) )); then
  echo "❌ Request success rate below threshold: ${RESULT_SUCCESS}% < ${THRESHOLD}%"
  VALIDATION_FAILED=true
else
  echo "✅ Request success rate acceptable: ${RESULT_SUCCESS}%"
fi

# Check P95 latency
BASELINE_LATENCY=$(echo "${BASELINE_METRICS}" | jq -r '.metrics.p95_latency_ms')
RESULT_LATENCY=$(echo "${RESULT_METRICS}" | jq -r '.metrics.p95_latency_ms')
MAX_INCREASE=100  # ms

LATENCY_INCREASE=$(echo "${RESULT_LATENCY} - ${BASELINE_LATENCY}" | bc)

if (( $(echo "${LATENCY_INCREASE} > ${MAX_INCREASE}" | bc -l) )); then
  echo "❌ Latency increase too high: +${LATENCY_INCREASE}ms > +${MAX_INCREASE}ms"
  VALIDATION_FAILED=true
else
  echo "✅ Latency increase acceptable: +${LATENCY_INCREASE}ms"
fi

if [ "${VALIDATION_FAILED}" = true ]; then
  echo "❌ Experiment validation FAILED"
  exit 1
else
  echo "✅ Experiment validation PASSED"
  exit 0
fi

Phase 6: Rollback

Objective: Remove fault injection and restore normal operation.

Activities:

  1. Automatic Rollback (if triggered)
  2. Abort triggers detected
  3. Experiment automatically stopped
  4. Fault injection removed
  5. System returns to normal

  6. Manual Rollback (if needed)

    # Manually abort experiment
    kubectl delete podchaos atp-ingestion-pod-failure -n chaos-testing
    
    # Or use Chaos Mesh dashboard abort button
    

  7. Verify Rollback Complete

  8. Confirm fault injection stopped
  9. Verify system returning to normal
  10. Check metrics recovering
  11. Validate no lingering effects

Output: Rollback confirmation, system recovery status.

Rollback Automation:

# rollback-automation.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: RollbackPolicy
metadata:
  name: standard-rollback-policy
  namespace: chaos-testing
spec:
  autoAbort:
    enabled: true
    triggers:
      - name: error-rate-threshold
        metric: error_rate
        threshold: 1.0
        operator: ">"
        duration: "30s"  # Must exceed threshold for 30s
        action: abort
      - name: latency-threshold
        metric: p95_latency_ms
        threshold: 500
        operator: ">"
        duration: "1m"
        action: abort
      - name: success-rate-threshold
        metric: request_success_rate
        threshold: 99.0
        operator: "<"
        duration: "30s"
        action: abort

  rollbackProcedure:
    - step: StopChaosInjection
      action: delete_experiment
    - step: VerifyRollback
      wait: "30s"
      check: metrics_recovered
    - step: NotifyTeam
      channels:
        - slack: "#atp-chaos"
        - email: "sre-team@connectsoft.example"

Phase 7: Analyze

Objective: Review metrics, logs, and traces to understand system behavior.

Activities:

  1. Analyze Metrics

    # Compare baseline vs experiment metrics
    ./scripts/analyze-experiment-metrics.sh \
      --baseline baseline-ingestion-api-20240120.json \
      --experiment experiment-results-20240120.json \
      --output analysis-report-20240120.md
    

  2. Review Logs

  3. Application logs during experiment
  4. Error patterns
  5. Recovery behavior
  6. Unexpected events

  7. Analyze Traces

  8. Distributed trace analysis
  9. Service dependencies affected
  10. Latency breakdown
  11. Failure propagation

Output: Analysis report, root cause findings, lessons learned.

Phase 8: Report

Objective: Document experiment results, findings, and recommendations.

Activities:

  1. Generate Experiment Report

    # Generate comprehensive report
    ./scripts/generate-experiment-report.sh \
      --experiment ingestion-pod-failure \
      --date 2024-01-20 \
      --output reports/ingestion-pod-failure-20240120.md
    

  2. Document Findings

  3. Hypothesis validation results
  4. Unexpected behaviors
  5. Resilience gaps identified
  6. Improvement recommendations

  7. Share Results

  8. Publish report to wiki/docs
  9. Notify team (Slack, email)
  10. Present in team meeting
  11. Update experiment catalog

Output: Experiment report, findings document.

Report Template:

# Experiment Report: ${EXPERIMENT_NAME}

**Date**: ${DATE}
**Experiment Type**: ${TYPE}
**Service**: ${SERVICE}
**Hypothesis**: ${HYPOTHESIS}

## Experiment Configuration
- **Blast Radius**: ${BLAST_RADIUS}
- **Duration**: ${DURATION}
- **Target**: ${TARGET}

## Results

### Metrics Comparison
| Metric | Baseline | During Experiment | Change | Threshold | Status |
|--------|----------|-------------------|--------|-----------|--------|
| Request Success Rate | 99.95% | 99.92% | -0.03% | >99.9% | ✅ Pass |
| P95 Latency | 145ms | 210ms | +65ms | <250ms | ✅ Pass |
| Error Rate | 0.05% | 0.08% | +0.03% | <0.1% | ✅ Pass |

### Hypothesis Validation
**Hypothesis CONFIRMED**: System remained available with minimal impact.

## Findings

### What Worked Well
- Pod restart completed within 25 seconds
- Load balancer routed traffic away from failed pod
- No request failures
- Latency increase within acceptable range

### Issues Identified
- Slight increase in error rate (within threshold but higher than expected)
- Recovery time could be improved (25s vs target 20s)

## Recommendations
1. Optimize pod restart time (target: <20s)
2. Investigate error rate increase root cause
3. Add additional monitoring for pod failure scenarios

## Next Steps
- [ ] Implement pod restart optimization
- [ ] Investigate error rate increase
- [ ] Re-run experiment after improvements

Phase 9: Improve

Objective: Implement resilience improvements based on findings.

Activities:

  1. Prioritize Improvements
  2. Impact assessment
  3. Effort estimation
  4. Priority ranking
  5. Backlog creation

  6. Implement Improvements

  7. Code changes
  8. Configuration updates
  9. Runbook updates
  10. Monitoring enhancements

  11. Validate Improvements

  12. Re-run experiment
  13. Verify improvements
  14. Confirm resilience enhanced
  15. Update documentation

Output: Resilience improvements implemented, validation results.


Steady State Definition

Steady state defines what "normal" looks like for ATP services, providing baseline metrics against which experiment results are compared.

ATP Service Steady State Standards

Ingestion API Steady State:

# steady-state-definitions/ingestion-api-steady-state.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: SteadyState
metadata:
  name: ingestion-api-steady-state
  namespace: chaos-testing
spec:
  service: atp-ingestion-api
  namespace: atp-ingest-ns

  metrics:
    # Performance Metrics
    performance:
      - name: p50_latency_ms
        description: "Median request latency"
        threshold: 100  # ms
        operator: "<"
        unit: "ms"
      - name: p95_latency_ms
        description: "95th percentile request latency"
        threshold: 200  # ms
        operator: "<"
        unit: "ms"
      - name: p99_latency_ms
        description: "99th percentile request latency"
        threshold: 500  # ms
        operator: "<"
        unit: "ms"
      - name: throughput_events_per_sec
        description: "Events ingested per second"
        threshold: 10000
        operator: ">="
        unit: "events/sec"

    # Reliability Metrics
    reliability:
      - name: request_success_rate
        description: "Percentage of successful requests"
        threshold: 99.9  # %
        operator: ">="
        unit: "percent"
      - name: error_rate
        description: "Percentage of failed requests"
        threshold: 0.1  # %
        operator: "<="
        unit: "percent"
      - name: availability
        description: "Service availability"
        threshold: 99.9  # %
        operator: ">="
        unit: "percent"

    # Resource Metrics
    resource:
      - name: cpu_utilization_percent
        description: "CPU utilization percentage"
        threshold: 80  # %
        operator: "<"
        unit: "percent"
      - name: memory_utilization_percent
        description: "Memory utilization percentage"
        threshold: 85  # %
        operator: "<"
        unit: "percent"
      - name: network_io_bytes_per_sec
        description: "Network I/O throughput"
        threshold: 1000000000  # 1 GB/s
        operator: "<"
        unit: "bytes/sec"

    # Business Metrics
    business:
      - name: events_ingested_per_minute
        description: "Events ingested per minute"
        threshold: 600000
        operator: ">="
        unit: "events/min"
      - name: tenant_isolation_maintained
        description: "Tenant isolation maintained"
        threshold: true
        operator: "=="
        unit: "boolean"
      - name: data_integrity_maintained
        description: "No data loss or corruption"
        threshold: true
        operator: "=="
        unit: "boolean"

    # Queue Metrics
    queue:
      - name: queue_depth_messages
        description: "Number of messages in queue"
        threshold: 1000
        operator: "<"
        unit: "messages"
      - name: queue_processing_rate
        description: "Messages processed per second"
        threshold: 5000
        operator: ">="
        unit: "messages/sec"

  # Baseline Collection Period
  baseline:
    duration: "24h"  # Collect 24-hour baseline
    sampleInterval: "1m"  # Sample every minute

  # Validation Rules
  validation:
    requiredMetricsPercentage: 90  # 90% of metrics must be within threshold
    consecutiveViolationsAllowed: 3  # Allow 3 consecutive violations before abort

Query API Steady State:

# steady-state-definitions/query-api-steady-state.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: SteadyState
metadata:
  name: query-api-steady-state
  namespace: chaos-testing
spec:
  service: atp-query-api
  namespace: atp-query-ns

  metrics:
    performance:
      - name: p95_latency_ms
        threshold: 300  # Query API has higher latency tolerance
        operator: "<"
      - name: p99_latency_ms
        threshold: 1000  # Complex queries may take longer
        operator: "<"
      - name: queries_per_sec
        threshold: 1000
        operator: ">="

    reliability:
      - name: request_success_rate
        threshold: 99.9
        operator: ">="
      - name: query_timeout_rate
        threshold: 0.5  # %
        operator: "<="

    business:
      - name: queries_completed_per_minute
        threshold: 60000
        operator: ">="
      - name: cache_hit_rate
        threshold: 80  # %
        operator: ">="

Steady State Visualization

graph TB
    STEADY[Steady State Definition]

    PERF[Performance Metrics<br/>Latency, Throughput]
    REL[Reliability Metrics<br/>Success Rate, Error Rate]
    RES[Resource Metrics<br/>CPU, Memory, Network]
    BUS[Business Metrics<br/>Events, Queries, Isolation]

    STEADY --> PERF
    STEADY --> REL
    STEADY --> RES
    STEADY --> BUS

    PERF --> THRESH[Thresholds]
    REL --> THRESH
    RES --> THRESH
    BUS --> THRESH

    THRESH --> VALIDATE[Validation<br/>During Experiments]

    style STEADY fill:#E3F2FD
    style THRESH fill:#FFF3E0
    style VALIDATE fill:#E8F5E9
Hold "Alt" / "Option" to enable pan & zoom

Steady State Monitoring Script:

#!/bin/bash
# scripts/monitor-steady-state.sh

SERVICE="${1}"
DURATION="${2:-24h}"

echo "📊 Monitoring steady state for ${SERVICE} over ${DURATION}"

# Query Prometheus for steady state metrics
PROMETHEUS_URL="http://prometheus.monitoring.svc.cluster.local:9090"

# Check each metric against threshold
METRICS=(
  "request_success_rate:99.9:>="
  "p95_latency_ms:200:<"
  "error_rate:0.1:<="
  "cpu_utilization_percent:80:<"
)

for METRIC in "${METRICS[@]}"; do
  IFS=':' read -r NAME THRESHOLD OPERATOR <<< "${METRIC}"

  VALUE=$(curl -s "${PROMETHEUS_URL}/api/v1/query" \
    --data-urlencode "query=${NAME}{service=\"${SERVICE}\"}" \
    | jq -r '.data.result[0].value[1]')

  echo "Checking ${NAME}: ${VALUE} ${OPERATOR} ${THRESHOLD}"

  # Validate against threshold (simplified)
  if [ "${OPERATOR}" = ">=" ] && (( $(echo "${VALUE} >= ${THRESHOLD}" | bc -l) )); then
    echo "  ✅ Within threshold"
  elif [ "${OPERATOR}" = "<" ] && (( $(echo "${VALUE} < ${THRESHOLD}" | bc -l) )); then
    echo "  ✅ Within threshold"
  else
    echo "  ⚠️  Outside threshold"
  fi
done

Blast Radius Control

Blast radius control defines the scope of impact for chaos experiments, limiting risk and ensuring safe experimentation.

Blast Radius Dimensions

graph LR
    BLAST[Blast Radius]

    TRAFFIC[Traffic %<br/>1%, 5%, 10%, 25%, 50%]
    TENANT[Tenant Scope<br/>All, Specific, Test Only]
    SERVICE[Service Scope<br/>All, Service, Pod]
    TIME[Duration<br/>30s, 5m, 15m, 1h]
    GEO[Geographic<br/>All, Region, AZ]

    BLAST --> TRAFFIC
    BLAST --> TENANT
    BLAST --> SERVICE
    BLAST --> TIME
    BLAST --> GEO

    style BLAST fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Blast Radius Configuration

# blast-radius-configurations/standard-blast-radius.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: BlastRadius
metadata:
  name: standard-blast-radius
  namespace: chaos-testing
spec:
  # Traffic Percentage Control
  traffic:
    percentage: 1  # Start with 1%
    gradualIncrease: true
    increments:
      - percentage: 1
        duration: "30s"
        stabilityCheck: true
      - percentage: 5
        duration: "5m"
        stabilityCheck: true
      - percentage: 10
        duration: "15m"
        stabilityCheck: true

  # Tenant Scope Control
  tenant:
    scope: test-tenants-only  # Options: all, specific, test-tenants-only
    tenants:
      - test-tenant-001
      - test-tenant-002
    excludeTenants:
      - production-tenant-001
      - production-tenant-002

  # Service Scope Control
  service:
    scope: single-pod  # Options: all, service, single-pod
    namespace: atp-ingest-ns
    labelSelector:
      app: atp-ingest-api
    podSelector:
      matchLabels:
        version: v1.2.3

  # Duration Control
  duration:
    initial: "30s"  # Start with 30 seconds
    max: "5m"  # Maximum 5 minutes
    extensionAllowed: false  # Do not allow extension

  # Geographic Scope Control
  geographic:
    scope: single-az  # Options: all, region, single-az
    region: eastus
    availabilityZone: "1"

  # Automatic Blast Radius Reduction
  autoReduce:
    enabled: true
    triggers:
      - metric: error_rate
        threshold: 0.5  # %
        action: reduce_by_50_percent
      - metric: p95_latency_ms
        threshold: 300  # ms
        action: reduce_by_50_percent

Blast Radius by Experiment Type

Experiment Type Default Blast Radius Max Blast Radius Notes
Pod Kill 1 pod (5-10% traffic) 1 pod Safe, Kubernetes handles
Network Partition 10% traffic 25% traffic Can cause cascading failures
Database Failover Read-only mode Full failover Affects all traffic
Cache Failure 10% traffic 50% traffic Degraded performance expected
Latency Injection 5% traffic 25% traffic Gradual increase recommended
Error Injection 1% requests 10% requests Can trigger alert storms

Gradual Blast Radius Increase

# blast-radius-configurations/gradual-rollout.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: BlastRadius
metadata:
  name: gradual-rollout-blast-radius
spec:
  gradualIncrease:
    enabled: true
    stages:
      - stage: 1
        percentage: 1
        duration: "30s"
        stabilityCheck:
          enabled: true
          metrics:
            - name: error_rate
              threshold: 0.5
              operator: "<"
            - name: p95_latency_ms
              threshold: 250
              operator: "<"
          requiredDuration: "30s"  # Must be stable for 30s before next stage

      - stage: 2
        percentage: 5
        duration: "5m"
        stabilityCheck:
          enabled: true
          metrics:
            - name: error_rate
              threshold: 1.0
              operator: "<"
            - name: p95_latency_ms
              threshold: 300
              operator: "<"
          requiredDuration: "2m"

      - stage: 3
        percentage: 10
        duration: "15m"
        stabilityCheck:
          enabled: true
          requiredDuration: "5m"

  # Auto-reduce if unstable
  autoReduce:
    enabled: true
    reduceToPreviousStage: true

Blast Radius Monitoring

#!/bin/bash
# scripts/monitor-blast-radius.sh

EXPERIMENT="${1}"
NAMESPACE="${2:-chaos-testing}"

echo "🎯 Monitoring blast radius for ${EXPERIMENT}"

# Get blast radius configuration
BLAST_RADIUS=$(kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
  -o jsonpath='{.metadata.annotations.chaos\.atp\.connectsoft\.io/blast-radius}')

echo "Blast Radius: ${BLAST_RADIUS}"

# Monitor actual impact
echo "Actual Impact:"
echo "- Traffic affected: $(curl -s http://prometheus:9090/api/v1/query?query=sum\(rate\(http_requests_total\{experiment=\"${EXPERIMENT}\"}[1m]\)\) | jq -r '.data.result[0].value[1]') req/s"
echo "- Error rate: $(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{experiment=\"${EXPERIMENT}\",status=~\"5..\"}[1m]\) | jq -r '.data.result[0].value[1]')%"

Rollback Triggers

Rollback triggers automatically abort experiments when system health degrades beyond acceptable thresholds, preventing cascading failures and minimizing impact.

Rollback Trigger Configuration

# rollback-triggers/standard-rollback-triggers.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: RollbackPolicy
metadata:
  name: standard-rollback-triggers
  namespace: chaos-testing
spec:
  autoAbort:
    enabled: true

    # Error Rate Trigger
    triggers:
      - name: error-rate-threshold
        description: "Abort if error rate exceeds threshold"
        metric: error_rate
        threshold: 1.0  # %
        operator: ">"
        duration: "30s"  # Must exceed threshold for 30 seconds
        severity: high
        action: abort
        notify:
          - slack: "#atp-chaos"
          - email: "sre-oncall@connectsoft.example"

      # Latency Trigger
      - name: latency-threshold
        description: "Abort if P95 latency exceeds threshold"
        metric: p95_latency_ms
        threshold: 500  # ms
        operator: ">"
        duration: "1m"
        severity: medium
        action: abort

      # Success Rate Trigger
      - name: success-rate-threshold
        description: "Abort if success rate drops below threshold"
        metric: request_success_rate
        threshold: 99.0  # %
        operator: "<"
        duration: "30s"
        severity: high
        action: abort

      # Throughput Trigger
      - name: throughput-drop-threshold
        description: "Abort if throughput drops significantly"
        metric: throughput_events_per_sec
        threshold: 50  # % of baseline
        operator: "<"
        comparison: baseline  # Compare to baseline
        duration: "1m"
        severity: medium
        action: abort

      # SLO Violation Trigger
      - name: slo-violation
        description: "Abort if any SLO is violated"
        metric: slo_violation_count
        threshold: 1
        operator: ">"
        duration: "10s"
        severity: critical
        action: abort

Rollback Trigger Flow

graph TD
    START[Experiment Running] --> MONITOR[Monitor Metrics]
    MONITOR --> CHECK{Trigger<br/>Condition<br/>Met?}
    CHECK -->|No| MONITOR
    CHECK -->|Yes| DURATION{Exceeded<br/>Duration?}
    DURATION -->|No| MONITOR
    DURATION -->|Yes| SEVERITY{Severity?}

    SEVERITY -->|Critical| ABORT[Immediate Abort]
    SEVERITY -->|High| ABORT
    SEVERITY -->|Medium| WARN[Warning First]
    SEVERITY -->|Low| LOG[Log Only]

    WARN --> WAIT[Wait 30s]
    WAIT --> CHECK2{Still<br/>Triggered?}
    CHECK2 -->|Yes| ABORT
    CHECK2 -->|No| MONITOR

    ABORT --> NOTIFY[Notify Team]
    NOTIFY --> ROLLBACK[Rollback Experiment]
    ROLLBACK --> VERIFY[Verify Recovery]
    VERIFY --> COMPLETE[Complete]

    style START fill:#FFE5B4
    style ABORT fill:#FFB6C1
    style COMPLETE fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Rollback Automation

#!/bin/bash
# scripts/rollback-automation.sh

EXPERIMENT="${1}"
NAMESPACE="${2:-chaos-testing}"

echo "⏪ Automatically rolling back experiment: ${EXPERIMENT}"

# Check if rollback already triggered
if kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
   -o jsonpath='{.status.phase}' | grep -q "RolledBack"; then
  echo "Experiment already rolled back"
  exit 0
fi

# Delete experiment (stops fault injection)
kubectl delete chaos ${EXPERIMENT} -n ${NAMESPACE}

# Wait for rollback to complete
echo "Waiting for rollback to complete..."
sleep 10

# Verify system recovery
echo "Verifying system recovery..."
RETRIES=0
MAX_RETRIES=6

while [ ${RETRIES} -lt ${MAX_RETRIES} ]; do
  ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  if (( $(echo "${ERROR_RATE} < 0.5" | bc -l) )); then
    echo "✅ System recovered: Error rate = ${ERROR_RATE}%"
    exit 0
  fi

  RETRIES=$((RETRIES + 1))
  echo "Waiting for recovery... (${RETRIES}/${MAX_RETRIES})"
  sleep 10
done

echo "⚠️  System recovery taking longer than expected"
exit 1

Manual Rollback

Manual rollback provides human override for experiments:

#!/bin/bash
# scripts/manual-rollback.sh

EXPERIMENT="${1}"
NAMESPACE="${2:-chaos-testing}"
REASON="${3}"

echo "⏪ Manual rollback requested for: ${EXPERIMENT}"
echo "Reason: ${REASON}"

# Delete experiment
kubectl delete chaos ${EXPERIMENT} -n ${NAMESPACE}

# Log manual rollback
kubectl annotate chaos ${EXPERIMENT} -n ${NAMESPACE} \
  chaos.atp.connectsoft.io/manual-rollback="true" \
  chaos.atp.connectsoft.io/rollback-reason="${REASON}" \
  chaos.atp.connectsoft.io/rollback-timestamp="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --overwrite

# Notify team
curl -X POST "${SLACK_WEBHOOK_URL}" \
  -H "Content-Type: application/json" \
  -d "{
    \"text\": \"🚨 Manual rollback triggered for ${EXPERIMENT}\",
    \"attachments\": [{
      \"color\": \"warning\",
      \"fields\": [{
        \"title\": \"Reason\",
        \"value\": \"${REASON}\",
        \"short\": false
      }]
    }]
  }"

echo "✅ Manual rollback complete"

Safety Measures

Safety measures ensure chaos experiments are conducted safely with minimal risk to production systems and customer experience.

Safety Measures Checklist

## Chaos Experiment Safety Checklist

### Pre-Experiment Safety Checks

#### Environment Validation
- [ ] Experiment target is **staging environment** (not production)
- [ ] Staging environment mirrors production configuration
- [ ] Production data volumes used (for realistic testing)
- [ ] No active incidents in target environment

#### Experiment Configuration
- [ ] Blast radius set to **minimum** (1% traffic, test tenants)
- [ ] Rollback triggers configured and tested
- [ ] Experiment duration set to **minimum** (30 seconds)
- [ ] Gradual rollout enabled (if applicable)
- [ ] Automatic abort enabled

#### Team Preparation
- [ ] **On-call engineer present** and available
- [ ] Team notified (Slack, email) at least 24 hours in advance
- [ ] Communication plan established
- [ ] Rollback procedure tested and documented
- [ ] Approval obtained (if required for experiment type)

#### Monitoring Preparation
- [ ] Monitoring dashboards prepared and tested
- [ ] Alerts configured for abort triggers
- [ ] Baseline metrics collected (24-48 hours)
- [ ] Steady state validated
- [ ] Log aggregation queries prepared

#### Rollback Preparation
- [ ] Rollback procedure documented and tested
- [ ] Manual rollback button/command ready
- [ ] Automatic rollback triggers validated
- [ ] Recovery verification procedure ready

### During Experiment Safety

#### Active Monitoring
- [ ] Real-time monitoring dashboards visible
- [ ] Metrics monitored continuously
- [ ] Abort triggers watched
- [ ] Error logs reviewed in real-time

#### Communication
- [ ] Team informed experiment started
- [ ] Status updates provided (if long duration)
- [ ] Issues communicated immediately

#### Readiness
- [ ] Ready to abort at any moment
- [ ] Rollback procedure accessible
- [ ] On-call engineer monitoring

### Post-Experiment Safety

#### Validation
- [ ] Experiment completed or aborted cleanly
- [ ] System returned to steady state
- [ ] No lingering effects
- [ ] Metrics normalized

#### Documentation
- [ ] Results documented
- [ ] Findings recorded
- [ ] Lessons learned captured
- [ ] Improvements identified

Safety Measure Configuration

# safety-measures/standard-safety-measures.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: SafetyPolicy
metadata:
  name: standard-safety-policy
  namespace: chaos-testing
spec:
  # Environment Restrictions
  environment:
    allowedEnvironments:
      - staging
      - dev  # Limited experiments only
    prohibitedEnvironments:
      - production  # Never run in production without explicit approval

  # Blast Radius Restrictions
  blastRadius:
    maxTrafficPercentage: 50  # Never exceed 50% traffic
    maxDuration: "1h"  # Never exceed 1 hour
    requireGradualRollout: true  # Always use gradual rollout
    minTrafficPercentage: 1  # Always start with at least 1%

  # Team Requirements
  team:
    requireOnCallEngineer: true
    requireNotification: true
    notificationChannels:
      - slack: "#atp-chaos"
      - email: "sre-team@connectsoft.example"
    notificationAdvanceTime: "24h"  # Notify 24 hours in advance

  # Approval Requirements
  approval:
    requiredFor:
      - production-experiments
      - high-severity-experiments
      - experiments-exceeding-10-percent-traffic
    approvalMethod: cab  # Change Advisory Board
    approvers:
      - sre-team-lead
      - platform-engineering-lead

  # Time Restrictions
  time:
    allowedWindows:
      - day: monday-friday
        time: "02:00-06:00 UTC"  # Low-traffic window
    prohibitedWindows:
      - day: friday
        time: "14:00-18:00 UTC"  # Avoid Friday afternoons
      - day: monday
        time: "08:00-12:00 UTC"  # Avoid Monday mornings

  # Automated Safety Checks
  automatedChecks:
    - name: active-incident-check
      check: no_active_incidents
      action: block_experiment
    - name: steady-state-validation
      check: steady_state_valid
      action: block_experiment
    - name: rollback-test
      check: rollback_procedure_tested
      action: block_experiment

Safety Measures Diagram

graph TD
    START[Start Experiment] --> CHECK1{Environment<br/>Staging?}
    CHECK1 -->|No| BLOCK1[Block Experiment]
    CHECK1 -->|Yes| CHECK2{Blast Radius<br/><10%?}

    CHECK2 -->|No| APPROVAL{Approved<br/>by CAB?}
    CHECK2 -->|Yes| CHECK3{On-Call<br/>Present?}

    APPROVAL -->|No| BLOCK2[Block Experiment]
    APPROVAL -->|Yes| CHECK3

    CHECK3 -->|No| BLOCK3[Block Experiment]
    CHECK3 -->|Yes| CHECK4{No Active<br/>Incidents?}

    CHECK4 -->|No| BLOCK4[Block Experiment]
    CHECK4 -->|Yes| CHECK5{Time Window<br/>Allowed?}

    CHECK5 -->|No| BLOCK5[Block Experiment]
    CHECK5 -->|Yes| ALLOW[Allow Experiment]

    ALLOW --> EXECUTE[Execute Experiment]

    style START fill:#FFE5B4
    style BLOCK1 fill:#FFB6C1
    style BLOCK2 fill:#FFB6C1
    style BLOCK3 fill:#FFB6C1
    style BLOCK4 fill:#FFB6C1
    style BLOCK5 fill:#FFB6C1
    style ALLOW fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Safety Validation Script

#!/bin/bash
# scripts/validate-safety-measures.sh

EXPERIMENT="${1}"
NAMESPACE="${2:-chaos-testing}"

echo "🔒 Validating safety measures for ${EXPERIMENT}"

VALIDATION_FAILED=false

# Check environment
ENV=$(kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
  -o jsonpath='{.metadata.labels.environment}')

if [ "${ENV}" = "production" ]; then
  echo "❌ Experiment targets production environment"
  VALIDATION_FAILED=true
else
  echo "✅ Environment check passed: ${ENV}"
fi

# Check blast radius
BLAST_RADIUS=$(kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
  -o jsonpath='{.metadata.annotations.chaos\.atp\.connectsoft\.io/blast-radius}')

if (( $(echo "${BLAST_RADIUS} > 10" | bc -l) )); then
  echo "⚠️  Blast radius exceeds 10%: ${BLAST_RADIUS}%"
  # Check if approved
  if ! kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
     -o jsonpath='{.metadata.annotations.chaos\.atp\.connectsoft\.io/cab-approved}' | grep -q "true"; then
    echo "❌ Experiment not approved by CAB"
    VALIDATION_FAILED=true
  fi
else
  echo "✅ Blast radius acceptable: ${BLAST_RADIUS}%"
fi

# Check on-call engineer
if ! kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
   -o jsonpath='{.metadata.annotations.chaos\.atp\.connectsoft\.io/oncall-present}' | grep -q "true"; then
  echo "❌ On-call engineer not present"
  VALIDATION_FAILED=true
else
  echo "✅ On-call engineer present"
fi

# Check active incidents
if kubectl get incidents -n monitoring | grep -q "Active"; then
  echo "❌ Active incidents detected"
  VALIDATION_FAILED=true
else
  echo "✅ No active incidents"
fi

if [ "${VALIDATION_FAILED}" = true ]; then
  echo "❌ Safety validation FAILED"
  exit 1
else
  echo "✅ Safety validation PASSED"
  exit 0
fi

Summary: Chaos Experiment Framework

  • Experiment Structure: Standardized Chaos Mesh experiment format with ATP extensions, metadata labels for categorization, spec structure (action, mode, selector, duration, scheduler), support for multiple experiment types (Pod, Network, IO, Stress, Time, Kernel, HTTP), experiment categories for organization
  • Experiment Lifecycle: 9-phase lifecycle (Setup, Baseline, Inject, Observe, Validate, Rollback, Analyze, Report, Improve) with detailed procedures, scripts, and checklists for each phase
  • Steady State Definition: Comprehensive steady state definitions for ATP services (Ingestion API, Query API) with performance, reliability, resource, business, and queue metrics, baseline collection procedures, validation rules, steady state monitoring scripts
  • Blast Radius Control: Multi-dimensional blast radius (traffic percentage, tenant scope, service scope, duration, geographic scope), gradual rollout configuration, automatic blast radius reduction, blast radius by experiment type guidelines, blast radius monitoring scripts
  • Rollback Triggers: Automatic abort triggers (error rate, latency, success rate, throughput, SLO violations), rollback trigger flow with severity levels, automated rollback procedures, manual rollback capability with notification
  • Safety Measures: Comprehensive safety checklist (pre-experiment, during experiment, post-experiment), safety policy configuration with environment restrictions, blast radius limits, team requirements, approval requirements, time restrictions, automated safety checks, safety validation scripts

Pod and Container Chaos

Purpose: Define comprehensive chaos experiments for pod and container failures in ATP, validating Kubernetes resilience mechanisms including pod restarts, container recovery, resource limits, and horizontal pod autoscaling to ensure ATP services remain available and performant during infrastructure failures.


Pod Failure Experiment

Pod failure experiments validate that ATP services remain available and functional when individual pods crash or are terminated, ensuring Kubernetes orchestration and load balancing mechanisms work correctly under failure conditions.

Hypothesis

"When 1 ingestion API pod crashes (out of 5 pods deployed), the system will remain available, request success rate will stay >99.9%, P95 latency will increase by <100ms, and Kubernetes will restart the pod within 30 seconds."

Experiment Configuration

Basic PodChaos Configuration:

# chaos-experiments/pod-failure-ingestion-api.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: ingestion-pod-kill
  namespace: chaos-testing
  labels:
    category: infrastructure
    service: atp-ingestion-api
    severity: low
    frequency: weekly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When 1 ingestion API pod crashes (out of 5 pods), 
      the system will remain available, request success rate will stay >99.9%, 
      P95 latency will increase by <100ms, and Kubernetes will restart the pod within 30 seconds.
    chaos.atp.connectsoft.io/blast-radius: "20%"  # 1 pod out of 5 = 20%
spec:
  action: pod-kill
  mode: one  # Kill one pod
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
      version: v1.2.3
  duration: "5m"
  scheduler:
    cron: "@every 1h"

  # ATP-specific rollback triggers
  abortRules:
    - name: error-rate-threshold
      condition: error_rate > 1.0%
      action: abort
    - name: latency-threshold
      condition: p95_latency_ms > 500
      action: abort

Advanced PodChaos with Gradual Rollout:

# chaos-experiments/pod-failure-advanced.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: ingestion-pod-kill-advanced
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: fixed-percent
  value: "20"  # 20% of pods (1 out of 5)
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  duration: "5m"

  # Gradual rollout: kill pods one at a time
  gradualRollout:
    enabled: true
    stages:
      - stage: 1
        pods: 1
        duration: "30s"
        stabilityCheck:
          enabled: true
          metrics:
            - name: request_success_rate
              threshold: 99.9
              operator: ">="
            - name: p95_latency_ms
              threshold: 250
              operator: "<"
          requiredDuration: "30s"

  # Monitor pod restart time
  monitoring:
    podRestartTime:
      enabled: true
      maxTime: "30s"
      alertOnExceed: true

Expected Behavior

Immediate Impact (0-5 seconds):

  • Pod termination: Target pod receives SIGTERM, then SIGKILL
  • Service endpoint removal: Pod removed from service endpoints
  • Load balancer update: Traffic routed away from failed pod
  • Remaining pods: 4 pods continue handling traffic (80% capacity)

Recovery Phase (5-30 seconds):

  • Pod restart: Kubernetes detects pod failure and schedules new pod
  • Container startup: New container starts and initializes
  • Health checks: Liveness and readiness probes execute
  • Service endpoint addition: New pod added to service endpoints
  • Traffic routing: Load balancer routes traffic to new pod

Steady State (30+ seconds):

  • Full capacity: All 5 pods operational
  • Traffic distribution: Traffic evenly distributed across pods
  • Metrics normalized: Latency and throughput return to baseline

Expected Metrics

Metric Baseline During Failure Expected Range Recovery Target
Request Success Rate 99.95% >99.9% >99.9% 99.95%
P95 Latency 145ms <245ms Baseline + <100ms 145ms
P99 Latency 320ms <420ms Baseline + <100ms 320ms
Throughput 10,500 events/sec >8,400 events/sec >80% of baseline 10,500 events/sec
Error Rate 0.05% <0.1% <0.1% 0.05%
Pod Restart Time N/A <30s <30s N/A

Validation Criteria

Success Criteria:

  • ✅ Request success rate maintained >99.9% throughout experiment
  • ✅ P95 latency increase <100ms
  • ✅ No request failures (all requests succeed)
  • ✅ Pod restarted within 30 seconds
  • ✅ All 5 pods operational after recovery
  • ✅ Metrics normalized within 60 seconds

Failure Criteria (Auto-abort triggers):

  • ❌ Request success rate drops below 99.9%
  • ❌ P95 latency exceeds 500ms
  • ❌ Any request failures occur
  • ❌ Pod restart time exceeds 30 seconds
  • ❌ System unable to recover within 60 seconds

Monitoring and Observation

Real-Time Monitoring Dashboard:

// Log Analytics: Pod failure monitoring
let ExperimentStart = datetime("2024-01-20T10:00:00Z");
let ExperimentEnd = datetime("2024-01-20T10:05:00Z");

// Pod status during experiment
KubePodInventory
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| where Namespace == "atp-ingest-ns"
| where Name contains "atp-ingest-api"
| summarize 
    PodCount = count(),
    RunningCount = countif(Status == "Running"),
    FailedCount = countif(Status == "Failed"),
    RestartCount = sum(ContainerRestartCount)
    by bin(TimeGenerated, 10s)
| render timechart

// Request metrics during pod failure
ContainerLog
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| where Namespace == "atp-ingest-ns"
| where ContainerName == "atp-ingestion-api"
| summarize 
    TotalRequests = count(),
    ErrorCount = countif(LogMessage contains "ERROR"),
    ErrorRate = (countif(LogMessage contains "ERROR") * 100.0) / count()
    by bin(TimeGenerated, 10s)
| render timechart

Pod Restart Time Measurement Script:

#!/bin/bash
# scripts/measure-pod-restart-time.sh

NAMESPACE="${1:-atp-ingest-ns}"
LABEL_SELECTOR="${2:-app=atp-ingest-api}"

echo "📊 Measuring pod restart time for ${LABEL_SELECTOR} in ${NAMESPACE}"

# Get initial pod count
INITIAL_PODS=$(kubectl get pods -n ${NAMESPACE} -l ${LABEL_SELECTOR} --no-headers | wc -l)
echo "Initial pod count: ${INITIAL_PODS}"

# Kill one pod
POD_TO_KILL=$(kubectl get pods -n ${NAMESPACE} -l ${LABEL_SELECTOR} -o jsonpath='{.items[0].metadata.name}')
echo "Killing pod: ${POD_TO_KILL}"

KILL_TIME=$(date +%s)
kubectl delete pod ${POD_TO_KILL} -n ${NAMESPACE}

# Wait for pod to be deleted
echo "Waiting for pod to be deleted..."
kubectl wait --for=delete pod/${POD_TO_KILL} -n ${NAMESPACE} --timeout=60s

DELETE_TIME=$(date +%s)
DELETE_DURATION=$((DELETE_TIME - KILL_TIME))
echo "Pod deleted in ${DELETE_DURATION} seconds"

# Wait for new pod to be running
echo "Waiting for new pod to be running..."
kubectl wait --for=condition=Ready pod -n ${NAMESPACE} -l ${LABEL_SELECTOR} --timeout=60s

RECOVERY_TIME=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_TIME - KILL_TIME))
echo "Pod recovery time: ${RECOVERY_DURATION} seconds"

# Verify pod count
FINAL_PODS=$(kubectl get pods -n ${NAMESPACE} -l ${LABEL_SELECTOR} --no-headers | wc -l)
echo "Final pod count: ${FINAL_PODS}"

if [ "${FINAL_PODS}" -eq "${INITIAL_PODS}" ]; then
  echo "✅ Pod count restored: ${FINAL_PODS}"
else
  echo "⚠️  Pod count mismatch: expected ${INITIAL_PODS}, got ${FINAL_PODS}"
  exit 1
fi

# Validate restart time
if [ "${RECOVERY_DURATION}" -le 30 ]; then
  echo "✅ Pod restart time within target: ${RECOVERY_DURATION}s <= 30s"
  exit 0
else
  echo "❌ Pod restart time exceeds target: ${RECOVERY_DURATION}s > 30s"
  exit 1
fi

Experiment Execution

Manual Execution:

#!/bin/bash
# scripts/execute-pod-failure-experiment.sh

EXPERIMENT="ingestion-pod-kill"
NAMESPACE="chaos-testing"

echo "🧪 Starting pod failure experiment: ${EXPERIMENT}"

# Collect baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service atp-ingestion-api \
  --duration 1h \
  --output baseline-ingestion-api-$(date +%Y%m%d-%H%M%S).json

# Apply experiment
echo "🔧 Applying chaos experiment..."
kubectl apply -f chaos-experiments/pod-failure-ingestion-api.yaml -n ${NAMESPACE}

# Monitor experiment
echo "👀 Monitoring experiment..."
./scripts/monitor-pod-failure-experiment.sh ${EXPERIMENT}

# Measure pod restart time
echo "⏱️  Measuring pod restart time..."
./scripts/measure-pod-restart-time.sh atp-ingest-ns app=atp-ingest-api

# Wait for experiment to complete
echo "⏳ Waiting for experiment to complete..."
kubectl wait --for=condition=complete chaos/${EXPERIMENT} -n ${NAMESPACE} --timeout=10m

# Validate results
echo "✅ Validating experiment results..."
./scripts/validate-experiment-results.sh \
  --experiment ${EXPERIMENT} \
  --baseline baseline-ingestion-api-$(date +%Y%m%d-%H%M%S).json \
  --results experiment-results-$(date +%Y%m%d-%H%M%S).json

echo "✅ Experiment complete"

Experiment Results Analysis

Success Scenario:

{
  "experiment": "ingestion-pod-kill",
  "status": "success",
  "metrics": {
    "request_success_rate": {
      "baseline": 99.95,
      "during_failure": 99.92,
      "recovery": 99.95,
      "status": "pass"
    },
    "p95_latency_ms": {
      "baseline": 145,
      "during_failure": 210,
      "recovery": 145,
      "increase": 65,
      "status": "pass"
    },
    "pod_restart_time_seconds": {
      "value": 25,
      "target": 30,
      "status": "pass"
    }
  },
  "findings": {
    "what_worked": [
      "Pod restarted within 25 seconds",
      "Load balancer routed traffic away from failed pod",
      "No request failures occurred",
      "Latency increase within acceptable range"
    ],
    "issues": [
      "Slight latency increase (65ms) during recovery"
    ]
  }
}

Failure Scenario:

{
  "experiment": "ingestion-pod-kill",
  "status": "failed",
  "reason": "pod_restart_time_exceeded",
  "metrics": {
    "pod_restart_time_seconds": {
      "value": 45,
      "target": 30,
      "status": "fail"
    }
  },
  "findings": {
    "root_cause": "Container startup time too slow",
    "recommendations": [
      "Optimize container startup time",
      "Review liveness/readiness probe settings",
      "Check resource constraints"
    ]
  }
}

Container Failure Experiment

Container failure experiments validate that individual containers within a pod can fail and restart without affecting the entire pod or service availability, ensuring Kubernetes container restart mechanisms work correctly.

Hypothesis

"When 1 container in a multi-container pod crashes, Kubernetes will restart the container within 15 seconds, the pod will remain available, and no service disruption will occur."

Experiment Configuration

Container Kill Configuration:

# chaos-experiments/container-failure-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: container-kill-experiment
  namespace: chaos-testing
  labels:
    category: infrastructure
    service: atp-ingestion-api
    severity: low
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When 1 container in a pod crashes, Kubernetes will restart the container 
      within 15 seconds, the pod will remain available, and no service disruption will occur.
spec:
  action: pod-kill
  mode: one
  containerNames:
    - atp-ingestion-api  # Specific container name
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  duration: "5m"

Expected Behavior

Container Failure Process:

  1. Container crash: Container process terminates (SIGKILL)
  2. Pod status: Pod status changes to "NotReady" (container not ready)
  3. Kubernetes detection: Kubelet detects container failure
  4. Container restart: Kubernetes restarts the container
  5. Health check: Liveness probe validates container health
  6. Readiness check: Readiness probe validates container ready
  7. Service recovery: Pod returns to service endpoints

Expected Metrics

Metric Baseline During Failure Expected Range Recovery Target
Container Restart Time N/A <15s <15s N/A
Pod Availability 100% 100% 100% 100%
Request Success Rate 99.95% >99.9% >99.9% 99.95%
Service Disruption None None None None

Validation Criteria

Success Criteria:

  • ✅ Container restarted within 15 seconds
  • ✅ Pod remained available throughout
  • ✅ No service disruption
  • ✅ No request failures
  • ✅ Liveness/readiness probes working correctly

Container Restart Monitoring Script:

#!/bin/bash
# scripts/monitor-container-restart.sh

NAMESPACE="${1:-atp-ingest-ns}"
CONTAINER_NAME="${2:-atp-ingestion-api}"

echo "📊 Monitoring container restart: ${CONTAINER_NAME} in ${NAMESPACE}"

# Get pod with the container
POD=$(kubectl get pods -n ${NAMESPACE} -l app=atp-ingest-api -o jsonpath='{.items[0].metadata.name}')

# Get initial container restart count
INITIAL_RESTARTS=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.status.containerStatuses[?(@.name=="'${CONTAINER_NAME}'")].restartCount}')

echo "Initial restart count: ${INITIAL_RESTARTS}"

# Kill container
echo "Killing container: ${CONTAINER_NAME} in pod: ${POD}"
kubectl exec -n ${NAMESPACE} ${POD} -c ${CONTAINER_NAME} -- kill -9 1

KILL_TIME=$(date +%s)

# Wait for container restart
echo "Waiting for container restart..."
MAX_WAIT=60
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  CURRENT_RESTARTS=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.status.containerStatuses[?(@.name=="'${CONTAINER_NAME}'")].restartCount}')

  if [ "${CURRENT_RESTARTS}" -gt "${INITIAL_RESTARTS}" ]; then
    RESTART_TIME=$(date +%s)
    RESTART_DURATION=$((RESTART_TIME - KILL_TIME))
    echo "✅ Container restarted in ${RESTART_DURATION} seconds"

    # Check container status
    CONTAINER_READY=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.status.containerStatuses[?(@.name=="'${CONTAINER_NAME}'")].ready}')
    if [ "${CONTAINER_READY}" = "true" ]; then
      echo "✅ Container is ready"
      exit 0
    else
      echo "⚠️  Container restarted but not ready yet"
      exit 1
    fi
  fi

  sleep 1
  ELAPSED=$((ELAPSED + 1))
done

echo "❌ Container restart timeout after ${MAX_WAIT} seconds"
exit 1

Liveness and Readiness Probe Validation

Probe Configuration Example:

# kubernetes/deployments/ingestion-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion-api
  namespace: atp-ingest-ns
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: atp-ingestion-api
        image: atpcr.azurecr.io/atp-ingestion-api:v1.2.3
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3

Probe Validation Script:

#!/bin/bash
# scripts/validate-probes.sh

NAMESPACE="${1:-atp-ingest-ns}"
POD="${2}"

echo "🔍 Validating liveness and readiness probes for pod: ${POD}"

# Check liveness probe
LIVENESS_PROBE=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].livenessProbe}')
if [ -z "${LIVENESS_PROBE}" ]; then
  echo "❌ Liveness probe not configured"
  exit 1
else
  echo "✅ Liveness probe configured"
fi

# Check readiness probe
READINESS_PROBE=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].readinessProbe}')
if [ -z "${READINESS_PROBE}" ]; then
  echo "❌ Readiness probe not configured"
  exit 1
else
  echo "✅ Readiness probe configured"
fi

# Check container status
CONTAINER_READY=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.status.containerStatuses[0].ready}')
if [ "${CONTAINER_READY}" = "true" ]; then
  echo "✅ Container is ready"
else
  echo "⚠️  Container is not ready"
  exit 1
fi

# Test liveness endpoint
POD_IP=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.status.podIP}')
LIVENESS_PATH=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].livenessProbe.httpGet.path}')
LIVENESS_PORT=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].livenessProbe.httpGet.port}')

if curl -f -s http://${POD_IP}:${LIVENESS_PORT}${LIVENESS_PATH} > /dev/null; then
  echo "✅ Liveness endpoint responding"
else
  echo "❌ Liveness endpoint not responding"
  exit 1
fi

# Test readiness endpoint
READINESS_PATH=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].readinessProbe.httpGet.path}')
READINESS_PORT=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].readinessProbe.httpGet.port}')

if curl -f -s http://${POD_IP}:${READINESS_PORT}${READINESS_PATH} > /dev/null; then
  echo "✅ Readiness endpoint responding"
else
  echo "❌ Readiness endpoint not responding"
  exit 1
fi

echo "✅ All probes validated successfully"

Resource Exhaustion Experiment

Resource exhaustion experiments validate that ATP services handle resource constraints gracefully and that resource limits prevent noisy neighbor effects, while horizontal pod autoscaling (HPA) responds appropriately to resource pressure.

Hypothesis

"When CPU, memory, or disk resources are exhausted, resource limits prevent noisy neighbor effects, HPA scales up to handle increased load, and services maintain availability with graceful degradation."

CPU Stress Experiment

StressChaos Configuration:

# chaos-experiments/cpu-stress-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress-ingestion-api
  namespace: chaos-testing
  labels:
    category: infrastructure
    service: atp-ingestion-api
    severity: medium
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When CPU stress is applied, resource limits prevent noisy neighbor effects,
      HPA scales up to handle load, and services maintain availability.
spec:
  mode: one
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  stressors:
    cpu:
      workers: 4  # Number of CPU-intensive processes
      load: 100  # CPU load percentage per worker
  duration: "10m"

Expected Behavior:

  • CPU utilization: CPU usage approaches 100% (within limits)
  • Resource limits: CPU limits prevent exceeding allocated resources
  • Noisy neighbor: Other pods not affected by CPU stress
  • HPA response: HPA detects CPU pressure and scales up pods
  • Service availability: Service remains available with potential latency increase

Memory Stress Experiment

Memory Stress Configuration:

# chaos-experiments/memory-stress-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress-ingestion-api
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  stressors:
    memory:
      workers: 1
      size: "512Mi"  # Allocate 512MB memory
  duration: "10m"

Expected Behavior:

  • Memory pressure: Memory usage increases
  • OOM protection: Memory limits prevent OOM kills of other containers
  • Graceful handling: Application handles memory pressure gracefully
  • Potential restart: Pod may be evicted if memory limits exceeded

Disk Stress Experiment

IO Stress Configuration:

# chaos-experiments/disk-stress-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: disk-stress-ingestion-api
  namespace: chaos-testing
spec:
  action: io-latency
  mode: one
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  volumePath: /var/log  # Path to stress
  path: /var/log/chaos
  delay: "100ms"
  percent: 100
  duration: "10m"

Expected Behavior:

  • Disk I/O latency: Disk operations experience increased latency
  • Application handling: Application handles disk I/O delays gracefully
  • No data loss: No data corruption or loss
  • Performance impact: Potential performance degradation

Resource Limits Validation

Resource Limits Configuration:

# kubernetes/deployments/ingestion-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion-api
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion-api
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"

Resource Limits Validation Script:

#!/bin/bash
# scripts/validate-resource-limits.sh

NAMESPACE="${1:-atp-ingest-ns}"
DEPLOYMENT="${2:-atp-ingestion-api}"

echo "🔍 Validating resource limits for deployment: ${DEPLOYMENT}"

# Check resource requests and limits
kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE} -o json | \
  jq -r '.spec.template.spec.containers[0].resources' > /tmp/resources.json

CPU_REQUEST=$(jq -r '.requests.cpu' /tmp/resources.json)
CPU_LIMIT=$(jq -r '.limits.cpu' /tmp/resources.json)
MEMORY_REQUEST=$(jq -r '.requests.memory' /tmp/resources.json)
MEMORY_LIMIT=$(jq -r '.limits.memory' /tmp/resources.json)

echo "CPU Request: ${CPU_REQUEST}"
echo "CPU Limit: ${CPU_LIMIT}"
echo "Memory Request: ${MEMORY_REQUEST}"
echo "Memory Limit: ${MEMORY_LIMIT}"

# Validate limits are set
if [ "${CPU_LIMIT}" = "null" ] || [ -z "${CPU_LIMIT}" ]; then
  echo "❌ CPU limit not set"
  exit 1
else
  echo "✅ CPU limit set: ${CPU_LIMIT}"
fi

if [ "${MEMORY_LIMIT}" = "null" ] || [ -z "${MEMORY_LIMIT}" ]; then
  echo "❌ Memory limit not set"
  exit 1
else
  echo "✅ Memory limit set: ${MEMORY_LIMIT}"
fi

# Validate limits >= requests
if [ "${CPU_REQUEST}" != "null" ] && [ "${CPU_LIMIT}" != "null" ]; then
  CPU_REQUEST_M=$(echo ${CPU_REQUEST} | sed 's/m$//')
  CPU_LIMIT_M=$(echo ${CPU_LIMIT} | sed 's/m$//')

  if [ "${CPU_LIMIT_M}" -lt "${CPU_REQUEST_M}" ]; then
    echo "❌ CPU limit (${CPU_LIMIT}) < CPU request (${CPU_REQUEST})"
    exit 1
  else
    echo "✅ CPU limit >= CPU request"
  fi
fi

echo "✅ Resource limits validated successfully"

Horizontal Pod Autoscaler (HPA) Configuration

HPA Configuration:

# kubernetes/hpa/ingestion-api-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-api-hpa
  namespace: atp-ingest-ns
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion-api
  minReplicas: 5
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 25
        periodSeconds: 60

HPA Validation Script:

#!/bin/bash
# scripts/validate-hpa-response.sh

NAMESPACE="${1:-atp-ingest-ns}"
DEPLOYMENT="${2:-atp-ingestion-api}"

echo "📊 Validating HPA response for deployment: ${DEPLOYMENT}"

# Get initial replica count
INITIAL_REPLICAS=$(kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE} -o jsonpath='{.spec.replicas}')
echo "Initial replicas: ${INITIAL_REPLICAS}"

# Apply CPU stress
echo "🔧 Applying CPU stress..."
kubectl apply -f chaos-experiments/cpu-stress-ingestion-api.yaml -n chaos-testing

# Wait for HPA to scale up
echo "⏳ Waiting for HPA to scale up..."
MAX_WAIT=300
ELAPSED=0
SCALED_UP=false

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  CURRENT_REPLICAS=$(kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE} -o jsonpath='{.spec.replicas}')

  if [ "${CURRENT_REPLICAS}" -gt "${INITIAL_REPLICAS}" ]; then
    SCALED_UP=true
    echo "✅ HPA scaled up: ${INITIAL_REPLICAS}${CURRENT_REPLICAS}"
    break
  fi

  sleep 10
  ELAPSED=$((ELAPSED + 10))
  echo "Waiting... (${ELAPSED}s/${MAX_WAIT}s)"
done

if [ "${SCALED_UP}" = false ]; then
  echo "❌ HPA did not scale up within ${MAX_WAIT} seconds"
  kubectl delete stresschaos cpu-stress-ingestion-api -n chaos-testing
  exit 1
fi

# Remove stress
echo "🔧 Removing CPU stress..."
kubectl delete stresschaos cpu-stress-ingestion-api -n chaos-testing

# Wait for HPA to scale down
echo "⏳ Waiting for HPA to scale down..."
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  CURRENT_REPLICAS=$(kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE} -o jsonpath='{.spec.replicas}')

  if [ "${CURRENT_REPLICAS}" -eq "${INITIAL_REPLICAS}" ]; then
    echo "✅ HPA scaled down: ${CURRENT_REPLICAS}${INITIAL_REPLICAS}"
    exit 0
  fi

  sleep 10
  ELAPSED=$((ELAPSED + 10))
  echo "Waiting... (${ELAPSED}s/${MAX_WAIT}s)"
done

echo "⚠️  HPA did not scale down within ${MAX_WAIT} seconds"
exit 1

Resource Exhaustion Visualization

graph TD
    STRESS[Resource Stress Applied] --> CPU[CPU Stress]
    STRESS --> MEM[Memory Stress]
    STRESS --> DISK[Disk Stress]

    CPU --> LIMIT1{CPU Limit<br/>Exceeded?}
    LIMIT1 -->|No| THROTTLE[CPU Throttling]
    LIMIT1 -->|Yes| HPA1[HPA Scales Up]

    MEM --> LIMIT2{Memory Limit<br/>Exceeded?}
    LIMIT2 -->|No| DEGRADE[Graceful Degradation]
    LIMIT2 -->|Yes| OOM[OOM Kill]

    DISK --> LATENCY[I/O Latency]
    LATENCY --> TIMEOUT[Timeout Handling]

    HPA1 --> RECOVER[Recovery]
    THROTTLE --> RECOVER
    DEGRADE --> RECOVER
    TIMEOUT --> RECOVER

    style STRESS fill:#FFE5B4
    style HPA1 fill:#90EE90
    style OOM fill:#FFB6C1
    style RECOVER fill:#E8F5E9
Hold "Alt" / "Option" to enable pan & zoom

Experiment Results Analysis

CPU Stress Results:

{
  "experiment": "cpu-stress-ingestion-api",
  "status": "success",
  "metrics": {
    "cpu_utilization": {
      "baseline": 65,
      "during_stress": 95,
      "limit": 100,
      "status": "pass"
    },
    "hpa_scaling": {
      "initial_replicas": 5,
      "scaled_to": 8,
      "scale_up_time": 90,
      "status": "pass"
    },
    "request_success_rate": {
      "baseline": 99.95,
      "during_stress": 99.90,
      "status": "pass"
    }
  }
}

Summary: Pod and Container Chaos

  • Pod Failure Experiment: Validates Kubernetes pod restart mechanisms, load balancer failover, and service availability during pod crashes; expects pod restart within 30 seconds, no request failures, latency increase <100ms, and success rate >99.9%
  • Container Failure Experiment: Validates container restart within pods, liveness/readiness probe functionality, and service availability during container failures; expects container restart within 15 seconds and no service disruption
  • Resource Exhaustion Experiments: Validates CPU, memory, and disk stress handling, resource limits preventing noisy neighbor effects, and HPA scaling response; expects graceful degradation, HPA scaling, and maintained service availability
  • Monitoring and Validation: Comprehensive scripts for measuring pod restart times, container restart times, resource limits validation, HPA response validation, and probe functionality validation
  • Experiment Execution: Automated scripts for executing pod failure, container failure, and resource exhaustion experiments with baseline collection, real-time monitoring, and result validation

Node and Cluster Chaos

Purpose: Define comprehensive chaos experiments for node and cluster failures in ATP, validating multi-node resilience, availability zone failover, and disaster recovery procedures to ensure ATP services remain available and recoverable during infrastructure-level failures.


Node Failure Experiment

Node failure experiments validate that ATP services remain available and functional when individual AKS nodes fail, ensuring Kubernetes pod rescheduling, StatefulSet quorum maintenance, and data persistence mechanisms work correctly under node failure conditions.

Hypothesis

"When 1 AKS node fails (in a 3-node cluster), services will remain available, pods will be rescheduled to other nodes within 5 minutes, StatefulSets will maintain quorum, no data will be lost, and all services will recover within 5 minutes."

Experiment Configuration

Node Drain and Delete Procedure:

# chaos-experiments/node-failure-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NodeChaos
metadata:
  name: node-failure-atp-cluster
  namespace: chaos-testing
  labels:
    category: infrastructure
    severity: high
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When 1 AKS node fails, services will remain available, pods will be rescheduled 
      to other nodes within 5 minutes, StatefulSets will maintain quorum, 
      no data will be lost, and all services will recover within 5 minutes.
    chaos.atp.connectsoft.io/blast-radius: "33%"  # 1 node out of 3 = 33%
spec:
  action: node-restart  # Restart node
  mode: one
  selector:
    nodes:
      - aks-nodepool1-12345678-vmss000000  # Specific node name
  duration: "5m"

  # ATP-specific rollback triggers
  abortRules:
    - name: pod-reschedule-time
      condition: pod_reschedule_time > 300  # 5 minutes
      action: abort
    - name: data-loss-detected
      condition: data_loss_count > 0
      action: abort

Manual Node Failure Script:

#!/bin/bash
# scripts/execute-node-failure-experiment.sh

CLUSTER_NAME="${1:-atp-staging-aks}"
RESOURCE_GROUP="${2:-atp-staging-rg}"
NODE_NAME="${3}"  # Optional: specific node to target

echo "🧪 Starting node failure experiment"

# Get cluster nodes
echo "📊 Getting cluster nodes..."
NODES=$(az aks nodepool list \
  --cluster-name ${CLUSTER_NAME} \
  --resource-group ${RESOURCE_GROUP} \
  --query "[].{name:name,nodeCount:count}" \
  -o json)

echo "Nodes: ${NODES}"

# Select node to fail (if not specified)
if [ -z "${NODE_NAME}" ]; then
  NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
fi

echo "Target node: ${NODE_NAME}"

# Get initial pod count per node
echo "📊 Getting initial pod distribution..."
kubectl get pods -A -o wide --field-selector spec.nodeName=${NODE_NAME} | \
  wc -l | xargs echo "Pods on target node:"

# Get StatefulSet pods on target node
STATEFULSET_PODS=$(kubectl get pods -A -o json \
  --field-selector spec.nodeName=${NODE_NAME} | \
  jq -r '.items[] | select(.metadata.ownerReferences[]?.kind == "StatefulSet") | .metadata.name')

echo "StatefulSet pods on target node: ${STATEFULSET_PODS}"

# Drain node (gracefully evict pods)
echo "🔧 Draining node: ${NODE_NAME}"
kubectl drain ${NODE_NAME} \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --timeout=300s

DRAIN_START=$(date +%s)

# Wait for pods to be rescheduled
echo "⏳ Waiting for pods to be rescheduled..."
MAX_WAIT=300
ELAPSED=0
ALL_RESCHEDULED=false

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  PODS_ON_NODE=$(kubectl get pods -A -o json \
    --field-selector spec.nodeName=${NODE_NAME} | \
    jq -r '.items | length')

  if [ "${PODS_ON_NODE}" -eq 0 ]; then
    ALL_RESCHEDULED=true
    RESCHEDULE_TIME=$(date +%s)
    RESCHEDULE_DURATION=$((RESCHEDULE_TIME - DRAIN_START))
    echo "✅ All pods rescheduled in ${RESCHEDULE_DURATION} seconds"
    break
  fi

  sleep 10
  ELAPSED=$((ELAPSED + 10))
  echo "Waiting... (${ELAPSED}s/${MAX_WAIT}s) - ${PODS_ON_NODE} pods remaining"
done

if [ "${ALL_RESCHEDULED}" = false ]; then
  echo "⚠️  Not all pods rescheduled within ${MAX_WAIT} seconds"
  # Uncordon node to restore it
  kubectl uncordon ${NODE_NAME}
  exit 1
fi

# Validate StatefulSet quorum
echo "🔍 Validating StatefulSet quorum..."
STATEFULSETS=$(kubectl get statefulsets -A -o json | jq -r '.items[].metadata.name')

for SS in ${STATEFULSETS}; do
  NS=$(kubectl get statefulset ${SS} -A -o jsonpath='{.metadata.namespace}')
  READY=$(kubectl get statefulset ${SS} -n ${NS} -o jsonpath='{.status.readyReplicas}')
  DESIRED=$(kubectl get statefulset ${SS} -n ${NS} -o jsonpath='{.spec.replicas}')

  if [ "${READY}" -lt "${DESIRED}" ]; then
    echo "⚠️  StatefulSet ${NS}/${SS}: ${READY}/${DESIRED} replicas ready"
  else
    echo "✅ StatefulSet ${NS}/${SS}: ${READY}/${DESIRED} replicas ready"
  fi
done

# Simulate node failure (restart node)
echo "🔧 Simulating node failure (restarting node)..."
az vm restart \
  --resource-group ${RESOURCE_GROUP} \
  --name ${NODE_NAME} \
  --no-wait

FAILURE_START=$(date +%s)

# Wait for node to be unavailable
echo "⏳ Waiting for node to be unavailable..."
sleep 30

# Wait for node to recover
echo "⏳ Waiting for node to recover..."
kubectl wait --for=condition=Ready node/${NODE_NAME} --timeout=600s

RECOVERY_TIME=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_TIME - FAILURE_START))

echo "✅ Node recovered in ${RECOVERY_DURATION} seconds"

# Uncordon node
echo "🔧 Uncordoning node..."
kubectl uncordon ${NODE_NAME}

# Validate all pods running
echo "✅ Validating all pods running..."
sleep 60

ALL_PODS_RUNNING=true
NAMESPACES=$(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}')

for NS in ${NAMESPACES}; do
  NOT_RUNNING=$(kubectl get pods -n ${NS} -o json | \
    jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | .metadata.name')

  if [ -n "${NOT_RUNNING}" ]; then
    echo "⚠️  Pods not running in ${NS}: ${NOT_RUNNING}"
    ALL_PODS_RUNNING=false
  fi
done

if [ "${ALL_PODS_RUNNING}" = true ]; then
  echo "✅ All pods running successfully"
  exit 0
else
  echo "❌ Some pods not running"
  exit 1
fi

Expected Behavior

Node Drain Phase (0-5 minutes):

  • Pod eviction: Pods gracefully terminated with SIGTERM
  • Pod rescheduling: Pods rescheduled to other nodes
  • StatefulSet handling: StatefulSet pods maintain quorum (if replicas > 1)
  • PV persistence: Persistent volumes remain available
  • Service continuity: Services continue operating on remaining nodes

Node Failure Phase (5-10 minutes):

  • Node unavailable: Node becomes unreachable
  • Kubernetes detection: Kubernetes detects node not ready
  • Endpoint removal: Node removed from service endpoints
  • Health monitoring: Cluster health monitored

Recovery Phase (10-15 minutes):

  • Node restart: Node VM restarts
  • Kubelet restart: Kubelet service restarts
  • Node registration: Node registers with API server
  • Pod rescheduling: Pods can be rescheduled back to node (if desired)
  • Service recovery: All services fully operational

Expected Metrics

Metric Baseline During Failure Expected Range Recovery Target
Service Availability 100% >99% >99% 100%
Pod Reschedule Time N/A <5min <5min N/A
Request Success Rate 99.95% >99.5% >99.5% 99.95%
StatefulSet Quorum 100% 100% 100% 100%
Data Loss None None None None
Recovery Time N/A <5min <5min N/A

Validation Criteria

Success Criteria:

  • ✅ All pods rescheduled within 5 minutes
  • ✅ StatefulSets maintain quorum (no quorum loss)
  • ✅ No data loss (all persistent volumes intact)
  • ✅ Service availability >99% during failure
  • ✅ Request success rate >99.5% during failure
  • ✅ All services recovered within 5 minutes
  • ✅ Node recovered and operational

Failure Criteria (Auto-abort triggers):

  • ❌ Pod reschedule time exceeds 5 minutes
  • ❌ StatefulSet quorum lost
  • ❌ Data loss detected
  • ❌ Service availability drops below 99%
  • ❌ Request success rate drops below 99.5%
  • ❌ Recovery time exceeds 5 minutes

StatefulSet Quorum Validation

Quorum Validation Script:

#!/bin/bash
# scripts/validate-statefulset-quorum.sh

NAMESPACE="${1}"
STATEFULSET="${2}"

echo "🔍 Validating StatefulSet quorum: ${NAMESPACE}/${STATEFULSET}"

# Get StatefulSet details
READY=$(kubectl get statefulset ${STATEFULSET} -n ${NAMESPACE} -o jsonpath='{.status.readyReplicas}')
DESIRED=$(kubectl get statefulset ${STATEFULSET} -n ${NAMESPACE} -o jsonpath='{.spec.replicas}')
QUORUM_REQUIRED=$((DESIRED / 2 + 1))

echo "Ready replicas: ${READY}/${DESIRED}"
echo "Quorum required: ${QUORUM_REQUIRED}"

if [ "${READY}" -lt "${QUORUM_REQUIRED}" ]; then
  echo "❌ Quorum lost: ${READY} < ${QUORUM_REQUIRED}"
  exit 1
else
  echo "✅ Quorum maintained: ${READY} >= ${QUORUM_REQUIRED}"

  # Verify all pods are on different nodes (anti-affinity)
  PODS=$(kubectl get pods -n ${NAMESPACE} -l app=${STATEFULSET} -o jsonpath='{.items[*].metadata.name}')
  NODES=()

  for POD in ${PODS}; do
    NODE=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.nodeName}')
    NODES+=(${NODE})
  done

  UNIQUE_NODES=$(printf '%s\n' "${NODES[@]}" | sort -u | wc -l)
  TOTAL_NODES=${#NODES[@]}

  if [ "${UNIQUE_NODES}" -lt "${TOTAL_NODES}" ]; then
    echo "⚠️  Some pods on same node (anti-affinity not enforced)"
  else
    echo "✅ All pods on different nodes (anti-affinity enforced)"
  fi

  exit 0
fi

Pod Anti-Affinity Configuration

Anti-Affinity Rules:

# kubernetes/deployments/ingestion-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion-api
  namespace: atp-ingest-ns
spec:
  replicas: 5
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - atp-ingest-api
              topologyKey: kubernetes.io/hostname
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - atp-ingest-api
            topologyKey: topology.kubernetes.io/zone

Monitoring and Observation

Node Failure Monitoring Dashboard:

// Log Analytics: Node failure monitoring
let ExperimentStart = datetime("2024-01-20T10:00:00Z");
let ExperimentEnd = datetime("2024-01-20T10:15:00Z");

// Node status during experiment
KubeNodeInventory
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| summarize 
    NodeCount = count(),
    ReadyCount = countif(Status == "Ready"),
    NotReadyCount = countif(Status == "NotReady"),
    UnknownCount = countif(Status == "Unknown")
    by bin(TimeGenerated, 1m)
| render timechart

// Pod distribution across nodes
KubePodInventory
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| where Namespace == "atp-ingest-ns"
| summarize 
    PodCount = count(),
    Nodes = dcount(Computer)
    by bin(TimeGenerated, 1m)
| render timechart

// Pod rescheduling events
ContainerLog
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| where LogMessage contains "Scheduled" or LogMessage contains "Successfully assigned"
| summarize 
    SchedulingEvents = count()
    by bin(TimeGenerated, 1m)
| render timechart

Availability Zone Failure Experiment

Availability Zone (AZ) failure experiments validate that ATP services remain available and functional when an entire availability zone fails, ensuring multi-AZ deployment, anti-affinity rules, and cross-AZ failover mechanisms work correctly.

Hypothesis

"When an entire availability zone fails, services will remain available through multi-AZ deployment, pods will be rescheduled to other AZs within 10 minutes, anti-affinity rules will ensure pod distribution, and all services will recover within 10 minutes."

Experiment Configuration

AZ Failure Simulation:

#!/bin/bash
# scripts/execute-az-failure-experiment.sh

CLUSTER_NAME="${1:-atp-staging-aks}"
RESOURCE_GROUP="${2:-atp-staging-rg}"
TARGET_AZ="${3}"  # e.g., "eastus-1"

echo "🧪 Starting availability zone failure experiment"
echo "Target AZ: ${TARGET_AZ}"

# Get nodes in target AZ
echo "📊 Getting nodes in target AZ..."
NODES_IN_AZ=$(kubectl get nodes -o json | \
  jq -r ".items[] | select(.metadata.labels.\"topology.kubernetes.io/zone\" == \"${TARGET_AZ}\") | .metadata.name")

if [ -z "${NODES_IN_AZ}" ]; then
  echo "❌ No nodes found in AZ: ${TARGET_AZ}"
  exit 1
fi

echo "Nodes in target AZ: ${NODES_IN_AZ}"

# Get initial pod distribution
echo "📊 Getting initial pod distribution across AZs..."
kubectl get pods -A -o json | \
  jq -r '.items[] | select(.spec.nodeName != null) | 
    {pod: .metadata.name, namespace: .metadata.namespace, node: .spec.nodeName}' | \
  jq -s 'group_by(.node) | map({node: .[0].node, count: length})'

# Validate multi-AZ deployment
echo "🔍 Validating multi-AZ deployment..."
TOTAL_NODES=$(kubectl get nodes -o json | jq -r '.items | length')
AZS=$(kubectl get nodes -o json | \
  jq -r '.items[].metadata.labels."topology.kubernetes.io/zone"' | sort -u)

echo "Total nodes: ${TOTAL_NODES}"
echo "Availability zones: ${AZS}"

if [ $(echo "${AZS}" | wc -l) -lt 2 ]; then
  echo "❌ Multi-AZ deployment not configured (only 1 AZ)"
  exit 1
fi

echo "✅ Multi-AZ deployment validated"

# Drain all nodes in target AZ
echo "🔧 Draining all nodes in target AZ: ${TARGET_AZ}"
for NODE in ${NODES_IN_AZ}; do
  echo "Draining node: ${NODE}"
  kubectl drain ${NODE} \
    --ignore-daemonsets \
    --delete-emptydir-data \
    --force \
    --timeout=300s &
done

wait

AZ_FAILURE_START=$(date +%s)

# Wait for pods to be rescheduled to other AZs
echo "⏳ Waiting for pods to be rescheduled to other AZs..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0
ALL_RESCHEDULED=false

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  PODS_IN_TARGET_AZ=$(kubectl get pods -A -o json | \
    jq -r ".items[] | select(.spec.nodeName != null) | 
      .spec.nodeName" | \
    xargs -I {} kubectl get node {} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' | \
    grep -c "${TARGET_AZ}" || echo "0")

  if [ "${PODS_IN_TARGET_AZ}" -eq 0 ]; then
    ALL_RESCHEDULED=true
    RESCHEDULE_TIME=$(date +%s)
    RESCHEDULE_DURATION=$((RESCHEDULE_TIME - AZ_FAILURE_START))
    echo "✅ All pods rescheduled from AZ in ${RESCHEDULE_DURATION} seconds"
    break
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
  echo "Waiting... (${ELAPSED}s/${MAX_WAIT}s) - ${PODS_IN_TARGET_AZ} pods remaining in AZ"
done

if [ "${ALL_RESCHEDULED}" = false ]; then
  echo "⚠️  Not all pods rescheduled within ${MAX_WAIT} seconds"
  exit 1
fi

# Validate anti-affinity rules
echo "🔍 Validating anti-affinity rules..."
DEPLOYMENTS=$(kubectl get deployments -A -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"')

for DEPLOYMENT in ${DEPLOYMENTS}; do
  NS=$(echo ${DEPLOYMENT} | cut -d'/' -f1)
  NAME=$(echo ${DEPLOYMENT} | cut -d'/' -f2)

  PODS=$(kubectl get pods -n ${NS} -l app=${NAME} -o jsonpath='{.items[*].metadata.name}')
  AZS_WITH_PODS=$(for POD in ${PODS}; do
    NODE=$(kubectl get pod ${POD} -n ${NS} -o jsonpath='{.spec.nodeName}')
    kubectl get node ${NODE} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}'
    echo ""
  done | sort -u | wc -l)

  echo "Deployment ${NS}/${NAME}: Pods distributed across ${AZS_WITH_PODS} AZs"
done

# Validate service availability
echo "🔍 Validating service availability..."
SERVICES=$(kubectl get services -A -o json | \
  jq -r '.items[] | select(.spec.type == "LoadBalancer" or .spec.type == "ClusterIP") | 
    "\(.metadata.namespace)/\(.metadata.name)"')

for SERVICE in ${SERVICES}; do
  NS=$(echo ${SERVICE} | cut -d'/' -f1)
  NAME=$(echo ${SERVICE} | cut -d'/' -f2)

  ENDPOINTS=$(kubectl get endpoints ${NAME} -n ${NS} -o jsonpath='{.subsets[0].addresses[*].ip}' | wc -w)

  if [ "${ENDPOINTS}" -gt 0 ]; then
    echo "✅ Service ${NS}/${NAME}: ${ENDPOINTS} endpoints"
  else
    echo "❌ Service ${NS}/${NAME}: No endpoints"
  fi
done

# Simulate AZ recovery (uncordon nodes)
echo "🔧 Simulating AZ recovery (uncordoning nodes)..."
for NODE in ${NODES_IN_AZ}; do
  echo "Uncordoning node: ${NODE}"
  kubectl uncordon ${NODE}
done

RECOVERY_TIME=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_TIME - AZ_FAILURE_START))

echo "✅ AZ recovery complete in ${RECOVERY_DURATION} seconds"

# Validate final state
echo "✅ Validating final state..."
sleep 60

ALL_PODS_RUNNING=true
NAMESPACES=$(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}')

for NS in ${NAMESPACES}; do
  NOT_RUNNING=$(kubectl get pods -n ${NS} -o json | \
    jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | .metadata.name')

  if [ -n "${NOT_RUNNING}" ]; then
    echo "⚠️  Pods not running in ${NS}: ${NOT_RUNNING}"
    ALL_PODS_RUNNING=false
  fi
done

if [ "${ALL_PODS_RUNNING}" = true ]; then
  echo "✅ All pods running successfully"
  exit 0
else
  echo "❌ Some pods not running"
  exit 1
fi

Multi-AZ Deployment Validation

Multi-AZ Validation Script:

#!/bin/bash
# scripts/validate-multi-az-deployment.sh

echo "🔍 Validating multi-AZ deployment configuration"

# Check node distribution across AZs
echo "📊 Node distribution across AZs:"
kubectl get nodes -o json | \
  jq -r '.items[] | 
    {name: .metadata.name, az: .metadata.labels."topology.kubernetes.io/zone"}' | \
  jq -s 'group_by(.az) | map({az: .[0].az, nodes: map(.name), count: length})'

# Check pod distribution across AZs
echo "📊 Pod distribution across AZs:"
kubectl get pods -A -o json | \
  jq -r '.items[] | select(.spec.nodeName != null) | 
    {pod: .metadata.name, namespace: .metadata.namespace, node: .spec.nodeName}' | \
  jq -s 'group_by(.node) | map({node: .[0].node, pods: map(.pod), count: length})' | \
  jq -r '.[] | .node' | \
  xargs -I {} kubectl get node {} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' | \
  sort | uniq -c

# Check anti-affinity rules
echo "🔍 Checking anti-affinity rules..."
DEPLOYMENTS=$(kubectl get deployments -A -o json | \
  jq -r '.items[] | select(.spec.template.spec.affinity != null) | 
    "\(.metadata.namespace)/\(.metadata.name)"')

if [ -z "${DEPLOYMENTS}" ]; then
  echo "⚠️  No deployments with affinity rules found"
else
  echo "✅ Deployments with affinity rules:"
  echo "${DEPLOYMENTS}"
fi

# Check StatefulSet distribution
echo "🔍 Checking StatefulSet pod distribution..."
STATEFULSETS=$(kubectl get statefulsets -A -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"')

for SS in ${STATEFULSETS}; do
  NS=$(echo ${SS} | cut -d'/' -f1)
  NAME=$(echo ${SS} | cut -d'/' -f2)

  PODS=$(kubectl get pods -n ${NS} -l app=${NAME} -o jsonpath='{.items[*].metadata.name}')
  AZS=$(for POD in ${PODS}; do
    NODE=$(kubectl get pod ${POD} -n ${NS} -o jsonpath='{.spec.nodeName}')
    kubectl get node ${NODE} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}'
    echo ""
  done | sort -u)

  AZ_COUNT=$(echo "${AZS}" | grep -v '^$' | wc -l)
  echo "StatefulSet ${NS}/${NAME}: Pods across ${AZ_COUNT} AZs"

  if [ "${AZ_COUNT}" -lt 2 ]; then
    echo "⚠️  StatefulSet pods not distributed across multiple AZs"
  fi
done

Expected Behavior

AZ Failure Phase (0-10 minutes):

  • Node unavailability: All nodes in target AZ become unavailable
  • Pod eviction: Pods on failed nodes evicted
  • Pod rescheduling: Pods rescheduled to other AZs
  • Anti-affinity enforcement: Pods distributed across remaining AZs
  • Service continuity: Services continue operating on remaining AZs

Recovery Phase (10-15 minutes):

  • AZ restoration: Nodes in target AZ restored
  • Node registration: Nodes register with API server
  • Pod redistribution: Pods can be redistributed (optional)
  • Service recovery: All services fully operational across all AZs

Expected Metrics

Metric Baseline During Failure Expected Range Recovery Target
Service Availability 100% >99% >99% 100%
Pod Reschedule Time N/A <10min <10min N/A
AZ Distribution 3 AZs 2 AZs ≥2 AZs 3 AZs
Request Success Rate 99.95% >99% >99% 99.95%
Recovery Time N/A <10min <10min N/A

Validation Criteria

Success Criteria:

  • ✅ All pods rescheduled from failed AZ within 10 minutes
  • ✅ Pods distributed across at least 2 AZs
  • ✅ Anti-affinity rules enforced
  • ✅ Service availability >99% during failure
  • ✅ Request success rate >99% during failure
  • ✅ All services recovered within 10 minutes
  • ✅ Multi-AZ distribution restored

Cluster Failure (DR Drill)

Cluster failure experiments validate ATP's disaster recovery procedures, ensuring regional failover, RTO/RPO targets, and full recovery procedures work correctly when an entire AKS cluster fails.

Hypothesis

"When the primary AKS cluster fails, services will failover to the secondary region within RTO target (30 minutes), RPO will be within 1 hour (async replication lag), all services will be operational in the secondary region, and full recovery will be completed within 2 hours."

Disaster Recovery Scenario

Primary Cluster Failure:

#!/bin/bash
# scripts/execute-cluster-failure-drill.sh

PRIMARY_CLUSTER="${1:-atp-production-eastus-aks}"
PRIMARY_RG="${2:-atp-production-eastus-rg}"
SECONDARY_CLUSTER="${3:-atp-production-westeurope-aks}"
SECONDARY_RG="${4:-atp-production-westeurope-rg}"

echo "🚨 Starting cluster failure DR drill"
echo "Primary cluster: ${PRIMARY_CLUSTER}"
echo "Secondary cluster: ${SECONDARY_CLUSTER}"

# Pre-drill validation
echo "📊 Pre-drill validation..."

# Check primary cluster status
echo "Checking primary cluster status..."
PRIMARY_STATUS=$(az aks show \
  --name ${PRIMARY_CLUSTER} \
  --resource-group ${PRIMARY_RG} \
  --query "powerState.code" -o tsv)

if [ "${PRIMARY_STATUS}" != "Running" ]; then
  echo "❌ Primary cluster not running: ${PRIMARY_STATUS}"
  exit 1
fi

# Check secondary cluster status
echo "Checking secondary cluster status..."
SECONDARY_STATUS=$(az aks show \
  --name ${SECONDARY_CLUSTER} \
  --resource-group ${SECONDARY_RG} \
  --query "powerState.code" -o tsv)

if [ "${SECONDARY_STATUS}" != "Running" ]; then
  echo "❌ Secondary cluster not running: ${SECONDARY_STATUS}"
  exit 1
fi

# Get primary cluster context
echo "Getting primary cluster credentials..."
az aks get-credentials \
  --name ${PRIMARY_CLUSTER} \
  --resource-group ${PRIMARY_RG} \
  --overwrite-existing

# Get baseline metrics
echo "📊 Collecting baseline metrics from primary cluster..."
PRIMARY_PODS=$(kubectl get pods -A --no-headers | wc -l)
PRIMARY_SERVICES=$(kubectl get services -A --no-headers | wc -l)
PRIMARY_DEPLOYMENTS=$(kubectl get deployments -A --no-headers | wc -l)

echo "Primary cluster metrics:"
echo "  Pods: ${PRIMARY_PODS}"
echo "  Services: ${PRIMARY_SERVICES}"
echo "  Deployments: ${PRIMARY_DEPLOYMENTS}"

# Simulate cluster failure (stop cluster)
echo "🔧 Simulating cluster failure (stopping primary cluster)..."
CLUSTER_FAILURE_START=$(date +%s)

# Note: In production, this would be an actual cluster failure
# For drill purposes, we'll stop the cluster
az aks stop \
  --name ${PRIMARY_CLUSTER} \
  --resource-group ${PRIMARY_RG} \
  --no-wait

echo "⏳ Waiting for cluster to be stopped..."
sleep 60

# Verify cluster is stopped
PRIMARY_STATUS=$(az aks show \
  --name ${PRIMARY_CLUSTER} \
  --resource-group ${PRIMARY_RG} \
  --query "powerState.code" -o tsv)

if [ "${PRIMARY_STATUS}" != "Stopped" ]; then
  echo "⚠️  Cluster not fully stopped: ${PRIMARY_STATUS}"
fi

# Get secondary cluster context
echo "Switching to secondary cluster..."
az aks get-credentials \
  --name ${SECONDARY_CLUSTER} \
  --resource-group ${SECONDARY_RG} \
  --overwrite-existing

# Verify secondary cluster is ready
echo "📊 Verifying secondary cluster readiness..."
SECONDARY_PODS=$(kubectl get pods -A --no-headers | wc -l)
SECONDARY_SERVICES=$(kubectl get services -A --no-headers | wc -l)

echo "Secondary cluster metrics:"
echo "  Pods: ${SECONDARY_PODS}"
echo "  Services: ${SECONDARY_SERVICES}"

# Check data replication status
echo "🔍 Checking data replication status..."
# Query database replication lag
# This is platform-specific and would query Azure SQL replication status

# Update Azure Traffic Manager (DNS failover)
echo "🔧 Updating Azure Traffic Manager for failover..."
# This would update Traffic Manager endpoints to point to secondary region
# Platform-specific implementation

# Verify services operational in secondary
echo "🔍 Verifying services operational in secondary cluster..."
MAX_WAIT=1800  # 30 minutes
ELAPSED=0
ALL_SERVICES_READY=false

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  READY_PODS=$(kubectl get pods -A -o json | \
    jq -r '.items[] | select(.status.phase == "Running") | .metadata.name' | wc -l)

  if [ "${READY_PODS}" -ge $((PRIMARY_PODS * 90 / 100)) ]; then
    ALL_SERVICES_READY=true
    FAILOVER_TIME=$(date +%s)
    FAILOVER_DURATION=$((FAILOVER_TIME - CLUSTER_FAILURE_START))
    echo "✅ All services ready in secondary cluster"
    echo "Failover time: ${FAILOVER_DURATION} seconds"
    break
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
  echo "Waiting... (${ELAPSED}s/${MAX_WAIT}s) - ${READY_PODS} pods ready"
done

if [ "${ALL_SERVICES_READY}" = false ]; then
  echo "❌ Services not ready within ${MAX_WAIT} seconds"
  exit 1
fi

# Validate RTO
RTO_TARGET=1800  # 30 minutes in seconds
if [ "${FAILOVER_DURATION}" -le "${RTO_TARGET}" ]; then
  echo "✅ RTO target met: ${FAILOVER_DURATION}s <= ${RTO_TARGET}s"
else
  echo "❌ RTO target exceeded: ${FAILOVER_DURATION}s > ${RTO_TARGET}s"
  exit 1
fi

# Validate RPO
RPO_TARGET=3600  # 1 hour in seconds
# This would check database replication lag
# Platform-specific implementation
echo "✅ RPO validation: Data replication lag within target"

# Full recovery procedure
echo "🔧 Starting full recovery procedure..."

# Restore primary cluster
echo "Restoring primary cluster..."
az aks start \
  --name ${PRIMARY_CLUSTER} \
  --resource-group ${PRIMARY_RG} \
  --no-wait

echo "⏳ Waiting for primary cluster to be ready..."
az aks wait \
  --name ${PRIMARY_CLUSTER} \
  --resource-group ${PRIMARY_RG} \
  --created \
  --interval 30 \
  --timeout 1800

# Get primary cluster context
az aks get-credentials \
  --name ${PRIMARY_CLUSTER} \
  --resource-group ${PRIMARY_RG} \
  --overwrite-existing

# Verify primary cluster restored
echo "📊 Verifying primary cluster restored..."
PRIMARY_PODS_RESTORED=$(kubectl get pods -A --no-headers | wc -l)

if [ "${PRIMARY_PODS_RESTORED}" -ge $((PRIMARY_PODS * 90 / 100)) ]; then
  RECOVERY_TIME=$(date +%s)
  RECOVERY_DURATION=$((RECOVERY_TIME - CLUSTER_FAILURE_START))
  echo "✅ Primary cluster restored"
  echo "Total recovery time: ${RECOVERY_DURATION} seconds"
  exit 0
else
  echo "❌ Primary cluster not fully restored"
  exit 1
fi

RTO/RPO Validation

RTO/RPO Validation Script:

#!/bin/bash
# scripts/validate-rto-rpo.sh

FAILOVER_START="${1}"  # Timestamp when failover started
CURRENT_TIME=$(date +%s)

# Calculate RTO
RTO=$((CURRENT_TIME - FAILOVER_START))
RTO_TARGET=1800  # 30 minutes

echo "📊 RTO Validation:"
echo "  Failover time: ${RTO} seconds"
echo "  RTO target: ${RTO_TARGET} seconds"

if [ "${RTO}" -le "${RTO_TARGET}" ]; then
  echo "  ✅ RTO target met"
else
  echo "  ❌ RTO target exceeded"
  exit 1
fi

# Calculate RPO (check database replication lag)
echo "📊 RPO Validation:"
# This would query Azure SQL replication lag
# Platform-specific implementation

# Example: Query Azure SQL replication lag
PRIMARY_DB="${2}"  # Primary database name
REPLICA_DB="${3}"  # Replica database name

# Query replication lag from Azure SQL
REPLICATION_LAG=$(az sql db replica show \
  --resource-group atp-production-eastus-rg \
  --server atp-sql-primary \
  --name ${PRIMARY_DB} \
  --query "replicationLag" -o tsv)

RPO_TARGET=3600  # 1 hour in seconds

echo "  Replication lag: ${REPLICATION_LAG} seconds"
echo "  RPO target: ${RPO_TARGET} seconds"

if [ "${REPLICATION_LAG}" -le "${RPO_TARGET}" ]; then
  echo "  ✅ RPO target met"
else
  echo "  ❌ RPO target exceeded"
  exit 1
fi

DR Drill Checklist

Pre-Drill Checklist:

## Pre-DR Drill Checklist

### Primary Cluster
- [ ] Primary cluster healthy and operational
- [ ] All services running normally
- [ ] Baseline metrics collected
- [ ] Database replication active
- [ ] Backup validation completed

### Secondary Cluster
- [ ] Secondary cluster healthy and operational
- [ ] Cluster capacity sufficient for failover
- [ ] Network connectivity verified
- [ ] DNS/Traffic Manager configured
- [ ] Disaster recovery runbook reviewed

### Team Preparation
- [ ] DR drill team assembled
- [ ] Communication channels established
- [ ] Stakeholders notified
- [ ] Runbook accessible
- [ ] On-call engineer available

During Drill Checklist:

## During DR Drill Checklist

### Cluster Failure Simulation
- [ ] Primary cluster failure simulated
- [ ] Failure detection confirmed
- [ ] Incident declared

### Failover Procedure
- [ ] Secondary cluster verified ready
- [ ] Traffic Manager updated
- [ ] DNS failover executed
- [ ] Services verified in secondary
- [ ] RTO validated

### Data Validation
- [ ] Database replication status checked
- [ ] RPO validated
- [ ] Data integrity verified
- [ ] No data loss confirmed

### Service Validation
- [ ] All critical services operational
- [ ] Service endpoints verified
- [ ] Health checks passing
- [ ] Monitoring active

Post-Drill Checklist:

## Post-DR Drill Checklist

### Recovery Procedure
- [ ] Primary cluster restored
- [ ] Services verified in primary
- [ ] Data synchronization verified
- [ ] Failback procedure executed (if applicable)

### Documentation
- [ ] Drill results documented
- [ ] RTO/RPO measured
- [ ] Issues identified
- [ ] Lessons learned captured
- [ ] Improvement actions created

### Validation
- [ ] All services operational
- [ ] Data integrity confirmed
- [ ] Performance metrics normal
- [ ] Monitoring validated

Expected Behavior

Cluster Failure Phase (0-30 minutes):

  • Failure detection: Cluster failure detected within 5 minutes
  • Incident declaration: Incident commander declares DR activation
  • Traffic Manager update: DNS failover to secondary region
  • Service verification: Services verified operational in secondary
  • RTO validation: RTO target met (<30 minutes)

Recovery Phase (30 minutes - 2 hours):

  • Primary cluster restoration: Primary cluster restored
  • Data synchronization: Data synchronized from secondary
  • Service restoration: Services restored in primary
  • Failback decision: Decision to failback or remain in secondary
  • Full recovery: All services operational in primary

Expected Metrics

Metric Target Measurement
RTO (Recovery Time Objective) <30 minutes Time from failure to operational in secondary
RPO (Recovery Point Objective) <1 hour Maximum data loss (replication lag)
Service Availability >99% During failover and recovery
Data Loss None Zero data loss target
Full Recovery Time <2 hours Complete restoration to primary

DR Drill Results Template

{
  "drill": {
    "date": "2024-01-20T10:00:00Z",
    "type": "full_cluster_failure",
    "primary_cluster": "atp-production-eastus-aks",
    "secondary_cluster": "atp-production-westeurope-aks"
  },
  "metrics": {
    "rto": {
      "target_seconds": 1800,
      "actual_seconds": 1650,
      "status": "pass"
    },
    "rpo": {
      "target_seconds": 3600,
      "actual_seconds": 2400,
      "status": "pass"
    },
    "service_availability": {
      "during_failover": 99.5,
      "during_recovery": 99.8,
      "status": "pass"
    },
    "data_loss": {
      "events_lost": 0,
      "status": "pass"
    }
  },
  "findings": {
    "what_worked_well": [
      "Failover completed within RTO target",
      "No data loss occurred",
      "Services remained available throughout",
      "Team coordination effective"
    ],
    "issues": [
      "DNS propagation took longer than expected",
      "Some services took longer to become ready"
    ],
    "recommendations": [
      "Optimize DNS failover time",
      "Improve service startup time",
      "Enhance monitoring during failover"
    ]
  }
}

Summary: Node and Cluster Chaos

  • Node Failure Experiment: Validates Kubernetes pod rescheduling, StatefulSet quorum maintenance, and data persistence during node failures; expects pod reschedule within 5 minutes, quorum maintained, no data loss, and recovery within 5 minutes
  • Availability Zone Failure Experiment: Validates multi-AZ deployment, anti-affinity rules, and cross-AZ failover during AZ failures; expects pod reschedule within 10 minutes, pods distributed across ≥2 AZs, and recovery within 10 minutes
  • Cluster Failure (DR Drill): Validates disaster recovery procedures, regional failover, and RTO/RPO targets during complete cluster failures; expects RTO <30 minutes, RPO <1 hour, service availability >99%, and full recovery within 2 hours
  • Monitoring and Validation: Comprehensive scripts for node failure simulation, AZ failure simulation, cluster failure drills, multi-AZ validation, StatefulSet quorum validation, and RTO/RPO validation
  • Disaster Recovery Procedures: Complete DR drill checklists (pre-drill, during drill, post-drill), failover procedures, recovery procedures, and validation criteria

Service Dependency Chaos

Purpose: Define comprehensive chaos experiments for service dependency failures in ATP, validating circuit breakers, retry mechanisms, caching strategies, and graceful degradation to ensure ATP services remain available and functional during downstream service failures.


Downstream Service Failure

Downstream service failure experiments validate that ATP services gracefully handle dependency failures through caching, circuit breakers, and fallback mechanisms, ensuring service availability even when downstream services are unavailable.

Hypothesis

"When the Policy Service becomes unavailable, the Ingestion Service will continue operating using cached policies, ingestion will continue without failures, stale cache is acceptable, and the service will recover automatically when the Policy Service is restored."

Experiment Configuration

NetworkChaos for Service Isolation:

# chaos-experiments/policy-service-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: policy-service-network-failure
  namespace: chaos-testing
  labels:
    category: application
    service: atp-policy-service
    severity: medium
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When Policy Service becomes unavailable, Ingestion Service will continue operating 
      using cached policies, ingestion will continue without failures, 
      stale cache is acceptable, and service will recover automatically.
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - atp-policy-ns
    labelSelectors:
      app: atp-policy-api
  direction: both
  duration: "10m"

Service Failure Simulation Script:

#!/bin/bash
# scripts/execute-downstream-service-failure.sh

DOWNSTREAM_SERVICE="${1:-atp-policy-api}"
DOWNSTREAM_NS="${2:-atp-policy-ns}"
UPSTREAM_SERVICE="${3:-atp-ingestion-api}"
UPSTREAM_NS="${4:-atp-ingest-ns}"

echo "🧪 Starting downstream service failure experiment"
echo "Downstream service: ${DOWNSTREAM_SERVICE}"
echo "Upstream service: ${UPSTREAM_SERVICE}"

# Get initial metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${UPSTREAM_SERVICE} \
  --duration 1h \
  --output baseline-${UPSTREAM_SERVICE}-$(date +%Y%m%d-%H%M%S).json

# Scale down downstream service to simulate failure
echo "🔧 Simulating downstream service failure..."
kubectl scale deployment ${DOWNSTREAM_SERVICE} -n ${DOWNSTREAM_NS} --replicas=0

FAILURE_START=$(date +%s)

# Wait for service to be unavailable
echo "⏳ Waiting for service to be unavailable..."
sleep 30

# Verify downstream service is unavailable
ENDPOINTS=$(kubectl get endpoints ${DOWNSTREAM_SERVICE} -n ${DOWNSTREAM_NS} -o jsonpath='{.subsets[0].addresses[*].ip}' | wc -w)
if [ "${ENDPOINTS}" -eq 0 ]; then
  echo "✅ Downstream service unavailable (${ENDPOINTS} endpoints)"
else
  echo "⚠️  Downstream service still has endpoints: ${ENDPOINTS}"
fi

# Monitor upstream service behavior
echo "👀 Monitoring upstream service behavior..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Check request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${UPSTREAM_SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Check error rate
  ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${UPSTREAM_SERVICE}\",status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Check cache hit rate
  CACHE_HIT_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(cache_hits_total\{service=\"${UPSTREAM_SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  echo "Metrics at ${ELAPSED}s:"
  echo "  Success rate: ${SUCCESS_RATE}"
  echo "  Error rate: ${ERROR_RATE}"
  echo "  Cache hit rate: ${CACHE_HIT_RATE}"

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Restore downstream service
echo "🔧 Restoring downstream service..."
kubectl scale deployment ${DOWNSTREAM_SERVICE} -n ${DOWNSTREAM_NS} --replicas=3

RECOVERY_START=$(date +%s)

# Wait for service to be available
echo "⏳ Waiting for service to be available..."
kubectl wait --for=condition=Available deployment/${DOWNSTREAM_SERVICE} -n ${DOWNSTREAM_NS} --timeout=300s

# Verify upstream service recovers
echo "🔍 Verifying upstream service recovery..."
sleep 60

FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${UPSTREAM_SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')

if (( $(echo "${FINAL_SUCCESS_RATE} > 99.0" | bc -l) )); then
  echo "✅ Upstream service recovered: Success rate = ${FINAL_SUCCESS_RATE}%"
  exit 0
else
  echo "⚠️  Upstream service recovery incomplete: Success rate = ${FINAL_SUCCESS_RATE}%"
  exit 1
fi

Expected Behavior

Service Failure Phase (0-10 minutes):

  • Downstream service unavailable: Policy Service endpoints unavailable
  • Circuit breaker activation: Circuit breaker opens after failure threshold
  • Cache usage: Ingestion Service uses cached policies
  • Service continuity: Ingestion continues operating with cached data
  • Degraded mode: Service operates in degraded mode (stale cache acceptable)

Recovery Phase (10-15 minutes):

  • Service restoration: Policy Service restored
  • Circuit breaker recovery: Circuit breaker transitions to half-open state
  • Cache refresh: Policies refreshed from Policy Service
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Failure Expected Range Recovery Target
Request Success Rate 99.95% >99% >99% 99.95%
Cache Hit Rate 85% 100% 100% 85%
Circuit Breaker State Closed Open Open Closed
Error Rate 0.05% <1% <1% 0.05%
Latency 145ms <200ms <200ms 145ms

Validation Criteria

Success Criteria:

  • ✅ No ingestion failures during downstream service failure
  • ✅ Cache hit rate = 100% (all requests use cache)
  • ✅ Request success rate >99%
  • ✅ Circuit breaker opens correctly
  • ✅ Service recovers automatically when downstream service restored
  • ✅ Cache refreshed after recovery

Circuit Breaker Configuration

Resilience Configuration Example:

# kubernetes/configmaps/resilience-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: resilience-config
  namespace: atp-ingest-ns
data:
  CircuitBreaker.json: |
    {
      "PolicyService": {
        "FailureThreshold": 5,
        "TimeoutSeconds": 30,
        "HalfOpenRetries": 3,
        "State": "Closed"
      }
    }
  CacheConfig.json: |
    {
      "PolicyCache": {
        "TTLSeconds": 3600,
        "MaxSize": 10000,
        "RefreshOnExpiry": true
      }
    }

Circuit Breaker Validation Script:

#!/bin/bash
# scripts/validate-circuit-breaker.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"

echo "🔍 Validating circuit breaker behavior for ${SERVICE}"

# Get circuit breaker metrics
CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
CB_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_failures\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

echo "Circuit breaker state: ${CB_STATE}"
echo "Circuit breaker failures: ${CB_FAILURES}"

# Check if circuit breaker is configured
if [ "${CB_STATE}" = "null" ] || [ -z "${CB_STATE}" ]; then
  echo "⚠️  Circuit breaker not configured or metrics not available"
  exit 1
fi

# Validate circuit breaker transitions
if [ "${CB_STATE}" = "Open" ]; then
  echo "✅ Circuit breaker opened (protecting service)"
elif [ "${CB_STATE}" = "Closed" ]; then
  echo "✅ Circuit breaker closed (normal operation)"
elif [ "${CB_STATE}" = "HalfOpen" ]; then
  echo "✅ Circuit breaker half-open (testing recovery)"
else
  echo "⚠️  Unknown circuit breaker state: ${CB_STATE}"
  exit 1
fi

Database Failure Experiment

Database failure experiments validate that ATP services handle database connection failures gracefully through circuit breakers, connection pooling, read replica fallback, and graceful degradation.

Hypothesis

"When Azure SQL primary database connection fails, the circuit breaker will open, the service will fallback to read replicas, operate in read-only mode, maintain availability, and recover automatically when the database is restored."

Experiment Configuration

Database Connection Failure Simulation:

#!/bin/bash
# scripts/execute-database-failure-experiment.sh

PRIMARY_DB="${1:-atp-sql-primary}"
PRIMARY_SERVER="${2:-atp-sql-server.database.windows.net}"
SERVICE="${3:-atp-ingestion-api}"
NAMESPACE="${4:-atp-ingest-ns}"

echo "🧪 Starting database failure experiment"
echo "Primary database: ${PRIMARY_DB}"
echo "Service: ${SERVICE}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

# Apply network partition to database
echo "🔧 Applying network partition to database..."
kubectl apply -f chaos-experiments/database-network-partition.yaml -n chaos-testing

FAILURE_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Check circuit breaker state
  CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\",component=\"database\"\} | jq -r '.data.result[0].value[1]')

  # Check read replica usage
  REPLICA_USAGE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_read_replica_connections\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Check request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Check error rate
  ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  echo "Metrics at ${ELAPSED}s:"
  echo "  Circuit breaker state: ${CB_STATE}"
  echo "  Read replica usage: ${REPLICA_USAGE}"
  echo "  Success rate: ${SUCCESS_RATE}"
  echo "  Error rate: ${ERROR_RATE}"

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos database-network-partition -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for recovery
echo "⏳ Waiting for service recovery..."
sleep 120

# Verify service recovery
FINAL_CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\",component=\"database\"\} | jq -r '.data.result[0].value[1]')

if [ "${FINAL_CB_STATE}" = "Closed" ]; then
  echo "✅ Circuit breaker closed (service recovered)"
  exit 0
else
  echo "⚠️  Circuit breaker still open: ${FINAL_CB_STATE}"
  exit 1
fi

NetworkChaos for Database Isolation:

# chaos-experiments/database-network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: database-network-partition
  namespace: chaos-testing
  labels:
    category: application
    service: database
    severity: high
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When Azure SQL connection fails, circuit breaker will open, 
      service will fallback to read replicas, operate in read-only mode, 
      maintain availability, and recover automatically.
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  target:
    mode: all
    selector:
      # Target Azure SQL endpoints
      address: "*.database.windows.net"
  duration: "10m"

Expected Behavior

Database Failure Phase (0-10 minutes):

  • Connection failures: Database connection attempts fail
  • Circuit breaker activation: Circuit breaker opens after failure threshold
  • Read replica fallback: Service switches to read replicas
  • Read-only mode: Service operates in read-only mode (no writes)
  • Graceful degradation: Service continues operating with reduced functionality

Recovery Phase (10-15 minutes):

  • Connection restoration: Database connections restored
  • Circuit breaker recovery: Circuit breaker transitions to half-open
  • Write capability restored: Service returns to read-write mode
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Failure Expected Range Recovery Target
Request Success Rate 99.95% >95% >95% 99.95%
Circuit Breaker State Closed Open Open Closed
Read Replica Usage 20% 100% 100% 20%
Write Operations Normal Disabled Disabled Normal
Error Rate 0.05% <5% <5% 0.05%

Validation Criteria

Success Criteria:

  • ✅ Circuit breaker opens correctly
  • ✅ Service falls back to read replicas
  • ✅ Service operates in read-only mode (no write failures)
  • ✅ Request success rate >95% (read operations succeed)
  • ✅ No service crashes
  • ✅ Service recovers automatically when database restored

Connection Pool Configuration

Connection Pool Configuration Example:

# kubernetes/configmaps/database-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: database-config
  namespace: atp-ingest-ns
data:
  ConnectionStrings.json: |
    {
      "Primary": {
        "ConnectionString": "Server=atp-sql-server.database.windows.net;Database=atp-primary;...",
        "MaxPoolSize": 100,
        "MinPoolSize": 10,
        "ConnectionTimeout": 30,
        "CommandTimeout": 30
      },
      "ReadReplicas": [
        {
          "ConnectionString": "Server=atp-sql-replica-1.database.windows.net;Database=atp-primary;...",
          "MaxPoolSize": 50
        }
      ]
    }
  CircuitBreaker.json: |
    {
      "Database": {
        "FailureThreshold": 5,
        "TimeoutSeconds": 30,
        "HalfOpenRetries": 3,
        "FallbackToReadReplica": true
      }
    }

Database Connection Monitoring Script:

#!/bin/bash
# scripts/monitor-database-connections.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"

echo "📊 Monitoring database connections for ${SERVICE}"

# Get connection pool metrics
POOL_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
IDLE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_idle\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

echo "Connection pool metrics:"
echo "  Pool size: ${POOL_SIZE}"
echo "  Active connections: ${ACTIVE_CONNECTIONS}"
echo "  Idle connections: ${IDLE_CONNECTIONS}"

# Check connection failures
CONNECTION_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_connection_failures\{service=\"${SERVICE}\"\}[5m]\) | jq -r '.data.result[0].value[1]')
echo "  Connection failures: ${CONNECTION_FAILURES}/sec"

# Check read replica usage
REPLICA_USAGE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_read_replica_connections\{service=\"${SERVICE}\"\}[5m]\) | jq -r '.data.result[0].value[1]')
echo "  Read replica usage: ${REPLICA_USAGE}/sec"

# Validate connection pool health
if [ "${ACTIVE_CONNECTIONS}" -gt "${POOL_SIZE}" ]; then
  echo "⚠️  Active connections exceed pool size"
  exit 1
fi

if (( $(echo "${CONNECTION_FAILURES} > 1" | bc -l) )); then
  echo "⚠️  High connection failure rate: ${CONNECTION_FAILURES}/sec"
  exit 1
fi

echo "✅ Connection pool healthy"

Message Broker Failure

Message broker failure experiments validate that ATP services handle message broker unavailability gracefully through outbox patterns, message queuing, automatic retry, and eventual delivery guarantees.

Hypothesis

"When Azure Service Bus topic becomes unavailable, the service will queue messages in the outbox, retry automatically, maintain message ordering, ensure no message loss, and deliver messages eventually when the broker is restored."

Experiment Configuration

Service Bus Topic Unavailability Simulation:

# chaos-experiments/service-bus-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: service-bus-network-failure
  namespace: chaos-testing
  labels:
    category: application
    service: service-bus
    severity: high
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When Service Bus topic becomes unavailable, service will queue messages in outbox, 
      retry automatically, maintain message ordering, ensure no message loss, 
      and deliver messages eventually when broker is restored.
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  target:
    mode: all
    selector:
      address: "*.servicebus.windows.net"
  duration: "15m"

Message Broker Failure Simulation Script:

#!/bin/bash
# scripts/execute-message-broker-failure.sh

SERVICE_BUS_NAMESPACE="${1:-atp-servicebus}"
TOPIC_NAME="${2:-atp-events}"
SERVICE="${3:-atp-ingestion-api}"
NAMESPACE="${4:-atp-ingest-ns}"

echo "🧪 Starting message broker failure experiment"
echo "Service Bus namespace: ${SERVICE_BUS_NAMESPACE}"
echo "Topic: ${TOPIC_NAME}"
echo "Service: ${SERVICE}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

# Get initial message count
echo "📊 Getting initial message counts..."
INITIAL_OUTBOX_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
INITIAL_SENT=$(curl -s http://prometheus:9090/api/v1/query?query=messages_sent_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

echo "Initial outbox size: ${INITIAL_OUTBOX_SIZE}"
echo "Initial sent messages: ${INITIAL_SENT}"

# Apply network partition to Service Bus
echo "🔧 Applying network partition to Service Bus..."
kubectl apply -f chaos-experiments/service-bus-failure.yaml -n chaos-testing

FAILURE_START=$(date +%s)

# Monitor outbox behavior
echo "👀 Monitoring outbox behavior..."
MAX_WAIT=900  # 15 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Check outbox size
  OUTBOX_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Check retry attempts
  RETRY_COUNT=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(message_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Check message processing rate
  PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Check DLQ (dead letter queue) size
  DLQ_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=dlq_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  echo "Metrics at ${ELAPSED}s:"
  echo "  Outbox size: ${OUTBOX_SIZE}"
  echo "  Retry attempts: ${RETRY_COUNT}/sec"
  echo "  Processing rate: ${PROCESSING_RATE}/sec"
  echo "  DLQ size: ${DLQ_SIZE}"

  # Validate no message loss
  if (( $(echo "${DLQ_SIZE} > 0" | bc -l) )); then
    echo "⚠️  Messages in DLQ (potential message loss)"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos service-bus-network-failure -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for outbox to drain
echo "⏳ Waiting for outbox to drain..."
MAX_DRAIN_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_DRAIN_WAIT} ]; do
  OUTBOX_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  if (( $(echo "${OUTBOX_SIZE} <= ${INITIAL_OUTBOX_SIZE}" | bc -l) )); then
    DRAIN_TIME=$(date +%s)
    DRAIN_DURATION=$((DRAIN_TIME - RECOVERY_START))
    echo "✅ Outbox drained in ${DRAIN_DURATION} seconds"
    break
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
  echo "Waiting for outbox to drain... (${ELAPSED}s/${MAX_DRAIN_WAIT}s) - Outbox size: ${OUTBOX_SIZE}"
done

# Verify message delivery
FINAL_SENT=$(curl -s http://prometheus:9090/api/v1/query?query=messages_sent_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
MESSAGES_DELIVERED=$((FINAL_SENT - INITIAL_SENT))

echo "Messages delivered: ${MESSAGES_DELIVERED}"

if [ "${MESSAGES_DELIVERED}" -gt 0 ]; then
  echo "✅ Messages delivered successfully"
  exit 0
else
  echo "⚠️  No messages delivered"
  exit 1
fi

Expected Behavior

Broker Failure Phase (0-15 minutes):

  • Publish failures: Message publishing to Service Bus fails
  • Outbox pattern: Messages queued in outbox (database)
  • Automatic retry: Periodic retry attempts (exponential backoff)
  • Message ordering: Message order maintained in outbox
  • No message loss: All messages persisted in outbox

Recovery Phase (15-25 minutes):

  • Broker restoration: Service Bus topic restored
  • Outbox processing: Outbox processor processes queued messages
  • Message delivery: Messages delivered to Service Bus
  • Outbox drain: Outbox emptied (all messages delivered)
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Failure Expected Range Recovery Target
Outbox Size 0 Increasing Any 0
Message Success Rate 100% 0% (queued) 0% 100%
Retry Attempts 0/sec >0/sec >0/sec 0/sec
DLQ Size 0 0 0 0
Message Loss None None None None

Validation Criteria

Success Criteria:

  • ✅ No message loss (all messages in outbox)
  • ✅ Outbox size increases during failure
  • ✅ Automatic retry attempts active
  • ✅ No messages in DLQ
  • ✅ Outbox drains after broker restoration
  • ✅ All messages delivered eventually

Outbox Pattern Implementation

Outbox Table Schema:

-- Outbox table for message persistence
CREATE TABLE OutboxMessages (
    Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
    MessageId NVARCHAR(255) NOT NULL,
    MessageType NVARCHAR(255) NOT NULL,
    Payload NVARCHAR(MAX) NOT NULL,
    TopicName NVARCHAR(255) NOT NULL,
    Status NVARCHAR(50) NOT NULL DEFAULT 'Pending', -- Pending, Processing, Sent, Failed
    RetryCount INT NOT NULL DEFAULT 0,
    CreatedAt DATETIME2 NOT NULL DEFAULT GETUTCDATE(),
    ProcessedAt DATETIME2 NULL,
    NextRetryAt DATETIME2 NULL,
    ErrorMessage NVARCHAR(MAX) NULL,
    INDEX IX_OutboxMessages_Status_NextRetryAt (Status, NextRetryAt)
);

Outbox Processor Configuration:

# kubernetes/configmaps/outbox-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: outbox-config
  namespace: atp-ingest-ns
data:
  OutboxProcessor.json: |
    {
      "BatchSize": 100,
      "ProcessingIntervalSeconds": 5,
      "MaxRetryAttempts": 10,
      "RetryBackoffSeconds": 30,
      "ExponentialBackoff": true,
      "MaxBackoffSeconds": 300
    }

Outbox Monitoring Script:

#!/bin/bash
# scripts/monitor-outbox.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"

echo "📊 Monitoring outbox for ${SERVICE}"

# Get outbox metrics
OUTBOX_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
PENDING_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\",status=\"Pending\"\} | jq -r '.data.result[0].value[1]')
PROCESSING_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\",status=\"Processing\"\} | jq -r '.data.result[0].value[1]')
FAILED_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\",status=\"Failed\"\} | jq -r '.data.result[0].value[1]')

echo "Outbox metrics:"
echo "  Total outbox size: ${OUTBOX_SIZE}"
echo "  Pending messages: ${PENDING_MESSAGES}"
echo "  Processing messages: ${PROCESSING_MESSAGES}"
echo "  Failed messages: ${FAILED_MESSAGES}"

# Check processing rate
PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(outbox_messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo "  Processing rate: ${PROCESSING_RATE}/sec"

# Check retry rate
RETRY_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(outbox_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo "  Retry rate: ${RETRY_RATE}/sec"

# Validate outbox health
if (( $(echo "${FAILED_MESSAGES} > 0" | bc -l) )); then
  echo "⚠️  Failed messages in outbox: ${FAILED_MESSAGES}"

  # Check if failed messages exceed retry limit
  MAX_RETRIES=10
  if (( $(echo "${FAILED_MESSAGES} > ${MAX_RETRIES}" | bc -l) )); then
    echo "❌ Failed messages exceed retry limit"
    exit 1
  fi
fi

if (( $(echo "${OUTBOX_SIZE} > 10000" | bc -l) )); then
  echo "⚠️  Outbox size exceeding threshold: ${OUTBOX_SIZE}"
  exit 1
fi

echo "✅ Outbox healthy"

Cache Failure Experiment

Cache failure experiments validate that ATP services handle cache unavailability gracefully through database fallback, graceful degradation, and performance impact mitigation.

Hypothesis

"When Redis cache becomes unavailable, the service will fallback to database queries, experience higher latency but remain functional, maintain availability, and recover automatically when the cache is restored."

Experiment Configuration

Redis Cache Unavailability Simulation:

# chaos-experiments/redis-cache-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: redis-cache-network-failure
  namespace: chaos-testing
  labels:
    category: application
    service: redis-cache
    severity: medium
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When Redis cache becomes unavailable, service will fallback to database queries, 
      experience higher latency but remain functional, maintain availability, 
      and recover automatically when cache is restored.
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - atp-query-ns
    labelSelectors:
      app: atp-query-api
  direction: both
  target:
    mode: all
    selector:
      address: "*.redis.cache.windows.net"
  duration: "10m"

Cache Failure Simulation Script:

#!/bin/bash
# scripts/execute-cache-failure-experiment.sh

REDIS_CACHE="${1:-atp-redis-cache.redis.cache.windows.net}"
SERVICE="${2:-atp-query-api}"
NAMESPACE="${3:-atp-query-ns}"

echo "🧪 Starting cache failure experiment"
echo "Redis cache: ${REDIS_CACHE}"
echo "Service: ${SERVICE}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_LATENCY=$(jq -r '.metrics.p95_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_CACHE_HIT_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_hit_rate\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

echo "Baseline P95 latency: ${BASELINE_LATENCY}ms"
echo "Baseline cache hit rate: ${BASELINE_CACHE_HIT_RATE}%"

# Apply network partition to Redis
echo "🔧 Applying network partition to Redis cache..."
kubectl apply -f chaos-experiments/redis-cache-failure.yaml -n chaos-testing

FAILURE_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Check cache hit rate
  CACHE_HIT_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_hit_rate\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Check P95 latency
  P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
  P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)

  # Check database query rate
  DB_QUERY_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_queries_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Check request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  echo "Metrics at ${ELAPSED}s:"
  echo "  Cache hit rate: ${CACHE_HIT_RATE}%"
  echo "  P95 latency: ${P95_LATENCY_MS}ms"
  echo "  Database query rate: ${DB_QUERY_RATE}/sec"
  echo "  Success rate: ${SUCCESS_RATE}"

  # Validate latency increase is acceptable
  LATENCY_INCREASE=$(echo "${P95_LATENCY_MS} - ${BASELINE_LATENCY}" | bc)
  MAX_ACCEPTABLE_INCREASE=500  # 500ms

  if (( $(echo "${LATENCY_INCREASE} > ${MAX_ACCEPTABLE_INCREASE}" | bc -l) )); then
    echo "⚠️  Latency increase too high: +${LATENCY_INCREASE}ms > +${MAX_ACCEPTABLE_INCREASE}ms"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos redis-cache-network-failure -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for cache recovery
echo "⏳ Waiting for cache recovery..."
sleep 120

# Verify service recovery
FINAL_CACHE_HIT_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_hit_rate\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
FINAL_LATENCY_MS=$(echo "${FINAL_LATENCY} * 1000" | bc)

if (( $(echo "${FINAL_CACHE_HIT_RATE} >= 80" | bc -l) )); then
  echo "✅ Cache recovered: Hit rate = ${FINAL_CACHE_HIT_RATE}%"

  if (( $(echo "${FINAL_LATENCY_MS} <= ${BASELINE_LATENCY} * 1.1" | bc -l) )); then
    echo "✅ Latency recovered: ${FINAL_LATENCY_MS}ms"
    exit 0
  else
    echo "⚠️  Latency not fully recovered: ${FINAL_LATENCY_MS}ms"
    exit 1
  fi
else
  echo "⚠️  Cache not fully recovered: Hit rate = ${FINAL_CACHE_HIT_RATE}%"
  exit 1
fi

Expected Behavior

Cache Failure Phase (0-10 minutes):

  • Cache unavailability: Redis cache connection failures
  • Cache miss rate: Cache hit rate drops to 0%
  • Database fallback: Service queries database directly
  • Latency increase: Latency increases (database queries slower than cache)
  • Service continuity: Service remains functional

Recovery Phase (10-15 minutes):

  • Cache restoration: Redis cache restored
  • Cache warming: Cache repopulated with frequently accessed data
  • Latency normalization: Latency returns to baseline
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Failure Expected Range Recovery Target
Cache Hit Rate 85% 0% 0% 85%
P95 Latency 145ms <645ms Baseline + <500ms 145ms
Database Query Rate 100/sec 500/sec 3-5x increase 100/sec
Request Success Rate 99.95% >99% >99% 99.95%
Error Rate 0.05% <1% <1% 0.05%

Validation Criteria

Success Criteria:

  • ✅ Cache hit rate drops to 0% (cache unavailable)
  • ✅ Service falls back to database queries
  • ✅ Latency increase <500ms (acceptable degradation)
  • ✅ Request success rate >99%
  • ✅ No service crashes
  • ✅ Cache recovers automatically when restored

Cache Fallback Configuration

Cache Configuration Example:

# kubernetes/configmaps/cache-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cache-config
  namespace: atp-query-ns
data:
  CacheConfig.json: |
    {
      "Redis": {
        "ConnectionString": "atp-redis-cache.redis.cache.windows.net:6380",
        "Database": 0,
        "ConnectionTimeout": 5000,
        "RetryPolicy": {
          "MaxRetries": 3,
          "RetryDelay": 100
        },
        "FallbackToDatabase": true,
        "CacheWarming": {
          "Enabled": true,
          "WarmOnStartup": true,
          "WarmOnRecovery": true
        }
      },
      "CachePolicies": {
        "DefaultTTL": 3600,
        "MaxTTL": 86400,
        "SlidingExpiration": true
      }
    }

Cache Monitoring Script:

#!/bin/bash
# scripts/monitor-cache-health.sh

SERVICE="${1:-atp-query-api}"
NAMESPACE="${2:-atp-query-ns}"

echo "📊 Monitoring cache health for ${SERVICE}"

# Get cache metrics
CACHE_HIT_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_hit_rate\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
CACHE_MISS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_miss_rate\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
CACHE_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_size_bytes\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
CACHE_EVICTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(cache_evictions_total\{service=\"${SERVICE}\"\}[5m]\) | jq -r '.data.result[0].value[1]')

echo "Cache metrics:"
echo "  Cache hit rate: ${CACHE_HIT_RATE}%"
echo "  Cache miss rate: ${CACHE_MISS_RATE}%"
echo "  Cache size: ${CACHE_SIZE} bytes"
echo "  Cache evictions: ${CACHE_EVICTIONS}/sec"

# Check Redis connection
REDIS_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=redis_connections_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
REDIS_CONNECTION_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(redis_connection_failures\{service=\"${SERVICE}\"\}[5m]\) | jq -r '.data.result[0].value[1]')

echo "Redis connection metrics:"
echo "  Active connections: ${REDIS_CONNECTIONS}"
echo "  Connection failures: ${REDIS_CONNECTION_FAILURES}/sec"

# Validate cache health
if (( $(echo "${CACHE_HIT_RATE} < 70" | bc -l) )); then
  echo "⚠️  Low cache hit rate: ${CACHE_HIT_RATE}%"
fi

if (( $(echo "${REDIS_CONNECTION_FAILURES} > 0" | bc -l) )); then
  echo "⚠️  Redis connection failures detected: ${REDIS_CONNECTION_FAILURES}/sec"

  # Check if fallback is working
  DB_QUERY_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_queries_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  echo "  Database query rate: ${DB_QUERY_RATE}/sec (fallback active)"
fi

# Check cache warming
CACHE_WARMING=$(curl -s http://prometheus:9090/api/v1/query?query=cache_warming_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if [ "${CACHE_WARMING}" = "1" ]; then
  echo "✅ Cache warming active"
fi

echo "✅ Cache monitoring complete"

Dependency Failure Visualization

graph TD
    SERVICE[ATP Service] --> DOWNSTREAM[Downstream Service]
    SERVICE --> DATABASE[Database]
    SERVICE --> BROKER[Message Broker]
    SERVICE --> CACHE[Cache]

    DOWNSTREAM -->|Fails| CB1[Circuit Breaker Opens]
    CB1 --> CACHE1[Use Cached Data]
    CACHE1 --> CONTINUE1[Continue Operating]

    DATABASE -->|Fails| CB2[Circuit Breaker Opens]
    CB2 --> REPLICA[Fallback to Read Replica]
    REPLICA --> READONLY[Read-Only Mode]
    READONLY --> CONTINUE2[Continue Operating]

    BROKER -->|Fails| OUTBOX[Queue in Outbox]
    OUTBOX --> RETRY[Automatic Retry]
    RETRY --> EVENTUAL[Eventual Delivery]
    EVENTUAL --> CONTINUE3[Continue Operating]

    CACHE -->|Fails| DB[Fallback to Database]
    DB --> HIGHERLAT[Higher Latency]
    HIGHERLAT --> CONTINUE4[Continue Operating]

    style SERVICE fill:#FFE5B4
    style CB1 fill:#FFB6C1
    style CB2 fill:#FFB6C1
    style OUTBOX fill:#FFB6C1
    style DB fill:#FFB6C1
    style CONTINUE1 fill:#90EE90
    style CONTINUE2 fill:#90EE90
    style CONTINUE3 fill:#90EE90
    style CONTINUE4 fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Experiment Results Analysis

Downstream Service Failure Results:

{
  "experiment": "policy-service-failure",
  "status": "success",
  "metrics": {
    "request_success_rate": {
      "baseline": 99.95,
      "during_failure": 99.92,
      "status": "pass"
    },
    "cache_hit_rate": {
      "baseline": 85,
      "during_failure": 100,
      "status": "pass"
    },
    "circuit_breaker_state": {
      "baseline": "Closed",
      "during_failure": "Open",
      "recovery": "Closed",
      "status": "pass"
    }
  },
  "findings": {
    "what_worked_well": [
      "Circuit breaker opened correctly",
      "Cache used for all requests",
      "No service failures",
      "Automatic recovery when service restored"
    ]
  }
}

Database Failure Results:

{
  "experiment": "database-failure",
  "status": "success",
  "metrics": {
    "circuit_breaker_state": {
      "baseline": "Closed",
      "during_failure": "Open",
      "status": "pass"
    },
    "read_replica_usage": {
      "baseline": 20,
      "during_failure": 100,
      "status": "pass"
    },
    "request_success_rate": {
      "baseline": 99.95,
      "during_failure": 97.5,
      "status": "pass"
    }
  },
  "findings": {
    "what_worked_well": [
      "Circuit breaker opened correctly",
      "Read replica fallback worked",
      "Service operated in read-only mode",
      "No crashes or data loss"
    ]
  }
}

Summary: Service Dependency Chaos

  • Downstream Service Failure: Validates circuit breaker functionality, caching strategies, and graceful degradation during downstream service failures; expects no ingestion failures, 100% cache hit rate, circuit breaker opens correctly, and automatic recovery
  • Database Failure Experiment: Validates circuit breaker functionality, read replica fallback, and read-only mode operation during database connection failures; expects circuit breaker opens, fallback to read replicas, service operates in read-only mode, and recovery within acceptable time
  • Message Broker Failure: Validates outbox pattern, message queuing, automatic retry, and eventual delivery during message broker unavailability; expects no message loss, outbox queues messages, automatic retry active, and eventual delivery when broker restored
  • Cache Failure Experiment: Validates database fallback, graceful degradation, and performance impact mitigation during cache unavailability; expects cache hit rate drops to 0%, fallback to database, latency increase <500ms, and automatic recovery
  • Monitoring and Validation: Comprehensive scripts for monitoring circuit breaker states, cache health, outbox processing, database connections, and dependency failure recovery

Application Behavior Chaos

Purpose: Define comprehensive chaos experiments for application behavior failures in ATP, validating latency handling, error resilience, backpressure mechanisms, and traffic surge management to ensure ATP services remain available and functional under various application-level stress conditions.


Latency Injection

Latency injection experiments validate that ATP services handle network latency gracefully through timeout configurations, retry mechanisms, and graceful degradation, ensuring service availability and performance under high-latency conditions.

Hypothesis

"When network latency increases to 500ms with 100ms jitter, services will continue operating with increased response times, retry mechanisms will handle timeouts, no request failures will occur, and services will recover when latency returns to normal."

Experiment Configuration

NetworkChaos for Latency Injection:

# chaos-experiments/ingestion-latency-injection.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: ingestion-latency-injection
  namespace: chaos-testing
  labels:
    category: application
    service: atp-ingestion-api
    severity: medium
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When network latency increases to 500ms with 100ms jitter, services will continue 
      operating with increased response times, retry mechanisms will handle timeouts, 
      no request failures will occur, and services will recover when latency returns to normal.
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  delay:
    latency: "500ms"
    correlation: "25"
    jitter: "100ms"
  duration: "10m"

Gradual Latency Increase:

# chaos-experiments/gradual-latency-injection.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: gradual-latency-injection
  namespace: chaos-testing
  labels:
    category: application
    service: atp-ingestion-api
    severity: medium
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      Gradual latency increase will validate service resilience to progressive degradation.
spec:
  action: delay
  mode: fixed-percent
  value: "100"
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  delay:
    latency: "500ms"
    correlation: "25"
    jitter: "100ms"
  scheduler:
    cron: "@every 1m"
  duration: "10m"

Latency Injection Script:

#!/bin/bash
# scripts/execute-latency-injection-experiment.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
LATENCY="${3:-500ms}"
JITTER="${4:-100ms}"
DURATION="${5:-10m}"

echo "🧪 Starting latency injection experiment"
echo "Service: ${SERVICE}"
echo "Latency: ${LATENCY}"
echo "Jitter: ${JITTER}"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_P50=$(jq -r '.metrics.p50_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_P95=$(jq -r '.metrics.p95_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_P99=$(jq -r '.metrics.p99_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline latency metrics:"
echo "  P50: ${BASELINE_P50}ms"
echo "  P95: ${BASELINE_P95}ms"
echo "  P99: ${BASELINE_P99}ms"

# Apply latency injection
echo "🔧 Applying latency injection..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: latency-injection-${SERVICE}
  namespace: chaos-testing
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  delay:
    latency: "${LATENCY}"
    correlation: "25"
    jitter: "${JITTER}"
  duration: "${DURATION}"
EOF

INJECTION_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during latency injection..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get current latency metrics
  P50_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.50,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
  P50_LATENCY_MS=$(echo "${P50_LATENCY} * 1000" | bc)

  P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
  P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)

  P99_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.99,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
  P99_LATENCY_MS=$(echo "${P99_LATENCY} * 1000" | bc)

  # Check request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Check timeout errors
  TIMEOUT_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=\"504\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Check retry attempts
  RETRY_ATTEMPTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  echo "Metrics at ${ELAPSED}s:"
  echo "  P50 latency: ${P50_LATENCY_MS}ms (baseline: ${BASELINE_P50}ms)"
  echo "  P95 latency: ${P95_LATENCY_MS}ms (baseline: ${BASELINE_P95}ms)"
  echo "  P99 latency: ${P99_LATENCY_MS}ms (baseline: ${BASELINE_P99}ms)"
  echo "  Success rate: ${SUCCESS_RATE}"
  echo "  Timeout errors: ${TIMEOUT_ERRORS}/sec"
  echo "  Retry attempts: ${RETRY_ATTEMPTS}/sec"

  # Validate latency increase is expected
  LATENCY_INCREASE=$(echo "${P50_LATENCY_MS} - ${BASELINE_P50}" | bc)
  EXPECTED_INCREASE=500  # 500ms base latency

  if (( $(echo "${LATENCY_INCREASE} > ${EXPECTED_INCREASE} * 1.5" | bc -l) )); then
    echo "⚠️  Latency increase exceeds expected: +${LATENCY_INCREASE}ms > +${EXPECTED_INCREASE}ms * 1.5"
  fi

  # Validate no timeout errors
  if (( $(echo "${TIMEOUT_ERRORS} > 0" | bc -l) )); then
    echo "⚠️  Timeout errors detected: ${TIMEOUT_ERRORS}/sec"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove latency injection
echo "🔧 Removing latency injection..."
kubectl delete networkchaos latency-injection-${SERVICE} -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for recovery
echo "⏳ Waiting for latency to return to normal..."
sleep 120

# Verify latency recovery
FINAL_P50=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.50,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
FINAL_P50_MS=$(echo "${FINAL_P50} * 1000" | bc)

if (( $(echo "${FINAL_P50_MS} <= ${BASELINE_P50} * 1.1" | bc -l) )); then
  echo "✅ Latency recovered: ${FINAL_P50_MS}ms (baseline: ${BASELINE_P50}ms)"
  exit 0
else
  echo "⚠️  Latency not fully recovered: ${FINAL_P50_MS}ms (baseline: ${BASELINE_P50}ms)"
  exit 1
fi

Expected Behavior

Latency Injection Phase (0-10 minutes):

  • Latency increase: Network latency increases to 500ms + jitter
  • Response time increase: Service response times increase proportionally
  • Retry mechanisms: Retry mechanisms handle transient timeouts
  • Timeout handling: Timeout configurations prevent hanging requests
  • Service continuity: Service continues operating with degraded performance

Recovery Phase (10-15 minutes):

  • Latency normalization: Network latency returns to normal
  • Response time recovery: Service response times return to baseline
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Injection Expected Range Recovery Target
P50 Latency 145ms <645ms Baseline + <500ms 145ms
P95 Latency 250ms <750ms Baseline + <500ms 250ms
P99 Latency 400ms <900ms Baseline + <500ms 400ms
Request Success Rate 99.95% >99% >99% 99.95%
Timeout Errors 0/sec <1/sec <1/sec 0/sec
Retry Rate 0.1/sec <5/sec <5/sec 0.1/sec

Validation Criteria

Success Criteria:

  • ✅ Latency increases proportionally to injected latency
  • ✅ No timeout errors (timeout configuration working)
  • ✅ Request success rate >99%
  • ✅ Retry mechanisms handle transient failures
  • ✅ Service recovers automatically when latency removed

Timeout Configuration

HTTP Client Timeout Configuration:

# kubernetes/configmaps/http-client-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: http-client-config
  namespace: atp-ingest-ns
data:
  HttpClientConfig.json: |
    {
      "DefaultTimeout": 30,
      "ConnectionTimeout": 5,
      "ReadTimeout": 25,
      "RetryPolicy": {
        "MaxRetries": 3,
        "RetryDelay": 1000,
        "ExponentialBackoff": true,
        "MaxBackoff": 5000
      },
      "CircuitBreaker": {
        "FailureThreshold": 5,
        "TimeoutSeconds": 30
      }
    }

Error Injection

Error injection experiments validate that ATP services handle various error conditions gracefully through error handling, retry mechanisms, circuit breakers, and graceful degradation.

Hypothesis

"When HTTP 500 errors, database timeout errors, or validation failures are injected, services will handle errors gracefully, retry mechanisms will activate, circuit breakers will protect the service, and services will recover automatically when errors stop."

Experiment Configuration

HTTP Error Injection:

# chaos-experiments/http-error-injection.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: http-error-injection
  namespace: chaos-testing
  labels:
    category: application
    service: atp-ingestion-api
    severity: medium
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When HTTP 500 errors are injected, services will handle errors gracefully, 
      retry mechanisms will activate, circuit breakers will protect the service, 
      and services will recover automatically when errors stop.
spec:
  mode: all
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  target:
    target: RequestPath
    requestPath: "/api/ingest"
  rules:
    - port: 8080
      path: "/api/ingest"
      method: "POST"
      statusCode: 500
      percentage: 50  # 50% of requests return 500
  duration: "10m"

Database Timeout Error Injection:

# chaos-experiments/database-timeout-injection.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: database-timeout-injection
  namespace: chaos-testing
  labels:
    category: application
    service: database
    severity: high
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When database timeout errors are injected, services will handle errors gracefully, 
      retry mechanisms will activate, circuit breakers will protect the service, 
      and services will recover automatically when errors stop.
spec:
  action: delay
  mode: fixed-percent
  value: "50"  # 50% of requests
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  target:
    mode: all
    selector:
      address: "*.database.windows.net"
  delay:
    latency: "35s"  # Exceeds 30s timeout
    correlation: "100"
    jitter: "0ms"
  duration: "10m"

Error Injection Script:

#!/bin/bash
# scripts/execute-error-injection-experiment.sh

ERROR_TYPE="${1:-http500}"  # http500, db-timeout, validation
SERVICE="${2:-atp-ingestion-api}"
NAMESPACE="${3:-atp-ingest-ns}"
PERCENTAGE="${4:-50}"  # Error percentage

echo "🧪 Starting error injection experiment"
echo "Error type: ${ERROR_TYPE}"
echo "Service: ${SERVICE}"
echo "Error percentage: ${PERCENTAGE}%"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_ERROR_RATE=$(jq -r '.metrics.error_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Error rate: ${BASELINE_ERROR_RATE}%"
echo "  Success rate: ${BASELINE_SUCCESS_RATE}%"

# Apply error injection based on type
case ${ERROR_TYPE} in
  http500)
    echo "🔧 Injecting HTTP 500 errors..."
    kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: http-error-injection-${SERVICE}
  namespace: chaos-testing
spec:
  mode: all
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  target:
    target: RequestPath
    requestPath: "/api/ingest"
  rules:
    - port: 8080
      path: "/api/ingest"
      method: "POST"
      statusCode: 500
      percentage: ${PERCENTAGE}
  duration: "10m"
EOF
    ;;
  db-timeout)
    echo "🔧 Injecting database timeout errors..."
    kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: db-timeout-injection-${SERVICE}
  namespace: chaos-testing
spec:
  action: delay
  mode: fixed-percent
  value: "${PERCENTAGE}"
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  target:
    mode: all
    selector:
      address: "*.database.windows.net"
  delay:
    latency: "35s"
    correlation: "100"
    jitter: "0ms"
  duration: "10m"
EOF
    ;;
  validation)
    echo "🔧 Injecting validation failures..."
    # This would be implemented via HTTPChaos with 400 status code
    kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: validation-error-injection-${SERVICE}
  namespace: chaos-testing
spec:
  mode: all
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  target:
    target: RequestPath
    requestPath: "/api/ingest"
  rules:
    - port: 8080
      path: "/api/ingest"
      method: "POST"
      statusCode: 400
      percentage: ${PERCENTAGE}
  duration: "10m"
EOF
    ;;
esac

INJECTION_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during error injection..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get error rate
  ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Check circuit breaker state
  CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Check retry attempts
  RETRY_ATTEMPTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Check service availability
  AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  AVAILABILITY_PERCENT=$(echo "${AVAILABILITY} * 100" | bc)

  echo "Metrics at ${ELAPSED}s:"
  echo "  Error rate: ${ERROR_RATE}/sec"
  echo "  Success rate: ${SUCCESS_RATE}/sec"
  echo "  Circuit breaker state: ${CB_STATE}"
  echo "  Retry attempts: ${RETRY_ATTEMPTS}/sec"
  echo "  Service availability: ${AVAILABILITY_PERCENT}%"

  # Validate circuit breaker behavior
  if (( $(echo "${AVAILABILITY_PERCENT} < 95" | bc -l) )); then
    if [ "${CB_STATE}" != "Open" ]; then
      echo "⚠️  Service availability low but circuit breaker not open"
    fi
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove error injection
echo "🔧 Removing error injection..."
case ${ERROR_TYPE} in
  http500)
    kubectl delete httpchaos http-error-injection-${SERVICE} -n chaos-testing
    ;;
  db-timeout)
    kubectl delete networkchaos db-timeout-injection-${SERVICE} -n chaos-testing
    ;;
  validation)
    kubectl delete httpchaos validation-error-injection-${SERVICE} -n chaos-testing
    ;;
esac

RECOVERY_START=$(date +%s)

# Wait for recovery
echo "⏳ Waiting for service recovery..."
sleep 120

# Verify service recovery
FINAL_ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

if (( $(echo "${FINAL_ERROR_RATE} <= ${BASELINE_ERROR_RATE} * 1.1" | bc -l) )); then
  echo "✅ Error rate recovered: ${FINAL_ERROR_RATE}/sec (baseline: ${BASELINE_ERROR_RATE}/sec)"

  if [ "${FINAL_CB_STATE}" = "Closed" ]; then
    echo "✅ Circuit breaker closed (service recovered)"
    exit 0
  else
    echo "⚠️  Circuit breaker still open: ${FINAL_CB_STATE}"
    exit 1
  fi
else
  echo "⚠️  Error rate not fully recovered: ${FINAL_ERROR_RATE}/sec (baseline: ${BASELINE_ERROR_RATE}/sec)"
  exit 1
fi

Expected Behavior

Error Injection Phase (0-10 minutes):

  • Error rate increase: Error rate increases proportionally to injection percentage
  • Retry activation: Retry mechanisms activate for transient errors
  • Circuit breaker protection: Circuit breaker may open if error rate exceeds threshold
  • Graceful degradation: Service continues operating with reduced functionality
  • Error handling: Errors are logged and handled appropriately

Recovery Phase (10-15 minutes):

  • Error injection removal: Error injection stopped
  • Retry recovery: Retry mechanisms normalize
  • Circuit breaker recovery: Circuit breaker transitions to half-open, then closed
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Injection Expected Range Recovery Target
Error Rate 0.05% <5% <5% 0.05%
Success Rate 99.95% >95% >95% 99.95%
Circuit Breaker State Closed Open/HalfOpen Open/HalfOpen Closed
Retry Rate 0.1/sec <10/sec <10/sec 0.1/sec
Service Availability 99.95% >95% >95% 99.95%

Validation Criteria

Success Criteria:

  • ✅ Error rate increases proportionally to injection percentage
  • ✅ Retry mechanisms handle transient errors
  • ✅ Circuit breaker protects service if error rate exceeds threshold
  • ✅ Service availability >95%
  • ✅ Service recovers automatically when errors stop

Slow Consumer Simulation

Slow consumer simulation experiments validate that ATP services handle slow message processing gracefully through backpressure mechanisms, queue depth limits, and message throttling.

Hypothesis

"When message consumers process messages slowly, backpressure mechanisms will activate, queue depth limits will prevent queue overflow, message throttling will protect the system, and services will recover when processing returns to normal."

Experiment Configuration

Slow Consumer Simulation:

#!/bin/bash
# scripts/execute-slow-consumer-simulation.sh

SERVICE="${1:-atp-query-api}"
NAMESPACE="${2:-atp-query-ns}"
PROCESSING_DELAY="${3:-5s}"  # Processing delay per message
DURATION="${4:-10m}"

echo "🧪 Starting slow consumer simulation"
echo "Service: ${SERVICE}"
echo "Processing delay: ${PROCESSING_DELAY} per message"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_PROCESSING_RATE=$(jq -r '.metrics.message_processing_rate_per_sec' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_QUEUE_DEPTH=$(jq -r '.metrics.queue_depth' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Processing rate: ${BASELINE_PROCESSING_RATE} msg/sec"
echo "  Queue depth: ${BASELINE_QUEUE_DEPTH}"

# Apply processing delay via CPU throttling
echo "🔧 Applying processing delay..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: slow-consumer-simulation-${SERVICE}
  namespace: chaos-testing
spec:
  mode: fixed-percent
  value: "100"
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  stressors:
    cpu:
      workers: 1
      load: 10  # Low CPU load to simulate slow processing
  duration: "${DURATION}"
EOF

# Additionally, inject network delay to slow message retrieval
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: slow-consumer-network-delay-${SERVICE}
  namespace: chaos-testing
spec:
  action: delay
  mode: fixed-percent
  value: "100"
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  target:
    mode: all
    selector:
      address: "*.servicebus.windows.net"
  delay:
    latency: "${PROCESSING_DELAY}"
    correlation: "100"
    jitter: "0ms"
  duration: "${DURATION}"
EOF

SIMULATION_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during slow consumer simulation..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get queue depth
  QUEUE_DEPTH=$(curl -s http://prometheus:9090/api/v1/query?query=queue_depth\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get processing rate
  PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get message arrival rate
  ARRIVAL_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_arrived_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get backpressure state
  BACKPRESSURE_ACTIVE=$(curl -s http://prometheus:9090/api/v1/query?query=backpressure_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get DLQ size
  DLQ_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=dlq_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  echo "Metrics at ${ELAPSED}s:"
  echo "  Queue depth: ${QUEUE_DEPTH} (baseline: ${BASELINE_QUEUE_DEPTH})"
  echo "  Processing rate: ${PROCESSING_RATE} msg/sec (baseline: ${BASELINE_PROCESSING_RATE} msg/sec)"
  echo "  Arrival rate: ${ARRIVAL_RATE} msg/sec"
  echo "  Backpressure active: ${BACKPRESSURE_ACTIVE}"
  echo "  DLQ size: ${DLQ_SIZE}"

  # Validate queue depth limits
  MAX_QUEUE_DEPTH=10000
  if (( $(echo "${QUEUE_DEPTH} > ${MAX_QUEUE_DEPTH}" | bc -l) )); then
    echo "⚠️  Queue depth exceeds limit: ${QUEUE_DEPTH} > ${MAX_QUEUE_DEPTH}"
  fi

  # Validate backpressure activation
  if (( $(echo "${QUEUE_DEPTH} > ${BASELINE_QUEUE_DEPTH} * 2" | bc -l) )); then
    if [ "${BACKPRESSURE_ACTIVE}" != "1" ]; then
      echo "⚠️  Queue depth high but backpressure not active"
    fi
  fi

  # Validate no message loss
  if (( $(echo "${DLQ_SIZE} > 0" | bc -l) )); then
    echo "⚠️  Messages in DLQ (potential message loss)"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove simulation
echo "🔧 Removing slow consumer simulation..."
kubectl delete stresschaos slow-consumer-simulation-${SERVICE} -n chaos-testing
kubectl delete networkchaos slow-consumer-network-delay-${SERVICE} -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for queue to drain
echo "⏳ Waiting for queue to drain..."
MAX_DRAIN_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_DRAIN_WAIT} ]; do
  QUEUE_DEPTH=$(curl -s http://prometheus:9090/api/v1/query?query=queue_depth\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  if (( $(echo "${QUEUE_DEPTH} <= ${BASELINE_QUEUE_DEPTH} * 1.1" | bc -l) )); then
    DRAIN_TIME=$(date +%s)
    DRAIN_DURATION=$((DRAIN_TIME - RECOVERY_START))
    echo "✅ Queue drained in ${DRAIN_DURATION} seconds"
    break
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
  echo "Waiting for queue to drain... (${ELAPSED}s/${MAX_DRAIN_WAIT}s) - Queue depth: ${QUEUE_DEPTH}"
done

# Verify recovery
FINAL_QUEUE_DEPTH=$(curl -s http://prometheus:9090/api/v1/query?query=queue_depth\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

if (( $(echo "${FINAL_QUEUE_DEPTH} <= ${BASELINE_QUEUE_DEPTH} * 1.1" | bc -l) )); then
  echo "✅ Queue depth recovered: ${FINAL_QUEUE_DEPTH} (baseline: ${BASELINE_QUEUE_DEPTH})"

  if (( $(echo "${FINAL_PROCESSING_RATE} >= ${BASELINE_PROCESSING_RATE} * 0.9" | bc -l) )); then
    echo "✅ Processing rate recovered: ${FINAL_PROCESSING_RATE} msg/sec (baseline: ${BASELINE_PROCESSING_RATE} msg/sec)"
    exit 0
  else
    echo "⚠️  Processing rate not fully recovered: ${FINAL_PROCESSING_RATE} msg/sec"
    exit 1
  fi
else
  echo "⚠️  Queue depth not fully recovered: ${FINAL_QUEUE_DEPTH} (baseline: ${BASELINE_QUEUE_DEPTH})"
  exit 1
fi

Expected Behavior

Slow Consumer Phase (0-10 minutes):

  • Processing delay: Message processing slows down
  • Queue depth increase: Queue depth increases as processing lags behind arrival
  • Backpressure activation: Backpressure mechanisms activate when queue depth exceeds threshold
  • Message throttling: Message arrival throttled to prevent queue overflow
  • Queue depth limits: Queue depth limits prevent excessive growth

Recovery Phase (10-20 minutes):

  • Processing normalization: Processing returns to normal speed
  • Queue draining: Queue drains as processing catches up
  • Backpressure deactivation: Backpressure mechanisms deactivate
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Simulation Expected Range Recovery Target
Queue Depth 100 <10,000 <10,000 100
Processing Rate 100 msg/sec <20 msg/sec <20 msg/sec 100 msg/sec
Backpressure Active No Yes Yes No
Message Loss None None None None
DLQ Size 0 0 0 0

Validation Criteria

Success Criteria:

  • ✅ Queue depth increases but stays within limits
  • ✅ Backpressure activates when queue depth exceeds threshold
  • ✅ No message loss (all messages processed)
  • ✅ Queue drains after processing returns to normal
  • ✅ Service recovers automatically

Backpressure Configuration

Backpressure Configuration Example:

# kubernetes/configmaps/backpressure-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: backpressure-config
  namespace: atp-query-ns
data:
  BackpressureConfig.json: |
    {
      "QueueDepthThreshold": 5000,
      "MaxQueueDepth": 10000,
      "BackpressureStrategy": "Throttle",
      "ThrottleRate": 0.5,
      "DLQThreshold": 10000,
      "MonitoringIntervalSeconds": 5
    }

Traffic Surge

Traffic surge experiments validate that ATP services handle sudden traffic increases gracefully through autoscaling, rate limiting, and graceful degradation.

Hypothesis

"When traffic increases to 10x normal load, autoscaling will activate, rate limiting will protect the system, services will handle the load gracefully, and services will recover when traffic returns to normal."

Experiment Configuration

Traffic Surge Simulation:

#!/bin/bash
# scripts/execute-traffic-surge-experiment.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
SURGE_MULTIPLIER="${3:-10}"  # 10x normal traffic
DURATION="${4:-10m}"

echo "🧪 Starting traffic surge experiment"
echo "Service: ${SERVICE}"
echo "Traffic multiplier: ${SURGE_MULTIPLIER}x"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_RPS=$(jq -r '.metrics.requests_per_second' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_POD_COUNT=$(kubectl get deployment ${SERVICE} -n ${NAMESPACE} -o jsonpath='{.status.replicas}')
TARGET_RPS=$(echo "${BASELINE_RPS} * ${SURGE_MULTIPLIER}" | bc)

echo "Baseline metrics:"
echo "  Requests per second: ${BASELINE_RPS}"
echo "  Pod count: ${BASELINE_POD_COUNT}"
echo "Target traffic: ${TARGET_RPS} req/sec"

# Start traffic generation
echo "🚀 Starting traffic generation..."
kubectl run traffic-generator \
  --image=nginx/nginx-ingress:latest \
  --restart=Never \
  --rm -i --tty \
  -- /bin/sh -c "while true; do curl -s http://${SERVICE}.${NAMESPACE}.svc.cluster.local/api/ingest -X POST -d '{}' -H 'Content-Type: application/json' & done" &

TRAFFIC_GENERATOR_PID=$!

# Calculate number of parallel requests needed
PARALLEL_REQUESTS=$(echo "${TARGET_RPS} / 10" | bc)  # Assuming 10 req/sec per worker

# Start multiple traffic generators
for i in $(seq 1 ${PARALLEL_REQUESTS}); do
  kubectl run traffic-generator-${i} \
    --image=curlimages/curl:latest \
    --restart=Never \
    --rm -i --tty \
    -- /bin/sh -c "while true; do curl -s http://${SERVICE}.${NAMESPACE}.svc.cluster.local/api/ingest -X POST -d '{}' -H 'Content-Type: application/json' > /dev/null 2>&1; sleep 0.1; done" &
done

SURGE_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during traffic surge..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get current RPS
  CURRENT_RPS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  CURRENT_RPS_ROUNDED=$(echo "${CURRENT_RPS}" | cut -d. -f1)

  # Get pod count
  CURRENT_POD_COUNT=$(kubectl get deployment ${SERVICE} -n ${NAMESPACE} -o jsonpath='{.status.replicas}')

  # Get HPA status
  HPA_STATUS=$(kubectl get hpa ${SERVICE} -n ${NAMESPACE} -o jsonpath='{.status.conditions[?(@.type=="AbleToScale")].status}')

  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  # Get rate limiting metrics
  RATE_LIMITED=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=\"429\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get latency
  P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
  P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)

  echo "Metrics at ${ELAPSED}s:"
  echo "  Current RPS: ${CURRENT_RPS_ROUNDED} (target: ${TARGET_RPS})"
  echo "  Pod count: ${CURRENT_POD_COUNT} (baseline: ${BASELINE_POD_COUNT})"
  echo "  HPA status: ${HPA_STATUS}"
  echo "  Success rate: ${SUCCESS_RATE_PERCENT}%"
  echo "  Rate limited: ${RATE_LIMITED}/sec"
  echo "  P95 latency: ${P95_LATENCY_MS}ms"

  # Validate autoscaling
  if (( $(echo "${CURRENT_RPS} > ${BASELINE_RPS} * 2" | bc -l) )); then
    if [ "${CURRENT_POD_COUNT}" -le "${BASELINE_POD_COUNT}" ]; then
      echo "⚠️  Traffic increased but autoscaling not triggered"
    fi
  fi

  # Validate rate limiting
  if (( $(echo "${RATE_LIMITED} > 0" | bc -l) )); then
    echo "✅ Rate limiting active: ${RATE_LIMITED}/sec requests rate limited"
  fi

  # Validate service availability
  if (( $(echo "${SUCCESS_RATE_PERCENT} < 95" | bc -l) )); then
    echo "⚠️  Service availability low: ${SUCCESS_RATE_PERCENT}%"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Stop traffic generation
echo "🛑 Stopping traffic generation..."
kubectl delete pod traffic-generator -n ${NAMESPACE} --ignore-not-found=true
for i in $(seq 1 ${PARALLEL_REQUESTS}); do
  kubectl delete pod traffic-generator-${i} -n ${NAMESPACE} --ignore-not-found=true
done

RECOVERY_START=$(date +%s)

# Wait for traffic to normalize
echo "⏳ Waiting for traffic to normalize..."
sleep 120

# Wait for autoscaling to scale down
echo "⏳ Waiting for autoscaling to scale down..."
MAX_SCALE_DOWN_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_SCALE_DOWN_WAIT} ]; do
  CURRENT_POD_COUNT=$(kubectl get deployment ${SERVICE} -n ${NAMESPACE} -o jsonpath='{.status.replicas}')

  if [ "${CURRENT_POD_COUNT}" -le "${BASELINE_POD_COUNT}" ]; then
    SCALE_DOWN_TIME=$(date +%s)
    SCALE_DOWN_DURATION=$((SCALE_DOWN_TIME - RECOVERY_START))
    echo "✅ Autoscaling scaled down in ${SCALE_DOWN_DURATION} seconds"
    break
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
  echo "Waiting for scale down... (${ELAPSED}s/${MAX_SCALE_DOWN_WAIT}s) - Pod count: ${CURRENT_POD_COUNT}"
done

# Verify recovery
FINAL_RPS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_POD_COUNT=$(kubectl get deployment ${SERVICE} -n ${NAMESPACE} -o jsonpath='{.status.replicas}')
FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE_PERCENT=$(echo "${FINAL_SUCCESS_RATE} * 100" | bc)

if (( $(echo "${FINAL_RPS} <= ${BASELINE_RPS} * 1.1" | bc -l) )); then
  echo "✅ Traffic normalized: ${FINAL_RPS} req/sec (baseline: ${BASELINE_RPS} req/sec)"

  if [ "${FINAL_POD_COUNT}" -le "${BASELINE_POD_COUNT}" ]; then
    echo "✅ Pod count normalized: ${FINAL_POD_COUNT} (baseline: ${BASELINE_POD_COUNT})"

    if (( $(echo "${FINAL_SUCCESS_RATE_PERCENT} >= 99" | bc -l) )); then
      echo "✅ Success rate recovered: ${FINAL_SUCCESS_RATE_PERCENT}%"
      exit 0
    else
      echo "⚠️  Success rate not fully recovered: ${FINAL_SUCCESS_RATE_PERCENT}%"
      exit 1
    fi
  else
    echo "⚠️  Pod count not fully normalized: ${FINAL_POD_COUNT} (baseline: ${BASELINE_POD_COUNT})"
    exit 1
  fi
else
  echo "⚠️  Traffic not fully normalized: ${FINAL_RPS} req/sec (baseline: ${BASELINE_RPS} req/sec)"
  exit 1
fi

Expected Behavior

Traffic Surge Phase (0-10 minutes):

  • Traffic increase: Traffic increases to 10x normal load
  • Autoscaling activation: HPA scales up pods to handle increased load
  • Rate limiting activation: Rate limiting protects system from overload
  • Graceful degradation: Service continues operating with increased latency
  • Load distribution: Load distributed across scaled pods

Recovery Phase (10-20 minutes):

  • Traffic normalization: Traffic returns to normal levels
  • Autoscaling scale down: HPA scales down pods as load decreases
  • Rate limiting deactivation: Rate limiting normalizes
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Surge Expected Range Recovery Target
Request Rate 100 req/sec 1,000 req/sec 10x increase 100 req/sec
Pod Count 3 15-30 5-10x increase 3
Success Rate 99.95% >95% >95% 99.95%
P95 Latency 250ms <1,000ms <1,000ms 250ms
Rate Limited Requests 0/sec >0/sec >0/sec 0/sec

Validation Criteria

Success Criteria:

  • ✅ Autoscaling activates and scales up pods
  • ✅ Rate limiting protects system from overload
  • ✅ Service availability >95% during surge
  • ✅ Latency increase <4x baseline
  • ✅ Autoscaling scales down after traffic normalizes

HPA Configuration

Horizontal Pod Autoscaler Configuration:

# kubernetes/autoscaling/ingestion-api-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-api-hpa
  namespace: atp-ingest-ns
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion-api
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 5
        periodSeconds: 30
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      selectPolicy: Min

Rate Limiting Configuration:

# kubernetes/configmaps/rate-limiting-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: rate-limiting-config
  namespace: atp-ingest-ns
data:
  RateLimitingConfig.json: |
    {
      "DefaultRateLimit": {
        "RequestsPerSecond": 100,
        "BurstSize": 150
      },
      "PerClientRateLimit": {
        "Enabled": true,
        "RequestsPerSecond": 50,
        "BurstSize": 75
      },
      "RateLimitStrategy": "TokenBucket",
      "RateLimitResponse": {
        "StatusCode": 429,
        "Message": "Rate limit exceeded"
      }
    }

Summary: Application Behavior Chaos

  • Latency Injection: Validates timeout configurations, retry mechanisms, and graceful degradation during network latency increases; expects latency increases proportionally, no timeout errors, request success rate >99%, and automatic recovery
  • Error Injection: Validates error handling, retry mechanisms, and circuit breaker protection during HTTP 500 errors, database timeouts, and validation failures; expects error rate increases proportionally, retry mechanisms activate, circuit breaker protects service, and automatic recovery
  • Slow Consumer Simulation: Validates backpressure mechanisms, queue depth limits, and message throttling during slow message processing; expects queue depth increases within limits, backpressure activates, no message loss, and automatic recovery
  • Traffic Surge: Validates autoscaling, rate limiting, and graceful degradation during 10x traffic increases; expects autoscaling activates, rate limiting protects system, service availability >95%, and automatic scale down when traffic normalizes
  • Monitoring and Validation: Comprehensive scripts for monitoring latency injection, error injection, slow consumer simulation, traffic surge, autoscaling, rate limiting, and recovery behavior

Database Chaos

Purpose: Define comprehensive chaos experiments for database failures in ATP, validating failover mechanisms, slowdown handling, and connection pool management to ensure ATP services remain available and functional during database-level failures and performance degradation.


Database Failover

Database failover experiments validate that ATP services handle Azure SQL failover gracefully through connection retry, automatic failover detection, and transaction integrity, ensuring no data loss and minimal downtime.

Hypothesis

"When Azure SQL primary database fails over to a replica, services will automatically reconnect to the new primary, connection retry mechanisms will handle transient failures, no transactions will be lost, and failover time will be within acceptable limits (<30 seconds)."

Experiment Configuration

Azure SQL Failover Simulation:

#!/bin/bash
# scripts/execute-database-failover-experiment.sh

PRIMARY_SERVER="${1:-atp-sql-primary.database.windows.net}"
PRIMARY_DB="${2:-atp-primary}"
SERVICE="${3:-atp-ingestion-api}"
NAMESPACE="${4:-atp-ingest-ns}"

echo "🧪 Starting database failover experiment"
echo "Primary server: ${PRIMARY_SERVER}"
echo "Primary database: ${PRIMARY_DB}"
echo "Service: ${SERVICE}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_ACTIVE_CONNECTIONS=$(kubectl exec -n ${NAMESPACE} deployment/${SERVICE} -- \
  curl -s http://localhost:8080/metrics | grep 'database_connections_active' | awk '{print $2}')

echo "Baseline metrics:"
echo "  Success rate: ${BASELINE_SUCCESS_RATE}%"
echo "  Active connections: ${BASELINE_ACTIVE_CONNECTIONS}"

# Get current primary replica
echo "📊 Getting current primary replica..."
CURRENT_PRIMARY=$(az sql db replica list \
  --resource-group atp-production-rg \
  --server ${PRIMARY_SERVER} \
  --name ${PRIMARY_DB} \
  --query "[?role == 'Primary'].name" -o tsv)

echo "Current primary: ${CURRENT_PRIMARY}"

# Get available replicas
REPLICAS=$(az sql db replica list \
  --resource-group atp-production-rg \
  --server ${PRIMARY_SERVER} \
  --name ${PRIMARY_DB} \
  --query "[?role == 'Secondary'].name" -o tsv)

echo "Available replicas: ${REPLICAS}"
TARGET_REPLICA=$(echo "${REPLICAS}" | head -n 1)

if [ -z "${TARGET_REPLICA}" ]; then
  echo "❌ No secondary replicas available"
  exit 1
fi

echo "Target replica for failover: ${TARGET_REPLICA}"

# Initiate failover
echo "🔧 Initiating database failover..."
FAILOVER_START=$(date +%s)

az sql db replica set-primary \
  --resource-group atp-production-rg \
  --server ${PRIMARY_SERVER} \
  --name ${TARGET_REPLICA} \
  --allow-data-loss false

FAILOVER_INITIATED=$(date +%s)
FAILOVER_INITIATION_TIME=$((FAILOVER_INITIATED - FAILOVER_START))

echo "Failover initiated in ${FAILOVER_INITIATION_TIME} seconds"

# Monitor failover progress
echo "👀 Monitoring failover progress..."
MAX_WAIT=300  # 5 minutes
ELAPSED=0
FAILOVER_COMPLETE=false

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  NEW_PRIMARY=$(az sql db replica list \
    --resource-group atp-production-rg \
    --server ${PRIMARY_SERVER} \
    --name ${PRIMARY_DB} \
    --query "[?role == 'Primary'].name" -o tsv)

  if [ "${NEW_PRIMARY}" = "${TARGET_REPLICA}" ]; then
    FAILOVER_COMPLETE=true
    FAILOVER_END=$(date +%s)
    FAILOVER_DURATION=$((FAILOVER_END - FAILOVER_START))
    echo "✅ Failover complete in ${FAILOVER_DURATION} seconds"
    break
  fi

  sleep 5
  ELAPSED=$((ELAPSED + 5))
  echo "Waiting for failover... (${ELAPSED}s/${MAX_WAIT}s)"
done

if [ "${FAILOVER_COMPLETE}" = false ]; then
  echo "❌ Failover not completed within ${MAX_WAIT} seconds"
  exit 1
fi

# Monitor service behavior during failover
echo "👀 Monitoring service behavior during failover..."
MAX_MONITOR_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_MONITOR_WAIT} ]; do
  # Get connection retry attempts
  RETRY_ATTEMPTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_connection_retries\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get active connections
  ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connections_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  # Get transaction count (before and after failover)
  TRANSACTION_COUNT_BEFORE=$(curl -s http://prometheus:9090/api/v1/query?query=database_transactions_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  echo "Metrics at ${ELAPSED}s:"
  echo "  Connection retries: ${RETRY_ATTEMPTS}/sec"
  echo "  Active connections: ${ACTIVE_CONNECTIONS}"
  echo "  Success rate: ${SUCCESS_RATE_PERCENT}%"
  echo "  Transaction count: ${TRANSACTION_COUNT_BEFORE}"

  # Validate connection retry
  if (( $(echo "${RETRY_ATTEMPTS} > 0" | bc -l) )); then
    echo "✅ Connection retry active: ${RETRY_ATTEMPTS}/sec"
  fi

  # Validate service availability
  if (( $(echo "${SUCCESS_RATE_PERCENT} >= 95" | bc -l) )); then
    echo "✅ Service availability maintained: ${SUCCESS_RATE_PERCENT}%"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Validate transaction integrity
echo "🔍 Validating transaction integrity..."
TRANSACTION_COUNT_AFTER=$(curl -s http://prometheus:9090/api/v1/query?query=database_transactions_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
TRANSACTION_LOSS=$((TRANSACTION_COUNT_AFTER - TRANSACTION_COUNT_BEFORE))

if [ "${TRANSACTION_LOSS}" -eq 0 ]; then
  echo "✅ No transaction loss detected"
else
  echo "⚠️  Potential transaction loss: ${TRANSACTION_LOSS}"
fi

# Validate failover time
FAILOVER_TIME_TARGET=30  # 30 seconds
if [ "${FAILOVER_DURATION}" -le "${FAILOVER_TIME_TARGET}" ]; then
  echo "✅ Failover time within target: ${FAILOVER_DURATION}s <= ${FAILOVER_TIME_TARGET}s"
  exit 0
else
  echo "⚠️  Failover time exceeds target: ${FAILOVER_DURATION}s > ${FAILOVER_TIME_TARGET}s"
  exit 1
fi

Expected Behavior

Failover Initiation Phase (0-5 seconds):

  • Failover command: Azure SQL failover command executed
  • Primary role transfer: Primary role transferred to target replica
  • Connection termination: Existing connections to old primary terminated

Failover Completion Phase (5-30 seconds):

  • Replica promotion: Secondary replica promoted to primary
  • DNS/endpoint update: DNS/endpoint updated to point to new primary
  • Connection retry: Services retry connections to new primary
  • Connection establishment: New connections established to new primary

Recovery Phase (30-60 seconds):

  • Full connectivity: All services connected to new primary
  • Transaction integrity: All transactions committed or rolled back safely
  • Normal operation: Services return to normal operation

Expected Metrics

Metric Baseline During Failover Expected Range Recovery Target
Failover Time N/A <30s <30s N/A
Connection Retry Rate 0/sec >0/sec >0/sec 0/sec
Request Success Rate 99.95% >95% >95% 99.95%
Active Connections 50 Variable Variable 50
Transaction Loss None None None None

Validation Criteria

Success Criteria:

  • ✅ Failover completes within 30 seconds
  • ✅ Connection retry mechanisms activate
  • ✅ Request success rate >95% during failover
  • ✅ No transaction loss
  • ✅ All connections re-established after failover
  • ✅ Service recovers automatically

Connection Retry Configuration

Database Connection Retry Configuration:

# kubernetes/configmaps/database-connection-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: database-connection-config
  namespace: atp-ingest-ns
data:
  ConnectionRetryConfig.json: |
    {
      "MaxRetryAttempts": 10,
      "RetryDelay": 1000,
      "ExponentialBackoff": true,
      "MaxBackoff": 30000,
      "RetryableErrors": [
        "Connection timeout",
        "Connection closed",
        "Server unavailable",
        "Network error"
      ],
      "FailoverDetection": {
        "Enabled": true,
        "HealthCheckInterval": 5000,
        "HealthCheckTimeout": 3000
      }
    }

Database Slowdown

Database slowdown experiments validate that ATP services handle database performance degradation gracefully through timeout handling, circuit breaker activation, and fallback strategies.

Hypothesis

"When database queries become slow (latency >5 seconds), timeout configurations will prevent hanging requests, circuit breakers will activate to protect the service, fallback strategies will maintain service availability, and services will recover when database performance returns to normal."

Experiment Configuration

Database Query Latency Injection:

# chaos-experiments/database-slowdown.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: database-slowdown
  namespace: chaos-testing
  labels:
    category: application
    service: database
    severity: medium
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When database queries become slow, timeout configurations will prevent hanging requests, 
      circuit breakers will activate to protect the service, fallback strategies will maintain 
      service availability, and services will recover when database performance returns to normal.
spec:
  action: delay
  mode: fixed-percent
  value: "50"  # 50% of queries
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  target:
    mode: all
    selector:
      address: "*.database.windows.net"
  delay:
    latency: "6s"  # Exceeds 5s timeout
    correlation: "100"
    jitter: "500ms"
  duration: "10m"

Database Slowdown Simulation Script:

#!/bin/bash
# scripts/execute-database-slowdown-experiment.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
QUERY_LATENCY="${3:-6s}"  # Query latency injection
PERCENTAGE="${4:-50}"  # Percentage of queries affected

echo "🧪 Starting database slowdown experiment"
echo "Service: ${SERVICE}"
echo "Query latency: ${QUERY_LATENCY}"
echo "Percentage: ${PERCENTAGE}%"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_QUERY_LATENCY=$(jq -r '.metrics.p95_query_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  P95 query latency: ${BASELINE_QUERY_LATENCY}ms"
echo "  Success rate: ${BASELINE_SUCCESS_RATE}%"

# Apply database slowdown
echo "🔧 Applying database slowdown..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: database-slowdown-${SERVICE}
  namespace: chaos-testing
spec:
  action: delay
  mode: fixed-percent
  value: "${PERCENTAGE}"
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  target:
    mode: all
    selector:
      address: "*.database.windows.net"
  delay:
    latency: "${QUERY_LATENCY}"
    correlation: "100"
    jitter: "500ms"
  duration: "10m"
EOF

SLOWDOWN_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during database slowdown..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get query latency
  QUERY_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(database_query_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
  QUERY_LATENCY_MS=$(echo "${QUERY_LATENCY} * 1000" | bc)

  # Get timeout errors
  TIMEOUT_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_timeout_errors\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get circuit breaker state
  CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\",component=\"database\"\} | jq -r '.data.result[0].value[1]')

  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  # Get fallback usage
  FALLBACK_USAGE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_fallback_usage\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  echo "Metrics at ${ELAPSED}s:"
  echo "  P95 query latency: ${QUERY_LATENCY_MS}ms (baseline: ${BASELINE_QUERY_LATENCY}ms)"
  echo "  Timeout errors: ${TIMEOUT_ERRORS}/sec"
  echo "  Circuit breaker state: ${CB_STATE}"
  echo "  Success rate: ${SUCCESS_RATE_PERCENT}%"
  echo "  Fallback usage: ${FALLBACK_USAGE}/sec"

  # Validate timeout handling
  if (( $(echo "${QUERY_LATENCY_MS} > 5000" | bc -l) )); then
    if (( $(echo "${TIMEOUT_ERRORS} > 0" | bc -l) )); then
      echo "✅ Timeout handling working: ${TIMEOUT_ERRORS}/sec timeout errors"
    else
      echo "⚠️  Query latency high but no timeout errors detected"
    fi
  fi

  # Validate circuit breaker activation
  if (( $(echo "${QUERY_LATENCY_MS} > 5000" | bc -l) )); then
    if [ "${CB_STATE}" = "Open" ]; then
      echo "✅ Circuit breaker activated: ${CB_STATE}"
    elif [ "${CB_STATE}" = "HalfOpen" ]; then
      echo "✅ Circuit breaker testing recovery: ${CB_STATE}"
    fi
  fi

  # Validate fallback strategies
  if (( $(echo "${FALLBACK_USAGE} > 0" | bc -l) )); then
    echo "✅ Fallback strategies active: ${FALLBACK_USAGE}/sec"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove database slowdown
echo "🔧 Removing database slowdown..."
kubectl delete networkchaos database-slowdown-${SERVICE} -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for recovery
echo "⏳ Waiting for database performance to return to normal..."
sleep 120

# Verify recovery
FINAL_QUERY_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(database_query_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
FINAL_QUERY_LATENCY_MS=$(echo "${FINAL_QUERY_LATENCY} * 1000" | bc)
FINAL_CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\",component=\"database\"\} | jq -r '.data.result[0].value[1]')

if (( $(echo "${FINAL_QUERY_LATENCY_MS} <= ${BASELINE_QUERY_LATENCY} * 1.1" | bc -l) )); then
  echo "✅ Query latency recovered: ${FINAL_QUERY_LATENCY_MS}ms (baseline: ${BASELINE_QUERY_LATENCY}ms)"

  if [ "${FINAL_CB_STATE}" = "Closed" ]; then
    echo "✅ Circuit breaker closed (service recovered)"
    exit 0
  else
    echo "⚠️  Circuit breaker still open: ${FINAL_CB_STATE}"
    exit 1
  fi
else
  echo "⚠️  Query latency not fully recovered: ${FINAL_QUERY_LATENCY_MS}ms (baseline: ${BASELINE_QUERY_LATENCY}ms)"
  exit 1
fi

Expected Behavior

Slowdown Phase (0-10 minutes):

  • Query latency increase: Database queries become slow (>5 seconds)
  • Timeout handling: Timeout configurations prevent hanging requests
  • Circuit breaker activation: Circuit breaker opens if latency exceeds threshold
  • Fallback strategies: Fallback strategies activate (cached data, read replicas)
  • Graceful degradation: Service continues operating with reduced functionality

Recovery Phase (10-15 minutes):

  • Latency normalization: Database query latency returns to normal
  • Circuit breaker recovery: Circuit breaker transitions to half-open, then closed
  • Fallback deactivation: Fallback strategies deactivate
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Slowdown Expected Range Recovery Target
P95 Query Latency 250ms <7,000ms <7,000ms 250ms
Timeout Errors 0/sec <10/sec <10/sec 0/sec
Circuit Breaker State Closed Open/HalfOpen Open/HalfOpen Closed
Request Success Rate 99.95% >90% >90% 99.95%
Fallback Usage 0/sec >0/sec >0/sec 0/sec

Validation Criteria

Success Criteria:

  • ✅ Timeout configurations prevent hanging requests
  • ✅ Circuit breaker activates when latency exceeds threshold
  • ✅ Fallback strategies maintain service availability
  • ✅ Request success rate >90%
  • ✅ Service recovers automatically when database performance returns to normal

Query Timeout Configuration

Database Query Timeout Configuration:

# kubernetes/configmaps/database-query-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: database-query-config
  namespace: atp-ingest-ns
data:
  QueryTimeoutConfig.json: |
    {
      "DefaultTimeout": 5000,
      "ReadTimeout": 5000,
      "WriteTimeout": 10000,
      "ConnectionTimeout": 3000,
      "CommandTimeout": 5000,
      "CircuitBreaker": {
        "FailureThreshold": 5,
        "TimeoutSeconds": 30,
        "HalfOpenRetries": 3,
        "SlowQueryThreshold": 5000
      },
      "FallbackStrategies": {
        "UseCache": true,
        "UseReadReplica": true,
        "DegradedMode": true
      }
    }

Database Connection Pool Exhaustion

Database connection pool exhaustion experiments validate that ATP services handle connection pool exhaustion gracefully through connection leak detection, pool size limits, and queuing behavior.

Hypothesis

"When database connection pool is exhausted, connection leak detection will identify leaks, pool size limits will prevent resource exhaustion, request queuing will handle overload, and services will recover when connections are released."

Experiment Configuration

Connection Pool Exhaustion Simulation:

#!/bin/bash
# scripts/execute-connection-pool-exhaustion-experiment.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
MAX_POOL_SIZE="${3:-100}"  # Maximum pool size
LEAK_RATE="${4:-5}"  # Connections leaked per second

echo "🧪 Starting connection pool exhaustion experiment"
echo "Service: ${SERVICE}"
echo "Max pool size: ${MAX_POOL_SIZE}"
echo "Leak rate: ${LEAK_RATE} connections/sec"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_POOL_SIZE=$(jq -r '.metrics.connection_pool_size' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_ACTIVE_CONNECTIONS=$(jq -r '.metrics.active_connections' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_IDLE_CONNECTIONS=$(jq -r '.metrics.idle_connections' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Pool size: ${BASELINE_POOL_SIZE}"
echo "  Active connections: ${BASELINE_ACTIVE_CONNECTIONS}"
echo "  Idle connections: ${BASELINE_IDLE_CONNECTIONS}"

# Simulate connection leaks by creating long-running connections
echo "🔧 Simulating connection pool exhaustion..."
EXHAUSTION_START=$(date +%s)

# Create a script that will hold connections open
cat > /tmp/connection-leak-script.sh <<'EOF'
#!/bin/bash
while true; do
  # Create a connection and hold it open
  kubectl exec -n ${NAMESPACE} deployment/${SERVICE} -- \
    psql -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} -c "SELECT pg_sleep(300);" &
  sleep 1
done
EOF

chmod +x /tmp/connection-leak-script.sh

# Start connection leak simulation
echo "Starting connection leak simulation..."
for i in $(seq 1 ${LEAK_RATE}); do
  kubectl run connection-leak-${i} \
    --image=postgres:15 \
    --restart=Never \
    --rm -i --tty \
    -- /bin/sh -c "while true; do PGPASSWORD=\${DB_PASSWORD} psql -h \${DB_HOST} -U \${DB_USER} -d \${DB_NAME} -c 'SELECT pg_sleep(300);' > /dev/null 2>&1; done" &
done

# Monitor connection pool
echo "👀 Monitoring connection pool behavior..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get connection pool metrics
  POOL_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
  ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
  IDLE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_idle\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
  WAITING_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_waiting\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get connection leak detection
  LEAK_DETECTED=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_leaks_detected\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  # Get connection pool exhaustion errors
  POOL_EXHAUSTION_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_connection_pool_exhausted\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  echo "Metrics at ${ELAPSED}s:"
  echo "  Pool size: ${POOL_SIZE} (max: ${MAX_POOL_SIZE})"
  echo "  Active connections: ${ACTIVE_CONNECTIONS}"
  echo "  Idle connections: ${IDLE_CONNECTIONS}"
  echo "  Waiting requests: ${WAITING_REQUESTS}"
  echo "  Leaks detected: ${LEAK_DETECTED}"
  echo "  Success rate: ${SUCCESS_RATE_PERCENT}%"
  echo "  Pool exhaustion errors: ${POOL_EXHAUSTION_ERRORS}/sec"

  # Validate pool size limits
  if (( $(echo "${ACTIVE_CONNECTIONS} >= ${MAX_POOL_SIZE}" | bc -l) )); then
    echo "✅ Pool size limit reached: ${ACTIVE_CONNECTIONS} >= ${MAX_POOL_SIZE}"

    # Validate queuing behavior
    if (( $(echo "${WAITING_REQUESTS} > 0" | bc -l) )); then
      echo "✅ Request queuing active: ${WAITING_REQUESTS} requests waiting"
    else
      echo "⚠️  Pool exhausted but no request queuing detected"
    fi
  fi

  # Validate connection leak detection
  if (( $(echo "${LEAK_DETECTED} > 0" | bc -l) )); then
    echo "✅ Connection leak detection active: ${LEAK_DETECTED} leaks detected"
  fi

  # Validate pool exhaustion handling
  if (( $(echo "${POOL_EXHAUSTION_ERRORS} > 0" | bc -l) )); then
    echo "⚠️  Pool exhaustion errors detected: ${POOL_EXHAUSTION_ERRORS}/sec"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Stop connection leak simulation
echo "🛑 Stopping connection leak simulation..."
kubectl delete pod connection-leak-* -n ${NAMESPACE} --ignore-not-found=true

# Wait for connections to be released
echo "⏳ Waiting for connections to be released..."
sleep 60

RECOVERY_START=$(date +%s)

# Monitor connection pool recovery
echo "👀 Monitoring connection pool recovery..."
MAX_RECOVERY_WAIT=300  # 5 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_RECOVERY_WAIT} ]; do
  ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
  WAITING_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_waiting\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  if (( $(echo "${ACTIVE_CONNECTIONS} <= ${BASELINE_ACTIVE_CONNECTIONS} * 1.1" | bc -l) )); then
    if (( $(echo "${WAITING_REQUESTS} == 0" | bc -l) )); then
      RECOVERY_TIME=$(date +%s)
      RECOVERY_DURATION=$((RECOVERY_TIME - RECOVERY_START))
      echo "✅ Connection pool recovered in ${RECOVERY_DURATION} seconds"
      break
    fi
  fi

  sleep 10
  ELAPSED=$((ELAPSED + 10))
  echo "Waiting for recovery... (${ELAPSED}s/${MAX_RECOVERY_WAIT}s) - Active: ${ACTIVE_CONNECTIONS}, Waiting: ${WAITING_REQUESTS}"
done

# Verify recovery
FINAL_ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE_PERCENT=$(echo "${FINAL_SUCCESS_RATE} * 100" | bc)

if (( $(echo "${FINAL_ACTIVE_CONNECTIONS} <= ${BASELINE_ACTIVE_CONNECTIONS} * 1.1" | bc -l) )); then
  echo "✅ Active connections recovered: ${FINAL_ACTIVE_CONNECTIONS} (baseline: ${BASELINE_ACTIVE_CONNECTIONS})"

  if (( $(echo "${FINAL_SUCCESS_RATE_PERCENT} >= 99" | bc -l) )); then
    echo "✅ Success rate recovered: ${FINAL_SUCCESS_RATE_PERCENT}%"
    exit 0
  else
    echo "⚠️  Success rate not fully recovered: ${FINAL_SUCCESS_RATE_PERCENT}%"
    exit 1
  fi
else
  echo "⚠️  Active connections not fully recovered: ${FINAL_ACTIVE_CONNECTIONS} (baseline: ${BASELINE_ACTIVE_CONNECTIONS})"
  exit 1
fi

Expected Behavior

Exhaustion Phase (0-10 minutes):

  • Connection pool exhaustion: Connection pool reaches maximum size
  • Connection leak detection: Connection leak detection identifies leaks
  • Request queuing: Requests queue when pool is exhausted
  • Pool size limits: Pool size limits prevent resource exhaustion
  • Graceful degradation: Service continues operating with queued requests

Recovery Phase (10-15 minutes):

  • Connection release: Connections released when leaks stop
  • Pool recovery: Connection pool recovers to normal levels
  • Queue draining: Queued requests processed
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Exhaustion Expected Range Recovery Target
Active Connections 50 100 (max) ≤100 50
Idle Connections 50 0 0-50 50
Waiting Requests 0 >0 >0 0
Leaks Detected 0 >0 >0 0
Pool Exhaustion Errors 0/sec <5/sec <5/sec 0/sec
Request Success Rate 99.95% >95% >95% 99.95%

Validation Criteria

Success Criteria:

  • ✅ Pool size limits prevent resource exhaustion
  • ✅ Connection leak detection identifies leaks
  • ✅ Request queuing handles overload
  • ✅ Request success rate >95%
  • ✅ Service recovers automatically when connections released

Connection Pool Configuration

Database Connection Pool Configuration:

# kubernetes/configmaps/database-pool-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: database-pool-config
  namespace: atp-ingest-ns
data:
  ConnectionPoolConfig.json: |
    {
      "MaxPoolSize": 100,
      "MinPoolSize": 10,
      "IdleTimeout": 300000,
      "ConnectionTimeout": 30000,
      "LeakDetection": {
        "Enabled": true,
        "Threshold": 60000,
        "LogInterval": 60000
      },
      "PoolExhaustion": {
        "MaxWaitTime": 30000,
        "QueueSize": 1000,
        "RejectWhenExhausted": false
      },
      "HealthCheck": {
        "Enabled": true,
        "Interval": 30000,
        "Timeout": 5000
      }
    }

Connection Pool Monitoring Script:

#!/bin/bash
# scripts/monitor-connection-pool.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"

echo "📊 Monitoring connection pool for ${SERVICE}"

# Get connection pool metrics
POOL_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
IDLE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_idle\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
WAITING_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_waiting\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

echo "Connection pool metrics:"
echo "  Pool size: ${POOL_SIZE}"
echo "  Active connections: ${ACTIVE_CONNECTIONS}"
echo "  Idle connections: ${IDLE_CONNECTIONS}"
echo "  Waiting requests: ${WAITING_REQUESTS}"

# Check connection leak detection
LEAK_DETECTED=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_leaks_detected\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
LEAK_DURATION=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_leak_duration_seconds\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

echo "Connection leak detection:"
echo "  Leaks detected: ${LEAK_DETECTED}"
echo "  Leak duration: ${LEAK_DURATION}s"

# Check pool exhaustion
POOL_EXHAUSTION_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_connection_pool_exhausted\{service=\"${SERVICE}\"\}[5m]\) | jq -r '.data.result[0].value[1]')
POOL_UTILIZATION=$(echo "scale=2; ${ACTIVE_CONNECTIONS} / ${POOL_SIZE} * 100" | bc)

echo "Pool utilization:"
echo "  Utilization: ${POOL_UTILIZATION}%"
echo "  Exhaustion errors: ${POOL_EXHAUSTION_ERRORS}/sec"

# Validate pool health
if (( $(echo "${POOL_UTILIZATION} > 90" | bc -l) )); then
  echo "⚠️  Pool utilization high: ${POOL_UTILIZATION}%"
fi

if (( $(echo "${WAITING_REQUESTS} > 0" | bc -l) )); then
  echo "⚠️  Requests waiting for connections: ${WAITING_REQUESTS}"
fi

if (( $(echo "${LEAK_DETECTED} > 0" | bc -l) )); then
  echo "⚠️  Connection leaks detected: ${LEAK_DETECTED}"
fi

if (( $(echo "${POOL_EXHAUSTION_ERRORS} > 0" | bc -l) )); then
  echo "⚠️  Pool exhaustion errors: ${POOL_EXHAUSTION_ERRORS}/sec"
fi

echo "✅ Connection pool monitoring complete"

Summary: Database Chaos

  • Database Failover: Validates Azure SQL failover mechanisms, connection retry, and transaction integrity during database failover; expects failover completes within 30 seconds, connection retry mechanisms activate, no transaction loss, and automatic recovery
  • Database Slowdown: Validates timeout handling, circuit breaker activation, and fallback strategies during database performance degradation; expects timeout configurations prevent hanging requests, circuit breaker activates, fallback strategies maintain availability, and automatic recovery
  • Database Connection Pool Exhaustion: Validates connection leak detection, pool size limits, and queuing behavior during connection pool exhaustion; expects pool size limits prevent resource exhaustion, connection leak detection identifies leaks, request queuing handles overload, and automatic recovery
  • Monitoring and Validation: Comprehensive scripts for monitoring database failover, slowdown, connection pool exhaustion, connection leaks, pool utilization, and recovery behavior

Storage and Queue Chaos

Purpose: Define comprehensive chaos experiments for storage and queue failures in ATP, validating blob storage resilience, message queue disruption handling, and event store integrity to ensure ATP services remain available and functional during storage and messaging infrastructure failures.


Blob Storage Unavailability

Blob storage unavailability experiments validate that ATP services handle Azure Blob Storage outages gracefully through retry logic, export failure handling, and eventual consistency mechanisms.

Hypothesis

"When Azure Blob Storage becomes unavailable, services will retry operations with exponential backoff, export failures will be handled gracefully, operations will be queued for eventual consistency, and services will recover automatically when storage is restored."

Experiment Configuration

Azure Blob Storage Network Partition:

# chaos-experiments/blob-storage-unavailability.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: blob-storage-unavailability
  namespace: chaos-testing
  labels:
    category: application
    service: blob-storage
    severity: high
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When Azure Blob Storage becomes unavailable, services will retry operations with exponential backoff, 
      export failures will be handled gracefully, operations will be queued for eventual consistency, 
      and services will recover automatically when storage is restored.
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - atp-export-ns
    labelSelectors:
      app: atp-export-api
  direction: both
  target:
    mode: all
    selector:
      address: "*.blob.core.windows.net"
  duration: "15m"

Blob Storage Unavailability Simulation Script:

#!/bin/bash
# scripts/execute-blob-storage-unavailability-experiment.sh

STORAGE_ACCOUNT="${1:-atpstorageaccount}"
SERVICE="${2:-atp-export-api}"
NAMESPACE="${3:-atp-export-ns}"
DURATION="${4:-15m}"

echo "🧪 Starting blob storage unavailability experiment"
echo "Storage account: ${STORAGE_ACCOUNT}"
echo "Service: ${SERVICE}"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_EXPORT_SUCCESS_RATE=$(jq -r '.metrics.export_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_RETRY_COUNT=$(jq -r '.metrics.retry_count_per_operation' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Export success rate: ${BASELINE_EXPORT_SUCCESS_RATE}%"
echo "  Retry count per operation: ${BASELINE_RETRY_COUNT}"

# Apply network partition to blob storage
echo "🔧 Applying network partition to blob storage..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: blob-storage-unavailability-${SERVICE}
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  target:
    mode: all
    selector:
      address: "*.blob.core.windows.net"
  duration: "${DURATION}"
EOF

OUTAGE_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during blob storage unavailability..."
MAX_WAIT=900  # 15 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get export success rate
  EXPORT_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(export_operations_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(export_operations_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  EXPORT_SUCCESS_RATE_PERCENT=$(echo "${EXPORT_SUCCESS_RATE} * 100" | bc)

  # Get export failure rate
  EXPORT_FAILURE_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(export_operations_total\{service=\"${SERVICE}\",status=\"failure\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get retry count
  RETRY_COUNT=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(blob_storage_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get queued operations
  QUEUED_OPERATIONS=$(curl -s http://prometheus:9090/api/v1/query?query=blob_storage_queued_operations\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get eventual consistency operations
  EVENTUAL_CONSISTENCY_OPS=$(curl -s http://prometheus:9090/api/v1/query?query=blob_storage_eventual_consistency_operations\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  echo "Metrics at ${ELAPSED}s:"
  echo "  Export success rate: ${EXPORT_SUCCESS_RATE_PERCENT}%"
  echo "  Export failure rate: ${EXPORT_FAILURE_RATE}/sec"
  echo "  Retry count: ${RETRY_COUNT}/sec"
  echo "  Queued operations: ${QUEUED_OPERATIONS}"
  echo "  Eventual consistency operations: ${EVENTUAL_CONSISTENCY_OPS}"

  # Validate retry logic
  if (( $(echo "${RETRY_COUNT} > 0" | bc -l) )); then
    echo "✅ Retry logic active: ${RETRY_COUNT}/sec retry attempts"
  fi

  # Validate queuing behavior
  if (( $(echo "${QUEUED_OPERATIONS} > 0" | bc -l) )); then
    echo "✅ Operations queued for eventual consistency: ${QUEUED_OPERATIONS}"
  fi

  # Validate export failure handling
  if (( $(echo "${EXPORT_FAILURE_RATE} > 0" | bc -l) )); then
    echo "⚠️  Export failures detected: ${EXPORT_FAILURE_RATE}/sec"
    # Check if failures are being handled gracefully (not causing service crashes)
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos blob-storage-unavailability-${SERVICE} -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for queued operations to complete
echo "⏳ Waiting for queued operations to complete..."
MAX_RECOVERY_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_RECOVERY_WAIT} ]; do
  QUEUED_OPERATIONS=$(curl -s http://prometheus:9090/api/v1/query?query=blob_storage_queued_operations\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  if (( $(echo "${QUEUED_OPERATIONS} == 0" | bc -l) )); then
    RECOVERY_TIME=$(date +%s)
    RECOVERY_DURATION=$((RECOVERY_TIME - RECOVERY_START))
    echo "✅ Queued operations completed in ${RECOVERY_DURATION} seconds"
    break
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
  echo "Waiting for queued operations... (${ELAPSED}s/${MAX_RECOVERY_WAIT}s) - Queued: ${QUEUED_OPERATIONS}"
done

# Verify eventual consistency
echo "🔍 Verifying eventual consistency..."
FINAL_QUEUED_OPS=$(curl -s http://prometheus:9090/api/v1/query?query=blob_storage_queued_operations\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_EXPORT_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(export_operations_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(export_operations_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_EXPORT_SUCCESS_RATE_PERCENT=$(echo "${FINAL_EXPORT_SUCCESS_RATE} * 100" | bc)

if (( $(echo "${FINAL_QUEUED_OPS} == 0" | bc -l) )); then
  echo "✅ All operations completed (eventual consistency achieved)"

  if (( $(echo "${FINAL_EXPORT_SUCCESS_RATE_PERCENT} >= 99" | bc -l) )); then
    echo "✅ Export success rate recovered: ${FINAL_EXPORT_SUCCESS_RATE_PERCENT}%"
    exit 0
  else
    echo "⚠️  Export success rate not fully recovered: ${FINAL_EXPORT_SUCCESS_RATE_PERCENT}%"
    exit 1
  fi
else
  echo "⚠️  Some operations still queued: ${FINAL_QUEUED_OPS}"
  exit 1
fi

Expected Behavior

Outage Phase (0-15 minutes):

  • Storage unavailability: Azure Blob Storage becomes unreachable
  • Retry logic activation: Services retry operations with exponential backoff
  • Export failure handling: Export failures handled gracefully (no crashes)
  • Operation queuing: Failed operations queued for eventual consistency
  • Service continuity: Service continues operating with degraded functionality

Recovery Phase (15-25 minutes):

  • Storage restoration: Blob Storage becomes available
  • Queued operations processing: Queued operations processed
  • Eventual consistency: All operations eventually complete
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Outage Expected Range Recovery Target
Export Success Rate 99.95% 0% 0% 99.95%
Retry Count 0.1/sec >0/sec >0/sec 0.1/sec
Queued Operations 0 >0 >0 0
Export Failure Rate 0.05% 100% 100% 0.05%
Eventual Consistency Ops 0 >0 >0 0

Validation Criteria

Success Criteria:

  • ✅ Retry logic activates with exponential backoff
  • ✅ Export failures handled gracefully (no service crashes)
  • ✅ Operations queued for eventual consistency
  • ✅ All operations complete after storage restoration
  • ✅ Service recovers automatically when storage restored

Blob Storage Retry Configuration

Blob Storage Retry Configuration:

# kubernetes/configmaps/blob-storage-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: blob-storage-config
  namespace: atp-export-ns
data:
  BlobStorageConfig.json: |
    {
      "RetryPolicy": {
        "MaxRetries": 10,
        "RetryDelay": 1000,
        "ExponentialBackoff": true,
        "MaxBackoff": 60000,
        "RetryableErrors": [
          "NetworkError",
          "TimeoutError",
          "ServiceUnavailable",
          "InternalServerError"
        ]
      },
      "ExportFailureHandling": {
        "QueueOnFailure": true,
        "MaxQueueSize": 10000,
        "RetryFailedExports": true,
        "MaxRetryAttempts": 10
      },
      "EventualConsistency": {
        "Enabled": true,
        "QueueOperations": true,
        "ProcessInterval": 5000,
        "BatchSize": 100
      }
    }

Message Queue Disruption

Message queue disruption experiments validate that ATP services handle Service Bus topic disruptions gracefully through message buffering, backpressure mechanisms, and DLQ behavior.

Hypothesis

"When Service Bus topic becomes paused or unavailable, services will buffer messages, backpressure mechanisms will activate to prevent overload, messages will be moved to DLQ when appropriate, and services will recover automatically when the queue is restored."

Experiment Configuration

Service Bus Topic Pause Simulation:

# chaos-experiments/service-bus-topic-pause.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: service-bus-topic-pause
  namespace: chaos-testing
  labels:
    category: application
    service: service-bus
    severity: high
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When Service Bus topic becomes paused or unavailable, services will buffer messages, 
      backpressure mechanisms will activate to prevent overload, messages will be moved to DLQ 
      when appropriate, and services will recover automatically when the queue is restored.
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  target:
    mode: all
    selector:
      address: "*.servicebus.windows.net"
  duration: "15m"

Message Queue Disruption Simulation Script:

#!/bin/bash
# scripts/execute-message-queue-disruption-experiment.sh

SERVICE_BUS_NAMESPACE="${1:-atp-servicebus}"
TOPIC_NAME="${2:-atp-events}"
SERVICE="${3:-atp-ingestion-api}"
NAMESPACE="${4:-atp-ingest-ns}"
DURATION="${5:-15m}"

echo "🧪 Starting message queue disruption experiment"
echo "Service Bus namespace: ${SERVICE_BUS_NAMESPACE}"
echo "Topic: ${TOPIC_NAME}"
echo "Service: ${SERVICE}"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_QUEUE_DEPTH=$(jq -r '.metrics.queue_depth' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_PROCESSING_RATE=$(jq -r '.metrics.message_processing_rate_per_sec' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Queue depth: ${BASELINE_QUEUE_DEPTH}"
echo "  Processing rate: ${BASELINE_PROCESSING_RATE} msg/sec"

# Apply network partition to Service Bus
echo "🔧 Applying network partition to Service Bus..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: service-bus-topic-pause-${SERVICE}
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  target:
    mode: all
    selector:
      address: "*.servicebus.windows.net"
  duration: "${DURATION}"
EOF

DISRUPTION_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during message queue disruption..."
MAX_WAIT=900  # 15 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get queue depth
  QUEUE_DEPTH=$(curl -s http://prometheus:9090/api/v1/query?query=queue_depth\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get buffered messages
  BUFFERED_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=message_buffer_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get backpressure state
  BACKPRESSURE_ACTIVE=$(curl -s http://prometheus:9090/api/v1/query?query=backpressure_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get DLQ size
  DLQ_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=dlq_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get message processing rate
  PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get message arrival rate
  ARRIVAL_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_arrived_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  echo "Metrics at ${ELAPSED}s:"
  echo "  Queue depth: ${QUEUE_DEPTH}"
  echo "  Buffered messages: ${BUFFERED_MESSAGES}"
  echo "  Backpressure active: ${BACKPRESSURE_ACTIVE}"
  echo "  DLQ size: ${DLQ_SIZE}"
  echo "  Processing rate: ${PROCESSING_RATE} msg/sec"
  echo "  Arrival rate: ${ARRIVAL_RATE} msg/sec"

  # Validate message buffering
  if (( $(echo "${BUFFERED_MESSAGES} > 0" | bc -l) )); then
    echo "✅ Messages buffered: ${BUFFERED_MESSAGES}"
  fi

  # Validate backpressure activation
  if (( $(echo "${QUEUE_DEPTH} > ${BASELINE_QUEUE_DEPTH} * 2" | bc -l) )); then
    if [ "${BACKPRESSURE_ACTIVE}" = "1" ]; then
      echo "✅ Backpressure activated: ${BACKPRESSURE_ACTIVE}"
    else
      echo "⚠️  Queue depth high but backpressure not active"
    fi
  fi

  # Validate DLQ behavior
  if (( $(echo "${DLQ_SIZE} > 0" | bc -l) )); then
    echo "⚠️  Messages in DLQ: ${DLQ_SIZE}"
    # Check if DLQ messages are within acceptable limits
    MAX_DLQ_SIZE=1000
    if (( $(echo "${DLQ_SIZE} > ${MAX_DLQ_SIZE}" | bc -l) )); then
      echo "⚠️  DLQ size exceeds limit: ${DLQ_SIZE} > ${MAX_DLQ_SIZE}"
    fi
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos service-bus-topic-pause-${SERVICE} -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for buffered messages to be processed
echo "⏳ Waiting for buffered messages to be processed..."
MAX_RECOVERY_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_RECOVERY_WAIT} ]; do
  BUFFERED_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=message_buffer_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
  QUEUE_DEPTH=$(curl -s http://prometheus:9090/api/v1/query?query=queue_depth\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  if (( $(echo "${BUFFERED_MESSAGES} == 0" | bc -l) )); then
    if (( $(echo "${QUEUE_DEPTH} <= ${BASELINE_QUEUE_DEPTH} * 1.1" | bc -l) )); then
      RECOVERY_TIME=$(date +%s)
      RECOVERY_DURATION=$((RECOVERY_TIME - RECOVERY_START))
      echo "✅ Buffered messages processed in ${RECOVERY_DURATION} seconds"
      break
    fi
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
  echo "Waiting for recovery... (${ELAPSED}s/${MAX_RECOVERY_WAIT}s) - Buffered: ${BUFFERED_MESSAGES}, Queue depth: ${QUEUE_DEPTH}"
done

# Verify recovery
FINAL_BUFFERED_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=message_buffer_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_BACKPRESSURE=$(curl -s http://prometheus:9090/api/v1/query?query=backpressure_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

if (( $(echo "${FINAL_BUFFERED_MESSAGES} == 0" | bc -l) )); then
  echo "✅ All buffered messages processed"

  if (( $(echo "${FINAL_PROCESSING_RATE} >= ${BASELINE_PROCESSING_RATE} * 0.9" | bc -l) )); then
    echo "✅ Processing rate recovered: ${FINAL_PROCESSING_RATE} msg/sec (baseline: ${BASELINE_PROCESSING_RATE} msg/sec)"

    if [ "${FINAL_BACKPRESSURE}" = "0" ]; then
      echo "✅ Backpressure deactivated (service recovered)"
      exit 0
    else
      echo "⚠️  Backpressure still active: ${FINAL_BACKPRESSURE}"
      exit 1
    fi
  else
    echo "⚠️  Processing rate not fully recovered: ${FINAL_PROCESSING_RATE} msg/sec"
    exit 1
  fi
else
  echo "⚠️  Some messages still buffered: ${FINAL_BUFFERED_MESSAGES}"
  exit 1
fi

Expected Behavior

Disruption Phase (0-15 minutes):

  • Queue unavailability: Service Bus topic becomes unavailable
  • Message buffering: Messages buffered locally
  • Backpressure activation: Backpressure mechanisms activate to prevent overload
  • DLQ movement: Messages moved to DLQ when retry limit exceeded
  • Service continuity: Service continues operating with message buffering

Recovery Phase (15-25 minutes):

  • Queue restoration: Service Bus topic restored
  • Buffered message processing: Buffered messages processed
  • Backpressure deactivation: Backpressure mechanisms deactivate
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Disruption Expected Range Recovery Target
Queue Depth 100 Increasing Any 100
Buffered Messages 0 >0 >0 0
Backpressure Active No Yes Yes No
DLQ Size 0 <1,000 <1,000 0
Processing Rate 100 msg/sec 0 msg/sec 0 msg/sec 100 msg/sec

Validation Criteria

Success Criteria:

  • ✅ Messages buffered when queue unavailable
  • ✅ Backpressure activates to prevent overload
  • ✅ DLQ behavior appropriate (messages moved when retry limit exceeded)
  • ✅ All buffered messages processed after queue restoration
  • ✅ Service recovers automatically when queue restored

Message Queue Configuration

Message Queue Buffering Configuration:

# kubernetes/configmaps/message-queue-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: message-queue-config
  namespace: atp-ingest-ns
data:
  MessageQueueConfig.json: |
    {
      "Buffering": {
        "Enabled": true,
        "MaxBufferSize": 10000,
        "BufferTimeout": 300000
      },
      "Backpressure": {
        "Enabled": true,
        "QueueDepthThreshold": 5000,
        "ThrottleRate": 0.5
      },
      "DLQ": {
        "Enabled": true,
        "MaxRetryAttempts": 10,
        "MoveToDLQAfterRetries": true,
        "MaxDLQSize": 10000
      },
      "RetryPolicy": {
        "MaxRetries": 10,
        "RetryDelay": 1000,
        "ExponentialBackoff": true,
        "MaxBackoff": 60000
      }
    }

Event Store Corruption Simulation

Event store corruption simulation experiments validate that ATP services handle corrupted event data gracefully through integrity verification, quarantine procedures, and recovery from backups.

Hypothesis

"When event store contains corrupted event data, integrity verification will detect corruption, corrupted events will be quarantined, services will recover from backups, and services will continue operating without corrupted data."

Experiment Configuration

Event Store Corruption Simulation Script:

#!/bin/bash
# scripts/execute-event-store-corruption-experiment.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
CORRUPTION_RATE="${3:-1}"  # Percentage of events to corrupt

echo "🧪 Starting event store corruption simulation"
echo "Service: ${SERVICE}"
echo "Corruption rate: ${CORRUPTION_RATE}%"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_EVENT_COUNT=$(jq -r '.metrics.total_events' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_INTEGRITY_CHECKS=$(jq -r '.metrics.integrity_checks_passed' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Total events: ${BASELINE_EVENT_COUNT}"
echo "  Integrity checks passed: ${BASELINE_INTEGRITY_CHECKS}"

# Simulate event corruption by modifying event data
echo "🔧 Simulating event store corruption..."
CORRUPTION_START=$(date +%s)

# Corrupt a percentage of events in the event store
# This would be done by directly modifying event data in the database
# For simulation, we'll use a script that modifies event checksums

cat > /tmp/corrupt-events.sh <<'EOF'
#!/bin/bash
# This script would corrupt events by modifying their checksums or data
# In production, this would be done through database operations
# For simulation, we'll use a Kubernetes job to modify event data

kubectl create job --from=cronjob/event-corruption-simulator corrupt-events-$(date +%s) \
  -n ${NAMESPACE} \
  -- /bin/sh -c "
    # Connect to database and corrupt events
    # This is a simplified example - actual implementation would use proper database client
    psql -h \${DB_HOST} -U \${DB_USER} -d \${DB_NAME} -c \"
      UPDATE events 
      SET checksum = 'CORRUPTED' 
      WHERE id IN (
        SELECT id FROM events 
        WHERE checksum != 'CORRUPTED' 
        ORDER BY RANDOM() 
        LIMIT (SELECT COUNT(*) * ${CORRUPTION_RATE} / 100 FROM events)
      );
    \"
  "
EOF

chmod +x /tmp/corrupt-events.sh

# Execute corruption (in production, this would be done carefully)
echo "⚠️  WARNING: This will corrupt event data. Continuing in 5 seconds..."
sleep 5

# For simulation, we'll use a Kubernetes job
kubectl create job corrupt-events-$(date +%s) \
  --image=postgres:15 \
  --restart=Never \
  -n ${NAMESPACE} \
  -- /bin/sh -c "
    PGPASSWORD=\${DB_PASSWORD} psql -h \${DB_HOST} -U \${DB_USER} -d \${DB_NAME} -c \"
      UPDATE events 
      SET checksum = 'CORRUPTED' 
      WHERE id IN (
        SELECT id FROM events 
        WHERE checksum != 'CORRUPTED' 
        ORDER BY RANDOM() 
        LIMIT (SELECT GREATEST(1, COUNT(*) * ${CORRUPTION_RATE} / 100) FROM events)
      );
    \"
  " || echo "⚠️  Corruption simulation job failed (expected in test environment)"

# Monitor service behavior
echo "👀 Monitoring service behavior during event store corruption..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get integrity check failures
  INTEGRITY_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(event_integrity_check_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get quarantined events
  QUARANTINED_EVENTS=$(curl -s http://prometheus:9090/api/v1/query?query=event_quarantine_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get recovery operations
  RECOVERY_OPS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(event_recovery_operations\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get backup restore operations
  BACKUP_RESTORE_OPS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(event_backup_restore_operations\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get service availability
  AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  AVAILABILITY_PERCENT=$(echo "${AVAILABILITY} * 100" | bc)

  echo "Metrics at ${ELAPSED}s:"
  echo "  Integrity check failures: ${INTEGRITY_FAILURES}/sec"
  echo "  Quarantined events: ${QUARANTINED_EVENTS}"
  echo "  Recovery operations: ${RECOVERY_OPS}/sec"
  echo "  Backup restore operations: ${BACKUP_RESTORE_OPS}/sec"
  echo "  Service availability: ${AVAILABILITY_PERCENT}%"

  # Validate integrity verification
  if (( $(echo "${INTEGRITY_FAILURES} > 0" | bc -l) )); then
    echo "✅ Integrity verification detected corruption: ${INTEGRITY_FAILURES}/sec failures"
  fi

  # Validate quarantine procedures
  if (( $(echo "${QUARANTINED_EVENTS} > 0" | bc -l) )); then
    echo "✅ Corrupted events quarantined: ${QUARANTINED_EVENTS}"
  fi

  # Validate recovery operations
  if (( $(echo "${RECOVERY_OPS} > 0" | bc -l) )); then
    echo "✅ Recovery operations active: ${RECOVERY_OPS}/sec"
  fi

  # Validate service availability
  if (( $(echo "${AVAILABILITY_PERCENT} >= 95" | bc -l) )); then
    echo "✅ Service availability maintained: ${AVAILABILITY_PERCENT}%"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Verify recovery
echo "🔍 Verifying recovery from corruption..."
FINAL_QUARANTINED=$(curl -s http://prometheus:9090/api/v1/query?query=event_quarantine_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_INTEGRITY_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(event_integrity_check_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_AVAILABILITY_PERCENT=$(echo "${FINAL_AVAILABILITY} * 100" | bc)

# Check if corrupted events were recovered from backup
RECOVERED_EVENTS=$(curl -s http://prometheus:9090/api/v1/query?query=event_recovery_successful\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

if (( $(echo "${FINAL_INTEGRITY_FAILURES} == 0" | bc -l) )); then
  echo "✅ Integrity checks passing (corruption handled)"

  if (( $(echo "${FINAL_AVAILABILITY_PERCENT} >= 99" | bc -l) )); then
    echo "✅ Service availability recovered: ${FINAL_AVAILABILITY_PERCENT}%"

    if (( $(echo "${RECOVERED_EVENTS} > 0" | bc -l) )); then
      echo "✅ Events recovered from backup: ${RECOVERED_EVENTS}"
      exit 0
    else
      echo "⚠️  No events recovered from backup"
      exit 1
    fi
  else
    echo "⚠️  Service availability not fully recovered: ${FINAL_AVAILABILITY_PERCENT}%"
    exit 1
  fi
else
  echo "⚠️  Integrity check failures still occurring: ${FINAL_INTEGRITY_FAILURES}/sec"
  exit 1
fi

Expected Behavior

Corruption Detection Phase (0-5 minutes):

  • Corruption detection: Integrity verification detects corrupted events
  • Quarantine activation: Corrupted events quarantined
  • Service continuity: Service continues operating without corrupted data

Recovery Phase (5-15 minutes):

  • Backup identification: Backups identified for corrupted events
  • Event recovery: Corrupted events recovered from backups
  • Integrity restoration: Event store integrity restored
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Corruption Expected Range Recovery Target
Integrity Check Failures 0/sec >0/sec >0/sec 0/sec
Quarantined Events 0 >0 >0 0
Recovery Operations 0/sec >0/sec >0/sec 0/sec
Backup Restore Operations 0/sec >0/sec >0/sec 0/sec
Service Availability 99.95% >95% >95% 99.95%

Validation Criteria

Success Criteria:

  • ✅ Integrity verification detects corruption
  • ✅ Corrupted events quarantined
  • ✅ Events recovered from backups
  • ✅ Service availability >95% during corruption
  • ✅ Service recovers automatically

Event Store Integrity Configuration

Event Store Integrity Configuration:

# kubernetes/configmaps/event-store-integrity-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: event-store-integrity-config
  namespace: atp-ingest-ns
data:
  EventStoreIntegrityConfig.json: |
    {
      "IntegrityVerification": {
        "Enabled": true,
        "CheckInterval": 60000,
        "ChecksumAlgorithm": "SHA256",
        "ValidateOnRead": true,
        "ValidateOnWrite": true
      },
      "Quarantine": {
        "Enabled": true,
        "QuarantineThreshold": 1,
        "QuarantineLocation": "quarantine_events",
        "MaxQuarantineSize": 10000
      },
      "Recovery": {
        "Enabled": true,
        "RecoverFromBackup": true,
        "BackupRetentionDays": 30,
        "RecoveryRetryAttempts": 3
      },
      "Backup": {
        "Enabled": true,
        "BackupInterval": 3600000,
        "BackupRetentionDays": 30,
        "BackupLocation": "backup_events"
      }
    }

Event Integrity Verification Script:

#!/bin/bash
# scripts/verify-event-integrity.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"

echo "🔍 Verifying event store integrity for ${SERVICE}"

# Get integrity check metrics
INTEGRITY_CHECKS=$(curl -s http://prometheus:9090/api/v1/query?query=event_integrity_checks_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
INTEGRITY_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=event_integrity_check_failures\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
INTEGRITY_PASS_RATE=$(echo "scale=2; (${INTEGRITY_CHECKS} - ${INTEGRITY_FAILURES}) / ${INTEGRITY_CHECKS} * 100" | bc)

echo "Integrity check metrics:"
echo "  Total checks: ${INTEGRITY_CHECKS}"
echo "  Failures: ${INTEGRITY_FAILURES}"
echo "  Pass rate: ${INTEGRITY_PASS_RATE}%"

# Get quarantined events
QUARANTINED_EVENTS=$(curl -s http://prometheus:9090/api/v1/query?query=event_quarantine_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo "  Quarantined events: ${QUARANTINED_EVENTS}"

# Get recovery operations
RECOVERY_OPS=$(curl -s http://prometheus:9090/api/v1/query?query=event_recovery_operations_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
RECOVERY_SUCCESS=$(curl -s http://prometheus:9090/api/v1/query?query=event_recovery_successful\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
RECOVERY_SUCCESS_RATE=$(echo "scale=2; ${RECOVERY_SUCCESS} / ${RECOVERY_OPS} * 100" | bc)

echo "Recovery metrics:"
echo "  Total recovery operations: ${RECOVERY_OPS}"
echo "  Successful recoveries: ${RECOVERY_SUCCESS}"
echo "  Recovery success rate: ${RECOVERY_SUCCESS_RATE}%"

# Validate integrity
if (( $(echo "${INTEGRITY_PASS_RATE} >= 99.9" | bc -l) )); then
  echo "✅ Event store integrity healthy: ${INTEGRITY_PASS_RATE}% pass rate"
else
  echo "⚠️  Event store integrity issues: ${INTEGRITY_PASS_RATE}% pass rate"
  exit 1
fi

if (( $(echo "${QUARANTINED_EVENTS} > 0" | bc -l) )); then
  echo "⚠️  Quarantined events detected: ${QUARANTINED_EVENTS}"
  # Check if recovery is in progress
  if (( $(echo "${RECOVERY_OPS} > 0" | bc -l) )); then
    echo "✅ Recovery operations in progress"
  fi
fi

echo "✅ Event integrity verification complete"

Summary: Storage and Queue Chaos

  • Blob Storage Unavailability: Validates retry logic, export failure handling, and eventual consistency during Azure Blob Storage outages; expects retry logic activates with exponential backoff, export failures handled gracefully, operations queued for eventual consistency, and automatic recovery
  • Message Queue Disruption: Validates message buffering, backpressure mechanisms, and DLQ behavior during Service Bus topic disruptions; expects messages buffered when queue unavailable, backpressure activates, DLQ behavior appropriate, and automatic recovery
  • Event Store Corruption Simulation: Validates integrity verification, quarantine procedures, and recovery from backups during event store corruption; expects integrity verification detects corruption, corrupted events quarantined, events recovered from backups, and automatic recovery
  • Monitoring and Validation: Comprehensive scripts for monitoring blob storage availability, message queue disruption, event store corruption, integrity verification, quarantine procedures, and recovery operations

Network Chaos

Purpose: Define comprehensive chaos experiments for network failures in ATP, validating network partition handling, packet loss resilience, DNS failure recovery, and bandwidth limitation management to ensure ATP services remain available and functional during network-level failures and performance degradation.


Network Partition

Network partition experiments validate that ATP services handle network partitions gracefully through partition detection, service isolation handling, and automatic recovery when network connectivity is restored.

Hypothesis

"When a network partition occurs between service namespaces, services will detect the partition, handle service isolation gracefully, continue operating within their partition, and recover automatically when network connectivity is restored."

Experiment Configuration

Network Partition Between Namespaces:

# chaos-experiments/network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition
  namespace: chaos-testing
  labels:
    category: infrastructure
    service: network
    severity: high
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When a network partition occurs between service namespaces, services will detect the partition, 
      handle service isolation gracefully, continue operating within their partition, 
      and recover automatically when network connectivity is restored.
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - atp-query-ns
  duration: "5m"

Network Partition Simulation Script:

#!/bin/bash
# scripts/execute-network-partition-experiment.sh

SOURCE_NS="${1:-atp-ingest-ns}"
SOURCE_SERVICE="${2:-atp-ingest-api}"
TARGET_NS="${3:-atp-query-ns}"
DURATION="${4:-5m}"

echo "🧪 Starting network partition experiment"
echo "Source namespace: ${SOURCE_NS}"
echo "Source service: ${SOURCE_SERVICE}"
echo "Target namespace: ${TARGET_NS}"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SOURCE_SERVICE} \
  --duration 1h \
  --output baseline-${SOURCE_SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SOURCE_SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_LATENCY=$(jq -r '.metrics.p95_latency_ms' baseline-${SOURCE_SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Success rate: ${BASELINE_SUCCESS_RATE}%"
echo "  P95 latency: ${BASELINE_LATENCY}ms"

# Apply network partition
echo "🔧 Applying network partition..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition-${SOURCE_NS}-to-${TARGET_NS}
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - ${SOURCE_NS}
    labelSelectors:
      app: ${SOURCE_SERVICE}
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - ${TARGET_NS}
  duration: "${DURATION}"
EOF

PARTITION_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during network partition..."
MAX_WAIT=300  # 5 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SOURCE_SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SOURCE_SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  # Get connection errors
  CONNECTION_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SOURCE_SERVICE}\",status=\"503\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get partition detection
  PARTITION_DETECTED=$(curl -s http://prometheus:9090/api/v1/query?query=network_partition_detected\{service=\"${SOURCE_SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get service availability within partition
  AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=service_availability\{service=\"${SOURCE_SERVICE}\",partition=\"local\"\} | jq -r '.data.result[0].value[1]')
  AVAILABILITY_PERCENT=$(echo "${AVAILABILITY} * 100" | bc)

  echo "Metrics at ${ELAPSED}s:"
  echo "  Success rate: ${SUCCESS_RATE_PERCENT}%"
  echo "  Connection errors: ${CONNECTION_ERRORS}/sec"
  echo "  Partition detected: ${PARTITION_DETECTED}"
  echo "  Local availability: ${AVAILABILITY_PERCENT}%"

  # Validate partition detection
  if [ "${PARTITION_DETECTED}" = "1" ]; then
    echo "✅ Network partition detected"
  fi

  # Validate service isolation handling
  if (( $(echo "${AVAILABILITY_PERCENT} >= 90" | bc -l) )); then
    echo "✅ Service continues operating within partition: ${AVAILABILITY_PERCENT}%"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos network-partition-${SOURCE_NS}-to-${TARGET_NS} -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for recovery
echo "⏳ Waiting for network connectivity to restore..."
sleep 60

# Verify recovery
FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SOURCE_SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SOURCE_SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE_PERCENT=$(echo "${FINAL_SUCCESS_RATE} * 100" | bc)
FINAL_PARTITION_DETECTED=$(curl -s http://prometheus:9090/api/v1/query?query=network_partition_detected\{service=\"${SOURCE_SERVICE}\"\} | jq -r '.data.result[0].value[1]')

if (( $(echo "${FINAL_SUCCESS_RATE_PERCENT} >= ${BASELINE_SUCCESS_RATE} * 0.95" | bc -l) )); then
  echo "✅ Success rate recovered: ${FINAL_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_SUCCESS_RATE}%)"

  if [ "${FINAL_PARTITION_DETECTED}" = "0" ]; then
    echo "✅ Network partition cleared (service recovered)"
    exit 0
  else
    echo "⚠️  Network partition still detected: ${FINAL_PARTITION_DETECTED}"
    exit 1
  fi
else
  echo "⚠️  Success rate not fully recovered: ${FINAL_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_SUCCESS_RATE}%)"
  exit 1
fi

Expected Behavior

Partition Phase (0-5 minutes):

  • Network partition: Network connectivity between namespaces lost
  • Partition detection: Services detect network partition
  • Service isolation: Services continue operating within their partition
  • Connection failures: Cross-partition requests fail
  • Local operation: Services maintain local functionality

Recovery Phase (5-10 minutes):

  • Network restoration: Network connectivity restored
  • Partition detection cleared: Services detect network restoration
  • Connection re-establishment: Cross-partition connections re-established
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Partition Expected Range Recovery Target
Request Success Rate 99.95% Variable Variable 99.95%
Connection Errors 0/sec >0/sec >0/sec 0/sec
Partition Detected No Yes Yes No
Local Availability 100% >90% >90% 100%

Validation Criteria

Success Criteria:

  • ✅ Network partition detected
  • ✅ Services continue operating within partition
  • ✅ Local availability >90%
  • ✅ Services recover automatically when partition cleared

Packet Loss Simulation

Packet loss simulation experiments validate that ATP services handle packet loss gracefully through TCP retransmission, application timeouts, and circuit breaker activation.

Hypothesis

"When packet loss occurs (5%, 10%, 25%), TCP retransmission will handle low packet loss, application timeouts will handle high packet loss, circuit breakers will protect services from cascading failures, and services will recover automatically when packet loss is removed."

Experiment Configuration

Packet Loss Injection:

# chaos-experiments/packet-loss-simulation.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: packet-loss-simulation
  namespace: chaos-testing
  labels:
    category: infrastructure
    service: network
    severity: medium
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When packet loss occurs, TCP retransmission will handle low packet loss, 
      application timeouts will handle high packet loss, circuit breakers will protect services, 
      and services will recover automatically when packet loss is removed.
spec:
  action: loss
  mode: fixed-percent
  value: "10"  # 10% packet loss
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  loss:
    loss: "10%"
    correlation: "25"
  duration: "10m"

Packet Loss Simulation Script:

#!/bin/bash
# scripts/execute-packet-loss-simulation.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
PACKET_LOSS="${3:-10}"  # Packet loss percentage (5, 10, 25)
DURATION="${4:-10m}"

echo "🧪 Starting packet loss simulation"
echo "Service: ${SERVICE}"
echo "Packet loss: ${PACKET_LOSS}%"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_LATENCY=$(jq -r '.metrics.p95_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Success rate: ${BASELINE_SUCCESS_RATE}%"
echo "  P95 latency: ${BASELINE_LATENCY}ms"

# Apply packet loss
echo "🔧 Applying packet loss..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: packet-loss-${SERVICE}-${PACKET_LOSS}pct
  namespace: chaos-testing
spec:
  action: loss
  mode: fixed-percent
  value: "100"
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  loss:
    loss: "${PACKET_LOSS}%"
    correlation: "25"
  duration: "${DURATION}"
EOF

PACKET_LOSS_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during packet loss..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  # Get TCP retransmissions
  TCP_RETRANSMISSIONS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(tcp_retransmissions\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get timeout errors
  TIMEOUT_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=\"504\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get circuit breaker state
  CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get latency
  P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
  P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)

  echo "Metrics at ${ELAPSED}s:"
  echo "  Success rate: ${SUCCESS_RATE_PERCENT}%"
  echo "  TCP retransmissions: ${TCP_RETRANSMISSIONS}/sec"
  echo "  Timeout errors: ${TIMEOUT_ERRORS}/sec"
  echo "  Circuit breaker state: ${CB_STATE}"
  echo "  P95 latency: ${P95_LATENCY_MS}ms (baseline: ${BASELINE_LATENCY}ms)"

  # Validate TCP retransmission for low packet loss
  if (( $(echo "${PACKET_LOSS} <= 10" | bc -l) )); then
    if (( $(echo "${TCP_RETRANSMISSIONS} > 0" | bc -l) )); then
      echo "✅ TCP retransmission handling packet loss: ${TCP_RETRANSMISSIONS}/sec"
    fi
  fi

  # Validate application timeouts for high packet loss
  if (( $(echo "${PACKET_LOSS} > 10" | bc -l) )); then
    if (( $(echo "${TIMEOUT_ERRORS} > 0" | bc -l) )); then
      echo "✅ Application timeouts handling packet loss: ${TIMEOUT_ERRORS}/sec"
    fi
  fi

  # Validate circuit breaker activation
  if (( $(echo "${PACKET_LOSS} >= 25" | bc -l) )); then
    if [ "${CB_STATE}" = "Open" ]; then
      echo "✅ Circuit breaker activated: ${CB_STATE}"
    fi
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove packet loss
echo "🔧 Removing packet loss..."
kubectl delete networkchaos packet-loss-${SERVICE}-${PACKET_LOSS}pct -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for recovery
echo "⏳ Waiting for network to recover..."
sleep 120

# Verify recovery
FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE_PERCENT=$(echo "${FINAL_SUCCESS_RATE} * 100" | bc)
FINAL_CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

if (( $(echo "${FINAL_SUCCESS_RATE_PERCENT} >= ${BASELINE_SUCCESS_RATE} * 0.95" | bc -l) )); then
  echo "✅ Success rate recovered: ${FINAL_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_SUCCESS_RATE}%)"

  if [ "${FINAL_CB_STATE}" = "Closed" ]; then
    echo "✅ Circuit breaker closed (service recovered)"
    exit 0
  else
    echo "⚠️  Circuit breaker still open: ${FINAL_CB_STATE}"
    exit 1
  fi
else
  echo "⚠️  Success rate not fully recovered: ${FINAL_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_SUCCESS_RATE}%)"
  exit 1
fi

Gradual Packet Loss Increase:

# chaos-experiments/gradual-packet-loss.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: gradual-packet-loss
  namespace: chaos-testing
  labels:
    category: infrastructure
    service: network
    severity: medium
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      Gradual packet loss increase will validate service resilience to progressive network degradation.
spec:
  action: loss
  mode: fixed-percent
  value: "100"
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  loss:
    loss: "25%"
    correlation: "25"
  scheduler:
    cron: "@every 2m"
  duration: "10m"

Expected Behavior

Packet Loss Phase (0-10 minutes):

  • Packet loss injection: Network packet loss increases to specified percentage
  • TCP retransmission: TCP retransmits lost packets (low packet loss)
  • Application timeouts: Application timeouts occur (high packet loss)
  • Circuit breaker activation: Circuit breaker opens if packet loss exceeds threshold
  • Latency increase: Latency increases due to retransmissions

Recovery Phase (10-15 minutes):

  • Packet loss removal: Packet loss removed
  • TCP normalization: TCP retransmissions normalize
  • Circuit breaker recovery: Circuit breaker transitions to half-open, then closed
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline 5% Loss 10% Loss 25% Loss Recovery Target
Request Success Rate 99.95% >99% >95% >90% 99.95%
TCP Retransmissions 0/sec >0/sec >0/sec >0/sec 0/sec
Timeout Errors 0/sec <1/sec <5/sec <10/sec 0/sec
Circuit Breaker State Closed Closed HalfOpen Open Closed
P95 Latency 250ms <350ms <500ms <1,000ms 250ms

Validation Criteria

Success Criteria:

  • ✅ TCP retransmission handles low packet loss (≤10%)
  • ✅ Application timeouts handle high packet loss (>10%)
  • ✅ Circuit breaker activates for high packet loss (≥25%)
  • ✅ Request success rate acceptable for packet loss level
  • ✅ Service recovers automatically when packet loss removed

DNS Failure

DNS failure experiments validate that ATP services handle DNS resolution failures gracefully through DNS caching, retry with backoff, and fallback to IP addresses.

Hypothesis

"When DNS resolution fails, DNS caching will maintain service availability, retry with backoff will handle transient DNS failures, fallback to IP addresses will ensure service connectivity, and services will recover automatically when DNS is restored."

Experiment Configuration

DNS Failure Simulation:

# chaos-experiments/dns-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: dns-failure
  namespace: chaos-testing
  labels:
    category: infrastructure
    service: dns
    severity: high
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When DNS resolution fails, DNS caching will maintain service availability, 
      retry with backoff will handle transient DNS failures, fallback to IP addresses 
      will ensure service connectivity, and services will recover automatically when DNS is restored.
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  target:
    mode: all
    selector:
      address: "*.dns.windows.net"
      port: 53
  duration: "10m"

DNS Failure Simulation Script:

#!/bin/bash
# scripts/execute-dns-failure-experiment.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
DOWNSTREAM_SERVICE="${3:-atp-policy-api}"
DURATION="${4:-10m}"

echo "🧪 Starting DNS failure experiment"
echo "Service: ${SERVICE}"
echo "Downstream service: ${DOWNSTREAM_SERVICE}"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_DNS_QUERIES=$(jq -r '.metrics.dns_queries_per_second' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Success rate: ${BASELINE_SUCCESS_RATE}%"
echo "  DNS queries: ${BASELINE_DNS_QUERIES}/sec"

# Apply DNS failure (partition DNS servers)
echo "🔧 Applying DNS failure..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: dns-failure-${SERVICE}
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  target:
    mode: all
    selector:
      address: "*.dns.windows.net"
      port: 53
  duration: "${DURATION}"
EOF

DNS_FAILURE_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during DNS failure..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get DNS query failures
  DNS_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(dns_query_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get DNS cache hits
  DNS_CACHE_HITS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(dns_cache_hits\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get DNS retry attempts
  DNS_RETRIES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(dns_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get IP fallback usage
  IP_FALLBACK=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(dns_ip_fallback_usage\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  echo "Metrics at ${ELAPSED}s:"
  echo "  DNS failures: ${DNS_FAILURES}/sec"
  echo "  DNS cache hits: ${DNS_CACHE_HITS}/sec"
  echo "  DNS retries: ${DNS_RETRIES}/sec"
  echo "  IP fallback usage: ${IP_FALLBACK}/sec"
  echo "  Success rate: ${SUCCESS_RATE_PERCENT}%"

  # Validate DNS caching
  if (( $(echo "${DNS_CACHE_HITS} > 0" | bc -l) )); then
    echo "✅ DNS caching active: ${DNS_CACHE_HITS}/sec cache hits"
  fi

  # Validate DNS retry
  if (( $(echo "${DNS_RETRIES} > 0" | bc -l) )); then
    echo "✅ DNS retry with backoff active: ${DNS_RETRIES}/sec retries"
  fi

  # Validate IP fallback
  if (( $(echo "${IP_FALLBACK} > 0" | bc -l) )); then
    echo "✅ IP fallback active: ${IP_FALLBACK}/sec fallback usage"
  fi

  # Validate service availability
  if (( $(echo "${SUCCESS_RATE_PERCENT} >= 95" | bc -l) )); then
    echo "✅ Service availability maintained: ${SUCCESS_RATE_PERCENT}%"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove DNS failure
echo "🔧 Removing DNS failure..."
kubectl delete networkchaos dns-failure-${SERVICE} -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for recovery
echo "⏳ Waiting for DNS to recover..."
sleep 120

# Verify recovery
FINAL_DNS_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(dns_query_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE_PERCENT=$(echo "${FINAL_SUCCESS_RATE} * 100" | bc)

if (( $(echo "${FINAL_DNS_FAILURES} == 0" | bc -l) )); then
  echo "✅ DNS failures resolved"

  if (( $(echo "${FINAL_SUCCESS_RATE_PERCENT} >= ${BASELINE_SUCCESS_RATE} * 0.95" | bc -l) )); then
    echo "✅ Success rate recovered: ${FINAL_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_SUCCESS_RATE}%)"
    exit 0
  else
    echo "⚠️  Success rate not fully recovered: ${FINAL_SUCCESS_RATE_PERCENT}%"
    exit 1
  fi
else
  echo "⚠️  DNS failures still occurring: ${FINAL_DNS_FAILURES}/sec"
  exit 1
fi

Expected Behavior

DNS Failure Phase (0-10 minutes):

  • DNS unavailability: DNS servers become unreachable
  • DNS cache usage: Services use DNS cache for resolved addresses
  • DNS retry: DNS retry with backoff attempts
  • IP fallback: Services fallback to IP addresses when DNS fails
  • Service continuity: Service continues operating with cached/fallback addresses

Recovery Phase (10-15 minutes):

  • DNS restoration: DNS servers become available
  • DNS cache refresh: DNS cache refreshed with new resolutions
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During DNS Failure Expected Range Recovery Target
DNS Query Failures 0/sec >0/sec >0/sec 0/sec
DNS Cache Hits 50/sec 100/sec 100/sec 50/sec
DNS Retries 0/sec >0/sec >0/sec 0/sec
IP Fallback Usage 0/sec >0/sec >0/sec 0/sec
Request Success Rate 99.95% >95% >95% 99.95%

Validation Criteria

Success Criteria:

  • ✅ DNS caching maintains service availability
  • ✅ DNS retry with backoff handles transient failures
  • ✅ IP fallback ensures service connectivity
  • ✅ Request success rate >95%
  • ✅ Service recovers automatically when DNS restored

DNS Configuration

DNS Client Configuration:

# kubernetes/configmaps/dns-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dns-config
  namespace: atp-ingest-ns
data:
  DnsClientConfig.json: |
    {
      "Cache": {
        "Enabled": true,
        "TTL": 300,
        "MaxCacheSize": 1000
      },
      "RetryPolicy": {
        "MaxRetries": 3,
        "RetryDelay": 1000,
        "ExponentialBackoff": true,
        "MaxBackoff": 10000
      },
      "Fallback": {
        "UseIPFallback": true,
        "IPAddresses": {
          "atp-policy-api": "10.0.1.100",
          "atp-query-api": "10.0.1.101"
        }
      },
      "HealthCheck": {
        "Enabled": true,
        "Interval": 30000,
        "Timeout": 5000
      }
    }

Bandwidth Limitation

Bandwidth limitation experiments validate that ATP services handle bandwidth constraints gracefully through large payload handling, streaming vs batching strategies, and compression usage.

Hypothesis

"When network bandwidth is limited to 1Mbps, services will handle large payloads through streaming or batching, compression will reduce bandwidth usage, services will adapt to bandwidth constraints, and services will recover automatically when bandwidth is restored."

Experiment Configuration

Bandwidth Throttling:

# chaos-experiments/bandwidth-limitation.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: bandwidth-limitation
  namespace: chaos-testing
  labels:
    category: infrastructure
    service: network
    severity: medium
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When network bandwidth is limited, services will handle large payloads through streaming or batching, 
      compression will reduce bandwidth usage, services will adapt to bandwidth constraints, 
      and services will recover automatically when bandwidth is restored.
spec:
  action: bandwidth
  mode: all
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  bandwidth:
    rate: "1Mbps"
    limit: 1048576
    buffer: 10485760
  duration: "10m"

Bandwidth Limitation Simulation Script:

#!/bin/bash
# scripts/execute-bandwidth-limitation-experiment.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
BANDWIDTH="${3:-1Mbps}"  # Bandwidth limit
DURATION="${4:-10m}"

echo "🧪 Starting bandwidth limitation experiment"
echo "Service: ${SERVICE}"
echo "Bandwidth limit: ${BANDWIDTH}"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_THROUGHPUT=$(jq -r '.metrics.throughput_mbps' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_LATENCY=$(jq -r '.metrics.p95_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Throughput: ${BASELINE_THROUGHPUT} Mbps"
echo "  P95 latency: ${BASELINE_LATENCY}ms"

# Apply bandwidth limitation
echo "🔧 Applying bandwidth limitation..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: bandwidth-limitation-${SERVICE}
  namespace: chaos-testing
spec:
  action: bandwidth
  mode: all
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  bandwidth:
    rate: "${BANDWIDTH}"
    limit: 1048576
    buffer: 10485760
  duration: "${DURATION}"
EOF

BANDWIDTH_LIMIT_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during bandwidth limitation..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get network throughput
  THROUGHPUT=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(network_bytes_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  THROUGHPUT_MBPS=$(echo "scale=2; ${THROUGHPUT} * 8 / 1024 / 1024" | bc)

  # Get compression ratio
  COMPRESSION_RATIO=$(curl -s http://prometheus:9090/api/v1/query?query=compression_ratio\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get streaming vs batching usage
  STREAMING_USAGE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(streaming_operations\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  BATCHING_USAGE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(batching_operations\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get large payload handling
  LARGE_PAYLOAD_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",payload_size=\"large\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get latency
  P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
  P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)

  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  echo "Metrics at ${ELAPSED}s:"
  echo "  Throughput: ${THROUGHPUT_MBPS} Mbps (limit: ${BANDWIDTH})"
  echo "  Compression ratio: ${COMPRESSION_RATIO}"
  echo "  Streaming operations: ${STREAMING_USAGE}/sec"
  echo "  Batching operations: ${BATCHING_USAGE}/sec"
  echo "  Large payload requests: ${LARGE_PAYLOAD_REQUESTS}/sec"
  echo "  P95 latency: ${P95_LATENCY_MS}ms (baseline: ${BASELINE_LATENCY}ms)"
  echo "  Success rate: ${SUCCESS_RATE_PERCENT}%"

  # Validate compression usage
  if (( $(echo "${COMPRESSION_RATIO} > 1.5" | bc -l) )); then
    echo "✅ Compression reducing bandwidth usage: ${COMPRESSION_RATIO}x"
  fi

  # Validate streaming vs batching
  if (( $(echo "${STREAMING_USAGE} > 0" | bc -l) )); then
    echo "✅ Streaming strategy used for large payloads: ${STREAMING_USAGE}/sec"
  fi

  if (( $(echo "${BATCHING_USAGE} > 0" | bc -l) )); then
    echo "✅ Batching strategy used for large payloads: ${BATCHING_USAGE}/sec"
  fi

  # Validate bandwidth adaptation
  if (( $(echo "${THROUGHPUT_MBPS} <= 1.1" | bc -l) )); then
    echo "✅ Throughput within bandwidth limit: ${THROUGHPUT_MBPS} Mbps"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove bandwidth limitation
echo "🔧 Removing bandwidth limitation..."
kubectl delete networkchaos bandwidth-limitation-${SERVICE} -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for recovery
echo "⏳ Waiting for bandwidth to recover..."
sleep 120

# Verify recovery
FINAL_THROUGHPUT=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(network_bytes_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_THROUGHPUT_MBPS=$(echo "scale=2; ${FINAL_THROUGHPUT} * 8 / 1024 / 1024" | bc)
FINAL_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
FINAL_LATENCY_MS=$(echo "${FINAL_LATENCY} * 1000" | bc)

if (( $(echo "${FINAL_THROUGHPUT_MBPS} >= ${BASELINE_THROUGHPUT} * 0.9" | bc -l) )); then
  echo "✅ Throughput recovered: ${FINAL_THROUGHPUT_MBPS} Mbps (baseline: ${BASELINE_THROUGHPUT} Mbps)"

  if (( $(echo "${FINAL_LATENCY_MS} <= ${BASELINE_LATENCY} * 1.1" | bc -l) )); then
    echo "✅ Latency recovered: ${FINAL_LATENCY_MS}ms (baseline: ${BASELINE_LATENCY}ms)"
    exit 0
  else
    echo "⚠️  Latency not fully recovered: ${FINAL_LATENCY_MS}ms"
    exit 1
  fi
else
  echo "⚠️  Throughput not fully recovered: ${FINAL_THROUGHPUT_MBPS} Mbps"
  exit 1
fi

Expected Behavior

Bandwidth Limitation Phase (0-10 minutes):

  • Bandwidth throttling: Network bandwidth limited to 1Mbps
  • Compression activation: Compression reduces payload size
  • Streaming/batching: Large payloads handled via streaming or batching
  • Latency increase: Latency increases due to bandwidth constraints
  • Service adaptation: Service adapts to bandwidth constraints

Recovery Phase (10-15 minutes):

  • Bandwidth restoration: Bandwidth limitation removed
  • Throughput normalization: Network throughput returns to normal
  • Latency normalization: Latency returns to baseline
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Limitation Expected Range Recovery Target
Network Throughput 100 Mbps ≤1 Mbps ≤1 Mbps 100 Mbps
Compression Ratio 1.0x >1.5x >1.5x 1.0x
Streaming Usage 0/sec >0/sec >0/sec 0/sec
Batching Usage 0/sec >0/sec >0/sec 0/sec
P95 Latency 250ms <2,000ms <2,000ms 250ms
Request Success Rate 99.95% >95% >95% 99.95%

Validation Criteria

Success Criteria:

  • ✅ Compression reduces bandwidth usage
  • ✅ Streaming or batching handles large payloads
  • ✅ Throughput within bandwidth limit
  • ✅ Request success rate >95%
  • ✅ Service recovers automatically when bandwidth restored

Bandwidth Adaptation Configuration

Network Bandwidth Configuration:

# kubernetes/configmaps/network-bandwidth-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: network-bandwidth-config
  namespace: atp-ingest-ns
data:
  NetworkBandwidthConfig.json: |
    {
      "Compression": {
        "Enabled": true,
        "Algorithm": "gzip",
        "MinSize": 1024,
        "CompressionLevel": 6
      },
      "LargePayloadHandling": {
        "StreamingThreshold": 1048576,
        "BatchingThreshold": 524288,
        "ChunkSize": 65536,
        "Strategy": "adaptive"
      },
      "BandwidthMonitoring": {
        "Enabled": true,
        "Threshold": 1048576,
        "AdaptiveCompression": true
      }
    }

Network Chaos Visualization

graph TD
    NETWORK[Network Layer] --> PARTITION[Network Partition]
    NETWORK --> PACKETLOSS[Packet Loss]
    NETWORK --> DNS[DNS Failure]
    NETWORK --> BANDWIDTH[Bandwidth Limitation]

    PARTITION --> DETECT[Partition Detection]
    DETECT --> ISOLATE[Service Isolation]
    ISOLATE --> CONTINUE1[Continue Operating]

    PACKETLOSS --> TCPRETRY[TCP Retransmission]
    PACKETLOSS --> APPTIMEOUT[Application Timeout]
    PACKETLOSS --> CB[Circuit Breaker]
    TCPRETRY --> CONTINUE2[Continue Operating]
    APPTIMEOUT --> CONTINUE2
    CB --> CONTINUE2

    DNS --> CACHE[DNS Cache]
    DNS --> RETRY[DNS Retry]
    DNS --> IPFALLBACK[IP Fallback]
    CACHE --> CONTINUE3[Continue Operating]
    RETRY --> CONTINUE3
    IPFALLBACK --> CONTINUE3

    BANDWIDTH --> COMPRESS[Compression]
    BANDWIDTH --> STREAM[Streaming]
    BANDWIDTH --> BATCH[Batching]
    COMPRESS --> CONTINUE4[Continue Operating]
    STREAM --> CONTINUE4
    BATCH --> CONTINUE4

    style NETWORK fill:#FFE5B4
    style PARTITION fill:#FFB6C1
    style PACKETLOSS fill:#FFB6C1
    style DNS fill:#FFB6C1
    style BANDWIDTH fill:#FFB6C1
    style CONTINUE1 fill:#90EE90
    style CONTINUE2 fill:#90EE90
    style CONTINUE3 fill:#90EE90
    style CONTINUE4 fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Summary: Network Chaos

  • Network Partition: Validates partition detection, service isolation handling, and automatic recovery during network partitions between namespaces; expects partition detected, services continue operating within partition, local availability >90%, and automatic recovery
  • Packet Loss Simulation: Validates TCP retransmission, application timeouts, and circuit breaker activation during packet loss (5%, 10%, 25%); expects TCP retransmission handles low packet loss, application timeouts handle high packet loss, circuit breaker activates for high packet loss, and automatic recovery
  • DNS Failure: Validates DNS caching, retry with backoff, and IP fallback during DNS resolution failures; expects DNS caching maintains availability, DNS retry handles transient failures, IP fallback ensures connectivity, and automatic recovery
  • Bandwidth Limitation: Validates compression usage, streaming vs batching strategies, and large payload handling during bandwidth constraints (1Mbps); expects compression reduces bandwidth usage, streaming/batching handles large payloads, throughput within limit, and automatic recovery
  • Monitoring and Validation: Comprehensive scripts for monitoring network partitions, packet loss, DNS failures, bandwidth limitations, TCP retransmissions, DNS cache hits, compression ratios, and recovery behavior

Security Chaos

Purpose: Define comprehensive chaos experiments for security infrastructure failures in ATP, validating authentication resilience, authorization fallback, certificate management, and secret management to ensure ATP services remain available and secure during security infrastructure failures.


Authentication Failure

Authentication failure experiments validate that ATP services handle Azure AD unavailability gracefully through token caching, graceful degradation, and deny-by-default behavior.

Hypothesis

"When Azure AD becomes unavailable, services will use cached authentication tokens, gracefully degrade to read-only mode if necessary, enforce deny-by-default behavior for unauthenticated requests, and recover automatically when Azure AD is restored."

Experiment Configuration

Azure AD Unavailability Simulation:

# chaos-experiments/azure-ad-unavailability.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: azure-ad-unavailability
  namespace: chaos-testing
  labels:
    category: security
    service: azure-ad
    severity: high
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When Azure AD becomes unavailable, services will use cached authentication tokens, 
      gracefully degrade to read-only mode if necessary, enforce deny-by-default behavior 
      for unauthenticated requests, and recover automatically when Azure AD is restored.
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  target:
    mode: all
    selector:
      address: "*.login.microsoftonline.com"
      address: "*.microsoftonline.com"
  duration: "10m"

Authentication Failure Simulation Script:

#!/bin/bash
# scripts/execute-authentication-failure-experiment.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
DURATION="${3:-10m}"

echo "🧪 Starting authentication failure experiment"
echo "Service: ${SERVICE}"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_AUTH_SUCCESS_RATE=$(jq -r '.metrics.auth_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_TOKEN_CACHE_HITS=$(jq -r '.metrics.token_cache_hits_per_second' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Auth success rate: ${BASELINE_AUTH_SUCCESS_RATE}%"
echo "  Token cache hits: ${BASELINE_TOKEN_CACHE_HITS}/sec"

# Apply network partition to Azure AD
echo "🔧 Applying network partition to Azure AD..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: azure-ad-unavailability-${SERVICE}
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  target:
    mode: all
    selector:
      address: "*.login.microsoftonline.com"
      address: "*.microsoftonline.com"
  duration: "${DURATION}"
EOF

AUTH_FAILURE_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during authentication failure..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get authentication success rate
  AUTH_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(auth_requests_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(auth_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  AUTH_SUCCESS_RATE_PERCENT=$(echo "${AUTH_SUCCESS_RATE} * 100" | bc)

  # Get token cache hits
  TOKEN_CACHE_HITS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(token_cache_hits\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get authentication failures
  AUTH_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(auth_requests_total\{service=\"${SERVICE}\",status=\"failure\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get denied requests (unauthenticated)
  DENIED_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=\"401\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get service mode (normal vs degraded)
  SERVICE_MODE=$(curl -s http://prometheus:9090/api/v1/query?query=service_mode\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  echo "Metrics at ${ELAPSED}s:"
  echo "  Auth success rate: ${AUTH_SUCCESS_RATE_PERCENT}%"
  echo "  Token cache hits: ${TOKEN_CACHE_HITS}/sec"
  echo "  Auth failures: ${AUTH_FAILURES}/sec"
  echo "  Denied requests (401): ${DENIED_REQUESTS}/sec"
  echo "  Service mode: ${SERVICE_MODE}"
  echo "  Request success rate: ${SUCCESS_RATE_PERCENT}%"

  # Validate token cache usage
  if (( $(echo "${TOKEN_CACHE_HITS} > 0" | bc -l) )); then
    echo "✅ Token cache active: ${TOKEN_CACHE_HITS}/sec cache hits"
  fi

  # Validate deny-by-default behavior
  if (( $(echo "${DENIED_REQUESTS} > 0" | bc -l) )); then
    echo "✅ Deny-by-default enforced: ${DENIED_REQUESTS}/sec requests denied"
  fi

  # Validate graceful degradation
  if [ "${SERVICE_MODE}" = "degraded" ] || [ "${SERVICE_MODE}" = "readonly" ]; then
    echo "✅ Service operating in degraded mode: ${SERVICE_MODE}"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos azure-ad-unavailability-${SERVICE} -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for recovery
echo "⏳ Waiting for Azure AD to recover..."
sleep 120

# Verify recovery
FINAL_AUTH_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(auth_requests_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(auth_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_AUTH_SUCCESS_RATE_PERCENT=$(echo "${FINAL_AUTH_SUCCESS_RATE} * 100" | bc)
FINAL_SERVICE_MODE=$(curl -s http://prometheus:9090/api/v1/query?query=service_mode\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

if (( $(echo "${FINAL_AUTH_SUCCESS_RATE_PERCENT} >= ${BASELINE_AUTH_SUCCESS_RATE} * 0.95" | bc -l) )); then
  echo "✅ Auth success rate recovered: ${FINAL_AUTH_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_AUTH_SUCCESS_RATE}%)"

  if [ "${FINAL_SERVICE_MODE}" = "normal" ]; then
    echo "✅ Service mode normalized: ${FINAL_SERVICE_MODE}"
    exit 0
  else
    echo "⚠️  Service mode not normalized: ${FINAL_SERVICE_MODE}"
    exit 1
  fi
else
  echo "⚠️  Auth success rate not fully recovered: ${FINAL_AUTH_SUCCESS_RATE_PERCENT}%"
  exit 1
fi

Expected Behavior

Authentication Failure Phase (0-10 minutes):

  • Azure AD unavailability: Azure AD authentication endpoints unreachable
  • Token cache usage: Services use cached authentication tokens
  • Deny-by-default: Unauthenticated requests denied (401)
  • Graceful degradation: Service may operate in read-only mode
  • Service continuity: Authenticated requests continue using cached tokens

Recovery Phase (10-15 minutes):

  • Azure AD restoration: Azure AD endpoints restored
  • Token refresh: New tokens obtained from Azure AD
  • Token cache refresh: Token cache refreshed
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Failure Expected Range Recovery Target
Auth Success Rate 99.95% >80% >80% 99.95%
Token Cache Hits 50/sec 100/sec 100/sec 50/sec
Auth Failures 0.05% <20% <20% 0.05%
Denied Requests (401) 0/sec >0/sec >0/sec 0/sec
Service Mode Normal Degraded/ReadOnly Degraded/ReadOnly Normal

Validation Criteria

Success Criteria:

  • ✅ Token cache maintains authentication for existing sessions
  • ✅ Deny-by-default enforced for unauthenticated requests
  • ✅ Graceful degradation to read-only mode if necessary
  • ✅ Auth success rate >80% (using cached tokens)
  • ✅ Service recovers automatically when Azure AD restored

Authentication Configuration

Token Cache Configuration:

# kubernetes/configmaps/authentication-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: authentication-config
  namespace: atp-ingest-ns
data:
  AuthenticationConfig.json: |
    {
      "TokenCache": {
        "Enabled": true,
        "TTL": 3600,
        "MaxCacheSize": 10000,
        "RefreshThreshold": 300
      },
      "AzureAD": {
        "Authority": "https://login.microsoftonline.com/{tenant-id}",
        "ClientId": "{client-id}",
        "RetryPolicy": {
          "MaxRetries": 3,
          "RetryDelay": 1000,
          "ExponentialBackoff": true
        }
      },
      "FallbackBehavior": {
        "DenyByDefault": true,
        "DegradedMode": "readonly",
        "AllowCachedTokens": true
      }
    }

Authorization Denial

Authorization denial experiments validate that ATP services handle OPA (Open Policy Agent) unavailability gracefully through cached policies, safe-fail behavior, and deny-when-uncertain enforcement.

Hypothesis

"When OPA policy engine becomes unavailable, services will use cached authorization policies, enforce safe-fail behavior (deny when uncertain), maintain service availability for authorized requests, and recover automatically when OPA is restored."

Experiment Configuration

OPA Unavailability Simulation:

# chaos-experiments/opa-unavailability.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: opa-unavailability
  namespace: chaos-testing
  labels:
    category: security
    service: opa
    severity: high
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When OPA policy engine becomes unavailable, services will use cached authorization policies, 
      enforce safe-fail behavior (deny when uncertain), maintain service availability for authorized requests, 
      and recover automatically when OPA is restored.
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - atp-policy-ns
      labelSelectors:
        app: opa
  duration: "10m"

Authorization Denial Simulation Script:

#!/bin/bash
# scripts/execute-authorization-denial-experiment.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
OPA_NAMESPACE="${3:-atp-policy-ns}"
DURATION="${4:-10m}"

echo "🧪 Starting authorization denial experiment"
echo "Service: ${SERVICE}"
echo "OPA namespace: ${OPA_NAMESPACE}"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_AUTHZ_SUCCESS_RATE=$(jq -r '.metrics.authz_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_POLICY_CACHE_HITS=$(jq -r '.metrics.policy_cache_hits_per_second' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Authz success rate: ${BASELINE_AUTHZ_SUCCESS_RATE}%"
echo "  Policy cache hits: ${BASELINE_POLICY_CACHE_HITS}/sec"

# Apply network partition to OPA
echo "🔧 Applying network partition to OPA..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: opa-unavailability-${SERVICE}
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - ${OPA_NAMESPACE}
      labelSelectors:
        app: opa
  duration: "${DURATION}"
EOF

AUTHZ_FAILURE_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during authorization denial..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get authorization success rate
  AUTHZ_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(authz_requests_total\{service=\"${SERVICE}\",status=\"allow\"\}[1m]\)/rate\(authz_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  AUTHZ_SUCCESS_RATE_PERCENT=$(echo "${AUTHZ_SUCCESS_RATE} * 100" | bc)

  # Get policy cache hits
  POLICY_CACHE_HITS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(policy_cache_hits\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get authorization denials
  AUTHZ_DENIALS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(authz_requests_total\{service=\"${SERVICE}\",status=\"deny\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get denied requests (403)
  DENIED_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=\"403\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get safe-fail behavior
  SAFE_FAIL_COUNT=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(authz_safe_fail\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  echo "Metrics at ${ELAPSED}s:"
  echo "  Authz success rate: ${AUTHZ_SUCCESS_RATE_PERCENT}%"
  echo "  Policy cache hits: ${POLICY_CACHE_HITS}/sec"
  echo "  Authz denials: ${AUTHZ_DENIALS}/sec"
  echo "  Denied requests (403): ${DENIED_REQUESTS}/sec"
  echo "  Safe-fail count: ${SAFE_FAIL_COUNT}/sec"
  echo "  Request success rate: ${SUCCESS_RATE_PERCENT}%"

  # Validate policy cache usage
  if (( $(echo "${POLICY_CACHE_HITS} > 0" | bc -l) )); then
    echo "✅ Policy cache active: ${POLICY_CACHE_HITS}/sec cache hits"
  fi

  # Validate safe-fail behavior
  if (( $(echo "${SAFE_FAIL_COUNT} > 0" | bc -l) )); then
    echo "✅ Safe-fail behavior enforced: ${SAFE_FAIL_COUNT}/sec safe-fail denials"
  fi

  # Validate deny-when-uncertain
  if (( $(echo "${AUTHZ_DENIALS} > 0" | bc -l) )); then
    echo "✅ Authorization denials when uncertain: ${AUTHZ_DENIALS}/sec"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos opa-unavailability-${SERVICE} -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for recovery
echo "⏳ Waiting for OPA to recover..."
sleep 120

# Verify recovery
FINAL_AUTHZ_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(authz_requests_total\{service=\"${SERVICE}\",status=\"allow\"\}[1m]\)/rate\(authz_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_AUTHZ_SUCCESS_RATE_PERCENT=$(echo "${FINAL_AUTHZ_SUCCESS_RATE} * 100" | bc)
FINAL_SAFE_FAIL=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(authz_safe_fail\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

if (( $(echo "${FINAL_AUTHZ_SUCCESS_RATE_PERCENT} >= ${BASELINE_AUTHZ_SUCCESS_RATE} * 0.95" | bc -l) )); then
  echo "✅ Authz success rate recovered: ${FINAL_AUTHZ_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_AUTHZ_SUCCESS_RATE}%)"

  if (( $(echo "${FINAL_SAFE_FAIL} == 0" | bc -l) )); then
    echo "✅ Safe-fail behavior normalized (OPA recovered)"
    exit 0
  else
    echo "⚠️  Safe-fail still active: ${FINAL_SAFE_FAIL}/sec"
    exit 1
  fi
else
  echo "⚠️  Authz success rate not fully recovered: ${FINAL_AUTHZ_SUCCESS_RATE_PERCENT}%"
  exit 1
fi

Expected Behavior

Authorization Failure Phase (0-10 minutes):

  • OPA unavailability: OPA policy engine unreachable
  • Policy cache usage: Services use cached authorization policies
  • Safe-fail enforcement: Requests denied when authorization uncertain
  • Service continuity: Authorized requests continue using cached policies

Recovery Phase (10-15 minutes):

  • OPA restoration: OPA policy engine restored
  • Policy refresh: New policies obtained from OPA
  • Policy cache refresh: Policy cache refreshed
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Failure Expected Range Recovery Target
Authz Success Rate 99.95% >80% >80% 99.95%
Policy Cache Hits 50/sec 100/sec 100/sec 50/sec
Authz Denials 0.05% <20% <20% 0.05%
Denied Requests (403) 0/sec >0/sec >0/sec 0/sec
Safe-Fail Count 0/sec >0/sec >0/sec 0/sec

Validation Criteria

Success Criteria:

  • ✅ Policy cache maintains authorization for cached policies
  • ✅ Safe-fail behavior enforced (deny when uncertain)
  • ✅ Authz success rate >80% (using cached policies)
  • ✅ Service recovers automatically when OPA restored

Authorization Configuration

OPA Policy Cache Configuration:

# kubernetes/configmaps/authorization-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: authorization-config
  namespace: atp-ingest-ns
data:
  AuthorizationConfig.json: |
    {
      "PolicyCache": {
        "Enabled": true,
        "TTL": 3600,
        "MaxCacheSize": 10000,
        "RefreshThreshold": 300
      },
      "OPA": {
        "Endpoint": "http://opa.atp-policy-ns.svc.cluster.local:8181",
        "RetryPolicy": {
          "MaxRetries": 3,
          "RetryDelay": 1000,
          "ExponentialBackoff": true
        }
      },
      "SafeFail": {
        "Enabled": true,
        "DenyWhenUncertain": true,
        "AllowCachedPolicies": true
      }
    }

Certificate Expiration

Certificate expiration experiments validate that ATP services handle TLS certificate expiration gracefully through cert-manager renewal, mTLS failure handling, and automatic certificate rotation.

Hypothesis

"When TLS certificates expire, cert-manager will automatically renew certificates, mTLS connections will handle certificate failures gracefully, services will continue operating with renewed certificates, and services will recover automatically when certificates are renewed."

Experiment Configuration

Certificate Expiration Simulation Script:

#!/bin/bash
# scripts/execute-certificate-expiration-experiment.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
CERT_NAME="${3:-atp-ingestion-tls}"

echo "🧪 Starting certificate expiration experiment"
echo "Service: ${SERVICE}"
echo "Certificate: ${CERT_NAME}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_TLS_SUCCESS_RATE=$(jq -r '.metrics.tls_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_MTLS_SUCCESS_RATE=$(jq -r '.metrics.mtls_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  TLS success rate: ${BASELINE_TLS_SUCCESS_RATE}%"
echo "  mTLS success rate: ${BASELINE_MTLS_SUCCESS_RATE}%"

# Get current certificate expiration
echo "📊 Getting current certificate expiration..."
CURRENT_CERT_EXPIRY=$(kubectl get secret ${CERT_NAME} -n ${NAMESPACE} -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -enddate | cut -d= -f2)

echo "Current certificate expires: ${CURRENT_CERT_EXPIRY}"

# Simulate certificate expiration by deleting certificate
echo "🔧 Simulating certificate expiration..."
CERT_EXPIRATION_START=$(date +%s)

# Note: In production, this would be done more carefully
# For testing, we'll delete the certificate to trigger renewal
echo "⚠️  WARNING: This will delete the certificate. Continuing in 5 seconds..."
sleep 5

kubectl delete secret ${CERT_NAME} -n ${NAMESPACE}

# Wait for cert-manager to detect and renew
echo "⏳ Waiting for cert-manager to renew certificate..."
MAX_RENEWAL_WAIT=300  # 5 minutes
ELAPSED=0
CERT_RENEWED=false

while [ ${ELAPSED} -lt ${MAX_RENEWAL_WAIT} ]; do
  CERT_EXISTS=$(kubectl get secret ${CERT_NAME} -n ${NAMESPACE} -o jsonpath='{.metadata.name}' 2>/dev/null)

  if [ -n "${CERT_EXISTS}" ]; then
    NEW_CERT_EXPIRY=$(kubectl get secret ${CERT_NAME} -n ${NAMESPACE} -o jsonpath='{.data.tls\.crt}' | \
      base64 -d | openssl x509 -noout -enddate | cut -d= -f2)

    if [ "${NEW_CERT_EXPIRY}" != "${CURRENT_CERT_EXPIRY}" ]; then
      CERT_RENEWED=true
      RENEWAL_TIME=$(date +%s)
      RENEWAL_DURATION=$((RENEWAL_TIME - CERT_EXPIRATION_START))
      echo "✅ Certificate renewed in ${RENEWAL_DURATION} seconds"
      echo "New certificate expires: ${NEW_CERT_EXPIRY}"
      break
    fi
  fi

  sleep 10
  ELAPSED=$((ELAPSED + 10))
  echo "Waiting for certificate renewal... (${ELAPSED}s/${MAX_RENEWAL_WAIT}s)"
done

if [ "${CERT_RENEWED}" = false ]; then
  echo "❌ Certificate not renewed within ${MAX_RENEWAL_WAIT} seconds"
  exit 1
fi

# Monitor service behavior during certificate renewal
echo "👀 Monitoring service behavior during certificate renewal..."
MAX_MONITOR_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_MONITOR_WAIT} ]; do
  # Get TLS handshake failures
  TLS_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(tls_handshake_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get mTLS handshake failures
  MTLS_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(mtls_handshake_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get certificate renewal status
  CERT_RENEWAL_STATUS=$(curl -s http://prometheus:9090/api/v1/query?query=certificate_renewal_status\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  echo "Metrics at ${ELAPSED}s:"
  echo "  TLS handshake failures: ${TLS_FAILURES}/sec"
  echo "  mTLS handshake failures: ${MTLS_FAILURES}/sec"
  echo "  Certificate renewal status: ${CERT_RENEWAL_STATUS}"
  echo "  Request success rate: ${SUCCESS_RATE_PERCENT}%"

  # Validate cert-manager renewal
  if [ "${CERT_RENEWAL_STATUS}" = "success" ]; then
    echo "✅ Certificate renewal successful"
  fi

  # Validate mTLS failure handling
  if (( $(echo "${MTLS_FAILURES} > 0" | bc -l) )); then
    echo "⚠️  mTLS handshake failures: ${MTLS_FAILURES}/sec"
    # Check if failures are transient during renewal
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Verify final state
FINAL_TLS_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(tls_handshakes_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(tls_handshakes_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_TLS_SUCCESS_RATE_PERCENT=$(echo "${FINAL_TLS_SUCCESS_RATE} * 100" | bc)
FINAL_MTLS_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(mtls_handshakes_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(mtls_handshakes_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_MTLS_SUCCESS_RATE_PERCENT=$(echo "${FINAL_MTLS_SUCCESS_RATE} * 100" | bc)

if (( $(echo "${FINAL_TLS_SUCCESS_RATE_PERCENT} >= ${BASELINE_TLS_SUCCESS_RATE} * 0.95" | bc -l) )); then
  echo "✅ TLS success rate recovered: ${FINAL_TLS_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_TLS_SUCCESS_RATE}%)"

  if (( $(echo "${FINAL_MTLS_SUCCESS_RATE_PERCENT} >= ${BASELINE_MTLS_SUCCESS_RATE} * 0.95" | bc -l) )); then
    echo "✅ mTLS success rate recovered: ${FINAL_MTLS_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_MTLS_SUCCESS_RATE}%)"
    exit 0
  else
    echo "⚠️  mTLS success rate not fully recovered: ${FINAL_MTLS_SUCCESS_RATE_PERCENT}%"
    exit 1
  fi
else
  echo "⚠️  TLS success rate not fully recovered: ${FINAL_TLS_SUCCESS_RATE_PERCENT}%"
  exit 1
fi

Expected Behavior

Certificate Expiration Phase (0-5 minutes):

  • Certificate expiration: TLS certificate expires
  • Cert-manager detection: Cert-manager detects expiration
  • Certificate renewal: Cert-manager renews certificate
  • TLS handshake failures: Temporary TLS handshake failures during renewal
  • Service continuity: Service continues operating with renewed certificate

Recovery Phase (5-10 minutes):

  • Certificate renewal: New certificate issued
  • Certificate deployment: New certificate deployed to pods
  • TLS normalization: TLS handshakes succeed with new certificate
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Expiration Expected Range Recovery Target
TLS Success Rate 100% >90% >90% 100%
mTLS Success Rate 100% >90% >90% 100%
TLS Handshake Failures 0/sec <10/sec <10/sec 0/sec
mTLS Handshake Failures 0/sec <10/sec <10/sec 0/sec
Certificate Renewal Time N/A <5min <5min N/A

Validation Criteria

Success Criteria:

  • ✅ Cert-manager automatically renews expired certificates
  • ✅ Certificate renewal completes within 5 minutes
  • ✅ TLS handshake failures are transient during renewal
  • ✅ mTLS handles certificate failures gracefully
  • ✅ Service recovers automatically when certificates renewed

Certificate Management Configuration

Cert-Manager Certificate Configuration:

# kubernetes/certificates/ingestion-api-tls-certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: atp-ingestion-tls
  namespace: atp-ingest-ns
spec:
  secretName: atp-ingestion-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  commonName: ingestion-api.atp.connectsoft.io
  dnsNames:
    - ingestion-api.atp.connectsoft.io
    - ingestion-api.atp-staging.connectsoft.io
  renewBefore: 720h  # Renew 30 days before expiration
  privateKey:
    algorithm: RSA
    size: 2048

mTLS Configuration:

# kubernetes/configmaps/mtls-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: mtls-config
  namespace: atp-ingest-ns
data:
  MTLSConfig.json: |
    {
      "Enabled": true,
      "ClientCertificateRequired": true,
      "CertificateValidation": {
        "ValidateCertificateChain": true,
        "ValidateRevocation": true,
        "AllowExpiredCertificates": false
      },
      "FailureHandling": {
        "GracefulDegradation": true,
        "RetryOnFailure": true,
        "MaxRetries": 3
      },
      "CertificateRotation": {
        "AutoRotate": true,
        "RotationThreshold": 720,
        "RenewBeforeExpiry": true
      }
    }

Key Vault Unavailability

Key Vault unavailability experiments validate that ATP services handle Azure Key Vault unavailability gracefully through secret caching, graceful degradation, and encryption key access failure handling.

Hypothesis

"When Azure Key Vault becomes unavailable, services will use cached secrets, gracefully degrade functionality that requires new secrets, maintain service availability for operations using cached secrets, and recover automatically when Key Vault is restored."

Experiment Configuration

Azure Key Vault Network Partition:

# chaos-experiments/key-vault-unavailability.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: key-vault-unavailability
  namespace: chaos-testing
  labels:
    category: security
    service: key-vault
    severity: high
    frequency: monthly
  annotations:
    chaos.atp.connectsoft.io/hypothesis: |
      When Azure Key Vault becomes unavailable, services will use cached secrets, 
      gracefully degrade functionality that requires new secrets, maintain service availability 
      for operations using cached secrets, and recover automatically when Key Vault is restored.
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - atp-ingest-ns
    labelSelectors:
      app: atp-ingest-api
  direction: both
  target:
    mode: all
    selector:
      address: "*.vault.azure.net"
  duration: "10m"

Key Vault Unavailability Simulation Script:

#!/bin/bash
# scripts/execute-key-vault-unavailability-experiment.sh

SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
KEY_VAULT_NAME="${3:-atp-keyvault}"
DURATION="${4:-10m}"

echo "🧪 Starting Key Vault unavailability experiment"
echo "Service: ${SERVICE}"
echo "Key Vault: ${KEY_VAULT_NAME}"
echo "Duration: ${DURATION}"

# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
  --service ${SERVICE} \
  --duration 1h \
  --output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json

BASELINE_SECRET_ACCESS_SUCCESS=$(jq -r '.metrics.secret_access_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_SECRET_CACHE_HITS=$(jq -r '.metrics.secret_cache_hits_per_second' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)

echo "Baseline metrics:"
echo "  Secret access success rate: ${BASELINE_SECRET_ACCESS_SUCCESS}%"
echo "  Secret cache hits: ${BASELINE_SECRET_CACHE_HITS}/sec"

# Apply network partition to Key Vault
echo "🔧 Applying network partition to Key Vault..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: key-vault-unavailability-${SERVICE}
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - ${NAMESPACE}
    labelSelectors:
      app: ${SERVICE}
  direction: both
  target:
    mode: all
    selector:
      address: "*.vault.azure.net"
  duration: "${DURATION}"
EOF

KEY_VAULT_FAILURE_START=$(date +%s)

# Monitor service behavior
echo "👀 Monitoring service behavior during Key Vault unavailability..."
MAX_WAIT=600  # 10 minutes
ELAPSED=0

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  # Get secret access success rate
  SECRET_ACCESS_SUCCESS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(secret_access_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(secret_access_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SECRET_ACCESS_SUCCESS_PERCENT=$(echo "${SECRET_ACCESS_SUCCESS} * 100" | bc)

  # Get secret cache hits
  SECRET_CACHE_HITS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(secret_cache_hits\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get secret access failures
  SECRET_ACCESS_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(secret_access_total\{service=\"${SERVICE}\",status=\"failure\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get encryption key access failures
  ENCRYPTION_KEY_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(encryption_key_access_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

  # Get service mode (normal vs degraded)
  SERVICE_MODE=$(curl -s http://prometheus:9090/api/v1/query?query=service_mode\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')

  # Get request success rate
  SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
  SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)

  echo "Metrics at ${ELAPSED}s:"
  echo "  Secret access success rate: ${SECRET_ACCESS_SUCCESS_PERCENT}%"
  echo "  Secret cache hits: ${SECRET_CACHE_HITS}/sec"
  echo "  Secret access failures: ${SECRET_ACCESS_FAILURES}/sec"
  echo "  Encryption key access failures: ${ENCRYPTION_KEY_FAILURES}/sec"
  echo "  Service mode: ${SERVICE_MODE}"
  echo "  Request success rate: ${SUCCESS_RATE_PERCENT}%"

  # Validate secret cache usage
  if (( $(echo "${SECRET_CACHE_HITS} > 0" | bc -l) )); then
    echo "✅ Secret cache active: ${SECRET_CACHE_HITS}/sec cache hits"
  fi

  # Validate graceful degradation
  if [ "${SERVICE_MODE}" = "degraded" ]; then
    echo "✅ Service operating in degraded mode: ${SERVICE_MODE}"
  fi

  # Validate encryption key access failure handling
  if (( $(echo "${ENCRYPTION_KEY_FAILURES} > 0" | bc -l) )); then
    echo "⚠️  Encryption key access failures: ${ENCRYPTION_KEY_FAILURES}/sec"
    # Check if service handles failures gracefully
    if (( $(echo "${SUCCESS_RATE_PERCENT} >= 90" | bc -l) )); then
      echo "✅ Service handles encryption key failures gracefully"
    fi
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos key-vault-unavailability-${SERVICE} -n chaos-testing

RECOVERY_START=$(date +%s)

# Wait for recovery
echo "⏳ Waiting for Key Vault to recover..."
sleep 120

# Verify recovery
FINAL_SECRET_ACCESS_SUCCESS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(secret_access_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(secret_access_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SECRET_ACCESS_SUCCESS_PERCENT=$(echo "${FINAL_SECRET_ACCESS_SUCCESS} * 100" | bc)
FINAL_SERVICE_MODE=$(curl -s http://prometheus:9090/api/v1/query?query=service_mode\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_ENCRYPTION_KEY_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(encryption_key_access_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')

if (( $(echo "${FINAL_SECRET_ACCESS_SUCCESS_PERCENT} >= ${BASELINE_SECRET_ACCESS_SUCCESS} * 0.95" | bc -l) )); then
  echo "✅ Secret access success rate recovered: ${FINAL_SECRET_ACCESS_SUCCESS_PERCENT}% (baseline: ${BASELINE_SECRET_ACCESS_SUCCESS}%)"

  if [ "${FINAL_SERVICE_MODE}" = "normal" ]; then
    echo "✅ Service mode normalized: ${FINAL_SERVICE_MODE}"

    if (( $(echo "${FINAL_ENCRYPTION_KEY_FAILURES} == 0" | bc -l) )); then
      echo "✅ Encryption key access recovered"
      exit 0
    else
      echo "⚠️  Encryption key access failures still occurring: ${FINAL_ENCRYPTION_KEY_FAILURES}/sec"
      exit 1
    fi
  else
    echo "⚠️  Service mode not normalized: ${FINAL_SERVICE_MODE}"
    exit 1
  fi
else
  echo "⚠️  Secret access success rate not fully recovered: ${FINAL_SECRET_ACCESS_SUCCESS_PERCENT}%"
  exit 1
fi

Expected Behavior

Key Vault Failure Phase (0-10 minutes):

  • Key Vault unavailability: Azure Key Vault unreachable
  • Secret cache usage: Services use cached secrets
  • Graceful degradation: Service degrades functionality requiring new secrets
  • Encryption key access: Encryption key access failures handled gracefully
  • Service continuity: Service continues operating with cached secrets

Recovery Phase (10-15 minutes):

  • Key Vault restoration: Azure Key Vault restored
  • Secret refresh: New secrets obtained from Key Vault
  • Secret cache refresh: Secret cache refreshed
  • Normal operation: Service returns to normal operation

Expected Metrics

Metric Baseline During Failure Expected Range Recovery Target
Secret Access Success Rate 99.95% >80% >80% 99.95%
Secret Cache Hits 50/sec 100/sec 100/sec 50/sec
Secret Access Failures 0.05% <20% <20% 0.05%
Encryption Key Failures 0/sec <5/sec <5/sec 0/sec
Service Mode Normal Degraded Degraded Normal

Validation Criteria

Success Criteria:

  • ✅ Secret cache maintains access to cached secrets
  • ✅ Graceful degradation when new secrets required
  • ✅ Encryption key access failures handled gracefully
  • ✅ Secret access success rate >80%
  • ✅ Service recovers automatically when Key Vault restored

Key Vault Configuration

Secret Cache Configuration:

# kubernetes/configmaps/key-vault-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: key-vault-config
  namespace: atp-ingest-ns
data:
  KeyVaultConfig.json: |
    {
      "SecretCache": {
        "Enabled": true,
        "TTL": 3600,
        "MaxCacheSize": 1000,
        "RefreshThreshold": 300
      },
      "KeyVault": {
        "VaultUrl": "https://${KEY_VAULT_NAME}.vault.azure.net/",
        "Authentication": {
          "Type": "ManagedIdentity",
          "ClientId": "{managed-identity-client-id}"
        },
        "RetryPolicy": {
          "MaxRetries": 3,
          "RetryDelay": 1000,
          "ExponentialBackoff": true
        }
      },
      "EncryptionKeys": {
        "CacheEnabled": true,
        "CacheTTL": 7200,
        "FailureHandling": "graceful"
      },
      "FallbackBehavior": {
        "UseCachedSecrets": true,
        "DegradedMode": true,
        "DenyOnFailure": false
      }
    }

Security Chaos Visualization

graph TD
    SECURITY[Security Layer] --> AUTH[Authentication]
    SECURITY --> AUTHZ[Authorization]
    SECURITY --> CERT[Certificates]
    SECURITY --> KEYVAULT[Key Vault]

    AUTH --> AZUREAD[Azure AD]
    AZUREAD -->|Fails| TOKENCACHE[Token Cache]
    TOKENCACHE --> DENY[Deny-by-Default]
    DENY --> CONTINUE1[Continue Operating]

    AUTHZ --> OPA[OPA Policy Engine]
    OPA -->|Fails| POLICYCACHE[Policy Cache]
    POLICYCACHE --> SAFEFAIL[Safe-Fail]
    SAFEFAIL --> CONTINUE2[Continue Operating]

    CERT --> CERTMANAGER[Cert-Manager]
    CERTMANAGER -->|Expires| RENEW[Auto-Renewal]
    RENEW --> MTLS[mTLS Handling]
    MTLS --> CONTINUE3[Continue Operating]

    KEYVAULT --> SECRETCACHE[Secret Cache]
    KEYVAULT -->|Fails| ENCRYPTION[Encryption Keys]
    SECRETCACHE --> DEGRADED[Graceful Degradation]
    ENCRYPTION --> DEGRADED
    DEGRADED --> CONTINUE4[Continue Operating]

    style SECURITY fill:#FFE5B4
    style AZUREAD fill:#FFB6C1
    style OPA fill:#FFB6C1
    style CERTMANAGER fill:#FFB6C1
    style KEYVAULT fill:#FFB6C1
    style CONTINUE1 fill:#90EE90
    style CONTINUE2 fill:#90EE90
    style CONTINUE3 fill:#90EE90
    style CONTINUE4 fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Summary: Security Chaos

  • Authentication Failure: Validates token caching, graceful degradation, and deny-by-default behavior during Azure AD unavailability; expects token cache maintains authentication, deny-by-default enforced, graceful degradation to read-only mode, and automatic recovery
  • Authorization Denial: Validates cached policies, safe-fail behavior, and deny-when-uncertain enforcement during OPA unavailability; expects policy cache maintains authorization, safe-fail behavior enforced, deny-when-uncertain, and automatic recovery
  • Certificate Expiration: Validates cert-manager renewal, mTLS failure handling, and automatic certificate rotation during TLS certificate expiration; expects cert-manager auto-renews certificates, renewal completes within 5 minutes, mTLS handles failures gracefully, and automatic recovery
  • Key Vault Unavailability: Validates cached secrets, graceful degradation, and encryption key access failure handling during Azure Key Vault unavailability; expects secret cache maintains access, graceful degradation when new secrets required, encryption key failures handled gracefully, and automatic recovery
  • Monitoring and Validation: Comprehensive scripts for monitoring authentication failures, authorization denials, certificate expiration, Key Vault unavailability, token cache hits, policy cache hits, certificate renewal status, and recovery behavior

Regional Failover Drill

Purpose: Define comprehensive disaster recovery (DR) drill procedures for regional failover scenarios in ATP, validating failover procedures, RTO/RPO targets, data replication, and service availability to ensure ATP services remain available and recoverable during complete regional failures.


Full Region Failover Scenario

Regional failover drill experiments validate that ATP services handle complete regional failures gracefully through automated failover procedures, RTO/RPO target achievement, and service availability in secondary regions.

Hypothesis

"When the East US region becomes completely unavailable, ATP services will automatically failover to the West Europe region within RTO target (30 minutes), maintain RPO target (1 hour data loss), ensure all services are operational in the secondary region, and recover automatically when the primary region is restored."

Scenario Overview

Primary Region: East US (eastus) - Azure Kubernetes Service (AKS) cluster - Azure SQL Database (primary) - Azure Service Bus - Azure Blob Storage - Azure Key Vault - Azure Application Insights

Secondary Region: West Europe (westeurope) - Azure Kubernetes Service (AKS) cluster (standby) - Azure SQL Database (read replica → primary) - Azure Service Bus (geo-replication) - Azure Blob Storage (geo-redundant) - Azure Key Vault (geo-redundant) - Azure Application Insights

Failover Objectives

Objective Target Validation
Recovery Time Objective (RTO) 30 minutes Time from failure detection to full traffic in secondary region
Recovery Point Objective (RPO) 1 hour Maximum data loss (async replication lag)
Service Availability 99.9% All critical services operational in secondary region
Data Integrity 100% No data corruption, all transactions consistent
Authentication 100% Azure AD multi-region authentication functional

Regional Failover Architecture

graph TB
    USERS[Users] --> TM[Azure Traffic Manager]

    TM -->|Primary| EASTUS[East US Region]
    TM -->|Secondary| WESTEU[West Europe Region]

    EASTUS --> EAKS[AKS Cluster - East US]
    EASTUS --> ESQL[Azure SQL Primary]
    EASTUS --> ESB[Service Bus - East US]
    EASTUS --> ESTORAGE[Blob Storage - East US]

    WESTEU --> WAKS[AKS Cluster - West Europe]
    WESTEU --> WSQL[Azure SQL Replica]
    WESTEU --> WSB[Service Bus - West Europe]
    WESTEU --> WSTORAGE[Blob Storage - West Europe]

    ESQL -.->|Async Replication| WSQL
    ESB -.->|Geo-Replication| WSB
    ESTORAGE -.->|Geo-Redundant| WSTORAGE

    EASTUS -.->|FAILOVER| TM
    TM -.->|Traffic Redirect| WESTEU

    style EASTUS fill:#FFB6C1
    style WESTEU fill:#90EE90
    style TM fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Failover Procedure

Automated Failover Detection

Region Failure Detection Script:

#!/bin/bash
# scripts/detect-region-failure.sh

PRIMARY_REGION="${1:-eastus}"
SECONDARY_REGION="${2:-westeurope}"
RESOURCE_GROUP="${3:-atp-production}"

echo "🔍 Detecting region failure for ${PRIMARY_REGION}..."

FAILURE_DETECTED=false
FAILURE_COMPONENTS=()

# Check AKS cluster health
echo "Checking AKS cluster health..."
AKS_CLUSTER="atp-aks-${PRIMARY_REGION}"
AKS_STATUS=$(az aks show \
  --resource-group ${RESOURCE_GROUP} \
  --name ${AKS_CLUSTER} \
  --query "powerState.code" \
  --output tsv 2>/dev/null)

if [ "${AKS_STATUS}" != "Running" ]; then
  FAILURE_DETECTED=true
  FAILURE_COMPONENTS+=("AKS Cluster")
  echo "❌ AKS cluster ${AKS_CLUSTER} is not running: ${AKS_STATUS}"
fi

# Check Azure SQL Database connectivity
echo "Checking Azure SQL Database connectivity..."
SQL_SERVER="atp-sql-${PRIMARY_REGION}.database.windows.net"
SQL_DB="atp-database"

if ! nc -z ${SQL_SERVER} 1433 2>/dev/null; then
  FAILURE_DETECTED=true
  FAILURE_COMPONENTS+=("Azure SQL Database")
  echo "❌ Azure SQL Database ${SQL_SERVER} is not reachable"
fi

# Check Service Bus namespace
echo "Checking Service Bus namespace..."
SB_NAMESPACE="atp-sb-${PRIMARY_REGION}"
SB_STATUS=$(az servicebus namespace show \
  --resource-group ${RESOURCE_GROUP} \
  --name ${SB_NAMESPACE} \
  --query "status" \
  --output tsv 2>/dev/null)

if [ "${SB_STATUS}" != "Active" ]; then
  FAILURE_DETECTED=true
  FAILURE_COMPONENTS+=("Service Bus")
  echo "❌ Service Bus namespace ${SB_NAMESPACE} is not active: ${SB_STATUS}"
fi

# Check Storage Account
echo "Checking Storage Account..."
STORAGE_ACCOUNT="atpstorage${PRIMARY_REGION}"
STORAGE_STATUS=$(az storage account show \
  --resource-group ${RESOURCE_GROUP} \
  --name ${STORAGE_ACCOUNT} \
  --query "provisioningState" \
  --output tsv 2>/dev/null)

if [ "${STORAGE_STATUS}" != "Succeeded" ]; then
  FAILURE_DETECTED=true
  FAILURE_COMPONENTS+=("Storage Account")
  echo "❌ Storage Account ${STORAGE_ACCOUNT} provisioning state: ${STORAGE_STATUS}"
fi

# Check Key Vault
echo "Checking Key Vault..."
KEY_VAULT="atp-kv-${PRIMARY_REGION}"
KV_STATUS=$(az keyvault show \
  --name ${KEY_VAULT} \
  --query "properties.provisioningState" \
  --output tsv 2>/dev/null)

if [ "${KV_STATUS}" != "Succeeded" ]; then
  FAILURE_DETECTED=true
  FAILURE_COMPONENTS+=("Key Vault")
  echo "❌ Key Vault ${KEY_VAULT} provisioning state: ${KV_STATUS}"
fi

# Summary
if [ "${FAILURE_DETECTED}" = true ]; then
  echo ""
  echo "⚠️  REGION FAILURE DETECTED"
  echo "Primary Region: ${PRIMARY_REGION}"
  echo "Failed Components:"
  for component in "${FAILURE_COMPONENTS[@]}"; do
    echo "  - ${component}"
  done
  echo ""
  echo "🚨 Initiating failover to ${SECONDARY_REGION}..."

  # Trigger failover automation
  ./scripts/initiate-failover.sh ${PRIMARY_REGION} ${SECONDARY_REGION} ${RESOURCE_GROUP}

  exit 1
else
  echo "✅ Primary region ${PRIMARY_REGION} is healthy"
  exit 0
fi

Failover Procedure Script:

#!/bin/bash
# scripts/execute-regional-failover-drill.sh

PRIMARY_REGION="${1:-eastus}"
SECONDARY_REGION="${2:-westeurope}"
RESOURCE_GROUP="${3:-atp-production}"
DRILL_MODE="${4:-true}"  # true for drill, false for real failover

FAILOVER_START=$(date +%s)

echo "🚨 Starting Regional Failover Drill"
echo "Primary Region: ${PRIMARY_REGION}"
echo "Secondary Region: ${SECONDARY_REGION}"
echo "Resource Group: ${RESOURCE_GROUP}"
echo "Drill Mode: ${DRILL_MODE}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""

# Step 1: Detect region failure (automated monitoring)
echo "Step 1: Detecting region failure..."
if [ "${DRILL_MODE}" = "true" ]; then
  echo "⚠️  DRILL MODE: Simulating region failure detection"
  FAILURE_DETECTED=true
else
  ./scripts/detect-region-failure.sh ${PRIMARY_REGION} ${SECONDARY_REGION} ${RESOURCE_GROUP}
  FAILURE_DETECTED=$?
fi

if [ "${FAILURE_DETECTED}" != "true" ] && [ "${FAILURE_DETECTED}" != "1" ]; then
  echo "❌ No failure detected. Aborting failover."
  exit 1
fi

echo "✅ Region failure detected"
echo ""

# Step 2: Incident commander declares failover
echo "Step 2: Incident commander declares failover..."
echo "⚠️  MANUAL STEP: Incident commander must declare failover"
echo "   Send notification to: #atp-dr-war-room"
echo "   Incident commander: [WAIT FOR CONFIRMATION]"
read -p "Press Enter when incident commander has declared failover..."

FAILOVER_DECLARED=$(date +%s)
DECLARE_DURATION=$((FAILOVER_DECLARED - FAILOVER_START))
echo "✅ Failover declared at +${DECLARE_DURATION} seconds"
echo ""

# Step 3: Update Azure Traffic Manager (DNS failover)
echo "Step 3: Updating Azure Traffic Manager..."
TM_PROFILE="atp-traffic-manager"
TM_ENDPOINT_PRIMARY="atp-eastus-endpoint"
TM_ENDPOINT_SECONDARY="atp-westeurope-endpoint"

# Disable primary endpoint
az network traffic-manager endpoint update \
  --resource-group ${RESOURCE_GROUP} \
  --profile-name ${TM_PROFILE} \
  --name ${TM_ENDPOINT_PRIMARY} \
  --endpoint-status Disabled

# Enable secondary endpoint
az network traffic-manager endpoint update \
  --resource-group ${RESOURCE_GROUP} \
  --profile-name ${TM_PROFILE} \
  --name ${TM_ENDPOINT_SECONDARY} \
  --endpoint-status Enabled \
  --priority 1

echo "✅ Traffic Manager updated (DNS failover initiated)"
echo "   Primary endpoint: Disabled"
echo "   Secondary endpoint: Enabled (Priority 1)"
echo ""

# Step 4: Verify West Europe cluster healthy
echo "Step 4: Verifying West Europe cluster health..."
SECONDARY_AKS="atp-aks-${SECONDARY_REGION}"

AKS_NODES=$(az aks show \
  --resource-group ${RESOURCE_GROUP} \
  --name ${SECONDARY_AKS} \
  --query "agentPoolProfiles[0].count" \
  --output tsv)

AKS_STATUS=$(az aks show \
  --resource-group ${RESOURCE_GROUP} \
  --name ${SECONDARY_AKS} \
  --query "powerState.code" \
  --output tsv)

if [ "${AKS_STATUS}" != "Running" ]; then
  echo "❌ Secondary AKS cluster is not running: ${AKS_STATUS}"
  exit 1
fi

echo "✅ Secondary AKS cluster healthy"
echo "   Cluster: ${SECONDARY_AKS}"
echo "   Status: ${AKS_STATUS}"
echo "   Nodes: ${AKS_NODES}"
echo ""

# Step 5: Verify data replication status
echo "Step 5: Verifying data replication status..."
PRIMARY_SQL_SERVER="atp-sql-${PRIMARY_REGION}"
SECONDARY_SQL_SERVER="atp-sql-${SECONDARY_REGION}"
FAILOVER_GROUP="atp-sql-failover-group"

# Check replication lag
REPLICATION_LAG=$(az sql db replica show-lag \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SECONDARY_SQL_SERVER} \
  --database atp-database \
  --query "lagSeconds" \
  --output tsv 2>/dev/null || echo "0")

if [ -z "${REPLICATION_LAG}" ]; then
  REPLICATION_LAG=0
fi

echo "   Replication lag: ${REPLICATION_LAG} seconds"
if (( REPLICATION_LAG > 3600 )); then
  echo "⚠️  WARNING: Replication lag exceeds RPO target (1 hour)"
fi

echo "✅ Data replication status verified"
echo ""

# Step 6: Validate latest data available (check RPO)
echo "Step 6: Validating latest data available (RPO check)..."
# Get latest transaction timestamp from secondary database
LATEST_TRANSACTION=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SECONDARY_SQL_SERVER} \
  --database atp-database \
  --query-text "SELECT MAX(LastModified) as LatestTransaction FROM AuditEvents" \
  --output tsv 2>/dev/null || echo "N/A")

CURRENT_TIME=$(date -u +"%Y-%m-%d %H:%M:%S")
RPO_AGE=$(date -u -d "${LATEST_TRANSACTION}" +%s 2>/dev/null || echo "0")
CURRENT_AGE=$(date -u -d "${CURRENT_TIME}" +%s)
RPO_DIFF=$((CURRENT_AGE - RPO_AGE))

if (( RPO_DIFF > 3600 )); then
  echo "⚠️  WARNING: RPO exceeded (${RPO_DIFF} seconds > 3600 seconds)"
else
  echo "✅ RPO target met (${RPO_DIFF} seconds < 3600 seconds)"
fi

echo "   Latest transaction: ${LATEST_TRANSACTION}"
echo "   Current time: ${CURRENT_TIME}"
echo "   RPO age: ${RPO_DIFF} seconds"
echo ""

# Step 7: Re-route all traffic to West Europe
echo "Step 7: Re-routing all traffic to West Europe..."
# This is already done in Step 3 (Traffic Manager)
# Additional verification: Check DNS propagation
TM_DNS_NAME=$(az network traffic-manager profile show \
  --resource-group ${RESOURCE_GROUP} \
  --name ${TM_PROFILE} \
  --query "dnsConfig.fqdn" \
  --output tsv)

echo "   Traffic Manager DNS: ${TM_DNS_NAME}"
echo "   DNS propagation: Checking..."
sleep 10

# Verify DNS resolution
RESOLVED_IP=$(dig +short ${TM_DNS_NAME} | head -n1)
echo "   Resolved IP: ${RESOLVED_IP}"

echo "✅ Traffic re-routed to West Europe"
echo ""

# Step 8: Monitor application health
echo "Step 8: Monitoring application health..."
SECONDARY_NAMESPACE="atp-production"

# Check pod status
PODS_RUNNING=$(kubectl get pods -n ${SECONDARY_NAMESPACE} --context ${SECONDARY_AKS} \
  --field-selector=status.phase=Running \
  --no-headers | wc -l)

PODS_TOTAL=$(kubectl get pods -n ${SECONDARY_NAMESPACE} --context ${SECONDARY_AKS} \
  --no-headers | wc -l)

echo "   Pods running: ${PODS_RUNNING}/${PODS_TOTAL}"

# Check service endpoints
SERVICES=$(kubectl get svc -n ${SECONDARY_NAMESPACE} --context ${SECONDARY_AKS} \
  -o jsonpath='{.items[*].metadata.name}')

for service in ${SERVICES}; do
  ENDPOINTS=$(kubectl get endpoints ${service} -n ${SECONDARY_NAMESPACE} --context ${SECONDARY_AKS} \
    -o jsonpath='{.subsets[0].addresses[*].ip}' 2>/dev/null || echo "")
  if [ -z "${ENDPOINTS}" ]; then
    echo "   ⚠️  Service ${service}: No endpoints"
  else
    echo "   ✅ Service ${service}: Healthy"
  fi
done

echo "✅ Application health monitoring initiated"
echo ""

# Step 9: Notify stakeholders
echo "Step 9: Notifying stakeholders..."
FAILOVER_COMPLETE=$(date +%s)
RTO_ACHIEVED=$((FAILOVER_COMPLETE - FAILOVER_START))

NOTIFICATION_MESSAGE="🚨 Regional Failover Completed
Primary Region: ${PRIMARY_REGION} (Unavailable)
Secondary Region: ${SECONDARY_REGION} (Active)
RTO Achieved: ${RTO_ACHIEVED} seconds
RPO Verified: ${RPO_DIFF} seconds
Status: All services operational in ${SECONDARY_REGION}"

echo "${NOTIFICATION_MESSAGE}"
echo ""

# Send notifications (Slack, Email, etc.)
# ./scripts/send-notification.sh "${NOTIFICATION_MESSAGE}"

echo "✅ Stakeholders notified"
echo ""

# Step 10: Document RTO/RPO achieved
echo "Step 10: Documenting RTO/RPO achieved..."
DR_REPORT_FILE="dr-drill-report-$(date +%Y%m%d-%H%M%S).json"

cat > ${DR_REPORT_FILE} <<EOF
{
  "drillId": "dr-$(date +%Y%m%d-%H%M%S)",
  "timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
  "primaryRegion": "${PRIMARY_REGION}",
  "secondaryRegion": "${SECONDARY_REGION}",
  "failoverStartTime": "$(date -u -d @${FAILOVER_START} +"%Y-%m-%dT%H:%M:%SZ")",
  "failoverCompleteTime": "$(date -u -d @${FAILOVER_COMPLETE} +"%Y-%m-%dT%H:%M:%SZ")",
  "rto": {
    "target": 1800,
    "achieved": ${RTO_ACHIEVED},
    "status": "$(if (( RTO_ACHIEVED <= 1800 )); then echo "PASS"; else echo "FAIL"; fi)"
  },
  "rpo": {
    "target": 3600,
    "achieved": ${RPO_DIFF},
    "status": "$(if (( RPO_DIFF <= 3600 )); then echo "PASS"; else echo "FAIL"; fi)"
  },
  "replicationLag": ${REPLICATION_LAG},
  "services": {
    "podsRunning": ${PODS_RUNNING},
    "podsTotal": ${PODS_TOTAL},
    "status": "$(if (( PODS_RUNNING == PODS_TOTAL )); then echo "HEALTHY"; else echo "DEGRADED"; fi)"
  },
  "drillMode": ${DRILL_MODE}
}
EOF

echo "✅ DR drill report generated: ${DR_REPORT_FILE}"
echo ""
echo "=========================================="
echo "Regional Failover Drill Summary"
echo "=========================================="
echo "RTO Target: 1800 seconds (30 minutes)"
echo "RTO Achieved: ${RTO_ACHIEVED} seconds"
echo "RPO Target: 3600 seconds (1 hour)"
echo "RPO Achieved: ${RPO_DIFF} seconds"
echo "Status: $(if (( RTO_ACHIEVED <= 1800 && RPO_DIFF <= 3600 )); then echo " PASS"; else echo " FAIL"; fi)"
echo ""

Expected Behavior

Failover Phase (0-30 minutes):

  • Failure detection: Automated monitoring detects region failure
  • Incident declaration: Incident commander declares failover
  • DNS failover: Traffic Manager redirects traffic to secondary region
  • Cluster verification: Secondary cluster verified healthy
  • Data replication: Replication status verified (RPO checked)
  • Traffic routing: All traffic re-routed to secondary region
  • Health monitoring: Application health monitored continuously
  • Notifications: Stakeholders notified of failover completion

Post-Failover Phase (30-240 minutes):

  • Service validation: All services validated operational
  • Data integrity: Data integrity verified (no corruption)
  • Monitoring: Monitoring and alerting functional
  • Authentication: Azure AD multi-region authentication verified
  • Documentation: RTO/RPO documented

DR Drill Execution

DR Drill Schedule

Schedule Type Environment Duration
Q1 Full DR Drill Production-like (Staging) 4 hours
Q2 Tabletop Exercise Any 2 hours
Q3 Full DR Drill Production-like (Staging) 4 hours
Q4 Post-Mortem Review Any 2 hours

DR Drill Timeline

Phase Duration Activities
Pre-Drill 1 week Planning, coordination, stakeholder notification
Pre-Drill Briefing 30 min Team briefing, roles, procedures
Failover Execution 1 hour Failover procedure execution
Validation 2 hours Service validation, RTO/RPO verification
Post-Drill Briefing 30 min Lessons learned, improvement actions
Post-Drill Report 1 week Documentation, post-mortem report

DR Drill Team Structure

Incident Commander (IC): - Overall responsibility for failover decision - Coordinates with all teams - Makes go/no-go decisions - Communicates with stakeholders

Platform Team: - Infrastructure provisioning - AKS cluster management - Traffic Manager configuration - Resource group management

SRE Team: - Monitoring and alerting - Service health validation - Performance monitoring - Incident response

Security Team: - Authentication verification - Authorization validation - Key Vault access verification - Compliance validation

Service Teams: - Application deployment verification - Service-specific validation - Data integrity checks - End-to-end testing

Communication Channels

War Room: Dedicated virtual meeting room (Teams/Zoom) - All teams participate - Real-time coordination - Screen sharing for monitoring

Slack Channel: #atp-dr-war-room - Status updates - Incident reports - Coordination messages - Timestamp logs

Email Distribution: atp-dr-alerts@connectsoft.io - Stakeholder notifications - Executive summaries - Post-drill reports

Pre-Drill Checklist

# Pre-Drill Checklist

## Planning (1 week before)
- [ ] DR drill schedule confirmed
- [ ] Stakeholders notified (customers, leadership, compliance)
- [ ] Team assignments confirmed
- [ ] Communication channels set up
- [ ] Monitoring dashboards prepared
- [ ] Runbooks reviewed and updated

## Preparation (1 day before)
- [ ] Secondary region resources verified
- [ ] Data replication status checked
- [ ] Backup procedures verified
- [ ] Failover scripts tested
- [ ] Access credentials verified
- [ ] War room scheduled
- [ ] Slack channel created

## Day of Drill
- [ ] Pre-drill briefing conducted
- [ ] Team members available
- [ ] Monitoring tools accessible
- [ ] Communication channels open
- [ ] Backup procedures ready
- [ ] Rollback plan confirmed

During-Drill Checklist

# During-Drill Checklist

## Failover Execution
- [ ] Region failure simulated/detected
- [ ] Incident commander declares failover
- [ ] Traffic Manager updated (DNS failover)
- [ ] Secondary cluster verified healthy
- [ ] Data replication status verified
- [ ] RPO validated (data loss <1 hour)
- [ ] Traffic re-routed to secondary region
- [ ] Application health monitored
- [ ] Stakeholders notified

## Validation
- [ ] All services operational in secondary region
- [ ] No data corruption detected
- [ ] Monitoring and alerting functional
- [ ] Authentication working (Azure AD)
- [ ] Authorization working (OPA)
- [ ] Key Vault access verified
- [ ] End-to-end functionality tested
- [ ] Performance metrics within acceptable range

## Documentation
- [ ] RTO achieved documented
- [ ] RPO achieved documented
- [ ] Service status documented
- [ ] Issues encountered documented
- [ ] Lessons learned captured

Post-Drill Checklist

# Post-Drill Checklist

## Immediate (Within 1 hour)
- [ ] Post-drill briefing conducted
- [ ] Initial findings documented
- [ ] Critical issues identified
- [ ] Rollback completed (if drill mode)
- [ ] Services restored to primary region (if drill mode)

## Short-term (Within 1 week)
- [ ] Post-mortem report completed
- [ ] Improvement actions identified
- [ ] Runbooks updated
- [ ] Procedures refined
- [ ] Stakeholder report distributed
- [ ] Compliance documentation updated

## Long-term (Within 1 month)
- [ ] Improvement actions implemented
- [ ] Next drill scheduled
- [ ] Training completed
- [ ] Documentation finalized

DR Drill Validation Criteria

RTO Validation

RTO Measurement Script:

#!/bin/bash
# scripts/validate-rto.sh

FAILOVER_START="${1}"  # Unix timestamp
FAILOVER_COMPLETE="${2}"  # Unix timestamp
RTO_TARGET=1800  # 30 minutes in seconds

if [ -z "${FAILOVER_START}" ] || [ -z "${FAILOVER_COMPLETE}" ]; then
  echo "❌ Usage: validate-rto.sh <failover_start_timestamp> <failover_complete_timestamp>"
  exit 1
fi

RTO_ACHIEVED=$((FAILOVER_COMPLETE - FAILOVER_START))

echo "RTO Validation"
echo "=============="
echo "RTO Target: ${RTO_TARGET} seconds (30 minutes)"
echo "RTO Achieved: ${RTO_ACHIEVED} seconds"
echo ""

if (( RTO_ACHIEVED <= RTO_TARGET )); then
  echo "✅ RTO TARGET ACHIEVED"
  echo "   Achieved: ${RTO_ACHIEVED}s (Target: ${RTO_TARGET}s)"
  exit 0
else
  echo "❌ RTO TARGET NOT ACHIEVED"
  echo "   Achieved: ${RTO_ACHIEVED}s (Target: ${RTO_TARGET}s)"
  echo "   Over by: $((RTO_ACHIEVED - RTO_TARGET))s"
  exit 1
fi

RPO Validation

RPO Measurement Script:

#!/bin/bash
# scripts/validate-rpo.sh

SECONDARY_SERVER="${1:-atp-sql-westeurope}"
RESOURCE_GROUP="${2:-atp-production}"
RPO_TARGET=3600  # 1 hour in seconds

# Get latest transaction timestamp from secondary database
LATEST_TRANSACTION=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SECONDARY_SERVER} \
  --database atp-database \
  --query-text "SELECT MAX(LastModified) as LatestTransaction FROM AuditEvents" \
  --output tsv 2>/dev/null)

if [ -z "${LATEST_TRANSACTION}" ] || [ "${LATEST_TRANSACTION}" = "N/A" ]; then
  echo "❌ Unable to retrieve latest transaction timestamp"
  exit 1
fi

CURRENT_TIME=$(date -u +"%Y-%m-%d %H:%M:%S")
RPO_AGE=$(date -u -d "${LATEST_TRANSACTION}" +%s 2>/dev/null)
CURRENT_AGE=$(date -u -d "${CURRENT_TIME}" +%s)
RPO_ACHIEVED=$((CURRENT_AGE - RPO_AGE))

echo "RPO Validation"
echo "=============="
echo "RPO Target: ${RPO_TARGET} seconds (1 hour)"
echo "Latest Transaction: ${LATEST_TRANSACTION}"
echo "Current Time: ${CURRENT_TIME}"
echo "RPO Achieved: ${RPO_ACHIEVED} seconds"
echo ""

if (( RPO_ACHIEVED <= RPO_TARGET )); then
  echo "✅ RPO TARGET ACHIEVED"
  echo "   Achieved: ${RPO_ACHIEVED}s (Target: ${RPO_TARGET}s)"
  exit 0
else
  echo "❌ RPO TARGET NOT ACHIEVED"
  echo "   Achieved: ${RPO_ACHIEVED}s (Target: ${RPO_TARGET}s)"
  echo "   Over by: $((RPO_ACHIEVED - RPO_TARGET))s"
  exit 1
fi

Service Availability Validation

Service Availability Validation Script:

#!/bin/bash
# scripts/validate-service-availability.sh

SECONDARY_AKS="${1:-atp-aks-westeurope}"
NAMESPACE="${2:-atp-production}"
AVAILABILITY_TARGET=99.9  # 99.9%

echo "Service Availability Validation"
echo "==============================="

# Get all services
SERVICES=$(kubectl get svc -n ${NAMESPACE} --context ${SECONDARY_AKS} \
  -o jsonpath='{.items[*].metadata.name}')

TOTAL_SERVICES=0
HEALTHY_SERVICES=0

for service in ${SERVICES}; do
  TOTAL_SERVICES=$((TOTAL_SERVICES + 1))

  # Check service endpoints
  ENDPOINTS=$(kubectl get endpoints ${service} -n ${NAMESPACE} --context ${SECONDARY_AKS} \
    -o jsonpath='{.subsets[0].addresses[*].ip}' 2>/dev/null || echo "")

  if [ -n "${ENDPOINTS}" ]; then
    HEALTHY_SERVICES=$((HEALTHY_SERVICES + 1))
    echo "✅ ${service}: Healthy"
  else
    echo "❌ ${service}: No endpoints"
  fi
done

AVAILABILITY_PERCENT=$(echo "scale=2; ${HEALTHY_SERVICES} * 100 / ${TOTAL_SERVICES}" | bc)

echo ""
echo "Service Availability: ${AVAILABILITY_PERCENT}% (Target: ${AVAILABILITY_TARGET}%)"
echo "Healthy Services: ${HEALTHY_SERVICES}/${TOTAL_SERVICES}"

if (( $(echo "${AVAILABILITY_PERCENT} >= ${AVAILABILITY_TARGET}" | bc -l) )); then
  echo "✅ SERVICE AVAILABILITY TARGET ACHIEVED"
  exit 0
else
  echo "❌ SERVICE AVAILABILITY TARGET NOT ACHIEVED"
  exit 1
fi

Data Integrity Validation

Data Integrity Validation Script:

#!/bin/bash
# scripts/validate-data-integrity.sh

SECONDARY_SERVER="${1:-atp-sql-westeurope}"
RESOURCE_GROUP="${2:-atp-production}"

echo "Data Integrity Validation"
echo "========================="

# Check for data corruption (checksum validation)
echo "Checking data integrity..."
CHECKSUM_RESULT=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SECONDARY_SERVER} \
  --database atp-database \
  --query-text "SELECT COUNT(*) as CorruptedRecords FROM AuditEvents WHERE CHECKSUM(EventData) != StoredChecksum" \
  --output tsv 2>/dev/null || echo "ERROR")

if [ "${CHECKSUM_RESULT}" = "ERROR" ]; then
  echo "❌ Unable to validate data integrity"
  exit 1
fi

if [ "${CHECKSUM_RESULT}" = "0" ]; then
  echo "✅ No data corruption detected"
  echo "   Corrupted records: 0"
  exit 0
else
  echo "❌ Data corruption detected"
  echo "   Corrupted records: ${CHECKSUM_RESULT}"
  exit 1
fi

Monitoring and Alerting Validation

Monitoring Validation Script:

#!/bin/bash
# scripts/validate-monitoring.sh

SECONDARY_REGION="${1:-westeurope}"
RESOURCE_GROUP="${2:-atp-production}"

echo "Monitoring and Alerting Validation"
echo "==================================="

# Check Application Insights
APPINSIGHTS="atp-appinsights-${SECONDARY_REGION}"
AI_STATUS=$(az monitor app-insights component show \
  --app ${APPINSIGHTS} \
  --resource-group ${RESOURCE_GROUP} \
  --query "state" \
  --output tsv 2>/dev/null || echo "ERROR")

if [ "${AI_STATUS}" = "ERROR" ]; then
  echo "❌ Application Insights not accessible"
  exit 1
fi

if [ "${AI_STATUS}" = "Succeeded" ]; then
  echo "✅ Application Insights: Operational"
else
  echo "⚠️  Application Insights: ${AI_STATUS}"
fi

# Check Log Analytics workspace
LOG_ANALYTICS="atp-loganalytics-${SECONDARY_REGION}"
LA_STATUS=$(az monitor log-analytics workspace show \
  --resource-group ${RESOURCE_GROUP} \
  --workspace-name ${LOG_ANALYTICS} \
  --query "provisioningState" \
  --output tsv 2>/dev/null || echo "ERROR")

if [ "${LA_STATUS}" = "ERROR" ]; then
  echo "❌ Log Analytics workspace not accessible"
  exit 1
fi

if [ "${LA_STATUS}" = "Succeeded" ]; then
  echo "✅ Log Analytics: Operational"
else
  echo "⚠️  Log Analytics: ${LA_STATUS}"
fi

echo "✅ Monitoring and alerting functional"
exit 0

Authentication Validation

Authentication Validation Script:

#!/bin/bash
# scripts/validate-authentication.sh

SECONDARY_REGION="${1:-westeurope}"
TEST_ENDPOINT="${2:-https://atp-api.${SECONDARY_REGION}.connectsoft.io/health}"

echo "Authentication Validation"
echo "========================="

# Test Azure AD authentication
echo "Testing Azure AD authentication..."

# Attempt to get access token
TOKEN_RESPONSE=$(curl -s -X POST \
  "https://login.microsoftonline.com/{tenant-id}/oauth2/v2.0/token" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "client_id={client-id}" \
  -d "scope=api://atp-api/.default" \
  -d "client_secret={client-secret}" \
  -d "grant_type=client_credentials" 2>/dev/null || echo "ERROR")

if [ "${TOKEN_RESPONSE}" = "ERROR" ] || echo "${TOKEN_RESPONSE}" | jq -e '.access_token' > /dev/null 2>&1; then
  if echo "${TOKEN_RESPONSE}" | jq -e '.access_token' > /dev/null 2>&1; then
    echo "✅ Azure AD authentication: Working"

    # Test authenticated API call
    ACCESS_TOKEN=$(echo "${TOKEN_RESPONSE}" | jq -r '.access_token')
    API_RESPONSE=$(curl -s -H "Authorization: Bearer ${ACCESS_TOKEN}" ${TEST_ENDPOINT} 2>/dev/null || echo "ERROR")

    if [ "${API_RESPONSE}" != "ERROR" ] && echo "${API_RESPONSE}" | jq -e '.status' > /dev/null 2>&1; then
      echo "✅ Authenticated API call: Working"
      exit 0
    else
      echo "❌ Authenticated API call: Failed"
      exit 1
    fi
  else
    echo "❌ Azure AD authentication: Failed"
    exit 1
  fi
else
  echo "❌ Azure AD token request: Failed"
  exit 1
fi

DR Drill Validation Summary

#!/bin/bash
# scripts/validate-dr-drill.sh

PRIMARY_REGION="${1:-eastus}"
SECONDARY_REGION="${2:-westeurope}"
RESOURCE_GROUP="${3:-atp-production}"
FAILOVER_START="${4:-$(date +%s)}"
FAILOVER_COMPLETE="${5:-$(date +%s)}"

echo "=========================================="
echo "DR Drill Validation Summary"
echo "=========================================="
echo "Primary Region: ${PRIMARY_REGION}"
echo "Secondary Region: ${SECONDARY_REGION}"
echo "Resource Group: ${RESOURCE_GROUP}"
echo ""

VALIDATION_RESULTS=()

# RTO Validation
echo "1. RTO Validation"
./scripts/validate-rto.sh ${FAILOVER_START} ${FAILOVER_COMPLETE}
RTO_RESULT=$?
VALIDATION_RESULTS+=(${RTO_RESULT})
echo ""

# RPO Validation
echo "2. RPO Validation"
./scripts/validate-rpo.sh "atp-sql-${SECONDARY_REGION}" ${RESOURCE_GROUP}
RPO_RESULT=$?
VALIDATION_RESULTS+=(${RPO_RESULT})
echo ""

# Service Availability Validation
echo "3. Service Availability Validation"
./scripts/validate-service-availability.sh "atp-aks-${SECONDARY_REGION}" "atp-production"
AVAILABILITY_RESULT=$?
VALIDATION_RESULTS+=(${AVAILABILITY_RESULT})
echo ""

# Data Integrity Validation
echo "4. Data Integrity Validation"
./scripts/validate-data-integrity.sh "atp-sql-${SECONDARY_REGION}" ${RESOURCE_GROUP}
INTEGRITY_RESULT=$?
VALIDATION_RESULTS+=(${INTEGRITY_RESULT})
echo ""

# Monitoring Validation
echo "5. Monitoring and Alerting Validation"
./scripts/validate-monitoring.sh ${SECONDARY_REGION} ${RESOURCE_GROUP}
MONITORING_RESULT=$?
VALIDATION_RESULTS+=(${MONITORING_RESULT})
echo ""

# Authentication Validation
echo "6. Authentication Validation"
./scripts/validate-authentication.sh ${SECONDARY_REGION}
AUTH_RESULT=$?
VALIDATION_RESULTS+=(${AUTH_RESULT})
echo ""

# Summary
echo "=========================================="
echo "Validation Summary"
echo "=========================================="

TOTAL_VALIDATIONS=6
PASSED_VALIDATIONS=0

for result in "${VALIDATION_RESULTS[@]}"; do
  if [ "${result}" = "0" ]; then
    PASSED_VALIDATIONS=$((PASSED_VALIDATIONS + 1))
  fi
done

echo "Passed: ${PASSED_VALIDATIONS}/${TOTAL_VALIDATIONS}"

if [ ${PASSED_VALIDATIONS} -eq ${TOTAL_VALIDATIONS} ]; then
  echo "✅ ALL VALIDATIONS PASSED"
  exit 0
else
  echo "❌ SOME VALIDATIONS FAILED"
  exit 1
fi

Expected Metrics

Metric Target Validation Method
RTO ≤30 minutes Time from failure to full traffic in secondary region
RPO ≤1 hour Maximum data loss (async replication lag)
Service Availability ≥99.9% All critical services operational
Data Integrity 100% No data corruption detected
Monitoring 100% All monitoring tools functional
Authentication 100% Azure AD multi-region authentication working

Summary: Regional Failover Drill

  • Full Region Failover Scenario: Validates complete regional failover from East US to West Europe with RTO target (30 minutes) and RPO target (1 hour); expects automated failover detection, incident commander declaration, DNS failover, cluster verification, data replication validation, traffic re-routing, and stakeholder notification
  • Failover Procedure: Comprehensive 10-step failover procedure including failure detection, incident declaration, Traffic Manager update, cluster verification, data replication check, RPO validation, traffic routing, health monitoring, stakeholder notification, and RTO/RPO documentation
  • DR Drill Execution: Quarterly scheduled drills (Q1, Q3) with 4-hour duration, structured team assignments (Platform, SRE, Security, Service teams), dedicated communication channels (war room, Slack), and stakeholder notifications (leadership, compliance, customers)
  • DR Drill Validation Criteria: Comprehensive validation of RTO achievement (<30 minutes), RPO verification (<1 hour data loss), service availability (all services operational), data integrity (no corruption), monitoring/alerting functionality, and authentication working (Azure AD multi-region)
  • Monitoring and Validation: Comprehensive scripts for RTO/RPO validation, service availability validation, data integrity validation, monitoring validation, authentication validation, and overall DR drill validation summary

Data Recovery Drills

Purpose: Define comprehensive data recovery drill procedures for ATP, validating backup restoration capabilities, point-in-time recovery, and corruption recovery to ensure ATP data can be recovered and restored with integrity and completeness during data loss or corruption scenarios.


Backup Restoration Drill

Backup restoration drill experiments validate that ATP services can restore from Azure Backup successfully with backup integrity validation, restoration time measurement, and data completeness verification.

Hypothesis

"When data is lost or corrupted, ATP services will restore from Azure Backup within acceptable restoration time, validate backup integrity before restoration, verify data completeness after restoration, and ensure all restored data is consistent and functional."

Experiment Configuration

Backup Restoration Script:

#!/bin/bash
# scripts/execute-backup-restoration-drill.sh

SQL_SERVER="${1:-atp-sql-eastus}"
RESOURCE_GROUP="${2:-atp-production}"
DATABASE_NAME="${3:-atp-database}"
BACKUP_RETENTION_DAYS="${4:-30}"
TEST_NAMESPACE="${5:-atp-restore-test}"

RESTORATION_START=$(date +%s)

echo "🧪 Starting Backup Restoration Drill"
echo "SQL Server: ${SQL_SERVER}"
echo "Database: ${DATABASE_NAME}"
echo "Resource Group: ${RESOURCE_GROUP}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""

# Step 1: List available backups
echo "Step 1: Listing available backups..."
echo "Getting backup list from Azure Backup..."

# Get available backup points
BACKUP_LIST=$(az backup recoverypoint list \
  --resource-group ${RESOURCE_GROUP} \
  --vault-name atp-backup-vault \
  --container-name "IaasVMContainer;${SQL_SERVER}" \
  --item-name "SQLDataBase;${DATABASE_NAME}" \
  --query "[?properties.recoveryPointTime >= \`$(date -u -d "${BACKUP_RETENTION_DAYS} days ago" +"%Y-%m-%dT%H:%M:%SZ")\`].{Time:properties.recoveryPointTime, Type:properties.recoveryPointType}" \
  --output table 2>/dev/null || echo "ERROR")

if [ "${BACKUP_LIST}" = "ERROR" ]; then
  echo "❌ Unable to retrieve backup list"
  echo "⚠️  Attempting alternative method using Azure SQL backup history..."

  # Alternative: Get backup history from Azure SQL
  BACKUP_HISTORY=$(az sql db list-restorable-dropped \
    --resource-group ${RESOURCE_GROUP} \
    --server ${SQL_SERVER} \
    --query "[?databaseName == '${DATABASE_NAME}'].{Name:name, DeletionDate:deletionDate, EarliestRestoreDate:earliestRestoreDate}" \
    --output table)

  echo "Backup history:"
  echo "${BACKUP_HISTORY}"
fi

# Select latest backup (or specific backup if provided)
SELECTED_BACKUP_TIME="${6:-$(date -u -d "1 day ago" +"%Y-%m-%dT%H:%M:%SZ")}"
echo "Selected backup time: ${SELECTED_BACKUP_TIME}"
echo ""

# Step 2: Validate backup integrity
echo "Step 2: Validating backup integrity..."
echo "Checking backup metadata and integrity..."

# Get backup metadata
BACKUP_METADATA=$(az backup recoverypoint show \
  --resource-group ${RESOURCE_GROUP} \
  --vault-name atp-backup-vault \
  --container-name "IaasVMContainer;${SQL_SERVER}" \
  --item-name "SQLDataBase;${DATABASE_NAME}" \
  --name "${SELECTED_BACKUP_TIME}" \
  --query "{Size:properties.backupSizeInGB, Type:properties.recoveryPointType, Time:properties.recoveryPointTime, Consistency:properties.isConsistent}" \
  --output json 2>/dev/null || echo "{}")

if [ "${BACKUP_METADATA}" = "{}" ]; then
  echo "⚠️  Unable to retrieve backup metadata via Azure Backup"
  echo "Using Azure SQL backup metadata instead..."

  # Get backup metadata from Azure SQL
  BACKUP_SIZE=$(az sql db show \
    --resource-group ${RESOURCE_GROUP} \
    --server ${SQL_SERVER} \
    --name ${DATABASE_NAME} \
    --query "currentBackupStorageRedundancy" \
    --output tsv 2>/dev/null || echo "N/A")

  echo "Backup size: ${BACKUP_SIZE}"
else
  BACKUP_SIZE=$(echo "${BACKUP_METADATA}" | jq -r '.Size')
  BACKUP_TYPE=$(echo "${BACKUP_METADATA}" | jq -r '.Type')
  BACKUP_CONSISTENT=$(echo "${BACKUP_METADATA}" | jq -r '.Consistency')

  echo "Backup metadata:"
  echo "  Size: ${BACKUP_SIZE} GB"
  echo "  Type: ${BACKUP_TYPE}"
  echo "  Consistent: ${BACKUP_CONSISTENT}"

  if [ "${BACKUP_CONSISTENT}" != "true" ]; then
    echo "⚠️  WARNING: Backup is not marked as consistent"
  else
    echo "✅ Backup integrity validated (consistent)"
  fi
fi

echo ""

# Step 3: Perform restoration
echo "Step 3: Performing restoration..."
RESTORE_DATABASE_NAME="${DATABASE_NAME}-restored-$(date +%Y%m%d-%H%M%S)"

# Restore database to new database name (test restoration)
az sql db restore \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --name ${DATABASE_NAME} \
  --dest-name ${RESTORE_DATABASE_NAME} \
  --restore-point-in-time "${SELECTED_BACKUP_TIME}" \
  --output none

RESTORATION_WAIT_START=$(date +%s)

# Wait for restoration to complete
echo "Waiting for restoration to complete..."
MAX_WAIT=3600  # 1 hour
ELAPSED=0
RESTORATION_COMPLETE=false

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  RESTORE_STATUS=$(az sql db show \
    --resource-group ${RESOURCE_GROUP} \
    --server ${SQL_SERVER} \
    --name ${RESTORE_DATABASE_NAME} \
    --query "status" \
    --output tsv 2>/dev/null || echo "NOT_FOUND")

  if [ "${RESTORE_STATUS}" = "Online" ]; then
    RESTORATION_COMPLETE=true
    RESTORATION_END=$(date +%s)
    RESTORATION_DURATION=$((RESTORATION_END - RESTORATION_WAIT_START))
    echo "✅ Database restored successfully in ${RESTORATION_DURATION} seconds"
    break
  elif [ "${RESTORE_STATUS}" = "NOT_FOUND" ]; then
    echo "Restoration in progress... (${ELAPSED}s/${MAX_WAIT}s)"
  else
    echo "Restoration status: ${RESTORE_STATUS} (${ELAPSED}s/${MAX_WAIT}s)"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

if [ "${RESTORATION_COMPLETE}" = false ]; then
  echo "❌ Restoration did not complete within ${MAX_WAIT} seconds"
  exit 1
fi

echo ""

# Step 4: Validate data completeness
echo "Step 4: Validating data completeness..."

# Get record counts from original and restored database
ORIGINAL_COUNT=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${DATABASE_NAME} \
  --query-text "SELECT COUNT(*) as RecordCount FROM AuditEvents" \
  --output tsv 2>/dev/null || echo "0")

RESTORED_COUNT=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${RESTORE_DATABASE_NAME} \
  --query-text "SELECT COUNT(*) as RecordCount FROM AuditEvents" \
  --output tsv 2>/dev/null || echo "0")

echo "Record counts:"
echo "  Original database: ${ORIGINAL_COUNT} records"
echo "  Restored database: ${RESTORED_COUNT} records"

# Calculate completeness percentage
if [ "${ORIGINAL_COUNT}" != "0" ]; then
  COMPLETENESS_PERCENT=$(echo "scale=2; ${RESTORED_COUNT} * 100 / ${ORIGINAL_COUNT}" | bc)
  echo "  Completeness: ${COMPLETENESS_PERCENT}%"

  if (( $(echo "${COMPLETENESS_PERCENT} >= 95" | bc -l) )); then
    echo "✅ Data completeness validated (${COMPLETENESS_PERCENT}% >= 95%)"
  else
    echo "⚠️  WARNING: Data completeness below threshold (${COMPLETENESS_PERCENT}% < 95%)"
  fi
else
  echo "⚠️  Unable to calculate completeness (original count = 0)"
fi

# Validate key tables
echo ""
echo "Validating key tables..."

KEY_TABLES=("AuditEvents" "Tenants" "Policies" "Configurations")

for table in "${KEY_TABLES[@]}"; do
  ORIGINAL_TABLE_COUNT=$(az sql db query \
    --resource-group ${RESOURCE_GROUP} \
    --server ${SQL_SERVER} \
    --database ${DATABASE_NAME} \
    --query-text "SELECT COUNT(*) as RecordCount FROM ${table}" \
    --output tsv 2>/dev/null || echo "0")

  RESTORED_TABLE_COUNT=$(az sql db query \
    --resource-group ${RESOURCE_GROUP} \
    --server ${SQL_SERVER} \
    --database ${RESTORE_DATABASE_NAME} \
    --query-text "SELECT COUNT(*) as RecordCount FROM ${table}" \
    --output tsv 2>/dev/null || echo "0")

  if [ "${ORIGINAL_TABLE_COUNT}" = "${RESTORED_TABLE_COUNT}" ]; then
    echo "  ✅ ${table}: ${RESTORED_TABLE_COUNT} records (match)"
  else
    echo "  ⚠️  ${table}: ${RESTORED_TABLE_COUNT} records (original: ${ORIGINAL_TABLE_COUNT})"
  fi
done

echo ""

# Step 5: Validate data integrity
echo "Step 5: Validating data integrity..."

# Check for data corruption using checksums
INTEGRITY_CHECK=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${RESTORE_DATABASE_NAME} \
  --query-text "SELECT COUNT(*) as CorruptedRecords FROM AuditEvents WHERE CHECKSUM(EventData) != StoredChecksum" \
  --output tsv 2>/dev/null || echo "ERROR")

if [ "${INTEGRITY_CHECK}" = "ERROR" ]; then
  echo "⚠️  Unable to perform integrity check"
elif [ "${INTEGRITY_CHECK}" = "0" ]; then
  echo "✅ Data integrity validated (no corruption detected)"
else
  echo "❌ Data integrity check failed: ${INTEGRITY_CHECK} corrupted records"
fi

echo ""

# Generate restoration report
RESTORATION_REPORT_FILE="backup-restoration-report-$(date +%Y%m%d-%H%M%S).json"

cat > ${RESTORATION_REPORT_FILE} <<EOF
{
  "drillId": "backup-restore-$(date +%Y%m%d-%H%M%S)",
  "timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
  "sqlServer": "${SQL_SERVER}",
  "databaseName": "${DATABASE_NAME}",
  "restoredDatabaseName": "${RESTORE_DATABASE_NAME}",
  "backupTime": "${SELECTED_BACKUP_TIME}",
  "restorationStartTime": "$(date -u -d @${RESTORATION_START} +"%Y-%m-%dT%H:%M:%SZ")",
  "restorationEndTime": "$(date -u -d @${RESTORATION_END} +"%Y-%m-%dT%H:%M:%SZ")",
  "restorationDuration": ${RESTORATION_DURATION},
  "backupSize": "${BACKUP_SIZE}",
  "backupType": "${BACKUP_TYPE}",
  "backupConsistent": ${BACKUP_CONSISTENT},
  "dataCompleteness": {
    "originalCount": ${ORIGINAL_COUNT},
    "restoredCount": ${RESTORED_COUNT},
    "completenessPercent": ${COMPLETENESS_PERCENT}
  },
  "integrityCheck": {
    "corruptedRecords": ${INTEGRITY_CHECK},
    "status": "$(if [ "${INTEGRITY_CHECK}" = "0" ]; then echo "PASS"; else echo "FAIL"; fi)"
  }
}
EOF

echo "✅ Restoration report generated: ${RESTORATION_REPORT_FILE}"
echo ""
echo "=========================================="
echo "Backup Restoration Drill Summary"
echo "=========================================="
echo "Restoration Duration: ${RESTORATION_DURATION} seconds"
echo "Data Completeness: ${COMPLETENESS_PERCENT}%"
echo "Data Integrity: $(if [ "${INTEGRITY_CHECK}" = "0" ]; then echo "✅ PASS"; else echo "❌ FAIL"; fi)"
echo ""

Expected Behavior

Restoration Phase (0-60 minutes):

  • Backup listing: Available backups listed and validated
  • Backup integrity: Backup metadata validated (size, type, consistency)
  • Restoration: Database restored to test database
  • Completion: Restoration completes and database comes online

Validation Phase (60-90 minutes):

  • Data completeness: Record counts validated (original vs restored)
  • Table validation: Key tables validated for completeness
  • Data integrity: Checksum validation performed
  • Report generation: Restoration report generated

Expected Metrics

Metric Target Validation Method
Restoration Time <60 minutes Time from restoration start to database online
Backup Integrity 100% Backup metadata validation (consistent = true)
Data Completeness ≥95% Record count comparison (restored vs original)
Data Integrity 100% Checksum validation (no corrupted records)
Table Completeness 100% All key tables validated (count match)

Validation Criteria

Success Criteria:

  • ✅ Backup integrity validated (consistent = true)
  • ✅ Restoration completed within 60 minutes
  • ✅ Data completeness ≥95%
  • ✅ Data integrity validated (no corruption)
  • ✅ All key tables validated

Point-in-Time Recovery

Point-in-time recovery drill experiments validate that ATP services can restore to a specific timestamp with transaction consistency validation and selective recovery capabilities.

Hypothesis

"When data needs to be restored to a specific point in time, ATP services will restore the database to the requested timestamp, validate transaction consistency, support selective recovery for specific tenants, and ensure all restored data is consistent and functional."

Experiment Configuration

Point-in-Time Recovery Script:

#!/bin/bash
# scripts/execute-point-in-time-recovery-drill.sh

SQL_SERVER="${1:-atp-sql-eastus}"
RESOURCE_GROUP="${2:-atp-production}"
DATABASE_NAME="${3:-atp-database}"
TARGET_TIMESTAMP="${4:-$(date -u -d "2 hours ago" +"%Y-%m-%dT%H:%M:%SZ")}"
TENANT_ID="${5:-}"  # Optional: specific tenant for selective recovery

RECOVERY_START=$(date +%s)

echo "🧪 Starting Point-in-Time Recovery Drill"
echo "SQL Server: ${SQL_SERVER}"
echo "Database: ${DATABASE_NAME}"
echo "Target Timestamp: ${TARGET_TIMESTAMP}"
echo "Tenant ID: ${TENANT_ID:-All tenants}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""

# Step 1: Validate target timestamp
echo "Step 1: Validating target timestamp..."
TARGET_EPOCH=$(date -u -d "${TARGET_TIMESTAMP}" +%s 2>/dev/null || echo "0")
CURRENT_EPOCH=$(date +%s)
EARLIEST_RECOVERY=$(date -u -d "35 days ago" +%s)  # Azure SQL PITR limit

if [ "${TARGET_EPOCH}" = "0" ]; then
  echo "❌ Invalid target timestamp: ${TARGET_TIMESTAMP}"
  exit 1
fi

if (( TARGET_EPOCH < EARLIEST_RECOVERY )); then
  echo "❌ Target timestamp is beyond recovery window (35 days)"
  exit 1
fi

if (( TARGET_EPOCH > CURRENT_EPOCH )); then
  echo "❌ Target timestamp is in the future"
  exit 1
fi

echo "✅ Target timestamp validated"
echo "  Target: ${TARGET_TIMESTAMP}"
echo "  Current: $(date -u +"%Y-%m-%dT%H:%M:%SZ")"
echo ""

# Step 2: Get data state at target timestamp
echo "Step 2: Getting data state at target timestamp..."
echo "Querying data state from backup history..."

# Get transaction count at target timestamp
TRANSACTION_COUNT_AT_TARGET=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${DATABASE_NAME} \
  --query-text "SELECT COUNT(*) as TransactionCount FROM AuditEvents WHERE LastModified <= '${TARGET_TIMESTAMP}'" \
  --output tsv 2>/dev/null || echo "0")

echo "Transactions at target timestamp: ${TRANSACTION_COUNT_AT_TARGET}"
echo ""

# Step 3: Perform point-in-time recovery
echo "Step 3: Performing point-in-time recovery..."
RECOVERED_DATABASE_NAME="${DATABASE_NAME}-pitr-$(date +%Y%m%d-%H%M%S)"

if [ -n "${TENANT_ID}" ]; then
  echo "⚠️  Selective recovery mode: Tenant ${TENANT_ID} only"
  # Note: Selective recovery requires custom logic or separate database restore
  # For this drill, we'll restore full database and then filter
fi

# Restore database to point in time
az sql db restore \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --name ${DATABASE_NAME} \
  --dest-name ${RECOVERED_DATABASE_NAME} \
  --restore-point-in-time "${TARGET_TIMESTAMP}" \
  --output none

RECOVERY_WAIT_START=$(date +%s)

# Wait for recovery to complete
echo "Waiting for point-in-time recovery to complete..."
MAX_WAIT=3600  # 1 hour
ELAPSED=0
RECOVERY_COMPLETE=false

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  RECOVERY_STATUS=$(az sql db show \
    --resource-group ${RESOURCE_GROUP} \
    --server ${SQL_SERVER} \
    --name ${RECOVERED_DATABASE_NAME} \
    --query "status" \
    --output tsv 2>/dev/null || echo "NOT_FOUND")

  if [ "${RECOVERY_STATUS}" = "Online" ]; then
    RECOVERY_COMPLETE=true
    RECOVERY_END=$(date +%s)
    RECOVERY_DURATION=$((RECOVERY_END - RECOVERY_WAIT_START))
    echo "✅ Database recovered successfully in ${RECOVERY_DURATION} seconds"
    break
  elif [ "${RECOVERY_STATUS}" = "NOT_FOUND" ]; then
    echo "Recovery in progress... (${ELAPSED}s/${MAX_WAIT}s)"
  else
    echo "Recovery status: ${RECOVERY_STATUS} (${ELAPSED}s/${MAX_WAIT}s)"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

if [ "${RECOVERY_COMPLETE}" = false ]; then
  echo "❌ Recovery did not complete within ${MAX_WAIT} seconds"
  exit 1
fi

echo ""

# Step 4: Validate transaction consistency
echo "Step 4: Validating transaction consistency..."

# Get recovered transaction count
RECOVERED_TRANSACTION_COUNT=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${RECOVERED_DATABASE_NAME} \
  --query-text "SELECT COUNT(*) as TransactionCount FROM AuditEvents" \
  --output tsv 2>/dev/null || echo "0")

echo "Transaction counts:"
echo "  Expected at target timestamp: ${TRANSACTION_COUNT_AT_TARGET}"
echo "  Recovered: ${RECOVERED_TRANSACTION_COUNT}"

# Validate transaction count matches
if [ "${TRANSACTION_COUNT_AT_TARGET}" = "${RECOVERED_TRANSACTION_COUNT}" ]; then
  echo "✅ Transaction count matches"
else
  DIFF=$((TRANSACTION_COUNT_AT_TARGET - RECOVERED_TRANSACTION_COUNT))
  if (( DIFF < 0 )); then
    DIFF=$((DIFF * -1))
  fi
  echo "⚠️  Transaction count mismatch: ${DIFF} difference"
fi

# Validate no transactions after target timestamp
TRANSACTIONS_AFTER_TARGET=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${RECOVERED_DATABASE_NAME} \
  --query-text "SELECT COUNT(*) as TransactionCount FROM AuditEvents WHERE LastModified > '${TARGET_TIMESTAMP}'" \
  --output tsv 2>/dev/null || echo "0")

if [ "${TRANSACTIONS_AFTER_TARGET}" = "0" ]; then
  echo "✅ No transactions after target timestamp (consistent)"
else
  echo "❌ Transactions found after target timestamp: ${TRANSACTIONS_AFTER_TARGET}"
fi

# Validate transaction integrity
echo ""
echo "Validating transaction integrity..."

# Check for orphaned transactions
ORPHANED_TRANSACTIONS=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${RECOVERED_DATABASE_NAME} \
  --query-text "SELECT COUNT(*) as OrphanedCount FROM AuditEvents a LEFT JOIN Tenants t ON a.TenantId = t.Id WHERE t.Id IS NULL" \
  --output tsv 2>/dev/null || echo "0")

if [ "${ORPHANED_TRANSACTIONS}" = "0" ]; then
  echo "✅ No orphaned transactions detected"
else
  echo "⚠️  Orphaned transactions detected: ${ORPHANED_TRANSACTIONS}"
fi

echo ""

# Step 5: Selective recovery validation (if tenant specified)
if [ -n "${TENANT_ID}" ]; then
  echo "Step 5: Validating selective recovery for tenant ${TENANT_ID}..."

  # Get tenant data in recovered database
  TENANT_DATA_COUNT=$(az sql db query \
    --resource-group ${RESOURCE_GROUP} \
    --server ${SQL_SERVER} \
    --database ${RECOVERED_DATABASE_NAME} \
    --query-text "SELECT COUNT(*) as RecordCount FROM AuditEvents WHERE TenantId = '${TENANT_ID}' AND LastModified <= '${TARGET_TIMESTAMP}'" \
    --output tsv 2>/dev/null || echo "0")

  echo "Tenant ${TENANT_ID} data in recovered database: ${TENANT_DATA_COUNT} records"

  # Get expected tenant data count
  EXPECTED_TENANT_COUNT=$(az sql db query \
    --resource-group ${RESOURCE_GROUP} \
    --server ${SQL_SERVER} \
    --database ${DATABASE_NAME} \
    --query-text "SELECT COUNT(*) as RecordCount FROM AuditEvents WHERE TenantId = '${TENANT_ID}' AND LastModified <= '${TARGET_TIMESTAMP}'" \
    --output tsv 2>/dev/null || echo "0")

  if [ "${TENANT_DATA_COUNT}" = "${EXPECTED_TENANT_COUNT}" ]; then
    echo "✅ Tenant data recovery validated (${TENANT_DATA_COUNT} records)"
  else
    echo "⚠️  Tenant data mismatch: ${TENANT_DATA_COUNT} (expected: ${EXPECTED_TENANT_COUNT})"
  fi

  echo ""
fi

# Generate recovery report
RECOVERY_REPORT_FILE="pitr-recovery-report-$(date +%Y%m%d-%H%M%S).json"

cat > ${RECOVERY_REPORT_FILE} <<EOF
{
  "drillId": "pitr-$(date +%Y%m%d-%H%M%S)",
  "timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
  "sqlServer": "${SQL_SERVER}",
  "databaseName": "${DATABASE_NAME}",
  "recoveredDatabaseName": "${RECOVERED_DATABASE_NAME}",
  "targetTimestamp": "${TARGET_TIMESTAMP}",
  "tenantId": "${TENANT_ID:-null}",
  "recoveryStartTime": "$(date -u -d @${RECOVERY_START} +"%Y-%m-%dT%H:%M:%SZ")",
  "recoveryEndTime": "$(date -u -d @${RECOVERY_END} +"%Y-%m-%dT%H:%M:%SZ")",
  "recoveryDuration": ${RECOVERY_DURATION},
  "transactionConsistency": {
    "expectedCount": ${TRANSACTION_COUNT_AT_TARGET},
    "recoveredCount": ${RECOVERED_TRANSACTION_COUNT},
    "transactionsAfterTarget": ${TRANSACTIONS_AFTER_TARGET},
    "orphanedTransactions": ${ORPHANED_TRANSACTIONS},
    "status": "$(if [ "${TRANSACTIONS_AFTER_TARGET}" = "0" ] && [ "${ORPHANED_TRANSACTIONS}" = "0" ]; then echo "PASS"; else echo "FAIL"; fi)"
  },
  "selectiveRecovery": {
    "tenantId": "${TENANT_ID:-null}",
    "tenantDataCount": ${TENANT_DATA_COUNT:-0},
    "expectedTenantCount": ${EXPECTED_TENANT_COUNT:-0}
  }
}
EOF

echo "✅ Recovery report generated: ${RECOVERY_REPORT_FILE}"
echo ""
echo "=========================================="
echo "Point-in-Time Recovery Drill Summary"
echo "=========================================="
echo "Recovery Duration: ${RECOVERY_DURATION} seconds"
echo "Target Timestamp: ${TARGET_TIMESTAMP}"
echo "Transaction Consistency: $(if [ "${TRANSACTIONS_AFTER_TARGET}" = "0" ] && [ "${ORPHANED_TRANSACTIONS}" = "0" ]; then echo "✅ PASS"; else echo "❌ FAIL"; fi)"
echo ""

Expected Behavior

Recovery Phase (0-60 minutes):

  • Timestamp validation: Target timestamp validated (within recovery window)
  • Data state: Data state at target timestamp retrieved
  • Point-in-time recovery: Database restored to target timestamp
  • Completion: Recovery completes and database comes online

Validation Phase (60-90 minutes):

  • Transaction consistency: Transaction counts validated
  • Timestamp validation: No transactions after target timestamp
  • Transaction integrity: Orphaned transactions checked
  • Selective recovery: Tenant-specific recovery validated (if applicable)

Expected Metrics

Metric Target Validation Method
Recovery Time <60 minutes Time from recovery start to database online
Transaction Consistency 100% Transaction count matches expected at timestamp
Timestamp Accuracy 100% No transactions after target timestamp
Transaction Integrity 100% No orphaned transactions
Selective Recovery 100% Tenant-specific data matches expected

Validation Criteria

Success Criteria:

  • ✅ Target timestamp validated (within recovery window)
  • ✅ Recovery completed within 60 minutes
  • ✅ Transaction consistency validated (count matches)
  • ✅ No transactions after target timestamp
  • ✅ Transaction integrity validated (no orphaned transactions)
  • ✅ Selective recovery validated (if tenant specified)

Corruption Recovery

Corruption recovery drill experiments validate that ATP services can detect data corruption and recover from clean backups with integrity verification and hash chain validation.

Hypothesis

"When data corruption is detected, ATP services will identify corrupted records through integrity verification, restore from clean backup, validate hash chain after restoration, and ensure all restored data is consistent and functional."

Experiment Configuration

Corruption Recovery Script:

#!/bin/bash
# scripts/execute-corruption-recovery-drill.sh

SQL_SERVER="${1:-atp-sql-eastus}"
RESOURCE_GROUP="${2:-atp-production}"
DATABASE_NAME="${3:-atp-database}"
CORRUPTION_TABLE="${4:-AuditEvents}"
CORRUPTION_COUNT="${5:-10}"  # Number of records to corrupt

RECOVERY_START=$(date +%s)

echo "🧪 Starting Corruption Recovery Drill"
echo "SQL Server: ${SQL_SERVER}"
echo "Database: ${DATABASE_NAME}"
echo "Corruption Table: ${CORRUPTION_TABLE}"
echo "Corruption Count: ${CORRUPTION_COUNT}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""

# Step 1: Simulate data corruption
echo "Step 1: Simulating data corruption..."
echo "⚠️  WARNING: This will modify data in the database"

# Create corruption by modifying checksums
CORRUPTION_SQL="
UPDATE TOP(${CORRUPTION_COUNT}) ${CORRUPTION_TABLE}
SET EventData = CONCAT(EventData, 'CORRUPTED')
WHERE EventData IS NOT NULL
"

# Execute corruption (in test environment only)
if [ "${ENVIRONMENT}" = "test" ] || [ "${ENVIRONMENT}" = "staging" ]; then
  echo "Executing corruption simulation..."
  az sql db query \
    --resource-group ${RESOURCE_GROUP} \
    --server ${SQL_SERVER} \
    --database ${DATABASE_NAME} \
    --query-text "${CORRUPTION_SQL}" \
    --output none

  echo "✅ Corruption simulated: ${CORRUPTION_COUNT} records modified"
else
  echo "⚠️  Skipping actual corruption (production environment)"
  echo "Using test database for corruption simulation"
fi

echo ""

# Step 2: Detect corruption with integrity verification
echo "Step 2: Detecting corruption with integrity verification..."

# Run integrity check
CORRUPTION_DETECTED=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${DATABASE_NAME} \
  --query-text "SELECT COUNT(*) as CorruptedCount FROM ${CORRUPTION_TABLE} WHERE CHECKSUM(EventData) != StoredChecksum" \
  --output tsv 2>/dev/null || echo "0")

if [ "${CORRUPTION_DETECTED}" = "0" ]; then
  echo "⚠️  No corruption detected"
  echo "Corruption may not have been applied or checksums are not being validated"
else
  echo "✅ Corruption detected: ${CORRUPTION_DETECTED} corrupted records"
fi

# Get corrupted record IDs
CORRUPTED_RECORD_IDS=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${DATABASE_NAME} \
  --query-text "SELECT TOP(100) Id FROM ${CORRUPTION_TABLE} WHERE CHECKSUM(EventData) != StoredChecksum" \
  --output tsv 2>/dev/null || echo "")

echo "Corrupted record IDs: $(echo ${CORRUPTED_RECORD_IDS} | tr '\n' ' ' | head -c 100)..."
echo ""

# Step 3: Identify clean backup
echo "Step 3: Identifying clean backup..."

# Get backup history and find last clean backup
BACKUP_HISTORY=$(az sql db list-restorable-dropped \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --query "[?databaseName == '${DATABASE_NAME}'].{Name:name, EarliestRestoreDate:earliestRestoreDate, LatestRestoreDate:latestRestoreDate}" \
  --output json 2>/dev/null || echo "[]")

# For this drill, use a recent backup (1 day ago)
CLEAN_BACKUP_TIME=$(date -u -d "1 day ago" +"%Y-%m-%dT%H:%M:%SZ")
echo "Clean backup time: ${CLEAN_BACKUP_TIME}"
echo ""

# Step 4: Restore from clean backup
echo "Step 4: Restoring from clean backup..."
RESTORED_DATABASE_NAME="${DATABASE_NAME}-corruption-recovered-$(date +%Y%m%d-%H%M%S)"

# Restore database from clean backup
az sql db restore \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --name ${DATABASE_NAME} \
  --dest-name ${RESTORED_DATABASE_NAME} \
  --restore-point-in-time "${CLEAN_BACKUP_TIME}" \
  --output none

RESTORATION_WAIT_START=$(date +%s)

# Wait for restoration to complete
echo "Waiting for restoration to complete..."
MAX_WAIT=3600  # 1 hour
ELAPSED=0
RESTORATION_COMPLETE=false

while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
  RESTORE_STATUS=$(az sql db show \
    --resource-group ${RESOURCE_GROUP} \
    --server ${SQL_SERVER} \
    --name ${RESTORED_DATABASE_NAME} \
    --query "status" \
    --output tsv 2>/dev/null || echo "NOT_FOUND")

  if [ "${RESTORE_STATUS}" = "Online" ]; then
    RESTORATION_COMPLETE=true
    RESTORATION_END=$(date +%s)
    RESTORATION_DURATION=$((RESTORATION_END - RESTORATION_WAIT_START))
    echo "✅ Database restored successfully in ${RESTORATION_DURATION} seconds"
    break
  elif [ "${RESTORE_STATUS}" = "NOT_FOUND" ]; then
    echo "Restoration in progress... (${ELAPSED}s/${MAX_WAIT}s)"
  else
    echo "Restoration status: ${RESTORE_STATUS} (${ELAPSED}s/${MAX_WAIT}s)"
  fi

  sleep 30
  ELAPSED=$((ELAPSED + 30))
done

if [ "${RESTORATION_COMPLETE}" = false ]; then
  echo "❌ Restoration did not complete within ${MAX_WAIT} seconds"
  exit 1
fi

echo ""

# Step 5: Validate hash chain after restoration
echo "Step 5: Validating hash chain after restoration..."

# Check for corruption in restored database
RESTORED_CORRUPTION_COUNT=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${RESTORED_DATABASE_NAME} \
  --query-text "SELECT COUNT(*) as CorruptedCount FROM ${CORRUPTION_TABLE} WHERE CHECKSUM(EventData) != StoredChecksum" \
  --output tsv 2>/dev/null || echo "0")

if [ "${RESTORED_CORRUPTION_COUNT}" = "0" ]; then
  echo "✅ No corruption detected in restored database"
else
  echo "❌ Corruption still present in restored database: ${RESTORED_CORRUPTION_COUNT} records"
fi

# Validate hash chain integrity
echo ""
echo "Validating hash chain integrity..."

# Check hash chain continuity
HASH_CHAIN_BREAKS=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${RESTORED_DATABASE_NAME} \
  --query-text "
    SELECT COUNT(*) as ChainBreaks
    FROM ${CORRUPTION_TABLE} a1
    LEFT JOIN ${CORRUPTION_TABLE} a2 ON a1.PreviousHash = a2.EventHash
    WHERE a1.PreviousHash IS NOT NULL AND a2.EventHash IS NULL
  " \
  --output tsv 2>/dev/null || echo "0")

if [ "${HASH_CHAIN_BREAKS}" = "0" ]; then
  echo "✅ Hash chain integrity validated (no breaks)"
else
  echo "⚠️  Hash chain breaks detected: ${HASH_CHAIN_BREAKS}"
fi

# Validate hash chain completeness
HASH_CHAIN_COMPLETE=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${RESTORED_DATABASE_NAME} \
  --query-text "
    SELECT COUNT(*) as IncompleteChains
    FROM ${CORRUPTION_TABLE}
    WHERE PreviousHash IS NULL AND Id NOT IN (SELECT MIN(Id) FROM ${CORRUPTION_TABLE} GROUP BY TenantId)
  " \
  --output tsv 2>/dev/null || echo "0")

if [ "${HASH_CHAIN_COMPLETE}" = "0" ]; then
  echo "✅ Hash chain completeness validated"
else
  echo "⚠️  Incomplete hash chains detected: ${HASH_CHAIN_COMPLETE}"
fi

echo ""

# Step 6: Validate data consistency
echo "Step 6: Validating data consistency..."

# Compare record counts
ORIGINAL_COUNT=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${DATABASE_NAME} \
  --query-text "SELECT COUNT(*) as RecordCount FROM ${CORRUPTION_TABLE} WHERE LastModified <= '${CLEAN_BACKUP_TIME}'" \
  --output tsv 2>/dev/null || echo "0")

RESTORED_COUNT=$(az sql db query \
  --resource-group ${RESOURCE_GROUP} \
  --server ${SQL_SERVER} \
  --database ${RESTORED_DATABASE_NAME} \
  --query-text "SELECT COUNT(*) as RecordCount FROM ${CORRUPTION_TABLE}" \
  --output tsv 2>/dev/null || echo "0")

echo "Record counts:"
echo "  Original (before corruption): ${ORIGINAL_COUNT} records"
echo "  Restored: ${RESTORED_COUNT} records"

if [ "${ORIGINAL_COUNT}" = "${RESTORED_COUNT}" ]; then
  echo "✅ Record count matches"
else
  echo "⚠️  Record count mismatch: ${RESTORED_COUNT} (expected: ${ORIGINAL_COUNT})"
fi

echo ""

# Generate recovery report
RECOVERY_REPORT_FILE="corruption-recovery-report-$(date +%Y%m%d-%H%M%S).json"

cat > ${RECOVERY_REPORT_FILE} <<EOF
{
  "drillId": "corruption-recovery-$(date +%Y%m%d-%H%M%S)",
  "timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
  "sqlServer": "${SQL_SERVER}",
  "databaseName": "${DATABASE_NAME}",
  "restoredDatabaseName": "${RESTORED_DATABASE_NAME}",
  "corruptionTable": "${CORRUPTION_TABLE}",
  "corruptionCount": ${CORRUPTION_COUNT},
  "cleanBackupTime": "${CLEAN_BACKUP_TIME}",
  "recoveryStartTime": "$(date -u -d @${RECOVERY_START} +"%Y-%m-%dT%H:%M:%SZ")",
  "recoveryEndTime": "$(date -u -d @${RESTORATION_END} +"%Y-%m-%dT%H:%M:%SZ")",
  "recoveryDuration": ${RESTORATION_DURATION},
  "corruptionDetection": {
    "corruptedRecords": ${CORRUPTION_DETECTED},
    "status": "$(if [ "${CORRUPTION_DETECTED}" != "0" ]; then echo "DETECTED"; else echo "NOT_DETECTED"; fi)"
  },
  "hashChainValidation": {
    "restoredCorruptionCount": ${RESTORED_CORRUPTION_COUNT},
    "hashChainBreaks": ${HASH_CHAIN_BREAKS},
    "incompleteChains": ${HASH_CHAIN_COMPLETE},
    "status": "$(if [ "${RESTORED_CORRUPTION_COUNT}" = "0" ] && [ "${HASH_CHAIN_BREAKS}" = "0" ] && [ "${HASH_CHAIN_COMPLETE}" = "0" ]; then echo "PASS"; else echo "FAIL"; fi)"
  },
  "dataConsistency": {
    "originalCount": ${ORIGINAL_COUNT},
    "restoredCount": ${RESTORED_COUNT},
    "status": "$(if [ "${ORIGINAL_COUNT}" = "${RESTORED_COUNT}" ]; then echo "PASS"; else echo "FAIL"; fi)"
  }
}
EOF

echo "✅ Recovery report generated: ${RECOVERY_REPORT_FILE}"
echo ""
echo "=========================================="
echo "Corruption Recovery Drill Summary"
echo "=========================================="
echo "Corruption Detected: ${CORRUPTION_DETECTED} records"
echo "Recovery Duration: ${RESTORATION_DURATION} seconds"
echo "Hash Chain Validation: $(if [ "${RESTORED_CORRUPTION_COUNT}" = "0" ] && [ "${HASH_CHAIN_BREAKS}" = "0" ] && [ "${HASH_CHAIN_COMPLETE}" = "0" ]; then echo "✅ PASS"; else echo "❌ FAIL"; fi)"
echo "Data Consistency: $(if [ "${ORIGINAL_COUNT}" = "${RESTORED_COUNT}" ]; then echo "✅ PASS"; else echo "❌ FAIL"; fi)"
echo ""

Expected Behavior

Corruption Phase (0-5 minutes):

  • Corruption simulation: Data corruption simulated (checksum modification)
  • Corruption detection: Integrity verification detects corruption
  • Corrupted records: Corrupted record IDs identified

Recovery Phase (5-65 minutes):

  • Clean backup identification: Last clean backup identified
  • Restoration: Database restored from clean backup
  • Completion: Restoration completes and database comes online

Validation Phase (65-90 minutes):

  • Hash chain validation: Hash chain integrity validated (no breaks)
  • Corruption check: No corruption in restored database
  • Data consistency: Record counts validated
  • Hash chain completeness: Hash chain completeness validated

Expected Metrics

Metric Target Validation Method
Corruption Detection 100% Integrity verification detects all corrupted records
Recovery Time <60 minutes Time from restoration start to database online
Hash Chain Integrity 100% No hash chain breaks detected
Corruption Removal 100% No corruption in restored database
Data Consistency 100% Record count matches original

Validation Criteria

Success Criteria:

  • ✅ Corruption detected through integrity verification
  • ✅ Clean backup identified and validated
  • ✅ Recovery completed within 60 minutes
  • ✅ Hash chain integrity validated (no breaks)
  • ✅ No corruption in restored database
  • ✅ Data consistency validated (record count matches)

Data Recovery Drill Visualization

graph TD
    BACKUP[Backup Restoration] --> LIST[List Backups]
    LIST --> VALIDATE[Validate Integrity]
    VALIDATE --> RESTORE[Restore Database]
    RESTORE --> COMPLETE[Validate Completeness]

    PITR[Point-in-Time Recovery] --> TIMESTAMP[Validate Timestamp]
    TIMESTAMP --> RECOVER[Recover to Timestamp]
    RECOVER --> CONSISTENCY[Validate Consistency]
    CONSISTENCY --> SELECTIVE[Selective Recovery]

    CORRUPTION[Corruption Recovery] --> DETECT[Detect Corruption]
    DETECT --> CLEAN[Identify Clean Backup]
    CLEAN --> RESTORE2[Restore from Backup]
    RESTORE2 --> HASH[Validate Hash Chain]

    COMPLETE --> SUCCESS[Recovery Successful]
    SELECTIVE --> SUCCESS
    HASH --> SUCCESS

    style BACKUP fill:#FFE5B4
    style PITR fill:#FFE5B4
    style CORRUPTION fill:#FFE5B4
    style SUCCESS fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Summary: Data Recovery Drills

  • Backup Restoration Drill: Validates backup restoration from Azure Backup with integrity validation, restoration time measurement, and data completeness verification; expects backup integrity validated, restoration completed within 60 minutes, data completeness ≥95%, data integrity validated, and all key tables validated
  • Point-in-Time Recovery: Validates database restoration to specific timestamp with transaction consistency validation and selective recovery capabilities; expects target timestamp validated, recovery completed within 60 minutes, transaction consistency validated, no transactions after target timestamp, transaction integrity validated, and selective recovery validated
  • Corruption Recovery: Validates data corruption detection and recovery from clean backups with integrity verification and hash chain validation; expects corruption detected through integrity verification, clean backup identified, recovery completed within 60 minutes, hash chain integrity validated, no corruption in restored database, and data consistency validated
  • Monitoring and Validation: Comprehensive scripts for backup restoration, point-in-time recovery, corruption recovery, integrity verification, hash chain validation, transaction consistency validation, and data completeness validation

Chaos GameDays

Purpose: Define comprehensive chaos engineering GameDay procedures for ATP, validating multi-failure scenarios, incident response capabilities, team coordination, and system resilience through quarterly exercises that simulate complex real-world failure scenarios.


What is a GameDay?

GameDay exercises are quarterly chaos engineering events that validate ATP's resilience through simulated complex failure scenarios involving multiple simultaneous failures, multi-team participation, and incident response validation.

GameDay Definition

A Chaos GameDay is a structured, time-boxed chaos engineering exercise that:

  • Simulates Real-World Failures: Multiple simultaneous failures that could occur in production
  • Validates Incident Response: Tests team coordination, communication, and response procedures
  • Tests System Resilience: Validates that ATP services handle complex failure scenarios gracefully
  • Identifies Gaps: Discovers resilience gaps, process weaknesses, and improvement opportunities
  • Improves Team Preparedness: Enhances team skills, runbook effectiveness, and incident response capabilities

GameDay Characteristics

Characteristic Description
Frequency Quarterly (4 times per year)
Duration 4 hours (structured time-box)
Participation Multi-team (Platform, SRE, Security, Service teams)
Complexity Multiple simultaneous failures
Environment Production-like (staging) or production (with approval)
Focus Learning, improvement, resilience validation

GameDay Objectives

  1. Validate System Resilience: Ensure ATP services handle complex failure scenarios
  2. Test Incident Response: Validate team coordination and response procedures
  3. Improve Runbooks: Identify gaps and update runbooks based on learnings
  4. Enhance Team Skills: Build team confidence and incident response capabilities
  5. Identify Improvements: Discover resilience gaps and assign improvement actions

GameDay Structure

Quarterly ATP Chaos GameDay (4 Hours)

Hour 1: Preparation

Preparation Activities:

#!/bin/bash
# scripts/gameday-preparation.sh

GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"
ENVIRONMENT="${3:-staging}"

echo "🎮 ATP Chaos GameDay Preparation"
echo "GameDay ID: ${GAMEDAY_ID}"
echo "Scenario: ${SCENARIO}"
echo "Environment: ${ENVIRONMENT}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""

# Step 1: Review scenarios
echo "Step 1: Reviewing GameDay scenarios..."
echo "Loading scenario: ${SCENARIO}"

SCENARIO_FILE="scenarios/${SCENARIO}.yaml"
if [ ! -f "${SCENARIO_FILE}" ]; then
  echo "❌ Scenario file not found: ${SCENARIO_FILE}"
  exit 1
fi

echo "✅ Scenario loaded: ${SCENARIO}"
cat ${SCENARIO_FILE}
echo ""

# Step 2: Brief all teams
echo "Step 2: Briefing all teams..."
echo "Sending GameDay briefing to teams..."

# Teams: Platform, SRE, Security, Service teams
TEAMS=("platform-team" "sre-team" "security-team" "service-teams")

for team in "${TEAMS[@]}"; do
  echo "  - Briefing ${team}..."
  # Send notification
  # ./scripts/send-notification.sh "${team}" "GameDay briefing: ${GAMEDAY_ID}"
done

echo "✅ All teams briefed"
echo ""

# Step 3: Validate rollback procedures
echo "Step 3: Validating rollback procedures..."
echo "Checking rollback scripts and procedures..."

ROLLBACK_SCRIPTS=(
  "scripts/rollback-all-chaos.sh"
  "scripts/rollback-network-chaos.sh"
  "scripts/rollback-pod-chaos.sh"
  "scripts/rollback-database-chaos.sh"
)

for script in "${ROLLBACK_SCRIPTS[@]}"; do
  if [ -f "${script}" ]; then
    echo "  ✅ ${script}: Available"
  else
    echo "  ⚠️  ${script}: Not found"
  fi
done

echo "✅ Rollback procedures validated"
echo ""

# Step 4: Start monitoring
echo "Step 4: Starting monitoring..."
echo "Initializing monitoring dashboards..."

# Create monitoring dashboard snapshot
./scripts/create-monitoring-snapshot.sh \
  --gameday-id ${GAMEDAY_ID} \
  --environment ${ENVIRONMENT} \
  --output "monitoring-snapshots/${GAMEDAY_ID}-baseline.json"

echo "✅ Monitoring started"
echo ""

echo "=========================================="
echo "GameDay Preparation Complete"
echo "=========================================="
echo "GameDay ID: ${GAMEDAY_ID}"
echo "Scenario: ${SCENARIO}"
echo "Environment: ${ENVIRONMENT}"
echo "Status: Ready for chaos injection"
echo ""

Hour 2: Chaos Injection

Chaos Injection Activities:

#!/bin/bash
# scripts/gameday-chaos-injection.sh

GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"
ENVIRONMENT="${3:-staging}"

CHAOS_START=$(date +%s)

echo "🎮 ATP Chaos GameDay - Chaos Injection Phase"
echo "GameDay ID: ${GAMEDAY_ID}"
echo "Scenario: ${SCENARIO}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""

# Load scenario experiments
SCENARIO_FILE="scenarios/${SCENARIO}.yaml"
EXPERIMENTS=$(yq eval '.experiments[]' ${SCENARIO_FILE} 2>/dev/null || echo "")

if [ -z "${EXPERIMENTS}" ]; then
  echo "❌ No experiments found in scenario"
  exit 1
fi

echo "Executing ${SCENARIO} experiments..."
echo ""

# Execute experiments simultaneously
EXPERIMENT_COUNT=0
EXPERIMENT_IDS=()

while IFS= read -r experiment; do
  EXPERIMENT_COUNT=$((EXPERIMENT_COUNT + 1))
  EXPERIMENT_NAME=$(echo "${experiment}" | yq eval '.name' -)
  EXPERIMENT_FILE=$(echo "${experiment}" | yq eval '.file' -)

  echo "Experiment ${EXPERIMENT_COUNT}: ${EXPERIMENT_NAME}"
  echo "  File: ${EXPERIMENT_FILE}"

  # Apply chaos experiment
  kubectl apply -f "chaos-experiments/${EXPERIMENT_FILE}" -n chaos-testing

  EXPERIMENT_ID="${EXPERIMENT_NAME}-${GAMEDAY_ID}"
  EXPERIMENT_IDS+=(${EXPERIMENT_ID})

  echo "  ✅ Experiment applied: ${EXPERIMENT_ID}"
  echo ""
done <<< "${EXPERIMENTS}"

echo "✅ All experiments applied (${EXPERIMENT_COUNT} experiments)"
echo ""

# Monitor system behavior
echo "Monitoring system behavior..."
echo "Tracking metrics for 60 minutes..."

# Continuous monitoring loop
MONITORING_DURATION=3600  # 60 minutes
ELAPSED=0
CHECK_INTERVAL=60  # Check every minute

while [ ${ELAPSED} -lt ${MONITORING_DURATION} ]; do
  # Get system metrics
  AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=service_availability\{environment=\"${ENVIRONMENT}\"\} | jq -r '.data.result[0].value[1]' || echo "0")
  ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{environment=\"${ENVIRONMENT}\",status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]' || echo "0")
  LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{environment=\"${ENVIRONMENT}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]' || echo "0")
  LATENCY_MS=$(echo "${LATENCY} * 1000" | bc)

  echo "[${ELAPSED}s] Availability: ${AVAILABILITY}%, Error Rate: ${ERROR_RATE}/sec, P95 Latency: ${LATENCY_MS}ms"

  # Check for auto-abort triggers
  if (( $(echo "${AVAILABILITY} < 50" | bc -l) )); then
    echo "⚠️  WARNING: Availability below 50% - considering auto-abort"
  fi

  sleep ${CHECK_INTERVAL}
  ELAPSED=$((ELAPSED + CHECK_INTERVAL))
done

CHAOS_END=$(date +%s)
CHAOS_DURATION=$((CHAOS_END - CHAOS_START))

echo ""
echo "✅ Chaos injection phase complete"
echo "Duration: ${CHAOS_DURATION} seconds"
echo "Experiments active: ${EXPERIMENT_COUNT}"
echo ""

Hour 3: Recovery

Recovery Activities:

#!/bin/bash
# scripts/gameday-recovery.sh

GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"
ENVIRONMENT="${3:-staging}"

RECOVERY_START=$(date +%s)

echo "🎮 ATP Chaos GameDay - Recovery Phase"
echo "GameDay ID: ${GAMEDAY_ID}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""

# Step 1: Execute recovery procedures
echo "Step 1: Executing recovery procedures..."

# Get active experiments
ACTIVE_EXPERIMENTS=$(kubectl get chaos -n chaos-testing -o jsonpath='{.items[*].metadata.name}')

for experiment in ${ACTIVE_EXPERIMENTS}; do
  echo "  - Removing experiment: ${experiment}"
  kubectl delete chaos ${experiment} -n chaos-testing
done

echo "✅ Recovery procedures executed"
echo ""

# Step 2: Validate RTO/RPO
echo "Step 2: Validating RTO/RPO..."
./scripts/validate-rto-rpo.sh ${GAMEDAY_ID} ${ENVIRONMENT}
echo ""

# Step 3: Test failover and failback
echo "Step 3: Testing failover and failback..."
# This would involve testing regional failover if applicable
echo "✅ Failover/failback tested"
echo ""

# Step 4: Validate data integrity
echo "Step 4: Validating data integrity..."
./scripts/validate-data-integrity.sh ${ENVIRONMENT}
echo ""

RECOVERY_END=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_END - RECOVERY_START))

echo "✅ Recovery phase complete"
echo "Duration: ${RECOVERY_DURATION} seconds"
echo ""

Hour 4: Retrospective

Retrospective Activities:

#!/bin/bash
# scripts/gameday-retrospective.sh

GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"

echo "🎮 ATP Chaos GameDay - Retrospective"
echo "GameDay ID: ${GAMEDAY_ID}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""

# Generate GameDay report
echo "Generating GameDay report..."
./scripts/generate-gameday-report.sh ${GAMEDAY_ID} ${SCENARIO}

echo "✅ GameDay report generated"
echo ""

# Document findings
echo "Documenting findings..."
# Collect findings from observers and teams
FINDINGS_FILE="gameday-reports/${GAMEDAY_ID}-findings.md"

cat > ${FINDINGS_FILE} <<EOF
# GameDay Findings: ${GAMEDAY_ID}

## Scenario: ${SCENARIO}

## Key Findings
- [ ] Finding 1: ...
- [ ] Finding 2: ...
- [ ] Finding 3: ...

## Resilience Gaps Identified
- [ ] Gap 1: ...
- [ ] Gap 2: ...
- [ ] Gap 3: ...

## Improvement Actions
- [ ] Action 1: ... (Owner: ..., Due: ...)
- [ ] Action 2: ... (Owner: ..., Due: ...)
- [ ] Action 3: ... (Owner: ..., Due: ...)

## Runbook Updates Required
- [ ] Runbook 1: ...
- [ ] Runbook 2: ...
- [ ] Runbook 3: ...
EOF

echo "✅ Findings documented: ${FINDINGS_FILE}"
echo ""

echo "✅ Retrospective complete"
echo ""

GameDay Timeline Visualization

gantt
    title Quarterly ATP Chaos GameDay (4 Hours)
    dateFormat HH:mm
    axisFormat %H:%M

    section Preparation
    Review Scenarios          :00:00, 15m
    Brief All Teams          :15:00, 15m
    Validate Rollback        :30:00, 15m
    Start Monitoring         :45:00, 15m

    section Chaos Injection
    Execute Experiments      :60:00, 30m
    Monitor System           :90:00, 30m

    section Recovery
    Execute Recovery         :120:00, 30m
    Validate RTO/RPO         :150:00, 30m

    section Retrospective
    Document Findings        :180:00, 30m
    Identify Gaps            :210:00, 15m
    Assign Actions           :225:00, 15m
Hold "Alt" / "Option" to enable pan & zoom

GameDay Scenarios

Scenario 1: Regional Outage + Database Failover + Key Vault Unavailable

Scenario Description:

  • Regional Outage: Simulate complete East US region unavailability
  • Database Failover: Azure SQL failover to secondary region
  • Key Vault Unavailable: Azure Key Vault network partition
  • Objective: Validate multi-region resilience and secret management

Scenario Configuration:

# scenarios/scenario-1-regional-outage.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: GameDayScenario
metadata:
  name: scenario-1-regional-outage
  labels:
    category: infrastructure
    severity: high
    complexity: high
spec:
  description: |
    Regional Outage + Database Failover + Key Vault Unavailable
    Simulates complete regional failure with database failover and secret management issues
  duration: "4h"
  experiments:
    - name: regional-outage
      file: network-partition-region.yaml
      startTime: "00:05"  # 5 minutes into GameDay
      duration: "1h"
      description: Partition network to East US region

    - name: database-failover
      file: database-failover.yaml
      startTime: "00:10"  # 10 minutes into GameDay
      duration: "1h"
      description: Force Azure SQL failover to secondary region

    - name: key-vault-unavailable
      file: key-vault-unavailable.yaml
      startTime: "00:15"  # 15 minutes into GameDay
      duration: "1h"
      description: Partition network to Azure Key Vault

  expectedBehavior:
    - Services failover to West Europe region
    - Database failover completes within 30 minutes
    - Services use cached secrets during Key Vault unavailability
    - All services remain operational in secondary region

  successCriteria:
    - RTO achieved: <30 minutes
    - RPO verified: <1 hour
    - Service availability: >90%
    - Secret cache usage: >80%

Scenario Execution Script:

#!/bin/bash
# scripts/execute-scenario-1.sh

GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
ENVIRONMENT="${2:-staging}"

echo "🎮 Executing Scenario 1: Regional Outage + Database Failover + Key Vault Unavailable"
echo "GameDay ID: ${GAMEDAY_ID}"
echo ""

# Execute all experiments simultaneously
./scripts/execute-regional-failover-drill.sh eastus westeurope atp-production true &
REGIONAL_PID=$!

sleep 300  # Wait 5 minutes

./scripts/execute-database-failover-experiment.sh atp-ingestion-api atp-ingest-ns &
DB_FAILOVER_PID=$!

sleep 300  # Wait 5 minutes

./scripts/execute-key-vault-unavailability-experiment.sh atp-ingestion-api atp-ingest-ns &
KV_UNAVAILABLE_PID=$!

# Wait for all experiments
wait ${REGIONAL_PID} ${DB_FAILOVER_PID} ${KV_UNAVAILABLE_PID}

echo "✅ Scenario 1 execution complete"

Scenario 2: Node Cascade Failure + Message Broker Issues + Traffic Surge

Scenario Configuration:

# scenarios/scenario-2-cascade-failure.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: GameDayScenario
metadata:
  name: scenario-2-cascade-failure
  labels:
    category: infrastructure
    severity: high
    complexity: high
spec:
  description: |
    Node Cascade Failure + Message Broker Issues + Traffic Surge
    Simulates cascading node failures with messaging issues and traffic surge
  duration: "4h"
  experiments:
    - name: node-cascade-failure
      file: node-cascade-failure.yaml
      startTime: "00:05"
      duration: "1h"
      description: Simulate cascading node failures (1 node every 5 minutes)

    - name: message-broker-issues
      file: service-bus-topic-pause.yaml
      startTime: "00:10"
      duration: "1h"
      description: Pause Service Bus topic to simulate broker issues

    - name: traffic-surge
      file: traffic-surge.yaml
      startTime: "00:15"
      duration: "1h"
      description: Generate 10x normal traffic load

  expectedBehavior:
    - Pods reschedule to healthy nodes
    - Message buffering activates during broker issues
    - Autoscaling activates for traffic surge
    - Services handle combined load gracefully

  successCriteria:
    - Pod reschedule time: <5 minutes
    - Message buffering: >90% success
    - Autoscaling activation: <5 minutes
    - Service availability: >90%

Scenario 3: Security Incident + DDoS Attack + Data Corruption

Scenario Configuration:

# scenarios/scenario-3-security-incident.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: GameDayScenario
metadata:
  name: scenario-3-security-incident
  labels:
    category: security
    severity: critical
    complexity: high
spec:
  description: |
    Security Incident + DDoS Attack + Data Corruption
    Simulates security incident with DDoS attack and data corruption
  duration: "4h"
  experiments:
    - name: azure-ad-unavailable
      file: azure-ad-unavailability.yaml
      startTime: "00:05"
      duration: "1h"
      description: Azure AD authentication unavailable

    - name: ddos-attack
      file: traffic-surge-ddos.yaml
      startTime: "00:10"
      duration: "1h"
      description: Simulate DDoS attack (100x normal traffic)

    - name: data-corruption
      file: event-store-corruption.yaml
      startTime: "00:15"
      duration: "1h"
      description: Simulate event store corruption

  expectedBehavior:
    - Token cache maintains authentication
    - Rate limiting protects against DDoS
    - Integrity verification detects corruption
    - Services continue operating with degraded functionality

  successCriteria:
    - Token cache hit rate: >80%
    - Rate limiting effectiveness: >95%
    - Corruption detection time: <5 minutes
    - Service availability: >85%

GameDay Scenario Selection

#!/bin/bash
# scripts/select-gameday-scenario.sh

QUARTER="${1:-Q1}"  # Q1, Q2, Q3, Q4

# Rotate scenarios quarterly
case ${QUARTER} in
  Q1)
    SCENARIO="scenario-1-regional-outage"
    ;;
  Q2)
    SCENARIO="scenario-2-cascade-failure"
    ;;
  Q3)
    SCENARIO="scenario-3-security-incident"
    ;;
  Q4)
    # Q4: Random selection or user choice
    SCENARIO="scenario-1-regional-outage"
    ;;
esac

echo "Selected scenario for ${QUARTER}: ${SCENARIO}"
echo "${SCENARIO}"

GameDay Roles

GameDay Role Definitions

Role Responsibilities Key Activities
GameDay Commander Overall coordination, decision-making Coordinates exercise, makes go/no-go decisions, manages timeline
Chaos Injector Executes chaos experiments Applies chaos experiments, monitors experiment status
Incident Commander Leads incident response Coordinates response, makes recovery decisions
Observer Documents findings Observes system behavior, documents issues, captures metrics
Service Teams Respond to failures Monitor services, execute runbooks, report status
SRE Monitor and support Monitor system health, provide support, validate recovery

Role Assignment Script:

#!/bin/bash
# scripts/assign-gameday-roles.sh

GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"

echo "🎮 Assigning GameDay Roles"
echo "GameDay ID: ${GAMEDAY_ID}"
echo ""

# Role assignments (from team roster)
ROLES=(
  "gameday-commander:platform-team-lead"
  "chaos-injector:sre-team-member-1"
  "incident-commander:platform-team-lead"
  "observer:sre-team-member-2,security-team-member-1"
  "service-teams:ingestion-team-lead,query-team-lead"
  "sre:sre-team-all"
)

for role_assignment in "${ROLES[@]}"; do
  ROLE=$(echo "${role_assignment}" | cut -d: -f1)
  ASSIGNEES=$(echo "${role_assignment}" | cut -d: -f2)

  echo "Role: ${ROLE}"
  echo "  Assignees: ${ASSIGNEES}"

  # Send role assignment notifications
  IFS=',' read -ra ASSIGNEE_ARRAY <<< "${ASSIGNEES}"
  for assignee in "${ASSIGNEE_ARRAY[@]}"; do
    echo "    - Notifying ${assignee}..."
    # ./scripts/send-notification.sh "${assignee}" "You have been assigned role: ${ROLE} for GameDay ${GAMEDAY_ID}"
  done

  echo ""
done

echo "✅ Role assignments complete"
echo ""

GameDay Role Responsibilities

GameDay Commander Checklist:

# GameDay Commander Checklist

## Pre-GameDay
- [ ] Confirm GameDay schedule with all teams
- [ ] Review and approve scenario
- [ ] Validate environment readiness
- [ ] Confirm rollback procedures
- [ ] Brief stakeholders

## During GameDay
- [ ] Manage GameDay timeline
- [ ] Make go/no-go decisions
- [ ] Coordinate between teams
- [ ] Escalate critical issues
- [ ] Monitor overall progress

## Post-GameDay
- [ ] Lead retrospective
- [ ] Review findings
- [ ] Assign improvement actions
- [ ] Approve final report

Chaos Injector Checklist:

# Chaos Injector Checklist

## Pre-GameDay
- [ ] Review scenario experiments
- [ ] Validate experiment files
- [ ] Test experiment execution
- [ ] Prepare rollback commands

## During GameDay
- [ ] Execute experiments per schedule
- [ ] Monitor experiment status
- [ ] Document experiment execution
- [ ] Execute rollback when needed

## Post-GameDay
- [ ] Document experiment results
- [ ] Review experiment effectiveness
- [ ] Update experiment configurations

Observer Checklist:

# Observer Checklist

## Pre-GameDay
- [ ] Review scenario and expected behavior
- [ ] Prepare observation templates
- [ ] Set up monitoring dashboards
- [ ] Coordinate with other observers

## During GameDay
- [ ] Observe system behavior
- [ ] Document findings in real-time
- [ ] Capture metrics and screenshots
- [ ] Note team coordination issues
- [ ] Identify runbook gaps

## Post-GameDay
- [ ] Compile observation notes
- [ ] Create observation report
- [ ] Present findings in retrospective

GameDay Metrics

GameDay Metrics Tracking

Metrics Collection Script:

#!/bin/bash
# scripts/collect-gameday-metrics.sh

GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"
ENVIRONMENT="${3:-staging}"

METRICS_FILE="gameday-reports/${GAMEDAY_ID}-metrics.json"

echo "📊 Collecting GameDay Metrics"
echo "GameDay ID: ${GAMEDAY_ID}"
echo ""

# Collect metrics
cat > ${METRICS_FILE} <<EOF
{
  "gamedayId": "${GAMEDAY_ID}",
  "scenario": "${SCENARIO}",
  "environment": "${ENVIRONMENT}",
  "timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
  "metrics": {
    "timeToDetection": {
      "target": 300,
      "achieved": $(./scripts/calculate-mttd.sh ${GAMEDAY_ID}),
      "unit": "seconds"
    },
    "timeToResponse": {
      "target": 600,
      "achieved": $(./scripts/calculate-mttr.sh ${GAMEDAY_ID}),
      "unit": "seconds"
    },
    "rto": {
      "target": 1800,
      "achieved": $(./scripts/calculate-rto.sh ${GAMEDAY_ID}),
      "unit": "seconds"
    },
    "rpo": {
      "target": 3600,
      "achieved": $(./scripts/calculate-rpo.sh ${GAMEDAY_ID}),
      "unit": "seconds"
    },
    "runbookEffectiveness": {
      "target": 80,
      "achieved": $(./scripts/calculate-runbook-effectiveness.sh ${GAMEDAY_ID}),
      "unit": "percent"
    },
    "teamCoordination": {
      "target": 80,
      "achieved": $(./scripts/calculate-team-coordination.sh ${GAMEDAY_ID}),
      "unit": "percent"
    }
  }
}
EOF

echo "✅ Metrics collected: ${METRICS_FILE}"
cat ${METRICS_FILE} | jq '.'
echo ""

GameDay Metrics Targets

Metric Target Description
Time to Detection (MTTD) <5 minutes Time from failure injection to detection
Time to Response (MTTR) <10 minutes Time from detection to response initiation
RTO Achieved <30 minutes Recovery time objective achievement
RPO Verified <1 hour Recovery point objective verification
Runbook Effectiveness >80% Percentage of runbook steps executed successfully
Team Coordination >80% Quality of team coordination (subjective rating)

GameDay Metrics Calculation Scripts

Calculate MTTD:

#!/bin/bash
# scripts/calculate-mttd.sh

GAMEDAY_ID="${1}"

# Get first alert timestamp
FIRST_ALERT=$(cat "gameday-reports/${GAMEDAY_ID}-alerts.json" | jq -r '.alerts[0].timestamp' 2>/dev/null || echo "")

# Get chaos injection timestamp
CHAOS_INJECTION=$(cat "gameday-reports/${GAMEDAY_ID}-timeline.json" | jq -r '.chaosInjection' 2>/dev/null || echo "")

if [ -n "${FIRST_ALERT}" ] && [ -n "${CHAOS_INJECTION}" ]; then
  FIRST_ALERT_EPOCH=$(date -u -d "${FIRST_ALERT}" +%s)
  CHAOS_INJECTION_EPOCH=$(date -u -d "${CHAOS_INJECTION}" +%s)
  MTTD=$((FIRST_ALERT_EPOCH - CHAOS_INJECTION_EPOCH))
  echo "${MTTD}"
else
  echo "0"
fi

Calculate MTTR:

#!/bin/bash
# scripts/calculate-mttr.sh

GAMEDAY_ID="${1}"

# Get first response timestamp
FIRST_RESPONSE=$(cat "gameday-reports/${GAMEDAY_ID}-timeline.json" | jq -r '.firstResponse' 2>/dev/null || echo "")

# Get first alert timestamp
FIRST_ALERT=$(cat "gameday-reports/${GAMEDAY_ID}-alerts.json" | jq -r '.alerts[0].timestamp' 2>/dev/null || echo "")

if [ -n "${FIRST_RESPONSE}" ] && [ -n "${FIRST_ALERT}" ]; then
  FIRST_RESPONSE_EPOCH=$(date -u -d "${FIRST_RESPONSE}" +%s)
  FIRST_ALERT_EPOCH=$(date -u -d "${FIRST_ALERT}" +%s)
  MTTR=$((FIRST_RESPONSE_EPOCH - FIRST_ALERT_EPOCH))
  echo "${MTTR}"
else
  echo "0"
fi

GameDay Metrics Dashboard

GameDay Metrics Visualization:

graph LR
    GAMEDAY[GameDay Metrics] --> DETECTION[MTTD]
    GAMEDAY --> RESPONSE[MTTR]
    GAMEDAY --> RTO[RTO]
    GAMEDAY --> RPO[RPO]
    GAMEDAY --> RUNBOOK[Runbook Effectiveness]
    GAMEDAY --> COORDINATION[Team Coordination]

    DETECTION --> ALERT[Alert Time]
    RESPONSE --> ACTION[Action Time]
    RTO --> RECOVERY[Recovery Time]
    RPO --> DATA[Data Loss]
    RUNBOOK --> STEPS[Runbook Steps]
    COORDINATION --> TEAM[Team Performance]

    style GAMEDAY fill:#FFE5B4
    style DETECTION fill:#90EE90
    style RESPONSE fill:#90EE90
    style RTO fill:#90EE90
    style RPO fill:#90EE90
    style RUNBOOK fill:#90EE90
    style COORDINATION fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

GameDay Report Template

GameDay Report Generation Script:

#!/bin/bash
# scripts/generate-gameday-report.sh

GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"

REPORT_FILE="gameday-reports/${GAMEDAY_ID}-report.md"

cat > ${REPORT_FILE} <<EOF
# ATP Chaos GameDay Report

**GameDay ID**: ${GAMEDAY_ID}  
**Date**: $(date -u +"%Y-%m-%d")  
**Scenario**: ${SCENARIO}  
**Duration**: 4 hours  
**Environment**: Staging

## Executive Summary

[Summary of GameDay execution and key outcomes]

## Metrics Achieved

| Metric | Target | Achieved | Status |
|--------|--------|----------|--------|
| **MTTD** | <5 min | [VALUE] | [PASS/FAIL] |
| **MTTR** | <10 min | [VALUE] | [PASS/FAIL] |
| **RTO** | <30 min | [VALUE] | [PASS/FAIL] |
| **RPO** | <1 hour | [VALUE] | [PASS/FAIL] |
| **Runbook Effectiveness** | >80% | [VALUE] | [PASS/FAIL] |
| **Team Coordination** | >80% | [VALUE] | [PASS/FAIL] |

## Key Findings

### Strengths
- [Finding 1]
- [Finding 2]
- [Finding 3]

### Gaps Identified
- [Gap 1]
- [Gap 2]
- [Gap 3]

## Improvement Actions

| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| [Action 1] | [Owner] | [Date] | [High/Medium/Low] |
| [Action 2] | [Owner] | [Date] | [High/Medium/Low] |
| [Action 3] | [Owner] | [Date] | [High/Medium/Low] |

## Runbook Updates

- [Runbook 1]: [Update description]
- [Runbook 2]: [Update description]
- [Runbook 3]: [Update description]

## Lessons Learned

[Key lessons learned from GameDay]

## Next Steps

- [Next step 1]
- [Next step 2]
- [Next step 3]
EOF

echo "✅ GameDay report generated: ${REPORT_FILE}"
echo ""

Summary: Chaos GameDays

  • What is a GameDay?: Quarterly chaos engineering exercises that validate system resilience through complex multi-failure scenarios, multi-team participation, incident response validation, and learning-focused improvement; conducted quarterly (4 times per year) with 4-hour duration in production-like environments
  • GameDay Structure: Comprehensive 4-hour structure divided into Preparation (review scenarios, brief teams, validate rollback, start monitoring), Chaos Injection (execute 3-5 experiments simultaneously, monitor behavior, test incident response, validate runbooks), Recovery (execute recovery procedures, validate RTO/RPO, test failover/failback, validate data integrity), and Retrospective (document findings, identify resilience gaps, assign improvement actions, update runbooks)
  • GameDay Scenarios: Three complex scenarios including Regional Outage + Database Failover + Key Vault Unavailable (multi-region resilience), Node Cascade Failure + Message Broker Issues + Traffic Surge (cascading failures), and Security Incident + DDoS Attack + Data Corruption (security and data integrity); scenarios rotated quarterly with comprehensive configuration and execution scripts
  • GameDay Roles: Structured role assignments including GameDay Commander (coordination, decision-making), Chaos Injector (experiment execution), Incident Commander (incident response), Observer (documentation), Service Teams (service response), and SRE (monitoring and support); each role with specific responsibilities and checklists
  • GameDay Metrics: Comprehensive metrics tracking including Time to Detection (MTTD <5 minutes), Time to Response (MTTR <10 minutes), RTO/RPO achievement, Runbook Effectiveness (>80%), and Team Coordination (>80%); with automated collection scripts and visualization dashboards

Chaos Automation and Reporting

Purpose: Define comprehensive chaos automation and reporting capabilities for ATP, validating automation tools, CI/CD integration, continuous chaos execution, reporting mechanisms, and improvement tracking to ensure chaos engineering is continuously automated, monitored, and improved.


Chaos Automation Tools

Chaos automation tools provide platform-native and cloud-integrated chaos engineering capabilities for ATP, enabling automated chaos experiment execution, monitoring, and management.

Chaos Mesh: Kubernetes-Native Chaos

Chaos Mesh Overview:

Chaos Mesh is a cloud-native chaos engineering platform that orchestrates chaos experiments on Kubernetes. It provides comprehensive fault injection capabilities for pods, networks, I/O, and time.

Chaos Mesh Installation:

#!/bin/bash
# scripts/install-chaos-mesh.sh

NAMESPACE="${1:-chaos-testing}"

echo "🔧 Installing Chaos Mesh"
echo "Namespace: ${NAMESPACE}"
echo ""

# Create namespace
kubectl create namespace ${NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -

# Install Chaos Mesh using Helm
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace=${NAMESPACE} \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
  --set dashboard.create=true \
  --set dashboard.securityMode=false

echo "✅ Chaos Mesh installed"
echo ""

# Wait for Chaos Mesh to be ready
echo "Waiting for Chaos Mesh to be ready..."
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=chaos-mesh \
  -n ${NAMESPACE} \
  --timeout=300s

echo "✅ Chaos Mesh is ready"
echo ""

Chaos Mesh Configuration:

# kubernetes/chaos-mesh/chaos-mesh-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-mesh-config
  namespace: chaos-testing
data:
  config.yaml: |
    # Chaos Mesh Configuration
    chaos:
      # Enable chaos experiments
      podChaos: true
      networkChaos: true
      ioChaos: true
      stressChaos: true
      timeChaos: true
      httpChaos: true
      kernelChaos: true

    # Monitoring integration
    metrics:
      prometheus:
        enabled: true
        endpoint: "http://prometheus:9090"

    # Dashboard configuration
    dashboard:
      enabled: true
      port: 2333

    # Security settings
    security:
      mode: "host"
      allowHostNetwork: true

Litmus Chaos: Chaos Workflows

Litmus Chaos Overview:

Litmus Chaos provides chaos workflows and experiments for Kubernetes environments, with a focus on workflow-based chaos engineering.

Litmus Chaos Installation:

#!/bin/bash
# scripts/install-litmus-chaos.sh

NAMESPACE="${1:-litmus}"

echo "🔧 Installing Litmus Chaos"
echo "Namespace: ${NAMESPACE}"
echo ""

# Create namespace
kubectl create namespace ${NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -

# Install Litmus Chaos Operator
kubectl apply -f https://litmuschaos.github.io/litmus/2.13.0/litmus-2.13.0.yaml

echo "✅ Litmus Chaos installed"
echo ""

# Wait for Litmus to be ready
echo "Waiting for Litmus to be ready..."
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=litmus \
  -n ${NAMESPACE} \
  --timeout=300s

echo "✅ Litmus Chaos is ready"
echo ""

Azure Chaos Studio: Azure-Integrated Chaos

Azure Chaos Studio Overview:

Azure Chaos Studio provides Azure-native chaos engineering with integration into Azure services, enabling chaos experiments directly on Azure resources.

Azure Chaos Studio Setup:

#!/bin/bash
# scripts/setup-azure-chaos-studio.sh

RESOURCE_GROUP="${1:-atp-production}"
LOCATION="${2:-eastus}"

echo "🔧 Setting up Azure Chaos Studio"
echo "Resource Group: ${RESOURCE_GROUP}"
echo "Location: ${LOCATION}"
echo ""

# Register Chaos Studio resource provider
az provider register --namespace Microsoft.Chaos

# Create Chaos Studio target
az chaos target create \
  --target-name atp-aks-target \
  --target-type Microsoft-AzureKubernetesServiceChaosMesh \
  --resource-group ${RESOURCE_GROUP} \
  --parent-provider-namespace Microsoft.ContainerService \
  --parent-resource-type managedClusters \
  --parent-resource-name atp-aks-${LOCATION} \
  --location ${LOCATION}

echo "✅ Azure Chaos Studio target created"
echo ""

# Create Chaos Studio experiment
az chaos experiment create \
  --experiment-name atp-chaos-experiment \
  --resource-group ${RESOURCE_GROUP} \
  --location ${LOCATION} \
  --experiment-file chaos-experiments/azure-chaos-studio-experiment.json

echo "✅ Azure Chaos Studio experiment created"
echo ""

Custom Scripts: ATP-Specific Chaos

Custom Chaos Script Framework:

#!/bin/bash
# scripts/chaos-framework.sh
# ATP-specific chaos engineering framework

# Chaos Framework Configuration
CHAOS_NAMESPACE="${CHAOS_NAMESPACE:-chaos-testing}"
ENVIRONMENT="${ENVIRONMENT:-staging}"
LOG_LEVEL="${LOG_LEVEL:-INFO}"

# Chaos Framework Functions
chaos_inject() {
  local experiment_type=$1
  local experiment_config=$2
  local duration=${3:-"10m"}

  echo "[$(date +"%Y-%m-%d %H:%M:%S")] Injecting chaos: ${experiment_type}"

  case ${experiment_type} in
    pod-kill)
      ./scripts/execute-pod-failure-experiment.sh ${experiment_config} ${duration}
      ;;
    network-partition)
      ./scripts/execute-network-partition-experiment.sh ${experiment_config} ${duration}
      ;;
    database-failover)
      ./scripts/execute-database-failover-experiment.sh ${experiment_config} ${duration}
      ;;
    *)
      echo "Unknown experiment type: ${experiment_type}"
      return 1
      ;;
  esac
}

chaos_monitor() {
  local experiment_id=$1
  local duration=${2:-"10m"}

  echo "[$(date +"%Y-%m-%d %H:%M:%S")] Monitoring chaos: ${experiment_id}"

  ./scripts/monitor-chaos-experiment.sh ${experiment_id} ${duration}
}

chaos_rollback() {
  local experiment_id=$1

  echo "[$(date +"%Y-%m-%d %H:%M:%S")] Rolling back chaos: ${experiment_id}"

  ./scripts/rollback-chaos-experiment.sh ${experiment_id}
}

chaos_validate() {
  local experiment_id=$1

  echo "[$(date +"%Y-%m-%d %H:%M:%S")] Validating chaos: ${experiment_id}"

  ./scripts/validate-chaos-experiment.sh ${experiment_id}
}

# Export functions
export -f chaos_inject chaos_monitor chaos_rollback chaos_validate

Chaos Tool Comparison

Tool Type Integration Best For
Chaos Mesh Kubernetes-native Kubernetes, Prometheus Pod, network, I/O chaos
Litmus Chaos Kubernetes-native Kubernetes, Argo Workflow-based chaos
Azure Chaos Studio Azure-native Azure services Azure resource chaos
Custom Scripts ATP-specific ATP services ATP-specific scenarios

CI/CD Integration

CI/CD integration enables automated chaos testing in staging pipelines with resilience validation and deployment blocking on chaos failures.

Chaos Tests in Staging Pipeline

Azure Pipelines Configuration:

# azure-pipelines/chaos-tests-stage.yaml
trigger:
  branches:
    include:
      - main
      - develop

stages:
  - stage: ChaosTests
    displayName: 'Chaos Engineering Tests'
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - job: ChaosResilienceTests
        displayName: 'Resilience Validation'
        pool:
          vmImage: 'ubuntu-latest'
        steps:
          - task: AzureCLI@2
            displayName: 'Setup Azure CLI'
            inputs:
              azureSubscription: 'atp-connection'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                az account set --subscription $(AZURE_SUBSCRIPTION_ID)
                az aks get-credentials --resource-group $(RESOURCE_GROUP) --name $(AKS_CLUSTER_NAME)

          - task: Bash@3
            displayName: 'Run Chaos Tests'
            inputs:
              targetType: 'inline'
              script: |
                # Install chaos tools
                ./scripts/install-chaos-mesh.sh chaos-testing

                # Run chaos test suite
                ./scripts/run-chaos-test-suite.sh staging

          - task: Bash@3
            displayName: 'Validate Resilience'
            inputs:
              targetType: 'inline'
              script: |
                # Validate resilience metrics
                ./scripts/validate-resilience-metrics.sh staging

                # Check if all tests passed
                if [ $? -ne 0 ]; then
                  echo "##vso[task.logissue type=error]Resilience validation failed"
                  exit 1
                fi

Chaos Test Suite Script:

#!/bin/bash
# scripts/run-chaos-test-suite.sh

ENVIRONMENT="${1:-staging}"
TEST_SUITE="${2:-basic}"

echo "🧪 Running Chaos Test Suite"
echo "Environment: ${ENVIRONMENT}"
echo "Test Suite: ${TEST_SUITE}"
echo ""

FAILED_TESTS=0
PASSED_TESTS=0

# Define test suite
case ${TEST_SUITE} in
  basic)
    TESTS=(
      "pod-failure:atp-ingestion-api"
      "network-latency:atp-ingestion-api:500ms"
      "database-failover:atp-sql-eastus"
    )
    ;;
  advanced)
    TESTS=(
      "pod-failure:atp-ingestion-api"
      "network-partition:atp-ingest-ns:atp-query-ns"
      "database-failover:atp-sql-eastus"
      "key-vault-unavailable:atp-kv-eastus"
      "traffic-surge:atp-ingestion-api:10x"
    )
    ;;
  *)
    echo "Unknown test suite: ${TEST_SUITE}"
    exit 1
    ;;
esac

# Run each test
for test in "${TESTS[@]}"; do
  TEST_TYPE=$(echo "${test}" | cut -d: -f1)
  TEST_PARAMS=$(echo "${test}" | cut -d: -f2-)

  echo "Running test: ${TEST_TYPE} (${TEST_PARAMS})"

  # Execute test
  ./scripts/execute-chaos-test.sh ${TEST_TYPE} ${TEST_PARAMS} ${ENVIRONMENT}
  TEST_RESULT=$?

  if [ ${TEST_RESULT} -eq 0 ]; then
    echo "✅ Test passed: ${TEST_TYPE}"
    PASSED_TESTS=$((PASSED_TESTS + 1))
  else
    echo "❌ Test failed: ${TEST_TYPE}"
    FAILED_TESTS=$((FAILED_TESTS + 1))
  fi

  echo ""
done

# Summary
echo "=========================================="
echo "Chaos Test Suite Summary"
echo "=========================================="
echo "Passed: ${PASSED_TESTS}"
echo "Failed: ${FAILED_TESTS}"
echo "Total: $((PASSED_TESTS + FAILED_TESTS))"
echo ""

if [ ${FAILED_TESTS} -gt 0 ]; then
  echo "❌ Test suite failed: ${FAILED_TESTS} test(s) failed"
  exit 1
else
  echo "✅ All tests passed"
  exit 0
fi

Automated Resilience Validation

Resilience Validation Script:

#!/bin/bash
# scripts/validate-resilience-metrics.sh

ENVIRONMENT="${1:-staging}"

echo "📊 Validating Resilience Metrics"
echo "Environment: ${ENVIRONMENT}"
echo ""

VALIDATION_PASSED=true

# Validate service availability
AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=service_availability\{environment=\"${ENVIRONMENT}\"\} | jq -r '.data.result[0].value[1]' || echo "0")
AVAILABILITY_PERCENT=$(echo "${AVAILABILITY} * 100" | bc)

if (( $(echo "${AVAILABILITY_PERCENT} >= 99.9" | bc -l) )); then
  echo "✅ Service availability: ${AVAILABILITY_PERCENT}% (target: ≥99.9%)"
else
  echo "❌ Service availability: ${AVAILABILITY_PERCENT}% (target: ≥99.9%)"
  VALIDATION_PASSED=false
fi

# Validate error rate
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{environment=\"${ENVIRONMENT}\",status=~\"5..\"\}[5m]\) | jq -r '.data.result[0].value[1]' || echo "0")

if (( $(echo "${ERROR_RATE} < 0.001" | bc -l) )); then
  echo "✅ Error rate: ${ERROR_RATE}/sec (target: <0.001/sec)"
else
  echo "❌ Error rate: ${ERROR_RATE}/sec (target: <0.001/sec)"
  VALIDATION_PASSED=false
fi

# Validate latency
P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{environment=\"${ENVIRONMENT}\"\}[5m]\)\) | jq -r '.data.result[0].value[1]' || echo "0")
P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)

if (( $(echo "${P95_LATENCY_MS} < 500" | bc -l) )); then
  echo "✅ P95 latency: ${P95_LATENCY_MS}ms (target: <500ms)"
else
  echo "❌ P95 latency: ${P95_LATENCY_MS}ms (target: <500ms)"
  VALIDATION_PASSED=false
fi

# Validate circuit breaker state
CIRCUIT_BREAKER_OPEN=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{environment=\"${ENVIRONMENT}\"\} | jq -r '.data.result[] | select(.value[1] == "Open") | .value[1]' | wc -l)

if [ "${CIRCUIT_BREAKER_OPEN}" = "0" ]; then
  echo "✅ Circuit breakers: All closed (target: 0 open)"
else
  echo "❌ Circuit breakers: ${CIRCUIT_BREAKER_OPEN} open (target: 0 open)"
  VALIDATION_PASSED=false
fi

echo ""

if [ "${VALIDATION_PASSED}" = true ]; then
  echo "✅ Resilience validation passed"
  exit 0
else
  echo "❌ Resilience validation failed"
  exit 1
fi

Block Deployment on Chaos Failures

Pipeline Gate Configuration:

# azure-pipelines/deployment-gate.yaml
trigger: none

stages:
  - stage: ChaosValidationGate
    displayName: 'Chaos Validation Gate'
    jobs:
      - job: ChaosGate
        displayName: 'Chaos Validation Gate'
        pool:
          vmImage: 'ubuntu-latest'
        steps:
          - task: Bash@3
            displayName: 'Check Chaos Test Results'
            inputs:
              targetType: 'inline'
              script: |
                # Get latest chaos test results
                LATEST_RESULTS=$(az pipelines runs list \
                  --pipeline-name "ChaosTests" \
                  --top 1 \
                  --query "[0].result" \
                  --output tsv)

                if [ "${LATEST_RESULTS}" != "succeeded" ]; then
                  echo "##vso[task.logissue type=error]Chaos tests failed. Deployment blocked."
                  exit 1
                fi

                echo "✅ Chaos tests passed. Deployment allowed."

Continuous Chaos

Continuous chaos provides low-level chaos running continuously with minimal blast radius (1% traffic affected) to detect resilience regressions proactively.

Continuous Chaos Configuration

Continuous Chaos Deployment:

# kubernetes/continuous-chaos/continuous-chaos-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: continuous-chaos-config
  namespace: chaos-testing
data:
  config.yaml: |
    continuousChaos:
      enabled: true
      blastRadius: 0.01  # 1% traffic affected
      experiments:
        - type: pod-kill
          frequency: "1h"
          target: "atp-ingestion-api"
          maxPodsAffected: 1

        - type: network-latency
          frequency: "2h"
          target: "atp-ingestion-api"
          latency: "100ms"
          duration: "30m"

        - type: error-injection
          frequency: "4h"
          target: "atp-ingestion-api"
          errorRate: 0.01  # 1% error rate
          duration: "15m"

      monitoring:
        enabled: true
        prometheusEndpoint: "http://prometheus:9090"
        alertThresholds:
          availability: 0.995  # 99.5%
          errorRate: 0.005  # 0.5%
          latency: 1000  # 1 second

Continuous Chaos Controller:

#!/bin/bash
# scripts/continuous-chaos-controller.sh

NAMESPACE="${1:-chaos-testing}"
CONFIG_FILE="${2:-continuous-chaos-config.yaml}"

echo "🔄 Starting Continuous Chaos Controller"
echo "Namespace: ${NAMESPACE}"
echo ""

# Load configuration
CONFIG=$(kubectl get configmap continuous-chaos-config -n ${NAMESPACE} -o jsonpath='{.data.config\.yaml}')

# Parse experiments
EXPERIMENTS=$(echo "${CONFIG}" | yq eval '.continuousChaos.experiments[]' -)

# Continuous chaos loop
while true; do
  echo "[$(date +"%Y-%m-%d %H:%M:%S")] Continuous chaos cycle"

  while IFS= read -r experiment; do
    EXPERIMENT_TYPE=$(echo "${experiment}" | yq eval '.type' -)
    FREQUENCY=$(echo "${experiment}" | yq eval '.frequency' -)
    TARGET=$(echo "${experiment}" | yq eval '.target' -)

    # Check if experiment should run
    LAST_RUN=$(kubectl get configmap continuous-chaos-state -n ${NAMESPACE} -o jsonpath="{.data.${EXPERIMENT_TYPE}-${TARGET}}" 2>/dev/null || echo "")

    if [ -z "${LAST_RUN}" ]; then
      # First run
      RUN_EXPERIMENT=true
    else
      # Check frequency
      LAST_RUN_EPOCH=$(date -u -d "${LAST_RUN}" +%s)
      CURRENT_EPOCH=$(date +%s)
      FREQUENCY_SECONDS=$(echo "${FREQUENCY}" | sed 's/h/*3600/g' | sed 's/m/*60/g' | bc)
      ELAPSED=$((CURRENT_EPOCH - LAST_RUN_EPOCH))

      if (( ELAPSED >= FREQUENCY_SECONDS )); then
        RUN_EXPERIMENT=true
      else
        RUN_EXPERIMENT=false
      fi
    fi

    if [ "${RUN_EXPERIMENT}" = true ]; then
      echo "  Running experiment: ${EXPERIMENT_TYPE} on ${TARGET}"

      # Execute experiment
      ./scripts/execute-continuous-chaos-experiment.sh ${EXPERIMENT_TYPE} ${TARGET} ${experiment}

      # Update last run time
      kubectl patch configmap continuous-chaos-state -n ${NAMESPACE} \
        --type merge \
        -p "{\"data\":{\"${EXPERIMENT_TYPE}-${TARGET}\":\"$(date -u +"%Y-%m-%dT%H:%M:%SZ")\"}}"
    fi
  done <<< "${EXPERIMENTS}"

  # Sleep for 1 minute before next cycle
  sleep 60
done

Continuous Chaos Experiment Script:

#!/bin/bash
# scripts/execute-continuous-chaos-experiment.sh

EXPERIMENT_TYPE="${1}"
TARGET="${2}"
EXPERIMENT_CONFIG="${3}"

echo "  Executing continuous chaos: ${EXPERIMENT_TYPE} on ${TARGET}"

case ${EXPERIMENT_TYPE} in
  pod-kill)
    # Kill 1 pod with minimal impact
    MAX_PODS=$(echo "${EXPERIMENT_CONFIG}" | yq eval '.maxPodsAffected' -)
    ./scripts/execute-pod-failure-experiment.sh ${TARGET} ${MAX_PODS} "5m"
    ;;

  network-latency)
    # Inject minimal latency
    LATENCY=$(echo "${EXPERIMENT_CONFIG}" | yq eval '.latency' -)
    DURATION=$(echo "${EXPERIMENT_CONFIG}" | yq eval '.duration' -)
    ./scripts/execute-latency-injection-experiment.sh ${TARGET} ${LATENCY} ${DURATION}
    ;;

  error-injection)
    # Inject minimal error rate
    ERROR_RATE=$(echo "${EXPERIMENT_CONFIG}" | yq eval '.errorRate' -)
    DURATION=$(echo "${EXPERIMENT_CONFIG}" | yq eval '.duration' -)
    ./scripts/execute-error-injection-experiment.sh ${TARGET} ${ERROR_RATE} ${DURATION}
    ;;

  *)
    echo "Unknown experiment type: ${EXPERIMENT_TYPE}"
    ;;
esac

Regression Detection

Regression Detection Script:

#!/bin/bash
# scripts/detect-resilience-regression.sh

ENVIRONMENT="${1:-staging}"
BASELINE_METRICS="${2:-baseline-metrics.json}"

echo "🔍 Detecting Resilience Regressions"
echo "Environment: ${ENVIRONMENT}"
echo ""

REGRESSION_DETECTED=false

# Get current metrics
CURRENT_AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=service_availability\{environment=\"${ENVIRONMENT}\"\} | jq -r '.data.result[0].value[1]' || echo "0")
CURRENT_AVAILABILITY_PERCENT=$(echo "${CURRENT_AVAILABILITY} * 100" | bc)

BASELINE_AVAILABILITY=$(jq -r '.metrics.availability' ${BASELINE_METRICS} 2>/dev/null || echo "99.95")

# Compare with baseline
AVAILABILITY_DROP=$(echo "${BASELINE_AVAILABILITY} - ${CURRENT_AVAILABILITY_PERCENT}" | bc)

if (( $(echo "${AVAILABILITY_DROP} > 0.1" | bc -l) )); then
  echo "⚠️  Regression detected: Availability dropped by ${AVAILABILITY_DROP}%"
  echo "  Baseline: ${BASELINE_AVAILABILITY}%"
  echo "  Current: ${CURRENT_AVAILABILITY_PERCENT}%"
  REGRESSION_DETECTED=true
fi

# Get current error rate
CURRENT_ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{environment=\"${ENVIRONMENT}\",status=~\"5..\"\}[5m]\) | jq -r '.data.result[0].value[1]' || echo "0")
BASELINE_ERROR_RATE=$(jq -r '.metrics.errorRate' ${BASELINE_METRICS} 2>/dev/null || echo "0.0001")

ERROR_RATE_INCREASE=$(echo "${CURRENT_ERROR_RATE} - ${BASELINE_ERROR_RATE}" | bc)

if (( $(echo "${ERROR_RATE_INCREASE} > 0.0005" | bc -l) )); then
  echo "⚠️  Regression detected: Error rate increased by ${ERROR_RATE_INCREASE}/sec"
  echo "  Baseline: ${BASELINE_ERROR_RATE}/sec"
  echo "  Current: ${CURRENT_ERROR_RATE}/sec"
  REGRESSION_DETECTED=true
fi

if [ "${REGRESSION_DETECTED}" = true ]; then
  echo ""
  echo "❌ Resilience regression detected"
  # Send alert
  # ./scripts/send-alert.sh "Resilience regression detected in ${ENVIRONMENT}"
  exit 1
else
  echo "✅ No regression detected"
  exit 0
fi

Chaos Reporting

Chaos reporting provides comprehensive tracking and reporting of chaos experiments, resilience scores, gap tracking, and GameDay reports.

Experiment Success/Failure Tracking

Experiment Tracking Database Schema:

-- Database schema for chaos experiment tracking
CREATE TABLE ChaosExperiments (
    Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
    ExperimentId NVARCHAR(100) NOT NULL UNIQUE,
    ExperimentType NVARCHAR(50) NOT NULL,
    TargetService NVARCHAR(100),
    Environment NVARCHAR(50),
    Status NVARCHAR(20) NOT NULL, -- Running, Success, Failed, Aborted
    StartTime DATETIME2 NOT NULL,
    EndTime DATETIME2,
    Duration INT, -- seconds
    BlastRadius DECIMAL(5,2),
    Hypothesis NVARCHAR(MAX),
    CreatedAt DATETIME2 DEFAULT GETUTCDATE(),
    UpdatedAt DATETIME2 DEFAULT GETUTCDATE()
);

CREATE TABLE ChaosExperimentMetrics (
    Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
    ExperimentId UNIQUEIDENTIFIER NOT NULL,
    MetricName NVARCHAR(100) NOT NULL,
    MetricValue DECIMAL(18,4),
    Timestamp DATETIME2 NOT NULL,
    FOREIGN KEY (ExperimentId) REFERENCES ChaosExperiments(Id)
);

CREATE TABLE ChaosExperimentResults (
    Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
    ExperimentId UNIQUEIDENTIFIER NOT NULL,
    Success BIT NOT NULL,
    ValidationPassed BIT,
    IssuesFound NVARCHAR(MAX),
    LessonsLearned NVARCHAR(MAX),
    CreatedAt DATETIME2 DEFAULT GETUTCDATE(),
    FOREIGN KEY (ExperimentId) REFERENCES ChaosExperiments(Id)
);

Experiment Tracking Script:

#!/bin/bash
# scripts/track-chaos-experiment.sh

EXPERIMENT_ID="${1}"
EXPERIMENT_TYPE="${2}"
TARGET_SERVICE="${3}"
ENVIRONMENT="${4}"
STATUS="${5}"  # Success, Failed, Aborted

echo "📊 Tracking Chaos Experiment"
echo "Experiment ID: ${EXPERIMENT_ID}"
echo ""

# Record experiment in database
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
INSERT INTO ChaosExperiments (ExperimentId, ExperimentType, TargetService, Environment, Status, StartTime, EndTime)
VALUES ('${EXPERIMENT_ID}', '${EXPERIMENT_TYPE}', '${TARGET_SERVICE}', '${ENVIRONMENT}', '${STATUS}', 
        GETUTCDATE(), GETUTCDATE())
"

# Get experiment metrics
AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=service_availability\{service=\"${TARGET_SERVICE}\"\} | jq -r '.data.result[0].value[1]' || echo "0")
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${TARGET_SERVICE}\",status=~\"5..\"\}[5m]\) | jq -r '.data.result[0].value[1]' || echo "0")

# Record metrics
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
DECLARE @ExperimentId UNIQUEIDENTIFIER = (SELECT Id FROM ChaosExperiments WHERE ExperimentId = '${EXPERIMENT_ID}');
INSERT INTO ChaosExperimentMetrics (ExperimentId, MetricName, MetricValue, Timestamp)
VALUES (@ExperimentId, 'Availability', ${AVAILABILITY}, GETUTCDATE()),
       (@ExperimentId, 'ErrorRate', ${ERROR_RATE}, GETUTCDATE())
"

echo "✅ Experiment tracked"

Resilience Score Over Time

Resilience Score Calculation:

#!/bin/bash
# scripts/calculate-resilience-score.sh

ENVIRONMENT="${1:-staging}"
TIME_RANGE="${2:-7d}"  # 7 days, 30d, etc.

echo "📊 Calculating Resilience Score"
echo "Environment: ${ENVIRONMENT}"
echo "Time Range: ${TIME_RANGE}"
echo ""

# Get metrics for time range
AVAILABILITY=$(curl -s "http://prometheus:9090/api/v1/query?query=avg_over_time(service_availability{environment=\"${ENVIRONMENT}\"}[${TIME_RANGE}])" | jq -r '.data.result[0].value[1]' || echo "0")
AVAILABILITY_PERCENT=$(echo "${AVAILABILITY} * 100" | bc)

ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=avg_over_time(rate(http_requests_total{environment=\"${ENVIRONMENT}\",status=~\"5..\"}[5m])[${TIME_RANGE}:1h])" | jq -r '.data.result[0].value[1]' || echo "0")

P95_LATENCY=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,avg_over_time(rate(http_request_duration_seconds_bucket{environment=\"${ENVIRONMENT}\"}[5m])[${TIME_RANGE}:1h]))" | jq -r '.data.result[0].value[1]' || echo "0")
P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)

# Calculate resilience score (0-100)
# Availability: 40 points
AVAILABILITY_SCORE=$(echo "scale=2; ${AVAILABILITY_PERCENT} * 0.4" | bc)

# Error Rate: 30 points (inverse)
ERROR_RATE_SCORE=$(echo "scale=2; (1 - ${ERROR_RATE} * 1000) * 0.3" | bc)
if (( $(echo "${ERROR_RATE_SCORE} < 0" | bc -l) )); then
  ERROR_RATE_SCORE=0
fi

# Latency: 30 points (inverse)
LATENCY_SCORE=$(echo "scale=2; (1 - ${P95_LATENCY_MS} / 1000) * 0.3" | bc)
if (( $(echo "${LATENCY_SCORE} < 0" | bc -l) )); then
  LATENCY_SCORE=0
fi

RESILIENCE_SCORE=$(echo "${AVAILABILITY_SCORE} + ${ERROR_RATE_SCORE} + ${LATENCY_SCORE}" | bc)

echo "Resilience Score Breakdown:"
echo "  Availability (40%): ${AVAILABILITY_SCORE} / 40"
echo "  Error Rate (30%): ${ERROR_RATE_SCORE} / 30"
echo "  Latency (30%): ${LATENCY_SCORE} / 30"
echo ""
echo "Overall Resilience Score: ${RESILIENCE_SCORE} / 100"
echo ""

# Record resilience score
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
INSERT INTO ResilienceScores (Environment, TimeRange, Score, Availability, ErrorRate, Latency, Timestamp)
VALUES ('${ENVIRONMENT}', '${TIME_RANGE}', ${RESILIENCE_SCORE}, ${AVAILABILITY_PERCENT}, ${ERROR_RATE}, ${P95_LATENCY_MS}, GETUTCDATE())
"

Gap Tracking and Remediation

Gap Tracking Database Schema:

-- Database schema for resilience gap tracking
CREATE TABLE ResilienceGaps (
    Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
    GapId NVARCHAR(100) NOT NULL UNIQUE,
    Title NVARCHAR(200) NOT NULL,
    Description NVARCHAR(MAX),
    Category NVARCHAR(50), -- Infrastructure, Application, Security, Data
    Severity NVARCHAR(20), -- Critical, High, Medium, Low
    ImpactScore DECIMAL(5,2),
    LikelihoodScore DECIMAL(5,2),
    PriorityScore DECIMAL(5,2), -- Impact × Likelihood
    Status NVARCHAR(20), -- Open, InProgress, Resolved, Closed
    AssignedTo NVARCHAR(100),
    CreatedAt DATETIME2 DEFAULT GETUTCDATE(),
    UpdatedAt DATETIME2 DEFAULT GETUTCDATE(),
    ResolvedAt DATETIME2
);

CREATE TABLE GapRemediation (
    Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
    GapId UNIQUEIDENTIFIER NOT NULL,
    Action NVARCHAR(200) NOT NULL,
    Status NVARCHAR(20), -- Planned, InProgress, Completed
    DueDate DATETIME2,
    CompletedAt DATETIME2,
    FOREIGN KEY (GapId) REFERENCES ResilienceGaps(Id)
);

Gap Tracking Script:

#!/bin/bash
# scripts/track-resilience-gap.sh

GAP_ID="${1}"
TITLE="${2}"
DESCRIPTION="${3}"
CATEGORY="${4}"
SEVERITY="${5}"
IMPACT_SCORE="${6}"
LIKELIHOOD_SCORE="${7}"

echo "📋 Tracking Resilience Gap"
echo "Gap ID: ${GAP_ID}"
echo ""

# Calculate priority score
PRIORITY_SCORE=$(echo "${IMPACT_SCORE} * ${LIKELIHOOD_SCORE}" | bc)

# Insert gap into database
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
INSERT INTO ResilienceGaps (GapId, Title, Description, Category, Severity, ImpactScore, LikelihoodScore, PriorityScore, Status)
VALUES ('${GAP_ID}', '${TITLE}', '${DESCRIPTION}', '${CATEGORY}', '${SEVERITY}', 
        ${IMPACT_SCORE}, ${LIKELIHOOD_SCORE}, ${PRIORITY_SCORE}, 'Open')
"

echo "✅ Gap tracked"
echo "  Priority Score: ${PRIORITY_SCORE} (Impact: ${IMPACT_SCORE}, Likelihood: ${LIKELIHOOD_SCORE})"

Quarterly GameDay Reports

GameDay Report Generation:

#!/bin/bash
# scripts/generate-quarterly-gameday-report.sh

QUARTER="${1:-Q1}"
YEAR="${2:-$(date +%Y)}"

echo "📊 Generating Quarterly GameDay Report"
echo "Quarter: ${QUARTER} ${YEAR}"
echo ""

REPORT_FILE="gameday-reports/quarterly-${QUARTER}-${YEAR}-report.md"

# Get GameDay data for quarter
case ${QUARTER} in
  Q1)
    START_DATE="${YEAR}-01-01"
    END_DATE="${YEAR}-03-31"
    ;;
  Q2)
    START_DATE="${YEAR}-04-01"
    END_DATE="${YEAR}-06-30"
    ;;
  Q3)
    START_DATE="${YEAR}-07-01"
    END_DATE="${YEAR}-09-30"
    ;;
  Q4)
    START_DATE="${YEAR}-10-01"
    END_DATE="${YEAR}-12-31"
    ;;
esac

cat > ${REPORT_FILE} <<EOF
# ATP Chaos Engineering Quarterly Report

**Quarter**: ${QUARTER} ${YEAR}  
**Period**: ${START_DATE} to ${END_DATE}  
**Generated**: $(date -u +"%Y-%m-%d")

## Executive Summary

[Summary of chaos engineering activities for the quarter]

## GameDay Activities

### GameDays Conducted
- **Total GameDays**: [COUNT]
- **Scenarios Executed**: [LIST]
- **Success Rate**: [PERCENTAGE]%

### GameDay Metrics Summary

| Metric | Target | Average | Status |
|--------|--------|---------|--------|
| **MTTD** | <5 min | [VALUE] | [PASS/FAIL] |
| **MTTR** | <10 min | [VALUE] | [PASS/FAIL] |
| **RTO** | <30 min | [VALUE] | [PASS/FAIL] |
| **RPO** | <1 hour | [VALUE] | [PASS/FAIL] |

## Resilience Score Trends

[Resilience score trends over the quarter]

## Resilience Gaps

### Gaps Identified
- **Total Gaps**: [COUNT]
- **Critical**: [COUNT]
- **High**: [COUNT]
- **Medium**: [COUNT]
- **Low**: [COUNT]

### Gap Remediation
- **Resolved**: [COUNT]
- **In Progress**: [COUNT]
- **Open**: [COUNT]

## Improvement Actions

[Summary of improvement actions and their status]

## Lessons Learned

[Key lessons learned from GameDays and experiments]

## Next Quarter Plan

[Planned activities for next quarter]
EOF

echo "✅ Quarterly report generated: ${REPORT_FILE}"
echo ""

Improvement Tracking

Improvement tracking provides resilience backlog management with prioritization, implementation tracking, and resilience metrics dashboards.

Resilience Backlog

Backlog Management Script:

#!/bin/bash
# scripts/manage-resilience-backlog.sh

ACTION="${1:-list}"  # list, add, update, prioritize

case ${ACTION} in
  list)
    echo "📋 Resilience Backlog"
    echo ""
    sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
    SELECT 
        GapId,
        Title,
        Category,
        Severity,
        PriorityScore,
        Status,
        AssignedTo
    FROM ResilienceGaps
    WHERE Status IN ('Open', 'InProgress')
    ORDER BY PriorityScore DESC
    " -W -h -1
    ;;

  add)
    GAP_ID="${2}"
    TITLE="${3}"
    DESCRIPTION="${4}"
    CATEGORY="${5}"
    SEVERITY="${6}"
    IMPACT="${7}"
    LIKELIHOOD="${8}"

    ./scripts/track-resilience-gap.sh ${GAP_ID} "${TITLE}" "${DESCRIPTION}" ${CATEGORY} ${SEVERITY} ${IMPACT} ${LIKELIHOOD}
    ;;

  prioritize)
    echo "📊 Prioritizing Resilience Backlog"
    echo ""
    sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
    UPDATE ResilienceGaps
    SET PriorityScore = ImpactScore * LikelihoodScore
    WHERE Status IN ('Open', 'InProgress')
    "
    echo "✅ Backlog prioritized"
    ;;

  *)
    echo "Unknown action: ${ACTION}"
    exit 1
    ;;
esac

Prioritization (Impact × Likelihood)

Prioritization Matrix:

graph TD
    HIGH[High Impact] --> HIGHHIGH[High Priority<br/>Impact × Likelihood]
    HIGH --> HIGHLOW[Medium Priority]
    LOW[Low Impact] --> LOWHIGH[Medium Priority]
    LOW --> LOWLOW[Low Priority]

    HIGHHIGH --> CRITICAL[Critical: Immediate Action]
    HIGHLOW --> HIGH_PRIO[High: Plan Soon]
    LOWHIGH --> HIGH_PRIO
    LOWLOW --> LOW_PRIO[Low: Backlog]

    style HIGHHIGH fill:#FF6B6B
    style CRITICAL fill:#FF6B6B
    style HIGHLOW fill:#FFD93D
    style LOWHIGH fill:#FFD93D
    style HIGH_PRIO fill:#FFD93D
    style LOWLOW fill:#6BCF7F
    style LOW_PRIO fill:#6BCF7F
Hold "Alt" / "Option" to enable pan & zoom

Impact and Likelihood Scoring:

Impact Score Description
Critical 5 Complete service outage, data loss, security breach
High 4 Significant service degradation, partial data loss
Medium 3 Moderate service impact, limited functionality loss
Low 2 Minor service impact, cosmetic issues
Very Low 1 Negligible impact
Likelihood Score Description
Very High 5 >50% chance of occurrence
High 4 25-50% chance of occurrence
Medium 3 10-25% chance of occurrence
Low 2 1-10% chance of occurrence
Very Low 1 <1% chance of occurrence

Implementation Tracking

Implementation Tracking Script:

#!/bin/bash
# scripts/track-implementation.sh

GAP_ID="${1}"
ACTION="${2}"
STATUS="${3}"
ASSIGNED_TO="${4}"
DUE_DATE="${5}"

echo "📝 Tracking Implementation"
echo "Gap ID: ${GAP_ID}"
echo "Action: ${ACTION}"
echo ""

# Insert remediation action
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
DECLARE @GapId UNIQUEIDENTIFIER = (SELECT Id FROM ResilienceGaps WHERE GapId = '${GAP_ID}');
INSERT INTO GapRemediation (GapId, Action, Status, AssignedTo, DueDate)
VALUES (@GapId, '${ACTION}', '${STATUS}', '${ASSIGNED_TO}', '${DUE_DATE}')
"

# Update gap status
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
UPDATE ResilienceGaps
SET Status = '${STATUS}', AssignedTo = '${ASSIGNED_TO}', UpdatedAt = GETUTCDATE()
WHERE GapId = '${GAP_ID}'
"

echo "✅ Implementation tracked"

Resilience Metrics Dashboard

Dashboard Configuration:

# dashboards/resilience-metrics-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: resilience-metrics-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "ATP Resilience Metrics Dashboard",
        "panels": [
          {
            "title": "Resilience Score Over Time",
            "type": "graph",
            "targets": [
              {
                "expr": "resilience_score{environment=\"staging\"}",
                "legendFormat": "Resilience Score"
              }
            ]
          },
          {
            "title": "Chaos Experiment Success Rate",
            "type": "stat",
            "targets": [
              {
                "expr": "rate(chaos_experiments_total{status=\"success\"}[7d]) / rate(chaos_experiments_total[7d])",
                "legendFormat": "Success Rate"
              }
            ]
          },
          {
            "title": "Resilience Gaps by Priority",
            "type": "piechart",
            "targets": [
              {
                "expr": "resilience_gaps_total{priority=\"critical\"}",
                "legendFormat": "Critical"
              },
              {
                "expr": "resilience_gaps_total{priority=\"high\"}",
                "legendFormat": "High"
              },
              {
                "expr": "resilience_gaps_total{priority=\"medium\"}",
                "legendFormat": "Medium"
              }
            ]
          },
          {
            "title": "GameDay Metrics Trend",
            "type": "table",
            "targets": [
              {
                "expr": "gameday_mttd",
                "legendFormat": "MTTD"
              },
              {
                "expr": "gameday_mttr",
                "legendFormat": "MTTR"
              },
              {
                "expr": "gameday_rto",
                "legendFormat": "RTO"
              },
              {
                "expr": "gameday_rpo",
                "legendFormat": "RPO"
              }
            ]
          }
        ]
      }
    }

Dashboard Generation Script:

#!/bin/bash
# scripts/generate-resilience-dashboard.sh

ENVIRONMENT="${1:-staging}"

echo "📊 Generating Resilience Metrics Dashboard"
echo "Environment: ${ENVIRONMENT}"
echo ""

# Generate dashboard JSON
DASHBOARD_JSON=$(cat <<EOF
{
  "dashboard": {
    "title": "ATP Resilience Metrics - ${ENVIRONMENT}",
    "time": {
      "from": "now-7d",
      "to": "now"
    },
    "panels": [
      {
        "id": 1,
        "title": "Resilience Score",
        "type": "stat",
        "targets": [
          {
            "expr": "resilience_score{environment=\"${ENVIRONMENT}\"}",
            "refId": "A"
          }
        ]
      },
      {
        "id": 2,
        "title": "Chaos Experiments",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(chaos_experiments_total{environment=\"${ENVIRONMENT}\"}[1h])",
            "refId": "A"
          }
        ]
      },
      {
        "id": 3,
        "title": "Resilience Gaps",
        "type": "table",
        "targets": [
          {
            "expr": "resilience_gaps_total{environment=\"${ENVIRONMENT}\"}",
            "refId": "A"
          }
        ]
      }
    ]
  }
}
EOF
)

# Create dashboard in Grafana
curl -X POST \
  http://grafana:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${GRAFANA_API_KEY}" \
  -d "${DASHBOARD_JSON}"

echo "✅ Dashboard generated"
echo ""

Chaos Automation and Reporting Visualization

graph TB
    AUTOMATION[Chaos Automation] --> CHAOSMESH[Chaos Mesh]
    AUTOMATION --> LITMUS[Litmus Chaos]
    AUTOMATION --> AZURESTUDIO[Azure Chaos Studio]
    AUTOMATION --> CUSTOM[Custom Scripts]

    CICD[CI/CD Integration] --> STAGING[Staging Pipeline]
    STAGING --> CHAOSTESTS[Chaos Tests]
    CHAOSTESTS --> VALIDATION[Resilience Validation]
    VALIDATION --> GATE[Deployment Gate]

    CONTINUOUS[Continuous Chaos] --> LOWLEVEL[Low-Level Chaos]
    LOWLEVEL --> REGRESSION[Regression Detection]

    REPORTING[Chaos Reporting] --> TRACKING[Experiment Tracking]
    REPORTING --> SCORE[Resilience Score]
    REPORTING --> GAPS[Gap Tracking]
    REPORTING --> GAMEDAY[GameDay Reports]

    IMPROVEMENT[Improvement Tracking] --> BACKLOG[Resilience Backlog]
    BACKLOG --> PRIORITY[Prioritization]
    PRIORITY --> IMPLEMENTATION[Implementation Tracking]
    IMPLEMENTATION --> DASHBOARD[Resilience Dashboard]

    style AUTOMATION fill:#FFE5B4
    style CICD fill:#FFE5B4
    style CONTINUOUS fill:#FFE5B4
    style REPORTING fill:#FFE5B4
    style IMPROVEMENT fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Summary: Chaos Automation and Reporting

  • Chaos Automation Tools: Comprehensive chaos automation tools including Chaos Mesh (Kubernetes-native), Litmus Chaos (chaos workflows), Azure Chaos Studio (Azure-integrated), and Custom Scripts (ATP-specific chaos); with installation scripts, configuration examples, and tool comparison table
  • CI/CD Integration: Automated chaos testing in staging pipelines with resilience validation and deployment blocking; includes Azure Pipelines configuration, chaos test suite script, resilience validation script, and pipeline gate configuration to block deployments on chaos failures
  • Continuous Chaos: Low-level chaos running continuously with minimal blast radius (1% traffic affected) to detect resilience regressions proactively; includes continuous chaos configuration, controller script, experiment execution script, and regression detection script
  • Chaos Reporting: Comprehensive tracking and reporting including experiment success/failure tracking (database schema and tracking script), resilience score calculation over time, gap tracking and remediation (database schema and tracking script), and quarterly GameDay report generation
  • Improvement Tracking: Resilience backlog management with prioritization (impact × likelihood matrix), implementation tracking (tracking script and database schema), and resilience metrics dashboard (dashboard configuration and generation script); with automated prioritization and visualization

Chaos Experiment Catalog

Purpose: Provide a comprehensive catalog of all chaos experiments available in ATP, organized by category (Infrastructure, Application, Data) with standardized experiment specifications including hypothesis, duration, frequency, and blast radius for easy reference and experiment planning.


Infrastructure Experiments

Infrastructure chaos experiments validate Kubernetes infrastructure resilience including pod failures, node failures, availability zone failures, and regional failures.

Infrastructure Experiment Catalog:

Experiment Hypothesis Duration Frequency Blast Radius
Pod Kill Service remains available when 1 pod is killed, pods reschedule automatically, and service recovers within 5 minutes 5 min Weekly 1 pod
Node Failure Pods reschedule successfully when 1 node fails, StatefulSet quorum is maintained, and services recover within 15 minutes 15 min Monthly 1 node
AZ Failure Multi-AZ deployment works when 1 availability zone fails, pods redistribute across remaining AZs, and services recover within 30 minutes 30 min Quarterly 1 AZ
Region Failure Regional failover succeeds when 1 region fails, Traffic Manager routes to secondary region, and services are operational within RTO target (30 minutes) 1 hour Annually 1 region

Infrastructure Experiment Execution:

#!/bin/bash
# scripts/execute-infrastructure-experiment.sh

EXPERIMENT_TYPE="${1}"  # pod-kill, node-failure, az-failure, region-failure
ENVIRONMENT="${2:-staging}"

echo "🏗️  Executing Infrastructure Experiment: ${EXPERIMENT_TYPE}"
echo "Environment: ${ENVIRONMENT}"
echo ""

case ${EXPERIMENT_TYPE} in
  pod-kill)
    echo "Executing Pod Kill experiment..."
    ./scripts/execute-pod-failure-experiment.sh atp-ingestion-api 1 "5m" ${ENVIRONMENT}
    ;;

  node-failure)
    echo "Executing Node Failure experiment..."
    ./scripts/execute-node-failure-experiment.sh 1 "15m" ${ENVIRONMENT}
    ;;

  az-failure)
    echo "Executing AZ Failure experiment..."
    ./scripts/execute-az-failure-experiment.sh eastus-1 "30m" ${ENVIRONMENT}
    ;;

  region-failure)
    echo "Executing Region Failure experiment..."
    ./scripts/execute-regional-failover-drill.sh eastus "1h" ${ENVIRONMENT}
    ;;

  *)
    echo "Unknown infrastructure experiment: ${EXPERIMENT_TYPE}"
    exit 1
    ;;
esac

echo "✅ Infrastructure experiment completed"

Application Experiments

Application chaos experiments validate application-level resilience including service failures, latency handling, error recovery, and autoscaling behavior.

Application Experiment Catalog:

Experiment Hypothesis Duration Frequency Blast Radius
Service Crash Circuit breaker prevents cascade failures when 1 service crashes, upstream services handle failures gracefully, and service recovers when restarted 5 min Weekly 1 service
Latency Injection Timeouts prevent hanging requests when network latency increases, retry mechanisms handle transient delays, and services recover when latency is removed 10 min Weekly 25% traffic
Error Injection Retries recover from errors when HTTP 500 errors are injected, circuit breakers prevent cascading failures, and services recover when errors are removed 5 min Weekly 10% requests
CPU Stress HPA scales under load when CPU usage increases, pods scale horizontally to meet demand, and service performance remains acceptable 15 min Monthly 1 deployment

Application Experiment Execution:

#!/bin/bash
# scripts/execute-application-experiment.sh

EXPERIMENT_TYPE="${1}"  # service-crash, latency-injection, error-injection, cpu-stress
TARGET_SERVICE="${2}"
ENVIRONMENT="${3:-staging}"

echo "📱 Executing Application Experiment: ${EXPERIMENT_TYPE}"
echo "Target Service: ${TARGET_SERVICE}"
echo "Environment: ${ENVIRONMENT}"
echo ""

case ${EXPERIMENT_TYPE} in
  service-crash)
    echo "Executing Service Crash experiment..."
    ./scripts/execute-downstream-service-failure.sh ${TARGET_SERVICE} "5m" ${ENVIRONMENT}
    ;;

  latency-injection)
    echo "Executing Latency Injection experiment..."
    ./scripts/execute-latency-injection-experiment.sh ${TARGET_SERVICE} "500ms" "10m" ${ENVIRONMENT}
    ;;

  error-injection)
    echo "Executing Error Injection experiment..."
    ./scripts/execute-error-injection-experiment.sh ${TARGET_SERVICE} 0.1 "5m" ${ENVIRONMENT}
    ;;

  cpu-stress)
    echo "Executing CPU Stress experiment..."
    ./scripts/execute-resource-exhaustion-experiment.sh ${TARGET_SERVICE} cpu "15m" ${ENVIRONMENT}
    ;;

  *)
    echo "Unknown application experiment: ${EXPERIMENT_TYPE}"
    exit 1
    ;;
esac

echo "✅ Application experiment completed"

Data Experiments

Data chaos experiments validate data layer resilience including database failover, cache failures, and message queue disruptions.

Data Experiment Catalog:

Experiment Hypothesis Duration Frequency Blast Radius
DB Failover Connection retry recovers when database fails over to replica, transactions complete successfully, and no data loss occurs 5 min Monthly Read-only mode
Cache Failure DB fallback maintains function when Redis cache becomes unavailable, database queries handle increased load, and services remain functional with higher latency 10 min Weekly 1 cache instance
Queue Pause Backpressure prevents overflow when Service Bus topic is paused, messages are buffered in outbox, and services recover when queue is restored 15 min Monthly 1 topic

Data Experiment Execution:

#!/bin/bash
# scripts/execute-data-experiment.sh

EXPERIMENT_TYPE="${1}"  # db-failover, cache-failure, queue-pause
ENVIRONMENT="${2:-staging}"

echo "💾 Executing Data Experiment: ${EXPERIMENT_TYPE}"
echo "Environment: ${ENVIRONMENT}"
echo ""

case ${EXPERIMENT_TYPE} in
  db-failover)
    echo "Executing Database Failover experiment..."
    ./scripts/execute-database-failover-experiment.sh atp-sql-eastus "5m" ${ENVIRONMENT}
    ;;

  cache-failure)
    echo "Executing Cache Failure experiment..."
    ./scripts/execute-cache-failure-experiment.sh atp-redis-eastus "10m" ${ENVIRONMENT}
    ;;

  queue-pause)
    echo "Executing Queue Pause experiment..."
    ./scripts/execute-message-queue-disruption-experiment.sh atp-service-bus-topic "15m" ${ENVIRONMENT}
    ;;

  *)
    echo "Unknown data experiment: ${EXPERIMENT_TYPE}"
    exit 1
    ;;
esac

echo "✅ Data experiment completed"

Experiment Catalog Summary:

graph TB
    CATALOG[Chaos Experiment Catalog] --> INFRA[Infrastructure Experiments]
    CATALOG --> APP[Application Experiments]
    CATALOG --> DATA[Data Experiments]

    INFRA --> POD[Pod Kill<br/>Weekly, 5min, 1 pod]
    INFRA --> NODE[Node Failure<br/>Monthly, 15min, 1 node]
    INFRA --> AZ[AZ Failure<br/>Quarterly, 30min, 1 AZ]
    INFRA --> REGION[Region Failure<br/>Annually, 1h, 1 region]

    APP --> SERVICE[Service Crash<br/>Weekly, 5min, 1 service]
    APP --> LATENCY[Latency Injection<br/>Weekly, 10min, 25% traffic]
    APP --> ERROR[Error Injection<br/>Weekly, 5min, 10% requests]
    APP --> CPU[CPU Stress<br/>Monthly, 15min, 1 deployment]

    DATA --> DB[DB Failover<br/>Monthly, 5min, Read-only]
    DATA --> CACHE[Cache Failure<br/>Weekly, 10min, 1 cache instance]
    DATA --> QUEUE[Queue Pause<br/>Monthly, 15min, 1 topic]

    style CATALOG fill:#FFE5B4
    style INFRA fill:#B4E5FF
    style APP fill:#B4FFB4
    style DATA fill:#FFB4E5
Hold "Alt" / "Option" to enable pan & zoom

Experiment Schedule Overview:

Frequency Experiments Total Experiments/Month
Weekly Pod Kill, Service Crash, Latency Injection, Error Injection, Cache Failure ~20/month
Monthly Node Failure, CPU Stress, DB Failover, Queue Pause 4/month
Quarterly AZ Failure 1/quarter (4/year)
Annually Region Failure 1/year

Experiment Catalog Usage:

#!/bin/bash
# scripts/list-experiment-catalog.sh

CATEGORY="${1:-all}"  # infrastructure, application, data, all

echo "📋 Chaos Experiment Catalog"
echo "Category: ${CATEGORY}"
echo ""

case ${CATEGORY} in
  infrastructure)
    echo "Infrastructure Experiments:"
    echo "  1. pod-kill - Weekly, 5min, 1 pod"
    echo "  2. node-failure - Monthly, 15min, 1 node"
    echo "  3. az-failure - Quarterly, 30min, 1 AZ"
    echo "  4. region-failure - Annually, 1h, 1 region"
    ;;

  application)
    echo "Application Experiments:"
    echo "  1. service-crash - Weekly, 5min, 1 service"
    echo "  2. latency-injection - Weekly, 10min, 25% traffic"
    echo "  3. error-injection - Weekly, 5min, 10% requests"
    echo "  4. cpu-stress - Monthly, 15min, 1 deployment"
    ;;

  data)
    echo "Data Experiments:"
    echo "  1. db-failover - Monthly, 5min, Read-only mode"
    echo "  2. cache-failure - Weekly, 10min, 1 cache instance"
    echo "  3. queue-pause - Monthly, 15min, 1 topic"
    ;;

  all)
    echo "Infrastructure Experiments:"
    echo "  1. pod-kill - Weekly, 5min, 1 pod"
    echo "  2. node-failure - Monthly, 15min, 1 node"
    echo "  3. az-failure - Quarterly, 30min, 1 AZ"
    echo "  4. region-failure - Annually, 1h, 1 region"
    echo ""
    echo "Application Experiments:"
    echo "  1. service-crash - Weekly, 5min, 1 service"
    echo "  2. latency-injection - Weekly, 10min, 25% traffic"
    echo "  3. error-injection - Weekly, 5min, 10% requests"
    echo "  4. cpu-stress - Monthly, 15min, 1 deployment"
    echo ""
    echo "Data Experiments:"
    echo "  1. db-failover - Monthly, 5min, Read-only mode"
    echo "  2. cache-failure - Weekly, 10min, 1 cache instance"
    echo "  3. queue-pause - Monthly, 15min, 1 topic"
    ;;

  *)
    echo "Unknown category: ${CATEGORY}"
    exit 1
    ;;
esac

echo ""

Summary: Chaos Experiment Catalog

  • Infrastructure Experiments: Four infrastructure chaos experiments including Pod Kill (weekly, 5min, 1 pod), Node Failure (monthly, 15min, 1 node), AZ Failure (quarterly, 30min, 1 AZ), and Region Failure (annually, 1h, 1 region); each with standardized hypothesis, duration, frequency, and blast radius specifications
  • Application Experiments: Four application chaos experiments including Service Crash (weekly, 5min, 1 service), Latency Injection (weekly, 10min, 25% traffic), Error Injection (weekly, 5min, 10% requests), and CPU Stress (monthly, 15min, 1 deployment); validating circuit breakers, retries, timeouts, and autoscaling
  • Data Experiments: Three data chaos experiments including DB Failover (monthly, 5min, read-only mode), Cache Failure (weekly, 10min, 1 cache instance), and Queue Pause (monthly, 15min, 1 topic); validating database resilience, cache fallback, and message queue backpressure
  • Experiment Catalog Tools: Execution scripts for each category (infrastructure, application, data), catalog listing script, experiment schedule overview table, and Mermaid diagram visualization; with standardized experiment specifications for easy reference and experiment planning