Chaos Engineering Drills - Audit Trail Platform (ATP)¶
Break it to validate it — ATP's chaos engineering drills systematically inject failures to validate resilience, test recovery procedures, and ensure SLO compliance under adverse conditions.
Purpose & Scope¶
This document defines comprehensive chaos engineering and resilience testing procedures for the Audit Trail Platform (ATP), establishing systematic fault injection experiments, disaster recovery drills, and SLO validation procedures to ensure ATP remains available, performant, and compliant even under adverse conditions including infrastructure failures, network partitions, resource exhaustion, and security incidents.
Key Chaos Engineering Principles
- Hypothesis-Driven: Formulate hypothesis about system behavior, test with experiments
- Production-Like: Test in staging with production-like load and configuration
- Blast Radius Control: Limit impact to percentage of traffic or specific tenants
- Observability: Monitor all metrics during experiments to detect issues
- Gradual Rollout: Start small (1% traffic), increase if stable
- Automated Rollback: Abort experiment if SLO violations detected
- Learn and Improve: Document findings, update runbooks, improve resilience
What this document covers
- Establish chaos engineering fundamentals: What it is, why ATP needs it, principles and methodology
- Define chaos experiment framework: Hypothesis, steady state, blast radius, rollback, validation
- Specify infrastructure chaos experiments: Pod failures, node failures, AZ outages, region failures
- Document application chaos experiments: Service crashes, dependency failures, latency injection, error injection
- Detail data and state experiments: Database failures, cache failures, message broker disruptions, storage outages
- Describe network chaos experiments: Network partitions, packet loss, latency, DNS failures
- Outline security chaos experiments: Authentication failures, authorization denials, encryption key unavailability
- Specify disaster recovery drills: Full region failover, cluster rebuilding, data restoration
- Document GameDays: Quarterly chaos exercises, multi-team coordination, incident response validation
- Detail chaos automation: Chaos Mesh, Litmus Chaos, custom fault injection, CI/CD integration
- Describe SLO validation: Ensure SLOs maintained during chaos, error budget tracking
- Outline reporting and improvement: Experiment reports, findings documentation, resilience improvements
Out of scope (referenced elsewhere)
- Zero-trust security architecture (see zero-trust.md)
- Incident response procedures (see ../operations/runbook.md)
- SLO and alert definitions (see ../operations/alerts-slos.md)
- Backup and restore procedures (see ../operations/backups-restore-ediscovery.md)
- Performance testing and load testing (not chaos-focused)
Readers & ownership
- SRE/Operations (owners): Chaos experiment design, execution, GameDay coordination
- Platform Engineering: Infrastructure resilience, automation, tooling
- Backend Developers: Application resilience, circuit breakers, retry logic
- Security Engineering: Security chaos experiments, attack simulation
- Architects: Resilience architecture, failure mode analysis
- Incident Response: DR drills, incident simulation, runbook validation
Artifacts produced
- Chaos Experiment Catalog: All experiments with hypotheses, procedures, validation criteria
- Experiment Runbooks: Step-by-step execution procedures for each experiment
- Chaos Automation Scripts: Chaos Mesh experiments, Litmus workflows, custom injectors
- GameDay Playbooks: Quarterly chaos exercise scenarios and coordination
- SLO Validation Reports: Experiment results showing SLO compliance during chaos
- Blast Radius Configurations: Traffic percentage, tenant selection, duration limits
- Rollback Procedures: Automated and manual rollback for each experiment
- Monitoring Dashboards: Real-time experiment monitoring, SLO tracking
- Incident Simulations: Security incidents, data breaches, ransomware scenarios
- DR Drill Reports: Disaster recovery validation results, RTO/RPO measurement
- Findings Documentation: Lessons learned, resilience gaps, improvement actions
- Resilience Scorecard: System resilience metrics, improvement tracking over time
Chaos Engineering Fundamentals¶
Purpose: Establish the foundational understanding of chaos engineering, its principles, methodology, and strategic application within ATP to build systematic confidence in system resilience through controlled failure injection.
What is Chaos Engineering?¶
Definition
Chaos Engineering is the discipline of experimenting on distributed systems in production (or production-like environments) to discover weaknesses and vulnerabilities before they manifest as customer-facing incidents. It is a systematic, proactive approach to building confidence in system resilience by intentionally injecting failures and observing system behavior under adverse conditions.
Core Concept
Traditional testing validates that systems work under ideal conditions. Chaos Engineering validates that systems work when things go wrong. Instead of waiting for production failures to reveal weaknesses, chaos engineers proactively introduce controlled failures to:
- Discover unknown failure modes that don't appear in unit or integration tests
- Validate resilience mechanisms (circuit breakers, retries, failovers) actually work
- Test incident response procedures under realistic pressure
- Build confidence that the system can handle real-world failures
Historical Context
Chaos Engineering originated at Netflix in 2010 with the development of Chaos Monkey, a tool that randomly terminated production instances to ensure services could handle instance failures. This evolved into the Simian Army (Chaos Gorilla, Latency Monkey, Conformity Monkey, etc.) and established chaos engineering as a discipline.
Since then, chaos engineering has been adopted by organizations worldwide, including:
- Amazon Web Services: Chaos engineering for infrastructure resilience
- Microsoft: Azure Chaos Studio for cloud resilience
- Google: Site Reliability Engineering (SRE) chaos testing practices
- LinkedIn: Chaos testing for distributed systems
Proactive vs Reactive Approach
| Approach | Traditional Testing | Chaos Engineering |
|---|---|---|
| Timing | Before deployment | During operation |
| Focus | Does it work? | Does it still work when things break? |
| Discovery | Known failure modes | Unknown failure modes |
| Mindset | Reactive (fix after incident) | Proactive (prevent incidents) |
| Outcome | System works in ideal conditions | System works in adverse conditions |
Key Insight
"If something can go wrong, it will go wrong—chaos engineering helps you find it before your customers do."
Why Chaos Engineering for ATP?¶
High Availability Requirements
ATP has a 99.9% uptime SLA (less than 8.76 hours of downtime per year). This requires:
- Resilient architecture: System must handle failures gracefully
- Validated recovery: Failover and recovery mechanisms must be proven to work
- Minimal impact: Failures must not cascade across services
- Fast recovery: System must recover quickly from failures
Chaos engineering validates that ATP can meet these requirements under real-world failure conditions.
Compliance and Audit Requirements
As an audit trail platform, ATP must maintain:
- Data availability: Audit logs must remain accessible even during failures
- Data integrity: No data loss during failures or recovery
- Audit compliance: SOC 2, GDPR, HIPAA require evidence of resilience testing
- Continuous operation: Audit data ingestion cannot be interrupted
Chaos engineering provides evidence that ATP maintains compliance during adverse conditions.
Multi-Tenancy Isolation
ATP serves multiple tenants with strict isolation requirements:
- Tenant isolation: Failures affecting one tenant must not impact others
- Resource isolation: Resource exhaustion in one tenant must not affect others
- Data isolation: Tenant data must remain isolated during failures
- SLA per tenant: Each tenant has independent SLA commitments
Chaos engineering validates that tenant isolation holds under failure conditions.
Distributed System Complexity
ATP is a microservices architecture with many failure modes:
- Service-to-service communication: Network partitions, latency, failures
- Service discovery: DNS failures, service registry unavailability
- Load balancing: Traffic routing during failures
- State management: Distributed state consistency during failures
- Cascading failures: One service failure causing others to fail
Chaos engineering systematically tests these failure modes to prevent cascading failures.
Complex Dependencies
ATP depends on multiple Azure services with their own failure modes:
- Azure Kubernetes Service (AKS): Node failures, pod scheduling, cluster failures
- Azure SQL Database: Failover, connection pool exhaustion, query timeouts
- Azure Service Bus: Topic unavailability, message broker failures
- Azure Key Vault: Secret access failures, certificate expiration
- Azure Blob Storage: Storage unavailability, replication lag
- Azure Active Directory: Authentication failures, token refresh issues
Chaos engineering validates ATP's resilience to dependency failures.
Incident Preparedness
Chaos engineering validates incident response procedures:
- Runbook validation: Do runbooks work under pressure?
- Team coordination: Can teams respond effectively during incidents?
- Communication: Is incident communication clear and timely?
- Escalation: Do escalation paths work correctly?
- Recovery procedures: Can the team recover within SLA targets?
Regular chaos experiments ensure teams are prepared for real incidents.
Business Value
Chaos engineering delivers measurable business value:
- Reduced incidents: Find and fix issues before customers experience them
- Faster recovery: Validated procedures enable faster incident resolution
- Customer confidence: Proactive resilience testing builds trust
- Cost savings: Preventing incidents is cheaper than fixing them
- Compliance: Evidence of resilience testing supports compliance audits
Chaos Engineering Principles¶
The Principles of Chaos Engineering (established by Netflix) provide a framework for conducting safe, effective chaos experiments.
Principle 1: Build a Hypothesis Around Steady State¶
Definition
Before injecting failures, define what "normal" looks like. This is the steady state—the observable behavior of the system when operating correctly. Formulate a hypothesis about how the system will behave during the experiment.
Steady State Metrics
Steady state is defined by observable metrics that indicate normal operation:
- Performance metrics: Latency (P50, P95, P99), throughput, response times
- Reliability metrics: Error rates, success rates, availability
- Resource metrics: CPU utilization, memory usage, network I/O
- Business metrics: Transaction rates, revenue impact, user activity
Hypothesis Format
A hypothesis states the expected behavior during the experiment:
"When [failure condition], the system will [expected behavior], and steady state metrics will [expected metric values]."
Example Hypotheses
Example 1: Pod Failure
Hypothesis: "When 1 ingestion API pod crashes (out of 5 pods),
the system will remain available, request success rate will stay >99.9%,
and P95 latency will increase by <100ms."
Example 2: Database Failover
Hypothesis: "When Azure SQL primary database fails over to replica,
the system will automatically reconnect, no requests will fail,
and recovery time will be <30 seconds."
Example 3: Network Partition
Hypothesis: "When network partition isolates ingestion service from query service,
ingestion will continue accepting events, query service will degrade gracefully
(read-only mode), and no data will be lost."
Steady State Definition Example
# examples/steady-state-definition.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: SteadyState
metadata:
name: atp-ingestion-api-steady-state
namespace: chaos-testing
spec:
service: atp-ingestion-api
metrics:
performance:
- name: p95_latency_ms
threshold: 200 # ms
operator: "<"
- name: p99_latency_ms
threshold: 500 # ms
operator: "<"
- name: throughput_events_per_sec
threshold: 10000
operator: ">="
reliability:
- name: request_success_rate
threshold: 99.9 # %
operator: ">="
- name: error_rate
threshold: 0.1 # %
operator: "<="
resource:
- name: cpu_utilization_percent
threshold: 80 # %
operator: "<"
- name: memory_utilization_percent
threshold: 85 # %
operator: "<"
business:
- name: events_ingested_per_minute
threshold: 600000
operator: ">="
- name: tenant_isolation_maintained
threshold: true
operator: "=="
Principle 2: Vary Real-World Events¶
Definition
Chaos experiments should simulate realistic failures that occur in production, not artificial or unrealistic scenarios. Base experiments on:
- Historical incidents: What failures have happened before?
- Common failure modes: What failures are most likely?
- Dependency failures: How do external dependencies fail?
- Infrastructure failures: What infrastructure failures occur?
Real-World Event Categories
| Category | Examples |
|---|---|
| Infrastructure | Pod crashes, node failures, AZ outages, region failures |
| Network | Network partitions, packet loss, latency spikes, DNS failures |
| Dependencies | Database failures, cache failures, message broker issues |
| Resource | CPU exhaustion, memory exhaustion, disk full |
| Application | Service crashes, slow responses, error injection |
| Security | Authentication failures, certificate expiration, key unavailability |
Avoid Artificial Scenarios
❌ Bad: "Kill all pods simultaneously" ✅ Good: "Kill 1 pod at a time, observe recovery"
❌ Bad: "Disconnect all network connections" ✅ Good: "Partition network between two services"
❌ Bad: "Set latency to 24 hours" ✅ Good: "Set latency to 500ms (realistic network delay)"
Historical Incident Analysis
Use incident post-mortems to identify chaos experiment scenarios:
## Example: Incident-Driven Chaos Experiment
**Incident**: On 2024-01-15, Azure SQL failover caused 2-minute downtime
**Root Cause**: Application didn't handle connection pool exhaustion during failover
**Chaos Experiment Created**:
- **Hypothesis**: "When Azure SQL fails over, application will retry connections and recover within 30 seconds"
- **Experiment**: Simulate database failover, monitor connection pool behavior
- **Validation**: Verify connection retry logic, measure recovery time
Principle 3: Run Experiments in Production (or Production-Like)¶
Definition
Chaos experiments must run in production-like environments to be meaningful. Testing in development or QA environments doesn't validate real-world resilience because:
- Different configurations: Dev/test environments differ from production
- Different load: Production load patterns aren't replicated
- Different data: Production data volumes and patterns matter
- Different dependencies: Production dependency configurations differ
ATP Strategy: Staging Environment
ATP runs chaos experiments in staging environment which is:
- Production-like: Same configuration, same architecture, same scale
- Production data volumes: Uses production data volumes for realistic testing
- Isolated: No customer impact, safe for experimentation
- Controlled: Blast radius limits prevent cascading failures
Production Experiments
After staging validation, limited production experiments may be conducted:
- Small blast radius: 1% of traffic, specific test tenants only
- Short duration: 30 seconds to 5 minutes
- Gradual rollout: Start small, increase if stable
- Automated rollback: Immediate abort on SLO violations
- Approval required: CAB approval for production chaos experiments
Environment Comparison
| Aspect | Development | Staging | Production |
|---|---|---|---|
| Chaos Experiments | ❌ No (different config) | ✅ Yes (primary) | ⚠️ Limited (after staging validation) |
| Blast Radius | N/A | Controlled | Minimal (1%) |
| Duration | N/A | 5 min - 1 hour | 30 sec - 5 min |
| Approval | N/A | SRE Team | CAB Required |
Principle 4: Automate Experiments to Run Continuously¶
Definition
Chaos experiments should be automated and run continuously to:
- Prevent regressions: Detect when resilience is degraded
- Continuous validation: Ensure resilience improvements are maintained
- Reduce manual effort: Automate routine experiments
- Scale testing: Run more experiments more frequently
Automation Levels
| Level | Description | ATP Usage |
|---|---|---|
| Level 1: Manual | Execute experiments manually | Initial development, GameDays |
| Level 2: Scheduled | Cron-based execution | Weekly automated experiments |
| Level 3: CI/CD Integrated | Part of deployment pipeline | Staging pipeline resilience tests |
| Level 4: Continuous | Always-on low-level chaos | 1% traffic continuous chaos |
CI/CD Integration Example
# azure-pipelines/chaos-tests.yaml
trigger:
branches:
include:
- main
- staging
stages:
- stage: ChaosResilienceTests
displayName: 'Chaos Resilience Tests'
jobs:
- job: PodFailureTest
displayName: 'Pod Failure Resilience Test'
steps:
- task: Kubernetes@1
inputs:
connectionType: 'Kubernetes Service Connection'
kubernetesServiceEndpoint: 'atp-staging-aks'
namespace: 'chaos-testing'
command: 'apply'
arguments: '-f chaos-experiments/pod-failure-test.yaml'
- script: |
# Wait for experiment to complete
kubectl wait --for=condition=complete \
chaos/pod-failure-test -n chaos-testing --timeout=10m
# Validate results
./scripts/validate-chaos-results.sh pod-failure-test
displayName: 'Execute and Validate Pod Failure Test'
- task: PublishTestResults@2
inputs:
testResultsFormat: 'JUnit'
testResultsFiles: 'chaos-results/*.xml'
Continuous Chaos
Continuous chaos runs low-level experiments continuously:
# chaos-experiments/continuous-pod-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: continuous-pod-failure
namespace: chaos-testing
spec:
action: pod-kill
mode: fixed-percent
value: "1" # 1% of pods
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
duration: "30s"
scheduler:
cron: "@every 1h" # Run every hour
abortRules:
- name: error-rate-threshold
condition: error_rate > 0.5%
action: abort
Principle 5: Minimize Blast Radius¶
Definition
Blast radius is the scope of impact of a chaos experiment. Start with minimal blast radius and increase gradually only if the system remains stable.
Blast Radius Dimensions
| Dimension | Options | ATP Default |
|---|---|---|
| Traffic Percentage | 1%, 5%, 10%, 25%, 50% | 1% (start), 5% (staging), 25% (GameDay) |
| Tenant Scope | All, specific tenants, test tenants | Test tenants only |
| Service Scope | All services, specific service, specific pod | Specific pod/service |
| Duration | 30s, 5min, 15min, 1hour | 5min (staging), 30s (production) |
| Geographic Scope | All regions, single region, single AZ | Single AZ (start) |
Gradual Rollout Strategy
graph LR
START[Start Experiment] --> VAL1[1% Traffic<br/>30 seconds]
VAL1 --> CHECK1{Stable?}
CHECK1 -->|Yes| VAL2[5% Traffic<br/>5 minutes]
CHECK1 -->|No| ABORT[Abort Experiment]
VAL2 --> CHECK2{Stable?}
CHECK2 -->|Yes| VAL3[10% Traffic<br/>15 minutes]
CHECK2 -->|No| ABORT
VAL3 --> CHECK3{Stable?}
CHECK3 -->|Yes| COMPLETE[Experiment Complete]
CHECK3 -->|No| ABORT
style START fill:#FFE5B4
style COMPLETE fill:#90EE90
style ABORT fill:#FFB6C1
Blast Radius Configuration Example
# examples/blast-radius-config.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: BlastRadius
metadata:
name: pod-failure-blast-radius
namespace: chaos-testing
spec:
experiment: pod-failure-test
traffic:
percentage: 1 # Start with 1%
gradualIncrease: true
increments: [1, 5, 10, 25] # Gradual rollout percentages
tenant:
scope: test-tenants-only # Only affect test tenants
tenants:
- test-tenant-001
- test-tenant-002
service:
scope: single-pod
namespace: atp-ingest-ns
labelSelector:
app: atp-ingest-api
duration:
initial: 30s # Start with 30 seconds
max: 5m # Maximum 5 minutes
autoAbort:
enabled: true
triggers:
- metric: error_rate
threshold: 1.0 # %
operator: ">"
- metric: p95_latency_ms
threshold: 500 # ms
operator: ">"
- metric: request_success_rate
threshold: 99.0 # %
operator: "<"
Automatic Abort Criteria
Experiments automatically abort when:
- Error rate exceeds threshold: >1% for staging, >0.5% for production
- Latency degrades: P95 latency >500ms (staging), >300ms (production)
- Success rate drops: Request success rate <99% (staging), <99.9% (production)
- SLO violations: Any SLO violation detected
- Manual abort: On-call engineer triggers manual abort
Chaos Engineering Methodology¶
The chaos engineering methodology provides a systematic 10-step process for conducting safe, effective chaos experiments.
Methodology Overview
graph TD
STEP1[1. Define Steady State] --> STEP2[2. Formulate Hypothesis]
STEP2 --> STEP3[3. Design Experiment]
STEP3 --> STEP4[4. Set Blast Radius]
STEP4 --> STEP5[5. Configure Rollback]
STEP5 --> STEP6[6. Execute Experiment]
STEP6 --> STEP7[7. Observe System]
STEP7 --> STEP8[8. Validate Hypothesis]
STEP8 --> STEP9[9. Document Findings]
STEP9 --> STEP10[10. Improve Resilience]
STEP10 --> STEP1
style STEP1 fill:#E3F2FD
style STEP6 fill:#FFF3E0
style STEP10 fill:#E8F5E9
Detailed Methodology Steps
Step 1: Define Steady State¶
Objective: Establish baseline metrics that represent normal system operation.
Activities: - Identify key performance metrics (latency, throughput, error rate) - Identify reliability metrics (availability, success rate) - Identify resource metrics (CPU, memory, network) - Identify business metrics (transactions, revenue impact) - Set threshold values for each metric - Measure baseline for 24-48 hours
Output: Steady state definition document with metrics and thresholds.
Step 2: Formulate Hypothesis¶
Objective: Predict how the system will behave during the experiment.
Activities: - Define the failure condition to inject - Predict system behavior (expected vs unexpected) - Specify metric changes (e.g., "latency will increase by <100ms") - Specify recovery behavior (e.g., "system will recover within 30 seconds") - Define success criteria (what validates the hypothesis)
Output: Hypothesis statement with expected behavior and validation criteria.
Step 3: Design Experiment¶
Objective: Create the experiment specification.
Activities: - Select chaos tool (Chaos Mesh, Litmus, custom) - Define experiment type (pod kill, network partition, etc.) - Specify target resources (pods, services, nodes) - Define injection parameters (duration, frequency, intensity) - Create experiment YAML/configuration - Review experiment with team
Output: Experiment specification (YAML, configuration file).
Step 4: Set Blast Radius¶
Objective: Limit the scope of impact to minimize risk.
Activities: - Set traffic percentage (start with 1%) - Select tenant scope (test tenants only) - Set duration (start with 30 seconds) - Configure geographic scope (single AZ) - Set service scope (single pod/service) - Enable gradual rollout if appropriate
Output: Blast radius configuration.
Step 5: Configure Rollback¶
Objective: Define automatic abort criteria and rollback procedures.
Activities: - Set automatic abort triggers (error rate, latency, SLO violations) - Define rollback procedure (remove fault injection, restore state) - Test rollback procedure - Configure monitoring alerts for abort triggers - Assign on-call engineer for manual abort capability
Output: Rollback configuration and procedures.
Step 6: Execute Experiment¶
Objective: Run the chaos experiment in staging environment.
Activities: - Notify team (Slack, email) - Start monitoring dashboards - Apply experiment YAML/configuration - Monitor experiment status - Observe system behavior in real-time - Be ready to abort if needed
Output: Experiment execution log.
Step 7: Observe System¶
Objective: Monitor all metrics, logs, and traces during the experiment.
Activities: - Monitor performance metrics (latency, throughput) - Monitor reliability metrics (error rate, success rate) - Monitor resource metrics (CPU, memory, network) - Review application logs for errors - Review distributed traces for bottlenecks - Monitor user impact (if applicable)
Output: Observation logs and metrics.
Step 8: Validate Hypothesis¶
Objective: Compare actual behavior to predicted behavior.
Activities: - Compare metrics to hypothesis predictions - Check if success criteria were met - Identify unexpected behaviors - Analyze root causes of any failures - Measure recovery time and behavior - Document deviations from hypothesis
Output: Hypothesis validation report.
Step 9: Document Findings¶
Objective: Record experiment results, findings, and lessons learned.
Activities: - Document experiment configuration - Document observed metrics and behavior - Document hypothesis validation results - Identify resilience gaps - Identify unexpected behaviors - Document lessons learned - Create improvement actions
Output: Experiment findings document.
Step 10: Improve Resilience¶
Objective: Address identified resilience gaps and improve system resilience.
Activities: - Prioritize improvement actions (impact × likelihood) - Implement resilience improvements - Update runbooks based on learnings - Update monitoring and alerting - Re-run experiment to validate improvements - Share learnings with team
Output: Resilience improvements implemented.
Chaos vs Other Testing¶
Chaos engineering complements other testing approaches but serves a unique purpose: validating resilience under failure conditions.
Testing Type Comparison
| Testing Type | Purpose | When | ATP Usage | What It Validates |
|---|---|---|---|---|
| Unit Tests | Validate individual components | Every build | CI/CD | Component correctness |
| Integration Tests | Validate service interactions | Every deployment | CI/CD | Service integration |
| Load Tests | Validate performance under load | Pre-release, Monthly | Performance testing | Performance at scale |
| Chaos Tests | Validate resilience under failures | Quarterly GameDays, Continuous | Resilience validation | Failure handling |
| Penetration Tests | Validate security controls | Annually | External audit | Security vulnerabilities |
| DR Drills | Validate disaster recovery | Quarterly | Scheduled drills | Recovery procedures |
Testing Pyramid with Chaos
graph TD
PYRAMID[Testing Pyramid]
UNIT[Unit Tests<br/>1000s of tests<br/>Fast, isolated]
INTEGRATION[Integration Tests<br/>100s of tests<br/>Service interactions]
LOAD[Load Tests<br/>10s of tests<br/>Performance validation]
CHAOS[Chaos Tests<br/>10s of experiments<br/>Resilience validation]
DR[DR Drills<br/>Quarterly<br/>Full recovery validation]
PYRAMID --> UNIT
PYRAMID --> INTEGRATION
PYRAMID --> LOAD
PYRAMID --> CHAOS
PYRAMID --> DR
style UNIT fill:#90EE90
style INTEGRATION fill:#87CEEB
style LOAD fill:#FFD700
style CHAOS fill:#FFA500
style DR fill:#FF6347
What Each Testing Type Catches
| Issue Type | Unit Tests | Integration Tests | Load Tests | Chaos Tests | DR Drills |
|---|---|---|---|---|---|
| Logic errors | ✅ | ✅ | ❌ | ❌ | ❌ |
| API contract violations | ❌ | ✅ | ❌ | ❌ | ❌ |
| Performance degradation | ❌ | ❌ | ✅ | ⚠️ | ❌ |
| Failure handling | ❌ | ⚠️ | ❌ | ✅ | ✅ |
| Cascading failures | ❌ | ❌ | ❌ | ✅ | ✅ |
| Recovery procedures | ❌ | ❌ | ❌ | ⚠️ | ✅ |
| Resource exhaustion | ❌ | ❌ | ⚠️ | ✅ | ❌ |
| Network issues | ❌ | ⚠️ | ❌ | ✅ | ⚠️ |
Example: Testing a New Feature
## Example: Testing New Circuit Breaker Feature
### 1. Unit Tests (CI/CD)
- ✅ Test circuit breaker state transitions (Closed → Open → Half-Open)
- ✅ Test threshold calculations
- ✅ Test timeout handling
### 2. Integration Tests (CI/CD)
- ✅ Test circuit breaker with real HTTP client
- ✅ Test circuit breaker with database connection
- ✅ Test fallback behavior
### 3. Load Tests (Pre-release)
- ✅ Validate circuit breaker under high load
- ✅ Measure performance impact
### 4. Chaos Tests (Quarterly)
- ✅ Validate circuit breaker opens during dependency failure
- ✅ Validate circuit breaker prevents cascading failures
- ✅ Validate recovery when dependency recovers
### 5. DR Drills (Quarterly)
- ✅ Validate circuit breaker behavior during regional failover
- ✅ Validate circuit breaker with database failover
ATP Chaos Engineering Strategy¶
ATP employs a multi-layered chaos engineering strategy combining continuous automated experiments with periodic large-scale exercises.
Strategy Layers
graph TB
STRATEGY[ATP Chaos Engineering Strategy]
CONTINUOUS[Continuous Chaos<br/>Always-on, 1% traffic<br/>Automated, CI/CD integrated]
SCHEDULED[Scheduled Experiments<br/>Weekly, Monthly<br/>Automated, specific scenarios]
GAMEDAYS[Quarterly GameDays<br/>Large-scale, multi-team<br/>Manual, coordinated]
DR[Annual DR Drills<br/>Full region failover<br/>Manual, comprehensive]
ADHOC[Ad-Hoc Experiments<br/>Investigation, validation<br/>Manual, targeted]
STRATEGY --> CONTINUOUS
STRATEGY --> SCHEDULED
STRATEGY --> GAMEDAYS
STRATEGY --> DR
STRATEGY --> ADHOC
style CONTINUOUS fill:#90EE90
style SCHEDULED fill:#87CEEB
style GAMEDAYS fill:#FFD700
style DR fill:#FF6347
style ADHOC fill:#DDA0DD
Layer 1: Continuous Chaos
Purpose: Detect resilience regressions early through always-on low-level chaos.
Characteristics: - Frequency: Continuous (24/7) - Scope: 1% of traffic, test tenants only - Duration: 30 seconds per experiment - Automation: Fully automated, no human intervention - Examples: Random pod kills, low-level latency injection
Benefits: - Early detection of resilience issues - Continuous validation of improvements - Minimal overhead (1% traffic impact)
Configuration:
# continuous-chaos-config.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: ContinuousChaos
metadata:
name: atp-continuous-chaos
namespace: chaos-testing
spec:
enabled: true
blastRadius:
trafficPercentage: 1
tenantScope: test-tenants-only
experiments:
- name: random-pod-kill
frequency: "@every 1h"
duration: "30s"
- name: low-latency-injection
frequency: "@every 2h"
duration: "1m"
latency: "100ms"
autoAbort:
enabled: true
errorRateThreshold: 0.5%
Layer 2: Scheduled Experiments
Purpose: Validate specific resilience scenarios on a regular schedule.
Characteristics: - Frequency: Weekly (automated), Monthly (manual) - Scope: 5-10% of traffic, staging environment - Duration: 5-15 minutes - Automation: Automated execution, manual review - Examples: Database failover, cache failure, network partition
Schedule:
| Frequency | Experiment Type | Examples |
|---|---|---|
| Daily | Basic resilience | Pod kill, container restart |
| Weekly | Dependency failures | Database failover, cache failure |
| Monthly | Infrastructure failures | Node failure, AZ failure |
| Quarterly | Complex scenarios | Multiple simultaneous failures |
Layer 3: Quarterly GameDays
Purpose: Large-scale coordinated chaos exercises involving multiple teams.
Characteristics: - Frequency: Quarterly (4 per year) - Scope: 25-50% of traffic, staging environment - Duration: 4 hours (1h prep, 2h chaos, 1h retrospective) - Automation: Manual coordination, automated experiments - Participants: All engineering teams, SRE, Security, Operations
GameDay Structure: See Topic 13 for detailed GameDay procedures.
Layer 4: Annual DR Drills
Purpose: Full disaster recovery validation including regional failover.
Characteristics: - Frequency: Annually (once per year) - Scope: Full region failover, production-like - Duration: Full day (8 hours) - Automation: Manual coordination, automated failover - Participants: All teams, leadership, compliance
DR Drill Structure: See Topic 11 for detailed DR drill procedures.
Layer 5: Ad-Hoc Experiments
Purpose: Investigate specific resilience concerns or validate fixes.
Characteristics: - Frequency: As needed - Scope: Targeted, specific scenario - Duration: Variable (30 minutes to 2 hours) - Automation: Manual or automated - Trigger: Incident investigation, new feature validation, fix verification
Ad-Hoc Experiment Examples:
- Post-Incident: "Why did the system fail during the incident? Let's reproduce it."
- Feature Validation: "Does the new circuit breaker work correctly? Let's test it."
- Fix Verification: "We fixed the connection pool issue. Let's verify it's fixed."
Chaos Experiment Template¶
Template Structure
# templates/chaos-experiment-template.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: ChaosExperiment
metadata:
name: EXPERIMENT_NAME
namespace: chaos-testing
labels:
category: infrastructure|application|data|network|security
severity: low|medium|high
frequency: continuous|weekly|monthly|quarterly|adhoc
spec:
# Experiment Metadata
description: "Brief description of the experiment"
hypothesis: |
When [failure condition],
the system will [expected behavior],
and metrics will [expected values].
# Steady State Definition
steadyState:
metrics:
- name: request_success_rate
threshold: 99.9
operator: ">="
- name: p95_latency_ms
threshold: 200
operator: "<"
# Add more metrics...
# Blast Radius
blastRadius:
trafficPercentage: 1
tenantScope: test-tenants-only
duration: "5m"
serviceScope:
namespace: atp-ingest-ns
labelSelector:
app: atp-ingest-api
# Chaos Injection
injection:
type: pod-kill|network-partition|latency|error|resource
config:
# Experiment-specific configuration
action: pod-kill
mode: one
# Rollback Configuration
rollback:
autoAbort:
enabled: true
triggers:
- metric: error_rate
threshold: 1.0
operator: ">"
- metric: p95_latency_ms
threshold: 500
operator: ">"
# Validation Criteria
validation:
successCriteria:
- metric: request_success_rate
threshold: 99.9
operator: ">="
- metric: recovery_time_seconds
threshold: 30
operator: "<="
# Scheduling
schedule:
frequency: weekly
cron: "0 2 * * 1" # Monday 2 AM
# Reporting
reporting:
notifyChannels:
- slack: "#atp-chaos"
- email: "sre-team@connectsoft.example"
generateReport: true
Hypothesis Formulation Template
## Hypothesis Template
### Experiment: [EXPERIMENT_NAME]
**Hypothesis**:
When [failure condition], the system will [expected behavior],
and the following metrics will remain within acceptable ranges:
- [metric1]: [expected value/range]
- [metric2]: [expected value/range]
- [metric3]: [expected value/range]
**Failure Condition**:
- [Description of what will be broken]
**Expected Behavior**:
- [How the system should respond]
- [Recovery mechanism]
- [Degradation mode (if any)]
**Success Criteria**:
- ✅ [Criterion 1]
- ✅ [Criterion 2]
- ✅ [Criterion 3]
**Failure Criteria** (auto-abort triggers):
- ❌ [Criterion 1]
- ❌ [Criterion 2]
Example: Complete Experiment Definition
# examples/pod-failure-experiment.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: ChaosExperiment
metadata:
name: ingestion-api-pod-failure
namespace: chaos-testing
labels:
category: infrastructure
severity: low
frequency: weekly
spec:
description: "Validate ingestion API resilience to pod failures"
hypothesis: |
When 1 ingestion API pod crashes (out of 5 pods),
the system will remain available, request success rate will stay >99.9%,
P95 latency will increase by <100ms, and the pod will be restarted within 30 seconds.
steadyState:
metrics:
- name: request_success_rate
threshold: 99.9
operator: ">="
- name: p95_latency_ms
baseline: 150 # ms
threshold: 250 # ms (150 + 100)
operator: "<"
- name: pod_restart_time_seconds
threshold: 30
operator: "<="
blastRadius:
trafficPercentage: 5
tenantScope: test-tenants-only
duration: "5m"
serviceScope:
namespace: atp-ingest-ns
labelSelector:
app: atp-ingest-api
injection:
type: pod-kill
config:
action: pod-kill
mode: one # Kill one pod
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
rollback:
autoAbort:
enabled: true
triggers:
- metric: error_rate
threshold: 1.0
operator: ">"
- metric: p95_latency_ms
threshold: 500
operator: ">"
validation:
successCriteria:
- metric: request_success_rate
threshold: 99.9
operator: ">="
duration: "5m" # Must maintain for 5 minutes
- metric: pod_restart_time_seconds
threshold: 30
operator: "<="
schedule:
frequency: weekly
cron: "0 2 * * 1" # Monday 2 AM
reporting:
notifyChannels:
- slack: "#atp-chaos"
generateReport: true
Summary: Chaos Engineering Fundamentals¶
- What is Chaos Engineering: Discipline of experimenting on distributed systems to discover weaknesses proactively; originated at Netflix in 2010 with Chaos Monkey
- Why Chaos Engineering for ATP: High availability (99.9% SLA), compliance requirements, multi-tenancy isolation, distributed system complexity, complex dependencies, incident preparedness
- Chaos Engineering Principles: Build hypothesis around steady state, vary real-world events, run in production-like environments, automate continuously, minimize blast radius
- Chaos Engineering Methodology: 10-step systematic process (define steady state, formulate hypothesis, design experiment, set blast radius, configure rollback, execute, observe, validate, document, improve)
- Chaos vs Other Testing: Complements unit, integration, load, and penetration tests; unique focus on resilience under failure conditions
- ATP Chaos Engineering Strategy: Multi-layered approach with continuous chaos (1% traffic), scheduled experiments (weekly/monthly), quarterly GameDays, annual DR drills, and ad-hoc experiments
- Experiment Template: Comprehensive YAML template for defining chaos experiments with hypothesis, steady state, blast radius, injection, rollback, validation, and reporting
Chaos Experiment Framework¶
Purpose: Define the comprehensive framework for designing, executing, and managing chaos experiments in ATP, including experiment structure, lifecycle management, steady state definitions, blast radius controls, rollback automation, and safety measures to ensure safe, effective, and repeatable resilience testing.
Experiment Structure¶
The chaos experiment structure defines the standardized format for all chaos experiments in ATP, ensuring consistency, repeatability, and safety across all resilience testing activities.
Chaos Mesh Experiment Structure
ATP uses Chaos Mesh as the primary chaos engineering tool for Kubernetes-native fault injection. Chaos Mesh provides a declarative API for defining chaos experiments as Kubernetes Custom Resources (CRs).
Basic PodChaos Example
# chaos-experiments/pod-failure-basic.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: atp-ingestion-pod-failure
namespace: chaos-testing
labels:
experiment-type: infrastructure
service: atp-ingestion-api
severity: low
spec:
action: pod-kill
mode: one # Kill one pod
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
version: v1.2.3
duration: "30s"
scheduler:
cron: "@every 1h"
Experiment Metadata
The metadata section provides experiment identification and organization:
- name: Unique identifier for the experiment
- namespace: Isolation namespace (always
chaos-testingfor ATP) - labels: Categorization for filtering and reporting
Common Labels:
labels:
category: infrastructure|application|data|network|security
service: atp-gateway|atp-ingestion-api|atp-query-api
environment: staging|production
severity: low|medium|high|critical
frequency: continuous|weekly|monthly|quarterly|adhoc
owner: sre-team|platform-team|backend-team
Experiment Spec Structure
The spec section defines what the experiment does:
| Field | Purpose | Examples |
|---|---|---|
| action | Type of fault to inject | pod-kill, pod-failure, container-kill |
| mode | Scope of injection | one, all, fixed, fixed-percent, random-max-percent |
| selector | Target resources | Namespaces, label selectors |
| duration | How long to inject | "30s", "5m", "1h" |
| scheduler | When to run | Cron expression or immediate |
Advanced Experiment Structure with ATP Extensions
ATP extends Chaos Mesh experiments with custom annotations and configurations for enhanced control:
# chaos-experiments/pod-failure-advanced.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: atp-ingestion-pod-failure-advanced
namespace: chaos-testing
labels:
category: infrastructure
service: atp-ingestion-api
severity: low
frequency: weekly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When 1 ingestion API pod crashes, the system will remain available,
request success rate will stay >99.9%, and P95 latency will increase by <100ms.
chaos.atp.connectsoft.io/blast-radius: "5%"
chaos.atp.connectsoft.io/auto-abort: "true"
chaos.atp.connectsoft.io/notify: "slack:#atp-chaos"
spec:
action: pod-kill
mode: one
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
annotationSelectors:
chaos.atp.connectsoft.io/include-in-chaos: "true"
duration: "5m"
# ATP-specific rollback triggers
abortRules:
- name: error-rate-threshold
condition: error_rate > 1.0%
action: abort
- name: latency-threshold
condition: p95_latency_ms > 500
action: abort
# Gradual rollout configuration
gradualRollout:
enabled: true
stages:
- percentage: 1
duration: "30s"
- percentage: 5
duration: "5m"
scheduler:
cron: "@every 1h"
# ATP-specific monitoring
monitoring:
enabled: true
metrics:
- name: request_success_rate
threshold: 99.9
operator: ">="
- name: p95_latency_ms
threshold: 200
operator: "<"
Experiment Types
ATP supports multiple chaos experiment types:
| Type | Chaos Mesh Kind | Use Case | Example |
|---|---|---|---|
| Pod Chaos | PodChaos |
Pod failures, container restarts | Pod crashes, OOM kills |
| Network Chaos | NetworkChaos |
Network issues | Partitions, latency, packet loss |
| IO Chaos | IOChaos |
Storage issues | Disk latency, I/O errors |
| Stress Chaos | StressChaos |
Resource exhaustion | CPU stress, memory stress |
| Time Chaos | TimeChaos |
Clock skew | Time manipulation |
| Kernel Chaos | KernelChaos |
Kernel-level faults | System call failures |
| HTTP Chaos | HTTPChaos |
HTTP-level faults | Request faults, response delays |
Experiment Categories
Experiments are categorized for organization and reporting:
# Experiment category examples
categories:
infrastructure:
- pod-kill
- node-failure
- container-restart
application:
- service-crash
- latency-injection
- error-injection
data:
- database-failover
- cache-failure
- storage-unavailable
network:
- partition
- packet-loss
- dns-failure
security:
- auth-failure
- cert-expiration
- key-unavailable
Experiment Lifecycle¶
The experiment lifecycle defines the 9-step process for executing chaos experiments from setup to improvement, ensuring systematic, safe, and effective resilience testing.
Lifecycle Overview
graph TD
SETUP[1. Setup] --> BASELINE[2. Baseline]
BASELINE --> INJECT[3. Inject]
INJECT --> OBSERVE[4. Observe]
OBSERVE --> VALIDATE[5. Validate]
VALIDATE --> ROLLBACK[6. Rollback]
ROLLBACK --> ANALYZE[7. Analyze]
ANALYZE --> REPORT[8. Report]
REPORT --> IMPROVE[9. Improve]
IMPROVE --> SETUP
style SETUP fill:#E3F2FD
style INJECT fill:#FFF3E0
style VALIDATE fill:#F3E5F5
style IMPROVE fill:#E8F5E9
Lifecycle Phase Details
Phase 1: Setup¶
Objective: Prepare the experiment environment and define steady state metrics.
Activities:
-
Define Steady State Metrics
-
Create Experiment Configuration
-
Review and Approve
- Team review of experiment configuration
- Validate blast radius settings
- Confirm rollback procedures
-
Obtain necessary approvals
-
Prepare Monitoring
- Configure Grafana dashboards
- Set up alerts for abort triggers
- Prepare log aggregation queries
- Test monitoring visibility
Output: Experiment configuration file, steady state definition, monitoring setup.
Setup Checklist:
## Experiment Setup Checklist
### Pre-Experiment
- [ ] Experiment YAML created and reviewed
- [ ] Steady state metrics defined
- [ ] Hypothesis documented
- [ ] Blast radius configured appropriately
- [ ] Rollback triggers configured
- [ ] Monitoring dashboards prepared
- [ ] Team notified (Slack, email)
- [ ] On-call engineer available
- [ ] Approval obtained (if required)
- [ ] Rollback procedure tested
Phase 2: Baseline¶
Objective: Establish baseline metrics before injecting chaos.
Activities:
-
Measure Baseline Metrics
-
Verify Steady State
- Check all metrics within thresholds
- Verify no active incidents
- Confirm system health
-
Validate baseline period sufficient
-
Document Baseline
Output: Baseline metrics document, steady state validation.
Baseline Collection Script:
#!/bin/bash
# scripts/collect-baseline-metrics.sh
SERVICE="${1}"
DURATION="${2:-24h}"
OUTPUT="${3:-baseline-${SERVICE}-$(date +%Y%m%d).json}"
echo "📊 Collecting baseline metrics for ${SERVICE} over ${DURATION}"
# Query Prometheus for baseline metrics
PROMETHEUS_URL="http://prometheus.monitoring.svc.cluster.local:9090"
cat > /tmp/baseline-query.json <<EOF
{
"queries": [
{
"metric": "rate(http_requests_total{service=\"${SERVICE}\"}[5m])",
"duration": "${DURATION}"
},
{
"metric": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"${SERVICE}\"}[5m]))",
"duration": "${DURATION}"
},
{
"metric": "rate(http_requests_total{service=\"${SERVICE}\",status=~\"5..\"}[5m]) / rate(http_requests_total{service=\"${SERVICE}\"}[5m])",
"duration": "${DURATION}"
}
]
}
EOF
# Execute queries and collect results
curl -X POST "${PROMETHEUS_URL}/api/v1/query_range" \
-H "Content-Type: application/json" \
-d @/tmp/baseline-query.json \
| jq '.' > "${OUTPUT}"
echo "✅ Baseline metrics collected: ${OUTPUT}"
Phase 3: Inject¶
Objective: Apply fault injection to the target system.
Activities:
-
Apply Experiment Configuration
-
Verify Injection Started
-
Confirm Fault Injected
- Verify pod killed/affected
- Check system impact
- Monitor initial metrics
- Validate blast radius active
Output: Experiment execution log, fault injection confirmation.
Injection Monitoring Script:
#!/bin/bash
# scripts/monitor-injection.sh
EXPERIMENT="${1}"
NAMESPACE="${2:-chaos-testing}"
echo "🔍 Monitoring fault injection: ${EXPERIMENT}"
# Watch experiment status
kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} -w
# Monitor pod status
kubectl get pods -n atp-ingest-ns -w
# Monitor metrics
watch -n 2 '
echo "=== Request Success Rate ==="
curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"atp-ingestion-api\",status!~\"5..\"\}[1m]\) | jq -r ".data.result[0].value[1]"
echo "=== P95 Latency ==="
curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"atp-ingestion-api\"\}[1m]\)\) | jq -r ".data.result[0].value[1]"
'
Phase 4: Observe¶
Objective: Monitor system behavior during fault injection.
Activities:
- Monitor Key Metrics
- Performance metrics (latency, throughput)
- Reliability metrics (error rate, success rate)
- Resource metrics (CPU, memory, network)
-
Business metrics (events ingested, tenant isolation)
-
Review Logs and Traces
-
Monitor Abort Triggers
- Watch error rate thresholds
- Monitor latency thresholds
- Check SLO violations
- Be ready to abort if needed
Output: Observation logs, metrics snapshots, trace analysis.
Real-Time Monitoring Dashboard Query:
// Log Analytics: Real-time experiment monitoring
let ExperimentStart = datetime("2024-01-20T10:00:00Z");
let ExperimentEnd = datetime("2024-01-20T10:05:00Z");
ContainerLog
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| where Namespace == "atp-ingest-ns"
| where ContainerName == "atp-ingestion-api"
| summarize
ErrorCount = countif(LogMessage contains "ERROR"),
TotalRequests = count(),
ErrorRate = (countif(LogMessage contains "ERROR") * 100.0) / count()
by bin(TimeGenerated, 1m)
| render timechart
Phase 5: Validate¶
Objective: Compare actual behavior to hypothesis and steady state.
Activities:
-
Compare Metrics to Hypothesis
-
Check Success Criteria
- Request success rate maintained?
- Latency within acceptable range?
- Recovery time as expected?
-
No unexpected behaviors?
-
Identify Deviations
- Document unexpected behaviors
- Analyze root causes
- Measure actual vs expected
Output: Validation report, hypothesis validation results.
Validation Script:
#!/bin/bash
# scripts/validate-experiment-results.sh
EXPERIMENT="${1}"
BASELINE="${2}"
RESULTS="${3}"
echo "✅ Validating experiment results: ${EXPERIMENT}"
# Load baseline and results
BASELINE_METRICS=$(cat "${BASELINE}")
RESULT_METRICS=$(cat "${RESULTS}")
# Validate each metric
VALIDATION_FAILED=false
# Check request success rate
BASELINE_SUCCESS=$(echo "${BASELINE_METRICS}" | jq -r '.metrics.request_success_rate')
RESULT_SUCCESS=$(echo "${RESULT_METRICS}" | jq -r '.metrics.request_success_rate')
THRESHOLD=99.9
if (( $(echo "${RESULT_SUCCESS} < ${THRESHOLD}" | bc -l) )); then
echo "❌ Request success rate below threshold: ${RESULT_SUCCESS}% < ${THRESHOLD}%"
VALIDATION_FAILED=true
else
echo "✅ Request success rate acceptable: ${RESULT_SUCCESS}%"
fi
# Check P95 latency
BASELINE_LATENCY=$(echo "${BASELINE_METRICS}" | jq -r '.metrics.p95_latency_ms')
RESULT_LATENCY=$(echo "${RESULT_METRICS}" | jq -r '.metrics.p95_latency_ms')
MAX_INCREASE=100 # ms
LATENCY_INCREASE=$(echo "${RESULT_LATENCY} - ${BASELINE_LATENCY}" | bc)
if (( $(echo "${LATENCY_INCREASE} > ${MAX_INCREASE}" | bc -l) )); then
echo "❌ Latency increase too high: +${LATENCY_INCREASE}ms > +${MAX_INCREASE}ms"
VALIDATION_FAILED=true
else
echo "✅ Latency increase acceptable: +${LATENCY_INCREASE}ms"
fi
if [ "${VALIDATION_FAILED}" = true ]; then
echo "❌ Experiment validation FAILED"
exit 1
else
echo "✅ Experiment validation PASSED"
exit 0
fi
Phase 6: Rollback¶
Objective: Remove fault injection and restore normal operation.
Activities:
- Automatic Rollback (if triggered)
- Abort triggers detected
- Experiment automatically stopped
- Fault injection removed
-
System returns to normal
-
Manual Rollback (if needed)
-
Verify Rollback Complete
- Confirm fault injection stopped
- Verify system returning to normal
- Check metrics recovering
- Validate no lingering effects
Output: Rollback confirmation, system recovery status.
Rollback Automation:
# rollback-automation.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: RollbackPolicy
metadata:
name: standard-rollback-policy
namespace: chaos-testing
spec:
autoAbort:
enabled: true
triggers:
- name: error-rate-threshold
metric: error_rate
threshold: 1.0
operator: ">"
duration: "30s" # Must exceed threshold for 30s
action: abort
- name: latency-threshold
metric: p95_latency_ms
threshold: 500
operator: ">"
duration: "1m"
action: abort
- name: success-rate-threshold
metric: request_success_rate
threshold: 99.0
operator: "<"
duration: "30s"
action: abort
rollbackProcedure:
- step: StopChaosInjection
action: delete_experiment
- step: VerifyRollback
wait: "30s"
check: metrics_recovered
- step: NotifyTeam
channels:
- slack: "#atp-chaos"
- email: "sre-team@connectsoft.example"
Phase 7: Analyze¶
Objective: Review metrics, logs, and traces to understand system behavior.
Activities:
-
Analyze Metrics
-
Review Logs
- Application logs during experiment
- Error patterns
- Recovery behavior
-
Unexpected events
-
Analyze Traces
- Distributed trace analysis
- Service dependencies affected
- Latency breakdown
- Failure propagation
Output: Analysis report, root cause findings, lessons learned.
Phase 8: Report¶
Objective: Document experiment results, findings, and recommendations.
Activities:
-
Generate Experiment Report
-
Document Findings
- Hypothesis validation results
- Unexpected behaviors
- Resilience gaps identified
-
Improvement recommendations
-
Share Results
- Publish report to wiki/docs
- Notify team (Slack, email)
- Present in team meeting
- Update experiment catalog
Output: Experiment report, findings document.
Report Template:
# Experiment Report: ${EXPERIMENT_NAME}
**Date**: ${DATE}
**Experiment Type**: ${TYPE}
**Service**: ${SERVICE}
**Hypothesis**: ${HYPOTHESIS}
## Experiment Configuration
- **Blast Radius**: ${BLAST_RADIUS}
- **Duration**: ${DURATION}
- **Target**: ${TARGET}
## Results
### Metrics Comparison
| Metric | Baseline | During Experiment | Change | Threshold | Status |
|--------|----------|-------------------|--------|-----------|--------|
| Request Success Rate | 99.95% | 99.92% | -0.03% | >99.9% | ✅ Pass |
| P95 Latency | 145ms | 210ms | +65ms | <250ms | ✅ Pass |
| Error Rate | 0.05% | 0.08% | +0.03% | <0.1% | ✅ Pass |
### Hypothesis Validation
✅ **Hypothesis CONFIRMED**: System remained available with minimal impact.
## Findings
### What Worked Well
- Pod restart completed within 25 seconds
- Load balancer routed traffic away from failed pod
- No request failures
- Latency increase within acceptable range
### Issues Identified
- Slight increase in error rate (within threshold but higher than expected)
- Recovery time could be improved (25s vs target 20s)
## Recommendations
1. Optimize pod restart time (target: <20s)
2. Investigate error rate increase root cause
3. Add additional monitoring for pod failure scenarios
## Next Steps
- [ ] Implement pod restart optimization
- [ ] Investigate error rate increase
- [ ] Re-run experiment after improvements
Phase 9: Improve¶
Objective: Implement resilience improvements based on findings.
Activities:
- Prioritize Improvements
- Impact assessment
- Effort estimation
- Priority ranking
-
Backlog creation
-
Implement Improvements
- Code changes
- Configuration updates
- Runbook updates
-
Monitoring enhancements
-
Validate Improvements
- Re-run experiment
- Verify improvements
- Confirm resilience enhanced
- Update documentation
Output: Resilience improvements implemented, validation results.
Steady State Definition¶
Steady state defines what "normal" looks like for ATP services, providing baseline metrics against which experiment results are compared.
ATP Service Steady State Standards
Ingestion API Steady State:
# steady-state-definitions/ingestion-api-steady-state.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: SteadyState
metadata:
name: ingestion-api-steady-state
namespace: chaos-testing
spec:
service: atp-ingestion-api
namespace: atp-ingest-ns
metrics:
# Performance Metrics
performance:
- name: p50_latency_ms
description: "Median request latency"
threshold: 100 # ms
operator: "<"
unit: "ms"
- name: p95_latency_ms
description: "95th percentile request latency"
threshold: 200 # ms
operator: "<"
unit: "ms"
- name: p99_latency_ms
description: "99th percentile request latency"
threshold: 500 # ms
operator: "<"
unit: "ms"
- name: throughput_events_per_sec
description: "Events ingested per second"
threshold: 10000
operator: ">="
unit: "events/sec"
# Reliability Metrics
reliability:
- name: request_success_rate
description: "Percentage of successful requests"
threshold: 99.9 # %
operator: ">="
unit: "percent"
- name: error_rate
description: "Percentage of failed requests"
threshold: 0.1 # %
operator: "<="
unit: "percent"
- name: availability
description: "Service availability"
threshold: 99.9 # %
operator: ">="
unit: "percent"
# Resource Metrics
resource:
- name: cpu_utilization_percent
description: "CPU utilization percentage"
threshold: 80 # %
operator: "<"
unit: "percent"
- name: memory_utilization_percent
description: "Memory utilization percentage"
threshold: 85 # %
operator: "<"
unit: "percent"
- name: network_io_bytes_per_sec
description: "Network I/O throughput"
threshold: 1000000000 # 1 GB/s
operator: "<"
unit: "bytes/sec"
# Business Metrics
business:
- name: events_ingested_per_minute
description: "Events ingested per minute"
threshold: 600000
operator: ">="
unit: "events/min"
- name: tenant_isolation_maintained
description: "Tenant isolation maintained"
threshold: true
operator: "=="
unit: "boolean"
- name: data_integrity_maintained
description: "No data loss or corruption"
threshold: true
operator: "=="
unit: "boolean"
# Queue Metrics
queue:
- name: queue_depth_messages
description: "Number of messages in queue"
threshold: 1000
operator: "<"
unit: "messages"
- name: queue_processing_rate
description: "Messages processed per second"
threshold: 5000
operator: ">="
unit: "messages/sec"
# Baseline Collection Period
baseline:
duration: "24h" # Collect 24-hour baseline
sampleInterval: "1m" # Sample every minute
# Validation Rules
validation:
requiredMetricsPercentage: 90 # 90% of metrics must be within threshold
consecutiveViolationsAllowed: 3 # Allow 3 consecutive violations before abort
Query API Steady State:
# steady-state-definitions/query-api-steady-state.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: SteadyState
metadata:
name: query-api-steady-state
namespace: chaos-testing
spec:
service: atp-query-api
namespace: atp-query-ns
metrics:
performance:
- name: p95_latency_ms
threshold: 300 # Query API has higher latency tolerance
operator: "<"
- name: p99_latency_ms
threshold: 1000 # Complex queries may take longer
operator: "<"
- name: queries_per_sec
threshold: 1000
operator: ">="
reliability:
- name: request_success_rate
threshold: 99.9
operator: ">="
- name: query_timeout_rate
threshold: 0.5 # %
operator: "<="
business:
- name: queries_completed_per_minute
threshold: 60000
operator: ">="
- name: cache_hit_rate
threshold: 80 # %
operator: ">="
Steady State Visualization
graph TB
STEADY[Steady State Definition]
PERF[Performance Metrics<br/>Latency, Throughput]
REL[Reliability Metrics<br/>Success Rate, Error Rate]
RES[Resource Metrics<br/>CPU, Memory, Network]
BUS[Business Metrics<br/>Events, Queries, Isolation]
STEADY --> PERF
STEADY --> REL
STEADY --> RES
STEADY --> BUS
PERF --> THRESH[Thresholds]
REL --> THRESH
RES --> THRESH
BUS --> THRESH
THRESH --> VALIDATE[Validation<br/>During Experiments]
style STEADY fill:#E3F2FD
style THRESH fill:#FFF3E0
style VALIDATE fill:#E8F5E9
Steady State Monitoring Script:
#!/bin/bash
# scripts/monitor-steady-state.sh
SERVICE="${1}"
DURATION="${2:-24h}"
echo "📊 Monitoring steady state for ${SERVICE} over ${DURATION}"
# Query Prometheus for steady state metrics
PROMETHEUS_URL="http://prometheus.monitoring.svc.cluster.local:9090"
# Check each metric against threshold
METRICS=(
"request_success_rate:99.9:>="
"p95_latency_ms:200:<"
"error_rate:0.1:<="
"cpu_utilization_percent:80:<"
)
for METRIC in "${METRICS[@]}"; do
IFS=':' read -r NAME THRESHOLD OPERATOR <<< "${METRIC}"
VALUE=$(curl -s "${PROMETHEUS_URL}/api/v1/query" \
--data-urlencode "query=${NAME}{service=\"${SERVICE}\"}" \
| jq -r '.data.result[0].value[1]')
echo "Checking ${NAME}: ${VALUE} ${OPERATOR} ${THRESHOLD}"
# Validate against threshold (simplified)
if [ "${OPERATOR}" = ">=" ] && (( $(echo "${VALUE} >= ${THRESHOLD}" | bc -l) )); then
echo " ✅ Within threshold"
elif [ "${OPERATOR}" = "<" ] && (( $(echo "${VALUE} < ${THRESHOLD}" | bc -l) )); then
echo " ✅ Within threshold"
else
echo " ⚠️ Outside threshold"
fi
done
Blast Radius Control¶
Blast radius control defines the scope of impact for chaos experiments, limiting risk and ensuring safe experimentation.
Blast Radius Dimensions
graph LR
BLAST[Blast Radius]
TRAFFIC[Traffic %<br/>1%, 5%, 10%, 25%, 50%]
TENANT[Tenant Scope<br/>All, Specific, Test Only]
SERVICE[Service Scope<br/>All, Service, Pod]
TIME[Duration<br/>30s, 5m, 15m, 1h]
GEO[Geographic<br/>All, Region, AZ]
BLAST --> TRAFFIC
BLAST --> TENANT
BLAST --> SERVICE
BLAST --> TIME
BLAST --> GEO
style BLAST fill:#FFE5B4
Blast Radius Configuration
# blast-radius-configurations/standard-blast-radius.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: BlastRadius
metadata:
name: standard-blast-radius
namespace: chaos-testing
spec:
# Traffic Percentage Control
traffic:
percentage: 1 # Start with 1%
gradualIncrease: true
increments:
- percentage: 1
duration: "30s"
stabilityCheck: true
- percentage: 5
duration: "5m"
stabilityCheck: true
- percentage: 10
duration: "15m"
stabilityCheck: true
# Tenant Scope Control
tenant:
scope: test-tenants-only # Options: all, specific, test-tenants-only
tenants:
- test-tenant-001
- test-tenant-002
excludeTenants:
- production-tenant-001
- production-tenant-002
# Service Scope Control
service:
scope: single-pod # Options: all, service, single-pod
namespace: atp-ingest-ns
labelSelector:
app: atp-ingest-api
podSelector:
matchLabels:
version: v1.2.3
# Duration Control
duration:
initial: "30s" # Start with 30 seconds
max: "5m" # Maximum 5 minutes
extensionAllowed: false # Do not allow extension
# Geographic Scope Control
geographic:
scope: single-az # Options: all, region, single-az
region: eastus
availabilityZone: "1"
# Automatic Blast Radius Reduction
autoReduce:
enabled: true
triggers:
- metric: error_rate
threshold: 0.5 # %
action: reduce_by_50_percent
- metric: p95_latency_ms
threshold: 300 # ms
action: reduce_by_50_percent
Blast Radius by Experiment Type
| Experiment Type | Default Blast Radius | Max Blast Radius | Notes |
|---|---|---|---|
| Pod Kill | 1 pod (5-10% traffic) | 1 pod | Safe, Kubernetes handles |
| Network Partition | 10% traffic | 25% traffic | Can cause cascading failures |
| Database Failover | Read-only mode | Full failover | Affects all traffic |
| Cache Failure | 10% traffic | 50% traffic | Degraded performance expected |
| Latency Injection | 5% traffic | 25% traffic | Gradual increase recommended |
| Error Injection | 1% requests | 10% requests | Can trigger alert storms |
Gradual Blast Radius Increase
# blast-radius-configurations/gradual-rollout.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: BlastRadius
metadata:
name: gradual-rollout-blast-radius
spec:
gradualIncrease:
enabled: true
stages:
- stage: 1
percentage: 1
duration: "30s"
stabilityCheck:
enabled: true
metrics:
- name: error_rate
threshold: 0.5
operator: "<"
- name: p95_latency_ms
threshold: 250
operator: "<"
requiredDuration: "30s" # Must be stable for 30s before next stage
- stage: 2
percentage: 5
duration: "5m"
stabilityCheck:
enabled: true
metrics:
- name: error_rate
threshold: 1.0
operator: "<"
- name: p95_latency_ms
threshold: 300
operator: "<"
requiredDuration: "2m"
- stage: 3
percentage: 10
duration: "15m"
stabilityCheck:
enabled: true
requiredDuration: "5m"
# Auto-reduce if unstable
autoReduce:
enabled: true
reduceToPreviousStage: true
Blast Radius Monitoring
#!/bin/bash
# scripts/monitor-blast-radius.sh
EXPERIMENT="${1}"
NAMESPACE="${2:-chaos-testing}"
echo "🎯 Monitoring blast radius for ${EXPERIMENT}"
# Get blast radius configuration
BLAST_RADIUS=$(kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
-o jsonpath='{.metadata.annotations.chaos\.atp\.connectsoft\.io/blast-radius}')
echo "Blast Radius: ${BLAST_RADIUS}"
# Monitor actual impact
echo "Actual Impact:"
echo "- Traffic affected: $(curl -s http://prometheus:9090/api/v1/query?query=sum\(rate\(http_requests_total\{experiment=\"${EXPERIMENT}\"}[1m]\)\) | jq -r '.data.result[0].value[1]') req/s"
echo "- Error rate: $(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{experiment=\"${EXPERIMENT}\",status=~\"5..\"}[1m]\) | jq -r '.data.result[0].value[1]')%"
Rollback Triggers¶
Rollback triggers automatically abort experiments when system health degrades beyond acceptable thresholds, preventing cascading failures and minimizing impact.
Rollback Trigger Configuration
# rollback-triggers/standard-rollback-triggers.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: RollbackPolicy
metadata:
name: standard-rollback-triggers
namespace: chaos-testing
spec:
autoAbort:
enabled: true
# Error Rate Trigger
triggers:
- name: error-rate-threshold
description: "Abort if error rate exceeds threshold"
metric: error_rate
threshold: 1.0 # %
operator: ">"
duration: "30s" # Must exceed threshold for 30 seconds
severity: high
action: abort
notify:
- slack: "#atp-chaos"
- email: "sre-oncall@connectsoft.example"
# Latency Trigger
- name: latency-threshold
description: "Abort if P95 latency exceeds threshold"
metric: p95_latency_ms
threshold: 500 # ms
operator: ">"
duration: "1m"
severity: medium
action: abort
# Success Rate Trigger
- name: success-rate-threshold
description: "Abort if success rate drops below threshold"
metric: request_success_rate
threshold: 99.0 # %
operator: "<"
duration: "30s"
severity: high
action: abort
# Throughput Trigger
- name: throughput-drop-threshold
description: "Abort if throughput drops significantly"
metric: throughput_events_per_sec
threshold: 50 # % of baseline
operator: "<"
comparison: baseline # Compare to baseline
duration: "1m"
severity: medium
action: abort
# SLO Violation Trigger
- name: slo-violation
description: "Abort if any SLO is violated"
metric: slo_violation_count
threshold: 1
operator: ">"
duration: "10s"
severity: critical
action: abort
Rollback Trigger Flow
graph TD
START[Experiment Running] --> MONITOR[Monitor Metrics]
MONITOR --> CHECK{Trigger<br/>Condition<br/>Met?}
CHECK -->|No| MONITOR
CHECK -->|Yes| DURATION{Exceeded<br/>Duration?}
DURATION -->|No| MONITOR
DURATION -->|Yes| SEVERITY{Severity?}
SEVERITY -->|Critical| ABORT[Immediate Abort]
SEVERITY -->|High| ABORT
SEVERITY -->|Medium| WARN[Warning First]
SEVERITY -->|Low| LOG[Log Only]
WARN --> WAIT[Wait 30s]
WAIT --> CHECK2{Still<br/>Triggered?}
CHECK2 -->|Yes| ABORT
CHECK2 -->|No| MONITOR
ABORT --> NOTIFY[Notify Team]
NOTIFY --> ROLLBACK[Rollback Experiment]
ROLLBACK --> VERIFY[Verify Recovery]
VERIFY --> COMPLETE[Complete]
style START fill:#FFE5B4
style ABORT fill:#FFB6C1
style COMPLETE fill:#90EE90
Rollback Automation
#!/bin/bash
# scripts/rollback-automation.sh
EXPERIMENT="${1}"
NAMESPACE="${2:-chaos-testing}"
echo "⏪ Automatically rolling back experiment: ${EXPERIMENT}"
# Check if rollback already triggered
if kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
-o jsonpath='{.status.phase}' | grep -q "RolledBack"; then
echo "Experiment already rolled back"
exit 0
fi
# Delete experiment (stops fault injection)
kubectl delete chaos ${EXPERIMENT} -n ${NAMESPACE}
# Wait for rollback to complete
echo "Waiting for rollback to complete..."
sleep 10
# Verify system recovery
echo "Verifying system recovery..."
RETRIES=0
MAX_RETRIES=6
while [ ${RETRIES} -lt ${MAX_RETRIES} ]; do
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
if (( $(echo "${ERROR_RATE} < 0.5" | bc -l) )); then
echo "✅ System recovered: Error rate = ${ERROR_RATE}%"
exit 0
fi
RETRIES=$((RETRIES + 1))
echo "Waiting for recovery... (${RETRIES}/${MAX_RETRIES})"
sleep 10
done
echo "⚠️ System recovery taking longer than expected"
exit 1
Manual Rollback
Manual rollback provides human override for experiments:
#!/bin/bash
# scripts/manual-rollback.sh
EXPERIMENT="${1}"
NAMESPACE="${2:-chaos-testing}"
REASON="${3}"
echo "⏪ Manual rollback requested for: ${EXPERIMENT}"
echo "Reason: ${REASON}"
# Delete experiment
kubectl delete chaos ${EXPERIMENT} -n ${NAMESPACE}
# Log manual rollback
kubectl annotate chaos ${EXPERIMENT} -n ${NAMESPACE} \
chaos.atp.connectsoft.io/manual-rollback="true" \
chaos.atp.connectsoft.io/rollback-reason="${REASON}" \
chaos.atp.connectsoft.io/rollback-timestamp="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--overwrite
# Notify team
curl -X POST "${SLACK_WEBHOOK_URL}" \
-H "Content-Type: application/json" \
-d "{
\"text\": \"🚨 Manual rollback triggered for ${EXPERIMENT}\",
\"attachments\": [{
\"color\": \"warning\",
\"fields\": [{
\"title\": \"Reason\",
\"value\": \"${REASON}\",
\"short\": false
}]
}]
}"
echo "✅ Manual rollback complete"
Safety Measures¶
Safety measures ensure chaos experiments are conducted safely with minimal risk to production systems and customer experience.
Safety Measures Checklist
## Chaos Experiment Safety Checklist
### Pre-Experiment Safety Checks
#### Environment Validation
- [ ] Experiment target is **staging environment** (not production)
- [ ] Staging environment mirrors production configuration
- [ ] Production data volumes used (for realistic testing)
- [ ] No active incidents in target environment
#### Experiment Configuration
- [ ] Blast radius set to **minimum** (1% traffic, test tenants)
- [ ] Rollback triggers configured and tested
- [ ] Experiment duration set to **minimum** (30 seconds)
- [ ] Gradual rollout enabled (if applicable)
- [ ] Automatic abort enabled
#### Team Preparation
- [ ] **On-call engineer present** and available
- [ ] Team notified (Slack, email) at least 24 hours in advance
- [ ] Communication plan established
- [ ] Rollback procedure tested and documented
- [ ] Approval obtained (if required for experiment type)
#### Monitoring Preparation
- [ ] Monitoring dashboards prepared and tested
- [ ] Alerts configured for abort triggers
- [ ] Baseline metrics collected (24-48 hours)
- [ ] Steady state validated
- [ ] Log aggregation queries prepared
#### Rollback Preparation
- [ ] Rollback procedure documented and tested
- [ ] Manual rollback button/command ready
- [ ] Automatic rollback triggers validated
- [ ] Recovery verification procedure ready
### During Experiment Safety
#### Active Monitoring
- [ ] Real-time monitoring dashboards visible
- [ ] Metrics monitored continuously
- [ ] Abort triggers watched
- [ ] Error logs reviewed in real-time
#### Communication
- [ ] Team informed experiment started
- [ ] Status updates provided (if long duration)
- [ ] Issues communicated immediately
#### Readiness
- [ ] Ready to abort at any moment
- [ ] Rollback procedure accessible
- [ ] On-call engineer monitoring
### Post-Experiment Safety
#### Validation
- [ ] Experiment completed or aborted cleanly
- [ ] System returned to steady state
- [ ] No lingering effects
- [ ] Metrics normalized
#### Documentation
- [ ] Results documented
- [ ] Findings recorded
- [ ] Lessons learned captured
- [ ] Improvements identified
Safety Measure Configuration
# safety-measures/standard-safety-measures.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: SafetyPolicy
metadata:
name: standard-safety-policy
namespace: chaos-testing
spec:
# Environment Restrictions
environment:
allowedEnvironments:
- staging
- dev # Limited experiments only
prohibitedEnvironments:
- production # Never run in production without explicit approval
# Blast Radius Restrictions
blastRadius:
maxTrafficPercentage: 50 # Never exceed 50% traffic
maxDuration: "1h" # Never exceed 1 hour
requireGradualRollout: true # Always use gradual rollout
minTrafficPercentage: 1 # Always start with at least 1%
# Team Requirements
team:
requireOnCallEngineer: true
requireNotification: true
notificationChannels:
- slack: "#atp-chaos"
- email: "sre-team@connectsoft.example"
notificationAdvanceTime: "24h" # Notify 24 hours in advance
# Approval Requirements
approval:
requiredFor:
- production-experiments
- high-severity-experiments
- experiments-exceeding-10-percent-traffic
approvalMethod: cab # Change Advisory Board
approvers:
- sre-team-lead
- platform-engineering-lead
# Time Restrictions
time:
allowedWindows:
- day: monday-friday
time: "02:00-06:00 UTC" # Low-traffic window
prohibitedWindows:
- day: friday
time: "14:00-18:00 UTC" # Avoid Friday afternoons
- day: monday
time: "08:00-12:00 UTC" # Avoid Monday mornings
# Automated Safety Checks
automatedChecks:
- name: active-incident-check
check: no_active_incidents
action: block_experiment
- name: steady-state-validation
check: steady_state_valid
action: block_experiment
- name: rollback-test
check: rollback_procedure_tested
action: block_experiment
Safety Measures Diagram
graph TD
START[Start Experiment] --> CHECK1{Environment<br/>Staging?}
CHECK1 -->|No| BLOCK1[Block Experiment]
CHECK1 -->|Yes| CHECK2{Blast Radius<br/><10%?}
CHECK2 -->|No| APPROVAL{Approved<br/>by CAB?}
CHECK2 -->|Yes| CHECK3{On-Call<br/>Present?}
APPROVAL -->|No| BLOCK2[Block Experiment]
APPROVAL -->|Yes| CHECK3
CHECK3 -->|No| BLOCK3[Block Experiment]
CHECK3 -->|Yes| CHECK4{No Active<br/>Incidents?}
CHECK4 -->|No| BLOCK4[Block Experiment]
CHECK4 -->|Yes| CHECK5{Time Window<br/>Allowed?}
CHECK5 -->|No| BLOCK5[Block Experiment]
CHECK5 -->|Yes| ALLOW[Allow Experiment]
ALLOW --> EXECUTE[Execute Experiment]
style START fill:#FFE5B4
style BLOCK1 fill:#FFB6C1
style BLOCK2 fill:#FFB6C1
style BLOCK3 fill:#FFB6C1
style BLOCK4 fill:#FFB6C1
style BLOCK5 fill:#FFB6C1
style ALLOW fill:#90EE90
Safety Validation Script
#!/bin/bash
# scripts/validate-safety-measures.sh
EXPERIMENT="${1}"
NAMESPACE="${2:-chaos-testing}"
echo "🔒 Validating safety measures for ${EXPERIMENT}"
VALIDATION_FAILED=false
# Check environment
ENV=$(kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
-o jsonpath='{.metadata.labels.environment}')
if [ "${ENV}" = "production" ]; then
echo "❌ Experiment targets production environment"
VALIDATION_FAILED=true
else
echo "✅ Environment check passed: ${ENV}"
fi
# Check blast radius
BLAST_RADIUS=$(kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
-o jsonpath='{.metadata.annotations.chaos\.atp\.connectsoft\.io/blast-radius}')
if (( $(echo "${BLAST_RADIUS} > 10" | bc -l) )); then
echo "⚠️ Blast radius exceeds 10%: ${BLAST_RADIUS}%"
# Check if approved
if ! kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
-o jsonpath='{.metadata.annotations.chaos\.atp\.connectsoft\.io/cab-approved}' | grep -q "true"; then
echo "❌ Experiment not approved by CAB"
VALIDATION_FAILED=true
fi
else
echo "✅ Blast radius acceptable: ${BLAST_RADIUS}%"
fi
# Check on-call engineer
if ! kubectl get chaos ${EXPERIMENT} -n ${NAMESPACE} \
-o jsonpath='{.metadata.annotations.chaos\.atp\.connectsoft\.io/oncall-present}' | grep -q "true"; then
echo "❌ On-call engineer not present"
VALIDATION_FAILED=true
else
echo "✅ On-call engineer present"
fi
# Check active incidents
if kubectl get incidents -n monitoring | grep -q "Active"; then
echo "❌ Active incidents detected"
VALIDATION_FAILED=true
else
echo "✅ No active incidents"
fi
if [ "${VALIDATION_FAILED}" = true ]; then
echo "❌ Safety validation FAILED"
exit 1
else
echo "✅ Safety validation PASSED"
exit 0
fi
Summary: Chaos Experiment Framework¶
- Experiment Structure: Standardized Chaos Mesh experiment format with ATP extensions, metadata labels for categorization, spec structure (action, mode, selector, duration, scheduler), support for multiple experiment types (Pod, Network, IO, Stress, Time, Kernel, HTTP), experiment categories for organization
- Experiment Lifecycle: 9-phase lifecycle (Setup, Baseline, Inject, Observe, Validate, Rollback, Analyze, Report, Improve) with detailed procedures, scripts, and checklists for each phase
- Steady State Definition: Comprehensive steady state definitions for ATP services (Ingestion API, Query API) with performance, reliability, resource, business, and queue metrics, baseline collection procedures, validation rules, steady state monitoring scripts
- Blast Radius Control: Multi-dimensional blast radius (traffic percentage, tenant scope, service scope, duration, geographic scope), gradual rollout configuration, automatic blast radius reduction, blast radius by experiment type guidelines, blast radius monitoring scripts
- Rollback Triggers: Automatic abort triggers (error rate, latency, success rate, throughput, SLO violations), rollback trigger flow with severity levels, automated rollback procedures, manual rollback capability with notification
- Safety Measures: Comprehensive safety checklist (pre-experiment, during experiment, post-experiment), safety policy configuration with environment restrictions, blast radius limits, team requirements, approval requirements, time restrictions, automated safety checks, safety validation scripts
Pod and Container Chaos¶
Purpose: Define comprehensive chaos experiments for pod and container failures in ATP, validating Kubernetes resilience mechanisms including pod restarts, container recovery, resource limits, and horizontal pod autoscaling to ensure ATP services remain available and performant during infrastructure failures.
Pod Failure Experiment¶
Pod failure experiments validate that ATP services remain available and functional when individual pods crash or are terminated, ensuring Kubernetes orchestration and load balancing mechanisms work correctly under failure conditions.
Hypothesis
"When 1 ingestion API pod crashes (out of 5 pods deployed), the system will remain available, request success rate will stay >99.9%, P95 latency will increase by <100ms, and Kubernetes will restart the pod within 30 seconds."
Experiment Configuration
Basic PodChaos Configuration:
# chaos-experiments/pod-failure-ingestion-api.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: ingestion-pod-kill
namespace: chaos-testing
labels:
category: infrastructure
service: atp-ingestion-api
severity: low
frequency: weekly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When 1 ingestion API pod crashes (out of 5 pods),
the system will remain available, request success rate will stay >99.9%,
P95 latency will increase by <100ms, and Kubernetes will restart the pod within 30 seconds.
chaos.atp.connectsoft.io/blast-radius: "20%" # 1 pod out of 5 = 20%
spec:
action: pod-kill
mode: one # Kill one pod
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
version: v1.2.3
duration: "5m"
scheduler:
cron: "@every 1h"
# ATP-specific rollback triggers
abortRules:
- name: error-rate-threshold
condition: error_rate > 1.0%
action: abort
- name: latency-threshold
condition: p95_latency_ms > 500
action: abort
Advanced PodChaos with Gradual Rollout:
# chaos-experiments/pod-failure-advanced.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: ingestion-pod-kill-advanced
namespace: chaos-testing
spec:
action: pod-kill
mode: fixed-percent
value: "20" # 20% of pods (1 out of 5)
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
duration: "5m"
# Gradual rollout: kill pods one at a time
gradualRollout:
enabled: true
stages:
- stage: 1
pods: 1
duration: "30s"
stabilityCheck:
enabled: true
metrics:
- name: request_success_rate
threshold: 99.9
operator: ">="
- name: p95_latency_ms
threshold: 250
operator: "<"
requiredDuration: "30s"
# Monitor pod restart time
monitoring:
podRestartTime:
enabled: true
maxTime: "30s"
alertOnExceed: true
Expected Behavior
Immediate Impact (0-5 seconds):
- Pod termination: Target pod receives SIGTERM, then SIGKILL
- Service endpoint removal: Pod removed from service endpoints
- Load balancer update: Traffic routed away from failed pod
- Remaining pods: 4 pods continue handling traffic (80% capacity)
Recovery Phase (5-30 seconds):
- Pod restart: Kubernetes detects pod failure and schedules new pod
- Container startup: New container starts and initializes
- Health checks: Liveness and readiness probes execute
- Service endpoint addition: New pod added to service endpoints
- Traffic routing: Load balancer routes traffic to new pod
Steady State (30+ seconds):
- Full capacity: All 5 pods operational
- Traffic distribution: Traffic evenly distributed across pods
- Metrics normalized: Latency and throughput return to baseline
Expected Metrics
| Metric | Baseline | During Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| Request Success Rate | 99.95% | >99.9% | >99.9% | 99.95% |
| P95 Latency | 145ms | <245ms | Baseline + <100ms | 145ms |
| P99 Latency | 320ms | <420ms | Baseline + <100ms | 320ms |
| Throughput | 10,500 events/sec | >8,400 events/sec | >80% of baseline | 10,500 events/sec |
| Error Rate | 0.05% | <0.1% | <0.1% | 0.05% |
| Pod Restart Time | N/A | <30s | <30s | N/A |
Validation Criteria
Success Criteria:
- ✅ Request success rate maintained >99.9% throughout experiment
- ✅ P95 latency increase <100ms
- ✅ No request failures (all requests succeed)
- ✅ Pod restarted within 30 seconds
- ✅ All 5 pods operational after recovery
- ✅ Metrics normalized within 60 seconds
Failure Criteria (Auto-abort triggers):
- ❌ Request success rate drops below 99.9%
- ❌ P95 latency exceeds 500ms
- ❌ Any request failures occur
- ❌ Pod restart time exceeds 30 seconds
- ❌ System unable to recover within 60 seconds
Monitoring and Observation
Real-Time Monitoring Dashboard:
// Log Analytics: Pod failure monitoring
let ExperimentStart = datetime("2024-01-20T10:00:00Z");
let ExperimentEnd = datetime("2024-01-20T10:05:00Z");
// Pod status during experiment
KubePodInventory
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| where Namespace == "atp-ingest-ns"
| where Name contains "atp-ingest-api"
| summarize
PodCount = count(),
RunningCount = countif(Status == "Running"),
FailedCount = countif(Status == "Failed"),
RestartCount = sum(ContainerRestartCount)
by bin(TimeGenerated, 10s)
| render timechart
// Request metrics during pod failure
ContainerLog
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| where Namespace == "atp-ingest-ns"
| where ContainerName == "atp-ingestion-api"
| summarize
TotalRequests = count(),
ErrorCount = countif(LogMessage contains "ERROR"),
ErrorRate = (countif(LogMessage contains "ERROR") * 100.0) / count()
by bin(TimeGenerated, 10s)
| render timechart
Pod Restart Time Measurement Script:
#!/bin/bash
# scripts/measure-pod-restart-time.sh
NAMESPACE="${1:-atp-ingest-ns}"
LABEL_SELECTOR="${2:-app=atp-ingest-api}"
echo "📊 Measuring pod restart time for ${LABEL_SELECTOR} in ${NAMESPACE}"
# Get initial pod count
INITIAL_PODS=$(kubectl get pods -n ${NAMESPACE} -l ${LABEL_SELECTOR} --no-headers | wc -l)
echo "Initial pod count: ${INITIAL_PODS}"
# Kill one pod
POD_TO_KILL=$(kubectl get pods -n ${NAMESPACE} -l ${LABEL_SELECTOR} -o jsonpath='{.items[0].metadata.name}')
echo "Killing pod: ${POD_TO_KILL}"
KILL_TIME=$(date +%s)
kubectl delete pod ${POD_TO_KILL} -n ${NAMESPACE}
# Wait for pod to be deleted
echo "Waiting for pod to be deleted..."
kubectl wait --for=delete pod/${POD_TO_KILL} -n ${NAMESPACE} --timeout=60s
DELETE_TIME=$(date +%s)
DELETE_DURATION=$((DELETE_TIME - KILL_TIME))
echo "Pod deleted in ${DELETE_DURATION} seconds"
# Wait for new pod to be running
echo "Waiting for new pod to be running..."
kubectl wait --for=condition=Ready pod -n ${NAMESPACE} -l ${LABEL_SELECTOR} --timeout=60s
RECOVERY_TIME=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_TIME - KILL_TIME))
echo "Pod recovery time: ${RECOVERY_DURATION} seconds"
# Verify pod count
FINAL_PODS=$(kubectl get pods -n ${NAMESPACE} -l ${LABEL_SELECTOR} --no-headers | wc -l)
echo "Final pod count: ${FINAL_PODS}"
if [ "${FINAL_PODS}" -eq "${INITIAL_PODS}" ]; then
echo "✅ Pod count restored: ${FINAL_PODS}"
else
echo "⚠️ Pod count mismatch: expected ${INITIAL_PODS}, got ${FINAL_PODS}"
exit 1
fi
# Validate restart time
if [ "${RECOVERY_DURATION}" -le 30 ]; then
echo "✅ Pod restart time within target: ${RECOVERY_DURATION}s <= 30s"
exit 0
else
echo "❌ Pod restart time exceeds target: ${RECOVERY_DURATION}s > 30s"
exit 1
fi
Experiment Execution
Manual Execution:
#!/bin/bash
# scripts/execute-pod-failure-experiment.sh
EXPERIMENT="ingestion-pod-kill"
NAMESPACE="chaos-testing"
echo "🧪 Starting pod failure experiment: ${EXPERIMENT}"
# Collect baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service atp-ingestion-api \
--duration 1h \
--output baseline-ingestion-api-$(date +%Y%m%d-%H%M%S).json
# Apply experiment
echo "🔧 Applying chaos experiment..."
kubectl apply -f chaos-experiments/pod-failure-ingestion-api.yaml -n ${NAMESPACE}
# Monitor experiment
echo "👀 Monitoring experiment..."
./scripts/monitor-pod-failure-experiment.sh ${EXPERIMENT}
# Measure pod restart time
echo "⏱️ Measuring pod restart time..."
./scripts/measure-pod-restart-time.sh atp-ingest-ns app=atp-ingest-api
# Wait for experiment to complete
echo "⏳ Waiting for experiment to complete..."
kubectl wait --for=condition=complete chaos/${EXPERIMENT} -n ${NAMESPACE} --timeout=10m
# Validate results
echo "✅ Validating experiment results..."
./scripts/validate-experiment-results.sh \
--experiment ${EXPERIMENT} \
--baseline baseline-ingestion-api-$(date +%Y%m%d-%H%M%S).json \
--results experiment-results-$(date +%Y%m%d-%H%M%S).json
echo "✅ Experiment complete"
Experiment Results Analysis
Success Scenario:
{
"experiment": "ingestion-pod-kill",
"status": "success",
"metrics": {
"request_success_rate": {
"baseline": 99.95,
"during_failure": 99.92,
"recovery": 99.95,
"status": "pass"
},
"p95_latency_ms": {
"baseline": 145,
"during_failure": 210,
"recovery": 145,
"increase": 65,
"status": "pass"
},
"pod_restart_time_seconds": {
"value": 25,
"target": 30,
"status": "pass"
}
},
"findings": {
"what_worked": [
"Pod restarted within 25 seconds",
"Load balancer routed traffic away from failed pod",
"No request failures occurred",
"Latency increase within acceptable range"
],
"issues": [
"Slight latency increase (65ms) during recovery"
]
}
}
Failure Scenario:
{
"experiment": "ingestion-pod-kill",
"status": "failed",
"reason": "pod_restart_time_exceeded",
"metrics": {
"pod_restart_time_seconds": {
"value": 45,
"target": 30,
"status": "fail"
}
},
"findings": {
"root_cause": "Container startup time too slow",
"recommendations": [
"Optimize container startup time",
"Review liveness/readiness probe settings",
"Check resource constraints"
]
}
}
Container Failure Experiment¶
Container failure experiments validate that individual containers within a pod can fail and restart without affecting the entire pod or service availability, ensuring Kubernetes container restart mechanisms work correctly.
Hypothesis
"When 1 container in a multi-container pod crashes, Kubernetes will restart the container within 15 seconds, the pod will remain available, and no service disruption will occur."
Experiment Configuration
Container Kill Configuration:
# chaos-experiments/container-failure-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: container-kill-experiment
namespace: chaos-testing
labels:
category: infrastructure
service: atp-ingestion-api
severity: low
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When 1 container in a pod crashes, Kubernetes will restart the container
within 15 seconds, the pod will remain available, and no service disruption will occur.
spec:
action: pod-kill
mode: one
containerNames:
- atp-ingestion-api # Specific container name
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
duration: "5m"
Expected Behavior
Container Failure Process:
- Container crash: Container process terminates (SIGKILL)
- Pod status: Pod status changes to "NotReady" (container not ready)
- Kubernetes detection: Kubelet detects container failure
- Container restart: Kubernetes restarts the container
- Health check: Liveness probe validates container health
- Readiness check: Readiness probe validates container ready
- Service recovery: Pod returns to service endpoints
Expected Metrics
| Metric | Baseline | During Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| Container Restart Time | N/A | <15s | <15s | N/A |
| Pod Availability | 100% | 100% | 100% | 100% |
| Request Success Rate | 99.95% | >99.9% | >99.9% | 99.95% |
| Service Disruption | None | None | None | None |
Validation Criteria
Success Criteria:
- ✅ Container restarted within 15 seconds
- ✅ Pod remained available throughout
- ✅ No service disruption
- ✅ No request failures
- ✅ Liveness/readiness probes working correctly
Container Restart Monitoring Script:
#!/bin/bash
# scripts/monitor-container-restart.sh
NAMESPACE="${1:-atp-ingest-ns}"
CONTAINER_NAME="${2:-atp-ingestion-api}"
echo "📊 Monitoring container restart: ${CONTAINER_NAME} in ${NAMESPACE}"
# Get pod with the container
POD=$(kubectl get pods -n ${NAMESPACE} -l app=atp-ingest-api -o jsonpath='{.items[0].metadata.name}')
# Get initial container restart count
INITIAL_RESTARTS=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.status.containerStatuses[?(@.name=="'${CONTAINER_NAME}'")].restartCount}')
echo "Initial restart count: ${INITIAL_RESTARTS}"
# Kill container
echo "Killing container: ${CONTAINER_NAME} in pod: ${POD}"
kubectl exec -n ${NAMESPACE} ${POD} -c ${CONTAINER_NAME} -- kill -9 1
KILL_TIME=$(date +%s)
# Wait for container restart
echo "Waiting for container restart..."
MAX_WAIT=60
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
CURRENT_RESTARTS=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.status.containerStatuses[?(@.name=="'${CONTAINER_NAME}'")].restartCount}')
if [ "${CURRENT_RESTARTS}" -gt "${INITIAL_RESTARTS}" ]; then
RESTART_TIME=$(date +%s)
RESTART_DURATION=$((RESTART_TIME - KILL_TIME))
echo "✅ Container restarted in ${RESTART_DURATION} seconds"
# Check container status
CONTAINER_READY=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.status.containerStatuses[?(@.name=="'${CONTAINER_NAME}'")].ready}')
if [ "${CONTAINER_READY}" = "true" ]; then
echo "✅ Container is ready"
exit 0
else
echo "⚠️ Container restarted but not ready yet"
exit 1
fi
fi
sleep 1
ELAPSED=$((ELAPSED + 1))
done
echo "❌ Container restart timeout after ${MAX_WAIT} seconds"
exit 1
Liveness and Readiness Probe Validation
Probe Configuration Example:
# kubernetes/deployments/ingestion-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion-api
namespace: atp-ingest-ns
spec:
replicas: 5
template:
spec:
containers:
- name: atp-ingestion-api
image: atpcr.azurecr.io/atp-ingestion-api:v1.2.3
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Probe Validation Script:
#!/bin/bash
# scripts/validate-probes.sh
NAMESPACE="${1:-atp-ingest-ns}"
POD="${2}"
echo "🔍 Validating liveness and readiness probes for pod: ${POD}"
# Check liveness probe
LIVENESS_PROBE=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].livenessProbe}')
if [ -z "${LIVENESS_PROBE}" ]; then
echo "❌ Liveness probe not configured"
exit 1
else
echo "✅ Liveness probe configured"
fi
# Check readiness probe
READINESS_PROBE=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].readinessProbe}')
if [ -z "${READINESS_PROBE}" ]; then
echo "❌ Readiness probe not configured"
exit 1
else
echo "✅ Readiness probe configured"
fi
# Check container status
CONTAINER_READY=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.status.containerStatuses[0].ready}')
if [ "${CONTAINER_READY}" = "true" ]; then
echo "✅ Container is ready"
else
echo "⚠️ Container is not ready"
exit 1
fi
# Test liveness endpoint
POD_IP=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.status.podIP}')
LIVENESS_PATH=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].livenessProbe.httpGet.path}')
LIVENESS_PORT=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].livenessProbe.httpGet.port}')
if curl -f -s http://${POD_IP}:${LIVENESS_PORT}${LIVENESS_PATH} > /dev/null; then
echo "✅ Liveness endpoint responding"
else
echo "❌ Liveness endpoint not responding"
exit 1
fi
# Test readiness endpoint
READINESS_PATH=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].readinessProbe.httpGet.path}')
READINESS_PORT=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.containers[0].readinessProbe.httpGet.port}')
if curl -f -s http://${POD_IP}:${READINESS_PORT}${READINESS_PATH} > /dev/null; then
echo "✅ Readiness endpoint responding"
else
echo "❌ Readiness endpoint not responding"
exit 1
fi
echo "✅ All probes validated successfully"
Resource Exhaustion Experiment¶
Resource exhaustion experiments validate that ATP services handle resource constraints gracefully and that resource limits prevent noisy neighbor effects, while horizontal pod autoscaling (HPA) responds appropriately to resource pressure.
Hypothesis
"When CPU, memory, or disk resources are exhausted, resource limits prevent noisy neighbor effects, HPA scales up to handle increased load, and services maintain availability with graceful degradation."
CPU Stress Experiment
StressChaos Configuration:
# chaos-experiments/cpu-stress-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress-ingestion-api
namespace: chaos-testing
labels:
category: infrastructure
service: atp-ingestion-api
severity: medium
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When CPU stress is applied, resource limits prevent noisy neighbor effects,
HPA scales up to handle load, and services maintain availability.
spec:
mode: one
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
stressors:
cpu:
workers: 4 # Number of CPU-intensive processes
load: 100 # CPU load percentage per worker
duration: "10m"
Expected Behavior:
- CPU utilization: CPU usage approaches 100% (within limits)
- Resource limits: CPU limits prevent exceeding allocated resources
- Noisy neighbor: Other pods not affected by CPU stress
- HPA response: HPA detects CPU pressure and scales up pods
- Service availability: Service remains available with potential latency increase
Memory Stress Experiment
Memory Stress Configuration:
# chaos-experiments/memory-stress-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress-ingestion-api
namespace: chaos-testing
spec:
mode: one
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
stressors:
memory:
workers: 1
size: "512Mi" # Allocate 512MB memory
duration: "10m"
Expected Behavior:
- Memory pressure: Memory usage increases
- OOM protection: Memory limits prevent OOM kills of other containers
- Graceful handling: Application handles memory pressure gracefully
- Potential restart: Pod may be evicted if memory limits exceeded
Disk Stress Experiment
IO Stress Configuration:
# chaos-experiments/disk-stress-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: disk-stress-ingestion-api
namespace: chaos-testing
spec:
action: io-latency
mode: one
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
volumePath: /var/log # Path to stress
path: /var/log/chaos
delay: "100ms"
percent: 100
duration: "10m"
Expected Behavior:
- Disk I/O latency: Disk operations experience increased latency
- Application handling: Application handles disk I/O delays gracefully
- No data loss: No data corruption or loss
- Performance impact: Potential performance degradation
Resource Limits Validation
Resource Limits Configuration:
# kubernetes/deployments/ingestion-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion-api
spec:
template:
spec:
containers:
- name: atp-ingestion-api
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
Resource Limits Validation Script:
#!/bin/bash
# scripts/validate-resource-limits.sh
NAMESPACE="${1:-atp-ingest-ns}"
DEPLOYMENT="${2:-atp-ingestion-api}"
echo "🔍 Validating resource limits for deployment: ${DEPLOYMENT}"
# Check resource requests and limits
kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE} -o json | \
jq -r '.spec.template.spec.containers[0].resources' > /tmp/resources.json
CPU_REQUEST=$(jq -r '.requests.cpu' /tmp/resources.json)
CPU_LIMIT=$(jq -r '.limits.cpu' /tmp/resources.json)
MEMORY_REQUEST=$(jq -r '.requests.memory' /tmp/resources.json)
MEMORY_LIMIT=$(jq -r '.limits.memory' /tmp/resources.json)
echo "CPU Request: ${CPU_REQUEST}"
echo "CPU Limit: ${CPU_LIMIT}"
echo "Memory Request: ${MEMORY_REQUEST}"
echo "Memory Limit: ${MEMORY_LIMIT}"
# Validate limits are set
if [ "${CPU_LIMIT}" = "null" ] || [ -z "${CPU_LIMIT}" ]; then
echo "❌ CPU limit not set"
exit 1
else
echo "✅ CPU limit set: ${CPU_LIMIT}"
fi
if [ "${MEMORY_LIMIT}" = "null" ] || [ -z "${MEMORY_LIMIT}" ]; then
echo "❌ Memory limit not set"
exit 1
else
echo "✅ Memory limit set: ${MEMORY_LIMIT}"
fi
# Validate limits >= requests
if [ "${CPU_REQUEST}" != "null" ] && [ "${CPU_LIMIT}" != "null" ]; then
CPU_REQUEST_M=$(echo ${CPU_REQUEST} | sed 's/m$//')
CPU_LIMIT_M=$(echo ${CPU_LIMIT} | sed 's/m$//')
if [ "${CPU_LIMIT_M}" -lt "${CPU_REQUEST_M}" ]; then
echo "❌ CPU limit (${CPU_LIMIT}) < CPU request (${CPU_REQUEST})"
exit 1
else
echo "✅ CPU limit >= CPU request"
fi
fi
echo "✅ Resource limits validated successfully"
Horizontal Pod Autoscaler (HPA) Configuration
HPA Configuration:
# kubernetes/hpa/ingestion-api-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: atp-ingestion-api-hpa
namespace: atp-ingest-ns
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: atp-ingestion-api
minReplicas: 5
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
HPA Validation Script:
#!/bin/bash
# scripts/validate-hpa-response.sh
NAMESPACE="${1:-atp-ingest-ns}"
DEPLOYMENT="${2:-atp-ingestion-api}"
echo "📊 Validating HPA response for deployment: ${DEPLOYMENT}"
# Get initial replica count
INITIAL_REPLICAS=$(kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE} -o jsonpath='{.spec.replicas}')
echo "Initial replicas: ${INITIAL_REPLICAS}"
# Apply CPU stress
echo "🔧 Applying CPU stress..."
kubectl apply -f chaos-experiments/cpu-stress-ingestion-api.yaml -n chaos-testing
# Wait for HPA to scale up
echo "⏳ Waiting for HPA to scale up..."
MAX_WAIT=300
ELAPSED=0
SCALED_UP=false
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
CURRENT_REPLICAS=$(kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE} -o jsonpath='{.spec.replicas}')
if [ "${CURRENT_REPLICAS}" -gt "${INITIAL_REPLICAS}" ]; then
SCALED_UP=true
echo "✅ HPA scaled up: ${INITIAL_REPLICAS} → ${CURRENT_REPLICAS}"
break
fi
sleep 10
ELAPSED=$((ELAPSED + 10))
echo "Waiting... (${ELAPSED}s/${MAX_WAIT}s)"
done
if [ "${SCALED_UP}" = false ]; then
echo "❌ HPA did not scale up within ${MAX_WAIT} seconds"
kubectl delete stresschaos cpu-stress-ingestion-api -n chaos-testing
exit 1
fi
# Remove stress
echo "🔧 Removing CPU stress..."
kubectl delete stresschaos cpu-stress-ingestion-api -n chaos-testing
# Wait for HPA to scale down
echo "⏳ Waiting for HPA to scale down..."
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
CURRENT_REPLICAS=$(kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE} -o jsonpath='{.spec.replicas}')
if [ "${CURRENT_REPLICAS}" -eq "${INITIAL_REPLICAS}" ]; then
echo "✅ HPA scaled down: ${CURRENT_REPLICAS} → ${INITIAL_REPLICAS}"
exit 0
fi
sleep 10
ELAPSED=$((ELAPSED + 10))
echo "Waiting... (${ELAPSED}s/${MAX_WAIT}s)"
done
echo "⚠️ HPA did not scale down within ${MAX_WAIT} seconds"
exit 1
Resource Exhaustion Visualization
graph TD
STRESS[Resource Stress Applied] --> CPU[CPU Stress]
STRESS --> MEM[Memory Stress]
STRESS --> DISK[Disk Stress]
CPU --> LIMIT1{CPU Limit<br/>Exceeded?}
LIMIT1 -->|No| THROTTLE[CPU Throttling]
LIMIT1 -->|Yes| HPA1[HPA Scales Up]
MEM --> LIMIT2{Memory Limit<br/>Exceeded?}
LIMIT2 -->|No| DEGRADE[Graceful Degradation]
LIMIT2 -->|Yes| OOM[OOM Kill]
DISK --> LATENCY[I/O Latency]
LATENCY --> TIMEOUT[Timeout Handling]
HPA1 --> RECOVER[Recovery]
THROTTLE --> RECOVER
DEGRADE --> RECOVER
TIMEOUT --> RECOVER
style STRESS fill:#FFE5B4
style HPA1 fill:#90EE90
style OOM fill:#FFB6C1
style RECOVER fill:#E8F5E9
Experiment Results Analysis
CPU Stress Results:
{
"experiment": "cpu-stress-ingestion-api",
"status": "success",
"metrics": {
"cpu_utilization": {
"baseline": 65,
"during_stress": 95,
"limit": 100,
"status": "pass"
},
"hpa_scaling": {
"initial_replicas": 5,
"scaled_to": 8,
"scale_up_time": 90,
"status": "pass"
},
"request_success_rate": {
"baseline": 99.95,
"during_stress": 99.90,
"status": "pass"
}
}
}
Summary: Pod and Container Chaos¶
- Pod Failure Experiment: Validates Kubernetes pod restart mechanisms, load balancer failover, and service availability during pod crashes; expects pod restart within 30 seconds, no request failures, latency increase <100ms, and success rate >99.9%
- Container Failure Experiment: Validates container restart within pods, liveness/readiness probe functionality, and service availability during container failures; expects container restart within 15 seconds and no service disruption
- Resource Exhaustion Experiments: Validates CPU, memory, and disk stress handling, resource limits preventing noisy neighbor effects, and HPA scaling response; expects graceful degradation, HPA scaling, and maintained service availability
- Monitoring and Validation: Comprehensive scripts for measuring pod restart times, container restart times, resource limits validation, HPA response validation, and probe functionality validation
- Experiment Execution: Automated scripts for executing pod failure, container failure, and resource exhaustion experiments with baseline collection, real-time monitoring, and result validation
Node and Cluster Chaos¶
Purpose: Define comprehensive chaos experiments for node and cluster failures in ATP, validating multi-node resilience, availability zone failover, and disaster recovery procedures to ensure ATP services remain available and recoverable during infrastructure-level failures.
Node Failure Experiment¶
Node failure experiments validate that ATP services remain available and functional when individual AKS nodes fail, ensuring Kubernetes pod rescheduling, StatefulSet quorum maintenance, and data persistence mechanisms work correctly under node failure conditions.
Hypothesis
"When 1 AKS node fails (in a 3-node cluster), services will remain available, pods will be rescheduled to other nodes within 5 minutes, StatefulSets will maintain quorum, no data will be lost, and all services will recover within 5 minutes."
Experiment Configuration
Node Drain and Delete Procedure:
# chaos-experiments/node-failure-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NodeChaos
metadata:
name: node-failure-atp-cluster
namespace: chaos-testing
labels:
category: infrastructure
severity: high
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When 1 AKS node fails, services will remain available, pods will be rescheduled
to other nodes within 5 minutes, StatefulSets will maintain quorum,
no data will be lost, and all services will recover within 5 minutes.
chaos.atp.connectsoft.io/blast-radius: "33%" # 1 node out of 3 = 33%
spec:
action: node-restart # Restart node
mode: one
selector:
nodes:
- aks-nodepool1-12345678-vmss000000 # Specific node name
duration: "5m"
# ATP-specific rollback triggers
abortRules:
- name: pod-reschedule-time
condition: pod_reschedule_time > 300 # 5 minutes
action: abort
- name: data-loss-detected
condition: data_loss_count > 0
action: abort
Manual Node Failure Script:
#!/bin/bash
# scripts/execute-node-failure-experiment.sh
CLUSTER_NAME="${1:-atp-staging-aks}"
RESOURCE_GROUP="${2:-atp-staging-rg}"
NODE_NAME="${3}" # Optional: specific node to target
echo "🧪 Starting node failure experiment"
# Get cluster nodes
echo "📊 Getting cluster nodes..."
NODES=$(az aks nodepool list \
--cluster-name ${CLUSTER_NAME} \
--resource-group ${RESOURCE_GROUP} \
--query "[].{name:name,nodeCount:count}" \
-o json)
echo "Nodes: ${NODES}"
# Select node to fail (if not specified)
if [ -z "${NODE_NAME}" ]; then
NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
fi
echo "Target node: ${NODE_NAME}"
# Get initial pod count per node
echo "📊 Getting initial pod distribution..."
kubectl get pods -A -o wide --field-selector spec.nodeName=${NODE_NAME} | \
wc -l | xargs echo "Pods on target node:"
# Get StatefulSet pods on target node
STATEFULSET_PODS=$(kubectl get pods -A -o json \
--field-selector spec.nodeName=${NODE_NAME} | \
jq -r '.items[] | select(.metadata.ownerReferences[]?.kind == "StatefulSet") | .metadata.name')
echo "StatefulSet pods on target node: ${STATEFULSET_PODS}"
# Drain node (gracefully evict pods)
echo "🔧 Draining node: ${NODE_NAME}"
kubectl drain ${NODE_NAME} \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--timeout=300s
DRAIN_START=$(date +%s)
# Wait for pods to be rescheduled
echo "⏳ Waiting for pods to be rescheduled..."
MAX_WAIT=300
ELAPSED=0
ALL_RESCHEDULED=false
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
PODS_ON_NODE=$(kubectl get pods -A -o json \
--field-selector spec.nodeName=${NODE_NAME} | \
jq -r '.items | length')
if [ "${PODS_ON_NODE}" -eq 0 ]; then
ALL_RESCHEDULED=true
RESCHEDULE_TIME=$(date +%s)
RESCHEDULE_DURATION=$((RESCHEDULE_TIME - DRAIN_START))
echo "✅ All pods rescheduled in ${RESCHEDULE_DURATION} seconds"
break
fi
sleep 10
ELAPSED=$((ELAPSED + 10))
echo "Waiting... (${ELAPSED}s/${MAX_WAIT}s) - ${PODS_ON_NODE} pods remaining"
done
if [ "${ALL_RESCHEDULED}" = false ]; then
echo "⚠️ Not all pods rescheduled within ${MAX_WAIT} seconds"
# Uncordon node to restore it
kubectl uncordon ${NODE_NAME}
exit 1
fi
# Validate StatefulSet quorum
echo "🔍 Validating StatefulSet quorum..."
STATEFULSETS=$(kubectl get statefulsets -A -o json | jq -r '.items[].metadata.name')
for SS in ${STATEFULSETS}; do
NS=$(kubectl get statefulset ${SS} -A -o jsonpath='{.metadata.namespace}')
READY=$(kubectl get statefulset ${SS} -n ${NS} -o jsonpath='{.status.readyReplicas}')
DESIRED=$(kubectl get statefulset ${SS} -n ${NS} -o jsonpath='{.spec.replicas}')
if [ "${READY}" -lt "${DESIRED}" ]; then
echo "⚠️ StatefulSet ${NS}/${SS}: ${READY}/${DESIRED} replicas ready"
else
echo "✅ StatefulSet ${NS}/${SS}: ${READY}/${DESIRED} replicas ready"
fi
done
# Simulate node failure (restart node)
echo "🔧 Simulating node failure (restarting node)..."
az vm restart \
--resource-group ${RESOURCE_GROUP} \
--name ${NODE_NAME} \
--no-wait
FAILURE_START=$(date +%s)
# Wait for node to be unavailable
echo "⏳ Waiting for node to be unavailable..."
sleep 30
# Wait for node to recover
echo "⏳ Waiting for node to recover..."
kubectl wait --for=condition=Ready node/${NODE_NAME} --timeout=600s
RECOVERY_TIME=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_TIME - FAILURE_START))
echo "✅ Node recovered in ${RECOVERY_DURATION} seconds"
# Uncordon node
echo "🔧 Uncordoning node..."
kubectl uncordon ${NODE_NAME}
# Validate all pods running
echo "✅ Validating all pods running..."
sleep 60
ALL_PODS_RUNNING=true
NAMESPACES=$(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}')
for NS in ${NAMESPACES}; do
NOT_RUNNING=$(kubectl get pods -n ${NS} -o json | \
jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | .metadata.name')
if [ -n "${NOT_RUNNING}" ]; then
echo "⚠️ Pods not running in ${NS}: ${NOT_RUNNING}"
ALL_PODS_RUNNING=false
fi
done
if [ "${ALL_PODS_RUNNING}" = true ]; then
echo "✅ All pods running successfully"
exit 0
else
echo "❌ Some pods not running"
exit 1
fi
Expected Behavior
Node Drain Phase (0-5 minutes):
- Pod eviction: Pods gracefully terminated with SIGTERM
- Pod rescheduling: Pods rescheduled to other nodes
- StatefulSet handling: StatefulSet pods maintain quorum (if replicas > 1)
- PV persistence: Persistent volumes remain available
- Service continuity: Services continue operating on remaining nodes
Node Failure Phase (5-10 minutes):
- Node unavailable: Node becomes unreachable
- Kubernetes detection: Kubernetes detects node not ready
- Endpoint removal: Node removed from service endpoints
- Health monitoring: Cluster health monitored
Recovery Phase (10-15 minutes):
- Node restart: Node VM restarts
- Kubelet restart: Kubelet service restarts
- Node registration: Node registers with API server
- Pod rescheduling: Pods can be rescheduled back to node (if desired)
- Service recovery: All services fully operational
Expected Metrics
| Metric | Baseline | During Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| Service Availability | 100% | >99% | >99% | 100% |
| Pod Reschedule Time | N/A | <5min | <5min | N/A |
| Request Success Rate | 99.95% | >99.5% | >99.5% | 99.95% |
| StatefulSet Quorum | 100% | 100% | 100% | 100% |
| Data Loss | None | None | None | None |
| Recovery Time | N/A | <5min | <5min | N/A |
Validation Criteria
Success Criteria:
- ✅ All pods rescheduled within 5 minutes
- ✅ StatefulSets maintain quorum (no quorum loss)
- ✅ No data loss (all persistent volumes intact)
- ✅ Service availability >99% during failure
- ✅ Request success rate >99.5% during failure
- ✅ All services recovered within 5 minutes
- ✅ Node recovered and operational
Failure Criteria (Auto-abort triggers):
- ❌ Pod reschedule time exceeds 5 minutes
- ❌ StatefulSet quorum lost
- ❌ Data loss detected
- ❌ Service availability drops below 99%
- ❌ Request success rate drops below 99.5%
- ❌ Recovery time exceeds 5 minutes
StatefulSet Quorum Validation
Quorum Validation Script:
#!/bin/bash
# scripts/validate-statefulset-quorum.sh
NAMESPACE="${1}"
STATEFULSET="${2}"
echo "🔍 Validating StatefulSet quorum: ${NAMESPACE}/${STATEFULSET}"
# Get StatefulSet details
READY=$(kubectl get statefulset ${STATEFULSET} -n ${NAMESPACE} -o jsonpath='{.status.readyReplicas}')
DESIRED=$(kubectl get statefulset ${STATEFULSET} -n ${NAMESPACE} -o jsonpath='{.spec.replicas}')
QUORUM_REQUIRED=$((DESIRED / 2 + 1))
echo "Ready replicas: ${READY}/${DESIRED}"
echo "Quorum required: ${QUORUM_REQUIRED}"
if [ "${READY}" -lt "${QUORUM_REQUIRED}" ]; then
echo "❌ Quorum lost: ${READY} < ${QUORUM_REQUIRED}"
exit 1
else
echo "✅ Quorum maintained: ${READY} >= ${QUORUM_REQUIRED}"
# Verify all pods are on different nodes (anti-affinity)
PODS=$(kubectl get pods -n ${NAMESPACE} -l app=${STATEFULSET} -o jsonpath='{.items[*].metadata.name}')
NODES=()
for POD in ${PODS}; do
NODE=$(kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.spec.nodeName}')
NODES+=(${NODE})
done
UNIQUE_NODES=$(printf '%s\n' "${NODES[@]}" | sort -u | wc -l)
TOTAL_NODES=${#NODES[@]}
if [ "${UNIQUE_NODES}" -lt "${TOTAL_NODES}" ]; then
echo "⚠️ Some pods on same node (anti-affinity not enforced)"
else
echo "✅ All pods on different nodes (anti-affinity enforced)"
fi
exit 0
fi
Pod Anti-Affinity Configuration
Anti-Affinity Rules:
# kubernetes/deployments/ingestion-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: atp-ingestion-api
namespace: atp-ingest-ns
spec:
replicas: 5
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- atp-ingest-api
topologyKey: kubernetes.io/hostname
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- atp-ingest-api
topologyKey: topology.kubernetes.io/zone
Monitoring and Observation
Node Failure Monitoring Dashboard:
// Log Analytics: Node failure monitoring
let ExperimentStart = datetime("2024-01-20T10:00:00Z");
let ExperimentEnd = datetime("2024-01-20T10:15:00Z");
// Node status during experiment
KubeNodeInventory
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| summarize
NodeCount = count(),
ReadyCount = countif(Status == "Ready"),
NotReadyCount = countif(Status == "NotReady"),
UnknownCount = countif(Status == "Unknown")
by bin(TimeGenerated, 1m)
| render timechart
// Pod distribution across nodes
KubePodInventory
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| where Namespace == "atp-ingest-ns"
| summarize
PodCount = count(),
Nodes = dcount(Computer)
by bin(TimeGenerated, 1m)
| render timechart
// Pod rescheduling events
ContainerLog
| where TimeGenerated between (ExperimentStart .. ExperimentEnd)
| where LogMessage contains "Scheduled" or LogMessage contains "Successfully assigned"
| summarize
SchedulingEvents = count()
by bin(TimeGenerated, 1m)
| render timechart
Availability Zone Failure Experiment¶
Availability Zone (AZ) failure experiments validate that ATP services remain available and functional when an entire availability zone fails, ensuring multi-AZ deployment, anti-affinity rules, and cross-AZ failover mechanisms work correctly.
Hypothesis
"When an entire availability zone fails, services will remain available through multi-AZ deployment, pods will be rescheduled to other AZs within 10 minutes, anti-affinity rules will ensure pod distribution, and all services will recover within 10 minutes."
Experiment Configuration
AZ Failure Simulation:
#!/bin/bash
# scripts/execute-az-failure-experiment.sh
CLUSTER_NAME="${1:-atp-staging-aks}"
RESOURCE_GROUP="${2:-atp-staging-rg}"
TARGET_AZ="${3}" # e.g., "eastus-1"
echo "🧪 Starting availability zone failure experiment"
echo "Target AZ: ${TARGET_AZ}"
# Get nodes in target AZ
echo "📊 Getting nodes in target AZ..."
NODES_IN_AZ=$(kubectl get nodes -o json | \
jq -r ".items[] | select(.metadata.labels.\"topology.kubernetes.io/zone\" == \"${TARGET_AZ}\") | .metadata.name")
if [ -z "${NODES_IN_AZ}" ]; then
echo "❌ No nodes found in AZ: ${TARGET_AZ}"
exit 1
fi
echo "Nodes in target AZ: ${NODES_IN_AZ}"
# Get initial pod distribution
echo "📊 Getting initial pod distribution across AZs..."
kubectl get pods -A -o json | \
jq -r '.items[] | select(.spec.nodeName != null) |
{pod: .metadata.name, namespace: .metadata.namespace, node: .spec.nodeName}' | \
jq -s 'group_by(.node) | map({node: .[0].node, count: length})'
# Validate multi-AZ deployment
echo "🔍 Validating multi-AZ deployment..."
TOTAL_NODES=$(kubectl get nodes -o json | jq -r '.items | length')
AZS=$(kubectl get nodes -o json | \
jq -r '.items[].metadata.labels."topology.kubernetes.io/zone"' | sort -u)
echo "Total nodes: ${TOTAL_NODES}"
echo "Availability zones: ${AZS}"
if [ $(echo "${AZS}" | wc -l) -lt 2 ]; then
echo "❌ Multi-AZ deployment not configured (only 1 AZ)"
exit 1
fi
echo "✅ Multi-AZ deployment validated"
# Drain all nodes in target AZ
echo "🔧 Draining all nodes in target AZ: ${TARGET_AZ}"
for NODE in ${NODES_IN_AZ}; do
echo "Draining node: ${NODE}"
kubectl drain ${NODE} \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--timeout=300s &
done
wait
AZ_FAILURE_START=$(date +%s)
# Wait for pods to be rescheduled to other AZs
echo "⏳ Waiting for pods to be rescheduled to other AZs..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
ALL_RESCHEDULED=false
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
PODS_IN_TARGET_AZ=$(kubectl get pods -A -o json | \
jq -r ".items[] | select(.spec.nodeName != null) |
.spec.nodeName" | \
xargs -I {} kubectl get node {} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' | \
grep -c "${TARGET_AZ}" || echo "0")
if [ "${PODS_IN_TARGET_AZ}" -eq 0 ]; then
ALL_RESCHEDULED=true
RESCHEDULE_TIME=$(date +%s)
RESCHEDULE_DURATION=$((RESCHEDULE_TIME - AZ_FAILURE_START))
echo "✅ All pods rescheduled from AZ in ${RESCHEDULE_DURATION} seconds"
break
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
echo "Waiting... (${ELAPSED}s/${MAX_WAIT}s) - ${PODS_IN_TARGET_AZ} pods remaining in AZ"
done
if [ "${ALL_RESCHEDULED}" = false ]; then
echo "⚠️ Not all pods rescheduled within ${MAX_WAIT} seconds"
exit 1
fi
# Validate anti-affinity rules
echo "🔍 Validating anti-affinity rules..."
DEPLOYMENTS=$(kubectl get deployments -A -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"')
for DEPLOYMENT in ${DEPLOYMENTS}; do
NS=$(echo ${DEPLOYMENT} | cut -d'/' -f1)
NAME=$(echo ${DEPLOYMENT} | cut -d'/' -f2)
PODS=$(kubectl get pods -n ${NS} -l app=${NAME} -o jsonpath='{.items[*].metadata.name}')
AZS_WITH_PODS=$(for POD in ${PODS}; do
NODE=$(kubectl get pod ${POD} -n ${NS} -o jsonpath='{.spec.nodeName}')
kubectl get node ${NODE} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}'
echo ""
done | sort -u | wc -l)
echo "Deployment ${NS}/${NAME}: Pods distributed across ${AZS_WITH_PODS} AZs"
done
# Validate service availability
echo "🔍 Validating service availability..."
SERVICES=$(kubectl get services -A -o json | \
jq -r '.items[] | select(.spec.type == "LoadBalancer" or .spec.type == "ClusterIP") |
"\(.metadata.namespace)/\(.metadata.name)"')
for SERVICE in ${SERVICES}; do
NS=$(echo ${SERVICE} | cut -d'/' -f1)
NAME=$(echo ${SERVICE} | cut -d'/' -f2)
ENDPOINTS=$(kubectl get endpoints ${NAME} -n ${NS} -o jsonpath='{.subsets[0].addresses[*].ip}' | wc -w)
if [ "${ENDPOINTS}" -gt 0 ]; then
echo "✅ Service ${NS}/${NAME}: ${ENDPOINTS} endpoints"
else
echo "❌ Service ${NS}/${NAME}: No endpoints"
fi
done
# Simulate AZ recovery (uncordon nodes)
echo "🔧 Simulating AZ recovery (uncordoning nodes)..."
for NODE in ${NODES_IN_AZ}; do
echo "Uncordoning node: ${NODE}"
kubectl uncordon ${NODE}
done
RECOVERY_TIME=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_TIME - AZ_FAILURE_START))
echo "✅ AZ recovery complete in ${RECOVERY_DURATION} seconds"
# Validate final state
echo "✅ Validating final state..."
sleep 60
ALL_PODS_RUNNING=true
NAMESPACES=$(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}')
for NS in ${NAMESPACES}; do
NOT_RUNNING=$(kubectl get pods -n ${NS} -o json | \
jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | .metadata.name')
if [ -n "${NOT_RUNNING}" ]; then
echo "⚠️ Pods not running in ${NS}: ${NOT_RUNNING}"
ALL_PODS_RUNNING=false
fi
done
if [ "${ALL_PODS_RUNNING}" = true ]; then
echo "✅ All pods running successfully"
exit 0
else
echo "❌ Some pods not running"
exit 1
fi
Multi-AZ Deployment Validation
Multi-AZ Validation Script:
#!/bin/bash
# scripts/validate-multi-az-deployment.sh
echo "🔍 Validating multi-AZ deployment configuration"
# Check node distribution across AZs
echo "📊 Node distribution across AZs:"
kubectl get nodes -o json | \
jq -r '.items[] |
{name: .metadata.name, az: .metadata.labels."topology.kubernetes.io/zone"}' | \
jq -s 'group_by(.az) | map({az: .[0].az, nodes: map(.name), count: length})'
# Check pod distribution across AZs
echo "📊 Pod distribution across AZs:"
kubectl get pods -A -o json | \
jq -r '.items[] | select(.spec.nodeName != null) |
{pod: .metadata.name, namespace: .metadata.namespace, node: .spec.nodeName}' | \
jq -s 'group_by(.node) | map({node: .[0].node, pods: map(.pod), count: length})' | \
jq -r '.[] | .node' | \
xargs -I {} kubectl get node {} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' | \
sort | uniq -c
# Check anti-affinity rules
echo "🔍 Checking anti-affinity rules..."
DEPLOYMENTS=$(kubectl get deployments -A -o json | \
jq -r '.items[] | select(.spec.template.spec.affinity != null) |
"\(.metadata.namespace)/\(.metadata.name)"')
if [ -z "${DEPLOYMENTS}" ]; then
echo "⚠️ No deployments with affinity rules found"
else
echo "✅ Deployments with affinity rules:"
echo "${DEPLOYMENTS}"
fi
# Check StatefulSet distribution
echo "🔍 Checking StatefulSet pod distribution..."
STATEFULSETS=$(kubectl get statefulsets -A -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"')
for SS in ${STATEFULSETS}; do
NS=$(echo ${SS} | cut -d'/' -f1)
NAME=$(echo ${SS} | cut -d'/' -f2)
PODS=$(kubectl get pods -n ${NS} -l app=${NAME} -o jsonpath='{.items[*].metadata.name}')
AZS=$(for POD in ${PODS}; do
NODE=$(kubectl get pod ${POD} -n ${NS} -o jsonpath='{.spec.nodeName}')
kubectl get node ${NODE} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}'
echo ""
done | sort -u)
AZ_COUNT=$(echo "${AZS}" | grep -v '^$' | wc -l)
echo "StatefulSet ${NS}/${NAME}: Pods across ${AZ_COUNT} AZs"
if [ "${AZ_COUNT}" -lt 2 ]; then
echo "⚠️ StatefulSet pods not distributed across multiple AZs"
fi
done
Expected Behavior
AZ Failure Phase (0-10 minutes):
- Node unavailability: All nodes in target AZ become unavailable
- Pod eviction: Pods on failed nodes evicted
- Pod rescheduling: Pods rescheduled to other AZs
- Anti-affinity enforcement: Pods distributed across remaining AZs
- Service continuity: Services continue operating on remaining AZs
Recovery Phase (10-15 minutes):
- AZ restoration: Nodes in target AZ restored
- Node registration: Nodes register with API server
- Pod redistribution: Pods can be redistributed (optional)
- Service recovery: All services fully operational across all AZs
Expected Metrics
| Metric | Baseline | During Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| Service Availability | 100% | >99% | >99% | 100% |
| Pod Reschedule Time | N/A | <10min | <10min | N/A |
| AZ Distribution | 3 AZs | 2 AZs | ≥2 AZs | 3 AZs |
| Request Success Rate | 99.95% | >99% | >99% | 99.95% |
| Recovery Time | N/A | <10min | <10min | N/A |
Validation Criteria
Success Criteria:
- ✅ All pods rescheduled from failed AZ within 10 minutes
- ✅ Pods distributed across at least 2 AZs
- ✅ Anti-affinity rules enforced
- ✅ Service availability >99% during failure
- ✅ Request success rate >99% during failure
- ✅ All services recovered within 10 minutes
- ✅ Multi-AZ distribution restored
Cluster Failure (DR Drill)¶
Cluster failure experiments validate ATP's disaster recovery procedures, ensuring regional failover, RTO/RPO targets, and full recovery procedures work correctly when an entire AKS cluster fails.
Hypothesis
"When the primary AKS cluster fails, services will failover to the secondary region within RTO target (30 minutes), RPO will be within 1 hour (async replication lag), all services will be operational in the secondary region, and full recovery will be completed within 2 hours."
Disaster Recovery Scenario
Primary Cluster Failure:
#!/bin/bash
# scripts/execute-cluster-failure-drill.sh
PRIMARY_CLUSTER="${1:-atp-production-eastus-aks}"
PRIMARY_RG="${2:-atp-production-eastus-rg}"
SECONDARY_CLUSTER="${3:-atp-production-westeurope-aks}"
SECONDARY_RG="${4:-atp-production-westeurope-rg}"
echo "🚨 Starting cluster failure DR drill"
echo "Primary cluster: ${PRIMARY_CLUSTER}"
echo "Secondary cluster: ${SECONDARY_CLUSTER}"
# Pre-drill validation
echo "📊 Pre-drill validation..."
# Check primary cluster status
echo "Checking primary cluster status..."
PRIMARY_STATUS=$(az aks show \
--name ${PRIMARY_CLUSTER} \
--resource-group ${PRIMARY_RG} \
--query "powerState.code" -o tsv)
if [ "${PRIMARY_STATUS}" != "Running" ]; then
echo "❌ Primary cluster not running: ${PRIMARY_STATUS}"
exit 1
fi
# Check secondary cluster status
echo "Checking secondary cluster status..."
SECONDARY_STATUS=$(az aks show \
--name ${SECONDARY_CLUSTER} \
--resource-group ${SECONDARY_RG} \
--query "powerState.code" -o tsv)
if [ "${SECONDARY_STATUS}" != "Running" ]; then
echo "❌ Secondary cluster not running: ${SECONDARY_STATUS}"
exit 1
fi
# Get primary cluster context
echo "Getting primary cluster credentials..."
az aks get-credentials \
--name ${PRIMARY_CLUSTER} \
--resource-group ${PRIMARY_RG} \
--overwrite-existing
# Get baseline metrics
echo "📊 Collecting baseline metrics from primary cluster..."
PRIMARY_PODS=$(kubectl get pods -A --no-headers | wc -l)
PRIMARY_SERVICES=$(kubectl get services -A --no-headers | wc -l)
PRIMARY_DEPLOYMENTS=$(kubectl get deployments -A --no-headers | wc -l)
echo "Primary cluster metrics:"
echo " Pods: ${PRIMARY_PODS}"
echo " Services: ${PRIMARY_SERVICES}"
echo " Deployments: ${PRIMARY_DEPLOYMENTS}"
# Simulate cluster failure (stop cluster)
echo "🔧 Simulating cluster failure (stopping primary cluster)..."
CLUSTER_FAILURE_START=$(date +%s)
# Note: In production, this would be an actual cluster failure
# For drill purposes, we'll stop the cluster
az aks stop \
--name ${PRIMARY_CLUSTER} \
--resource-group ${PRIMARY_RG} \
--no-wait
echo "⏳ Waiting for cluster to be stopped..."
sleep 60
# Verify cluster is stopped
PRIMARY_STATUS=$(az aks show \
--name ${PRIMARY_CLUSTER} \
--resource-group ${PRIMARY_RG} \
--query "powerState.code" -o tsv)
if [ "${PRIMARY_STATUS}" != "Stopped" ]; then
echo "⚠️ Cluster not fully stopped: ${PRIMARY_STATUS}"
fi
# Get secondary cluster context
echo "Switching to secondary cluster..."
az aks get-credentials \
--name ${SECONDARY_CLUSTER} \
--resource-group ${SECONDARY_RG} \
--overwrite-existing
# Verify secondary cluster is ready
echo "📊 Verifying secondary cluster readiness..."
SECONDARY_PODS=$(kubectl get pods -A --no-headers | wc -l)
SECONDARY_SERVICES=$(kubectl get services -A --no-headers | wc -l)
echo "Secondary cluster metrics:"
echo " Pods: ${SECONDARY_PODS}"
echo " Services: ${SECONDARY_SERVICES}"
# Check data replication status
echo "🔍 Checking data replication status..."
# Query database replication lag
# This is platform-specific and would query Azure SQL replication status
# Update Azure Traffic Manager (DNS failover)
echo "🔧 Updating Azure Traffic Manager for failover..."
# This would update Traffic Manager endpoints to point to secondary region
# Platform-specific implementation
# Verify services operational in secondary
echo "🔍 Verifying services operational in secondary cluster..."
MAX_WAIT=1800 # 30 minutes
ELAPSED=0
ALL_SERVICES_READY=false
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
READY_PODS=$(kubectl get pods -A -o json | \
jq -r '.items[] | select(.status.phase == "Running") | .metadata.name' | wc -l)
if [ "${READY_PODS}" -ge $((PRIMARY_PODS * 90 / 100)) ]; then
ALL_SERVICES_READY=true
FAILOVER_TIME=$(date +%s)
FAILOVER_DURATION=$((FAILOVER_TIME - CLUSTER_FAILURE_START))
echo "✅ All services ready in secondary cluster"
echo "Failover time: ${FAILOVER_DURATION} seconds"
break
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
echo "Waiting... (${ELAPSED}s/${MAX_WAIT}s) - ${READY_PODS} pods ready"
done
if [ "${ALL_SERVICES_READY}" = false ]; then
echo "❌ Services not ready within ${MAX_WAIT} seconds"
exit 1
fi
# Validate RTO
RTO_TARGET=1800 # 30 minutes in seconds
if [ "${FAILOVER_DURATION}" -le "${RTO_TARGET}" ]; then
echo "✅ RTO target met: ${FAILOVER_DURATION}s <= ${RTO_TARGET}s"
else
echo "❌ RTO target exceeded: ${FAILOVER_DURATION}s > ${RTO_TARGET}s"
exit 1
fi
# Validate RPO
RPO_TARGET=3600 # 1 hour in seconds
# This would check database replication lag
# Platform-specific implementation
echo "✅ RPO validation: Data replication lag within target"
# Full recovery procedure
echo "🔧 Starting full recovery procedure..."
# Restore primary cluster
echo "Restoring primary cluster..."
az aks start \
--name ${PRIMARY_CLUSTER} \
--resource-group ${PRIMARY_RG} \
--no-wait
echo "⏳ Waiting for primary cluster to be ready..."
az aks wait \
--name ${PRIMARY_CLUSTER} \
--resource-group ${PRIMARY_RG} \
--created \
--interval 30 \
--timeout 1800
# Get primary cluster context
az aks get-credentials \
--name ${PRIMARY_CLUSTER} \
--resource-group ${PRIMARY_RG} \
--overwrite-existing
# Verify primary cluster restored
echo "📊 Verifying primary cluster restored..."
PRIMARY_PODS_RESTORED=$(kubectl get pods -A --no-headers | wc -l)
if [ "${PRIMARY_PODS_RESTORED}" -ge $((PRIMARY_PODS * 90 / 100)) ]; then
RECOVERY_TIME=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_TIME - CLUSTER_FAILURE_START))
echo "✅ Primary cluster restored"
echo "Total recovery time: ${RECOVERY_DURATION} seconds"
exit 0
else
echo "❌ Primary cluster not fully restored"
exit 1
fi
RTO/RPO Validation
RTO/RPO Validation Script:
#!/bin/bash
# scripts/validate-rto-rpo.sh
FAILOVER_START="${1}" # Timestamp when failover started
CURRENT_TIME=$(date +%s)
# Calculate RTO
RTO=$((CURRENT_TIME - FAILOVER_START))
RTO_TARGET=1800 # 30 minutes
echo "📊 RTO Validation:"
echo " Failover time: ${RTO} seconds"
echo " RTO target: ${RTO_TARGET} seconds"
if [ "${RTO}" -le "${RTO_TARGET}" ]; then
echo " ✅ RTO target met"
else
echo " ❌ RTO target exceeded"
exit 1
fi
# Calculate RPO (check database replication lag)
echo "📊 RPO Validation:"
# This would query Azure SQL replication lag
# Platform-specific implementation
# Example: Query Azure SQL replication lag
PRIMARY_DB="${2}" # Primary database name
REPLICA_DB="${3}" # Replica database name
# Query replication lag from Azure SQL
REPLICATION_LAG=$(az sql db replica show \
--resource-group atp-production-eastus-rg \
--server atp-sql-primary \
--name ${PRIMARY_DB} \
--query "replicationLag" -o tsv)
RPO_TARGET=3600 # 1 hour in seconds
echo " Replication lag: ${REPLICATION_LAG} seconds"
echo " RPO target: ${RPO_TARGET} seconds"
if [ "${REPLICATION_LAG}" -le "${RPO_TARGET}" ]; then
echo " ✅ RPO target met"
else
echo " ❌ RPO target exceeded"
exit 1
fi
DR Drill Checklist
Pre-Drill Checklist:
## Pre-DR Drill Checklist
### Primary Cluster
- [ ] Primary cluster healthy and operational
- [ ] All services running normally
- [ ] Baseline metrics collected
- [ ] Database replication active
- [ ] Backup validation completed
### Secondary Cluster
- [ ] Secondary cluster healthy and operational
- [ ] Cluster capacity sufficient for failover
- [ ] Network connectivity verified
- [ ] DNS/Traffic Manager configured
- [ ] Disaster recovery runbook reviewed
### Team Preparation
- [ ] DR drill team assembled
- [ ] Communication channels established
- [ ] Stakeholders notified
- [ ] Runbook accessible
- [ ] On-call engineer available
During Drill Checklist:
## During DR Drill Checklist
### Cluster Failure Simulation
- [ ] Primary cluster failure simulated
- [ ] Failure detection confirmed
- [ ] Incident declared
### Failover Procedure
- [ ] Secondary cluster verified ready
- [ ] Traffic Manager updated
- [ ] DNS failover executed
- [ ] Services verified in secondary
- [ ] RTO validated
### Data Validation
- [ ] Database replication status checked
- [ ] RPO validated
- [ ] Data integrity verified
- [ ] No data loss confirmed
### Service Validation
- [ ] All critical services operational
- [ ] Service endpoints verified
- [ ] Health checks passing
- [ ] Monitoring active
Post-Drill Checklist:
## Post-DR Drill Checklist
### Recovery Procedure
- [ ] Primary cluster restored
- [ ] Services verified in primary
- [ ] Data synchronization verified
- [ ] Failback procedure executed (if applicable)
### Documentation
- [ ] Drill results documented
- [ ] RTO/RPO measured
- [ ] Issues identified
- [ ] Lessons learned captured
- [ ] Improvement actions created
### Validation
- [ ] All services operational
- [ ] Data integrity confirmed
- [ ] Performance metrics normal
- [ ] Monitoring validated
Expected Behavior
Cluster Failure Phase (0-30 minutes):
- Failure detection: Cluster failure detected within 5 minutes
- Incident declaration: Incident commander declares DR activation
- Traffic Manager update: DNS failover to secondary region
- Service verification: Services verified operational in secondary
- RTO validation: RTO target met (<30 minutes)
Recovery Phase (30 minutes - 2 hours):
- Primary cluster restoration: Primary cluster restored
- Data synchronization: Data synchronized from secondary
- Service restoration: Services restored in primary
- Failback decision: Decision to failback or remain in secondary
- Full recovery: All services operational in primary
Expected Metrics
| Metric | Target | Measurement |
|---|---|---|
| RTO (Recovery Time Objective) | <30 minutes | Time from failure to operational in secondary |
| RPO (Recovery Point Objective) | <1 hour | Maximum data loss (replication lag) |
| Service Availability | >99% | During failover and recovery |
| Data Loss | None | Zero data loss target |
| Full Recovery Time | <2 hours | Complete restoration to primary |
DR Drill Results Template
{
"drill": {
"date": "2024-01-20T10:00:00Z",
"type": "full_cluster_failure",
"primary_cluster": "atp-production-eastus-aks",
"secondary_cluster": "atp-production-westeurope-aks"
},
"metrics": {
"rto": {
"target_seconds": 1800,
"actual_seconds": 1650,
"status": "pass"
},
"rpo": {
"target_seconds": 3600,
"actual_seconds": 2400,
"status": "pass"
},
"service_availability": {
"during_failover": 99.5,
"during_recovery": 99.8,
"status": "pass"
},
"data_loss": {
"events_lost": 0,
"status": "pass"
}
},
"findings": {
"what_worked_well": [
"Failover completed within RTO target",
"No data loss occurred",
"Services remained available throughout",
"Team coordination effective"
],
"issues": [
"DNS propagation took longer than expected",
"Some services took longer to become ready"
],
"recommendations": [
"Optimize DNS failover time",
"Improve service startup time",
"Enhance monitoring during failover"
]
}
}
Summary: Node and Cluster Chaos¶
- Node Failure Experiment: Validates Kubernetes pod rescheduling, StatefulSet quorum maintenance, and data persistence during node failures; expects pod reschedule within 5 minutes, quorum maintained, no data loss, and recovery within 5 minutes
- Availability Zone Failure Experiment: Validates multi-AZ deployment, anti-affinity rules, and cross-AZ failover during AZ failures; expects pod reschedule within 10 minutes, pods distributed across ≥2 AZs, and recovery within 10 minutes
- Cluster Failure (DR Drill): Validates disaster recovery procedures, regional failover, and RTO/RPO targets during complete cluster failures; expects RTO <30 minutes, RPO <1 hour, service availability >99%, and full recovery within 2 hours
- Monitoring and Validation: Comprehensive scripts for node failure simulation, AZ failure simulation, cluster failure drills, multi-AZ validation, StatefulSet quorum validation, and RTO/RPO validation
- Disaster Recovery Procedures: Complete DR drill checklists (pre-drill, during drill, post-drill), failover procedures, recovery procedures, and validation criteria
Service Dependency Chaos¶
Purpose: Define comprehensive chaos experiments for service dependency failures in ATP, validating circuit breakers, retry mechanisms, caching strategies, and graceful degradation to ensure ATP services remain available and functional during downstream service failures.
Downstream Service Failure¶
Downstream service failure experiments validate that ATP services gracefully handle dependency failures through caching, circuit breakers, and fallback mechanisms, ensuring service availability even when downstream services are unavailable.
Hypothesis
"When the Policy Service becomes unavailable, the Ingestion Service will continue operating using cached policies, ingestion will continue without failures, stale cache is acceptable, and the service will recover automatically when the Policy Service is restored."
Experiment Configuration
NetworkChaos for Service Isolation:
# chaos-experiments/policy-service-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: policy-service-network-failure
namespace: chaos-testing
labels:
category: application
service: atp-policy-service
severity: medium
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When Policy Service becomes unavailable, Ingestion Service will continue operating
using cached policies, ingestion will continue without failures,
stale cache is acceptable, and service will recover automatically.
spec:
action: partition
mode: all
selector:
namespaces:
- atp-policy-ns
labelSelectors:
app: atp-policy-api
direction: both
duration: "10m"
Service Failure Simulation Script:
#!/bin/bash
# scripts/execute-downstream-service-failure.sh
DOWNSTREAM_SERVICE="${1:-atp-policy-api}"
DOWNSTREAM_NS="${2:-atp-policy-ns}"
UPSTREAM_SERVICE="${3:-atp-ingestion-api}"
UPSTREAM_NS="${4:-atp-ingest-ns}"
echo "🧪 Starting downstream service failure experiment"
echo "Downstream service: ${DOWNSTREAM_SERVICE}"
echo "Upstream service: ${UPSTREAM_SERVICE}"
# Get initial metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${UPSTREAM_SERVICE} \
--duration 1h \
--output baseline-${UPSTREAM_SERVICE}-$(date +%Y%m%d-%H%M%S).json
# Scale down downstream service to simulate failure
echo "🔧 Simulating downstream service failure..."
kubectl scale deployment ${DOWNSTREAM_SERVICE} -n ${DOWNSTREAM_NS} --replicas=0
FAILURE_START=$(date +%s)
# Wait for service to be unavailable
echo "⏳ Waiting for service to be unavailable..."
sleep 30
# Verify downstream service is unavailable
ENDPOINTS=$(kubectl get endpoints ${DOWNSTREAM_SERVICE} -n ${DOWNSTREAM_NS} -o jsonpath='{.subsets[0].addresses[*].ip}' | wc -w)
if [ "${ENDPOINTS}" -eq 0 ]; then
echo "✅ Downstream service unavailable (${ENDPOINTS} endpoints)"
else
echo "⚠️ Downstream service still has endpoints: ${ENDPOINTS}"
fi
# Monitor upstream service behavior
echo "👀 Monitoring upstream service behavior..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Check request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${UPSTREAM_SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Check error rate
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${UPSTREAM_SERVICE}\",status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Check cache hit rate
CACHE_HIT_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(cache_hits_total\{service=\"${UPSTREAM_SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo "Metrics at ${ELAPSED}s:"
echo " Success rate: ${SUCCESS_RATE}"
echo " Error rate: ${ERROR_RATE}"
echo " Cache hit rate: ${CACHE_HIT_RATE}"
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Restore downstream service
echo "🔧 Restoring downstream service..."
kubectl scale deployment ${DOWNSTREAM_SERVICE} -n ${DOWNSTREAM_NS} --replicas=3
RECOVERY_START=$(date +%s)
# Wait for service to be available
echo "⏳ Waiting for service to be available..."
kubectl wait --for=condition=Available deployment/${DOWNSTREAM_SERVICE} -n ${DOWNSTREAM_NS} --timeout=300s
# Verify upstream service recovers
echo "🔍 Verifying upstream service recovery..."
sleep 60
FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${UPSTREAM_SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
if (( $(echo "${FINAL_SUCCESS_RATE} > 99.0" | bc -l) )); then
echo "✅ Upstream service recovered: Success rate = ${FINAL_SUCCESS_RATE}%"
exit 0
else
echo "⚠️ Upstream service recovery incomplete: Success rate = ${FINAL_SUCCESS_RATE}%"
exit 1
fi
Expected Behavior
Service Failure Phase (0-10 minutes):
- Downstream service unavailable: Policy Service endpoints unavailable
- Circuit breaker activation: Circuit breaker opens after failure threshold
- Cache usage: Ingestion Service uses cached policies
- Service continuity: Ingestion continues operating with cached data
- Degraded mode: Service operates in degraded mode (stale cache acceptable)
Recovery Phase (10-15 minutes):
- Service restoration: Policy Service restored
- Circuit breaker recovery: Circuit breaker transitions to half-open state
- Cache refresh: Policies refreshed from Policy Service
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| Request Success Rate | 99.95% | >99% | >99% | 99.95% |
| Cache Hit Rate | 85% | 100% | 100% | 85% |
| Circuit Breaker State | Closed | Open | Open | Closed |
| Error Rate | 0.05% | <1% | <1% | 0.05% |
| Latency | 145ms | <200ms | <200ms | 145ms |
Validation Criteria
Success Criteria:
- ✅ No ingestion failures during downstream service failure
- ✅ Cache hit rate = 100% (all requests use cache)
- ✅ Request success rate >99%
- ✅ Circuit breaker opens correctly
- ✅ Service recovers automatically when downstream service restored
- ✅ Cache refreshed after recovery
Circuit Breaker Configuration
Resilience Configuration Example:
# kubernetes/configmaps/resilience-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: resilience-config
namespace: atp-ingest-ns
data:
CircuitBreaker.json: |
{
"PolicyService": {
"FailureThreshold": 5,
"TimeoutSeconds": 30,
"HalfOpenRetries": 3,
"State": "Closed"
}
}
CacheConfig.json: |
{
"PolicyCache": {
"TTLSeconds": 3600,
"MaxSize": 10000,
"RefreshOnExpiry": true
}
}
Circuit Breaker Validation Script:
#!/bin/bash
# scripts/validate-circuit-breaker.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
echo "🔍 Validating circuit breaker behavior for ${SERVICE}"
# Get circuit breaker metrics
CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
CB_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_failures\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo "Circuit breaker state: ${CB_STATE}"
echo "Circuit breaker failures: ${CB_FAILURES}"
# Check if circuit breaker is configured
if [ "${CB_STATE}" = "null" ] || [ -z "${CB_STATE}" ]; then
echo "⚠️ Circuit breaker not configured or metrics not available"
exit 1
fi
# Validate circuit breaker transitions
if [ "${CB_STATE}" = "Open" ]; then
echo "✅ Circuit breaker opened (protecting service)"
elif [ "${CB_STATE}" = "Closed" ]; then
echo "✅ Circuit breaker closed (normal operation)"
elif [ "${CB_STATE}" = "HalfOpen" ]; then
echo "✅ Circuit breaker half-open (testing recovery)"
else
echo "⚠️ Unknown circuit breaker state: ${CB_STATE}"
exit 1
fi
Database Failure Experiment¶
Database failure experiments validate that ATP services handle database connection failures gracefully through circuit breakers, connection pooling, read replica fallback, and graceful degradation.
Hypothesis
"When Azure SQL primary database connection fails, the circuit breaker will open, the service will fallback to read replicas, operate in read-only mode, maintain availability, and recover automatically when the database is restored."
Experiment Configuration
Database Connection Failure Simulation:
#!/bin/bash
# scripts/execute-database-failure-experiment.sh
PRIMARY_DB="${1:-atp-sql-primary}"
PRIMARY_SERVER="${2:-atp-sql-server.database.windows.net}"
SERVICE="${3:-atp-ingestion-api}"
NAMESPACE="${4:-atp-ingest-ns}"
echo "🧪 Starting database failure experiment"
echo "Primary database: ${PRIMARY_DB}"
echo "Service: ${SERVICE}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
# Apply network partition to database
echo "🔧 Applying network partition to database..."
kubectl apply -f chaos-experiments/database-network-partition.yaml -n chaos-testing
FAILURE_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Check circuit breaker state
CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\",component=\"database\"\} | jq -r '.data.result[0].value[1]')
# Check read replica usage
REPLICA_USAGE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_read_replica_connections\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Check request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Check error rate
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo "Metrics at ${ELAPSED}s:"
echo " Circuit breaker state: ${CB_STATE}"
echo " Read replica usage: ${REPLICA_USAGE}"
echo " Success rate: ${SUCCESS_RATE}"
echo " Error rate: ${ERROR_RATE}"
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos database-network-partition -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for recovery
echo "⏳ Waiting for service recovery..."
sleep 120
# Verify service recovery
FINAL_CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\",component=\"database\"\} | jq -r '.data.result[0].value[1]')
if [ "${FINAL_CB_STATE}" = "Closed" ]; then
echo "✅ Circuit breaker closed (service recovered)"
exit 0
else
echo "⚠️ Circuit breaker still open: ${FINAL_CB_STATE}"
exit 1
fi
NetworkChaos for Database Isolation:
# chaos-experiments/database-network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: database-network-partition
namespace: chaos-testing
labels:
category: application
service: database
severity: high
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When Azure SQL connection fails, circuit breaker will open,
service will fallback to read replicas, operate in read-only mode,
maintain availability, and recover automatically.
spec:
action: partition
mode: all
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
target:
mode: all
selector:
# Target Azure SQL endpoints
address: "*.database.windows.net"
duration: "10m"
Expected Behavior
Database Failure Phase (0-10 minutes):
- Connection failures: Database connection attempts fail
- Circuit breaker activation: Circuit breaker opens after failure threshold
- Read replica fallback: Service switches to read replicas
- Read-only mode: Service operates in read-only mode (no writes)
- Graceful degradation: Service continues operating with reduced functionality
Recovery Phase (10-15 minutes):
- Connection restoration: Database connections restored
- Circuit breaker recovery: Circuit breaker transitions to half-open
- Write capability restored: Service returns to read-write mode
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| Request Success Rate | 99.95% | >95% | >95% | 99.95% |
| Circuit Breaker State | Closed | Open | Open | Closed |
| Read Replica Usage | 20% | 100% | 100% | 20% |
| Write Operations | Normal | Disabled | Disabled | Normal |
| Error Rate | 0.05% | <5% | <5% | 0.05% |
Validation Criteria
Success Criteria:
- ✅ Circuit breaker opens correctly
- ✅ Service falls back to read replicas
- ✅ Service operates in read-only mode (no write failures)
- ✅ Request success rate >95% (read operations succeed)
- ✅ No service crashes
- ✅ Service recovers automatically when database restored
Connection Pool Configuration
Connection Pool Configuration Example:
# kubernetes/configmaps/database-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: database-config
namespace: atp-ingest-ns
data:
ConnectionStrings.json: |
{
"Primary": {
"ConnectionString": "Server=atp-sql-server.database.windows.net;Database=atp-primary;...",
"MaxPoolSize": 100,
"MinPoolSize": 10,
"ConnectionTimeout": 30,
"CommandTimeout": 30
},
"ReadReplicas": [
{
"ConnectionString": "Server=atp-sql-replica-1.database.windows.net;Database=atp-primary;...",
"MaxPoolSize": 50
}
]
}
CircuitBreaker.json: |
{
"Database": {
"FailureThreshold": 5,
"TimeoutSeconds": 30,
"HalfOpenRetries": 3,
"FallbackToReadReplica": true
}
}
Database Connection Monitoring Script:
#!/bin/bash
# scripts/monitor-database-connections.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
echo "📊 Monitoring database connections for ${SERVICE}"
# Get connection pool metrics
POOL_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
IDLE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_idle\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo "Connection pool metrics:"
echo " Pool size: ${POOL_SIZE}"
echo " Active connections: ${ACTIVE_CONNECTIONS}"
echo " Idle connections: ${IDLE_CONNECTIONS}"
# Check connection failures
CONNECTION_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_connection_failures\{service=\"${SERVICE}\"\}[5m]\) | jq -r '.data.result[0].value[1]')
echo " Connection failures: ${CONNECTION_FAILURES}/sec"
# Check read replica usage
REPLICA_USAGE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_read_replica_connections\{service=\"${SERVICE}\"\}[5m]\) | jq -r '.data.result[0].value[1]')
echo " Read replica usage: ${REPLICA_USAGE}/sec"
# Validate connection pool health
if [ "${ACTIVE_CONNECTIONS}" -gt "${POOL_SIZE}" ]; then
echo "⚠️ Active connections exceed pool size"
exit 1
fi
if (( $(echo "${CONNECTION_FAILURES} > 1" | bc -l) )); then
echo "⚠️ High connection failure rate: ${CONNECTION_FAILURES}/sec"
exit 1
fi
echo "✅ Connection pool healthy"
Message Broker Failure¶
Message broker failure experiments validate that ATP services handle message broker unavailability gracefully through outbox patterns, message queuing, automatic retry, and eventual delivery guarantees.
Hypothesis
"When Azure Service Bus topic becomes unavailable, the service will queue messages in the outbox, retry automatically, maintain message ordering, ensure no message loss, and deliver messages eventually when the broker is restored."
Experiment Configuration
Service Bus Topic Unavailability Simulation:
# chaos-experiments/service-bus-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: service-bus-network-failure
namespace: chaos-testing
labels:
category: application
service: service-bus
severity: high
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When Service Bus topic becomes unavailable, service will queue messages in outbox,
retry automatically, maintain message ordering, ensure no message loss,
and deliver messages eventually when broker is restored.
spec:
action: partition
mode: all
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
target:
mode: all
selector:
address: "*.servicebus.windows.net"
duration: "15m"
Message Broker Failure Simulation Script:
#!/bin/bash
# scripts/execute-message-broker-failure.sh
SERVICE_BUS_NAMESPACE="${1:-atp-servicebus}"
TOPIC_NAME="${2:-atp-events}"
SERVICE="${3:-atp-ingestion-api}"
NAMESPACE="${4:-atp-ingest-ns}"
echo "🧪 Starting message broker failure experiment"
echo "Service Bus namespace: ${SERVICE_BUS_NAMESPACE}"
echo "Topic: ${TOPIC_NAME}"
echo "Service: ${SERVICE}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
# Get initial message count
echo "📊 Getting initial message counts..."
INITIAL_OUTBOX_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
INITIAL_SENT=$(curl -s http://prometheus:9090/api/v1/query?query=messages_sent_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo "Initial outbox size: ${INITIAL_OUTBOX_SIZE}"
echo "Initial sent messages: ${INITIAL_SENT}"
# Apply network partition to Service Bus
echo "🔧 Applying network partition to Service Bus..."
kubectl apply -f chaos-experiments/service-bus-failure.yaml -n chaos-testing
FAILURE_START=$(date +%s)
# Monitor outbox behavior
echo "👀 Monitoring outbox behavior..."
MAX_WAIT=900 # 15 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Check outbox size
OUTBOX_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Check retry attempts
RETRY_COUNT=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(message_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Check message processing rate
PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Check DLQ (dead letter queue) size
DLQ_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=dlq_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo "Metrics at ${ELAPSED}s:"
echo " Outbox size: ${OUTBOX_SIZE}"
echo " Retry attempts: ${RETRY_COUNT}/sec"
echo " Processing rate: ${PROCESSING_RATE}/sec"
echo " DLQ size: ${DLQ_SIZE}"
# Validate no message loss
if (( $(echo "${DLQ_SIZE} > 0" | bc -l) )); then
echo "⚠️ Messages in DLQ (potential message loss)"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos service-bus-network-failure -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for outbox to drain
echo "⏳ Waiting for outbox to drain..."
MAX_DRAIN_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_DRAIN_WAIT} ]; do
OUTBOX_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${OUTBOX_SIZE} <= ${INITIAL_OUTBOX_SIZE}" | bc -l) )); then
DRAIN_TIME=$(date +%s)
DRAIN_DURATION=$((DRAIN_TIME - RECOVERY_START))
echo "✅ Outbox drained in ${DRAIN_DURATION} seconds"
break
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
echo "Waiting for outbox to drain... (${ELAPSED}s/${MAX_DRAIN_WAIT}s) - Outbox size: ${OUTBOX_SIZE}"
done
# Verify message delivery
FINAL_SENT=$(curl -s http://prometheus:9090/api/v1/query?query=messages_sent_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
MESSAGES_DELIVERED=$((FINAL_SENT - INITIAL_SENT))
echo "Messages delivered: ${MESSAGES_DELIVERED}"
if [ "${MESSAGES_DELIVERED}" -gt 0 ]; then
echo "✅ Messages delivered successfully"
exit 0
else
echo "⚠️ No messages delivered"
exit 1
fi
Expected Behavior
Broker Failure Phase (0-15 minutes):
- Publish failures: Message publishing to Service Bus fails
- Outbox pattern: Messages queued in outbox (database)
- Automatic retry: Periodic retry attempts (exponential backoff)
- Message ordering: Message order maintained in outbox
- No message loss: All messages persisted in outbox
Recovery Phase (15-25 minutes):
- Broker restoration: Service Bus topic restored
- Outbox processing: Outbox processor processes queued messages
- Message delivery: Messages delivered to Service Bus
- Outbox drain: Outbox emptied (all messages delivered)
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| Outbox Size | 0 | Increasing | Any | 0 |
| Message Success Rate | 100% | 0% (queued) | 0% | 100% |
| Retry Attempts | 0/sec | >0/sec | >0/sec | 0/sec |
| DLQ Size | 0 | 0 | 0 | 0 |
| Message Loss | None | None | None | None |
Validation Criteria
Success Criteria:
- ✅ No message loss (all messages in outbox)
- ✅ Outbox size increases during failure
- ✅ Automatic retry attempts active
- ✅ No messages in DLQ
- ✅ Outbox drains after broker restoration
- ✅ All messages delivered eventually
Outbox Pattern Implementation
Outbox Table Schema:
-- Outbox table for message persistence
CREATE TABLE OutboxMessages (
Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
MessageId NVARCHAR(255) NOT NULL,
MessageType NVARCHAR(255) NOT NULL,
Payload NVARCHAR(MAX) NOT NULL,
TopicName NVARCHAR(255) NOT NULL,
Status NVARCHAR(50) NOT NULL DEFAULT 'Pending', -- Pending, Processing, Sent, Failed
RetryCount INT NOT NULL DEFAULT 0,
CreatedAt DATETIME2 NOT NULL DEFAULT GETUTCDATE(),
ProcessedAt DATETIME2 NULL,
NextRetryAt DATETIME2 NULL,
ErrorMessage NVARCHAR(MAX) NULL,
INDEX IX_OutboxMessages_Status_NextRetryAt (Status, NextRetryAt)
);
Outbox Processor Configuration:
# kubernetes/configmaps/outbox-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: outbox-config
namespace: atp-ingest-ns
data:
OutboxProcessor.json: |
{
"BatchSize": 100,
"ProcessingIntervalSeconds": 5,
"MaxRetryAttempts": 10,
"RetryBackoffSeconds": 30,
"ExponentialBackoff": true,
"MaxBackoffSeconds": 300
}
Outbox Monitoring Script:
#!/bin/bash
# scripts/monitor-outbox.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
echo "📊 Monitoring outbox for ${SERVICE}"
# Get outbox metrics
OUTBOX_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
PENDING_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\",status=\"Pending\"\} | jq -r '.data.result[0].value[1]')
PROCESSING_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\",status=\"Processing\"\} | jq -r '.data.result[0].value[1]')
FAILED_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=outbox_messages_count\{service=\"${SERVICE}\",status=\"Failed\"\} | jq -r '.data.result[0].value[1]')
echo "Outbox metrics:"
echo " Total outbox size: ${OUTBOX_SIZE}"
echo " Pending messages: ${PENDING_MESSAGES}"
echo " Processing messages: ${PROCESSING_MESSAGES}"
echo " Failed messages: ${FAILED_MESSAGES}"
# Check processing rate
PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(outbox_messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo " Processing rate: ${PROCESSING_RATE}/sec"
# Check retry rate
RETRY_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(outbox_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo " Retry rate: ${RETRY_RATE}/sec"
# Validate outbox health
if (( $(echo "${FAILED_MESSAGES} > 0" | bc -l) )); then
echo "⚠️ Failed messages in outbox: ${FAILED_MESSAGES}"
# Check if failed messages exceed retry limit
MAX_RETRIES=10
if (( $(echo "${FAILED_MESSAGES} > ${MAX_RETRIES}" | bc -l) )); then
echo "❌ Failed messages exceed retry limit"
exit 1
fi
fi
if (( $(echo "${OUTBOX_SIZE} > 10000" | bc -l) )); then
echo "⚠️ Outbox size exceeding threshold: ${OUTBOX_SIZE}"
exit 1
fi
echo "✅ Outbox healthy"
Cache Failure Experiment¶
Cache failure experiments validate that ATP services handle cache unavailability gracefully through database fallback, graceful degradation, and performance impact mitigation.
Hypothesis
"When Redis cache becomes unavailable, the service will fallback to database queries, experience higher latency but remain functional, maintain availability, and recover automatically when the cache is restored."
Experiment Configuration
Redis Cache Unavailability Simulation:
# chaos-experiments/redis-cache-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: redis-cache-network-failure
namespace: chaos-testing
labels:
category: application
service: redis-cache
severity: medium
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When Redis cache becomes unavailable, service will fallback to database queries,
experience higher latency but remain functional, maintain availability,
and recover automatically when cache is restored.
spec:
action: partition
mode: all
selector:
namespaces:
- atp-query-ns
labelSelectors:
app: atp-query-api
direction: both
target:
mode: all
selector:
address: "*.redis.cache.windows.net"
duration: "10m"
Cache Failure Simulation Script:
#!/bin/bash
# scripts/execute-cache-failure-experiment.sh
REDIS_CACHE="${1:-atp-redis-cache.redis.cache.windows.net}"
SERVICE="${2:-atp-query-api}"
NAMESPACE="${3:-atp-query-ns}"
echo "🧪 Starting cache failure experiment"
echo "Redis cache: ${REDIS_CACHE}"
echo "Service: ${SERVICE}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_LATENCY=$(jq -r '.metrics.p95_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_CACHE_HIT_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_hit_rate\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo "Baseline P95 latency: ${BASELINE_LATENCY}ms"
echo "Baseline cache hit rate: ${BASELINE_CACHE_HIT_RATE}%"
# Apply network partition to Redis
echo "🔧 Applying network partition to Redis cache..."
kubectl apply -f chaos-experiments/redis-cache-failure.yaml -n chaos-testing
FAILURE_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Check cache hit rate
CACHE_HIT_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_hit_rate\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Check P95 latency
P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)
# Check database query rate
DB_QUERY_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_queries_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Check request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo "Metrics at ${ELAPSED}s:"
echo " Cache hit rate: ${CACHE_HIT_RATE}%"
echo " P95 latency: ${P95_LATENCY_MS}ms"
echo " Database query rate: ${DB_QUERY_RATE}/sec"
echo " Success rate: ${SUCCESS_RATE}"
# Validate latency increase is acceptable
LATENCY_INCREASE=$(echo "${P95_LATENCY_MS} - ${BASELINE_LATENCY}" | bc)
MAX_ACCEPTABLE_INCREASE=500 # 500ms
if (( $(echo "${LATENCY_INCREASE} > ${MAX_ACCEPTABLE_INCREASE}" | bc -l) )); then
echo "⚠️ Latency increase too high: +${LATENCY_INCREASE}ms > +${MAX_ACCEPTABLE_INCREASE}ms"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos redis-cache-network-failure -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for cache recovery
echo "⏳ Waiting for cache recovery..."
sleep 120
# Verify service recovery
FINAL_CACHE_HIT_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_hit_rate\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
FINAL_LATENCY_MS=$(echo "${FINAL_LATENCY} * 1000" | bc)
if (( $(echo "${FINAL_CACHE_HIT_RATE} >= 80" | bc -l) )); then
echo "✅ Cache recovered: Hit rate = ${FINAL_CACHE_HIT_RATE}%"
if (( $(echo "${FINAL_LATENCY_MS} <= ${BASELINE_LATENCY} * 1.1" | bc -l) )); then
echo "✅ Latency recovered: ${FINAL_LATENCY_MS}ms"
exit 0
else
echo "⚠️ Latency not fully recovered: ${FINAL_LATENCY_MS}ms"
exit 1
fi
else
echo "⚠️ Cache not fully recovered: Hit rate = ${FINAL_CACHE_HIT_RATE}%"
exit 1
fi
Expected Behavior
Cache Failure Phase (0-10 minutes):
- Cache unavailability: Redis cache connection failures
- Cache miss rate: Cache hit rate drops to 0%
- Database fallback: Service queries database directly
- Latency increase: Latency increases (database queries slower than cache)
- Service continuity: Service remains functional
Recovery Phase (10-15 minutes):
- Cache restoration: Redis cache restored
- Cache warming: Cache repopulated with frequently accessed data
- Latency normalization: Latency returns to baseline
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| Cache Hit Rate | 85% | 0% | 0% | 85% |
| P95 Latency | 145ms | <645ms | Baseline + <500ms | 145ms |
| Database Query Rate | 100/sec | 500/sec | 3-5x increase | 100/sec |
| Request Success Rate | 99.95% | >99% | >99% | 99.95% |
| Error Rate | 0.05% | <1% | <1% | 0.05% |
Validation Criteria
Success Criteria:
- ✅ Cache hit rate drops to 0% (cache unavailable)
- ✅ Service falls back to database queries
- ✅ Latency increase <500ms (acceptable degradation)
- ✅ Request success rate >99%
- ✅ No service crashes
- ✅ Cache recovers automatically when restored
Cache Fallback Configuration
Cache Configuration Example:
# kubernetes/configmaps/cache-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cache-config
namespace: atp-query-ns
data:
CacheConfig.json: |
{
"Redis": {
"ConnectionString": "atp-redis-cache.redis.cache.windows.net:6380",
"Database": 0,
"ConnectionTimeout": 5000,
"RetryPolicy": {
"MaxRetries": 3,
"RetryDelay": 100
},
"FallbackToDatabase": true,
"CacheWarming": {
"Enabled": true,
"WarmOnStartup": true,
"WarmOnRecovery": true
}
},
"CachePolicies": {
"DefaultTTL": 3600,
"MaxTTL": 86400,
"SlidingExpiration": true
}
}
Cache Monitoring Script:
#!/bin/bash
# scripts/monitor-cache-health.sh
SERVICE="${1:-atp-query-api}"
NAMESPACE="${2:-atp-query-ns}"
echo "📊 Monitoring cache health for ${SERVICE}"
# Get cache metrics
CACHE_HIT_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_hit_rate\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
CACHE_MISS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_miss_rate\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
CACHE_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=cache_size_bytes\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
CACHE_EVICTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(cache_evictions_total\{service=\"${SERVICE}\"\}[5m]\) | jq -r '.data.result[0].value[1]')
echo "Cache metrics:"
echo " Cache hit rate: ${CACHE_HIT_RATE}%"
echo " Cache miss rate: ${CACHE_MISS_RATE}%"
echo " Cache size: ${CACHE_SIZE} bytes"
echo " Cache evictions: ${CACHE_EVICTIONS}/sec"
# Check Redis connection
REDIS_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=redis_connections_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
REDIS_CONNECTION_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(redis_connection_failures\{service=\"${SERVICE}\"\}[5m]\) | jq -r '.data.result[0].value[1]')
echo "Redis connection metrics:"
echo " Active connections: ${REDIS_CONNECTIONS}"
echo " Connection failures: ${REDIS_CONNECTION_FAILURES}/sec"
# Validate cache health
if (( $(echo "${CACHE_HIT_RATE} < 70" | bc -l) )); then
echo "⚠️ Low cache hit rate: ${CACHE_HIT_RATE}%"
fi
if (( $(echo "${REDIS_CONNECTION_FAILURES} > 0" | bc -l) )); then
echo "⚠️ Redis connection failures detected: ${REDIS_CONNECTION_FAILURES}/sec"
# Check if fallback is working
DB_QUERY_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_queries_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo " Database query rate: ${DB_QUERY_RATE}/sec (fallback active)"
fi
# Check cache warming
CACHE_WARMING=$(curl -s http://prometheus:9090/api/v1/query?query=cache_warming_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if [ "${CACHE_WARMING}" = "1" ]; then
echo "✅ Cache warming active"
fi
echo "✅ Cache monitoring complete"
Dependency Failure Visualization
graph TD
SERVICE[ATP Service] --> DOWNSTREAM[Downstream Service]
SERVICE --> DATABASE[Database]
SERVICE --> BROKER[Message Broker]
SERVICE --> CACHE[Cache]
DOWNSTREAM -->|Fails| CB1[Circuit Breaker Opens]
CB1 --> CACHE1[Use Cached Data]
CACHE1 --> CONTINUE1[Continue Operating]
DATABASE -->|Fails| CB2[Circuit Breaker Opens]
CB2 --> REPLICA[Fallback to Read Replica]
REPLICA --> READONLY[Read-Only Mode]
READONLY --> CONTINUE2[Continue Operating]
BROKER -->|Fails| OUTBOX[Queue in Outbox]
OUTBOX --> RETRY[Automatic Retry]
RETRY --> EVENTUAL[Eventual Delivery]
EVENTUAL --> CONTINUE3[Continue Operating]
CACHE -->|Fails| DB[Fallback to Database]
DB --> HIGHERLAT[Higher Latency]
HIGHERLAT --> CONTINUE4[Continue Operating]
style SERVICE fill:#FFE5B4
style CB1 fill:#FFB6C1
style CB2 fill:#FFB6C1
style OUTBOX fill:#FFB6C1
style DB fill:#FFB6C1
style CONTINUE1 fill:#90EE90
style CONTINUE2 fill:#90EE90
style CONTINUE3 fill:#90EE90
style CONTINUE4 fill:#90EE90
Experiment Results Analysis
Downstream Service Failure Results:
{
"experiment": "policy-service-failure",
"status": "success",
"metrics": {
"request_success_rate": {
"baseline": 99.95,
"during_failure": 99.92,
"status": "pass"
},
"cache_hit_rate": {
"baseline": 85,
"during_failure": 100,
"status": "pass"
},
"circuit_breaker_state": {
"baseline": "Closed",
"during_failure": "Open",
"recovery": "Closed",
"status": "pass"
}
},
"findings": {
"what_worked_well": [
"Circuit breaker opened correctly",
"Cache used for all requests",
"No service failures",
"Automatic recovery when service restored"
]
}
}
Database Failure Results:
{
"experiment": "database-failure",
"status": "success",
"metrics": {
"circuit_breaker_state": {
"baseline": "Closed",
"during_failure": "Open",
"status": "pass"
},
"read_replica_usage": {
"baseline": 20,
"during_failure": 100,
"status": "pass"
},
"request_success_rate": {
"baseline": 99.95,
"during_failure": 97.5,
"status": "pass"
}
},
"findings": {
"what_worked_well": [
"Circuit breaker opened correctly",
"Read replica fallback worked",
"Service operated in read-only mode",
"No crashes or data loss"
]
}
}
Summary: Service Dependency Chaos¶
- Downstream Service Failure: Validates circuit breaker functionality, caching strategies, and graceful degradation during downstream service failures; expects no ingestion failures, 100% cache hit rate, circuit breaker opens correctly, and automatic recovery
- Database Failure Experiment: Validates circuit breaker functionality, read replica fallback, and read-only mode operation during database connection failures; expects circuit breaker opens, fallback to read replicas, service operates in read-only mode, and recovery within acceptable time
- Message Broker Failure: Validates outbox pattern, message queuing, automatic retry, and eventual delivery during message broker unavailability; expects no message loss, outbox queues messages, automatic retry active, and eventual delivery when broker restored
- Cache Failure Experiment: Validates database fallback, graceful degradation, and performance impact mitigation during cache unavailability; expects cache hit rate drops to 0%, fallback to database, latency increase <500ms, and automatic recovery
- Monitoring and Validation: Comprehensive scripts for monitoring circuit breaker states, cache health, outbox processing, database connections, and dependency failure recovery
Application Behavior Chaos¶
Purpose: Define comprehensive chaos experiments for application behavior failures in ATP, validating latency handling, error resilience, backpressure mechanisms, and traffic surge management to ensure ATP services remain available and functional under various application-level stress conditions.
Latency Injection¶
Latency injection experiments validate that ATP services handle network latency gracefully through timeout configurations, retry mechanisms, and graceful degradation, ensuring service availability and performance under high-latency conditions.
Hypothesis
"When network latency increases to 500ms with 100ms jitter, services will continue operating with increased response times, retry mechanisms will handle timeouts, no request failures will occur, and services will recover when latency returns to normal."
Experiment Configuration
NetworkChaos for Latency Injection:
# chaos-experiments/ingestion-latency-injection.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: ingestion-latency-injection
namespace: chaos-testing
labels:
category: application
service: atp-ingestion-api
severity: medium
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When network latency increases to 500ms with 100ms jitter, services will continue
operating with increased response times, retry mechanisms will handle timeouts,
no request failures will occur, and services will recover when latency returns to normal.
spec:
action: delay
mode: all
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
delay:
latency: "500ms"
correlation: "25"
jitter: "100ms"
duration: "10m"
Gradual Latency Increase:
# chaos-experiments/gradual-latency-injection.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: gradual-latency-injection
namespace: chaos-testing
labels:
category: application
service: atp-ingestion-api
severity: medium
annotations:
chaos.atp.connectsoft.io/hypothesis: |
Gradual latency increase will validate service resilience to progressive degradation.
spec:
action: delay
mode: fixed-percent
value: "100"
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
delay:
latency: "500ms"
correlation: "25"
jitter: "100ms"
scheduler:
cron: "@every 1m"
duration: "10m"
Latency Injection Script:
#!/bin/bash
# scripts/execute-latency-injection-experiment.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
LATENCY="${3:-500ms}"
JITTER="${4:-100ms}"
DURATION="${5:-10m}"
echo "🧪 Starting latency injection experiment"
echo "Service: ${SERVICE}"
echo "Latency: ${LATENCY}"
echo "Jitter: ${JITTER}"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_P50=$(jq -r '.metrics.p50_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_P95=$(jq -r '.metrics.p95_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_P99=$(jq -r '.metrics.p99_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline latency metrics:"
echo " P50: ${BASELINE_P50}ms"
echo " P95: ${BASELINE_P95}ms"
echo " P99: ${BASELINE_P99}ms"
# Apply latency injection
echo "🔧 Applying latency injection..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: latency-injection-${SERVICE}
namespace: chaos-testing
spec:
action: delay
mode: all
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
delay:
latency: "${LATENCY}"
correlation: "25"
jitter: "${JITTER}"
duration: "${DURATION}"
EOF
INJECTION_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during latency injection..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get current latency metrics
P50_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.50,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
P50_LATENCY_MS=$(echo "${P50_LATENCY} * 1000" | bc)
P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)
P99_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.99,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
P99_LATENCY_MS=$(echo "${P99_LATENCY} * 1000" | bc)
# Check request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Check timeout errors
TIMEOUT_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=\"504\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Check retry attempts
RETRY_ATTEMPTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo "Metrics at ${ELAPSED}s:"
echo " P50 latency: ${P50_LATENCY_MS}ms (baseline: ${BASELINE_P50}ms)"
echo " P95 latency: ${P95_LATENCY_MS}ms (baseline: ${BASELINE_P95}ms)"
echo " P99 latency: ${P99_LATENCY_MS}ms (baseline: ${BASELINE_P99}ms)"
echo " Success rate: ${SUCCESS_RATE}"
echo " Timeout errors: ${TIMEOUT_ERRORS}/sec"
echo " Retry attempts: ${RETRY_ATTEMPTS}/sec"
# Validate latency increase is expected
LATENCY_INCREASE=$(echo "${P50_LATENCY_MS} - ${BASELINE_P50}" | bc)
EXPECTED_INCREASE=500 # 500ms base latency
if (( $(echo "${LATENCY_INCREASE} > ${EXPECTED_INCREASE} * 1.5" | bc -l) )); then
echo "⚠️ Latency increase exceeds expected: +${LATENCY_INCREASE}ms > +${EXPECTED_INCREASE}ms * 1.5"
fi
# Validate no timeout errors
if (( $(echo "${TIMEOUT_ERRORS} > 0" | bc -l) )); then
echo "⚠️ Timeout errors detected: ${TIMEOUT_ERRORS}/sec"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove latency injection
echo "🔧 Removing latency injection..."
kubectl delete networkchaos latency-injection-${SERVICE} -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for recovery
echo "⏳ Waiting for latency to return to normal..."
sleep 120
# Verify latency recovery
FINAL_P50=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.50,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
FINAL_P50_MS=$(echo "${FINAL_P50} * 1000" | bc)
if (( $(echo "${FINAL_P50_MS} <= ${BASELINE_P50} * 1.1" | bc -l) )); then
echo "✅ Latency recovered: ${FINAL_P50_MS}ms (baseline: ${BASELINE_P50}ms)"
exit 0
else
echo "⚠️ Latency not fully recovered: ${FINAL_P50_MS}ms (baseline: ${BASELINE_P50}ms)"
exit 1
fi
Expected Behavior
Latency Injection Phase (0-10 minutes):
- Latency increase: Network latency increases to 500ms + jitter
- Response time increase: Service response times increase proportionally
- Retry mechanisms: Retry mechanisms handle transient timeouts
- Timeout handling: Timeout configurations prevent hanging requests
- Service continuity: Service continues operating with degraded performance
Recovery Phase (10-15 minutes):
- Latency normalization: Network latency returns to normal
- Response time recovery: Service response times return to baseline
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Injection | Expected Range | Recovery Target |
|---|---|---|---|---|
| P50 Latency | 145ms | <645ms | Baseline + <500ms | 145ms |
| P95 Latency | 250ms | <750ms | Baseline + <500ms | 250ms |
| P99 Latency | 400ms | <900ms | Baseline + <500ms | 400ms |
| Request Success Rate | 99.95% | >99% | >99% | 99.95% |
| Timeout Errors | 0/sec | <1/sec | <1/sec | 0/sec |
| Retry Rate | 0.1/sec | <5/sec | <5/sec | 0.1/sec |
Validation Criteria
Success Criteria:
- ✅ Latency increases proportionally to injected latency
- ✅ No timeout errors (timeout configuration working)
- ✅ Request success rate >99%
- ✅ Retry mechanisms handle transient failures
- ✅ Service recovers automatically when latency removed
Timeout Configuration
HTTP Client Timeout Configuration:
# kubernetes/configmaps/http-client-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: http-client-config
namespace: atp-ingest-ns
data:
HttpClientConfig.json: |
{
"DefaultTimeout": 30,
"ConnectionTimeout": 5,
"ReadTimeout": 25,
"RetryPolicy": {
"MaxRetries": 3,
"RetryDelay": 1000,
"ExponentialBackoff": true,
"MaxBackoff": 5000
},
"CircuitBreaker": {
"FailureThreshold": 5,
"TimeoutSeconds": 30
}
}
Error Injection¶
Error injection experiments validate that ATP services handle various error conditions gracefully through error handling, retry mechanisms, circuit breakers, and graceful degradation.
Hypothesis
"When HTTP 500 errors, database timeout errors, or validation failures are injected, services will handle errors gracefully, retry mechanisms will activate, circuit breakers will protect the service, and services will recover automatically when errors stop."
Experiment Configuration
HTTP Error Injection:
# chaos-experiments/http-error-injection.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
name: http-error-injection
namespace: chaos-testing
labels:
category: application
service: atp-ingestion-api
severity: medium
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When HTTP 500 errors are injected, services will handle errors gracefully,
retry mechanisms will activate, circuit breakers will protect the service,
and services will recover automatically when errors stop.
spec:
mode: all
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
target:
target: RequestPath
requestPath: "/api/ingest"
rules:
- port: 8080
path: "/api/ingest"
method: "POST"
statusCode: 500
percentage: 50 # 50% of requests return 500
duration: "10m"
Database Timeout Error Injection:
# chaos-experiments/database-timeout-injection.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: database-timeout-injection
namespace: chaos-testing
labels:
category: application
service: database
severity: high
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When database timeout errors are injected, services will handle errors gracefully,
retry mechanisms will activate, circuit breakers will protect the service,
and services will recover automatically when errors stop.
spec:
action: delay
mode: fixed-percent
value: "50" # 50% of requests
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
target:
mode: all
selector:
address: "*.database.windows.net"
delay:
latency: "35s" # Exceeds 30s timeout
correlation: "100"
jitter: "0ms"
duration: "10m"
Error Injection Script:
#!/bin/bash
# scripts/execute-error-injection-experiment.sh
ERROR_TYPE="${1:-http500}" # http500, db-timeout, validation
SERVICE="${2:-atp-ingestion-api}"
NAMESPACE="${3:-atp-ingest-ns}"
PERCENTAGE="${4:-50}" # Error percentage
echo "🧪 Starting error injection experiment"
echo "Error type: ${ERROR_TYPE}"
echo "Service: ${SERVICE}"
echo "Error percentage: ${PERCENTAGE}%"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_ERROR_RATE=$(jq -r '.metrics.error_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Error rate: ${BASELINE_ERROR_RATE}%"
echo " Success rate: ${BASELINE_SUCCESS_RATE}%"
# Apply error injection based on type
case ${ERROR_TYPE} in
http500)
echo "🔧 Injecting HTTP 500 errors..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
name: http-error-injection-${SERVICE}
namespace: chaos-testing
spec:
mode: all
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
target:
target: RequestPath
requestPath: "/api/ingest"
rules:
- port: 8080
path: "/api/ingest"
method: "POST"
statusCode: 500
percentage: ${PERCENTAGE}
duration: "10m"
EOF
;;
db-timeout)
echo "🔧 Injecting database timeout errors..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: db-timeout-injection-${SERVICE}
namespace: chaos-testing
spec:
action: delay
mode: fixed-percent
value: "${PERCENTAGE}"
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
target:
mode: all
selector:
address: "*.database.windows.net"
delay:
latency: "35s"
correlation: "100"
jitter: "0ms"
duration: "10m"
EOF
;;
validation)
echo "🔧 Injecting validation failures..."
# This would be implemented via HTTPChaos with 400 status code
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
name: validation-error-injection-${SERVICE}
namespace: chaos-testing
spec:
mode: all
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
target:
target: RequestPath
requestPath: "/api/ingest"
rules:
- port: 8080
path: "/api/ingest"
method: "POST"
statusCode: 400
percentage: ${PERCENTAGE}
duration: "10m"
EOF
;;
esac
INJECTION_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during error injection..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get error rate
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Check circuit breaker state
CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Check retry attempts
RETRY_ATTEMPTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Check service availability
AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
AVAILABILITY_PERCENT=$(echo "${AVAILABILITY} * 100" | bc)
echo "Metrics at ${ELAPSED}s:"
echo " Error rate: ${ERROR_RATE}/sec"
echo " Success rate: ${SUCCESS_RATE}/sec"
echo " Circuit breaker state: ${CB_STATE}"
echo " Retry attempts: ${RETRY_ATTEMPTS}/sec"
echo " Service availability: ${AVAILABILITY_PERCENT}%"
# Validate circuit breaker behavior
if (( $(echo "${AVAILABILITY_PERCENT} < 95" | bc -l) )); then
if [ "${CB_STATE}" != "Open" ]; then
echo "⚠️ Service availability low but circuit breaker not open"
fi
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove error injection
echo "🔧 Removing error injection..."
case ${ERROR_TYPE} in
http500)
kubectl delete httpchaos http-error-injection-${SERVICE} -n chaos-testing
;;
db-timeout)
kubectl delete networkchaos db-timeout-injection-${SERVICE} -n chaos-testing
;;
validation)
kubectl delete httpchaos validation-error-injection-${SERVICE} -n chaos-testing
;;
esac
RECOVERY_START=$(date +%s)
# Wait for recovery
echo "⏳ Waiting for service recovery..."
sleep 120
# Verify service recovery
FINAL_ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${FINAL_ERROR_RATE} <= ${BASELINE_ERROR_RATE} * 1.1" | bc -l) )); then
echo "✅ Error rate recovered: ${FINAL_ERROR_RATE}/sec (baseline: ${BASELINE_ERROR_RATE}/sec)"
if [ "${FINAL_CB_STATE}" = "Closed" ]; then
echo "✅ Circuit breaker closed (service recovered)"
exit 0
else
echo "⚠️ Circuit breaker still open: ${FINAL_CB_STATE}"
exit 1
fi
else
echo "⚠️ Error rate not fully recovered: ${FINAL_ERROR_RATE}/sec (baseline: ${BASELINE_ERROR_RATE}/sec)"
exit 1
fi
Expected Behavior
Error Injection Phase (0-10 minutes):
- Error rate increase: Error rate increases proportionally to injection percentage
- Retry activation: Retry mechanisms activate for transient errors
- Circuit breaker protection: Circuit breaker may open if error rate exceeds threshold
- Graceful degradation: Service continues operating with reduced functionality
- Error handling: Errors are logged and handled appropriately
Recovery Phase (10-15 minutes):
- Error injection removal: Error injection stopped
- Retry recovery: Retry mechanisms normalize
- Circuit breaker recovery: Circuit breaker transitions to half-open, then closed
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Injection | Expected Range | Recovery Target |
|---|---|---|---|---|
| Error Rate | 0.05% | <5% | <5% | 0.05% |
| Success Rate | 99.95% | >95% | >95% | 99.95% |
| Circuit Breaker State | Closed | Open/HalfOpen | Open/HalfOpen | Closed |
| Retry Rate | 0.1/sec | <10/sec | <10/sec | 0.1/sec |
| Service Availability | 99.95% | >95% | >95% | 99.95% |
Validation Criteria
Success Criteria:
- ✅ Error rate increases proportionally to injection percentage
- ✅ Retry mechanisms handle transient errors
- ✅ Circuit breaker protects service if error rate exceeds threshold
- ✅ Service availability >95%
- ✅ Service recovers automatically when errors stop
Slow Consumer Simulation¶
Slow consumer simulation experiments validate that ATP services handle slow message processing gracefully through backpressure mechanisms, queue depth limits, and message throttling.
Hypothesis
"When message consumers process messages slowly, backpressure mechanisms will activate, queue depth limits will prevent queue overflow, message throttling will protect the system, and services will recover when processing returns to normal."
Experiment Configuration
Slow Consumer Simulation:
#!/bin/bash
# scripts/execute-slow-consumer-simulation.sh
SERVICE="${1:-atp-query-api}"
NAMESPACE="${2:-atp-query-ns}"
PROCESSING_DELAY="${3:-5s}" # Processing delay per message
DURATION="${4:-10m}"
echo "🧪 Starting slow consumer simulation"
echo "Service: ${SERVICE}"
echo "Processing delay: ${PROCESSING_DELAY} per message"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_PROCESSING_RATE=$(jq -r '.metrics.message_processing_rate_per_sec' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_QUEUE_DEPTH=$(jq -r '.metrics.queue_depth' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Processing rate: ${BASELINE_PROCESSING_RATE} msg/sec"
echo " Queue depth: ${BASELINE_QUEUE_DEPTH}"
# Apply processing delay via CPU throttling
echo "🔧 Applying processing delay..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: slow-consumer-simulation-${SERVICE}
namespace: chaos-testing
spec:
mode: fixed-percent
value: "100"
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
stressors:
cpu:
workers: 1
load: 10 # Low CPU load to simulate slow processing
duration: "${DURATION}"
EOF
# Additionally, inject network delay to slow message retrieval
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: slow-consumer-network-delay-${SERVICE}
namespace: chaos-testing
spec:
action: delay
mode: fixed-percent
value: "100"
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
target:
mode: all
selector:
address: "*.servicebus.windows.net"
delay:
latency: "${PROCESSING_DELAY}"
correlation: "100"
jitter: "0ms"
duration: "${DURATION}"
EOF
SIMULATION_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during slow consumer simulation..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get queue depth
QUEUE_DEPTH=$(curl -s http://prometheus:9090/api/v1/query?query=queue_depth\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get processing rate
PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get message arrival rate
ARRIVAL_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_arrived_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get backpressure state
BACKPRESSURE_ACTIVE=$(curl -s http://prometheus:9090/api/v1/query?query=backpressure_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get DLQ size
DLQ_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=dlq_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo "Metrics at ${ELAPSED}s:"
echo " Queue depth: ${QUEUE_DEPTH} (baseline: ${BASELINE_QUEUE_DEPTH})"
echo " Processing rate: ${PROCESSING_RATE} msg/sec (baseline: ${BASELINE_PROCESSING_RATE} msg/sec)"
echo " Arrival rate: ${ARRIVAL_RATE} msg/sec"
echo " Backpressure active: ${BACKPRESSURE_ACTIVE}"
echo " DLQ size: ${DLQ_SIZE}"
# Validate queue depth limits
MAX_QUEUE_DEPTH=10000
if (( $(echo "${QUEUE_DEPTH} > ${MAX_QUEUE_DEPTH}" | bc -l) )); then
echo "⚠️ Queue depth exceeds limit: ${QUEUE_DEPTH} > ${MAX_QUEUE_DEPTH}"
fi
# Validate backpressure activation
if (( $(echo "${QUEUE_DEPTH} > ${BASELINE_QUEUE_DEPTH} * 2" | bc -l) )); then
if [ "${BACKPRESSURE_ACTIVE}" != "1" ]; then
echo "⚠️ Queue depth high but backpressure not active"
fi
fi
# Validate no message loss
if (( $(echo "${DLQ_SIZE} > 0" | bc -l) )); then
echo "⚠️ Messages in DLQ (potential message loss)"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove simulation
echo "🔧 Removing slow consumer simulation..."
kubectl delete stresschaos slow-consumer-simulation-${SERVICE} -n chaos-testing
kubectl delete networkchaos slow-consumer-network-delay-${SERVICE} -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for queue to drain
echo "⏳ Waiting for queue to drain..."
MAX_DRAIN_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_DRAIN_WAIT} ]; do
QUEUE_DEPTH=$(curl -s http://prometheus:9090/api/v1/query?query=queue_depth\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${QUEUE_DEPTH} <= ${BASELINE_QUEUE_DEPTH} * 1.1" | bc -l) )); then
DRAIN_TIME=$(date +%s)
DRAIN_DURATION=$((DRAIN_TIME - RECOVERY_START))
echo "✅ Queue drained in ${DRAIN_DURATION} seconds"
break
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
echo "Waiting for queue to drain... (${ELAPSED}s/${MAX_DRAIN_WAIT}s) - Queue depth: ${QUEUE_DEPTH}"
done
# Verify recovery
FINAL_QUEUE_DEPTH=$(curl -s http://prometheus:9090/api/v1/query?query=queue_depth\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
if (( $(echo "${FINAL_QUEUE_DEPTH} <= ${BASELINE_QUEUE_DEPTH} * 1.1" | bc -l) )); then
echo "✅ Queue depth recovered: ${FINAL_QUEUE_DEPTH} (baseline: ${BASELINE_QUEUE_DEPTH})"
if (( $(echo "${FINAL_PROCESSING_RATE} >= ${BASELINE_PROCESSING_RATE} * 0.9" | bc -l) )); then
echo "✅ Processing rate recovered: ${FINAL_PROCESSING_RATE} msg/sec (baseline: ${BASELINE_PROCESSING_RATE} msg/sec)"
exit 0
else
echo "⚠️ Processing rate not fully recovered: ${FINAL_PROCESSING_RATE} msg/sec"
exit 1
fi
else
echo "⚠️ Queue depth not fully recovered: ${FINAL_QUEUE_DEPTH} (baseline: ${BASELINE_QUEUE_DEPTH})"
exit 1
fi
Expected Behavior
Slow Consumer Phase (0-10 minutes):
- Processing delay: Message processing slows down
- Queue depth increase: Queue depth increases as processing lags behind arrival
- Backpressure activation: Backpressure mechanisms activate when queue depth exceeds threshold
- Message throttling: Message arrival throttled to prevent queue overflow
- Queue depth limits: Queue depth limits prevent excessive growth
Recovery Phase (10-20 minutes):
- Processing normalization: Processing returns to normal speed
- Queue draining: Queue drains as processing catches up
- Backpressure deactivation: Backpressure mechanisms deactivate
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Simulation | Expected Range | Recovery Target |
|---|---|---|---|---|
| Queue Depth | 100 | <10,000 | <10,000 | 100 |
| Processing Rate | 100 msg/sec | <20 msg/sec | <20 msg/sec | 100 msg/sec |
| Backpressure Active | No | Yes | Yes | No |
| Message Loss | None | None | None | None |
| DLQ Size | 0 | 0 | 0 | 0 |
Validation Criteria
Success Criteria:
- ✅ Queue depth increases but stays within limits
- ✅ Backpressure activates when queue depth exceeds threshold
- ✅ No message loss (all messages processed)
- ✅ Queue drains after processing returns to normal
- ✅ Service recovers automatically
Backpressure Configuration
Backpressure Configuration Example:
# kubernetes/configmaps/backpressure-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: backpressure-config
namespace: atp-query-ns
data:
BackpressureConfig.json: |
{
"QueueDepthThreshold": 5000,
"MaxQueueDepth": 10000,
"BackpressureStrategy": "Throttle",
"ThrottleRate": 0.5,
"DLQThreshold": 10000,
"MonitoringIntervalSeconds": 5
}
Traffic Surge¶
Traffic surge experiments validate that ATP services handle sudden traffic increases gracefully through autoscaling, rate limiting, and graceful degradation.
Hypothesis
"When traffic increases to 10x normal load, autoscaling will activate, rate limiting will protect the system, services will handle the load gracefully, and services will recover when traffic returns to normal."
Experiment Configuration
Traffic Surge Simulation:
#!/bin/bash
# scripts/execute-traffic-surge-experiment.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
SURGE_MULTIPLIER="${3:-10}" # 10x normal traffic
DURATION="${4:-10m}"
echo "🧪 Starting traffic surge experiment"
echo "Service: ${SERVICE}"
echo "Traffic multiplier: ${SURGE_MULTIPLIER}x"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_RPS=$(jq -r '.metrics.requests_per_second' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_POD_COUNT=$(kubectl get deployment ${SERVICE} -n ${NAMESPACE} -o jsonpath='{.status.replicas}')
TARGET_RPS=$(echo "${BASELINE_RPS} * ${SURGE_MULTIPLIER}" | bc)
echo "Baseline metrics:"
echo " Requests per second: ${BASELINE_RPS}"
echo " Pod count: ${BASELINE_POD_COUNT}"
echo "Target traffic: ${TARGET_RPS} req/sec"
# Start traffic generation
echo "🚀 Starting traffic generation..."
kubectl run traffic-generator \
--image=nginx/nginx-ingress:latest \
--restart=Never \
--rm -i --tty \
-- /bin/sh -c "while true; do curl -s http://${SERVICE}.${NAMESPACE}.svc.cluster.local/api/ingest -X POST -d '{}' -H 'Content-Type: application/json' & done" &
TRAFFIC_GENERATOR_PID=$!
# Calculate number of parallel requests needed
PARALLEL_REQUESTS=$(echo "${TARGET_RPS} / 10" | bc) # Assuming 10 req/sec per worker
# Start multiple traffic generators
for i in $(seq 1 ${PARALLEL_REQUESTS}); do
kubectl run traffic-generator-${i} \
--image=curlimages/curl:latest \
--restart=Never \
--rm -i --tty \
-- /bin/sh -c "while true; do curl -s http://${SERVICE}.${NAMESPACE}.svc.cluster.local/api/ingest -X POST -d '{}' -H 'Content-Type: application/json' > /dev/null 2>&1; sleep 0.1; done" &
done
SURGE_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during traffic surge..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get current RPS
CURRENT_RPS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
CURRENT_RPS_ROUNDED=$(echo "${CURRENT_RPS}" | cut -d. -f1)
# Get pod count
CURRENT_POD_COUNT=$(kubectl get deployment ${SERVICE} -n ${NAMESPACE} -o jsonpath='{.status.replicas}')
# Get HPA status
HPA_STATUS=$(kubectl get hpa ${SERVICE} -n ${NAMESPACE} -o jsonpath='{.status.conditions[?(@.type=="AbleToScale")].status}')
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
# Get rate limiting metrics
RATE_LIMITED=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=\"429\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get latency
P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)
echo "Metrics at ${ELAPSED}s:"
echo " Current RPS: ${CURRENT_RPS_ROUNDED} (target: ${TARGET_RPS})"
echo " Pod count: ${CURRENT_POD_COUNT} (baseline: ${BASELINE_POD_COUNT})"
echo " HPA status: ${HPA_STATUS}"
echo " Success rate: ${SUCCESS_RATE_PERCENT}%"
echo " Rate limited: ${RATE_LIMITED}/sec"
echo " P95 latency: ${P95_LATENCY_MS}ms"
# Validate autoscaling
if (( $(echo "${CURRENT_RPS} > ${BASELINE_RPS} * 2" | bc -l) )); then
if [ "${CURRENT_POD_COUNT}" -le "${BASELINE_POD_COUNT}" ]; then
echo "⚠️ Traffic increased but autoscaling not triggered"
fi
fi
# Validate rate limiting
if (( $(echo "${RATE_LIMITED} > 0" | bc -l) )); then
echo "✅ Rate limiting active: ${RATE_LIMITED}/sec requests rate limited"
fi
# Validate service availability
if (( $(echo "${SUCCESS_RATE_PERCENT} < 95" | bc -l) )); then
echo "⚠️ Service availability low: ${SUCCESS_RATE_PERCENT}%"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Stop traffic generation
echo "🛑 Stopping traffic generation..."
kubectl delete pod traffic-generator -n ${NAMESPACE} --ignore-not-found=true
for i in $(seq 1 ${PARALLEL_REQUESTS}); do
kubectl delete pod traffic-generator-${i} -n ${NAMESPACE} --ignore-not-found=true
done
RECOVERY_START=$(date +%s)
# Wait for traffic to normalize
echo "⏳ Waiting for traffic to normalize..."
sleep 120
# Wait for autoscaling to scale down
echo "⏳ Waiting for autoscaling to scale down..."
MAX_SCALE_DOWN_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_SCALE_DOWN_WAIT} ]; do
CURRENT_POD_COUNT=$(kubectl get deployment ${SERVICE} -n ${NAMESPACE} -o jsonpath='{.status.replicas}')
if [ "${CURRENT_POD_COUNT}" -le "${BASELINE_POD_COUNT}" ]; then
SCALE_DOWN_TIME=$(date +%s)
SCALE_DOWN_DURATION=$((SCALE_DOWN_TIME - RECOVERY_START))
echo "✅ Autoscaling scaled down in ${SCALE_DOWN_DURATION} seconds"
break
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
echo "Waiting for scale down... (${ELAPSED}s/${MAX_SCALE_DOWN_WAIT}s) - Pod count: ${CURRENT_POD_COUNT}"
done
# Verify recovery
FINAL_RPS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_POD_COUNT=$(kubectl get deployment ${SERVICE} -n ${NAMESPACE} -o jsonpath='{.status.replicas}')
FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE_PERCENT=$(echo "${FINAL_SUCCESS_RATE} * 100" | bc)
if (( $(echo "${FINAL_RPS} <= ${BASELINE_RPS} * 1.1" | bc -l) )); then
echo "✅ Traffic normalized: ${FINAL_RPS} req/sec (baseline: ${BASELINE_RPS} req/sec)"
if [ "${FINAL_POD_COUNT}" -le "${BASELINE_POD_COUNT}" ]; then
echo "✅ Pod count normalized: ${FINAL_POD_COUNT} (baseline: ${BASELINE_POD_COUNT})"
if (( $(echo "${FINAL_SUCCESS_RATE_PERCENT} >= 99" | bc -l) )); then
echo "✅ Success rate recovered: ${FINAL_SUCCESS_RATE_PERCENT}%"
exit 0
else
echo "⚠️ Success rate not fully recovered: ${FINAL_SUCCESS_RATE_PERCENT}%"
exit 1
fi
else
echo "⚠️ Pod count not fully normalized: ${FINAL_POD_COUNT} (baseline: ${BASELINE_POD_COUNT})"
exit 1
fi
else
echo "⚠️ Traffic not fully normalized: ${FINAL_RPS} req/sec (baseline: ${BASELINE_RPS} req/sec)"
exit 1
fi
Expected Behavior
Traffic Surge Phase (0-10 minutes):
- Traffic increase: Traffic increases to 10x normal load
- Autoscaling activation: HPA scales up pods to handle increased load
- Rate limiting activation: Rate limiting protects system from overload
- Graceful degradation: Service continues operating with increased latency
- Load distribution: Load distributed across scaled pods
Recovery Phase (10-20 minutes):
- Traffic normalization: Traffic returns to normal levels
- Autoscaling scale down: HPA scales down pods as load decreases
- Rate limiting deactivation: Rate limiting normalizes
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Surge | Expected Range | Recovery Target |
|---|---|---|---|---|
| Request Rate | 100 req/sec | 1,000 req/sec | 10x increase | 100 req/sec |
| Pod Count | 3 | 15-30 | 5-10x increase | 3 |
| Success Rate | 99.95% | >95% | >95% | 99.95% |
| P95 Latency | 250ms | <1,000ms | <1,000ms | 250ms |
| Rate Limited Requests | 0/sec | >0/sec | >0/sec | 0/sec |
Validation Criteria
Success Criteria:
- ✅ Autoscaling activates and scales up pods
- ✅ Rate limiting protects system from overload
- ✅ Service availability >95% during surge
- ✅ Latency increase <4x baseline
- ✅ Autoscaling scales down after traffic normalizes
HPA Configuration
Horizontal Pod Autoscaler Configuration:
# kubernetes/autoscaling/ingestion-api-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: atp-ingestion-api-hpa
namespace: atp-ingest-ns
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: atp-ingestion-api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 5
periodSeconds: 30
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
selectPolicy: Min
Rate Limiting Configuration:
# kubernetes/configmaps/rate-limiting-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: rate-limiting-config
namespace: atp-ingest-ns
data:
RateLimitingConfig.json: |
{
"DefaultRateLimit": {
"RequestsPerSecond": 100,
"BurstSize": 150
},
"PerClientRateLimit": {
"Enabled": true,
"RequestsPerSecond": 50,
"BurstSize": 75
},
"RateLimitStrategy": "TokenBucket",
"RateLimitResponse": {
"StatusCode": 429,
"Message": "Rate limit exceeded"
}
}
Summary: Application Behavior Chaos¶
- Latency Injection: Validates timeout configurations, retry mechanisms, and graceful degradation during network latency increases; expects latency increases proportionally, no timeout errors, request success rate >99%, and automatic recovery
- Error Injection: Validates error handling, retry mechanisms, and circuit breaker protection during HTTP 500 errors, database timeouts, and validation failures; expects error rate increases proportionally, retry mechanisms activate, circuit breaker protects service, and automatic recovery
- Slow Consumer Simulation: Validates backpressure mechanisms, queue depth limits, and message throttling during slow message processing; expects queue depth increases within limits, backpressure activates, no message loss, and automatic recovery
- Traffic Surge: Validates autoscaling, rate limiting, and graceful degradation during 10x traffic increases; expects autoscaling activates, rate limiting protects system, service availability >95%, and automatic scale down when traffic normalizes
- Monitoring and Validation: Comprehensive scripts for monitoring latency injection, error injection, slow consumer simulation, traffic surge, autoscaling, rate limiting, and recovery behavior
Database Chaos¶
Purpose: Define comprehensive chaos experiments for database failures in ATP, validating failover mechanisms, slowdown handling, and connection pool management to ensure ATP services remain available and functional during database-level failures and performance degradation.
Database Failover¶
Database failover experiments validate that ATP services handle Azure SQL failover gracefully through connection retry, automatic failover detection, and transaction integrity, ensuring no data loss and minimal downtime.
Hypothesis
"When Azure SQL primary database fails over to a replica, services will automatically reconnect to the new primary, connection retry mechanisms will handle transient failures, no transactions will be lost, and failover time will be within acceptable limits (<30 seconds)."
Experiment Configuration
Azure SQL Failover Simulation:
#!/bin/bash
# scripts/execute-database-failover-experiment.sh
PRIMARY_SERVER="${1:-atp-sql-primary.database.windows.net}"
PRIMARY_DB="${2:-atp-primary}"
SERVICE="${3:-atp-ingestion-api}"
NAMESPACE="${4:-atp-ingest-ns}"
echo "🧪 Starting database failover experiment"
echo "Primary server: ${PRIMARY_SERVER}"
echo "Primary database: ${PRIMARY_DB}"
echo "Service: ${SERVICE}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_ACTIVE_CONNECTIONS=$(kubectl exec -n ${NAMESPACE} deployment/${SERVICE} -- \
curl -s http://localhost:8080/metrics | grep 'database_connections_active' | awk '{print $2}')
echo "Baseline metrics:"
echo " Success rate: ${BASELINE_SUCCESS_RATE}%"
echo " Active connections: ${BASELINE_ACTIVE_CONNECTIONS}"
# Get current primary replica
echo "📊 Getting current primary replica..."
CURRENT_PRIMARY=$(az sql db replica list \
--resource-group atp-production-rg \
--server ${PRIMARY_SERVER} \
--name ${PRIMARY_DB} \
--query "[?role == 'Primary'].name" -o tsv)
echo "Current primary: ${CURRENT_PRIMARY}"
# Get available replicas
REPLICAS=$(az sql db replica list \
--resource-group atp-production-rg \
--server ${PRIMARY_SERVER} \
--name ${PRIMARY_DB} \
--query "[?role == 'Secondary'].name" -o tsv)
echo "Available replicas: ${REPLICAS}"
TARGET_REPLICA=$(echo "${REPLICAS}" | head -n 1)
if [ -z "${TARGET_REPLICA}" ]; then
echo "❌ No secondary replicas available"
exit 1
fi
echo "Target replica for failover: ${TARGET_REPLICA}"
# Initiate failover
echo "🔧 Initiating database failover..."
FAILOVER_START=$(date +%s)
az sql db replica set-primary \
--resource-group atp-production-rg \
--server ${PRIMARY_SERVER} \
--name ${TARGET_REPLICA} \
--allow-data-loss false
FAILOVER_INITIATED=$(date +%s)
FAILOVER_INITIATION_TIME=$((FAILOVER_INITIATED - FAILOVER_START))
echo "Failover initiated in ${FAILOVER_INITIATION_TIME} seconds"
# Monitor failover progress
echo "👀 Monitoring failover progress..."
MAX_WAIT=300 # 5 minutes
ELAPSED=0
FAILOVER_COMPLETE=false
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
NEW_PRIMARY=$(az sql db replica list \
--resource-group atp-production-rg \
--server ${PRIMARY_SERVER} \
--name ${PRIMARY_DB} \
--query "[?role == 'Primary'].name" -o tsv)
if [ "${NEW_PRIMARY}" = "${TARGET_REPLICA}" ]; then
FAILOVER_COMPLETE=true
FAILOVER_END=$(date +%s)
FAILOVER_DURATION=$((FAILOVER_END - FAILOVER_START))
echo "✅ Failover complete in ${FAILOVER_DURATION} seconds"
break
fi
sleep 5
ELAPSED=$((ELAPSED + 5))
echo "Waiting for failover... (${ELAPSED}s/${MAX_WAIT}s)"
done
if [ "${FAILOVER_COMPLETE}" = false ]; then
echo "❌ Failover not completed within ${MAX_WAIT} seconds"
exit 1
fi
# Monitor service behavior during failover
echo "👀 Monitoring service behavior during failover..."
MAX_MONITOR_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_MONITOR_WAIT} ]; do
# Get connection retry attempts
RETRY_ATTEMPTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_connection_retries\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get active connections
ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connections_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
# Get transaction count (before and after failover)
TRANSACTION_COUNT_BEFORE=$(curl -s http://prometheus:9090/api/v1/query?query=database_transactions_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo "Metrics at ${ELAPSED}s:"
echo " Connection retries: ${RETRY_ATTEMPTS}/sec"
echo " Active connections: ${ACTIVE_CONNECTIONS}"
echo " Success rate: ${SUCCESS_RATE_PERCENT}%"
echo " Transaction count: ${TRANSACTION_COUNT_BEFORE}"
# Validate connection retry
if (( $(echo "${RETRY_ATTEMPTS} > 0" | bc -l) )); then
echo "✅ Connection retry active: ${RETRY_ATTEMPTS}/sec"
fi
# Validate service availability
if (( $(echo "${SUCCESS_RATE_PERCENT} >= 95" | bc -l) )); then
echo "✅ Service availability maintained: ${SUCCESS_RATE_PERCENT}%"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Validate transaction integrity
echo "🔍 Validating transaction integrity..."
TRANSACTION_COUNT_AFTER=$(curl -s http://prometheus:9090/api/v1/query?query=database_transactions_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
TRANSACTION_LOSS=$((TRANSACTION_COUNT_AFTER - TRANSACTION_COUNT_BEFORE))
if [ "${TRANSACTION_LOSS}" -eq 0 ]; then
echo "✅ No transaction loss detected"
else
echo "⚠️ Potential transaction loss: ${TRANSACTION_LOSS}"
fi
# Validate failover time
FAILOVER_TIME_TARGET=30 # 30 seconds
if [ "${FAILOVER_DURATION}" -le "${FAILOVER_TIME_TARGET}" ]; then
echo "✅ Failover time within target: ${FAILOVER_DURATION}s <= ${FAILOVER_TIME_TARGET}s"
exit 0
else
echo "⚠️ Failover time exceeds target: ${FAILOVER_DURATION}s > ${FAILOVER_TIME_TARGET}s"
exit 1
fi
Expected Behavior
Failover Initiation Phase (0-5 seconds):
- Failover command: Azure SQL failover command executed
- Primary role transfer: Primary role transferred to target replica
- Connection termination: Existing connections to old primary terminated
Failover Completion Phase (5-30 seconds):
- Replica promotion: Secondary replica promoted to primary
- DNS/endpoint update: DNS/endpoint updated to point to new primary
- Connection retry: Services retry connections to new primary
- Connection establishment: New connections established to new primary
Recovery Phase (30-60 seconds):
- Full connectivity: All services connected to new primary
- Transaction integrity: All transactions committed or rolled back safely
- Normal operation: Services return to normal operation
Expected Metrics
| Metric | Baseline | During Failover | Expected Range | Recovery Target |
|---|---|---|---|---|
| Failover Time | N/A | <30s | <30s | N/A |
| Connection Retry Rate | 0/sec | >0/sec | >0/sec | 0/sec |
| Request Success Rate | 99.95% | >95% | >95% | 99.95% |
| Active Connections | 50 | Variable | Variable | 50 |
| Transaction Loss | None | None | None | None |
Validation Criteria
Success Criteria:
- ✅ Failover completes within 30 seconds
- ✅ Connection retry mechanisms activate
- ✅ Request success rate >95% during failover
- ✅ No transaction loss
- ✅ All connections re-established after failover
- ✅ Service recovers automatically
Connection Retry Configuration
Database Connection Retry Configuration:
# kubernetes/configmaps/database-connection-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: database-connection-config
namespace: atp-ingest-ns
data:
ConnectionRetryConfig.json: |
{
"MaxRetryAttempts": 10,
"RetryDelay": 1000,
"ExponentialBackoff": true,
"MaxBackoff": 30000,
"RetryableErrors": [
"Connection timeout",
"Connection closed",
"Server unavailable",
"Network error"
],
"FailoverDetection": {
"Enabled": true,
"HealthCheckInterval": 5000,
"HealthCheckTimeout": 3000
}
}
Database Slowdown¶
Database slowdown experiments validate that ATP services handle database performance degradation gracefully through timeout handling, circuit breaker activation, and fallback strategies.
Hypothesis
"When database queries become slow (latency >5 seconds), timeout configurations will prevent hanging requests, circuit breakers will activate to protect the service, fallback strategies will maintain service availability, and services will recover when database performance returns to normal."
Experiment Configuration
Database Query Latency Injection:
# chaos-experiments/database-slowdown.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: database-slowdown
namespace: chaos-testing
labels:
category: application
service: database
severity: medium
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When database queries become slow, timeout configurations will prevent hanging requests,
circuit breakers will activate to protect the service, fallback strategies will maintain
service availability, and services will recover when database performance returns to normal.
spec:
action: delay
mode: fixed-percent
value: "50" # 50% of queries
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
target:
mode: all
selector:
address: "*.database.windows.net"
delay:
latency: "6s" # Exceeds 5s timeout
correlation: "100"
jitter: "500ms"
duration: "10m"
Database Slowdown Simulation Script:
#!/bin/bash
# scripts/execute-database-slowdown-experiment.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
QUERY_LATENCY="${3:-6s}" # Query latency injection
PERCENTAGE="${4:-50}" # Percentage of queries affected
echo "🧪 Starting database slowdown experiment"
echo "Service: ${SERVICE}"
echo "Query latency: ${QUERY_LATENCY}"
echo "Percentage: ${PERCENTAGE}%"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_QUERY_LATENCY=$(jq -r '.metrics.p95_query_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " P95 query latency: ${BASELINE_QUERY_LATENCY}ms"
echo " Success rate: ${BASELINE_SUCCESS_RATE}%"
# Apply database slowdown
echo "🔧 Applying database slowdown..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: database-slowdown-${SERVICE}
namespace: chaos-testing
spec:
action: delay
mode: fixed-percent
value: "${PERCENTAGE}"
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
target:
mode: all
selector:
address: "*.database.windows.net"
delay:
latency: "${QUERY_LATENCY}"
correlation: "100"
jitter: "500ms"
duration: "10m"
EOF
SLOWDOWN_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during database slowdown..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get query latency
QUERY_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(database_query_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
QUERY_LATENCY_MS=$(echo "${QUERY_LATENCY} * 1000" | bc)
# Get timeout errors
TIMEOUT_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_timeout_errors\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get circuit breaker state
CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\",component=\"database\"\} | jq -r '.data.result[0].value[1]')
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
# Get fallback usage
FALLBACK_USAGE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_fallback_usage\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo "Metrics at ${ELAPSED}s:"
echo " P95 query latency: ${QUERY_LATENCY_MS}ms (baseline: ${BASELINE_QUERY_LATENCY}ms)"
echo " Timeout errors: ${TIMEOUT_ERRORS}/sec"
echo " Circuit breaker state: ${CB_STATE}"
echo " Success rate: ${SUCCESS_RATE_PERCENT}%"
echo " Fallback usage: ${FALLBACK_USAGE}/sec"
# Validate timeout handling
if (( $(echo "${QUERY_LATENCY_MS} > 5000" | bc -l) )); then
if (( $(echo "${TIMEOUT_ERRORS} > 0" | bc -l) )); then
echo "✅ Timeout handling working: ${TIMEOUT_ERRORS}/sec timeout errors"
else
echo "⚠️ Query latency high but no timeout errors detected"
fi
fi
# Validate circuit breaker activation
if (( $(echo "${QUERY_LATENCY_MS} > 5000" | bc -l) )); then
if [ "${CB_STATE}" = "Open" ]; then
echo "✅ Circuit breaker activated: ${CB_STATE}"
elif [ "${CB_STATE}" = "HalfOpen" ]; then
echo "✅ Circuit breaker testing recovery: ${CB_STATE}"
fi
fi
# Validate fallback strategies
if (( $(echo "${FALLBACK_USAGE} > 0" | bc -l) )); then
echo "✅ Fallback strategies active: ${FALLBACK_USAGE}/sec"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove database slowdown
echo "🔧 Removing database slowdown..."
kubectl delete networkchaos database-slowdown-${SERVICE} -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for recovery
echo "⏳ Waiting for database performance to return to normal..."
sleep 120
# Verify recovery
FINAL_QUERY_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(database_query_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
FINAL_QUERY_LATENCY_MS=$(echo "${FINAL_QUERY_LATENCY} * 1000" | bc)
FINAL_CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\",component=\"database\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${FINAL_QUERY_LATENCY_MS} <= ${BASELINE_QUERY_LATENCY} * 1.1" | bc -l) )); then
echo "✅ Query latency recovered: ${FINAL_QUERY_LATENCY_MS}ms (baseline: ${BASELINE_QUERY_LATENCY}ms)"
if [ "${FINAL_CB_STATE}" = "Closed" ]; then
echo "✅ Circuit breaker closed (service recovered)"
exit 0
else
echo "⚠️ Circuit breaker still open: ${FINAL_CB_STATE}"
exit 1
fi
else
echo "⚠️ Query latency not fully recovered: ${FINAL_QUERY_LATENCY_MS}ms (baseline: ${BASELINE_QUERY_LATENCY}ms)"
exit 1
fi
Expected Behavior
Slowdown Phase (0-10 minutes):
- Query latency increase: Database queries become slow (>5 seconds)
- Timeout handling: Timeout configurations prevent hanging requests
- Circuit breaker activation: Circuit breaker opens if latency exceeds threshold
- Fallback strategies: Fallback strategies activate (cached data, read replicas)
- Graceful degradation: Service continues operating with reduced functionality
Recovery Phase (10-15 minutes):
- Latency normalization: Database query latency returns to normal
- Circuit breaker recovery: Circuit breaker transitions to half-open, then closed
- Fallback deactivation: Fallback strategies deactivate
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Slowdown | Expected Range | Recovery Target |
|---|---|---|---|---|
| P95 Query Latency | 250ms | <7,000ms | <7,000ms | 250ms |
| Timeout Errors | 0/sec | <10/sec | <10/sec | 0/sec |
| Circuit Breaker State | Closed | Open/HalfOpen | Open/HalfOpen | Closed |
| Request Success Rate | 99.95% | >90% | >90% | 99.95% |
| Fallback Usage | 0/sec | >0/sec | >0/sec | 0/sec |
Validation Criteria
Success Criteria:
- ✅ Timeout configurations prevent hanging requests
- ✅ Circuit breaker activates when latency exceeds threshold
- ✅ Fallback strategies maintain service availability
- ✅ Request success rate >90%
- ✅ Service recovers automatically when database performance returns to normal
Query Timeout Configuration
Database Query Timeout Configuration:
# kubernetes/configmaps/database-query-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: database-query-config
namespace: atp-ingest-ns
data:
QueryTimeoutConfig.json: |
{
"DefaultTimeout": 5000,
"ReadTimeout": 5000,
"WriteTimeout": 10000,
"ConnectionTimeout": 3000,
"CommandTimeout": 5000,
"CircuitBreaker": {
"FailureThreshold": 5,
"TimeoutSeconds": 30,
"HalfOpenRetries": 3,
"SlowQueryThreshold": 5000
},
"FallbackStrategies": {
"UseCache": true,
"UseReadReplica": true,
"DegradedMode": true
}
}
Database Connection Pool Exhaustion¶
Database connection pool exhaustion experiments validate that ATP services handle connection pool exhaustion gracefully through connection leak detection, pool size limits, and queuing behavior.
Hypothesis
"When database connection pool is exhausted, connection leak detection will identify leaks, pool size limits will prevent resource exhaustion, request queuing will handle overload, and services will recover when connections are released."
Experiment Configuration
Connection Pool Exhaustion Simulation:
#!/bin/bash
# scripts/execute-connection-pool-exhaustion-experiment.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
MAX_POOL_SIZE="${3:-100}" # Maximum pool size
LEAK_RATE="${4:-5}" # Connections leaked per second
echo "🧪 Starting connection pool exhaustion experiment"
echo "Service: ${SERVICE}"
echo "Max pool size: ${MAX_POOL_SIZE}"
echo "Leak rate: ${LEAK_RATE} connections/sec"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_POOL_SIZE=$(jq -r '.metrics.connection_pool_size' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_ACTIVE_CONNECTIONS=$(jq -r '.metrics.active_connections' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_IDLE_CONNECTIONS=$(jq -r '.metrics.idle_connections' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Pool size: ${BASELINE_POOL_SIZE}"
echo " Active connections: ${BASELINE_ACTIVE_CONNECTIONS}"
echo " Idle connections: ${BASELINE_IDLE_CONNECTIONS}"
# Simulate connection leaks by creating long-running connections
echo "🔧 Simulating connection pool exhaustion..."
EXHAUSTION_START=$(date +%s)
# Create a script that will hold connections open
cat > /tmp/connection-leak-script.sh <<'EOF'
#!/bin/bash
while true; do
# Create a connection and hold it open
kubectl exec -n ${NAMESPACE} deployment/${SERVICE} -- \
psql -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} -c "SELECT pg_sleep(300);" &
sleep 1
done
EOF
chmod +x /tmp/connection-leak-script.sh
# Start connection leak simulation
echo "Starting connection leak simulation..."
for i in $(seq 1 ${LEAK_RATE}); do
kubectl run connection-leak-${i} \
--image=postgres:15 \
--restart=Never \
--rm -i --tty \
-- /bin/sh -c "while true; do PGPASSWORD=\${DB_PASSWORD} psql -h \${DB_HOST} -U \${DB_USER} -d \${DB_NAME} -c 'SELECT pg_sleep(300);' > /dev/null 2>&1; done" &
done
# Monitor connection pool
echo "👀 Monitoring connection pool behavior..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get connection pool metrics
POOL_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
IDLE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_idle\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
WAITING_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_waiting\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get connection leak detection
LEAK_DETECTED=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_leaks_detected\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
# Get connection pool exhaustion errors
POOL_EXHAUSTION_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_connection_pool_exhausted\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo "Metrics at ${ELAPSED}s:"
echo " Pool size: ${POOL_SIZE} (max: ${MAX_POOL_SIZE})"
echo " Active connections: ${ACTIVE_CONNECTIONS}"
echo " Idle connections: ${IDLE_CONNECTIONS}"
echo " Waiting requests: ${WAITING_REQUESTS}"
echo " Leaks detected: ${LEAK_DETECTED}"
echo " Success rate: ${SUCCESS_RATE_PERCENT}%"
echo " Pool exhaustion errors: ${POOL_EXHAUSTION_ERRORS}/sec"
# Validate pool size limits
if (( $(echo "${ACTIVE_CONNECTIONS} >= ${MAX_POOL_SIZE}" | bc -l) )); then
echo "✅ Pool size limit reached: ${ACTIVE_CONNECTIONS} >= ${MAX_POOL_SIZE}"
# Validate queuing behavior
if (( $(echo "${WAITING_REQUESTS} > 0" | bc -l) )); then
echo "✅ Request queuing active: ${WAITING_REQUESTS} requests waiting"
else
echo "⚠️ Pool exhausted but no request queuing detected"
fi
fi
# Validate connection leak detection
if (( $(echo "${LEAK_DETECTED} > 0" | bc -l) )); then
echo "✅ Connection leak detection active: ${LEAK_DETECTED} leaks detected"
fi
# Validate pool exhaustion handling
if (( $(echo "${POOL_EXHAUSTION_ERRORS} > 0" | bc -l) )); then
echo "⚠️ Pool exhaustion errors detected: ${POOL_EXHAUSTION_ERRORS}/sec"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Stop connection leak simulation
echo "🛑 Stopping connection leak simulation..."
kubectl delete pod connection-leak-* -n ${NAMESPACE} --ignore-not-found=true
# Wait for connections to be released
echo "⏳ Waiting for connections to be released..."
sleep 60
RECOVERY_START=$(date +%s)
# Monitor connection pool recovery
echo "👀 Monitoring connection pool recovery..."
MAX_RECOVERY_WAIT=300 # 5 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_RECOVERY_WAIT} ]; do
ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
WAITING_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_waiting\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${ACTIVE_CONNECTIONS} <= ${BASELINE_ACTIVE_CONNECTIONS} * 1.1" | bc -l) )); then
if (( $(echo "${WAITING_REQUESTS} == 0" | bc -l) )); then
RECOVERY_TIME=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_TIME - RECOVERY_START))
echo "✅ Connection pool recovered in ${RECOVERY_DURATION} seconds"
break
fi
fi
sleep 10
ELAPSED=$((ELAPSED + 10))
echo "Waiting for recovery... (${ELAPSED}s/${MAX_RECOVERY_WAIT}s) - Active: ${ACTIVE_CONNECTIONS}, Waiting: ${WAITING_REQUESTS}"
done
# Verify recovery
FINAL_ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE_PERCENT=$(echo "${FINAL_SUCCESS_RATE} * 100" | bc)
if (( $(echo "${FINAL_ACTIVE_CONNECTIONS} <= ${BASELINE_ACTIVE_CONNECTIONS} * 1.1" | bc -l) )); then
echo "✅ Active connections recovered: ${FINAL_ACTIVE_CONNECTIONS} (baseline: ${BASELINE_ACTIVE_CONNECTIONS})"
if (( $(echo "${FINAL_SUCCESS_RATE_PERCENT} >= 99" | bc -l) )); then
echo "✅ Success rate recovered: ${FINAL_SUCCESS_RATE_PERCENT}%"
exit 0
else
echo "⚠️ Success rate not fully recovered: ${FINAL_SUCCESS_RATE_PERCENT}%"
exit 1
fi
else
echo "⚠️ Active connections not fully recovered: ${FINAL_ACTIVE_CONNECTIONS} (baseline: ${BASELINE_ACTIVE_CONNECTIONS})"
exit 1
fi
Expected Behavior
Exhaustion Phase (0-10 minutes):
- Connection pool exhaustion: Connection pool reaches maximum size
- Connection leak detection: Connection leak detection identifies leaks
- Request queuing: Requests queue when pool is exhausted
- Pool size limits: Pool size limits prevent resource exhaustion
- Graceful degradation: Service continues operating with queued requests
Recovery Phase (10-15 minutes):
- Connection release: Connections released when leaks stop
- Pool recovery: Connection pool recovers to normal levels
- Queue draining: Queued requests processed
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Exhaustion | Expected Range | Recovery Target |
|---|---|---|---|---|
| Active Connections | 50 | 100 (max) | ≤100 | 50 |
| Idle Connections | 50 | 0 | 0-50 | 50 |
| Waiting Requests | 0 | >0 | >0 | 0 |
| Leaks Detected | 0 | >0 | >0 | 0 |
| Pool Exhaustion Errors | 0/sec | <5/sec | <5/sec | 0/sec |
| Request Success Rate | 99.95% | >95% | >95% | 99.95% |
Validation Criteria
Success Criteria:
- ✅ Pool size limits prevent resource exhaustion
- ✅ Connection leak detection identifies leaks
- ✅ Request queuing handles overload
- ✅ Request success rate >95%
- ✅ Service recovers automatically when connections released
Connection Pool Configuration
Database Connection Pool Configuration:
# kubernetes/configmaps/database-pool-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: database-pool-config
namespace: atp-ingest-ns
data:
ConnectionPoolConfig.json: |
{
"MaxPoolSize": 100,
"MinPoolSize": 10,
"IdleTimeout": 300000,
"ConnectionTimeout": 30000,
"LeakDetection": {
"Enabled": true,
"Threshold": 60000,
"LogInterval": 60000
},
"PoolExhaustion": {
"MaxWaitTime": 30000,
"QueueSize": 1000,
"RejectWhenExhausted": false
},
"HealthCheck": {
"Enabled": true,
"Interval": 30000,
"Timeout": 5000
}
}
Connection Pool Monitoring Script:
#!/bin/bash
# scripts/monitor-connection-pool.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
echo "📊 Monitoring connection pool for ${SERVICE}"
# Get connection pool metrics
POOL_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
ACTIVE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
IDLE_CONNECTIONS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_idle\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
WAITING_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_pool_waiting\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo "Connection pool metrics:"
echo " Pool size: ${POOL_SIZE}"
echo " Active connections: ${ACTIVE_CONNECTIONS}"
echo " Idle connections: ${IDLE_CONNECTIONS}"
echo " Waiting requests: ${WAITING_REQUESTS}"
# Check connection leak detection
LEAK_DETECTED=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_leaks_detected\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
LEAK_DURATION=$(curl -s http://prometheus:9090/api/v1/query?query=database_connection_leak_duration_seconds\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo "Connection leak detection:"
echo " Leaks detected: ${LEAK_DETECTED}"
echo " Leak duration: ${LEAK_DURATION}s"
# Check pool exhaustion
POOL_EXHAUSTION_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(database_connection_pool_exhausted\{service=\"${SERVICE}\"\}[5m]\) | jq -r '.data.result[0].value[1]')
POOL_UTILIZATION=$(echo "scale=2; ${ACTIVE_CONNECTIONS} / ${POOL_SIZE} * 100" | bc)
echo "Pool utilization:"
echo " Utilization: ${POOL_UTILIZATION}%"
echo " Exhaustion errors: ${POOL_EXHAUSTION_ERRORS}/sec"
# Validate pool health
if (( $(echo "${POOL_UTILIZATION} > 90" | bc -l) )); then
echo "⚠️ Pool utilization high: ${POOL_UTILIZATION}%"
fi
if (( $(echo "${WAITING_REQUESTS} > 0" | bc -l) )); then
echo "⚠️ Requests waiting for connections: ${WAITING_REQUESTS}"
fi
if (( $(echo "${LEAK_DETECTED} > 0" | bc -l) )); then
echo "⚠️ Connection leaks detected: ${LEAK_DETECTED}"
fi
if (( $(echo "${POOL_EXHAUSTION_ERRORS} > 0" | bc -l) )); then
echo "⚠️ Pool exhaustion errors: ${POOL_EXHAUSTION_ERRORS}/sec"
fi
echo "✅ Connection pool monitoring complete"
Summary: Database Chaos¶
- Database Failover: Validates Azure SQL failover mechanisms, connection retry, and transaction integrity during database failover; expects failover completes within 30 seconds, connection retry mechanisms activate, no transaction loss, and automatic recovery
- Database Slowdown: Validates timeout handling, circuit breaker activation, and fallback strategies during database performance degradation; expects timeout configurations prevent hanging requests, circuit breaker activates, fallback strategies maintain availability, and automatic recovery
- Database Connection Pool Exhaustion: Validates connection leak detection, pool size limits, and queuing behavior during connection pool exhaustion; expects pool size limits prevent resource exhaustion, connection leak detection identifies leaks, request queuing handles overload, and automatic recovery
- Monitoring and Validation: Comprehensive scripts for monitoring database failover, slowdown, connection pool exhaustion, connection leaks, pool utilization, and recovery behavior
Storage and Queue Chaos¶
Purpose: Define comprehensive chaos experiments for storage and queue failures in ATP, validating blob storage resilience, message queue disruption handling, and event store integrity to ensure ATP services remain available and functional during storage and messaging infrastructure failures.
Blob Storage Unavailability¶
Blob storage unavailability experiments validate that ATP services handle Azure Blob Storage outages gracefully through retry logic, export failure handling, and eventual consistency mechanisms.
Hypothesis
"When Azure Blob Storage becomes unavailable, services will retry operations with exponential backoff, export failures will be handled gracefully, operations will be queued for eventual consistency, and services will recover automatically when storage is restored."
Experiment Configuration
Azure Blob Storage Network Partition:
# chaos-experiments/blob-storage-unavailability.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: blob-storage-unavailability
namespace: chaos-testing
labels:
category: application
service: blob-storage
severity: high
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When Azure Blob Storage becomes unavailable, services will retry operations with exponential backoff,
export failures will be handled gracefully, operations will be queued for eventual consistency,
and services will recover automatically when storage is restored.
spec:
action: partition
mode: all
selector:
namespaces:
- atp-export-ns
labelSelectors:
app: atp-export-api
direction: both
target:
mode: all
selector:
address: "*.blob.core.windows.net"
duration: "15m"
Blob Storage Unavailability Simulation Script:
#!/bin/bash
# scripts/execute-blob-storage-unavailability-experiment.sh
STORAGE_ACCOUNT="${1:-atpstorageaccount}"
SERVICE="${2:-atp-export-api}"
NAMESPACE="${3:-atp-export-ns}"
DURATION="${4:-15m}"
echo "🧪 Starting blob storage unavailability experiment"
echo "Storage account: ${STORAGE_ACCOUNT}"
echo "Service: ${SERVICE}"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_EXPORT_SUCCESS_RATE=$(jq -r '.metrics.export_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_RETRY_COUNT=$(jq -r '.metrics.retry_count_per_operation' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Export success rate: ${BASELINE_EXPORT_SUCCESS_RATE}%"
echo " Retry count per operation: ${BASELINE_RETRY_COUNT}"
# Apply network partition to blob storage
echo "🔧 Applying network partition to blob storage..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: blob-storage-unavailability-${SERVICE}
namespace: chaos-testing
spec:
action: partition
mode: all
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
target:
mode: all
selector:
address: "*.blob.core.windows.net"
duration: "${DURATION}"
EOF
OUTAGE_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during blob storage unavailability..."
MAX_WAIT=900 # 15 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get export success rate
EXPORT_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(export_operations_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(export_operations_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
EXPORT_SUCCESS_RATE_PERCENT=$(echo "${EXPORT_SUCCESS_RATE} * 100" | bc)
# Get export failure rate
EXPORT_FAILURE_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(export_operations_total\{service=\"${SERVICE}\",status=\"failure\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get retry count
RETRY_COUNT=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(blob_storage_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get queued operations
QUEUED_OPERATIONS=$(curl -s http://prometheus:9090/api/v1/query?query=blob_storage_queued_operations\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get eventual consistency operations
EVENTUAL_CONSISTENCY_OPS=$(curl -s http://prometheus:9090/api/v1/query?query=blob_storage_eventual_consistency_operations\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo "Metrics at ${ELAPSED}s:"
echo " Export success rate: ${EXPORT_SUCCESS_RATE_PERCENT}%"
echo " Export failure rate: ${EXPORT_FAILURE_RATE}/sec"
echo " Retry count: ${RETRY_COUNT}/sec"
echo " Queued operations: ${QUEUED_OPERATIONS}"
echo " Eventual consistency operations: ${EVENTUAL_CONSISTENCY_OPS}"
# Validate retry logic
if (( $(echo "${RETRY_COUNT} > 0" | bc -l) )); then
echo "✅ Retry logic active: ${RETRY_COUNT}/sec retry attempts"
fi
# Validate queuing behavior
if (( $(echo "${QUEUED_OPERATIONS} > 0" | bc -l) )); then
echo "✅ Operations queued for eventual consistency: ${QUEUED_OPERATIONS}"
fi
# Validate export failure handling
if (( $(echo "${EXPORT_FAILURE_RATE} > 0" | bc -l) )); then
echo "⚠️ Export failures detected: ${EXPORT_FAILURE_RATE}/sec"
# Check if failures are being handled gracefully (not causing service crashes)
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos blob-storage-unavailability-${SERVICE} -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for queued operations to complete
echo "⏳ Waiting for queued operations to complete..."
MAX_RECOVERY_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_RECOVERY_WAIT} ]; do
QUEUED_OPERATIONS=$(curl -s http://prometheus:9090/api/v1/query?query=blob_storage_queued_operations\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${QUEUED_OPERATIONS} == 0" | bc -l) )); then
RECOVERY_TIME=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_TIME - RECOVERY_START))
echo "✅ Queued operations completed in ${RECOVERY_DURATION} seconds"
break
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
echo "Waiting for queued operations... (${ELAPSED}s/${MAX_RECOVERY_WAIT}s) - Queued: ${QUEUED_OPERATIONS}"
done
# Verify eventual consistency
echo "🔍 Verifying eventual consistency..."
FINAL_QUEUED_OPS=$(curl -s http://prometheus:9090/api/v1/query?query=blob_storage_queued_operations\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_EXPORT_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(export_operations_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(export_operations_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_EXPORT_SUCCESS_RATE_PERCENT=$(echo "${FINAL_EXPORT_SUCCESS_RATE} * 100" | bc)
if (( $(echo "${FINAL_QUEUED_OPS} == 0" | bc -l) )); then
echo "✅ All operations completed (eventual consistency achieved)"
if (( $(echo "${FINAL_EXPORT_SUCCESS_RATE_PERCENT} >= 99" | bc -l) )); then
echo "✅ Export success rate recovered: ${FINAL_EXPORT_SUCCESS_RATE_PERCENT}%"
exit 0
else
echo "⚠️ Export success rate not fully recovered: ${FINAL_EXPORT_SUCCESS_RATE_PERCENT}%"
exit 1
fi
else
echo "⚠️ Some operations still queued: ${FINAL_QUEUED_OPS}"
exit 1
fi
Expected Behavior
Outage Phase (0-15 minutes):
- Storage unavailability: Azure Blob Storage becomes unreachable
- Retry logic activation: Services retry operations with exponential backoff
- Export failure handling: Export failures handled gracefully (no crashes)
- Operation queuing: Failed operations queued for eventual consistency
- Service continuity: Service continues operating with degraded functionality
Recovery Phase (15-25 minutes):
- Storage restoration: Blob Storage becomes available
- Queued operations processing: Queued operations processed
- Eventual consistency: All operations eventually complete
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Outage | Expected Range | Recovery Target |
|---|---|---|---|---|
| Export Success Rate | 99.95% | 0% | 0% | 99.95% |
| Retry Count | 0.1/sec | >0/sec | >0/sec | 0.1/sec |
| Queued Operations | 0 | >0 | >0 | 0 |
| Export Failure Rate | 0.05% | 100% | 100% | 0.05% |
| Eventual Consistency Ops | 0 | >0 | >0 | 0 |
Validation Criteria
Success Criteria:
- ✅ Retry logic activates with exponential backoff
- ✅ Export failures handled gracefully (no service crashes)
- ✅ Operations queued for eventual consistency
- ✅ All operations complete after storage restoration
- ✅ Service recovers automatically when storage restored
Blob Storage Retry Configuration
Blob Storage Retry Configuration:
# kubernetes/configmaps/blob-storage-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: blob-storage-config
namespace: atp-export-ns
data:
BlobStorageConfig.json: |
{
"RetryPolicy": {
"MaxRetries": 10,
"RetryDelay": 1000,
"ExponentialBackoff": true,
"MaxBackoff": 60000,
"RetryableErrors": [
"NetworkError",
"TimeoutError",
"ServiceUnavailable",
"InternalServerError"
]
},
"ExportFailureHandling": {
"QueueOnFailure": true,
"MaxQueueSize": 10000,
"RetryFailedExports": true,
"MaxRetryAttempts": 10
},
"EventualConsistency": {
"Enabled": true,
"QueueOperations": true,
"ProcessInterval": 5000,
"BatchSize": 100
}
}
Message Queue Disruption¶
Message queue disruption experiments validate that ATP services handle Service Bus topic disruptions gracefully through message buffering, backpressure mechanisms, and DLQ behavior.
Hypothesis
"When Service Bus topic becomes paused or unavailable, services will buffer messages, backpressure mechanisms will activate to prevent overload, messages will be moved to DLQ when appropriate, and services will recover automatically when the queue is restored."
Experiment Configuration
Service Bus Topic Pause Simulation:
# chaos-experiments/service-bus-topic-pause.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: service-bus-topic-pause
namespace: chaos-testing
labels:
category: application
service: service-bus
severity: high
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When Service Bus topic becomes paused or unavailable, services will buffer messages,
backpressure mechanisms will activate to prevent overload, messages will be moved to DLQ
when appropriate, and services will recover automatically when the queue is restored.
spec:
action: partition
mode: all
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
target:
mode: all
selector:
address: "*.servicebus.windows.net"
duration: "15m"
Message Queue Disruption Simulation Script:
#!/bin/bash
# scripts/execute-message-queue-disruption-experiment.sh
SERVICE_BUS_NAMESPACE="${1:-atp-servicebus}"
TOPIC_NAME="${2:-atp-events}"
SERVICE="${3:-atp-ingestion-api}"
NAMESPACE="${4:-atp-ingest-ns}"
DURATION="${5:-15m}"
echo "🧪 Starting message queue disruption experiment"
echo "Service Bus namespace: ${SERVICE_BUS_NAMESPACE}"
echo "Topic: ${TOPIC_NAME}"
echo "Service: ${SERVICE}"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_QUEUE_DEPTH=$(jq -r '.metrics.queue_depth' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_PROCESSING_RATE=$(jq -r '.metrics.message_processing_rate_per_sec' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Queue depth: ${BASELINE_QUEUE_DEPTH}"
echo " Processing rate: ${BASELINE_PROCESSING_RATE} msg/sec"
# Apply network partition to Service Bus
echo "🔧 Applying network partition to Service Bus..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: service-bus-topic-pause-${SERVICE}
namespace: chaos-testing
spec:
action: partition
mode: all
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
target:
mode: all
selector:
address: "*.servicebus.windows.net"
duration: "${DURATION}"
EOF
DISRUPTION_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during message queue disruption..."
MAX_WAIT=900 # 15 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get queue depth
QUEUE_DEPTH=$(curl -s http://prometheus:9090/api/v1/query?query=queue_depth\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get buffered messages
BUFFERED_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=message_buffer_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get backpressure state
BACKPRESSURE_ACTIVE=$(curl -s http://prometheus:9090/api/v1/query?query=backpressure_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get DLQ size
DLQ_SIZE=$(curl -s http://prometheus:9090/api/v1/query?query=dlq_messages_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get message processing rate
PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get message arrival rate
ARRIVAL_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_arrived_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
echo "Metrics at ${ELAPSED}s:"
echo " Queue depth: ${QUEUE_DEPTH}"
echo " Buffered messages: ${BUFFERED_MESSAGES}"
echo " Backpressure active: ${BACKPRESSURE_ACTIVE}"
echo " DLQ size: ${DLQ_SIZE}"
echo " Processing rate: ${PROCESSING_RATE} msg/sec"
echo " Arrival rate: ${ARRIVAL_RATE} msg/sec"
# Validate message buffering
if (( $(echo "${BUFFERED_MESSAGES} > 0" | bc -l) )); then
echo "✅ Messages buffered: ${BUFFERED_MESSAGES}"
fi
# Validate backpressure activation
if (( $(echo "${QUEUE_DEPTH} > ${BASELINE_QUEUE_DEPTH} * 2" | bc -l) )); then
if [ "${BACKPRESSURE_ACTIVE}" = "1" ]; then
echo "✅ Backpressure activated: ${BACKPRESSURE_ACTIVE}"
else
echo "⚠️ Queue depth high but backpressure not active"
fi
fi
# Validate DLQ behavior
if (( $(echo "${DLQ_SIZE} > 0" | bc -l) )); then
echo "⚠️ Messages in DLQ: ${DLQ_SIZE}"
# Check if DLQ messages are within acceptable limits
MAX_DLQ_SIZE=1000
if (( $(echo "${DLQ_SIZE} > ${MAX_DLQ_SIZE}" | bc -l) )); then
echo "⚠️ DLQ size exceeds limit: ${DLQ_SIZE} > ${MAX_DLQ_SIZE}"
fi
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos service-bus-topic-pause-${SERVICE} -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for buffered messages to be processed
echo "⏳ Waiting for buffered messages to be processed..."
MAX_RECOVERY_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_RECOVERY_WAIT} ]; do
BUFFERED_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=message_buffer_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
QUEUE_DEPTH=$(curl -s http://prometheus:9090/api/v1/query?query=queue_depth\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${BUFFERED_MESSAGES} == 0" | bc -l) )); then
if (( $(echo "${QUEUE_DEPTH} <= ${BASELINE_QUEUE_DEPTH} * 1.1" | bc -l) )); then
RECOVERY_TIME=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_TIME - RECOVERY_START))
echo "✅ Buffered messages processed in ${RECOVERY_DURATION} seconds"
break
fi
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
echo "Waiting for recovery... (${ELAPSED}s/${MAX_RECOVERY_WAIT}s) - Buffered: ${BUFFERED_MESSAGES}, Queue depth: ${QUEUE_DEPTH}"
done
# Verify recovery
FINAL_BUFFERED_MESSAGES=$(curl -s http://prometheus:9090/api/v1/query?query=message_buffer_size\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_PROCESSING_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(messages_processed_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_BACKPRESSURE=$(curl -s http://prometheus:9090/api/v1/query?query=backpressure_active\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${FINAL_BUFFERED_MESSAGES} == 0" | bc -l) )); then
echo "✅ All buffered messages processed"
if (( $(echo "${FINAL_PROCESSING_RATE} >= ${BASELINE_PROCESSING_RATE} * 0.9" | bc -l) )); then
echo "✅ Processing rate recovered: ${FINAL_PROCESSING_RATE} msg/sec (baseline: ${BASELINE_PROCESSING_RATE} msg/sec)"
if [ "${FINAL_BACKPRESSURE}" = "0" ]; then
echo "✅ Backpressure deactivated (service recovered)"
exit 0
else
echo "⚠️ Backpressure still active: ${FINAL_BACKPRESSURE}"
exit 1
fi
else
echo "⚠️ Processing rate not fully recovered: ${FINAL_PROCESSING_RATE} msg/sec"
exit 1
fi
else
echo "⚠️ Some messages still buffered: ${FINAL_BUFFERED_MESSAGES}"
exit 1
fi
Expected Behavior
Disruption Phase (0-15 minutes):
- Queue unavailability: Service Bus topic becomes unavailable
- Message buffering: Messages buffered locally
- Backpressure activation: Backpressure mechanisms activate to prevent overload
- DLQ movement: Messages moved to DLQ when retry limit exceeded
- Service continuity: Service continues operating with message buffering
Recovery Phase (15-25 minutes):
- Queue restoration: Service Bus topic restored
- Buffered message processing: Buffered messages processed
- Backpressure deactivation: Backpressure mechanisms deactivate
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Disruption | Expected Range | Recovery Target |
|---|---|---|---|---|
| Queue Depth | 100 | Increasing | Any | 100 |
| Buffered Messages | 0 | >0 | >0 | 0 |
| Backpressure Active | No | Yes | Yes | No |
| DLQ Size | 0 | <1,000 | <1,000 | 0 |
| Processing Rate | 100 msg/sec | 0 msg/sec | 0 msg/sec | 100 msg/sec |
Validation Criteria
Success Criteria:
- ✅ Messages buffered when queue unavailable
- ✅ Backpressure activates to prevent overload
- ✅ DLQ behavior appropriate (messages moved when retry limit exceeded)
- ✅ All buffered messages processed after queue restoration
- ✅ Service recovers automatically when queue restored
Message Queue Configuration
Message Queue Buffering Configuration:
# kubernetes/configmaps/message-queue-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: message-queue-config
namespace: atp-ingest-ns
data:
MessageQueueConfig.json: |
{
"Buffering": {
"Enabled": true,
"MaxBufferSize": 10000,
"BufferTimeout": 300000
},
"Backpressure": {
"Enabled": true,
"QueueDepthThreshold": 5000,
"ThrottleRate": 0.5
},
"DLQ": {
"Enabled": true,
"MaxRetryAttempts": 10,
"MoveToDLQAfterRetries": true,
"MaxDLQSize": 10000
},
"RetryPolicy": {
"MaxRetries": 10,
"RetryDelay": 1000,
"ExponentialBackoff": true,
"MaxBackoff": 60000
}
}
Event Store Corruption Simulation¶
Event store corruption simulation experiments validate that ATP services handle corrupted event data gracefully through integrity verification, quarantine procedures, and recovery from backups.
Hypothesis
"When event store contains corrupted event data, integrity verification will detect corruption, corrupted events will be quarantined, services will recover from backups, and services will continue operating without corrupted data."
Experiment Configuration
Event Store Corruption Simulation Script:
#!/bin/bash
# scripts/execute-event-store-corruption-experiment.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
CORRUPTION_RATE="${3:-1}" # Percentage of events to corrupt
echo "🧪 Starting event store corruption simulation"
echo "Service: ${SERVICE}"
echo "Corruption rate: ${CORRUPTION_RATE}%"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_EVENT_COUNT=$(jq -r '.metrics.total_events' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_INTEGRITY_CHECKS=$(jq -r '.metrics.integrity_checks_passed' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Total events: ${BASELINE_EVENT_COUNT}"
echo " Integrity checks passed: ${BASELINE_INTEGRITY_CHECKS}"
# Simulate event corruption by modifying event data
echo "🔧 Simulating event store corruption..."
CORRUPTION_START=$(date +%s)
# Corrupt a percentage of events in the event store
# This would be done by directly modifying event data in the database
# For simulation, we'll use a script that modifies event checksums
cat > /tmp/corrupt-events.sh <<'EOF'
#!/bin/bash
# This script would corrupt events by modifying their checksums or data
# In production, this would be done through database operations
# For simulation, we'll use a Kubernetes job to modify event data
kubectl create job --from=cronjob/event-corruption-simulator corrupt-events-$(date +%s) \
-n ${NAMESPACE} \
-- /bin/sh -c "
# Connect to database and corrupt events
# This is a simplified example - actual implementation would use proper database client
psql -h \${DB_HOST} -U \${DB_USER} -d \${DB_NAME} -c \"
UPDATE events
SET checksum = 'CORRUPTED'
WHERE id IN (
SELECT id FROM events
WHERE checksum != 'CORRUPTED'
ORDER BY RANDOM()
LIMIT (SELECT COUNT(*) * ${CORRUPTION_RATE} / 100 FROM events)
);
\"
"
EOF
chmod +x /tmp/corrupt-events.sh
# Execute corruption (in production, this would be done carefully)
echo "⚠️ WARNING: This will corrupt event data. Continuing in 5 seconds..."
sleep 5
# For simulation, we'll use a Kubernetes job
kubectl create job corrupt-events-$(date +%s) \
--image=postgres:15 \
--restart=Never \
-n ${NAMESPACE} \
-- /bin/sh -c "
PGPASSWORD=\${DB_PASSWORD} psql -h \${DB_HOST} -U \${DB_USER} -d \${DB_NAME} -c \"
UPDATE events
SET checksum = 'CORRUPTED'
WHERE id IN (
SELECT id FROM events
WHERE checksum != 'CORRUPTED'
ORDER BY RANDOM()
LIMIT (SELECT GREATEST(1, COUNT(*) * ${CORRUPTION_RATE} / 100) FROM events)
);
\"
" || echo "⚠️ Corruption simulation job failed (expected in test environment)"
# Monitor service behavior
echo "👀 Monitoring service behavior during event store corruption..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get integrity check failures
INTEGRITY_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(event_integrity_check_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get quarantined events
QUARANTINED_EVENTS=$(curl -s http://prometheus:9090/api/v1/query?query=event_quarantine_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get recovery operations
RECOVERY_OPS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(event_recovery_operations\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get backup restore operations
BACKUP_RESTORE_OPS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(event_backup_restore_operations\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get service availability
AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
AVAILABILITY_PERCENT=$(echo "${AVAILABILITY} * 100" | bc)
echo "Metrics at ${ELAPSED}s:"
echo " Integrity check failures: ${INTEGRITY_FAILURES}/sec"
echo " Quarantined events: ${QUARANTINED_EVENTS}"
echo " Recovery operations: ${RECOVERY_OPS}/sec"
echo " Backup restore operations: ${BACKUP_RESTORE_OPS}/sec"
echo " Service availability: ${AVAILABILITY_PERCENT}%"
# Validate integrity verification
if (( $(echo "${INTEGRITY_FAILURES} > 0" | bc -l) )); then
echo "✅ Integrity verification detected corruption: ${INTEGRITY_FAILURES}/sec failures"
fi
# Validate quarantine procedures
if (( $(echo "${QUARANTINED_EVENTS} > 0" | bc -l) )); then
echo "✅ Corrupted events quarantined: ${QUARANTINED_EVENTS}"
fi
# Validate recovery operations
if (( $(echo "${RECOVERY_OPS} > 0" | bc -l) )); then
echo "✅ Recovery operations active: ${RECOVERY_OPS}/sec"
fi
# Validate service availability
if (( $(echo "${AVAILABILITY_PERCENT} >= 95" | bc -l) )); then
echo "✅ Service availability maintained: ${AVAILABILITY_PERCENT}%"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Verify recovery
echo "🔍 Verifying recovery from corruption..."
FINAL_QUARANTINED=$(curl -s http://prometheus:9090/api/v1/query?query=event_quarantine_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_INTEGRITY_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(event_integrity_check_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_AVAILABILITY_PERCENT=$(echo "${FINAL_AVAILABILITY} * 100" | bc)
# Check if corrupted events were recovered from backup
RECOVERED_EVENTS=$(curl -s http://prometheus:9090/api/v1/query?query=event_recovery_successful\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${FINAL_INTEGRITY_FAILURES} == 0" | bc -l) )); then
echo "✅ Integrity checks passing (corruption handled)"
if (( $(echo "${FINAL_AVAILABILITY_PERCENT} >= 99" | bc -l) )); then
echo "✅ Service availability recovered: ${FINAL_AVAILABILITY_PERCENT}%"
if (( $(echo "${RECOVERED_EVENTS} > 0" | bc -l) )); then
echo "✅ Events recovered from backup: ${RECOVERED_EVENTS}"
exit 0
else
echo "⚠️ No events recovered from backup"
exit 1
fi
else
echo "⚠️ Service availability not fully recovered: ${FINAL_AVAILABILITY_PERCENT}%"
exit 1
fi
else
echo "⚠️ Integrity check failures still occurring: ${FINAL_INTEGRITY_FAILURES}/sec"
exit 1
fi
Expected Behavior
Corruption Detection Phase (0-5 minutes):
- Corruption detection: Integrity verification detects corrupted events
- Quarantine activation: Corrupted events quarantined
- Service continuity: Service continues operating without corrupted data
Recovery Phase (5-15 minutes):
- Backup identification: Backups identified for corrupted events
- Event recovery: Corrupted events recovered from backups
- Integrity restoration: Event store integrity restored
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Corruption | Expected Range | Recovery Target |
|---|---|---|---|---|
| Integrity Check Failures | 0/sec | >0/sec | >0/sec | 0/sec |
| Quarantined Events | 0 | >0 | >0 | 0 |
| Recovery Operations | 0/sec | >0/sec | >0/sec | 0/sec |
| Backup Restore Operations | 0/sec | >0/sec | >0/sec | 0/sec |
| Service Availability | 99.95% | >95% | >95% | 99.95% |
Validation Criteria
Success Criteria:
- ✅ Integrity verification detects corruption
- ✅ Corrupted events quarantined
- ✅ Events recovered from backups
- ✅ Service availability >95% during corruption
- ✅ Service recovers automatically
Event Store Integrity Configuration
Event Store Integrity Configuration:
# kubernetes/configmaps/event-store-integrity-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: event-store-integrity-config
namespace: atp-ingest-ns
data:
EventStoreIntegrityConfig.json: |
{
"IntegrityVerification": {
"Enabled": true,
"CheckInterval": 60000,
"ChecksumAlgorithm": "SHA256",
"ValidateOnRead": true,
"ValidateOnWrite": true
},
"Quarantine": {
"Enabled": true,
"QuarantineThreshold": 1,
"QuarantineLocation": "quarantine_events",
"MaxQuarantineSize": 10000
},
"Recovery": {
"Enabled": true,
"RecoverFromBackup": true,
"BackupRetentionDays": 30,
"RecoveryRetryAttempts": 3
},
"Backup": {
"Enabled": true,
"BackupInterval": 3600000,
"BackupRetentionDays": 30,
"BackupLocation": "backup_events"
}
}
Event Integrity Verification Script:
#!/bin/bash
# scripts/verify-event-integrity.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
echo "🔍 Verifying event store integrity for ${SERVICE}"
# Get integrity check metrics
INTEGRITY_CHECKS=$(curl -s http://prometheus:9090/api/v1/query?query=event_integrity_checks_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
INTEGRITY_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=event_integrity_check_failures\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
INTEGRITY_PASS_RATE=$(echo "scale=2; (${INTEGRITY_CHECKS} - ${INTEGRITY_FAILURES}) / ${INTEGRITY_CHECKS} * 100" | bc)
echo "Integrity check metrics:"
echo " Total checks: ${INTEGRITY_CHECKS}"
echo " Failures: ${INTEGRITY_FAILURES}"
echo " Pass rate: ${INTEGRITY_PASS_RATE}%"
# Get quarantined events
QUARANTINED_EVENTS=$(curl -s http://prometheus:9090/api/v1/query?query=event_quarantine_count\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
echo " Quarantined events: ${QUARANTINED_EVENTS}"
# Get recovery operations
RECOVERY_OPS=$(curl -s http://prometheus:9090/api/v1/query?query=event_recovery_operations_total\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
RECOVERY_SUCCESS=$(curl -s http://prometheus:9090/api/v1/query?query=event_recovery_successful\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
RECOVERY_SUCCESS_RATE=$(echo "scale=2; ${RECOVERY_SUCCESS} / ${RECOVERY_OPS} * 100" | bc)
echo "Recovery metrics:"
echo " Total recovery operations: ${RECOVERY_OPS}"
echo " Successful recoveries: ${RECOVERY_SUCCESS}"
echo " Recovery success rate: ${RECOVERY_SUCCESS_RATE}%"
# Validate integrity
if (( $(echo "${INTEGRITY_PASS_RATE} >= 99.9" | bc -l) )); then
echo "✅ Event store integrity healthy: ${INTEGRITY_PASS_RATE}% pass rate"
else
echo "⚠️ Event store integrity issues: ${INTEGRITY_PASS_RATE}% pass rate"
exit 1
fi
if (( $(echo "${QUARANTINED_EVENTS} > 0" | bc -l) )); then
echo "⚠️ Quarantined events detected: ${QUARANTINED_EVENTS}"
# Check if recovery is in progress
if (( $(echo "${RECOVERY_OPS} > 0" | bc -l) )); then
echo "✅ Recovery operations in progress"
fi
fi
echo "✅ Event integrity verification complete"
Summary: Storage and Queue Chaos¶
- Blob Storage Unavailability: Validates retry logic, export failure handling, and eventual consistency during Azure Blob Storage outages; expects retry logic activates with exponential backoff, export failures handled gracefully, operations queued for eventual consistency, and automatic recovery
- Message Queue Disruption: Validates message buffering, backpressure mechanisms, and DLQ behavior during Service Bus topic disruptions; expects messages buffered when queue unavailable, backpressure activates, DLQ behavior appropriate, and automatic recovery
- Event Store Corruption Simulation: Validates integrity verification, quarantine procedures, and recovery from backups during event store corruption; expects integrity verification detects corruption, corrupted events quarantined, events recovered from backups, and automatic recovery
- Monitoring and Validation: Comprehensive scripts for monitoring blob storage availability, message queue disruption, event store corruption, integrity verification, quarantine procedures, and recovery operations
Network Chaos¶
Purpose: Define comprehensive chaos experiments for network failures in ATP, validating network partition handling, packet loss resilience, DNS failure recovery, and bandwidth limitation management to ensure ATP services remain available and functional during network-level failures and performance degradation.
Network Partition¶
Network partition experiments validate that ATP services handle network partitions gracefully through partition detection, service isolation handling, and automatic recovery when network connectivity is restored.
Hypothesis
"When a network partition occurs between service namespaces, services will detect the partition, handle service isolation gracefully, continue operating within their partition, and recover automatically when network connectivity is restored."
Experiment Configuration
Network Partition Between Namespaces:
# chaos-experiments/network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition
namespace: chaos-testing
labels:
category: infrastructure
service: network
severity: high
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When a network partition occurs between service namespaces, services will detect the partition,
handle service isolation gracefully, continue operating within their partition,
and recover automatically when network connectivity is restored.
spec:
action: partition
mode: all
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
target:
mode: all
selector:
namespaces:
- atp-query-ns
duration: "5m"
Network Partition Simulation Script:
#!/bin/bash
# scripts/execute-network-partition-experiment.sh
SOURCE_NS="${1:-atp-ingest-ns}"
SOURCE_SERVICE="${2:-atp-ingest-api}"
TARGET_NS="${3:-atp-query-ns}"
DURATION="${4:-5m}"
echo "🧪 Starting network partition experiment"
echo "Source namespace: ${SOURCE_NS}"
echo "Source service: ${SOURCE_SERVICE}"
echo "Target namespace: ${TARGET_NS}"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SOURCE_SERVICE} \
--duration 1h \
--output baseline-${SOURCE_SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SOURCE_SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_LATENCY=$(jq -r '.metrics.p95_latency_ms' baseline-${SOURCE_SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Success rate: ${BASELINE_SUCCESS_RATE}%"
echo " P95 latency: ${BASELINE_LATENCY}ms"
# Apply network partition
echo "🔧 Applying network partition..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition-${SOURCE_NS}-to-${TARGET_NS}
namespace: chaos-testing
spec:
action: partition
mode: all
selector:
namespaces:
- ${SOURCE_NS}
labelSelectors:
app: ${SOURCE_SERVICE}
direction: both
target:
mode: all
selector:
namespaces:
- ${TARGET_NS}
duration: "${DURATION}"
EOF
PARTITION_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during network partition..."
MAX_WAIT=300 # 5 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SOURCE_SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SOURCE_SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
# Get connection errors
CONNECTION_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SOURCE_SERVICE}\",status=\"503\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get partition detection
PARTITION_DETECTED=$(curl -s http://prometheus:9090/api/v1/query?query=network_partition_detected\{service=\"${SOURCE_SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get service availability within partition
AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=service_availability\{service=\"${SOURCE_SERVICE}\",partition=\"local\"\} | jq -r '.data.result[0].value[1]')
AVAILABILITY_PERCENT=$(echo "${AVAILABILITY} * 100" | bc)
echo "Metrics at ${ELAPSED}s:"
echo " Success rate: ${SUCCESS_RATE_PERCENT}%"
echo " Connection errors: ${CONNECTION_ERRORS}/sec"
echo " Partition detected: ${PARTITION_DETECTED}"
echo " Local availability: ${AVAILABILITY_PERCENT}%"
# Validate partition detection
if [ "${PARTITION_DETECTED}" = "1" ]; then
echo "✅ Network partition detected"
fi
# Validate service isolation handling
if (( $(echo "${AVAILABILITY_PERCENT} >= 90" | bc -l) )); then
echo "✅ Service continues operating within partition: ${AVAILABILITY_PERCENT}%"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos network-partition-${SOURCE_NS}-to-${TARGET_NS} -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for recovery
echo "⏳ Waiting for network connectivity to restore..."
sleep 60
# Verify recovery
FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SOURCE_SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SOURCE_SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE_PERCENT=$(echo "${FINAL_SUCCESS_RATE} * 100" | bc)
FINAL_PARTITION_DETECTED=$(curl -s http://prometheus:9090/api/v1/query?query=network_partition_detected\{service=\"${SOURCE_SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${FINAL_SUCCESS_RATE_PERCENT} >= ${BASELINE_SUCCESS_RATE} * 0.95" | bc -l) )); then
echo "✅ Success rate recovered: ${FINAL_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_SUCCESS_RATE}%)"
if [ "${FINAL_PARTITION_DETECTED}" = "0" ]; then
echo "✅ Network partition cleared (service recovered)"
exit 0
else
echo "⚠️ Network partition still detected: ${FINAL_PARTITION_DETECTED}"
exit 1
fi
else
echo "⚠️ Success rate not fully recovered: ${FINAL_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_SUCCESS_RATE}%)"
exit 1
fi
Expected Behavior
Partition Phase (0-5 minutes):
- Network partition: Network connectivity between namespaces lost
- Partition detection: Services detect network partition
- Service isolation: Services continue operating within their partition
- Connection failures: Cross-partition requests fail
- Local operation: Services maintain local functionality
Recovery Phase (5-10 minutes):
- Network restoration: Network connectivity restored
- Partition detection cleared: Services detect network restoration
- Connection re-establishment: Cross-partition connections re-established
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Partition | Expected Range | Recovery Target |
|---|---|---|---|---|
| Request Success Rate | 99.95% | Variable | Variable | 99.95% |
| Connection Errors | 0/sec | >0/sec | >0/sec | 0/sec |
| Partition Detected | No | Yes | Yes | No |
| Local Availability | 100% | >90% | >90% | 100% |
Validation Criteria
Success Criteria:
- ✅ Network partition detected
- ✅ Services continue operating within partition
- ✅ Local availability >90%
- ✅ Services recover automatically when partition cleared
Packet Loss Simulation¶
Packet loss simulation experiments validate that ATP services handle packet loss gracefully through TCP retransmission, application timeouts, and circuit breaker activation.
Hypothesis
"When packet loss occurs (5%, 10%, 25%), TCP retransmission will handle low packet loss, application timeouts will handle high packet loss, circuit breakers will protect services from cascading failures, and services will recover automatically when packet loss is removed."
Experiment Configuration
Packet Loss Injection:
# chaos-experiments/packet-loss-simulation.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: packet-loss-simulation
namespace: chaos-testing
labels:
category: infrastructure
service: network
severity: medium
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When packet loss occurs, TCP retransmission will handle low packet loss,
application timeouts will handle high packet loss, circuit breakers will protect services,
and services will recover automatically when packet loss is removed.
spec:
action: loss
mode: fixed-percent
value: "10" # 10% packet loss
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
loss:
loss: "10%"
correlation: "25"
duration: "10m"
Packet Loss Simulation Script:
#!/bin/bash
# scripts/execute-packet-loss-simulation.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
PACKET_LOSS="${3:-10}" # Packet loss percentage (5, 10, 25)
DURATION="${4:-10m}"
echo "🧪 Starting packet loss simulation"
echo "Service: ${SERVICE}"
echo "Packet loss: ${PACKET_LOSS}%"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_LATENCY=$(jq -r '.metrics.p95_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Success rate: ${BASELINE_SUCCESS_RATE}%"
echo " P95 latency: ${BASELINE_LATENCY}ms"
# Apply packet loss
echo "🔧 Applying packet loss..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: packet-loss-${SERVICE}-${PACKET_LOSS}pct
namespace: chaos-testing
spec:
action: loss
mode: fixed-percent
value: "100"
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
loss:
loss: "${PACKET_LOSS}%"
correlation: "25"
duration: "${DURATION}"
EOF
PACKET_LOSS_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during packet loss..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
# Get TCP retransmissions
TCP_RETRANSMISSIONS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(tcp_retransmissions\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get timeout errors
TIMEOUT_ERRORS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=\"504\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get circuit breaker state
CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get latency
P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)
echo "Metrics at ${ELAPSED}s:"
echo " Success rate: ${SUCCESS_RATE_PERCENT}%"
echo " TCP retransmissions: ${TCP_RETRANSMISSIONS}/sec"
echo " Timeout errors: ${TIMEOUT_ERRORS}/sec"
echo " Circuit breaker state: ${CB_STATE}"
echo " P95 latency: ${P95_LATENCY_MS}ms (baseline: ${BASELINE_LATENCY}ms)"
# Validate TCP retransmission for low packet loss
if (( $(echo "${PACKET_LOSS} <= 10" | bc -l) )); then
if (( $(echo "${TCP_RETRANSMISSIONS} > 0" | bc -l) )); then
echo "✅ TCP retransmission handling packet loss: ${TCP_RETRANSMISSIONS}/sec"
fi
fi
# Validate application timeouts for high packet loss
if (( $(echo "${PACKET_LOSS} > 10" | bc -l) )); then
if (( $(echo "${TIMEOUT_ERRORS} > 0" | bc -l) )); then
echo "✅ Application timeouts handling packet loss: ${TIMEOUT_ERRORS}/sec"
fi
fi
# Validate circuit breaker activation
if (( $(echo "${PACKET_LOSS} >= 25" | bc -l) )); then
if [ "${CB_STATE}" = "Open" ]; then
echo "✅ Circuit breaker activated: ${CB_STATE}"
fi
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove packet loss
echo "🔧 Removing packet loss..."
kubectl delete networkchaos packet-loss-${SERVICE}-${PACKET_LOSS}pct -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for recovery
echo "⏳ Waiting for network to recover..."
sleep 120
# Verify recovery
FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE_PERCENT=$(echo "${FINAL_SUCCESS_RATE} * 100" | bc)
FINAL_CB_STATE=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${FINAL_SUCCESS_RATE_PERCENT} >= ${BASELINE_SUCCESS_RATE} * 0.95" | bc -l) )); then
echo "✅ Success rate recovered: ${FINAL_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_SUCCESS_RATE}%)"
if [ "${FINAL_CB_STATE}" = "Closed" ]; then
echo "✅ Circuit breaker closed (service recovered)"
exit 0
else
echo "⚠️ Circuit breaker still open: ${FINAL_CB_STATE}"
exit 1
fi
else
echo "⚠️ Success rate not fully recovered: ${FINAL_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_SUCCESS_RATE}%)"
exit 1
fi
Gradual Packet Loss Increase:
# chaos-experiments/gradual-packet-loss.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: gradual-packet-loss
namespace: chaos-testing
labels:
category: infrastructure
service: network
severity: medium
annotations:
chaos.atp.connectsoft.io/hypothesis: |
Gradual packet loss increase will validate service resilience to progressive network degradation.
spec:
action: loss
mode: fixed-percent
value: "100"
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
loss:
loss: "25%"
correlation: "25"
scheduler:
cron: "@every 2m"
duration: "10m"
Expected Behavior
Packet Loss Phase (0-10 minutes):
- Packet loss injection: Network packet loss increases to specified percentage
- TCP retransmission: TCP retransmits lost packets (low packet loss)
- Application timeouts: Application timeouts occur (high packet loss)
- Circuit breaker activation: Circuit breaker opens if packet loss exceeds threshold
- Latency increase: Latency increases due to retransmissions
Recovery Phase (10-15 minutes):
- Packet loss removal: Packet loss removed
- TCP normalization: TCP retransmissions normalize
- Circuit breaker recovery: Circuit breaker transitions to half-open, then closed
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | 5% Loss | 10% Loss | 25% Loss | Recovery Target |
|---|---|---|---|---|---|
| Request Success Rate | 99.95% | >99% | >95% | >90% | 99.95% |
| TCP Retransmissions | 0/sec | >0/sec | >0/sec | >0/sec | 0/sec |
| Timeout Errors | 0/sec | <1/sec | <5/sec | <10/sec | 0/sec |
| Circuit Breaker State | Closed | Closed | HalfOpen | Open | Closed |
| P95 Latency | 250ms | <350ms | <500ms | <1,000ms | 250ms |
Validation Criteria
Success Criteria:
- ✅ TCP retransmission handles low packet loss (≤10%)
- ✅ Application timeouts handle high packet loss (>10%)
- ✅ Circuit breaker activates for high packet loss (≥25%)
- ✅ Request success rate acceptable for packet loss level
- ✅ Service recovers automatically when packet loss removed
DNS Failure¶
DNS failure experiments validate that ATP services handle DNS resolution failures gracefully through DNS caching, retry with backoff, and fallback to IP addresses.
Hypothesis
"When DNS resolution fails, DNS caching will maintain service availability, retry with backoff will handle transient DNS failures, fallback to IP addresses will ensure service connectivity, and services will recover automatically when DNS is restored."
Experiment Configuration
DNS Failure Simulation:
# chaos-experiments/dns-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: dns-failure
namespace: chaos-testing
labels:
category: infrastructure
service: dns
severity: high
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When DNS resolution fails, DNS caching will maintain service availability,
retry with backoff will handle transient DNS failures, fallback to IP addresses
will ensure service connectivity, and services will recover automatically when DNS is restored.
spec:
action: partition
mode: all
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
target:
mode: all
selector:
address: "*.dns.windows.net"
port: 53
duration: "10m"
DNS Failure Simulation Script:
#!/bin/bash
# scripts/execute-dns-failure-experiment.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
DOWNSTREAM_SERVICE="${3:-atp-policy-api}"
DURATION="${4:-10m}"
echo "🧪 Starting DNS failure experiment"
echo "Service: ${SERVICE}"
echo "Downstream service: ${DOWNSTREAM_SERVICE}"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_SUCCESS_RATE=$(jq -r '.metrics.success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_DNS_QUERIES=$(jq -r '.metrics.dns_queries_per_second' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Success rate: ${BASELINE_SUCCESS_RATE}%"
echo " DNS queries: ${BASELINE_DNS_QUERIES}/sec"
# Apply DNS failure (partition DNS servers)
echo "🔧 Applying DNS failure..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: dns-failure-${SERVICE}
namespace: chaos-testing
spec:
action: partition
mode: all
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
target:
mode: all
selector:
address: "*.dns.windows.net"
port: 53
duration: "${DURATION}"
EOF
DNS_FAILURE_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during DNS failure..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get DNS query failures
DNS_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(dns_query_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get DNS cache hits
DNS_CACHE_HITS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(dns_cache_hits\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get DNS retry attempts
DNS_RETRIES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(dns_retry_attempts\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get IP fallback usage
IP_FALLBACK=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(dns_ip_fallback_usage\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
echo "Metrics at ${ELAPSED}s:"
echo " DNS failures: ${DNS_FAILURES}/sec"
echo " DNS cache hits: ${DNS_CACHE_HITS}/sec"
echo " DNS retries: ${DNS_RETRIES}/sec"
echo " IP fallback usage: ${IP_FALLBACK}/sec"
echo " Success rate: ${SUCCESS_RATE_PERCENT}%"
# Validate DNS caching
if (( $(echo "${DNS_CACHE_HITS} > 0" | bc -l) )); then
echo "✅ DNS caching active: ${DNS_CACHE_HITS}/sec cache hits"
fi
# Validate DNS retry
if (( $(echo "${DNS_RETRIES} > 0" | bc -l) )); then
echo "✅ DNS retry with backoff active: ${DNS_RETRIES}/sec retries"
fi
# Validate IP fallback
if (( $(echo "${IP_FALLBACK} > 0" | bc -l) )); then
echo "✅ IP fallback active: ${IP_FALLBACK}/sec fallback usage"
fi
# Validate service availability
if (( $(echo "${SUCCESS_RATE_PERCENT} >= 95" | bc -l) )); then
echo "✅ Service availability maintained: ${SUCCESS_RATE_PERCENT}%"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove DNS failure
echo "🔧 Removing DNS failure..."
kubectl delete networkchaos dns-failure-${SERVICE} -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for recovery
echo "⏳ Waiting for DNS to recover..."
sleep 120
# Verify recovery
FINAL_DNS_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(dns_query_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SUCCESS_RATE_PERCENT=$(echo "${FINAL_SUCCESS_RATE} * 100" | bc)
if (( $(echo "${FINAL_DNS_FAILURES} == 0" | bc -l) )); then
echo "✅ DNS failures resolved"
if (( $(echo "${FINAL_SUCCESS_RATE_PERCENT} >= ${BASELINE_SUCCESS_RATE} * 0.95" | bc -l) )); then
echo "✅ Success rate recovered: ${FINAL_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_SUCCESS_RATE}%)"
exit 0
else
echo "⚠️ Success rate not fully recovered: ${FINAL_SUCCESS_RATE_PERCENT}%"
exit 1
fi
else
echo "⚠️ DNS failures still occurring: ${FINAL_DNS_FAILURES}/sec"
exit 1
fi
Expected Behavior
DNS Failure Phase (0-10 minutes):
- DNS unavailability: DNS servers become unreachable
- DNS cache usage: Services use DNS cache for resolved addresses
- DNS retry: DNS retry with backoff attempts
- IP fallback: Services fallback to IP addresses when DNS fails
- Service continuity: Service continues operating with cached/fallback addresses
Recovery Phase (10-15 minutes):
- DNS restoration: DNS servers become available
- DNS cache refresh: DNS cache refreshed with new resolutions
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During DNS Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| DNS Query Failures | 0/sec | >0/sec | >0/sec | 0/sec |
| DNS Cache Hits | 50/sec | 100/sec | 100/sec | 50/sec |
| DNS Retries | 0/sec | >0/sec | >0/sec | 0/sec |
| IP Fallback Usage | 0/sec | >0/sec | >0/sec | 0/sec |
| Request Success Rate | 99.95% | >95% | >95% | 99.95% |
Validation Criteria
Success Criteria:
- ✅ DNS caching maintains service availability
- ✅ DNS retry with backoff handles transient failures
- ✅ IP fallback ensures service connectivity
- ✅ Request success rate >95%
- ✅ Service recovers automatically when DNS restored
DNS Configuration
DNS Client Configuration:
# kubernetes/configmaps/dns-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: dns-config
namespace: atp-ingest-ns
data:
DnsClientConfig.json: |
{
"Cache": {
"Enabled": true,
"TTL": 300,
"MaxCacheSize": 1000
},
"RetryPolicy": {
"MaxRetries": 3,
"RetryDelay": 1000,
"ExponentialBackoff": true,
"MaxBackoff": 10000
},
"Fallback": {
"UseIPFallback": true,
"IPAddresses": {
"atp-policy-api": "10.0.1.100",
"atp-query-api": "10.0.1.101"
}
},
"HealthCheck": {
"Enabled": true,
"Interval": 30000,
"Timeout": 5000
}
}
Bandwidth Limitation¶
Bandwidth limitation experiments validate that ATP services handle bandwidth constraints gracefully through large payload handling, streaming vs batching strategies, and compression usage.
Hypothesis
"When network bandwidth is limited to 1Mbps, services will handle large payloads through streaming or batching, compression will reduce bandwidth usage, services will adapt to bandwidth constraints, and services will recover automatically when bandwidth is restored."
Experiment Configuration
Bandwidth Throttling:
# chaos-experiments/bandwidth-limitation.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: bandwidth-limitation
namespace: chaos-testing
labels:
category: infrastructure
service: network
severity: medium
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When network bandwidth is limited, services will handle large payloads through streaming or batching,
compression will reduce bandwidth usage, services will adapt to bandwidth constraints,
and services will recover automatically when bandwidth is restored.
spec:
action: bandwidth
mode: all
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
bandwidth:
rate: "1Mbps"
limit: 1048576
buffer: 10485760
duration: "10m"
Bandwidth Limitation Simulation Script:
#!/bin/bash
# scripts/execute-bandwidth-limitation-experiment.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
BANDWIDTH="${3:-1Mbps}" # Bandwidth limit
DURATION="${4:-10m}"
echo "🧪 Starting bandwidth limitation experiment"
echo "Service: ${SERVICE}"
echo "Bandwidth limit: ${BANDWIDTH}"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_THROUGHPUT=$(jq -r '.metrics.throughput_mbps' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_LATENCY=$(jq -r '.metrics.p95_latency_ms' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Throughput: ${BASELINE_THROUGHPUT} Mbps"
echo " P95 latency: ${BASELINE_LATENCY}ms"
# Apply bandwidth limitation
echo "🔧 Applying bandwidth limitation..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: bandwidth-limitation-${SERVICE}
namespace: chaos-testing
spec:
action: bandwidth
mode: all
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
bandwidth:
rate: "${BANDWIDTH}"
limit: 1048576
buffer: 10485760
duration: "${DURATION}"
EOF
BANDWIDTH_LIMIT_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during bandwidth limitation..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get network throughput
THROUGHPUT=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(network_bytes_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
THROUGHPUT_MBPS=$(echo "scale=2; ${THROUGHPUT} * 8 / 1024 / 1024" | bc)
# Get compression ratio
COMPRESSION_RATIO=$(curl -s http://prometheus:9090/api/v1/query?query=compression_ratio\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get streaming vs batching usage
STREAMING_USAGE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(streaming_operations\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
BATCHING_USAGE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(batching_operations\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get large payload handling
LARGE_PAYLOAD_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",payload_size=\"large\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get latency
P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
echo "Metrics at ${ELAPSED}s:"
echo " Throughput: ${THROUGHPUT_MBPS} Mbps (limit: ${BANDWIDTH})"
echo " Compression ratio: ${COMPRESSION_RATIO}"
echo " Streaming operations: ${STREAMING_USAGE}/sec"
echo " Batching operations: ${BATCHING_USAGE}/sec"
echo " Large payload requests: ${LARGE_PAYLOAD_REQUESTS}/sec"
echo " P95 latency: ${P95_LATENCY_MS}ms (baseline: ${BASELINE_LATENCY}ms)"
echo " Success rate: ${SUCCESS_RATE_PERCENT}%"
# Validate compression usage
if (( $(echo "${COMPRESSION_RATIO} > 1.5" | bc -l) )); then
echo "✅ Compression reducing bandwidth usage: ${COMPRESSION_RATIO}x"
fi
# Validate streaming vs batching
if (( $(echo "${STREAMING_USAGE} > 0" | bc -l) )); then
echo "✅ Streaming strategy used for large payloads: ${STREAMING_USAGE}/sec"
fi
if (( $(echo "${BATCHING_USAGE} > 0" | bc -l) )); then
echo "✅ Batching strategy used for large payloads: ${BATCHING_USAGE}/sec"
fi
# Validate bandwidth adaptation
if (( $(echo "${THROUGHPUT_MBPS} <= 1.1" | bc -l) )); then
echo "✅ Throughput within bandwidth limit: ${THROUGHPUT_MBPS} Mbps"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove bandwidth limitation
echo "🔧 Removing bandwidth limitation..."
kubectl delete networkchaos bandwidth-limitation-${SERVICE} -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for recovery
echo "⏳ Waiting for bandwidth to recover..."
sleep 120
# Verify recovery
FINAL_THROUGHPUT=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(network_bytes_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_THROUGHPUT_MBPS=$(echo "scale=2; ${FINAL_THROUGHPUT} * 8 / 1024 / 1024" | bc)
FINAL_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{service=\"${SERVICE}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]')
FINAL_LATENCY_MS=$(echo "${FINAL_LATENCY} * 1000" | bc)
if (( $(echo "${FINAL_THROUGHPUT_MBPS} >= ${BASELINE_THROUGHPUT} * 0.9" | bc -l) )); then
echo "✅ Throughput recovered: ${FINAL_THROUGHPUT_MBPS} Mbps (baseline: ${BASELINE_THROUGHPUT} Mbps)"
if (( $(echo "${FINAL_LATENCY_MS} <= ${BASELINE_LATENCY} * 1.1" | bc -l) )); then
echo "✅ Latency recovered: ${FINAL_LATENCY_MS}ms (baseline: ${BASELINE_LATENCY}ms)"
exit 0
else
echo "⚠️ Latency not fully recovered: ${FINAL_LATENCY_MS}ms"
exit 1
fi
else
echo "⚠️ Throughput not fully recovered: ${FINAL_THROUGHPUT_MBPS} Mbps"
exit 1
fi
Expected Behavior
Bandwidth Limitation Phase (0-10 minutes):
- Bandwidth throttling: Network bandwidth limited to 1Mbps
- Compression activation: Compression reduces payload size
- Streaming/batching: Large payloads handled via streaming or batching
- Latency increase: Latency increases due to bandwidth constraints
- Service adaptation: Service adapts to bandwidth constraints
Recovery Phase (10-15 minutes):
- Bandwidth restoration: Bandwidth limitation removed
- Throughput normalization: Network throughput returns to normal
- Latency normalization: Latency returns to baseline
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Limitation | Expected Range | Recovery Target |
|---|---|---|---|---|
| Network Throughput | 100 Mbps | ≤1 Mbps | ≤1 Mbps | 100 Mbps |
| Compression Ratio | 1.0x | >1.5x | >1.5x | 1.0x |
| Streaming Usage | 0/sec | >0/sec | >0/sec | 0/sec |
| Batching Usage | 0/sec | >0/sec | >0/sec | 0/sec |
| P95 Latency | 250ms | <2,000ms | <2,000ms | 250ms |
| Request Success Rate | 99.95% | >95% | >95% | 99.95% |
Validation Criteria
Success Criteria:
- ✅ Compression reduces bandwidth usage
- ✅ Streaming or batching handles large payloads
- ✅ Throughput within bandwidth limit
- ✅ Request success rate >95%
- ✅ Service recovers automatically when bandwidth restored
Bandwidth Adaptation Configuration
Network Bandwidth Configuration:
# kubernetes/configmaps/network-bandwidth-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: network-bandwidth-config
namespace: atp-ingest-ns
data:
NetworkBandwidthConfig.json: |
{
"Compression": {
"Enabled": true,
"Algorithm": "gzip",
"MinSize": 1024,
"CompressionLevel": 6
},
"LargePayloadHandling": {
"StreamingThreshold": 1048576,
"BatchingThreshold": 524288,
"ChunkSize": 65536,
"Strategy": "adaptive"
},
"BandwidthMonitoring": {
"Enabled": true,
"Threshold": 1048576,
"AdaptiveCompression": true
}
}
Network Chaos Visualization
graph TD
NETWORK[Network Layer] --> PARTITION[Network Partition]
NETWORK --> PACKETLOSS[Packet Loss]
NETWORK --> DNS[DNS Failure]
NETWORK --> BANDWIDTH[Bandwidth Limitation]
PARTITION --> DETECT[Partition Detection]
DETECT --> ISOLATE[Service Isolation]
ISOLATE --> CONTINUE1[Continue Operating]
PACKETLOSS --> TCPRETRY[TCP Retransmission]
PACKETLOSS --> APPTIMEOUT[Application Timeout]
PACKETLOSS --> CB[Circuit Breaker]
TCPRETRY --> CONTINUE2[Continue Operating]
APPTIMEOUT --> CONTINUE2
CB --> CONTINUE2
DNS --> CACHE[DNS Cache]
DNS --> RETRY[DNS Retry]
DNS --> IPFALLBACK[IP Fallback]
CACHE --> CONTINUE3[Continue Operating]
RETRY --> CONTINUE3
IPFALLBACK --> CONTINUE3
BANDWIDTH --> COMPRESS[Compression]
BANDWIDTH --> STREAM[Streaming]
BANDWIDTH --> BATCH[Batching]
COMPRESS --> CONTINUE4[Continue Operating]
STREAM --> CONTINUE4
BATCH --> CONTINUE4
style NETWORK fill:#FFE5B4
style PARTITION fill:#FFB6C1
style PACKETLOSS fill:#FFB6C1
style DNS fill:#FFB6C1
style BANDWIDTH fill:#FFB6C1
style CONTINUE1 fill:#90EE90
style CONTINUE2 fill:#90EE90
style CONTINUE3 fill:#90EE90
style CONTINUE4 fill:#90EE90
Summary: Network Chaos¶
- Network Partition: Validates partition detection, service isolation handling, and automatic recovery during network partitions between namespaces; expects partition detected, services continue operating within partition, local availability >90%, and automatic recovery
- Packet Loss Simulation: Validates TCP retransmission, application timeouts, and circuit breaker activation during packet loss (5%, 10%, 25%); expects TCP retransmission handles low packet loss, application timeouts handle high packet loss, circuit breaker activates for high packet loss, and automatic recovery
- DNS Failure: Validates DNS caching, retry with backoff, and IP fallback during DNS resolution failures; expects DNS caching maintains availability, DNS retry handles transient failures, IP fallback ensures connectivity, and automatic recovery
- Bandwidth Limitation: Validates compression usage, streaming vs batching strategies, and large payload handling during bandwidth constraints (1Mbps); expects compression reduces bandwidth usage, streaming/batching handles large payloads, throughput within limit, and automatic recovery
- Monitoring and Validation: Comprehensive scripts for monitoring network partitions, packet loss, DNS failures, bandwidth limitations, TCP retransmissions, DNS cache hits, compression ratios, and recovery behavior
Security Chaos¶
Purpose: Define comprehensive chaos experiments for security infrastructure failures in ATP, validating authentication resilience, authorization fallback, certificate management, and secret management to ensure ATP services remain available and secure during security infrastructure failures.
Authentication Failure¶
Authentication failure experiments validate that ATP services handle Azure AD unavailability gracefully through token caching, graceful degradation, and deny-by-default behavior.
Hypothesis
"When Azure AD becomes unavailable, services will use cached authentication tokens, gracefully degrade to read-only mode if necessary, enforce deny-by-default behavior for unauthenticated requests, and recover automatically when Azure AD is restored."
Experiment Configuration
Azure AD Unavailability Simulation:
# chaos-experiments/azure-ad-unavailability.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: azure-ad-unavailability
namespace: chaos-testing
labels:
category: security
service: azure-ad
severity: high
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When Azure AD becomes unavailable, services will use cached authentication tokens,
gracefully degrade to read-only mode if necessary, enforce deny-by-default behavior
for unauthenticated requests, and recover automatically when Azure AD is restored.
spec:
action: partition
mode: all
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
target:
mode: all
selector:
address: "*.login.microsoftonline.com"
address: "*.microsoftonline.com"
duration: "10m"
Authentication Failure Simulation Script:
#!/bin/bash
# scripts/execute-authentication-failure-experiment.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
DURATION="${3:-10m}"
echo "🧪 Starting authentication failure experiment"
echo "Service: ${SERVICE}"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_AUTH_SUCCESS_RATE=$(jq -r '.metrics.auth_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_TOKEN_CACHE_HITS=$(jq -r '.metrics.token_cache_hits_per_second' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Auth success rate: ${BASELINE_AUTH_SUCCESS_RATE}%"
echo " Token cache hits: ${BASELINE_TOKEN_CACHE_HITS}/sec"
# Apply network partition to Azure AD
echo "🔧 Applying network partition to Azure AD..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: azure-ad-unavailability-${SERVICE}
namespace: chaos-testing
spec:
action: partition
mode: all
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
target:
mode: all
selector:
address: "*.login.microsoftonline.com"
address: "*.microsoftonline.com"
duration: "${DURATION}"
EOF
AUTH_FAILURE_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during authentication failure..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get authentication success rate
AUTH_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(auth_requests_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(auth_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
AUTH_SUCCESS_RATE_PERCENT=$(echo "${AUTH_SUCCESS_RATE} * 100" | bc)
# Get token cache hits
TOKEN_CACHE_HITS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(token_cache_hits\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get authentication failures
AUTH_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(auth_requests_total\{service=\"${SERVICE}\",status=\"failure\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get denied requests (unauthenticated)
DENIED_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=\"401\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get service mode (normal vs degraded)
SERVICE_MODE=$(curl -s http://prometheus:9090/api/v1/query?query=service_mode\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
echo "Metrics at ${ELAPSED}s:"
echo " Auth success rate: ${AUTH_SUCCESS_RATE_PERCENT}%"
echo " Token cache hits: ${TOKEN_CACHE_HITS}/sec"
echo " Auth failures: ${AUTH_FAILURES}/sec"
echo " Denied requests (401): ${DENIED_REQUESTS}/sec"
echo " Service mode: ${SERVICE_MODE}"
echo " Request success rate: ${SUCCESS_RATE_PERCENT}%"
# Validate token cache usage
if (( $(echo "${TOKEN_CACHE_HITS} > 0" | bc -l) )); then
echo "✅ Token cache active: ${TOKEN_CACHE_HITS}/sec cache hits"
fi
# Validate deny-by-default behavior
if (( $(echo "${DENIED_REQUESTS} > 0" | bc -l) )); then
echo "✅ Deny-by-default enforced: ${DENIED_REQUESTS}/sec requests denied"
fi
# Validate graceful degradation
if [ "${SERVICE_MODE}" = "degraded" ] || [ "${SERVICE_MODE}" = "readonly" ]; then
echo "✅ Service operating in degraded mode: ${SERVICE_MODE}"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos azure-ad-unavailability-${SERVICE} -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for recovery
echo "⏳ Waiting for Azure AD to recover..."
sleep 120
# Verify recovery
FINAL_AUTH_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(auth_requests_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(auth_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_AUTH_SUCCESS_RATE_PERCENT=$(echo "${FINAL_AUTH_SUCCESS_RATE} * 100" | bc)
FINAL_SERVICE_MODE=$(curl -s http://prometheus:9090/api/v1/query?query=service_mode\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
if (( $(echo "${FINAL_AUTH_SUCCESS_RATE_PERCENT} >= ${BASELINE_AUTH_SUCCESS_RATE} * 0.95" | bc -l) )); then
echo "✅ Auth success rate recovered: ${FINAL_AUTH_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_AUTH_SUCCESS_RATE}%)"
if [ "${FINAL_SERVICE_MODE}" = "normal" ]; then
echo "✅ Service mode normalized: ${FINAL_SERVICE_MODE}"
exit 0
else
echo "⚠️ Service mode not normalized: ${FINAL_SERVICE_MODE}"
exit 1
fi
else
echo "⚠️ Auth success rate not fully recovered: ${FINAL_AUTH_SUCCESS_RATE_PERCENT}%"
exit 1
fi
Expected Behavior
Authentication Failure Phase (0-10 minutes):
- Azure AD unavailability: Azure AD authentication endpoints unreachable
- Token cache usage: Services use cached authentication tokens
- Deny-by-default: Unauthenticated requests denied (401)
- Graceful degradation: Service may operate in read-only mode
- Service continuity: Authenticated requests continue using cached tokens
Recovery Phase (10-15 minutes):
- Azure AD restoration: Azure AD endpoints restored
- Token refresh: New tokens obtained from Azure AD
- Token cache refresh: Token cache refreshed
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| Auth Success Rate | 99.95% | >80% | >80% | 99.95% |
| Token Cache Hits | 50/sec | 100/sec | 100/sec | 50/sec |
| Auth Failures | 0.05% | <20% | <20% | 0.05% |
| Denied Requests (401) | 0/sec | >0/sec | >0/sec | 0/sec |
| Service Mode | Normal | Degraded/ReadOnly | Degraded/ReadOnly | Normal |
Validation Criteria
Success Criteria:
- ✅ Token cache maintains authentication for existing sessions
- ✅ Deny-by-default enforced for unauthenticated requests
- ✅ Graceful degradation to read-only mode if necessary
- ✅ Auth success rate >80% (using cached tokens)
- ✅ Service recovers automatically when Azure AD restored
Authentication Configuration
Token Cache Configuration:
# kubernetes/configmaps/authentication-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: authentication-config
namespace: atp-ingest-ns
data:
AuthenticationConfig.json: |
{
"TokenCache": {
"Enabled": true,
"TTL": 3600,
"MaxCacheSize": 10000,
"RefreshThreshold": 300
},
"AzureAD": {
"Authority": "https://login.microsoftonline.com/{tenant-id}",
"ClientId": "{client-id}",
"RetryPolicy": {
"MaxRetries": 3,
"RetryDelay": 1000,
"ExponentialBackoff": true
}
},
"FallbackBehavior": {
"DenyByDefault": true,
"DegradedMode": "readonly",
"AllowCachedTokens": true
}
}
Authorization Denial¶
Authorization denial experiments validate that ATP services handle OPA (Open Policy Agent) unavailability gracefully through cached policies, safe-fail behavior, and deny-when-uncertain enforcement.
Hypothesis
"When OPA policy engine becomes unavailable, services will use cached authorization policies, enforce safe-fail behavior (deny when uncertain), maintain service availability for authorized requests, and recover automatically when OPA is restored."
Experiment Configuration
OPA Unavailability Simulation:
# chaos-experiments/opa-unavailability.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: opa-unavailability
namespace: chaos-testing
labels:
category: security
service: opa
severity: high
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When OPA policy engine becomes unavailable, services will use cached authorization policies,
enforce safe-fail behavior (deny when uncertain), maintain service availability for authorized requests,
and recover automatically when OPA is restored.
spec:
action: partition
mode: all
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
target:
mode: all
selector:
namespaces:
- atp-policy-ns
labelSelectors:
app: opa
duration: "10m"
Authorization Denial Simulation Script:
#!/bin/bash
# scripts/execute-authorization-denial-experiment.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
OPA_NAMESPACE="${3:-atp-policy-ns}"
DURATION="${4:-10m}"
echo "🧪 Starting authorization denial experiment"
echo "Service: ${SERVICE}"
echo "OPA namespace: ${OPA_NAMESPACE}"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_AUTHZ_SUCCESS_RATE=$(jq -r '.metrics.authz_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_POLICY_CACHE_HITS=$(jq -r '.metrics.policy_cache_hits_per_second' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Authz success rate: ${BASELINE_AUTHZ_SUCCESS_RATE}%"
echo " Policy cache hits: ${BASELINE_POLICY_CACHE_HITS}/sec"
# Apply network partition to OPA
echo "🔧 Applying network partition to OPA..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: opa-unavailability-${SERVICE}
namespace: chaos-testing
spec:
action: partition
mode: all
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
target:
mode: all
selector:
namespaces:
- ${OPA_NAMESPACE}
labelSelectors:
app: opa
duration: "${DURATION}"
EOF
AUTHZ_FAILURE_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during authorization denial..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get authorization success rate
AUTHZ_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(authz_requests_total\{service=\"${SERVICE}\",status=\"allow\"\}[1m]\)/rate\(authz_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
AUTHZ_SUCCESS_RATE_PERCENT=$(echo "${AUTHZ_SUCCESS_RATE} * 100" | bc)
# Get policy cache hits
POLICY_CACHE_HITS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(policy_cache_hits\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get authorization denials
AUTHZ_DENIALS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(authz_requests_total\{service=\"${SERVICE}\",status=\"deny\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get denied requests (403)
DENIED_REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status=\"403\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get safe-fail behavior
SAFE_FAIL_COUNT=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(authz_safe_fail\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
echo "Metrics at ${ELAPSED}s:"
echo " Authz success rate: ${AUTHZ_SUCCESS_RATE_PERCENT}%"
echo " Policy cache hits: ${POLICY_CACHE_HITS}/sec"
echo " Authz denials: ${AUTHZ_DENIALS}/sec"
echo " Denied requests (403): ${DENIED_REQUESTS}/sec"
echo " Safe-fail count: ${SAFE_FAIL_COUNT}/sec"
echo " Request success rate: ${SUCCESS_RATE_PERCENT}%"
# Validate policy cache usage
if (( $(echo "${POLICY_CACHE_HITS} > 0" | bc -l) )); then
echo "✅ Policy cache active: ${POLICY_CACHE_HITS}/sec cache hits"
fi
# Validate safe-fail behavior
if (( $(echo "${SAFE_FAIL_COUNT} > 0" | bc -l) )); then
echo "✅ Safe-fail behavior enforced: ${SAFE_FAIL_COUNT}/sec safe-fail denials"
fi
# Validate deny-when-uncertain
if (( $(echo "${AUTHZ_DENIALS} > 0" | bc -l) )); then
echo "✅ Authorization denials when uncertain: ${AUTHZ_DENIALS}/sec"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos opa-unavailability-${SERVICE} -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for recovery
echo "⏳ Waiting for OPA to recover..."
sleep 120
# Verify recovery
FINAL_AUTHZ_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(authz_requests_total\{service=\"${SERVICE}\",status=\"allow\"\}[1m]\)/rate\(authz_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_AUTHZ_SUCCESS_RATE_PERCENT=$(echo "${FINAL_AUTHZ_SUCCESS_RATE} * 100" | bc)
FINAL_SAFE_FAIL=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(authz_safe_fail\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
if (( $(echo "${FINAL_AUTHZ_SUCCESS_RATE_PERCENT} >= ${BASELINE_AUTHZ_SUCCESS_RATE} * 0.95" | bc -l) )); then
echo "✅ Authz success rate recovered: ${FINAL_AUTHZ_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_AUTHZ_SUCCESS_RATE}%)"
if (( $(echo "${FINAL_SAFE_FAIL} == 0" | bc -l) )); then
echo "✅ Safe-fail behavior normalized (OPA recovered)"
exit 0
else
echo "⚠️ Safe-fail still active: ${FINAL_SAFE_FAIL}/sec"
exit 1
fi
else
echo "⚠️ Authz success rate not fully recovered: ${FINAL_AUTHZ_SUCCESS_RATE_PERCENT}%"
exit 1
fi
Expected Behavior
Authorization Failure Phase (0-10 minutes):
- OPA unavailability: OPA policy engine unreachable
- Policy cache usage: Services use cached authorization policies
- Safe-fail enforcement: Requests denied when authorization uncertain
- Service continuity: Authorized requests continue using cached policies
Recovery Phase (10-15 minutes):
- OPA restoration: OPA policy engine restored
- Policy refresh: New policies obtained from OPA
- Policy cache refresh: Policy cache refreshed
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| Authz Success Rate | 99.95% | >80% | >80% | 99.95% |
| Policy Cache Hits | 50/sec | 100/sec | 100/sec | 50/sec |
| Authz Denials | 0.05% | <20% | <20% | 0.05% |
| Denied Requests (403) | 0/sec | >0/sec | >0/sec | 0/sec |
| Safe-Fail Count | 0/sec | >0/sec | >0/sec | 0/sec |
Validation Criteria
Success Criteria:
- ✅ Policy cache maintains authorization for cached policies
- ✅ Safe-fail behavior enforced (deny when uncertain)
- ✅ Authz success rate >80% (using cached policies)
- ✅ Service recovers automatically when OPA restored
Authorization Configuration
OPA Policy Cache Configuration:
# kubernetes/configmaps/authorization-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: authorization-config
namespace: atp-ingest-ns
data:
AuthorizationConfig.json: |
{
"PolicyCache": {
"Enabled": true,
"TTL": 3600,
"MaxCacheSize": 10000,
"RefreshThreshold": 300
},
"OPA": {
"Endpoint": "http://opa.atp-policy-ns.svc.cluster.local:8181",
"RetryPolicy": {
"MaxRetries": 3,
"RetryDelay": 1000,
"ExponentialBackoff": true
}
},
"SafeFail": {
"Enabled": true,
"DenyWhenUncertain": true,
"AllowCachedPolicies": true
}
}
Certificate Expiration¶
Certificate expiration experiments validate that ATP services handle TLS certificate expiration gracefully through cert-manager renewal, mTLS failure handling, and automatic certificate rotation.
Hypothesis
"When TLS certificates expire, cert-manager will automatically renew certificates, mTLS connections will handle certificate failures gracefully, services will continue operating with renewed certificates, and services will recover automatically when certificates are renewed."
Experiment Configuration
Certificate Expiration Simulation Script:
#!/bin/bash
# scripts/execute-certificate-expiration-experiment.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
CERT_NAME="${3:-atp-ingestion-tls}"
echo "🧪 Starting certificate expiration experiment"
echo "Service: ${SERVICE}"
echo "Certificate: ${CERT_NAME}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_TLS_SUCCESS_RATE=$(jq -r '.metrics.tls_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_MTLS_SUCCESS_RATE=$(jq -r '.metrics.mtls_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " TLS success rate: ${BASELINE_TLS_SUCCESS_RATE}%"
echo " mTLS success rate: ${BASELINE_MTLS_SUCCESS_RATE}%"
# Get current certificate expiration
echo "📊 Getting current certificate expiration..."
CURRENT_CERT_EXPIRY=$(kubectl get secret ${CERT_NAME} -n ${NAMESPACE} -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -enddate | cut -d= -f2)
echo "Current certificate expires: ${CURRENT_CERT_EXPIRY}"
# Simulate certificate expiration by deleting certificate
echo "🔧 Simulating certificate expiration..."
CERT_EXPIRATION_START=$(date +%s)
# Note: In production, this would be done more carefully
# For testing, we'll delete the certificate to trigger renewal
echo "⚠️ WARNING: This will delete the certificate. Continuing in 5 seconds..."
sleep 5
kubectl delete secret ${CERT_NAME} -n ${NAMESPACE}
# Wait for cert-manager to detect and renew
echo "⏳ Waiting for cert-manager to renew certificate..."
MAX_RENEWAL_WAIT=300 # 5 minutes
ELAPSED=0
CERT_RENEWED=false
while [ ${ELAPSED} -lt ${MAX_RENEWAL_WAIT} ]; do
CERT_EXISTS=$(kubectl get secret ${CERT_NAME} -n ${NAMESPACE} -o jsonpath='{.metadata.name}' 2>/dev/null)
if [ -n "${CERT_EXISTS}" ]; then
NEW_CERT_EXPIRY=$(kubectl get secret ${CERT_NAME} -n ${NAMESPACE} -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -enddate | cut -d= -f2)
if [ "${NEW_CERT_EXPIRY}" != "${CURRENT_CERT_EXPIRY}" ]; then
CERT_RENEWED=true
RENEWAL_TIME=$(date +%s)
RENEWAL_DURATION=$((RENEWAL_TIME - CERT_EXPIRATION_START))
echo "✅ Certificate renewed in ${RENEWAL_DURATION} seconds"
echo "New certificate expires: ${NEW_CERT_EXPIRY}"
break
fi
fi
sleep 10
ELAPSED=$((ELAPSED + 10))
echo "Waiting for certificate renewal... (${ELAPSED}s/${MAX_RENEWAL_WAIT}s)"
done
if [ "${CERT_RENEWED}" = false ]; then
echo "❌ Certificate not renewed within ${MAX_RENEWAL_WAIT} seconds"
exit 1
fi
# Monitor service behavior during certificate renewal
echo "👀 Monitoring service behavior during certificate renewal..."
MAX_MONITOR_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_MONITOR_WAIT} ]; do
# Get TLS handshake failures
TLS_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(tls_handshake_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get mTLS handshake failures
MTLS_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(mtls_handshake_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get certificate renewal status
CERT_RENEWAL_STATUS=$(curl -s http://prometheus:9090/api/v1/query?query=certificate_renewal_status\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
echo "Metrics at ${ELAPSED}s:"
echo " TLS handshake failures: ${TLS_FAILURES}/sec"
echo " mTLS handshake failures: ${MTLS_FAILURES}/sec"
echo " Certificate renewal status: ${CERT_RENEWAL_STATUS}"
echo " Request success rate: ${SUCCESS_RATE_PERCENT}%"
# Validate cert-manager renewal
if [ "${CERT_RENEWAL_STATUS}" = "success" ]; then
echo "✅ Certificate renewal successful"
fi
# Validate mTLS failure handling
if (( $(echo "${MTLS_FAILURES} > 0" | bc -l) )); then
echo "⚠️ mTLS handshake failures: ${MTLS_FAILURES}/sec"
# Check if failures are transient during renewal
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Verify final state
FINAL_TLS_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(tls_handshakes_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(tls_handshakes_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_TLS_SUCCESS_RATE_PERCENT=$(echo "${FINAL_TLS_SUCCESS_RATE} * 100" | bc)
FINAL_MTLS_SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(mtls_handshakes_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(mtls_handshakes_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_MTLS_SUCCESS_RATE_PERCENT=$(echo "${FINAL_MTLS_SUCCESS_RATE} * 100" | bc)
if (( $(echo "${FINAL_TLS_SUCCESS_RATE_PERCENT} >= ${BASELINE_TLS_SUCCESS_RATE} * 0.95" | bc -l) )); then
echo "✅ TLS success rate recovered: ${FINAL_TLS_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_TLS_SUCCESS_RATE}%)"
if (( $(echo "${FINAL_MTLS_SUCCESS_RATE_PERCENT} >= ${BASELINE_MTLS_SUCCESS_RATE} * 0.95" | bc -l) )); then
echo "✅ mTLS success rate recovered: ${FINAL_MTLS_SUCCESS_RATE_PERCENT}% (baseline: ${BASELINE_MTLS_SUCCESS_RATE}%)"
exit 0
else
echo "⚠️ mTLS success rate not fully recovered: ${FINAL_MTLS_SUCCESS_RATE_PERCENT}%"
exit 1
fi
else
echo "⚠️ TLS success rate not fully recovered: ${FINAL_TLS_SUCCESS_RATE_PERCENT}%"
exit 1
fi
Expected Behavior
Certificate Expiration Phase (0-5 minutes):
- Certificate expiration: TLS certificate expires
- Cert-manager detection: Cert-manager detects expiration
- Certificate renewal: Cert-manager renews certificate
- TLS handshake failures: Temporary TLS handshake failures during renewal
- Service continuity: Service continues operating with renewed certificate
Recovery Phase (5-10 minutes):
- Certificate renewal: New certificate issued
- Certificate deployment: New certificate deployed to pods
- TLS normalization: TLS handshakes succeed with new certificate
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Expiration | Expected Range | Recovery Target |
|---|---|---|---|---|
| TLS Success Rate | 100% | >90% | >90% | 100% |
| mTLS Success Rate | 100% | >90% | >90% | 100% |
| TLS Handshake Failures | 0/sec | <10/sec | <10/sec | 0/sec |
| mTLS Handshake Failures | 0/sec | <10/sec | <10/sec | 0/sec |
| Certificate Renewal Time | N/A | <5min | <5min | N/A |
Validation Criteria
Success Criteria:
- ✅ Cert-manager automatically renews expired certificates
- ✅ Certificate renewal completes within 5 minutes
- ✅ TLS handshake failures are transient during renewal
- ✅ mTLS handles certificate failures gracefully
- ✅ Service recovers automatically when certificates renewed
Certificate Management Configuration
Cert-Manager Certificate Configuration:
# kubernetes/certificates/ingestion-api-tls-certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: atp-ingestion-tls
namespace: atp-ingest-ns
spec:
secretName: atp-ingestion-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
commonName: ingestion-api.atp.connectsoft.io
dnsNames:
- ingestion-api.atp.connectsoft.io
- ingestion-api.atp-staging.connectsoft.io
renewBefore: 720h # Renew 30 days before expiration
privateKey:
algorithm: RSA
size: 2048
mTLS Configuration:
# kubernetes/configmaps/mtls-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: mtls-config
namespace: atp-ingest-ns
data:
MTLSConfig.json: |
{
"Enabled": true,
"ClientCertificateRequired": true,
"CertificateValidation": {
"ValidateCertificateChain": true,
"ValidateRevocation": true,
"AllowExpiredCertificates": false
},
"FailureHandling": {
"GracefulDegradation": true,
"RetryOnFailure": true,
"MaxRetries": 3
},
"CertificateRotation": {
"AutoRotate": true,
"RotationThreshold": 720,
"RenewBeforeExpiry": true
}
}
Key Vault Unavailability¶
Key Vault unavailability experiments validate that ATP services handle Azure Key Vault unavailability gracefully through secret caching, graceful degradation, and encryption key access failure handling.
Hypothesis
"When Azure Key Vault becomes unavailable, services will use cached secrets, gracefully degrade functionality that requires new secrets, maintain service availability for operations using cached secrets, and recover automatically when Key Vault is restored."
Experiment Configuration
Azure Key Vault Network Partition:
# chaos-experiments/key-vault-unavailability.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: key-vault-unavailability
namespace: chaos-testing
labels:
category: security
service: key-vault
severity: high
frequency: monthly
annotations:
chaos.atp.connectsoft.io/hypothesis: |
When Azure Key Vault becomes unavailable, services will use cached secrets,
gracefully degrade functionality that requires new secrets, maintain service availability
for operations using cached secrets, and recover automatically when Key Vault is restored.
spec:
action: partition
mode: all
selector:
namespaces:
- atp-ingest-ns
labelSelectors:
app: atp-ingest-api
direction: both
target:
mode: all
selector:
address: "*.vault.azure.net"
duration: "10m"
Key Vault Unavailability Simulation Script:
#!/bin/bash
# scripts/execute-key-vault-unavailability-experiment.sh
SERVICE="${1:-atp-ingestion-api}"
NAMESPACE="${2:-atp-ingest-ns}"
KEY_VAULT_NAME="${3:-atp-keyvault}"
DURATION="${4:-10m}"
echo "🧪 Starting Key Vault unavailability experiment"
echo "Service: ${SERVICE}"
echo "Key Vault: ${KEY_VAULT_NAME}"
echo "Duration: ${DURATION}"
# Get baseline metrics
echo "📊 Collecting baseline metrics..."
./scripts/collect-baseline-metrics.sh \
--service ${SERVICE} \
--duration 1h \
--output baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json
BASELINE_SECRET_ACCESS_SUCCESS=$(jq -r '.metrics.secret_access_success_rate_percent' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
BASELINE_SECRET_CACHE_HITS=$(jq -r '.metrics.secret_cache_hits_per_second' baseline-${SERVICE}-$(date +%Y%m%d-%H%M%S).json)
echo "Baseline metrics:"
echo " Secret access success rate: ${BASELINE_SECRET_ACCESS_SUCCESS}%"
echo " Secret cache hits: ${BASELINE_SECRET_CACHE_HITS}/sec"
# Apply network partition to Key Vault
echo "🔧 Applying network partition to Key Vault..."
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: key-vault-unavailability-${SERVICE}
namespace: chaos-testing
spec:
action: partition
mode: all
selector:
namespaces:
- ${NAMESPACE}
labelSelectors:
app: ${SERVICE}
direction: both
target:
mode: all
selector:
address: "*.vault.azure.net"
duration: "${DURATION}"
EOF
KEY_VAULT_FAILURE_START=$(date +%s)
# Monitor service behavior
echo "👀 Monitoring service behavior during Key Vault unavailability..."
MAX_WAIT=600 # 10 minutes
ELAPSED=0
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
# Get secret access success rate
SECRET_ACCESS_SUCCESS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(secret_access_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(secret_access_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SECRET_ACCESS_SUCCESS_PERCENT=$(echo "${SECRET_ACCESS_SUCCESS} * 100" | bc)
# Get secret cache hits
SECRET_CACHE_HITS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(secret_cache_hits\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get secret access failures
SECRET_ACCESS_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(secret_access_total\{service=\"${SERVICE}\",status=\"failure\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get encryption key access failures
ENCRYPTION_KEY_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(encryption_key_access_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
# Get service mode (normal vs degraded)
SERVICE_MODE=$(curl -s http://prometheus:9090/api/v1/query?query=service_mode\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
# Get request success rate
SUCCESS_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${SERVICE}\",status!~\"5..\"\}[1m]\)/rate\(http_requests_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
SUCCESS_RATE_PERCENT=$(echo "${SUCCESS_RATE} * 100" | bc)
echo "Metrics at ${ELAPSED}s:"
echo " Secret access success rate: ${SECRET_ACCESS_SUCCESS_PERCENT}%"
echo " Secret cache hits: ${SECRET_CACHE_HITS}/sec"
echo " Secret access failures: ${SECRET_ACCESS_FAILURES}/sec"
echo " Encryption key access failures: ${ENCRYPTION_KEY_FAILURES}/sec"
echo " Service mode: ${SERVICE_MODE}"
echo " Request success rate: ${SUCCESS_RATE_PERCENT}%"
# Validate secret cache usage
if (( $(echo "${SECRET_CACHE_HITS} > 0" | bc -l) )); then
echo "✅ Secret cache active: ${SECRET_CACHE_HITS}/sec cache hits"
fi
# Validate graceful degradation
if [ "${SERVICE_MODE}" = "degraded" ]; then
echo "✅ Service operating in degraded mode: ${SERVICE_MODE}"
fi
# Validate encryption key access failure handling
if (( $(echo "${ENCRYPTION_KEY_FAILURES} > 0" | bc -l) )); then
echo "⚠️ Encryption key access failures: ${ENCRYPTION_KEY_FAILURES}/sec"
# Check if service handles failures gracefully
if (( $(echo "${SUCCESS_RATE_PERCENT} >= 90" | bc -l) )); then
echo "✅ Service handles encryption key failures gracefully"
fi
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
# Remove network partition
echo "🔧 Removing network partition..."
kubectl delete networkchaos key-vault-unavailability-${SERVICE} -n chaos-testing
RECOVERY_START=$(date +%s)
# Wait for recovery
echo "⏳ Waiting for Key Vault to recover..."
sleep 120
# Verify recovery
FINAL_SECRET_ACCESS_SUCCESS=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(secret_access_total\{service=\"${SERVICE}\",status=\"success\"\}[1m]\)/rate\(secret_access_total\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
FINAL_SECRET_ACCESS_SUCCESS_PERCENT=$(echo "${FINAL_SECRET_ACCESS_SUCCESS} * 100" | bc)
FINAL_SERVICE_MODE=$(curl -s http://prometheus:9090/api/v1/query?query=service_mode\{service=\"${SERVICE}\"\} | jq -r '.data.result[0].value[1]')
FINAL_ENCRYPTION_KEY_FAILURES=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(encryption_key_access_failures\{service=\"${SERVICE}\"\}[1m]\) | jq -r '.data.result[0].value[1]')
if (( $(echo "${FINAL_SECRET_ACCESS_SUCCESS_PERCENT} >= ${BASELINE_SECRET_ACCESS_SUCCESS} * 0.95" | bc -l) )); then
echo "✅ Secret access success rate recovered: ${FINAL_SECRET_ACCESS_SUCCESS_PERCENT}% (baseline: ${BASELINE_SECRET_ACCESS_SUCCESS}%)"
if [ "${FINAL_SERVICE_MODE}" = "normal" ]; then
echo "✅ Service mode normalized: ${FINAL_SERVICE_MODE}"
if (( $(echo "${FINAL_ENCRYPTION_KEY_FAILURES} == 0" | bc -l) )); then
echo "✅ Encryption key access recovered"
exit 0
else
echo "⚠️ Encryption key access failures still occurring: ${FINAL_ENCRYPTION_KEY_FAILURES}/sec"
exit 1
fi
else
echo "⚠️ Service mode not normalized: ${FINAL_SERVICE_MODE}"
exit 1
fi
else
echo "⚠️ Secret access success rate not fully recovered: ${FINAL_SECRET_ACCESS_SUCCESS_PERCENT}%"
exit 1
fi
Expected Behavior
Key Vault Failure Phase (0-10 minutes):
- Key Vault unavailability: Azure Key Vault unreachable
- Secret cache usage: Services use cached secrets
- Graceful degradation: Service degrades functionality requiring new secrets
- Encryption key access: Encryption key access failures handled gracefully
- Service continuity: Service continues operating with cached secrets
Recovery Phase (10-15 minutes):
- Key Vault restoration: Azure Key Vault restored
- Secret refresh: New secrets obtained from Key Vault
- Secret cache refresh: Secret cache refreshed
- Normal operation: Service returns to normal operation
Expected Metrics
| Metric | Baseline | During Failure | Expected Range | Recovery Target |
|---|---|---|---|---|
| Secret Access Success Rate | 99.95% | >80% | >80% | 99.95% |
| Secret Cache Hits | 50/sec | 100/sec | 100/sec | 50/sec |
| Secret Access Failures | 0.05% | <20% | <20% | 0.05% |
| Encryption Key Failures | 0/sec | <5/sec | <5/sec | 0/sec |
| Service Mode | Normal | Degraded | Degraded | Normal |
Validation Criteria
Success Criteria:
- ✅ Secret cache maintains access to cached secrets
- ✅ Graceful degradation when new secrets required
- ✅ Encryption key access failures handled gracefully
- ✅ Secret access success rate >80%
- ✅ Service recovers automatically when Key Vault restored
Key Vault Configuration
Secret Cache Configuration:
# kubernetes/configmaps/key-vault-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: key-vault-config
namespace: atp-ingest-ns
data:
KeyVaultConfig.json: |
{
"SecretCache": {
"Enabled": true,
"TTL": 3600,
"MaxCacheSize": 1000,
"RefreshThreshold": 300
},
"KeyVault": {
"VaultUrl": "https://${KEY_VAULT_NAME}.vault.azure.net/",
"Authentication": {
"Type": "ManagedIdentity",
"ClientId": "{managed-identity-client-id}"
},
"RetryPolicy": {
"MaxRetries": 3,
"RetryDelay": 1000,
"ExponentialBackoff": true
}
},
"EncryptionKeys": {
"CacheEnabled": true,
"CacheTTL": 7200,
"FailureHandling": "graceful"
},
"FallbackBehavior": {
"UseCachedSecrets": true,
"DegradedMode": true,
"DenyOnFailure": false
}
}
Security Chaos Visualization
graph TD
SECURITY[Security Layer] --> AUTH[Authentication]
SECURITY --> AUTHZ[Authorization]
SECURITY --> CERT[Certificates]
SECURITY --> KEYVAULT[Key Vault]
AUTH --> AZUREAD[Azure AD]
AZUREAD -->|Fails| TOKENCACHE[Token Cache]
TOKENCACHE --> DENY[Deny-by-Default]
DENY --> CONTINUE1[Continue Operating]
AUTHZ --> OPA[OPA Policy Engine]
OPA -->|Fails| POLICYCACHE[Policy Cache]
POLICYCACHE --> SAFEFAIL[Safe-Fail]
SAFEFAIL --> CONTINUE2[Continue Operating]
CERT --> CERTMANAGER[Cert-Manager]
CERTMANAGER -->|Expires| RENEW[Auto-Renewal]
RENEW --> MTLS[mTLS Handling]
MTLS --> CONTINUE3[Continue Operating]
KEYVAULT --> SECRETCACHE[Secret Cache]
KEYVAULT -->|Fails| ENCRYPTION[Encryption Keys]
SECRETCACHE --> DEGRADED[Graceful Degradation]
ENCRYPTION --> DEGRADED
DEGRADED --> CONTINUE4[Continue Operating]
style SECURITY fill:#FFE5B4
style AZUREAD fill:#FFB6C1
style OPA fill:#FFB6C1
style CERTMANAGER fill:#FFB6C1
style KEYVAULT fill:#FFB6C1
style CONTINUE1 fill:#90EE90
style CONTINUE2 fill:#90EE90
style CONTINUE3 fill:#90EE90
style CONTINUE4 fill:#90EE90
Summary: Security Chaos¶
- Authentication Failure: Validates token caching, graceful degradation, and deny-by-default behavior during Azure AD unavailability; expects token cache maintains authentication, deny-by-default enforced, graceful degradation to read-only mode, and automatic recovery
- Authorization Denial: Validates cached policies, safe-fail behavior, and deny-when-uncertain enforcement during OPA unavailability; expects policy cache maintains authorization, safe-fail behavior enforced, deny-when-uncertain, and automatic recovery
- Certificate Expiration: Validates cert-manager renewal, mTLS failure handling, and automatic certificate rotation during TLS certificate expiration; expects cert-manager auto-renews certificates, renewal completes within 5 minutes, mTLS handles failures gracefully, and automatic recovery
- Key Vault Unavailability: Validates cached secrets, graceful degradation, and encryption key access failure handling during Azure Key Vault unavailability; expects secret cache maintains access, graceful degradation when new secrets required, encryption key failures handled gracefully, and automatic recovery
- Monitoring and Validation: Comprehensive scripts for monitoring authentication failures, authorization denials, certificate expiration, Key Vault unavailability, token cache hits, policy cache hits, certificate renewal status, and recovery behavior
Regional Failover Drill¶
Purpose: Define comprehensive disaster recovery (DR) drill procedures for regional failover scenarios in ATP, validating failover procedures, RTO/RPO targets, data replication, and service availability to ensure ATP services remain available and recoverable during complete regional failures.
Full Region Failover Scenario¶
Regional failover drill experiments validate that ATP services handle complete regional failures gracefully through automated failover procedures, RTO/RPO target achievement, and service availability in secondary regions.
Hypothesis
"When the East US region becomes completely unavailable, ATP services will automatically failover to the West Europe region within RTO target (30 minutes), maintain RPO target (1 hour data loss), ensure all services are operational in the secondary region, and recover automatically when the primary region is restored."
Scenario Overview
Primary Region: East US (eastus) - Azure Kubernetes Service (AKS) cluster - Azure SQL Database (primary) - Azure Service Bus - Azure Blob Storage - Azure Key Vault - Azure Application Insights
Secondary Region: West Europe (westeurope) - Azure Kubernetes Service (AKS) cluster (standby) - Azure SQL Database (read replica → primary) - Azure Service Bus (geo-replication) - Azure Blob Storage (geo-redundant) - Azure Key Vault (geo-redundant) - Azure Application Insights
Failover Objectives
| Objective | Target | Validation |
|---|---|---|
| Recovery Time Objective (RTO) | 30 minutes | Time from failure detection to full traffic in secondary region |
| Recovery Point Objective (RPO) | 1 hour | Maximum data loss (async replication lag) |
| Service Availability | 99.9% | All critical services operational in secondary region |
| Data Integrity | 100% | No data corruption, all transactions consistent |
| Authentication | 100% | Azure AD multi-region authentication functional |
Regional Failover Architecture
graph TB
USERS[Users] --> TM[Azure Traffic Manager]
TM -->|Primary| EASTUS[East US Region]
TM -->|Secondary| WESTEU[West Europe Region]
EASTUS --> EAKS[AKS Cluster - East US]
EASTUS --> ESQL[Azure SQL Primary]
EASTUS --> ESB[Service Bus - East US]
EASTUS --> ESTORAGE[Blob Storage - East US]
WESTEU --> WAKS[AKS Cluster - West Europe]
WESTEU --> WSQL[Azure SQL Replica]
WESTEU --> WSB[Service Bus - West Europe]
WESTEU --> WSTORAGE[Blob Storage - West Europe]
ESQL -.->|Async Replication| WSQL
ESB -.->|Geo-Replication| WSB
ESTORAGE -.->|Geo-Redundant| WSTORAGE
EASTUS -.->|FAILOVER| TM
TM -.->|Traffic Redirect| WESTEU
style EASTUS fill:#FFB6C1
style WESTEU fill:#90EE90
style TM fill:#FFE5B4
Failover Procedure¶
Automated Failover Detection
Region Failure Detection Script:
#!/bin/bash
# scripts/detect-region-failure.sh
PRIMARY_REGION="${1:-eastus}"
SECONDARY_REGION="${2:-westeurope}"
RESOURCE_GROUP="${3:-atp-production}"
echo "🔍 Detecting region failure for ${PRIMARY_REGION}..."
FAILURE_DETECTED=false
FAILURE_COMPONENTS=()
# Check AKS cluster health
echo "Checking AKS cluster health..."
AKS_CLUSTER="atp-aks-${PRIMARY_REGION}"
AKS_STATUS=$(az aks show \
--resource-group ${RESOURCE_GROUP} \
--name ${AKS_CLUSTER} \
--query "powerState.code" \
--output tsv 2>/dev/null)
if [ "${AKS_STATUS}" != "Running" ]; then
FAILURE_DETECTED=true
FAILURE_COMPONENTS+=("AKS Cluster")
echo "❌ AKS cluster ${AKS_CLUSTER} is not running: ${AKS_STATUS}"
fi
# Check Azure SQL Database connectivity
echo "Checking Azure SQL Database connectivity..."
SQL_SERVER="atp-sql-${PRIMARY_REGION}.database.windows.net"
SQL_DB="atp-database"
if ! nc -z ${SQL_SERVER} 1433 2>/dev/null; then
FAILURE_DETECTED=true
FAILURE_COMPONENTS+=("Azure SQL Database")
echo "❌ Azure SQL Database ${SQL_SERVER} is not reachable"
fi
# Check Service Bus namespace
echo "Checking Service Bus namespace..."
SB_NAMESPACE="atp-sb-${PRIMARY_REGION}"
SB_STATUS=$(az servicebus namespace show \
--resource-group ${RESOURCE_GROUP} \
--name ${SB_NAMESPACE} \
--query "status" \
--output tsv 2>/dev/null)
if [ "${SB_STATUS}" != "Active" ]; then
FAILURE_DETECTED=true
FAILURE_COMPONENTS+=("Service Bus")
echo "❌ Service Bus namespace ${SB_NAMESPACE} is not active: ${SB_STATUS}"
fi
# Check Storage Account
echo "Checking Storage Account..."
STORAGE_ACCOUNT="atpstorage${PRIMARY_REGION}"
STORAGE_STATUS=$(az storage account show \
--resource-group ${RESOURCE_GROUP} \
--name ${STORAGE_ACCOUNT} \
--query "provisioningState" \
--output tsv 2>/dev/null)
if [ "${STORAGE_STATUS}" != "Succeeded" ]; then
FAILURE_DETECTED=true
FAILURE_COMPONENTS+=("Storage Account")
echo "❌ Storage Account ${STORAGE_ACCOUNT} provisioning state: ${STORAGE_STATUS}"
fi
# Check Key Vault
echo "Checking Key Vault..."
KEY_VAULT="atp-kv-${PRIMARY_REGION}"
KV_STATUS=$(az keyvault show \
--name ${KEY_VAULT} \
--query "properties.provisioningState" \
--output tsv 2>/dev/null)
if [ "${KV_STATUS}" != "Succeeded" ]; then
FAILURE_DETECTED=true
FAILURE_COMPONENTS+=("Key Vault")
echo "❌ Key Vault ${KEY_VAULT} provisioning state: ${KV_STATUS}"
fi
# Summary
if [ "${FAILURE_DETECTED}" = true ]; then
echo ""
echo "⚠️ REGION FAILURE DETECTED"
echo "Primary Region: ${PRIMARY_REGION}"
echo "Failed Components:"
for component in "${FAILURE_COMPONENTS[@]}"; do
echo " - ${component}"
done
echo ""
echo "🚨 Initiating failover to ${SECONDARY_REGION}..."
# Trigger failover automation
./scripts/initiate-failover.sh ${PRIMARY_REGION} ${SECONDARY_REGION} ${RESOURCE_GROUP}
exit 1
else
echo "✅ Primary region ${PRIMARY_REGION} is healthy"
exit 0
fi
Failover Procedure Script:
#!/bin/bash
# scripts/execute-regional-failover-drill.sh
PRIMARY_REGION="${1:-eastus}"
SECONDARY_REGION="${2:-westeurope}"
RESOURCE_GROUP="${3:-atp-production}"
DRILL_MODE="${4:-true}" # true for drill, false for real failover
FAILOVER_START=$(date +%s)
echo "🚨 Starting Regional Failover Drill"
echo "Primary Region: ${PRIMARY_REGION}"
echo "Secondary Region: ${SECONDARY_REGION}"
echo "Resource Group: ${RESOURCE_GROUP}"
echo "Drill Mode: ${DRILL_MODE}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""
# Step 1: Detect region failure (automated monitoring)
echo "Step 1: Detecting region failure..."
if [ "${DRILL_MODE}" = "true" ]; then
echo "⚠️ DRILL MODE: Simulating region failure detection"
FAILURE_DETECTED=true
else
./scripts/detect-region-failure.sh ${PRIMARY_REGION} ${SECONDARY_REGION} ${RESOURCE_GROUP}
FAILURE_DETECTED=$?
fi
if [ "${FAILURE_DETECTED}" != "true" ] && [ "${FAILURE_DETECTED}" != "1" ]; then
echo "❌ No failure detected. Aborting failover."
exit 1
fi
echo "✅ Region failure detected"
echo ""
# Step 2: Incident commander declares failover
echo "Step 2: Incident commander declares failover..."
echo "⚠️ MANUAL STEP: Incident commander must declare failover"
echo " Send notification to: #atp-dr-war-room"
echo " Incident commander: [WAIT FOR CONFIRMATION]"
read -p "Press Enter when incident commander has declared failover..."
FAILOVER_DECLARED=$(date +%s)
DECLARE_DURATION=$((FAILOVER_DECLARED - FAILOVER_START))
echo "✅ Failover declared at +${DECLARE_DURATION} seconds"
echo ""
# Step 3: Update Azure Traffic Manager (DNS failover)
echo "Step 3: Updating Azure Traffic Manager..."
TM_PROFILE="atp-traffic-manager"
TM_ENDPOINT_PRIMARY="atp-eastus-endpoint"
TM_ENDPOINT_SECONDARY="atp-westeurope-endpoint"
# Disable primary endpoint
az network traffic-manager endpoint update \
--resource-group ${RESOURCE_GROUP} \
--profile-name ${TM_PROFILE} \
--name ${TM_ENDPOINT_PRIMARY} \
--endpoint-status Disabled
# Enable secondary endpoint
az network traffic-manager endpoint update \
--resource-group ${RESOURCE_GROUP} \
--profile-name ${TM_PROFILE} \
--name ${TM_ENDPOINT_SECONDARY} \
--endpoint-status Enabled \
--priority 1
echo "✅ Traffic Manager updated (DNS failover initiated)"
echo " Primary endpoint: Disabled"
echo " Secondary endpoint: Enabled (Priority 1)"
echo ""
# Step 4: Verify West Europe cluster healthy
echo "Step 4: Verifying West Europe cluster health..."
SECONDARY_AKS="atp-aks-${SECONDARY_REGION}"
AKS_NODES=$(az aks show \
--resource-group ${RESOURCE_GROUP} \
--name ${SECONDARY_AKS} \
--query "agentPoolProfiles[0].count" \
--output tsv)
AKS_STATUS=$(az aks show \
--resource-group ${RESOURCE_GROUP} \
--name ${SECONDARY_AKS} \
--query "powerState.code" \
--output tsv)
if [ "${AKS_STATUS}" != "Running" ]; then
echo "❌ Secondary AKS cluster is not running: ${AKS_STATUS}"
exit 1
fi
echo "✅ Secondary AKS cluster healthy"
echo " Cluster: ${SECONDARY_AKS}"
echo " Status: ${AKS_STATUS}"
echo " Nodes: ${AKS_NODES}"
echo ""
# Step 5: Verify data replication status
echo "Step 5: Verifying data replication status..."
PRIMARY_SQL_SERVER="atp-sql-${PRIMARY_REGION}"
SECONDARY_SQL_SERVER="atp-sql-${SECONDARY_REGION}"
FAILOVER_GROUP="atp-sql-failover-group"
# Check replication lag
REPLICATION_LAG=$(az sql db replica show-lag \
--resource-group ${RESOURCE_GROUP} \
--server ${SECONDARY_SQL_SERVER} \
--database atp-database \
--query "lagSeconds" \
--output tsv 2>/dev/null || echo "0")
if [ -z "${REPLICATION_LAG}" ]; then
REPLICATION_LAG=0
fi
echo " Replication lag: ${REPLICATION_LAG} seconds"
if (( REPLICATION_LAG > 3600 )); then
echo "⚠️ WARNING: Replication lag exceeds RPO target (1 hour)"
fi
echo "✅ Data replication status verified"
echo ""
# Step 6: Validate latest data available (check RPO)
echo "Step 6: Validating latest data available (RPO check)..."
# Get latest transaction timestamp from secondary database
LATEST_TRANSACTION=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SECONDARY_SQL_SERVER} \
--database atp-database \
--query-text "SELECT MAX(LastModified) as LatestTransaction FROM AuditEvents" \
--output tsv 2>/dev/null || echo "N/A")
CURRENT_TIME=$(date -u +"%Y-%m-%d %H:%M:%S")
RPO_AGE=$(date -u -d "${LATEST_TRANSACTION}" +%s 2>/dev/null || echo "0")
CURRENT_AGE=$(date -u -d "${CURRENT_TIME}" +%s)
RPO_DIFF=$((CURRENT_AGE - RPO_AGE))
if (( RPO_DIFF > 3600 )); then
echo "⚠️ WARNING: RPO exceeded (${RPO_DIFF} seconds > 3600 seconds)"
else
echo "✅ RPO target met (${RPO_DIFF} seconds < 3600 seconds)"
fi
echo " Latest transaction: ${LATEST_TRANSACTION}"
echo " Current time: ${CURRENT_TIME}"
echo " RPO age: ${RPO_DIFF} seconds"
echo ""
# Step 7: Re-route all traffic to West Europe
echo "Step 7: Re-routing all traffic to West Europe..."
# This is already done in Step 3 (Traffic Manager)
# Additional verification: Check DNS propagation
TM_DNS_NAME=$(az network traffic-manager profile show \
--resource-group ${RESOURCE_GROUP} \
--name ${TM_PROFILE} \
--query "dnsConfig.fqdn" \
--output tsv)
echo " Traffic Manager DNS: ${TM_DNS_NAME}"
echo " DNS propagation: Checking..."
sleep 10
# Verify DNS resolution
RESOLVED_IP=$(dig +short ${TM_DNS_NAME} | head -n1)
echo " Resolved IP: ${RESOLVED_IP}"
echo "✅ Traffic re-routed to West Europe"
echo ""
# Step 8: Monitor application health
echo "Step 8: Monitoring application health..."
SECONDARY_NAMESPACE="atp-production"
# Check pod status
PODS_RUNNING=$(kubectl get pods -n ${SECONDARY_NAMESPACE} --context ${SECONDARY_AKS} \
--field-selector=status.phase=Running \
--no-headers | wc -l)
PODS_TOTAL=$(kubectl get pods -n ${SECONDARY_NAMESPACE} --context ${SECONDARY_AKS} \
--no-headers | wc -l)
echo " Pods running: ${PODS_RUNNING}/${PODS_TOTAL}"
# Check service endpoints
SERVICES=$(kubectl get svc -n ${SECONDARY_NAMESPACE} --context ${SECONDARY_AKS} \
-o jsonpath='{.items[*].metadata.name}')
for service in ${SERVICES}; do
ENDPOINTS=$(kubectl get endpoints ${service} -n ${SECONDARY_NAMESPACE} --context ${SECONDARY_AKS} \
-o jsonpath='{.subsets[0].addresses[*].ip}' 2>/dev/null || echo "")
if [ -z "${ENDPOINTS}" ]; then
echo " ⚠️ Service ${service}: No endpoints"
else
echo " ✅ Service ${service}: Healthy"
fi
done
echo "✅ Application health monitoring initiated"
echo ""
# Step 9: Notify stakeholders
echo "Step 9: Notifying stakeholders..."
FAILOVER_COMPLETE=$(date +%s)
RTO_ACHIEVED=$((FAILOVER_COMPLETE - FAILOVER_START))
NOTIFICATION_MESSAGE="🚨 Regional Failover Completed
Primary Region: ${PRIMARY_REGION} (Unavailable)
Secondary Region: ${SECONDARY_REGION} (Active)
RTO Achieved: ${RTO_ACHIEVED} seconds
RPO Verified: ${RPO_DIFF} seconds
Status: All services operational in ${SECONDARY_REGION}"
echo "${NOTIFICATION_MESSAGE}"
echo ""
# Send notifications (Slack, Email, etc.)
# ./scripts/send-notification.sh "${NOTIFICATION_MESSAGE}"
echo "✅ Stakeholders notified"
echo ""
# Step 10: Document RTO/RPO achieved
echo "Step 10: Documenting RTO/RPO achieved..."
DR_REPORT_FILE="dr-drill-report-$(date +%Y%m%d-%H%M%S).json"
cat > ${DR_REPORT_FILE} <<EOF
{
"drillId": "dr-$(date +%Y%m%d-%H%M%S)",
"timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
"primaryRegion": "${PRIMARY_REGION}",
"secondaryRegion": "${SECONDARY_REGION}",
"failoverStartTime": "$(date -u -d @${FAILOVER_START} +"%Y-%m-%dT%H:%M:%SZ")",
"failoverCompleteTime": "$(date -u -d @${FAILOVER_COMPLETE} +"%Y-%m-%dT%H:%M:%SZ")",
"rto": {
"target": 1800,
"achieved": ${RTO_ACHIEVED},
"status": "$(if (( RTO_ACHIEVED <= 1800 )); then echo "PASS"; else echo "FAIL"; fi)"
},
"rpo": {
"target": 3600,
"achieved": ${RPO_DIFF},
"status": "$(if (( RPO_DIFF <= 3600 )); then echo "PASS"; else echo "FAIL"; fi)"
},
"replicationLag": ${REPLICATION_LAG},
"services": {
"podsRunning": ${PODS_RUNNING},
"podsTotal": ${PODS_TOTAL},
"status": "$(if (( PODS_RUNNING == PODS_TOTAL )); then echo "HEALTHY"; else echo "DEGRADED"; fi)"
},
"drillMode": ${DRILL_MODE}
}
EOF
echo "✅ DR drill report generated: ${DR_REPORT_FILE}"
echo ""
echo "=========================================="
echo "Regional Failover Drill Summary"
echo "=========================================="
echo "RTO Target: 1800 seconds (30 minutes)"
echo "RTO Achieved: ${RTO_ACHIEVED} seconds"
echo "RPO Target: 3600 seconds (1 hour)"
echo "RPO Achieved: ${RPO_DIFF} seconds"
echo "Status: $(if (( RTO_ACHIEVED <= 1800 && RPO_DIFF <= 3600 )); then echo "✅ PASS"; else echo "❌ FAIL"; fi)"
echo ""
Expected Behavior
Failover Phase (0-30 minutes):
- Failure detection: Automated monitoring detects region failure
- Incident declaration: Incident commander declares failover
- DNS failover: Traffic Manager redirects traffic to secondary region
- Cluster verification: Secondary cluster verified healthy
- Data replication: Replication status verified (RPO checked)
- Traffic routing: All traffic re-routed to secondary region
- Health monitoring: Application health monitored continuously
- Notifications: Stakeholders notified of failover completion
Post-Failover Phase (30-240 minutes):
- Service validation: All services validated operational
- Data integrity: Data integrity verified (no corruption)
- Monitoring: Monitoring and alerting functional
- Authentication: Azure AD multi-region authentication verified
- Documentation: RTO/RPO documented
DR Drill Execution¶
DR Drill Schedule
| Schedule | Type | Environment | Duration |
|---|---|---|---|
| Q1 | Full DR Drill | Production-like (Staging) | 4 hours |
| Q2 | Tabletop Exercise | Any | 2 hours |
| Q3 | Full DR Drill | Production-like (Staging) | 4 hours |
| Q4 | Post-Mortem Review | Any | 2 hours |
DR Drill Timeline
| Phase | Duration | Activities |
|---|---|---|
| Pre-Drill | 1 week | Planning, coordination, stakeholder notification |
| Pre-Drill Briefing | 30 min | Team briefing, roles, procedures |
| Failover Execution | 1 hour | Failover procedure execution |
| Validation | 2 hours | Service validation, RTO/RPO verification |
| Post-Drill Briefing | 30 min | Lessons learned, improvement actions |
| Post-Drill Report | 1 week | Documentation, post-mortem report |
DR Drill Team Structure
Incident Commander (IC): - Overall responsibility for failover decision - Coordinates with all teams - Makes go/no-go decisions - Communicates with stakeholders
Platform Team: - Infrastructure provisioning - AKS cluster management - Traffic Manager configuration - Resource group management
SRE Team: - Monitoring and alerting - Service health validation - Performance monitoring - Incident response
Security Team: - Authentication verification - Authorization validation - Key Vault access verification - Compliance validation
Service Teams: - Application deployment verification - Service-specific validation - Data integrity checks - End-to-end testing
Communication Channels
War Room: Dedicated virtual meeting room (Teams/Zoom) - All teams participate - Real-time coordination - Screen sharing for monitoring
Slack Channel: #atp-dr-war-room
- Status updates
- Incident reports
- Coordination messages
- Timestamp logs
Email Distribution: atp-dr-alerts@connectsoft.io
- Stakeholder notifications
- Executive summaries
- Post-drill reports
Pre-Drill Checklist
# Pre-Drill Checklist
## Planning (1 week before)
- [ ] DR drill schedule confirmed
- [ ] Stakeholders notified (customers, leadership, compliance)
- [ ] Team assignments confirmed
- [ ] Communication channels set up
- [ ] Monitoring dashboards prepared
- [ ] Runbooks reviewed and updated
## Preparation (1 day before)
- [ ] Secondary region resources verified
- [ ] Data replication status checked
- [ ] Backup procedures verified
- [ ] Failover scripts tested
- [ ] Access credentials verified
- [ ] War room scheduled
- [ ] Slack channel created
## Day of Drill
- [ ] Pre-drill briefing conducted
- [ ] Team members available
- [ ] Monitoring tools accessible
- [ ] Communication channels open
- [ ] Backup procedures ready
- [ ] Rollback plan confirmed
During-Drill Checklist
# During-Drill Checklist
## Failover Execution
- [ ] Region failure simulated/detected
- [ ] Incident commander declares failover
- [ ] Traffic Manager updated (DNS failover)
- [ ] Secondary cluster verified healthy
- [ ] Data replication status verified
- [ ] RPO validated (data loss <1 hour)
- [ ] Traffic re-routed to secondary region
- [ ] Application health monitored
- [ ] Stakeholders notified
## Validation
- [ ] All services operational in secondary region
- [ ] No data corruption detected
- [ ] Monitoring and alerting functional
- [ ] Authentication working (Azure AD)
- [ ] Authorization working (OPA)
- [ ] Key Vault access verified
- [ ] End-to-end functionality tested
- [ ] Performance metrics within acceptable range
## Documentation
- [ ] RTO achieved documented
- [ ] RPO achieved documented
- [ ] Service status documented
- [ ] Issues encountered documented
- [ ] Lessons learned captured
Post-Drill Checklist
# Post-Drill Checklist
## Immediate (Within 1 hour)
- [ ] Post-drill briefing conducted
- [ ] Initial findings documented
- [ ] Critical issues identified
- [ ] Rollback completed (if drill mode)
- [ ] Services restored to primary region (if drill mode)
## Short-term (Within 1 week)
- [ ] Post-mortem report completed
- [ ] Improvement actions identified
- [ ] Runbooks updated
- [ ] Procedures refined
- [ ] Stakeholder report distributed
- [ ] Compliance documentation updated
## Long-term (Within 1 month)
- [ ] Improvement actions implemented
- [ ] Next drill scheduled
- [ ] Training completed
- [ ] Documentation finalized
DR Drill Validation Criteria¶
RTO Validation
RTO Measurement Script:
#!/bin/bash
# scripts/validate-rto.sh
FAILOVER_START="${1}" # Unix timestamp
FAILOVER_COMPLETE="${2}" # Unix timestamp
RTO_TARGET=1800 # 30 minutes in seconds
if [ -z "${FAILOVER_START}" ] || [ -z "${FAILOVER_COMPLETE}" ]; then
echo "❌ Usage: validate-rto.sh <failover_start_timestamp> <failover_complete_timestamp>"
exit 1
fi
RTO_ACHIEVED=$((FAILOVER_COMPLETE - FAILOVER_START))
echo "RTO Validation"
echo "=============="
echo "RTO Target: ${RTO_TARGET} seconds (30 minutes)"
echo "RTO Achieved: ${RTO_ACHIEVED} seconds"
echo ""
if (( RTO_ACHIEVED <= RTO_TARGET )); then
echo "✅ RTO TARGET ACHIEVED"
echo " Achieved: ${RTO_ACHIEVED}s (Target: ${RTO_TARGET}s)"
exit 0
else
echo "❌ RTO TARGET NOT ACHIEVED"
echo " Achieved: ${RTO_ACHIEVED}s (Target: ${RTO_TARGET}s)"
echo " Over by: $((RTO_ACHIEVED - RTO_TARGET))s"
exit 1
fi
RPO Validation
RPO Measurement Script:
#!/bin/bash
# scripts/validate-rpo.sh
SECONDARY_SERVER="${1:-atp-sql-westeurope}"
RESOURCE_GROUP="${2:-atp-production}"
RPO_TARGET=3600 # 1 hour in seconds
# Get latest transaction timestamp from secondary database
LATEST_TRANSACTION=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SECONDARY_SERVER} \
--database atp-database \
--query-text "SELECT MAX(LastModified) as LatestTransaction FROM AuditEvents" \
--output tsv 2>/dev/null)
if [ -z "${LATEST_TRANSACTION}" ] || [ "${LATEST_TRANSACTION}" = "N/A" ]; then
echo "❌ Unable to retrieve latest transaction timestamp"
exit 1
fi
CURRENT_TIME=$(date -u +"%Y-%m-%d %H:%M:%S")
RPO_AGE=$(date -u -d "${LATEST_TRANSACTION}" +%s 2>/dev/null)
CURRENT_AGE=$(date -u -d "${CURRENT_TIME}" +%s)
RPO_ACHIEVED=$((CURRENT_AGE - RPO_AGE))
echo "RPO Validation"
echo "=============="
echo "RPO Target: ${RPO_TARGET} seconds (1 hour)"
echo "Latest Transaction: ${LATEST_TRANSACTION}"
echo "Current Time: ${CURRENT_TIME}"
echo "RPO Achieved: ${RPO_ACHIEVED} seconds"
echo ""
if (( RPO_ACHIEVED <= RPO_TARGET )); then
echo "✅ RPO TARGET ACHIEVED"
echo " Achieved: ${RPO_ACHIEVED}s (Target: ${RPO_TARGET}s)"
exit 0
else
echo "❌ RPO TARGET NOT ACHIEVED"
echo " Achieved: ${RPO_ACHIEVED}s (Target: ${RPO_TARGET}s)"
echo " Over by: $((RPO_ACHIEVED - RPO_TARGET))s"
exit 1
fi
Service Availability Validation
Service Availability Validation Script:
#!/bin/bash
# scripts/validate-service-availability.sh
SECONDARY_AKS="${1:-atp-aks-westeurope}"
NAMESPACE="${2:-atp-production}"
AVAILABILITY_TARGET=99.9 # 99.9%
echo "Service Availability Validation"
echo "==============================="
# Get all services
SERVICES=$(kubectl get svc -n ${NAMESPACE} --context ${SECONDARY_AKS} \
-o jsonpath='{.items[*].metadata.name}')
TOTAL_SERVICES=0
HEALTHY_SERVICES=0
for service in ${SERVICES}; do
TOTAL_SERVICES=$((TOTAL_SERVICES + 1))
# Check service endpoints
ENDPOINTS=$(kubectl get endpoints ${service} -n ${NAMESPACE} --context ${SECONDARY_AKS} \
-o jsonpath='{.subsets[0].addresses[*].ip}' 2>/dev/null || echo "")
if [ -n "${ENDPOINTS}" ]; then
HEALTHY_SERVICES=$((HEALTHY_SERVICES + 1))
echo "✅ ${service}: Healthy"
else
echo "❌ ${service}: No endpoints"
fi
done
AVAILABILITY_PERCENT=$(echo "scale=2; ${HEALTHY_SERVICES} * 100 / ${TOTAL_SERVICES}" | bc)
echo ""
echo "Service Availability: ${AVAILABILITY_PERCENT}% (Target: ${AVAILABILITY_TARGET}%)"
echo "Healthy Services: ${HEALTHY_SERVICES}/${TOTAL_SERVICES}"
if (( $(echo "${AVAILABILITY_PERCENT} >= ${AVAILABILITY_TARGET}" | bc -l) )); then
echo "✅ SERVICE AVAILABILITY TARGET ACHIEVED"
exit 0
else
echo "❌ SERVICE AVAILABILITY TARGET NOT ACHIEVED"
exit 1
fi
Data Integrity Validation
Data Integrity Validation Script:
#!/bin/bash
# scripts/validate-data-integrity.sh
SECONDARY_SERVER="${1:-atp-sql-westeurope}"
RESOURCE_GROUP="${2:-atp-production}"
echo "Data Integrity Validation"
echo "========================="
# Check for data corruption (checksum validation)
echo "Checking data integrity..."
CHECKSUM_RESULT=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SECONDARY_SERVER} \
--database atp-database \
--query-text "SELECT COUNT(*) as CorruptedRecords FROM AuditEvents WHERE CHECKSUM(EventData) != StoredChecksum" \
--output tsv 2>/dev/null || echo "ERROR")
if [ "${CHECKSUM_RESULT}" = "ERROR" ]; then
echo "❌ Unable to validate data integrity"
exit 1
fi
if [ "${CHECKSUM_RESULT}" = "0" ]; then
echo "✅ No data corruption detected"
echo " Corrupted records: 0"
exit 0
else
echo "❌ Data corruption detected"
echo " Corrupted records: ${CHECKSUM_RESULT}"
exit 1
fi
Monitoring and Alerting Validation
Monitoring Validation Script:
#!/bin/bash
# scripts/validate-monitoring.sh
SECONDARY_REGION="${1:-westeurope}"
RESOURCE_GROUP="${2:-atp-production}"
echo "Monitoring and Alerting Validation"
echo "==================================="
# Check Application Insights
APPINSIGHTS="atp-appinsights-${SECONDARY_REGION}"
AI_STATUS=$(az monitor app-insights component show \
--app ${APPINSIGHTS} \
--resource-group ${RESOURCE_GROUP} \
--query "state" \
--output tsv 2>/dev/null || echo "ERROR")
if [ "${AI_STATUS}" = "ERROR" ]; then
echo "❌ Application Insights not accessible"
exit 1
fi
if [ "${AI_STATUS}" = "Succeeded" ]; then
echo "✅ Application Insights: Operational"
else
echo "⚠️ Application Insights: ${AI_STATUS}"
fi
# Check Log Analytics workspace
LOG_ANALYTICS="atp-loganalytics-${SECONDARY_REGION}"
LA_STATUS=$(az monitor log-analytics workspace show \
--resource-group ${RESOURCE_GROUP} \
--workspace-name ${LOG_ANALYTICS} \
--query "provisioningState" \
--output tsv 2>/dev/null || echo "ERROR")
if [ "${LA_STATUS}" = "ERROR" ]; then
echo "❌ Log Analytics workspace not accessible"
exit 1
fi
if [ "${LA_STATUS}" = "Succeeded" ]; then
echo "✅ Log Analytics: Operational"
else
echo "⚠️ Log Analytics: ${LA_STATUS}"
fi
echo "✅ Monitoring and alerting functional"
exit 0
Authentication Validation
Authentication Validation Script:
#!/bin/bash
# scripts/validate-authentication.sh
SECONDARY_REGION="${1:-westeurope}"
TEST_ENDPOINT="${2:-https://atp-api.${SECONDARY_REGION}.connectsoft.io/health}"
echo "Authentication Validation"
echo "========================="
# Test Azure AD authentication
echo "Testing Azure AD authentication..."
# Attempt to get access token
TOKEN_RESPONSE=$(curl -s -X POST \
"https://login.microsoftonline.com/{tenant-id}/oauth2/v2.0/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "client_id={client-id}" \
-d "scope=api://atp-api/.default" \
-d "client_secret={client-secret}" \
-d "grant_type=client_credentials" 2>/dev/null || echo "ERROR")
if [ "${TOKEN_RESPONSE}" = "ERROR" ] || echo "${TOKEN_RESPONSE}" | jq -e '.access_token' > /dev/null 2>&1; then
if echo "${TOKEN_RESPONSE}" | jq -e '.access_token' > /dev/null 2>&1; then
echo "✅ Azure AD authentication: Working"
# Test authenticated API call
ACCESS_TOKEN=$(echo "${TOKEN_RESPONSE}" | jq -r '.access_token')
API_RESPONSE=$(curl -s -H "Authorization: Bearer ${ACCESS_TOKEN}" ${TEST_ENDPOINT} 2>/dev/null || echo "ERROR")
if [ "${API_RESPONSE}" != "ERROR" ] && echo "${API_RESPONSE}" | jq -e '.status' > /dev/null 2>&1; then
echo "✅ Authenticated API call: Working"
exit 0
else
echo "❌ Authenticated API call: Failed"
exit 1
fi
else
echo "❌ Azure AD authentication: Failed"
exit 1
fi
else
echo "❌ Azure AD token request: Failed"
exit 1
fi
DR Drill Validation Summary
#!/bin/bash
# scripts/validate-dr-drill.sh
PRIMARY_REGION="${1:-eastus}"
SECONDARY_REGION="${2:-westeurope}"
RESOURCE_GROUP="${3:-atp-production}"
FAILOVER_START="${4:-$(date +%s)}"
FAILOVER_COMPLETE="${5:-$(date +%s)}"
echo "=========================================="
echo "DR Drill Validation Summary"
echo "=========================================="
echo "Primary Region: ${PRIMARY_REGION}"
echo "Secondary Region: ${SECONDARY_REGION}"
echo "Resource Group: ${RESOURCE_GROUP}"
echo ""
VALIDATION_RESULTS=()
# RTO Validation
echo "1. RTO Validation"
./scripts/validate-rto.sh ${FAILOVER_START} ${FAILOVER_COMPLETE}
RTO_RESULT=$?
VALIDATION_RESULTS+=(${RTO_RESULT})
echo ""
# RPO Validation
echo "2. RPO Validation"
./scripts/validate-rpo.sh "atp-sql-${SECONDARY_REGION}" ${RESOURCE_GROUP}
RPO_RESULT=$?
VALIDATION_RESULTS+=(${RPO_RESULT})
echo ""
# Service Availability Validation
echo "3. Service Availability Validation"
./scripts/validate-service-availability.sh "atp-aks-${SECONDARY_REGION}" "atp-production"
AVAILABILITY_RESULT=$?
VALIDATION_RESULTS+=(${AVAILABILITY_RESULT})
echo ""
# Data Integrity Validation
echo "4. Data Integrity Validation"
./scripts/validate-data-integrity.sh "atp-sql-${SECONDARY_REGION}" ${RESOURCE_GROUP}
INTEGRITY_RESULT=$?
VALIDATION_RESULTS+=(${INTEGRITY_RESULT})
echo ""
# Monitoring Validation
echo "5. Monitoring and Alerting Validation"
./scripts/validate-monitoring.sh ${SECONDARY_REGION} ${RESOURCE_GROUP}
MONITORING_RESULT=$?
VALIDATION_RESULTS+=(${MONITORING_RESULT})
echo ""
# Authentication Validation
echo "6. Authentication Validation"
./scripts/validate-authentication.sh ${SECONDARY_REGION}
AUTH_RESULT=$?
VALIDATION_RESULTS+=(${AUTH_RESULT})
echo ""
# Summary
echo "=========================================="
echo "Validation Summary"
echo "=========================================="
TOTAL_VALIDATIONS=6
PASSED_VALIDATIONS=0
for result in "${VALIDATION_RESULTS[@]}"; do
if [ "${result}" = "0" ]; then
PASSED_VALIDATIONS=$((PASSED_VALIDATIONS + 1))
fi
done
echo "Passed: ${PASSED_VALIDATIONS}/${TOTAL_VALIDATIONS}"
if [ ${PASSED_VALIDATIONS} -eq ${TOTAL_VALIDATIONS} ]; then
echo "✅ ALL VALIDATIONS PASSED"
exit 0
else
echo "❌ SOME VALIDATIONS FAILED"
exit 1
fi
Expected Metrics
| Metric | Target | Validation Method |
|---|---|---|
| RTO | ≤30 minutes | Time from failure to full traffic in secondary region |
| RPO | ≤1 hour | Maximum data loss (async replication lag) |
| Service Availability | ≥99.9% | All critical services operational |
| Data Integrity | 100% | No data corruption detected |
| Monitoring | 100% | All monitoring tools functional |
| Authentication | 100% | Azure AD multi-region authentication working |
Summary: Regional Failover Drill¶
- Full Region Failover Scenario: Validates complete regional failover from East US to West Europe with RTO target (30 minutes) and RPO target (1 hour); expects automated failover detection, incident commander declaration, DNS failover, cluster verification, data replication validation, traffic re-routing, and stakeholder notification
- Failover Procedure: Comprehensive 10-step failover procedure including failure detection, incident declaration, Traffic Manager update, cluster verification, data replication check, RPO validation, traffic routing, health monitoring, stakeholder notification, and RTO/RPO documentation
- DR Drill Execution: Quarterly scheduled drills (Q1, Q3) with 4-hour duration, structured team assignments (Platform, SRE, Security, Service teams), dedicated communication channels (war room, Slack), and stakeholder notifications (leadership, compliance, customers)
- DR Drill Validation Criteria: Comprehensive validation of RTO achievement (<30 minutes), RPO verification (<1 hour data loss), service availability (all services operational), data integrity (no corruption), monitoring/alerting functionality, and authentication working (Azure AD multi-region)
- Monitoring and Validation: Comprehensive scripts for RTO/RPO validation, service availability validation, data integrity validation, monitoring validation, authentication validation, and overall DR drill validation summary
Data Recovery Drills¶
Purpose: Define comprehensive data recovery drill procedures for ATP, validating backup restoration capabilities, point-in-time recovery, and corruption recovery to ensure ATP data can be recovered and restored with integrity and completeness during data loss or corruption scenarios.
Backup Restoration Drill¶
Backup restoration drill experiments validate that ATP services can restore from Azure Backup successfully with backup integrity validation, restoration time measurement, and data completeness verification.
Hypothesis
"When data is lost or corrupted, ATP services will restore from Azure Backup within acceptable restoration time, validate backup integrity before restoration, verify data completeness after restoration, and ensure all restored data is consistent and functional."
Experiment Configuration
Backup Restoration Script:
#!/bin/bash
# scripts/execute-backup-restoration-drill.sh
SQL_SERVER="${1:-atp-sql-eastus}"
RESOURCE_GROUP="${2:-atp-production}"
DATABASE_NAME="${3:-atp-database}"
BACKUP_RETENTION_DAYS="${4:-30}"
TEST_NAMESPACE="${5:-atp-restore-test}"
RESTORATION_START=$(date +%s)
echo "🧪 Starting Backup Restoration Drill"
echo "SQL Server: ${SQL_SERVER}"
echo "Database: ${DATABASE_NAME}"
echo "Resource Group: ${RESOURCE_GROUP}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""
# Step 1: List available backups
echo "Step 1: Listing available backups..."
echo "Getting backup list from Azure Backup..."
# Get available backup points
BACKUP_LIST=$(az backup recoverypoint list \
--resource-group ${RESOURCE_GROUP} \
--vault-name atp-backup-vault \
--container-name "IaasVMContainer;${SQL_SERVER}" \
--item-name "SQLDataBase;${DATABASE_NAME}" \
--query "[?properties.recoveryPointTime >= \`$(date -u -d "${BACKUP_RETENTION_DAYS} days ago" +"%Y-%m-%dT%H:%M:%SZ")\`].{Time:properties.recoveryPointTime, Type:properties.recoveryPointType}" \
--output table 2>/dev/null || echo "ERROR")
if [ "${BACKUP_LIST}" = "ERROR" ]; then
echo "❌ Unable to retrieve backup list"
echo "⚠️ Attempting alternative method using Azure SQL backup history..."
# Alternative: Get backup history from Azure SQL
BACKUP_HISTORY=$(az sql db list-restorable-dropped \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--query "[?databaseName == '${DATABASE_NAME}'].{Name:name, DeletionDate:deletionDate, EarliestRestoreDate:earliestRestoreDate}" \
--output table)
echo "Backup history:"
echo "${BACKUP_HISTORY}"
fi
# Select latest backup (or specific backup if provided)
SELECTED_BACKUP_TIME="${6:-$(date -u -d "1 day ago" +"%Y-%m-%dT%H:%M:%SZ")}"
echo "Selected backup time: ${SELECTED_BACKUP_TIME}"
echo ""
# Step 2: Validate backup integrity
echo "Step 2: Validating backup integrity..."
echo "Checking backup metadata and integrity..."
# Get backup metadata
BACKUP_METADATA=$(az backup recoverypoint show \
--resource-group ${RESOURCE_GROUP} \
--vault-name atp-backup-vault \
--container-name "IaasVMContainer;${SQL_SERVER}" \
--item-name "SQLDataBase;${DATABASE_NAME}" \
--name "${SELECTED_BACKUP_TIME}" \
--query "{Size:properties.backupSizeInGB, Type:properties.recoveryPointType, Time:properties.recoveryPointTime, Consistency:properties.isConsistent}" \
--output json 2>/dev/null || echo "{}")
if [ "${BACKUP_METADATA}" = "{}" ]; then
echo "⚠️ Unable to retrieve backup metadata via Azure Backup"
echo "Using Azure SQL backup metadata instead..."
# Get backup metadata from Azure SQL
BACKUP_SIZE=$(az sql db show \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--name ${DATABASE_NAME} \
--query "currentBackupStorageRedundancy" \
--output tsv 2>/dev/null || echo "N/A")
echo "Backup size: ${BACKUP_SIZE}"
else
BACKUP_SIZE=$(echo "${BACKUP_METADATA}" | jq -r '.Size')
BACKUP_TYPE=$(echo "${BACKUP_METADATA}" | jq -r '.Type')
BACKUP_CONSISTENT=$(echo "${BACKUP_METADATA}" | jq -r '.Consistency')
echo "Backup metadata:"
echo " Size: ${BACKUP_SIZE} GB"
echo " Type: ${BACKUP_TYPE}"
echo " Consistent: ${BACKUP_CONSISTENT}"
if [ "${BACKUP_CONSISTENT}" != "true" ]; then
echo "⚠️ WARNING: Backup is not marked as consistent"
else
echo "✅ Backup integrity validated (consistent)"
fi
fi
echo ""
# Step 3: Perform restoration
echo "Step 3: Performing restoration..."
RESTORE_DATABASE_NAME="${DATABASE_NAME}-restored-$(date +%Y%m%d-%H%M%S)"
# Restore database to new database name (test restoration)
az sql db restore \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--name ${DATABASE_NAME} \
--dest-name ${RESTORE_DATABASE_NAME} \
--restore-point-in-time "${SELECTED_BACKUP_TIME}" \
--output none
RESTORATION_WAIT_START=$(date +%s)
# Wait for restoration to complete
echo "Waiting for restoration to complete..."
MAX_WAIT=3600 # 1 hour
ELAPSED=0
RESTORATION_COMPLETE=false
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
RESTORE_STATUS=$(az sql db show \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--name ${RESTORE_DATABASE_NAME} \
--query "status" \
--output tsv 2>/dev/null || echo "NOT_FOUND")
if [ "${RESTORE_STATUS}" = "Online" ]; then
RESTORATION_COMPLETE=true
RESTORATION_END=$(date +%s)
RESTORATION_DURATION=$((RESTORATION_END - RESTORATION_WAIT_START))
echo "✅ Database restored successfully in ${RESTORATION_DURATION} seconds"
break
elif [ "${RESTORE_STATUS}" = "NOT_FOUND" ]; then
echo "Restoration in progress... (${ELAPSED}s/${MAX_WAIT}s)"
else
echo "Restoration status: ${RESTORE_STATUS} (${ELAPSED}s/${MAX_WAIT}s)"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
if [ "${RESTORATION_COMPLETE}" = false ]; then
echo "❌ Restoration did not complete within ${MAX_WAIT} seconds"
exit 1
fi
echo ""
# Step 4: Validate data completeness
echo "Step 4: Validating data completeness..."
# Get record counts from original and restored database
ORIGINAL_COUNT=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${DATABASE_NAME} \
--query-text "SELECT COUNT(*) as RecordCount FROM AuditEvents" \
--output tsv 2>/dev/null || echo "0")
RESTORED_COUNT=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${RESTORE_DATABASE_NAME} \
--query-text "SELECT COUNT(*) as RecordCount FROM AuditEvents" \
--output tsv 2>/dev/null || echo "0")
echo "Record counts:"
echo " Original database: ${ORIGINAL_COUNT} records"
echo " Restored database: ${RESTORED_COUNT} records"
# Calculate completeness percentage
if [ "${ORIGINAL_COUNT}" != "0" ]; then
COMPLETENESS_PERCENT=$(echo "scale=2; ${RESTORED_COUNT} * 100 / ${ORIGINAL_COUNT}" | bc)
echo " Completeness: ${COMPLETENESS_PERCENT}%"
if (( $(echo "${COMPLETENESS_PERCENT} >= 95" | bc -l) )); then
echo "✅ Data completeness validated (${COMPLETENESS_PERCENT}% >= 95%)"
else
echo "⚠️ WARNING: Data completeness below threshold (${COMPLETENESS_PERCENT}% < 95%)"
fi
else
echo "⚠️ Unable to calculate completeness (original count = 0)"
fi
# Validate key tables
echo ""
echo "Validating key tables..."
KEY_TABLES=("AuditEvents" "Tenants" "Policies" "Configurations")
for table in "${KEY_TABLES[@]}"; do
ORIGINAL_TABLE_COUNT=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${DATABASE_NAME} \
--query-text "SELECT COUNT(*) as RecordCount FROM ${table}" \
--output tsv 2>/dev/null || echo "0")
RESTORED_TABLE_COUNT=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${RESTORE_DATABASE_NAME} \
--query-text "SELECT COUNT(*) as RecordCount FROM ${table}" \
--output tsv 2>/dev/null || echo "0")
if [ "${ORIGINAL_TABLE_COUNT}" = "${RESTORED_TABLE_COUNT}" ]; then
echo " ✅ ${table}: ${RESTORED_TABLE_COUNT} records (match)"
else
echo " ⚠️ ${table}: ${RESTORED_TABLE_COUNT} records (original: ${ORIGINAL_TABLE_COUNT})"
fi
done
echo ""
# Step 5: Validate data integrity
echo "Step 5: Validating data integrity..."
# Check for data corruption using checksums
INTEGRITY_CHECK=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${RESTORE_DATABASE_NAME} \
--query-text "SELECT COUNT(*) as CorruptedRecords FROM AuditEvents WHERE CHECKSUM(EventData) != StoredChecksum" \
--output tsv 2>/dev/null || echo "ERROR")
if [ "${INTEGRITY_CHECK}" = "ERROR" ]; then
echo "⚠️ Unable to perform integrity check"
elif [ "${INTEGRITY_CHECK}" = "0" ]; then
echo "✅ Data integrity validated (no corruption detected)"
else
echo "❌ Data integrity check failed: ${INTEGRITY_CHECK} corrupted records"
fi
echo ""
# Generate restoration report
RESTORATION_REPORT_FILE="backup-restoration-report-$(date +%Y%m%d-%H%M%S).json"
cat > ${RESTORATION_REPORT_FILE} <<EOF
{
"drillId": "backup-restore-$(date +%Y%m%d-%H%M%S)",
"timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
"sqlServer": "${SQL_SERVER}",
"databaseName": "${DATABASE_NAME}",
"restoredDatabaseName": "${RESTORE_DATABASE_NAME}",
"backupTime": "${SELECTED_BACKUP_TIME}",
"restorationStartTime": "$(date -u -d @${RESTORATION_START} +"%Y-%m-%dT%H:%M:%SZ")",
"restorationEndTime": "$(date -u -d @${RESTORATION_END} +"%Y-%m-%dT%H:%M:%SZ")",
"restorationDuration": ${RESTORATION_DURATION},
"backupSize": "${BACKUP_SIZE}",
"backupType": "${BACKUP_TYPE}",
"backupConsistent": ${BACKUP_CONSISTENT},
"dataCompleteness": {
"originalCount": ${ORIGINAL_COUNT},
"restoredCount": ${RESTORED_COUNT},
"completenessPercent": ${COMPLETENESS_PERCENT}
},
"integrityCheck": {
"corruptedRecords": ${INTEGRITY_CHECK},
"status": "$(if [ "${INTEGRITY_CHECK}" = "0" ]; then echo "PASS"; else echo "FAIL"; fi)"
}
}
EOF
echo "✅ Restoration report generated: ${RESTORATION_REPORT_FILE}"
echo ""
echo "=========================================="
echo "Backup Restoration Drill Summary"
echo "=========================================="
echo "Restoration Duration: ${RESTORATION_DURATION} seconds"
echo "Data Completeness: ${COMPLETENESS_PERCENT}%"
echo "Data Integrity: $(if [ "${INTEGRITY_CHECK}" = "0" ]; then echo "✅ PASS"; else echo "❌ FAIL"; fi)"
echo ""
Expected Behavior
Restoration Phase (0-60 minutes):
- Backup listing: Available backups listed and validated
- Backup integrity: Backup metadata validated (size, type, consistency)
- Restoration: Database restored to test database
- Completion: Restoration completes and database comes online
Validation Phase (60-90 minutes):
- Data completeness: Record counts validated (original vs restored)
- Table validation: Key tables validated for completeness
- Data integrity: Checksum validation performed
- Report generation: Restoration report generated
Expected Metrics
| Metric | Target | Validation Method |
|---|---|---|
| Restoration Time | <60 minutes | Time from restoration start to database online |
| Backup Integrity | 100% | Backup metadata validation (consistent = true) |
| Data Completeness | ≥95% | Record count comparison (restored vs original) |
| Data Integrity | 100% | Checksum validation (no corrupted records) |
| Table Completeness | 100% | All key tables validated (count match) |
Validation Criteria
Success Criteria:
- ✅ Backup integrity validated (consistent = true)
- ✅ Restoration completed within 60 minutes
- ✅ Data completeness ≥95%
- ✅ Data integrity validated (no corruption)
- ✅ All key tables validated
Point-in-Time Recovery¶
Point-in-time recovery drill experiments validate that ATP services can restore to a specific timestamp with transaction consistency validation and selective recovery capabilities.
Hypothesis
"When data needs to be restored to a specific point in time, ATP services will restore the database to the requested timestamp, validate transaction consistency, support selective recovery for specific tenants, and ensure all restored data is consistent and functional."
Experiment Configuration
Point-in-Time Recovery Script:
#!/bin/bash
# scripts/execute-point-in-time-recovery-drill.sh
SQL_SERVER="${1:-atp-sql-eastus}"
RESOURCE_GROUP="${2:-atp-production}"
DATABASE_NAME="${3:-atp-database}"
TARGET_TIMESTAMP="${4:-$(date -u -d "2 hours ago" +"%Y-%m-%dT%H:%M:%SZ")}"
TENANT_ID="${5:-}" # Optional: specific tenant for selective recovery
RECOVERY_START=$(date +%s)
echo "🧪 Starting Point-in-Time Recovery Drill"
echo "SQL Server: ${SQL_SERVER}"
echo "Database: ${DATABASE_NAME}"
echo "Target Timestamp: ${TARGET_TIMESTAMP}"
echo "Tenant ID: ${TENANT_ID:-All tenants}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""
# Step 1: Validate target timestamp
echo "Step 1: Validating target timestamp..."
TARGET_EPOCH=$(date -u -d "${TARGET_TIMESTAMP}" +%s 2>/dev/null || echo "0")
CURRENT_EPOCH=$(date +%s)
EARLIEST_RECOVERY=$(date -u -d "35 days ago" +%s) # Azure SQL PITR limit
if [ "${TARGET_EPOCH}" = "0" ]; then
echo "❌ Invalid target timestamp: ${TARGET_TIMESTAMP}"
exit 1
fi
if (( TARGET_EPOCH < EARLIEST_RECOVERY )); then
echo "❌ Target timestamp is beyond recovery window (35 days)"
exit 1
fi
if (( TARGET_EPOCH > CURRENT_EPOCH )); then
echo "❌ Target timestamp is in the future"
exit 1
fi
echo "✅ Target timestamp validated"
echo " Target: ${TARGET_TIMESTAMP}"
echo " Current: $(date -u +"%Y-%m-%dT%H:%M:%SZ")"
echo ""
# Step 2: Get data state at target timestamp
echo "Step 2: Getting data state at target timestamp..."
echo "Querying data state from backup history..."
# Get transaction count at target timestamp
TRANSACTION_COUNT_AT_TARGET=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${DATABASE_NAME} \
--query-text "SELECT COUNT(*) as TransactionCount FROM AuditEvents WHERE LastModified <= '${TARGET_TIMESTAMP}'" \
--output tsv 2>/dev/null || echo "0")
echo "Transactions at target timestamp: ${TRANSACTION_COUNT_AT_TARGET}"
echo ""
# Step 3: Perform point-in-time recovery
echo "Step 3: Performing point-in-time recovery..."
RECOVERED_DATABASE_NAME="${DATABASE_NAME}-pitr-$(date +%Y%m%d-%H%M%S)"
if [ -n "${TENANT_ID}" ]; then
echo "⚠️ Selective recovery mode: Tenant ${TENANT_ID} only"
# Note: Selective recovery requires custom logic or separate database restore
# For this drill, we'll restore full database and then filter
fi
# Restore database to point in time
az sql db restore \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--name ${DATABASE_NAME} \
--dest-name ${RECOVERED_DATABASE_NAME} \
--restore-point-in-time "${TARGET_TIMESTAMP}" \
--output none
RECOVERY_WAIT_START=$(date +%s)
# Wait for recovery to complete
echo "Waiting for point-in-time recovery to complete..."
MAX_WAIT=3600 # 1 hour
ELAPSED=0
RECOVERY_COMPLETE=false
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
RECOVERY_STATUS=$(az sql db show \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--name ${RECOVERED_DATABASE_NAME} \
--query "status" \
--output tsv 2>/dev/null || echo "NOT_FOUND")
if [ "${RECOVERY_STATUS}" = "Online" ]; then
RECOVERY_COMPLETE=true
RECOVERY_END=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_END - RECOVERY_WAIT_START))
echo "✅ Database recovered successfully in ${RECOVERY_DURATION} seconds"
break
elif [ "${RECOVERY_STATUS}" = "NOT_FOUND" ]; then
echo "Recovery in progress... (${ELAPSED}s/${MAX_WAIT}s)"
else
echo "Recovery status: ${RECOVERY_STATUS} (${ELAPSED}s/${MAX_WAIT}s)"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
if [ "${RECOVERY_COMPLETE}" = false ]; then
echo "❌ Recovery did not complete within ${MAX_WAIT} seconds"
exit 1
fi
echo ""
# Step 4: Validate transaction consistency
echo "Step 4: Validating transaction consistency..."
# Get recovered transaction count
RECOVERED_TRANSACTION_COUNT=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${RECOVERED_DATABASE_NAME} \
--query-text "SELECT COUNT(*) as TransactionCount FROM AuditEvents" \
--output tsv 2>/dev/null || echo "0")
echo "Transaction counts:"
echo " Expected at target timestamp: ${TRANSACTION_COUNT_AT_TARGET}"
echo " Recovered: ${RECOVERED_TRANSACTION_COUNT}"
# Validate transaction count matches
if [ "${TRANSACTION_COUNT_AT_TARGET}" = "${RECOVERED_TRANSACTION_COUNT}" ]; then
echo "✅ Transaction count matches"
else
DIFF=$((TRANSACTION_COUNT_AT_TARGET - RECOVERED_TRANSACTION_COUNT))
if (( DIFF < 0 )); then
DIFF=$((DIFF * -1))
fi
echo "⚠️ Transaction count mismatch: ${DIFF} difference"
fi
# Validate no transactions after target timestamp
TRANSACTIONS_AFTER_TARGET=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${RECOVERED_DATABASE_NAME} \
--query-text "SELECT COUNT(*) as TransactionCount FROM AuditEvents WHERE LastModified > '${TARGET_TIMESTAMP}'" \
--output tsv 2>/dev/null || echo "0")
if [ "${TRANSACTIONS_AFTER_TARGET}" = "0" ]; then
echo "✅ No transactions after target timestamp (consistent)"
else
echo "❌ Transactions found after target timestamp: ${TRANSACTIONS_AFTER_TARGET}"
fi
# Validate transaction integrity
echo ""
echo "Validating transaction integrity..."
# Check for orphaned transactions
ORPHANED_TRANSACTIONS=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${RECOVERED_DATABASE_NAME} \
--query-text "SELECT COUNT(*) as OrphanedCount FROM AuditEvents a LEFT JOIN Tenants t ON a.TenantId = t.Id WHERE t.Id IS NULL" \
--output tsv 2>/dev/null || echo "0")
if [ "${ORPHANED_TRANSACTIONS}" = "0" ]; then
echo "✅ No orphaned transactions detected"
else
echo "⚠️ Orphaned transactions detected: ${ORPHANED_TRANSACTIONS}"
fi
echo ""
# Step 5: Selective recovery validation (if tenant specified)
if [ -n "${TENANT_ID}" ]; then
echo "Step 5: Validating selective recovery for tenant ${TENANT_ID}..."
# Get tenant data in recovered database
TENANT_DATA_COUNT=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${RECOVERED_DATABASE_NAME} \
--query-text "SELECT COUNT(*) as RecordCount FROM AuditEvents WHERE TenantId = '${TENANT_ID}' AND LastModified <= '${TARGET_TIMESTAMP}'" \
--output tsv 2>/dev/null || echo "0")
echo "Tenant ${TENANT_ID} data in recovered database: ${TENANT_DATA_COUNT} records"
# Get expected tenant data count
EXPECTED_TENANT_COUNT=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${DATABASE_NAME} \
--query-text "SELECT COUNT(*) as RecordCount FROM AuditEvents WHERE TenantId = '${TENANT_ID}' AND LastModified <= '${TARGET_TIMESTAMP}'" \
--output tsv 2>/dev/null || echo "0")
if [ "${TENANT_DATA_COUNT}" = "${EXPECTED_TENANT_COUNT}" ]; then
echo "✅ Tenant data recovery validated (${TENANT_DATA_COUNT} records)"
else
echo "⚠️ Tenant data mismatch: ${TENANT_DATA_COUNT} (expected: ${EXPECTED_TENANT_COUNT})"
fi
echo ""
fi
# Generate recovery report
RECOVERY_REPORT_FILE="pitr-recovery-report-$(date +%Y%m%d-%H%M%S).json"
cat > ${RECOVERY_REPORT_FILE} <<EOF
{
"drillId": "pitr-$(date +%Y%m%d-%H%M%S)",
"timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
"sqlServer": "${SQL_SERVER}",
"databaseName": "${DATABASE_NAME}",
"recoveredDatabaseName": "${RECOVERED_DATABASE_NAME}",
"targetTimestamp": "${TARGET_TIMESTAMP}",
"tenantId": "${TENANT_ID:-null}",
"recoveryStartTime": "$(date -u -d @${RECOVERY_START} +"%Y-%m-%dT%H:%M:%SZ")",
"recoveryEndTime": "$(date -u -d @${RECOVERY_END} +"%Y-%m-%dT%H:%M:%SZ")",
"recoveryDuration": ${RECOVERY_DURATION},
"transactionConsistency": {
"expectedCount": ${TRANSACTION_COUNT_AT_TARGET},
"recoveredCount": ${RECOVERED_TRANSACTION_COUNT},
"transactionsAfterTarget": ${TRANSACTIONS_AFTER_TARGET},
"orphanedTransactions": ${ORPHANED_TRANSACTIONS},
"status": "$(if [ "${TRANSACTIONS_AFTER_TARGET}" = "0" ] && [ "${ORPHANED_TRANSACTIONS}" = "0" ]; then echo "PASS"; else echo "FAIL"; fi)"
},
"selectiveRecovery": {
"tenantId": "${TENANT_ID:-null}",
"tenantDataCount": ${TENANT_DATA_COUNT:-0},
"expectedTenantCount": ${EXPECTED_TENANT_COUNT:-0}
}
}
EOF
echo "✅ Recovery report generated: ${RECOVERY_REPORT_FILE}"
echo ""
echo "=========================================="
echo "Point-in-Time Recovery Drill Summary"
echo "=========================================="
echo "Recovery Duration: ${RECOVERY_DURATION} seconds"
echo "Target Timestamp: ${TARGET_TIMESTAMP}"
echo "Transaction Consistency: $(if [ "${TRANSACTIONS_AFTER_TARGET}" = "0" ] && [ "${ORPHANED_TRANSACTIONS}" = "0" ]; then echo "✅ PASS"; else echo "❌ FAIL"; fi)"
echo ""
Expected Behavior
Recovery Phase (0-60 minutes):
- Timestamp validation: Target timestamp validated (within recovery window)
- Data state: Data state at target timestamp retrieved
- Point-in-time recovery: Database restored to target timestamp
- Completion: Recovery completes and database comes online
Validation Phase (60-90 minutes):
- Transaction consistency: Transaction counts validated
- Timestamp validation: No transactions after target timestamp
- Transaction integrity: Orphaned transactions checked
- Selective recovery: Tenant-specific recovery validated (if applicable)
Expected Metrics
| Metric | Target | Validation Method |
|---|---|---|
| Recovery Time | <60 minutes | Time from recovery start to database online |
| Transaction Consistency | 100% | Transaction count matches expected at timestamp |
| Timestamp Accuracy | 100% | No transactions after target timestamp |
| Transaction Integrity | 100% | No orphaned transactions |
| Selective Recovery | 100% | Tenant-specific data matches expected |
Validation Criteria
Success Criteria:
- ✅ Target timestamp validated (within recovery window)
- ✅ Recovery completed within 60 minutes
- ✅ Transaction consistency validated (count matches)
- ✅ No transactions after target timestamp
- ✅ Transaction integrity validated (no orphaned transactions)
- ✅ Selective recovery validated (if tenant specified)
Corruption Recovery¶
Corruption recovery drill experiments validate that ATP services can detect data corruption and recover from clean backups with integrity verification and hash chain validation.
Hypothesis
"When data corruption is detected, ATP services will identify corrupted records through integrity verification, restore from clean backup, validate hash chain after restoration, and ensure all restored data is consistent and functional."
Experiment Configuration
Corruption Recovery Script:
#!/bin/bash
# scripts/execute-corruption-recovery-drill.sh
SQL_SERVER="${1:-atp-sql-eastus}"
RESOURCE_GROUP="${2:-atp-production}"
DATABASE_NAME="${3:-atp-database}"
CORRUPTION_TABLE="${4:-AuditEvents}"
CORRUPTION_COUNT="${5:-10}" # Number of records to corrupt
RECOVERY_START=$(date +%s)
echo "🧪 Starting Corruption Recovery Drill"
echo "SQL Server: ${SQL_SERVER}"
echo "Database: ${DATABASE_NAME}"
echo "Corruption Table: ${CORRUPTION_TABLE}"
echo "Corruption Count: ${CORRUPTION_COUNT}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""
# Step 1: Simulate data corruption
echo "Step 1: Simulating data corruption..."
echo "⚠️ WARNING: This will modify data in the database"
# Create corruption by modifying checksums
CORRUPTION_SQL="
UPDATE TOP(${CORRUPTION_COUNT}) ${CORRUPTION_TABLE}
SET EventData = CONCAT(EventData, 'CORRUPTED')
WHERE EventData IS NOT NULL
"
# Execute corruption (in test environment only)
if [ "${ENVIRONMENT}" = "test" ] || [ "${ENVIRONMENT}" = "staging" ]; then
echo "Executing corruption simulation..."
az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${DATABASE_NAME} \
--query-text "${CORRUPTION_SQL}" \
--output none
echo "✅ Corruption simulated: ${CORRUPTION_COUNT} records modified"
else
echo "⚠️ Skipping actual corruption (production environment)"
echo "Using test database for corruption simulation"
fi
echo ""
# Step 2: Detect corruption with integrity verification
echo "Step 2: Detecting corruption with integrity verification..."
# Run integrity check
CORRUPTION_DETECTED=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${DATABASE_NAME} \
--query-text "SELECT COUNT(*) as CorruptedCount FROM ${CORRUPTION_TABLE} WHERE CHECKSUM(EventData) != StoredChecksum" \
--output tsv 2>/dev/null || echo "0")
if [ "${CORRUPTION_DETECTED}" = "0" ]; then
echo "⚠️ No corruption detected"
echo "Corruption may not have been applied or checksums are not being validated"
else
echo "✅ Corruption detected: ${CORRUPTION_DETECTED} corrupted records"
fi
# Get corrupted record IDs
CORRUPTED_RECORD_IDS=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${DATABASE_NAME} \
--query-text "SELECT TOP(100) Id FROM ${CORRUPTION_TABLE} WHERE CHECKSUM(EventData) != StoredChecksum" \
--output tsv 2>/dev/null || echo "")
echo "Corrupted record IDs: $(echo ${CORRUPTED_RECORD_IDS} | tr '\n' ' ' | head -c 100)..."
echo ""
# Step 3: Identify clean backup
echo "Step 3: Identifying clean backup..."
# Get backup history and find last clean backup
BACKUP_HISTORY=$(az sql db list-restorable-dropped \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--query "[?databaseName == '${DATABASE_NAME}'].{Name:name, EarliestRestoreDate:earliestRestoreDate, LatestRestoreDate:latestRestoreDate}" \
--output json 2>/dev/null || echo "[]")
# For this drill, use a recent backup (1 day ago)
CLEAN_BACKUP_TIME=$(date -u -d "1 day ago" +"%Y-%m-%dT%H:%M:%SZ")
echo "Clean backup time: ${CLEAN_BACKUP_TIME}"
echo ""
# Step 4: Restore from clean backup
echo "Step 4: Restoring from clean backup..."
RESTORED_DATABASE_NAME="${DATABASE_NAME}-corruption-recovered-$(date +%Y%m%d-%H%M%S)"
# Restore database from clean backup
az sql db restore \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--name ${DATABASE_NAME} \
--dest-name ${RESTORED_DATABASE_NAME} \
--restore-point-in-time "${CLEAN_BACKUP_TIME}" \
--output none
RESTORATION_WAIT_START=$(date +%s)
# Wait for restoration to complete
echo "Waiting for restoration to complete..."
MAX_WAIT=3600 # 1 hour
ELAPSED=0
RESTORATION_COMPLETE=false
while [ ${ELAPSED} -lt ${MAX_WAIT} ]; do
RESTORE_STATUS=$(az sql db show \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--name ${RESTORED_DATABASE_NAME} \
--query "status" \
--output tsv 2>/dev/null || echo "NOT_FOUND")
if [ "${RESTORE_STATUS}" = "Online" ]; then
RESTORATION_COMPLETE=true
RESTORATION_END=$(date +%s)
RESTORATION_DURATION=$((RESTORATION_END - RESTORATION_WAIT_START))
echo "✅ Database restored successfully in ${RESTORATION_DURATION} seconds"
break
elif [ "${RESTORE_STATUS}" = "NOT_FOUND" ]; then
echo "Restoration in progress... (${ELAPSED}s/${MAX_WAIT}s)"
else
echo "Restoration status: ${RESTORE_STATUS} (${ELAPSED}s/${MAX_WAIT}s)"
fi
sleep 30
ELAPSED=$((ELAPSED + 30))
done
if [ "${RESTORATION_COMPLETE}" = false ]; then
echo "❌ Restoration did not complete within ${MAX_WAIT} seconds"
exit 1
fi
echo ""
# Step 5: Validate hash chain after restoration
echo "Step 5: Validating hash chain after restoration..."
# Check for corruption in restored database
RESTORED_CORRUPTION_COUNT=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${RESTORED_DATABASE_NAME} \
--query-text "SELECT COUNT(*) as CorruptedCount FROM ${CORRUPTION_TABLE} WHERE CHECKSUM(EventData) != StoredChecksum" \
--output tsv 2>/dev/null || echo "0")
if [ "${RESTORED_CORRUPTION_COUNT}" = "0" ]; then
echo "✅ No corruption detected in restored database"
else
echo "❌ Corruption still present in restored database: ${RESTORED_CORRUPTION_COUNT} records"
fi
# Validate hash chain integrity
echo ""
echo "Validating hash chain integrity..."
# Check hash chain continuity
HASH_CHAIN_BREAKS=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${RESTORED_DATABASE_NAME} \
--query-text "
SELECT COUNT(*) as ChainBreaks
FROM ${CORRUPTION_TABLE} a1
LEFT JOIN ${CORRUPTION_TABLE} a2 ON a1.PreviousHash = a2.EventHash
WHERE a1.PreviousHash IS NOT NULL AND a2.EventHash IS NULL
" \
--output tsv 2>/dev/null || echo "0")
if [ "${HASH_CHAIN_BREAKS}" = "0" ]; then
echo "✅ Hash chain integrity validated (no breaks)"
else
echo "⚠️ Hash chain breaks detected: ${HASH_CHAIN_BREAKS}"
fi
# Validate hash chain completeness
HASH_CHAIN_COMPLETE=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${RESTORED_DATABASE_NAME} \
--query-text "
SELECT COUNT(*) as IncompleteChains
FROM ${CORRUPTION_TABLE}
WHERE PreviousHash IS NULL AND Id NOT IN (SELECT MIN(Id) FROM ${CORRUPTION_TABLE} GROUP BY TenantId)
" \
--output tsv 2>/dev/null || echo "0")
if [ "${HASH_CHAIN_COMPLETE}" = "0" ]; then
echo "✅ Hash chain completeness validated"
else
echo "⚠️ Incomplete hash chains detected: ${HASH_CHAIN_COMPLETE}"
fi
echo ""
# Step 6: Validate data consistency
echo "Step 6: Validating data consistency..."
# Compare record counts
ORIGINAL_COUNT=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${DATABASE_NAME} \
--query-text "SELECT COUNT(*) as RecordCount FROM ${CORRUPTION_TABLE} WHERE LastModified <= '${CLEAN_BACKUP_TIME}'" \
--output tsv 2>/dev/null || echo "0")
RESTORED_COUNT=$(az sql db query \
--resource-group ${RESOURCE_GROUP} \
--server ${SQL_SERVER} \
--database ${RESTORED_DATABASE_NAME} \
--query-text "SELECT COUNT(*) as RecordCount FROM ${CORRUPTION_TABLE}" \
--output tsv 2>/dev/null || echo "0")
echo "Record counts:"
echo " Original (before corruption): ${ORIGINAL_COUNT} records"
echo " Restored: ${RESTORED_COUNT} records"
if [ "${ORIGINAL_COUNT}" = "${RESTORED_COUNT}" ]; then
echo "✅ Record count matches"
else
echo "⚠️ Record count mismatch: ${RESTORED_COUNT} (expected: ${ORIGINAL_COUNT})"
fi
echo ""
# Generate recovery report
RECOVERY_REPORT_FILE="corruption-recovery-report-$(date +%Y%m%d-%H%M%S).json"
cat > ${RECOVERY_REPORT_FILE} <<EOF
{
"drillId": "corruption-recovery-$(date +%Y%m%d-%H%M%S)",
"timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
"sqlServer": "${SQL_SERVER}",
"databaseName": "${DATABASE_NAME}",
"restoredDatabaseName": "${RESTORED_DATABASE_NAME}",
"corruptionTable": "${CORRUPTION_TABLE}",
"corruptionCount": ${CORRUPTION_COUNT},
"cleanBackupTime": "${CLEAN_BACKUP_TIME}",
"recoveryStartTime": "$(date -u -d @${RECOVERY_START} +"%Y-%m-%dT%H:%M:%SZ")",
"recoveryEndTime": "$(date -u -d @${RESTORATION_END} +"%Y-%m-%dT%H:%M:%SZ")",
"recoveryDuration": ${RESTORATION_DURATION},
"corruptionDetection": {
"corruptedRecords": ${CORRUPTION_DETECTED},
"status": "$(if [ "${CORRUPTION_DETECTED}" != "0" ]; then echo "DETECTED"; else echo "NOT_DETECTED"; fi)"
},
"hashChainValidation": {
"restoredCorruptionCount": ${RESTORED_CORRUPTION_COUNT},
"hashChainBreaks": ${HASH_CHAIN_BREAKS},
"incompleteChains": ${HASH_CHAIN_COMPLETE},
"status": "$(if [ "${RESTORED_CORRUPTION_COUNT}" = "0" ] && [ "${HASH_CHAIN_BREAKS}" = "0" ] && [ "${HASH_CHAIN_COMPLETE}" = "0" ]; then echo "PASS"; else echo "FAIL"; fi)"
},
"dataConsistency": {
"originalCount": ${ORIGINAL_COUNT},
"restoredCount": ${RESTORED_COUNT},
"status": "$(if [ "${ORIGINAL_COUNT}" = "${RESTORED_COUNT}" ]; then echo "PASS"; else echo "FAIL"; fi)"
}
}
EOF
echo "✅ Recovery report generated: ${RECOVERY_REPORT_FILE}"
echo ""
echo "=========================================="
echo "Corruption Recovery Drill Summary"
echo "=========================================="
echo "Corruption Detected: ${CORRUPTION_DETECTED} records"
echo "Recovery Duration: ${RESTORATION_DURATION} seconds"
echo "Hash Chain Validation: $(if [ "${RESTORED_CORRUPTION_COUNT}" = "0" ] && [ "${HASH_CHAIN_BREAKS}" = "0" ] && [ "${HASH_CHAIN_COMPLETE}" = "0" ]; then echo "✅ PASS"; else echo "❌ FAIL"; fi)"
echo "Data Consistency: $(if [ "${ORIGINAL_COUNT}" = "${RESTORED_COUNT}" ]; then echo "✅ PASS"; else echo "❌ FAIL"; fi)"
echo ""
Expected Behavior
Corruption Phase (0-5 minutes):
- Corruption simulation: Data corruption simulated (checksum modification)
- Corruption detection: Integrity verification detects corruption
- Corrupted records: Corrupted record IDs identified
Recovery Phase (5-65 minutes):
- Clean backup identification: Last clean backup identified
- Restoration: Database restored from clean backup
- Completion: Restoration completes and database comes online
Validation Phase (65-90 minutes):
- Hash chain validation: Hash chain integrity validated (no breaks)
- Corruption check: No corruption in restored database
- Data consistency: Record counts validated
- Hash chain completeness: Hash chain completeness validated
Expected Metrics
| Metric | Target | Validation Method |
|---|---|---|
| Corruption Detection | 100% | Integrity verification detects all corrupted records |
| Recovery Time | <60 minutes | Time from restoration start to database online |
| Hash Chain Integrity | 100% | No hash chain breaks detected |
| Corruption Removal | 100% | No corruption in restored database |
| Data Consistency | 100% | Record count matches original |
Validation Criteria
Success Criteria:
- ✅ Corruption detected through integrity verification
- ✅ Clean backup identified and validated
- ✅ Recovery completed within 60 minutes
- ✅ Hash chain integrity validated (no breaks)
- ✅ No corruption in restored database
- ✅ Data consistency validated (record count matches)
Data Recovery Drill Visualization
graph TD
BACKUP[Backup Restoration] --> LIST[List Backups]
LIST --> VALIDATE[Validate Integrity]
VALIDATE --> RESTORE[Restore Database]
RESTORE --> COMPLETE[Validate Completeness]
PITR[Point-in-Time Recovery] --> TIMESTAMP[Validate Timestamp]
TIMESTAMP --> RECOVER[Recover to Timestamp]
RECOVER --> CONSISTENCY[Validate Consistency]
CONSISTENCY --> SELECTIVE[Selective Recovery]
CORRUPTION[Corruption Recovery] --> DETECT[Detect Corruption]
DETECT --> CLEAN[Identify Clean Backup]
CLEAN --> RESTORE2[Restore from Backup]
RESTORE2 --> HASH[Validate Hash Chain]
COMPLETE --> SUCCESS[Recovery Successful]
SELECTIVE --> SUCCESS
HASH --> SUCCESS
style BACKUP fill:#FFE5B4
style PITR fill:#FFE5B4
style CORRUPTION fill:#FFE5B4
style SUCCESS fill:#90EE90
Summary: Data Recovery Drills¶
- Backup Restoration Drill: Validates backup restoration from Azure Backup with integrity validation, restoration time measurement, and data completeness verification; expects backup integrity validated, restoration completed within 60 minutes, data completeness ≥95%, data integrity validated, and all key tables validated
- Point-in-Time Recovery: Validates database restoration to specific timestamp with transaction consistency validation and selective recovery capabilities; expects target timestamp validated, recovery completed within 60 minutes, transaction consistency validated, no transactions after target timestamp, transaction integrity validated, and selective recovery validated
- Corruption Recovery: Validates data corruption detection and recovery from clean backups with integrity verification and hash chain validation; expects corruption detected through integrity verification, clean backup identified, recovery completed within 60 minutes, hash chain integrity validated, no corruption in restored database, and data consistency validated
- Monitoring and Validation: Comprehensive scripts for backup restoration, point-in-time recovery, corruption recovery, integrity verification, hash chain validation, transaction consistency validation, and data completeness validation
Chaos GameDays¶
Purpose: Define comprehensive chaos engineering GameDay procedures for ATP, validating multi-failure scenarios, incident response capabilities, team coordination, and system resilience through quarterly exercises that simulate complex real-world failure scenarios.
What is a GameDay?¶
GameDay exercises are quarterly chaos engineering events that validate ATP's resilience through simulated complex failure scenarios involving multiple simultaneous failures, multi-team participation, and incident response validation.
GameDay Definition
A Chaos GameDay is a structured, time-boxed chaos engineering exercise that:
- Simulates Real-World Failures: Multiple simultaneous failures that could occur in production
- Validates Incident Response: Tests team coordination, communication, and response procedures
- Tests System Resilience: Validates that ATP services handle complex failure scenarios gracefully
- Identifies Gaps: Discovers resilience gaps, process weaknesses, and improvement opportunities
- Improves Team Preparedness: Enhances team skills, runbook effectiveness, and incident response capabilities
GameDay Characteristics
| Characteristic | Description |
|---|---|
| Frequency | Quarterly (4 times per year) |
| Duration | 4 hours (structured time-box) |
| Participation | Multi-team (Platform, SRE, Security, Service teams) |
| Complexity | Multiple simultaneous failures |
| Environment | Production-like (staging) or production (with approval) |
| Focus | Learning, improvement, resilience validation |
GameDay Objectives
- Validate System Resilience: Ensure ATP services handle complex failure scenarios
- Test Incident Response: Validate team coordination and response procedures
- Improve Runbooks: Identify gaps and update runbooks based on learnings
- Enhance Team Skills: Build team confidence and incident response capabilities
- Identify Improvements: Discover resilience gaps and assign improvement actions
GameDay Structure¶
Quarterly ATP Chaos GameDay (4 Hours)
Hour 1: Preparation
Preparation Activities:
#!/bin/bash
# scripts/gameday-preparation.sh
GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"
ENVIRONMENT="${3:-staging}"
echo "🎮 ATP Chaos GameDay Preparation"
echo "GameDay ID: ${GAMEDAY_ID}"
echo "Scenario: ${SCENARIO}"
echo "Environment: ${ENVIRONMENT}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""
# Step 1: Review scenarios
echo "Step 1: Reviewing GameDay scenarios..."
echo "Loading scenario: ${SCENARIO}"
SCENARIO_FILE="scenarios/${SCENARIO}.yaml"
if [ ! -f "${SCENARIO_FILE}" ]; then
echo "❌ Scenario file not found: ${SCENARIO_FILE}"
exit 1
fi
echo "✅ Scenario loaded: ${SCENARIO}"
cat ${SCENARIO_FILE}
echo ""
# Step 2: Brief all teams
echo "Step 2: Briefing all teams..."
echo "Sending GameDay briefing to teams..."
# Teams: Platform, SRE, Security, Service teams
TEAMS=("platform-team" "sre-team" "security-team" "service-teams")
for team in "${TEAMS[@]}"; do
echo " - Briefing ${team}..."
# Send notification
# ./scripts/send-notification.sh "${team}" "GameDay briefing: ${GAMEDAY_ID}"
done
echo "✅ All teams briefed"
echo ""
# Step 3: Validate rollback procedures
echo "Step 3: Validating rollback procedures..."
echo "Checking rollback scripts and procedures..."
ROLLBACK_SCRIPTS=(
"scripts/rollback-all-chaos.sh"
"scripts/rollback-network-chaos.sh"
"scripts/rollback-pod-chaos.sh"
"scripts/rollback-database-chaos.sh"
)
for script in "${ROLLBACK_SCRIPTS[@]}"; do
if [ -f "${script}" ]; then
echo " ✅ ${script}: Available"
else
echo " ⚠️ ${script}: Not found"
fi
done
echo "✅ Rollback procedures validated"
echo ""
# Step 4: Start monitoring
echo "Step 4: Starting monitoring..."
echo "Initializing monitoring dashboards..."
# Create monitoring dashboard snapshot
./scripts/create-monitoring-snapshot.sh \
--gameday-id ${GAMEDAY_ID} \
--environment ${ENVIRONMENT} \
--output "monitoring-snapshots/${GAMEDAY_ID}-baseline.json"
echo "✅ Monitoring started"
echo ""
echo "=========================================="
echo "GameDay Preparation Complete"
echo "=========================================="
echo "GameDay ID: ${GAMEDAY_ID}"
echo "Scenario: ${SCENARIO}"
echo "Environment: ${ENVIRONMENT}"
echo "Status: Ready for chaos injection"
echo ""
Hour 2: Chaos Injection
Chaos Injection Activities:
#!/bin/bash
# scripts/gameday-chaos-injection.sh
GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"
ENVIRONMENT="${3:-staging}"
CHAOS_START=$(date +%s)
echo "🎮 ATP Chaos GameDay - Chaos Injection Phase"
echo "GameDay ID: ${GAMEDAY_ID}"
echo "Scenario: ${SCENARIO}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""
# Load scenario experiments
SCENARIO_FILE="scenarios/${SCENARIO}.yaml"
EXPERIMENTS=$(yq eval '.experiments[]' ${SCENARIO_FILE} 2>/dev/null || echo "")
if [ -z "${EXPERIMENTS}" ]; then
echo "❌ No experiments found in scenario"
exit 1
fi
echo "Executing ${SCENARIO} experiments..."
echo ""
# Execute experiments simultaneously
EXPERIMENT_COUNT=0
EXPERIMENT_IDS=()
while IFS= read -r experiment; do
EXPERIMENT_COUNT=$((EXPERIMENT_COUNT + 1))
EXPERIMENT_NAME=$(echo "${experiment}" | yq eval '.name' -)
EXPERIMENT_FILE=$(echo "${experiment}" | yq eval '.file' -)
echo "Experiment ${EXPERIMENT_COUNT}: ${EXPERIMENT_NAME}"
echo " File: ${EXPERIMENT_FILE}"
# Apply chaos experiment
kubectl apply -f "chaos-experiments/${EXPERIMENT_FILE}" -n chaos-testing
EXPERIMENT_ID="${EXPERIMENT_NAME}-${GAMEDAY_ID}"
EXPERIMENT_IDS+=(${EXPERIMENT_ID})
echo " ✅ Experiment applied: ${EXPERIMENT_ID}"
echo ""
done <<< "${EXPERIMENTS}"
echo "✅ All experiments applied (${EXPERIMENT_COUNT} experiments)"
echo ""
# Monitor system behavior
echo "Monitoring system behavior..."
echo "Tracking metrics for 60 minutes..."
# Continuous monitoring loop
MONITORING_DURATION=3600 # 60 minutes
ELAPSED=0
CHECK_INTERVAL=60 # Check every minute
while [ ${ELAPSED} -lt ${MONITORING_DURATION} ]; do
# Get system metrics
AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=service_availability\{environment=\"${ENVIRONMENT}\"\} | jq -r '.data.result[0].value[1]' || echo "0")
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{environment=\"${ENVIRONMENT}\",status=~\"5..\"\}[1m]\) | jq -r '.data.result[0].value[1]' || echo "0")
LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{environment=\"${ENVIRONMENT}\"\}[1m]\)\) | jq -r '.data.result[0].value[1]' || echo "0")
LATENCY_MS=$(echo "${LATENCY} * 1000" | bc)
echo "[${ELAPSED}s] Availability: ${AVAILABILITY}%, Error Rate: ${ERROR_RATE}/sec, P95 Latency: ${LATENCY_MS}ms"
# Check for auto-abort triggers
if (( $(echo "${AVAILABILITY} < 50" | bc -l) )); then
echo "⚠️ WARNING: Availability below 50% - considering auto-abort"
fi
sleep ${CHECK_INTERVAL}
ELAPSED=$((ELAPSED + CHECK_INTERVAL))
done
CHAOS_END=$(date +%s)
CHAOS_DURATION=$((CHAOS_END - CHAOS_START))
echo ""
echo "✅ Chaos injection phase complete"
echo "Duration: ${CHAOS_DURATION} seconds"
echo "Experiments active: ${EXPERIMENT_COUNT}"
echo ""
Hour 3: Recovery
Recovery Activities:
#!/bin/bash
# scripts/gameday-recovery.sh
GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"
ENVIRONMENT="${3:-staging}"
RECOVERY_START=$(date +%s)
echo "🎮 ATP Chaos GameDay - Recovery Phase"
echo "GameDay ID: ${GAMEDAY_ID}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""
# Step 1: Execute recovery procedures
echo "Step 1: Executing recovery procedures..."
# Get active experiments
ACTIVE_EXPERIMENTS=$(kubectl get chaos -n chaos-testing -o jsonpath='{.items[*].metadata.name}')
for experiment in ${ACTIVE_EXPERIMENTS}; do
echo " - Removing experiment: ${experiment}"
kubectl delete chaos ${experiment} -n chaos-testing
done
echo "✅ Recovery procedures executed"
echo ""
# Step 2: Validate RTO/RPO
echo "Step 2: Validating RTO/RPO..."
./scripts/validate-rto-rpo.sh ${GAMEDAY_ID} ${ENVIRONMENT}
echo ""
# Step 3: Test failover and failback
echo "Step 3: Testing failover and failback..."
# This would involve testing regional failover if applicable
echo "✅ Failover/failback tested"
echo ""
# Step 4: Validate data integrity
echo "Step 4: Validating data integrity..."
./scripts/validate-data-integrity.sh ${ENVIRONMENT}
echo ""
RECOVERY_END=$(date +%s)
RECOVERY_DURATION=$((RECOVERY_END - RECOVERY_START))
echo "✅ Recovery phase complete"
echo "Duration: ${RECOVERY_DURATION} seconds"
echo ""
Hour 4: Retrospective
Retrospective Activities:
#!/bin/bash
# scripts/gameday-retrospective.sh
GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"
echo "🎮 ATP Chaos GameDay - Retrospective"
echo "GameDay ID: ${GAMEDAY_ID}"
echo "Start Time: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
echo ""
# Generate GameDay report
echo "Generating GameDay report..."
./scripts/generate-gameday-report.sh ${GAMEDAY_ID} ${SCENARIO}
echo "✅ GameDay report generated"
echo ""
# Document findings
echo "Documenting findings..."
# Collect findings from observers and teams
FINDINGS_FILE="gameday-reports/${GAMEDAY_ID}-findings.md"
cat > ${FINDINGS_FILE} <<EOF
# GameDay Findings: ${GAMEDAY_ID}
## Scenario: ${SCENARIO}
## Key Findings
- [ ] Finding 1: ...
- [ ] Finding 2: ...
- [ ] Finding 3: ...
## Resilience Gaps Identified
- [ ] Gap 1: ...
- [ ] Gap 2: ...
- [ ] Gap 3: ...
## Improvement Actions
- [ ] Action 1: ... (Owner: ..., Due: ...)
- [ ] Action 2: ... (Owner: ..., Due: ...)
- [ ] Action 3: ... (Owner: ..., Due: ...)
## Runbook Updates Required
- [ ] Runbook 1: ...
- [ ] Runbook 2: ...
- [ ] Runbook 3: ...
EOF
echo "✅ Findings documented: ${FINDINGS_FILE}"
echo ""
echo "✅ Retrospective complete"
echo ""
GameDay Timeline Visualization
gantt
title Quarterly ATP Chaos GameDay (4 Hours)
dateFormat HH:mm
axisFormat %H:%M
section Preparation
Review Scenarios :00:00, 15m
Brief All Teams :15:00, 15m
Validate Rollback :30:00, 15m
Start Monitoring :45:00, 15m
section Chaos Injection
Execute Experiments :60:00, 30m
Monitor System :90:00, 30m
section Recovery
Execute Recovery :120:00, 30m
Validate RTO/RPO :150:00, 30m
section Retrospective
Document Findings :180:00, 30m
Identify Gaps :210:00, 15m
Assign Actions :225:00, 15m
GameDay Scenarios¶
Scenario 1: Regional Outage + Database Failover + Key Vault Unavailable
Scenario Description:
- Regional Outage: Simulate complete East US region unavailability
- Database Failover: Azure SQL failover to secondary region
- Key Vault Unavailable: Azure Key Vault network partition
- Objective: Validate multi-region resilience and secret management
Scenario Configuration:
# scenarios/scenario-1-regional-outage.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: GameDayScenario
metadata:
name: scenario-1-regional-outage
labels:
category: infrastructure
severity: high
complexity: high
spec:
description: |
Regional Outage + Database Failover + Key Vault Unavailable
Simulates complete regional failure with database failover and secret management issues
duration: "4h"
experiments:
- name: regional-outage
file: network-partition-region.yaml
startTime: "00:05" # 5 minutes into GameDay
duration: "1h"
description: Partition network to East US region
- name: database-failover
file: database-failover.yaml
startTime: "00:10" # 10 minutes into GameDay
duration: "1h"
description: Force Azure SQL failover to secondary region
- name: key-vault-unavailable
file: key-vault-unavailable.yaml
startTime: "00:15" # 15 minutes into GameDay
duration: "1h"
description: Partition network to Azure Key Vault
expectedBehavior:
- Services failover to West Europe region
- Database failover completes within 30 minutes
- Services use cached secrets during Key Vault unavailability
- All services remain operational in secondary region
successCriteria:
- RTO achieved: <30 minutes
- RPO verified: <1 hour
- Service availability: >90%
- Secret cache usage: >80%
Scenario Execution Script:
#!/bin/bash
# scripts/execute-scenario-1.sh
GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
ENVIRONMENT="${2:-staging}"
echo "🎮 Executing Scenario 1: Regional Outage + Database Failover + Key Vault Unavailable"
echo "GameDay ID: ${GAMEDAY_ID}"
echo ""
# Execute all experiments simultaneously
./scripts/execute-regional-failover-drill.sh eastus westeurope atp-production true &
REGIONAL_PID=$!
sleep 300 # Wait 5 minutes
./scripts/execute-database-failover-experiment.sh atp-ingestion-api atp-ingest-ns &
DB_FAILOVER_PID=$!
sleep 300 # Wait 5 minutes
./scripts/execute-key-vault-unavailability-experiment.sh atp-ingestion-api atp-ingest-ns &
KV_UNAVAILABLE_PID=$!
# Wait for all experiments
wait ${REGIONAL_PID} ${DB_FAILOVER_PID} ${KV_UNAVAILABLE_PID}
echo "✅ Scenario 1 execution complete"
Scenario 2: Node Cascade Failure + Message Broker Issues + Traffic Surge
Scenario Configuration:
# scenarios/scenario-2-cascade-failure.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: GameDayScenario
metadata:
name: scenario-2-cascade-failure
labels:
category: infrastructure
severity: high
complexity: high
spec:
description: |
Node Cascade Failure + Message Broker Issues + Traffic Surge
Simulates cascading node failures with messaging issues and traffic surge
duration: "4h"
experiments:
- name: node-cascade-failure
file: node-cascade-failure.yaml
startTime: "00:05"
duration: "1h"
description: Simulate cascading node failures (1 node every 5 minutes)
- name: message-broker-issues
file: service-bus-topic-pause.yaml
startTime: "00:10"
duration: "1h"
description: Pause Service Bus topic to simulate broker issues
- name: traffic-surge
file: traffic-surge.yaml
startTime: "00:15"
duration: "1h"
description: Generate 10x normal traffic load
expectedBehavior:
- Pods reschedule to healthy nodes
- Message buffering activates during broker issues
- Autoscaling activates for traffic surge
- Services handle combined load gracefully
successCriteria:
- Pod reschedule time: <5 minutes
- Message buffering: >90% success
- Autoscaling activation: <5 minutes
- Service availability: >90%
Scenario 3: Security Incident + DDoS Attack + Data Corruption
Scenario Configuration:
# scenarios/scenario-3-security-incident.yaml
apiVersion: chaos.atp.connectsoft.io/v1alpha1
kind: GameDayScenario
metadata:
name: scenario-3-security-incident
labels:
category: security
severity: critical
complexity: high
spec:
description: |
Security Incident + DDoS Attack + Data Corruption
Simulates security incident with DDoS attack and data corruption
duration: "4h"
experiments:
- name: azure-ad-unavailable
file: azure-ad-unavailability.yaml
startTime: "00:05"
duration: "1h"
description: Azure AD authentication unavailable
- name: ddos-attack
file: traffic-surge-ddos.yaml
startTime: "00:10"
duration: "1h"
description: Simulate DDoS attack (100x normal traffic)
- name: data-corruption
file: event-store-corruption.yaml
startTime: "00:15"
duration: "1h"
description: Simulate event store corruption
expectedBehavior:
- Token cache maintains authentication
- Rate limiting protects against DDoS
- Integrity verification detects corruption
- Services continue operating with degraded functionality
successCriteria:
- Token cache hit rate: >80%
- Rate limiting effectiveness: >95%
- Corruption detection time: <5 minutes
- Service availability: >85%
GameDay Scenario Selection
#!/bin/bash
# scripts/select-gameday-scenario.sh
QUARTER="${1:-Q1}" # Q1, Q2, Q3, Q4
# Rotate scenarios quarterly
case ${QUARTER} in
Q1)
SCENARIO="scenario-1-regional-outage"
;;
Q2)
SCENARIO="scenario-2-cascade-failure"
;;
Q3)
SCENARIO="scenario-3-security-incident"
;;
Q4)
# Q4: Random selection or user choice
SCENARIO="scenario-1-regional-outage"
;;
esac
echo "Selected scenario for ${QUARTER}: ${SCENARIO}"
echo "${SCENARIO}"
GameDay Roles¶
GameDay Role Definitions
| Role | Responsibilities | Key Activities |
|---|---|---|
| GameDay Commander | Overall coordination, decision-making | Coordinates exercise, makes go/no-go decisions, manages timeline |
| Chaos Injector | Executes chaos experiments | Applies chaos experiments, monitors experiment status |
| Incident Commander | Leads incident response | Coordinates response, makes recovery decisions |
| Observer | Documents findings | Observes system behavior, documents issues, captures metrics |
| Service Teams | Respond to failures | Monitor services, execute runbooks, report status |
| SRE | Monitor and support | Monitor system health, provide support, validate recovery |
Role Assignment Script:
#!/bin/bash
# scripts/assign-gameday-roles.sh
GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
echo "🎮 Assigning GameDay Roles"
echo "GameDay ID: ${GAMEDAY_ID}"
echo ""
# Role assignments (from team roster)
ROLES=(
"gameday-commander:platform-team-lead"
"chaos-injector:sre-team-member-1"
"incident-commander:platform-team-lead"
"observer:sre-team-member-2,security-team-member-1"
"service-teams:ingestion-team-lead,query-team-lead"
"sre:sre-team-all"
)
for role_assignment in "${ROLES[@]}"; do
ROLE=$(echo "${role_assignment}" | cut -d: -f1)
ASSIGNEES=$(echo "${role_assignment}" | cut -d: -f2)
echo "Role: ${ROLE}"
echo " Assignees: ${ASSIGNEES}"
# Send role assignment notifications
IFS=',' read -ra ASSIGNEE_ARRAY <<< "${ASSIGNEES}"
for assignee in "${ASSIGNEE_ARRAY[@]}"; do
echo " - Notifying ${assignee}..."
# ./scripts/send-notification.sh "${assignee}" "You have been assigned role: ${ROLE} for GameDay ${GAMEDAY_ID}"
done
echo ""
done
echo "✅ Role assignments complete"
echo ""
GameDay Role Responsibilities
GameDay Commander Checklist:
# GameDay Commander Checklist
## Pre-GameDay
- [ ] Confirm GameDay schedule with all teams
- [ ] Review and approve scenario
- [ ] Validate environment readiness
- [ ] Confirm rollback procedures
- [ ] Brief stakeholders
## During GameDay
- [ ] Manage GameDay timeline
- [ ] Make go/no-go decisions
- [ ] Coordinate between teams
- [ ] Escalate critical issues
- [ ] Monitor overall progress
## Post-GameDay
- [ ] Lead retrospective
- [ ] Review findings
- [ ] Assign improvement actions
- [ ] Approve final report
Chaos Injector Checklist:
# Chaos Injector Checklist
## Pre-GameDay
- [ ] Review scenario experiments
- [ ] Validate experiment files
- [ ] Test experiment execution
- [ ] Prepare rollback commands
## During GameDay
- [ ] Execute experiments per schedule
- [ ] Monitor experiment status
- [ ] Document experiment execution
- [ ] Execute rollback when needed
## Post-GameDay
- [ ] Document experiment results
- [ ] Review experiment effectiveness
- [ ] Update experiment configurations
Observer Checklist:
# Observer Checklist
## Pre-GameDay
- [ ] Review scenario and expected behavior
- [ ] Prepare observation templates
- [ ] Set up monitoring dashboards
- [ ] Coordinate with other observers
## During GameDay
- [ ] Observe system behavior
- [ ] Document findings in real-time
- [ ] Capture metrics and screenshots
- [ ] Note team coordination issues
- [ ] Identify runbook gaps
## Post-GameDay
- [ ] Compile observation notes
- [ ] Create observation report
- [ ] Present findings in retrospective
GameDay Metrics¶
GameDay Metrics Tracking
Metrics Collection Script:
#!/bin/bash
# scripts/collect-gameday-metrics.sh
GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"
ENVIRONMENT="${3:-staging}"
METRICS_FILE="gameday-reports/${GAMEDAY_ID}-metrics.json"
echo "📊 Collecting GameDay Metrics"
echo "GameDay ID: ${GAMEDAY_ID}"
echo ""
# Collect metrics
cat > ${METRICS_FILE} <<EOF
{
"gamedayId": "${GAMEDAY_ID}",
"scenario": "${SCENARIO}",
"environment": "${ENVIRONMENT}",
"timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
"metrics": {
"timeToDetection": {
"target": 300,
"achieved": $(./scripts/calculate-mttd.sh ${GAMEDAY_ID}),
"unit": "seconds"
},
"timeToResponse": {
"target": 600,
"achieved": $(./scripts/calculate-mttr.sh ${GAMEDAY_ID}),
"unit": "seconds"
},
"rto": {
"target": 1800,
"achieved": $(./scripts/calculate-rto.sh ${GAMEDAY_ID}),
"unit": "seconds"
},
"rpo": {
"target": 3600,
"achieved": $(./scripts/calculate-rpo.sh ${GAMEDAY_ID}),
"unit": "seconds"
},
"runbookEffectiveness": {
"target": 80,
"achieved": $(./scripts/calculate-runbook-effectiveness.sh ${GAMEDAY_ID}),
"unit": "percent"
},
"teamCoordination": {
"target": 80,
"achieved": $(./scripts/calculate-team-coordination.sh ${GAMEDAY_ID}),
"unit": "percent"
}
}
}
EOF
echo "✅ Metrics collected: ${METRICS_FILE}"
cat ${METRICS_FILE} | jq '.'
echo ""
GameDay Metrics Targets
| Metric | Target | Description |
|---|---|---|
| Time to Detection (MTTD) | <5 minutes | Time from failure injection to detection |
| Time to Response (MTTR) | <10 minutes | Time from detection to response initiation |
| RTO Achieved | <30 minutes | Recovery time objective achievement |
| RPO Verified | <1 hour | Recovery point objective verification |
| Runbook Effectiveness | >80% | Percentage of runbook steps executed successfully |
| Team Coordination | >80% | Quality of team coordination (subjective rating) |
GameDay Metrics Calculation Scripts
Calculate MTTD:
#!/bin/bash
# scripts/calculate-mttd.sh
GAMEDAY_ID="${1}"
# Get first alert timestamp
FIRST_ALERT=$(cat "gameday-reports/${GAMEDAY_ID}-alerts.json" | jq -r '.alerts[0].timestamp' 2>/dev/null || echo "")
# Get chaos injection timestamp
CHAOS_INJECTION=$(cat "gameday-reports/${GAMEDAY_ID}-timeline.json" | jq -r '.chaosInjection' 2>/dev/null || echo "")
if [ -n "${FIRST_ALERT}" ] && [ -n "${CHAOS_INJECTION}" ]; then
FIRST_ALERT_EPOCH=$(date -u -d "${FIRST_ALERT}" +%s)
CHAOS_INJECTION_EPOCH=$(date -u -d "${CHAOS_INJECTION}" +%s)
MTTD=$((FIRST_ALERT_EPOCH - CHAOS_INJECTION_EPOCH))
echo "${MTTD}"
else
echo "0"
fi
Calculate MTTR:
#!/bin/bash
# scripts/calculate-mttr.sh
GAMEDAY_ID="${1}"
# Get first response timestamp
FIRST_RESPONSE=$(cat "gameday-reports/${GAMEDAY_ID}-timeline.json" | jq -r '.firstResponse' 2>/dev/null || echo "")
# Get first alert timestamp
FIRST_ALERT=$(cat "gameday-reports/${GAMEDAY_ID}-alerts.json" | jq -r '.alerts[0].timestamp' 2>/dev/null || echo "")
if [ -n "${FIRST_RESPONSE}" ] && [ -n "${FIRST_ALERT}" ]; then
FIRST_RESPONSE_EPOCH=$(date -u -d "${FIRST_RESPONSE}" +%s)
FIRST_ALERT_EPOCH=$(date -u -d "${FIRST_ALERT}" +%s)
MTTR=$((FIRST_RESPONSE_EPOCH - FIRST_ALERT_EPOCH))
echo "${MTTR}"
else
echo "0"
fi
GameDay Metrics Dashboard
GameDay Metrics Visualization:
graph LR
GAMEDAY[GameDay Metrics] --> DETECTION[MTTD]
GAMEDAY --> RESPONSE[MTTR]
GAMEDAY --> RTO[RTO]
GAMEDAY --> RPO[RPO]
GAMEDAY --> RUNBOOK[Runbook Effectiveness]
GAMEDAY --> COORDINATION[Team Coordination]
DETECTION --> ALERT[Alert Time]
RESPONSE --> ACTION[Action Time]
RTO --> RECOVERY[Recovery Time]
RPO --> DATA[Data Loss]
RUNBOOK --> STEPS[Runbook Steps]
COORDINATION --> TEAM[Team Performance]
style GAMEDAY fill:#FFE5B4
style DETECTION fill:#90EE90
style RESPONSE fill:#90EE90
style RTO fill:#90EE90
style RPO fill:#90EE90
style RUNBOOK fill:#90EE90
style COORDINATION fill:#90EE90
GameDay Report Template
GameDay Report Generation Script:
#!/bin/bash
# scripts/generate-gameday-report.sh
GAMEDAY_ID="${1:-gameday-$(date +%Y%m%d)}"
SCENARIO="${2:-scenario-1}"
REPORT_FILE="gameday-reports/${GAMEDAY_ID}-report.md"
cat > ${REPORT_FILE} <<EOF
# ATP Chaos GameDay Report
**GameDay ID**: ${GAMEDAY_ID}
**Date**: $(date -u +"%Y-%m-%d")
**Scenario**: ${SCENARIO}
**Duration**: 4 hours
**Environment**: Staging
## Executive Summary
[Summary of GameDay execution and key outcomes]
## Metrics Achieved
| Metric | Target | Achieved | Status |
|--------|--------|----------|--------|
| **MTTD** | <5 min | [VALUE] | [PASS/FAIL] |
| **MTTR** | <10 min | [VALUE] | [PASS/FAIL] |
| **RTO** | <30 min | [VALUE] | [PASS/FAIL] |
| **RPO** | <1 hour | [VALUE] | [PASS/FAIL] |
| **Runbook Effectiveness** | >80% | [VALUE] | [PASS/FAIL] |
| **Team Coordination** | >80% | [VALUE] | [PASS/FAIL] |
## Key Findings
### Strengths
- [Finding 1]
- [Finding 2]
- [Finding 3]
### Gaps Identified
- [Gap 1]
- [Gap 2]
- [Gap 3]
## Improvement Actions
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| [Action 1] | [Owner] | [Date] | [High/Medium/Low] |
| [Action 2] | [Owner] | [Date] | [High/Medium/Low] |
| [Action 3] | [Owner] | [Date] | [High/Medium/Low] |
## Runbook Updates
- [Runbook 1]: [Update description]
- [Runbook 2]: [Update description]
- [Runbook 3]: [Update description]
## Lessons Learned
[Key lessons learned from GameDay]
## Next Steps
- [Next step 1]
- [Next step 2]
- [Next step 3]
EOF
echo "✅ GameDay report generated: ${REPORT_FILE}"
echo ""
Summary: Chaos GameDays¶
- What is a GameDay?: Quarterly chaos engineering exercises that validate system resilience through complex multi-failure scenarios, multi-team participation, incident response validation, and learning-focused improvement; conducted quarterly (4 times per year) with 4-hour duration in production-like environments
- GameDay Structure: Comprehensive 4-hour structure divided into Preparation (review scenarios, brief teams, validate rollback, start monitoring), Chaos Injection (execute 3-5 experiments simultaneously, monitor behavior, test incident response, validate runbooks), Recovery (execute recovery procedures, validate RTO/RPO, test failover/failback, validate data integrity), and Retrospective (document findings, identify resilience gaps, assign improvement actions, update runbooks)
- GameDay Scenarios: Three complex scenarios including Regional Outage + Database Failover + Key Vault Unavailable (multi-region resilience), Node Cascade Failure + Message Broker Issues + Traffic Surge (cascading failures), and Security Incident + DDoS Attack + Data Corruption (security and data integrity); scenarios rotated quarterly with comprehensive configuration and execution scripts
- GameDay Roles: Structured role assignments including GameDay Commander (coordination, decision-making), Chaos Injector (experiment execution), Incident Commander (incident response), Observer (documentation), Service Teams (service response), and SRE (monitoring and support); each role with specific responsibilities and checklists
- GameDay Metrics: Comprehensive metrics tracking including Time to Detection (MTTD <5 minutes), Time to Response (MTTR <10 minutes), RTO/RPO achievement, Runbook Effectiveness (>80%), and Team Coordination (>80%); with automated collection scripts and visualization dashboards
Chaos Automation and Reporting¶
Purpose: Define comprehensive chaos automation and reporting capabilities for ATP, validating automation tools, CI/CD integration, continuous chaos execution, reporting mechanisms, and improvement tracking to ensure chaos engineering is continuously automated, monitored, and improved.
Chaos Automation Tools¶
Chaos automation tools provide platform-native and cloud-integrated chaos engineering capabilities for ATP, enabling automated chaos experiment execution, monitoring, and management.
Chaos Mesh: Kubernetes-Native Chaos
Chaos Mesh Overview:
Chaos Mesh is a cloud-native chaos engineering platform that orchestrates chaos experiments on Kubernetes. It provides comprehensive fault injection capabilities for pods, networks, I/O, and time.
Chaos Mesh Installation:
#!/bin/bash
# scripts/install-chaos-mesh.sh
NAMESPACE="${1:-chaos-testing}"
echo "🔧 Installing Chaos Mesh"
echo "Namespace: ${NAMESPACE}"
echo ""
# Create namespace
kubectl create namespace ${NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
# Install Chaos Mesh using Helm
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace=${NAMESPACE} \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
--set dashboard.create=true \
--set dashboard.securityMode=false
echo "✅ Chaos Mesh installed"
echo ""
# Wait for Chaos Mesh to be ready
echo "Waiting for Chaos Mesh to be ready..."
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=chaos-mesh \
-n ${NAMESPACE} \
--timeout=300s
echo "✅ Chaos Mesh is ready"
echo ""
Chaos Mesh Configuration:
# kubernetes/chaos-mesh/chaos-mesh-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: chaos-mesh-config
namespace: chaos-testing
data:
config.yaml: |
# Chaos Mesh Configuration
chaos:
# Enable chaos experiments
podChaos: true
networkChaos: true
ioChaos: true
stressChaos: true
timeChaos: true
httpChaos: true
kernelChaos: true
# Monitoring integration
metrics:
prometheus:
enabled: true
endpoint: "http://prometheus:9090"
# Dashboard configuration
dashboard:
enabled: true
port: 2333
# Security settings
security:
mode: "host"
allowHostNetwork: true
Litmus Chaos: Chaos Workflows
Litmus Chaos Overview:
Litmus Chaos provides chaos workflows and experiments for Kubernetes environments, with a focus on workflow-based chaos engineering.
Litmus Chaos Installation:
#!/bin/bash
# scripts/install-litmus-chaos.sh
NAMESPACE="${1:-litmus}"
echo "🔧 Installing Litmus Chaos"
echo "Namespace: ${NAMESPACE}"
echo ""
# Create namespace
kubectl create namespace ${NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
# Install Litmus Chaos Operator
kubectl apply -f https://litmuschaos.github.io/litmus/2.13.0/litmus-2.13.0.yaml
echo "✅ Litmus Chaos installed"
echo ""
# Wait for Litmus to be ready
echo "Waiting for Litmus to be ready..."
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=litmus \
-n ${NAMESPACE} \
--timeout=300s
echo "✅ Litmus Chaos is ready"
echo ""
Azure Chaos Studio: Azure-Integrated Chaos
Azure Chaos Studio Overview:
Azure Chaos Studio provides Azure-native chaos engineering with integration into Azure services, enabling chaos experiments directly on Azure resources.
Azure Chaos Studio Setup:
#!/bin/bash
# scripts/setup-azure-chaos-studio.sh
RESOURCE_GROUP="${1:-atp-production}"
LOCATION="${2:-eastus}"
echo "🔧 Setting up Azure Chaos Studio"
echo "Resource Group: ${RESOURCE_GROUP}"
echo "Location: ${LOCATION}"
echo ""
# Register Chaos Studio resource provider
az provider register --namespace Microsoft.Chaos
# Create Chaos Studio target
az chaos target create \
--target-name atp-aks-target \
--target-type Microsoft-AzureKubernetesServiceChaosMesh \
--resource-group ${RESOURCE_GROUP} \
--parent-provider-namespace Microsoft.ContainerService \
--parent-resource-type managedClusters \
--parent-resource-name atp-aks-${LOCATION} \
--location ${LOCATION}
echo "✅ Azure Chaos Studio target created"
echo ""
# Create Chaos Studio experiment
az chaos experiment create \
--experiment-name atp-chaos-experiment \
--resource-group ${RESOURCE_GROUP} \
--location ${LOCATION} \
--experiment-file chaos-experiments/azure-chaos-studio-experiment.json
echo "✅ Azure Chaos Studio experiment created"
echo ""
Custom Scripts: ATP-Specific Chaos
Custom Chaos Script Framework:
#!/bin/bash
# scripts/chaos-framework.sh
# ATP-specific chaos engineering framework
# Chaos Framework Configuration
CHAOS_NAMESPACE="${CHAOS_NAMESPACE:-chaos-testing}"
ENVIRONMENT="${ENVIRONMENT:-staging}"
LOG_LEVEL="${LOG_LEVEL:-INFO}"
# Chaos Framework Functions
chaos_inject() {
local experiment_type=$1
local experiment_config=$2
local duration=${3:-"10m"}
echo "[$(date +"%Y-%m-%d %H:%M:%S")] Injecting chaos: ${experiment_type}"
case ${experiment_type} in
pod-kill)
./scripts/execute-pod-failure-experiment.sh ${experiment_config} ${duration}
;;
network-partition)
./scripts/execute-network-partition-experiment.sh ${experiment_config} ${duration}
;;
database-failover)
./scripts/execute-database-failover-experiment.sh ${experiment_config} ${duration}
;;
*)
echo "Unknown experiment type: ${experiment_type}"
return 1
;;
esac
}
chaos_monitor() {
local experiment_id=$1
local duration=${2:-"10m"}
echo "[$(date +"%Y-%m-%d %H:%M:%S")] Monitoring chaos: ${experiment_id}"
./scripts/monitor-chaos-experiment.sh ${experiment_id} ${duration}
}
chaos_rollback() {
local experiment_id=$1
echo "[$(date +"%Y-%m-%d %H:%M:%S")] Rolling back chaos: ${experiment_id}"
./scripts/rollback-chaos-experiment.sh ${experiment_id}
}
chaos_validate() {
local experiment_id=$1
echo "[$(date +"%Y-%m-%d %H:%M:%S")] Validating chaos: ${experiment_id}"
./scripts/validate-chaos-experiment.sh ${experiment_id}
}
# Export functions
export -f chaos_inject chaos_monitor chaos_rollback chaos_validate
Chaos Tool Comparison
| Tool | Type | Integration | Best For |
|---|---|---|---|
| Chaos Mesh | Kubernetes-native | Kubernetes, Prometheus | Pod, network, I/O chaos |
| Litmus Chaos | Kubernetes-native | Kubernetes, Argo | Workflow-based chaos |
| Azure Chaos Studio | Azure-native | Azure services | Azure resource chaos |
| Custom Scripts | ATP-specific | ATP services | ATP-specific scenarios |
CI/CD Integration¶
CI/CD integration enables automated chaos testing in staging pipelines with resilience validation and deployment blocking on chaos failures.
Chaos Tests in Staging Pipeline
Azure Pipelines Configuration:
# azure-pipelines/chaos-tests-stage.yaml
trigger:
branches:
include:
- main
- develop
stages:
- stage: ChaosTests
displayName: 'Chaos Engineering Tests'
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- job: ChaosResilienceTests
displayName: 'Resilience Validation'
pool:
vmImage: 'ubuntu-latest'
steps:
- task: AzureCLI@2
displayName: 'Setup Azure CLI'
inputs:
azureSubscription: 'atp-connection'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
az account set --subscription $(AZURE_SUBSCRIPTION_ID)
az aks get-credentials --resource-group $(RESOURCE_GROUP) --name $(AKS_CLUSTER_NAME)
- task: Bash@3
displayName: 'Run Chaos Tests'
inputs:
targetType: 'inline'
script: |
# Install chaos tools
./scripts/install-chaos-mesh.sh chaos-testing
# Run chaos test suite
./scripts/run-chaos-test-suite.sh staging
- task: Bash@3
displayName: 'Validate Resilience'
inputs:
targetType: 'inline'
script: |
# Validate resilience metrics
./scripts/validate-resilience-metrics.sh staging
# Check if all tests passed
if [ $? -ne 0 ]; then
echo "##vso[task.logissue type=error]Resilience validation failed"
exit 1
fi
Chaos Test Suite Script:
#!/bin/bash
# scripts/run-chaos-test-suite.sh
ENVIRONMENT="${1:-staging}"
TEST_SUITE="${2:-basic}"
echo "🧪 Running Chaos Test Suite"
echo "Environment: ${ENVIRONMENT}"
echo "Test Suite: ${TEST_SUITE}"
echo ""
FAILED_TESTS=0
PASSED_TESTS=0
# Define test suite
case ${TEST_SUITE} in
basic)
TESTS=(
"pod-failure:atp-ingestion-api"
"network-latency:atp-ingestion-api:500ms"
"database-failover:atp-sql-eastus"
)
;;
advanced)
TESTS=(
"pod-failure:atp-ingestion-api"
"network-partition:atp-ingest-ns:atp-query-ns"
"database-failover:atp-sql-eastus"
"key-vault-unavailable:atp-kv-eastus"
"traffic-surge:atp-ingestion-api:10x"
)
;;
*)
echo "Unknown test suite: ${TEST_SUITE}"
exit 1
;;
esac
# Run each test
for test in "${TESTS[@]}"; do
TEST_TYPE=$(echo "${test}" | cut -d: -f1)
TEST_PARAMS=$(echo "${test}" | cut -d: -f2-)
echo "Running test: ${TEST_TYPE} (${TEST_PARAMS})"
# Execute test
./scripts/execute-chaos-test.sh ${TEST_TYPE} ${TEST_PARAMS} ${ENVIRONMENT}
TEST_RESULT=$?
if [ ${TEST_RESULT} -eq 0 ]; then
echo "✅ Test passed: ${TEST_TYPE}"
PASSED_TESTS=$((PASSED_TESTS + 1))
else
echo "❌ Test failed: ${TEST_TYPE}"
FAILED_TESTS=$((FAILED_TESTS + 1))
fi
echo ""
done
# Summary
echo "=========================================="
echo "Chaos Test Suite Summary"
echo "=========================================="
echo "Passed: ${PASSED_TESTS}"
echo "Failed: ${FAILED_TESTS}"
echo "Total: $((PASSED_TESTS + FAILED_TESTS))"
echo ""
if [ ${FAILED_TESTS} -gt 0 ]; then
echo "❌ Test suite failed: ${FAILED_TESTS} test(s) failed"
exit 1
else
echo "✅ All tests passed"
exit 0
fi
Automated Resilience Validation
Resilience Validation Script:
#!/bin/bash
# scripts/validate-resilience-metrics.sh
ENVIRONMENT="${1:-staging}"
echo "📊 Validating Resilience Metrics"
echo "Environment: ${ENVIRONMENT}"
echo ""
VALIDATION_PASSED=true
# Validate service availability
AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=service_availability\{environment=\"${ENVIRONMENT}\"\} | jq -r '.data.result[0].value[1]' || echo "0")
AVAILABILITY_PERCENT=$(echo "${AVAILABILITY} * 100" | bc)
if (( $(echo "${AVAILABILITY_PERCENT} >= 99.9" | bc -l) )); then
echo "✅ Service availability: ${AVAILABILITY_PERCENT}% (target: ≥99.9%)"
else
echo "❌ Service availability: ${AVAILABILITY_PERCENT}% (target: ≥99.9%)"
VALIDATION_PASSED=false
fi
# Validate error rate
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{environment=\"${ENVIRONMENT}\",status=~\"5..\"\}[5m]\) | jq -r '.data.result[0].value[1]' || echo "0")
if (( $(echo "${ERROR_RATE} < 0.001" | bc -l) )); then
echo "✅ Error rate: ${ERROR_RATE}/sec (target: <0.001/sec)"
else
echo "❌ Error rate: ${ERROR_RATE}/sec (target: <0.001/sec)"
VALIDATION_PASSED=false
fi
# Validate latency
P95_LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=histogram_quantile\(0.95,rate\(http_request_duration_seconds_bucket\{environment=\"${ENVIRONMENT}\"\}[5m]\)\) | jq -r '.data.result[0].value[1]' || echo "0")
P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)
if (( $(echo "${P95_LATENCY_MS} < 500" | bc -l) )); then
echo "✅ P95 latency: ${P95_LATENCY_MS}ms (target: <500ms)"
else
echo "❌ P95 latency: ${P95_LATENCY_MS}ms (target: <500ms)"
VALIDATION_PASSED=false
fi
# Validate circuit breaker state
CIRCUIT_BREAKER_OPEN=$(curl -s http://prometheus:9090/api/v1/query?query=circuit_breaker_state\{environment=\"${ENVIRONMENT}\"\} | jq -r '.data.result[] | select(.value[1] == "Open") | .value[1]' | wc -l)
if [ "${CIRCUIT_BREAKER_OPEN}" = "0" ]; then
echo "✅ Circuit breakers: All closed (target: 0 open)"
else
echo "❌ Circuit breakers: ${CIRCUIT_BREAKER_OPEN} open (target: 0 open)"
VALIDATION_PASSED=false
fi
echo ""
if [ "${VALIDATION_PASSED}" = true ]; then
echo "✅ Resilience validation passed"
exit 0
else
echo "❌ Resilience validation failed"
exit 1
fi
Block Deployment on Chaos Failures
Pipeline Gate Configuration:
# azure-pipelines/deployment-gate.yaml
trigger: none
stages:
- stage: ChaosValidationGate
displayName: 'Chaos Validation Gate'
jobs:
- job: ChaosGate
displayName: 'Chaos Validation Gate'
pool:
vmImage: 'ubuntu-latest'
steps:
- task: Bash@3
displayName: 'Check Chaos Test Results'
inputs:
targetType: 'inline'
script: |
# Get latest chaos test results
LATEST_RESULTS=$(az pipelines runs list \
--pipeline-name "ChaosTests" \
--top 1 \
--query "[0].result" \
--output tsv)
if [ "${LATEST_RESULTS}" != "succeeded" ]; then
echo "##vso[task.logissue type=error]Chaos tests failed. Deployment blocked."
exit 1
fi
echo "✅ Chaos tests passed. Deployment allowed."
Continuous Chaos¶
Continuous chaos provides low-level chaos running continuously with minimal blast radius (1% traffic affected) to detect resilience regressions proactively.
Continuous Chaos Configuration
Continuous Chaos Deployment:
# kubernetes/continuous-chaos/continuous-chaos-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: continuous-chaos-config
namespace: chaos-testing
data:
config.yaml: |
continuousChaos:
enabled: true
blastRadius: 0.01 # 1% traffic affected
experiments:
- type: pod-kill
frequency: "1h"
target: "atp-ingestion-api"
maxPodsAffected: 1
- type: network-latency
frequency: "2h"
target: "atp-ingestion-api"
latency: "100ms"
duration: "30m"
- type: error-injection
frequency: "4h"
target: "atp-ingestion-api"
errorRate: 0.01 # 1% error rate
duration: "15m"
monitoring:
enabled: true
prometheusEndpoint: "http://prometheus:9090"
alertThresholds:
availability: 0.995 # 99.5%
errorRate: 0.005 # 0.5%
latency: 1000 # 1 second
Continuous Chaos Controller:
#!/bin/bash
# scripts/continuous-chaos-controller.sh
NAMESPACE="${1:-chaos-testing}"
CONFIG_FILE="${2:-continuous-chaos-config.yaml}"
echo "🔄 Starting Continuous Chaos Controller"
echo "Namespace: ${NAMESPACE}"
echo ""
# Load configuration
CONFIG=$(kubectl get configmap continuous-chaos-config -n ${NAMESPACE} -o jsonpath='{.data.config\.yaml}')
# Parse experiments
EXPERIMENTS=$(echo "${CONFIG}" | yq eval '.continuousChaos.experiments[]' -)
# Continuous chaos loop
while true; do
echo "[$(date +"%Y-%m-%d %H:%M:%S")] Continuous chaos cycle"
while IFS= read -r experiment; do
EXPERIMENT_TYPE=$(echo "${experiment}" | yq eval '.type' -)
FREQUENCY=$(echo "${experiment}" | yq eval '.frequency' -)
TARGET=$(echo "${experiment}" | yq eval '.target' -)
# Check if experiment should run
LAST_RUN=$(kubectl get configmap continuous-chaos-state -n ${NAMESPACE} -o jsonpath="{.data.${EXPERIMENT_TYPE}-${TARGET}}" 2>/dev/null || echo "")
if [ -z "${LAST_RUN}" ]; then
# First run
RUN_EXPERIMENT=true
else
# Check frequency
LAST_RUN_EPOCH=$(date -u -d "${LAST_RUN}" +%s)
CURRENT_EPOCH=$(date +%s)
FREQUENCY_SECONDS=$(echo "${FREQUENCY}" | sed 's/h/*3600/g' | sed 's/m/*60/g' | bc)
ELAPSED=$((CURRENT_EPOCH - LAST_RUN_EPOCH))
if (( ELAPSED >= FREQUENCY_SECONDS )); then
RUN_EXPERIMENT=true
else
RUN_EXPERIMENT=false
fi
fi
if [ "${RUN_EXPERIMENT}" = true ]; then
echo " Running experiment: ${EXPERIMENT_TYPE} on ${TARGET}"
# Execute experiment
./scripts/execute-continuous-chaos-experiment.sh ${EXPERIMENT_TYPE} ${TARGET} ${experiment}
# Update last run time
kubectl patch configmap continuous-chaos-state -n ${NAMESPACE} \
--type merge \
-p "{\"data\":{\"${EXPERIMENT_TYPE}-${TARGET}\":\"$(date -u +"%Y-%m-%dT%H:%M:%SZ")\"}}"
fi
done <<< "${EXPERIMENTS}"
# Sleep for 1 minute before next cycle
sleep 60
done
Continuous Chaos Experiment Script:
#!/bin/bash
# scripts/execute-continuous-chaos-experiment.sh
EXPERIMENT_TYPE="${1}"
TARGET="${2}"
EXPERIMENT_CONFIG="${3}"
echo " Executing continuous chaos: ${EXPERIMENT_TYPE} on ${TARGET}"
case ${EXPERIMENT_TYPE} in
pod-kill)
# Kill 1 pod with minimal impact
MAX_PODS=$(echo "${EXPERIMENT_CONFIG}" | yq eval '.maxPodsAffected' -)
./scripts/execute-pod-failure-experiment.sh ${TARGET} ${MAX_PODS} "5m"
;;
network-latency)
# Inject minimal latency
LATENCY=$(echo "${EXPERIMENT_CONFIG}" | yq eval '.latency' -)
DURATION=$(echo "${EXPERIMENT_CONFIG}" | yq eval '.duration' -)
./scripts/execute-latency-injection-experiment.sh ${TARGET} ${LATENCY} ${DURATION}
;;
error-injection)
# Inject minimal error rate
ERROR_RATE=$(echo "${EXPERIMENT_CONFIG}" | yq eval '.errorRate' -)
DURATION=$(echo "${EXPERIMENT_CONFIG}" | yq eval '.duration' -)
./scripts/execute-error-injection-experiment.sh ${TARGET} ${ERROR_RATE} ${DURATION}
;;
*)
echo "Unknown experiment type: ${EXPERIMENT_TYPE}"
;;
esac
Regression Detection
Regression Detection Script:
#!/bin/bash
# scripts/detect-resilience-regression.sh
ENVIRONMENT="${1:-staging}"
BASELINE_METRICS="${2:-baseline-metrics.json}"
echo "🔍 Detecting Resilience Regressions"
echo "Environment: ${ENVIRONMENT}"
echo ""
REGRESSION_DETECTED=false
# Get current metrics
CURRENT_AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=service_availability\{environment=\"${ENVIRONMENT}\"\} | jq -r '.data.result[0].value[1]' || echo "0")
CURRENT_AVAILABILITY_PERCENT=$(echo "${CURRENT_AVAILABILITY} * 100" | bc)
BASELINE_AVAILABILITY=$(jq -r '.metrics.availability' ${BASELINE_METRICS} 2>/dev/null || echo "99.95")
# Compare with baseline
AVAILABILITY_DROP=$(echo "${BASELINE_AVAILABILITY} - ${CURRENT_AVAILABILITY_PERCENT}" | bc)
if (( $(echo "${AVAILABILITY_DROP} > 0.1" | bc -l) )); then
echo "⚠️ Regression detected: Availability dropped by ${AVAILABILITY_DROP}%"
echo " Baseline: ${BASELINE_AVAILABILITY}%"
echo " Current: ${CURRENT_AVAILABILITY_PERCENT}%"
REGRESSION_DETECTED=true
fi
# Get current error rate
CURRENT_ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{environment=\"${ENVIRONMENT}\",status=~\"5..\"\}[5m]\) | jq -r '.data.result[0].value[1]' || echo "0")
BASELINE_ERROR_RATE=$(jq -r '.metrics.errorRate' ${BASELINE_METRICS} 2>/dev/null || echo "0.0001")
ERROR_RATE_INCREASE=$(echo "${CURRENT_ERROR_RATE} - ${BASELINE_ERROR_RATE}" | bc)
if (( $(echo "${ERROR_RATE_INCREASE} > 0.0005" | bc -l) )); then
echo "⚠️ Regression detected: Error rate increased by ${ERROR_RATE_INCREASE}/sec"
echo " Baseline: ${BASELINE_ERROR_RATE}/sec"
echo " Current: ${CURRENT_ERROR_RATE}/sec"
REGRESSION_DETECTED=true
fi
if [ "${REGRESSION_DETECTED}" = true ]; then
echo ""
echo "❌ Resilience regression detected"
# Send alert
# ./scripts/send-alert.sh "Resilience regression detected in ${ENVIRONMENT}"
exit 1
else
echo "✅ No regression detected"
exit 0
fi
Chaos Reporting¶
Chaos reporting provides comprehensive tracking and reporting of chaos experiments, resilience scores, gap tracking, and GameDay reports.
Experiment Success/Failure Tracking
Experiment Tracking Database Schema:
-- Database schema for chaos experiment tracking
CREATE TABLE ChaosExperiments (
Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
ExperimentId NVARCHAR(100) NOT NULL UNIQUE,
ExperimentType NVARCHAR(50) NOT NULL,
TargetService NVARCHAR(100),
Environment NVARCHAR(50),
Status NVARCHAR(20) NOT NULL, -- Running, Success, Failed, Aborted
StartTime DATETIME2 NOT NULL,
EndTime DATETIME2,
Duration INT, -- seconds
BlastRadius DECIMAL(5,2),
Hypothesis NVARCHAR(MAX),
CreatedAt DATETIME2 DEFAULT GETUTCDATE(),
UpdatedAt DATETIME2 DEFAULT GETUTCDATE()
);
CREATE TABLE ChaosExperimentMetrics (
Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
ExperimentId UNIQUEIDENTIFIER NOT NULL,
MetricName NVARCHAR(100) NOT NULL,
MetricValue DECIMAL(18,4),
Timestamp DATETIME2 NOT NULL,
FOREIGN KEY (ExperimentId) REFERENCES ChaosExperiments(Id)
);
CREATE TABLE ChaosExperimentResults (
Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
ExperimentId UNIQUEIDENTIFIER NOT NULL,
Success BIT NOT NULL,
ValidationPassed BIT,
IssuesFound NVARCHAR(MAX),
LessonsLearned NVARCHAR(MAX),
CreatedAt DATETIME2 DEFAULT GETUTCDATE(),
FOREIGN KEY (ExperimentId) REFERENCES ChaosExperiments(Id)
);
Experiment Tracking Script:
#!/bin/bash
# scripts/track-chaos-experiment.sh
EXPERIMENT_ID="${1}"
EXPERIMENT_TYPE="${2}"
TARGET_SERVICE="${3}"
ENVIRONMENT="${4}"
STATUS="${5}" # Success, Failed, Aborted
echo "📊 Tracking Chaos Experiment"
echo "Experiment ID: ${EXPERIMENT_ID}"
echo ""
# Record experiment in database
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
INSERT INTO ChaosExperiments (ExperimentId, ExperimentType, TargetService, Environment, Status, StartTime, EndTime)
VALUES ('${EXPERIMENT_ID}', '${EXPERIMENT_TYPE}', '${TARGET_SERVICE}', '${ENVIRONMENT}', '${STATUS}',
GETUTCDATE(), GETUTCDATE())
"
# Get experiment metrics
AVAILABILITY=$(curl -s http://prometheus:9090/api/v1/query?query=service_availability\{service=\"${TARGET_SERVICE}\"\} | jq -r '.data.result[0].value[1]' || echo "0")
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate\(http_requests_total\{service=\"${TARGET_SERVICE}\",status=~\"5..\"\}[5m]\) | jq -r '.data.result[0].value[1]' || echo "0")
# Record metrics
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
DECLARE @ExperimentId UNIQUEIDENTIFIER = (SELECT Id FROM ChaosExperiments WHERE ExperimentId = '${EXPERIMENT_ID}');
INSERT INTO ChaosExperimentMetrics (ExperimentId, MetricName, MetricValue, Timestamp)
VALUES (@ExperimentId, 'Availability', ${AVAILABILITY}, GETUTCDATE()),
(@ExperimentId, 'ErrorRate', ${ERROR_RATE}, GETUTCDATE())
"
echo "✅ Experiment tracked"
Resilience Score Over Time
Resilience Score Calculation:
#!/bin/bash
# scripts/calculate-resilience-score.sh
ENVIRONMENT="${1:-staging}"
TIME_RANGE="${2:-7d}" # 7 days, 30d, etc.
echo "📊 Calculating Resilience Score"
echo "Environment: ${ENVIRONMENT}"
echo "Time Range: ${TIME_RANGE}"
echo ""
# Get metrics for time range
AVAILABILITY=$(curl -s "http://prometheus:9090/api/v1/query?query=avg_over_time(service_availability{environment=\"${ENVIRONMENT}\"}[${TIME_RANGE}])" | jq -r '.data.result[0].value[1]' || echo "0")
AVAILABILITY_PERCENT=$(echo "${AVAILABILITY} * 100" | bc)
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=avg_over_time(rate(http_requests_total{environment=\"${ENVIRONMENT}\",status=~\"5..\"}[5m])[${TIME_RANGE}:1h])" | jq -r '.data.result[0].value[1]' || echo "0")
P95_LATENCY=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,avg_over_time(rate(http_request_duration_seconds_bucket{environment=\"${ENVIRONMENT}\"}[5m])[${TIME_RANGE}:1h]))" | jq -r '.data.result[0].value[1]' || echo "0")
P95_LATENCY_MS=$(echo "${P95_LATENCY} * 1000" | bc)
# Calculate resilience score (0-100)
# Availability: 40 points
AVAILABILITY_SCORE=$(echo "scale=2; ${AVAILABILITY_PERCENT} * 0.4" | bc)
# Error Rate: 30 points (inverse)
ERROR_RATE_SCORE=$(echo "scale=2; (1 - ${ERROR_RATE} * 1000) * 0.3" | bc)
if (( $(echo "${ERROR_RATE_SCORE} < 0" | bc -l) )); then
ERROR_RATE_SCORE=0
fi
# Latency: 30 points (inverse)
LATENCY_SCORE=$(echo "scale=2; (1 - ${P95_LATENCY_MS} / 1000) * 0.3" | bc)
if (( $(echo "${LATENCY_SCORE} < 0" | bc -l) )); then
LATENCY_SCORE=0
fi
RESILIENCE_SCORE=$(echo "${AVAILABILITY_SCORE} + ${ERROR_RATE_SCORE} + ${LATENCY_SCORE}" | bc)
echo "Resilience Score Breakdown:"
echo " Availability (40%): ${AVAILABILITY_SCORE} / 40"
echo " Error Rate (30%): ${ERROR_RATE_SCORE} / 30"
echo " Latency (30%): ${LATENCY_SCORE} / 30"
echo ""
echo "Overall Resilience Score: ${RESILIENCE_SCORE} / 100"
echo ""
# Record resilience score
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
INSERT INTO ResilienceScores (Environment, TimeRange, Score, Availability, ErrorRate, Latency, Timestamp)
VALUES ('${ENVIRONMENT}', '${TIME_RANGE}', ${RESILIENCE_SCORE}, ${AVAILABILITY_PERCENT}, ${ERROR_RATE}, ${P95_LATENCY_MS}, GETUTCDATE())
"
Gap Tracking and Remediation
Gap Tracking Database Schema:
-- Database schema for resilience gap tracking
CREATE TABLE ResilienceGaps (
Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
GapId NVARCHAR(100) NOT NULL UNIQUE,
Title NVARCHAR(200) NOT NULL,
Description NVARCHAR(MAX),
Category NVARCHAR(50), -- Infrastructure, Application, Security, Data
Severity NVARCHAR(20), -- Critical, High, Medium, Low
ImpactScore DECIMAL(5,2),
LikelihoodScore DECIMAL(5,2),
PriorityScore DECIMAL(5,2), -- Impact × Likelihood
Status NVARCHAR(20), -- Open, InProgress, Resolved, Closed
AssignedTo NVARCHAR(100),
CreatedAt DATETIME2 DEFAULT GETUTCDATE(),
UpdatedAt DATETIME2 DEFAULT GETUTCDATE(),
ResolvedAt DATETIME2
);
CREATE TABLE GapRemediation (
Id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
GapId UNIQUEIDENTIFIER NOT NULL,
Action NVARCHAR(200) NOT NULL,
Status NVARCHAR(20), -- Planned, InProgress, Completed
DueDate DATETIME2,
CompletedAt DATETIME2,
FOREIGN KEY (GapId) REFERENCES ResilienceGaps(Id)
);
Gap Tracking Script:
#!/bin/bash
# scripts/track-resilience-gap.sh
GAP_ID="${1}"
TITLE="${2}"
DESCRIPTION="${3}"
CATEGORY="${4}"
SEVERITY="${5}"
IMPACT_SCORE="${6}"
LIKELIHOOD_SCORE="${7}"
echo "📋 Tracking Resilience Gap"
echo "Gap ID: ${GAP_ID}"
echo ""
# Calculate priority score
PRIORITY_SCORE=$(echo "${IMPACT_SCORE} * ${LIKELIHOOD_SCORE}" | bc)
# Insert gap into database
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
INSERT INTO ResilienceGaps (GapId, Title, Description, Category, Severity, ImpactScore, LikelihoodScore, PriorityScore, Status)
VALUES ('${GAP_ID}', '${TITLE}', '${DESCRIPTION}', '${CATEGORY}', '${SEVERITY}',
${IMPACT_SCORE}, ${LIKELIHOOD_SCORE}, ${PRIORITY_SCORE}, 'Open')
"
echo "✅ Gap tracked"
echo " Priority Score: ${PRIORITY_SCORE} (Impact: ${IMPACT_SCORE}, Likelihood: ${LIKELIHOOD_SCORE})"
Quarterly GameDay Reports
GameDay Report Generation:
#!/bin/bash
# scripts/generate-quarterly-gameday-report.sh
QUARTER="${1:-Q1}"
YEAR="${2:-$(date +%Y)}"
echo "📊 Generating Quarterly GameDay Report"
echo "Quarter: ${QUARTER} ${YEAR}"
echo ""
REPORT_FILE="gameday-reports/quarterly-${QUARTER}-${YEAR}-report.md"
# Get GameDay data for quarter
case ${QUARTER} in
Q1)
START_DATE="${YEAR}-01-01"
END_DATE="${YEAR}-03-31"
;;
Q2)
START_DATE="${YEAR}-04-01"
END_DATE="${YEAR}-06-30"
;;
Q3)
START_DATE="${YEAR}-07-01"
END_DATE="${YEAR}-09-30"
;;
Q4)
START_DATE="${YEAR}-10-01"
END_DATE="${YEAR}-12-31"
;;
esac
cat > ${REPORT_FILE} <<EOF
# ATP Chaos Engineering Quarterly Report
**Quarter**: ${QUARTER} ${YEAR}
**Period**: ${START_DATE} to ${END_DATE}
**Generated**: $(date -u +"%Y-%m-%d")
## Executive Summary
[Summary of chaos engineering activities for the quarter]
## GameDay Activities
### GameDays Conducted
- **Total GameDays**: [COUNT]
- **Scenarios Executed**: [LIST]
- **Success Rate**: [PERCENTAGE]%
### GameDay Metrics Summary
| Metric | Target | Average | Status |
|--------|--------|---------|--------|
| **MTTD** | <5 min | [VALUE] | [PASS/FAIL] |
| **MTTR** | <10 min | [VALUE] | [PASS/FAIL] |
| **RTO** | <30 min | [VALUE] | [PASS/FAIL] |
| **RPO** | <1 hour | [VALUE] | [PASS/FAIL] |
## Resilience Score Trends
[Resilience score trends over the quarter]
## Resilience Gaps
### Gaps Identified
- **Total Gaps**: [COUNT]
- **Critical**: [COUNT]
- **High**: [COUNT]
- **Medium**: [COUNT]
- **Low**: [COUNT]
### Gap Remediation
- **Resolved**: [COUNT]
- **In Progress**: [COUNT]
- **Open**: [COUNT]
## Improvement Actions
[Summary of improvement actions and their status]
## Lessons Learned
[Key lessons learned from GameDays and experiments]
## Next Quarter Plan
[Planned activities for next quarter]
EOF
echo "✅ Quarterly report generated: ${REPORT_FILE}"
echo ""
Improvement Tracking¶
Improvement tracking provides resilience backlog management with prioritization, implementation tracking, and resilience metrics dashboards.
Resilience Backlog
Backlog Management Script:
#!/bin/bash
# scripts/manage-resilience-backlog.sh
ACTION="${1:-list}" # list, add, update, prioritize
case ${ACTION} in
list)
echo "📋 Resilience Backlog"
echo ""
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
SELECT
GapId,
Title,
Category,
Severity,
PriorityScore,
Status,
AssignedTo
FROM ResilienceGaps
WHERE Status IN ('Open', 'InProgress')
ORDER BY PriorityScore DESC
" -W -h -1
;;
add)
GAP_ID="${2}"
TITLE="${3}"
DESCRIPTION="${4}"
CATEGORY="${5}"
SEVERITY="${6}"
IMPACT="${7}"
LIKELIHOOD="${8}"
./scripts/track-resilience-gap.sh ${GAP_ID} "${TITLE}" "${DESCRIPTION}" ${CATEGORY} ${SEVERITY} ${IMPACT} ${LIKELIHOOD}
;;
prioritize)
echo "📊 Prioritizing Resilience Backlog"
echo ""
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
UPDATE ResilienceGaps
SET PriorityScore = ImpactScore * LikelihoodScore
WHERE Status IN ('Open', 'InProgress')
"
echo "✅ Backlog prioritized"
;;
*)
echo "Unknown action: ${ACTION}"
exit 1
;;
esac
Prioritization (Impact × Likelihood)
Prioritization Matrix:
graph TD
HIGH[High Impact] --> HIGHHIGH[High Priority<br/>Impact × Likelihood]
HIGH --> HIGHLOW[Medium Priority]
LOW[Low Impact] --> LOWHIGH[Medium Priority]
LOW --> LOWLOW[Low Priority]
HIGHHIGH --> CRITICAL[Critical: Immediate Action]
HIGHLOW --> HIGH_PRIO[High: Plan Soon]
LOWHIGH --> HIGH_PRIO
LOWLOW --> LOW_PRIO[Low: Backlog]
style HIGHHIGH fill:#FF6B6B
style CRITICAL fill:#FF6B6B
style HIGHLOW fill:#FFD93D
style LOWHIGH fill:#FFD93D
style HIGH_PRIO fill:#FFD93D
style LOWLOW fill:#6BCF7F
style LOW_PRIO fill:#6BCF7F
Impact and Likelihood Scoring:
| Impact | Score | Description |
|---|---|---|
| Critical | 5 | Complete service outage, data loss, security breach |
| High | 4 | Significant service degradation, partial data loss |
| Medium | 3 | Moderate service impact, limited functionality loss |
| Low | 2 | Minor service impact, cosmetic issues |
| Very Low | 1 | Negligible impact |
| Likelihood | Score | Description |
|---|---|---|
| Very High | 5 | >50% chance of occurrence |
| High | 4 | 25-50% chance of occurrence |
| Medium | 3 | 10-25% chance of occurrence |
| Low | 2 | 1-10% chance of occurrence |
| Very Low | 1 | <1% chance of occurrence |
Implementation Tracking
Implementation Tracking Script:
#!/bin/bash
# scripts/track-implementation.sh
GAP_ID="${1}"
ACTION="${2}"
STATUS="${3}"
ASSIGNED_TO="${4}"
DUE_DATE="${5}"
echo "📝 Tracking Implementation"
echo "Gap ID: ${GAP_ID}"
echo "Action: ${ACTION}"
echo ""
# Insert remediation action
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
DECLARE @GapId UNIQUEIDENTIFIER = (SELECT Id FROM ResilienceGaps WHERE GapId = '${GAP_ID}');
INSERT INTO GapRemediation (GapId, Action, Status, AssignedTo, DueDate)
VALUES (@GapId, '${ACTION}', '${STATUS}', '${ASSIGNED_TO}', '${DUE_DATE}')
"
# Update gap status
sqlcmd -S ${SQL_SERVER} -d ${DATABASE} -Q "
UPDATE ResilienceGaps
SET Status = '${STATUS}', AssignedTo = '${ASSIGNED_TO}', UpdatedAt = GETUTCDATE()
WHERE GapId = '${GAP_ID}'
"
echo "✅ Implementation tracked"
Resilience Metrics Dashboard
Dashboard Configuration:
# dashboards/resilience-metrics-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: resilience-metrics-dashboard
namespace: monitoring
data:
dashboard.json: |
{
"dashboard": {
"title": "ATP Resilience Metrics Dashboard",
"panels": [
{
"title": "Resilience Score Over Time",
"type": "graph",
"targets": [
{
"expr": "resilience_score{environment=\"staging\"}",
"legendFormat": "Resilience Score"
}
]
},
{
"title": "Chaos Experiment Success Rate",
"type": "stat",
"targets": [
{
"expr": "rate(chaos_experiments_total{status=\"success\"}[7d]) / rate(chaos_experiments_total[7d])",
"legendFormat": "Success Rate"
}
]
},
{
"title": "Resilience Gaps by Priority",
"type": "piechart",
"targets": [
{
"expr": "resilience_gaps_total{priority=\"critical\"}",
"legendFormat": "Critical"
},
{
"expr": "resilience_gaps_total{priority=\"high\"}",
"legendFormat": "High"
},
{
"expr": "resilience_gaps_total{priority=\"medium\"}",
"legendFormat": "Medium"
}
]
},
{
"title": "GameDay Metrics Trend",
"type": "table",
"targets": [
{
"expr": "gameday_mttd",
"legendFormat": "MTTD"
},
{
"expr": "gameday_mttr",
"legendFormat": "MTTR"
},
{
"expr": "gameday_rto",
"legendFormat": "RTO"
},
{
"expr": "gameday_rpo",
"legendFormat": "RPO"
}
]
}
]
}
}
Dashboard Generation Script:
#!/bin/bash
# scripts/generate-resilience-dashboard.sh
ENVIRONMENT="${1:-staging}"
echo "📊 Generating Resilience Metrics Dashboard"
echo "Environment: ${ENVIRONMENT}"
echo ""
# Generate dashboard JSON
DASHBOARD_JSON=$(cat <<EOF
{
"dashboard": {
"title": "ATP Resilience Metrics - ${ENVIRONMENT}",
"time": {
"from": "now-7d",
"to": "now"
},
"panels": [
{
"id": 1,
"title": "Resilience Score",
"type": "stat",
"targets": [
{
"expr": "resilience_score{environment=\"${ENVIRONMENT}\"}",
"refId": "A"
}
]
},
{
"id": 2,
"title": "Chaos Experiments",
"type": "graph",
"targets": [
{
"expr": "rate(chaos_experiments_total{environment=\"${ENVIRONMENT}\"}[1h])",
"refId": "A"
}
]
},
{
"id": 3,
"title": "Resilience Gaps",
"type": "table",
"targets": [
{
"expr": "resilience_gaps_total{environment=\"${ENVIRONMENT}\"}",
"refId": "A"
}
]
}
]
}
}
EOF
)
# Create dashboard in Grafana
curl -X POST \
http://grafana:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${GRAFANA_API_KEY}" \
-d "${DASHBOARD_JSON}"
echo "✅ Dashboard generated"
echo ""
Chaos Automation and Reporting Visualization
graph TB
AUTOMATION[Chaos Automation] --> CHAOSMESH[Chaos Mesh]
AUTOMATION --> LITMUS[Litmus Chaos]
AUTOMATION --> AZURESTUDIO[Azure Chaos Studio]
AUTOMATION --> CUSTOM[Custom Scripts]
CICD[CI/CD Integration] --> STAGING[Staging Pipeline]
STAGING --> CHAOSTESTS[Chaos Tests]
CHAOSTESTS --> VALIDATION[Resilience Validation]
VALIDATION --> GATE[Deployment Gate]
CONTINUOUS[Continuous Chaos] --> LOWLEVEL[Low-Level Chaos]
LOWLEVEL --> REGRESSION[Regression Detection]
REPORTING[Chaos Reporting] --> TRACKING[Experiment Tracking]
REPORTING --> SCORE[Resilience Score]
REPORTING --> GAPS[Gap Tracking]
REPORTING --> GAMEDAY[GameDay Reports]
IMPROVEMENT[Improvement Tracking] --> BACKLOG[Resilience Backlog]
BACKLOG --> PRIORITY[Prioritization]
PRIORITY --> IMPLEMENTATION[Implementation Tracking]
IMPLEMENTATION --> DASHBOARD[Resilience Dashboard]
style AUTOMATION fill:#FFE5B4
style CICD fill:#FFE5B4
style CONTINUOUS fill:#FFE5B4
style REPORTING fill:#FFE5B4
style IMPROVEMENT fill:#FFE5B4
Summary: Chaos Automation and Reporting¶
- Chaos Automation Tools: Comprehensive chaos automation tools including Chaos Mesh (Kubernetes-native), Litmus Chaos (chaos workflows), Azure Chaos Studio (Azure-integrated), and Custom Scripts (ATP-specific chaos); with installation scripts, configuration examples, and tool comparison table
- CI/CD Integration: Automated chaos testing in staging pipelines with resilience validation and deployment blocking; includes Azure Pipelines configuration, chaos test suite script, resilience validation script, and pipeline gate configuration to block deployments on chaos failures
- Continuous Chaos: Low-level chaos running continuously with minimal blast radius (1% traffic affected) to detect resilience regressions proactively; includes continuous chaos configuration, controller script, experiment execution script, and regression detection script
- Chaos Reporting: Comprehensive tracking and reporting including experiment success/failure tracking (database schema and tracking script), resilience score calculation over time, gap tracking and remediation (database schema and tracking script), and quarterly GameDay report generation
- Improvement Tracking: Resilience backlog management with prioritization (impact × likelihood matrix), implementation tracking (tracking script and database schema), and resilience metrics dashboard (dashboard configuration and generation script); with automated prioritization and visualization
Chaos Experiment Catalog¶
Purpose: Provide a comprehensive catalog of all chaos experiments available in ATP, organized by category (Infrastructure, Application, Data) with standardized experiment specifications including hypothesis, duration, frequency, and blast radius for easy reference and experiment planning.
Infrastructure Experiments¶
Infrastructure chaos experiments validate Kubernetes infrastructure resilience including pod failures, node failures, availability zone failures, and regional failures.
Infrastructure Experiment Catalog:
| Experiment | Hypothesis | Duration | Frequency | Blast Radius |
|---|---|---|---|---|
| Pod Kill | Service remains available when 1 pod is killed, pods reschedule automatically, and service recovers within 5 minutes | 5 min | Weekly | 1 pod |
| Node Failure | Pods reschedule successfully when 1 node fails, StatefulSet quorum is maintained, and services recover within 15 minutes | 15 min | Monthly | 1 node |
| AZ Failure | Multi-AZ deployment works when 1 availability zone fails, pods redistribute across remaining AZs, and services recover within 30 minutes | 30 min | Quarterly | 1 AZ |
| Region Failure | Regional failover succeeds when 1 region fails, Traffic Manager routes to secondary region, and services are operational within RTO target (30 minutes) | 1 hour | Annually | 1 region |
Infrastructure Experiment Execution:
#!/bin/bash
# scripts/execute-infrastructure-experiment.sh
EXPERIMENT_TYPE="${1}" # pod-kill, node-failure, az-failure, region-failure
ENVIRONMENT="${2:-staging}"
echo "🏗️ Executing Infrastructure Experiment: ${EXPERIMENT_TYPE}"
echo "Environment: ${ENVIRONMENT}"
echo ""
case ${EXPERIMENT_TYPE} in
pod-kill)
echo "Executing Pod Kill experiment..."
./scripts/execute-pod-failure-experiment.sh atp-ingestion-api 1 "5m" ${ENVIRONMENT}
;;
node-failure)
echo "Executing Node Failure experiment..."
./scripts/execute-node-failure-experiment.sh 1 "15m" ${ENVIRONMENT}
;;
az-failure)
echo "Executing AZ Failure experiment..."
./scripts/execute-az-failure-experiment.sh eastus-1 "30m" ${ENVIRONMENT}
;;
region-failure)
echo "Executing Region Failure experiment..."
./scripts/execute-regional-failover-drill.sh eastus "1h" ${ENVIRONMENT}
;;
*)
echo "Unknown infrastructure experiment: ${EXPERIMENT_TYPE}"
exit 1
;;
esac
echo "✅ Infrastructure experiment completed"
Application Experiments¶
Application chaos experiments validate application-level resilience including service failures, latency handling, error recovery, and autoscaling behavior.
Application Experiment Catalog:
| Experiment | Hypothesis | Duration | Frequency | Blast Radius |
|---|---|---|---|---|
| Service Crash | Circuit breaker prevents cascade failures when 1 service crashes, upstream services handle failures gracefully, and service recovers when restarted | 5 min | Weekly | 1 service |
| Latency Injection | Timeouts prevent hanging requests when network latency increases, retry mechanisms handle transient delays, and services recover when latency is removed | 10 min | Weekly | 25% traffic |
| Error Injection | Retries recover from errors when HTTP 500 errors are injected, circuit breakers prevent cascading failures, and services recover when errors are removed | 5 min | Weekly | 10% requests |
| CPU Stress | HPA scales under load when CPU usage increases, pods scale horizontally to meet demand, and service performance remains acceptable | 15 min | Monthly | 1 deployment |
Application Experiment Execution:
#!/bin/bash
# scripts/execute-application-experiment.sh
EXPERIMENT_TYPE="${1}" # service-crash, latency-injection, error-injection, cpu-stress
TARGET_SERVICE="${2}"
ENVIRONMENT="${3:-staging}"
echo "📱 Executing Application Experiment: ${EXPERIMENT_TYPE}"
echo "Target Service: ${TARGET_SERVICE}"
echo "Environment: ${ENVIRONMENT}"
echo ""
case ${EXPERIMENT_TYPE} in
service-crash)
echo "Executing Service Crash experiment..."
./scripts/execute-downstream-service-failure.sh ${TARGET_SERVICE} "5m" ${ENVIRONMENT}
;;
latency-injection)
echo "Executing Latency Injection experiment..."
./scripts/execute-latency-injection-experiment.sh ${TARGET_SERVICE} "500ms" "10m" ${ENVIRONMENT}
;;
error-injection)
echo "Executing Error Injection experiment..."
./scripts/execute-error-injection-experiment.sh ${TARGET_SERVICE} 0.1 "5m" ${ENVIRONMENT}
;;
cpu-stress)
echo "Executing CPU Stress experiment..."
./scripts/execute-resource-exhaustion-experiment.sh ${TARGET_SERVICE} cpu "15m" ${ENVIRONMENT}
;;
*)
echo "Unknown application experiment: ${EXPERIMENT_TYPE}"
exit 1
;;
esac
echo "✅ Application experiment completed"
Data Experiments¶
Data chaos experiments validate data layer resilience including database failover, cache failures, and message queue disruptions.
Data Experiment Catalog:
| Experiment | Hypothesis | Duration | Frequency | Blast Radius |
|---|---|---|---|---|
| DB Failover | Connection retry recovers when database fails over to replica, transactions complete successfully, and no data loss occurs | 5 min | Monthly | Read-only mode |
| Cache Failure | DB fallback maintains function when Redis cache becomes unavailable, database queries handle increased load, and services remain functional with higher latency | 10 min | Weekly | 1 cache instance |
| Queue Pause | Backpressure prevents overflow when Service Bus topic is paused, messages are buffered in outbox, and services recover when queue is restored | 15 min | Monthly | 1 topic |
Data Experiment Execution:
#!/bin/bash
# scripts/execute-data-experiment.sh
EXPERIMENT_TYPE="${1}" # db-failover, cache-failure, queue-pause
ENVIRONMENT="${2:-staging}"
echo "💾 Executing Data Experiment: ${EXPERIMENT_TYPE}"
echo "Environment: ${ENVIRONMENT}"
echo ""
case ${EXPERIMENT_TYPE} in
db-failover)
echo "Executing Database Failover experiment..."
./scripts/execute-database-failover-experiment.sh atp-sql-eastus "5m" ${ENVIRONMENT}
;;
cache-failure)
echo "Executing Cache Failure experiment..."
./scripts/execute-cache-failure-experiment.sh atp-redis-eastus "10m" ${ENVIRONMENT}
;;
queue-pause)
echo "Executing Queue Pause experiment..."
./scripts/execute-message-queue-disruption-experiment.sh atp-service-bus-topic "15m" ${ENVIRONMENT}
;;
*)
echo "Unknown data experiment: ${EXPERIMENT_TYPE}"
exit 1
;;
esac
echo "✅ Data experiment completed"
Experiment Catalog Summary:
graph TB
CATALOG[Chaos Experiment Catalog] --> INFRA[Infrastructure Experiments]
CATALOG --> APP[Application Experiments]
CATALOG --> DATA[Data Experiments]
INFRA --> POD[Pod Kill<br/>Weekly, 5min, 1 pod]
INFRA --> NODE[Node Failure<br/>Monthly, 15min, 1 node]
INFRA --> AZ[AZ Failure<br/>Quarterly, 30min, 1 AZ]
INFRA --> REGION[Region Failure<br/>Annually, 1h, 1 region]
APP --> SERVICE[Service Crash<br/>Weekly, 5min, 1 service]
APP --> LATENCY[Latency Injection<br/>Weekly, 10min, 25% traffic]
APP --> ERROR[Error Injection<br/>Weekly, 5min, 10% requests]
APP --> CPU[CPU Stress<br/>Monthly, 15min, 1 deployment]
DATA --> DB[DB Failover<br/>Monthly, 5min, Read-only]
DATA --> CACHE[Cache Failure<br/>Weekly, 10min, 1 cache instance]
DATA --> QUEUE[Queue Pause<br/>Monthly, 15min, 1 topic]
style CATALOG fill:#FFE5B4
style INFRA fill:#B4E5FF
style APP fill:#B4FFB4
style DATA fill:#FFB4E5
Experiment Schedule Overview:
| Frequency | Experiments | Total Experiments/Month |
|---|---|---|
| Weekly | Pod Kill, Service Crash, Latency Injection, Error Injection, Cache Failure | ~20/month |
| Monthly | Node Failure, CPU Stress, DB Failover, Queue Pause | 4/month |
| Quarterly | AZ Failure | 1/quarter (4/year) |
| Annually | Region Failure | 1/year |
Experiment Catalog Usage:
#!/bin/bash
# scripts/list-experiment-catalog.sh
CATEGORY="${1:-all}" # infrastructure, application, data, all
echo "📋 Chaos Experiment Catalog"
echo "Category: ${CATEGORY}"
echo ""
case ${CATEGORY} in
infrastructure)
echo "Infrastructure Experiments:"
echo " 1. pod-kill - Weekly, 5min, 1 pod"
echo " 2. node-failure - Monthly, 15min, 1 node"
echo " 3. az-failure - Quarterly, 30min, 1 AZ"
echo " 4. region-failure - Annually, 1h, 1 region"
;;
application)
echo "Application Experiments:"
echo " 1. service-crash - Weekly, 5min, 1 service"
echo " 2. latency-injection - Weekly, 10min, 25% traffic"
echo " 3. error-injection - Weekly, 5min, 10% requests"
echo " 4. cpu-stress - Monthly, 15min, 1 deployment"
;;
data)
echo "Data Experiments:"
echo " 1. db-failover - Monthly, 5min, Read-only mode"
echo " 2. cache-failure - Weekly, 10min, 1 cache instance"
echo " 3. queue-pause - Monthly, 15min, 1 topic"
;;
all)
echo "Infrastructure Experiments:"
echo " 1. pod-kill - Weekly, 5min, 1 pod"
echo " 2. node-failure - Monthly, 15min, 1 node"
echo " 3. az-failure - Quarterly, 30min, 1 AZ"
echo " 4. region-failure - Annually, 1h, 1 region"
echo ""
echo "Application Experiments:"
echo " 1. service-crash - Weekly, 5min, 1 service"
echo " 2. latency-injection - Weekly, 10min, 25% traffic"
echo " 3. error-injection - Weekly, 5min, 10% requests"
echo " 4. cpu-stress - Monthly, 15min, 1 deployment"
echo ""
echo "Data Experiments:"
echo " 1. db-failover - Monthly, 5min, Read-only mode"
echo " 2. cache-failure - Weekly, 10min, 1 cache instance"
echo " 3. queue-pause - Monthly, 15min, 1 topic"
;;
*)
echo "Unknown category: ${CATEGORY}"
exit 1
;;
esac
echo ""
Summary: Chaos Experiment Catalog¶
- Infrastructure Experiments: Four infrastructure chaos experiments including Pod Kill (weekly, 5min, 1 pod), Node Failure (monthly, 15min, 1 node), AZ Failure (quarterly, 30min, 1 AZ), and Region Failure (annually, 1h, 1 region); each with standardized hypothesis, duration, frequency, and blast radius specifications
- Application Experiments: Four application chaos experiments including Service Crash (weekly, 5min, 1 service), Latency Injection (weekly, 10min, 25% traffic), Error Injection (weekly, 5min, 10% requests), and CPU Stress (monthly, 15min, 1 deployment); validating circuit breakers, retries, timeouts, and autoscaling
- Data Experiments: Three data chaos experiments including DB Failover (monthly, 5min, read-only mode), Cache Failure (weekly, 10min, 1 cache instance), and Queue Pause (monthly, 15min, 1 topic); validating database resilience, cache fallback, and message queue backpressure
- Experiment Catalog Tools: Execution scripts for each category (infrastructure, application, data), catalog listing script, experiment schedule overview table, and Mermaid diagram visualization; with standardized experiment specifications for easy reference and experiment planning