Health Checks - Audit Trail Platform (ATP)¶

Observable health, predictable behavior — ATP implements comprehensive health checks across all services using ASP.NET Core Health Checks with custom checks for databases, message buses, caches, KMS, and dependencies, exposed via /health/live, /health/ready, and /health/startup endpoints with Kubernetes probe integration and real-time monitoring.

📋 Documentation Generation Plan¶

This document will be generated in 16 cycles. Current progress:

Cycle	Topics	Estimated Lines	Status
Cycle 1	Health Check Fundamentals (1-2)	~3,000	⏳ Not Started
Cycle 2	ASP.NET Core Health Checks (3-4)	~3,500	⏳ Not Started
Cycle 3	Kubernetes Probe Integration (5-6)	~3,500	⏳ Not Started
Cycle 4	Custom Health Checks (7-8)	~4,500	⏳ Not Started
Cycle 5	Service-Specific Health Checks (9-10)	~5,000	⏳ Not Started
Cycle 6	Dependency Health Checks (11-12)	~4,000	⏳ Not Started
Cycle 7	Health Check UI & Monitoring (13-14)	~3,500	⏳ Not Started
Cycle 8	Health Check Response Formats (15-16)	~3,000	⏳ Not Started
Cycle 9	Degraded State Handling (17-18)	~3,000	⏳ Not Started
Cycle 10	Startup Warmup & Grace Periods (19-20)	~3,000	⏳ Not Started
Cycle 11	Health Check Caching (21-22)	~2,500	⏳ Not Started
Cycle 12	Multi-Tenant Health Isolation (23-24)	~3,000	⏳ Not Started
Cycle 13	Health Check Testing (25-26)	~2,500	⏳ Not Started
Cycle 14	Load Balancer Integration (27-28)	~3,000	⏳ Not Started
Cycle 15	Health Check Troubleshooting (29-30)	~3,500	⏳ Not Started
Cycle 16	Best Practices & Governance (31-32)	~3,000	⏳ Not Started

Total Estimated Lines: ~54,000

Purpose & Scope¶

This document provides the complete health check implementation guide for ATP, covering ASP.NET Core Health Checks, Kubernetes probes (liveness, readiness, startup), custom health checks for all dependencies (databases, message buses, caches, KMS, external services), Health Check UI, monitoring integration, and operational procedures for maintaining service health and resilience.

Why Health Checks for ATP?

Reliability: Detect unhealthy services before they impact users
Kubernetes Integration: Automatic pod restart (liveness) and traffic management (readiness)
Load Balancer: Remove unhealthy instances from rotation
Monitoring: Real-time health status in dashboards
Debugging: Identify failing dependencies quickly
Compliance: Health audit trail for SLA validation
Automation: Enable self-healing and auto-scaling
Observability: Health metrics feed into SLO calculations

ATP Health Check Architecture

Client Request
    ↓
Load Balancer (checks /health/ready)
    ↓
Kubernetes Service (only routes to ready pods)
    ↓
Pod (3 health endpoints)
    ├── /health/live    → Liveness Probe (K8s restarts if fails)
    ├── /health/ready   → Readiness Probe (K8s removes from service if fails)
    └── /health/startup → Startup Probe (K8s delays liveness until passes)

Health Check Types

Type	Endpoint	Purpose	Kubernetes Usage	Check Scope
Liveness	`/health/live`	Is the process alive and responsive?	Restart pod if fails	Minimal (self-ping)
Readiness	`/health/ready`	Can the service handle traffic?	Remove from load balancer if fails	Dependencies (DB, cache, bus)
Startup	`/health/startup`	Has the service finished initializing?	Delay liveness/readiness until passes	Warmup tasks (migrations, cache)

ATP Service Health Checks

All ATP services implement comprehensive health checks: - Gateway: Key Vault (certs), backend services (HTTP), rate limiter (Redis) - Ingestion: Azure SQL, Service Bus (publish), Blob Storage (WORM), Policy Service, Outbox - Query: Read model database, query cache (Redis), Search index, projection lag - Projection: Service Bus (subscribe), read model database, inbox table, projection lag - Export: Blob Storage, export job queue (Redis), KMS (signing), bandwidth quota - Integrity: Blob WORM, KMS (signing keys), hash chain state, Merkle computation - Policy: Policy database, policy cache (Redis), default policies loaded - Admin: Admin database, configuration store

Detailed Cycle Plan¶

CYCLE 1: Health Check Fundamentals (~3,000 lines)¶

Topic 1: Health Check Philosophy¶

What will be covered: - What is a Health Check?

Health Check:
- Periodic test of service health
- Returns: Healthy, Degraded, or Unhealthy
- Exposes internal state to external systems
- Enables automation (restart, route, scale)

Purpose:
- Detect failures early
- Enable self-healing
- Prevent cascading failures
- Support graceful degradation

Health vs. Metrics

Health Checks:
- Binary state (healthy/unhealthy)
- Immediate action (restart, remove from LB)
- Synchronous (request/response)
- Examples: "Can I connect to database?"

Metrics:
- Continuous values (latency, throughput, errors)
- Gradual response (alerts, scaling)
- Asynchronous (scrape, push)
- Examples: "What's the current latency?"

Relationship:
- Metrics → Trends and patterns
- Health Checks → Go/no-go decisions
- Both needed for complete observability

Three Types of Health Checks

1. Liveness (Am I alive?)
   - Purpose: Detect deadlocks, infinite loops, crashes
   - Check: Minimal (process responsive)
   - Failure: Restart pod
   - Example: Self-ping, basic HTTP response

2. Readiness (Can I serve traffic?)
   - Purpose: Detect dependency failures
   - Check: Critical dependencies (DB, cache, queue)
   - Failure: Remove from load balancer
   - Example: Database connection, cache ping

3. Startup (Have I finished initializing?)
   - Purpose: Allow slow startup without killing pod
   - Check: Initialization tasks complete
   - Failure: Delay liveness probe
   - Example: Migrations run, cache warmed, config loaded

ATP Health Check Principles

1. Fast Checks (<5 seconds)
   - Health checks must be lightweight
   - Avoid expensive operations (full scans, complex queries)
   - Use simple connectivity tests

2. Dependency-Aware
   - Check critical dependencies only
   - Don't fail on optional dependencies
   - Degrade gracefully when possible

3. No Side Effects
   - Health checks are read-only
   - Don't modify state
   - Don't trigger business logic

4. Cacheable (with short TTL)
   - Cache health status briefly (5-10 seconds)
   - Avoid overwhelming dependencies
   - Balance freshness vs. load

5. Detailed Responses
   - Return status + details for each check
   - Include duration and error messages
   - Enable debugging

6. Compliance-Aware
   - No PII in health responses
   - Audit health check access
   - Secure endpoints (internal only for details)

Code Examples: - Health check concepts - Three-probe architecture - ATP principles

Diagrams: - Health check types comparison - Probe failure handling - Health check architecture

Deliverables: - Health check fundamentals guide - Type comparison matrix - ATP principles document

Topic 2: Health Check Standards¶

What will be covered: - Industry Standards

RFC 7234 (HTTP Caching):
- Cache-Control: no-cache for health checks
- Avoid stale health status

OpenAPI/Swagger:
- Document health endpoints
- Security schemes (Bearer token for detailed checks)

Health Check Response Format (de facto standard):
{
  "status": "Healthy" | "Degraded" | "Unhealthy",
  "totalDuration": "00:00:00.123",
  "entries": {
    "database": {
      "status": "Healthy",
      "description": "...",
      "duration": "00:00:00.050"
    }
  }
}

HTTP Status Codes:
- 200 OK: Healthy
- 503 Service Unavailable: Unhealthy
- 200 OK (with "Degraded" in body): Degraded

ASP.NET Core Health Checks
Microsoft.Extensions.Diagnostics.HealthChecks
AspNetCore.HealthChecks.* (community packages)
Health Check UI (AspNetCore.HealthChecks.UI)
Kubernetes Health Probes
livenessProbe
readinessProbe
startupProbe

Code Examples: - Standard response formats - HTTP status codes - ASP.NET Core integration

Diagrams: - Standards compliance - Response format evolution

Deliverables: - Standards reference - Format specifications - Compliance guide

CYCLE 2: ASP.NET Core Health Checks (~3,500 lines)¶

Topic 3: Health Check Middleware¶

What will be covered: - Registering Health Checks

publ {

href="#__codelineno-6-1">// Startup.cs / Program.cs ic void ConfigureServices(IServiceCollection services) pan> // Add health check services var healthChecks = services.AddHealthChecks(); // Add built-in checks healthChecks.AddCheck("self", () => HealthCheckResult.Healthy("Service is alive")); // Add dependency checks healthChecks.AddSqlServer( connectionString: configuration.GetConnectionString("AuditDb"), healthQuery: "SELECT 1", name: "database", failureStatus: HealthStatus.Unhealthy, tags: new[] { "ready", "database" }); healthChecks.AddRedis( connectionString: configuration.GetConnectionString("Redis"), name: "cache", failureStatus: HealthStatus.Degraded, tags: new[] { "ready", "cache" }); healthChecks.AddAzureServiceBusTopic( connectionString: configuration.GetConnectionString("ServiceBus"), topicName: "audit.appended.v1", name: "servicebus", tags: new[] { "ready", "messaging" }); } public void Configure(IApplicationBuilder app) { // Map health check endpoints app.UseEndpoints(endpoints => { // Liveness endpoint (minimal checks) endpoints.MapHealthChecks("/health/live", new HealthCheckOptions { Predicate = check => check.Tags.Contains("live"), AllowCachingResponses = false, ResultStatusCodes = { [HealthStatus.Healthy] = StatusCodes.Status200OK, [HealthStatus.Degraded] = StatusCodes.Status200OK, [HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable } }); // Readiness endpoint (dependency checks) endpoints.MapHealthChecks("/health/ready", new HealthCheckOptions { Predicate = check => check.Tags.Contains("ready"), AllowCachingResponses = false, ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse, ResultStatusCodes = { [HealthStatus.Healthy] = StatusCodes.Status200OK, [HealthStatus.Degraded] = StatusCodes.Status200OK, [HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable } }); // Startup endpoint endpoints.MapHealthChecks("/health/startup", new HealthCheckOptions { Predicate = check => check.Tags.Contains("startup"), AllowCachingResponses = false }); // Detailed health UI (internal only, secured) endpoints.MapHealthChecks("/health", new HealthCheckOptions { AllowCachingResponses = false, ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse }) .RequireAuthorization("InternalOnly"); }); }

Health Check Tags

Tag Strategy:

live:
- Minimal checks for liveness probe
- Only "self" check (process alive)
- Fast (<100ms)

ready:
- All critical dependency checks
- Database, cache, message bus
- Moderate speed (<5s total)

startup:
- Initialization completion checks
- Migrations complete, cache warmed
- Can be slow (30-60s acceptable)

database:
- Groups all database checks
- For targeted diagnostics

messaging:
- Groups messaging checks
- Service Bus, outbox, consumers

cache:
- Groups caching checks
- Redis, in-memory

external:
- Groups external API checks
- Policy service, KMS, etc.

Code Examples: - Complete health check setup - Endpoint mapping - Tag-based filtering

Diagrams: - Health check middleware flow - Tag-based routing - Response writer pipeline

Deliverables: - Health check setup guide - Middleware configuration - Tag strategy

Topic 4: Built-In Health Checks¶

What will be covered: - Community Health Check Packages

// Available health check packages
using AspNetCore.HealthChecks.SqlServer;      // SQL Server
using AspNetCore.HealthChecks.AzureServiceBus; // Azure Service Bus
using AspNetCore.HealthChecks.Redis;           // Redis
using AspNetCore.HealthChecks.MongoDb;         // MongoDB
using AspNetCore.HealthChecks.AzureStorage;    // Azure Blob/Table
using AspNetCore.HealthChecks.AzureKeyVault;   // Azure Key Vault
using AspNetCore.HealthChecks.Uris;            // HTTP endpoints
using AspNetCore.HealthChecks.System;          // Disk, memory, process

// Registration examples
services.AddHealthChecks()
    // Database
    .AddSqlServer(
        connectionString: config.GetConnectionString("AuditDb"),
        healthQuery: "SELECT 1",
        name: "audit-database",
        timeout: TimeSpan.FromSeconds(3))

    // Message Bus
    .AddAzureServiceBusTopic(
        connectionString: config.GetConnectionString("ServiceBus"),
        topicName: "audit.appended.v1",
        name: "servicebus-topic")

    // Cache
    .AddRedis(
        connectionString: config.GetConnectionString("Redis"),
        name: "redis-cache")

    // Blob Storage
    .AddAzureBlobStorage(
        connectionString: config.GetConnectionString("BlobStorage"),
        containerName: "audit-worm",
        name: "blob-worm-storage")

    // Key Vault
    .AddAzureKeyVault(
        keyVaultUri: new Uri(config["KeyVault:Uri"]),
        credential: new DefaultAzureCredential(),
        setup: options => options.AddSecret("signing-key-public"),
        name: "key-vault")

    // HTTP Endpoint (Policy Service)
    .AddUrlGroup(
        uri: new Uri("https://policy.atp.internal/health/ready"),
        name: "policy-service",
        timeout: TimeSpan.FromSeconds(5))

    // System resources
    .AddDiskStorageHealthCheck(
        setup: options => options.AddDrive("C:\\", minimumFreeMegabytes: 1024),
        name: "disk-space")

    .AddProcessAllocatedMemoryHealthCheck(
        maximumMegabytesAllocated: 2048,
        name: "memory-allocation");

Configuration Options
Timeouts
Failure status (Unhealthy vs. Degraded)
Retry policies
Tags

Code Examples: - Built-in check usage (all types) - Configuration patterns - Timeout management

Diagrams: - Health check library ecosystem - Package dependencies

Deliverables: - Built-in checks catalog - Usage guide - Configuration reference

CYCLE 3: Kubernetes Probe Integration (~3,500 lines)¶

Topic 5: Probe Configuration¶

What will be covered: - Liveness Probe

# Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ingestion
  namespace: atp-ingest-ns
spec:
  template:
    spec:
      containers:
      - name: ingestion
        image: atpacr.azurecr.io/atp/ingestion:1.2.3

        # Liveness Probe (restart if fails)
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 30     # Wait 30s after pod start
          periodSeconds: 10            # Check every 10s
          timeoutSeconds: 5            # 5s timeout per check
          failureThreshold: 3          # Restart after 3 consecutive failures
          successThreshold: 1          # Healthy after 1 success

Readiness Probe

        # Readiness Probe (remove from service if fails)
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
          successThreshold: 1

Startup Probe

        # Startup Probe (for slow-starting apps)
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 30         # 30 * 5s = 150s max startup time
          successThreshold: 1

Probe Types (HTTP, TCP, Exec)

# HTTP Probe (most common)
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
    httpHeaders:
    - name: X-Health-Check
      value: "kubernetes"

# TCP Probe (for non-HTTP services)
livenessProbe:
  tcpSocket:
    port: 9090

# Exec Probe (run command in container)
livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - /app/healthcheck.sh

# gRPC Probe (Kubernetes 1.24+)
livenessProbe:
  grpc:
    port: 9090
    service: grpc.health.v1.Health

Probe Timing Best Practices

Liveness Probe:
- initialDelaySeconds: 30-60s (allow startup)
- periodSeconds: 10-30s (not too frequent)
- timeoutSeconds: 5s (reasonable timeout)
- failureThreshold: 3 (tolerate transient failures)

Readiness Probe:
- initialDelaySeconds: 10-30s (faster than liveness)
- periodSeconds: 5-10s (more frequent for traffic mgmt)
- timeoutSeconds: 3s
- failureThreshold: 3

Startup Probe:
- initialDelaySeconds: 0 (start immediately)
- periodSeconds: 5s
- failureThreshold: 30-60 (allow long startup)
- Disable liveness until startup succeeds

Code Examples: - Complete probe configurations (all ATP services) - Probe type examples - Timing guidelines

Diagrams: - Probe lifecycle - Failure handling flow - Timing optimization

Deliverables: - Probe configuration guide - Timing recommendations - ATP probe library

Topic 6: Probe Failure Scenarios¶

What will be covered: - Liveness Probe Failure

Scenario: Liveness probe fails 3 consecutive times

Kubernetes Actions:
1. Mark pod as "Not Ready" (remove from service)
2. Send SIGTERM to container (graceful shutdown)
3. Wait for termination grace period (default: 30s)
4. Send SIGKILL if still running
5. Start new pod
6. Wait for startup probe to pass
7. Wait for readiness probe to pass
8. Add to service endpoints

Timeline:
T+0s:    Liveness probe fails (3rd failure)
T+0s:    Pod marked Terminating
T+0s:    SIGTERM sent to container
T+30s:   SIGKILL sent (if not terminated)
T+30s:   New pod created
T+45s:   New pod startup complete
T+50s:   New pod ready
T+50s:   New pod receives traffic

Total Recovery Time: ~50 seconds

Readiness Probe Failure

Scenario: Readiness probe fails (database connection lost)

Kubernetes Actions:
1. Mark pod as "Not Ready"
2. Remove pod IP from service endpoints
3. Stop routing traffic to pod
4. Pod continues running (NOT restarted)
5. Continue checking readiness probe
6. When probe passes → add back to service

User Impact:
- No downtime if other pods healthy
- Traffic automatically shifted
- Pod self-heals when dependency recovers

Timeline:
T+0s:    Readiness probe fails
T+0s:    Pod removed from service endpoints
T+0s-Xs: No traffic to pod (database recovers)
T+Xs:    Readiness probe passes
T+Xs:    Pod added back to service

Recovery Time: Depends on dependency recovery

Startup Probe Failure

Scenario: Startup probe fails (migrations taking too long)

Kubernetes Actions:
1. Continue running startup probe
2. Liveness/readiness disabled until startup passes
3. If failureThreshold exceeded (e.g., 30 failures * 5s = 150s)
4. Kill pod and restart

Best Practice:
- Set failureThreshold high enough for worst-case startup
- Monitor startup duration metrics
- Optimize slow initialization tasks

Code Examples: - Failure scenarios - Recovery timelines - Mitigation strategies

Diagrams: - Probe failure flows - Pod lifecycle during recovery - Traffic shifting

Deliverables: - Failure scenario guide - Recovery procedures - Timing analysis

CYCLE 4: Custom Health Checks (~4,500 lines)¶

Topic 7: Implementing Custom Health Checks¶

What will be covered: - IHealthCheck Interface

using Microsoft.Extensions.Diagnostics.HealthChecks;

public class OutboxHealthCheck : IHealthCheck
{
    private readonly IOutboxRepository _outboxRepository;
    private readonly ILogger<OutboxHealthCheck> _logger;
    private readonly int _maxPendingThreshold;

    public OutboxHealthCheck(
        IOutboxRepository outboxRepository,
        ILogger<OutboxHealthCheck> logger,
        IConfiguration configuration)
    {
        _outboxRepository = outboxRepository;
        _logger = logger;
        _maxPendingThreshold = configuration.GetValue<int>("HealthChecks:Outbox:MaxPending", 1000);
    }

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        try
        {
            // Check outbox pending count
            var pendingCount = await _outboxRepository.GetPendingCountAsync(cancellationToken);

            if (pendingCount == 0)
            {
                return HealthCheckResult.Healthy(
                    description: "Outbox is empty (all events published)",
                    data: new Dictionary<string, object>
                    {
                        ["pendingEvents"] = 0,
                        ["checkedAt"] = DateTime.UtcNow
                    });
            }

            if (pendingCount < _maxPendingThreshold)
            {
                return HealthCheckResult.Healthy(
                    description: $"Outbox has {pendingCount} pending events (within threshold)",
                    data: new Dictionary<string, object>
                    {
                        ["pendingEvents"] = pendingCount,
                        ["threshold"] = _maxPendingThreshold,
                        ["checkedAt"] = DateTime.UtcNow
                    });
            }

            // Degraded if backlog growing but not critical
            if (pendingCount < _maxPendingThreshold * 2)
            {
                _logger.LogWarning("Outbox backlog is elevated: {Count}", pendingCount);

                return HealthCheckResult.Degraded(
                    description: $"Outbox backlog elevated: {pendingCount} events",
                    data: new Dictionary<string, object>
                    {
                        ["pendingEvents"] = pendingCount,
                        ["threshold"] = _maxPendingThreshold,
                        ["recommendation"] = "Scale outbox relay workers"
                    });
            }

            // Unhealthy if backlog critical
            _logger.LogError("Outbox backlog critical: {Count}", pendingCount);

            return HealthCheckResult.Unhealthy(
                description: $"Outbox backlog critical: {pendingCount} events",
                data: new Dictionary<string, object>
                {
                    ["pendingEvents"] = pendingCount,
                    ["threshold"] = _maxPendingThreshold,
                    ["action"] = "Immediate investigation required"
                });
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Outbox health check failed");

            return HealthCheckResult.Unhealthy(
                description: "Failed to check outbox health",
                exception: ex);
        }
    }
}

// Registration
services.AddHealthChecks()
    .AddCheck<OutboxHealthCheck>(
        name: "outbox",
        failureStatus: HealthStatus.Degraded,  // Degraded, not Unhealthy
        tags: new[] { "ready", "messaging", "outbox" });

Health Check Data Dictionary
Include diagnostic information
Avoid PII
Machine-readable format
Useful for debugging
Async Health Checks
All health checks should be async
Use cancellation tokens
Timeout handling

Code Examples: - Complete custom health check implementations - Registration patterns - Error handling

Diagrams: - Custom health check flow - Data dictionary structure

Deliverables: - Custom health check guide - Implementation templates - Registration patterns

Topic 8: ATP Custom Health Checks¶

What will be covered: - ProjectionLagHealthCheck

public class ProjectionLagHealthCheck : IHealthCheck
{
    private readonly IProjectionWatermarkRepository _watermarkRepo;
    private readonly TimeSpan _lagThreshold;

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        var watermarks = await _watermarkRepo.GetAllWatermarksAsync(cancellationToken);

        var maxLag = watermarks.Max(w => 
            DateTime.UtcNow - w.LastEventTimestamp);

        if (maxLag < _lagThreshold)
        {
            return HealthCheckResult.Healthy(
                $"Projection lag: {maxLag.TotalSeconds:F1}s");
        }

        return HealthCheckResult.Degraded(
            $"Projection lag high: {maxLag.TotalSeconds:F1}s",
            data: new Dictionary<string, object>
            {
                ["lagSeconds"] = maxLag.TotalSeconds,
                ["threshold"] = _lagThreshold.TotalSeconds,
                ["laggestProjection"] = watermarks
                    .OrderByDescending(w => DateTime.UtcNow - w.LastEventTimestamp)
                    .First()
                    .ProjectionName
            });
    }
}

IdempotencyStoreHealthCheck
PolicyCacheHealthCheck
HashChainStateHealthCheck
ExportBandwidthHealthCheck
TenantQuotaHealthCheck

Code Examples: - All ATP custom health checks - ATP-specific validation logic

Deliverables: - ATP health check library - Implementation guide

CYCLE 5: Service-Specific Health Checks (~5,000 lines)¶

Topic 9: Gateway Service Health Checks¶

What will be covered: - Gateway Health Check Configuration

// Gateway health checks
services.AddHealthChecks()
    // Liveness (minimal)
    .AddCheck("self", () => HealthCheckResult.Healthy(), tags: new[] { "live" })

    // Readiness (dependencies)
    .AddAzureKeyVault(keyVaultUri, credential, name: "keyvault-certs", tags: new[] { "ready" })

    .AddRedis(redisConnection, name: "rate-limiter-cache", tags: new[] { "ready" })

    .AddUrlGroup(
        new Uri("https://ingestion.atp.internal/health/ready"),
        name: "ingestion-backend",
        tags: new[] { "ready", "backend" })

    .AddUrlGroup(
        new Uri("https://query.atp.internal/health/ready"),
        name: "query-backend",
        tags: new[] { "ready", "backend" })

    .AddUrlGroup(
        new Uri("https://policy.atp.internal/health/ready"),
        name: "policy-backend",
        tags: new[] { "ready", "backend" })

    .AddCheck<GatewayRoutingHealthCheck>(
        name: "routing-config",
        tags: new[] { "startup", "configuration" });

Custom Gateway Checks
Routing configuration loaded
YARP clusters reachable
JWT validation keys available
Rate limiting operational

Code Examples: - Complete Gateway health checks - Backend reachability tests - Configuration validation

Diagrams: - Gateway health architecture - Dependency graph

Deliverables: - Gateway health check guide - Configuration templates

Topic 10: Ingestion Service Health Checks¶

What will be covered: - Ingestion Health Configuration

services.AddHealthChecks()
    // Live
    .AddCheck("self", () => HealthCheckResult.Healthy(), tags: new[] { "live" })

    // Ready - Critical dependencies
    .AddSqlServer(
        connectionString: config.GetConnectionString("AuditDb"),
        healthQuery: "SELECT 1",
        name: "audit-database",
        tags: new[] { "ready", "database" })

    .AddAzureServiceBusTopic(
        connectionString: config.GetConnectionString("ServiceBus"),
        topicName: "audit.appended.v1",
        name: "servicebus-publish",
        tags: new[] { "ready", "messaging" })

    .AddAzureBlobStorage(
        connectionString: config.GetConnectionString("BlobStorage"),
        containerName: "audit-worm",
        name: "blob-worm-storage",
        tags: new[] { "ready", "storage" })

    .AddUrlGroup(
        new Uri("https://policy.atp.internal/health/ready"),
        name: "policy-service",
        tags: new[] { "ready", "external" })

    .AddCheck<OutboxHealthCheck>(
        name: "outbox-backlog",
        failureStatus: HealthStatus.Degraded,
        tags: new[] { "ready", "outbox" })

    // Startup
    .AddCheck<DatabaseMigrationHealthCheck>(
        name: "database-migrations",
        tags: new[] { "startup" })

    .AddCheck<StartupWarmupHealthCheck>(
        name: "startup-warmup",
        tags: new[] { "startup" });

Ingestion-Specific Checks
Outbox backlog within limits
Idempotency store accessible
Schema validation ready
Tenant quota store loaded

Code Examples: - Ingestion health checks (complete) - All 8 ATP services health configurations

Diagrams: - Service health dependencies - Check hierarchy

Deliverables: - Service health check library (8 services) - Configuration guide

CYCLE 6: Dependency Health Checks (~4,000 lines)¶

Topic 11: Database Health Checks¶

What will be covered: - SQL Server Health Check

// Basic connectivity
.AddSqlServer(
    connectionString: connectionString,
    healthQuery: "SELECT 1",
    name: "database-connectivity")

// Connection pool health
.AddCheck<SqlConnectionPoolHealthCheck>(
    name: "database-connection-pool")

public class SqlConnectionPoolHealthCheck : IHealthCheck
{
    private readonly string _connectionString;

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        using var connection = new SqlConnection(_connectionString);
        await connection.OpenAsync(cancellationToken);

        // Check connection pool statistics
        var poolStats = SqlConnection.GetConnectionPoolStatistics(connection);

        var activeConnections = (long)poolStats["ActiveConnections"];
        var poolSize = (long)poolStats["MaxPoolSize"];
        var utilization = (double)activeConnections / poolSize;

        if (utilization < 0.8)
        {
            return HealthCheckResult.Healthy(
                $"Connection pool: {activeConnections}/{poolSize} ({utilization:P0})");
        }

        if (utilization < 0.95)
        {
            return HealthCheckResult.Degraded(
                $"Connection pool high: {activeConnections}/{poolSize} ({utilization:P0})");
        }

        return HealthCheckResult.Unhealthy(
            $"Connection pool exhausted: {activeConnections}/{poolSize}");
    }
}

// Query performance check
.AddCheck<DatabasePerformanceHealthCheck>(
    name: "database-performance")

Azure Cosmos DB Health Check
MongoDB Health Check
NHibernate Session Factory Health Check

Code Examples: - Database health checks (all types) - Connection pool monitoring - Performance validation

Deliverables: - Database health check library

Topic 12: Messaging Health Checks¶

What will be covered: - Azure Service Bus Health Check

// Topic publish capability
.AddAzureServiceBusTopic(
    connectionString: serviceBusConnection,
    topicName: "audit.appended.v1",
    name: "servicebus-topic-publish")

// Subscription receive capability
.AddAzureServiceBusSubscription(
    connectionString: serviceBusConnection,
    topicName: "audit.appended.v1",
    subscriptionName: "projection-sub",
    name: "servicebus-subscription")

// Custom: Consumer lag check
public class ConsumerLagHealthCheck : IHealthCheck
{
    private readonly ServiceBusClient _serviceBusClient;
    private readonly string _topicName;
    private readonly string _subscriptionName;
    private readonly TimeSpan _lagThreshold;

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        var subscription = _serviceBusClient
            .CreateReceiver(_topicName, _subscriptionName);

        // Peek message to check lag
        var peekedMessage = await subscription.PeekMessageAsync(cancellationToken);

        if (peekedMessage == null)
        {
            return HealthCheckResult.Healthy("No pending messages");
        }

        var lag = DateTime.UtcNow - peekedMessage.EnqueuedTime;

        if (lag < _lagThreshold)
        {
            return HealthCheckResult.Healthy($"Lag: {lag.TotalSeconds:F1}s");
        }

        return HealthCheckResult.Degraded($"High lag: {lag.TotalSeconds:F1}s");
    }
}

MassTransit Health Check
Outbox Relay Health Check
DLQ Depth Health Check

Code Examples: - Messaging health checks - Lag monitoring - DLQ validation

Deliverables: - Messaging health check library

CYCLE 7: Health Check UI & Monitoring (~3,500 lines)¶

Topic 13: Health Check UI¶

What will be covered: - ASP.NET Core Health Check UI

services.AddHealthChecksUI(setup =>
{
    setup.SetEvaluationTimeInSeconds(30);
    setup.MaximumHistoryEntriesPerEndpoint(50);
    setup.AddHealthCheckEndpoint("ATP Ingestion", "https://ingestion.atp.internal/health");
    setup.AddHealthCheckEndpoint("ATP Query", "https://query.atp.internal/health");
    setup.AddHealthCheckEndpoint("ATP Policy", "https://policy.atp.internal/health");
})
.AddInMemoryStorage();  // Or .AddSqlServerStorage() for persistence

// Map UI endpoint
app.UseEndpoints(endpoints =>
{
    endpoints.MapHealthChecksUI(setup =>
    {
        setup.UIPath = "/healthchecks-ui";
        setup.ApiPath = "/healthchecks-api";
    })
    .RequireAuthorization("AdminOnly");
});

Health Check UI Features
Real-time health status
Historical health data
Webhook notifications
Failure tracking

Code Examples: - Health Check UI setup - Dashboard configuration - Webhook integration

Diagrams: - UI architecture - Data flow

Deliverables: - UI setup guide - Configuration reference

Topic 14: Prometheus Metrics from Health Checks¶

What will be covered: - Health Check Metrics Exporter

public class HealthCheckMetricsPublisher : IHealthCheckPublisher
{
    private readonly IMeterFactory _meterFactory;
    private readonly Counter<long> _healthCheckExecutions;
    private readonly Gauge<long> _healthStatus;

    public HealthCheckMetricsPublisher(IMeterFactory meterFactory)
    {
        var meter = meterFactory.Create("ATP.HealthChecks");

        _healthCheckExecutions = meter.CreateCounter<long>(
            "health_check_executions_total",
            description: "Total health check executions");

        _healthStatus = meter.CreateObservableGauge<long>(
            "health_check_status",
            description: "Health check status (0=unhealthy, 1=degraded, 2=healthy)");
    }

    public Task PublishAsync(HealthReport report, CancellationToken cancellationToken)
    {
        foreach (var entry in report.Entries)
        {
            var statusValue = entry.Value.Status switch
            {
                HealthStatus.Healthy => 2,
                HealthStatus.Degraded => 1,
                HealthStatus.Unhealthy => 0,
                _ => -1
            };

            _healthCheckExecutions.Add(1,
                new KeyValuePair<string, object>("name", entry.Key),
                new KeyValuePair<string, object>("status", entry.Value.Status.ToString()));
        }

        return Task.CompletedTask;
    }
}

// Register publisher
services.AddSingleton<IHealthCheckPublisher, HealthCheckMetricsPublisher>();

Code Examples: - Metrics publisher implementation - Prometheus integration - Grafana dashboards

Deliverables: - Metrics publisher guide - Dashboard templates

CYCLE 8: Health Check Response Formats (~3,000 lines)¶

Topic 15: Response Format Standards¶

What will be covered: - Healthy Response

GET /health/ready
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: no-cache

{
  "status": "Healthy",
  "totalDuration": "00:00:00.234",
  "entries": {
    "database": {
      "status": "Healthy",
      "description": "Azure SQL connection pool: 5/100 active",
      "duration": "00:00:00.050",
      "data": {
        "activeConnections": 5,
        "maxPoolSize": 100,
        "utilization": 0.05
      }
    },
    "servicebus": {
      "status": "Healthy",
      "description": "Azure Service Bus connected",
      "duration": "00:00:00.120"
    },
    "cache": {
      "status": "Healthy",
      "description": "Redis cache: 1024 keys, 512MB used",
      "duration": "00:00:00.015",
      "data": {
        "keys": 1024,
        "memoryUsedMb": 512,
        "hitRate": 0.87
      }
    },
    "outbox": {
      "status": "Healthy",
      "description": "Outbox: 15 pending events",
      "duration": "00:00:00.035",
      "data": {
        "pendingEvents": 15,
        "threshold": 1000
      }
    }
  }
}

Degraded Response

GET /health/ready
HTTP/1.1 200 OK  ← Still 200, but body shows degraded

{
  "status": "Degraded",
  "totalDuration": "00:00:00.567",
  "entries": {
    "database": {
      "status": "Healthy",
      "duration": "00:00:00.045"
    },
    "cache": {
      "status": "Degraded",
      "description": "Redis cache: High latency (150ms avg)",
      "duration": "00:00:00.456",
      "data": {
        "latencyMs": 150,
        "threshold": 50,
        "recommendation": "Check Redis server performance"
      }
    },
    "outbox": {
      "status": "Degraded",
      "description": "Outbox backlog elevated: 800 events",
      "data": {
        "pendingEvents": 800,
        "threshold": 1000,
        "recommendation": "Scale outbox relay workers"
      }
    }
  }
}

Unhealthy Response

GET /health/ready
HTTP/1.1 503 Service Unavailable

{
  "status": "Unhealthy",
  "totalDuration": "00:00:03.000",
  "entries": {
    "database": {
      "status": "Unhealthy",
      "description": "Azure SQL connection failed: Timeout",
      "duration": "00:00:03.000",
      "exception": "System.Data.SqlClient.SqlException: Timeout expired...",
      "data": {
        "error": "Connection timeout",
        "action": "Check database availability and connection string"
      }
    },
    "cache": {
      "status": "Healthy",
      "duration": "00:00:00.012"
    }
  }
}

Code Examples: - Response format examples (all states) - JSON schema - Response writer customization

Diagrams: - Response format structure - Status code mapping

Deliverables: - Response format specification - Schema definitions - Writer implementations

Topic 16: Minimal vs. Detailed Responses¶

What will be covered: - Minimal Response (Public/Load Balancer) - Detailed Response (Internal/Debugging) - Authenticated Access (Detailed info) - PII Redaction in responses

Code Examples: - Response filtering - Authentication integration

Deliverables: - Response strategy guide

CYCLE 9: Degraded State Handling (~3,000 lines)¶

Topic 17: Degraded vs. Unhealthy¶

What will be covered: - When to Return Degraded

Degraded (Service still functional, but impaired):
- Non-critical dependency unavailable
- Performance below target (but above minimum)
- Backlog elevated (but not critical)
- Cache miss rate high (but queries still work)
- Replica lag high (but within tolerance)

Examples:
- Redis cache down, but database queries work (slower)
- Projection lag 15s (target: 5s, critical: 30s)
- Outbox backlog 800 events (threshold: 1000)
- Search index unavailable (fallback to SQL)

Response:
- HTTP 200 OK (keep in load balancer)
- Body: "status": "Degraded"
- Kubernetes: Pod stays in service
- Monitoring: Alert (warning severity)

---

Unhealthy (Service cannot fulfill requests):
- Critical dependency unavailable
- Cannot process requests safely
- Data integrity at risk
- Compliance violation

Examples:
- Database connection failed (cannot persist)
- Service Bus connection failed (cannot publish events)
- KMS unavailable (cannot sign/verify)
- All backend services down (Gateway)

Response:
- HTTP 503 Service Unavailable
- Body: "status": "Unhealthy"
- Kubernetes: Pod removed from service
- Monitoring: Alert (critical severity)

Graceful Degradation Patterns

// Query service with optional search
public class QueryServiceHealthCheck : IHealthCheck
{
    private readonly IDbConnection _database;
    private readonly ISearchClient _searchClient;
    private readonly ILogger _logger;

    public async Task<HealthCheckResult> CheckHealthAsync(...)
    {
        // Critical: Database must be healthy
        var dbHealthy = await CheckDatabaseAsync();
        if (!dbHealthy)
        {
            return HealthCheckResult.Unhealthy("Database unavailable");
        }

        // Optional: Search enhances experience but not required
        var searchHealthy = await CheckSearchAsync();
        if (!searchHealthy)
        {
            _logger.LogWarning("Search unavailable, degraded mode");
            return HealthCheckResult.Degraded(
                "Database healthy, search unavailable (fallback to SQL)");
        }

        return HealthCheckResult.Healthy("All dependencies healthy");
    }
}

Degraded Mode Indicators
Response headers: X-Service-Mode: Degraded
Metrics: service_degraded{reason="cache-miss"} 1
Logs: Warning-level degradation notices

Code Examples: - Degraded vs. unhealthy logic - Graceful degradation patterns - Mode indicators

Diagrams: - State transition diagram - Degradation flow

Deliverables: - Degradation guide - Pattern library - Mode handling procedures

Topic 18: Dependency Criticality Matrix¶

What will be covered: - Critical vs. Optional Dependencies | Dependency | Critical? | Failure Status | Reason | |------------|-----------|----------------|--------| | Azure SQL (Ingestion) | ✅ Yes | Unhealthy | Cannot persist audit records | | Service Bus (Ingestion) | ✅ Yes | Unhealthy | Cannot publish events | | Redis Cache (Query) | ❌ No | Degraded | Slower queries, but functional | | Search Index (Query) | ❌ No | Degraded | Fallback to SQL queries | | Policy Service (Ingestion) | ✅ Yes | Unhealthy | Cannot classify/retain | | KMS (Integrity) | ✅ Yes | Unhealthy | Cannot sign/verify | | Blob Storage WORM (Ingestion) | ✅ Yes | Unhealthy | Cannot store evidence |

Code Examples: - Criticality configuration - Failure status mapping

Deliverables: - Dependency matrix - Configuration guide

CYCLE 10: Startup Warmup & Grace Periods (~3,000 lines)¶

Topic 19: Startup Warmup Pattern¶

What will be covered: - StartupWarmupGate

public class StartupWarmupGate
{
    private readonly TimeSpan _warmupDuration;
    private readonly DateTime _startTime;
    private bool _isReady;

    public StartupWarmupGate(IConfiguration configuration)
    {
        _warmupDuration = TimeSpan.FromSeconds(
            configuration.GetValue<int>("Microservice:StartupWarmupSeconds", 30));
        _startTime = DateTime.UtcNow;
        _isReady = false;
    }

    public bool IsReady
    {
        get
        {
            if (_isReady) return true;

            if (DateTime.UtcNow - _startTime >= _warmupDuration)
            {
                _isReady = true;
                return true;
            }

            return false;
        }
    }

    public void MarkReady()
    {
        _isReady = true;
    }
}

// Health check
public class StartupWarmupHealthCheck : IHealthCheck
{
    private readonly StartupWarmupGate _gate;

    public Task<HealthCheckResult> CheckHealthAsync(...)
    {
        return Task.FromResult(
            _gate.IsReady
                ? HealthCheckResult.Healthy("Warmup complete")
                : HealthCheckResult.Unhealthy("Warming up..."));
    }
}

// Usage in startup
public static async Task Main(string[] args)
{
    var host = CreateHostBuilder(args).Build();

    // Run migrations, warm cache, etc.
    await host.RunMigrationsAsync();
    await host.WarmCacheAsync();

    // Mark ready
    host.Services.GetRequiredService<StartupWarmupGate>().MarkReady();

    await host.RunAsync();
}

Initialization Tasks
Database migrations (FluentMigrator)
Configuration validation
Cache warming (policy cache, tenant metadata)
External service connectivity verification

Code Examples: - Warmup gate implementation - Initialization task orchestration - Startup optimization

Deliverables: - Startup warmup guide - Task coordination - Optimization techniques

Topic 20: Startup Probe Configuration¶

What will be covered: - Kubernetes Startup Probe - Failure Threshold Calculation - Startup Time Monitoring - Slow Startup Troubleshooting

Code Examples: - Probe configuration - Timing analysis

Deliverables: - Startup probe guide - Timing recommendations

CYCLE 11: Health Check Caching (~2,500 lines)¶

Topic 21: Response Caching Strategy¶

What will be covered: - Why Cache Health Check Responses?

Problem:
- Health checks run frequently (every 5-10s by K8s)
- Each check queries dependencies (DB, cache, bus)
- High load on dependencies from health checks alone
- Example: 100 pods × 2 checks/sec = 200 dependency queries/sec

Solution:
- Cache health check results briefly (5-10s)
- Reduce dependency load by 90%+
- Still fresh enough for K8s probes

Trade-off:
- Slightly stale health status (max 10s old)
- Acceptable for most use cases
- Critical checks can bypass cache

Implementing Health Check Caching

public class CachedHealthCheck : IHealthCheck
{
    private readonly IHealthCheck _innerCheck;
    private readonly IMemoryCache _cache;
    private readonly TimeSpan _cacheDuration;
    private readonly string _cacheKey;

    public CachedHealthCheck(
        IHealthCheck innerCheck,
        IMemoryCache cache,
        string name,
        TimeSpan cacheDuration)
    {
        _innerCheck = innerCheck;
        _cache = cache;
        _cacheKey = $"HealthCheck:{name}";
        _cacheDuration = cacheDuration;
    }

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        // Try cache first
        if (_cache.TryGetValue(_cacheKey, out HealthCheckResult cachedResult))
        {
            return cachedResult;
        }

        // Cache miss, run actual check
        var result = await _innerCheck.CheckHealthAsync(context, cancellationToken);

        // Cache result
        _cache.Set(_cacheKey, result, _cacheDuration);

        return result;
    }
}

// Registration
services.AddHealthChecks()
    .AddCheck(new CachedHealthCheck(
        new SqlServerHealthCheck(connectionString),
        cache,
        name: "database",
        cacheDuration: TimeSpan.FromSeconds(10)));

Conditional Caching
Cache only Healthy results (not failures)
Shorter TTL for Degraded (5s)
No caching for Unhealthy (immediate detection)
Cache Invalidation
Automatic TTL expiration
Manual invalidation on config changes
Invalidation on deployment

Code Examples: - Cached health check implementation - Conditional caching logic - Cache invalidation

Diagrams: - Caching flow - TTL strategy

Deliverables: - Caching implementation - Strategy guide - Invalidation procedures

Topic 22: Health Check Performance¶

What will be covered: - Performance Optimization - Parallel Execution - Timeout Configuration - Circuit Breakers for Health Checks

Code Examples: - Performance optimization - Parallel check execution

Deliverables: - Performance guide - Optimization techniques

CYCLE 12: Multi-Tenant Health Isolation (~3,000 lines)¶

Topic 23: Tenant-Aware Health Checks¶

What will be covered: - Per-Tenant Health Indicators

public class TenantQuotaHealthCheck : IHealthCheck
{
    private readonly ITenantQuotaService _quotaService;

    public async Task<HealthCheckResult> CheckHealthAsync(...)
    {
        // Check if any tenant exceeding quota critically
        var tenantsOverQuota = await _quotaService.GetTenantsExceedingQuotaAsync(
            threshold: 0.95);  // 95% of quota

        if (tenantsOverQuota.Count == 0)
        {
            return HealthCheckResult.Healthy("All tenants within quota");
        }

        var criticalTenants = tenantsOverQuota.Count(t => t.Utilization > 1.0);

        if (criticalTenants == 0)
        {
            return HealthCheckResult.Degraded(
                $"{tenantsOverQuota.Count} tenants approaching quota limits");
        }

        return HealthCheckResult.Unhealthy(
            $"{criticalTenants} tenants exceeded quota (ingestion throttled)");
    }
}

Tenant Isolation in Health Responses
Aggregate tenant health (no individual tenant details in public response)
Detailed tenant health (admin endpoint, authenticated)
Tenant quota monitoring
Per-tenant SLO tracking

Code Examples: - Tenant-aware health checks - Quota monitoring - Aggregation logic

Diagrams: - Tenant health aggregation - Quota monitoring flow

Deliverables: - Tenant health guide - Quota monitoring - Isolation procedures

Topic 24: Regional Health Aggregation¶

What will be covered: - Cross-Region Health - Multi-Cluster Health - Global Health Dashboard

Code Examples: - Regional aggregation

Deliverables: - Regional health guide

CYCLE 13: Health Check Testing (~2,500 lines)¶

Topic 25: Testing Health Checks¶

What will be covered: - Unit Testing Health Checks

[TestClass]
public class DatabaseHealthCheckTests
{
    [TestMethod]
    public async Task Should_ReturnHealthy_WhenDatabaseConnected()
    {
        // Arrange
        var healthCheck = new SqlServerHealthCheck(validConnectionString);

        // Act
        var result = await healthCheck.CheckHealthAsync(new HealthCheckContext());

        // Assert
        Assert.AreEqual(HealthStatus.Healthy, result.Status);
        Assert.IsTrue(result.Description.Contains("connected"));
    }

    [TestMethod]
    public async Task Should_ReturnUnhealthy_WhenDatabaseUnavailable()
    {
        // Arrange
        var healthCheck = new SqlServerHealthCheck(invalidConnectionString);

        // Act
        var result = await healthCheck.CheckHealthAsync(new HealthCheckContext());

        // Assert
        Assert.AreEqual(HealthStatus.Unhealthy, result.Status);
        Assert.IsNotNull(result.Exception);
    }

    [TestMethod]
    public async Task Should_ReturnDegraded_WhenConnectionPoolNearLimit()
    {
        // Arrange
        var healthCheck = new SqlConnectionPoolHealthCheck(connectionString);
        // Simulate high connection usage
        var connections = CreateManyConnections(95);  // 95% of pool

        // Act
        var result = await healthCheck.CheckHealthAsync(new HealthCheckContext());

        // Assert
        Assert.AreEqual(HealthStatus.Degraded, result.Status);

        // Cleanup
        CloseConnections(connections);
    }
}

Integration Testing

[TestMethod]
public async Task HealthEndpoint_Should_ReturnHealthy_WhenAllDependenciesUp()
{
    // Arrange
    var client = _testServer.CreateClient();

    // Act
    var response = await client.GetAsync("/health/ready");

    // Assert
    Assert.AreEqual(HttpStatusCode.OK, response.StatusCode);

    var content = await response.Content.ReadAsStringAsync();
    var healthReport = JsonSerializer.Deserialize<HealthReport>(content);

    Assert.AreEqual("Healthy", healthReport.Status);
    Assert.IsTrue(healthReport.Entries.All(e => 
        e.Value.Status == HealthStatus.Healthy));
}

Acceptance Testing (Reqnroll)

Feature: Health Check Endpoints
  As an operator
  I want to monitor service health
  So that I can detect and resolve issues quickly

Scenario: Liveness check returns healthy when service is running
  Given the Ingestion service is running
  When I request "/health/live"
  Then the response status code should be 200
  And the response body should contain "Healthy"

Scenario: Readiness check returns unhealthy when database is down
  Given the Ingestion service is running
  And the database is unavailable
  When I request "/health/ready"
  Then the response status code should be 503
  And the response body should contain "Unhealthy"
  And the database check should show "Unhealthy"

Scenario: Readiness check returns degraded when cache is slow
  Given the Ingestion service is running
  And the Redis cache has high latency
  When I request "/health/ready"
  Then the response status code should be 200
  And the response body should contain "Degraded"
  And the cache check should show "Degraded"

Code Examples: - Complete test suites (unit, integration, acceptance) - Test helpers and mocks - CI/CD integration

Diagrams: - Test architecture - Test coverage

Deliverables: - Testing guide - Test templates - CI/CD integration

Topic 26: Chaos Testing for Health Checks¶

What will be covered: - Fault Injection - Dependency Failure Simulation - Recovery Validation - Probe Behavior Under Load

Code Examples: - Chaos scenarios - Validation procedures

Deliverables: - Chaos testing guide

CYCLE 14: Load Balancer Integration (~3,000 lines)¶

Topic 27: Azure Load Balancer Health Probes¶

What will be covered: - Azure LB Health Probe Configuration

# Azure Load Balancer (for AKS services)
kind: Service
apiVersion: v1
metadata:
  name: ingestion-svc
  namespace: atp-ingest-ns
  annotations:
    # Health probe configuration
    service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/health/ready"
    service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol: "http"
    service.beta.kubernetes.io/azure-load-balancer-health-probe-interval: "10"
    service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probe: "3"
spec:
  type: LoadBalancer
  selector:
    app: ingestion
  ports:
  - port: 80
    targetPort: 8080

Azure Front Door Health Probes
API Management Health Checks
Application Gateway Health Probes

Code Examples: - LB probe configurations - Multi-layer health checks

Deliverables: - Load balancer integration guide

Topic 28: Service Mesh Health Checks¶

What will be covered: - Istio Health Checks - Linkerd Health Probes - Envoy Health Endpoints

Code Examples: - Mesh integration

Deliverables: - Service mesh health guide

CYCLE 15: Health Check Troubleshooting (~3,500 lines)¶

Topic 29: Common Health Check Issues¶

What will be covered: - Problem: Readiness Check Always Failing

Symptoms:
- Pod shows 0/1 READY
- Pod never receives traffic
- Logs show repeated health check failures

Diagnosis:
# Check pod status
kubectl get pods -n atp-ingest-ns
# NAME                         READY   STATUS    RESTARTS   AGE
# ingestion-7d8f6c9b4-abc123   0/1     Running   0          5m

# Describe pod
kubectl describe pod ingestion-7d8f6c9b4-abc123 -n atp-ingest-ns
# Events:
# Readiness probe failed: HTTP probe failed with statuscode: 503

# Check health endpoint directly
kubectl port-forward ingestion-7d8f6c9b4-abc123 8080:8080 -n atp-ingest-ns
curl http://localhost:8080/health/ready

# Response shows which dependency failed:
{
  "status": "Unhealthy",
  "entries": {
    "database": {
      "status": "Unhealthy",
      "description": "Connection timeout"
    }
  }
}

Common Causes:
1. Database not accessible (network policy, firewall)
2. Wrong connection string (secret not mounted)
3. Dependency service down
4. Health check timeout too short
5. Startup not complete (use startup probe)

Solutions:
# 1. Check database connectivity
kubectl run -it --rm debug --image=mcr.microsoft.com/mssql-tools \
    --restart=Never -n atp-ingest-ns -- /bin/bash
sqlcmd -S <server> -U <user> -P <password> -Q "SELECT 1"

# 2. Check secrets mounted
kubectl exec ingestion-7d8f6c9b4-abc123 -n atp-ingest-ns -- \
    ls -la /mnt/secrets

# 3. Check network policy
kubectl get networkpolicies -n atp-ingest-ns

# 4. Increase timeout
kubectl edit deployment ingestion -n atp-ingest-ns
# Change: timeoutSeconds: 3 → timeoutSeconds: 10

# 5. Add startup probe
# (see startup probe configuration)

Problem: CrashLoopBackOff Due to Liveness Failure

Symptoms:
- Pod continuously restarting
- STATUS shows CrashLoopBackOff
- Liveness probe failing

Diagnosis:
kubectl logs ingestion-7d8f6c9b4-abc123 -n atp-ingest-ns --previous

Common Causes:
1. Application deadlock
2. Liveness check too aggressive (short timeout/period)
3. Application startup slow (no startup probe)
4. Liveness check has bug (throws exception)

Solutions:
# 1. Add startup probe (delay liveness)
# 2. Increase liveness timeout/threshold
# 3. Fix application deadlock
# 4. Review liveness check code

Problem: Flapping Readiness (Ready → Not Ready → Ready)

Symptoms:
- Pod oscillates between ready/not ready
- Traffic intermittent
- Logs show dependency timeouts

Common Causes:
1. Dependency intermittently slow
2. Health check timeout too strict
3. Network issues
4. Health check not idempotent (side effects)

Solutions:
# 1. Increase successThreshold (require 2-3 consecutive successes)
# 2. Increase timeout
# 3. Add circuit breaker to dependency client
# 4. Review health check for side effects

Code Examples: - Troubleshooting procedures (10+ scenarios) - Diagnostic commands - Resolution scripts

Diagrams: - Troubleshooting decision tree - Common failure patterns

Deliverables: - Troubleshooting guide - Diagnostic procedures - Fix library

Topic 30: Health Check Debugging¶

What will be covered: - Debug Endpoints

// Detailed health endpoint (authenticated, internal only)
app.MapHealthChecks("/health/debug", new HealthCheckOptions
{
    ResponseWriter = async (context, report) =>
    {
        context.Response.ContentType = "application/json";

        var result = new
        {
            status = report.Status.ToString(),
            totalDuration = report.TotalDuration,
            entries = report.Entries.Select(e => new
            {
                name = e.Key,
                status = e.Value.Status.ToString(),
                description = e.Value.Description,
                duration = e.Value.Duration,
                exception = e.Value.Exception?.ToString(),
                data = e.Value.Data,
                tags = e.Value.Tags
            }),
            timestamp = DateTime.UtcNow,
            hostName = Environment.MachineName,
            podName = Environment.GetEnvironmentVariable("HOSTNAME"),
            podIp = context.Connection.LocalIpAddress?.ToString()
        };

        await context.Response.WriteAsJsonAsync(result);
    }
})
.RequireAuthorization("AdminOnly");

Health Check Logging
Health Check Metrics
Distributed Tracing for Health Checks

Code Examples: - Debug endpoints - Logging integration - Tracing

Deliverables: - Debugging guide - Diagnostic tools

CYCLE 16: Best Practices & Governance (~3,000 lines)¶

Topic 31: Health Check Best Practices¶

What will be covered: - Design Best Practices

✅ DO:
- Keep checks fast (<5s total)
- Check critical dependencies only
- Return detailed information in responses
- Use tags for logical grouping
- Cache results appropriately
- Implement all three probe types (live, ready, startup)
- Use Degraded for non-critical failures
- Test health checks thoroughly
- Document expected check behavior
- Monitor health check performance

❌ DON'T:
- Run expensive operations (full scans, backups)
- Modify state (writes, deletes)
- Include PII in responses
- Fail on optional dependencies
- Use static thresholds without monitoring
- Ignore degraded state
- Forget to test failure scenarios
- Couple health checks to business logic
- Use high-cardinality labels
- Expose detailed health publicly (security risk)

ATP Health Check Standards
All services must implement /health/live, /health/ready, /health/startup
All critical dependencies must have health checks
Health check duration must be <5s (P95)
Detailed responses must not include PII
Failed dependencies must log errors
Health metrics must be exported

Code Examples: - Best practice implementations - Anti-patterns to avoid

Deliverables: - Best practices guide - Standards document - Anti-patterns catalog

Topic 32: Health Check Governance¶

What will be covered: - Health Check Review Process - New health check PR review checklist - Performance testing required - Security review for public endpoints - Documentation requirements

Health Check Lifecycle
Creation: When adding new dependency
Updates: When dependency changes
Deprecation: When dependency removed
Monitoring: Continuous health check health
Compliance Requirements
Health audit trail (who accessed detailed health)
No PII in health responses
Secure admin endpoints
Health check versioning

Code Examples: - Governance procedures - Review checklists - Compliance validation

Deliverables: - Governance guide - Review procedures - Compliance checklist

Summary of Deliverables¶

Across all 16 cycles, this documentation will provide:

Health Check Foundations
Fundamentals (liveness, readiness, startup)
ASP.NET Core Health Checks integration
Industry standards and formats
Kubernetes Integration
Probe configuration (all three types)
Failure scenarios and recovery
Timing optimization
Custom Health Checks
IHealthCheck implementation patterns
ATP-specific custom checks (10+ checks)
Service-specific configurations (all 8 services)
Dependency Checks
Database health checks (SQL, Cosmos, MongoDB)
Messaging health checks (Service Bus, MassTransit)
Storage health checks (Blob, WORM)
Cache health checks (Redis, in-memory)
KMS health checks (Key Vault, signing operations)
External service checks (HTTP endpoints)
Monitoring & UI
Health Check UI (ASP.NET Core)
Prometheus metrics export
Grafana dashboards
Application Insights integration
Response Formats
Healthy, Degraded, Unhealthy responses
Detailed vs. minimal responses
JSON schema and standards
Custom response writers
Degraded State
Degraded vs. unhealthy criteria
Graceful degradation patterns
Dependency criticality matrix
Startup & Warmup
StartupWarmupGate pattern
Initialization task orchestration
Startup probe configuration
Grace period handling
Performance
Response caching (5-10s TTL)
Parallel execution
Timeout configuration
Load optimization
Multi-Tenancy
- Tenant-aware health indicators
- Quota monitoring
- Regional health aggregation
Testing
- Unit testing health checks
- Integration testing endpoints
- Acceptance testing scenarios
- Chaos testing (fault injection)
Integration
- Azure Load Balancer probes
- Azure Front Door health checks
- API Management health
- Service mesh integration
Operations
- Troubleshooting common issues (10+ scenarios)
- Debugging tools and techniques
- Performance analysis
Governance
- Best practices (10+ do's and don'ts)
- Standards and conventions
- Review process
- Compliance requirements

Runbook: Operational procedures using health checks
Alerts & SLOs: Health metrics in SLO calculations
Kubernetes: Probe configuration and pod lifecycle
Observability: Health check metrics and monitoring
Template Integration: Health checks in microservice template
Configuration: Health check configuration options
Testing Strategy: Health check testing approaches
Progressive Rollout: Health validation during deployments

This health checks guide provides complete implementation and operational procedures for ATP service health monitoring, from ASP.NET Core Health Checks fundamentals and Kubernetes probe integration to custom health checks for all dependencies, Health Check UI, caching strategies, multi-tenant health isolation, comprehensive testing, load balancer integration, troubleshooting procedures, and governance for maintaining predictable, observable, and self-healing services with fast failure detection and automatic recovery.