Skip to content

Health Checks - Audit Trail Platform (ATP)

Observable health, predictable behavior — ATP implements comprehensive health checks across all services using ASP.NET Core Health Checks with custom checks for databases, message buses, caches, KMS, and dependencies, exposed via /health/live, /health/ready, and /health/startup endpoints with Kubernetes probe integration and real-time monitoring.


📋 Documentation Generation Plan

This document will be generated in 16 cycles. Current progress:

Cycle Topics Estimated Lines Status
Cycle 1 Health Check Fundamentals (1-2) ~3,000 ⏳ Not Started
Cycle 2 ASP.NET Core Health Checks (3-4) ~3,500 ⏳ Not Started
Cycle 3 Kubernetes Probe Integration (5-6) ~3,500 ⏳ Not Started
Cycle 4 Custom Health Checks (7-8) ~4,500 ⏳ Not Started
Cycle 5 Service-Specific Health Checks (9-10) ~5,000 ⏳ Not Started
Cycle 6 Dependency Health Checks (11-12) ~4,000 ⏳ Not Started
Cycle 7 Health Check UI & Monitoring (13-14) ~3,500 ⏳ Not Started
Cycle 8 Health Check Response Formats (15-16) ~3,000 ⏳ Not Started
Cycle 9 Degraded State Handling (17-18) ~3,000 ⏳ Not Started
Cycle 10 Startup Warmup & Grace Periods (19-20) ~3,000 ⏳ Not Started
Cycle 11 Health Check Caching (21-22) ~2,500 ⏳ Not Started
Cycle 12 Multi-Tenant Health Isolation (23-24) ~3,000 ⏳ Not Started
Cycle 13 Health Check Testing (25-26) ~2,500 ⏳ Not Started
Cycle 14 Load Balancer Integration (27-28) ~3,000 ⏳ Not Started
Cycle 15 Health Check Troubleshooting (29-30) ~3,500 ⏳ Not Started
Cycle 16 Best Practices & Governance (31-32) ~3,000 ⏳ Not Started

Total Estimated Lines: ~54,000


Purpose & Scope

This document provides the complete health check implementation guide for ATP, covering ASP.NET Core Health Checks, Kubernetes probes (liveness, readiness, startup), custom health checks for all dependencies (databases, message buses, caches, KMS, external services), Health Check UI, monitoring integration, and operational procedures for maintaining service health and resilience.

Why Health Checks for ATP?

  1. Reliability: Detect unhealthy services before they impact users
  2. Kubernetes Integration: Automatic pod restart (liveness) and traffic management (readiness)
  3. Load Balancer: Remove unhealthy instances from rotation
  4. Monitoring: Real-time health status in dashboards
  5. Debugging: Identify failing dependencies quickly
  6. Compliance: Health audit trail for SLA validation
  7. Automation: Enable self-healing and auto-scaling
  8. Observability: Health metrics feed into SLO calculations

ATP Health Check Architecture

Client Request
Load Balancer (checks /health/ready)
Kubernetes Service (only routes to ready pods)
Pod (3 health endpoints)
    ├── /health/live    → Liveness Probe (K8s restarts if fails)
    ├── /health/ready   → Readiness Probe (K8s removes from service if fails)
    └── /health/startup → Startup Probe (K8s delays liveness until passes)

Health Check Types

Type Endpoint Purpose Kubernetes Usage Check Scope
Liveness /health/live Is the process alive and responsive? Restart pod if fails Minimal (self-ping)
Readiness /health/ready Can the service handle traffic? Remove from load balancer if fails Dependencies (DB, cache, bus)
Startup /health/startup Has the service finished initializing? Delay liveness/readiness until passes Warmup tasks (migrations, cache)

ATP Service Health Checks

All ATP services implement comprehensive health checks: - Gateway: Key Vault (certs), backend services (HTTP), rate limiter (Redis) - Ingestion: Azure SQL, Service Bus (publish), Blob Storage (WORM), Policy Service, Outbox - Query: Read model database, query cache (Redis), Search index, projection lag - Projection: Service Bus (subscribe), read model database, inbox table, projection lag - Export: Blob Storage, export job queue (Redis), KMS (signing), bandwidth quota - Integrity: Blob WORM, KMS (signing keys), hash chain state, Merkle computation - Policy: Policy database, policy cache (Redis), default policies loaded - Admin: Admin database, configuration store


Detailed Cycle Plan

CYCLE 1: Health Check Fundamentals (~3,000 lines)

Topic 1: Health Check Philosophy

What will be covered: - What is a Health Check?

Health Check:
- Periodic test of service health
- Returns: Healthy, Degraded, or Unhealthy
- Exposes internal state to external systems
- Enables automation (restart, route, scale)

Purpose:
- Detect failures early
- Enable self-healing
- Prevent cascading failures
- Support graceful degradation

  • Health vs. Metrics

    Health Checks:
    - Binary state (healthy/unhealthy)
    - Immediate action (restart, remove from LB)
    - Synchronous (request/response)
    - Examples: "Can I connect to database?"
    
    Metrics:
    - Continuous values (latency, throughput, errors)
    - Gradual response (alerts, scaling)
    - Asynchronous (scrape, push)
    - Examples: "What's the current latency?"
    
    Relationship:
    - Metrics → Trends and patterns
    - Health Checks → Go/no-go decisions
    - Both needed for complete observability
    

  • Three Types of Health Checks

    1. Liveness (Am I alive?)
       - Purpose: Detect deadlocks, infinite loops, crashes
       - Check: Minimal (process responsive)
       - Failure: Restart pod
       - Example: Self-ping, basic HTTP response
    
    2. Readiness (Can I serve traffic?)
       - Purpose: Detect dependency failures
       - Check: Critical dependencies (DB, cache, queue)
       - Failure: Remove from load balancer
       - Example: Database connection, cache ping
    
    3. Startup (Have I finished initializing?)
       - Purpose: Allow slow startup without killing pod
       - Check: Initialization tasks complete
       - Failure: Delay liveness probe
       - Example: Migrations run, cache warmed, config loaded
    

  • ATP Health Check Principles

    1. Fast Checks (<5 seconds)
       - Health checks must be lightweight
       - Avoid expensive operations (full scans, complex queries)
       - Use simple connectivity tests
    
    2. Dependency-Aware
       - Check critical dependencies only
       - Don't fail on optional dependencies
       - Degrade gracefully when possible
    
    3. No Side Effects
       - Health checks are read-only
       - Don't modify state
       - Don't trigger business logic
    
    4. Cacheable (with short TTL)
       - Cache health status briefly (5-10 seconds)
       - Avoid overwhelming dependencies
       - Balance freshness vs. load
    
    5. Detailed Responses
       - Return status + details for each check
       - Include duration and error messages
       - Enable debugging
    
    6. Compliance-Aware
       - No PII in health responses
       - Audit health check access
       - Secure endpoints (internal only for details)
    

Code Examples: - Health check concepts - Three-probe architecture - ATP principles

Diagrams: - Health check types comparison - Probe failure handling - Health check architecture

Deliverables: - Health check fundamentals guide - Type comparison matrix - ATP principles document


Topic 2: Health Check Standards

What will be covered: - Industry Standards

RFC 7234 (HTTP Caching):
- Cache-Control: no-cache for health checks
- Avoid stale health status

OpenAPI/Swagger:
- Document health endpoints
- Security schemes (Bearer token for detailed checks)

Health Check Response Format (de facto standard):
{
  "status": "Healthy" | "Degraded" | "Unhealthy",
  "totalDuration": "00:00:00.123",
  "entries": {
    "database": {
      "status": "Healthy",
      "description": "...",
      "duration": "00:00:00.050"
    }
  }
}

HTTP Status Codes:
- 200 OK: Healthy
- 503 Service Unavailable: Unhealthy
- 200 OK (with "Degraded" in body): Degraded

  • ASP.NET Core Health Checks
  • Microsoft.Extensions.Diagnostics.HealthChecks
  • AspNetCore.HealthChecks.* (community packages)
  • Health Check UI (AspNetCore.HealthChecks.UI)

  • Kubernetes Health Probes

  • livenessProbe
  • readinessProbe
  • startupProbe

Code Examples: - Standard response formats - HTTP status codes - ASP.NET Core integration

Diagrams: - Standards compliance - Response format evolution

Deliverables: - Standards reference - Format specifications - Compliance guide


CYCLE 2: ASP.NET Core Health Checks (~3,500 lines)

Topic 3: Health Check Middleware

What will be covered: - Registering Health Checks

// Startup.cs / Program.cs
public void ConfigureServices(IServiceCollection services)
{
    // Add health check services
    var healthChecks = services.AddHealthChecks();

    // Add built-in checks
    healthChecks.AddCheck("self", () => HealthCheckResult.Healthy("Service is alive"));

    // Add dependency checks
    healthChecks.AddSqlServer(
        connectionString: configuration.GetConnectionString("AuditDb"),
        healthQuery: "SELECT 1",
        name: "database",
        failureStatus: HealthStatus.Unhealthy,
        tags: new[] { "ready", "database" });

    healthChecks.AddRedis(
        connectionString: configuration.GetConnectionString("Redis"),
        name: "cache",
        failureStatus: HealthStatus.Degraded,
        tags: new[] { "ready", "cache" });

    healthChecks.AddAzureServiceBusTopic(
        connectionString: configuration.GetConnectionString("ServiceBus"),
        topicName: "audit.appended.v1",
        name: "servicebus",
        tags: new[] { "ready", "messaging" });
}

public void Configure(IApplicationBuilder app)
{
    // Map health check endpoints
    app.UseEndpoints(endpoints =>
    {
        // Liveness endpoint (minimal checks)
        endpoints.MapHealthChecks("/health/live", new HealthCheckOptions
        {
            Predicate = check => check.Tags.Contains("live"),
            AllowCachingResponses = false,
            ResultStatusCodes =
            {
                [HealthStatus.Healthy] = StatusCodes.Status200OK,
                [HealthStatus.Degraded] = StatusCodes.Status200OK,
                [HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
            }
        });

        // Readiness endpoint (dependency checks)
        endpoints.MapHealthChecks("/health/ready", new HealthCheckOptions
        {
            Predicate = check => check.Tags.Contains("ready"),
            AllowCachingResponses = false,
            ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse,
            ResultStatusCodes =
            {
                [HealthStatus.Healthy] = StatusCodes.Status200OK,
                [HealthStatus.Degraded] = StatusCodes.Status200OK,
                [HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
            }
        });

        // Startup endpoint
        endpoints.MapHealthChecks("/health/startup", new HealthCheckOptions
        {
            Predicate = check => check.Tags.Contains("startup"),
            AllowCachingResponses = false
        });

        // Detailed health UI (internal only, secured)
        endpoints.MapHealthChecks("/health", new HealthCheckOptions
        {
            AllowCachingResponses = false,
            ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
        })
        .RequireAuthorization("InternalOnly");
    });
}

  • Health Check Tags
    Tag Strategy:
    
    live:
    - Minimal checks for liveness probe
    - Only "self" check (process alive)
    - Fast (<100ms)
    
    ready:
    - All critical dependency checks
    - Database, cache, message bus
    - Moderate speed (<5s total)
    
    startup:
    - Initialization completion checks
    - Migrations complete, cache warmed
    - Can be slow (30-60s acceptable)
    
    database:
    - Groups all database checks
    - For targeted diagnostics
    
    messaging:
    - Groups messaging checks
    - Service Bus, outbox, consumers
    
    cache:
    - Groups caching checks
    - Redis, in-memory
    
    external:
    - Groups external API checks
    - Policy service, KMS, etc.
    

Code Examples: - Complete health check setup - Endpoint mapping - Tag-based filtering

Diagrams: - Health check middleware flow - Tag-based routing - Response writer pipeline

Deliverables: - Health check setup guide - Middleware configuration - Tag strategy


Topic 4: Built-In Health Checks

What will be covered: - Community Health Check Packages

// Available health check packages
using AspNetCore.HealthChecks.SqlServer;      // SQL Server
using AspNetCore.HealthChecks.AzureServiceBus; // Azure Service Bus
using AspNetCore.HealthChecks.Redis;           // Redis
using AspNetCore.HealthChecks.MongoDb;         // MongoDB
using AspNetCore.HealthChecks.AzureStorage;    // Azure Blob/Table
using AspNetCore.HealthChecks.AzureKeyVault;   // Azure Key Vault
using AspNetCore.HealthChecks.Uris;            // HTTP endpoints
using AspNetCore.HealthChecks.System;          // Disk, memory, process

// Registration examples
services.AddHealthChecks()
    // Database
    .AddSqlServer(
        connectionString: config.GetConnectionString("AuditDb"),
        healthQuery: "SELECT 1",
        name: "audit-database",
        timeout: TimeSpan.FromSeconds(3))

    // Message Bus
    .AddAzureServiceBusTopic(
        connectionString: config.GetConnectionString("ServiceBus"),
        topicName: "audit.appended.v1",
        name: "servicebus-topic")

    // Cache
    .AddRedis(
        connectionString: config.GetConnectionString("Redis"),
        name: "redis-cache")

    // Blob Storage
    .AddAzureBlobStorage(
        connectionString: config.GetConnectionString("BlobStorage"),
        containerName: "audit-worm",
        name: "blob-worm-storage")

    // Key Vault
    .AddAzureKeyVault(
        keyVaultUri: new Uri(config["KeyVault:Uri"]),
        credential: new DefaultAzureCredential(),
        setup: options => options.AddSecret("signing-key-public"),
        name: "key-vault")

    // HTTP Endpoint (Policy Service)
    .AddUrlGroup(
        uri: new Uri("https://policy.atp.internal/health/ready"),
        name: "policy-service",
        timeout: TimeSpan.FromSeconds(5))

    // System resources
    .AddDiskStorageHealthCheck(
        setup: options => options.AddDrive("C:\\", minimumFreeMegabytes: 1024),
        name: "disk-space")

    .AddProcessAllocatedMemoryHealthCheck(
        maximumMegabytesAllocated: 2048,
        name: "memory-allocation");

  • Configuration Options
  • Timeouts
  • Failure status (Unhealthy vs. Degraded)
  • Retry policies
  • Tags

Code Examples: - Built-in check usage (all types) - Configuration patterns - Timeout management

Diagrams: - Health check library ecosystem - Package dependencies

Deliverables: - Built-in checks catalog - Usage guide - Configuration reference


CYCLE 3: Kubernetes Probe Integration (~3,500 lines)

Topic 5: Probe Configuration

What will be covered: - Liveness Probe

# Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ingestion
  namespace: atp-ingest-ns
spec:
  template:
    spec:
      containers:
      - name: ingestion
        image: atpacr.azurecr.io/atp/ingestion:1.2.3

        # Liveness Probe (restart if fails)
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 30     # Wait 30s after pod start
          periodSeconds: 10            # Check every 10s
          timeoutSeconds: 5            # 5s timeout per check
          failureThreshold: 3          # Restart after 3 consecutive failures
          successThreshold: 1          # Healthy after 1 success

  • Readiness Probe

            # Readiness Probe (remove from service if fails)
            readinessProbe:
              httpGet:
                path: /health/ready
                port: 8080
                scheme: HTTP
              initialDelaySeconds: 10
              periodSeconds: 5
              timeoutSeconds: 3
              failureThreshold: 3
              successThreshold: 1
    

  • Startup Probe

            # Startup Probe (for slow-starting apps)
            startupProbe:
              httpGet:
                path: /health/startup
                port: 8080
              initialDelaySeconds: 0
              periodSeconds: 5
              timeoutSeconds: 3
              failureThreshold: 30         # 30 * 5s = 150s max startup time
              successThreshold: 1
    

  • Probe Types (HTTP, TCP, Exec)

    # HTTP Probe (most common)
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
        httpHeaders:
        - name: X-Health-Check
          value: "kubernetes"
    
    # TCP Probe (for non-HTTP services)
    livenessProbe:
      tcpSocket:
        port: 9090
    
    # Exec Probe (run command in container)
    livenessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - /app/healthcheck.sh
    
    # gRPC Probe (Kubernetes 1.24+)
    livenessProbe:
      grpc:
        port: 9090
        service: grpc.health.v1.Health
    

  • Probe Timing Best Practices

    Liveness Probe:
    - initialDelaySeconds: 30-60s (allow startup)
    - periodSeconds: 10-30s (not too frequent)
    - timeoutSeconds: 5s (reasonable timeout)
    - failureThreshold: 3 (tolerate transient failures)
    
    Readiness Probe:
    - initialDelaySeconds: 10-30s (faster than liveness)
    - periodSeconds: 5-10s (more frequent for traffic mgmt)
    - timeoutSeconds: 3s
    - failureThreshold: 3
    
    Startup Probe:
    - initialDelaySeconds: 0 (start immediately)
    - periodSeconds: 5s
    - failureThreshold: 30-60 (allow long startup)
    - Disable liveness until startup succeeds
    

Code Examples: - Complete probe configurations (all ATP services) - Probe type examples - Timing guidelines

Diagrams: - Probe lifecycle - Failure handling flow - Timing optimization

Deliverables: - Probe configuration guide - Timing recommendations - ATP probe library


Topic 6: Probe Failure Scenarios

What will be covered: - Liveness Probe Failure

Scenario: Liveness probe fails 3 consecutive times

Kubernetes Actions:
1. Mark pod as "Not Ready" (remove from service)
2. Send SIGTERM to container (graceful shutdown)
3. Wait for termination grace period (default: 30s)
4. Send SIGKILL if still running
5. Start new pod
6. Wait for startup probe to pass
7. Wait for readiness probe to pass
8. Add to service endpoints

Timeline:
T+0s:    Liveness probe fails (3rd failure)
T+0s:    Pod marked Terminating
T+0s:    SIGTERM sent to container
T+30s:   SIGKILL sent (if not terminated)
T+30s:   New pod created
T+45s:   New pod startup complete
T+50s:   New pod ready
T+50s:   New pod receives traffic

Total Recovery Time: ~50 seconds

  • Readiness Probe Failure

    Scenario: Readiness probe fails (database connection lost)
    
    Kubernetes Actions:
    1. Mark pod as "Not Ready"
    2. Remove pod IP from service endpoints
    3. Stop routing traffic to pod
    4. Pod continues running (NOT restarted)
    5. Continue checking readiness probe
    6. When probe passes → add back to service
    
    User Impact:
    - No downtime if other pods healthy
    - Traffic automatically shifted
    - Pod self-heals when dependency recovers
    
    Timeline:
    T+0s:    Readiness probe fails
    T+0s:    Pod removed from service endpoints
    T+0s-Xs: No traffic to pod (database recovers)
    T+Xs:    Readiness probe passes
    T+Xs:    Pod added back to service
    
    Recovery Time: Depends on dependency recovery
    

  • Startup Probe Failure

    Scenario: Startup probe fails (migrations taking too long)
    
    Kubernetes Actions:
    1. Continue running startup probe
    2. Liveness/readiness disabled until startup passes
    3. If failureThreshold exceeded (e.g., 30 failures * 5s = 150s)
    4. Kill pod and restart
    
    Best Practice:
    - Set failureThreshold high enough for worst-case startup
    - Monitor startup duration metrics
    - Optimize slow initialization tasks
    

Code Examples: - Failure scenarios - Recovery timelines - Mitigation strategies

Diagrams: - Probe failure flows - Pod lifecycle during recovery - Traffic shifting

Deliverables: - Failure scenario guide - Recovery procedures - Timing analysis


CYCLE 4: Custom Health Checks (~4,500 lines)

Topic 7: Implementing Custom Health Checks

What will be covered: - IHealthCheck Interface

using Microsoft.Extensions.Diagnostics.HealthChecks;

public class OutboxHealthCheck : IHealthCheck
{
    private readonly IOutboxRepository _outboxRepository;
    private readonly ILogger<OutboxHealthCheck> _logger;
    private readonly int _maxPendingThreshold;

    public OutboxHealthCheck(
        IOutboxRepository outboxRepository,
        ILogger<OutboxHealthCheck> logger,
        IConfiguration configuration)
    {
        _outboxRepository = outboxRepository;
        _logger = logger;
        _maxPendingThreshold = configuration.GetValue<int>("HealthChecks:Outbox:MaxPending", 1000);
    }

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        try
        {
            // Check outbox pending count
            var pendingCount = await _outboxRepository.GetPendingCountAsync(cancellationToken);

            if (pendingCount == 0)
            {
                return HealthCheckResult.Healthy(
                    description: "Outbox is empty (all events published)",
                    data: new Dictionary<string, object>
                    {
                        ["pendingEvents"] = 0,
                        ["checkedAt"] = DateTime.UtcNow
                    });
            }

            if (pendingCount < _maxPendingThreshold)
            {
                return HealthCheckResult.Healthy(
                    description: $"Outbox has {pendingCount} pending events (within threshold)",
                    data: new Dictionary<string, object>
                    {
                        ["pendingEvents"] = pendingCount,
                        ["threshold"] = _maxPendingThreshold,
                        ["checkedAt"] = DateTime.UtcNow
                    });
            }

            // Degraded if backlog growing but not critical
            if (pendingCount < _maxPendingThreshold * 2)
            {
                _logger.LogWarning("Outbox backlog is elevated: {Count}", pendingCount);

                return HealthCheckResult.Degraded(
                    description: $"Outbox backlog elevated: {pendingCount} events",
                    data: new Dictionary<string, object>
                    {
                        ["pendingEvents"] = pendingCount,
                        ["threshold"] = _maxPendingThreshold,
                        ["recommendation"] = "Scale outbox relay workers"
                    });
            }

            // Unhealthy if backlog critical
            _logger.LogError("Outbox backlog critical: {Count}", pendingCount);

            return HealthCheckResult.Unhealthy(
                description: $"Outbox backlog critical: {pendingCount} events",
                data: new Dictionary<string, object>
                {
                    ["pendingEvents"] = pendingCount,
                    ["threshold"] = _maxPendingThreshold,
                    ["action"] = "Immediate investigation required"
                });
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Outbox health check failed");

            return HealthCheckResult.Unhealthy(
                description: "Failed to check outbox health",
                exception: ex);
        }
    }
}

// Registration
services.AddHealthChecks()
    .AddCheck<OutboxHealthCheck>(
        name: "outbox",
        failureStatus: HealthStatus.Degraded,  // Degraded, not Unhealthy
        tags: new[] { "ready", "messaging", "outbox" });

  • Health Check Data Dictionary
  • Include diagnostic information
  • Avoid PII
  • Machine-readable format
  • Useful for debugging

  • Async Health Checks

  • All health checks should be async
  • Use cancellation tokens
  • Timeout handling

Code Examples: - Complete custom health check implementations - Registration patterns - Error handling

Diagrams: - Custom health check flow - Data dictionary structure

Deliverables: - Custom health check guide - Implementation templates - Registration patterns


Topic 8: ATP Custom Health Checks

What will be covered: - ProjectionLagHealthCheck

public class ProjectionLagHealthCheck : IHealthCheck
{
    private readonly IProjectionWatermarkRepository _watermarkRepo;
    private readonly TimeSpan _lagThreshold;

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        var watermarks = await _watermarkRepo.GetAllWatermarksAsync(cancellationToken);

        var maxLag = watermarks.Max(w => 
            DateTime.UtcNow - w.LastEventTimestamp);

        if (maxLag < _lagThreshold)
        {
            return HealthCheckResult.Healthy(
                $"Projection lag: {maxLag.TotalSeconds:F1}s");
        }

        return HealthCheckResult.Degraded(
            $"Projection lag high: {maxLag.TotalSeconds:F1}s",
            data: new Dictionary<string, object>
            {
                ["lagSeconds"] = maxLag.TotalSeconds,
                ["threshold"] = _lagThreshold.TotalSeconds,
                ["laggestProjection"] = watermarks
                    .OrderByDescending(w => DateTime.UtcNow - w.LastEventTimestamp)
                    .First()
                    .ProjectionName
            });
    }
}

  • IdempotencyStoreHealthCheck
  • PolicyCacheHealthCheck
  • HashChainStateHealthCheck
  • ExportBandwidthHealthCheck
  • TenantQuotaHealthCheck

Code Examples: - All ATP custom health checks - ATP-specific validation logic

Deliverables: - ATP health check library - Implementation guide


CYCLE 5: Service-Specific Health Checks (~5,000 lines)

Topic 9: Gateway Service Health Checks

What will be covered: - Gateway Health Check Configuration

// Gateway health checks
services.AddHealthChecks()
    // Liveness (minimal)
    .AddCheck("self", () => HealthCheckResult.Healthy(), tags: new[] { "live" })

    // Readiness (dependencies)
    .AddAzureKeyVault(keyVaultUri, credential, name: "keyvault-certs", tags: new[] { "ready" })

    .AddRedis(redisConnection, name: "rate-limiter-cache", tags: new[] { "ready" })

    .AddUrlGroup(
        new Uri("https://ingestion.atp.internal/health/ready"),
        name: "ingestion-backend",
        tags: new[] { "ready", "backend" })

    .AddUrlGroup(
        new Uri("https://query.atp.internal/health/ready"),
        name: "query-backend",
        tags: new[] { "ready", "backend" })

    .AddUrlGroup(
        new Uri("https://policy.atp.internal/health/ready"),
        name: "policy-backend",
        tags: new[] { "ready", "backend" })

    .AddCheck<GatewayRoutingHealthCheck>(
        name: "routing-config",
        tags: new[] { "startup", "configuration" });

  • Custom Gateway Checks
  • Routing configuration loaded
  • YARP clusters reachable
  • JWT validation keys available
  • Rate limiting operational

Code Examples: - Complete Gateway health checks - Backend reachability tests - Configuration validation

Diagrams: - Gateway health architecture - Dependency graph

Deliverables: - Gateway health check guide - Configuration templates


Topic 10: Ingestion Service Health Checks

What will be covered: - Ingestion Health Configuration

services.AddHealthChecks()
    // Live
    .AddCheck("self", () => HealthCheckResult.Healthy(), tags: new[] { "live" })

    // Ready - Critical dependencies
    .AddSqlServer(
        connectionString: config.GetConnectionString("AuditDb"),
        healthQuery: "SELECT 1",
        name: "audit-database",
        tags: new[] { "ready", "database" })

    .AddAzureServiceBusTopic(
        connectionString: config.GetConnectionString("ServiceBus"),
        topicName: "audit.appended.v1",
        name: "servicebus-publish",
        tags: new[] { "ready", "messaging" })

    .AddAzureBlobStorage(
        connectionString: config.GetConnectionString("BlobStorage"),
        containerName: "audit-worm",
        name: "blob-worm-storage",
        tags: new[] { "ready", "storage" })

    .AddUrlGroup(
        new Uri("https://policy.atp.internal/health/ready"),
        name: "policy-service",
        tags: new[] { "ready", "external" })

    .AddCheck<OutboxHealthCheck>(
        name: "outbox-backlog",
        failureStatus: HealthStatus.Degraded,
        tags: new[] { "ready", "outbox" })

    // Startup
    .AddCheck<DatabaseMigrationHealthCheck>(
        name: "database-migrations",
        tags: new[] { "startup" })

    .AddCheck<StartupWarmupHealthCheck>(
        name: "startup-warmup",
        tags: new[] { "startup" });

  • Ingestion-Specific Checks
  • Outbox backlog within limits
  • Idempotency store accessible
  • Schema validation ready
  • Tenant quota store loaded

Code Examples: - Ingestion health checks (complete) - All 8 ATP services health configurations

Diagrams: - Service health dependencies - Check hierarchy

Deliverables: - Service health check library (8 services) - Configuration guide


CYCLE 6: Dependency Health Checks (~4,000 lines)

Topic 11: Database Health Checks

What will be covered: - SQL Server Health Check

// Basic connectivity
.AddSqlServer(
    connectionString: connectionString,
    healthQuery: "SELECT 1",
    name: "database-connectivity")

// Connection pool health
.AddCheck<SqlConnectionPoolHealthCheck>(
    name: "database-connection-pool")

public class SqlConnectionPoolHealthCheck : IHealthCheck
{
    private readonly string _connectionString;

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        using var connection = new SqlConnection(_connectionString);
        await connection.OpenAsync(cancellationToken);

        // Check connection pool statistics
        var poolStats = SqlConnection.GetConnectionPoolStatistics(connection);

        var activeConnections = (long)poolStats["ActiveConnections"];
        var poolSize = (long)poolStats["MaxPoolSize"];
        var utilization = (double)activeConnections / poolSize;

        if (utilization < 0.8)
        {
            return HealthCheckResult.Healthy(
                $"Connection pool: {activeConnections}/{poolSize} ({utilization:P0})");
        }

        if (utilization < 0.95)
        {
            return HealthCheckResult.Degraded(
                $"Connection pool high: {activeConnections}/{poolSize} ({utilization:P0})");
        }

        return HealthCheckResult.Unhealthy(
            $"Connection pool exhausted: {activeConnections}/{poolSize}");
    }
}

// Query performance check
.AddCheck<DatabasePerformanceHealthCheck>(
    name: "database-performance")

  • Azure Cosmos DB Health Check
  • MongoDB Health Check
  • NHibernate Session Factory Health Check

Code Examples: - Database health checks (all types) - Connection pool monitoring - Performance validation

Deliverables: - Database health check library


Topic 12: Messaging Health Checks

What will be covered: - Azure Service Bus Health Check

// Topic publish capability
.AddAzureServiceBusTopic(
    connectionString: serviceBusConnection,
    topicName: "audit.appended.v1",
    name: "servicebus-topic-publish")

// Subscription receive capability
.AddAzureServiceBusSubscription(
    connectionString: serviceBusConnection,
    topicName: "audit.appended.v1",
    subscriptionName: "projection-sub",
    name: "servicebus-subscription")

// Custom: Consumer lag check
public class ConsumerLagHealthCheck : IHealthCheck
{
    private readonly ServiceBusClient _serviceBusClient;
    private readonly string _topicName;
    private readonly string _subscriptionName;
    private readonly TimeSpan _lagThreshold;

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        var subscription = _serviceBusClient
            .CreateReceiver(_topicName, _subscriptionName);

        // Peek message to check lag
        var peekedMessage = await subscription.PeekMessageAsync(cancellationToken);

        if (peekedMessage == null)
        {
            return HealthCheckResult.Healthy("No pending messages");
        }

        var lag = DateTime.UtcNow - peekedMessage.EnqueuedTime;

        if (lag < _lagThreshold)
        {
            return HealthCheckResult.Healthy($"Lag: {lag.TotalSeconds:F1}s");
        }

        return HealthCheckResult.Degraded($"High lag: {lag.TotalSeconds:F1}s");
    }
}

  • MassTransit Health Check
  • Outbox Relay Health Check
  • DLQ Depth Health Check

Code Examples: - Messaging health checks - Lag monitoring - DLQ validation

Deliverables: - Messaging health check library


CYCLE 7: Health Check UI & Monitoring (~3,500 lines)

Topic 13: Health Check UI

What will be covered: - ASP.NET Core Health Check UI

services.AddHealthChecksUI(setup =>
{
    setup.SetEvaluationTimeInSeconds(30);
    setup.MaximumHistoryEntriesPerEndpoint(50);
    setup.AddHealthCheckEndpoint("ATP Ingestion", "https://ingestion.atp.internal/health");
    setup.AddHealthCheckEndpoint("ATP Query", "https://query.atp.internal/health");
    setup.AddHealthCheckEndpoint("ATP Policy", "https://policy.atp.internal/health");
})
.AddInMemoryStorage();  // Or .AddSqlServerStorage() for persistence

// Map UI endpoint
app.UseEndpoints(endpoints =>
{
    endpoints.MapHealthChecksUI(setup =>
    {
        setup.UIPath = "/healthchecks-ui";
        setup.ApiPath = "/healthchecks-api";
    })
    .RequireAuthorization("AdminOnly");
});

  • Health Check UI Features
  • Real-time health status
  • Historical health data
  • Webhook notifications
  • Failure tracking

Code Examples: - Health Check UI setup - Dashboard configuration - Webhook integration

Diagrams: - UI architecture - Data flow

Deliverables: - UI setup guide - Configuration reference


Topic 14: Prometheus Metrics from Health Checks

What will be covered: - Health Check Metrics Exporter

public class HealthCheckMetricsPublisher : IHealthCheckPublisher
{
    private readonly IMeterFactory _meterFactory;
    private readonly Counter<long> _healthCheckExecutions;
    private readonly Gauge<long> _healthStatus;

    public HealthCheckMetricsPublisher(IMeterFactory meterFactory)
    {
        var meter = meterFactory.Create("ATP.HealthChecks");

        _healthCheckExecutions = meter.CreateCounter<long>(
            "health_check_executions_total",
            description: "Total health check executions");

        _healthStatus = meter.CreateObservableGauge<long>(
            "health_check_status",
            description: "Health check status (0=unhealthy, 1=degraded, 2=healthy)");
    }

    public Task PublishAsync(HealthReport report, CancellationToken cancellationToken)
    {
        foreach (var entry in report.Entries)
        {
            var statusValue = entry.Value.Status switch
            {
                HealthStatus.Healthy => 2,
                HealthStatus.Degraded => 1,
                HealthStatus.Unhealthy => 0,
                _ => -1
            };

            _healthCheckExecutions.Add(1,
                new KeyValuePair<string, object>("name", entry.Key),
                new KeyValuePair<string, object>("status", entry.Value.Status.ToString()));
        }

        return Task.CompletedTask;
    }
}

// Register publisher
services.AddSingleton<IHealthCheckPublisher, HealthCheckMetricsPublisher>();

Code Examples: - Metrics publisher implementation - Prometheus integration - Grafana dashboards

Deliverables: - Metrics publisher guide - Dashboard templates


CYCLE 8: Health Check Response Formats (~3,000 lines)

Topic 15: Response Format Standards

What will be covered: - Healthy Response

GET /health/ready
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: no-cache

{
  "status": "Healthy",
  "totalDuration": "00:00:00.234",
  "entries": {
    "database": {
      "status": "Healthy",
      "description": "Azure SQL connection pool: 5/100 active",
      "duration": "00:00:00.050",
      "data": {
        "activeConnections": 5,
        "maxPoolSize": 100,
        "utilization": 0.05
      }
    },
    "servicebus": {
      "status": "Healthy",
      "description": "Azure Service Bus connected",
      "duration": "00:00:00.120"
    },
    "cache": {
      "status": "Healthy",
      "description": "Redis cache: 1024 keys, 512MB used",
      "duration": "00:00:00.015",
      "data": {
        "keys": 1024,
        "memoryUsedMb": 512,
        "hitRate": 0.87
      }
    },
    "outbox": {
      "status": "Healthy",
      "description": "Outbox: 15 pending events",
      "duration": "00:00:00.035",
      "data": {
        "pendingEvents": 15,
        "threshold": 1000
      }
    }
  }
}

  • Degraded Response

    GET /health/ready
    HTTP/1.1 200 OK   Still 200, but body shows degraded
    
    {
      "status": "Degraded",
      "totalDuration": "00:00:00.567",
      "entries": {
        "database": {
          "status": "Healthy",
          "duration": "00:00:00.045"
        },
        "cache": {
          "status": "Degraded",
          "description": "Redis cache: High latency (150ms avg)",
          "duration": "00:00:00.456",
          "data": {
            "latencyMs": 150,
            "threshold": 50,
            "recommendation": "Check Redis server performance"
          }
        },
        "outbox": {
          "status": "Degraded",
          "description": "Outbox backlog elevated: 800 events",
          "data": {
            "pendingEvents": 800,
            "threshold": 1000,
            "recommendation": "Scale outbox relay workers"
          }
        }
      }
    }
    

  • Unhealthy Response

    GET /health/ready
    HTTP/1.1 503 Service Unavailable
    
    {
      "status": "Unhealthy",
      "totalDuration": "00:00:03.000",
      "entries": {
        "database": {
          "status": "Unhealthy",
          "description": "Azure SQL connection failed: Timeout",
          "duration": "00:00:03.000",
          "exception": "System.Data.SqlClient.SqlException: Timeout expired...",
          "data": {
            "error": "Connection timeout",
            "action": "Check database availability and connection string"
          }
        },
        "cache": {
          "status": "Healthy",
          "duration": "00:00:00.012"
        }
      }
    }
    

Code Examples: - Response format examples (all states) - JSON schema - Response writer customization

Diagrams: - Response format structure - Status code mapping

Deliverables: - Response format specification - Schema definitions - Writer implementations


Topic 16: Minimal vs. Detailed Responses

What will be covered: - Minimal Response (Public/Load Balancer) - Detailed Response (Internal/Debugging) - Authenticated Access (Detailed info) - PII Redaction in responses

Code Examples: - Response filtering - Authentication integration

Deliverables: - Response strategy guide


CYCLE 9: Degraded State Handling (~3,000 lines)

Topic 17: Degraded vs. Unhealthy

What will be covered: - When to Return Degraded

Degraded (Service still functional, but impaired):
- Non-critical dependency unavailable
- Performance below target (but above minimum)
- Backlog elevated (but not critical)
- Cache miss rate high (but queries still work)
- Replica lag high (but within tolerance)

Examples:
- Redis cache down, but database queries work (slower)
- Projection lag 15s (target: 5s, critical: 30s)
- Outbox backlog 800 events (threshold: 1000)
- Search index unavailable (fallback to SQL)

Response:
- HTTP 200 OK (keep in load balancer)
- Body: "status": "Degraded"
- Kubernetes: Pod stays in service
- Monitoring: Alert (warning severity)

---

Unhealthy (Service cannot fulfill requests):
- Critical dependency unavailable
- Cannot process requests safely
- Data integrity at risk
- Compliance violation

Examples:
- Database connection failed (cannot persist)
- Service Bus connection failed (cannot publish events)
- KMS unavailable (cannot sign/verify)
- All backend services down (Gateway)

Response:
- HTTP 503 Service Unavailable
- Body: "status": "Unhealthy"
- Kubernetes: Pod removed from service
- Monitoring: Alert (critical severity)

  • Graceful Degradation Patterns

    // Query service with optional search
    public class QueryServiceHealthCheck : IHealthCheck
    {
        private readonly IDbConnection _database;
        private readonly ISearchClient _searchClient;
        private readonly ILogger _logger;
    
        public async Task<HealthCheckResult> CheckHealthAsync(...)
        {
            // Critical: Database must be healthy
            var dbHealthy = await CheckDatabaseAsync();
            if (!dbHealthy)
            {
                return HealthCheckResult.Unhealthy("Database unavailable");
            }
    
            // Optional: Search enhances experience but not required
            var searchHealthy = await CheckSearchAsync();
            if (!searchHealthy)
            {
                _logger.LogWarning("Search unavailable, degraded mode");
                return HealthCheckResult.Degraded(
                    "Database healthy, search unavailable (fallback to SQL)");
            }
    
            return HealthCheckResult.Healthy("All dependencies healthy");
        }
    }
    

  • Degraded Mode Indicators

  • Response headers: X-Service-Mode: Degraded
  • Metrics: service_degraded{reason="cache-miss"} 1
  • Logs: Warning-level degradation notices

Code Examples: - Degraded vs. unhealthy logic - Graceful degradation patterns - Mode indicators

Diagrams: - State transition diagram - Degradation flow

Deliverables: - Degradation guide - Pattern library - Mode handling procedures


Topic 18: Dependency Criticality Matrix

What will be covered: - Critical vs. Optional Dependencies | Dependency | Critical? | Failure Status | Reason | |------------|-----------|----------------|--------| | Azure SQL (Ingestion) | ✅ Yes | Unhealthy | Cannot persist audit records | | Service Bus (Ingestion) | ✅ Yes | Unhealthy | Cannot publish events | | Redis Cache (Query) | ❌ No | Degraded | Slower queries, but functional | | Search Index (Query) | ❌ No | Degraded | Fallback to SQL queries | | Policy Service (Ingestion) | ✅ Yes | Unhealthy | Cannot classify/retain | | KMS (Integrity) | ✅ Yes | Unhealthy | Cannot sign/verify | | Blob Storage WORM (Ingestion) | ✅ Yes | Unhealthy | Cannot store evidence |

Code Examples: - Criticality configuration - Failure status mapping

Deliverables: - Dependency matrix - Configuration guide


CYCLE 10: Startup Warmup & Grace Periods (~3,000 lines)

Topic 19: Startup Warmup Pattern

What will be covered: - StartupWarmupGate

public class StartupWarmupGate
{
    private readonly TimeSpan _warmupDuration;
    private readonly DateTime _startTime;
    private bool _isReady;

    public StartupWarmupGate(IConfiguration configuration)
    {
        _warmupDuration = TimeSpan.FromSeconds(
            configuration.GetValue<int>("Microservice:StartupWarmupSeconds", 30));
        _startTime = DateTime.UtcNow;
        _isReady = false;
    }

    public bool IsReady
    {
        get
        {
            if (_isReady) return true;

            if (DateTime.UtcNow - _startTime >= _warmupDuration)
            {
                _isReady = true;
                return true;
            }

            return false;
        }
    }

    public void MarkReady()
    {
        _isReady = true;
    }
}

// Health check
public class StartupWarmupHealthCheck : IHealthCheck
{
    private readonly StartupWarmupGate _gate;

    public Task<HealthCheckResult> CheckHealthAsync(...)
    {
        return Task.FromResult(
            _gate.IsReady
                ? HealthCheckResult.Healthy("Warmup complete")
                : HealthCheckResult.Unhealthy("Warming up..."));
    }
}

// Usage in startup
public static async Task Main(string[] args)
{
    var host = CreateHostBuilder(args).Build();

    // Run migrations, warm cache, etc.
    await host.RunMigrationsAsync();
    await host.WarmCacheAsync();

    // Mark ready
    host.Services.GetRequiredService<StartupWarmupGate>().MarkReady();

    await host.RunAsync();
}

  • Initialization Tasks
  • Database migrations (FluentMigrator)
  • Configuration validation
  • Cache warming (policy cache, tenant metadata)
  • External service connectivity verification

Code Examples: - Warmup gate implementation - Initialization task orchestration - Startup optimization

Deliverables: - Startup warmup guide - Task coordination - Optimization techniques


Topic 20: Startup Probe Configuration

What will be covered: - Kubernetes Startup Probe - Failure Threshold Calculation - Startup Time Monitoring - Slow Startup Troubleshooting

Code Examples: - Probe configuration - Timing analysis

Deliverables: - Startup probe guide - Timing recommendations


CYCLE 11: Health Check Caching (~2,500 lines)

Topic 21: Response Caching Strategy

What will be covered: - Why Cache Health Check Responses?

Problem:
- Health checks run frequently (every 5-10s by K8s)
- Each check queries dependencies (DB, cache, bus)
- High load on dependencies from health checks alone
- Example: 100 pods × 2 checks/sec = 200 dependency queries/sec

Solution:
- Cache health check results briefly (5-10s)
- Reduce dependency load by 90%+
- Still fresh enough for K8s probes

Trade-off:
- Slightly stale health status (max 10s old)
- Acceptable for most use cases
- Critical checks can bypass cache

  • Implementing Health Check Caching

    public class CachedHealthCheck : IHealthCheck
    {
        private readonly IHealthCheck _innerCheck;
        private readonly IMemoryCache _cache;
        private readonly TimeSpan _cacheDuration;
        private readonly string _cacheKey;
    
        public CachedHealthCheck(
            IHealthCheck innerCheck,
            IMemoryCache cache,
            string name,
            TimeSpan cacheDuration)
        {
            _innerCheck = innerCheck;
            _cache = cache;
            _cacheKey = $"HealthCheck:{name}";
            _cacheDuration = cacheDuration;
        }
    
        public async Task<HealthCheckResult> CheckHealthAsync(
            HealthCheckContext context,
            CancellationToken cancellationToken = default)
        {
            // Try cache first
            if (_cache.TryGetValue(_cacheKey, out HealthCheckResult cachedResult))
            {
                return cachedResult;
            }
    
            // Cache miss, run actual check
            var result = await _innerCheck.CheckHealthAsync(context, cancellationToken);
    
            // Cache result
            _cache.Set(_cacheKey, result, _cacheDuration);
    
            return result;
        }
    }
    
    // Registration
    services.AddHealthChecks()
        .AddCheck(new CachedHealthCheck(
            new SqlServerHealthCheck(connectionString),
            cache,
            name: "database",
            cacheDuration: TimeSpan.FromSeconds(10)));
    

  • Conditional Caching

  • Cache only Healthy results (not failures)
  • Shorter TTL for Degraded (5s)
  • No caching for Unhealthy (immediate detection)

  • Cache Invalidation

  • Automatic TTL expiration
  • Manual invalidation on config changes
  • Invalidation on deployment

Code Examples: - Cached health check implementation - Conditional caching logic - Cache invalidation

Diagrams: - Caching flow - TTL strategy

Deliverables: - Caching implementation - Strategy guide - Invalidation procedures


Topic 22: Health Check Performance

What will be covered: - Performance Optimization - Parallel Execution - Timeout Configuration - Circuit Breakers for Health Checks

Code Examples: - Performance optimization - Parallel check execution

Deliverables: - Performance guide - Optimization techniques


CYCLE 12: Multi-Tenant Health Isolation (~3,000 lines)

Topic 23: Tenant-Aware Health Checks

What will be covered: - Per-Tenant Health Indicators

public class TenantQuotaHealthCheck : IHealthCheck
{
    private readonly ITenantQuotaService _quotaService;

    public async Task<HealthCheckResult> CheckHealthAsync(...)
    {
        // Check if any tenant exceeding quota critically
        var tenantsOverQuota = await _quotaService.GetTenantsExceedingQuotaAsync(
            threshold: 0.95);  // 95% of quota

        if (tenantsOverQuota.Count == 0)
        {
            return HealthCheckResult.Healthy("All tenants within quota");
        }

        var criticalTenants = tenantsOverQuota.Count(t => t.Utilization > 1.0);

        if (criticalTenants == 0)
        {
            return HealthCheckResult.Degraded(
                $"{tenantsOverQuota.Count} tenants approaching quota limits");
        }

        return HealthCheckResult.Unhealthy(
            $"{criticalTenants} tenants exceeded quota (ingestion throttled)");
    }
}

  • Tenant Isolation in Health Responses
  • Aggregate tenant health (no individual tenant details in public response)
  • Detailed tenant health (admin endpoint, authenticated)
  • Tenant quota monitoring
  • Per-tenant SLO tracking

Code Examples: - Tenant-aware health checks - Quota monitoring - Aggregation logic

Diagrams: - Tenant health aggregation - Quota monitoring flow

Deliverables: - Tenant health guide - Quota monitoring - Isolation procedures


Topic 24: Regional Health Aggregation

What will be covered: - Cross-Region Health - Multi-Cluster Health - Global Health Dashboard

Code Examples: - Regional aggregation

Deliverables: - Regional health guide


CYCLE 13: Health Check Testing (~2,500 lines)

Topic 25: Testing Health Checks

What will be covered: - Unit Testing Health Checks

[TestClass]
public class DatabaseHealthCheckTests
{
    [TestMethod]
    public async Task Should_ReturnHealthy_WhenDatabaseConnected()
    {
        // Arrange
        var healthCheck = new SqlServerHealthCheck(validConnectionString);

        // Act
        var result = await healthCheck.CheckHealthAsync(new HealthCheckContext());

        // Assert
        Assert.AreEqual(HealthStatus.Healthy, result.Status);
        Assert.IsTrue(result.Description.Contains("connected"));
    }

    [TestMethod]
    public async Task Should_ReturnUnhealthy_WhenDatabaseUnavailable()
    {
        // Arrange
        var healthCheck = new SqlServerHealthCheck(invalidConnectionString);

        // Act
        var result = await healthCheck.CheckHealthAsync(new HealthCheckContext());

        // Assert
        Assert.AreEqual(HealthStatus.Unhealthy, result.Status);
        Assert.IsNotNull(result.Exception);
    }

    [TestMethod]
    public async Task Should_ReturnDegraded_WhenConnectionPoolNearLimit()
    {
        // Arrange
        var healthCheck = new SqlConnectionPoolHealthCheck(connectionString);
        // Simulate high connection usage
        var connections = CreateManyConnections(95);  // 95% of pool

        // Act
        var result = await healthCheck.CheckHealthAsync(new HealthCheckContext());

        // Assert
        Assert.AreEqual(HealthStatus.Degraded, result.Status);

        // Cleanup
        CloseConnections(connections);
    }
}

  • Integration Testing

    [TestMethod]
    public async Task HealthEndpoint_Should_ReturnHealthy_WhenAllDependenciesUp()
    {
        // Arrange
        var client = _testServer.CreateClient();
    
        // Act
        var response = await client.GetAsync("/health/ready");
    
        // Assert
        Assert.AreEqual(HttpStatusCode.OK, response.StatusCode);
    
        var content = await response.Content.ReadAsStringAsync();
        var healthReport = JsonSerializer.Deserialize<HealthReport>(content);
    
        Assert.AreEqual("Healthy", healthReport.Status);
        Assert.IsTrue(healthReport.Entries.All(e => 
            e.Value.Status == HealthStatus.Healthy));
    }
    

  • Acceptance Testing (Reqnroll)

    Feature: Health Check Endpoints
      As an operator
      I want to monitor service health
      So that I can detect and resolve issues quickly
    
    Scenario: Liveness check returns healthy when service is running
      Given the Ingestion service is running
      When I request "/health/live"
      Then the response status code should be 200
      And the response body should contain "Healthy"
    
    Scenario: Readiness check returns unhealthy when database is down
      Given the Ingestion service is running
      And the database is unavailable
      When I request "/health/ready"
      Then the response status code should be 503
      And the response body should contain "Unhealthy"
      And the database check should show "Unhealthy"
    
    Scenario: Readiness check returns degraded when cache is slow
      Given the Ingestion service is running
      And the Redis cache has high latency
      When I request "/health/ready"
      Then the response status code should be 200
      And the response body should contain "Degraded"
      And the cache check should show "Degraded"
    

Code Examples: - Complete test suites (unit, integration, acceptance) - Test helpers and mocks - CI/CD integration

Diagrams: - Test architecture - Test coverage

Deliverables: - Testing guide - Test templates - CI/CD integration


Topic 26: Chaos Testing for Health Checks

What will be covered: - Fault Injection - Dependency Failure Simulation - Recovery Validation - Probe Behavior Under Load

Code Examples: - Chaos scenarios - Validation procedures

Deliverables: - Chaos testing guide


CYCLE 14: Load Balancer Integration (~3,000 lines)

Topic 27: Azure Load Balancer Health Probes

What will be covered: - Azure LB Health Probe Configuration

# Azure Load Balancer (for AKS services)
kind: Service
apiVersion: v1
metadata:
  name: ingestion-svc
  namespace: atp-ingest-ns
  annotations:
    # Health probe configuration
    service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/health/ready"
    service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol: "http"
    service.beta.kubernetes.io/azure-load-balancer-health-probe-interval: "10"
    service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probe: "3"
spec:
  type: LoadBalancer
  selector:
    app: ingestion
  ports:
  - port: 80
    targetPort: 8080

  • Azure Front Door Health Probes
  • API Management Health Checks
  • Application Gateway Health Probes

Code Examples: - LB probe configurations - Multi-layer health checks

Deliverables: - Load balancer integration guide


Topic 28: Service Mesh Health Checks

What will be covered: - Istio Health Checks - Linkerd Health Probes - Envoy Health Endpoints

Code Examples: - Mesh integration

Deliverables: - Service mesh health guide


CYCLE 15: Health Check Troubleshooting (~3,500 lines)

Topic 29: Common Health Check Issues

What will be covered: - Problem: Readiness Check Always Failing

Symptoms:
- Pod shows 0/1 READY
- Pod never receives traffic
- Logs show repeated health check failures

Diagnosis:
# Check pod status
kubectl get pods -n atp-ingest-ns
# NAME                         READY   STATUS    RESTARTS   AGE
# ingestion-7d8f6c9b4-abc123   0/1     Running   0          5m

# Describe pod
kubectl describe pod ingestion-7d8f6c9b4-abc123 -n atp-ingest-ns
# Events:
# Readiness probe failed: HTTP probe failed with statuscode: 503

# Check health endpoint directly
kubectl port-forward ingestion-7d8f6c9b4-abc123 8080:8080 -n atp-ingest-ns
curl http://localhost:8080/health/ready

# Response shows which dependency failed:
{
  "status": "Unhealthy",
  "entries": {
    "database": {
      "status": "Unhealthy",
      "description": "Connection timeout"
    }
  }
}

Common Causes:
1. Database not accessible (network policy, firewall)
2. Wrong connection string (secret not mounted)
3. Dependency service down
4. Health check timeout too short
5. Startup not complete (use startup probe)

Solutions:
# 1. Check database connectivity
kubectl run -it --rm debug --image=mcr.microsoft.com/mssql-tools \
    --restart=Never -n atp-ingest-ns -- /bin/bash
sqlcmd -S <server> -U <user> -P <password> -Q "SELECT 1"

# 2. Check secrets mounted
kubectl exec ingestion-7d8f6c9b4-abc123 -n atp-ingest-ns -- \
    ls -la /mnt/secrets

# 3. Check network policy
kubectl get networkpolicies -n atp-ingest-ns

# 4. Increase timeout
kubectl edit deployment ingestion -n atp-ingest-ns
# Change: timeoutSeconds: 3 → timeoutSeconds: 10

# 5. Add startup probe
# (see startup probe configuration)

  • Problem: CrashLoopBackOff Due to Liveness Failure

    Symptoms:
    - Pod continuously restarting
    - STATUS shows CrashLoopBackOff
    - Liveness probe failing
    
    Diagnosis:
    kubectl logs ingestion-7d8f6c9b4-abc123 -n atp-ingest-ns --previous
    
    Common Causes:
    1. Application deadlock
    2. Liveness check too aggressive (short timeout/period)
    3. Application startup slow (no startup probe)
    4. Liveness check has bug (throws exception)
    
    Solutions:
    # 1. Add startup probe (delay liveness)
    # 2. Increase liveness timeout/threshold
    # 3. Fix application deadlock
    # 4. Review liveness check code
    

  • Problem: Flapping Readiness (Ready → Not Ready → Ready)

    Symptoms:
    - Pod oscillates between ready/not ready
    - Traffic intermittent
    - Logs show dependency timeouts
    
    Common Causes:
    1. Dependency intermittently slow
    2. Health check timeout too strict
    3. Network issues
    4. Health check not idempotent (side effects)
    
    Solutions:
    # 1. Increase successThreshold (require 2-3 consecutive successes)
    # 2. Increase timeout
    # 3. Add circuit breaker to dependency client
    # 4. Review health check for side effects
    

Code Examples: - Troubleshooting procedures (10+ scenarios) - Diagnostic commands - Resolution scripts

Diagrams: - Troubleshooting decision tree - Common failure patterns

Deliverables: - Troubleshooting guide - Diagnostic procedures - Fix library


Topic 30: Health Check Debugging

What will be covered: - Debug Endpoints

// Detailed health endpoint (authenticated, internal only)
app.MapHealthChecks("/health/debug", new HealthCheckOptions
{
    ResponseWriter = async (context, report) =>
    {
        context.Response.ContentType = "application/json";

        var result = new
        {
            status = report.Status.ToString(),
            totalDuration = report.TotalDuration,
            entries = report.Entries.Select(e => new
            {
                name = e.Key,
                status = e.Value.Status.ToString(),
                description = e.Value.Description,
                duration = e.Value.Duration,
                exception = e.Value.Exception?.ToString(),
                data = e.Value.Data,
                tags = e.Value.Tags
            }),
            timestamp = DateTime.UtcNow,
            hostName = Environment.MachineName,
            podName = Environment.GetEnvironmentVariable("HOSTNAME"),
            podIp = context.Connection.LocalIpAddress?.ToString()
        };

        await context.Response.WriteAsJsonAsync(result);
    }
})
.RequireAuthorization("AdminOnly");

  • Health Check Logging
  • Health Check Metrics
  • Distributed Tracing for Health Checks

Code Examples: - Debug endpoints - Logging integration - Tracing

Deliverables: - Debugging guide - Diagnostic tools


CYCLE 16: Best Practices & Governance (~3,000 lines)

Topic 31: Health Check Best Practices

What will be covered: - Design Best Practices

✅ DO:
- Keep checks fast (<5s total)
- Check critical dependencies only
- Return detailed information in responses
- Use tags for logical grouping
- Cache results appropriately
- Implement all three probe types (live, ready, startup)
- Use Degraded for non-critical failures
- Test health checks thoroughly
- Document expected check behavior
- Monitor health check performance

❌ DON'T:
- Run expensive operations (full scans, backups)
- Modify state (writes, deletes)
- Include PII in responses
- Fail on optional dependencies
- Use static thresholds without monitoring
- Ignore degraded state
- Forget to test failure scenarios
- Couple health checks to business logic
- Use high-cardinality labels
- Expose detailed health publicly (security risk)

  • ATP Health Check Standards
  • All services must implement /health/live, /health/ready, /health/startup
  • All critical dependencies must have health checks
  • Health check duration must be <5s (P95)
  • Detailed responses must not include PII
  • Failed dependencies must log errors
  • Health metrics must be exported

Code Examples: - Best practice implementations - Anti-patterns to avoid

Deliverables: - Best practices guide - Standards document - Anti-patterns catalog


Topic 32: Health Check Governance

What will be covered: - Health Check Review Process - New health check PR review checklist - Performance testing required - Security review for public endpoints - Documentation requirements

  • Health Check Lifecycle
  • Creation: When adding new dependency
  • Updates: When dependency changes
  • Deprecation: When dependency removed
  • Monitoring: Continuous health check health

  • Compliance Requirements

  • Health audit trail (who accessed detailed health)
  • No PII in health responses
  • Secure admin endpoints
  • Health check versioning

Code Examples: - Governance procedures - Review checklists - Compliance validation

Deliverables: - Governance guide - Review procedures - Compliance checklist


Summary of Deliverables

Across all 16 cycles, this documentation will provide:

  1. Health Check Foundations
  2. Fundamentals (liveness, readiness, startup)
  3. ASP.NET Core Health Checks integration
  4. Industry standards and formats

  5. Kubernetes Integration

  6. Probe configuration (all three types)
  7. Failure scenarios and recovery
  8. Timing optimization

  9. Custom Health Checks

  10. IHealthCheck implementation patterns
  11. ATP-specific custom checks (10+ checks)
  12. Service-specific configurations (all 8 services)

  13. Dependency Checks

  14. Database health checks (SQL, Cosmos, MongoDB)
  15. Messaging health checks (Service Bus, MassTransit)
  16. Storage health checks (Blob, WORM)
  17. Cache health checks (Redis, in-memory)
  18. KMS health checks (Key Vault, signing operations)
  19. External service checks (HTTP endpoints)

  20. Monitoring & UI

  21. Health Check UI (ASP.NET Core)
  22. Prometheus metrics export
  23. Grafana dashboards
  24. Application Insights integration

  25. Response Formats

  26. Healthy, Degraded, Unhealthy responses
  27. Detailed vs. minimal responses
  28. JSON schema and standards
  29. Custom response writers

  30. Degraded State

  31. Degraded vs. unhealthy criteria
  32. Graceful degradation patterns
  33. Dependency criticality matrix

  34. Startup & Warmup

  35. StartupWarmupGate pattern
  36. Initialization task orchestration
  37. Startup probe configuration
  38. Grace period handling

  39. Performance

  40. Response caching (5-10s TTL)
  41. Parallel execution
  42. Timeout configuration
  43. Load optimization

  44. Multi-Tenancy

    • Tenant-aware health indicators
    • Quota monitoring
    • Regional health aggregation
  45. Testing

    • Unit testing health checks
    • Integration testing endpoints
    • Acceptance testing scenarios
    • Chaos testing (fault injection)
  46. Integration

    • Azure Load Balancer probes
    • Azure Front Door health checks
    • API Management health
    • Service mesh integration
  47. Operations

    • Troubleshooting common issues (10+ scenarios)
    • Debugging tools and techniques
    • Performance analysis
  48. Governance

    • Best practices (10+ do's and don'ts)
    • Standards and conventions
    • Review process
    • Compliance requirements


This health checks guide provides complete implementation and operational procedures for ATP service health monitoring, from ASP.NET Core Health Checks fundamentals and Kubernetes probe integration to custom health checks for all dependencies, Health Check UI, caching strategies, multi-tenant health isolation, comprehensive testing, load balancer integration, troubleshooting procedures, and governance for maintaining predictable, observable, and self-healing services with fast failure detection and automatic recovery.