Health Checks - Audit Trail Platform (ATP)¶
Observable health, predictable behavior — ATP implements comprehensive health checks across all services using ASP.NET Core Health Checks with custom checks for databases, message buses, caches, KMS, and dependencies, exposed via /health/live, /health/ready, and /health/startup endpoints with Kubernetes probe integration and real-time monitoring.
📋 Documentation Generation Plan¶
This document will be generated in 16 cycles. Current progress:
| Cycle | Topics | Estimated Lines | Status |
|---|---|---|---|
| Cycle 1 | Health Check Fundamentals (1-2) | ~3,000 | ⏳ Not Started |
| Cycle 2 | ASP.NET Core Health Checks (3-4) | ~3,500 | ⏳ Not Started |
| Cycle 3 | Kubernetes Probe Integration (5-6) | ~3,500 | ⏳ Not Started |
| Cycle 4 | Custom Health Checks (7-8) | ~4,500 | ⏳ Not Started |
| Cycle 5 | Service-Specific Health Checks (9-10) | ~5,000 | ⏳ Not Started |
| Cycle 6 | Dependency Health Checks (11-12) | ~4,000 | ⏳ Not Started |
| Cycle 7 | Health Check UI & Monitoring (13-14) | ~3,500 | ⏳ Not Started |
| Cycle 8 | Health Check Response Formats (15-16) | ~3,000 | ⏳ Not Started |
| Cycle 9 | Degraded State Handling (17-18) | ~3,000 | ⏳ Not Started |
| Cycle 10 | Startup Warmup & Grace Periods (19-20) | ~3,000 | ⏳ Not Started |
| Cycle 11 | Health Check Caching (21-22) | ~2,500 | ⏳ Not Started |
| Cycle 12 | Multi-Tenant Health Isolation (23-24) | ~3,000 | ⏳ Not Started |
| Cycle 13 | Health Check Testing (25-26) | ~2,500 | ⏳ Not Started |
| Cycle 14 | Load Balancer Integration (27-28) | ~3,000 | ⏳ Not Started |
| Cycle 15 | Health Check Troubleshooting (29-30) | ~3,500 | ⏳ Not Started |
| Cycle 16 | Best Practices & Governance (31-32) | ~3,000 | ⏳ Not Started |
Total Estimated Lines: ~54,000
Purpose & Scope¶
This document provides the complete health check implementation guide for ATP, covering ASP.NET Core Health Checks, Kubernetes probes (liveness, readiness, startup), custom health checks for all dependencies (databases, message buses, caches, KMS, external services), Health Check UI, monitoring integration, and operational procedures for maintaining service health and resilience.
Why Health Checks for ATP?
- Reliability: Detect unhealthy services before they impact users
- Kubernetes Integration: Automatic pod restart (liveness) and traffic management (readiness)
- Load Balancer: Remove unhealthy instances from rotation
- Monitoring: Real-time health status in dashboards
- Debugging: Identify failing dependencies quickly
- Compliance: Health audit trail for SLA validation
- Automation: Enable self-healing and auto-scaling
- Observability: Health metrics feed into SLO calculations
ATP Health Check Architecture
Client Request
↓
Load Balancer (checks /health/ready)
↓
Kubernetes Service (only routes to ready pods)
↓
Pod (3 health endpoints)
├── /health/live → Liveness Probe (K8s restarts if fails)
├── /health/ready → Readiness Probe (K8s removes from service if fails)
└── /health/startup → Startup Probe (K8s delays liveness until passes)
Health Check Types
| Type | Endpoint | Purpose | Kubernetes Usage | Check Scope |
|---|---|---|---|---|
| Liveness | /health/live |
Is the process alive and responsive? | Restart pod if fails | Minimal (self-ping) |
| Readiness | /health/ready |
Can the service handle traffic? | Remove from load balancer if fails | Dependencies (DB, cache, bus) |
| Startup | /health/startup |
Has the service finished initializing? | Delay liveness/readiness until passes | Warmup tasks (migrations, cache) |
ATP Service Health Checks
All ATP services implement comprehensive health checks: - Gateway: Key Vault (certs), backend services (HTTP), rate limiter (Redis) - Ingestion: Azure SQL, Service Bus (publish), Blob Storage (WORM), Policy Service, Outbox - Query: Read model database, query cache (Redis), Search index, projection lag - Projection: Service Bus (subscribe), read model database, inbox table, projection lag - Export: Blob Storage, export job queue (Redis), KMS (signing), bandwidth quota - Integrity: Blob WORM, KMS (signing keys), hash chain state, Merkle computation - Policy: Policy database, policy cache (Redis), default policies loaded - Admin: Admin database, configuration store
Detailed Cycle Plan¶
CYCLE 1: Health Check Fundamentals (~3,000 lines)¶
Topic 1: Health Check Philosophy¶
What will be covered: - What is a Health Check?
Health Check:
- Periodic test of service health
- Returns: Healthy, Degraded, or Unhealthy
- Exposes internal state to external systems
- Enables automation (restart, route, scale)
Purpose:
- Detect failures early
- Enable self-healing
- Prevent cascading failures
- Support graceful degradation
-
Health vs. Metrics
Health Checks: - Binary state (healthy/unhealthy) - Immediate action (restart, remove from LB) - Synchronous (request/response) - Examples: "Can I connect to database?" Metrics: - Continuous values (latency, throughput, errors) - Gradual response (alerts, scaling) - Asynchronous (scrape, push) - Examples: "What's the current latency?" Relationship: - Metrics → Trends and patterns - Health Checks → Go/no-go decisions - Both needed for complete observability -
Three Types of Health Checks
1. Liveness (Am I alive?) - Purpose: Detect deadlocks, infinite loops, crashes - Check: Minimal (process responsive) - Failure: Restart pod - Example: Self-ping, basic HTTP response 2. Readiness (Can I serve traffic?) - Purpose: Detect dependency failures - Check: Critical dependencies (DB, cache, queue) - Failure: Remove from load balancer - Example: Database connection, cache ping 3. Startup (Have I finished initializing?) - Purpose: Allow slow startup without killing pod - Check: Initialization tasks complete - Failure: Delay liveness probe - Example: Migrations run, cache warmed, config loaded -
ATP Health Check Principles
1. Fast Checks (<5 seconds) - Health checks must be lightweight - Avoid expensive operations (full scans, complex queries) - Use simple connectivity tests 2. Dependency-Aware - Check critical dependencies only - Don't fail on optional dependencies - Degrade gracefully when possible 3. No Side Effects - Health checks are read-only - Don't modify state - Don't trigger business logic 4. Cacheable (with short TTL) - Cache health status briefly (5-10 seconds) - Avoid overwhelming dependencies - Balance freshness vs. load 5. Detailed Responses - Return status + details for each check - Include duration and error messages - Enable debugging 6. Compliance-Aware - No PII in health responses - Audit health check access - Secure endpoints (internal only for details)
Code Examples: - Health check concepts - Three-probe architecture - ATP principles
Diagrams: - Health check types comparison - Probe failure handling - Health check architecture
Deliverables: - Health check fundamentals guide - Type comparison matrix - ATP principles document
Topic 2: Health Check Standards¶
What will be covered: - Industry Standards
RFC 7234 (HTTP Caching):
- Cache-Control: no-cache for health checks
- Avoid stale health status
OpenAPI/Swagger:
- Document health endpoints
- Security schemes (Bearer token for detailed checks)
Health Check Response Format (de facto standard):
{
"status": "Healthy" | "Degraded" | "Unhealthy",
"totalDuration": "00:00:00.123",
"entries": {
"database": {
"status": "Healthy",
"description": "...",
"duration": "00:00:00.050"
}
}
}
HTTP Status Codes:
- 200 OK: Healthy
- 503 Service Unavailable: Unhealthy
- 200 OK (with "Degraded" in body): Degraded
- ASP.NET Core Health Checks
- Microsoft.Extensions.Diagnostics.HealthChecks
- AspNetCore.HealthChecks.* (community packages)
-
Health Check UI (AspNetCore.HealthChecks.UI)
-
Kubernetes Health Probes
- livenessProbe
- readinessProbe
- startupProbe
Code Examples: - Standard response formats - HTTP status codes - ASP.NET Core integration
Diagrams: - Standards compliance - Response format evolution
Deliverables: - Standards reference - Format specifications - Compliance guide
CYCLE 2: ASP.NET Core Health Checks (~3,500 lines)¶
Topic 3: Health Check Middleware¶
What will be covered: - Registering Health Checks
// Startup.cs / Program.cs
public void ConfigureServices(IServiceCollection services)
{
// Add health check services
var healthChecks = services.AddHealthChecks();
// Add built-in checks
healthChecks.AddCheck("self", () => HealthCheckResult.Healthy("Service is alive"));
// Add dependency checks
healthChecks.AddSqlServer(
connectionString: configuration.GetConnectionString("AuditDb"),
healthQuery: "SELECT 1",
name: "database",
failureStatus: HealthStatus.Unhealthy,
tags: new[] { "ready", "database" });
healthChecks.AddRedis(
connectionString: configuration.GetConnectionString("Redis"),
name: "cache",
failureStatus: HealthStatus.Degraded,
tags: new[] { "ready", "cache" });
healthChecks.AddAzureServiceBusTopic(
connectionString: configuration.GetConnectionString("ServiceBus"),
topicName: "audit.appended.v1",
name: "servicebus",
tags: new[] { "ready", "messaging" });
}
public void Configure(IApplicationBuilder app)
{
// Map health check endpoints
app.UseEndpoints(endpoints =>
{
// Liveness endpoint (minimal checks)
endpoints.MapHealthChecks("/health/live", new HealthCheckOptions
{
Predicate = check => check.Tags.Contains("live"),
AllowCachingResponses = false,
ResultStatusCodes =
{
[HealthStatus.Healthy] = StatusCodes.Status200OK,
[HealthStatus.Degraded] = StatusCodes.Status200OK,
[HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
}
});
// Readiness endpoint (dependency checks)
endpoints.MapHealthChecks("/health/ready", new HealthCheckOptions
{
Predicate = check => check.Tags.Contains("ready"),
AllowCachingResponses = false,
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse,
ResultStatusCodes =
{
[HealthStatus.Healthy] = StatusCodes.Status200OK,
[HealthStatus.Degraded] = StatusCodes.Status200OK,
[HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
}
});
// Startup endpoint
endpoints.MapHealthChecks("/health/startup", new HealthCheckOptions
{
Predicate = check => check.Tags.Contains("startup"),
AllowCachingResponses = false
});
// Detailed health UI (internal only, secured)
endpoints.MapHealthChecks("/health", new HealthCheckOptions
{
AllowCachingResponses = false,
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
})
.RequireAuthorization("InternalOnly");
});
}
- Health Check Tags
Tag Strategy: live: - Minimal checks for liveness probe - Only "self" check (process alive) - Fast (<100ms) ready: - All critical dependency checks - Database, cache, message bus - Moderate speed (<5s total) startup: - Initialization completion checks - Migrations complete, cache warmed - Can be slow (30-60s acceptable) database: - Groups all database checks - For targeted diagnostics messaging: - Groups messaging checks - Service Bus, outbox, consumers cache: - Groups caching checks - Redis, in-memory external: - Groups external API checks - Policy service, KMS, etc.
Code Examples: - Complete health check setup - Endpoint mapping - Tag-based filtering
Diagrams: - Health check middleware flow - Tag-based routing - Response writer pipeline
Deliverables: - Health check setup guide - Middleware configuration - Tag strategy
Topic 4: Built-In Health Checks¶
What will be covered: - Community Health Check Packages
// Available health check packages
using AspNetCore.HealthChecks.SqlServer; // SQL Server
using AspNetCore.HealthChecks.AzureServiceBus; // Azure Service Bus
using AspNetCore.HealthChecks.Redis; // Redis
using AspNetCore.HealthChecks.MongoDb; // MongoDB
using AspNetCore.HealthChecks.AzureStorage; // Azure Blob/Table
using AspNetCore.HealthChecks.AzureKeyVault; // Azure Key Vault
using AspNetCore.HealthChecks.Uris; // HTTP endpoints
using AspNetCore.HealthChecks.System; // Disk, memory, process
// Registration examples
services.AddHealthChecks()
// Database
.AddSqlServer(
connectionString: config.GetConnectionString("AuditDb"),
healthQuery: "SELECT 1",
name: "audit-database",
timeout: TimeSpan.FromSeconds(3))
// Message Bus
.AddAzureServiceBusTopic(
connectionString: config.GetConnectionString("ServiceBus"),
topicName: "audit.appended.v1",
name: "servicebus-topic")
// Cache
.AddRedis(
connectionString: config.GetConnectionString("Redis"),
name: "redis-cache")
// Blob Storage
.AddAzureBlobStorage(
connectionString: config.GetConnectionString("BlobStorage"),
containerName: "audit-worm",
name: "blob-worm-storage")
// Key Vault
.AddAzureKeyVault(
keyVaultUri: new Uri(config["KeyVault:Uri"]),
credential: new DefaultAzureCredential(),
setup: options => options.AddSecret("signing-key-public"),
name: "key-vault")
// HTTP Endpoint (Policy Service)
.AddUrlGroup(
uri: new Uri("https://policy.atp.internal/health/ready"),
name: "policy-service",
timeout: TimeSpan.FromSeconds(5))
// System resources
.AddDiskStorageHealthCheck(
setup: options => options.AddDrive("C:\\", minimumFreeMegabytes: 1024),
name: "disk-space")
.AddProcessAllocatedMemoryHealthCheck(
maximumMegabytesAllocated: 2048,
name: "memory-allocation");
- Configuration Options
- Timeouts
- Failure status (Unhealthy vs. Degraded)
- Retry policies
- Tags
Code Examples: - Built-in check usage (all types) - Configuration patterns - Timeout management
Diagrams: - Health check library ecosystem - Package dependencies
Deliverables: - Built-in checks catalog - Usage guide - Configuration reference
CYCLE 3: Kubernetes Probe Integration (~3,500 lines)¶
Topic 5: Probe Configuration¶
What will be covered: - Liveness Probe
# Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ingestion
namespace: atp-ingest-ns
spec:
template:
spec:
containers:
- name: ingestion
image: atpacr.azurecr.io/atp/ingestion:1.2.3
# Liveness Probe (restart if fails)
livenessProbe:
httpGet:
path: /health/live
port: 8080
scheme: HTTP
initialDelaySeconds: 30 # Wait 30s after pod start
periodSeconds: 10 # Check every 10s
timeoutSeconds: 5 # 5s timeout per check
failureThreshold: 3 # Restart after 3 consecutive failures
successThreshold: 1 # Healthy after 1 success
-
Readiness Probe
-
Startup Probe
-
Probe Types (HTTP, TCP, Exec)
# HTTP Probe (most common) livenessProbe: httpGet: path: /health/live port: 8080 httpHeaders: - name: X-Health-Check value: "kubernetes" # TCP Probe (for non-HTTP services) livenessProbe: tcpSocket: port: 9090 # Exec Probe (run command in container) livenessProbe: exec: command: - /bin/sh - -c - /app/healthcheck.sh # gRPC Probe (Kubernetes 1.24+) livenessProbe: grpc: port: 9090 service: grpc.health.v1.Health -
Probe Timing Best Practices
Liveness Probe: - initialDelaySeconds: 30-60s (allow startup) - periodSeconds: 10-30s (not too frequent) - timeoutSeconds: 5s (reasonable timeout) - failureThreshold: 3 (tolerate transient failures) Readiness Probe: - initialDelaySeconds: 10-30s (faster than liveness) - periodSeconds: 5-10s (more frequent for traffic mgmt) - timeoutSeconds: 3s - failureThreshold: 3 Startup Probe: - initialDelaySeconds: 0 (start immediately) - periodSeconds: 5s - failureThreshold: 30-60 (allow long startup) - Disable liveness until startup succeeds
Code Examples: - Complete probe configurations (all ATP services) - Probe type examples - Timing guidelines
Diagrams: - Probe lifecycle - Failure handling flow - Timing optimization
Deliverables: - Probe configuration guide - Timing recommendations - ATP probe library
Topic 6: Probe Failure Scenarios¶
What will be covered: - Liveness Probe Failure
Scenario: Liveness probe fails 3 consecutive times
Kubernetes Actions:
1. Mark pod as "Not Ready" (remove from service)
2. Send SIGTERM to container (graceful shutdown)
3. Wait for termination grace period (default: 30s)
4. Send SIGKILL if still running
5. Start new pod
6. Wait for startup probe to pass
7. Wait for readiness probe to pass
8. Add to service endpoints
Timeline:
T+0s: Liveness probe fails (3rd failure)
T+0s: Pod marked Terminating
T+0s: SIGTERM sent to container
T+30s: SIGKILL sent (if not terminated)
T+30s: New pod created
T+45s: New pod startup complete
T+50s: New pod ready
T+50s: New pod receives traffic
Total Recovery Time: ~50 seconds
-
Readiness Probe Failure
Scenario: Readiness probe fails (database connection lost) Kubernetes Actions: 1. Mark pod as "Not Ready" 2. Remove pod IP from service endpoints 3. Stop routing traffic to pod 4. Pod continues running (NOT restarted) 5. Continue checking readiness probe 6. When probe passes → add back to service User Impact: - No downtime if other pods healthy - Traffic automatically shifted - Pod self-heals when dependency recovers Timeline: T+0s: Readiness probe fails T+0s: Pod removed from service endpoints T+0s-Xs: No traffic to pod (database recovers) T+Xs: Readiness probe passes T+Xs: Pod added back to service Recovery Time: Depends on dependency recovery -
Startup Probe Failure
Scenario: Startup probe fails (migrations taking too long) Kubernetes Actions: 1. Continue running startup probe 2. Liveness/readiness disabled until startup passes 3. If failureThreshold exceeded (e.g., 30 failures * 5s = 150s) 4. Kill pod and restart Best Practice: - Set failureThreshold high enough for worst-case startup - Monitor startup duration metrics - Optimize slow initialization tasks
Code Examples: - Failure scenarios - Recovery timelines - Mitigation strategies
Diagrams: - Probe failure flows - Pod lifecycle during recovery - Traffic shifting
Deliverables: - Failure scenario guide - Recovery procedures - Timing analysis
CYCLE 4: Custom Health Checks (~4,500 lines)¶
Topic 7: Implementing Custom Health Checks¶
What will be covered: - IHealthCheck Interface
using Microsoft.Extensions.Diagnostics.HealthChecks;
public class OutboxHealthCheck : IHealthCheck
{
private readonly IOutboxRepository _outboxRepository;
private readonly ILogger<OutboxHealthCheck> _logger;
private readonly int _maxPendingThreshold;
public OutboxHealthCheck(
IOutboxRepository outboxRepository,
ILogger<OutboxHealthCheck> logger,
IConfiguration configuration)
{
_outboxRepository = outboxRepository;
_logger = logger;
_maxPendingThreshold = configuration.GetValue<int>("HealthChecks:Outbox:MaxPending", 1000);
}
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
try
{
// Check outbox pending count
var pendingCount = await _outboxRepository.GetPendingCountAsync(cancellationToken);
if (pendingCount == 0)
{
return HealthCheckResult.Healthy(
description: "Outbox is empty (all events published)",
data: new Dictionary<string, object>
{
["pendingEvents"] = 0,
["checkedAt"] = DateTime.UtcNow
});
}
if (pendingCount < _maxPendingThreshold)
{
return HealthCheckResult.Healthy(
description: $"Outbox has {pendingCount} pending events (within threshold)",
data: new Dictionary<string, object>
{
["pendingEvents"] = pendingCount,
["threshold"] = _maxPendingThreshold,
["checkedAt"] = DateTime.UtcNow
});
}
// Degraded if backlog growing but not critical
if (pendingCount < _maxPendingThreshold * 2)
{
_logger.LogWarning("Outbox backlog is elevated: {Count}", pendingCount);
return HealthCheckResult.Degraded(
description: $"Outbox backlog elevated: {pendingCount} events",
data: new Dictionary<string, object>
{
["pendingEvents"] = pendingCount,
["threshold"] = _maxPendingThreshold,
["recommendation"] = "Scale outbox relay workers"
});
}
// Unhealthy if backlog critical
_logger.LogError("Outbox backlog critical: {Count}", pendingCount);
return HealthCheckResult.Unhealthy(
description: $"Outbox backlog critical: {pendingCount} events",
data: new Dictionary<string, object>
{
["pendingEvents"] = pendingCount,
["threshold"] = _maxPendingThreshold,
["action"] = "Immediate investigation required"
});
}
catch (Exception ex)
{
_logger.LogError(ex, "Outbox health check failed");
return HealthCheckResult.Unhealthy(
description: "Failed to check outbox health",
exception: ex);
}
}
}
// Registration
services.AddHealthChecks()
.AddCheck<OutboxHealthCheck>(
name: "outbox",
failureStatus: HealthStatus.Degraded, // Degraded, not Unhealthy
tags: new[] { "ready", "messaging", "outbox" });
- Health Check Data Dictionary
- Include diagnostic information
- Avoid PII
- Machine-readable format
-
Useful for debugging
-
Async Health Checks
- All health checks should be async
- Use cancellation tokens
- Timeout handling
Code Examples: - Complete custom health check implementations - Registration patterns - Error handling
Diagrams: - Custom health check flow - Data dictionary structure
Deliverables: - Custom health check guide - Implementation templates - Registration patterns
Topic 8: ATP Custom Health Checks¶
What will be covered: - ProjectionLagHealthCheck
public class ProjectionLagHealthCheck : IHealthCheck
{
private readonly IProjectionWatermarkRepository _watermarkRepo;
private readonly TimeSpan _lagThreshold;
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
var watermarks = await _watermarkRepo.GetAllWatermarksAsync(cancellationToken);
var maxLag = watermarks.Max(w =>
DateTime.UtcNow - w.LastEventTimestamp);
if (maxLag < _lagThreshold)
{
return HealthCheckResult.Healthy(
$"Projection lag: {maxLag.TotalSeconds:F1}s");
}
return HealthCheckResult.Degraded(
$"Projection lag high: {maxLag.TotalSeconds:F1}s",
data: new Dictionary<string, object>
{
["lagSeconds"] = maxLag.TotalSeconds,
["threshold"] = _lagThreshold.TotalSeconds,
["laggestProjection"] = watermarks
.OrderByDescending(w => DateTime.UtcNow - w.LastEventTimestamp)
.First()
.ProjectionName
});
}
}
- IdempotencyStoreHealthCheck
- PolicyCacheHealthCheck
- HashChainStateHealthCheck
- ExportBandwidthHealthCheck
- TenantQuotaHealthCheck
Code Examples: - All ATP custom health checks - ATP-specific validation logic
Deliverables: - ATP health check library - Implementation guide
CYCLE 5: Service-Specific Health Checks (~5,000 lines)¶
Topic 9: Gateway Service Health Checks¶
What will be covered: - Gateway Health Check Configuration
// Gateway health checks
services.AddHealthChecks()
// Liveness (minimal)
.AddCheck("self", () => HealthCheckResult.Healthy(), tags: new[] { "live" })
// Readiness (dependencies)
.AddAzureKeyVault(keyVaultUri, credential, name: "keyvault-certs", tags: new[] { "ready" })
.AddRedis(redisConnection, name: "rate-limiter-cache", tags: new[] { "ready" })
.AddUrlGroup(
new Uri("https://ingestion.atp.internal/health/ready"),
name: "ingestion-backend",
tags: new[] { "ready", "backend" })
.AddUrlGroup(
new Uri("https://query.atp.internal/health/ready"),
name: "query-backend",
tags: new[] { "ready", "backend" })
.AddUrlGroup(
new Uri("https://policy.atp.internal/health/ready"),
name: "policy-backend",
tags: new[] { "ready", "backend" })
.AddCheck<GatewayRoutingHealthCheck>(
name: "routing-config",
tags: new[] { "startup", "configuration" });
- Custom Gateway Checks
- Routing configuration loaded
- YARP clusters reachable
- JWT validation keys available
- Rate limiting operational
Code Examples: - Complete Gateway health checks - Backend reachability tests - Configuration validation
Diagrams: - Gateway health architecture - Dependency graph
Deliverables: - Gateway health check guide - Configuration templates
Topic 10: Ingestion Service Health Checks¶
What will be covered: - Ingestion Health Configuration
services.AddHealthChecks()
// Live
.AddCheck("self", () => HealthCheckResult.Healthy(), tags: new[] { "live" })
// Ready - Critical dependencies
.AddSqlServer(
connectionString: config.GetConnectionString("AuditDb"),
healthQuery: "SELECT 1",
name: "audit-database",
tags: new[] { "ready", "database" })
.AddAzureServiceBusTopic(
connectionString: config.GetConnectionString("ServiceBus"),
topicName: "audit.appended.v1",
name: "servicebus-publish",
tags: new[] { "ready", "messaging" })
.AddAzureBlobStorage(
connectionString: config.GetConnectionString("BlobStorage"),
containerName: "audit-worm",
name: "blob-worm-storage",
tags: new[] { "ready", "storage" })
.AddUrlGroup(
new Uri("https://policy.atp.internal/health/ready"),
name: "policy-service",
tags: new[] { "ready", "external" })
.AddCheck<OutboxHealthCheck>(
name: "outbox-backlog",
failureStatus: HealthStatus.Degraded,
tags: new[] { "ready", "outbox" })
// Startup
.AddCheck<DatabaseMigrationHealthCheck>(
name: "database-migrations",
tags: new[] { "startup" })
.AddCheck<StartupWarmupHealthCheck>(
name: "startup-warmup",
tags: new[] { "startup" });
- Ingestion-Specific Checks
- Outbox backlog within limits
- Idempotency store accessible
- Schema validation ready
- Tenant quota store loaded
Code Examples: - Ingestion health checks (complete) - All 8 ATP services health configurations
Diagrams: - Service health dependencies - Check hierarchy
Deliverables: - Service health check library (8 services) - Configuration guide
CYCLE 6: Dependency Health Checks (~4,000 lines)¶
Topic 11: Database Health Checks¶
What will be covered: - SQL Server Health Check
// Basic connectivity
.AddSqlServer(
connectionString: connectionString,
healthQuery: "SELECT 1",
name: "database-connectivity")
// Connection pool health
.AddCheck<SqlConnectionPoolHealthCheck>(
name: "database-connection-pool")
public class SqlConnectionPoolHealthCheck : IHealthCheck
{
private readonly string _connectionString;
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
using var connection = new SqlConnection(_connectionString);
await connection.OpenAsync(cancellationToken);
// Check connection pool statistics
var poolStats = SqlConnection.GetConnectionPoolStatistics(connection);
var activeConnections = (long)poolStats["ActiveConnections"];
var poolSize = (long)poolStats["MaxPoolSize"];
var utilization = (double)activeConnections / poolSize;
if (utilization < 0.8)
{
return HealthCheckResult.Healthy(
$"Connection pool: {activeConnections}/{poolSize} ({utilization:P0})");
}
if (utilization < 0.95)
{
return HealthCheckResult.Degraded(
$"Connection pool high: {activeConnections}/{poolSize} ({utilization:P0})");
}
return HealthCheckResult.Unhealthy(
$"Connection pool exhausted: {activeConnections}/{poolSize}");
}
}
// Query performance check
.AddCheck<DatabasePerformanceHealthCheck>(
name: "database-performance")
- Azure Cosmos DB Health Check
- MongoDB Health Check
- NHibernate Session Factory Health Check
Code Examples: - Database health checks (all types) - Connection pool monitoring - Performance validation
Deliverables: - Database health check library
Topic 12: Messaging Health Checks¶
What will be covered: - Azure Service Bus Health Check
// Topic publish capability
.AddAzureServiceBusTopic(
connectionString: serviceBusConnection,
topicName: "audit.appended.v1",
name: "servicebus-topic-publish")
// Subscription receive capability
.AddAzureServiceBusSubscription(
connectionString: serviceBusConnection,
topicName: "audit.appended.v1",
subscriptionName: "projection-sub",
name: "servicebus-subscription")
// Custom: Consumer lag check
public class ConsumerLagHealthCheck : IHealthCheck
{
private readonly ServiceBusClient _serviceBusClient;
private readonly string _topicName;
private readonly string _subscriptionName;
private readonly TimeSpan _lagThreshold;
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
var subscription = _serviceBusClient
.CreateReceiver(_topicName, _subscriptionName);
// Peek message to check lag
var peekedMessage = await subscription.PeekMessageAsync(cancellationToken);
if (peekedMessage == null)
{
return HealthCheckResult.Healthy("No pending messages");
}
var lag = DateTime.UtcNow - peekedMessage.EnqueuedTime;
if (lag < _lagThreshold)
{
return HealthCheckResult.Healthy($"Lag: {lag.TotalSeconds:F1}s");
}
return HealthCheckResult.Degraded($"High lag: {lag.TotalSeconds:F1}s");
}
}
- MassTransit Health Check
- Outbox Relay Health Check
- DLQ Depth Health Check
Code Examples: - Messaging health checks - Lag monitoring - DLQ validation
Deliverables: - Messaging health check library
CYCLE 7: Health Check UI & Monitoring (~3,500 lines)¶
Topic 13: Health Check UI¶
What will be covered: - ASP.NET Core Health Check UI
services.AddHealthChecksUI(setup =>
{
setup.SetEvaluationTimeInSeconds(30);
setup.MaximumHistoryEntriesPerEndpoint(50);
setup.AddHealthCheckEndpoint("ATP Ingestion", "https://ingestion.atp.internal/health");
setup.AddHealthCheckEndpoint("ATP Query", "https://query.atp.internal/health");
setup.AddHealthCheckEndpoint("ATP Policy", "https://policy.atp.internal/health");
})
.AddInMemoryStorage(); // Or .AddSqlServerStorage() for persistence
// Map UI endpoint
app.UseEndpoints(endpoints =>
{
endpoints.MapHealthChecksUI(setup =>
{
setup.UIPath = "/healthchecks-ui";
setup.ApiPath = "/healthchecks-api";
})
.RequireAuthorization("AdminOnly");
});
- Health Check UI Features
- Real-time health status
- Historical health data
- Webhook notifications
- Failure tracking
Code Examples: - Health Check UI setup - Dashboard configuration - Webhook integration
Diagrams: - UI architecture - Data flow
Deliverables: - UI setup guide - Configuration reference
Topic 14: Prometheus Metrics from Health Checks¶
What will be covered: - Health Check Metrics Exporter
public class HealthCheckMetricsPublisher : IHealthCheckPublisher
{
private readonly IMeterFactory _meterFactory;
private readonly Counter<long> _healthCheckExecutions;
private readonly Gauge<long> _healthStatus;
public HealthCheckMetricsPublisher(IMeterFactory meterFactory)
{
var meter = meterFactory.Create("ATP.HealthChecks");
_healthCheckExecutions = meter.CreateCounter<long>(
"health_check_executions_total",
description: "Total health check executions");
_healthStatus = meter.CreateObservableGauge<long>(
"health_check_status",
description: "Health check status (0=unhealthy, 1=degraded, 2=healthy)");
}
public Task PublishAsync(HealthReport report, CancellationToken cancellationToken)
{
foreach (var entry in report.Entries)
{
var statusValue = entry.Value.Status switch
{
HealthStatus.Healthy => 2,
HealthStatus.Degraded => 1,
HealthStatus.Unhealthy => 0,
_ => -1
};
_healthCheckExecutions.Add(1,
new KeyValuePair<string, object>("name", entry.Key),
new KeyValuePair<string, object>("status", entry.Value.Status.ToString()));
}
return Task.CompletedTask;
}
}
// Register publisher
services.AddSingleton<IHealthCheckPublisher, HealthCheckMetricsPublisher>();
Code Examples: - Metrics publisher implementation - Prometheus integration - Grafana dashboards
Deliverables: - Metrics publisher guide - Dashboard templates
CYCLE 8: Health Check Response Formats (~3,000 lines)¶
Topic 15: Response Format Standards¶
What will be covered: - Healthy Response
GET /health/ready
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: no-cache
{
"status": "Healthy",
"totalDuration": "00:00:00.234",
"entries": {
"database": {
"status": "Healthy",
"description": "Azure SQL connection pool: 5/100 active",
"duration": "00:00:00.050",
"data": {
"activeConnections": 5,
"maxPoolSize": 100,
"utilization": 0.05
}
},
"servicebus": {
"status": "Healthy",
"description": "Azure Service Bus connected",
"duration": "00:00:00.120"
},
"cache": {
"status": "Healthy",
"description": "Redis cache: 1024 keys, 512MB used",
"duration": "00:00:00.015",
"data": {
"keys": 1024,
"memoryUsedMb": 512,
"hitRate": 0.87
}
},
"outbox": {
"status": "Healthy",
"description": "Outbox: 15 pending events",
"duration": "00:00:00.035",
"data": {
"pendingEvents": 15,
"threshold": 1000
}
}
}
}
-
Degraded Response
GET /health/ready HTTP/1.1 200 OK ← Still 200, but body shows degraded { "status": "Degraded", "totalDuration": "00:00:00.567", "entries": { "database": { "status": "Healthy", "duration": "00:00:00.045" }, "cache": { "status": "Degraded", "description": "Redis cache: High latency (150ms avg)", "duration": "00:00:00.456", "data": { "latencyMs": 150, "threshold": 50, "recommendation": "Check Redis server performance" } }, "outbox": { "status": "Degraded", "description": "Outbox backlog elevated: 800 events", "data": { "pendingEvents": 800, "threshold": 1000, "recommendation": "Scale outbox relay workers" } } } } -
Unhealthy Response
GET /health/ready HTTP/1.1 503 Service Unavailable { "status": "Unhealthy", "totalDuration": "00:00:03.000", "entries": { "database": { "status": "Unhealthy", "description": "Azure SQL connection failed: Timeout", "duration": "00:00:03.000", "exception": "System.Data.SqlClient.SqlException: Timeout expired...", "data": { "error": "Connection timeout", "action": "Check database availability and connection string" } }, "cache": { "status": "Healthy", "duration": "00:00:00.012" } } }
Code Examples: - Response format examples (all states) - JSON schema - Response writer customization
Diagrams: - Response format structure - Status code mapping
Deliverables: - Response format specification - Schema definitions - Writer implementations
Topic 16: Minimal vs. Detailed Responses¶
What will be covered: - Minimal Response (Public/Load Balancer) - Detailed Response (Internal/Debugging) - Authenticated Access (Detailed info) - PII Redaction in responses
Code Examples: - Response filtering - Authentication integration
Deliverables: - Response strategy guide
CYCLE 9: Degraded State Handling (~3,000 lines)¶
Topic 17: Degraded vs. Unhealthy¶
What will be covered: - When to Return Degraded
Degraded (Service still functional, but impaired):
- Non-critical dependency unavailable
- Performance below target (but above minimum)
- Backlog elevated (but not critical)
- Cache miss rate high (but queries still work)
- Replica lag high (but within tolerance)
Examples:
- Redis cache down, but database queries work (slower)
- Projection lag 15s (target: 5s, critical: 30s)
- Outbox backlog 800 events (threshold: 1000)
- Search index unavailable (fallback to SQL)
Response:
- HTTP 200 OK (keep in load balancer)
- Body: "status": "Degraded"
- Kubernetes: Pod stays in service
- Monitoring: Alert (warning severity)
---
Unhealthy (Service cannot fulfill requests):
- Critical dependency unavailable
- Cannot process requests safely
- Data integrity at risk
- Compliance violation
Examples:
- Database connection failed (cannot persist)
- Service Bus connection failed (cannot publish events)
- KMS unavailable (cannot sign/verify)
- All backend services down (Gateway)
Response:
- HTTP 503 Service Unavailable
- Body: "status": "Unhealthy"
- Kubernetes: Pod removed from service
- Monitoring: Alert (critical severity)
-
Graceful Degradation Patterns
// Query service with optional search public class QueryServiceHealthCheck : IHealthCheck { private readonly IDbConnection _database; private readonly ISearchClient _searchClient; private readonly ILogger _logger; public async Task<HealthCheckResult> CheckHealthAsync(...) { // Critical: Database must be healthy var dbHealthy = await CheckDatabaseAsync(); if (!dbHealthy) { return HealthCheckResult.Unhealthy("Database unavailable"); } // Optional: Search enhances experience but not required var searchHealthy = await CheckSearchAsync(); if (!searchHealthy) { _logger.LogWarning("Search unavailable, degraded mode"); return HealthCheckResult.Degraded( "Database healthy, search unavailable (fallback to SQL)"); } return HealthCheckResult.Healthy("All dependencies healthy"); } } -
Degraded Mode Indicators
- Response headers:
X-Service-Mode: Degraded - Metrics:
service_degraded{reason="cache-miss"} 1 - Logs: Warning-level degradation notices
Code Examples: - Degraded vs. unhealthy logic - Graceful degradation patterns - Mode indicators
Diagrams: - State transition diagram - Degradation flow
Deliverables: - Degradation guide - Pattern library - Mode handling procedures
Topic 18: Dependency Criticality Matrix¶
What will be covered: - Critical vs. Optional Dependencies | Dependency | Critical? | Failure Status | Reason | |------------|-----------|----------------|--------| | Azure SQL (Ingestion) | ✅ Yes | Unhealthy | Cannot persist audit records | | Service Bus (Ingestion) | ✅ Yes | Unhealthy | Cannot publish events | | Redis Cache (Query) | ❌ No | Degraded | Slower queries, but functional | | Search Index (Query) | ❌ No | Degraded | Fallback to SQL queries | | Policy Service (Ingestion) | ✅ Yes | Unhealthy | Cannot classify/retain | | KMS (Integrity) | ✅ Yes | Unhealthy | Cannot sign/verify | | Blob Storage WORM (Ingestion) | ✅ Yes | Unhealthy | Cannot store evidence |
Code Examples: - Criticality configuration - Failure status mapping
Deliverables: - Dependency matrix - Configuration guide
CYCLE 10: Startup Warmup & Grace Periods (~3,000 lines)¶
Topic 19: Startup Warmup Pattern¶
What will be covered: - StartupWarmupGate
public class StartupWarmupGate
{
private readonly TimeSpan _warmupDuration;
private readonly DateTime _startTime;
private bool _isReady;
public StartupWarmupGate(IConfiguration configuration)
{
_warmupDuration = TimeSpan.FromSeconds(
configuration.GetValue<int>("Microservice:StartupWarmupSeconds", 30));
_startTime = DateTime.UtcNow;
_isReady = false;
}
public bool IsReady
{
get
{
if (_isReady) return true;
if (DateTime.UtcNow - _startTime >= _warmupDuration)
{
_isReady = true;
return true;
}
return false;
}
}
public void MarkReady()
{
_isReady = true;
}
}
// Health check
public class StartupWarmupHealthCheck : IHealthCheck
{
private readonly StartupWarmupGate _gate;
public Task<HealthCheckResult> CheckHealthAsync(...)
{
return Task.FromResult(
_gate.IsReady
? HealthCheckResult.Healthy("Warmup complete")
: HealthCheckResult.Unhealthy("Warming up..."));
}
}
// Usage in startup
public static async Task Main(string[] args)
{
var host = CreateHostBuilder(args).Build();
// Run migrations, warm cache, etc.
await host.RunMigrationsAsync();
await host.WarmCacheAsync();
// Mark ready
host.Services.GetRequiredService<StartupWarmupGate>().MarkReady();
await host.RunAsync();
}
- Initialization Tasks
- Database migrations (FluentMigrator)
- Configuration validation
- Cache warming (policy cache, tenant metadata)
- External service connectivity verification
Code Examples: - Warmup gate implementation - Initialization task orchestration - Startup optimization
Deliverables: - Startup warmup guide - Task coordination - Optimization techniques
Topic 20: Startup Probe Configuration¶
What will be covered: - Kubernetes Startup Probe - Failure Threshold Calculation - Startup Time Monitoring - Slow Startup Troubleshooting
Code Examples: - Probe configuration - Timing analysis
Deliverables: - Startup probe guide - Timing recommendations
CYCLE 11: Health Check Caching (~2,500 lines)¶
Topic 21: Response Caching Strategy¶
What will be covered: - Why Cache Health Check Responses?
Problem:
- Health checks run frequently (every 5-10s by K8s)
- Each check queries dependencies (DB, cache, bus)
- High load on dependencies from health checks alone
- Example: 100 pods × 2 checks/sec = 200 dependency queries/sec
Solution:
- Cache health check results briefly (5-10s)
- Reduce dependency load by 90%+
- Still fresh enough for K8s probes
Trade-off:
- Slightly stale health status (max 10s old)
- Acceptable for most use cases
- Critical checks can bypass cache
-
Implementing Health Check Caching
public class CachedHealthCheck : IHealthCheck { private readonly IHealthCheck _innerCheck; private readonly IMemoryCache _cache; private readonly TimeSpan _cacheDuration; private readonly string _cacheKey; public CachedHealthCheck( IHealthCheck innerCheck, IMemoryCache cache, string name, TimeSpan cacheDuration) { _innerCheck = innerCheck; _cache = cache; _cacheKey = $"HealthCheck:{name}"; _cacheDuration = cacheDuration; } public async Task<HealthCheckResult> CheckHealthAsync( HealthCheckContext context, CancellationToken cancellationToken = default) { // Try cache first if (_cache.TryGetValue(_cacheKey, out HealthCheckResult cachedResult)) { return cachedResult; } // Cache miss, run actual check var result = await _innerCheck.CheckHealthAsync(context, cancellationToken); // Cache result _cache.Set(_cacheKey, result, _cacheDuration); return result; } } // Registration services.AddHealthChecks() .AddCheck(new CachedHealthCheck( new SqlServerHealthCheck(connectionString), cache, name: "database", cacheDuration: TimeSpan.FromSeconds(10))); -
Conditional Caching
- Cache only Healthy results (not failures)
- Shorter TTL for Degraded (5s)
-
No caching for Unhealthy (immediate detection)
-
Cache Invalidation
- Automatic TTL expiration
- Manual invalidation on config changes
- Invalidation on deployment
Code Examples: - Cached health check implementation - Conditional caching logic - Cache invalidation
Diagrams: - Caching flow - TTL strategy
Deliverables: - Caching implementation - Strategy guide - Invalidation procedures
Topic 22: Health Check Performance¶
What will be covered: - Performance Optimization - Parallel Execution - Timeout Configuration - Circuit Breakers for Health Checks
Code Examples: - Performance optimization - Parallel check execution
Deliverables: - Performance guide - Optimization techniques
CYCLE 12: Multi-Tenant Health Isolation (~3,000 lines)¶
Topic 23: Tenant-Aware Health Checks¶
What will be covered: - Per-Tenant Health Indicators
public class TenantQuotaHealthCheck : IHealthCheck
{
private readonly ITenantQuotaService _quotaService;
public async Task<HealthCheckResult> CheckHealthAsync(...)
{
// Check if any tenant exceeding quota critically
var tenantsOverQuota = await _quotaService.GetTenantsExceedingQuotaAsync(
threshold: 0.95); // 95% of quota
if (tenantsOverQuota.Count == 0)
{
return HealthCheckResult.Healthy("All tenants within quota");
}
var criticalTenants = tenantsOverQuota.Count(t => t.Utilization > 1.0);
if (criticalTenants == 0)
{
return HealthCheckResult.Degraded(
$"{tenantsOverQuota.Count} tenants approaching quota limits");
}
return HealthCheckResult.Unhealthy(
$"{criticalTenants} tenants exceeded quota (ingestion throttled)");
}
}
- Tenant Isolation in Health Responses
- Aggregate tenant health (no individual tenant details in public response)
- Detailed tenant health (admin endpoint, authenticated)
- Tenant quota monitoring
- Per-tenant SLO tracking
Code Examples: - Tenant-aware health checks - Quota monitoring - Aggregation logic
Diagrams: - Tenant health aggregation - Quota monitoring flow
Deliverables: - Tenant health guide - Quota monitoring - Isolation procedures
Topic 24: Regional Health Aggregation¶
What will be covered: - Cross-Region Health - Multi-Cluster Health - Global Health Dashboard
Code Examples: - Regional aggregation
Deliverables: - Regional health guide
CYCLE 13: Health Check Testing (~2,500 lines)¶
Topic 25: Testing Health Checks¶
What will be covered: - Unit Testing Health Checks
[TestClass]
public class DatabaseHealthCheckTests
{
[TestMethod]
public async Task Should_ReturnHealthy_WhenDatabaseConnected()
{
// Arrange
var healthCheck = new SqlServerHealthCheck(validConnectionString);
// Act
var result = await healthCheck.CheckHealthAsync(new HealthCheckContext());
// Assert
Assert.AreEqual(HealthStatus.Healthy, result.Status);
Assert.IsTrue(result.Description.Contains("connected"));
}
[TestMethod]
public async Task Should_ReturnUnhealthy_WhenDatabaseUnavailable()
{
// Arrange
var healthCheck = new SqlServerHealthCheck(invalidConnectionString);
// Act
var result = await healthCheck.CheckHealthAsync(new HealthCheckContext());
// Assert
Assert.AreEqual(HealthStatus.Unhealthy, result.Status);
Assert.IsNotNull(result.Exception);
}
[TestMethod]
public async Task Should_ReturnDegraded_WhenConnectionPoolNearLimit()
{
// Arrange
var healthCheck = new SqlConnectionPoolHealthCheck(connectionString);
// Simulate high connection usage
var connections = CreateManyConnections(95); // 95% of pool
// Act
var result = await healthCheck.CheckHealthAsync(new HealthCheckContext());
// Assert
Assert.AreEqual(HealthStatus.Degraded, result.Status);
// Cleanup
CloseConnections(connections);
}
}
-
Integration Testing
[TestMethod] public async Task HealthEndpoint_Should_ReturnHealthy_WhenAllDependenciesUp() { // Arrange var client = _testServer.CreateClient(); // Act var response = await client.GetAsync("/health/ready"); // Assert Assert.AreEqual(HttpStatusCode.OK, response.StatusCode); var content = await response.Content.ReadAsStringAsync(); var healthReport = JsonSerializer.Deserialize<HealthReport>(content); Assert.AreEqual("Healthy", healthReport.Status); Assert.IsTrue(healthReport.Entries.All(e => e.Value.Status == HealthStatus.Healthy)); } -
Acceptance Testing (Reqnroll)
Feature: Health Check Endpoints As an operator I want to monitor service health So that I can detect and resolve issues quickly Scenario: Liveness check returns healthy when service is running Given the Ingestion service is running When I request "/health/live" Then the response status code should be 200 And the response body should contain "Healthy" Scenario: Readiness check returns unhealthy when database is down Given the Ingestion service is running And the database is unavailable When I request "/health/ready" Then the response status code should be 503 And the response body should contain "Unhealthy" And the database check should show "Unhealthy" Scenario: Readiness check returns degraded when cache is slow Given the Ingestion service is running And the Redis cache has high latency When I request "/health/ready" Then the response status code should be 200 And the response body should contain "Degraded" And the cache check should show "Degraded"
Code Examples: - Complete test suites (unit, integration, acceptance) - Test helpers and mocks - CI/CD integration
Diagrams: - Test architecture - Test coverage
Deliverables: - Testing guide - Test templates - CI/CD integration
Topic 26: Chaos Testing for Health Checks¶
What will be covered: - Fault Injection - Dependency Failure Simulation - Recovery Validation - Probe Behavior Under Load
Code Examples: - Chaos scenarios - Validation procedures
Deliverables: - Chaos testing guide
CYCLE 14: Load Balancer Integration (~3,000 lines)¶
Topic 27: Azure Load Balancer Health Probes¶
What will be covered: - Azure LB Health Probe Configuration
# Azure Load Balancer (for AKS services)
kind: Service
apiVersion: v1
metadata:
name: ingestion-svc
namespace: atp-ingest-ns
annotations:
# Health probe configuration
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/health/ready"
service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol: "http"
service.beta.kubernetes.io/azure-load-balancer-health-probe-interval: "10"
service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probe: "3"
spec:
type: LoadBalancer
selector:
app: ingestion
ports:
- port: 80
targetPort: 8080
- Azure Front Door Health Probes
- API Management Health Checks
- Application Gateway Health Probes
Code Examples: - LB probe configurations - Multi-layer health checks
Deliverables: - Load balancer integration guide
Topic 28: Service Mesh Health Checks¶
What will be covered: - Istio Health Checks - Linkerd Health Probes - Envoy Health Endpoints
Code Examples: - Mesh integration
Deliverables: - Service mesh health guide
CYCLE 15: Health Check Troubleshooting (~3,500 lines)¶
Topic 29: Common Health Check Issues¶
What will be covered: - Problem: Readiness Check Always Failing
Symptoms:
- Pod shows 0/1 READY
- Pod never receives traffic
- Logs show repeated health check failures
Diagnosis:
# Check pod status
kubectl get pods -n atp-ingest-ns
# NAME READY STATUS RESTARTS AGE
# ingestion-7d8f6c9b4-abc123 0/1 Running 0 5m
# Describe pod
kubectl describe pod ingestion-7d8f6c9b4-abc123 -n atp-ingest-ns
# Events:
# Readiness probe failed: HTTP probe failed with statuscode: 503
# Check health endpoint directly
kubectl port-forward ingestion-7d8f6c9b4-abc123 8080:8080 -n atp-ingest-ns
curl http://localhost:8080/health/ready
# Response shows which dependency failed:
{
"status": "Unhealthy",
"entries": {
"database": {
"status": "Unhealthy",
"description": "Connection timeout"
}
}
}
Common Causes:
1. Database not accessible (network policy, firewall)
2. Wrong connection string (secret not mounted)
3. Dependency service down
4. Health check timeout too short
5. Startup not complete (use startup probe)
Solutions:
# 1. Check database connectivity
kubectl run -it --rm debug --image=mcr.microsoft.com/mssql-tools \
--restart=Never -n atp-ingest-ns -- /bin/bash
sqlcmd -S <server> -U <user> -P <password> -Q "SELECT 1"
# 2. Check secrets mounted
kubectl exec ingestion-7d8f6c9b4-abc123 -n atp-ingest-ns -- \
ls -la /mnt/secrets
# 3. Check network policy
kubectl get networkpolicies -n atp-ingest-ns
# 4. Increase timeout
kubectl edit deployment ingestion -n atp-ingest-ns
# Change: timeoutSeconds: 3 → timeoutSeconds: 10
# 5. Add startup probe
# (see startup probe configuration)
-
Problem: CrashLoopBackOff Due to Liveness Failure
Symptoms: - Pod continuously restarting - STATUS shows CrashLoopBackOff - Liveness probe failing Diagnosis: kubectl logs ingestion-7d8f6c9b4-abc123 -n atp-ingest-ns --previous Common Causes: 1. Application deadlock 2. Liveness check too aggressive (short timeout/period) 3. Application startup slow (no startup probe) 4. Liveness check has bug (throws exception) Solutions: # 1. Add startup probe (delay liveness) # 2. Increase liveness timeout/threshold # 3. Fix application deadlock # 4. Review liveness check code -
Problem: Flapping Readiness (Ready → Not Ready → Ready)
Symptoms: - Pod oscillates between ready/not ready - Traffic intermittent - Logs show dependency timeouts Common Causes: 1. Dependency intermittently slow 2. Health check timeout too strict 3. Network issues 4. Health check not idempotent (side effects) Solutions: # 1. Increase successThreshold (require 2-3 consecutive successes) # 2. Increase timeout # 3. Add circuit breaker to dependency client # 4. Review health check for side effects
Code Examples: - Troubleshooting procedures (10+ scenarios) - Diagnostic commands - Resolution scripts
Diagrams: - Troubleshooting decision tree - Common failure patterns
Deliverables: - Troubleshooting guide - Diagnostic procedures - Fix library
Topic 30: Health Check Debugging¶
What will be covered: - Debug Endpoints
// Detailed health endpoint (authenticated, internal only)
app.MapHealthChecks("/health/debug", new HealthCheckOptions
{
ResponseWriter = async (context, report) =>
{
context.Response.ContentType = "application/json";
var result = new
{
status = report.Status.ToString(),
totalDuration = report.TotalDuration,
entries = report.Entries.Select(e => new
{
name = e.Key,
status = e.Value.Status.ToString(),
description = e.Value.Description,
duration = e.Value.Duration,
exception = e.Value.Exception?.ToString(),
data = e.Value.Data,
tags = e.Value.Tags
}),
timestamp = DateTime.UtcNow,
hostName = Environment.MachineName,
podName = Environment.GetEnvironmentVariable("HOSTNAME"),
podIp = context.Connection.LocalIpAddress?.ToString()
};
await context.Response.WriteAsJsonAsync(result);
}
})
.RequireAuthorization("AdminOnly");
- Health Check Logging
- Health Check Metrics
- Distributed Tracing for Health Checks
Code Examples: - Debug endpoints - Logging integration - Tracing
Deliverables: - Debugging guide - Diagnostic tools
CYCLE 16: Best Practices & Governance (~3,000 lines)¶
Topic 31: Health Check Best Practices¶
What will be covered: - Design Best Practices
✅ DO:
- Keep checks fast (<5s total)
- Check critical dependencies only
- Return detailed information in responses
- Use tags for logical grouping
- Cache results appropriately
- Implement all three probe types (live, ready, startup)
- Use Degraded for non-critical failures
- Test health checks thoroughly
- Document expected check behavior
- Monitor health check performance
❌ DON'T:
- Run expensive operations (full scans, backups)
- Modify state (writes, deletes)
- Include PII in responses
- Fail on optional dependencies
- Use static thresholds without monitoring
- Ignore degraded state
- Forget to test failure scenarios
- Couple health checks to business logic
- Use high-cardinality labels
- Expose detailed health publicly (security risk)
- ATP Health Check Standards
- All services must implement /health/live, /health/ready, /health/startup
- All critical dependencies must have health checks
- Health check duration must be <5s (P95)
- Detailed responses must not include PII
- Failed dependencies must log errors
- Health metrics must be exported
Code Examples: - Best practice implementations - Anti-patterns to avoid
Deliverables: - Best practices guide - Standards document - Anti-patterns catalog
Topic 32: Health Check Governance¶
What will be covered: - Health Check Review Process - New health check PR review checklist - Performance testing required - Security review for public endpoints - Documentation requirements
- Health Check Lifecycle
- Creation: When adding new dependency
- Updates: When dependency changes
- Deprecation: When dependency removed
-
Monitoring: Continuous health check health
-
Compliance Requirements
- Health audit trail (who accessed detailed health)
- No PII in health responses
- Secure admin endpoints
- Health check versioning
Code Examples: - Governance procedures - Review checklists - Compliance validation
Deliverables: - Governance guide - Review procedures - Compliance checklist
Summary of Deliverables¶
Across all 16 cycles, this documentation will provide:
- Health Check Foundations
- Fundamentals (liveness, readiness, startup)
- ASP.NET Core Health Checks integration
-
Industry standards and formats
-
Kubernetes Integration
- Probe configuration (all three types)
- Failure scenarios and recovery
-
Timing optimization
-
Custom Health Checks
- IHealthCheck implementation patterns
- ATP-specific custom checks (10+ checks)
-
Service-specific configurations (all 8 services)
-
Dependency Checks
- Database health checks (SQL, Cosmos, MongoDB)
- Messaging health checks (Service Bus, MassTransit)
- Storage health checks (Blob, WORM)
- Cache health checks (Redis, in-memory)
- KMS health checks (Key Vault, signing operations)
-
External service checks (HTTP endpoints)
-
Monitoring & UI
- Health Check UI (ASP.NET Core)
- Prometheus metrics export
- Grafana dashboards
-
Application Insights integration
-
Response Formats
- Healthy, Degraded, Unhealthy responses
- Detailed vs. minimal responses
- JSON schema and standards
-
Custom response writers
-
Degraded State
- Degraded vs. unhealthy criteria
- Graceful degradation patterns
-
Dependency criticality matrix
-
Startup & Warmup
- StartupWarmupGate pattern
- Initialization task orchestration
- Startup probe configuration
-
Grace period handling
-
Performance
- Response caching (5-10s TTL)
- Parallel execution
- Timeout configuration
-
Load optimization
-
Multi-Tenancy
- Tenant-aware health indicators
- Quota monitoring
- Regional health aggregation
-
Testing
- Unit testing health checks
- Integration testing endpoints
- Acceptance testing scenarios
- Chaos testing (fault injection)
-
Integration
- Azure Load Balancer probes
- Azure Front Door health checks
- API Management health
- Service mesh integration
-
Operations
- Troubleshooting common issues (10+ scenarios)
- Debugging tools and techniques
- Performance analysis
-
Governance
- Best practices (10+ do's and don'ts)
- Standards and conventions
- Review process
- Compliance requirements
Related Documentation¶
- Runbook: Operational procedures using health checks
- Alerts & SLOs: Health metrics in SLO calculations
- Kubernetes: Probe configuration and pod lifecycle
- Observability: Health check metrics and monitoring
- Template Integration: Health checks in microservice template
- Configuration: Health check configuration options
- Testing Strategy: Health check testing approaches
- Progressive Rollout: Health validation during deployments
This health checks guide provides complete implementation and operational procedures for ATP service health monitoring, from ASP.NET Core Health Checks fundamentals and Kubernetes probe integration to custom health checks for all dependencies, Health Check UI, caching strategies, multi-tenant health isolation, comprehensive testing, load balancer integration, troubleshooting procedures, and governance for maintaining predictable, observable, and self-healing services with fast failure detection and automatic recovery.