Observability Strategy & Practices¶

Purpose & Scope¶

Purpose: Strategic guide for using observability effectively in ATP, focusing on debugging workflows, troubleshooting patterns, and observability-driven development practices. This document complements the implementation details in monitoring.md.

Scope: This document covers:

Observability Philosophy: How to think about observability, asking the right questions, observability-driven design
Debugging Workflows: Practical troubleshooting scenarios using traces, logs, and metrics together
Correlation Patterns: How to trace requests across services, messages, and background jobs
Performance Investigation: Using observability to identify and fix performance issues
Security Observability: Detecting threats, anomalies, and unauthorized access
Observability Maturity: Evolving from basic monitoring to advanced observability practices
Context Propagation: Ensuring tenant context, correlation IDs, and trace context flow end-to-end
Observability for Compliance: Using telemetry for audit trails and regulatory evidence

Audience: Developers, SREs, incident responders, platform engineers, architects

Relationship to Other Documents: - Implementation: See monitoring.md for OpenTelemetry setup, Serilog configuration, Prometheus/Grafana, Azure Monitor integration - Operations: See runbook.md for operational procedures using observability data - Alerts: See alerts-slos.md for alerting strategy and SLO definitions - Architecture: See ../architecture/architecture.md for observability requirements

Observability Philosophy¶

What is Observability?¶

Observability is the ability to understand the internal state of a system by examining its external outputs (logs, metrics, traces). It goes beyond monitoring by enabling exploratory investigation of unknown issues.

Monitoring vs. Observability:

Monitoring	Observability
Known unknowns: Pre-defined dashboards, alerts for expected failure modes	Unknown unknowns: Ad-hoc exploration, debugging unexpected issues
"Is X broken?"	"Why is X broken?"
Static thresholds	Dynamic queries
Reactive (alerts fire)	Proactive (explore before issues escalate)
Good for: Health checks, SLOs	Good for: Debugging, optimization, learning

ATP Uses Both: - Monitoring: Health checks, SLO dashboards, alerting - Observability: Distributed tracing, structured logs, ad-hoc metrics queries

Three Pillars Working Together¶

┌─────────────────────────────────────────────────────────┐
│                  OBSERVABILITY STACK                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  METRICS (What) → "P95 latency is 2.5s"                │
│       ↓                                                 │
│  TRACES (Why) → "Database query took 2.2s"             │
│       ↓                                                 │
│  LOGS (Details) → "Connection pool exhausted, waited..."│
│                                                          │
└─────────────────────────────────────────────────────────┘

Example Investigation Flow:

Metrics Alert: ingest_latency_p95 > 1s fires
Check Metrics Dashboard: Latency spike started at 14:30 UTC
Find Trace: Search for slow ingestion requests around 14:30
Examine Trace: Policy evaluation span shows 900ms delay
Check Logs: Policy service logs show "Connection pool exhausted"
Root Cause: Policy service connection pool too small for load
Fix: Increase connection pool, add connection pool health check

Observability Principles¶

Instrument Everything: Every service, every request, every operation
Correlate Everything: Trace IDs, correlation IDs, tenant context in all telemetry
Structure Everything: Structured logs, semantic metrics, well-named spans
Query Everything: Make all telemetry queryable and explorable
Retain Everything: Balance retention with cost (hot → cool → archive)
Redact Sensitive Data: Never log PII, sanitize parameters, hash when needed

Asking the Right Questions¶

The Observability Question Framework¶

Observability enables you to answer arbitrary questions about system behavior. Learn to ask:

What happened? (Logs)
How much? (Metrics)
Why? (Traces)
Who was affected? (Tenant context, correlation)
When did it start? (Time-series analysis)
Where is the problem? (Service-level attribution)

ATP-Specific Questions¶

Performance Questions: - "Why is ingestion latency high for tenant acme-corp?" - "What's the slowest operation in the query service?" - "Which tenants are hitting rate limits?"

Reliability Questions: - "Why did this ingestion request fail?" - "What's causing the dead-letter queue growth?" - "Which services are experiencing errors?"

Business Questions: - "How many audit records were ingested per tenant this month?" - "What's the projection lag by tenant edition?" - "Which tenants are using the most storage?"

Security Questions: - "Who accessed audit records for tenant xyz last week?" - "Are there any failed authentication attempts?" - "Did any cross-tenant data access occur?"

Question-to-Telemetry Mapping¶

Question Type	Primary Telemetry	Secondary Telemetry
Why is X slow?	Traces (timing breakdown)	Metrics (percentiles), Logs (warnings)
How many X?	Metrics (counters)	Logs (aggregated counts)
What error occurred?	Logs (exceptions, stack traces)	Traces (error spans), Metrics (error rate)
Who did X?	Logs (actor context)	Traces (user attributes), Metrics (per-tenant)
When did X happen?	All (timestamp correlation)	Metrics (time-series), Traces (timeline)
Where is the issue?	Traces (service attribution)	Metrics (per-service), Logs (service tags)

Debugging Workflows¶

Standard Debugging Workflow¶

flowchart TD
    A[Alert/Issue Reported] --> B[Check Metrics Dashboard]
    B --> C{Issue Visible in Metrics?}
    C -->|Yes| D[Identify Time Window]
    C -->|No| E[Search Logs for Error Messages]
    D --> F[Find Traces in Time Window]
    E --> F
    F --> G[Examine Trace Breakdown]
    G --> H[Check Logs for Detailed Context]
    H --> I[Identify Root Cause]
    I --> J[Implement Fix]
    J --> K[Verify Fix with Observability]

Hold "Alt" / "Option" to enable pan & zoom

Workflow 1: Slow Request Investigation¶

Scenario: Customer reports slow ingestion (5+ seconds)

Step 1: Check Metrics

# Query Prometheus for P95 latency spike
histogram_quantile(0.95, 
  rate(http_server_duration_seconds_bucket{
    service="atp.ingestion",
    route="/api/v1/ingest"
  }[5m]))

Result: P95 latency is 4.8s (normally 200ms)

Step 2: Find Slow Traces

Jaeger Query:
- Service: atp.ingestion
- Operation: POST /api/v1/ingest
- Duration: > 4000ms
- Time: Last 1 hour

Step 3: Examine Trace

Trace: 4bf92f3577b34da6a3ce929d0e0e4736
Duration: 4.8s

├─ Gateway (50ms)
│  ├─ Authentication (10ms)
│  └─ Routing (5ms)
│
├─ Ingestion (4.7s) ← SLOW
│  ├─ Validation (20ms)
│  ├─ Policy.Evaluate (3.9s) ← BOTTLENECK
│  │  └─ Database.Query (3.6s) ← ROOT CAUSE
│  ├─ Database.Insert (200ms)
│  └─ Outbox.Append (100ms)
│
└─ Service Bus Publish (50ms)

Step 4: Check Logs for Context

// Log Analytics query
traces
| where timestamp > ago(1h)
| where traceId == "4bf92f3577b34da6a3ce929d0e0e4736"
| where message contains "database" or message contains "connection"
| project timestamp, message, severityLevel, customDimensions

Log Result:

[14:32:15.123] WARN PolicyService: Connection pool exhausted, 
  waited 3600ms for available connection. 
  Pool size: 10, Active: 10, Pending: 45

Step 5: Root Cause Identified - Policy service connection pool too small (10 connections) - Under load, connections exhausted - Requests wait for available connections

Step 6: Fix - Increase connection pool to 50 - Add connection pool metrics (active, pending, wait time) - Add alert: connection_pool_pending > 10

Workflow 2: Error Rate Investigation¶

Scenario: Error rate spike (5% → 15% in 10 minutes)

Step 1: Check Error Metrics

# Error rate by service
sum(rate(http_server_requests_total{
  result="error"
}[5m])) by (service)
/
sum(rate(http_server_requests_total[5m])) by (service)

Result: Ingestion service: 15% error rate

Step 2: Find Error Traces

Jaeger Query:
- Service: atp.ingestion
- Tags: result=error
- Time: Last 30 minutes

Step 3: Group Errors by Type

// Log Analytics - group errors by exception type
traces
| where timestamp > ago(30m)
| where severityLevel >= 3  // Error or Critical
| where cloud_RoleName == "atp.ingestion"
| extend ExceptionType = tostring(customDimensions.ExceptionType)
| summarize ErrorCount = count() by ExceptionType
| order by ErrorCount desc

Result:

ExceptionType              | ErrorCount
---------------------------|-----------
SqlException               | 234
TimeoutException           | 12
ValidationException        | 5

Step 4: Examine Specific Error

// Get detailed error for SqlException
traces
| where timestamp > ago(30m)
| where customDimensions.ExceptionType == "SqlException"
| where customDimensions.Message contains "timeout"
| project timestamp, message, customDimensions, exceptionDetails
| take 10

Result:

[14:35:22.456] ERROR IngestionService: Database timeout after 30s
  Exception: System.Data.SqlClient.SqlException
  Message: Timeout expired. The timeout period elapsed prior to 
    completion of the operation or the server is not responding.
  Query: SELECT * FROM PolicyRules WHERE TenantId = @p0 AND Active = 1
  Parameters: @p0 = 'acme-corp'

Step 5: Check Database Metrics

# Database query duration
histogram_quantile(0.95,
  rate(db_client_duration_seconds_bucket{
    service="atp.ingestion",
    db_operation="SELECT"
  }[5m]))

Result: P95 database query duration is 28s (normally 50ms)

Step 6: Root Cause Identified - Database queries timing out - PolicyRules table may be locked or missing index - High query volume causing contention

Step 7: Fix - Add index on (TenantId, Active) for PolicyRules table - Investigate table locks - Consider caching policy rules

Workflow 3: Missing Data Investigation¶

Scenario: Tenant reports missing audit records

Step 1: Verify Ingestion

// Check if records were ingested for tenant
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where message contains "Ingestion.Complete" or message contains "Record ingested"
| summarize IngestedCount = count() by bin(timestamp, 1h)

Step 2: Check Query Service

// Check query service logs for the records
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where cloud_RoleName == "atp.query"
| where message contains "Query.Executed"
| project timestamp, message, customDimensions

Step 3: Check Projection Service

// Check if projection processed the records
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where cloud_RoleName == "atp.projection"
| where message contains "Projection.Updated"
| summarize ProjectedCount = count() by bin(timestamp, 1h)

Step 4: Check for Errors in Pipeline

// Find any errors in the pipeline
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where severityLevel >= 3
| where message contains "acme-corp"
| project timestamp, cloud_RoleName, message, customDimensions
| order by timestamp desc

Step 5: Verify Database State

-- Check actual database records
SELECT COUNT(*) 
FROM AuditRecords 
WHERE TenantId = 'acme-corp' 
  AND CreatedAt >= DATEADD(hour, -24, GETUTCDATE())

Step 6: Trace Specific Record If tenant provides a specific record ID:

// Trace specific record ID through pipeline
traces
| where timestamp > ago(7d)
| where customDimensions.AuditRecordId == "01HZX123456789"
| project timestamp, cloud_RoleName, message, customDimensions
| order by timestamp asc

Correlation & Context¶

Context Propagation Pattern¶

Context Flow:

Client Request
  ↓ (HTTP headers)
Gateway
  ↓ (traceparent, baggage)
Ingestion Service
  ↓ (traceparent, baggage)
Policy Service (HTTP)
  ↓ (traceparent, baggage)
Database (connection context)
  ↓ (traceparent in message headers)
Service Bus Message
  ↓ (traceparent, baggage)
Projection Consumer
  ↓ (traceparent)
Database (projection)

Required Context Attributes¶

Every telemetry record MUST include:

// Resource attributes (service-level, constant)
service.name = "atp.ingestion"
service.version = "1.2.3"
deployment.environment = "production"
cloud.provider = "azure"
cloud.region = "eastus"

// Span/Log attributes (request-level, variable)
trace.id = "4bf92f3577b34da6a3ce929d0e0e4736"
span.id = "00f067aa0ba902b7"
tenant.id = "acme-corp"  // or tenant.class for metrics
tenant.edition = "enterprise"
correlation.id = "01HZX123456789"  // Business correlation ID
audit.record.id = "01HZX987654321"  // ATP-specific

Correlation ID Strategy¶

Three Types of IDs:

Trace ID (OpenTelemetry):
Generated by gateway on request
Propagated via traceparent header (W3C Trace Context)
Links all spans in a single request flow
Example: 4bf92f3577b34da6a3ce929d0e0e4736
Correlation ID (Business):
ULID for audit record or business entity
Can span multiple traces/requests
Used for business logic correlation
Example: 01HZX123456789 (ULID)
Request ID (Gateway):
Unique per HTTP request
Returned to client for support
May equal trace ID or separate
Example: req-20251030-abc123

Cross-Service Correlation¶

HTTP Request Correlation:

// Gateway sets correlation context
var traceId = Activity.Current?.TraceId.ToString();
var correlationId = GenerateCorrelationId();

// Add to headers (automatic with OTel)
request.Headers.Add("X-Correlation-ID", correlationId);
// traceparent header added automatically by OTel SDK

// Downstream service receives and uses
var correlationId = HttpContext.Request.Headers["X-Correlation-ID"];
var traceId = Activity.Current?.TraceId.ToString();

// Log with both
_logger.LogInformation(
    "Processing request with CorrelationId={CorrelationId}, TraceId={TraceId}",
    correlationId, traceId);

Message Bus Correlation:

// Publisher sets correlation context
await _bus.Publish(new AuditAcceptedEvent
{
    AuditRecordId = recordId,
    TenantId = tenantId
}, context =>
{
    // MassTransit automatically propagates trace context
    // Also set business correlation ID
    context.Headers.Set("X-Correlation-ID", recordId);
});

// Consumer receives and uses
public async Task Consume(ConsumeContext<AuditAcceptedEvent> context)
{
    var correlationId = context.Headers.Get<string>("X-Correlation-ID");
    var traceId = Activity.Current?.TraceId.ToString();

    // All logs in this consumer will include correlation context
    using (_logger.BeginScope(new Dictionary<string, object>
    {
        ["CorrelationId"] = correlationId,
        ["TraceId"] = traceId
    }))
    {
        _logger.LogInformation("Processing audit accepted event");
        // Process event...
    }
}

Performance Investigation¶

Latency Analysis Workflow¶

Step 1: Identify Latency Component

Total Request Time: 500ms
├─ Gateway: 20ms (4%)
├─ Authentication: 10ms (2%)
├─ Ingestion Service: 450ms (90%) ← FOCUS HERE
│  ├─ Validation: 15ms
│  ├─ Policy Evaluation: 400ms ← BOTTLENECK
│  ├─ Database Insert: 30ms
│  └─ Outbox Append: 5ms
└─ Response Serialization: 20ms (4%)

Step 2: Drill into Bottleneck - Examine Policy Evaluation span details - Check downstream calls (database, cache, external API) - Look for lock contention, resource exhaustion

Step 3: Check Resource Metrics

# Connection pool usage
db_connection_pool_active{service="atp.policy"}
db_connection_pool_pending{service="atp.policy"}

# Cache hit rate
cache_hit_rate{service="atp.policy", cache="policy-rules"}

# CPU/Memory
process_cpu_usage{service="atp.policy"}
process_memory_usage{service="atp.policy"}

Throughput Analysis¶

Questions to Answer: - What's the current throughput (requests/second)? - Is throughput limited by CPU, memory, network, or database? - Which tenants/operations have highest throughput? - Are there any throttling or rate limiting in effect?

Metrics to Examine:

# Request rate by service
sum(rate(http_server_requests_total[1m])) by (service)

# Request rate by tenant class
sum(rate(http_server_requests_total{
  tenant_class="enterprise"
}[1m])) by (service)

# Throughput vs. capacity
sum(rate(http_server_requests_total[1m])) by (service)
/
sum(http_server_capacity_total) by (service)

Security Observability¶

Threat Detection Patterns¶

Unauthorized Access Attempts:

// Failed authentication attempts
traces
| where timestamp > ago(24h)
| where customDimensions.EventType == "Authentication.Failed"
| summarize 
    FailedAttempts = count(),
    UniqueIPs = dcount(customDimensions.ClientIp),
    UniqueTenants = dcount(customDimensions.TenantId)
    by bin(timestamp, 1h)
| where FailedAttempts > 10  // Threshold

Cross-Tenant Access Attempts:

// Potential cross-tenant data access
traces
| where timestamp > ago(24h)
| where customDimensions.EventType == "Authorization.Denied"
| where customDimensions.Reason contains "tenant" or 
      customDimensions.Reason contains "cross-tenant"
| project timestamp, customDimensions.ActorId, 
    customDimensions.TenantId, customDimensions.RequestedTenantId,
    customDimensions.Resource

Data Exfiltration Patterns:

// Large export requests
traces
| where timestamp > ago(24h)
| where cloud_RoleName == "atp.export"
| where customDimensions.ExportSizeBytes > 1000000000  // > 1GB
| project timestamp, customDimensions.TenantId, 
    customDimensions.ExportSizeBytes, customDimensions.RequestedBy

Anomaly Detection:

# Unusual request patterns (sudden spike)
increase(http_server_requests_total{
  service="atp.query"
}[5m]) > 1000  # More than 1000 requests in 5 minutes

Observability-Driven Development¶

Design for Observability¶

Before Writing Code: 1. Define Success Metrics: What metrics indicate this feature is working? 2. Define Failure Scenarios: What errors can occur? How will we detect them? 3. Plan Instrumentation: What spans/logs/metrics are needed? 4. Consider Correlation: How will we trace this operation end-to-end?

While Writing Code: 1. Instrument Early: Add spans/logs/metrics as you code, not after 2. Use Structured Logging: Named parameters, not string interpolation 3. Add Context: Include tenant ID, correlation ID, trace ID in all logs 4. Record Exceptions: Always log exceptions with full context

After Deploying: 1. Verify Instrumentation: Check that traces/logs/metrics are appearing 2. Validate Dashboards: Ensure new metrics show up in dashboards 3. Test Error Paths: Trigger errors, verify they're logged correctly 4. Review Queries: Can you answer questions about this feature?

Observability Checklist¶

For Every Feature: - [ ] Traces cover the critical path (request → response) - [ ] Logs include sufficient context (tenant, correlation, user) - [ ] Metrics track success rate, latency, throughput - [ ] Errors are logged with full exception details - [ ] Dashboards show feature health - [ ] Alerts fire for known failure modes - [ ] Documentation explains how to debug this feature

Observability Maturity Model¶

Level 1: Basic Monitoring (Reactive)¶

Logs exist but unstructured
Basic metrics (CPU, memory)
Manual investigation
"What's broken?"

Level 2: Structured Observability (Proactive)¶

Structured logs with correlation
Service-level metrics
Distributed tracing
"Where is the problem?"

Level 3: Context-Rich Observability (Investigative)¶

Full context propagation (tenant, correlation, trace)
Business metrics alongside technical metrics
Rich dashboards and alerting
"Why did this happen?"

Level 4: Observability-Driven (Predictive)¶

Automated anomaly detection
Predictive alerting (before issues occur)
Observability used for optimization
"How can we prevent this?"

ATP Target: Level 3-4 (Context-Rich to Observability-Driven)

Troubleshooting Scenarios¶

Scenario 1: Intermittent Timeouts¶

Symptoms: Random 30s timeouts, affects 1% of requests

Investigation: 1. Find Timeout Traces: Search for traces with duration > 25s 2. Check Timeout Pattern: Are timeouts clustered by tenant, time, or operation? 3. Examine Span Details: Which operation is timing out? 4. Check Resource Metrics: Connection pools, queue depths, CPU 5. Look for Lock Contention: Database locks, distributed locks

Common Causes: - Connection pool exhaustion (spikes) - Database deadlocks - Network partition - Garbage collection pauses

Scenario 2: Data Inconsistency¶

Symptoms: Query returns stale data, missing records

Investigation: 1. Trace Record Lifecycle: Follow record from ingestion → projection → query 2. Check Projection Lag: Is projection service keeping up? 3. Verify Event Processing: Are events being consumed from Service Bus? 4. Check for Errors: Any errors in projection or query services? 5. Validate Watermarks: Are projection watermarks advancing?

Common Causes: - Projection lag (events not processed) - Event processing errors (dead-letter queue) - Cache invalidation failures - Database replication lag

Scenario 3: Performance Degradation¶

Symptoms: Gradual latency increase over days/weeks

Investigation: 1. Trend Analysis: Compare current metrics to baseline (7 days ago) 2. Identify Component: Which service/operation degraded? 3. Resource Analysis: CPU, memory, database, network trends 4. Check for Scaling Issues: Is autoscaling working? 5. Data Growth: Has data volume increased significantly?

Common Causes: - Data growth (larger queries) - Missing indexes - Resource exhaustion - Memory leaks - Inefficient algorithms

Best Practices¶

Logging Best Practices¶

Use Structured Logging:

// ✅ GOOD
_logger.LogInformation(
    "Ingested record {AuditRecordId} for tenant {TenantId}",
    recordId, tenantId);

// ❌ BAD
_logger.LogInformation($"Ingested record {recordId} for tenant {tenantId}");

Include Context:

using (_logger.BeginScope(new Dictionary<string, object>
{
    ["TenantId"] = tenantId,
    ["CorrelationId"] = correlationId,
    ["TraceId"] = Activity.Current?.TraceId.ToString()
}))
{
    // All logs in this scope include context
}

Log at Appropriate Levels:
Debug: Development-only, detailed execution flow
Information: Significant business events, normal operations
Warning: Abnormal but handled situations
Error: Error conditions, handled exceptions
Critical: Critical failures, unhandled exceptions

Never Log PII:

// ❌ BAD
_logger.LogInformation("User {Email} logged in", email);

// ✅ GOOD
_logger.LogInformation("User {UserId} logged in", userId);
// Or hash/redact
_logger.LogInformation("User {EmailHash} logged in", Hash(email));

Tracing Best Practices¶

Name Spans Clearly:

// ✅ GOOD
ActivitySource.StartActivity("Ingestion.ValidateRecord")
ActivitySource.StartActivity("Policy.EvaluateClassification")

// ❌ BAD
ActivitySource.StartActivity("DoWork")
ActivitySource.StartActivity("Process")

Add Relevant Attributes:

activity?.SetTag("tenant.id", tenantId);
activity?.SetTag("audit.record.id", recordId);
activity?.SetTag("policy.version", policyVersion);

Record Exceptions:

try
{
    // Operation
}
catch (Exception ex)
{
    activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
    activity?.RecordException(ex);
    throw;
}

Metrics Best Practices¶

Use Low-Cardinality Labels:

// ✅ GOOD (4 values)
_requestsTotal.Add(1, new("tenant_class", "enterprise"));

// ❌ BAD (1000s of values)
_requestsTotal.Add(1, new("tenant_id", tenantId));

Choose Appropriate Metric Types:
Counter: Total requests, total errors (monotonically increasing)
Histogram: Latency, size (distribution of values)
Gauge: Queue depth, active connections (current value)

Document Metrics:

_meter.CreateCounter<long>(
    name: "ingest.requests.total",
    unit: "{requests}",
    description: "Total number of ingestion requests");

Summary¶

Observability in ATP enables:

Fast Debugging: Trace requests end-to-end, find bottlenecks quickly
Proactive Detection: Identify issues before customers notice
Performance Optimization: Data-driven improvements
Security Monitoring: Detect threats and anomalies
Compliance Evidence: Audit trail visibility

Key Takeaways: - Use traces, logs, and metrics together (not in isolation) - Always include correlation context (tenant ID, trace ID, correlation ID) - Structure everything (structured logs, semantic metrics) - Ask the right questions (What? How much? Why?) - Design for observability (instrument as you code)

Next Steps: - Review monitoring.md for implementation details - Practice debugging workflows with real scenarios - Build observability into development process - Regularly review and optimize telemetry

Document Version: 1.0
Last Updated: 2025-10-30
Maintained By: Platform Engineering & SRE Team