Observability Strategy & Practices¶
Purpose & Scope¶
Purpose: Strategic guide for using observability effectively in ATP, focusing on debugging workflows, troubleshooting patterns, and observability-driven development practices. This document complements the implementation details in monitoring.md.
Scope: This document covers:
- Observability Philosophy: How to think about observability, asking the right questions, observability-driven design
- Debugging Workflows: Practical troubleshooting scenarios using traces, logs, and metrics together
- Correlation Patterns: How to trace requests across services, messages, and background jobs
- Performance Investigation: Using observability to identify and fix performance issues
- Security Observability: Detecting threats, anomalies, and unauthorized access
- Observability Maturity: Evolving from basic monitoring to advanced observability practices
- Context Propagation: Ensuring tenant context, correlation IDs, and trace context flow end-to-end
- Observability for Compliance: Using telemetry for audit trails and regulatory evidence
Audience: Developers, SREs, incident responders, platform engineers, architects
Relationship to Other Documents:
- Implementation: See monitoring.md for OpenTelemetry setup, Serilog configuration, Prometheus/Grafana, Azure Monitor integration
- Operations: See runbook.md for operational procedures using observability data
- Alerts: See alerts-slos.md for alerting strategy and SLO definitions
- Architecture: See ../architecture/architecture.md for observability requirements
Table of Contents¶
- Observability Philosophy
- Asking the Right Questions
- Debugging Workflows
- Correlation & Context
- Performance Investigation
- Security Observability
- Observability-Driven Development
- Observability Maturity Model
- Troubleshooting Scenarios
- Best Practices
Observability Philosophy¶
What is Observability?¶
Observability is the ability to understand the internal state of a system by examining its external outputs (logs, metrics, traces). It goes beyond monitoring by enabling exploratory investigation of unknown issues.
Monitoring vs. Observability:
| Monitoring | Observability |
|---|---|
| Known unknowns: Pre-defined dashboards, alerts for expected failure modes | Unknown unknowns: Ad-hoc exploration, debugging unexpected issues |
| "Is X broken?" | "Why is X broken?" |
| Static thresholds | Dynamic queries |
| Reactive (alerts fire) | Proactive (explore before issues escalate) |
| Good for: Health checks, SLOs | Good for: Debugging, optimization, learning |
ATP Uses Both: - Monitoring: Health checks, SLO dashboards, alerting - Observability: Distributed tracing, structured logs, ad-hoc metrics queries
Three Pillars Working Together¶
┌─────────────────────────────────────────────────────────┐
│ OBSERVABILITY STACK │
├─────────────────────────────────────────────────────────┤
│ │
│ METRICS (What) → "P95 latency is 2.5s" │
│ ↓ │
│ TRACES (Why) → "Database query took 2.2s" │
│ ↓ │
│ LOGS (Details) → "Connection pool exhausted, waited..."│
│ │
└─────────────────────────────────────────────────────────┘
Example Investigation Flow:
- Metrics Alert:
ingest_latency_p95 > 1sfires - Check Metrics Dashboard: Latency spike started at 14:30 UTC
- Find Trace: Search for slow ingestion requests around 14:30
- Examine Trace: Policy evaluation span shows 900ms delay
- Check Logs: Policy service logs show "Connection pool exhausted"
- Root Cause: Policy service connection pool too small for load
- Fix: Increase connection pool, add connection pool health check
Observability Principles¶
- Instrument Everything: Every service, every request, every operation
- Correlate Everything: Trace IDs, correlation IDs, tenant context in all telemetry
- Structure Everything: Structured logs, semantic metrics, well-named spans
- Query Everything: Make all telemetry queryable and explorable
- Retain Everything: Balance retention with cost (hot → cool → archive)
- Redact Sensitive Data: Never log PII, sanitize parameters, hash when needed
Asking the Right Questions¶
The Observability Question Framework¶
Observability enables you to answer arbitrary questions about system behavior. Learn to ask:
- What happened? (Logs)
- How much? (Metrics)
- Why? (Traces)
- Who was affected? (Tenant context, correlation)
- When did it start? (Time-series analysis)
- Where is the problem? (Service-level attribution)
ATP-Specific Questions¶
Performance Questions: - "Why is ingestion latency high for tenant acme-corp?" - "What's the slowest operation in the query service?" - "Which tenants are hitting rate limits?"
Reliability Questions: - "Why did this ingestion request fail?" - "What's causing the dead-letter queue growth?" - "Which services are experiencing errors?"
Business Questions: - "How many audit records were ingested per tenant this month?" - "What's the projection lag by tenant edition?" - "Which tenants are using the most storage?"
Security Questions: - "Who accessed audit records for tenant xyz last week?" - "Are there any failed authentication attempts?" - "Did any cross-tenant data access occur?"
Question-to-Telemetry Mapping¶
| Question Type | Primary Telemetry | Secondary Telemetry |
|---|---|---|
| Why is X slow? | Traces (timing breakdown) | Metrics (percentiles), Logs (warnings) |
| How many X? | Metrics (counters) | Logs (aggregated counts) |
| What error occurred? | Logs (exceptions, stack traces) | Traces (error spans), Metrics (error rate) |
| Who did X? | Logs (actor context) | Traces (user attributes), Metrics (per-tenant) |
| When did X happen? | All (timestamp correlation) | Metrics (time-series), Traces (timeline) |
| Where is the issue? | Traces (service attribution) | Metrics (per-service), Logs (service tags) |
Debugging Workflows¶
Standard Debugging Workflow¶
flowchart TD
A[Alert/Issue Reported] --> B[Check Metrics Dashboard]
B --> C{Issue Visible in Metrics?}
C -->|Yes| D[Identify Time Window]
C -->|No| E[Search Logs for Error Messages]
D --> F[Find Traces in Time Window]
E --> F
F --> G[Examine Trace Breakdown]
G --> H[Check Logs for Detailed Context]
H --> I[Identify Root Cause]
I --> J[Implement Fix]
J --> K[Verify Fix with Observability]
Workflow 1: Slow Request Investigation¶
Scenario: Customer reports slow ingestion (5+ seconds)
Step 1: Check Metrics
# Query Prometheus for P95 latency spike
histogram_quantile(0.95,
rate(http_server_duration_seconds_bucket{
service="atp.ingestion",
route="/api/v1/ingest"
}[5m]))
Result: P95 latency is 4.8s (normally 200ms)
Step 2: Find Slow Traces
Jaeger Query:
- Service: atp.ingestion
- Operation: POST /api/v1/ingest
- Duration: > 4000ms
- Time: Last 1 hour
Step 3: Examine Trace
Trace: 4bf92f3577b34da6a3ce929d0e0e4736
Duration: 4.8s
├─ Gateway (50ms)
│ ├─ Authentication (10ms)
│ └─ Routing (5ms)
│
├─ Ingestion (4.7s) ← SLOW
│ ├─ Validation (20ms)
│ ├─ Policy.Evaluate (3.9s) ← BOTTLENECK
│ │ └─ Database.Query (3.6s) ← ROOT CAUSE
│ ├─ Database.Insert (200ms)
│ └─ Outbox.Append (100ms)
│
└─ Service Bus Publish (50ms)
Step 4: Check Logs for Context
// Log Analytics query
traces
| where timestamp > ago(1h)
| where traceId == "4bf92f3577b34da6a3ce929d0e0e4736"
| where message contains "database" or message contains "connection"
| project timestamp, message, severityLevel, customDimensions
Log Result:
[14:32:15.123] WARN PolicyService: Connection pool exhausted,
waited 3600ms for available connection.
Pool size: 10, Active: 10, Pending: 45
Step 5: Root Cause Identified - Policy service connection pool too small (10 connections) - Under load, connections exhausted - Requests wait for available connections
Step 6: Fix
- Increase connection pool to 50
- Add connection pool metrics (active, pending, wait time)
- Add alert: connection_pool_pending > 10
Workflow 2: Error Rate Investigation¶
Scenario: Error rate spike (5% → 15% in 10 minutes)
Step 1: Check Error Metrics
# Error rate by service
sum(rate(http_server_requests_total{
result="error"
}[5m])) by (service)
/
sum(rate(http_server_requests_total[5m])) by (service)
Result: Ingestion service: 15% error rate
Step 2: Find Error Traces
Step 3: Group Errors by Type
// Log Analytics - group errors by exception type
traces
| where timestamp > ago(30m)
| where severityLevel >= 3 // Error or Critical
| where cloud_RoleName == "atp.ingestion"
| extend ExceptionType = tostring(customDimensions.ExceptionType)
| summarize ErrorCount = count() by ExceptionType
| order by ErrorCount desc
Result:
ExceptionType | ErrorCount
---------------------------|-----------
SqlException | 234
TimeoutException | 12
ValidationException | 5
Step 4: Examine Specific Error
// Get detailed error for SqlException
traces
| where timestamp > ago(30m)
| where customDimensions.ExceptionType == "SqlException"
| where customDimensions.Message contains "timeout"
| project timestamp, message, customDimensions, exceptionDetails
| take 10
Result:
[14:35:22.456] ERROR IngestionService: Database timeout after 30s
Exception: System.Data.SqlClient.SqlException
Message: Timeout expired. The timeout period elapsed prior to
completion of the operation or the server is not responding.
Query: SELECT * FROM PolicyRules WHERE TenantId = @p0 AND Active = 1
Parameters: @p0 = 'acme-corp'
Step 5: Check Database Metrics
# Database query duration
histogram_quantile(0.95,
rate(db_client_duration_seconds_bucket{
service="atp.ingestion",
db_operation="SELECT"
}[5m]))
Result: P95 database query duration is 28s (normally 50ms)
Step 6: Root Cause Identified - Database queries timing out - PolicyRules table may be locked or missing index - High query volume causing contention
Step 7: Fix
- Add index on (TenantId, Active) for PolicyRules table
- Investigate table locks
- Consider caching policy rules
Workflow 3: Missing Data Investigation¶
Scenario: Tenant reports missing audit records
Step 1: Verify Ingestion
// Check if records were ingested for tenant
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where message contains "Ingestion.Complete" or message contains "Record ingested"
| summarize IngestedCount = count() by bin(timestamp, 1h)
Step 2: Check Query Service
// Check query service logs for the records
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where cloud_RoleName == "atp.query"
| where message contains "Query.Executed"
| project timestamp, message, customDimensions
Step 3: Check Projection Service
// Check if projection processed the records
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where cloud_RoleName == "atp.projection"
| where message contains "Projection.Updated"
| summarize ProjectedCount = count() by bin(timestamp, 1h)
Step 4: Check for Errors in Pipeline
// Find any errors in the pipeline
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where severityLevel >= 3
| where message contains "acme-corp"
| project timestamp, cloud_RoleName, message, customDimensions
| order by timestamp desc
Step 5: Verify Database State
-- Check actual database records
SELECT COUNT(*)
FROM AuditRecords
WHERE TenantId = 'acme-corp'
AND CreatedAt >= DATEADD(hour, -24, GETUTCDATE())
Step 6: Trace Specific Record If tenant provides a specific record ID:
// Trace specific record ID through pipeline
traces
| where timestamp > ago(7d)
| where customDimensions.AuditRecordId == "01HZX123456789"
| project timestamp, cloud_RoleName, message, customDimensions
| order by timestamp asc
Correlation & Context¶
Context Propagation Pattern¶
Context Flow:
Client Request
↓ (HTTP headers)
Gateway
↓ (traceparent, baggage)
Ingestion Service
↓ (traceparent, baggage)
Policy Service (HTTP)
↓ (traceparent, baggage)
Database (connection context)
↓ (traceparent in message headers)
Service Bus Message
↓ (traceparent, baggage)
Projection Consumer
↓ (traceparent)
Database (projection)
Required Context Attributes¶
Every telemetry record MUST include:
// Resource attributes (service-level, constant)
service.name = "atp.ingestion"
service.version = "1.2.3"
deployment.environment = "production"
cloud.provider = "azure"
cloud.region = "eastus"
// Span/Log attributes (request-level, variable)
trace.id = "4bf92f3577b34da6a3ce929d0e0e4736"
span.id = "00f067aa0ba902b7"
tenant.id = "acme-corp" // or tenant.class for metrics
tenant.edition = "enterprise"
correlation.id = "01HZX123456789" // Business correlation ID
audit.record.id = "01HZX987654321" // ATP-specific
Correlation ID Strategy¶
Three Types of IDs:
- Trace ID (OpenTelemetry):
- Generated by gateway on request
- Propagated via
traceparentheader (W3C Trace Context) - Links all spans in a single request flow
-
Example:
4bf92f3577b34da6a3ce929d0e0e4736 -
Correlation ID (Business):
- ULID for audit record or business entity
- Can span multiple traces/requests
- Used for business logic correlation
-
Example:
01HZX123456789(ULID) -
Request ID (Gateway):
- Unique per HTTP request
- Returned to client for support
- May equal trace ID or separate
- Example:
req-20251030-abc123
Cross-Service Correlation¶
HTTP Request Correlation:
// Gateway sets correlation context
var traceId = Activity.Current?.TraceId.ToString();
var correlationId = GenerateCorrelationId();
// Add to headers (automatic with OTel)
request.Headers.Add("X-Correlation-ID", correlationId);
// traceparent header added automatically by OTel SDK
// Downstream service receives and uses
var correlationId = HttpContext.Request.Headers["X-Correlation-ID"];
var traceId = Activity.Current?.TraceId.ToString();
// Log with both
_logger.LogInformation(
"Processing request with CorrelationId={CorrelationId}, TraceId={TraceId}",
correlationId, traceId);
Message Bus Correlation:
// Publisher sets correlation context
await _bus.Publish(new AuditAcceptedEvent
{
AuditRecordId = recordId,
TenantId = tenantId
}, context =>
{
// MassTransit automatically propagates trace context
// Also set business correlation ID
context.Headers.Set("X-Correlation-ID", recordId);
});
// Consumer receives and uses
public async Task Consume(ConsumeContext<AuditAcceptedEvent> context)
{
var correlationId = context.Headers.Get<string>("X-Correlation-ID");
var traceId = Activity.Current?.TraceId.ToString();
// All logs in this consumer will include correlation context
using (_logger.BeginScope(new Dictionary<string, object>
{
["CorrelationId"] = correlationId,
["TraceId"] = traceId
}))
{
_logger.LogInformation("Processing audit accepted event");
// Process event...
}
}
Performance Investigation¶
Latency Analysis Workflow¶
Step 1: Identify Latency Component
Total Request Time: 500ms
├─ Gateway: 20ms (4%)
├─ Authentication: 10ms (2%)
├─ Ingestion Service: 450ms (90%) ← FOCUS HERE
│ ├─ Validation: 15ms
│ ├─ Policy Evaluation: 400ms ← BOTTLENECK
│ ├─ Database Insert: 30ms
│ └─ Outbox Append: 5ms
└─ Response Serialization: 20ms (4%)
Step 2: Drill into Bottleneck - Examine Policy Evaluation span details - Check downstream calls (database, cache, external API) - Look for lock contention, resource exhaustion
Step 3: Check Resource Metrics
# Connection pool usage
db_connection_pool_active{service="atp.policy"}
db_connection_pool_pending{service="atp.policy"}
# Cache hit rate
cache_hit_rate{service="atp.policy", cache="policy-rules"}
# CPU/Memory
process_cpu_usage{service="atp.policy"}
process_memory_usage{service="atp.policy"}
Throughput Analysis¶
Questions to Answer: - What's the current throughput (requests/second)? - Is throughput limited by CPU, memory, network, or database? - Which tenants/operations have highest throughput? - Are there any throttling or rate limiting in effect?
Metrics to Examine:
# Request rate by service
sum(rate(http_server_requests_total[1m])) by (service)
# Request rate by tenant class
sum(rate(http_server_requests_total{
tenant_class="enterprise"
}[1m])) by (service)
# Throughput vs. capacity
sum(rate(http_server_requests_total[1m])) by (service)
/
sum(http_server_capacity_total) by (service)
Security Observability¶
Threat Detection Patterns¶
Unauthorized Access Attempts:
// Failed authentication attempts
traces
| where timestamp > ago(24h)
| where customDimensions.EventType == "Authentication.Failed"
| summarize
FailedAttempts = count(),
UniqueIPs = dcount(customDimensions.ClientIp),
UniqueTenants = dcount(customDimensions.TenantId)
by bin(timestamp, 1h)
| where FailedAttempts > 10 // Threshold
Cross-Tenant Access Attempts:
// Potential cross-tenant data access
traces
| where timestamp > ago(24h)
| where customDimensions.EventType == "Authorization.Denied"
| where customDimensions.Reason contains "tenant" or
customDimensions.Reason contains "cross-tenant"
| project timestamp, customDimensions.ActorId,
customDimensions.TenantId, customDimensions.RequestedTenantId,
customDimensions.Resource
Data Exfiltration Patterns:
// Large export requests
traces
| where timestamp > ago(24h)
| where cloud_RoleName == "atp.export"
| where customDimensions.ExportSizeBytes > 1000000000 // > 1GB
| project timestamp, customDimensions.TenantId,
customDimensions.ExportSizeBytes, customDimensions.RequestedBy
Anomaly Detection:
# Unusual request patterns (sudden spike)
increase(http_server_requests_total{
service="atp.query"
}[5m]) > 1000 # More than 1000 requests in 5 minutes
Observability-Driven Development¶
Design for Observability¶
Before Writing Code: 1. Define Success Metrics: What metrics indicate this feature is working? 2. Define Failure Scenarios: What errors can occur? How will we detect them? 3. Plan Instrumentation: What spans/logs/metrics are needed? 4. Consider Correlation: How will we trace this operation end-to-end?
While Writing Code: 1. Instrument Early: Add spans/logs/metrics as you code, not after 2. Use Structured Logging: Named parameters, not string interpolation 3. Add Context: Include tenant ID, correlation ID, trace ID in all logs 4. Record Exceptions: Always log exceptions with full context
After Deploying: 1. Verify Instrumentation: Check that traces/logs/metrics are appearing 2. Validate Dashboards: Ensure new metrics show up in dashboards 3. Test Error Paths: Trigger errors, verify they're logged correctly 4. Review Queries: Can you answer questions about this feature?
Observability Checklist¶
For Every Feature: - [ ] Traces cover the critical path (request → response) - [ ] Logs include sufficient context (tenant, correlation, user) - [ ] Metrics track success rate, latency, throughput - [ ] Errors are logged with full exception details - [ ] Dashboards show feature health - [ ] Alerts fire for known failure modes - [ ] Documentation explains how to debug this feature
Observability Maturity Model¶
Level 1: Basic Monitoring (Reactive)¶
- Logs exist but unstructured
- Basic metrics (CPU, memory)
- Manual investigation
- "What's broken?"
Level 2: Structured Observability (Proactive)¶
- Structured logs with correlation
- Service-level metrics
- Distributed tracing
- "Where is the problem?"
Level 3: Context-Rich Observability (Investigative)¶
- Full context propagation (tenant, correlation, trace)
- Business metrics alongside technical metrics
- Rich dashboards and alerting
- "Why did this happen?"
Level 4: Observability-Driven (Predictive)¶
- Automated anomaly detection
- Predictive alerting (before issues occur)
- Observability used for optimization
- "How can we prevent this?"
ATP Target: Level 3-4 (Context-Rich to Observability-Driven)
Troubleshooting Scenarios¶
Scenario 1: Intermittent Timeouts¶
Symptoms: Random 30s timeouts, affects 1% of requests
Investigation: 1. Find Timeout Traces: Search for traces with duration > 25s 2. Check Timeout Pattern: Are timeouts clustered by tenant, time, or operation? 3. Examine Span Details: Which operation is timing out? 4. Check Resource Metrics: Connection pools, queue depths, CPU 5. Look for Lock Contention: Database locks, distributed locks
Common Causes: - Connection pool exhaustion (spikes) - Database deadlocks - Network partition - Garbage collection pauses
Scenario 2: Data Inconsistency¶
Symptoms: Query returns stale data, missing records
Investigation: 1. Trace Record Lifecycle: Follow record from ingestion → projection → query 2. Check Projection Lag: Is projection service keeping up? 3. Verify Event Processing: Are events being consumed from Service Bus? 4. Check for Errors: Any errors in projection or query services? 5. Validate Watermarks: Are projection watermarks advancing?
Common Causes: - Projection lag (events not processed) - Event processing errors (dead-letter queue) - Cache invalidation failures - Database replication lag
Scenario 3: Performance Degradation¶
Symptoms: Gradual latency increase over days/weeks
Investigation: 1. Trend Analysis: Compare current metrics to baseline (7 days ago) 2. Identify Component: Which service/operation degraded? 3. Resource Analysis: CPU, memory, database, network trends 4. Check for Scaling Issues: Is autoscaling working? 5. Data Growth: Has data volume increased significantly?
Common Causes: - Data growth (larger queries) - Missing indexes - Resource exhaustion - Memory leaks - Inefficient algorithms
Best Practices¶
Logging Best Practices¶
-
Use Structured Logging:
-
Include Context:
-
Log at Appropriate Levels:
- Debug: Development-only, detailed execution flow
- Information: Significant business events, normal operations
- Warning: Abnormal but handled situations
- Error: Error conditions, handled exceptions
-
Critical: Critical failures, unhandled exceptions
-
Never Log PII:
Tracing Best Practices¶
-
Name Spans Clearly:
-
Add Relevant Attributes:
-
Record Exceptions:
Metrics Best Practices¶
-
Use Low-Cardinality Labels:
-
Choose Appropriate Metric Types:
- Counter: Total requests, total errors (monotonically increasing)
- Histogram: Latency, size (distribution of values)
-
Gauge: Queue depth, active connections (current value)
-
Document Metrics:
Summary¶
Observability in ATP enables:
- Fast Debugging: Trace requests end-to-end, find bottlenecks quickly
- Proactive Detection: Identify issues before customers notice
- Performance Optimization: Data-driven improvements
- Security Monitoring: Detect threats and anomalies
- Compliance Evidence: Audit trail visibility
Key Takeaways: - Use traces, logs, and metrics together (not in isolation) - Always include correlation context (tenant ID, trace ID, correlation ID) - Structure everything (structured logs, semantic metrics) - Ask the right questions (What? How much? Why?) - Design for observability (instrument as you code)
Next Steps:
- Review monitoring.md for implementation details
- Practice debugging workflows with real scenarios
- Build observability into development process
- Regularly review and optimize telemetry
Document Version: 1.0
Last Updated: 2025-10-30
Maintained By: Platform Engineering & SRE Team