Monitoring & Observability - Audit Trail Platform (ATP)¶
Three pillars of observability — ATP implements comprehensive monitoring with OpenTelemetry for distributed tracing (W3C Trace Context), metrics (Prometheus), and structured logs (Serilog) exported to Azure Monitor (Application Insights, Log Analytics, Container Insights) and Grafana with tenant/correlation context propagation, PII redaction, and actionable dashboards for operational excellence.
📋 Documentation Generation Plan¶
This document will be generated in 20 cycles. Current progress:
| Cycle | Topics | Estimated Lines | Status |
|---|---|---|---|
| Cycle 1 | Observability Fundamentals (1-2) | ~3,500 | ⏳ Not Started |
| Cycle 2 | OpenTelemetry Architecture (3-4) | ~4,000 | ⏳ Not Started |
| Cycle 3 | Distributed Tracing (5-6) | ~5,000 | ⏳ Not Started |
| Cycle 4 | Trace Context Propagation (7-8) | ~4,000 | ⏳ Not Started |
| Cycle 5 | Metrics Collection (9-10) | ~4,500 | ⏳ Not Started |
| Cycle 6 | Custom Metrics (11-12) | ~4,000 | ⏳ Not Started |
| Cycle 7 | Structured Logging (13-14) | ~4,500 | ⏳ Not Started |
| Cycle 8 | Log Enrichment & Correlation (15-16) | ~4,000 | ⏳ Not Started |
| Cycle 9 | Azure Monitor Integration (17-18) | ~5,000 | ⏳ Not Started |
| Cycle 10 | Application Insights (19-20) | ~4,000 | ⏳ Not Started |
| Cycle 11 | Log Analytics & KQL (21-22) | ~4,500 | ⏳ Not Started |
| Cycle 12 | Prometheus & Grafana (23-24) | ~4,500 | ⏳ Not Started |
| Cycle 13 | Dashboard Design (25-26) | ~4,000 | ⏳ Not Started |
| Cycle 14 | Performance Monitoring (27-28) | ~4,000 | ⏳ Not Started |
| Cycle 15 | Application Performance Monitoring (29-30) | ~3,500 | ⏳ Not Started |
| Cycle 16 | PII Redaction & Compliance (31-32) | ~3,500 | ⏳ Not Started |
| Cycle 17 | Correlation & Context (33-34) | ~3,500 | ⏳ Not Started |
| Cycle 18 | Log Aggregation & Search (35-36) | ~3,500 | ⏳ Not Started |
| Cycle 19 | Monitoring Operations (37-38) | ~3,000 | ⏳ Not Started |
| Cycle 20 | Best Practices & Troubleshooting (39-40) | ~3,500 | ⏳ Not Started |
Total Estimated Lines: ~78,000
Purpose & Scope¶
This document provides the complete monitoring and observability implementation guide for ATP, covering OpenTelemetry (distributed tracing, metrics, logs), Azure Monitor (Application Insights, Log Analytics, Container Insights), Prometheus, Grafana, Serilog structured logging, correlation, dashboards, APM, and operational procedures for comprehensive, actionable, and compliant system observability.
Why Comprehensive Observability for ATP?
- Debugging: Trace requests across microservices, identify bottlenecks
- Performance: Measure latency, throughput, errors at every layer
- Reliability: Detect issues before customers do (proactive alerts)
- Compliance: Audit all access to audit data ("audit the auditor")
- Security: Detect anomalies, unauthorized access, data exfiltration
- Business Insights: Tenant usage patterns, feature adoption, capacity planning
- SLO Validation: Measure actual performance against targets
- Incident Response: Rich telemetry for faster MTTR
- Optimization: Data-driven performance and cost optimization
- Regulatory: Immutable observability trail for compliance audits
Three Pillars of Observability
1. TRACES (Request Flow)
- Distributed tracing across services
- End-to-end request path
- Latency attribution
- Error correlation
- Example: "Why did this ingestion request take 5 seconds?"
2. METRICS (System Health)
- Time-series data (CPU, memory, latency, throughput)
- Aggregations and percentiles
- SLO/SLI measurements
- Alerting thresholds
- Example: "What's the P95 latency for queries?"
3. LOGS (Event Details)
- Structured event records
- Contextual information
- Error messages and stack traces
- Audit trail
- Example: "What error occurred at 14:32 UTC for tenant acme-corp?"
ATP Observability Stack
Application Layer:
- OpenTelemetry SDK (C#)
- Serilog (structured logging)
- Custom meters and spans
Collection Layer:
- OTel Collector (DaemonSet on K8s)
- Prometheus (scraping)
- Fluent Bit (log forwarding)
Storage Layer:
- Azure Monitor (Application Insights, Log Analytics)
- Prometheus (TSDB, 30-day retention)
- Azure Blob (long-term log archive)
Visualization Layer:
- Grafana (dashboards, alerts)
- Azure Monitor Workbooks
- Application Insights Workbooks
Alerting Layer:
- Prometheus Alertmanager
- Azure Monitor Alerts
- PagerDuty integration
Key Technologies
- OpenTelemetry: Vendor-neutral observability framework (traces, metrics, logs)
- Serilog: Structured logging for .NET
- Prometheus: Time-series metrics database
- Grafana: Visualization and dashboards
- Azure Monitor: Cloud-native monitoring (Application Insights, Log Analytics, Container Insights)
- Jaeger/Zipkin: Distributed tracing UIs
- KQL (Kusto Query Language): Azure Monitor query language
- PromQL: Prometheus query language
- W3C Trace Context: Standard for trace propagation
Detailed Cycle Plan¶
CYCLE 1: Observability Fundamentals (~3,500 lines)¶
Topic 1: Three Pillars of Observability¶
What will be covered: - Traces, Metrics, Logs
TRACES (The "Why" and "Where"):
- Request journey across services
- Timing breakdown (where did time go?)
- Causal relationships
- Error attribution
Use Cases:
- Debug slow requests
- Find bottlenecks
- Understand service dependencies
- Root cause analysis
Example Questions Answered:
- "Why did this request take 5 seconds?"
- "Which service in the chain is slow?"
- "Where did this error originate?"
---
METRICS (The "What" and "How Much"):
- Numeric measurements over time
- Aggregations (avg, sum, percentile)
- System health indicators
- SLO/SLI tracking
Use Cases:
- Alerting (latency > threshold)
- Capacity planning (trending growth)
- SLO compliance (99.9% availability)
- Performance dashboards
Example Questions Answered:
- "What's the current error rate?"
- "What's the P95 latency?"
- "How many events are in the queue?"
---
LOGS (The "What Happened"):
- Discrete events
- Detailed context
- Error messages and stack traces
- Audit trail
Use Cases:
- Debugging specific errors
- Audit compliance
- Security investigation
- Business event tracking
Example Questions Answered:
- "What error occurred at 14:32 UTC?"
- "Which user triggered this action?"
- "What was the validation failure reason?"
-
How They Work Together
Scenario: Slow Ingestion Request 1. METRICS alert fires: - "P95 ingestion latency > 1s" 2. Check DASHBOARD (metrics): - Latency spiked at 14:30 UTC - Affecting tenant "acme-corp" 3. Find REQUEST in TRACES: - Search for requests to /api/v1/ingest - Filter: timestamp ~14:30, tenant=acme-corp - Trace shows: 4.2s total - Gateway: 50ms - Ingestion: 4.1s - Validation: 20ms - Policy call: 3.8s ← BOTTLENECK - Database: 200ms - Outbox: 100ms 4. Examine LOGS (detailed context): - Search: timestamp=14:30, traceId=abc123 - Find: "PolicyService.Evaluate timeout after 3s" - Root cause: Policy service database connection pool exhausted 5. Resolution: - Scale policy service - Increase connection pool - Add connection pool health check -
Observability vs. Monitoring
Monitoring (Traditional): - Known failure modes - Predefined dashboards - Static thresholds - "Tell me when X goes wrong" Observability (Modern): - Unknown failure modes - Ad-hoc exploration - Dynamic queries - "Help me understand why X is wrong" ATP Uses BOTH: - Monitoring: Alerts, SLOs, health checks - Observability: Traces, logs, metrics for investigation
Code Examples: - Three pillars examples - Combined usage scenario - Observability workflow
Diagrams: - Three pillars architecture - Pillars working together - Observability vs. monitoring
Deliverables: - Observability fundamentals guide - Pillar comparison matrix - Usage scenarios
Topic 2: ATP Observability Requirements¶
What will be covered: - ATP-Specific Observability Needs
Audit Integrity:
- Trace every write (who, when, what)
- Tamper detection observability
- Integrity verification tracking
Compliance:
- No PII in logs/metrics/traces
- Tenant isolation in telemetry
- Audit access to audit data
- Retention period compliance (7+ years)
Multi-Tenancy:
- Per-tenant metrics (without high cardinality)
- Tenant context in all telemetry
- Cross-tenant query detection
Performance:
- Sub-second latency tracking (P50, P95, P99)
- Projection lag monitoring (<5s SLO)
- Query performance (by endpoint, tenant class)
Security:
- Unauthorized access attempts
- Privilege escalation detection
- Data exfiltration monitoring
- Anomaly detection
Operational:
- Service dependencies (application map)
- Capacity utilization (CPU, memory, storage, network)
- Error rates and types
- Queue depths and consumer lag
- Mandatory Telemetry Attributes
Resource Attributes (service-level): - service.name: "atp.ingestion" - service.version: "1.2.3" - deployment.environment: "production" - cloud.provider: "azure" - cloud.region: "eastus" - cloud.availability_zone: "eastus-1" Span/Metric/Log Attributes (request-level): - tenant.id: "acme-corp" (or tenant.class: "enterprise") - tenant.edition: "enterprise" - correlation.id: "<ulid>" - trace.id: "<w3c-trace-id>" - span.id: "<span-id>" - http.route: "/api/v1/ingest" - http.method: "POST" - http.status_code: 200 - messaging.operation: "publish" | "receive" - messaging.destination: "audit.appended.v1" - db.operation: "INSERT" | "SELECT" - db.table: "AuditRecords" ATP-Specific: - audit.record.id: "<ulid>" - idempotency.key: "<tenant>:<source>:<seq>" - classification: "SENSITIVE" | "PII" | "PUBLIC" - policy.version: 123
Code Examples: - ATP observability requirements - Mandatory attributes - Attribute taxonomy
Diagrams: - ATP observability architecture - Attribute hierarchy
Deliverables: - Requirements specification - Attribute catalog - Compliance mapping
CYCLE 2: OpenTelemetry Architecture (~4,000 lines)¶
Topic 3: OpenTelemetry Overview¶
What will be covered: - What is OpenTelemetry (OTel)?
OpenTelemetry:
- Open source observability framework
- Vendor-neutral (no lock-in)
- Single SDK for traces, metrics, logs
- Standardized protocols (OTLP)
- Ecosystem of exporters (Prometheus, Jaeger, Azure Monitor, etc.)
Components:
1. SDK (in-process instrumentation)
- Traces: ActivitySource, Activity
- Metrics: Meter, Counter, Histogram, Gauge
- Logs: ILogger integration
2. API (interface for instrumentation)
- Language-agnostic specification
- Implementation-independent
3. Collector (agent/gateway)
- Receives telemetry (OTLP)
- Processes (batch, filter, transform)
- Exports to backends (Prometheus, Jaeger, Azure Monitor)
4. Semantic Conventions
- Standard attribute names
- Consistent meaning across systems
- Examples: http.method, db.system, messaging.destination
-
ATP OpenTelemetry Architecture
flowchart LR subgraph ATP Services GW[Gateway<br/>OTel SDK] ING[Ingestion<br/>OTel SDK] QRY[Query<br/>OTel SDK] PROJ[Projection<br/>OTel SDK] end subgraph OTel Collection COL[OTel Collector<br/>DaemonSet] end subgraph Backends PROM[Prometheus<br/>Metrics] JAEGER[Jaeger<br/>Traces] AZMON[Azure Monitor<br/>Traces/Metrics/Logs] SEQ[Seq<br/>Logs] end GW -->|OTLP gRPC :4317| COL ING -->|OTLP gRPC :4317| COL QRY -->|OTLP gRPC :4317| COL PROJ -->|OTLP gRPC :4317| COL COL -->|Remote Write| PROM COL -->|OTLP| JAEGER COL -->|Azure Monitor Exporter| AZMON COL -->|OTLP HTTP| SEQHold "Alt" / "Option" to enable pan & zoom -
Why OpenTelemetry for ATP?
- Vendor-neutral (avoid lock-in)
- Single instrumentation (traces + metrics + logs)
- Rich ecosystem (auto-instrumentation for ASP.NET, SQL, HTTP, gRPC, Service Bus)
- Azure Monitor support
- Industry standard (CNCF graduated project)
Code Examples: - OTel architecture overview - Component responsibilities - ATP integration
Diagrams: - OTel architecture - ATP OTel topology - Data flow
Deliverables: - OTel overview - Architecture guide - Integration roadmap
Topic 4: OpenTelemetry SDK Configuration¶
What will be covered: - OTel SDK Setup (C#)
// Program.cs / Startup.cs
var builder = WebApplication.CreateBuilder(args);
// Configure OpenTelemetry
builder.Services.AddOpenTelemetry()
.ConfigureResource(resource =>
{
resource
.AddService(
serviceName: "atp.ingestion",
serviceVersion: "1.2.3",
serviceInstanceId: Environment.MachineName)
.AddAttributes(new Dictionary<string, object>
{
["deployment.environment"] = builder.Environment.EnvironmentName,
["cloud.provider"] = "azure",
["cloud.region"] = Environment.GetEnvironmentVariable("AZURE_REGION") ?? "eastus",
["host.name"] = Environment.MachineName
});
})
.WithTracing(tracing =>
{
tracing
// Auto-instrumentation
.AddAspNetCoreInstrumentation(options =>
{
options.RecordException = true;
options.EnrichWithHttpRequest = (activity, request) =>
{
// Add tenant context
var tenantId = request.Headers["X-Tenant-Id"].FirstOrDefault();
if (tenantId != null)
{
activity.SetTag("tenant.id", tenantId);
}
};
options.EnrichWithHttpResponse = (activity, response) =>
{
activity.SetTag("http.response.size", response.ContentLength);
};
})
.AddHttpClientInstrumentation()
.AddSqlClientInstrumentation(options =>
{
options.SetDbStatementForText = true;
options.RecordException = true;
options.EnableConnectionLevelAttributes = true;
})
.AddGrpcClientInstrumentation()
// Custom activity sources
.AddSource("ATP.Ingestion")
.AddSource("ATP.Domain")
.AddSource("MassTransit")
// Sampling (production: 10%, dev: 100%)
.SetSampler(builder.Environment.IsDevelopment()
? new AlwaysOnSampler()
: new TraceIdRatioBasedSampler(0.1))
// Exporters
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://otel-collector:4317");
options.Protocol = OtlpExportProtocol.Grpc;
});
})
.WithMetrics(metrics =>
{
metrics
// Auto-instrumentation
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddRuntimeInstrumentation()
.AddProcessInstrumentation()
// Custom meters
.AddMeter("ATP.Ingestion")
.AddMeter("MassTransit")
// Exporters
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://otel-collector:4317");
options.Protocol = OtlpExportProtocol.Grpc;
})
.AddPrometheusExporter(); // Expose /metrics endpoint
});
// Enable Azure Monitor (alternative/additional)
builder.Services.AddOpenTelemetry()
.UseAzureMonitor(options =>
{
options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});
- Configuration Options
- Sampling rates (dev: 100%, prod: 1-10%)
- Exporter endpoints
- Batch processing
- Resource attributes
- Instrumentation options
Code Examples: - Complete OTel SDK configuration (all ATP services) - Environment-specific settings - Exporter configurations
Diagrams: - SDK architecture - Configuration flow
Deliverables: - OTel SDK setup guide - Configuration reference - Service-specific configurations
CYCLE 3: Distributed Tracing (~5,000 lines)¶
Topic 5: Tracing Fundamentals¶
What will be covered: - Trace, Span, Context
Trace:
- End-to-end request journey
- Collection of related spans
- Unique trace ID (shared across all spans)
Span:
- Single operation within trace
- Has start time, end time, duration
- Parent-child relationships
- Attributes (tags)
Context:
- Trace ID + Span ID + Trace Flags
- Propagated across service boundaries
- W3C Trace Context standard: traceparent header
Example Trace:
TraceID: 4bf92f3577b34da6a3ce929d0e0e4736
Spans:
├─ Gateway (50ms)
│ ├─ Authentication (10ms)
│ └─ Routing (5ms)
├─ Ingestion (4.2s) ← SLOW
│ ├─ Validation (20ms)
│ ├─ Policy.Evaluate (3.8s) ← BOTTLENECK
│ │ └─ Database.Query (3.5s) ← ROOT CAUSE
│ ├─ Database.Insert (200ms)
│ └─ Outbox.Append (100ms)
└─ Service Bus Publish (50ms)
Total Duration: 4.3s
Bottleneck: Policy evaluation → database query
-
W3C Trace Context
traceparent Header Format: 00-<trace-id>-<parent-span-id>-<trace-flags> Example: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 Parts: - 00: Version - 4bf92f3577b34da6a3ce929d0e0e4736: Trace ID (16 bytes hex) - 00f067aa0ba902b7: Parent Span ID (8 bytes hex) - 01: Trace Flags (sampled=1) tracestate Header (optional, vendor-specific): tracestate: tenant=acme-corp,edition=enterprise -
Creating Spans (C#)
using System.Diagnostics; public class IngestionService { private static readonly ActivitySource ActivitySource = new ActivitySource("ATP.Ingestion", "1.0.0"); public async Task<string> IngestEventAsync(AuditEvent auditEvent) { // Start span using var activity = ActivitySource.StartActivity( "Ingestion.IngestEvent", ActivityKind.Server); // Add attributes activity?.SetTag("tenant.id", auditEvent.TenantId); activity?.SetTag("event.type", auditEvent.EventType); activity?.SetTag("audit.record.id", auditEvent.Id); try { // Child span: Validation using (var validateActivity = ActivitySource.StartActivity("Ingestion.Validate")) { await ValidateEventAsync(auditEvent); } // Child span: Policy evaluation using (var policyActivity = ActivitySource.StartActivity("Ingestion.EvaluatePolicy")) { var policy = await _policyService.EvaluateAsync(auditEvent); policyActivity?.SetTag("policy.version", policy.Version); policyActivity?.SetTag("classification", policy.Classification); } // Child span: Database insert using (var dbActivity = ActivitySource.StartActivity("Ingestion.PersistToDatabase")) { await _repository.SaveAsync(auditEvent); } // Success activity?.SetStatus(ActivityStatusCode.Ok); activity?.SetTag("result", "success"); return auditEvent.Id; } catch (Exception ex) { // Record exception in span activity?.SetStatus(ActivityStatusCode.Error, ex.Message); activity?.RecordException(ex); activity?.SetTag("result", "error"); activity?.SetTag("error.type", ex.GetType().Name); throw; } } }
Code Examples: - Complete tracing implementation - Span creation patterns - Attribute setting - Error recording
Diagrams: - Trace structure - Span hierarchy - Context propagation
Deliverables: - Tracing fundamentals guide - Implementation patterns - Span library
Topic 6: Auto-Instrumentation¶
What will be covered: - ASP.NET Core Auto-Instrumentation - Automatic spans for HTTP requests - Route, method, status code captured - Exception recording
- SQL Client Auto-Instrumentation
- Automatic spans for database queries
- Connection, command, query captured
-
Parameter sanitization (no PII)
-
HTTP Client Auto-Instrumentation
- Automatic spans for outbound HTTP calls
-
URL, method, status captured
-
gRPC Auto-Instrumentation
- Azure SDK Auto-Instrumentation
-
Service Bus, Blob Storage, Key Vault
-
MassTransit Auto-Instrumentation
- Message publish/consume spans
- Correlation propagation
Code Examples: - Auto-instrumentation configuration - All ATP auto-instrumented libraries
Deliverables: - Auto-instrumentation guide - Configuration catalog
CYCLE 4: Trace Context Propagation (~4,000 lines)¶
Topic 7: HTTP Context Propagation¶
What will be covered: - Propagating Trace Context (HTTP)
// Outbound HTTP call (automatic with HttpClient instrumentation)
var httpClient = _httpClientFactory.CreateClient("PolicyService");
// OTel automatically adds headers:
// traceparent: 00-{trace-id}-{span-id}-01
// tracestate: tenant={tenant-id}
var response = await httpClient.PostAsync("/api/policy/evaluate", content);
// Manual propagation (if needed)
var request = new HttpRequestMessage(HttpMethod.Post, "/api/policy/evaluate");
var currentActivity = Activity.Current;
if (currentActivity != null)
{
request.Headers.Add("traceparent",
$"00-{currentActivity.TraceId}-{currentActivity.SpanId}-01");
}
-
Message Bus Context Propagation
// MassTransit automatically propagates trace context public class AuditAcceptedEventConsumer : IConsumer<AuditAcceptedEvent> { public async Task Consume(ConsumeContext<AuditAcceptedEvent> context) { // Trace context automatically restored from message headers // Activity.Current contains parent trace using var activity = Activity.Current; activity?.SetTag("tenant.id", context.Message.TenantId); activity?.SetTag("event.id", context.Message.AuditRecordId); // Process event... } } // Manual propagation (if needed) await _bus.Publish(new AuditAcceptedEvent { AuditRecordId = recordId, TenantId = tenantId }, context => { // Trace context propagated automatically via MassTransit // Or manual: context.Headers.Set("traceparent", Activity.Current?.Id); context.Headers.Set("tenant-id", tenantId); }); -
Background Job Context
- Start new trace or link to parent
- Baggage propagation
- Correlation preservation
Code Examples: - Context propagation patterns (HTTP, messaging, background) - Manual propagation when needed - Baggage usage
Diagrams: - Context propagation flow - Cross-service tracing
Deliverables: - Propagation guide - Implementation patterns
Topic 8: Baggage & Correlation¶
What will be covered: - OpenTelemetry Baggage
// Set baggage (propagated to all downstream services)
Baggage.SetBaggage("tenant.id", tenantId);
Baggage.SetBaggage("correlation.id", correlationId);
Baggage.SetBaggage("tenant.edition", "enterprise");
// Read baggage (in downstream service)
var tenantId = Baggage.GetBaggage("tenant.id");
var correlationId = Baggage.GetBaggage("correlation.id");
// Baggage propagated via W3C baggage header:
// baggage: tenant.id=acme-corp,correlation.id=01HZX123,edition=enterprise
- Correlation ID Pattern
- Request ID vs. Correlation ID vs. Trace ID
Trace ID: - OpenTelemetry trace identifier - Generated per request - Links all spans in trace Correlation ID: - Business/application-level correlation - Can span multiple traces - Example: Order ID, User Session ID - ATP: ULID for audit record Request ID: - Gateway-generated unique ID - For client correlation - May equal trace ID or separate
Code Examples: - Baggage usage - Correlation patterns - ID management
Deliverables: - Baggage guide - Correlation strategies
CYCLE 5: Metrics Collection (~4,500 lines)¶
Topic 9: OTel Metrics API¶
What will be covered: - Metric Instruments
using System.Diagnostics.Metrics;
public class IngestionMetrics
{
private readonly Meter _meter;
private readonly Counter<long> _requestsTotal;
private readonly Histogram<double> _requestDuration;
private readonly UpDownCounter<int> _activeRequests;
private readonly ObservableGauge<long> _outboxPending;
public IngestionMetrics(IMeterFactory meterFactory)
{
_meter = meterFactory.Create("ATP.Ingestion", "1.0.0");
// Counter (monotonic increase)
_requestsTotal = _meter.CreateCounter<long>(
name: "ingest.requests.total",
unit: "{requests}",
description: "Total ingestion requests");
// Histogram (distribution of values)
_requestDuration = _meter.CreateHistogram<double>(
name: "ingest.request.duration",
unit: "s",
description: "Ingestion request duration");
// UpDownCounter (can increase or decrease)
_activeRequests = _meter.CreateUpDownCounter<int>(
name: "ingest.requests.active",
unit: "{requests}",
description: "Active ingestion requests");
// ObservableGauge (observed value)
_outboxPending = _meter.CreateObservableGauge<long>(
name: "outbox.events.pending",
observeValue: () => new Measurement<long>(
GetOutboxPendingCount(),
new KeyValuePair<string, object>("service", "ingestion")),
unit: "{events}",
description: "Pending outbox events");
}
public void RecordRequest(string result, string tenantClass, double durationSeconds)
{
// Increment counter
_requestsTotal.Add(1,
new KeyValuePair<string, object>("result", result),
new KeyValuePair<string, object>("tenant_class", tenantClass));
// Record duration
_requestDuration.Record(durationSeconds,
new KeyValuePair<string, object>("result", result));
}
public IDisposable TrackActiveRequest()
{
_activeRequests.Add(1);
return new DisposableAction(() => _activeRequests.Add(-1));
}
private long GetOutboxPendingCount()
{
return _outboxRepository.GetPendingCount();
}
}
- Metric Types
Counter: - Monotonically increasing - Resets to zero on restart - Example: Total requests, total errors - Query: rate(metric[5m]) UpDownCounter: - Can increase or decrease - Example: Active connections, queue depth - Query: metric (current value) Histogram: - Distribution of values (buckets) - Calculate percentiles (P50, P95, P99) - Example: Request duration, payload size - Query: histogram_quantile(0.95, metric) Gauge (Observable): - Point-in-time value - Observed when scraped - Example: CPU usage, memory usage, pending jobs - Query: metric (current value)
Code Examples: - Complete metrics implementation (all services) - Metric instrument usage - Query examples
Diagrams: - Metric types - Collection flow
Deliverables: - Metrics API guide - Instrument catalog - Usage patterns
Topic 10: Prometheus Exposition¶
What will be covered: - Prometheus /metrics Endpoint
// Enable Prometheus exporter
builder.Services.AddOpenTelemetry()
.WithMetrics(metrics =>
{
metrics.AddPrometheusExporter();
});
// Map endpoint
app.MapPrometheusScrapingEndpoint();
// Exposes: http://service:9090/metrics
- Metrics Format
# HELP ingest_requests_total Total ingestion requests # TYPE ingest_requests_total counter ingest_requests_total{result="success",tenant_class="enterprise"} 15234 ingest_requests_total{result="failure",tenant_class="enterprise"} 23 # HELP ingest_request_duration Ingestion request duration # TYPE ingest_request_duration histogram ingest_request_duration_bucket{result="success",le="0.1"} 1000 ingest_request_duration_bucket{result="success",le="0.5"} 14500 ingest_request_duration_bucket{result="success",le="1.0"} 15200 ingest_request_duration_bucket{result="success",le="+Inf"} 15234 ingest_request_duration_sum{result="success"} 1234.56 ingest_request_duration_count{result="success"} 15234
Code Examples: - Prometheus exporter setup - Scrape configuration - Query examples
Deliverables: - Prometheus integration guide - Scrape configurations
CYCLE 6: Custom Metrics (~4,000 lines)¶
Topic 11: ATP Business Metrics¶
What will be covered: - Audit Trail-Specific Metrics
public class AuditMetrics
{
private readonly Meter _meter;
// Counters
private readonly Counter<long> _recordsIngested;
private readonly Counter<long> _recordsClassified;
private readonly Counter<long> _recordsArchived;
private readonly Counter<long> _recordsPurged;
private readonly Counter<long> _integrityVerifications;
private readonly Counter<long> _tamperAnomalies;
// Histograms
private readonly Histogram<double> _recordSize;
private readonly Histogram<double> _verificationDuration;
// Gauges
private readonly ObservableGauge<long> _activeRetentionPolicies;
private readonly ObservableGauge<long> _recordsInHotStorage;
private readonly ObservableGauge<double> _projectionLag;
public void RecordIngestion(string tenantClass, string classification, int sizeBytes)
{
_recordsIngested.Add(1,
new KeyValuePair<string, object>("tenant_class", tenantClass),
new KeyValuePair<string, object>("classification", classification));
_recordSize.Record(sizeBytes);
}
public void RecordIntegrityCheck(string result, double durationSeconds)
{
_integrityVerifications.Add(1,
new KeyValuePair<string, object>("result", result));
_verificationDuration.Record(durationSeconds);
}
public void RecordTamperAnomaly(string type, string severity)
{
_tamperAnomalies.Add(1,
new KeyValuePair<string, object>("type", type),
new KeyValuePair<string, object>("severity", severity));
}
}
- Complete ATP Metrics Catalog (100+ metrics across all services)
Code Examples: - Complete ATP metrics implementation - Business metric patterns
Deliverables: - ATP metrics catalog (100+ metrics) - Implementation guide
Topic 12: Metric Labeling Strategy¶
What will be covered: - Low-Cardinality Labels
✅ GOOD (Low Cardinality):
- tenant_class: "small" | "medium" | "large" | "enterprise" (4 values)
- result: "success" | "failure" | "timeout" (3 values)
- region: "us-east" | "eu-west" | "il-central" (3 values)
- service: "gateway" | "ingestion" | "query" ... (8 values)
❌ BAD (High Cardinality):
- tenant_id: "acme-corp" | "contoso" | ... (1000s of values)
- user_id: "user-123" | "user-456" | ... (millions of values)
- trace_id: unique per request (billions of values)
- audit_record_id: unique per record (billions of values)
Why High Cardinality is Bad:
- Exponential metric explosion
- Prometheus memory exhaustion
- Query performance degradation
- Storage costs
Solution:
- Use tenant_class instead of tenant_id
- Use aggregations (count by class, not by ID)
- Use tracing for individual request details
Code Examples: - Label strategy - Cardinality management
Deliverables: - Labeling guide - Cardinality policies
CYCLE 7: Structured Logging (~4,500 lines)¶
Topic 13: Serilog Configuration¶
What will be covered: - Serilog Setup
// Program.cs
Log.Logger = new LoggerConfiguration()
.ReadFrom.Configuration(configuration)
.Enrich.FromLogContext()
.Enrich.WithMachineName()
.Enrich.WithEnvironmentName()
.Enrich.WithCorrelationId()
.Enrich.WithClientIp()
.Enrich.WithExceptionDetails()
.WriteTo.Console(
outputTemplate: "[{Timestamp:HH:mm:ss} {Level:u3}] {Message:lj} {Properties:j}{NewLine}{Exception}")
.WriteTo.File(
path: "logs/atp-ingestion-.log",
rollingInterval: RollingInterval.Day,
outputTemplate: "[{Timestamp:yyyy-MM-dd HH:mm:ss.fff zzz}] [{Level:u3}] {Message:lj} {Properties:j}{NewLine}{Exception}",
retainedFileCountLimit: 31)
.WriteTo.File(
new JsonFormatter(),
path: "logs/atp-ingestion-.json",
rollingInterval: RollingInterval.Day,
retainedFileCountLimit: 31)
.WriteTo.Seq("http://seq:5341")
.WriteTo.ApplicationInsights(
telemetryConfiguration,
TelemetryConverter.Traces)
.WriteTo.OpenTelemetry(options =>
{
options.Endpoint = "http://otel-collector:4318/v1/logs";
options.Protocol = OtlpProtocol.HttpProtobuf;
options.ResourceAttributes = new Dictionary<string, object>
{
["service.name"] = "atp.ingestion"
};
})
.CreateLogger();
builder.Host.UseSerilog();
- Log Sinks
- Console (development)
- File (local debugging, JSON + text)
- Seq (development, log search)
- Application Insights (production)
- OpenTelemetry (unified pipeline)
- Azure Log Analytics (production)
Code Examples: - Complete Serilog configuration - Sink configurations - Environment-specific settings
Deliverables: - Serilog setup guide - Sink catalog - Configuration templates
Topic 14: Structured Logging Patterns¶
What will be covered: - Structured Logging Best Practices
// ❌ BAD: String interpolation (unstructured)
_logger.LogInformation($"User {userId} ingested event {eventId} for tenant {tenantId}");
// ✅ GOOD: Structured (named properties)
_logger.LogInformation(
"User {UserId} ingested event {EventId} for tenant {TenantId}",
userId, eventId, tenantId);
// Output (JSON):
{
"Timestamp": "2025-10-30T14:32:15.123Z",
"Level": "Information",
"Message": "User user-123 ingested event 01HZX... for tenant acme-corp",
"Properties": {
"UserId": "user-123",
"EventId": "01HZX123456789",
"TenantId": "acme-corp"
}
}
// ✅ GOOD: Enriched with context
using (_logger.BeginScope(new Dictionary<string, object>
{
["TenantId"] = tenantId,
["CorrelationId"] = correlationId,
["TraceId"] = Activity.Current?.TraceId.ToString()
}))
{
_logger.LogInformation("Processing ingestion request");
// ... operations
_logger.LogInformation("Ingestion complete");
// All logs in scope include TenantId, CorrelationId, TraceId
}
- Log Levels
Trace (Verbose): - Detailed debugging information - Use sparingly (performance cost) - Example: "Entering method X with parameters Y" Debug: - Debugging information - Disabled in production by default - Example: "Cache hit for key X" Information: - Significant application events - Enabled in production - Example: "Event ingested successfully" Warning: - Abnormal but handled situations - Review periodically - Example: "Projection lag elevated (15s)" Error: - Error conditions (handled exceptions) - Review immediately - Example: "Failed to connect to database" Critical: - Critical failures (unhandled exceptions) - Page on-call - Example: "Data integrity verification failed"
Code Examples: - Structured logging patterns - Log level usage - Scope and enrichment
Deliverables: - Logging best practices - Pattern catalog
CYCLE 8: Log Enrichment & Correlation (~4,000 lines)¶
Topic 15: Log Enrichers¶
What will be covered: - Built-In Enrichers
.Enrich.FromLogContext() // Include scope properties
.Enrich.WithMachineName() // Add machine name
.Enrich.WithEnvironmentName() // Add environment (dev, prod)
.Enrich.WithThreadId() // Add thread ID
.Enrich.WithProcessId() // Add process ID
.Enrich.WithCorrelationId() // Add correlation ID from headers
.Enrich.WithClientIp() // Add client IP
.Enrich.WithExceptionDetails() // Rich exception logging
- Custom Enrichers
public class TenantEnricher : ILogEventEnricher { private readonly ITenantResolver _tenantResolver; public void Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory) { var tenantId = _tenantResolver.GetCurrentTenantId(); if (tenantId != null) { var property = propertyFactory.CreateProperty("TenantId", tenantId); logEvent.AddPropertyIfAbsent(property); } } } // Register enricher .Enrich.With<TenantEnricher>()
Code Examples: - Enricher configuration - Custom enrichers
Deliverables: - Enrichment guide
Topic 16: Correlation Across Logs, Traces, Metrics¶
What will be covered: - Unified Correlation - Cross-Pillar Queries - Exemplars (Prometheus → Traces)
Code Examples: - Correlation strategies
Deliverables: - Correlation guide
CYCLE 9: Azure Monitor Integration (~5,000 lines)¶
Topic 17: Application Insights¶
What will be covered: - Application Insights Setup - Telemetry Types (requests, dependencies, exceptions, traces, metrics) - Custom Events and Metrics - Application Map - Live Metrics Stream
Code Examples: - App Insights configuration
Deliverables: - Application Insights guide
Topic 18: Container Insights¶
What will be covered: - AKS Container Insights - Pod and Node Metrics - Log Collection - Performance Analysis
Code Examples: - Container Insights setup
Deliverables: - Container monitoring guide
CYCLE 11: Log Analytics & KQL (~4,500 lines)¶
Topic 21: KQL Query Language¶
What will be covered: - Kusto Query Language (KQL)
// Find all errors in last hour
traces
| where timestamp > ago(1h)
| where severityLevel >= 3 // Error or Critical
| where cloud_RoleName == "Ingestion"
| project timestamp, message, severityLevel, customDimensions
| order by timestamp desc
| take 100
// P95 latency by service
requests
| where timestamp > ago(1h)
| summarize P95=percentile(duration, 95) by cloud_RoleName, bin(timestamp, 5m)
| render timechart
// Error rate
requests
| where timestamp > ago(1h)
| summarize
Total = count(),
Errors = countif(success == false)
by bin(timestamp, 5m)
| extend ErrorRate = todouble(Errors) / todouble(Total)
| render timechart
Code Examples: - Complete KQL query library (50+ queries)
Deliverables: - KQL reference guide
Topic 22: Log Analytics Workbooks¶
What will be covered: - Custom Workbooks - Parameterized Queries - Interactive Dashboards
Code Examples: - Workbook templates
Deliverables: - Workbook library
CYCLE 12: Prometheus & Grafana (~4,500 lines)¶
Topic 23: Prometheus Setup¶
What will be covered: - Prometheus Deployment - Scrape Configurations - Recording Rules - Storage and Retention
Code Examples: - Prometheus configuration
Deliverables: - Prometheus setup guide
Topic 24: Grafana Dashboards¶
What will be covered: - Dashboard Design Principles - ATP Operational Dashboards - Service Health Dashboards - SLO Dashboards
Code Examples: - Dashboard JSON templates (10+ dashboards)
Deliverables: - Dashboard library
CYCLE 13: Dashboard Design (~4,000 lines)¶
Topic 25: Dashboard Catalog¶
What will be covered: - ATP Operations Dashboard - Service-Specific Dashboards (8 services) - Infrastructure Dashboards - Business Dashboards
Code Examples: - Complete dashboard library
Deliverables: - Dashboard templates
Topic 26: Dashboard Best Practices¶
What will be covered: - Layout Guidelines - Color Conventions - Alert Visualization
Deliverables: - Design guide
CYCLE 16: PII Redaction & Compliance (~3,500 lines)¶
Topic 31: PII Redaction in Telemetry¶
What will be covered: - Log Redaction
// Microsoft.Extensions.Compliance.Redaction
services.AddRedaction(configure =>
{
configure.SetRedactor<ErasingRedactor>(DataClassifications.PIIData);
configure.SetRedactor<HashRedactor>(DataClassifications.SensitiveData);
});
// Usage
_logger.LogInformation(
"User {UserId} with email {Email:redact} logged in",
userId,
userEmail); // Email automatically redacted
// Output:
// "User user-123 with email *** logged in"
- Trace Attribute Sanitization
- Metric Label Filtering
- Compliance Validation
Code Examples: - Redaction implementation
Deliverables: - PII redaction guide
Topic 32: Compliance Monitoring¶
What will be covered: - Retention Compliance Monitoring - Data Residency Validation - Access Audit Logging
Code Examples: - Compliance metrics
Deliverables: - Compliance monitoring guide
CYCLE 20: Best Practices & Troubleshooting (~3,500 lines)¶
Topic 39: Monitoring Best Practices¶
What will be covered: - Observability Design Principles - Common Anti-Patterns - Performance Optimization
Deliverables: - Best practices handbook
Topic 40: Troubleshooting Observability¶
What will be covered: - Missing Traces - Missing Metrics - Log Gaps - High Cardinality Issues
Deliverables: - Troubleshooting guide
Summary of Deliverables¶
Complete observability implementation covering:
- Fundamentals: Three pillars, ATP requirements
- OpenTelemetry: SDK setup, configuration, integration
- Distributed Tracing: W3C Trace Context, span creation, propagation
- Metrics: OTel Metrics API, Prometheus, custom metrics
- Logging: Serilog, structured logging, enrichment
- Azure Monitor: Application Insights, Log Analytics, Container Insights
- Prometheus & Grafana: Setup, dashboards, alerting
- Dashboards: 15+ operational dashboards
- APM: Performance monitoring, dependency tracking
- Compliance: PII redaction, audit logging, retention
- Operations: Correlation, troubleshooting, best practices
Related Documentation¶
- Alerts & SLOs: Alerting and SLO definitions
- Health Checks: Service health monitoring
- Runbook: Using monitoring for operations
- Architecture: Observability requirements
- Kubernetes: Container monitoring
This monitoring & observability guide provides complete implementation for ATP's observability stack, from OpenTelemetry instrumentation and distributed tracing to structured logging, Azure Monitor integration, Prometheus and Grafana dashboards, PII redaction, compliance monitoring, and operational excellence for maintaining visibility into system behavior with actionable insights while preserving privacy and regulatory compliance.