Monitoring & Observability - Audit Trail Platform (ATP)¶

Three pillars of observability — ATP implements comprehensive monitoring with OpenTelemetry for distributed tracing (W3C Trace Context), metrics (Prometheus), and structured logs (Serilog) exported to Azure Monitor (Application Insights, Log Analytics, Container Insights) and Grafana with tenant/correlation context propagation, PII redaction, and actionable dashboards for operational excellence.

📋 Documentation Generation Plan¶

This document will be generated in 20 cycles. Current progress:

Cycle	Topics	Estimated Lines	Status
Cycle 1	Observability Fundamentals (1-2)	~3,500	⏳ Not Started
Cycle 2	OpenTelemetry Architecture (3-4)	~4,000	⏳ Not Started
Cycle 3	Distributed Tracing (5-6)	~5,000	⏳ Not Started
Cycle 4	Trace Context Propagation (7-8)	~4,000	⏳ Not Started
Cycle 5	Metrics Collection (9-10)	~4,500	⏳ Not Started
Cycle 6	Custom Metrics (11-12)	~4,000	⏳ Not Started
Cycle 7	Structured Logging (13-14)	~4,500	⏳ Not Started
Cycle 8	Log Enrichment & Correlation (15-16)	~4,000	⏳ Not Started
Cycle 9	Azure Monitor Integration (17-18)	~5,000	⏳ Not Started
Cycle 10	Application Insights (19-20)	~4,000	⏳ Not Started
Cycle 11	Log Analytics & KQL (21-22)	~4,500	⏳ Not Started
Cycle 12	Prometheus & Grafana (23-24)	~4,500	⏳ Not Started
Cycle 13	Dashboard Design (25-26)	~4,000	⏳ Not Started
Cycle 14	Performance Monitoring (27-28)	~4,000	⏳ Not Started
Cycle 15	Application Performance Monitoring (29-30)	~3,500	⏳ Not Started
Cycle 16	PII Redaction & Compliance (31-32)	~3,500	⏳ Not Started
Cycle 17	Correlation & Context (33-34)	~3,500	⏳ Not Started
Cycle 18	Log Aggregation & Search (35-36)	~3,500	⏳ Not Started
Cycle 19	Monitoring Operations (37-38)	~3,000	⏳ Not Started
Cycle 20	Best Practices & Troubleshooting (39-40)	~3,500	⏳ Not Started

Total Estimated Lines: ~78,000

Purpose & Scope¶

This document provides the complete monitoring and observability implementation guide for ATP, covering OpenTelemetry (distributed tracing, metrics, logs), Azure Monitor (Application Insights, Log Analytics, Container Insights), Prometheus, Grafana, Serilog structured logging, correlation, dashboards, APM, and operational procedures for comprehensive, actionable, and compliant system observability.

Why Comprehensive Observability for ATP?

Debugging: Trace requests across microservices, identify bottlenecks
Performance: Measure latency, throughput, errors at every layer
Reliability: Detect issues before customers do (proactive alerts)
Compliance: Audit all access to audit data ("audit the auditor")
Security: Detect anomalies, unauthorized access, data exfiltration
Business Insights: Tenant usage patterns, feature adoption, capacity planning
SLO Validation: Measure actual performance against targets
Incident Response: Rich telemetry for faster MTTR
Optimization: Data-driven performance and cost optimization
Regulatory: Immutable observability trail for compliance audits

Three Pillars of Observability

1. TRACES (Request Flow)
   - Distributed tracing across services
   - End-to-end request path
   - Latency attribution
   - Error correlation
   - Example: "Why did this ingestion request take 5 seconds?"

2. METRICS (System Health)
   - Time-series data (CPU, memory, latency, throughput)
   - Aggregations and percentiles
   - SLO/SLI measurements
   - Alerting thresholds
   - Example: "What's the P95 latency for queries?"

3. LOGS (Event Details)
   - Structured event records
   - Contextual information
   - Error messages and stack traces
   - Audit trail
   - Example: "What error occurred at 14:32 UTC for tenant acme-corp?"

ATP Observability Stack

Application Layer:
- OpenTelemetry SDK (C#)
- Serilog (structured logging)
- Custom meters and spans

Collection Layer:
- OTel Collector (DaemonSet on K8s)
- Prometheus (scraping)
- Fluent Bit (log forwarding)

Storage Layer:
- Azure Monitor (Application Insights, Log Analytics)
- Prometheus (TSDB, 30-day retention)
- Azure Blob (long-term log archive)

Visualization Layer:
- Grafana (dashboards, alerts)
- Azure Monitor Workbooks
- Application Insights Workbooks

Alerting Layer:
- Prometheus Alertmanager
- Azure Monitor Alerts
- PagerDuty integration

Key Technologies

OpenTelemetry: Vendor-neutral observability framework (traces, metrics, logs)
Serilog: Structured logging for .NET
Prometheus: Time-series metrics database
Grafana: Visualization and dashboards
Azure Monitor: Cloud-native monitoring (Application Insights, Log Analytics, Container Insights)
Jaeger/Zipkin: Distributed tracing UIs
KQL (Kusto Query Language): Azure Monitor query language
PromQL: Prometheus query language
W3C Trace Context: Standard for trace propagation

Detailed Cycle Plan¶

CYCLE 1: Observability Fundamentals (~3,500 lines)¶

Topic 1: Three Pillars of Observability¶

What will be covered: - Traces, Metrics, Logs

TRACES (The "Why" and "Where"):
- Request journey across services
- Timing breakdown (where did time go?)
- Causal relationships
- Error attribution

Use Cases:
- Debug slow requests
- Find bottlenecks
- Understand service dependencies
- Root cause analysis

Example Questions Answered:
- "Why did this request take 5 seconds?"
- "Which service in the chain is slow?"
- "Where did this error originate?"

---

METRICS (The "What" and "How Much"):
- Numeric measurements over time
- Aggregations (avg, sum, percentile)
- System health indicators
- SLO/SLI tracking

Use Cases:
- Alerting (latency > threshold)
- Capacity planning (trending growth)
- SLO compliance (99.9% availability)
- Performance dashboards

Example Questions Answered:
- "What's the current error rate?"
- "What's the P95 latency?"
- "How many events are in the queue?"

---

LOGS (The "What Happened"):
- Discrete events
- Detailed context
- Error messages and stack traces
- Audit trail

Use Cases:
- Debugging specific errors
- Audit compliance
- Security investigation
- Business event tracking

Example Questions Answered:
- "What error occurred at 14:32 UTC?"
- "Which user triggered this action?"
- "What was the validation failure reason?"

How They Work Together

Scenario: Slow Ingestion Request

1. METRICS alert fires:
   - "P95 ingestion latency > 1s"

2. Check DASHBOARD (metrics):
   - Latency spiked at 14:30 UTC
   - Affecting tenant "acme-corp"

3. Find REQUEST in TRACES:
   - Search for requests to /api/v1/ingest
   - Filter: timestamp ~14:30, tenant=acme-corp
   - Trace shows: 4.2s total
     - Gateway: 50ms
     - Ingestion: 4.1s
       - Validation: 20ms
       - Policy call: 3.8s ← BOTTLENECK
       - Database: 200ms
       - Outbox: 100ms

4. Examine LOGS (detailed context):
   - Search: timestamp=14:30, traceId=abc123
   - Find: "PolicyService.Evaluate timeout after 3s"
   - Root cause: Policy service database connection pool exhausted

5. Resolution:
   - Scale policy service
   - Increase connection pool
   - Add connection pool health check

Observability vs. Monitoring

Monitoring (Traditional):
- Known failure modes
- Predefined dashboards
- Static thresholds
- "Tell me when X goes wrong"

Observability (Modern):
- Unknown failure modes
- Ad-hoc exploration
- Dynamic queries
- "Help me understand why X is wrong"

ATP Uses BOTH:
- Monitoring: Alerts, SLOs, health checks
- Observability: Traces, logs, metrics for investigation

Code Examples: - Three pillars examples - Combined usage scenario - Observability workflow

Diagrams: - Three pillars architecture - Pillars working together - Observability vs. monitoring

Deliverables: - Observability fundamentals guide - Pillar comparison matrix - Usage scenarios

Topic 2: ATP Observability Requirements¶

What will be covered: - ATP-Specific Observability Needs

Audit Integrity:
- Trace every write (who, when, what)
- Tamper detection observability
- Integrity verification tracking

Compliance:
- No PII in logs/metrics/traces
- Tenant isolation in telemetry
- Audit access to audit data
- Retention period compliance (7+ years)

Multi-Tenancy:
- Per-tenant metrics (without high cardinality)
- Tenant context in all telemetry
- Cross-tenant query detection

Performance:
- Sub-second latency tracking (P50, P95, P99)
- Projection lag monitoring (<5s SLO)
- Query performance (by endpoint, tenant class)

Security:
- Unauthorized access attempts
- Privilege escalation detection
- Data exfiltration monitoring
- Anomaly detection

Operational:
- Service dependencies (application map)
- Capacity utilization (CPU, memory, storage, network)
- Error rates and types
- Queue depths and consumer lag

Mandatory Telemetry Attributes

Resource Attributes (service-level):
- service.name: "atp.ingestion"
- service.version: "1.2.3"
- deployment.environment: "production"
- cloud.provider: "azure"
- cloud.region: "eastus"
- cloud.availability_zone: "eastus-1"

Span/Metric/Log Attributes (request-level):
- tenant.id: "acme-corp" (or tenant.class: "enterprise")
- tenant.edition: "enterprise"
- correlation.id: "<ulid>"
- trace.id: "<w3c-trace-id>"
- span.id: "<span-id>"
- http.route: "/api/v1/ingest"
- http.method: "POST"
- http.status_code: 200
- messaging.operation: "publish" | "receive"
- messaging.destination: "audit.appended.v1"
- db.operation: "INSERT" | "SELECT"
- db.table: "AuditRecords"

ATP-Specific:
- audit.record.id: "<ulid>"
- idempotency.key: "<tenant>:<source>:<seq>"
- classification: "SENSITIVE" | "PII" | "PUBLIC"
- policy.version: 123

Code Examples: - ATP observability requirements - Mandatory attributes - Attribute taxonomy

Diagrams: - ATP observability architecture - Attribute hierarchy

Deliverables: - Requirements specification - Attribute catalog - Compliance mapping

CYCLE 2: OpenTelemetry Architecture (~4,000 lines)¶

Topic 3: OpenTelemetry Overview¶

What will be covered: - What is OpenTelemetry (OTel)?

OpenTelemetry:
- Open source observability framework
- Vendor-neutral (no lock-in)
- Single SDK for traces, metrics, logs
- Standardized protocols (OTLP)
- Ecosystem of exporters (Prometheus, Jaeger, Azure Monitor, etc.)

Components:
1. SDK (in-process instrumentation)
   - Traces: ActivitySource, Activity
   - Metrics: Meter, Counter, Histogram, Gauge
   - Logs: ILogger integration

2. API (interface for instrumentation)
   - Language-agnostic specification
   - Implementation-independent

3. Collector (agent/gateway)
   - Receives telemetry (OTLP)
   - Processes (batch, filter, transform)
   - Exports to backends (Prometheus, Jaeger, Azure Monitor)

4. Semantic Conventions
   - Standard attribute names
   - Consistent meaning across systems
   - Examples: http.method, db.system, messaging.destination

ATP OpenTelemetry Architecture

flowchart LR
    subgraph ATP Services
        GW[Gateway<br/>OTel SDK]
        ING[Ingestion<br/>OTel SDK]
        QRY[Query<br/>OTel SDK]
        PROJ[Projection<br/>OTel SDK]
    end

    subgraph OTel Collection
        COL[OTel Collector<br/>DaemonSet]
    end

    subgraph Backends
        PROM[Prometheus<br/>Metrics]
        JAEGER[Jaeger<br/>Traces]
        AZMON[Azure Monitor<br/>Traces/Metrics/Logs]
        SEQ[Seq<br/>Logs]
    end

    GW -->|OTLP gRPC :4317| COL
    ING -->|OTLP gRPC :4317| COL
    QRY -->|OTLP gRPC :4317| COL
    PROJ -->|OTLP gRPC :4317| COL

    COL -->|Remote Write| PROM
    COL -->|OTLP| JAEGER
    COL -->|Azure Monitor Exporter| AZMON
    COL -->|OTLP HTTP| SEQ

Hold "Alt" / "Option" to enable pan & zoom

Why OpenTelemetry for ATP?
Vendor-neutral (avoid lock-in)
Single instrumentation (traces + metrics + logs)
Rich ecosystem (auto-instrumentation for ASP.NET, SQL, HTTP, gRPC, Service Bus)
Azure Monitor support
Industry standard (CNCF graduated project)

Code Examples: - OTel architecture overview - Component responsibilities - ATP integration

Diagrams: - OTel architecture - ATP OTel topology - Data flow

Deliverables: - OTel overview - Architecture guide - Integration roadmap

Topic 4: OpenTelemetry SDK Configuration¶

What will be covered: - OTel SDK Setup (C#)

var // buil                                                                                     //

href="#__codelineno-8-1">// Program.cs / Startup.cs class="w"> builder = WebApplication.CreateBuilder(args); Configure OpenTelemetry der.Services.AddOpenTelemetry() .ConfigureResource(resource => { resource .AddService( serviceName: "atp.ingestion", serviceVersion: "1.2.3", serviceInstanceId: Environment.MachineName) .AddAttributes(new Dictionary<string, object> { ["deployment.environment"] = builder.Environment.EnvironmentName, ["cloud.provider"] = "azure", ["cloud.region"] = Environment.GetEnvironmentVariable("AZURE_REGION") ?? "eastus", ["host.name"] = Environment.MachineName }); }) .WithTracing(tracing => { tracing // Auto-instrumentation .AddAspNetCoreInstrumentation(options => { options.RecordException = true; options.EnrichWithHttpRequest = (activity, request) => { // Add tenant context var tenantId = request.Headers["X-Tenant-Id"].FirstOrDefault(); if (tenantId != null) { activity.SetTag("tenant.id", tenantId); } }; options.EnrichWithHttpResponse = (activity, response) => { activity.SetTag("http.response.size", response.ContentLength); }; }) .AddHttpClientInstrumentation() .AddSqlClientInstrumentation(options => { options.SetDbStatementForText = true; options.RecordException = true; options.EnableConnectionLevelAttributes = true; }) .AddGrpcClientInstrumentation() // Custom activity sources .AddSource("ATP.Ingestion") .AddSource("ATP.Domain") .AddSource("MassTransit") // Sampling (production: 10%, dev: 100%) .SetSampler(builder.Environment.IsDevelopment() ? new AlwaysOnSampler() : new TraceIdRatioBasedSampler(0.1)) // Exporters .AddOtlpExporter(options => { options.Endpoint = new Uri("http://otel-collector:4317"); options.Protocol = OtlpExportProtocol.Grpc; }); }) .WithMetrics(metrics => { metrics // Auto-instrumentation .AddAspNetCoreInstrumentation() .AddHttpClientInstrumentation() .AddRuntimeInstrumentation() .AddProcessInstrumentation() // Custom meters .AddMeter("ATP.Ingestion") .AddMeter("MassTransit") // Exporters .AddOtlpExporter(options => { options.Endpoint = new Uri("http://otel-collector:4317"); options.Protocol = OtlpExportProtocol.Grpc; }) .AddPrometheusExporter(); // Expose /metrics endpoint }); Enable Azure Monitor (alternative/additional) builder.Services.AddOpenTelemetry() .UseAzureMonitor(options => { options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"]; });

Configuration Options
Sampling rates (dev: 100%, prod: 1-10%)
Exporter endpoints
Batch processing
Resource attributes
Instrumentation options

Code Examples: - Complete OTel SDK configuration (all ATP services) - Environment-specific settings - Exporter configurations

Diagrams: - SDK architecture - Configuration flow

Deliverables: - OTel SDK setup guide - Configuration reference - Service-specific configurations

CYCLE 3: Distributed Tracing (~5,000 lines)¶

Topic 5: Tracing Fundamentals¶

What will be covered: - Trace, Span, Context

Trace:
- End-to-end request journey
- Collection of related spans
- Unique trace ID (shared across all spans)

Span:
- Single operation within trace
- Has start time, end time, duration
- Parent-child relationships
- Attributes (tags)

Context:
- Trace ID + Span ID + Trace Flags
- Propagated across service boundaries
- W3C Trace Context standard: traceparent header

Example Trace:
TraceID: 4bf92f3577b34da6a3ce929d0e0e4736

Spans:
├─ Gateway (50ms)
│  ├─ Authentication (10ms)
│  └─ Routing (5ms)
├─ Ingestion (4.2s) ← SLOW
│  ├─ Validation (20ms)
│  ├─ Policy.Evaluate (3.8s) ← BOTTLENECK
│  │  └─ Database.Query (3.5s) ← ROOT CAUSE
│  ├─ Database.Insert (200ms)
│  └─ Outbox.Append (100ms)
└─ Service Bus Publish (50ms)

Total Duration: 4.3s
Bottleneck: Policy evaluation → database query

W3C Trace Context

traceparent Header Format:
00-<trace-id>-<parent-span-id>-<trace-flags>

Example:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Parts:
- 00: Version
- 4bf92f3577b34da6a3ce929d0e0e4736: Trace ID (16 bytes hex)
- 00f067aa0ba902b7: Parent Span ID (8 bytes hex)
- 01: Trace Flags (sampled=1)

tracestate Header (optional, vendor-specific):
tracestate: tenant=acme-corp,edition=enterprise

Creating Spans (C#)

using System.Diagnostics;

public class IngestionService
{
    private static readonly ActivitySource ActivitySource = 
        new ActivitySource("ATP.Ingestion", "1.0.0");

    public async Task<string> IngestEventAsync(AuditEvent auditEvent)
    {
        // Start span
        using var activity = ActivitySource.StartActivity(
            "Ingestion.IngestEvent",
            ActivityKind.Server);

        // Add attributes
        activity?.SetTag("tenant.id", auditEvent.TenantId);
        activity?.SetTag("event.type", auditEvent.EventType);
        activity?.SetTag("audit.record.id", auditEvent.Id);

        try
        {
            // Child span: Validation
            using (var validateActivity = ActivitySource.StartActivity("Ingestion.Validate"))
            {
                await ValidateEventAsync(auditEvent);
            }

            // Child span: Policy evaluation
            using (var policyActivity = ActivitySource.StartActivity("Ingestion.EvaluatePolicy"))
            {
                var policy = await _policyService.EvaluateAsync(auditEvent);
                policyActivity?.SetTag("policy.version", policy.Version);
                policyActivity?.SetTag("classification", policy.Classification);
            }

            // Child span: Database insert
            using (var dbActivity = ActivitySource.StartActivity("Ingestion.PersistToDatabase"))
            {
                await _repository.SaveAsync(auditEvent);
            }

            // Success
            activity?.SetStatus(ActivityStatusCode.Ok);
            activity?.SetTag("result", "success");

            return auditEvent.Id;
        }
        catch (Exception ex)
        {
            // Record exception in span
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.RecordException(ex);
            activity?.SetTag("result", "error");
            activity?.SetTag("error.type", ex.GetType().Name);

            throw;
        }
    }
}

Code Examples: - Complete tracing implementation - Span creation patterns - Attribute setting - Error recording

Diagrams: - Trace structure - Span hierarchy - Context propagation

Deliverables: - Tracing fundamentals guide - Implementation patterns - Span library

Topic 6: Auto-Instrumentation¶

What will be covered: - ASP.NET Core Auto-Instrumentation - Automatic spans for HTTP requests - Route, method, status code captured - Exception recording

SQL Client Auto-Instrumentation
Automatic spans for database queries
Connection, command, query captured
Parameter sanitization (no PII)
HTTP Client Auto-Instrumentation
Automatic spans for outbound HTTP calls
URL, method, status captured
gRPC Auto-Instrumentation
Azure SDK Auto-Instrumentation
Service Bus, Blob Storage, Key Vault
MassTransit Auto-Instrumentation
Message publish/consume spans
Correlation propagation

Code Examples: - Auto-instrumentation configuration - All ATP auto-instrumented libraries

Deliverables: - Auto-instrumentation guide - Configuration catalog

CYCLE 4: Trace Context Propagation (~4,000 lines)¶

Topic 7: HTTP Context Propagation¶

What will be covered: - Propagating Trace Context (HTTP)

// Outbound HTTP call (automatic with HttpClient instrumentation)
var httpClient = _httpClientFactory.CreateClient("PolicyService");

// OTel automatically adds headers:
// traceparent: 00-{trace-id}-{span-id}-01
// tracestate: tenant={tenant-id}

var response = await httpClient.PostAsync("/api/policy/evaluate", content);

// Manual propagation (if needed)
var request = new HttpRequestMessage(HttpMethod.Post, "/api/policy/evaluate");

var currentActivity = Activity.Current;
if (currentActivity != null)
{
    request.Headers.Add("traceparent", 
        $"00-{currentActivity.TraceId}-{currentActivity.SpanId}-01");
}

Message Bus Context Propagation

// MassTransit automatically propagates trace context
public class AuditAcceptedEventConsumer : IConsumer<AuditAcceptedEvent>
{
    public async Task Consume(ConsumeContext<AuditAcceptedEvent> context)
    {
        // Trace context automatically restored from message headers
        // Activity.Current contains parent trace

        using var activity = Activity.Current;
        activity?.SetTag("tenant.id", context.Message.TenantId);
        activity?.SetTag("event.id", context.Message.AuditRecordId);

        // Process event...
    }
}

// Manual propagation (if needed)
await _bus.Publish(new AuditAcceptedEvent
{
    AuditRecordId = recordId,
    TenantId = tenantId
}, context =>
{
    // Trace context propagated automatically via MassTransit
    // Or manual:
    context.Headers.Set("traceparent", Activity.Current?.Id);
    context.Headers.Set("tenant-id", tenantId);
});

Background Job Context
Start new trace or link to parent
Baggage propagation
Correlation preservation

Code Examples: - Context propagation patterns (HTTP, messaging, background) - Manual propagation when needed - Baggage usage

Diagrams: - Context propagation flow - Cross-service tracing

Deliverables: - Propagation guide - Implementation patterns

Topic 8: Baggage & Correlation¶

What will be covered: - OpenTelemetry Baggage

// Set baggage (propagated to all downstream services)
Baggage.SetBaggage("tenant.id", tenantId);
Baggage.SetBaggage("correlation.id", correlationId);
Baggage.SetBaggage("tenant.edition", "enterprise");

// Read baggage (in downstream service)
var tenantId = Baggage.GetBaggage("tenant.id");
var correlationId = Baggage.GetBaggage("correlation.id");

// Baggage propagated via W3C baggage header:
// baggage: tenant.id=acme-corp,correlation.id=01HZX123,edition=enterprise

Correlation ID Pattern

Request ID vs. Correlation ID vs. Trace ID

Trace ID:
- OpenTelemetry trace identifier
- Generated per request
- Links all spans in trace

Correlation ID:
- Business/application-level correlation
- Can span multiple traces
- Example: Order ID, User Session ID
- ATP: ULID for audit record

Request ID:
- Gateway-generated unique ID
- For client correlation
- May equal trace ID or separate

Code Examples: - Baggage usage - Correlation patterns - ID management

Deliverables: - Baggage guide - Correlation strategies

CYCLE 5: Metrics Collection (~4,500 lines)¶

Topic 9: OTel Metrics API¶

What will be covered: - Metric Instruments

using System.Diagnostics.Metrics;

public class IngestionMetrics
{
    private readonly Meter _meter;
    private readonly Counter<long> _requestsTotal;
    private readonly Histogram<double> _requestDuration;
    private readonly UpDownCounter<int> _activeRequests;
    private readonly ObservableGauge<long> _outboxPending;

    public IngestionMetrics(IMeterFactory meterFactory)
    {
        _meter = meterFactory.Create("ATP.Ingestion", "1.0.0");

        // Counter (monotonic increase)
        _requestsTotal = _meter.CreateCounter<long>(
            name: "ingest.requests.total",
            unit: "{requests}",
            description: "Total ingestion requests");

        // Histogram (distribution of values)
        _requestDuration = _meter.CreateHistogram<double>(
            name: "ingest.request.duration",
            unit: "s",
            description: "Ingestion request duration");

        // UpDownCounter (can increase or decrease)
        _activeRequests = _meter.CreateUpDownCounter<int>(
            name: "ingest.requests.active",
            unit: "{requests}",
            description: "Active ingestion requests");

        // ObservableGauge (observed value)
        _outboxPending = _meter.CreateObservableGauge<long>(
            name: "outbox.events.pending",
            observeValue: () => new Measurement<long>(
                GetOutboxPendingCount(),
                new KeyValuePair<string, object>("service", "ingestion")),
            unit: "{events}",
            description: "Pending outbox events");
    }

    public void RecordRequest(string result, string tenantClass, double durationSeconds)
    {
        // Increment counter
        _requestsTotal.Add(1,
            new KeyValuePair<string, object>("result", result),
            new KeyValuePair<string, object>("tenant_class", tenantClass));

        // Record duration
        _requestDuration.Record(durationSeconds,
            new KeyValuePair<string, object>("result", result));
    }

    public IDisposable TrackActiveRequest()
    {
        _activeRequests.Add(1);
        return new DisposableAction(() => _activeRequests.Add(-1));
    }

    private long GetOutboxPendingCount()
    {
        return _outboxRepository.GetPendingCount();
    }
}

Metric Types

Counter:
- Monotonically increasing
- Resets to zero on restart
- Example: Total requests, total errors
- Query: rate(metric[5m])

UpDownCounter:
- Can increase or decrease
- Example: Active connections, queue depth
- Query: metric (current value)

Histogram:
- Distribution of values (buckets)
- Calculate percentiles (P50, P95, P99)
- Example: Request duration, payload size
- Query: histogram_quantile(0.95, metric)

Gauge (Observable):
- Point-in-time value
- Observed when scraped
- Example: CPU usage, memory usage, pending jobs
- Query: metric (current value)

Code Examples: - Complete metrics implementation (all services) - Metric instrument usage - Query examples

Diagrams: - Metric types - Collection flow

Deliverables: - Metrics API guide - Instrument catalog - Usage patterns

Topic 10: Prometheus Exposition¶

What will be covered: - Prometheus /metrics Endpoint

// Enable Prometheus exporter
builder.Services.AddOpenTelemetry()
    .WithMetrics(metrics =>
    {
        metrics.AddPrometheusExporter();
    });

// Map endpoint
app.MapPrometheusScrapingEndpoint();

// Exposes: http://service:9090/metrics

Metrics Format

# HELP ingest_requests_total Total ingestion requests
# TYPE ingest_requests_total counter
ingest_requests_total{result="success",tenant_class="enterprise"} 15234
ingest_requests_total{result="failure",tenant_class="enterprise"} 23

# HELP ingest_request_duration Ingestion request duration
# TYPE ingest_request_duration histogram
ingest_request_duration_bucket{result="success",le="0.1"} 1000
ingest_request_duration_bucket{result="success",le="0.5"} 14500
ingest_request_duration_bucket{result="success",le="1.0"} 15200
ingest_request_duration_bucket{result="success",le="+Inf"} 15234
ingest_request_duration_sum{result="success"} 1234.56
ingest_request_duration_count{result="success"} 15234

Code Examples: - Prometheus exporter setup - Scrape configuration - Query examples

Deliverables: - Prometheus integration guide - Scrape configurations

CYCLE 6: Custom Metrics (~4,000 lines)¶

Topic 11: ATP Business Metrics¶

What will be covered: - Audit Trail-Specific Metrics

public class AuditMetrics
{
    private readonly Meter _meter;

    // Counters
    private readonly Counter<long> _recordsIngested;
    private readonly Counter<long> _recordsClassified;
    private readonly Counter<long> _recordsArchived;
    private readonly Counter<long> _recordsPurged;
    private readonly Counter<long> _integrityVerifications;
    private readonly Counter<long> _tamperAnomalies;

    // Histograms
    private readonly Histogram<double> _recordSize;
    private readonly Histogram<double> _verificationDuration;

    // Gauges
    private readonly ObservableGauge<long> _activeRetentionPolicies;
    private readonly ObservableGauge<long> _recordsInHotStorage;
    private readonly ObservableGauge<double> _projectionLag;

    public void RecordIngestion(string tenantClass, string classification, int sizeBytes)
    {
        _recordsIngested.Add(1,
            new KeyValuePair<string, object>("tenant_class", tenantClass),
            new KeyValuePair<string, object>("classification", classification));

        _recordSize.Record(sizeBytes);
    }

    public void RecordIntegrityCheck(string result, double durationSeconds)
    {
        _integrityVerifications.Add(1,
            new KeyValuePair<string, object>("result", result));

        _verificationDuration.Record(durationSeconds);
    }

    public void RecordTamperAnomaly(string type, string severity)
    {
        _tamperAnomalies.Add(1,
            new KeyValuePair<string, object>("type", type),
            new KeyValuePair<string, object>("severity", severity));
    }
}

Complete ATP Metrics Catalog (100+ metrics across all services)

Code Examples: - Complete ATP metrics implementation - Business metric patterns

Deliverables: - ATP metrics catalog (100+ metrics) - Implementation guide

Topic 12: Metric Labeling Strategy¶

What will be covered: - Low-Cardinality Labels

✅ GOOD (Low Cardinality):
- tenant_class: "small" | "medium" | "large" | "enterprise"  (4 values)
- result: "success" | "failure" | "timeout"  (3 values)
- region: "us-east" | "eu-west" | "il-central"  (3 values)
- service: "gateway" | "ingestion" | "query" ...  (8 values)

❌ BAD (High Cardinality):
- tenant_id: "acme-corp" | "contoso" | ...  (1000s of values)
- user_id: "user-123" | "user-456" | ...  (millions of values)
- trace_id: unique per request  (billions of values)
- audit_record_id: unique per record  (billions of values)

Why High Cardinality is Bad:
- Exponential metric explosion
- Prometheus memory exhaustion
- Query performance degradation
- Storage costs

Solution:
- Use tenant_class instead of tenant_id
- Use aggregations (count by class, not by ID)
- Use tracing for individual request details

Code Examples: - Label strategy - Cardinality management

Deliverables: - Labeling guide - Cardinality policies

CYCLE 7: Structured Logging (~4,500 lines)¶

Topic 13: Serilog Configuration¶

What will be covered: - Serilog Setup

// Program.cs
Log.Logger = new LoggerConfiguration()
    .ReadFrom.Configuration(configuration)
    .Enrich.FromLogContext()
    .Enrich.WithMachineName()
    .Enrich.WithEnvironmentName()
    .Enrich.WithCorrelationId()
    .Enrich.WithClientIp()
    .Enrich.WithExceptionDetails()
    .WriteTo.Console(
        outputTemplate: "[{Timestamp:HH:mm:ss} {Level:u3}] {Message:lj} {Properties:j}{NewLine}{Exception}")
    .WriteTo.File(
        path: "logs/atp-ingestion-.log",
        rollingInterval: RollingInterval.Day,
        outputTemplate: "[{Timestamp:yyyy-MM-dd HH:mm:ss.fff zzz}] [{Level:u3}] {Message:lj} {Properties:j}{NewLine}{Exception}",
        retainedFileCountLimit: 31)
    .WriteTo.File(
        new JsonFormatter(),
        path: "logs/atp-ingestion-.json",
        rollingInterval: RollingInterval.Day,
        retainedFileCountLimit: 31)
    .WriteTo.Seq("http://seq:5341")
    .WriteTo.ApplicationInsights(
        telemetryConfiguration,
        TelemetryConverter.Traces)
    .WriteTo.OpenTelemetry(options =>
    {
        options.Endpoint = "http://otel-collector:4318/v1/logs";
        options.Protocol = OtlpProtocol.HttpProtobuf;
        options.ResourceAttributes = new Dictionary<string, object>
        {
            ["service.name"] = "atp.ingestion"
        };
    })
    .CreateLogger();

builder.Host.UseSerilog();

Log Sinks
Console (development)
File (local debugging, JSON + text)
Seq (development, log search)
Application Insights (production)
OpenTelemetry (unified pipeline)
Azure Log Analytics (production)

Code Examples: - Complete Serilog configuration - Sink configurations - Environment-specific settings

Deliverables: - Serilog setup guide - Sink catalog - Configuration templates

Topic 14: Structured Logging Patterns¶

What will be covered: - Structured Logging Best Practices

// ❌ BAD: String interpolation (unstructured)
_logger.LogInformation($"User {userId} ingested event {eventId} for tenant {tenantId}");

// ✅ GOOD: Structured (named properties)
_logger.LogInformation(
    "User {UserId} ingested event {EventId} for tenant {TenantId}",
    userId, eventId, tenantId);

// Output (JSON):
{
  "Timestamp": "2025-10-30T14:32:15.123Z",
  "Level": "Information",
  "Message": "User user-123 ingested event 01HZX... for tenant acme-corp",
  "Properties": {
    "UserId": "user-123",
    "EventId": "01HZX123456789",
    "TenantId": "acme-corp"
  }
}

// ✅ GOOD: Enriched with context
using (_logger.BeginScope(new Dictionary<string, object>
{
    ["TenantId"] = tenantId,
    ["CorrelationId"] = correlationId,
    ["TraceId"] = Activity.Current?.TraceId.ToString()
}))
{
    _logger.LogInformation("Processing ingestion request");
    // ... operations
    _logger.LogInformation("Ingestion complete");
    // All logs in scope include TenantId, CorrelationId, TraceId
}

Log Levels

Trace (Verbose):
- Detailed debugging information
- Use sparingly (performance cost)
- Example: "Entering method X with parameters Y"

Debug:
- Debugging information
- Disabled in production by default
- Example: "Cache hit for key X"

Information:
- Significant application events
- Enabled in production
- Example: "Event ingested successfully"

Warning:
- Abnormal but handled situations
- Review periodically
- Example: "Projection lag elevated (15s)"

Error:
- Error conditions (handled exceptions)
- Review immediately
- Example: "Failed to connect to database"

Critical:
- Critical failures (unhandled exceptions)
- Page on-call
- Example: "Data integrity verification failed"

Code Examples: - Structured logging patterns - Log level usage - Scope and enrichment

Deliverables: - Logging best practices - Pattern catalog

CYCLE 8: Log Enrichment & Correlation (~4,000 lines)¶

Topic 15: Log Enrichers¶

What will be covered: - Built-In Enrichers

.Enrich.FromLogContext()           // Include scope properties
.Enrich.WithMachineName()          // Add machine name
.Enrich.WithEnvironmentName()      // Add environment (dev, prod)
.Enrich.WithThreadId()             // Add thread ID
.Enrich.WithProcessId()            // Add process ID
.Enrich.WithCorrelationId()        // Add correlation ID from headers
.Enrich.WithClientIp()             // Add client IP
.Enrich.WithExceptionDetails()     // Rich exception logging

Custom Enrichers

public class TenantEnricher : ILogEventEnricher
{
    private readonly ITenantResolver _tenantResolver;

    public void Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory)
    {
        var tenantId = _tenantResolver.GetCurrentTenantId();
        if (tenantId != null)
        {
            var property = propertyFactory.CreateProperty("TenantId", tenantId);
            logEvent.AddPropertyIfAbsent(property);
        }
    }
}

// Register enricher
.Enrich.With<TenantEnricher>()

Code Examples: - Enricher configuration - Custom enrichers

Deliverables: - Enrichment guide

Topic 16: Correlation Across Logs, Traces, Metrics¶

What will be covered: - Unified Correlation - Cross-Pillar Queries - Exemplars (Prometheus → Traces)

Code Examples: - Correlation strategies

Deliverables: - Correlation guide

CYCLE 9: Azure Monitor Integration (~5,000 lines)¶

Topic 17: Application Insights¶

What will be covered: - Application Insights Setup - Telemetry Types (requests, dependencies, exceptions, traces, metrics) - Custom Events and Metrics - Application Map - Live Metrics Stream

Code Examples: - App Insights configuration

Deliverables: - Application Insights guide

Topic 18: Container Insights¶

What will be covered: - AKS Container Insights - Pod and Node Metrics - Log Collection - Performance Analysis

Code Examples: - Container Insights setup

Deliverables: - Container monitoring guide

CYCLE 11: Log Analytics & KQL (~4,500 lines)¶

Topic 21: KQL Query Language¶

What will be covered: - Kusto Query Language (KQL)

// Find all errors in last hour
traces
| where timestamp > ago(1h)
| where severityLevel >= 3  // Error or Critical
| where cloud_RoleName == "Ingestion"
| project timestamp, message, severityLevel, customDimensions
| order by timestamp desc
| take 100

// P95 latency by service
requests
| where timestamp > ago(1h)
| summarize P95=percentile(duration, 95) by cloud_RoleName, bin(timestamp, 5m)
| render timechart

// Error rate
requests
| where timestamp > ago(1h)
| summarize 
    Total = count(),
    Errors = countif(success == false)
    by bin(timestamp, 5m)
| extend ErrorRate = todouble(Errors) / todouble(Total)
| render timechart

Code Examples: - Complete KQL query library (50+ queries)

Deliverables: - KQL reference guide

Topic 22: Log Analytics Workbooks¶

What will be covered: - Custom Workbooks - Parameterized Queries - Interactive Dashboards

Code Examples: - Workbook templates

Deliverables: - Workbook library

CYCLE 12: Prometheus & Grafana (~4,500 lines)¶

Topic 23: Prometheus Setup¶

What will be covered: - Prometheus Deployment - Scrape Configurations - Recording Rules - Storage and Retention

Code Examples: - Prometheus configuration

Deliverables: - Prometheus setup guide

Topic 24: Grafana Dashboards¶

What will be covered: - Dashboard Design Principles - ATP Operational Dashboards - Service Health Dashboards - SLO Dashboards

Code Examples: - Dashboard JSON templates (10+ dashboards)

Deliverables: - Dashboard library

CYCLE 13: Dashboard Design (~4,000 lines)¶

Topic 25: Dashboard Catalog¶

What will be covered: - ATP Operations Dashboard - Service-Specific Dashboards (8 services) - Infrastructure Dashboards - Business Dashboards

Code Examples: - Complete dashboard library

Deliverables: - Dashboard templates

Topic 26: Dashboard Best Practices¶

What will be covered: - Layout Guidelines - Color Conventions - Alert Visualization

Deliverables: - Design guide

CYCLE 16: PII Redaction & Compliance (~3,500 lines)¶

Topic 31: PII Redaction in Telemetry¶

What will be covered: - Log Redaction

// Microsoft.Extensions.Compliance.Redaction
services.AddRedaction(configure =>
{
    configure.SetRedactor<ErasingRedactor>(DataClassifications.PIIData);
    configure.SetRedactor<HashRedactor>(DataClassifications.SensitiveData);
});

// Usage
_logger.LogInformation(
    "User {UserId} with email {Email:redact} logged in",
    userId,
    userEmail);  // Email automatically redacted

// Output:
// "User user-123 with email *** logged in"

Trace Attribute Sanitization
Metric Label Filtering
Compliance Validation

Code Examples: - Redaction implementation

Deliverables: - PII redaction guide

Topic 32: Compliance Monitoring¶

What will be covered: - Retention Compliance Monitoring - Data Residency Validation - Access Audit Logging

Code Examples: - Compliance metrics

Deliverables: - Compliance monitoring guide

CYCLE 20: Best Practices & Troubleshooting (~3,500 lines)¶

Topic 39: Monitoring Best Practices¶

What will be covered: - Observability Design Principles - Common Anti-Patterns - Performance Optimization

Deliverables: - Best practices handbook

Topic 40: Troubleshooting Observability¶

What will be covered: - Missing Traces - Missing Metrics - Log Gaps - High Cardinality Issues

Deliverables: - Troubleshooting guide

Summary of Deliverables¶

Complete observability implementation covering:

Fundamentals: Three pillars, ATP requirements
OpenTelemetry: SDK setup, configuration, integration
Distributed Tracing: W3C Trace Context, span creation, propagation
Metrics: OTel Metrics API, Prometheus, custom metrics
Logging: Serilog, structured logging, enrichment
Azure Monitor: Application Insights, Log Analytics, Container Insights
Prometheus & Grafana: Setup, dashboards, alerting
Dashboards: 15+ operational dashboards
APM: Performance monitoring, dependency tracking
Compliance: PII redaction, audit logging, retention
Operations: Correlation, troubleshooting, best practices

Alerts & SLOs: Alerting and SLO definitions
Health Checks: Service health monitoring
Runbook: Using monitoring for operations
Architecture: Observability requirements
Kubernetes: Container monitoring

This monitoring & observability guide provides complete implementation for ATP's observability stack, from OpenTelemetry instrumentation and distributed tracing to structured logging, Azure Monitor integration, Prometheus and Grafana dashboards, PII redaction, compliance monitoring, and operational excellence for maintaining visibility into system behavior with actionable insights while preserving privacy and regulatory compliance.