Skip to content

Monitoring & Observability - Audit Trail Platform (ATP)

Three pillars of observability — ATP implements comprehensive monitoring with OpenTelemetry for distributed tracing (W3C Trace Context), metrics (Prometheus), and structured logs (Serilog) exported to Azure Monitor (Application Insights, Log Analytics, Container Insights) and Grafana with tenant/correlation context propagation, PII redaction, and actionable dashboards for operational excellence.


📋 Documentation Generation Plan

This document will be generated in 20 cycles. Current progress:

Cycle Topics Estimated Lines Status
Cycle 1 Observability Fundamentals (1-2) ~3,500 ⏳ Not Started
Cycle 2 OpenTelemetry Architecture (3-4) ~4,000 ⏳ Not Started
Cycle 3 Distributed Tracing (5-6) ~5,000 ⏳ Not Started
Cycle 4 Trace Context Propagation (7-8) ~4,000 ⏳ Not Started
Cycle 5 Metrics Collection (9-10) ~4,500 ⏳ Not Started
Cycle 6 Custom Metrics (11-12) ~4,000 ⏳ Not Started
Cycle 7 Structured Logging (13-14) ~4,500 ⏳ Not Started
Cycle 8 Log Enrichment & Correlation (15-16) ~4,000 ⏳ Not Started
Cycle 9 Azure Monitor Integration (17-18) ~5,000 ⏳ Not Started
Cycle 10 Application Insights (19-20) ~4,000 ⏳ Not Started
Cycle 11 Log Analytics & KQL (21-22) ~4,500 ⏳ Not Started
Cycle 12 Prometheus & Grafana (23-24) ~4,500 ⏳ Not Started
Cycle 13 Dashboard Design (25-26) ~4,000 ⏳ Not Started
Cycle 14 Performance Monitoring (27-28) ~4,000 ⏳ Not Started
Cycle 15 Application Performance Monitoring (29-30) ~3,500 ⏳ Not Started
Cycle 16 PII Redaction & Compliance (31-32) ~3,500 ⏳ Not Started
Cycle 17 Correlation & Context (33-34) ~3,500 ⏳ Not Started
Cycle 18 Log Aggregation & Search (35-36) ~3,500 ⏳ Not Started
Cycle 19 Monitoring Operations (37-38) ~3,000 ⏳ Not Started
Cycle 20 Best Practices & Troubleshooting (39-40) ~3,500 ⏳ Not Started

Total Estimated Lines: ~78,000


Purpose & Scope

This document provides the complete monitoring and observability implementation guide for ATP, covering OpenTelemetry (distributed tracing, metrics, logs), Azure Monitor (Application Insights, Log Analytics, Container Insights), Prometheus, Grafana, Serilog structured logging, correlation, dashboards, APM, and operational procedures for comprehensive, actionable, and compliant system observability.

Why Comprehensive Observability for ATP?

  1. Debugging: Trace requests across microservices, identify bottlenecks
  2. Performance: Measure latency, throughput, errors at every layer
  3. Reliability: Detect issues before customers do (proactive alerts)
  4. Compliance: Audit all access to audit data ("audit the auditor")
  5. Security: Detect anomalies, unauthorized access, data exfiltration
  6. Business Insights: Tenant usage patterns, feature adoption, capacity planning
  7. SLO Validation: Measure actual performance against targets
  8. Incident Response: Rich telemetry for faster MTTR
  9. Optimization: Data-driven performance and cost optimization
  10. Regulatory: Immutable observability trail for compliance audits

Three Pillars of Observability

1. TRACES (Request Flow)
   - Distributed tracing across services
   - End-to-end request path
   - Latency attribution
   - Error correlation
   - Example: "Why did this ingestion request take 5 seconds?"

2. METRICS (System Health)
   - Time-series data (CPU, memory, latency, throughput)
   - Aggregations and percentiles
   - SLO/SLI measurements
   - Alerting thresholds
   - Example: "What's the P95 latency for queries?"

3. LOGS (Event Details)
   - Structured event records
   - Contextual information
   - Error messages and stack traces
   - Audit trail
   - Example: "What error occurred at 14:32 UTC for tenant acme-corp?"

ATP Observability Stack

Application Layer:
- OpenTelemetry SDK (C#)
- Serilog (structured logging)
- Custom meters and spans

Collection Layer:
- OTel Collector (DaemonSet on K8s)
- Prometheus (scraping)
- Fluent Bit (log forwarding)

Storage Layer:
- Azure Monitor (Application Insights, Log Analytics)
- Prometheus (TSDB, 30-day retention)
- Azure Blob (long-term log archive)

Visualization Layer:
- Grafana (dashboards, alerts)
- Azure Monitor Workbooks
- Application Insights Workbooks

Alerting Layer:
- Prometheus Alertmanager
- Azure Monitor Alerts
- PagerDuty integration

Key Technologies

  • OpenTelemetry: Vendor-neutral observability framework (traces, metrics, logs)
  • Serilog: Structured logging for .NET
  • Prometheus: Time-series metrics database
  • Grafana: Visualization and dashboards
  • Azure Monitor: Cloud-native monitoring (Application Insights, Log Analytics, Container Insights)
  • Jaeger/Zipkin: Distributed tracing UIs
  • KQL (Kusto Query Language): Azure Monitor query language
  • PromQL: Prometheus query language
  • W3C Trace Context: Standard for trace propagation

Detailed Cycle Plan

CYCLE 1: Observability Fundamentals (~3,500 lines)

Topic 1: Three Pillars of Observability

What will be covered: - Traces, Metrics, Logs

TRACES (The "Why" and "Where"):
- Request journey across services
- Timing breakdown (where did time go?)
- Causal relationships
- Error attribution

Use Cases:
- Debug slow requests
- Find bottlenecks
- Understand service dependencies
- Root cause analysis

Example Questions Answered:
- "Why did this request take 5 seconds?"
- "Which service in the chain is slow?"
- "Where did this error originate?"

---

METRICS (The "What" and "How Much"):
- Numeric measurements over time
- Aggregations (avg, sum, percentile)
- System health indicators
- SLO/SLI tracking

Use Cases:
- Alerting (latency > threshold)
- Capacity planning (trending growth)
- SLO compliance (99.9% availability)
- Performance dashboards

Example Questions Answered:
- "What's the current error rate?"
- "What's the P95 latency?"
- "How many events are in the queue?"

---

LOGS (The "What Happened"):
- Discrete events
- Detailed context
- Error messages and stack traces
- Audit trail

Use Cases:
- Debugging specific errors
- Audit compliance
- Security investigation
- Business event tracking

Example Questions Answered:
- "What error occurred at 14:32 UTC?"
- "Which user triggered this action?"
- "What was the validation failure reason?"

  • How They Work Together

    Scenario: Slow Ingestion Request
    
    1. METRICS alert fires:
       - "P95 ingestion latency > 1s"
    
    2. Check DASHBOARD (metrics):
       - Latency spiked at 14:30 UTC
       - Affecting tenant "acme-corp"
    
    3. Find REQUEST in TRACES:
       - Search for requests to /api/v1/ingest
       - Filter: timestamp ~14:30, tenant=acme-corp
       - Trace shows: 4.2s total
         - Gateway: 50ms
         - Ingestion: 4.1s
           - Validation: 20ms
           - Policy call: 3.8s ← BOTTLENECK
           - Database: 200ms
           - Outbox: 100ms
    
    4. Examine LOGS (detailed context):
       - Search: timestamp=14:30, traceId=abc123
       - Find: "PolicyService.Evaluate timeout after 3s"
       - Root cause: Policy service database connection pool exhausted
    
    5. Resolution:
       - Scale policy service
       - Increase connection pool
       - Add connection pool health check
    

  • Observability vs. Monitoring

    Monitoring (Traditional):
    - Known failure modes
    - Predefined dashboards
    - Static thresholds
    - "Tell me when X goes wrong"
    
    Observability (Modern):
    - Unknown failure modes
    - Ad-hoc exploration
    - Dynamic queries
    - "Help me understand why X is wrong"
    
    ATP Uses BOTH:
    - Monitoring: Alerts, SLOs, health checks
    - Observability: Traces, logs, metrics for investigation
    

Code Examples: - Three pillars examples - Combined usage scenario - Observability workflow

Diagrams: - Three pillars architecture - Pillars working together - Observability vs. monitoring

Deliverables: - Observability fundamentals guide - Pillar comparison matrix - Usage scenarios


Topic 2: ATP Observability Requirements

What will be covered: - ATP-Specific Observability Needs

Audit Integrity:
- Trace every write (who, when, what)
- Tamper detection observability
- Integrity verification tracking

Compliance:
- No PII in logs/metrics/traces
- Tenant isolation in telemetry
- Audit access to audit data
- Retention period compliance (7+ years)

Multi-Tenancy:
- Per-tenant metrics (without high cardinality)
- Tenant context in all telemetry
- Cross-tenant query detection

Performance:
- Sub-second latency tracking (P50, P95, P99)
- Projection lag monitoring (<5s SLO)
- Query performance (by endpoint, tenant class)

Security:
- Unauthorized access attempts
- Privilege escalation detection
- Data exfiltration monitoring
- Anomaly detection

Operational:
- Service dependencies (application map)
- Capacity utilization (CPU, memory, storage, network)
- Error rates and types
- Queue depths and consumer lag

  • Mandatory Telemetry Attributes
    Resource Attributes (service-level):
    - service.name: "atp.ingestion"
    - service.version: "1.2.3"
    - deployment.environment: "production"
    - cloud.provider: "azure"
    - cloud.region: "eastus"
    - cloud.availability_zone: "eastus-1"
    
    Span/Metric/Log Attributes (request-level):
    - tenant.id: "acme-corp" (or tenant.class: "enterprise")
    - tenant.edition: "enterprise"
    - correlation.id: "<ulid>"
    - trace.id: "<w3c-trace-id>"
    - span.id: "<span-id>"
    - http.route: "/api/v1/ingest"
    - http.method: "POST"
    - http.status_code: 200
    - messaging.operation: "publish" | "receive"
    - messaging.destination: "audit.appended.v1"
    - db.operation: "INSERT" | "SELECT"
    - db.table: "AuditRecords"
    
    ATP-Specific:
    - audit.record.id: "<ulid>"
    - idempotency.key: "<tenant>:<source>:<seq>"
    - classification: "SENSITIVE" | "PII" | "PUBLIC"
    - policy.version: 123
    

Code Examples: - ATP observability requirements - Mandatory attributes - Attribute taxonomy

Diagrams: - ATP observability architecture - Attribute hierarchy

Deliverables: - Requirements specification - Attribute catalog - Compliance mapping


CYCLE 2: OpenTelemetry Architecture (~4,000 lines)

Topic 3: OpenTelemetry Overview

What will be covered: - What is OpenTelemetry (OTel)?

OpenTelemetry:
- Open source observability framework
- Vendor-neutral (no lock-in)
- Single SDK for traces, metrics, logs
- Standardized protocols (OTLP)
- Ecosystem of exporters (Prometheus, Jaeger, Azure Monitor, etc.)

Components:
1. SDK (in-process instrumentation)
   - Traces: ActivitySource, Activity
   - Metrics: Meter, Counter, Histogram, Gauge
   - Logs: ILogger integration

2. API (interface for instrumentation)
   - Language-agnostic specification
   - Implementation-independent

3. Collector (agent/gateway)
   - Receives telemetry (OTLP)
   - Processes (batch, filter, transform)
   - Exports to backends (Prometheus, Jaeger, Azure Monitor)

4. Semantic Conventions
   - Standard attribute names
   - Consistent meaning across systems
   - Examples: http.method, db.system, messaging.destination

  • ATP OpenTelemetry Architecture

    flowchart LR
        subgraph ATP Services
            GW[Gateway<br/>OTel SDK]
            ING[Ingestion<br/>OTel SDK]
            QRY[Query<br/>OTel SDK]
            PROJ[Projection<br/>OTel SDK]
        end
    
        subgraph OTel Collection
            COL[OTel Collector<br/>DaemonSet]
        end
    
        subgraph Backends
            PROM[Prometheus<br/>Metrics]
            JAEGER[Jaeger<br/>Traces]
            AZMON[Azure Monitor<br/>Traces/Metrics/Logs]
            SEQ[Seq<br/>Logs]
        end
    
        GW -->|OTLP gRPC :4317| COL
        ING -->|OTLP gRPC :4317| COL
        QRY -->|OTLP gRPC :4317| COL
        PROJ -->|OTLP gRPC :4317| COL
    
        COL -->|Remote Write| PROM
        COL -->|OTLP| JAEGER
        COL -->|Azure Monitor Exporter| AZMON
        COL -->|OTLP HTTP| SEQ
    Hold "Alt" / "Option" to enable pan & zoom

  • Why OpenTelemetry for ATP?

  • Vendor-neutral (avoid lock-in)
  • Single instrumentation (traces + metrics + logs)
  • Rich ecosystem (auto-instrumentation for ASP.NET, SQL, HTTP, gRPC, Service Bus)
  • Azure Monitor support
  • Industry standard (CNCF graduated project)

Code Examples: - OTel architecture overview - Component responsibilities - ATP integration

Diagrams: - OTel architecture - ATP OTel topology - Data flow

Deliverables: - OTel overview - Architecture guide - Integration roadmap


Topic 4: OpenTelemetry SDK Configuration

What will be covered: - OTel SDK Setup (C#)

// Program.cs / Startup.cs
var builder = WebApplication.CreateBuilder(args);

// Configure OpenTelemetry
builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource =>
    {
        resource
            .AddService(
                serviceName: "atp.ingestion",
                serviceVersion: "1.2.3",
                serviceInstanceId: Environment.MachineName)
            .AddAttributes(new Dictionary<string, object>
            {
                ["deployment.environment"] = builder.Environment.EnvironmentName,
                ["cloud.provider"] = "azure",
                ["cloud.region"] = Environment.GetEnvironmentVariable("AZURE_REGION") ?? "eastus",
                ["host.name"] = Environment.MachineName
            });
    })
    .WithTracing(tracing =>
    {
        tracing
            // Auto-instrumentation
            .AddAspNetCoreInstrumentation(options =>
            {
                options.RecordException = true;
                options.EnrichWithHttpRequest = (activity, request) =>
                {
                    // Add tenant context
                    var tenantId = request.Headers["X-Tenant-Id"].FirstOrDefault();
                    if (tenantId != null)
                    {
                        activity.SetTag("tenant.id", tenantId);
                    }
                };
                options.EnrichWithHttpResponse = (activity, response) =>
                {
                    activity.SetTag("http.response.size", response.ContentLength);
                };
            })
            .AddHttpClientInstrumentation()
            .AddSqlClientInstrumentation(options =>
            {
                options.SetDbStatementForText = true;
                options.RecordException = true;
                options.EnableConnectionLevelAttributes = true;
            })
            .AddGrpcClientInstrumentation()

            // Custom activity sources
            .AddSource("ATP.Ingestion")
            .AddSource("ATP.Domain")
            .AddSource("MassTransit")

            // Sampling (production: 10%, dev: 100%)
            .SetSampler(builder.Environment.IsDevelopment()
                ? new AlwaysOnSampler()
                : new TraceIdRatioBasedSampler(0.1))

            // Exporters
            .AddOtlpExporter(options =>
            {
                options.Endpoint = new Uri("http://otel-collector:4317");
                options.Protocol = OtlpExportProtocol.Grpc;
            });
    })
    .WithMetrics(metrics =>
    {
        metrics
            // Auto-instrumentation
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddRuntimeInstrumentation()
            .AddProcessInstrumentation()

            // Custom meters
            .AddMeter("ATP.Ingestion")
            .AddMeter("MassTransit")

            // Exporters
            .AddOtlpExporter(options =>
            {
                options.Endpoint = new Uri("http://otel-collector:4317");
                options.Protocol = OtlpExportProtocol.Grpc;
            })
            .AddPrometheusExporter();  // Expose /metrics endpoint
    });

// Enable Azure Monitor (alternative/additional)
builder.Services.AddOpenTelemetry()
    .UseAzureMonitor(options =>
    {
        options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
    });

  • Configuration Options
  • Sampling rates (dev: 100%, prod: 1-10%)
  • Exporter endpoints
  • Batch processing
  • Resource attributes
  • Instrumentation options

Code Examples: - Complete OTel SDK configuration (all ATP services) - Environment-specific settings - Exporter configurations

Diagrams: - SDK architecture - Configuration flow

Deliverables: - OTel SDK setup guide - Configuration reference - Service-specific configurations


CYCLE 3: Distributed Tracing (~5,000 lines)

Topic 5: Tracing Fundamentals

What will be covered: - Trace, Span, Context

Trace:
- End-to-end request journey
- Collection of related spans
- Unique trace ID (shared across all spans)

Span:
- Single operation within trace
- Has start time, end time, duration
- Parent-child relationships
- Attributes (tags)

Context:
- Trace ID + Span ID + Trace Flags
- Propagated across service boundaries
- W3C Trace Context standard: traceparent header

Example Trace:
TraceID: 4bf92f3577b34da6a3ce929d0e0e4736

Spans:
├─ Gateway (50ms)
│  ├─ Authentication (10ms)
│  └─ Routing (5ms)
├─ Ingestion (4.2s) ← SLOW
│  ├─ Validation (20ms)
│  ├─ Policy.Evaluate (3.8s) ← BOTTLENECK
│  │  └─ Database.Query (3.5s) ← ROOT CAUSE
│  ├─ Database.Insert (200ms)
│  └─ Outbox.Append (100ms)
└─ Service Bus Publish (50ms)

Total Duration: 4.3s
Bottleneck: Policy evaluation → database query

  • W3C Trace Context

    traceparent Header Format:
    00-<trace-id>-<parent-span-id>-<trace-flags>
    
    Example:
    traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
    
    Parts:
    - 00: Version
    - 4bf92f3577b34da6a3ce929d0e0e4736: Trace ID (16 bytes hex)
    - 00f067aa0ba902b7: Parent Span ID (8 bytes hex)
    - 01: Trace Flags (sampled=1)
    
    tracestate Header (optional, vendor-specific):
    tracestate: tenant=acme-corp,edition=enterprise
    

  • Creating Spans (C#)

    using System.Diagnostics;
    
    public class IngestionService
    {
        private static readonly ActivitySource ActivitySource = 
            new ActivitySource("ATP.Ingestion", "1.0.0");
    
        public async Task<string> IngestEventAsync(AuditEvent auditEvent)
        {
            // Start span
            using var activity = ActivitySource.StartActivity(
                "Ingestion.IngestEvent",
                ActivityKind.Server);
    
            // Add attributes
            activity?.SetTag("tenant.id", auditEvent.TenantId);
            activity?.SetTag("event.type", auditEvent.EventType);
            activity?.SetTag("audit.record.id", auditEvent.Id);
    
            try
            {
                // Child span: Validation
                using (var validateActivity = ActivitySource.StartActivity("Ingestion.Validate"))
                {
                    await ValidateEventAsync(auditEvent);
                }
    
                // Child span: Policy evaluation
                using (var policyActivity = ActivitySource.StartActivity("Ingestion.EvaluatePolicy"))
                {
                    var policy = await _policyService.EvaluateAsync(auditEvent);
                    policyActivity?.SetTag("policy.version", policy.Version);
                    policyActivity?.SetTag("classification", policy.Classification);
                }
    
                // Child span: Database insert
                using (var dbActivity = ActivitySource.StartActivity("Ingestion.PersistToDatabase"))
                {
                    await _repository.SaveAsync(auditEvent);
                }
    
                // Success
                activity?.SetStatus(ActivityStatusCode.Ok);
                activity?.SetTag("result", "success");
    
                return auditEvent.Id;
            }
            catch (Exception ex)
            {
                // Record exception in span
                activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
                activity?.RecordException(ex);
                activity?.SetTag("result", "error");
                activity?.SetTag("error.type", ex.GetType().Name);
    
                throw;
            }
        }
    }
    

Code Examples: - Complete tracing implementation - Span creation patterns - Attribute setting - Error recording

Diagrams: - Trace structure - Span hierarchy - Context propagation

Deliverables: - Tracing fundamentals guide - Implementation patterns - Span library


Topic 6: Auto-Instrumentation

What will be covered: - ASP.NET Core Auto-Instrumentation - Automatic spans for HTTP requests - Route, method, status code captured - Exception recording

  • SQL Client Auto-Instrumentation
  • Automatic spans for database queries
  • Connection, command, query captured
  • Parameter sanitization (no PII)

  • HTTP Client Auto-Instrumentation

  • Automatic spans for outbound HTTP calls
  • URL, method, status captured

  • gRPC Auto-Instrumentation

  • Azure SDK Auto-Instrumentation
  • Service Bus, Blob Storage, Key Vault

  • MassTransit Auto-Instrumentation

  • Message publish/consume spans
  • Correlation propagation

Code Examples: - Auto-instrumentation configuration - All ATP auto-instrumented libraries

Deliverables: - Auto-instrumentation guide - Configuration catalog


CYCLE 4: Trace Context Propagation (~4,000 lines)

Topic 7: HTTP Context Propagation

What will be covered: - Propagating Trace Context (HTTP)

// Outbound HTTP call (automatic with HttpClient instrumentation)
var httpClient = _httpClientFactory.CreateClient("PolicyService");

// OTel automatically adds headers:
// traceparent: 00-{trace-id}-{span-id}-01
// tracestate: tenant={tenant-id}

var response = await httpClient.PostAsync("/api/policy/evaluate", content);

// Manual propagation (if needed)
var request = new HttpRequestMessage(HttpMethod.Post, "/api/policy/evaluate");

var currentActivity = Activity.Current;
if (currentActivity != null)
{
    request.Headers.Add("traceparent", 
        $"00-{currentActivity.TraceId}-{currentActivity.SpanId}-01");
}

  • Message Bus Context Propagation

    // MassTransit automatically propagates trace context
    public class AuditAcceptedEventConsumer : IConsumer<AuditAcceptedEvent>
    {
        public async Task Consume(ConsumeContext<AuditAcceptedEvent> context)
        {
            // Trace context automatically restored from message headers
            // Activity.Current contains parent trace
    
            using var activity = Activity.Current;
            activity?.SetTag("tenant.id", context.Message.TenantId);
            activity?.SetTag("event.id", context.Message.AuditRecordId);
    
            // Process event...
        }
    }
    
    // Manual propagation (if needed)
    await _bus.Publish(new AuditAcceptedEvent
    {
        AuditRecordId = recordId,
        TenantId = tenantId
    }, context =>
    {
        // Trace context propagated automatically via MassTransit
        // Or manual:
        context.Headers.Set("traceparent", Activity.Current?.Id);
        context.Headers.Set("tenant-id", tenantId);
    });
    

  • Background Job Context

  • Start new trace or link to parent
  • Baggage propagation
  • Correlation preservation

Code Examples: - Context propagation patterns (HTTP, messaging, background) - Manual propagation when needed - Baggage usage

Diagrams: - Context propagation flow - Cross-service tracing

Deliverables: - Propagation guide - Implementation patterns


Topic 8: Baggage & Correlation

What will be covered: - OpenTelemetry Baggage

// Set baggage (propagated to all downstream services)
Baggage.SetBaggage("tenant.id", tenantId);
Baggage.SetBaggage("correlation.id", correlationId);
Baggage.SetBaggage("tenant.edition", "enterprise");

// Read baggage (in downstream service)
var tenantId = Baggage.GetBaggage("tenant.id");
var correlationId = Baggage.GetBaggage("correlation.id");

// Baggage propagated via W3C baggage header:
// baggage: tenant.id=acme-corp,correlation.id=01HZX123,edition=enterprise

  • Correlation ID Pattern
  • Request ID vs. Correlation ID vs. Trace ID
    Trace ID:
    - OpenTelemetry trace identifier
    - Generated per request
    - Links all spans in trace
    
    Correlation ID:
    - Business/application-level correlation
    - Can span multiple traces
    - Example: Order ID, User Session ID
    - ATP: ULID for audit record
    
    Request ID:
    - Gateway-generated unique ID
    - For client correlation
    - May equal trace ID or separate
    

Code Examples: - Baggage usage - Correlation patterns - ID management

Deliverables: - Baggage guide - Correlation strategies


CYCLE 5: Metrics Collection (~4,500 lines)

Topic 9: OTel Metrics API

What will be covered: - Metric Instruments

using System.Diagnostics.Metrics;

public class IngestionMetrics
{
    private readonly Meter _meter;
    private readonly Counter<long> _requestsTotal;
    private readonly Histogram<double> _requestDuration;
    private readonly UpDownCounter<int> _activeRequests;
    private readonly ObservableGauge<long> _outboxPending;

    public IngestionMetrics(IMeterFactory meterFactory)
    {
        _meter = meterFactory.Create("ATP.Ingestion", "1.0.0");

        // Counter (monotonic increase)
        _requestsTotal = _meter.CreateCounter<long>(
            name: "ingest.requests.total",
            unit: "{requests}",
            description: "Total ingestion requests");

        // Histogram (distribution of values)
        _requestDuration = _meter.CreateHistogram<double>(
            name: "ingest.request.duration",
            unit: "s",
            description: "Ingestion request duration");

        // UpDownCounter (can increase or decrease)
        _activeRequests = _meter.CreateUpDownCounter<int>(
            name: "ingest.requests.active",
            unit: "{requests}",
            description: "Active ingestion requests");

        // ObservableGauge (observed value)
        _outboxPending = _meter.CreateObservableGauge<long>(
            name: "outbox.events.pending",
            observeValue: () => new Measurement<long>(
                GetOutboxPendingCount(),
                new KeyValuePair<string, object>("service", "ingestion")),
            unit: "{events}",
            description: "Pending outbox events");
    }

    public void RecordRequest(string result, string tenantClass, double durationSeconds)
    {
        // Increment counter
        _requestsTotal.Add(1,
            new KeyValuePair<string, object>("result", result),
            new KeyValuePair<string, object>("tenant_class", tenantClass));

        // Record duration
        _requestDuration.Record(durationSeconds,
            new KeyValuePair<string, object>("result", result));
    }

    public IDisposable TrackActiveRequest()
    {
        _activeRequests.Add(1);
        return new DisposableAction(() => _activeRequests.Add(-1));
    }

    private long GetOutboxPendingCount()
    {
        return _outboxRepository.GetPendingCount();
    }
}

  • Metric Types
    Counter:
    - Monotonically increasing
    - Resets to zero on restart
    - Example: Total requests, total errors
    - Query: rate(metric[5m])
    
    UpDownCounter:
    - Can increase or decrease
    - Example: Active connections, queue depth
    - Query: metric (current value)
    
    Histogram:
    - Distribution of values (buckets)
    - Calculate percentiles (P50, P95, P99)
    - Example: Request duration, payload size
    - Query: histogram_quantile(0.95, metric)
    
    Gauge (Observable):
    - Point-in-time value
    - Observed when scraped
    - Example: CPU usage, memory usage, pending jobs
    - Query: metric (current value)
    

Code Examples: - Complete metrics implementation (all services) - Metric instrument usage - Query examples

Diagrams: - Metric types - Collection flow

Deliverables: - Metrics API guide - Instrument catalog - Usage patterns


Topic 10: Prometheus Exposition

What will be covered: - Prometheus /metrics Endpoint

// Enable Prometheus exporter
builder.Services.AddOpenTelemetry()
    .WithMetrics(metrics =>
    {
        metrics.AddPrometheusExporter();
    });

// Map endpoint
app.MapPrometheusScrapingEndpoint();

// Exposes: http://service:9090/metrics

  • Metrics Format
    # HELP ingest_requests_total Total ingestion requests
    # TYPE ingest_requests_total counter
    ingest_requests_total{result="success",tenant_class="enterprise"} 15234
    ingest_requests_total{result="failure",tenant_class="enterprise"} 23
    
    # HELP ingest_request_duration Ingestion request duration
    # TYPE ingest_request_duration histogram
    ingest_request_duration_bucket{result="success",le="0.1"} 1000
    ingest_request_duration_bucket{result="success",le="0.5"} 14500
    ingest_request_duration_bucket{result="success",le="1.0"} 15200
    ingest_request_duration_bucket{result="success",le="+Inf"} 15234
    ingest_request_duration_sum{result="success"} 1234.56
    ingest_request_duration_count{result="success"} 15234
    

Code Examples: - Prometheus exporter setup - Scrape configuration - Query examples

Deliverables: - Prometheus integration guide - Scrape configurations


CYCLE 6: Custom Metrics (~4,000 lines)

Topic 11: ATP Business Metrics

What will be covered: - Audit Trail-Specific Metrics

public class AuditMetrics
{
    private readonly Meter _meter;

    // Counters
    private readonly Counter<long> _recordsIngested;
    private readonly Counter<long> _recordsClassified;
    private readonly Counter<long> _recordsArchived;
    private readonly Counter<long> _recordsPurged;
    private readonly Counter<long> _integrityVerifications;
    private readonly Counter<long> _tamperAnomalies;

    // Histograms
    private readonly Histogram<double> _recordSize;
    private readonly Histogram<double> _verificationDuration;

    // Gauges
    private readonly ObservableGauge<long> _activeRetentionPolicies;
    private readonly ObservableGauge<long> _recordsInHotStorage;
    private readonly ObservableGauge<double> _projectionLag;

    public void RecordIngestion(string tenantClass, string classification, int sizeBytes)
    {
        _recordsIngested.Add(1,
            new KeyValuePair<string, object>("tenant_class", tenantClass),
            new KeyValuePair<string, object>("classification", classification));

        _recordSize.Record(sizeBytes);
    }

    public void RecordIntegrityCheck(string result, double durationSeconds)
    {
        _integrityVerifications.Add(1,
            new KeyValuePair<string, object>("result", result));

        _verificationDuration.Record(durationSeconds);
    }

    public void RecordTamperAnomaly(string type, string severity)
    {
        _tamperAnomalies.Add(1,
            new KeyValuePair<string, object>("type", type),
            new KeyValuePair<string, object>("severity", severity));
    }
}

  • Complete ATP Metrics Catalog (100+ metrics across all services)

Code Examples: - Complete ATP metrics implementation - Business metric patterns

Deliverables: - ATP metrics catalog (100+ metrics) - Implementation guide


Topic 12: Metric Labeling Strategy

What will be covered: - Low-Cardinality Labels

✅ GOOD (Low Cardinality):
- tenant_class: "small" | "medium" | "large" | "enterprise"  (4 values)
- result: "success" | "failure" | "timeout"  (3 values)
- region: "us-east" | "eu-west" | "il-central"  (3 values)
- service: "gateway" | "ingestion" | "query" ...  (8 values)

❌ BAD (High Cardinality):
- tenant_id: "acme-corp" | "contoso" | ...  (1000s of values)
- user_id: "user-123" | "user-456" | ...  (millions of values)
- trace_id: unique per request  (billions of values)
- audit_record_id: unique per record  (billions of values)

Why High Cardinality is Bad:
- Exponential metric explosion
- Prometheus memory exhaustion
- Query performance degradation
- Storage costs

Solution:
- Use tenant_class instead of tenant_id
- Use aggregations (count by class, not by ID)
- Use tracing for individual request details

Code Examples: - Label strategy - Cardinality management

Deliverables: - Labeling guide - Cardinality policies


CYCLE 7: Structured Logging (~4,500 lines)

Topic 13: Serilog Configuration

What will be covered: - Serilog Setup

// Program.cs
Log.Logger = new LoggerConfiguration()
    .ReadFrom.Configuration(configuration)
    .Enrich.FromLogContext()
    .Enrich.WithMachineName()
    .Enrich.WithEnvironmentName()
    .Enrich.WithCorrelationId()
    .Enrich.WithClientIp()
    .Enrich.WithExceptionDetails()
    .WriteTo.Console(
        outputTemplate: "[{Timestamp:HH:mm:ss} {Level:u3}] {Message:lj} {Properties:j}{NewLine}{Exception}")
    .WriteTo.File(
        path: "logs/atp-ingestion-.log",
        rollingInterval: RollingInterval.Day,
        outputTemplate: "[{Timestamp:yyyy-MM-dd HH:mm:ss.fff zzz}] [{Level:u3}] {Message:lj} {Properties:j}{NewLine}{Exception}",
        retainedFileCountLimit: 31)
    .WriteTo.File(
        new JsonFormatter(),
        path: "logs/atp-ingestion-.json",
        rollingInterval: RollingInterval.Day,
        retainedFileCountLimit: 31)
    .WriteTo.Seq("http://seq:5341")
    .WriteTo.ApplicationInsights(
        telemetryConfiguration,
        TelemetryConverter.Traces)
    .WriteTo.OpenTelemetry(options =>
    {
        options.Endpoint = "http://otel-collector:4318/v1/logs";
        options.Protocol = OtlpProtocol.HttpProtobuf;
        options.ResourceAttributes = new Dictionary<string, object>
        {
            ["service.name"] = "atp.ingestion"
        };
    })
    .CreateLogger();

builder.Host.UseSerilog();

  • Log Sinks
  • Console (development)
  • File (local debugging, JSON + text)
  • Seq (development, log search)
  • Application Insights (production)
  • OpenTelemetry (unified pipeline)
  • Azure Log Analytics (production)

Code Examples: - Complete Serilog configuration - Sink configurations - Environment-specific settings

Deliverables: - Serilog setup guide - Sink catalog - Configuration templates


Topic 14: Structured Logging Patterns

What will be covered: - Structured Logging Best Practices

// ❌ BAD: String interpolation (unstructured)
_logger.LogInformation($"User {userId} ingested event {eventId} for tenant {tenantId}");

// ✅ GOOD: Structured (named properties)
_logger.LogInformation(
    "User {UserId} ingested event {EventId} for tenant {TenantId}",
    userId, eventId, tenantId);

// Output (JSON):
{
  "Timestamp": "2025-10-30T14:32:15.123Z",
  "Level": "Information",
  "Message": "User user-123 ingested event 01HZX... for tenant acme-corp",
  "Properties": {
    "UserId": "user-123",
    "EventId": "01HZX123456789",
    "TenantId": "acme-corp"
  }
}

// ✅ GOOD: Enriched with context
using (_logger.BeginScope(new Dictionary<string, object>
{
    ["TenantId"] = tenantId,
    ["CorrelationId"] = correlationId,
    ["TraceId"] = Activity.Current?.TraceId.ToString()
}))
{
    _logger.LogInformation("Processing ingestion request");
    // ... operations
    _logger.LogInformation("Ingestion complete");
    // All logs in scope include TenantId, CorrelationId, TraceId
}

  • Log Levels
    Trace (Verbose):
    - Detailed debugging information
    - Use sparingly (performance cost)
    - Example: "Entering method X with parameters Y"
    
    Debug:
    - Debugging information
    - Disabled in production by default
    - Example: "Cache hit for key X"
    
    Information:
    - Significant application events
    - Enabled in production
    - Example: "Event ingested successfully"
    
    Warning:
    - Abnormal but handled situations
    - Review periodically
    - Example: "Projection lag elevated (15s)"
    
    Error:
    - Error conditions (handled exceptions)
    - Review immediately
    - Example: "Failed to connect to database"
    
    Critical:
    - Critical failures (unhandled exceptions)
    - Page on-call
    - Example: "Data integrity verification failed"
    

Code Examples: - Structured logging patterns - Log level usage - Scope and enrichment

Deliverables: - Logging best practices - Pattern catalog


CYCLE 8: Log Enrichment & Correlation (~4,000 lines)

Topic 15: Log Enrichers

What will be covered: - Built-In Enrichers

.Enrich.FromLogContext()           // Include scope properties
.Enrich.WithMachineName()          // Add machine name
.Enrich.WithEnvironmentName()      // Add environment (dev, prod)
.Enrich.WithThreadId()             // Add thread ID
.Enrich.WithProcessId()            // Add process ID
.Enrich.WithCorrelationId()        // Add correlation ID from headers
.Enrich.WithClientIp()             // Add client IP
.Enrich.WithExceptionDetails()     // Rich exception logging

  • Custom Enrichers
    public class TenantEnricher : ILogEventEnricher
    {
        private readonly ITenantResolver _tenantResolver;
    
        public void Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory)
        {
            var tenantId = _tenantResolver.GetCurrentTenantId();
            if (tenantId != null)
            {
                var property = propertyFactory.CreateProperty("TenantId", tenantId);
                logEvent.AddPropertyIfAbsent(property);
            }
        }
    }
    
    // Register enricher
    .Enrich.With<TenantEnricher>()
    

Code Examples: - Enricher configuration - Custom enrichers

Deliverables: - Enrichment guide


Topic 16: Correlation Across Logs, Traces, Metrics

What will be covered: - Unified Correlation - Cross-Pillar Queries - Exemplars (Prometheus → Traces)

Code Examples: - Correlation strategies

Deliverables: - Correlation guide


CYCLE 9: Azure Monitor Integration (~5,000 lines)

Topic 17: Application Insights

What will be covered: - Application Insights Setup - Telemetry Types (requests, dependencies, exceptions, traces, metrics) - Custom Events and Metrics - Application Map - Live Metrics Stream

Code Examples: - App Insights configuration

Deliverables: - Application Insights guide


Topic 18: Container Insights

What will be covered: - AKS Container Insights - Pod and Node Metrics - Log Collection - Performance Analysis

Code Examples: - Container Insights setup

Deliverables: - Container monitoring guide


CYCLE 11: Log Analytics & KQL (~4,500 lines)

Topic 21: KQL Query Language

What will be covered: - Kusto Query Language (KQL)

// Find all errors in last hour
traces
| where timestamp > ago(1h)
| where severityLevel >= 3  // Error or Critical
| where cloud_RoleName == "Ingestion"
| project timestamp, message, severityLevel, customDimensions
| order by timestamp desc
| take 100

// P95 latency by service
requests
| where timestamp > ago(1h)
| summarize P95=percentile(duration, 95) by cloud_RoleName, bin(timestamp, 5m)
| render timechart

// Error rate
requests
| where timestamp > ago(1h)
| summarize 
    Total = count(),
    Errors = countif(success == false)
    by bin(timestamp, 5m)
| extend ErrorRate = todouble(Errors) / todouble(Total)
| render timechart

Code Examples: - Complete KQL query library (50+ queries)

Deliverables: - KQL reference guide


Topic 22: Log Analytics Workbooks

What will be covered: - Custom Workbooks - Parameterized Queries - Interactive Dashboards

Code Examples: - Workbook templates

Deliverables: - Workbook library


CYCLE 12: Prometheus & Grafana (~4,500 lines)

Topic 23: Prometheus Setup

What will be covered: - Prometheus Deployment - Scrape Configurations - Recording Rules - Storage and Retention

Code Examples: - Prometheus configuration

Deliverables: - Prometheus setup guide


Topic 24: Grafana Dashboards

What will be covered: - Dashboard Design Principles - ATP Operational Dashboards - Service Health Dashboards - SLO Dashboards

Code Examples: - Dashboard JSON templates (10+ dashboards)

Deliverables: - Dashboard library


CYCLE 13: Dashboard Design (~4,000 lines)

Topic 25: Dashboard Catalog

What will be covered: - ATP Operations Dashboard - Service-Specific Dashboards (8 services) - Infrastructure Dashboards - Business Dashboards

Code Examples: - Complete dashboard library

Deliverables: - Dashboard templates


Topic 26: Dashboard Best Practices

What will be covered: - Layout Guidelines - Color Conventions - Alert Visualization

Deliverables: - Design guide


CYCLE 16: PII Redaction & Compliance (~3,500 lines)

Topic 31: PII Redaction in Telemetry

What will be covered: - Log Redaction

// Microsoft.Extensions.Compliance.Redaction
services.AddRedaction(configure =>
{
    configure.SetRedactor<ErasingRedactor>(DataClassifications.PIIData);
    configure.SetRedactor<HashRedactor>(DataClassifications.SensitiveData);
});

// Usage
_logger.LogInformation(
    "User {UserId} with email {Email:redact} logged in",
    userId,
    userEmail);  // Email automatically redacted

// Output:
// "User user-123 with email *** logged in"

  • Trace Attribute Sanitization
  • Metric Label Filtering
  • Compliance Validation

Code Examples: - Redaction implementation

Deliverables: - PII redaction guide


Topic 32: Compliance Monitoring

What will be covered: - Retention Compliance Monitoring - Data Residency Validation - Access Audit Logging

Code Examples: - Compliance metrics

Deliverables: - Compliance monitoring guide


CYCLE 20: Best Practices & Troubleshooting (~3,500 lines)

Topic 39: Monitoring Best Practices

What will be covered: - Observability Design Principles - Common Anti-Patterns - Performance Optimization

Deliverables: - Best practices handbook


Topic 40: Troubleshooting Observability

What will be covered: - Missing Traces - Missing Metrics - Log Gaps - High Cardinality Issues

Deliverables: - Troubleshooting guide


Summary of Deliverables

Complete observability implementation covering:

  1. Fundamentals: Three pillars, ATP requirements
  2. OpenTelemetry: SDK setup, configuration, integration
  3. Distributed Tracing: W3C Trace Context, span creation, propagation
  4. Metrics: OTel Metrics API, Prometheus, custom metrics
  5. Logging: Serilog, structured logging, enrichment
  6. Azure Monitor: Application Insights, Log Analytics, Container Insights
  7. Prometheus & Grafana: Setup, dashboards, alerting
  8. Dashboards: 15+ operational dashboards
  9. APM: Performance monitoring, dependency tracking
  10. Compliance: PII redaction, audit logging, retention
  11. Operations: Correlation, troubleshooting, best practices


This monitoring & observability guide provides complete implementation for ATP's observability stack, from OpenTelemetry instrumentation and distributed tracing to structured logging, Azure Monitor integration, Prometheus and Grafana dashboards, PII redaction, compliance monitoring, and operational excellence for maintaining visibility into system behavior with actionable insights while preserving privacy and regulatory compliance.