Skip to content

Observability Strategy & Practices

Purpose & Scope

Purpose: Strategic guide for using observability effectively in ATP, focusing on debugging workflows, troubleshooting patterns, and observability-driven development practices. This document complements the implementation details in monitoring.md.

Scope: This document covers:

  • Observability Philosophy: How to think about observability, asking the right questions, observability-driven design
  • Debugging Workflows: Practical troubleshooting scenarios using traces, logs, and metrics together
  • Correlation Patterns: How to trace requests across services, messages, and background jobs
  • Performance Investigation: Using observability to identify and fix performance issues
  • Security Observability: Detecting threats, anomalies, and unauthorized access
  • Observability Maturity: Evolving from basic monitoring to advanced observability practices
  • Context Propagation: Ensuring tenant context, correlation IDs, and trace context flow end-to-end
  • Observability for Compliance: Using telemetry for audit trails and regulatory evidence

Audience: Developers, SREs, incident responders, platform engineers, architects

Relationship to Other Documents: - Implementation: See monitoring.md for OpenTelemetry setup, Serilog configuration, Prometheus/Grafana, Azure Monitor integration - Operations: See runbook.md for operational procedures using observability data - Alerts: See alerts-slos.md for alerting strategy and SLO definitions - Architecture: See ../architecture/architecture.md for observability requirements


Table of Contents

  1. Observability Philosophy
  2. Asking the Right Questions
  3. Debugging Workflows
  4. Correlation & Context
  5. Performance Investigation
  6. Security Observability
  7. Observability-Driven Development
  8. Observability Maturity Model
  9. Troubleshooting Scenarios
  10. Best Practices

Observability Philosophy

What is Observability?

Observability is the ability to understand the internal state of a system by examining its external outputs (logs, metrics, traces). It goes beyond monitoring by enabling exploratory investigation of unknown issues.

Monitoring vs. Observability:

Monitoring Observability
Known unknowns: Pre-defined dashboards, alerts for expected failure modes Unknown unknowns: Ad-hoc exploration, debugging unexpected issues
"Is X broken?" "Why is X broken?"
Static thresholds Dynamic queries
Reactive (alerts fire) Proactive (explore before issues escalate)
Good for: Health checks, SLOs Good for: Debugging, optimization, learning

ATP Uses Both: - Monitoring: Health checks, SLO dashboards, alerting - Observability: Distributed tracing, structured logs, ad-hoc metrics queries

Three Pillars Working Together

┌─────────────────────────────────────────────────────────┐
│                  OBSERVABILITY STACK                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  METRICS (What) → "P95 latency is 2.5s"                │
│       ↓                                                 │
│  TRACES (Why) → "Database query took 2.2s"             │
│       ↓                                                 │
│  LOGS (Details) → "Connection pool exhausted, waited..."│
│                                                          │
└─────────────────────────────────────────────────────────┘

Example Investigation Flow:

  1. Metrics Alert: ingest_latency_p95 > 1s fires
  2. Check Metrics Dashboard: Latency spike started at 14:30 UTC
  3. Find Trace: Search for slow ingestion requests around 14:30
  4. Examine Trace: Policy evaluation span shows 900ms delay
  5. Check Logs: Policy service logs show "Connection pool exhausted"
  6. Root Cause: Policy service connection pool too small for load
  7. Fix: Increase connection pool, add connection pool health check

Observability Principles

  1. Instrument Everything: Every service, every request, every operation
  2. Correlate Everything: Trace IDs, correlation IDs, tenant context in all telemetry
  3. Structure Everything: Structured logs, semantic metrics, well-named spans
  4. Query Everything: Make all telemetry queryable and explorable
  5. Retain Everything: Balance retention with cost (hot → cool → archive)
  6. Redact Sensitive Data: Never log PII, sanitize parameters, hash when needed

Asking the Right Questions

The Observability Question Framework

Observability enables you to answer arbitrary questions about system behavior. Learn to ask:

  1. What happened? (Logs)
  2. How much? (Metrics)
  3. Why? (Traces)
  4. Who was affected? (Tenant context, correlation)
  5. When did it start? (Time-series analysis)
  6. Where is the problem? (Service-level attribution)

ATP-Specific Questions

Performance Questions: - "Why is ingestion latency high for tenant acme-corp?" - "What's the slowest operation in the query service?" - "Which tenants are hitting rate limits?"

Reliability Questions: - "Why did this ingestion request fail?" - "What's causing the dead-letter queue growth?" - "Which services are experiencing errors?"

Business Questions: - "How many audit records were ingested per tenant this month?" - "What's the projection lag by tenant edition?" - "Which tenants are using the most storage?"

Security Questions: - "Who accessed audit records for tenant xyz last week?" - "Are there any failed authentication attempts?" - "Did any cross-tenant data access occur?"

Question-to-Telemetry Mapping

Question Type Primary Telemetry Secondary Telemetry
Why is X slow? Traces (timing breakdown) Metrics (percentiles), Logs (warnings)
How many X? Metrics (counters) Logs (aggregated counts)
What error occurred? Logs (exceptions, stack traces) Traces (error spans), Metrics (error rate)
Who did X? Logs (actor context) Traces (user attributes), Metrics (per-tenant)
When did X happen? All (timestamp correlation) Metrics (time-series), Traces (timeline)
Where is the issue? Traces (service attribution) Metrics (per-service), Logs (service tags)

Debugging Workflows

Standard Debugging Workflow

flowchart TD
    A[Alert/Issue Reported] --> B[Check Metrics Dashboard]
    B --> C{Issue Visible in Metrics?}
    C -->|Yes| D[Identify Time Window]
    C -->|No| E[Search Logs for Error Messages]
    D --> F[Find Traces in Time Window]
    E --> F
    F --> G[Examine Trace Breakdown]
    G --> H[Check Logs for Detailed Context]
    H --> I[Identify Root Cause]
    I --> J[Implement Fix]
    J --> K[Verify Fix with Observability]
Hold "Alt" / "Option" to enable pan & zoom

Workflow 1: Slow Request Investigation

Scenario: Customer reports slow ingestion (5+ seconds)

Step 1: Check Metrics

# Query Prometheus for P95 latency spike
histogram_quantile(0.95, 
  rate(http_server_duration_seconds_bucket{
    service="atp.ingestion",
    route="/api/v1/ingest"
  }[5m]))

Result: P95 latency is 4.8s (normally 200ms)

Step 2: Find Slow Traces

Jaeger Query:
- Service: atp.ingestion
- Operation: POST /api/v1/ingest
- Duration: > 4000ms
- Time: Last 1 hour

Step 3: Examine Trace

Trace: 4bf92f3577b34da6a3ce929d0e0e4736
Duration: 4.8s

├─ Gateway (50ms)
│  ├─ Authentication (10ms)
│  └─ Routing (5ms)
├─ Ingestion (4.7s) ← SLOW
│  ├─ Validation (20ms)
│  ├─ Policy.Evaluate (3.9s) ← BOTTLENECK
│  │  └─ Database.Query (3.6s) ← ROOT CAUSE
│  ├─ Database.Insert (200ms)
│  └─ Outbox.Append (100ms)
└─ Service Bus Publish (50ms)

Step 4: Check Logs for Context

// Log Analytics query
traces
| where timestamp > ago(1h)
| where traceId == "4bf92f3577b34da6a3ce929d0e0e4736"
| where message contains "database" or message contains "connection"
| project timestamp, message, severityLevel, customDimensions

Log Result:

[14:32:15.123] WARN PolicyService: Connection pool exhausted, 
  waited 3600ms for available connection. 
  Pool size: 10, Active: 10, Pending: 45

Step 5: Root Cause Identified - Policy service connection pool too small (10 connections) - Under load, connections exhausted - Requests wait for available connections

Step 6: Fix - Increase connection pool to 50 - Add connection pool metrics (active, pending, wait time) - Add alert: connection_pool_pending > 10

Workflow 2: Error Rate Investigation

Scenario: Error rate spike (5% → 15% in 10 minutes)

Step 1: Check Error Metrics

# Error rate by service
sum(rate(http_server_requests_total{
  result="error"
}[5m])) by (service)
/
sum(rate(http_server_requests_total[5m])) by (service)

Result: Ingestion service: 15% error rate

Step 2: Find Error Traces

Jaeger Query:
- Service: atp.ingestion
- Tags: result=error
- Time: Last 30 minutes

Step 3: Group Errors by Type

// Log Analytics - group errors by exception type
traces
| where timestamp > ago(30m)
| where severityLevel >= 3  // Error or Critical
| where cloud_RoleName == "atp.ingestion"
| extend ExceptionType = tostring(customDimensions.ExceptionType)
| summarize ErrorCount = count() by ExceptionType
| order by ErrorCount desc

Result:

ExceptionType              | ErrorCount
---------------------------|-----------
SqlException               | 234
TimeoutException           | 12
ValidationException        | 5

Step 4: Examine Specific Error

// Get detailed error for SqlException
traces
| where timestamp > ago(30m)
| where customDimensions.ExceptionType == "SqlException"
| where customDimensions.Message contains "timeout"
| project timestamp, message, customDimensions, exceptionDetails
| take 10

Result:

[14:35:22.456] ERROR IngestionService: Database timeout after 30s
  Exception: System.Data.SqlClient.SqlException
  Message: Timeout expired. The timeout period elapsed prior to 
    completion of the operation or the server is not responding.
  Query: SELECT * FROM PolicyRules WHERE TenantId = @p0 AND Active = 1
  Parameters: @p0 = 'acme-corp'

Step 5: Check Database Metrics

# Database query duration
histogram_quantile(0.95,
  rate(db_client_duration_seconds_bucket{
    service="atp.ingestion",
    db_operation="SELECT"
  }[5m]))

Result: P95 database query duration is 28s (normally 50ms)

Step 6: Root Cause Identified - Database queries timing out - PolicyRules table may be locked or missing index - High query volume causing contention

Step 7: Fix - Add index on (TenantId, Active) for PolicyRules table - Investigate table locks - Consider caching policy rules

Workflow 3: Missing Data Investigation

Scenario: Tenant reports missing audit records

Step 1: Verify Ingestion

// Check if records were ingested for tenant
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where message contains "Ingestion.Complete" or message contains "Record ingested"
| summarize IngestedCount = count() by bin(timestamp, 1h)

Step 2: Check Query Service

// Check query service logs for the records
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where cloud_RoleName == "atp.query"
| where message contains "Query.Executed"
| project timestamp, message, customDimensions

Step 3: Check Projection Service

// Check if projection processed the records
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where cloud_RoleName == "atp.projection"
| where message contains "Projection.Updated"
| summarize ProjectedCount = count() by bin(timestamp, 1h)

Step 4: Check for Errors in Pipeline

// Find any errors in the pipeline
traces
| where timestamp > ago(24h)
| where customDimensions.TenantId == "acme-corp"
| where severityLevel >= 3
| where message contains "acme-corp"
| project timestamp, cloud_RoleName, message, customDimensions
| order by timestamp desc

Step 5: Verify Database State

-- Check actual database records
SELECT COUNT(*) 
FROM AuditRecords 
WHERE TenantId = 'acme-corp' 
  AND CreatedAt >= DATEADD(hour, -24, GETUTCDATE())

Step 6: Trace Specific Record If tenant provides a specific record ID:

// Trace specific record ID through pipeline
traces
| where timestamp > ago(7d)
| where customDimensions.AuditRecordId == "01HZX123456789"
| project timestamp, cloud_RoleName, message, customDimensions
| order by timestamp asc


Correlation & Context

Context Propagation Pattern

Context Flow:

Client Request
  ↓ (HTTP headers)
Gateway
  ↓ (traceparent, baggage)
Ingestion Service
  ↓ (traceparent, baggage)
Policy Service (HTTP)
  ↓ (traceparent, baggage)
Database (connection context)
  ↓ (traceparent in message headers)
Service Bus Message
  ↓ (traceparent, baggage)
Projection Consumer
  ↓ (traceparent)
Database (projection)

Required Context Attributes

Every telemetry record MUST include:

// Resource attributes (service-level, constant)
service.name = "atp.ingestion"
service.version = "1.2.3"
deployment.environment = "production"
cloud.provider = "azure"
cloud.region = "eastus"

// Span/Log attributes (request-level, variable)
trace.id = "4bf92f3577b34da6a3ce929d0e0e4736"
span.id = "00f067aa0ba902b7"
tenant.id = "acme-corp"  // or tenant.class for metrics
tenant.edition = "enterprise"
correlation.id = "01HZX123456789"  // Business correlation ID
audit.record.id = "01HZX987654321"  // ATP-specific

Correlation ID Strategy

Three Types of IDs:

  1. Trace ID (OpenTelemetry):
  2. Generated by gateway on request
  3. Propagated via traceparent header (W3C Trace Context)
  4. Links all spans in a single request flow
  5. Example: 4bf92f3577b34da6a3ce929d0e0e4736

  6. Correlation ID (Business):

  7. ULID for audit record or business entity
  8. Can span multiple traces/requests
  9. Used for business logic correlation
  10. Example: 01HZX123456789 (ULID)

  11. Request ID (Gateway):

  12. Unique per HTTP request
  13. Returned to client for support
  14. May equal trace ID or separate
  15. Example: req-20251030-abc123

Cross-Service Correlation

HTTP Request Correlation:

// Gateway sets correlation context
var traceId = Activity.Current?.TraceId.ToString();
var correlationId = GenerateCorrelationId();

// Add to headers (automatic with OTel)
request.Headers.Add("X-Correlation-ID", correlationId);
// traceparent header added automatically by OTel SDK

// Downstream service receives and uses
var correlationId = HttpContext.Request.Headers["X-Correlation-ID"];
var traceId = Activity.Current?.TraceId.ToString();

// Log with both
_logger.LogInformation(
    "Processing request with CorrelationId={CorrelationId}, TraceId={TraceId}",
    correlationId, traceId);

Message Bus Correlation:

// Publisher sets correlation context
await _bus.Publish(new AuditAcceptedEvent
{
    AuditRecordId = recordId,
    TenantId = tenantId
}, context =>
{
    // MassTransit automatically propagates trace context
    // Also set business correlation ID
    context.Headers.Set("X-Correlation-ID", recordId);
});

// Consumer receives and uses
public async Task Consume(ConsumeContext<AuditAcceptedEvent> context)
{
    var correlationId = context.Headers.Get<string>("X-Correlation-ID");
    var traceId = Activity.Current?.TraceId.ToString();

    // All logs in this consumer will include correlation context
    using (_logger.BeginScope(new Dictionary<string, object>
    {
        ["CorrelationId"] = correlationId,
        ["TraceId"] = traceId
    }))
    {
        _logger.LogInformation("Processing audit accepted event");
        // Process event...
    }
}


Performance Investigation

Latency Analysis Workflow

Step 1: Identify Latency Component

Total Request Time: 500ms
├─ Gateway: 20ms (4%)
├─ Authentication: 10ms (2%)
├─ Ingestion Service: 450ms (90%) ← FOCUS HERE
│  ├─ Validation: 15ms
│  ├─ Policy Evaluation: 400ms ← BOTTLENECK
│  ├─ Database Insert: 30ms
│  └─ Outbox Append: 5ms
└─ Response Serialization: 20ms (4%)

Step 2: Drill into Bottleneck - Examine Policy Evaluation span details - Check downstream calls (database, cache, external API) - Look for lock contention, resource exhaustion

Step 3: Check Resource Metrics

# Connection pool usage
db_connection_pool_active{service="atp.policy"}
db_connection_pool_pending{service="atp.policy"}

# Cache hit rate
cache_hit_rate{service="atp.policy", cache="policy-rules"}

# CPU/Memory
process_cpu_usage{service="atp.policy"}
process_memory_usage{service="atp.policy"}

Throughput Analysis

Questions to Answer: - What's the current throughput (requests/second)? - Is throughput limited by CPU, memory, network, or database? - Which tenants/operations have highest throughput? - Are there any throttling or rate limiting in effect?

Metrics to Examine:

# Request rate by service
sum(rate(http_server_requests_total[1m])) by (service)

# Request rate by tenant class
sum(rate(http_server_requests_total{
  tenant_class="enterprise"
}[1m])) by (service)

# Throughput vs. capacity
sum(rate(http_server_requests_total[1m])) by (service)
/
sum(http_server_capacity_total) by (service)


Security Observability

Threat Detection Patterns

Unauthorized Access Attempts:

// Failed authentication attempts
traces
| where timestamp > ago(24h)
| where customDimensions.EventType == "Authentication.Failed"
| summarize 
    FailedAttempts = count(),
    UniqueIPs = dcount(customDimensions.ClientIp),
    UniqueTenants = dcount(customDimensions.TenantId)
    by bin(timestamp, 1h)
| where FailedAttempts > 10  // Threshold

Cross-Tenant Access Attempts:

// Potential cross-tenant data access
traces
| where timestamp > ago(24h)
| where customDimensions.EventType == "Authorization.Denied"
| where customDimensions.Reason contains "tenant" or 
      customDimensions.Reason contains "cross-tenant"
| project timestamp, customDimensions.ActorId, 
    customDimensions.TenantId, customDimensions.RequestedTenantId,
    customDimensions.Resource

Data Exfiltration Patterns:

// Large export requests
traces
| where timestamp > ago(24h)
| where cloud_RoleName == "atp.export"
| where customDimensions.ExportSizeBytes > 1000000000  // > 1GB
| project timestamp, customDimensions.TenantId, 
    customDimensions.ExportSizeBytes, customDimensions.RequestedBy

Anomaly Detection:

# Unusual request patterns (sudden spike)
increase(http_server_requests_total{
  service="atp.query"
}[5m]) > 1000  # More than 1000 requests in 5 minutes


Observability-Driven Development

Design for Observability

Before Writing Code: 1. Define Success Metrics: What metrics indicate this feature is working? 2. Define Failure Scenarios: What errors can occur? How will we detect them? 3. Plan Instrumentation: What spans/logs/metrics are needed? 4. Consider Correlation: How will we trace this operation end-to-end?

While Writing Code: 1. Instrument Early: Add spans/logs/metrics as you code, not after 2. Use Structured Logging: Named parameters, not string interpolation 3. Add Context: Include tenant ID, correlation ID, trace ID in all logs 4. Record Exceptions: Always log exceptions with full context

After Deploying: 1. Verify Instrumentation: Check that traces/logs/metrics are appearing 2. Validate Dashboards: Ensure new metrics show up in dashboards 3. Test Error Paths: Trigger errors, verify they're logged correctly 4. Review Queries: Can you answer questions about this feature?

Observability Checklist

For Every Feature: - [ ] Traces cover the critical path (request → response) - [ ] Logs include sufficient context (tenant, correlation, user) - [ ] Metrics track success rate, latency, throughput - [ ] Errors are logged with full exception details - [ ] Dashboards show feature health - [ ] Alerts fire for known failure modes - [ ] Documentation explains how to debug this feature


Observability Maturity Model

Level 1: Basic Monitoring (Reactive)

  • Logs exist but unstructured
  • Basic metrics (CPU, memory)
  • Manual investigation
  • "What's broken?"

Level 2: Structured Observability (Proactive)

  • Structured logs with correlation
  • Service-level metrics
  • Distributed tracing
  • "Where is the problem?"

Level 3: Context-Rich Observability (Investigative)

  • Full context propagation (tenant, correlation, trace)
  • Business metrics alongside technical metrics
  • Rich dashboards and alerting
  • "Why did this happen?"

Level 4: Observability-Driven (Predictive)

  • Automated anomaly detection
  • Predictive alerting (before issues occur)
  • Observability used for optimization
  • "How can we prevent this?"

ATP Target: Level 3-4 (Context-Rich to Observability-Driven)


Troubleshooting Scenarios

Scenario 1: Intermittent Timeouts

Symptoms: Random 30s timeouts, affects 1% of requests

Investigation: 1. Find Timeout Traces: Search for traces with duration > 25s 2. Check Timeout Pattern: Are timeouts clustered by tenant, time, or operation? 3. Examine Span Details: Which operation is timing out? 4. Check Resource Metrics: Connection pools, queue depths, CPU 5. Look for Lock Contention: Database locks, distributed locks

Common Causes: - Connection pool exhaustion (spikes) - Database deadlocks - Network partition - Garbage collection pauses

Scenario 2: Data Inconsistency

Symptoms: Query returns stale data, missing records

Investigation: 1. Trace Record Lifecycle: Follow record from ingestion → projection → query 2. Check Projection Lag: Is projection service keeping up? 3. Verify Event Processing: Are events being consumed from Service Bus? 4. Check for Errors: Any errors in projection or query services? 5. Validate Watermarks: Are projection watermarks advancing?

Common Causes: - Projection lag (events not processed) - Event processing errors (dead-letter queue) - Cache invalidation failures - Database replication lag

Scenario 3: Performance Degradation

Symptoms: Gradual latency increase over days/weeks

Investigation: 1. Trend Analysis: Compare current metrics to baseline (7 days ago) 2. Identify Component: Which service/operation degraded? 3. Resource Analysis: CPU, memory, database, network trends 4. Check for Scaling Issues: Is autoscaling working? 5. Data Growth: Has data volume increased significantly?

Common Causes: - Data growth (larger queries) - Missing indexes - Resource exhaustion - Memory leaks - Inefficient algorithms


Best Practices

Logging Best Practices

  1. Use Structured Logging:

    // ✅ GOOD
    _logger.LogInformation(
        "Ingested record {AuditRecordId} for tenant {TenantId}",
        recordId, tenantId);
    
    // ❌ BAD
    _logger.LogInformation($"Ingested record {recordId} for tenant {tenantId}");
    

  2. Include Context:

    using (_logger.BeginScope(new Dictionary<string, object>
    {
        ["TenantId"] = tenantId,
        ["CorrelationId"] = correlationId,
        ["TraceId"] = Activity.Current?.TraceId.ToString()
    }))
    {
        // All logs in this scope include context
    }
    

  3. Log at Appropriate Levels:

  4. Debug: Development-only, detailed execution flow
  5. Information: Significant business events, normal operations
  6. Warning: Abnormal but handled situations
  7. Error: Error conditions, handled exceptions
  8. Critical: Critical failures, unhandled exceptions

  9. Never Log PII:

    // ❌ BAD
    _logger.LogInformation("User {Email} logged in", email);
    
    // ✅ GOOD
    _logger.LogInformation("User {UserId} logged in", userId);
    // Or hash/redact
    _logger.LogInformation("User {EmailHash} logged in", Hash(email));
    

Tracing Best Practices

  1. Name Spans Clearly:

    // ✅ GOOD
    ActivitySource.StartActivity("Ingestion.ValidateRecord")
    ActivitySource.StartActivity("Policy.EvaluateClassification")
    
    // ❌ BAD
    ActivitySource.StartActivity("DoWork")
    ActivitySource.StartActivity("Process")
    

  2. Add Relevant Attributes:

    activity?.SetTag("tenant.id", tenantId);
    activity?.SetTag("audit.record.id", recordId);
    activity?.SetTag("policy.version", policyVersion);
    

  3. Record Exceptions:

    try
    {
        // Operation
    }
    catch (Exception ex)
    {
        activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
        activity?.RecordException(ex);
        throw;
    }
    

Metrics Best Practices

  1. Use Low-Cardinality Labels:

    // ✅ GOOD (4 values)
    _requestsTotal.Add(1, new("tenant_class", "enterprise"));
    
    // ❌ BAD (1000s of values)
    _requestsTotal.Add(1, new("tenant_id", tenantId));
    

  2. Choose Appropriate Metric Types:

  3. Counter: Total requests, total errors (monotonically increasing)
  4. Histogram: Latency, size (distribution of values)
  5. Gauge: Queue depth, active connections (current value)

  6. Document Metrics:

    _meter.CreateCounter<long>(
        name: "ingest.requests.total",
        unit: "{requests}",
        description: "Total number of ingestion requests");
    


Summary

Observability in ATP enables:

  1. Fast Debugging: Trace requests end-to-end, find bottlenecks quickly
  2. Proactive Detection: Identify issues before customers notice
  3. Performance Optimization: Data-driven improvements
  4. Security Monitoring: Detect threats and anomalies
  5. Compliance Evidence: Audit trail visibility

Key Takeaways: - Use traces, logs, and metrics together (not in isolation) - Always include correlation context (tenant ID, trace ID, correlation ID) - Structure everything (structured logs, semantic metrics) - Ask the right questions (What? How much? Why?) - Design for observability (instrument as you code)

Next Steps: - Review monitoring.md for implementation details - Practice debugging workflows with real scenarios - Build observability into development process - Regularly review and optimize telemetry


Document Version: 1.0
Last Updated: 2025-10-30
Maintained By: Platform Engineering & SRE Team