Skip to content

Epic AUD-OTEL-001: Observability & Telemetry Framework

Status: πŸ”„ In Progress (85% complete)
Owner: SRE Team
Target: Q4 2025


Epic Description

This epic establishes the Observability and Telemetry Framework for the Audit Trail Platform (ATP). It integrates OpenTelemetry (OTel) instrumentation across all microservices to provide unified traces, metrics, and logs. The framework ensures that every transactionβ€”spanning ingestion, storage, query, and exportβ€”is fully traceable, measurable, and diagnosable within the platform's observability stack (Prometheus, Grafana, Jaeger, and Seq).


Epic Objectives

  • Implement OpenTelemetry middleware across all ATP services
  • Standardize structured logging with correlation IDs
  • Collect metrics on performance, latency, error rates
  • Integrate with Prometheus and Grafana
  • Provide unified observability dashboards (90% complete)
  • Ensure observability coverage β‰₯ 95% across all microservices

Features

Feature AUD-OTEL-TRC-001: Traces & Spans βœ…

Status: Complete
Delivered: Q3 2025

Tasks: - βœ… Add OTel middleware to all services - βœ… Define trace naming & span conventions - βœ… Configure trace sampling & exporters

Achievements: - 100% trace continuity across service boundaries - Traces visible in Jaeger and Grafana Tempo - Average tracing overhead: 3.2% (target: <5%)


Feature AUD-OTEL-MET-001: Metrics (Latency, Throughput) βœ…

Status: Complete
Delivered: Q3 2025

Tasks: - βœ… Instrument core metrics per service - βœ… Configure Prometheus/Grafana dashboards - βœ… Define SLOs and alerts

Deployed Metrics: - http_requests_total, http_request_duration_seconds - grpc_calls_total, bus_messages_published_total - audit_records_ingested_total, audit_query_latency_seconds

Dashboards: - βœ… Service Health Dashboard (all services) - βœ… Business Metrics Dashboard (ingestion, query volumes) - βœ… SLO Dashboard (availability, error budget)


Feature AUD-OTEL-LOG-001: Structured Logs πŸ”„

Status: In Progress (90%)
Target: November 2025

Tasks: - βœ… Implement structured logging format (JSON) - βœ… Integrate centralized log storage (Seq, Application Insights) - πŸ”„ Add contextual logging middleware (final review)

Current Work: - Finalizing exception handling enrichment - Performance validation (CPU overhead < 3%) - Integration with correlation IDs from traces

Blockers: - None


Current Sprint Work

Sprint 2025-10-28 to 2025-11-08

Focus: Complete structured logging rollout to production

Tasks: 1. βœ… Deploy logging middleware to all services 2. πŸ”„ Validate log correlation with traces (testing) 3. ⏳ Create log aggregation dashboard in Grafana 4. ⏳ Configure log retention policies (30 days hot, 365 days archive)

Team Capacity: - 2 SRE engineers @ 100% allocation - 1 Platform engineer for dashboard creation


Dependencies

Upstream (Depends On)

  • βœ… AUD-ARC-001: Platform architecture defined
  • βœ… AUD-OPS-001: CI/CD pipelines operational

Downstream (Enables)

  • All epics benefit from observability
  • AUD-CHAOS-001: Chaos engineering needs metrics
  • AUD-OPS-002: Operations monitoring requires dashboards

Metrics & KPIs

Metric Target Current Status
Observability Coverage > 95% 97% βœ…
Trace Propagation 100% 100% βœ…
Metrics Collection Rate > 99% 99.4% βœ…
Log Ingestion Latency < 5s 3.2s βœ…
Dashboard Load Time < 2s 1.8s βœ…
Tracing Overhead < 5% 3.2% βœ…

Technology Stack

Component Technology Purpose
Instrumentation OpenTelemetry SDK Traces, metrics, logs
Trace Backend Jaeger + Grafana Tempo Distributed tracing
Metrics Prometheus Time-series metrics
Visualization Grafana Dashboards and alerts
Logs (Dev) Seq Structured log search
Logs (Prod) Azure Application Insights Cloud-native logging

Key Achievements

Q3 2025

  • βœ… OpenTelemetry integrated across 8 microservices
  • βœ… 15 Grafana dashboards deployed
  • βœ… SLO dashboard with error budget tracking
  • βœ… Distributed tracing validated end-to-end

Q4 2025 (Current)

  • βœ… Structured logging format standardized
  • πŸ”„ Log aggregation dashboard (90% complete)
  • πŸ“Š 97% observability coverage achieved

Risks & Mitigation

Risk Severity Mitigation Status
Log volume overwhelming storage Medium Implement sampling + log retention policies βœ… Mitigated
Grafana dashboard performance Low Optimize queries, add caching βœ… Mitigated
Trace sampling misses critical errors Medium Always trace errors + 10% sampling βœ… Mitigated

Recent Updates

2025-10-30: - βœ… Logging middleware performance validated (<3% CPU overhead) - πŸ”„ Log correlation with traces tested in staging - πŸ“ Retention policy configuration documented

2025-10-15: - βœ… All services upgraded to OpenTelemetry 1.6.0 - βœ… Grafana alert rules published - πŸ“Š SLO tracking dashboard deployed

2025-10-01: - βœ… Structured logging deployed to Ingestion and Storage services - 🎯 Started integration testing for log correlation


Next Steps

  1. Complete logging middleware - Final review and production deployment
  2. Create log aggregation dashboard - Grafana panels for error tracking
  3. Configure retention policies - 30 days hot, 365 days archive
  4. Document observability patterns - Best practices for service teams



Next Review: 2025-11-07 (Sprint Planning)
Contact: #atp-sre-team on Slack