Epic AUD-OTEL-001: Observability & Telemetry Framework¶
Status: π In Progress (85% complete)
Owner: SRE Team
Target: Q4 2025
Epic Description¶
This epic establishes the Observability and Telemetry Framework for the Audit Trail Platform (ATP). It integrates OpenTelemetry (OTel) instrumentation across all microservices to provide unified traces, metrics, and logs. The framework ensures that every transactionβspanning ingestion, storage, query, and exportβis fully traceable, measurable, and diagnosable within the platform's observability stack (Prometheus, Grafana, Jaeger, and Seq).
Epic Objectives¶
- Implement OpenTelemetry middleware across all ATP services
- Standardize structured logging with correlation IDs
- Collect metrics on performance, latency, error rates
- Integrate with Prometheus and Grafana
- Provide unified observability dashboards (90% complete)
- Ensure observability coverage β₯ 95% across all microservices
Features¶
Feature AUD-OTEL-TRC-001: Traces & Spans β ¶
Status: Complete
Delivered: Q3 2025
Tasks: - β Add OTel middleware to all services - β Define trace naming & span conventions - β Configure trace sampling & exporters
Achievements: - 100% trace continuity across service boundaries - Traces visible in Jaeger and Grafana Tempo - Average tracing overhead: 3.2% (target: <5%)
Feature AUD-OTEL-MET-001: Metrics (Latency, Throughput) β ¶
Status: Complete
Delivered: Q3 2025
Tasks: - β Instrument core metrics per service - β Configure Prometheus/Grafana dashboards - β Define SLOs and alerts
Deployed Metrics:
- http_requests_total, http_request_duration_seconds
- grpc_calls_total, bus_messages_published_total
- audit_records_ingested_total, audit_query_latency_seconds
Dashboards: - β Service Health Dashboard (all services) - β Business Metrics Dashboard (ingestion, query volumes) - β SLO Dashboard (availability, error budget)
Feature AUD-OTEL-LOG-001: Structured Logs π¶
Status: In Progress (90%)
Target: November 2025
Tasks: - β Implement structured logging format (JSON) - β Integrate centralized log storage (Seq, Application Insights) - π Add contextual logging middleware (final review)
Current Work: - Finalizing exception handling enrichment - Performance validation (CPU overhead < 3%) - Integration with correlation IDs from traces
Blockers: - None
Current Sprint Work¶
Sprint 2025-10-28 to 2025-11-08¶
Focus: Complete structured logging rollout to production
Tasks: 1. β Deploy logging middleware to all services 2. π Validate log correlation with traces (testing) 3. β³ Create log aggregation dashboard in Grafana 4. β³ Configure log retention policies (30 days hot, 365 days archive)
Team Capacity: - 2 SRE engineers @ 100% allocation - 1 Platform engineer for dashboard creation
Dependencies¶
Upstream (Depends On)¶
- β AUD-ARC-001: Platform architecture defined
- β AUD-OPS-001: CI/CD pipelines operational
Downstream (Enables)¶
- All epics benefit from observability
- AUD-CHAOS-001: Chaos engineering needs metrics
- AUD-OPS-002: Operations monitoring requires dashboards
Metrics & KPIs¶
| Metric | Target | Current | Status |
|---|---|---|---|
| Observability Coverage | > 95% | 97% | β |
| Trace Propagation | 100% | 100% | β |
| Metrics Collection Rate | > 99% | 99.4% | β |
| Log Ingestion Latency | < 5s | 3.2s | β |
| Dashboard Load Time | < 2s | 1.8s | β |
| Tracing Overhead | < 5% | 3.2% | β |
Technology Stack¶
| Component | Technology | Purpose |
|---|---|---|
| Instrumentation | OpenTelemetry SDK | Traces, metrics, logs |
| Trace Backend | Jaeger + Grafana Tempo | Distributed tracing |
| Metrics | Prometheus | Time-series metrics |
| Visualization | Grafana | Dashboards and alerts |
| Logs (Dev) | Seq | Structured log search |
| Logs (Prod) | Azure Application Insights | Cloud-native logging |
Key Achievements¶
Q3 2025¶
- β OpenTelemetry integrated across 8 microservices
- β 15 Grafana dashboards deployed
- β SLO dashboard with error budget tracking
- β Distributed tracing validated end-to-end
Q4 2025 (Current)¶
- β Structured logging format standardized
- π Log aggregation dashboard (90% complete)
- π 97% observability coverage achieved
Risks & Mitigation¶
| Risk | Severity | Mitigation | Status |
|---|---|---|---|
| Log volume overwhelming storage | Medium | Implement sampling + log retention policies | β Mitigated |
| Grafana dashboard performance | Low | Optimize queries, add caching | β Mitigated |
| Trace sampling misses critical errors | Medium | Always trace errors + 10% sampling | β Mitigated |
Recent Updates¶
2025-10-30: - β Logging middleware performance validated (<3% CPU overhead) - π Log correlation with traces tested in staging - π Retention policy configuration documented
2025-10-15: - β All services upgraded to OpenTelemetry 1.6.0 - β Grafana alert rules published - π SLO tracking dashboard deployed
2025-10-01: - β Structured logging deployed to Ingestion and Storage services - π― Started integration testing for log correlation
Next Steps¶
- Complete logging middleware - Final review and production deployment
- Create log aggregation dashboard - Grafana panels for error tracking
- Configure retention policies - 30 days hot, 365 days archive
- Document observability patterns - Best practices for service teams
Related Documentation¶
- Implementation: Observability
- Operations: Monitoring
- Operations: Alerts & SLOs
- Reference: Baseline Roadmap
Azure DevOps Links¶
- Epic: AUD-OTEL-001
- Current Sprint: Sprint 2025-10-28
- Dashboards: Grafana ATP
Next Review: 2025-11-07 (Sprint Planning)
Contact: #atp-sre-team on Slack