Alerts & SLOs - Audit Trail Platform (ATP)¶
Measure what matters, alert intelligently â ATP defines Service Level Objectives (SLOs) for availability, latency, correctness, and freshness with error budgets, multi-window burn-rate alerts, golden signal monitoring, and actionable runbook-linked notifications to ensure reliability without alert fatigue.
đ Documentation Generation Plan¶
This document will be generated in 18 cycles. Current progress:
| Cycle | Topics | Estimated Lines | Status |
|---|---|---|---|
| Cycle 1 | SLO/SLI Fundamentals (1-2) | ~3,000 | âŗ Not Started |
| Cycle 2 | ATP SLO Catalog (3-4) | ~4,000 | âŗ Not Started |
| Cycle 3 | Error Budget Policies (5-6) | ~3,000 | âŗ Not Started |
| Cycle 4 | Golden Signals Framework (7-8) | ~4,000 | âŗ Not Started |
| Cycle 5 | Metrics Collection (9-10) | ~4,500 | âŗ Not Started |
| Cycle 6 | Prometheus Configuration (11-12) | ~4,000 | âŗ Not Started |
| Cycle 7 | Alert Rules Library (13-14) | ~5,000 | âŗ Not Started |
| Cycle 8 | Burn-Rate Alerts (15-16) | ~3,500 | âŗ Not Started |
| Cycle 9 | Alertmanager Routing (17-18) | ~3,500 | âŗ Not Started |
| Cycle 10 | Multi-Window SLO Monitoring (19-20) | ~3,000 | âŗ Not Started |
| Cycle 11 | Service-Specific Alerts (21-22) | ~5,000 | âŗ Not Started |
| Cycle 12 | Alert Inhibition & Grouping (23-24) | ~3,000 | âŗ Not Started |
| Cycle 13 | Dashboard Design (25-26) | ~4,000 | âŗ Not Started |
| Cycle 14 | Alert Fatigue Prevention (27-28) | ~3,000 | âŗ Not Started |
| Cycle 15 | On-Call Playbooks (29-30) | ~3,500 | âŗ Not Started |
| Cycle 16 | SLO Reporting & Reviews (31-32) | ~3,000 | âŗ Not Started |
| Cycle 17 | Testing & Validation (33-34) | ~2,500 | âŗ Not Started |
| Cycle 18 | Best Practices & Governance (35-36) | ~3,000 | âŗ Not Started |
Total Estimated Lines: ~64,500
Purpose & Scope¶
This document defines ATP's complete alerting and SLO strategy, covering Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, alert rules, golden signals, Prometheus/Alertmanager configuration, burn-rate monitoring, dashboard design, and operational excellence for reliable, observable, and incident-responsive service delivery.
Why SLOs & Alerting for ATP?
- Reliability: Define and measure what "healthy" means for each service
- Objectivity: Quantitative metrics replace subjective health assessments
- Prioritization: Focus engineering effort on what impacts users most
- Accountability: Error budgets balance innovation vs. stability
- Proactivity: Alerts detect issues before customers do
- Actionability: Every alert links to runbook with clear remediation
- Learning: SLO violations drive postmortems and improvements
- Compliance: Tamper-evidence and audit integrity SLOs for regulatory requirements
ATP Reliability Principles
1. Define SLOs for User Experience
- What do users care about? (latency, availability, correctness)
- NOT: CPU usage, memory, disk I/O (symptoms, not user impact)
2. Measure SLIs (Service Level Indicators)
- Actual measurements (request success rate, P95 latency, projection lag)
- High-quality, low-cardinality metrics
3. Set Error Budgets
- 99.9% SLO = 0.1% error budget = 43.2 min/month downtime
- Spend budget on innovation; preserve budget for launches
4. Alert on SLO Burn Rate (Not Static Thresholds)
- Multi-window burn (1h, 6h, 3d) detects fast/slow degradation
- Avoid alert fatigue from transient blips
5. Actionable Alerts Only
- Every alert â runbook link
- Every alert â clear owner
- Every alert â remediation steps
6. Continuous Improvement
- Postmortem every SLO violation
- Update SLOs quarterly
- Refine alert rules based on false positive rate
Key Concepts
- SLI (Service Level Indicator): Quantitative measurement (e.g., request success rate, P95 latency)
- SLO (Service Level Objective): Target value for SLI (e.g., 99.9% availability, P95 <200ms)
- SLA (Service Level Agreement): Customer-facing contractual commitment (âĨ SLO)
- Error Budget: Acceptable failures within SLO (e.g., 0.1% for 99.9% SLO)
- Burn Rate: Rate at which error budget is consumed (fast burn = alert)
- Golden Signals: Latency, Errors, Traffic, Saturation (Google SRE model)
- Multi-Window Alerts: Combine short + long windows to detect fast/slow burns
Detailed Cycle Plan¶
CYCLE 1: SLO/SLI Fundamentals (~3,000 lines)¶
Topic 1: SLO/SLI Principles¶
What will be covered: - What are SLIs?
Service Level Indicator (SLI):
- Quantitative measurement of service behavior
- Expressed as a ratio or percentage
- Based on user experience (not infrastructure metrics)
Examples:
- Request success rate: (successful requests) / (total requests)
- Latency: Percentage of requests completing within threshold
- Availability: (uptime) / (total time)
- Freshness: Percentage of data within staleness threshold
-
What are SLOs?
Service Level Objective (SLO): - Target value for an SLI - Defines "healthy" vs. "unhealthy" - Time-bound (rolling 30 days typical) Examples: - Availability: 99.9% (over 30 days) - Latency: 95% of requests complete in <200ms - Correctness: 100% of audit records have valid integrity proofs - Freshness: 95% of projections lag <5 seconds -
SLO vs. SLA
SLO (Internal Target): - Engineering goal - No penalties - Drives error budget - Example: 99.9% availability SLA (Customer Contract): - Legal commitment - Financial penalties if breached - Always ⤠SLO (buffer for safety) - Example: 99.5% availability (with credits if breached) Relationship: SLA (99.5%) ⤠SLO (99.9%) with 0.4% buffer -
Why SLOs Matter for ATP
User Experience: - Users care about: Can I ingest events? Can I query them? Are they tamper-proof? - Users don't care about: CPU usage, pod count, cache hit rate Engineering Focus: - SLOs guide prioritization - Error budget allows innovation while protecting reliability - Violations trigger postmortems and improvements Compliance: - Audit integrity SLOs (100% tamper-detection) - Data retention SLOs (100% compliance) - Privacy SLOs (100% PII redaction in logs) -
Google SRE SLO Framework
1. Choose SLIs (what to measure) - Latency, availability, correctness, freshness 2. Set SLO Targets (what's acceptable) - 99.9%, 99%, 95% (depends on criticality) 3. Calculate Error Budget (room for failure) - 99.9% SLO = 0.1% error budget = 43.2 min/month 4. Alert on Burn Rate (rate of budget consumption) - Fast burn (1h window) + Slow burn (6h window) 5. Report & Review (quarterly SLO review) - Adjust SLOs based on actual performance - Balance user expectations vs. cost
Code Examples: - SLI calculation formulas - SLO definition templates - Error budget computation
Diagrams: - SLO framework - SLI vs. SLO vs. SLA relationship - Error budget visualization
Deliverables: - SLO fundamentals guide - SLI selection criteria - Error budget policies
Topic 2: ATP SLI Taxonomy¶
What will be covered: - ATP Service Level Indicators (SLIs)
1. Request Success Rate
Formula: (successful requests) / (total requests)
Success: HTTP 2xx, 3xx
Failure: HTTP 5xx, network errors, timeouts
Scope: Per service (Gateway, Ingestion, Query)
2. Request Latency (Percentile)
Measurement: P50, P95, P99 latency
Threshold: <200ms (P95), <500ms (P99)
Scope: Per service, per endpoint
3. Projection Freshness (Lag)
Measurement: Event timestamp â projection updated timestamp
Threshold: P95 <5s, P99 <10s
Scope: Per projection type (timeline, actor, resource)
4. Availability (Uptime)
Measurement: (healthy time) / (total time)
Healthy: Health check returns 200 OK
Unhealthy: Health check fails or service unreachable
5. Correctness (Integrity)
Measurement: (records with valid proofs) / (total records)
Target: 100% (zero tolerance for integrity failures)
6. Data Durability
Measurement: (records successfully persisted) / (records ingested)
Target: 99.999999999% (11 nines)
7. Compliance Rate
Measurement: (compliant records) / (total records)
Compliant: Retention, residency, classification applied
Target: 100%
- SLI Categories by Impact
Code Examples: - SLI calculation queries (Prometheus/KQL) - SLI recording rules - SLI dashboards
Diagrams: - SLI taxonomy - User-facing vs. internal SLIs - SLI measurement flow
Deliverables: - Complete SLI catalog - Calculation formulas - Measurement procedures
CYCLE 2: ATP SLO Catalog (~4,000 lines)¶
Topic 3: Service-Level SLOs¶
What will be covered: - ATP SLO Definitions by Service
# Gateway Service
service: gateway
slos:
- name: availability
sli: request_success_rate
target: 99.95%
window: 30d
error_budget: 21.6 min/month
- name: latency_p95
sli: http_request_duration_p95
target: 200ms
window: 30d
- name: auth_success
sli: authentication_success_rate
target: 99.99%
window: 30d
# Ingestion Service
service: ingestion
slos:
- name: availability
sli: ingest_success_rate
target: 99.9%
window: 30d
error_budget: 43.2 min/month
- name: latency_p95
sli: ingest_duration_p95
target: 500ms
window: 30d
- name: outbox_lag_p95
sli: outbox_relay_latency_p95
target: 5s
window: 1h
- name: data_durability
sli: persist_success_rate
target: 99.999999999% # 11 nines
window: 30d
# Query Service
service: query
slos:
- name: availability
sli: query_success_rate
target: 99.9%
window: 30d
error_budget: 43.2 min/month
- name: latency_p95
sli: query_duration_p95
target: 200ms
window: 30d
- name: latency_p99
sli: query_duration_p99
target: 500ms
window: 30d
- name: projection_freshness
sli: projection_lag_p95
target: 10s
window: 1h
# Projection Service
service: projection
slos:
- name: projection_lag_p95
sli: projector_lag_seconds_p95
target: 5s
window: 1h
- name: projection_lag_p99
sli: projector_lag_seconds_p99
target: 10s
window: 1h
- name: consumer_success_rate
sli: projection_handler_success_rate
target: 99.99%
window: 30d
- name: dlq_rate
sli: dlq_messages_per_million
target: <100 ppm
window: 24h
# Export Service
service: export
slos:
- name: availability
sli: export_success_rate
target: 99.5%
window: 30d
error_budget: 3.6 hours/month
- name: ttfb_p95
sli: export_time_to_first_byte_p95
target: 30s
window: 1h
- name: completion_p95
sli: export_completion_time_p95
target: 5min
window: 24h
# Integrity Service
service: integrity
slos:
- name: correctness
sli: integrity_verification_success_rate
target: 100% # Zero tolerance
window: 30d
- name: tamper_detection
sli: tamper_detection_rate
target: 100% # Must detect all tampering
window: 30d
- name: seal_latency_p95
sli: hash_chain_seal_latency_p95
target: 2s
window: 1h
- name: kms_availability
sli: kms_request_success_rate
target: 99.99%
window: 30d
# Policy Service
service: policy
slos:
- name: decision_latency_p95
sli: policy_decision_latency_p95
target: 50ms
window: 1h
- name: cache_hit_rate
sli: policy_cache_hit_ratio
target: 95%
window: 1h
# Search Service (Optional)
service: search
slos:
- name: search_latency_p95
sli: search_query_latency_p95
target: 500ms
window: 1h
- name: indexer_lag_p95
sli: search_indexer_lag_seconds_p95
target: 10s
window: 1h
# Admin Service
service: admin
slos:
- name: availability
sli: admin_api_success_rate
target: 99%
window: 30d
error_budget: 7.2 hours/month
- SLO Priority Tiers
Tier 1 (Critical - Zero Tolerance): - Integrity correctness: 100% - Tamper detection: 100% - Data durability: 99.999999999% - Compliance rate: 100% Impact: Any violation = SEV-1 incident --- Tier 2 (High - Strict SLOs): - Gateway availability: 99.95% - Ingestion availability: 99.9% - Query availability: 99.9% Impact: Violations = SEV-2 incident, customer impact --- Tier 3 (Medium - Standard SLOs): - Export availability: 99.5% - Projection lag: P95 <5s - Search latency: P95 <500ms Impact: Violations = SEV-3, internal impact --- Tier 4 (Low - Best Effort): - Admin console availability: 99% - Batch job success rate: 95% Impact: Violations tracked, no immediate incident
Code Examples: - Complete SLO definitions (YAML configuration) - SLI query examples (PromQL, KQL) - SLO calculation scripts
Diagrams: - SLO hierarchy - Service SLO matrix - Tier classification
Deliverables: - Complete ATP SLO catalog - SLI definitions - Priority matrix
Topic 4: Cross-Cutting SLOs¶
What will be covered: - Platform-Wide SLOs
End-to-End Latency (Ingest â Query):
- SLI: Time from event ingestion to queryable
- SLO: P95 <10s, P99 <30s
- Spans: Ingestion + Outbox + Projection + Index
Tenant Isolation Integrity:
- SLI: Cross-tenant query leak rate
- SLO: 0 (zero tolerance)
- Validation: Automated tests, RLS verification
PII Redaction Compliance:
- SLI: PII leaked in logs/metrics
- SLO: 0 (zero tolerance)
- Validation: Log scanning, compliance audits
Secret Management:
- SLI: Secrets in Key Vault (not env vars/code)
- SLO: 100%
- Validation: Secret scanning in CI/CD
- Composite SLOs
- Multiple SLIs combined (AND/OR logic)
- User journey SLOs (multi-service)
Code Examples: - Cross-cutting SLO definitions - Composite calculations - Journey mapping
Diagrams: - Platform SLO dependencies - Journey SLO flow
Deliverables: - Platform SLO catalog - Composite SLO guide - Journey mapping
CYCLE 3: Error Budget Policies (~3,000 lines)¶
Topic 5: Error Budget Fundamentals¶
What will be covered: - What is an Error Budget?
Error Budget = Acceptable failure within SLO
Formula:
Error Budget = (1 - SLO) Ã Total Requests (or Time)
Examples:
99.9% Availability SLO (30 days):
- Allowed downtime = (1 - 0.999) Ã 43200 min
- Error budget = 43.2 minutes/month
99% Request Success Rate (1M requests/month):
- Allowed failures = (1 - 0.99) Ã 1,000,000
- Error budget = 10,000 failed requests/month
P95 Latency <200ms:
- Allowed slow requests = 5% of total
- Error budget = 50,000 slow requests/1M total
-
Error Budget Consumption
Budget Consumed By: - Service outages (downtime) - 5xx errors (server failures) - Slow requests (exceed latency threshold) - Projection lag (data staleness) - DLQ messages (processing failures) Budget NOT Consumed By: - 4xx errors (client errors) - Maintenance windows (scheduled) - Load test traffic (excluded) - Health check failures (synthetic probes) -
Error Budget Policies
Budget Status: Healthy (0-50% consumed) Action: Normal operations - Continue feature development - Normal deployment cadence - Standard testing Budget Status: Warning (50-75% consumed) Action: Slow down, focus on reliability - Freeze non-critical features - Increase test coverage - Review recent changes - Add monitoring/alerts Budget Status: Critical (75-100% consumed) Action: Reliability freeze - Stop all feature work - All hands on stability - Root cause analysis for all SLO violations - Add redundancy/failover Budget Status: Exhausted (>100% consumed) Action: Incident mode - Declare SEV-1 incident - Engineering manager + VP engaged - Customer communication - Postmortem with action items - Process review (why did we miss this?) -
Budget Tracking
# Error budget remaining (gauge) error_budget_remaining{service="ingestion", slo="availability"} 0.82 # â 82% of budget remaining (18% consumed) # Burn rate (gauge, normalized to 1.0 = perfect burn) error_budget_burn_rate{service="ingestion", window="1h"} 5.2 # â Consuming budget 5.2x faster than sustainable
Code Examples: - Error budget calculation - Budget tracking metrics - Budget consumption queries
Diagrams: - Error budget lifecycle - Budget consumption timeline - Budget policy decision tree
Deliverables: - Error budget policies - Tracking implementation - Policy enforcement procedures
Topic 6: Error Budget-Driven Development¶
What will be covered: - Using Error Budget for Release Decisions
Pre-Release Check:
â
SAFE TO RELEASE:
- Error budget >80% remaining
- No recent SLO violations (last 7 days)
- All tests passed
- Canary deployment planned
â ī¸ CAUTION - EVALUATE CAREFULLY:
- Error budget 50-80% remaining
- Minor SLO violations in last 7 days
- Consider smaller canary (5% vs. 20%)
- Extended monitoring period
â DO NOT RELEASE:
- Error budget <50% remaining
- Recent SEV-1/SEV-2 incidents
- SLO violations in last 24 hours
- Focus on stability, not features
đ¨ RELIABILITY FREEZE:
- Error budget <25% remaining
- Only hotfixes and reliability improvements
- VP Engineering approval required for any change
- Error Budget Review Meetings
- Weekly review for all services
- Quarterly SLO target review
- Postmortem action item tracking
Code Examples: - Budget decision automation - Release gate integration - Review meeting templates
Diagrams: - Budget-driven decision flow - Release gates
Deliverables: - Budget-driven development guide - Release decision criteria - Review procedures
CYCLE 4: Golden Signals Framework (~4,000 lines)¶
Topic 7: Google SRE Golden Signals¶
What will be covered: - Four Golden Signals
1. Latency
- How long does it take to service a request?
- Measure: P50, P95, P99 (histograms)
- Split by: Success vs. Error (errors often faster!)
2. Traffic (Throughput)
- How much demand is the system handling?
- Measure: Requests/second, events/second
- Split by: Endpoint, tenant class, region
3. Errors
- What is the rate of failed requests?
- Measure: Error rate (5xx / total), exception count
- Split by: Error type, endpoint
4. Saturation
- How "full" is the system?
- Measure: CPU %, memory %, queue depth, connection pool usage
- Alert before hitting limits (e.g., 80% threshold)
- ATP Golden Signals by Service
Gateway: - Latency: http_request_duration_seconds{route,status} (histogram) - Traffic: http_requests_total{route,status} (counter) - Errors: http_requests_total{status=~"5.."} (counter) - Saturation: http_server_active_requests (gauge) Ingestion: - Latency: ingest_append_latency_seconds (histogram) - Traffic: ingest_requests_total{result} (counter) - Errors: ingest_errors_total{reason} (counter) - Saturation: outbox_pending_events (gauge), db_connection_pool_active (gauge) Query: - Latency: query_duration_seconds{route} (histogram) - Traffic: query_requests_total{route} (counter) - Errors: query_errors_total{route,reason} (counter) - Saturation: query_cache_size_bytes (gauge), query_concurrent_requests (gauge) Projection: - Latency: projection_handler_duration_seconds{model} (histogram) - Traffic: projection_events_processed_total{model} (counter) - Errors: projection_errors_total{model,reason} (counter) - Saturation: projection_lag_seconds{model} (gauge), consumer_queue_depth{subscription} (gauge) Export: - Latency: export_job_duration_seconds (histogram) - Traffic: export_jobs_total{result} (counter) - Errors: export_jobs_failed_total{reason} (counter) - Saturation: export_queue_depth (gauge), export_bandwidth_usage_mbps (gauge) Integrity: - Latency: integrity_verify_latency_seconds{target} (histogram) - Traffic: integrity_operations_total{operation} (counter) - Errors: integrity_verification_failures_total{reason} (counter) - Saturation: integrity_pending_verifications (gauge), kms_request_queue_depth (gauge) Policy: - Latency: policy_decision_latency_seconds{mode} (histogram) - Traffic: policy_decisions_total{mode,result} (counter) - Errors: policy_errors_total{reason} (counter) - Saturation: policy_cache_size_bytes (gauge), policy_cache_hit_ratio (gauge)
Code Examples: - Complete golden signals implementation (all services) - Metric instrumentation code - Query examples
Diagrams: - Golden signals framework - Signal hierarchy
Deliverables: - Golden signals catalog - Instrumentation guide - Query library
Topic 8: ATP-Specific Signals¶
What will be covered: - Audit Trail-Specific Metrics
Tamper Detection:
- tamper_scans_total{result}
- tamper_anomalies_total{type,severity}
- hash_chain_verification_failures_total
Compliance:
- retention_policy_violations_total{tenant}
- data_residency_violations_total{tenant,region}
- pii_redaction_failures_total
- legal_hold_active_count
Multi-Tenancy:
- tenant_isolation_tests_total{result}
- cross_tenant_query_attempts_total (should be 0)
- tenant_quota_exceeded_total{tenant,resource}
Data Lifecycle:
- records_ingested_total{tenant,classification}
- records_archived_total{tenant,tier}
- records_purged_total{tenant,reason}
- records_exported_total{tenant,format}
Code Examples: - ATP-specific metrics - Compliance monitoring - Security metrics
Deliverables: - ATP signals catalog - Compliance metrics - Security dashboards
CYCLE 5: Metrics Collection (~4,500 lines)¶
Topic 9: OpenTelemetry Metrics¶
What will be covered: - OTel Metrics API
// Service instrumentation (C#)
public class IngestionMetrics
{
private readonly Meter _meter;
private readonly Counter<long> _ingestRequestsTotal;
private readonly Histogram<double> _ingestLatencySeconds;
private readonly Gauge<long> _outboxPendingEvents;
public IngestionMetrics(IMeterFactory meterFactory)
{
_meter = meterFactory.Create("ATP.Ingestion", "1.0.0");
// Counter: Total ingestion requests
_ingestRequestsTotal = _meter.CreateCounter<long>(
name: "ingest.requests.total",
unit: "requests",
description: "Total number of ingestion requests");
// Histogram: Ingestion latency
_ingestLatencySeconds = _meter.CreateHistogram<double>(
name: "ingest.latency.seconds",
unit: "s",
description: "Ingestion request duration");
// Gauge: Outbox pending events
_outboxPendingEvents = _meter.CreateObservableGauge<long>(
name: "outbox.pending.events",
observeValue: () => GetOutboxPendingCount(),
unit: "events",
description: "Number of pending outbox events");
}
public void RecordIngestionRequest(string result, string tenantClass)
{
_ingestRequestsTotal.Add(1,
new KeyValuePair<string, object>("result", result),
new KeyValuePair<string, object>("tenant_class", tenantClass));
}
public void RecordIngestionLatency(double durationSeconds, string result)
{
_ingestLatencySeconds.Record(durationSeconds,
new KeyValuePair<string, object>("result", result));
}
private long GetOutboxPendingCount()
{
// Query outbox repository
return _outboxRepository.GetPendingCount();
}
}
-
Metric Naming Conventions
Pattern: {namespace}.{subsystem}.{name}.{unit} Examples: - atp.ingest.requests.total (counter) - atp.ingest.latency.seconds (histogram) - atp.query.cache.hit.ratio (gauge) - atp.projection.lag.seconds (gauge) Labels (low cardinality): - service (gateway, ingestion, query, ...) - route (/api/v1/ingest, /api/v1/query, ...) - result (success, failure, timeout, ...) - tenant_class (small, medium, large, enterprise) - region (us-east, eu-west, il-central) Avoid high-cardinality labels: - â tenant_id (thousands of values) - â user_id (millions of values) - â trace_id (unique per request) - â tenant_class (small set of buckets) -
Histogram Buckets
// Latency histogram buckets (milliseconds) var latencyBuckets = new double[] { 0.001, 0.005, 0.01, 0.025, 0.05, // 1ms, 5ms, 10ms, 25ms, 50ms 0.1, 0.25, 0.5, 1.0, 2.5, // 100ms, 250ms, 500ms, 1s, 2.5s 5.0, 10.0, 30.0, 60.0 // 5s, 10s, 30s, 60s }; // Request size histogram buckets (bytes) var sizeBuckets = new double[] { 1024, 4096, 16384, 65536, // 1KB, 4KB, 16KB, 64KB 262144, 1048576, 4194304, // 256KB, 1MB, 4MB 16777216, 67108864 // 16MB, 64MB };
Code Examples: - Complete metrics instrumentation (all services) - Metric naming standards - Histogram configuration
Diagrams: - Metrics collection flow - OTel pipeline
Deliverables: - Metrics instrumentation guide - Naming conventions - Histogram configurations
Topic 10: Metrics Export & Storage¶
What will be covered: - OTel Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
processors:
batch:
timeout: 10s
send_batch_size: 1024
resource:
attributes:
- key: cloud.provider
value: azure
action: insert
- key: deployment.environment
from_attribute: ENVIRONMENT
action: insert
exporters:
azuremonitor:
connection_string: "${APPINSIGHTS_CONNECTION_STRING}"
prometheusremotewrite:
endpoint: https://prometheus.atp.example.com/api/v1/write
auth:
authenticator: bearertokenauth
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [batch, resource]
exporters: [azuremonitor, prometheusremotewrite]
- Prometheus Storage
- Azure Monitor Metrics
- Retention Policies
Code Examples: - Collector configuration - Storage setup - Retention policies
Deliverables: - Collection configuration - Storage setup - Retention guide
CYCLE 6: Prometheus Configuration (~4,000 lines)¶
Topic 11: Prometheus Setup¶
What will be covered: - Prometheus Deployment (Kubernetes)
# Prometheus StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
namespace: monitoring
spec:
serviceName: prometheus
replicas: 2
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.48.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
- '--web.enable-lifecycle'
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: storage
mountPath: /prometheus
volumes:
- name: config
configMap:
name: prometheus-config
volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
- Prometheus Configuration
# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: atp-prod-useast environment: production # Alertmanager integration alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 # Alert rule files rule_files: - /etc/prometheus/rules/atp-*.yml # Scrape configs scrape_configs: # Kubernetes pods with prometheus.io/scrape annotation - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ # OTel Collector - job_name: 'otel-collector' static_configs: - targets: ['otel-collector:8889'] # Kubernetes nodes - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node # Azure Monitor (if using federated metrics) - job_name: 'azure-monitor' static_configs: - targets: ['azure-monitor-exporter:9090']
Code Examples: - Complete Prometheus setup - Scrape configurations - Service discovery
Diagrams: - Prometheus architecture - Scrape flow
Deliverables: - Prometheus deployment guide - Configuration templates - Service discovery setup
Topic 12: Recording Rules¶
What will be covered: - Pre-Aggregation with Recording Rules
# Prometheus recording rules
groups:
- name: atp_sli_recordings
interval: 30s
rules:
# Ingestion success rate (5m window)
- record: sli:ingest_success_rate:5m
expr: |
sum(rate(ingest_requests_total{result="success"}[5m]))
/
sum(rate(ingest_requests_total[5m]))
# Ingestion P95 latency (5m window)
- record: sli:ingest_latency_p95:5m
expr: |
histogram_quantile(0.95,
sum(rate(ingest_latency_seconds_bucket[5m])) by (le)
)
# Query success rate (5m window)
- record: sli:query_success_rate:5m
expr: |
sum(rate(query_requests_total{status=~"2.."}[5m]))
/
sum(rate(query_requests_total[5m]))
# Projection lag (current)
- record: sli:projection_lag_seconds:current
expr: |
max(projection_lag_seconds) by (model, tenant_class, region)
# Error budget remaining (30d window)
- record: slo:error_budget_remaining:30d
expr: |
1 - (
(1 - sli:ingest_success_rate:30d) / (1 - 0.999)
)
Code Examples: - Recording rule library - SLI aggregations - Error budget calculations
Deliverables: - Recording rules catalog - Aggregation guide
CYCLE 7: Alert Rules Library (~5,000 lines)¶
Topic 13: Alert Rule Structure¶
What will be covered: - Prometheus Alert Rule Anatomy
groups:
- name: atp_ingestion_alerts
interval: 30s
rules:
# High Error Rate
- alert: IngestionHighErrorRate
expr: |
(
sum(rate(ingest_errors_total[5m])) by (region)
/
sum(rate(ingest_requests_total[5m])) by (region)
) > 0.05
for: 5m
labels:
severity: critical
service: ingestion
team: platform
tier: tier2
annotations:
summary: "High ingestion error rate in {{ $labels.region }}"
description: "Ingestion error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
runbook: "https://runbooks.atp.example.com/ingestion-high-error-rate"
dashboard: "https://grafana.atp.example.com/d/ingestion"
impact: "Users unable to ingest audit events"
action: "Check database connectivity, Service Bus health, and recent deployments"
- Alert Rule Best Practices
1. Clear Naming - Pattern: {Service}{Condition}{Metric} - Example: IngestionHighErrorRate, QueryHighLatency 2. Appropriate "for" Duration - Transient spikes: 5-10 minutes - Persistent issues: 15-30 minutes - Avoid: <1 minute (too noisy) 3. Meaningful Labels - severity: critical | warning | info - service: Name of affected service - team: Owner team - tier: SLO tier (tier1, tier2, tier3) 4. Actionable Annotations - summary: One-line description - description: Details with context - runbook: Link to remediation steps - dashboard: Link to relevant dashboard - impact: User/business impact - action: Immediate next steps 5. Threshold Selection - Based on SLO targets (not arbitrary) - Supported by historical data - Reviewed quarterly
Code Examples: - Alert rule templates - Complete alert library (50+ rules) - Best practice examples
Diagrams: - Alert rule structure - Alert lifecycle
Deliverables: - Alert rule library (all ATP services) - Naming conventions - Template guide
Topic 14: ATP Alert Catalog¶
What will be covered: - Complete Alert Inventory
# TIER 1 ALERTS (Critical - SEV-1)
- alert: IntegrityVerificationFailure
expr: integrity_verification_failures_total > 0
for: 1m
labels:
severity: critical
tier: tier1
annotations:
summary: "Integrity verification failure detected"
runbook: "runbooks/tamper-investigation"
- alert: TamperDetected
expr: tamper_anomalies_total{severity="critical"} > 0
for: 1m
labels:
severity: critical
tier: tier1
- alert: CrossTenantDataLeak
expr: cross_tenant_query_attempts_total > 0
for: 1m
labels:
severity: critical
tier: tier1
- alert: ComplianceViolation
expr: |
retention_policy_violations_total > 0
OR data_residency_violations_total > 0
for: 5m
labels:
severity: critical
tier: tier1
# TIER 2 ALERTS (High - SEV-2)
- alert: GatewayDown
expr: up{service="gateway"} == 0
for: 2m
labels:
severity: critical
tier: tier2
- alert: IngestionHighErrorRate
expr: |
(
sum(rate(ingest_errors_total[5m]))
/
sum(rate(ingest_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
tier: tier2
- alert: QueryHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(query_latency_seconds_bucket[5m])) by (le)
) > 0.5
for: 10m
labels:
severity: warning
tier: tier2
- alert: ProjectionLagHigh
expr: projection_lag_seconds > 30
for: 10m
labels:
severity: warning
tier: tier2
# TIER 3 ALERTS (Medium - SEV-3)
- alert: DLQBacklogGrowing
expr: dlq_depth{subscription=~".*"} > 100
for: 30m
labels:
severity: warning
tier: tier3
- alert: CacheHitRateLow
expr: query_cache_hit_ratio < 0.7
for: 1h
labels:
severity: info
tier: tier3
(50+ alerts documented with thresholds, durations, labels, runbooks)
Code Examples: - Complete alert rule files (all services) - Alert categorization - Threshold rationale
Deliverables: - Complete alert catalog - Alert rule files - Categorization guide
CYCLE 8: Burn-Rate Alerts (~3,500 lines)¶
Topic 15: Multi-Window Burn Rate¶
What will be covered: - Burn Rate Concept
Burn Rate = Rate of error budget consumption
Perfect Burn Rate = 1.0
- Consuming budget at sustainable rate
- Will reach 100% budget consumed at end of SLO window
Fast Burn Rate > 1.0
- Consuming budget faster than sustainable
- Will exhaust budget before end of window
- Example: Burn rate of 14.4 = exhaust budget in 2 hours (30d window)
Slow Burn Rate < 1.0
- Consuming budget slower than expected
- Good! Under budget
-
Multi-Window Burn Rate Alerts
# Google SRE Multi-Window Burn Rate # 99.9% SLO = 0.1% error budget over 30 days groups: - name: atp_burn_rate_alerts rules: # Page (2% budget in 1 hour = very fast burn) - alert: IngestionErrorBudgetBurnFast expr: | ( sum(rate(ingest_errors_total[1h])) / sum(rate(ingest_requests_total[1h])) ) > (0.001 * 14.4) AND ( sum(rate(ingest_errors_total[5m])) / sum(rate(ingest_requests_total[5m])) ) > (0.001 * 14.4) for: 2m labels: severity: critical tier: tier2 alert_type: burn_rate window: fast annotations: summary: "Ingestion error budget burning fast" description: "At current rate, will exhaust 2% of monthly budget in 1 hour" runbook: "runbooks/burn-rate-response" # Ticket (5% budget in 6 hours = moderate burn) - alert: IngestionErrorBudgetBurnModerate expr: | ( sum(rate(ingest_errors_total[6h])) / sum(rate(ingest_requests_total[6h])) ) > (0.001 * 6) AND ( sum(rate(ingest_errors_total[30m])) / sum(rate(ingest_requests_total[30m])) ) > (0.001 * 6) for: 15m labels: severity: warning tier: tier2 alert_type: burn_rate window: moderate annotations: summary: "Ingestion error budget burning at moderate rate" # Warning (10% budget in 3 days = slow burn) - alert: IngestionErrorBudgetBurnSlow expr: | ( sum(rate(ingest_errors_total[3d])) / sum(rate(ingest_requests_total[3d])) ) > (0.001 * 1.0) AND ( sum(rate(ingest_errors_total[6h])) / sum(rate(ingest_requests_total[6h])) ) > (0.001 * 1.0) for: 1h labels: severity: info tier: tier2 alert_type: burn_rate window: slow -
Burn Rate Thresholds (Google SRE) | Alert Severity | Budget Consumed | Time Window | Burn Rate Multiple | |----------------|-----------------|-------------|-------------------| | Page | 2% in 1 hour | 1h + 5m | 14.4x | | Ticket | 5% in 6 hours | 6h + 30m | 6x | | Warning | 10% in 3 days | 3d + 6h | 1x |
Code Examples: - Burn rate alert rules (all services) - Threshold calculations - Multi-window logic
Diagrams: - Burn rate visualization - Multi-window detection - Alert severity mapping
Deliverables: - Burn rate alert library - Threshold guide - Configuration templates
Topic 16: Latency Burn Rate¶
What will be covered: - Latency-Based Error Budgets
SLO: 95% of requests complete in <200ms
Error Budget: 5% can exceed 200ms
Burn Rate:
- Measure: % of requests >200ms
- Target: â¤5%
- Fast burn: >10% exceeding (2x budget rate)
- Latency Burn Alert
Code Examples: - Latency burn rate rules - Quantile calculations
Deliverables: - Latency budget guide - Alert configurations
CYCLE 9: Alertmanager Routing (~3,500 lines)¶
Topic 17: Alertmanager Configuration¶
What will be covered: - Alertmanager Setup
# alertmanager.yml
global:
resolve_timeout: 5m
# PagerDuty integration
pagerduty_url: https://events.pagerduty.com/v2/enqueue
# Slack integration
slack_api_url: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXX
# Alert routing tree
route:
group_by: ['alertname', 'service', 'region']
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
# Critical alerts (SEV-1) â PagerDuty + Slack
- match:
severity: critical
receiver: 'pagerduty-critical'
routes:
- match:
tier: tier1
receiver: 'pagerduty-tier1'
group_wait: 0s
repeat_interval: 1h
# Warning alerts (SEV-2/SEV-3) â Slack only
- match:
severity: warning
receiver: 'slack-ops'
# Info alerts â Slack low-priority
- match:
severity: info
receiver: 'slack-notifications'
group_interval: 1h
repeat_interval: 24h
# Inhibition rules (suppress dependent alerts)
inhibit_rules:
# If service is down, suppress all its latency/error alerts
- source_match:
alertname: 'ServiceDown'
target_match_re:
alertname: '(HighLatency|HighErrorRate).*'
equal: ['service', 'region']
# If database is down, suppress all service database alerts
- source_match:
alertname: 'DatabaseDown'
target_match_re:
alertname: '.*DatabaseConnection.*'
equal: ['region']
# Receivers
receivers:
- name: 'default'
slack_configs:
- channel: '#atp-alerts-default'
title: 'ATP Alert: {{ .GroupLabels.alertname }}'
- name: 'pagerduty-tier1'
pagerduty_configs:
- service_key: '<tier1-integration-key>'
severity: '{{ .GroupLabels.severity }}'
description: '{{ .CommonAnnotations.summary }}'
details:
firing: '{{ template "pagerduty.default.instances" . }}'
num_firing: '{{ .Alerts.Firing | len }}'
num_resolved: '{{ .Alerts.Resolved | len }}'
client: 'ATP Alertmanager'
client_url: '{{ .CommonAnnotations.dashboard }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<critical-integration-key>'
- name: 'slack-ops'
slack_configs:
- channel: '#atp-ops'
title: 'â ī¸ {{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
actions:
- type: button
text: 'View Runbook'
url: '{{ .CommonAnnotations.runbook }}'
- type: button
text: 'View Dashboard'
url: '{{ .CommonAnnotations.dashboard }}'
- name: 'slack-notifications'
slack_configs:
- channel: '#atp-notifications'
title: 'âšī¸ {{ .GroupLabels.alertname }}'
Code Examples: - Complete Alertmanager configuration - Routing logic - Receiver configurations
Diagrams: - Alert routing tree - Inhibition flow - Multi-receiver routing
Deliverables: - Alertmanager setup guide - Routing configuration - Receiver templates
Topic 18: Alert Destinations¶
What will be covered: - PagerDuty Integration - Service keys (per team/severity) - Escalation policies - Incident auto-creation - Auto-resolve on recovery
- Slack Integration
- Channel routing (#atp-incidents, #atp-ops, #atp-notifications)
- Alert formatting
- Interactive buttons (runbook, dashboard, acknowledge)
-
Thread-based updates
-
Jira Integration
- Auto-create tickets for SEV-2+
- Link to PagerDuty incident
-
Auto-close on resolution
-
Webhook Integration
- Custom automation (auto-remediation)
- Runbook execution
- External SIEM (Splunk, Azure Sentinel)
Code Examples: - Integration configurations - Webhook templates - Auto-remediation scripts
Diagrams: - Destination integration - Alert flow
Deliverables: - Integration guide - Configuration templates
CYCLE 10: Multi-Window SLO Monitoring (~3,000 lines)¶
Topic 19: Multi-Window Burn Rate Implementation¶
What will be covered: - Why Multi-Window?
Single Window Problem:
- Short window (5m): Noisy, alerts on transient blips
- Long window (30d): Slow, misses fast-burning issues
Multi-Window Solution:
- Combine short + long windows
- Short window detects active problem
- Long window confirms sustained burn
- Both must fire to alert (AND logic)
Example:
Alert if:
- Error rate >threshold in last 1 hour (long window)
AND
- Error rate >threshold in last 5 minutes (short window)
Benefits:
- Fast detection (5m window)
- Low false positives (1h confirmation)
- Balanced sensitivity
- Google SRE Multi-Window Burn Rates (Complete implementation for all ATP SLOs)
Code Examples: - Multi-window alert rules - Burn rate calculations
Deliverables: - Multi-window guide - Implementation library
Topic 20: SLO Dashboard Design¶
What will be covered: - SLO Overview Dashboard - Error Budget Visualization - Burn Rate Graphs - Historical SLO Compliance
Code Examples: - Grafana dashboard JSON
Deliverables: - Dashboard templates
CYCLE 11: Service-Specific Alerts (~5,000 lines)¶
Topic 21: Per-Service Alert Libraries¶
What will be covered: - Ingestion Service Alerts (15+ alerts) - Query Service Alerts (12+ alerts) - Projection Service Alerts (10+ alerts) - Export Service Alerts (8+ alerts) - Integrity Service Alerts (10+ alerts) - Policy Service Alerts (6+ alerts) - Gateway Service Alerts (12+ alerts) - Admin Service Alerts (5+ alerts)
Code Examples: - Service-specific alert rules (complete)
Deliverables: - Service alert libraries (8 services)
Topic 22: Infrastructure Alerts¶
What will be covered: - Kubernetes Alerts - Database Alerts - Message Bus Alerts - Storage Alerts - Network Alerts
Code Examples: - Infrastructure alert rules
Deliverables: - Infrastructure alert library
CYCLE 12: Alert Inhibition & Grouping (~3,000 lines)¶
Topic 23: Alert Inhibition¶
What will be covered: - Dependency-Based Inhibition - Parent-Child Relationships - Cascading Failure Prevention
Code Examples: - Inhibition rules
Deliverables: - Inhibition guide
Topic 24: Alert Grouping & Deduplication¶
What will be covered: - Grouping Strategy - Deduplication Windows - Fingerprint Generation
Code Examples: - Grouping configuration
Deliverables: - Grouping guide
CYCLE 13: Dashboard Design (~4,000 lines)¶
Topic 25: Grafana Dashboards¶
What will be covered: - Dashboard Architecture - ATP Operations Dashboard - Service Health Dashboards (all services) - SLO Compliance Dashboards
Code Examples: - Complete dashboard library
Deliverables: - Dashboard templates (10+ dashboards)
Topic 26: Azure Monitor Workbooks¶
What will be covered: - Workbook Templates - KQL Queries - Application Map - Performance Analysis
Code Examples: - Workbook definitions
Deliverables: - Workbook library
CYCLE 14: Alert Fatigue Prevention (~3,000 lines)¶
Topic 27: Reducing False Positives¶
What will be covered: - Threshold Tuning - Duration Optimization - Silence Management - Alert Review Process
Code Examples: - Tuning procedures
Deliverables: - Fatigue prevention guide
Topic 28: Alert Quality Metrics¶
What will be covered: - Alert Precision & Recall - Mean Time to Acknowledge - False Positive Rate
Code Examples: - Quality tracking
Deliverables: - Quality metrics guide
CYCLE 15: On-Call Playbooks (~3,500 lines)¶
Topic 29: Runbook Integration¶
What will be covered: - Alert-to-Runbook Mapping - Playbook Templates - Diagnostic Procedures
Code Examples: - Playbook library
Deliverables: - Runbook catalog
Topic 30: Auto-Remediation¶
What will be covered: - Safe Auto-Remediation - Remediation Scripts - Validation & Rollback
Code Examples: - Auto-remediation playbooks
Deliverables: - Automation guide
CYCLE 16: SLO Reporting & Reviews (~3,000 lines)¶
Topic 31: SLO Reporting¶
What will be covered: - Monthly SLO Reports - Error Budget Consumption - Trend Analysis - Stakeholder Communication
Code Examples: - Report generation
Deliverables: - Reporting guide
Topic 32: SLO Review Process¶
What will be covered: - Quarterly SLO Review - Target Adjustments - New SLO Proposals
Code Examples: - Review procedures
Deliverables: - Review process guide
CYCLE 17: Testing & Validation (~2,500 lines)¶
Topic 33: Alert Testing¶
What will be covered: - Alert Rule Validation - Synthetic Incidents - Chaos Testing for Alerts
Code Examples: - Testing procedures
Deliverables: - Alert testing guide
Topic 34: SLO Validation¶
What will be covered: - SLO Metric Validation - Error Budget Calculation Verification
Code Examples: - Validation scripts
Deliverables: - Validation guide
CYCLE 18: Best Practices & Governance (~3,000 lines)¶
Topic 35: SLO Best Practices¶
What will be covered: - Choosing Good SLIs - Setting Realistic SLOs - Avoiding Common Pitfalls
Deliverables: - Best practices guide
Topic 36: Alert Governance¶
What will be covered: - Alert Review Process - Alert Ownership - Alert Lifecycle Management
Deliverables: - Governance guide
Summary of Deliverables¶
Across all 18 cycles, this documentation will provide:
- SLO/SLI Framework: Fundamentals, ATP taxonomy, tier classification
- ATP SLO Catalog: Complete SLOs for all 8 services + platform SLOs
- Error Budgets: Policies, tracking, budget-driven development
- Golden Signals: Latency, errors, traffic, saturation for all services
- Metrics: Collection, naming, OTel instrumentation, Prometheus storage
- Alert Rules: 70+ rules covering availability, latency, correctness, freshness
- Burn-Rate Alerts: Multi-window fast/moderate/slow burn detection
- Alertmanager: Routing, inhibition, grouping, deduplication
- Dashboards: Grafana SLO dashboards, Azure Monitor workbooks
- Operational Excellence: Fatigue prevention, playbooks, reporting, governance
Related Documentation¶
- Runbook: Operational procedures and incident response
- Progressive Rollout: Deployment strategies
- Observability: Tracing, logging, monitoring
- Kubernetes: K8s monitoring and health checks
- Architecture: System design and SLOs
- Quality Gates: CI/CD quality enforcement
This alerts & SLOs guide provides complete definitions, implementations, and operational procedures for measuring and maintaining ATP reliability through Service Level Objectives, error budgets, golden signal monitoring, intelligent alerting with burn-rate detection, actionable runbooks, comprehensive dashboards, and continuous improvement processes for delivering predictable, compliant, and tamper-evident audit trail services at scale.