Kubernetes Infrastructure - Audit Trail Platform (ATP)¶
Cloud-native, auto-scaling, zero-trust — ATP runs on Azure Kubernetes Service (AKS) with namespace isolation, service mesh (mTLS), KEDA event-driven autoscaling, Pod Security Standards, Helm packaging, and FluxCD GitOps for declarative, auditable, and resilient microservice orchestration.
📋 Documentation Generation Plan¶
This document will be generated in 20 cycles. Current progress:
| Cycle | Topics | Estimated Lines | Status |
|---|---|---|---|
| Cycle 1 | Kubernetes Architecture & AKS Overview (1-2) | ~3,500 | ⏳ Not Started |
| Cycle 2 | Cluster Setup & Node Pools (3-4) | ~3,000 | ⏳ Not Started |
| Cycle 3 | Namespace Organization (5-6) | ~3,000 | ⏳ Not Started |
| Cycle 4 | Deployment Workloads (7-8) | ~4,000 | ⏳ Not Started |
| Cycle 5 | Service Networking (9-10) | ~3,500 | ⏳ Not Started |
| Cycle 6 | Ingress & Load Balancing (11-12) | ~3,500 | ⏳ Not Started |
| Cycle 7 | ConfigMaps & Secrets (13-14) | ~3,000 | ⏳ Not Started |
| Cycle 8 | Azure Key Vault CSI Driver (15-16) | ~3,000 | ⏳ Not Started |
| Cycle 9 | Resource Management (17-18) | ~3,500 | ⏳ Not Started |
| Cycle 10 | Horizontal Pod Autoscaler (HPA) (19-20) | ~4,000 | ⏳ Not Started |
| Cycle 11 | KEDA Event-Driven Autoscaling (21-22) | ~4,500 | ⏳ Not Started |
| Cycle 12 | Pod Security Standards (23-24) | ~3,000 | ⏳ Not Started |
| Cycle 13 | Network Policies (25-26) | ~3,000 | ⏳ Not Started |
| Cycle 14 | Service Mesh & mTLS (27-28) | ~3,500 | ⏳ Not Started |
| Cycle 15 | RBAC & Workload Identity (29-30) | ~3,000 | ⏳ Not Started |
| Cycle 16 | Helm Charts (31-32) | ~4,000 | ⏳ Not Started |
| Cycle 17 | GitOps with FluxCD (33-34) | ~3,500 | ⏳ Not Started |
| Cycle 18 | Monitoring & Observability (35-36) | ~3,000 | ⏳ Not Started |
| Cycle 19 | Operations & Troubleshooting (37-38) | ~3,000 | ⏳ Not Started |
| Cycle 20 | Best Practices & Disaster Recovery (39-40) | ~3,000 | ⏳ Not Started |
Total Estimated Lines: ~67,000
Purpose & Scope¶
This document provides the complete Kubernetes infrastructure guide for ATP, covering Azure Kubernetes Service (AKS) cluster setup, namespace organization, workload deployments, service networking, autoscaling, security policies, Helm charts, GitOps, and operational best practices for running ATP's microservices at scale.
Why Kubernetes for ATP?
- Cloud-Native: Container orchestration with declarative configuration
- Scalability: Auto-scaling (HPA/KEDA) based on load and events
- Resilience: Self-healing, rolling updates, health checks
- Multi-Tenancy: Namespace isolation, resource quotas, network policies
- Security: Pod Security Standards, mTLS service mesh, RBAC, Workload Identity
- Portability: Run anywhere (Azure, on-prem, multi-cloud)
- Observability: Integrated monitoring, logging, tracing
- Cost Optimization: Efficient resource utilization, scale-to-zero with KEDA
- GitOps: Declarative, auditable infrastructure as code
- Ecosystem: Rich tooling (Helm, Kustomize, Flux, Istio/Linkerd, KEDA)
ATP Kubernetes Topology
Azure Front Door (AFD) + WAF
↓
API Management (APIM) / Ingress Controller
↓
AKS Cluster (3 regions: US, EU, IL)
├── System Ring
│ ├── OTel Collector (DaemonSet)
│ ├── Mesh Control Plane (Istio/Linkerd)
│ ├── FluxCD Controllers
│ ├── KEDA Operator
│ └── Monitoring (Prometheus, Grafana)
│
└── User Ring (ATP Namespaces)
├── atp-gateway-ns (Gateway pods)
├── atp-ingest-ns (Ingestion pods)
├── atp-policy-ns (Policy pods)
├── atp-projection-ns (Projection workers, KEDA)
├── atp-query-ns (Query API pods)
├── atp-integrity-ns (Integrity workers)
├── atp-export-ns (Export workers, KEDA)
├── atp-search-ns (Search service, optional)
└── atp-admin-ns (Admin console)
↓
Azure Services (Service Bus, SQL, Blob, Key Vault, Monitor)
Key Technologies
- AKS (Azure Kubernetes Service): Managed Kubernetes
- Service Mesh: Istio or Linkerd for mTLS, observability
- KEDA: Kubernetes Event-Driven Autoscaling (scale to zero)
- HPA: Horizontal Pod Autoscaler (CPU/memory/custom metrics)
- Helm: Package manager for Kubernetes
- FluxCD: GitOps continuous delivery
- Azure Key Vault CSI Driver: Secrets injection
- Azure Monitor: Container Insights, Log Analytics, Application Insights
- Prometheus & Grafana: Metrics and dashboards
- Calico/Azure CNI: Network policies
Detailed Cycle Plan¶
CYCLE 1: Kubernetes Architecture & AKS Overview (~3,500 lines)¶
Topic 1: Kubernetes Fundamentals¶
What will be covered: - What is Kubernetes? - Container orchestration platform - Declarative configuration (YAML) - Desired state reconciliation - Self-healing, auto-scaling, rolling updates
-
Kubernetes Core Concepts
Cluster: - Control Plane: API Server, Scheduler, Controller Manager, etcd - Worker Nodes: kubelet, kube-proxy, container runtime (containerd) Workloads: - Pod: Smallest deployable unit (1+ containers) - Deployment: Manages ReplicaSets, rolling updates - StatefulSet: For stateful applications (stable identity) - DaemonSet: One pod per node (monitoring, logging) - Job/CronJob: Run-to-completion tasks Networking: - Service: Stable endpoint for pod group (ClusterIP, LoadBalancer) - Ingress: HTTP/HTTPS routing to services - NetworkPolicy: Firewall rules for pods Configuration: - ConfigMap: Non-sensitive configuration - Secret: Sensitive data (passwords, tokens) - PersistentVolume: Durable storage Security: - ServiceAccount: Pod identity - RBAC: Role-based access control - Pod Security Standards: Security policies Observability: - Logs: stdout/stderr streams - Metrics: CPU, memory, custom metrics - Events: Cluster events (pod scheduled, image pulled) -
Why AKS (Azure Kubernetes Service)?
Managed Control Plane: - No manual k8s master management - Microsoft maintains API server, etcd, scheduler - Automatic updates and patches Azure Integration: - Azure CNI (VNet integration) - Azure Monitor Container Insights - Azure Key Vault CSI Driver - Azure AD Workload Identity - Azure Blob CSI (persistent volumes) - Azure Service Bus (for KEDA triggers) Enterprise Features: - Availability Zones support - Node auto-scaling - Managed node pools - Azure Policy for Kubernetes - Defender for Containers (security) Cost Optimization: - Pay only for worker nodes - Spot instances (low-priority workloads) - Auto-shutdown for dev/test
Code Examples: - Kubernetes architecture diagram - AKS cluster creation (Azure CLI, Bicep, Pulumi) - kubectl basic commands
Diagrams: - Kubernetes architecture - AKS control plane vs. data plane - ATP on AKS topology
Deliverables: - Kubernetes fundamentals primer - AKS benefits for ATP - Cluster architecture overview
Topic 2: ATP AKS Cluster Architecture¶
What will be covered: - Cluster Configuration
Cluster Name: atp-aks-{region}-{env}
Examples:
- atp-aks-useast-prod
- atp-aks-euwest-prod
- atp-aks-ilcentral-prod
- atp-aks-useast-dev
Kubernetes Version: 1.28+ (managed, auto-upgrade minor versions)
Regions:
- Primary: East US (us-east-1)
- Secondary: West Europe (eu-west-1)
- Tertiary: Israel Central (il-central-1)
Availability Zones: 3 per region
- Zone 1, Zone 2, Zone 3
- Distribute node pools across zones
- Distribute pods across zones (pod anti-affinity)
-
Node Pools Strategy
System Node Pool (np-system): - VM Size: Standard_D4s_v5 (4 vCPU, 16 GB RAM) - Count: 3 (one per AZ) - Auto-scale: No (fixed) - Taint: CriticalAddonsOnly=true:NoSchedule - Purpose: Control plane components, mesh, KEDA, FluxCD, OTel Generic Node Pool (np-generic): - VM Size: Standard_D8s_v5 (8 vCPU, 32 GB RAM) - Count: 3-30 (auto-scale) - Taint: None - Purpose: Stateless APIs (Gateway, Query, Admin, Policy) I/O Node Pool (np-io): - VM Size: Standard_E8s_v5 (8 vCPU, 64 GB RAM, premium storage) - Count: 2-50 (auto-scale) - Taint: workload=io:NoSchedule - Purpose: I/O-heavy (Ingestion, Projection, Export, Integrity) Jobs Node Pool (np-jobs) - Optional: - VM Size: Standard_F16s_v2 (16 vCPU, 32 GB RAM, compute-optimized) - Count: 0-20 (KEDA scale to zero) - Taint: workload=batch:NoSchedule - Purpose: Export jobs, maintenance tasks, compliance reports Spot Node Pool (np-spot) - Optional: - VM Size: Standard_D8s_v5 - Count: 0-10 - Spot Priority: Yes (80% cost savings) - Taint: kubernetes.azure.com/scalesetpriority=spot:NoSchedule - Purpose: Non-critical workloads (dev/test projections, backfills) -
Network Configuration
CNI Plugin: Azure CNI (VNet integration) VNet CIDR: 10.42.0.0/16 - Subnet: aks-nodes (10.42.0.0/20) - 4096 IPs for nodes - Subnet: aks-pods (10.42.16.0/20) - 4096 IPs for pods - Subnet: aks-services (10.42.32.0/24) - 256 IPs for services Service CIDR: 10.43.0.0/16 (internal cluster services) DNS Service IP: 10.43.0.10 Network Policy: Azure Network Policy or Calico Load Balancer: - Type: Standard Load Balancer - Public IP: Static (for ingress controller) - Private endpoints: For Azure services (SQL, Storage, Key Vault)
Code Examples: - AKS cluster creation (Pulumi C#) - Node pool configuration - Network setup
Diagrams: - AKS cluster architecture - Node pool distribution across AZs - Network topology
Deliverables: - AKS cluster specification - Node pool strategy - Network architecture
CYCLE 2: Cluster Setup & Node Pools (~3,000 lines)¶
Topic 3: AKS Cluster Provisioning¶
What will be covered: - Pulumi IaC for AKS
// Create AKS cluster with Pulumi
var cluster = new AzureNative.ContainerService.ManagedCluster("atp-aks-prod", new()
{
ResourceGroupName = resourceGroup.Name,
Location = location,
DnsPrefix = "atp-prod",
KubernetesVersion = "1.28",
EnableRBAC = true,
// Identity (Workload Identity for pods)
Identity = new ManagedClusterIdentityArgs
{
Type = ResourceIdentityType.SystemAssigned
},
// Network profile
NetworkProfile = new ContainerServiceNetworkProfileArgs
{
NetworkPlugin = "azure",
NetworkPolicy = "calico",
ServiceCidr = "10.43.0.0/16",
DnsServiceIP = "10.43.0.10",
LoadBalancerSku = "standard"
},
// Add-ons
AddonProfiles = new InputMap<ManagedClusterAddonProfileArgs>
{
["azureKeyvaultSecretsProvider"] = new ManagedClusterAddonProfileArgs
{
Enabled = true,
Config = new InputMap<string>
{
["enableSecretRotation"] = "true",
["rotationPollInterval"] = "2m"
}
},
["omsAgent"] = new ManagedClusterAddonProfileArgs
{
Enabled = true,
Config = new InputMap<string>
{
["logAnalyticsWorkspaceResourceID"] = logAnalyticsWorkspace.Id
}
}
},
// System node pool
AgentPoolProfiles = new[]
{
new ManagedClusterAgentPoolProfileArgs
{
Name = "npsystem",
Count = 3,
VmSize = "Standard_D4s_v5",
OsType = "Linux",
Mode = "System",
AvailabilityZones = new[] { "1", "2", "3" },
NodeTaints = new[] { "CriticalAddonsOnly=true:NoSchedule" }
}
}
});
// Add generic node pool
var genericNodePool = new AzureNative.ContainerService.AgentPool("np-generic", new()
{
ResourceGroupName = resourceGroup.Name,
ResourceName = cluster.Name,
AgentPoolName = "npgeneric",
Count = 3,
VmSize = "Standard_D8s_v5",
OsType = "Linux",
Mode = "User",
AvailabilityZones = new[] { "1", "2", "3" },
EnableAutoScaling = true,
MinCount = 3,
MaxCount = 30
});
// Add I/O node pool
var ioNodePool = new AzureNative.ContainerService.AgentPool("np-io", new()
{
ResourceGroupName = resourceGroup.Name,
ResourceName = cluster.Name,
AgentPoolName = "npio",
Count = 2,
VmSize = "Standard_E8s_v5",
OsType = "Linux",
Mode = "User",
AvailabilityZones = new[] { "1", "2", "3" },
EnableAutoScaling = true,
MinCount = 2,
MaxCount = 50,
NodeTaints = new[] { "workload=io:NoSchedule" }
});
- Azure CLI Alternative
# Create resource group az group create --name atp-aks-prod-rg --location eastus # Create AKS cluster az aks create \ --resource-group atp-aks-prod-rg \ --name atp-aks-useast-prod \ --node-count 3 \ --node-vm-size Standard_D4s_v5 \ --kubernetes-version 1.28 \ --network-plugin azure \ --network-policy calico \ --enable-managed-identity \ --enable-addons monitoring,azure-keyvault-secrets-provider \ --enable-workload-identity \ --enable-oidc-issuer \ --zones 1 2 3 \ --nodepool-name npsystem \ --nodepool-taints CriticalAddonsOnly=true:NoSchedule # Add generic node pool az aks nodepool add \ --resource-group atp-aks-prod-rg \ --cluster-name atp-aks-useast-prod \ --name npgeneric \ --node-count 3 \ --node-vm-size Standard_D8s_v5 \ --enable-cluster-autoscaler \ --min-count 3 \ --max-count 30 \ --zones 1 2 3 # Add I/O node pool az aks nodepool add \ --resource-group atp-aks-prod-rg \ --cluster-name atp-aks-useast-prod \ --name npio \ --node-count 2 \ --node-vm-size Standard_E8s_v5 \ --enable-cluster-autoscaler \ --min-count 2 \ --max-count 50 \ --node-taints workload=io:NoSchedule \ --zones 1 2 3
Code Examples: - Complete AKS provisioning (Pulumi, Azure CLI, Bicep) - Node pool configurations - Cluster add-ons setup
Diagrams: - AKS provisioning workflow - Node pool architecture
Deliverables: - AKS provisioning guide - Node pool specifications - Add-ons configuration
Topic 4: Connecting to AKS Cluster¶
What will be covered: - kubectl Configuration
# Get AKS credentials
az aks get-credentials \
--resource-group atp-aks-prod-rg \
--name atp-aks-useast-prod
# Verify connection
kubectl cluster-info
kubectl get nodes
kubectl get namespaces
# Switch context (multiple clusters)
kubectl config get-contexts
kubectl config use-context atp-aks-useast-prod
- Access Control
- Azure AD integration
- RBAC roles (admin, developer, reader)
- kubeconfig with AAD tokens
- kubectl proxy for local access
Code Examples: - Connection setup - Context management - RBAC configuration
Deliverables: - Access guide - RBAC setup - Security best practices
CYCLE 3: Namespace Organization (~3,000 lines)¶
Topic 5: Namespace Strategy¶
What will be covered: - ATP Namespace Taxonomy
# System namespaces (managed)
- kube-system # Kubernetes core components
- kube-public # Public cluster info
- flux-system # FluxCD controllers
- istio-system # Service mesh control plane
- keda # KEDA operator
- monitoring # Prometheus, Grafana
# ATP application namespaces
- atp-gateway-ns # API Gateway (public entry point)
- atp-ingest-ns # Ingestion service
- atp-policy-ns # Policy service
- atp-projection-ns # Projection workers
- atp-query-ns # Query service
- atp-integrity-ns # Integrity service
- atp-export-ns # Export service
- atp-search-ns # Search service (optional)
- atp-admin-ns # Admin console
-
Namespace Creation
# Create namespace with labels apiVersion: v1 kind: Namespace metadata: name: atp-ingest-ns labels: app.kubernetes.io/name: atp-ingestion app.kubernetes.io/component: backend app.kubernetes.io/part-of: audit-trail-platform environment: production region: us-east istio-injection: enabled # Enable service mesh sidecar injection kyverno.io/policy-severity: high -
Namespace Isolation
- Resource quotas per namespace
- Network policies (deny by default)
- RBAC (namespace-scoped roles)
-
Pod Security Standards
-
Namespace Lifecycle
- Creation (GitOps via FluxCD)
- Configuration (ResourceQuota, NetworkPolicy, RBAC)
- Monitoring (per-namespace dashboards)
- Deletion (drain workloads, cleanup resources)
Code Examples: - Namespace creation manifests - Resource quota configuration - RBAC binding
Diagrams: - Namespace organization - Isolation boundaries
Deliverables: - Namespace taxonomy - Creation templates - Isolation policies
Topic 6: Resource Quotas & LimitRanges¶
What will be covered: - Resource Quota per Namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: atp-ingest-quota
namespace: atp-ingest-ns
spec:
hard:
requests.cpu: "20" # Max 20 CPU cores requested
requests.memory: 40Gi # Max 40 GB memory requested
limits.cpu: "40" # Max 40 CPU cores limit
limits.memory: 80Gi # Max 80 GB memory limit
pods: "50" # Max 50 pods
services: "10" # Max 10 services
persistentvolumeclaims: "5"
- LimitRange (Default Limits)
Code Examples: - ResourceQuota manifests (all namespaces) - LimitRange configuration - Quota monitoring
Diagrams: - Resource allocation - Quota enforcement
Deliverables: - Quota specifications - Limit policies - Monitoring dashboards
CYCLE 4: Deployment Workloads (~4,000 lines)¶
Topic 7: Deployment Manifests¶
What will be covered: - Deployment Anatomy
apiVersion: apps/v1
kind: Deployment
metadata:
name: ingestion
namespace: atp-ingest-ns
labels:
app: ingestion
version: v1
component: backend
spec:
# Replica management
replicas: 3
# Rolling update strategy
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 30% # Allow 30% extra pods during update
maxUnavailable: 0 # Never go below desired count
# Pod selector
selector:
matchLabels:
app: ingestion
version: v1
# Pod template
template:
metadata:
labels:
app: ingestion
version: v1
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
# Security context (Pod level)
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
# Service account (Workload Identity)
serviceAccountName: ingestion-sa
# Node affinity (prefer different nodes)
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ingestion
topologyKey: kubernetes.io/hostname
# Zone spread
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ingestion
topologyKey: topology.kubernetes.io/zone
# Tolerations (for I/O node pool)
tolerations:
- key: "workload"
operator: "Equal"
value: "io"
effect: "NoSchedule"
# Containers
containers:
- name: ingestion
image: atpacr.azurecr.io/atp/ingestion:1.2.3@sha256:abc123...
# Security context (Container level)
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL
# Ports
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: grpc
containerPort: 9090
protocol: TCP
- name: metrics
containerPort: 9091
protocol: TCP
# Environment variables
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: TENANT_CONTEXT_HEADER
value: "X-Tenant-Id"
# ConfigMap reference
envFrom:
- configMapRef:
name: ingestion-config
# Resource requests and limits
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2000m"
memory: "2Gi"
# Liveness probe (is container alive?)
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe (ready to serve traffic?)
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
# Startup probe (for slow-starting apps)
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s max startup time
# Volume mounts
volumeMounts:
- name: secrets-store
mountPath: "/mnt/secrets"
readOnly: true
- name: temp-dir
mountPath: "/tmp"
# Volumes
volumes:
- name: secrets-store
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: "atp-kv-secrets"
- name: temp-dir
emptyDir: {}
- All ATP Deployments
- Gateway, Ingestion, Policy, Projection, Query, Integrity, Export, Search, Admin
- Each with specific resource profiles
- Affinity rules for high availability
- Tolerations for node pool assignment
Code Examples: - Complete deployment manifests (all ATP services) - Security contexts - Probe configurations - Volume mounts
Diagrams: - Deployment anatomy - Pod lifecycle - Health check flow
Deliverables: - Deployment manifest library - Configuration guide - Best practices
Topic 8: StatefulSets & DaemonSets¶
What will be covered: - When to Use StatefulSets vs. Deployments | Workload Type | Use Deployment | Use StatefulSet | |---------------|----------------|-----------------| | Stateless APIs | ✅ | ❌ | | Event consumers | ✅ | ❌ | | Projection workers | ✅ | ❌ | | Databases (managed externally) | ✅ | ❌ | | Caching tier (Redis Sentinel) | ❌ | ✅ | | Message brokers (Kafka, RabbitMQ) | ❌ | ✅ | | Stateful actors (Orleans silos) | ❌ | ✅ |
- DaemonSet Use Cases
- OTel Collector: One per node for metrics/logs
- Log Forwarder: FluentBit/Fluentd for centralized logging
- Node Exporter: Prometheus node metrics
- Security Agent: Defender for Containers
Code Examples: - StatefulSet example (if needed for ATP) - DaemonSet for OTel Collector - Persistent volume claims
Diagrams: - Workload type comparison - DaemonSet architecture
Deliverables: - Workload selection guide - DaemonSet configurations - Stateful patterns
CYCLE 5: Service Networking (~3,500 lines)¶
Topic 9: Kubernetes Services¶
What will be covered: - Service Types
# 1. ClusterIP (default, internal only)
apiVersion: v1
kind: Service
metadata:
name: ingestion-svc
namespace: atp-ingest-ns
spec:
type: ClusterIP
selector:
app: ingestion
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
- name: grpc
port: 9090
targetPort: 9090
protocol: TCP
# 2. LoadBalancer (external, public IP)
apiVersion: v1
kind: Service
metadata:
name: gateway-lb
namespace: atp-gateway-ns
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "false"
spec:
type: LoadBalancer
selector:
app: gateway
ports:
- name: https
port: 443
targetPort: 8443
# 3. Headless Service (for StatefulSets)
apiVersion: v1
kind: Service
metadata:
name: orleans-silo-headless
namespace: atp-orleans-ns
spec:
type: ClusterIP
clusterIP: None # Headless
selector:
app: orleans-silo
ports:
- name: silo
port: 11111
targetPort: 11111
- Service Discovery
- DNS-based (service-name.namespace.svc.cluster.local)
- Environment variables
-
Service mesh (Istio VirtualService)
-
Load Balancing
- Round-robin (default)
- Session affinity (ClientIP)
- Topology-aware routing
Code Examples: - Service manifests (all types) - Service discovery examples - Load balancing configuration
Diagrams: - Service types comparison - DNS resolution flow - Load balancing strategies
Deliverables: - Service manifest library - Discovery guide - Load balancing patterns
Topic 10: Service Mesh Integration¶
What will be covered: - Service-to-Service Communication - mTLS encryption (automatic) - Identity-based auth - Traffic management (retries, timeouts, circuit breakers) - Observability (distributed tracing)
- Istio/Linkerd Configuration
# Enable sidecar injection (namespace label) apiVersion: v1 kind: Namespace metadata: name: atp-ingest-ns labels: istio-injection: enabled # or linkerd.io/inject: enabled # VirtualService (traffic routing) apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: ingestion-vs namespace: atp-ingest-ns spec: hosts: - ingestion-svc http: - route: - destination: host: ingestion-svc port: number: 80 timeout: 30s retries: attempts: 3 perTryTimeout: 10s retryOn: 5xx,reset,connect-failure,refused-stream
Code Examples: - Service mesh configuration - Traffic policies - mTLS verification
Diagrams: - Service mesh architecture - mTLS flow
Deliverables: - Service mesh setup - Traffic management - Security policies
CYCLE 6: Ingress & Load Balancing (~3,500 lines)¶
Topic 11: Ingress Controllers¶
What will be covered: - Ingress Controller Options | Controller | Use Case | ATP Usage | |------------|----------|-----------| | NGINX Ingress | General-purpose HTTP/HTTPS | Dev/Test environments | | Azure Application Gateway | WAF, Azure-native | Production (alternative to APIM) | | Istio Ingress Gateway | Service mesh integration | Production (with Istio) | | Traefik | Dynamic routing, middlewares | Internal services |
-
NGINX Ingress Setup
# Install NGINX Ingress Controller helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update helm install nginx-ingress ingress-nginx/ingress-nginx \ --namespace ingress-nginx \ --create-namespace \ --set controller.service.type=LoadBalancer \ --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz -
Ingress Resource
apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: atp-gateway-ingress namespace: atp-gateway-ns annotations: kubernetes.io/ingress.class: "nginx" cert-manager.io/cluster-issuer: "letsencrypt-prod" nginx.ingress.kubernetes.io/ssl-redirect: "true" nginx.ingress.kubernetes.io/backend-protocol: "HTTPS" nginx.ingress.kubernetes.io/rate-limit: "100" spec: tls: - hosts: - api.audittrail.example.com secretName: atp-gateway-tls rules: - host: api.audittrail.example.com http: paths: - path: /api/v1/ingest pathType: Prefix backend: service: name: ingestion-svc port: number: 80 - path: /api/v1/query pathType: Prefix backend: service: name: query-svc port: number: 80 - path: /api/v1/policy pathType: Prefix backend: service: name: policy-svc port: number: 80
Code Examples: - Ingress controller installation - Ingress resource manifests - TLS configuration - Rate limiting
Diagrams: - Ingress architecture - Traffic routing flow
Deliverables: - Ingress setup guide - Routing configurations - TLS management
Topic 12: Azure Front Door + APIM Integration¶
What will be covered: - External Load Balancing - Azure Front Door (global load balancing, WAF) - API Management (rate limiting, caching, versioning) - Direct to Ingress Controller
- Traffic Flow
Code Examples: - AFD configuration - APIM backend configuration - End-to-end routing
Diagrams: - Complete traffic flow - Edge integration
Deliverables: - Edge integration guide - Traffic routing - Failover patterns
CYCLE 7: ConfigMaps & Secrets (~3,000 lines)¶
Topic 13: ConfigMap Management¶
What will be covered: - ConfigMap for Application Settings
apiVersion: v1
kind: ConfigMap
metadata:
name: ingestion-config
namespace: atp-ingest-ns
data:
appsettings.json: |
{
"ApplicationName": "ATP.Ingestion",
"Logging": {
"LogLevel": {
"Default": "Information"
}
},
"HealthChecks": {
"Enabled": true
}
}
# Environment-specific overrides
ASPNETCORE_ENVIRONMENT: "Production"
TENANT_CONTEXT_HEADER: "X-Tenant-Id"
MAX_BATCH_SIZE: "100"
- Using ConfigMaps
Code Examples: - ConfigMap creation - Usage patterns - Dynamic updates
Deliverables: - ConfigMap templates - Usage guide - Update procedures
Topic 14: Secret Management¶
What will be covered: - Kubernetes Secrets (Avoid for Production) - Base64 encoded (not encrypted at rest by default) - Use only for non-sensitive config - Prefer Azure Key Vault
- Secret Types
# Opaque secret apiVersion: v1 kind: Secret metadata: name: db-credentials namespace: atp-ingest-ns type: Opaque data: username: YWRtaW4= # base64("admin") password: cGFzc3dvcmQ= # base64("password") # TLS secret apiVersion: v1 kind: Secret metadata: name: tls-cert type: kubernetes.io/tls data: tls.crt: <base64-encoded-cert> tls.key: <base64-encoded-key>
Code Examples: - Secret creation - Secret usage - Rotation procedures
Deliverables: - Secret management guide - Best practices - Security considerations
CYCLE 8: Azure Key Vault CSI Driver (~3,000 lines)¶
Topic 15: Key Vault Integration¶
What will be covered: - Azure Key Vault CSI Driver
# SecretProviderClass
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: atp-kv-secrets
namespace: atp-ingest-ns
spec:
provider: azure
parameters:
usePodIdentity: "false"
useVMManagedIdentity: "false"
userAssignedIdentityID: "<workload-identity-client-id>"
keyvaultName: "atp-kv-prod"
cloudName: "AzurePublicCloud"
tenantId: "<azure-tenant-id>"
objects: |
array:
- |
objectName: DatabasePassword
objectType: secret
objectVersion: ""
- |
objectName: ServiceBusConnectionString
objectType: secret
objectVersion: ""
- |
objectName: SigningKeyPrivate
objectType: secret
objectVersion: ""
-
Mount Secrets in Pod
-
Access Secrets in Application
Code Examples: - Complete Key Vault CSI setup - SecretProviderClass for all services - Secret rotation automation
Diagrams: - Key Vault CSI architecture - Secret mounting flow
Deliverables: - Key Vault integration guide - Secret provider configurations - Rotation procedures
Topic 16: Workload Identity (AAD Pod Identity)¶
What will be covered: - Workload Identity Setup
# Service Account with Workload Identity
apiVersion: v1
kind: ServiceAccount
metadata:
name: ingestion-sa
namespace: atp-ingest-ns
annotations:
azure.workload.identity/client-id: "<managed-identity-client-id>"
azure.workload.identity/tenant-id: "<azure-tenant-id>"
- Grant Permissions
# Create managed identity az identity create \ --name atp-ingestion-identity \ --resource-group atp-aks-prod-rg # Grant Key Vault access az keyvault set-policy \ --name atp-kv-prod \ --object-id <identity-object-id> \ --secret-permissions get list # Federate with AKS az aks pod-identity add \ --resource-group atp-aks-prod-rg \ --cluster-name atp-aks-useast-prod \ --namespace atp-ingest-ns \ --name ingestion-identity \ --identity-resource-id <identity-resource-id>
Code Examples: - Workload Identity setup - Permission grants - Pod configuration
Deliverables: - Workload Identity guide - Security configuration - Permission matrix
CYCLE 9: Resource Management (~3,500 lines)¶
Topic 17: Resource Requests & Limits¶
What will be covered: - ATP Service Resource Profiles | Service | CPU Request | CPU Limit | Memory Request | Memory Limit | Notes | |---------|-------------|-----------|----------------|--------------|-------| | Gateway | 250m | 1000m | 512Mi | 1Gi | Stateless, low CPU | | Ingestion | 500m | 2000m | 1Gi | 2Gi | I/O intensive | | Policy | 250m | 1000m | 512Mi | 1Gi | Cache-heavy, low CPU | | Projection | 500m | 2000m | 1Gi | 2Gi | DB writes, batch processing | | Query | 500m | 2000m | 1Gi | 2Gi | DB reads, caching | | Integrity | 250m | 1000m | 512Mi | 1Gi | Crypto operations, burst | | Export | 1000m | 4000m | 2Gi | 4Gi | Large file processing | | Search | 500m | 2000m | 1Gi | 2Gi | Index operations | | Admin | 250m | 500m | 256Mi | 512Mi | Lightweight, low traffic |
- Right-Sizing Resources
- Monitor actual usage (Prometheus, Azure Monitor)
- Set requests = typical usage (95th percentile)
- Set limits = max burst capacity
-
Leave headroom for spikes
-
Quality of Service (QoS) Classes
Guaranteed (requests = limits): - Critical services (Gateway, Ingestion) - Predictable performance - Never evicted for resource pressure Burstable (requests < limits): - Most ATP services - Can burst above requests if node has capacity - May be throttled/evicted under pressure BestEffort (no requests/limits): - Avoid in production - Lowest priority, first to be evicted
Code Examples: - Resource configuration for all services - Monitoring resource usage - Right-sizing analysis
Diagrams: - Resource allocation - QoS classes - Eviction priority
Deliverables: - Resource profile guide - Monitoring dashboard - Right-sizing procedures
Topic 18: Resource Quotas & Cost Control¶
What will be covered: - Namespace Resource Quotas - Cost Allocation by Namespace - Spot Instances for Non-Critical Workloads - Cluster Autoscaler
Code Examples: - Quota enforcement - Cost monitoring - Spot configuration
Deliverables: - Cost control guide - Quota policies - Optimization strategies
CYCLE 10: Horizontal Pod Autoscaler (HPA) (~4,000 lines)¶
Topic 19: HPA Configuration¶
What will be covered: - HPA v2 (Metrics-Based Autoscaling)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ingestion-hpa
namespace: atp-ingest-ns
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ingestion
minReplicas: 3
maxReplicas: 30
# Scale up/down behavior
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100 # Double pods in 60s if needed
periodSeconds: 60
- type: Pods
value: 4 # Or add 4 pods
periodSeconds: 60
selectPolicy: Max # Take max of policies
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25 # Remove 25% of pods
periodSeconds: 60
selectPolicy: Min # Take min (conservative)
# Metrics
metrics:
# CPU utilization
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Memory utilization
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Custom metric (request rate)
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
# Custom metric (P95 latency)
- type: Pods
pods:
metric:
name: http_request_duration_p95_ms
target:
type: AverageValue
averageValue: "200"
- HPA for All ATP Services
- Gateway: CPU + request rate + P95 latency
- Ingestion: CPU + ingest rate
- Query: CPU + query rate + P95 latency
- Policy: CPU + cache miss rate
- (Projection, Export use KEDA instead)
Code Examples: - Complete HPA manifests (all services) - Custom metrics integration - Scaling behavior tuning
Diagrams: - HPA architecture - Scaling decision flow - Metrics pipeline
Deliverables: - HPA configuration library - Metrics guide - Tuning procedures
Topic 20: Custom Metrics with Azure Monitor¶
What will be covered: - Azure Monitor Metrics Adapter - Prometheus Adapter - Custom Metric Queries - SLO-Based Scaling
Code Examples: - Metrics adapter setup - Custom metric definitions - SLO-driven autoscaling
Deliverables: - Metrics adapter guide - Custom metrics catalog - SLO integration
CYCLE 11: KEDA Event-Driven Autoscaling (~4,500 lines)¶
Topic 21: KEDA Architecture¶
What will be covered: - KEDA (Kubernetes Event-Driven Autoscaling)
Why KEDA for ATP?
- Scale based on queue depth (Azure Service Bus)
- Scale to zero when idle (cost savings)
- Event-driven workers (Projection, Export, Integrity)
- Batch job scheduling (CronScaledJob)
-
KEDA Installation
-
KEDA Components
- Operator: Monitors ScaledObjects, creates HPAs
- Metrics Server: Exposes custom metrics
- Admission Webhooks: Validates ScaledObjects
Code Examples: - KEDA installation - Architecture overview - Component configuration
Diagrams: - KEDA architecture - Scaling flow - Integration with HPA
Deliverables: - KEDA setup guide - Architecture reference - Component overview
Topic 22: KEDA Scalers for ATP¶
What will be covered: - Azure Service Bus Scaler (Projection Workers)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: projection-worker-scaler
namespace: atp-projection-ns
spec:
scaleTargetRef:
name: projection-worker
pollingInterval: 10 # Check every 10 seconds
cooldownPeriod: 120 # Wait 2 min before scaling down
minReplicaCount: 0 # Scale to zero when no messages
maxReplicaCount: 50
advanced:
restoreToOriginalReplicaCount: false
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
triggers:
- type: azure-servicebus
metadata:
namespace: sb-atp-prod
topicName: audit.appended.v1
subscriptionName: projection-sub
messageCount: "400" # Target: 400 messages per replica
activationMessageCount: "1" # Scale from 0 when ≥1 message
cloud: AzurePublicCloud
authenticationRef:
name: keda-trigger-auth-asb
---
# Authentication using Workload Identity
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: keda-trigger-auth-asb
namespace: atp-projection-ns
spec:
podIdentity:
provider: azure-workload
identityId: "<managed-identity-client-id>"
-
Redis Scaler (Export Jobs Queue)
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: export-worker-scaler namespace: atp-export-ns spec: scaleTargetRef: name: export-worker minReplicaCount: 0 maxReplicaCount: 10 triggers: - type: redis metadata: addressFromEnv: REDIS_HOST listName: export-jobs listLength: "10" # Scale up if queue has 10+ jobs databaseIndex: "0" authenticationRef: name: keda-redis-auth -
Prometheus Scaler (Custom Metrics)
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: query-latency-scaler namespace: atp-query-ns spec: scaleTargetRef: name: query minReplicaCount: 3 maxReplicaCount: 30 triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring:9090 metricName: http_request_duration_p95_seconds query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="query"}[2m])) by (le)) threshold: "0.2" # Scale up if P95 > 200ms -
Cron Scaler (Scheduled Jobs)
apiVersion: keda.sh/v1alpha1 kind: ScaledJob metadata: name: daily-compliance-report namespace: atp-admin-ns spec: jobTargetRef: template: spec: containers: - name: report-generator image: atpacr.azurecr.io/atp/report-generator:1.0.0 restartPolicy: Never pollingInterval: 30 maxReplicaCount: 1 triggers: - type: cron metadata: timezone: America/New_York start: 0 2 * * * # 2 AM daily end: 0 3 * * * # Finish by 3 AM
Code Examples: - Complete KEDA scaler library (all ATP use cases) - Authentication configurations - Scaling policies
Diagrams: - KEDA scaler types - Event-driven scaling flow - Scale-to-zero timeline
Deliverables: - KEDA scaler catalog - Configuration guide - Scaling strategies
CYCLE 12: Pod Security Standards (~3,000 lines)¶
Topic 23: Pod Security Policies¶
What will be covered: - Pod Security Standards (PSS)
Privileged: Unrestricted (avoid)
Baseline: Minimally restrictive (dev/test)
Restricted: Hardened (production ATP)
- Restricted Pod Security
# Namespace-level enforcement apiVersion: v1 kind: Namespace metadata: name: atp-ingest-ns labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/warn: restricted # Pod security context (compliant) securityContext: runAsNonRoot: true runAsUser: 1000 runAsGroup: 3000 fsGroup: 2000 seccompProfile: type: RuntimeDefault # Container security context containers: - name: ingestion securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true runAsNonRoot: true runAsUser: 1000 capabilities: drop: - ALL
Code Examples: - Pod Security Standard configurations - Compliant pod templates - Validation policies
Diagrams: - PSS levels - Security context flow
Deliverables: - Pod security guide - Compliant templates - Validation rules
Topic 24: Runtime Security with Azure Policy¶
What will be covered: - Azure Policy for Kubernetes - OPA Gatekeeper - Admission Controller Policies - Image Scanning (Trivy, Defender)
Code Examples: - Policy definitions - Admission webhooks - Image scan integration
Deliverables: - Security policy catalog - Enforcement procedures - Scanning workflows
CYCLE 13: Network Policies (~3,000 lines)¶
Topic 25: Network Isolation¶
What will be covered: - Network Policy Fundamentals
# Default deny all ingress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: atp-ingest-ns
spec:
podSelector: {}
policyTypes:
- Ingress
# Allow ingress from Gateway only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-from-gateway
namespace: atp-ingest-ns
spec:
podSelector:
matchLabels:
app: ingestion
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: atp-gateway-ns
podSelector:
matchLabels:
app: gateway
ports:
- protocol: TCP
port: 8080
- protocol: TCP
port: 9090
# Allow egress to Azure SQL
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-egress-sql
namespace: atp-ingest-ns
spec:
podSelector:
matchLabels:
app: ingestion
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/8 # Azure SQL private endpoint
ports:
- protocol: TCP
port: 1433
Code Examples: - Network policy manifests (all ATP services) - Zero-trust network model - Egress control
Diagrams: - Network policy architecture - Traffic flow with policies
Deliverables: - Network policy library - Zero-trust configuration - Traffic matrix
Topic 26: Service Mesh Network Policies¶
What will be covered: - Istio AuthorizationPolicy - Linkerd Server Policies - mTLS Enforcement
Code Examples: - Mesh-native policies - mTLS verification
Deliverables: - Mesh policy guide - Security configuration
CYCLE 14: Service Mesh & mTLS (~3,500 lines)¶
Topic 27: Service Mesh Setup¶
What will be covered: - Istio Installation
# Install Istio
istioctl install --set profile=production -y
# Enable sidecar injection
kubectl label namespace atp-ingest-ns istio-injection=enabled
-
Linkerd Installation
-
mTLS Configuration
Code Examples: - Service mesh installation - mTLS configuration - Traffic policies
Diagrams: - Service mesh architecture - mTLS certificate flow
Deliverables: - Mesh setup guide - mTLS enforcement - Traffic management
Topic 28: Observability with Service Mesh¶
What will be covered: - Distributed Tracing - Traffic Metrics - Service Graph Visualization - Latency Monitoring
Code Examples: - Mesh observability configuration - Kiali/Jaeger integration
Deliverables: - Observability guide - Dashboard templates
CYCLE 15: RBAC & Workload Identity (~3,000 lines)¶
Topic 29: Kubernetes RBAC¶
What will be covered: - Role-Based Access Control
# Role (namespace-scoped)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: atp-ingest-ns
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
# RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
namespace: atp-ingest-ns
subjects:
- kind: ServiceAccount
name: developer-sa
namespace: atp-ingest-ns
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
# ClusterRole (cluster-wide)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: atp-admin
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
Code Examples: - Complete RBAC configuration - Role library - Binding templates
Deliverables: - RBAC guide - Role catalog - Access control policies
Topic 30: Azure AD Workload Identity¶
What will be covered: - Workload Identity Federation - Managed Identity Assignment - Azure Resource Access - Zero Secrets (Keyless Authentication)
Code Examples: - Workload Identity setup - Federated identity credentials - Azure resource access
Deliverables: - Workload Identity guide - Identity management - Security best practices
CYCLE 16: Helm Charts (~4,000 lines)¶
Topic 31: Helm Chart Structure¶
What will be covered: - ATP Helm Chart Architecture
charts/atp/
├── Chart.yaml # Chart metadata
├── values.yaml # Default values
├── values.dev.yaml # Dev environment overrides
├── values.prod.yaml # Prod environment overrides
├── values.us.yaml # US region overrides
├── values.eu.yaml # EU region overrides
├── templates/
│ ├── _helpers.tpl # Template helpers
│ ├── namespace.yaml
│ ├── gateway/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ ├── hpa.yaml
│ │ └── ingress.yaml
│ ├── ingestion/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ └── hpa.yaml
│ ├── projection/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ └── keda-scaledobject.yaml
│ ├── query/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ └── hpa.yaml
│ ├── export/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ └── keda-scaledobject.yaml
│ ├── shared/
│ │ ├── configmap.yaml
│ │ ├── secret-provider-class.yaml
│ │ └── network-policies.yaml
│ └── monitoring/
│ ├── servicemonitor.yaml
│ └── prometheusrule.yaml
└── .helmignore
Code Examples: - Complete Helm chart - Template syntax - Values organization
Diagrams: - Helm chart structure - Value inheritance
Deliverables: - Helm chart repository - Templating guide - Values reference
Topic 32: Helm Deployment¶
What will be covered: - Helm Installation
# Install/upgrade ATP
helm upgrade --install atp ./charts/atp \
--namespace atp-system \
--create-namespace \
--values values.prod.yaml \
--values values.us.yaml \
--set image.tag=1.2.3 \
--set global.edition=enterprise \
--wait --timeout 10m
# Verify release
helm list -n atp-system
helm status atp -n atp-system
# Rollback
helm rollback atp 1 -n atp-system
Code Examples: - Deployment commands - Value overrides - Rollback procedures
Deliverables: - Deployment guide - Operations procedures - Rollback strategies
CYCLE 17: GitOps with FluxCD (~3,500 lines)¶
Topic 33: FluxCD Setup¶
What will be covered: - FluxCD Installation - GitRepository Source - Kustomization - HelmRelease - Automated Reconciliation
Code Examples: - FluxCD configuration - GitOps workflow
Deliverables: - FluxCD setup guide - GitOps workflow
Topic 34: Declarative Deployments¶
What will be covered: - Git as Source of Truth - Automated Sync - Drift Detection - Notification Hooks
Code Examples: - FluxCD resources - Sync configuration
Deliverables: - Declarative deployment guide - Sync policies
CYCLE 18: Monitoring & Observability (~3,000 lines)¶
Topic 35: Container Insights¶
What will be covered: - Azure Monitor Container Insights - Prometheus Integration - Grafana Dashboards - Log Analytics
Code Examples: - Monitoring setup - Dashboard configurations
Deliverables: - Monitoring guide - Dashboard library
Topic 36: OpenTelemetry in Kubernetes¶
What will be covered: - OTel Collector DaemonSet - Trace/Metric/Log Collection - Azure Monitor Export
Code Examples: - OTel configuration - Collector deployment
Deliverables: - OTel setup guide - Export configuration
CYCLE 19: Operations & Troubleshooting (~3,000 lines)¶
Topic 37: Operational Tasks¶
What will be covered: - Common kubectl Commands - Log Viewing - Pod Debugging - Resource Inspection
Code Examples: - Operations cookbook
Deliverables: - Operations guide - Troubleshooting procedures
Topic 38: Troubleshooting Common Issues¶
What will be covered: - Pod Not Starting - Image Pull Errors - CrashLoopBackOff - Network Issues - Resource Exhaustion
Code Examples: - Debug procedures
Deliverables: - Troubleshooting guide - Problem catalog
CYCLE 20: Best Practices & Disaster Recovery (~3,000 lines)¶
Topic 39: Kubernetes Best Practices¶
What will be covered: - Design Best Practices - Security Hardening - Performance Optimization - Cost Management
Deliverables: - Best practices handbook
Topic 40: Disaster Recovery¶
What will be covered: - Cluster Backup - Disaster Recovery Procedures - Multi-Region Failover
Deliverables: - DR guide - Failover procedures
Summary of Deliverables¶
Across all 20 cycles, this documentation will provide:
- Cluster Architecture: AKS setup, node pools, networking
- Namespaces: Organization, quotas, isolation
- Workloads: Deployments, StatefulSets, DaemonSets, Jobs
- Networking: Services, Ingress, Load Balancing, Service Mesh
- Configuration: ConfigMaps, Secrets, Key Vault CSI
- Resource Management: Requests, limits, quotas, QoS
- Autoscaling: HPA, KEDA, custom metrics
- Security: Pod Security, Network Policies, RBAC, Workload Identity
- Packaging: Helm charts, templating, values
- GitOps: FluxCD, declarative deployments
- Observability: Monitoring, logging, tracing
- Operations: kubectl, troubleshooting, DR
Related Documentation¶
- Pulumi IaC: Infrastructure provisioning
- GitOps: Continuous delivery
- Deployment Views: ATP topology
- Template Integration: Service templates
- Configuration: App configuration
- Observability: Monitoring and tracing
- Security: Security architecture
- Disaster Recovery: DR procedures
This documentation plan covers complete Kubernetes infrastructure for ATP, from AKS cluster setup and namespace organization to workload deployments, service networking, autoscaling with HPA and KEDA, security hardening with Pod Security Standards and Network Policies, service mesh integration, Helm packaging, GitOps with FluxCD, comprehensive monitoring, operational procedures, and disaster recovery for running ATP microservices at scale with security, reliability, and cost-efficiency.