Kubernetes Infrastructure - Audit Trail Platform (ATP)¶

Cloud-native, auto-scaling, zero-trust — ATP runs on Azure Kubernetes Service (AKS) with namespace isolation, service mesh (mTLS), KEDA event-driven autoscaling, Pod Security Standards, Helm packaging, and FluxCD GitOps for declarative, auditable, and resilient microservice orchestration.

📋 Documentation Generation Plan¶

This document will be generated in 20 cycles. Current progress:

Cycle	Topics	Estimated Lines	Status
Cycle 1	Kubernetes Architecture & AKS Overview (1-2)	~3,500	⏳ Not Started
Cycle 2	Cluster Setup & Node Pools (3-4)	~3,000	⏳ Not Started
Cycle 3	Namespace Organization (5-6)	~3,000	⏳ Not Started
Cycle 4	Deployment Workloads (7-8)	~4,000	⏳ Not Started
Cycle 5	Service Networking (9-10)	~3,500	⏳ Not Started
Cycle 6	Ingress & Load Balancing (11-12)	~3,500	⏳ Not Started
Cycle 7	ConfigMaps & Secrets (13-14)	~3,000	⏳ Not Started
Cycle 8	Azure Key Vault CSI Driver (15-16)	~3,000	⏳ Not Started
Cycle 9	Resource Management (17-18)	~3,500	⏳ Not Started
Cycle 10	Horizontal Pod Autoscaler (HPA) (19-20)	~4,000	⏳ Not Started
Cycle 11	KEDA Event-Driven Autoscaling (21-22)	~4,500	⏳ Not Started
Cycle 12	Pod Security Standards (23-24)	~3,000	⏳ Not Started
Cycle 13	Network Policies (25-26)	~3,000	⏳ Not Started
Cycle 14	Service Mesh & mTLS (27-28)	~3,500	⏳ Not Started
Cycle 15	RBAC & Workload Identity (29-30)	~3,000	⏳ Not Started
Cycle 16	Helm Charts (31-32)	~4,000	⏳ Not Started
Cycle 17	GitOps with FluxCD (33-34)	~3,500	⏳ Not Started
Cycle 18	Monitoring & Observability (35-36)	~3,000	⏳ Not Started
Cycle 19	Operations & Troubleshooting (37-38)	~3,000	⏳ Not Started
Cycle 20	Best Practices & Disaster Recovery (39-40)	~3,000	⏳ Not Started

Total Estimated Lines: ~67,000

Purpose & Scope¶

This document provides the complete Kubernetes infrastructure guide for ATP, covering Azure Kubernetes Service (AKS) cluster setup, namespace organization, workload deployments, service networking, autoscaling, security policies, Helm charts, GitOps, and operational best practices for running ATP's microservices at scale.

Why Kubernetes for ATP?

Cloud-Native: Container orchestration with declarative configuration
Scalability: Auto-scaling (HPA/KEDA) based on load and events
Resilience: Self-healing, rolling updates, health checks
Multi-Tenancy: Namespace isolation, resource quotas, network policies
Security: Pod Security Standards, mTLS service mesh, RBAC, Workload Identity
Portability: Run anywhere (Azure, on-prem, multi-cloud)
Observability: Integrated monitoring, logging, tracing
Cost Optimization: Efficient resource utilization, scale-to-zero with KEDA
GitOps: Declarative, auditable infrastructure as code
Ecosystem: Rich tooling (Helm, Kustomize, Flux, Istio/Linkerd, KEDA)

ATP Kubernetes Topology

Azure Front Door (AFD) + WAF
    ↓
API Management (APIM) / Ingress Controller
    ↓
AKS Cluster (3 regions: US, EU, IL)
    ├── System Ring
    │   ├── OTel Collector (DaemonSet)
    │   ├── Mesh Control Plane (Istio/Linkerd)
    │   ├── FluxCD Controllers
    │   ├── KEDA Operator
    │   └── Monitoring (Prometheus, Grafana)
    │
    └── User Ring (ATP Namespaces)
        ├── atp-gateway-ns (Gateway pods)
        ├── atp-ingest-ns (Ingestion pods)
        ├── atp-policy-ns (Policy pods)
        ├── atp-projection-ns (Projection workers, KEDA)
        ├── atp-query-ns (Query API pods)
        ├── atp-integrity-ns (Integrity workers)
        ├── atp-export-ns (Export workers, KEDA)
        ├── atp-search-ns (Search service, optional)
        └── atp-admin-ns (Admin console)
    ↓
Azure Services (Service Bus, SQL, Blob, Key Vault, Monitor)

Key Technologies

AKS (Azure Kubernetes Service): Managed Kubernetes
Service Mesh: Istio or Linkerd for mTLS, observability
KEDA: Kubernetes Event-Driven Autoscaling (scale to zero)
HPA: Horizontal Pod Autoscaler (CPU/memory/custom metrics)
Helm: Package manager for Kubernetes
FluxCD: GitOps continuous delivery
Azure Key Vault CSI Driver: Secrets injection
Azure Monitor: Container Insights, Log Analytics, Application Insights
Prometheus & Grafana: Metrics and dashboards
Calico/Azure CNI: Network policies

Detailed Cycle Plan¶

CYCLE 1: Kubernetes Architecture & AKS Overview (~3,500 lines)¶

Topic 1: Kubernetes Fundamentals¶

What will be covered: - What is Kubernetes? - Container orchestration platform - Declarative configuration (YAML) - Desired state reconciliation - Self-healing, auto-scaling, rolling updates

Kubernetes Core Concepts

Cluster:
- Control Plane: API Server, Scheduler, Controller Manager, etcd
- Worker Nodes: kubelet, kube-proxy, container runtime (containerd)

Workloads:
- Pod: Smallest deployable unit (1+ containers)
- Deployment: Manages ReplicaSets, rolling updates
- StatefulSet: For stateful applications (stable identity)
- DaemonSet: One pod per node (monitoring, logging)
- Job/CronJob: Run-to-completion tasks

Networking:
- Service: Stable endpoint for pod group (ClusterIP, LoadBalancer)
- Ingress: HTTP/HTTPS routing to services
- NetworkPolicy: Firewall rules for pods

Configuration:
- ConfigMap: Non-sensitive configuration
- Secret: Sensitive data (passwords, tokens)
- PersistentVolume: Durable storage

Security:
- ServiceAccount: Pod identity
- RBAC: Role-based access control
- Pod Security Standards: Security policies

Observability:
- Logs: stdout/stderr streams
- Metrics: CPU, memory, custom metrics
- Events: Cluster events (pod scheduled, image pulled)

Why AKS (Azure Kubernetes Service)?

Managed Control Plane:
- No manual k8s master management
- Microsoft maintains API server, etcd, scheduler
- Automatic updates and patches

Azure Integration:
- Azure CNI (VNet integration)
- Azure Monitor Container Insights
- Azure Key Vault CSI Driver
- Azure AD Workload Identity
- Azure Blob CSI (persistent volumes)
- Azure Service Bus (for KEDA triggers)

Enterprise Features:
- Availability Zones support
- Node auto-scaling
- Managed node pools
- Azure Policy for Kubernetes
- Defender for Containers (security)

Cost Optimization:
- Pay only for worker nodes
- Spot instances (low-priority workloads)
- Auto-shutdown for dev/test

Code Examples: - Kubernetes architecture diagram - AKS cluster creation (Azure CLI, Bicep, Pulumi) - kubectl basic commands

Diagrams: - Kubernetes architecture - AKS control plane vs. data plane - ATP on AKS topology

Deliverables: - Kubernetes fundamentals primer - AKS benefits for ATP - Cluster architecture overview

Topic 2: ATP AKS Cluster Architecture¶

What will be covered: - Cluster Configuration

Cluster Name: atp-aks-{region}-{env}
Examples:
- atp-aks-useast-prod
- atp-aks-euwest-prod
- atp-aks-ilcentral-prod
- atp-aks-useast-dev

Kubernetes Version: 1.28+ (managed, auto-upgrade minor versions)

Regions:
- Primary: East US (us-east-1)
- Secondary: West Europe (eu-west-1)
- Tertiary: Israel Central (il-central-1)

Availability Zones: 3 per region
- Zone 1, Zone 2, Zone 3
- Distribute node pools across zones
- Distribute pods across zones (pod anti-affinity)

Node Pools Strategy

System Node Pool (np-system):
- VM Size: Standard_D4s_v5 (4 vCPU, 16 GB RAM)
- Count: 3 (one per AZ)
- Auto-scale: No (fixed)
- Taint: CriticalAddonsOnly=true:NoSchedule
- Purpose: Control plane components, mesh, KEDA, FluxCD, OTel

Generic Node Pool (np-generic):
- VM Size: Standard_D8s_v5 (8 vCPU, 32 GB RAM)
- Count: 3-30 (auto-scale)
- Taint: None
- Purpose: Stateless APIs (Gateway, Query, Admin, Policy)

I/O Node Pool (np-io):
- VM Size: Standard_E8s_v5 (8 vCPU, 64 GB RAM, premium storage)
- Count: 2-50 (auto-scale)
- Taint: workload=io:NoSchedule
- Purpose: I/O-heavy (Ingestion, Projection, Export, Integrity)

Jobs Node Pool (np-jobs) - Optional:
- VM Size: Standard_F16s_v2 (16 vCPU, 32 GB RAM, compute-optimized)
- Count: 0-20 (KEDA scale to zero)
- Taint: workload=batch:NoSchedule
- Purpose: Export jobs, maintenance tasks, compliance reports

Spot Node Pool (np-spot) - Optional:
- VM Size: Standard_D8s_v5
- Count: 0-10
- Spot Priority: Yes (80% cost savings)
- Taint: kubernetes.azure.com/scalesetpriority=spot:NoSchedule
- Purpose: Non-critical workloads (dev/test projections, backfills)

Network Configuration

CNI Plugin: Azure CNI (VNet integration)

VNet CIDR: 10.42.0.0/16
- Subnet: aks-nodes (10.42.0.0/20) - 4096 IPs for nodes
- Subnet: aks-pods (10.42.16.0/20) - 4096 IPs for pods
- Subnet: aks-services (10.42.32.0/24) - 256 IPs for services

Service CIDR: 10.43.0.0/16 (internal cluster services)
DNS Service IP: 10.43.0.10

Network Policy: Azure Network Policy or Calico

Load Balancer:
- Type: Standard Load Balancer
- Public IP: Static (for ingress controller)
- Private endpoints: For Azure services (SQL, Storage, Key Vault)

Code Examples: - AKS cluster creation (Pulumi C#) - Node pool configuration - Network setup

Diagrams: - AKS cluster architecture - Node pool distribution across AZs - Network topology

Deliverables: - AKS cluster specification - Node pool strategy - Network architecture

CYCLE 2: Cluster Setup & Node Pools (~3,000 lines)¶

Topic 3: AKS Cluster Provisioning¶

What will be covered: - Pulumi IaC for AKS

var{                                                            //               //

href="#__codelineno-6-1">// Create AKS cluster with Pulumi class="w"> cluster = new AzureNative.ContainerService.ManagedCluster("atp-aks-prod", new() pan> ResourceGroupName = resourceGroup.Name, Location = location, DnsPrefix = "atp-prod", KubernetesVersion = "1.28", EnableRBAC = true, // Identity (Workload Identity for pods) Identity = new ManagedClusterIdentityArgs { Type = ResourceIdentityType.SystemAssigned }, // Network profile NetworkProfile = new ContainerServiceNetworkProfileArgs { NetworkPlugin = "azure", NetworkPolicy = "calico", ServiceCidr = "10.43.0.0/16", DnsServiceIP = "10.43.0.10", LoadBalancerSku = "standard" }, // Add-ons AddonProfiles = new InputMap<ManagedClusterAddonProfileArgs> { ["azureKeyvaultSecretsProvider"] = new ManagedClusterAddonProfileArgs { Enabled = true, Config = new InputMap<string> { ["enableSecretRotation"] = "true", ["rotationPollInterval"] = "2m" } }, ["omsAgent"] = new ManagedClusterAddonProfileArgs { Enabled = true, Config = new InputMap<string> { ["logAnalyticsWorkspaceResourceID"] = logAnalyticsWorkspace.Id } } }, // System node pool AgentPoolProfiles = new[] { new ManagedClusterAgentPoolProfileArgs { Name = "npsystem", Count = 3, VmSize = "Standard_D4s_v5", OsType = "Linux", Mode = "System", AvailabilityZones = new[] { "1", "2", "3" }, NodeTaints = new[] { "CriticalAddonsOnly=true:NoSchedule" } } } }); Add generic node pool >var genericNodePool = new AzureNative.ContainerService.AgentPool("np-generic", new() { ResourceGroupName = resourceGroup.Name, ResourceName = cluster.Name, AgentPoolName = "npgeneric", Count = 3, VmSize = "Standard_D8s_v5", OsType = "Linux", Mode = "User", AvailabilityZones = new[] { "1", "2", "3" }, EnableAutoScaling = true, MinCount = 3, MaxCount = 30 }); Add I/O node pool >var ioNodePool = new AzureNative.ContainerService.AgentPool("np-io", new() { ResourceGroupName = resourceGroup.Name, ResourceName = cluster.Name, AgentPoolName = "npio", Count = 2, VmSize = "Standard_E8s_v5", OsType = "Linux", Mode = "User", AvailabilityZones = new[] { "1", "2", "3" }, EnableAutoScaling = true, MinCount = 2, MaxCount = 50, NodeTaints = new[] { "workload=io:NoSchedule" } });

Azure CLI Alternative

# Create resource group
az group create --name atp-aks-prod-rg --location eastus

# Create AKS cluster
az aks create \
    --resource-group atp-aks-prod-rg \
    --name atp-aks-useast-prod \
    --node-count 3 \
    --node-vm-size Standard_D4s_v5 \
    --kubernetes-version 1.28 \
    --network-plugin azure \
    --network-policy calico \
    --enable-managed-identity \
    --enable-addons monitoring,azure-keyvault-secrets-provider \
    --enable-workload-identity \
    --enable-oidc-issuer \
    --zones 1 2 3 \
    --nodepool-name npsystem \
    --nodepool-taints CriticalAddonsOnly=true:NoSchedule

# Add generic node pool
az aks nodepool add \
    --resource-group atp-aks-prod-rg \
    --cluster-name atp-aks-useast-prod \
    --name npgeneric \
    --node-count 3 \
    --node-vm-size Standard_D8s_v5 \
    --enable-cluster-autoscaler \
    --min-count 3 \
    --max-count 30 \
    --zones 1 2 3

# Add I/O node pool
az aks nodepool add \
    --resource-group atp-aks-prod-rg \
    --cluster-name atp-aks-useast-prod \
    --name npio \
    --node-count 2 \
    --node-vm-size Standard_E8s_v5 \
    --enable-cluster-autoscaler \
    --min-count 2 \
    --max-count 50 \
    --node-taints workload=io:NoSchedule \
    --zones 1 2 3

Code Examples: - Complete AKS provisioning (Pulumi, Azure CLI, Bicep) - Node pool configurations - Cluster add-ons setup

Diagrams: - AKS provisioning workflow - Node pool architecture

Deliverables: - AKS provisioning guide - Node pool specifications - Add-ons configuration

Topic 4: Connecting to AKS Cluster¶

What will be covered: - kubectl Configuration

# Get AKS credentials
az aks get-credentials \
    --resource-group atp-aks-prod-rg \
    --name atp-aks-useast-prod

# Verify connection
kubectl cluster-info
kubectl get nodes
kubectl get namespaces

# Switch context (multiple clusters)
kubectl config get-contexts
kubectl config use-context atp-aks-useast-prod

Access Control
Azure AD integration
RBAC roles (admin, developer, reader)
kubeconfig with AAD tokens
kubectl proxy for local access

Code Examples: - Connection setup - Context management - RBAC configuration

Deliverables: - Access guide - RBAC setup - Security best practices

CYCLE 3: Namespace Organization (~3,000 lines)¶

Topic 5: Namespace Strategy¶

What will be covered: - ATP Namespace Taxonomy

# System namespaces (managed)
- kube-system           # Kubernetes core components
- kube-public           # Public cluster info
- flux-system           # FluxCD controllers
- istio-system          # Service mesh control plane
- keda                  # KEDA operator
- monitoring            # Prometheus, Grafana

# ATP application namespaces
- atp-gateway-ns        # API Gateway (public entry point)
- atp-ingest-ns         # Ingestion service
- atp-policy-ns         # Policy service
- atp-projection-ns     # Projection workers
- atp-query-ns          # Query service
- atp-integrity-ns      # Integrity service
- atp-export-ns         # Export service
- atp-search-ns         # Search service (optional)
- atp-admin-ns          # Admin console

Namespace Creation

# Create namespace with labels
apiVersion: v1
kind: Namespace
metadata:
  name: atp-ingest-ns
  labels:
    app.kubernetes.io/name: atp-ingestion
    app.kubernetes.io/component: backend
    app.kubernetes.io/part-of: audit-trail-platform
    environment: production
    region: us-east
    istio-injection: enabled  # Enable service mesh sidecar injection
    kyverno.io/policy-severity: high

Namespace Isolation
Resource quotas per namespace
Network policies (deny by default)
RBAC (namespace-scoped roles)
Pod Security Standards
Namespace Lifecycle
Creation (GitOps via FluxCD)
Configuration (ResourceQuota, NetworkPolicy, RBAC)
Monitoring (per-namespace dashboards)
Deletion (drain workloads, cleanup resources)

Code Examples: - Namespace creation manifests - Resource quota configuration - RBAC binding

Diagrams: - Namespace organization - Isolation boundaries

Deliverables: - Namespace taxonomy - Creation templates - Isolation policies

Topic 6: Resource Quotas & LimitRanges¶

What will be covered: - Resource Quota per Namespace

apiVersion: v1
kind: ResourceQuota
metadata:
  name: atp-ingest-quota
  namespace: atp-ingest-ns
spec:
  hard:
    requests.cpu: "20"        # Max 20 CPU cores requested
    requests.memory: 40Gi     # Max 40 GB memory requested
    limits.cpu: "40"          # Max 40 CPU cores limit
    limits.memory: 80Gi       # Max 80 GB memory limit
    pods: "50"                # Max 50 pods
    services: "10"            # Max 10 services
    persistentvolumeclaims: "5"

LimitRange (Default Limits)

apiVersion: v1
kind: LimitRange
metadata:
  name: atp-ingest-limits
  namespace: atp-ingest-ns
spec:
  limits:
  - max:
      cpu: "4"
      memory: 8Gi
    min:
      cpu: "100m"
      memory: 128Mi
    default:
      cpu: "500m"
      memory: 512Mi
    defaultRequest:
      cpu: "250m"
      memory: 256Mi
    type: Container

Code Examples: - ResourceQuota manifests (all namespaces) - LimitRange configuration - Quota monitoring

Diagrams: - Resource allocation - Quota enforcement

Deliverables: - Quota specifications - Limit policies - Monitoring dashboards

CYCLE 4: Deployment Workloads (~4,000 lines)¶

Topic 7: Deployment Manifests¶

What will be covered: - Deployment Anatomy

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ingestion
  namespace: atp-ingest-ns
  labels:
    app: ingestion
    version: v1
    component: backend
spec:
  # Replica management
  replicas: 3

  # Rolling update strategy
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 30%          # Allow 30% extra pods during update
      maxUnavailable: 0      # Never go below desired count

  # Pod selector
  selector:
    matchLabels:
      app: ingestion
      version: v1

  # Pod template
  template:
    metadata:
      labels:
        app: ingestion
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"

    spec:
      # Security context (Pod level)
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 2000
        seccompProfile:
          type: RuntimeDefault

      # Service account (Workload Identity)
      serviceAccountName: ingestion-sa

      # Node affinity (prefer different nodes)
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - ingestion
              topologyKey: kubernetes.io/hostname

        # Zone spread
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - ingestion
            topologyKey: topology.kubernetes.io/zone

      # Tolerations (for I/O node pool)
      tolerations:
      - key: "workload"
        operator: "Equal"
        value: "io"
        effect: "NoSchedule"

      # Containers
      containers:
      - name: ingestion
        image: atpacr.azurecr.io/atp/ingestion:1.2.3@sha256:abc123...

        # Security context (Container level)
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
          capabilities:
            drop:
            - ALL

        # Ports
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        - name: grpc
          containerPort: 9090
          protocol: TCP
        - name: metrics
          containerPort: 9091
          protocol: TCP

        # Environment variables
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"
        - name: TENANT_CONTEXT_HEADER
          value: "X-Tenant-Id"

        # ConfigMap reference
        envFrom:
        - configMapRef:
            name: ingestion-config

        # Resource requests and limits
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "2000m"
            memory: "2Gi"

        # Liveness probe (is container alive?)
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        # Readiness probe (ready to serve traffic?)
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3

        # Startup probe (for slow-starting apps)
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          failureThreshold: 30  # 30 * 5s = 150s max startup time

        # Volume mounts
        volumeMounts:
        - name: secrets-store
          mountPath: "/mnt/secrets"
          readOnly: true
        - name: temp-dir
          mountPath: "/tmp"

      # Volumes
      volumes:
      - name: secrets-store
        csi:
          driver: secrets-store.csi.k8s.io
          readOnly: true
          volumeAttributes:
            secretProviderClass: "atp-kv-secrets"
      - name: temp-dir
        emptyDir: {}

All ATP Deployments
Gateway, Ingestion, Policy, Projection, Query, Integrity, Export, Search, Admin
Each with specific resource profiles
Affinity rules for high availability
Tolerations for node pool assignment

Code Examples: - Complete deployment manifests (all ATP services) - Security contexts - Probe configurations - Volume mounts

Diagrams: - Deployment anatomy - Pod lifecycle - Health check flow

Deliverables: - Deployment manifest library - Configuration guide - Best practices

Topic 8: StatefulSets & DaemonSets¶

What will be covered: - When to Use StatefulSets vs. Deployments | Workload Type | Use Deployment | Use StatefulSet | |---------------|----------------|-----------------| | Stateless APIs | ✅ | ❌ | | Event consumers | ✅ | ❌ | | Projection workers | ✅ | ❌ | | Databases (managed externally) | ✅ | ❌ | | Caching tier (Redis Sentinel) | ❌ | ✅ | | Message brokers (Kafka, RabbitMQ) | ❌ | ✅ | | Stateful actors (Orleans silos) | ❌ | ✅ |

DaemonSet Use Cases
OTel Collector: One per node for metrics/logs
Log Forwarder: FluentBit/Fluentd for centralized logging
Node Exporter: Prometheus node metrics
Security Agent: Defender for Containers

Code Examples: - StatefulSet example (if needed for ATP) - DaemonSet for OTel Collector - Persistent volume claims

Diagrams: - Workload type comparison - DaemonSet architecture

Deliverables: - Workload selection guide - DaemonSet configurations - Stateful patterns

CYCLE 5: Service Networking (~3,500 lines)¶

Topic 9: Kubernetes Services¶

What will be covered: - Service Types

# 1. ClusterIP (default, internal only)
apiVersion: v1
kind: Service
metadata:
  name: ingestion-svc
  namespace: atp-ingest-ns
spec:
  type: ClusterIP
  selector:
    app: ingestion
  ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP
  - name: grpc
    port: 9090
    targetPort: 9090
    protocol: TCP

# 2. LoadBalancer (external, public IP)
apiVersion: v1
kind: Service
metadata:
  name: gateway-lb
  namespace: atp-gateway-ns
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "false"
spec:
  type: LoadBalancer
  selector:
    app: gateway
  ports:
  - name: https
    port: 443
    targetPort: 8443

# 3. Headless Service (for StatefulSets)
apiVersion: v1
kind: Service
metadata:
  name: orleans-silo-headless
  namespace: atp-orleans-ns
spec:
  type: ClusterIP
  clusterIP: None  # Headless
  selector:
    app: orleans-silo
  ports:
  - name: silo
    port: 11111
    targetPort: 11111

Service Discovery
DNS-based (service-name.namespace.svc.cluster.local)
Environment variables
Service mesh (Istio VirtualService)
Load Balancing
Round-robin (default)
Session affinity (ClientIP)
Topology-aware routing

Code Examples: - Service manifests (all types) - Service discovery examples - Load balancing configuration

Diagrams: - Service types comparison - DNS resolution flow - Load balancing strategies

Deliverables: - Service manifest library - Discovery guide - Load balancing patterns

Topic 10: Service Mesh Integration¶

What will be covered: - Service-to-Service Communication - mTLS encryption (automatic) - Identity-based auth - Traffic management (retries, timeouts, circuit breakers) - Observability (distributed tracing)

Istio/Linkerd Configuration

# Enable sidecar injection (namespace label)
apiVersion: v1
kind: Namespace
metadata:
  name: atp-ingest-ns
  labels:
    istio-injection: enabled  # or linkerd.io/inject: enabled

# VirtualService (traffic routing)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ingestion-vs
  namespace: atp-ingest-ns
spec:
  hosts:
  - ingestion-svc
  http:
  - route:
    - destination:
        host: ingestion-svc
        port:
          number: 80
    timeout: 30s
    retries:
      attempts: 3
      perTryTimeout: 10s
      retryOn: 5xx,reset,connect-failure,refused-stream

Code Examples: - Service mesh configuration - Traffic policies - mTLS verification

Diagrams: - Service mesh architecture - mTLS flow

Deliverables: - Service mesh setup - Traffic management - Security policies

CYCLE 6: Ingress & Load Balancing (~3,500 lines)¶

Topic 11: Ingress Controllers¶

What will be covered: - Ingress Controller Options | Controller | Use Case | ATP Usage | |------------|----------|-----------| | NGINX Ingress | General-purpose HTTP/HTTPS | Dev/Test environments | | Azure Application Gateway | WAF, Azure-native | Production (alternative to APIM) | | Istio Ingress Gateway | Service mesh integration | Production (with Istio) | | Traefik | Dynamic routing, middlewares | Internal services |

NGINX Ingress Setup

# Install NGINX Ingress Controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm install nginx-ingress ingress-nginx/ingress-nginx \
    --namespace ingress-nginx \
    --create-namespace \
    --set controller.service.type=LoadBalancer \
    --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz

Ingress Resource

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-gateway-ingress
  namespace: atp-gateway-ns
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
    nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
  tls:
  - hosts:
    - api.audittrail.example.com
    secretName: atp-gateway-tls
  rules:
  - host: api.audittrail.example.com
    http:
      paths:
      - path: /api/v1/ingest
        pathType: Prefix
        backend:
          service:
            name: ingestion-svc
            port:
              number: 80
      - path: /api/v1/query
        pathType: Prefix
        backend:
          service:
            name: query-svc
            port:
              number: 80
      - path: /api/v1/policy
        pathType: Prefix
        backend:
          service:
            name: policy-svc
            port:
              number: 80

Code Examples: - Ingress controller installation - Ingress resource manifests - TLS configuration - Rate limiting

Diagrams: - Ingress architecture - Traffic routing flow

Deliverables: - Ingress setup guide - Routing configurations - TLS management

Topic 12: Azure Front Door + APIM Integration¶

What will be covered: - External Load Balancing - Azure Front Door (global load balancing, WAF) - API Management (rate limiting, caching, versioning) - Direct to Ingress Controller

Traffic Flow

Internet Client
    ↓
Azure Front Door (global edge, WAF)
    ↓
API Management (region-specific)
    ↓
AKS Ingress Controller (NGINX/Istio)
    ↓
Gateway Service (Kubernetes Service)
    ↓
Gateway Pods (via mesh)
    ↓
Backend Services (Ingestion, Query, etc.)

Code Examples: - AFD configuration - APIM backend configuration - End-to-end routing

Diagrams: - Complete traffic flow - Edge integration

Deliverables: - Edge integration guide - Traffic routing - Failover patterns

CYCLE 7: ConfigMaps & Secrets (~3,000 lines)¶

Topic 13: ConfigMap Management¶

What will be covered: - ConfigMap for Application Settings

apiVersion: v1
kind: ConfigMap
metadata:
  name: ingestion-config
  namespace: atp-ingest-ns
data:
  appsettings.json: |
    {
      "ApplicationName": "ATP.Ingestion",
      "Logging": {
        "LogLevel": {
          "Default": "Information"
        }
      },
      "HealthChecks": {
        "Enabled": true
      }
    }

  # Environment-specific overrides
  ASPNETCORE_ENVIRONMENT: "Production"
  TENANT_CONTEXT_HEADER: "X-Tenant-Id"
  MAX_BATCH_SIZE: "100"

Using ConfigMaps

# Mount as environment variables
envFrom:
- configMapRef:
    name: ingestion-config

# Mount as file
volumeMounts:
- name: config-volume
  mountPath: /app/config
volumes:
- name: config-volume
  configMap:
    name: ingestion-config

Code Examples: - ConfigMap creation - Usage patterns - Dynamic updates

Deliverables: - ConfigMap templates - Usage guide - Update procedures

Topic 14: Secret Management¶

What will be covered: - Kubernetes Secrets (Avoid for Production) - Base64 encoded (not encrypted at rest by default) - Use only for non-sensitive config - Prefer Azure Key Vault

Secret Types

# Opaque secret
apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
  namespace: atp-ingest-ns
type: Opaque
data:
  username: YWRtaW4=  # base64("admin")
  password: cGFzc3dvcmQ=  # base64("password")

# TLS secret
apiVersion: v1
kind: Secret
metadata:
  name: tls-cert
type: kubernetes.io/tls
data:
  tls.crt: <base64-encoded-cert>
  tls.key: <base64-encoded-key>

Code Examples: - Secret creation - Secret usage - Rotation procedures

Deliverables: - Secret management guide - Best practices - Security considerations

CYCLE 8: Azure Key Vault CSI Driver (~3,000 lines)¶

Topic 15: Key Vault Integration¶

What will be covered: - Azure Key Vault CSI Driver

# SecretProviderClass
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: atp-kv-secrets
  namespace: atp-ingest-ns
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"
    useVMManagedIdentity: "false"
    userAssignedIdentityID: "<workload-identity-client-id>"
    keyvaultName: "atp-kv-prod"
    cloudName: "AzurePublicCloud"
    tenantId: "<azure-tenant-id>"
    objects: |
      array:
        - |
          objectName: DatabasePassword
          objectType: secret
          objectVersion: ""
        - |
          objectName: ServiceBusConnectionString
          objectType: secret
          objectVersion: ""
        - |
          objectName: SigningKeyPrivate
          objectType: secret
          objectVersion: ""

Mount Secrets in Pod

volumes:
- name: secrets-store
  csi:
    driver: secrets-store.csi.k8s.io
    readOnly: true
    volumeAttributes:
      secretProviderClass: "atp-kv-secrets"

volumeMounts:
- name: secrets-store
  mountPath: "/mnt/secrets"
  readOnly: true

Access Secrets in Application

// Read secret from mounted path
var dbPassword = await File.ReadAllTextAsync("/mnt/secrets/DatabasePassword");

// Or use configuration provider
configuration.AddJsonFile("/mnt/secrets/appsettings-secrets.json", optional: false);

Code Examples: - Complete Key Vault CSI setup - SecretProviderClass for all services - Secret rotation automation

Diagrams: - Key Vault CSI architecture - Secret mounting flow

Deliverables: - Key Vault integration guide - Secret provider configurations - Rotation procedures

Topic 16: Workload Identity (AAD Pod Identity)¶

What will be covered: - Workload Identity Setup

# Service Account with Workload Identity
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ingestion-sa
  namespace: atp-ingest-ns
  annotations:
    azure.workload.identity/client-id: "<managed-identity-client-id>"
    azure.workload.identity/tenant-id: "<azure-tenant-id>"

Grant Permissions

# Create managed identity
az identity create \
    --name atp-ingestion-identity \
    --resource-group atp-aks-prod-rg

# Grant Key Vault access
az keyvault set-policy \
    --name atp-kv-prod \
    --object-id <identity-object-id> \
    --secret-permissions get list

# Federate with AKS
az aks pod-identity add \
    --resource-group atp-aks-prod-rg \
    --cluster-name atp-aks-useast-prod \
    --namespace atp-ingest-ns \
    --name ingestion-identity \
    --identity-resource-id <identity-resource-id>

Code Examples: - Workload Identity setup - Permission grants - Pod configuration

Deliverables: - Workload Identity guide - Security configuration - Permission matrix

CYCLE 9: Resource Management (~3,500 lines)¶

Topic 17: Resource Requests & Limits¶

What will be covered: - ATP Service Resource Profiles | Service | CPU Request | CPU Limit | Memory Request | Memory Limit | Notes | |---------|-------------|-----------|----------------|--------------|-------| | Gateway | 250m | 1000m | 512Mi | 1Gi | Stateless, low CPU | | Ingestion | 500m | 2000m | 1Gi | 2Gi | I/O intensive | | Policy | 250m | 1000m | 512Mi | 1Gi | Cache-heavy, low CPU | | Projection | 500m | 2000m | 1Gi | 2Gi | DB writes, batch processing | | Query | 500m | 2000m | 1Gi | 2Gi | DB reads, caching | | Integrity | 250m | 1000m | 512Mi | 1Gi | Crypto operations, burst | | Export | 1000m | 4000m | 2Gi | 4Gi | Large file processing | | Search | 500m | 2000m | 1Gi | 2Gi | Index operations | | Admin | 250m | 500m | 256Mi | 512Mi | Lightweight, low traffic |

Right-Sizing Resources
Monitor actual usage (Prometheus, Azure Monitor)
Set requests = typical usage (95^th percentile)
Set limits = max burst capacity
Leave headroom for spikes

Quality of Service (QoS) Classes

Guaranteed (requests = limits):
- Critical services (Gateway, Ingestion)
- Predictable performance
- Never evicted for resource pressure

Burstable (requests < limits):
- Most ATP services
- Can burst above requests if node has capacity
- May be throttled/evicted under pressure

BestEffort (no requests/limits):
- Avoid in production
- Lowest priority, first to be evicted

Code Examples: - Resource configuration for all services - Monitoring resource usage - Right-sizing analysis

Diagrams: - Resource allocation - QoS classes - Eviction priority

Deliverables: - Resource profile guide - Monitoring dashboard - Right-sizing procedures

Topic 18: Resource Quotas & Cost Control¶

What will be covered: - Namespace Resource Quotas - Cost Allocation by Namespace - Spot Instances for Non-Critical Workloads - Cluster Autoscaler

Code Examples: - Quota enforcement - Cost monitoring - Spot configuration

Deliverables: - Cost control guide - Quota policies - Optimization strategies

CYCLE 10: Horizontal Pod Autoscaler (HPA) (~4,000 lines)¶

Topic 19: HPA Configuration¶

What will be covered: - HPA v2 (Metrics-Based Autoscaling)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ingestion-hpa
  namespace: atp-ingest-ns
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ingestion

  minReplicas: 3
  maxReplicas: 30

  # Scale up/down behavior
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100        # Double pods in 60s if needed
        periodSeconds: 60
      - type: Pods
        value: 4          # Or add 4 pods
        periodSeconds: 60
      selectPolicy: Max   # Take max of policies

    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 25         # Remove 25% of pods
        periodSeconds: 60
      selectPolicy: Min   # Take min (conservative)

  # Metrics
  metrics:
  # CPU utilization
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

  # Memory utilization
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

  # Custom metric (request rate)
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

  # Custom metric (P95 latency)
  - type: Pods
    pods:
      metric:
        name: http_request_duration_p95_ms
      target:
        type: AverageValue
        averageValue: "200"

HPA for All ATP Services
Gateway: CPU + request rate + P95 latency
Ingestion: CPU + ingest rate
Query: CPU + query rate + P95 latency
Policy: CPU + cache miss rate
(Projection, Export use KEDA instead)

Code Examples: - Complete HPA manifests (all services) - Custom metrics integration - Scaling behavior tuning

Diagrams: - HPA architecture - Scaling decision flow - Metrics pipeline

Deliverables: - HPA configuration library - Metrics guide - Tuning procedures

Topic 20: Custom Metrics with Azure Monitor¶

What will be covered: - Azure Monitor Metrics Adapter - Prometheus Adapter - Custom Metric Queries - SLO-Based Scaling

Code Examples: - Metrics adapter setup - Custom metric definitions - SLO-driven autoscaling

Deliverables: - Metrics adapter guide - Custom metrics catalog - SLO integration

CYCLE 11: KEDA Event-Driven Autoscaling (~4,500 lines)¶

Topic 21: KEDA Architecture¶

What will be covered: - KEDA (Kubernetes Event-Driven Autoscaling)

Why KEDA for ATP?
- Scale based on queue depth (Azure Service Bus)
- Scale to zero when idle (cost savings)
- Event-driven workers (Projection, Export, Integrity)
- Batch job scheduling (CronScaledJob)

KEDA Installation

# Install KEDA operator
helm repo add kedacore https://kedacore.github.io/charts
helm repo update

helm install keda kedacore/keda \
    --namespace keda \
    --create-namespace \
    --set podIdentity.azureWorkload.enabled=true

KEDA Components
Operator: Monitors ScaledObjects, creates HPAs
Metrics Server: Exposes custom metrics
Admission Webhooks: Validates ScaledObjects

Code Examples: - KEDA installation - Architecture overview - Component configuration

Diagrams: - KEDA architecture - Scaling flow - Integration with HPA

Deliverables: - KEDA setup guide - Architecture reference - Component overview

Topic 22: KEDA Scalers for ATP¶

What will be covered: - Azure Service Bus Scaler (Projection Workers)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: projection-worker-scaler
  namespace: atp-projection-ns
spec:
  scaleTargetRef:
    name: projection-worker

  pollingInterval: 10        # Check every 10 seconds
  cooldownPeriod: 120        # Wait 2 min before scaling down

  minReplicaCount: 0         # Scale to zero when no messages
  maxReplicaCount: 50

  advanced:
    restoreToOriginalReplicaCount: false
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 50
            periodSeconds: 60

  triggers:
  - type: azure-servicebus
    metadata:
      namespace: sb-atp-prod
      topicName: audit.appended.v1
      subscriptionName: projection-sub
      messageCount: "400"          # Target: 400 messages per replica
      activationMessageCount: "1"  # Scale from 0 when ≥1 message
      cloud: AzurePublicCloud

    authenticationRef:
      name: keda-trigger-auth-asb

---
# Authentication using Workload Identity
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-trigger-auth-asb
  namespace: atp-projection-ns
spec:
  podIdentity:
    provider: azure-workload
    identityId: "<managed-identity-client-id>"

Redis Scaler (Export Jobs Queue)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: export-worker-scaler
  namespace: atp-export-ns
spec:
  scaleTargetRef:
    name: export-worker
  minReplicaCount: 0
  maxReplicaCount: 10
  triggers:
  - type: redis
    metadata:
      addressFromEnv: REDIS_HOST
      listName: export-jobs
      listLength: "10"  # Scale up if queue has 10+ jobs
      databaseIndex: "0"
    authenticationRef:
      name: keda-redis-auth

Prometheus Scaler (Custom Metrics)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: query-latency-scaler
  namespace: atp-query-ns
spec:
  scaleTargetRef:
    name: query
  minReplicaCount: 3
  maxReplicaCount: 30
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: http_request_duration_p95_seconds
      query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="query"}[2m])) by (le))
      threshold: "0.2"  # Scale up if P95 > 200ms

Cron Scaler (Scheduled Jobs)

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: daily-compliance-report
  namespace: atp-admin-ns
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: report-generator
          image: atpacr.azurecr.io/atp/report-generator:1.0.0
        restartPolicy: Never

  pollingInterval: 30
  maxReplicaCount: 1

  triggers:
  - type: cron
    metadata:
      timezone: America/New_York
      start: 0 2 * * *   # 2 AM daily
      end: 0 3 * * *     # Finish by 3 AM

Code Examples: - Complete KEDA scaler library (all ATP use cases) - Authentication configurations - Scaling policies

Diagrams: - KEDA scaler types - Event-driven scaling flow - Scale-to-zero timeline

Deliverables: - KEDA scaler catalog - Configuration guide - Scaling strategies

CYCLE 12: Pod Security Standards (~3,000 lines)¶

Topic 23: Pod Security Policies¶

What will be covered: - Pod Security Standards (PSS)

Privileged: Unrestricted (avoid)
Baseline: Minimally restrictive (dev/test)
Restricted: Hardened (production ATP)

Restricted Pod Security

# Namespace-level enforcement
apiVersion: v1
kind: Namespace
metadata:
  name: atp-ingest-ns
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

# Pod security context (compliant)
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 3000
  fsGroup: 2000
  seccompProfile:
    type: RuntimeDefault

# Container security context
containers:
- name: ingestion
  securityContext:
    allowPrivilegeEscalation: false
    readOnlyRootFilesystem: true
    runAsNonRoot: true
    runAsUser: 1000
    capabilities:
      drop:
      - ALL

Code Examples: - Pod Security Standard configurations - Compliant pod templates - Validation policies

Diagrams: - PSS levels - Security context flow

Deliverables: - Pod security guide - Compliant templates - Validation rules

Topic 24: Runtime Security with Azure Policy¶

What will be covered: - Azure Policy for Kubernetes - OPA Gatekeeper - Admission Controller Policies - Image Scanning (Trivy, Defender)

Code Examples: - Policy definitions - Admission webhooks - Image scan integration

Deliverables: - Security policy catalog - Enforcement procedures - Scanning workflows

CYCLE 13: Network Policies (~3,000 lines)¶

Topic 25: Network Isolation¶

What will be covered: - Network Policy Fundamentals

# Default deny all ingress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: atp-ingest-ns
spec:
  podSelector: {}
  policyTypes:
  - Ingress

# Allow ingress from Gateway only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-gateway
  namespace: atp-ingest-ns
spec:
  podSelector:
    matchLabels:
      app: ingestion
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: atp-gateway-ns
      podSelector:
        matchLabels:
          app: gateway
    ports:
    - protocol: TCP
      port: 8080
    - protocol: TCP
      port: 9090

# Allow egress to Azure SQL
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-sql
  namespace: atp-ingest-ns
spec:
  podSelector:
    matchLabels:
      app: ingestion
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 10.0.0.0/8  # Azure SQL private endpoint
    ports:
    - protocol: TCP
      port: 1433

Code Examples: - Network policy manifests (all ATP services) - Zero-trust network model - Egress control

Diagrams: - Network policy architecture - Traffic flow with policies

Deliverables: - Network policy library - Zero-trust configuration - Traffic matrix

Topic 26: Service Mesh Network Policies¶

What will be covered: - Istio AuthorizationPolicy - Linkerd Server Policies - mTLS Enforcement

Code Examples: - Mesh-native policies - mTLS verification

Deliverables: - Mesh policy guide - Security configuration

CYCLE 14: Service Mesh & mTLS (~3,500 lines)¶

Topic 27: Service Mesh Setup¶

What will be covered: - Istio Installation

# Install Istio
istioctl install --set profile=production -y

# Enable sidecar injection
kubectl label namespace atp-ingest-ns istio-injection=enabled

Linkerd Installation

# Install Linkerd
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -

# Verify installation
linkerd check

# Inject sidecar
kubectl annotate namespace atp-ingest-ns linkerd.io/inject=enabled

mTLS Configuration

# Istio PeerAuthentication (enforce mTLS)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default-mtls
  namespace: atp-ingest-ns
spec:
  mtls:
    mode: STRICT  # Enforce mTLS for all traffic

Code Examples: - Service mesh installation - mTLS configuration - Traffic policies

Diagrams: - Service mesh architecture - mTLS certificate flow

Deliverables: - Mesh setup guide - mTLS enforcement - Traffic management

Topic 28: Observability with Service Mesh¶

What will be covered: - Distributed Tracing - Traffic Metrics - Service Graph Visualization - Latency Monitoring

Code Examples: - Mesh observability configuration - Kiali/Jaeger integration

Deliverables: - Observability guide - Dashboard templates

CYCLE 15: RBAC & Workload Identity (~3,000 lines)¶

Topic 29: Kubernetes RBAC¶

What will be covered: - Role-Based Access Control

# Role (namespace-scoped)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: atp-ingest-ns
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]

# RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: atp-ingest-ns
subjects:
- kind: ServiceAccount
  name: developer-sa
  namespace: atp-ingest-ns
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

# ClusterRole (cluster-wide)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: atp-admin
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

Code Examples: - Complete RBAC configuration - Role library - Binding templates

Deliverables: - RBAC guide - Role catalog - Access control policies

Topic 30: Azure AD Workload Identity¶

What will be covered: - Workload Identity Federation - Managed Identity Assignment - Azure Resource Access - Zero Secrets (Keyless Authentication)

Code Examples: - Workload Identity setup - Federated identity credentials - Azure resource access

Deliverables: - Workload Identity guide - Identity management - Security best practices

CYCLE 16: Helm Charts (~4,000 lines)¶

Topic 31: Helm Chart Structure¶

What will be covered: - ATP Helm Chart Architecture

charts/atp/
├── Chart.yaml              # Chart metadata
├── values.yaml             # Default values
├── values.dev.yaml         # Dev environment overrides
├── values.prod.yaml        # Prod environment overrides
├── values.us.yaml          # US region overrides
├── values.eu.yaml          # EU region overrides
├── templates/
│   ├── _helpers.tpl        # Template helpers
│   ├── namespace.yaml
│   ├── gateway/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   ├── hpa.yaml
│   │   └── ingress.yaml
│   ├── ingestion/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── hpa.yaml
│   ├── projection/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── keda-scaledobject.yaml
│   ├── query/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── hpa.yaml
│   ├── export/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── keda-scaledobject.yaml
│   ├── shared/
│   │   ├── configmap.yaml
│   │   ├── secret-provider-class.yaml
│   │   └── network-policies.yaml
│   └── monitoring/
│       ├── servicemonitor.yaml
│       └── prometheusrule.yaml
└── .helmignore

Code Examples: - Complete Helm chart - Template syntax - Values organization

Diagrams: - Helm chart structure - Value inheritance

Deliverables: - Helm chart repository - Templating guide - Values reference

Topic 32: Helm Deployment¶

What will be covered: - Helm Installation

# Install/upgrade ATP
helm upgrade --install atp ./charts/atp \
    --namespace atp-system \
    --create-namespace \
    --values values.prod.yaml \
    --values values.us.yaml \
    --set image.tag=1.2.3 \
    --set global.edition=enterprise \
    --wait --timeout 10m

# Verify release
helm list -n atp-system
helm status atp -n atp-system

# Rollback
helm rollback atp 1 -n atp-system

Code Examples: - Deployment commands - Value overrides - Rollback procedures

Deliverables: - Deployment guide - Operations procedures - Rollback strategies

CYCLE 17: GitOps with FluxCD (~3,500 lines)¶

Topic 33: FluxCD Setup¶

What will be covered: - FluxCD Installation - GitRepository Source - Kustomization - HelmRelease - Automated Reconciliation

Code Examples: - FluxCD configuration - GitOps workflow

Deliverables: - FluxCD setup guide - GitOps workflow

Topic 34: Declarative Deployments¶

What will be covered: - Git as Source of Truth - Automated Sync - Drift Detection - Notification Hooks

Code Examples: - FluxCD resources - Sync configuration

Deliverables: - Declarative deployment guide - Sync policies

CYCLE 18: Monitoring & Observability (~3,000 lines)¶

Topic 35: Container Insights¶

What will be covered: - Azure Monitor Container Insights - Prometheus Integration - Grafana Dashboards - Log Analytics

Code Examples: - Monitoring setup - Dashboard configurations

Deliverables: - Monitoring guide - Dashboard library

Topic 36: OpenTelemetry in Kubernetes¶

What will be covered: - OTel Collector DaemonSet - Trace/Metric/Log Collection - Azure Monitor Export

Code Examples: - OTel configuration - Collector deployment

Deliverables: - OTel setup guide - Export configuration

CYCLE 19: Operations & Troubleshooting (~3,000 lines)¶

Topic 37: Operational Tasks¶

What will be covered: - Common kubectl Commands - Log Viewing - Pod Debugging - Resource Inspection

Code Examples: - Operations cookbook

Deliverables: - Operations guide - Troubleshooting procedures

Topic 38: Troubleshooting Common Issues¶

What will be covered: - Pod Not Starting - Image Pull Errors - CrashLoopBackOff - Network Issues - Resource Exhaustion

Code Examples: - Debug procedures

Deliverables: - Troubleshooting guide - Problem catalog

CYCLE 20: Best Practices & Disaster Recovery (~3,000 lines)¶

Topic 39: Kubernetes Best Practices¶

What will be covered: - Design Best Practices - Security Hardening - Performance Optimization - Cost Management

Deliverables: - Best practices handbook

Topic 40: Disaster Recovery¶

What will be covered: - Cluster Backup - Disaster Recovery Procedures - Multi-Region Failover

Deliverables: - DR guide - Failover procedures

Summary of Deliverables¶

Across all 20 cycles, this documentation will provide:

Cluster Architecture: AKS setup, node pools, networking
Namespaces: Organization, quotas, isolation
Workloads: Deployments, StatefulSets, DaemonSets, Jobs
Networking: Services, Ingress, Load Balancing, Service Mesh
Configuration: ConfigMaps, Secrets, Key Vault CSI
Resource Management: Requests, limits, quotas, QoS
Autoscaling: HPA, KEDA, custom metrics
Security: Pod Security, Network Policies, RBAC, Workload Identity
Packaging: Helm charts, templating, values
GitOps: FluxCD, declarative deployments
Observability: Monitoring, logging, tracing
Operations: kubectl, troubleshooting, DR

Pulumi IaC: Infrastructure provisioning
GitOps: Continuous delivery
Deployment Views: ATP topology
Template Integration: Service templates
Configuration: App configuration
Observability: Monitoring and tracing
Security: Security architecture
Disaster Recovery: DR procedures

This documentation plan covers complete Kubernetes infrastructure for ATP, from AKS cluster setup and namespace organization to workload deployments, service networking, autoscaling with HPA and KEDA, security hardening with Pod Security Standards and Network Policies, service mesh integration, Helm packaging, GitOps with FluxCD, comprehensive monitoring, operational procedures, and disaster recovery for running ATP microservices at scale with security, reliability, and cost-efficiency.