Skip to content

Kubernetes Infrastructure - Audit Trail Platform (ATP)

Cloud-native, auto-scaling, zero-trust — ATP runs on Azure Kubernetes Service (AKS) with namespace isolation, service mesh (mTLS), KEDA event-driven autoscaling, Pod Security Standards, Helm packaging, and FluxCD GitOps for declarative, auditable, and resilient microservice orchestration.


📋 Documentation Generation Plan

This document will be generated in 20 cycles. Current progress:

Cycle Topics Estimated Lines Status
Cycle 1 Kubernetes Architecture & AKS Overview (1-2) ~3,500 ⏳ Not Started
Cycle 2 Cluster Setup & Node Pools (3-4) ~3,000 ⏳ Not Started
Cycle 3 Namespace Organization (5-6) ~3,000 ⏳ Not Started
Cycle 4 Deployment Workloads (7-8) ~4,000 ⏳ Not Started
Cycle 5 Service Networking (9-10) ~3,500 ⏳ Not Started
Cycle 6 Ingress & Load Balancing (11-12) ~3,500 ⏳ Not Started
Cycle 7 ConfigMaps & Secrets (13-14) ~3,000 ⏳ Not Started
Cycle 8 Azure Key Vault CSI Driver (15-16) ~3,000 ⏳ Not Started
Cycle 9 Resource Management (17-18) ~3,500 ⏳ Not Started
Cycle 10 Horizontal Pod Autoscaler (HPA) (19-20) ~4,000 ⏳ Not Started
Cycle 11 KEDA Event-Driven Autoscaling (21-22) ~4,500 ⏳ Not Started
Cycle 12 Pod Security Standards (23-24) ~3,000 ⏳ Not Started
Cycle 13 Network Policies (25-26) ~3,000 ⏳ Not Started
Cycle 14 Service Mesh & mTLS (27-28) ~3,500 ⏳ Not Started
Cycle 15 RBAC & Workload Identity (29-30) ~3,000 ⏳ Not Started
Cycle 16 Helm Charts (31-32) ~4,000 ⏳ Not Started
Cycle 17 GitOps with FluxCD (33-34) ~3,500 ⏳ Not Started
Cycle 18 Monitoring & Observability (35-36) ~3,000 ⏳ Not Started
Cycle 19 Operations & Troubleshooting (37-38) ~3,000 ⏳ Not Started
Cycle 20 Best Practices & Disaster Recovery (39-40) ~3,000 ⏳ Not Started

Total Estimated Lines: ~67,000


Purpose & Scope

This document provides the complete Kubernetes infrastructure guide for ATP, covering Azure Kubernetes Service (AKS) cluster setup, namespace organization, workload deployments, service networking, autoscaling, security policies, Helm charts, GitOps, and operational best practices for running ATP's microservices at scale.

Why Kubernetes for ATP?

  1. Cloud-Native: Container orchestration with declarative configuration
  2. Scalability: Auto-scaling (HPA/KEDA) based on load and events
  3. Resilience: Self-healing, rolling updates, health checks
  4. Multi-Tenancy: Namespace isolation, resource quotas, network policies
  5. Security: Pod Security Standards, mTLS service mesh, RBAC, Workload Identity
  6. Portability: Run anywhere (Azure, on-prem, multi-cloud)
  7. Observability: Integrated monitoring, logging, tracing
  8. Cost Optimization: Efficient resource utilization, scale-to-zero with KEDA
  9. GitOps: Declarative, auditable infrastructure as code
  10. Ecosystem: Rich tooling (Helm, Kustomize, Flux, Istio/Linkerd, KEDA)

ATP Kubernetes Topology

Azure Front Door (AFD) + WAF
API Management (APIM) / Ingress Controller
AKS Cluster (3 regions: US, EU, IL)
    ├── System Ring
    │   ├── OTel Collector (DaemonSet)
    │   ├── Mesh Control Plane (Istio/Linkerd)
    │   ├── FluxCD Controllers
    │   ├── KEDA Operator
    │   └── Monitoring (Prometheus, Grafana)
    └── User Ring (ATP Namespaces)
        ├── atp-gateway-ns (Gateway pods)
        ├── atp-ingest-ns (Ingestion pods)
        ├── atp-policy-ns (Policy pods)
        ├── atp-projection-ns (Projection workers, KEDA)
        ├── atp-query-ns (Query API pods)
        ├── atp-integrity-ns (Integrity workers)
        ├── atp-export-ns (Export workers, KEDA)
        ├── atp-search-ns (Search service, optional)
        └── atp-admin-ns (Admin console)
Azure Services (Service Bus, SQL, Blob, Key Vault, Monitor)

Key Technologies

  • AKS (Azure Kubernetes Service): Managed Kubernetes
  • Service Mesh: Istio or Linkerd for mTLS, observability
  • KEDA: Kubernetes Event-Driven Autoscaling (scale to zero)
  • HPA: Horizontal Pod Autoscaler (CPU/memory/custom metrics)
  • Helm: Package manager for Kubernetes
  • FluxCD: GitOps continuous delivery
  • Azure Key Vault CSI Driver: Secrets injection
  • Azure Monitor: Container Insights, Log Analytics, Application Insights
  • Prometheus & Grafana: Metrics and dashboards
  • Calico/Azure CNI: Network policies

Detailed Cycle Plan

CYCLE 1: Kubernetes Architecture & AKS Overview (~3,500 lines)

Topic 1: Kubernetes Fundamentals

What will be covered: - What is Kubernetes? - Container orchestration platform - Declarative configuration (YAML) - Desired state reconciliation - Self-healing, auto-scaling, rolling updates

  • Kubernetes Core Concepts

    Cluster:
    - Control Plane: API Server, Scheduler, Controller Manager, etcd
    - Worker Nodes: kubelet, kube-proxy, container runtime (containerd)
    
    Workloads:
    - Pod: Smallest deployable unit (1+ containers)
    - Deployment: Manages ReplicaSets, rolling updates
    - StatefulSet: For stateful applications (stable identity)
    - DaemonSet: One pod per node (monitoring, logging)
    - Job/CronJob: Run-to-completion tasks
    
    Networking:
    - Service: Stable endpoint for pod group (ClusterIP, LoadBalancer)
    - Ingress: HTTP/HTTPS routing to services
    - NetworkPolicy: Firewall rules for pods
    
    Configuration:
    - ConfigMap: Non-sensitive configuration
    - Secret: Sensitive data (passwords, tokens)
    - PersistentVolume: Durable storage
    
    Security:
    - ServiceAccount: Pod identity
    - RBAC: Role-based access control
    - Pod Security Standards: Security policies
    
    Observability:
    - Logs: stdout/stderr streams
    - Metrics: CPU, memory, custom metrics
    - Events: Cluster events (pod scheduled, image pulled)
    

  • Why AKS (Azure Kubernetes Service)?

    Managed Control Plane:
    - No manual k8s master management
    - Microsoft maintains API server, etcd, scheduler
    - Automatic updates and patches
    
    Azure Integration:
    - Azure CNI (VNet integration)
    - Azure Monitor Container Insights
    - Azure Key Vault CSI Driver
    - Azure AD Workload Identity
    - Azure Blob CSI (persistent volumes)
    - Azure Service Bus (for KEDA triggers)
    
    Enterprise Features:
    - Availability Zones support
    - Node auto-scaling
    - Managed node pools
    - Azure Policy for Kubernetes
    - Defender for Containers (security)
    
    Cost Optimization:
    - Pay only for worker nodes
    - Spot instances (low-priority workloads)
    - Auto-shutdown for dev/test
    

Code Examples: - Kubernetes architecture diagram - AKS cluster creation (Azure CLI, Bicep, Pulumi) - kubectl basic commands

Diagrams: - Kubernetes architecture - AKS control plane vs. data plane - ATP on AKS topology

Deliverables: - Kubernetes fundamentals primer - AKS benefits for ATP - Cluster architecture overview


Topic 2: ATP AKS Cluster Architecture

What will be covered: - Cluster Configuration

Cluster Name: atp-aks-{region}-{env}
Examples:
- atp-aks-useast-prod
- atp-aks-euwest-prod
- atp-aks-ilcentral-prod
- atp-aks-useast-dev

Kubernetes Version: 1.28+ (managed, auto-upgrade minor versions)

Regions:
- Primary: East US (us-east-1)
- Secondary: West Europe (eu-west-1)
- Tertiary: Israel Central (il-central-1)

Availability Zones: 3 per region
- Zone 1, Zone 2, Zone 3
- Distribute node pools across zones
- Distribute pods across zones (pod anti-affinity)

  • Node Pools Strategy

    System Node Pool (np-system):
    - VM Size: Standard_D4s_v5 (4 vCPU, 16 GB RAM)
    - Count: 3 (one per AZ)
    - Auto-scale: No (fixed)
    - Taint: CriticalAddonsOnly=true:NoSchedule
    - Purpose: Control plane components, mesh, KEDA, FluxCD, OTel
    
    Generic Node Pool (np-generic):
    - VM Size: Standard_D8s_v5 (8 vCPU, 32 GB RAM)
    - Count: 3-30 (auto-scale)
    - Taint: None
    - Purpose: Stateless APIs (Gateway, Query, Admin, Policy)
    
    I/O Node Pool (np-io):
    - VM Size: Standard_E8s_v5 (8 vCPU, 64 GB RAM, premium storage)
    - Count: 2-50 (auto-scale)
    - Taint: workload=io:NoSchedule
    - Purpose: I/O-heavy (Ingestion, Projection, Export, Integrity)
    
    Jobs Node Pool (np-jobs) - Optional:
    - VM Size: Standard_F16s_v2 (16 vCPU, 32 GB RAM, compute-optimized)
    - Count: 0-20 (KEDA scale to zero)
    - Taint: workload=batch:NoSchedule
    - Purpose: Export jobs, maintenance tasks, compliance reports
    
    Spot Node Pool (np-spot) - Optional:
    - VM Size: Standard_D8s_v5
    - Count: 0-10
    - Spot Priority: Yes (80% cost savings)
    - Taint: kubernetes.azure.com/scalesetpriority=spot:NoSchedule
    - Purpose: Non-critical workloads (dev/test projections, backfills)
    

  • Network Configuration

    CNI Plugin: Azure CNI (VNet integration)
    
    VNet CIDR: 10.42.0.0/16
    - Subnet: aks-nodes (10.42.0.0/20) - 4096 IPs for nodes
    - Subnet: aks-pods (10.42.16.0/20) - 4096 IPs for pods
    - Subnet: aks-services (10.42.32.0/24) - 256 IPs for services
    
    Service CIDR: 10.43.0.0/16 (internal cluster services)
    DNS Service IP: 10.43.0.10
    
    Network Policy: Azure Network Policy or Calico
    
    Load Balancer:
    - Type: Standard Load Balancer
    - Public IP: Static (for ingress controller)
    - Private endpoints: For Azure services (SQL, Storage, Key Vault)
    

Code Examples: - AKS cluster creation (Pulumi C#) - Node pool configuration - Network setup

Diagrams: - AKS cluster architecture - Node pool distribution across AZs - Network topology

Deliverables: - AKS cluster specification - Node pool strategy - Network architecture


CYCLE 2: Cluster Setup & Node Pools (~3,000 lines)

Topic 3: AKS Cluster Provisioning

What will be covered: - Pulumi IaC for AKS

// Create AKS cluster with Pulumi
var cluster = new AzureNative.ContainerService.ManagedCluster("atp-aks-prod", new()
{
    ResourceGroupName = resourceGroup.Name,
    Location = location,
    DnsPrefix = "atp-prod",
    KubernetesVersion = "1.28",
    EnableRBAC = true,

    // Identity (Workload Identity for pods)
    Identity = new ManagedClusterIdentityArgs
    {
        Type = ResourceIdentityType.SystemAssigned
    },

    // Network profile
    NetworkProfile = new ContainerServiceNetworkProfileArgs
    {
        NetworkPlugin = "azure",
        NetworkPolicy = "calico",
        ServiceCidr = "10.43.0.0/16",
        DnsServiceIP = "10.43.0.10",
        LoadBalancerSku = "standard"
    },

    // Add-ons
    AddonProfiles = new InputMap<ManagedClusterAddonProfileArgs>
    {
        ["azureKeyvaultSecretsProvider"] = new ManagedClusterAddonProfileArgs
        {
            Enabled = true,
            Config = new InputMap<string>
            {
                ["enableSecretRotation"] = "true",
                ["rotationPollInterval"] = "2m"
            }
        },
        ["omsAgent"] = new ManagedClusterAddonProfileArgs
        {
            Enabled = true,
            Config = new InputMap<string>
            {
                ["logAnalyticsWorkspaceResourceID"] = logAnalyticsWorkspace.Id
            }
        }
    },

    // System node pool
    AgentPoolProfiles = new[]
    {
        new ManagedClusterAgentPoolProfileArgs
        {
            Name = "npsystem",
            Count = 3,
            VmSize = "Standard_D4s_v5",
            OsType = "Linux",
            Mode = "System",
            AvailabilityZones = new[] { "1", "2", "3" },
            NodeTaints = new[] { "CriticalAddonsOnly=true:NoSchedule" }
        }
    }
});

// Add generic node pool
var genericNodePool = new AzureNative.ContainerService.AgentPool("np-generic", new()
{
    ResourceGroupName = resourceGroup.Name,
    ResourceName = cluster.Name,
    AgentPoolName = "npgeneric",
    Count = 3,
    VmSize = "Standard_D8s_v5",
    OsType = "Linux",
    Mode = "User",
    AvailabilityZones = new[] { "1", "2", "3" },
    EnableAutoScaling = true,
    MinCount = 3,
    MaxCount = 30
});

// Add I/O node pool
var ioNodePool = new AzureNative.ContainerService.AgentPool("np-io", new()
{
    ResourceGroupName = resourceGroup.Name,
    ResourceName = cluster.Name,
    AgentPoolName = "npio",
    Count = 2,
    VmSize = "Standard_E8s_v5",
    OsType = "Linux",
    Mode = "User",
    AvailabilityZones = new[] { "1", "2", "3" },
    EnableAutoScaling = true,
    MinCount = 2,
    MaxCount = 50,
    NodeTaints = new[] { "workload=io:NoSchedule" }
});

  • Azure CLI Alternative
    # Create resource group
    az group create --name atp-aks-prod-rg --location eastus
    
    # Create AKS cluster
    az aks create \
        --resource-group atp-aks-prod-rg \
        --name atp-aks-useast-prod \
        --node-count 3 \
        --node-vm-size Standard_D4s_v5 \
        --kubernetes-version 1.28 \
        --network-plugin azure \
        --network-policy calico \
        --enable-managed-identity \
        --enable-addons monitoring,azure-keyvault-secrets-provider \
        --enable-workload-identity \
        --enable-oidc-issuer \
        --zones 1 2 3 \
        --nodepool-name npsystem \
        --nodepool-taints CriticalAddonsOnly=true:NoSchedule
    
    # Add generic node pool
    az aks nodepool add \
        --resource-group atp-aks-prod-rg \
        --cluster-name atp-aks-useast-prod \
        --name npgeneric \
        --node-count 3 \
        --node-vm-size Standard_D8s_v5 \
        --enable-cluster-autoscaler \
        --min-count 3 \
        --max-count 30 \
        --zones 1 2 3
    
    # Add I/O node pool
    az aks nodepool add \
        --resource-group atp-aks-prod-rg \
        --cluster-name atp-aks-useast-prod \
        --name npio \
        --node-count 2 \
        --node-vm-size Standard_E8s_v5 \
        --enable-cluster-autoscaler \
        --min-count 2 \
        --max-count 50 \
        --node-taints workload=io:NoSchedule \
        --zones 1 2 3
    

Code Examples: - Complete AKS provisioning (Pulumi, Azure CLI, Bicep) - Node pool configurations - Cluster add-ons setup

Diagrams: - AKS provisioning workflow - Node pool architecture

Deliverables: - AKS provisioning guide - Node pool specifications - Add-ons configuration


Topic 4: Connecting to AKS Cluster

What will be covered: - kubectl Configuration

# Get AKS credentials
az aks get-credentials \
    --resource-group atp-aks-prod-rg \
    --name atp-aks-useast-prod

# Verify connection
kubectl cluster-info
kubectl get nodes
kubectl get namespaces

# Switch context (multiple clusters)
kubectl config get-contexts
kubectl config use-context atp-aks-useast-prod

  • Access Control
  • Azure AD integration
  • RBAC roles (admin, developer, reader)
  • kubeconfig with AAD tokens
  • kubectl proxy for local access

Code Examples: - Connection setup - Context management - RBAC configuration

Deliverables: - Access guide - RBAC setup - Security best practices


CYCLE 3: Namespace Organization (~3,000 lines)

Topic 5: Namespace Strategy

What will be covered: - ATP Namespace Taxonomy

# System namespaces (managed)
- kube-system           # Kubernetes core components
- kube-public           # Public cluster info
- flux-system           # FluxCD controllers
- istio-system          # Service mesh control plane
- keda                  # KEDA operator
- monitoring            # Prometheus, Grafana

# ATP application namespaces
- atp-gateway-ns        # API Gateway (public entry point)
- atp-ingest-ns         # Ingestion service
- atp-policy-ns         # Policy service
- atp-projection-ns     # Projection workers
- atp-query-ns          # Query service
- atp-integrity-ns      # Integrity service
- atp-export-ns         # Export service
- atp-search-ns         # Search service (optional)
- atp-admin-ns          # Admin console

  • Namespace Creation

    # Create namespace with labels
    apiVersion: v1
    kind: Namespace
    metadata:
      name: atp-ingest-ns
      labels:
        app.kubernetes.io/name: atp-ingestion
        app.kubernetes.io/component: backend
        app.kubernetes.io/part-of: audit-trail-platform
        environment: production
        region: us-east
        istio-injection: enabled  # Enable service mesh sidecar injection
        kyverno.io/policy-severity: high
    

  • Namespace Isolation

  • Resource quotas per namespace
  • Network policies (deny by default)
  • RBAC (namespace-scoped roles)
  • Pod Security Standards

  • Namespace Lifecycle

  • Creation (GitOps via FluxCD)
  • Configuration (ResourceQuota, NetworkPolicy, RBAC)
  • Monitoring (per-namespace dashboards)
  • Deletion (drain workloads, cleanup resources)

Code Examples: - Namespace creation manifests - Resource quota configuration - RBAC binding

Diagrams: - Namespace organization - Isolation boundaries

Deliverables: - Namespace taxonomy - Creation templates - Isolation policies


Topic 6: Resource Quotas & LimitRanges

What will be covered: - Resource Quota per Namespace

apiVersion: v1
kind: ResourceQuota
metadata:
  name: atp-ingest-quota
  namespace: atp-ingest-ns
spec:
  hard:
    requests.cpu: "20"        # Max 20 CPU cores requested
    requests.memory: 40Gi     # Max 40 GB memory requested
    limits.cpu: "40"          # Max 40 CPU cores limit
    limits.memory: 80Gi       # Max 80 GB memory limit
    pods: "50"                # Max 50 pods
    services: "10"            # Max 10 services
    persistentvolumeclaims: "5"

  • LimitRange (Default Limits)
    apiVersion: v1
    kind: LimitRange
    metadata:
      name: atp-ingest-limits
      namespace: atp-ingest-ns
    spec:
      limits:
      - max:
          cpu: "4"
          memory: 8Gi
        min:
          cpu: "100m"
          memory: 128Mi
        default:
          cpu: "500m"
          memory: 512Mi
        defaultRequest:
          cpu: "250m"
          memory: 256Mi
        type: Container
    

Code Examples: - ResourceQuota manifests (all namespaces) - LimitRange configuration - Quota monitoring

Diagrams: - Resource allocation - Quota enforcement

Deliverables: - Quota specifications - Limit policies - Monitoring dashboards


CYCLE 4: Deployment Workloads (~4,000 lines)

Topic 7: Deployment Manifests

What will be covered: - Deployment Anatomy

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ingestion
  namespace: atp-ingest-ns
  labels:
    app: ingestion
    version: v1
    component: backend
spec:
  # Replica management
  replicas: 3

  # Rolling update strategy
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 30%          # Allow 30% extra pods during update
      maxUnavailable: 0      # Never go below desired count

  # Pod selector
  selector:
    matchLabels:
      app: ingestion
      version: v1

  # Pod template
  template:
    metadata:
      labels:
        app: ingestion
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"

    spec:
      # Security context (Pod level)
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 2000
        seccompProfile:
          type: RuntimeDefault

      # Service account (Workload Identity)
      serviceAccountName: ingestion-sa

      # Node affinity (prefer different nodes)
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - ingestion
              topologyKey: kubernetes.io/hostname

        # Zone spread
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - ingestion
            topologyKey: topology.kubernetes.io/zone

      # Tolerations (for I/O node pool)
      tolerations:
      - key: "workload"
        operator: "Equal"
        value: "io"
        effect: "NoSchedule"

      # Containers
      containers:
      - name: ingestion
        image: atpacr.azurecr.io/atp/ingestion:1.2.3@sha256:abc123...

        # Security context (Container level)
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
          capabilities:
            drop:
            - ALL

        # Ports
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        - name: grpc
          containerPort: 9090
          protocol: TCP
        - name: metrics
          containerPort: 9091
          protocol: TCP

        # Environment variables
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"
        - name: TENANT_CONTEXT_HEADER
          value: "X-Tenant-Id"

        # ConfigMap reference
        envFrom:
        - configMapRef:
            name: ingestion-config

        # Resource requests and limits
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "2000m"
            memory: "2Gi"

        # Liveness probe (is container alive?)
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        # Readiness probe (ready to serve traffic?)
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3

        # Startup probe (for slow-starting apps)
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          failureThreshold: 30  # 30 * 5s = 150s max startup time

        # Volume mounts
        volumeMounts:
        - name: secrets-store
          mountPath: "/mnt/secrets"
          readOnly: true
        - name: temp-dir
          mountPath: "/tmp"

      # Volumes
      volumes:
      - name: secrets-store
        csi:
          driver: secrets-store.csi.k8s.io
          readOnly: true
          volumeAttributes:
            secretProviderClass: "atp-kv-secrets"
      - name: temp-dir
        emptyDir: {}

  • All ATP Deployments
  • Gateway, Ingestion, Policy, Projection, Query, Integrity, Export, Search, Admin
  • Each with specific resource profiles
  • Affinity rules for high availability
  • Tolerations for node pool assignment

Code Examples: - Complete deployment manifests (all ATP services) - Security contexts - Probe configurations - Volume mounts

Diagrams: - Deployment anatomy - Pod lifecycle - Health check flow

Deliverables: - Deployment manifest library - Configuration guide - Best practices


Topic 8: StatefulSets & DaemonSets

What will be covered: - When to Use StatefulSets vs. Deployments | Workload Type | Use Deployment | Use StatefulSet | |---------------|----------------|-----------------| | Stateless APIs | ✅ | ❌ | | Event consumers | ✅ | ❌ | | Projection workers | ✅ | ❌ | | Databases (managed externally) | ✅ | ❌ | | Caching tier (Redis Sentinel) | ❌ | ✅ | | Message brokers (Kafka, RabbitMQ) | ❌ | ✅ | | Stateful actors (Orleans silos) | ❌ | ✅ |

  • DaemonSet Use Cases
  • OTel Collector: One per node for metrics/logs
  • Log Forwarder: FluentBit/Fluentd for centralized logging
  • Node Exporter: Prometheus node metrics
  • Security Agent: Defender for Containers

Code Examples: - StatefulSet example (if needed for ATP) - DaemonSet for OTel Collector - Persistent volume claims

Diagrams: - Workload type comparison - DaemonSet architecture

Deliverables: - Workload selection guide - DaemonSet configurations - Stateful patterns


CYCLE 5: Service Networking (~3,500 lines)

Topic 9: Kubernetes Services

What will be covered: - Service Types

# 1. ClusterIP (default, internal only)
apiVersion: v1
kind: Service
metadata:
  name: ingestion-svc
  namespace: atp-ingest-ns
spec:
  type: ClusterIP
  selector:
    app: ingestion
  ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP
  - name: grpc
    port: 9090
    targetPort: 9090
    protocol: TCP

# 2. LoadBalancer (external, public IP)
apiVersion: v1
kind: Service
metadata:
  name: gateway-lb
  namespace: atp-gateway-ns
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "false"
spec:
  type: LoadBalancer
  selector:
    app: gateway
  ports:
  - name: https
    port: 443
    targetPort: 8443

# 3. Headless Service (for StatefulSets)
apiVersion: v1
kind: Service
metadata:
  name: orleans-silo-headless
  namespace: atp-orleans-ns
spec:
  type: ClusterIP
  clusterIP: None  # Headless
  selector:
    app: orleans-silo
  ports:
  - name: silo
    port: 11111
    targetPort: 11111

  • Service Discovery
  • DNS-based (service-name.namespace.svc.cluster.local)
  • Environment variables
  • Service mesh (Istio VirtualService)

  • Load Balancing

  • Round-robin (default)
  • Session affinity (ClientIP)
  • Topology-aware routing

Code Examples: - Service manifests (all types) - Service discovery examples - Load balancing configuration

Diagrams: - Service types comparison - DNS resolution flow - Load balancing strategies

Deliverables: - Service manifest library - Discovery guide - Load balancing patterns


Topic 10: Service Mesh Integration

What will be covered: - Service-to-Service Communication - mTLS encryption (automatic) - Identity-based auth - Traffic management (retries, timeouts, circuit breakers) - Observability (distributed tracing)

  • Istio/Linkerd Configuration
    # Enable sidecar injection (namespace label)
    apiVersion: v1
    kind: Namespace
    metadata:
      name: atp-ingest-ns
      labels:
        istio-injection: enabled  # or linkerd.io/inject: enabled
    
    # VirtualService (traffic routing)
    apiVersion: networking.istio.io/v1beta1
    kind: VirtualService
    metadata:
      name: ingestion-vs
      namespace: atp-ingest-ns
    spec:
      hosts:
      - ingestion-svc
      http:
      - route:
        - destination:
            host: ingestion-svc
            port:
              number: 80
        timeout: 30s
        retries:
          attempts: 3
          perTryTimeout: 10s
          retryOn: 5xx,reset,connect-failure,refused-stream
    

Code Examples: - Service mesh configuration - Traffic policies - mTLS verification

Diagrams: - Service mesh architecture - mTLS flow

Deliverables: - Service mesh setup - Traffic management - Security policies


CYCLE 6: Ingress & Load Balancing (~3,500 lines)

Topic 11: Ingress Controllers

What will be covered: - Ingress Controller Options | Controller | Use Case | ATP Usage | |------------|----------|-----------| | NGINX Ingress | General-purpose HTTP/HTTPS | Dev/Test environments | | Azure Application Gateway | WAF, Azure-native | Production (alternative to APIM) | | Istio Ingress Gateway | Service mesh integration | Production (with Istio) | | Traefik | Dynamic routing, middlewares | Internal services |

  • NGINX Ingress Setup

    # Install NGINX Ingress Controller
    helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
    helm repo update
    
    helm install nginx-ingress ingress-nginx/ingress-nginx \
        --namespace ingress-nginx \
        --create-namespace \
        --set controller.service.type=LoadBalancer \
        --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz
    

  • Ingress Resource

    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: atp-gateway-ingress
      namespace: atp-gateway-ns
      annotations:
        kubernetes.io/ingress.class: "nginx"
        cert-manager.io/cluster-issuer: "letsencrypt-prod"
        nginx.ingress.kubernetes.io/ssl-redirect: "true"
        nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
        nginx.ingress.kubernetes.io/rate-limit: "100"
    spec:
      tls:
      - hosts:
        - api.audittrail.example.com
        secretName: atp-gateway-tls
      rules:
      - host: api.audittrail.example.com
        http:
          paths:
          - path: /api/v1/ingest
            pathType: Prefix
            backend:
              service:
                name: ingestion-svc
                port:
                  number: 80
          - path: /api/v1/query
            pathType: Prefix
            backend:
              service:
                name: query-svc
                port:
                  number: 80
          - path: /api/v1/policy
            pathType: Prefix
            backend:
              service:
                name: policy-svc
                port:
                  number: 80
    

Code Examples: - Ingress controller installation - Ingress resource manifests - TLS configuration - Rate limiting

Diagrams: - Ingress architecture - Traffic routing flow

Deliverables: - Ingress setup guide - Routing configurations - TLS management


Topic 12: Azure Front Door + APIM Integration

What will be covered: - External Load Balancing - Azure Front Door (global load balancing, WAF) - API Management (rate limiting, caching, versioning) - Direct to Ingress Controller

  • Traffic Flow
    Internet Client
    Azure Front Door (global edge, WAF)
    API Management (region-specific)
    AKS Ingress Controller (NGINX/Istio)
    Gateway Service (Kubernetes Service)
    Gateway Pods (via mesh)
    Backend Services (Ingestion, Query, etc.)
    

Code Examples: - AFD configuration - APIM backend configuration - End-to-end routing

Diagrams: - Complete traffic flow - Edge integration

Deliverables: - Edge integration guide - Traffic routing - Failover patterns


CYCLE 7: ConfigMaps & Secrets (~3,000 lines)

Topic 13: ConfigMap Management

What will be covered: - ConfigMap for Application Settings

apiVersion: v1
kind: ConfigMap
metadata:
  name: ingestion-config
  namespace: atp-ingest-ns
data:
  appsettings.json: |
    {
      "ApplicationName": "ATP.Ingestion",
      "Logging": {
        "LogLevel": {
          "Default": "Information"
        }
      },
      "HealthChecks": {
        "Enabled": true
      }
    }

  # Environment-specific overrides
  ASPNETCORE_ENVIRONMENT: "Production"
  TENANT_CONTEXT_HEADER: "X-Tenant-Id"
  MAX_BATCH_SIZE: "100"

  • Using ConfigMaps
    # Mount as environment variables
    envFrom:
    - configMapRef:
        name: ingestion-config
    
    # Mount as file
    volumeMounts:
    - name: config-volume
      mountPath: /app/config
    volumes:
    - name: config-volume
      configMap:
        name: ingestion-config
    

Code Examples: - ConfigMap creation - Usage patterns - Dynamic updates

Deliverables: - ConfigMap templates - Usage guide - Update procedures


Topic 14: Secret Management

What will be covered: - Kubernetes Secrets (Avoid for Production) - Base64 encoded (not encrypted at rest by default) - Use only for non-sensitive config - Prefer Azure Key Vault

  • Secret Types
    # Opaque secret
    apiVersion: v1
    kind: Secret
    metadata:
      name: db-credentials
      namespace: atp-ingest-ns
    type: Opaque
    data:
      username: YWRtaW4=  # base64("admin")
      password: cGFzc3dvcmQ=  # base64("password")
    
    # TLS secret
    apiVersion: v1
    kind: Secret
    metadata:
      name: tls-cert
    type: kubernetes.io/tls
    data:
      tls.crt: <base64-encoded-cert>
      tls.key: <base64-encoded-key>
    

Code Examples: - Secret creation - Secret usage - Rotation procedures

Deliverables: - Secret management guide - Best practices - Security considerations


CYCLE 8: Azure Key Vault CSI Driver (~3,000 lines)

Topic 15: Key Vault Integration

What will be covered: - Azure Key Vault CSI Driver

# SecretProviderClass
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: atp-kv-secrets
  namespace: atp-ingest-ns
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"
    useVMManagedIdentity: "false"
    userAssignedIdentityID: "<workload-identity-client-id>"
    keyvaultName: "atp-kv-prod"
    cloudName: "AzurePublicCloud"
    tenantId: "<azure-tenant-id>"
    objects: |
      array:
        - |
          objectName: DatabasePassword
          objectType: secret
          objectVersion: ""
        - |
          objectName: ServiceBusConnectionString
          objectType: secret
          objectVersion: ""
        - |
          objectName: SigningKeyPrivate
          objectType: secret
          objectVersion: ""

  • Mount Secrets in Pod

    volumes:
    - name: secrets-store
      csi:
        driver: secrets-store.csi.k8s.io
        readOnly: true
        volumeAttributes:
          secretProviderClass: "atp-kv-secrets"
    
    volumeMounts:
    - name: secrets-store
      mountPath: "/mnt/secrets"
      readOnly: true
    

  • Access Secrets in Application

    // Read secret from mounted path
    var dbPassword = await File.ReadAllTextAsync("/mnt/secrets/DatabasePassword");
    
    // Or use configuration provider
    configuration.AddJsonFile("/mnt/secrets/appsettings-secrets.json", optional: false);
    

Code Examples: - Complete Key Vault CSI setup - SecretProviderClass for all services - Secret rotation automation

Diagrams: - Key Vault CSI architecture - Secret mounting flow

Deliverables: - Key Vault integration guide - Secret provider configurations - Rotation procedures


Topic 16: Workload Identity (AAD Pod Identity)

What will be covered: - Workload Identity Setup

# Service Account with Workload Identity
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ingestion-sa
  namespace: atp-ingest-ns
  annotations:
    azure.workload.identity/client-id: "<managed-identity-client-id>"
    azure.workload.identity/tenant-id: "<azure-tenant-id>"

  • Grant Permissions
    # Create managed identity
    az identity create \
        --name atp-ingestion-identity \
        --resource-group atp-aks-prod-rg
    
    # Grant Key Vault access
    az keyvault set-policy \
        --name atp-kv-prod \
        --object-id <identity-object-id> \
        --secret-permissions get list
    
    # Federate with AKS
    az aks pod-identity add \
        --resource-group atp-aks-prod-rg \
        --cluster-name atp-aks-useast-prod \
        --namespace atp-ingest-ns \
        --name ingestion-identity \
        --identity-resource-id <identity-resource-id>
    

Code Examples: - Workload Identity setup - Permission grants - Pod configuration

Deliverables: - Workload Identity guide - Security configuration - Permission matrix


CYCLE 9: Resource Management (~3,500 lines)

Topic 17: Resource Requests & Limits

What will be covered: - ATP Service Resource Profiles | Service | CPU Request | CPU Limit | Memory Request | Memory Limit | Notes | |---------|-------------|-----------|----------------|--------------|-------| | Gateway | 250m | 1000m | 512Mi | 1Gi | Stateless, low CPU | | Ingestion | 500m | 2000m | 1Gi | 2Gi | I/O intensive | | Policy | 250m | 1000m | 512Mi | 1Gi | Cache-heavy, low CPU | | Projection | 500m | 2000m | 1Gi | 2Gi | DB writes, batch processing | | Query | 500m | 2000m | 1Gi | 2Gi | DB reads, caching | | Integrity | 250m | 1000m | 512Mi | 1Gi | Crypto operations, burst | | Export | 1000m | 4000m | 2Gi | 4Gi | Large file processing | | Search | 500m | 2000m | 1Gi | 2Gi | Index operations | | Admin | 250m | 500m | 256Mi | 512Mi | Lightweight, low traffic |

  • Right-Sizing Resources
  • Monitor actual usage (Prometheus, Azure Monitor)
  • Set requests = typical usage (95th percentile)
  • Set limits = max burst capacity
  • Leave headroom for spikes

  • Quality of Service (QoS) Classes

    Guaranteed (requests = limits):
    - Critical services (Gateway, Ingestion)
    - Predictable performance
    - Never evicted for resource pressure
    
    Burstable (requests < limits):
    - Most ATP services
    - Can burst above requests if node has capacity
    - May be throttled/evicted under pressure
    
    BestEffort (no requests/limits):
    - Avoid in production
    - Lowest priority, first to be evicted
    

Code Examples: - Resource configuration for all services - Monitoring resource usage - Right-sizing analysis

Diagrams: - Resource allocation - QoS classes - Eviction priority

Deliverables: - Resource profile guide - Monitoring dashboard - Right-sizing procedures


Topic 18: Resource Quotas & Cost Control

What will be covered: - Namespace Resource Quotas - Cost Allocation by Namespace - Spot Instances for Non-Critical Workloads - Cluster Autoscaler

Code Examples: - Quota enforcement - Cost monitoring - Spot configuration

Deliverables: - Cost control guide - Quota policies - Optimization strategies


CYCLE 10: Horizontal Pod Autoscaler (HPA) (~4,000 lines)

Topic 19: HPA Configuration

What will be covered: - HPA v2 (Metrics-Based Autoscaling)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ingestion-hpa
  namespace: atp-ingest-ns
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ingestion

  minReplicas: 3
  maxReplicas: 30

  # Scale up/down behavior
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100        # Double pods in 60s if needed
        periodSeconds: 60
      - type: Pods
        value: 4          # Or add 4 pods
        periodSeconds: 60
      selectPolicy: Max   # Take max of policies

    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 25         # Remove 25% of pods
        periodSeconds: 60
      selectPolicy: Min   # Take min (conservative)

  # Metrics
  metrics:
  # CPU utilization
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

  # Memory utilization
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

  # Custom metric (request rate)
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

  # Custom metric (P95 latency)
  - type: Pods
    pods:
      metric:
        name: http_request_duration_p95_ms
      target:
        type: AverageValue
        averageValue: "200"

  • HPA for All ATP Services
  • Gateway: CPU + request rate + P95 latency
  • Ingestion: CPU + ingest rate
  • Query: CPU + query rate + P95 latency
  • Policy: CPU + cache miss rate
  • (Projection, Export use KEDA instead)

Code Examples: - Complete HPA manifests (all services) - Custom metrics integration - Scaling behavior tuning

Diagrams: - HPA architecture - Scaling decision flow - Metrics pipeline

Deliverables: - HPA configuration library - Metrics guide - Tuning procedures


Topic 20: Custom Metrics with Azure Monitor

What will be covered: - Azure Monitor Metrics Adapter - Prometheus Adapter - Custom Metric Queries - SLO-Based Scaling

Code Examples: - Metrics adapter setup - Custom metric definitions - SLO-driven autoscaling

Deliverables: - Metrics adapter guide - Custom metrics catalog - SLO integration


CYCLE 11: KEDA Event-Driven Autoscaling (~4,500 lines)

Topic 21: KEDA Architecture

What will be covered: - KEDA (Kubernetes Event-Driven Autoscaling)

Why KEDA for ATP?
- Scale based on queue depth (Azure Service Bus)
- Scale to zero when idle (cost savings)
- Event-driven workers (Projection, Export, Integrity)
- Batch job scheduling (CronScaledJob)

  • KEDA Installation

    # Install KEDA operator
    helm repo add kedacore https://kedacore.github.io/charts
    helm repo update
    
    helm install keda kedacore/keda \
        --namespace keda \
        --create-namespace \
        --set podIdentity.azureWorkload.enabled=true
    

  • KEDA Components

  • Operator: Monitors ScaledObjects, creates HPAs
  • Metrics Server: Exposes custom metrics
  • Admission Webhooks: Validates ScaledObjects

Code Examples: - KEDA installation - Architecture overview - Component configuration

Diagrams: - KEDA architecture - Scaling flow - Integration with HPA

Deliverables: - KEDA setup guide - Architecture reference - Component overview


Topic 22: KEDA Scalers for ATP

What will be covered: - Azure Service Bus Scaler (Projection Workers)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: projection-worker-scaler
  namespace: atp-projection-ns
spec:
  scaleTargetRef:
    name: projection-worker

  pollingInterval: 10        # Check every 10 seconds
  cooldownPeriod: 120        # Wait 2 min before scaling down

  minReplicaCount: 0         # Scale to zero when no messages
  maxReplicaCount: 50

  advanced:
    restoreToOriginalReplicaCount: false
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 50
            periodSeconds: 60

  triggers:
  - type: azure-servicebus
    metadata:
      namespace: sb-atp-prod
      topicName: audit.appended.v1
      subscriptionName: projection-sub
      messageCount: "400"          # Target: 400 messages per replica
      activationMessageCount: "1"  # Scale from 0 when ≥1 message
      cloud: AzurePublicCloud

    authenticationRef:
      name: keda-trigger-auth-asb

---
# Authentication using Workload Identity
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-trigger-auth-asb
  namespace: atp-projection-ns
spec:
  podIdentity:
    provider: azure-workload
    identityId: "<managed-identity-client-id>"

  • Redis Scaler (Export Jobs Queue)

    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: export-worker-scaler
      namespace: atp-export-ns
    spec:
      scaleTargetRef:
        name: export-worker
      minReplicaCount: 0
      maxReplicaCount: 10
      triggers:
      - type: redis
        metadata:
          addressFromEnv: REDIS_HOST
          listName: export-jobs
          listLength: "10"  # Scale up if queue has 10+ jobs
          databaseIndex: "0"
        authenticationRef:
          name: keda-redis-auth
    

  • Prometheus Scaler (Custom Metrics)

    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: query-latency-scaler
      namespace: atp-query-ns
    spec:
      scaleTargetRef:
        name: query
      minReplicaCount: 3
      maxReplicaCount: 30
      triggers:
      - type: prometheus
        metadata:
          serverAddress: http://prometheus.monitoring:9090
          metricName: http_request_duration_p95_seconds
          query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="query"}[2m])) by (le))
          threshold: "0.2"  # Scale up if P95 > 200ms
    

  • Cron Scaler (Scheduled Jobs)

    apiVersion: keda.sh/v1alpha1
    kind: ScaledJob
    metadata:
      name: daily-compliance-report
      namespace: atp-admin-ns
    spec:
      jobTargetRef:
        template:
          spec:
            containers:
            - name: report-generator
              image: atpacr.azurecr.io/atp/report-generator:1.0.0
            restartPolicy: Never
    
      pollingInterval: 30
      maxReplicaCount: 1
    
      triggers:
      - type: cron
        metadata:
          timezone: America/New_York
          start: 0 2 * * *   # 2 AM daily
          end: 0 3 * * *     # Finish by 3 AM
    

Code Examples: - Complete KEDA scaler library (all ATP use cases) - Authentication configurations - Scaling policies

Diagrams: - KEDA scaler types - Event-driven scaling flow - Scale-to-zero timeline

Deliverables: - KEDA scaler catalog - Configuration guide - Scaling strategies


CYCLE 12: Pod Security Standards (~3,000 lines)

Topic 23: Pod Security Policies

What will be covered: - Pod Security Standards (PSS)

Privileged: Unrestricted (avoid)
Baseline: Minimally restrictive (dev/test)
Restricted: Hardened (production ATP)

  • Restricted Pod Security
    # Namespace-level enforcement
    apiVersion: v1
    kind: Namespace
    metadata:
      name: atp-ingest-ns
      labels:
        pod-security.kubernetes.io/enforce: restricted
        pod-security.kubernetes.io/audit: restricted
        pod-security.kubernetes.io/warn: restricted
    
    # Pod security context (compliant)
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 3000
      fsGroup: 2000
      seccompProfile:
        type: RuntimeDefault
    
    # Container security context
    containers:
    - name: ingestion
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 1000
        capabilities:
          drop:
          - ALL
    

Code Examples: - Pod Security Standard configurations - Compliant pod templates - Validation policies

Diagrams: - PSS levels - Security context flow

Deliverables: - Pod security guide - Compliant templates - Validation rules


Topic 24: Runtime Security with Azure Policy

What will be covered: - Azure Policy for Kubernetes - OPA Gatekeeper - Admission Controller Policies - Image Scanning (Trivy, Defender)

Code Examples: - Policy definitions - Admission webhooks - Image scan integration

Deliverables: - Security policy catalog - Enforcement procedures - Scanning workflows


CYCLE 13: Network Policies (~3,000 lines)

Topic 25: Network Isolation

What will be covered: - Network Policy Fundamentals

# Default deny all ingress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: atp-ingest-ns
spec:
  podSelector: {}
  policyTypes:
  - Ingress

# Allow ingress from Gateway only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-gateway
  namespace: atp-ingest-ns
spec:
  podSelector:
    matchLabels:
      app: ingestion
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: atp-gateway-ns
      podSelector:
        matchLabels:
          app: gateway
    ports:
    - protocol: TCP
      port: 8080
    - protocol: TCP
      port: 9090

# Allow egress to Azure SQL
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-sql
  namespace: atp-ingest-ns
spec:
  podSelector:
    matchLabels:
      app: ingestion
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 10.0.0.0/8  # Azure SQL private endpoint
    ports:
    - protocol: TCP
      port: 1433

Code Examples: - Network policy manifests (all ATP services) - Zero-trust network model - Egress control

Diagrams: - Network policy architecture - Traffic flow with policies

Deliverables: - Network policy library - Zero-trust configuration - Traffic matrix


Topic 26: Service Mesh Network Policies

What will be covered: - Istio AuthorizationPolicy - Linkerd Server Policies - mTLS Enforcement

Code Examples: - Mesh-native policies - mTLS verification

Deliverables: - Mesh policy guide - Security configuration


CYCLE 14: Service Mesh & mTLS (~3,500 lines)

Topic 27: Service Mesh Setup

What will be covered: - Istio Installation

# Install Istio
istioctl install --set profile=production -y

# Enable sidecar injection
kubectl label namespace atp-ingest-ns istio-injection=enabled

  • Linkerd Installation

    # Install Linkerd
    linkerd install --crds | kubectl apply -f -
    linkerd install | kubectl apply -f -
    
    # Verify installation
    linkerd check
    
    # Inject sidecar
    kubectl annotate namespace atp-ingest-ns linkerd.io/inject=enabled
    

  • mTLS Configuration

    # Istio PeerAuthentication (enforce mTLS)
    apiVersion: security.istio.io/v1beta1
    kind: PeerAuthentication
    metadata:
      name: default-mtls
      namespace: atp-ingest-ns
    spec:
      mtls:
        mode: STRICT  # Enforce mTLS for all traffic
    

Code Examples: - Service mesh installation - mTLS configuration - Traffic policies

Diagrams: - Service mesh architecture - mTLS certificate flow

Deliverables: - Mesh setup guide - mTLS enforcement - Traffic management


Topic 28: Observability with Service Mesh

What will be covered: - Distributed Tracing - Traffic Metrics - Service Graph Visualization - Latency Monitoring

Code Examples: - Mesh observability configuration - Kiali/Jaeger integration

Deliverables: - Observability guide - Dashboard templates


CYCLE 15: RBAC & Workload Identity (~3,000 lines)

Topic 29: Kubernetes RBAC

What will be covered: - Role-Based Access Control

# Role (namespace-scoped)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: atp-ingest-ns
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]

# RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: atp-ingest-ns
subjects:
- kind: ServiceAccount
  name: developer-sa
  namespace: atp-ingest-ns
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

# ClusterRole (cluster-wide)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: atp-admin
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

Code Examples: - Complete RBAC configuration - Role library - Binding templates

Deliverables: - RBAC guide - Role catalog - Access control policies


Topic 30: Azure AD Workload Identity

What will be covered: - Workload Identity Federation - Managed Identity Assignment - Azure Resource Access - Zero Secrets (Keyless Authentication)

Code Examples: - Workload Identity setup - Federated identity credentials - Azure resource access

Deliverables: - Workload Identity guide - Identity management - Security best practices


CYCLE 16: Helm Charts (~4,000 lines)

Topic 31: Helm Chart Structure

What will be covered: - ATP Helm Chart Architecture

charts/atp/
├── Chart.yaml              # Chart metadata
├── values.yaml             # Default values
├── values.dev.yaml         # Dev environment overrides
├── values.prod.yaml        # Prod environment overrides
├── values.us.yaml          # US region overrides
├── values.eu.yaml          # EU region overrides
├── templates/
│   ├── _helpers.tpl        # Template helpers
│   ├── namespace.yaml
│   ├── gateway/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   ├── hpa.yaml
│   │   └── ingress.yaml
│   ├── ingestion/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── hpa.yaml
│   ├── projection/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── keda-scaledobject.yaml
│   ├── query/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── hpa.yaml
│   ├── export/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── keda-scaledobject.yaml
│   ├── shared/
│   │   ├── configmap.yaml
│   │   ├── secret-provider-class.yaml
│   │   └── network-policies.yaml
│   └── monitoring/
│       ├── servicemonitor.yaml
│       └── prometheusrule.yaml
└── .helmignore

Code Examples: - Complete Helm chart - Template syntax - Values organization

Diagrams: - Helm chart structure - Value inheritance

Deliverables: - Helm chart repository - Templating guide - Values reference


Topic 32: Helm Deployment

What will be covered: - Helm Installation

# Install/upgrade ATP
helm upgrade --install atp ./charts/atp \
    --namespace atp-system \
    --create-namespace \
    --values values.prod.yaml \
    --values values.us.yaml \
    --set image.tag=1.2.3 \
    --set global.edition=enterprise \
    --wait --timeout 10m

# Verify release
helm list -n atp-system
helm status atp -n atp-system

# Rollback
helm rollback atp 1 -n atp-system

Code Examples: - Deployment commands - Value overrides - Rollback procedures

Deliverables: - Deployment guide - Operations procedures - Rollback strategies


CYCLE 17: GitOps with FluxCD (~3,500 lines)

Topic 33: FluxCD Setup

What will be covered: - FluxCD Installation - GitRepository Source - Kustomization - HelmRelease - Automated Reconciliation

Code Examples: - FluxCD configuration - GitOps workflow

Deliverables: - FluxCD setup guide - GitOps workflow


Topic 34: Declarative Deployments

What will be covered: - Git as Source of Truth - Automated Sync - Drift Detection - Notification Hooks

Code Examples: - FluxCD resources - Sync configuration

Deliverables: - Declarative deployment guide - Sync policies


CYCLE 18: Monitoring & Observability (~3,000 lines)

Topic 35: Container Insights

What will be covered: - Azure Monitor Container Insights - Prometheus Integration - Grafana Dashboards - Log Analytics

Code Examples: - Monitoring setup - Dashboard configurations

Deliverables: - Monitoring guide - Dashboard library


Topic 36: OpenTelemetry in Kubernetes

What will be covered: - OTel Collector DaemonSet - Trace/Metric/Log Collection - Azure Monitor Export

Code Examples: - OTel configuration - Collector deployment

Deliverables: - OTel setup guide - Export configuration


CYCLE 19: Operations & Troubleshooting (~3,000 lines)

Topic 37: Operational Tasks

What will be covered: - Common kubectl Commands - Log Viewing - Pod Debugging - Resource Inspection

Code Examples: - Operations cookbook

Deliverables: - Operations guide - Troubleshooting procedures


Topic 38: Troubleshooting Common Issues

What will be covered: - Pod Not Starting - Image Pull Errors - CrashLoopBackOff - Network Issues - Resource Exhaustion

Code Examples: - Debug procedures

Deliverables: - Troubleshooting guide - Problem catalog


CYCLE 20: Best Practices & Disaster Recovery (~3,000 lines)

Topic 39: Kubernetes Best Practices

What will be covered: - Design Best Practices - Security Hardening - Performance Optimization - Cost Management

Deliverables: - Best practices handbook


Topic 40: Disaster Recovery

What will be covered: - Cluster Backup - Disaster Recovery Procedures - Multi-Region Failover

Deliverables: - DR guide - Failover procedures


Summary of Deliverables

Across all 20 cycles, this documentation will provide:

  1. Cluster Architecture: AKS setup, node pools, networking
  2. Namespaces: Organization, quotas, isolation
  3. Workloads: Deployments, StatefulSets, DaemonSets, Jobs
  4. Networking: Services, Ingress, Load Balancing, Service Mesh
  5. Configuration: ConfigMaps, Secrets, Key Vault CSI
  6. Resource Management: Requests, limits, quotas, QoS
  7. Autoscaling: HPA, KEDA, custom metrics
  8. Security: Pod Security, Network Policies, RBAC, Workload Identity
  9. Packaging: Helm charts, templating, values
  10. GitOps: FluxCD, declarative deployments
  11. Observability: Monitoring, logging, tracing
  12. Operations: kubectl, troubleshooting, DR


This documentation plan covers complete Kubernetes infrastructure for ATP, from AKS cluster setup and namespace organization to workload deployments, service networking, autoscaling with HPA and KEDA, security hardening with Pod Security Standards and Network Policies, service mesh integration, Helm packaging, GitOps with FluxCD, comprehensive monitoring, operational procedures, and disaster recovery for running ATP microservices at scale with security, reliability, and cost-efficiency.