GitOps — Audit Trail Platform (ATP)¶

Declarative deployment with Git as truth — ATP GitOps ensures infrastructure and application state are versioned, auditable, and continuously reconciled across all Azure environments using Azure DevOps, AKS, and FluxCD.

Purpose & Scope¶

This document defines the GitOps deployment model for the ConnectSoft Audit Trail Platform (ATP), establishing how infrastructure and application manifests are version-controlled in Azure Repos, automatically reconciled to AKS clusters, and continuously monitored for drift with full traceability and compliance evidence using Azure-native tools and services.

What This Document Covers¶

GitOps Fundamentals:

GitOps philosophy and core principles (declarative, versioned, pulled, reconciled)
Comparison with traditional CI/CD (push-based) deployments
Benefits for audit trail requirements (immutable history, compliance, security)
Azure-native GitOps implementation patterns with FluxCD

Infrastructure & Repository Structure:

Azure Repos structure for GitOps manifests (monorepo pattern for Kubernetes manifests)
Branching strategies per environment (main, staging, test, dev, feature/, hotfix/)
Access control and RBAC for Git repositories (Azure AD integration, branch policies)
Naming conventions and versioning strategies (SemVer, Git tags, commit SHA)

FluxCD on Azure Kubernetes Service (AKS):

FluxCD installation, bootstrap, and multi-cluster setup
GitRepository and Kustomization custom resources for continuous reconciliation
Azure Repos integration (SSH keys, PAT, Azure AD Workload Identity)
Drift detection, self-healing, and automatic reconciliation loops

Declarative Manifests & Configuration:

Kubernetes manifests (Deployments, Services, ConfigMaps, Secrets, Ingress)
Helm charts for ATP microservices (templates, values files, dependencies, versioning)
Kustomize overlays for environment-specific configurations (base + overlays pattern)
Manifest validation, linting, and security policy enforcement

CI/CD Integration:

Azure Pipelines to GitOps handoff (build → test → publish → commit manifest update)
Automated manifest updates (image tag bumping after successful CI builds)
Multi-service coordination and orchestration (atomic updates across services)
Artifact metadata and provenance (SBOM, vulnerability scans, build attestations)

Secrets Management:

Azure Key Vault integration (External Secrets Operator or CSI Driver)
Azure AD Workload Identity for pod authentication (no credentials in Git)
Secret rotation procedures and zero-downtime updates
Compliance controls (SOC 2, GDPR, HIPAA) for secret handling and audit logging

Multi-Environment Deployment:

Environment-specific configurations (dev, test, staging, production, preview, hotfix)
Kustomize overlays and Helm values files per environment
Resource quotas, limits, and autoscaling policies per environment
Promotion workflows and approval gates (manual for staging/production)

Advanced Deployment Strategies:

Rolling updates (default Kubernetes strategy with maxSurge/maxUnavailable)
Blue-green deployments (namespace switching with traffic routing)
Canary releases (progressive traffic shifting with Flagger)
Preview environments (ephemeral namespaces per pull request)
Zero-downtime deployments and rollback procedures

Security & Compliance:

Azure Policy for Kubernetes (Pod Security Standards, network policies, resource quotas)
Image signing and verification (Cosign, Notary, admission controllers)
SBOM generation and vulnerability scanning (integrated with ACR)
Audit logging and compliance evidence collection (immutable Git history)

Multi-Tenancy:

Namespace-per-tenant isolation strategy
Dynamic tenant provisioning and offboarding workflows
Tenant-specific configurations and resource quotas
Cost allocation and compliance enforcement per tenant

Observability & Monitoring:

Azure Monitor Container Insights integration for AKS metrics
FluxCD metrics export to Prometheus/Grafana for reconciliation monitoring
Deployment tracking and DORA metrics (deployment frequency, lead time, MTTR, change failure rate)
Alerting on sync failures, drift detection, and health check failures

Day 2 Operations:

Troubleshooting GitOps issues (sync failures, drift, image pull errors, secret access failures)
Routine maintenance tasks (FluxCD upgrades, AKS patching, certificate renewals)
Disaster recovery and rollback procedures (Git revert, cluster recreation from IaC)
On-call runbooks and escalation paths

Governance & Training:

GitOps workflow ownership and change management processes
Developer and operations onboarding guides (Git workflow, manifest authoring)
Best practices catalog and reference architectures
Compliance reporting and audit evidence collection automation

Out of Scope¶

This document does NOT cover:

Kubernetes fundamentals — See infrastructure/kubernetes.md for AKS cluster architecture, pod design, container orchestration basics, and Kubernetes API concepts
Azure Pipelines (CI stage) — See ci-cd/azure-pipelines.md for build, test, security scanning, artifact publishing, and quality gate enforcement
Quality gates — See ci-cd/quality-gates.md for test coverage thresholds, security scanning policies, and compliance gate enforcement
Infrastructure provisioning (non-Kubernetes) — See infrastructure/pulumi.md for Azure SQL, Service Bus, Storage, Key Vault, and other PaaS resource provisioning
Application architecture — See architecture/hld.md for ATP service design, domain models, business logic, and system architecture
Service-specific deployment details — See individual service documentation in planning/core-services/ for service-specific configuration, dependencies, and operational characteristics
Observability implementation — See operations/observability.md for OpenTelemetry instrumentation, metrics collection, distributed tracing, and log aggregation
Backup and restore procedures — See operations/backups-restore-ediscovery.md for data backup strategies, disaster recovery, and eDiscovery procedures

Readers & Ownership¶

Primary Readers:

Platform Engineers: Implement GitOps workflows, configure FluxCD, author Kubernetes manifests, manage GitOps repository structure
DevOps Engineers: Integrate Azure Pipelines with GitOps, automate manifest updates, implement promotion workflows, troubleshoot CI/CD handoff
SRE Team: Monitor FluxCD reconciliation, respond to drift alerts, execute rollback procedures, perform incident response, conduct DR drills
Security Team: Review security policies, validate RBAC configurations, enforce Pod Security Standards, audit secret management, ensure compliance
Developers: Understand GitOps workflow, submit manifest changes via pull requests, test changes in preview environments, troubleshoot deployment issues
Compliance Officers: Validate audit trail completeness, review deployment approvals, ensure evidence collection for SOC 2/GDPR/HIPAA audits

Document Owner: Platform Engineering Team
Technical Reviewers: SRE Lead, Cloud Architect, Security Officer
Compliance Reviewer: Compliance Officer (for SOC 2/GDPR/HIPAA sections)
Approval Authority: CTO
Last Reviewed: 2024-10-30
Next Review: 2025-Q2 (after Cycle 6 completion — multi-environment observability)
Review Frequency: Quarterly (or after significant GitOps workflow changes)

Artifacts Produced¶

By following this document, teams will produce the following artifacts and deliverables:

1. GitOps Repository (atp-gitops in Azure Repos):

Declarative Kubernetes manifests for all 7 ATP microservices
Helm charts with templates, values files, and dependency specifications
Kustomize base manifests and environment-specific overlays (dev, test, staging, production)
FluxCD bootstrap configuration files (GitRepository, Kustomization resources)
Security policies (Pod Security Policies, Network Policies, Azure Policies)
Multi-tenant namespace configurations and resource quotas

2. FluxCD Installation:

FluxCD controllers deployed on all AKS clusters (dev, test, staging, production)
GitRepository resources configured for Azure Repos integration (SSH/PAT authentication)
Kustomization resources for each application and environment
Notification controllers for alerting (Slack, Teams, Azure Monitor)
Health assessment and drift detection configurations

3. CI/CD Integration:

Azure Pipelines templates for GitOps handoff (build → publish → manifest update → commit)
Automated manifest update scripts (image tag bumping, Helm values updates)
Preview environment provisioning pipelines (ephemeral namespaces per PR)
Multi-service coordination scripts (atomic updates across dependent services)

4. Infrastructure as Code:

Pulumi C# programs for AKS cluster provisioning (node pools, networking, SKUs)
Environment-specific Pulumi stack configurations (dev, test, staging, production)
Drift detection automation (scheduled reconciliation validation)
Disaster recovery scripts (cluster recreation from Git and IaC)

5. Secrets Management:

External Secrets Operator or CSI Driver installation and configuration
ClusterSecretStore resources (Azure Key Vault integration per environment)
ExternalSecret or SecretProviderClass definitions for each application
Azure AD Workload Identity configuration (federated credentials, ServiceAccount annotations)
Secret rotation runbooks and automation scripts

6. Observability & Compliance:

Azure Monitor dashboards for GitOps metrics (reconciliation status, drift events, deployment frequency)
Grafana dashboards for FluxCD monitoring (reconciliation duration, success rate, resource health)
Compliance evidence collection scripts (deployment receipts, approval records, Git audit trail)
KQL queries for audit trail analysis (who deployed what, when, why)
DORA metrics dashboard (deployment frequency, lead time for changes, MTTR, change failure rate)

7. Security & Policy Enforcement:

Azure Policy definitions for AKS (Pod Security Standards, network policies, resource limits)
Pod Security Admission configurations (baseline, restricted profiles)
Network policy templates (default deny, service-to-service rules)
Image signing workflows (Cosign signatures, admission controller verification)
RBAC configurations (ServiceAccounts, Roles, RoleBindings per service and tenant)

8. Runbooks & Documentation:

Troubleshooting guides (sync failures, drift resolution, image pull errors)
Rollback procedures (simple rollback with git revert, complex multi-service rollbacks)
DR test plans (cluster failure scenarios, region outage, complete platform loss)
Developer onboarding guides (GitOps workflow, manifest authoring, PR process)
Operations runbooks (routine maintenance, FluxCD upgrades, AKS patching)

What is GitOps?¶

Definition: GitOps is an operational framework that applies DevOps best practices—version control, collaboration, compliance, and CI/CD—to infrastructure automation and application deployment. The core principle is using Git repositories as the single source of truth for declarative infrastructure and application configurations.

Core Concept: Instead of operators running manual kubectl apply commands or CI/CD pipelines pushing changes to clusters, a GitOps agent (FluxCD, ArgoCD) running inside the Kubernetes cluster continuously pulls the desired state from Git and reconciles the actual cluster state to match it.

History & Evolution¶

GitOps emerged from the evolution of Infrastructure as Code (IaC) practices combined with Kubernetes' declarative nature:

Timeline:

Year	Milestone	Impact on Industry	Relevance to ATP
2010-2014	Infrastructure as Code (IaC) emerges	Terraform, CloudFormation, Ansible enable declarative infrastructure	Foundation for declarative Azure resources
2015	Kubernetes released (v1.0)	Declarative configuration becomes standard for container orchestration	ATP targets AKS for microservice deployment
2017	Weaveworks coins "GitOps" term	Flux v1 released as first GitOps operator for Kubernetes	GitOps pattern recognized
2018	ArgoCD released by Intuit	Alternative GitOps implementation; feature-rich UI	ArgoCD evaluated (FluxCD chosen for simplicity)
2019	OpenGitOps working group formed	CNCF standardizes 4 core GitOps principles	ATP adopts OpenGitOps principles
2020	FluxCD v2 released	Complete rewrite with modular architecture (GitOps Toolkit)	ATP uses FluxCD v2 for production
2021	Flux and Argo join CNCF	GitOps becomes cloud-native standard (incubating projects)	Industry validation for ATP choice
2022	Azure Arc GitOps integration	Microsoft provides native GitOps support for AKS and Arc-enabled clusters	Azure-native GitOps validated
2024	Widespread adoption	CNCF surveys show 70%+ of production Kubernetes use GitOps	ATP joins industry leaders

Why GitOps Now?:

Kubernetes maturity: Declarative APIs well-established; GitOps is natural evolution
Security focus: Zero-trust principles demand eliminating cluster credentials from CI/CD
Compliance: Audit trail requirements favor Git's immutable history
Cloud-native patterns: CNCF-endorsed pattern with mature tooling (FluxCD, ArgoCD)

Pull-Based vs Push-Based Deployment Models¶

Purpose: Understand the fundamental architectural difference between traditional CI/CD and GitOps deployment models.

Push-Based Deployment (Traditional CI/CD)¶

Architecture:

graph TD
    A[Developer] -->|1. git push| B[Source Code<br/>Repository]
    B -->|2. trigger webhook| C[CI/CD Pipeline<br/>Azure Pipelines]

    C -->|3. build| D[Compile &<br/>Test]
    D -->|4. publish| E[Docker Image]
    E -->|5. push| F[Container<br/>Registry<br/>ACR]

    C -->|6. deploy<br/>kubectl apply| G[Kubernetes<br/>Cluster]

    H[Secrets<br/>Vault] -.->|credentials<br/>stored| C

    style G fill:#ffcccc
    style C fill:#FFE5B4
    style H fill:#ffcccc

Hold "Alt" / "Option" to enable pan & zoom

Characteristics:

External deployment: CI/CD pipeline (running outside cluster) has direct access to Kubernetes API via kubeconfig or service account token
Push on trigger: Deployment happens during pipeline execution (synchronous operation)
Credentials required: Pipeline needs cluster credentials stored as secrets or service connections
No continuous reconciliation: Cluster state checked only during deployment; drift undetected between deployments
Secret management: Secrets often stored in CI/CD system variables (security risk)

Workflow Example (Azure Pipelines - Push Model):

# ❌ PUSH-BASED: Pipeline has direct cluster access
- stage: Deploy_Production
  jobs:
  - deployment: DeployToAKS
    environment: ATP-Production-AKS  # Requires approval
    strategy:
      runOnce:
        deploy:
          steps:
          # Pipeline has full kubectl access to production cluster
          - task: Kubernetes@1
            displayName: 'Deploy ATP Ingestion to Production'
            inputs:
              connectionType: 'Kubernetes Service Connection'
              kubernetesServiceEndpoint: 'ATP-Production-AKS'  # ⚠️ Cluster credentials
              namespace: 'atp-production'
              command: 'apply'
              useConfigurationFile: true
              configuration: 'manifests/production/atp-ingestion.yaml'

          # Update image tag imperatively
          - task: Kubernetes@1
            inputs:
              kubernetesServiceEndpoint: 'ATP-Production-AKS'
              command: 'set'
              arguments: 'image deployment/atp-ingestion atp-ingestion=$(containerRegistry)/atp/ingestion:$(Build.BuildNumber)'

Security Concerns:

# ⚠️ SECURITY RISK: Cluster credentials stored in Azure DevOps
# Service Connection: ATP-Production-AKS
# Type: Kubernetes Service Connection
# Authentication: Service Account (has cluster-admin rights!)
# 
# Attack vectors:
# 1. Anyone with "Use" permission on service connection can deploy to production
# 2. Compromised Azure DevOps account = compromised production cluster
# 3. Service account token rotation requires updating all pipelines
# 4. Credentials visible in pipeline logs (if logging enabled)

Pull-Based Deployment (GitOps)¶

Architecture:

graph TD
    A[Developer] -->|1. git push| B[Source Code<br/>Repository]
    B -->|2. trigger| C[CI Pipeline<br/>Azure Pipelines]

    C -->|3. build + test| D[Docker Image]
    D -->|4. push| E[Container<br/>Registry<br/>ACR]

    C -->|5. update manifest<br/>commit + push| F[GitOps<br/>Repository]

    subgraph "Inside Kubernetes Cluster"
        G[FluxCD Agent]
        H[AKS Cluster]

        G -->|6. git pull<br/>every 1 min| F
        G -->|7. kubectl apply| H
        H -.->|8. drift<br/>detection| G
        G -.->|9. self-heal| H
    end

    I[Azure Key<br/>Vault] -->|10. secrets<br/>sync| J[External Secrets<br/>Operator]
    J -->|11. create K8s<br/>secrets| H

    K[Azure Monitor] -.->|metrics| C
    K -.->|metrics| G
    K -.->|logs| H

    style H fill:#90EE90
    style G fill:#90EE90
    style F fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Characteristics:

Internal agent: GitOps operator (FluxCD) runs inside Kubernetes cluster as Deployment
Continuous pull: Agent polls Git repository at regular intervals (configurable: 30s to 10m)
No external access: Cluster credentials never leave cluster; enhanced security
Automatic reconciliation: Cluster state continuously compared to Git state; drift corrected automatically
Secret sync: Secrets managed in Azure Key Vault; synced to cluster via External Secrets Operator

Workflow Example (FluxCD - Pull Model):

# ✅ PULL-BASED: FluxCD inside cluster pulls from Git
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  interval: 1m  # Poll Git every 1 minute
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: production  # Production environment uses 'production' branch
  secretRef:
    name: azure-devops-ssh-key  # Read-only SSH key (no cluster credentials!)

---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
  namespace: flux-system
spec:
  interval: 5m  # Reconcile every 5 minutes
  path: ./apps/atp-ingestion/overlays/production
  prune: true  # Delete resources removed from Git
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-ingestion
      namespace: atp-production

Security Benefits:

# ✅ NO cluster credentials outside cluster
# FluxCD ServiceAccount has RBAC permissions inside cluster
# CI pipeline NEVER touches cluster; only commits to Git

# FluxCD ServiceAccount (inside cluster)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kustomize-controller
  namespace: flux-system

---
# FluxCD RBAC (cluster-admin for reconciliation)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kustomize-controller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin  # Full access inside cluster only
subjects:
- kind: ServiceAccount
  name: kustomize-controller
  namespace: flux-system

Comparison Summary¶

Deployment Model	When to Use	When to Avoid
Push-Based (Traditional CI/CD)	- Small teams with simple deployments - Non-Kubernetes environments - Immediate feedback required - Team unfamiliar with GitOps	- Production Kubernetes deployments - Compliance requirements (SOC 2, GDPR, HIPAA) - Multi-environment with drift concerns - Security-sensitive environments
Pull-Based (GitOps)	- Production Kubernetes deployments - Compliance/audit requirements - Multi-cluster/multi-region - Configuration drift is a concern - Security-first environments	- Non-Kubernetes deployments - Legacy infrastructure - Team unwilling to learn GitOps - Immediate deployment feedback critical

ATP Decision: ✅ GitOps (Pull-Based) for all Kubernetes deployments

Rationale:

Audit trail requirement: Git provides immutable, permanent history (vs 30-90 day pipeline logs)
Security requirement: Zero-trust principle; no cluster credentials outside cluster
Compliance requirement: SOC 2, GDPR, HIPAA demand tamper-evident change records
Multi-tenancy: Git structure enables isolated tenant configurations
Operational resilience: Disaster recovery RTO reduced from 4 hours to 30 minutes

GitOps in Audit Trail Platform Context¶

Purpose: Explain why GitOps is essential (not just beneficial) for ATP's unique requirements.

Audit Trail Requirements¶

ATP provides immutable, tamper-evident audit logs for customers. The platform's own infrastructure must meet the same standards:

Requirement 1: Complete Change History

Every infrastructure change must be tracked with full attribution (who, what, when, why):

# Git history provides complete audit trail
git log --all --pretty=format:"%h | %ai | %an | %ae | %s" \
  --since="2024-01-01" \
  --grep="production"

# Example output (can be exported for SOC 2 audits):
# abc123d | 2024-10-30 14:23:45 +0000 | Alice Chen | alice.chen@connectsoft.example | feat(ingestion): upgrade to v1.2.3
# def456e | 2024-10-25 10:15:22 +0000 | Bob Smith | bob.smith@connectsoft.example | fix(query): index performance (ATP-BUG-789)
# ghi789f | 2024-10-20 16:42:11 +0000 | Carol Davis | carol.davis@connectsoft.example | scale(integrity): replicas 3→5 (ATP-INC-456)

Requirement 2: Tamper-Evidence

Git commits must be cryptographically signed to prevent tampering:

# Generate GPG key for commit signing
gpg --full-generate-key
# Select: RSA and RSA, 4096 bits, no expiration
# User ID: "Platform Team <platform-team@connectsoft.example>"

# Export public key for verification
gpg --armor --export platform-team@connectsoft.example > platform-team-gpg-public.key

# Configure Git to sign all commits
git config --global user.signingkey <GPG_KEY_ID>
git config --global commit.gpgsign true
git config --global tag.gpgsign true

# Commit with signature
git add apps/atp-ingestion/overlays/production/kustomization.yaml
git commit -S -m "feat(ingestion): upgrade to v1.2.3

- Updated image tag to v1.2.3-abc123d
- Increased memory limit 512Mi → 1Gi (performance optimization)
- Enabled tamper-evidence in production config

Relates to: ATP-EPIC-456
Approved by: architect@connectsoft.example
Tested in: Staging (2024-10-26 to 2024-10-29)"

# Verify signature
git log --show-signature -1

# Output:
# commit abc123d1234567890abcdef1234567890abcdef (HEAD -> production)
# gpg: Signature made Wed Oct 30 14:23:45 2024 UTC
# gpg:                using RSA key 1234567890ABCDEF
# gpg: Good signature from "Platform Team <platform-team@connectsoft.example>"
# Author: Platform Team <platform-team@connectsoft.example>
# Date:   Wed Oct 30 14:23:45 2024 +0000
#
#     feat(ingestion): upgrade to v1.2.3
#     ...

Requirement 3: Long-Term Retention

Git history must be retained indefinitely for compliance (SOC 2: 1 year minimum, ATP: 7 years for parity with audit logs):

# Backup Git repository to immutable Azure Blob Storage
az storage blob upload-batch \
  --account-name atpgitbackupprod \
  --destination gitops-backups \
  --source .git/ \
  --destination-path "$(date +%Y%m%d)/" \
  --overwrite false  # Immutable: cannot overwrite

# Enable legal hold (WORM storage)
az storage container legal-hold set \
  --account-name atpgitbackupprod \
  --container-name gitops-backups \
  --tags "compliance=soc2-gdpr-hipaa" "retention=7-years"

# Retention: 7 years (matches audit log retention)
# Cost: ~$50/month for 10 GB Git history (cold storage tier)

Security Benefits¶

No Direct Cluster Access:

Problem Statement: Traditional CI/CD stores cluster credentials in Azure DevOps, creating security risks:

Broad attack surface: Anyone with Azure DevOps access can potentially access cluster credentials
Credential sprawl: Each environment/cluster needs separate service connection
Rotation complexity: Updating service account tokens requires updating all pipelines
Audit trail: Difficult to trace who accessed cluster credentials

GitOps Solution:

graph TD
    subgraph "Outside Cluster"
        A[Developer] -->|git push| B[Azure Repos<br/>atp-gitops]
        C[CI Pipeline] -->|update manifest<br/>commit + push| B
    end

    subgraph "Inside AKS Cluster - No External Access"
        D[FluxCD Agent]
        E[Kustomize Controller]
        F[Helm Controller]
        G[Kubernetes API]

        D -->|git pull| B
        D -->|render| E
        E -->|render| F
        F -->|kubectl apply| G
    end

    H[Azure Key Vault] -->|Workload Identity<br/>federated auth| I[External Secrets<br/>Operator]
    I -->|create secrets| G

    J[Azure Monitor] -.->|observability| D

    style G fill:#90EE90
    style D fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Security Improvements:

Security Aspect	Traditional CI/CD	GitOps	Improvement
Cluster Credentials	Stored in CI/CD system	Never leave cluster	✅ 100% reduction in external credentials
Attack Surface	CI/CD + cluster	Git repository only	✅ 50% reduction in attack surface
Credential Rotation	Manual; update all pipelines	Automatic (Workload Identity)	✅ Zero-touch rotation
Least Privilege	Often cluster-admin for simplicity	RBAC per FluxCD controller	✅ Principle of least privilege
Audit Trail	Pipeline logs (ephemeral)	Git history (permanent)	✅ Immutable audit evidence
Secrets in Git	Risk of accidental commit	Prevented (pre-commit hooks + PR validation)	✅ Zero secrets in Git

Separation of Duties:

ATP enforces role-based access control at multiple levels:

Role	Azure Repos Access	AKS Cluster Access	FluxCD Admin	Azure Key Vault Access	Approval Authority
Developer	✅ Submit PRs (feature/*)	❌ No access	❌ No	❌ No	None
DevOps Engineer	✅ Approve PRs (dev/test)	⚠️ Read-only (dev/test)	⚠️ Read-only	❌ No	Dev/Test deployments
Architect	✅ Approve PRs (staging/prod)	⚠️ Read-only (all envs)	⚠️ Read-only	⚠️ Read-only (audit)	Staging/Prod deployments
SRE Engineer	✅ Approve PRs (production)	⚠️ Read-only (production)	✅ Admin (suspend/resume reconciliation)	⚠️ Read-only (audit)	Production deployments
Security Officer	✅ Audit access (read-only)	⚠️ Read-only (all envs)	⚠️ Read-only	✅ Admin (rotate secrets)	Security policy changes
Compliance Officer	✅ Audit access (read-only)	❌ No access	❌ No	⚠️ Read-only (audit)	None (audit only)
FluxCD Agent	✅ Read-only (GitOps repo)	✅ Full access (via ServiceAccount RBAC)	N/A	❌ No (uses External Secrets Operator)	None (automated)
External Secrets Operator	❌ No	✅ Create secrets (namespace-scoped)	❌ No	✅ Read secrets (Workload Identity)	None (automated)

RBAC Example (FluxCD ServiceAccount):

# FluxCD runs with least privilege (namespace-scoped for app deployments)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kustomize-controller-atp-apps
  namespace: flux-system

---
# Role: namespace-scoped permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: flux-apps-deployer
  namespace: atp-production
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets", "statefulsets", "daemonsets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["services", "configmaps", "secrets", "persistentvolumeclaims"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
  resources: ["ingresses", "networkpolicies"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

---
# RoleBinding: bind ServiceAccount to Role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: flux-apps-deployer
  namespace: atp-production
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: flux-apps-deployer
subjects:
- kind: ServiceAccount
  name: kustomize-controller-atp-apps
  namespace: flux-system

Multi-Tenancy¶

Tenant Isolation in Git:

ATP's namespace-per-tenant model is naturally represented in Git:

Directory Structure:

atp-gitops/
├── tenants/
│   ├── tenant-acme-corp/           # Tenant: ACME Corporation
│   │   ├── namespace.yaml          # Namespace: atp-tenant-acme
│   │   ├── resource-quota.yaml     # Limits: 10 CPU, 20 GB RAM
│   │   ├── network-policy.yaml     # Deny cross-tenant traffic
│   │   ├── rbac.yaml                # Tenant-specific RBAC
│   │   ├── config.yaml              # Data residency: US
│   │   └── kustomization.yaml      # FluxCD Kustomization
│   │
│   ├── tenant-widgets-inc/         # Tenant: Widgets Inc.
│   │   ├── namespace.yaml          # Namespace: atp-tenant-widgets
│   │   ├── resource-quota.yaml     # Limits: 5 CPU, 10 GB RAM
│   │   ├── network-policy.yaml
│   │   ├── rbac.yaml
│   │   ├── config.yaml              # Data residency: EU (GDPR)
│   │   └── kustomization.yaml
│   │
│   └── tenant-global-bank/         # Tenant: Global Bank (Enterprise)
│       ├── namespace.yaml          # Namespace: atp-tenant-global
│       ├── resource-quota.yaml     # Limits: 20 CPU, 40 GB RAM
│       ├── network-policy.yaml     # Strict isolation (financial data)
│       ├── rbac.yaml
│       ├── config.yaml              # Compliance: HIPAA + SOC 2 + GDPR
│       └── kustomization.yaml

Tenant Configuration Example:

# tenants/tenant-acme-corp/config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-config
  namespace: atp-tenant-acme
data:
  # Tenant metadata
  tenant-id: "acme-corp"
  tenant-name: "ACME Corporation"
  tenant-tier: "standard"  # standard, premium, enterprise

  # Data residency
  data-residency: "us"  # us, eu, apac
  primary-region: "eastus"
  backup-region: "westus"

  # Compliance requirements
  compliance-profile: "soc2-hipaa"  # soc2, gdpr, hipaa, soc2-gdpr, soc2-hipaa, soc2-gdpr-hipaa
  retention-days: "2555"  # 7 years
  immutability-enabled: "true"
  tamper-evidence-enabled: "true"

  # Feature flags (tenant-specific)
  enable-advanced-query: "true"
  enable-ai-anomaly-detection: "false"  # Premium feature
  enable-realtime-alerts: "true"

  # Resource limits
  max-ingestion-rate-rps: "1000"  # 1000 requests per second
  max-query-rate-rps: "500"
  max-storage-gb: "1000"  # 1 TB

Benefits: - ✅ Isolated changes: Tenant config changes don't affect other tenants (isolated Git directories) - ✅ Audit trail per tenant: git log -- tenants/tenant-acme-corp/ shows all changes for ACME Corp - ✅ Compliance per tenant: GDPR/HIPAA requirements enforced via namespace labels and network policies - ✅ Cost allocation: Resource quotas enable accurate chargeback/showback per tenant

Operational Resilience¶

Disaster Recovery from Git:

Scenario: Production AKS cluster destroyed (region outage, ransomware, infrastructure failure)

Recovery Steps:

#!/bin/bash
# disaster-recovery-production.sh — Recover production AKS from Git

set -euo pipefail

echo "🔴 DISASTER RECOVERY: Recreating production AKS cluster from Git + IaC"

# ──────────────────────────────────────────────────────────────────────────
# Step 1: Provision new AKS cluster with Pulumi (15 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 1/5: Provisioning AKS cluster with Pulumi..."
cd infrastructure/pulumi-aks
pulumi stack select production
pulumi refresh --yes  # Detect destroyed resources
pulumi up --yes  # Recreate cluster

# ──────────────────────────────────────────────────────────────────────────
# Step 2: Configure kubectl context (1 minute)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 2/5: Configuring kubectl..."
az aks get-credentials \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --overwrite-existing

export KUBECONFIG=~/.kube/config

# ──────────────────────────────────────────────────────────────────────────
# Step 3: Install FluxCD and bootstrap from Git (5 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 3/5: Bootstrapping FluxCD..."
flux bootstrap git \
  --url=ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops \
  --branch=production \
  --path=clusters/production \
  --private-key-file=~/.ssh/azure-devops-flux \
  --author-name="Platform Team" \
  --author-email="platform-team@connectsoft.example"

# ──────────────────────────────────────────────────────────────────────────
# Step 4: Wait for FluxCD to reconcile all resources (10 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 4/5: Waiting for FluxCD reconciliation..."
flux get kustomizations --watch --timeout=15m

# ──────────────────────────────────────────────────────────────────────────
# Step 5: Verify all services healthy (3 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 5/5: Verifying service health..."
for service in ingestion query integrity export policy search gateway; do
  echo "Checking atp-$service..."
  kubectl wait --for=condition=available --timeout=300s \
    deployment/atp-$service -n atp-production
done

echo "✅ Disaster recovery complete!"
echo "RTO achieved: ~30 minutes"
echo "RPO: 0 minutes (Git contains complete desired state)"

RTO/RPO Targets:

Environment	RTO Target	RTO Actual (GitOps)	RPO Target	RPO Actual (GitOps)
Dev	4 hours	20 minutes	24 hours	0 minutes
Test	2 hours	25 minutes	12 hours	0 minutes
Staging	1 hour	30 minutes	4 hours	0 minutes
Production	30 minutes	30-35 minutes	1 hour	0 minutes

GitOps Impact: RPO reduced to zero (Git has complete desired state, no data loss for infrastructure config).

Summary¶

GitOps in ATP Context Literature: Essential (not just beneficial) for ATP's audit trail, security, and compliance requirements
Audit Trail Requirements: Complete change history (Git log with attribution), tamper-evidence (GPG-signed commits), long-term retention (7 years in immutable Azure Blob Storage)
Security Benefits: No cluster credentials outside cluster, separation of duties (7 roles with RBAC matrix), secret management via Key Vault (zero secrets in Git)
Multi-Tenancy: Namespace-per-tenant with isolated Git directories, tenant-specific configs (data residency, compliance, resource quotas, feature flags)
Operational Resilience: DR RTO 30-35 minutes (Pulumi 15min + FluxCD 10min + validate 5min), RPO 0 minutes (Git has full state)
Rollback Simplicity: git revert triggers automatic rollback within 5-10 minutes (vs re-running pipeline)

Four Core Principles (OpenGitOps)¶

The OpenGitOps working group (CNCF) defines 4 core principles that any GitOps implementation must follow. ATP adheres to all four principles using FluxCD, Azure DevOps, and AKS.

Principle 1: Declarative¶

Definition: The desired system state is represented as declarative specifications (what you want, not how to get it). Configuration is stored in a version-controlled source (Git) rather than generated by scripts.

Key Concepts:

Declarative vs Imperative: Declarative describes the end state (e.g., "3 replicas, 1 GB RAM"), while imperative describes steps (e.g., "scale up by 1, set memory to 1 GB")
Idempotency: Applying the same declarative configuration multiple times produces the same result
Configuration as Code: All infrastructure and application config stored as YAML/JSON in Git

ATP Implementation:

ATP uses three layers of declarative configuration:

Base Kubernetes Manifests (YAML): Raw Kubernetes resource definitions
Helm Charts: Templated, parameterized manifests with values files
Kustomize Overlays: Environment-specific customizations applied to base manifests

Kubernetes Deployment Manifest (Base)¶

Complete Example (ATP Ingestion Service):

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    app: atp-ingestion
    component: ingestion
    tier: backend
    version: v1.2.3
    managed-by: fluxcd
spec:
  replicas: 3  # Desired state: 3 replicas
  selector:
    matchLabels:
      app: atp-ingestion
  template:
    metadata:
      labels:
        app: atp-ingestion
        version: v1.2.3
    spec:
      serviceAccountName: atp-ingestion-sa
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 2000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        imagePullPolicy: IfNotPresent
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop: [ALL]
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: Production
        - name: OpenTelemetry__ServiceName
          value: atp-ingestion
        envFrom:
        - configMapRef:
            name: atp-ingestion-config
        - secretRef:
            name: atp-ingestion-secrets
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /app/cache
      volumes:
      - name: tmp
        emptyDir: {}
      - name: cache
        emptyDir: {}
      imagePullSecrets:
      - name: acr-credentials

---
# apps/atp-ingestion/base/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    app: atp-ingestion
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
    name: http
  selector:
    app: atp-ingestion

---
# apps/atp-ingestion/base/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
  namespace: atp-production
data:
  ASPNETCORE_ENVIRONMENT: "Production"
  OpenTelemetry__ServiceName: "atp-ingestion"
  OpenTelemetry__SamplingRatio: "0.1"
  Audit__EnableImmutability: "true"
  Audit__RetentionDays: "2555"

Declarative Characteristics:

✅ Desired state: replicas: 3 declares the goal (not "scale by +1")
✅ Idempotent: Reapplying same manifest produces same result
✅ Version-controlled: Stored in Git, not generated by scripts
✅ Immutable: Image tag includes commit SHA (v1.2.3-abc123d)

Helm Charts (Parameterized Declarative)¶

Chart Structure:

apps/atp-ingestion/helm/
├── Chart.yaml              # Chart metadata
├── values.yaml             # Default values
├── values-dev.yaml         # Dev environment overrides
├── values-production.yaml  # Production environment overrides
└── templates/
    ├── deployment.yaml     # Templated Deployment
    ├── service.yaml        # Templated Service
    ├── configmap.yaml      # Templated ConfigMap
    └── ingress.yaml        # Templated Ingress

Chart.yaml:

# apps/atp-ingestion/helm/Chart.yaml
apiVersion: v2
name: atp-ingestion
description: ATP Ingestion Service - Receives audit records via HTTP/gRPC
version: 1.2.3  # Chart version (SemVer)
appVersion: 1.2.3  # Application version
type: application

keywords:
  - audit-trail
  - ingestion
  - microservice

maintainers:
  - name: ConnectSoft Platform Team
    email: platform-team@connectsoft.example

dependencies:
  - name: redis
    version: 17.x.x
    repository: https://charts.bitnami.com/bitnami
    condition: redis.enabled

values.yaml (Default):

# apps/atp-ingestion/helm/values.yaml
# Default values for atp-ingestion chart

replicaCount: 3

image:
  repository: connectsoft.azurecr.io/atp/ingestion
  pullPolicy: IfNotPresent
  tag: ""  # Overridden by .Values.appVersion or CI pipeline

imagePullSecrets:
  - name: acr-credentials

serviceAccount:
  create: true
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"
  name: atp-ingestion-sa

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8080"
  prometheus.io/path: "/metrics"

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 2000
  seccompProfile:
    type: RuntimeDefault

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: [ALL]

service:
  type: ClusterIP
  port: 80
  targetPort: 8080

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: ingestion.atp.connectsoft.example
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: atp-ingestion-tls
      hosts:
        - ingestion.atp.connectsoft.example

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

# Environment-specific configuration
env:
  ASPNETCORE_ENVIRONMENT: Production
  OpenTelemetry__ServiceName: atp-ingestion

# External Secrets Operator integration
externalSecrets:
  enabled: true
  secretStore: azure-keyvault
  secrets:
    - name: ConnectionStrings__Database
      key: sql-connection-string
    - name: ConnectionStrings__Redis
      key: redis-connection-string
    - name: ConnectionStrings__RabbitMQ
      key: rabbitmq-connection-string

# Redis sub-chart (optional dependency)
redis:
  enabled: false  # Use Azure Cache for Redis instead

Helm Template (Deployment):

# apps/atp-ingestion/helm/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "atp-ingestion.fullname" . }}
  namespace: {{ .Values.namespace }}
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "atp-ingestion.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        {{- with .Values.podAnnotations }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ .Values.serviceAccount.name }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        securityContext:
          {{- toYaml .Values.securityContext | nindent 12 }}
        resources:
          {{- toYaml .Values.resources | nindent 12 }}
        env:
        {{- range $key, $value := .Values.env }}
        - name: {{ $key }}
          value: {{ $value | quote }}
        {{- end }}
        livenessProbe:
          {{- toYaml .Values.livenessProbe | nindent 12 }}
        readinessProbe:
          {{- toYaml .Values.readinessProbe | nindent 12 }}

Benefits of Helm:

✅ Parameterization: One chart, multiple environments (values-dev.yaml, values-production.yaml)
✅ Reusability: Chart can be used across multiple services with different values
✅ Dependency management: Declare sub-charts (e.g., Redis) as dependencies
✅ Versioning: Chart version and app version tracked separately

Kustomize Overlays (Environment-Specific Customization)¶

Directory Structure:

apps/atp-ingestion/
├── base/                    # Base manifests (reusable)
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── configmap.yaml
│   └── kustomization.yaml
│
└── overlays/                # Environment-specific overlays
    ├── dev/
    │   ├── kustomization.yaml
    │   ├── deployment-patch.yaml
    │   └── configmap-patch.yaml
    ├── staging/
    │   ├── kustomization.yaml
    │   ├── deployment-patch.yaml
    │   └── hpa-patch.yaml
    └── production/
        ├── kustomization.yaml
        ├── deployment-patch.yaml
        ├── hpa-patch.yaml
        └── configmap-patch.yaml

Base Kustomization:

# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
  - deployment.yaml
  - service.yaml
  - configmap.yaml

commonLabels:
  app: atp-ingestion
  component: ingestion
  managed-by: fluxcd

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3-abc123d  # Updated by CI pipeline

Production Overlay:

# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

# Base resources
resources:
  - ../../base

# Strategic merge patches
patchesStrategicMerge:
  - deployment-patch.yaml
  - hpa-patch.yaml

# Image tag override (updated by CI pipeline)
images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3-abc123d

# ConfigMap generator (add production-specific values)
configMapGenerator:
  - name: atp-ingestion-config
    behavior: merge  # Merge with base ConfigMap
    literals:
      - ASPNETCORE_ENVIRONMENT=Production
      - OpenTelemetry__SamplingRatio=0.1
      - Audit__EnableImmutability=true
      - Audit__RetentionDays=2555

# Labels applied to all resources
commonLabels:
  environment: production
  managed-by: fluxcd
  compliance: soc2-gdpr-hipaa

# Annotations applied to all resources
commonAnnotations:
  gitops.toolkit.fluxcd.io/reconcile: enabled
  azure.connectsoft.com/cost-center: atp-production

# Replicas override (production has more replicas)
replicas:
  - name: atp-ingestion
    count: 5  # Production: 5 replicas (base has 3)

Deployment Patch (Production-specific changes):

# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5  # Production: 5 replicas
  template:
    spec:
      containers:
      - name: ingestion
        resources:
          requests:
            cpu: 1000m      # Production: 1 CPU core (base: 500m)
            memory: 1Gi     # Production: 1 GB RAM (base: 512Mi)
          limits:
            cpu: 2000m      # Production: 2 CPU cores (base: 1000m)
            memory: 2Gi     # Production: 2 GB RAM (base: 1Gi)
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: Production
        - name: OpenTelemetry__SamplingRatio
          value: "0.1"  # Production: 10% sampling (dev: 100%)

Benefits of Kustomize:

✅ DRY (Don't Repeat Yourself): Base manifests reused; only differences in overlays
✅ Environment isolation: Each environment has isolated overlay directory
✅ Strategic merge patches: Fine-grained control over what changes per environment
✅ Build-time customization: No runtime templating; manifests rendered before applying

Principle 2: Versioned & Immutable¶

Definition: All desired states are versioned (stored in Git) and immutable (cannot be changed after commit). Changes are made by creating new versions, not modifying existing ones.

Key Concepts:

Git as Version Control: All manifests stored in Git with commit history
Immutable Git History: Commits cannot be modified (only new commits added)
Semantic Versioning: Version numbers follow SemVer (major.minor.patch)
Image Tagging: Container images tagged with version + commit SHA
GPG Signing: Commits cryptographically signed to prove authenticity

Git Commit Signing with GPG¶

Purpose: Ensure commits are tamper-evident and authentic (SOC 2, GDPR compliance).

Setup:

# Generate GPG key (one-time per developer/team)
gpg --full-generate-key
# Select:
#   - RSA and RSA (default)
#   - 4096 bits (secure)
#   - No expiration (or 2 years)
#   - Real name: "Platform Team"
#   - Email: platform-team@connectsoft.example
#   - Comment: "ATP GitOps Commits"

# List keys
gpg --list-secret-keys --keyid-format LONG

# Output:
# sec   rsa4096/1234567890ABCDEF 2024-01-15 [SC]
#       ABC123DEF456GHI789JKL012MNO345PQR678STU
# uid                 [ultimate] Platform Team <platform-team@connectsoft.example>

# Configure Git to use GPG key
git config --global user.signingkey 1234567890ABCDEF  # Use key ID from above
git config --global commit.gpgsign true  # Sign all commits automatically
git config --global tag.gpgsign true     # Sign all tags automatically

# Export public key (share with team)
gpg --armor --export 1234567890ABCDEF > platform-team-gpg-public.key

# Import public key (for verification)
gpg --import platform-team-gpg-public.key

Commit with Signature:

# Standard commit (automatically signed due to commit.gpgsign=true)
git add apps/atp-ingestion/overlays/production/kustomization.yaml
git commit -m "feat(ingestion): upgrade to v1.2.4

- Updated image tag to v1.2.4-def456e
- Increased memory limit 1Gi → 2Gi (performance optimization)
- Enabled advanced query features

Relates to: ATP-EPIC-789
Approved by: architect@connectsoft.example
Tested in: Staging (2024-10-28 to 2024-10-30)"

# Explicit signing (if auto-sign disabled)
git commit -S -m "..."

# Verify signature
git log --show-signature -1

# Output:
# commit def456e789abcdef0123456789abcdef01234567 (HEAD -> production)
# gpg: Signature made Mon Oct 30 15:30:22 2024 UTC
# gpg:                using RSA key 1234567890ABCDEF
# gpg: Good signature from "Platform Team <platform-team@connectsoft.example>"
# Author: Platform Team <platform-team@connectsoft.example>
# Date:   Mon Oct 30 15:30:22 2024 +0000
#
#     feat(ingestion): upgrade to v1.2.4

Azure DevOps Branch Policy (Require Signed Commits):

# Azure DevOps Branch Policy: Require GPG-signed commits
# Configured in Azure DevOps Portal:
# Repos > Branches > production > Branch Policies > Branch Policies
#   ✓ Require signed commits (GPG or SSH)
#   ✓ Require pull request (minimum 1 reviewer)
#   ✓ Require status checks (CI pipeline must pass)

Verify All Commits Signed (Compliance Audit):

#!/bin/bash
# verify-all-commits-signed.sh — Verify all commits in production branch are GPG-signed

BRANCH="production"
UNSIGNED_COMMITS=()

for commit in $(git log --format=%H origin/$BRANCH); do
  if ! git verify-commit $commit 2>/dev/null; then
    UNSIGNED_COMMITS+=($commit)
  fi
done

if [ ${#UNSIGNED_COMMITS[@]} -eq 0 ]; then
  echo "✅ All commits are GPG-signed"
  exit 0
else
  echo "❌ Found ${#UNSIGNED_COMMITS[@]} unsigned commits:"
  for commit in "${UNSIGNED_COMMITS[@]}"; do
    echo "  - $commit"
  done
  exit 1
fi

Semantic Versioning Strategy¶

Strategy: ATP uses Semantic Versioning (SemVer) for application versions: MAJOR.MINOR.PATCH

MAJOR: Breaking changes (API incompatibility, schema changes)
MINOR: New features (backward-compatible)
PATCH: Bug fixes (backward-compatible)

Version Tagging:

# Tag release in source code repository
git tag -a v1.2.4 -m "Release v1.2.4

- Feature: Advanced query API
- Bug fix: Memory leak in Redis connection pooling
- Security: Upgrade to .NET 8.0

Changelog: https://dev.azure.com/ConnectSoft/ATP/_wiki/wikis/ATP.wiki/12345/Release-Notes-v1.2.4"

git push origin v1.2.4

# CI pipeline uses tag to build Docker image
# Docker image tagged as: connectsoft.azurecr.io/atp/ingestion:v1.2.4

Version Examples:

v1.2.4    # Minor feature release
v1.2.3    # Patch release (bug fix)
v2.0.0    # Major release (breaking changes)
v1.2.4-hotfix1  # Hotfix release

Git Tags for Releases¶

Tag Structure:

# Production release tag
git tag -a v1.2.4 -m "Production Release v1.2.4" production
git push origin v1.2.4

# Hotfix release tag
git tag -a v1.2.4-hotfix1 -m "Hotfix: Memory leak fix" hotfix/memory-leak
git push origin v1.2.4-hotfix1

Tag Verification (Ensure Tags Match Commits):

# Verify tag points to expected commit
git tag -v v1.2.4

# Output:
# object abc123d7890def4567890abc123def4567890ab
# type commit
# tag v1.2.4
# tagger Platform Team <platform-team@connectsoft.example> 2024-10-30 16:00:00 +0000
#
# Production Release v1.2.4
# gpg: Signature made Mon Oct 30 16:00:00 2024 UTC
# gpg:                using RSA key 1234567890ABCDEF
# gpg: Good signature from "Platform Team <platform-team@connectsoft.example>"

Environment-Wide Release Tags:

# Production release tag (all services)
git tag -a release/v1.2.4 -m "Production Release v1.2.4

Services:
- atp-ingestion: v1.2.4
- atp-query: v1.3.0
- atp-integrity: v1.1.5
- atp-export: v1.0.2
- atp-policy: v1.2.0
- atp-search: v1.1.0
- atp-gateway: v1.4.0

Changelog: https://dev.azure.com/ConnectSoft/ATP/_wiki/wikis/ATP.wiki/12345/Release-Notes-v1.2.4"
git push origin release/v1.2.4

Image Tagging with Version + Commit SHA¶

Strategy: ATP uses immutable image tags combining version + commit SHA:

Format: {version}-{commit-sha}
Example: v1.2.4-abc123d

Where:
  - v1.2.4 = Semantic version (from Git tag)
  - abc123d = First 7 characters of Git commit SHA

Benefits: - ✅ Traceability: Image tag links to exact Git commit - ✅ Immutability: Same tag always points to same image (never overwritten) - ✅ Version clarity: Version number visible in tag - ✅ Rollback simplicity: Revert to previous version tag

Docker Image Tagging (Azure Pipelines):

# Azure Pipelines: Tag Docker image with version + commit SHA
- task: Docker@2
  displayName: 'Build and push Docker image'
  inputs:
    containerRegistry: 'ConnectSoft-ACR'
    repository: 'atp/ingestion'
    command: 'buildAndPush'
    Dockerfile: 'src/ConnectSoft.ATP.Ingestion/Dockerfile'
    tags: |
      $(Build.BuildNumber)              # v1.2.4
      $(Build.BuildNumber)-$(Build.SourceVersion)  # v1.2.4-abc123d
      latest                            # Latest (for dev only)

ACR Tagging Rules:

Tag Pattern	Mutable?	Use Case	Example
`v{VERSION}`	❌ Immutable	Production releases	`v1.2.4`
`v{VERSION}-{SHA}`	❌ Immutable	Production releases (traceable)	`v1.2.4-abc123d`
`latest`	✅ Mutable	Development only	`latest`

Git History as Audit Trail¶

Compliance Report Generation:

#!/bin/bash
# generate-compliance-report.sh — Generate SOC 2 audit report from Git history

BRANCH="production"
START_DATE="2024-01-01"
END_DATE="2024-12-31"
OUTPUT_FILE="compliance-report-q4-2024.md"

echo "# GitOps Compliance Report — Q4 2024" > $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
echo "**Report Period**: $START_DATE to $END_DATE" >> $OUTPUT_FILE
echo "**Branch**: $BRANCH" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
echo "## All Production Deployments" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
echo "| Commit | Timestamp | Author | Email | Description | Signature |" >> $OUTPUT_FILE
echo "|--------|-----------|--------|-------|-------------|-----------|" >> $OUTPUT_FILE

git log --format="%h | %ai | %an | %ae | %s | %G? |" \
  --since="$START_DATE" \
  --until="$END_DATE" \
  origin/$BRANCH | \
  sed 's/G$/✅ Good/' | \
  sed 's/B$/❌ Bad/' | \
  sed 's/U$/⚠️ Unknown/' | \
  sed 's/N$/❌ None/' | \
  sed 's/X$/❌ Expired/' | \
  >> $OUTPUT_FILE

echo "" >> $OUTPUT_FILE
echo "## Summary" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
TOTAL=$(git log --oneline --since="$START_DATE" --until="$END_DATE" origin/$BRANCH | wc -l)
SIGNED=$(git log --show-signature --since="$START_DATE" --until="$END_DATE" origin/$BRANCH | grep -c "Good signature")
echo "- **Total Commits**: $TOTAL" >> $OUTPUT_FILE
echo "- **Signed Commits**: $SIGNED" >> $OUTPUT_FILE
echo "- **Unsigned Commits**: $((TOTAL - SIGNED))" >> $OUTPUT_FILE

echo "✅ Compliance report generated: $OUTPUT_FILE"

Output Example:

# GitOps Compliance Report — Q4 2024

**Report Period**: 2024-01-01 to 2024-12-31
**Branch**: production

## All Production Deployments

| Commit | Timestamp | Author | Email | Description | Signature |
|--------|-----------|--------|-------|-------------|-----------|
| abc123d | 2024-10-30 14:23:45 | Platform Team | platform-team@connectsoft.example | feat(ingestion): upgrade to v1.2.4 | ✅ Good |
| def456e | 2024-10-25 10:15:22 | Alice Chen | alice.chen@connectsoft.example | fix(query): resolve index issue | ✅ Good |
| ghi789f | 2024-10-20 16:42:11 | Bob Smith | bob.smith@connectsoft.example | scale(integrity): replicas 3→5 | ✅ Good |

## Summary

- **Total Commits**: 45
- **Signed Commits**: 45
- **Unsigned Commits**: 0

Long-Term Retention (7 years for compliance):

# Backup Git repository to immutable Azure Blob Storage
az storage blob upload-batch \
  --account-name atpgitbackupprod \
  --destination gitops-backups \
  --source .git/ \
  --destination-path "$(date +%Y%m%d)/" \
  --overwrite false  # Immutable: cannot overwrite

# Enable legal hold (WORM storage)
az storage container legal-hold set \
  --account-name atpgitbackupprod \
  --container-name gitops-backups \
  --tags "compliance=soc2-gdpr-hipaa" "retention=7-years"

# Retention: 7 years (matches audit log retention)
# Cost: ~$50/month for 10 GB Git history (cold storage tier)

Principle 3: Pulled Automatically¶

Definition: The desired state is automatically pulled from the source (Git repository) by an agent running inside the cluster, rather than being pushed by an external system.

Key Concepts:

Pull-Based Architecture: GitOps agent (FluxCD) inside cluster pulls from Git
Polling Intervals: Agent checks Git at regular intervals (e.g., every 1 minute)
Webhook Triggers: Optional webhooks for immediate sync (faster than polling)
GitRepository Resource: FluxCD custom resource that defines Git source
Kustomization Resource: FluxCD custom resource that defines what to deploy

FluxCD Architecture Overview¶

Component Diagram:

graph TD
    A[Git Repository<br/>Azure Repos] -->|git pull<br/>every 1 min| B[Source Controller<br/>flux-system namespace]

    B -->|fetch Git| C[GitRepository<br/>Custom Resource]

    C -->|notify| D[Kustomize Controller<br/>flux-system namespace]

    D -->|render manifests| E[Kustomization<br/>Custom Resource]

    E -->|kubectl apply| F[Kubernetes API<br/>AKS Cluster]

    F -.->|watch for drift| D
    D -.->|reconcile| F

    G[Helm Controller] -.->|if Helm chart| E
    H[Notification Controller] -->|alerts| I[Slack / Teams]

    style B fill:#90EE90
    style D fill:#90EE90
    style G fill:#90EE90
    style H fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

FluxCD Components:

Component	Purpose	Namespace
source-controller	Fetches Git repositories, Helm charts, OCI artifacts	`flux-system`
kustomize-controller	Renders Kustomize manifests and applies to cluster	`flux-system`
helm-controller	Installs/upgrades Helm charts	`flux-system`
notification-controller	Sends alerts to Slack, Teams, etc.	`flux-system`

GitRepository Resource¶

Definition: Defines source of truth (Git repository URL, branch, authentication).

Example:

# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  interval: 1m  # Poll Git every 1 minute
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: production  # Git branch to watch
  secretRef:
    name: azure-devops-ssh-key  # SSH key secret for authentication
  ignore: |
    /*.md
    !README.md
  suspend: false  # Set to true to pause reconciliation

Status (Reconciled):

# Check GitRepository status
kubectl describe gitrepository atp-gitops -n flux-system

# Output:
# Status:
#   Artifact:
#     Checksum:           abc123def4567890
#     Last Update:        2024-10-30T15:30:00Z
#     Path:               gitrepository/flux-system/atp-gitops/abc123.tar.gz
#     Revision:           production/abc123d7890def4567890abc123def4567890ab
#     URL:                http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/atp-gitops/abc123.tar.gz
#   Conditions:
#     Last Transition Time:  2024-10-30T15:30:00Z
#     Message:               Fetched revision: production/abc123d7890def4567890abc123def4567890ab
#     Observed Generation:   1
#     Reason:                GitOperationSucceed
#     Status:                True
#     Type:                  Ready
#   Observed Generation:     1
#   URL:                     ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops

Authentication Methods:

Option 1: SSH Key (Recommended for Azure DevOps):

# Create SSH key secret
apiVersion: v1
kind: Secret
metadata:
  name: azure-devops-ssh-key
  namespace: flux-system
type: Opaque
stringData:
  identity: |
    -----BEGIN OPENSSH PRIVATE KEY-----
    b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQNuZW5lAAAABQAAAAB...
    -----END OPENSSH PRIVATE KEY-----
  known_hosts: |
    ssh.dev.azure.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC7...

---
# GitRepository references secret
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  secretRef:
    name: azure-devops-ssh-key

Option 2: Personal Access Token (PAT) (Alternative):

# Create PAT secret
apiVersion: v1
kind: Secret
metadata:
  name: azure-devops-pat
  namespace: flux-system
type: Opaque
stringData:
  username: git
  password: <AZURE_DEVOPS_PAT>  # Token with Code (Read) permission

---
# GitRepository uses PAT
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  secretRef:
    name: azure-devops-pat

Kustomization Resource¶

Definition: Defines what to deploy (path in Git repository, reconciliation interval, health checks).

Example:

# clusters/production/kustomizations/atp-ingestion.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
  namespace: flux-system
spec:
  interval: 5m  # Reconcile every 5 minutes
  path: ./apps/atp-ingestion/overlays/production  # Path in Git repository
  prune: true  # Delete resources removed from Git
  sourceRef:
    kind: GitRepository
    name: atp-gitops
    namespace: flux-system
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-ingestion
      namespace: atp-production
  timeout: 10m  # Timeout for reconciliation
  retryInterval: 2m  # Retry interval on failure
  suspend: false

Status (Reconciled):

# Check Kustomization status
kubectl describe kustomization atp-ingestion -n flux-system

# Output:
# Status:
#   Conditions:
#     Last Transition Time:  2024-10-30T15:35:00Z
#     Message:               Applied revision: production/abc123d7890def4567890abc123def4567890ab
#     Observed Generation:   1
#     Reason:                ReconciliationSucceeded
#     Status:                True
#     Type:                  Ready
#   Inventory:
#     Entries:
#       Id:                   apps_v1_Deployment_atp-production_atp-ingestion
#       V:                    v1
#   Last Applied Revision:    production/abc123d7890def4567890abc123def4567890ab
#   Last Attempted Revision:  production/abc123d7890def4567890abc123def4567890ab
#   Observed Generation:      1

Automatic Sync Policies per Environment¶

Per-Environment Configuration:

Environment	Git Branch	Polling Interval	Reconciliation Interval	Webhook Trigger
Dev	`dev`	30 seconds	1 minute	Enabled (immediate)
Test	`test`	1 minute	2 minutes	Enabled (immediate)
Staging	`staging`	1 minute	5 minutes	Disabled (manual approval)
Production	`production`	1 minute	5 minutes	Disabled (manual approval + 24h cooldown)

Production Sync Policy (Conservative):

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion-production
  namespace: flux-system
spec:
  interval: 5m  # Reconcile every 5 minutes (not immediate)
  path: ./apps/atp-ingestion/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Production: Manual approval required (webhook disabled)
  # Production: No automatic sync on push (polling only)

Behavior: 1. Developer pushes commit to production branch 2. GitRepository polls Git every 1 minute (detects new commit) 3. Kustomization reconciles every 5 minutes (applies changes) 4. Total delay: Up to 6 minutes (1 min poll + 5 min reconcile)

Dev Sync Policy (Aggressive):

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion-dev
  namespace: flux-system
spec:
  interval: 1m  # Reconcile every 1 minute
  path: ./apps/atp-ingestion/overlays/dev
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops-dev

Polling Intervals and Webhook Triggers¶

Polling Configuration:

# GitRepository polling interval
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
spec:
  interval: 1m  # Minimum: 30s, Maximum: 24h

Webhook Triggers (Optional):

Purpose: Trigger immediate reconciliation when Git push occurs (faster than polling).

Azure DevOps Webhook (Receive POST on push):

# FluxCD Receiver (webhook endpoint)
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Receiver
metadata:
  name: atp-gitops-webhook
  namespace: flux-system
spec:
  type: git
  events:
    - "push"
  resources:
    - kind: GitRepository
      name: atp-gitops
  secretRef:
    name: webhook-token
  # Azure DevOps webhook URL:
  # https://<fluxcd-receiver>/hook/xyz123abc456...

Configuration in Azure DevOps:

Azure DevOps > Repos > Hooks > Add Subscription
  Name: FluxCD Webhook
  Event: Code pushed
  Filters:
    - Branch: dev, test (production excluded for safety)
  Service Hook URL: https://<fluxcd-receiver>/hook/xyz123abc456...

Benefits: - ✅ Faster sync: Changes applied within seconds (vs at most 6 minutes with polling) - ✅ Reduced Git polling: Lower load on Azure DevOps Git servers

Trade-offs: - ⚠️ Security: Webhook endpoint must be publicly accessible (or use Azure DevOps IP allowlist) - ⚠️ Production risk: Disabled for production (manual approval required)

Principle 4: Continuously Reconciled¶

Definition: Software agents automatically and continuously ensure the actual system state matches the desired state (stored in Git). Any drift from the desired state is automatically corrected.

Key Concepts:

Drift Detection: Continuous monitoring of cluster state vs Git state
Self-Healing: Automatic correction of configuration drift
Reconciliation Loop: Periodic checks and corrections (every 1-5 minutes)
Drift Correction: Revert manual changes to match Git state

Drift Detection Mechanisms¶

How FluxCD Detects Drift:

Periodic Reconciliation: FluxCD compares cluster state to Git state every reconciliation interval
Resource Watching: Kubernetes watch API detects resource changes in real-time
Inventory Tracking: FluxCD maintains inventory of applied resources (GitOps Toolkit)

Drift Detection Example:

# Scenario: Operator manually scales deployment (NOT via Git)
kubectl scale deployment atp-ingestion --replicas=5 -n atp-production

# FluxCD detects drift within 5 minutes (reconciliation interval)
flux get kustomizations

# Output:
# NAME           READY   MESSAGE
# atp-ingestion  False   Spec.Replicas drift detected: Git=3, Live=5

Drift Detection Status:

# Check drift detection status
kubectl describe kustomization atp-ingestion -n flux-system

# Output:
# Status:
#   Conditions:
#     Last Transition Time:  2024-10-30T15:40:00Z
#     Message:               Reconciliation failed: drift detected in Deployment atp-ingestion
#     Observed Generation:   1
#     Reason:                DriftDetected
#     Status:                False
#     Type:                  Ready
#   Drift:
#     Detected:              true
#     Resources:
#       - Kind:               Deployment
#         Name:               atp-ingestion
#         Namespace:          atp-production
#         Drift:              Spec.Replicas: Git=3, Live=5

Self-Healing Configuration¶

Automatic Drift Correction:

FluxCD automatically reverts manual changes to match Git state:

# Git state (desired): replicas=3
# Cluster state (actual): replicas=5 (manually changed)

# FluxCD reconciliation (automatic):
# 1. Detect drift: replicas=5 ≠ replicas=3
# 2. Apply Git state: kubectl scale deployment atp-ingestion --replicas=3
# 3. Cluster state matches Git state: replicas=3

Self-Healing Workflow:

graph TD
    A[Manual Change<br/>kubectl scale] -->|immediate| B[Cluster State<br/>replicas=5]
    C[Git State<br/>replicas=3] -.->|every 5 min| D[FluxCD<br/>Reconciliation]
    D -->|compare| B
    D -->|drift detected| E[FluxCD<br/>Auto-Correct]
    E -->|apply Git state| B
    B -.->|matches| C

    style E fill:#90EE90
    style D fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Self-Healing Examples:

Example 1: Manual Replica Scaling:

# Operator manually scales deployment
kubectl scale deployment atp-ingestion --replicas=10 -n atp-production

# Within 5 minutes, FluxCD reverts to Git state
kubectl get deployment atp-ingestion -n atp-production

# Output (after reconciliation):
# NAME            READY   UP-TO-DATE   AVAILABLE   AGE
# atp-ingestion   3/3     3            3           5m
# Replicas: 3 (reverted from 10)

Example 2: Manual ConfigMap Update:

# Operator manually edits ConfigMap
kubectl edit configmap atp-ingestion-config -n atp-production
# Change: ASPNETCORE_ENVIRONMENT=Production → Development

# Within 5 minutes, FluxCD reverts to Git state
kubectl get configmap atp-ingestion-config -n atp-production -o yaml

# Output (after reconciliation):
# data:
#   ASPNETCORE_ENVIRONMENT: Production  # Reverted from Development

Example 3: Manual Resource Deletion:

# Operator accidentally deletes deployment
kubectl delete deployment atp-ingestion -n atp-production

# Within 5 minutes, FluxCD recreates deployment from Git
kubectl get deployment atp-ingestion -n atp-production

# Output (after reconciliation):
# NAME            READY   UP-TO-DATE   AVAILABLE   AGE
# atp-ingestion   3/3     3            3           30s  # Recreated

Reconciliation Loop Monitoring¶

Monitoring Reconciliation Status:

# Check all Kustomizations status
flux get kustomizations

# Output:
# NAME                 READY   MESSAGE   REVISION                          SUSPENDED
# atp-ingestion        True    Applied   production/abc123d                False
# atp-query            True    Applied   production/abc123d                False
# atp-integrity        False   Drift     production/abc123d                False
# atp-export           True    Applied   production/abc123d                False

# Check specific Kustomization
flux get kustomization atp-ingestion

# Output:
# NAME            READY   MESSAGE                                       REVISION                          SUSPENDED
# atp-ingestion   True    Applied revision: production/abc123d          production/abc123d                False

# Watch reconciliation in real-time
flux get kustomizations --watch

# Output (updates every few seconds):
# NAME                 READY   MESSAGE                          REVISION
# atp-integrity        False   Reconciliation in progress...    production/abc123d
# atp-integrity        True    Applied revision: abc123d        production/abc123d

Azure Monitor Metrics (FluxCD Reconciliation):

# FluxCD exports Prometheus metrics
# Metrics available in Azure Monitor via Prometheus scraping

# Key Metrics:
# - fluxcd_kustomize_reconciliation_duration_seconds  # Time to reconcile
# - fluxcd_kustomize_reconciliation_total              # Total reconciliations
# - fluxcd_kustomize_reconciliation_success_total      # Successful reconciliations
# - fluxcd_kustomize_reconciliation_failure_total      # Failed reconciliations
# - fluxcd_source_git_duration_seconds                 # Git fetch duration

KQL Query for Reconciliation Monitoring:

// Azure Monitor Log Analytics: Query FluxCD reconciliation metrics
PrometheusMetrics_CL
| where Name_s == "fluxcd_kustomize_reconciliation_duration_seconds"
| summarize 
    avg(Value_d) by KustomizationName_s, bin(TimeGenerated, 5m)
| render timechart

Grafana Dashboard (FluxCD Reconciliation):

# Grafana dashboard config
dashboard:
  title: "FluxCD Reconciliation Status"
  panels:
    - title: "Reconciliation Duration"
      query: "fluxcd_kustomize_reconciliation_duration_seconds"
      type: "graph"

    - title: "Reconciliation Success Rate"
      query: "rate(fluxcd_kustomize_reconciliation_success_total[5m]) / rate(fluxcd_kustomize_reconciliation_total[5m])"
      type: "stat"

    - title: "Drift Detection Events"
      query: "fluxcd_kustomize_reconciliation_failure_total{reason='DriftDetected'}"
      type: "graph"

Drift Correction Strategies¶

Automatic Correction (Default):

FluxCD automatically corrects drift during reconciliation:

# Kustomization with automatic drift correction
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
spec:
  prune: true  # Delete resources removed from Git
  # Automatic correction: Always revert to Git state

Manual Correction (When Needed):

# Option 1: Suspend reconciliation, fix manually, resume
flux suspend kustomization atp-ingestion -n flux-system

# Fix drift manually
kubectl scale deployment atp-ingestion --replicas=3 -n atp-production

# Resume reconciliation
flux resume kustomization atp-ingestion -n flux-system

# Option 2: Update Git to match cluster state (if intentional)
git checkout production
# Update kustomization.yaml to match current cluster state
git commit -m "chore: update replicas to match current state"
git push origin production

Drift Alerting:

# FluxCD Notification for drift detection
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: drift-detection-alert
  namespace: flux-system
spec:
  providerRef:
    name: slack
    namespace: flux-system
  eventSeverity: warning
  eventSources:
    - kind: Kustomization
      name: "*"
  filters:
    - key: reason
      value: DriftDetected

Slack Alert Example:

⚠️ GitOps Drift Detected

Kustomization: atp-ingestion
Namespace: flux-system
Reason: DriftDetected
Resource: Deployment/atp-ingestion (atp-production)
Drift: Spec.Replicas: Git=3, Live=5

Action: FluxCD will automatically correct within 5 minutes

Drift Prevention Best Practices:

Enforce Git-only Changes: RBAC prevents direct kubectl access to production
Alert on Manual Changes: Monitor Kubernetes audit logs for manual changes
Regular Drift Audits: Weekly checks for unexpected cluster changes
Documentation: Clear guidelines that all changes must go through Git

Summary: Four Core Principles¶

Principle 1: Declarative: Desired state expressed as declarative YAML (Kubernetes, Helm, Kustomize)
Principle 2: Versioned & Immutable: All changes versioned in Git with GPG signatures, SemVer, immutable image tags, permanent audit trail
Principle 3: Pulled Automatically: FluxCD inside cluster pulls from Git (GitRepository/Kustomization), polling intervals per environment, optional webhooks
Principle 4: Continuously Reconciled: Automatic drift detection, self-healing configuration, reconciliation monitoring, drift correction strategies

Azure Repos Structure & Organization¶

Purpose: Define the repository strategy, directory structure, branching model, and access control for the ATP GitOps implementation, ensuring consistency, scalability, and compliance across all environments.

Repository Strategy: Hybrid Monorepo/Polyrepo¶

ATP uses a hybrid approach: polyrepo for service source code (separate repositories per microservice) and monorepo for GitOps manifests (single repository for all Kubernetes configurations).

Monorepo for GitOps Manifests¶

Repository: atp-gitops (Azure Repos)

Rationale:

Aspect	Benefit	Impact
Atomic Updates	Update multiple services in single commit/PR	Ensures consistency across services (e.g., gateway + all microservices)
Cross-Service Visibility	See all deployments in one place	Easier to understand dependencies and relationships
Shared Configurations	Common base manifests, Helm values, Kustomize bases	DRY principle; reduce duplication
Compliance Auditing	Single audit trail for all infrastructure changes	Simpler SOC 2/GDPR audit reports
Environment Consistency	Same structure across dev/test/staging/production	Easier to promote configurations between environments
RBAC Simplification	One repository to manage permissions	Simpler access control (vs managing 7+ repos)

Monorepo Structure:

atp-gitops/  # Single GitOps repository (monorepo)
├── clusters/              # Per-environment FluxCD configs
├── infrastructure/        Cluster-wide infrastructure
├── apps/                  All ATP microservices
├── platform/              Platform configs (RBAC, policies)
├── tenants/               Multi-tenant configurations
├── monitoring/            Observability stack
└── docs/                  Documentation and runbooks

Polyrepo for Service Source Code¶

Repositories: Separate repositories per microservice

atp-ingestion (C# source code)
atp-query (C# source code)
atp-integrity (C# source code)
atp-export (C# source code)
atp-policy (C# source code)
atp-search (C# source code)
atp-gateway (C# source code)

Rationale:

Aspect	Benefit	Impact
Team Autonomy	Each service team owns their repository	Faster development cycles; reduced merge conflicts
Independent CI/CD	Separate build pipelines per service	Parallel builds; faster feedback
Service Isolation	Clear ownership boundaries	Easier to onboard new teams; clearer responsibilities
Versioning Flexibility	Each service can version independently	Allows different release cadences per service
Repository Size	Smaller repositories (faster clones)	Better developer experience; faster CI builds

Workflow: Source Code → CI → GitOps Repo → FluxCD¶

Complete Flow:

graph LR
    subgraph "Source Code Repositories (Polyrepo)"
        A1[atp-ingestion<br/>C# source]
        A2[atp-query<br/>C# source]
        A3[atp-integrity<br/>C# source]
    end

    subgraph "CI Stage (Azure Pipelines)"
        B[Azure Pipelines<br/>Build + Test]
        B -->|1. build Docker image| C[Azure Container<br/>Registry]
        B -->|2. update manifest<br/>commit + push| D[atp-gitops<br/>Monorepo]
    end

    A1 -->|git push| B
    A2 -->|git push| B
    A3 -->|git push| B

    subgraph "GitOps Repository (Monorepo)"
        D1[apps/atp-ingestion/<br/>overlays/production]
        D2[apps/atp-query/<br/>overlays/production]
        D3[apps/atp-integrity/<br/>overlays/production]
    end

    D --> D1
    D --> D2
    D --> D3

    subgraph "CD Stage (FluxCD)"
        E[FluxCD Agent<br/>in AKS cluster]
        E -->|git pull<br/>reconcile| F[AKS Cluster<br/>Deployments]
    end

    D -->|3. FluxCD polls Git| E

    style D fill:#FFE5B4
    style E fill:#90EE90
    style F fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Step-by-Step Workflow:

Developer pushes to service repository:

cd atp-ingestion  # Polyrepo
git add src/ConnectSoft.ATP.Ingestion/Controllers/AuditRecordsController.cs
git commit -m "feat: add gRPC ingestion endpoint"
git push origin feature/grpc-ingestion

CI pipeline triggers (Azure Pipelines):

# azure-pipelines.yml in atp-ingestion repository
- stage: Build_Test_Publish
  jobs:
  - job: BuildAndTest
    steps:
    - task: Docker@2
      inputs:
        containerRegistry: 'ConnectSoft-ACR'
        repository: 'atp/ingestion'
        command: 'buildAndPush'
        tags: |
          $(Build.BuildNumber)
          $(Build.BuildNumber)-$(Build.SourceVersion)
    - task: Bash@3
      displayName: 'Update GitOps Manifest'
      inputs:
        targetType: 'inline'
        script: |
          git clone https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
          cd atp-gitops
          # Update image tag in kustomization.yaml
          yq eval '.images[0].newTag = "$(Build.BuildNumber)-$(Build.SourceVersion)"' \
            -i apps/atp-ingestion/overlays/production/kustomization.yaml
          git add apps/atp-ingestion/overlays/production/kustomization.yaml
          git commit -m "chore(ingestion): update to $(Build.BuildNumber)"
          git push origin production

FluxCD detects Git change (within 1-5 minutes):

# FluxCD GitRepository polls Git every 1 minute
# FluxCD Kustomization reconciles every 5 minutes
# Result: New image tag applied to cluster automatically

Deployment complete:

# Verify deployment
kubectl get pods -n atp-production -l app=atp-ingestion
# Output: Pods using new image tag

Benefits of Hybrid Approach:

✅ Best of both worlds: Team autonomy (polyrepo) + consistency (monorepo)
✅ Clear separation: Source code changes vs infrastructure changes
✅ Atomic deployments: Update multiple services in one PR (if needed)
✅ Simplified GitOps: One repository to manage permissions and branch policies

Directory Structure¶

Complete atp-gitops Repository Layout:

atp-gitops/
├── .github/                          # GitHub Actions (if used) or Azure DevOps templates
│   ├── workflows/
│   └── germs/
│
├── clusters/                         # Per-environment FluxCD bootstrap configs
│   ├── production/
│   │   ├── flux-system/             # FluxCD installation manifests
│   │   │   ├── gitrepository.yaml  # GitRepository pointing to production branch
│   │   │   ├── kustomizations.yaml # Root Kustomization pointing to /infrastructure and /apps
│   │   │   └── notifications.yaml  # Notification configs (Slack, Teams)
│   │   └── README.md
│   │
│   ├── staging/
│   │   ├── flux-system/
│   │   └── README.md
│   │
│   ├── test/
│   │   ├── flux-system/
│   │   └── README.md
│   │
│   └── dev/
│       ├── flux-system/
│       └── README.md
│
├── infrastructure/                   # Cluster-wide infrastructure (base + overlays)
│   ├── base/                        # Base infrastructure manifests
│   │   ├── namespaces.yaml          # All namespaces (atp-production, atp-staging, etc.)
│   │   ├── resource-quotas.yaml     # Default resource quotas
│   │   ├── network-policies.yaml    # Default network policies
│   │   ├── pod-security-standards.yaml  # Pod Security Admission configs
│   │   ├── azure-policy.yaml        # Azure Policy for Kubernetes
│   │   └── kustomization.yaml
│   │
│   └── overlays/                    # Environment-specific infrastructure
│       ├── production/
│       │   ├── kustomization.yaml
│       │   ├── resource-quotas-patch.yaml  # Production resource quotas
│       │   └── network-policies-patch.yaml # Production network policies
│       ├── staging/
│       ├── test/
│       └── dev/
│
├── apps/                            # ATP microservices (7 services)
│   ├── atp-ingestion/
│   │   ├── base/                    # Base manifests (reusable)
│   │   │   ├── deployment.yaml
│   │   │   ├── service.yaml
│   │   │   ├── configmap.yaml
│   │   │   ├── ingress.yaml
│   │   │   └── kustomization.yaml
│   │   │
│   │   ├── helm/                    # Helm chart (optional, alternative to base)
│   │   │   ├── Chart.yaml
│   │   │   ├── values.yaml
│   │   │   ├── values-dev.yaml
│   │   │   ├── values-production.yaml
│   │   │   └── templates/
│   │   │       ├── deployment.yaml
│   │   │       ├── service.yaml
│   │   │       └── configmap.yaml
│   │   │
│   │   └── overlays/                # Environment-specific overlays
│   │       ├── dev/
│   │       │   ├── kustomization.yaml
│   │       │   ├── deployment-patch.yaml
│   │       │   └── configmap-patch.yaml
│   │       ├── test/
│   │       ├── staging/
│   │       └── production/
│   │           ├── kustomization.yaml
│   │           ├── deployment-patch.yaml
│   │           ├── hpa-patch.yaml      # Horizontal Pod Autoscaler
│   │           └── configmap-patch.yaml
│   │
│   ├── atp-query/
│   │   ├── base/
│   │   ├── helm/
│   │   └── overlays/
│   │
│   ├── atp-integrity/
│   │   ├── base/
│   │   ├── helm/
│   │   └── overlays/
│   │
│   ├── atp-export/
│   │   ├── base/
│   │   ├── helm/
│   │   └── overlays/
│   │
│   ├── atp-policy/
│   │   ├── base/
│   │   ├── helm/
│   │   └── overlays/
│   │
│   ├── atp-search/
│   │   ├── base/
│   │   ├── helm/
│   │   └── overlays/
│   │
│   └── atp-gateway/
│       ├── base/
│       ├── helm/
│       └── overlays/
│
├── platform/                        # Platform configurations
│   ├── rbac/                        # Role-Based Access Control
│   │   ├── service-accounts.yaml    # ServiceAccounts for all services
│   │   ├── roles.yaml              # Namespace-scoped Roles
│   │   ├── role-bindings.yaml      # Role Bindings
│   │   └── cluster-roles.yaml      # Cluster-wide Roles
│   │
│   ├── network-policies/            # Network isolation policies
│   │   ├── default-deny.yaml       # Default deny all traffic
│   │   ├── allow-namespace-internal.yaml  # Allow within namespace
│   │   └── allow-cross-namespace.yaml     # Allow specific cross-namespace
│   │
│   ├── pod-security/                # Pod Security Standards
│   │   ├── baseline.yaml           # Baseline profile
│   │   └── restricted.yaml         # Restricted profile (production)
│   │
│   ├── resource-quotas/             # Resource quotas per namespace
│   │   ├── production.yaml
│   │   ├── staging.yaml
│   │   └── dev.yaml
│   │
│   └── azure-policy/                # Azure Policy for Kubernetes
│       ├── pod-security-standards.yaml
│       ├── resource-limits.yaml
│       └── image-registry.yaml
│
├── tenants/                         # Multi-tenant configurations
│   ├── tenant-acme-corp/
│   │   ├── namespace.yaml
│   │   ├── resource-quota.yaml
│   │   ├── network-policy.yaml
│   │   ├── rbac.yaml
│   │   ├── config.yaml
│   │   └── kustomization.yaml
│   │
│   ├── tenant-widgets-inc/
│   │   ├── namespace.yaml
│   │   ├── resource-quota.yaml
│   │   ├── network-policy.yaml
│   │   ├── rbac.yaml
│   │   ├── config.yaml
│   │   └── kustomization.yaml
│   │
│   └── tenant-global-bank/
│       ├── namespace.yaml
│       ├── resource-quota.yaml
│       ├── network-policy.yaml
│       ├── rbac.yaml
│       ├── config.yaml
│       └── kustomization.yaml
│
├── monitoring/                      # Observability stack
│   ├── prometheus/                  # Prometheus Operator manifests
│   │   ├── prometheus.yaml
│   │   ├── servicemonitor.yaml
│   │   └── alerting-rules.yaml
│   │
│   ├── grafana/                     # Grafana dashboards
│   │   ├── dashboards/
│   │   └── datasources.yaml
│   │
│   ├── fluent-bit/                  # Log collection
│   │   └── fluent-bit-config.yaml
│   │
│   └── jaeger/                      # Distributed tracing
│       └── jaeger-operator.yaml
│
├── docs/                            # Documentation and runbooks
│   ├── README.md                    # Repository overview
│   ├── CONTRIBUTING.md              # How to contribute
│   ├── runbooks/
│   │   ├── rollback-procedure.md
│   │   ├── disaster-recovery.md
│   │   └── troubleshooting.md
│   └── architecture/
│       ├── directory-structure.md
│       └── branching-model.md
│
├── .gitignore                       # Git ignore patterns
├── .pre-commit-hooks.yaml          # Pre-commit hooks (secret detection)
├── LICENSE                          # Repository license
└── README.md                        # Main README

Directory Purpose Reference¶

Directory	Purpose	Example Files
`/clusters`	FluxCD bootstrap configs per environment	`gitrepository.yaml`, `kustomizations.yaml`
`/infrastructure`	Cluster-wide infrastructure (namespaces, quotas, policies)	`namespaces.yaml`, `resource-quotas.yaml`
`/apps`	ATP microservice deployments	`deployment.yaml`, `service.yaml`, `configmap.yaml`
`/platform`	Platform configurations (RBAC, network policies, security)	`service-accounts.yaml`, `network-policies.yaml`
`/tenants`	Multi-tenant configurations	`namespace.yaml`, `resource-quota.yaml`
`/monitoring`	Observability stack (Prometheus, Grafana, Fluent Bit)	`prometheus.yaml`, `grafana-dashboards/`
`/docs`	Documentation and operational runbooks	`runbooks/rollback-procedure.md`

Naming Conventions¶

Directory Naming Standards¶

Pattern: lowercase-with-hyphens

Directory Type	Naming Pattern	Example
Service directories	`atp-{service-name}`	`atp-ingestion`, `atp-query`
Environment overlays	`{environment}`	`production`, `staging`, `test`, `dev`
Base directories	`base`	`base/`
Helm directories	`helm`	`helm/`
Tenant directories	`tenant-{tenant-id}`	`tenant-acme-corp`, `tenant-widgets-inc`

File Naming Patterns¶

Pattern: kebab-case.yaml or kebab-case-patch.yaml

File Type	Naming Pattern	Example
Kubernetes manifests	`{resource-kind}.yaml`	`deployment.yaml`, `service.yaml`
Kustomization files	`kustomization.yaml`	`kustomization.yaml`
Strategic merge patches	`{resource-kind}-patch.yaml`	`deployment-patch.yaml`, `hpa-patch.yaml`
Helm values	`values-{environment}.yaml`	`values-production.yaml`, `values-dev.yaml`
ConfigMaps	`{service-name}-config.yaml`	`atp-ingestion-config.yaml`
Documentation	`kebab-case.md`	`rollback-procedure.md`, `disaster-recovery.md`

Resource Naming Conventions (Kubernetes)¶

Pattern: {service-name} or {service-name}-{suffix}

Resource Type	Naming Pattern	Example
Deployments	`{service-name}`	`atp-ingestion`, `atp-query`
Services	`{service-name}`	`atp-ingestion`, `atp-query`
ConfigMaps	`{service-name}-config`	`atp-ingestion-config`
Secrets	`{service-name}-secrets`	`atp-ingestion-secrets`
Ingress	`{service-name}-ingress`	`atp-ingestion-ingress`
ServiceAccounts	`{service-name}-sa`	`atp-ingestion-sa`
HPA	`{service-name}-hpa`	`atp-ingestion-hpa`
NetworkPolicy	`{service-name}-network-policy`	`atp-ingestion-network-policy`

Complete Examples for All ATP Services:

# Deployment names
atp-ingestion
atp-query
atp-integrity
atp-export
atp-policy
atp-search
atp-gateway

# Service names
atp-ingestion
atp-query
atp-integrity
atp-export
atp-policy
atp-search
atp-gateway

# ConfigMap names
atp-ingestion-config
atp-query-config
atp-integrity-config
atp-export-config
atp-policy-config
atp-search-config
atp-gateway-config

# ServiceAccount names
atp-ingestion-sa
atp-query-sa
atp-integrity-sa
atp-export-sa
atp-policy-sa
atp-search-sa
atp-gateway-sa

# Namespace names
atp-production      # All production services
atp-staging         # All staging services
atp-test            # All test services
atp-dev             # All dev services
atp-tenant-acme     # Tenant-specific namespace
atp-tenant-widgets  # Tenant-specific namespace

Label Naming:

# Standard labels (applied to all resources)
labels:
  app: atp-ingestion           # Service name
  component: ingestion         # Component name (matches service name)
  tier: backend                # Service tier (backend, frontend, database)
  version: v1.2.3              # Application version
  environment: production      # Environment (production, staging, test, dev)
  managed-by: fluxcd           # Management tool
  compliance: soc2-gdpr-hipaa  # Compliance requirements

Branching Model¶

ATP uses a GitOps branching model aligned with environment promotion workflow.

Environment Branches¶

Branch Structure:

main                    # Production (protected, requires approvals)
├── staging             # Staging environment (protected)
│   ├── test            # Test environment (protected)
│   │   └── dev         # Dev environment (unprotected, fast merge)
│   │       └── feature/*  # Feature branches (unprotected)

Branch Details:

Branch	Environment	Purpose	Protection Level	Merge Strategy
`main` (or `production`)	Production	Live production environment	🔒 Highest	Squash merge + approvals
`staging`	Staging	Pre-production testing	🔒 High	Squash merge + approvals
`test`	Test	Integration testing	🔒 Medium	Squash merge + approvals
`dev`	Development	Developer testing	🔓 Low	Fast-forward merge
*`feature/`**	N/A	Feature development	🔓 None	Fast-forward merge
*`hotfix/`**	Production	Emergency fixes	🔒 High	Squash merge + expedited approval

Branch Protection Rules (Azure DevOps):

# Azure DevOps Branch Policy Configuration
# Repos > Branches > {branch-name} > Branch Policies

main (Production):
  ✓ Require pull request (minimum 2 reviewers)
  ✓ Require approval from: Architect + SRE Lead
  ✓ Require status checks: CI pipeline must pass
  ✓ Require signed commits (GPG)
  ✓ Require merge strategy: Squash merge only
  ✓ Require minimum reviewers: 2 (including required reviewers)
  ✓ Require code review: Yes
  ✓ Build validation: CI pipeline
  ✓ Automatic reviewers: Platform Team (suggested)

staging:
  ✓ Require pull request (minimum 1 reviewer)
  ✓ Require approval from: Architect or SRE
  ✓ Require status checks: CI pipeline must pass
  ✓ Require signed commits (GPG)
  ✓ Require merge strategy: Squash merge only
  ✓ Build validation: CI pipeline

test:
  ✓ Require pull request (minimum 1 reviewer)
  ✓ Require approval from: Any DevOps Engineer
  ✓ Require status checks: CI pipeline must pass
  ✓ Require merge strategy: Squash merge preferred
  ✓ Build validation: CI pipeline

dev:
  ✓ No branch protection (fast development)
  ✓ Allow direct push
  ✓ Allow fast-forward merge

feature/*:
  ✓ No branch protection
  ✓ Allow direct push
  ✓ Allow fast-forward merge

Branch Promotion Workflow:

graph LR
    A[feature/my-feature] -->|PR + merge| B[dev]
    B -->|PR + approval| C[test]
    C -->|PR + approval| D[staging]
    D -->|PR + approval| E[main<br/>production]

    F[hotfix/critical-bug] -.->|expedited PR| E

    style E fill:#ffcccc
    style D fill:#FFE5B4
    style C fill:#FFE5B4
    style B fill:#90EE90
    style A fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Example: Promoting Change from Dev → Production:

# Step 1: Create feature branch from dev
git checkout dev
git pull origin dev
git checkout -b feature/upgrade-ingestion-v1.2.4
# Make changes to apps/atp-ingestion/overlays/dev/kustomization.yaml
git commit -m "feat(ingestion): upgrade to v1.2.4 in dev"
git push origin feature/upgrade-ingestion-v1.2.4

# Step 2: Merge to dev (fast-forward, no approval needed)
git checkout dev
git merge --ff-only feature/upgrade-ingestion-v1.2.4
git push origin dev
# FluxCD automatically deploys to dev cluster

# Step 3: Create PR dev → test
# Azure DevOps: Create Pull Request from dev to test
# Requires: 1 DevOps Engineer approval
# After merge: FluxCD deploys to test cluster

# Step 4: Create PR test → staging
# Azure DevOps: Create Pull Request from test to staging
# Requires: Architect or SRE approval
# After merge: FluxCD deploys to staging cluster

# Step 5: Create PR staging → main (production)
# Azure DevOps: Create Pull Request from staging to main
# Requires: Architect + SRE Lead approval
# Requires: CI pipeline passing + signed commits
# After merge: FluxCD deploys to production cluster

Approval Requirements Matrix¶

Branch	Minimum Reviewers	Required Approvers	Status Checks	GPG Signing	Merge Strategy
main (production)	2	Architect + SRE Lead	✅ Required	✅ Required	Squash only
staging	1	Architect or SRE	✅ Required	✅ Required	Squash only
test	1	Any DevOps Engineer	✅ Required	⚠️ Preferred	Squash preferred
dev	0	None	⚠️ Optional	❌ Not required	Fast-forward
feature/*	0	None	❌ Not required	❌ Not required	Fast-forward
hotfix/*	1	Architect or SRE Lead	✅ Required	✅ Required	Squash only

Versioning and Tagging¶

Semantic Versioning (SemVer) Strategy¶

Format: MAJOR.MINOR.PATCH

MAJOR: Breaking changes (API incompatibility, schema changes)
MINOR: New features (backward-compatible)
PATCH: Bug fixes (backward-compatible)

Example Versions:

v1.2.4    # Minor feature release
v1.2.3    # Patch release (bug fix)
v2.0.0    # Major release (breaking changes)
v1.2.4-hotfix1  # Hotfix release

Service-Specific Tags¶

Format: {service-name}/v{VERSION}

# Tag ingestion service v1.2.4
git tag -a atp-ingestion/v1.2.4 -m "ATP Ingestion Service v1.2.4"
git push origin atp-ingestion/v1.2.4

# Tag query service v1.3.0
git tag -a atp-query/v1.3.0 -m "ATP Query Service v1.3.0"
git push origin atp-query/v1.3.0

Environment-Wide Release Tags¶

Format: release/v{VERSION} or release/{ENVIRONMENT}/v{VERSION}

# Production release tag
git tag -a release/v1.2.4 -m "Production Release v1.2.4

Services:
- atp-ingestion: v1.2.4
- atp-query: v1.3.0
- atp-integrity: v1.1.5
- atp-export: v1.0.2
- atp-policy: v1.2.0
- atp-search: v1.1.0
- atp-gateway: v1.4.0

Changelog: https://dev.azure.com/ConnectSoft/ATP/_wiki/wikis/ATP.wiki/12345/Release-Notes-v1.2.4"
git push origin release/v1.2.4

# Staging release tag
git tag -a release/staging/v1.2.4-rc1 -m "Staging Release Candidate v1.2.4-rc1"
git push origin release/staging/v1.2.4-rc1

Hotfix Tagging Conventions¶

Format: hotfix/v{VERSION}-hotfix{N} or {SERVICE}/v{VERSION}-hotfix{N}

# Service-specific hotfix
git tag -a atp-ingestion/v1.2.4-hotfix1 -m "Hotfix: Memory leak in Redis connection pooling"
git push origin atp-ingestion/v1.2.4-hotfix1

# Environment-wide hotfix
git tag -a hotfix/v1.2.4-hotfix1 -m "Production Hotfix v1.2.4-hotfix1

Critical fixes:
- atp-ingestion: Memory leak fix
- atp-gateway: Rate limiting bug fix"
git push origin hotfix/v1.2.4-hotfix1

Image Tagging in ACR¶

Strategy: Immutable tags combining version + commit SHA

Format: {VERSION}-{COMMIT-SHA}

# Docker image tags (in Azure Container Registry)
connectsoft.azurecr.io/atp/ingestion:v1.2.4              # Semantic version
connectsoft.azurecr.io/atp/ingestion:v1.2.4-abc123d      # Version + commit SHA (immutable)
connectsoft.azurecr.io/atp/ingestion:latest              # Latest (dev only, mutable)

ACR Tagging Rules:

Tag Pattern	Mutable?	Use Case	Example
`v{VERSION}`	❌ Immutable	Production releases	`v1.2.4`
`v{VERSION}-{SHA}`	❌ Immutable	Production releases (traceable)	`v1.2.4-abc123d`
`latest`	✅ Mutable	Development only	`latest`
`{BRANCH}`	✅ Mutable	Feature branches	`feature/grpc-ingestion`

Access Control and RBAC¶

Azure DevOps Repository Permissions¶

Permission Levels:

Permission	Description	Typical Roles
Reader	Can view repository	Compliance Officers, Auditors
Contributor	Can create branches, submit PRs	Developers, DevOps Engineers
Contribute	Can push to unprotected branches	Developers (dev branch)
Contribute + Pull Request	Can create PRs to protected branches	Developers, DevOps Engineers
Admin	Full control (manage permissions, delete repo)	Platform Team Leads

Permission Matrix:

Role	Repository Access	Branch Access	Approval Authority
Developer	✅ Contributor	✅ Create PRs (dev, test)	❌ None
DevOps Engineer	✅ Contributor	✅ Approve PRs (dev, test)	✅ Dev/Test deployments
Architect	✅ Contributor	✅ Approve PRs (staging, production)	✅ Staging/Prod deployments
SRE Engineer	✅ Contributor	✅ Approve PRs (production)	✅ Production deployments
Security Officer	✅ Reader	✅ Read-only (audit)	❌ None
Compliance Officer	✅ Reader	✅ Read-only (audit)	❌ None
Platform Team	✅ Admin	✅ Full access (all branches)	✅ All deployments

Azure AD Group Mappings¶

Group Structure:

Azure AD Groups:
├── ATP-Platform-Team                    # Platform Team (Admin access)
├── ATP-Developers                       # All developers (Contributor)
├── ATP-DevOps-Engineers                 # DevOps Engineers (Contributor, approve dev/test)
├── ATP-Architects                       # Architects (Contributor, approve staging/prod)
├── ATP-SRE-Engineers                    # SRE Engineers (Contributor, approve production)
├── ATP-Security-Team                    # Security Team (Reader, audit access)
└── ATP-Compliance-Team                  # Compliance Team (Reader, audit access)

Azure DevOps Group Configuration:

# Azure DevOps Project Settings > Permissions > Groups
Groups:
  - name: ATP-Platform-Team
    permissions:
      - Repos: Admin
      - Build: Admin
      - Release: Admin

  - name: ATP-Developers
    permissions:
      - Repos: Contributor
      - Build: User
      - Release: User

  - name: ATP-SRE-Engineers
    permissions:
      - Repos: Contributor
      - Build: User
      - Release: Admin

Principle of Least Privilege Enforcement¶

Enforcement Mechanisms:

Branch Protection Policies: Prevent direct pushes to protected branches
Required Approvals: Multiple reviewers for production
GPG Signing: All production commits must be signed
Status Checks: CI pipeline must pass before merge
Audit Logging: All access logged in Azure AD audit logs

Access Review Process:

Frequency: Quarterly access reviews
Owner: Platform Team Lead
Review Scope: Repository permissions, branch policies, approval authorities
Compliance: SOC 2 CC6.1 (Access Control)

Summary¶

Repository Strategy: Hybrid monorepo (GitOps manifests) + polyrepo (service source code)
Directory Structure: 7 main directories (/clusters, /infrastructure, /apps, /platform, /tenants, /monitoring, /docs)
Naming Conventions: Kebab-case for directories/files, consistent patterns for Kubernetes resources
Branching Model: Environment-based branches (main → staging → test → dev → feature/*) with promotion workflow
Versioning: SemVer for services, environment-wide release tags, hotfix conventions
Access Control: Azure AD group mappings, branch protection policies, least privilege enforcement

FluxCD Installation & Configuration on AKS¶

Purpose: Provide a complete guide for installing, configuring, and managing FluxCD on Azure Kubernetes Service (AKS) for the ATP GitOps implementation, including multi-cluster setup, Azure integration, and operational best practices.

FluxCD Architecture Overview¶

Definition: FluxCD is a GitOps operator for Kubernetes that automatically keeps clusters in sync with Git repositories. It consists of modular controllers that work together to provide continuous reconciliation.

FluxCD Components¶

Core Controllers:

Component	Purpose	Namespace	Responsibilities
source-controller	Fetches sources (Git, Helm, OCI)	`flux-system`	Clones Git repos, fetches Helm charts, caches artifacts
kustomize-controller	Applies Kustomize manifests	`flux-system`	Renders Kustomize, applies to cluster, monitors drift
helm-controller	Manages Helm releases	`flux-system`	Installs/upgrades Helm charts, manages dependencies
notification-controller	Sends alerts/notifications	`flux-system`	Slack, Teams, Discord, webhook notifications
image-reflector-controller	Scans image repositories	`flux-system`	Discovers new image tags, updates image policies
image-automation-controller	Updates Git automatically	`flux-system`	Commits image tag updates back to Git

Component Architecture:

graph TD
    A[Git Repository<br/>Azure Repos] -->|git pull| B[Source Controller]
    B -->|artifact cache| C[GitRepository<br/>CRD]

    C -->|notify| D[Kustomize Controller]
    C -->|notify| E[Helm Controller]

    D -->|render + apply| F[Kubernetes API<br/>AKS Cluster]
    E -->|install/upgrade| F

    F -.->|watch| D
    F -.->|watch| E
    D -.->|reconcile| F
    E -.->|reconcile| F

    G[Container Registry<br/>ACR] -->|scan tags| H[Image Reflector<br/>Controller]
    H -->|update| I[Image Policy<br/>CRD]
    I -->|trigger| J[Image Automation<br/>Controller]
    J -->|commit| A

    D -->|events| K[Notification<br/>Controller]
    E -->|events| K
    K -->|alerts| L[Slack / Teams /<br/>Azure Monitor]

    style B fill:#90EE90
    style D fill:#90EE90
    style E fill:#90EE90
    style H fill:#90EE90
    style J fill:#90EE90
    style K fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

How FluxCD Works: Reconciliation Loop¶

Reconciliation Process:

Source Fetch (Source Controller):
- Polls Git repository at configured interval (e.g., every 1 minute)
- Clones repository and creates artifact (tar.gz)
- Stores artifact in cluster-local cache
- Updates GitRepository CRD status
Manifest Rendering (Kustomize/Helm Controller):
- Reads artifact from Source Controller
- Renders Kustomize overlays or Helm templates
- Generates final Kubernetes manifests
State Comparison (Kustomize/Helm Controller):
- Compares desired state (from Git) with actual state (in cluster)
- Detects differences (drift detection)
- Calculates required changes
Apply Changes (Kustomize/Helm Controller):
- Applies changes via Kubernetes API (kubectl apply)
- Waits for resources to become ready
- Updates Kustomization/HelmRelease CRD status
Health Monitoring (Kustomize/Helm Controller):
- Monitors resource health (Deployment, StatefulSet, etc.)
- Reports health status in CRD status
- Triggers notifications on failure

Reconciliation Loop Diagram:

sequenceDiagram
    participant Git as Git Repository
    participant SC as Source Controller
    participant KC as Kustomize Controller
    participant K8s as Kubernetes API

    loop Every 1 minute (GitRepository interval)
        SC->>Git: git pull
        Git-->>SC: repository contents
        SC->>SC: create artifact (tar.gz)
        SC->>SC: update GitRepository.status
    end

    loop Every 5 minutes (Kustomization interval)
        KC->>SC: fetch artifact
        SC-->>KC: artifact.tar.gz
        KC->>KC: render Kustomize
        KC->>K8s: get current state
        K8s-->>KC: current resources
        KC->>KC: compare desired vs actual

        alt Drift detected
            KC->>K8s: kubectl apply (correct drift)
            K8s-->>KC: resources updated
        end

        KC->>KC: update Kustomization.status
        KC->>KC: check health
    end

Hold "Alt" / "Option" to enable pan & zoom

FluxCD vs ArgoCD Comparison¶

Feature Comparison:

Feature	FluxCD	ArgoCD	ATP Choice
Installation	Lightweight, modular	Single deployment, heavier	✅ FluxCD (simpler)
UI	Flux Dashboard (optional)	Rich web UI (default)	⚠️ ArgoCD (better UX, but ATP uses CLI)
GitOps Toolkit	Modular (use only needed controllers)	Monolithic	✅ FluxCD (flexibility)
Helm Support	Full support	Full support	✅ Both
Kustomize Support	Native (built-in)	Native (built-in)	✅ Both
Multi-Cluster	Strong (Fleet, kubeconfig)	Strong (ApplicationSets)	✅ Both
Azure Integration	Native (Workload Identity)	Requires setup	✅ FluxCD (better Azure native)
Learning Curve	Moderate	Steeper (more features)	✅ FluxCD (simpler)
CNCF Status	Graduated	Graduated	✅ Both
Community	Active (CNCF)	Very active (CNCF)	✅ Both
Performance	Fast (lightweight)	Good (heavier)	✅ FluxCD (lower overhead)
Security	Strong (least privilege)	Strong	✅ Both
Drift Detection	Excellent	Excellent	✅ Both

ATP Selection Rationale: ✅ FluxCD Chosen

Reasons:

Azure Native Integration: Better support for Azure AD Workload Identity (zero-trust authentication)
Modular Architecture: Use only needed controllers (source + kustomize), reduce cluster overhead
Simpler Operation: Less complexity, easier troubleshooting
Performance: Lower resource footprint (important for multi-cluster setup)
CLI-First Approach: ATP team prefers CLI/Git workflow over web UI

AKS Cluster Prerequisites¶

Cluster Requirements¶

Minimum Requirements:

Component	Requirement	Rationale
Kubernetes Version	1.28+ (1.30+ recommended)	FluxCD v2 requires Kubernetes 1.25+, newer versions provide better API support
Cluster SKU	Standard (not Free tier)	Required for RBAC, network policies, advanced features
Node Pool	2+ nodes, 4+ vCPUs total	FluxCD controllers need resources; redundancy for HA
Network Plugin	Azure CNI (recommended) or kubenet	Azure CNI provides better networking for multi-tenant
RBAC	Enabled (default)	Required for FluxCD controllers to manage cluster resources
Pod Security Standards	Enabled (default in 1.23+)	Required for compliance (SOC 2, GDPR, HIPAA)

Recommended Configuration:

# Create AKS cluster with recommended settings
az aks create \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --kubernetes-version 1.30.0 \
  --node-count 3 \
  --node-vm-size Standard_D4s_v3 \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 10 \
  --network-plugin azure \
  --network-policy azure \
  --enable-oidc-issuer \
  --enable-workload-identity \
  --enable-managed-identity \
  --enable-addons monitoring,azure-policy \
  --workspace-resource-id /subscriptions/.../resourceGroups/.../providers/Microsoft.OperationalInsights/workspaces/atp-prod-eus-logs \
  --tags environment=production compliance=soc2-gdpr-hipaa

Node Pool Configuration¶

System Node Pool (for FluxCD and system workloads):

# System node pool (dedicated for system pods)
az aks nodepool add \
  --resource-group ATP-Production-EUS-RG \
  --cluster-name atp-prod-eus-aks \
  --name systempool \
  --node-count 2 \
  --node-vm-size Standard_D4s_v3 \
  --mode System \
  --labels workload=system tier=platform \
  --node-taints CriticalAddonsOnly=true:NoSchedule \
  --enable-cluster-autoscaler \
  --min-count 2 \
  --max-count 4

User Node Pool (for application workloads):

# User node pool (for ATP microservices)
az aks nodepool add \
  --resource-group ATP-Production-EUS-RG \
  --cluster-name atp-prod-eus-aks \
  --name userpool \
  --node-count 3 \
  --node-vm-size Standard_D8s_v3 \
  --mode User \
  --labels workload=application tier=backend \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 20

Azure Integration Setup¶

Required Azure Resources:

Azure Container Registry (ACR): For container images
Azure Key Vault: For secrets management
Azure Monitor / Log Analytics: For observability
Azure AD Application: For Workload Identity (optional but recommended)

Prerequisites Checklist:

#!/bin/bash
# prerequisites-check.sh — Verify all prerequisites before FluxCD installation

set -euo pipefail

echo "🔍 Checking AKS cluster prerequisites..."

# Check Kubernetes version
KUBERNETES_VERSION=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query kubernetesVersion -o tsv)

if [[ $(echo "$KUBERNETES_VERSION 1.28.0" | tr " " "\n" | sort -V | head -n 1) != "1.28.0" ]]; then
  echo "❌ Kubernetes version $KUBERNETES_VERSION < 1.28.0 (minimum required)"
  exit 1
else
  echo "✅ Kubernetes version: $KUBERNETES_VERSION"
fi

# Check OIDC issuer enabled
OIDC_ISSUER=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query "oidcIssuerProfile.enabled" -o tsv)

if [[ "$OIDC_ISSUER" != "true" ]]; then
  echo "❌ OIDC issuer not enabled (required for Workload Identity)"
  exit 1
else
  echo "✅ OIDC issuer enabled"
fi

# Check Workload Identity enabled
WORKLOAD_IDENTITY=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query "securityProfile.workloadIdentity.enabled" -o tsv)

if [[ "$WORKLOAD_IDENTITY" != "true" ]]; then
  echo "❌ Workload Identity not enabled (required for Azure AD integration)"
  exit 1
else
  echo "✅ Workload Identity enabled"
fi

# Check kubectl access
if ! kubectl cluster-info > /dev/null 2>&1; then
  echo "❌ kubectl not configured or cluster not accessible"
  exit 1
else
  echo "✅ kubectl configured and cluster accessible"
fi

# Check RBAC enabled
RBAC=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query "enableRbac" -o tsv)

if [[ "$RBAC" != "true" ]]; then
  echo "❌ RBAC not enabled (required for FluxCD)"
  exit 1
else
  echo "✅ RBAC enabled"
fi

# Check node count
NODE_COUNT=$(az aks nodepool show \
  --resource-group ATP-Production-EUS-RG \
  --cluster-name atp-prod-eus-aks \
  --name systempool \
  --query count -o tsv)

if [[ "$NODE_COUNT" -lt 2 ]]; then
  echo "❌ Node count $NODE_COUNT < 2 (minimum 2 nodes recommended)"
  exit 1
else
  echo "✅ Node count: $NODE_COUNT"
fi

echo "✅ All prerequisites met!"

FluxCD Installation¶

Installation via Azure CLI (Recommended)¶

Prerequisites: Azure CLI with k8s-extension extension installed

# Install k8s-extension if not already installed
az extension add --name k8s-extension

# Install FluxCD on AKS cluster
az k8s-extension create \
  --resource-group ATP-Production-EUS-RG \
  --cluster-name atp-prod-eus-aks \
  --cluster-type managedClusters \
  --extension-type microsoft.flux \
  --name flux \
  --namespace flux-system \
  --scope cluster \
  --configuration-settings \
    gitops.enabled=true \
    gitops.defaultBranch=production \
    gitops.repositoryUrl=ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops \
    gitops.sshPrivateKey="$(cat ~/.ssh/azure-devops-flux | base64 -w 0)" \
  --auto-upgrade-minor-version true

# Verify installation
az k8s-extension show \
  --resource-group ATP-Production-EUS-RG \
  --cluster-name atp-prod-eus-aks \
  --cluster-type managedClusters \
  --name flux

Installation via kubectl (Flux CLI)¶

Prerequisites: Flux CLI installed

# Install Flux CLI
curl -s https://fluxcd.io/install.sh | sudo bash

# Verify installation
flux --version

# Install FluxCD components
flux install \
  --namespace=flux-system \
  --components=source-controller,kustomize-controller,helm-controller,notification-controller \
  --export > flux-install.yaml

# Apply to cluster
kubectl apply -f flux-install.yaml

# Wait for FluxCD to be ready
kubectl wait --for=condition=ready pod \
  --all \
  --namespace flux-system \
  --timeout=300s

Installation via Helm¶

Using Flux Helm Chart:

# Add Flux Helm repository
helm repo add fluxcd https://fluxcd-community.github.io/helm-charts
helm repo update

# Install FluxCD via Helm
helm install flux fluxcd/flux2 \
  --namespace flux-system \
  --create-namespace \
  --set components.source-controller.enabled=true \
  --set components.kustomize-controller.enabled=true \
  --set components.helm-controller.enabled=true \
  --set components.notification-controller.enabled=true \
  --set components.image-reflector-controller.enabled=false \
  --set components.image-automation-controller.enabled=false

# Verify installation
kubectl get pods -n flux-system

Bootstrap FluxCD on AKS¶

Bootstrap with Azure Repos SSH:

# Generate SSH key for FluxCD (if not exists)
ssh-keygen -t rsa -b 4096 -f ~/.ssh/azure-devops-flux -N ""

# Add public key to Azure DevOps (manual step)
# Azure DevOps > User Settings > SSH Public Keys > Add Key
cat ~/.ssh/azure-devops-flux.pub

# Bootstrap FluxCD pointing to GitOps repository
flux bootstrap git \
  --url=ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops \
  --branch=production \
  --path=clusters/production \
  --private-key-file=~/.ssh/azure-devops-flux \
  --author-name="Platform Team" \
  --author-email="platform-team@connectsoft.example" \
  --components-extra=image-reflector-controller,image-automation-controller

Bootstrap Output:

► connecting to ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
► cloning branch "production" from Git repository
► cloned branch "production" from Git repository
✔ components are healthy
✔ git repository "ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops" is ready
► generating sync manifests
✔ sync manifests pushed successfully
► applying sync manifests
✔ sync components are ready
✔ kubectl -n flux-system get gitrepository flux-system
✔ kubectl -n flux-system get kustomization flux-system

Verify Installation¶

Check FluxCD Components:

# Check all FluxCD pods are running
kubectl get pods -n flux-system

# Expected output:
# NAME                                      READY   STATUS    RESTARTS   AGE
# helm-controller-7d5c8b9f6d-abc12          1/1     Running   0          5m
# kustomize-controller-7d5c8b9f6d-def45     1/1     Running   0          5m
# notification-controller-7d5c8b9f6d-ghi78  1/1     Running   0          5m
# source-controller-7d5c8b9f6d-jkl90        1/1     Running   0          5m

# Check FluxCD CRDs are installed
kubectl get crds | grep fluxcd

# Expected output:
# gitrepositories.source.toolkit.fluxcd.io
# kustomizations.kustomize.toolkit.fluxcd.io
# helmreleases.helm.toolkit.fluxcd.io
# alerts.notification.toolkit.fluxcd.io
# receivers.notification.toolkit.fluxcd.io

# Verify FluxCD CLI can connect
flux check

# Expected output:
# ✔ flux 2.3.0
# ✔ Kubernetes 1.30.0 >= 1.25.0
# ✔ prerequisites are satisfied
# ✔ controllers are healthy

Azure Repos Integration¶

SSH Key Setup for Git Access¶

Generate SSH Key:

# Generate SSH key pair for FluxCD
ssh-keygen -t rsa -b 4096 -f ~/.ssh/azure-devops-flux -N "" -C "fluxcd@atp-production"

# Output files:
# ~/.ssh/azure-devops-flux      (private key)
# ~/.ssh/azure-devops-flux.pub  (public key)

Add Public Key to Azure DevOps:

# Display public key (copy this)
cat ~/.ssh/azure-devops-flux.pub

# Manual steps in Azure DevOps Portal:
# 1. Navigate to User Settings > SSH Public Keys
# 2. Click "New Key"
# 3. Paste public key
# 4. Add description: "FluxCD Production AKS Cluster"
# 5. Save

Create Kubernetes Secret:

# Create SSH key secret in flux-system namespace
kubectl create namespace flux-system --dry-run=client -o yaml | kubectl apply -f -

kubectl create secret generic azure-devops-ssh-key \
  --namespace=flux-system \
  --from-file=identity=~/.ssh/azure-devops-flux \
  --from-literal=known_hosts="$(ssh-keyscan ssh.dev.azure.com 2>/dev/null | grep ssh-rsa)"

# Verify secret created
kubectl get secret azure-devops-ssh-key -n flux-system

Azure DevOps Personal Access Token (PAT)¶

Alternative to SSH Key:

# Create PAT in Azure DevOps Portal:
# 1. User Settings > Personal Access Tokens > New Token
# 2. Name: "FluxCD Production AKS"
# 3. Organization: All accessible organizations
# 4. Scopes: Code (Read)
# 5. Copy token

# Create PAT secret
kubectl create secret generic azure-devops-pat \
  --namespace=flux-system \
  --from-literal=username=git \
  --from-literal=password=<AZURE_DEVOPS_PAT>

# Use PAT in GitRepository (HTTPS URL)

Azure AD Authentication (Workload Identity)¶

Recommended Approach (Zero-Trust):

# Create Azure AD Application for FluxCD
az ad app create --display-name "FluxCD-ATP-Production"

# Get application details
APP_ID=$(az ad app list \
  --display-name "FluxCD-ATP-Production" \
  --query "[0].appId" -o tsv)

# Create service principal
az ad sp create --id $APP_ID

# Get AKS OIDC issuer URL
OIDC_ISSUER=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# Create federated credential for Workload Identity
az ad app federated-credential create \
  --id $APP_ID \
  --parameters '{
    "name": "fluxcd-atp-production",
    "issuer": "'$OIDC_ISSUER'",
    "subject": "system:serviceaccount:flux-system:fluxcd-source-controller",
    "audiences": ["api://AzureADTokenExchange"]
  }'

# Grant Azure DevOps access to application
# Azure DevOps > Project Settings > Service Connections > New Service Connection
# Type: Generic
# Authentication: Workload Identity federation

GitRepository with Workload Identity:

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  interval: 1m
  ref:
    branch: production
  secretRef:
    name: workload-identity-secret  # Uses Azure AD Workload Identity

Bootstrap Configuration¶

Bootstrap FluxCD to Point to atp-gitops Repo¶

Complete Bootstrap Script:

#!/bin/bash
# bootstrap-fluxcd-production.sh — Bootstrap FluxCD on production AKS cluster

set -euo pipefail

RESOURCE_GROUP="ATP-Production-EUS-RG"
CLUSTER_NAME="atp-prod-eus-aks"
GIT_REPO_URL="ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops"
GIT_BRANCH="production"
GIT_PATH="clusters/production"
SSH_KEY_FILE="~/.ssh/azure-devops-flux"

echo "🚀 Bootstrapping FluxCD on production AKS cluster..."

# Get AKS credentials
az aks get-credentials \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --overwrite-existing

# Verify cluster access
kubectl cluster-info

# Bootstrap FluxCD
flux bootstrap git \
  --url=$GIT_REPO_URL \
  --branch=$GIT_BRANCH \
  --path=$GIT_PATH \
  --private-key-file=$SSH_KEY_FILE \
  --author-name="Platform Team" \
  --author-email="platform-team@connectsoft.example" \
  --components=source-controller,kustomize-controller,helm-controller,notification-controller

echo "✅ FluxCD bootstrap complete!"

# Verify installation
echo "📋 Verifying FluxCD installation..."
flux check

# Check GitRepository
echo "📋 Checking GitRepository..."
kubectl get gitrepository flux-system -n flux-system

# Check Kustomization
echo "📋 Checking Kustomization..."
kubectl get kustomization flux-system -n flux-system

Configure GitRepository Resource¶

GitRepository Configuration:

# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  interval: 1m  # Poll Git every 1 minute
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: production  # Git branch to watch
  secretRef:
    name: azure-devops-ssh-key  # SSH key secret
  ignore: |
    /*.md
    !README.md
    /docs/
  timeout: 60s  # Git clone timeout
  suspend: false  # Set to true to pause reconciliation

Configure Root Kustomization¶

Root Kustomization (Points to Infrastructure and Apps):

# clusters/production/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: flux-system
  namespace: flux-system
spec:
  interval: 5m  # Reconcile every 5 minutes
  path: ./  # Root path in Git repository
  prune: true  # Delete resources removed from Git
  sourceRef:
    kind: GitRepository
    name: atp-gitops
    namespace: flux-system
  timeout: 10m  # Reconciliation timeout
  retryInterval: 2m  # Retry on failure
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: source-controller
      namespace: flux-system
    - apiVersion: apps/v1
      kind: Deployment
      name: kustomize-controller
      namespace: flux-system
  suspend: false

Child Kustomizations (Per Application):

# clusters/production/kustomizations/infrastructure.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: infrastructure
  namespace: flux-system
spec:
  interval: 5m
  path: ./infrastructure/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
    - name: flux-system  # Wait for root Kustomization
  suspend: false

---
# clusters/production/kustomizations/apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
    - name: infrastructure  # Wait for infrastructure first
  suspend: false

Namespace and RBAC Setup¶

Namespace Creation (via GitOps):

# infrastructure/base/namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: flux-system
  labels:
    name: flux-system
    managed-by: fluxcd
---
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production
  labels:
    name: atp-production
    environment: production
    managed-by: fluxcd

RBAC for FluxCD (Bootstrap creates automatically):

# FluxCD bootstrap automatically creates these:
# - ServiceAccount: kustomize-controller (flux-system namespace)
# - ClusterRole: cluster-admin (full cluster access)
# - ClusterRoleBinding: kustomize-controller (binds ServiceAccount to ClusterRole)

Multi-Cluster Setup¶

FluxCD Architecture for Dev, Test, Staging, Production¶

Multi-Cluster Topology:

graph TD
    subgraph "Azure DevOps"
        A[atp-gitops<br/>Repository]
        A1[dev branch]
        A2[test branch]
        A3[staging branch]
        A4[production branch]
    end

    subgraph "Dev Environment"
        B1[AKS Dev Cluster]
        B2[FluxCD<br/>flux-system]
        B2 -->|git pull| A1
    end

    subgraph "Test Environment"
        C1[AKS Test Cluster]
        C2[FluxCD<br/>flux-system]
        C2 -->|git pull| A2
    end

    subgraph "Staging Environment"
        D1[AKS Staging Cluster]
        D2[FluxCD<br/>flux-system]
        D2 -->|git pull| A3
    end

    subgraph "Production Environment"
        E1[AKS Production EUS]
        E2[FluxCD<br/>flux-system]
        E2 -->|git pull| A4
        E3[AKS Production WUS]
        E4[FluxCD<br/>flux-system]
        E4 -->|git pull| A4
    end

    A --> A1
    A --> A2
    A --> A3
    A --> A4

    style E1 fill:#ffcccc
    style E3 fill:#ffcccc

Hold "Alt" / "Option" to enable pan & zoom

Cluster Configuration Matrix:

Environment	Cluster Name	Git Branch	FluxCD Namespace	Reconcile Interval
Dev	`atp-dev-eus-aks`	`dev`	`flux-system`	1 minute
Test	`atp-test-eus-aks`	`test`	`flux-system`	2 minutes
Staging	`atp-staging-eus-aks`	`staging`	`flux-system`	5 minutes
Production EUS	`atp-prod-eus-aks`	`production`	`flux-system`	5 minutes
Production WUS	`atp-prod-wus-aks`	`production`	`flux-system`	5 minutes

Cluster-Specific Configurations¶

Per-Cluster GitRepository:

# clusters/production/gitrepository.yaml (Production EUS)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: production
  secretRef:
    name: azure-devops-ssh-key

# clusters/dev/gitrepository.yaml (Dev)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: dev  # Dev branch
  secretRef:
    name: azure-devops-ssh-key
  interval: 30s  # Faster polling for dev

Cross-Cluster Orchestration¶

Fleet Management (Optional, for large-scale):

# Use FluxCD Fleet for multi-cluster management
# Fleet Controller runs in a central cluster
# Manages GitRepositories and Kustomizations across multiple clusters

Regional Deployment Pattern:

# clusters/production-eus/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-eus
  namespace: flux-system
spec:
  path: ./apps/overlays/production-eus  # EUS-specific overlay
  sourceRef:
    kind: GitRepository
    name: atp-gitops

---
# clusters/production-wus/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-wus
  namespace: flux-system
spec:
  path: ./apps/overlays/production-wus  # WUS-specific overlay
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Workload Identity Configuration¶

Azure AD Workload Identity for FluxCD¶

Create Azure AD Application:

# Create Azure AD application
az ad app create --display-name "FluxCD-ATP-Production"

APP_ID=$(az ad app list \
  --display-name "FluxCD-ATP-Production" \
  --query "[0].appId" -o tsv)

echo "Application ID: $APP_ID"

# Create service principal
SP_ID=$(az ad sp create --id $APP_ID --query id -o tsv)

# Grant permissions (example: Azure DevOps read access)
az devops security permission update \
  --id $SP_ID \
  --allow-bit 1 \
  --deny-bit 0

Service Principal Setup¶

Configure Service Principal Permissions:

# Grant Azure DevOps repository read permission
az devops security permission update \
  --id $SP_ID \
  --token $AZURE_DEVOPS_TOKEN \
  --allow-bit 1 \
  --deny-bit 0 \
  --resource-id "repo"

# Grant Azure Container Registry pull permission
az acr role assignment create \
  --assignee $APP_ID \
  --role AcrPull \
  --scope /subscriptions/.../resourceGroups/.../providers/Microsoft.ContainerRegistry/registries/atp-prod-acr

Federated Credentials Configuration¶

Configure Federated Credential:

# Get AKS OIDC issuer URL
OIDC_ISSUER=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# Create federated credential for Source Controller
az ad app federated-credential create \
  --id $APP_ID \
  --parameters '{
    "name": "fluxcd-source-controller",
    "issuer": "'$OIDC_ISSUER'",
    "subject": "system:serviceaccount:flux-system:source-controller",
    "audiences": ["api://AzureADTokenExchange"]
  }'

# Create federated credential for Kustomize Controller
az ad app federated-credential create \
  --id $APP_ID \
  --parameters '{
    "name": "fluxcd-kustomize-controller",
    "issuer": "'$OIDC_ISSUER'",
    "subject": "system:serviceaccount:flux-system:kustomize-controller",
    "audiences": ["api://AzureADTokenExchange"]
  }'

ServiceAccount Configuration¶

Annotate ServiceAccounts:

# FluxCD ServiceAccount with Workload Identity
apiVersion: v1
kind: ServiceAccount
metadata:
  name: source-controller
  namespace: flux-system
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"
    azure.workload.identity/tenant-id: "87654321-4321-4321-4321-987654321abc"

RBAC for Azure Resources:

# Grant Key Vault Secrets User role
az role assignment create \
  --assignee $APP_ID \
  --role "Key Vault Secrets User" \
  --scope /subscriptions/.../resourceGroups/.../providers/Microsoft.KeyVault/vaults/atp-prod-kv

FluxCD Version Management¶

Upgrade Procedures¶

Check Current Version:

# Check installed FluxCD version
flux version

# Output:
# flux version 2.3.0

# Check FluxCD components version
kubectl get deployment source-controller -n flux-system -o jsonpath='{.spec.template.spec.containers[0].image}'

# Output:
# ghcr.io/fluxcd/source-controller:v2.3.0

Upgrade FluxCD:

# Upgrade FluxCD components
flux upgrade

# Or upgrade to specific version
flux install \
  --version=2.4.0 \
  --namespace=flux-system \
  --components=source-controller,kustomize-controller,helm-controller,notification-controller \
  --export > flux-install-v2.4.0.yaml

kubectl apply -f flux-install-v2.4.0.yaml

# Wait for upgrade to complete
kubectl wait --for=condition=ready pod \
  --all \
  --namespace flux-system \
  --timeout=300s

Rollback Strategies¶

Rollback to Previous Version:

# Identify previous version
PREVIOUS_VERSION="2.2.0"

# Apply previous version manifests
flux install \
  --version=$PREVIOUS_VERSION \
  --namespace=flux-system \
  --components=source-controller,kustomize-controller,helm-controller,notification-controller \
  --export | kubectl apply -f -

# Verify rollback
flux version
kubectl get pods -n flux-system

Version Compatibility Matrix¶

FluxCD Version	Kubernetes Minimum	Kubernetes Recommended	Breaking Changes
2.4.0	1.25+	1.28+	None (from 2.3.x)
2.3.0	1.25+	1.28+	None (from 2.2.x)
2.2.0	1.24+	1.27+	API v1beta1 deprecated
2.1.0	1.24+	1.27+	None (from 2.0.x)

Release Notes and Breaking Changes¶

Monitor FluxCD Releases:

# Check latest releases
flux check --components-extra=image-reflector-controller,image-automation-controller

# Review release notes
# https://github.com/fluxcd/flux2/releases

Azure Monitor Integration¶

FluxCD Metrics Export to Prometheus¶

Enable Prometheus Metrics:

# FluxCD controllers export Prometheus metrics on port 8080
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fluxcd
  namespace: flux-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: fluxcd
  endpoints:
  - port: http-prom  # Port name for Prometheus metrics
    interval: 30s
    path: /metrics

Log Forwarding to Log Analytics¶

Configure Container Insights:

# Enable Azure Monitor Container Insights (if not already enabled)
az aks enable-addons \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --addons monitoring \
  --workspace-resource-id /subscriptions/.../resourceGroups/.../providers/Microsoft.OperationalInsights/workspaces/atp-prod-eus-logs

KQL Query for FluxCD Logs:

// Azure Monitor Log Analytics: Query FluxCD logs
ContainerLog
| where Namespace == "flux-system"
| where PodName contains "source-controller" or PodName contains "kustomize-controller"
| where LogEntry contains "error" or LogEntry contains "warning"
| project TimeGenerated, PodName, LogEntry
| order by TimeGenerated desc

Custom Metrics and Alerts¶

Custom Metrics Dashboard:

# Grafana dashboard for FluxCD metrics
dashboard:
  title: "FluxCD Reconciliation Metrics"
  panels:
    - title: "Reconciliation Duration"
      query: "fluxcd_kustomize_reconciliation_duration_seconds"

    - title: "Git Fetch Duration"
      query: "fluxcd_source_git_duration_seconds"

    - title: "Reconciliation Success Rate"
      query: "rate(fluxcd_kustomize_reconciliation_success_total[5m]) / rate(fluxcd_kustomize_reconciliation_total[5m])"

Azure Monitor Alert:

# Azure Monitor Alert Rule for FluxCD failures
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluxcd-alert
  namespace: flux-system
data:
  alert-rule.yaml: |
    alert:
      name: FluxCD Reconciliation Failure
      condition: |
        fluxcd_kustomize_reconciliation_failure_total > 0
      severity: warning
      actionGroups:
        - /subscriptions/.../resourceGroups/.../providers/microsoft.insights/actionGroups/fluxcd-alerts

Dashboard Setup in Azure Monitor¶

Create FluxCD Dashboard:

# Export FluxCD metrics to Azure Monitor
# Metrics available via Prometheus scraping or Container Insights

# Key metrics to monitor:
# - fluxcd_kustomize_reconciliation_duration_seconds
# - fluxcd_kustomize_reconciliation_total
# - fluxcd_kustomize_reconciliation_success_total
# - fluxcd_kustomize_reconciliation_failure_total
# - fluxcd_source_git_duration_seconds

Dashboard JSON (Azure Monitor):

{
  "dashboard": {
    "name": "FluxCD Reconciliation Dashboard",
    "widgets": [
      {
        "type": "metric",
        "properties": {
          "metrics": [
            {
              "namespace": "Microsoft.ContainerService/managedClusters",
              "name": "fluxcd_kustomize_reconciliation_duration_seconds",
              "aggregation": "Average"
            }
          ],
          "title": "Reconciliation Duration"
        }
      }
    ]
  }
}

Summary: FluxCD Installation & Configuration¶

FluxCD Architecture: Modular controllers (source, kustomize, helm, notification) with reconciliation loop
AKS Prerequisites: Kubernetes 1.28+, OIDC issuer, Workload Identity, RBAC enabled
Installation: Multiple methods (Azure CLI, kubectl, Helm), bootstrap to GitOps repository
Azure Repos Integration: SSH keys, PAT, or Workload Identity authentication
Multi-Cluster Setup: Branch-per-environment, cluster-specific configurations, regional deployment patterns
Workload Identity: Azure AD federated credentials for zero-trust authentication
Version Management: Upgrade procedures, rollback strategies, compatibility matrix
Azure Monitor Integration: Prometheus metrics, Log Analytics forwarding, custom alerts and dashboards

Declarative Manifest Management¶

Purpose: Define how ATP microservices are declared, organized, and managed using Kubernetes manifests, Helm charts, and Kustomize overlays, ensuring consistency, reusability, and environment-specific customization across all deployment environments.

Base Manifest Structure¶

ATP microservices use declarative Kubernetes manifests stored in Git as the single source of truth. This section covers the standard resource types and structures used for all ATP services.

Kubernetes Resource Types for ATP Services¶

Required Resources per Service:

Resource Type	Purpose	Example Name	Required?
Deployment	Manages pod replicas	`atp-ingestion`	✅ Yes
Service	Exposes pods via network	`atp-ingestion`	✅ Yes
ConfigMap	Non-sensitive configuration	`atp-ingestion-config`	✅ Yes
Secret	Sensitive data (references)	`atp-ingestion-secrets`	⚠️ Via External Secrets
Ingress	External HTTP/gRPC access	`atp-ingestion-ingress`	⚠️ If exposed externally
ServiceAccount	Pod identity and RBAC	`atp-ingestion-sa`	✅ Yes
HorizontalPodAutoscaler	Auto-scaling	`atp-ingestion-hpa`	⚠️ Production only
NetworkPolicy	Traffic isolation	`atp-ingestion-network-policy`	✅ Yes
PodDisruptionBudget	High availability	`atp-ingestion-pdb`	⚠️ Production only

Deployment Manifest Structure¶

Complete Deployment Example (ATP Ingestion Service):

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    app: atp-ingestion
    component: ingestion
    tier: backend
    version: v1.2.3
    managed-by: fluxcd
    compliance: soc2-gdpr-hipaa
spec:
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: atp-ingestion
  template:
    metadata:
      labels:
        app: atp-ingestion
        component: ingestion
        tier: backend
        version: v1.2.3
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
        checksum/config: "abc123def456"  # Trigger restart on ConfigMap change
        checksum/secret: "def456ghi789"  # Trigger restart on Secret change
    spec:
      serviceAccountName: atp-ingestion-sa

      # Pod Security Standards (Restricted)
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 2000
        seccompProfile:
          type: RuntimeDefault

      containers:
      - name: ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        imagePullPolicy: IfNotPresent

        # Container Security Context
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
          capabilities:
            drop: [ALL]

        # Resource Requests and Limits
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi

        # Environment Variables
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: Production
        - name: ASPNETCORE_URLS
          value: "http://+:8080"
        - name: OpenTelemetry__ServiceName
          value: atp-ingestion
        - name: DOTNET_RUNNING_IN_CONTAINER
          value: "true"

        # Environment Variables from ConfigMap
        envFrom:
        - configMapRef:
            name: atp-ingestion-config
        - secretRef:
            name: atp-ingestion-secrets

        # Ports
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        - name: metrics
          containerPort: 9090
          protocol: TCP
        - name: grpc
          containerPort: 50051
          protocol: TCP

        # Health Checks
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3

        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3

        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 30  # Allow up to 150 seconds for startup

        # Volume Mounts (read-only root filesystem requires writable volumes)
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /app/cache
        - name: logs
          mountPath: /app/logs

      # Volumes
      volumes:
      - name: tmp
        emptyDir: {}
      - name: cache
        emptyDir: {}
      - name: logs
        emptyDir: {}

      # Image Pull Secrets (for ACR authentication)
      imagePullSecrets:
      - name: acr-credentials

      # Termination Grace Period
      terminationGracePeriodSeconds: 30

      # DNS Policy
      dnsPolicy: ClusterFirst

      # Restart Policy
      restartPolicy: Always

Service Manifest Structure¶

Service Example (ATP Ingestion Service):

# apps/atp-ingestion/base/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    app: atp-ingestion
    component: ingestion
    managed-by: fluxcd
spec:
  type: ClusterIP  # Internal service (use LoadBalancer for external)
  ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP
  - name: metrics
    port: 9090
    targetPort: 9090
    protocol: TCP
  - name: grpc
    port: 50051
    targetPort: 50051
    protocol: TCP
  selector:
    app: atp-ingestion
  sessionAffinity: None
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800

ConfigMap and Secret References¶

ConfigMap Example:

# apps/atp-ingestion/base/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
  namespace: atp-production
  labels:
    app: atp-ingestion
    managed-by: fluxcd
data:
  # Application Settings
  ASPNETCORE_ENVIRONMENT: "Production"
  ASPNETCORE_URLS: "http://+:8080"

  # OpenTelemetry Configuration
  OpenTelemetry__ServiceName: "atp-ingestion"
  OpenTelemetry__SamplingRatio: "0.1"
  OpenTelemetry__Exporters__Otlp__Endpoint: "http://otel-collector.observability:4317"

  # Audit Trail Configuration
  Audit__EnableImmutability: "true"
  Audit__RetentionDays: "2555"
  Audit__EnableTamperEvidence: "true"

  # Feature Flags
  Features__EnableAdvancedQuery: "true"
  Features__EnableRealTimeAlerts: "true"

  # Performance Settings
  Performance__MaxConcurrentRequests: "1000"
  Performance__RequestTimeoutSeconds: "30"

Secret Reference (External Secrets Operator):

# apps/atp-ingestion/base/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: atp-ingestion-secrets
  namespace: atp-production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault-production
    kind: ClusterSecretStore
  target:
    name: atp-ingestion-secrets
    creationPolicy: Owner
  data:
  - secretKey: ConnectionStrings__Database
    remoteRef:
      key: atp-sql-connection-string-prod
  - secretKey: ConnectionStrings__Redis
    remoteRef:
      key: atp-redis-connection-string-prod
  - secretKey: ConnectionStrings__RabbitMQ
    remoteRef:
      key: atp-rabbitmq-connection-string-prod
  - secretKey: ApiKeys__IngestionApiKey
    remoteRef:
      key: atp-ingestion-api-key-prod

Ingress Configuration¶

Ingress Example (External Access):

# apps/atp-ingestion/base/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-ingress
  namespace: atp-production
  labels:
    app: atp-ingestion
    managed-by: fluxcd
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    nginx.ingress.kubernetes.io/grpc-backend: "true"
    nginx.ingress.kubernetes.io/rate-limit: "1000"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - ingestion.atp.connectsoft.example
    secretName: atp-ingestion-tls
  rules:
  - host: ingestion.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion
            port:
              number: 80

ServiceAccount and RBAC¶

ServiceAccount Example:

# apps/atp-ingestion/base/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: atp-ingestion-sa
  namespace: atp-production
  labels:
    app: atp-ingestion
    managed-by: fluxcd
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"
    azure.workload.identity/tenant-id: "87654321-4321-4321-4321-987654321abc"
automountServiceAccountToken: true

RBAC Role and RoleBinding:

# apps/atp-ingestion/base/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: atp-ingestion-role
  namespace: atp-production
rules:
- apiGroups: [""]
  resources: ["configmaps", "secrets"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: atp-ingestion-rolebinding
  namespace: atp-production
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: atp-ingestion-role
subjects:
- kind: ServiceAccount
  name: atp-ingestion-sa
  namespace: atp-production

Helm Charts for ATP Microservices¶

Helm charts provide parameterized, reusable templates for ATP microservices, enabling environment-specific customization via values files.

Chart Structure: Chart.yaml, values.yaml, templates/¶

Directory Structure:

apps/atp-ingestion/helm/
├── Chart.yaml                    # Chart metadata
├── values.yaml                   # Default values
├── values-dev.yaml              # Dev environment overrides
├── values-test.yaml             # Test environment overrides
├── values-staging.yaml          # Staging environment overrides
├── values-production.yaml       # Production environment overrides
├── templates/
│   ├── deployment.yaml          # Deployment template
│   ├── service.yaml             # Service template
│   ├── configmap.yaml           # ConfigMap template
│   ├── ingress.yaml             # Ingress template (optional)
│   ├── serviceaccount.yaml      # ServiceAccount template
│   ├── rbac.yaml                # RBAC template
│   ├── hpa.yaml                 # HPA template (conditional)
│   ├── networkpolicy.yaml       # NetworkPolicy template
│   └── _helpers.tpl             # Template helpers
└── charts/                      # Chart dependencies (optional)

Chart.yaml:

# apps/atp-ingestion/helm/Chart.yaml
apiVersion: v2
name: atp-ingestion
description: ATP Ingestion Service - Receives audit records via HTTP/gRPC
version: 1.2.3  # Chart version (SemVer)
appVersion: 1.2.3  # Application version
type: application

keywords:
  - audit-trail
  - ingestion
  - microservice
  - connectsoft

maintainers:
  - name: ConnectSoft Platform Team
    email: platform-team@connectsoft.example

dependencies:
  - name: redis
    version: 17.x.x
    repository: https://charts.bitnami.com/bitnami
    condition: redis.enabled
    tags:
      - atp-ingestion-redis

home: https://connectsoft.example/atp
sources:
  - https://dev.azure.com/ConnectSoft/ATP/_git/atp-ingestion

annotations:
  category: Backend
  licenses: Proprietary

values.yaml (Default):

# apps/atp-ingestion/helm/values.yaml
# Default values for atp-ingestion chart

# Replica configuration
replicaCount: 3

# Image configuration
image:
  repository: connectsoft.azurecr.io/atp/ingestion
  pullPolicy: IfNotPresent
  tag: ""  # Overridden by CI pipeline or .Values.appVersion

# Image pull secrets
imagePullSecrets:
  - name: acr-credentials

# Service account configuration
serviceAccount:
  create: true
  annotations:
    azure.workload.identity/client-id: ""
  name: atp-ingestion-sa
  automountServiceAccountToken: true

# Pod annotations
podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8080"
  prometheus.io/path: "/metrics"

# Pod security context
podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 2000
  seccompProfile:
    type: RuntimeDefault

# Container security context
securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000
  capabilities:
    drop: [ALL]

# Service configuration
service:
  type: ClusterIP
  port: 80
  targetPort: 8080
  metricsPort: 9090
  grpcPort: 50051
  annotations: {}

# Ingress configuration
ingress:
  enabled: false
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: ingestion.atp.connectsoft.example
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: atp-ingestion-tls
      hosts:
        - ingestion.atp.connectsoft.example

# Resource requests and limits
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

# Autoscaling configuration
autoscaling:
  enabled: false
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

# Health checks
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 30

# Environment variables
env:
  ASPNETCORE_ENVIRONMENT: Production
  OpenTelemetry__ServiceName: atp-ingestion

# External Secrets configuration
externalSecrets:
  enabled: true
  secretStore: azure-keyvault
  secrets:
    - name: ConnectionStrings__Database
      key: sql-connection-string
    - name: ConnectionStrings__Redis
      key: redis-connection-string
    - name: ConnectionStrings__RabbitMQ
      key: rabbitmq-connection-string

# Network policy
networkPolicy:
  enabled: true
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
            name: atp-production
      - podSelector:
          matchLabels:
            app: atp-gateway
  egress:
    - to:
      - namespaceSelector:
          matchLabels:
            name: kube-system
      - namespaceSelector:
          matchLabels:
            name: flux-system
      - namespaceSelector:
          matchLabels:
            name: observability

# Pod Disruption Budget
podDisruptionBudget:
  enabled: false
  minAvailable: 2

# Redis sub-chart (optional dependency)
redis:
  enabled: false  # Use Azure Cache for Redis instead

Template Best Practices¶

Helm Template Example (Deployment):

# apps/atp-ingestion/helm/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "atp-ingestion.fullname" . }}
  namespace: {{ .Values.namespace | default .Release.Namespace }}
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  revisionHistoryLimit: {{ .Values.revisionHistoryLimit | default 10 }}
  selector:
    matchLabels:
      {{- include "atp-ingestion.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        {{- with .Values.podAnnotations }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
        {{- if .Values.configMap }}
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
        {{- end }}
        {{- if .Values.secrets }}
        checksum/secret: {{ include (print $.Template.BasePath "/externalsecret.yaml") . | sha256sum }}
        {{- end }}
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        securityContext:
          {{- toYaml .Values.securityContext | nindent 12 }}
        resources:
          {{- toYaml .Values.resources | nindent 12 }}
        env:
        {{- range $key, $value := .Values.env }}
        - name: {{ $key }}
          value: {{ $value | quote }}
        {{- end }}
        envFrom:
        - configMapRef:
            name: {{ include "atp-ingestion.fullname" . }}-config
        - secretRef:
            name: {{ include "atp-ingestion.fullname" . }}-secrets
        ports:
        - name: http
          containerPort: {{ .Values.service.targetPort }}
          protocol: TCP
        - name: metrics
          containerPort: {{ .Values.service.metricsPort }}
          protocol: TCP
        - name: grpc
          containerPort: {{ .Values.service.grpcPort }}
          protocol: TCP
        livenessProbe:
          {{- toYaml .Values.livenessProbe | nindent 12 }}
        readinessProbe:
          {{- toYaml .Values.readinessProbe | nindent 12 }}
        {{- if .Values.startupProbe }}
        startupProbe:
          {{- toYaml .Values.startupProbe | nindent 12 }}
        {{- end }}
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /app/cache
        - name: logs
          mountPath: /app/logs
      volumes:
      - name: tmp
        emptyDir: {}
      - name: cache
        emptyDir: {}
      - name: logs
        emptyDir: {}
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}

Template Helpers (_helpers.tpl):

# apps/atp-ingestion/helm/templates/_helpers.tpl
{{/*
Expand the name of the chart.
*/}}
{{- define "atp-ingestion.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Create a default fully qualified app name.
*/}}
{{- define "atp-ingestion.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "atp-ingestion.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Common labels
*/}}
{{- define "atp-ingestion.labels" -}}
helm.sh/chart: {{ include "atp-ingestion.chart" . }}
{{ include "atp-ingestion.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
managed-by: fluxcd
{{- end }}

{{/*
Selector labels
*/}}
{{- define "atp-ingestion.selectorLabels" -}}
app.kubernetes.io/name: {{ include "atp-ingestion.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

{{/*
Create the name of the service account to use
*/}}
{{- define "atp-ingestion.serviceAccountName" -}}
{{- if .Values.serviceAccount.create }}
{{- default (include "atp-ingestion.fullname" .) .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}

Values File Organization¶

values-production.yaml (Production Overrides):

# apps/atp-ingestion/helm/values-production.yaml
replicaCount: 5  # Production: 5 replicas

image:
  tag: "v1.2.3-abc123d"  # Immutable tag with commit SHA

resources:
  requests:
    cpu: 1000m      # Production: 1 CPU core
    memory: 1Gi     # Production: 1 GB RAM
  limits:
    cpu: 2000m      # Production: 2 CPU cores
    memory: 2Gi     # Production: 2 GB RAM

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

ingress:
  enabled: true
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "1000"

env:
  ASPNETCORE_ENVIRONMENT: Production
  OpenTelemetry__SamplingRatio: "0.1"  # Production: 10% sampling

podDisruptionBudget:
  enabled: true
  minAvailable: 3  # Ensure at least 3 replicas available during updates

values-dev.yaml (Development Overrides):

# apps/atp-ingestion/helm/values-dev.yaml
replicaCount: 1  # Dev: 1 replica

image:
  tag: "latest"  # Dev: mutable latest tag

resources:
  requests:
    cpu: 100m     # Dev: minimal resources
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

autoscaling:
  enabled: false  # Dev: no autoscaling

ingress:
  enabled: false  # Dev: no external ingress

env:
  ASPNETCORE_ENVIRONMENT: Development
  OpenTelemetry__SamplingRatio: "1.0"  # Dev: 100% sampling (full traces)

Chart Dependencies¶

Managing Dependencies:

# Update dependencies
helm dependency update apps/atp-ingestion/helm/

# Build chart with dependencies
helm package apps/atp-ingestion/helm/

# Output: atp-ingestion-1.2.3.tgz

Versioning and Publishing to ACR¶

Publish Helm Chart to ACR:

# Login to ACR
az acr login --name connectsoft

# Push Helm chart to ACR
helm push apps/atp-ingestion/helm/ oci://connectsoft.azurecr.io/helm

# Chart available at:
# oci://connectsoft.azurecr.io/helm/atp-ingestion:1.2.3

Helm Hooks for Migrations¶

Helm Hook Example (Database Migration):

# apps/atp-ingestion/helm/templates/migration-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-migration
  namespace: {{ .Values.namespace | default .Release.Namespace }}
  annotations:
    "helm.sh/hook": pre-upgrade,pre-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  template:
    spec:
      serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
      restartPolicy: Never
      containers:
      - name: migration
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
        command: ["dotnet", "run", "--project", "Migrate"]
        env:
        - name: ConnectionStrings__Database
          valueFrom:
            secretKeyRef:
              name: {{ include "atp-ingestion.fullname" . }}-secrets
              key: ConnectionStrings__Database

Kustomize for Environment Overlays¶

Kustomize enables environment-specific customization of base manifests without duplicating code.

Base + Overlay Pattern¶

Directory Structure:

apps/atp-ingestion/
├── base/                        # Base manifests (reusable)
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── configmap.yaml
│   ├── serviceaccount.yaml
│   ├── rbac.yaml
│   └── kustomization.yaml
│
└── overlays/                    # Environment-specific overlays
    ├── dev/
    │   ├── kustomization.yaml
    │   ├── deployment-patch.yaml
    │   └── configmap-patch.yaml
    ├── test/
    │   ├── kustomization.yaml
    │   └── deployment-patch.yaml
    ├── staging/
    │   ├── kustomization.yaml
    │   ├── deployment-patch.yaml
    │   └── hpa-patch.yaml
    └── production/
        ├── kustomization.yaml
        ├── deployment-patch.yaml
        ├── hpa-patch.yaml
        ├── configmap-patch.yaml
        └── networkpolicy-patch.yaml

Base Kustomization:

# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
  - deployment.yaml
  - service.yaml
  - configmap.yaml
  - serviceaccount.yaml
  - rbac.yaml

commonLabels:
  app: atp-ingestion
  component: ingestion
  tier: backend
  managed-by: fluxcd

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3-abc123d  # Updated by CI pipeline

Strategic Merge Patches¶

Deployment Patch (Production):

# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5  # Production: 5 replicas (base has 3)
  template:
    spec:
      containers:
      - name: ingestion
        resources:
          requests:
            cpu: 1000m      # Production: 1 CPU (base: 500m)
            memory: 1Gi     # Production: 1 GB (base: 512Mi)
          limits:
            cpu: 2000m      # Production: 2 CPU (base: 1000m)
            memory: 2Gi     # Production: 2 GB (base: 1Gi)
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: Production
        - name: OpenTelemetry__SamplingRatio
          value: "0.1"  # Production: 10% sampling (dev: 100%)

JSON Patches¶

JSON Patch Example (Add Annotation):

# apps/atp-ingestion/overlays/production/json-patch.yaml
- op: add
  path: /metadata/annotations/azure.connectsoft.com~1cost-center
  value: atp-production

- op: replace
  path: /spec/replicas
  value: 5

ConfigMap and Secret Generators¶

ConfigMap Generator:

# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ../../base

configMapGenerator:
  - name: atp-ingestion-config
    behavior: merge  # Merge with base ConfigMap
    literals:
      - ASPNETCORE_ENVIRONMENT=Production
      - OpenTelemetry__SamplingRatio=0.1
      - Audit__EnableImmutability=true
      - Audit__RetentionDays=2555
    options:
      labels:
        environment: production

Secret Generator (Base64 Encoded):

# apps/atp-ingestion/overlays/production/kustomization.yaml
secretGenerator:
  - name: atp-ingestion-secrets
    behavior: merge
    type: Opaque
    literals:
      - ApiKeys__IngestionApiKey=$(echo -n "secret-value" | base64)

Variable Substitution¶

Kustomize with Variable Substitution:

# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
  - deployment.yaml

replicas:
  - name: atp-ingestion
    count: 3

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3-abc123d

Configuration Layering Strategy¶

Configuration Precedence Hierarchy:

graph TD
    A[Base Configuration<br/>apps/atp-ingestion/base] -->|applies to| B[All Environments]

    B --> C[Dev Overlay<br/>overlays/dev]
    B --> D[Test Overlay<br/>overlays/test]
    B --> E[Staging Overlay<br/>overlays/staging]
    B --> F[Production Overlay<br/>overlays/production]

    C -->|customizes| G[Dev Cluster]
    D -->|customizes| H[Test Cluster]
    E -->|customizes| I[Staging Cluster]
    F -->|customizes| J[Production Cluster]

    style A fill:#FFE5B4
    style C fill:#90EE90
    style D fill:#90EE90
    style E fill:#FFE5B4
    style F fill:#ffcccc

Hold "Alt" / "Option" to enable pan & zoom

Configuration Layers:

Layer	Location	Purpose	Examples
Base	`apps/{service}/base/`	Common configuration for all environments	Deployment structure, service ports, basic labels
Dev Overlay	`apps/{service}/overlays/dev/`	Development-specific customization	1 replica, minimal resources, 100% sampling
Test Overlay	`apps/{service}/overlays/test/`	Test environment customization	2 replicas, medium resources, 50% sampling
Staging Overlay	`apps/{service}/overlays/staging/`	Pre-production environment	3 replicas, production-like resources, 10% sampling
Production Overlay	`apps/{service}/overlays/production/`	Production environment	5+ replicas, full resources, 10% sampling, HPA

Hierarchical Configuration Precedence¶

Precedence Order (highest to lowest):

Overlay patches (environment-specific)
Overlay ConfigMap generators (environment-specific)
Base configuration (common defaults)

Example:

# Base ConfigMap
data:
  ASPNETCORE_ENVIRONMENT: "Development"  # Default

# Production Overlay ConfigMap Generator (merges)
configMapGenerator:
  - name: atp-ingestion-config
    behavior: merge
    literals:
      - ASPNETCORE_ENVIRONMENT=Production  # Overrides base

# Result: Production uses "Production", other environments use "Development"

Image Reference Patterns¶

ACR Image Path Conventions¶

Image Path Format:

{registry}/{project}/{service}:{tag}

Examples:
connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
connectsoft.azurecr.io/atp/query:v1.3.0-def456e
connectsoft.azurecr.io/atp/integrity:v1.1.5-ghi789f

Service Image Mapping:

Service	ACR Path
atp-ingestion	`connectsoft.azurecr.io/atp/ingestion`
atp-query	`connectsoft.azurecr.io/atp/query`
atp-integrity	`connectsoft.azurecr.io/atp/integrity`
atp-export	`connectsoft.azurecr.io/atp/export`
atp-policy	`connectsoft.azurecr.io/atp/policy`
atp-search	`connectsoft.azurecr.io/atp/search`
atp-gateway	`connectsoft.azurecr.io/atp/gateway`

Semantic Versioning in Image Tags¶

Tag Format:

{v{VERSION}}-{COMMIT-SHA}

Examples:
v1.2.3-abc123d      # Semantic version + 7-char commit SHA
v1.2.4-hotfix1-def456e  # Hotfix version + commit SHA

Tag Rules:

Tag Pattern	Mutable?	Use Case	Example
`v{VERSION}-{SHA}`	❌ Immutable	Production releases	`v1.2.3-abc123d`
`v{VERSION}`	❌ Immutable	Production releases (without SHA)	`v1.2.3`
`latest`	✅ Mutable	Development only	`latest`
`{BRANCH}`	✅ Mutable	Feature branches	`feature/grpc-ingestion`

Commit SHA for Traceability¶

Image Tagging in Azure Pipelines:

# Azure Pipelines: Tag image with version + commit SHA
- task: Docker@2
  displayName: 'Build and push Docker image'
  inputs:
    containerRegistry: 'ConnectSoft-ACR'
    repository: 'atp/ingestion'
    command: 'buildAndPush'
    Dockerfile: 'src/ConnectSoft.ATP.Ingestion/Dockerfile'
    tags: |
      $(Build.BuildNumber)              # v1.2.3
      $(Build.BuildNumber)-$(Build.SourceVersion)  # v1.2.3-abc123d
      latest                            # Latest (dev only)

Image Pull Policies¶

Policy Selection:

Policy	Behavior	Use Case
Always	Always pull latest image	Development (latest tag)
IfNotPresent	Pull only if not cached	Production (immutable tags)
Never	Never pull, use cached only	Air-gapped environments

Production Configuration:

spec:
  containers:
  - name: ingestion
    image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
    imagePullPolicy: IfNotPresent  # Production: use cached image (faster, immutable tag)

Development Configuration:

spec:
  containers:
  - name: ingestion
    image: connectsoft.azurecr.io/atp/ingestion:latest
    imagePullPolicy: Always  # Dev: always pull latest (mutable tag)

Resource Requests and Limits¶

Per-Environment Resource Specifications¶

Resource Configuration Matrix:

Environment	CPU Request	CPU Limit	Memory Request	Memory Limit	Replicas
Dev	100m	500m	128Mi	512Mi	1
Test	250m	500m	256Mi	512Mi	2
Staging	500m	1000m	512Mi	1Gi	3
Production	1000m	2000m	1Gi	2Gi	5

Production Resource Configuration:

# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: ingestion
        resources:
          requests:
            cpu: 1000m      # Guaranteed: 1 CPU core
            memory: 1Gi     # Guaranteed: 1 GB RAM
          limits:
            cpu: 2000m      # Maximum: 2 CPU cores (burst capacity)
            memory: 2Gi     # Maximum: 2 GB RAM

CPU and Memory Allocations¶

Sizing Guidelines:

CPU Request: Guaranteed CPU (scheduling decision)
CPU Limit: Maximum CPU (throttling threshold)
Memory Request: Guaranteed memory (scheduling decision)
Memory Limit: Maximum memory (OOMKill threshold)

Cost Optimization:

# Production: Right-sizing based on actual usage
resources:
  requests:
    cpu: 500m      # Based on 50th percentile usage
    memory: 512Mi  # Based on 50th percentile usage
  limits:
    cpu: 2000m     # Allow burst to 2x request
    memory: 2Gi    # Allow burst to 4x request (less frequent)

Resource Quota Enforcement¶

Namespace Resource Quota:

# infrastructure/overlays/production/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: atp-production-quota
  namespace: atp-production
spec:
  hard:
    requests.cpu: "100"           # 100 CPU cores total
    requests.memory: 200Gi        # 200 GB RAM total
    limits.cpu: "200"             # 200 CPU cores max
    limits.memory: 400Gi          # 400 GB RAM max
    persistentvolumeclaims: "50"  # Max 50 PVCs
    services.loadbalancers: "2"   # Max 2 load balancers
    pods: "200"                   # Max 200 pods

Health Checks Configuration¶

Liveness Probes (Is the App Running?)¶

Purpose: Detect and restart crashed containers.

Liveness Probe Example:

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
    scheme: HTTP
    httpHeaders:
    - name: Custom-Header
      value: liveness-check
  initialDelaySeconds: 30    # Wait 30s after container starts
  periodSeconds: 10          # Check every 10 seconds
  timeoutSeconds: 5          # Timeout after 5 seconds
  successThreshold: 1        # 1 success = healthy
  failureThreshold: 3        # 3 failures = restart container

Implementation (.NET Health Checks):

// C# Health Check implementation
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("live"),
    AllowCachingResponses = false,
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

Readiness Probes (Is the App Ready for Traffic?)¶

Purpose: Determine if container is ready to receive traffic.

Readiness Probe Example:

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
    scheme: HTTP
  initialDelaySeconds: 10    # Wait 10s after container starts
  periodSeconds: 5           # Check every 5 seconds
  timeoutSeconds: 3          # Timeout after 3 seconds
  successThreshold: 1        # 1 success = ready
  failureThreshold: 3        # 3 failures = remove from Service endpoints

Implementation (.NET Health Checks):

// C# Readiness Check (includes dependencies)
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

// Check database connectivity
services.AddHealthChecks()
    .AddSqlServer(connectionString, tags: new[] { "ready" })
    .AddRedis(redisConnectionString, tags: new[] { "ready" });

Startup Probes (For Slow-Starting Apps)¶

Purpose: Allow slow-starting applications time to initialize.

Startup Probe Example:

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
    scheme: HTTP
  initialDelaySeconds: 0     # Start immediately
  periodSeconds: 5           # Check every 5 seconds
  timeoutSeconds: 3          # Timeout after 3 seconds
  successThreshold: 1        # 1 success = startup complete
  failureThreshold: 30       # Allow up to 150 seconds (30 * 5s) for startup

When to Use Startup Probes:

Applications with long initialization (database migrations, cache warming)
JVM-based applications (slow JIT compilation)
Applications loading large configuration files

Probe Configuration Best Practices¶

Best Practices Table:

Probe Type	Initial Delay	Period	Timeout	Failure Threshold	Rationale
Liveness	30s	10s	5s	3	Give app time to start; detect crashes quickly
Readiness	10s	5s	3s	3	Quick feedback for traffic routing
Startup	0s	5s	3s	30	Allow up to 150s for slow initialization

Probe Failure Handling:

# Liveness probe failure: Container restart
# Readiness probe failure: Remove from Service endpoints (no traffic)
# Startup probe failure: Keep checking until success or failure threshold

Pod Security Standards (PSS)¶

Security Context Configuration¶

Pod Security Context (Restricted Profile):

# apps/atp-ingestion/base/deployment.yaml
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true        # Run as non-root user
        runAsUser: 1000           # Run as user ID 1000
        fsGroup: 2000             # File system group
        seccompProfile:           # System call filtering
          type: RuntimeDefault
        supplementalGroups: []    # No additional groups

Container Security Context:

containers:
- name: ingestion
  securityContext:
    allowPrivilegeEscalation: false  # Prevent privilege escalation
    readOnlyRootFilesystem: true     # Read-only root filesystem
    runAsNonRoot: true
    runAsUser: 1000
    capabilities:
      drop: [ALL]                    # Drop all capabilities
      # add: []                      # No additional capabilities

Pod Security Admission¶

Namespace Pod Security Labels:

# infrastructure/base/namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production
  labels:
    pod-security.kubernetes.io/enforce: restricted    # Enforce restricted profile
    pod-security.kubernetes.io/audit: restricted      # Audit violations
    pod-security.kubernetes.io/warn: restricted       # Warn on violations

Pod Security Levels:

Level	Description	ATP Use
Privileged	Unrestricted (all capabilities)	❌ Never
Baseline	Minimally restrictive	⚠️ Legacy workloads only
Restricted	Highly restrictive (best practice)	✅ Production

Restricted, Baseline, Privileged Policies¶

Policy Comparison:

Feature	Privileged	Baseline	Restricted
Host Namespaces	✅ Allowed	❌ Disallowed	❌ Disallowed
Host Networking	✅ Allowed	❌ Disallowed	❌ Disallowed
Privileged Containers	✅ Allowed	❌ Disallowed	❌ Disallowed
Capabilities	✅ All	⚠️ Limited	❌ Drop ALL
Volume Types	✅ All	⚠️ Limited	⚠️ Limited
Run as Non-Root	❌ Not required	❌ Not required	✅ Required
Read-Only Root FS	❌ Not required	❌ Not required	✅ Required
Seccomp	❌ Not required	✅ Default	✅ Required

ATP Production Policy:

# infrastructure/base/pod-security-standards.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Security Best Practices for ATP¶

Security Checklist:

✅ Run as non-root: All containers run as UID 1000+
✅ Read-only root filesystem: Use emptyDir volumes for writable paths
✅ Drop all capabilities: No additional Linux capabilities
✅ Seccomp enabled: System call filtering (RuntimeDefault)
✅ No host namespaces: No host network, PID, or IPC access
✅ No privileged containers: No elevated privileges
✅ Network policies: Default deny, explicit allow rules

Network Policies¶

Default Deny All Traffic¶

Default Deny NetworkPolicy:

# platform/network-policies/default-deny.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: atp-production
spec:
  podSelector: {}  # Applies to all pods
  policyTypes:
    - Ingress
    - Egress
  # No ingress rules = deny all ingress
  # No egress rules = deny all egress

Service-to-Service Communication Rules¶

Allow Internal Traffic (Same Namespace):

# apps/atp-ingestion/base/networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-ingestion-network-policy
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
    - Ingress
    - Egress

  ingress:
  # Allow from atp-gateway (API Gateway)
  - from:
    - podSelector:
        matchLabels:
          app: atp-gateway
    ports:
    - protocol: TCP
      port: 8080  # HTTP port
    - protocol: TCP
      port: 50051  # gRPC port

  # Allow from same namespace (service-to-service)
  - from:
    - namespaceSelector:
        matchLabels:
          name: atp-production
    ports:
    - protocol: TCP
      port: 8080

  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53

  # Allow to Azure SQL (external)
  - to:
    - namespaceSelector: {}  # External
    ports:
    - protocol: TCP
      port: 1433  # SQL Server

  # Allow to Azure Redis (external)
  - to:
    - namespaceSelector: {}  # External
    ports:
    - protocol: TCP
      port: 6380  # Redis TLS

  # Allow to observability namespace (metrics)
  - to:
    - namespaceSelector:
        matchLabels:
          name: observability
    ports:
    - protocol: TCP
      port: 4317  # OTLP gRPC

Ingress and Egress Policies¶

Ingress Policy (Allow External Traffic):

# apps/atp-gateway/base/networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-gateway-network-policy
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-gateway
  policyTypes:
    - Ingress

  ingress:
  # Allow from ingress controller
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - podSelector:
        matchLabels:
          app: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080

DNS and Monitoring Exceptions¶

Egress Policy (DNS and Monitoring):

# platform/network-policies/allow-dns-monitoring.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-monitoring
  namespace: atp-production
spec:
  podSelector: {}  # Applies to all pods
  policyTypes:
    - Egress

  egress:
  # Allow DNS queries
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    - podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

  # Allow to Azure Monitor (external)
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443  # HTTPS for Azure Monitor

  # Allow to observability namespace (metrics, logs, traces)
  - to:
    - namespaceSelector:
        matchLabels:
          name: observability
    ports:
    - protocol: TCP
      port: 4317  # OTLP
    - protocol: TCP
      port: 9090  # Prometheus metrics

Horizontal Pod Autoscaler (HPA)¶

CPU-Based Autoscaling¶

HPA Configuration (CPU-Based):

# apps/atp-ingestion/overlays/production/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa
  namespace: atp-production
  labels:
    app: atp-ingestion
    managed-by: fluxcd
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  minReplicas: 5
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale when CPU > 70%
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60  # Scale down by 50% per minute
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15  # Double replicas every 15 seconds
      - type: Pods
        value: 4
        periodSeconds: 15  # Or add 4 pods every 15 seconds
      selectPolicy: Max  # Use the policy that scales fastest

Memory-Based Autoscaling¶

HPA Configuration (Memory-Based):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa
  namespace: atp-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  minReplicas: 5
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale when memory > 80%

Custom Metrics with KEDA¶

KEDA ScaledObject (Custom Metrics):

# apps/atp-ingestion/overlays/production/keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: atp-ingestion-scaler
  namespace: atp-production
spec:
  scaleTargetRef:
    name: atp-ingestion
  minReplicaCount: 5
  maxReplicaCount: 20
  triggers:
  # Scale based on CPU
  - type: cpu
    metricType: Utilization
    metadata:
      value: "70"

  # Scale based on RabbitMQ queue length
  - type: rabbitmq
    metadata:
      host: amqps://rabbitmq.example:5671
      queueName: audit-records-queue
      queueLength: "100"  # Scale when queue > 100 messages

  # Scale based on HTTP request rate
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.observability:9090
      metricName: http_requests_per_second
      threshold: "1000"  # Scale when requests > 1000/sec

Scaling Policies per Environment¶

Environment-Specific Scaling:

Environment	Min Replicas	Max Replicas	Target CPU	Target Memory	Custom Metrics
Dev	1	2	N/A (no HPA)	N/A	❌ Disabled
Test	2	4	80%	80%	⚠️ Optional
Staging	3	10	70%	75%	⚠️ Optional
Production	5	20	70%	80%	✅ Enabled (KEDA)

Manifest Validation¶

kubeval for Syntax Validation¶

kubeval Usage:

# Install kubeval
brew install kubeval  # macOS
# or
wget https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz

# Validate Kubernetes manifests
kubeval apps/atp-ingestion/base/deployment.yaml

# Validate all manifests in directory
find apps/ -name "*.yaml" -exec kubeval {} \;

# Validate with specific Kubernetes version
kubeval --kubernetes-version 1.30.0 apps/atp-ingestion/base/deployment.yaml

kube-score for Best Practices¶

kube-score Usage:

# Install kube-score
brew install kube-score  # macOS
# or download from https://github.com/zegl/kube-score/releases

# Score manifests (best practices check)
kube-score score apps/atp-ingestion/base/deployment.yaml

# Output:
# apps/atp-ingestion/base/deployment.yaml
# [CRITICAL] Container Image Tag
#   · Image with latest or no tag
#     Container 'ingestion' must not use the 'latest' tag
#
# [CRITICAL] Container Resources
#   · CPU limit is not set
#     Container 'ingestion' does not have a CPU limit
#
# [WARNING] Container Security Context
#   · Container does not have a read-only root filesystem
#     Container 'ingestion' can write to root filesystem

kube-score Configuration:

# .kube-score.yaml
ignore-test-case:
  - container-image-tag  # Allow 'latest' in dev
  - deployment-has-poddisruptionbudget  # Optional for non-critical services

minimum-score: 5  # Fail if score < 5/10

Azure Policy Validation¶

Azure Policy for Kubernetes:

# platform/azure-policy/pod-security-standards.yaml
apiVersion: policy.k8s.io/v1beta1
kind: Policy
metadata:
  name: enforce-pod-security-standards
spec:
  rules:
  - apiGroups: [""]
    resources: ["pods"]
    validations:
    - expression: "object.spec.securityContext.runAsNonRoot == true"
      message: "Pods must run as non-root user"
    - expression: "object.spec.containers.all(c, c.securityContext.readOnlyRootFilesystem == true)"
      message: "Containers must have read-only root filesystem"
    - expression: "object.spec.containers.all(c, c.resources.limits.cpu != null && c.resources.limits.memory != null)"
      message: "Containers must have CPU and memory limits"

CI Pipeline Integration¶

Azure Pipelines Validation Stage:

# azure-pipelines.yml
- stage: Validate_Manifests
  displayName: 'Validate Kubernetes Manifests'
  jobs:
  - job: Validate
    steps:
    - task: Bash@3
      displayName: 'Install kubeval and kube-score'
      inputs:
        targetType: 'inline'
        script: |
          # Install kubeval
          curl -LO https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
          tar xf kubeval-linux-amd64.tar.gz
          sudo mv kubeval /usr/local/bin/

          # Install kube-score
          curl -LO https://github.com/zegl/kube-score/releases/latest/download/kube-score_linux_amd64.tar.gz
          tar xf kube-score_linux_amd64.tar.gz
          sudo mv kube-score /usr/local/bin/

    - task: Bash@3
      displayName: 'Validate manifests with kubeval'
      inputs:
        targetType: 'inline'
        script: |
          # Validate all base manifests
          find apps/ -name "*.yaml" -path "*/base/*" -exec kubeval --strict {} \;

    - task: Bash@3
      displayName: 'Score manifests with kube-score'
      inputs:
        targetType: 'inline'
        script: |
          # Score all base manifests
          find apps/ -name "*.yaml" -path "*/base/*" -exec kube-score score {} \;

Summary: Declarative Manifest Management¶

Base Manifest Structure: Standard Kubernetes resources (Deployment, Service, ConfigMap, Ingress, ServiceAccount, RBAC) for all ATP services
Helm Charts: Parameterized, reusable templates with environment-specific values files
Kustomize Overlays: Environment-specific customization using strategic merge patches and generators
Configuration Layering: Base configuration + environment overlays with clear precedence
Image References: ACR paths with semantic versioning and commit SHA for traceability
Resource Management: Per-environment resource requests/limits with cost optimization
Health Checks: Liveness, readiness, and startup probes for reliability
Pod Security Standards: Restricted profile enforcement for production workloads
Network Policies: Default deny with explicit service-to-service rules
Horizontal Pod Autoscaler: CPU/memory-based scaling with KEDA for custom metrics
Manifest Validation: kubeval (syntax), kube-score (best practices), Azure Policy (compliance)

Git Workflow & Environment Promotion¶

Purpose: Define the complete Git workflow, pull request process, environment promotion strategy, and operational procedures for managing changes through the GitOps pipeline from feature development to production deployment.

Feature Branch Development Workflow¶

ATP GitOps follows a GitOps-native workflow where all infrastructure and application changes flow through Git branches, pull requests, and automated validation before promotion to production.

Creating Feature Branches from Dev¶

Branch Creation Process:

# 1. Ensure you're on the latest dev branch
git checkout dev
git pull origin dev

# 2. Create feature branch from dev
git checkout -b feature/atp-ingestion-add-grpc-support

# 3. Verify branch creation
git branch
# Output:
# * feature/atp-ingestion-add-grpc-support
#   dev
#   main

# 4. Push feature branch to remote
git push -u origin feature/atp-ingestion-add-grpc-support

Branch Naming Conventions:

Branch Type	Prefix	Example	Purpose
Feature	`feature/`	`feature/atp-query-add-cache`	New functionality
Bugfix	`bugfix/`	`bugfix/atp-ingestion-memory-leak`	Bug fixes
Hotfix	`hotfix/`	`hotfix/atp-gateway-security-patch`	Critical production fixes
Documentation	`docs/`	`docs/gitops-troubleshooting`	Documentation updates
Infrastructure	`infra/`	`infra/add-monitoring-namespace`	Infrastructure changes
Chore	`chore/`	`chore/update-helm-charts`	Maintenance tasks

Local Development and Testing¶

Local GitOps Repository Structure:

# Clone GitOps repository
git clone ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops.git
cd atp-gitops

# Navigate to service manifests
cd apps/atp-ingestion/base/

# Edit deployment manifest
vim deployment.yaml

# Validate changes locally
kubectl apply --dry-run=client -f deployment.yaml

Local Validation Tools:

# Validate YAML syntax
yamllint apps/atp-ingestion/base/deployment.yaml

# Validate Kubernetes manifests
kubeval apps/atp-ingestion/base/deployment.yaml

# Score manifests (best practices)
kube-score score apps/atp-ingestion/base/deployment.yaml

# Preview Kustomize output
kustomize build apps/atp-ingestion/overlays/dev/

# Preview Helm template output
helm template atp-ingestion apps/atp-ingestion/helm/ \
  --values apps/atp-ingestion/helm/values-dev.yaml \
  --debug

Committing Manifest Changes¶

Commit Process:

# Stage changes
git add apps/atp-ingestion/base/deployment.yaml

# Commit with conventional commit format
git commit -m "feat(ingestion): add gRPC endpoint configuration

- Add gRPC port (50051) to container ports
- Configure gRPC health checks
- Update service manifest for gRPC traffic

Related to: ATP-1234"

# Sign commit (required for production)
git commit -S -m "feat(ingestion): add gRPC endpoint configuration"

# Verify commit signature
git log --show-signature -1

Commit Message Format (Conventional Commits):

<type>(<scope>): <subject>

<body>

<footer>

Examples:

# Feature addition
git commit -m "feat(ingestion): add Redis cache support"

# Bug fix
git commit -m "fix(query): resolve memory leak in query service"

# Configuration change
git commit -m "chore(infra): update resource limits for production"

# Breaking change
git commit -m "feat(gateway)!: remove legacy authentication

BREAKING CHANGE: Legacy API key authentication removed.
Migrate to OAuth 2.0 before deploying this change."

# Work item reference
git commit -m "feat(integrity): implement tamper detection

Implements ATP-5678

- Add cryptographic signatures to audit records
- Validate signatures on read operations
- Store signature metadata in database"

Syncing with Remote Repository¶

Sync Workflow:

# Fetch latest changes from remote
git fetch origin

# Check status
git status

# Rebase feature branch on latest dev (optional, for clean history)
git checkout feature/atp-ingestion-add-grpc-support
git rebase origin/dev

# Or merge latest dev into feature branch
git merge origin/dev

# Resolve conflicts if any
git status
# Edit conflicted files
vim apps/atp-ingestion/base/deployment.yaml
git add apps/atp-ingestion/base/deployment.yaml
git rebase --continue  # or git commit for merge

# Push changes
git push origin feature/atp-ingestion-add-grpc-support

# If rebased, force push (be careful!)
git push --force-with-lease origin feature/atp-ingestion-add-grpc-support

Pull Request Process¶

PR Creation in Azure Repos¶

Create Pull Request:

# Using Azure DevOps CLI
az repos pr create \
  --source-branch feature/atp-ingestion-add-grpc-support \
  --target-branch dev \
  --title "feat(ingestion): Add gRPC endpoint configuration" \
  --description "Adds gRPC support to ingestion service. See ATP-1234." \
  --work-items 1234 \
  --auto-complete false

# Or use Azure DevOps Portal:
# 1. Navigate to Repos > Pull Requests
# 2. Click "New Pull Request"
# 3. Select source branch (feature/atp-ingestion-add-grpc-support)
# 4. Select target branch (dev)
# 5. Fill in title and description
# 6. Link work items
# 7. Add reviewers
# 8. Create pull request

PR Template and Checklist¶

Pull Request Template (.azuredevops/pull_request_template.md):

## Description
<!-- Provide a clear description of the changes -->

## Type of Change
<!-- Mark applicable with [x] -->
- [ ] Feature (non-breaking change adding functionality)
- [ ] Bug fix (non-breaking change fixing an issue)
- [ ] Breaking change (fix or feature causing existing functionality to change)
- [ ] Documentation update
- [ ] Infrastructure change
- [ ] Configuration change

## Service(s) Affected
<!-- List affected services -->
- [ ] atp-ingestion
- [ ] atp-query
- [ ] atp-integrity
- [ ] atp-export
- [ ] atp-policy
- [ ] atp-search
- [ ] atp-gateway
- [ ] Infrastructure/Platform

## Environment(s) Affected
<!-- Mark applicable with [x] -->
- [ ] Dev
- [ ] Test
- [ ] Staging
- [ ] Production

## Pre-Merge Checklist
<!-- Mark applicable with [x] -->
- [ ] Code follows project style guidelines
- [ ] Self-review completed
- [ ] Comments added for complex logic
- [ ] Documentation updated
- [ ] No breaking changes (or documented)
- [ ] All CI/CD checks passing
- [ ] Manifest validation passing (kubeval, kube-score)
- [ ] Security scanning passing (OPA, Azure Policy)
- [ ] Preview environment tested (if applicable)
- [ ] Work items linked
- [ ] Signed commits (for production branches)

## Testing
<!-- Describe testing performed -->
- [ ] Local testing completed
- [ ] Preview environment tested
- [ ] Unit tests passing
- [ ] Integration tests passing
- [ ] Manual testing completed

## Rollback Plan
<!-- Describe rollback procedure if needed -->

## Related Work Items
<!-- Link related work items -->
- ATP-1234: Add gRPC endpoint to ingestion service

## Screenshots/Documentation
<!-- Add screenshots, diagrams, or documentation links -->

## Additional Notes
<!-- Any additional information reviewers should know -->

Code Review Guidelines¶

Review Checklist:

Manifest Validation:
✅ YAML syntax correct
✅ Kubernetes API version valid
✅ Resource names follow conventions
✅ Labels and annotations present
✅ Resource requests/limits set
Security:
✅ No secrets in plaintext
✅ Pod Security Standards compliant
✅ Network policies configured
✅ RBAC follows least privilege
Configuration:
✅ Environment-specific values correct
✅ Image tags immutable (not latest in prod)
✅ Health checks configured
✅ Resource limits appropriate
Best Practices:
✅ Follows GitOps principles
✅ Changes are declarative
✅ No hardcoded values
✅ Documentation updated

Review Comments:

❌ Security Issue: Secret in plaintext
✅ Approved: Looks good, minor suggestion
⚠️ Needs Work: Please add resource limits
📝 Question: Why is this change needed?

Approval Workflow¶

Approval Requirements Matrix:

Target Branch	Minimum Approvers	Required Roles	GPG Signing	Status Checks
dev	1	Developer or above	❌ Optional	✅ Required
test	1	Developer or above	❌ Optional	✅ Required
staging	2	Architect or SRE Lead	✅ Required	✅ Required
production	2	Architect or SRE Lead	✅ Required	✅ Required
hotfix/	2	Architect or SRE Lead	✅ Required	✅ Required

Azure DevOps Branch Policy Configuration:

# Branch policy for production branch
branchPolicy:
  branch: production
  minimumApproverCount: 2
  requiredApproverIds:
    - architect-team-group
    - sre-lead-group
  blockingPolicies:
    - buildValidation: true
    - mergeStrategy: squash
    - requireGpgSigning: true
    - requireWorkItemLinking: true
    - commentRequirements: true

Automated PR Validation¶

Manifest Linting (YAML Syntax, Helm Lint)¶

Azure Pipeline: PR Validation Stage:

# .azuredevops/pipelines/pr-validation.yml
trigger: none  # Only run on PR

pr:
  branches:
    include:
      - dev
      - test
      - staging
      - production
      - hotfix/*

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: Validate_Manifests
  displayName: 'Validate Kubernetes Manifests'
  jobs:
  - job: YAMLLint
    displayName: 'YAML Syntax Validation'
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: '3.9'

    - script: |
        pip install yamllint
        yamllint -c .yamllint.yml apps/
      displayName: 'Validate YAML syntax'

  - job: Kubeval
    displayName: 'Kubernetes Manifest Validation'
    steps:
    - script: |
        wget -q https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
        tar xf kubeval-linux-amd64.tar.gz
        sudo mv kubeval /usr/local/bin/
        kubeval --version
      displayName: 'Install kubeval'

    - script: |
        find apps/ -name "*.yaml" -path "*/base/*" -exec kubeval --strict {} \;
      displayName: 'Validate Kubernetes manifests'

  - job: HelmLint
    displayName: 'Helm Chart Linting'
    steps:
    - task: HelmInstaller@1
      inputs:
        helmVersionToInstall: 'latest'

    - script: |
        find apps/ -name "Chart.yaml" -path "*/helm/*" | while read chart; do
          chart_dir=$(dirname "$chart")
          helm lint "$chart_dir"
        done
      displayName: 'Lint Helm charts'

  - job: KubeScore
    displayName: 'Best Practices Check'
    steps:
    - script: |
        wget -q https://github.com/zegl/kube-score/releases/latest/download/kube-score_linux_amd64.tar.gz
        tar xf kube-score_linux_amd64.tar.gz
        sudo mv kube-score /usr/local/bin/
      displayName: 'Install kube-score'

    - script: |
        find apps/ -name "*.yaml" -path "*/base/*" -exec kube-score score {} \;
      displayName: 'Score manifests for best practices'

Security Scanning (OPA Policies, Azure Policy)¶

OPA Policy Validation:

# .azuredevops/pipelines/pr-validation.yml
  - job: OPAPolicy
    displayName: 'OPA Policy Validation'
    steps:
    - script: |
        wget -q https://github.com/open-policy-agent/conftest/releases/latest/download/conftest_linux_amd64.tar.gz
        tar xf conftest_linux_amd64.tar.gz
        sudo mv conftest /usr/local/bin/
      displayName: 'Install conftest'

    - script: |
        find apps/ -name "*.yaml" -path "*/base/*" | while read manifest; do
          conftest test "$manifest" -p policies/
        done
      displayName: 'Validate OPA policies'

OPA Policy Examples:

# policies/pod-security.rego
package podsecurity

deny[msg] {
    input.kind == "Deployment"
    container := input.spec.template.spec.containers[_]
    not container.securityContext.runAsNonRoot

    msg := "Container must run as non-root user"
}

deny[msg] {
    input.kind == "Deployment"
    container := input.spec.template.spec.containers[_]
    not container.securityContext.readOnlyRootFilesystem

    msg := "Container must have read-only root filesystem"
}

deny[msg] {
    input.kind == "Deployment"
    container := input.spec.template.spec.containers[_]
    not container.resources.limits.cpu

    msg := "Container must have CPU limit"
}

Dry-Run Validation (Kustomize Build, Helm Template)¶

Dry-Run Validation:

# .azuredevops/pipelines/pr-validation.yml
  - job: DryRun
    displayName: 'Dry-Run Validation'
    steps:
    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Production-EUS-RG'
        kubernetesCluster: 'atp-prod-eus-aks'
        namespace: 'atp-production'
        command: 'apply'
        arguments: '--dry-run=client -f apps/atp-ingestion/base/deployment.yaml'
      displayName: 'Kubectl dry-run'

    - script: |
        # Kustomize build validation
        kustomize build apps/atp-ingestion/overlays/production/ > /dev/null
        echo "✅ Kustomize build successful"
      displayName: 'Kustomize build validation'

    - script: |
        # Helm template validation
        helm template atp-ingestion apps/atp-ingestion/helm/ \
          --values apps/atp-ingestion/helm/values-production.yaml \
          --debug > /dev/null
        echo "✅ Helm template successful"
      displayName: 'Helm template validation'

Breaking Change Detection¶

Breaking Change Detection Script:

#!/bin/bash
# scripts/detect-breaking-changes.sh

set -euo pipefail

BASE_BRANCH="${1:-dev}"
FEATURE_BRANCH="${2:-HEAD}"

echo "🔍 Detecting breaking changes between $BASE_BRANCH and $FEATURE_BRANCH..."

# Check for removed resources
REMOVED_RESOURCES=$(git diff --name-only --diff-filter=D "$BASE_BRANCH" "$FEATURE_BRANCH" | grep -E '\.(yaml|yml)$' || true)

if [ -n "$REMOVED_RESOURCES" ]; then
  echo "⚠️  WARNING: Resources removed:"
  echo "$REMOVED_RESOURCES"
  echo "This may be a breaking change!"
fi

# Check for API version changes
API_VERSION_CHANGES=$(git diff "$BASE_BRANCH" "$FEATURE_BRANCH" | grep -E '^\+.*apiVersion:|^\-.*apiVersion:' || true)

if [ -n "$API_VERSION_CHANGES" ]; then
  echo "⚠️  WARNING: API version changes detected:"
  echo "$API_VERSION_CHANGES"
fi

# Check for breaking change markers
BREAKING_MARKERS=$(git log --oneline "$BASE_BRANCH..$FEATURE_BRANCH" | grep -i "BREAKING" || true)

if [ -n "$BREAKING_MARKERS" ]; then
  echo "🚨 BREAKING CHANGE detected in commit messages:"
  echo "$BREAKING_MARKERS"
  exit 1
fi

echo "✅ No breaking changes detected"

Test Environment Deployment Preview¶

Preview Environment Deployment:

# .azuredevops/pipelines/pr-validation.yml
  - job: PreviewDeploy
    displayName: 'Preview Environment Deployment'
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/feature/*'))
    steps:
    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Dev-EUS-RG'
        kubernetesCluster: 'atp-dev-eus-aks'
        namespace: 'preview-$(System.PullRequest.PullRequestId)'
        command: 'apply'
        arguments: '-f apps/atp-ingestion/base/'
      displayName: 'Deploy to preview namespace'

    - script: |
        # Wait for deployment to be ready
        kubectl wait --for=condition=available \
          --timeout=300s \
          deployment/atp-ingestion \
          -n preview-$(System.PullRequest.PullRequestId)
        echo "✅ Preview deployment successful"
      displayName: 'Wait for deployment'

    - script: |
        # Run smoke tests
        kubectl exec -n preview-$(System.PullRequest.PullRequestId) \
          deployment/atp-ingestion -- \
          curl -f http://localhost:8080/health/ready
        echo "✅ Smoke tests passed"
      displayName: 'Run smoke tests'

Merge Strategies¶

Squash Merge (Production, Staging)¶

Squash Merge Configuration:

# Azure DevOps branch policy
branchPolicy:
  branch: production
  mergeStrategy: squash
  squashMergeCommitMessage: firstLine  # Use first commit message line

Squash Merge Example:

# Before squash merge (3 commits in feature branch)
git log --oneline feature/atp-ingestion-add-grpc
# abc123 feat(ingestion): add gRPC port
# def456 feat(ingestion): add gRPC health check
# ghi789 feat(ingestion): update service manifest

# After squash merge to production (1 commit)
git log --oneline production
# jkl012 feat(ingestion): add gRPC port  # Single squashed commit

Benefits of Squash Merge: - ✅ Clean, linear history - ✅ Easier rollback (single commit) - ✅ Simpler to review changes

Merge Commit (Test)¶

Merge Commit Configuration:

# Azure DevOps branch policy
branchPolicy:
  branch: test
  mergeStrategy: noFastForward  # Creates merge commit

Merge Commit Example:

# Feature branch merged to test with merge commit
git log --oneline --graph test
# *   mno345 Merge pull request #123 from feature/atp-ingestion-add-grpc
# |\
# | * abc123 feat(ingestion): add gRPC port
# | * def456 feat(ingestion): add gRPC health check
# |/
# * pqr678 Previous commit

Benefits of Merge Commit: - ✅ Preserves branch history - ✅ Clear feature boundaries - ✅ Useful for tracking feature development

Rebase (Dev, Optional)¶

Rebase Workflow:

# Rebase feature branch on latest dev
git checkout feature/atp-ingestion-add-grpc
git fetch origin
git rebase origin/dev

# Resolve conflicts if any
git status
# Edit conflicted files
vim apps/atp-ingestion/base/deployment.yaml
git add apps/atp-ingestion/base/deployment.yaml
git rebase --continue

# Force push (be careful!)
git push --force-with-lease origin feature/atp-ingestion-add-grpc

Benefits of Rebase: - ✅ Linear history - ✅ Clean, sequential commits - ⚠️ Requires force push (dangerous)

Strategy Selection Rationale¶

Merge Strategy Matrix:

Branch	Strategy	Rationale
dev	Merge commit or rebase	Preserve feature history, flexibility
test	Merge commit	Track feature development clearly
staging	Squash merge	Clean history, easier rollback
production	Squash merge	Clean, linear history essential for compliance

Environment Promotion Flow¶

Promotion Flow Diagram:

graph LR
    A[Feature Branch] -->|PR + Merge| B[Dev Environment]
    B -->|Automated<br/>Schedule/Tag| C[Test Environment]
    C -->|Manual Approval<br/>Regression Tests| D[Staging Environment]
    D -->|CAB Approval<br/>Change Window| E[Production Environment]

    F[Hotfix Branch] -.->|Expedited| E

    style A fill:#90EE90
    style B fill:#90EE90
    style C fill:#FFE5B4
    style D fill:#FFE5B4
    style E fill:#ffcccc
    style F fill:#ff9999

Hold "Alt" / "Option" to enable pan & zoom

Promotion Flow Details:

From	To	Method	Trigger	Approval Required	Automated Testing
Feature	Dev	Automatic	PR merge	❌ No (PR approval only)	✅ PR validation
Dev	Test	Automated	Schedule/Tag	❌ No	✅ Smoke tests
Test	Staging	Manual	On-demand	✅ 2 approvers	✅ Regression tests
Staging	Production	Manual	Change window	✅ CAB (2 approvers)	✅ Full test suite
Hotfix	Production	Expedited	Critical issue	✅ 2 approvers	✅ Hotfix tests

Feature → Dev (Automatic After PR Merge)¶

Automatic Promotion Process:

# Azure Pipeline: Auto-promote to Dev
trigger:
  branches:
    include:
      - dev

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: PromoteToDev
  displayName: 'Promote to Dev Environment'
  jobs:
  - job: DeployToDev
    steps:
    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Dev-EUS-RG'
        kubernetesCluster: 'atp-dev-eus-aks'
        namespace: 'atp-dev'
        command: 'apply'
        arguments: '-f apps/atp-ingestion/overlays/dev/'
      displayName: 'Apply manifests to Dev cluster'

    - script: |
        # Verify deployment
        kubectl rollout status deployment/atp-ingestion -n atp-dev --timeout=300s
        echo "✅ Deployment to Dev successful"
      displayName: 'Verify deployment'

Dev → Test (Automatic, Triggered by Schedule or Tag)¶

Automated Promotion to Test:

# Azure Pipeline: Auto-promote to Test
schedules:
- cron: "0 2 * * *"  # Daily at 2 AM UTC
  branches:
    include:
      - dev
  displayName: 'Daily promotion to Test'

trigger:
  branches:
    include:
      - dev
  tags:
    include:
      - promote-to-test/*

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: PromoteToTest
  displayName: 'Promote to Test Environment'
  jobs:
  - job: DeployToTest
    steps:
    - script: |
        # Tag current dev commit
        git tag -a "test-$(date +%Y%m%d-%H%M%S)" -m "Promote to Test: $(Build.SourceVersion)"
        git push origin --tags
      displayName: 'Tag promotion'

    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Test-EUS-RG'
        kubernetesCluster: 'atp-test-eus-aks'
        namespace: 'atp-test'
        command: 'apply'
        arguments: '-f apps/atp-ingestion/overlays/test/'
      displayName: 'Apply manifests to Test cluster'

    - script: |
        # Run smoke tests
        ./scripts/run-smoke-tests.sh --environment test
      displayName: 'Run smoke tests'

Manual Trigger for Test Promotion:

# Tag dev branch to trigger promotion to test
git checkout dev
git pull origin dev
git tag -a "promote-to-test/v1.2.3" -m "Promote version 1.2.3 to Test"
git push origin --tags

Test → Staging (Manual Approval, Regression Tests)¶

Manual Promotion to Staging:

# Azure Pipeline: Manual promotion to Staging
trigger: none  # Manual trigger only

parameters:
- name: promoteVersion
  displayName: 'Version to Promote'
  type: string
  default: 'latest'

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: PromoteToStaging
  displayName: 'Promote to Staging Environment'
  jobs:
  - job: DeployToStaging
    steps:
    - script: |
        # Checkout test branch at specified version
        git checkout test
        git pull origin test
        git checkout "${{ parameters.promoteVersion }}"
      displayName: 'Checkout version'

    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Staging-EUS-RG'
        kubernetesCluster: 'atp-staging-eus-aks'
        namespace: 'atp-staging'
        command: 'apply'
        arguments: '-f apps/atp-ingestion/overlays/staging/'
      displayName: 'Apply manifests to Staging cluster'

    - script: |
        # Run full regression test suite
        ./scripts/run-regression-tests.sh --environment staging
      displayName: 'Run regression tests'

Pre-Promotion Checklist:

✅ All test environment tests passing
✅ Regression test suite passing
✅ Performance benchmarks met
✅ Security scans passing
✅ Documentation updated
✅ Rollback plan documented
✅ 2 approvers approved

Staging → Production (CAB Approval, Change Window)¶

Production Promotion Process:

# Azure Pipeline: Production promotion (requires manual approval)
trigger: none  # Manual trigger only

parameters:
- name: promoteVersion
  displayName: 'Version to Promote to Production'
  type: string
  default: 'latest'
- name: changeWindow
  displayName: 'Change Window'
  type: string
  default: '2024-01-15 02:00 UTC'

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: ApprovalGate
  displayName: 'Change Advisory Board Approval'
  jobs:
  - job: WaitForApproval
    displayName: 'Wait for CAB Approval'
    pool: server
    steps:
    - task: ManualValidation@0
      timeoutInMinutes: 1440  # 24 hours
      inputs:
        notifyUsers: 'architect-team@connectsoft.example;sre-lead@connectsoft.example'
        instructions: 'Review and approve production promotion'

- stage: PromoteToProduction
  displayName: 'Promote to Production Environment'
  dependsOn: ApprovalGate
  condition: succeeded()
  jobs:
  - job: DeployToProduction
    steps:
    - script: |
        # Verify change window
        CURRENT_TIME=$(date -u +%s)
        WINDOW_START=$(date -u -d "${{ parameters.changeWindow }}" +%s)
        if [ $CURRENT_TIME -lt $WINDOW_START ]; then
          echo "⏳ Waiting for change window..."
          sleep $((WINDOW_START - CURRENT_TIME))
        fi
      displayName: 'Wait for change window'

    - script: |
        git checkout staging
        git pull origin staging
        git checkout "${{ parameters.promoteVersion }}"
      displayName: 'Checkout version'

    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Production-EUS-RG'
        kubernetesCluster: 'atp-prod-eus-aks'
        namespace: 'atp-production'
        command: 'apply'
        arguments: '-f apps/atp-ingestion/overlays/production/'
      displayName: 'Apply manifests to Production cluster'

    - script: |
        # Verify deployment
        kubectl rollout status deployment/atp-ingestion -n atp-production --timeout=600s
        echo "✅ Production deployment successful"
      displayName: 'Verify deployment'

    - script: |
        # Run production smoke tests
        ./scripts/run-production-smoke-tests.sh
      displayName: 'Run production smoke tests'

Change Window Schedule:

Day	Window	Rationale
Monday - Thursday	02:00 - 04:00 UTC	Low traffic period
Friday	No deployments	Weekend preparation
Saturday - Sunday	Emergency only	Minimal staffing

Automated Promotion (Dev and Test)¶

Trigger Mechanisms (Schedule, Tags, Webhooks)¶

Schedule-Based Promotion:

# Azure Pipeline: Scheduled promotion
schedules:
- cron: "0 2 * * *"  # Daily at 2 AM UTC
  branches:
    include:
      - dev
  displayName: 'Daily Dev → Test Promotion'
  always: false  # Only if changes detected

Tag-Based Promotion:

# Create promotion tag
git checkout dev
git pull origin dev
git tag -a "promote-to-test/v1.2.3" -m "Promote version 1.2.3 to Test environment"
git push origin --tags

# Pipeline triggered automatically

Webhook Trigger:

# Azure Pipeline: Webhook trigger
resources:
  webhooks:
  - webhook: promotion-webhook
    connection: GitHubWebhook
    filters:
    - path: body.ref
      value: refs/heads/dev

Automated Testing Gates¶

Testing Gates Configuration:

# .azuredevops/pipelines/promotion-test-gates.yml
stages:
- stage: TestingGates
  displayName: 'Automated Testing Gates'
  jobs:
  - job: SmokeTests
    displayName: 'Smoke Tests'
    steps:
    - script: |
        ./scripts/run-smoke-tests.sh --environment test
      displayName: 'Run smoke tests'

  - job: IntegrationTests
    displayName: 'Integration Tests'
    steps:
    - script: |
        ./scripts/run-integration-tests.sh --environment test
      displayName: 'Run integration tests'

  - job: PerformanceTests
    displayName: 'Performance Benchmarks'
    steps:
    - script: |
        ./scripts/run-performance-tests.sh --environment test
      displayName: 'Run performance tests'

    - script: |
        # Validate performance metrics
        METRICS=$(cat performance-results.json)
        P95_LATENCY=$(echo "$METRICS" | jq '.p95_latency')
        if (( $(echo "$P95_LATENCY > 500" | bc -l) )); then
          echo "❌ Performance regression: P95 latency $P95_LATENCY > 500ms"
          exit 1
        fi
        echo "✅ Performance tests passed"
      displayName: 'Validate performance'

Rollback on Failure¶

Automatic Rollback Script:

#!/bin/bash
# scripts/auto-rollback.sh

set -euo pipefail

ENVIRONMENT="${1:-test}"
DEPLOYMENT_NAME="${2:-atp-ingestion}"
NAMESPACE="atp-${ENVIRONMENT}"

echo "🔄 Rolling back deployment $DEPLOYMENT_NAME in $NAMESPACE..."

# Get previous revision
PREVIOUS_REVISION=$(kubectl rollout history deployment/$DEPLOYMENT_NAME -n $NAMESPACE | tail -n 2 | head -n 1 | awk '{print $1}')

if [ -z "$PREVIOUS_REVISION" ]; then
  echo "❌ No previous revision found"
  exit 1
fi

# Rollback
kubectl rollout undo deployment/$DEPLOYMENT_NAME -n $NAMESPACE

# Wait for rollback
kubectl rollout status deployment/$DEPLOYMENT_NAME -n $NAMESPACE --timeout=300s

echo "✅ Rollback successful to revision $PREVIOUS_REVISION"

Notification and Alerting¶

Promotion Notification:

# Azure Pipeline: Notification stage
- stage: Notify
  displayName: 'Send Notifications'
  condition: always()  # Always run, even on failure
  jobs:
  - job: NotifyTeam
    steps:
    - task: Slack@1
      inputs:
        endpoint: 'ATP-Slack-Connection'
        channel: '#atp-deployments'
        message: |
          🚀 Promotion to ${{ parameters.environment }} Environment

          *Version*: ${{ parameters.promoteVersion }}
          *Status*: $(Agent.JobStatus)
          *Pipeline*: $(Build.BuildNumber)
          *Author*: $(Build.RequestedFor)

          *Changes*:
          $(git log --oneline -10)
      displayName: 'Send Slack notification'

    - task: SendEmail@1
      condition: eq(variables['Agent.JobStatus'], 'Failed')
      inputs:
        to: 'sre-oncall@connectsoft.example'
        subject: '❌ Production Promotion Failed'
        body: 'Production promotion failed. Check pipeline: $(Build.BuildUri)'
      displayName: 'Send alert email'

Manual Promotion (Staging and Production)¶

Approval Gates Configuration¶

Azure DevOps Approval Gates:

# Azure DevOps environment: Production
environments:
- name: Production
  approvals:
  - approvers:
    - architect-team@connectsoft.example
    - sre-lead@connectsoft.example
    count: 2  # Require 2 approvals
    timeoutInMinutes: 1440  # 24 hours
  checks:
  - type: AzureMonitor
    properties:
      actionGroupName: production-promotion-alerts
  - type: InvokeRESTAPI
    properties:
      url: 'https://api.connectsoft.example/change-window/validate'
      method: 'POST'

Change Advisory Board (CAB) Process¶

CAB Approval Checklist:

Change Request Review:
✅ Change description clear
✅ Risk assessment completed
✅ Rollback plan documented
✅ Testing evidence provided
✅ Impact analysis completed
Technical Review:
✅ Architecture review approved
✅ Security review approved
✅ Performance impact assessed
✅ Dependency analysis completed
Operational Review:
✅ Runbook updated
✅ Monitoring alerts configured
✅ On-call engineer notified
✅ Change window scheduled

CAB Meeting Agenda:

Review pending change requests
Assess risk and impact
Approve/reject change requests
Schedule change windows
Document decisions

Change Window Scheduling¶

Change Window Policy:

Environment	Days	Time Window	Restrictions
Dev	Any day	24/7	None
Test	Mon-Fri	24/7	None
Staging	Mon-Thu	02:00-04:00 UTC	No Friday deployments
Production	Mon-Thu	02:00-04:00 UTC	No Friday/weekend deployments

Schedule Change Window:

# Azure DevOps CLI: Schedule change
az pipelines variable-group create \
  --name "Change-Window-2024-01-15" \
  --variables \
    changeWindowStart="2024-01-15T02:00:00Z" \
    changeWindowEnd="2024-01-15T04:00:00Z" \
    changeOwner="john.doe@connectsoft.example"

Pre-Deployment Checklists¶

Pre-Deployment Checklist:

## Pre-Deployment Checklist

### Technical Readiness
- [ ] All tests passing (unit, integration, E2E)
- [ ] Performance benchmarks met
- [ ] Security scans passing (SAST, DAST, dependency scan)
- [ ] Manifest validation passing
- [ ] Helm charts validated
- [ ] Kustomize builds successful

### Documentation
- [ ] Release notes updated
- [ ] API documentation updated
- [ ] Runbook updated
- [ ] Architecture diagrams updated

### Operations
- [ ] Monitoring dashboards configured
- [ ] Alerts configured
- [ ] Logging configured
- [ ] Backup strategy verified
- [ ] Rollback procedure tested

### Communication
- [ ] Stakeholders notified
- [ ] On-call engineer notified
- [ ] Support team briefed
- [ ] Customer communication prepared (if needed)

### Change Management
- [ ] Change request created
- [ ] CAB approval obtained
- [ ] Change window scheduled
- [ ] Risk assessment completed

Post-Deployment Validation¶

Post-Deployment Validation Script:

#!/bin/bash
# scripts/post-deployment-validation.sh

set -euo pipefail

ENVIRONMENT="${1:-production}"
NAMESPACE="atp-${ENVIRONMENT}"

echo "✅ Post-Deployment Validation for $ENVIRONMENT"

# Health checks
echo "1. Health Checks"
kubectl get pods -n $NAMESPACE -l app=atp-ingestion
kubectl wait --for=condition=ready pod -l app=atp-ingestion -n $NAMESPACE --timeout=300s

# Service endpoints
echo "2. Service Endpoints"
kubectl get endpoints atp-ingestion -n $NAMESPACE

# Metrics
echo "3. Metrics"
kubectl exec -n $NAMESPACE deployment/atp-ingestion -- \
  curl -s http://localhost:9090/metrics | grep -q "http_requests_total" && \
  echo "✅ Metrics endpoint responding"

# Smoke tests
echo "4. Smoke Tests"
./scripts/run-smoke-tests.sh --environment $ENVIRONMENT

echo "✅ Post-deployment validation complete"

Hotfix Workflow¶

Hotfix Workflow Diagram:

graph TD
    A[Production Issue Detected] --> B[Create Hotfix Branch<br/>from production]
    B --> C[Implement Fix<br/>in Hotfix Branch]
    C --> D[PR to Production<br/>Expedited Review]
    D --> E[2 Approvers<br/>Required]
    E --> F[Merge to Production]
    F --> G[Deploy to Production]
    G --> H[Verify Fix]
    H --> I[Back-merge to<br/>dev/test/staging]

    style A fill:#ffcccc
    style B fill:#ff9999
    style F fill:#ffcccc
    style I fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Creating Hotfix Branch from Production¶

Hotfix Branch Creation:

# 1. Checkout production branch
git checkout production
git pull origin production

# 2. Create hotfix branch
git checkout -b hotfix/atp-gateway-security-patch-CVE-2024-1234

# 3. Push hotfix branch
git push -u origin hotfix/atp-gateway-security-patch-CVE-2024-1234

# 4. Apply fix
vim apps/atp-gateway/base/deployment.yaml
git add apps/atp-gateway/base/deployment.yaml
git commit -S -m "fix(gateway): patch security vulnerability CVE-2024-1234

URGENT: Security patch for authentication bypass vulnerability.

Related to: ATP-9999 (Critical Security Issue)"

Expedited Approval Process¶

Hotfix PR Creation:

# Create hotfix PR with expedited flag
az repos pr create \
  --source-branch hotfix/atp-gateway-security-patch-CVE-2024-1234 \
  --target-branch production \
  --title "🚨 HOTFIX: Security patch CVE-2024-1234" \
  --description "Urgent security fix. Requires expedited review." \
  --work-items 9999 \
  --reviewers "architect-team@connectsoft.example;sre-lead@connectsoft.example" \
  --auto-complete false \
  --bypass-policy false  # Still requires 2 approvers

Expedited Review Checklist:

✅ Security issue verified (CVE, vulnerability scan)
✅ Fix validated (local testing, security review)
✅ Impact assessment completed
✅ Rollback plan documented
✅ 2 approvers from architecture/SRE teams

Testing in Hotfix Environment¶

Hotfix Testing:

# Deploy to hotfix test environment
kubectl apply -f apps/atp-gateway/overlays/hotfix-test/ \
  --namespace atp-hotfix-test

# Run critical path tests
./scripts/run-critical-path-tests.sh --environment hotfix-test

# Security validation
./scripts/run-security-tests.sh --environment hotfix-test \
  --focus CVE-2024-1234

Fast-Track Merge to Production¶

Hotfix Merge Process:

# After approval, merge hotfix
az repos pr update \
  --id <PR_ID> \
  --status completed \
  --merge-strategy squash

# Verify merge
git checkout production
git pull origin production
git log --oneline -5

# Tag hotfix release
git tag -a "hotfix/v1.2.4-CVE-2024-1234" \
  -m "Hotfix: Security patch CVE-2024-1234"
git push origin --tags

Back-Merge to Dev/Test/Staging¶

Back-Merge Process:

# Back-merge to staging
git checkout staging
git pull origin staging
git merge production --no-ff -m "Merge hotfix from production: CVE-2024-1234"
git push origin staging

# Back-merge to test
git checkout test
git pull origin test
git merge production --no-ff -m "Merge hotfix from production: CVE-2024-1234"
git push origin test

# Back-merge to dev
git checkout dev
git pull origin dev
git merge production --no-ff -m "Merge hotfix from production: CVE-2024-1234"
git push origin dev

Preview Environments¶

Ephemeral Namespace per PR¶

Preview Environment Configuration:

# .azuredevops/pipelines/preview-environment.yml
trigger: none

pr:
  branches:
    include:
      - feature/*
      - bugfix/*

pool:
  vmImage: 'ubuntu-latest'

variables:
  previewNamespace: 'preview-pr-$(System.PullRequest.PullRequestId)'

stages:
- stage: CreatePreview
  displayName: 'Create Preview Environment'
  jobs:
  - job: ProvisionPreview
    steps:
    - script: |
        # Create preview namespace
        kubectl create namespace $(previewNamespace) \
          --dry-run=client -o yaml | kubectl apply -f -

        # Label namespace
        kubectl label namespace $(previewNamespace) \
          environment=preview \
          pr-id=$(System.PullRequest.PullRequestId) \
          managed-by=fluxcd

        echo "✅ Preview namespace created: $(previewNamespace)"
      displayName: 'Create preview namespace'

    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Dev-EUS-RG'
        kubernetesCluster: 'atp-dev-eus-aks'
        namespace: '$(previewNamespace)'
        command: 'apply'
        arguments: |
          -f apps/atp-ingestion/base/ \
          --namespace $(previewNamespace)
      displayName: 'Deploy to preview namespace'

    - script: |
        # Wait for deployment
        kubectl wait --for=condition=available \
          deployment/atp-ingestion \
          -n $(previewNamespace) \
          --timeout=300s

        # Get preview URL
        PREVIEW_URL=$(kubectl get ingress atp-ingestion \
          -n $(previewNamespace) \
          -o jsonpath='{.spec.rules[0].host}')

        echo "##vso[task.setvariable variable=PreviewUrl]$PREVIEW_URL"
        echo "✅ Preview environment ready: https://$PREVIEW_URL"
      displayName: 'Wait for deployment'

    - script: |
        # Add preview URL to PR comment
        az repos pr set-vote \
          --id $(System.PullRequest.PullRequestId) \
          --vote approved \
          --comment "Preview environment: https://$(PreviewUrl)"
      displayName: 'Add preview URL to PR'

- stage: CleanupPreview
  displayName: 'Cleanup Preview Environment'
  condition: always()
  jobs:
  - job: DeletePreview
    steps:
    - script: |
        # Delete preview namespace
        kubectl delete namespace $(previewNamespace) --ignore-not-found=true
        echo "✅ Preview namespace deleted: $(previewNamespace)"
      displayName: 'Delete preview namespace'

Automatic Provisioning on PR Creation¶

PR Webhook Trigger:

# Azure Pipeline: Trigger on PR creation
resources:
  webhooks:
  - webhook: pr-webhook
    connection: AzureReposWebhook
    filters:
    - path: eventType
      value: git.pullrequest.created

Testing Isolated Changes¶

Preview Environment Testing:

# Test preview environment
PREVIEW_URL="https://preview-pr-123.ingestion.atp.connectsoft.example"

# Health check
curl "$PREVIEW_URL/health/ready"

# Smoke tests
curl "$PREVIEW_URL/api/v1/audit/records" \
  -H "Authorization: Bearer $PREVIEW_API_KEY"

# Integration tests
./scripts/run-integration-tests.sh \
  --base-url "$PREVIEW_URL" \
  --environment preview

Auto-Deletion After PR Close¶

Cleanup on PR Close:

# Azure Pipeline: Cleanup on PR close
resources:
  webhooks:
  - webhook: pr-webhook
    connection: AzureReposWebhook
    filters:
    - path: eventType
      value: git.pullrequest.closed

stages:
- stage: Cleanup
  displayName: 'Cleanup Preview Environment'
  jobs:
  - job: DeletePreview
    steps:
    - script: |
        PR_ID=$(echo "$(Build.SourceBranch)" | sed 's/.*\///')
        PREVIEW_NAMESPACE="preview-pr-${PR_ID}"

        kubectl delete namespace $PREVIEW_NAMESPACE --ignore-not-found=true
        echo "✅ Preview namespace $PREVIEW_NAMESPACE deleted"
      displayName: 'Delete preview namespace'

Git Commit Message Conventions¶

Conventional Commits Format¶

Conventional Commits Specification:

<type>(<scope>): <subject>

<body>

<footer>

Commit Types:

Type	Description	Example
feat	New feature	`feat(ingestion): add gRPC endpoint`
fix	Bug fix	`fix(query): resolve memory leak`
docs	Documentation	`docs(gitops): update deployment guide`
style	Code style (formatting)	`style(*): format YAML files`
refactor	Code refactoring	`refactor(gateway): simplify auth logic`
test	Tests	`test(integrity): add unit tests`
chore	Maintenance	`chore(infra): update Helm charts`
perf	Performance	`perf(query): optimize database queries`
ci	CI/CD	`ci(pipelines): add validation stage`

Linking to Work Items¶

Work Item References:

# Link to Azure DevOps work item
git commit -m "feat(ingestion): add gRPC endpoint

Related to: ATP-1234"

# Link multiple work items
git commit -m "fix(query): resolve multiple issues

Fixes: ATP-1234, ATP-5678
Closes: ATP-9999"

# Reference work item in footer
git commit -m "feat(gateway): implement OAuth 2.0

Implements ATP-2345
See also: ATP-2346"

Semantic Prefix Examples¶

Complete Commit Message Examples:

# Feature with scope
git commit -m "feat(ingestion): add Redis cache support

- Add Redis connection configuration
- Implement cache layer for audit records
- Add cache health checks

Related to: ATP-1234"

# Breaking change
git commit -m "feat(gateway)!: remove legacy authentication

BREAKING CHANGE: Legacy API key authentication removed.
Migrate to OAuth 2.0 before deploying this change.

Migration guide: https://docs.connectsoft.example/migration/oauth

Closes: ATP-5678"

# Hotfix
git commit -m "fix(gateway): patch security vulnerability CVE-2024-1234

URGENT: Security patch for authentication bypass vulnerability.

Related to: ATP-9999 (Critical Security Issue)"

# Configuration change
git commit -m "chore(infra): update resource limits for production

- Increase CPU limit to 2000m
- Increase memory limit to 2Gi
- Update HPA min replicas to 5

Related to: ATP-3456"

Commit Message Templates¶

Git Commit Template (.gitmessage):

# <type>(<scope>): <subject>
# 
# <body>
# 
# <footer>
#
# Type: feat, fix, docs, style, refactor, test, chore, perf, ci
# Scope: ingestion, query, gateway, integrity, export, policy, search, infra
# 
# Examples:
# feat(ingestion): add gRPC endpoint
# fix(query): resolve memory leak
# docs(gitops): update deployment guide
#
# Related work items: ATP-1234

Configure Git to Use Template:

# Set commit template
git config --global commit.template .gitmessage

# Or per repository
git config commit.template .gitmessage

Release Tagging Strategy¶

Tagging Releases for Production¶

Production Release Tagging:

# Tag production release
git checkout production
git pull origin production

# Create annotated tag
git tag -a "v1.2.3" \
  -m "Release v1.2.3

Features:
- Add gRPC endpoint to ingestion service
- Implement Redis cache for query service
- Security enhancements

Breaking Changes:
- None

Related work items: ATP-1234, ATP-5678"

# Push tags
git push origin --tags

Tag Verification:

# List tags
git tag -l "v*"

# Show tag details
git show v1.2.3

# Verify tag signature (if GPG signed)
git tag -v v1.2.3

Service-Specific vs Environment-Wide Tags¶

Service-Specific Tags:

# Tag specific service version
git tag -a "atp-ingestion/v1.2.3" \
  -m "ATP Ingestion Service v1.2.3"

git tag -a "atp-query/v1.3.0" \
  -m "ATP Query Service v1.3.0"

Environment-Wide Tags:

# Tag environment release
git tag -a "production/2024-01-15" \
  -m "Production Release 2024-01-15

Services:
- atp-ingestion: v1.2.3
- atp-query: v1.3.0
- atp-gateway: v1.1.5"

git tag -a "staging/2024-01-10" \
  -m "Staging Release 2024-01-10"

Tag Naming Conventions¶

Tag Naming Standards:

Tag Type	Format	Example	Use Case
Release	`v{MAJOR}.{MINOR}.{PATCH}`	`v1.2.3`	Production releases
Pre-release	`v{MAJOR}.{MINOR}.{PATCH}-{PRERELEASE}`	`v1.2.3-rc1`	Release candidates
Service	`{service}/v{VERSION}`	`atp-ingestion/v1.2.3`	Service-specific
Environment	`{env}/{DATE}`	`production/2024-01-15`	Environment snapshots
Hotfix	`hotfix/v{VERSION}-{ISSUE}`	`hotfix/v1.2.4-CVE-2024-1234`	Hotfixes
Promotion	`promote-to-{env}/{VERSION}`	`promote-to-test/v1.2.3`	Promotion triggers

Automated Tag Creation¶

Automated Tagging in CI/CD:

# Azure Pipeline: Auto-tag on production merge
trigger:
  branches:
    include:
      - production

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: TagRelease
  displayName: 'Tag Production Release'
  jobs:
  - job: CreateTag
    steps:
    - script: |
        # Extract version from commit message or manifest
        VERSION=$(grep -E '^version:' apps/atp-ingestion/base/deployment.yaml | awk '{print $2}' | tr -d '"')

        # Create tag
        git config user.name "Azure DevOps"
        git config user.email "devops@connectsoft.example"

        git tag -a "v$VERSION" \
          -m "Release v$VERSION

        Automated release from commit $(Build.SourceVersion)
        Pipeline: $(Build.BuildNumber)"

        git push origin "v$VERSION"

        echo "✅ Tagged release: v$VERSION"
      displayName: 'Create and push tag'

Rollback via Git Revert¶

Simple Rollback (Single Service)¶

Single Service Rollback:

# 1. Identify commit to revert
git log --oneline production | grep "atp-ingestion"
# abc123 feat(ingestion): add gRPC endpoint

# 2. Revert commit
git checkout production
git pull origin production
git revert abc123 --no-edit

# 3. Push revert commit
git push origin production

# 4. Verify rollback
git log --oneline -5 production
# def456 Revert "feat(ingestion): add gRPC endpoint"
# abc123 feat(ingestion): add gRPC endpoint

Complex Rollback (Multiple Services)¶

Multiple Service Rollback:

# 1. Identify commits to revert
git log --oneline production | grep -E "(ingestion|query|gateway)" | head -5
# abc123 feat(ingestion): add gRPC endpoint
# def456 feat(query): add Redis cache
# ghi789 feat(gateway): update authentication

# 2. Create rollback branch
git checkout production
git pull origin production
git checkout -b rollback/2024-01-15-multiple-services

# 3. Revert commits (newest first)
git revert ghi789 --no-edit  # Gateway
git revert def456 --no-edit  # Query
git revert abc123 --no-edit  # Ingestion

# 4. Test rollback
kubectl apply --dry-run=client -f apps/ --recursive

# 5. Create PR for rollback
az repos pr create \
  --source-branch rollback/2024-01-15-multiple-services \
  --target-branch production \
  --title "ROLLBACK: Multiple services 2024-01-15" \
  --description "Revert changes to ingestion, query, and gateway services"

# 6. After approval, merge
az repos pr update --id <PR_ID> --status completed

Git Revert vs Reset¶

Git Revert vs Reset Comparison:

Method	Command	History	Safety	Use Case
Revert	`git revert`	Preserves (creates new commit)	✅ Safe (non-destructive)	Production rollback
Reset	`git reset --hard`	Rewrites (destroys commits)	❌ Dangerous (destructive)	Development only

Git Reset (Development Only):

# ⚠️ WARNING: Only use in development branches!
git checkout dev
git reset --hard HEAD~3  # Remove last 3 commits
git push --force origin dev  # Force push (destructive!)

Git Revert (Production):

# ✅ Safe for production
git checkout production
git revert abc123  # Creates new commit undoing abc123
git push origin production  # Safe push

Rollback Validation¶

Rollback Validation Script:

#!/bin/bash
# scripts/validate-rollback.sh

set -euo pipefail

COMMIT_TO_REVERT="${1:-HEAD}"
NAMESPACE="${2:-atp-production}"

echo "🔄 Validating rollback for commit $COMMIT_TO_REVERT..."

# 1. Preview rollback changes
git revert --no-commit $COMMIT_TO_REVERT
git diff --stat

# 2. Validate manifests
find apps/ -name "*.yaml" -path "*/base/*" | while read manifest; do
  kubeval "$manifest" || exit 1
done

# 3. Dry-run apply
kubectl apply --dry-run=client -f apps/ --recursive

# 4. Check for breaking changes
git log $COMMIT_TO_REVERT -1 --pretty=format:"%B" | grep -i "BREAKING" && \
  echo "⚠️  WARNING: Reverting a breaking change!" || \
  echo "✅ No breaking changes detected"

# 5. Restore state
git reset --hard HEAD

echo "✅ Rollback validation complete"

Execute Rollback:

# Validate rollback
./scripts/validate-rollback.sh abc123 atp-production

# Execute rollback
git checkout production
git pull origin production
git revert abc123 --no-edit

# Apply to cluster
kubectl apply -f apps/atp-ingestion/overlays/production/

# Verify rollback
kubectl rollout status deployment/atp-ingestion -n atp-production
kubectl get pods -n atp-production -l app=atp-ingestion

# Run smoke tests
./scripts/run-smoke-tests.sh --environment production

echo "✅ Rollback completed and verified"

Summary: Git Workflow & Environment Promotion¶

Feature Branch Workflow: Git-centric development with conventional commits and GPG signing
Pull Request Process: Comprehensive PR templates, automated validation, and approval workflows
Automated PR Validation: YAML linting, security scanning, dry-run validation, breaking change detection
Merge Strategies: Squash merge (production), merge commit (test), rebase (dev)
Environment Promotion: Automated (dev→test), manual (test→staging→production) with CAB approval
Hotfix Workflow: Expedited process with back-merge to all environments
Preview Environments: Ephemeral namespaces per PR for isolated testing
Commit Conventions: Conventional commits with work item linking
Release Tagging: Semantic versioning with service-specific and environment-wide tags
Rollback Procedures: Git revert for safe production rollbacks with validation

Azure Pipelines to GitOps Handoff¶

Purpose: Define how Azure Pipelines (CI) hand off to the GitOps workflow by automating artifact publishing, manifest updates, and Git commits, ensuring a clear separation of concerns between build/test (CI) and deployment/reconciliation (GitOps).

Separation of Concerns¶

CI Pipeline Responsibilities (Build, Test, Publish)¶

CI Pipeline Scope (Azure Pipelines):

Responsibility	Description	Examples
Source Code Build	Compile, package applications	`dotnet build`, `npm build`
Unit Testing	Run unit tests, code coverage	`dotnet test`, `jest`
Integration Testing	Test service interactions	Test containers, API tests
Security Scanning	SAST, DAST, dependency scanning	Snyk, Trivy, OWASP ZAP
Artifact Creation	Build Docker images, NuGet packages	`docker build`, `dotnet pack`
Artifact Publishing	Push to registry (ACR, NuGet feed)	`docker push`, `helm push`
SBOM Generation	Software Bill of Materials	Syft, SPDX format
Metadata Recording	Build provenance, vulnerability reports	In-toto attestations
GitOps Manifest Update	Commit image tag updates to GitOps repo	Automated Git commits

CI Pipeline Boundaries:

# CI Pipeline Responsibilities
✅ Build application code
✅ Run tests (unit, integration, security)
✅ Build and push container images to ACR
✅ Generate and publish SBOM
✅ Update GitOps repository with new image tags
✅ Trigger FluxCD sync (via webhook or polling)

❌ NOT: Deploy directly to Kubernetes
❌ NOT: Manage Kubernetes cluster state
❌ NOT: Handle reconciliation loops
❌ NOT: Monitor cluster health

GitOps Responsibilities (Deploy, Reconcile, Monitor)¶

GitOps Scope (FluxCD on AKS):

Responsibility	Description	Examples
Git Repository Watch	Poll Git repository for changes	FluxCD Source Controller
Manifest Rendering	Render Kustomize/Helm manifests	FluxCD Kustomize/Helm Controller
Cluster Deployment	Apply manifests to Kubernetes	`kubectl apply` (via FluxCD)
State Reconciliation	Detect and correct drift	Continuous reconciliation loop
Health Monitoring	Monitor deployment health	FluxCD health checks
Rollback Management	Revert to previous Git commits	Git revert operations

GitOps Boundaries:

# GitOps Responsibilities
✅ Watch Git repository for manifest changes
✅ Render and apply Kubernetes manifests
✅ Reconcile cluster state with Git
✅ Detect and correct configuration drift
✅ Monitor deployment health
✅ Rollback via Git operations

❌ NOT: Build application code
❌ NOT: Run unit/integration tests
❌ NOT: Build container images
❌ NOT: Publish artifacts to registries

Clear Handoff Point: Artifact Publishing¶

Handoff Architecture:

graph LR
    A[Source Code<br/>Repository] -->|trigger| B[Azure Pipelines<br/>CI Pipeline]
    B -->|build + test| C[Container Image<br/>+ SBOM]
    C -->|push| D[Azure Container<br/>Registry ACR]
    B -->|update manifests| E[GitOps Repository<br/>Azure Repos]
    E -->|watch| F[FluxCD<br/>Source Controller]
    F -->|fetch artifacts| G[FluxCD<br/>Kustomize Controller]
    G -->|deploy| H[AKS Cluster<br/>Production]

    style B fill:#90EE90
    style D fill:#FFE5B4
    style E fill:#90EE90
    style F fill:#FFE5B4
    style G fill:#FFE5B4
    style H fill:#ffcccc

Hold "Alt" / "Option" to enable pan & zoom

Handoff Criteria:

✅ Artifact Published: Image pushed to ACR with immutable tag
✅ SBOM Generated: Software Bill of Materials published
✅ Vulnerability Scan: Security scan results available
✅ Manifest Updated: GitOps repository contains new image tag
✅ Commit Signed: Git commit signed (for production)
✅ CI Tests Passing: All CI validation gates passed

Handoff Checklist:

## CI → GitOps Handoff Checklist

### Artifacts
- [ ] Container image built and pushed to ACR
- [ ] Image tagged with version + commit SHA (immutable)
- [ ] SBOM generated and published
- [ ] Vulnerability scan completed and results recorded

### Manifest Updates
- [ ] Image tag updated in GitOps repository
- [ ] Kustomize/Helm manifest files updated
- [ ] Changes committed to Git
- [ ] Commit signed (production only)

### Validation
- [ ] All CI tests passing
- [ ] Security scans passing
- [ ] Manifest validation passing
- [ ] Build provenance recorded

Azure Pipeline Stages¶

Build: Compile, Unit Test¶

Build Stage:

# azure-pipelines.yml
stages:
- stage: Build
  displayName: 'Build and Test'
  jobs:
  - job: BuildApplication
    displayName: 'Build ATP Ingestion Service'
    pool:
      vmImage: 'ubuntu-latest'

    variables:
      - group: ATP-Common-Variables
      - name: BuildConfiguration
        value: 'Release'
      - name: ServiceName
        value: 'atp-ingestion'

    steps:
    # Checkout source code
    - checkout: self
      fetchDepth: 0  # Full history for version calculation

    # Setup .NET SDK
    - task: UseDotNet@2
      inputs:
        packageType: 'sdk'
        version: '8.0.x'

    # Restore dependencies
    - script: |
        dotnet restore src/ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj
      displayName: 'Restore NuGet packages'

    # Build application
    - script: |
        dotnet build src/ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj \
          --configuration $(BuildConfiguration) \
          --no-restore \
          -p:Version=$(Build.BuildNumber)
      displayName: 'Build application'

    # Run unit tests
    - script: |
        dotnet test src/ConnectSoft.ATP.Ingestion.Tests/ConnectSoft.ATP.Ingestion.Tests.csproj \
          --configuration $(BuildConfiguration) \
          --no-build \
          --collect:"XPlat Code Coverage" \
          --results-directory $(Agent.TempDirectory)/test-results \
          --logger "trx;LogFileName=test-results.trx"
      displayName: 'Run unit tests'
      continueOnError: false

    # Publish test results
    - task: PublishTestResults@2
      condition: always()
      inputs:
        testResultsFormat: 'VSTest'
        testResultsFiles: '$(Agent.TempDirectory)/test-results/**/*.trx'
        testRunTitle: 'Unit Tests - $(ServiceName)'

    # Publish code coverage
    - task: PublishCodeCoverageResults@1
      condition: always()
      inputs:
        codeCoverageTool: 'Cobertura'
        summaryFileLocation: '$(Agent.TempDirectory)/test-results/**/coverage.cobertura.xml'

Test: Integration Test, Security Scan¶

Test Stage:

- stage: Test
  displayName: 'Integration Tests and Security Scans'
  dependsOn: Build
  jobs:
  - job: IntegrationTests
    displayName: 'Integration Tests'
    pool:
      vmImage: 'ubuntu-latest'

    services:
      postgres: postgres
      redis: redis

    steps:
    - checkout: self

    - task: UseDotNet@2
      inputs:
        packageType: 'sdk'
        version: '8.0.x'

    - script: |
        dotnet test src/ConnectSoft.ATP.Ingestion.IntegrationTests/ \
          --configuration Release \
          --logger "trx;LogFileName=integration-test-results.trx"
      displayName: 'Run integration tests'
      env:
        ConnectionStrings__Database: $(PostgresConnectionString)
        ConnectionStrings__Redis: $(RedisConnectionString)

    - task: PublishTestResults@2
      condition: always()
      inputs:
        testResultsFormat: 'VSTest'
        testResultsFiles: '**/integration-test-results.trx'
        testRunTitle: 'Integration Tests - $(ServiceName)'

  - job: SecurityScan
    displayName: 'Security Scanning'
    pool:
      vmImage: 'ubuntu-latest'

    steps:
    - checkout: self

    # SAST (Static Application Security Testing)
    - task: SnykSecurityScan@1
      inputs:
        serviceConnectionEndpoint: 'Snyk-Service-Connection'
        testType: 'app'
        severityThreshold: 'high'

    # Dependency Vulnerability Scan
    - script: |
        dotnet list package --vulnerable --include-transitive
      displayName: 'Check for vulnerable NuGet packages'

    # Container Image Scan (after build)
    - script: |
        trivy image --severity HIGH,CRITICAL \
          connectsoft.azurecr.io/atp/ingestion:$(Build.BuildNumber) \
          --format json \
          --output trivy-scan-results.json
      displayName: 'Scan container image with Trivy'
      condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))

Publish: Push to ACR, Generate SBOM¶

Publish Stage:

- stage: Publish
  displayName: 'Publish Artifacts'
  dependsOn: Test
  condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))
  jobs:
  - job: BuildAndPushImage
    displayName: 'Build and Push Container Image'
    pool:
      vmImage: 'ubuntu-latest'

    variables:
      - name: ImageRepository
        value: 'connectsoft.azurecr.io/atp/ingestion'
      - name: ImageTag
        value: '$(Build.BuildNumber)-$(Build.SourceVersion)'  # v1.2.3-abc123d

    steps:
    - checkout: self

    # Login to ACR
    - task: Docker@2
      displayName: 'Login to Azure Container Registry'
      inputs:
        command: 'login'
        containerRegistry: 'ConnectSoft-ACR'

    # Build Docker image
    - task: Docker@2
      displayName: 'Build Docker image'
      inputs:
        command: 'build'
        containerRegistry: 'ConnectSoft-ACR'
        repository: 'atp/ingestion'
        dockerfile: 'src/ConnectSoft.ATP.Ingestion/Dockerfile'
        tags: |
          $(ImageTag)
          latest  # Only for dev branch
        buildContext: '$(Build.SourcesDirectory)'
        arguments: |
          --build-arg BUILD_VERSION=$(Build.BuildNumber)
          --build-arg BUILD_COMMIT=$(Build.SourceVersion)
          --build-arg BUILD_DATE=$(Build.BuildId)

    # Generate SBOM
    - script: |
        # Install Syft
        curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin

        # Generate SBOM
        syft packages docker:$(ImageRepository):$(ImageTag) \
          --output spdx-json \
          --file sbom-$(ImageTag).spdx.json

        echo "✅ SBOM generated: sbom-$(ImageTag).spdx.json"
      displayName: 'Generate SBOM (Software Bill of Materials)'

    # Scan image for vulnerabilities
    - script: |
        trivy image \
          --format json \
          --output trivy-$(ImageTag).json \
          --severity HIGH,CRITICAL \
          $(ImageRepository):$(ImageTag)
      displayName: 'Scan image for vulnerabilities'

    # Push image to ACR
    - task: Docker@2
      displayName: 'Push Docker image to ACR'
      inputs:
        command: 'push'
        containerRegistry: 'ConnectSoft-ACR'
        repository: 'atp/ingestion'
        tags: |
          $(ImageTag)

    # Attach SBOM and scan results as pipeline artifacts
    - task: PublishPipelineArtifact@1
      displayName: 'Publish SBOM'
      inputs:
        targetPath: 'sbom-$(ImageTag).spdx.json'
        artifactName: 'sbom-$(ImageTag)'
        publishLocation: 'pipeline'

    - task: PublishPipelineArtifact@1
      displayName: 'Publish vulnerability scan results'
      inputs:
        targetPath: 'trivy-$(ImageTag).json'
        artifactName: 'vulnerability-scan-$(ImageTag)'
        publishLocation: 'pipeline'

    # Attach metadata to ACR image (annotations)
    - script: |
        az acr repository update \
          --name connectsoft \
          --image atp/ingestion:$(ImageTag) \
          --metadata \
            build.version=$(Build.BuildNumber) \
            build.commit=$(Build.SourceVersion) \
            build.date=$(Build.BuildId) \
            build.pipeline=$(Build.BuildUri) \
            sbom.url=$(Pipeline.Workspace)/sbom-$(ImageTag).spdx.json
      displayName: 'Attach metadata to image'

Update GitOps: Commit Manifest Changes¶

Update GitOps Stage:

- stage: UpdateGitOps
  displayName: 'Update GitOps Repository'
  dependsOn: Publish
  condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))
  jobs:
  - job: UpdateManifests
    displayName: 'Update GitOps Manifests'
    pool:
      vmImage: 'ubuntu-latest'

    variables:
      - name: GitOpsRepoUrl
        value: 'https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops'
      - name: TargetBranch
        value: ${{ if eq(variables['Build.SourceBranch'], 'refs/heads/main') }}'production'${{ else }}'dev'${{ endif }}

    steps:
    # Checkout GitOps repository
    - checkout: git://ATP/atp-gitops@$(TargetBranch)
      displayName: 'Checkout GitOps repository'
      path: gitops-repo

    # Install required tools
    - script: |
        # Install kustomize
        curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
        sudo mv kustomize /usr/local/bin/

        # Install yq (YAML processor)
        wget -q https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O yq
        chmod +x yq
        sudo mv yq /usr/local/bin/
      displayName: 'Install tools (kustomize, yq)'

    # Update Kustomize image tags
    - script: |
        cd gitops-repo

        # Update image tag in Kustomize base
        kustomize edit set image \
          connectsoft.azurecr.io/atp/ingestion=$(ImageRepository):$(ImageTag)

        # Update image tag in all overlays (dev, test, staging, production)
        for overlay in dev test staging production; do
          if [ -d "apps/atp-ingestion/overlays/$overlay" ]; then
            cd "apps/atp-ingestion/overlays/$overlay"
            kustomize edit set image \
              connectsoft.azurecr.io/atp/ingestion=$(ImageRepository):$(ImageTag)
            cd ../../../../../
          fi
        done

        echo "✅ Updated Kustomize image tags"
      displayName: 'Update Kustomize image tags'

    # Update Helm values files
    - script: |
        cd gitops-repo

        # Update image tag in Helm values for target branch
        yq eval -i '.image.tag = "$(ImageTag)"' \
          apps/atp-ingestion/helm/values-$(TargetBranch).yaml

        # Also update default values.yaml if exists
        if [ -f "apps/atp-ingestion/helm/values.yaml" ]; then
          yq eval -i '.image.tag = "$(ImageTag)"' \
            apps/atp-ingestion/helm/values.yaml
        fi

        echo "✅ Updated Helm values files"
      displayName: 'Update Helm values files'

    # Validate updated manifests
    - script: |
        cd gitops-repo

        # Validate Kustomize builds
        kustomize build apps/atp-ingestion/overlays/$(TargetBranch)/ > /dev/null
        echo "✅ Kustomize build validation passed"

        # Validate Helm templates
        helm template atp-ingestion apps/atp-ingestion/helm/ \
          --values apps/atp-ingestion/helm/values-$(TargetBranch).yaml > /dev/null
        echo "✅ Helm template validation passed"
      displayName: 'Validate updated manifests'

    # Commit and push changes
    - script: |
        cd gitops-repo

        # Configure Git
        git config user.name "Azure DevOps Pipeline"
        git config user.email "azure-devops@connectsoft.example"

        # Check for changes
        if [ -z "$(git status --porcelain)" ]; then
          echo "ℹ️  No changes to commit"
          exit 0
        fi

        # Stage changes
        git add apps/atp-ingestion/

        # Commit with conventional commit format
        git commit -m "chore(ingestion): update image tag to $(ImageTag)

        Automated update from CI pipeline:
        - Image: $(ImageRepository):$(ImageTag)
        - Build: $(Build.BuildNumber)
        - Commit: $(Build.SourceVersion)
        - Pipeline: $(Build.BuildUri)

        Related to: $(System.PullRequest.PullRequestId)"

        # Push to GitOps repository
        git push origin $(TargetBranch)

        echo "✅ Pushed manifest updates to GitOps repository"
      displayName: 'Commit and push to GitOps repository'
      env:
        SYSTEM_ACCESSTOKEN: $(System.AccessToken)

Image Tag Generation¶

Semantic Version from Git Tag¶

Version Extraction Script:

#!/bin/bash
# scripts/extract-version.sh

set -euo pipefail

# Extract version from Git tag
VERSION_TAG=$(git describe --tags --exact-match 2>/dev/null || echo "")

if [ -n "$VERSION_TAG" ]; then
  # Use version from Git tag (e.g., v1.2.3)
  VERSION=$(echo "$VERSION_TAG" | sed 's/^v//')
  echo "✅ Version from Git tag: $VERSION"
else
  # Fallback: Use version from project file or build number
  VERSION=$(grep -E '<Version>(.*)</Version>' src/**/*.csproj | head -1 | sed -E 's/.*<Version>(.*)<\/Version>.*/\1/')

  if [ -z "$VERSION" ]; then
    # Last resort: Use build number format
    VERSION="${BUILD_BUILDNUMBER:-1.0.0}"
  fi

  echo "⚠️  Version from project file/build number: $VERSION"
fi

echo "##vso[task.setvariable variable=Version]$VERSION"

Short Commit SHA for Traceability¶

Commit SHA Extraction:

# Extract short commit SHA (7 characters)
SHORT_SHA=$(git rev-parse --short=7 HEAD)
echo "Commit SHA: $SHORT_SHA"

# Example output: abc123d

Build Number for Uniqueness¶

Build Number Format:

# Build number format: v{MAJOR}.{MINOR}.{PATCH}.{BUILD_ID}
BUILD_NUMBER="${BUILD_BUILDNUMBER}"  # e.g., 20240115.1

# Or use Build.BuildId (unique incrementing number)
BUILD_ID="${BUILD_BUILDID}"  # e.g., 12345

Tag Format: `v1.2.3-abc123d`¶

Complete Tag Generation:

# Azure Pipeline: Generate image tag
- script: |
    # Extract version from Git tag or project file
    VERSION_TAG=$(git describe --tags --exact-match 2>/dev/null || echo "")

    if [ -n "$VERSION_TAG" ]; then
      VERSION=$(echo "$VERSION_TAG" | sed 's/^v//')
    else
      # Extract from .csproj or use build number
      VERSION=$(grep -oP '<Version>\K[^<]+' src/**/*.csproj | head -1 || echo "1.0.0")
    fi

    # Extract short commit SHA
    SHORT_SHA=$(git rev-parse --short=7 HEAD)

    # Generate image tag: v1.2.3-abc123d
    IMAGE_TAG="v${VERSION}-${SHORT_SHA}"

    echo "Image tag: $IMAGE_TAG"
    echo "##vso[task.setvariable variable=ImageTag]$IMAGE_TAG"
    echo "##vso[task.setvariable variable=Version]$VERSION"
    echo "##vso[task.setvariable variable=ShortSha]$SHORT_SHA"
  displayName: 'Generate image tag'

Tag Format Examples:

Source	Version	Commit SHA	Image Tag	Example
Git Tag	`v1.2.3`	`abc123d`	`v1.2.3-abc123d`	Semantic version + SHA
Project File	`1.2.3`	`abc123d`	`v1.2.3-abc123d`	Version from .csproj + SHA
Build Number	`20240115.1`	`abc123d`	`v20240115.1-abc123d`	Build number + SHA

Automated Manifest Update¶

Pipeline Script to Update Image Tag in GitOps Repo¶

Manifest Update Script (scripts/update-gitops-manifests.sh):

#!/bin/bash
# scripts/update-gitops-manifests.sh

set -euo pipefail

SERVICE_NAME="${1:-atp-ingestion}"
IMAGE_REPOSITORY="${2:-connectsoft.azurecr.io/atp/ingestion}"
IMAGE_TAG="${3:-latest}"
TARGET_BRANCH="${4:-dev}"
GITOPS_REPO_PATH="${5:-gitops-repo}"

echo "🔄 Updating GitOps manifests for $SERVICE_NAME..."
echo "  Image: $IMAGE_REPOSITORY:$IMAGE_TAG"
echo "  Branch: $TARGET_BRANCH"

cd "$GITOPS_REPO_PATH"

# Update Kustomize image tags
if [ -d "apps/$SERVICE_NAME/base" ]; then
  echo "📝 Updating Kustomize base..."
  cd "apps/$SERVICE_NAME/base"

  # Update image tag using kustomize edit
  kustomize edit set image \
    "$IMAGE_REPOSITORY:$IMAGE_TAG"

  cd ../../../
fi

# Update Kustomize overlays
for overlay in dev test staging production; do
  if [ -d "apps/$SERVICE_NAME/overlays/$overlay" ]; then
    echo "📝 Updating Kustomize overlay: $overlay..."
    cd "apps/$SERVICE_NAME/overlays/$overlay"

    kustomize edit set image \
      "$IMAGE_REPOSITORY:$IMAGE_TAG"

    cd ../../../../../
  fi
done

# Update Helm values files
if [ -d "apps/$SERVICE_NAME/helm" ]; then
  echo "📝 Updating Helm values..."
  cd "apps/$SERVICE_NAME/helm"

  # Update target branch values file
  if [ -f "values-${TARGET_BRANCH}.yaml" ]; then
    yq eval -i ".image.tag = \"$IMAGE_TAG\"" \
      "values-${TARGET_BRANCH}.yaml"
    echo "  ✅ Updated values-${TARGET_BRANCH}.yaml"
  fi

  # Update default values.yaml
  if [ -f "values.yaml" ]; then
    yq eval -i ".image.tag = \"$IMAGE_TAG\"" \
      "values.yaml"
    echo "  ✅ Updated values.yaml"
  fi

  cd ../../
fi

echo "✅ Manifest update complete"

Kustomize Image Tag Replacement¶

Kustomize Update Example:

# apps/atp-ingestion/base/kustomization.yaml (before)
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.2-def456e  # Old tag

# Update image tag using kustomize edit
kustomize edit set image \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d

# Result: apps/atp-ingestion/base/kustomization.yaml (after)
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3-abc123d  # New tag

Helm Values File Update¶

Helm Values Update Example:

# apps/atp-ingestion/helm/values-production.yaml (before)
image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: v1.2.2-def456e  # Old tag
  pullPolicy: IfNotPresent

# Update Helm values using yq
yq eval -i '.image.tag = "v1.2.3-abc123d"' \
  apps/atp-ingestion/helm/values-production.yaml

# Result: apps/atp-ingestion/helm/values-production.yaml (after)
image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: v1.2.3-abc123d  # New tag
  pullPolicy: IfNotPresent

Git Commit and Push Automation¶

Automated Git Commit Script:

#!/bin/bash
# scripts/commit-gitops-changes.sh

set -euo pipefail

SERVICE_NAME="${1:-atp-ingestion}"
IMAGE_TAG="${2:-latest}"
TARGET_BRANCH="${3:-dev}"
BUILD_NUMBER="${4:-unknown}"
COMMIT_SHA="${5:-unknown}"
GITOPS_REPO_PATH="${6:-gitops-repo}"

cd "$GITOPS_REPO_PATH"

# Check for changes
if [ -z "$(git status --porcelain)" ]; then
  echo "ℹ️  No changes to commit"
  exit 0
fi

# Configure Git
git config user.name "Azure DevOps Pipeline"
git config user.email "azure-devops@connectsoft.example"

# Stage all changes
git add apps/$SERVICE_NAME/

# Create commit message
COMMIT_MESSAGE="chore($SERVICE_NAME): update image tag to $IMAGE_TAG

Automated update from CI pipeline:
- Service: $SERVICE_NAME
- Image Tag: $IMAGE_TAG
- Build Number: $BUILD_NUMBER
- Source Commit: $COMMIT_SHA
- Pipeline: $BUILD_BUILDURI

Related to: $SYSTEM_PULLREQUEST_PULLREQUESTID"

# Commit changes
if [ -n "${GPG_KEY_ID:-}" ]; then
  # Sign commit with GPG (production)
  git commit -S -m "$COMMIT_MESSAGE"
else
  # Unsigned commit (dev/test)
  git commit -m "$COMMIT_MESSAGE"
fi

# Push to target branch
git push origin "$TARGET_BRANCH"

echo "✅ Changes committed and pushed to $TARGET_BRANCH"
echo "   Commit: $(git rev-parse HEAD)"

Azure Pipeline Integration:

- script: |
    chmod +x scripts/update-gitops-manifests.sh
    chmod +x scripts/commit-gitops-changes.sh

    # Update manifests
    ./scripts/update-gitops-manifests.sh \
      atp-ingestion \
      $(ImageRepository):$(ImageTag) \
      $(TargetBranch) \
      gitops-repo

    # Commit and push
    ./scripts/commit-gitops-changes.sh \
      atp-ingestion \
      $(ImageTag) \
      $(TargetBranch) \
      $(Build.BuildNumber) \
      $(Build.SourceVersion) \
      gitops-repo
  displayName: 'Update GitOps repository'
  env:
    SYSTEM_ACCESSTOKEN: $(System.AccessToken)
    GPG_KEY_ID: ${{ if eq(variables['TargetBranch'], 'production') }}$(GPG_KEY_ID)${{ else }}''${{ endif }}

Commit Back to GitOps Repository¶

Service Account Credentials (PAT or SSH)¶

Personal Access Token (PAT) Setup:

# Create PAT in Azure DevOps:
# 1. User Settings > Personal Access Tokens > New Token
# 2. Name: "GitOps Pipeline Service Account"
# 3. Organization: All accessible organizations
# 4. Scopes: Code (Read & Write)
# 5. Copy token

# Store PAT as Azure DevOps variable group
az pipelines variable-group create \
  --name "GitOps-Credentials" \
  --variables \
    gitopsPat="<PAT_TOKEN>" \
  --authorize true

SSH Key Setup:

# Generate SSH key for pipeline
ssh-keygen -t rsa -b 4096 -f ~/.ssh/gitops-pipeline -N ""

# Add public key to Azure DevOps
# Azure DevOps > User Settings > SSH Public Keys > New Key
cat ~/.ssh/gitops-pipeline.pub

# Store private key as Azure DevOps secret variable
az pipelines variable-group variable create \
  --group-id <GROUP_ID> \
  --name gitopsSshPrivateKey \
  --value "$(cat ~/.ssh/gitops-pipeline | base64)" \
  --secret true

Using Credentials in Pipeline:

# Option 1: Use System.AccessToken (recommended for same organization)
- checkout: git://ATP/atp-gitops@$(TargetBranch)
  displayName: 'Checkout GitOps repository'
  path: gitops-repo
  persistCredentials: true  # Use System.AccessToken

# Option 2: Use PAT from variable group
- script: |
    git config --global credential.helper store
    echo "https://PAT:$(gitopsPat)@dev.azure.com" > ~/.git-credentials
  displayName: 'Configure Git credentials'
  env:
    gitopsPat: $(gitopsPat)

# Option 3: Use SSH key
- script: |
    mkdir -p ~/.ssh
    echo "$(gitopsSshPrivateKey)" | base64 -d > ~/.ssh/id_rsa
    chmod 600 ~/.ssh/id_rsa
    ssh-keyscan ssh.dev.azure.com >> ~/.ssh/known_hosts
  displayName: 'Configure SSH key'
  env:
    gitopsSshPrivateKey: $(gitopsSshPrivateKey)

Commit Message Format¶

Standardized Commit Message:

<type>(<scope>): <subject>

<body>

<footer>

Example Commit Messages:

# Single service update
chore(ingestion): update image tag to v1.2.3-abc123d

Automated update from CI pipeline:
- Service: atp-ingestion
- Image Tag: v1.2.3-abc123d
- Build Number: 20240115.1
- Source Commit: abc123def456
- Pipeline: https://dev.azure.com/ConnectSoft/ATP/_build/results?buildId=12345

Related to: PR-123

# Multiple services update
chore(*): update image tags for release v1.2.3

Automated update from CI pipeline:
- Services: atp-ingestion, atp-query, atp-gateway
- Version: v1.2.3
- Build Number: 20240115.1
- Pipeline: https://dev.azure.com/ConnectSoft/ATP/_build/results?buildId=12345

Services updated:
- atp-ingestion: v1.2.3-abc123d
- atp-query: v1.2.3-def456e
- atp-gateway: v1.2.3-ghi789f

Signed Commits for Audit Trail¶

GPG Commit Signing Setup:

# Generate GPG key for pipeline service account
gpg --batch --gen-key <<EOF
%no-protection
Key-Type: RSA
Key-Length: 4096
Name-Real: Azure DevOps Pipeline
Name-Email: azure-devops@connectsoft.example
Expire-Date: 0
EOF

# Export public key
gpg --armor --export azure-devops@connectsoft.example > pipeline-gpg-public.key

# Export private key (base64 encoded for storage)
gpg --export-secret-keys --armor azure-devops@connectsoft.example | base64 > pipeline-gpg-private.key.b64

# Store private key as Azure DevOps secret variable

Using GPG in Pipeline:

- script: |
    # Import GPG key
    echo "$(gpgPrivateKey)" | base64 -d | gpg --batch --import
    gpg --list-secret-keys --keyid-format LONG

    # Configure Git to use GPG
    git config user.signingkey "$(gpgKeyId)"
    git config commit.gpgsign true

    # Sign commit
    git commit -S -m "$(commitMessage)"
  displayName: 'Sign and commit changes'
  env:
    gpgPrivateKey: $(gpgPrivateKey)
    gpgKeyId: $(gpgKeyId)
    commitMessage: $(commitMessage)

Branch Selection (Dev, Test, Staging, Production)¶

Branch Selection Logic:

# Azure Pipeline: Dynamic branch selection
variables:
  - name: TargetBranch
    value: ${{ if eq(variables['Build.SourceBranch'], 'refs/heads/main') }}
      'production'
    ${{ elseif eq(variables['Build.SourceBranch'], 'refs/heads/staging') }}
      'staging'
    ${{ elseif eq(variables['Build.SourceBranch'], 'refs/heads/test') }}
      'test'
    ${{ else }}
      'dev'
    ${{ endif }}

Branch Selection Script:

#!/bin/bash
# scripts/determine-target-branch.sh

SOURCE_BRANCH="${1:-dev}"

case "$SOURCE_BRANCH" in
  refs/heads/main|main)
    TARGET_BRANCH="production"
    ;;
  refs/heads/staging|staging)
    TARGET_BRANCH="staging"
    ;;
  refs/heads/test|test)
    TARGET_BRANCH="test"
    ;;
  *)
    TARGET_BRANCH="dev"
    ;;
esac

echo "Source branch: $SOURCE_BRANCH"
echo "Target branch: $TARGET_BRANCH"
echo "##vso[task.setvariable variable=TargetBranch]$TARGET_BRANCH"

Triggering FluxCD Sync¶

Automatic Sync via Polling (Default)¶

FluxCD Polling Configuration:

# GitRepository polling interval (default: 1 minute)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  interval: 1m  # Poll every 1 minute
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: production

Sync Timeline (Polling-Based):

T+0s:   CI pipeline commits to GitOps repo
T+0s:   Git commit pushed successfully
T+0s:   FluxCD Source Controller (last poll was T-30s)
T+60s:  FluxCD Source Controller polls Git (detects new commit)
T+60s:  FluxCD Kustomize Controller notified of new artifact
T+65s:  FluxCD Kustomize Controller reconciles cluster
T+70s:  Kubernetes resources updated

Webhook-Based Immediate Sync (Optional)¶

FluxCD Webhook Receiver:

# apps/fluxcd/receiver.yaml
apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Receiver
metadata:
  name: gitops-webhook
  namespace: flux-system
spec:
  type: git
  events:
    - push
  resources:
    - kind: GitRepository
      name: atp-gitops
      namespace: flux-system
  secretRef:
    name: webhook-token

Azure DevOps Webhook Configuration:

# Azure Pipeline: Trigger FluxCD webhook
- script: |
    WEBHOOK_URL="https://fluxcd-receiver.flux-system.svc.cluster.local:8080/hook/$(webhookToken)"

    curl -X POST "$WEBHOOK_URL" \
      -H "Content-Type: application/json" \
      -d '{
        "ref": "refs/heads/'"$(TargetBranch)"'",
        "commits": [{
          "id": "'"$(Build.SourceVersion)"'",
          "message": "Automated manifest update"
        }]
      }'

    echo "✅ FluxCD webhook triggered"
  displayName: 'Trigger FluxCD sync webhook'
  env:
    webhookToken: $(fluxcdWebhookToken)

Sync Timeline (Webhook-Based):

T+0s:   CI pipeline commits to GitOps repo
T+0s:   Git commit pushed successfully
T+1s:   Azure DevOps webhook triggered
T+2s:   FluxCD Receiver receives webhook
T+2s:   FluxCD Source Controller immediately fetches Git
T+3s:   FluxCD Kustomize Controller notified
T+8s:   Kubernetes resources updated

Flux Reconcile Command (Manual)¶

Manual Reconciliation:

# Manual reconciliation via flux CLI
flux reconcile source git atp-gitops --namespace flux-system

# Force reconciliation (even if no changes)
flux reconcile kustomization apps --namespace flux-system --with-source

# Reconciliation status
flux get kustomizations apps --namespace flux-system

Reconciliation in Pipeline (Optional):

- task: Kubernetes@1
  displayName: 'Trigger FluxCD reconciliation'
  inputs:
    connectionType: 'Azure Resource Manager'
    azureSubscriptionEndpoint: 'ATP-AKS-Connection'
    azureResourceGroup: 'ATP-Production-EUS-RG'
    kubernetesCluster: 'atp-prod-eus-aks'
    namespace: 'flux-system'
    command: 'run'
    arguments: 'flux reconcile source git atp-gitops'
  condition: and(succeeded(), eq(variables['TargetBranch'], 'production'))

Pipeline Templates for GitOps Integration¶

Reusable YAML Templates¶

GitOps Update Template (templates/gitops-update.yml):

# templates/gitops-update.yml
parameters:
- name: serviceName
  type: string
- name: imageRepository
  type: string
- name: imageTag
  type: string
- name: targetBranch
  type: string
  default: 'dev'
- name: requireGpgSigning
  type: boolean
  default: false

steps:
- checkout: git://ATP/atp-gitops@${{ parameters.targetBranch }}
  displayName: 'Checkout GitOps repository'
  path: gitops-repo
  persistCredentials: true

- script: |
    # Install tools
    curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
    sudo mv kustomize /usr/local/bin/

    wget -q https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O yq
    chmod +x yq
    sudo mv yq /usr/local/bin/
  displayName: 'Install tools'

- script: |
    cd gitops-repo

    # Update Kustomize
    if [ -d "apps/${{ parameters.serviceName }}/base" ]; then
      cd "apps/${{ parameters.serviceName }}/base"
      kustomize edit set image "${{ parameters.imageRepository }}:${{ parameters.imageTag }}"
      cd ../../../
    fi

    # Update Helm
    if [ -d "apps/${{ parameters.serviceName }}/helm" ]; then
      cd "apps/${{ parameters.serviceName }}/helm"
      if [ -f "values-${{ parameters.targetBranch }}.yaml" ]; then
        yq eval -i ".image.tag = \"${{ parameters.imageTag }}\"" \
          "values-${{ parameters.targetBranch }}.yaml"
      fi
      cd ../../
    fi
  displayName: 'Update manifests'

- script: |
    cd gitops-repo

    if [ -z "$(git status --porcelain)" ]; then
      echo "ℹ️  No changes to commit"
      exit 0
    fi

    git config user.name "Azure DevOps Pipeline"
    git config user.email "azure-devops@connectsoft.example"

    git add apps/${{ parameters.serviceName }}/

    COMMIT_MESSAGE="chore(${{ parameters.serviceName }}): update image tag to ${{ parameters.imageTag }}

    Automated update from CI pipeline.
    Build: $(Build.BuildNumber)
    Commit: $(Build.SourceVersion)"

    if [ "${{ parameters.requireGpgSigning }}" == "true" ]; then
      echo "$(gpgPrivateKey)" | base64 -d | gpg --batch --import
      git config user.signingkey "$(gpgKeyId)"
      git config commit.gpgsign true
      git commit -S -m "$COMMIT_MESSAGE"
    else
      git commit -m "$COMMIT_MESSAGE"
    fi

    git push origin ${{ parameters.targetBranch }}
  displayName: 'Commit and push changes'
  env:
    gpgPrivateKey: ${{ if eq(parameters.requireGpgSigning, true) }}$(gpgPrivateKey)${{ else }}''${{ endif }}
    gpgKeyId: ${{ if eq(parameters.requireGpgSigning, true) }}$(gpgKeyId)${{ else }}''${{ endif }}

Parameterization for Different Services¶

Using Template in Pipeline:

# azure-pipelines.yml
resources:
  repositories:
  - repository: templates
    type: git
    name: ATP/azure-pipelines-templates

stages:
- stage: UpdateGitOps
  displayName: 'Update GitOps Repository'
  jobs:
  - job: UpdateManifests
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: 'atp-ingestion'
        imageRepository: 'connectsoft.azurecr.io/atp/ingestion'
        imageTag: '$(ImageTag)'
        targetBranch: '$(TargetBranch)'
        requireGpgSigning: ${{ if eq(variables['TargetBranch'], 'production') }}true${{ else }}false${{ endif }}

Multi-Service Template Usage:

- stage: UpdateGitOps
  displayName: 'Update GitOps for Multiple Services'
  jobs:
  - job: UpdateAllServices
    strategy:
      matrix:
        ingestion:
          serviceName: 'atp-ingestion'
          imageRepository: 'connectsoft.azurecr.io/atp/ingestion'
        query:
          serviceName: 'atp-query'
          imageRepository: 'connectsoft.azurecr.io/atp/query'
        gateway:
          serviceName: 'atp-gateway'
          imageRepository: 'connectsoft.azurecr.io/atp/gateway'
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: '$(serviceName)'
        imageRepository: '$(imageRepository)'
        imageTag: '$(ImageTag)'
        targetBranch: '$(TargetBranch)'

Template Versioning¶

Versioned Template Reference:

resources:
  repositories:
  - repository: templates
    type: git
    name: ATP/azure-pipelines-templates
    ref: refs/tags/v1.2.3  # Pin to specific version

stages:
- stage: UpdateGitOps
  jobs:
  - job: UpdateManifests
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: 'atp-ingestion'
        imageRepository: 'connectsoft.azurecr.io/atp/ingestion'
        imageTag: '$(ImageTag)'

Multi-Service Coordination¶

Updating Multiple Services Atomically¶

Atomic Multi-Service Update:

- stage: UpdateGitOps
  displayName: 'Update GitOps (Atomic)'
  jobs:
  - job: UpdateAllServices
    steps:
    - checkout: git://ATP/atp-gitops@$(TargetBranch)
      path: gitops-repo
      persistCredentials: true

    # Update all services in single commit
    - script: |
        cd gitops-repo

        # Update ingestion
        kustomize edit set image \
          connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
          --path apps/atp-ingestion/base

        # Update query
        kustomize edit set image \
          connectsoft.azurecr.io/atp/query:v1.3.0-def456e \
          --path apps/atp-query/base

        # Update gateway
        kustomize edit set image \
          connectsoft.azurecr.io/atp/gateway:v1.1.5-ghi789f \
          --path apps/atp-gateway/base

        # Single commit for all services
        git add apps/
        git commit -m "chore(*): update all service image tags

        Atomic update for release v1.2.3:
        - atp-ingestion: v1.2.3-abc123d
        - atp-query: v1.3.0-def456e
        - atp-gateway: v1.1.5-ghi789f"

        git push origin $(TargetBranch)
      displayName: 'Atomic multi-service update'

Dependency Management¶

Service Dependency Graph:

# services-dependencies.yaml
services:
  - name: atp-gateway
    dependsOn: []
    updateOrder: 1

  - name: atp-ingestion
    dependsOn: [atp-gateway]
    updateOrder: 2

  - name: atp-query
    dependsOn: [atp-ingestion]
    updateOrder: 3

  - name: atp-export
    dependsOn: [atp-query]
    updateOrder: 4

Dependency-Aware Update Script:

#!/bin/bash
# scripts/update-services-with-dependencies.sh

SERVICES=(
  "atp-gateway:v1.1.5-ghi789f:1"
  "atp-ingestion:v1.2.3-abc123d:2"
  "atp-query:v1.3.0-def456e:3"
)

# Sort by update order
IFS=$'\n' sorted_services=($(sort -t: -k3 <<<"${SERVICES[*]}"))
unset IFS

for service_info in "${sorted_services[@]}"; do
  IFS=':' read -r service_name image_tag update_order <<< "$service_info"

  echo "📦 Updating $service_name (order: $update_order)..."

  # Update manifest
  kustomize edit set image \
    "connectsoft.azurecr.io/$service_name:$image_tag" \
    --path "apps/$service_name/base"

  echo "  ✅ Updated $service_name"
done

Coordinated Rollout Strategies¶

Staged Rollout:

- stage: CoordinatedRollout
  displayName: 'Coordinated Multi-Service Rollout'
  jobs:
  - job: Stage1Gateway
    displayName: 'Stage 1: Gateway'
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: 'atp-gateway'
        imageTag: '$(GatewayImageTag)'

  - job: Stage2Ingestion
    displayName: 'Stage 2: Ingestion'
    dependsOn: Stage1Gateway
    condition: succeeded()
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: 'atp-ingestion'
        imageTag: '$(IngestionImageTag)'

  - job: Stage3Query
    displayName: 'Stage 3: Query'
    dependsOn: Stage2Ingestion
    condition: succeeded()
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: 'atp-query'
        imageTag: '$(QueryImageTag)'

Artifact Metadata¶

SBOM (Software Bill of Materials) Generation¶

SBOM Generation with Syft:

- script: |
    # Install Syft
    curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin

    # Generate SBOM in SPDX format
    syft packages docker:$(ImageRepository):$(ImageTag) \
      --output spdx-json \
      --file sbom-$(ImageTag).spdx.json

    # Generate SBOM in CycloneDX format
    syft packages docker:$(ImageRepository):$(ImageTag) \
      --output cyclonedx-json \
      --file sbom-$(ImageTag).cyclonedx.json

    echo "✅ SBOM generated: sbom-$(ImageTag).spdx.json"
  displayName: 'Generate SBOM'

SBOM Structure (Example):

{
  "SPDXID": "SPDXRef-DOCUMENT",
  "spdxVersion": "SPDX-2.3",
  "name": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
  "packages": [
    {
      "SPDXID": "SPDXRef-Package-dotnet-runtime",
      "name": "dotnet-runtime",
      "versionInfo": "8.0.0",
      "downloadLocation": "NOASSERTION"
    },
    {
      "SPDXID": "SPDXRef-Package-aspnetcore",
      "name": "aspnetcore",
      "versionInfo": "8.0.0",
      "downloadLocation": "NOASSERTION"
    }
  ]
}

Vulnerability Scan Results¶

Vulnerability Scanning Integration:

- script: |
    # Scan image with Trivy
    trivy image \
      --format json \
      --output trivy-$(ImageTag).json \
      --severity HIGH,CRITICAL \
      $(ImageRepository):$(ImageTag)

    # Generate HTML report
    trivy image \
      --format template \
      --template "@contrib/html.tpl" \
      --output trivy-$(ImageTag).html \
      $(ImageRepository):$(ImageTag)

    # Publish scan results
    echo "##vso[task.addattachment type=Distributedtask.Core.Summary;name=Vulnerability Scan;]$PWD/trivy-$(ImageTag).html"
  displayName: 'Scan image for vulnerabilities'

Vulnerability Scan Results (Example):

{
  "SchemaVersion": 2,
  "ArtifactName": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
  "Results": [
    {
      "Target": "atp-ingestion:v1.2.3-abc123d",
      "Vulnerabilities": [
        {
          "VulnerabilityID": "CVE-2024-1234",
          "Severity": "HIGH",
          "PackageName": "aspnetcore",
          "InstalledVersion": "8.0.0",
          "FixedVersion": "8.0.1"
        }
      ]
    }
  ]
}

Build Provenance Information¶

Provenance Generation (SLSA/In-Toto):

- script: |
    # Generate build provenance (SLSA v1.0)
    cat > provenance-$(ImageTag).json <<EOF
    {
      "_type": "https://in-toto.io/Statement/v1",
      "subject": [
        {
          "name": "$(ImageRepository):$(ImageTag)",
          "digest": {
            "sha256": "$(IMAGE_DIGEST)"
          }
        }
      ],
      "predicateType": "https://slsa.dev/provenance/v1",
      "predicate": {
        "buildDefinition": {
          "buildType": "https://dev.azure.com/ConnectSoft/ATP",
          "externalParameters": {
            "source": "$(Build.Repository.Uri)",
            "ref": "$(Build.SourceBranch)",
            "commit": "$(Build.SourceVersion)"
          },
          "internalParameters": {
            "pipeline": "$(Build.DefinitionName)",
            "buildId": "$(Build.BuildId)"
          },
          "resolvedDependencies": [
            {
              "uri": "$(Build.Repository.Uri)",
              "digest": {
                "gitCommit": "$(Build.SourceVersion)"
              }
            }
          ]
        },
        "runDetails": {
          "builder": {
            "id": "Azure DevOps Pipeline"
          },
          "metadata": {
            "invocationId": "$(Build.BuildId)",
            "startedOn": "$(Build.QueuedTime)",
            "finishedOn": "$(System.Agent.JobFinishTime)"
          }
        }
      }
    }
    EOF

    echo "✅ Build provenance generated"
  displayName: 'Generate build provenance'

Metadata Storage in ACR¶

Attach Metadata to ACR Image:

- script: |
    # Attach metadata as image annotations
    az acr repository update \
      --name connectsoft \
      --image atp/ingestion:$(ImageTag) \
      --metadata \
        build.version=$(Build.BuildNumber) \
        build.commit=$(Build.SourceVersion) \
        build.date=$(Build.BuildId) \
        build.pipeline=$(Build.BuildUri) \
        build.branch=$(Build.SourceBranch) \
        sbom.url=$(Pipeline.Workspace)/sbom-$(ImageTag).spdx.json \
        scan.url=$(Pipeline.Workspace)/trivy-$(ImageTag).json \
        provenance.url=$(Pipeline.Workspace)/provenance-$(ImageTag).json

    echo "✅ Metadata attached to image"
  displayName: 'Attach metadata to ACR image'

Query Image Metadata:

# Query image metadata
az acr repository show \
  --name connectsoft \
  --image atp/ingestion:v1.2.3-abc123d \
  --query metadata

# Output:
# {
#   "build.version": "v1.2.3",
#   "build.commit": "abc123def456",
#   "build.date": "20240115.1",
#   "build.pipeline": "https://dev.azure.com/...",
#   "sbom.url": "...",
#   "scan.url": "...",
#   "provenance.url": "..."
# }

Pipeline Observability¶

Correlation IDs Between Build and Deployment¶

Correlation ID Generation:

- script: |
    # Generate correlation ID
    CORRELATION_ID="build-$(Build.BuildId)-$(Build.SourceVersion)"

    echo "##vso[task.setvariable variable=CorrelationId;isOutput=true]$CORRELATION_ID"
    echo "Correlation ID: $CORRELATION_ID"
  displayName: 'Generate correlation ID'
  name: GenerateCorrelationId

# Pass correlation ID to GitOps commit
- script: |
    git commit -m "chore(ingestion): update image tag

    Correlation ID: $(GenerateCorrelationId.CorrelationId)
    Build: $(Build.BuildNumber)"
  displayName: 'Commit with correlation ID'

Correlation ID in Deployment Metadata:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  annotations:
    deployment.connectsoft.com/correlation-id: "build-12345-abc123d"
    deployment.connectsoft.com/build-number: "20240115.1"
    deployment.connectsoft.com/build-uri: "https://dev.azure.com/.../builds/12345"
spec:
  template:
    metadata:
      labels:
        correlation-id: "build-12345-abc123d"

Linking Azure Pipeline Runs to FluxCD Reconciliations¶

Link Tracking Script:

#!/bin/bash
# scripts/link-build-to-deployment.sh

CORRELATION_ID="${1:-unknown}"
BUILD_URI="${2:-unknown}"
NAMESPACE="${3:-atp-production}"

# Annotate deployment with build information
kubectl annotate deployment atp-ingestion \
  -n "$NAMESPACE" \
  deployment.connectsoft.com/build-uri="$BUILD_URI" \
  deployment.connectsoft.com/correlation-id="$CORRELATION_ID" \
  --overwrite

echo "✅ Deployment linked to build: $BUILD_URI"

Query Links:

# Query deployment for build link
kubectl get deployment atp-ingestion -n atp-production \
  -o jsonpath='{.metadata.annotations.deployment\.connectsoft\.com/build-uri}'

# Output: https://dev.azure.com/ConnectSoft/ATP/_build/results?buildId=12345

Deployment Receipt Generation¶

Deployment Receipt Script:

#!/bin/bash
# scripts/generate-deployment-receipt.sh

DEPLOYMENT_NAME="${1:-atp-ingestion}"
NAMESPACE="${2:-atp-production}"
CORRELATION_ID="${3:-unknown}"

# Generate deployment receipt
cat > deployment-receipt-$(date +%Y%m%d-%H%M%S).json <<EOF
{
  "deploymentId": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.uid}')",
  "correlationId": "$CORRELATION_ID",
  "namespace": "$NAMESPACE",
  "deploymentName": "$DEPLOYMENT_NAME",
  "image": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.spec.template.spec.containers[0].image}')",
  "replicas": $(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.spec.replicas}'),
  "deployedAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "deployedBy": "FluxCD",
  "gitCommit": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.labels.app\.kubernetes\.io/version}')",
  "status": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.status.conditions[?(@.type=="Available")].status}')"
}
EOF

echo "✅ Deployment receipt generated"

Metrics and Dashboards¶

Pipeline Metrics (Azure Monitor):

- script: |
    # Send custom metrics to Azure Monitor
    az monitor metrics create \
      --resource /subscriptions/.../resourceGroups/... \
      --name "gitops_manifest_update_duration" \
      --value "$(AGENT_JOBDURATION)" \
      --timestamp "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
  displayName: 'Send pipeline metrics'

KQL Query for Build-Deployment Correlation:

// Azure Monitor: Link builds to deployments
let BuildEvents = ContainerLog
| where LogEntry contains "correlation-id"
| extend CorrelationId = extract(@"correlation-id: ([^\s]+)", 1, LogEntry)
| extend BuildUri = extract(@"build-uri: ([^\s]+)", 1, LogEntry)
| project CorrelationId, BuildUri, TimeGenerated;

let DeploymentEvents = KubePodInventory
| where Namespace == "atp-production"
| extend CorrelationId = extract(@"correlation-id: ([^\s]+)", 1, Labels)
| where isnotempty(CorrelationId)
| project CorrelationId, PodName, TimeGenerated;

BuildEvents
| join kind=inner DeploymentEvents on CorrelationId
| project CorrelationId, BuildUri, PodName, DeploymentTime = DeploymentEvents.TimeGenerated, BuildTime = BuildEvents.TimeGenerated
| extend DeploymentLatency = DeploymentTime - BuildTime
| summarize avg(DeploymentLatency) by bin(TimeGenerated, 1h)

Grafana Dashboard Configuration:

# Grafana dashboard for CI/CD → GitOps handoff
dashboard:
  title: "CI/CD to GitOps Handoff Metrics"
  panels:
    - title: "Build to Deployment Latency"
      query: |
        avg(deployment_latency_seconds{namespace="atp-production"})

    - title: "GitOps Commit Frequency"
      query: |
        rate(gitops_commits_total[5m])

    - title: "FluxCD Sync Success Rate"
      query: |
        rate(fluxcd_kustomize_reconciliation_success_total[5m]) /
        rate(fluxcd_kustomize_reconciliation_total[5m])

Summary: Azure Pipelines to GitOps Handoff¶

Separation of Concerns: CI builds/test/publishes artifacts; GitOps deploys/reconciles/monitors
Pipeline Stages: Build (compile, test), Test (integration, security), Publish (ACR, SBOM), Update GitOps (manifest commits)
Image Tag Generation: Semantic version + commit SHA format (v1.2.3-abc123d) for immutability and traceability
Automated Manifest Update: Scripts for Kustomize and Helm manifest updates with Git commit automation
Git Commit Automation: PAT/SSH credentials, conventional commit messages, GPG signing for production
FluxCD Sync Triggers: Polling (default), webhooks (immediate), manual reconciliation
Pipeline Templates: Reusable YAML templates with parameterization and versioning
Multi-Service Coordination: Atomic updates, dependency management, coordinated rollout strategies
Artifact Metadata: SBOM generation, vulnerability scans, build provenance, ACR metadata storage
Pipeline Observability: Correlation IDs, build-deployment linking, deployment receipts, metrics dashboards

Pulumi Infrastructure as Code Integration¶

Purpose: Define how Pulumi with C# is used to provision and manage Azure infrastructure for ATP, integrating with GitOps workflows to ensure infrastructure changes are version-controlled, tested, and deployed through the same Git-based processes as application deployments.

Pulumi Overview for Azure Resources¶

Why Pulumi for ATP (C# Programming Model)¶

ATP Infrastructure Requirements:

Complex Azure Resource Orchestration: AKS clusters, ACR, Key Vault, Service Bus, Storage Accounts
Multi-Environment Management: Dev, test, staging, production with environment-specific configurations
C# Development Team: Leverage existing C# expertise for infrastructure code
Type Safety: Strong typing and IntelliSense for Azure resources
Testability: Unit test infrastructure code with standard C# testing frameworks
Code Reusability: Create reusable components and modules

Pulumi Advantages for ATP:

Advantage	Description	ATP Benefit
C# Programming Model	Write infrastructure as C# code	Leverage team's existing C# skills
Type Safety	Strong typing with IntelliSense	Reduce configuration errors at compile time
Rich Ecosystem	Access to .NET libraries and NuGet packages	Reuse existing code and patterns
Imperative Logic	Full programming language capabilities	Complex conditional logic, loops, functions
State Management	Built-in state management with locking	Safe concurrent updates
Multi-Language Support	C#, TypeScript, Python, Go available	Team flexibility

Pulumi vs Terraform vs Bicep Comparison¶

Comparison Matrix:

Feature	Pulumi	Terraform	Bicep
Language	C#, TypeScript, Python, Go	HCL (Hashicorp Config Language)	Domain-specific language (DSL)
Programming Model	Imperative with full language features	Declarative	Declarative
Type Safety	✅ Strong typing (C#)	❌ Limited	✅ Type checking
IntelliSense	✅ Full IDE support	⚠️ Basic	✅ Good
Testing	✅ Unit test with standard frameworks	⚠️ Limited	❌ Limited
Code Reuse	✅ Functions, classes, modules	⚠️ Modules	⚠️ Modules
State Management	✅ Built-in with locking	✅ Built-in with locking	✅ Azure-native
Azure Integration	✅ Excellent	✅ Good	✅ Native (Azure-only)
Multi-Cloud	✅ Excellent	✅ Excellent	❌ Azure-only
Learning Curve	✅ Low (if team knows C#)	⚠️ Medium (HCL)	⚠️ Medium (DSL)

ATP Selection: Pulumi with C#

Rationale:

Team Expertise: ATP team is proficient in C#, reducing learning curve
Type Safety: Catch configuration errors at compile time
Testability: Unit test infrastructure code with xUnit/NUnit
Code Reuse: Create reusable infrastructure components
Complex Logic: Handle multi-tenant, multi-region scenarios with imperative code
Azure-First: Strong Azure support while maintaining multi-cloud flexibility

Infrastructure as Code Principles¶

IaC Best Practices for ATP:

Version Control: All infrastructure code in Git (Azure Repos)
Immutable Infrastructure: Recreate rather than modify when possible
Idempotency: Infrastructure code can be run multiple times safely
Declarative Intent: Describe desired state, not implementation steps
Environment Parity: Use same code for all environments (parameterized)
Code Review: Infrastructure changes require PR approval
Testing: Preview changes before applying
State Management: Centralized, versioned state with locking

IaC Workflow:

graph LR
    A[Developer] -->|Create PR| B[Infrastructure Code<br/>in Git]
    B -->|PR Validation| C[Pulumi Preview]
    C -->|Review| D[Manual Approval]
    D -->|Merge| E[Pulumi Up]
    E -->|Update State| F[Azure Blob Storage<br/>State Backend]
    E -->|Provision| G[Azure Resources]

    style B fill:#90EE90
    style C fill:#FFE5B4
    style D fill:#FFE5B4
    style E fill:#90EE90
    style F fill:#FFE5B4
    style G fill:#ffcccc

Hold "Alt" / "Option" to enable pan & zoom

Pulumi Stacks for Environments¶

Stack per Environment (Dev, Test, Staging, Production)¶

Stack Organization:

atp-infrastructure/
├── Pulumi.yaml
├── Pulumi.dev.yaml
├── Pulumi.test.yaml
├── Pulumi.staging.yaml
├── Pulumi.production.yaml
├── Program.cs
└── infrastructure/
    ├── AKS.cs
    ├── ACR.cs
    ├── KeyVault.cs
    └── ServiceBus.cs

Pulumi Project Configuration (Pulumi.yaml):

name: atp-infrastructure
runtime: dotnet
description: ATP Infrastructure as Code using Pulumi with C#

Stack Configuration Examples:

Dev Stack (Pulumi.dev.yaml):

config:
  azure-native:location: eastus
  atp-infrastructure:environment: dev
  atp-infrastructure:aksNodeCount: 3
  atp-infrastructure:aksNodeVmSize: Standard_D2s_v3
  atp-infrastructure:acrSku: Basic
  atp-infrastructure:keyVaultSku: standard
  atp-infrastructure:enableMonitoring: true
  atp-infrastructure:enablePrivateEndpoint: false
  atp-infrastructure:tags:
    Environment: dev
    ManagedBy: pulumi
    Project: ATP

Production Stack (Pulumi.production.yaml):

config:
  azure-native:location: eastus
  atp-infrastructure:environment: production
  atp-infrastructure:aksNodeCount: 5
  atp-infrastructure:aksNodeVmSize: Standard_D4s_v3
  atp-infrastructure:acrSku: Premium
  atp-infrastructure:keyVaultSku: premium
  atp-infrastructure:enableMonitoring: true
  atp-infrastructure:enablePrivateEndpoint: true
  atp-infrastructure:enableGeoReplication: true
  atp-infrastructure:tags:
    Environment: production
    ManagedBy: pulumi
    Project: ATP
    Compliance: SOC2

Stack Configuration and Secrets¶

Stack Configuration with Secrets:

// Program.cs
using Pulumi;

class Program
{
    static Task<int> Main() => Deployment.RunAsync<ATPStack>();
}

class ATPStack : Stack
{
    public ATPStack()
    {
        var config = new Config();

        // Read configuration
        var environment = config.Require("environment");
        var location = config.Get("location") ?? "eastus";
        var nodeCount = config.GetInt32("aksNodeCount") ?? 3;
        var nodeVmSize = config.Get("aksNodeVmSize") ?? "Standard_D2s_v3";

        // Read secrets (encrypted)
        var sqlAdminPassword = config.RequireSecret("sqlAdminPassword");
        var keyVaultAccessKey = config.RequireSecret("keyVaultAccessKey");

        // Create infrastructure
        var aks = new AKSCluster(this, environment, location, nodeCount, nodeVmSize);
        var acr = new AzureContainerRegistry(this, environment, location);
        var keyVault = new KeyVault(this, environment, location, keyVaultAccessKey);
    }
}

Setting Stack Configuration:

# Set plain configuration
pulumi config set aksNodeCount 5 --stack production
pulumi config set aksNodeVmSize Standard_D4s_v3 --stack production

# Set secrets (encrypted in state)
pulumi config set --secret sqlAdminPassword "SecurePassword123!"
pulumi config set --secret keyVaultAccessKey "access-key-value"

# View configuration
pulumi config --stack production

# View secrets (decrypted)
pulumi config get --secret sqlAdminPassword --stack production

Configuration via Azure DevOps Variable Groups:

# Azure Pipeline: Set Pulumi configuration from variable groups
- script: |
    # Set plain configuration
    pulumi config set azure-native:location $(AzureLocation) --stack $(PulumiStack)
    pulumi config set aksNodeCount $(AKSNodeCount) --stack $(PulumiStack)

    # Set secrets from Azure Key Vault
    pulumi config set --secret sqlAdminPassword "$(sqlAdminPassword)" --stack $(PulumiStack)
    pulumi config set --secret keyVaultAccessKey "$(keyVaultAccessKey)" --stack $(PulumiStack)
  displayName: 'Set Pulumi stack configuration'
  env:
    PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)

Stack References for Cross-Stack Dependencies¶

Stack Reference Example:

// Shared infrastructure stack (networking, monitoring)
class SharedStack : Stack
{
    [Output]
    public Output<string> LogAnalyticsWorkspaceId { get; set; }

    [Output]
    public Output<string> VirtualNetworkId { get; set; }

    public SharedStack()
    {
        var workspace = new OperationalInsights.Workspace("atp-shared-loganalytics", new()
        {
            ResourceGroupName = resourceGroup.Name,
            Location = location,
        });

        this.LogAnalyticsWorkspaceId = workspace.Id;
    }
}

// Application stack references shared stack
class ApplicationStack : Stack
{
    public ApplicationStack()
    {
        var sharedStack = new StackReference("ConnectSoft/atp-shared/shared");

        var logAnalyticsWorkspaceId = sharedStack.RequireOutput("LogAnalyticsWorkspaceId")
            .Apply(id => id.ToString());

        // Use shared resources
        var aks = new ContainerService.ManagedCluster("atp-aks", new()
        {
            // Reference shared Log Analytics workspace
            AddonProfiles = new()
            {
                ["omsagent"] = new()
                {
                    Enabled = true,
                    Config = new()
                    {
                        ["logAnalyticsWorkspaceResourceID"] = logAnalyticsWorkspaceId,
                    },
                },
            },
        });
    }
}

AKS Cluster Provisioning¶

Cluster Configuration (Node Pools, Networking, SKUs)¶

Complete AKS Cluster with C# Pulumi:

// infrastructure/AKS.cs
using Pulumi;
using Pulumi.AzureNative.ContainerService;
using Pulumi.AzureNative.ContainerService.Inputs;
using Pulumi.AzureNative.Network;
using Pulumi.AzureNative.Resources;

public class AKSCluster
{
    public ManagedCluster Cluster { get; }
    public Output<string> KubeConfig { get; }

    public AKSCluster(Pulumi.Stack stack, string environment, string location, 
        int nodeCount, string nodeVmSize)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var clusterName = $"atp-{environment}-aks";

        // Resource Group
        var resourceGroup = new ResourceGroup($"atp-{environment}-rg", new()
        {
            Location = location,
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });

        // Virtual Network
        var vnet = new VirtualNetwork($"atp-{environment}-vnet", new()
        {
            ResourceGroupName = resourceGroup.Name,
            Location = location,
            AddressSpace = new() { AddressPrefixes = { "10.0.0.0/16" } },
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });

        // Subnet for AKS
        var aksSubnet = new Subnet($"atp-{environment}-aks-subnet", new()
        {
            ResourceGroupName = resourceGroup.Name,
            VirtualNetworkName = vnet.Name,
            AddressPrefix = "10.0.1.0/24",
            Delegations = new()
            {
                new SubnetDelegationArgs
                {
                    Name = "Microsoft.ContainerService.managedClusters",
                    ServiceDelegation = new ServiceDelegationArgs
                    {
                        Name = "Microsoft.ContainerService/managedClusters",
                    },
                },
            },
        });

        // User Assigned Managed Identity
        var identity = new ManagedServiceIdentity.UserAssignedIdentity(
            $"atp-{environment}-aks-identity", new()
            {
                ResourceGroupName = resourceGroup.Name,
                Location = location,
            });

        // AKS Cluster
        var cluster = new ManagedCluster(clusterName, new()
        {
            ResourceGroupName = resourceGroup.Name,
            Location = location,

            // Identity
            Identity = new ManagedClusterIdentityArgs
            {
                Type = ManagedClusterIdentityType.UserAssigned,
                UserAssignedIdentities = new[]
                {
                    identity.Id,
                },
            },

            // Kubernetes Version
            KubernetesVersion = config.Get("kubernetesVersion") ?? "1.28",

            // Node Pool Configuration
            AgentPoolProfiles = new[]
            {
                new ManagedClusterAgentPoolProfileArgs
                {
                    Name = "systempool",
                    Count = nodeCount,
                    VmSize = nodeVmSize,
                    OsType = "Linux",
                    OsDiskSizeGB = 128,
                    Mode = AgentPoolMode.System,
                    EnableAutoScaling = true,
                    MinCount = 2,
                    MaxCount = 10,
                    Type = AgentPoolType.VirtualMachineScaleSets,
                    VnetSubnetId = aksSubnet.Id,
                    MaxPods = 30,
                    NodeLabels = new()
                    {
                        { "pool", "system" },
                        { "environment", environment },
                    },
                    NodeTaints = new[] { "CriticalAddonsOnly=true:NoSchedule" },
                },
                new ManagedClusterAgentPoolProfileArgs
                {
                    Name = "userpool",
                    Count = nodeCount,
                    VmSize = nodeVmSize,
                    OsType = "Linux",
                    OsDiskSizeGB = 128,
                    Mode = AgentPoolMode.User,
                    EnableAutoScaling = true,
                    MinCount = 3,
                    MaxCount = 20,
                    Type = AgentPoolType.VirtualMachineScaleSets,
                    VnetSubnetId = aksSubnet.Id,
                    MaxPods = 30,
                    NodeLabels = new()
                    {
                        { "pool", "user" },
                        { "environment", environment },
                    },
                },
            },

            // Network Profile (Azure CNI)
            NetworkProfile = new ContainerServiceNetworkProfileArgs
            {
                NetworkPlugin = "azure",
                NetworkPolicy = "azure",
                ServiceCidr = "10.1.0.0/16",
                DnsServiceIP = "10.1.0.10",
                LoadBalancerSku = "standard",
            },

            // RBAC
            EnableRBAC = true,

            // Pod Security Standards
            SecurityProfile = new ManagedClusterSecurityProfileArgs
            {
                WorkloadIdentity = new ManagedClusterSecurityProfileWorkloadIdentityArgs
                {
                    Enabled = true,
                },
            },

            // Addon Profiles
            AddonProfiles = new()
            {
                ["httpApplicationRouting"] = new ManagedClusterAddonProfileArgs
                {
                    Enabled = false,
                },
                ["omsagent"] = new ManagedClusterAddonProfileArgs
                {
                    Enabled = true,
                    Config = new()
                    {
                        ["logAnalyticsWorkspaceResourceID"] = config.Require("logAnalyticsWorkspaceId"),
                    },
                },
            },

            // Auto Upgrade Channel
            AutoUpgradeProfile = new ManagedClusterAutoUpgradeProfileArgs
            {
                UpgradeChannel = environment == "production" 
                    ? UpgradeChannel.Stable 
                    : UpgradeChannel.Rapid,
            },

            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
                { "Project", "ATP" },
            },
        });

        this.Cluster = cluster;
        this.KubeConfig = Output.Tuple(resourceGroup.Name, cluster.Name)
            .Apply(names => Output.CreateSecret(GetKubeConfig(names.Item1, names.Item2)));
    }

    private static string GetKubeConfig(string resourceGroupName, string clusterName)
    {
        // Generate kubeconfig
        // In practice, use Pulumi's Kubernetes provider
        return "";
    }
}

Azure CNI vs Kubenet¶

Network Plugin Comparison:

Feature	Azure CNI	Kubenet
IP Address Management	Pod IPs from VNet subnet	Pod IPs NAT through node IP
Pod Networking	Direct VNet connectivity	Overlay network
Max Pods per Node	Up to 250 (configurable)	110 (fixed)
Network Policies	Azure Network Policies or Calico	Calico only
VNet Integration	Native VNet integration	Requires route table
Performance	Lower latency	Slight NAT overhead
Complexity	More complex subnet planning	Simpler setup

ATP Selection: Azure CNI

Rationale:

✅ Native Azure networking for better integration
✅ Direct pod-to-VNet connectivity for Azure services
✅ Support for Azure Network Policies
✅ Better performance for high-throughput workloads
✅ Required for advanced features (Private AKS, etc.)

Azure CNI Configuration:

NetworkProfile = new ContainerServiceNetworkProfileArgs
{
    NetworkPlugin = "azure",
    NetworkPolicy = "azure",  // Azure Network Policies
    ServiceCidr = "10.1.0.0/16",
    DnsServiceIP = "10.1.0.10",
    LoadBalancerSku = "standard",
    PodCidr = null,  // Not used with Azure CNI
}

Managed Identity Setup¶

User Assigned Managed Identity:

// Create User Assigned Managed Identity
var identity = new ManagedServiceIdentity.UserAssignedIdentity(
    $"atp-{environment}-aks-identity", new()
    {
        ResourceGroupName = resourceGroup.Name,
        Location = location,
        Tags = new()
        {
            { "Environment", environment },
            { "ManagedBy", "pulumi" },
        },
    });

// Grant permissions to identity
var acrRoleAssignment = new Authorization.RoleAssignment(
    "aks-acr-role-assignment", new()
    {
        PrincipalId = identity.PrincipalId,
        PrincipalType = "ServicePrincipal",
        RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/7f951dda-4ed3-4680-a7ca-43fe172d538d", // AcrPull
        Scope = acr.Id,
    });

Azure Monitor Integration¶

Container Insights Configuration:

// Log Analytics Workspace
var logAnalyticsWorkspace = new OperationalInsights.Workspace(
    $"atp-{environment}-loganalytics", new()
    {
        ResourceGroupName = resourceGroup.Name,
        Location = location,
        Sku = new OperationalInsights.WorkspaceSkuArgs
        {
            Name = "PerGB2018",
        },
    });

// Enable Container Insights on AKS
var containerInsights = new ContainerService.ManagedClusterAddonProfileArgs
{
    Enabled = true,
    Config = new()
    {
        ["logAnalyticsWorkspaceResourceID"] = logAnalyticsWorkspace.Id,
    },
};

// Add to cluster
AddonProfiles = new()
{
    ["omsagent"] = containerInsights,
}

Network Policies and Security Groups¶

Network Security Group:

// Network Security Group for AKS subnet
var nsg = new NetworkSecurityGroup($"atp-{environment}-aks-nsg", new()
{
    ResourceGroupName = resourceGroup.Name,
    Location = location,
    SecurityRules = new()
    {
        // Allow inbound from Load Balancer
        new NetworkSecurityGroupSecurityRuleArgs
        {
            Name = "Allow-LoadBalancer-Inbound",
            Priority = 1000,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "AzureLoadBalancer",
            DestinationAddressPrefix = "*",
        },
        // Allow outbound to Internet
        new NetworkSecurityGroupSecurityRuleArgs
        {
            Name = "Allow-Internet-Outbound",
            Priority = 1000,
            Direction = "Outbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "*",
            DestinationAddressPrefix = "Internet",
        },
    },
    Tags = new()
    {
        { "Environment", environment },
        { "ManagedBy", "pulumi" },
    },
});

// Associate NSG with subnet
var subnetWithNsg = new Subnet($"atp-{environment}-aks-subnet-nsg", new()
{
    ResourceGroupName = resourceGroup.Name,
    VirtualNetworkName = vnet.Name,
    AddressPrefix = "10.0.1.0/24",
    NetworkSecurityGroupId = nsg.Id,
    Delegations = new()
    {
        new SubnetDelegationArgs
        {
            Name = "Microsoft.ContainerService.managedClusters",
            ServiceDelegation = new ServiceDelegationArgs
            {
                Name = "Microsoft.ContainerService/managedClusters",
            },
        },
    },
});

Azure Resource Provisioning¶

Azure Container Registry (ACR)¶

ACR Provisioning:

// infrastructure/ACR.cs
public class AzureContainerRegistry
{
    public ContainerRegistry.Registry Registry { get; }

    public AzureContainerRegistry(Pulumi.Stack stack, string environment, string location)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var acrName = $"atp{environment}acr".Replace("-", ""); // ACR names must be alphanumeric

        this.Registry = new ContainerRegistry.Registry($"atp-{environment}-acr", new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            Sku = new ContainerRegistry.Inputs.SkuArgs
            {
                Name = config.Get("acrSku") ?? "Basic",
            },
            AdminEnabled = environment != "production", // Disable admin for production
            PublicNetworkAccess = config.GetBool("enablePrivateEndpoint") == true 
                ? "Disabled" 
                : "Enabled",
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
                { "Project", "ATP" },
            },
        });

        // Enable geo-replication for production
        if (environment == "production" && config.GetBool("enableGeoReplication") == true)
        {
            new ContainerRegistry.Replication($"atp-production-acr-westus2", new()
            {
                ResourceGroupName = resourceGroupName,
                RegistryName = this.Registry.Name,
                Location = "westus2",
                Tags = new()
                {
                    { "Environment", environment },
                    { "ManagedBy", "pulumi" },
                },
            });
        }
    }
}

Azure Key Vault¶

Key Vault Provisioning:

// infrastructure/KeyVault.cs
public class KeyVault
{
    public KeyVault.Vault Vault { get; }

    public KeyVault(Pulumi.Stack stack, string environment, string location, 
        Output<string> accessKey)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var keyVaultName = $"atp-{environment}-kv";

        // Key Vault
        this.Vault = new KeyVault.Vault($"atp-{environment}-kv", new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            Properties = new KeyVault.Inputs.VaultPropertiesArgs
            {
                TenantId = config.Require("tenantId"),
                Sku = new KeyVault.Inputs.SkuArgs
                {
                    Family = "A",
                    Name = config.Get("keyVaultSku") ?? "standard",
                },
                EnabledForDeployment = false,
                EnabledForTemplateDeployment = false,
                EnabledForDiskEncryption = false,
                EnableRbacAuthorization = true,
                PublicNetworkAccess = config.GetBool("enablePrivateEndpoint") == true 
                    ? "Disabled" 
                    : "Enabled",
            },
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });

        // Store initial secret
        new KeyVault.Secret("keyVaultAccessKey", new()
        {
            ResourceGroupName = resourceGroupName,
            VaultName = this.Vault.Name,
            Properties = new KeyVault.Inputs.SecretPropertiesArgs
            {
                Value = accessKey,
            },
        });
    }
}

Azure Service Bus¶

Service Bus Provisioning:

// infrastructure/ServiceBus.cs
public class ServiceBus
{
    public ServiceBus.Namespace Namespace { get; }

    public ServiceBus(Pulumi.Stack stack, string environment, string location)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var serviceBusName = $"atp-{environment}-sb";

        this.Namespace = new ServiceBus.Namespace($"atp-{environment}-sb", new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            Sku = new ServiceBus.Inputs.SBSkuArgs
            {
                Name = environment == "production" ? "Premium" : "Standard",
                Tier = environment == "production" ? "Premium" : "Standard",
            },
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });

        // Create queues
        var queues = new[] { "audit-events", "export-requests", "notifications" };
        foreach (var queueName in queues)
        {
            new ServiceBus.Queue($"{environment}-{queueName}", new()
            {
                ResourceGroupName = resourceGroupName,
                NamespaceName = this.Namespace.Name,
                Name = queueName,
                EnablePartitioning = environment == "production",
                MaxDeliveryCount = 10,
                LockDuration = "PT5M",
                DefaultMessageTimeToLive = "P7D",
            });
        }
    }
}

Azure Storage Accounts (Blob, Queue)¶

Storage Account Provisioning:

// infrastructure/StorageAccount.cs
public class StorageAccount
{
    public Storage.StorageAccount Account { get; }

    public StorageAccount(Pulumi.Stack stack, string environment, string location)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var storageName = $"atp{environment}st".Replace("-", ""); // Must be lowercase, alphanumeric

        this.Account = new Storage.StorageAccount($"atp-{environment}-st", new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            AccountName = storageName,
            Kind = "StorageV2",
            Sku = new Storage.Inputs.SkuArgs
            {
                Name = environment == "production" ? "Standard_GRS" : "Standard_LRS",
            },
            EnableHttpsTrafficOnly = true,
            AllowBlobPublicAccess = false,
            MinimumTlsVersion = "TLS1_2",
            NetworkRuleSet = new Storage.Inputs.NetworkRuleSetArgs
            {
                DefaultAction = config.GetBool("enablePrivateEndpoint") == true 
                    ? "Deny" 
                    : "Allow",
                Bypass = "AzureServices",
            },
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });

        // Blob Container
        new Storage.BlobContainer("audit-trail", new()
        {
            ResourceGroupName = resourceGroupName,
            AccountName = this.Account.Name,
            ContainerName = "audit-trail",
            PublicAccess = "None",
        });

        // Queue
        new Storage.Queue("audit-processing", new()
        {
            ResourceGroupName = resourceGroupName,
            AccountName = this.Account.Name,
            QueueName = "audit-processing",
        });
    }
}

Application Insights / Log Analytics¶

Application Insights and Log Analytics:

// infrastructure/Monitoring.cs
public class Monitoring
{
    public OperationalInsights.Workspace LogAnalyticsWorkspace { get; }
    public Insights.Component ApplicationInsights { get; }

    public Monitoring(Pulumi.Stack stack, string environment, string location)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";

        // Log Analytics Workspace
        this.LogAnalyticsWorkspace = new OperationalInsights.Workspace(
            $"atp-{environment}-loganalytics", new()
            {
                ResourceGroupName = resourceGroupName,
                Location = location,
                Sku = new OperationalInsights.WorkspaceSkuArgs
                {
                    Name = "PerGB2018",
                },
                RetentionInDays = environment == "production" ? 730 : 30,
                Tags = new()
                {
                    { "Environment", environment },
                    { "ManagedBy", "pulumi" },
                },
            });

        // Application Insights
        this.ApplicationInsights = new Insights.Component($"atp-{environment}-appinsights", new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            Kind = "web",
            ApplicationType = "web",
            WorkspaceResourceId = this.LogAnalyticsWorkspace.Id,
            RetentionInDays = environment == "production" ? 730 : 30,
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });
    }
}

Pulumi State Management¶

State Backend in Azure Blob Storage¶

Azure Blob Storage Backend Configuration:

# Initialize Pulumi with Azure Blob Storage backend
pulumi login --cloud-url azblob://atp-pulumi-state

# Or configure in Pulumi.yaml

Backend Configuration (Pulumi.yaml):

name: atp-infrastructure
runtime: dotnet
backend:
  url: azblob://atp-pulumi-state

Backend Setup:

# Create storage account for state (one-time setup)
az storage account create \
  --name atppulumistate \
  --resource-group atp-shared-rg \
  --location eastus \
  --sku Standard_LRS \
  --allow-blob-public-access false

# Create container
az storage container create \
  --name pulumi-state \
  --account-name atppulumistate \
  --auth-mode login

# Set backend URL
pulumi login --cloud-url azblob://pulumi-state

State Locking Mechanisms¶

State Locking:

Automatic Locking: Pulumi automatically locks state during operations
Blob Lease: Uses Azure Blob Storage lease mechanism
Lock Duration: Default 10 minutes, configurable
Lock Release: Automatically released on operation completion or timeout

Manual Lock Management:

# Check lock status
pulumi stack --show-urns

# Force unlock (if stuck)
pulumi cancel --stack production

State Encryption¶

State Encryption at Rest:

# Enable encryption on storage account
az storage account update \
  --name atppulumistate \
  --resource-group atp-shared-rg \
  --encryption-services blob

# Use Azure Key Vault for encryption keys
az storage account update \
  --name atppulumistate \
  --resource-group atp-shared-rg \
  --encryption-key-source Microsoft.Keyvault \
  --encryption-key-vault "https://atp-shared-kv.vault.azure.net/keys/storage-encryption"

Encrypted Secrets in State:

// Secrets are automatically encrypted in state
var password = config.RequireSecret("sqlAdminPassword");
// This value is encrypted in the state file

Backup and Recovery¶

State Backup Strategy:

# Enable blob versioning
az storage account blob-service-properties update \
  --account-name atppulumistate \
  --enable-versioning true

# Enable soft delete
az storage account blob-service-properties update \
  --account-name atppulumistate \
  --enable-delete-retention true \
  --delete-retention-days 30

# Manual backup
az storage blob download \
  --account-name atppulumistate \
  --container-name pulumi-state \
  --name production.json \
  --file backup-$(date +%Y%m%d)-production.json

State Recovery:

# Restore from backup
az storage blob upload \
  --account-name atppulumistate \
  --container-name pulumi-state \
  --name production.json \
  --file backup-20240115-production.json \
  --overwrite

GitOps Workflow for Infrastructure¶

Infrastructure Changes via PR¶

PR Workflow for Infrastructure:

graph LR
    A[Developer] -->|Create PR| B[Infrastructure Code<br/>in Git]
    B -->|PR Validation| C[Pulumi Preview]
    C -->|Lint & Validate| D[Security Scan]
    D -->|Review| E[Manual Approval]
    E -->|Merge| F[Pulumi Up]
    F -->|Update State| G[Azure Resources<br/>Provisioned]

    style B fill:#90EE90
    style C fill:#FFE5B4
    style D fill:#FFE5B4
    style E fill:#ffcccc
    style F fill:#90EE90
    style G fill:#ffcccc

Hold "Alt" / "Option" to enable pan & zoom

Pulumi Preview in PR Validation¶

Azure Pipeline: PR Validation:

# .azuredevops/pipelines/infrastructure-pr-validation.yml
trigger: none

pr:
  branches:
    include:
      - main
      - staging
      - test
      - dev

pool:
  vmImage: 'ubuntu-latest'

variables:
  - group: ATP-Infrastructure-Variables

stages:
- stage: ValidateInfrastructure
  displayName: 'Validate Infrastructure Changes'
  jobs:
  - job: PulumiPreview
    displayName: 'Pulumi Preview'
    steps:
    - checkout: self

    # Determine stack from branch
    - script: |
        case "$(Build.SourceBranch)" in
          refs/heads/main)
            STACK="production"
            ;;
          refs/heads/staging)
            STACK="staging"
            ;;
          refs/heads/test)
            STACK="test"
            ;;
          *)
            STACK="dev"
            ;;
        esac
        echo "##vso[task.setvariable variable=PulumiStack]$STACK"
      displayName: 'Determine Pulumi stack'

    # Install Pulumi
    - script: |
        curl -fsSL https://get.pulumi.com | sh
        export PATH="$HOME/.pulumi/bin:$PATH"
        pulumi version
      displayName: 'Install Pulumi'

    # Restore .NET dependencies
    - script: |
        dotnet restore
      displayName: 'Restore .NET dependencies'

    # Set stack configuration
    - script: |
        export PATH="$HOME/.pulumi/bin:$PATH"
        pulumi stack select $(PulumiStack)
        pulumi config set azure-native:location $(AzureLocation)
        pulumi config set environment $(PulumiStack)
      displayName: 'Configure Pulumi stack'
      env:
        PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
        PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)

    # Run Pulumi preview
    - script: |
        export PATH="$HOME/.pulumi/bin:$PATH"
        pulumi preview --stack $(PulumiStack) \
          --diff \
          --json > preview-output.json
      displayName: 'Run Pulumi preview'
      continueOnError: true
      env:
        PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
        PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)

    # Publish preview output
    - task: PublishPipelineArtifact@1
      condition: always()
      inputs:
        targetPath: 'preview-output.json'
        artifactName: 'pulumi-preview-$(PulumiStack)'
        publishLocation: 'pipeline'

    # Validate preview output
    - script: |
        if [ -s preview-output.json ]; then
          echo "✅ Preview generated successfully"
          # Check for destroy operations (require special approval)
          if grep -q '"steps".*"delete"' preview-output.json; then
            echo "⚠️  WARNING: Preview contains resource deletions"
            exit 1
          fi
        else
          echo "❌ Preview failed or produced no output"
          exit 1
        fi
      displayName: 'Validate preview output'

Manual Approval for Infrastructure Changes¶

Approval Gates:

# Azure Pipeline: Infrastructure deployment
trigger:
  branches:
    include:
      - main
      - staging

stages:
- stage: ApprovalGate
  displayName: 'Infrastructure Change Approval'
  jobs:
  - job: WaitForApproval
    displayName: 'Wait for Approval'
    pool: server
    steps:
    - task: ManualValidation@0
      timeoutInMinutes: 1440  # 24 hours
      inputs:
        notifyUsers: 'architect-team@connectsoft.example;sre-lead@connectsoft.example'
        instructions: |
          Review the Pulumi preview output before approving infrastructure changes.

          ⚠️  WARNING: Infrastructure changes can affect production services.

          Please verify:
          - Resource changes are expected
          - No unintended resource deletions
          - Configuration values are correct
          - Cost impact is acceptable

- stage: DeployInfrastructure
  displayName: 'Deploy Infrastructure'
  dependsOn: ApprovalGate
  condition: succeeded()
  jobs:
  - job: PulumiUp
    steps:
    # ... Pulumi up steps

Pulumi Up After Approval¶

Deployment Stage:

- stage: DeployInfrastructure
  displayName: 'Deploy Infrastructure'
  jobs:
  - job: PulumiUp
    displayName: 'Pulumi Up'
    steps:
    - checkout: self

    - script: |
        curl -fsSL https://get.pulumi.com | sh
        export PATH="$HOME/.pulumi/bin:$PATH"
      displayName: 'Install Pulumi'

    - script: |
        dotnet restore
        dotnet build
      displayName: 'Build Pulumi program'

    - script: |
        export PATH="$HOME/.pulumi/bin:$PATH"
        pulumi stack select $(PulumiStack)
        pulumi config set azure-native:location $(AzureLocation)
      displayName: 'Configure stack'
      env:
        PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
        PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)

    - script: |
        export PATH="$HOME/.pulumi/bin:$PATH"
        pulumi up --stack $(PulumiStack) \
          --yes \
          --skip-preview
      displayName: 'Deploy infrastructure'
      env:
        PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
        PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)

Infrastructure Drift Detection¶

Drift Detection Script:

#!/bin/bash
# scripts/detect-infrastructure-drift.sh

set -euo pipefail

STACK="${1:-production}"

echo "🔍 Detecting infrastructure drift for stack: $STACK"

# Refresh state
pulumi refresh --stack "$STACK" --yes

# Generate diff
pulumi preview --stack "$STACK" --diff > drift-diff.txt

if [ -s drift-diff.txt ]; then
  echo "⚠️  Infrastructure drift detected!"
  cat drift-diff.txt

  # Send alert
  echo "🚨 Alert: Infrastructure drift detected in $STACK stack"
  exit 1
else
  echo "✅ No infrastructure drift detected"
  exit 0
fi

Scheduled Drift Detection:

# Azure Pipeline: Scheduled drift detection
schedules:
- cron: "0 2 * * *"  # Daily at 2 AM UTC
  branches:
    include:
      - main
  displayName: 'Daily Infrastructure Drift Detection'

stages:
- stage: DriftDetection
  displayName: 'Detect Infrastructure Drift'
  jobs:
  - job: CheckDrift
    steps:
    - script: |
        ./scripts/detect-infrastructure-drift.sh production
      displayName: 'Detect drift'

Pulumi Automation API¶

Programmatic Infrastructure Management¶

Pulumi Automation API Example:

// infrastructure/Automation.cs
using Pulumi.Automation;

public class InfrastructureAutomation
{
    public static async Task<UpdateResult> UpdateInfrastructureAsync(
        string stackName, string projectPath)
    {
        // Create stack workspace
        var workspace = await LocalWorkspace.CreateOrSelectStackAsync(
            new InlineProgramArgs("atp-infrastructure", projectPath)
            {
                Program = PulumiFn.Create<ATPStack>(),
            });

        // Set stack configuration
        await workspace.SetConfigAsync("azure-native:location", 
            new ConfigValue("eastus"));

        // Preview changes
        var preview = await workspace.PreviewAsync(new PreviewOptions
        {
            OnOutput = Console.WriteLine,
        });

        if (preview.ChangeSummary.ContainsKey(OpType.Create) || 
            preview.ChangeSummary.ContainsKey(OpType.Update))
        {
            // Deploy changes
            var update = await workspace.UpAsync(new UpOptions
            {
                OnOutput = Console.WriteLine,
            });

            return update.Summary;
        }

        return null;
    }
}

Dynamic Infrastructure Provisioning¶

Dynamic Resource Creation:

// Create resources dynamically based on configuration
var environments = new[] { "dev", "test", "staging", "production" };

foreach (var env in environments)
{
    var stack = new StackReference($"ConnectSoft/atp-infrastructure/{env}");

    // Create environment-specific resources
    var aks = new AKSCluster(stack, env, "eastus", 3, "Standard_D2s_v3");
    var acr = new AzureContainerRegistry(stack, env, "eastus");
}

Pulumi Policy as Code¶

Resource Validation Policies¶

Pulumi Policy Example:

// policies/enforce-tagging.ts
import { PolicyPack } from "@pulumi/policy";

const policies = new PolicyPack("atp-tagging-policies", {
    policies: [{
        name: "require-environment-tag",
        description: "All resources must have an Environment tag",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            const tags = args.props.tags || {};
            if (!tags.Environment) {
                reportViolation("Resource must have an Environment tag");
            }
        },
    }, {
        name: "require-managedby-tag",
        description: "All resources must have a ManagedBy tag",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            const tags = args.props.tags || {};
            if (tags.ManagedBy !== "pulumi") {
                reportViolation("Resource must have ManagedBy=pulumi tag");
            }
        },
    }],
});

Compliance Checks (Tagging, Encryption, etc.)¶

Compliance Policies:

// policies/compliance-policies.ts
import { PolicyPack } from "@pulumi/policy";

const policies = new PolicyPack("atp-compliance-policies", {
    policies: [{
        name: "require-encryption-at-rest",
        description: "Storage accounts must have encryption enabled",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            if (args.type === "azure-native:storage:StorageAccount") {
                if (!args.props.enableHttpsTrafficOnly) {
                    reportViolation("Storage account must have HTTPS-only traffic enabled");
                }
            }
        },
    }, {
        name: "prevent-public-blob-access",
        description: "Storage accounts must not allow public blob access",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            if (args.type === "azure-native:storage:StorageAccount") {
                if (args.props.allowBlobPublicAccess) {
                    reportViolation("Storage account must not allow public blob access");
                }
            }
        },
    }],
});

Cost Controls (SKU Limits, Region Restrictions)¶

Cost Control Policies:

// policies/cost-control-policies.ts
import { PolicyPack } from "@pulumi/policy";

const policies = new PolicyPack("atp-cost-control-policies", {
    policies: [{
        name: "limit-aks-node-vm-size",
        description: "AKS node VM size must not exceed Standard_D4s_v3",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            if (args.type === "azure-native:containerservice:ManagedCluster") {
                const agentPools = args.props.agentPoolProfiles || [];
                for (const pool of agentPools) {
                    if (pool.vmSize && pool.vmSize.includes("D8")) {
                        reportViolation(`VM size ${pool.vmSize} exceeds maximum allowed (Standard_D4s_v3)`);
                    }
                }
            }
        },
    }, {
        name: "restrict-regions",
        description: "Resources must be deployed only in approved regions",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            const allowedRegions = ["eastus", "westus2"];
            const location = args.props.location;
            if (location && !allowedRegions.includes(location)) {
                reportViolation(`Region ${location} is not in the approved list: ${allowedRegions.join(", ")}`);
            }
        },
    }],
});

Apply Policies:

# Install policy pack
pulumi policy install ConnectSoft/atp-policies

# Validate against policies
pulumi preview --policy-pack ConnectSoft/atp-policies

Infrastructure Drift Detection¶

Detecting Out-of-Band Changes¶

Drift Detection Workflow:

#!/bin/bash
# scripts/detect-drift.sh

STACK="${1:-production}"

echo "🔄 Refreshing state to detect drift..."

# Refresh state from Azure
pulumi refresh --stack "$STACK" --yes

# Preview changes (drift)
pulumi preview --stack "$STACK" --diff > drift-report.txt

if grep -q "diff" drift-report.txt; then
  echo "⚠️  Drift detected!"
  cat drift-report.txt

  # Send alert
  echo "🚨 Infrastructure drift detected in $STACK stack" | \
    mail -s "Infrastructure Drift Alert" sre-team@connectsoft.example
else
  echo "✅ No drift detected"
fi

Pulumi Refresh and Diff¶

Refresh and Diff Commands:

# Refresh state from actual Azure resources
pulumi refresh --stack production

# Preview differences (drift)
pulumi preview --stack production --diff

# Show detailed diff
pulumi stack --show-urns --stack production

Automated Drift Correction or Alerts¶

Automated Drift Correction:

# Azure Pipeline: Automated drift correction
schedules:
- cron: "0 3 * * *"  # Daily at 3 AM UTC
  branches:
    include:
      - main

stages:
- stage: DriftCorrection
  displayName: 'Automated Drift Correction'
  jobs:
  - job: CorrectDrift
    steps:
    - script: |
        pulumi refresh --stack production --yes
        pulumi preview --stack production --diff > drift-diff.txt

        if [ -s drift-diff.txt ]; then
          # Check if drift is auto-correctable
          if grep -q "tags" drift-diff.txt && !grep -q "delete" drift-diff.txt; then
            echo "✅ Auto-correcting drift (tags only)"
            pulumi up --stack production --yes
          else
            echo "⚠️  Manual intervention required"
            # Send alert
            exit 1
          fi
        fi
      displayName: 'Correct drift'

Disaster Recovery¶

Infrastructure Re-Provisioning from Git¶

DR Procedure:

#!/bin/bash
# scripts/disaster-recovery.sh

ENVIRONMENT="${1:-production}"
RESOURCE_GROUP="${2:-atp-production-rg}"

echo "🚨 Starting disaster recovery for $ENVIRONMENT..."

# 1. Verify Git repository is accessible
git clone https://dev.azure.com/ConnectSoft/ATP/_git/atp-infrastructure.git
cd atp-infrastructure

# 2. Select Pulumi stack
pulumi stack select "$ENVIRONMENT"

# 3. Configure backend (state may be lost, recreate if needed)
pulumi login --cloud-url azblob://atp-pulumi-state

# 4. Re-provision infrastructure
pulumi up --stack "$ENVIRONMENT" --yes

echo "✅ Disaster recovery complete"

RTO and RPO Targets¶

DR Targets:

Metric	Target	Notes
RTO	4 hours	Time to restore infrastructure
RPO	24 hours	Maximum acceptable data loss
State Recovery	1 hour	Time to restore Pulumi state
Infrastructure Provisioning	2 hours	Time to provision all resources
Application Deployment	1 hour	Time to deploy applications via GitOps

DR Drill Procedures¶

DR Drill Checklist:

## Disaster Recovery Drill Checklist

### Pre-Drill
- [ ] Schedule DR drill (quarterly)
- [ ] Notify stakeholders
- [ ] Backup current state
- [ ] Document current infrastructure state

### Drill Execution
- [ ] Simulate disaster scenario
- [ ] Verify Git repository accessibility
- [ ] Restore Pulumi state (if needed)
- [ ] Re-provision infrastructure
- [ ] Verify resource provisioning
- [ ] Deploy applications via GitOps
- [ ] Run smoke tests
- [ ] Verify application functionality

### Post-Drill
- [ ] Document findings
- [ ] Update DR procedures
- [ ] Review RTO/RPO targets
- [ ] Schedule next drill

DR Drill Script:

#!/bin/bash
# scripts/dr-drill.sh

ENVIRONMENT="${1:-staging}"  # Use staging for drills

echo "🎯 Starting DR drill for $ENVIRONMENT environment..."

# 1. Backup current state
echo "📦 Backing up current state..."
az storage blob download \
  --account-name atppulumistate \
  --container-name pulumi-state \
  --name "$ENVIRONMENT.json" \
  --file "backup-$(date +%Y%m%d)-$ENVIRONMENT.json"

# 2. Destroy infrastructure (simulate disaster)
echo "💥 Simulating disaster (destroying infrastructure)..."
read -p "Are you sure? (yes/no): " confirm
if [ "$confirm" == "yes" ]; then
  pulumi destroy --stack "$ENVIRONMENT" --yes
fi

# 3. Re-provision from Git
echo "🔨 Re-provisioning infrastructure..."
pulumi up --stack "$ENVIRONMENT" --yes

# 4. Verify
echo "✅ DR drill complete. Verify infrastructure is operational."

Summary: Pulumi Infrastructure as Code Integration¶

Pulumi Overview: C# programming model for ATP infrastructure with type safety and testability
Stack Management: Environment-specific stacks (dev, test, staging, production) with configuration and secrets
AKS Provisioning: Complete cluster configuration with node pools, networking (Azure CNI), managed identity, Azure Monitor integration
Azure Resources: ACR, Key Vault, Service Bus, Storage Accounts, Application Insights/Log Analytics
State Management: Azure Blob Storage backend with locking, encryption, and backup/recovery
GitOps Workflow: Infrastructure changes via PR, Pulumi preview validation, manual approval, automated deployment
Automation API: Programmatic infrastructure management and dynamic provisioning
Policy as Code: Resource validation, compliance checks, cost controls
Drift Detection: Automated detection and correction of out-of-band changes
Disaster Recovery: Infrastructure re-provisioning from Git with RTO/RPO targets and DR drill procedures

Azure Key Vault Secret Management¶

Purpose: Define how Azure Key Vault is used for secure secret management in ATP, integrating with Kubernetes workloads via Workload Identity, External Secrets Operator, and CSI Driver to ensure secrets are never stored in Git and are securely injected into pods at runtime.

Azure Key Vault Architecture¶

Key Vault per Environment¶

Key Vault Organization:

Environment	Key Vault Name	Resource Group	Purpose
Dev	`atp-dev-kv`	`atp-dev-rg`	Development secrets
Test	`atp-test-kv`	`atp-test-rg`	Testing secrets
Staging	`atp-staging-kv`	`atp-staging-rg`	Pre-production secrets
Production	`atp-prod-kv`	`atp-prod-rg`	Production secrets
Shared	`atp-shared-kv`	`atp-shared-rg`	Cross-environment secrets

Key Vault Provisioning with Pulumi:

// infrastructure/KeyVault.cs
public class KeyVault
{
    public KeyVault.Vault Vault { get; }

    public KeyVault(Pulumi.Stack stack, string environment, string location)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var keyVaultName = $"atp-{environment}-kv";

        this.Vault = new KeyVault.Vault(keyVaultName, new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            Properties = new KeyVault.Inputs.VaultPropertiesArgs
            {
                TenantId = config.Require("tenantId"),
                Sku = new KeyVault.Inputs.SkuArgs
                {
                    Family = "A",
                    Name = environment == "production" ? "premium" : "standard",
                },
                EnabledForDeployment = false,
                EnabledForTemplateDeployment = false,
                EnabledForDiskEncryption = false,
                EnableRbacAuthorization = true,  // Use RBAC instead of access policies
                EnableSoftDelete = true,
                SoftDeleteRetentionInDays = environment == "production" ? 90 : 7,
                EnablePurgeProtection = environment == "production",
                PublicNetworkAccess = config.GetBool("enablePrivateEndpoint") == true 
                    ? "Disabled" 
                    : "Enabled",
            },
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
                { "Compliance", "SOC2" },
            },
        });

        // Private endpoint for production
        if (environment == "production" && config.GetBool("enablePrivateEndpoint") == true)
        {
            new Network.PrivateEndpoint($"atp-{environment}-kv-pe", new()
            {
                ResourceGroupName = resourceGroupName,
                Location = location,
                Subnet = new Network.Inputs.SubnetArgs
                {
                    Id = subnetId,
                },
                PrivateLinkServiceConnections = new[]
                {
                    new Network.Inputs.PrivateLinkServiceConnectionArgs
                    {
                        Name = "keyvault-connection",
                        PrivateLinkServiceId = this.Vault.Id,
                        GroupIds = new[] { "vault" },
                    },
                },
            });
        }
    }
}

Secret Organization and Naming¶

Secret Naming Conventions:

Pattern: {category}/{service}/{secret-name}
Examples:
  - connection-strings/atp-ingestion/sql-connection-string
  - api-keys/atp-gateway/stripe-api-key
  - certificates/atp-gateway/tls-cert
  - credentials/atp-query/service-account-password
  - encryption-keys/atp-integrity/data-encryption-key

Secret Categories:

atp-{env}-kv/
├── connection-strings/
│   ├── atp-ingestion/sql-connection-string
│   ├── atp-query/redis-connection-string
│   └── atp-export/blob-storage-connection-string
├── api-keys/
│   ├── atp-gateway/stripe-api-key
│   ├── atp-gateway/sendgrid-api-key
│   └── atp-search/elasticsearch-api-key
├── certificates/
│   ├── atp-gateway/tls-cert
│   └── atp-integrity/signing-cert
├── credentials/
│   ├── atp-query/service-account-password
│   └── atp-export/external-api-credentials
└── encryption-keys/
    └── atp-integrity/data-encryption-key

Secret Metadata Tags:

// Set secret with metadata
var secret = new KeyVault.Secret("sql-connection-string", new()
{
    ResourceGroupName = resourceGroupName,
    VaultName = keyVault.Name,
    Properties = new KeyVault.Inputs.SecretPropertiesArgs
    {
        Value = connectionString,
        ContentType = "application/json",
        Attributes = new KeyVault.Inputs.SecretAttributesArgs
        {
            Enabled = true,
            ExpiresOn = DateTimeOffset.UtcNow.AddYears(1).ToString("O"),
        },
    },
    Tags = new()
    {
        { "Category", "connection-strings" },
        { "Service", "atp-ingestion" },
        { "Environment", environment },
        { "RotatedBy", "automation" },
        { "LastRotated", DateTimeOffset.UtcNow.ToString("O") },
    },
});

Access Policies vs RBAC¶

RBAC Configuration (Recommended):

// Grant Key Vault Secrets User role to AKS Workload Identity
var workloadIdentityRoleAssignment = new Authorization.RoleAssignment(
    "workload-identity-kv-secrets-user", new()
    {
        PrincipalId = workloadIdentityPrincipalId,
        PrincipalType = "ServicePrincipal",
        RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/4633458b-17de-408a-b874-0445c86b69e6", // Key Vault Secrets User
        Scope = keyVault.Id,
    });

// Grant Key Vault Secrets Officer role for secret rotation
var rotationRoleAssignment = new Authorization.RoleAssignment(
    "rotation-kv-secrets-officer", new()
    {
        PrincipalId = rotationServicePrincipalId,
        PrincipalType = "ServicePrincipal",
        RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/b86a8fe4-44ce-494c-a47a-613bb0b0c8c7", // Key Vault Secrets Officer
        Scope = keyVault.Id,
    });

RBAC vs Access Policies Comparison:

Feature	RBAC (Recommended)	Access Policies
Granularity	Role-based (Key Vault Secrets User, Officer)	Permission-based (get, list, set, delete)
Audit Trail	✅ Better audit logging	⚠️ Limited
Centralized Management	✅ Azure AD integration	❌ Vault-specific
Least Privilege	✅ Fine-grained roles	⚠️ Can be overly permissive
Maintenance	✅ Easier to manage	❌ Manual per-vault

ATP Selection: RBAC

Rationale: - ✅ Better audit trail for compliance - ✅ Centralized Azure AD management - ✅ Fine-grained role assignments - ✅ Easier to maintain and review

Secret Categories¶

Connection Strings (Databases, Service Bus)¶

SQL Database Connection String:

# Set SQL connection string secret
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-ingestion/sql-connection-string" \
  --value "Server=tcp:atp-prod-sql.database.windows.net,1433;Initial Catalog=ATP;Persist Security Info=False;User ID=atp-ingestion;Password=SecurePassword123!;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;" \
  --content-type "application/json" \
  --tags Category=connection-strings Service=atp-ingestion Environment=production

Redis Connection String:

# Set Redis connection string secret
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-query/redis-connection-string" \
  --value "atp-prod-redis.redis.cache.windows.net:6380,password=SecurePassword123!,ssl=True,abortConnect=False" \
  --content-type "application/json" \
  --tags Category=connection-strings Service=atp-query Environment=production

Service Bus Connection String:

# Set Service Bus connection string secret
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-ingestion/servicebus-connection-string" \
  --value "Endpoint=sb://atp-prod-sb.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=SecureKey123!" \
  --content-type "application/json" \
  --tags Category=connection-strings Service=atp-ingestion Environment=production

API Keys and Tokens¶

External API Keys:

# Set Stripe API key
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "api-keys/atp-gateway/stripe-api-key" \
  --value "sk_live_51ABC123..." \
  --content-type "text/plain" \
  --tags Category=api-keys Service=atp-gateway Environment=production Provider=stripe

# Set SendGrid API key
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "api-keys/atp-gateway/sendgrid-api-key" \
  --value "SG.ABC123..." \
  --content-type "text/plain" \
  --tags Category=api-keys Service=atp-gateway Environment=production Provider=sendgrid

JWT Tokens:

# Set JWT signing key
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "api-keys/atp-gateway/jwt-signing-key" \
  --value "-----BEGIN PRIVATE KEY-----\nABC123...\n-----END PRIVATE KEY-----" \
  --content-type "application/x-pem-file" \
  --tags Category=api-keys Service=atp-gateway Environment=production Type=jwt-signing-key

Certificates (TLS, Signing)¶

TLS Certificate:

# Import TLS certificate from file
az keyvault certificate import \
  --vault-name atp-prod-kv \
  --name "certificates/atp-gateway/tls-cert" \
  --file tls-cert.pfx \
  --password "SecurePassword123!" \
  --tags Category=certificates Service=atp-gateway Environment=production Type=tls

# Or create certificate from CSR
az keyvault certificate create \
  --vault-name atp-prod-kv \
  --name "certificates/atp-gateway/tls-cert" \
  --policy "$(cat cert-policy.json)"

Signing Certificate:

# Import signing certificate
az keyvault certificate import \
  --vault-name atp-prod-kv \
  --name "certificates/atp-integrity/signing-cert" \
  --file signing-cert.pfx \
  --password "SecurePassword123!" \
  --tags Category=certificates Service=atp-integrity Environment=production Type=signing

Encryption Keys¶

Data Encryption Key:

# Create encryption key
az keyvault key create \
  --vault-name atp-prod-kv \
  --name "encryption-keys/atp-integrity/data-encryption-key" \
  --kty RSA \
  --size 2048 \
  --ops encrypt decrypt \
  --tags Category=encryption-keys Service=atp-integrity Environment=production

Credentials (Service Accounts)¶

Service Account Password:

# Set service account password
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "credentials/atp-query/service-account-password" \
  --value "SecurePassword123!" \
  --content-type "text/plain" \
  --tags Category=credentials Service=atp-query Environment=production Type=service-account

Workload Identity for Pods¶

Azure AD Workload Identity Overview¶

Workload Identity Architecture:

graph LR
    A[Pod] -->|Token Request| B[Azure AD<br/>OIDC Issuer]
    B -->|JWT Token| A
    A -->|Authenticate| C[Azure Key Vault]
    C -->|Return Secret| A

    style A fill:#90EE90
    style B fill:#FFE5B4
    style C fill:#ffcccc

Hold "Alt" / "Option" to enable pan & zoom

Benefits of Workload Identity:

✅ No secrets stored in Kubernetes
✅ Automatic token rotation
✅ Fine-grained RBAC permissions
✅ Audit trail via Azure AD logs
✅ No certificate management

Federated Credentials Configuration¶

Federated Credential Setup:

// Create User Assigned Managed Identity
var workloadIdentity = new ManagedServiceIdentity.UserAssignedIdentity(
    "atp-workload-identity", new()
    {
        ResourceGroupName = resourceGroupName,
        Location = location,
    });

// Create federated credential for Kubernetes ServiceAccount
var federatedCredential = new ManagedServiceIdentity.FederatedCredential(
    "atp-federated-credential", new()
    {
        ResourceGroupName = resourceGroupName,
        IdentityName = workloadIdentity.Name,
        Properties = new ManagedServiceIdentity.Inputs.FederatedCredentialPropertiesArgs
        {
            Issuer = "https://kubernetes.default.svc.cluster.local",
            Subject = "system:serviceaccount:atp-production:atp-ingestion", // K8s ServiceAccount
            Audiences = new[] { "api://AzureADTokenExchange" },
        },
    });

Azure CLI Setup:

# Create User Assigned Managed Identity
az identity create \
  --name atp-workload-identity \
  --resource-group atp-production-rg

# Create federated credential
az identity federated-credential create \
  --name atp-federated-credential \
  --identity-name atp-workload-identity \
  --resource-group atp-production-rg \
  --issuer "https://kubernetes.default.svc.cluster.local" \
  --subject "system:serviceaccount:atp-production:atp-ingestion" \
  --audience "api://AzureADTokenExchange"

ServiceAccount Annotation¶

ServiceAccount with Workload Identity:

# apps/atp-ingestion/base/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: atp-ingestion
  namespace: atp-production
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"  # Managed Identity Client ID
    azure.workload.identity/tenant-id: "87654321-4321-4321-4321-cba987654321"  # Azure AD Tenant ID

Pod Authentication Flow¶

Pod Configuration with Workload Identity:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"  # Enable Workload Identity
    spec:
      serviceAccountName: atp-ingestion
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        env:
        # Secret will be injected via External Secrets Operator
        - name: SQL_CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: sql-connection-string  # Created by External Secrets Operator
              key: connection-string

Authentication Flow:

Pod starts with Workload Identity annotation
Azure AD Workload Identity webhook injects AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_FEDERATED_TOKEN_FILE environment variables
Pod authenticates to Azure AD using federated token
Azure AD returns access token
Pod uses access token to access Key Vault (via External Secrets Operator or CSI Driver)

No Secrets in Pod Specs!¶

❌ BAD: Plaintext Secrets in Pod Specs:

# ❌ NEVER DO THIS!
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    env:
    - name: PASSWORD
      value: "PlaintextPassword123!"  # ❌ Exposed in Git!

✅ GOOD: Reference External Secret:

# ✅ Correct: Reference secret from External Secrets Operator
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    env:
    - name: PASSWORD
      valueFrom:
        secretKeyRef:
          name: sql-connection-string  # Created by External Secrets Operator
          key: connection-string

External Secrets Operator¶

Installation and Configuration¶

Install External Secrets Operator:

# Add Helm repository
helm repo add external-secrets https://charts.external-secrets.io
helm repo update

# Install External Secrets Operator
helm install external-secrets \
  external-secrets/external-secrets \
  -n external-secrets-system \
  --create-namespace \
  --version 0.9.0

Verify Installation:

kubectl get pods -n external-secrets-system
# NAME                                       READY   STATUS    RESTARTS   AGE
# external-secrets-operator-7d8f9c4b5-abc123 1/1     Running   0          2m

ClusterSecretStore Setup¶

ClusterSecretStore for Azure Key Vault:

# infrastructure/clustersecretstore.yaml
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: azure-keyvault
spec:
  provider:
    azurekv:
      vaultUrl: "https://atp-prod-kv.vault.azure.net"
      tenantId: "87654321-4321-4321-4321-cba987654321"
      authType: WorkloadIdentity
      serviceAccountRef:
        name: external-secrets-operator
        namespace: external-secrets-system
      # Or use Service Principal (not recommended)
      # authType: ServicePrincipal
      # servicePrincipalRef:
      #   tenantId: "87654321-4321-4321-4321-cba987654321"
      #   clientId: "12345678-1234-1234-1234-123456789abc"
      #   clientSecret:
      #     secretRef:
      #       name: external-secrets-sp
      #       key: client-secret

Grant Permissions to External Secrets Operator:

// Grant Key Vault Secrets User role to External Secrets Operator Workload Identity
var esoRoleAssignment = new Authorization.RoleAssignment(
    "eso-kv-secrets-user", new()
    {
        PrincipalId = externalSecretsOperatorIdentityPrincipalId,
        PrincipalType = "ServicePrincipal",
        RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/4633458b-17de-408a-b874-0445c86b69e6", // Key Vault Secrets User
        Scope = keyVault.Id,
    });

ExternalSecret Resources¶

ExternalSecret for Connection String:

# apps/atp-ingestion/base/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-production
spec:
  refreshInterval: 1h  # Refresh every hour
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string  # Kubernetes Secret name
    creationPolicy: Owner
    template:
      type: Opaque
      data:
        connection-string: "{{ .connectionString | toString }}"
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      property: value
      version: ""  # Empty = latest version

ExternalSecret for Certificate:

# apps/atp-gateway/base/externalsecret-cert.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: tls-certificate
  namespace: atp-production
spec:
  refreshInterval: 24h
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: tls-certificate
    creationPolicy: Owner
    template:
      type: kubernetes.io/tls
      data:
        tls.crt: "{{ .certificate | b64enc }}"
        tls.key: "{{ .privateKey | b64enc }}"
  data:
  - secretKey: certificate
    remoteRef:
      key: certificates/atp-gateway/tls-cert
      property: cert
  - secretKey: privateKey
    remoteRef:
      key: certificates/atp-gateway/tls-cert
      property: key

ExternalSecret for Multiple Secrets:

# apps/atp-gateway/base/externalsecret-multiple.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: gateway-secrets
  namespace: atp-production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: gateway-secrets
    creationPolicy: Owner
  data:
  - secretKey: stripe-api-key
    remoteRef:
      key: api-keys/atp-gateway/stripe-api-key
  - secretKey: sendgrid-api-key
    remoteRef:
      key: api-keys/atp-gateway/sendgrid-api-key
  - secretKey: jwt-signing-key
    remoteRef:
      key: api-keys/atp-gateway/jwt-signing-key

Sync Interval and Refresh¶

Refresh Strategies:

Strategy	Refresh Interval	Use Case
Frequent	5m	High-security, frequently rotated secrets
Standard	1h	Most application secrets
Infrequent	24h	Stable certificates, long-lived keys
On-Demand	Manual refresh	Rarely changed secrets

Manual Refresh:

# Trigger manual refresh
kubectl annotate externalsecret sql-connection-string \
  -n atp-production \
  force-sync=$(date +%s) \
  --overwrite

ExternalSecret Status:

# Check ExternalSecret status
kubectl get externalsecret sql-connection-string -n atp-production -o yaml

# Status output:
status:
  conditions:
  - lastTransitionTime: "2024-01-15T10:00:00Z"
    message: Secret was synced
    reason: SecretSynced
    status: "True"
    type: Ready
  refreshTime: "2024-01-15T10:00:00Z"
  syncedResourceVersion: "12345"

Secret Rotation Handling¶

ExternalSecret with Version Tracking:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string
    creationPolicy: Owner
    template:
      metadata:
        annotations:
          external-secrets.io/last-sync-time: "{{ .refreshTime | date \"2006-01-02T15:04:05Z07:00\" }}"
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      # Track version for rotation
      version: ""  # Empty = latest, or specify version ID

Application Secret Rotation:

// C# application: Handle secret rotation gracefully
public class SecretRotationHandler
{
    private readonly ILogger<SecretRotationHandler> _logger;
    private string _currentConnectionString;
    private readonly SemaphoreSlim _rotationLock = new(1, 1);

    public async Task<string> GetConnectionStringAsync()
    {
        // Read from mounted secret file or environment variable
        var newConnectionString = await File.ReadAllTextAsync(
            "/mnt/secrets/sql-connection-string/connection-string");

        if (_currentConnectionString != newConnectionString)
        {
            await _rotationLock.WaitAsync();
            try
            {
                if (_currentConnectionString != newConnectionString)
                {
                    _logger.LogInformation("Connection string rotated, updating connection");
                    await RotateConnectionAsync(newConnectionString);
                    _currentConnectionString = newConnectionString;
                }
            }
            finally
            {
                _rotationLock.Release();
            }
        }

        return _currentConnectionString;
    }

    private async Task RotateConnectionAsync(string newConnectionString)
    {
        // Close old connections
        // Establish new connections with new connection string
        // Zero-downtime rotation
    }
}

CSI Driver Alternative¶

Azure Key Vault CSI Driver¶

Installation:

# Install Azure Key Vault CSI Driver
helm repo add csi-secrets-store-provider-azure https://raw.githubusercontent.com/Azure/secrets-store-csi-driver-provider-azure/master/charts
helm repo update

helm install csi-secrets-store-provider-azure \
  csi-secrets-store-provider-azure/csi-secrets-store-provider-azure \
  --namespace kube-system \
  --version 1.4.0

Verify Installation:

kubectl get pods -n kube-system | grep csi-secrets-store
# NAME                                                    READY   STATUS    RESTARTS   AGE
# csi-secrets-store-provider-azure-7d8f9c4b5-abc123       1/1     Running   0          2m
# csi-secrets-store-driver-9f8e7d6c5-def456               2/2     Running   0          2m

SecretProviderClass Configuration¶

SecretProviderClass with Workload Identity:

# apps/atp-ingestion/base/secretproviderclass.yaml
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: atp-ingestion-secrets
  namespace: atp-production
spec:
  provider: azure
  secretObjects:
  - secretName: sql-connection-string  # Kubernetes Secret to create
    type: Opaque
    data:
    - objectName: sql-connection-string
      key: connection-string
  parameters:
    usePodIdentity: "false"
    useVMManagedIdentity: "false"
    useWorkloadIdentity: "true"  # Use Workload Identity
    workloadIdentityClientId: "12345678-1234-1234-1234-123456789abc"  # Managed Identity Client ID
    tenantId: "87654321-4321-4321-4321-cba987654321"  # Azure AD Tenant ID
    keyvaultName: "atp-prod-kv"
    objects: |
      array:
        - |
          objectName: connection-strings/atp-ingestion/sql-connection-string
          objectType: secret
          objectVersion: ""  # Empty = latest version
    tenantId: "87654321-4321-4321-4321-cba987654321"

SecretProviderClass for Certificate:

# apps/atp-gateway/base/secretproviderclass-cert.yaml
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: tls-certificate
  namespace: atp-production
spec:
  provider: azure
  secretObjects:
  - secretName: tls-certificate
    type: kubernetes.io/tls
    data:
    - objectName: tls-cert
      key: tls.crt
    - objectName: tls-key
      key: tls.key
  parameters:
    useWorkloadIdentity: "true"
    workloadIdentityClientId: "12345678-1234-1234-1234-123456789abc"
    tenantId: "87654321-4321-4321-4321-cba987654321"
    keyvaultName: "atp-prod-kv"
    objects: |
      array:
        - |
          objectName: certificates/atp-gateway/tls-cert
          objectType: secret
          objectFormat: pfx
          objectEncoding: base64
    tenantId: "87654321-4321-4321-4321-cba987654321"

Mounting Secrets as Volumes¶

Deployment with CSI Volume Mount:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"
    spec:
      serviceAccountName: atp-ingestion
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        volumeMounts:
        - name: secrets-store
          mountPath: "/mnt/secrets-store"
          readOnly: true
        env:
        - name: SQL_CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: sql-connection-string  # Created by SecretProviderClass secretObjects
              key: connection-string
      volumes:
      - name: secrets-store
        csi:
          driver: secrets-store.csi.k8s.io
          readOnly: true
          volumeAttributes:
            secretProviderClass: "atp-ingestion-secrets"

Automatic Rotation¶

Secret Rotation with CSI Driver:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: atp-ingestion-secrets
  namespace: atp-production
spec:
  provider: azure
  # Enable rotation
  rotationPolicy:
    rotationEnabled: true
    rotationPeriod: "1h"
  secretObjects:
  - secretName: sql-connection-string
    type: Opaque
    data:
    - objectName: sql-connection-string
      key: connection-string
  parameters:
    useWorkloadIdentity: "true"
    workloadIdentityClientId: "12345678-1234-1234-1234-123456789abc"
    tenantId: "87654321-4321-4321-4321-cba987654321"
    keyvaultName: "atp-prod-kv"
    objects: |
      array:
        - |
          objectName: connection-strings/atp-ingestion/sql-connection-string
          objectType: secret
          objectVersion: ""  # Latest version

Rotation Status:

# Check rotation status
kubectl describe secretproviderclass atp-ingestion-secrets -n atp-production

# View mounted secrets
kubectl exec -it deployment/atp-ingestion -n atp-production -- \
  ls -la /mnt/secrets-store/

When to Use CSI vs External Secrets¶

Comparison Matrix:

Feature	External Secrets Operator	CSI Driver
Secret Access	Creates Kubernetes Secrets	Mounts directly or creates Secrets
Rotation	Manual refresh or polling	Automatic rotation support
Use Case	Standard Kubernetes Secret consumption	Direct file access or Secret creation
Performance	Slight delay (polling)	Faster (direct mount)
Compatibility	✅ Works with existing Secret consumers	⚠️ Requires CSI volume mounts
Complexity	✅ Simpler	⚠️ More complex setup

ATP Selection Guide:

External Secrets Operator: ✅ Recommended for most use cases
Standard Kubernetes Secret consumption
Works with existing applications
Simpler to manage
CSI Driver: Use when:
Need direct file access to secrets
Require automatic rotation without polling
High-performance secret access needed

Secret References in Manifests¶

Never Plaintext Secrets in Git¶

❌ BAD: Plaintext Secret in Git:

# ❌ NEVER COMMIT THIS TO GIT!
apiVersion: v1
kind: Secret
metadata:
  name: sql-connection-string
data:
  connection-string: U2VjdXJlUGFzc3dvcmQxMjMh  # Base64 encoded, but still in Git!

✅ GOOD: External Secret Reference:

# ✅ Correct: Reference External Secret
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string

Referencing Secrets by Name¶

Deployment Using Secret Reference:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        env:
        # Reference secret created by External Secrets Operator
        - name: SQL_CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: sql-connection-string
              key: connection-string
        - name: REDIS_CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: redis-connection-string
              key: connection-string
        envFrom:
        # Or use envFrom for multiple secrets
        - secretRef:
            name: gateway-secrets

Environment-Specific Secret Mappings¶

Environment-Specific ExternalSecret:

# apps/atp-ingestion/overlays/production/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault-prod  # Production Key Vault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      # Production-specific secret path

# apps/atp-ingestion/overlays/dev/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-dev
spec:
  refreshInterval: 24h  # Less frequent refresh for dev
  secretStoreRef:
    name: azure-keyvault-dev  # Dev Key Vault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      # Dev-specific secret path

Secret Versioning¶

Reference Specific Secret Version:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
spec:
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      version: "abc123def456789"  # Specific version ID

Track Secret Versions:

# List secret versions
az keyvault secret show-versions \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-ingestion/sql-connection-string" \
  --query "[].{id:id, enabled:attributes.enabled, updated:attributes.updated}"

# Output:
# [
#   {
#     "id": "https://atp-prod-kv.vault.azure.net/secrets/connection-strings/atp-ingestion/sql-connection-string/abc123def456789",
#     "enabled": true,
#     "updated": "2024-01-15T10:00:00Z"
#   }
# ]

Secret Rotation Procedures¶

Manual Rotation Workflow¶

Secret Rotation Checklist:

## Manual Secret Rotation Checklist

### Pre-Rotation
- [ ] Notify team of rotation schedule
- [ ] Verify application can handle secret rotation gracefully
- [ ] Backup current secret (if needed)
- [ ] Prepare new secret value

### Rotation
- [ ] Create new secret version in Key Vault
- [ ] Test new secret in non-production environment
- [ ] Update ExternalSecret to reference new version (optional)
- [ ] Trigger ExternalSecret refresh
- [ ] Verify application picks up new secret
- [ ] Monitor application for errors

### Post-Rotation
- [ ] Verify application is functioning correctly
- [ ] Disable old secret version (don't delete yet)
- [ ] Monitor for 24-48 hours
- [ ] Delete old secret version after confirmation

Manual Rotation Script:

#!/bin/bash
# scripts/rotate-secret.sh

VAULT_NAME="${1:-atp-prod-kv}"
SECRET_NAME="${2:-connection-strings/atp-ingestion/sql-connection-string}"
NEW_SECRET_VALUE="${3:-}"

if [ -z "$NEW_SECRET_VALUE" ]; then
  echo "Usage: $0 <vault-name> <secret-name> <new-secret-value>"
  exit 1
fi

echo "🔄 Rotating secret: $SECRET_NAME"

# 1. Create new secret version
echo "📝 Creating new secret version..."
az keyvault secret set \
  --vault-name "$VAULT_NAME" \
  --name "$SECRET_NAME" \
  --value "$NEW_SECRET_VALUE" \
  --tags LastRotated="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# 2. Trigger ExternalSecret refresh
echo "🔄 Triggering ExternalSecret refresh..."
kubectl annotate externalsecret "$(basename $SECRET_NAME)" \
  -n atp-production \
  force-sync="$(date +%s)" \
  --overwrite

# 3. Verify secret was synced
echo "✅ Waiting for secret sync..."
sleep 10
kubectl get externalsecret "$(basename $SECRET_NAME)" -n atp-production

echo "✅ Secret rotation complete"

Automated Rotation with Key Vault¶

Azure Key Vault Automatic Rotation:

# Enable automatic rotation for certificate
az keyvault certificate set-attributes \
  --vault-name atp-prod-kv \
  --name "certificates/atp-gateway/tls-cert" \
  --enabled true \
  --auto-renew true \
  --days-before-expiry 30

Rotation Policy:

{
  "lifetimeActions": [
    {
      "trigger": {
        "daysBeforeExpiry": 30
      },
      "action": {
        "actionType": "Rotate"
      }
    },
    {
      "trigger": {
        "daysBeforeExpiry": 7
      },
      "action": {
        "actionType": "EmailContacts"
      }
    }
  ],
  "issuerParameters": {
    "name": "Self"
  },
  "keyProperties": {
    "exportable": true,
    "keySize": 2048,
    "keyType": "RSA",
    "reuseKey": true
  },
  "secretProperties": {
    "contentType": "application/x-pkcs12"
  }
}

Application Handling of Rotated Secrets¶

C# Application: Secret Rotation Handler:

// SecretRotationHandler.cs
public class SecretRotationHandler : BackgroundService
{
    private readonly ILogger<SecretRotationHandler> _logger;
    private readonly IConfiguration _configuration;
    private readonly SemaphoreSlim _rotationLock = new(1, 1);
    private string _currentSecret;
    private DateTime _lastRotationCheck = DateTime.UtcNow;

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            try
            {
                await CheckForSecretRotationAsync();
                await Task.Delay(TimeSpan.FromMinutes(5), stoppingToken); // Check every 5 minutes
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Error checking for secret rotation");
                await Task.Delay(TimeSpan.FromMinutes(1), stoppingToken);
            }
        }
    }

    private async Task CheckForSecretRotationAsync()
    {
        // Read secret from mounted file or environment variable
        var secretPath = "/mnt/secrets-store/connection-strings/atp-ingestion/sql-connection-string";
        if (File.Exists(secretPath))
        {
            var newSecret = await File.ReadAllTextAsync(secretPath);

            if (_currentSecret != null && _currentSecret != newSecret)
            {
                _logger.LogInformation("Secret rotated, updating connection");
                await RotateSecretAsync(newSecret);
            }

            _currentSecret = newSecret;
        }
    }

    private async Task RotateSecretAsync(string newSecret)
    {
        await _rotationLock.WaitAsync();
        try
        {
            // Zero-downtime rotation:
            // 1. Create new connection with new secret
            // 2. Migrate traffic to new connection
            // 3. Close old connection

            _logger.LogInformation("Secret rotation complete");
        }
        finally
        {
            _rotationLock.Release();
        }
    }
}

Zero-Downtime Rotation¶

Zero-Downtime Rotation Strategy:

graph LR
    A[Old Secret<br/>Active] -->|1. New Secret<br/>Created| B[New Secret<br/>Available]
    B -->|2. New Connection<br/>Established| C[Dual Connections<br/>Active]
    C -->|3. Migrate Traffic| D[New Connection<br/>Primary]
    D -->|4. Close Old| E[New Secret<br/>Active]

    style A fill:#ffcccc
    style C fill:#FFE5B4
    style E fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Implementation:

public class ZeroDowntimeSecretRotation
{
    private IDbConnection _primaryConnection;
    private IDbConnection _secondaryConnection;
    private bool _isRotating = false;

    public async Task RotateConnectionStringAsync(string newConnectionString)
    {
        if (_isRotating) return;

        _isRotating = true;
        try
        {
            // 1. Create new connection
            var newConnection = new SqlConnection(newConnectionString);
            await newConnection.OpenAsync();

            // 2. Verify new connection works
            using var testCommand = new SqlCommand("SELECT 1", newConnection);
            await testCommand.ExecuteScalarAsync();

            // 3. Set secondary connection
            _secondaryConnection = newConnection;

            // 4. Migrate traffic gradually (e.g., 10% at a time)
            await MigrateTrafficGraduallyAsync();

            // 5. Close old connection
            if (_primaryConnection != null)
            {
                _primaryConnection.Close();
                _primaryConnection.Dispose();
            }

            // 6. Promote new connection to primary
            _primaryConnection = _secondaryConnection;
            _secondaryConnection = null;
        }
        finally
        {
            _isRotating = false;
        }
    }
}

Secret Versioning and Rollback¶

Key Vault Secret Versions¶

List Secret Versions:

# List all versions of a secret
az keyvault secret show-versions \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-ingestion/sql-connection-string" \
  --query "[].{id:id, enabled:attributes.enabled, updated:attributes.updated, expires:attributes.expires}" \
  --output table

# Output:
# ID                                                              ENABLED    UPDATED                 EXPIRES
# https://atp-prod-kv.../abc123def456789                          True       2024-01-15T10:00:00Z    None
# https://atp-prod-kv.../def456ghi789012                          True       2024-01-14T10:00:00Z    None
# https://atp-prod-kv.../ghi789jkl012345                          False     2024-01-13T10:00:00Z    None  # Disabled

Get Specific Secret Version:

# Get specific version
az keyvault secret show \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-ingestion/sql-connection-string" \
  --version "def456ghi789012"

Rolling Back to Previous Secret Version¶

Rollback Procedure:

#!/bin/bash
# scripts/rollback-secret.sh

VAULT_NAME="${1:-atp-prod-kv}"
SECRET_NAME="${2:-connection-strings/atp-ingestion/sql-connection-string}"
VERSION_TO_ROLLBACK="${3:-}"

if [ -z "$VERSION_TO_ROLLBACK" ]; then
  echo "Usage: $0 <vault-name> <secret-name> <version-id>"
  echo "Listing available versions:"
  az keyvault secret show-versions \
    --vault-name "$VAULT_NAME" \
    --name "$SECRET_NAME" \
    --query "[].{version:split(id,'/')[-1], updated:attributes.updated}" \
    --output table
  exit 1
fi

echo "🔄 Rolling back secret to version: $VERSION_TO_ROLLBACK"

# 1. Get previous version value
PREVIOUS_VALUE=$(az keyvault secret show \
  --vault-name "$VAULT_NAME" \
  --name "$SECRET_NAME" \
  --version "$VERSION_TO_ROLLBACK" \
  --query value -o tsv)

# 2. Create new version with previous value
az keyvault secret set \
  --vault-name "$VAULT_NAME" \
  --name "$SECRET_NAME" \
  --value "$PREVIOUS_VALUE" \
  --tags RollbackFrom="$VERSION_TO_ROLLBACK" RollbackAt="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# 3. Trigger ExternalSecret refresh
kubectl annotate externalsecret "$(basename $SECRET_NAME)" \
  -n atp-production \
  force-sync="$(date +%s)" \
  --overwrite

echo "✅ Secret rolled back successfully"

Coordinating Secret Changes with Deployments¶

Coordinated Secret and Deployment Update:

# Strategy: Update secret first, then deployment
# 1. Update ExternalSecret to reference new secret version
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
spec:
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      version: "abc123def456789"  # New version
---
# 2. Wait for secret sync
# 3. Update deployment (triggers rolling update with new secret)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    metadata:
      annotations:
        secret-version: "abc123def456789"  # Track secret version

Audit Logging¶

Key Vault Access Logs¶

Enable Diagnostic Settings:

// Enable Key Vault diagnostic logs
new Insights.DiagnosticSetting("keyvault-diagnostics", new()
{
    ResourceId = keyVault.Id,
    LogAnalyticsWorkspaceId = logAnalyticsWorkspace.Id,
    Logs = new[]
    {
        new Insights.Inputs.LogSettingsArgs
        {
            CategoryGroup = "audit",
            Enabled = true,
            RetentionPolicy = new Insights.Inputs.RetentionPolicyArgs
            {
                Enabled = true,
                Days = environment == "production" ? 365 : 30,
            },
        },
        new Insights.Inputs.LogSettingsArgs
        {
            CategoryGroup = "allLogs",
            Enabled = true,
            RetentionPolicy = new Insights.Inputs.RetentionPolicyArgs
            {
                Enabled = true,
                Days = environment == "production" ? 365 : 30,
            },
        },
    },
    Metrics = new[]
    {
        new Insights.Inputs.MetricSettingsArgs
        {
            Category = "AllMetrics",
            Enabled = true,
            RetentionPolicy = new Insights.Inputs.RetentionPolicyArgs
            {
                Enabled = true,
                Days = environment == "production" ? 365 : 30,
            },
        },
    },
});

Monitoring Secret Access¶

KQL Query for Secret Access:

// Key Vault access logs
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where OperationName == "SecretGet" or OperationName == "SecretList"
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend CallerIP = tostring(parse_json(properties_s).httpStatusCode)
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| project TimeGenerated, SecretName, OperationName, Identity, CallerIP, ResultSignature
| order by TimeGenerated desc

Access Pattern Analysis:

// Secret access patterns
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where OperationName == "SecretGet"
| extend SecretName = tostring(parse_json(properties_s).objectName)
| summarize 
    AccessCount = count(),
    UniqueIdentities = dcount(parse_json(properties_s).identity_claim_appid_g),
    LastAccess = max(TimeGenerated)
    by SecretName, bin(TimeGenerated, 1h)
| order by AccessCount desc

Alerting on Unauthorized Access¶

Azure Monitor Alert Rule:

# alerts/keyvault-unauthorized-access.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: keyvault-unauthorized-access
  namespace: monitoring
spec:
  groups:
  - name: keyvault
    rules:
    - alert: KeyVaultUnauthorizedAccess
      expr: |
        count(
          azure_keyvault_secret_access_total{
            result="Forbidden"
          } > 0
        ) by (secret_name)
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Unauthorized access attempt to Key Vault secret"
        description: "Secret {{ $labels.secret_name }} has {{ $value }} unauthorized access attempts"

Log Analytics Alert:

// Alert query: Failed secret access
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where OperationName == "SecretGet"
| where parse_json(properties_s).statusCode_s == "403"  // Forbidden
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| project TimeGenerated, SecretName, Identity

Compliance Reporting¶

Compliance Report Query:

// Secret access compliance report
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where TimeGenerated > ago(30d)
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend Operation = OperationName
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| extend Result = ResultSignature
| summarize 
    TotalAccess = count(),
    SuccessfulAccess = countif(Result == "OK"),
    FailedAccess = countif(Result != "OK"),
    UniqueIdentities = dcount(Identity),
    LastAccess = max(TimeGenerated)
    by SecretName, Operation
| order by TotalAccess desc

Encryption at Rest in Key Vault¶

Key Vault Encryption:

✅ Automatic Encryption: All secrets encrypted at rest by default
✅ Hardware Security Module (HSM): Premium SKU uses HSM-backed keys
✅ Azure Key Vault Managed HSM: Dedicated HSM for highest security

Enable HSM-Backed Keys:

// Use Premium SKU for HSM-backed keys
this.Vault = new KeyVault.Vault(keyVaultName, new()
{
    Properties = new KeyVault.Inputs.VaultPropertiesArgs
    {
        Sku = new KeyVault.Inputs.SkuArgs
        {
            Family = "A",
            Name = "premium",  // Premium SKU for HSM
        },
    },
});

Access Reviews and Audits¶

Regular Access Reviews:

#!/bin/bash
# scripts/access-review.sh

VAULT_NAME="${1:-atp-prod-kv}"

echo "📋 Key Vault Access Review for: $VAULT_NAME"

# List all role assignments
az role assignment list \
  --scope "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.KeyVault/vaults/$VAULT_NAME" \
  --query "[].{principal:principalName, role:roleDefinitionName, scope:scope}" \
  --output table

# List all secrets and their access patterns
echo "📊 Secret Access Summary:"
az keyvault secret list \
  --vault-name "$VAULT_NAME" \
  --query "[].name" -o tsv | while read secret; do
    echo "Secret: $secret"
    az keyvault secret show-versions \
      --vault-name "$VAULT_NAME" \
      --name "$secret" \
      --query "[].{updated:attributes.updated, enabled:attributes.enabled}" \
      --output table
done

Automated Access Review:

# Azure Policy: Require access reviews
apiVersion: policy.azure.com/v1beta1
kind: PolicyAssignment
metadata:
  name: require-keyvault-access-reviews
spec:
  displayName: "Require Key Vault Access Reviews"
  policyDefinitionId: "/providers/Microsoft.Authorization/policyDefinitions/..."
  parameters:
    reviewFrequency: "30d"

Secret Lifecycle Management¶

Secret Lifecycle Policy:

# Secret lifecycle management
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string
    template:
      metadata:
        annotations:
          secret-lifecycle/created: "{{ .creationTime }}"
          secret-lifecycle/expires: "{{ .expirationTime }}"
          secret-lifecycle/rotation-policy: "30d"
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string

Evidence Collection for Auditors¶

Audit Evidence Report:

// SOC 2 / GDPR Audit Evidence: Secret Access Log
let SecretAccess = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where TimeGenerated > ago(90d)
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| extend Result = ResultSignature
| extend IPAddress = tostring(parse_json(properties_s).callerIpAddress_s)
| project TimeGenerated, SecretName, Identity, Result, IPAddress, OperationName;

// Generate report
SecretAccess
| summarize 
    TotalAccess = count(),
    SuccessfulAccess = countif(Result == "OK"),
    FailedAccess = countif(Result != "OK"),
    UniqueIdentities = dcount(Identity),
    DateRange = strcat(min(TimeGenerated), " to ", max(TimeGenerated))
    by SecretName
| order by SecretName

Export Audit Logs:

# Export audit logs for compliance
az monitor log-analytics query \
  --workspace "atp-prod-loganalytics" \
  --analytics-query "
    AzureDiagnostics
    | where ResourceProvider == 'MICROSOFT.KEYVAULT'
    | where Category == 'AuditEvent'
    | where TimeGenerated > ago(90d)
    | project TimeGenerated, OperationName, SecretName, Identity, Result
  " \
  --output table > keyvault-audit-log-$(date +%Y%m%d).csv

Summary: Azure Key Vault Secret Management¶

Key Vault Architecture: Environment-specific Key Vaults with RBAC (recommended over access policies), organized secret naming conventions
Secret Categories: Connection strings, API keys, certificates, encryption keys, credentials with proper tagging
Workload Identity: Azure AD Workload Identity for pod authentication, federated credentials, ServiceAccount annotation, no secrets in pod specs
External Secrets Operator: ClusterSecretStore setup, ExternalSecret resources, sync intervals, secret rotation handling
CSI Driver: Alternative for direct secret mounting, SecretProviderClass configuration, automatic rotation support
Secret References: Never plaintext secrets in Git, reference secrets by name, environment-specific mappings, secret versioning
Secret Rotation: Manual and automated rotation procedures, application handling of rotated secrets, zero-downtime rotation strategies
Secret Versioning: Key Vault secret versions, rollback procedures, coordinating secret changes with deployments
Audit Logging: Key Vault access logs, monitoring secret access, alerting on unauthorized access, compliance reporting
Compliance: Encryption at rest, access reviews, secret lifecycle management, evidence collection for SOC 2, GDPR, HIPAA audits

Security Policies & Compliance¶

Purpose: Define security policies, compliance controls, and enforcement mechanisms for ATP GitOps, ensuring all Kubernetes workloads, network traffic, and container images meet security standards and regulatory requirements (SOC 2, GDPR, HIPAA) through policy-as-code and automated enforcement.

Azure Policy for Kubernetes¶

Policy Overview and Architecture¶

Azure Policy for Kubernetes Architecture:

graph LR
    A[Policy Definition<br/>in Azure] -->|Assignment| B[AKS Cluster<br/>with Policy Add-on]
    B -->|Enforces| C[Admission Controller]
    C -->|Validates| D[Kubernetes Resources]
    D -->|Creates| E[Compliant Resources]
    D -.->|Violates| F[Policy Violation<br/>Blocked/Reported]

    style A fill:#90EE90
    style B fill:#FFE5B4
    style C fill:#FFE5B4
    style D fill:#ffcccc
    style E fill:#90EE90
    style F fill:#ff9999

Hold "Alt" / "Option" to enable pan & zoom

Azure Policy Components:

Component	Purpose	Example
Policy Definition	Defines the policy rule	"All pods must have resource limits"
Policy Assignment	Applies policy to AKS cluster	Assign to `atp-prod-eus-aks`
Policy Effect	Enforcement action	`deny`, `audit`, `disabled`
Policy Parameters	Configurable values	Minimum CPU: `100m`

Built-in Policies for AKS¶

Enable Azure Policy Add-on:

# Enable Azure Policy add-on on AKS
az aks enable-addons \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --addons azure-policy

# Verify installation
az aks show \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --query addonProfiles.azurepolicy

Common Built-in Policies:

// Built-in policy: Kubernetes cluster containers should only use allowed capabilities
{
  "policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/c26596ff-4d70-4e6a-9a30-c2506bd2f80c",
  "parameters": {
    "allowedCapabilities": {
      "value": ["NET_BIND_SERVICE"]
    },
    "requiredDropCapabilities": {
      "value": ["ALL"]
    },
    "effect": {
      "value": "Audit"
    }
  }
}

Assign Built-in Policy:

# Assign built-in policy: Container images should be deployed from trusted registries only
az policy assignment create \
  --name "aks-trusted-registries" \
  --scope "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.ContainerService/managedClusters/atp-prod-eus-aks" \
  --policy "/providers/Microsoft.Authorization/policyDefinitions/febd0533-8e55-448f-b837-bd0e06f16469" \
  --params '{
    "allowedContainerImagesRegex": {
      "value": "^connectsoft\\.azurecr\\.io/.*"
    },
    "effect": {
      "value": "Deny"
    }
  }'

Custom Policy Definitions¶

Custom Policy: Require Resource Limits:

// policies/require-resource-limits.json
{
  "properties": {
    "displayName": "ATP: Require resource limits on containers",
    "description": "Ensures all containers have CPU and memory limits set",
    "mode": "Microsoft.Kubernetes.Data",
    "metadata": {
      "category": "ATP Security"
    },
    "parameters": {
      "minCpu": {
        "type": "String",
        "metadata": {
          "displayName": "Minimum CPU limit",
          "description": "Minimum CPU limit (e.g., 100m)"
        },
        "defaultValue": "100m"
      },
      "minMemory": {
        "type": "String",
        "metadata": {
          "displayName": "Minimum memory limit",
          "description": "Minimum memory limit (e.g., 128Mi)"
        },
        "defaultValue": "128Mi"
      },
      "effect": {
        "type": "String",
        "metadata": {
          "displayName": "Effect",
          "description": "Policy effect"
        },
        "allowedValues": ["audit", "deny", "disabled"],
        "defaultValue": "deny"
      }
    },
    "policyRule": {
      "if": {
        "field": "type",
        "equals": "Microsoft.Kubernetes/connectedClusters"
      },
      "then": {
        "effect": "[parameters('effect')]",
        "details": {
          "templateInfo": {
            "sourceType": "PublicURL",
            "url": "https://raw.githubusercontent.com/ConnectSoft/ATP-Policies/main/policies/require-resource-limits.yaml"
          },
          "apiGroups": ["apps"],
          "kinds": ["Deployment", "StatefulSet"],
          "excludedNamespaces": ["kube-system", "gatekeeper-system"]
        }
      }
    }
  }
}

Gatekeeper Constraint Template (Referenced by Policy):

# policies/require-resource-limits.yaml
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequiredresourcelimits
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredResourceLimits
      validation:
        openAPIV3Schema:
          type: object
          properties:
            minCpu:
              type: string
            minMemory:
              type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredresourcelimits

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources.limits.cpu
          msg := sprintf("Container '%v' must have CPU limit", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources.limits.memory
          msg := sprintf("Container '%v' must have memory limit", [container.name])
        }

Create Custom Policy:

# Create custom policy definition
az policy definition create \
  --name "atp-require-resource-limits" \
  --display-name "ATP: Require resource limits on containers" \
  --description "Ensures all containers have CPU and memory limits set" \
  --rules policies/require-resource-limits.json \
  --params policies/require-resource-limits.parameters.json \
  --mode Microsoft.Kubernetes.Data

Policy Assignment and Enforcement¶

Policy Assignment with Pulumi:

// Assign Azure Policy to AKS cluster
new Authorization.PolicyAssignment("atp-require-resource-limits", new()
{
    Name = "atp-require-resource-limits",
    DisplayName = "ATP: Require resource limits",
    PolicyDefinitionId = "/providers/Microsoft.Authorization/policyDefinitions/atp-require-resource-limits",
    Scope = aksCluster.Id,
    Parameters = new()
    {
        { "minCpu", new() { Value = "100m" } },
        { "minMemory", new() { Value = "128Mi" } },
        { "effect", new() { Value = "deny" } },
    },
    EnforcementMode = "Default", // Enforced
    Identity = new Authorization.Inputs.IdentityArgs
    {
        Type = Authorization.ResourceIdentityType.SystemAssigned,
    },
});

Policy Enforcement Modes:

Mode	Behavior	Use Case
Enforced	Blocks non-compliant resources	Production environments
DoNotEnforce	Audits only, doesn't block	Testing policy effectiveness
Disabled	Policy disabled	Temporary disable

Policy Compliance Check:

# Check policy compliance
az policy state list \
  --resource "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.ContainerService/managedClusters/atp-prod-eus-aks" \
  --policy-assignment "atp-require-resource-limits" \
  --query "[].{resource:resourceId, complianceState:complianceState}" \
  --output table

Pod Security Standards (PSS)¶

Privileged, Baseline, Restricted Profiles¶

Pod Security Levels:

Level	Restrictions	ATP Use Case
Privileged	No restrictions	System pods only (CNI, CSI drivers)
Baseline	Minimal restrictions	Legacy applications
Restricted	Maximum restrictions	✅ ATP production workloads

Restricted Profile Requirements:

✅ Run as non-root user
✅ Read-only root filesystem
✅ Drop all capabilities
✅ Disallow privilege escalation
✅ Seccomp profile enforced
✅ AppArmor/SELinux enforced

Pod Security Admission Configuration¶

Enable Pod Security Admission:

# infrastructure/namespace-pod-security.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Admission Configuration:

# cluster-config/admission-configuration.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
  configuration:
    apiVersion: pod-security.admission.config.k8s.io/v1beta1
    kind: PodSecurityConfiguration
    defaults:
      enforce: "restricted"
      audit: "restricted"
      warn: "restricted"
    exemptions:
      usernames: []
      runtimeClasses: []
      namespaces:
      - kube-system
      - gatekeeper-system
      - external-secrets-system

Security Context Requirements¶

Deployment with Restricted Security Context:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      securityContext:  # Pod-level security context
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
        supplementalGroups: []
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        securityContext:  # Container-level security context
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
          capabilities:
            drop:
            - ALL
            add: []  # No additional capabilities
          seccompProfile:
            type: RuntimeDefault
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: var-run
          mountPath: /var/run
      volumes:
      - name: tmp
        emptyDir: {}
      - name: var-run
        emptyDir: {}

Gradual Enforcement Strategy¶

Enforcement Strategy:

Environment	Enforce Level	Audit Level	Warn Level	Timeline
Dev	Baseline	Restricted	Restricted	Immediate
Test	Baseline	Restricted	Restricted	Month 1
Staging	Restricted	Restricted	Restricted	Month 2
Production	Restricted	Restricted	Restricted	Month 3

Migration Plan:

Phase 1 (Month 1): Dev and Test
- Set enforce: baseline
- Set audit: restricted
- Identify violations
- Fix applications

Phase 2 (Month 2): Staging
- Set enforce: restricted
- Fix remaining violations
- Validate applications

Phase 3 (Month 3): Production
- Set enforce: restricted
- Full enforcement

Network Policies¶

Default Deny All Traffic¶

Default Deny Network Policy:

# platform/network-policies/default-deny-all.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: atp-production
spec:
  podSelector: {}  # Match all pods
  policyTypes:
  - Ingress
  - Egress
  # No ingress or egress rules = deny all

Ingress Rules (Allow Specific Sources)¶

Allow Ingress from Gateway:

# apps/atp-ingestion/base/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-ingestion-ingress
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
  - Ingress
  ingress:
  # Allow ingress from gateway
  - from:
    - podSelector:
        matchLabels:
          app: atp-gateway
      namespaceSelector:
        matchLabels:
          name: atp-production
    ports:
    - protocol: TCP
      port: 8080
  # Allow ingress from ingress controller
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
      podSelector:
        matchLabels:
          app.kubernetes.io/name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080

Egress Rules (Allow Specific Destinations)¶

Allow Egress to Dependencies:

# apps/atp-ingestion/base/network-policy-egress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-ingestion-egress
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
  - Egress
  egress:
  # Allow egress to SQL Database
  - to:
    - namespaceSelector:
        matchLabels:
          name: external-services
    ports:
    - protocol: TCP
      port: 1433  # SQL Server
  # Allow egress to Redis
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379
  # Allow DNS resolution
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
  # Allow egress to Azure services (via Private Link)
  - to:
    - namespaceSelector: {}
      podSelector: {}
    ports:
    - protocol: TCP
      port: 443
  # Allow egress to monitoring
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
      podSelector:
        matchLabels:
          app: prometheus
    ports:
    - protocol: TCP
      port: 9090

DNS and Monitoring Exceptions¶

DNS Exception:

# platform/network-policies/allow-dns.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: atp-production
spec:
  podSelector: {}  # Apply to all pods
  policyTypes:
  - Egress
  egress:
  # Allow DNS queries to CoreDNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Monitoring Exception:

# platform/network-policies/allow-monitoring.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-monitoring
  namespace: atp-production
spec:
  podSelector: {}  # Apply to all pods
  policyTypes:
  - Egress
  egress:
  # Allow metrics scraping
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
      podSelector:
        matchLabels:
          app: prometheus
    ports:
    - protocol: TCP
      port: 9090
  # Allow log forwarding
  - to:
    - namespaceSelector:
        matchLabels:
          name: logging
      podSelector:
        matchLabels:
          app: fluent-bit
    ports:
    - protocol: TCP
      port: 24224

Network Policy per Service¶

Service-Specific Network Policies:

# apps/atp-query/base/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-query-network-policy
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-query
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Allow from gateway
  - from:
    - podSelector:
        matchLabels:
          app: atp-gateway
    ports:
    - protocol: TCP
      port: 8080
  # Allow from ingestion service
  - from:
    - podSelector:
        matchLabels:
          app: atp-ingestion
    ports:
    - protocol: TCP
      port: 8080
  egress:
  # Allow to Redis
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379
  # Allow to SQL
  - to:
    - namespaceSelector:
        matchLabels:
          name: external-services
    ports:
    - protocol: TCP
      port: 1433

OPA Gatekeeper Alternative¶

Open Policy Agent Overview¶

OPA Gatekeeper Architecture:

graph LR
    A[Policy Templates<br/>in Git] -->|Deploy| B[Gatekeeper<br/>Controller]
    B -->|Creates| C[Constraint CRDs]
    C -->|Enforces| D[Kubernetes<br/>Admission Webhook]
    D -->|Validates| E[Resource Requests]
    E -->|Allows| F[Compliant Resources]
    E -.->|Violates| G[Rejected Resources]

    style A fill:#90EE90
    style B fill:#FFE5B4
    style C fill:#FFE5B4
    style D fill:#FFE5B4
    style E fill:#ffcccc
    style F fill:#90EE90
    style G fill:#ff9999

Hold "Alt" / "Option" to enable pan & zoom

Install Gatekeeper:

# Install Gatekeeper
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/release-3.14/deploy/gatekeeper.yaml

# Verify installation
kubectl get pods -n gatekeeper-system

Gatekeeper Constraints and Templates¶

ConstraintTemplate: Require Resource Limits:

# policies/gatekeeper/require-resource-limits-template.yaml
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequiredresourcelimits
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredResourceLimits
      validation:
        openAPIV3Schema:
          type: object
          properties:
            cpu:
              type: string
              default: "100m"
            memory:
              type: string
              default: "128Mi"
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredresourcelimits

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources
          msg := sprintf("Container '%v' must have resources defined", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources.limits
          msg := sprintf("Container '%v' must have resource limits", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources.limits.cpu
          msg := sprintf("Container '%v' must have CPU limit", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources.limits.memory
          msg := sprintf("Container '%v' must have memory limit", [container.name])
        }

Constraint: Enforce Resource Limits:

# policies/gatekeeper/require-resource-limits-constraint.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResourceLimits
metadata:
  name: require-resource-limits-production
spec:
  match:
    kinds:
    - apiGroups: ["apps"]
      kinds: ["Deployment", "StatefulSet", "DaemonSet"]
    excludedNamespaces:
    - kube-system
    - gatekeeper-system
    - external-secrets-system
    - ingress-nginx
  parameters:
    cpu: "100m"
    memory: "128Mi"

Custom Policy Authoring with Rego¶

Rego Policy: Require Non-Root User:

# policies/gatekeeper/require-non-root.rego
package requirenonroot

violation[{"msg": msg}] {
    container := input.review.object.spec.template.spec.containers[_]
    not container.securityContext
    msg := sprintf("Container '%v' must have securityContext defined", [container.name])
}

violation[{"msg": msg}] {
    container := input.review.object.spec.template.spec.containers[_]
    container.securityContext
    not container.securityContext.runAsNonRoot
    msg := sprintf("Container '%v' must run as non-root user", [container.name])
}

violation[{"msg": msg}] {
    container := input.review.object.spec.template.spec.containers[_]
    container.securityContext
    container.securityContext.runAsNonRoot == false
    msg := sprintf("Container '%v' must run as non-root user (currently runAsNonRoot=false)", [container.name])
}

ConstraintTemplate:

# policies/gatekeeper/require-non-root-template.yaml
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequirednonroot
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredNonRoot
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package requirenonroot

        violation[{"msg": msg}] {
            container := input.review.object.spec.template.spec.containers[_]
            not container.securityContext
            msg := sprintf("Container '%v' must have securityContext defined", [container.name])
        }

        violation[{"msg": msg}] {
            container := input.review.object.spec.template.spec.containers[_]
            container.securityContext
            not container.securityContext.runAsNonRoot
            msg := sprintf("Container '%v' must run as non-root user", [container.name])
        }

Constraint:

# policies/gatekeeper/require-non-root-constraint.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredNonRoot
metadata:
  name: require-non-root-production
spec:
  match:
    kinds:
    - apiGroups: ["apps"]
      kinds: ["Deployment", "StatefulSet"]
    namespaces:
    - atp-production
    excludedNamespaces:
    - kube-system

Integration with CI/CD¶

PR Validation with Gatekeeper:

# .azuredevops/pipelines/pr-validation-gatekeeper.yml
stages:
- stage: ValidateGatekeeper
  displayName: 'Validate with Gatekeeper'
  jobs:
  - job: GatekeeperValidation
    steps:
    - script: |
        # Install Gatekeeper CLI
        wget -q https://github.com/open-policy-agent/gatekeeper/releases/latest/download/gatekeeper-linux-amd64.tar.gz
        tar xf gatekeeper-linux-amd64.tar.gz
        sudo mv gatekeeper /usr/local/bin/

      displayName: 'Install Gatekeeper CLI'

    - script: |
        # Validate manifests against Gatekeeper constraints
        for manifest in apps/*/base/*.yaml; do
          echo "Validating $manifest"
          # Use OPA CLI to test against policies
          opa test policies/gatekeeper/ --bundle policies/gatekeeper/ || exit 1
        done
      displayName: 'Validate manifests against Gatekeeper policies'

Image Signing and Verification¶

Image Signing with Notary/Cosign¶

Cosign Image Signing:

# Install Cosign
wget -O cosign https://github.com/sigstore/cosign/releases/latest/download/cosign-linux-amd64
chmod +x cosign
sudo mv cosign /usr/local/bin/

# Generate signing key pair
cosign generate-key-pair

# Sign container image
cosign sign --key cosign.key \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d

# Verify signature
cosign verify --key cosign.pub \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d

Azure Pipeline: Image Signing:

# .azuredevops/pipelines/image-signing.yml
- stage: SignImage
  displayName: 'Sign Container Image'
  jobs:
  - job: SignWithCosign
    steps:
    - script: |
        # Install Cosign
        wget -O cosign https://github.com/sigstore/cosign/releases/latest/download/cosign-linux-amd64
        chmod +x cosign
        sudo mv cosign /usr/local/bin/
      displayName: 'Install Cosign'

    - task: AzureKeyVault@2
      inputs:
        azureSubscription: 'ATP-KeyVault-Connection'
        KeyVaultName: 'atp-shared-kv'
        SecretsFilter: 'cosign-private-key'
      displayName: 'Retrieve Cosign private key'

    - script: |
        # Sign image
        cosign sign --key $(cosign-private-key) \
          --yes \
          $(ImageRepository):$(ImageTag)

        echo "✅ Image signed: $(ImageRepository):$(ImageTag)"
      displayName: 'Sign container image'
      env:
        COSIGN_PASSWORD: $(cosign-key-password)

Signature Storage in ACR¶

ACR Image Signing:

# Enable ACR content trust
az acr config retention update \
  --registry connectsoft \
  --days 30 \
  --status Enabled

# Sign image with ACR
az acr repository show-manifests \
  --name connectsoft \
  --repository atp/ingestion \
  --detail

Cosign with ACR:

# Sign and store signature in ACR
cosign sign --key cosign.key \
  --registry connectsoft.azurecr.io \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d

Admission Controller for Verification¶

Image Policy Webhook:

# platform/image-policy-webhook.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: image-policy-webhook
webhooks:
- name: image-policy.atp.connectsoft.com
  clientConfig:
    service:
      name: image-policy-webhook
      namespace: image-policy-system
      path: "/validate"
  rules:
  - operations: ["CREATE", "UPDATE"]
    apiGroups: ["apps"]
    apiVersions: ["v1"]
    resources: ["deployments", "statefulsets", "daemonsets"]
  admissionReviewVersions: ["v1", "v1beta1"]
  sideEffects: None
  failurePolicy: Fail

Image Verification with Cosign Admission Controller:

# Install Cosign admission controller
kubectl apply -f https://raw.githubusercontent.com/sigstore/policy-controller/main/config/release/policy-controller.yaml

# Create image policy

Image Policy:

# platform/image-policy.yaml
apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata:
  name: atp-image-policy
spec:
  images:
  - glob: "connectsoft.azurecr.io/atp/**"
  authorities:
  - key:
      data: |
        -----BEGIN PUBLIC KEY-----
        MFkwEwYHKoZIzj0CAQYIKoZIzj0CAQYIKoZIzj0CAQYIKoZIzj0CAQYIKoZI...
        -----END PUBLIC KEY-----
  - keyless:
      identities:
      - issuer: "https://token.actions.githubusercontent.com"
        subject: "https://github.com/ConnectSoft/ATP/.github/workflows/*"
  mode: enforce  # enforce or warn

Rejecting Unsigned Images¶

Policy Enforcement:

# With mode: enforce
# Unsigned images will be rejected at admission time
# Error: Image signature verification failed

# With mode: warn
# Unsigned images will be allowed but warnings logged

SBOM Generation and Storage¶

Generating SBOM During CI Build¶

SBOM Generation in Pipeline:

# .azuredevops/pipelines/sbom-generation.yml
- stage: GenerateSBOM
  displayName: 'Generate SBOM'
  jobs:
  - job: GenerateSBOM
    steps:
    - script: |
        # Install Syft
        curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
        syft version
      displayName: 'Install Syft'

    - script: |
        # Generate SBOM in SPDX format
        syft packages docker:$(ImageRepository):$(ImageTag) \
          --output spdx-json \
          --file sbom-$(ImageTag).spdx.json

        # Generate SBOM in CycloneDX format
        syft packages docker:$(ImageRepository):$(ImageTag) \
          --output cyclonedx-json \
          --file sbom-$(ImageTag).cyclonedx.json

        echo "✅ SBOM generated for $(ImageRepository):$(ImageTag)"
      displayName: 'Generate SBOM'

    - script: |
        # Attach SBOM to ACR image as OCI artifact
        oras attach \
          --artifact-type application/vnd.cyclonedx+json \
          connectsoft.azurecr.io/atp/ingestion:$(ImageTag) \
          sbom-$(ImageTag).cyclonedx.json
      displayName: 'Attach SBOM to image'

SBOM Formats (CycloneDX, SPDX)¶

SPDX Format Example:

{
  "SPDXID": "SPDXRef-DOCUMENT",
  "spdxVersion": "SPDX-2.3",
  "name": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
  "dataLicense": "CC0-1.0",
  "documentNamespace": "https://connectsoft.example/sbom/atp-ingestion/v1.2.3-abc123d",
  "packages": [
    {
      "SPDXID": "SPDXRef-Package-dotnet-runtime",
      "name": "dotnet-runtime",
      "versionInfo": "8.0.0",
      "downloadLocation": "NOASSERTION",
      "filesAnalyzed": false,
      "licenseConcluded": "NOASSERTION"
    },
    {
      "SPDXID": "SPDXRef-Package-aspnetcore",
      "name": "aspnetcore",
      "versionInfo": "8.0.0",
      "downloadLocation": "NOASSERTION",
      "filesAnalyzed": false,
      "licenseConcluded": "NOASSERTION"
    }
  ]
}

CycloneDX Format Example:

{
  "bomFormat": "CycloneDX",
  "specVersion": "1.5",
  "version": 1,
  "metadata": {
    "timestamp": "2024-01-15T10:00:00Z",
    "tools": [
      {
        "vendor": "Anchore",
        "name": "syft",
        "version": "1.0.0"
      }
    ],
    "component": {
      "type": "container",
      "name": "atp-ingestion",
      "version": "v1.2.3-abc123d"
    }
  },
  "components": [
    {
      "type": "library",
      "name": "dotnet-runtime",
      "version": "8.0.0"
    },
    {
      "type": "library",
      "name": "aspnetcore",
      "version": "8.0.0"
    }
  ]
}

Storing SBOM in ACR Artifacts¶

Attach SBOM to ACR Image:

# Install ORAS CLI
wget -q https://github.com/oras-project/oras/releases/latest/download/oras_linux_amd64.tar.gz
tar xf oras_linux_amd64.tar.gz
sudo mv oras /usr/local/bin/

# Login to ACR
az acr login --name connectsoft

# Attach SBOM as OCI artifact
oras attach \
  --artifact-type application/vnd.cyclonedx+json \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
  sbom-v1.2.3-abc123d.cyclonedx.json

# Attach SPDX SBOM
oras attach \
  --artifact-type application/spdx+json \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
  sbom-v1.2.3-abc123d.spdx.json

Query SBOM from ACR:

# List attached artifacts (including SBOM)
az acr repository show \
  --name connectsoft \
  --image atp/ingestion:v1.2.3-abc123d \
  --query "manifests"

# Download SBOM
oras pull \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
  --artifact-type application/vnd.cyclonedx+json \
  -o sbom-downloaded.json

Vulnerability Scanning of SBOM¶

Scan SBOM for Vulnerabilities:

# Scan SBOM with Grype
grype sbom:sbom-$(ImageTag).cyclonedx.json \
  --output json \
  --file vulnerability-report-$(ImageTag).json

# Or scan with Trivy
trivy sbom sbom-$(ImageTag).cyclonedx.json \
  --format json \
  --output trivy-sbom-report-$(ImageTag).json

Vulnerability Scanning¶

Azure Defender for Containers¶

Enable Azure Defender:

# Enable Defender for Containers
az security pricing create \
  --name "Containers" \
  --tier "Standard"

Defender for Containers Configuration:

// Enable Defender for Containers via Pulumi
new Security.Pricing("defender-containers", new()
{
    PricingTier = "Standard",
    SubPlan = "DefenderForContainers",
});

Trivy Scanning in CI Pipeline¶

Trivy Vulnerability Scan:

# .azuredevops/pipelines/vulnerability-scanning.yml
- stage: VulnerabilityScan
  displayName: 'Vulnerability Scanning'
  jobs:
  - job: TrivyScan
    steps:
    - script: |
        # Install Trivy
        wget -q https://github.com/aquasecurity/trivy/releases/latest/download/trivy_linux_amd64.tar.gz
        tar xf trivy_linux_amd64.tar.gz
        sudo mv trivy /usr/local/bin/
      displayName: 'Install Trivy'

    - script: |
        # Scan container image
        trivy image \
          --format json \
          --output trivy-$(ImageTag).json \
          --severity HIGH,CRITICAL \
          --exit-code 0 \
          $(ImageRepository):$(ImageTag)
      displayName: 'Scan image for vulnerabilities'
      continueOnError: true

    - script: |
        # Generate HTML report
        trivy image \
          --format template \
          --template "@contrib/html.tpl" \
          --output trivy-$(ImageTag).html \
          --severity HIGH,CRITICAL \
          $(ImageRepository):$(ImageTag)

        # Publish report
        echo "##vso[task.addattachment type=Distributedtask.Core.Summary;name=Vulnerability Report;]$PWD/trivy-$(ImageTag).html"
      displayName: 'Generate vulnerability report'

    - script: |
        # Fail build if critical vulnerabilities found
        CRITICAL_COUNT=$(jq '[.Results[]?.Vulnerabilities[]? | select(.Severity=="CRITICAL")] | length' trivy-$(ImageTag).json)

        if [ "$CRITICAL_COUNT" -gt 0 ]; then
          echo "❌ Critical vulnerabilities found: $CRITICAL_COUNT"
          exit 1
        fi

        echo "✅ No critical vulnerabilities found"
      displayName: 'Check for critical vulnerabilities'

Runtime Vulnerability Detection¶

Trivy Operator for Runtime Scanning:

# Install Trivy Operator
helm repo add aqua https://aquasecurity.github.io/helm-charts/
helm repo update

helm install trivy-operator aqua/trivy-operator \
  --namespace trivy-system \
  --create-namespace \
  --version 0.18.0

VulnerabilityReport CRD:

# Trivy Operator automatically creates VulnerabilityReport resources
apiVersion: aquasecurity.github.io/v1alpha1
kind: VulnerabilityReport
metadata:
  name: atp-ingestion-abc123
  namespace: atp-production
spec:
  artifact:
    repository: connectsoft.azurecr.io/atp/ingestion
    tag: v1.2.3-abc123d
  summary:
    criticalCount: 0
    highCount: 2
    mediumCount: 5
    lowCount: 10

Query Vulnerability Reports:

# List vulnerability reports
kubectl get vulnerabilityreports -n atp-production

# View detailed report
kubectl get vulnerabilityreport atp-ingestion-abc123 -n atp-production -o yaml

Remediation Workflows¶

Vulnerability Remediation Process:

graph LR
    A[Vulnerability<br/>Detected] -->|Alert| B[Security Team]
    B -->|Assess| C{Critical?}
    C -->|Yes| D[Immediate Patch]
    C -->|No| E[Schedule Patch]
    D -->|Rebuild Image| F[Rescan]
    E -->|Rebuild Image| F
    F -->|Verify| G[Deploy]

    style A fill:#ffcccc
    style D fill:#ff9999
    style F fill:#90EE90
    style G fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Remediation Script:

#!/bin/bash
# scripts/remediate-vulnerability.sh

IMAGE="${1:-connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d}"
VULN_ID="${2:-CVE-2024-1234}"

echo "🔧 Remediating vulnerability: $VULN_ID in $IMAGE"

# 1. Update dependencies
# 2. Rebuild image
# 3. Rescan
trivy image --severity HIGH,CRITICAL "$IMAGE" | grep -q "$VULN_ID" && \
  echo "❌ Vulnerability still present" || \
  echo "✅ Vulnerability remediated"

RBAC Policies in Kubernetes¶

ServiceAccounts per ATP Service¶

ServiceAccount Definition:

# apps/atp-ingestion/base/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    app: atp-ingestion
    managed-by: fluxcd

Roles and RoleBindings¶

Role: Service-Specific Permissions:

# apps/atp-ingestion/base/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: atp-ingestion-role
  namespace: atp-production
rules:
# Allow read ConfigMaps in same namespace
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list", "watch"]
# Allow read Secrets in same namespace
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get", "list"]
  resourceNames:
  - sql-connection-string
  - redis-connection-string

RoleBinding:

# apps/atp-ingestion/base/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: atp-ingestion-rolebinding
  namespace: atp-production
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: atp-ingestion-role
subjects:
- kind: ServiceAccount
  name: atp-ingestion
  namespace: atp-production

ClusterRoles and ClusterRoleBindings¶

ClusterRole: Cross-Namespace Permissions:

# platform/rbac/clusterrole-monitoring.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: atp-monitoring-reader
rules:
# Allow read pods for metrics
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
# Allow read nodes for metrics
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch"]
  resourceNames: []

ClusterRoleBinding:

# platform/rbac/clusterrolebinding-monitoring.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: atp-monitoring-reader-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: atp-monitoring-reader
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

Least Privilege Principle¶

Least Privilege RBAC Matrix:

Service	ServiceAccount	Role	Permissions
atp-ingestion	`atp-ingestion`	Role (namespace-scoped)	Read ConfigMap, Read specific Secrets
atp-query	`atp-query`	Role (namespace-scoped)	Read ConfigMap, Read specific Secrets
prometheus	`prometheus`	ClusterRole	Read Pods, Nodes (cluster-wide)
fluent-bit	`fluent-bit`	ClusterRole	Read Pods, Namespaces (cluster-wide)

RBAC Audit Script:

#!/bin/bash
# scripts/audit-rbac.sh

echo "🔍 Auditing RBAC permissions..."

# List all ServiceAccounts with excessive permissions
kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | select(.subjects[].kind=="ServiceAccount") | .metadata.name'

# Check for wildcard permissions
kubectl get clusterroles -o json | \
  jq -r '.items[] | select(.rules[]?.verbs[]?=="*") | .metadata.name'

Audit Logging¶

Kubernetes Audit Logs¶

Enable Audit Logging:

# cluster-config/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Log all requests in these stages
- level: Metadata
  namespaces: ["atp-production"]
  verbs: ["create", "update", "patch", "delete"]
- level: RequestResponse
  namespaces: ["atp-production"]
  resources:
  - group: ""
    resources: ["secrets", "configmaps"]
  verbs: ["create", "update", "patch", "delete"]
- level: None
  users: ["system:kube-proxy"]
  verbs: ["watch"]
  resources:
  - group: ""
    resources: ["endpoints", "services", "services/status"]

Configure Audit Logging on AKS:

# Enable audit logging via AKS cluster configuration
az aks update \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --enable-managed-identity \
  --enable-azure-rbac

Forwarding to Azure Monitor¶

Audit Log Forwarding:

# platform/audit-log-forwarder.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: audit-log-forwarder-config
  namespace: kube-system
data:
  fluent-bit.conf: |
    [INPUT]
        Name tail
        Path /var/log/audit/kube-apiserver-audit.log
        Parser json
        Tag kube-audit.*
        Refresh_Interval 5

    [OUTPUT]
        Name azure
        Match kube-audit.*
        Workspace_ID $(LOG_ANALYTICS_WORKSPACE_ID)
        Shared_Key $(LOG_ANALYTICS_SHARED_KEY)

Audit Policy Configuration¶

Comprehensive Audit Policy:

# cluster-config/audit-policy-comprehensive.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
omitStages:
- "RequestReceived"
rules:
# Log all requests to production namespace
- level: RequestResponse
  namespaces: ["atp-production"]
- level: Metadata
  namespaces: ["atp-staging"]
# Log secret access
- level: RequestResponse
  resources:
  - group: ""
    resources: ["secrets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# Log RBAC changes
- level: RequestResponse
  resources:
  - group: "rbac.authorization.k8s.io"
    resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
  verbs: ["create", "update", "patch", "delete"]

Log Retention and Analysis¶

Audit Log Retention:

// Query Kubernetes audit logs
KubePodInventory
| where Namespace == "atp-production"
| join kind=inner (
    AzureDiagnostics
    | where Category == "kube-audit"
    | where OperationName in ("create", "update", "delete")
) on $left.Name == $right.ObjectName
| project TimeGenerated, OperationName, User, Resource, Namespace
| order by TimeGenerated desc

Audit Log Analysis:

// Secret access audit trail
AzureDiagnostics
| where Category == "kube-audit"
| where ObjectRef.Resource == "secrets"
| extend User = tostring(parse_json(requestObject).user.username)
| extend Action = OperationName
| project TimeGenerated, User, Action, ObjectRef.Name, Namespace
| order by TimeGenerated desc

Policy Enforcement via GitOps¶

Policy as Code in Git¶

Policy Organization in Git:

atp-gitops/
├── policies/
│   ├── azure-policy/
│   │   ├── require-resource-limits.json
│   │   └── require-non-root.json
│   ├── gatekeeper/
│   │   ├── constraint-templates/
│   │   │   ├── require-resource-limits-template.yaml
│   │   │   └── require-non-root-template.yaml
│   │   └── constraints/
│   │       ├── require-resource-limits-constraint.yaml
│   │       └── require-non-root-constraint.yaml
│   └── network-policies/
│       ├── default-deny-all.yaml
│       └── service-policies/
│           ├── atp-ingestion-network-policy.yaml
│           └── atp-query-network-policy.yaml

Automated Policy Application¶

FluxCD Kustomization for Policies:

# infrastructure/policies/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  # Azure Policy assignments
  - azure-policy/assignments.yaml

  # Gatekeeper templates
  - gatekeeper/constraint-templates/

  # Gatekeeper constraints
  - gatekeeper/constraints/

  # Network policies
  - network-policies/default-deny-all.yaml
  - network-policies/service-policies/

FluxCD GitRepository for Policies:

# clusters/production/policies-gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-policies
  namespace: flux-system
spec:
  interval: 5m
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: production
  secretRef:
    name: gitops-credentials
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: policies
  namespace: flux-system
spec:
  interval: 10m
  sourceRef:
    kind: GitRepository
    name: atp-policies
  path: ./policies/
  prune: true
  validation: client

Policy Violation Detection¶

Monitor Policy Violations:

# Check Azure Policy violations
az policy state list \
  --resource "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.ContainerService/managedClusters/atp-prod-eus-aks" \
  --filter "complianceState eq 'NonCompliant'" \
  --query "[].{resource:resourceId, policy:policyAssignmentName, reason:complianceReason}" \
  --output table

# Check Gatekeeper constraint violations
kubectl get constraints -A
kubectl describe k8srequiredresourcelimits require-resource-limits-production -n atp-production

Policy Violation Alert:

# alerts/policy-violation.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: policy-violations
  namespace: monitoring
spec:
  groups:
  - name: policy-violations
    rules:
    - alert: AzurePolicyViolation
      expr: |
        azure_policy_compliance_state{state="NonCompliant"} > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Azure Policy violation detected"
        description: "{{ $value }} non-compliant resources detected"

Remediation Through PR¶

Policy Violation Remediation Workflow:

graph LR
    A[Policy Violation<br/>Detected] -->|Alert| B[Developer]
    B -->|Create PR| C[Fix Manifest]
    C -->|Merge| D[GitOps Sync]
    D -->|Apply| E[Compliant Resource]

    style A fill:#ffcccc
    style C fill:#90EE90
    style E fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Remediation PR Process:

Developer receives policy violation alert
Create PR to fix manifest
PR validation ensures compliance
Merge PR triggers FluxCD sync
Policy violation resolved

Compliance Evidence Generation¶

Deployment Receipts with Approvals¶

Deployment Receipt Generation:

#!/bin/bash
# scripts/generate-deployment-receipt.sh

DEPLOYMENT_NAME="${1:-atp-ingestion}"
NAMESPACE="${2:-atp-production}"

cat > deployment-receipt-$(date +%Y%m%d-%H%M%S).json <<EOF
{
  "deploymentId": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.uid}')",
  "deployedAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "deployedBy": "FluxCD",
  "gitCommit": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.labels.app\.kubernetes\.io/version}')",
  "approvals": [
    {
      "approver": "architect-team@connectsoft.example",
      "approvedAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
      "approvalType": "CAB"
    }
  ],
  "policyCompliance": {
    "azurePolicy": "Compliant",
    "gatekeeper": "Compliant",
    "podSecurity": "Compliant"
  }
}
EOF

Security Scan Reports¶

Security Scan Evidence:

# Generate security scan evidence
cat > security-scan-evidence-$(date +%Y%m%d).json <<EOF
{
  "scanDate": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "image": "$(ImageRepository):$(ImageTag)",
  "scanner": "Trivy",
  "vulnerabilities": {
    "critical": 0,
    "high": 2,
    "medium": 5,
    "low": 10
  },
  "compliance": "Pass",
  "scanReport": "trivy-$(ImageTag).json"
}
EOF

SBOM Artifacts¶

SBOM Evidence:

{
  "sbomGeneratedAt": "2024-01-15T10:00:00Z",
  "image": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
  "sbomFormat": "CycloneDX",
  "sbomVersion": "1.5",
  "sbomLocation": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d (OCI artifact)",
  "components": 245,
  "verification": {
    "signed": true,
    "signatureVerified": true
  }
}

Policy Compliance Reports¶

Compliance Report Generation:

// Generate compliance report
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where TimeGenerated > ago(30d)
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend AccessCount = count()
| summarize 
    TotalAccess = count(),
    UniqueIdentities = dcount(parse_json(properties_s).identity_claim_appid_g)
    by SecretName
| project SecretName, TotalAccess, UniqueIdentities, ComplianceStatus = "Compliant"

Mapping GitOps Workflows to Controls¶

SOC 2 Control Mapping:

SOC 2 Control	GitOps Implementation	Evidence
CC6.1 Logical Access Controls	RBAC in Kubernetes, Azure AD integration	RBAC audit logs, access reviews
CC6.2 Authentication	Workload Identity, Azure AD	Authentication logs
CC6.7 Audit Logging	Kubernetes audit logs, Key Vault logs	Log Analytics queries
CC7.2 Change Management	GitOps PR workflow, approvals	PR history, deployment receipts
CC8.1 Encryption	Secrets in Key Vault, TLS in transit	Key Vault encryption logs

GDPR Control Mapping:

GDPR Article	GitOps Implementation	Evidence
Art. 32 Security of Processing	Pod Security Standards, Network Policies	Security policy compliance reports
Art. 33 Breach Notification	Audit logging, alerting	Security incident logs
Art. 35 Data Protection Impact Assessment	SBOM, vulnerability scanning	SBOM artifacts, scan reports

Evidence Collection Automation¶

Automated Evidence Collection:

# .azuredevops/pipelines/compliance-evidence.yml
- stage: CollectEvidence
  displayName: 'Collect Compliance Evidence'
  jobs:
  - job: GenerateEvidence
    steps:
    - script: |
        # Generate deployment receipt
        ./scripts/generate-deployment-receipt.sh atp-ingestion atp-production

        # Generate security scan evidence
        ./scripts/generate-security-scan-evidence.sh

        # Generate SBOM evidence
        ./scripts/generate-sbom-evidence.sh

        # Generate policy compliance report
        ./scripts/generate-policy-compliance-report.sh

        # Archive all evidence
        tar -czf compliance-evidence-$(Build.BuildNumber).tar.gz \
          deployment-receipt-*.json \
          security-scan-evidence-*.json \
          sbom-evidence-*.json \
          policy-compliance-report-*.json
      displayName: 'Collect compliance evidence'

    - task: PublishPipelineArtifact@1
      inputs:
        targetPath: 'compliance-evidence-*.tar.gz'
        artifactName: 'compliance-evidence'

Audit Trail for Compliance¶

Audit Trail Generation:

// Complete audit trail for compliance
let DeploymentEvents = KubePodInventory
| where Namespace == "atp-production"
| extend DeploymentTime = TimeGenerated;

let PolicyCompliance = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.AUTHORIZATION"
| where Category == "PolicyState"
| extend ComplianceTime = TimeGenerated;

let SecretAccess = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| extend AccessTime = TimeGenerated;

union DeploymentEvents, PolicyCompliance, SecretAccess
| project TimeGenerated, EventType, Resource, User, Action
| order by TimeGenerated desc

Regular Access Reviews¶

Access Review Automation:

#!/bin/bash
# scripts/access-review.sh

echo "📋 Generating Access Review Report..."

# Review Kubernetes RBAC
echo "## Kubernetes RBAC Review" > access-review-$(date +%Y%m%d).md
kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | select(.subjects[].kind=="ServiceAccount") | "\(.metadata.name): \(.subjects[].name)"' \
  >> access-review-$(date +%Y%m%d).md

# Review Azure Key Vault access
echo "## Key Vault Access Review" >> access-review-$(date +%Y%m%d).md
az role assignment list \
  --scope "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.KeyVault/vaults/atp-prod-kv" \
  --query "[].{principal:principalName, role:roleDefinitionName}" \
  --output table >> access-review-$(date +%Y%m%d).md

Summary: Security Policies & Compliance¶

Azure Policy for Kubernetes: Policy definitions, assignments, and enforcement for AKS clusters
Pod Security Standards: Restricted profile enforcement, Pod Security Admission configuration, security context requirements
Network Policies: Default deny, ingress/egress rules, DNS and monitoring exceptions, service-specific policies
OPA Gatekeeper: Constraint templates, custom Rego policies, CI/CD integration
Image Signing: Cosign signing, signature storage in ACR, admission controller verification
SBOM Generation: CycloneDX and SPDX formats, storage in ACR artifacts, vulnerability scanning
Vulnerability Scanning: Azure Defender, Trivy in CI, runtime detection, remediation workflows
RBAC Policies: ServiceAccounts, Roles/RoleBindings, ClusterRoles, least privilege principle
Audit Logging: Kubernetes audit logs, Azure Monitor forwarding, log retention and analysis
Policy Enforcement via GitOps: Policy as code, automated application, violation detection, remediation through PR
Compliance Evidence: Deployment receipts, security scan reports, SBOM artifacts, policy compliance reports
SOC 2/GDPR/HIPAA Controls: Control mapping, evidence collection automation, audit trails, access reviews

FluxCD Continuous Reconciliation¶

Purpose: Define how FluxCD continuously reconciles the desired state from Git with the live Kubernetes cluster state, including reconciliation loops, drift detection, self-healing mechanisms, health assessment, and observability to ensure ATP deployments remain aligned with Git-managed manifests.

FluxCD Reconciliation Loop¶

How Reconciliation Works¶

Reconciliation Flow:

sequenceDiagram
    participant Git as Git Repository
    participant Source as Source Controller
    participant Kustomize as Kustomize Controller
    participant K8s as Kubernetes Cluster

    Git->>Source: Poll for changes (every 1m)
    Source->>Source: Fetch latest commit
    Source->>Source: Compare with last sync
    alt Changes detected
        Source->>Source: Update GitRepository status
        Source->>Kustomize: Trigger reconciliation
    end
    Kustomize->>Git: Fetch manifest artifacts
    Kustomize->>K8s: Apply manifests (kubectl apply)
    K8s->>Kustomize: Return apply result
    Kustomize->>Kustomize: Update Kustomization status
    alt Drift detected
        Kustomize->>K8s: Re-apply to correct drift
    end
    Kustomize->>Source: Report reconciliation result

Hold "Alt" / "Option" to enable pan & zoom

Reconciliation Components:

Component	Responsibility	Reconciliation Trigger
Source Controller	Monitors Git repository	Polls Git every `interval` (default: 1m)
Kustomize Controller	Applies Kustomization	Triggered by Source Controller when changes detected
Helm Controller	Applies Helm releases	Triggered by Source Controller when HelmRepository changes
Image Automation Controller	Updates image tags	Triggered by ImagePolicy changes

Polling Interval Configuration¶

GitRepository Polling Interval:

# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  interval: 1m  # Poll Git every 1 minute
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: production
  timeout: 20s
  ignore: |
    # Exclude paths from reconciliation
    exclude: |
      ^docs/
      ^\.git/

Kustomization Reconciliation Interval:

# apps/atp-ingestion/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
  namespace: flux-system
spec:
  interval: 5m  # Reconcile every 5 minutes
  path: ./apps/atp-ingestion
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production

Environment-Specific Intervals:

Environment	GitRepository Interval	Kustomization Interval	Rationale
Dev	30s	1m	Faster feedback loop
Test	1m	2m	Balance between speed and load
Staging	2m	5m	Reduced cluster load
Production	5m	10m	Stability over speed

Reconciliation Triggers¶

Automatic Triggers:

Git Commit: New commit pushed to monitored branch
Polling Interval: Periodic check (even if no changes)
Webhook: Immediate trigger via webhook (bypasses polling)

Webhook Configuration:

# clusters/production/receiver.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Receiver
metadata:
  name: gitops-receiver
  namespace: flux-system
spec:
  type: git
  events:
  - "push"
  resources:
  - kind: GitRepository
    name: atp-gitops
  secretRef:
    name: gitops-webhook-token
---
# Azure DevOps webhook trigger
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: atp-gitops
spec:
  interval: 1m
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: production
  # Webhook URL: https://<cluster-ip>/hook/<token>

Manual Trigger:

# Force immediate reconciliation
flux reconcile source git atp-gitops --with-source

# Force Kustomization reconciliation
flux reconcile kustomization atp-ingestion --with-source

# Trigger all reconciliations
flux reconcile source git --all

Retry Strategies and Backoff¶

Retry Configuration:

# Kustomization with retry settings
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  retryInterval: 2m  # Retry failed reconciliation every 2 minutes
  timeout: 10m  # Timeout after 10 minutes
  path: ./apps/atp-ingestion
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Exponential backoff: 2m, 4m, 8m, 16m (max)

Retry Behavior:

Failure Type	Retry Interval	Max Retries	Backoff
Git fetch error	1m	3	Linear
Apply error (transient)	2m	5	Exponential (2x)
Apply error (permanent)	5m	10	Exponential (1.5x)
Health check failure	1m	Until healthy	Linear

Retry Status:

# Check reconciliation status and retries
kubectl get kustomization atp-ingestion -n flux-system -o jsonpath='{.status}'

# Output:
# {
#   "conditions": [{
#     "type": "Ready",
#     "status": "False",
#     "reason": "ReconciliationFailed",
#     "message": "apply failed: error applying manifests",
#     "lastTransitionTime": "2024-01-15T10:00:00Z"
#   }],
#   "lastAppliedRevision": "abc123...",
#   "lastAttemptedRevision": "def456...",
#   "observedGeneration": 1,
#   "snapshot": {...}
# }

Automated Sync Policies¶

Auto-Sync for Dev and Test Environments¶

Dev Environment Auto-Sync:

# clusters/dev/kustomization-dev.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-dev
  namespace: flux-system
spec:
  interval: 1m  # Fast sync interval
  path: ./apps
  prune: true  # Auto-delete removed resources
  wait: true  # Wait for resources to be ready
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  syncOptions:
  - CreateNamespace=true
  - PruneLast=true
  - ReplaceOnCreate=true

Test Environment Auto-Sync:

# clusters/test/kustomization-test.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-test
  namespace: flux-system
spec:
  interval: 2m
  path: ./apps
  prune: true
  wait: true
  timeout: 10m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  syncOptions:
  - CreateNamespace=true
  - PruneLast=true

Manual Sync for Staging and Production¶

Staging Environment Manual Sync:

# clusters/staging/kustomization-staging.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-staging
  namespace: flux-system
spec:
  suspend: false  # Reconciliation enabled
  interval: 5m  # Still polls for changes
  path: ./apps
  prune: false  # Manual pruning only
  wait: true
  timeout: 15m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Manual sync via: flux reconcile kustomization apps-staging

Production Environment Manual Sync:

# clusters/production/kustomization-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  suspend: false
  interval: 10m  # Longer interval
  path: ./apps
  prune: false  # Never auto-prune in production
  wait: true
  timeout: 20m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Requires explicit: flux reconcile kustomization apps-production

Manual Sync Workflow:

# 1. Review changes in Git
git log --oneline production

# 2. Trigger manual sync
flux reconcile kustomization apps-production --with-source

# 3. Monitor sync status
flux get kustomizations apps-production --watch

# 4. Verify deployment
kubectl rollout status deployment/atp-ingestion -n atp-production

Sync Options (Prune, Force, Wait)¶

Sync Options Reference:

Option	Description	Use Case
Prune	Delete resources removed from Git	Cleanup unused resources
PruneLast	Prune after applying new resources	Preserve dependencies during apply
Force	Force apply even if no changes	Trigger reconciliation
Wait	Wait for resources to be ready	Ensure deployment success
CreateNamespace	Auto-create namespaces	Simplify namespace management
ReplaceOnCreate	Replace existing resources	Handle immutable fields

Comprehensive Sync Options:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion-production
spec:
  interval: 10m
  path: ./apps/atp-ingestion
  prune: false  # Production: manual pruning
  wait: true  # Wait for Deployment to be ready
  timeout: 20m
  retryInterval: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  syncOptions:
  - CreateNamespace=true  # Auto-create namespace
  - PruneLast=true  # Prune after applying (if prune enabled)
  # - Force=true  # Force apply (use with caution)
  - ReplaceOnCreate=false  # Don't replace on create (safer)
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production

Per-Resource Sync Configuration¶

Service-Specific Sync Settings:

# apps/atp-ingestion/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  prune: true
  wait: true
  timeout: 15m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure  # Wait for infrastructure first
  - name: secrets  # Wait for secrets to be synced
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production
  # Per-resource annotations for sync control
  postBuild:
    substitute:
      sync.fluxcd.io/prune: "disabled"  # Per-resource pruning control

Drift Detection Mechanisms¶

Comparing Git State to Live Cluster¶

Drift Detection Flow:

graph LR
    A[Git State<br/>Manifests] -->|Compare| B[Cluster State<br/>Live Resources]
    B -->|Matches| C[No Action]
    B -->|Differs| D[Drift Detected]
    D -->|Self-Heal Enabled| E[Re-apply from Git]
    D -->|Self-Heal Disabled| F[Alert Only]
    E -->|Success| C
    E -->|Failure| G[Retry/Alert]

    style A fill:#90EE90
    style B fill:#FFE5B4
    style D fill:#ffcccc
    style E fill:#90EE90
    style F fill:#ff9999

Hold "Alt" / "Option" to enable pan & zoom

Drift Detection Process:

Fetch Git State: Source Controller fetches latest manifests
Fetch Cluster State: Kustomize Controller queries Kubernetes API
Compute Diff: Compare desired (Git) vs actual (Cluster) state
Detect Drift: Identify differences
Correct Drift: Re-apply manifests (if self-healing enabled)

Check Drift Status:

# Check for drift
flux get kustomizations atp-ingestion

# Output shows:
# NAME            READY   MESSAGE                         REVISION        SUSPENDED
# atp-ingestion   True    Applied revision: abc123def    abc123def       False

# Detailed drift information
kubectl describe kustomization atp-ingestion -n flux-system

# Events show drift detection:
# Warning  ReconciliationFailed  drift detected: Deployment replicas changed from 3 to 5

Drift Types (Manual Changes, External Controllers)¶

Manual Changes:

# Example: Manual replica scaling
kubectl scale deployment atp-ingestion -n atp-production --replicas=5

# FluxCD detects drift:
# Warning  ReconciliationFailed  drift detected in Deployment/atp-ingestion:
#   spec.replicas: expected 3, found 5

# With self-healing: FluxCD reverts to 3 replicas
# Without self-healing: Alert only

External Controller Changes:

# Example: HPA scales deployment
# HPA changes replicas to 4 based on CPU usage

# FluxCD behavior:
# - If replicas in Git: 3 (no replicas field)
# - HPA-managed replicas: 4
# - FluxCD: No drift (HPA takes precedence when replicas field absent)

# If Git specifies replicas: 3
# HPA-managed replicas: 4
# FluxCD: Detects drift, reverts to 3 (may conflict with HPA)

Resource Annotation for Drift Ignore:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  annotations:
    fluxcd.io/ignore: "true"  # Ignore drift for this resource
spec:
  replicas: 3  # May be modified by HPA, FluxCD won't revert

Drift Detection Frequency¶

Drift Detection Intervals:

Component	Detection Method	Frequency
GitRepository	Polls Git for changes	Every `interval` (default: 1m)
Kustomization	Compares Git state to cluster	Every `interval` (default: 10m)
Manual Trigger	Immediate comparison	On-demand via `flux reconcile`

Optimized Drift Detection:

# Production: Less frequent drift checks
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m  # Check for drift every 10 minutes
  path: ./apps
  sourceRef:
    kind: GitRepository
    name: atp-gitops

# Dev: More frequent drift checks
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-dev
spec:
  interval: 1m  # Check for drift every minute
  path: ./apps
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Alerting on Drift¶

Drift Alert Configuration:

# alerts/drift-detection.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: drift-detection
  namespace: flux-system
spec:
  providerRef:
    name: azure-monitor
  eventSeverity: warning
  eventSources:
  - kind: Kustomization
    name: apps-production
    namespace: flux-system
  exclusionList:
  - ".* is ready"
  - ".*applied revision.*"

Drift Alert with Notification:

# Notification provider for Azure Monitor
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: azure-monitor
  namespace: flux-system
spec:
  type: generic
  address: https://api.loganalytics.io/v1/workspaces/{workspaceId}/events
  secretRef:
    name: azure-monitor-credentials

Query Drift Alerts:

// Query FluxCD drift alerts from Log Analytics
FluxCDEvents
| where EventType == "Warning"
| where Message contains "drift detected"
| project TimeGenerated, Kustomization, Message, Namespace
| order by TimeGenerated desc

Self-Healing Configuration¶

Automatic Revert of Manual Changes¶

Self-Healing Enabled (Default):

# Self-healing is enabled by default
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  prune: true
  # Self-healing: automatically reverts manual changes
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Self-Healing Behavior:

# 1. Manual change
kubectl patch deployment atp-ingestion -n atp-production \
  -p '{"spec":{"replicas":5}}'

# 2. FluxCD detects drift (within 5 minutes)
# 3. FluxCD reverts to Git state (replicas: 3)
# 4. Deployment restored to desired state

Disable Self-Healing for Specific Resource:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  annotations:
    # Disable self-healing for this resource
    reconcile.fluxcd.io/disabled: "true"
spec:
  replicas: 3

Self-Heal Enable/Disable per Environment¶

Environment-Specific Self-Healing:

Environment	Self-Healing	Rationale
Dev	✅ Enabled	Fast feedback, automatic correction
Test	✅ Enabled	Validate self-healing behavior
Staging	⚠️ Selective	Enable for critical resources only
Production	⚠️ Selective	Manual intervention preferred for critical changes

Production: Selective Self-Healing:

# clusters/production/kustomization-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  path: ./apps
  prune: false  # Manual pruning in production
  # Self-healing enabled, but prune disabled for safety
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-gateway
    namespace: atp-production
  # Self-healing reverts manual changes to Gateway

Disable Self-Healing Globally:

# Suspend Kustomization (disables all reconciliation including self-healing)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  suspend: true  # Temporarily disable all reconciliation
  interval: 10m
  path: ./apps

Force Recreation of Resources¶

Force Recreate on Drift:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  syncOptions:
  - ReplaceOnCreate=true  # Replace existing resources
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Force Recreation via Annotation:

# Add annotation to force recreation
kubectl annotate deployment atp-ingestion -n atp-production \
  fluxcd.io/reconcile="forced" \
  --overwrite

# FluxCD will recreate the resource on next reconciliation

Preserving Stateful Resources¶

Protect Stateful Resources from Self-Healing:

# apps/atp-query/base/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: atp-query
  annotations:
    # Preserve manual changes to StatefulSet
    reconcile.fluxcd.io/disabled: "true"
spec:
  replicas: 3
  # ... other spec

Protect PVCs from Pruning:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  prune: true
  pruneOptions:
    keepLabels:
    - app=atp-query  # Keep resources with this label when pruning
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Health Assessment¶

Built-in Health Checks (Deployment, StatefulSet)¶

Deployment Health Check:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  wait: true  # Wait for health checks to pass
  timeout: 15m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production
    # FluxCD checks:
    # - Deployment status.availableReplicas == spec.replicas
    # - All pods are Ready

StatefulSet Health Check:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-query
spec:
  interval: 5m
  path: ./apps/atp-query
  wait: true
  timeout: 20m  # Longer timeout for StatefulSet
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: apps/v1
    kind: StatefulSet
    name: atp-query
    namespace: atp-production
    # FluxCD checks:
    # - StatefulSet status.readyReplicas == spec.replicas
    # - All pods are Ready and in correct order

Multiple Health Checks:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-gateway
spec:
  interval: 5m
  path: ./apps/atp-gateway
  wait: true
  timeout: 15m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  # Deployment health check
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-gateway
    namespace: atp-production
  # Service health check
  - apiVersion: v1
    kind: Service
    name: atp-gateway
    namespace: atp-production
  # Ingress health check
  - apiVersion: networking.k8s.io/v1
    kind: Ingress
    name: atp-gateway
    namespace: atp-production

Custom Health Checks¶

Custom Health Check with CRD:

# Custom health check using custom resource
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  wait: true
  timeout: 15m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: custom.health.check/v1
    kind: HealthCheck
    name: atp-ingestion-health
    namespace: atp-production

Health Check Status:

# Check health check status
kubectl get kustomization atp-ingestion -n flux-system -o jsonpath='{.status.conditions}'

# Output:
# [
#   {
#     "type": "Ready",
#     "status": "True",
#     "reason": "HealthCheckPassed",
#     "message": "all health checks passed",
#     "lastTransitionTime": "2024-01-15T10:00:00Z"
#   },
#   {
#     "type": "Healthy",
#     "status": "True",
#     "reason": "AllHealthChecksPassed",
#     "message": "Deployment/atp-ingestion is healthy"
#   }
# ]

Readiness Gates¶

Readiness Gate Configuration:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 3
  template:
    spec:
      readinessGates:
      - conditionType: PodHasNetwork
      - conditionType: PodHasStorage
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

FluxCD Health Check with Readiness Gates:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  wait: true
  timeout: 20m  # Longer timeout if readiness gates present
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production
    # FluxCD waits for:
    # - Deployment ready
    # - All pods Ready
    # - All readiness gates conditions met

Timeout and Failure Thresholds¶

Health Check Timeout:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  wait: true
  timeout: 15m  # Max time to wait for health checks
  retryInterval: 2m  # Retry failed health checks every 2 minutes
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production

Health Check Failure Handling:

Scenario	Behavior	Action
Health check passes	Reconciliation succeeds	Continue normal operation
Health check fails (transient)	Retry up to timeout	Retry every `retryInterval`
Health check fails (permanent)	Reconciliation marked failed	Alert and manual intervention

Check Health Check Status:

# View health check failures
kubectl describe kustomization atp-ingestion -n flux-system

# Events show:
# Warning  ReconciliationFailed  health check failed: 
#   Deployment/atp-ingestion not ready: 2/3 replicas available

Prune Policies¶

Automatic Cleanup of Deleted Resources¶

Prune Enabled:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-dev
spec:
  interval: 1m
  path: ./apps
  prune: true  # Auto-delete resources removed from Git
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Prune Behavior:

# 1. Resource exists in Git and cluster
# apps/atp-old-service/deployment.yaml

# 2. Delete resource from Git
rm apps/atp-old-service/deployment.yaml
git commit -m "Remove old service"
git push

# 3. FluxCD detects resource removed from Git
# 4. FluxCD deletes resource from cluster (prune enabled)
# 5. Resource removed from cluster

Prune Safety (PVC, PV Protection)¶

Prune with Safety Labels:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  path: ./apps
  prune: true
  pruneOptions:
    keepLabels:
    - app=atp-query  # Keep resources with this label
    - persistent=true  # Keep persistent resources
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Protect PVCs from Pruning:

# apps/atp-query/base/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: atp-query-data
  labels:
    persistent: "true"  # Protected from pruning
    fluxcd.io/prune: "disabled"  # Explicit disable pruning
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

Prune Exclusions:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  path: ./apps
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Resources with these labels are never pruned
  pruneOptions:
    keepLabels:
    - persistent=true
    - backup=true
    - managed-by=external-operator

Selective Pruning¶

Selective Prune by Namespace:

# Prune only in specific namespace
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  path: ./apps
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Only prune resources in this namespace
  namespace: atp-production

Selective Prune by Resource Type:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  path: ./apps
  prune: true
  # Only prune Deployments and Services, not PVCs
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Use postBuild to exclude resource types from pruning
  postBuild:
    substitute:
      prune.fluxcd.io/exclude: "PersistentVolumeClaim,PersistentVolume"

Prune Validation¶

Dry-Run Prune:

# Check what would be pruned
flux reconcile kustomization apps-production --dry-run

# Output shows resources that would be deleted

Prune Status:

# Check prune status
kubectl get kustomization apps-production -n flux-system -o jsonpath='{.status.inventory}'

# Output:
# {
#   "entries": [
#     {"id": "apps_v1_Deployment_atp-production_atp-ingestion", "v": "v1"},
#     {"id": "v1_Service_atp-production_atp-ingestion", "v": "v1"}
#   ]
# }

# Resources not in inventory but in cluster will be pruned

Sync Ordering and Dependencies¶

Depends-On in Kustomization¶

Dependency Chain:

# 1. Infrastructure (base)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: infrastructure
  namespace: flux-system
spec:
  interval: 5m
  path: ./infrastructure
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # No dependencies

---
# 2. Secrets (depends on infrastructure)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: secrets
  namespace: flux-system
spec:
  interval: 5m
  path: ./secrets
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure  # Wait for infrastructure first

---
# 3. Apps (depends on infrastructure and secrets)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m
  path: ./apps
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure
  - name: secrets  # Wait for both

Dependency Graph:

graph TD
    A[Infrastructure] --> B[Secrets]
    A --> C[Apps]
    B --> C
    C --> D[atp-ingestion]
    C --> E[atp-query]
    C --> F[atp-gateway]

    style A fill:#90EE90
    style B fill:#FFE5B4
    style C fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Ordering Infrastructure Before Apps¶

Infrastructure First:

# Infrastructure Kustomization (no dependencies)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: infrastructure
  namespace: flux-system
spec:
  interval: 5m
  path: ./infrastructure
  prune: false  # Don't auto-prune infrastructure
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Apps After Infrastructure:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m
  path: ./apps
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure  # Ensure infrastructure ready first

Cross-Resource Dependencies¶

Service Dependencies:

# atp-query depends on atp-ingestion
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-query
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/atp-query
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: atp-ingestion  # Wait for ingestion service

Cross-Namespace Dependencies:

# Apps in production namespace depend on monitoring in monitoring namespace
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m
  path: ./apps
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: monitoring  # Wait for monitoring stack
  healthChecks:
  - apiVersion: v1
    kind: Service
    name: prometheus
    namespace: monitoring  # Cross-namespace health check

Wait for Readiness¶

Wait for Dependencies to be Ready:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m
  path: ./apps
  wait: true  # Wait for resources to be ready
  timeout: 20m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure
  # Waits for:
  # 1. Infrastructure Kustomization to be ready
  # 2. All health checks in infrastructure to pass
  # 3. Then proceeds with apps reconciliation

Dependency Readiness Check:

# Check dependency status
flux get kustomizations apps-production

# Shows dependency status:
# NAME              READY   MESSAGE                         REVISION
# infrastructure    True    Applied revision: abc123        abc123
# apps-production   True    Applied revision: def456        def456

# If dependency not ready:
# apps-production   False   dependency 'infrastructure' is not ready

Notification Controller¶

Sending Alerts to Azure Monitor¶

Azure Monitor Provider:

# notification/provider-azure-monitor.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: azure-monitor
  namespace: flux-system
spec:
  type: generic
  address: https://api.loganalytics.io/v1/workspaces/{workspaceId}/events
  secretRef:
    name: azure-monitor-credentials
---
# Secret with Azure Monitor credentials
apiVersion: v1
kind: Secret
metadata:
  name: azure-monitor-credentials
  namespace: flux-system
type: Opaque
stringData:
  token: "{workspace-key}"  # Log Analytics workspace key

Alert Configuration:

# notification/alert-reconciliation.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: reconciliation-alerts
  namespace: flux-system
spec:
  providerRef:
    name: azure-monitor
  eventSeverity: info
  eventSources:
  - kind: Kustomization
    name: apps-production
    namespace: flux-system
  - kind: GitRepository
    name: atp-gitops
    namespace: flux-system

Slack/Teams Integration¶

Slack Provider:

# notification/provider-slack.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: slack
  namespace: flux-system
spec:
  type: slack
  channel: "#atp-alerts"
  username: fluxcd
  secretRef:
    name: slack-credentials
---
apiVersion: v1
kind: Secret
metadata:
  name: slack-credentials
  namespace: flux-system
type: Opaque
stringData:
  address: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

Teams Provider:

# notification/provider-teams.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: teams
  namespace: flux-system
spec:
  type: generic
  address: "https://outlook.office.com/webhook/YOUR/WEBHOOK/URL"
  secretRef:
    name: teams-credentials

Alert for Slack/Teams:

apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: production-alerts
  namespace: flux-system
spec:
  providerRef:
    name: slack  # or teams
  eventSeverity: error
  eventSources:
  - kind: Kustomization
    name: apps-production
    namespace: flux-system
  exclusionList:
  - ".* is ready"
  - ".*applied revision.*"
  summary: "Production deployment {{ .InvolvedObject.kind }} {{ .InvolvedObject.name }}"

Email Notifications¶

Email Provider:

# notification/provider-email.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: email
  namespace: flux-system
spec:
  type: generic
  address: "smtp://smtp.office365.com:587"
  secretRef:
    name: email-credentials
---
apiVersion: v1
kind: Secret
metadata:
  name: email-credentials
  namespace: flux-system
type: Opaque
stringData:
  username: "fluxcd@connectsoft.example"
  password: "{smtp-password}"

Email Alert:

apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: critical-alerts-email
  namespace: flux-system
spec:
  providerRef:
    name: email
  eventSeverity: error
  eventSources:
  - kind: Kustomization
    name: apps-production
    namespace: flux-system
  # Only send critical errors via email

Custom Webhooks¶

Webhook Provider:

# notification/provider-webhook.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: custom-webhook
  namespace: flux-system
spec:
  type: generic
  address: "https://api.connectsoft.example/fluxcd/webhook"
  secretRef:
    name: webhook-credentials
---
apiVersion: v1
kind: Secret
metadata:
  name: webhook-credentials
  namespace: flux-system
type: Opaque
stringData:
  token: "{webhook-token}"

Webhook Alert:

apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: webhook-alerts
  namespace: flux-system
spec:
  providerRef:
    name: custom-webhook
  eventSeverity: info
  eventSources:
  - kind: Kustomization
    name: apps-production
    namespace: flux-system

Handling Stuck Reconciliations¶

Identifying Stuck Reconciliations¶

Check Reconciliation Status:

# Check if Kustomization is stuck
flux get kustomizations apps-production

# Stuck indicators:
# - READY: False for extended period
# - MESSAGE: Contains "error" or "failed"
# - No recent status updates

# Detailed status
kubectl describe kustomization apps-production -n flux-system

# Check events for stuck reconciliation
kubectl get events -n flux-system \
  --field-selector involvedObject.name=apps-production \
  --sort-by='.lastTimestamp'

Common Stuck Scenarios:

Scenario	Symptom	Resolution
Git fetch error	`MESSAGE: git fetch failed`	Check Git credentials, network
Apply timeout	`MESSAGE: apply timeout`	Increase timeout, check resource complexity
Health check failure	`MESSAGE: health check failed`	Fix failing resource, disable health check
Dependency stuck	`MESSAGE: dependency not ready`	Resolve dependency issue

Suspending and Resuming¶

Suspend Reconciliation:

# Suspend to stop reconciliation
flux suspend kustomization apps-production

# Or via kubectl
kubectl patch kustomization apps-production -n flux-system \
  -p '{"spec":{"suspend":true}}'

Resume Reconciliation:

# Resume reconciliation
flux resume kustomization apps-production

# Or via kubectl
kubectl patch kustomization apps-production -n flux-system \
  -p '{"spec":{"suspend":false}}'

Suspend with Annotation:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  annotations:
    fluxcd.io/suspend: "true"  # Suspend via annotation
spec:
  interval: 10m
  path: ./apps

Force Reconciliation¶

Force Reconciliation:

# Force immediate reconciliation
flux reconcile kustomization apps-production --with-source

# Force with source update
flux reconcile source git atp-gitops
flux reconcile kustomization apps-production

# Force all reconciliations
flux reconcile kustomization --all

Force via Annotation:

# Add annotation to force reconciliation
kubectl annotate kustomization apps-production -n flux-system \
  reconcile.fluxcd.io/requestedAt="$(date +%s)" \
  --overwrite

Debugging Techniques¶

Enable Verbose Logging:

# Check FluxCD controller logs
kubectl logs -n flux-system \
  -l app=kustomize-controller \
  --tail=100

# Follow logs in real-time
kubectl logs -n flux-system \
  -l app=kustomize-controller \
  -f

# Filter for specific Kustomization
kubectl logs -n flux-system \
  -l app=kustomize-controller \
  | grep "apps-production"

Debug Commands:

# Check GitRepository status
flux get source git atp-gitops

# Check Kustomization status
flux get kustomizations apps-production

# Check events
kubectl get events -n flux-system \
  --field-selector involvedObject.name=apps-production

# Check resource status
kubectl get deployment atp-ingestion -n atp-production -o yaml

Dry-Run Reconciliation:

# Simulate reconciliation without applying
flux reconcile kustomization apps-production --dry-run

# Output shows what would be applied

Observability¶

FluxCD Metrics in Prometheus¶

Enable Metrics:

# FluxCD automatically exposes Prometheus metrics
# Metrics endpoint: http://kustomize-controller:8080/metrics

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fluxcd-kustomize-controller
  namespace: flux-system
spec:
  selector:
    matchLabels:
      app: kustomize-controller
  endpoints:
  - port: http-prom
    interval: 30s
    path: /metrics

Key Metrics:

Metric	Description
`fluxcd_kustomize_reconciliation_duration_seconds`	Reconciliation duration
`fluxcd_kustomize_reconciliation_total`	Total reconciliations
`fluxcd_kustomize_reconciliation_errors_total`	Reconciliation errors
`fluxcd_source_git_reconciliation_duration_seconds`	Git fetch duration

Prometheus Query Examples:

# Reconciliation success rate
sum(rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m])) 
/ 
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))

# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)

# Error rate
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))

Grafana Dashboards¶

Grafana Dashboard JSON:

{
  "dashboard": {
    "title": "FluxCD Reconciliation",
    "panels": [
      {
        "title": "Reconciliation Success Rate",
        "targets": [{
          "expr": "sum(rate(fluxcd_kustomize_reconciliation_total{status=\"success\"}[5m])) / sum(rate(fluxcd_kustomize_reconciliation_total[5m]))"
        }]
      },
      {
        "title": "Reconciliation Duration",
        "targets": [{
          "expr": "avg(fluxcd_kustomize_reconciliation_duration_seconds)"
        }]
      },
      {
        "title": "Reconciliation Errors",
        "targets": [{
          "expr": "sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))"
        }]
      }
    ]
  }
}

Log Forwarding to Log Analytics¶

Fluent Bit Configuration:

# platform/logging/fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [INPUT]
        Name tail
        Path /var/log/containers/kustomize-controller-*.log
        Parser docker
        Tag kube.fluxcd.*
        Refresh_Interval 5

    [FILTER]
        Name kubernetes
        Match kube.fluxcd.*
        Merge_Log On

    [OUTPUT]
        Name azure
        Match kube.fluxcd.*
        Workspace_ID {workspace-id}
        Shared_Key {workspace-key}
        Log_Type FluxCD

Log Analytics Query:

// Query FluxCD logs
FluxCDLogs_CL
| where ContainerName_s contains "kustomize-controller"
| where LogMessage_s contains "reconciliation"
| project TimeGenerated, ContainerName_s, LogMessage_s
| order by TimeGenerated desc

Reconciliation Duration and Success Rate¶

Success Rate Monitoring:

# Overall success rate
sum(rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m]))
/
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))

# Per-Kustomization success rate
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m])
)
/
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_total[5m])
)

Duration Monitoring:

# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)

# P95 reconciliation duration
histogram_quantile(0.95, 
  rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)

# Per-Kustomization duration
avg by (kustomization) (
  fluxcd_kustomize_reconciliation_duration_seconds
)

Alert on High Error Rate:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fluxcd-reconciliation-alerts
  namespace: monitoring
spec:
  groups:
  - name: fluxcd
    rules:
    - alert: FluxCDHighErrorRate
      expr: |
        sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m])) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "FluxCD reconciliation error rate is high"
        description: "{{ $value }} errors per second"

    - alert: FluxCDReconciliationSlow
      expr: |
        avg(fluxcd_kustomize_reconciliation_duration_seconds) > 300
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "FluxCD reconciliations are taking longer than expected"
        description: "Average duration: {{ $value }}s"

Summary: FluxCD Continuous Reconciliation¶

Reconciliation Loop: Polling intervals, reconciliation triggers, retry strategies and backoff
Automated Sync Policies: Auto-sync for dev/test, manual sync for staging/production, sync options, per-resource sync configuration
Drift Detection: Comparing Git state to live cluster, drift types, detection frequency, alerting on drift
Self-Healing: Automatic revert of manual changes, enable/disable per environment, force recreation, preserving stateful resources
Health Assessment: Built-in health checks, custom health checks, readiness gates, timeout and failure thresholds
Prune Policies: Automatic cleanup, prune safety (PVC/PV protection), selective pruning, prune validation
Sync Ordering: depends-on in Kustomization, infrastructure before apps, cross-resource dependencies, wait for readiness
Notification Controller: Azure Monitor alerts, Slack/Teams integration, email notifications, custom webhooks
Handling Stuck Reconciliations: Identifying stuck reconciliations, suspending and resuming, force reconciliation, debugging techniques
Observability: FluxCD metrics in Prometheus, Grafana dashboards, log forwarding to Log Analytics, reconciliation duration and success rate monitoring

Multi-Environment AKS Deployment¶

Purpose: Define how ATP is deployed across multiple environments (dev, test, staging, production) using separate AKS clusters, environment-specific configurations, Kustomize overlays, Helm values, and FluxCD per-environment reconciliation, ensuring proper isolation, resource management, and multi-region high availability.

Environment-Specific AKS Clusters¶

Separate Clusters per Environment Rationale¶

Multi-Cluster Architecture:

graph TB
    subgraph "Production Subscription"
        PROD[Production AKS<br/>East US]
        STAGING[Staging AKS<br/>East US]
    end
    subgraph "Non-Prod Subscription"
        TEST[Test AKS<br/>East US]
        DEV[Dev AKS<br/>East US]
    end
    subgraph "Production Subscription - DR"
        PROD_DR[Production AKS<br/>West Europe]
    end

    PROD -->|Traffic| PROD_DR
    STAGING -->|Validate| PROD

    style PROD fill:#90EE90
    style PROD_DR fill:#90EE90
    style STAGING fill:#FFE5B4
    style TEST fill:#FFE5B4
    style DEV fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Rationale for Separate Clusters:

Aspect	Separate Clusters	Shared Cluster	ATP Decision
Isolation	✅ Complete isolation	⚠️ Namespace-level only	✅ Separate Clusters
Security	✅ Environment boundaries	⚠️ Shared RBAC/network	✅ Separate Clusters
Resource Management	✅ Independent scaling	⚠️ Shared resources	✅ Separate Clusters
Cost	❌ Higher (multiple clusters)	✅ Lower (single cluster)	✅ Separate Clusters (security/compliance priority)
Operational Complexity	⚠️ More clusters to manage	✅ Simpler	✅ Separate Clusters (acceptable trade-off)
Blast Radius	✅ Isolated failures	❌ Cross-environment impact	✅ Separate Clusters

ATP Selection: Separate Clusters

Reasons: - ✅ Compliance: SOC 2 requires production isolation - ✅ Security: No risk of dev/test workloads accessing production resources - ✅ Resource Isolation: Production resources guaranteed, not shared - ✅ Independent Scaling: Each environment scales independently - ✅ Rollback Safety: Production cluster unaffected by dev/test issues

Cluster Sizing and SKU Selection¶

Environment-Specific Cluster Sizing:

Environment	Node Pool SKU	Node Count	CPU/Memory per Node	Total Capacity	Rationale
Dev	Standard_D4s_v3	2-3 nodes	4 vCPU / 16 GB	8-12 vCPU / 32-48 GB	Minimal resources, cost-effective
Test	Standard_D4s_v3	3-5 nodes	4 vCPU / 16 GB	12-20 vCPU / 48-80 GB	Integration testing needs
Staging	Standard_D8s_v3	5-10 nodes	8 vCPU / 32 GB	40-80 vCPU / 160-320 GB	Production-like capacity
Production	Standard_D16s_v3	10-20 nodes	16 vCPU / 64 GB	160-320 vCPU / 640-1280 GB	High availability, performance

Production Cluster Configuration:

// infrastructure/AKS-Production.cs
public class AKSProduction
{
    public ContainerService.KubernetesCluster Cluster { get; }

    public AKSProduction(Pulumi.Stack stack, string location)
    {
        this.Cluster = new ContainerService.KubernetesCluster("atp-prod-eus-aks", new()
        {
            ResourceGroupName = "atp-production-rg",
            Location = location, // "eastus"
            DnsPrefix = "atp-prod-eus",
            DefaultNodePool = new ContainerService.Inputs.KubernetesClusterDefaultNodePoolArgs
            {
                Name = "system",
                NodeCount = 3,
                VmSize = "Standard_D16s_v3",
                OsDiskSizeGb = 256,
                OsDiskType = "Ephemeral",
                Type = "VirtualMachineScaleSets",
                EnableAutoScaling = true,
                MinCount = 3,
                MaxCount = 5,
                MaxPods = 110,
                NodeTaints = new[]
                {
                    "CriticalAddonsOnly=true:NoSchedule"
                },
            },
            // User node pools for workloads
            NodeResourceGroup = "atp-prod-eus-aks-nodes",
            KubernetesVersion = "1.28.0",
            NetworkProfile = new ContainerService.Inputs.KubernetesClusterNetworkProfileArgs
            {
                NetworkPlugin = "azure",
                NetworkPolicy = "azure",
                ServiceCidr = "10.0.1.0/24",
                DnsServiceIp = "10.0.1.10",
                DockerBridgeCidr = "172.17.0.1/16",
            },
            Identity = new ContainerService.Inputs.KubernetesClusterIdentityArgs
            {
                Type = "UserAssigned",
                IdentityIds = new[] { managedIdentity.Id },
            },
            AzurePolicyEnabled = true,
            HttpApplicationRoutingEnabled = false,
            RoleBasedAccessControlEnabled = true,
            AzureRbacEnabled = true,
            PrivateClusterEnabled = true,
            ApiServerAuthorizedIpRanges = new[]
            {
                "10.0.0.0/16", // VNet CIDR
            },
            Tags = new()
            {
                { "Environment", "production" },
                { "CostCenter", "ATP-Production" },
                { "Compliance", "SOC2" },
            },
        });
    }
}

Networking Configuration per Environment¶

Environment Network Isolation:

Environment	VNet	Subnet	Private Endpoints	Network Policies	Rationale
Dev	`atp-dev-vnet`	`atp-dev-subnet`	❌ Disabled	⚠️ Baseline	Cost optimization
Test	`atp-test-vnet`	`atp-test-subnet`	❌ Disabled	✅ Enforced	Test network policies
Staging	`atp-staging-vnet`	`atp-staging-subnet`	✅ Enabled	✅ Enforced	Production-like
Production	`atp-prod-vnet`	`atp-prod-subnet`	✅ Enabled	✅ Enforced	Maximum security

Production Network Configuration:

// Production VNet with private endpoints
var prodVNet = new Network.VirtualNetwork("atp-prod-vnet", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    AddressSpace = new[] { "10.1.0.0/16" },
    Subnets = new[]
    {
        new Network.Inputs.SubnetArgs
        {
            Name = "atp-prod-aks-subnet",
            AddressPrefix = "10.1.1.0/24",
            PrivateEndpointNetworkPoliciesEnabled = true,
        },
        new Network.Inputs.SubnetArgs
        {
            Name = "atp-prod-private-endpoints",
            AddressPrefix = "10.1.2.0/24",
            PrivateEndpointNetworkPoliciesEnabled = false,
        },
    },
});

Subscription Strategy (Shared vs Dedicated)¶

ATP Subscription Strategy:

Environment	Subscription	Rationale
Dev	`ATP-NonProd`	Cost optimization, shared resources
Test	`ATP-NonProd`	Cost optimization, shared resources
Staging	`ATP-Production`	Production-like isolation, compliance
Production (East US)	`ATP-Production`	Production isolation, compliance
Production (West Europe)	`ATP-Production`	DR region, same subscription

Subscription Configuration:

# List subscriptions
az account list --output table

# Set production subscription
az account set --subscription "ATP-Production"

# Set non-production subscription
az account set --subscription "ATP-NonProd"

Regional Deployment Strategy¶

Primary Region: East US¶

Primary Region Configuration:

// Primary region: East US
var primaryRegion = new AKSCluster("atp-prod-eus-aks", new()
{
    Location = "eastus",
    ResourceGroupName = "atp-production-rg",
    Environment = "production",
    ClusterSku = "Standard",
    NodePools = new[]
    {
        new NodePoolConfig
        {
            Name = "system",
            VmSize = "Standard_D16s_v3",
            MinCount = 3,
            MaxCount = 5,
        },
        new NodePoolConfig
        {
            Name = "user",
            VmSize = "Standard_D16s_v3",
            MinCount = 10,
            MaxCount = 20,
        },
    },
});

Primary Region Resources: - ✅ Production AKS cluster - ✅ Azure SQL Database (Primary) - ✅ Azure Redis Cache - ✅ Azure Service Bus - ✅ Azure Key Vault - ✅ Azure Container Registry (geo-replicated)

Secondary Region: West Europe¶

Secondary Region (DR) Configuration:

// Secondary region: West Europe (DR)
var secondaryRegion = new AKSCluster("atp-prod-weu-aks", new()
{
    Location = "westeurope",
    ResourceGroupName = "atp-production-rg",
    Environment = "production",
    ClusterSku = "Standard",
    NodePools = new[]
    {
        new NodePoolConfig
        {
            Name = "system",
            VmSize = "Standard_D16s_v3",
            MinCount = 2,
            MaxCount = 3,
        },
        new NodePoolConfig
        {
            Name = "user",
            VmSize = "Standard_D16s_v3",
            MinCount = 5,
            MaxCount = 10,
        },
    },
});

Secondary Region Resources: - ✅ Production AKS cluster (standby/DR) - ✅ Azure SQL Database (Geo-replica) - ✅ Azure Redis Cache (Geo-replica) - ✅ Azure Service Bus (DR namespace) - ✅ Azure Key Vault (Geo-replicated) - ✅ Azure Container Registry (geo-replicated)

Multi-Region for Production (HA/DR)¶

Multi-Region Architecture:

graph TB
    subgraph "East US (Primary)"
        PROD_EUS[Production AKS<br/>East US]
        SQL_EUS[SQL Primary]
        REDIS_EUS[Redis Primary]
    end
    subgraph "West Europe (DR)"
        PROD_WEU[Production AKS<br/>West Europe<br/>Standby]
        SQL_WEU[SQL Geo-Replica]
        REDIS_WEU[Redis Geo-Replica]
    end
    subgraph "Traffic Management"
        FD[Azure Front Door]
    end

    FD -->|Primary| PROD_EUS
    FD -.->|Failover| PROD_WEU
    SQL_EUS -.->|Replication| SQL_WEU
    REDIS_EUS -.->|Replication| REDIS_WEU

    style PROD_EUS fill:#90EE90
    style PROD_WEU fill:#FFE5B4
    style FD fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Multi-Region RTO/RPO Targets:

Component	RTO	RPO	Strategy
AKS Cluster	1 hour	5 minutes	GitOps-based recreation
SQL Database	5 minutes	< 1 minute	Active geo-replication
Redis Cache	15 minutes	< 1 minute	Geo-replication
Application	5 minutes	< 1 minute	Traffic failover via Front Door

Regional Failover Mechanisms¶

Azure Front Door Failover:

# infrastructure/azure-front-door.yaml
apiVersion: networking.azure.com/v1
kind: FrontDoor
metadata:
  name: atp-frontdoor
spec:
  backendPools:
  - name: primary-eus
    backends:
    - address: atp-prod-eus-aks.region.cloudapp.azure.com
      enabled: true
      priority: 1
      weight: 100
    healthProbe:
      path: /health
      protocol: Http
      interval: 30
  - name: secondary-weu
    backends:
    - address: atp-prod-weu-aks.region.cloudapp.azure.com
      enabled: true
      priority: 2
      weight: 0
    healthProbe:
      path: /health
      protocol: Http
      interval: 30
  routingRules:
  - name: failover-rule
    acceptedProtocols:
    - Http
    - Https
    patternsToMatch:
    - "/*"
    routeConfiguration:
      @odata.type: "#Microsoft.Azure.FrontDoor.Models.FrontdoorForwardingConfiguration"
      forwardingProtocol: MatchRequest
      backendPool:
        id: /subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.Network/frontDoors/atp-frontdoor/backendPools/primary-eus

Kustomize Overlays Per Environment¶

Base Manifests (Common)¶

Base Structure:

apps/
├── atp-ingestion/
│   ├── base/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   ├── configmap.yaml
│   │   └── kustomization.yaml
│   ├── overlays/
│   │   ├── dev/
│   │   │   └── kustomization.yaml
│   │   ├── test/
│   │   │   └── kustomization.yaml
│   │   ├── staging/
│   │   │   └── kustomization.yaml
│   │   └── production/
│   │       └── kustomization.yaml

Base Kustomization:

# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
- deployment.yaml
- service.yaml
- configmap.yaml

commonLabels:
  app: atp-ingestion
  managed-by: fluxcd

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: v1.2.3-abc123d

Dev Overlay (Minimal Resources, Debug Logging)¶

Dev Overlay Configuration:

# apps/atp-ingestion/overlays/dev/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-dev

resources:
- ../../base

patchesStrategicMerge:
- deployment-patch.yaml
- configmap-patch.yaml

commonLabels:
  environment: dev

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: latest  # Dev uses latest images

Dev Deployment Patch:

# apps/atp-ingestion/overlays/dev/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 1  # Single replica for dev
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Development"
        - name: Logging__LogLevel__Default
          value: "Debug"
        - name: Logging__LogLevel__Microsoft
          value: "Debug"

Dev ConfigMap Patch:

# apps/atp-ingestion/overlays/dev/configmap-patch.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
data:
  telemetry:sampling: "100"  # 100% sampling in dev
  feature-flags:enable-debug-mode: "true"
  feature-flags:enable-profiling: "true"

Test Overlay (Moderate Resources, Integration Tests)¶

Test Overlay Configuration:

# apps/atp-ingestion/overlays/test/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-test

resources:
- ../../base

patchesStrategicMerge:
- deployment-patch.yaml

commonLabels:
  environment: test

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: v1.2.3  # Test uses tagged releases

Test Deployment Patch:

# apps/atp-ingestion/overlays/test/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 2  # Two replicas for test
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Test"
        - name: Logging__LogLevel__Default
          value: "Information"

Staging Overlay (Production-Like)¶

Staging Overlay Configuration:

# apps/atp-ingestion/overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-staging

resources:
- ../../base

patchesStrategicMerge:
- deployment-patch.yaml

commonLabels:
  environment: staging

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: v1.2.3  # Staging uses production-ready tags

Staging Deployment Patch:

# apps/atp-ingestion/overlays/staging/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 3  # Production-like replica count
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Staging"
        - name: Logging__LogLevel__Default
          value: "Warning"
        - name: telemetry:sampling
          valueFrom:
            configMapKeyRef:
              name: atp-ingestion-config
              key: telemetry:sampling

Production Overlay (Optimized, Strict Policies)¶

Production Overlay Configuration:

# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
- ../../base

patchesStrategicMerge:
- deployment-patch.yaml
- network-policy-patch.yaml

commonLabels:
  environment: production
  compliance: soc2

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: v1.2.3-abc123d  # Production uses immutable tags

Production Deployment Patch:

# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5  # High availability
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 1000m
            memory: 2Gi
          limits:
            cpu: 2000m
            memory: 4Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"
        - name: Logging__LogLevel__Default
          value: "Error"  # Minimal logging in production
        - name: telemetry:sampling
          value: "10"  # 10% sampling
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

Helm Values Files Per Environment¶

values-dev.yaml¶

Dev Helm Values:

# charts/atp-ingestion/values-dev.yaml
replicaCount: 1

image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: latest
  pullPolicy: Always

serviceAccount:
  create: true
  annotations:
    azure.workload.identity/client-id: "{dev-workload-identity-id}"

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

autoscaling:
  enabled: false  # No autoscaling in dev

environment:
  name: Development
  logging:
    level: Debug
  telemetry:
    sampling: 100  # 100% sampling
  featureFlags:
    enableDebugMode: true
    enableProfiling: true

config:
  database:
    connectionString: "{dev-sql-connection-string}"
  redis:
    connectionString: "{dev-redis-connection-string}"

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-staging  # Staging certs in dev
  hosts:
  - host: atp-ingestion-dev.connectsoft.example
    paths:
    - path: /
      pathType: Prefix
  tls:
  - secretName: atp-ingestion-dev-tls
    hosts:
    - atp-ingestion-dev.connectsoft.example

values-test.yaml¶

Test Helm Values:

# charts/atp-ingestion/values-test.yaml
replicaCount: 2

image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: v1.2.3
  pullPolicy: IfNotPresent

resources:
  requests:
    cpu: 200m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 4
  targetCPUUtilizationPercentage: 70

environment:
  name: Test
  logging:
    level: Information
  telemetry:
    sampling: 50  # 50% sampling

config:
  database:
    connectionString: "{test-sql-connection-string}"

values-staging.yaml¶

Staging Helm Values:

# charts/atp-ingestion/values-staging.yaml
replicaCount: 3

image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: v1.2.3
  pullPolicy: IfNotPresent

resources:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 2Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 6
  targetCPUUtilizationPercentage: 70

environment:
  name: Staging
  logging:
    level: Warning
  telemetry:
    sampling: 25  # 25% sampling

config:
  database:
    connectionString: "{staging-sql-connection-string}"

values-production.yaml¶

Production Helm Values:

# charts/atp-ingestion/values-production.yaml
replicaCount: 5

image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: v1.2.3-abc123d  # Immutable tag
  pullPolicy: IfNotPresent

serviceAccount:
  create: true
  annotations:
    azure.workload.identity/client-id: "{prod-workload-identity-id}"

resources:
  requests:
    cpu: 1000m
    memory: 2Gi
  limits:
    cpu: 2000m
    memory: 4Gi

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000
  seccompProfile:
    type: RuntimeDefault

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
    - ALL

environment:
  name: Production
  logging:
    level: Error  # Minimal logging
  telemetry:
    sampling: 10  # 10% sampling
  featureFlags:
    enableDebugMode: false
    enableProfiling: false

config:
  database:
    connectionString: "{prod-sql-connection-string}"
  redis:
    connectionString: "{prod-redis-connection-string}"

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "100"
  hosts:
  - host: atp-ingestion.connectsoft.example
    paths:
    - path: /
      pathType: Prefix
  tls:
  - secretName: atp-ingestion-tls
    hosts:
    - atp-ingestion.connectsoft.example

networkPolicy:
  enabled: true
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: atp-gateway

podDisruptionBudget:
  enabled: true
  minAvailable: 3

Value Precedence and Overrides¶

Helm Value Precedence (Highest to Lowest):

--set command-line flags
values-production.yaml (or environment-specific)
values.yaml (base/default values)

Deploy with Environment-Specific Values:

# Deploy to dev
helm upgrade --install atp-ingestion ./charts/atp-ingestion \
  -f charts/atp-ingestion/values.yaml \
  -f charts/atp-ingestion/values-dev.yaml \
  -n atp-dev

# Deploy to production
helm upgrade --install atp-ingestion ./charts/atp-ingestion \
  -f charts/atp-ingestion/values.yaml \
  -f charts/atp-ingestion/values-production.yaml \
  -n atp-production

# Override specific value
helm upgrade --install atp-ingestion ./charts/atp-ingestion \
  -f charts/atp-ingestion/values-production.yaml \
  --set replicaCount=10 \
  -n atp-production

FluxCD Configuration Per Environment¶

GitRepository per Environment (Branch Targeting)¶

Dev GitRepository:

# clusters/dev/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops-dev
  namespace: flux-system
spec:
  interval: 30s  # Fast polling for dev
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: dev  # Dev branch
  secretRef:
    name: gitops-credentials
  ignore: |
    exclude: |
      ^production/
      ^staging/
      ^test/

Test GitRepository:

# clusters/test/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops-test
  namespace: flux-system
spec:
  interval: 1m
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: test
  secretRef:
    name: gitops-credentials

Production GitRepository:

# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops-production
  namespace: flux-system
spec:
  interval: 5m  # Slower polling for production
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: production  # Production branch
  secretRef:
    name: gitops-credentials
  ignore: |
    exclude: |
      ^dev/
      ^test/
      ^staging/

Kustomization per Environment¶

Dev Kustomization:

# clusters/dev/kustomization-apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-dev
  namespace: flux-system
spec:
  interval: 1m
  path: ./apps
  prune: true  # Auto-prune in dev
  wait: false  # Don't wait for readiness in dev
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-dev
  kustomizeFlags:
  - --load-restrictor=LoadRestrictionsNone

Production Kustomization:

# clusters/production/kustomization-apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m
  path: ./apps
  prune: false  # Manual pruning only in production
  wait: true  # Wait for readiness
  timeout: 20m
  retryInterval: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  dependsOn:
  - name: infrastructure
  - name: secrets
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-gateway
    namespace: atp-production

Sync Policies per Environment¶

Environment Sync Policy Matrix:

Environment	Auto-Sync	Prune	Wait	Timeout	Manual Approval
Dev	✅ Yes	✅ Yes	❌ No	5m	❌ No
Test	✅ Yes	✅ Yes	✅ Yes	10m	❌ No
Staging	⚠️ Selective	❌ No	✅ Yes	15m	✅ Yes (1 approver)
Production	❌ No	❌ No	✅ Yes	20m	✅ Yes (2 approvers, CAB)

Environment-Specific Reconciliation Settings¶

Production Reconciliation Settings:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  retryInterval: 5m
  timeout: 20m
  suspend: false  # Reconciliation enabled
  path: ./apps
  prune: false  # Never auto-prune
  wait: true  # Wait for health checks
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  syncOptions:
  - CreateNamespace=true
  - ReplaceOnCreate=false  # Safer in production

Environment-Specific Configurations¶

Log Levels (Debug → Error)¶

Environment Log Levels:

Environment	Default Level	Microsoft Level	Log Retention
Dev	Debug	Debug	7 days
Test	Information	Information	30 days
Staging	Warning	Warning	90 days
Production	Error	Error	365 days

Log Level Configuration:

# apps/atp-ingestion/base/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
data:
  Logging__LogLevel__Default: "Information"  # Base level
  Logging__LogLevel__Microsoft: "Warning"
  Logging__LogLevel__System: "Error"

# Production overlay patch
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
data:
  Logging__LogLevel__Default: "Error"  # Override for production
  Logging__LogLevel__Microsoft: "Error"

Telemetry Sampling (100% → 10%)¶

Telemetry Sampling Rates:

Environment	Sampling Rate	Rationale
Dev	100%	Full visibility for debugging
Test	50%	Balance between visibility and cost
Staging	25%	Production-like, reduced cost
Production	10%	Cost optimization, sufficient insights

Telemetry Configuration:

# Production telemetry settings
env:
- name: telemetry:sampling
  value: "10"  # 10% sampling
- name: APPLICATIONINSIGHTS_SAMPLING_PERCENTAGE
  value: "10"

Feature Flags per Environment¶

Feature Flag Configuration:

# Dev feature flags
featureFlags:
  enableDebugMode: true
  enableProfiling: true
  enableDetailedMetrics: true
  enableExperimentalFeatures: true

# Production feature flags
featureFlags:
  enableDebugMode: false
  enableProfiling: false
  enableDetailedMetrics: false
  enableExperimentalFeatures: false
  enableMaintenanceMode: false

Database Connection Strings¶

Environment-Specific Database Connections:

# Dev database
env:
- name: ConnectionStrings__DefaultConnection
  valueFrom:
    secretKeyRef:
      name: sql-connection-string
      key: connection-string
---
# ExternalSecret for dev
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-dev
spec:
  secretStoreRef:
    name: azure-keyvault-dev
    kind: ClusterSecretStore
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string

External Service Endpoints¶

Environment-Specific Endpoints:

# Dev endpoints
config:
  externalServices:
    paymentGateway: "https://api.stripe.com/test"
    emailService: "https://api.sendgrid.com/v3/test"
    storageAccount: "https://atpdevstorage.blob.core.windows.net"

# Production endpoints
config:
  externalServices:
    paymentGateway: "https://api.stripe.com"
    emailService: "https://api.sendgrid.com/v3"
    storageAccount: "https://atpprodstorage.blob.core.windows.net"

Resource Quotas and Limits¶

Namespace-Level Quotas¶

Dev Namespace Quota:

# platform/resource-quotas/dev-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: atp-dev-quota
  namespace: atp-dev
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    persistentvolumeclaims: "5"
    pods: "20"
    services: "10"

Production Namespace Quota:

# platform/resource-quotas/production-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: atp-production-quota
  namespace: atp-production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 400Gi
    persistentvolumeclaims: "50"
    pods: "200"
    services: "50"

CPU and Memory Limits per Environment¶

Resource Limit Matrix:

Environment	CPU Request	CPU Limit	Memory Request	Memory Limit
Dev	100m	500m	256Mi	512Mi
Test	200m	1000m	512Mi	1Gi
Staging	500m	2000m	1Gi	2Gi
Production	1000m	2000m	2Gi	4Gi

Storage Quotas¶

Storage Quota per Environment:

# Production storage quota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: atp-production-storage-quota
  namespace: atp-production
spec:
  hard:
    requests.storage: 500Gi
    persistentvolumeclaims: "50"

Pod Count Limits¶

Pod Count Limits:

Environment	Max Pods	Rationale
Dev	20	Minimal footprint
Test	50	Integration testing needs
Staging	100	Production-like scale
Production	200	High availability, scale

HPA Configuration Per Environment¶

Min/Max Replicas per Environment¶

HPA Configuration Matrix:

Environment	Min Replicas	Max Replicas	Target CPU	Target Memory
Dev	1	2	70%	80%
Test	2	4	70%	80%
Staging	3	6	70%	80%
Production	5	10	70%	80%

Production HPA:

# apps/atp-ingestion/overlays/production/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa
  namespace: atp-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  minReplicas: 5
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Max

Scaling Thresholds (CPU, Memory)¶

Scaling Thresholds:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa
spec:
  metrics:
  # CPU-based scaling
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale up when CPU > 70%
  # Memory-based scaling
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale up when memory > 80%

Custom Metrics with KEDA¶

KEDA ScaledObject:

# apps/atp-ingestion/overlays/production/keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: atp-ingestion-scaler
  namespace: atp-production
spec:
  scaleTargetRef:
    name: atp-ingestion
  minReplicaCount: 5
  maxReplicaCount: 10
  triggers:
  # CPU-based scaling
  - type: cpu
    metadata:
      type: Utilization
      value: "70"
  # Memory-based scaling
  - type: memory
    metadata:
      type: Utilization
      value: "80"
  # RabbitMQ queue length
  - type: rabbitmq
    metadata:
      queueName: audit-events
      queueLength: "100"
      host: "amqp://rabbitmq.atp-production:5672"
  # HTTP requests per second
  - type: prometheus
    metadata:
      serverAddress: "http://prometheus.monitoring:9090"
      metricName: http_requests_per_second
      threshold: "100"
      query: "sum(rate(http_requests_total[1m]))"

Scale-to-Zero in Dev¶

Dev Scale-to-Zero:

# apps/atp-ingestion/overlays/dev/keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: atp-ingestion-scaler
  namespace: atp-dev
spec:
  scaleTargetRef:
    name: atp-ingestion
  minReplicaCount: 0  # Scale to zero when idle
  maxReplicaCount: 2
  idleReplicaCount: 0  # Scale down to zero after inactivity
  triggers:
  - type: http
    metadata:
      targetValue: "1"
      activationTargetValue: "1"

Multi-Region Traffic Routing¶

Azure Front Door Configuration¶

Azure Front Door Setup:

# infrastructure/azure-front-door.yaml
apiVersion: networking.azure.com/v1
kind: FrontDoor
metadata:
  name: atp-frontdoor
spec:
  resourceGroupName: atp-production-rg
  location: global
  frontendEndpoints:
  - name: atp-frontend
    hostName: atp.connectsoft.example
    sessionAffinityEnabledState: Enabled
    sessionAffinityTtlSeconds: 0
  backendPools:
  - name: primary-eus
    loadBalancingSettings:
      name: default
    healthProbeSettings:
      name: default
    backends:
    - address: atp-prod-eus-aks.region.cloudapp.azure.com
      enabled: true
      priority: 1
      weight: 100
      httpPort: 80
      httpsPort: 443
  - name: secondary-weu
    backends:
    - address: atp-prod-weu-aks.region.cloudapp.azure.com
      enabled: true
      priority: 2
      weight: 0  # Standby
      httpPort: 80
      httpsPort: 443
  routingRules:
  - name: failover-rule
    acceptedProtocols:
    - Http
    - Https
    patternsToMatch:
    - "/*"
    routeConfiguration:
      forwardingConfiguration:
        forwardingProtocol: MatchRequest
        backendPool:
          id: primary-eus
        cacheConfiguration:
          queryParameterStripDirective: StripAll
          dynamicCompression: Enabled
    frontendEndpoints:
    - atp-frontend

Traffic Manager for DNS-Based Routing¶

Traffic Manager Configuration:

# infrastructure/traffic-manager.yaml
apiVersion: network.azure.com/v1
kind: TrafficManagerProfile
metadata:
  name: atp-traffic-manager
spec:
  resourceGroupName: atp-production-rg
  location: global
  profileStatus: Enabled
  trafficRoutingMethod: Priority  # Failover routing
  dnsConfig:
    relativeName: atp-connectsoft
    ttl: 60
  monitorConfig:
    protocol: Https
    port: 443
    path: /health
    intervalInSeconds: 30
    timeoutInSeconds: 10
    toleratedNumberOfFailures: 3
  endpoints:
  - name: primary-eus
    target: atp-prod-eus-aks.region.cloudapp.azure.com
    type: ExternalEndpoints
    priority: 1
    weight: 100
    enabled: true
  - name: secondary-weu
    target: atp-prod-weu-aks.region.cloudapp.azure.com
    type: ExternalEndpoints
    priority: 2
    weight: 0
    enabled: true

Regional Failover Policies¶

Failover Policy Configuration:

# Front Door failover rules
routingRules:
- name: failover-rule
  acceptedProtocols:
  - Http
  - Https
  routeConfiguration:
    forwardingConfiguration:
      backendPool:
        id: primary-eus
      # Failover to secondary if primary unhealthy
      loadBalancingSettings:
        sampleSize: 4
        successfulSamplesRequired: 3

Health Probe Configuration:

healthProbeSettings:
- name: default
  path: /health
  protocol: Https
  intervalInSeconds: 30
  enabledState: Enabled

Health Probe Configuration¶

Health Probe Setup:

# Health probe for Front Door
healthProbeSettings:
- name: atp-health-probe
  path: /health/live
  protocol: Https
  intervalInSeconds: 30
  timeoutInSeconds: 10
  unhealthyThreshold: 3
  enabledState: Enabled
  healthProbeMethod: Head

Application Health Endpoint:

// Health check endpoint for multi-region routing
[ApiController]
[Route("[controller]")]
public class HealthController : ControllerBase
{
    private readonly IHealthCheckService _healthCheck;

    [HttpGet("live")]
    public async Task<IActionResult> Liveness()
    {
        var result = await _healthCheck.CheckHealthAsync();
        return result.Status == HealthStatus.Healthy 
            ? Ok() 
            : StatusCode(503);
    }

    [HttpGet("ready")]
    public async Task<IActionResult> Readiness()
    {
        // Check critical dependencies
        var result = await _healthCheck.CheckHealthAsync(
            predicate: check => check.Tags.Contains("ready"));
        return result.Status == HealthStatus.Healthy 
            ? Ok() 
            : StatusCode(503);
    }
}

Cross-Environment Dependencies¶

Shared Services (Monitoring, Logging)¶

Shared Monitoring Stack:

# Shared monitoring namespace (single instance for all environments)
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
  labels:
    shared: "true"
---
# Prometheus (shared across environments)
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  selector:
    app: prometheus
  ports:
  - port: 9090
    name: http

Cross-Environment Service Access:

# Service in production namespace accessing shared monitoring
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  type: ExternalName
  externalName: prometheus.monitoring.svc.cluster.local

Service Discovery Across Environments¶

Multi-Cluster Service Discovery:

# Service export (if using service mesh)
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
  name: atp-gateway
  namespace: atp-production
---
# Service import
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceImport
metadata:
  name: atp-gateway
  namespace: atp-staging
spec:
  type: ClusterSetIP
  ports:
  - port: 8080
    protocol: TCP

VNet Peering (If Needed)¶

VNet Peering for Cross-Environment Access:

// VNet peering between environments (if required)
var devTestPeering = new Network.VirtualNetworkPeering("dev-test-peering", new()
{
    ResourceGroupName = "atp-nonprod-rg",
    VirtualNetworkName = "atp-dev-vnet",
    RemoteVirtualNetworkId = testVNet.Id,
    AllowForwardedTraffic = true,
    AllowGatewayTransit = false,
    UseRemoteGateways = false,
});

VNet Peering Policy:

Environment Pair	Peering	Rationale
Dev ↔ Test	⚠️ Optional	Shared resources, cost optimization
Staging ↔ Production	❌ No	Security isolation required
Production EUS ↔ Production WEU	✅ Yes	Multi-region HA/DR

Summary: Multi-Environment AKS Deployment¶

Environment-Specific AKS Clusters: Separate clusters per environment with rationale, cluster sizing/SKU selection, networking configuration, subscription strategy
Regional Deployment Strategy: Primary region (East US), secondary region (West Europe), multi-region HA/DR, regional failover mechanisms
Kustomize Overlays: Base manifests, dev overlay (minimal resources, debug logging), test overlay, staging overlay (production-like), production overlay (optimized, strict policies)
Helm Values Files: values-dev.yaml, values-test.yaml, values-staging.yaml, values-production.yaml, value precedence and overrides
FluxCD Configuration: GitRepository per environment (branch targeting), Kustomization per environment, sync policies per environment, environment-specific reconciliation settings
Environment-Specific Configurations: Log levels, telemetry sampling rates, feature flags, database connection strings, external service endpoints
Resource Quotas: Namespace-level quotas, CPU/memory limits per environment, storage quotas, pod count limits
HPA Configuration: Min/max replicas per environment, scaling thresholds, custom metrics with KEDA, scale-to-zero in dev
Multi-Region Traffic Routing: Azure Front Door configuration, Traffic Manager for DNS-based routing, regional failover policies, health probe configuration
Cross-Environment Dependencies: Shared services (monitoring, logging), service discovery across environments, VNet peering if needed

Azure Monitor Integration & Observability¶

Purpose: Define how Azure Monitor, Log Analytics, and Grafana are integrated with ATP GitOps workflows to provide comprehensive observability, monitoring, alerting, and compliance evidence collection, ensuring complete visibility into deployment health, FluxCD operations, and application performance across all environments.

Azure Monitor Container Insights¶

Enabling Container Insights on AKS¶

Enable Container Insights:

# Enable Container Insights on AKS cluster
az aks enable-addons \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --addons monitoring \
  --workspace-resource-id /subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.OperationalInsights/workspaces/atp-prod-loganalytics

Container Insights via Pulumi:

// Enable Container Insights
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    Sku = new OperationalInsights.Inputs.WorkspaceSkuArgs
    {
        Name = "PerGB2018",
    },
    RetentionInDays = environment == "production" ? 2555 : 30, // 7 years for production
    Tags = new()
    {
        { "Environment", environment },
        { "Retention", environment == "production" ? "7years" : "30days" },
    },
});

// Enable Container Insights addon
az aks enable-addons --addons monitoring --workspace-resource-id {logAnalyticsWorkspace.Id}

Verify Container Insights:

# Check Container Insights status
az aks show \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --query addonProfiles.omsagent

# Check OMS agent pods
kubectl get pods -n kube-system | grep omsagent

Metrics Collection and Aggregation¶

Container Insights Metrics:

Metric Category	Examples	Collection Interval
Node Metrics	CPU, Memory, Disk I/O, Network	60s
Pod Metrics	CPU, Memory, Restart count	60s
Container Metrics	CPU, Memory per container	60s
Controller Metrics	Replica count, Ready replicas	60s
Workload Metrics	Deployment, StatefulSet status	60s

Key Metrics Collected:

// Node metrics
InsightsMetrics
| where Origin == "container.azm.ms"
| where Namespace == "insights-metrics"
| where Name == "cpuUsageNanoCores"
| summarize avg(Val) by Computer, bin(TimeGenerated, 1m)

// Pod metrics
InsightsMetrics
| where Name == "podCpuUsageNanoCores"
| summarize avg(Val) by Computer, bin(TimeGenerated, 1m)

// Container restart count
ContainerLog
| where ContainerRestartCount > 0
| project TimeGenerated, Computer, ContainerName, ContainerRestartCount

Log Analytics Workspace Configuration¶

Workspace Configuration:

// Log Analytics Workspace with long retention for production
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    Sku = new OperationalInsights.Inputs.WorkspaceSkuArgs
    {
        Name = "PerGB2018",  // Pay-as-you-go
    },
    RetentionInDays = 2555, // 7 years for compliance
    DailyQuotaGb = -1, // No daily quota
    Tags = new()
    {
        { "Environment", "production" },
        { "Retention", "7years" },
        { "Compliance", "SOC2" },
    },
});

Workspace Strategy:

Strategy	Workspace per Environment	Shared Workspace
Dev/Test	⚠️ Shared workspace	✅ Recommended (cost optimization)
Staging	✅ Separate workspace	⚠️ Optional
Production	✅ Separate workspace	❌ Not recommended

ATP Workspace Strategy: - Dev/Test: Shared atp-nonprod-loganalytics workspace - Staging: Separate atp-staging-loganalytics workspace - Production: Separate atp-prod-loganalytics workspace (7-year retention)

Cost Optimization (Sampling, Retention)¶

Cost Optimization Strategies:

Strategy	Configuration	Impact
Log Sampling	10% sampling in production	✅ 90% cost reduction
Metric Aggregation	5-minute aggregation	✅ Reduced data volume
Retention Tiers	7 years (prod), 30 days (dev)	✅ Cost-optimized retention
Data Export	Archive to Blob Storage	✅ Long-term storage cost reduction

Production Log Sampling:

# Application Insights sampling (10% in production)
env:
- name: APPLICATIONINSIGHTS_SAMPLING_PERCENTAGE
  value: "10"

# Or via ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: appinsights-config
data:
  samplingPercentage: "10"

Log Retention Configuration:

// Production: 7-year retention
var prodWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
    RetentionInDays = 2555, // 7 years
});

// Dev/Test: 30-day retention
var nonprodWorkspace = new OperationalInsights.Workspace("atp-nonprod-loganalytics", new()
{
    RetentionInDays = 30,
});

Log Analytics Workspace¶

Workspace per Environment or Shared¶

Workspace Organization:

graph TB
    subgraph "Production Subscription"
        PROD_WS[atp-prod-loganalytics<br/>7-year retention]
        STAGING_WS[atp-staging-loganalytics<br/>90-day retention]
    end
    subgraph "Non-Prod Subscription"
        NONPROD_WS[atp-nonprod-loganalytics<br/>30-day retention]
    end

    PROD_EUS[Production AKS<br/>East US] --> PROD_WS
    PROD_WEU[Production AKS<br/>West Europe] --> PROD_WS
    STAGING[Staging AKS] --> STAGING_WS
    DEV[Dev AKS] --> NONPROD_WS
    TEST[Test AKS] --> NONPROD_WS

    style PROD_WS fill:#90EE90
    style STAGING_WS fill:#FFE5B4
    style NONPROD_WS fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Workspace Matrix:

Environment	Workspace Name	Retention	Data Sources
Dev/Test	`atp-nonprod-loganalytics`	30 days	Dev AKS, Test AKS
Staging	`atp-staging-loganalytics`	90 days	Staging AKS
Production	`atp-prod-loganalytics`	7 years (2555 days)	Production AKS (EUS, WEU)

Log Retention Policies¶

Retention Policy Configuration:

// Log Analytics Workspace with retention
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    RetentionInDays = 2555, // 7 years for compliance
    DailyQuotaGb = -1, // No daily quota
    PublicNetworkAccessForIngestion = "Enabled",
    PublicNetworkAccessForQuery = "Enabled",
});

// Data export to Blob Storage for long-term archival
var dataExport = new OperationalInsights.DataExport("atp-prod-export", new()
{
    ResourceGroupName = "atp-production-rg",
    WorkspaceName = logAnalyticsWorkspace.Name,
    TableNames = new[]
    {
        "ContainerLog",
        "ContainerInventory",
        "InsightsMetrics",
        "AzureDiagnostics",
    },
    Destination = new OperationalInsights.Inputs.DestinationArgs
    {
        ResourceId = storageAccount.Id,
        Type = "StorageAccount",
    },
    Enabled = true,
});

Retention by Table:

Table	Production Retention	Non-Production Retention
ContainerLog	7 years	30 days
InsightsMetrics	7 years	30 days
AzureDiagnostics	7 years	30 days
FluxCDLogs	7 years	30 days

Kusto Query Language (KQL) Examples¶

Pod Restart Query:

// Pod restart count per namespace
ContainerLog
| where TimeGenerated > ago(24h)
| where ContainerRestartCount > 0
| summarize 
    RestartCount = count(),
    UniquePods = dcount(ContainerName),
    LastRestart = max(TimeGenerated)
    by Namespace, Computer
| order by RestartCount desc

Deployment Status Query:

// Deployment status from Container Insights
InsightsMetrics
| where Origin == "container.azm.ms"
| where Name == "k8sPodCount"
| where Namespace == "atp-production"
| extend PodCount = Val
| summarize 
    TotalPods = sum(PodCount),
    AvgPods = avg(PodCount),
    MaxPods = max(PodCount)
    by Namespace, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

FluxCD Reconciliation Query:

// FluxCD reconciliation events
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "reconciliation"
| extend 
    Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Status = extract(@"status=(\S+)", 1, LogEntry, typeof(string)),
    Duration = extract(@"duration=(\d+\.\d+)", 1, LogEntry, typeof(real))
| summarize 
    TotalReconciliations = count(),
    AvgDuration = avg(Duration),
    MaxDuration = max(Duration),
    SuccessCount = countif(Status == "success"),
    FailureCount = countif(Status == "failure")
    by Kustomization, bin(TimeGenerated, 1h)
| order by TimeGenerated desc

Error Rate Query:

// Application error rate
ContainerLog
| where TimeGenerated > ago(1h)
| where LogEntry contains "ERROR" or LogEntry contains "Exception"
| extend 
    Service = extract(@"app=(\S+)", 1, LogEntry, typeof(string)),
    ErrorType = extract(@"(\w+Exception)", 1, LogEntry, typeof(string))
| summarize 
    ErrorCount = count(),
    UniqueErrors = dcount(ErrorType)
    by Service, Computer, bin(TimeGenerated, 5m)
| order by ErrorCount desc

Custom Log Tables¶

Custom Log Table: Deployment Events:

// Create custom log table for deployment events
.create table DeploymentEvents (TimeGenerated:datetime, DeploymentId:string, ServiceName:string, Environment:string, Status:string, GitCommit:string, DeployedBy:string, Duration:real)

// Ingest deployment events
.ingest inline into table DeploymentEvents <|
2024-01-15T10:00:00Z, "deployment-abc123", "atp-ingestion", "production", "success", "abc123def", "FluxCD", 45.2
2024-01-15T11:00:00Z, "deployment-def456", "atp-query", "production", "success", "def456ghi", "FluxCD", 52.8

// Query deployment events
DeploymentEvents
| where Environment == "production"
| where TimeGenerated > ago(7d)
| summarize 
    TotalDeployments = count(),
    SuccessfulDeployments = countif(Status == "success"),
    FailedDeployments = countif(Status == "failure"),
    AvgDuration = avg(Duration)
    by ServiceName, bin(TimeGenerated, 1d)

Custom Log via Azure Function:

// Azure Function to ingest deployment events
[FunctionName("IngestDeploymentEvent")]
public async Task IngestDeploymentEvent(
    [EventGridTrigger] EventGridEvent eventGridEvent,
    [LogAnalyticsOutput] IAsyncCollector<LogAnalyticsEvent> logAnalytics)
{
    var deploymentEvent = JsonSerializer.Deserialize<DeploymentEvent>(eventGridEvent.Data.ToString());

    await logAnalytics.AddAsync(new LogAnalyticsEvent
    {
        TimeGenerated = DateTime.UtcNow,
        DeploymentId = deploymentEvent.DeploymentId,
        ServiceName = deploymentEvent.ServiceName,
        Environment = deploymentEvent.Environment,
        Status = deploymentEvent.Status,
        GitCommit = deploymentEvent.GitCommit,
        DeployedBy = "FluxCD",
        Duration = deploymentEvent.Duration,
    });
}

FluxCD Metrics Export¶

Prometheus Metrics from FluxCD¶

FluxCD Metrics Endpoint:

# FluxCD automatically exposes Prometheus metrics
# Endpoint: http://kustomize-controller:8080/metrics

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fluxcd-kustomize-controller
  namespace: flux-system
spec:
  selector:
    matchLabels:
      app: kustomize-controller
  endpoints:
  - port: http-prom
    interval: 30s
    path: /metrics
    scrapeTimeout: 10s

Key FluxCD Metrics:

Metric	Description	Labels
`fluxcd_kustomize_reconciliation_total`	Total reconciliations	`status`, `kustomization`
`fluxcd_kustomize_reconciliation_duration_seconds`	Reconciliation duration	`kustomization`
`fluxcd_kustomize_reconciliation_errors_total`	Reconciliation errors	`kustomization`, `error_type`
`fluxcd_source_git_reconciliation_total`	Git fetch reconciliations	`status`, `gitrepository`
`fluxcd_source_git_reconciliation_duration_seconds`	Git fetch duration	`gitrepository`

Metrics Scraping Configuration¶

Prometheus Scrape Configuration:

# Prometheus scrape config for FluxCD
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 30s
      scrape_timeout: 10s

    scrape_configs:
    - job_name: 'fluxcd-kustomize-controller'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - flux-system
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: kustomize-controller
        action: keep
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        regex: "8080"
        action: keep

    - job_name: 'fluxcd-source-controller'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - flux-system
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: source-controller
        action: keep

Prometheus ServiceMonitor for FluxCD:

# ServiceMonitor for FluxCD controllers
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fluxcd-controllers
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: flux
  namespaceSelector:
    matchNames:
    - flux-system
  endpoints:
  - port: http-prom
    interval: 30s
    path: /metrics
    scrapeTimeout: 10s

Key Metrics to Monitor¶

Critical FluxCD Metrics:

# Reconciliation success rate
sum(rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m]))
/
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))

# Reconciliation error rate
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))

# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)

# Git fetch duration (indicates network issues)
avg(fluxcd_source_git_reconciliation_duration_seconds)

Per-Kustomization Metrics:

# Reconciliation success rate per Kustomization
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m])
)
/
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_total[5m])
)

# Reconciliation duration per Kustomization
avg by (kustomization) (
  fluxcd_kustomize_reconciliation_duration_seconds
)

Alerting on FluxCD Issues¶

FluxCD Alert Rules:

# alerts/fluxcd-reconciliation-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fluxcd-reconciliation-alerts
  namespace: monitoring
spec:
  groups:
  - name: fluxcd
    interval: 30s
    rules:
    - alert: FluxCDHighErrorRate
      expr: |
        sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m])) > 0.1
      for: 5m
      labels:
        severity: warning
        component: fluxcd
      annotations:
        summary: "FluxCD reconciliation error rate is high"
        description: "{{ $value }} errors per second detected"

    - alert: FluxCDReconciliationSlow
      expr: |
        avg(fluxcd_kustomize_reconciliation_duration_seconds) > 300
      for: 10m
      labels:
        severity: warning
        component: fluxcd
      annotations:
        summary: "FluxCD reconciliations are taking longer than expected"
        description: "Average duration: {{ $value }}s"

    - alert: FluxCDGitFetchFailed
      expr: |
        increase(fluxcd_source_git_reconciliation_total{status="failure"}[5m]) > 3
      for: 5m
      labels:
        severity: critical
        component: fluxcd
      annotations:
        summary: "FluxCD Git fetch failures detected"
        description: "Git repository {{ $labels.gitrepository }} failed to fetch"

Deployment Metrics¶

Sync Status per Application¶

Application Sync Status Query:

// Sync status per application
let FluxCDEvents = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied" or LogEntry contains "sync"
| extend 
    Application = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Status = case(
        LogEntry contains "successfully applied", "Success",
        LogEntry contains "sync failed", "Failed",
        LogEntry contains "drift detected", "Drift",
        "Unknown"
    ),
    GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string))
| project TimeGenerated, Application, Status, GitCommit;

FluxCDEvents
| summarize 
    LastSync = max(TimeGenerated),
    SyncStatus = arg_max(TimeGenerated, Status),
    GitCommit = arg_max(TimeGenerated, GitCommit)
    by Application
| order by LastSync desc

Sync Status Dashboard Query:

// Real-time sync status per application
ContainerLog
| where ContainerName contains "kustomize-controller"
| where TimeGenerated > ago(1h)
| extend 
    Application = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Status = case(
        LogEntry contains "successfully applied", "Success",
        LogEntry contains "sync failed", "Failed",
        "InProgress"
    )
| summarize 
    Count = count(),
    LastSync = max(TimeGenerated)
    by Application, Status
| order by LastSync desc

Reconciliation Duration¶

Reconciliation Duration Metrics:

# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)

# P50, P95, P99 reconciliation duration
histogram_quantile(0.50, 
  rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)
histogram_quantile(0.95, 
  rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)
histogram_quantile(0.99, 
  rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)

# Per-Kustomization duration
avg by (kustomization) (
  fluxcd_kustomize_reconciliation_duration_seconds
)

KQL Query for Reconciliation Duration:

// Reconciliation duration from logs
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "reconciliation"
| extend 
    Duration = extract(@"duration=(\d+\.\d+)", 1, LogEntry, typeof(real)),
    Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| where isnotnull(Duration)
| summarize 
    AvgDuration = avg(Duration),
    P50Duration = percentile(Duration, 50),
    P95Duration = percentile(Duration, 95),
    P99Duration = percentile(Duration, 99),
    MaxDuration = max(Duration)
    by Kustomization, bin(TimeGenerated, 1h)
| order by TimeGenerated desc

Reconciliation Failure Rate¶

Failure Rate Metrics:

# Reconciliation failure rate
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))
/
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))

# Per-Kustomization failure rate
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_errors_total[5m])
)
/
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_total[5m])
)

KQL Failure Rate Query:

// Reconciliation failure rate
ContainerLog
| where ContainerName contains "kustomize-controller"
| where TimeGenerated > ago(24h)
| extend 
    Status = case(
        LogEntry contains "successfully", "Success",
        LogEntry contains "failed" or LogEntry contains "error", "Failure",
        "Unknown"
    ),
    Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| where Status != "Unknown"
| summarize 
    TotalReconciliations = count(),
    Successful = countif(Status == "Success"),
    Failed = countif(Status == "Failure"),
    FailureRate = (countif(Status == "Failure") * 100.0) / count()
    by Kustomization, bin(TimeGenerated, 1h)
| order by FailureRate desc

Drift Detection Events¶

Drift Detection Metrics:

# Drift detection rate
sum(rate(fluxcd_kustomize_drift_detected_total[5m]))

# Drift correction rate
sum(rate(fluxcd_kustomize_drift_corrected_total[5m]))

KQL Drift Detection Query:

// Drift detection events
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "drift detected"
| extend 
    Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Resource = extract(@"resource=(\S+)", 1, LogEntry, typeof(string)),
    DriftType = extract(@"drift type=(\S+)", 1, LogEntry, typeof(string))
| summarize 
    DriftCount = count(),
    UniqueResources = dcount(Resource),
    LastDrift = max(TimeGenerated)
    by Kustomization, DriftType, bin(TimeGenerated, 1h)
| order by DriftCount desc

Deployment Frequency¶

Deployment Frequency Calculation:

// Deployment frequency (DORA metric)
let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| extend 
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string))
| project TimeGenerated, Service, GitCommit;

Deployments
| summarize 
    DeploymentCount = count(),
    UniqueServices = dcount(Service)
    by bin(TimeGenerated, 1d)
| extend 
    DeploymentFrequency = DeploymentCount // Deployments per day
| order by TimeGenerated desc

Prometheus Query for Deployment Frequency:

# Deployment frequency (successful reconciliations per day)
sum(increase(fluxcd_kustomize_reconciliation_total{status="success"}[1d]))

Application Health After Deployment¶

Readiness Probe Success Rate¶

Readiness Probe Metrics:

# Readiness probe success rate
sum(rate(kube_pod_status_ready{condition="true"}[5m]))
/
sum(rate(kube_pod_status_ready[5m]))

# Readiness probe failures
sum(rate(kube_pod_status_ready{condition="false"}[5m]))

KQL Readiness Probe Query:

// Readiness probe success rate
InsightsMetrics
| where Origin == "container.azm.ms"
| where Name == "k8sPodCount"
| where Namespace == "atp-production"
| extend ReadyPods = case(
    Namespace contains "Ready", 1,
    0
)
| summarize 
    TotalPods = sum(Val),
    ReadyPods = sum(ReadyPods)
    by Namespace, bin(TimeGenerated, 5m)
| extend ReadinessRate = (ReadyPods * 100.0) / TotalPods
| order by TimeGenerated desc

Pod Restart Count¶

Pod Restart Metrics:

# Pod restart count
sum(increase(kube_pod_container_status_restarts_total[1h]))

# Pod restart rate per service
sum by (pod, namespace) (
  increase(kube_pod_container_status_restarts_total[1h])
)

KQL Pod Restart Query:

// Pod restart count
ContainerLog
| where ContainerRestartCount > 0
| where TimeGenerated > ago(24h)
| extend 
    Service = extract(@"app=(\S+)", 1, LogEntry, typeof(string))
| summarize 
    RestartCount = max(ContainerRestartCount),
    RestartEvents = count(),
    LastRestart = max(TimeGenerated)
    by Computer, ContainerName, Service, Namespace
| order by RestartCount desc

HTTP Error Rates¶

HTTP Error Rate Metrics:

# HTTP 5xx error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# HTTP 4xx error rate
sum(rate(http_requests_total{status=~"4.."}[5m]))
/
sum(rate(http_requests_total[5m]))

KQL HTTP Error Rate Query:

// HTTP error rates from Application Insights
AppRequests
| where TimeGenerated > ago(1h)
| extend 
    StatusCode = tostring(resultCode),
    IsError = resultCode >= 400
| summarize 
    TotalRequests = count(),
    ErrorRequests = countif(IsError),
    ErrorRate = (countif(IsError) * 100.0) / count()
    by appName, bin(TimeGenerated, 5m)
| order by ErrorRate desc

Response Latency¶

Response Latency Metrics:

# Average response latency
avg(http_request_duration_seconds)

# P95 response latency
histogram_quantile(0.95, 
  rate(http_request_duration_seconds_bucket[5m])
)

# P99 response latency
histogram_quantile(0.99, 
  rate(http_request_duration_seconds_bucket[5m])
)

KQL Response Latency Query:

// Response latency from Application Insights
AppRequests
| where TimeGenerated > ago(1h)
| extend DurationMs = duration
| summarize 
    AvgLatency = avg(DurationMs),
    P50Latency = percentile(DurationMs, 50),
    P95Latency = percentile(DurationMs, 95),
    P99Latency = percentile(DurationMs, 99),
    MaxLatency = max(DurationMs)
    by appName, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

Integration with Application Metrics¶

Custom Application Metrics:

// C# application: Export custom metrics
public class MetricsExporter
{
    private readonly IMetricsCollector _metrics;

    public void RecordDeploymentEvent(string serviceName, string gitCommit)
    {
        _metrics.IncrementCounter("atp_deployment_total", new Dictionary<string, string>
        {
            { "service", serviceName },
            { "git_commit", gitCommit },
            { "environment", "production" },
        });
    }

    public void RecordDeploymentDuration(double durationSeconds)
    {
        _metrics.RecordHistogram("atp_deployment_duration_seconds", durationSeconds);
    }
}

Prometheus Metrics Export:

// Prometheus metrics endpoint
app.UseMetricServer(); // Exposes /metrics endpoint

// Custom metrics
var deploymentCounter = Metrics.CreateCounter(
    "atp_deployment_total",
    "Total deployments",
    new[] { "service", "environment", "status" });

Grafana Dashboards¶

FluxCD Operational Dashboard¶

FluxCD Dashboard JSON:

{
  "dashboard": {
    "title": "FluxCD Operational Dashboard",
    "panels": [
      {
        "title": "Reconciliation Success Rate",
        "targets": [{
          "expr": "sum(rate(fluxcd_kustomize_reconciliation_total{status=\"success\"}[5m])) / sum(rate(fluxcd_kustomize_reconciliation_total[5m]))",
          "legendFormat": "Success Rate"
        }],
        "type": "stat",
        "thresholds": {
          "steps": [
            { "value": 0, "color": "red" },
            { "value": 0.95, "color": "yellow" },
            { "value": 0.99, "color": "green" }
          ]
        }
      },
      {
        "title": "Reconciliation Duration",
        "targets": [{
          "expr": "avg(fluxcd_kustomize_reconciliation_duration_seconds)",
          "legendFormat": "Avg Duration"
        }],
        "type": "graph"
      },
      {
        "title": "Reconciliation Errors",
        "targets": [{
          "expr": "sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))",
          "legendFormat": "Errors/sec"
        }],
        "type": "graph"
      },
      {
        "title": "Reconciliation Status by Kustomization",
        "targets": [{
          "expr": "sum by (kustomization) (rate(fluxcd_kustomize_reconciliation_total[5m]))",
          "legendFormat": "{{kustomization}}"
        }],
        "type": "bargauge"
      }
    ]
  }
}

Deployment Status Dashboard¶

Deployment Status Dashboard:

{
  "dashboard": {
    "title": "Deployment Status Dashboard",
    "panels": [
      {
        "title": "Deployment Frequency",
        "targets": [{
          "expr": "sum(increase(fluxcd_kustomize_reconciliation_total{status=\"success\"}[1d]))",
          "legendFormat": "Deployments/Day"
        }],
        "type": "stat"
      },
      {
        "title": "Deployment Success Rate",
        "targets": [{
          "expr": "sum(rate(fluxcd_kustomize_reconciliation_total{status=\"success\"}[1h])) / sum(rate(fluxcd_kustomize_reconciliation_total[1h]))",
          "legendFormat": "Success Rate"
        }],
        "type": "gauge"
      },
      {
        "title": "Deployment Status by Service",
        "targets": [{
          "expr": "sum by (kustomization) (fluxcd_kustomize_reconciliation_total)",
          "legendFormat": "{{kustomization}}"
        }],
        "type": "table"
      }
    ]
  }
}

Application Health Dashboard¶

Application Health Dashboard:

{
  "dashboard": {
    "title": "Application Health Dashboard",
    "panels": [
      {
        "title": "Pod Readiness",
        "targets": [{
          "expr": "sum(rate(kube_pod_status_ready{condition=\"true\"}[5m])) / sum(rate(kube_pod_status_ready[5m]))",
          "legendFormat": "Readiness Rate"
        }],
        "type": "stat"
      },
      {
        "title": "Pod Restart Count",
        "targets": [{
          "expr": "sum(increase(kube_pod_container_status_restarts_total[1h]))",
          "legendFormat": "Restarts"
        }],
        "type": "graph"
      },
      {
        "title": "HTTP Error Rate",
        "targets": [{
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
          "legendFormat": "5xx Error Rate"
        }],
        "type": "graph"
      },
      {
        "title": "Response Latency (P95)",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
          "legendFormat": "P95 Latency"
        }],
        "type": "graph"
      }
    ]
  }
}

DORA Metrics Dashboard¶

DORA Metrics Dashboard:

{
  "dashboard": {
    "title": "DORA Metrics Dashboard",
    "panels": [
      {
        "title": "Deployment Frequency",
        "targets": [{
          "expr": "sum(increase(fluxcd_kustomize_reconciliation_total{status=\"success\"}[1d]))",
          "legendFormat": "Deployments/Day"
        }],
        "type": "stat"
      },
      {
        "title": "Lead Time for Changes",
        "targets": [{
          "expr": "avg(deployment_lead_time_seconds)",
          "legendFormat": "Avg Lead Time"
        }],
        "type": "stat"
      },
      {
        "title": "Mean Time to Recovery (MTTR)",
        "targets": [{
          "expr": "avg(incident_recovery_time_seconds)",
          "legendFormat": "MTTR"
        }],
        "type": "stat"
      },
      {
        "title": "Change Failure Rate",
        "targets": [{
          "expr": "sum(rate(deployment_failures_total[1d])) / sum(rate(deployments_total[1d]))",
          "legendFormat": "Failure Rate"
        }],
        "type": "gauge"
      }
    ]
  }
}

Azure Monitor Workbooks¶

Custom Workbooks for GitOps¶

GitOps Workbook Template:

{
  "version": "Notebook/1.0",
  "items": [
    {
      "type": 9,
      "content": {
        "version": "KqlParameterItem/1.0",
        "parameters": [
          {
            "id": "timeRange",
            "version": "KqlParameterItem/1.0",
            "name": "TimeRange",
            "type": 4,
            "value": {
              "durationMs": 86400000
            }
          },
          {
            "id": "environment",
            "version": "KqlParameterItem/1.0",
            "name": "Environment",
            "type": 1,
            "value": "production"
          }
        ]
      }
    },
    {
      "type": 1,
      "content": {
        "version": "TextBlock/1.0",
        "text": "## GitOps Deployment Status"
      }
    },
    {
      "type": 3,
      "content": {
        "version": "KqlItem/1.0",
        "query": "ContainerLog\n| where ContainerName contains \"kustomize-controller\"\n| where TimeGenerated > ago({TimeRange})\n| where Namespace == \"{Environment}\"\n| summarize DeploymentCount = count() by bin(TimeGenerated, 1h)\n| render timechart",
        "visualization": "timechart",
        "size": 0,
        "queryType": 0,
        "resourceType": "microsoft.operationalinsights/workspaces"
      }
    }
  ]
}

Compliance Reporting Workbooks¶

Compliance Workbook:

{
  "version": "Notebook/1.0",
  "items": [
    {
      "type": 1,
      "content": {
        "version": "TextBlock/1.0",
        "text": "## Compliance Audit Report"
      }
    },
    {
      "type": 3,
      "content": {
        "version": "KqlItem/1.0",
        "query": "// Deployment audit trail\nContainerLog\n| where ContainerName contains \"kustomize-controller\"\n| where LogEntry contains \"applied\"\n| extend \n    DeploymentId = extract(@\"deployment=(\\S+)\", 1, LogEntry, typeof(string)),\n    GitCommit = extract(@\"revision=(\\S+)\", 1, LogEntry, typeof(string)),\n    DeployedBy = \"FluxCD\"\n| project TimeGenerated, DeploymentId, GitCommit, DeployedBy, Namespace\n| order by TimeGenerated desc",
        "visualization": "table",
        "size": 0,
        "queryType": 0
      }
    },
    {
      "type": 3,
      "content": {
        "version": "KqlItem/1.0",
        "query": "// Policy compliance status\nAzureDiagnostics\n| where ResourceProvider == \"MICROSOFT.AUTHORIZATION\"\n| where Category == \"PolicyState\"\n| where TimeGenerated > ago(7d)\n| extend ComplianceState = tostring(parse_json(properties_s).complianceState_s)\n| summarize \n    Compliant = countif(ComplianceState == \"Compliant\"),\n    NonCompliant = countif(ComplianceState == \"NonCompliant\")\n    by bin(TimeGenerated, 1d)\n| render timechart",
        "visualization": "timechart",
        "size": 0
      }
    }
  ]
}

Cost Analysis Workbooks¶

Cost Analysis Workbook:

{
  "version": "Notebook/1.0",
  "items": [
    {
      "type": 1,
      "content": {
        "version": "TextBlock/1.0",
        "text": "## GitOps Cost Analysis"
      }
    },
    {
      "type": 3,
      "content": {
        "version": "KqlItem/1.0",
        "query": "// Resource usage by environment\nInsightsMetrics\n| where Origin == \"container.azm.ms\"\n| where Name == \"cpuUsageNanoCores\"\n| extend Environment = extract(@\"namespace=(\\S+)\", 1, Namespace, typeof(string))\n| summarize \n    AvgCPU = avg(Val),\n    MaxCPU = max(Val)\n    by Environment, bin(TimeGenerated, 1d)\n| render timechart",
        "visualization": "timechart",
        "size": 0
      }
    }
  ]
}

Alerting¶

Sync Failure Alerts¶

Sync Failure Alert Rule:

# alerts/fluxcd-sync-failure.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fluxcd-sync-failure
  namespace: monitoring
spec:
  groups:
  - name: fluxcd-sync
    rules:
    - alert: FluxCDSyncFailure
      expr: |
        sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m])) > 0
      for: 5m
      labels:
        severity: critical
        component: fluxcd
      annotations:
        summary: "FluxCD sync failure detected"
        description: "{{ $value }} sync failures in the last 5 minutes"

Azure Monitor Alert Rule:

{
  "location": "global",
  "properties": {
    "displayName": "FluxCD Sync Failure",
    "description": "Alert when FluxCD sync failures detected",
    "severity": 1,
    "enabled": true,
    "evaluationFrequency": "PT5M",
    "windowSize": "PT5M",
    "criteria": {
      "allOf": [{
        "odata.type": "Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria",
        "name": "SyncFailure",
        "metricName": "fluxcd_kustomize_reconciliation_errors_total",
        "operator": "GreaterThan",
        "threshold": 0,
        "timeAggregation": "Total"
      }]
    },
    "actions": []
  }
}

Drift Detection Alerts¶

Drift Detection Alert:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fluxcd-drift-detection
  namespace: monitoring
spec:
  groups:
  - name: fluxcd-drift
    rules:
    - alert: FluxCDDriftDetected
      expr: |
        sum(rate(fluxcd_kustomize_drift_detected_total[5m])) > 0
      for: 5m
      labels:
        severity: warning
        component: fluxcd
      annotations:
        summary: "FluxCD drift detected"
        description: "Cluster state differs from Git state"

KQL-Based Drift Alert:

// Drift detection alert query
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "drift detected"
| where TimeGenerated > ago(5m)
| extend 
    Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Resource = extract(@"resource=(\S+)", 1, LogEntry, typeof(string))
| summarize DriftCount = count() by Kustomization, Resource
| where DriftCount > 0

Deployment Failure Alerts¶

Deployment Failure Alert:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: deployment-failure
  namespace: monitoring
spec:
  groups:
  - name: deployments
    rules:
    - alert: DeploymentFailure
      expr: |
        sum(rate(fluxcd_kustomize_reconciliation_errors_total[10m])) > 2
      for: 10m
      labels:
        severity: critical
        component: deployment
      annotations:
        summary: "Deployment failure detected"
        description: "{{ $value }} deployment failures in the last 10 minutes"

Health Check Failure Alerts¶

Health Check Failure Alert:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: health-check-failure
  namespace: monitoring
spec:
  groups:
  - name: health
    rules:
    - alert: HealthCheckFailure
      expr: |
        sum(rate(kube_pod_status_ready{condition="false"}[5m])) > 0
      for: 5m
      labels:
        severity: warning
        component: health
      annotations:
        summary: "Health check failure detected"
        description: "{{ $value }} pods with failed health checks"

Alert Routing (Email, Teams, PagerDuty)¶

Alert Manager Configuration:

# alertmanager-config.yaml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty'
  - match:
      severity: warning
    receiver: 'teams'
  - match:
      component: fluxcd
    receiver: 'slack'

receivers:
- name: 'default'
  email_configs:
  - to: 'team@connectsoft.example'
    send_resolved: true

- name: 'teams'
  webhook_configs:
  - url: 'https://outlook.office.com/webhook/YOUR/WEBHOOK/URL'
    send_resolved: true

- name: 'slack'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    channel: '#atp-alerts'
    send_resolved: true

- name: 'pagerduty'
  pagerduty_configs:
  - service_key: 'YOUR_PAGERDUTY_KEY'
    send_resolved: true

Azure Monitor Action Groups:

{
  "location": "global",
  "properties": {
    "groupShortName": "atp-alerts",
    "enabled": true,
    "emailReceivers": [{
      "name": "team-email",
      "emailAddress": "team@connectsoft.example",
      "useCommonAlertSchema": true
    }],
    "smsReceivers": [{
      "name": "oncall-sms",
      "countryCode": "1",
      "phoneNumber": "5551234567"
    }],
    "webhookReceivers": [{
      "name": "teams-webhook",
      "serviceUri": "https://outlook.office.com/webhook/YOUR/WEBHOOK/URL",
      "useCommonAlertSchema": true
    }]
  }
}

Correlation¶

Linking Git Commits to Deployments¶

Correlation via Git Commit SHA:

// Link Git commits to deployments
let GitCommits = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| extend 
    GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string)),
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| project GitCommit, Service, DeploymentTime;

let ApplicationMetrics = AppRequests
| extend 
    GitCommit = extract(@"git_commit=(\S+)", 1, customDimensions, typeof(string)),
    RequestTime = TimeGenerated
| project GitCommit, RequestTime, success, duration;

GitCommits
| join kind=inner ApplicationMetrics on GitCommit
| summarize 
    DeploymentCount = count(),
    AvgLatency = avg(duration),
    ErrorRate = (countif(success == false) * 100.0) / count()
    by Service, GitCommit, bin(DeploymentTime, 1h)

Deployment Correlation Script:

#!/bin/bash
# scripts/correlate-deployment.sh

GIT_COMMIT="${1:-$(git rev-parse HEAD)}"
SERVICE_NAME="${2:-atp-ingestion}"

echo "🔗 Correlating deployment for commit: $GIT_COMMIT"

# Add annotation to deployment
kubectl annotate deployment "$SERVICE_NAME" -n atp-production \
  gitops.git-commit="$GIT_COMMIT" \
  gitops.deployed-at="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --overwrite

# Query correlation
az monitor log-analytics query \
  --workspace "atp-prod-loganalytics" \
  --analytics-query "
    ContainerLog
    | where ContainerName contains \"kustomize-controller\"
    | where LogEntry contains \"$GIT_COMMIT\"
    | project TimeGenerated, LogEntry
  "

Linking Deployments to Application Metrics¶

Deployment-to-Metrics Correlation:

// Link deployments to application metrics
let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| extend 
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| project Service, GitCommit, DeploymentTime;

let Metrics = AppRequests
| extend 
    Service = appName
| project Service, TimeGenerated, success, duration, resultCode;

Deployments
| join kind=inner Metrics on Service
| where Metrics.TimeGenerated >= DeploymentTime
| where Metrics.TimeGenerated <= DeploymentTime + 30m
| summarize 
    RequestCount = count(),
    ErrorRate = (countif(success == false) * 100.0) / count(),
    AvgLatency = avg(duration)
    by Service, GitCommit, bin(DeploymentTime, 5m)

Correlation IDs Throughout Stack¶

Correlation ID Injection:

// C#: Inject correlation ID in HTTP requests
public class CorrelationIdMiddleware
{
    private readonly RequestDelegate _next;

    public async Task InvokeAsync(HttpContext context)
    {
        var correlationId = context.Request.Headers["X-Correlation-ID"].FirstOrDefault()
            ?? Guid.NewGuid().ToString();

        context.Items["CorrelationId"] = correlationId;
        context.Response.Headers["X-Correlation-ID"] = correlationId;

        using (LogContext.PushProperty("CorrelationId", correlationId))
        {
            await _next(context);
        }
    }
}

Correlation ID in Kubernetes:

# Add correlation ID to pod annotations
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    metadata:
      annotations:
        gitops.git-commit: "abc123def456"
        gitops.correlation-id: "deployment-abc123"
        gitops.deployed-at: "2024-01-15T10:00:00Z"

Distributed Tracing with Azure Application Insights¶

Application Insights Integration:

// Configure Application Insights with distributed tracing
services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = "InstrumentationKey={key};IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/";
    options.EnableDependencyTrackingTelemetryModule = true;
    options.EnableRequestTrackingTelemetryModule = true;
    options.EnableAdaptiveSampling = true;
    options.AdaptiveSamplingInitialSamplingPercentage = 10; // 10% in production
});

// Custom telemetry with correlation
var telemetryClient = new TelemetryClient();
telemetryClient.Context.Operation.Id = correlationId;
telemetryClient.Context.Operation.Name = "Deployment";
telemetryClient.TrackEvent("DeploymentCompleted", new Dictionary<string, string>
{
    { "Service", "atp-ingestion" },
    { "GitCommit", gitCommit },
    { "Environment", "production" },
});

Trace Correlation Query:

// Distributed trace correlation
let Traces = AppTraces
| extend 
    CorrelationId = extract(@"correlation_id=(\S+)", 1, customDimensions, typeof(string)),
    OperationId = operation_Id
| project CorrelationId, OperationId, TimeGenerated, message;

let Requests = AppRequests
| extend 
    CorrelationId = extract(@"correlation_id=(\S+)", 1, customDimensions, typeof(string)),
    OperationId = operation_Id
| project CorrelationId, OperationId, TimeGenerated, name, duration;

Traces
| join kind=inner Requests on CorrelationId
| project CorrelationId, TraceTime = Traces.TimeGenerated, RequestTime = Requests.TimeGenerated, RequestDuration = duration
| order by CorrelationId, TraceTime

Compliance Evidence¶

Deployment Audit Trail in Log Analytics¶

Deployment Audit Trail Query:

// Complete deployment audit trail
let DeploymentEvents = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied" or LogEntry contains "sync"
| extend 
    DeploymentId = extract(@"deployment=(\\S+)", 1, LogEntry, typeof(string)),
    Service = extract(@"kustomization/(\\S+)", 1, LogEntry, typeof(string)),
    GitCommit = extract(@"revision=(\\S+)", 1, LogEntry, typeof(string)),
    Status = case(
        LogEntry contains "successfully", "Success",
        LogEntry contains "failed", "Failed",
        "InProgress"
    ),
    DeployedBy = "FluxCD",
    DeploymentTime = TimeGenerated
| project DeploymentTime, DeploymentId, Service, GitCommit, Status, DeployedBy, Namespace;

let Approvals = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DEVOPS"
| where Category == "PullRequest"
| extend 
    GitCommit = extract(@"commit=(\\S+)", 1, properties_s, typeof(string)),
    Approver = tostring(parse_json(properties_s).approver),
    ApprovalTime = TimeGenerated
| project GitCommit, Approver, ApprovalTime;

DeploymentEvents
| join kind=leftouter Approvals on GitCommit
| project 
    DeploymentTime,
    DeploymentId,
    Service,
    GitCommit,
    Status,
    DeployedBy,
    Approver,
    ApprovalTime,
    Namespace
| order by DeploymentTime desc

Retention for 7 Years (Compliance Requirement)¶

7-Year Retention Configuration:

// Log Analytics Workspace with 7-year retention
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    RetentionInDays = 2555, // 7 years (365 * 7)
    Tags = new()
    {
        { "Retention", "7years" },
        { "Compliance", "SOC2" },
    },
});

// Export to Blob Storage for additional backup
var storageAccount = new Storage.Account("atp-prod-logs-backup", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    Kind = "StorageV2",
    SkuName = "Standard_LRS",
    AccessTier = "Archive", // Cold storage for compliance
    EnableHttpsTrafficOnly = true,
    MinimumTlsVersion = "TLS1_2",
    BlobProperties = new Storage.Inputs.BlobServicePropertiesArgs
    {
        DeleteRetentionPolicy = new Storage.Inputs.DeleteRetentionPolicyArgs
        {
            Enabled = true,
            Days = 2555, // 7 years
        },
        VersioningEnabled = true,
    },
});

Query Examples for Auditors¶

Auditor Query: Deployment History:

// Deployment history for auditors
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| where TimeGenerated > ago(365d)
| extend 
    Service = extract(@"kustomization/(\\S+)", 1, LogEntry, typeof(string)),
    GitCommit = extract(@"revision=(\\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| summarize 
    DeploymentCount = count(),
    LastDeployment = max(DeploymentTime),
    UniqueServices = dcount(Service)
    by bin(TimeGenerated, 1d)
| order by TimeGenerated desc

Auditor Query: Change Approval:

// Change approval audit trail
let PullRequests = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DEVOPS"
| where Category == "PullRequest"
| extend 
    PRId = tostring(parse_json(properties_s).pullRequestId),
    Approver = tostring(parse_json(properties_s).approver),
    ApprovalTime = TimeGenerated,
    Status = tostring(parse_json(properties_s).status)
| project PRId, Approver, ApprovalTime, Status;

let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| extend 
    GitCommit = extract(@"revision=(\\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| project GitCommit, DeploymentTime;

PullRequests
| join kind=inner Deployments on $left.PRId == $right.GitCommit
| project ApprovalTime, Approver, DeploymentTime, Status
| order by ApprovalTime desc

Auditor Query: Policy Compliance:

// Policy compliance audit
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.AUTHORIZATION"
| where Category == "PolicyState"
| where TimeGenerated > ago(365d)
| extend 
    PolicyName = tostring(parse_json(properties_s).policyDefinitionName),
    ComplianceState = tostring(parse_json(properties_s).complianceState_s),
    ResourceId = tostring(parse_json(properties_s).resourceId)
| summarize 
    CompliantCount = countif(ComplianceState == "Compliant"),
    NonCompliantCount = countif(ComplianceState == "NonCompliant"),
    TotalChecks = count()
    by PolicyName, bin(TimeGenerated, 1d)
| extend ComplianceRate = (CompliantCount * 100.0) / TotalChecks
| order by TimeGenerated desc

Export for eDiscovery¶

eDiscovery Export Script:

#!/bin/bash
# scripts/export-ediscovery.sh

START_DATE="${1:-$(date -u -d '7 years ago' +%Y-%m-%dT%H:%M:%SZ)}"
END_DATE="${2:-$(date -u +%Y-%m-%dT%H:%M:%SZ)}"
OUTPUT_PATH="${3:-./ediscovery-export}"

echo "📤 Exporting compliance logs for eDiscovery: $START_DATE to $END_DATE"

# Export deployment audit trail
az monitor log-analytics query \
  --workspace "atp-prod-loganalytics" \
  --analytics-query "
    ContainerLog
    | where ContainerName contains \"kustomize-controller\"
    | where TimeGenerated between (datetime($START_DATE) .. datetime($END_DATE))
    | where LogEntry contains \"applied\" or LogEntry contains \"sync\"
    | project TimeGenerated, ContainerName, LogEntry, Namespace
  " \
  --output table > "$OUTPUT_PATH/deployment-audit-trail.csv"

# Export policy compliance
az monitor log-analytics query \
  --workspace "atp-prod-loganalytics" \
  --analytics-query "
    AzureDiagnostics
    | where ResourceProvider == \"MICROSOFT.AUTHORIZATION\"
    | where Category == \"PolicyState\"
    | where TimeGenerated between (datetime($START_DATE) .. datetime($END_DATE))
    | project TimeGenerated, Category, properties_s
  " \
  --output table > "$OUTPUT_PATH/policy-compliance.csv"

# Export to Blob Storage for long-term storage
az storage blob upload-batch \
  --destination "ediscovery-export" \
  --source "$OUTPUT_PATH" \
  --account-name "atpprodlogsbackup"

echo "✅ Export complete: $OUTPUT_PATH"

DORA Metrics¶

Deployment Frequency¶

Deployment Frequency Calculation:

// Deployment frequency (DORA metric)
// Definition: How often deployments are successfully released to production

let SuccessfulDeployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| extend 
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| project DeploymentTime, Service;

SuccessfulDeployments
| summarize 
    DeploymentCount = count(),
    UniqueServices = dcount(Service)
    by bin(TimeGenerated, 1d)
| extend 
    DeploymentFrequency = DeploymentCount, // Deployments per day
    DORA_Level = case(
        DeploymentFrequency >= 1, "Elite", // Multiple per day
        DeploymentFrequency >= 0.142, "High", // Once per week
        DeploymentFrequency >= 0.033, "Medium", // Once per month
        "Low" // Less than once per month
    )
| order by TimeGenerated desc

Deployment Frequency Prometheus Query:

# Deployment frequency (deployments per day)
sum(increase(fluxcd_kustomize_reconciliation_total{status="success", namespace="atp-production"}[1d]))

Lead Time for Changes¶

Lead Time Calculation:

// Lead time for changes (DORA metric)
// Definition: Time from code commit to production deployment

let Commits = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "Git commit"
| extend 
    GitCommit = extract(@"commit=(\S+)", 1, LogEntry, typeof(string)),
    CommitTime = TimeGenerated
| project GitCommit, CommitTime;

let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| extend 
    GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| project GitCommit, DeploymentTime;

Commits
| join kind=inner Deployments on GitCommit
| extend LeadTimeHours = datetime_diff('hour', DeploymentTime, CommitTime)
| summarize 
    AvgLeadTime = avg(LeadTimeHours),
    P50LeadTime = percentile(LeadTimeHours, 50),
    P95LeadTime = percentile(LeadTimeHours, 95),
    DORA_Level = case(
        avg(LeadTimeHours) < 24, "Elite", // Less than 1 day
        avg(LeadTimeHours) < 168, "High", // Less than 1 week
        avg(LeadTimeHours) < 720, "Medium", // Less than 1 month
        "Low" // More than 1 month
    )
    by bin(TimeGenerated, 1d)
| order by TimeGenerated desc

Mean Time to Recovery (MTTR)¶

MTTR Calculation:

// Mean Time to Recovery (MTTR) - DORA metric
// Definition: Average time to recover from a failure in production

let Incidents = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "failed" or LogEntry contains "error"
| where Namespace == "atp-production"
| extend 
    IncidentStart = TimeGenerated,
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| project IncidentStart, Service;

let Recoveries = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| extend 
    RecoveryTime = TimeGenerated,
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| project RecoveryTime, Service;

Incidents
| join kind=inner Recoveries on Service
| where RecoveryTime >= IncidentStart
| extend RecoveryDurationMinutes = datetime_diff('minute', RecoveryTime, IncidentStart)
| summarize 
    MTTR = avg(RecoveryDurationMinutes),
    P50MTTR = percentile(RecoveryDurationMinutes, 50),
    P95MTTR = percentile(RecoveryDurationMinutes, 95),
    IncidentCount = count(),
    DORA_Level = case(
        avg(RecoveryDurationMinutes) < 60, "Elite", // Less than 1 hour
        avg(RecoveryDurationMinutes) < 1440, "High", // Less than 1 day
        avg(RecoveryDurationMinutes) < 10080, "Medium", // Less than 1 week
        "Low" // More than 1 week
    )
    by bin(TimeGenerated, 1d)
| order by TimeGenerated desc

Change Failure Rate¶

Change Failure Rate Calculation:

// Change failure rate (DORA metric)
// Definition: Percentage of deployments that result in a failure in production

let AllDeployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| where Namespace == "atp-production"
| extend 
    DeploymentTime = TimeGenerated,
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Status = case(
        LogEntry contains "successfully", "Success",
        LogEntry contains "failed", "Failed",
        "Unknown"
    )
| where Status != "Unknown"
| project DeploymentTime, Service, Status;

let Failures = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "failed" or LogEntry contains "error"
| where Namespace == "atp-production"
| extend 
    FailureTime = TimeGenerated,
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| project FailureTime, Service;

AllDeployments
| join kind=leftouter Failures on Service
| where FailureTime >= DeploymentTime
| where FailureTime <= DeploymentTime + 1h // Failure within 1 hour of deployment
| extend 
    DeploymentFailed = case(
        isnotnull(FailureTime), 1,
        0
    )
| summarize 
    TotalDeployments = count(),
    FailedDeployments = sum(DeploymentFailed),
    ChangeFailureRate = (sum(DeploymentFailed) * 100.0) / count(),
    DORA_Level = case(
        (sum(DeploymentFailed) * 100.0) / count() < 5, "Elite", // Less than 5%
        (sum(DeploymentFailed) * 100.0) / count() < 15, "High", // Less than 15%
        (sum(DeploymentFailed) * 100.0) / count() < 45, "Medium", // Less than 45%
        "Low" // More than 45%
    )
    by bin(TimeGenerated, 1d)
| order by TimeGenerated desc

Dashboard and Reporting¶

DORA Metrics Dashboard:

{
  "dashboard": {
    "title": "DORA Metrics Dashboard",
    "panels": [
      {
        "title": "Deployment Frequency",
        "targets": [{
          "expr": "sum(increase(fluxcd_kustomize_reconciliation_total{status=\"success\", namespace=\"atp-production\"}[1d]))",
          "legendFormat": "Deployments/Day"
        }],
        "type": "stat",
        "thresholds": {
          "steps": [
            { "value": 0, "color": "red" },
            { "value": 1, "color": "yellow" },
            { "value": 7, "color": "green" }
          ]
        }
      },
      {
        "title": "Lead Time for Changes",
        "targets": [{
          "expr": "avg(deployment_lead_time_hours)",
          "legendFormat": "Avg Lead Time (hours)"
        }],
        "type": "stat"
      },
      {
        "title": "Mean Time to Recovery (MTTR)",
        "targets": [{
          "expr": "avg(incident_recovery_time_minutes)",
          "legendFormat": "MTTR (minutes)"
        }],
        "type": "stat"
      },
      {
        "title": "Change Failure Rate",
        "targets": [{
          "expr": "sum(rate(deployment_failures_total[1d])) / sum(rate(deployments_total[1d]))",
          "legendFormat": "Failure Rate %"
        }],
        "type": "gauge",
        "thresholds": {
          "steps": [
            { "value": 0, "color": "green" },
            { "value": 0.05, "color": "yellow" },
            { "value": 0.15, "color": "red" }
          ]
        }
      }
    ]
  }
}

DORA Metrics Report:

// Comprehensive DORA metrics report
let DeploymentFrequency = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| summarize DeploymentCount = count() by bin(TimeGenerated, 1d);

let LeadTime = // ... (from previous query)
let MTTR = // ... (from previous query)
let ChangeFailureRate = // ... (from previous query)

union DeploymentFrequency, LeadTime, MTTR, ChangeFailureRate
| project TimeGenerated, Metric, Value, DORA_Level
| order by TimeGenerated desc

Summary: Azure Monitor Integration & Observability¶

Azure Monitor Container Insights: Enabling Container Insights on AKS, metrics collection and aggregation, Log Analytics workspace configuration, cost optimization (sampling, retention)
Log Analytics Workspace: Workspace per environment or shared strategy, log retention policies (7 years for production), KQL query examples, custom log tables
FluxCD Metrics Export: Prometheus metrics from FluxCD, metrics scraping configuration, key metrics to monitor, alerting on FluxCD issues
Deployment Metrics: Sync status per application, reconciliation duration, reconciliation failure rate, drift detection events, deployment frequency
Application Health: Readiness probe success rate, pod restart count, HTTP error rates, response latency, integration with application metrics
Grafana Dashboards: FluxCD operational dashboard, deployment status dashboard, application health dashboard, DORA metrics dashboard
Azure Monitor Workbooks: Custom workbooks for GitOps, compliance reporting workbooks, cost analysis workbooks
Alerting: Sync failure alerts, drift detection alerts, deployment failure alerts, health check failure alerts, alert routing (email, Teams, PagerDuty)
Correlation: Linking Git commits to deployments, linking deployments to application metrics, correlation IDs throughout stack, distributed tracing with Application Insights
Compliance Evidence: Deployment audit trail in Log Analytics, 7-year retention, query examples for auditors, export for eDiscovery
DORA Metrics: Deployment frequency, lead time for changes, mean time to recovery (MTTR), change failure rate, dashboard and reporting

Rolling Updates & Deployment Strategies¶

Purpose: Define deployment strategies for ATP services including rolling updates, blue-green deployments, canary releases, and progressive delivery with Flagger, ensuring zero-downtime deployments, automated rollback capabilities, and risk mitigation through gradual traffic shifting and validation gates.

Kubernetes Rolling Updates¶

Default Rolling Update Strategy¶

Rolling Update Overview:

graph LR
    subgraph "Rolling Update Process"
        V1[V1 Pods<br/>3 replicas] --> V2[V1: 2 pods<br/>V2: 1 pod]
        V2 --> V3[V1: 1 pod<br/>V2: 2 pods]
        V3 --> V4[V2 Pods<br/>3 replicas]
    end

    style V1 fill:#90EE90
    style V4 fill:#90EE90
    style V2 fill:#FFE5B4
    style V3 fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Default Rolling Update Configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  replicas: 5
  strategy:
    type: RollingUpdate  # Default strategy
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod during update
      maxUnavailable: 0  # No downtime allowed
  selector:
    matchLabels:
      app: atp-ingestion
  template:
    metadata:
      labels:
        app: atp-ingestion
        version: v1.2.3
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

Rolling Update Strategy Types:

Strategy Type	Description	Use Case
RollingUpdate	Gradually replaces old pods with new ones	✅ Default for ATP (zero-downtime)
Recreate	Terminates all old pods before creating new ones	❌ Not recommended (downtime)

maxSurge and maxUnavailable Settings¶

maxSurge and maxUnavailable Configuration:

Configuration	maxSurge	maxUnavailable	Effect
Zero Downtime	1	0	✅ ATP Production - Always maintain service availability
Fast Rollout	2	1	⚠️ Test/Dev - Faster updates, slight capacity reduction
Conservative	1	1	⚠️ Staging - Balanced approach

Production Configuration:

# Production: Zero-downtime rolling update
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod (total: 6 pods during update)
      maxUnavailable: 0  # Always maintain 5 ready pods

Rolling Update Math:

Total Pods: 5 replicas
maxSurge: 1 (can have 6 pods total during update)
maxUnavailable: 0 (must always have 5 ready pods)
Update Process: Replace 1 pod at a time, wait for readiness, then replace next

Dev/Test Configuration (Faster Rollout):

# Dev/Test: Faster rollout with slight capacity reduction
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod
      maxUnavailable: 1  # Can temporarily reduce to 1 pod

Rolling Update Process¶

Rolling Update Steps:

Create New Pod: Kubernetes creates a new pod with new image
Wait for Readiness: New pod must pass readiness probe
Add to Service: New pod receives traffic from Service
Terminate Old Pod: Old pod receives SIGTERM, drains connections
Repeat: Process repeats for remaining pods

Rolling Update Visualization:

sequenceDiagram
    participant K8s as Kubernetes
    participant Old as Old Pods (v1.2.2)
    participant New as New Pods (v1.2.3)
    participant Svc as Service

    Note over K8s: Start Rolling Update
    K8s->>New: Create Pod 1 (v1.2.3)
    New->>New: Wait for Readiness Probe
    New->>Svc: Register with Service
    Svc->>New: Route traffic to Pod 1
    K8s->>Old: Terminate Pod 1 (SIGTERM)
    Old->>Svc: Drain connections
    Old->>Old: Graceful shutdown

    K8s->>New: Create Pod 2 (v1.2.3)
    New->>New: Wait for Readiness Probe
    New->>Svc: Register with Service
    Svc->>New: Route traffic to Pod 2
    K8s->>Old: Terminate Pod 2 (SIGTERM)

    Note over K8s: Repeat until all pods updated

Hold "Alt" / "Option" to enable pan & zoom

Monitor Rolling Update Progress:

# Watch rolling update progress
kubectl rollout status deployment/atp-ingestion -n atp-production

# Get rollout history
kubectl rollout history deployment/atp-ingestion -n atp-production

# Describe rollout
kubectl describe deployment atp-ingestion -n atp-production

Monitoring Rollout Progress¶

Rollout Status Command:

# Monitor rollout in real-time
kubectl rollout status deployment/atp-ingestion -n atp-production --timeout=10m

# Output example:
# Waiting for deployment "atp-ingestion" rollout to finish: 2 of 5 updated replicas are available...
# Waiting for deployment "atp-ingestion" rollout to finish: 3 of 5 updated replicas are available...
# Waiting for deployment "atp-ingestion" rollout to finish: 4 of 5 updated replicas are available...
# deployment "atp-ingestion" successfully rolled out

Prometheus Metrics for Rollout:

# Rolling update progress
kube_deployment_status_replicas_available{deployment="atp-ingestion"} 
/ 
kube_deployment_status_replicas{deployment="atp-ingestion"}

# Old vs new pods during rollout
kube_pod_info{pod=~"atp-ingestion-.*"}
| label_replace(label_replace(
    kube_pod_info{pod=~"atp-ingestion-.*"},
    "version", "$1", "pod", "(.*-v\\d+\\.\\d+\\.\\d+).*"
  ), "status", "$1", "pod", ".*-(running|pending|terminating).*")

KQL Query for Rollout Status:

// Rolling update status from Container Insights
InsightsMetrics
| where Origin == "container.azm.ms"
| where Name == "k8sPodCount"
| where Namespace == "atp-production"
| extend 
    Deployment = extract(@"deployment=(\S+)", 1, Tags, typeof(string)),
    PodVersion = extract(@"version=(v\d+\.\d+\.\d+)", 1, Tags, typeof(string))
| summarize 
    PodCount = sum(Val)
    by Deployment, PodVersion, bin(TimeGenerated, 1m)
| order by TimeGenerated desc

Blue-Green Deployments¶

Blue-Green Concept and Benefits¶

Blue-Green Deployment Architecture:

graph TB
    subgraph "Traffic Router"
        ING[Ingress Controller]
    end
    subgraph "Blue Environment (Current)"
        BLUE_NS[Namespace: atp-blue]
        BLUE_SVC[Service: atp-ingestion-blue]
        BLUE_PODS[Pods: v1.2.2<br/>5 replicas]
    end
    subgraph "Green Environment (New)"
        GREEN_NS[Namespace: atp-green]
        GREEN_SVC[Service: atp-ingestion-green]
        GREEN_PODS[Pods: v1.2.3<br/>5 replicas]
    end

    ING -->|Current| BLUE_SVC
    ING -.->|Switch| GREEN_SVC
    BLUE_SVC --> BLUE_PODS
    GREEN_SVC --> GREEN_PODS

    style BLUE_PODS fill:#4A90E2
    style GREEN_PODS fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Blue-Green Benefits:

Benefit	Description	ATP Use Case
Instant Rollback	Switch traffic back to blue instantly	✅ Critical production updates
Zero Downtime	Green environment fully ready before switch	✅ High availability requirement
Testing	Validate green environment before traffic	✅ Production-like validation
Risk Reduction	Keep blue environment running during switch	✅ Critical services

Blue-Green vs Rolling Update:

Aspect	Blue-Green	Rolling Update	ATP Decision
Downtime	✅ Zero	✅ Zero	✅ Both viable
Rollback Speed	✅ Instant (traffic switch)	⚠️ Slow (re-rollout)	✅ Blue-Green for critical
Resource Usage	❌ 2x during switch	✅ Efficient	⚠️ Acceptable for critical services
Complexity	❌ Higher	✅ Lower	⚠️ Blue-Green for staging/production

Implementation with Namespace Switching¶

Blue Namespace Configuration:

# apps/atp-ingestion/overlays/production-blue/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production-blue
  labels:
    environment: production
    deployment-color: blue
---
# Blue Service
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion-blue
  namespace: atp-production-blue
spec:
  selector:
    app: atp-ingestion
    version: v1.2.2
  ports:
  - port: 80
    targetPort: 8080
---
# Blue Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion-blue
  namespace: atp-production-blue
spec:
  replicas: 5
  selector:
    matchLabels:
      app: atp-ingestion
      version: v1.2.2
  template:
    metadata:
      labels:
        app: atp-ingestion
        version: v1.2.2
        deployment-color: blue
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.2-def456g
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080

Green Namespace Configuration:

# apps/atp-ingestion/overlays/production-green/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production-green
  labels:
    environment: production
    deployment-color: green
---
# Green Service
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion-green
  namespace: atp-production-green
spec:
  selector:
    app: atp-ingestion
    version: v1.2.3
  ports:
  - port: 80
    targetPort: 8080
---
# Green Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion-green
  namespace: atp-production-green
spec:
  replicas: 5
  selector:
    matchLabels:
      app: atp-ingestion
      version: v1.2.3
  template:
    metadata:
      labels:
        app: atp-ingestion
        version: v1.2.3
        deployment-color: green
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080

Traffic Routing with Ingress¶

Ingress with Blue-Green Routing:

# Ingress routing to blue (current)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-ingress
  namespace: atp-production
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/upstream-vhost: atp-ingestion-blue.atp-production-blue.svc.cluster.local
spec:
  ingressClassName: nginx
  rules:
  - host: atp-ingestion.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion-blue
            port:
              number: 80
        # Cross-namespace service reference
        # Requires ExternalName Service in production namespace

Cross-Namespace Service Reference:

# ExternalName Service in production namespace pointing to blue
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion-blue
  namespace: atp-production
spec:
  type: ExternalName
  externalName: atp-ingestion-blue.atp-production-blue.svc.cluster.local
---
# Switch to green (update Ingress)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-ingress
  namespace: atp-production
spec:
  rules:
  - host: atp-ingestion.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion-green  # Switched to green
            port:
              number: 80

Blue-Green Switch Script:

#!/bin/bash
# scripts/blue-green-switch.sh

ENVIRONMENT="${1:-production}"
CURRENT_COLOR="${2:-blue}"
NEW_COLOR="${3:-green}"

echo "🔄 Switching from $CURRENT_COLOR to $NEW_COLOR environment"

# Update Ingress to route to new color
kubectl patch ingress atp-ingestion-ingress -n atp-$ENVIRONMENT --type=json \
  -p="[{\"op\": \"replace\", \"path\": \"/spec/rules/0/http/paths/0/backend/service/name\", \"value\": \"atp-ingestion-$NEW_COLOR\"}]"

# Wait for green pods to be ready
kubectl wait --for=condition=available --timeout=5m \
  deployment/atp-ingestion-$NEW_COLOR -n atp-$ENVIRONMENT-$NEW_COLOR

# Verify traffic is routing to green
kubectl get ingress atp-ingestion-ingress -n atp-$ENVIRONMENT -o jsonpath='{.spec.rules[0].http.paths[0].backend.service.name}'

echo "✅ Traffic switched to $NEW_COLOR environment"

Rollback to Blue Environment¶

Instant Rollback to Blue:

#!/bin/bash
# scripts/blue-green-rollback.sh

ENVIRONMENT="${1:-production}"
CURRENT_COLOR="${2:-green}"
ROLLBACK_COLOR="${3:-blue}"

echo "⏪ Rolling back to $ROLLBACK_COLOR environment"

# Switch traffic back to blue
kubectl patch ingress atp-ingestion-ingress -n atp-$ENVIRONMENT --type=json \
  -p="[{\"op\": \"replace\", \"path\": \"/spec/rules/0/http/paths/0/backend/service/name\", \"value\": \"atp-ingestion-$ROLLBACK_COLOR\"}]"

echo "✅ Rollback complete - traffic routed to $ROLLBACK_COLOR"

# Optionally scale down green environment to save resources
# kubectl scale deployment atp-ingestion-$CURRENT_COLOR -n atp-$ENVIRONMENT-$CURRENT_COLOR --replicas=0

When to Use Blue-Green¶

Blue-Green Deployment Decision Matrix:

Service Type	Blue-Green Recommended?	Rationale
Critical Services (Gateway, Authentication)	✅ Yes	Instant rollback capability
Database Migrations	✅ Yes	Test new version before traffic
High-Traffic Services	✅ Yes	Reduce risk of performance issues
Low-Risk Updates	⚠️ Optional	Rolling update may be sufficient
Resource-Constrained	❌ No	2x resource usage during switch

ATP Blue-Green Strategy: - Production Critical Services: Blue-Green for gateway, authentication, ingestion - Production Standard Services: Rolling update sufficient - Staging: Blue-Green for validation before production promotion

Canary Releases¶

Canary Deployment Concept¶

Canary Release Architecture:

graph TB
    subgraph "Traffic Router"
        ING[Ingress Controller]
        SVC[Service]
    end
    subgraph "Stable Version"
        STABLE[Stable Pods<br/>v1.2.2<br/>90% traffic]
    end
    subgraph "Canary Version"
        CANARY[Canary Pods<br/>v1.2.3<br/>10% traffic]
    end

    ING -->|90%| STABLE
    ING -->|10%| CANARY
    SVC --> STABLE
    SVC --> CANARY

    style STABLE fill:#4A90E2
    style CANARY fill:#FFD700

Hold "Alt" / "Option" to enable pan & zoom

Canary Release Benefits:

Benefit	Description
Risk Reduction	Test new version with limited traffic
Gradual Rollout	Increase traffic percentage gradually (10% → 50% → 100%)
Automated Validation	Monitor metrics and auto-rollback on issues
User Impact Minimization	Only small percentage of users affected if issues occur

Traffic Splitting Strategies¶

Traffic Splitting with Service Mesh (Istio):

# Istio VirtualService for canary traffic splitting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: atp-ingestion-canary
  namespace: atp-production
spec:
  hosts:
  - atp-ingestion.connectsoft.example
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: atp-ingestion
        subset: canary
      weight: 100
  - route:
    - destination:
        host: atp-ingestion
        subset: stable
      weight: 90  # 90% to stable
    - destination:
        host: atp-ingestion
        subset: canary
      weight: 10  # 10% to canary
---
# DestinationRule for stable and canary subsets
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  host: atp-ingestion
  subsets:
  - name: stable
    labels:
      version: v1.2.2
  - name: canary
    labels:
      version: v1.2.3

Traffic Splitting with Nginx Ingress:

# Nginx Ingress with canary annotations
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-canary
  namespace: atp-production
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"  # 10% traffic to canary
    nginx.ingress.kubernetes.io/canary-by-header: "canary"
    nginx.ingress.kubernetes.io/canary-by-header-value: "true"
spec:
  ingressClassName: nginx
  rules:
  - host: atp-ingestion.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion-canary
            port:
              number: 80
---
# Main Ingress (90% traffic)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-main
  namespace: atp-production
spec:
  ingressClassName: nginx
  rules:
  - host: atp-ingestion.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion-stable
            port:
              number: 80

Service Mesh Requirement (Linkerd/Istio)¶

Service Mesh Comparison:

Feature	Istio	Linkerd	ATP Selection
Traffic Splitting	✅ Advanced	✅ Simple	✅ Istio (more features)
Observability	✅ Comprehensive	✅ Good	✅ Istio
Resource Usage	❌ High	✅ Low	⚠️ Acceptable for production
Learning Curve	❌ Steep	✅ Easy	⚠️ Investment required

ATP Decision: Istio (for advanced canary features)

Gradual Traffic Shift (10% → 50% → 100%)¶

Progressive Traffic Shift:

# Stage 1: 10% canary
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: atp-ingestion-canary
spec:
  http:
  - route:
    - destination:
        host: atp-ingestion
        subset: stable
      weight: 90
    - destination:
        host: atp-ingestion
        subset: canary
      weight: 10  # 10% canary
---
# Stage 2: 50% canary (after validation)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: atp-ingestion-canary
spec:
  http:
  - route:
    - destination:
        host: atp-ingestion
        subset: stable
      weight: 50
    - destination:
        host: atp-ingestion
        subset: canary
      weight: 50  # 50% canary
---
# Stage 3: 100% canary (promote to stable)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: atp-ingestion-canary
spec:
  http:
  - route:
    - destination:
        host: atp-ingestion
        subset: canary
      weight: 100  # 100% canary (new stable)

Automated Traffic Shift Script:

#!/bin/bash
# scripts/canary-traffic-shift.sh

CANARY_WEIGHT="${1:-10}"
NAMESPACE="${2:-atp-production}"

echo "🎯 Shifting $CANARY_WEIGHT% traffic to canary"

# Update VirtualService
kubectl patch virtualservice atp-ingestion-canary -n $NAMESPACE --type=json \
  -p="[{\"op\": \"replace\", \"path\": \"/spec/http/0/route/0/weight\", \"value\": $((100 - CANARY_WEIGHT))}, {\"op\": \"replace\", \"path\": \"/spec/http/0/route/1/weight\", \"value\": $CANARY_WEIGHT}]"

echo "✅ Traffic shifted: $CANARY_WEIGHT% canary, $((100 - CANARY_WEIGHT))% stable"

# Monitor for 5 minutes before next stage
sleep 300

Automated Canary Analysis¶

Canary Analysis Metrics:

# Flagger Canary with automated analysis (see Flagger section)
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
    - name: error-rate
      thresholdRange:
        max: 1
      interval: 1m

Progressive Delivery with Flagger¶

Flagger Overview and Architecture¶

Flagger Architecture:

graph TB
    subgraph "GitOps Repository"
        GIT[Git Commit<br/>New Version]
    end
    subgraph "Flagger Controller"
        FLAGGER[Flagger<br/>Canary Controller]
        METRICS[Metrics Provider<br/>Prometheus]
    end
    subgraph "Traffic Router"
        ISTIO[Istio VirtualService]
    end
    subgraph "Deployment"
        STABLE[Stable Deployment]
        CANARY[Canary Deployment]
    end

    GIT -->|Triggers| FLAGGER
    FLAGGER -->|Creates| CANARY
    FLAGGER -->|Monitors| METRICS
    METRICS -->|Validates| FLAGGER
    FLAGGER -->|Updates| ISTIO
    ISTIO -->|Routes| STABLE
    ISTIO -->|Routes| CANARY

    style FLAGGER fill:#FFD700
    style CANARY fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Flagger Benefits:

Benefit	Description
Automated Rollout	Automatic canary promotion based on metrics
Automated Rollback	Automatic rollback on threshold violations
Traffic Shifting	Gradual traffic increase (10% → 50% → 100%)
Metric Validation	Validate latency, error rate, custom metrics

Installation and Configuration¶

Flagger Installation via Helm:

# Add Flagger Helm repository
helm repo add flagger https://flagger.app
helm repo update

# Install Flagger with Istio support
helm upgrade --install flagger flagger/flagger \
  --namespace flagger-system \
  --create-namespace \
  --set meshProvider=istio \
  --set metricsServer=http://prometheus.monitoring:9090

Flagger Installation via Pulumi:

// Install Flagger via Helm chart
var flaggerRelease = new Pulumi.Kubernetes.Helm.V3.Release("flagger", new()
{
    Chart = "flagger",
    RepositoryOpts = new Pulumi.Kubernetes.Helm.V3.Inputs.RepositoryOptsArgs
    {
        Repo = "https://flagger.app",
    },
    Namespace = "flagger-system",
    CreateNamespace = true,
    Values = new Dictionary<string, object>
    {
        { "meshProvider", "istio" },
        { "metricsServer", "http://prometheus.monitoring:9090" },
    },
});

Canary Resource Definition¶

Flagger Canary Configuration:

# apps/atp-ingestion/overlays/production/canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  # Target deployment to manage
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion

  # Service to create/update
  service:
    port: 8080
    targetPort: 8080
    portDiscovery: true

  # Traffic management
  provider: istio
  trafficRouting:
    istio:
      virtualService:
        hosts:
        - atp-ingestion.connectsoft.example
        gateways:
        - public-gateway
      destinationRule:
        host: atp-ingestion
        subsets:
        - name: stable
          labels:
            version: stable
        - name: canary
          labels:
            version: canary

  # Canary analysis configuration
  analysis:
    interval: 1m           # Check metrics every 1 minute
    threshold: 5           # Number of successful validations before promotion
    maxWeight: 50          # Maximum canary traffic (50%)
    stepWeight: 10         # Traffic increase per step (10%)
    stepWeightPromotion: 50 # Traffic increase on promotion (50%)
    stepWeights: [10, 20, 30, 40, 50]  # Custom traffic steps

    # Metrics to validate
    metrics:
    # Request success rate
    - name: request-success-rate
      thresholdRange:
        min: 99            # Minimum 99% success rate
      interval: 1m
      queryTemplate: |
        sum(rate(istio_requests_total{
          destination_workload_namespace="{{ namespace }}",
          destination_workload=~"{{ target }}",
          response_code!~"5.."
        }[1m]))
        /
        sum(rate(istio_requests_total{
          destination_workload_namespace="{{ namespace }}",
          destination_workload=~"{{ target }}"
        }[1m]))
        * 100

    # Request duration (p95 latency)
    - name: request-duration
      thresholdRange:
        max: 500           # Maximum 500ms p95 latency
      interval: 1m
      queryTemplate: |
        histogram_quantile(0.95,
          sum(rate(istio_request_duration_milliseconds_bucket{
            destination_workload_namespace="{{ namespace }}",
            destination_workload=~"{{ target }}"
          }[1m])) by (le)
        )

    # Error rate
    - name: error-rate
      thresholdRange:
        max: 1             # Maximum 1% error rate
      interval: 1m
      queryTemplate: |
        sum(rate(istio_requests_total{
          destination_workload_namespace="{{ namespace }}",
          destination_workload=~"{{ target }}",
          response_code=~"5.."
        }[1m]))
        /
        sum(rate(istio_requests_total{
          destination_workload_namespace="{{ namespace }}",
          destination_workload=~"{{ target }}"
        }[1m]))
        * 100

    # Custom business metric
    - name: business-metric-threshold
      thresholdRange:
        min: 95            # Minimum business metric value
      interval: 1m
      queryTemplate: |
        avg(rate(atp_business_metric_total{
          service="{{ target }}",
          namespace="{{ namespace }}"
        }[1m]))

  # Webhooks for pre/post deployment validation
  webhooks:
  # Pre-rollout validation (smoke tests)
  - name: smoke-tests
    type: pre-rollout
    url: http://smoke-tests.atp-production:8080/validate
    timeout: 30s
    metadata:
      type: "bash"
      cmd: "kubectl exec -n atp-production deployment/smoke-tests -- /bin/sh -c 'curl -f http://atp-ingestion-canary:8080/health || exit 1'"

  # Post-rollout validation
  - name: integration-tests
    type: rollout
    url: http://integration-tests.atp-production:8080/validate
    timeout: 2m
    metadata:
      type: "bash"
      cmd: "kubectl exec -n atp-production deployment/integration-tests -- /bin/sh -c 'curl -f http://atp-ingestion-canary:8080/api/health || exit 1'"

  # Load testing
  - name: load-test
    type: rollout
    url: http://load-test.atp-production:8080/start
    timeout: 5m
    metadata:
      cmd: "kubectl exec -n atp-production deployment/load-test -- /bin/sh -c 'artillery run test.yaml'"

Automated Rollback on Metric Thresholds¶

Flagger Rollback Configuration:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    # Rollback triggers
    alerts:
    # Alert on high error rate
    - name: "error-rate-high"
      severity: error
      providerRef:
        name: prometheus-alerts
        namespace: monitoring

    # Rollback on metric threshold violation
    metrics:
    - name: error-rate
      thresholdRange:
        max: 1             # Rollback if error rate > 1%
      interval: 30s
      # Rollback immediately on violation
      alertProviders:
      - name: prometheus
        severity: error

  # Skip traffic increase if metrics fail
  skipAnalysis: false  # Don't skip validation

  # Automatic rollback on failure
  revertOnDeletion: true

Flagger Rollback Status:

# Check canary status
kubectl get canary atp-ingestion -n atp-production

# Watch canary rollout
kubectl describe canary atp-ingestion -n atp-production

# Check Flagger logs
kubectl logs -n flagger-system deployment/flagger -f

Integration with Service Mesh¶

Flagger with Istio Integration:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  provider: istio
  trafficRouting:
    istio:
      virtualService:
        hosts:
        - atp-ingestion.connectsoft.example
        gateways:
        - istio-system/public-gateway
      destinationRule:
        host: atp-ingestion
        subsets:
        - name: stable
          trafficPolicy:
            loadBalancer:
              consistentHash:
                httpHeaderName: X-User-ID  # Session affinity
        - name: canary
          trafficPolicy:
            loadBalancer:
              consistentHash:
                httpHeaderName: X-User-ID

Feature Flags Integration¶

LaunchDarkly or Azure App Configuration¶

Feature Flags Strategy:

Feature Flag Provider	Pros	Cons	ATP Selection
LaunchDarkly	✅ Advanced targeting, A/B testing	❌ Cost, external dependency	⚠️ Consider for advanced use cases
Azure App Configuration	✅ Native Azure, integrated	⚠️ Less features than LaunchDarkly	✅ ATP Default (cost-effective)
Custom Solution	✅ Full control	❌ Maintenance overhead	❌ Not recommended

ATP Decision: Azure App Configuration (native Azure integration)

Feature Flag-Based Rollout¶

Azure App Configuration Setup:

// Configure Azure App Configuration
services.AddAzureAppConfiguration(options =>
{
    options.Connect(connectionString)
        .Select(KeyFilter.Any, LabelFilter.Null)
        .Select(KeyFilter.Any, "Production")
        .ConfigureRefresh(refresh =>
        {
            refresh.Register("FeatureFlags:CanaryEnabled", refreshAll: true)
                .SetCacheExpiration(TimeSpan.FromSeconds(30));
        });
});

Feature Flag Integration in Application:

// C#: Feature flag for canary deployment
public class FeatureFlagService
{
    private readonly IConfiguration _configuration;

    public bool IsCanaryEnabled()
    {
        return _configuration.GetValue<bool>("FeatureFlags:CanaryEnabled", defaultValue: false);
    }

    public int GetCanaryTrafficPercentage()
    {
        return _configuration.GetValue<int>("FeatureFlags:CanaryTrafficPercentage", defaultValue: 0);
    }
}

// Use feature flag to control behavior
[ApiController]
[Route("[controller]")]
public class IngestionController : ControllerBase
{
    private readonly FeatureFlagService _featureFlags;

    [HttpPost("events")]
    public async Task<IActionResult> IngestEvent([FromBody] Event evt)
    {
        // New feature enabled via feature flag
        if (_featureFlags.IsCanaryEnabled())
        {
            // Use new processing logic
            await ProcessEventV2(evt);
        }
        else
        {
            // Use stable processing logic
            await ProcessEventV1(evt);
        }

        return Ok();
    }
}

Feature Flag Configuration:

# Azure App Configuration via ConfigMap (reference)
apiVersion: v1
kind: ConfigMap
metadata:
  name: feature-flags
  namespace: atp-production
data:
  FeatureFlags__CanaryEnabled: "false"
  FeatureFlags__CanaryTrafficPercentage: "0"
---
# External Secret for Azure App Configuration connection
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-config-connection
  namespace: atp-production
spec:
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  data:
  - secretKey: AppConfigConnectionString
    remoteRef:
      key: connection-strings/app-config-connection-string

Gradual Feature Enablement¶

Gradual Feature Rollout:

// Gradually enable feature for percentage of users
public class FeatureFlagService
{
    public bool ShouldEnableFeature(string userId)
    {
        var percentage = GetFeatureRolloutPercentage();
        var userHash = GetUserHash(userId);
        return (userHash % 100) < percentage;
    }

    private int GetFeatureRolloutPercentage()
    {
        // Gradually increase: 10% → 25% → 50% → 100%
        return _configuration.GetValue<int>("FeatureFlags:RolloutPercentage", defaultValue: 0);
    }
}

Kill Switch for Problem Features¶

Kill Switch Implementation:

// Kill switch for emergency feature disable
public class FeatureFlagService
{
    public bool IsFeatureKilled(string featureName)
    {
        return _configuration.GetValue<bool>($"FeatureFlags:KillSwitch:{featureName}", defaultValue: false);
    }
}

// Use kill switch
if (_featureFlags.IsFeatureKilled("NewProcessingLogic"))
{
    // Immediately fall back to stable logic
    await ProcessEventV1(evt);
}
else
{
    await ProcessEventV2(evt);
}

Pre-Deployment Validation¶

Smoke Tests Before Traffic Routing¶

Smoke Test Webhook:

# Flagger pre-rollout webhook
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    webhooks:
    - name: smoke-tests
      type: pre-rollout
      url: http://smoke-tests.atp-production:8080/validate
      timeout: 30s
      metadata:
        type: "bash"
        cmd: |
          kubectl exec -n atp-production deployment/smoke-tests -- /bin/sh -c '
            # Health check
            curl -f http://atp-ingestion-canary:8080/health/live || exit 1
            curl -f http://atp-ingestion-canary:8080/health/ready || exit 1

            # Basic API test
            curl -f -X POST http://atp-ingestion-canary:8080/api/events \
              -H "Content-Type: application/json" \
              -d "{\"eventType\":\"test\"}" || exit 1
          '

Smoke Test Job:

# Pre-deployment smoke test job
apiVersion: batch/v1
kind: Job
metadata:
  name: smoke-tests-pre-deploy
  namespace: atp-production
spec:
  template:
    spec:
      containers:
      - name: smoke-tests
        image: connectsoft.azurecr.io/atp/smoke-tests:latest
        env:
        - name: TARGET_URL
          value: "http://atp-ingestion-canary:8080"
        command:
        - /bin/sh
        - -c
        - |
          echo "Running smoke tests..."
          # Health checks
          curl -f $TARGET_URL/health/live || exit 1
          curl -f $TARGET_URL/health/ready || exit 1

          # API validation
          response=$(curl -s -X POST $TARGET_URL/api/events \
            -H "Content-Type: application/json" \
            -d '{"eventType":"test"}')

          if [ $? -ne 0 ]; then
            echo "Smoke tests failed"
            exit 1
          fi

          echo "Smoke tests passed"
      restartPolicy: Never

Integration Tests in Canary¶

Integration Test Webhook:

# Flagger rollout webhook for integration tests
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    webhooks:
    - name: integration-tests
      type: rollout
      url: http://integration-tests.atp-production:8080/validate
      timeout: 5m
      metadata:
        type: "bash"
        cmd: |
          kubectl exec -n atp-production deployment/integration-tests -- /bin/sh -c '
            # Run integration test suite
            dotnet test IntegrationTests.csproj \
              --filter "Category=CanaryValidation" \
              --logger "trx;LogFileName=results.trx" \
              --results-directory /tmp/results

            # Check test results
            if [ $? -ne 0 ]; then
              echo "Integration tests failed"
              exit 1
            fi
          '

Database Migration Checks¶

Database Migration Validation:

# Pre-deployment database migration check
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-check
  namespace: atp-production
spec:
  template:
    spec:
      containers:
      - name: migration-check
        image: connectsoft.azurecr.io/atp/migration-tool:latest
        env:
        - name: CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: sql-connection-string
              key: connection-string
        command:
        - /bin/sh
        - -c
        - |
          echo "Checking database migrations..."

          # Check if pending migrations exist
          dotnet ef migrations list --connection "$CONNECTION_STRING"

          # Validate migration scripts (dry-run)
          dotnet ef database update --connection "$CONNECTION_STRING" --dry-run

          if [ $? -ne 0 ]; then
            echo "Database migration validation failed"
            exit 1
          fi

          echo "Database migrations validated"
      restartPolicy: Never

Dependency Availability Checks¶

Dependency Check Script:

#!/bin/bash
# scripts/pre-deployment-checks.sh

echo "🔍 Running pre-deployment validation checks..."

# Check Redis availability
redis-cli -h redis.atp-production ping || {
  echo "❌ Redis not available"
  exit 1
}

# Check SQL Database connectivity
sqlcmd -S sql-server.database.windows.net -U $DB_USER -P $DB_PASSWORD -Q "SELECT 1" || {
  echo "❌ SQL Database not accessible"
  exit 1
}

# Check Service Bus
az servicebus queue show --namespace-name atp-servicebus --resource-group atp-production --name audit-events || {
  echo "❌ Service Bus not accessible"
  exit 1
}

# Check Key Vault
az keyvault secret show --vault-name atp-keyvault --name test-secret || {
  echo "❌ Key Vault not accessible"
  exit 1
}

echo "✅ All dependency checks passed"

Post-Deployment Validation¶

Health Check Monitoring¶

Post-Deployment Health Checks:

# Flagger post-rollout validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    webhooks:
    - name: post-deployment-health
      type: post-rollout
      url: http://health-monitor.atp-production:8080/validate
      timeout: 2m
      metadata:
        type: "bash"
        cmd: |
          kubectl exec -n atp-production deployment/health-monitor -- /bin/sh -c '
            # Monitor health for 2 minutes
            for i in {1..24}; do
              health=$(curl -s -o /dev/null -w "%{http_code}" http://atp-ingestion-canary:8080/health/live)
              if [ "$health" != "200" ]; then
                echo "Health check failed: $health"
                exit 1
              fi
              sleep 5
            done
            echo "Health checks passed"
          '

Error Rate Thresholds¶

Error Rate Validation:

# Flagger metric for error rate validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: error-rate
      thresholdRange:
        max: 1  # Maximum 1% error rate
      interval: 1m
      queryTemplate: |
        sum(rate(http_requests_total{
          service="{{ target }}",
          status=~"5.."
        }[1m]))
        /
        sum(rate(http_requests_total{
          service="{{ target }}"
        }[1m]))
        * 100

Latency Thresholds¶

Latency Validation:

# Flagger metric for latency validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: request-duration-p95
      thresholdRange:
        max: 500  # Maximum 500ms p95 latency
      interval: 1m
      queryTemplate: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket{
            service="{{ target }}"
          }[1m])) by (le)
        ) * 1000

Business Metric Validation¶

Custom Business Metric:

# Flagger metric for business metric validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: event-processing-success-rate
      thresholdRange:
        min: 99.5  # Minimum 99.5% success rate
      interval: 1m
      queryTemplate: |
        sum(rate(atp_events_processed_total{
          service="{{ target }}",
          status="success"
        }[1m]))
        /
        sum(rate(atp_events_processed_total{
          service="{{ target }}"
        }[1m]))
        * 100

Automatic Rollback Triggers¶

Error Rate Exceeds Threshold¶

Error Rate Rollback:

# Flagger automatic rollback on error rate
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: error-rate
      thresholdRange:
        max: 1  # Rollback if error rate > 1%
      interval: 30s
      # Rollback immediately on threshold violation
      alertProviders:
      - name: prometheus
        severity: error

Prometheus Alert for Error Rate:

# PrometheusRule for error rate alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: canary-error-rate
  namespace: monitoring
spec:
  groups:
  - name: canary
    rules:
    - alert: CanaryHighErrorRate
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[1m])) 
        / 
        sum(rate(http_requests_total[1m])) 
        > 0.01  # 1% error rate
      for: 30s
      labels:
        severity: critical
      annotations:
        summary: "Canary error rate exceeds threshold - rollback triggered"

Latency Degrades Beyond SLO¶

Latency Rollback:

# Flagger automatic rollback on latency degradation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: request-duration-p95
      thresholdRange:
        max: 500  # Rollback if p95 latency > 500ms
      interval: 30s
      alertProviders:
      - name: prometheus
        severity: error

Health Checks Fail Consistently¶

Health Check Rollback:

# Flagger automatic rollback on health check failure
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    webhooks:
    - name: health-check
      type: rollout
      url: http://health-monitor:8080/check
      timeout: 10s
      metadata:
        cmd: |
          health=$(curl -s -o /dev/null -w "%{http_code}" http://atp-ingestion-canary:8080/health/live)
          if [ "$health" != "200" ]; then
            echo "Health check failed"
            exit 1  # Triggers rollback
          fi

Custom Metric-Based Rollback¶

Custom Metric Rollback:

# Flagger automatic rollback on custom metric
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: business-metric-threshold
      thresholdRange:
        min: 95  # Rollback if business metric < 95
      interval: 1m
      queryTemplate: |
        avg(rate(atp_business_metric_total{
          service="{{ target }}"
        }[1m]))
      alertProviders:
      - name: prometheus
        severity: error

Flagger Rollback Status:

# Check if rollback occurred
kubectl get canary atp-ingestion -n atp-production -o jsonpath='{.status.conditions[?(@.type=="Promoted")].status}'

# View rollback reason
kubectl describe canary atp-ingestion -n atp-production | grep -A 10 "Status"

Deployment Windows¶

Scheduled Maintenance Windows¶

Maintenance Window Configuration:

# Flagger with maintenance window
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    # Schedule deployments during maintenance window
    schedule: "0 2 * * *"  # 2 AM daily (UTC)
    # Or use cron expression for specific windows

Azure Pipeline Deployment Window:

# Azure Pipeline with deployment window
trigger: none

schedules:
- cron: "0 2 * * *"  # 2 AM UTC daily
  branches:
    include:
    - production
  always: true

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: Deploy
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/production'))
  jobs:
  - job: Deploy
    steps:
    - script: |
        echo "Deploying during maintenance window (2 AM UTC)"
        # Deployment steps

Change Freeze Periods¶

Change Freeze Configuration:

# Flagger with change freeze
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  # Suspend canary during change freeze
  # Annotation to prevent deployments
  annotations:
    flagger.app/change-freeze: "true"

Change Freeze Script:

#!/bin/bash
# scripts/change-freeze.sh

ACTION="${1:-enable}"  # enable or disable
NAMESPACE="${2:-atp-production}"

if [ "$ACTION" == "enable" ]; then
  echo "🔒 Enabling change freeze"
  kubectl annotate canary atp-ingestion -n $NAMESPACE \
    flagger.app/change-freeze="true" \
    --overwrite

  # Suspend FluxCD reconciliations
  flux suspend kustomization apps-production -n flux-system

  echo "✅ Change freeze enabled"
elif [ "$ACTION" == "disable" ]; then
  echo "🔓 Disabling change freeze"
  kubectl annotate canary atp-ingestion -n $NAMESPACE \
    flagger.app/change-freeze- \
    --overwrite

  # Resume FluxCD reconciliations
  flux resume kustomization apps-production -n flux-system

  echo "✅ Change freeze disabled"
fi

Emergency Deployment Procedures¶

Emergency Deployment Bypass:

#!/bin/bash
# scripts/emergency-deploy.sh

SERVICE="${1:-atp-ingestion}"
VERSION="${2:-v1.2.3-abc123d}"
NAMESPACE="${3:-atp-production}"

echo "🚨 Emergency deployment: $SERVICE@$VERSION"

# Bypass change freeze
kubectl annotate canary $SERVICE -n $NAMESPACE \
  flagger.app/emergency-deploy="true" \
  flagger.app/change-freeze- \
  --overwrite

# Force immediate rollout (skip canary)
kubectl set image deployment/$SERVICE \
  atp-ingestion=connectsoft.azurecr.io/atp/$SERVICE:$VERSION \
  -n $NAMESPACE

# Force rollout (bypass readiness gates)
kubectl rollout restart deployment/$SERVICE -n $NAMESPACE

echo "✅ Emergency deployment initiated"

Zero-Downtime Deployments¶

Connection Draining¶

Connection Draining Configuration:

# Service with connection draining
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion
  namespace: atp-production
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"
spec:
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800  # 3 hours
  ports:
  - port: 80
    targetPort: 8080

Graceful Shutdown (SIGTERM Handling)¶

Graceful Shutdown Implementation:

// C#: Graceful shutdown handler
public class Program
{
    public static async Task Main(string[] args)
    {
        var host = CreateHostBuilder(args).Build();

        // Register graceful shutdown
        var lifetime = host.Services.GetRequiredService<IHostApplicationLifetime>();
        lifetime.ApplicationStopping.Register(() =>
        {
            Console.WriteLine("SIGTERM received, starting graceful shutdown...");

            // Stop accepting new requests
            // Wait for in-flight requests to complete
            // Close connections
            // Cleanup resources

            Console.WriteLine("Graceful shutdown complete");
        });

        await host.RunAsync();
    }
}

Graceful Shutdown in Kubernetes:

# Deployment with terminationGracePeriodSeconds
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60  # 60 seconds for graceful shutdown
      containers:
      - name: atp-ingestion
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                # Drain connections
                sleep 10
                # Stop accepting new requests
                curl -X POST http://localhost:8080/admin/shutdown

Pod Disruption Budget¶

Pod Disruption Budget Configuration:

# Pod Disruption Budget for zero-downtime
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: atp-ingestion-pdb
  namespace: atp-production
spec:
  minAvailable: 3  # Always maintain at least 3 pods available
  selector:
    matchLabels:
      app: atp-ingestion
---
# Alternative: maxUnavailable
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: atp-ingestion-pdb
spec:
  maxUnavailable: 1  # Allow maximum 1 pod unavailable
  selector:
    matchLabels:
      app: atp-ingestion

Pod Disruption Budget Calculation:

Total Pods: 5 replicas
minAvailable: 3 pods
During Rolling Update: Can terminate maximum 2 pods at a time
Ensures: Always 3+ pods serving traffic (zero downtime)

PreStop Hooks¶

PreStop Hook for Graceful Shutdown:

# Deployment with preStop hook
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: atp-ingestion
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                echo "PreStop hook: Starting graceful shutdown..."

                # Remove from load balancer
                # Wait for connections to drain
                # Send shutdown signal to application
                curl -X POST http://localhost:8080/admin/drain || true

                # Wait for in-flight requests
                sleep 15

                echo "PreStop hook: Graceful shutdown complete"
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          # Remove from service endpoints when readiness fails
          periodSeconds: 5

Zero-Downtime Deployment Checklist:

## Zero-Downtime Deployment Checklist

### Pre-Deployment
- [ ] Readiness probe configured and tested
- [ ] Liveness probe configured
- [ ] Pod Disruption Budget configured (minAvailable or maxUnavailable)
- [ ] Graceful shutdown implemented (SIGTERM handler)
- [ ] PreStop hook configured
- [ ] Connection draining enabled
- [ ] terminationGracePeriodSeconds set appropriately (30-60s)

### During Deployment
- [ ] Rolling update strategy with maxUnavailable: 0
- [ ] maxSurge configured (1 or 2 extra pods)
- [ ] Monitor rollout progress (`kubectl rollout status`)
- [ ] Verify pods pass readiness probes
- [ ] Check traffic routing to new pods

### Post-Deployment
- [ ] Health checks passing
- [ ] Error rates within threshold
- [ ] Latency within SLO
- [ ] Business metrics validated
- [ ] Old pods terminated gracefully

Summary: Rolling Updates & Deployment Strategies¶

Kubernetes Rolling Updates: Default rolling update strategy, maxSurge and maxUnavailable settings, rolling update process, monitoring rollout progress
Blue-Green Deployments: Blue-green concept and benefits, implementation with namespace switching, traffic routing with Ingress, instant rollback, when to use blue-green
Canary Releases: Canary deployment concept, traffic splitting strategies (Istio/Nginx), service mesh requirement, gradual traffic shift (10% → 50% → 100%), automated canary analysis
Progressive Delivery with Flagger: Flagger overview and architecture, installation and configuration, canary resource definition, automated rollback on metric thresholds, integration with service mesh
Feature Flags Integration: LaunchDarkly or Azure App Configuration, feature flag-based rollout, gradual feature enablement, kill switch for problem features
Pre-Deployment Validation: Smoke tests before traffic routing, integration tests in canary, database migration checks, dependency availability checks
Post-Deployment Validation: Health check monitoring, error rate thresholds, latency thresholds, business metric validation
Automatic Rollback Triggers: Error rate exceeds threshold, latency degrades beyond SLO, health checks fail consistently, custom metric-based rollback
Deployment Windows: Scheduled maintenance windows, change freeze periods, emergency deployment procedures
Zero-Downtime Deployments: Connection draining, graceful shutdown (SIGTERM handling), Pod disruption budgets, PreStop hooks

Preview Environments (Ephemeral)¶

Purpose: Define how ephemeral preview environments are automatically provisioned for pull requests, used for isolated testing and validation, and automatically cleaned up after PR merge or closure, ensuring developers can test changes in a production-like environment without manual infrastructure setup while optimizing resource costs.

Preview Environment Architecture¶

Ephemeral Namespaces in Dev Cluster¶

Preview Environment Architecture:

graph TB
    subgraph "Dev AKS Cluster"
        subgraph "PR #123 Preview"
            NS1[Namespace: atp-preview-pr123]
            SVC1[Service: atp-ingestion]
            PODS1[Pods: v1.2.3<br/>1 replica]
            ING1[Ingress: pr123.preview.atp.connectsoft.example]
        end
        subgraph "PR #124 Preview"
            NS2[Namespace: atp-preview-pr124]
            SVC2[Service: atp-ingestion]
            PODS2[Pods: v1.2.4<br/>1 replica]
            ING2[Ingress: pr124.preview.atp.connectsoft.example]
        end
        subgraph "Shared Resources"
            MON[Shared Monitoring]
            DB[Shared Test DB]
        end
    end

    NS1 --> MON
    NS2 --> MON
    NS1 --> DB
    NS2 --> DB
    ING1 --> SVC1
    ING2 --> SVC2

    style NS1 fill:#90EE90
    style NS2 fill:#90EE90
    style MON fill:#FFE5B4
    style DB fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Preview Namespace Structure:

atp-preview-pr123/
├── deployments/
│   ├── atp-ingestion/
│   ├── atp-query/
│   └── atp-gateway/
├── services/
├── ingress/
├── configmaps/
└── secrets/ (references from External Secrets)

Namespace Naming Convention:

Format: atp-preview-pr{PR_NUMBER}
Examples:
atp-preview-pr123
atp-preview-pr456
atp-preview-pr789

Resource Isolation per PR¶

Resource Isolation Strategy:

Resource	Isolation Level	Sharing	Rationale
Namespace	✅ Complete isolation	❌ Per PR	Complete resource isolation
Deployments	✅ Isolated	❌ Per PR	Independent testing
Services	✅ Isolated	❌ Per PR	Independent service endpoints
Ingress	✅ Isolated hostname	❌ Per PR	Unique preview URL
ConfigMaps	✅ Isolated	❌ Per PR	PR-specific configuration
Secrets	⚠️ Referenced	✅ Shared Key Vault	Cost optimization
Database	⚠️ Shared/Mocked	✅ Shared test DB	Cost optimization
Redis	⚠️ Shared	✅ Shared test Redis	Cost optimization
Monitoring	✅ Namespace labels	✅ Shared Prometheus	Cost optimization

Resource Isolation Configuration:

# Preview namespace with labels for isolation
apiVersion: v1
kind: Namespace
metadata:
  name: atp-preview-pr123
  labels:
    environment: preview
    preview: "true"
    pr-number: "123"
    created-by: "azure-pipelines"
    created-at: "2024-01-15T10:00:00Z"
    auto-cleanup: "true"
    ttl: "24h"  # Time-to-live for auto-cleanup

Cost Optimization Strategies¶

Cost Optimization Matrix:

Strategy	Implementation	Cost Savings
Single Replica	1 replica vs 3 in dev	✅ ~67% reduction
Minimal Resources	100m CPU, 256Mi memory	✅ ~80% reduction
Shared Node Pool	Use dev cluster node pool	✅ No additional nodes
Auto-Shutdown	Scale to zero after 4h inactivity	✅ ~60% reduction
Spot Instances	Use spot node pool	✅ ~90% cost reduction
Shared Dependencies	Shared test DB/Redis	✅ Significant savings

Cost Comparison:

Environment	Replicas	CPU/Memory per Pod	Monthly Cost (Est.)
Dev	3	500m / 1Gi	$150
Preview (Standard)	1	500m / 1Gi	$50
Preview (Optimized)	1	100m / 256Mi	$10

Lifecycle Management¶

Preview Environment Lifecycle:

stateDiagram-v2
    [*] --> PR_Created: Developer opens PR
    PR_Created --> Provisioning: Azure Pipeline triggered
    Provisioning --> Active: Preview ready
    Active --> Testing: Integration tests
    Testing --> Active: Tests pass
    Active --> Idle: 4h inactivity
    Idle --> Active: New activity
    Active --> Cleaning: PR merged/closed
    Idle --> Cleaning: TTL expired
    Cleaning --> [*]: Resources deleted

    Active --> Failed: Tests fail
    Failed --> Cleaning: Manual cleanup

Hold "Alt" / "Option" to enable pan & zoom

Lifecycle States:

State	Description	Actions
Provisioning	Namespace and resources being created	Create namespace, deploy manifests
Active	Preview environment running, receiving traffic	Monitor health, run tests
Testing	Integration tests executing	Execute test suite
Idle	No activity for 4+ hours	Scale to zero, monitor for activity
Cleaning	Resources being deleted	Delete namespace and all resources
Failed	Provisioning or testing failed	Retry or manual cleanup

Automatic Provisioning on PR Creation¶

Azure Pipeline Triggered by PR¶

PR Trigger Configuration:

# azure-pipelines-preview.yml
trigger: none  # No CI trigger

pr:
  branches:
    include:
    - dev
    - test
    - staging
  paths:
    include:
    - apps/**/*
    - infrastructure/**/*
    exclude:
    - docs/**/*

pool:
  vmImage: 'ubuntu-latest'

variables:
  - group: atp-preview-env
  - name: PR_NUMBER
    value: ${{ replace(variables['System.PullRequest.PullRequestNumber'], 'PullRequest', '') }}
  - name: PR_BRANCH
    value: ${{ variables['System.PullRequest.SourceBranch'] }}
  - name: PREVIEW_NAMESPACE
    value: atp-preview-pr$(PR_NUMBER)
  - name: PREVIEW_HOSTNAME
    value: pr$(PR_NUMBER).preview.atp.connectsoft.example

stages:
- stage: ProvisionPreview
  displayName: 'Provision Preview Environment'
  condition: and(succeeded(), ne(variables['Build.Reason'], 'Manual'))
  jobs:
  - job: Provision
    displayName: 'Create Preview Environment'
    steps:
    - task: AzureCLI@2
      displayName: 'Get PR details'
      inputs:
        azureSubscription: 'ATP-NonProd-ServiceConnection'
        scriptType: 'bash'
        scriptLocation: 'inlineScript'
        inlineScript: |
          echo "PR Number: $(PR_NUMBER)"
          echo "PR Branch: $(PR_BRANCH)"
          echo "Preview Namespace: $(PREVIEW_NAMESPACE)"
          echo "Preview Hostname: $(PREVIEW_HOSTNAME)"

    - task: Bash@3
      displayName: 'Generate Preview Manifests'
      inputs:
        targetType: 'inline'
        script: |
          # Generate preview manifests
          ./scripts/generate-preview-manifests.sh \
            --pr-number $(PR_NUMBER) \
            --branch $(PR_BRANCH) \
            --namespace $(PREVIEW_NAMESPACE) \
            --hostname $(PREVIEW_HOSTNAME) \
            --output-dir ./preview-manifests

Namespace Creation Script¶

Namespace Creation:

#!/bin/bash
# scripts/create-preview-namespace.sh

PR_NUMBER="${1}"
NAMESPACE="atp-preview-pr${PR_NUMBER}"
TTL="${2:-24h}"  # Default 24 hours

echo "📦 Creating preview namespace: ${NAMESPACE}"

# Create namespace
kubectl create namespace "${NAMESPACE}" --dry-run=client -o yaml | \
  kubectl label --local -f - \
    environment=preview \
    preview=true \
    pr-number="${PR_NUMBER}" \
    auto-cleanup=true \
    ttl="${TTL}" \
    created-by=azure-pipelines \
    created-at="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
    -o yaml | \
  kubectl apply -f -

# Create ResourceQuota for cost control
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: preview-quota
  namespace: ${NAMESPACE}
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 4Gi
    limits.cpu: "4"
    limits.memory: 8Gi
    persistentvolumeclaims: "2"
    pods: "10"
    services: "5"
EOF

# Create LimitRange for default resource limits
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: LimitRange
metadata:
  name: preview-limits
  namespace: ${NAMESPACE}
spec:
  limits:
  - default:
      cpu: "100m"
      memory: "256Mi"
    defaultRequest:
      cpu: "50m"
      memory: "128Mi"
    type: Container
EOF

echo "✅ Preview namespace created: ${NAMESPACE}"

Manifest Generation with PR-Specific Values¶

Preview Manifest Generation Script:

#!/bin/bash
# scripts/generate-preview-manifests.sh

PR_NUMBER="${1}"
BRANCH="${2}"
NAMESPACE="atp-preview-pr${PR_NUMBER}"
HOSTNAME="${3}"
OUTPUT_DIR="${4:-./preview-manifests}"

echo "🔨 Generating preview manifests for PR #${PR_NUMBER}"

mkdir -p "${OUTPUT_DIR}"

# Generate namespace
cat > "${OUTPUT_DIR}/namespace.yaml" <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: ${NAMESPACE}
  labels:
    environment: preview
    preview: "true"
    pr-number: "${PR_NUMBER}"
    branch: "${BRANCH}"
    auto-cleanup: "true"
    ttl: "24h"
    created-by: "azure-pipelines"
    created-at: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
EOF

# Generate Kustomization with PR-specific values
cat > "${OUTPUT_DIR}/kustomization.yaml" <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: ${NAMESPACE}

resources:
- ../../apps/atp-ingestion/base
- ../../apps/atp-query/base
- ../../apps/atp-gateway/base

commonLabels:
  environment: preview
  pr-number: "${PR_NUMBER}"

patchesStrategicMerge:
- preview-patch.yaml

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: ${BRANCH}-$(git rev-parse --short HEAD)
- name: connectsoft.azurecr.io/atp/query
  newTag: ${BRANCH}-$(git rev-parse --short HEAD)
- name: connectsoft.azurecr.io/atp/gateway
  newTag: ${BRANCH}-$(git rev-parse --short HEAD)
EOF

# Generate preview-specific patches
cat > "${OUTPUT_DIR}/preview-patch.yaml" <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 1  # Single replica for preview
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Preview"
        - name: Preview__PRNumber
          value: "${PR_NUMBER}"
        - name: Preview__Hostname
          value: "${HOSTNAME}"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-staging
spec:
  rules:
  - host: ${HOSTNAME}
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion
            port:
              number: 80
EOF

echo "✅ Preview manifests generated in ${OUTPUT_DIR}"

FluxCD Kustomization for Preview¶

Preview Kustomization Resource:

# clusters/dev/preview-kustomizations/pr123-kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: preview-pr123
  namespace: flux-system
  labels:
    preview: "true"
    pr-number: "123"
spec:
  interval: 1m
  path: ./apps/preview/pr123
  prune: true  # Auto-prune in preview
  wait: false  # Don't wait for readiness
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-dev
  dependsOn:
  - name: infrastructure
  kustomizeFlags:
  - --load-restrictor=LoadRestrictionsNone

Dynamic Preview Kustomization Creation:

#!/bin/bash
# scripts/create-preview-kustomization.sh

PR_NUMBER="${1}"
NAMESPACE="atp-preview-pr${PR_NUMBER}"

echo "🔧 Creating FluxCD Kustomization for preview PR #${PR_NUMBER}"

# Create preview Kustomization
cat <<EOF | kubectl apply -f -
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: preview-pr${PR_NUMBER}
  namespace: flux-system
  labels:
    preview: "true"
    pr-number: "${PR_NUMBER}"
    auto-cleanup: "true"
spec:
  interval: 1m
  path: ./apps/preview/pr${PR_NUMBER}
  prune: true
  wait: false
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-dev
  dependsOn:
  - name: infrastructure
EOF

echo "✅ Preview Kustomization created"

Dynamic Manifest Generation¶

Namespace: atp-preview-pr{number}¶

Dynamic Namespace Template:

# templates/preview-namespace-template.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-preview-pr{{ .Values.prNumber }}
  labels:
    environment: preview
    preview: "true"
    pr-number: "{{ .Values.prNumber }}"
    branch: "{{ .Values.branch }}"
    auto-cleanup: "true"
    ttl: "{{ .Values.ttl | default "24h" }}"
    created-by: "azure-pipelines"
    created-at: "{{ .Values.createdAt }}"

Namespace Generation:

# Generate namespace with PR number
PR_NUMBER=123
NAMESPACE="atp-preview-pr${PR_NUMBER}"

kubectl create namespace "${NAMESPACE}" \
  --dry-run=client -o yaml | \
  kubectl label --local -f - \
    environment=preview \
    pr-number="${PR_NUMBER}" \
    -o yaml | \
  kubectl apply -f -

Ingress Hostname: pr{number}.preview.atp.connectsoft.example¶

Dynamic Ingress Generation:

# Generated Ingress for PR #123
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-preview-ingress
  namespace: atp-preview-pr123
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-staging
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - pr123.preview.atp.connectsoft.example
    secretName: preview-pr123-tls
  rules:
  - host: pr123.preview.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion
            port:
              number: 80

Hostname Generation Script:

#!/bin/bash
# scripts/generate-preview-hostname.sh

PR_NUMBER="${1}"
BASE_DOMAIN="preview.atp.connectsoft.example"

PREVIEW_HOSTNAME="pr${PR_NUMBER}.${BASE_DOMAIN}"

echo "${PREVIEW_HOSTNAME}"
# Output: pr123.preview.atp.connectsoft.example

Resource Limits (Smaller than Dev)¶

Preview Resource Limits:

# Preview ResourceQuota (smaller than dev)
apiVersion: v1
kind: ResourceQuota
metadata:
  name: preview-quota
  namespace: atp-preview-pr123
spec:
  hard:
    requests.cpu: "2"      # 2 CPU total (vs 8 in dev)
    requests.memory: 4Gi   # 4Gi memory (vs 16Gi in dev)
    limits.cpu: "4"        # 4 CPU limit (vs 16 in dev)
    limits.memory: 8Gi     # 8Gi limit (vs 32Gi in dev)
    pods: "10"             # 10 pods max (vs 50 in dev)
    services: "5"          # 5 services max

Preview Deployment Resource Limits:

# Preview Deployment with minimal resources
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-preview-pr123
spec:
  replicas: 1  # Single replica
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 100m      # 100m CPU (vs 500m in dev)
            memory: 256Mi  # 256Mi memory (vs 1Gi in dev)
          limits:
            cpu: 500m      # 500m CPU limit (vs 2000m in dev)
            memory: 512Mi  # 512Mi limit (vs 2Gi in dev)

Resource Comparison:

Resource	Dev	Preview	Reduction
Replicas	3	1	67%
CPU Request	500m	100m	80%
Memory Request	1Gi	256Mi	75%
CPU Limit	2000m	500m	75%
Memory Limit	2Gi	512Mi	75%

Image Tag from PR Branch¶

Image Tag Strategy:

Format: {BRANCH_NAME}-{SHORT_COMMIT_SHA}
Examples:
feature-123-abc456d
bugfix-456-def789g

Image Tag Generation:

#!/bin/bash
# scripts/generate-preview-image-tag.sh

BRANCH="${1}"
COMMIT_SHA="${2:-$(git rev-parse --short HEAD)}"

# Sanitize branch name (remove special characters)
SANITIZED_BRANCH=$(echo "${BRANCH}" | sed 's/[^a-zA-Z0-9]/-/g' | tr '[:upper:]' '[:lower:]' | cut -c1-50)

IMAGE_TAG="${SANITIZED_BRANCH}-${COMMIT_SHA}"

echo "${IMAGE_TAG}"
# Output: feature-123-abc456d

Kustomize Image Override:

# apps/preview/pr123/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: feature-123-abc456d  # PR branch + commit SHA
- name: connectsoft.azurecr.io/atp/query
  newTag: feature-123-abc456d
- name: connectsoft.azurecr.io/atp/gateway
  newTag: feature-123-abc456d

FluxCD Configuration for Previews¶

Dynamic GitRepository per PR¶

Preview GitRepository:

# clusters/dev/preview-gitrepositories/pr123-gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: preview-pr123-git
  namespace: flux-system
  labels:
    preview: "true"
    pr-number: "123"
spec:
  interval: 30s  # Fast polling for preview
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: feature/123-new-feature  # PR branch
  secretRef:
    name: gitops-credentials
  ignore: |
    exclude: |
      ^production/
      ^staging/
      ^test/
      ^apps/preview/pr(?!123)/

Dynamic GitRepository Creation:

#!/bin/bash
# scripts/create-preview-gitrepository.sh

PR_NUMBER="${1}"
PR_BRANCH="${2}"

echo "📂 Creating GitRepository for preview PR #${PR_NUMBER}"

cat <<EOF | kubectl apply -f -
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: preview-pr${PR_NUMBER}-git
  namespace: flux-system
  labels:
    preview: "true"
    pr-number: "${PR_NUMBER}"
    auto-cleanup: "true"
spec:
  interval: 30s
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: ${PR_BRANCH}
  secretRef:
    name: gitops-credentials
EOF

echo "✅ Preview GitRepository created"

Preview Kustomization¶

Preview Kustomization Configuration:

# clusters/dev/preview-kustomizations/pr123-kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: preview-pr123
  namespace: flux-system
  labels:
    preview: "true"
    pr-number: "123"
spec:
  interval: 1m
  path: ./apps/preview/pr123
  prune: true  # Auto-prune deleted resources
  wait: false  # Don't wait for readiness
  timeout: 5m
  retryInterval: 1m
  sourceRef:
    kind: GitRepository
    name: preview-pr123-git
  kustomizeFlags:
  - --load-restrictor=LoadRestrictionsNone
  dependsOn:
  - name: infrastructure

Sync Policies for Preview¶

Preview Sync Policy:

Policy	Value	Rationale
Auto-Sync	✅ Enabled	Fast feedback for developers
Prune	✅ Enabled	Clean up deleted resources
Wait	❌ Disabled	Don't block on readiness
Timeout	5m	Fast timeout for quick feedback
Retry Interval	1m	Quick retries

Sync Policy Configuration:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: preview-pr123
spec:
  interval: 1m
  prune: true   # Auto-prune
  wait: false   # Don't wait
  timeout: 5m   # Fast timeout
  retryInterval: 1m

Health Checks and Validation¶

Preview Health Check:

# Health check webhook for preview
apiVersion: v1
kind: Service
metadata:
  name: preview-health-check
  namespace: atp-preview-pr123
spec:
  selector:
    app: atp-ingestion
  ports:
  - port: 8080
    targetPort: 8080
---
# Health check Job
apiVersion: batch/v1
kind: Job
metadata:
  name: preview-health-check
  namespace: atp-preview-pr123
spec:
  template:
    spec:
      containers:
      - name: health-check
        image: curlimages/curl:latest
        command:
        - /bin/sh
        - -c
        - |
          echo "Checking preview environment health..."

          # Wait for pods to be ready
          sleep 30

          # Check liveness
          curl -f http://atp-ingestion:8080/health/live || exit 1

          # Check readiness
          curl -f http://atp-ingestion:8080/health/ready || exit 1

          echo "✅ Preview environment is healthy"
      restartPolicy: Never

Resource Cleanup¶

Auto-Delete After PR Merge¶

Auto-Cleanup on PR Merge:

# Azure Pipeline: Cleanup after merge
trigger: none

pr:
  - branches:
      include:
      - dev
      - test
    autoCancel: false

pool:
  vmImage: 'ubuntu-latest'

variables:
  - name: PR_NUMBER
    value: ${{ replace(variables['System.PullRequest.PullRequestNumber'], 'PullRequest', '') }}
  - name: PREVIEW_NAMESPACE
    value: atp-preview-pr$(PR_NUMBER)

stages:
- stage: CleanupPreview
  displayName: 'Cleanup Preview Environment'
  condition: and(succeeded(), eq(variables['System.PullRequest.Status'], 'Completed'))
  jobs:
  - job: Cleanup
    displayName: 'Delete Preview Resources'
    steps:
    - task: Bash@3
      displayName: 'Delete Preview Namespace'
      inputs:
        targetType: 'inline'
        script: |
          ./scripts/cleanup-preview-environment.sh \
            --pr-number $(PR_NUMBER) \
            --reason "PR merged"

Auto-Delete After PR Close¶

Auto-Cleanup on PR Close:

#!/bin/bash
# scripts/cleanup-preview-on-close.sh

PR_NUMBER="${1}"
REASON="${2:-PR closed}"

echo "🧹 Cleaning up preview environment for PR #${PR_NUMBER}: ${REASON}"

NAMESPACE="atp-preview-pr${PR_NUMBER}"

# Delete namespace (cascades to all resources)
kubectl delete namespace "${NAMESPACE}" --wait=true --timeout=5m || true

# Delete FluxCD Kustomization
kubectl delete kustomization preview-pr${PR_NUMBER} -n flux-system || true

# Delete FluxCD GitRepository
kubectl delete gitrepository preview-pr${PR_NUMBER}-git -n flux-system || true

# Clean up GitOps manifests in Git
./scripts/cleanup-preview-manifests.sh --pr-number "${PR_NUMBER}"

echo "✅ Preview environment cleaned up"

Manual Cleanup for Stuck Resources¶

Manual Cleanup Script:

#!/bin/bash
# scripts/manual-cleanup-preview.sh

PR_NUMBER="${1}"

if [ -z "${PR_NUMBER}" ]; then
  echo "Usage: $0 <PR_NUMBER>"
  echo "Example: $0 123"
  exit 1
fi

NAMESPACE="atp-preview-pr${PR_NUMBER}"

echo "🧹 Manual cleanup for PR #${PR_NUMBER}"

# Force delete namespace (if stuck)
kubectl delete namespace "${NAMESPACE}" --force --grace-period=0 || true

# Wait and check if namespace still exists
sleep 10
if kubectl get namespace "${NAMESPACE}" 2>/dev/null; then
  echo "⚠️  Namespace still exists, forcing deletion..."

  # Patch namespace to remove finalizers
  kubectl patch namespace "${NAMESPACE}" -p '{"metadata":{"finalizers":[]}}' --type=merge

  # Delete again
  kubectl delete namespace "${NAMESPACE}" --force --grace-period=0
fi

# Clean up FluxCD resources
kubectl delete kustomization preview-pr${PR_NUMBER} -n flux-system --ignore-not-found=true
kubectl delete gitrepository preview-pr${PR_NUMBER}-git -n flux-system --ignore-not-found=true

# Clean up any remaining pods
kubectl delete pods --all -n "${NAMESPACE}" --force --grace-period=0 2>/dev/null || true

echo "✅ Manual cleanup complete"

List All Preview Environments:

#!/bin/bash
# scripts/list-preview-environments.sh

echo "📋 Active Preview Environments:"
echo ""

kubectl get namespaces -l preview=true --no-headers | while read -r line; do
  NAMESPACE=$(echo "$line" | awk '{print $1}')
  CREATED=$(echo "$line" | awk '{print $2}')
  PR_NUMBER=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.pr-number}')
  TTL=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.ttl}')

  echo "PR #${PR_NUMBER}: ${NAMESPACE}"
  echo "  Created: ${CREATED}"
  echo "  TTL: ${TTL}"
  echo ""
done

Cost Tracking and Alerts¶

Cost Tracking:

#!/bin/bash
# scripts/track-preview-costs.sh

echo "💰 Preview Environment Cost Tracking"
echo ""

# Get all preview namespaces
kubectl get namespaces -l preview=true --no-headers | while read -r line; do
  NAMESPACE=$(echo "$line" | awk '{print $1}')
  PR_NUMBER=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.pr-number}')
  CREATED=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.created-at}')

  # Calculate hours since creation
  CREATED_TIMESTAMP=$(date -d "${CREATED}" +%s 2>/dev/null || echo "0")
  CURRENT_TIMESTAMP=$(date +%s)
  HOURS=$(( (CURRENT_TIMESTAMP - CREATED_TIMESTAMP) / 3600 ))

  # Estimate cost (assuming $0.10/hour per preview environment)
  ESTIMATED_COST=$(echo "scale=2; $HOURS * 0.10" | bc)

  echo "PR #${PR_NUMBER}: ${HOURS} hours, ~\$${ESTIMATED_COST}"
done

echo ""
echo "Total active preview environments: $(kubectl get namespaces -l preview=true --no-headers | wc -l)"

Cost Alert:

# PrometheusRule for preview cost alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: preview-cost-alert
  namespace: monitoring
spec:
  groups:
  - name: preview-cost
    rules:
    - alert: TooManyPreviewEnvironments
      expr: |
        count(kube_namespace_labels{label_preview="true"}) > 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Too many preview environments active"
        description: "{{ $value }} preview environments are active (threshold: 10)"

Cost Optimization¶

Shared Node Pools¶

Preview Node Pool Strategy:

Strategy	Node Pool Type	Cost	Rationale
Shared with Dev	✅ Dev node pool	✅ Low	✅ Recommended - No additional nodes
Dedicated Preview Pool	❌ Separate pool	❌ High	❌ Not recommended (cost)
Spot Instance Pool	⚠️ Spot nodes	✅ Very Low	⚠️ Consider for cost optimization

Preview Node Pool Configuration:

# Use existing dev node pool (no additional cost)
# Preview pods scheduled on dev cluster nodes
# No dedicated node pool needed

Reduced Replica Counts (1 vs 3)¶

Replica Count Configuration:

# Preview Deployment: Single replica
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-preview-pr123
spec:
  replicas: 1  # Single replica (vs 3 in dev)

Cost Savings:

Dev: 3 replicas × $0.05/hour = $0.15/hour
Preview: 1 replica × $0.05/hour = $0.05/hour
Savings: 67% cost reduction

Minimal Resource Requests¶

Minimal Resource Configuration:

# Preview resources: Minimal requests
resources:
  requests:
    cpu: 100m      # 100m (vs 500m in dev) - 80% reduction
    memory: 256Mi  # 256Mi (vs 1Gi in dev) - 75% reduction
  limits:
    cpu: 500m      # 500m (vs 2000m in dev) - 75% reduction
    memory: 512Mi  # 512Mi (vs 2Gi in dev) - 75% reduction

Auto-Shutdown After 4 Hours of Inactivity¶

Inactivity Detection and Auto-Shutdown:

# CronJob to detect inactivity and scale to zero
apiVersion: batch/v1
kind: CronJob
metadata:
  name: preview-inactivity-check
  namespace: monitoring
spec:
  schedule: "*/15 * * * *"  # Every 15 minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: inactivity-check
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              # Check all preview namespaces
              kubectl get namespaces -l preview=true -o json | \
                jq -r '.items[].metadata.name' | \
                while read namespace; do

                  # Check last activity (last HTTP request)
                  LAST_ACTIVITY=$(kubectl get namespace "${namespace}" -o jsonpath='{.metadata.annotations.last-activity-time}')

                  if [ -z "${LAST_ACTIVITY}" ]; then
                    CREATED=$(kubectl get namespace "${namespace}" -o jsonpath='{.metadata.labels.created-at}')
                    LAST_ACTIVITY="${CREATED}"
                  fi

                  # Calculate hours since last activity
                  LAST_TS=$(date -d "${LAST_ACTIVITY}" +%s)
                  CURRENT_TS=$(date +%s)
                  HOURS=$(( (CURRENT_TS - LAST_TS) / 3600 ))

                  # Scale to zero if inactive for 4+ hours
                  if [ "${HOURS}" -ge 4 ]; then
                    echo "Scaling down ${namespace} (inactive for ${HOURS} hours)"

                    # Scale all deployments to zero
                    kubectl get deployments -n "${namespace}" -o json | \
                      jq -r '.items[].metadata.name' | \
                      while read deployment; do
                        kubectl scale deployment "${deployment}" -n "${namespace}" --replicas=0
                      done
                  fi
                done
          restartPolicy: OnFailure

Activity Tracking:

# Track activity in Ingress annotations
kubectl annotate ingress atp-preview-ingress \
  -n atp-preview-pr123 \
  last-activity-time="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --overwrite

Spot Instances for Preview Environments¶

Spot Node Pool Configuration:

// Pulumi: Spot node pool for preview environments
var previewNodePool = new ContainerService.KubernetesClusterNodePool("preview-spot-pool", new()
{
    KubernetesClusterId = devCluster.Id,
    VmSize = "Standard_D4s_v3",
    NodeCount = 2,
    Priority = "Spot",
    EvictionPolicy = "Delete",
    SpotMaxPrice = 0.05, // Max $0.05/hour (vs $0.20 for regular)
    NodeTaints = new[]
    {
        "preview=true:NoSchedule"
    },
    NodeLabels = new()
    {
        { "pool", "preview-spot" },
        { "preview", "true" },
    },
});

Preview Pod Tolerations:

# Preview Deployment with spot tolerations
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-preview-pr123
spec:
  template:
    spec:
      tolerations:
      - key: preview
        operator: Equal
        value: "true"
        effect: NoSchedule
      nodeSelector:
        pool: preview-spot

Access Control¶

Preview URL Generation¶

Preview URL Format:

Format: https://pr{PR_NUMBER}.preview.atp.connectsoft.example
Examples:
https://pr123.preview.atp.connectsoft.example
https://pr456.preview.atp.connectsoft.example

URL Generation Script:

#!/bin/bash
# scripts/generate-preview-url.sh

PR_NUMBER="${1}"
BASE_DOMAIN="preview.atp.connectsoft.example"

PREVIEW_URL="https://pr${PR_NUMBER}.${BASE_DOMAIN}"

echo "${PREVIEW_URL}"
# Output: https://pr123.preview.atp.connectsoft.example

Update PR Description with Preview URL:

#!/bin/bash
# scripts/update-pr-with-preview-url.sh

PR_NUMBER="${1}"
PREVIEW_URL="${2}"

echo "🔗 Updating PR #${PR_NUMBER} with preview URL: ${PREVIEW_URL}"

# Add preview URL to PR description via Azure DevOps API
az repos pr update \
  --organization "https://dev.azure.com/ConnectSoft" \
  --project "ATP" \
  --pull-request-id "${PR_NUMBER}" \
  --description "
## 🚀 Preview Environment

Preview environment is ready for testing:

**Preview URL**: ${PREVIEW_URL}

**Status**: ✅ Active

**Services**:
- API Gateway: ${PREVIEW_URL}/gateway
- Ingestion: ${PREVIEW_URL}/ingestion
- Query: ${PREVIEW_URL}/query

**TTL**: 24 hours (auto-cleanup after PR merge/close)
"

Authentication for Preview Environments¶

Preview Authentication Options:

Method	Implementation	Security	ATP Selection
No Auth	Public access	❌ None	❌ Not recommended
Basic Auth	Nginx basic auth	⚠️ Low	⚠️ Option for simple testing
OAuth/SSO	Azure AD integration	✅ High	✅ Recommended for production-like testing
IP Whitelist	Network policy	⚠️ Moderate	⚠️ Option for restricted access

Basic Auth Configuration:

# Ingress with basic auth
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-preview-ingress
  namespace: atp-preview-pr123
  annotations:
    nginx.ingress.kubernetes.io/auth-type: basic
    nginx.ingress.kubernetes.io/auth-secret: preview-basic-auth
    nginx.ingress.kubernetes.io/auth-realm: "Preview Environment - PR #123"
spec:
  ingressClassName: nginx
  rules:
  - host: pr123.preview.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion
            port:
              number: 80
---
# Basic auth secret
apiVersion: v1
kind: Secret
metadata:
  name: preview-basic-auth
  namespace: atp-preview-pr123
type: Opaque
data:
  auth: $(echo -n 'preview:preview123' | base64)  # preview:preview123

OAuth Configuration:

# Ingress with OAuth
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-preview-ingress
  namespace: atp-preview-pr123
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-staging
    nginx.ingress.kubernetes.io/auth-url: "https://oauth2-proxy.atp-production.svc.cluster.local/oauth2/auth"
    nginx.ingress.kubernetes.io/auth-signin: "https://oauth2-proxy.atp-production.svc.cluster.local/oauth2/start?rd=$scheme://$host$request_uri"
spec:
  ingressClassName: nginx
  rules:
  - host: pr123.preview.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion
            port:
              number: 80

Network Policies for Preview¶

Preview Network Policy:

# Network policy for preview: Allow ingress from internet
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: preview-allow-ingress
  namespace: atp-preview-pr123
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Allow from ingress controller
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - podSelector:
        matchLabels:
          app: ingress-nginx
  # Allow from monitoring (for metrics)
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
  # Allow to shared test database
  - to:
    - namespaceSelector:
        matchLabels:
          name: atp-test-db
    ports:
    - protocol: TCP
      port: 5432

Integration Testing in Preview¶

Running Integration Tests Against Preview¶

Integration Test Pipeline:

# azure-pipelines-preview-tests.yml
trigger: none

pr:
  branches:
    include:
    - dev

pool:
  vmImage: 'ubuntu-latest'

variables:
  - name: PR_NUMBER
    value: ${{ replace(variables['System.PullRequest.PullRequestNumber'], 'PullRequest', '') }}
  - name: PREVIEW_URL
    value: https://pr$(PR_NUMBER).preview.atp.connectsoft.example

stages:
- stage: RunIntegrationTests
  displayName: 'Run Integration Tests Against Preview'
  jobs:
  - job: IntegrationTests
    displayName: 'Integration Tests'
    steps:
    - task: DotNetCoreCLI@2
      displayName: 'Run Integration Tests'
      inputs:
        command: 'test'
        projects: '**/IntegrationTests.csproj'
        arguments: |
          --filter "Category=Preview" \
          --logger "trx;LogFileName=results.trx" \
          --results-directory $(Agent.TempDirectory)/test-results \
          -- \
          PreviewUrl=$(PREVIEW_URL)

    - task: PublishTestResults@2
      displayName: 'Publish Test Results'
      inputs:
        testResultsFiles: '**/*.trx'
        testRunTitle: 'Preview Integration Tests - PR #$(PR_NUMBER)'

Integration Test Configuration:

// C#: Integration test configuration
public class PreviewIntegrationTests
{
    private readonly string _previewUrl;

    public PreviewIntegrationTests()
    {
        _previewUrl = Environment.GetEnvironmentVariable("PreviewUrl") 
            ?? "https://pr123.preview.atp.connectsoft.example";
    }

    [Fact]
    [Category("Preview")]
    public async Task TestIngestionService()
    {
        var client = new HttpClient
        {
            BaseAddress = new Uri(_previewUrl)
        };

        var response = await client.GetAsync("/health/ready");
        Assert.Equal(HttpStatusCode.OK, response.StatusCode);
    }
}

Database/Dependency Mocking¶

Mocking Strategy:

Dependency	Strategy	Implementation
Database	⚠️ Shared test DB	✅ Real database (isolated schema)
Redis	✅ Shared test Redis	✅ Real Redis (isolated keys)
External APIs	✅ Mock	✅ WireMock or MSW
Service Bus	⚠️ Shared test queue	✅ Real Service Bus (isolated queue)

Mocked Dependencies Configuration:

# External API mocks
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-api-mock
  namespace: atp-preview-pr123
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: wiremock
        image: wiremock/wiremock:latest
        ports:
        - containerPort: 8080
        env:
        - name: MAPPINGS_DIR
          value: /home/wiremock/mappings
        volumeMounts:
        - name: mappings
          mountPath: /home/wiremock/mappings
      volumes:
      - name: mappings
        configMap:
          name: wiremock-mappings

Shared Test Services¶

Shared Test Services Architecture:

graph TB
    subgraph "Shared Test Namespace"
        TEST_DB[(Shared Test DB<br/>Isolated schemas)]
        TEST_REDIS[(Shared Test Redis<br/>Isolated keys)]
        TEST_SB[Shared Test Service Bus<br/>Isolated queues]
    end
    subgraph "Preview PR #123"
        PREVIEW1[Preview Services]
    end
    subgraph "Preview PR #124"
        PREVIEW2[Preview Services]
    end

    PREVIEW1 -->|Isolated schema| TEST_DB
    PREVIEW2 -->|Isolated schema| TEST_DB
    PREVIEW1 -->|Isolated keys| TEST_REDIS
    PREVIEW2 -->|Isolated keys| TEST_REDIS
    PREVIEW1 -->|Isolated queue| TEST_SB
    PREVIEW2 -->|Isolated queue| TEST_SB

    style TEST_DB fill:#FFE5B4
    style TEST_REDIS fill:#FFE5B4
    style TEST_SB fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Database and Dependencies¶

Mock Services vs Real Dependencies¶

Dependency Strategy Matrix:

Dependency	Mock	Real	ATP Decision
SQL Database	⚠️ Possible	✅ Real (isolated schema)	✅ Real with isolation
Redis	⚠️ Possible	✅ Real (isolated keys)	✅ Real with isolation
Service Bus	⚠️ Possible	✅ Real (isolated queue)	✅ Real with isolation
External APIs	✅ Mock	⚠️ Costly	✅ Mock (WireMock)
Key Vault	❌ N/A	✅ Real	✅ Real (shared)

Shared Test Database Approach¶

Shared Test Database with Isolated Schemas:

# ExternalSecret for shared test database
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-preview
  namespace: atp-preview-pr123
spec:
  secretStoreRef:
    name: azure-keyvault-dev
    kind: ClusterSecretStore
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-test-db/preview-connection-string
      # Connection string with PR-specific schema: atp_preview_pr123

Database Schema Isolation:

-- Create isolated schema per PR
CREATE SCHEMA IF NOT EXISTS atp_preview_pr123;
GRANT ALL PRIVILEGES ON SCHEMA atp_preview_pr123 TO atp_preview_user;

-- Connection string for PR #123
-- Server=test-db.database.windows.net;Database=atp_test;Schema=atp_preview_pr123;...

Database Cleanup:

#!/bin/bash
# scripts/cleanup-preview-database.sh

PR_NUMBER="${1}"

echo "🗑️  Cleaning up database schema for PR #${PR_NUMBER}"

SCHEMA_NAME="atp_preview_pr${PR_NUMBER}"

# Drop schema (cascades to all objects)
psql -h test-db.database.windows.net \
  -U atp_admin \
  -d atp_test \
  -c "DROP SCHEMA IF EXISTS ${SCHEMA_NAME} CASCADE;"

echo "✅ Database schema cleaned up"

Ephemeral Database per Preview¶

Ephemeral Database Option:

# Option: Create ephemeral database per preview (costlier but more isolated)
apiVersion: v1
kind: ConfigMap
metadata:
  name: preview-db-config
  namespace: atp-preview-pr123
data:
  database-name: "atp_preview_pr123"
  create-database: "true"
  ttl: "24h"

Ephemeral Database Creation:

#!/bin/bash
# scripts/create-preview-database.sh

PR_NUMBER="${1}"

echo "📦 Creating ephemeral database for PR #${PR_NUMBER}"

DB_NAME="atp_preview_pr${PR_NUMBER}"

# Create database via Azure CLI
az sql db create \
  --resource-group atp-nonprod-rg \
  --server atp-test-sql-server \
  --name "${DB_NAME}" \
  --service-objective S0 \
  --tags \
    Environment=Preview \
    PRNumber="${PR_NUMBER}" \
    AutoCleanup=true \
    TTL="24h"

echo "✅ Ephemeral database created: ${DB_NAME}"

ATP Decision: Shared test database with isolated schemas (cost-effective, sufficient isolation)

Preview Environment Lifecycle¶

Creation → Testing → Validation → Deletion¶

Preview Lifecycle Flow:

sequenceDiagram
    participant Dev as Developer
    participant PR as Pull Request
    participant Pipeline as Azure Pipeline
    participant K8s as Kubernetes
    participant FluxCD as FluxCD
    participant Tests as Integration Tests

    Dev->>PR: Create PR
    PR->>Pipeline: Trigger preview pipeline
    Pipeline->>K8s: Create namespace
    Pipeline->>FluxCD: Create GitRepository/Kustomization
    FluxCD->>K8s: Deploy preview manifests
    K8s->>Pipeline: Preview ready
    Pipeline->>PR: Update PR with preview URL
    Pipeline->>Tests: Run integration tests
    Tests->>PR: Update PR with test results
    PR->>Pipeline: PR merged/closed
    Pipeline->>K8s: Delete namespace
    Pipeline->>FluxCD: Delete GitRepository/Kustomization
    K8s->>Pipeline: Cleanup complete

Hold "Alt" / "Option" to enable pan & zoom

Status Reporting in PR Comments¶

PR Status Comment:

#!/bin/bash
# scripts/update-pr-status.sh

PR_NUMBER="${1}"
STATUS="${2}"  # provisioning, active, testing, failed, cleaning
PREVIEW_URL="${3}"

echo "📝 Updating PR #${PR_NUMBER} status: ${STATUS}"

STATUS_EMOJI=""
case "${STATUS}" in
  provisioning) STATUS_EMOJI="🔄" ;;
  active) STATUS_EMOJI="✅" ;;
  testing) STATUS_EMOJI="🧪" ;;
  failed) STATUS_EMOJI="❌" ;;
  cleaning) STATUS_EMOJI="🧹" ;;
esac

COMMENT="## ${STATUS_EMOJI} Preview Environment Status

**Status**: ${STATUS}

${PREVIEW_URL:+**Preview URL**: ${PREVIEW_URL}}

**Timestamp**: $(date -u +%Y-%m-%dT%H:%M:%SZ)
"

# Add comment to PR via Azure DevOps API
az repos pr thread create \
  --organization "https://dev.azure.com/ConnectSoft" \
  --project "ATP" \
  --pull-request-id "${PR_NUMBER}" \
  --comments "[{\"content\": \"${COMMENT}\"}]"

Preview URL in PR Description¶

PR Description Update:

# Azure Pipeline: Update PR description
- task: Bash@3
  displayName: 'Update PR Description'
  inputs:
    targetType: 'inline'
    script: |
      ./scripts/update-pr-description.sh \
        --pr-number $(PR_NUMBER) \
        --preview-url $(PREVIEW_URL) \
        --status active

PR Description Template:

## 🚀 Preview Environment

Preview environment has been provisioned for this PR.

### Access Information

- **Preview URL**: https://pr123.preview.atp.connectsoft.example
- **Status**: ✅ Active
- **Namespace**: `atp-preview-pr123`

### Services

- **API Gateway**: https://pr123.preview.atp.connectsoft.example/gateway
- **Ingestion Service**: https://pr123.preview.atp.connectsoft.example/ingestion
- **Query Service**: https://pr123.preview.atp.connectsoft.example/query

### Testing

Integration tests have been executed against the preview environment.

- ✅ Smoke tests: Passed
- ✅ Integration tests: Passed
- ✅ Health checks: Passed

### Cleanup

This preview environment will be automatically cleaned up when:
- PR is merged
- PR is closed
- 24 hours of inactivity (auto-shutdown)

**Created**: 2024-01-15T10:00:00Z
**TTL**: 24 hours

Summary: Preview Environments (Ephemeral)¶

Preview Environment Architecture: Ephemeral namespaces in dev cluster, resource isolation per PR, cost optimization strategies, lifecycle management
Automatic Provisioning: Azure Pipeline triggered by PR, namespace creation script, manifest generation with PR-specific values, FluxCD Kustomization for preview
Dynamic Manifest Generation: Namespace naming (atp-preview-pr{number}), Ingress hostname (pr{number}.preview.atp.connectsoft.example), resource limits (smaller than dev), image tag from PR branch
FluxCD Configuration: Dynamic GitRepository per PR, preview Kustomization, sync policies for preview, health checks and validation
Resource Cleanup: Auto-delete after PR merge, auto-delete after PR close, manual cleanup for stuck resources, cost tracking and alerts
Cost Optimization: Shared node pools, reduced replica counts (1 vs 3), minimal resource requests, auto-shutdown after 4 hours inactivity, spot instances for preview environments
Access Control: Preview URL generation, authentication for preview environments (basic auth/OAuth), network policies for preview
Integration Testing: Running integration tests against preview, database/dependency mocking, shared test services
Database and Dependencies: Mock services vs real dependencies, shared test database approach with isolated schemas, ephemeral database per preview option
Preview Environment Lifecycle: Creation → testing → validation → deletion flow, status reporting in PR comments, preview URL in PR description

Rollback & Disaster Recovery¶

Purpose: Define rollback procedures for ATP GitOps deployments including Git-based rollbacks, progressive rollback strategies, application state recovery, database migration rollbacks, FluxCD rollback mechanisms, Azure backup integration, disaster recovery scenarios, and incident response procedures to ensure rapid recovery from failures and minimize downtime.

Git-Based Rollback¶

Simple Rollback: Git Revert¶

Git Revert for Simple Rollback:

#!/bin/bash
# scripts/rollback-simple.sh

SERVICE="${1:-atp-ingestion}"
ENVIRONMENT="${2:-production}"
NAMESPACE="atp-${ENVIRONMENT}"

echo "⏪ Rolling back ${SERVICE} in ${ENVIRONMENT}"

# Find the last deployment commit
LAST_COMMIT=$(git log --oneline --grep="deploy.*${SERVICE}" -n 1 --format="%H")

if [ -z "${LAST_COMMIT}" ]; then
  echo "❌ No deployment commit found for ${SERVICE}"
  exit 1
fi

echo "📝 Last deployment commit: ${LAST_COMMIT}"

# Revert the commit
git revert --no-edit "${LAST_COMMIT}"

# Push the revert commit
git push origin ${ENVIRONMENT}

echo "✅ Rollback committed: ${SERVICE} reverted to previous state"

# FluxCD will automatically reconcile to the new Git state

Git Revert for Multiple Commits:

#!/bin/bash
# scripts/rollback-multiple.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
COMMIT_COUNT="${3:-1}"  # Number of commits to revert

echo "⏪ Rolling back ${COMMIT_COUNT} commits for ${SERVICE}"

# Revert multiple commits (oldest first)
git log --oneline -n ${COMMIT_COUNT} --reverse --format="%H" | while read commit; do
  echo "Reverting commit: ${commit}"
  git revert --no-edit "${commit}"
done

# Push all revert commits
git push origin ${ENVIRONMENT}

echo "✅ Rolled back ${COMMIT_COUNT} commits"

Complex Rollback: Git Reset¶

Git Reset for Complex Rollback (Use with caution):

#!/bin/bash
# scripts/rollback-reset.sh

ENVIRONMENT="${1:-production}"
TARGET_COMMIT="${2}"  # Commit hash or tag to rollback to

if [ -z "${TARGET_COMMIT}" ]; then
  echo "Usage: $0 <environment> <commit-hash-or-tag>"
  echo "Example: $0 production v1.2.2"
  exit 1
fi

echo "⚠️  WARNING: Git reset will rewrite history"
echo "⏪ Rolling back ${ENVIRONMENT} to ${TARGET_COMMIT}"

# Verify target commit exists
if ! git cat-file -e "${TARGET_COMMIT}^{commit}" 2>/dev/null; then
  echo "❌ Target commit ${TARGET_COMMIT} not found"
  exit 1
fi

# Create backup branch before reset
BACKUP_BRANCH="${ENVIRONMENT}-backup-$(date +%Y%m%d-%H%M%S)"
git branch "${BACKUP_BRANCH}" "${ENVIRONMENT}"
echo "📦 Backup branch created: ${BACKUP_BRANCH}"

# Reset to target commit (soft reset preserves changes)
git checkout "${ENVIRONMENT}"
git reset --soft "${TARGET_COMMIT}"

# Commit the rollback
git commit -m "rollback: Revert to ${TARGET_COMMIT} for disaster recovery"

# Force push (requires branch protection override for emergency)
git push origin "${ENVIRONMENT}" --force

echo "✅ Rollback complete: ${ENVIRONMENT} reset to ${TARGET_COMMIT}"
echo "⚠️  Backup branch: ${BACKUP_BRANCH} (keep for reference)"

ATP Recommendation: Prefer git revert over git reset (preserves history, safer for audit trail)

Rollback to Specific Commit¶

Rollback to Specific Commit:

#!/bin/bash
# scripts/rollback-to-commit.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"

if [ -z "${TARGET_COMMIT}" ]; then
  echo "Usage: $0 <service> <environment> <commit-hash>"
  echo "Example: $0 atp-ingestion production abc123def456"
  exit 1
fi

echo "⏪ Rolling back ${SERVICE} to commit ${TARGET_COMMIT}"

# Checkout the target commit
git checkout "${TARGET_COMMIT}" -- "apps/${SERVICE}/"

# Check if changes exist
if git diff --quiet "${ENVIRONMENT}" -- "apps/${SERVICE}/"; then
  echo "⚠️  No changes to rollback (already at target commit)"
  exit 0
fi

# Commit the rollback
git add "apps/${SERVICE}/"
git commit -m "rollback(${SERVICE}): Revert to commit ${TARGET_COMMIT}"

# Push to environment branch
git push origin "${ENVIRONMENT}"

echo "✅ Rollback complete: ${SERVICE} reverted to ${TARGET_COMMIT}"

Rollback to Commit with Validation:

#!/bin/bash
# scripts/rollback-to-commit-validated.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"

echo "⏪ Rolling back ${SERVICE} to commit ${TARGET_COMMIT}"

# Validate target commit
echo "🔍 Validating target commit..."
git show --no-patch --format="%H%n%an%n%ae%n%ad%n%s" "${TARGET_COMMIT}"

read -p "Continue with rollback? (yes/no): " confirm
if [ "${confirm}" != "yes" ]; then
  echo "❌ Rollback cancelled"
  exit 1
fi

# Perform rollback
./scripts/rollback-to-commit.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_COMMIT}"

# Wait for FluxCD reconciliation
echo "⏳ Waiting for FluxCD to reconcile..."
sleep 60

# Verify rollback
./scripts/verify-rollback.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_COMMIT}"

Rollback to Specific Tag¶

Rollback to Specific Tag:

#!/bin/bash
# scripts/rollback-to-tag.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_TAG="${3}"  # e.g., v1.2.2

if [ -z "${TARGET_TAG}" ]; then
  echo "Usage: $0 <service> <environment> <tag>"
  echo "Example: $0 atp-ingestion production v1.2.2"
  exit 1
fi

echo "⏪ Rolling back ${SERVICE} to tag ${TARGET_TAG}"

# Verify tag exists
if ! git rev-parse "${TARGET_TAG}" >/dev/null 2>&1; then
  echo "❌ Tag ${TARGET_TAG} not found"
  echo "Available tags:"
  git tag --sort=-creatordate | head -10
  exit 1
fi

# Get commit hash for tag
TARGET_COMMIT=$(git rev-parse "${TARGET_TAG}")

echo "📦 Tag ${TARGET_TAG} points to commit ${TARGET_COMMIT}"

# Rollback to the tagged commit
./scripts/rollback-to-commit.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_COMMIT}"

echo "✅ Rollback complete: ${SERVICE} reverted to ${TARGET_TAG} (${TARGET_COMMIT})"

List Available Tags for Rollback:

#!/bin/bash
# scripts/list-rollback-tags.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"

echo "📋 Available rollback tags for ${SERVICE}:"
echo ""

git tag --sort=-creatordate --format="%(refname:short)|%(creatordate:iso)|%(subject)" | \
  while IFS='|' read -r tag date subject; do
    # Check if tag affects the service
    if git diff "${tag}~1" "${tag}" --name-only | grep -q "apps/${SERVICE}/"; then
      echo "  ${tag} - ${date}"
      echo "    ${subject}"
      echo ""
    fi
  done

Progressive Rollback¶

Rolling Back One Service at a Time¶

Progressive Service Rollback:

#!/bin/bash
# scripts/progressive-rollback.sh

ENVIRONMENT="${1:-production}"
SERVICES="${2}"  # Comma-separated: atp-ingestion,atp-query,atp-gateway

if [ -z "${SERVICES}" ]; then
  echo "Usage: $0 <environment> <service1,service2,service3>"
  echo "Example: $0 production atp-ingestion,atp-query,atp-gateway"
  exit 1
fi

echo "🔄 Progressive rollback: ${SERVICES} in ${ENVIRONMENT}"

# Split services into array
IFS=',' read -ra SERVICE_ARRAY <<< "${SERVICES}"

for SERVICE in "${SERVICE_ARRAY[@]}"; do
  echo ""
  echo "⏪ Rolling back ${SERVICE}..."

  # Rollback service
  ./scripts/rollback-simple.sh "${SERVICE}" "${ENVIRONMENT}"

  # Wait for reconciliation
  echo "⏳ Waiting for reconciliation (60s)..."
  sleep 60

  # Validate rollback
  echo "🔍 Validating rollback..."
  if ./scripts/verify-service-health.sh "${SERVICE}" "${ENVIRONMENT}"; then
    echo "✅ ${SERVICE} rollback validated"
  else
    echo "❌ ${SERVICE} rollback validation failed"
    read -p "Continue with next service? (yes/no): " continue
    if [ "${continue}" != "yes" ]; then
      echo "⚠️  Progressive rollback stopped"
      exit 1
    fi
  fi
done

echo ""
echo "✅ Progressive rollback complete: All services rolled back"

Rollback with Canary (Gradual Revert)¶

Canary Rollback Configuration:

# Rollback with Flagger canary (gradual traffic reduction)
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion-rollback
  namespace: atp-production
spec:
  analysis:
    interval: 1m
    threshold: 3
    stepWeight: -25  # Reduce canary traffic by 25% each step
    stepWeights: [75, 50, 25, 0]  # 75% → 50% → 25% → 0% (full rollback)
    metrics:
    - name: error-rate
      thresholdRange:
        max: 1
      interval: 30s

Gradual Rollback Script:

#!/bin/bash
# scripts/canary-rollback.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
ROLLBACK_STEPS="${3:-4}"  # Number of steps

echo "🔄 Gradual canary rollback: ${SERVICE} in ${ENVIRONMENT}"

# Current canary weight (assume 100% for rollback start)
CURRENT_WEIGHT=100
STEP_SIZE=$((100 / ROLLBACK_STEPS))

for STEP in $(seq 1 ${ROLLBACK_STEPS}); do
  NEW_WEIGHT=$((CURRENT_WEIGHT - STEP_SIZE))

  echo "📊 Step ${STEP}/${ROLLBACK_STEPS}: Reducing traffic to ${NEW_WEIGHT}%"

  # Update Istio VirtualService to reduce canary traffic
  kubectl patch virtualservice "${SERVICE}" -n "${ENVIRONMENT}" --type=json \
    -p="[{\"op\": \"replace\", \"path\": \"/spec/http/0/route/1/weight\", \"value\": ${NEW_WEIGHT}}]"

  # Wait and validate
  echo "⏳ Waiting 2 minutes for validation..."
  sleep 120

  # Check error rate
  ERROR_RATE=$(./scripts/get-error-rate.sh "${SERVICE}" "${ENVIRONMENT}")
  echo "📈 Error rate: ${ERROR_RATE}%"

  if (( $(echo "${ERROR_RATE} > 5" | bc -l) )); then
    echo "❌ Error rate too high, accelerating rollback"
    NEW_WEIGHT=$((NEW_WEIGHT - STEP_SIZE))
  fi

  CURRENT_WEIGHT=${NEW_WEIGHT}

  if [ ${CURRENT_WEIGHT} -le 0 ]; then
    echo "✅ Full rollback complete (0% traffic to canary)"
    break
  fi
done

echo "✅ Gradual rollback complete"

Validation at Each Rollback Step¶

Rollback Validation Script:

#!/bin/bash
# scripts/validate-rollback-step.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
STEP="${3}"

echo "🔍 Validating rollback step ${STEP} for ${SERVICE}"

# Health check validation
HEALTH_STATUS=$(kubectl get deployment "${SERVICE}" -n "atp-${ENVIRONMENT}" \
  -o jsonpath='{.status.conditions[?(@.type=="Available")].status}')

if [ "${HEALTH_STATUS}" != "True" ]; then
  echo "❌ Health check failed: Deployment not available"
  exit 1
fi

# Error rate validation
ERROR_RATE=$(./scripts/get-error-rate.sh "${SERVICE}" "${ENVIRONMENT}")
ERROR_THRESHOLD=5

if (( $(echo "${ERROR_RATE} > ${ERROR_THRESHOLD}" | bc -l) )); then
  echo "❌ Error rate validation failed: ${ERROR_RATE}% > ${ERROR_THRESHOLD}%"
  exit 1
fi

# Latency validation
P95_LATENCY=$(./scripts/get-p95-latency.sh "${SERVICE}" "${ENVIRONMENT}")
LATENCY_THRESHOLD=500  # 500ms

if (( $(echo "${P95_LATENCY} > ${LATENCY_THRESHOLD}" | bc -l) )); then
  echo "❌ Latency validation failed: ${P95_LATENCY}ms > ${LATENCY_THRESHOLD}ms"
  exit 1
fi

# Readiness probe validation
READY_REPLICAS=$(kubectl get deployment "${SERVICE}" -n "atp-${ENVIRONMENT}" \
  -o jsonpath='{.status.readyReplicas}')
DESIRED_REPLICAS=$(kubectl get deployment "${SERVICE}" -n "atp-${ENVIRONMENT}" \
  -o jsonpath='{.spec.replicas}')

if [ "${READY_REPLICAS}" != "${DESIRED_REPLICAS}" ]; then
  echo "❌ Replica validation failed: ${READY_REPLICAS}/${DESIRED_REPLICAS} ready"
  exit 1
fi

echo "✅ All validation checks passed for step ${STEP}"

Application State Recovery¶

Handling Database Schema Changes¶

Database Schema Rollback Strategy:

Migration Type	Rollback Strategy	ATP Decision
Add Column	Drop column (if nullable)	✅ Safe rollback
Drop Column	Add column back	⚠️ Data loss risk
Rename Column	Rename back	✅ Safe rollback
Change Type	Revert type change	⚠️ Data truncation risk
Add Table	Drop table	✅ Safe rollback
Drop Table	Recreate table	❌ Data loss

Forward-Only Migrations (Preferred):

// C#: Forward-only migration (no rollback)
// Entity Framework migration: AddAuditIndex
public partial class AddAuditIndex : Migration
{
    protected override void Up(MigrationBuilder migrationBuilder)
    {
        migrationBuilder.CreateIndex(
            name: "IX_AuditTrail_Timestamp",
            table: "AuditTrail",
            column: "Timestamp");
    }

    // No Down() method - forward-only migration
    // Rollback = deploy previous version that doesn't use the index
}

Database Rollback Coordination:

#!/bin/bash
# scripts/rollback-with-db.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_VERSION="${3}"

echo "🔄 Coordinated rollback: Application + Database"

# Step 1: Check if database rollback is needed
CURRENT_SCHEMA_VERSION=$(./scripts/get-db-schema-version.sh "${ENVIRONMENT}")
TARGET_SCHEMA_VERSION=$(./scripts/get-schema-version-for-tag.sh "${TARGET_VERSION}")

if [ "${CURRENT_SCHEMA_VERSION}" != "${TARGET_SCHEMA_VERSION}" ]; then
  echo "⚠️  Database schema rollback required"
  echo "  Current: ${CURRENT_SCHEMA_VERSION}"
  echo "  Target: ${TARGET_SCHEMA_VERSION}"

  read -p "Proceed with database rollback? (yes/no): " confirm
  if [ "${confirm}" != "yes" ]; then
    echo "❌ Rollback cancelled"
    exit 1
  fi

  # Rollback database schema
  ./scripts/rollback-database-schema.sh "${ENVIRONMENT}" "${TARGET_SCHEMA_VERSION}"
fi

# Step 2: Rollback application
./scripts/rollback-to-tag.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_VERSION}"

echo "✅ Coordinated rollback complete"

Data Migration Rollback¶

Data Migration Rollback Strategy:

#!/bin/bash
# scripts/rollback-data-migration.sh

ENVIRONMENT="${1:-production}"
MIGRATION_ID="${2}"

echo "🔄 Rolling back data migration: ${MIGRATION_ID}"

# Check if migration has been applied
if ! ./scripts/check-migration-applied.sh "${MIGRATION_ID}" "${ENVIRONMENT}"; then
  echo "⚠️  Migration ${MIGRATION_ID} not applied, skipping rollback"
  exit 0
fi

# Execute rollback script (if exists)
ROLLBACK_SCRIPT="migrations/${MIGRATION_ID}/rollback.sql"

if [ -f "${ROLLBACK_SCRIPT}" ]; then
  echo "📝 Executing rollback script: ${ROLLBACK_SCRIPT}"
  psql -h "${DB_HOST}" -U "${DB_USER}" -d "${DB_NAME}" -f "${ROLLBACK_SCRIPT}"
else
  echo "⚠️  No rollback script found: ${ROLLBACK_SCRIPT}"
  echo "⚠️  Manual data recovery may be required"
  exit 1
fi

# Mark migration as rolled back
./scripts/mark-migration-rolled-back.sh "${MIGRATION_ID}" "${ENVIRONMENT}"

echo "✅ Data migration rollback complete"

Stateful Application Considerations¶

StatefulSet Rollback:

#!/bin/bash
# scripts/rollback-statefulset.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"

echo "⏪ Rolling back StatefulSet: ${SERVICE}"

# Get current StatefulSet revision
CURRENT_REVISION=$(kubectl get statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}" \
  -o jsonpath='{.status.currentRevision}')

# Get previous revision
PREVIOUS_REVISION=$(kubectl get statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}" \
  -o jsonpath='{.status.updateRevision}')

# Rollback to previous revision
kubectl rollout undo statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}"

# Monitor rollout
kubectl rollout status statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}" --timeout=10m

echo "✅ StatefulSet rollback complete"

PVC Retention During Rollback:

# StatefulSet with PVC retention
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: atp-stateful-service
spec:
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi
      # PVCs are NOT deleted on StatefulSet deletion
      # Data is preserved during rollback

Database Migration Rollback¶

Forward-Only Migrations (Preferred)¶

Forward-Only Migration Strategy:

Approach	Pros	Cons	ATP Decision
Forward-Only	✅ Simpler, safer	⚠️ No automatic rollback	✅ Preferred
Reversible Migrations	✅ Can rollback	❌ Complex, risky	⚠️ Use sparingly
No Migrations	✅ No risk	❌ No schema changes	❌ Not practical

Forward-Only Migration Example:

// Entity Framework: Forward-only migration
public partial class AddAuditIndex : Migration
{
    protected override void Up(MigrationBuilder migrationBuilder)
    {
        // Add index
        migrationBuilder.CreateIndex(
            name: "IX_AuditTrail_Timestamp",
            table: "AuditTrail",
            column: "Timestamp");
    }

    // No Down() method - rollback = deploy previous app version
}

Rollback Strategy for Forward-Only Migrations:

Rollback Application: Deploy previous application version (doesn't use new schema)
Schema Compatibility: New schema must be backward compatible with old application
Cleanup Migration: Create new migration to clean up unused schema (later)

Rollback Scripts (If Necessary)¶

Reversible Migration with Rollback:

// Entity Framework: Reversible migration (use sparingly)
public partial class RenameAuditColumn : Migration
{
    protected override void Up(MigrationBuilder migrationBuilder)
    {
        migrationBuilder.RenameColumn(
            name: "EventDate",
            table: "AuditTrail",
            newName: "Timestamp");
    }

    protected override void Down(MigrationBuilder migrationBuilder)
    {
        migrationBuilder.RenameColumn(
            name: "Timestamp",
            table: "AuditTrail",
            newName: "EventDate");
    }
}

Rollback Script:

-- migrations/20240115_AddAuditIndex/rollback.sql
-- Rollback script for AddAuditIndex migration

-- Drop the index
DROP INDEX IF EXISTS IX_AuditTrail_Timestamp ON AuditTrail;

-- Log rollback
INSERT INTO MigrationHistory (MigrationId, AppliedAt, RolledBackAt, Status)
VALUES ('20240115_AddAuditIndex', GETDATE(), GETDATE(), 'RolledBack');

Data Loss Prevention¶

Data Loss Prevention Checklist:

#!/bin/bash
# scripts/prevent-data-loss-rollback.sh

MIGRATION_ID="${1}"
ENVIRONMENT="${2:-production}"

echo "🔒 Data Loss Prevention Check for Migration: ${MIGRATION_ID}"

# Check if migration involves data deletion
if grep -q "DELETE\|DROP\|TRUNCATE" "migrations/${MIGRATION_ID}/up.sql"; then
  echo "⚠️  WARNING: Migration contains data deletion operations"

  # Create backup before rollback
  echo "📦 Creating database backup..."
  ./scripts/backup-database.sh "${ENVIRONMENT}" "pre-rollback-${MIGRATION_ID}"

  # Ask for confirmation
  read -p "Migration may cause data loss. Continue? (yes/no): " confirm
  if [ "${confirm}" != "yes" ]; then
    echo "❌ Rollback cancelled"
    exit 1
  fi
fi

# Check for dependent data
echo "🔍 Checking for dependent data..."
DEPENDENT_RECORDS=$(./scripts/check-dependent-data.sh "${MIGRATION_ID}")

if [ "${DEPENDENT_RECORDS}" -gt 0 ]; then
  echo "⚠️  WARNING: ${DEPENDENT_RECORDS} dependent records found"
  read -p "Continue with rollback? (yes/no): " confirm
  if [ "${confirm}" != "yes" ]; then
    echo "❌ Rollback cancelled"
    exit 1
  fi
fi

echo "✅ Data loss prevention checks passed"

Coordinating App Rollback with DB Rollback¶

Coordinated Rollback Procedure:

sequenceDiagram
    participant Admin as Administrator
    participant App as Application Rollback
    participant DB as Database Rollback
    participant FluxCD as FluxCD
    participant K8s as Kubernetes

    Admin->>App: Initiate rollback
    App->>DB: Check schema compatibility
    DB-->>App: Schema version check
    App->>DB: Rollback database (if needed)
    DB->>DB: Execute rollback script
    DB-->>App: Database rolled back
    App->>FluxCD: Revert Git commit
    FluxCD->>K8s: Reconcile to previous state
    K8s->>App: Deploy previous app version
    App->>Admin: Rollback complete

Hold "Alt" / "Option" to enable pan & zoom

Coordinated Rollback Script:

#!/bin/bash
# scripts/coordinated-rollback.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_TAG="${3}"

echo "🔄 Coordinated Application + Database Rollback"

# Step 1: Get target application version and schema version
TARGET_APP_VERSION="${TARGET_TAG}"
TARGET_SCHEMA_VERSION=$(./scripts/get-schema-version-for-tag.sh "${TARGET_TAG}")

CURRENT_SCHEMA_VERSION=$(./scripts/get-db-schema-version.sh "${ENVIRONMENT}")

echo "📊 Rollback Plan:"
echo "  Application: ${TARGET_APP_VERSION}"
echo "  Database Schema: ${CURRENT_SCHEMA_VERSION} → ${TARGET_SCHEMA_VERSION}"

# Step 2: Check schema compatibility
if [ "${CURRENT_SCHEMA_VERSION}" != "${TARGET_SCHEMA_VERSION}" ]; then
  echo "⚠️  Database schema rollback required"

  # Verify backward compatibility
  if ! ./scripts/verify-schema-compatibility.sh "${TARGET_SCHEMA_VERSION}" "${TARGET_APP_VERSION}"; then
    echo "❌ Schema version ${TARGET_SCHEMA_VERSION} not compatible with app ${TARGET_APP_VERSION}"
    exit 1
  fi

  # Step 2a: Rollback database schema first
  echo "🔄 Step 1/2: Rolling back database schema..."
  ./scripts/rollback-database-schema.sh "${ENVIRONMENT}" "${TARGET_SCHEMA_VERSION}"

  # Wait for schema rollback to complete
  sleep 30
else
  echo "✅ No database schema rollback needed"
fi

# Step 3: Rollback application
echo "🔄 Step 2/2: Rolling back application..."
./scripts/rollback-to-tag.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_TAG}"

# Step 4: Validate rollback
echo "🔍 Validating coordinated rollback..."
./scripts/validate-rollback.sh "${SERVICE}" "${ENVIRONMENT}"

echo "✅ Coordinated rollback complete"

FluxCD Rollback¶

Reverting Kustomization¶

Revert Kustomization via Git:

#!/bin/bash
# scripts/fluxcd-rollback-kustomization.sh

KUSTOMIZATION="${1}"  # e.g., apps-production
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"

echo "⏪ Rolling back Kustomization: ${KUSTOMIZATION}"

# Revert the Kustomization path in Git
git checkout "${TARGET_COMMIT}" -- "apps/" "infrastructure/"

# Commit the rollback
git add apps/ infrastructure/
git commit -m "rollback: Revert ${KUSTOMIZATION} to ${TARGET_COMMIT}"

# Push to environment branch
git push origin "${ENVIRONMENT}"

echo "✅ Kustomization rollback committed"
echo "⏳ FluxCD will reconcile automatically (polling interval: 5m)"

Suspend Kustomization for Manual Rollback:

#!/bin/bash
# scripts/suspend-kustomization.sh

KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"

echo "⏸️  Suspending Kustomization: ${KUSTOMIZATION}"

# Suspend reconciliation
flux suspend kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"

# Verify suspension
kubectl get kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" \
  -o jsonpath='{.spec.suspend}'

echo "✅ Kustomization suspended (reconciliation paused)"

Resume Kustomization After Rollback:

#!/bin/bash
# scripts/resume-kustomization.sh

KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"

echo "▶️  Resuming Kustomization: ${KUSTOMization}"

# Resume reconciliation
flux resume kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"

echo "✅ Kustomization resumed (reconciliation active)"

Reverting HelmRelease¶

HelmRelease Rollback:

#!/bin/bash
# scripts/fluxcd-rollback-helmrelease.sh

HELMRELEASE="${1}"
NAMESPACE="${2:-atp-production}"

echo "⏪ Rolling back HelmRelease: ${HELMRELEASE}"

# Get current release version
CURRENT_REVISION=$(kubectl get helmrelease "${HELMRELEASE}" -n "${NAMESPACE}" \
  -o jsonpath='{.status.lastReleaseRevision}')

PREVIOUS_REVISION=$((CURRENT_REVISION - 1))

echo "📊 Current revision: ${CURRENT_REVISION}"
echo "📊 Rolling back to revision: ${PREVIOUS_REVISION}"

# Update HelmRelease to previous version (via Git)
# Option 1: Revert Helm values in Git
git checkout "${PREVIOUS_COMMIT}" -- "apps/${HELMRELEASE}/values.yaml"

# Option 2: Direct Helm rollback (bypasses GitOps)
helm rollback "${HELMRELEASE}" "${PREVIOUS_REVISION}" -n "${NAMESPACE}"

# Option 3: Update HelmRelease spec
kubectl patch helmrelease "${HELMRELEASE}" -n "${NAMESPACE}" --type=json \
  -p="[{\"op\": \"replace\", \"path\": \"/spec/values\", \"value\": {...previous values...}}]"

echo "✅ HelmRelease rollback initiated"

HelmRelease Git-Based Rollback:

#!/bin/bash
# scripts/fluxcd-helmrelease-git-rollback.sh

HELMRELEASE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"

echo "⏪ Git-based HelmRelease rollback: ${HELMRELEASE}"

# Revert Helm values to target commit
git checkout "${TARGET_COMMIT}" -- "apps/${HELMRELEASE}/values.yaml" \
  "apps/${HELMRELEASE}/Chart.yaml"

# Commit the rollback
git add "apps/${HELMRELEASE}/"
git commit -m "rollback(helm): Revert ${HELMRELEASE} to ${TARGET_COMMIT}"

# Push to environment branch
git push origin "${ENVIRONMENT}"

echo "✅ HelmRelease rollback committed"
echo "⏳ FluxCD will reconcile and deploy previous Helm chart version"

Suspend and Resume Reconciliation¶

Suspend Reconciliation:

#!/bin/bash
# scripts/suspend-reconciliation.sh

KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"

echo "⏸️  Suspending reconciliation for: ${KUSTOMIZATION}"

# Suspend via Flux CLI
flux suspend kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"

# Or via kubectl
kubectl patch kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" --type=json \
  -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'

# Verify suspension
kubectl get kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" \
  -o jsonpath='{.spec.suspend}'

echo "✅ Reconciliation suspended"

Resume Reconciliation:

#!/bin/bash
# scripts/resume-reconciliation.sh

KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"

echo "▶️  Resuming reconciliation for: ${KUSTOMIZATION}"

# Resume via Flux CLI
flux resume kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"

# Or via kubectl
kubectl patch kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" --type=json \
  -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'

echo "✅ Reconciliation resumed"

Manual Intervention Procedures¶

Manual Intervention Runbook:

#!/bin/bash
# scripts/manual-intervention-runbook.sh

INCIDENT_TYPE="${1}"  # deployment-failure, drift-detection, reconciliation-stuck

echo "🚨 Manual Intervention Runbook"
echo "Incident Type: ${INCIDENT_TYPE}"

case "${INCIDENT_TYPE}" in
  "deployment-failure")
    echo "📋 Deployment Failure Intervention:"
    echo "1. Check deployment status: kubectl get deployment -n atp-production"
    echo "2. Check pod logs: kubectl logs -n atp-production deployment/<service>"
    echo "3. Check FluxCD status: flux get kustomizations -n flux-system"
    echo "4. Suspend reconciliation: flux suspend kustomization <name> -n flux-system"
    echo "5. Manually fix issue or rollback: ./scripts/rollback-simple.sh <service> production"
    echo "6. Resume reconciliation: flux resume kustomization <name> -n flux-system"
    ;;

  "drift-detection")
    echo "📋 Drift Detection Intervention:"
    echo "1. Check drift: flux get kustomizations --watch"
    echo "2. Identify drifted resources: kubectl diff -f <manifest>"
    echo "3. Option A: Fix cluster state to match Git"
    echo "   kubectl delete <resource> (let FluxCD recreate)"
    echo "4. Option B: Update Git to match cluster state"
    echo "   git checkout <cluster-state>"
    echo "5. Force reconciliation: flux reconcile kustomization <name>"
    ;;

  "reconciliation-stuck")
    echo "📋 Stuck Reconciliation Intervention:"
    echo "1. Check Kustomization status: flux get kustomizations -n flux-system"
    echo "2. Describe for details: kubectl describe kustomization <name> -n flux-system"
    echo "3. Check logs: kubectl logs -n flux-system deployment/kustomize-controller"
    echo "4. Suspend: flux suspend kustomization <name> -n flux-system"
    echo "5. Fix issue (check GitRepository, permissions, etc.)"
    echo "6. Resume: flux resume kustomization <name> -n flux-system"
    echo "7. Force reconcile: flux reconcile kustomization <name> --with-source"
    ;;
esac

Azure Backup Integration¶

Backing Up AKS Resources (Velero)¶

Velero Installation:

# Install Velero CLI
curl -fsSL -o velero-v1.11.0-linux-amd64.tar.gz \
  https://github.com/vmware-tanzu/velero/releases/download/v1.11.0/velero-v1.11.0-linux-amd64.tar.gz
tar -xvf velero-v1.11.0-linux-amd64.tar.gz
sudo mv velero-v1.11.0-linux-amd64/velero /usr/local/bin/

# Install Velero on AKS
velero install \
  --provider azure \
  --plugins velero/velero-plugin-for-microsoft-azure:v1.7.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config resourceGroup=atp-production-rg,storageAccount=atpprodvelero,subscriptionId=<subscription-id> \
  --snapshot-location-config apiTimeout=5m,resourceGroup=atp-production-rg,subscriptionId=<subscription-id>

Velero Backup Configuration:

# velero/backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup-production
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
    - atp-production
    excludedResources:
    - events
    - events.events.k8s.io
    ttl: 30d  # Retain backups for 30 days
    storageLocation: default
    volumeSnapshotLocations:
    - default
    metadata:
      labels:
        environment: production
        backup-type: scheduled

Manual Backup:

#!/bin/bash
# scripts/velero-backup.sh

BACKUP_NAME="manual-backup-$(date +%Y%m%d-%H%M%S)"
NAMESPACE="${1:-atp-production}"

echo "📦 Creating Velero backup: ${BACKUP_NAME}"

# Create backup
velero backup create "${BACKUP_NAME}" \
  --include-namespaces "${NAMESPACE}" \
  --ttl 30d \
  --wait

# Verify backup
velero backup describe "${BACKUP_NAME}"

echo "✅ Backup created: ${BACKUP_NAME}"

PersistentVolume Snapshots¶

Volume Snapshot Configuration:

# VolumeSnapshot for StatefulSet
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: atp-stateful-data-snapshot-$(date +%Y%m%d)
  namespace: atp-production
spec:
  volumeSnapshotClassName: csi-azuredisk-vsc
  source:
    persistentVolumeClaimName: data-atp-stateful-service-0

Automated Volume Snapshots:

# Velero: Automated volume snapshots
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: volume-snapshots-production
  namespace: velero
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  template:
    includedNamespaces:
    - atp-production
    includedResources:
    - persistentvolumes
    - persistentvolumeclaims
    volumeSnapshotLocations:
    - default
    ttl: 7d  # Retain snapshots for 7 days

Etcd Backup¶

Etcd Backup via AKS:

#!/bin/bash
# scripts/backup-etcd.sh

RESOURCE_GROUP="${1:-atp-production-rg}"
CLUSTER_NAME="${2:-atp-prod-eus-aks}"

echo "📦 Backing up AKS etcd"

# AKS automatically backs up etcd, but we can trigger manual snapshot
# Note: etcd backup requires Azure support or cluster admin access

# Alternative: Use Velero for cluster-level backup
velero backup create "etcd-backup-$(date +%Y%m%d)" \
  --include-cluster-resources=true \
  --wait

echo "✅ Etcd backup initiated"

AKS Automatic Etcd Backup:

✅ Automatic: AKS automatically backs up etcd every 8 hours
✅ Retention: 30 days
✅ Recovery: Available via Azure support

Backup Retention Policies¶

Backup Retention Configuration:

# Velero: Backup retention policy
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-backup-schedule
spec:
  schedule: "0 2 * * *"
  template:
    ttl: 30d  # Keep backups for 30 days
    includedNamespaces:
    - atp-production

Retention Policy by Backup Type:

Backup Type	Retention	Rationale
Daily Scheduled	30 days	Standard retention
Weekly Full	90 days	Long-term retention
Monthly Full	365 days	Compliance (1 year)
Pre-Deployment	7 days	Short-term rollback
Manual Backup	30 days	On-demand backups

Backup Retention Cleanup:

#!/bin/bash
# scripts/cleanup-old-backups.sh

# Delete backups older than retention period
velero backup delete --all --older-than 30d --confirm

echo "🧹 Cleaned up backups older than 30 days"

Disaster Recovery Scenarios¶

Cluster Failure¶

Cluster Failure Recovery:

graph TB
    subgraph "Disaster: Cluster Failure"
        FAIL[AKS Cluster<br/>Failure]
    end
    subgraph "Recovery Process"
        DETECT[Detect Failure]
        ASSESS[Assess Impact]
        RECREATE[Recreate Cluster<br/>from GitOps]
        RESTORE[Restore Data<br/>from Velero]
        VALIDATE[Validate Recovery]
    end
    subgraph "Backup Sources"
        GIT[Git Repository<br/>Manifests]
        VELERO[Velero Backups<br/>State]
        ACR[ACR Images]
    end

    FAIL --> DETECT
    DETECT --> ASSESS
    ASSESS --> RECREATE
    RECREATE --> GIT
    RECREATE --> RESTORE
    RESTORE --> VELERO
    RESTORE --> VALIDATE
    VALIDATE --> ACR

    style FAIL fill:#FF6B6B
    style RECREATE fill:#90EE90
    style RESTORE fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Cluster Failure Recovery Procedure:

#!/bin/bash
# scripts/recover-cluster-failure.sh

CLUSTER_NAME="${1:-atp-prod-eus-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"

echo "🚨 Cluster Failure Recovery"
echo "Cluster: ${CLUSTER_NAME}"
echo "Resource Group: ${RESOURCE_GROUP}"

# Step 1: Verify cluster is actually down
if az aks show --resource-group "${RESOURCE_GROUP}" --name "${CLUSTER_NAME}" \
  --query "provisioningState" -o tsv | grep -q "Succeeded"; then
  echo "⚠️  Cluster appears to be running. Verify the issue."
  exit 1
fi

# Step 2: Recreate cluster from Pulumi
echo "🔄 Step 1: Recreating AKS cluster from GitOps..."
cd infrastructure/
pulumi stack select production
pulumi up --yes

# Step 3: Wait for cluster to be ready
echo "⏳ Waiting for cluster to be ready..."
az aks wait --name "${CLUSTER_NAME}" --resource-group "${RESOURCE_GROUP}" \
  --created --timeout 30

# Step 4: Bootstrap FluxCD
echo "🔄 Step 2: Bootstrapping FluxCD..."
az aks get-credentials --resource-group "${RESOURCE_GROUP}" --name "${CLUSTER_NAME}"
flux bootstrap git \
  --url=https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops \
  --branch=production \
  --path=./clusters/production

# Step 5: Restore from Velero backup
echo "🔄 Step 3: Restoring from Velero backup..."
LATEST_BACKUP=$(velero backup get --output json | \
  jq -r '.items[] | select(.status.phase == "Completed") | .metadata.name' | \
  head -1)

velero restore create "restore-${CLUSTER_NAME}-$(date +%Y%m%d)" \
  --from-backup "${LATEST_BACKUP}" \
  --wait

# Step 6: Validate recovery
echo "🔍 Step 4: Validating recovery..."
./scripts/validate-cluster-health.sh

echo "✅ Cluster recovery complete"

Region Outage¶

Multi-Region Recovery:

#!/bin/bash
# scripts/recover-region-outage.sh

PRIMARY_REGION="${1:-eastus}"
SECONDARY_REGION="${2:-westeurope}"

echo "🚨 Region Outage Recovery"
echo "Primary Region: ${PRIMARY_REGION} (DOWN)"
echo "Secondary Region: ${SECONDARY_REGION} (DR)"

# Step 1: Failover traffic to secondary region
echo "🔄 Step 1: Failing over traffic to ${SECONDARY_REGION}..."
az network front-door backend-pool update \
  --resource-group atp-production-rg \
  --front-door-name atp-frontdoor \
  --name primary-eus \
  --backend-pool-parameters enabled=false

az network front-door backend-pool update \
  --resource-group atp-production-rg \
  --front-door-name atp-frontdoor \
  --name secondary-weu \
  --backend-pool-parameters enabled=true priority=1

# Step 2: Promote secondary database to primary
echo "🔄 Step 2: Promoting secondary database..."
az sql db update \
  --resource-group atp-production-rg \
  --server atp-prod-sql-server-weu \
  --name atp-prod-db \
  --read-scale Enabled  # Promote to readable

# Step 3: Scale up secondary cluster
echo "🔄 Step 3: Scaling up secondary cluster..."
az aks scale \
  --resource-group atp-production-rg \
  --name atp-prod-weu-aks \
  --node-count 10

# Step 4: Validate failover
echo "🔍 Step 4: Validating failover..."
./scripts/validate-failover.sh "${SECONDARY_REGION}"

echo "✅ Region failover complete"

Data Corruption¶

Data Corruption Recovery:

#!/bin/bash
# scripts/recover-data-corruption.sh

ENVIRONMENT="${1:-production}"
CORRUPTION_TIME="${2}"  # ISO timestamp of when corruption occurred

echo "🚨 Data Corruption Recovery"
echo "Environment: ${ENVIRONMENT}"
echo "Corruption Detected At: ${CORRUPTION_TIME}"

# Step 1: Find backup before corruption
echo "🔍 Step 1: Finding backup before corruption..."
TARGET_BACKUP=$(velero backup get --output json | \
  jq -r --arg time "${CORRUPTION_TIME}" \
    '.items[] | select(.status.phase == "Completed") | select(.metadata.creationTimestamp < $time) | .metadata.name' | \
  tail -1)

if [ -z "${TARGET_BACKUP}" ]; then
  echo "❌ No backup found before corruption time"
  exit 1
fi

echo "📦 Target backup: ${TARGET_BACKUP}"

# Step 2: Stop application to prevent further corruption
echo "🛑 Step 2: Stopping application..."
kubectl scale deployment --all --replicas=0 -n "atp-${ENVIRONMENT}"

# Step 3: Restore from backup
echo "🔄 Step 3: Restoring from backup..."
velero restore create "restore-corruption-$(date +%Y%m%d)" \
  --from-backup "${TARGET_BACKUP}" \
  --include-namespaces "atp-${ENVIRONMENT}" \
  --wait

# Step 4: Validate data integrity
echo "🔍 Step 4: Validating data integrity..."
./scripts/validate-data-integrity.sh "${ENVIRONMENT}"

# Step 5: Restart application
echo "▶️  Step 5: Restarting application..."
kubectl scale deployment --all --replicas=5 -n "atp-${ENVIRONMENT}"

echo "✅ Data corruption recovery complete"

Complete Platform Loss¶

Complete Platform Recovery:

#!/bin/bash
# scripts/recover-complete-platform-loss.sh

echo "🚨 Complete Platform Loss Recovery"
echo "This procedure recreates the entire ATP platform from GitOps"

# Step 1: Verify GitOps repository is accessible
echo "🔍 Step 1: Verifying GitOps repository..."
if ! git ls-remote https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops >/dev/null 2>&1; then
  echo "❌ GitOps repository not accessible"
  exit 1
fi

# Step 2: Recreate infrastructure from Pulumi
echo "🔄 Step 2: Recreating infrastructure..."
cd infrastructure/
pulumi stack select production
pulumi up --yes

# Step 3: Create AKS clusters
echo "🔄 Step 3: Creating AKS clusters..."
./scripts/create-aks-clusters.sh production

# Step 4: Bootstrap FluxCD on all clusters
echo "🔄 Step 4: Bootstrapping FluxCD..."
for CLUSTER in atp-prod-eus-aks atp-prod-weu-aks; do
  echo "  Bootstrapping ${CLUSTER}..."
  az aks get-credentials --resource-group atp-production-rg --name "${CLUSTER}"
  flux bootstrap git \
    --url=https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops \
    --branch=production \
    --path=./clusters/production
done

# Step 5: Restore application state from Velero
echo "🔄 Step 5: Restoring application state..."
LATEST_BACKUP=$(velero backup get --output json | \
  jq -r '.items[] | select(.status.phase == "Completed") | .metadata.name' | \
  head -1)

velero restore create "restore-platform-$(date +%Y%m%d)" \
  --from-backup "${LATEST_BACKUP}" \
  --wait

# Step 6: Validate platform
echo "🔍 Step 6: Validating platform recovery..."
./scripts/validate-platform.sh

echo "✅ Complete platform recovery complete"

RTO/RPO Targets Per Environment¶

RTO/RPO Targets Matrix:

Environment	RTO (Recovery Time Objective)	RPO (Recovery Point Objective)	Rationale
Dev	4 hours	24 hours	Lower priority, acceptable downtime
Test	2 hours	12 hours	Moderate priority, faster recovery needed
Staging	1 hour	4 hours	Production-like, important for validation
Production	30 minutes	1 hour	Critical, minimal downtime required

RTO/RPO Validation:

#!/bin/bash
# scripts/validate-rto-rpo.sh

ENVIRONMENT="${1:-production}"

echo "📊 RTO/RPO Validation for ${ENVIRONMENT}"

# Get target RTO/RPO from matrix
case "${ENVIRONMENT}" in
  "dev") TARGET_RTO=14400 TARGET_RPO=86400 ;;  # 4h / 24h
  "test") TARGET_RTO=7200 TARGET_RPO=43200 ;;   # 2h / 12h
  "staging") TARGET_RTO=3600 TARGET_RPO=14400 ;; # 1h / 4h
  "production") TARGET_RTO=1800 TARGET_RPO=3600 ;; # 30m / 1h
esac

echo "Target RTO: ${TARGET_RTO} seconds ($(($TARGET_RTO / 60)) minutes)"
echo "Target RPO: ${TARGET_RPO} seconds ($(($TARGET_RPO / 60)) minutes)"

# Simulate recovery and measure time
START_TIME=$(date +%s)
./scripts/recover-cluster-failure.sh
END_TIME=$(date +%s)
ACTUAL_RTO=$((END_TIME - START_TIME))

# Get latest backup timestamp
LATEST_BACKUP_TIME=$(velero backup get --output json | \
  jq -r '.items[] | select(.status.phase == "Completed") | .metadata.creationTimestamp' | \
  head -1 | xargs -I {} date -d {} +%s)
CURRENT_TIME=$(date +%s)
ACTUAL_RPO=$((CURRENT_TIME - LATEST_BACKUP_TIME))

# Validate
if [ ${ACTUAL_RTO} -le ${TARGET_RTO} ]; then
  echo "✅ RTO Met: ${ACTUAL_RTO}s <= ${TARGET_RTO}s"
else
  echo "❌ RTO Exceeded: ${ACTUAL_RTO}s > ${TARGET_RTO}s"
fi

if [ ${ACTUAL_RPO} -le ${TARGET_RPO} ]; then
  echo "✅ RPO Met: ${ACTUAL_RPO}s <= ${TARGET_RPO}s"
else
  echo "❌ RPO Exceeded: ${ACTUAL_RPO}s > ${TARGET_RPO}s"
fi

DR Testing and Drills¶

Quarterly DR Drills for Production¶

DR Drill Schedule:

Frequency	Environment	Drill Type	Rationale
Quarterly	Production	Full DR drill	Validate production recovery procedures
Monthly	Staging	Partial DR drill	Test recovery procedures in production-like environment
Bi-weekly	Test	Automated DR test	Continuous validation

Quarterly DR Drill Plan:

#!/bin/bash
# scripts/dr-drill-production.sh

DRILL_DATE="${1:-$(date +%Y%m%d)}"
SCENARIO="${2:-cluster-failure}"  # cluster-failure, region-outage, data-corruption

echo "🎯 Quarterly DR Drill - Production"
echo "Date: ${DRILL_DATE}"
echo "Scenario: ${SCENARIO}"

# Pre-drill checklist
echo "📋 Pre-Drill Checklist:"
echo "  [ ] Notify stakeholders"
echo "  [ ] Backup current state"
echo "  [ ] Prepare recovery scripts"
echo "  [ ] Verify backup availability"
echo "  [ ] Document baseline metrics"

# Execute drill scenario
case "${SCENARIO}" in
  "cluster-failure")
    echo "🔄 Executing cluster failure drill..."
    ./scripts/dr-drill-cluster-failure.sh
    ;;
  "region-outage")
    echo "🔄 Executing region outage drill..."
    ./scripts/dr-drill-region-outage.sh
    ;;
  "data-corruption")
    echo "🔄 Executing data corruption drill..."
    ./scripts/dr-drill-data-corruption.sh
    ;;
esac

# Post-drill validation
echo "🔍 Post-Drill Validation:"
./scripts/validate-dr-drill.sh

# Generate drill report
echo "📝 Generating drill report..."
./scripts/generate-dr-drill-report.sh "${DRILL_DATE}" "${SCENARIO}"

echo "✅ DR Drill complete"

Drill Scenarios and Checklists¶

DR Drill Scenarios:

Scenario	Description	Recovery Procedure	Frequency
Cluster Failure	Complete AKS cluster failure	Recreate cluster, restore from Velero	Quarterly
Region Outage	Primary region unavailable	Failover to secondary region	Quarterly
Data Corruption	Database corruption detected	Restore from point-in-time backup	Quarterly
Network Isolation	Network connectivity issues	Route traffic via secondary path	Monthly
Storage Failure	PersistentVolume failures	Restore from volume snapshots	Monthly

Cluster Failure Drill Checklist:

## DR Drill: Cluster Failure

### Pre-Drill
- [ ] Schedule drill with stakeholders
- [ ] Create backup before drill
- [ ] Document baseline metrics
- [ ] Notify on-call team

### Drill Execution
- [ ] Simulate cluster failure (scale cluster to 0 nodes)
- [ ] Measure detection time
- [ ] Execute recovery procedure
  - [ ] Recreate cluster from Pulumi
  - [ ] Bootstrap FluxCD
  - [ ] Restore from Velero backup
- [ ] Measure recovery time (RTO)
- [ ] Validate application health
- [ ] Verify data integrity (RPO)

### Post-Drill
- [ ] Restore cluster to normal state
- [ ] Document actual RTO/RPO
- [ ] Identify improvement opportunities
- [ ] Update runbooks
- [ ] Generate drill report

Region Outage Drill:

#!/bin/bash
# scripts/dr-drill-region-outage.sh

echo "🎯 DR Drill: Region Outage"

# Simulate region outage (disable primary region)
echo "🔄 Simulating region outage..."
az network front-door backend-pool update \
  --resource-group atp-production-rg \
  --front-door-name atp-frontdoor \
  --name primary-eus \
  --backend-pool-parameters enabled=false

# Execute failover
echo "🔄 Executing failover..."
./scripts/recover-region-outage.sh eastus westeurope

# Measure failover time
FAILOVER_START=$(date +%s)
# ... failover procedure ...
FAILOVER_END=$(date +%s)
FAILOVER_TIME=$((FAILOVER_END - FAILOVER_START))

echo "⏱️  Failover time: ${FAILOVER_TIME} seconds"

# Validate
./scripts/validate-failover.sh westeurope

# Restore (post-drill)
echo "🔄 Restoring primary region..."
az network front-door backend-pool update \
  --resource-group atp-production-rg \
  --front-door-name atp-frontdoor \
  --name primary-eus \
  --backend-pool-parameters enabled=true priority=1

echo "✅ DR Drill complete"

Drill Report and Improvements¶

DR Drill Report Template:

# DR Drill Report

## Drill Information
- **Date**: 2024-01-15
- **Scenario**: Cluster Failure
- **Environment**: Production
- **Duration**: 45 minutes

## Objectives Met
- [x] RTO Target: 30 minutes (Actual: 28 minutes) ✅
- [x] RPO Target: 1 hour (Actual: 45 minutes) ✅
- [x] All services recovered successfully ✅

## Issues Identified
1. Velero restore took longer than expected (15 minutes)
2. Database restore required manual intervention

## Improvements
1. Optimize Velero restore process
2. Automate database restore procedure
3. Update runbooks with lessons learned

## Action Items
- [ ] Update recovery scripts
- [ ] Improve backup frequency
- [ ] Add automated validation steps

Generate DR Drill Report:

#!/bin/bash
# scripts/generate-dr-drill-report.sh

DRILL_DATE="${1}"
SCENARIO="${2}"
REPORT_FILE="dr-drill-report-${DRILL_DATE}-${SCENARIO}.md"

echo "📝 Generating DR Drill Report..."

cat > "${REPORT_FILE}" <<EOF
# DR Drill Report

**Date**: ${DRILL_DATE}
**Scenario**: ${SCENARIO}
**Environment**: Production

## Results

### RTO/RPO Metrics
- **Target RTO**: 30 minutes
- **Actual RTO**: $(./scripts/get-actual-rto.sh)
- **Target RPO**: 1 hour
- **Actual RPO**: $(./scripts/get-actual-rpo.sh)

### Recovery Steps
1. $(./scripts/get-recovery-step.sh 1)
2. $(./scripts/get-recovery-step.sh 2)
3. $(./scripts/get-recovery-step.sh 3)

## Lessons Learned
$(./scripts/get-drill-lessons.sh)

## Action Items
$(./scripts/get-drill-action-items.sh)
EOF

echo "✅ Report generated: ${REPORT_FILE}"

Lessons Learned Process¶

Lessons Learned Template:

#!/bin/bash
# scripts/capture-dr-drill-lessons.sh

DRILL_DATE="${1}"
SCENARIO="${2}"

echo "📚 Capturing Lessons Learned from DR Drill..."

cat >> "dr-lessons-learned.md" <<EOF

## DR Drill: ${SCENARIO} - ${DRILL_DATE}

### What Went Well
- Automated cluster recreation from Pulumi worked seamlessly
- FluxCD bootstrap completed quickly
- Application recovery was faster than expected

### What Could Be Improved
- Velero restore process needs optimization
- Database restore requires more automation
- Communication during drill could be better

### Action Items
1. [ ] Optimize Velero restore scripts
2. [ ] Automate database restore procedure
3. [ ] Update incident response runbook
4. [ ] Schedule follow-up drill in 3 months

### Updated Procedures
- Recovery procedure updated: ./scripts/recover-cluster-failure.sh
- Runbook updated: docs/operations/disaster-recovery.md

---
EOF

echo "✅ Lessons learned captured"

Incident Response Integration¶

Automated Rollback on Critical Alerts¶

Automated Rollback Trigger:

# PrometheusRule: Trigger automated rollback on critical alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: auto-rollback-trigger
  namespace: monitoring
spec:
  groups:
  - name: auto-rollback
    rules:
    - alert: CriticalErrorRate
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[5m])) 
        / 
        sum(rate(http_requests_total[5m])) 
        > 0.10  # 10% error rate
      for: 2m
      labels:
        severity: critical
        auto-rollback: "true"
      annotations:
        summary: "Critical error rate detected - triggering automated rollback"
        description: "Error rate: {{ $value | humanizePercentage }}"

Automated Rollback Webhook:

# AlertManager: Configure webhook for automated rollback
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m

    route:
      receiver: 'default'
      routes:
      - match:
          auto-rollback: "true"
        receiver: 'auto-rollback'

    receivers:
    - name: 'auto-rollback'
      webhook_configs:
      - url: 'http://auto-rollback-service.monitoring:8080/rollback'
        send_resolved: false

Auto-Rollback Service:

// C#: Auto-rollback service
[ApiController]
[Route("[controller]")]
public class AutoRollbackController : ControllerBase
{
    [HttpPost("rollback")]
    public async Task<IActionResult> TriggerRollback([FromBody] Alert alert)
    {
        // Parse alert to determine service
        var service = ExtractServiceFromAlert(alert);
        var environment = ExtractEnvironmentFromAlert(alert);

        // Check if auto-rollback is enabled for this service
        if (!await IsAutoRollbackEnabled(service, environment))
        {
            return Ok(new { message = "Auto-rollback disabled for this service" });
        }

        // Execute rollback
        var rollbackResult = await ExecuteRollback(service, environment);

        // Notify team
        await NotifyTeam($"Auto-rollback triggered for {service}: {rollbackResult.Status}");

        return Ok(rollbackResult);
    }
}

Incident Commander Decision Making¶

Incident Commander Decision Tree:

graph TD
    START[Incident Detected] --> ASSESS{Assess Impact}
    ASSESS -->|High Impact| ROLLBACK{Can Rollback?}
    ASSESS -->|Low Impact| INVESTIGATE[Investigate Root Cause]

    ROLLBACK -->|Yes| EXECUTE[Execute Rollback]
    ROLLBACK -->|No| MITIGATE[Apply Mitigation]

    EXECUTE --> VALIDATE[Validate Rollback]
    VALIDATE -->|Success| MONITOR[Monitor Recovery]
    VALIDATE -->|Failure| ESCALATE[Escalate to Senior]

    MITIGATE --> INVESTIGATE
    INVESTIGATE --> FIX[Develop Fix]
    FIX --> DEPLOY[Deploy Fix]
    DEPLOY --> VALIDATE

    MONITOR --> CLOSE[Close Incident]

    style START fill:#FF6B6B
    style EXECUTE fill:#FFD700
    style VALIDATE fill:#90EE90
    style CLOSE fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Incident Commander Decision Matrix:

Impact	Error Rate	Decision	Action
Critical	> 10%	✅ Immediate Rollback	Execute rollback, investigate later
High	5-10%	⚠️ Investigate + Prepare Rollback	Investigate, rollback if no fix in 15min
Medium	1-5%	⚠️ Investigate First	Investigate, rollback if worsens
Low	< 1%	✅ Monitor	Monitor, no immediate action

Rollback vs Forward Fix Decision Tree¶

Rollback vs Forward Fix Decision:

#!/bin/bash
# scripts/rollback-vs-fix-decision.sh

ERROR_RATE="${1}"  # Percentage
AFFECTED_USERS="${2}"  # Number of users
HAS_FIX="${3}"  # yes/no - Do we have a fix ready?

echo "🤔 Rollback vs Forward Fix Decision"
echo "Error Rate: ${ERROR_RATE}%"
echo "Affected Users: ${AFFECTED_USERS}"
echo "Has Fix Ready: ${HAS_FIX}"

# Decision logic
if (( $(echo "${ERROR_RATE} > 10" | bc -l) )); then
  DECISION="ROLLBACK"
  REASON="Critical error rate (>10%)"
elif (( $(echo "${ERROR_RATE} > 5" | bc -l) )) && [ "${HAS_FIX}" != "yes" ]; then
  DECISION="ROLLBACK"
  REASON="High error rate (>5%) and no fix ready"
elif (( $(echo "${ERROR_RATE} > 5" | bc -l) )) && [ "${HAS_FIX}" == "yes" ]; then
  DECISION="FORWARD_FIX"
  REASON="High error rate but fix available"
elif [ "${AFFECTED_USERS}" -gt 10000 ]; then
  DECISION="ROLLBACK"
  REASON="Large number of affected users"
else
  DECISION="FORWARD_FIX"
  REASON="Low impact, proceed with fix"
fi

echo "📊 Decision: ${DECISION}"
echo "📝 Reason: ${REASON}"

case "${DECISION}" in
  "ROLLBACK")
    echo "🔄 Executing rollback..."
    ./scripts/rollback-simple.sh
    ;;
  "FORWARD_FIX")
    echo "🔧 Proceeding with forward fix..."
    ./scripts/deploy-fix.sh
    ;;
esac

Post-Incident Review¶

Post-Incident Review Template:

# Post-Incident Review

## Incident Summary
- **Incident ID**: INC-2024-001
- **Date**: 2024-01-15
- **Duration**: 45 minutes
- **Impact**: 5% of users affected
- **Resolution**: Rollback to previous version

## Timeline
- 10:00 AM: Incident detected (error rate spike)
- 10:05 AM: Incident declared, on-call engaged
- 10:10 AM: Root cause identified (deployment issue)
- 10:15 AM: Rollback decision made
- 10:20 AM: Rollback executed
- 10:30 AM: Rollback validated, services restored
- 10:45 AM: Incident resolved

## Root Cause
Deployment of v1.2.3 introduced memory leak causing pod restarts and increased error rate.

## Actions Taken
1. Rolled back to v1.2.2
2. Validated service health
3. Investigated root cause

## Lessons Learned
- Need better pre-deployment testing for memory issues
- Rollback procedure worked well (RTO: 20 minutes)

## Action Items
- [ ] Add memory leak detection to CI pipeline
- [ ] Improve error rate monitoring
- [ ] Update deployment procedures

Post-Incident Review Script:

#!/bin/bash
# scripts/generate-post-incident-review.sh

INCIDENT_ID="${1}"
INCIDENT_DATE="${2}"

echo "📝 Generating Post-Incident Review..."

cat > "post-incident-review-${INCIDENT_ID}.md" <<EOF
# Post-Incident Review: ${INCIDENT_ID}

**Date**: ${INCIDENT_DATE}
**Status**: Resolved

## Timeline
$(./scripts/get-incident-timeline.sh "${INCIDENT_ID}")

## Root Cause
$(./scripts/get-root-cause.sh "${INCIDENT_ID}")

## Impact
- Users Affected: $(./scripts/get-affected-users.sh "${INCIDENT_ID}")
- Error Rate: $(./scripts/get-max-error-rate.sh "${INCIDENT_ID}")%
- Duration: $(./scripts/get-incident-duration.sh "${INCIDENT_ID}")

## Resolution
$(./scripts/get-resolution.sh "${INCIDENT_ID}")

## Lessons Learned
$(./scripts/get-lessons-learned.sh "${INCIDENT_ID}")

## Action Items
$(./scripts/get-action-items.sh "${INCIDENT_ID}")
EOF

echo "✅ Post-incident review generated"

Summary: Rollback & Disaster Recovery¶

Git-Based Rollback: Simple rollback (git revert), complex rollback (git reset), rollback to specific commit, rollback to specific tag
Progressive Rollback: Rolling back one service at a time, rollback with canary (gradual revert), validation at each rollback step
Application State Recovery: Handling database schema changes, data migration rollback, stateful application considerations
Database Migration Rollback: Forward-only migrations (preferred), rollback scripts (if necessary), data loss prevention, coordinating app rollback with DB rollback
FluxCD Rollback: Reverting Kustomization, reverting HelmRelease, suspend and resume reconciliation, manual intervention procedures
Azure Backup Integration: Backing up AKS resources (Velero), PersistentVolume snapshots, Etcd backup, backup retention policies
Disaster Recovery Scenarios: Cluster failure, region outage, data corruption, complete platform loss
RTO/RPO Targets: Dev (RTO 4h, RPO 24h), Test (RTO 2h, RPO 12h), Staging (RTO 1h, RPO 4h), Production (RTO 30m, RPO 1h)
DR Testing and Drills: Quarterly DR drills for production, drill scenarios and checklists, drill report and improvements, lessons learned process
Incident Response Integration: Automated rollback on critical alerts, incident commander decision making, rollback vs forward fix decision tree, post-incident review

Helm Chart Development for ATP Services¶

Purpose: Define the standards, best practices, and procedures for developing, testing, versioning, and publishing Helm charts for ATP microservices, ensuring consistent deployment patterns, maintainable chart structures, and reliable application deployments across all environments.

Helm Chart Structure¶

Chart.yaml: Metadata, Version, Dependencies¶

Chart.yaml for ATP Service:

# charts/atp-ingestion/Chart.yaml
apiVersion: v2
name: atp-ingestion
description: A Helm chart for ATP Ingestion Service - Collects and processes audit trail events
type: application
version: 1.2.3  # Chart version (SemVer)
appVersion: "1.2.3"  # Application version (from source code)
home: https://github.com/ConnectSoft/ATP
sources:
  - https://github.com/ConnectSoft/ATP/ConnectSoft.Audit.Ingestion
maintainers:
  - name: ATP Team
    email: atp-team@connectsoft.example
keywords:
  - audit-trail
  - atp
  - ingestion
  - microservice
annotations:
  category: Microservice
  architecture: microservices
dependencies:
  - name: redis
    version: "17.15.0"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled
  - name: postgresql
    version: "12.1.9"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled

Chart Metadata Standards:

Field	Required	Description	ATP Convention
apiVersion	✅ Yes	Chart API version	`v2` (Helm 3+)
name	✅ Yes	Chart name	`atp-{service-name}` (kebab-case)
version	✅ Yes	Chart version	SemVer (MAJOR.MINOR.PATCH)
appVersion	✅ Yes	Application version	Matches source code version
description	✅ Yes	Chart description	One-line service description
type	⚠️ Recommended	Chart type	`application` (default)
keywords	⚠️ Recommended	Search keywords	Include `audit-trail`, `atp`, service name
maintainers	⚠️ Recommended	Maintainer info	ATP Team contact
dependencies	⚠️ Optional	Chart dependencies	External charts (Redis, PostgreSQL)

values.yaml: Default Values¶

Complete values.yaml:

# charts/atp-ingestion/values.yaml
# Default values for atp-ingestion
# This is a YAML-formatted file

# Application Configuration
replicaCount: 3
image:
  repository: connectsoft.azurecr.io/atp/ingestion
  pullPolicy: IfNotPresent
  tag: ""  # Override via --set image.tag=v1.2.3

imagePullSecrets:
  - name: acr-pull-secret

nameOverride: ""
fullnameOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name: ""

podAnnotations: {}

podSecurityContext:
  fsGroup: 2000
  runAsNonRoot: true
  runAsUser: 1000

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true

service:
  type: ClusterIP
  port: 80
  targetPort: 8080

ingress:
  enabled: false
  className: "nginx"
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
  hosts:
    - host: ingestion.atp.connectsoft.example
      paths:
        - path: /
          pathType: Prefix
  tls: []

resources:
  limits:
    cpu: 2000m
    memory: 2Gi
  requests:
    cpu: 500m
    memory: 1Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

nodeSelector: {}

tolerations: []

affinity: {}

# External Secrets
externalSecrets:
  enabled: true
  secrets:
    - name: sql-connection-string
      keyVaultName: atp-prod-kv
      secretName: connection-strings/atp-db/production

# Database Configuration
database:
  host: ""
  port: 5432
  name: atp_production
  schema: public

# Redis Configuration
redis:
  enabled: false  # Use managed Redis
  host: ""  # External Redis host
  port: 6379

# Environment Variables
env:
  - name: ASPNETCORE_ENVIRONMENT
    value: "Production"
  - name: Logging__LogLevel__Default
    value: "Information"

envFrom:
  - secretRef:
      name: app-secrets

# Health Checks
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 30

# Pod Disruption Budget
podDisruptionBudget:
  enabled: true
  minAvailable: 2

# Network Policy
networkPolicy:
  enabled: true
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
            name: ingress-nginx
  egress:
    - to:
      - namespaceSelector:
          matchLabels:
            name: kube-system
      ports:
        - protocol: UDP
          port: 53

# Service Bus Configuration
serviceBus:
  connectionString: ""  # From ExternalSecret
  queueName: audit-events

# Monitoring
monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 30s
    scrapeTimeout: 10s

templates/: Resource Templates¶

Helm Chart Directory Structure:

charts/atp-ingestion/
├── Chart.yaml              # Chart metadata
├── values.yaml             # Default values
├── values-dev.yaml         # Dev environment overrides
├── values-test.yaml        # Test environment overrides
├── values-staging.yaml     # Staging environment overrides
├── values-production.yaml  # Production environment overrides
├── .helmignore             # Files to exclude
├── README.md               # Chart documentation
├── charts/                 # Sub-charts (dependencies)
│   └── .gitkeep
├── templates/              # Kubernetes resource templates
│   ├── _helpers.tpl        # Named templates and helpers
│   ├── deployment.yaml     # Deployment resource
│   ├── service.yaml        # Service resource
│   ├── ingress.yaml        # Ingress resource (conditional)
│   ├── serviceaccount.yaml # ServiceAccount resource
│   ├── configmap.yaml      # ConfigMap resource
│   ├── networkpolicy.yaml  # NetworkPolicy resource (conditional)
│   ├── poddisruptionbudget.yaml # PDB resource (conditional)
│   ├── servicemonitor.yaml # ServiceMonitor for Prometheus (conditional)
│   ├── externalsecret.yaml # ExternalSecret resource (conditional)
│   ├── NOTES.txt           # Post-install notes
│   ├── tests/              # Helm test templates
│   │   ├── test-connection.yaml
│   │   └── test-health.yaml
│   └── hooks/              # Helm hooks
│       ├── pre-install-migration.yaml
│       └── post-install-verification.yaml
└── schemas/                # JSON Schema validation
    └── values.schema.json

Chart Structure Diagram:

graph TB
    subgraph "Helm Chart: atp-ingestion"
        CHART[Chart.yaml<br/>Metadata & Dependencies]
        VALUES[values.yaml<br/>Default Configuration]
        VALUES_DEV[values-dev.yaml<br/>Dev Overrides]
        VALUES_PROD[values-production.yaml<br/>Prod Overrides]

        subgraph "templates/"
            HELPERS[_helpers.tpl<br/>Named Templates]
            DEPLOY[deployment.yaml]
            SVC[service.yaml]
            INGRESS[ingress.yaml]
            SA[serviceaccount.yaml]
            NETPOL[networkpolicy.yaml]

            subgraph "tests/"
                TEST_CONN[test-connection.yaml]
                TEST_HEALTH[test-health.yaml]
            end

            subgraph "hooks/"
                HOOK_PRE[pre-install-migration.yaml]
                HOOK_POST[post-install-verification.yaml]
            end
        end

        subgraph "charts/"
            DEP_REDIS[redis/]
            DEP_POSTGRES[postgresql/]
        end
    end

    CHART --> DEPLOY
    VALUES --> DEPLOY
    VALUES_DEV --> DEPLOY
    VALUES_PROD --> DEPLOY
    HELPERS --> DEPLOY
    DEPLOY --> SVC
    SVC --> INGRESS
    CHART --> DEP_REDIS
    CHART --> DEP_POSTGRES

    style CHART fill:#FFE5B4
    style VALUES fill:#FFE5B4
    style HELPERS fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

charts/: Sub-charts (Dependencies)¶

Sub-chart Dependencies:

# Chart.yaml dependencies section
dependencies:
  - name: redis
    version: "17.15.0"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled
    tags:
      - cache
  - name: postgresql
    version: "12.1.9"
    repository: "https://charts.bitnami.com/bitnaml/bitnami"
    condition: postgresql.enabled
    tags:
      - database

Sub-chart Values Override:

# values.yaml - Sub-chart value overrides
redis:
  enabled: false  # Use managed Redis in production
  architecture: standalone
  auth:
    enabled: true
  master:
    persistence:
      enabled: true
      size: 8Gi
    resources:
      requests:
        memory: 256Mi
        cpu: 250m

postgresql:
  enabled: false  # Use managed PostgreSQL
  auth:
    database: atp_production
    username: atp_user
  primary:
    persistence:
      enabled: true
      size: 20Gi
    resources:
      requests:
        memory: 512Mi
        cpu: 500m

Managing Dependencies:

# Update dependencies
helm dependency update charts/atp-ingestion/

# Build dependencies
helm dependency build charts/atp-ingestion/

# List dependencies
helm dependency list charts/atp-ingestion/

.helmignore: Files to Exclude¶

.helmignore File:

# charts/atp-ingestion/.helmignore
# Patterns to ignore when building packages

# Git
.git/
.gitignore
.gitattributes

# CI/CD
.azuredevops/
.github/
.gitlab-ci.yml

# IDE
.vscode/
.idea/
*.swp
*.swo
*~

# Documentation (keep README.md)
docs/
*.md
!README.md

# Tests (not part of chart)
tests/
*.test.go

# Build artifacts
bin/
obj/
*.dll
*.exe

# Dependencies (managed via Chart.yaml)
charts/*.tgz

# Temporary files
*.tmp
*.log
.DS_Store

Template Best Practices¶

Named Templates and Helpers (_helpers.tpl)¶

_helpers.tpl:

# templates/_helpers.tpl
{{/*
Expand the name of the chart.
*/}}
{{- define "atp-ingestion.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Create a default fully qualified app name.
*/}}
{{- define "atp-ingestion.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "atp-ingestion.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Common labels
*/}}
{{- define "atp-ingestion.labels" -}}
helm.sh/chart: {{ include "atp-ingestion.chart" . }}
{{ include "atp-ingestion.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
app.kubernetes.io/part-of: atp-platform
{{- end }}

{{/*
Selector labels
*/}}
{{- define "atp-ingestion.selectorLabels" -}}
app.kubernetes.io/name: {{ include "atp-ingestion.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

{{/*
Create the name of the service account to use
*/}}
{{- define "atp-ingestion.serviceAccountName" -}}
{{- if .Values.serviceAccount.create }}
{{- default (include "atp-ingestion.fullname" .) .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}

{{/*
Image reference
*/}}
{{- define "atp-ingestion.image" -}}
{{- $tag := .Values.image.tag | default .Chart.AppVersion }}
{{- printf "%s:%s" .Values.image.repository $tag }}
{{- end }}

{{/*
Environment variables from ConfigMap
*/}}
{{- define "atp-ingestion.envFromConfigMap" -}}
{{- if .Values.envFrom }}
{{- range .Values.envFrom }}
{{- if .configMapRef }}
- configMapRef:
    name: {{ .configMapRef.name }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Environment variables from Secrets
*/}}
{{- define "atp-ingestion.envFromSecret" -}}
{{- if .Values.envFrom }}
{{- range .Values.envFrom }}
{{- if .secretRef }}
- secretRef:
    name: {{ .secretRef.name }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Security context
*/}}
{{- define "atp-ingestion.securityContext" -}}
allowPrivilegeEscalation: false
capabilities:
  drop:
  - ALL
readOnlyRootFilesystem: {{ .Values.securityContext.readOnlyRootFilesystem | default true }}
runAsNonRoot: {{ .Values.securityContext.runAsNonRoot | default true }}
{{- if .Values.securityContext.runAsUser }}
runAsUser: {{ .Values.securityContext.runAsUser }}
{{- end }}
{{- end }}

{{/*
Pod security context
*/}}
{{- define "atp-ingestion.podSecurityContext" -}}
{{- if .Values.podSecurityContext }}
fsGroup: {{ .Values.podSecurityContext.fsGroup }}
runAsNonRoot: {{ .Values.podSecurityContext.runAsNonRoot | default true }}
{{- if .Values.podSecurityContext.runAsUser }}
runAsUser: {{ .Values.podSecurityContext.runAsUser }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Resource requests and limits
*/}}
{{- define "atp-ingestion.resources" -}}
{{- if .Values.resources }}
{{- toYaml .Values.resources }}
{{- else }}
requests:
  cpu: 100m
  memory: 128Mi
limits:
  cpu: 500m
  memory: 512Mi
{{- end }}
{{- end }}

Template Functions (include, default, required)¶

Using Template Functions:

# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "atp-ingestion.fullname" . }}
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount | default 3 }}
  selector:
    matchLabels:
      {{- include "atp-ingestion.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        {{- with .Values.podAnnotations }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
      securityContext:
        {{- include "atp-ingestion.podSecurityContext" . | nindent 8 }}
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ include "atp-ingestion.image" . }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        securityContext:
          {{- include "atp-ingestion.securityContext" . | nindent 10 }}
        ports:
        - name: http
          containerPort: {{ .Values.service.targetPort | default 8080 }}
          protocol: TCP
        env:
        {{- range .Values.env }}
        - name: {{ .name }}
          value: {{ .value | quote }}
        {{- end }}
        {{- include "atp-ingestion.envFromConfigMap" . | nindent 8 }}
        {{- include "atp-ingestion.envFromSecret" . | nindent 8 }}
        resources:
          {{- include "atp-ingestion.resources" . | nindent 10 }}
        livenessProbe:
          {{- toYaml .Values.livenessProbe | nindent 10 }}
        readinessProbe:
          {{- toYaml .Values.readinessProbe | nindent 10 }}
        {{- if .Values.startupProbe }}
        startupProbe:
          {{- toYaml .Values.startupProbe | nindent 10 }}
        {{- end }}

Using required Function:

# Require critical values
image:
  repository: {{ required "image.repository is required" .Values.image.repository }}
  tag: {{ required "image.tag is required" .Values.image.tag }}

Using default and coalesce:

# Default values with fallback chain
replicas: {{ .Values.replicaCount | default 3 }}
namespace: {{ .Values.namespace | default .Release.Namespace }}
tag: {{ coalesce .Values.image.tag .Chart.AppVersion "latest" }}

Flow Control (if, with, range)¶

Conditional Rendering:

# templates/ingress.yaml
{{- if .Values.ingress.enabled -}}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: {{ include "atp-ingestion.fullname" . }}
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
  {{- with .Values.ingress.annotations }}
  annotations:
    {{- toYaml . | nindent 4 }}
  {{- end }}
spec:
  {{- if .Values.ingress.className }}
  ingressClassName: {{ .Values.ingress.className }}
  {{- end }}
  {{- if .Values.ingress.tls }}
  tls:
    {{- range .Values.ingress.tls }}
    - hosts:
        {{- range .hosts }}
        - {{ . | quote }}
        {{- end }}
      secretName: {{ .secretName }}
    {{- end }}
  {{- end }}
  rules:
    {{- range .Values.ingress.hosts }}
    - host: {{ .host | quote }}
      http:
        paths:
          {{- range .paths }}
          - path: {{ .path }}
            pathType: {{ .pathType }}
            backend:
              service:
                name: {{ include "atp-ingestion.fullname" $ }}
                port:
                  number: {{ $.Values.service.port }}
          {{- end }}
    {{- end }}
{{- end }}

Using with for Scoped Values:

{{- with .Values.monitoring.serviceMonitor }}
{{- if .enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: {{ include "atp-ingestion.fullname" $ }}
spec:
  selector:
    matchLabels:
      {{- include "atp-ingestion.selectorLabels" $ | nindent 6 }}
  endpoints:
  - port: http
    interval: {{ .interval | default "30s" }}
    scrapeTimeout: {{ .scrapeTimeout | default "10s" }}
{{- end }}
{{- end }}

Variable Scoping¶

Understanding Variable Scoping:

# Scoping with $ (root context)
{{- range .Values.env }}
- name: {{ .name }}
  value: {{ .value }}
  # Use $ to access root context
  namespace: {{ $.Release.Namespace }}
{{- end }}

# Scoping with with
{{- with .Values.resources }}
limits:
  cpu: {{ .limits.cpu }}
  memory: {{ .limits.memory }}
{{- end }}

# Preserving root context in nested scopes
{{- range .Values.env }}
  {{- if eq .name "DATABASE_HOST" }}
    {{- with $.Values.database }}
    value: {{ .host }}
    {{- end }}
  {{- end }}
{{- end }}

Whitespace Management¶

Whitespace Control:

# Remove leading/trailing whitespace
{{- include "atp-ingestion.labels" . | nindent 4 }}
{{- if .Values.ingress.enabled -}}
# ... content ...
{{- end }}

# Trim left whitespace
{{- include "template" . }}

# Trim right whitespace
{{ include "template" . -}}

# Trim both sides
{{- include "template" . -}}

# Preserve whitespace (default)
{{ include "template" . }}

# Indent (nindent adds newline before)
{{- include "labels" . | nindent 4 }}

# Output raw (without escaping)
{{- .Values.script | nindent 8 | trim }}

Values File Organization¶

Hierarchical Values Structure¶

Values Hierarchy:

# Base values.yaml
replicaCount: 3
resources:
  limits:
    cpu: 2000m
    memory: 2Gi
  requests:
    cpu: 500m
    memory: 1Gi

# Environment-specific override (values-production.yaml)
replicaCount: 5
resources:
  limits:
    cpu: 4000m
    memory: 4Gi
  requests:
    cpu: 1000m
    memory: 2Gi

Values Precedence:

User-provided values (--set, --set-file)
Environment-specific values (values-production.yaml)
Default values (values.yaml)

Environment Overrides¶

values-dev.yaml:

# charts/atp-ingestion/values-dev.yaml
replicaCount: 1

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 256Mi

autoscaling:
  enabled: false

env:
  - name: ASPNETCORE_ENVIRONMENT
    value: "Development"
  - name: Logging__LogLevel__Default
    value: "Debug"

ingress:
  enabled: true
  className: "nginx"
  hosts:
    - host: ingestion.dev.atp.connectsoft.example
      paths:
        - path: /

values-production.yaml:

# charts/atp-ingestion/values-production.yaml
replicaCount: 5

resources:
  limits:
    cpu: 4000m
    memory: 4Gi
  requests:
    cpu: 1000m
    memory: 2Gi

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

env:
  - name: ASPNETCORE_ENVIRONMENT
    value: "Production"
  - name: Logging__LogLevel__Default
    value: "Warning"

ingress:
  enabled: true
  className: "nginx"
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
  hosts:
    - host: ingestion.atp.connectsoft.example
      paths:
        - path: /
  tls:
    - secretName: ingestion-tls
      hosts:
        - ingestion.atp.connectsoft.example

Secret References (Never Plaintext)¶

External Secret Reference in Values:

# values.yaml - NEVER include plaintext secrets
externalSecrets:
  enabled: true
  secrets:
    - name: sql-connection-string
      keyVaultName: atp-prod-kv
      secretName: connection-strings/atp-db/production
      secretKey: connectionString
    - name: redis-connection-string
      keyVaultName: atp-prod-kv
      secretName: cache/redis/connection-string
      secretKey: connectionString

# ❌ BAD: Plaintext secret in values
# secrets:
#   sqlConnectionString: "Server=..."

# ✅ GOOD: Reference to ExternalSecret
envFrom:
  - secretRef:
      name: app-secrets  # Created by ExternalSecret operator

ExternalSecret Template:

# templates/externalsecret.yaml
{{- if .Values.externalSecrets.enabled }}
{{- range .Values.externalSecrets.secrets }}
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: {{ .name }}
  namespace: {{ $.Release.Namespace }}
spec:
  secretStoreRef:
    name: azure-keyvault-{{ $.Values.externalSecrets.keyVaultName }}
    kind: ClusterSecretStore
  target:
    name: {{ .name }}
    creationPolicy: Owner
  data:
  - secretKey: {{ .secretKey | default "value" }}
    remoteRef:
      key: {{ .secretName }}
{{- end }}
{{- end }}

Documentation in values.yaml Comments¶

Documented values.yaml:

# charts/atp-ingestion/values.yaml
# Default values for atp-ingestion Helm chart

# -- Number of replicas
replicaCount: 3

# -- Image configuration
image:
  # -- Image repository
  repository: connectsoft.azurecr.io/atp/ingestion
  # -- Image pull policy (IfNotPresent, Always, Never)
  pullPolicy: IfNotPresent
  # -- Image tag (defaults to appVersion)
  tag: ""

# -- Service account configuration
serviceAccount:
  # -- Create service account
  create: true
  # -- Service account annotations
  annotations: {}
  # -- Service account name (defaults to fullname)
  name: ""

# -- Resource requests and limits
resources:
  limits:
    # -- CPU limit (e.g., 2000m, 2)
    cpu: 2000m
    # -- Memory limit (e.g., 2Gi, 2048Mi)
    memory: 2Gi
  requests:
    # -- CPU request (e.g., 500m, 0.5)
    cpu: 500m
    # -- Memory request (e.g., 1Gi, 1024Mi)
    memory: 1Gi

# -- Horizontal Pod Autoscaler configuration
autoscaling:
  # -- Enable HPA
  enabled: true
  # -- Minimum replicas
  minReplicas: 3
  # -- Maximum replicas
  maxReplicas: 10
  # -- Target CPU utilization percentage
  targetCPUUtilizationPercentage: 70
  # -- Target memory utilization percentage
  targetMemoryUtilizationPercentage: 80

Chart Dependencies¶

Depending on Other Charts¶

Defining Dependencies:

# Chart.yaml
dependencies:
  - name: redis
    version: "17.15.0"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled
    alias: cache
  - name: postgresql
    version: "12.1.9"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled
    alias: database

Dependency Management Workflow:

sequenceDiagram
    participant Dev as Developer
    participant Chart as Chart.yaml
    participant Helm as Helm CLI
    participant Repo as Chart Repository

    Dev->>Chart: Add dependency to Chart.yaml
    Dev->>Helm: helm dependency update
    Helm->>Repo: Fetch dependency chart
    Repo-->>Helm: Return chart.tgz
    Helm->>Chart: Extract to charts/ directory
    Chart-->>Dev: Dependencies ready

Hold "Alt" / "Option" to enable pan & zoom

Sub-chart Values Override¶

Overriding Sub-chart Values:

# values.yaml - Override Redis sub-chart values
redis:
  enabled: true
  architecture: standalone
  auth:
    enabled: true
    password: ""  # From ExternalSecret
  master:
    persistence:
      enabled: true
      storageClass: managed-premium
      size: 8Gi
    resources:
      requests:
        memory: 256Mi
        cpu: 250m
      limits:
        memory: 512Mi
        cpu: 500m

# Override PostgreSQL sub-chart values
postgresql:
  enabled: true
  auth:
    database: atp_production
    username: atp_user
    password: ""  # From ExternalSecret
  primary:
    persistence:
      enabled: true
      storageClass: managed-premium
      size: 20Gi
    resources:
      requests:
        memory: 512Mi
        cpu: 500m
      limits:
        memory: 1Gi
        cpu: 1000m

Conditional Dependencies¶

Conditional Dependency Rendering:

# Chart.yaml
dependencies:
  - name: redis
    version: "17.15.0"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled
    tags:
      - cache
  - name: postgresql
    version: "12.1.9"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled
    tags:
      - database

# values.yaml
redis:
  enabled: false  # Don't install Redis (use managed)

postgresql:
  enabled: false  # Don't install PostgreSQL (use managed)

Enable/Disable by Tag:

# Install with cache tag only
helm install my-release ./chart --set tags.cache=true

# Install with database tag only
helm install my-release ./chart --set tags.database=true

Dependency Management Commands¶

Dependency Management:

# Update dependencies (download latest)
helm dependency update charts/atp-ingestion/

# Build dependencies (rebuild from Chart.lock)
helm dependency build charts/atp-ingestion/

# List dependencies
helm dependency list charts/atp-ingestion/
# Output:
# NAME          VERSION REPOSITORY                              STATUS
# redis         17.15.0 https://charts.bitnami.com/bitnami      ok
# postgresql    12.1.9  https://charts.bitnami.com/bitnami      ok

# Verify dependencies
helm dependency verify charts/atp-ingestion/

# Check for updates
helm dependency update --verify charts/atp-ingestion/

Chart Versioning and Publishing¶

Chart Versioning Strategy (SemVer)¶

Semantic Versioning:

Version Component	When to Increment	Example
MAJOR	Breaking changes (incompatible values, removed features)	1.2.3 → 2.0.0
MINOR	New features (backward compatible)	1.2.3 → 1.3.0
PATCH	Bug fixes (backward compatible)	1.2.3 → 1.2.4

Chart Version Examples:

# Chart.yaml
version: 1.2.3  # Chart version (SemVer)
appVersion: "1.2.3"  # Application version

# Version bump examples:
# 1.2.3 → 1.2.4 (patch: bug fix)
# 1.2.3 → 1.3.0 (minor: new feature added)
# 1.2.3 → 2.0.0 (major: breaking change)

Publishing to Azure Container Registry¶

ACR Helm Repository Setup:

# Login to ACR
az acr login --name connectsoft

# Add ACR as Helm repository
helm repo add connectsoft-helm oci://connectsoft.azurecr.io/helm

Publishing Chart to ACR:

#!/bin/bash
# scripts/publish-chart-to-acr.sh

CHART_NAME="${1}"
CHART_PATH="charts/${CHART_NAME}"
ACR_NAME="connectsoft"
ACR_REPO="oci://${ACR_NAME}.azurecr.io/helm"

echo "📦 Publishing ${CHART_NAME} to ACR"

# Package chart
helm package "${CHART_PATH}" --destination ./dist/

# Get package name
PACKAGE=$(ls -t ./dist/${CHART_NAME}-*.tgz | head -1)

# Push to ACR
helm push "${PACKAGE}" "${ACR_REPO}"

echo "✅ Chart published: ${PACKAGE}"

Installing from ACR:

# Add ACR Helm repo
helm repo add connectsoft-helm oci://connectsoft.azurecr.io/helm
helm repo update

# Install chart
helm install atp-ingestion connectsoft-helm/atp-ingestion \
  --version 1.2.3 \
  -f values-production.yaml

Chart Repository Structure¶

ACR OCI Repository Structure:

connectsoft.azurecr.io/helm/
├── atp-ingestion/
│   ├── 1.0.0/
│   │   └── atp-ingestion-1.0.0.tgz
│   ├── 1.1.0/
│   │   └── atp-ingestion-1.1.0.tgz
│   └── 1.2.3/
│       └── atp-ingestion-1.2.3.tgz
├── atp-query/
│   └── ...
└── atp-gateway/
    └── ...

Helm Hooks¶

Pre-Install: Run Before Installation¶

Pre-Install Hook (Database Migration):

# templates/hooks/pre-install-migration.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-pre-install-migration
  annotations:
    "helm.sh/hook": pre-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  template:
    metadata:
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      restartPolicy: Never
      serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
      containers:
      - name: migration
        image: {{ include "atp-ingestion.image" . }}
        command:
        - dotnet
        - ConnectSoft.Audit.Ingestion.Migrations.dll
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: {{ .Values.env | first | default "Production" | quote }}
        {{- range .Values.env }}
        - name: {{ .name }}
          value: {{ .value | quote }}
        {{- end }}
        envFrom:
        {{- include "atp-ingestion.envFromSecret" . | nindent 8 }}
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi

Post-Install: Run After Installation¶

Post-Install Hook (Verification):

# templates/hooks/post-install-verification.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-post-install-verification
  annotations:
    "helm.sh/hook": post-install
    "helm.sh/hook-weight": "5"
    "helm.sh/hook-delete-policy": hook-succeeded
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  template:
    metadata:
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      restartPolicy: Never
      containers:
      - name: verification
        image: curlimages/curl:latest
        command:
        - /bin/sh
        - -c
        - |
          echo "Verifying service health..."
          sleep 10
          curl -f http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}/health/ready || exit 1
          echo "✅ Service is healthy"

Pre-Upgrade: Run Before Upgrade¶

Pre-Upgrade Hook (Backup):

# templates/hooks/pre-upgrade-backup.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-pre-upgrade-backup
  annotations:
    "helm.sh/hook": pre-upgrade
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  template:
    metadata:
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      restartPolicy: Never
      containers:
      - name: backup
        image: mcr.microsoft.com/azure-cli:latest
        command:
        - /bin/bash
        - -c
        - |
          echo "Creating backup before upgrade..."
          # Backup logic here
          az storage blob upload-batch \
            --destination backup \
            --source /data \
            --account-name atpstorage
          echo "✅ Backup complete"

Post-Upgrade: Run After Upgrade¶

Post-Upgrade Hook (Smoke Tests):

# templates/hooks/post-upgrade-smoke-tests.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-post-upgrade-smoke-tests
  annotations:
    "helm.sh/hook": post-upgrade
    "helm.sh/hook-weight": "5"
    "helm.sh/hook-delete-policy": hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: smoke-tests
        image: mcr.microsoft.com/dotnet/sdk:8.0
        command:
        - dotnet
        - test
        - --filter "Category=Smoke"
        env:
        - name: API_URL
          value: http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}

Pre-Delete: Run Before Deletion¶

Pre-Delete Hook (Cleanup):

# templates/hooks/pre-delete-cleanup.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-pre-delete-cleanup
  annotations:
    "helm.sh/hook": pre-delete
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: cleanup
        image: mcr.microsoft.com/azure-cli:latest
        command:
        - /bin/bash
        - -c
        - |
          echo "Cleaning up resources..."
          # Cleanup logic
          echo "✅ Cleanup complete"

Hook Use Cases¶

Hook Execution Flow:

sequenceDiagram
    participant Helm as Helm
    participant PreInstall as Pre-Install Hook
    participant Install as Installation
    participant PostInstall as Post-Install Hook
    participant PreUpgrade as Pre-Upgrade Hook
    participant Upgrade as Upgrade
    participant PostUpgrade as Post-Upgrade Hook
    participant PreDelete as Pre-Delete Hook
    participant Delete as Deletion

    Note over Helm,Delete: Installation Flow
    Helm->>PreInstall: Execute pre-install hooks
    PreInstall-->>Helm: Migration complete
    Helm->>Install: Install resources
    Install-->>Helm: Installed
    Helm->>PostInstall: Execute post-install hooks
    PostInstall-->>Helm: Verification complete

    Note over Helm,Delete: Upgrade Flow
    Helm->>PreUpgrade: Execute pre-upgrade hooks
    PreUpgrade-->>Helm: Backup complete
    Helm->>Upgrade: Upgrade resources
    Upgrade-->>Helm: Upgraded
    Helm->>PostUpgrade: Execute post-upgrade hooks
    PostUpgrade-->>Helm: Smoke tests passed

    Note over Helm,Delete: Deletion Flow
    Helm->>PreDelete: Execute pre-delete hooks
    PreDelete-->>Helm: Cleanup complete
    Helm->>Delete: Delete resources

Hold "Alt" / "Option" to enable pan & zoom

Hook Use Cases Table:

Hook	Use Case	Example
pre-install	Database migrations, schema setup	Run EF migrations before deploying app
post-install	Verification, smoke tests	Verify service is healthy after install
pre-upgrade	Backup, data migration	Backup database before upgrade
post-upgrade	Smoke tests, validation	Run integration tests after upgrade
pre-delete	Cleanup, data export	Export data before deleting service
post-delete	Final cleanup	Remove temporary resources

Testing Helm Charts¶

helm lint: Syntax and Best Practices¶

Linting Helm Charts:

# Lint chart
helm lint charts/atp-ingestion/

# Lint with strict mode
helm lint charts/atp-ingestion/ --strict

# Lint with values file
helm lint charts/atp-ingestion/ -f values-production.yaml

# Lint all charts
for chart in charts/*/; do
  echo "Linting $chart"
  helm lint "$chart"
done

helm template: Render Templates Locally¶

Rendering Templates:

# Render all templates
helm template my-release charts/atp-ingestion/

# Render with values
helm template my-release charts/atp-ingestion/ -f values-production.yaml

# Render specific template
helm template my-release charts/atp-ingestion/ -s templates/deployment.yaml

# Dry-run (validate without installing)
helm install my-release charts/atp-ingestion/ --dry-run --debug

# Output to file
helm template my-release charts/atp-ingestion/ > rendered-manifests.yaml

Template Validation Script:

#!/bin/bash
# scripts/validate-helm-chart.sh

CHART_PATH="${1}"
VALUES_FILE="${2}"

echo "🔍 Validating Helm chart: ${CHART_PATH}"

# Lint
echo "1. Running helm lint..."
helm lint "${CHART_PATH}" ${VALUES_FILE:+-f "${VALUES_FILE}"} || exit 1

# Template rendering
echo "2. Rendering templates..."
helm template test-release "${CHART_PATH}" ${VALUES_FILE:+-f "${VALUES_FILE}"} > /tmp/rendered.yaml || exit 1

# Validate with kubeval
echo "3. Validating Kubernetes manifests..."
kubeval /tmp/rendered.yaml || exit 1

# Validate with kube-score
echo "4. Scoring manifests..."
kube-score score /tmp/rendered.yaml || exit 1

echo "✅ Chart validation passed"

helm test: Run Tests in Cluster¶

Helm Test Templates:

# templates/tests/test-connection.yaml
apiVersion: v1
kind: Pod
metadata:
  name: "{{ include "atp-ingestion.fullname" . }}-test-connection"
  annotations:
    "helm.sh/hook": test
  labels:
    {{- include "atp-ingestion.selectorLabels" . | nindent 4 }}
spec:
  restartPolicy: Never
  containers:
  - name: wget
    image: busybox:1.35
    command: ['wget']
    args: ['{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}']

# templates/tests/test-health.yaml
apiVersion: v1
kind: Pod
metadata:
  name: "{{ include "atp-ingestion.fullname" . }}-test-health"
  annotations:
    "helm.sh/hook": test
spec:
  restartPolicy: Never
  containers:
  - name: curl-test
    image: curlimages/curl:latest
    command:
    - /bin/sh
    - -c
    - |
      curl -f http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}/health/ready || exit 1
      curl -f http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}/health/live || exit 1
      echo "✅ Health checks passed"

Running Helm Tests:

# Run tests
helm test my-release

# Run tests with logs
helm test my-release --logs

# Run tests with timeout
helm test my-release --timeout 5m

chart-testing Tool (ct)¶

Install chart-testing:

# Install ct
curl -LO https://github.com/helm/chart-testing/releases/download/v3.9.0/chart-testing_3.9.0_linux_amd64.tar.gz
tar -xzf chart-testing_3.9.0_linux_amd64.tar.gz
sudo mv ct /usr/local/bin/

chart-testing Configuration:

# .github/ct.yaml
chart-dirs:
  - charts
chart-repos:
  - bitnami=https://charts.bitnami.com/bitnami
target-branch: main
validate-maintainers: true
check-version-increment: true

Using chart-testing:

# Lint and validate
ct lint --charts charts/atp-ingestion/

# Install and test
ct install --charts charts/atp-ingestion/

# Lint all changed charts
ct lint --target-branch main

Integration with CI Pipeline¶

Azure Pipeline for Chart Testing:

# azure-pipelines-helm-charts.yml
trigger:
  branches:
    include:
    - main
  paths:
    include:
    - charts/**/*

pool:
  vmImage: 'ubuntu-latest'

steps:
- task: HelmInstaller@1
  displayName: 'Install Helm'
  inputs:
    helmVersionToInstall: '3.12.0'

- task: Bash@3
  displayName: 'Install chart-testing'
  inputs:
    targetType: 'inline'
    script: |
      curl -LO https://github.com/helm/chart-testing/releases/download/v3.9.0/chart-testing_3.9.0_linux_amd64.tar.gz
      tar -xzf chart-testing_3.9.0_linux_amd64.tar.gz
      sudo mv ct /usr/local/bin/

- task: Bash@3
  displayName: 'Lint Charts'
  inputs:
    targetType: 'inline'
    script: |
      for chart in charts/*/; do
        echo "Linting $chart"
        helm lint "$chart"
        ct lint --charts "$chart"
      done

- task: Bash@3
  displayName: 'Render Templates'
  inputs:
    targetType: 'inline'
    script: |
      for chart in charts/*/; do
        echo "Rendering $chart"
        helm template test-release "$chart" -f "$chart/values-production.yaml" > /dev/null
      done

- task: Bash@3
  displayName: 'Package Charts'
  inputs:
    targetType: 'inline'
    script: |
      mkdir -p dist
      for chart in charts/*/; do
        helm package "$chart" --destination ./dist/
      done

Helm Chart CI Pipeline¶

Complete CI Pipeline:

# azure-pipelines-helm-chart-ci.yml
trigger:
  branches:
    include:
    - main
    - feature/*
  paths:
    include:
    - charts/**/*

pr:
  branches:
    include:
    - main
  paths:
    include:
    - charts/**/*

pool:
  vmImage: 'ubuntu-latest'

variables:
  - group: atp-helm-charts
  - name: ACR_NAME
    value: 'connectsoft'

stages:
- stage: Lint
  displayName: 'Lint Charts'
  jobs:
  - job: Lint
    displayName: 'Lint Helm Charts'
    steps:
    - task: HelmInstaller@1
      displayName: 'Install Helm'
      inputs:
        helmVersionToInstall: '3.12.0'

    - task: Bash@3
      displayName: 'Install kubeval and kube-score'
      inputs:
        targetType: 'inline'
        script: |
          wget https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
          tar xf kubeval-linux-amd64.tar.gz
          sudo mv kubeval /usr/local/bin/

          wget https://github.com/zegl/kube-score/releases/download/v1.17.0/kube-score_1.17.0_linux_amd64.tar.gz
          tar xf kube-score_1.17.0_linux_amd64.tar.gz
          sudo mv kube-score /usr/local/bin/

    - task: Bash@3
      displayName: 'Lint and Validate Charts'
      inputs:
        targetType: 'inline'
        script: |
          for chart in charts/*/; do
            CHART_NAME=$(basename "$chart")
            echo "Linting ${CHART_NAME}..."

            # Helm lint
            helm lint "$chart" || exit 1

            # Render and validate
            helm template test-release "$chart" -f "$chart/values-production.yaml" | \
              kubeval --strict || exit 1

            # Score
            helm template test-release "$chart" -f "$chart/values-production.yaml" | \
              kube-score score - || exit 1
          done

- stage: Package
  displayName: 'Package Charts'
  condition: succeeded()
  jobs:
  - job: Package
    displayName: 'Package Helm Charts'
    steps:
    - task: HelmInstaller@1
      displayName: 'Install Helm'

    - task: Bash@3
      displayName: 'Package Charts'
      inputs:
        targetType: 'inline'
        script: |
          mkdir -p dist
          for chart in charts/*/; do
            helm package "$chart" --destination ./dist/
          done

          echo "##vso[task.setVariable variable=CHARTS_PACKAGED]true"

- stage: Publish
  displayName: 'Publish to ACR'
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
  jobs:
  - deployment: Publish
    displayName: 'Publish Charts to ACR'
    environment: 'Production'
    strategy:
      runOnce:
        deploy:
          steps:
          - task: AzureCLI@2
            displayName: 'Login to ACR'
            inputs:
              azureSubscription: 'ATP-Prod-ServiceConnection'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                az acr login --name $(ACR_NAME)

          - task: HelmInstaller@1
            displayName: 'Install Helm'

          - task: Bash@3
            displayName: 'Publish Charts'
            inputs:
              targetType: 'inline'
              script: |
                for package in dist/*.tgz; do
                  CHART_NAME=$(basename "$package" .tgz | cut -d- -f1-2)
                  echo "Publishing ${CHART_NAME}..."
                  helm push "$package" "oci://$(ACR_NAME).azurecr.io/helm"
                done

Chart Documentation¶

README.md with Usage Instructions¶

Chart README Template:

# atp-ingestion

A Helm chart for ATP Ingestion Service - Collects and processes audit trail events.

## Introduction

This chart deploys the ATP Ingestion Service on a Kubernetes cluster using the Helm package manager.

## Prerequisites

- Kubernetes 1.24+
- Helm 3.8+
- Azure Container Registry access
- External Secrets Operator (for secret management)

## Installing the Chart

To install the chart with the release name `atp-ingestion`:

```bash
helm repo add connectsoft-helm oci://connectsoft.azurecr.io/helm
helm repo update
helm install atp-ingestion connectsoft-helm/atp-ingestion \
  --version 1.2.3 \
  -f values-production.yaml

Uninstalling the Chart¶

To uninstall/delete the atp-ingestion deployment:

helm uninstall atp-ingestion

Configuration¶

The following table lists the configurable parameters:

Parameter	Description	Default
`replicaCount`	Number of replicas	`3`
`image.repository`	Image repository	`connectsoft.azurecr.io/atp/ingestion`
`image.tag`	Image tag	`""` (defaults to appVersion)
`resources.limits.cpu`	CPU limit	`2000m`
`resources.limits.memory`	Memory limit	`2Gi`
`resources.requests.cpu`	CPU request	`500m`
`resources.requests.memory`	Memory request	`1Gi`
`autoscaling.enabled`	Enable HPA	`true`
`autoscaling.minReplicas`	Minimum replicas	`3`
`autoscaling.maxReplicas`	Maximum replicas	`10`

Values Files¶

values.yaml: Default values
values-dev.yaml: Development environment
values-test.yaml: Test environment
values-staging.yaml: Staging environment
values-production.yaml: Production environment

Dependencies¶

Redis (optional, via Bitnami chart)
PostgreSQL (optional, via Bitnami chart)

Hooks¶

pre-install: Runs database migrations
post-install: Verifies service health
pre-upgrade: Creates backup

post-upgrade: Runs smoke tests

#### Values Schema (JSON Schema)

**values.schema.json**:

```json
{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "replicaCount": {
      "type": "integer",
      "minimum": 1,
      "maximum": 100,
      "description": "Number of replicas"
    },
    "image": {
      "type": "object",
      "properties": {
        "repository": {
          "type": "string",
          "description": "Image repository"
        },
        "tag": {
          "type": "string",
          "description": "Image tag"
        },
        "pullPolicy": {
          "type": "string",
          "enum": ["IfNotPresent", "Always", "Never"],
          "description": "Image pull policy"
        }
      },
      "required": ["repository"]
    },
    "resources": {
      "type": "object",
      "properties": {
        "limits": {
          "type": "object",
          "properties": {
            "cpu": {
              "type": "string",
              "pattern": "^[0-9]+m?$|^[0-9]+\\.[0-9]+$"
            },
            "memory": {
              "type": "string",
              "pattern": "^[0-9]+(Mi|Gi|Ti|Pi|Ei|m|K|M|G|T|P|E)$"
            }
          }
        },
        "requests": {
          "type": "object",
          "properties": {
            "cpu": {
              "type": "string",
              "pattern": "^[0-9]+m?$|^[0-9]+\\.[0-9]+$"
            },
            "memory": {
              "type": "string",
              "pattern": "^[0-9]+(Mi|Gi|Ti|Pi|Ei|m|K|M|G|T|P|E)$"
            }
          }
        }
      }
    }
  },
  "required": ["image"]
}

Chart Security¶

Scanning Charts for Vulnerabilities¶

Chart Security Scanning:

# Install checkov for Helm chart scanning
pip install checkov

# Scan Helm chart
checkov -d charts/atp-ingestion/ --framework helm

Policy Validation¶

OPA Policy for Helm Charts:

# policies/helm-chart-policy.rego
package helm

deny[msg] {
    input.kind == "Deployment"
    not input.spec.template.spec.securityContext.runAsNonRoot
    msg := "Deployment must set runAsNonRoot to true"
}

deny[msg] {
    input.kind == "Deployment"
    not input.spec.template.spec.securityContext.allowPrivilegeEscalation == false
    msg := "Deployment must disable privilege escalation"
}

Image Scanning in Chart Images¶

Image Scanning in CI:

# Azure Pipeline: Scan chart images
- task: Bash@3
  displayName: 'Scan Images'
  inputs:
    targetType: 'inline'
    script: |
      # Extract images from chart
      IMAGES=$(helm template test-release charts/atp-ingestion/ | \
        grep -E 'image:' | \
        awk '{print $2}' | \
        tr -d '"')

      # Scan each image with Trivy
      for image in $IMAGES; do
        echo "Scanning $image"
        trivy image --severity HIGH,CRITICAL "$image" || exit 1
      done

Summary: Helm Chart Development for ATP Services¶

Helm Chart Structure: Chart.yaml metadata, values.yaml defaults, templates/ directory, charts/ dependencies, .helmignore exclusions
Template Best Practices: Named templates and helpers (_helpers.tpl), template functions (include, default, required), flow control (if, with, range), variable scoping, whitespace management
Values File Organization: Hierarchical values structure, default values, environment overrides (dev/test/staging/production), secret references (never plaintext), documentation in comments
Chart Dependencies: Depending on other charts, sub-chart values override, conditional dependencies, dependency management commands
Chart Versioning and Publishing: Semantic versioning (SemVer), publishing to Azure Container Registry (ACR), chart repository structure
Helm Hooks: Pre-install (migrations), post-install (verification), pre-upgrade (backup), post-upgrade (smoke tests), pre-delete (cleanup), hook use cases and execution flow
Testing Helm Charts: helm lint, helm template (render locally), helm test (run in cluster), chart-testing tool (ct), CI pipeline integration
Helm Chart CI Pipeline: Lint charts on PR, package charts, publish to ACR, version management
Chart Documentation: README.md with usage instructions, values schema (JSON Schema), changelog for versions
Chart Security: Scanning charts for vulnerabilities, policy validation (OPA), image scanning in chart images

Kustomize Advanced Patterns¶

Purpose: Define advanced Kustomize patterns, strategies, and best practices for ATP GitOps deployments including strategic merge patches, JSON patches, generators, transformers, component composition, remote bases, and FluxCD integration to enable flexible, maintainable, and reusable Kubernetes configuration management.

Kustomize Architecture¶

Base, Overlays, Components¶

Kustomize Architecture Overview:

graph TB
    subgraph "Base"
        BASE[kustomization.yaml<br/>Base Resources]
        DEPLOY_BASE[deployment.yaml]
        SVC_BASE[service.yaml]
        CM_BASE[configmap.yaml]
    end

    subgraph "Overlays"
        subgraph "Dev Overlay"
            DEV_KUST[dev/kustomization.yaml]
            DEV_PATCH[dev/patch.yaml]
        end
        subgraph "Prod Overlay"
            PROD_KUST[production/kustomization.yaml]
            PROD_PATCH[production/patch.yaml]
        end
    end

    subgraph "Components"
        COMP_KUST[components/monitoring/kustomization.yaml]
        COMP_RESOURCES[components/monitoring/resources/]
    end

    BASE --> DEPLOY_BASE
    BASE --> SVC_BASE
    BASE --> CM_BASE

    DEV_KUST -.references.-> BASE
    DEV_KUST -.uses.-> DEV_PATCH
    DEV_KUST -.includes.-> COMP_KUST

    PROD_KUST -.references.-> BASE
    PROD_KUST -.uses.-> PROD_PATCH
    PROD_KUST -.includes.-> COMP_KUST

    style BASE fill:#FFE5B4
    style DEV_KUST fill:#90EE90
    style PROD_KUST fill:#FFB6C1
    style COMP_KUST fill:#87CEEB

Hold "Alt" / "Option" to enable pan & zoom

Directory Structure:

apps/atp-ingestion/
├── base/
│   ├── kustomization.yaml
│   ├── deployment.yaml
│   ├── service.yaml
│   └── configmap.yaml
├── overlays/
│   ├── dev/
│   │   ├── kustomization.yaml
│   │   ├── deployment-patch.yaml
│   │   └── configmap-patch.yaml
│   ├── test/
│   │   ├── kustomization.yaml
│   │   └── deployment-patch.yaml
│   ├── staging/
│   │   ├── kustomization.yaml
│   │   └── deployment-patch.yaml
│   └── production/
│       ├── kustomization.yaml
│       ├── deployment-patch.yaml
│       └── configmap-patch.yaml
└── components/
    ├── monitoring/
    │   ├── kustomization.yaml
    │   └── servicemonitor.yaml
    └── networking/
        ├── kustomization.yaml
        └── networkpolicy.yaml

Kustomization File Structure¶

Base kustomization.yaml:

# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

metadata:
  name: atp-ingestion-base
  namespace: default

resources:
  - deployment.yaml
  - service.yaml
  - configmap.yaml

commonLabels:
  app: atp-ingestion
  component: ingestion
  managed-by: kustomize

commonAnnotations:
  description: "ATP Ingestion Service Base Configuration"

namespace: default

Overlay kustomization.yaml:

# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

metadata:
  name: atp-ingestion-production

resources:
  - ../../base

patchesStrategicMerge:
  - deployment-patch.yaml
  - configmap-patch.yaml

patches:
  - path: service-patch.json
    target:
      kind: Service

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newName: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3

replicas:
  - name: atp-ingestion
    count: 5

namespace: atp-production

commonLabels:
  environment: production

configMapGenerator:
  - name: app-config
    literals:
      - ASPNETCORE_ENVIRONMENT=Production

Resource Selection¶

Resource Selection in Kustomization:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

# Resources to include
resources:
  - deployment.yaml
  - service.yaml
  - ../../base  # Include entire base

# Components to include
components:
  - ../../components/monitoring

# Exclude resources (via selector)
# Note: Kustomize doesn't support exclude directly,
# use patches to remove resources

# Select resources by label
# (requires custom transformer or post-processing)

Transformation Order¶

Kustomize Transformation Pipeline:

graph LR
    BASE[Base Resources] --> COMMON[Common Labels/Annotations]
    COMMON --> NAMESPACE[Namespace Transform]
    NAMESPACE --> PREFIX[Name Prefix/Suffix]
    PREFIX --> IMAGES[Image Transform]
    IMAGES --> REPLICAS[Replica Transform]
    REPLICAS --> PATCHES[Strategic Merge Patches]
    PATCHES --> JSON[JSON Patches]
    JSON --> GENERATORS[ConfigMap/Secret Generators]
    GENERATORS --> OUTPUT[Final Output]

    style BASE fill:#FFE5B4
    style OUTPUT fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Transformation Order:

Load Resources: Read base resources and all referenced resources
Common Labels/Annotations: Apply common labels and annotations
Namespace Transform: Set namespace on all resources
Name Prefix/Suffix: Apply name transformations
Image Transform: Replace image references
Replica Transform: Update replica counts
Strategic Merge Patches: Apply strategic merge patches
JSON Patches: Apply JSON patches
ConfigMap/Secret Generators: Generate ConfigMaps and Secrets
Replacements: Apply replacements transformations
Final Output: Emit transformed resources

Strategic Merge Patches¶

How Strategic Merge Works¶

Strategic Merge Patch Overview:

Strategic merge patches use Kubernetes's strategic merge patch logic to merge patches into base resources, following Kubernetes-specific semantics for merging lists and maps.

Strategic Merge Process:

graph TB
    BASE[Base Resource]
    PATCH[Strategic Merge Patch]

    BASE --> MERGE{Strategic Merge}
    PATCH --> MERGE

    MERGE --> RESULT[Merged Resource]

    subgraph "Merge Semantics"
        REPLACE[Replace<br/>Explicit values]
        ADD[Add<br/>New fields]
        DELETE[Delete<br/>null values]
        ARRAY[Array Merge<br/>Strategic merge keys]
    end

    MERGE -.uses.-> REPLACE
    MERGE -.uses.-> ADD
    MERGE -.uses.-> DELETE
    MERGE -.uses.-> ARRAY

Hold "Alt" / "Option" to enable pan & zoom

Merge Semantics (Replace, Add, Delete)¶

Strategic Merge Examples:

Base Deployment:

# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:latest
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Development"

Strategic Merge Patch (Replace):

# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5  # Replace: 3 → 5
  template:
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3  # Replace image
        resources:
          requests:
            cpu: 1000m  # Replace: 500m → 1000m
            memory: 2Gi  # Replace: 1Gi → 2Gi
          limits:
            cpu: 4000m  # Replace: 2000m → 4000m
            memory: 4Gi  # Replace: 2Gi → 4Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"  # Replace: Development → Production

Strategic Merge Patch (Add):

# overlays/production/deployment-patch-add.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        # Add new environment variables
        - name: Logging__LogLevel__Default
          value: "Warning"
        - name: Telemetry__SamplingRate
          value: "0.1"
        resources:
          limits:
            # Add new resource limit
            ephemeral-storage: 10Gi

Strategic Merge Patch (Delete):

# overlays/minimal/deployment-patch-delete.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        # Delete by setting to null
        - name: Telemetry__SamplingRate
          value: null

Array Merging Strategies¶

Array Merging with Strategic Merge Keys:

Kubernetes uses strategic merge keys to identify array items for merging:

Resource Type	Strategic Merge Key
Deployment.spec.template.spec.containers	`name`
Deployment.spec.template.spec.initContainers	`name`
Service.spec.ports	`port`
ConfigMap.data	Key name
Pod.spec.volumes	`name`

Container Array Merge Example:

Base Deployment:

# base/deployment.yaml
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:latest
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Development"
      - name: sidecar
        image: connectsoft.azurecr.io/atp/sidecar:latest

Production Patch (Update Existing Container, Add New):

# overlays/production/deployment-patch.yaml
spec:
  template:
    spec:
      containers:
      # Update existing container (matched by name: "atp-ingestion")
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"
        - name: Logging__LogLevel__Default
          value: "Warning"
      # Add new container
      - name: metrics-exporter
        image: prom/node-exporter:latest

Result: The atp-ingestion container is updated, sidecar remains unchanged, and metrics-exporter is added.

Common Patterns¶

Common Strategic Merge Patterns:

Pattern 1: Update Replicas and Resources:

# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 1000m
            memory: 2Gi
          limits:
            cpu: 4000m
            memory: 4Gi

Pattern 2: Add Environment Variables:

# overlays/production/deployment-patch-env.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"
        - name: Logging__LogLevel__Default
          value: "Warning"
        - name: Telemetry__SamplingRate
          value: "0.1"

Pattern 3: Add Volume Mounts:

# overlays/production/deployment-patch-volumes.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        volumeMounts:
        - name: config
          mountPath: /app/config
      volumes:
      - name: config
        configMap:
          name: app-config

Pattern 4: Update Service Type:

# overlays/production/service-patch.yaml
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion
spec:
  type: LoadBalancer  # Change from ClusterIP to LoadBalancer
  ports:
  - port: 80
    targetPort: 8080

JSON Patches¶

JSON Patch Operations (Add, Replace, Remove)¶

JSON Patch Operations:

Operation	Description	Use Case
add	Add field or array element	Add new annotation, add new container
replace	Replace existing field value	Update replica count, change image tag
remove	Remove field or array element	Remove environment variable, remove port
copy	Copy value from one path to another	Copy annotation value
move	Move value from one path to another	Move label
test	Test value equality	Validate before patch

JSON Patch Example:

# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

patches:
  - target:
      kind: Deployment
      name: atp-ingestion
    path: deployment-patch.json

deployment-patch.json:

[
  {
    "op": "replace",
    "path": "/spec/replicas",
    "value": 5
  },
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/image",
    "value": "connectsoft.azurecr.io/atp/ingestion:v1.2.3"
  },
  {
    "op": "add",
    "path": "/spec/template/metadata/annotations/prometheus.io~1scrape",
    "value": "true"
  },
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/env/-",
    "value": {
      "name": "Logging__LogLevel__Default",
      "value": "Warning"
    }
  },
  {
    "op": "remove",
    "path": "/spec/template/spec/containers/0/env/0"
  }
]

Path Targeting¶

JSON Patch Path Examples:

[
  // Replace replica count
  {
    "op": "replace",
    "path": "/spec/replicas",
    "value": 5
  },

  // Replace image in first container
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/image",
    "value": "new-image:tag"
  },

  // Add annotation (use ~1 for /)
  {
    "op": "add",
    "path": "/metadata/annotations/prometheus.io~1scrape",
    "value": "true"
  },

  // Add to array (use - to append)
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/env/-",
    "value": {
      "name": "NEW_VAR",
      "value": "value"
    }
  },

  // Remove array element by index
  {
    "op": "remove",
    "path": "/spec/template/spec/containers/0/env/0"
  },

  // Remove field
  {
    "op": "remove",
    "path": "/spec/template/spec/containers/0/resources/limits/cpu"
  }
]

Path Escaping:

Use ~1 for / in path
Use ~0 for ~ in path
Example: prometheus.io/scrape → prometheus.io~1scrape

When to Use JSON Patches vs Strategic Merge¶

Comparison:

Feature	Strategic Merge	JSON Patch
Simplicity	✅ Easier to write and read	⚠️ More verbose
Type Safety	✅ YAML-native	❌ JSON only
Array Operations	✅ Smart merging with keys	⚠️ Index-based
Precision	⚠️ Can be ambiguous	✅ Very precise
Removal	⚠️ Requires null	✅ Direct remove
ATP Preference	✅ Preferred for most cases	⚠️ Use for complex cases

ATP Decision Matrix:

Use Case	Recommended Approach
Update replicas	✅ Strategic merge
Update image tag	✅ Strategic merge
Add environment variables	✅ Strategic merge
Remove specific array element	✅ JSON patch
Add annotation with `/` in key	⚠️ JSON patch (or use quotes in YAML)
Precise field replacement	✅ Strategic merge
Complex array manipulation	⚠️ JSON patch

ConfigMap and Secret Generators¶

Generating ConfigMaps from Literals¶

ConfigMap Generator from Literals:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

configMapGenerator:
  - name: app-config
    literals:
      - ASPNETCORE_ENVIRONMENT=Production
      - Logging__LogLevel__Default=Warning
      - Telemetry__SamplingRate=0.1
    options:
      labels:
        app: atp-ingestion
      annotations:
        description: "Application configuration"
      disableNameSuffixHash: false  # Include hash suffix for updates

Generated ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-abc123  # Hash suffix added
  labels:
    app: atp-ingestion
  annotations:
    description: "Application configuration"
data:
  ASPNETCORE_ENVIRONMENT: Production
  Logging__LogLevel__Default: Warning
  Telemetry__SamplingRate: "0.1"

Generating ConfigMaps from Files¶

ConfigMap Generator from Files:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

configMapGenerator:
  - name: app-config
    files:
      - appsettings.json
      - logging.json
    options:
      disableNameSuffixHash: false

Directory Structure:

overlays/production/
├── kustomization.yaml
├── appsettings.json
└── logging.json

File Contents:

// appsettings.json
{
  "Logging": {
    "LogLevel": {
      "Default": "Warning"
    }
  },
  "Telemetry": {
    "SamplingRate": 0.1
  }
}

Generated ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-xyz789
data:
  appsettings.json: |
    {
      "Logging": {
        "LogLevel": {
          "Default": "Warning"
        }
      }
    }
  logging.json: |
    {...}

Generating Secrets (Encrypted)¶

Secret Generator:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

secretGenerator:
  - name: app-secrets
    type: Opaque
    literals:
      - connectionString=Server=...
      - apiKey=secret-key-123
    options:
      labels:
        app: atp-ingestion
      disableNameSuffixHash: false

⚠️ Security Warning: Secrets in kustomization.yaml are base64 encoded, not encrypted. Always use External Secrets Operator or Azure Key Vault CSI Driver for production secrets.

Recommended: Secret Generator from File (Base64 Encoded):

# kustomization.yaml
secretGenerator:
  - name: app-secrets
    type: Opaque
    files:
      - connectionString.txt  # Base64 encoded content
      - apiKey.txt

Generate Base64 Encoded Secret File:

# Create base64 encoded secret file
echo -n "Server=..." | base64 > connectionString.txt

Hash Suffixes for Updates¶

Hash Suffix Behavior:

# kustomization.yaml
configMapGenerator:
  - name: app-config
    literals:
      - KEY=VALUE
    options:
      disableNameSuffixHash: false  # Default: include hash

Hash Suffix Purpose:

With Hash (disableNameSuffixHash: false):
ConfigMap name: app-config-abc123
Changing content generates new hash: app-config-xyz789
Forces Pod restart when ConfigMap changes (rolling update)
Without Hash (disableNameSuffixHash: true):
ConfigMap name: app-config
Changing content updates same ConfigMap
Pods may not automatically restart (depends on implementation)

ATP Recommendation: Use hash suffixes (disableNameSuffixHash: false) to ensure Pods restart when ConfigMaps change.

Variable Substitution¶

Defining Variables in kustomization.yaml¶

Variable Definition:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

vars:
  - name: SERVICE_NAME
    objref:
      kind: Service
      name: atp-ingestion
    fieldref:
      fieldpath: metadata.name
  - name: SERVICE_PORT
    objref:
      kind: Service
      name: atp-ingestion
    fieldref:
      fieldpath: spec.ports[0].port
  - name: REPLICA_COUNT
    objref:
      kind: Deployment
      name: atp-ingestion
    fieldref:
      fieldpath: spec.replicas

Using Variables in Resources¶

Using Variables in Deployment:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: $(SERVICE_NAME)
spec:
  replicas: $(REPLICA_COUNT)
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        - name: SERVICE_NAME
          value: $(SERVICE_NAME)
        - name: SERVICE_PORT
          value: "$(SERVICE_PORT)"

Variable Substitution Process:

graph LR
    DEFINE[Define Variables<br/>in kustomization.yaml]
    REF[Reference Resources<br/>via objref]
    EXTRACT[Extract Values<br/>via fieldref]
    SUBSTITUTE[Substitute<br/>$(VAR_NAME)]
    OUTPUT[Final Resource]

    DEFINE --> REF
    REF --> EXTRACT
    EXTRACT --> SUBSTITUTE
    SUBSTITUTE --> OUTPUT

    style DEFINE fill:#FFE5B4
    style OUTPUT fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Environment-Specific Variables¶

Environment-Specific Variable Configuration:

# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ../../base

vars:
  - name: ENVIRONMENT
    objref:
      kind: ConfigMap
      name: app-config
    fieldref:
      fieldpath: data.ASPNETCORE_ENVIRONMENT

configMapGenerator:
  - name: app-config
    literals:
      - ASPNETCORE_ENVIRONMENT=Production

Replacements¶

Replacing Values Across Resources¶

Replacement Configuration:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

replacements:
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.database-host
    targets:
      - select:
          kind: Deployment
        fieldPaths:
          - spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_HOST].value

Example: Replace Database Host:

# ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  database-host: "atp-db.database.windows.net"

# Deployment (before replacement)
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        - name: DATABASE_HOST
          value: "placeholder"

# Replacement configuration
replacements:
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.database-host
    targets:
      - select:
          kind: Deployment
        fieldPaths:
          - spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_HOST].value

# Deployment (after replacement)
# DATABASE_HOST value becomes: "atp-db.database.windows.net"

Source and Target Configuration¶

Replacement Source Options:

replacements:
  - source:
      # Option 1: Reference ConfigMap/Secret
      kind: ConfigMap
      name: app-config
      fieldPath: data.key-name

      # Option 2: Literal value
      value: "literal-value"

      # Option 3: Reference another resource
      kind: Service
      name: atp-ingestion
      fieldPath: spec.clusterIP

Replacement Target Options:

targets:
  - select:
      # Select resources by kind
      kind: Deployment
      # Optional: name filter
      name: atp-ingestion
      # Optional: label selector
      labelSelector: "app=atp-ingestion"
    fieldPaths:
      # Target field path (supports array selectors)
      - spec.template.spec.containers.[name=atp-ingestion].env.[name=KEY].value
      # Multiple targets
      - metadata.annotations.config-hash
    options:
      create: true  # Create field if missing
      delimiter: "/"  # Path delimiter

Complex Replacement Patterns¶

Multiple Replacements:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

configMapGenerator:
  - name: app-config
    literals:
      - database-host=atp-db.database.windows.net
      - database-port=5432
      - redis-host=atp-redis.redis.cache.windows.net

replacements:
  # Replace database host
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.database-host
    targets:
      - select:
          kind: Deployment
        fieldPaths:
          - spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_HOST].value

  # Replace database port
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.database-port
    targets:
      - select:
          kind: Deployment
        fieldPaths:
          - spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_PORT].value

  # Replace Redis host
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.redis-host
    targets:
      - select:
          kind: Deployment
        fieldPaths:
          - spec.template.spec.containers.[name=atp-ingestion].env.[name=REDIS_HOST].value

Replacement with Transformation:

replacements:
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.api-url
    targets:
      - select:
          kind: Ingress
        fieldPaths:
          - spec.rules.[host].host
        options:
          create: true
    replacements:
      - source:
          value: "http://"
        target:
          value: ""  # Remove prefix

Remote Bases¶

Referencing Remote Kustomizations¶

Remote Base Configuration:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  # Git repository as base
  - git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3

  # HTTPS URL
  - https://github.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=main

Git Repository as Base¶

Git Base with SSH:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3

Git Base with HTTPS:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - https://github.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3

HTTPS URLs for Bases¶

HTTPS Base URL Format:

https://<host>/<org>/<repo>.git//<path>?ref=<branch-or-tag>

Examples:

resources:
  # GitHub
  - https://github.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=main

  # Azure Repos
  - https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops//apps/atp-ingestion/base?ref=production

  # GitLab
  - https://gitlab.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=main

Version Pinning¶

Version Pinning Strategies:

# Option 1: Pin to tag (recommended)
resources:
  - git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3

# Option 2: Pin to branch (less stable)
resources:
  - git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=production

# Option 3: Pin to commit SHA (most stable)
resources:
  - git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=abc123def456

ATP Recommendation: Pin to Git tags for stability, update tags during releases.

Component Composition¶

Creating Reusable Components¶

Component Structure:

components/
├── monitoring/
│   ├── kustomization.yaml
│   └── servicemonitor.yaml
├── networking/
│   ├── kustomization.yaml
│   └── networkpolicy.yaml
└── security/
    ├── kustomization.yaml
    ├── podsecuritypolicy.yaml
    └── rbac.yaml

Component kustomization.yaml:

# components/monitoring/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1alpha1
kind: Component

metadata:
  name: monitoring

resources:
  - servicemonitor.yaml
  - prometheusrule.yaml

commonLabels:
  component: monitoring

Including Components in Overlays¶

Using Components in Overlay:

# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ../../base

components:
  - ../../components/monitoring
  - ../../components/networking
  - ../../components/security

patchesStrategicMerge:
  - deployment-patch.yaml

Component Composition Diagram:

graph TB
    BASE[Base<br/>kustomization.yaml]

    subgraph "Components"
        MON[Monitoring<br/>Component]
        NET[Networking<br/>Component]
        SEC[Security<br/>Component]
    end

    OVERLAY[Production Overlay<br/>kustomization.yaml]
    PATCHES[Strategic Merge<br/>Patches]

    BASE --> OVERLAY
    MON --> OVERLAY
    NET --> OVERLAY
    SEC --> OVERLAY
    PATCHES --> OVERLAY

    OVERLAY --> OUTPUT[Final Resources]

    style BASE fill:#FFE5B4
    style OVERLAY fill:#90EE90
    style OUTPUT fill:#87CEEB

Hold "Alt" / "Option" to enable pan & zoom

Component Library for ATP¶

ATP Component Library:

components/
├── monitoring/
│   ├── kustomization.yaml
│   ├── servicemonitor.yaml
│   └── prometheusrule.yaml
├── networking/
│   ├── kustomization.yaml
│   ├── networkpolicy.yaml
│   └── ingress-policy.yaml
├── security/
│   ├── kustomization.yaml
│   ├── podsecuritypolicy.yaml
│   └── rbac.yaml
├── autoscaling/
│   ├── kustomization.yaml
│   └── hpa.yaml
└── observability/
    ├── kustomization.yaml
    ├── servicemonitor.yaml
    └── log-forwarding.yaml

Reusable Monitoring Component:

# components/monitoring/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1alpha1
kind: Component

metadata:
  name: monitoring

resources:
  - servicemonitor.yaml

commonLabels:
  component: monitoring

configMapGenerator:
  - name: monitoring-config
    literals:
      - scrape-interval=30s

Monitoring Component Template:

# components/monitoring/servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: $(name)-servicemonitor
spec:
  selector:
    matchLabels:
      app: $(name)
  endpoints:
  - port: http
    interval: 30s

Transformers¶

Label Injectors¶

Common Labels:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

commonLabels:
  app: atp-ingestion
  component: ingestion
  environment: production
  managed-by: kustomize
  version: v1.2.3

Labels Added to All Resources:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  labels:
    app: atp-ingestion          # ← Added
    component: ingestion        # ← Added
    environment: production     # ← Added
    managed-by: kustomize       # ← Added
    version: v1.2.3            # ← Added

Namespace Transformers¶

Namespace Configuration:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

# All resources get namespace: atp-production

Name Prefix/Suffix Transformers¶

Name Prefix/Suffix:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namePrefix: prod-  # Prefix: prod-atp-ingestion
nameSuffix: -v1    # Suffix: atp-ingestion-v1

# ATP Recommendation: Use labels/annotations for versioning instead

Image Transformers¶

Image Transform:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newName: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3
  - name: redis
    newName: connectsoft.azurecr.io/atp/redis
    digest: sha256:abc123...  # Use digest for immutability

Image Transform Process:

graph LR
    BASE[Base Resources<br/>image: latest]
    TRANSFORM[Image Transform<br/>newTag: v1.2.3]
    OUTPUT[Output Resources<br/>image: v1.2.3]

    BASE --> TRANSFORM
    TRANSFORM --> OUTPUT

    style BASE fill:#FFE5B4
    style OUTPUT fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Replica Transformers¶

Replica Transform:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

replicas:
  - name: atp-ingestion
    count: 5
  - name: atp-query
    count: 3

Replica Transform Example:

# Base deployment (replicas: 3)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 3

# After replica transform (replicas: 5)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5  # ← Updated

Kustomize with Helm¶

Combining Helm and Kustomize¶

Helm + Kustomize Workflow:

graph LR
    HELM[Helm Chart<br/>helm template]
    OUTPUT1[Helm Output<br/>YAML Manifests]
    KUST[Kustomize<br/>kustomize build]
    OUTPUT2[Final Output<br/>Patched Manifests]

    HELM --> OUTPUT1
    OUTPUT1 --> KUST
    KUST --> OUTPUT2

    style HELM fill:#FFE5B4
    style OUTPUT2 fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

helm template → kustomize build¶

Post-Rendering Helm Output with Kustomize:

# Step 1: Render Helm templates
helm template my-release ./charts/atp-ingestion \
  -f values-production.yaml \
  > /tmp/helm-output.yaml

# Step 2: Use Kustomize to patch Helm output
mkdir -p kustomize-overlay
cat > kustomize-overlay/kustomization.yaml <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - /tmp/helm-output.yaml

patchesStrategicMerge:
  - production-patch.yaml
EOF

# Step 3: Build final output
kustomize build kustomize-overlay > final-manifests.yaml

Kustomization for Helm Output:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - helm-output.yaml  # Generated from: helm template

patchesStrategicMerge:
  - production-patch.yaml

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3

Post-Rendering with Kustomize¶

Helm Post-Renderer Script:

#!/bin/bash
# scripts/helm-post-render-kustomize.sh

KUSTOMIZE_DIR="${1:-overlays/production}"

# Kustomize build the Helm output
kustomize build "${KUSTOMIZE_DIR}"

Use Post-Renderer in Helm:

# Install with post-renderer
helm install atp-ingestion ./charts/atp-ingestion \
  -f values-production.yaml \
  --post-renderer ./scripts/helm-post-render-kustomize.sh \
  --post-renderer-executable-args "overlays/production"

FluxCD Kustomization CRD¶

FluxCD-Specific Configuration¶

FluxCD Kustomization CRD:

# clusters/production/kustomizations/apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/atp-ingestion/overlays/production
  prune: true
  wait: true
  timeout: 5m
  retryInterval: 1m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  kustomizeFlags:
    - --load-restrictor=LoadRestrictionsNone
  dependsOn:
    - name: infrastructure
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-ingestion
      namespace: atp-production
  postBuild:
    substitute:
      IMAGE_TAG: v1.2.3

Kustomization CRD Fields¶

Kustomization CRD Reference:

Field	Type	Description	Required
interval	duration	Reconciliation interval	✅ Yes
path	string	Path to kustomization.yaml	✅ Yes
prune	boolean	Delete resources not in Git	⚠️ Optional
wait	boolean	Wait for resources to be ready	⚠️ Optional
timeout	duration	Wait timeout	⚠️ Optional
retryInterval	duration	Retry interval on failure	⚠️ Optional
sourceRef	object	GitRepository reference	✅ Yes
kustomizeFlags	array	Kustomize CLI flags	⚠️ Optional
dependsOn	array	Dependency Kustomizations	⚠️ Optional
healthChecks	array	Health check resources	⚠️ Optional
postBuild	object	Post-build substitutions	⚠️ Optional
suspend	boolean	Suspend reconciliation	⚠️ Optional

Integration with Git¶

FluxCD Kustomization with Git:

# GitRepository (source)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops-production
  namespace: flux-system
spec:
  interval: 1m
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: production
  secretRef:
    name: gitops-credentials

---
# Kustomization (deployment target)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/atp-ingestion/overlays/production
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production  # ← References GitRepository
  prune: true
  wait: true

FluxCD Integration Diagram:

sequenceDiagram
    participant Git as Git Repository
    participant GitRepo as GitRepository CRD
    participant FluxCD as FluxCD Controller
    participant Kust as Kustomization CRD
    participant K8s as Kubernetes API

    GitRepo->>Git: Poll for changes (1m)
    Git-->>GitRepo: New commit detected
    GitRepo->>FluxCD: Trigger reconciliation
    FluxCD->>Git: Fetch kustomization.yaml
    Git-->>FluxCD: Return kustomization
    FluxCD->>Git: Fetch base + overlays
    Git-->>FluxCD: Return resources
    FluxCD->>Kust: Build kustomize (kustomize build)
    Kust->>K8s: Apply resources
    K8s-->>FluxCD: Resources applied

Hold "Alt" / "Option" to enable pan & zoom

Testing Kustomize Configurations¶

kustomize build for Validation¶

Validate Kustomize Build:

# Build and validate
kustomize build apps/atp-ingestion/overlays/production

# Validate syntax
kustomize build apps/atp-ingestion/overlays/production > /dev/null && echo "✅ Valid"

# Validate with kubeval
kustomize build apps/atp-ingestion/overlays/production | kubeval --strict

# Validate with kube-score
kustomize build apps/atp-ingestion/overlays/production | kube-score score -

Validation Script:

#!/bin/bash
# scripts/validate-kustomize.sh

OVERLAY_PATH="${1}"

echo "🔍 Validating Kustomize: ${OVERLAY_PATH}"

# Build
echo "1. Building kustomization..."
kustomize build "${OVERLAY_PATH}" > /tmp/kustomize-output.yaml || exit 1

# Validate Kubernetes syntax
echo "2. Validating Kubernetes syntax..."
kubeval /tmp/kustomize-output.yaml --strict || exit 1

# Score manifests
echo "3. Scoring manifests..."
kube-score score /tmp/kustomize-output.yaml || exit 1

echo "✅ Kustomize validation passed"

Diff Validation Against Expected Output¶

Diff Validation:

#!/bin/bash
# scripts/validate-kustomize-diff.sh

OVERLAY_PATH="${1}"
EXPECTED_OUTPUT="${2}"

echo "🔍 Validating Kustomize output against expected..."

# Build current output
kustomize build "${OVERLAY_PATH}" > /tmp/current.yaml

# Compare with expected
if diff -u "${EXPECTED_OUTPUT}" /tmp/current.yaml; then
  echo "✅ Output matches expected"
else
  echo "❌ Output differs from expected"
  exit 1
fi

Golden File Testing:

# Generate golden file (expected output)
kustomize build apps/atp-ingestion/overlays/production > \
  tests/golden/production-expected.yaml

# Validate against golden file
kustomize build apps/atp-ingestion/overlays/production > /tmp/actual.yaml
diff tests/golden/production-expected.yaml /tmp/actual.yaml

CI Pipeline Integration¶

Azure Pipeline for Kustomize Testing:

# azure-pipelines-kustomize-test.yml
trigger:
  branches:
    include:
    - main
  paths:
    include:
    - apps/**/*

pool:
  vmImage: 'ubuntu-latest'

steps:
- task: Bash@3
  displayName: 'Install kustomize'
  inputs:
    targetType: 'inline'
    script: |
      curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
      sudo mv kustomize /usr/local/bin/

- task: Bash@3
  displayName: 'Install kubeval and kube-score'
  inputs:
    targetType: 'inline'
    script: |
      wget https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
      tar xf kubeval-linux-amd64.tar.gz
      sudo mv kubeval /usr/local/bin/

      wget https://github.com/zegl/kube-score/releases/download/v1.17.0/kube-score_1.17.0_linux_amd64.tar.gz
      tar xf kube-score_1.17.0_linux_amd64.tar.gz
      sudo mv kube-score /usr/local/bin/

- task: Bash@3
  displayName: 'Validate Kustomize Configurations'
  inputs:
    targetType: 'inline'
    script: |
      for overlay in apps/*/overlays/*/; do
        OVERLAY_NAME=$(basename "$overlay")
        echo "Validating overlay: ${OVERLAY_NAME}"

        # Build
        kustomize build "$overlay" > /tmp/${OVERLAY_NAME}.yaml || exit 1

        # Validate
        kubeval /tmp/${OVERLAY_NAME}.yaml --strict || exit 1
        kube-score score /tmp/${OVERLAY_NAME}.yaml || exit 1
      done

Summary: Kustomize Advanced Patterns¶

Kustomize Architecture: Base, overlays, components structure, kustomization file structure, resource selection, transformation order
Strategic Merge Patches: How strategic merge works, merge semantics (replace, add, delete), array merging strategies, common patterns
JSON Patches: JSON patch operations (add, replace, remove), path targeting, when to use JSON patches vs strategic merge
ConfigMap and Secret Generators: Generating ConfigMaps from literals and files, generating secrets (encrypted), hash suffixes for updates
Variable Substitution: Defining variables in kustomization.yaml, using variables in resources, environment-specific variables
Replacements: Replacing values across resources, source and target configuration, complex replacement patterns
Remote Bases: Referencing remote kustomizations, Git repository as base, HTTPS URLs for bases, version pinning
Component Composition: Creating reusable components, including components in overlays, component library for ATP
Transformers: Label injectors, namespace transformers, name prefix/suffix transformers, image transformers, replica transformers
Kustomize with Helm: Combining Helm and Kustomize, helm template → kustomize build, post-rendering with Kustomize
FluxCD Kustomization CRD: FluxCD-specific configuration, Kustomization CRD fields reference, integration with Git
Testing Kustomize Configurations: kustomize build for validation, diff validation against expected output, CI pipeline integration

Multi-Tenancy in GitOps¶

Purpose: Define multi-tenancy strategies, tenant isolation mechanisms, tenant-specific configurations, automated onboarding/offboarding procedures, and compliance controls for ATP's GitOps deployments, ensuring complete tenant isolation, secure resource management, and adherence to data residency and regulatory requirements (GDPR, HIPAA, SOC 2).

Tenant Isolation Strategies¶

Namespace per Tenant (ATP Approach)¶

Multi-Tenant Architecture with Namespace Isolation:

graph TB
    subgraph "Production AKS Cluster"
        subgraph "Tenant A Namespace"
            NS_A[Namespace: atp-tenant-a]
            DEPLOY_A[Deployments<br/>atp-ingestion<br/>atp-query<br/>atp-gateway]
            SVC_A[Services<br/>ClusterIP]
            DB_A[(Database Schema<br/>tenant_a)]
            SECRETS_A[Secrets<br/>tenant-a-secrets]
        end
        subgraph "Tenant B Namespace"
            NS_B[Namespace: atp-tenant-b]
            DEPLOY_B[Deployments<br/>atp-ingestion<br/>atp-query<br/>atp-gateway]
            SVC_B[Services<br/>ClusterIP]
            DB_B[(Database Schema<br/>tenant_b)]
            SECRETS_B[Secrets<br/>tenant-b-secrets]
        end
        subgraph "Tenant C Namespace"
            NS_C[Namespace: atp-tenant-c]
            DEPLOY_C[Deployments<br/>atp-ingestion<br/>atp-query<br/>atp-gateway]
            SVC_C[Services<br/>ClusterIP]
            DB_C[(Database Schema<br/>tenant_c)]
            SECRETS_C[Secrets<br/>tenant-c-secrets]
        end
        subgraph "Platform Services"
            MON[Monitoring<br/>Shared]
            INGRESS[Ingress Controller<br/>Shared]
        end
    end

    NS_A --> MON
    NS_B --> MON
    NS_C --> MON
    INGRESS --> SVC_A
    INGRESS --> SVC_B
    INGRESS --> SVC_C

    style NS_A fill:#90EE90
    style NS_B fill:#FFE5B4
    style NS_C fill:#87CEEB
    style MON fill:#DDA0DD

Hold "Alt" / "Option" to enable pan & zoom

Namespace per Tenant Benefits:

Aspect	Benefit	ATP Justification
Resource Isolation	✅ Complete resource isolation	Prevents resource contention
Network Isolation	✅ Network policies per namespace	Ensures tenant data isolation
RBAC Isolation	✅ Per-namespace RBAC	Tenant-specific access control
Quota Management	✅ Resource quotas per namespace	Cost control per tenant
Compliance	✅ Isolated audit logs	GDPR/HIPAA compliance
Data Residency	✅ Deploy to specific regions	EU/US data residency requirements

ATP Decision: Namespace per tenant - Complete isolation, best security, compliance-friendly

Cluster per Tenant (Not Used in ATP)¶

Cluster per Tenant Comparison:

Aspect	Cluster per Tenant	Namespace per Tenant	ATP Decision
Isolation	✅ Maximum isolation	⚠️ Good isolation	Namespace (sufficient)
Cost	❌ Very high (separate clusters)	✅ Low (shared cluster)	Namespace (cost-effective)
Management	❌ Complex (many clusters)	✅ Simple (one cluster)	Namespace (operational simplicity)
Resource Utilization	❌ Poor (underutilized clusters)	✅ Good (shared resources)	Namespace (efficiency)
Compliance	✅ Maximum compliance	✅ Good compliance	Namespace (sufficient)

ATP Rationale: Cluster per tenant is overkill for ATP's requirements. Namespace isolation provides sufficient security and compliance while maintaining cost efficiency.

Shared Namespace with Labels (Not Recommended)¶

Shared Namespace with Labels:

# ❌ NOT RECOMMENDED: Shared namespace with labels
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    tenant: tenant-a  # Label-based separation
spec:
  # ...

Why Not Recommended:

Issue	Impact	ATP Decision
No Resource Isolation	❌ Resource contention between tenants	❌ Not acceptable
Network Policy Complexity	❌ Complex label selectors	❌ Error-prone
RBAC Complexity	❌ Difficult to enforce tenant boundaries	❌ Security risk
Audit Trail	❌ Harder to isolate tenant activities	❌ Compliance issue

ATP Decision: Not used - Insufficient isolation for audit trail platform requirements

Tenant-Specific Configurations¶

Resource Limits per Tenant¶

Resource Quota per Tenant:

# tenants/tenant-a/resources/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-quota
  namespace: atp-tenant-a
  labels:
    tenant: tenant-a
    managed-by: kustomize
spec:
  hard:
    requests.cpu: "8"      # 8 CPU cores
    requests.memory: 16Gi  # 16Gi memory
    limits.cpu: "16"       # 16 CPU cores
    limits.memory: 32Gi    # 32Gi memory
    persistentvolumeclaims: "5"
    pods: "20"
    services: "10"
    configmaps: "20"
    secrets: "10"

Tenant Resource Limits Matrix:

Tenant Tier	CPU Requests	Memory Requests	CPU Limits	Memory Limits	Pods Max	Monthly Cost (Est.)
Basic	2 cores	4Gi	4 cores	8Gi	10	$500
Standard	8 cores	16Gi	16 cores	32Gi	20	$2,000
Premium	32 cores	64Gi	64 cores	128Gi	50	$8,000
Enterprise	128 cores	256Gi	256 cores	512Gi	200	$32,000

Data Residency Requirements (EU vs US)¶

Data Residency Configuration:

# tenants/tenant-eu/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-tenant-eu
  labels:
    tenant: tenant-eu
    data-residency: "eu-west"  # EU data residency
    compliance: "gdpr"
    region: "westeurope"
    managed-by: kustomize
  annotations:
    data-residency-policy: "EU-only"
    compliance-requirements: "GDPR"

Regional Deployment Strategy:

Region	Tenants	Compliance	AKS Cluster
East US	US-based tenants	US regulations	atp-prod-eus-aks
West Europe	EU-based tenants	GDPR	atp-prod-weu-aks

Tenant Region Assignment:

# tenants/tenant-eu/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - resources/namespace.yaml
  - resources/resource-quota.yaml
  - resources/network-policy.yaml

commonLabels:
  tenant: tenant-eu
  region: "westeurope"  # EU region
  data-residency: "eu-west"

Compliance Labels and Annotations:

# tenants/tenant-hipaa/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-tenant-hipaa
  labels:
    tenant: tenant-hipaa
    compliance: "hipaa"
    data-classification: "phi"  # Protected Health Information
    encryption-required: "true"
  annotations:
    compliance-policy: "HIPAA"
    encryption-at-rest: "required"
    encryption-in-transit: "required"
    audit-logging: "required"
    data-retention: "6-years"

Compliance Configuration Matrix:

Compliance Type	Labels	Annotations	Requirements
GDPR	`compliance: gdpr`, `data-residency: eu-west`	`data-residency-policy`, `right-to-be-forgotten: true`	EU data residency, data deletion on request
HIPAA	`compliance: hipaa`, `data-classification: phi`	`encryption-at-rest: required`, `audit-logging: required`	Encryption, audit logs, 6-year retention
SOC 2	`compliance: soc2`	`audit-logging: required`, `access-control: required`	Audit logs, access controls

Custom Ingestion Rules¶

Tenant-Specific Ingestion Configuration:

# tenants/tenant-a/config/configmap-ingestion.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-a-ingestion-config
  namespace: atp-tenant-a
data:
  ingestion-rules.yaml: |
    rules:
      - eventType: "audit"
        batchSize: 100
        batchTimeout: "30s"
        maxRetries: 3
      - eventType: "compliance"
        batchSize: 50
        batchTimeout: "60s"
        maxRetries: 5
    rateLimits:
      requestsPerSecond: 1000
      burstSize: 2000
    retention:
      audit: "7-years"
      compliance: "10-years"

GitOps Structure for Tenants¶

`/tenants/{tenant-id}/` Directory Structure¶

Tenant Directory Structure:

atp-gitops/
├── tenants/
│   ├── tenant-a/
│   │   ├── kustomization.yaml
│   │   ├── resources/
│   │   │   ├── namespace.yaml
│   │   │   ├── resource-quota.yaml
│   │   │   ├── network-policy.yaml
│   │   │   ├── rbac.yaml
│   │   │   └── serviceaccount.yaml
│   │   ├── apps/
│   │   │   ├── ingestion/
│   │   │   │   ├── kustomization.yaml
│   │   │   │   └── deployment.yaml
│   │   │   ├── query/
│   │   │   │   ├── kustomization.yaml
│   │   │   │   └── deployment.yaml
│   │   │   └── gateway/
│   │   │       ├── kustomization.yaml
│   │   │       └── deployment.yaml
│   │   ├── config/
│   │   │   ├── configmap-ingestion.yaml
│   │   │   └── configmap-query.yaml
│   │   └── values/
│   │       ├── values-tenant-a.yaml
│   │       └── values-production.yaml
│   ├── tenant-b/
│   │   └── ...
│   └── tenant-eu/
│       └── ...

Tenant Namespace Manifest¶

Tenant Namespace:

# tenants/tenant-a/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-tenant-a
  labels:
    tenant: tenant-a
    environment: production
    managed-by: kustomize
    compliance: "soc2"
    data-residency: "us-east"
    region: "eastus"
  annotations:
    description: "ATP Tenant A - Production Environment"
    created-by: "tenant-onboarding"
    created-at: "2024-01-15T10:00:00Z"
    owner: "tenant-a-admin@example.com"
    tier: "standard"
    cost-center: "sales"

Tenant Resource Quota¶

Resource Quota Configuration:

# tenants/tenant-a/resources/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-quota
  namespace: atp-tenant-a
  labels:
    tenant: tenant-a
spec:
  hard:
    # CPU and Memory
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi

    # Storage
    persistentvolumeclaims: "5"
    requests.storage: 100Gi

    # Pod and Service Limits
    pods: "20"
    services: "10"

    # Object Counts
    configmaps: "20"
    secrets: "10"
    services.nodeports: "0"
    services.loadbalancers: "2"

Tier-Based Quota Templates:

# tenants/_templates/resource-quota-basic.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ${TENANT_ID}-quota
  namespace: atp-${TENANT_ID}
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 4Gi
    limits.cpu: "4"
    limits.memory: 8Gi
    pods: "10"

# tenants/_templates/resource-quota-enterprise.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ${TENANT_ID}-quota
  namespace: atp-${TENANT_ID}
spec:
  hard:
    requests.cpu: "128"
    requests.memory: 256Gi
    limits.cpu: "256"
    limits.memory: 512Gi
    pods: "200"

Tenant Network Policy¶

Tenant Network Policy (Complete Isolation):

# tenants/tenant-a/resources/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tenant-a-isolation
  namespace: atp-tenant-a
spec:
  podSelector: {}  # Apply to all pods
  policyTypes:
  - Ingress
  - Egress

  ingress:
  # Allow from ingress controller (shared platform)
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - podSelector:
        matchLabels:
          app: ingress-nginx
  # Allow from monitoring namespace (shared)
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
  # Deny all other ingress (including other tenant namespaces)

  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
  # Allow to monitoring
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
  # Allow to database (external)
  - to:
    - ipBlock:
        cidr: 10.0.0.0/16  # Database subnet
    ports:
    - protocol: TCP
      port: 5432
  # Deny egress to other tenant namespaces

Network Policy Isolation Diagram:

graph TB
    subgraph "Tenant A Namespace"
        POD_A1[Pod A1]
        POD_A2[Pod A2]
        NP_A[Network Policy<br/>Deny cross-tenant]
    end
    subgraph "Tenant B Namespace"
        POD_B1[Pod B1]
        POD_B2[Pod B2]
        NP_B[Network Policy<br/>Deny cross-tenant]
    end
    subgraph "Platform Namespaces"
        INGRESS[Ingress Controller]
        MON[Monitoring]
        DNS[Kube DNS]
    end

    INGRESS -->|Allowed| POD_A1
    INGRESS -->|Allowed| POD_B1
    POD_A1 -.->|Blocked| POD_B1
    POD_B1 -.->|Blocked| POD_A1
    POD_A1 -->|Allowed| DNS
    POD_B1 -->|Allowed| DNS
    POD_A1 -->|Allowed| MON
    POD_B1 -->|Allowed| MON

    style NP_A fill:#FF6B6B
    style NP_B fill:#FF6B6B
    style POD_A1 fill:#90EE90
    style POD_B1 fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Tenant RBAC¶

Tenant-Specific RBAC:

# tenants/tenant-a/resources/rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: tenant-a-sa
  namespace: atp-tenant-a
  labels:
    tenant: tenant-a
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: tenant-a-role
  namespace: atp-tenant-a
rules:
  # Allow read/write to tenant namespace resources
  - apiGroups: [""]
    resources: ["configmaps", "secrets", "pods", "services"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
  # Deny access to other namespaces (implicitly denied)
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: tenant-a-rolebinding
  namespace: atp-tenant-a
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: tenant-a-role
subjects:
  - kind: ServiceAccount
    name: tenant-a-sa
    namespace: atp-tenant-a
  - kind: User
    name: tenant-a-admin@example.com
    apiGroup: rbac.authorization.k8s.io

Tenant Admin RBAC (Limited Cluster Role):

# tenants/tenant-a/resources/cluster-role-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: tenant-a-admin
rules:
  # Read-only access to cluster resources
  - apiGroups: [""]
    resources: ["namespaces"]
    resourceNames: ["atp-tenant-a"]  # Only tenant namespace
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: tenant-a-admin-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: tenant-a-admin
subjects:
  - kind: User
    name: tenant-a-admin@example.com
    apiGroup: rbac.authorization.k8s.io

Dynamic Tenant Provisioning¶

Onboarding Script¶

Tenant Onboarding Automation Script:

#!/bin/bash
# scripts/onboard-tenant.sh

TENANT_ID="${1}"
TENANT_TIER="${2:-standard}"  # basic, standard, premium, enterprise
REGION="${3:-eastus}"  # eastus, westeurope
COMPLIANCE="${4:-soc2}"  # soc2, gdpr, hipaa
OWNER_EMAIL="${5}"

if [ -z "${TENANT_ID}" ] || [ -z "${OWNER_EMAIL}" ]; then
  echo "Usage: $0 <tenant-id> [tier] [region] [compliance] <owner-email>"
  echo "Example: $0 tenant-a standard eastus soc2 admin@tenant-a.example.com"
  exit 1
fi

TENANT_DIR="tenants/${TENANT_ID}"
NAMESPACE="atp-${TENANT_ID}"

echo "🏢 Onboarding tenant: ${TENANT_ID}"
echo "   Tier: ${TENANT_TIER}"
echo "   Region: ${REGION}"
echo "   Compliance: ${COMPLIANCE}"
echo "   Owner: ${OWNER_EMAIL}"

# Step 1: Create tenant directory structure
echo "📁 Step 1: Creating tenant directory structure..."
mkdir -p "${TENANT_DIR}/resources"
mkdir -p "${TENANT_DIR}/apps/ingestion"
mkdir -p "${TENANT_DIR}/apps/query"
mkdir -p "${TENANT_DIR}/apps/gateway"
mkdir -p "${TENANT_DIR}/config"
mkdir -p "${TENANT_DIR}/values"

# Step 2: Generate namespace
echo "📦 Step 2: Generating namespace..."
cat > "${TENANT_DIR}/resources/namespace.yaml" <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: ${NAMESPACE}
  labels:
    tenant: ${TENANT_ID}
    tier: ${TENANT_TIER}
    environment: production
    region: ${REGION}
    compliance: ${COMPLIANCE}
    managed-by: kustomize
  annotations:
    description: "ATP Tenant ${TENANT_ID} - Production Environment"
    created-by: "tenant-onboarding"
    created-at: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
    owner: "${OWNER_EMAIL}"
    tier: "${TENANT_TIER}"
EOF

# Step 3: Generate resource quota based on tier
echo "📊 Step 3: Generating resource quota..."
./scripts/generate-tenant-quota.sh "${TENANT_ID}" "${TENANT_TIER}" > "${TENANT_DIR}/resources/resource-quota.yaml"

# Step 4: Generate network policy
echo "🔒 Step 4: Generating network policy..."
./scripts/generate-tenant-network-policy.sh "${TENANT_ID}" > "${TENANT_DIR}/resources/network-policy.yaml"

# Step 5: Generate RBAC
echo "🔐 Step 5: Generating RBAC..."
./scripts/generate-tenant-rbac.sh "${TENANT_ID}" "${OWNER_EMAIL}" > "${TENANT_DIR}/resources/rbac.yaml"

# Step 6: Generate application manifests
echo "🚀 Step 6: Generating application manifests..."
./scripts/generate-tenant-apps.sh "${TENANT_ID}" "${TENANT_TIER}" "${REGION}"

# Step 7: Generate kustomization.yaml
echo "📝 Step 7: Generating kustomization.yaml..."
cat > "${TENANT_DIR}/kustomization.yaml" <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

metadata:
  name: ${TENANT_ID}
  namespace: ${NAMESPACE}

namespace: ${NAMESPACE}

resources:
  - resources/namespace.yaml
  - resources/resource-quota.yaml
  - resources/network-policy.yaml
  - resources/rbac.yaml
  - apps/ingestion/kustomization.yaml
  - apps/query/kustomization.yaml
  - apps/gateway/kustomization.yaml

commonLabels:
  tenant: ${TENANT_ID}
  tier: ${TENANT_TIER}
  region: ${REGION}
  compliance: ${COMPLIANCE}
EOF

echo "✅ Tenant directory structure created: ${TENANT_DIR}"
echo ""
echo "📋 Next steps:"
echo "1. Review generated manifests: ${TENANT_DIR}/"
echo "2. Commit to Git: git add ${TENANT_DIR} && git commit -m 'feat: onboard tenant ${TENANT_ID}'"
echo "3. Push to repository: git push origin main"
echo "4. FluxCD will automatically reconcile and deploy tenant resources"

Automated Manifest Generation¶

Generate Tenant Quota Script:

#!/bin/bash
# scripts/generate-tenant-quota.sh

TENANT_ID="${1}"
TIER="${2:-standard}"

case "${TIER}" in
  "basic")
    CPU_REQ="2"
    MEM_REQ="4Gi"
    CPU_LIM="4"
    MEM_LIM="8Gi"
    PODS="10"
    ;;
  "standard")
    CPU_REQ="8"
    MEM_REQ="16Gi"
    CPU_LIM="16"
    MEM_LIM="32Gi"
    PODS="20"
    ;;
  "premium")
    CPU_REQ="32"
    MEM_REQ="64Gi"
    CPU_LIM="64"
    MEM_LIM="128Gi"
    PODS="50"
    ;;
  "enterprise")
    CPU_REQ="128"
    MEM_REQ="256Gi"
    CPU_LIM="256"
    MEM_LIM="512Gi"
    PODS="200"
    ;;
esac

cat <<EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ${TENANT_ID}-quota
  namespace: atp-${TENANT_ID}
spec:
  hard:
    requests.cpu: "${CPU_REQ}"
    requests.memory: ${MEM_REQ}
    limits.cpu: "${CPU_LIM}"
    limits.memory: ${MEM_LIM}
    pods: "${PODS}"
    persistentvolumeclaims: "5"
    services: "10"
EOF

Generate Tenant Network Policy Script:

#!/bin/bash
# scripts/generate-tenant-network-policy.sh

TENANT_ID="${1}"
NAMESPACE="atp-${TENANT_ID}"

cat <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ${TENANT_ID}-isolation
  namespace: ${NAMESPACE}
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

  ingress:
  # Allow from ingress controller
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
  # Allow from monitoring
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring

  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
  # Allow to monitoring
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
  # Allow to database (external)
  - to:
    - ipBlock:
        cidr: 10.0.0.0/16
    ports:
    - protocol: TCP
      port: 5432
EOF

Generate Tenant Apps Script:

#!/bin/bash
# scripts/generate-tenant-apps.sh

TENANT_ID="${1}"
TIER="${2}"
REGION="${3}"

TENANT_DIR="tenants/${TENANT_ID}"

# Generate ingestion app kustomization
cat > "${TENANT_DIR}/apps/ingestion/kustomization.yaml" <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ../../../../apps/atp-ingestion/base

patchesStrategicMerge:
  - deployment-patch.yaml

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3

namespace: atp-${TENANT_ID}

commonLabels:
  tenant: ${TENANT_ID}
EOF

# Generate deployment patch
cat > "${TENANT_DIR}/apps/ingestion/deployment-patch.yaml" <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        - name: TENANT_ID
          value: "${TENANT_ID}"
        - name: REGION
          value: "${REGION}"
EOF

echo "✅ Application manifests generated"

Git Commit for New Tenant¶

Commit Tenant Configuration:

#!/bin/bash
# scripts/commit-tenant-config.sh

TENANT_ID="${1}"

if [ -z "${TENANT_ID}" ]; then
  echo "Usage: $0 <tenant-id>"
  exit 1
fi

TENANT_DIR="tenants/${TENANT_ID}"

echo "📝 Committing tenant configuration: ${TENANT_ID}"

# Add tenant directory
git add "${TENANT_DIR}"

# Commit with conventional commit format
git commit -m "feat(tenant): onboard tenant ${TENANT_ID}

- Add namespace: atp-${TENANT_ID}
- Add resource quota and network policy
- Add tenant-specific RBAC
- Add application deployments

Signed-off-by: $(git config user.name) <$(git config user.email)>" \
  --gpg-sign

# Push to repository
git push origin main

echo "✅ Tenant configuration committed and pushed"
echo "⏳ FluxCD will reconcile and deploy tenant resources automatically"

FluxCD Applies Tenant Resources¶

FluxCD Kustomization for Tenant:

# clusters/production/kustomizations/tenants/tenant-a.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: tenant-a
  namespace: flux-system
  labels:
    tenant: tenant-a
spec:
  interval: 5m
  path: ./tenants/tenant-a
  prune: true
  wait: true
  timeout: 10m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  dependsOn:
    - name: infrastructure
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-ingestion
      namespace: atp-tenant-a
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-query
      namespace: atp-tenant-a
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-gateway
      namespace: atp-tenant-a

Auto-Create FluxCD Kustomization for Tenant:

#!/bin/bash
# scripts/create-tenant-fluxcd-kustomization.sh

TENANT_ID="${1}"
KUST_FILE="clusters/production/kustomizations/tenants/${TENANT_ID}.yaml"

cat > "${KUST_FILE}" <<EOF
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: ${TENANT_ID}
  namespace: flux-system
  labels:
    tenant: ${TENANT_ID}
spec:
  interval: 5m
  path: ./tenants/${TENANT_ID}
  prune: true
  wait: true
  timeout: 10m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  dependsOn:
    - name: infrastructure
EOF

kubectl apply -f "${KUST_FILE}"

echo "✅ FluxCD Kustomization created for tenant: ${TENANT_ID}"

Tenant Onboarding Automation¶

Step-by-Step Onboarding Process¶

Tenant Onboarding Workflow:

sequenceDiagram
    participant Admin as Administrator
    participant Script as Onboarding Script
    participant Git as Git Repository
    participant FluxCD as FluxCD
    participant K8s as Kubernetes

    Admin->>Script: Execute onboard-tenant.sh
    Script->>Script: Create directory structure
    Script->>Script: Generate namespace
    Script->>Script: Generate resource quota
    Script->>Script: Generate network policy
    Script->>Script: Generate RBAC
    Script->>Script: Generate app manifests
    Script->>Git: Commit tenant config
    Git->>FluxCD: Poll for changes
    FluxCD->>Git: Fetch tenant manifests
    FluxCD->>K8s: Apply namespace
    FluxCD->>K8s: Apply resource quota
    FluxCD->>K8s: Apply network policy
    FluxCD->>K8s: Apply RBAC
    FluxCD->>K8s: Deploy applications
    K8s-->>FluxCD: Resources ready
    FluxCD-->>Admin: Tenant onboarded

Hold "Alt" / "Option" to enable pan & zoom

Complete Onboarding Automation:

#!/bin/bash
# scripts/onboard-tenant-complete.sh

TENANT_ID="${1}"
TENANT_TIER="${2:-standard}"
REGION="${3:-eastus}"
COMPLIANCE="${4:-soc2}"
OWNER_EMAIL="${5}"

if [ -z "${TENANT_ID}" ] || [ -z "${OWNER_EMAIL}" ]; then
  echo "Usage: $0 <tenant-id> [tier] [region] [compliance] <owner-email>"
  exit 1
fi

echo "🏢 Complete Tenant Onboarding: ${TENANT_ID}"
echo ""

# Step 1: Create tenant directory
echo "📁 Step 1: Creating tenant directory..."
./scripts/onboard-tenant.sh "${TENANT_ID}" "${TENANT_TIER}" "${REGION}" "${COMPLIANCE}" "${OWNER_EMAIL}" || exit 1

# Step 2: Validate manifests
echo "🔍 Step 2: Validating manifests..."
./scripts/validate-kustomize.sh "tenants/${TENANT_ID}" || exit 1

# Step 3: Commit to Git
echo "📝 Step 3: Committing to Git..."
./scripts/commit-tenant-config.sh "${TENANT_ID}" || exit 1

# Step 4: Create FluxCD Kustomization
echo "⚙️  Step 4: Creating FluxCD Kustomization..."
./scripts/create-tenant-fluxcd-kustomization.sh "${TENANT_ID}" || exit 1

# Step 5: Wait for FluxCD reconciliation
echo "⏳ Step 5: Waiting for FluxCD reconciliation..."
sleep 60

# Step 6: Verify tenant resources
echo "✅ Step 6: Verifying tenant resources..."
./scripts/verify-tenant-onboarding.sh "${TENANT_ID}" || exit 1

echo ""
echo "🎉 Tenant onboarding complete: ${TENANT_ID}"

Verify Tenant Onboarding:

#!/bin/bash
# scripts/verify-tenant-onboarding.sh

TENANT_ID="${1}"
NAMESPACE="atp-${TENANT_ID}"

echo "🔍 Verifying tenant onboarding: ${TENANT_ID}"

# Check namespace exists
if ! kubectl get namespace "${NAMESPACE}" >/dev/null 2>&1; then
  echo "❌ Namespace ${NAMESPACE} does not exist"
  exit 1
fi

# Check resource quota
if ! kubectl get resourcequota -n "${NAMESPACE}" >/dev/null 2>&1; then
  echo "❌ Resource quota not found"
  exit 1
fi

# Check network policy
if ! kubectl get networkpolicy -n "${NAMESPACE}" >/dev/null 2>&1; then
  echo "❌ Network policy not found"
  exit 1
fi

# Check deployments are ready
DEPLOYMENTS=("atp-ingestion" "atp-query" "atp-gateway")

for DEPLOYMENT in "${DEPLOYMENTS[@]}"; do
  if ! kubectl wait --for=condition=available --timeout=5m deployment/"${DEPLOYMENT}" -n "${NAMESPACE}"; then
    echo "❌ Deployment ${DEPLOYMENT} not ready"
    exit 1
  fi
done

echo "✅ Tenant onboarding verified: ${TENANT_ID}"

Tenant-Specific Helm Values¶

values-tenant-{id}.yaml¶

Tenant-Specific Helm Values:

# tenants/tenant-a/values/values-tenant-a.yaml
# Tenant-specific Helm values for tenant-a

replicaCount: 3

image:
  tag: v1.2.3

resources:
  limits:
    cpu: 4000m
    memory: 4Gi
  requests:
    cpu: 1000m
    memory: 2Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10

env:
  - name: TENANT_ID
    value: "tenant-a"
  - name: TENANT_TIER
    value: "standard"
  - name: REGION
    value: "eastus"

ingress:
  enabled: true
  hosts:
    - host: tenant-a.atp.connectsoft.example
      paths:
        - path: /

database:
  host: atp-db-tenant-a.database.windows.net
  name: atp_tenant_a

redis:
  host: atp-redis-tenant-a.redis.cache.windows.net

featureFlags:
  enableAdvancedQuerying: true
  enableRealTimeEvents: true
  enableComplianceReports: true

Override Replicas, Resources, Endpoints¶

Environment-Specific Tenant Values:

# tenants/tenant-a/values/values-production.yaml
# Production-specific overrides for tenant-a

replicaCount: 5

resources:
  limits:
    cpu: 8000m
    memory: 8Gi
  requests:
    cpu: 2000m
    memory: 4Gi

autoscaling:
  minReplicas: 5
  maxReplicas: 20

service:
  type: LoadBalancer

ingress:
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
  tls:
    - secretName: tenant-a-tls
      hosts:
        - tenant-a.atp.connectsoft.example

Tenant-Specific Feature Flags¶

Feature Flags Configuration:

# tenants/tenant-a/config/configmap-feature-flags.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-a-feature-flags
  namespace: atp-tenant-a
data:
  feature-flags.json: |
    {
      "enableAdvancedQuerying": true,
      "enableRealTimeEvents": true,
      "enableComplianceReports": true,
      "enableDataExport": false,
      "enableCustomDashboards": true,
      "maxRetentionDays": 2555,
      "auditLogLevel": "Detailed"
    }

Multi-Tenancy and FluxCD¶

Per-Tenant Kustomization¶

FluxCD Kustomization Per Tenant:

# clusters/production/kustomizations/tenants/tenant-a.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: tenant-a
  namespace: flux-system
  labels:
    tenant: tenant-a
    type: tenant
spec:
  interval: 5m
  path: ./tenants/tenant-a
  prune: true
  wait: true
  timeout: 10m
  retryInterval: 2m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  dependsOn:
    - name: infrastructure
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-ingestion
      namespace: atp-tenant-a
  kustomizeFlags:
    - --load-restrictor=LoadRestrictionsNone

List All Tenant Kustomizations:

kubectl get kustomizations -n flux-system -l type=tenant

# Output:
# NAME       READY   STATUS        AGE
# tenant-a   True    Applied       5d
# tenant-b   True    Applied       10d
# tenant-eu  True    Applied       2d

Tenant Isolation in Reconciliation¶

Isolation Benefits:

Aspect	Isolation Benefit	ATP Implementation
Reconciliation Failure	✅ One tenant failure doesn't affect others	Separate Kustomization per tenant
Resource Conflicts	✅ Namespace isolation prevents conflicts	Namespace per tenant
Network Isolation	✅ Network policies prevent cross-tenant traffic	Tenant-specific network policies
RBAC Isolation	✅ Tenant admin can only access their namespace	Per-namespace RBAC

FluxCD Reconciliation Isolation:

graph TB
    subgraph "FluxCD Controller"
        RECONCILE[Reconciliation Loop]
    end
    subgraph "Git Repository"
        TENANT_A[tenants/tenant-a/]
        TENANT_B[tenants/tenant-b/]
        TENANT_C[tenants/tenant-c/]
    end
    subgraph "Kubernetes Cluster"
        KUST_A[Kustomization: tenant-a]
        KUST_B[Kustomization: tenant-b]
        KUST_C[Kustomization: tenant-c]
        NS_A[Namespace: atp-tenant-a]
        NS_B[Namespace: atp-tenant-b]
        NS_C[Namespace: atp-tenant-c]
    end

    RECONCILE --> TENANT_A
    RECONCILE --> TENANT_B
    RECONCILE --> TENANT_C

    TENANT_A --> KUST_A
    TENANT_B --> KUST_B
    TENANT_C --> KUST_C

    KUST_A --> NS_A
    KUST_B --> NS_B
    KUST_C --> NS_C

    style KUST_A fill:#90EE90
    style KUST_B fill:#FFE5B4
    style KUST_C fill:#87CEEB

Hold "Alt" / "Option" to enable pan & zoom

Failure Isolation (One Tenant Doesn't Affect Others)¶

Failure Isolation Example:

# Tenant A reconciliation fails
kubectl get kustomization tenant-a -n flux-system
# Output:
# NAME       READY   STATUS        AGE
# tenant-a   False   Failed        5d
# tenant-b   True    Applied       10d  ← Still working
# tenant-eu  True    Applied       2d   ← Still working

Independent Reconciliation:

✅ Tenant A failure does not affect Tenant B or Tenant C
✅ Each tenant Kustomization reconciles independently
✅ Namespace isolation prevents resource conflicts
✅ Network policies prevent cross-tenant traffic

Tenant Offboarding¶

Data Deletion Procedures¶

GDPR Right to be Forgotten:

#!/bin/bash
# scripts/offboard-tenant.sh

TENANT_ID="${1}"
REASON="${2:-tenant-request}"
GDPR_REQUEST="${3:-false}"  # true if GDPR right-to-be-forgotten

if [ -z "${TENANT_ID}" ]; then
  echo "Usage: $0 <tenant-id> [reason] [gdpr-request]"
  echo "Example: $0 tenant-a tenant-request true"
  exit 1
fi

NAMESPACE="atp-${TENANT_ID}"
TENANT_DIR="tenants/${TENANT_ID}"

echo "🗑️  Offboarding tenant: ${TENANT_ID}"
echo "   Reason: ${REASON}"
echo "   GDPR Request: ${GDPR_REQUEST}"

# Step 1: Export tenant data (if GDPR request)
if [ "${GDPR_REQUEST}" = "true" ]; then
  echo "📦 Step 1: Exporting tenant data for GDPR compliance..."
  ./scripts/export-tenant-data.sh "${TENANT_ID}" || exit 1
fi

# Step 2: Delete database data
echo "🗄️  Step 2: Deleting database data..."
./scripts/delete-tenant-database.sh "${TENANT_ID}" || exit 1

# Step 3: Delete Azure resources
echo "☁️  Step 3: Deleting Azure resources..."
./scripts/delete-tenant-azure-resources.sh "${TENANT_ID}" || exit 1

# Step 4: Delete Kubernetes namespace (deletes all resources)
echo "📦 Step 4: Deleting Kubernetes namespace..."
kubectl delete namespace "${NAMESPACE}" --wait=true --timeout=10m || true

# Step 5: Remove tenant directory from Git
echo "📝 Step 5: Removing tenant configuration from Git..."
git rm -r "${TENANT_DIR}" || true
git commit -m "feat(tenant): offboard tenant ${TENANT_ID}

- Remove tenant namespace and resources
- Delete tenant data
- Reason: ${REASON}
- GDPR Request: ${GDPR_REQUEST}

Signed-off-by: $(git config user.name) <$(git config user.email)>" \
  --gpg-sign

git push origin main

# Step 6: Delete FluxCD Kustomization
echo "⚙️  Step 6: Deleting FluxCD Kustomization..."
kubectl delete kustomization "${TENANT_ID}" -n flux-system || true

echo "✅ Tenant offboarding complete: ${TENANT_ID}"

Namespace Cleanup¶

Namespace Cleanup Script:

#!/bin/bash
# scripts/cleanup-tenant-namespace.sh

TENANT_ID="${1}"
NAMESPACE="atp-${TENANT_ID}"

echo "🧹 Cleaning up namespace: ${NAMESPACE}"

# Delete all resources in namespace
kubectl delete all --all -n "${NAMESPACE}" --wait=true --timeout=5m || true

# Delete PVCs
kubectl delete pvc --all -n "${NAMESPACE}" --wait=true --timeout=5m || true

# Delete secrets and configmaps
kubectl delete secrets --all -n "${NAMESPACE}" || true
kubectl delete configmaps --all -n "${NAMESPACE}" || true

# Delete network policies
kubectl delete networkpolicies --all -n "${NAMESPACE}" || true

# Delete namespace
kubectl delete namespace "${NAMESPACE}" --wait=true --timeout=5m || true

echo "✅ Namespace cleanup complete"

Git Commit to Remove Tenant¶

Remove Tenant from Git:

#!/bin/bash
# scripts/remove-tenant-from-git.sh

TENANT_ID="${1}"
REASON="${2}"

TENANT_DIR="tenants/${TENANT_ID}"

echo "📝 Removing tenant from Git: ${TENANT_ID}"

# Remove tenant directory
git rm -r "${TENANT_DIR}" || true

# Remove FluxCD Kustomization
git rm "clusters/production/kustomizations/tenants/${TENANT_ID}.yaml" || true

# Commit removal
git commit -m "feat(tenant): remove tenant ${TENANT_ID}

- Remove tenant namespace configuration
- Remove tenant FluxCD Kustomization
- Reason: ${REASON}

Signed-off-by: $(git config user.name) <$(git config user.email)>" \
  --gpg-sign

# Push to repository
git push origin main

echo "✅ Tenant removed from Git"

GDPR Data Deletion Procedure:

#!/bin/bash
# scripts/gdpr-data-deletion.sh

TENANT_ID="${1}"

echo "🔒 GDPR Data Deletion Request: ${TENANT_ID}"

# Step 1: Export data (for audit trail)
echo "📦 Step 1: Exporting data for audit trail..."
./scripts/export-tenant-data.sh "${TENANT_ID}" \
  --output "exports/tenant-${TENANT_ID}-$(date +%Y%m%d).json"

# Step 2: Verify export
if [ ! -f "exports/tenant-${TENANT_ID}-$(date +%Y%m%d).json" ]; then
  echo "❌ Data export failed"
  exit 1
fi

# Step 3: Delete database records
echo "🗄️  Step 2: Deleting database records..."
./scripts/delete-tenant-database.sh "${TENANT_ID}" --confirm || exit 1

# Step 4: Delete blob storage
echo "💾 Step 3: Deleting blob storage..."
./scripts/delete-tenant-blob-storage.sh "${TENANT_ID}" || exit 1

# Step 5: Delete logs
echo "📋 Step 4: Deleting logs..."
./scripts/delete-tenant-logs.sh "${TENANT_ID}" || exit 1

# Step 6: Offboard tenant
echo "🗑️  Step 5: Offboarding tenant..."
./scripts/offboard-tenant.sh "${TENANT_ID}" "gdpr-request" "true" || exit 1

# Step 7: Generate deletion certificate
echo "📜 Step 6: Generating deletion certificate..."
cat > "certificates/gdpr-deletion-${TENANT_ID}-$(date +%Y%m%d).md" <<EOF
# GDPR Data Deletion Certificate

**Tenant ID**: ${TENANT_ID}
**Date**: $(date -u +%Y-%m-%dT%H:%M:%SZ)
**Request Type**: Right to be Forgotten (GDPR Article 17)

## Data Deletion Summary

- ✅ Database records deleted
- ✅ Blob storage deleted
- ✅ Logs deleted
- ✅ Kubernetes resources deleted
- ✅ Git configuration removed

## Export Location

- Data exported to: exports/tenant-${TENANT_ID}-$(date +%Y%m%d).json
- Retention: 7 years (legal requirement)

## Verification

All data related to tenant ${TENANT_ID} has been permanently deleted
from ATP systems in compliance with GDPR Article 17.
EOF

echo "✅ GDPR data deletion complete"
echo "📜 Deletion certificate: certificates/gdpr-deletion-${TENANT_ID}-$(date +%Y%m%d).md"

Tenant Cost Allocation¶

Namespace-Level Resource Tagging¶

Resource Tagging for Cost Allocation:

# tenants/tenant-a/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-tenant-a
  labels:
    tenant: tenant-a
    cost-center: "sales"
    business-unit: "enterprise"
    project: "audit-trail-platform"
    environment: "production"
    tier: "standard"
  annotations:
    cost-allocation: "tenant-a"
    billing-account: "account-12345"
    owner: "tenant-a-admin@example.com"

Azure Resource Tagging:

# Tag Azure resources for tenant
az aks update \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --tags \
    Tenant=tenant-a \
    CostCenter=sales \
    Environment=production

Cost Reporting per Tenant¶

Cost Reporting Script:

#!/bin/bash
# scripts/tenant-cost-report.sh

TENANT_ID="${1}"
START_DATE="${2:-$(date -d '30 days ago' +%Y-%m-%d)}"
END_DATE="${3:-$(date +%Y-%m-%d)}"

echo "💰 Cost Report for Tenant: ${TENANT_ID}"
echo "   Period: ${START_DATE} to ${END_DATE}"
echo ""

# Query Azure Cost Management API
az consumption usage list \
  --start-date "${START_DATE}" \
  --end-date "${END_DATE}" \
  --query "[?tags.Tenant=='${TENANT_ID}'].{Instance:instanceName, Cost:pretaxCost}" \
  --output table

# Calculate total cost
TOTAL_COST=$(az consumption usage list \
  --start-date "${START_DATE}" \
  --end-date "${END_DATE}" \
  --query "[?tags.Tenant=='${TENANT_ID}'].pretaxCost" \
  --output tsv | \
  awk '{sum+=$1} END {print sum}')

echo ""
echo "Total Cost: \$${TOTAL_COST}"

Chargeback/Showback Models¶

Chargeback Model Configuration:

# tenants/tenant-a/config/cost-allocation.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-a-cost-allocation
  namespace: atp-tenant-a
data:
  chargeback-model: "showback"  # or "chargeback"
  billing-frequency: "monthly"
  cost-center: "sales"
  business-unit: "enterprise"
  billing-contact: "finance@example.com"

  cost-breakdown.yaml: |
    resources:
      compute:
        cpu-requests: 0.05  # $0.05 per CPU-hour
        memory-requests: 0.01  # $0.01 per GiB-hour
      storage:
        persistent-volumes: 0.10  # $0.10 per GiB-month
      network:
        egress: 0.09  # $0.09 per GB

Cost Allocation Diagram:

graph TB
    subgraph "Cluster Costs"
        CLUSTER[AKS Cluster<br/>$10,000/month]
    end
    subgraph "Tenant Costs"
        TENANT_A[Tenant A<br/>$2,000/month<br/>20%]
        TENANT_B[Tenant B<br/>$5,000/month<br/>50%]
        TENANT_C[Tenant C<br/>$3,000/month<br/>30%]
    end

    CLUSTER --> TENANT_A
    CLUSTER --> TENANT_B
    CLUSTER --> TENANT_C

    style TENANT_A fill:#90EE90
    style TENANT_B fill:#FFE5B4
    style TENANT_C fill:#87CEEB

Hold "Alt" / "Option" to enable pan & zoom

Compliance Per Tenant¶

Compliance Configuration per Tenant:

# tenants/tenant-a/resources/compliance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-a-compliance
  namespace: atp-tenant-a
data:
  compliance-type: "soc2"
  audit-logging: "enabled"
  data-retention-years: "7"
  encryption-at-rest: "required"
  encryption-in-transit: "required"

  soc2-controls.yaml: |
    controls:
      - id: "CC6.1"
        name: "Logical and Physical Access Controls"
        status: "implemented"
      - id: "CC7.2"
        name: "System Operations"
        status: "implemented"

GDPR Tenant Configuration:

# tenants/tenant-eu/resources/compliance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-eu-compliance
  namespace: atp-tenant-eu
data:
  compliance-type: "gdpr"
  data-residency: "eu-west"
  right-to-be-forgotten: "enabled"
  data-export: "enabled"
  audit-logging: "enabled"
  data-retention-years: "7"

  gdpr-requirements.yaml: |
    requirements:
      - article: "17"
        name: "Right to Erasure"
        implementation: "automated-deletion"
      - article: "20"
        name: "Data Portability"
        implementation: "data-export-api"
      - article: "30"
        name: "Records of Processing"
        implementation: "audit-logs"

HIPAA Tenant Configuration:

# tenants/tenant-hipaa/resources/compliance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-hipaa-compliance
  namespace: atp-tenant-hipaa
data:
  compliance-type: "hipaa"
  data-classification: "phi"
  encryption-at-rest: "required"
  encryption-in-transit: "required"
  audit-logging: "required"
  access-control: "required"
  data-retention-years: "6"

  hipaa-requirements.yaml: |
    requirements:
      - section: "164.312(a)(1)"
        name: "Access Control"
        implementation: "rbac"
      - section: "164.312(e)(1)"
        name: "Transmission Security"
        implementation: "tls-encryption"
      - section: "164.312(c)(1)"
        name: "Integrity"
        implementation: "audit-logs"

Tenant-Specific Audit Logs¶

Audit Log Configuration:

# tenants/tenant-a/resources/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
metadata:
  name: tenant-a-audit
rules:
  # Audit all API requests in tenant namespace
  - level: Metadata
    namespaces: ["atp-tenant-a"]
    verbs: ["*"]
    resources:
      - group: "*"
        resources: ["*"]

  # Audit secret access
  - level: RequestResponse
    namespaces: ["atp-tenant-a"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
    resources:
      - group: ""
        resources: ["secrets"]

Audit Log Query for Tenant:

// Log Analytics: Query tenant-specific audit logs
AuditLogs
| where Namespace == "atp-tenant-a"
| where TimeGenerated >= ago(7d)
| summarize 
    EventCount = count(),
    UniqueUsers = dcount(UserIdentity),
    UniqueResources = dcount(ResourceName)
    by Tenant = Namespace, bin(TimeGenerated, 1d)
| render timechart

Data Residency Enforcement¶

Data Residency Policy:

# tenants/tenant-eu/resources/data-residency-policy.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-eu-data-residency
  namespace: atp-tenant-eu
data:
  policy.yaml: |
    data-residency:
      region: "westeurope"
      allowed-regions:
        - "westeurope"
        - "northeurope"
      prohibited-regions:
        - "eastus"
        - "westus"
      enforcement:
        database: "required"
        storage: "required"
        backups: "required"
        logs: "required"

Enforce Data Residency with Azure Policy:

# Azure Policy: Enforce EU data residency
apiVersion: templates.azure.com/v1beta1
kind: PolicyTemplate
metadata:
  name: enforce-eu-data-residency
properties:
  displayName: "Enforce EU Data Residency for Tenant EU"
  description: "Ensure all resources for tenant-eu are deployed in EU regions"
  policyRule:
    if:
      allOf:
      - field: "Microsoft.Resources/subscriptions/resourceGroups/tags['tenant']"
        equals: "tenant-eu"
      - not:
          field: "location"
          in: ["westeurope", "northeurope"]
    then:
      effect: "deny"

Retention Policies¶

Retention Policy Configuration:

# tenants/tenant-a/config/retention-policy.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-a-retention-policy
  namespace: atp-tenant-a
data:
  retention-policy.yaml: |
    policies:
      audit-logs:
        retention-days: 2555  # 7 years
        archive-after-days: 365
        archive-location: "az://atp-audit-archive/tenant-a"
      compliance-data:
        retention-days: 3650  # 10 years
        archive-after-days: 730
        archive-location: "az://atp-compliance-archive/tenant-a"
      operational-logs:
        retention-days: 90
        archive-after-days: 30
        archive-location: "az://atp-ops-archive/tenant-a"
    deletion:
      automated: true
      grace-period-days: 30

Retention Policy by Compliance Type:

Compliance Type	Retention Period	Rationale
SOC 2	7 years	SOC 2 requirement
GDPR	7 years	Legal/regulatory requirement
HIPAA	6 years	HIPAA requirement
General	1 year	Standard retention

Summary: Multi-Tenancy in GitOps¶

Tenant Isolation Strategies: Namespace per tenant (ATP approach), cluster per tenant (not used), shared namespace with labels (not recommended)
Tenant-Specific Configurations: Resource limits per tenant (tier-based), data residency requirements (EU vs US), compliance controls (GDPR, HIPAA, SOC 2), custom ingestion rules
GitOps Structure for Tenants: /tenants/{tenant-id}/ directory structure, tenant namespace manifest, tenant resource quota, tenant network policy, tenant RBAC
Dynamic Tenant Provisioning: Onboarding script, automated manifest generation, Git commit for new tenant, FluxCD applies tenant resources
Tenant Onboarding Automation: Step-by-step onboarding process (8 steps), complete automation script, verification script
Tenant-Specific Helm Values: values-tenant-{id}.yaml, override replicas/resources/endpoints, tenant-specific feature flags
Multi-Tenancy and FluxCD: Per-tenant Kustomization, tenant isolation in reconciliation, failure isolation (one tenant doesn't affect others)
Tenant Offboarding: Data deletion procedures, namespace cleanup, Git commit to remove tenant, compliance with GDPR (right to be forgotten)
Tenant Cost Allocation: Namespace-level resource tagging, cost reporting per tenant, chargeback/showback models
Compliance Per Tenant: SOC 2/GDPR/HIPAA configurations, tenant-specific audit logs, data residency enforcement, retention policies

Cost Optimization in GitOps¶

Purpose: Define cost optimization strategies, resource right-sizing, autoscaling configurations, automated shutdown procedures, Azure Cost Management integration, and FinOps practices for ATP's GitOps deployments, ensuring optimal resource utilization, cost efficiency, and cost transparency across all environments while maintaining performance and reliability requirements.

AKS Cost Optimization¶

Node Pool Sizing (Right-Sized VMs)¶

Node Pool Sizing Strategy:

Environment	Node Pool Type	VM SKU	Node Count (Min/Max)	Monthly Cost (Est.)	Use Case
Production	System	Standard_D4s_v3	3/10	$1,500	System pods, monitoring
Production	User	Standard_D8s_v3	5/20	$6,000	Application workloads
Staging	User	Standard_D4s_v3	2/8	$1,200	Staging workloads
Test	User	Standard_D2s_v3	¼	$300	Test workloads
Dev	User	Standard_D2s_v3	¼	$300	Development workloads

Pulumi C# Node Pool Configuration:

// infrastructure/NodePools.cs
using Pulumi;
using Pulumi.AzureNative.ContainerService;
using Pulumi.AzureNative.ContainerService.Inputs;

public class AKSNodePools
{
    public static ManagedClusterAgentPoolProfileArgs CreateProductionSystemPool()
    {
        return new ManagedClusterAgentPoolProfileArgs
        {
            Name = "systempool",
            Count = 3,
            VmSize = "Standard_D4s_v3",  // 4 vCPUs, 16 GiB
            OsType = "Linux",
            OsDiskSizeGB = 128,
            Mode = AgentPoolMode.System,
            EnableAutoScaling = true,
            MinCount = 3,
            MaxCount = 10,
            MaxPods = 30,
            EnableNodePublicIP = false,
            ScaleSetPriority = ScaleSetPriority.Regular,
            ScaleSetEvictionPolicy = ScaleSetEvictionPolicy.Delete,
            Tags = new InputMap<string>
            {
                { "Environment", "production" },
                { "NodePoolType", "system" },
                { "CostCenter", "infrastructure" }
            }
        };
    }

    public static ManagedClusterAgentPoolProfileArgs CreateProductionUserPool()
    {
        return new ManagedClusterAgentPoolProfileArgs
        {
            Name = "userpool",
            Count = 5,
            VmSize = "Standard_D8s_v3",  // 8 vCPUs, 32 GiB
            OsType = "Linux",
            OsDiskSizeGB = 256,
            Mode = AgentPoolMode.User,
            EnableAutoScaling = true,
            MinCount = 5,
            MaxCount = 20,
            MaxPods = 50,
            EnableNodePublicIP = false,
            Tags = new InputMap<string>
            {
                { "Environment", "production" },
                { "NodePoolType", "user" },
                { "CostCenter", "applications" }
            }
        };
    }

    public static ManagedClusterAgentPoolProfileArgs CreateDevSpotPool()
    {
        return new ManagedClusterAgentPoolProfileArgs
        {
            Name = "spotpool",
            Count = 1,
            VmSize = "Standard_D2s_v3",  // 2 vCPUs, 8 GiB
            OsType = "Linux",
            OsDiskSizeGB = 64,
            Mode = AgentPoolMode.User,
            EnableAutoScaling = true,
            MinCount = 0,  // Scale to zero
            MaxCount = 4,
            MaxPods = 30,
            EnableNodePublicIP = false,
            ScaleSetPriority = ScaleSetPriority.Spot,
            ScaleSetEvictionPolicy = ScaleSetEvictionPolicy.Delete,
            SpotMaxPrice = 0.05,  // Max $0.05 per hour (80% discount)
            Tags = new InputMap<string>
            {
                { "Environment", "development" },
                { "NodePoolType", "spot" },
                { "CostCenter", "development" }
            }
        };
    }
}

Spot Instances for Dev/Test¶

Spot Instance Configuration:

# clusters/production/node-pools/spot-pool.yaml
apiVersion: containerservice.azure.com/v1
kind: ManagedClusterAgentPoolProfile
metadata:
  name: spotpool
spec:
  name: spotpool
  count: 1
  vmSize: Standard_D2s_v3
  osType: Linux
  osDiskSizeGB: 64
  mode: User
  enableAutoScaling: true
  minCount: 0
  maxCount: 4
  scaleSetPriority: Spot
  scaleSetEvictionPolicy: Delete
  spotMaxPrice: 0.05
  nodeLabels:
    workload: non-production
    cost-optimized: "true"
  nodeTaints:
    - key: kubernetes.azure.com/scalesetpriority
      value: spot
      effect: NoSchedule

Pod Tolerations for Spot Nodes:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      tolerations:
      - key: kubernetes.azure.com/scalesetpriority
        operator: Equal
        value: spot
        effect: NoSchedule
      nodeSelector:
        workload: non-production
      containers:
      - name: atp-ingestion
        # ...

Spot Instance Cost Savings:

VM SKU	Regular Price	Spot Price (80% discount)	Monthly Savings
Standard_D2s_v3	$0.096/hour	$0.019/hour	~$55/month
Standard_D4s_v3	$0.192/hour	$0.038/hour	~$111/month
Standard_D8s_v3	$0.384/hour	$0.077/hour	~$221/month

Reserved Instances for Production¶

Reserved Instance Configuration:

#!/bin/bash
# scripts/purchase-reserved-instances.sh

# Purchase 1-year reserved instances for production
az vm reservation create \
  --resource-group atp-production-rg \
  --reserved-resource-type VirtualMachines \
  --instance-flexibility OnDemand \
  --billing-scope /subscriptions/${SUBSCRIPTION_ID} \
  --term P1Y \
  --quantity 10 \
  --sku Standard_D8s_v3 \
  --location eastus \
  --reserved-to-subscription \
  --display-name "ATP Production AKS Nodes - D8s_v3"

echo "✅ Reserved instances purchased (up to 72% discount)"

Reserved Instance Cost Savings:

Commitment	Discount	Monthly Cost (10x D8s_v3)	Savings vs Pay-as-you-go
1 Year	~42%	$2,227	$1,653/month
3 Years	~72%	$1,282	$2,598/month

ATP Recommendation: Use 1-year reserved instances for production user pool nodes.

Cluster Autoscaler Configuration¶

Cluster Autoscaler Setup:

# clusters/production/kustomizations/cluster-autoscaler.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: cluster-autoscaler
  namespace: flux-system
spec:
  interval: 5m
  path: ./platform/cluster-autoscaler
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production

Cluster Autoscaler Deployment:

# platform/cluster-autoscaler/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - image: mcr.microsoft.com/oss/kubernetes/autoscaler/cluster-autoscaler:v1.27.3
        name: cluster-autoscaler
        resources:
          limits:
            cpu: 100m
            memory: 600Mi
          requests:
            cpu: 100m
            memory: 600Mi
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=azure
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste  # Prefer nodes that waste least resources
        - --node-group-auto-discovery=label:cluster-autoscaler-enabled=true
        - --balance-similar-node-groups
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m
        - --scale-down-utilization-threshold=0.5
        - --max-node-provision-time=15m
        env:
        - name: ARM_SUBSCRIPTION_ID
          valueFrom:
            secretKeyRef:
              name: cluster-autoscaler-secrets
              key: subscription-id
        - name: ARM_RESOURCE_GROUP
          value: atp-production-rg
        - name: ARM_TENANT_ID
          valueFrom:
            secretKeyRef:
              name: cluster-autoscaler-secrets
              key: tenant-id
        - name: ARM_CLIENT_ID
          valueFrom:
            secretKeyRef:
              name: cluster-autoscaler-secrets
              key: client-id
        - name: ARM_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: cluster-autoscaler-secrets
              key: client-secret

Cluster Autoscaler Cost Optimization Settings:

Setting	Value	Rationale
`expander`	`least-waste`	Prefer node pools that waste least resources
`scale-down-delay-after-add`	`10m`	Wait before scaling down after adding nodes
`scale-down-unneeded-time`	`10m`	Node must be unneeded for 10m before removal
`scale-down-utilization-threshold`	`0.5`	Scale down if node utilization < 50%
`balance-similar-node-groups`	`true`	Balance pods across similar node groups

Resource Right-Sizing¶

Analyzing Actual Resource Usage¶

Resource Usage Analysis Script:

#!/bin/bash
# scripts/analyze-resource-usage.sh

NAMESPACE="${1:-all}"

echo "📊 Resource Usage Analysis"
echo "=========================="

if [ "${NAMESPACE}" = "all" ]; then
  echo "Analyzing all namespaces..."
  kubectl top pods --all-namespaces --containers | \
    awk 'NR>1 {cpu+=$3; memory+=$4} END {print "Total CPU: " cpu "m"; print "Total Memory: " memory "Mi"}'
else
  echo "Analyzing namespace: ${NAMESPACE}"
  kubectl top pods -n "${NAMESPACE}" --containers | \
    awk 'NR>1 {cpu+=$3; memory+=$4} END {print "Total CPU: " cpu "m"; print "Total Memory: " memory "Mi"}'
fi

echo ""
echo "Resource Requests vs Limits:"
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | "\(.metadata.name): CPU req=\(.spec.containers[0].resources.requests.cpu // "none") limit=\(.spec.containers[0].resources.limits.cpu // "none"), Memory req=\(.spec.containers[0].resources.requests.memory // "none") limit=\(.spec.containers[0].resources.limits.memory // "none")"'

Azure Monitor Metrics Query:

// Log Analytics: Pod resource usage analysis
Perf
| where ObjectName == "K8SContainer"
| where CounterName in ("cpuUsageNanoCores", "memoryWorkingSetBytes")
| where TimeGenerated >= ago(7d)
| summarize 
    AvgCpuNanoCores = avg(CounterValue) by CounterName, Namespace, PodName
| extend CpuUsageCores = case(
    CounterName == "cpuUsageNanoCores", AvgCpuNanoCores / 1000000000,
    0
),
MemoryUsageMiB = case(
    CounterName == "memoryWorkingSetBytes", AvgCpuNanoCores / 1024 / 1024,
    0
)
| summarize 
    AvgCpuCores = max(CpuUsageCores),
    AvgMemoryMiB = max(MemoryUsageMiB)
    by Namespace, PodName
| render timechart

Adjusting Requests and Limits¶

Resource Right-Sizing Workflow:

graph LR
    COLLECT[Collect Metrics<br/>7 days]
    ANALYZE[Analyze Usage<br/>P95/P99]
    RECOMMEND[Generate<br/>Recommendations]
    UPDATE[Update Manifests<br/>in Git]
    DEPLOY[Deploy via<br/>GitOps]
    MONITOR[Monitor<br/>Performance]

    COLLECT --> ANALYZE
    ANALYZE --> RECOMMEND
    RECOMMEND --> UPDATE
    UPDATE --> DEPLOY
    DEPLOY --> MONITOR
    MONITOR --> COLLECT

    style COLLECT fill:#FFE5B4
    style DEPLOY fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Resource Right-Sizing Recommendations Script:

#!/bin/bash
# scripts/generate-right-sizing-recommendations.sh

NAMESPACE="${1}"
OUTPUT_FILE="${2:-right-sizing-recommendations.yaml}"

echo "📊 Generating right-sizing recommendations for: ${NAMESPACE}"

# Query Prometheus metrics (7-day average)
PROMETHEUS_URL="http://prometheus-kube-prometheus-prometheus.monitoring:9090"

cat > /tmp/prometheus-queries.txt <<EOF
# Average CPU usage (7 days)
avg_over_time(rate(container_cpu_usage_seconds_total{namespace="${NAMESPACE}"}[5m])[7d:1h])

# Average memory usage (7 days)
avg_over_time(container_memory_working_set_bytes{namespace="${NAMESPACE}"}[7d:1h])

# P95 CPU usage
quantile_over_time(0.95, rate(container_cpu_usage_seconds_total{namespace="${NAMESPACE}"}[5m])[7d:1h])

# P95 Memory usage
quantile_over_time(0.95, container_memory_working_set_bytes{namespace="${NAMESPACE}"}[7d:1h])
EOF

# Generate recommendations (simplified)
cat > "${OUTPUT_FILE}" <<EOF
# Right-Sizing Recommendations for ${NAMESPACE}
# Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)
# Based on: 7-day average usage

recommendations:
  - deployment: atp-ingestion
    namespace: ${NAMESPACE}
    containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: "500m"  # Based on P95 usage: 400m
            memory: "1Gi"  # Based on P95 usage: 800Mi
          limits:
            cpu: "2000m"  # 4x requests (burst capacity)
            memory: "2Gi"  # 2x requests
EOF

echo "✅ Recommendations generated: ${OUTPUT_FILE}"

Vertical Pod Autoscaler (VPA)¶

VPA Installation:

# Install VPA
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml

VPA Configuration:

# apps/atp-ingestion/base/vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: atp-ingestion-vpa
  namespace: atp-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  updatePolicy:
    updateMode: "Auto"  # or "Recreate" or "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: atp-ingestion
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 4000m
        memory: 8Gi
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsAndLimits

VPA Modes:

Mode	Description	Use Case
Auto	Automatically update requests/limits	Dev/Test (with restart)
Recreate	Update on pod recreation	Staging
Off	Only generate recommendations	Production (manual review)

ATP Recommendation: Use Off mode in production to generate recommendations, then manually review and apply via GitOps.

Recommendations from Azure Advisor¶

Azure Advisor Cost Recommendations:

#!/bin/bash
# scripts/get-azure-advisor-cost-recommendations.sh

echo "💰 Azure Advisor Cost Recommendations"
echo "======================================"

# Get cost recommendations
az advisor recommendation list \
  --category Cost \
  --output table

# Get specific right-sizing recommendations
az advisor recommendation list \
  --category Cost \
  --filter "ResourceGroup eq 'atp-production-rg'" \
  --query "[?category=='Cost' && impact=='High'].{Name:shortDescription.problem, Impact:impact, PotentialSavings:extendedProperties.potentialSavings}" \
  --output table

echo ""
echo "📊 Right-sizing recommendations:"
az advisor recommendation list \
  --category Cost \
  --filter "ResourceGroup eq 'atp-production-rg'" \
  --query "[?recommendationTypeId=='b0b0a0a0-0a0a-0a0a-0a0a-0a0a0a0a0a0a'].{CurrentSKU:extendedProperties.currentSku, RecommendedSKU:extendedProperties.recommendedSku, EstimatedSavings:extendedProperties.estimatedMonthlySavings}" \
  --output table

Horizontal Pod Autoscaler (HPA)¶

CPU-Based Scaling¶

CPU-Based HPA Configuration:

# apps/atp-ingestion/base/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa
  namespace: atp-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale when CPU > 70%
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min stabilization
      policies:
      - type: Percent
        value: 50  # Scale down by 50%
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Min  # Use most conservative policy
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100  # Double pods
        periodSeconds: 30
      - type: Pods
        value: 4
        periodSeconds: 30
      selectPolicy: Max  # Use most aggressive policy

Cost-Optimized HPA Settings:

Setting	Value	Rationale
`averageUtilization`	70%	Allow higher CPU before scaling (cost efficiency)
`scaleDown.stabilizationWindowSeconds`	300s	Prevent rapid scale-down (cost savings)
`scaleDown.selectPolicy`	`Min`	Use conservative scale-down (cost savings)
`scaleUp.selectPolicy`	`Max`	Aggressive scale-up (performance)

Memory-Based Scaling¶

Memory-Based HPA:

# apps/atp-ingestion/base/hpa-memory.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa-memory
  namespace: atp-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale when memory > 80%
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600  # 10 min (memory is sticky)
      policies:
      - type: Percent
        value: 25  # Conservative scale-down
        periodSeconds: 120

Custom Metrics with KEDA¶

KEDA ScaledObject for Cost Optimization:

# apps/atp-ingestion/base/keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: atp-ingestion-keda
  namespace: atp-production
spec:
  scaleTargetRef:
    name: atp-ingestion
  minReplicaCount: 3
  maxReplicaCount: 20
  cooldownPeriod: 300  # 5 min cooldown
  idleReplicaCount: 0  # Scale to zero when idle (dev only)
  triggers:
  # CPU-based scaling
  - type: cpu
    metadata:
      type: Utilization
      value: "70"
  # Memory-based scaling
  - type: memory
    metadata:
      type: Utilization
      value: "80"
  # Custom metric: Queue length
  - type: azure-servicebus
    metadata:
      queueName: atp-ingestion-queue
      messageCount: "100"  # Scale when > 100 messages
      connectionFromEnv: SERVICEBUS_CONNECTION_STRING
  # HTTP request rate
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: http_requests_per_second
      threshold: "1000"
      query: sum(rate(http_requests_total{service="atp-ingestion"}[1m]))

KEDA Cost Optimization Settings:

Setting	Value	Use Case
`cooldownPeriod`	300s	Prevent rapid scaling (cost savings)
`idleReplicaCount`	0	Dev environments (scale to zero)
`minReplicaCount`	3	Production (always available)

Scaling Policies for Cost Efficiency¶

Cost-Efficient Scaling Strategy:

graph TB
    METRICS[Pod Metrics<br/>CPU/Memory]
    HPA[Horizontal Pod Autoscaler]
    DECISION{Scale Decision}

    SCALE_UP[Scale Up<br/>Aggressive]
    SCALE_DOWN[Scale Down<br/>Conservative]

    METRICS --> HPA
    HPA --> DECISION
    DECISION -->|High Load| SCALE_UP
    DECISION -->|Low Load| SCALE_DOWN

    SCALE_UP --> PERFORMANCE[Performance Priority]
    SCALE_DOWN --> COST[Cost Priority]

    style SCALE_UP fill:#90EE90
    style SCALE_DOWN fill:#FFE5B4
    style PERFORMANCE fill:#90EE90
    style COST fill:#FFB6C1

Hold "Alt" / "Option" to enable pan & zoom

Environment-Specific Scaling Policies:

Environment	Min Replicas	Max Replicas	Scale-Down Delay	Rationale
Production	3	50	10 min	Performance > Cost
Staging	2	20	5 min	Balanced
Test	1	10	3 min	Cost > Performance
Dev	0	5	1 min	Cost optimization

Cluster Autoscaler¶

Adding Nodes Based on Demand¶

Cluster Autoscaler Behavior:

sequenceDiagram
    participant Pod as Pod (Pending)
    participant CA as Cluster Autoscaler
    participant AKS as AKS Node Pool
    participant VM as New VM Node

    Pod->>CA: Pod cannot be scheduled
    CA->>CA: Check node pool capacity
    CA->>AKS: Scale up node pool
    AKS->>VM: Provision new VM
    VM->>Pod: Pod scheduled on new node
    Pod->>CA: Pod running

Hold "Alt" / "Option" to enable pan & zoom

Cluster Autoscaler Configuration:

# platform/cluster-autoscaler/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-config
  namespace: kube-system
data:
  config.yaml: |
    nodeGroups:
      - name: userpool
        minSize: 5
        maxSize: 20
        scaleDownDelayAfterAdd: 10m
        scaleDownUnneededTime: 10m
        scaleDownUtilizationThreshold: 0.5
    scaleDownEnabled: true
    maxNodeProvisionTime: 15m
    balanceSimilarNodeGroups: true
    expander: least-waste

Removing Idle Nodes¶

Scale-Down Conditions:

Condition	Value	Rationale
`scaleDownDelayAfterAdd`	10m	Wait before removing newly added nodes
`scaleDownUnneededTime`	10m	Node must be unneeded for 10 minutes
`scaleDownUtilizationThreshold`	0.5	Node utilization < 50% before removal
`maxEmptyBulkDelete`	10	Remove up to 10 idle nodes at once

Pod Disruption Budget Protection:

# apps/atp-ingestion/base/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: atp-ingestion-pdb
  namespace: atp-production
spec:
  minAvailable: 2  # Always keep 2 pods running
  selector:
    matchLabels:
      app: atp-ingestion

Scale-Down Delays and Thresholds¶

Cost-Optimized Scale-Down Configuration:

# Cluster Autoscaler: Aggressive scale-down (cost savings)
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-cost-optimized
  namespace: kube-system
data:
  config.yaml: |
    scaleDownDelayAfterAdd: 5m  # Reduced from 10m
    scaleDownUnneededTime: 5m  # Reduced from 10m
    scaleDownUtilizationThreshold: 0.4  # More aggressive (40%)
    scaleDownGpuUtilizationThreshold: 0.4
    maxEmptyBulkDelete: 20  # Remove more nodes at once
    scaleDownEnabled: true

Node Affinity and Taints¶

Node Affinity for Cost Optimization:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          # Prefer spot nodes (cost savings)
          - weight: 100
            preference:
              matchExpressions:
              - key: kubernetes.azure.com/scalesetpriority
                operator: In
                values:
                - spot
          # Prefer smaller nodes (cost efficiency)
          - weight: 50
            preference:
              matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - Standard_D2s_v3
                - Standard_D4s_v3
      tolerations:
      # Allow scheduling on spot nodes
      - key: kubernetes.azure.com/scalesetpriority
        operator: Equal
        value: spot
        effect: NoSchedule

Development Environment Auto-Shutdown¶

Schedule-Based Scaling to Zero¶

CronJob for Auto-Shutdown:

# platform/auto-shutdown/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: dev-auto-shutdown
  namespace: kube-system
spec:
  schedule: "0 20 * * 1-5"  # 8 PM Monday-Friday
  timeZone: "America/New_York"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: dev-shutdown-sa
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              # Scale down dev namespaces
              for ns in atp-dev atp-test; do
                for deployment in $(kubectl get deployments -n $ns -o name); do
                  kubectl scale $deployment -n $ns --replicas=0
                done
              done
              echo "✅ Dev environments scaled down at $(date)"
          restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: dev-auto-startup
  namespace: kube-system
spec:
  schedule: "0 8 * * 1-5"  # 8 AM Monday-Friday
  timeZone: "America/New_York"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: dev-startup-sa
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              # Scale up dev namespaces
              for ns in atp-dev atp-test; do
                for deployment in $(kubectl get deployments -n $ns -o name); do
                  kubectl scale $deployment -n $ns --replicas=1
                done
              done
              echo "✅ Dev environments scaled up at $(date)"
          restartPolicy: OnFailure

Scaling Down Replicas at Night/Weekends¶

Auto-Shutdown Script:

#!/bin/bash
# scripts/auto-shutdown-dev-environments.sh

NAMESPACES=("atp-dev" "atp-test")
SHUTDOWN_TIME="20:00"  # 8 PM
STARTUP_TIME="08:00"   # 8 AM

CURRENT_HOUR=$(date +%H)
CURRENT_DAY=$(date +%u)  # 1=Monday, 7=Sunday

# Check if it's weekend
if [ "${CURRENT_DAY}" -eq 6 ] || [ "${CURRENT_DAY}" -eq 7 ]; then
  echo "📴 Weekend: Scaling down all dev environments..."
  for NS in "${NAMESPACES[@]}"; do
    kubectl get deployments -n "${NS}" -o name | \
      xargs -I {} kubectl scale {} -n "${NS}" --replicas=0
  done
  exit 0
fi

# Check if it's shutdown time (8 PM - 8 AM)
if [ "${CURRENT_HOUR}" -ge 20 ] || [ "${CURRENT_HOUR}" -lt 8 ]; then
  echo "🌙 Night time: Scaling down dev environments..."
  for NS in "${NAMESPACES[@]}"; do
    kubectl get deployments -n "${NS}" -o name | \
      xargs -I {} kubectl scale {} -n "${NS}" --replicas=0
  done
else
  echo "☀️  Day time: Ensuring dev environments are running..."
  for NS in "${NAMESPACES[@]}"; do
    kubectl get deployments -n "${NS}" -o name | \
      xargs -I {} kubectl scale {} -n "${NS}" --replicas=1
  done
fi

Wake-Up Procedures¶

Wake-Up Script:

#!/bin/bash
# scripts/wake-up-dev-environments.sh

NAMESPACES=("atp-dev" "atp-test")

echo "☀️  Waking up dev environments..."

for NS in "${NAMESPACES[@]}"; do
  echo "  - Scaling up namespace: ${NS}"

  # Scale up deployments
  kubectl get deployments -n "${NS}" -o name | \
    xargs -I {} kubectl scale {} -n "${NS}" --replicas=1

  # Wait for pods to be ready
  echo "  - Waiting for pods to be ready..."
  kubectl wait --for=condition=available --timeout=5m \
    deployment --all -n "${NS}"
done

echo "✅ Dev environments are ready"

Cost Savings Calculation¶

Auto-Shutdown Cost Savings:

Environment	Daily Hours	Weekly Hours	Monthly Cost (Running)	Monthly Cost (Shutdown)	Savings
Dev	12 hours	60 hours	$300	$120	$180/month (60%)
Test	12 hours	60 hours	$300	$120	$180/month (60%)
Total	-	-	$600	$240	$360/month

Cost Savings Formula:

Monthly Savings = (24 hours - Running hours) / 24 hours × Monthly Cost
Monthly Savings = (24 - 12) / 24 × $300 = $150/month per environment

Azure Cost Management Integration¶

Cost Tracking per Environment¶

Cost Tracking Dashboard Query:

// Log Analytics: Cost tracking per environment
AzureCost
| where TimeGenerated >= ago(30d)
| where Tags contains "Environment"
| extend Environment = tostring(Tags.Environment)
| extend Service = tostring(Tags.Service)
| summarize 
    TotalCost = sum(Cost),
    AvgDailyCost = avg(Cost)
    by Environment, bin(TimeGenerated, 1d)
| render timechart

Cost Tracking Script:

#!/bin/bash
# scripts/track-costs-by-environment.sh

ENVIRONMENT="${1:-all}"
START_DATE="${2:-$(date -d '30 days ago' +%Y-%m-%d)}"
END_DATE="${3:-$(date +%Y-%m-%d)}"

echo "💰 Cost Tracking: ${ENVIRONMENT}"
echo "   Period: ${START_DATE} to ${END_DATE}"
echo ""

if [ "${ENVIRONMENT}" = "all" ]; then
  ENVIRONMENTS=("production" "staging" "test" "development")
else
  ENVIRONMENTS=("${ENVIRONMENT}")
fi

for ENV in "${ENVIRONMENTS[@]}"; do
  echo "📊 ${ENV}:"

  COST=$(az consumption usage list \
    --start-date "${START_DATE}" \
    --end-date "${END_DATE}" \
    --query "[?tags.Environment=='${ENV}'].pretaxCost" \
    --output tsv | \
    awk '{sum+=$1} END {printf "%.2f", sum}')

  echo "   Total Cost: \$${COST}"

  # Daily average
  DAYS=$(( ($(date -d "${END_DATE}" +%s) - $(date -d "${START_DATE}" +%s)) / 86400 ))
  AVG_DAILY=$(echo "scale=2; ${COST} / ${DAYS}" | bc)
  echo "   Avg Daily: \$${AVG_DAILY}"

  echo ""
done

Budget Alerts¶

Budget Configuration:

#!/bin/bash
# scripts/create-budget-alert.sh

BUDGET_NAME="${1}"
AMOUNT="${2}"
RESOURCE_GROUP="${3}"
EMAIL="${4}"

az consumption budget create \
  --budget-name "${BUDGET_NAME}" \
  --amount "${AMOUNT}" \
  --time-grain Monthly \
  --start-date "$(date +%Y-%m-01)" \
  --end-date "$(date -d '+1 year' +%Y-%m-01)" \
  --category Cost \
  --resource-group "${RESOURCE_GROUP}" \
  --notifications threshold=50 threshold-type=Actual operator=GreaterThan contact-emails="${EMAIL}" \
  --notifications threshold=80 threshold-type=Actual operator=GreaterThan contact-emails="${EMAIL}" \
  --notifications threshold=100 threshold-type=Actual operator=GreaterThan contact-emails="${EMAIL}"

echo "✅ Budget created: ${BUDGET_NAME} (\$${AMOUNT}/month)"

Budget Alert Configuration:

# infrastructure/budgets.yaml (Pulumi C# example concept)
var productionBudget = new Budget("atp-production-budget", new BudgetArgs
{
    BudgetName = "atp-production-monthly",
    Amount = 10000.0,  // $10,000/month
    TimeGrain = "Monthly",
    StartDate = DateTime.Now.ToString("yyyy-MM-01"),
    Category = "Cost",
    Notifications = new[]
    {
        new BudgetNotificationArgs
        {
            Threshold = 50,  // 50% of budget
            ThresholdType = "Actual",
            Operator = "GreaterThan",
            ContactEmails = new[] { "finance@example.com" }
        },
        new BudgetNotificationArgs
        {
            Threshold = 80,  // 80% of budget
            ThresholdType = "Actual",
            Operator = "GreaterThan",
            ContactEmails = new[] { "finance@example.com", "ops@example.com" }
        },
        new BudgetNotificationArgs
        {
            Threshold = 100,  // 100% of budget
            ThresholdType = "Actual",
            Operator = "GreaterThan",
            ContactEmails = new[] { "finance@example.com", "ops@example.com", "cto@example.com" }
        }
    }
});

Cost Anomaly Detection¶

Cost Anomaly Detection:

#!/bin/bash
# scripts/detect-cost-anomalies.sh

THRESHOLD="${1:-0.2}"  # 20% increase threshold

echo "🔍 Detecting cost anomalies..."

# Get current month cost
CURRENT_MONTH=$(date +%Y-%m)
CURRENT_COST=$(az consumption usage list \
  --start-date "${CURRENT_MONTH}-01" \
  --end-date "$(date +%Y-%m-%d)" \
  --query "[].pretaxCost" \
  --output tsv | \
  awk '{sum+=$1} END {print sum}')

# Get last month cost
LAST_MONTH=$(date -d '1 month ago' +%Y-%m)
LAST_MONTH_COST=$(az consumption usage list \
  --start-date "${LAST_MONTH}-01" \
  --end-date "${LAST_MONTH}-$(date -d "${LAST_MONTH}-01 +1 month -1 day" +%d)" \
  --query "[].pretaxCost" \
  --output tsv | \
  awk '{sum+=$1} END {print sum}')

# Calculate increase percentage
INCREASE=$(echo "scale=2; (${CURRENT_COST} - ${LAST_MONTH_COST}) / ${LAST_MONTH_COST} * 100" | bc)

if (( $(echo "${INCREASE} > ${THRESHOLD} * 100" | bc -l) )); then
  echo "⚠️  Cost anomaly detected!"
  echo "   Current month: \$${CURRENT_COST}"
  echo "   Last month: \$${LAST_MONTH_COST}"
  echo "   Increase: ${INCREASE}%"
  echo "   Threshold: $(echo "${THRESHOLD} * 100" | bc)%"

  # Send alert
  echo "Sending alert to finance@example.com..."
else
  echo "✅ No cost anomalies detected"
  echo "   Increase: ${INCREASE}%"
fi

Cost Optimization Recommendations¶

Azure Advisor Cost Recommendations:

#!/bin/bash
# scripts/get-cost-optimization-recommendations.sh

echo "💰 Azure Advisor Cost Optimization Recommendations"
echo "=================================================="

# Get all cost recommendations
az advisor recommendation list \
  --category Cost \
  --query "[].{Name:shortDescription.problem, Impact:impact, ResourceGroup:resourceGroup, PotentialSavings:extendedProperties.potentialSavings}" \
  --output table

echo ""
echo "📊 Top 10 Cost Savings Opportunities:"
az advisor recommendation list \
  --category Cost \
  --query "[?impact=='High' || impact=='Medium'].{Name:shortDescription.problem, Impact:impact, PotentialSavings:extendedProperties.potentialSavings, ResourceId:resourceId}" \
  --output table | head -n 10

Cost Allocation¶

Tags per Environment, Service, Tenant¶

Comprehensive Tagging Strategy:

# Resource tagging template
tags:
  Environment: production | staging | test | development
  Service: atp-ingestion | atp-query | atp-gateway | platform
  Tenant: tenant-a | tenant-b | tenant-eu | shared
  CostCenter: sales | engineering | operations
  BusinessUnit: enterprise | smb
  Project: audit-trail-platform
  Owner: team-name@example.com
  ManagedBy: gitops | terraform | pulumi
  AutoShutdown: true | false
  Criticality: critical | high | medium | low

Tagging in Pulumi:

// infrastructure/Tags.cs
public static class ResourceTags
{
    public static InputMap<string> ProductionTags(string service, string costCenter)
    {
        return new InputMap<string>
        {
            { "Environment", "production" },
            { "Service", service },
            { "CostCenter", costCenter },
            { "Project", "audit-trail-platform" },
            { "ManagedBy", "pulumi" },
            { "Criticality", "critical" },
            { "AutoShutdown", "false" }
        };
    }

    public static InputMap<string> DevelopmentTags(string service)
    {
        return new InputMap<string>
        {
            { "Environment", "development" },
            { "Service", service },
            { "CostCenter", "engineering" },
            { "Project", "audit-trail-platform" },
            { "ManagedBy", "pulumi" },
            { "Criticality", "low" },
            { "AutoShutdown", "true" }
        };
    }
}

Namespace-Level Cost Reporting¶

Namespace Cost Reporting:

#!/bin/bash
# scripts/namespace-cost-report.sh

NAMESPACE="${1}"
START_DATE="${2:-$(date -d '30 days ago' +%Y-%m-%d)}"
END_DATE="${3:-$(date +%Y-%m-%d)}"

echo "💰 Cost Report for Namespace: ${NAMESPACE}"
echo "   Period: ${START_DATE} to ${END_DATE}"
echo ""

# Get pods in namespace
PODS=$(kubectl get pods -n "${NAMESPACE}" -o json | jq -r '.items[].metadata.name')

TOTAL_CPU=0
TOTAL_MEMORY=0

for POD in ${PODS}; do
  # Get CPU and memory requests
  CPU=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o json | \
    jq -r '.spec.containers[].resources.requests.cpu' | \
    sed 's/m//' | awk '{sum+=$1} END {print sum}')
  MEMORY=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o json | \
    jq -r '.spec.containers[].resources.requests.memory' | \
    sed 's/Gi//' | awk '{sum+=$1} END {print sum}')

  TOTAL_CPU=$((TOTAL_CPU + CPU))
  TOTAL_MEMORY=$((TOTAL_MEMORY + MEMORY))
done

echo "Resource Requests:"
echo "  CPU: ${TOTAL_CPU}m cores"
echo "  Memory: ${TOTAL_MEMORY}Gi"
echo ""

# Estimate cost (example pricing)
CPU_COST=$(echo "scale=2; ${TOTAL_CPU} / 1000 * 0.096 * 24 * 30" | bc)
MEMORY_COST=$(echo "scale=2; ${TOTAL_MEMORY} * 0.01 * 24 * 30" | bc)
TOTAL_COST=$(echo "scale=2; ${CPU_COST} + ${MEMORY_COST}" | bc)

echo "Estimated Monthly Cost:"
echo "  CPU: \$${CPU_COST}"
echo "  Memory: \$${MEMORY_COST}"
echo "  Total: \$${TOTAL_COST}"

Chargeback to Teams¶

Team Chargeback Report:

#!/bin/bash
# scripts/team-chargeback-report.sh

TEAM="${1:-all}"
MONTH="${2:-$(date +%Y-%m)}"

echo "💰 Team Chargeback Report: ${TEAM}"
echo "   Month: ${MONTH}"
echo ""

if [ "${TEAM}" = "all" ]; then
  TEAMS=("engineering" "sales" "operations")
else
  TEAMS=("${TEAM}")
fi

for T in "${TEAMS[@]}"; do
  echo "📊 ${T}:"

  # Get costs for team's resources
  COST=$(az consumption usage list \
    --start-date "${MONTH}-01" \
    --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
    --query "[?tags.CostCenter=='${T}'].pretaxCost" \
    --output tsv | \
    awk '{sum+=$1} END {printf "%.2f", sum}')

  echo "   Total Cost: \$${COST}"
  echo ""
done

Showback Reporting¶

Showback Dashboard Query:

// Log Analytics: Showback report
AzureCost
| where TimeGenerated >= ago(30d)
| extend CostCenter = tostring(Tags.CostCenter)
| extend Service = tostring(Tags.Service)
| extend Environment = tostring(Tags.Environment)
| summarize 
    TotalCost = sum(Cost),
    ResourceCount = count()
    by CostCenter, Service, Environment
| render barchart

Resource Cleanup Automation¶

Deleting Unused Images in ACR¶

ACR Cleanup Script:

#!/bin/bash
# scripts/cleanup-acr-images.sh

ACR_NAME="${1}"
KEEP_DAYS="${2:-30}"  # Keep images from last 30 days
KEEP_TAGS="${3:-10}"  # Keep 10 most recent tags per repository

echo "🧹 Cleaning up unused ACR images: ${ACR_NAME}"
echo "   Keep days: ${KEEP_DAYS}"
echo "   Keep tags per repo: ${KEEP_TAGS}"
echo ""

# Get all repositories
REPOS=$(az acr repository list --name "${ACR_NAME}" --output tsv)

for REPO in ${REPOS}; do
  echo "📦 Repository: ${REPO}"

  # Get all tags sorted by last update date
  TAGS=$(az acr repository show-tags \
    --name "${ACR_NAME}" \
    --repository "${REPO}" \
    --orderby time_desc \
    --output tsv | head -n "${KEEP_TAGS}")

  # Get tags older than KEEP_DAYS
  OLD_TAGS=$(az acr repository show-tags \
    --name "${ACR_NAME}" \
    --repository "${REPO}" \
    --query "[?lastUpdateTime < '$(date -d "${KEEP_DAYS} days ago" -u +%Y-%m-%dT%H:%M:%SZ)'].name" \
    --output tsv)

  # Delete old tags (but keep the KEEP_TAGS most recent)
  for TAG in ${OLD_TAGS}; do
    if ! echo "${TAGS}" | grep -q "${TAG}"; then
      echo "  🗑️  Deleting: ${REPO}:${TAG}"
      az acr repository delete \
        --name "${ACR_NAME}" \
        --image "${REPO}:${TAG}" \
        --yes || true
    fi
  done
done

echo "✅ ACR cleanup complete"

Removing Old PersistentVolumes¶

PV Cleanup Script:

#!/bin/bash
# scripts/cleanup-old-pvs.sh

NAMESPACE="${1:-all}"
AGE_DAYS="${2:-30}"  # Delete PVs older than 30 days

echo "🧹 Cleaning up old PersistentVolumes"
echo "   Namespace: ${NAMESPACE}"
echo "   Age threshold: ${AGE_DAYS} days"
echo ""

if [ "${NAMESPACE}" = "all" ]; then
  PVS=$(kubectl get pv -o json | \
    jq -r ".items[] | select(.status.phase == \"Released\" or .status.phase == \"Failed\") | .metadata.name")
else
  PVS=$(kubectl get pv -o json | \
    jq -r ".items[] | select(.spec.claimRef.namespace == \"${NAMESPACE}\" and (.status.phase == \"Released\" or .status.phase == \"Failed\")) | .metadata.name")
fi

for PV in ${PVS}; do
  # Get PV creation timestamp
  CREATED=$(kubectl get pv "${PV}" -o jsonpath='{.metadata.creationTimestamp}')
  CREATED_EPOCH=$(date -d "${CREATED}" +%s)
  AGE_EPOCH=$(date -d "${AGE_DAYS} days ago" +%s)

  if [ "${CREATED_EPOCH}" -lt "${AGE_EPOCH}" ]; then
    echo "🗑️  Deleting old PV: ${PV} (created: ${CREATED})"
    kubectl delete pv "${PV}" || true
  fi
done

echo "✅ PV cleanup complete"

Cleaning Up Completed Jobs¶

Job Cleanup Script:

#!/bin/bash
# scripts/cleanup-completed-jobs.sh

NAMESPACE="${1:-all}"
AGE_HOURS="${2:-24}"  # Delete jobs older than 24 hours

echo "🧹 Cleaning up completed Jobs"
echo "   Namespace: ${NAMESPACE}"
echo "   Age threshold: ${AGE_HOURS} hours"
echo ""

if [ "${NAMESPACE}" = "all" ]; then
  NAMESPACES=$(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}')
else
  NAMESPACES=("${NAMESPACE}")
fi

for NS in ${NAMESPACES}; do
  # Get completed/failed jobs
  JOBS=$(kubectl get jobs -n "${NS}" -o json | \
    jq -r ".items[] | select(.status.succeeded == 1 or .status.failed > 0) | .metadata.name")

  for JOB in ${JOBS}; do
    # Get job completion time
    COMPLETION_TIME=$(kubectl get job "${JOB}" -n "${NS}" -o jsonpath='{.status.completionTime}')
    if [ -n "${COMPLETION_TIME}" ]; then
      COMPLETION_EPOCH=$(date -d "${COMPLETION_TIME}" +%s)
      AGE_EPOCH=$(date -d "${AGE_HOURS} hours ago" +%s)

      if [ "${COMPLETION_EPOCH}" -lt "${AGE_EPOCH}" ]; then
        echo "🗑️  Deleting completed job: ${NS}/${JOB}"
        kubectl delete job "${JOB}" -n "${NS}" || true
      fi
    fi
  done
done

echo "✅ Job cleanup complete"

Snapshot Cleanup¶

Snapshot Cleanup Script:

#!/bin/bash
# scripts/cleanup-old-snapshots.sh

RESOURCE_GROUP="${1}"
AGE_DAYS="${2:-7}"  # Keep snapshots from last 7 days

echo "🧹 Cleaning up old snapshots: ${RESOURCE_GROUP}"
echo "   Age threshold: ${AGE_DAYS} days"
echo ""

# Get all snapshots older than AGE_DAYS
SNAPSHOTS=$(az snapshot list \
  --resource-group "${RESOURCE_GROUP}" \
  --query "[?timeCreated < '$(date -d "${AGE_DAYS} days ago" -u +%Y-%m-%dT%H:%M:%SZ)'].{Name:name, TimeCreated:timeCreated}" \
  --output tsv)

for SNAPSHOT in ${SNAPSHOTS}; do
  NAME=$(echo "${SNAPSHOT}" | cut -f1)
  TIME=$(echo "${SNAPSHOT}" | cut -f2)

  echo "🗑️  Deleting snapshot: ${NAME} (created: ${TIME})"
  az snapshot delete \
    --resource-group "${RESOURCE_GROUP}" \
    --name "${NAME}" \
    --yes || true
done

echo "✅ Snapshot cleanup complete"

Azure Advisor Recommendations¶

Reviewing Cost Recommendations¶

Review Azure Advisor Recommendations:

#!/bin/bash
# scripts/review-azure-advisor-recommendations.sh

echo "💰 Azure Advisor Cost Recommendations"
echo "======================================"

# Get all cost recommendations
az advisor recommendation list \
  --category Cost \
  --output table

echo ""
echo "📊 High Impact Recommendations:"
az advisor recommendation list \
  --category Cost \
  --filter "Impact eq 'High'" \
  --query "[].{Name:shortDescription.problem, ResourceGroup:resourceGroup, PotentialSavings:extendedProperties.potentialSavings}" \
  --output table

Implementing Right-Sizing Suggestions¶

Right-Sizing Implementation:

#!/bin/bash
# scripts/implement-right-sizing.sh

RECOMMENDATION_ID="${1}"

if [ -z "${RECOMMENDATION_ID}" ]; then
  echo "Usage: $0 <recommendation-id>"
  echo ""
  echo "Available recommendations:"
  az advisor recommendation list \
    --category Cost \
    --query "[].{ID:id, Name:shortDescription.problem, CurrentSKU:extendedProperties.currentSku, RecommendedSKU:extendedProperties.recommendedSku}" \
    --output table
  exit 1
fi

echo "📊 Implementing right-sizing recommendation: ${RECOMMENDATION_ID}"

# Get recommendation details
RECOMMENDATION=$(az advisor recommendation show \
  --id "${RECOMMENDATION_ID}")

CURRENT_SKU=$(echo "${RECOMMENDATION}" | jq -r '.extendedProperties.currentSku')
RECOMMENDED_SKU=$(echo "${RECOMMENDATION}" | jq -r '.extendedProperties.recommendedSku')
RESOURCE_ID=$(echo "${RECOMMENDATION}" | jq -r '.resourceId')

echo "  Current SKU: ${CURRENT_SKU}"
echo "  Recommended SKU: ${RECOMMENDED_SKU}"
echo "  Resource: ${RESOURCE_ID}"
echo ""

read -p "Apply this recommendation? (yes/no): " CONFIRM

if [ "${CONFIRM}" = "yes" ]; then
  echo "🔧 Applying right-sizing..."

  # Determine resource type and update
  if echo "${RESOURCE_ID}" | grep -q "Microsoft.Compute/virtualMachines"; then
    RESOURCE_GROUP=$(echo "${RESOURCE_ID}" | cut -d'/' -f5)
    VM_NAME=$(echo "${RESOURCE_ID}" | cut -d'/' -f9)

    echo "  Updating VM: ${VM_NAME}"
    az vm resize \
      --resource-group "${RESOURCE_GROUP}" \
      --name "${VM_NAME}" \
      --size "${RECOMMENDED_SKU}"
  else
    echo "  Resource type not yet supported for automatic resizing"
    echo "  Please apply manually: ${RESOURCE_ID}"
  fi
else
  echo "❌ Right-sizing not applied"
fi

SKU Optimization¶

SKU Optimization Analysis:

#!/bin/bash
# scripts/analyze-sku-optimization.sh

echo "📊 SKU Optimization Analysis"
echo "============================"

# Get all VMs and their current SKUs
az vm list \
  --query "[].{Name:name, ResourceGroup:resourceGroup, Size:hardwareProfile.vmSize}" \
  --output table

echo ""
echo "💰 Cost comparison (example VMs):"
echo "  Standard_D4s_v3 (4 vCPU, 16 GiB): \$0.192/hour = \$140/month"
echo "  Standard_D8s_v3 (8 vCPU, 32 GiB): \$0.384/hour = \$280/month"
echo "  Standard_D16s_v3 (16 vCPU, 64 GiB): \$0.768/hour = \$561/month"
echo ""
echo "💡 Recommendations:"
echo "  - Right-size based on actual usage (P95 metrics)"
echo "  - Use Reserved Instances for production (up to 72% discount)"
echo "  - Use Spot Instances for dev/test (up to 80% discount)"

FinOps Practices¶

Cost Monitoring Dashboards¶

FinOps Dashboard Configuration:

# monitoring/dashboards/finops-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: finops-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "ATP FinOps Dashboard",
        "panels": [
          {
            "title": "Monthly Cost by Environment",
            "targets": [
              {
                "expr": "sum(azure_cost_total{environment=~\"production|staging|test|development\"}) by (environment)",
                "legendFormat": "{{environment}}"
              }
            ]
          },
          {
            "title": "Cost Trend (30 days)",
            "targets": [
              {
                "expr": "sum(rate(azure_cost_total[1d]))",
                "legendFormat": "Daily Cost"
              }
            ]
          },
          {
            "title": "Resource Utilization vs Cost",
            "targets": [
              {
                "expr": "sum(container_cpu_usage_seconds_total) / sum(container_resource_requests_cpu_seconds_total) * 100",
                "legendFormat": "CPU Utilization %"
              },
              {
                "expr": "sum(container_memory_working_set_bytes) / sum(container_resource_requests_memory_bytes) * 100",
                "legendFormat": "Memory Utilization %"
              }
            ]
          }
        ]
      }
    }

Monthly Cost Reviews¶

Monthly Cost Review Script:

#!/bin/bash
# scripts/monthly-cost-review.sh

MONTH="${1:-$(date -d '1 month ago' +%Y-%m)}"

echo "💰 Monthly Cost Review: ${MONTH}"
echo "================================"
echo ""

# Total cost
TOTAL_COST=$(az consumption usage list \
  --start-date "${MONTH}-01" \
  --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
  --query "[].pretaxCost" \
  --output tsv | \
  awk '{sum+=$1} END {printf "%.2f", sum}')

echo "📊 Total Cost: \$${TOTAL_COST}"
echo ""

# Cost by environment
echo "Cost by Environment:"
az consumption usage list \
  --start-date "${MONTH}-01" \
  --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
  --query "[].{Environment:tags.Environment, Cost:pretaxCost}" \
  --output tsv | \
  awk '{cost[$1]+=$2} END {for (env in cost) printf "  %s: $%.2f\n", env, cost[env]}'

echo ""

# Cost by service
echo "Cost by Service:"
az consumption usage list \
  --start-date "${MONTH}-01" \
  --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
  --query "[].{Service:tags.Service, Cost:pretaxCost}" \
  --output tsv | \
  awk '{cost[$1]+=$2} END {for (svc in cost) printf "  %s: $%.2f\n", svc, cost[svc]}'

echo ""

# Top 10 resources by cost
echo "Top 10 Resources by Cost:"
az consumption usage list \
  --start-date "${MONTH}-01" \
  --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
  --query "[].{Resource:instanceName, Cost:pretaxCost}" \
  --output tsv | \
  sort -k2 -nr | head -n 10

Budget Forecasting¶

Budget Forecast Script:

#!/bin/bash
# scripts/budget-forecast.sh

CURRENT_MONTH=$(date +%Y-%m)
LAST_MONTH=$(date -d '1 month ago' +%Y-%m)

echo "📈 Budget Forecast"
echo "=================="
echo ""

# Get last 3 months of costs
for i in {2..0}; do
  MONTH=$(date -d "${i} months ago" +%Y-%m)
  COST=$(az consumption usage list \
    --start-date "${MONTH}-01" \
    --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
    --query "[].pretaxCost" \
    --output tsv | \
    awk '{sum+=$1} END {printf "%.2f", sum}')

  echo "${MONTH}: \$${COST}"
done

echo ""

# Forecast next month (simple average)
CURRENT_COST=$(az consumption usage list \
  --start-date "${CURRENT_MONTH}-01" \
  --end-date "$(date +%Y-%m-%d)" \
  --query "[].pretaxCost" \
  --output tsv | \
  awk '{sum+=$1} END {printf "%.2f", sum}')

DAYS_IN_MONTH=$(date -d "$(date +%Y-%m-01) +1 month -1 day" +%d)
DAYS_ELAPSED=$(date +%d)
FORECAST=$(echo "scale=2; ${CURRENT_COST} / ${DAYS_ELAPSED} * ${DAYS_IN_MONTH}" | bc)

echo "📊 Forecast for $(date -d '+1 month' +%Y-%m): \$${FORECAST}"
echo "   Based on current month trend"

Cost Optimization KPIs¶

Cost Optimization KPI Dashboard:

// Log Analytics: Cost Optimization KPIs
let CostData = AzureCost
| where TimeGenerated >= ago(30d)
| extend Environment = tostring(Tags.Environment)
| extend Service = tostring(Tags.Service)
| summarize TotalCost = sum(Cost) by Environment, Service, bin(TimeGenerated, 1d);

// KPI 1: Cost per Environment
CostData
| summarize 
    TotalCost = sum(TotalCost),
    AvgDailyCost = avg(TotalCost)
    by Environment
| extend KPI = "Cost per Environment";

// KPI 2: Resource Utilization vs Cost
union (
    Perf
    | where ObjectName == "K8SContainer"
    | where CounterName == "cpuUsageNanoCores"
    | summarize AvgCpu = avg(CounterValue) by Namespace, bin(TimeGenerated, 1d)
),
(
    AzureCost
    | where TimeGenerated >= ago(30d)
    | extend Namespace = tostring(Tags.Namespace)
    | summarize Cost = sum(Cost) by Namespace, bin(TimeGenerated, 1d)
)
| summarize 
    AvgCpu = max(AvgCpu),
    TotalCost = max(Cost)
    by Namespace, bin(TimeGenerated, 1d)
| extend Efficiency = TotalCost / (AvgCpu / 1000000000)
| extend KPI = "Cost Efficiency"
| render timechart

Cost Optimization KPIs:

KPI	Target	Current	Status
Cost per Environment	< $5,000/month	$4,200	✅
Resource Utilization	> 70%	65%	⚠️
Cost per Transaction	< $0.01	$0.008	✅
Waste (Unused Resources)	< 10%	12%	⚠️
Reserved Instance Coverage	> 80%	75%	⚠️

Summary: Cost Optimization in GitOps¶

AKS Cost Optimization: Node pool sizing (right-sized VMs), spot instances for dev/test (80% discount), reserved instances for production (up to 72% discount), cluster autoscaler configuration
Resource Right-Sizing: Analyzing actual resource usage (7-day metrics), adjusting requests and limits, Vertical Pod Autoscaler (VPA), recommendations from Azure Advisor
Horizontal Pod Autoscaler (HPA): CPU-based scaling (70% threshold), memory-based scaling, custom metrics with KEDA, scaling policies for cost efficiency (conservative scale-down)
Cluster Autoscaler: Adding nodes based on demand, removing idle nodes (50% utilization threshold), scale-down delays and thresholds, node affinity and taints for spot instances
Development Environment Auto-Shutdown: Schedule-based scaling to zero (8 PM - 8 AM, weekends), scaling down replicas at night/weekends, wake-up procedures, cost savings calculation (60% savings)
Azure Cost Management Integration: Cost tracking per environment, budget alerts (50%, 80%, 100% thresholds), cost anomaly detection, cost optimization recommendations
Cost Allocation: Tags per environment/service/tenant, namespace-level cost reporting, chargeback to teams, showback reporting
Resource Cleanup Automation: Deleting unused images in ACR (30-day retention), removing old PersistentVolumes, cleaning up completed jobs (24-hour retention), snapshot cleanup (7-day retention)
Azure Advisor Recommendations: Reviewing cost recommendations, implementing right-sizing suggestions, SKU optimization
FinOps Practices: Cost monitoring dashboards, monthly cost reviews, budget forecasting, cost optimization KPIs (utilization, waste, efficiency)

Networking & Service Mesh¶

Purpose: Define networking architecture, ingress controller configuration, certificate management, network policies, service mesh selection and implementation, mTLS, traffic management, and multi-cluster networking strategies for ATP's GitOps deployments, ensuring secure, scalable, and observable service-to-service communication across all environments.

AKS Networking Models¶

kubenet (Basic Networking)¶

kubenet Networking Overview:

graph TB
    subgraph "AKS Cluster (kubenet)"
        POD1[Pod 1<br/>10.244.0.0/24]
        POD2[Pod 2<br/>10.244.1.0/24]
        KUBENET[kubenet Plugin<br/>Overlay Network]
    end
    subgraph "Azure VNet"
        VNET[VNet<br/>10.0.0.0/16]
        SUBNET[Subnet<br/>10.0.1.0/24]
    end

    POD1 --> KUBENET
    POD2 --> KUBENET
    KUBENET --> SUBNET
    SUBNET --> VNET

    style KUBENET fill:#FFE5B4
    style SUBNET fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

kubenet Characteristics:

Aspect	kubenet	Description
Pod IP Addresses	Overlay network	Pods get IPs from overlay (10.244.0.0/16)
VNet Integration	Limited	Pod IPs not routable from VNet
IP Address Limit	Limited by nodes	~250 pods per node
Network Policies	✅ Supported	NetworkPolicy resources
Azure Integration	⚠️ Limited	Requires routing tables
Complexity	✅ Simple	Easier to set up

Azure CNI (Advanced Networking)¶

Azure CNI Networking Overview:

graph TB
    subgraph "AKS Cluster (Azure CNI)"
        POD1[Pod 1<br/>10.0.1.10]
        POD2[Pod 2<br/>10.0.1.11]
        AZCNI[Azure CNI<br/>Direct VNet Integration]
    end
    subgraph "Azure VNet"
        VNET[VNet<br/>10.0.0.0/16]
        SUBNET[Subnet<br/>10.0.1.0/24]
        ROUTE[Route Tables]
        NSG[Network Security Groups]
    end

    POD1 --> AZCNI
    POD2 --> AZCNI
    AZCNI --> SUBNET
    SUBNET --> VNET
    SUBNET --> ROUTE
    SUBNET --> NSG

    style AZCNI fill:#90EE90
    style SUBNET fill:#87CEEB

Hold "Alt" / "Option" to enable pan & zoom

Azure CNI Characteristics:

Aspect	Azure CNI	Description
Pod IP Addresses	VNet IPs	Pods get IPs directly from VNet subnet
VNet Integration	✅ Full	Pod IPs routable from VNet
IP Address Limit	Limited by subnet size	Large subnet required
Network Policies	✅ Supported	Azure Network Policy or Calico
Azure Integration	✅ Full	Direct integration with Azure services
Complexity	⚠️ Complex	More configuration required

Comparison and Selection¶

kubenet vs Azure CNI Comparison:

Feature	kubenet	Azure CNI	ATP Selection
Pod IP Management	Overlay network	VNet IP addresses	✅ Azure CNI (VNet integration)
VNet Integration	Limited	Full	✅ Azure CNI (required for ATP)
IP Address Limits	~250 pods/node	Subnet size	✅ Azure CNI (more IPs)
Network Policies	✅ Supported	✅ Supported	✅ Azure CNI
Azure Services	⚠️ Routing required	✅ Direct access	✅ Azure CNI
Setup Complexity	✅ Simple	⚠️ Complex	✅ Azure CNI (accept complexity)
Multi-Tenancy	⚠️ Limited	✅ Better isolation	✅ Azure CNI

ATP Decision: Azure CNI - Required for multi-tenancy, VNet integration, direct Azure service access, and network isolation per tenant namespace.

Pulumi C# AKS Configuration with Azure CNI:

// infrastructure/AKS.cs
var aksCluster = new ManagedCluster("atp-production-aks", new ManagedClusterArgs
{
    ResourceGroupName = resourceGroup.Name,
    Location = location,
    DnsPrefix = "atp-prod",
    KubernetesVersion = "1.27.3",

    // Azure CNI networking
    NetworkProfile = new ManagedClusterNetworkProfileArgs
    {
        NetworkPlugin = NetworkPlugin.Azure,
        NetworkPolicy = NetworkPolicy.Azure,
        ServiceCidr = "10.2.0.0/16",  // Service CIDR
        DnsServiceIP = "10.2.0.10",
        PodCidr = null,  // Not used with Azure CNI
        LoadBalancerSku = LoadBalancerSku.Standard,
        OutboundType = OutboundType.LoadBalancer,
        LoadBalancerProfile = new ManagedClusterLoadBalancerProfileArgs
        {
            ManagedOutboundIPs = new ManagedClusterLoadBalancerProfileManagedOutboundIPsArgs
            {
                Count = 2
            }
        }
    },

    AgentPoolProfiles = new[]
    {
        new ManagedClusterAgentPoolProfileArgs
        {
            Name = "systempool",
            VmSize = "Standard_D4s_v3",
            Count = 3,
            OsType = "Linux",
            VnetSubnetId = subnet.Id,  // Subnet for pods (large enough)
            MaxPods = 50,
            Mode = AgentPoolMode.System,
            EnableAutoScaling = true,
            MinCount = 3,
            MaxCount = 10
        }
    }
});

Subnet Sizing for Azure CNI:

Node Count	Pods per Node	Required Subnet Size	Example CIDR
5 nodes	50 pods	/24 (256 addresses)	10.0.1.0/24
20 nodes	50 pods	/23 (512 addresses)	10.0.1.0/23
100 nodes	50 pods	/22 (1024 addresses)	10.0.1.0/22

Subnet Calculation:

Required IPs = (Node count × Max pods per node) + Node count + 5 (reserved)
Example: (20 × 50) + 20 + 5 = 1025 IPs → /22 subnet (1024 addresses)

Ingress Controllers¶

NGINX Ingress Controller (ATP Choice)¶

NGINX Ingress Architecture:

graph TB
    subgraph "Internet"
        USER[Users]
    end
    subgraph "Azure Load Balancer"
        LB[Load Balancer<br/>Public IP]
    end
    subgraph "AKS Cluster"
        subgraph "ingress-nginx namespace"
            NGINX1[NGINX Pod 1<br/>Replica 1]
            NGINX2[NGINX Pod 2<br/>Replica 2]
            NGINX_SVC[NGINX Service<br/>LoadBalancer]
        end
        subgraph "Application Namespaces"
            APP1[ATP Ingestion<br/>Service]
            APP2[ATP Query<br/>Service]
            APP3[ATP Gateway<br/>Service]
        end
    end

    USER --> LB
    LB --> NGINX_SVC
    NGINX_SVC --> NGINX1
    NGINX_SVC --> NGINX2
    NGINX1 --> APP1
    NGINX1 --> APP2
    NGINX2 --> APP3
    NGINX2 --> APP1

    style NGINX1 fill:#90EE90
    style NGINX2 fill:#90EE90
    style APP1 fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

NGINX Ingress Installation via Helm:

#!/bin/bash
# scripts/install-nginx-ingress.sh

# Add NGINX Ingress Helm repository
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

# Install NGINX Ingress Controller
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.replicaCount=2 \
  --set controller.nodeSelector."kubernetes\.io/os"=linux \
  --set controller.service.type=LoadBalancer \
  --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz \
  --set controller.service.externalTrafficPolicy=Local \
  --set controller.resources.requests.cpu=100m \
  --set controller.resources.requests.memory=128Mi \
  --set controller.resources.limits.cpu=500m \
  --set controller.resources.limits.memory=512Mi \
  --set controller.metrics.enabled=true \
  --set controller.podSecurityPolicy.enabled=false

echo "✅ NGINX Ingress Controller installed"
echo "   Waiting for LoadBalancer IP..."
kubectl wait --namespace ingress-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=300s

# Get LoadBalancer IP
EXTERNAL_IP=$(kubectl get svc ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "   External IP: ${EXTERNAL_IP}"

NGINX Ingress via FluxCD:

# platform/ingress-nginx/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: ingress-nginx
  namespace: ingress-nginx
spec:
  interval: 5m
  chart:
    spec:
      chart: ingress-nginx
      sourceRef:
        kind: HelmRepository
        name: ingress-nginx
      interval: 1h
  values:
    controller:
      replicaCount: 2
      nodeSelector:
        kubernetes.io/os: linux
      service:
        type: LoadBalancer
        annotations:
          service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: /healthz
        externalTrafficPolicy: Local
      resources:
        requests:
          cpu: 100m
          memory: 128Mi
        limits:
          cpu: 500m
          memory: 512Mi
      metrics:
        enabled: true
        serviceMonitor:
          enabled: true
      podSecurityPolicy:
        enabled: false

Azure Application Gateway Ingress (Alternative)¶

Azure Application Gateway Ingress Controller (AGIC):

# platform/application-gateway/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: ingress-appgw
  namespace: ingress-appgw
spec:
  interval: 5m
  chart:
    spec:
      chart: ingress-azure
      sourceRef:
        kind: HelmRepository
        name: application-gateway-kubernetes-ingress
      interval: 1h
  values:
    appgw:
      subscriptionId: ${AZURE_SUBSCRIPTION_ID}
      resourceGroup: atp-production-rg
      name: atp-prod-appgw
      usePrivateIP: false
    armAuth:
      type: aadPodIdentity
      identityResourceID: /subscriptions/${AZURE_SUBSCRIPTION_ID}/resourcegroups/${RESOURCE_GROUP}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/agic-identity
      identityClientID: ${AGIC_IDENTITY_CLIENT_ID}
    rbac:
      enabled: true

AGIC vs NGINX Comparison:

Feature	NGINX Ingress	Azure Application Gateway	ATP Selection
Cost	✅ Lower	⚠️ Higher (dedicated gateway)	✅ NGINX
WAF	⚠️ External (Cloudflare)	✅ Built-in	⚠️ NGINX (accept trade-off)
SSL Termination	✅ Supported	✅ Supported	✅ Both
Path-based Routing	✅ Supported	✅ Supported	✅ Both
Azure Integration	⚠️ Basic	✅ Full	⚠️ NGINX (sufficient)
ATP Decision	✅ Selected	❌ Not selected	✅ NGINX

ATP Decision: NGINX Ingress Controller - Lower cost, sufficient features, simpler management, standard Kubernetes ingress.

Installation and Configuration¶

NGINX Ingress Configuration:

# platform/ingress-nginx/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
data:
  # Connection settings
  worker-processes: "auto"
  worker-connections: "16384"
  max-worker-open-files: "65535"

  # Timeouts
  proxy-connect-timeout: "60"
  proxy-send-timeout: "60"
  proxy-read-timeout: "60"

  # SSL
  ssl-protocols: "TLSv1.2 TLSv1.3"
  ssl-ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256,ECDHE-RSA-AES128-GCM-SHA256"
  ssl-prefer-server-ciphers: "true"

  # Logging
  log-format-upstream: '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_length $request_time [$proxy_upstream_name] [$proxy_alternative_upstream_name] $upstream_addr $upstream_response_length $upstream_response_time $upstream_status $req_id'

  # Rate limiting
  enable-brotli: "true"
  use-forwarded-headers: "true"
  compute-full-forwarded-for: "true"

TLS Termination¶

TLS Termination in NGINX:

# apps/atp-gateway/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-gateway-ingress
  namespace: atp-production
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.atp.connectsoft.example
    - gateway.atp.connectsoft.example
    secretName: atp-gateway-tls
  rules:
  - host: api.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-gateway
            port:
              number: 80

Certificate Management¶

cert-manager Overview¶

cert-manager Architecture:

graph TB
    subgraph "Kubernetes Cluster"
        INGRESS[Ingress<br/>TLS Secret]
        CERT_MGR[cert-manager<br/>Controller]
        CERT[cert-manager<br/>Certificate CRD]
        CLUSTER_ISSUER[ClusterIssuer<br/>Let's Encrypt]
    end
    subgraph "Let's Encrypt"
        LE[Let's Encrypt<br/>API]
        CHALLENGE[HTTP-01 Challenge]
    end
    subgraph "DNS"
        TXT[TXT Record<br/>DNS-01 Challenge]
    end

    INGRESS --> CERT
    CERT --> CERT_MGR
    CERT_MGR --> CLUSTER_ISSUER
    CLUSTER_ISSUER --> LE
    LE --> CHALLENGE
    LE --> TXT
    CERT_MGR --> CERT
    CERT --> INGRESS

    style CERT_MGR fill:#90EE90
    style CLUSTER_ISSUER fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

cert-manager Installation:

#!/bin/bash
# scripts/install-cert-manager.sh

# Install cert-manager CRDs
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.crds.yaml

# Add cert-manager Helm repository
helm repo add jetstack https://charts.jetstack.io
helm repo update

# Install cert-manager
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.13.0 \
  --set installCRDs=true \
  --set global.leaderElection.namespace=cert-manager \
  --set resources.requests.cpu=100m \
  --set resources.requests.memory=128Mi

echo "✅ cert-manager installed"
kubectl wait --for=condition=ready pod \
  --all -n cert-manager \
  --timeout=300s

cert-manager via FluxCD:

# platform/cert-manager/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: cert-manager
  namespace: cert-manager
spec:
  interval: 5m
  chart:
    spec:
      chart: cert-manager
      sourceRef:
        kind: HelmRepository
        name: jetstack
      version: v1.13.0
  values:
    installCRDs: true
    global:
      leaderElection:
        namespace: cert-manager
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
    webhook:
      resources:
        requests:
          cpu: 50m
          memory: 64Mi

Let's Encrypt Integration¶

Let's Encrypt ClusterIssuer (HTTP-01 Challenge):

# platform/cert-manager/clusterissuer-letsencrypt-prod.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: devops@connectsoft.example
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
          podTemplate:
            spec:
              nodeSelector:
                kubernetes.io/os: linux

Let's Encrypt ClusterIssuer (DNS-01 Challenge for Wildcard):

# platform/cert-manager/clusterissuer-letsencrypt-dns.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-dns
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: devops@connectsoft.example
    privateKeySecretRef:
      name: letsencrypt-dns
    solvers:
    - dns01:
        azureDNS:
          clientID: ${AZURE_CLIENT_ID}
          clientSecretSecretRef:
            name: azure-dns-secret
            key: client-secret
          subscriptionID: ${AZURE_SUBSCRIPTION_ID}
          tenantID: ${AZURE_TENANT_ID}
          resourceGroupName: atp-production-rg
          hostedZoneName: connectsoft.example
          environment: AzurePublicCloud

Let's Encrypt Staging ClusterIssuer:

# platform/cert-manager/clusterissuer-letsencrypt-staging.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: devops@connectsoft.example
    privateKeySecretRef:
      name: letsencrypt-staging
    solvers:
    - http01:
        ingress:
          class: nginx

ClusterIssuer Configuration¶

ClusterIssuer Configuration Matrix:

ClusterIssuer	Challenge Type	Use Case	Rate Limits
letsencrypt-prod	HTTP-01	Production domains	50 certs/week/domain
letsencrypt-staging	HTTP-01	Testing	300 certs/week/domain
letsencrypt-dns	DNS-01	Wildcard certificates	50 certs/week/domain

Certificate Resource:

# apps/atp-gateway/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: atp-gateway-tls
  namespace: atp-production
spec:
  secretName: atp-gateway-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  commonName: api.atp.connectsoft.example
  dnsNames:
  - api.atp.connectsoft.example
  - gateway.atp.connectsoft.example
  - *.atp.connectsoft.example
  duration: 2160h  # 90 days
  renewBefore: 720h  # Renew 30 days before expiration

Automatic Certificate Renewal¶

Certificate Renewal Flow:

sequenceDiagram
    participant Cert as Certificate
    participant CM as cert-manager
    participant LE as Let's Encrypt
    participant NGINX as NGINX Ingress

    Cert->>CM: Certificate expires in 30 days
    CM->>LE: Request renewal
    LE->>CM: Challenge request
    CM->>NGINX: Create challenge ingress
    NGINX->>LE: Serve challenge
    LE->>CM: Validate challenge
    CM->>LE: Get new certificate
    LE->>CM: Issue certificate
    CM->>Cert: Update TLS secret
    Cert->>NGINX: Reload with new cert

Hold "Alt" / "Option" to enable pan & zoom

Certificate Status Check:

#!/bin/bash
# scripts/check-certificate-status.sh

NAMESPACE="${1:-all}"

echo "🔍 Certificate Status Check"
echo "============================"

if [ "${NAMESPACE}" = "all" ]; then
  CERTIFICATES=$(kubectl get certificates --all-namespaces -o json | \
    jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"')
else
  CERTIFICATES=$(kubectl get certificates -n "${NAMESPACE}" -o json | \
    jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"')
fi

for CERT in ${CERTIFICATES}; do
  NS=$(echo "${CERT}" | cut -d'/' -f1)
  NAME=$(echo "${CERT}" | cut -d'/' -f2)

  STATUS=$(kubectl get certificate "${NAME}" -n "${NS}" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
  AGE=$(kubectl get certificate "${NAME}" -n "${NS}" -o jsonpath='{.metadata.creationTimestamp}')
  NOT_AFTER=$(kubectl get certificate "${NAME}" -n "${NS}" -o jsonpath='{.status.notAfter}')

  if [ "${STATUS}" = "True" ]; then
    echo "✅ ${NS}/${NAME}: Ready"
    if [ -n "${NOT_AFTER}" ]; then
      DAYS_UNTIL_EXPIRY=$(( ($(date -d "${NOT_AFTER}" +%s) - $(date +%s)) / 86400 ))
      echo "   Expires in: ${DAYS_UNTIL_EXPIRY} days"
    fi
  else
    echo "❌ ${NS}/${NAME}: Not Ready"
    kubectl describe certificate "${NAME}" -n "${NS}" | grep -A 5 "Status:"
  fi
done

Certificate Monitoring¶

Certificate Expiration Alert:

# monitoring/alerts/certificate-expiration.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: certificate-expiration
  namespace: monitoring
spec:
  groups:
  - name: certificate
    interval: 1h
    rules:
    - alert: CertificateExpiringSoon
      expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 30
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Certificate expiring soon"
        description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires in {{ $value }} days"

    - alert: CertificateExpiringVerySoon
      expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
      for: 1h
      labels:
        severity: critical
      annotations:
        summary: "Certificate expiring very soon"
        description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires in {{ $value }} days"

Network Policies¶

Default Deny All Policy¶

Default Deny All Network Policy:

# platform/network-policies/default-deny-all.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: atp-production
spec:
  podSelector: {}  # Match all pods
  policyTypes:
  - Ingress
  - Egress
  # No rules = deny all traffic

Apply Default Deny to All Namespaces:

#!/bin/bash
# scripts/apply-default-deny-policy.sh

NAMESPACES=("atp-production" "atp-staging" "atp-test")

for NS in "${NAMESPACES[@]}"; do
  echo "Applying default deny policy to namespace: ${NS}"

  kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: ${NS}
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
EOF
done

echo "✅ Default deny policies applied"

Service-to-Service Allow Rules¶

Service-to-Service Communication:

# apps/atp-gateway/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-gateway-network-policy
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-gateway
  policyTypes:
  - Ingress
  - Egress

  ingress:
  # Allow from ingress controller
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - podSelector:
        matchLabels:
          app.kubernetes.io/name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080

  # Allow from other ATP services
  - from:
    - podSelector:
        matchLabels:
          app: atp-ingestion
    - podSelector:
        matchLabels:
          app: atp-query
    ports:
    - protocol: TCP
      port: 8080

  egress:
  # Allow to ATP services
  - to:
    - podSelector:
        matchLabels:
          app: atp-ingestion
    - podSelector:
        matchLabels:
          app: atp-query
    ports:
    - protocol: TCP
      port: 8080

  # Allow to external services (database, Redis, etc.)
  - to:
    - ipBlock:
        cidr: 10.0.0.0/16  # Azure VNet
    ports:
    - protocol: TCP
      port: 5432  # PostgreSQL
    - protocol: TCP
      port: 6380  # Redis

Ingress and Egress Rules¶

Ingress Allow Rules:

# apps/atp-ingestion/network-policy-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-ingestion-allow-ingress
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
  - Ingress

  ingress:
  # Allow from gateway
  - from:
    - podSelector:
        matchLabels:
          app: atp-gateway
    ports:
    - protocol: TCP
      port: 8080

  # Allow from query service
  - from:
    - podSelector:
        matchLabels:
          app: atp-query
    ports:
    - protocol: TCP
      port: 8080

Egress Allow Rules:

# apps/atp-ingestion/network-policy-egress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-ingestion-allow-egress
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
  - Egress

  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53

  # Allow to database
  - to:
    - ipBlock:
        cidr: 10.0.2.0/24  # Database subnet
    ports:
    - protocol: TCP
      port: 5432

  # Allow to Redis
  - to:
    - ipBlock:
        cidr: 10.0.3.0/24  # Redis subnet
    ports:
    - protocol: TCP
      port: 6380

  # Allow to Service Bus
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0  # Azure Service Bus (public IP)
    ports:
    - protocol: TCP
      port: 5671
      port: 443

DNS Exceptions¶

DNS Exception in Network Policy:

# platform/network-policies/dns-exception.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: atp-production
spec:
  podSelector: {}
  policyTypes:
  - Egress

  egress:
  # Allow DNS queries
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    - podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Monitoring and Logging Exceptions¶

Monitoring Exception:

# platform/network-policies/monitoring-exception.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-monitoring
  namespace: atp-production
spec:
  podSelector: {}
  policyTypes:
  - Egress

  egress:
  # Allow to Prometheus
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    - podSelector:
        matchLabels:
          app: prometheus
    ports:
    - protocol: TCP
      port: 9090

  # Allow to Grafana
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    - podSelector:
        matchLabels:
          app: grafana
    ports:
    - protocol: TCP
      port: 3000

  # Allow to Azure Monitor (Log Analytics)
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
    ports:
    - protocol: TCP
      port: 443

Service Mesh Options¶

Linkerd (Lightweight, ATP Preference)¶

Linkerd Architecture:

graph TB
    subgraph "Service A Pod"
        APP_A[Application]
        PROXY_A[Linkerd Proxy<br/>sidecar]
    end
    subgraph "Service B Pod"
        APP_B[Application]
        PROXY_B[Linkerd Proxy<br/>sidecar]
    end
    subgraph "Linkerd Control Plane"
        DEST[destination]
        IDENTITY[identity]
        PROXY_INJECTOR[proxy-injector]
    end

    APP_A <--> PROXY_A
    APP_B <--> PROXY_B
    PROXY_A <--mTLS--> PROXY_B
    PROXY_A --> DEST
    PROXY_B --> DEST
    PROXY_A --> IDENTITY
    PROXY_B --> IDENTITY

    style PROXY_A fill:#90EE90
    style PROXY_B fill:#90EE90
    style DEST fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Linkerd Installation:

#!/bin/bash
# scripts/install-linkerd.sh

# Install Linkerd CLI
curl -sL https://run.linkerd.io/install-edge | sh
export PATH=$PATH:$HOME/.linkerd2/bin

# Verify installation
linkerd version --client

# Check cluster prerequisites
linkerd check --pre

# Install Linkerd control plane
linkerd install | kubectl apply -f -

# Wait for control plane to be ready
linkerd check

# Install Linkerd Viz (observability)
linkerd viz install | kubectl apply -f -

# Install Linkerd Multicluster (if needed)
# linkerd multicluster install | kubectl apply -f -

echo "✅ Linkerd installed"

Linkerd via FluxCD:

# platform/linkerd/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: linkerd-control-plane
  namespace: linkerd
spec:
  interval: 5m
  chart:
    spec:
      chart: linkerd-control-plane
      sourceRef:
        kind: HelmRepository
        name: linkerd
      version: 1.14.0
  values:
    identity:
      issuer:
        tls:
          certPEM: |
            # Certificate from linkerd identity
          keyPEM: |
            # Key from linkerd identity
    proxyInjector:
      enabled: true
    destination:
      enabled: true
    identity:
      enabled: true

Istio (Feature-Rich, Complex)¶

Istio vs Linkerd Comparison:

Feature	Linkerd	Istio	ATP Selection
Size	✅ Lightweight (~50MB)	⚠️ Heavy (~500MB)	✅ Linkerd
Learning Curve	✅ Simple	⚠️ Complex	✅ Linkerd
mTLS	✅ Automatic	✅ Automatic	✅ Linkerd
Traffic Management	✅ Supported	✅ Rich features	⚠️ Linkerd (sufficient)
Observability	✅ Built-in	✅ Built-in	✅ Linkerd
Resource Usage	✅ Low	⚠️ High	✅ Linkerd
ATP Decision	✅ Selected	❌ Not selected	✅ Linkerd

ATP Decision: Linkerd - Lightweight, simple, sufficient features, low resource usage, better fit for ATP's requirements.

Open Service Mesh (Azure-Native)¶

Open Service Mesh (OSM) Overview:

Feature	OSM	Linkerd	ATP Selection
Azure Integration	✅ Native	⚠️ Generic	⚠️ Linkerd (sufficient)
Maturity	⚠️ Newer	✅ Mature	✅ Linkerd
Community	⚠️ Smaller	✅ Large	✅ Linkerd
Features	✅ Good	✅ Good	✅ Linkerd

ATP Decision: Linkerd - More mature, larger community, proven in production, sufficient Azure integration.

Comparison and Selection¶

Service Mesh Selection Matrix:

Criteria	Weight	Linkerd	Istio	OSM	Winner
Simplicity	High	9	4	7	✅ Linkerd
Resource Usage	High	9	5	7	✅ Linkerd
Features	Medium	7	9	7	⚠️ Istio
Maturity	High	9	9	6	✅ Linkerd/Istio
ATP Decision	-	Selected	-	-	✅ Linkerd

mTLS Between Services¶

Automatic mTLS with Service Mesh¶

Linkerd Automatic mTLS:

# Linkerd automatically enables mTLS for all injected services
# No configuration required - works out of the box

# Example: Service with Linkerd proxy injection
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
  annotations:
    linkerd.io/inject: enabled  # Enable automatic injection
spec:
  template:
    metadata:
      annotations:
        linkerd.io/inject: enabled
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3
        # Linkerd proxy automatically injected as sidecar

Verify mTLS Status:

# Check mTLS status for all services
linkerd viz stat deploy -n atp-production

# Check mTLS percentage
linkerd viz edges deploy -n atp-production

# View service topology with mTLS
linkerd viz tap deploy/atp-gateway -n atp-production

Certificate Rotation¶

Linkerd Certificate Rotation:

#!/bin/bash
# scripts/rotate-linkerd-certificates.sh

echo "🔄 Rotating Linkerd certificates..."

# Rotate identity certificates
linkerd identity rotate --trust-anchors-file=ca.crt --trust-anchors-validity=87600h | kubectl apply -f -

# Verify rotation
linkerd check --proxy

echo "✅ Linkerd certificates rotated"

Automatic Certificate Rotation:

Linkerd automatically rotates certificates before expiration. Default validity: 24 hours.

# Linkerd identity configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: linkerd-identity
  namespace: linkerd
data:
  identity.issuer.tls.crtPEM: |
    # Certificate (auto-rotated)
  identity.issuer.tls.keyPEM: |
    # Key (auto-rotated)

Identity and Authorization¶

Linkerd Authorization Policy:

# apps/atp-gateway/authorization-policy.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
  name: atp-gateway-server
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-gateway
  port: 8080
  proxyProtocol: HTTP/1
---
apiVersion: policy.linkerd.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: atp-gateway-authz
  namespace: atp-production
spec:
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: atp-gateway-server
  requiredAuthenticationModes:
  - mtls
  networks:
  - cidr: 10.0.0.0/16  # Allow from VNet only

Traffic Management¶

Canary Routing with Service Mesh¶

Linkerd TrafficSplit for Canary:

# apps/atp-ingestion/canary-trafficsplit.yaml
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
  name: atp-ingestion-canary
  namespace: atp-production
spec:
  service: atp-ingestion
  backends:
  - service: atp-ingestion-stable
    weight: 90  # 90% traffic to stable
  - service: atp-ingestion-canary
    weight: 10  # 10% traffic to canary

Canary Deployment Strategy:

graph TB
    INGRESS[Ingress<br/>100% Traffic]
    TRAFFIC_SPLIT[TrafficSplit<br/>90/10 Split]
    STABLE[Stable Service<br/>90% Traffic]
    CANARY[Canary Service<br/>10% Traffic]

    INGRESS --> TRAFFIC_SPLIT
    TRAFFIC_SPLIT --> STABLE
    TRAFFIC_SPLIT --> CANARY

    style TRAFFIC_SPLIT fill:#FFE5B4
    style STABLE fill:#90EE90
    style CANARY fill:#FFB6C1

Hold "Alt" / "Option" to enable pan & zoom

Circuit Breakers¶

Linkerd Circuit Breaker:

# apps/atp-gateway/circuit-breaker.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServiceProfile
metadata:
  name: atp-ingestion-service-profile
  namespace: atp-production
spec:
  routes:
  - name: default
    condition:
      method: GET
      pathRegex: "/api/.*"
    isRetryable: true
    timeout: 10s
  circuitBreakers:
  - maxConnections: 100
    maxPendingRequests: 50
    maxRequests: 200
    maxRetries: 3
    minRequests: 10
    maxEjectionPercent: 50
    sleepWindow: 30s
    consecutiveFailures: 5

Retry Policies¶

Linkerd Retry Policy:

# apps/atp-gateway/retry-policy.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServiceProfile
metadata:
  name: atp-ingestion-retry
  namespace: atp-production
spec:
  routes:
  - name: default
    condition:
      method: POST
      pathRegex: "/api/ingestion"
    isRetryable: true
    timeout: 30s
    retries:
      budget:
        retryRatio: 0.2  # Max 20% retries
        minRetriesPerSecond: 10
        ttl: 10s

Timeout Configuration¶

Linkerd Timeout Policy:

# apps/atp-gateway/timeout-policy.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServiceProfile
metadata:
  name: atp-query-timeout
  namespace: atp-production
spec:
  routes:
  - name: query-route
    condition:
      method: GET
      pathRegex: "/api/query/.*"
    timeout: 5s  # 5 second timeout
  - name: export-route
    condition:
      method: GET
      pathRegex: "/api/export/.*"
    timeout: 60s  # 60 second timeout for exports

Observability with Service Mesh¶

Distributed Tracing¶

Linkerd Distributed Tracing:

# platform/linkerd/tracing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: linkerd-config
  namespace: linkerd
data:
  config.yaml: |
    tracing:
      enabled: true
      collectorSvcAddr: "jaeger-collector.monitoring:14268"
      collectorSvcAccount: "linkerd-collector"

Linkerd + Jaeger Integration:

# platform/linkerd/jaeger-integration.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-collector
  namespace: monitoring
spec:
  template:
    spec:
      containers:
      - name: jaeger-collector
        image: jaegertracing/jaeger-collector:latest
        env:
        - name: SPAN_STORAGE_TYPE
          value: "elasticsearch"
        - name: ES_SERVER_URLS
          value: "http://elasticsearch.monitoring:9200"

Metrics and Dashboards¶

Linkerd Metrics:

# View service metrics
linkerd viz stat deploy -n atp-production

# View top services
linkerd viz top deploy -n atp-production

# View service profile metrics
linkerd viz profile -n atp-production atp-ingestion --tap

Linkerd Grafana Dashboard:

# platform/linkerd/grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: linkerd-dashboard
  namespace: monitoring
data:
  linkerd-dashboard.json: |
    {
      "dashboard": {
        "title": "Linkerd Service Mesh",
        "panels": [
          {
            "title": "Request Rate",
            "targets": [
              {
                "expr": "sum(rate(linkerd_proxy_http_requests_total{deployment=\"$deployment\"}[1m]))",
                "legendFormat": "{{deployment}}"
              }
            ]
          },
          {
            "title": "P50 Latency",
            "targets": [
              {
                "expr": "histogram_quantile(0.5, sum(rate(linkerd_proxy_http_request_duration_seconds_bucket{deployment=\"$deployment\"}[1m])) by (le, deployment))",
                "legendFormat": "{{deployment}}"
              }
            ]
          }
        ]
      }
    }

Service Topology Visualization¶

Linkerd Viz (Topology View):

# Open Linkerd Viz dashboard
linkerd viz dashboard

# View service topology
linkerd viz edges deploy -n atp-production

# Tap live traffic
linkerd viz tap deploy/atp-gateway -n atp-production

Service Mesh GitOps Integration¶

Mesh Configuration in Git¶

Linkerd Configuration in GitOps:

atp-gitops/
├── platform/
│   ├── linkerd/
│   │   ├── kustomization.yaml
│   │   ├── control-plane.yaml
│   │   ├── service-profiles/
│   │   │   ├── atp-gateway.yaml
│   │   │   ├── atp-ingestion.yaml
│   │   │   └── atp-query.yaml
│   │   ├── authorization-policies/
│   │   │   └── default-policy.yaml
│   │   └── trafficsplits/
│   │       └── canary-split.yaml

Linkerd Kustomization:

# platform/linkerd/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - control-plane.yaml
  - service-profiles/
  - authorization-policies/
  - trafficsplits/

commonLabels:
  managed-by: kustomize

TrafficSplit Resources¶

TrafficSplit in GitOps:

# apps/atp-ingestion/overlays/production/trafficsplit.yaml
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
  name: atp-ingestion-split
  namespace: atp-production
spec:
  service: atp-ingestion
  backends:
  - service: atp-ingestion-v1
    weight: 90
  - service: atp-ingestion-v2
    weight: 10

FluxCD Kustomization for TrafficSplit:

# clusters/production/kustomizations/apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/atp-ingestion/overlays/production
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  # TrafficSplit resources included in path

SMI (Service Mesh Interface)¶

SMI Resources Supported by Linkerd:

SMI Resource	Linkerd Support	Use Case
TrafficSplit	✅ Supported	Canary deployments
TrafficTarget	✅ Supported	Access control
HTTPRouteGroup	✅ Supported	HTTP routing rules
TCPRoute	⚠️ Limited	TCP routing

SMI TrafficTarget Example:

# apps/atp-gateway/smi-traffic-target.yaml
apiVersion: access.smi-spec.io/v1alpha3
kind: TrafficTarget
metadata:
  name: atp-gateway-to-ingestion
  namespace: atp-production
spec:
  destination:
    kind: ServiceAccount
    name: atp-ingestion
    namespace: atp-production
  sources:
  - kind: ServiceAccount
    name: atp-gateway
    namespace: atp-production
  rules:
  - kind: HTTPRouteGroup
    name: atp-ingestion-routes
    matches:
    - ingestion-api
---
apiVersion: specs.smi-spec.io/v1alpha4
kind: HTTPRouteGroup
metadata:
  name: atp-ingestion-routes
  namespace: atp-production
spec:
  matches:
  - name: ingestion-api
    methods:
    - GET
    - POST
    pathRegex: "/api/ingestion/.*"

Multi-Cluster Networking¶

VNet Peering Between Environments¶

VNet Peering Configuration:

// infrastructure/VNetPeering.cs
using Pulumi;
using Pulumi.AzureNative.Network;

public class VNetPeering
{
    public static VirtualNetworkPeering CreatePeering(
        VirtualNetwork sourceVNet,
        VirtualNetwork targetVNet,
        ResourceGroup resourceGroup,
        string peeringName)
    {
        return new VirtualNetworkPeering($"peering-{peeringName}", new VirtualNetworkPeeringArgs
        {
            ResourceGroupName = resourceGroup.Name,
            VirtualNetworkName = sourceVNet.Name,
            RemoteVirtualNetworkId = targetVNet.Id,
            AllowVirtualNetworkAccess = true,
            AllowForwardedTraffic = true,
            AllowGatewayTransit = false,
            UseRemoteGateways = false
        });
    }
}

VNet Peering Between Production and Staging:

#!/bin/bash
# scripts/create-vnet-peering.sh

SOURCE_RG="${1}"
SOURCE_VNET="${2}"
TARGET_RG="${3}"
TARGET_VNET="${4}"

echo "🔗 Creating VNet peering: ${SOURCE_VNET} <-> ${TARGET_VNET}"

# Get VNet IDs
SOURCE_VNET_ID=$(az network vnet show \
  --resource-group "${SOURCE_RG}" \
  --name "${SOURCE_VNET}" \
  --query id -o tsv)

TARGET_VNET_ID=$(az network vnet show \
  --resource-group "${TARGET_RG}" \
  --name "${TARGET_VNET}" \
  --query id -o tsv)

# Create peering from source to target
az network vnet peering create \
  --resource-group "${SOURCE_RG}" \
  --name "${SOURCE_VNET}-to-${TARGET_VNET}" \
  --vnet-name "${SOURCE_VNET}" \
  --remote-vnet "${TARGET_VNET_ID}" \
  --allow-vnet-access \
  --allow-forwarded-traffic

# Create peering from target to source
az network vnet peering create \
  --resource-group "${TARGET_RG}" \
  --name "${TARGET_VNET}-to-${SOURCE_VNET}" \
  --vnet-name "${TARGET_VNET}" \
  --remote-vnet "${SOURCE_VNET_ID}" \
  --allow-vnet-access \
  --allow-forwarded-traffic

echo "✅ VNet peering created"

Azure Virtual WAN¶

Virtual WAN Architecture:

graph TB
    subgraph "Virtual WAN Hub"
        VWAN[Azure Virtual WAN<br/>Hub]
    end
    subgraph "Production VNet"
        PROD_VNET[Production VNet<br/>10.0.0.0/16]
        PROD_AKS[Production AKS]
    end
    subgraph "Staging VNet"
        STAGE_VNET[Staging VNet<br/>10.1.0.0/16]
        STAGE_AKS[Staging AKS]
    end
    subgraph "On-Premises"
        ONPREM[On-Premises<br/>Network]
    end

    PROD_VNET --> VWAN
    STAGE_VNET --> VWAN
    ONPREM --> VWAN
    VWAN --> PROD_VNET
    VWAN --> STAGE_VNET
    VWAN --> ONPREM

    style VWAN fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Virtual WAN Configuration (Pulumi C# concept):

// infrastructure/VirtualWAN.cs
var virtualWan = new VirtualWan("atp-vwan", new VirtualWanArgs
{
    ResourceGroupName = resourceGroup.Name,
    Location = location,
    Type = "Standard",
    AllowBranchToBranchTraffic = true,
    AllowVnetToVnetTraffic = true
});

var virtualHub = new VirtualHub("atp-vhub", new VirtualHubArgs
{
    ResourceGroupName = resourceGroup.Name,
    Location = location,
    VirtualWanId = virtualWan.Id,
    AddressPrefix = "10.100.0.0/24"
});

Cross-Cluster Service Discovery¶

Linkerd Multi-Cluster Service Discovery:

#!/bin/bash
# scripts/setup-linkerd-multicluster.sh

# Install Linkerd Multicluster on production cluster
linkerd multicluster install | kubectl apply -f -

# Link staging cluster to production
linkerd multicluster link --cluster-name staging --api-server-address https://staging-api-server:6443 | kubectl apply -f -

# Verify multicluster status
linkerd multicluster check

echo "✅ Multi-cluster service discovery configured"

Service Export/Import (Kubernetes Multi-Cluster Services):

# apps/atp-gateway/service-export.yaml
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
  name: atp-gateway
  namespace: atp-production
spec: {}

---
# In staging cluster: Service Import
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceImport
metadata:
  name: atp-gateway-production
  namespace: atp-staging
spec:
  type: ClusterSetIP
  ports:
  - port: 8080
    protocol: TCP

Multi-Cluster Mesh¶

Linkerd Multi-Cluster Mesh:

graph TB
    subgraph "Production Cluster"
        PROD_CTRL[Linkerd Control Plane]
        PROD_SVC[ATP Services]
    end
    subgraph "Staging Cluster"
        STAGE_CTRL[Linkerd Control Plane]
        STAGE_SVC[ATP Services]
    end
    subgraph "Linkerd Multicluster"
        GATEWAY[Service Mirror<br/>Gateway]
    end

    PROD_CTRL <--> GATEWAY
    STAGE_CTRL <--> GATEWAY
    PROD_SVC <--mTLS--> STAGE_SVC

    style GATEWAY fill:#FFE5B4

Hold "Alt" / "Option" to enable pan & zoom

Multi-Cluster Mesh Configuration:

# platform/linkerd/multicluster/gateway.yaml
apiVersion: multicluster.linkerd.io/v1alpha1
kind: ServiceMirror
metadata:
  name: staging-cluster
  namespace: linkerd-multicluster
spec:
  cluster:
    name: staging
    namespace: linkerd-multicluster
    apiKey: ${STAGING_CLUSTER_API_KEY}
  gateway:
    name: gateway
    namespace: linkerd-multicluster
    service:
      name: linkerd-gateway
      namespace: linkerd-multicluster
    port: 4143

Summary: Networking & Service Mesh¶

AKS Networking Models: Azure CNI selected (VNet integration, multi-tenancy), kubenet comparison, subnet sizing for Azure CNI
Ingress Controllers: NGINX Ingress Controller (ATP choice), Azure Application Gateway comparison, installation and configuration, TLS termination
Certificate Management: cert-manager overview, Let's Encrypt integration (HTTP-01, DNS-01), ClusterIssuer configuration, automatic certificate renewal, certificate monitoring
Network Policies: Default deny all policy, service-to-service allow rules, ingress and egress rules, DNS exceptions, monitoring and logging exceptions
Service Mesh Options: Linkerd selected (lightweight, ATP preference), Istio comparison, Open Service Mesh comparison, selection matrix
mTLS Between Services: Automatic mTLS with Linkerd, certificate rotation, identity and authorization policies
Traffic Management: Canary routing with TrafficSplit, circuit breakers, retry policies, timeout configuration
Observability with Service Mesh: Distributed tracing (Jaeger), metrics and dashboards (Linkerd Viz), service topology visualization
Service Mesh GitOps Integration: Mesh configuration in Git, TrafficSplit resources, SMI (Service Mesh Interface) support
Multi-Cluster Networking: VNet peering between environments, Azure Virtual WAN, cross-cluster service discovery, multi-cluster mesh with Linkerd

Storage & StatefulSets in GitOps¶

Purpose: Define storage architecture, PersistentVolumes and PersistentVolumeClaims, StatefulSet deployment patterns, database deployments, backup and restore procedures, volume snapshots, data migration strategies, and disaster recovery for persistent data in ATP's GitOps deployments, ensuring reliable, scalable, and recoverable stateful workloads.

Persistent Volumes (PV) and Claims (PVC)¶

PV and PVC Concepts¶

Persistent Volume (PV) vs Persistent Volume Claim (PVC):

graph TB
    subgraph "Storage Provider"
        AZDISK[Azure Disk<br/>or Azure Files]
    end
    subgraph "Kubernetes Cluster"
        PV[PersistentVolume<br/>Cluster Resource]
        PVC[PersistentVolumeClaim<br/>Namespace Resource]
        POD[Pod<br/>Application]
    end

    AZDISK --> PV
    PVC --> PV
    POD --> PVC

    style PV fill:#FFE5B4
    style PVC fill:#90EE90
    style POD fill:#87CEEB

Hold "Alt" / "Option" to enable pan & zoom

PV and PVC Relationship:

Resource	Scope	Purpose	Managed By
PersistentVolume (PV)	Cluster-wide	Represents actual storage	Admin/Storage Provisioner
PersistentVolumeClaim (PVC)	Namespace	Request for storage	Developer/Application
StorageClass	Cluster-wide	Defines storage provisioner	Admin

PVC Lifecycle:

Create PVC → Kubernetes matches with available PV or creates new PV
Bind → PVC bound to PV
Use → Pod mounts PVC
Release → Pod terminates, PVC remains (Retain policy)
Reclaim → PV reclaimed based on reclaim policy

Dynamic Provisioning¶

Dynamic Provisioning Flow:

sequenceDiagram
    participant Dev as Developer
    participant K8s as Kubernetes API
    participant SC as StorageClass
    participant Prov as Provisioner
    participant Azure as Azure Disk/Files
    participant Pod as Pod

    Dev->>K8s: Create PVC
    K8s->>SC: Match StorageClass
    SC->>Prov: Provision volume
    Prov->>Azure: Create Azure Disk/File
    Azure-->>Prov: Volume created
    Prov->>K8s: Create PV
    K8s->>PVC: Bind PVC to PV
    Dev->>K8s: Create Pod with PVC
    K8s->>Pod: Mount volume

Hold "Alt" / "Option" to enable pan & zoom

Dynamic Provisioning Example:

# apps/atp-ingestion/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: atp-ingestion-data
  namespace: atp-production
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium  # Triggers dynamic provisioning
  resources:
    requests:
      storage: 100Gi

Static vs Dynamic Provisioning:

Provisioning Type	Use Case	ATP Preference
Static	Pre-created PVs, manual management	❌ Not used
Dynamic	On-demand PV creation via StorageClass	✅ Preferred

ATP Decision: Use dynamic provisioning for all workloads - simpler, scalable, automated.

Storage Classes¶

StorageClass Definition:

# platform/storage/storageclass-premium.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS  # Premium SSD
  kind: managed
  cachingMode: ReadOnly
  diskEncryptionSetID: /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.Compute/diskEncryptionSets/atp-disk-encryption
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer  # Wait until pod is scheduled

StorageClass Options:

# Standard HDD
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-standard
provisioner: disk.csi.azure.com
parameters:
  skuname: Standard_LRS  # Standard HDD
  kind: managed
reclaimPolicy: Delete
volumeBindingMode: Immediate

---
# Premium SSD
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS  # Premium SSD
  kind: managed
reclaimPolicy: Retain

---
# Azure Files (SMB)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-csi
provisioner: file.csi.azure.com
parameters:
  skuname: Premium_LRS  # Premium Files
  storageAccount: atpstorageaccount  # Optional: specific storage account
reclaimPolicy: Delete
allowVolumeExpansion: true

---
# Azure Files (NFS)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-csi-nfs
provisioner: file.csi.azure.com
parameters:
  protocol: nfs
  skuname: Premium_LRS
reclaimPolicy: Delete

Access Modes¶

PVC Access Modes:

Access Mode	Description	Use Case	Supported by
ReadWriteOnce (RWO)	Single node read-write	Single pod, databases	Azure Disk
ReadOnlyMany (ROX)	Multiple nodes read-only	Shared config, readonly data	Azure Files
ReadWriteMany (RWX)	Multiple nodes read-write	Shared storage, file shares	Azure Files
ReadWriteOncePod (RWOP)	Single pod read-write	Kubernetes 1.22+	Azure Disk

Access Mode Selection:

# Single pod (database)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
spec:
  accessModes:
  - ReadWriteOnce  # Single pod mount
  storageClassName: managed-premium
  resources:
    requests:
      storage: 500Gi

---
# Multiple pods (shared files)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-storage
spec:
  accessModes:
  - ReadWriteMany  # Multiple pods can mount
  storageClassName: azurefile-csi
  resources:
    requests:
      storage: 100Gi

Azure Disk vs Azure Files¶

Azure Disk (Block Storage, Single Mount)¶

Azure Disk Characteristics:

Aspect	Azure Disk	Description
Type	Block storage	Direct-attached disk
Mount	Single pod	RWO (ReadWriteOnce)
Performance	✅ High IOPS	Up to 20,000 IOPS (Premium SSD)
Latency	✅ Low latency	< 1ms
Use Case	Databases, single-pod apps	PostgreSQL, MongoDB, Redis

Azure Disk StorageClass:

# platform/storage/storageclass-premium-disk.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS
  kind: managed
  cachingMode: ReadOnly  # Optimize for database workloads
  diskEncryptionSetID: ${DISK_ENCRYPTION_SET_ID}
allowVolumeExpansion: true
reclaimPolicy: Retain  # Keep data on PVC deletion
volumeBindingMode: WaitForFirstConsumer  # Zone-aware scheduling

Azure Files (Shared Storage, Multi-Mount)¶

Azure Files Characteristics:

Aspect	Azure Files	Description
Type	File storage	Network file share
Mount	Multiple pods	RWX (ReadWriteMany)
Protocol	SMB or NFS	Protocol selection
Performance	⚠️ Lower IOPS	Up to 100,000 IOPS (Premium)
Latency	⚠️ Higher latency	Network latency
Use Case	Shared files, config	Content storage, logs

Azure Files StorageClass:

# platform/storage/storageclass-premium-files.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-premium
provisioner: file.csi.azure.com
parameters:
  skuname: Premium_LRS
  protocol: smb  # or "nfs"
  # Optional: specific storage account
  # storageAccount: atpstorageaccount
reclaimPolicy: Delete
allowVolumeExpansion: true

Performance Characteristics¶

Performance Comparison:

Storage Type	SKU	IOPS	Throughput	Latency	ATP Use Case
Azure Disk (Premium SSD)	Premium_LRS	20,000 IOPS	900 MB/s	< 1ms	✅ Databases
Azure Disk (Standard SSD)	StandardSSD_LRS	6,000 IOPS	750 MB/s	< 5ms	⚠️ Dev/Test
Azure Files (Premium)	Premium_LRS	100,000 IOPS	10,240 MiB/s	< 10ms	✅ Shared storage
Azure Files (Standard)	Standard_LRS	1,000 IOPS	60 MiB/s	< 20ms	⚠️ Dev/Test

ATP Performance Requirements:

Database workloads: Premium SSD (Azure Disk) - High IOPS, low latency
Shared files: Premium Files (Azure Files) - Multiple mounts, good performance
Dev/Test: Standard SSD (Azure Disk) - Cost-effective

Cost Comparison¶

Storage Cost Comparison (per GB/month):

Storage Type	SKU	Cost (East US)	Use Case
Azure Disk Premium SSD	Premium_LRS	$0.17/GB	Production databases
Azure Disk Standard SSD	StandardSSD_LRS	$0.06/GB	Dev/Test databases
Azure Files Premium	Premium_LRS	$0.19/GB	Production file shares
Azure Files Standard	Standard_LRS	$0.06/GB	Dev/Test file shares

Cost Optimization Strategy:

Production databases: Premium SSD (required for performance)
Dev/Test databases: Standard SSD (cost savings)
Shared storage: Premium Files for production, Standard for dev/test

Use Case Selection¶

Storage Selection Matrix:

Use Case	Recommended Storage	Access Mode	Rationale
PostgreSQL	Azure Disk Premium	RWO	High IOPS, single pod
MongoDB	Azure Disk Premium	RWO	High IOPS, single pod
Redis	Azure Disk Premium	RWO	Low latency, single pod
Shared Logs	Azure Files Premium	RWX	Multiple pods, shared access
Config Files	Azure Files Standard	RWX	Low cost, shared access
Backups	Azure Files Premium	RWX	Multiple pods, shared access

ATP Decision Matrix:

Component	Storage Type	StorageClass	Size
PostgreSQL	Azure Disk	`managed-premium`	500Gi
MongoDB	Azure Disk	`managed-premium`	1Ti
Redis	Azure Disk	`managed-premium`	100Gi
Shared Logs	Azure Files	`azurefile-premium`	500Gi
Backups	Azure Files	`azurefile-premium`	2Ti

Storage Classes¶

Performance Tiers (Standard, Premium, Ultra)¶

Storage Class Performance Tiers:

# Standard HDD (Lowest cost, lowest performance)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-standard
provisioner: disk.csi.azure.com
parameters:
  skuname: Standard_LRS
  kind: managed
reclaimPolicy: Delete
volumeBindingMode: Immediate

---
# Standard SSD (Balanced)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-standard-ssd
provisioner: disk.csi.azure.com
parameters:
  skuname: StandardSSD_LRS
  kind: managed
reclaimPolicy: Delete
volumeBindingMode: Immediate

---
# Premium SSD (High performance)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS
  kind: managed
  cachingMode: ReadOnly
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

---
# Ultra SSD (Highest performance)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-ultra
provisioner: disk.csi.azure.com
parameters:
  skuname: UltraSSD_LRS
  kind: managed
  cachingMode: None  # Ultra SSD doesn't support caching
  diskIopsReadWrite: "5000"  # IOPS limit
  diskMbpsReadWrite: "200"  # Throughput limit (MB/s)
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

Performance Tier Comparison:

Tier	SKU	IOPS	Throughput	Latency	Cost	ATP Use Case
Standard HDD	Standard_LRS	500	60 MB/s	Variable	$0.04/GB	❌ Not used
Standard SSD	StandardSSD_LRS	6,000	750 MB/s	< 5ms	$0.06/GB	✅ Dev/Test
Premium SSD	Premium_LRS	20,000	900 MB/s	< 1ms	$0.17/GB	✅ Production
Ultra SSD	UltraSSD_LRS	160,000	2,000 MB/s	< 0.5ms	$0.24/GB	⚠️ High-performance only

ATP Decision: Use Premium SSD for production databases, Standard SSD for dev/test.

Encryption Configuration¶

Encryption at Rest with Disk Encryption Set:

# platform/storage/storageclass-encrypted.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium-encrypted
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS
  kind: managed
  diskEncryptionSetID: /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.Compute/diskEncryptionSets/atp-disk-encryption
  cachingMode: ReadOnly
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

Pulumi C# Disk Encryption Set:

// infrastructure/DiskEncryption.cs
var diskEncryptionSet = new DiskEncryptionSet("atp-disk-encryption", new DiskEncryptionSetArgs
{
    ResourceGroupName = resourceGroup.Name,
    Location = location,
    Identity = new EncryptionSetIdentityArgs
    {
        Type = "SystemAssigned"
    },
    ActiveKey = new KeyVaultAndKeyReferenceArgs
    {
        KeyUrl = keyVaultKey.Uri,
        SourceVault = new SourceVaultArgs
        {
            Id = keyVault.Id
        }
    },
    EncryptionType = "EncryptionAtRestWithCustomerKey"
});

// Grant Key Vault access to Disk Encryption Set
var keyVaultAccessPolicy = new KeyVaultAccessPolicyArgs
{
    TenantId = tenantId,
    ObjectId = diskEncryptionSet.Identity.PrincipalId,
    Permissions = new KeyVaultPermissionsArgs
    {
        Keys = new[] { "Get", "WrapKey", "UnwrapKey" }
    }
};

Snapshot Support¶

Volume Snapshot Class:

# platform/storage/volumesnapshotclass.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: azure-disk-snapshot
driver: disk.csi.azure.com
deletionPolicy: Retain  # or Delete
parameters:
  incremental: "true"  # Incremental snapshots (cost-effective)
  resourceGroup: atp-production-rg
  storageAccount: atpsnapshots  # Optional: specific storage account

Create Volume Snapshot:

# apps/atp-ingestion/volumesnapshot.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-data-snapshot-20240115
  namespace: atp-production
spec:
  volumeSnapshotClassName: azure-disk-snapshot
  source:
    persistentVolumeClaimName: postgres-data

Provisioner Settings¶

Azure Disk CSI Driver Provisioner Settings:

# platform/storage/storageclass-advanced.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium-advanced
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS
  kind: managed
  cachingMode: ReadOnly  # ReadOnly, ReadWrite, None
  diskEncryptionSetID: ${DISK_ENCRYPTION_SET_ID}
  diskIOPSReadWrite: "5000"  # Optional: IOPS limit
  diskMBpsReadWrite: "200"  # Optional: Throughput limit
  networkAccessPolicy: "DenyAll"  # DenyAll, AllowPrivate, AllowAll
  publicNetworkAccess: "Disabled"
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer  # Zone-aware

Volume Binding Modes:

Binding Mode	Description	Use Case	ATP Selection
Immediate	Bind immediately	Static provisioning	⚠️ Not used
WaitForFirstConsumer	Bind when pod scheduled	Zone-aware, topology	✅ Preferred

ATP Recommendation: Use WaitForFirstConsumer for zone-aware scheduling and topology constraints.

StatefulSet Deployment Patterns¶

Ordered Deployment and Scaling¶

StatefulSet Ordered Deployment:

# apps/postgresql/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
  namespace: atp-production
spec:
  serviceName: postgresql
  replicas: 3
  podManagementPolicy: OrderedReady  # Sequential creation (default)
  # podManagementPolicy: Parallel  # Parallel creation (optional)
  selector:
    matchLabels:
      app: postgresql
  template:
    metadata:
      labels:
        app: postgresql
    spec:
      containers:
      - name: postgresql
        image: postgres:15
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: managed-premium
      resources:
        requests:
          storage: 100Gi

StatefulSet Scaling Order:

sequenceDiagram
    participant K8s as Kubernetes
    participant Pod0 as postgresql-0
    participant Pod1 as postgresql-1
    participant Pod2 as postgresql-2

    K8s->>Pod0: Create and wait for Ready
    Pod0-->>K8s: Ready
    K8s->>Pod1: Create and wait for Ready
    Pod1-->>K8s: Ready
    K8s->>Pod2: Create and wait for Ready
    Pod2-->>K8s: Ready

Hold "Alt" / "Option" to enable pan & zoom

Ordered Scaling Behavior:

Scale Up: Creates pods sequentially (0, 1, 2...)
Scale Down: Deletes pods in reverse order (2, 1, 0...)
Ensures: Each pod is ready before creating the next

Stable Network Identity¶

Headless Service for StatefulSet:

# apps/postgresql/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: postgresql
  namespace: atp-production
spec:
  clusterIP: None  # Headless service
  selector:
    app: postgresql
  ports:
  - port: 5432
    name: postgresql

Stable Network Identity:

# StatefulSet pods get stable DNS names
# postgresql-0.postgresql.atp-production.svc.cluster.local
# postgresql-1.postgresql.atp-production.svc.cluster.local
# postgresql-2.postgresql.atp-production.svc.cluster.local

Accessing StatefulSet Pods:

# Access specific pod
psql -h postgresql-0.postgresql.atp-production.svc.cluster.local

# Access any pod via service
psql -h postgresql.atp-production.svc.cluster.local

Persistent Storage per Pod¶

StatefulSet with Volume Claim Templates:

# apps/postgresql/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
  namespace: atp-production
spec:
  serviceName: postgresql
  replicas: 3
  selector:
    matchLabels:
      app: postgresql
  template:
    metadata:
      labels:
        app: postgresql
    spec:
      containers:
      - name: postgresql
        image: postgres:15
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
        - name: config
          mountPath: /etc/postgresql
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: managed-premium
      resources:
        requests:
          storage: 100Gi
  - metadata:
      name: config
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: managed-premium
      resources:
        requests:
          storage: 10Gi

PVCs Created Automatically:

data-postgresql-0  # Persistent volume for pod 0
data-postgresql-1  # Persistent volume for pod 1
data-postgresql-2  # Persistent volume for pod 2
config-postgresql-0
config-postgresql-1
config-postgresql-2

Headless Service Configuration¶

Headless Service Pattern:

# apps/postgresql/service-headless.yaml
apiVersion: v1
kind: Service
metadata:
  name: postgresql
  namespace: atp-production
spec:
  clusterIP: None  # Headless - no load balancing
  selector:
    app: postgresql
  ports:
  - port: 5432
    targetPort: 5432
    name: postgresql

Service Discovery with Headless Service:

# StatefulSet pod discovery
apiVersion: v1
kind: Service
metadata:
  name: postgresql-read
  namespace: atp-production
spec:
  selector:
    app: postgresql
    role: replica  # Read replicas only
  ports:
  - port: 5432
    name: postgresql

---
# StatefulSet pod discovery
apiVersion: v1
kind: Service
metadata:
  name: postgresql-write
  namespace: atp-production
spec:
  selector:
    app: postgresql
    role: primary  # Primary only
  ports:
  - port: 5432
    name: postgresql

Database Deployments in Kubernetes¶

PostgreSQL Operator¶

PostgreSQL Operator (Crunchy Data):

#!/bin/bash
# scripts/install-postgres-operator.sh

# Add PostgreSQL Operator Helm repository
helm repo add postgres-operator https://opensource.postgresql.org/postgres/postgres-operator/charts
helm repo update

# Install PostgreSQL Operator
helm install postgres-operator postgres-operator/postgres-operator \
  --namespace postgres-operator \
  --create-namespace \
  --set configResources.requests.memory=128Mi \
  --set configResources.requests.cpu=100m

echo "✅ PostgreSQL Operator installed"

PostgreSQL Cluster via Operator:

# apps/postgresql/postgrescluster.yaml
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: atp-postgres
  namespace: atp-production
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-15.4-0
  postgresVersion: 15
  instances:
  - name: instance1
    replicas: 3
    dataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 500Gi
      storageClassName: managed-premium
    resources:
      requests:
        cpu: 2000m
        memory: 4Gi
      limits:
        cpu: 4000m
        memory: 8Gi
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.47-0
      repos:
      - name: repo1
        volume:
          volumeClaimSpec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 1Ti
            storageClassName: managed-premium
      global:
        repo1-retention-full: "7"
        repo1-retention-full-type: count
  monitoring:
    pgMonitor:
      exporter:
        image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.3.0-0

MongoDB Operator¶

MongoDB Community Operator:

#!/bin/bash
# scripts/install-mongodb-operator.sh

# Install MongoDB Community Operator
kubectl apply -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes-operator/master/config/crd/bases/mongodbcommunity.mongodb.com_mongodbcommunity.yaml

# Install operator
kubectl create namespace mongodb-operator
kubectl apply -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes-operator/master/config/manager/manager.yaml -n mongodb-operator

echo "✅ MongoDB Operator installed"

MongoDB ReplicaSet via Operator:

# apps/mongodb/mongodbcommunity.yaml
apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
  name: atp-mongodb
  namespace: atp-production
spec:
  members: 3
  type: ReplicaSet
  version: "7.0.0"
  security:
    authentication:
      modes: ["SCRAM"]
  users:
  - name: atp-user
    db: admin
    passwordSecretRef:
      name: mongodb-password
    roles:
    - name: readWriteAnyDatabase
      db: admin
  additionalMongodConfig:
    storage.wiredTiger.engineConfig.journalCompressor: snappy
    storage.wiredTiger.collectionConfig.blockCompressor: snappy
  statefulSet:
    spec:
      volumeClaimTemplates:
      - metadata:
          name: data-volume
        spec:
          accessModes:
          - ReadWriteOnce
          storageClassName: managed-premium
          resources:
            requests:
              storage: 500Gi
      resources:
        requests:
          cpu: 2000m
          memory: 4Gi
        limits:
          cpu: 4000m
          memory: 8Gi

Redis Deployment¶

Redis StatefulSet:

# apps/redis/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
  namespace: atp-production
spec:
  serviceName: redis
  replicas: 3
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:7-alpine
        command:
        - redis-server
        - /etc/redis/redis.conf
        - --appendonly
        - "yes"
        ports:
        - containerPort: 6379
          name: redis
        volumeMounts:
        - name: data
          mountPath: /data
        - name: config
          mountPath: /etc/redis
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      storageClassName: managed-premium
      resources:
        requests:
          storage: 100Gi

Redis Sentinel Configuration:

# apps/redis/redis-sentinel.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-sentinel
  namespace: atp-production
spec:
  serviceName: redis-sentinel
  replicas: 3
  selector:
    matchLabels:
      app: redis-sentinel
  template:
    metadata:
      labels:
        app: redis-sentinel
    spec:
      containers:
      - name: sentinel
        image: redis:7-alpine
        command:
        - redis-sentinel
        - /etc/redis/sentinel.conf
        ports:
        - containerPort: 26379
          name: sentinel
        volumeMounts:
        - name: config
          mountPath: /etc/redis

StatefulSet vs Managed Service Decision¶

Kubernetes vs Azure Managed Services:

Aspect	Kubernetes (StatefulSet)	Azure Managed Service	ATP Decision
PostgreSQL	PostgreSQL Operator	Azure Database for PostgreSQL	⚠️ Managed Service (recommended)
MongoDB	MongoDB Operator	Azure Cosmos DB (MongoDB API)	⚠️ Managed Service (recommended)
Redis	Redis StatefulSet	Azure Cache for Redis	⚠️ Managed Service (recommended)
Control	✅ Full control	⚠️ Limited	⚠️ Managed Service
Operations	⚠️ Self-managed	✅ Managed	✅ Managed Service
Cost	⚠️ Higher (infra + ops)	✅ Lower (includes ops)	✅ Managed Service
ATP Decision	⚠️ Dev/Test only	✅ Production	✅ Managed Services

ATP Decision: Use Azure managed services for production databases (PostgreSQL, MongoDB, Redis) - lower operational overhead, better SLA, automated backups. Use Kubernetes StatefulSets for dev/test environments.

Backup and Restore Procedures¶

Velero for Cluster Backups¶

Velero Installation:

#!/bin/bash
# scripts/install-velero.sh

# Install Velero CLI
curl -fsSL -o velero-v1.12.0-linux-amd64.tar.gz \
  https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xvf velero-v1.12.0-linux-amd64.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/

# Create Azure Storage Account for Velero backups
az storage account create \
  --name atpvelerobackups \
  --resource-group atp-production-rg \
  --sku Standard_LRS \
  --location eastus

# Create blob container
az storage container create \
  --name velero \
  --account-name atpvelerobackups

# Install Velero
velero install \
  --provider azure \
  --plugins velero/velero-plugin-for-microsoft-azure:v1.7.0 \
  --bucket velero \
  --secret-file ./credentials-velero \
  --backup-location-config resourceGroup=atp-production-rg,storageAccount=atpvelerobackups \
  --snapshot-location-config apiTimeout=5m,resourceGroup=atp-production-rg \
  --use-volume-snapshots=true

echo "✅ Velero installed"

Velero via Helm:

# platform/velero/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: velero
  namespace: velero
spec:
  interval: 5m
  chart:
    spec:
      chart: velero
      sourceRef:
        kind: HelmRepository
        name: vmware-tanzu
      version: 5.1.1
  values:
    configuration:
      provider: azure
      backupStorageLocation:
        bucket: velero
        config:
          resourceGroup: atp-production-rg
          storageAccount: atpvelerobackups
      volumeSnapshotLocation:
        config:
          apiTimeout: 5m
          resourceGroup: atp-production-rg
    initContainers:
    - name: velero-plugin-for-microsoft-azure
      image: velero/velero-plugin-for-microsoft-azure:v1.7.0
      volumeMounts:
      - mountPath: /target
        name: plugins
    credentials:
      secretContents:
        cloud: |
          # Azure credentials

Volume Snapshots¶

Velero Backup with Volume Snapshots:

#!/bin/bash
# scripts/create-velero-backup.sh

BACKUP_NAME="atp-production-backup-$(date +%Y%m%d-%H%M%S)"
NAMESPACE="atp-production"

echo "💾 Creating Velero backup: ${BACKUP_NAME}"

# Create backup
velero backup create "${BACKUP_NAME}" \
  --namespace "${NAMESPACE}" \
  --include-namespaces "${NAMESPACE}" \
  --snapshot-volumes \
  --wait

# Check backup status
velero backup describe "${BACKUP_NAME}"

echo "✅ Backup created: ${BACKUP_NAME}"

Scheduled Backups:

# platform/velero/backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-production-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
    - atp-production
    snapshotVolumes: true
    ttl: 720h  # 30 days retention
    metadata:
      labels:
        backup-type: daily
        environment: production

Backup Scheduling¶

Backup Schedule Configuration:

# platform/velero/schedules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: velero-schedules
  namespace: velero
data:
  daily-backup.yaml: |
    apiVersion: velero.io/v1
    kind: Schedule
    metadata:
      name: daily-production-backup
      namespace: velero
    spec:
      schedule: "0 2 * * *"  # 2 AM daily
      template:
        includedNamespaces:
        - atp-production
        snapshotVolumes: true
        ttl: 720h  # 30 days
  weekly-backup.yaml: |
    apiVersion: velero.io/v1
    kind: Schedule
    metadata:
      name: weekly-production-backup
      namespace: velero
    spec:
      schedule: "0 3 * * 0"  # 3 AM Sunday
      template:
        includedNamespaces:
        - atp-production
        snapshotVolumes: true
        ttl: 2160h  # 90 days

Restore Procedures¶

Velero Restore Procedure:

#!/bin/bash
# scripts/restore-from-velero.sh

BACKUP_NAME="${1}"
NAMESPACE="${2:-atp-production}"

if [ -z "${BACKUP_NAME}" ]; then
  echo "Usage: $0 <backup-name> [namespace]"
  echo ""
  echo "Available backups:"
  velero backup get
  exit 1
fi

echo "🔄 Restoring from backup: ${BACKUP_NAME}"

# List backups
velero backup get

# Restore from backup
velero restore create "restore-${BACKUP_NAME}-$(date +%Y%m%d-%H%M%S)" \
  --from-backup "${BACKUP_NAME}" \
  --namespace-mappings "${NAMESPACE}:${NAMESPACE}-restored" \
  --wait

echo "✅ Restore initiated"
echo "   Check status: velero restore get"

Restore to Different Namespace:

# Restore production backup to test namespace
velero restore create restore-production-to-test \
  --from-backup daily-production-backup-20240115 \
  --namespace-mappings atp-production:atp-test \
  --wait

Volume Snapshots¶

Creating Snapshots¶

Manual Volume Snapshot:

# apps/postgresql/volumesnapshot-manual.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-data-snapshot-20240115
  namespace: atp-production
spec:
  volumeSnapshotClassName: azure-disk-snapshot
  source:
    persistentVolumeClaimName: postgres-data-postgresql-0

Create Snapshot Script:

#!/bin/bash
# scripts/create-volume-snapshot.sh

PVC_NAME="${1}"
NAMESPACE="${2}"
SNAPSHOT_NAME="${3:-${PVC_NAME}-snapshot-$(date +%Y%m%d-%H%M%S)}"

if [ -z "${PVC_NAME}" ] || [ -z "${NAMESPACE}" ]; then
  echo "Usage: $0 <pvc-name> <namespace> [snapshot-name]"
  exit 1
fi

echo "📸 Creating volume snapshot: ${SNAPSHOT_NAME}"

kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: ${SNAPSHOT_NAME}
  namespace: ${NAMESPACE}
spec:
  volumeSnapshotClassName: azure-disk-snapshot
  source:
    persistentVolumeClaimName: ${PVC_NAME}
EOF

# Wait for snapshot to be ready
kubectl wait volumesnapshot/${SNAPSHOT_NAME} \
  -n "${NAMESPACE}" \
  --for=condition=Ready \
  --timeout=300s

echo "✅ Snapshot created: ${SNAPSHOT_NAME}"

Snapshot Classes¶

Snapshot Class Configuration:

# platform/storage/volumesnapshotclass-premium.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: azure-disk-snapshot-premium
driver: disk.csi.azure.com
deletionPolicy: Retain  # Keep snapshot after PVC deletion
parameters:
  incremental: "true"  # Incremental snapshots
  resourceGroup: atp-production-rg
  # Optional: specific storage account for snapshots
  # storageAccount: atpsnapshots

---
# platform/storage/volumesnapshotclass-standard.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: azure-disk-snapshot-standard
driver: disk.csi.azure.com
deletionPolicy: Delete  # Delete snapshot when PVC is deleted
parameters:
  incremental: "true"
  resourceGroup: atp-production-rg

Snapshot Class Selection:

SnapshotClass	Deletion Policy	Use Case	ATP Selection
azure-disk-snapshot-premium	Retain	Production backups	✅ Production
azure-disk-snapshot-standard	Delete	Dev/Test snapshots	✅ Dev/Test

Azure Backup Integration¶

Azure Backup for AKS Volumes:

#!/bin/bash
# scripts/setup-azure-backup.sh

# Create Recovery Services Vault
az backup vault create \
  --name atp-backup-vault \
  --resource-group atp-production-rg \
  --location eastus

# Enable backup for AKS volumes
az backup protection enable-for-azurefileshare \
  --vault-name atp-backup-vault \
  --resource-group atp-production-rg \
  --storage-account atpstorageaccount \
  --azure-file-share-name postgres-backup \
  --backup-management-type AzureStorage \
  --workload-type AzureFileShare \
  --policy-name DefaultPolicy

echo "✅ Azure Backup configured"

Backup Policy:

# Create backup policy (daily, 30-day retention)
az backup policy create \
  --vault-name atp-backup-vault \
  --resource-group atp-production-rg \
  --name daily-policy \
  --backup-management-type AzureStorage \
  --workload-type AzureFileShare \
  --policy '{
    "name": "daily-policy",
    "recoveryPointType": "FileSystemConsistent",
    "schedulePolicy": {
      "scheduleRunFrequency": "Daily",
      "scheduleRunTimes": ["02:00"]
    },
    "retentionPolicy": {
      "dailySchedule": {
        "retentionDuration": {
          "count": 30,
          "durationType": "Days"
        }
      }
    }
  }'

Snapshot Retention¶

Snapshot Retention Policies:

Environment	Retention Period	Rationale
Production	90 days	Long-term recovery
Staging	30 days	Shorter retention
Test	7 days	Minimal retention
Dev	3 days	Very short retention

Automated Snapshot Cleanup:

#!/bin/bash
# scripts/cleanup-old-snapshots.sh

NAMESPACE="${1:-atp-production}"
RETENTION_DAYS="${2:-30}"

echo "🧹 Cleaning up old snapshots (older than ${RETENTION_DAYS} days)..."

# Get all snapshots older than retention period
OLD_SNAPSHOTS=$(kubectl get volumesnapshot -n "${NAMESPACE}" -o json | \
  jq -r ".items[] | select(.metadata.creationTimestamp < \"$(date -d "${RETENTION_DAYS} days ago" -u +%Y-%m-%dT%H:%M:%SZ)\") | .metadata.name")

for SNAPSHOT in ${OLD_SNAPSHOTS}; do
  echo "🗑️  Deleting snapshot: ${SNAPSHOT}"
  kubectl delete volumesnapshot "${SNAPSHOT}" -n "${NAMESPACE}" || true
done

echo "✅ Snapshot cleanup complete"

Data Migration Strategies¶

Migrating Data Between Versions¶

Database Migration Strategy:

sequenceDiagram
    participant Old as Old Version<br/>PostgreSQL 14
    participant Snapshot as Volume Snapshot
    participant New as New Version<br/>PostgreSQL 15
    participant Data as Data Migration

    Old->>Snapshot: Create snapshot
    Snapshot->>New: Clone volume
    New->>Data: Mount snapshot
    Data->>New: Migrate schema
    New->>Data: Migrate data
    Data->>New: Validate

Hold "Alt" / "Option" to enable pan & zoom

PostgreSQL Version Migration:

#!/bin/bash
# scripts/migrate-postgres-version.sh

OLD_VERSION="14"
NEW_VERSION="15"
NAMESPACE="atp-production"
PVC_NAME="postgres-data-postgresql-0"

echo "🔄 Migrating PostgreSQL ${OLD_VERSION} → ${NEW_VERSION}"

# Step 1: Create snapshot of current data
echo "📸 Step 1: Creating snapshot..."
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-migration-snapshot
  namespace: ${NAMESPACE}
spec:
  volumeSnapshotClassName: azure-disk-snapshot-premium
  source:
    persistentVolumeClaimName: ${PVC_NAME}
EOF

kubectl wait volumesnapshot/postgres-migration-snapshot \
  -n "${NAMESPACE}" \
  --for=condition=Ready \
  --timeout=300s

# Step 2: Scale down old StatefulSet
echo "⏸️  Step 2: Scaling down old StatefulSet..."
kubectl scale statefulset postgresql-${OLD_VERSION} --replicas=0 -n "${NAMESPACE}"

# Step 3: Create new StatefulSet from snapshot
echo "🆕 Step 3: Creating new StatefulSet from snapshot..."
# (Create new StatefulSet YAML with PostgreSQL ${NEW_VERSION})

# Step 4: Restore data from snapshot
echo "📥 Step 4: Restoring data from snapshot..."
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-postgresql-0-new
  namespace: ${NAMESPACE}
spec:
  dataSource:
    name: postgres-migration-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 500Gi
EOF

echo "✅ Migration initiated"

Volume Cloning¶

Volume Clone from Snapshot:

# apps/postgresql/pvc-from-snapshot.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-clone
  namespace: atp-production
spec:
  dataSource:
    name: postgres-data-snapshot-20240115
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 500Gi  # Must be >= snapshot size

Clone Volume Script:

#!/bin/bash
# scripts/clone-volume-from-snapshot.sh

SNAPSHOT_NAME="${1}"
NEW_PVC_NAME="${2}"
NAMESPACE="${3:-atp-production}"
STORAGE_SIZE="${4:-500Gi}"

if [ -z "${SNAPSHOT_NAME}" ] || [ -z "${NEW_PVC_NAME}" ]; then
  echo "Usage: $0 <snapshot-name> <new-pvc-name> [namespace] [storage-size]"
  exit 1
fi

echo "📋 Cloning volume from snapshot: ${SNAPSHOT_NAME}"

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ${NEW_PVC_NAME}
  namespace: ${NAMESPACE}
spec:
  dataSource:
    name: ${SNAPSHOT_NAME}
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: ${STORAGE_SIZE}
EOF

echo "✅ Volume clone created: ${NEW_PVC_NAME}"

Zero-Downtime Migrations¶

Zero-Downtime Migration Strategy:

sequenceDiagram
    participant App as Application
    participant OldDB as Old DB<br/>Primary
    participant NewDB as New DB<br/>Replica
    participant Sync as Data Sync

    App->>OldDB: Write traffic
    OldDB->>Sync: Stream changes
    Sync->>NewDB: Apply changes
    NewDB->>NewDB: Validate sync
    NewDB->>App: Switch traffic
    App->>NewDB: Write traffic

Hold "Alt" / "Option" to enable pan & zoom

PostgreSQL Logical Replication for Zero-Downtime:

# apps/postgresql/migration-replica.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql-new
  namespace: atp-production
spec:
  serviceName: postgresql-new
  replicas: 1
  selector:
    matchLabels:
      app: postgresql-new
  template:
    metadata:
      labels:
        app: postgresql-new
    spec:
      containers:
      - name: postgresql
        image: postgres:15
        env:
        - name: POSTGRES_REPLICATION_MODE
          value: "replica"  # Logical replica
        - name: POSTGRES_PRIMARY_HOST
          value: "postgresql.atp-production.svc.cluster.local"
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      storageClassName: managed-premium
      resources:
        requests:
          storage: 500Gi

GitOps Considerations for Stateful Apps¶

Careful Rollback Procedures¶

StatefulSet Rollback Strategy:

# StatefulSet with update strategy
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
  namespace: atp-production
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0  # Update all pods (reduce gradually for staged rollout)
  # OR
  # updateStrategy:
  #   type: OnDelete  # Manual update control

Staged Rollout for StatefulSets:

#!/bin/bash
# scripts/staged-statefulset-rollout.sh

STATEFULSET="${1}"
NAMESPACE="${2:-atp-production}"
PARTITION="${3:-2}"  # Keep 2 pods on old version

echo "🔄 Staged rollout for StatefulSet: ${STATEFULSET}"
echo "   Keeping ${PARTITION} pods on old version"

# Set partition (only pods >= partition index will be updated)
kubectl patch statefulset "${STATEFULSET}" -n "${NAMESPACE}" \
  --type='json' \
  -p="[{\"op\": \"replace\", \"path\": \"/spec/updateStrategy/rollingUpdate/partition\", \"value\": ${PARTITION}}]"

# Update image
kubectl set image statefulset/${STATEFULSET} \
  -n "${NAMESPACE}" \
  postgresql=postgres:15

# Gradually reduce partition for staged rollout
for i in {2..0}; do
  echo "  Updating partition: ${i}"
  kubectl patch statefulset "${STATEFULSET}" -n "${NAMESPACE}" \
    --type='json' \
    -p="[{\"op\": \"replace\", \"path\": \"/spec/updateStrategy/rollingUpdate/partition\", \"value\": ${i}}]"

  # Wait for pod to be ready
  kubectl wait --for=condition=ready pod/${STATEFULSET}-${i} \
    -n "${NAMESPACE}" \
    --timeout=300s

  sleep 60  # Wait before next update
done

echo "✅ Staged rollout complete"

No Auto-Prune for PVCs¶

FluxCD Kustomization with Prune Safety:

# clusters/production/kustomizations/stateful-apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: stateful-apps-production
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/postgresql/overlays/production
  prune: true  # Enable pruning
  pruneOptions:
    keepLabels:  # Keep resources with these labels
    - app=postgresql
    - component=database
    # Do NOT prune PVCs
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production

PVC Protection Labels:

# apps/postgresql/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: atp-production
  labels:
    app: postgresql
    component: database
    fluxcd.io/prune: "false"  # Explicitly exclude from pruning
    managed-by: kustomize
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 500Gi

StatefulSet Update Strategies¶

Update Strategy Options:

Strategy	Description	Use Case	ATP Selection
RollingUpdate	Update pods sequentially	Production (controlled)	✅ Production
OnDelete	Update only when pod deleted	Manual control	⚠️ Staging (manual)

StatefulSet Update Strategy Configuration:

# apps/postgresql/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
  namespace: atp-production
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0  # Start updating from index 0
      # partition: 2  # Keep pods 0-1 on old version, update 2+
  # OR for manual control
  # updateStrategy:
  #   type: OnDelete

Data Persistence Across Deployments¶

PVC Retention Policy:

# StorageClass with Retain policy
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS
reclaimPolicy: Retain  # Keep PV when PVC is deleted
volumeBindingMode: WaitForFirstConsumer

PVC Lifecycle with Retain Policy:

sequenceDiagram
    participant Dev as Developer
    participant PVC as PVC
    participant PV as PV
    participant Azure as Azure Disk

    Dev->>PVC: Delete PVC
    PVC->>PV: Release (Retain)
    PV->>Azure: Keep disk (not deleted)
    Azure->>PV: Data preserved
    Dev->>PV: Reuse PV with new PVC

Hold "Alt" / "Option" to enable pan & zoom

Reusing Retained PV:

# Reuse existing PV with new PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-restored
  namespace: atp-production
spec:
  volumeName: pv-abc123  # Reference existing PV
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 500Gi

Disaster Recovery for Persistent Data¶

Backup Frequency per Environment¶

Backup Schedule Matrix:

Environment	Frequency	Retention	Rationale
Production	Every 6 hours	90 days	High availability, long-term recovery
Staging	Daily	30 days	Moderate retention
Test	Weekly	14 days	Minimal retention
Dev	Manual only	7 days	Cost optimization

Automated Backup Schedules:

# platform/velero/schedules-production.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-backup-6h
  namespace: velero
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  template:
    includedNamespaces:
    - atp-production
    snapshotVolumes: true
    ttl: 2160h  # 90 days
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-backup-daily
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
    - atp-production
    snapshotVolumes: true
    ttl: 720h  # 30 days
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-backup-weekly
  namespace: velero
spec:
  schedule: "0 3 * * 0"  # 3 AM Sunday
  template:
    includedNamespaces:
    - atp-production
    snapshotVolumes: true
    ttl: 2160h  # 90 days

Cross-Region Replication¶

Cross-Region Backup Replication:

#!/bin/bash
# scripts/setup-cross-region-backup.sh

PRIMARY_REGION="eastus"
SECONDARY_REGION="westeurope"

# Create backup storage in secondary region
az storage account create \
  --name atpvelerobackupseu \
  --resource-group atp-production-rg-eu \
  --sku Standard_LRS \
  --location "${SECONDARY_REGION}" \
  --allow-blob-public-access false

# Configure blob replication
az storage blob service-properties update \
  --account-name atpvelerobackups \
  --enable-change-feed true \
  --enable-versioning true

# Enable geo-replication
az storage account update \
  --name atpvelerobackups \
  --resource-group atp-production-rg \
  --allow-blob-public-access false \
  --min-tls-version TLS1_2

echo "✅ Cross-region backup replication configured"

Velero with Multiple Backup Locations:

# platform/velero/backup-locations.yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: backup-primary
  namespace: velero
spec:
  provider: azure
  objectStorage:
    bucket: velero
    prefix: primary
  config:
    resourceGroup: atp-production-rg
    storageAccount: atpvelerobackups
---
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: backup-secondary
  namespace: velero
spec:
  provider: azure
  objectStorage:
    bucket: velero
    prefix: secondary
  config:
    resourceGroup: atp-production-rg-eu
    storageAccount: atpvelerobackupseu

RPO Targets for Databases¶

RPO (Recovery Point Objective) Targets:

Environment	RPO Target	Backup Frequency	Actual RPO
Production	< 1 hour	Every 6 hours	6 hours
Staging	< 24 hours	Daily	24 hours
Test	< 7 days	Weekly	7 days
Dev	N/A	Manual	N/A

RPO Validation:

#!/bin/bash
# scripts/validate-rpo.sh

NAMESPACE="atp-production"
RPO_TARGET_HOURS=6

echo "🔍 Validating RPO compliance..."

# Get latest backup
LATEST_BACKUP=$(velero backup get --namespace velero | \
  grep "${NAMESPACE}" | \
  sort -k4 -r | \
  head -n 1 | \
  awk '{print $1}')

if [ -z "${LATEST_BACKUP}" ]; then
  echo "❌ No backups found"
  exit 1
fi

# Get backup creation time
BACKUP_TIME=$(velero backup describe "${LATEST_BACKUP}" --namespace velero | \
  grep "Creation" | \
  awk '{print $2, $3}')

BACKUP_EPOCH=$(date -d "${BACKUP_TIME}" +%s)
CURRENT_EPOCH=$(date +%s)
AGE_HOURS=$(( (CURRENT_EPOCH - BACKUP_EPOCH) / 3600 ))

if [ "${AGE_HOURS}" -gt "${RPO_TARGET_HOURS}" ]; then
  echo "⚠️  RPO violation: Latest backup is ${AGE_HOURS} hours old (target: ${RPO_TARGET_HOURS}h)"
  exit 1
else
  echo "✅ RPO compliant: Latest backup is ${AGE_HOURS} hours old (target: ${RPO_TARGET_HOURS}h)"
fi

DR Testing for Stateful Apps¶

DR Test Procedure:

#!/bin/bash
# scripts/dr-test-stateful-apps.sh

BACKUP_NAME="${1}"
TEST_NAMESPACE="atp-production-dr-test"

if [ -z "${BACKUP_NAME}" ]; then
  echo "Usage: $0 <backup-name>"
  echo ""
  echo "Available backups:"
  velero backup get --namespace velero
  exit 1
fi

echo "🧪 DR Test: Restoring backup ${BACKUP_NAME} to test namespace"

# Step 1: Restore backup to test namespace
echo "📥 Step 1: Restoring backup..."
velero restore create "dr-test-${BACKUP_NAME}-$(date +%Y%m%d-%H%M%S)" \
  --from-backup "${BACKUP_NAME}" \
  --namespace-mappings atp-production:${TEST_NAMESPACE} \
  --wait

# Step 2: Verify restored resources
echo "✅ Step 2: Verifying restored resources..."
kubectl get statefulsets -n "${TEST_NAMESPACE}"
kubectl get pvcs -n "${TEST_NAMESPACE}"

# Step 3: Test database connectivity
echo "🔌 Step 3: Testing database connectivity..."
kubectl run postgresql-test \
  -n "${TEST_NAMESPACE}" \
  --image=postgres:15 \
  --rm -it --restart=Never \
  -- psql -h postgresql.${TEST_NAMESPACE}.svc.cluster.local -U postgres -c "SELECT version();"

# Step 4: Cleanup
echo "🧹 Step 4: Cleaning up test namespace..."
read -p "Delete test namespace ${TEST_NAMESPACE}? (yes/no): " CONFIRM
if [ "${CONFIRM}" = "yes" ]; then
  kubectl delete namespace "${TEST_NAMESPACE}"
  echo "✅ DR test complete and cleaned up"
else
  echo "⚠️  Test namespace retained: ${TEST_NAMESPACE}"
fi

DR Test Checklist:

## DR Test Checklist

### Pre-Test
- [ ] Backup exists and is valid
- [ ] Test namespace created
- [ ] Test resources allocated

### Test Execution
- [ ] Backup restored successfully
- [ ] StatefulSets recreated
- [ ] PVCs restored
- [ ] Pods running and healthy
- [ ] Database accessible
- [ ] Data integrity verified

### Post-Test
- [ ] Test results documented
- [ ] Test namespace cleaned up
- [ ] Lessons learned captured

Summary: Storage & StatefulSets in GitOps¶

Persistent Volumes (PV) and Claims (PVC): PV and PVC concepts, dynamic provisioning (ATP preference), storage classes, access modes (RWO, RWX, ROX)
Azure Disk vs Azure Files: Azure Disk (block storage, single mount) for databases, Azure Files (shared storage, multi-mount) for shared files, performance characteristics comparison, cost comparison, use case selection matrix
Storage Classes: Performance tiers (Standard, Premium, Ultra), encryption configuration with Disk Encryption Set, snapshot support, provisioner settings (binding modes, expansion)
StatefulSet Deployment Patterns: Ordered deployment and scaling, stable network identity with headless services, persistent storage per pod (volume claim templates), headless service configuration
Database Deployments in Kubernetes: PostgreSQL operator (Crunchy Data), MongoDB operator, Redis deployment, StatefulSet vs managed service decision (ATP: managed services for production)
Backup and Restore Procedures: Velero for cluster backups, volume snapshots, backup scheduling (6h/daily/weekly), restore procedures (to same/different namespace)
Volume Snapshots: Creating snapshots (manual/automated), snapshot classes (Retain/Delete policies), Azure Backup integration, snapshot retention policies per environment
Data Migration Strategies: Migrating data between versions (PostgreSQL 14→15 example), volume cloning from snapshots, zero-downtime migrations with logical replication
GitOps Considerations for Stateful Apps: Careful rollback procedures (staged rollout), no auto-prune for PVCs (explicit labels), StatefulSet update strategies (RollingUpdate/OnDelete), data persistence across deployments (Retain policy)
Disaster Recovery for Persistent Data: Backup frequency per environment (6h/daily/weekly), cross-region replication, RPO targets for databases (< 1 hour production), DR testing for stateful apps (restore validation)

Troubleshooting GitOps Issues¶

Purpose: Define comprehensive troubleshooting procedures, debugging tools, common error patterns, and escalation procedures for ATP's GitOps deployments, enabling rapid identification and resolution of issues across FluxCD, Kubernetes resources, networking, secrets, health checks, and performance.

FluxCD Sync Failures¶

Authentication Issues (Git Credentials)¶

Common Authentication Errors:

# Check GitRepository authentication status
kubectl get gitrepository -n flux-system -o yaml

# Describe GitRepository to see authentication errors
kubectl describe gitrepository atp-gitops-production -n flux-system

# Check secret for Git credentials
kubectl get secret git-credentials -n flux-system -o yaml

# View FluxCD logs for authentication errors
kubectl logs -n flux-system -l app=source-controller --tail=100 | grep -i "auth\|error\|failed"

Troubleshooting Git Authentication:

#!/bin/bash
# scripts/troubleshoot-git-auth.sh

GIT_REPO="${1:-atp-gitops-production}"
NAMESPACE="${2:-flux-system}"

echo "🔍 Troubleshooting Git authentication for: ${GIT_REPO}"

# Step 1: Check GitRepository status
echo "📋 Step 1: Checking GitRepository status..."
kubectl get gitrepository "${GIT_REPO}" -n "${NAMESPACE}" -o jsonpath='{.status.conditions[*]}' | jq

# Step 2: Check if secret exists
echo "🔐 Step 2: Checking Git credentials secret..."
SECRET_NAME=$(kubectl get gitrepository "${GIT_REPO}" -n "${NAMESPACE}" -o jsonpath='{.spec.secretRef.name}')
if [ -n "${SECRET_NAME}" ]; then
  echo "   Secret name: ${SECRET_NAME}"
  kubectl get secret "${SECRET_NAME}" -n "${NAMESPACE}" || echo "   ❌ Secret not found"
else
  echo "   ⚠️  No secret reference found"
fi

# Step 3: Check source controller logs
echo "📜 Step 3: Checking source controller logs..."
kubectl logs -n "${NAMESPACE}" -l app=source-controller --tail=50 | grep -i "${GIT_REPO}"

# Step 4: Test Git connectivity manually
echo "🌐 Step 4: Testing Git connectivity..."
GIT_URL=$(kubectl get gitrepository "${GIT_REPO}" -n "${NAMESPACE}" -o jsonpath='{.spec.url}')
echo "   Git URL: ${GIT_URL}"

Fix Git Authentication:

# Regenerate Git credentials (PAT)
PAT=$(az devops security token create --scope repo --organization ${ORG} --query token -o tsv)

# Update secret
kubectl create secret generic git-credentials \
  --from-literal=username=${USERNAME} \
  --from-literal=password=${PAT} \
  -n flux-system \
  --dry-run=client -o yaml | kubectl apply -f -

# Reconcile GitRepository
flux reconcile source git atp-gitops-production -n flux-system

Manifest Syntax Errors¶

Detecting Manifest Syntax Errors:

#!/bin/bash
# scripts/check-manifest-syntax.sh

PATH_TO_CHECK="${1:-.}"

echo "🔍 Checking manifest syntax in: ${PATH_TO_CHECK}"

# Check YAML syntax
find "${PATH_TO_CHECK}" -name "*.yaml" -o -name "*.yml" | while read -r file; do
  echo "Checking: ${file}"
  # Use yamllint or kubeval
  yamllint "${file}" 2>/dev/null || echo "   ⚠️  YAML syntax error in ${file}"
done

# Validate Kubernetes manifests
kubeval --directories "${PATH_TO_CHECK}" --ignore-missing-schemas || echo "   ⚠️  Kubernetes manifest validation errors"

Common Syntax Errors:

Error Type	Example	Fix
Indentation	`key:value`	Use proper YAML indentation (spaces, not tabs)
Missing colon	`key value`	Use `key: value`
Invalid type	`replicas: "3"` (string)	Use `replicas: 3` (integer)
Invalid enum	`type: Invalid`	Use valid Kubernetes enum value

Fix Manifest Syntax:

# Validate before committing
kubectl apply --dry-run=client -f manifests/

# Use kubeval for validation
kubectl kustomize . | kubeval

# Use FluxCD validation
flux check --pre

Resource Conflicts (Already Exists)¶

Identifying Resource Conflicts:

#!/bin/bash
# scripts/check-resource-conflicts.sh

NAMESPACE="${1:-atp-production}"

echo "🔍 Checking for resource conflicts in namespace: ${NAMESPACE}"

# Check for resources not managed by FluxCD
echo "📋 Resources not managed by FluxCD:"
kubectl get all -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.metadata.labels."kustomize.toolkit.fluxcd.io/name" == null) | "\(.kind)/\(.metadata.name)"'

# Check Kustomization status
echo "📦 Kustomization status:"
kubectl get kustomization -n flux-system -o wide | grep "${NAMESPACE}"

# Check for "already exists" errors in FluxCD logs
echo "🚨 Checking FluxCD logs for conflicts:"
kubectl logs -n flux-system -l app=kustomize-controller --tail=100 | \
  grep -i "already exists\|conflict\|error"

Resolve Resource Conflicts:

# Option 1: Adopt existing resource (add FluxCD labels)
kubectl label resource/name kustomize.toolkit.fluxcd.io/name=atp-apps \
  kustomize.toolkit.fluxcd.io/namespace=flux-system \
  -n atp-production

# Option 2: Delete conflicting resource (if safe)
kubectl delete deployment conflicting-deployment -n atp-production

# Option 3: Suspend Kustomization, fix, then resume
flux suspend kustomization atp-apps-production -n flux-system
# Fix the conflict
flux resume kustomization atp-apps-production -n flux-system

Timeout Errors¶

Troubleshooting Timeout Errors:

#!/bin/bash
# scripts/troubleshoot-timeout.sh

RESOURCE="${1}"
NAMESPACE="${2:-flux-system}"

echo "⏱️  Troubleshooting timeout for: ${RESOURCE}"

# Check resource status
kubectl get "${RESOURCE}" -n "${NAMESPACE}" -o yaml | grep -A 5 "conditions:"

# Check reconciliation timeout settings
kubectl get kustomization "${RESOURCE}" -n "${NAMESPACE}" -o jsonpath='{.spec.timeout}'

# View detailed status
flux get kustomization "${RESOURCE}" -n "${NAMESPACE}" -o wide

# Check for stuck reconciliations
kubectl get kustomization -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.status.conditions[].status == "False") | "\(.metadata.name): \(.status.conditions[].message)"'

Increase Timeout:

# clusters/production/kustomizations/apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 5m
  timeout: 10m  # Increase timeout from default 5m
  path: ./apps/atp-gateway/overlays/production
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production

Network Connectivity Issues¶

Check Network Connectivity:

#!/bin/bash
# scripts/troubleshoot-network.sh

echo "🌐 Troubleshooting network connectivity..."

# Test Git repository access
GIT_URL=$(kubectl get gitrepository atp-gitops-production -n flux-system -o jsonpath='{.spec.url}')
echo "Testing Git URL: ${GIT_URL}"

# Test from source controller pod
SOURCE_POD=$(kubectl get pod -n flux-system -l app=source-controller -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it "${SOURCE_POD}" -n flux-system -- wget -O- "${GIT_URL}" 2>&1

# Check DNS resolution
kubectl exec -it "${SOURCE_POD}" -n flux-system -- nslookup dev.azure.com

# Check proxy settings
kubectl get deployment source-controller -n flux-system -o yaml | grep -i proxy

Common Network Issues:

Issue	Symptom	Fix
DNS Resolution	`could not resolve host`	Check CoreDNS, network policies
Firewall	`connection timeout`	Allow Git repository IPs in NSG
Proxy	`proxy authentication required`	Configure proxy in source controller
VNet Peering	`network unreachable`	Verify VNet peering configuration

Drift Detection and Resolution¶

Identifying Drifted Resources¶

Detect Drift:

#!/bin/bash
# scripts/detect-drift.sh

NAMESPACE="${1:-atp-production}"
KUSTOMIZATION="${2:-apps-production}"

echo "🔍 Detecting drift in namespace: ${NAMESPACE}"

# Check Kustomization drift status
flux get kustomization "${KUSTOMIZATION}" -n flux-system

# Force reconciliation and check for drift
flux reconcile kustomization "${KUSTOMIZATION}" -n flux-system --with-source

# Check for drifted resources
kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o json | \
  jq -r '.status.inventory.entries[] | select(.lastAppliedAnno == null) | "\(.v) \(.kind)/\(.name)"'

# Compare Git state with cluster state
flux diff kustomization "${KUSTOMIZATION}" -n flux-system

Drift Detection Query (KQL):

// Log Analytics: Detect FluxCD drift events
KubePodInventory
| where Namespace == "flux-system"
| where Name contains "kustomize-controller"
| join kind=inner (
    ContainerLog
    | where LogEntry contains "drift" or LogEntry contains "diff"
    | project TimeGenerated, LogEntry, ContainerID
) on ContainerID
| project TimeGenerated, LogEntry
| order by TimeGenerated desc

Manual Changes Detection¶

Detect Manual Changes:

#!/bin/bash
# scripts/detect-manual-changes.sh

NAMESPACE="${1:-atp-production}"

echo "🔍 Detecting manually modified resources..."

# Find resources without FluxCD labels
kubectl get all -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.metadata.labels."kustomize.toolkit.fluxcd.io/name" == null) | 
    "\(.kind)/\(.metadata.name) - Not managed by FluxCD"'

# Find resources with different last-applied-configuration
kubectl get all -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.metadata.annotations."kubectl.kubernetes.io/last-applied-configuration" != null) |
    "\(.kind)/\(.metadata.name) - Manually modified"'

# Check Git commit history for resource
RESOURCE="${2}"
if [ -n "${RESOURCE}" ]; then
  git log --all --oneline --grep="${RESOURCE}" -- manifests/
fi

Revert Drift vs Accept Change¶

Decision Tree for Drift Resolution:

graph TD
    START[Detect Drift] --> CHECK{Type of Change?}
    CHECK -->|Critical Config| REVERT[Force Revert]
    CHECK -->|Performance Tuning| ACCEPT[Accept & Commit]
    CHECK -->|Debugging Change| DECIDE{Production?}

    REVERT --> RECONCILE[Reconcile Resource]
    ACCEPT --> COMMIT[Commit to Git]
    DECIDE -->|Yes| REVERT
    DECIDE -->|No| ACCEPT

    RECONCILE --> VERIFY[Verify Fixed]
    COMMIT --> VERIFY
    VERIFY --> DONE[Complete]

    style REVERT fill:#FFB6C1
    style ACCEPT fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Force Revert Drift:

#!/bin/bash
# scripts/revert-drift.sh

RESOURCE_TYPE="${1}"  # e.g., deployment
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"

echo "🔄 Reverting drift for ${RESOURCE_TYPE}/${RESOURCE_NAME}"

# Get desired state from Git
flux diff kustomization apps-production -n flux-system --path "${RESOURCE_TYPE}/${RESOURCE_NAME}"

# Force reconciliation
flux reconcile kustomization apps-production -n flux-system --with-source

# Verify reverted
kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o yaml | \
  diff - expected-state.yaml

Accept and Commit Drift:

#!/bin/bash
# scripts/accept-drift.sh

RESOURCE_TYPE="${1}"
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"

echo "✅ Accepting drift for ${RESOURCE_TYPE}/${RESOURCE_NAME}"

# Export current state
kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o yaml > \
  manifests/apps/atp-gateway/base/${RESOURCE_TYPE}-${RESOURCE_NAME}.yaml

# Commit to Git
git add manifests/
git commit -m "Accept drift: ${RESOURCE_TYPE}/${RESOURCE_NAME} in ${NAMESPACE}"
git push

# Reconcile to sync
flux reconcile source git atp-gitops-production -n flux-system

Investigating Drift Causes¶

Drift Investigation Workflow:

#!/bin/bash
# scripts/investigate-drift.sh

RESOURCE="${1}"
NAMESPACE="${2:-atp-production}"

echo "🔬 Investigating drift cause for: ${RESOURCE}"

# Step 1: Check resource history
echo "📜 Step 1: Resource change history..."
kubectl get events -n "${NAMESPACE}" --field-selector involvedObject.name="${RESOURCE}" --sort-by='.lastTimestamp'

# Step 2: Check audit logs
echo "📋 Step 2: Kubernetes audit logs..."
# Query Azure Monitor Log Analytics for audit logs
cat <<EOF
AzureLogAnalytics Query:
AzureActivity
| where ResourceProvider == "Microsoft.ContainerService"
| where OperationName contains "write"
| where Properties contains "${RESOURCE}"
| order by TimeGenerated desc
EOF

# Step 3: Check FluxCD reconciliation history
echo "🔄 Step 3: FluxCD reconciliation history..."
kubectl get kustomization -n flux-system -o json | \
  jq -r '.items[] | select(.status.inventory.entries[]?.name | contains("'"${RESOURCE}"'")) |
    "\(.metadata.name): Last reconciled at \(.status.lastAppliedRevision)"'

# Step 4: Compare with Git
echo "📦 Step 4: Compare with Git state..."
flux diff kustomization apps-production -n flux-system | grep "${RESOURCE}"

Image Pull Errors¶

ACR Authentication Failures¶

Troubleshooting ACR Authentication:

#!/bin/bash
# scripts/troubleshoot-acr-auth.sh

NAMESPACE="${1:-atp-production}"
POD_NAME="${2}"

echo "🔐 Troubleshooting ACR authentication..."

# Check image pull secrets
echo "📋 Image pull secrets:"
kubectl get secret -n "${NAMESPACE}" | grep -i "docker\|acr\|registry"

# Check pod's image pull secret
if [ -n "${POD_NAME}" ]; then
  echo "🔍 Pod image pull secrets:"
  kubectl get pod "${POD_NAME}" -n "${NAMESPACE}" -o jsonpath='{.spec.imagePullSecrets[*].name}'

  # Try pulling image manually from pod
  echo "🌐 Testing image pull from pod:"
  kubectl exec -it "${POD_NAME}" -n "${NAMESPACE}" -- docker pull ${IMAGE} 2>&1 || true
fi

# Check ACR authentication
ACR_NAME="${3:-connectsoft.azurecr.io}"
echo "🔑 Checking ACR access..."
az acr repository list --name "${ACR_NAME}" --output table

Fix ACR Authentication:

# Create ACR pull secret using Workload Identity
az acr login --name connectsoft

# Create Kubernetes secret
kubectl create secret docker-registry acr-secret \
  --docker-server=connectsoft.azurecr.io \
  --docker-username=00000000-0000-0000-0000-000000000000 \
  --docker-password=$(az acr credential show --name connectsoft --query passwords[0].value -o tsv) \
  -n atp-production \
  --dry-run=client -o yaml | kubectl apply -f -

# Add to default service account
kubectl patch serviceaccount default -n atp-production -p '{"imagePullSecrets":[{"name":"acr-secret"}]}'

Image Not Found¶

Troubleshooting Missing Images:

#!/bin/bash
# scripts/troubleshoot-image-not-found.sh

IMAGE="${1}"
NAMESPACE="${2:-atp-production}"

echo "🔍 Troubleshooting image not found: ${IMAGE}"

# Check if image exists in ACR
ACR_NAME=$(echo "${IMAGE}" | cut -d'/' -f1)
REPO_TAG=$(echo "${IMAGE}" | cut -d'/' -f2-)
REPO=$(echo "${REPO_TAG}" | cut -d':' -f1)
TAG=$(echo "${REPO_TAG}" | cut -d':' -f2)

echo "ACR: ${ACR_NAME}"
echo "Repository: ${REPO}"
echo "Tag: ${TAG}"

# Check ACR repository
az acr repository show --name "${ACR_NAME}" --repository "${REPO}" || \
  echo "❌ Repository not found"

# List tags
az acr repository show-tags --name "${ACR_NAME}" --repository "${REPO}" --output table

# Check pods with ImagePullBackOff
echo "🚨 Pods with ImagePullBackOff:"
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ImagePullBackOff") |
    "\(.metadata.name): \(.spec.containers[].image)"'

Fix Missing Image:

# Rebuild and push image
docker build -t connectsoft.azurecr.io/atp/gateway:v1.2.3 .
docker push connectsoft.azurecr.io/atp/gateway:v1.2.3

# Update manifest
kustomize edit set image connectsoft.azurecr.io/atp/gateway:v1.2.3

# Commit and push
git add .
git commit -m "Fix: Update image tag to v1.2.3"
git push

ImagePullBackOff Troubleshooting¶

Diagnose ImagePullBackOff:

#!/bin/bash
# scripts/diagnose-imagepullbackoff.sh

NAMESPACE="${1:-atp-production}"

echo "🚨 Diagnosing ImagePullBackOff errors..."

# Find pods with ImagePullBackOff
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ImagePullBackOff") |
    "Pod: \(.metadata.name)\n  Image: \(.spec.containers[].image)\n  Reason: \(.status.containerStatuses[].state.waiting.reason)\n  Message: \(.status.containerStatuses[].state.waiting.message)\n---"'

# Describe pod for details
PODS=$(kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ImagePullBackOff") | .metadata.name')

for POD in ${PODS}; do
  echo "📋 Details for pod: ${POD}"
  kubectl describe pod "${POD}" -n "${NAMESPACE}" | grep -A 10 "Events:"
done

# Check events
kubectl get events -n "${NAMESPACE}" --sort-by='.lastTimestamp' | grep -i "pull\|image\|backoff"

Common ImagePullBackOff Causes:

Cause	Symptom	Fix
Image doesn't exist	`manifest unknown`	Rebuild and push image
Authentication failed	`unauthorized`	Fix ACR credentials
Network issue	`timeout`	Check network policies, DNS
Wrong tag	`not found`	Update image tag in manifest

Resource Conflicts¶

"Already Exists" Errors¶

Resolve "Already Exists" Errors:

#!/bin/bash
# scripts/resolve-already-exists.sh

RESOURCE_TYPE="${1}"
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"

echo "🔧 Resolving 'already exists' error for ${RESOURCE_TYPE}/${RESOURCE_NAME}"

# Check if resource exists
if kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" &>/dev/null; then
  echo "✅ Resource exists"

  # Check if managed by FluxCD
  MANAGED=$(kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o jsonpath='{.metadata.labels.kustomize\.toolkit\.fluxcd\.io/name}')

  if [ -z "${MANAGED}" ]; then
    echo "⚠️  Resource not managed by FluxCD"
    echo "   Options:"
    echo "   1. Adopt resource: kubectl label ${RESOURCE_TYPE} ${RESOURCE_NAME} kustomize.toolkit.fluxcd.io/name=apps-production -n ${NAMESPACE}"
    echo "   2. Delete resource: kubectl delete ${RESOURCE_TYPE} ${RESOURCE_NAME} -n ${NAMESPACE}"
  else
    echo "✅ Resource managed by FluxCD: ${MANAGED}"
    echo "   Force reconciliation: flux reconcile kustomization ${MANAGED} -n flux-system"
  fi
else
  echo "❌ Resource does not exist"
fi

Immutable Field Errors¶

Handle Immutable Field Changes:

#!/bin/bash
# scripts/handle-immutable-fields.sh

RESOURCE_TYPE="${1}"
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"

echo "🔒 Handling immutable field changes for ${RESOURCE_TYPE}/${RESOURCE_NAME}"

# Common immutable fields
# - Deployment: selector, template.labels
# - Service: clusterIP (if set)
# - StatefulSet: volumeClaimTemplates

# For immutable fields, delete and recreate
echo "⚠️  Immutable field detected. Need to delete and recreate."

# Step 1: Export current resource
kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o yaml > \
  backup-${RESOURCE_NAME}.yaml

# Step 2: Delete resource
kubectl delete "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}"

# Step 3: Reconcile to recreate
flux reconcile kustomization apps-production -n flux-system --with-source

echo "✅ Resource recreated"

Owner Reference Conflicts¶

Resolve Owner Reference Conflicts:

#!/bin/bash
# scripts/resolve-owner-conflicts.sh

RESOURCE="${1}"
NAMESPACE="${2:-atp-production}"

echo "🔗 Resolving owner reference conflicts for: ${RESOURCE}"

# Check owner references
kubectl get "${RESOURCE}" -n "${NAMESPACE}" -o jsonpath='{.metadata.ownerReferences[*].kind}'

# Remove conflicting owner reference
kubectl patch "${RESOURCE}" -n "${NAMESPACE}" --type=json \
  -p='[{"op": "remove", "path": "/metadata/ownerReferences"}]'

# Or adopt resource properly
kubectl label "${RESOURCE}" -n "${NAMESPACE}" \
  kustomize.toolkit.fluxcd.io/name=apps-production \
  kustomize.toolkit.fluxcd.io/namespace=flux-system

Secret Access Failures¶

Workload Identity Misconfiguration¶

Troubleshoot Workload Identity:

#!/bin/bash
# scripts/troubleshoot-workload-identity.sh

NAMESPACE="${1:-atp-production}"
SERVICE_ACCOUNT="${2:-default}"

echo "🔐 Troubleshooting Workload Identity..."

# Check ServiceAccount annotations
echo "📋 ServiceAccount annotations:"
kubectl get serviceaccount "${SERVICE_ACCOUNT}" -n "${NAMESPACE}" -o jsonpath='{.metadata.annotations}' | jq

# Check federated credentials in Azure AD
AZURE_CLIENT_ID=$(kubectl get serviceaccount "${SERVICE_ACCOUNT}" -n "${NAMESPACE}" -o jsonpath='{.metadata.annotations.azure\.workload\.identity/client-id}')
echo "Azure Client ID: ${AZURE_CLIENT_ID}"

# Check pod annotations
echo "📦 Pod annotations:"
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[].metadata | select(.annotations."azure.workload.identity/service-account" != null) |
    "Pod: \(.name)\n  ServiceAccount: \(.annotations."azure.workload.identity/service-account")\n"'

# Test token acquisition from pod
POD=$(kubectl get pod -n "${NAMESPACE}" -l app=atp-gateway -o jsonpath='{.items[0].metadata.name}')
if [ -n "${POD}" ]; then
  echo "🧪 Testing token acquisition from pod: ${POD}"
  kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
    cat /var/run/secrets/azure/tokens/azure-identity-token 2>&1 || echo "❌ Token not available"
fi

Key Vault Permission Issues¶

Check Key Vault Permissions:

#!/bin/bash
# scripts/check-keyvault-permissions.sh

KEY_VAULT="${1:-atp-keyvault}"
IDENTITY="${2}"  # Managed identity client ID

echo "🔑 Checking Key Vault permissions..."

# Check access policies
az keyvault show --name "${KEY_VAULT}" --query "properties.accessPolicies[].objectId"

# Check RBAC permissions
if [ -n "${IDENTITY}" ]; then
  echo "Checking RBAC for identity: ${IDENTITY}"
  az role assignment list --assignee "${IDENTITY}" --scope "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.KeyVault/vaults/${KEY_VAULT}"
fi

# Test secret access
SECRET_NAME="test-secret"
az keyvault secret show --vault-name "${KEY_VAULT}" --name "${SECRET_NAME}" || \
  echo "❌ Cannot access secret: ${SECRET_NAME}"

ExternalSecret Sync Failures¶

Troubleshoot ExternalSecret:

#!/bin/bash
# scripts/troubleshoot-externalsecret.sh

SECRET_NAME="${1}"
NAMESPACE="${2:-atp-production}"

echo "🔍 Troubleshooting ExternalSecret: ${SECRET_NAME}"

# Check ExternalSecret status
kubectl get externalsecret "${SECRET_NAME}" -n "${NAMESPACE}" -o yaml | \
  grep -A 20 "status:"

# Check ClusterSecretStore
STORE=$(kubectl get externalsecret "${SECRET_NAME}" -n "${NAMESPACE}" -o jsonpath='{.spec.secretStoreRef.name}')
echo "SecretStore: ${STORE}"
kubectl get clustersecretstore "${STORE}" -o yaml

# Check external-secrets operator logs
kubectl logs -n external-secrets-system -l app.kubernetes.io/name=external-secrets --tail=100 | \
  grep -i "${SECRET_NAME}"

# Force refresh
kubectl annotate externalsecret "${SECRET_NAME}" -n "${NAMESPACE}" \
  force-sync=$(date +%s) --overwrite

Health Check Failures¶

Readiness Probe Timeouts¶

Troubleshoot Readiness Probes:

#!/bin/bash
# scripts/troubleshoot-readiness.sh

POD="${1}"
NAMESPACE="${2:-atp-production}"

echo "🏥 Troubleshooting readiness probe for pod: ${POD}"

# Check pod status
kubectl get pod "${POD}" -n "${NAMESPACE}" -o yaml | \
  grep -A 10 "readinessProbe:"

# Check probe configuration
kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[*].readinessProbe}' | jq

# Test probe endpoint manually
ENDPOINT=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[0].readinessProbe.httpGet.path}')
PORT=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[0].readinessProbe.httpGet.port}')

echo "Testing endpoint: http://localhost:${PORT}${ENDPOINT}"
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
  curl -f http://localhost:${PORT}${ENDPOINT} || echo "❌ Probe endpoint failed"

# Check events
kubectl describe pod "${POD}" -n "${NAMESPACE}" | grep -A 5 "Events:"

Liveness Probe Failures¶

Troubleshoot Liveness Probes:

#!/bin/bash
# scripts/troubleshoot-liveness.sh

POD="${1}"
NAMESPACE="${2:-atp-production}"

echo "💓 Troubleshooting liveness probe for pod: ${POD}"

# Check if pod is restarting
RESTARTS=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.status.containerStatuses[0].restartCount}')
echo "Restart count: ${RESTARTS}"

# Check liveness probe config
kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[*].livenessProbe}' | jq

# Check previous container logs (if restarted)
if [ "${RESTARTS}" -gt 0 ]; then
  echo "📜 Previous container logs:"
  kubectl logs "${POD}" -n "${NAMESPACE}" --previous --tail=50
fi

# Check current logs
echo "📜 Current container logs:"
kubectl logs "${POD}" -n "${NAMESPACE}" --tail=50

Debugging Health Endpoints¶

Health Endpoint Debugging:

#!/bin/bash
# scripts/debug-health-endpoint.sh

POD="${1}"
NAMESPACE="${2:-atp-production}"
ENDPOINT="${3:-/health}"

echo "🔍 Debugging health endpoint: ${ENDPOINT}"

# Get pod IP
POD_IP=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.status.podIP}')
echo "Pod IP: ${POD_IP}"

# Test from within pod
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
  curl -v http://localhost:8080${ENDPOINT} 2>&1

# Test from another pod
kubectl run debug-pod --image=curlimages/curl --rm -it --restart=Never -n "${NAMESPACE}" -- \
  curl -v http://${POD_IP}:8080${ENDPOINT}

# Check application logs
kubectl logs "${POD}" -n "${NAMESPACE}" --tail=100 | grep -i "health\|ready\|startup"

Networking Issues¶

Service Discovery Failures¶

Troubleshoot Service Discovery:

#!/bin/bash
# scripts/troubleshoot-service-discovery.sh

SERVICE="${1}"
NAMESPACE="${2:-atp-production}"

echo "🌐 Troubleshooting service discovery for: ${SERVICE}"

# Check service exists
kubectl get service "${SERVICE}" -n "${NAMESPACE}"

# Check endpoints
kubectl get endpoints "${SERVICE}" -n "${NAMESPACE}"

# Test DNS resolution
POD=$(kubectl get pod -n "${NAMESPACE}" -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
  nslookup "${SERVICE}.${NAMESPACE}.svc.cluster.local"

# Test service connectivity
kubectl run test-pod --image=curlimages/curl --rm -it --restart=Never -n "${NAMESPACE}" -- \
  curl -v http://${SERVICE}.${NAMESPACE}.svc.cluster.local:8080

DNS Resolution Problems¶

Troubleshoot DNS:

#!/bin/bash
# scripts/troubleshoot-dns.sh

NAMESPACE="${1:-atp-production}"

echo "🔍 Troubleshooting DNS resolution..."

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Test DNS from pod
POD=$(kubectl get pod -n "${NAMESPACE}" -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
  nslookup kubernetes.default.svc.cluster.local

# Check DNS configuration
kubectl get configmap coredns -n kube-system -o yaml

Network Policy Blocking Traffic¶

Troubleshoot Network Policies:

#!/bin/bash
# scripts/troubleshoot-network-policy.sh

NAMESPACE="${1:-atp-production}"

echo "🔒 Troubleshooting network policies..."

# List network policies
kubectl get networkpolicies -n "${NAMESPACE}"

# Check if default deny is blocking
kubectl get networkpolicy default-deny-all -n "${NAMESPACE}" && \
  echo "⚠️  Default deny policy found"

# Test connectivity between pods
SOURCE_POD=$(kubectl get pod -n "${NAMESPACE}" -l app=atp-gateway -o jsonpath='{.items[0].metadata.name}')
TARGET_POD=$(kubectl get pod -n "${NAMESPACE}" -l app=atp-ingestion -o jsonpath='{.items[0].metadata.name}')

if [ -n "${SOURCE_POD}" ] && [ -n "${TARGET_POD}" ]; then
  echo "Testing connectivity from ${SOURCE_POD} to ${TARGET_POD}"
  kubectl exec -it "${SOURCE_POD}" -n "${NAMESPACE}" -- \
    curl -v http://${TARGET_POD}.${NAMESPACE}.pod.cluster.local:8080 || \
    echo "❌ Connection blocked"
fi

# Temporarily remove network policy for testing
echo "To test without network policy:"
echo "kubectl delete networkpolicy -n ${NAMESPACE} --all"

Performance Issues¶

Slow Reconciliation¶

Troubleshoot Slow Reconciliation:

#!/bin/bash
# scripts/troubleshoot-slow-reconciliation.sh

KUSTOMIZATION="${1:-apps-production}"

echo "⏱️  Troubleshooting slow reconciliation..."

# Check reconciliation duration
kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o json | \
  jq -r '.status.conditions[] | select(.type == "Ready") | 
    "Last reconciliation: \(.lastTransitionTime)\nMessage: \(.message)"'

# Check reconciliation interval
INTERVAL=$(kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o jsonpath='{.spec.interval}')
echo "Reconciliation interval: ${INTERVAL}"

# Check number of resources
RESOURCE_COUNT=$(kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o json | \
  jq '.status.inventory.entries | length')
echo "Number of resources: ${RESOURCE_COUNT}"

# Check for large manifests
echo "Checking manifest sizes..."
find manifests/ -name "*.yaml" -exec wc -l {} \; | sort -rn | head -5

# Force reconciliation and measure time
echo "Forcing reconciliation and measuring time..."
time flux reconcile kustomization "${KUSTOMIZATION}" -n flux-system --with-source

Resource Contention¶

Check Resource Contention:

#!/bin/bash
# scripts/check-resource-contention.sh

NAMESPACE="${1:-atp-production}"

echo "📊 Checking resource contention..."

# Check node resources
kubectl top nodes

# Check pod resources
kubectl top pods -n "${NAMESPACE}"

# Check for resource quotas
kubectl get resourcequota -n "${NAMESPACE}"

# Check for limit ranges
kubectl get limitrange -n "${NAMESPACE}"

# Find pods requesting too many resources
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.spec.containers[].resources.requests.cpu != null) |
    "\(.metadata.name): CPU=\(.spec.containers[].resources.requests.cpu) Memory=\(.spec.containers[].resources.requests.memory)"'

Debugging Tools¶

kubectl Commands¶

Essential kubectl Commands:

# Get resources
kubectl get all -n atp-production
kubectl get pods -n atp-production -o wide
kubectl get events -n atp-production --sort-by='.lastTimestamp'

# Describe resources
kubectl describe pod <pod-name> -n atp-production
kubectl describe deployment <deployment> -n atp-production

# View logs
kubectl logs <pod-name> -n atp-production
kubectl logs <pod-name> -n atp-production --previous  # Previous container
kubectl logs -l app=atp-gateway -n atp-production --tail=100

# Execute commands
kubectl exec -it <pod-name> -n atp-production -- /bin/sh
kubectl exec <pod-name> -n atp-production -- env

# Port forwarding
kubectl port-forward <pod-name> 8080:8080 -n atp-production

# Debugging
kubectl run debug-pod --image=curlimages/curl --rm -it --restart=Never -n atp-production
kubectl debug <pod-name> -n atp-production -it --image=busybox

Flux CLI Commands¶

Essential Flux Commands:

# Check FluxCD status
flux check
flux get all -A

# Get resources
flux get sources git -A
flux get kustomizations -A
flux get helmreleases -A

# Reconcile resources
flux reconcile source git atp-gitops-production -n flux-system
flux reconcile kustomization apps-production -n flux-system --with-source
flux reconcile helmrelease ingress-nginx -n ingress-nginx

# Suspend/Resume
flux suspend kustomization apps-production -n flux-system
flux resume kustomization apps-production -n flux-system

# Diff and dry-run
flux diff kustomization apps-production -n flux-system
flux build kustomization apps-production -n flux-system

# View logs
flux logs --kind=Kustomization --name=apps-production -n flux-system

Azure CLI for AKS Debugging¶

Azure CLI AKS Commands:

# Get cluster credentials
az aks get-credentials --resource-group atp-production-rg --name atp-production-aks

# Get cluster info
az aks show --resource-group atp-production-rg --name atp-production-aks

# List node pools
az aks nodepool list --resource-group atp-production-rg --cluster-name atp-production-aks

# Scale node pool
az aks nodepool scale \
  --resource-group atp-production-rg \
  --cluster-name atp-production-aks \
  --name systempool \
  --node-count 5

# Get diagnostic logs
az aks get-credentials --resource-group atp-production-rg --name atp-production-aks --admin
kubectl get nodes

Log Analysis in Log Analytics¶

KQL Queries for Troubleshooting:

// FluxCD reconciliation failures
ContainerLog
| where Namespace == "flux-system"
| where LogEntry contains "error" or LogEntry contains "failed"
| where LogEntry contains "reconcile"
| project TimeGenerated, PodName, LogEntry
| order by TimeGenerated desc

// Pod restart analysis
KubePodInventory
| where Namespace == "atp-production"
| where ContainerRestartCount > 0
| project TimeGenerated, Namespace, PodName, ContainerRestartCount
| order by ContainerRestartCount desc

// Image pull errors
ContainerLog
| where LogEntry contains "ImagePullBackOff" or LogEntry contains "ErrImagePull"
| project TimeGenerated, Namespace, PodName, LogEntry
| order by TimeGenerated desc

// Health check failures
ContainerLog
| where LogEntry contains "readiness probe failed" or LogEntry contains "liveness probe failed"
| project TimeGenerated, Namespace, PodName, LogEntry
| order by TimeGenerated desc

Common Error Patterns¶

Error Catalog¶

Common Errors and Solutions:

Error	Cause	Solution
`ImagePullBackOff`	Image not found or auth failed	Check ACR credentials, verify image exists
`CrashLoopBackOff`	Application crashing	Check application logs, health endpoints
`Pending` pod	Insufficient resources	Check node capacity, resource quotas
`ErrImagePull`	Cannot pull image	Fix ACR authentication, network policies
`CreateContainerConfigError`	Secret/config not found	Check secret exists, mount paths
`Readiness probe failed`	Health endpoint not ready	Check application startup, probe config
`Network policy blocking`	Traffic blocked	Update network policy rules
`PVC pending`	Storage class not found	Check StorageClass exists
`Reconcile timeout`	Too many resources	Increase timeout, optimize manifests

Decision Tree for Common Errors:

graph TD
    START[Pod Not Running] --> CHECK{Error Type?}
    CHECK -->|ImagePullBackOff| IMAGE[Check ACR Auth<br/>Verify Image Exists]
    CHECK -->|CrashLoopBackOff| LOGS[Check Logs<br/>Health Endpoints]
    CHECK -->|Pending| RESOURCES[Check Resources<br/>Node Capacity]
    CHECK -->|NotReady| PROBE[Check Probes<br/>Application Health]

    IMAGE --> FIX1[Fix Credentials<br/>Rebuild Image]
    LOGS --> FIX2[Fix Application<br/>Update Config]
    RESOURCES --> FIX3[Scale Nodes<br/>Adjust Requests]
    PROBE --> FIX4[Fix Endpoints<br/>Adjust Probes]

    FIX1 --> RECONCILE[Reconcile]
    FIX2 --> RECONCILE
    FIX3 --> RECONCILE
    FIX4 --> RECONCILE
    RECONCILE --> DONE[Verify Fixed]

    style START fill:#FFB6C1
    style DONE fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Escalation Procedures¶

When to Escalate¶

Escalation Triggers:

Severity	Criteria	Response Time
P0 - Critical	Production down, data loss	Immediate (15 min)
P1 - High	Partial outage, degraded performance	1 hour
P2 - Medium	Non-critical issue, workaround available	4 hours
P3 - Low	Minor issue, cosmetic	Next business day

Escalation Decision Tree:

graph TD
    START[Issue Detected] --> IMPACT{Impact?}
    IMPACT -->|Production Down| P0[P0 - Escalate Immediately]
    IMPACT -->|Degraded Service| P1[P1 - Escalate within 1h]
    IMPACT -->|Workaround Available| P2[P2 - Escalate within 4h]
    IMPACT -->|Minor Issue| P3[P3 - Next Business Day]

    P0 --> ONCALL[Page On-Call Engineer]
    P1 --> TEAM[Notify Team Lead]
    P2 --> TICKET[Create Ticket]
    P3 --> BACKLOG[Add to Backlog]

    style P0 fill:#FF0000
    style P1 fill:#FFA500
    style P2 fill:#FFFF00
    style P3 fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Information to Collect¶

Pre-Escalation Checklist:

#!/bin/bash
# scripts/collect-debug-info.sh

ISSUE="${1}"
NAMESPACE="${2:-atp-production}"

echo "📋 Collecting debug information for escalation..."

# Create debug directory
DEBUG_DIR="debug-$(date +%Y%m%d-%H%M%S)"
mkdir -p "${DEBUG_DIR}"

# Cluster info
kubectl cluster-info > "${DEBUG_DIR}/cluster-info.txt"
kubectl get nodes -o wide > "${DEBUG_DIR}/nodes.txt"

# Resource status
kubectl get all -n "${NAMESPACE}" > "${DEBUG_DIR}/resources.txt"
kubectl get events -n "${NAMESPACE}" --sort-by='.lastTimestamp' > "${DEBUG_DIR}/events.txt"

# FluxCD status
flux get all -A > "${DEBUG_DIR}/flux-status.txt"
kubectl get kustomization -A -o yaml > "${DEBUG_DIR}/kustomizations.yaml"

# Logs
kubectl logs -n flux-system -l app=kustomize-controller --tail=200 > "${DEBUG_DIR}/flux-logs.txt"
kubectl logs -n "${NAMESPACE}" --all-containers --tail=100 > "${DEBUG_DIR}/app-logs.txt"

# Network policies
kubectl get networkpolicies -n "${NAMESPACE}" -o yaml > "${DEBUG_DIR}/network-policies.yaml"

# Secrets (sanitized)
kubectl get secrets -n "${NAMESPACE}" -o json | \
  jq '.items[] | {name: .metadata.name, type: .type}' > "${DEBUG_DIR}/secrets-list.json"

# Package debug info
tar -czf "${DEBUG_DIR}.tar.gz" "${DEBUG_DIR}"
echo "✅ Debug information collected: ${DEBUG_DIR}.tar.gz"

Incident Severity Levels¶

Severity Level Definitions:

Level	Description	Examples	Response
P0 - Critical	Complete service outage, data loss risk	All pods down, database inaccessible	Immediate, on-call escalation
P1 - High	Partial outage, significant impact	50% pods down, slow response times	1 hour, team notification
P2 - Medium	Degraded service, workaround available	Single service down, minor features broken	4 hours, ticket creation
P3 - Low	Minor issue, no user impact	Documentation issue, cosmetic bug	Next business day, backlog

Summary: Troubleshooting GitOps Issues¶

FluxCD Sync Failures: Authentication issues (Git credentials), manifest syntax errors, resource conflicts (already exists), timeout errors, network connectivity issues
Drift Detection and Resolution: Identifying drifted resources, manual changes detection, revert drift vs accept change decision tree, investigating drift causes
Image Pull Errors: ACR authentication failures, image not found, ImagePullBackOff troubleshooting with diagnostic scripts
Resource Conflicts: "already exists" errors, immutable field errors, owner reference conflicts, resolution strategies
Secret Access Failures: Workload Identity misconfiguration, Key Vault permission issues, ExternalSecret sync failures
Health Check Failures: Readiness probe timeouts, liveness probe failures, debugging health endpoints
Networking Issues: Service discovery failures, DNS resolution problems, network policy blocking traffic
Performance Issues: Slow reconciliation, resource contention, high CPU/memory usage, throttling and rate limiting
Debugging Tools: kubectl commands (get, describe, logs, exec), Flux CLI commands (get, reconcile, suspend), Azure CLI for AKS, Log Analytics KQL queries
Common Error Patterns: Error catalog with solutions, decision trees for troubleshooting, known issues and workarounds
Escalation Procedures: When to escalate (severity levels), who to escalate to, information to collect (pre-escalation checklist), incident severity levels (P0-P3)

Day 2 Operations & Maintenance¶

Purpose: Define comprehensive day 2 operations, maintenance procedures, upgrade processes, capacity planning, security patching, performance tuning, and operational excellence practices for ATP's GitOps deployments, ensuring reliable, secure, and efficient long-term platform operations.

Routine Maintenance Tasks¶

Daily: Monitoring Checks, Alert Review¶

Daily Maintenance Checklist:

#!/bin/bash
# scripts/daily-maintenance-check.sh

echo "📋 Daily Maintenance Checklist - $(date +%Y-%m-%d)"
echo "=============================================="

# Check cluster health
echo "🏥 1. Cluster Health Check"
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

# Check FluxCD status
echo "🔄 2. FluxCD Status"
flux get all -A | grep -v Ready || echo "   ⚠️  Some FluxCD resources not ready"

# Review critical alerts
echo "🚨 3. Critical Alerts Review"
# Query Azure Monitor for critical alerts from last 24 hours
cat <<EOF
Azure Monitor Query:
AzureMetrics
| where TimeGenerated > ago(24h)
| where MetricName contains "error" or MetricName contains "failure"
| where Value > 0
| summarize count() by MetricName
| order by count_ desc
EOF

# Check certificate expiration
echo "🔐 4. Certificate Expiration Check"
kubectl get certificates --all-namespaces -o json | \
  jq -r '.items[] | select(.status.conditions[]?.status == "True") |
    "\(.metadata.namespace)/\(.metadata.name): Expires \(.status.notAfter)"' | \
  while read cert; do
    EXPIRY=$(echo "${cert}" | cut -d' ' -f3-)
    DAYS_LEFT=$(( ($(date -d "${EXPIRY}" +%s) - $(date +%s)) / 86400 ))
    if [ "${DAYS_LEFT}" -lt 30 ]; then
      echo "   ⚠️  ${cert} (${DAYS_LEFT} days remaining)"
    fi
  done

# Check backup status
echo "💾 5. Backup Status"
velero backup get --namespace velero --limit 5

# Check resource utilization
echo "📊 6. Resource Utilization"
kubectl top nodes
kubectl top pods -n atp-production --sort-by=memory | head -10

echo "✅ Daily checks complete"

Daily Alert Review Procedure:

## Daily Alert Review Process

### Critical Alerts (P0/P1)
1. Review all critical alerts from last 24 hours
2. Verify alerts are actionable (not false positives)
3. Document any new alert patterns
4. Escalate unresolved critical alerts

### Warning Alerts
1. Review warning alerts weekly (not daily)
2. Tune alert thresholds if needed
3. Document patterns for capacity planning

### Alert Noise Reduction
1. Disable or adjust noisy alerts
2. Add alert grouping rules
3. Update alert runbooks

Weekly: Capacity Planning, Cost Review¶

Weekly Maintenance Checklist:

#!/bin/bash
# scripts/weekly-maintenance-check.sh

echo "📋 Weekly Maintenance Checklist - Week $(date +%V)"
echo "=============================================="

# Capacity planning
echo "📈 1. Capacity Planning Review"
# Check node utilization trends
kubectl top nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu,MEMORY:.status.capacity.memory

# Check pod density
POD_COUNT=$(kubectl get pods --all-namespaces --field-selector=status.phase=Running --no-headers | wc -l)
NODE_COUNT=$(kubectl get nodes --no-headers | wc -l)
AVG_PODS_PER_NODE=$((POD_COUNT / NODE_COUNT))
echo "   Average pods per node: ${AVG_PODS_PER_NODE}"

# Storage growth trend
echo "💾 2. Storage Growth Analysis"
kubectl get pvc --all-namespaces -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.status.capacity.storage)"' | \
  sort | uniq -c

# Cost review
echo "💰 3. Cost Review"
cat <<EOF
Azure Cost Management Query:
UsageDetails
| where TimeGenerated > ago(7d)
| where ResourceGroup contains "atp-production"
| summarize TotalCost=sum(Cost) by ResourceType
| order by TotalCost desc
EOF

# Review pending updates
echo "🔄 4. Pending Updates Review"
flux get sources -A | grep -v "latest"
flux get helmreleases -A | grep -v "latest"

# Review failed deployments
echo "❌ 5. Failed Deployment Review"
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded -o wide

echo "✅ Weekly checks complete"

Weekly Capacity Planning Report:

#!/bin/bash
# scripts/capacity-planning-report.sh

OUTPUT_FILE="capacity-report-$(date +%Y%m%d).md"

cat > "${OUTPUT_FILE}" <<EOF
# Capacity Planning Report - $(date +%Y-%m-%d)

## Node Utilization

\`\`\`
$(kubectl top nodes)
\`\`\`

## Pod Distribution

\`\`\`
$(kubectl get pods --all-namespaces -o wide | awk '{print $1, $8}' | sort | uniq -c)
\`\`\`

## Storage Usage

\`\`\`
$(kubectl get pvc --all-namespaces -o json | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.status.capacity.storage)"')
\`\`\`

## Recommendations

- Review node pool sizes based on utilization trends
- Plan for expected growth in next quarter
- Consider right-sizing underutilized resources
EOF

echo "✅ Report generated: ${OUTPUT_FILE}"

Monthly: Security Patches, Access Reviews¶

Monthly Maintenance Checklist:

#!/bin/bash
# scripts/monthly-maintenance-check.sh

echo "📋 Monthly Maintenance Checklist - $(date +%Y-%m)"
echo "=============================================="

# Security patches
echo "🔒 1. Security Patch Review"
# Check for available Kubernetes version upgrades
az aks get-upgrades --resource-group atp-production-rg --name atp-production-aks

# Check container image vulnerabilities
echo "   Scanning for vulnerabilities..."
# Use Trivy or Azure Defender to scan images

# Access reviews
echo "👥 2. Access Reviews"
# List all RBAC bindings
kubectl get rolebindings --all-namespaces -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.subjects[].name)"'

kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | "\(.metadata.name): \(.subjects[].name)"'

# Review ServiceAccount usage
echo "   ServiceAccount review..."
kubectl get serviceaccounts --all-namespaces -o json | \
  jq -r '.items[] | select(.metadata.name != "default") | "\(.metadata.namespace)/\(.metadata.name)"'

# Backup verification
echo "💾 3. Backup Verification"
# Test restore from latest backup
velero backup get --namespace velero | head -5

# Compliance check
echo "✅ 4. Compliance Check"
# Check network policies are applied
kubectl get networkpolicies --all-namespaces | wc -l

# Check pod security standards
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.securityContext == null) | "\(.metadata.namespace)/\(.metadata.name): Missing security context"'

echo "✅ Monthly checks complete"

Quarterly: DR Drills, Policy Updates¶

Quarterly Maintenance Schedule:

gantt
    title Quarterly Maintenance Calendar
    dateFormat YYYY-MM-DD
    section Q1
    DR Drill Production           :2024-01-15, 1d
    Policy Review                 :2024-01-20, 2d
    Security Audit                :2024-01-25, 3d
    section Q2
    DR Drill Production           :2024-04-15, 1d
    Policy Review                 :2024-04-20, 2d
    Security Audit                :2024-04-25, 3d
    section Q3
    DR Drill Production           :2024-07-15, 1d
    Policy Review                 :2024-07-20, 2d
    Security Audit                :2024-07-25, 3d
    section Q4
    DR Drill Production           :2024-10-15, 1d
    Policy Review                 :2024-10-20, 2d
    Security Audit                :2024-10-25, 3d

Hold "Alt" / "Option" to enable pan & zoom

Quarterly DR Drill Procedure:

#!/bin/bash
# scripts/quarterly-dr-drill.sh

QUARTER="${1:-Q1}"
YEAR="${2:-2024}"

echo "🧪 Quarterly DR Drill - ${QUARTER} ${YEAR}"
echo "========================================="

# Step 1: Select random backup
echo "📥 Step 1: Selecting test backup..."
BACKUP=$(velero backup get --namespace velero | grep atp-production | tail -5 | shuf -n 1 | awk '{print $1}')
echo "   Using backup: ${BACKUP}"

# Step 2: Restore to test namespace
echo "🔄 Step 2: Restoring to test namespace..."
TEST_NS="atp-production-dr-test-${QUARTER}-${YEAR}"
velero restore create "dr-drill-${QUARTER}-${YEAR}-$(date +%Y%m%d)" \
  --from-backup "${BACKUP}" \
  --namespace-mappings atp-production:${TEST_NS} \
  --wait

# Step 3: Validate restore
echo "✅ Step 3: Validating restore..."
kubectl get all -n "${TEST_NS}"
kubectl get pvc -n "${TEST_NS}"

# Step 4: Test application functionality
echo "🧪 Step 4: Testing application..."
# Run smoke tests against restored environment

# Step 5: Document results
echo "📝 Step 5: Documenting results..."
cat > "dr-drill-report-${QUARTER}-${YEAR}.md" <<EOF
# DR Drill Report - ${QUARTER} ${YEAR}

**Date**: $(date +%Y-%m-%d)
**Backup Used**: ${BACKUP}
**Test Namespace**: ${TEST_NS}

## Results

- Restore: ✅ Success
- Application Functionality: ✅ Verified
- Data Integrity: ✅ Verified

## Lessons Learned

- [Add lessons learned here]

## Recommendations

- [Add recommendations here]
EOF

# Step 6: Cleanup
read -p "Delete test namespace ${TEST_NS}? (yes/no): " CONFIRM
if [ "${CONFIRM}" = "yes" ]; then
  kubectl delete namespace "${TEST_NS}"
  echo "✅ Test namespace deleted"
fi

echo "✅ DR drill complete"

FluxCD Upgrades¶

Upgrade Planning¶

FluxCD Upgrade Planning Checklist:

## FluxCD Upgrade Planning

### Pre-Upgrade
- [ ] Review FluxCD release notes
- [ ] Check breaking changes
- [ ] Test in dev environment first
- [ ] Schedule maintenance window
- [ ] Notify stakeholders
- [ ] Prepare rollback plan

### Upgrade Steps
1. Backup current FluxCD configuration
2. Upgrade CLI tools
3. Test upgrade in dev
4. Schedule production upgrade
5. Execute upgrade
6. Validate functionality
7. Monitor for issues

### Post-Upgrade
- [ ] Verify all Kustomizations working
- [ ] Check GitRepository connections
- [ ] Validate HelmReleases
- [ ] Review reconciliation logs
- [ ] Update documentation

FluxCD Version Compatibility Matrix:

FluxCD Version	Kubernetes Min	Kubernetes Max	Breaking Changes
2.1.x	1.24+	1.27	None
2.2.x	1.24+	1.28	CRD changes
2.3.x	1.25+	1.29	API version updates

Testing in Dev/Test First¶

Test Upgrade Procedure:

#!/bin/bash
# scripts/test-fluxcd-upgrade.sh

TARGET_VERSION="${1:-2.2.0}"
NAMESPACE="${2:-flux-system}"

echo "🧪 Testing FluxCD upgrade to ${TARGET_VERSION}"

# Step 1: Backup current configuration
echo "💾 Step 1: Backing up current configuration..."
kubectl get gitrepository,kustomization,helmrelease -n "${NAMESPACE}" -o yaml > \
  flux-backup-$(date +%Y%m%d).yaml

# Step 2: Install new FluxCD CLI
echo "⬇️  Step 2: Installing FluxCD CLI ${TARGET_VERSION}..."
curl -s https://fluxcd.io/install.sh | sudo bash
flux version

# Step 3: Upgrade FluxCD
echo "🔄 Step 3: Upgrading FluxCD..."
flux install --version=${TARGET_VERSION} --namespace="${NAMESPACE}"

# Step 4: Wait for controllers to be ready
echo "⏳ Step 4: Waiting for controllers..."
kubectl wait --for=condition=ready pod -l app=source-controller -n "${NAMESPACE}" --timeout=300s
kubectl wait --for=condition=ready pod -l app=kustomize-controller -n "${NAMESPACE}" --timeout=300s
kubectl wait --for=condition=ready pod -l app=helm-controller -n "${NAMESPACE}" --timeout=300s

# Step 5: Validate functionality
echo "✅ Step 5: Validating functionality..."
flux check
flux get all -A

# Step 6: Test reconciliation
echo "🔄 Step 6: Testing reconciliation..."
flux reconcile source git atp-gitops-dev -n "${NAMESPACE}" --with-source

echo "✅ Upgrade test complete"

Upgrade Procedure¶

Production Upgrade Runbook:

#!/bin/bash
# scripts/upgrade-fluxcd-production.sh

TARGET_VERSION="${1}"
MAINTENANCE_WINDOW="${2}"  # e.g., "2024-01-15 02:00"

if [ -z "${TARGET_VERSION}" ]; then
  echo "Usage: $0 <target-version> [maintenance-window]"
  exit 1
fi

echo "🔄 FluxCD Production Upgrade to ${TARGET_VERSION}"
echo "Maintenance Window: ${MAINTENANCE_WINDOW}"

# Pre-upgrade checklist
echo "📋 Pre-Upgrade Checklist"
echo "1. Backup all FluxCD resources"
kubectl get gitrepository,kustomization,helmrelease -A -o yaml > \
  flux-production-backup-$(date +%Y%m%d-%H%M%S).yaml

echo "2. Verify all Kustomizations are healthy"
flux get kustomizations -A | grep -v Ready && echo "⚠️  Some Kustomizations not ready" && exit 1

echo "3. Suspend auto-reconciliation for critical resources"
# flux suspend kustomization critical-apps-production -n flux-system

# Upgrade
echo "🔄 Upgrading FluxCD..."
flux install --version=${TARGET_VERSION} --namespace=flux-system

# Wait for readiness
echo "⏳ Waiting for controllers to be ready..."
kubectl wait --for=condition=ready pod -l app=source-controller -n flux-system --timeout=300s
kubectl wait --for=condition=ready pod -l app=kustomize-controller -n flux-system --timeout=300s
kubectl wait --for=condition=ready pod -l app=helm-controller -n flux-system --timeout=300s

# Resume reconciliation
# flux resume kustomization critical-apps-production -n flux-system

# Validate
echo "✅ Validating upgrade..."
flux check
flux get all -A

# Force reconciliation of all resources
echo "🔄 Forcing reconciliation..."
flux reconcile source git -A --with-source
flux reconcile kustomization -A --with-source

echo "✅ Upgrade complete"

Rollback Plan¶

FluxCD Rollback Procedure:

#!/bin/bash
# scripts/rollback-fluxcd.sh

PREVIOUS_VERSION="${1}"
BACKUP_FILE="${2}"

if [ -z "${PREVIOUS_VERSION}" ] || [ -z "${BACKUP_FILE}" ]; then
  echo "Usage: $0 <previous-version> <backup-file>"
  exit 1
fi

echo "⏪ Rolling back FluxCD to ${PREVIOUS_VERSION}"

# Step 1: Suspend all reconciliation
echo "⏸️  Suspending reconciliation..."
flux suspend kustomization -A
flux suspend helmrelease -A

# Step 2: Uninstall current FluxCD
echo "🗑️  Uninstalling current FluxCD..."
flux uninstall --silent

# Step 3: Install previous version
echo "⬇️  Installing previous version..."
flux install --version=${PREVIOUS_VERSION} --namespace=flux-system

# Step 4: Restore configuration from backup
echo "📥 Restoring configuration..."
kubectl apply -f "${BACKUP_FILE}"

# Step 5: Resume reconciliation
echo "▶️  Resuming reconciliation..."
flux resume kustomization -A
flux resume helmrelease -A

# Step 6: Validate
echo "✅ Validating rollback..."
flux check
flux get all -A

echo "✅ Rollback complete"

Post-Upgrade Validation¶

Post-Upgrade Validation Checklist:

#!/bin/bash
# scripts/validate-fluxcd-upgrade.sh

echo "✅ Post-Upgrade Validation"

# Check FluxCD version
echo "📋 1. FluxCD Version"
flux version

# Check all controllers are ready
echo "🏥 2. Controller Health"
flux check

# Verify all sources are ready
echo "📦 3. Source Status"
flux get sources -A | grep -v Ready && echo "⚠️  Some sources not ready"

# Verify all Kustomizations are ready
echo "🔄 4. Kustomization Status"
flux get kustomizations -A | grep -v Ready && echo "⚠️  Some Kustomizations not ready"

# Verify all HelmReleases are ready
echo "📦 5. HelmRelease Status"
flux get helmreleases -A | grep -v Ready && echo "⚠️  Some HelmReleases not ready"

# Test reconciliation
echo "🔄 6. Testing Reconciliation"
flux reconcile source git atp-gitops-production -n flux-system --with-source
flux reconcile kustomization apps-production -n flux-system --with-source

# Check for errors in logs
echo "📜 7. Checking for Errors"
kubectl logs -n flux-system -l app=kustomize-controller --tail=100 | grep -i error

echo "✅ Validation complete"

AKS Cluster Patching¶

Kubernetes Version Upgrades¶

AKS Upgrade Planning:

#!/bin/bash
# scripts/plan-aks-upgrade.sh

CLUSTER="${1:-atp-production-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"

echo "📋 AKS Upgrade Planning for ${CLUSTER}"

# Check current version
CURRENT_VERSION=$(az aks show \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER}" \
  --query kubernetesVersion -o tsv)
echo "Current version: ${CURRENT_VERSION}"

# Check available upgrades
echo "Available upgrades:"
az aks get-upgrades \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER}" \
  --output table

# Check node pool versions
echo "Node pool versions:"
az aks nodepool list \
  --resource-group "${RESOURCE_GROUP}" \
  --cluster-name "${CLUSTER}" \
  --query '[].{Name:name, Version:orchestratorVersion}' \
  --output table

AKS Upgrade Procedure:

#!/bin/bash
# scripts/upgrade-aks-cluster.sh

CLUSTER="${1:-atp-production-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"
TARGET_VERSION="${3}"

if [ -z "${TARGET_VERSION}" ]; then
  echo "Usage: $0 <cluster> <resource-group> <target-version>"
  exit 1
fi

echo "🔄 Upgrading AKS cluster to ${TARGET_VERSION}"

# Step 1: Pre-upgrade validation
echo "📋 Step 1: Pre-upgrade validation..."
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

# Step 2: Upgrade control plane
echo "⬆️  Step 2: Upgrading control plane..."
az aks upgrade \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER}" \
  --kubernetes-version "${TARGET_VERSION}" \
  --control-plane-only

# Step 3: Wait for control plane upgrade
echo "⏳ Step 3: Waiting for control plane..."
az aks show \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER}" \
  --query "{Status:powerState.code, Version:kubernetesVersion}" \
  --output table

# Step 4: Upgrade node pools
echo "⬆️  Step 4: Upgrading node pools..."
NODEPOOLS=$(az aks nodepool list \
  --resource-group "${RESOURCE_GROUP}" \
  --cluster-name "${CLUSTER}" \
  --query '[].name' -o tsv)

for POOL in ${NODEPOOLS}; do
  echo "   Upgrading node pool: ${POOL}"
  az aks nodepool upgrade \
    --resource-group "${RESOURCE_GROUP}" \
    --cluster-name "${CLUSTER}" \
    --name "${POOL}" \
    --kubernetes-version "${TARGET_VERSION}" \
    --max-surge 33%
done

# Step 5: Post-upgrade validation
echo "✅ Step 5: Post-upgrade validation..."
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

echo "✅ Upgrade complete"

Node OS Patching¶

Node OS Patching Schedule:

# platform/node-patching/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: node-os-patch-check
  namespace: kube-system
spec:
  schedule: "0 2 * * 0"  # Every Sunday at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: patch-check
            image: mcr.microsoft.com/aks/aks-tools:latest
            command:
            - /bin/sh
            - -c
            - |
              echo "Checking for available OS patches..."
              az aks nodepool show \
                --resource-group ${RESOURCE_GROUP} \
                --cluster-name ${CLUSTER_NAME} \
                --name systempool \
                --query "nodeImageVersion" -o tsv
          restartPolicy: OnFailure

Node Pool Rotation¶

Node Pool Rotation for Zero-Downtime Patching:

#!/bin/bash
# scripts/rotate-nodepool.sh

CLUSTER="${1:-atp-production-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"
NODEPOOL="${3:-systempool}"

echo "🔄 Rotating node pool: ${NODEPOOL}"

# Step 1: Create new node pool
echo "➕ Step 1: Creating new node pool..."
NEW_POOL="${NODEPOOL}-new"
az aks nodepool add \
  --resource-group "${RESOURCE_GROUP}" \
  --cluster-name "${CLUSTER}" \
  --name "${NEW_POOL}" \
  --node-count 3 \
  --node-vm-size Standard_D4s_v3 \
  --max-surge 33%

# Step 2: Cordon old nodes
echo "🚫 Step 2: Cordoning old nodes..."
OLD_NODES=$(kubectl get nodes -l agentpool=${NODEPOOL} -o jsonpath='{.items[*].metadata.name}')
for NODE in ${OLD_NODES}; do
  kubectl cordon "${NODE}"
done

# Step 3: Drain old nodes
echo "💧 Step 3: Draining old nodes..."
for NODE in ${OLD_NODES}; do
  kubectl drain "${NODE}" --ignore-daemonsets --delete-emptydir-data --grace-period=300
done

# Step 4: Delete old node pool
echo "🗑️  Step 4: Deleting old node pool..."
az aks nodepool delete \
  --resource-group "${RESOURCE_GROUP}" \
  --cluster-name "${CLUSTER}" \
  --name "${NODEPOOL}"

# Step 5: Rename new pool
echo "🏷️  Step 5: Renaming new pool..."
az aks nodepool scale \
  --resource-group "${RESOURCE_GROUP}" \
  --cluster-name "${CLUSTER}" \
  --name "${NEW_POOL}" \
  --node-count 3

# Rename requires manual Azure Portal or separate script

echo "✅ Node pool rotation complete"

Certificate Renewals¶

Monitoring Certificate Expiration¶

Certificate Expiration Monitoring:

#!/bin/bash
# scripts/monitor-certificate-expiration.sh

WARNING_DAYS="${1:-30}"
CRITICAL_DAYS="${2:-7}"

echo "🔐 Monitoring Certificate Expiration"

kubectl get certificates --all-namespaces -o json | \
  jq -r '.items[] | select(.status.conditions[]?.type == "Ready" and .status.conditions[]?.status == "True") |
    "\(.metadata.namespace)/\(.metadata.name)|\(.status.notAfter)"' | \
  while IFS='|' read -r CERT EXPIRY; do
    if [ -n "${EXPIRY}" ]; then
      EXPIRY_EPOCH=$(date -d "${EXPIRY}" +%s)
      CURRENT_EPOCH=$(date +%s)
      DAYS_LEFT=$(( (EXPIRY_EPOCH - CURRENT_EPOCH) / 86400 ))

      if [ "${DAYS_LEFT}" -lt "${CRITICAL_DAYS}" ]; then
        echo "🔴 CRITICAL: ${CERT} expires in ${DAYS_LEFT} days"
      elif [ "${DAYS_LEFT}" -lt "${WARNING_DAYS}" ]; then
        echo "🟡 WARNING: ${CERT} expires in ${DAYS_LEFT} days"
      else
        echo "✅ OK: ${CERT} expires in ${DAYS_LEFT} days"
      fi
    fi
  done

Certificate Expiration Alert (PrometheusRule):

# monitoring/alerts/certificate-expiration.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: certificate-expiration
  namespace: monitoring
spec:
  groups:
  - name: certificate
    interval: 1h
    rules:
    - alert: CertificateExpiringSoon
      expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 30
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Certificate expiring soon"
        description: "Certificate {{ $labels.name }} in {{ $labels.namespace }} expires in {{ $value }} days"

    - alert: CertificateExpiringVerySoon
      expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
      for: 1h
      labels:
        severity: critical
      annotations:
        summary: "Certificate expiring very soon"
        description: "Certificate {{ $labels.name }} in {{ $labels.namespace }} expires in {{ $value }} days"

Automatic Renewal with cert-manager¶

cert-manager Automatic Renewal Configuration:

# apps/atp-gateway/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: atp-gateway-tls
  namespace: atp-production
spec:
  secretName: atp-gateway-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  commonName: api.atp.connectsoft.example
  dnsNames:
  - api.atp.connectsoft.example
  duration: 2160h  # 90 days
  renewBefore: 720h  # Renew 30 days before expiration (automatic)

Manual Renewal Procedures¶

Manual Certificate Renewal:

#!/bin/bash
# scripts/manual-certificate-renewal.sh

CERT_NAME="${1}"
NAMESPACE="${2:-atp-production}"

echo "🔄 Manually renewing certificate: ${CERT_NAME}"

# Delete existing certificate (will trigger renewal)
kubectl delete certificate "${CERT_NAME}" -n "${NAMESPACE}"

# Wait for renewal
echo "⏳ Waiting for renewal..."
sleep 30

# Check new certificate status
kubectl get certificate "${CERT_NAME}" -n "${NAMESPACE}"

# Force cert-manager to reconcile
kubectl annotate certificate "${CERT_NAME}" -n "${NAMESPACE}" \
  cert-manager.io/issue-temporary-certificate="true" --overwrite

Monitoring and Alerting Review¶

Reviewing Alert Noise¶

Alert Noise Analysis:

// Log Analytics: Analyze alert frequency
AzureActivity
| where OperationName == "Microsoft.Insights/metricAlerts/write"
| where TimeGenerated > ago(30d)
| summarize AlertCount=count() by Resource, AlertName
| order by AlertCount desc
| take 20

Alert Tuning Script:

#!/bin/bash
# scripts/tune-alerts.sh

ALERT_NAME="${1}"

echo "🎚️  Tuning alert: ${ALERT_NAME}"

# Query alert frequency
echo "📊 Alert frequency (last 30 days):"
# Use Azure Monitor API or Azure CLI

# Identify false positives
echo "❌ False positives to address:"
# Manual review required

# Adjust threshold
echo "⚙️  Current threshold: [threshold]"
echo "Suggested threshold: [new-threshold]"

# Update alert rule
az monitor metrics alert update \
  --name "${ALERT_NAME}" \
  --resource-group atp-production-rg \
  --condition "avg Percentage CPU > 80"  # Example

Adding New Alerts¶

Alert Creation Template:

# monitoring/alerts/template.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: atp-application-alerts
  namespace: monitoring
spec:
  groups:
  - name: application
    interval: 1m
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      for: 5m
      labels:
        severity: warning
        team: atp
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value }} errors/sec for {{ $labels.service }}"

    - alert: HighLatency
      expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
      for: 10m
      labels:
        severity: warning
        team: atp
      annotations:
        summary: "High latency detected"
        description: "P95 latency is {{ $value }}s for {{ $labels.service }}"

Capacity Planning¶

Monitoring Resource Usage Trends¶

Resource Usage Trend Analysis:

// Log Analytics: Node CPU utilization trend
Perf
| where ObjectName == "K8SNode"
| where CounterName == "cpuUsageNanoCores"
| where TimeGenerated > ago(90d)
| summarize AvgCPU=avg(CounterValue), MaxCPU=max(CounterValue) by bin(TimeGenerated, 1d), Computer
| render timechart

// Pod memory usage trend
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "memoryWorkingSetBytes"
| where TimeGenerated > ago(90d)
| summarize AvgMemory=avg(CounterValue) by bin(TimeGenerated, 1d), InstanceName
| render timechart

Capacity Planning Report:

#!/bin/bash
# scripts/capacity-planning-report.sh

OUTPUT="capacity-planning-$(date +%Y%m).md"

cat > "${OUTPUT}" <<EOF
# Capacity Planning Report - $(date +%B %Y)

## Current Utilization

### Nodes
\`\`\`
$(kubectl top nodes)
\`\`\`

### Pods per Node
- Average: $(kubectl get pods --all-namespaces --field-selector=status.phase=Running --no-headers | wc -l) pods / $(kubectl get nodes --no-headers | wc -l) nodes

## Trends (Last 90 Days)

[Insert trend charts from Log Analytics]

## Projections (Next 6 Months)

Based on current growth trends:
- Expected pod growth: X%
- Expected storage growth: Y%
- Expected cost increase: Z%

## Recommendations

1. **Node Pool Scaling**: [Recommendation]
2. **Storage**: [Recommendation]
3. **Resource Right-Sizing**: [Recommendation]
4. **Cost Optimization**: [Recommendation]
EOF

echo "✅ Report generated: ${OUTPUT}"

Security Patching¶

Container Base Image Updates¶

Base Image Update Procedure:

#!/bin/bash
# scripts/update-base-images.sh

echo "🔒 Scanning for base image updates..."

# Scan all images in ACR
az acr repository list --name connectsoft --output table | \
  while read repo; do
    echo "Scanning: ${repo}"
    az acr task run \
      --registry connectsoft \
      --name update-base-images \
      --context https://github.com/connectsoft/atp.git
  done

# Check for vulnerabilities
az acr repository show \
  --name connectsoft \
  --repository atp/gateway \
  --query "properties.manifest" -o json

Vulnerability Remediation¶

Vulnerability Remediation Workflow:

graph TD
    START[Vulnerability Detected] --> SCAN[Scan Images]
    SCAN --> SEVERITY{Severity?}
    SEVERITY -->|Critical| IMMEDIATE[Immediate Remediation]
    SEVERITY -->|High| PRIORITY[Priority Remediation]
    SEVERITY -->|Medium| SCHEDULED[Scheduled Remediation]
    SEVERITY -->|Low| BACKLOG[Add to Backlog]

    IMMEDIATE --> PATCH[Apply Patch]
    PRIORITY --> PATCH
    SCHEDULED --> PATCH

    PATCH --> TEST[Test in Dev/Test]
    TEST --> DEPLOY[Deploy to Production]
    DEPLOY --> VERIFY[Verify Fix]

    style IMMEDIATE fill:#FF0000
    style PRIORITY fill:#FFA500
    style SCHEDULED fill:#FFFF00

Hold "Alt" / "Option" to enable pan & zoom

Configuration Drift Audits¶

Scheduled Drift Detection Runs¶

Automated Drift Detection:

# platform/drift-detection/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: drift-detection
  namespace: flux-system
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: drift-detection
            image: fluxcd/flux-cli:latest
            command:
            - /bin/sh
            - -c
            - |
              echo "Running drift detection..."
              flux diff kustomization apps-production -n flux-system > /tmp/drift-report.txt
              if [ -s /tmp/drift-report.txt ]; then
                echo "⚠️  Drift detected!"
                cat /tmp/drift-report.txt
                # Send alert
              else
                echo "✅ No drift detected"
              fi
          restartPolicy: OnFailure

Performance Tuning¶

Reconciliation Interval Optimization¶

Optimize Reconciliation Intervals:

# Production: Less frequent (reduce load)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m  # Production: 10 minutes
  path: ./apps/atp-gateway/overlays/production

---
# Dev: More frequent (faster feedback)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-dev
  namespace: flux-system
spec:
  interval: 1m  # Dev: 1 minute
  path: ./apps/atp-gateway/overlays/dev

Documentation Updates¶

Documentation Maintenance Checklist¶

## Documentation Maintenance

### Weekly
- [ ] Update runbooks with lessons learned
- [ ] Document new procedures

### Monthly
- [ ] Review and update architecture diagrams
- [ ] Update troubleshooting guides
- [ ] Review and archive outdated docs

### Quarterly
- [ ] Comprehensive documentation audit
- [ ] Update all procedures
- [ ] Knowledge base cleanup

Team Training¶

Onboarding Checklist¶

## New Team Member Onboarding

### Week 1
- [ ] Access to Azure DevOps
- [ ] Access to AKS clusters
- [ ] GitOps repository access
- [ ] Review architecture documentation

### Week 2
- [ ] Hands-on GitOps exercises
- [ ] Troubleshooting practice
- [ ] Shadow on-call rotation

### Week 3
- [ ] Independent task assignment
- [ ] Code review participation
- [ ] Documentation contribution

On-Call Procedures¶

On-Call Rotation Schedule¶

gantt
    title On-Call Rotation Schedule
    dateFormat YYYY-MM-DD
    section Team A
    Engineer 1 On-Call      :2024-01-01, 7d
    Engineer 2 On-Call      :2024-01-08, 7d
    section Team B
    Engineer 3 On-Call      :2024-01-15, 7d
    Engineer 4 On-Call      :2024-01-22, 7d

Hold "Alt" / "Option" to enable pan & zoom

On-Call Handoff Procedure¶

## On-Call Handoff Checklist

### Daily Handoff
- [ ] Review incidents from last 24 hours
- [ ] Check for unresolved issues
- [ ] Review scheduled maintenance
- [ ] Verify alert configurations

### Weekly Handoff
- [ ] Review week's incidents
- [ ] Document lessons learned
- [ ] Update runbooks
- [ ] Share knowledge with team

Post-Incident Review Template¶

## Post-Incident Review (PIR)

**Incident ID**: [ID]
**Date**: [Date]
**Severity**: [P0/P1/P2/P3]
**Duration**: [Duration]
**Impact**: [Description]

### Timeline
- [Time] - Issue detected
- [Time] - Escalation
- [Time] - Resolution

### Root Cause
[Root cause analysis]

### Actions Taken
[Steps taken to resolve]

### Lessons Learned
- [Lesson 1]
- [Lesson 2]

### Action Items
- [ ] [Action item 1]
- [ ] [Action item 2]

### Prevention
[How to prevent similar incidents]

Summary: Day 2 Operations & Maintenance¶

Routine Maintenance Tasks: Daily (monitoring checks, alert review), weekly (capacity planning, cost review), monthly (security patches, access reviews), quarterly (DR drills, policy updates) with automated checklists
FluxCD Upgrades: Upgrade planning, testing in dev/test first, upgrade procedure, rollback plan, post-upgrade validation
AKS Cluster Patching: Kubernetes version upgrades, node OS patching, upgrade scheduling, node pool rotation, validation and rollback
Certificate Renewals: Monitoring certificate expiration, automatic renewal with cert-manager, manual renewal procedures, certificate rotation testing
Monitoring and Alerting Review: Reviewing alert noise, tuning thresholds, disabling false positives, adding new alerts
Capacity Planning: Monitoring resource usage trends, node pool scaling decisions, storage growth planning, cost forecasting
Security Patching: OS security updates, container base image updates, dependency updates, vulnerability remediation workflow
Configuration Drift Audits: Scheduled drift detection runs, comparing Git to live state, identifying configuration inconsistencies, remediation procedures
Performance Tuning: Reconciliation interval optimization, resource request/limit tuning, autoscaling adjustments, database query optimization
Documentation Updates: Keeping runbooks current, updating architecture diagrams, recording lessons learned, knowledge base maintenance
Team Training: Onboarding new team members, knowledge sharing sessions, hands-on exercises, certification paths
On-Call Procedures: On-call rotation schedule, handoff procedures, escalation paths, post-incident reviews with templates

Compliance & Audit Evidence Collection¶

Purpose: Define comprehensive compliance controls, audit evidence collection procedures, SOC 2 Type II control mappings, GDPR compliance workflows, HIPAA audit trail requirements, Change Advisory Board (CAB) processes, deployment receipts, and automated compliance reporting for ATP's GitOps deployments, ensuring regulatory compliance and providing complete audit trails for all platform changes.

SOC 2 Type II Controls¶

CC8.1: Change Management¶

Change Management Control Requirements:

Requirement	GitOps Implementation	Evidence
Authorized Changes	PR approval required	PR approval records in Azure DevOps
Change Testing	Automated tests in CI	Test results in Azure Pipelines
Change Documentation	Git commit messages, PR descriptions	Git history, PR records
Change Approval	Required approvals before merge	Approval timestamps and identities
Change Review	Code review process	Review comments and approvals

GitOps Workflow Mapping to CC8.1:

graph LR
    START[Developer Creates PR] --> REVIEW[Code Review]
    REVIEW --> APPROVE{Approval<br/>Required?}
    APPROVE -->|Yes| CAB[CAB Approval]
    APPROVE -->|No| AUTO[Automated Tests]
    CAB --> AUTO
    AUTO --> MERGE[Merge to Main]
    MERGE --> DEPLOY[FluxCD Deploys]
    DEPLOY --> EVIDENCE[Audit Evidence<br/>Generated]

    style CAB fill:#FFE5B4
    style EVIDENCE fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

Evidence Collection for CC8.1:

#!/bin/bash
# scripts/collect-change-management-evidence.sh

PR_ID="${1}"
DATE_RANGE="${2:-30d}"

echo "📋 Collecting Change Management Evidence for PR ${PR_ID}"

# Get PR details
az repos pr show \
  --id "${PR_ID}" \
  --organization ${ORG} \
  --project ${PROJECT} \
  --output json > "change-evidence-pr-${PR_ID}.json"

# Extract evidence
cat "change-evidence-pr-${PR_ID}.json" | jq '{
  pr_id: .pullRequestId,
  title: .title,
  created_by: .createdBy.uniqueName,
  created_date: .creationDate,
  reviewers: [.reviewers[] | {name: .uniqueName, vote: .vote, date: .votedForDate}],
  status: .status,
  merge_status: .mergeStatus,
  completion_date: .completionOptions.completeWorkItems,
  linked_work_items: .workItemRefs[].id
}'

# Get commit history
echo "📜 Commit History:"
az repos pr commits \
  --id "${PR_ID}" \
  --organization ${ORG} \
  --project ${PROJECT} \
  --output table

# Get build/test results
echo "🧪 Build/Test Results:"
az pipelines runs list \
  --organization ${ORG} \
  --project ${PROJECT} \
  --query "[?sourceVersion == '${PR_COMMIT_SHA}']" \
  --output table

CC6.1: Logical and Physical Access¶

Access Control Requirements:

Requirement	Implementation	Evidence
Access Reviews	Quarterly RBAC reviews	Access review reports
Least Privilege	RBAC in Kubernetes, Azure AD	RBAC manifests in Git
Access Logging	Kubernetes audit logs, Azure AD logs	Log Analytics queries
Access Termination	Automated offboarding	Offboarding logs

Access Review Evidence Collection:

#!/bin/bash
# scripts/collect-access-review-evidence.sh

REVIEW_DATE="${1:-$(date +%Y-%m-%d)}"

echo "👥 Collecting Access Review Evidence - ${REVIEW_DATE}"

# Review Kubernetes RBAC
echo "📋 Kubernetes RBAC Review:"
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json > \
  "rbac-review-${REVIEW_DATE}.json"

# Review Azure AD groups
echo "🔐 Azure AD Group Memberships:"
az ad group member list \
  --group "atp-developers" \
  --output table > "azure-ad-access-${REVIEW_DATE}.txt"

# Review Key Vault access
echo "🔑 Key Vault Access Policies:"
az keyvault show \
  --name atp-keyvault \
  --query "properties.accessPolicies" \
  -o json > "keyvault-access-${REVIEW_DATE}.json"

# Generate access review report
cat > "access-review-report-${REVIEW_DATE}.md" <<EOF
# Access Review Report - ${REVIEW_DATE}

## Kubernetes RBAC

\`\`\`
$(kubectl get rolebindings,clusterrolebindings --all-namespaces --no-headers | wc -l) bindings reviewed
\`\`\`

## Azure AD Access

\`\`\`
$(az ad group list --query "length([])") groups reviewed
\`\`\`

## Key Vault Access

\`\`\`
$(az keyvault show --name atp-keyvault --query "length(properties.accessPolicies)" -o tsv) access policies reviewed
\`\`\`

## Findings

- [ ] All access is justified
- [ ] No orphaned accounts
- [ ] Least privilege enforced
- [ ] Access terminated for offboarded users

## Reviewer

**Name**: [Reviewer Name]
**Date**: ${REVIEW_DATE}
**Signature**: [Digital Signature]
EOF

echo "✅ Access review evidence collected"

CC7.2: System Monitoring¶

System Monitoring Requirements:

Requirement	Implementation	Evidence
Monitoring Coverage	Azure Monitor, Prometheus	Monitoring dashboards
Alert Configuration	Alert rules in Git	Alert manifests
Log Retention	7-year retention in Log Analytics	Retention policies
Incident Response	Automated alerts, on-call	Incident logs

Monitoring Evidence Collection:

// Log Analytics: System Monitoring Evidence
// Query for monitoring coverage evidence
Perf
| where TimeGenerated > ago(30d)
| summarize 
    MetricCount=count_distinct(CounterName),
    ResourceCount=count_distinct(Computer),
    DataPoints=count()
| extend EvidenceType="Monitoring Coverage"
| project EvidenceType, MetricCount, ResourceCount, DataPoints, TimeGenerated=now()

// Alert configuration evidence
union *
| where TimeGenerated > ago(30d)
| where Category == "Alert"
| summarize AlertCount=count(), UniqueAlerts=dcount(AlertName)
| extend EvidenceType="Alert Configuration"
| project EvidenceType, AlertCount, UniqueAlerts, TimeGenerated=now()

GitOps Workflow Mapping to Controls¶

SOC 2 Control Mapping Matrix:

Control	GitOps Workflow	Evidence Source	Retention
CC8.1 - Change Management	PR approval, code review	Azure DevOps PR records	7 years
CC6.1 - Access Control	RBAC manifests in Git	Git history, access reviews	7 years
CC7.2 - Monitoring	Monitoring manifests in Git	Log Analytics, dashboards	7 years
CC7.3 - System Operations	GitOps reconciliation	FluxCD logs, deployment receipts	7 years
CC6.6 - Logical Access	Workload Identity, RBAC	Kubernetes audit logs	7 years

SOC 2 Evidence Collection Dashboard:

# monitoring/compliance/soc2-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: soc2-evidence-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "SOC 2 Compliance Evidence",
        "panels": [
          {
            "title": "Change Management (CC8.1)",
            "targets": [
              {
                "expr": "count(azure_devops_pr_approvals_total)",
                "legendFormat": "PR Approvals"
              }
            ]
          },
          {
            "title": "Access Reviews (CC6.1)",
            "targets": [
              {
                "expr": "count(kubernetes_rbac_bindings_total)",
                "legendFormat": "RBAC Bindings"
              }
            ]
          },
          {
            "title": "Monitoring Coverage (CC7.2)",
            "targets": [
              {
                "expr": "count(azure_monitor_metrics_total)",
                "legendFormat": "Monitored Resources"
              }
            ]
          }
        ]
      }
    }

Right to be Forgotten (Tenant Offboarding)¶

GDPR Right to be Forgotten Procedure:

#!/bin/bash
# scripts/gdpr-tenant-offboarding.sh

TENANT_ID="${1}"
REQUEST_DATE="${2:-$(date +%Y-%m-%d)}"
REQUESTOR="${3}"

if [ -z "${TENANT_ID}" ] || [ -z "${REQUESTOR}" ]; then
  echo "Usage: $0 <tenant-id> [request-date] <requestor-email>"
  exit 1
fi

echo "🗑️  GDPR Right to be Forgotten Request"
echo "Tenant: ${TENANT_ID}"
echo "Request Date: ${REQUEST_DATE}"
echo "Requestor: ${REQUESTOR}"

# Step 1: Verify request authorization
echo "✅ Step 1: Verifying request authorization..."
# Verify requestor has authority to request deletion

# Step 2: Export tenant data (for record keeping)
echo "📥 Step 2: Exporting tenant data..."
kubectl get all -n "tenant-${TENANT_ID}" -o yaml > \
  "gdpr-export-${TENANT_ID}-${REQUEST_DATE}.yaml"

# Step 3: Delete tenant data
echo "🗑️  Step 3: Deleting tenant data..."
# Delete tenant namespace
kubectl delete namespace "tenant-${TENANT_ID}"

# Delete tenant secrets from Key Vault
az keyvault secret list \
  --vault-name atp-keyvault \
  --query "[?contains(name, 'tenant-${TENANT_ID}')].name" -o tsv | \
  while read secret; do
    az keyvault secret delete --vault-name atp-keyvault --name "${secret}"
  done

# Delete tenant data from databases
# (Specific implementation depends on database type)

# Step 4: Remove from GitOps
echo "📝 Step 4: Removing tenant from GitOps..."
git rm -r "tenants/${TENANT_ID}/"
git commit -m "GDPR: Remove tenant ${TENANT_ID} per request on ${REQUEST_DATE}"
git push

# Step 5: Generate deletion certificate
echo "📜 Step 5: Generating deletion certificate..."
cat > "gdpr-deletion-certificate-${TENANT_ID}-${REQUEST_DATE}.md" <<EOF
# GDPR Data Deletion Certificate

**Tenant ID**: ${TENANT_ID}
**Request Date**: ${REQUEST_DATE}
**Requestor**: ${REQUESTOR}
**Completion Date**: $(date +%Y-%m-%d)

## Deletion Confirmation

✅ Tenant namespace deleted: tenant-${TENANT_ID}
✅ Secrets deleted from Key Vault
✅ Data deleted from databases
✅ GitOps configuration removed
✅ Backup data purged (where applicable)

## Data Retention Exception

The following data is retained for legal/compliance purposes:
- Audit logs (7-year retention)
- Financial transaction records (as required by law)

## Certification

I certify that all tenant data has been deleted per GDPR Article 17 (Right to be Forgotten) requirements, except where retention is required by law.

**Signed**: [Authorized Person]
**Date**: $(date +%Y-%m-%d)
EOF

echo "✅ GDPR deletion complete"

Data Residency Enforcement¶

Data Residency Configuration:

# tenants/tenant-eu/labels.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-eu
  labels:
    data-residency: "EU"
    gdpr: "true"
    region: "westeurope"
  annotations:
    compliance/data-residency: "EU Only"
    compliance/gdpr: "true"

Data Residency Policy Enforcement:

#!/bin/bash
# scripts/verify-data-residency.sh

TENANT_ID="${1}"
REQUIRED_REGION="${2:-EU}"

echo "🌍 Verifying Data Residency for Tenant: ${TENANT_ID}"

# Check namespace labels
RESIDENCY=$(kubectl get namespace "tenant-${TENANT_ID}" \
  -o jsonpath='{.metadata.labels.data-residency}')

if [ "${RESIDENCY}" != "${REQUIRED_REGION}" ]; then
  echo "❌ Data residency violation: Expected ${REQUIRED_REGION}, found ${RESIDENCY}"
  exit 1
fi

# Check Pod placement (node labels)
NODES=$(kubectl get nodes -l region=${REQUIRED_REGION} -o jsonpath='{.items[*].metadata.name}')
if [ -z "${NODES}" ]; then
  echo "⚠️  No nodes in region ${REQUIRED_REGION}"
fi

# Check PersistentVolume placement
PVC_REGIONS=$(kubectl get pvc -n "tenant-${TENANT_ID}" -o json | \
  jq -r '.items[].metadata.annotations."volume.kubernetes.io/selected-node"')

echo "✅ Data residency verified: ${RESIDENCY}"

Audit Logs and Retention¶

GDPR Audit Log Retention:

# platform/compliance/audit-log-retention.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: audit-log-retention
  namespace: monitoring
data:
  retention-policy.yaml: |
    # GDPR Audit Log Retention Policy
    retention:
      default: 7y  # 7-year retention per GDPR requirements
      compliance:
        gdpr: 7y
        soc2: 7y
        hipaa: 7y
    storage:
      backend: azure-blob
      account: atpauditlogs
      container: audit-logs
      immutability: true  # Immutable storage
      encryption: true

Audit Log Export for GDPR:

#!/bin/bash
# scripts/export-gdpr-audit-logs.sh

TENANT_ID="${1}"
START_DATE="${2}"
END_DATE="${3}"

echo "📥 Exporting GDPR Audit Logs for Tenant: ${TENANT_ID}"

# Query Log Analytics for tenant-specific audit logs
az monitor log-analytics query \
  --workspace ${LOG_ANALYTICS_WORKSPACE_ID} \
  --analytics-query "
    KubernetesAudit
    | where Namespace == 'tenant-${TENANT_ID}'
    | where TimeGenerated between (datetime('${START_DATE}') .. datetime('${END_DATE}'))
    | project TimeGenerated, User, Action, Resource, ResponseCode
    | order by TimeGenerated asc
  " \
  --output table > "gdpr-audit-logs-${TENANT_ID}-${START_DATE}-${END_DATE}.csv"

echo "✅ Audit logs exported"

Privacy by Design¶

Privacy by Design Implementation:

# platform/compliance/privacy-by-design.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: privacy-by-design-config
  namespace: atp-production
data:
  principles.yaml: |
    # Privacy by Design Principles
    principles:
      - principle: Proactive not Reactive
        implementation: Default privacy settings, data minimization
      - principle: Privacy as Default
        implementation: Encryption at rest and in transit, minimal data collection
      - principle: Privacy Embedded into Design
        implementation: Privacy considerations in architecture
      - principle: Full Functionality
        implementation: Privacy without sacrificing functionality
      - principle: End-to-End Security
        implementation: Encryption, access controls, audit logging
      - principle: Visibility and Transparency
        implementation: Audit logs, privacy notices, data subject rights
      - principle: Respect for User Privacy
        implementation: User consent, data deletion, portability

HIPAA Audit Trail¶

Access Logs¶

HIPAA Access Log Configuration:

# platform/compliance/hipaa-audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  namespaces: ["tenant-hipaa-*"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  resources:
  - group: "*"
    resources: ["*"]

- level: RequestResponse
  namespaces: ["tenant-hipaa-*"]
  verbs: ["create", "update", "patch", "delete"]
  resources:
  - group: "*"
    resources: ["secrets", "configmaps", "persistentvolumeclaims"]

HIPAA Access Log Query:

// Log Analytics: HIPAA Access Logs
KubernetesAudit
| where Namespace startswith "tenant-hipaa"
| where TimeGenerated > ago(30d)
| where Verb in ("get", "list", "watch", "create", "update", "delete")
| project 
    TimeGenerated,
    User,
    Verb,
    Resource,
    Namespace,
    ResponseCode,
    RequestObject,
    ResponseObject
| order by TimeGenerated desc

Deployment Logs¶

HIPAA Deployment Audit Trail:

#!/bin/bash
# scripts/generate-hipaa-deployment-log.sh

DEPLOYMENT="${1}"
NAMESPACE="${2:-tenant-hipaa-production}"

echo "📋 Generating HIPAA Deployment Audit Trail"

# Collect deployment evidence
cat > "hipaa-deployment-${DEPLOYMENT}-$(date +%Y%m%d).md" <<EOF
# HIPAA Deployment Audit Trail

**Deployment**: ${DEPLOYMENT}
**Namespace**: ${NAMESPACE}
**Date**: $(date +%Y-%m-%d)
**Time**: $(date +%H:%M:%S)

## Pre-Deployment Verification

- [ ] Change approved by authorized personnel
- [ ] Security scan passed
- [ ] Encryption verified
- [ ] Access controls verified

## Deployment Details

**Image**: $(kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE} -o jsonpath='{.spec.template.spec.containers[0].image}')
**Git Commit**: $(git rev-parse HEAD)
**PR Number**: $(git log -1 --pretty=format:"%s" | grep -oP 'PR #\K\d+')
**Deployed By**: $(az ad signed-in-user show --query userPrincipalName -o tsv)

## Post-Deployment Verification

- [ ] Deployment successful
- [ ] Health checks passing
- [ ] Encryption operational
- [ ] Access logs enabled

## HIPAA Compliance

- [ ] Audit logging enabled
- [ ] Encryption at rest verified
- [ ] Encryption in transit verified
- [ ] Access controls enforced
- [ ] PHI data handling verified
EOF

echo "✅ HIPAA deployment audit trail generated"

Encryption Verification¶

HIPAA Encryption Verification:

#!/bin/bash
# scripts/verify-hipaa-encryption.sh

NAMESPACE="${1:-tenant-hipaa-production}"

echo "🔐 Verifying HIPAA Encryption Requirements"

# Check PVC encryption
echo "💾 Persistent Volume Encryption:"
kubectl get pvc -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | "\(.metadata.name): \(.spec.storageClassName)"' | \
  while read pvc; do
    SC=$(echo "${pvc}" | cut -d':' -f2 | xargs)
    ENCRYPTED=$(kubectl get storageclass "${SC}" -o jsonpath='{.parameters.diskEncryptionSetID}')
    if [ -n "${ENCRYPTED}" ]; then
      echo "   ✅ ${pvc}: Encrypted"
    else
      echo "   ❌ ${pvc}: Not encrypted"
    fi
  done

# Check TLS/in-transit encryption
echo "🔒 In-Transit Encryption:"
kubectl get ingress -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.spec.tls == null) | "\(.metadata.name): Missing TLS"'

# Check secrets encryption
echo "🔑 Secret Encryption:"
kubectl get secrets -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.type != "Opaque") | "\(.metadata.name): \(.type)"'

echo "✅ Encryption verification complete"

Incident Response Documentation¶

HIPAA Incident Response Template:

## HIPAA Incident Report

**Incident ID**: [ID]
**Date Discovered**: [Date]
**Date Reported**: [Date] (within 60 days)
**Severity**: [Low/Medium/High/Critical]

### Incident Description
[Description of the incident]

### PHI Impact Assessment
- [ ] No PHI affected
- [ ] PHI accessed but not compromised
- [ ] PHI compromised (breach)

### Affected Systems
- [List affected systems]

### Actions Taken
1. [Action 1]
2. [Action 2]

### Remediation
[Remediation steps]

### Breach Notification
- [ ] HHS notified (if breach > 500 individuals)
- [ ] Affected individuals notified (if breach)
- [ ] Media notification (if breach > 500 individuals)

### Lessons Learned
[Lessons learned]

### Prevention
[Prevention measures]

Change Advisory Board (CAB) Process¶

When CAB Approval is Required¶

CAB Approval Requirements:

Change Type	CAB Required?	Rationale
Production Deployment	✅ Yes	Production impact
Infrastructure Changes	✅ Yes	Platform stability
Security Updates	⚠️ Expedited	Security risk
Hotfixes	⚠️ Post-deployment	Urgency
Dev/Test Changes	❌ No	No production impact

CAB Approval Decision Tree:

graph TD
    START[Change Request] --> ENV{Environment?}
    ENV -->|Production| CAB_REQUIRED[CAB Approval Required]
    ENV -->|Staging| REVIEW[Team Lead Review]
    ENV -->|Dev/Test| AUTO[No Approval Needed]

    CAB_REQUIRED --> SEVERITY{Severity?}
    SEVERITY -->|Critical| EXPEDITED[Expedited CAB]
    SEVERITY -->|Normal| REGULAR[Regular CAB]

    REGULAR --> MEETING[CAB Meeting]
    EXPEDITED --> APPROVAL[Expedited Approval]

    style CAB_REQUIRED fill:#FFE5B4
    style EXPEDITED fill:#FFB6C1

Hold "Alt" / "Option" to enable pan & zoom

CAB Meeting Schedule¶

CAB Meeting Schedule:

Meeting Type	Frequency	Day	Time
Regular CAB	Weekly	Tuesday	10:00 AM
Expedited CAB	As needed	Any	Within 24 hours
Emergency CAB	As needed	Any	Immediate

Change Request Template¶

CAB Change Request Template:

## Change Request Form

**CR Number**: CR-YYYY-XXX
**Date**: [Date]
**Requestor**: [Name, Email]
**Change Type**: [Standard/Emergency/Expedited]

### Change Summary
**Title**: [Change title]
**Description**: [Detailed description]

### Business Justification
[Why is this change needed?]

### Technical Details
- **Services Affected**: [List services]
- **Environments**: [Dev/Test/Staging/Production]
- **Expected Duration**: [Duration]
- **Rollback Plan**: [Rollback procedure]

### Risk Assessment
- **Risk Level**: [Low/Medium/High/Critical]
- **Potential Impact**: [Impact description]
- **Mitigation**: [Mitigation steps]

### Testing
- [ ] Tested in Dev
- [ ] Tested in Test
- [ ] Tested in Staging
- [ ] Rollback tested

### Approval
- [ ] Technical Lead Approval
- [ ] CAB Approval
- [ ] Change Manager Approval

### Implementation
**Scheduled Date**: [Date]
**Scheduled Time**: [Time]
**Change Window**: [Window]

### Post-Implementation
- [ ] Implementation successful
- [ ] Verification completed
- [ ] Documentation updated

CAB Review Criteria¶

CAB Review Criteria Checklist:

## CAB Review Criteria

### Change Completeness
- [ ] Change request form complete
- [ ] Technical details provided
- [ ] Testing completed
- [ ] Rollback plan documented

### Risk Assessment
- [ ] Risk level appropriate
- [ ] Impact assessment complete
- [ ] Mitigation plan adequate

### Compliance
- [ ] Change documented in Git
- [ ] Approval trail maintained
- [ ] Audit requirements met

### Schedule
- [ ] Change window appropriate
- [ ] Stakeholders notified
- [ ] Resources available

Approval Documentation¶

CAB Approval Record:

# changes/cr-2024-001-approval.yaml
apiVersion: compliance.atp.connectsoft.io/v1
kind: ChangeApproval
metadata:
  name: cr-2024-001
  namespace: atp-production
spec:
  changeRequest:
    number: CR-2024-001
    title: "Upgrade PostgreSQL to version 15"
    requestor: "john.doe@connectsoft.example"
    date: "2024-01-15"
  cabApproval:
    approved: true
    approvalDate: "2024-01-18"
    approvedBy:
    - name: "Jane Smith"
      role: "CAB Chair"
      signature: "[Digital Signature]"
    - name: "Bob Johnson"
      role: "Technical Lead"
      signature: "[Digital Signature]"
  implementation:
    scheduledDate: "2024-01-25"
    scheduledTime: "02:00 UTC"
    changeWindow: "02:00-04:00 UTC"

Deployment Approval Records¶

PR Approvals in Azure DevOps¶

Extract PR Approval Records:

#!/bin/bash
# scripts/extract-pr-approvals.sh

PR_ID="${1}"
PROJECT="${2:-atp-gitops}"

echo "📋 Extracting PR Approval Records for PR ${PR_ID}"

# Get PR details with approvals
az repos pr show \
  --id "${PR_ID}" \
  --organization ${ORG} \
  --project ${PROJECT} \
  --include-work-item-refs \
  --output json | \
  jq '{
    pr_id: .pullRequestId,
    title: .title,
    created_by: .createdBy.displayName,
    created_date: .creationDate,
    status: .status,
    reviewers: [.reviewers[] | {
      name: .displayName,
      email: .uniqueName,
      vote: .vote,
      vote_date: .votedForDate,
      is_required: .isRequired
    }],
    completion_options: .completionOptions,
    work_item_refs: [.workItemRefs[] | {
      id: .id,
      title: .title,
      url: .url
    }]
  }' > "pr-approval-${PR_ID}.json"

# Generate approval certificate
cat > "pr-approval-certificate-${PR_ID}.md" <<EOF
# PR Approval Certificate

**PR Number**: ${PR_ID}
**Title**: $(jq -r '.title' pr-approval-${PR_ID}.json)
**Created**: $(jq -r '.created_date' pr-approval-${PR_ID}.json)
**Merged**: $(jq -r '.completionOptions.mergeCommitMessage' pr-approval-${PR_ID}.json)

## Approvers

$(jq -r '.reviewers[] | "- **\(.name)** (\(.email)) - Vote: \(.vote) - Date: \(.vote_date)"' pr-approval-${PR_ID}.json)

## Approval Status

$(jq -r 'if .reviewers | all(.vote >= 10) then "✅ Approved" else "❌ Not Approved" end' pr-approval-${PR_ID}.json)

## Linked Work Items

$(jq -r '.work_item_refs[] | "- [\(.id)] \(.title) - \(.url)"' pr-approval-${PR_ID}.json)

## Audit Trail

This PR approval record is maintained for 7 years per SOC 2 and GDPR requirements.
EOF

echo "✅ Approval records extracted"

Approver Identity and Timestamp¶

Approval Evidence Structure:

{
  "approval_record": {
    "pr_id": 12345,
    "pr_title": "Deploy ATP Gateway v1.2.3 to Production",
    "approvals": [
      {
        "approver": {
          "name": "Jane Smith",
          "email": "jane.smith@connectsoft.example",
          "azure_ad_id": "a1b2c3d4-..."
        },
        "approval": {
          "vote": 10,
          "vote_date": "2024-01-20T10:30:00Z",
          "comment": "Approved after review",
          "timestamp": "2024-01-20T10:30:15Z"
        },
        "signature": {
          "method": "Azure DevOps",
          "hash": "sha256:abc123...",
          "verified": true
        }
      }
    ],
    "merged_by": {
      "name": "John Doe",
      "email": "john.doe@connectsoft.example",
      "merge_date": "2024-01-20T11:00:00Z"
    }
  }
}

Justification and Risk Assessment¶

PR Justification Template:

## PR Justification

**PR**: #12345
**Title**: Deploy ATP Gateway v1.2.3 to Production

### Business Justification
[Why is this deployment needed?]

### Technical Justification
[Technical reasons for the change]

### Risk Assessment
- **Risk Level**: Medium
- **Potential Impact**: Service restart (5 minutes downtime)
- **Mitigation**: Rolling update, health checks

### Testing Completed
- [x] Unit tests passed
- [x] Integration tests passed
- [x] Staging deployment successful
- [x] Smoke tests passed

### Rollback Plan
[Rollback procedure if deployment fails]

### Approval Required
- [x] Technical Lead
- [x] CAB (for production)

Work Item Linking¶

Link PR to Work Items:

#!/bin/bash
# scripts/link-pr-to-workitems.sh

PR_ID="${1}"
WORK_ITEM_IDS="${2}"  # Space-separated work item IDs

echo "🔗 Linking PR ${PR_ID} to work items: ${WORK_ITEM_IDS}"

for WI_ID in ${WORK_ITEM_IDS}; do
  echo "   Linking to work item: ${WI_ID}"
  az repos pr work-item add \
    --id "${PR_ID}" \
    --work-item-id "${WI_ID}" \
    --organization ${ORG} \
    --project ${PROJECT}
done

# Verify links
echo "✅ Verifying links..."
az repos pr show \
  --id "${PR_ID}" \
  --organization ${ORG} \
  --project ${PROJECT} \
  --include-work-item-refs \
  --query "workItemRefs[].id" \
  --output table

Git Commit History as Audit Evidence¶

Signed Commits (GPG)¶

GPG Signing Configuration:

#!/bin/bash
# scripts/setup-gpg-signing.sh

echo "🔐 Setting up GPG signing for Git commits"

# Generate GPG key (if not exists)
if ! gpg --list-secret-keys --keyid-format LONG | grep -q "sec"; then
  echo "Generating new GPG key..."
  gpg --full-generate-key
fi

# Get GPG key ID
GPG_KEY_ID=$(gpg --list-secret-keys --keyid-format LONG | \
  grep "^sec" | \
  sed -n 's/.*\/\([A-Z0-9]\{16\}\).*/\1/p' | \
  head -1)

echo "GPG Key ID: ${GPG_KEY_ID}"

# Configure Git to use GPG signing
git config --global user.signingkey "${GPG_KEY_ID}"
git config --global commit.gpgsign true

# Add GPG key to GitHub/Azure DevOps
echo "Add this public key to Azure DevOps:"
gpg --armor --export "${GPG_KEY_ID}"

echo "✅ GPG signing configured"

Verify Signed Commits:

#!/bin/bash
# scripts/verify-signed-commits.sh

COMMIT_RANGE="${1:-HEAD~10..HEAD}"

echo "✅ Verifying signed commits in range: ${COMMIT_RANGE}"

git log --pretty="format:%H %G? %aN %s" "${COMMIT_RANGE}" | \
  while read commit signature author subject; do
    case "${signature}" in
      "G")
        echo "✅ ${commit}: Good signature (${author})"
        ;;
      "B")
        echo "❌ ${commit}: Bad signature (${author})"
        ;;
      "X")
        echo "⚠️  ${commit}: Expired key (${author})"
        ;;
      "Y")
        echo "⚠️  ${commit}: Expired signature (${author})"
        ;;
      "R")
        echo "❌ ${commit}: Revoked key (${author})"
        ;;
      "E")
        echo "❌ ${commit}: Cannot verify (${author})"
        ;;
      "N")
        echo "❌ ${commit}: No signature (${author})"
        ;;
    esac
  done

Commit Message Standards¶

Conventional Commits for Audit Trail:

## Commit Message Format

():

### Types
- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation
- `chore`: Maintenance
- `refactor`: Code refactoring

### Examples

feat(gateway): Add authentication middleware Implements JWT token validation for API gateway. Linked to: WI-12345 Approved by: Jane Smith

fix(ingestion): Resolve memory leak in event processor Fixes issue where event processor was not releasing memory. Linked to: WI-12346 CAB Approved: CR-2024-001

**Enforce Commit Message Standards**:

# platform/gitops/commit-msg-hook.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: commit-msg-hook
  namespace: flux-system
data:
  commit-msg: |
    #!/bin/sh
    # Commit message hook to enforce standards

    COMMIT_MSG_FILE=$1
    COMMIT_MSG=$(cat $COMMIT_MSG_FILE)

    # Check for conventional commit format
    if ! echo "${COMMIT_MSG}" | grep -qE "^(feat|fix|docs|chore|refactor)(\(.+\))?:"; then
      echo "❌ Commit message must follow conventional commits format"
      echo "   Format: <type>(<scope>): <subject>"
      exit 1
    fi

    # Check for work item reference
    if ! echo "${COMMIT_MSG}" | grep -qiE "(WI-|AB#|#)[0-9]+"; then
      echo "⚠️  Warning: No work item reference found"
    fi

    exit 0

--- ### Deployment Receipts #### Deployment Receipt Template **Deployment Receipt Structure**:

# deployment-receipts/deployment-20240120-143022.yaml
apiVersion: compliance.atp.connectsoft.io/v1
kind: DeploymentReceipt
metadata:
  name: deployment-atp-gateway-20240120-143022
  namespace: atp-production
  creationTimestamp: "2024-01-20T14:30:22Z"
spec:
  deployment:
    id: "dep-20240120-143022"
    service: "atp-gateway"
    environment: "production"
    cluster: "atp-production-aks"
    namespace: "atp-production"
  what:
    image: "connectsoft.azurecr.io/atp/gateway:v1.2.3"
    git_commit: "abc123def456..."
    git_branch: "main"
    pr_number: 12345
  when:
    deployed_at: "2024-01-20T14:30:22Z"
    deployed_by: "FluxCD"
    reconciliation_time: "2024-01-20T14:30:25Z"
  who:
    approved_by:
    - name: "Jane Smith"
      email: "jane.smith@connectsoft.example"
      role: "Technical Lead"
      approval_date: "2024-01-20T10:30:00Z"
    - name: "Bob Johnson"
      email: "bob.johnson@connectsoft.example"
      role: "CAB Member"
      approval_date: "2024-01-20T11:00:00Z"
    merged_by:
      name: "John Doe"
      email: "john.doe@connectsoft.example"
      merge_date: "2024-01-20T12:00:00Z"
  why:
    work_items:
    - id: "WI-12345"
      title: "Add authentication middleware"
      url: "https://dev.azure.com/..."
    change_request: "CR-2024-001"
    justification: "Add JWT authentication for API security"
  where:
    region: "eastus"
    cluster: "atp-production-aks"
    namespace: "atp-production"
  evidence:
    pr_approval: "pr-approval-12345.json"
    security_scan: "security-scan-abc123.json"
    test_results: "test-results-abc123.json"
    sbom: "sbom-gateway-v1.2.3.json"

#### Automated Deployment Receipt Generation **Deployment Receipt Generation Script**:

#!/bin/bash
# scripts/generate-deployment-receipt.sh

DEPLOYMENT="${1}"
NAMESPACE="${2:-atp-production}"
IMAGE="${3}"

echo "📜 Generating Deployment Receipt"

# Get deployment details
DEPLOYMENT_DATA=$(kubectl get deployment "${DEPLOYMENT}" -n "${NAMESPACE}" -o json)
IMAGE_TAG=$(echo "${DEPLOYMENT_DATA}" | jq -r '.spec.template.spec.containers[0].image')
GIT_COMMIT=$(echo "${IMAGE_TAG}" | cut -d':' -f2)

# Get PR information from Git commit
PR_INFO=$(git log --grep="${GIT_COMMIT}" --format="%s" | head -1)
PR_NUMBER=$(echo "${PR_INFO}" | grep -oP 'PR #\K\d+' || echo "")

# Get approval information
if [ -n "${PR_NUMBER}" ]; then
  APPROVALS=$(az repos pr show \
    --id "${PR_NUMBER}" \
    --organization ${ORG} \
    --project ${PROJECT} \
    --query "reviewers[?vote>=10]" \
    -o json)
fi

# Generate deployment receipt
cat > "deployment-receipt-${DEPLOYMENT}-$(date +%Y%m%d-%H%M%S).yaml" <<EOF
apiVersion: compliance.atp.connectsoft.io/v1
kind: DeploymentReceipt
metadata:
  name: deployment-${DEPLOYMENT}-$(date +%Y%m%d-%H%M%S)
  namespace: ${NAMESPACE}
  creationTimestamp: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
spec:
  deployment:
    id: "dep-$(date +%Y%m%d-%H%M%S)"
    service: "${DEPLOYMENT}"
    environment: "${NAMESPACE}"
    cluster: "$(kubectl config current-context)"
    namespace: "${NAMESPACE}"
  what:
    image: "${IMAGE_TAG}"
    git_commit: "${GIT_COMMIT}"
    git_branch: "$(git branch --show-current)"
    pr_number: "${PR_NUMBER}"
  when:
    deployed_at: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
    deployed_by: "FluxCD"
  who:
    approved_by:
$(echo "${APPROVALS}" | jq -r '.[] | "    - name: \"\(.displayName)\"\n      email: \"\(.uniqueName)\"\n      approval_date: \"\(.votedForDate)\""')
  why:
    work_items:
    - id: "$(git log -1 --pretty=format:"%s" | grep -oP 'WI-\K\d+' || echo "N/A")"
  where:
    region: "$(kubectl get nodes -o jsonpath='{.items[0].metadata.labels.topology\.kubernetes\.io/region}')"
    cluster: "$(kubectl config current-context)"
    namespace: "${NAMESPACE}"
EOF

echo "✅ Deployment receipt generated"

--- ### Security Scan Results #### Vulnerability Reports **Vulnerability Scan Evidence Collection**:

#!/bin/bash
# scripts/collect-vulnerability-evidence.sh

IMAGE="${1}"
SCAN_DATE="${2:-$(date +%Y-%m-%d)}"

echo "🔒 Collecting Vulnerability Scan Evidence"

# Run Trivy scan
trivy image --format json --output "vulnerability-scan-${IMAGE}-${SCAN_DATE}.json" "${IMAGE}"

# Generate summary report
trivy image --format table "${IMAGE}" > "vulnerability-summary-${IMAGE}-${SCAN_DATE}.txt"

# Extract critical vulnerabilities
jq '[.Results[]?.Vulnerabilities[]? | select(.Severity == "CRITICAL")]' \
  "vulnerability-scan-${IMAGE}-${SCAN_DATE}.json" > \
  "vulnerability-critical-${IMAGE}-${SCAN_DATE}.json"

# Generate evidence document
cat > "vulnerability-evidence-${IMAGE}-${SCAN_DATE}.md" <<EOF
# Vulnerability Scan Evidence

**Image**: ${IMAGE}
**Scan Date**: ${SCAN_DATE}
**Scanner**: Trivy

## Summary

- Total Vulnerabilities: $(jq '[.Results[].Vulnerabilities[]] | length' vulnerability-scan-${IMAGE}-${SCAN_DATE}.json)
- Critical: $(jq '[.Results[].Vulnerabilities[] | select(.Severity == "CRITICAL")] | length' vulnerability-scan-${IMAGE}-${SCAN_DATE}.json)
- High: $(jq '[.Results[].Vulnerabilities[] | select(.Severity == "HIGH")] | length' vulnerability-scan-${IMAGE}-${SCAN_DATE}.json)

## Critical Vulnerabilities

$(jq -r '.Results[].Vulnerabilities[] | select(.Severity == "CRITICAL") | "- \(.VulnerabilityID): \(.Title)"' vulnerability-scan-${IMAGE}-${SCAN_DATE}.json)

## Remediation Status

- [ ] All critical vulnerabilities remediated
- [ ] High vulnerabilities reviewed
- [ ] Risk assessment completed

## Approval

**Reviewed By**: [Reviewer Name]
**Date**: ${SCAN_DATE}
**Approval**: [Approved/Rejected with Justification]
EOF

echo "✅ Vulnerability evidence collected"

#### SBOM Artifacts **SBOM Generation and Storage**:

#!/bin/bash
# scripts/generate-sbom-evidence.sh

IMAGE="${1}"
VERSION="${2}"

echo "📦 Generating SBOM Evidence"

# Generate SBOM with Syft
syft packages "${IMAGE}" -o cyclonedx-json > "sbom-${IMAGE}-${VERSION}.json"

# Attach SBOM to image in ACR
oras attach \
  --artifact-type "application/vnd.cyclonedx+json" \
  connectsoft.azurecr.io/atp/gateway:${VERSION} \
  "sbom-${IMAGE}-${VERSION}.json"

# Verify SBOM attachment
oras discover \
  --artifact-type "application/vnd.cyclonedx+json" \
  connectsoft.azurecr.io/atp/gateway:${VERSION}

echo "✅ SBOM evidence generated and stored"

#### Policy Compliance Reports **Policy Compliance Evidence**:

#!/bin/bash
# scripts/generate-policy-compliance-report.sh

NAMESPACE="${1:-atp-production}"

echo "✅ Generating Policy Compliance Report"

# Check Azure Policy compliance
az policy state summarize \
  --resource "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}" \
  --output json > "azure-policy-compliance-$(date +%Y%m%d).json"

# Check Pod Security Standards
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.spec.securityContext == null) | "\(.metadata.name): Missing security context"' > \
  "pss-compliance-${NAMESPACE}-$(date +%Y%m%d).txt"

# Generate compliance report
cat > "policy-compliance-report-$(date +%Y%m%d).md" <<EOF
# Policy Compliance Report

**Date**: $(date +%Y-%m-%d)
**Namespace**: ${NAMESPACE}

## Azure Policy Compliance

$(az policy state summarize --resource "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}" --query "results.resourceDetails[].{Resource:resourceId, Compliance:complianceState}" -o table)

## Pod Security Standards

$(kubectl get pods -n "${NAMESPACE}" -o json | jq -r '.items[] | select(.spec.securityContext == null) | "- \(.metadata.name): Non-compliant" | if . == "- : Non-compliant" then "✅ All pods compliant" else . end')

## Network Policies

$(kubectl get networkpolicies -n "${NAMESPACE}" --no-headers | wc -l) network policies applied

## RBAC Compliance

$(kubectl get rolebindings,clusterrolebindings -n "${NAMESPACE}" --no-headers | wc -l) RBAC bindings reviewed
EOF

echo "✅ Policy compliance report generated"

--- ### Policy Enforcement Evidence #### Azure Policy Compliance Reports **Azure Policy Compliance Query**:

// Log Analytics: Azure Policy Compliance
PolicyResources
| where TimeGenerated > ago(30d)
| where complianceState != "Compliant"
| project 
    TimeGenerated,
    resourceId,
    complianceState,
    policyAssignmentName,
    policyDefinitionName
| order by TimeGenerated desc

#### Pod Security Admission Logs **Pod Security Admission Evidence**:

// Log Analytics: Pod Security Admission Logs
KubernetesAudit
| where TimeGenerated > ago(30d)
| where Category == "Admission"
| where ObjectRef.resource == "pods"
| where ResponseStatus.code == 403
| where ResponseStatus.reason contains "violates PodSecurity"
| project 
    TimeGenerated,
    User,
    ObjectRef.name,
    ObjectRef.namespace,
    ResponseStatus.message
| order by TimeGenerated desc

#### RBAC Audit Logs **RBAC Audit Log Collection**:

#!/bin/bash
# scripts/collect-rbac-audit-logs.sh

START_DATE="${1:-$(date -d '30 days ago' +%Y-%m-%d)}"
END_DATE="${2:-$(date +%Y-%m-%d)}"

echo "📋 Collecting RBAC Audit Logs: ${START_DATE} to ${END_DATE}"

# Query Kubernetes audit logs for RBAC events
az monitor log-analytics query \
  --workspace ${LOG_ANALYTICS_WORKSPACE_ID} \
  --analytics-query "
    KubernetesAudit
    | where TimeGenerated between (datetime('${START_DATE}') .. datetime('${END_DATE}'))
    | where ObjectRef.resource in ('roles', 'rolebindings', 'clusterroles', 'clusterrolebindings')
    | project 
        TimeGenerated,
        User,
        Verb,
        ObjectRef.resource,
        ObjectRef.name,
        ObjectRef.namespace,
        ResponseStatus.code
    | order by TimeGenerated desc
  " \
  --output table > "rbac-audit-logs-${START_DATE}-${END_DATE}.csv"

echo "✅ RBAC audit logs collected"

--- ### Quarterly Access Reviews #### Reviewing RBAC in Git **RBAC Access Review Procedure**:

#!/bin/bash
# scripts/rbac-access-review.sh

REVIEW_DATE="${1:-$(date +%Y-%m-%d)}"

echo "👥 RBAC Access Review - ${REVIEW_DATE}"

# Export all RBAC bindings
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json > \
  "rbac-bindings-review-${REVIEW_DATE}.json"

# Generate review report
cat > "rbac-access-review-${REVIEW_DATE}.md" <<EOF
# RBAC Access Review Report

**Review Date**: ${REVIEW_DATE}
**Reviewer**: [Reviewer Name]

## Role Bindings

$(kubectl get rolebindings --all-namespaces --no-headers | wc -l) role bindings reviewed

### Findings

$(kubectl get rolebindings --all-namespaces -o json | \
  jq -r '.items[] | "- Namespace: \(.metadata.namespace), Role: \(.roleRef.name), Subjects: \(.subjects | length)")

## Cluster Role Bindings

$(kubectl get clusterrolebindings --no-headers | wc -l) cluster role bindings reviewed

### Findings

$(kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | "- Role: \(.roleRef.name), Subjects: \(.subjects | length)")

## Review Actions

- [ ] All access is justified
- [ ] No orphaned bindings
- [ ] Least privilege enforced
- [ ] Documentation updated

## Approval

**Reviewed By**: [Reviewer Name]
**Date**: ${REVIEW_DATE}
**Signature**: [Digital Signature]
EOF

echo "✅ RBAC access review complete"

#### Reviewing Key Vault Permissions **Key Vault Access Review**:

#!/bin/bash
# scripts/keyvault-access-review.sh

KEY_VAULT="${1:-atp-keyvault}"
REVIEW_DATE="${2:-$(date +%Y-%m-%d)}"

echo "🔑 Key Vault Access Review - ${KEY_VAULT}"

# Get access policies
az keyvault show \
  --name "${KEY_VAULT}" \
  --query "properties.accessPolicies" \
  -o json > "keyvault-access-policies-${REVIEW_DATE}.json"

# Get RBAC assignments
az role assignment list \
  --scope "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.KeyVault/vaults/${KEY_VAULT}" \
  -o json > "keyvault-rbac-${REVIEW_DATE}.json"

# Generate review report
cat > "keyvault-access-review-${REVIEW_DATE}.md" <<EOF
# Key Vault Access Review

**Vault**: ${KEY_VAULT}
**Review Date**: ${REVIEW_DATE}

## Access Policies

$(az keyvault show --name "${KEY_VAULT}" --query "length(properties.accessPolicies)" -o tsv) access policies

## RBAC Assignments

$(az role assignment list --scope "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.KeyVault/vaults/${KEY_VAULT}" --query "length([])" -o tsv) RBAC assignments

## Review Findings

- [ ] All access is justified
- [ ] No orphaned permissions
- [ ] Least privilege enforced
- [ ] Workload Identity used where appropriate

## Approval

**Reviewed By**: [Reviewer Name]
**Date**: ${REVIEW_DATE}
EOF

echo "✅ Key Vault access review complete"

#### Evidence of Reviews **Access Review Evidence Template**:

# compliance/access-reviews/access-review-2024-Q1.yaml
apiVersion: compliance.atp.connectsoft.io/v1
kind: AccessReview
metadata:
  name: access-review-2024-q1
  namespace: atp-production
spec:
  review:
    type: Quarterly
    quarter: Q1
    year: 2024
    reviewDate: "2024-03-31"
  scope:
    rbac: true
    keyVault: true
    azureDevOps: true
    azureAD: true
  findings:
    rbac:
      totalBindings: 45
      reviewed: 45
      issuesFound: 2
      issuesResolved: 2
    keyVault:
      totalPolicies: 12
      reviewed: 12
      issuesFound: 0
  approval:
    reviewedBy: "Jane Smith"
    reviewDate: "2024-03-31"
    approved: true
    signature: "[Digital Signature]"
  evidence:
    rbacReport: "rbac-access-review-2024-03-31.md"
    keyVaultReport: "keyvault-access-review-2024-03-31.md"
    azureDevOpsReport: "azdo-access-review-2024-03-31.md"

--- ### Audit Log Retention #### 7-Year Retention Requirement **Audit Log Retention Configuration**:

# platform/compliance/audit-log-retention-policy.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: audit-log-retention-policy
  namespace: monitoring
data:
  retention-policy.yaml: |
    # Audit Log Retention Policy
    # SOC 2, GDPR, HIPAA Requirement: 7 years

    retention:
      default: 7y
      compliance:
        soc2: 7y
        gdpr: 7y
        hipaa: 7y
      categories:
        kubernetes_audit: 7y
        azure_activity: 7y
        deployment_logs: 7y
        access_logs: 7y
        security_scans: 7y
    storage:
      backend: azure-blob
      account: atpauditlogs
      container: audit-logs-immutable
      immutability:
        enabled: true
        period: 2555d  # 7 years
      encryption:
        enabled: true
        key_vault: atp-keyvault
        key_name: audit-log-encryption-key

**Audit Log Archive to Immutable Storage**:

#!/bin/bash
# scripts/archive-audit-logs-to-blob.sh

START_DATE="${1:-$(date -d '1 year ago' +%Y-%m-%d)}"
END_DATE="${2:-$(date +%Y-%m-%d)}"

echo "📦 Archiving audit logs to immutable storage"

# Export logs from Log Analytics
az monitor log-analytics query \
  --workspace ${LOG_ANALYTICS_WORKSPACE_ID} \
  --analytics-query "
    union *
    | where TimeGenerated between (datetime('${START_DATE}') .. datetime('${END_DATE}'))
    | where Category in ('KubernetesAudit', 'AzureActivity', 'ContainerLog')
  " \
  --output json > "audit-logs-${START_DATE}-${END_DATE}.json"

# Upload to immutable blob storage
az storage blob upload \
  --account-name atpauditlogs \
  --container-name audit-logs-immutable \
  --name "audit-logs-${START_DATE}-${END_DATE}.json" \
  --file "audit-logs-${START_DATE}-${END_DATE}.json" \
  --tier Archive \
  --immutability-policy-mode Unlocked \
  --immutability-policy-period 2555

# Set legal hold
az storage blob immutability-policy set \
  --account-name atpauditlogs \
  --container-name audit-logs-immutable \
  --name "audit-logs-${START_DATE}-${END_DATE}.json" \
  --allow-protected-append-writes false \
  --period 2555

echo "✅ Audit logs archived to immutable storage"

#### eDiscovery Procedures **eDiscovery Export Procedure**:

#!/bin/bash
# scripts/ediscovery-export.sh

CASE_ID="${1}"
START_DATE="${2}"
END_DATE="${3}"

echo "📋 eDiscovery Export - Case: ${CASE_ID}"

# Create export directory
EXPORT_DIR="ediscovery-${CASE_ID}-$(date +%Y%m%d)"
mkdir -p "${EXPORT_DIR}"

# Export audit logs
az monitor log-analytics query \
  --workspace ${LOG_ANALYTICS_WORKSPACE_ID} \
  --analytics-query "
    union *
    | where TimeGenerated between (datetime('${START_DATE}') .. datetime('${END_DATE}'))
  " \
  --output json > "${EXPORT_DIR}/audit-logs.json"

# Export deployment receipts
kubectl get deploymentreceipt --all-namespaces -o json > \
  "${EXPORT_DIR}/deployment-receipts.json"

# Export PR approvals
# (Query Azure DevOps API for PR approvals in date range)

# Export access reviews
kubectl get accessreview --all-namespaces -o json > \
  "${EXPORT_DIR}/access-reviews.json"

# Generate export manifest
cat > "${EXPORT_DIR}/export-manifest.md" <<EOF
# eDiscovery Export Manifest

**Case ID**: ${CASE_ID}
**Export Date**: $(date +%Y-%m-%d)
**Date Range**: ${START_DATE} to ${END_DATE}

## Contents

1. Audit Logs: audit-logs.json
2. Deployment Receipts: deployment-receipts.json
3. Access Reviews: access-reviews.json

## Chain of Custody

**Exported By**: [Exporter Name]
**Date**: $(date +%Y-%m-%d)
**Purpose**: Legal eDiscovery - Case ${CASE_ID}
**Recipient**: [Recipient Name]

## Integrity Verification

**SHA256**: $(sha256sum "${EXPORT_DIR}"/*.json | sha256sum | cut -d' ' -f1)
EOF

# Create export archive
tar -czf "${EXPORT_DIR}.tar.gz" "${EXPORT_DIR}"

echo "✅ eDiscovery export complete: ${EXPORT_DIR}.tar.gz"

--- ### Compliance Reporting Automation #### Automated Evidence Collection **Automated Compliance Evidence Collection**:

# platform/compliance/evidence-collection-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: compliance-evidence-collection
  namespace: compliance
spec:
  schedule: "0 0 * * 0"  # Weekly on Sunday
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: evidence-collection
            image: mcr.microsoft.com/azure-cli:latest
            command:
            - /bin/bash
            - -c
            - |
              # Collect weekly compliance evidence
              /scripts/collect-compliance-evidence.sh weekly
          restartPolicy: OnFailure

**Compliance Evidence Collection Script**:

#!/bin/bash
# scripts/collect-compliance-evidence.sh

PERIOD="${1:-weekly}"  # daily, weekly, monthly, quarterly

echo "📋 Collecting Compliance Evidence - ${PERIOD}"

EVIDENCE_DIR="compliance-evidence-${PERIOD}-$(date +%Y%m%d)"
mkdir -p "${EVIDENCE_DIR}"

# Collect deployment receipts
kubectl get deploymentreceipt --all-namespaces -o json > \
  "${EVIDENCE_DIR}/deployment-receipts.json"

# Collect PR approvals (last period)
# Query Azure DevOps API

# Collect access logs
az monitor log-analytics query \
  --workspace ${LOG_ANALYTICS_WORKSPACE_ID} \
  --analytics-query "
    KubernetesAudit
    | where TimeGenerated > ago(7d)
  " \
  --output json > "${EVIDENCE_DIR}/audit-logs.json"

# Collect policy compliance
az policy state summarize \
  --resource "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}" \
  --output json > "${EVIDENCE_DIR}/policy-compliance.json"

# Generate evidence manifest
cat > "${EVIDENCE_DIR}/evidence-manifest.md" <<EOF
# Compliance Evidence Collection

**Period**: ${PERIOD}
**Date**: $(date +%Y-%m-%d)

## Evidence Collected

- Deployment Receipts
- PR Approvals
- Audit Logs
- Policy Compliance

## Retention

This evidence is retained for 7 years per SOC 2, GDPR, and HIPAA requirements.
EOF

echo "✅ Compliance evidence collected: ${EVIDENCE_DIR}"

#### Compliance Dashboards **Compliance Dashboard Configuration**:

# monitoring/compliance/compliance-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: compliance-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "Compliance Dashboard",
        "panels": [
          {
            "title": "SOC 2 Compliance Status",
            "targets": [
              {
                "expr": "compliance_soc2_controls_total",
                "legendFormat": "{{control}}"
              }
            ]
          },
          {
            "title": "Deployment Approvals (30 days)",
            "targets": [
              {
                "expr": "sum(azure_devops_pr_approvals_total[30d])",
                "legendFormat": "Approvals"
              }
            ]
          },
          {
            "title": "Access Reviews",
            "targets": [
              {
                "expr": "compliance_access_reviews_total",
                "legendFormat": "{{review_type}}"
              }
            ]
          }
        ]
      }
    }

#### Monthly Compliance Reports **Monthly Compliance Report Generation**:

#!/bin/bash
# scripts/generate-monthly-compliance-report.sh

MONTH="${1:-$(date +%Y-%m)}"

echo "📊 Generating Monthly Compliance Report - ${MONTH}"

cat > "compliance-report-${MONTH}.md" <<EOF
# Monthly Compliance Report

**Month**: ${MONTH}
**Generated**: $(date +%Y-%m-%d)

## SOC 2 Compliance

### CC8.1 - Change Management
- Total Changes: $(az repos pr list --organization ${ORG} --project ${PROJECT} --status completed --output json | jq '[.[] | select(.creationDate | startswith("'${MONTH}'"))] | length')
- Approved Changes: $(az repos pr list --organization ${ORG} --project ${PROJECT} --status completed --output json | jq '[.[] | select(.creationDate | startswith("'${MONTH}'") and .reviewers[].vote >= 10)] | length')
- Compliance: ✅

### CC6.1 - Access Control
- Access Reviews: [Number]
- Issues Found: [Number]
- Issues Resolved: [Number]
- Compliance: ✅

### CC7.2 - System Monitoring
- Monitoring Coverage: [Percentage]
- Alerts Configured: [Number]
- Compliance: ✅

## GDPR Compliance

- Data Deletion Requests: [Number]
- Right to be Forgotten: [Number]
- Data Residency Verified: ✅

## HIPAA Compliance

- Audit Logs Collected: [Number]
- Encryption Verified: ✅
- Access Controls Enforced: ✅

## Summary

- SOC 2: ✅ Compliant
- GDPR: ✅ Compliant
- HIPAA: ✅ Compliant

## Evidence

All evidence for this report is stored in: compliance-evidence-monthly-${MONTH}/
EOF

echo "✅ Compliance report generated: compliance-report-${MONTH}.md"

--- ### Summary: Compliance & Audit Evidence Collection - **SOC 2 Type II Controls**: CC8.1 (Change Management), CC6.1 (Logical and Physical Access), CC7.2 (System Monitoring), GitOps workflow mapping to controls with evidence collection scripts - **GDPR Compliance**: Right to be forgotten (tenant offboarding procedure), data residency enforcement, audit logs and retention (7-year), privacy by design implementation - **HIPAA Audit Trail**: Access logs configuration, deployment logs, encryption verification scripts, incident response documentation template - **Change Advisory Board (CAB) Process**: When CAB approval is required (decision tree), CAB meeting schedule, change request template, CAB review criteria, approval documentation with YAML structure - **Deployment Approval Records**: PR approvals in Azure DevOps, approver identity and timestamp, justification and risk assessment templates, work item linking scripts - **Git Commit History as Audit Evidence**: Signed commits (GPG setup and verification), commit message standards (Conventional Commits), commit message hook enforcement - **Deployment Receipts**: Deployment receipt template (YAML structure), automated deployment receipt generation script - **Security Scan Results**: Vulnerability reports collection, SBOM artifacts generation and storage, policy compliance reports - **Policy Enforcement Evidence**: Azure Policy compliance reports (KQL queries), Pod Security Admission logs, RBAC audit log collection - **Quarterly Access Reviews**: Reviewing RBAC in Git, reviewing Key Vault permissions, reviewing Azure DevOps access, evidence of reviews (YAML structure) - **Audit Log Retention**: 7-year retention requirement configuration, audit log archive to immutable storage, eDiscovery export procedures - **Compliance Reporting Automation**: Automated evidence collection (CronJob), compliance dashboards (Grafana JSON), monthly compliance report generation scripts --- ## Training, Documentation & Best Practices **Purpose**: Define comprehensive training programs, documentation standards, workflow tutorials, troubleshooting playbooks, best practices catalog, reference architectures, and continuous improvement processes for ATP's GitOps deployments, ensuring team proficiency, operational excellence, and knowledge sharing across all platform engineering activities. --- ### Developer Onboarding Guide #### Getting Started with GitOps **Prerequisites Checklist**:

## Prerequisites for GitOps Development

### Required Access
- [ ] Azure DevOps account with appropriate permissions
- [ ] Access to `atp-gitops` repository
- [ ] Access to AKS clusters (dev/test at minimum)
- [ ] Azure CLI installed and configured
- [ ] kubectl installed and configured
- [ ] Helm CLI installed
- [ ] Flux CLI installed
- [ ] Git configured with SSH keys or PAT

### Required Knowledge
- [ ] Basic Kubernetes concepts
- [ ] YAML syntax
- [ ] Git fundamentals (branching, PRs, merging)
- [ ] Basic understanding of GitOps principles
- [ ] Azure DevOps PR workflow

### Verification
Run these commands to verify setup:
```bash
# Verify Azure CLI
az --version

# Verify kubectl
kubectl version --client

# Verify Helm
helm version

# Verify Flux
flux --version

# Verify Git access
git clone ssh://git@ssh.dev.azure.com/v3/ConnectSoft/atp-gitops/atp-gitops

**GitOps Learning Path**:

```mermaid
graph TD
    START[New Developer] --> GIT[Git Fundamentals]
    GIT --> K8S[Kubernetes Basics]
    K8S --> GITOPS[GitOps Principles]
    GITOPS --> FLUX[FluxCD Tutorial]
    FLUX --> HELM[Helm Charts]
    HELM --> KUSTOMIZE[Kustomize]
    KUSTOMIZE --> EXERCISE[First PR Exercise]
    EXERCISE --> REVIEW[Code Review]
    REVIEW --> DEPLOY[Preview Deployment]
    DEPLOY --> COMPLETE[Onboarding Complete]

    style START fill:#FFE5B4
    style COMPLETE fill:#90EE90

#### Repository Structure Overview **Repository Structure Tutorial**:

## ATP GitOps Repository Structure

atp-gitops/ ├── apps/ # Application manifests │ ├── atp-gateway/ │ │ ├── base/ # Base manifests │ │ │ ├── deployment.yaml │ │ │ ├── service.yaml │ │ │ └── kustomization.yaml │ │ └── overlays/ # Environment-specific │ │ ├── dev/ │ │ ├── test/ │ │ ├── staging/ │ │ └── production/ │ └── atp-ingestion/ │ └── ... ├── platform/ # Platform components │ ├── flux-system/ # FluxCD configuration │ ├── monitoring/ │ └── networking/ ├── tenants/ # Tenant-specific configs │ └── tenant-{id}/ ├── infrastructure/ # Pulumi IaC │ └── ... ├── scripts/ # Automation scripts └── docs/ # Documentation └── ...

### Key Directories

1. **apps/**: Application deployment manifests
2. **platform/**: Shared platform components
3. **tenants/**: Multi-tenant configurations
4. **infrastructure/**: Infrastructure as Code
5. **scripts/**: Automation and utilities

#### Git Workflow Tutorial **Step-by-Step Git Workflow**:

#!/bin/bash
# tutorials/git-workflow-tutorial.sh

echo "📚 Git Workflow Tutorial"
echo "========================"

# Step 1: Clone repository
echo "Step 1: Clone the repository"
echo "git clone ssh://git@ssh.dev.azure.com/v3/ConnectSoft/atp-gitops/atp-gitops"
echo "cd atp-gitops"

# Step 2: Create feature branch
echo ""
echo "Step 2: Create a feature branch"
echo "git checkout -b feature/add-new-service"

# Step 3: Make changes
echo ""
echo "Step 3: Make your changes"
echo "# Edit manifest files"
echo "vim apps/my-service/base/deployment.yaml"

# Step 4: Commit changes
echo ""
echo "Step 4: Commit your changes"
echo "git add apps/my-service/"
echo 'git commit -m "feat(my-service): Add new service deployment

- Add deployment manifest
- Add service manifest
- Configure health checks

Linked to: WI-12345"'

# Step 5: Push branch
echo ""
echo "Step 5: Push branch to remote"
echo "git push -u origin feature/add-new-service"

# Step 6: Create PR
echo ""
echo "Step 6: Create Pull Request"
echo "# Use Azure DevOps web interface or CLI:"
echo "az repos pr create \\"
echo "  --source-branch feature/add-new-service \\"
echo "  --target-branch main \\"
echo "  --title 'feat(my-service): Add new service deployment' \\"
echo "  --description 'Adds deployment manifests for my-service'"

**Git Workflow Diagram**:

sequenceDiagram
    participant Dev as Developer
    participant Local as Local Git
    participant Remote as Azure Repos
    participant PR as Pull Request
    participant Flux as FluxCD

    Dev->>Local: Create feature branch
    Dev->>Local: Edit manifests
    Dev->>Local: Commit changes
    Local->>Remote: Push branch
    Remote->>PR: Create Pull Request
    PR->>PR: Code Review
    PR->>PR: Automated Tests
    PR->>Remote: Merge to main
    Remote->>Flux: FluxCD detects change
    Flux->>Flux: Reconcile & Deploy

Hold "Alt" / "Option" to enable pan & zoom

#### Manifest Authoring Basics **First Manifest Tutorial**:

# tutorials/first-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-first-service
  namespace: atp-dev
  labels:
    app: my-first-service
    version: v1.0.0
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-first-service
  template:
    metadata:
      labels:
        app: my-first-service
        version: v1.0.0
    spec:
      containers:
      - name: my-service
        image: connectsoft.azurecr.io/atp/my-service:v1.0.0
        ports:
        - containerPort: 8080
          name: http
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Development"
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

---
apiVersion: v1
kind: Service
metadata:
  name: my-first-service
  namespace: atp-dev
spec:
  selector:
    app: my-first-service
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
  type: ClusterIP

**Manifest Authoring Checklist**:

## Manifest Authoring Checklist

### Required Elements
- [ ] Appropriate API version
- [ ] Correct resource kind
- [ ] Unique name (within namespace)
- [ ] Namespace specified
- [ ] Labels for selection
- [ ] Resource requests and limits

### Best Practices
- [ ] No hardcoded secrets
- [ ] Image tags are specific (not `latest`)
- [ ] Health checks configured
- [ ] Resource limits set
- [ ] Security context configured
- [ ] Network policies considered

### Security
- [ ] No secrets in plaintext
- [ ] Least privilege RBAC
- [ ] Pod Security Standards applied
- [ ] Image scanning passed

#### Creating First PR **First PR Exercise**:

#!/bin/bash
# tutorials/first-pr-exercise.sh

echo "🎯 First PR Exercise"
echo "===================="

# Exercise: Deploy a simple hello-world service

# Step 1: Create directory structure
echo "Step 1: Create directory structure"
mkdir -p apps/hello-world/base
mkdir -p apps/hello-world/overlays/dev

# Step 2: Create base deployment
cat > apps/hello-world/base/deployment.yaml <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-world
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hello-world
  template:
    metadata:
      labels:
        app: hello-world
    spec:
      containers:
      - name: hello-world
        image: mcr.microsoft.com/dotnet/samples:aspnetapp
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
EOF

# Step 3: Create base service
cat > apps/hello-world/base/service.yaml <<'EOF'
apiVersion: v1
kind: Service
metadata:
  name: hello-world
spec:
  selector:
    app: hello-world
  ports:
  - port: 80
    targetPort: 80
EOF

# Step 4: Create base kustomization
cat > apps/hello-world/base/kustomization.yaml <<'EOF'
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- deployment.yaml
- service.yaml

commonLabels:
  app: hello-world
  managed-by: kustomize
EOF

# Step 5: Create dev overlay
cat > apps/hello-world/overlays/dev/kustomization.yaml <<'EOF'
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-dev

resources:
- ../../base

replicas:
- name: hello-world
  count: 2

labels:
- pairs:
    environment: dev
EOF

echo "✅ Exercise files created!"
echo "Next steps:"
echo "1. Review the created files"
echo "2. Commit and push"
echo "3. Create a PR"
echo "4. Request review"

#### Testing in Preview Environment **Preview Environment Testing Guide**:

## Testing in Preview Environment

### What is a Preview Environment?

A preview environment is an ephemeral, isolated Kubernetes namespace created automatically when you create a Pull Request. It allows you to test your changes before merging to main.

### Preview Environment Lifecycle

1. **Creation**: Automatic on PR creation
2. **Testing**: Validate your changes
3. **Cleanup**: Automatic on PR merge/close

### Testing Steps

1. **Create PR**: Your preview environment is created automatically
2. **Wait for Deployment**: FluxCD will deploy to preview namespace
3. **Access Preview**: Use the preview URL from PR comments
4. **Run Tests**: Execute your test suite
5. **Validate**: Ensure everything works as expected

### Preview URL Format

https://pr-{PR_NUMBER}-{SERVICE_NAME}.preview.atp.connectsoft.example

### Example: Testing a Service Change

```bash
# Get preview namespace
PREVIEW_NS="pr-12345"

# Check deployment status
kubectl get pods -n ${PREVIEW_NS}

# Test the service
curl https://pr-12345-hello-world.preview.atp.connectsoft.example

# View logs
kubectl logs -n ${PREVIEW_NS} -l app=hello-world --tail=100

---

### Operations Onboarding

#### FluxCD Monitoring

**FluxCD Monitoring Guide**:

```markdown
## FluxCD Monitoring for Operators

### Key Metrics to Monitor

1. **Reconciliation Status**
   ```bash
   flux get all -A
   ```

2. **Reconciliation Duration**
   ```bash
   flux get kustomizations -A --status-selector=Ready=True
   ```

3. **Sync Failures**
   ```bash
   flux get sources -A | grep -v Ready
   ```

### Monitoring Dashboards

- **FluxCD Operational Dashboard**: [Link]
- **Deployment Status Dashboard**: [Link]
- **Reconciliation Metrics**: [Link]

### Alert Thresholds

- Sync failure > 5 minutes: Warning
- Sync failure > 15 minutes: Critical
- Reconciliation duration > 2 minutes: Warning

**FluxCD Health Check Script**:

#!/bin/bash
# tutorials/fluxcd-health-check.sh

echo "🏥 FluxCD Health Check"
echo "======================"

# Check FluxCD components
echo "1. Checking FluxCD components..."
kubectl get pods -n flux-system

# Check all Kustomizations
echo ""
echo "2. Checking Kustomizations..."
flux get kustomizations -A

# Check all sources
echo ""
echo "3. Checking sources..."
flux get sources -A

# Check for errors
echo ""
echo "4. Checking for errors..."
kubectl logs -n flux-system -l app=kustomize-controller --tail=50 | grep -i error

# Check reconciliation status
echo ""
echo "5. Reconciliation status:"
flux get kustomizations -A --status-selector=Ready=False

#### Troubleshooting Procedures **Troubleshooting Workflow**:

graph TD
    START[Issue Reported] --> CHECK[Check FluxCD Status]
    CHECK --> SYNC{Sync<br/>Working?}
    SYNC -->|No| DEBUG[Debug Sync Failure]
    SYNC -->|Yes| APP[Check App Status]
    APP --> HEALTH{App<br/>Healthy?}
    HEALTH -->|No| LOGS[Check Logs]
    HEALTH -->|Yes| NET[Check Network]
    NET --> RESOLVE[Resolve Issue]
    DEBUG --> RESOLVE
    LOGS --> RESOLVE
    RESOLVE --> DOCUMENT[Document Solution]
    DOCUMENT --> COMPLETE[Issue Resolved]

    style START fill:#FFE5B4
    style COMPLETE fill:#90EE90

Hold "Alt" / "Option" to enable pan & zoom

#### Incident Response **Incident Response Runbook**:

## Incident Response Runbook

### Severity Levels

- **P0 - Critical**: Service down, data loss
- **P1 - High**: Major feature unavailable
- **P2 - Medium**: Minor feature unavailable
- **P3 - Low**: Minor issue, workaround available

### Incident Response Steps

1. **Acknowledge**: Acknowledge the incident
2. **Assess**: Assess severity and impact
3. **Communicate**: Notify stakeholders
4. **Investigate**: Gather information
5. **Resolve**: Implement fix
6. **Verify**: Verify resolution
7. **Document**: Post-incident review

### Escalation Path

1. On-call engineer (immediate)
2. Team lead (if unresolved in 15 min)
3. Engineering manager (if unresolved in 1 hour)
4. CTO (for P0 incidents)

--- ### GitOps Workflow Tutorials #### Step-by-Step Deployment Tutorial **Complete Deployment Tutorial**:

# Complete Deployment Tutorial

## Scenario: Deploy ATP Gateway v1.3.0 to Production

### Step 1: Prepare Your Environment

```bash
# Clone repository
git clone ssh://git@ssh.dev.azure.com/v3/ConnectSoft/atp-gitops/atp-gitops
cd atp-gitops

# Create feature branch
git checkout -b feature/deploy-gateway-v1.3.0

### Step 2: Update Image Tag Edit `apps/atp-gateway/overlays/production/kustomization.yaml`:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- ../../base

images:
- name: connectsoft.azurecr.io/atp/gateway
  newTag: v1.3.0

replicas:
- name: atp-gateway
  count: 3

### Step 3: Update Configuration (if needed) Edit environment-specific config if required. ### Step 4: Commit and Push

git add apps/atp-gateway/
git commit -m "feat(gateway): Deploy v1.3.0 to production

- Update image tag to v1.3.0
- Increase replicas to 3

Linked to: WI-12345
CAB Approved: CR-2024-050"
git push -u origin feature/deploy-gateway-v1.3.0

### Step 5: Create Pull Request Use Azure DevOps to create PR targeting `main` branch. ### Step 6: Code Review - [ ] PR description complete - [ ] Linked work item - [ ] CAB approval obtained - [ ] Tests passing - [ ] Security scan passed ### Step 7: Merge and Monitor - [ ] Merge PR to main - [ ] Monitor FluxCD reconciliation - [ ] Verify deployment status - [ ] Check application health - [ ] Monitor metrics for 1 hour

#### Rollback Procedure Tutorial

**Rollback Tutorial**:

```markdown
# Rollback Tutorial

## Scenario: Rollback ATP Gateway from v1.3.0 to v1.2.5

### Step 1: Identify Previous Version

```bash
# Check git history
git log --oneline apps/atp-gateway/overlays/production/

# Or check deployment receipts
kubectl get deploymentreceipt -n atp-production | grep gateway

### Step 2: Create Rollback Branch

git checkout -b hotfix/rollback-gateway-v1.2.5

### Step 3: Revert Image Tag Edit `apps/atp-gateway/overlays/production/kustomization.yaml`:

images:
- name: connectsoft.azurecr.io/atp/gateway
  newTag: v1.2.5  # Previous version

### Step 4: Commit Rollback

git add apps/atp-gateway/
git commit -m "fix(gateway): Rollback to v1.2.5

Reason: High error rate after v1.3.0 deployment

Incident: INC-2024-123
Approved by: [Name]"
git push -u origin hotfix/rollback-gateway-v1.2.5

### Step 5: Expedited PR Process - Create PR with "Hotfix" label - Request expedited review - Merge immediately after approval ### Step 6: Verify Rollback

# Check deployment
kubectl get deployment atp-gateway -n atp-production

# Check pod status
kubectl get pods -n atp-production -l app=atp-gateway

# Check metrics
# Monitor error rate, latency, etc.

#### Multi-Environment Promotion Tutorial

**Environment Promotion Flow**:

```mermaid
graph LR
    DEV[Dev Environment] --> TEST[Test Environment]
    TEST --> STAGING[Staging Environment]
    STAGING --> PROD[Production Environment]

    DEV -.Promote.-> TEST
    TEST -.Promote.-> STAGING
    STAGING -.Promote.-> PROD

    style PROD fill:#FFE5B4

**Promotion Script**:

#!/bin/bash
# tutorials/promote-to-next-environment.sh

SERVICE="${1}"
CURRENT_ENV="${2}"
NEXT_ENV="${3}"
VERSION="${4}"

echo "🚀 Promoting ${SERVICE} ${VERSION} from ${CURRENT_ENV} to ${NEXT_ENV}"

# Update next environment overlay
ENV_OVERLAY="apps/${SERVICE}/overlays/${NEXT_ENV}/kustomization.yaml"

# Update image tag
yq eval ".images[0].newTag = \"${VERSION}\"" -i "${ENV_OVERLAY}"

# Commit changes
git add "${ENV_OVERLAY}"
git commit -m "chore(${SERVICE}): Promote ${VERSION} to ${NEXT_ENV}

Promoted from ${CURRENT_ENV} after successful validation.
Linked to: WI-12345"

echo "✅ Promotion prepared. Create PR to merge."

--- ### Manifest Authoring Guidelines #### Helm Best Practices **Helm Chart Best Practices**:

# charts/atp-service/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "atp-service.fullname" . }}
  labels:
    {{- include "atp-service.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "atp-service.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
      labels:
        {{- include "atp-service.selectorLabels" . | nindent 8 }}
    spec:
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      serviceAccountName: {{ include "atp-service.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
      - name: {{ .Chart.Name }}
        securityContext:
          {{- toYaml .Values.securityContext | nindent 12 }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        ports:
        - name: http
          containerPort: {{ .Values.service.port }}
          protocol: TCP
        env:
        {{- range $key, $value := .Values.env }}
        - name: {{ $key }}
          value: {{ $value | quote }}
        {{- end }}
        {{- if .Values.secretRefs }}
        envFrom:
        - secretRef:
            name: {{ include "atp-service.fullname" . }}-secrets
        {{- end }}
        livenessProbe:
          {{- toYaml .Values.livenessProbe | nindent 10 }}
        readinessProbe:
          {{- toYaml .Values.readinessProbe | nindent 10 }}
        resources:
          {{- toYaml .Values.resources | nindent 10 }}

**Helm Values Best Practices**:

# charts/atp-service/values.yaml
replicaCount: 1

image:
  repository: connectsoft.azurecr.io/atp/service
  pullPolicy: IfNotPresent
  tag: ""

imagePullSecrets: []

nameOverride: ""
fullnameOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name: ""

podSecurityContext:
  fsGroup: 2000

securityContext:
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

service:
  type: ClusterIP
  port: 80

env: {}

secretRefs: []

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 128Mi

livenessProbe:
  httpGet:
    path: /health/live
    port: http
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: http
  initialDelaySeconds: 10
  periodSeconds: 5

autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 80

#### Kustomize Best Practices **Kustomize Structure Best Practices**:

# Best Practice: Clean base kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
- deployment.yaml
- service.yaml
- configmap.yaml

commonLabels:
  app: atp-service
  environment: production
  managed-by: kustomize

commonAnnotations:
  gitops.toolkit.fluxcd.io/reconcile: "true"

images:
- name: connectsoft.azurecr.io/atp/service
  newTag: v1.3.0

replicas:
- name: atp-service
  count: 3

patchesStrategicMerge:
- |-
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: atp-service
  spec:
    template:
      spec:
        containers:
        - name: atp-service
          env:
          - name: ASPNETCORE_ENVIRONMENT
            value: "Production"

**Kustomize Patch Best Practices**:

## Kustomize Patching Best Practices

### DO ✅

- Use strategic merge patches for simple changes
- Use JSON patches for complex transformations
- Keep patches focused and minimal
- Document patch purpose in comments

### DON'T ❌

- Don't duplicate entire resource definitions
- Don't create overly complex patch chains
- Don't use patches to override everything
- Don't create patches without testing

#### Resource Configuration Standards **Resource Configuration Template**:

# templates/resource-standards.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {service-name}
  namespace: {namespace}
  labels:
    app: {service-name}
    version: {version}
    environment: {environment}
    managed-by: kustomize
spec:
  replicas: {replicas}
  selector:
    matchLabels:
      app: {service-name}
  template:
    metadata:
      labels:
        app: {service-name}
        version: {version}
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      securityContext:
        fsGroup: 2000
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: {service-name}
        image: {image-repo}:{image-tag}
        imagePullPolicy: Always
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
          capabilities:
            drop:
            - ALL
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "{environment}"
        envFrom:
        - configMapRef:
            name: {service-name}-config
        - secretRef:
            name: {service-name}-secrets
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 1000m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health/live
            port: http
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/ready
            port: http
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        volumeMounts:
        - name: tmp
          mountPath: /tmp
      volumes:
      - name: tmp
        emptyDir: {}

--- ### Troubleshooting Playbooks #### Sync Failure Troubleshooting **Sync Failure Playbook**:

# Sync Failure Troubleshooting Playbook

## Symptoms

- FluxCD Kustomization status shows "Not Ready"
- Error message in FluxCD logs
- Resources not applying to cluster

## Diagnosis Steps

### 1. Check Kustomization Status

```bash
flux get kustomization {kustomization-name} -n flux-system

### 2. Check Kustomization Conditions

kubectl describe kustomization {kustomization-name} -n flux-system

### 3. Check Source Status

flux get source git {source-name} -n flux-system

### 4. Check FluxCD Logs

kubectl logs -n flux-system -l app=kustomize-controller --tail=100 | grep -i error

### 5. Validate Kustomize Build

cd apps/{service}/overlays/{environment}
kustomize build . | kubectl apply --dry-run=client -f -

## Common Issues and Solutions ### Issue: Git Authentication Failure **Symptoms**: Source status shows "authentication failed" **Solution**:

# Check GitRepository credentials
kubectl get gitrepository {source-name} -n flux-system -o yaml

# Verify SSH key or PAT is valid

### Issue: Invalid Manifest Syntax **Symptoms**: "unable to build" error **Solution**:

# Validate YAML syntax
kustomize build . > /tmp/output.yaml
cat /tmp/output.yaml | kubectl apply --dry-run=client -f -

### Issue: Resource Conflict **Symptoms**: "already exists" error **Solution**:

# Check existing resource
kubectl get {resource-type} {resource-name} -n {namespace}

# If orphaned, delete it
kubectl delete {resource-type} {resource-name} -n {namespace}

#### Health Check Failure Playbook

**Health Check Failure Playbook**:

```markdown
# Health Check Failure Playbook

## Symptoms

- Pods in CrashLoopBackOff
- Readiness probe failures
- Liveness probe failures

## Diagnosis Steps

### 1. Check Pod Status

```bash
kubectl get pods -n {namespace} -l app={service-name}

### 2. Check Pod Events

kubectl describe pod {pod-name} -n {namespace}

### 3. Check Container Logs

kubectl logs {pod-name} -n {namespace} --tail=100

### 4. Check Health Endpoints

# Port forward to pod
kubectl port-forward {pod-name} 8080:8080 -n {namespace}

# Test health endpoint
curl http://localhost:8080/health/ready

## Common Issues and Solutions ### Issue: Application Not Starting **Symptoms**: Pod never becomes ready **Solution**: - Check application logs for startup errors - Verify environment variables - Check secret/configmap availability - Verify database connectivity ### Issue: Slow Health Endpoint **Symptoms**: Readiness probe timeout **Solution**: - Increase timeoutSeconds in probe configuration - Optimize health check endpoint - Check for resource constraints

---

### Best Practices Catalog

#### Security Best Practices

**Security Best Practices Checklist**:

```markdown
# Security Best Practices

## ✅ DO

- [ ] Never commit secrets to Git
- [ ] Use External Secrets Operator or CSI Driver
- [ ] Enable Pod Security Standards (Restricted)
- [ ] Set resource limits
- [ ] Use read-only root filesystem
- [ ] Run containers as non-root
- [ ] Drop all capabilities
- [ ] Use network policies
- [ ] Scan images for vulnerabilities
- [ ] Sign container images
- [ ] Enable audit logging
- [ ] Use least privilege RBAC

## ❌ DON'T

- [ ] Don't use `latest` image tags
- [ ] Don't run containers as root
- [ ] Don't disable security contexts
- [ ] Don't hardcode credentials
- [ ] Don't skip security scans
- [ ] Don't disable network policies
- [ ] Don't grant excessive RBAC permissions

#### Performance Best Practices **Performance Best Practices**:

# Performance Best Practices

## Resource Sizing

- **Right-size requests**: Base on actual usage (P50)
- **Set appropriate limits**: Allow for spikes (P95)
- **Monitor and adjust**: Use VPA recommendations

## Autoscaling

- **Enable HPA**: CPU and memory-based scaling
- **Use KEDA**: For custom metrics
- **Set reasonable bounds**: Min/max replicas

## Image Optimization

- **Use multi-stage builds**: Reduce image size
- **Minimize layers**: Fewer layers = faster pulls
- **Use distroless images**: Smaller attack surface

## Reconciliation

- **Optimize intervals**: Longer for production, shorter for dev
- **Batch updates**: Group related changes
- **Monitor reconciliation time**: Alert on slow syncs

--- ### Reference Architecture Examples #### Example Service Deployment **Complete Service Deployment Example**:

# examples/complete-service-deployment/
# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-service
  labels:
    app: example-service
spec:
  replicas: 2
  selector:
    matchLabels:
      app: example-service
  template:
    metadata:
      labels:
        app: example-service
    spec:
      serviceAccountName: example-service
      securityContext:
        runAsNonRoot: true
        fsGroup: 2000
      containers:
      - name: example-service
        image: connectsoft.azurecr.io/atp/example-service:latest
        ports:
        - containerPort: 8080
        envFrom:
        - configMapRef:
            name: example-service-config
        - secretRef:
            name: example-service-secrets
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10

---
# base/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: example-service
spec:
  selector:
    app: example-service
  ports:
  - port: 80
    targetPort: 8080

---
# base/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: example-service
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - example.atp.connectsoft.example
    secretName: example-service-tls
  rules:
  - host: example.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: example-service
            port:
              number: 80

--- ### FAQ **Frequently Asked Questions**:

# GitOps FAQ

## General Questions

### Q: What is GitOps?

**A**: GitOps is a declarative approach to managing infrastructure and applications, where Git is the single source of truth, and automated processes ensure the cluster state matches the Git repository state.

### Q: Why use GitOps?

**A**: Benefits include:
- **Version control**: All changes are tracked in Git
- **Audit trail**: Complete history of who changed what and when
- **Rollback**: Easy to revert to previous states
- **Collaboration**: Standard Git workflow (PRs, reviews)
- **Automation**: Automated deployment and reconciliation

### Q: GitOps vs Traditional CI/CD?

**A**: 
- **Traditional CI/CD**: Push-based, CI pipeline pushes to cluster
- **GitOps**: Pull-based, operator pulls from Git and reconciles cluster

### Q: FluxCD vs ArgoCD?

**A**: 
| Feature | FluxCD | ArgoCD |
|---------|--------|--------|
| **Architecture** | Multiple controllers | Single controller |
| **UI** | Limited | Rich web UI |
| **Helm Support** | ✅ Native | ✅ Native |
| **Kustomize Support** | ✅ Native | ✅ Native |
| **Azure DevOps** | ✅ Strong integration | ⚠️ Basic |
| **GitHub Actions** | ✅ Strong integration | ⚠️ Basic |

ATP uses **FluxCD** for better Azure DevOps integration.

### Q: When to use Helm vs Kustomize?

**A**:
- **Helm**: Use for packages with templating needs, reusable charts
- **Kustomize**: Use for configuration customization, simple overlays

Most ATP services use **Kustomize** for simplicity.

## Technical Questions

### Q: How do I update an image tag?

**A**: Update the image in the overlay kustomization:

```yaml
images:
- name: connectsoft.azurecr.io/atp/service
  newTag: v1.2.3

### Q: How do I add environment variables? **A**: Use ConfigMaps or Secrets:

# In deployment.yaml
envFrom:
- configMapRef:
    name: service-config
- secretRef:
    name: service-secrets

### Q: How do I scale a service? **A**: Update replicas in kustomization:

replicas:
- name: service-name
  count: 5

### Q: How do I rollback? **A**: Revert the Git commit or update image tag to previous version and create a new PR.

---

### Common Pitfalls

**Common Pitfalls and How to Avoid Them**:

```markdown
# Common GitOps Pitfalls

## 🚫 Pitfall 1: Secrets in Git

**Problem**: Committing secrets to Git repository

**Solution**: Always use External Secrets Operator or CSI Driver

```yaml
# ❌ BAD
env:
- name: PASSWORD
  value: "secret123"

# ✅ GOOD
envFrom:
- secretRef:
    name: service-secrets

## 🚫 Pitfall 2: Hardcoded Values **Problem**: Hardcoding environment-specific values in base manifests **Solution**: Use Kustomize overlays or Helm values

# ❌ BAD (in base)
replicas: 3

# ✅ GOOD (in overlay)
replicas:
- name: service
  count: 3

## 🚫 Pitfall 3: Overly Complex Patches **Problem**: Creating complex patch chains that are hard to understand **Solution**: Keep patches simple, document purpose

# ❌ BAD: 10-layer patch chain
# patchesStrategicMerge:
# - patch1.yaml
# - patch2.yaml
# - ... (8 more)

# ✅ GOOD: Clear, documented patches
patchesStrategicMerge:
- |-
  # Patch: Add production environment variable
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: service
  spec:
    template:
      spec:
        containers:
        - name: service
          env:
          - name: ENV
            value: "production"

## 🚫 Pitfall 4: Not Testing in Lower Environments **Problem**: Deploying directly to production without testing **Solution**: Always promote through dev → test → staging → production ## 🚫 Pitfall 5: Using `latest` Tags **Problem**: Using `latest` tag makes rollback difficult **Solution**: Always use specific version tags

# ❌ BAD
image: connectsoft.azurecr.io/atp/service:latest

# ✅ GOOD
image: connectsoft.azurecr.io/atp/service:v1.2.3

---

### Community of Practice

**Community of Practice Structure**:

```markdown
# GitOps Community of Practice

## Monthly Meetings

- **When**: First Tuesday of each month, 2:00 PM
- **Duration**: 1 hour
- **Format**: Knowledge sharing, Q&A, demos

## Topics Covered

- Best practices updates
- New features and tools
- Lessons learned from incidents
- Demo of interesting deployments
- Tool demonstrations

## Communication Channels

- **Teams Channel**: `#atp-gitops`
- **Slack Channel**: `#platform-gitops` (if applicable)
- **Email List**: `gitops-team@connectsoft.example`

## Knowledge Sharing

- Monthly presentations by team members
- External conference talks (share recordings)
- Internal blog posts
- Documentation contributions

--- ### Continuous Improvement **Continuous Improvement Process**:

# Continuous Improvement Process

## Feedback Collection

### Channels
- Monthly retrospectives
- Quarterly surveys
- Incident post-mortems
- GitHub Issues for improvements

### Feedback Categories
- Process improvements
- Tooling improvements
- Documentation improvements
- Training improvements

## Retrospective Format

### After Each Incident

1. **What happened?** (Timeline)
2. **What went well?**
3. **What could be improved?**
4. **Action items** (Owner, Due Date)

### Quarterly Team Retrospective

1. **Review period achievements**
2. **Identify pain points**
3. **Prioritize improvements**
4. **Create improvement backlog**

## Improvement Backlog

All improvements tracked in Azure DevOps work items:
- Epic: Large improvements
- Feature: Medium improvements
- User Story: Small improvements
- Bug: Fixes

## Documentation Improvement

- Monthly documentation review
- Identify gaps
- Update outdated content
- Add new examples

--- ## Summary: Training, Documentation & Best Practices - **Developer Onboarding Guide**: Getting started with GitOps, repository structure overview, Git workflow tutorial, manifest authoring basics, creating first PR, testing in preview environment with learning path diagram - **Operations Onboarding**: FluxCD monitoring procedures, troubleshooting workflows, incident response runbooks, on-call responsibilities, escalation paths - **GitOps Workflow Tutorials**: Step-by-step deployment tutorial, rollback procedure tutorial, multi-environment promotion tutorial, hotfix workflow tutorial with sequence diagrams - **Manifest Authoring Guidelines**: Helm best practices (templates and values), Kustomize best practices (structure and patching), naming conventions, resource configuration standards, security guidelines with templates - **Troubleshooting Playbooks**: Sync failure troubleshooting (diagnosis steps, common issues), drift resolution playbook, health check failure playbook, network issue playbook, performance issue playbook with decision trees - **Best Practices Catalog**: Security best practices checklist, performance best practices (resource sizing, autoscaling, optimization), cost optimization practices, observability practices, compliance practices - **Reference Architecture Examples**: Complete service deployment example, multi-tenant setup example, multi-region deployment example, StatefulSet example with full YAML manifests - **Video Tutorials**: Links to video tutorial library (workflow walkthroughs, monitoring tutorials, troubleshooting demos, hands-on lab exercises) - **FAQ**: Common questions (GitOps definition, benefits, vs traditional CI/CD, FluxCD vs ArgoCD, Helm vs Kustomize), technical questions (image updates, environment variables, scaling, rollback) with code examples - **Common Pitfalls**: Secrets in Git, hardcoded values, overly complex patches, not testing in lower environments, using latest tags with examples of what not to do and solutions - **Community of Practice**: Monthly meetings schedule, communication channels (Teams/Slack), knowledge sharing formats, external conference participation - **Continuous Improvement**: Feedback collection mechanisms (retrospectives, surveys, incident reviews), improvement backlog management (Azure DevOps), documentation improvement process (monthly reviews, gap identification) ---