Skip to content

GitOps — Audit Trail Platform (ATP)

Declarative deployment with Git as truth — ATP GitOps ensures infrastructure and application state are versioned, auditable, and continuously reconciled across all Azure environments using Azure DevOps, AKS, and FluxCD.


Purpose & Scope

This document defines the GitOps deployment model for the ConnectSoft Audit Trail Platform (ATP), establishing how infrastructure and application manifests are version-controlled in Azure Repos, automatically reconciled to AKS clusters, and continuously monitored for drift with full traceability and compliance evidence using Azure-native tools and services.

What This Document Covers

GitOps Fundamentals:

  • GitOps philosophy and core principles (declarative, versioned, pulled, reconciled)
  • Comparison with traditional CI/CD (push-based) deployments
  • Benefits for audit trail requirements (immutable history, compliance, security)
  • Azure-native GitOps implementation patterns with FluxCD

Infrastructure & Repository Structure:

  • Azure Repos structure for GitOps manifests (monorepo pattern for Kubernetes manifests)
  • Branching strategies per environment (main, staging, test, dev, feature/, hotfix/)
  • Access control and RBAC for Git repositories (Azure AD integration, branch policies)
  • Naming conventions and versioning strategies (SemVer, Git tags, commit SHA)

FluxCD on Azure Kubernetes Service (AKS):

  • FluxCD installation, bootstrap, and multi-cluster setup
  • GitRepository and Kustomization custom resources for continuous reconciliation
  • Azure Repos integration (SSH keys, PAT, Azure AD Workload Identity)
  • Drift detection, self-healing, and automatic reconciliation loops

Declarative Manifests & Configuration:

  • Kubernetes manifests (Deployments, Services, ConfigMaps, Secrets, Ingress)
  • Helm charts for ATP microservices (templates, values files, dependencies, versioning)
  • Kustomize overlays for environment-specific configurations (base + overlays pattern)
  • Manifest validation, linting, and security policy enforcement

CI/CD Integration:

  • Azure Pipelines to GitOps handoff (build → test → publish → commit manifest update)
  • Automated manifest updates (image tag bumping after successful CI builds)
  • Multi-service coordination and orchestration (atomic updates across services)
  • Artifact metadata and provenance (SBOM, vulnerability scans, build attestations)

Secrets Management:

  • Azure Key Vault integration (External Secrets Operator or CSI Driver)
  • Azure AD Workload Identity for pod authentication (no credentials in Git)
  • Secret rotation procedures and zero-downtime updates
  • Compliance controls (SOC 2, GDPR, HIPAA) for secret handling and audit logging

Multi-Environment Deployment:

  • Environment-specific configurations (dev, test, staging, production, preview, hotfix)
  • Kustomize overlays and Helm values files per environment
  • Resource quotas, limits, and autoscaling policies per environment
  • Promotion workflows and approval gates (manual for staging/production)

Advanced Deployment Strategies:

  • Rolling updates (default Kubernetes strategy with maxSurge/maxUnavailable)
  • Blue-green deployments (namespace switching with traffic routing)
  • Canary releases (progressive traffic shifting with Flagger)
  • Preview environments (ephemeral namespaces per pull request)
  • Zero-downtime deployments and rollback procedures

Security & Compliance:

  • Azure Policy for Kubernetes (Pod Security Standards, network policies, resource quotas)
  • Image signing and verification (Cosign, Notary, admission controllers)
  • SBOM generation and vulnerability scanning (integrated with ACR)
  • Audit logging and compliance evidence collection (immutable Git history)

Multi-Tenancy:

  • Namespace-per-tenant isolation strategy
  • Dynamic tenant provisioning and offboarding workflows
  • Tenant-specific configurations and resource quotas
  • Cost allocation and compliance enforcement per tenant

Observability & Monitoring:

  • Azure Monitor Container Insights integration for AKS metrics
  • FluxCD metrics export to Prometheus/Grafana for reconciliation monitoring
  • Deployment tracking and DORA metrics (deployment frequency, lead time, MTTR, change failure rate)
  • Alerting on sync failures, drift detection, and health check failures

Day 2 Operations:

  • Troubleshooting GitOps issues (sync failures, drift, image pull errors, secret access failures)
  • Routine maintenance tasks (FluxCD upgrades, AKS patching, certificate renewals)
  • Disaster recovery and rollback procedures (Git revert, cluster recreation from IaC)
  • On-call runbooks and escalation paths

Governance & Training:

  • GitOps workflow ownership and change management processes
  • Developer and operations onboarding guides (Git workflow, manifest authoring)
  • Best practices catalog and reference architectures
  • Compliance reporting and audit evidence collection automation

Out of Scope

This document does NOT cover:

  • Kubernetes fundamentals — See infrastructure/kubernetes.md for AKS cluster architecture, pod design, container orchestration basics, and Kubernetes API concepts
  • Azure Pipelines (CI stage) — See ci-cd/azure-pipelines.md for build, test, security scanning, artifact publishing, and quality gate enforcement
  • Quality gates — See ci-cd/quality-gates.md for test coverage thresholds, security scanning policies, and compliance gate enforcement
  • Infrastructure provisioning (non-Kubernetes) — See infrastructure/pulumi.md for Azure SQL, Service Bus, Storage, Key Vault, and other PaaS resource provisioning
  • Application architecture — See architecture/hld.md for ATP service design, domain models, business logic, and system architecture
  • Service-specific deployment details — See individual service documentation in planning/core-services/ for service-specific configuration, dependencies, and operational characteristics
  • Observability implementation — See operations/observability.md for OpenTelemetry instrumentation, metrics collection, distributed tracing, and log aggregation
  • Backup and restore procedures — See operations/backups-restore-ediscovery.md for data backup strategies, disaster recovery, and eDiscovery procedures

Readers & Ownership

Primary Readers:

  • Platform Engineers: Implement GitOps workflows, configure FluxCD, author Kubernetes manifests, manage GitOps repository structure
  • DevOps Engineers: Integrate Azure Pipelines with GitOps, automate manifest updates, implement promotion workflows, troubleshoot CI/CD handoff
  • SRE Team: Monitor FluxCD reconciliation, respond to drift alerts, execute rollback procedures, perform incident response, conduct DR drills
  • Security Team: Review security policies, validate RBAC configurations, enforce Pod Security Standards, audit secret management, ensure compliance
  • Developers: Understand GitOps workflow, submit manifest changes via pull requests, test changes in preview environments, troubleshoot deployment issues
  • Compliance Officers: Validate audit trail completeness, review deployment approvals, ensure evidence collection for SOC 2/GDPR/HIPAA audits

Document Owner: Platform Engineering Team
Technical Reviewers: SRE Lead, Cloud Architect, Security Officer
Compliance Reviewer: Compliance Officer (for SOC 2/GDPR/HIPAA sections)
Approval Authority: CTO
Last Reviewed: 2024-10-30
Next Review: 2025-Q2 (after Cycle 6 completion — multi-environment observability)
Review Frequency: Quarterly (or after significant GitOps workflow changes)

Artifacts Produced

By following this document, teams will produce the following artifacts and deliverables:

1. GitOps Repository (atp-gitops in Azure Repos):

  • Declarative Kubernetes manifests for all 7 ATP microservices
  • Helm charts with templates, values files, and dependency specifications
  • Kustomize base manifests and environment-specific overlays (dev, test, staging, production)
  • FluxCD bootstrap configuration files (GitRepository, Kustomization resources)
  • Security policies (Pod Security Policies, Network Policies, Azure Policies)
  • Multi-tenant namespace configurations and resource quotas

2. FluxCD Installation:

  • FluxCD controllers deployed on all AKS clusters (dev, test, staging, production)
  • GitRepository resources configured for Azure Repos integration (SSH/PAT authentication)
  • Kustomization resources for each application and environment
  • Notification controllers for alerting (Slack, Teams, Azure Monitor)
  • Health assessment and drift detection configurations

3. CI/CD Integration:

  • Azure Pipelines templates for GitOps handoff (build → publish → manifest update → commit)
  • Automated manifest update scripts (image tag bumping, Helm values updates)
  • Preview environment provisioning pipelines (ephemeral namespaces per PR)
  • Multi-service coordination scripts (atomic updates across dependent services)

4. Infrastructure as Code:

  • Pulumi C# programs for AKS cluster provisioning (node pools, networking, SKUs)
  • Environment-specific Pulumi stack configurations (dev, test, staging, production)
  • Drift detection automation (scheduled reconciliation validation)
  • Disaster recovery scripts (cluster recreation from Git and IaC)

5. Secrets Management:

  • External Secrets Operator or CSI Driver installation and configuration
  • ClusterSecretStore resources (Azure Key Vault integration per environment)
  • ExternalSecret or SecretProviderClass definitions for each application
  • Azure AD Workload Identity configuration (federated credentials, ServiceAccount annotations)
  • Secret rotation runbooks and automation scripts

6. Observability & Compliance:

  • Azure Monitor dashboards for GitOps metrics (reconciliation status, drift events, deployment frequency)
  • Grafana dashboards for FluxCD monitoring (reconciliation duration, success rate, resource health)
  • Compliance evidence collection scripts (deployment receipts, approval records, Git audit trail)
  • KQL queries for audit trail analysis (who deployed what, when, why)
  • DORA metrics dashboard (deployment frequency, lead time for changes, MTTR, change failure rate)

7. Security & Policy Enforcement:

  • Azure Policy definitions for AKS (Pod Security Standards, network policies, resource limits)
  • Pod Security Admission configurations (baseline, restricted profiles)
  • Network policy templates (default deny, service-to-service rules)
  • Image signing workflows (Cosign signatures, admission controller verification)
  • RBAC configurations (ServiceAccounts, Roles, RoleBindings per service and tenant)

8. Runbooks & Documentation:

  • Troubleshooting guides (sync failures, drift resolution, image pull errors)
  • Rollback procedures (simple rollback with git revert, complex multi-service rollbacks)
  • DR test plans (cluster failure scenarios, region outage, complete platform loss)
  • Developer onboarding guides (GitOps workflow, manifest authoring, PR process)
  • Operations runbooks (routine maintenance, FluxCD upgrades, AKS patching)

What is GitOps?

Definition: GitOps is an operational framework that applies DevOps best practices—version control, collaboration, compliance, and CI/CD—to infrastructure automation and application deployment. The core principle is using Git repositories as the single source of truth for declarative infrastructure and application configurations.

Core Concept: Instead of operators running manual kubectl apply commands or CI/CD pipelines pushing changes to clusters, a GitOps agent (FluxCD, ArgoCD) running inside the Kubernetes cluster continuously pulls the desired state from Git and reconciles the actual cluster state to match it.

History & Evolution

GitOps emerged from the evolution of Infrastructure as Code (IaC) practices combined with Kubernetes' declarative nature:

Timeline:

Year Milestone Impact on Industry Relevance to ATP
2010-2014 Infrastructure as Code (IaC) emerges Terraform, CloudFormation, Ansible enable declarative infrastructure Foundation for declarative Azure resources
2015 Kubernetes released (v1.0) Declarative configuration becomes standard for container orchestration ATP targets AKS for microservice deployment
2017 Weaveworks coins "GitOps" term Flux v1 released as first GitOps operator for Kubernetes GitOps pattern recognized
2018 ArgoCD released by Intuit Alternative GitOps implementation; feature-rich UI ArgoCD evaluated (FluxCD chosen for simplicity)
2019 OpenGitOps working group formed CNCF standardizes 4 core GitOps principles ATP adopts OpenGitOps principles
2020 FluxCD v2 released Complete rewrite with modular architecture (GitOps Toolkit) ATP uses FluxCD v2 for production
2021 Flux and Argo join CNCF GitOps becomes cloud-native standard (incubating projects) Industry validation for ATP choice
2022 Azure Arc GitOps integration Microsoft provides native GitOps support for AKS and Arc-enabled clusters Azure-native GitOps validated
2024 Widespread adoption CNCF surveys show 70%+ of production Kubernetes use GitOps ATP joins industry leaders

Why GitOps Now?:

  • Kubernetes maturity: Declarative APIs well-established; GitOps is natural evolution
  • Security focus: Zero-trust principles demand eliminating cluster credentials from CI/CD
  • Compliance: Audit trail requirements favor Git's immutable history
  • Cloud-native patterns: CNCF-endorsed pattern with mature tooling (FluxCD, ArgoCD)

Pull-Based vs Push-Based Deployment Models

Purpose: Understand the fundamental architectural difference between traditional CI/CD and GitOps deployment models.

Push-Based Deployment (Traditional CI/CD)

Architecture:

graph TD
    A[Developer] -->|1. git push| B[Source Code<br/>Repository]
    B -->|2. trigger webhook| C[CI/CD Pipeline<br/>Azure Pipelines]

    C -->|3. build| D[Compile &<br/>Test]
    D -->|4. publish| E[Docker Image]
    E -->|5. push| F[Container<br/>Registry<br/>ACR]

    C -->|6. deploy<br/>kubectl apply| G[Kubernetes<br/>Cluster]

    H[Secrets<br/>Vault] -.->|credentials<br/>stored| C

    style G fill:#ffcccc
    style C fill:#FFE5B4
    style H fill:#ffcccc
Hold "Alt" / "Option" to enable pan & zoom

Characteristics:

  • External deployment: CI/CD pipeline (running outside cluster) has direct access to Kubernetes API via kubeconfig or service account token
  • Push on trigger: Deployment happens during pipeline execution (synchronous operation)
  • Credentials required: Pipeline needs cluster credentials stored as secrets or service connections
  • No continuous reconciliation: Cluster state checked only during deployment; drift undetected between deployments
  • Secret management: Secrets often stored in CI/CD system variables (security risk)

Workflow Example (Azure Pipelines - Push Model):

# ❌ PUSH-BASED: Pipeline has direct cluster access
- stage: Deploy_Production
  jobs:
  - deployment: DeployToAKS
    environment: ATP-Production-AKS  # Requires approval
    strategy:
      runOnce:
        deploy:
          steps:
          # Pipeline has full kubectl access to production cluster
          - task: Kubernetes@1
            displayName: 'Deploy ATP Ingestion to Production'
            inputs:
              connectionType: 'Kubernetes Service Connection'
              kubernetesServiceEndpoint: 'ATP-Production-AKS'  # ⚠️ Cluster credentials
              namespace: 'atp-production'
              command: 'apply'
              useConfigurationFile: true
              configuration: 'manifests/production/atp-ingestion.yaml'

          # Update image tag imperatively
          - task: Kubernetes@1
            inputs:
              kubernetesServiceEndpoint: 'ATP-Production-AKS'
              command: 'set'
              arguments: 'image deployment/atp-ingestion atp-ingestion=$(containerRegistry)/atp/ingestion:$(Build.BuildNumber)'

Security Concerns:

# ⚠️ SECURITY RISK: Cluster credentials stored in Azure DevOps
# Service Connection: ATP-Production-AKS
# Type: Kubernetes Service Connection
# Authentication: Service Account (has cluster-admin rights!)
# 
# Attack vectors:
# 1. Anyone with "Use" permission on service connection can deploy to production
# 2. Compromised Azure DevOps account = compromised production cluster
# 3. Service account token rotation requires updating all pipelines
# 4. Credentials visible in pipeline logs (if logging enabled)

Pull-Based Deployment (GitOps)

Architecture:

graph TD
    A[Developer] -->|1. git push| B[Source Code<br/>Repository]
    B -->|2. trigger| C[CI Pipeline<br/>Azure Pipelines]

    C -->|3. build + test| D[Docker Image]
    D -->|4. push| E[Container<br/>Registry<br/>ACR]

    C -->|5. update manifest<br/>commit + push| F[GitOps<br/>Repository]

    subgraph "Inside Kubernetes Cluster"
        G[FluxCD Agent]
        H[AKS Cluster]

        G -->|6. git pull<br/>every 1 min| F
        G -->|7. kubectl apply| H
        H -.->|8. drift<br/>detection| G
        G -.->|9. self-heal| H
    end

    I[Azure Key<br/>Vault] -->|10. secrets<br/>sync| J[External Secrets<br/>Operator]
    J -->|11. create K8s<br/>secrets| H

    K[Azure Monitor] -.->|metrics| C
    K -.->|metrics| G
    K -.->|logs| H

    style H fill:#90EE90
    style G fill:#90EE90
    style F fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Characteristics:

  • Internal agent: GitOps operator (FluxCD) runs inside Kubernetes cluster as Deployment
  • Continuous pull: Agent polls Git repository at regular intervals (configurable: 30s to 10m)
  • No external access: Cluster credentials never leave cluster; enhanced security
  • Automatic reconciliation: Cluster state continuously compared to Git state; drift corrected automatically
  • Secret sync: Secrets managed in Azure Key Vault; synced to cluster via External Secrets Operator

Workflow Example (FluxCD - Pull Model):

# ✅ PULL-BASED: FluxCD inside cluster pulls from Git
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  interval: 1m  # Poll Git every 1 minute
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: production  # Production environment uses 'production' branch
  secretRef:
    name: azure-devops-ssh-key  # Read-only SSH key (no cluster credentials!)

---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
  namespace: flux-system
spec:
  interval: 5m  # Reconcile every 5 minutes
  path: ./apps/atp-ingestion/overlays/production
  prune: true  # Delete resources removed from Git
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-ingestion
      namespace: atp-production

Security Benefits:

# ✅ NO cluster credentials outside cluster
# FluxCD ServiceAccount has RBAC permissions inside cluster
# CI pipeline NEVER touches cluster; only commits to Git

# FluxCD ServiceAccount (inside cluster)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kustomize-controller
  namespace: flux-system

---
# FluxCD RBAC (cluster-admin for reconciliation)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kustomize-controller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin  # Full access inside cluster only
subjects:
- kind: ServiceAccount
  name: kustomize-controller
  namespace: flux-system

Comparison Summary

Deployment Model When to Use When to Avoid
Push-Based (Traditional CI/CD) - Small teams with simple deployments
- Non-Kubernetes environments
- Immediate feedback required
- Team unfamiliar with GitOps
- Production Kubernetes deployments
- Compliance requirements (SOC 2, GDPR, HIPAA)
- Multi-environment with drift concerns
- Security-sensitive environments
Pull-Based (GitOps) - Production Kubernetes deployments
- Compliance/audit requirements
- Multi-cluster/multi-region
- Configuration drift is a concern
- Security-first environments
- Non-Kubernetes deployments
- Legacy infrastructure
- Team unwilling to learn GitOps
- Immediate deployment feedback critical

ATP Decision: ✅ GitOps (Pull-Based) for all Kubernetes deployments

Rationale:

  1. Audit trail requirement: Git provides immutable, permanent history (vs 30-90 day pipeline logs)
  2. Security requirement: Zero-trust principle; no cluster credentials outside cluster
  3. Compliance requirement: SOC 2, GDPR, HIPAA demand tamper-evident change records
  4. Multi-tenancy: Git structure enables isolated tenant configurations
  5. Operational resilience: Disaster recovery RTO reduced from 4 hours to 30 minutes

GitOps in Audit Trail Platform Context

Purpose: Explain why GitOps is essential (not just beneficial) for ATP's unique requirements.

Audit Trail Requirements

ATP provides immutable, tamper-evident audit logs for customers. The platform's own infrastructure must meet the same standards:

Requirement 1: Complete Change History

Every infrastructure change must be tracked with full attribution (who, what, when, why):

# Git history provides complete audit trail
git log --all --pretty=format:"%h | %ai | %an | %ae | %s" \
  --since="2024-01-01" \
  --grep="production"

# Example output (can be exported for SOC 2 audits):
# abc123d | 2024-10-30 14:23:45 +0000 | Alice Chen | alice.chen@connectsoft.example | feat(ingestion): upgrade to v1.2.3
# def456e | 2024-10-25 10:15:22 +0000 | Bob Smith | bob.smith@connectsoft.example | fix(query): index performance (ATP-BUG-789)
# ghi789f | 2024-10-20 16:42:11 +0000 | Carol Davis | carol.davis@connectsoft.example | scale(integrity): replicas 3→5 (ATP-INC-456)

Requirement 2: Tamper-Evidence

Git commits must be cryptographically signed to prevent tampering:

# Generate GPG key for commit signing
gpg --full-generate-key
# Select: RSA and RSA, 4096 bits, no expiration
# User ID: "Platform Team <platform-team@connectsoft.example>"

# Export public key for verification
gpg --armor --export platform-team@connectsoft.example > platform-team-gpg-public.key

# Configure Git to sign all commits
git config --global user.signingkey <GPG_KEY_ID>
git config --global commit.gpgsign true
git config --global tag.gpgsign true

# Commit with signature
git add apps/atp-ingestion/overlays/production/kustomization.yaml
git commit -S -m "feat(ingestion): upgrade to v1.2.3

- Updated image tag to v1.2.3-abc123d
- Increased memory limit 512Mi → 1Gi (performance optimization)
- Enabled tamper-evidence in production config

Relates to: ATP-EPIC-456
Approved by: architect@connectsoft.example
Tested in: Staging (2024-10-26 to 2024-10-29)"

# Verify signature
git log --show-signature -1

# Output:
# commit abc123d1234567890abcdef1234567890abcdef (HEAD -> production)
# gpg: Signature made Wed Oct 30 14:23:45 2024 UTC
# gpg:                using RSA key 1234567890ABCDEF
# gpg: Good signature from "Platform Team <platform-team@connectsoft.example>"
# Author: Platform Team <platform-team@connectsoft.example>
# Date:   Wed Oct 30 14:23:45 2024 +0000
#
#     feat(ingestion): upgrade to v1.2.3
#     ...

Requirement 3: Long-Term Retention

Git history must be retained indefinitely for compliance (SOC 2: 1 year minimum, ATP: 7 years for parity with audit logs):

# Backup Git repository to immutable Azure Blob Storage
az storage blob upload-batch \
  --account-name atpgitbackupprod \
  --destination gitops-backups \
  --source .git/ \
  --destination-path "$(date +%Y%m%d)/" \
  --overwrite false  # Immutable: cannot overwrite

# Enable legal hold (WORM storage)
az storage container legal-hold set \
  --account-name atpgitbackupprod \
  --container-name gitops-backups \
  --tags "compliance=soc2-gdpr-hipaa" "retention=7-years"

# Retention: 7 years (matches audit log retention)
# Cost: ~$50/month for 10 GB Git history (cold storage tier)

Security Benefits

No Direct Cluster Access:

Problem Statement: Traditional CI/CD stores cluster credentials in Azure DevOps, creating security risks:

  1. Broad attack surface: Anyone with Azure DevOps access can potentially access cluster credentials
  2. Credential sprawl: Each environment/cluster needs separate service connection
  3. Rotation complexity: Updating service account tokens requires updating all pipelines
  4. Audit trail: Difficult to trace who accessed cluster credentials

GitOps Solution:

graph TD
    subgraph "Outside Cluster"
        A[Developer] -->|git push| B[Azure Repos<br/>atp-gitops]
        C[CI Pipeline] -->|update manifest<br/>commit + push| B
    end

    subgraph "Inside AKS Cluster - No External Access"
        D[FluxCD Agent]
        E[Kustomize Controller]
        F[Helm Controller]
        G[Kubernetes API]

        D -->|git pull| B
        D -->|render| E
        E -->|render| F
        F -->|kubectl apply| G
    end

    H[Azure Key Vault] -->|Workload Identity<br/>federated auth| I[External Secrets<br/>Operator]
    I -->|create secrets| G

    J[Azure Monitor] -.->|observability| D

    style G fill:#90EE90
    style D fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Security Improvements:

Security Aspect Traditional CI/CD GitOps Improvement
Cluster Credentials Stored in CI/CD system Never leave cluster ✅ 100% reduction in external credentials
Attack Surface CI/CD + cluster Git repository only ✅ 50% reduction in attack surface
Credential Rotation Manual; update all pipelines Automatic (Workload Identity) ✅ Zero-touch rotation
Least Privilege Often cluster-admin for simplicity RBAC per FluxCD controller ✅ Principle of least privilege
Audit Trail Pipeline logs (ephemeral) Git history (permanent) ✅ Immutable audit evidence
Secrets in Git Risk of accidental commit Prevented (pre-commit hooks + PR validation) ✅ Zero secrets in Git

Separation of Duties:

ATP enforces role-based access control at multiple levels:

Role Azure Repos Access AKS Cluster Access FluxCD Admin Azure Key Vault Access Approval Authority
Developer ✅ Submit PRs (feature/*) ❌ No access ❌ No ❌ No None
DevOps Engineer ✅ Approve PRs (dev/test) ⚠️ Read-only (dev/test) ⚠️ Read-only ❌ No Dev/Test deployments
Architect ✅ Approve PRs (staging/prod) ⚠️ Read-only (all envs) ⚠️ Read-only ⚠️ Read-only (audit) Staging/Prod deployments
SRE Engineer ✅ Approve PRs (production) ⚠️ Read-only (production) ✅ Admin (suspend/resume reconciliation) ⚠️ Read-only (audit) Production deployments
Security Officer ✅ Audit access (read-only) ⚠️ Read-only (all envs) ⚠️ Read-only ✅ Admin (rotate secrets) Security policy changes
Compliance Officer ✅ Audit access (read-only) ❌ No access ❌ No ⚠️ Read-only (audit) None (audit only)
FluxCD Agent ✅ Read-only (GitOps repo) ✅ Full access (via ServiceAccount RBAC) N/A ❌ No (uses External Secrets Operator) None (automated)
External Secrets Operator ❌ No ✅ Create secrets (namespace-scoped) ❌ No ✅ Read secrets (Workload Identity) None (automated)

RBAC Example (FluxCD ServiceAccount):

# FluxCD runs with least privilege (namespace-scoped for app deployments)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kustomize-controller-atp-apps
  namespace: flux-system

---
# Role: namespace-scoped permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: flux-apps-deployer
  namespace: atp-production
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets", "statefulsets", "daemonsets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["services", "configmaps", "secrets", "persistentvolumeclaims"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
  resources: ["ingresses", "networkpolicies"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

---
# RoleBinding: bind ServiceAccount to Role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: flux-apps-deployer
  namespace: atp-production
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: flux-apps-deployer
subjects:
- kind: ServiceAccount
  name: kustomize-controller-atp-apps
  namespace: flux-system

Multi-Tenancy

Tenant Isolation in Git:

ATP's namespace-per-tenant model is naturally represented in Git:

Directory Structure:

atp-gitops/
├── tenants/
│   ├── tenant-acme-corp/           # Tenant: ACME Corporation
│   │   ├── namespace.yaml          # Namespace: atp-tenant-acme
│   │   ├── resource-quota.yaml     # Limits: 10 CPU, 20 GB RAM
│   │   ├── network-policy.yaml     # Deny cross-tenant traffic
│   │   ├── rbac.yaml                # Tenant-specific RBAC
│   │   ├── config.yaml              # Data residency: US
│   │   └── kustomization.yaml      # FluxCD Kustomization
│   │
│   ├── tenant-widgets-inc/         # Tenant: Widgets Inc.
│   │   ├── namespace.yaml          # Namespace: atp-tenant-widgets
│   │   ├── resource-quota.yaml     # Limits: 5 CPU, 10 GB RAM
│   │   ├── network-policy.yaml
│   │   ├── rbac.yaml
│   │   ├── config.yaml              # Data residency: EU (GDPR)
│   │   └── kustomization.yaml
│   │
│   └── tenant-global-bank/         # Tenant: Global Bank (Enterprise)
│       ├── namespace.yaml          # Namespace: atp-tenant-global
│       ├── resource-quota.yaml     # Limits: 20 CPU, 40 GB RAM
│       ├── network-policy.yaml     # Strict isolation (financial data)
│       ├── rbac.yaml
│       ├── config.yaml              # Compliance: HIPAA + SOC 2 + GDPR
│       └── kustomization.yaml

Tenant Configuration Example:

# tenants/tenant-acme-corp/config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-config
  namespace: atp-tenant-acme
data:
  # Tenant metadata
  tenant-id: "acme-corp"
  tenant-name: "ACME Corporation"
  tenant-tier: "standard"  # standard, premium, enterprise

  # Data residency
  data-residency: "us"  # us, eu, apac
  primary-region: "eastus"
  backup-region: "westus"

  # Compliance requirements
  compliance-profile: "soc2-hipaa"  # soc2, gdpr, hipaa, soc2-gdpr, soc2-hipaa, soc2-gdpr-hipaa
  retention-days: "2555"  # 7 years
  immutability-enabled: "true"
  tamper-evidence-enabled: "true"

  # Feature flags (tenant-specific)
  enable-advanced-query: "true"
  enable-ai-anomaly-detection: "false"  # Premium feature
  enable-realtime-alerts: "true"

  # Resource limits
  max-ingestion-rate-rps: "1000"  # 1000 requests per second
  max-query-rate-rps: "500"
  max-storage-gb: "1000"  # 1 TB

Benefits: - ✅ Isolated changes: Tenant config changes don't affect other tenants (isolated Git directories) - ✅ Audit trail per tenant: git log -- tenants/tenant-acme-corp/ shows all changes for ACME Corp - ✅ Compliance per tenant: GDPR/HIPAA requirements enforced via namespace labels and network policies - ✅ Cost allocation: Resource quotas enable accurate chargeback/showback per tenant


Operational Resilience

Disaster Recovery from Git:

Scenario: Production AKS cluster destroyed (region outage, ransomware, infrastructure failure)

Recovery Steps:

#!/bin/bash
# disaster-recovery-production.sh — Recover production AKS from Git

set -euo pipefail

echo "🔴 DISASTER RECOVERY: Recreating production AKS cluster from Git + IaC"

# ──────────────────────────────────────────────────────────────────────────
# Step 1: Provision new AKS cluster with Pulumi (15 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 1/5: Provisioning AKS cluster with Pulumi..."
cd infrastructure/pulumi-aks
pulumi stack select production
pulumi refresh --yes  # Detect destroyed resources
pulumi up --yes  # Recreate cluster

# ──────────────────────────────────────────────────────────────────────────
# Step 2: Configure kubectl context (1 minute)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 2/5: Configuring kubectl..."
az aks get-credentials \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --overwrite-existing

export KUBECONFIG=~/.kube/config

# ──────────────────────────────────────────────────────────────────────────
# Step 3: Install FluxCD and bootstrap from Git (5 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 3/5: Bootstrapping FluxCD..."
flux bootstrap git \
  --url=ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops \
  --branch=production \
  --path=clusters/production \
  --private-key-file=~/.ssh/azure-devops-flux \
  --author-name="Platform Team" \
  --author-email="platform-team@connectsoft.example"

# ──────────────────────────────────────────────────────────────────────────
# Step 4: Wait for FluxCD to reconcile all resources (10 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 4/5: Waiting for FluxCD reconciliation..."
flux get kustomizations --watch --timeout=15m

# ──────────────────────────────────────────────────────────────────────────
# Step 5: Verify all services healthy (3 minutes)
# ──────────────────────────────────────────────────────────────────────────
echo "Step 5/5: Verifying service health..."
for service in ingestion query integrity export policy search gateway; do
  echo "Checking atp-$service..."
  kubectl wait --for=condition=available --timeout=300s \
    deployment/atp-$service -n atp-production
done

echo "✅ Disaster recovery complete!"
echo "RTO achieved: ~30 minutes"
echo "RPO: 0 minutes (Git contains complete desired state)"

RTO/RPO Targets:

Environment RTO Target RTO Actual (GitOps) RPO Target RPO Actual (GitOps)
Dev 4 hours 20 minutes 24 hours 0 minutes
Test 2 hours 25 minutes 12 hours 0 minutes
Staging 1 hour 30 minutes 4 hours 0 minutes
Production 30 minutes 30-35 minutes 1 hour 0 minutes

GitOps Impact: RPO reduced to zero (Git has complete desired state, no data loss for infrastructure config).


Summary

  • GitOps in ATP Context Literature: Essential (not just beneficial) for ATP's audit trail, security, and compliance requirements
  • Audit Trail Requirements: Complete change history (Git log with attribution), tamper-evidence (GPG-signed commits), long-term retention (7 years in immutable Azure Blob Storage)
  • Security Benefits: No cluster credentials outside cluster, separation of duties (7 roles with RBAC matrix), secret management via Key Vault (zero secrets in Git)
  • Multi-Tenancy: Namespace-per-tenant with isolated Git directories, tenant-specific configs (data residency, compliance, resource quotas, feature flags)
  • Operational Resilience: DR RTO 30-35 minutes (Pulumi 15min + FluxCD 10min + validate 5min), RPO 0 minutes (Git has full state)
  • Rollback Simplicity: git revert triggers automatic rollback within 5-10 minutes (vs re-running pipeline)

Four Core Principles (OpenGitOps)

The OpenGitOps working group (CNCF) defines 4 core principles that any GitOps implementation must follow. ATP adheres to all four principles using FluxCD, Azure DevOps, and AKS.


Principle 1: Declarative

Definition: The desired system state is represented as declarative specifications (what you want, not how to get it). Configuration is stored in a version-controlled source (Git) rather than generated by scripts.

Key Concepts:

  • Declarative vs Imperative: Declarative describes the end state (e.g., "3 replicas, 1 GB RAM"), while imperative describes steps (e.g., "scale up by 1, set memory to 1 GB")
  • Idempotency: Applying the same declarative configuration multiple times produces the same result
  • Configuration as Code: All infrastructure and application config stored as YAML/JSON in Git

ATP Implementation:

ATP uses three layers of declarative configuration:

  1. Base Kubernetes Manifests (YAML): Raw Kubernetes resource definitions
  2. Helm Charts: Templated, parameterized manifests with values files
  3. Kustomize Overlays: Environment-specific customizations applied to base manifests

Kubernetes Deployment Manifest (Base)

Complete Example (ATP Ingestion Service):

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    app: atp-ingestion
    component: ingestion
    tier: backend
    version: v1.2.3
    managed-by: fluxcd
spec:
  replicas: 3  # Desired state: 3 replicas
  selector:
    matchLabels:
      app: atp-ingestion
  template:
    metadata:
      labels:
        app: atp-ingestion
        version: v1.2.3
    spec:
      serviceAccountName: atp-ingestion-sa
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 2000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        imagePullPolicy: IfNotPresent
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop: [ALL]
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: Production
        - name: OpenTelemetry__ServiceName
          value: atp-ingestion
        envFrom:
        - configMapRef:
            name: atp-ingestion-config
        - secretRef:
            name: atp-ingestion-secrets
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /app/cache
      volumes:
      - name: tmp
        emptyDir: {}
      - name: cache
        emptyDir: {}
      imagePullSecrets:
      - name: acr-credentials

---
# apps/atp-ingestion/base/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    app: atp-ingestion
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
    name: http
  selector:
    app: atp-ingestion

---
# apps/atp-ingestion/base/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
  namespace: atp-production
data:
  ASPNETCORE_ENVIRONMENT: "Production"
  OpenTelemetry__ServiceName: "atp-ingestion"
  OpenTelemetry__SamplingRatio: "0.1"
  Audit__EnableImmutability: "true"
  Audit__RetentionDays: "2555"

Declarative Characteristics:

  • Desired state: replicas: 3 declares the goal (not "scale by +1")
  • Idempotent: Reapplying same manifest produces same result
  • Version-controlled: Stored in Git, not generated by scripts
  • Immutable: Image tag includes commit SHA (v1.2.3-abc123d)

Helm Charts (Parameterized Declarative)

Chart Structure:

apps/atp-ingestion/helm/
├── Chart.yaml              # Chart metadata
├── values.yaml             # Default values
├── values-dev.yaml         # Dev environment overrides
├── values-production.yaml  # Production environment overrides
└── templates/
    ├── deployment.yaml     # Templated Deployment
    ├── service.yaml        # Templated Service
    ├── configmap.yaml      # Templated ConfigMap
    └── ingress.yaml        # Templated Ingress

Chart.yaml:

# apps/atp-ingestion/helm/Chart.yaml
apiVersion: v2
name: atp-ingestion
description: ATP Ingestion Service - Receives audit records via HTTP/gRPC
version: 1.2.3  # Chart version (SemVer)
appVersion: 1.2.3  # Application version
type: application

keywords:
  - audit-trail
  - ingestion
  - microservice

maintainers:
  - name: ConnectSoft Platform Team
    email: platform-team@connectsoft.example

dependencies:
  - name: redis
    version: 17.x.x
    repository: https://charts.bitnami.com/bitnami
    condition: redis.enabled

values.yaml (Default):

# apps/atp-ingestion/helm/values.yaml
# Default values for atp-ingestion chart

replicaCount: 3

image:
  repository: connectsoft.azurecr.io/atp/ingestion
  pullPolicy: IfNotPresent
  tag: ""  # Overridden by .Values.appVersion or CI pipeline

imagePullSecrets:
  - name: acr-credentials

serviceAccount:
  create: true
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"
  name: atp-ingestion-sa

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8080"
  prometheus.io/path: "/metrics"

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 2000
  seccompProfile:
    type: RuntimeDefault

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: [ALL]

service:
  type: ClusterIP
  port: 80
  targetPort: 8080

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: ingestion.atp.connectsoft.example
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: atp-ingestion-tls
      hosts:
        - ingestion.atp.connectsoft.example

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

# Environment-specific configuration
env:
  ASPNETCORE_ENVIRONMENT: Production
  OpenTelemetry__ServiceName: atp-ingestion

# External Secrets Operator integration
externalSecrets:
  enabled: true
  secretStore: azure-keyvault
  secrets:
    - name: ConnectionStrings__Database
      key: sql-connection-string
    - name: ConnectionStrings__Redis
      key: redis-connection-string
    - name: ConnectionStrings__RabbitMQ
      key: rabbitmq-connection-string

# Redis sub-chart (optional dependency)
redis:
  enabled: false  # Use Azure Cache for Redis instead

Helm Template (Deployment):

# apps/atp-ingestion/helm/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "atp-ingestion.fullname" . }}
  namespace: {{ .Values.namespace }}
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "atp-ingestion.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        {{- with .Values.podAnnotations }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ .Values.serviceAccount.name }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        securityContext:
          {{- toYaml .Values.securityContext | nindent 12 }}
        resources:
          {{- toYaml .Values.resources | nindent 12 }}
        env:
        {{- range $key, $value := .Values.env }}
        - name: {{ $key }}
          value: {{ $value | quote }}
        {{- end }}
        livenessProbe:
          {{- toYaml .Values.livenessProbe | nindent 12 }}
        readinessProbe:
          {{- toYaml .Values.readinessProbe | nindent 12 }}

Benefits of Helm:

  • Parameterization: One chart, multiple environments (values-dev.yaml, values-production.yaml)
  • Reusability: Chart can be used across multiple services with different values
  • Dependency management: Declare sub-charts (e.g., Redis) as dependencies
  • Versioning: Chart version and app version tracked separately

Kustomize Overlays (Environment-Specific Customization)

Directory Structure:

apps/atp-ingestion/
├── base/                    # Base manifests (reusable)
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── configmap.yaml
│   └── kustomization.yaml
└── overlays/                # Environment-specific overlays
    ├── dev/
    │   ├── kustomization.yaml
    │   ├── deployment-patch.yaml
    │   └── configmap-patch.yaml
    ├── staging/
    │   ├── kustomization.yaml
    │   ├── deployment-patch.yaml
    │   └── hpa-patch.yaml
    └── production/
        ├── kustomization.yaml
        ├── deployment-patch.yaml
        ├── hpa-patch.yaml
        └── configmap-patch.yaml

Base Kustomization:

# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
  - deployment.yaml
  - service.yaml
  - configmap.yaml

commonLabels:
  app: atp-ingestion
  component: ingestion
  managed-by: fluxcd

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3-abc123d  # Updated by CI pipeline

Production Overlay:

# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

# Base resources
resources:
  - ../../base

# Strategic merge patches
patchesStrategicMerge:
  - deployment-patch.yaml
  - hpa-patch.yaml

# Image tag override (updated by CI pipeline)
images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3-abc123d

# ConfigMap generator (add production-specific values)
configMapGenerator:
  - name: atp-ingestion-config
    behavior: merge  # Merge with base ConfigMap
    literals:
      - ASPNETCORE_ENVIRONMENT=Production
      - OpenTelemetry__SamplingRatio=0.1
      - Audit__EnableImmutability=true
      - Audit__RetentionDays=2555

# Labels applied to all resources
commonLabels:
  environment: production
  managed-by: fluxcd
  compliance: soc2-gdpr-hipaa

# Annotations applied to all resources
commonAnnotations:
  gitops.toolkit.fluxcd.io/reconcile: enabled
  azure.connectsoft.com/cost-center: atp-production

# Replicas override (production has more replicas)
replicas:
  - name: atp-ingestion
    count: 5  # Production: 5 replicas (base has 3)

Deployment Patch (Production-specific changes):

# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5  # Production: 5 replicas
  template:
    spec:
      containers:
      - name: ingestion
        resources:
          requests:
            cpu: 1000m      # Production: 1 CPU core (base: 500m)
            memory: 1Gi     # Production: 1 GB RAM (base: 512Mi)
          limits:
            cpu: 2000m      # Production: 2 CPU cores (base: 1000m)
            memory: 2Gi     # Production: 2 GB RAM (base: 1Gi)
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: Production
        - name: OpenTelemetry__SamplingRatio
          value: "0.1"  # Production: 10% sampling (dev: 100%)

Benefits of Kustomize:

  • DRY (Don't Repeat Yourself): Base manifests reused; only differences in overlays
  • Environment isolation: Each environment has isolated overlay directory
  • Strategic merge patches: Fine-grained control over what changes per environment
  • Build-time customization: No runtime templating; manifests rendered before applying

Principle 2: Versioned & Immutable

Definition: All desired states are versioned (stored in Git) and immutable (cannot be changed after commit). Changes are made by creating new versions, not modifying existing ones.

Key Concepts:

  • Git as Version Control: All manifests stored in Git with commit history
  • Immutable Git History: Commits cannot be modified (only new commits added)
  • Semantic Versioning: Version numbers follow SemVer (major.minor.patch)
  • Image Tagging: Container images tagged with version + commit SHA
  • GPG Signing: Commits cryptographically signed to prove authenticity

Git Commit Signing with GPG

Purpose: Ensure commits are tamper-evident and authentic (SOC 2, GDPR compliance).

Setup:

# Generate GPG key (one-time per developer/team)
gpg --full-generate-key
# Select:
#   - RSA and RSA (default)
#   - 4096 bits (secure)
#   - No expiration (or 2 years)
#   - Real name: "Platform Team"
#   - Email: platform-team@connectsoft.example
#   - Comment: "ATP GitOps Commits"

# List keys
gpg --list-secret-keys --keyid-format LONG

# Output:
# sec   rsa4096/1234567890ABCDEF 2024-01-15 [SC]
#       ABC123DEF456GHI789JKL012MNO345PQR678STU
# uid                 [ultimate] Platform Team <platform-team@connectsoft.example>

# Configure Git to use GPG key
git config --global user.signingkey 1234567890ABCDEF  # Use key ID from above
git config --global commit.gpgsign true  # Sign all commits automatically
git config --global tag.gpgsign true     # Sign all tags automatically

# Export public key (share with team)
gpg --armor --export 1234567890ABCDEF > platform-team-gpg-public.key

# Import public key (for verification)
gpg --import platform-team-gpg-public.key

Commit with Signature:

# Standard commit (automatically signed due to commit.gpgsign=true)
git add apps/atp-ingestion/overlays/production/kustomization.yaml
git commit -m "feat(ingestion): upgrade to v1.2.4

- Updated image tag to v1.2.4-def456e
- Increased memory limit 1Gi → 2Gi (performance optimization)
- Enabled advanced query features

Relates to: ATP-EPIC-789
Approved by: architect@connectsoft.example
Tested in: Staging (2024-10-28 to 2024-10-30)"

# Explicit signing (if auto-sign disabled)
git commit -S -m "..."

# Verify signature
git log --show-signature -1

# Output:
# commit def456e789abcdef0123456789abcdef01234567 (HEAD -> production)
# gpg: Signature made Mon Oct 30 15:30:22 2024 UTC
# gpg:                using RSA key 1234567890ABCDEF
# gpg: Good signature from "Platform Team <platform-team@connectsoft.example>"
# Author: Platform Team <platform-team@connectsoft.example>
# Date:   Mon Oct 30 15:30:22 2024 +0000
#
#     feat(ingestion): upgrade to v1.2.4

Azure DevOps Branch Policy (Require Signed Commits):

# Azure DevOps Branch Policy: Require GPG-signed commits
# Configured in Azure DevOps Portal:
# Repos > Branches > production > Branch Policies > Branch Policies
#   ✓ Require signed commits (GPG or SSH)
#   ✓ Require pull request (minimum 1 reviewer)
#   ✓ Require status checks (CI pipeline must pass)

Verify All Commits Signed (Compliance Audit):

#!/bin/bash
# verify-all-commits-signed.sh — Verify all commits in production branch are GPG-signed

BRANCH="production"
UNSIGNED_COMMITS=()

for commit in $(git log --format=%H origin/$BRANCH); do
  if ! git verify-commit $commit 2>/dev/null; then
    UNSIGNED_COMMITS+=($commit)
  fi
done

if [ ${#UNSIGNED_COMMITS[@]} -eq 0 ]; then
  echo "✅ All commits are GPG-signed"
  exit 0
else
  echo "❌ Found ${#UNSIGNED_COMMITS[@]} unsigned commits:"
  for commit in "${UNSIGNED_COMMITS[@]}"; do
    echo "  - $commit"
  done
  exit 1
fi

Semantic Versioning Strategy

Strategy: ATP uses Semantic Versioning (SemVer) for application versions: MAJOR.MINOR.PATCH

  • MAJOR: Breaking changes (API incompatibility, schema changes)
  • MINOR: New features (backward-compatible)
  • PATCH: Bug fixes (backward-compatible)

Version Tagging:

# Tag release in source code repository
git tag -a v1.2.4 -m "Release v1.2.4

- Feature: Advanced query API
- Bug fix: Memory leak in Redis connection pooling
- Security: Upgrade to .NET 8.0

Changelog: https://dev.azure.com/ConnectSoft/ATP/_wiki/wikis/ATP.wiki/12345/Release-Notes-v1.2.4"

git push origin v1.2.4

# CI pipeline uses tag to build Docker image
# Docker image tagged as: connectsoft.azurecr.io/atp/ingestion:v1.2.4

Version Examples:

v1.2.4    # Minor feature release
v1.2.3    # Patch release (bug fix)
v2.0.0    # Major release (breaking changes)
v1.2.4-hotfix1  # Hotfix release

Git Tags for Releases

Tag Structure:

# Production release tag
git tag -a v1.2.4 -m "Production Release v1.2.4" production
git push origin v1.2.4

# Hotfix release tag
git tag -a v1.2.4-hotfix1 -m "Hotfix: Memory leak fix" hotfix/memory-leak
git push origin v1.2.4-hotfix1

Tag Verification (Ensure Tags Match Commits):

# Verify tag points to expected commit
git tag -v v1.2.4

# Output:
# object abc123d7890def4567890abc123def4567890ab
# type commit
# tag v1.2.4
# tagger Platform Team <platform-team@connectsoft.example> 2024-10-30 16:00:00 +0000
#
# Production Release v1.2.4
# gpg: Signature made Mon Oct 30 16:00:00 2024 UTC
# gpg:                using RSA key 1234567890ABCDEF
# gpg: Good signature from "Platform Team <platform-team@connectsoft.example>"

Environment-Wide Release Tags:

# Production release tag (all services)
git tag -a release/v1.2.4 -m "Production Release v1.2.4

Services:
- atp-ingestion: v1.2.4
- atp-query: v1.3.0
- atp-integrity: v1.1.5
- atp-export: v1.0.2
- atp-policy: v1.2.0
- atp-search: v1.1.0
- atp-gateway: v1.4.0

Changelog: https://dev.azure.com/ConnectSoft/ATP/_wiki/wikis/ATP.wiki/12345/Release-Notes-v1.2.4"
git push origin release/v1.2.4

Image Tagging with Version + Commit SHA

Strategy: ATP uses immutable image tags combining version + commit SHA:

Format: {version}-{commit-sha}
Example: v1.2.4-abc123d

Where:
  - v1.2.4 = Semantic version (from Git tag)
  - abc123d = First 7 characters of Git commit SHA

Benefits: - ✅ Traceability: Image tag links to exact Git commit - ✅ Immutability: Same tag always points to same image (never overwritten) - ✅ Version clarity: Version number visible in tag - ✅ Rollback simplicity: Revert to previous version tag

Docker Image Tagging (Azure Pipelines):

# Azure Pipelines: Tag Docker image with version + commit SHA
- task: Docker@2
  displayName: 'Build and push Docker image'
  inputs:
    containerRegistry: 'ConnectSoft-ACR'
    repository: 'atp/ingestion'
    command: 'buildAndPush'
    Dockerfile: 'src/ConnectSoft.ATP.Ingestion/Dockerfile'
    tags: |
      $(Build.BuildNumber)              # v1.2.4
      $(Build.BuildNumber)-$(Build.SourceVersion)  # v1.2.4-abc123d
      latest                            # Latest (for dev only)

ACR Tagging Rules:

Tag Pattern Mutable? Use Case Example
v{VERSION} ❌ Immutable Production releases v1.2.4
v{VERSION}-{SHA} ❌ Immutable Production releases (traceable) v1.2.4-abc123d
latest ✅ Mutable Development only latest

Git History as Audit Trail

Compliance Report Generation:

#!/bin/bash
# generate-compliance-report.sh — Generate SOC 2 audit report from Git history

BRANCH="production"
START_DATE="2024-01-01"
END_DATE="2024-12-31"
OUTPUT_FILE="compliance-report-q4-2024.md"

echo "# GitOps Compliance Report — Q4 2024" > $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
echo "**Report Period**: $START_DATE to $END_DATE" >> $OUTPUT_FILE
echo "**Branch**: $BRANCH" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
echo "## All Production Deployments" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
echo "| Commit | Timestamp | Author | Email | Description | Signature |" >> $OUTPUT_FILE
echo "|--------|-----------|--------|-------|-------------|-----------|" >> $OUTPUT_FILE

git log --format="%h | %ai | %an | %ae | %s | %G? |" \
  --since="$START_DATE" \
  --until="$END_DATE" \
  origin/$BRANCH | \
  sed 's/G$/✅ Good/' | \
  sed 's/B$/❌ Bad/' | \
  sed 's/U$/⚠️ Unknown/' | \
  sed 's/N$/❌ None/' | \
  sed 's/X$/❌ Expired/' | \
  >> $OUTPUT_FILE

echo "" >> $OUTPUT_FILE
echo "## Summary" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
TOTAL=$(git log --oneline --since="$START_DATE" --until="$END_DATE" origin/$BRANCH | wc -l)
SIGNED=$(git log --show-signature --since="$START_DATE" --until="$END_DATE" origin/$BRANCH | grep -c "Good signature")
echo "- **Total Commits**: $TOTAL" >> $OUTPUT_FILE
echo "- **Signed Commits**: $SIGNED" >> $OUTPUT_FILE
echo "- **Unsigned Commits**: $((TOTAL - SIGNED))" >> $OUTPUT_FILE

echo "✅ Compliance report generated: $OUTPUT_FILE"

Output Example:

# GitOps Compliance Report — Q4 2024

**Report Period**: 2024-01-01 to 2024-12-31
**Branch**: production

## All Production Deployments

| Commit | Timestamp | Author | Email | Description | Signature |
|--------|-----------|--------|-------|-------------|-----------|
| abc123d | 2024-10-30 14:23:45 | Platform Team | platform-team@connectsoft.example | feat(ingestion): upgrade to v1.2.4 | ✅ Good |
| def456e | 2024-10-25 10:15:22 | Alice Chen | alice.chen@connectsoft.example | fix(query): resolve index issue | ✅ Good |
| ghi789f | 2024-10-20 16:42:11 | Bob Smith | bob.smith@connectsoft.example | scale(integrity): replicas 3→5 | ✅ Good |

## Summary

- **Total Commits**: 45
- **Signed Commits**: 45
- **Unsigned Commits**: 0

Long-Term Retention (7 years for compliance):

# Backup Git repository to immutable Azure Blob Storage
az storage blob upload-batch \
  --account-name atpgitbackupprod \
  --destination gitops-backups \
  --source .git/ \
  --destination-path "$(date +%Y%m%d)/" \
  --overwrite false  # Immutable: cannot overwrite

# Enable legal hold (WORM storage)
az storage container legal-hold set \
  --account-name atpgitbackupprod \
  --container-name gitops-backups \
  --tags "compliance=soc2-gdpr-hipaa" "retention=7-years"

# Retention: 7 years (matches audit log retention)
# Cost: ~$50/month for 10 GB Git history (cold storage tier)

Principle 3: Pulled Automatically

Definition: The desired state is automatically pulled from the source (Git repository) by an agent running inside the cluster, rather than being pushed by an external system.

Key Concepts:

  • Pull-Based Architecture: GitOps agent (FluxCD) inside cluster pulls from Git
  • Polling Intervals: Agent checks Git at regular intervals (e.g., every 1 minute)
  • Webhook Triggers: Optional webhooks for immediate sync (faster than polling)
  • GitRepository Resource: FluxCD custom resource that defines Git source
  • Kustomization Resource: FluxCD custom resource that defines what to deploy

FluxCD Architecture Overview

Component Diagram:

graph TD
    A[Git Repository<br/>Azure Repos] -->|git pull<br/>every 1 min| B[Source Controller<br/>flux-system namespace]

    B -->|fetch Git| C[GitRepository<br/>Custom Resource]

    C -->|notify| D[Kustomize Controller<br/>flux-system namespace]

    D -->|render manifests| E[Kustomization<br/>Custom Resource]

    E -->|kubectl apply| F[Kubernetes API<br/>AKS Cluster]

    F -.->|watch for drift| D
    D -.->|reconcile| F

    G[Helm Controller] -.->|if Helm chart| E
    H[Notification Controller] -->|alerts| I[Slack / Teams]

    style B fill:#90EE90
    style D fill:#90EE90
    style G fill:#90EE90
    style H fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

FluxCD Components:

Component Purpose Namespace
source-controller Fetches Git repositories, Helm charts, OCI artifacts flux-system
kustomize-controller Renders Kustomize manifests and applies to cluster flux-system
helm-controller Installs/upgrades Helm charts flux-system
notification-controller Sends alerts to Slack, Teams, etc. flux-system

GitRepository Resource

Definition: Defines source of truth (Git repository URL, branch, authentication).

Example:

# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  interval: 1m  # Poll Git every 1 minute
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: production  # Git branch to watch
  secretRef:
    name: azure-devops-ssh-key  # SSH key secret for authentication
  ignore: |
    /*.md
    !README.md
  suspend: false  # Set to true to pause reconciliation

Status (Reconciled):

# Check GitRepository status
kubectl describe gitrepository atp-gitops -n flux-system

# Output:
# Status:
#   Artifact:
#     Checksum:           abc123def4567890
#     Last Update:        2024-10-30T15:30:00Z
#     Path:               gitrepository/flux-system/atp-gitops/abc123.tar.gz
#     Revision:           production/abc123d7890def4567890abc123def4567890ab
#     URL:                http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/atp-gitops/abc123.tar.gz
#   Conditions:
#     Last Transition Time:  2024-10-30T15:30:00Z
#     Message:               Fetched revision: production/abc123d7890def4567890abc123def4567890ab
#     Observed Generation:   1
#     Reason:                GitOperationSucceed
#     Status:                True
#     Type:                  Ready
#   Observed Generation:     1
#   URL:                     ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops

Authentication Methods:

Option 1: SSH Key (Recommended for Azure DevOps):

# Create SSH key secret
apiVersion: v1
kind: Secret
metadata:
  name: azure-devops-ssh-key
  namespace: flux-system
type: Opaque
stringData:
  identity: |
    -----BEGIN OPENSSH PRIVATE KEY-----
    b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQNuZW5lAAAABQAAAAB...
    -----END OPENSSH PRIVATE KEY-----
  known_hosts: |
    ssh.dev.azure.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC7...

---
# GitRepository references secret
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  secretRef:
    name: azure-devops-ssh-key

Option 2: Personal Access Token (PAT) (Alternative):

# Create PAT secret
apiVersion: v1
kind: Secret
metadata:
  name: azure-devops-pat
  namespace: flux-system
type: Opaque
stringData:
  username: git
  password: <AZURE_DEVOPS_PAT>  # Token with Code (Read) permission

---
# GitRepository uses PAT
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  secretRef:
    name: azure-devops-pat

Kustomization Resource

Definition: Defines what to deploy (path in Git repository, reconciliation interval, health checks).

Example:

# clusters/production/kustomizations/atp-ingestion.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
  namespace: flux-system
spec:
  interval: 5m  # Reconcile every 5 minutes
  path: ./apps/atp-ingestion/overlays/production  # Path in Git repository
  prune: true  # Delete resources removed from Git
  sourceRef:
    kind: GitRepository
    name: atp-gitops
    namespace: flux-system
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-ingestion
      namespace: atp-production
  timeout: 10m  # Timeout for reconciliation
  retryInterval: 2m  # Retry interval on failure
  suspend: false

Status (Reconciled):

# Check Kustomization status
kubectl describe kustomization atp-ingestion -n flux-system

# Output:
# Status:
#   Conditions:
#     Last Transition Time:  2024-10-30T15:35:00Z
#     Message:               Applied revision: production/abc123d7890def4567890abc123def4567890ab
#     Observed Generation:   1
#     Reason:                ReconciliationSucceeded
#     Status:                True
#     Type:                  Ready
#   Inventory:
#     Entries:
#       Id:                   apps_v1_Deployment_atp-production_atp-ingestion
#       V:                    v1
#   Last Applied Revision:    production/abc123d7890def4567890abc123def4567890ab
#   Last Attempted Revision:  production/abc123d7890def4567890abc123def4567890ab
#   Observed Generation:      1

Automatic Sync Policies per Environment

Per-Environment Configuration:

Environment Git Branch Polling Interval Reconciliation Interval Webhook Trigger
Dev dev 30 seconds 1 minute Enabled (immediate)
Test test 1 minute 2 minutes Enabled (immediate)
Staging staging 1 minute 5 minutes Disabled (manual approval)
Production production 1 minute 5 minutes Disabled (manual approval + 24h cooldown)

Production Sync Policy (Conservative):

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion-production
  namespace: flux-system
spec:
  interval: 5m  # Reconcile every 5 minutes (not immediate)
  path: ./apps/atp-ingestion/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Production: Manual approval required (webhook disabled)
  # Production: No automatic sync on push (polling only)

Behavior: 1. Developer pushes commit to production branch 2. GitRepository polls Git every 1 minute (detects new commit) 3. Kustomization reconciles every 5 minutes (applies changes) 4. Total delay: Up to 6 minutes (1 min poll + 5 min reconcile)

Dev Sync Policy (Aggressive):

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion-dev
  namespace: flux-system
spec:
  interval: 1m  # Reconcile every 1 minute
  path: ./apps/atp-ingestion/overlays/dev
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops-dev

Polling Intervals and Webhook Triggers

Polling Configuration:

# GitRepository polling interval
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
spec:
  interval: 1m  # Minimum: 30s, Maximum: 24h

Webhook Triggers (Optional):

Purpose: Trigger immediate reconciliation when Git push occurs (faster than polling).

Azure DevOps Webhook (Receive POST on push):

# FluxCD Receiver (webhook endpoint)
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Receiver
metadata:
  name: atp-gitops-webhook
  namespace: flux-system
spec:
  type: git
  events:
    - "push"
  resources:
    - kind: GitRepository
      name: atp-gitops
  secretRef:
    name: webhook-token
  # Azure DevOps webhook URL:
  # https://<fluxcd-receiver>/hook/xyz123abc456...

Configuration in Azure DevOps:

Azure DevOps > Repos > Hooks > Add Subscription
  Name: FluxCD Webhook
  Event: Code pushed
  Filters:
    - Branch: dev, test (production excluded for safety)
  Service Hook URL: https://<fluxcd-receiver>/hook/xyz123abc456...

Benefits: - ✅ Faster sync: Changes applied within seconds (vs at most 6 minutes with polling) - ✅ Reduced Git polling: Lower load on Azure DevOps Git servers

Trade-offs: - ⚠️ Security: Webhook endpoint must be publicly accessible (or use Azure DevOps IP allowlist) - ⚠️ Production risk: Disabled for production (manual approval required)


Principle 4: Continuously Reconciled

Definition: Software agents automatically and continuously ensure the actual system state matches the desired state (stored in Git). Any drift from the desired state is automatically corrected.

Key Concepts:

  • Drift Detection: Continuous monitoring of cluster state vs Git state
  • Self-Healing: Automatic correction of configuration drift
  • Reconciliation Loop: Periodic checks and corrections (every 1-5 minutes)
  • Drift Correction: Revert manual changes to match Git state

Drift Detection Mechanisms

How FluxCD Detects Drift:

  1. Periodic Reconciliation: FluxCD compares cluster state to Git state every reconciliation interval
  2. Resource Watching: Kubernetes watch API detects resource changes in real-time
  3. Inventory Tracking: FluxCD maintains inventory of applied resources (GitOps Toolkit)

Drift Detection Example:

# Scenario: Operator manually scales deployment (NOT via Git)
kubectl scale deployment atp-ingestion --replicas=5 -n atp-production

# FluxCD detects drift within 5 minutes (reconciliation interval)
flux get kustomizations

# Output:
# NAME           READY   MESSAGE
# atp-ingestion  False   Spec.Replicas drift detected: Git=3, Live=5

Drift Detection Status:

# Check drift detection status
kubectl describe kustomization atp-ingestion -n flux-system

# Output:
# Status:
#   Conditions:
#     Last Transition Time:  2024-10-30T15:40:00Z
#     Message:               Reconciliation failed: drift detected in Deployment atp-ingestion
#     Observed Generation:   1
#     Reason:                DriftDetected
#     Status:                False
#     Type:                  Ready
#   Drift:
#     Detected:              true
#     Resources:
#       - Kind:               Deployment
#         Name:               atp-ingestion
#         Namespace:          atp-production
#         Drift:              Spec.Replicas: Git=3, Live=5

Self-Healing Configuration

Automatic Drift Correction:

FluxCD automatically reverts manual changes to match Git state:

# Git state (desired): replicas=3
# Cluster state (actual): replicas=5 (manually changed)

# FluxCD reconciliation (automatic):
# 1. Detect drift: replicas=5 ≠ replicas=3
# 2. Apply Git state: kubectl scale deployment atp-ingestion --replicas=3
# 3. Cluster state matches Git state: replicas=3

Self-Healing Workflow:

graph TD
    A[Manual Change<br/>kubectl scale] -->|immediate| B[Cluster State<br/>replicas=5]
    C[Git State<br/>replicas=3] -.->|every 5 min| D[FluxCD<br/>Reconciliation]
    D -->|compare| B
    D -->|drift detected| E[FluxCD<br/>Auto-Correct]
    E -->|apply Git state| B
    B -.->|matches| C

    style E fill:#90EE90
    style D fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Self-Healing Examples:

Example 1: Manual Replica Scaling:

# Operator manually scales deployment
kubectl scale deployment atp-ingestion --replicas=10 -n atp-production

# Within 5 minutes, FluxCD reverts to Git state
kubectl get deployment atp-ingestion -n atp-production

# Output (after reconciliation):
# NAME            READY   UP-TO-DATE   AVAILABLE   AGE
# atp-ingestion   3/3     3            3           5m
# Replicas: 3 (reverted from 10)

Example 2: Manual ConfigMap Update:

# Operator manually edits ConfigMap
kubectl edit configmap atp-ingestion-config -n atp-production
# Change: ASPNETCORE_ENVIRONMENT=Production → Development

# Within 5 minutes, FluxCD reverts to Git state
kubectl get configmap atp-ingestion-config -n atp-production -o yaml

# Output (after reconciliation):
# data:
#   ASPNETCORE_ENVIRONMENT: Production  # Reverted from Development

Example 3: Manual Resource Deletion:

# Operator accidentally deletes deployment
kubectl delete deployment atp-ingestion -n atp-production

# Within 5 minutes, FluxCD recreates deployment from Git
kubectl get deployment atp-ingestion -n atp-production

# Output (after reconciliation):
# NAME            READY   UP-TO-DATE   AVAILABLE   AGE
# atp-ingestion   3/3     3            3           30s  # Recreated

Reconciliation Loop Monitoring

Monitoring Reconciliation Status:

# Check all Kustomizations status
flux get kustomizations

# Output:
# NAME                 READY   MESSAGE   REVISION                          SUSPENDED
# atp-ingestion        True    Applied   production/abc123d                False
# atp-query            True    Applied   production/abc123d                False
# atp-integrity        False   Drift     production/abc123d                False
# atp-export           True    Applied   production/abc123d                False

# Check specific Kustomization
flux get kustomization atp-ingestion

# Output:
# NAME            READY   MESSAGE                                       REVISION                          SUSPENDED
# atp-ingestion   True    Applied revision: production/abc123d          production/abc123d                False

# Watch reconciliation in real-time
flux get kustomizations --watch

# Output (updates every few seconds):
# NAME                 READY   MESSAGE                          REVISION
# atp-integrity        False   Reconciliation in progress...    production/abc123d
# atp-integrity        True    Applied revision: abc123d        production/abc123d

Azure Monitor Metrics (FluxCD Reconciliation):

# FluxCD exports Prometheus metrics
# Metrics available in Azure Monitor via Prometheus scraping

# Key Metrics:
# - fluxcd_kustomize_reconciliation_duration_seconds  # Time to reconcile
# - fluxcd_kustomize_reconciliation_total              # Total reconciliations
# - fluxcd_kustomize_reconciliation_success_total      # Successful reconciliations
# - fluxcd_kustomize_reconciliation_failure_total      # Failed reconciliations
# - fluxcd_source_git_duration_seconds                 # Git fetch duration

KQL Query for Reconciliation Monitoring:

// Azure Monitor Log Analytics: Query FluxCD reconciliation metrics
PrometheusMetrics_CL
| where Name_s == "fluxcd_kustomize_reconciliation_duration_seconds"
| summarize 
    avg(Value_d) by KustomizationName_s, bin(TimeGenerated, 5m)
| render timechart

Grafana Dashboard (FluxCD Reconciliation):

# Grafana dashboard config
dashboard:
  title: "FluxCD Reconciliation Status"
  panels:
    - title: "Reconciliation Duration"
      query: "fluxcd_kustomize_reconciliation_duration_seconds"
      type: "graph"

    - title: "Reconciliation Success Rate"
      query: "rate(fluxcd_kustomize_reconciliation_success_total[5m]) / rate(fluxcd_kustomize_reconciliation_total[5m])"
      type: "stat"

    - title: "Drift Detection Events"
      query: "fluxcd_kustomize_reconciliation_failure_total{reason='DriftDetected'}"
      type: "graph"

Drift Correction Strategies

Automatic Correction (Default):

FluxCD automatically corrects drift during reconciliation:

# Kustomization with automatic drift correction
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
spec:
  prune: true  # Delete resources removed from Git
  # Automatic correction: Always revert to Git state

Manual Correction (When Needed):

# Option 1: Suspend reconciliation, fix manually, resume
flux suspend kustomization atp-ingestion -n flux-system

# Fix drift manually
kubectl scale deployment atp-ingestion --replicas=3 -n atp-production

# Resume reconciliation
flux resume kustomization atp-ingestion -n flux-system

# Option 2: Update Git to match cluster state (if intentional)
git checkout production
# Update kustomization.yaml to match current cluster state
git commit -m "chore: update replicas to match current state"
git push origin production

Drift Alerting:

# FluxCD Notification for drift detection
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: drift-detection-alert
  namespace: flux-system
spec:
  providerRef:
    name: slack
    namespace: flux-system
  eventSeverity: warning
  eventSources:
    - kind: Kustomization
      name: "*"
  filters:
    - key: reason
      value: DriftDetected

Slack Alert Example:

⚠️ GitOps Drift Detected

Kustomization: atp-ingestion
Namespace: flux-system
Reason: DriftDetected
Resource: Deployment/atp-ingestion (atp-production)
Drift: Spec.Replicas: Git=3, Live=5

Action: FluxCD will automatically correct within 5 minutes

Drift Prevention Best Practices:

  1. Enforce Git-only Changes: RBAC prevents direct kubectl access to production
  2. Alert on Manual Changes: Monitor Kubernetes audit logs for manual changes
  3. Regular Drift Audits: Weekly checks for unexpected cluster changes
  4. Documentation: Clear guidelines that all changes must go through Git

Summary: Four Core Principles

  • Principle 1: Declarative: Desired state expressed as declarative YAML (Kubernetes, Helm, Kustomize)
  • Principle 2: Versioned & Immutable: All changes versioned in Git with GPG signatures, SemVer, immutable image tags, permanent audit trail
  • Principle 3: Pulled Automatically: FluxCD inside cluster pulls from Git (GitRepository/Kustomization), polling intervals per environment, optional webhooks
  • Principle 4: Continuously Reconciled: Automatic drift detection, self-healing configuration, reconciliation monitoring, drift correction strategies

Azure Repos Structure & Organization

Purpose: Define the repository strategy, directory structure, branching model, and access control for the ATP GitOps implementation, ensuring consistency, scalability, and compliance across all environments.


Repository Strategy: Hybrid Monorepo/Polyrepo

ATP uses a hybrid approach: polyrepo for service source code (separate repositories per microservice) and monorepo for GitOps manifests (single repository for all Kubernetes configurations).

Monorepo for GitOps Manifests

Repository: atp-gitops (Azure Repos)

Rationale:

Aspect Benefit Impact
Atomic Updates Update multiple services in single commit/PR Ensures consistency across services (e.g., gateway + all microservices)
Cross-Service Visibility See all deployments in one place Easier to understand dependencies and relationships
Shared Configurations Common base manifests, Helm values, Kustomize bases DRY principle; reduce duplication
Compliance Auditing Single audit trail for all infrastructure changes Simpler SOC 2/GDPR audit reports
Environment Consistency Same structure across dev/test/staging/production Easier to promote configurations between environments
RBAC Simplification One repository to manage permissions Simpler access control (vs managing 7+ repos)

Monorepo Structure:

atp-gitops/  # Single GitOps repository (monorepo)
├── clusters/              # Per-environment FluxCD configs
├── infrastructure/        Cluster-wide infrastructure
├── apps/                  All ATP microservices
├── platform/              Platform configs (RBAC, policies)
├── tenants/               Multi-tenant configurations
├── monitoring/            Observability stack
└── docs/                  Documentation and runbooks

Polyrepo for Service Source Code

Repositories: Separate repositories per microservice

  • atp-ingestion (C# source code)
  • atp-query (C# source code)
  • atp-integrity (C# source code)
  • atp-export (C# source code)
  • atp-policy (C# source code)
  • atp-search (C# source code)
  • atp-gateway (C# source code)

Rationale:

Aspect Benefit Impact
Team Autonomy Each service team owns their repository Faster development cycles; reduced merge conflicts
Independent CI/CD Separate build pipelines per service Parallel builds; faster feedback
Service Isolation Clear ownership boundaries Easier to onboard new teams; clearer responsibilities
Versioning Flexibility Each service can version independently Allows different release cadences per service
Repository Size Smaller repositories (faster clones) Better developer experience; faster CI builds

Workflow: Source Code → CI → GitOps Repo → FluxCD

Complete Flow:

graph LR
    subgraph "Source Code Repositories (Polyrepo)"
        A1[atp-ingestion<br/>C# source]
        A2[atp-query<br/>C# source]
        A3[atp-integrity<br/>C# source]
    end

    subgraph "CI Stage (Azure Pipelines)"
        B[Azure Pipelines<br/>Build + Test]
        B -->|1. build Docker image| C[Azure Container<br/>Registry]
        B -->|2. update manifest<br/>commit + push| D[atp-gitops<br/>Monorepo]
    end

    A1 -->|git push| B
    A2 -->|git push| B
    A3 -->|git push| B

    subgraph "GitOps Repository (Monorepo)"
        D1[apps/atp-ingestion/<br/>overlays/production]
        D2[apps/atp-query/<br/>overlays/production]
        D3[apps/atp-integrity/<br/>overlays/production]
    end

    D --> D1
    D --> D2
    D --> D3

    subgraph "CD Stage (FluxCD)"
        E[FluxCD Agent<br/>in AKS cluster]
        E -->|git pull<br/>reconcile| F[AKS Cluster<br/>Deployments]
    end

    D -->|3. FluxCD polls Git| E

    style D fill:#FFE5B4
    style E fill:#90EE90
    style F fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Step-by-Step Workflow:

  1. Developer pushes to service repository:
cd atp-ingestion  # Polyrepo
git add src/ConnectSoft.ATP.Ingestion/Controllers/AuditRecordsController.cs
git commit -m "feat: add gRPC ingestion endpoint"
git push origin feature/grpc-ingestion
  1. CI pipeline triggers (Azure Pipelines):
# azure-pipelines.yml in atp-ingestion repository
- stage: Build_Test_Publish
  jobs:
  - job: BuildAndTest
    steps:
    - task: Docker@2
      inputs:
        containerRegistry: 'ConnectSoft-ACR'
        repository: 'atp/ingestion'
        command: 'buildAndPush'
        tags: |
          $(Build.BuildNumber)
          $(Build.BuildNumber)-$(Build.SourceVersion)
    - task: Bash@3
      displayName: 'Update GitOps Manifest'
      inputs:
        targetType: 'inline'
        script: |
          git clone https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
          cd atp-gitops
          # Update image tag in kustomization.yaml
          yq eval '.images[0].newTag = "$(Build.BuildNumber)-$(Build.SourceVersion)"' \
            -i apps/atp-ingestion/overlays/production/kustomization.yaml
          git add apps/atp-ingestion/overlays/production/kustomization.yaml
          git commit -m "chore(ingestion): update to $(Build.BuildNumber)"
          git push origin production
  1. FluxCD detects Git change (within 1-5 minutes):
# FluxCD GitRepository polls Git every 1 minute
# FluxCD Kustomization reconciles every 5 minutes
# Result: New image tag applied to cluster automatically
  1. Deployment complete:
# Verify deployment
kubectl get pods -n atp-production -l app=atp-ingestion
# Output: Pods using new image tag

Benefits of Hybrid Approach:

  • Best of both worlds: Team autonomy (polyrepo) + consistency (monorepo)
  • Clear separation: Source code changes vs infrastructure changes
  • Atomic deployments: Update multiple services in one PR (if needed)
  • Simplified GitOps: One repository to manage permissions and branch policies

Directory Structure

Complete atp-gitops Repository Layout:

atp-gitops/
├── .github/                          # GitHub Actions (if used) or Azure DevOps templates
│   ├── workflows/
│   └── germs/
├── clusters/                         # Per-environment FluxCD bootstrap configs
│   ├── production/
│   │   ├── flux-system/             # FluxCD installation manifests
│   │   │   ├── gitrepository.yaml  # GitRepository pointing to production branch
│   │   │   ├── kustomizations.yaml # Root Kustomization pointing to /infrastructure and /apps
│   │   │   └── notifications.yaml  # Notification configs (Slack, Teams)
│   │   └── README.md
│   │
│   ├── staging/
│   │   ├── flux-system/
│   │   └── README.md
│   │
│   ├── test/
│   │   ├── flux-system/
│   │   └── README.md
│   │
│   └── dev/
│       ├── flux-system/
│       └── README.md
├── infrastructure/                   # Cluster-wide infrastructure (base + overlays)
│   ├── base/                        # Base infrastructure manifests
│   │   ├── namespaces.yaml          # All namespaces (atp-production, atp-staging, etc.)
│   │   ├── resource-quotas.yaml     # Default resource quotas
│   │   ├── network-policies.yaml    # Default network policies
│   │   ├── pod-security-standards.yaml  # Pod Security Admission configs
│   │   ├── azure-policy.yaml        # Azure Policy for Kubernetes
│   │   └── kustomization.yaml
│   │
│   └── overlays/                    # Environment-specific infrastructure
│       ├── production/
│       │   ├── kustomization.yaml
│       │   ├── resource-quotas-patch.yaml  # Production resource quotas
│       │   └── network-policies-patch.yaml # Production network policies
│       ├── staging/
│       ├── test/
│       └── dev/
├── apps/                            # ATP microservices (7 services)
│   ├── atp-ingestion/
│   │   ├── base/                    # Base manifests (reusable)
│   │   │   ├── deployment.yaml
│   │   │   ├── service.yaml
│   │   │   ├── configmap.yaml
│   │   │   ├── ingress.yaml
│   │   │   └── kustomization.yaml
│   │   │
│   │   ├── helm/                    # Helm chart (optional, alternative to base)
│   │   │   ├── Chart.yaml
│   │   │   ├── values.yaml
│   │   │   ├── values-dev.yaml
│   │   │   ├── values-production.yaml
│   │   │   └── templates/
│   │   │       ├── deployment.yaml
│   │   │       ├── service.yaml
│   │   │       └── configmap.yaml
│   │   │
│   │   └── overlays/                # Environment-specific overlays
│   │       ├── dev/
│   │       │   ├── kustomization.yaml
│   │       │   ├── deployment-patch.yaml
│   │       │   └── configmap-patch.yaml
│   │       ├── test/
│   │       ├── staging/
│   │       └── production/
│   │           ├── kustomization.yaml
│   │           ├── deployment-patch.yaml
│   │           ├── hpa-patch.yaml      # Horizontal Pod Autoscaler
│   │           └── configmap-patch.yaml
│   │
│   ├── atp-query/
│   │   ├── base/
│   │   ├── helm/
│   │   └── overlays/
│   │
│   ├── atp-integrity/
│   │   ├── base/
│   │   ├── helm/
│   │   └── overlays/
│   │
│   ├── atp-export/
│   │   ├── base/
│   │   ├── helm/
│   │   └── overlays/
│   │
│   ├── atp-policy/
│   │   ├── base/
│   │   ├── helm/
│   │   └── overlays/
│   │
│   ├── atp-search/
│   │   ├── base/
│   │   ├── helm/
│   │   └── overlays/
│   │
│   └── atp-gateway/
│       ├── base/
│       ├── helm/
│       └── overlays/
├── platform/                        # Platform configurations
│   ├── rbac/                        # Role-Based Access Control
│   │   ├── service-accounts.yaml    # ServiceAccounts for all services
│   │   ├── roles.yaml              # Namespace-scoped Roles
│   │   ├── role-bindings.yaml      # Role Bindings
│   │   └── cluster-roles.yaml      # Cluster-wide Roles
│   │
│   ├── network-policies/            # Network isolation policies
│   │   ├── default-deny.yaml       # Default deny all traffic
│   │   ├── allow-namespace-internal.yaml  # Allow within namespace
│   │   └── allow-cross-namespace.yaml     # Allow specific cross-namespace
│   │
│   ├── pod-security/                # Pod Security Standards
│   │   ├── baseline.yaml           # Baseline profile
│   │   └── restricted.yaml         # Restricted profile (production)
│   │
│   ├── resource-quotas/             # Resource quotas per namespace
│   │   ├── production.yaml
│   │   ├── staging.yaml
│   │   └── dev.yaml
│   │
│   └── azure-policy/                # Azure Policy for Kubernetes
│       ├── pod-security-standards.yaml
│       ├── resource-limits.yaml
│       └── image-registry.yaml
├── tenants/                         # Multi-tenant configurations
│   ├── tenant-acme-corp/
│   │   ├── namespace.yaml
│   │   ├── resource-quota.yaml
│   │   ├── network-policy.yaml
│   │   ├── rbac.yaml
│   │   ├── config.yaml
│   │   └── kustomization.yaml
│   │
│   ├── tenant-widgets-inc/
│   │   ├── namespace.yaml
│   │   ├── resource-quota.yaml
│   │   ├── network-policy.yaml
│   │   ├── rbac.yaml
│   │   ├── config.yaml
│   │   └── kustomization.yaml
│   │
│   └── tenant-global-bank/
│       ├── namespace.yaml
│       ├── resource-quota.yaml
│       ├── network-policy.yaml
│       ├── rbac.yaml
│       ├── config.yaml
│       └── kustomization.yaml
├── monitoring/                      # Observability stack
│   ├── prometheus/                  # Prometheus Operator manifests
│   │   ├── prometheus.yaml
│   │   ├── servicemonitor.yaml
│   │   └── alerting-rules.yaml
│   │
│   ├── grafana/                     # Grafana dashboards
│   │   ├── dashboards/
│   │   └── datasources.yaml
│   │
│   ├── fluent-bit/                  # Log collection
│   │   └── fluent-bit-config.yaml
│   │
│   └── jaeger/                      # Distributed tracing
│       └── jaeger-operator.yaml
├── docs/                            # Documentation and runbooks
│   ├── README.md                    # Repository overview
│   ├── CONTRIBUTING.md              # How to contribute
│   ├── runbooks/
│   │   ├── rollback-procedure.md
│   │   ├── disaster-recovery.md
│   │   └── troubleshooting.md
│   └── architecture/
│       ├── directory-structure.md
│       └── branching-model.md
├── .gitignore                       # Git ignore patterns
├── .pre-commit-hooks.yaml          # Pre-commit hooks (secret detection)
├── LICENSE                          # Repository license
└── README.md                        # Main README

Directory Purpose Reference

Directory Purpose Example Files
/clusters FluxCD bootstrap configs per environment gitrepository.yaml, kustomizations.yaml
/infrastructure Cluster-wide infrastructure (namespaces, quotas, policies) namespaces.yaml, resource-quotas.yaml
/apps ATP microservice deployments deployment.yaml, service.yaml, configmap.yaml
/platform Platform configurations (RBAC, network policies, security) service-accounts.yaml, network-policies.yaml
/tenants Multi-tenant configurations namespace.yaml, resource-quota.yaml
/monitoring Observability stack (Prometheus, Grafana, Fluent Bit) prometheus.yaml, grafana-dashboards/
/docs Documentation and operational runbooks runbooks/rollback-procedure.md

Naming Conventions

Directory Naming Standards

Pattern: lowercase-with-hyphens

Directory Type Naming Pattern Example
Service directories atp-{service-name} atp-ingestion, atp-query
Environment overlays {environment} production, staging, test, dev
Base directories base base/
Helm directories helm helm/
Tenant directories tenant-{tenant-id} tenant-acme-corp, tenant-widgets-inc

File Naming Patterns

Pattern: kebab-case.yaml or kebab-case-patch.yaml

File Type Naming Pattern Example
Kubernetes manifests {resource-kind}.yaml deployment.yaml, service.yaml
Kustomization files kustomization.yaml kustomization.yaml
Strategic merge patches {resource-kind}-patch.yaml deployment-patch.yaml, hpa-patch.yaml
Helm values values-{environment}.yaml values-production.yaml, values-dev.yaml
ConfigMaps {service-name}-config.yaml atp-ingestion-config.yaml
Documentation kebab-case.md rollback-procedure.md, disaster-recovery.md

Resource Naming Conventions (Kubernetes)

Pattern: {service-name} or {service-name}-{suffix}

Resource Type Naming Pattern Example
Deployments {service-name} atp-ingestion, atp-query
Services {service-name} atp-ingestion, atp-query
ConfigMaps {service-name}-config atp-ingestion-config
Secrets {service-name}-secrets atp-ingestion-secrets
Ingress {service-name}-ingress atp-ingestion-ingress
ServiceAccounts {service-name}-sa atp-ingestion-sa
HPA {service-name}-hpa atp-ingestion-hpa
NetworkPolicy {service-name}-network-policy atp-ingestion-network-policy

Complete Examples for All ATP Services:

# Deployment names
atp-ingestion
atp-query
atp-integrity
atp-export
atp-policy
atp-search
atp-gateway

# Service names
atp-ingestion
atp-query
atp-integrity
atp-export
atp-policy
atp-search
atp-gateway

# ConfigMap names
atp-ingestion-config
atp-query-config
atp-integrity-config
atp-export-config
atp-policy-config
atp-search-config
atp-gateway-config

# ServiceAccount names
atp-ingestion-sa
atp-query-sa
atp-integrity-sa
atp-export-sa
atp-policy-sa
atp-search-sa
atp-gateway-sa

# Namespace names
atp-production      # All production services
atp-staging         # All staging services
atp-test            # All test services
atp-dev             # All dev services
atp-tenant-acme     # Tenant-specific namespace
atp-tenant-widgets  # Tenant-specific namespace

Label Naming:

# Standard labels (applied to all resources)
labels:
  app: atp-ingestion           # Service name
  component: ingestion         # Component name (matches service name)
  tier: backend                # Service tier (backend, frontend, database)
  version: v1.2.3              # Application version
  environment: production      # Environment (production, staging, test, dev)
  managed-by: fluxcd           # Management tool
  compliance: soc2-gdpr-hipaa  # Compliance requirements

Branching Model

ATP uses a GitOps branching model aligned with environment promotion workflow.

Environment Branches

Branch Structure:

main                    # Production (protected, requires approvals)
├── staging             # Staging environment (protected)
│   ├── test            # Test environment (protected)
│   │   └── dev         # Dev environment (unprotected, fast merge)
│   │       └── feature/*  # Feature branches (unprotected)

Branch Details:

Branch Environment Purpose Protection Level Merge Strategy
main (or production) Production Live production environment 🔒 Highest Squash merge + approvals
staging Staging Pre-production testing 🔒 High Squash merge + approvals
test Test Integration testing 🔒 Medium Squash merge + approvals
dev Development Developer testing 🔓 Low Fast-forward merge
feature/* N/A Feature development 🔓 None Fast-forward merge
hotfix/* Production Emergency fixes 🔒 High Squash merge + expedited approval

Branch Protection Rules (Azure DevOps):

# Azure DevOps Branch Policy Configuration
# Repos > Branches > {branch-name} > Branch Policies

main (Production):
  ✓ Require pull request (minimum 2 reviewers)
  ✓ Require approval from: Architect + SRE Lead
  ✓ Require status checks: CI pipeline must pass
  ✓ Require signed commits (GPG)
  ✓ Require merge strategy: Squash merge only
  ✓ Require minimum reviewers: 2 (including required reviewers)
  ✓ Require code review: Yes
  ✓ Build validation: CI pipeline
  ✓ Automatic reviewers: Platform Team (suggested)

staging:
  ✓ Require pull request (minimum 1 reviewer)
  ✓ Require approval from: Architect or SRE
  ✓ Require status checks: CI pipeline must pass
  ✓ Require signed commits (GPG)
  ✓ Require merge strategy: Squash merge only
  ✓ Build validation: CI pipeline

test:
  ✓ Require pull request (minimum 1 reviewer)
  ✓ Require approval from: Any DevOps Engineer
  ✓ Require status checks: CI pipeline must pass
  ✓ Require merge strategy: Squash merge preferred
  ✓ Build validation: CI pipeline

dev:
  ✓ No branch protection (fast development)
  ✓ Allow direct push
  ✓ Allow fast-forward merge

feature/*:
  ✓ No branch protection
  ✓ Allow direct push
  ✓ Allow fast-forward merge

Branch Promotion Workflow:

graph LR
    A[feature/my-feature] -->|PR + merge| B[dev]
    B -->|PR + approval| C[test]
    C -->|PR + approval| D[staging]
    D -->|PR + approval| E[main<br/>production]

    F[hotfix/critical-bug] -.->|expedited PR| E

    style E fill:#ffcccc
    style D fill:#FFE5B4
    style C fill:#FFE5B4
    style B fill:#90EE90
    style A fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Example: Promoting Change from Dev → Production:

# Step 1: Create feature branch from dev
git checkout dev
git pull origin dev
git checkout -b feature/upgrade-ingestion-v1.2.4
# Make changes to apps/atp-ingestion/overlays/dev/kustomization.yaml
git commit -m "feat(ingestion): upgrade to v1.2.4 in dev"
git push origin feature/upgrade-ingestion-v1.2.4

# Step 2: Merge to dev (fast-forward, no approval needed)
git checkout dev
git merge --ff-only feature/upgrade-ingestion-v1.2.4
git push origin dev
# FluxCD automatically deploys to dev cluster

# Step 3: Create PR dev → test
# Azure DevOps: Create Pull Request from dev to test
# Requires: 1 DevOps Engineer approval
# After merge: FluxCD deploys to test cluster

# Step 4: Create PR test → staging
# Azure DevOps: Create Pull Request from test to staging
# Requires: Architect or SRE approval
# After merge: FluxCD deploys to staging cluster

# Step 5: Create PR staging → main (production)
# Azure DevOps: Create Pull Request from staging to main
# Requires: Architect + SRE Lead approval
# Requires: CI pipeline passing + signed commits
# After merge: FluxCD deploys to production cluster

Approval Requirements Matrix

Branch Minimum Reviewers Required Approvers Status Checks GPG Signing Merge Strategy
main (production) 2 Architect + SRE Lead ✅ Required ✅ Required Squash only
staging 1 Architect or SRE ✅ Required ✅ Required Squash only
test 1 Any DevOps Engineer ✅ Required ⚠️ Preferred Squash preferred
dev 0 None ⚠️ Optional ❌ Not required Fast-forward
feature/* 0 None ❌ Not required ❌ Not required Fast-forward
hotfix/* 1 Architect or SRE Lead ✅ Required ✅ Required Squash only

Versioning and Tagging

Semantic Versioning (SemVer) Strategy

Format: MAJOR.MINOR.PATCH

  • MAJOR: Breaking changes (API incompatibility, schema changes)
  • MINOR: New features (backward-compatible)
  • PATCH: Bug fixes (backward-compatible)

Example Versions:

v1.2.4    # Minor feature release
v1.2.3    # Patch release (bug fix)
v2.0.0    # Major release (breaking changes)
v1.2.4-hotfix1  # Hotfix release

Service-Specific Tags

Format: {service-name}/v{VERSION}

# Tag ingestion service v1.2.4
git tag -a atp-ingestion/v1.2.4 -m "ATP Ingestion Service v1.2.4"
git push origin atp-ingestion/v1.2.4

# Tag query service v1.3.0
git tag -a atp-query/v1.3.0 -m "ATP Query Service v1.3.0"
git push origin atp-query/v1.3.0

Environment-Wide Release Tags

Format: release/v{VERSION} or release/{ENVIRONMENT}/v{VERSION}

# Production release tag
git tag -a release/v1.2.4 -m "Production Release v1.2.4

Services:
- atp-ingestion: v1.2.4
- atp-query: v1.3.0
- atp-integrity: v1.1.5
- atp-export: v1.0.2
- atp-policy: v1.2.0
- atp-search: v1.1.0
- atp-gateway: v1.4.0

Changelog: https://dev.azure.com/ConnectSoft/ATP/_wiki/wikis/ATP.wiki/12345/Release-Notes-v1.2.4"
git push origin release/v1.2.4

# Staging release tag
git tag -a release/staging/v1.2.4-rc1 -m "Staging Release Candidate v1.2.4-rc1"
git push origin release/staging/v1.2.4-rc1

Hotfix Tagging Conventions

Format: hotfix/v{VERSION}-hotfix{N} or {SERVICE}/v{VERSION}-hotfix{N}

# Service-specific hotfix
git tag -a atp-ingestion/v1.2.4-hotfix1 -m "Hotfix: Memory leak in Redis connection pooling"
git push origin atp-ingestion/v1.2.4-hotfix1

# Environment-wide hotfix
git tag -a hotfix/v1.2.4-hotfix1 -m "Production Hotfix v1.2.4-hotfix1

Critical fixes:
- atp-ingestion: Memory leak fix
- atp-gateway: Rate limiting bug fix"
git push origin hotfix/v1.2.4-hotfix1

Image Tagging in ACR

Strategy: Immutable tags combining version + commit SHA

Format: {VERSION}-{COMMIT-SHA}

# Docker image tags (in Azure Container Registry)
connectsoft.azurecr.io/atp/ingestion:v1.2.4              # Semantic version
connectsoft.azurecr.io/atp/ingestion:v1.2.4-abc123d      # Version + commit SHA (immutable)
connectsoft.azurecr.io/atp/ingestion:latest              # Latest (dev only, mutable)

ACR Tagging Rules:

Tag Pattern Mutable? Use Case Example
v{VERSION} ❌ Immutable Production releases v1.2.4
v{VERSION}-{SHA} ❌ Immutable Production releases (traceable) v1.2.4-abc123d
latest ✅ Mutable Development only latest
{BRANCH} ✅ Mutable Feature branches feature/grpc-ingestion

Access Control and RBAC

Azure DevOps Repository Permissions

Permission Levels:

Permission Description Typical Roles
Reader Can view repository Compliance Officers, Auditors
Contributor Can create branches, submit PRs Developers, DevOps Engineers
Contribute Can push to unprotected branches Developers (dev branch)
Contribute + Pull Request Can create PRs to protected branches Developers, DevOps Engineers
Admin Full control (manage permissions, delete repo) Platform Team Leads

Permission Matrix:

Role Repository Access Branch Access Approval Authority
Developer ✅ Contributor ✅ Create PRs (dev, test) ❌ None
DevOps Engineer ✅ Contributor ✅ Approve PRs (dev, test) ✅ Dev/Test deployments
Architect ✅ Contributor ✅ Approve PRs (staging, production) ✅ Staging/Prod deployments
SRE Engineer ✅ Contributor ✅ Approve PRs (production) ✅ Production deployments
Security Officer ✅ Reader ✅ Read-only (audit) ❌ None
Compliance Officer ✅ Reader ✅ Read-only (audit) ❌ None
Platform Team ✅ Admin ✅ Full access (all branches) ✅ All deployments

Azure AD Group Mappings

Group Structure:

Azure AD Groups:
├── ATP-Platform-Team                    # Platform Team (Admin access)
├── ATP-Developers                       # All developers (Contributor)
├── ATP-DevOps-Engineers                 # DevOps Engineers (Contributor, approve dev/test)
├── ATP-Architects                       # Architects (Contributor, approve staging/prod)
├── ATP-SRE-Engineers                    # SRE Engineers (Contributor, approve production)
├── ATP-Security-Team                    # Security Team (Reader, audit access)
└── ATP-Compliance-Team                  # Compliance Team (Reader, audit access)

Azure DevOps Group Configuration:

# Azure DevOps Project Settings > Permissions > Groups
Groups:
  - name: ATP-Platform-Team
    permissions:
      - Repos: Admin
      - Build: Admin
      - Release: Admin

  - name: ATP-Developers
    permissions:
      - Repos: Contributor
      - Build: User
      - Release: User

  - name: ATP-SRE-Engineers
    permissions:
      - Repos: Contributor
      - Build: User
      - Release: Admin

Principle of Least Privilege Enforcement

Enforcement Mechanisms:

  1. Branch Protection Policies: Prevent direct pushes to protected branches
  2. Required Approvals: Multiple reviewers for production
  3. GPG Signing: All production commits must be signed
  4. Status Checks: CI pipeline must pass before merge
  5. Audit Logging: All access logged in Azure AD audit logs

Access Review Process:

  • Frequency: Quarterly access reviews
  • Owner: Platform Team Lead
  • Review Scope: Repository permissions, branch policies, approval authorities
  • Compliance: SOC 2 CC6.1 (Access Control)

Summary

  • Repository Strategy: Hybrid monorepo (GitOps manifests) + polyrepo (service source code)
  • Directory Structure: 7 main directories (/clusters, /infrastructure, /apps, /platform, /tenants, /monitoring, /docs)
  • Naming Conventions: Kebab-case for directories/files, consistent patterns for Kubernetes resources
  • Branching Model: Environment-based branches (main → staging → test → dev → feature/*) with promotion workflow
  • Versioning: SemVer for services, environment-wide release tags, hotfix conventions
  • Access Control: Azure AD group mappings, branch protection policies, least privilege enforcement

FluxCD Installation & Configuration on AKS

Purpose: Provide a complete guide for installing, configuring, and managing FluxCD on Azure Kubernetes Service (AKS) for the ATP GitOps implementation, including multi-cluster setup, Azure integration, and operational best practices.


FluxCD Architecture Overview

Definition: FluxCD is a GitOps operator for Kubernetes that automatically keeps clusters in sync with Git repositories. It consists of modular controllers that work together to provide continuous reconciliation.

FluxCD Components

Core Controllers:

Component Purpose Namespace Responsibilities
source-controller Fetches sources (Git, Helm, OCI) flux-system Clones Git repos, fetches Helm charts, caches artifacts
kustomize-controller Applies Kustomize manifests flux-system Renders Kustomize, applies to cluster, monitors drift
helm-controller Manages Helm releases flux-system Installs/upgrades Helm charts, manages dependencies
notification-controller Sends alerts/notifications flux-system Slack, Teams, Discord, webhook notifications
image-reflector-controller Scans image repositories flux-system Discovers new image tags, updates image policies
image-automation-controller Updates Git automatically flux-system Commits image tag updates back to Git

Component Architecture:

graph TD
    A[Git Repository<br/>Azure Repos] -->|git pull| B[Source Controller]
    B -->|artifact cache| C[GitRepository<br/>CRD]

    C -->|notify| D[Kustomize Controller]
    C -->|notify| E[Helm Controller]

    D -->|render + apply| F[Kubernetes API<br/>AKS Cluster]
    E -->|install/upgrade| F

    F -.->|watch| D
    F -.->|watch| E
    D -.->|reconcile| F
    E -.->|reconcile| F

    G[Container Registry<br/>ACR] -->|scan tags| H[Image Reflector<br/>Controller]
    H -->|update| I[Image Policy<br/>CRD]
    I -->|trigger| J[Image Automation<br/>Controller]
    J -->|commit| A

    D -->|events| K[Notification<br/>Controller]
    E -->|events| K
    K -->|alerts| L[Slack / Teams /<br/>Azure Monitor]

    style B fill:#90EE90
    style D fill:#90EE90
    style E fill:#90EE90
    style H fill:#90EE90
    style J fill:#90EE90
    style K fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

How FluxCD Works: Reconciliation Loop

Reconciliation Process:

  1. Source Fetch (Source Controller):

    • Polls Git repository at configured interval (e.g., every 1 minute)
    • Clones repository and creates artifact (tar.gz)
    • Stores artifact in cluster-local cache
    • Updates GitRepository CRD status
  2. Manifest Rendering (Kustomize/Helm Controller):

    • Reads artifact from Source Controller
    • Renders Kustomize overlays or Helm templates
    • Generates final Kubernetes manifests
  3. State Comparison (Kustomize/Helm Controller):

    • Compares desired state (from Git) with actual state (in cluster)
    • Detects differences (drift detection)
    • Calculates required changes
  4. Apply Changes (Kustomize/Helm Controller):

    • Applies changes via Kubernetes API (kubectl apply)
    • Waits for resources to become ready
    • Updates Kustomization/HelmRelease CRD status
  5. Health Monitoring (Kustomize/Helm Controller):

    • Monitors resource health (Deployment, StatefulSet, etc.)
    • Reports health status in CRD status
    • Triggers notifications on failure

Reconciliation Loop Diagram:

sequenceDiagram
    participant Git as Git Repository
    participant SC as Source Controller
    participant KC as Kustomize Controller
    participant K8s as Kubernetes API

    loop Every 1 minute (GitRepository interval)
        SC->>Git: git pull
        Git-->>SC: repository contents
        SC->>SC: create artifact (tar.gz)
        SC->>SC: update GitRepository.status
    end

    loop Every 5 minutes (Kustomization interval)
        KC->>SC: fetch artifact
        SC-->>KC: artifact.tar.gz
        KC->>KC: render Kustomize
        KC->>K8s: get current state
        K8s-->>KC: current resources
        KC->>KC: compare desired vs actual

        alt Drift detected
            KC->>K8s: kubectl apply (correct drift)
            K8s-->>KC: resources updated
        end

        KC->>KC: update Kustomization.status
        KC->>KC: check health
    end
Hold "Alt" / "Option" to enable pan & zoom

FluxCD vs ArgoCD Comparison

Feature Comparison:

Feature FluxCD ArgoCD ATP Choice
Installation Lightweight, modular Single deployment, heavier FluxCD (simpler)
UI Flux Dashboard (optional) Rich web UI (default) ⚠️ ArgoCD (better UX, but ATP uses CLI)
GitOps Toolkit Modular (use only needed controllers) Monolithic FluxCD (flexibility)
Helm Support Full support Full support ✅ Both
Kustomize Support Native (built-in) Native (built-in) ✅ Both
Multi-Cluster Strong (Fleet, kubeconfig) Strong (ApplicationSets) ✅ Both
Azure Integration Native (Workload Identity) Requires setup FluxCD (better Azure native)
Learning Curve Moderate Steeper (more features) FluxCD (simpler)
CNCF Status Graduated Graduated ✅ Both
Community Active (CNCF) Very active (CNCF) ✅ Both
Performance Fast (lightweight) Good (heavier) FluxCD (lower overhead)
Security Strong (least privilege) Strong ✅ Both
Drift Detection Excellent Excellent ✅ Both

ATP Selection Rationale: ✅ FluxCD Chosen

Reasons:

  1. Azure Native Integration: Better support for Azure AD Workload Identity (zero-trust authentication)
  2. Modular Architecture: Use only needed controllers (source + kustomize), reduce cluster overhead
  3. Simpler Operation: Less complexity, easier troubleshooting
  4. Performance: Lower resource footprint (important for multi-cluster setup)
  5. CLI-First Approach: ATP team prefers CLI/Git workflow over web UI

AKS Cluster Prerequisites

Cluster Requirements

Minimum Requirements:

Component Requirement Rationale
Kubernetes Version 1.28+ (1.30+ recommended) FluxCD v2 requires Kubernetes 1.25+, newer versions provide better API support
Cluster SKU Standard (not Free tier) Required for RBAC, network policies, advanced features
Node Pool 2+ nodes, 4+ vCPUs total FluxCD controllers need resources; redundancy for HA
Network Plugin Azure CNI (recommended) or kubenet Azure CNI provides better networking for multi-tenant
RBAC Enabled (default) Required for FluxCD controllers to manage cluster resources
Pod Security Standards Enabled (default in 1.23+) Required for compliance (SOC 2, GDPR, HIPAA)

Recommended Configuration:

# Create AKS cluster with recommended settings
az aks create \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --kubernetes-version 1.30.0 \
  --node-count 3 \
  --node-vm-size Standard_D4s_v3 \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 10 \
  --network-plugin azure \
  --network-policy azure \
  --enable-oidc-issuer \
  --enable-workload-identity \
  --enable-managed-identity \
  --enable-addons monitoring,azure-policy \
  --workspace-resource-id /subscriptions/.../resourceGroups/.../providers/Microsoft.OperationalInsights/workspaces/atp-prod-eus-logs \
  --tags environment=production compliance=soc2-gdpr-hipaa

Node Pool Configuration

System Node Pool (for FluxCD and system workloads):

# System node pool (dedicated for system pods)
az aks nodepool add \
  --resource-group ATP-Production-EUS-RG \
  --cluster-name atp-prod-eus-aks \
  --name systempool \
  --node-count 2 \
  --node-vm-size Standard_D4s_v3 \
  --mode System \
  --labels workload=system tier=platform \
  --node-taints CriticalAddonsOnly=true:NoSchedule \
  --enable-cluster-autoscaler \
  --min-count 2 \
  --max-count 4

User Node Pool (for application workloads):

# User node pool (for ATP microservices)
az aks nodepool add \
  --resource-group ATP-Production-EUS-RG \
  --cluster-name atp-prod-eus-aks \
  --name userpool \
  --node-count 3 \
  --node-vm-size Standard_D8s_v3 \
  --mode User \
  --labels workload=application tier=backend \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 20

Azure Integration Setup

Required Azure Resources:

  1. Azure Container Registry (ACR): For container images
  2. Azure Key Vault: For secrets management
  3. Azure Monitor / Log Analytics: For observability
  4. Azure AD Application: For Workload Identity (optional but recommended)

Prerequisites Checklist:

#!/bin/bash
# prerequisites-check.sh — Verify all prerequisites before FluxCD installation

set -euo pipefail

echo "🔍 Checking AKS cluster prerequisites..."

# Check Kubernetes version
KUBERNETES_VERSION=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query kubernetesVersion -o tsv)

if [[ $(echo "$KUBERNETES_VERSION 1.28.0" | tr " " "\n" | sort -V | head -n 1) != "1.28.0" ]]; then
  echo "❌ Kubernetes version $KUBERNETES_VERSION < 1.28.0 (minimum required)"
  exit 1
else
  echo "✅ Kubernetes version: $KUBERNETES_VERSION"
fi

# Check OIDC issuer enabled
OIDC_ISSUER=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query "oidcIssuerProfile.enabled" -o tsv)

if [[ "$OIDC_ISSUER" != "true" ]]; then
  echo "❌ OIDC issuer not enabled (required for Workload Identity)"
  exit 1
else
  echo "✅ OIDC issuer enabled"
fi

# Check Workload Identity enabled
WORKLOAD_IDENTITY=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query "securityProfile.workloadIdentity.enabled" -o tsv)

if [[ "$WORKLOAD_IDENTITY" != "true" ]]; then
  echo "❌ Workload Identity not enabled (required for Azure AD integration)"
  exit 1
else
  echo "✅ Workload Identity enabled"
fi

# Check kubectl access
if ! kubectl cluster-info > /dev/null 2>&1; then
  echo "❌ kubectl not configured or cluster not accessible"
  exit 1
else
  echo "✅ kubectl configured and cluster accessible"
fi

# Check RBAC enabled
RBAC=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query "enableRbac" -o tsv)

if [[ "$RBAC" != "true" ]]; then
  echo "❌ RBAC not enabled (required for FluxCD)"
  exit 1
else
  echo "✅ RBAC enabled"
fi

# Check node count
NODE_COUNT=$(az aks nodepool show \
  --resource-group ATP-Production-EUS-RG \
  --cluster-name atp-prod-eus-aks \
  --name systempool \
  --query count -o tsv)

if [[ "$NODE_COUNT" -lt 2 ]]; then
  echo "❌ Node count $NODE_COUNT < 2 (minimum 2 nodes recommended)"
  exit 1
else
  echo "✅ Node count: $NODE_COUNT"
fi

echo "✅ All prerequisites met!"

FluxCD Installation

Prerequisites: Azure CLI with k8s-extension extension installed

# Install k8s-extension if not already installed
az extension add --name k8s-extension

# Install FluxCD on AKS cluster
az k8s-extension create \
  --resource-group ATP-Production-EUS-RG \
  --cluster-name atp-prod-eus-aks \
  --cluster-type managedClusters \
  --extension-type microsoft.flux \
  --name flux \
  --namespace flux-system \
  --scope cluster \
  --configuration-settings \
    gitops.enabled=true \
    gitops.defaultBranch=production \
    gitops.repositoryUrl=ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops \
    gitops.sshPrivateKey="$(cat ~/.ssh/azure-devops-flux | base64 -w 0)" \
  --auto-upgrade-minor-version true

# Verify installation
az k8s-extension show \
  --resource-group ATP-Production-EUS-RG \
  --cluster-name atp-prod-eus-aks \
  --cluster-type managedClusters \
  --name flux

Installation via kubectl (Flux CLI)

Prerequisites: Flux CLI installed

# Install Flux CLI
curl -s https://fluxcd.io/install.sh | sudo bash

# Verify installation
flux --version

# Install FluxCD components
flux install \
  --namespace=flux-system \
  --components=source-controller,kustomize-controller,helm-controller,notification-controller \
  --export > flux-install.yaml

# Apply to cluster
kubectl apply -f flux-install.yaml

# Wait for FluxCD to be ready
kubectl wait --for=condition=ready pod \
  --all \
  --namespace flux-system \
  --timeout=300s

Installation via Helm

Using Flux Helm Chart:

# Add Flux Helm repository
helm repo add fluxcd https://fluxcd-community.github.io/helm-charts
helm repo update

# Install FluxCD via Helm
helm install flux fluxcd/flux2 \
  --namespace flux-system \
  --create-namespace \
  --set components.source-controller.enabled=true \
  --set components.kustomize-controller.enabled=true \
  --set components.helm-controller.enabled=true \
  --set components.notification-controller.enabled=true \
  --set components.image-reflector-controller.enabled=false \
  --set components.image-automation-controller.enabled=false

# Verify installation
kubectl get pods -n flux-system

Bootstrap FluxCD on AKS

Bootstrap with Azure Repos SSH:

# Generate SSH key for FluxCD (if not exists)
ssh-keygen -t rsa -b 4096 -f ~/.ssh/azure-devops-flux -N ""

# Add public key to Azure DevOps (manual step)
# Azure DevOps > User Settings > SSH Public Keys > Add Key
cat ~/.ssh/azure-devops-flux.pub

# Bootstrap FluxCD pointing to GitOps repository
flux bootstrap git \
  --url=ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops \
  --branch=production \
  --path=clusters/production \
  --private-key-file=~/.ssh/azure-devops-flux \
  --author-name="Platform Team" \
  --author-email="platform-team@connectsoft.example" \
  --components-extra=image-reflector-controller,image-automation-controller

Bootstrap Output:

► connecting to ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
► cloning branch "production" from Git repository
► cloned branch "production" from Git repository
✔ components are healthy
✔ git repository "ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops" is ready
► generating sync manifests
✔ sync manifests pushed successfully
► applying sync manifests
✔ sync components are ready
✔ kubectl -n flux-system get gitrepository flux-system
✔ kubectl -n flux-system get kustomization flux-system

Verify Installation

Check FluxCD Components:

# Check all FluxCD pods are running
kubectl get pods -n flux-system

# Expected output:
# NAME                                      READY   STATUS    RESTARTS   AGE
# helm-controller-7d5c8b9f6d-abc12          1/1     Running   0          5m
# kustomize-controller-7d5c8b9f6d-def45     1/1     Running   0          5m
# notification-controller-7d5c8b9f6d-ghi78  1/1     Running   0          5m
# source-controller-7d5c8b9f6d-jkl90        1/1     Running   0          5m

# Check FluxCD CRDs are installed
kubectl get crds | grep fluxcd

# Expected output:
# gitrepositories.source.toolkit.fluxcd.io
# kustomizations.kustomize.toolkit.fluxcd.io
# helmreleases.helm.toolkit.fluxcd.io
# alerts.notification.toolkit.fluxcd.io
# receivers.notification.toolkit.fluxcd.io

# Verify FluxCD CLI can connect
flux check

# Expected output:
# ✔ flux 2.3.0
# ✔ Kubernetes 1.30.0 >= 1.25.0
# ✔ prerequisites are satisfied
# ✔ controllers are healthy

Azure Repos Integration

SSH Key Setup for Git Access

Generate SSH Key:

# Generate SSH key pair for FluxCD
ssh-keygen -t rsa -b 4096 -f ~/.ssh/azure-devops-flux -N "" -C "fluxcd@atp-production"

# Output files:
# ~/.ssh/azure-devops-flux      (private key)
# ~/.ssh/azure-devops-flux.pub  (public key)

Add Public Key to Azure DevOps:

# Display public key (copy this)
cat ~/.ssh/azure-devops-flux.pub

# Manual steps in Azure DevOps Portal:
# 1. Navigate to User Settings > SSH Public Keys
# 2. Click "New Key"
# 3. Paste public key
# 4. Add description: "FluxCD Production AKS Cluster"
# 5. Save

Create Kubernetes Secret:

# Create SSH key secret in flux-system namespace
kubectl create namespace flux-system --dry-run=client -o yaml | kubectl apply -f -

kubectl create secret generic azure-devops-ssh-key \
  --namespace=flux-system \
  --from-file=identity=~/.ssh/azure-devops-flux \
  --from-literal=known_hosts="$(ssh-keyscan ssh.dev.azure.com 2>/dev/null | grep ssh-rsa)"

# Verify secret created
kubectl get secret azure-devops-ssh-key -n flux-system

Azure DevOps Personal Access Token (PAT)

Alternative to SSH Key:

# Create PAT in Azure DevOps Portal:
# 1. User Settings > Personal Access Tokens > New Token
# 2. Name: "FluxCD Production AKS"
# 3. Organization: All accessible organizations
# 4. Scopes: Code (Read)
# 5. Copy token

# Create PAT secret
kubectl create secret generic azure-devops-pat \
  --namespace=flux-system \
  --from-literal=username=git \
  --from-literal=password=<AZURE_DEVOPS_PAT>

# Use PAT in GitRepository (HTTPS URL)

Azure AD Authentication (Workload Identity)

Recommended Approach (Zero-Trust):

# Create Azure AD Application for FluxCD
az ad app create --display-name "FluxCD-ATP-Production"

# Get application details
APP_ID=$(az ad app list \
  --display-name "FluxCD-ATP-Production" \
  --query "[0].appId" -o tsv)

# Create service principal
az ad sp create --id $APP_ID

# Get AKS OIDC issuer URL
OIDC_ISSUER=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# Create federated credential for Workload Identity
az ad app federated-credential create \
  --id $APP_ID \
  --parameters '{
    "name": "fluxcd-atp-production",
    "issuer": "'$OIDC_ISSUER'",
    "subject": "system:serviceaccount:flux-system:fluxcd-source-controller",
    "audiences": ["api://AzureADTokenExchange"]
  }'

# Grant Azure DevOps access to application
# Azure DevOps > Project Settings > Service Connections > New Service Connection
# Type: Generic
# Authentication: Workload Identity federation

GitRepository with Workload Identity:

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  interval: 1m
  ref:
    branch: production
  secretRef:
    name: workload-identity-secret  # Uses Azure AD Workload Identity

Bootstrap Configuration

Bootstrap FluxCD to Point to atp-gitops Repo

Complete Bootstrap Script:

#!/bin/bash
# bootstrap-fluxcd-production.sh — Bootstrap FluxCD on production AKS cluster

set -euo pipefail

RESOURCE_GROUP="ATP-Production-EUS-RG"
CLUSTER_NAME="atp-prod-eus-aks"
GIT_REPO_URL="ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops"
GIT_BRANCH="production"
GIT_PATH="clusters/production"
SSH_KEY_FILE="~/.ssh/azure-devops-flux"

echo "🚀 Bootstrapping FluxCD on production AKS cluster..."

# Get AKS credentials
az aks get-credentials \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --overwrite-existing

# Verify cluster access
kubectl cluster-info

# Bootstrap FluxCD
flux bootstrap git \
  --url=$GIT_REPO_URL \
  --branch=$GIT_BRANCH \
  --path=$GIT_PATH \
  --private-key-file=$SSH_KEY_FILE \
  --author-name="Platform Team" \
  --author-email="platform-team@connectsoft.example" \
  --components=source-controller,kustomize-controller,helm-controller,notification-controller

echo "✅ FluxCD bootstrap complete!"

# Verify installation
echo "📋 Verifying FluxCD installation..."
flux check

# Check GitRepository
echo "📋 Checking GitRepository..."
kubectl get gitrepository flux-system -n flux-system

# Check Kustomization
echo "📋 Checking Kustomization..."
kubectl get kustomization flux-system -n flux-system

Configure GitRepository Resource

GitRepository Configuration:

# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  interval: 1m  # Poll Git every 1 minute
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: production  # Git branch to watch
  secretRef:
    name: azure-devops-ssh-key  # SSH key secret
  ignore: |
    /*.md
    !README.md
    /docs/
  timeout: 60s  # Git clone timeout
  suspend: false  # Set to true to pause reconciliation

Configure Root Kustomization

Root Kustomization (Points to Infrastructure and Apps):

# clusters/production/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: flux-system
  namespace: flux-system
spec:
  interval: 5m  # Reconcile every 5 minutes
  path: ./  # Root path in Git repository
  prune: true  # Delete resources removed from Git
  sourceRef:
    kind: GitRepository
    name: atp-gitops
    namespace: flux-system
  timeout: 10m  # Reconciliation timeout
  retryInterval: 2m  # Retry on failure
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: source-controller
      namespace: flux-system
    - apiVersion: apps/v1
      kind: Deployment
      name: kustomize-controller
      namespace: flux-system
  suspend: false

Child Kustomizations (Per Application):

# clusters/production/kustomizations/infrastructure.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: infrastructure
  namespace: flux-system
spec:
  interval: 5m
  path: ./infrastructure/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
    - name: flux-system  # Wait for root Kustomization
  suspend: false

---
# clusters/production/kustomizations/apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
    - name: infrastructure  # Wait for infrastructure first
  suspend: false

Namespace and RBAC Setup

Namespace Creation (via GitOps):

# infrastructure/base/namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: flux-system
  labels:
    name: flux-system
    managed-by: fluxcd
---
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production
  labels:
    name: atp-production
    environment: production
    managed-by: fluxcd

RBAC for FluxCD (Bootstrap creates automatically):

# FluxCD bootstrap automatically creates these:
# - ServiceAccount: kustomize-controller (flux-system namespace)
# - ClusterRole: cluster-admin (full cluster access)
# - ClusterRoleBinding: kustomize-controller (binds ServiceAccount to ClusterRole)

Multi-Cluster Setup

FluxCD Architecture for Dev, Test, Staging, Production

Multi-Cluster Topology:

graph TD
    subgraph "Azure DevOps"
        A[atp-gitops<br/>Repository]
        A1[dev branch]
        A2[test branch]
        A3[staging branch]
        A4[production branch]
    end

    subgraph "Dev Environment"
        B1[AKS Dev Cluster]
        B2[FluxCD<br/>flux-system]
        B2 -->|git pull| A1
    end

    subgraph "Test Environment"
        C1[AKS Test Cluster]
        C2[FluxCD<br/>flux-system]
        C2 -->|git pull| A2
    end

    subgraph "Staging Environment"
        D1[AKS Staging Cluster]
        D2[FluxCD<br/>flux-system]
        D2 -->|git pull| A3
    end

    subgraph "Production Environment"
        E1[AKS Production EUS]
        E2[FluxCD<br/>flux-system]
        E2 -->|git pull| A4
        E3[AKS Production WUS]
        E4[FluxCD<br/>flux-system]
        E4 -->|git pull| A4
    end

    A --> A1
    A --> A2
    A --> A3
    A --> A4

    style E1 fill:#ffcccc
    style E3 fill:#ffcccc
Hold "Alt" / "Option" to enable pan & zoom

Cluster Configuration Matrix:

Environment Cluster Name Git Branch FluxCD Namespace Reconcile Interval
Dev atp-dev-eus-aks dev flux-system 1 minute
Test atp-test-eus-aks test flux-system 2 minutes
Staging atp-staging-eus-aks staging flux-system 5 minutes
Production EUS atp-prod-eus-aks production flux-system 5 minutes
Production WUS atp-prod-wus-aks production flux-system 5 minutes

Cluster-Specific Configurations

Per-Cluster GitRepository:

# clusters/production/gitrepository.yaml (Production EUS)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: production
  secretRef:
    name: azure-devops-ssh-key
# clusters/dev/gitrepository.yaml (Dev)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: dev  # Dev branch
  secretRef:
    name: azure-devops-ssh-key
  interval: 30s  # Faster polling for dev

Cross-Cluster Orchestration

Fleet Management (Optional, for large-scale):

# Use FluxCD Fleet for multi-cluster management
# Fleet Controller runs in a central cluster
# Manages GitRepositories and Kustomizations across multiple clusters

Regional Deployment Pattern:

# clusters/production-eus/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-eus
  namespace: flux-system
spec:
  path: ./apps/overlays/production-eus  # EUS-specific overlay
  sourceRef:
    kind: GitRepository
    name: atp-gitops

---
# clusters/production-wus/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-wus
  namespace: flux-system
spec:
  path: ./apps/overlays/production-wus  # WUS-specific overlay
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Workload Identity Configuration

Azure AD Workload Identity for FluxCD

Create Azure AD Application:

# Create Azure AD application
az ad app create --display-name "FluxCD-ATP-Production"

APP_ID=$(az ad app list \
  --display-name "FluxCD-ATP-Production" \
  --query "[0].appId" -o tsv)

echo "Application ID: $APP_ID"

# Create service principal
SP_ID=$(az ad sp create --id $APP_ID --query id -o tsv)

# Grant permissions (example: Azure DevOps read access)
az devops security permission update \
  --id $SP_ID \
  --allow-bit 1 \
  --deny-bit 0

Service Principal Setup

Configure Service Principal Permissions:

# Grant Azure DevOps repository read permission
az devops security permission update \
  --id $SP_ID \
  --token $AZURE_DEVOPS_TOKEN \
  --allow-bit 1 \
  --deny-bit 0 \
  --resource-id "repo"

# Grant Azure Container Registry pull permission
az acr role assignment create \
  --assignee $APP_ID \
  --role AcrPull \
  --scope /subscriptions/.../resourceGroups/.../providers/Microsoft.ContainerRegistry/registries/atp-prod-acr

Federated Credentials Configuration

Configure Federated Credential:

# Get AKS OIDC issuer URL
OIDC_ISSUER=$(az aks show \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# Create federated credential for Source Controller
az ad app federated-credential create \
  --id $APP_ID \
  --parameters '{
    "name": "fluxcd-source-controller",
    "issuer": "'$OIDC_ISSUER'",
    "subject": "system:serviceaccount:flux-system:source-controller",
    "audiences": ["api://AzureADTokenExchange"]
  }'

# Create federated credential for Kustomize Controller
az ad app federated-credential create \
  --id $APP_ID \
  --parameters '{
    "name": "fluxcd-kustomize-controller",
    "issuer": "'$OIDC_ISSUER'",
    "subject": "system:serviceaccount:flux-system:kustomize-controller",
    "audiences": ["api://AzureADTokenExchange"]
  }'

ServiceAccount Configuration

Annotate ServiceAccounts:

# FluxCD ServiceAccount with Workload Identity
apiVersion: v1
kind: ServiceAccount
metadata:
  name: source-controller
  namespace: flux-system
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"
    azure.workload.identity/tenant-id: "87654321-4321-4321-4321-987654321abc"

RBAC for Azure Resources:

# Grant Key Vault Secrets User role
az role assignment create \
  --assignee $APP_ID \
  --role "Key Vault Secrets User" \
  --scope /subscriptions/.../resourceGroups/.../providers/Microsoft.KeyVault/vaults/atp-prod-kv

FluxCD Version Management

Upgrade Procedures

Check Current Version:

# Check installed FluxCD version
flux version

# Output:
# flux version 2.3.0

# Check FluxCD components version
kubectl get deployment source-controller -n flux-system -o jsonpath='{.spec.template.spec.containers[0].image}'

# Output:
# ghcr.io/fluxcd/source-controller:v2.3.0

Upgrade FluxCD:

# Upgrade FluxCD components
flux upgrade

# Or upgrade to specific version
flux install \
  --version=2.4.0 \
  --namespace=flux-system \
  --components=source-controller,kustomize-controller,helm-controller,notification-controller \
  --export > flux-install-v2.4.0.yaml

kubectl apply -f flux-install-v2.4.0.yaml

# Wait for upgrade to complete
kubectl wait --for=condition=ready pod \
  --all \
  --namespace flux-system \
  --timeout=300s

Rollback Strategies

Rollback to Previous Version:

# Identify previous version
PREVIOUS_VERSION="2.2.0"

# Apply previous version manifests
flux install \
  --version=$PREVIOUS_VERSION \
  --namespace=flux-system \
  --components=source-controller,kustomize-controller,helm-controller,notification-controller \
  --export | kubectl apply -f -

# Verify rollback
flux version
kubectl get pods -n flux-system

Version Compatibility Matrix

FluxCD Version Kubernetes Minimum Kubernetes Recommended Breaking Changes
2.4.0 1.25+ 1.28+ None (from 2.3.x)
2.3.0 1.25+ 1.28+ None (from 2.2.x)
2.2.0 1.24+ 1.27+ API v1beta1 deprecated
2.1.0 1.24+ 1.27+ None (from 2.0.x)

Release Notes and Breaking Changes

Monitor FluxCD Releases:

# Check latest releases
flux check --components-extra=image-reflector-controller,image-automation-controller

# Review release notes
# https://github.com/fluxcd/flux2/releases

Azure Monitor Integration

FluxCD Metrics Export to Prometheus

Enable Prometheus Metrics:

# FluxCD controllers export Prometheus metrics on port 8080
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fluxcd
  namespace: flux-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: fluxcd
  endpoints:
  - port: http-prom  # Port name for Prometheus metrics
    interval: 30s
    path: /metrics

Log Forwarding to Log Analytics

Configure Container Insights:

# Enable Azure Monitor Container Insights (if not already enabled)
az aks enable-addons \
  --resource-group ATP-Production-EUS-RG \
  --name atp-prod-eus-aks \
  --addons monitoring \
  --workspace-resource-id /subscriptions/.../resourceGroups/.../providers/Microsoft.OperationalInsights/workspaces/atp-prod-eus-logs

KQL Query for FluxCD Logs:

// Azure Monitor Log Analytics: Query FluxCD logs
ContainerLog
| where Namespace == "flux-system"
| where PodName contains "source-controller" or PodName contains "kustomize-controller"
| where LogEntry contains "error" or LogEntry contains "warning"
| project TimeGenerated, PodName, LogEntry
| order by TimeGenerated desc

Custom Metrics and Alerts

Custom Metrics Dashboard:

# Grafana dashboard for FluxCD metrics
dashboard:
  title: "FluxCD Reconciliation Metrics"
  panels:
    - title: "Reconciliation Duration"
      query: "fluxcd_kustomize_reconciliation_duration_seconds"

    - title: "Git Fetch Duration"
      query: "fluxcd_source_git_duration_seconds"

    - title: "Reconciliation Success Rate"
      query: "rate(fluxcd_kustomize_reconciliation_success_total[5m]) / rate(fluxcd_kustomize_reconciliation_total[5m])"

Azure Monitor Alert:

# Azure Monitor Alert Rule for FluxCD failures
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluxcd-alert
  namespace: flux-system
data:
  alert-rule.yaml: |
    alert:
      name: FluxCD Reconciliation Failure
      condition: |
        fluxcd_kustomize_reconciliation_failure_total > 0
      severity: warning
      actionGroups:
        - /subscriptions/.../resourceGroups/.../providers/microsoft.insights/actionGroups/fluxcd-alerts

Dashboard Setup in Azure Monitor

Create FluxCD Dashboard:

# Export FluxCD metrics to Azure Monitor
# Metrics available via Prometheus scraping or Container Insights

# Key metrics to monitor:
# - fluxcd_kustomize_reconciliation_duration_seconds
# - fluxcd_kustomize_reconciliation_total
# - fluxcd_kustomize_reconciliation_success_total
# - fluxcd_kustomize_reconciliation_failure_total
# - fluxcd_source_git_duration_seconds

Dashboard JSON (Azure Monitor):

{
  "dashboard": {
    "name": "FluxCD Reconciliation Dashboard",
    "widgets": [
      {
        "type": "metric",
        "properties": {
          "metrics": [
            {
              "namespace": "Microsoft.ContainerService/managedClusters",
              "name": "fluxcd_kustomize_reconciliation_duration_seconds",
              "aggregation": "Average"
            }
          ],
          "title": "Reconciliation Duration"
        }
      }
    ]
  }
}

Summary: FluxCD Installation & Configuration

  • FluxCD Architecture: Modular controllers (source, kustomize, helm, notification) with reconciliation loop
  • AKS Prerequisites: Kubernetes 1.28+, OIDC issuer, Workload Identity, RBAC enabled
  • Installation: Multiple methods (Azure CLI, kubectl, Helm), bootstrap to GitOps repository
  • Azure Repos Integration: SSH keys, PAT, or Workload Identity authentication
  • Multi-Cluster Setup: Branch-per-environment, cluster-specific configurations, regional deployment patterns
  • Workload Identity: Azure AD federated credentials for zero-trust authentication
  • Version Management: Upgrade procedures, rollback strategies, compatibility matrix
  • Azure Monitor Integration: Prometheus metrics, Log Analytics forwarding, custom alerts and dashboards

Declarative Manifest Management

Purpose: Define how ATP microservices are declared, organized, and managed using Kubernetes manifests, Helm charts, and Kustomize overlays, ensuring consistency, reusability, and environment-specific customization across all deployment environments.


Base Manifest Structure

ATP microservices use declarative Kubernetes manifests stored in Git as the single source of truth. This section covers the standard resource types and structures used for all ATP services.

Kubernetes Resource Types for ATP Services

Required Resources per Service:

Resource Type Purpose Example Name Required?
Deployment Manages pod replicas atp-ingestion ✅ Yes
Service Exposes pods via network atp-ingestion ✅ Yes
ConfigMap Non-sensitive configuration atp-ingestion-config ✅ Yes
Secret Sensitive data (references) atp-ingestion-secrets ⚠️ Via External Secrets
Ingress External HTTP/gRPC access atp-ingestion-ingress ⚠️ If exposed externally
ServiceAccount Pod identity and RBAC atp-ingestion-sa ✅ Yes
HorizontalPodAutoscaler Auto-scaling atp-ingestion-hpa ⚠️ Production only
NetworkPolicy Traffic isolation atp-ingestion-network-policy ✅ Yes
PodDisruptionBudget High availability atp-ingestion-pdb ⚠️ Production only

Deployment Manifest Structure

Complete Deployment Example (ATP Ingestion Service):

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    app: atp-ingestion
    component: ingestion
    tier: backend
    version: v1.2.3
    managed-by: fluxcd
    compliance: soc2-gdpr-hipaa
spec:
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: atp-ingestion
  template:
    metadata:
      labels:
        app: atp-ingestion
        component: ingestion
        tier: backend
        version: v1.2.3
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
        checksum/config: "abc123def456"  # Trigger restart on ConfigMap change
        checksum/secret: "def456ghi789"  # Trigger restart on Secret change
    spec:
      serviceAccountName: atp-ingestion-sa

      # Pod Security Standards (Restricted)
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 2000
        seccompProfile:
          type: RuntimeDefault

      containers:
      - name: ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        imagePullPolicy: IfNotPresent

        # Container Security Context
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
          capabilities:
            drop: [ALL]

        # Resource Requests and Limits
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi

        # Environment Variables
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: Production
        - name: ASPNETCORE_URLS
          value: "http://+:8080"
        - name: OpenTelemetry__ServiceName
          value: atp-ingestion
        - name: DOTNET_RUNNING_IN_CONTAINER
          value: "true"

        # Environment Variables from ConfigMap
        envFrom:
        - configMapRef:
            name: atp-ingestion-config
        - secretRef:
            name: atp-ingestion-secrets

        # Ports
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        - name: metrics
          containerPort: 9090
          protocol: TCP
        - name: grpc
          containerPort: 50051
          protocol: TCP

        # Health Checks
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3

        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3

        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 30  # Allow up to 150 seconds for startup

        # Volume Mounts (read-only root filesystem requires writable volumes)
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /app/cache
        - name: logs
          mountPath: /app/logs

      # Volumes
      volumes:
      - name: tmp
        emptyDir: {}
      - name: cache
        emptyDir: {}
      - name: logs
        emptyDir: {}

      # Image Pull Secrets (for ACR authentication)
      imagePullSecrets:
      - name: acr-credentials

      # Termination Grace Period
      terminationGracePeriodSeconds: 30

      # DNS Policy
      dnsPolicy: ClusterFirst

      # Restart Policy
      restartPolicy: Always

Service Manifest Structure

Service Example (ATP Ingestion Service):

# apps/atp-ingestion/base/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    app: atp-ingestion
    component: ingestion
    managed-by: fluxcd
spec:
  type: ClusterIP  # Internal service (use LoadBalancer for external)
  ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP
  - name: metrics
    port: 9090
    targetPort: 9090
    protocol: TCP
  - name: grpc
    port: 50051
    targetPort: 50051
    protocol: TCP
  selector:
    app: atp-ingestion
  sessionAffinity: None
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800

ConfigMap and Secret References

ConfigMap Example:

# apps/atp-ingestion/base/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
  namespace: atp-production
  labels:
    app: atp-ingestion
    managed-by: fluxcd
data:
  # Application Settings
  ASPNETCORE_ENVIRONMENT: "Production"
  ASPNETCORE_URLS: "http://+:8080"

  # OpenTelemetry Configuration
  OpenTelemetry__ServiceName: "atp-ingestion"
  OpenTelemetry__SamplingRatio: "0.1"
  OpenTelemetry__Exporters__Otlp__Endpoint: "http://otel-collector.observability:4317"

  # Audit Trail Configuration
  Audit__EnableImmutability: "true"
  Audit__RetentionDays: "2555"
  Audit__EnableTamperEvidence: "true"

  # Feature Flags
  Features__EnableAdvancedQuery: "true"
  Features__EnableRealTimeAlerts: "true"

  # Performance Settings
  Performance__MaxConcurrentRequests: "1000"
  Performance__RequestTimeoutSeconds: "30"

Secret Reference (External Secrets Operator):

# apps/atp-ingestion/base/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: atp-ingestion-secrets
  namespace: atp-production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault-production
    kind: ClusterSecretStore
  target:
    name: atp-ingestion-secrets
    creationPolicy: Owner
  data:
  - secretKey: ConnectionStrings__Database
    remoteRef:
      key: atp-sql-connection-string-prod
  - secretKey: ConnectionStrings__Redis
    remoteRef:
      key: atp-redis-connection-string-prod
  - secretKey: ConnectionStrings__RabbitMQ
    remoteRef:
      key: atp-rabbitmq-connection-string-prod
  - secretKey: ApiKeys__IngestionApiKey
    remoteRef:
      key: atp-ingestion-api-key-prod

Ingress Configuration

Ingress Example (External Access):

# apps/atp-ingestion/base/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-ingress
  namespace: atp-production
  labels:
    app: atp-ingestion
    managed-by: fluxcd
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    nginx.ingress.kubernetes.io/grpc-backend: "true"
    nginx.ingress.kubernetes.io/rate-limit: "1000"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - ingestion.atp.connectsoft.example
    secretName: atp-ingestion-tls
  rules:
  - host: ingestion.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion
            port:
              number: 80

ServiceAccount and RBAC

ServiceAccount Example:

# apps/atp-ingestion/base/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: atp-ingestion-sa
  namespace: atp-production
  labels:
    app: atp-ingestion
    managed-by: fluxcd
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"
    azure.workload.identity/tenant-id: "87654321-4321-4321-4321-987654321abc"
automountServiceAccountToken: true

RBAC Role and RoleBinding:

# apps/atp-ingestion/base/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: atp-ingestion-role
  namespace: atp-production
rules:
- apiGroups: [""]
  resources: ["configmaps", "secrets"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: atp-ingestion-rolebinding
  namespace: atp-production
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: atp-ingestion-role
subjects:
- kind: ServiceAccount
  name: atp-ingestion-sa
  namespace: atp-production

Helm Charts for ATP Microservices

Helm charts provide parameterized, reusable templates for ATP microservices, enabling environment-specific customization via values files.

Chart Structure: Chart.yaml, values.yaml, templates/

Directory Structure:

apps/atp-ingestion/helm/
├── Chart.yaml                    # Chart metadata
├── values.yaml                   # Default values
├── values-dev.yaml              # Dev environment overrides
├── values-test.yaml             # Test environment overrides
├── values-staging.yaml          # Staging environment overrides
├── values-production.yaml       # Production environment overrides
├── templates/
│   ├── deployment.yaml          # Deployment template
│   ├── service.yaml             # Service template
│   ├── configmap.yaml           # ConfigMap template
│   ├── ingress.yaml             # Ingress template (optional)
│   ├── serviceaccount.yaml      # ServiceAccount template
│   ├── rbac.yaml                # RBAC template
│   ├── hpa.yaml                 # HPA template (conditional)
│   ├── networkpolicy.yaml       # NetworkPolicy template
│   └── _helpers.tpl             # Template helpers
└── charts/                      # Chart dependencies (optional)

Chart.yaml:

# apps/atp-ingestion/helm/Chart.yaml
apiVersion: v2
name: atp-ingestion
description: ATP Ingestion Service - Receives audit records via HTTP/gRPC
version: 1.2.3  # Chart version (SemVer)
appVersion: 1.2.3  # Application version
type: application

keywords:
  - audit-trail
  - ingestion
  - microservice
  - connectsoft

maintainers:
  - name: ConnectSoft Platform Team
    email: platform-team@connectsoft.example

dependencies:
  - name: redis
    version: 17.x.x
    repository: https://charts.bitnami.com/bitnami
    condition: redis.enabled
    tags:
      - atp-ingestion-redis

home: https://connectsoft.example/atp
sources:
  - https://dev.azure.com/ConnectSoft/ATP/_git/atp-ingestion

annotations:
  category: Backend
  licenses: Proprietary

values.yaml (Default):

# apps/atp-ingestion/helm/values.yaml
# Default values for atp-ingestion chart

# Replica configuration
replicaCount: 3

# Image configuration
image:
  repository: connectsoft.azurecr.io/atp/ingestion
  pullPolicy: IfNotPresent
  tag: ""  # Overridden by CI pipeline or .Values.appVersion

# Image pull secrets
imagePullSecrets:
  - name: acr-credentials

# Service account configuration
serviceAccount:
  create: true
  annotations:
    azure.workload.identity/client-id: ""
  name: atp-ingestion-sa
  automountServiceAccountToken: true

# Pod annotations
podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8080"
  prometheus.io/path: "/metrics"

# Pod security context
podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 2000
  seccompProfile:
    type: RuntimeDefault

# Container security context
securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000
  capabilities:
    drop: [ALL]

# Service configuration
service:
  type: ClusterIP
  port: 80
  targetPort: 8080
  metricsPort: 9090
  grpcPort: 50051
  annotations: {}

# Ingress configuration
ingress:
  enabled: false
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: ingestion.atp.connectsoft.example
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: atp-ingestion-tls
      hosts:
        - ingestion.atp.connectsoft.example

# Resource requests and limits
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

# Autoscaling configuration
autoscaling:
  enabled: false
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

# Health checks
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 30

# Environment variables
env:
  ASPNETCORE_ENVIRONMENT: Production
  OpenTelemetry__ServiceName: atp-ingestion

# External Secrets configuration
externalSecrets:
  enabled: true
  secretStore: azure-keyvault
  secrets:
    - name: ConnectionStrings__Database
      key: sql-connection-string
    - name: ConnectionStrings__Redis
      key: redis-connection-string
    - name: ConnectionStrings__RabbitMQ
      key: rabbitmq-connection-string

# Network policy
networkPolicy:
  enabled: true
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
            name: atp-production
      - podSelector:
          matchLabels:
            app: atp-gateway
  egress:
    - to:
      - namespaceSelector:
          matchLabels:
            name: kube-system
      - namespaceSelector:
          matchLabels:
            name: flux-system
      - namespaceSelector:
          matchLabels:
            name: observability

# Pod Disruption Budget
podDisruptionBudget:
  enabled: false
  minAvailable: 2

# Redis sub-chart (optional dependency)
redis:
  enabled: false  # Use Azure Cache for Redis instead

Template Best Practices

Helm Template Example (Deployment):

# apps/atp-ingestion/helm/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "atp-ingestion.fullname" . }}
  namespace: {{ .Values.namespace | default .Release.Namespace }}
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  revisionHistoryLimit: {{ .Values.revisionHistoryLimit | default 10 }}
  selector:
    matchLabels:
      {{- include "atp-ingestion.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        {{- with .Values.podAnnotations }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
        {{- if .Values.configMap }}
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
        {{- end }}
        {{- if .Values.secrets }}
        checksum/secret: {{ include (print $.Template.BasePath "/externalsecret.yaml") . | sha256sum }}
        {{- end }}
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        securityContext:
          {{- toYaml .Values.securityContext | nindent 12 }}
        resources:
          {{- toYaml .Values.resources | nindent 12 }}
        env:
        {{- range $key, $value := .Values.env }}
        - name: {{ $key }}
          value: {{ $value | quote }}
        {{- end }}
        envFrom:
        - configMapRef:
            name: {{ include "atp-ingestion.fullname" . }}-config
        - secretRef:
            name: {{ include "atp-ingestion.fullname" . }}-secrets
        ports:
        - name: http
          containerPort: {{ .Values.service.targetPort }}
          protocol: TCP
        - name: metrics
          containerPort: {{ .Values.service.metricsPort }}
          protocol: TCP
        - name: grpc
          containerPort: {{ .Values.service.grpcPort }}
          protocol: TCP
        livenessProbe:
          {{- toYaml .Values.livenessProbe | nindent 12 }}
        readinessProbe:
          {{- toYaml .Values.readinessProbe | nindent 12 }}
        {{- if .Values.startupProbe }}
        startupProbe:
          {{- toYaml .Values.startupProbe | nindent 12 }}
        {{- end }}
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /app/cache
        - name: logs
          mountPath: /app/logs
      volumes:
      - name: tmp
        emptyDir: {}
      - name: cache
        emptyDir: {}
      - name: logs
        emptyDir: {}
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}

Template Helpers (_helpers.tpl):

# apps/atp-ingestion/helm/templates/_helpers.tpl
{{/*
Expand the name of the chart.
*/}}
{{- define "atp-ingestion.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Create a default fully qualified app name.
*/}}
{{- define "atp-ingestion.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "atp-ingestion.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Common labels
*/}}
{{- define "atp-ingestion.labels" -}}
helm.sh/chart: {{ include "atp-ingestion.chart" . }}
{{ include "atp-ingestion.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
managed-by: fluxcd
{{- end }}

{{/*
Selector labels
*/}}
{{- define "atp-ingestion.selectorLabels" -}}
app.kubernetes.io/name: {{ include "atp-ingestion.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

{{/*
Create the name of the service account to use
*/}}
{{- define "atp-ingestion.serviceAccountName" -}}
{{- if .Values.serviceAccount.create }}
{{- default (include "atp-ingestion.fullname" .) .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}

Values File Organization

values-production.yaml (Production Overrides):

# apps/atp-ingestion/helm/values-production.yaml
replicaCount: 5  # Production: 5 replicas

image:
  tag: "v1.2.3-abc123d"  # Immutable tag with commit SHA

resources:
  requests:
    cpu: 1000m      # Production: 1 CPU core
    memory: 1Gi     # Production: 1 GB RAM
  limits:
    cpu: 2000m      # Production: 2 CPU cores
    memory: 2Gi     # Production: 2 GB RAM

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

ingress:
  enabled: true
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "1000"

env:
  ASPNETCORE_ENVIRONMENT: Production
  OpenTelemetry__SamplingRatio: "0.1"  # Production: 10% sampling

podDisruptionBudget:
  enabled: true
  minAvailable: 3  # Ensure at least 3 replicas available during updates

values-dev.yaml (Development Overrides):

# apps/atp-ingestion/helm/values-dev.yaml
replicaCount: 1  # Dev: 1 replica

image:
  tag: "latest"  # Dev: mutable latest tag

resources:
  requests:
    cpu: 100m     # Dev: minimal resources
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

autoscaling:
  enabled: false  # Dev: no autoscaling

ingress:
  enabled: false  # Dev: no external ingress

env:
  ASPNETCORE_ENVIRONMENT: Development
  OpenTelemetry__SamplingRatio: "1.0"  # Dev: 100% sampling (full traces)

Chart Dependencies

Managing Dependencies:

# Update dependencies
helm dependency update apps/atp-ingestion/helm/

# Build chart with dependencies
helm package apps/atp-ingestion/helm/

# Output: atp-ingestion-1.2.3.tgz

Versioning and Publishing to ACR

Publish Helm Chart to ACR:

# Login to ACR
az acr login --name connectsoft

# Push Helm chart to ACR
helm push apps/atp-ingestion/helm/ oci://connectsoft.azurecr.io/helm

# Chart available at:
# oci://connectsoft.azurecr.io/helm/atp-ingestion:1.2.3

Helm Hooks for Migrations

Helm Hook Example (Database Migration):

# apps/atp-ingestion/helm/templates/migration-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-migration
  namespace: {{ .Values.namespace | default .Release.Namespace }}
  annotations:
    "helm.sh/hook": pre-upgrade,pre-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  template:
    spec:
      serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
      restartPolicy: Never
      containers:
      - name: migration
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
        command: ["dotnet", "run", "--project", "Migrate"]
        env:
        - name: ConnectionStrings__Database
          valueFrom:
            secretKeyRef:
              name: {{ include "atp-ingestion.fullname" . }}-secrets
              key: ConnectionStrings__Database

Kustomize for Environment Overlays

Kustomize enables environment-specific customization of base manifests without duplicating code.

Base + Overlay Pattern

Directory Structure:

apps/atp-ingestion/
├── base/                        # Base manifests (reusable)
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── configmap.yaml
│   ├── serviceaccount.yaml
│   ├── rbac.yaml
│   └── kustomization.yaml
└── overlays/                    # Environment-specific overlays
    ├── dev/
    │   ├── kustomization.yaml
    │   ├── deployment-patch.yaml
    │   └── configmap-patch.yaml
    ├── test/
    │   ├── kustomization.yaml
    │   └── deployment-patch.yaml
    ├── staging/
    │   ├── kustomization.yaml
    │   ├── deployment-patch.yaml
    │   └── hpa-patch.yaml
    └── production/
        ├── kustomization.yaml
        ├── deployment-patch.yaml
        ├── hpa-patch.yaml
        ├── configmap-patch.yaml
        └── networkpolicy-patch.yaml

Base Kustomization:

# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
  - deployment.yaml
  - service.yaml
  - configmap.yaml
  - serviceaccount.yaml
  - rbac.yaml

commonLabels:
  app: atp-ingestion
  component: ingestion
  tier: backend
  managed-by: fluxcd

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3-abc123d  # Updated by CI pipeline

Strategic Merge Patches

Deployment Patch (Production):

# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5  # Production: 5 replicas (base has 3)
  template:
    spec:
      containers:
      - name: ingestion
        resources:
          requests:
            cpu: 1000m      # Production: 1 CPU (base: 500m)
            memory: 1Gi     # Production: 1 GB (base: 512Mi)
          limits:
            cpu: 2000m      # Production: 2 CPU (base: 1000m)
            memory: 2Gi     # Production: 2 GB (base: 1Gi)
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: Production
        - name: OpenTelemetry__SamplingRatio
          value: "0.1"  # Production: 10% sampling (dev: 100%)

JSON Patches

JSON Patch Example (Add Annotation):

# apps/atp-ingestion/overlays/production/json-patch.yaml
- op: add
  path: /metadata/annotations/azure.connectsoft.com~1cost-center
  value: atp-production

- op: replace
  path: /spec/replicas
  value: 5

ConfigMap and Secret Generators

ConfigMap Generator:

# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ../../base

configMapGenerator:
  - name: atp-ingestion-config
    behavior: merge  # Merge with base ConfigMap
    literals:
      - ASPNETCORE_ENVIRONMENT=Production
      - OpenTelemetry__SamplingRatio=0.1
      - Audit__EnableImmutability=true
      - Audit__RetentionDays=2555
    options:
      labels:
        environment: production

Secret Generator (Base64 Encoded):

# apps/atp-ingestion/overlays/production/kustomization.yaml
secretGenerator:
  - name: atp-ingestion-secrets
    behavior: merge
    type: Opaque
    literals:
      - ApiKeys__IngestionApiKey=$(echo -n "secret-value" | base64)

Variable Substitution

Kustomize with Variable Substitution:

# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
  - deployment.yaml

replicas:
  - name: atp-ingestion
    count: 3

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3-abc123d

Configuration Layering Strategy

Configuration Precedence Hierarchy:

graph TD
    A[Base Configuration<br/>apps/atp-ingestion/base] -->|applies to| B[All Environments]

    B --> C[Dev Overlay<br/>overlays/dev]
    B --> D[Test Overlay<br/>overlays/test]
    B --> E[Staging Overlay<br/>overlays/staging]
    B --> F[Production Overlay<br/>overlays/production]

    C -->|customizes| G[Dev Cluster]
    D -->|customizes| H[Test Cluster]
    E -->|customizes| I[Staging Cluster]
    F -->|customizes| J[Production Cluster]

    style A fill:#FFE5B4
    style C fill:#90EE90
    style D fill:#90EE90
    style E fill:#FFE5B4
    style F fill:#ffcccc
Hold "Alt" / "Option" to enable pan & zoom

Configuration Layers:

Layer Location Purpose Examples
Base apps/{service}/base/ Common configuration for all environments Deployment structure, service ports, basic labels
Dev Overlay apps/{service}/overlays/dev/ Development-specific customization 1 replica, minimal resources, 100% sampling
Test Overlay apps/{service}/overlays/test/ Test environment customization 2 replicas, medium resources, 50% sampling
Staging Overlay apps/{service}/overlays/staging/ Pre-production environment 3 replicas, production-like resources, 10% sampling
Production Overlay apps/{service}/overlays/production/ Production environment 5+ replicas, full resources, 10% sampling, HPA

Hierarchical Configuration Precedence

Precedence Order (highest to lowest):

  1. Overlay patches (environment-specific)
  2. Overlay ConfigMap generators (environment-specific)
  3. Base configuration (common defaults)

Example:

# Base ConfigMap
data:
  ASPNETCORE_ENVIRONMENT: "Development"  # Default

# Production Overlay ConfigMap Generator (merges)
configMapGenerator:
  - name: atp-ingestion-config
    behavior: merge
    literals:
      - ASPNETCORE_ENVIRONMENT=Production  # Overrides base

# Result: Production uses "Production", other environments use "Development"

Image Reference Patterns

ACR Image Path Conventions

Image Path Format:

{registry}/{project}/{service}:{tag}

Examples:
connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
connectsoft.azurecr.io/atp/query:v1.3.0-def456e
connectsoft.azurecr.io/atp/integrity:v1.1.5-ghi789f

Service Image Mapping:

Service ACR Path
atp-ingestion connectsoft.azurecr.io/atp/ingestion
atp-query connectsoft.azurecr.io/atp/query
atp-integrity connectsoft.azurecr.io/atp/integrity
atp-export connectsoft.azurecr.io/atp/export
atp-policy connectsoft.azurecr.io/atp/policy
atp-search connectsoft.azurecr.io/atp/search
atp-gateway connectsoft.azurecr.io/atp/gateway

Semantic Versioning in Image Tags

Tag Format:

{v{VERSION}}-{COMMIT-SHA}

Examples:
v1.2.3-abc123d      # Semantic version + 7-char commit SHA
v1.2.4-hotfix1-def456e  # Hotfix version + commit SHA

Tag Rules:

Tag Pattern Mutable? Use Case Example
v{VERSION}-{SHA} ❌ Immutable Production releases v1.2.3-abc123d
v{VERSION} ❌ Immutable Production releases (without SHA) v1.2.3
latest ✅ Mutable Development only latest
{BRANCH} ✅ Mutable Feature branches feature/grpc-ingestion

Commit SHA for Traceability

Image Tagging in Azure Pipelines:

# Azure Pipelines: Tag image with version + commit SHA
- task: Docker@2
  displayName: 'Build and push Docker image'
  inputs:
    containerRegistry: 'ConnectSoft-ACR'
    repository: 'atp/ingestion'
    command: 'buildAndPush'
    Dockerfile: 'src/ConnectSoft.ATP.Ingestion/Dockerfile'
    tags: |
      $(Build.BuildNumber)              # v1.2.3
      $(Build.BuildNumber)-$(Build.SourceVersion)  # v1.2.3-abc123d
      latest                            # Latest (dev only)

Image Pull Policies

Policy Selection:

Policy Behavior Use Case
Always Always pull latest image Development (latest tag)
IfNotPresent Pull only if not cached Production (immutable tags)
Never Never pull, use cached only Air-gapped environments

Production Configuration:

spec:
  containers:
  - name: ingestion
    image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
    imagePullPolicy: IfNotPresent  # Production: use cached image (faster, immutable tag)

Development Configuration:

spec:
  containers:
  - name: ingestion
    image: connectsoft.azurecr.io/atp/ingestion:latest
    imagePullPolicy: Always  # Dev: always pull latest (mutable tag)

Resource Requests and Limits

Per-Environment Resource Specifications

Resource Configuration Matrix:

Environment CPU Request CPU Limit Memory Request Memory Limit Replicas
Dev 100m 500m 128Mi 512Mi 1
Test 250m 500m 256Mi 512Mi 2
Staging 500m 1000m 512Mi 1Gi 3
Production 1000m 2000m 1Gi 2Gi 5

Production Resource Configuration:

# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: ingestion
        resources:
          requests:
            cpu: 1000m      # Guaranteed: 1 CPU core
            memory: 1Gi     # Guaranteed: 1 GB RAM
          limits:
            cpu: 2000m      # Maximum: 2 CPU cores (burst capacity)
            memory: 2Gi     # Maximum: 2 GB RAM

CPU and Memory Allocations

Sizing Guidelines:

  • CPU Request: Guaranteed CPU (scheduling decision)
  • CPU Limit: Maximum CPU (throttling threshold)
  • Memory Request: Guaranteed memory (scheduling decision)
  • Memory Limit: Maximum memory (OOMKill threshold)

Cost Optimization:

# Production: Right-sizing based on actual usage
resources:
  requests:
    cpu: 500m      # Based on 50th percentile usage
    memory: 512Mi  # Based on 50th percentile usage
  limits:
    cpu: 2000m     # Allow burst to 2x request
    memory: 2Gi    # Allow burst to 4x request (less frequent)

Resource Quota Enforcement

Namespace Resource Quota:

# infrastructure/overlays/production/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: atp-production-quota
  namespace: atp-production
spec:
  hard:
    requests.cpu: "100"           # 100 CPU cores total
    requests.memory: 200Gi        # 200 GB RAM total
    limits.cpu: "200"             # 200 CPU cores max
    limits.memory: 400Gi          # 400 GB RAM max
    persistentvolumeclaims: "50"  # Max 50 PVCs
    services.loadbalancers: "2"   # Max 2 load balancers
    pods: "200"                   # Max 200 pods

Health Checks Configuration

Liveness Probes (Is the App Running?)

Purpose: Detect and restart crashed containers.

Liveness Probe Example:

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
    scheme: HTTP
    httpHeaders:
    - name: Custom-Header
      value: liveness-check
  initialDelaySeconds: 30    # Wait 30s after container starts
  periodSeconds: 10          # Check every 10 seconds
  timeoutSeconds: 5          # Timeout after 5 seconds
  successThreshold: 1        # 1 success = healthy
  failureThreshold: 3        # 3 failures = restart container

Implementation (.NET Health Checks):

// C# Health Check implementation
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("live"),
    AllowCachingResponses = false,
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

Readiness Probes (Is the App Ready for Traffic?)

Purpose: Determine if container is ready to receive traffic.

Readiness Probe Example:

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
    scheme: HTTP
  initialDelaySeconds: 10    # Wait 10s after container starts
  periodSeconds: 5           # Check every 5 seconds
  timeoutSeconds: 3          # Timeout after 3 seconds
  successThreshold: 1        # 1 success = ready
  failureThreshold: 3        # 3 failures = remove from Service endpoints

Implementation (.NET Health Checks):

// C# Readiness Check (includes dependencies)
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

// Check database connectivity
services.AddHealthChecks()
    .AddSqlServer(connectionString, tags: new[] { "ready" })
    .AddRedis(redisConnectionString, tags: new[] { "ready" });

Startup Probes (For Slow-Starting Apps)

Purpose: Allow slow-starting applications time to initialize.

Startup Probe Example:

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
    scheme: HTTP
  initialDelaySeconds: 0     # Start immediately
  periodSeconds: 5           # Check every 5 seconds
  timeoutSeconds: 3          # Timeout after 3 seconds
  successThreshold: 1        # 1 success = startup complete
  failureThreshold: 30       # Allow up to 150 seconds (30 * 5s) for startup

When to Use Startup Probes:

  • Applications with long initialization (database migrations, cache warming)
  • JVM-based applications (slow JIT compilation)
  • Applications loading large configuration files

Probe Configuration Best Practices

Best Practices Table:

Probe Type Initial Delay Period Timeout Failure Threshold Rationale
Liveness 30s 10s 5s 3 Give app time to start; detect crashes quickly
Readiness 10s 5s 3s 3 Quick feedback for traffic routing
Startup 0s 5s 3s 30 Allow up to 150s for slow initialization

Probe Failure Handling:

# Liveness probe failure: Container restart
# Readiness probe failure: Remove from Service endpoints (no traffic)
# Startup probe failure: Keep checking until success or failure threshold

Pod Security Standards (PSS)

Security Context Configuration

Pod Security Context (Restricted Profile):

# apps/atp-ingestion/base/deployment.yaml
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true        # Run as non-root user
        runAsUser: 1000           # Run as user ID 1000
        fsGroup: 2000             # File system group
        seccompProfile:           # System call filtering
          type: RuntimeDefault
        supplementalGroups: []    # No additional groups

Container Security Context:

containers:
- name: ingestion
  securityContext:
    allowPrivilegeEscalation: false  # Prevent privilege escalation
    readOnlyRootFilesystem: true     # Read-only root filesystem
    runAsNonRoot: true
    runAsUser: 1000
    capabilities:
      drop: [ALL]                    # Drop all capabilities
      # add: []                      # No additional capabilities

Pod Security Admission

Namespace Pod Security Labels:

# infrastructure/base/namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production
  labels:
    pod-security.kubernetes.io/enforce: restricted    # Enforce restricted profile
    pod-security.kubernetes.io/audit: restricted      # Audit violations
    pod-security.kubernetes.io/warn: restricted       # Warn on violations

Pod Security Levels:

Level Description ATP Use
Privileged Unrestricted (all capabilities) ❌ Never
Baseline Minimally restrictive ⚠️ Legacy workloads only
Restricted Highly restrictive (best practice) Production

Restricted, Baseline, Privileged Policies

Policy Comparison:

Feature Privileged Baseline Restricted
Host Namespaces ✅ Allowed ❌ Disallowed ❌ Disallowed
Host Networking ✅ Allowed ❌ Disallowed ❌ Disallowed
Privileged Containers ✅ Allowed ❌ Disallowed ❌ Disallowed
Capabilities ✅ All ⚠️ Limited ❌ Drop ALL
Volume Types ✅ All ⚠️ Limited ⚠️ Limited
Run as Non-Root ❌ Not required ❌ Not required Required
Read-Only Root FS ❌ Not required ❌ Not required Required
Seccomp ❌ Not required ✅ Default Required

ATP Production Policy:

# infrastructure/base/pod-security-standards.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Security Best Practices for ATP

Security Checklist:

  • Run as non-root: All containers run as UID 1000+
  • Read-only root filesystem: Use emptyDir volumes for writable paths
  • Drop all capabilities: No additional Linux capabilities
  • Seccomp enabled: System call filtering (RuntimeDefault)
  • No host namespaces: No host network, PID, or IPC access
  • No privileged containers: No elevated privileges
  • Network policies: Default deny, explicit allow rules

Network Policies

Default Deny All Traffic

Default Deny NetworkPolicy:

# platform/network-policies/default-deny.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: atp-production
spec:
  podSelector: {}  # Applies to all pods
  policyTypes:
    - Ingress
    - Egress
  # No ingress rules = deny all ingress
  # No egress rules = deny all egress

Service-to-Service Communication Rules

Allow Internal Traffic (Same Namespace):

# apps/atp-ingestion/base/networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-ingestion-network-policy
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
    - Ingress
    - Egress

  ingress:
  # Allow from atp-gateway (API Gateway)
  - from:
    - podSelector:
        matchLabels:
          app: atp-gateway
    ports:
    - protocol: TCP
      port: 8080  # HTTP port
    - protocol: TCP
      port: 50051  # gRPC port

  # Allow from same namespace (service-to-service)
  - from:
    - namespaceSelector:
        matchLabels:
          name: atp-production
    ports:
    - protocol: TCP
      port: 8080

  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53

  # Allow to Azure SQL (external)
  - to:
    - namespaceSelector: {}  # External
    ports:
    - protocol: TCP
      port: 1433  # SQL Server

  # Allow to Azure Redis (external)
  - to:
    - namespaceSelector: {}  # External
    ports:
    - protocol: TCP
      port: 6380  # Redis TLS

  # Allow to observability namespace (metrics)
  - to:
    - namespaceSelector:
        matchLabels:
          name: observability
    ports:
    - protocol: TCP
      port: 4317  # OTLP gRPC

Ingress and Egress Policies

Ingress Policy (Allow External Traffic):

# apps/atp-gateway/base/networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-gateway-network-policy
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-gateway
  policyTypes:
    - Ingress

  ingress:
  # Allow from ingress controller
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - podSelector:
        matchLabels:
          app: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080

DNS and Monitoring Exceptions

Egress Policy (DNS and Monitoring):

# platform/network-policies/allow-dns-monitoring.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-monitoring
  namespace: atp-production
spec:
  podSelector: {}  # Applies to all pods
  policyTypes:
    - Egress

  egress:
  # Allow DNS queries
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    - podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

  # Allow to Azure Monitor (external)
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443  # HTTPS for Azure Monitor

  # Allow to observability namespace (metrics, logs, traces)
  - to:
    - namespaceSelector:
        matchLabels:
          name: observability
    ports:
    - protocol: TCP
      port: 4317  # OTLP
    - protocol: TCP
      port: 9090  # Prometheus metrics

Horizontal Pod Autoscaler (HPA)

CPU-Based Autoscaling

HPA Configuration (CPU-Based):

# apps/atp-ingestion/overlays/production/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa
  namespace: atp-production
  labels:
    app: atp-ingestion
    managed-by: fluxcd
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  minReplicas: 5
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale when CPU > 70%
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60  # Scale down by 50% per minute
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15  # Double replicas every 15 seconds
      - type: Pods
        value: 4
        periodSeconds: 15  # Or add 4 pods every 15 seconds
      selectPolicy: Max  # Use the policy that scales fastest

Memory-Based Autoscaling

HPA Configuration (Memory-Based):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa
  namespace: atp-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  minReplicas: 5
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale when memory > 80%

Custom Metrics with KEDA

KEDA ScaledObject (Custom Metrics):

# apps/atp-ingestion/overlays/production/keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: atp-ingestion-scaler
  namespace: atp-production
spec:
  scaleTargetRef:
    name: atp-ingestion
  minReplicaCount: 5
  maxReplicaCount: 20
  triggers:
  # Scale based on CPU
  - type: cpu
    metricType: Utilization
    metadata:
      value: "70"

  # Scale based on RabbitMQ queue length
  - type: rabbitmq
    metadata:
      host: amqps://rabbitmq.example:5671
      queueName: audit-records-queue
      queueLength: "100"  # Scale when queue > 100 messages

  # Scale based on HTTP request rate
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.observability:9090
      metricName: http_requests_per_second
      threshold: "1000"  # Scale when requests > 1000/sec

Scaling Policies per Environment

Environment-Specific Scaling:

Environment Min Replicas Max Replicas Target CPU Target Memory Custom Metrics
Dev 1 2 N/A (no HPA) N/A ❌ Disabled
Test 2 4 80% 80% ⚠️ Optional
Staging 3 10 70% 75% ⚠️ Optional
Production 5 20 70% 80% ✅ Enabled (KEDA)

Manifest Validation

kubeval for Syntax Validation

kubeval Usage:

# Install kubeval
brew install kubeval  # macOS
# or
wget https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz

# Validate Kubernetes manifests
kubeval apps/atp-ingestion/base/deployment.yaml

# Validate all manifests in directory
find apps/ -name "*.yaml" -exec kubeval {} \;

# Validate with specific Kubernetes version
kubeval --kubernetes-version 1.30.0 apps/atp-ingestion/base/deployment.yaml

kube-score for Best Practices

kube-score Usage:

# Install kube-score
brew install kube-score  # macOS
# or download from https://github.com/zegl/kube-score/releases

# Score manifests (best practices check)
kube-score score apps/atp-ingestion/base/deployment.yaml

# Output:
# apps/atp-ingestion/base/deployment.yaml
# [CRITICAL] Container Image Tag
#   · Image with latest or no tag
#     Container 'ingestion' must not use the 'latest' tag
#
# [CRITICAL] Container Resources
#   · CPU limit is not set
#     Container 'ingestion' does not have a CPU limit
#
# [WARNING] Container Security Context
#   · Container does not have a read-only root filesystem
#     Container 'ingestion' can write to root filesystem

kube-score Configuration:

# .kube-score.yaml
ignore-test-case:
  - container-image-tag  # Allow 'latest' in dev
  - deployment-has-poddisruptionbudget  # Optional for non-critical services

minimum-score: 5  # Fail if score < 5/10

Azure Policy Validation

Azure Policy for Kubernetes:

# platform/azure-policy/pod-security-standards.yaml
apiVersion: policy.k8s.io/v1beta1
kind: Policy
metadata:
  name: enforce-pod-security-standards
spec:
  rules:
  - apiGroups: [""]
    resources: ["pods"]
    validations:
    - expression: "object.spec.securityContext.runAsNonRoot == true"
      message: "Pods must run as non-root user"
    - expression: "object.spec.containers.all(c, c.securityContext.readOnlyRootFilesystem == true)"
      message: "Containers must have read-only root filesystem"
    - expression: "object.spec.containers.all(c, c.resources.limits.cpu != null && c.resources.limits.memory != null)"
      message: "Containers must have CPU and memory limits"

CI Pipeline Integration

Azure Pipelines Validation Stage:

# azure-pipelines.yml
- stage: Validate_Manifests
  displayName: 'Validate Kubernetes Manifests'
  jobs:
  - job: Validate
    steps:
    - task: Bash@3
      displayName: 'Install kubeval and kube-score'
      inputs:
        targetType: 'inline'
        script: |
          # Install kubeval
          curl -LO https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
          tar xf kubeval-linux-amd64.tar.gz
          sudo mv kubeval /usr/local/bin/

          # Install kube-score
          curl -LO https://github.com/zegl/kube-score/releases/latest/download/kube-score_linux_amd64.tar.gz
          tar xf kube-score_linux_amd64.tar.gz
          sudo mv kube-score /usr/local/bin/

    - task: Bash@3
      displayName: 'Validate manifests with kubeval'
      inputs:
        targetType: 'inline'
        script: |
          # Validate all base manifests
          find apps/ -name "*.yaml" -path "*/base/*" -exec kubeval --strict {} \;

    - task: Bash@3
      displayName: 'Score manifests with kube-score'
      inputs:
        targetType: 'inline'
        script: |
          # Score all base manifests
          find apps/ -name "*.yaml" -path "*/base/*" -exec kube-score score {} \;

Summary: Declarative Manifest Management

  • Base Manifest Structure: Standard Kubernetes resources (Deployment, Service, ConfigMap, Ingress, ServiceAccount, RBAC) for all ATP services
  • Helm Charts: Parameterized, reusable templates with environment-specific values files
  • Kustomize Overlays: Environment-specific customization using strategic merge patches and generators
  • Configuration Layering: Base configuration + environment overlays with clear precedence
  • Image References: ACR paths with semantic versioning and commit SHA for traceability
  • Resource Management: Per-environment resource requests/limits with cost optimization
  • Health Checks: Liveness, readiness, and startup probes for reliability
  • Pod Security Standards: Restricted profile enforcement for production workloads
  • Network Policies: Default deny with explicit service-to-service rules
  • Horizontal Pod Autoscaler: CPU/memory-based scaling with KEDA for custom metrics
  • Manifest Validation: kubeval (syntax), kube-score (best practices), Azure Policy (compliance)

Git Workflow & Environment Promotion

Purpose: Define the complete Git workflow, pull request process, environment promotion strategy, and operational procedures for managing changes through the GitOps pipeline from feature development to production deployment.


Feature Branch Development Workflow

ATP GitOps follows a GitOps-native workflow where all infrastructure and application changes flow through Git branches, pull requests, and automated validation before promotion to production.

Creating Feature Branches from Dev

Branch Creation Process:

# 1. Ensure you're on the latest dev branch
git checkout dev
git pull origin dev

# 2. Create feature branch from dev
git checkout -b feature/atp-ingestion-add-grpc-support

# 3. Verify branch creation
git branch
# Output:
# * feature/atp-ingestion-add-grpc-support
#   dev
#   main

# 4. Push feature branch to remote
git push -u origin feature/atp-ingestion-add-grpc-support

Branch Naming Conventions:

Branch Type Prefix Example Purpose
Feature feature/ feature/atp-query-add-cache New functionality
Bugfix bugfix/ bugfix/atp-ingestion-memory-leak Bug fixes
Hotfix hotfix/ hotfix/atp-gateway-security-patch Critical production fixes
Documentation docs/ docs/gitops-troubleshooting Documentation updates
Infrastructure infra/ infra/add-monitoring-namespace Infrastructure changes
Chore chore/ chore/update-helm-charts Maintenance tasks

Local Development and Testing

Local GitOps Repository Structure:

# Clone GitOps repository
git clone ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops.git
cd atp-gitops

# Navigate to service manifests
cd apps/atp-ingestion/base/

# Edit deployment manifest
vim deployment.yaml

# Validate changes locally
kubectl apply --dry-run=client -f deployment.yaml

Local Validation Tools:

# Validate YAML syntax
yamllint apps/atp-ingestion/base/deployment.yaml

# Validate Kubernetes manifests
kubeval apps/atp-ingestion/base/deployment.yaml

# Score manifests (best practices)
kube-score score apps/atp-ingestion/base/deployment.yaml

# Preview Kustomize output
kustomize build apps/atp-ingestion/overlays/dev/

# Preview Helm template output
helm template atp-ingestion apps/atp-ingestion/helm/ \
  --values apps/atp-ingestion/helm/values-dev.yaml \
  --debug

Committing Manifest Changes

Commit Process:

# Stage changes
git add apps/atp-ingestion/base/deployment.yaml

# Commit with conventional commit format
git commit -m "feat(ingestion): add gRPC endpoint configuration

- Add gRPC port (50051) to container ports
- Configure gRPC health checks
- Update service manifest for gRPC traffic

Related to: ATP-1234"

# Sign commit (required for production)
git commit -S -m "feat(ingestion): add gRPC endpoint configuration"

# Verify commit signature
git log --show-signature -1

Commit Message Format (Conventional Commits):

<type>(<scope>): <subject>

<body>

<footer>

Examples:

# Feature addition
git commit -m "feat(ingestion): add Redis cache support"

# Bug fix
git commit -m "fix(query): resolve memory leak in query service"

# Configuration change
git commit -m "chore(infra): update resource limits for production"

# Breaking change
git commit -m "feat(gateway)!: remove legacy authentication

BREAKING CHANGE: Legacy API key authentication removed.
Migrate to OAuth 2.0 before deploying this change."

# Work item reference
git commit -m "feat(integrity): implement tamper detection

Implements ATP-5678

- Add cryptographic signatures to audit records
- Validate signatures on read operations
- Store signature metadata in database"

Syncing with Remote Repository

Sync Workflow:

# Fetch latest changes from remote
git fetch origin

# Check status
git status

# Rebase feature branch on latest dev (optional, for clean history)
git checkout feature/atp-ingestion-add-grpc-support
git rebase origin/dev

# Or merge latest dev into feature branch
git merge origin/dev

# Resolve conflicts if any
git status
# Edit conflicted files
vim apps/atp-ingestion/base/deployment.yaml
git add apps/atp-ingestion/base/deployment.yaml
git rebase --continue  # or git commit for merge

# Push changes
git push origin feature/atp-ingestion-add-grpc-support

# If rebased, force push (be careful!)
git push --force-with-lease origin feature/atp-ingestion-add-grpc-support

Pull Request Process

PR Creation in Azure Repos

Create Pull Request:

# Using Azure DevOps CLI
az repos pr create \
  --source-branch feature/atp-ingestion-add-grpc-support \
  --target-branch dev \
  --title "feat(ingestion): Add gRPC endpoint configuration" \
  --description "Adds gRPC support to ingestion service. See ATP-1234." \
  --work-items 1234 \
  --auto-complete false

# Or use Azure DevOps Portal:
# 1. Navigate to Repos > Pull Requests
# 2. Click "New Pull Request"
# 3. Select source branch (feature/atp-ingestion-add-grpc-support)
# 4. Select target branch (dev)
# 5. Fill in title and description
# 6. Link work items
# 7. Add reviewers
# 8. Create pull request

PR Template and Checklist

Pull Request Template (.azuredevops/pull_request_template.md):

## Description
<!-- Provide a clear description of the changes -->

## Type of Change
<!-- Mark applicable with [x] -->
- [ ] Feature (non-breaking change adding functionality)
- [ ] Bug fix (non-breaking change fixing an issue)
- [ ] Breaking change (fix or feature causing existing functionality to change)
- [ ] Documentation update
- [ ] Infrastructure change
- [ ] Configuration change

## Service(s) Affected
<!-- List affected services -->
- [ ] atp-ingestion
- [ ] atp-query
- [ ] atp-integrity
- [ ] atp-export
- [ ] atp-policy
- [ ] atp-search
- [ ] atp-gateway
- [ ] Infrastructure/Platform

## Environment(s) Affected
<!-- Mark applicable with [x] -->
- [ ] Dev
- [ ] Test
- [ ] Staging
- [ ] Production

## Pre-Merge Checklist
<!-- Mark applicable with [x] -->
- [ ] Code follows project style guidelines
- [ ] Self-review completed
- [ ] Comments added for complex logic
- [ ] Documentation updated
- [ ] No breaking changes (or documented)
- [ ] All CI/CD checks passing
- [ ] Manifest validation passing (kubeval, kube-score)
- [ ] Security scanning passing (OPA, Azure Policy)
- [ ] Preview environment tested (if applicable)
- [ ] Work items linked
- [ ] Signed commits (for production branches)

## Testing
<!-- Describe testing performed -->
- [ ] Local testing completed
- [ ] Preview environment tested
- [ ] Unit tests passing
- [ ] Integration tests passing
- [ ] Manual testing completed

## Rollback Plan
<!-- Describe rollback procedure if needed -->

## Related Work Items
<!-- Link related work items -->
- ATP-1234: Add gRPC endpoint to ingestion service

## Screenshots/Documentation
<!-- Add screenshots, diagrams, or documentation links -->

## Additional Notes
<!-- Any additional information reviewers should know -->

Code Review Guidelines

Review Checklist:

  1. Manifest Validation:
  2. ✅ YAML syntax correct
  3. ✅ Kubernetes API version valid
  4. ✅ Resource names follow conventions
  5. ✅ Labels and annotations present
  6. ✅ Resource requests/limits set

  7. Security:

  8. ✅ No secrets in plaintext
  9. ✅ Pod Security Standards compliant
  10. ✅ Network policies configured
  11. ✅ RBAC follows least privilege

  12. Configuration:

  13. ✅ Environment-specific values correct
  14. ✅ Image tags immutable (not latest in prod)
  15. ✅ Health checks configured
  16. ✅ Resource limits appropriate

  17. Best Practices:

  18. ✅ Follows GitOps principles
  19. ✅ Changes are declarative
  20. ✅ No hardcoded values
  21. ✅ Documentation updated

Review Comments:

❌ Security Issue: Secret in plaintext
✅ Approved: Looks good, minor suggestion
⚠️ Needs Work: Please add resource limits
📝 Question: Why is this change needed?

Approval Workflow

Approval Requirements Matrix:

Target Branch Minimum Approvers Required Roles GPG Signing Status Checks
dev 1 Developer or above ❌ Optional ✅ Required
test 1 Developer or above ❌ Optional ✅ Required
staging 2 Architect or SRE Lead ✅ Required ✅ Required
production 2 Architect or SRE Lead ✅ Required ✅ Required
hotfix/ 2 Architect or SRE Lead ✅ Required ✅ Required

Azure DevOps Branch Policy Configuration:

# Branch policy for production branch
branchPolicy:
  branch: production
  minimumApproverCount: 2
  requiredApproverIds:
    - architect-team-group
    - sre-lead-group
  blockingPolicies:
    - buildValidation: true
    - mergeStrategy: squash
    - requireGpgSigning: true
    - requireWorkItemLinking: true
    - commentRequirements: true

Automated PR Validation

Manifest Linting (YAML Syntax, Helm Lint)

Azure Pipeline: PR Validation Stage:

# .azuredevops/pipelines/pr-validation.yml
trigger: none  # Only run on PR

pr:
  branches:
    include:
      - dev
      - test
      - staging
      - production
      - hotfix/*

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: Validate_Manifests
  displayName: 'Validate Kubernetes Manifests'
  jobs:
  - job: YAMLLint
    displayName: 'YAML Syntax Validation'
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: '3.9'

    - script: |
        pip install yamllint
        yamllint -c .yamllint.yml apps/
      displayName: 'Validate YAML syntax'

  - job: Kubeval
    displayName: 'Kubernetes Manifest Validation'
    steps:
    - script: |
        wget -q https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
        tar xf kubeval-linux-amd64.tar.gz
        sudo mv kubeval /usr/local/bin/
        kubeval --version
      displayName: 'Install kubeval'

    - script: |
        find apps/ -name "*.yaml" -path "*/base/*" -exec kubeval --strict {} \;
      displayName: 'Validate Kubernetes manifests'

  - job: HelmLint
    displayName: 'Helm Chart Linting'
    steps:
    - task: HelmInstaller@1
      inputs:
        helmVersionToInstall: 'latest'

    - script: |
        find apps/ -name "Chart.yaml" -path "*/helm/*" | while read chart; do
          chart_dir=$(dirname "$chart")
          helm lint "$chart_dir"
        done
      displayName: 'Lint Helm charts'

  - job: KubeScore
    displayName: 'Best Practices Check'
    steps:
    - script: |
        wget -q https://github.com/zegl/kube-score/releases/latest/download/kube-score_linux_amd64.tar.gz
        tar xf kube-score_linux_amd64.tar.gz
        sudo mv kube-score /usr/local/bin/
      displayName: 'Install kube-score'

    - script: |
        find apps/ -name "*.yaml" -path "*/base/*" -exec kube-score score {} \;
      displayName: 'Score manifests for best practices'

Security Scanning (OPA Policies, Azure Policy)

OPA Policy Validation:

# .azuredevops/pipelines/pr-validation.yml
  - job: OPAPolicy
    displayName: 'OPA Policy Validation'
    steps:
    - script: |
        wget -q https://github.com/open-policy-agent/conftest/releases/latest/download/conftest_linux_amd64.tar.gz
        tar xf conftest_linux_amd64.tar.gz
        sudo mv conftest /usr/local/bin/
      displayName: 'Install conftest'

    - script: |
        find apps/ -name "*.yaml" -path "*/base/*" | while read manifest; do
          conftest test "$manifest" -p policies/
        done
      displayName: 'Validate OPA policies'

OPA Policy Examples:

# policies/pod-security.rego
package podsecurity

deny[msg] {
    input.kind == "Deployment"
    container := input.spec.template.spec.containers[_]
    not container.securityContext.runAsNonRoot

    msg := "Container must run as non-root user"
}

deny[msg] {
    input.kind == "Deployment"
    container := input.spec.template.spec.containers[_]
    not container.securityContext.readOnlyRootFilesystem

    msg := "Container must have read-only root filesystem"
}

deny[msg] {
    input.kind == "Deployment"
    container := input.spec.template.spec.containers[_]
    not container.resources.limits.cpu

    msg := "Container must have CPU limit"
}

Dry-Run Validation (Kustomize Build, Helm Template)

Dry-Run Validation:

# .azuredevops/pipelines/pr-validation.yml
  - job: DryRun
    displayName: 'Dry-Run Validation'
    steps:
    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Production-EUS-RG'
        kubernetesCluster: 'atp-prod-eus-aks'
        namespace: 'atp-production'
        command: 'apply'
        arguments: '--dry-run=client -f apps/atp-ingestion/base/deployment.yaml'
      displayName: 'Kubectl dry-run'

    - script: |
        # Kustomize build validation
        kustomize build apps/atp-ingestion/overlays/production/ > /dev/null
        echo "✅ Kustomize build successful"
      displayName: 'Kustomize build validation'

    - script: |
        # Helm template validation
        helm template atp-ingestion apps/atp-ingestion/helm/ \
          --values apps/atp-ingestion/helm/values-production.yaml \
          --debug > /dev/null
        echo "✅ Helm template successful"
      displayName: 'Helm template validation'

Breaking Change Detection

Breaking Change Detection Script:

#!/bin/bash
# scripts/detect-breaking-changes.sh

set -euo pipefail

BASE_BRANCH="${1:-dev}"
FEATURE_BRANCH="${2:-HEAD}"

echo "🔍 Detecting breaking changes between $BASE_BRANCH and $FEATURE_BRANCH..."

# Check for removed resources
REMOVED_RESOURCES=$(git diff --name-only --diff-filter=D "$BASE_BRANCH" "$FEATURE_BRANCH" | grep -E '\.(yaml|yml)$' || true)

if [ -n "$REMOVED_RESOURCES" ]; then
  echo "⚠️  WARNING: Resources removed:"
  echo "$REMOVED_RESOURCES"
  echo "This may be a breaking change!"
fi

# Check for API version changes
API_VERSION_CHANGES=$(git diff "$BASE_BRANCH" "$FEATURE_BRANCH" | grep -E '^\+.*apiVersion:|^\-.*apiVersion:' || true)

if [ -n "$API_VERSION_CHANGES" ]; then
  echo "⚠️  WARNING: API version changes detected:"
  echo "$API_VERSION_CHANGES"
fi

# Check for breaking change markers
BREAKING_MARKERS=$(git log --oneline "$BASE_BRANCH..$FEATURE_BRANCH" | grep -i "BREAKING" || true)

if [ -n "$BREAKING_MARKERS" ]; then
  echo "🚨 BREAKING CHANGE detected in commit messages:"
  echo "$BREAKING_MARKERS"
  exit 1
fi

echo "✅ No breaking changes detected"

Test Environment Deployment Preview

Preview Environment Deployment:

# .azuredevops/pipelines/pr-validation.yml
  - job: PreviewDeploy
    displayName: 'Preview Environment Deployment'
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/feature/*'))
    steps:
    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Dev-EUS-RG'
        kubernetesCluster: 'atp-dev-eus-aks'
        namespace: 'preview-$(System.PullRequest.PullRequestId)'
        command: 'apply'
        arguments: '-f apps/atp-ingestion/base/'
      displayName: 'Deploy to preview namespace'

    - script: |
        # Wait for deployment to be ready
        kubectl wait --for=condition=available \
          --timeout=300s \
          deployment/atp-ingestion \
          -n preview-$(System.PullRequest.PullRequestId)
        echo "✅ Preview deployment successful"
      displayName: 'Wait for deployment'

    - script: |
        # Run smoke tests
        kubectl exec -n preview-$(System.PullRequest.PullRequestId) \
          deployment/atp-ingestion -- \
          curl -f http://localhost:8080/health/ready
        echo "✅ Smoke tests passed"
      displayName: 'Run smoke tests'

Merge Strategies

Squash Merge (Production, Staging)

Squash Merge Configuration:

# Azure DevOps branch policy
branchPolicy:
  branch: production
  mergeStrategy: squash
  squashMergeCommitMessage: firstLine  # Use first commit message line

Squash Merge Example:

# Before squash merge (3 commits in feature branch)
git log --oneline feature/atp-ingestion-add-grpc
# abc123 feat(ingestion): add gRPC port
# def456 feat(ingestion): add gRPC health check
# ghi789 feat(ingestion): update service manifest

# After squash merge to production (1 commit)
git log --oneline production
# jkl012 feat(ingestion): add gRPC port  # Single squashed commit

Benefits of Squash Merge: - ✅ Clean, linear history - ✅ Easier rollback (single commit) - ✅ Simpler to review changes

Merge Commit (Test)

Merge Commit Configuration:

# Azure DevOps branch policy
branchPolicy:
  branch: test
  mergeStrategy: noFastForward  # Creates merge commit

Merge Commit Example:

# Feature branch merged to test with merge commit
git log --oneline --graph test
# *   mno345 Merge pull request #123 from feature/atp-ingestion-add-grpc
# |\
# | * abc123 feat(ingestion): add gRPC port
# | * def456 feat(ingestion): add gRPC health check
# |/
# * pqr678 Previous commit

Benefits of Merge Commit: - ✅ Preserves branch history - ✅ Clear feature boundaries - ✅ Useful for tracking feature development

Rebase (Dev, Optional)

Rebase Workflow:

# Rebase feature branch on latest dev
git checkout feature/atp-ingestion-add-grpc
git fetch origin
git rebase origin/dev

# Resolve conflicts if any
git status
# Edit conflicted files
vim apps/atp-ingestion/base/deployment.yaml
git add apps/atp-ingestion/base/deployment.yaml
git rebase --continue

# Force push (be careful!)
git push --force-with-lease origin feature/atp-ingestion-add-grpc

Benefits of Rebase: - ✅ Linear history - ✅ Clean, sequential commits - ⚠️ Requires force push (dangerous)

Strategy Selection Rationale

Merge Strategy Matrix:

Branch Strategy Rationale
dev Merge commit or rebase Preserve feature history, flexibility
test Merge commit Track feature development clearly
staging Squash merge Clean history, easier rollback
production Squash merge Clean, linear history essential for compliance

Environment Promotion Flow

Promotion Flow Diagram:

graph LR
    A[Feature Branch] -->|PR + Merge| B[Dev Environment]
    B -->|Automated<br/>Schedule/Tag| C[Test Environment]
    C -->|Manual Approval<br/>Regression Tests| D[Staging Environment]
    D -->|CAB Approval<br/>Change Window| E[Production Environment]

    F[Hotfix Branch] -.->|Expedited| E

    style A fill:#90EE90
    style B fill:#90EE90
    style C fill:#FFE5B4
    style D fill:#FFE5B4
    style E fill:#ffcccc
    style F fill:#ff9999
Hold "Alt" / "Option" to enable pan & zoom

Promotion Flow Details:

From To Method Trigger Approval Required Automated Testing
Feature Dev Automatic PR merge ❌ No (PR approval only) ✅ PR validation
Dev Test Automated Schedule/Tag ❌ No ✅ Smoke tests
Test Staging Manual On-demand ✅ 2 approvers ✅ Regression tests
Staging Production Manual Change window ✅ CAB (2 approvers) ✅ Full test suite
Hotfix Production Expedited Critical issue ✅ 2 approvers ✅ Hotfix tests

Feature → Dev (Automatic After PR Merge)

Automatic Promotion Process:

# Azure Pipeline: Auto-promote to Dev
trigger:
  branches:
    include:
      - dev

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: PromoteToDev
  displayName: 'Promote to Dev Environment'
  jobs:
  - job: DeployToDev
    steps:
    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Dev-EUS-RG'
        kubernetesCluster: 'atp-dev-eus-aks'
        namespace: 'atp-dev'
        command: 'apply'
        arguments: '-f apps/atp-ingestion/overlays/dev/'
      displayName: 'Apply manifests to Dev cluster'

    - script: |
        # Verify deployment
        kubectl rollout status deployment/atp-ingestion -n atp-dev --timeout=300s
        echo "✅ Deployment to Dev successful"
      displayName: 'Verify deployment'

Dev → Test (Automatic, Triggered by Schedule or Tag)

Automated Promotion to Test:

# Azure Pipeline: Auto-promote to Test
schedules:
- cron: "0 2 * * *"  # Daily at 2 AM UTC
  branches:
    include:
      - dev
  displayName: 'Daily promotion to Test'

trigger:
  branches:
    include:
      - dev
  tags:
    include:
      - promote-to-test/*

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: PromoteToTest
  displayName: 'Promote to Test Environment'
  jobs:
  - job: DeployToTest
    steps:
    - script: |
        # Tag current dev commit
        git tag -a "test-$(date +%Y%m%d-%H%M%S)" -m "Promote to Test: $(Build.SourceVersion)"
        git push origin --tags
      displayName: 'Tag promotion'

    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Test-EUS-RG'
        kubernetesCluster: 'atp-test-eus-aks'
        namespace: 'atp-test'
        command: 'apply'
        arguments: '-f apps/atp-ingestion/overlays/test/'
      displayName: 'Apply manifests to Test cluster'

    - script: |
        # Run smoke tests
        ./scripts/run-smoke-tests.sh --environment test
      displayName: 'Run smoke tests'

Manual Trigger for Test Promotion:

# Tag dev branch to trigger promotion to test
git checkout dev
git pull origin dev
git tag -a "promote-to-test/v1.2.3" -m "Promote version 1.2.3 to Test"
git push origin --tags

Test → Staging (Manual Approval, Regression Tests)

Manual Promotion to Staging:

# Azure Pipeline: Manual promotion to Staging
trigger: none  # Manual trigger only

parameters:
- name: promoteVersion
  displayName: 'Version to Promote'
  type: string
  default: 'latest'

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: PromoteToStaging
  displayName: 'Promote to Staging Environment'
  jobs:
  - job: DeployToStaging
    steps:
    - script: |
        # Checkout test branch at specified version
        git checkout test
        git pull origin test
        git checkout "${{ parameters.promoteVersion }}"
      displayName: 'Checkout version'

    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Staging-EUS-RG'
        kubernetesCluster: 'atp-staging-eus-aks'
        namespace: 'atp-staging'
        command: 'apply'
        arguments: '-f apps/atp-ingestion/overlays/staging/'
      displayName: 'Apply manifests to Staging cluster'

    - script: |
        # Run full regression test suite
        ./scripts/run-regression-tests.sh --environment staging
      displayName: 'Run regression tests'

Pre-Promotion Checklist:

  • ✅ All test environment tests passing
  • ✅ Regression test suite passing
  • ✅ Performance benchmarks met
  • ✅ Security scans passing
  • ✅ Documentation updated
  • ✅ Rollback plan documented
  • ✅ 2 approvers approved

Staging → Production (CAB Approval, Change Window)

Production Promotion Process:

# Azure Pipeline: Production promotion (requires manual approval)
trigger: none  # Manual trigger only

parameters:
- name: promoteVersion
  displayName: 'Version to Promote to Production'
  type: string
  default: 'latest'
- name: changeWindow
  displayName: 'Change Window'
  type: string
  default: '2024-01-15 02:00 UTC'

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: ApprovalGate
  displayName: 'Change Advisory Board Approval'
  jobs:
  - job: WaitForApproval
    displayName: 'Wait for CAB Approval'
    pool: server
    steps:
    - task: ManualValidation@0
      timeoutInMinutes: 1440  # 24 hours
      inputs:
        notifyUsers: 'architect-team@connectsoft.example;sre-lead@connectsoft.example'
        instructions: 'Review and approve production promotion'

- stage: PromoteToProduction
  displayName: 'Promote to Production Environment'
  dependsOn: ApprovalGate
  condition: succeeded()
  jobs:
  - job: DeployToProduction
    steps:
    - script: |
        # Verify change window
        CURRENT_TIME=$(date -u +%s)
        WINDOW_START=$(date -u -d "${{ parameters.changeWindow }}" +%s)
        if [ $CURRENT_TIME -lt $WINDOW_START ]; then
          echo "⏳ Waiting for change window..."
          sleep $((WINDOW_START - CURRENT_TIME))
        fi
      displayName: 'Wait for change window'

    - script: |
        git checkout staging
        git pull origin staging
        git checkout "${{ parameters.promoteVersion }}"
      displayName: 'Checkout version'

    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Production-EUS-RG'
        kubernetesCluster: 'atp-prod-eus-aks'
        namespace: 'atp-production'
        command: 'apply'
        arguments: '-f apps/atp-ingestion/overlays/production/'
      displayName: 'Apply manifests to Production cluster'

    - script: |
        # Verify deployment
        kubectl rollout status deployment/atp-ingestion -n atp-production --timeout=600s
        echo "✅ Production deployment successful"
      displayName: 'Verify deployment'

    - script: |
        # Run production smoke tests
        ./scripts/run-production-smoke-tests.sh
      displayName: 'Run production smoke tests'

Change Window Schedule:

Day Window Rationale
Monday - Thursday 02:00 - 04:00 UTC Low traffic period
Friday No deployments Weekend preparation
Saturday - Sunday Emergency only Minimal staffing

Automated Promotion (Dev and Test)

Trigger Mechanisms (Schedule, Tags, Webhooks)

Schedule-Based Promotion:

# Azure Pipeline: Scheduled promotion
schedules:
- cron: "0 2 * * *"  # Daily at 2 AM UTC
  branches:
    include:
      - dev
  displayName: 'Daily Dev  Test Promotion'
  always: false  # Only if changes detected

Tag-Based Promotion:

# Create promotion tag
git checkout dev
git pull origin dev
git tag -a "promote-to-test/v1.2.3" -m "Promote version 1.2.3 to Test environment"
git push origin --tags

# Pipeline triggered automatically

Webhook Trigger:

# Azure Pipeline: Webhook trigger
resources:
  webhooks:
  - webhook: promotion-webhook
    connection: GitHubWebhook
    filters:
    - path: body.ref
      value: refs/heads/dev

Automated Testing Gates

Testing Gates Configuration:

# .azuredevops/pipelines/promotion-test-gates.yml
stages:
- stage: TestingGates
  displayName: 'Automated Testing Gates'
  jobs:
  - job: SmokeTests
    displayName: 'Smoke Tests'
    steps:
    - script: |
        ./scripts/run-smoke-tests.sh --environment test
      displayName: 'Run smoke tests'

  - job: IntegrationTests
    displayName: 'Integration Tests'
    steps:
    - script: |
        ./scripts/run-integration-tests.sh --environment test
      displayName: 'Run integration tests'

  - job: PerformanceTests
    displayName: 'Performance Benchmarks'
    steps:
    - script: |
        ./scripts/run-performance-tests.sh --environment test
      displayName: 'Run performance tests'

    - script: |
        # Validate performance metrics
        METRICS=$(cat performance-results.json)
        P95_LATENCY=$(echo "$METRICS" | jq '.p95_latency')
        if (( $(echo "$P95_LATENCY > 500" | bc -l) )); then
          echo "❌ Performance regression: P95 latency $P95_LATENCY > 500ms"
          exit 1
        fi
        echo "✅ Performance tests passed"
      displayName: 'Validate performance'

Rollback on Failure

Automatic Rollback Script:

#!/bin/bash
# scripts/auto-rollback.sh

set -euo pipefail

ENVIRONMENT="${1:-test}"
DEPLOYMENT_NAME="${2:-atp-ingestion}"
NAMESPACE="atp-${ENVIRONMENT}"

echo "🔄 Rolling back deployment $DEPLOYMENT_NAME in $NAMESPACE..."

# Get previous revision
PREVIOUS_REVISION=$(kubectl rollout history deployment/$DEPLOYMENT_NAME -n $NAMESPACE | tail -n 2 | head -n 1 | awk '{print $1}')

if [ -z "$PREVIOUS_REVISION" ]; then
  echo "❌ No previous revision found"
  exit 1
fi

# Rollback
kubectl rollout undo deployment/$DEPLOYMENT_NAME -n $NAMESPACE

# Wait for rollback
kubectl rollout status deployment/$DEPLOYMENT_NAME -n $NAMESPACE --timeout=300s

echo "✅ Rollback successful to revision $PREVIOUS_REVISION"

Notification and Alerting

Promotion Notification:

# Azure Pipeline: Notification stage
- stage: Notify
  displayName: 'Send Notifications'
  condition: always()  # Always run, even on failure
  jobs:
  - job: NotifyTeam
    steps:
    - task: Slack@1
      inputs:
        endpoint: 'ATP-Slack-Connection'
        channel: '#atp-deployments'
        message: |
          🚀 Promotion to ${{ parameters.environment }} Environment

          *Version*: ${{ parameters.promoteVersion }}
          *Status*: $(Agent.JobStatus)
          *Pipeline*: $(Build.BuildNumber)
          *Author*: $(Build.RequestedFor)

          *Changes*:
          $(git log --oneline -10)
      displayName: 'Send Slack notification'

    - task: SendEmail@1
      condition: eq(variables['Agent.JobStatus'], 'Failed')
      inputs:
        to: 'sre-oncall@connectsoft.example'
        subject: '❌ Production Promotion Failed'
        body: 'Production promotion failed. Check pipeline: $(Build.BuildUri)'
      displayName: 'Send alert email'

Manual Promotion (Staging and Production)

Approval Gates Configuration

Azure DevOps Approval Gates:

# Azure DevOps environment: Production
environments:
- name: Production
  approvals:
  - approvers:
    - architect-team@connectsoft.example
    - sre-lead@connectsoft.example
    count: 2  # Require 2 approvals
    timeoutInMinutes: 1440  # 24 hours
  checks:
  - type: AzureMonitor
    properties:
      actionGroupName: production-promotion-alerts
  - type: InvokeRESTAPI
    properties:
      url: 'https://api.connectsoft.example/change-window/validate'
      method: 'POST'

Change Advisory Board (CAB) Process

CAB Approval Checklist:

  1. Change Request Review:
  2. ✅ Change description clear
  3. ✅ Risk assessment completed
  4. ✅ Rollback plan documented
  5. ✅ Testing evidence provided
  6. ✅ Impact analysis completed

  7. Technical Review:

  8. ✅ Architecture review approved
  9. ✅ Security review approved
  10. ✅ Performance impact assessed
  11. ✅ Dependency analysis completed

  12. Operational Review:

  13. ✅ Runbook updated
  14. ✅ Monitoring alerts configured
  15. ✅ On-call engineer notified
  16. ✅ Change window scheduled

CAB Meeting Agenda:

  • Review pending change requests
  • Assess risk and impact
  • Approve/reject change requests
  • Schedule change windows
  • Document decisions

Change Window Scheduling

Change Window Policy:

Environment Days Time Window Restrictions
Dev Any day 24/7 None
Test Mon-Fri 24/7 None
Staging Mon-Thu 02:00-04:00 UTC No Friday deployments
Production Mon-Thu 02:00-04:00 UTC No Friday/weekend deployments

Schedule Change Window:

# Azure DevOps CLI: Schedule change
az pipelines variable-group create \
  --name "Change-Window-2024-01-15" \
  --variables \
    changeWindowStart="2024-01-15T02:00:00Z" \
    changeWindowEnd="2024-01-15T04:00:00Z" \
    changeOwner="john.doe@connectsoft.example"

Pre-Deployment Checklists

Pre-Deployment Checklist:

## Pre-Deployment Checklist

### Technical Readiness
- [ ] All tests passing (unit, integration, E2E)
- [ ] Performance benchmarks met
- [ ] Security scans passing (SAST, DAST, dependency scan)
- [ ] Manifest validation passing
- [ ] Helm charts validated
- [ ] Kustomize builds successful

### Documentation
- [ ] Release notes updated
- [ ] API documentation updated
- [ ] Runbook updated
- [ ] Architecture diagrams updated

### Operations
- [ ] Monitoring dashboards configured
- [ ] Alerts configured
- [ ] Logging configured
- [ ] Backup strategy verified
- [ ] Rollback procedure tested

### Communication
- [ ] Stakeholders notified
- [ ] On-call engineer notified
- [ ] Support team briefed
- [ ] Customer communication prepared (if needed)

### Change Management
- [ ] Change request created
- [ ] CAB approval obtained
- [ ] Change window scheduled
- [ ] Risk assessment completed

Post-Deployment Validation

Post-Deployment Validation Script:

#!/bin/bash
# scripts/post-deployment-validation.sh

set -euo pipefail

ENVIRONMENT="${1:-production}"
NAMESPACE="atp-${ENVIRONMENT}"

echo "✅ Post-Deployment Validation for $ENVIRONMENT"

# Health checks
echo "1. Health Checks"
kubectl get pods -n $NAMESPACE -l app=atp-ingestion
kubectl wait --for=condition=ready pod -l app=atp-ingestion -n $NAMESPACE --timeout=300s

# Service endpoints
echo "2. Service Endpoints"
kubectl get endpoints atp-ingestion -n $NAMESPACE

# Metrics
echo "3. Metrics"
kubectl exec -n $NAMESPACE deployment/atp-ingestion -- \
  curl -s http://localhost:9090/metrics | grep -q "http_requests_total" && \
  echo "✅ Metrics endpoint responding"

# Smoke tests
echo "4. Smoke Tests"
./scripts/run-smoke-tests.sh --environment $ENVIRONMENT

echo "✅ Post-deployment validation complete"

Hotfix Workflow

Hotfix Workflow Diagram:

graph TD
    A[Production Issue Detected] --> B[Create Hotfix Branch<br/>from production]
    B --> C[Implement Fix<br/>in Hotfix Branch]
    C --> D[PR to Production<br/>Expedited Review]
    D --> E[2 Approvers<br/>Required]
    E --> F[Merge to Production]
    F --> G[Deploy to Production]
    G --> H[Verify Fix]
    H --> I[Back-merge to<br/>dev/test/staging]

    style A fill:#ffcccc
    style B fill:#ff9999
    style F fill:#ffcccc
    style I fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Creating Hotfix Branch from Production

Hotfix Branch Creation:

# 1. Checkout production branch
git checkout production
git pull origin production

# 2. Create hotfix branch
git checkout -b hotfix/atp-gateway-security-patch-CVE-2024-1234

# 3. Push hotfix branch
git push -u origin hotfix/atp-gateway-security-patch-CVE-2024-1234

# 4. Apply fix
vim apps/atp-gateway/base/deployment.yaml
git add apps/atp-gateway/base/deployment.yaml
git commit -S -m "fix(gateway): patch security vulnerability CVE-2024-1234

URGENT: Security patch for authentication bypass vulnerability.

Related to: ATP-9999 (Critical Security Issue)"

Expedited Approval Process

Hotfix PR Creation:

# Create hotfix PR with expedited flag
az repos pr create \
  --source-branch hotfix/atp-gateway-security-patch-CVE-2024-1234 \
  --target-branch production \
  --title "🚨 HOTFIX: Security patch CVE-2024-1234" \
  --description "Urgent security fix. Requires expedited review." \
  --work-items 9999 \
  --reviewers "architect-team@connectsoft.example;sre-lead@connectsoft.example" \
  --auto-complete false \
  --bypass-policy false  # Still requires 2 approvers

Expedited Review Checklist:

  • ✅ Security issue verified (CVE, vulnerability scan)
  • ✅ Fix validated (local testing, security review)
  • ✅ Impact assessment completed
  • ✅ Rollback plan documented
  • ✅ 2 approvers from architecture/SRE teams

Testing in Hotfix Environment

Hotfix Testing:

# Deploy to hotfix test environment
kubectl apply -f apps/atp-gateway/overlays/hotfix-test/ \
  --namespace atp-hotfix-test

# Run critical path tests
./scripts/run-critical-path-tests.sh --environment hotfix-test

# Security validation
./scripts/run-security-tests.sh --environment hotfix-test \
  --focus CVE-2024-1234

Fast-Track Merge to Production

Hotfix Merge Process:

# After approval, merge hotfix
az repos pr update \
  --id <PR_ID> \
  --status completed \
  --merge-strategy squash

# Verify merge
git checkout production
git pull origin production
git log --oneline -5

# Tag hotfix release
git tag -a "hotfix/v1.2.4-CVE-2024-1234" \
  -m "Hotfix: Security patch CVE-2024-1234"
git push origin --tags

Back-Merge to Dev/Test/Staging

Back-Merge Process:

# Back-merge to staging
git checkout staging
git pull origin staging
git merge production --no-ff -m "Merge hotfix from production: CVE-2024-1234"
git push origin staging

# Back-merge to test
git checkout test
git pull origin test
git merge production --no-ff -m "Merge hotfix from production: CVE-2024-1234"
git push origin test

# Back-merge to dev
git checkout dev
git pull origin dev
git merge production --no-ff -m "Merge hotfix from production: CVE-2024-1234"
git push origin dev

Preview Environments

Ephemeral Namespace per PR

Preview Environment Configuration:

# .azuredevops/pipelines/preview-environment.yml
trigger: none

pr:
  branches:
    include:
      - feature/*
      - bugfix/*

pool:
  vmImage: 'ubuntu-latest'

variables:
  previewNamespace: 'preview-pr-$(System.PullRequest.PullRequestId)'

stages:
- stage: CreatePreview
  displayName: 'Create Preview Environment'
  jobs:
  - job: ProvisionPreview
    steps:
    - script: |
        # Create preview namespace
        kubectl create namespace $(previewNamespace) \
          --dry-run=client -o yaml | kubectl apply -f -

        # Label namespace
        kubectl label namespace $(previewNamespace) \
          environment=preview \
          pr-id=$(System.PullRequest.PullRequestId) \
          managed-by=fluxcd

        echo "✅ Preview namespace created: $(previewNamespace)"
      displayName: 'Create preview namespace'

    - task: Kubernetes@1
      inputs:
        connectionType: 'Azure Resource Manager'
        azureSubscriptionEndpoint: 'ATP-AKS-Connection'
        azureResourceGroup: 'ATP-Dev-EUS-RG'
        kubernetesCluster: 'atp-dev-eus-aks'
        namespace: '$(previewNamespace)'
        command: 'apply'
        arguments: |
          -f apps/atp-ingestion/base/ \
          --namespace $(previewNamespace)
      displayName: 'Deploy to preview namespace'

    - script: |
        # Wait for deployment
        kubectl wait --for=condition=available \
          deployment/atp-ingestion \
          -n $(previewNamespace) \
          --timeout=300s

        # Get preview URL
        PREVIEW_URL=$(kubectl get ingress atp-ingestion \
          -n $(previewNamespace) \
          -o jsonpath='{.spec.rules[0].host}')

        echo "##vso[task.setvariable variable=PreviewUrl]$PREVIEW_URL"
        echo "✅ Preview environment ready: https://$PREVIEW_URL"
      displayName: 'Wait for deployment'

    - script: |
        # Add preview URL to PR comment
        az repos pr set-vote \
          --id $(System.PullRequest.PullRequestId) \
          --vote approved \
          --comment "Preview environment: https://$(PreviewUrl)"
      displayName: 'Add preview URL to PR'

- stage: CleanupPreview
  displayName: 'Cleanup Preview Environment'
  condition: always()
  jobs:
  - job: DeletePreview
    steps:
    - script: |
        # Delete preview namespace
        kubectl delete namespace $(previewNamespace) --ignore-not-found=true
        echo "✅ Preview namespace deleted: $(previewNamespace)"
      displayName: 'Delete preview namespace'

Automatic Provisioning on PR Creation

PR Webhook Trigger:

# Azure Pipeline: Trigger on PR creation
resources:
  webhooks:
  - webhook: pr-webhook
    connection: AzureReposWebhook
    filters:
    - path: eventType
      value: git.pullrequest.created

Testing Isolated Changes

Preview Environment Testing:

# Test preview environment
PREVIEW_URL="https://preview-pr-123.ingestion.atp.connectsoft.example"

# Health check
curl "$PREVIEW_URL/health/ready"

# Smoke tests
curl "$PREVIEW_URL/api/v1/audit/records" \
  -H "Authorization: Bearer $PREVIEW_API_KEY"

# Integration tests
./scripts/run-integration-tests.sh \
  --base-url "$PREVIEW_URL" \
  --environment preview

Auto-Deletion After PR Close

Cleanup on PR Close:

# Azure Pipeline: Cleanup on PR close
resources:
  webhooks:
  - webhook: pr-webhook
    connection: AzureReposWebhook
    filters:
    - path: eventType
      value: git.pullrequest.closed

stages:
- stage: Cleanup
  displayName: 'Cleanup Preview Environment'
  jobs:
  - job: DeletePreview
    steps:
    - script: |
        PR_ID=$(echo "$(Build.SourceBranch)" | sed 's/.*\///')
        PREVIEW_NAMESPACE="preview-pr-${PR_ID}"

        kubectl delete namespace $PREVIEW_NAMESPACE --ignore-not-found=true
        echo "✅ Preview namespace $PREVIEW_NAMESPACE deleted"
      displayName: 'Delete preview namespace'

Git Commit Message Conventions

Conventional Commits Format

Conventional Commits Specification:

<type>(<scope>): <subject>

<body>

<footer>

Commit Types:

Type Description Example
feat New feature feat(ingestion): add gRPC endpoint
fix Bug fix fix(query): resolve memory leak
docs Documentation docs(gitops): update deployment guide
style Code style (formatting) style(*): format YAML files
refactor Code refactoring refactor(gateway): simplify auth logic
test Tests test(integrity): add unit tests
chore Maintenance chore(infra): update Helm charts
perf Performance perf(query): optimize database queries
ci CI/CD ci(pipelines): add validation stage

Linking to Work Items

Work Item References:

# Link to Azure DevOps work item
git commit -m "feat(ingestion): add gRPC endpoint

Related to: ATP-1234"

# Link multiple work items
git commit -m "fix(query): resolve multiple issues

Fixes: ATP-1234, ATP-5678
Closes: ATP-9999"

# Reference work item in footer
git commit -m "feat(gateway): implement OAuth 2.0

Implements ATP-2345
See also: ATP-2346"

Semantic Prefix Examples

Complete Commit Message Examples:

# Feature with scope
git commit -m "feat(ingestion): add Redis cache support

- Add Redis connection configuration
- Implement cache layer for audit records
- Add cache health checks

Related to: ATP-1234"

# Breaking change
git commit -m "feat(gateway)!: remove legacy authentication

BREAKING CHANGE: Legacy API key authentication removed.
Migrate to OAuth 2.0 before deploying this change.

Migration guide: https://docs.connectsoft.example/migration/oauth

Closes: ATP-5678"

# Hotfix
git commit -m "fix(gateway): patch security vulnerability CVE-2024-1234

URGENT: Security patch for authentication bypass vulnerability.

Related to: ATP-9999 (Critical Security Issue)"

# Configuration change
git commit -m "chore(infra): update resource limits for production

- Increase CPU limit to 2000m
- Increase memory limit to 2Gi
- Update HPA min replicas to 5

Related to: ATP-3456"

Commit Message Templates

Git Commit Template (.gitmessage):

# <type>(<scope>): <subject>
# 
# <body>
# 
# <footer>
#
# Type: feat, fix, docs, style, refactor, test, chore, perf, ci
# Scope: ingestion, query, gateway, integrity, export, policy, search, infra
# 
# Examples:
# feat(ingestion): add gRPC endpoint
# fix(query): resolve memory leak
# docs(gitops): update deployment guide
#
# Related work items: ATP-1234

Configure Git to Use Template:

# Set commit template
git config --global commit.template .gitmessage

# Or per repository
git config commit.template .gitmessage

Release Tagging Strategy

Tagging Releases for Production

Production Release Tagging:

# Tag production release
git checkout production
git pull origin production

# Create annotated tag
git tag -a "v1.2.3" \
  -m "Release v1.2.3

Features:
- Add gRPC endpoint to ingestion service
- Implement Redis cache for query service
- Security enhancements

Breaking Changes:
- None

Related work items: ATP-1234, ATP-5678"

# Push tags
git push origin --tags

Tag Verification:

# List tags
git tag -l "v*"

# Show tag details
git show v1.2.3

# Verify tag signature (if GPG signed)
git tag -v v1.2.3

Service-Specific vs Environment-Wide Tags

Service-Specific Tags:

# Tag specific service version
git tag -a "atp-ingestion/v1.2.3" \
  -m "ATP Ingestion Service v1.2.3"

git tag -a "atp-query/v1.3.0" \
  -m "ATP Query Service v1.3.0"

Environment-Wide Tags:

# Tag environment release
git tag -a "production/2024-01-15" \
  -m "Production Release 2024-01-15

Services:
- atp-ingestion: v1.2.3
- atp-query: v1.3.0
- atp-gateway: v1.1.5"

git tag -a "staging/2024-01-10" \
  -m "Staging Release 2024-01-10"

Tag Naming Conventions

Tag Naming Standards:

Tag Type Format Example Use Case
Release v{MAJOR}.{MINOR}.{PATCH} v1.2.3 Production releases
Pre-release v{MAJOR}.{MINOR}.{PATCH}-{PRERELEASE} v1.2.3-rc1 Release candidates
Service {service}/v{VERSION} atp-ingestion/v1.2.3 Service-specific
Environment {env}/{DATE} production/2024-01-15 Environment snapshots
Hotfix hotfix/v{VERSION}-{ISSUE} hotfix/v1.2.4-CVE-2024-1234 Hotfixes
Promotion promote-to-{env}/{VERSION} promote-to-test/v1.2.3 Promotion triggers

Automated Tag Creation

Automated Tagging in CI/CD:

# Azure Pipeline: Auto-tag on production merge
trigger:
  branches:
    include:
      - production

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: TagRelease
  displayName: 'Tag Production Release'
  jobs:
  - job: CreateTag
    steps:
    - script: |
        # Extract version from commit message or manifest
        VERSION=$(grep -E '^version:' apps/atp-ingestion/base/deployment.yaml | awk '{print $2}' | tr -d '"')

        # Create tag
        git config user.name "Azure DevOps"
        git config user.email "devops@connectsoft.example"

        git tag -a "v$VERSION" \
          -m "Release v$VERSION

        Automated release from commit $(Build.SourceVersion)
        Pipeline: $(Build.BuildNumber)"

        git push origin "v$VERSION"

        echo "✅ Tagged release: v$VERSION"
      displayName: 'Create and push tag'

Rollback via Git Revert

Simple Rollback (Single Service)

Single Service Rollback:

# 1. Identify commit to revert
git log --oneline production | grep "atp-ingestion"
# abc123 feat(ingestion): add gRPC endpoint

# 2. Revert commit
git checkout production
git pull origin production
git revert abc123 --no-edit

# 3. Push revert commit
git push origin production

# 4. Verify rollback
git log --oneline -5 production
# def456 Revert "feat(ingestion): add gRPC endpoint"
# abc123 feat(ingestion): add gRPC endpoint

Complex Rollback (Multiple Services)

Multiple Service Rollback:

# 1. Identify commits to revert
git log --oneline production | grep -E "(ingestion|query|gateway)" | head -5
# abc123 feat(ingestion): add gRPC endpoint
# def456 feat(query): add Redis cache
# ghi789 feat(gateway): update authentication

# 2. Create rollback branch
git checkout production
git pull origin production
git checkout -b rollback/2024-01-15-multiple-services

# 3. Revert commits (newest first)
git revert ghi789 --no-edit  # Gateway
git revert def456 --no-edit  # Query
git revert abc123 --no-edit  # Ingestion

# 4. Test rollback
kubectl apply --dry-run=client -f apps/ --recursive

# 5. Create PR for rollback
az repos pr create \
  --source-branch rollback/2024-01-15-multiple-services \
  --target-branch production \
  --title "ROLLBACK: Multiple services 2024-01-15" \
  --description "Revert changes to ingestion, query, and gateway services"

# 6. After approval, merge
az repos pr update --id <PR_ID> --status completed

Git Revert vs Reset

Git Revert vs Reset Comparison:

Method Command History Safety Use Case
Revert git revert Preserves (creates new commit) ✅ Safe (non-destructive) Production rollback
Reset git reset --hard Rewrites (destroys commits) ❌ Dangerous (destructive) Development only

Git Reset (Development Only):

# ⚠️ WARNING: Only use in development branches!
git checkout dev
git reset --hard HEAD~3  # Remove last 3 commits
git push --force origin dev  # Force push (destructive!)

Git Revert (Production):

# ✅ Safe for production
git checkout production
git revert abc123  # Creates new commit undoing abc123
git push origin production  # Safe push

Rollback Validation

Rollback Validation Script:

#!/bin/bash
# scripts/validate-rollback.sh

set -euo pipefail

COMMIT_TO_REVERT="${1:-HEAD}"
NAMESPACE="${2:-atp-production}"

echo "🔄 Validating rollback for commit $COMMIT_TO_REVERT..."

# 1. Preview rollback changes
git revert --no-commit $COMMIT_TO_REVERT
git diff --stat

# 2. Validate manifests
find apps/ -name "*.yaml" -path "*/base/*" | while read manifest; do
  kubeval "$manifest" || exit 1
done

# 3. Dry-run apply
kubectl apply --dry-run=client -f apps/ --recursive

# 4. Check for breaking changes
git log $COMMIT_TO_REVERT -1 --pretty=format:"%B" | grep -i "BREAKING" && \
  echo "⚠️  WARNING: Reverting a breaking change!" || \
  echo "✅ No breaking changes detected"

# 5. Restore state
git reset --hard HEAD

echo "✅ Rollback validation complete"

Execute Rollback:

# Validate rollback
./scripts/validate-rollback.sh abc123 atp-production

# Execute rollback
git checkout production
git pull origin production
git revert abc123 --no-edit

# Apply to cluster
kubectl apply -f apps/atp-ingestion/overlays/production/

# Verify rollback
kubectl rollout status deployment/atp-ingestion -n atp-production
kubectl get pods -n atp-production -l app=atp-ingestion

# Run smoke tests
./scripts/run-smoke-tests.sh --environment production

echo "✅ Rollback completed and verified"

Summary: Git Workflow & Environment Promotion

  • Feature Branch Workflow: Git-centric development with conventional commits and GPG signing
  • Pull Request Process: Comprehensive PR templates, automated validation, and approval workflows
  • Automated PR Validation: YAML linting, security scanning, dry-run validation, breaking change detection
  • Merge Strategies: Squash merge (production), merge commit (test), rebase (dev)
  • Environment Promotion: Automated (dev→test), manual (test→staging→production) with CAB approval
  • Hotfix Workflow: Expedited process with back-merge to all environments
  • Preview Environments: Ephemeral namespaces per PR for isolated testing
  • Commit Conventions: Conventional commits with work item linking
  • Release Tagging: Semantic versioning with service-specific and environment-wide tags
  • Rollback Procedures: Git revert for safe production rollbacks with validation

Azure Pipelines to GitOps Handoff

Purpose: Define how Azure Pipelines (CI) hand off to the GitOps workflow by automating artifact publishing, manifest updates, and Git commits, ensuring a clear separation of concerns between build/test (CI) and deployment/reconciliation (GitOps).


Separation of Concerns

CI Pipeline Responsibilities (Build, Test, Publish)

CI Pipeline Scope (Azure Pipelines):

Responsibility Description Examples
Source Code Build Compile, package applications dotnet build, npm build
Unit Testing Run unit tests, code coverage dotnet test, jest
Integration Testing Test service interactions Test containers, API tests
Security Scanning SAST, DAST, dependency scanning Snyk, Trivy, OWASP ZAP
Artifact Creation Build Docker images, NuGet packages docker build, dotnet pack
Artifact Publishing Push to registry (ACR, NuGet feed) docker push, helm push
SBOM Generation Software Bill of Materials Syft, SPDX format
Metadata Recording Build provenance, vulnerability reports In-toto attestations
GitOps Manifest Update Commit image tag updates to GitOps repo Automated Git commits

CI Pipeline Boundaries:

# CI Pipeline Responsibilities
✅ Build application code
✅ Run tests (unit, integration, security)
✅ Build and push container images to ACR
✅ Generate and publish SBOM
✅ Update GitOps repository with new image tags
✅ Trigger FluxCD sync (via webhook or polling)

❌ NOT: Deploy directly to Kubernetes
❌ NOT: Manage Kubernetes cluster state
❌ NOT: Handle reconciliation loops
❌ NOT: Monitor cluster health

GitOps Responsibilities (Deploy, Reconcile, Monitor)

GitOps Scope (FluxCD on AKS):

Responsibility Description Examples
Git Repository Watch Poll Git repository for changes FluxCD Source Controller
Manifest Rendering Render Kustomize/Helm manifests FluxCD Kustomize/Helm Controller
Cluster Deployment Apply manifests to Kubernetes kubectl apply (via FluxCD)
State Reconciliation Detect and correct drift Continuous reconciliation loop
Health Monitoring Monitor deployment health FluxCD health checks
Rollback Management Revert to previous Git commits Git revert operations

GitOps Boundaries:

# GitOps Responsibilities
✅ Watch Git repository for manifest changes
✅ Render and apply Kubernetes manifests
✅ Reconcile cluster state with Git
✅ Detect and correct configuration drift
✅ Monitor deployment health
✅ Rollback via Git operations

❌ NOT: Build application code
❌ NOT: Run unit/integration tests
❌ NOT: Build container images
❌ NOT: Publish artifacts to registries

Clear Handoff Point: Artifact Publishing

Handoff Architecture:

graph LR
    A[Source Code<br/>Repository] -->|trigger| B[Azure Pipelines<br/>CI Pipeline]
    B -->|build + test| C[Container Image<br/>+ SBOM]
    C -->|push| D[Azure Container<br/>Registry ACR]
    B -->|update manifests| E[GitOps Repository<br/>Azure Repos]
    E -->|watch| F[FluxCD<br/>Source Controller]
    F -->|fetch artifacts| G[FluxCD<br/>Kustomize Controller]
    G -->|deploy| H[AKS Cluster<br/>Production]

    style B fill:#90EE90
    style D fill:#FFE5B4
    style E fill:#90EE90
    style F fill:#FFE5B4
    style G fill:#FFE5B4
    style H fill:#ffcccc
Hold "Alt" / "Option" to enable pan & zoom

Handoff Criteria:

  1. Artifact Published: Image pushed to ACR with immutable tag
  2. SBOM Generated: Software Bill of Materials published
  3. Vulnerability Scan: Security scan results available
  4. Manifest Updated: GitOps repository contains new image tag
  5. Commit Signed: Git commit signed (for production)
  6. CI Tests Passing: All CI validation gates passed

Handoff Checklist:

## CI → GitOps Handoff Checklist

### Artifacts
- [ ] Container image built and pushed to ACR
- [ ] Image tagged with version + commit SHA (immutable)
- [ ] SBOM generated and published
- [ ] Vulnerability scan completed and results recorded

### Manifest Updates
- [ ] Image tag updated in GitOps repository
- [ ] Kustomize/Helm manifest files updated
- [ ] Changes committed to Git
- [ ] Commit signed (production only)

### Validation
- [ ] All CI tests passing
- [ ] Security scans passing
- [ ] Manifest validation passing
- [ ] Build provenance recorded

Azure Pipeline Stages

Build: Compile, Unit Test

Build Stage:

# azure-pipelines.yml
stages:
- stage: Build
  displayName: 'Build and Test'
  jobs:
  - job: BuildApplication
    displayName: 'Build ATP Ingestion Service'
    pool:
      vmImage: 'ubuntu-latest'

    variables:
      - group: ATP-Common-Variables
      - name: BuildConfiguration
        value: 'Release'
      - name: ServiceName
        value: 'atp-ingestion'

    steps:
    # Checkout source code
    - checkout: self
      fetchDepth: 0  # Full history for version calculation

    # Setup .NET SDK
    - task: UseDotNet@2
      inputs:
        packageType: 'sdk'
        version: '8.0.x'

    # Restore dependencies
    - script: |
        dotnet restore src/ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj
      displayName: 'Restore NuGet packages'

    # Build application
    - script: |
        dotnet build src/ConnectSoft.ATP.Ingestion/ConnectSoft.ATP.Ingestion.csproj \
          --configuration $(BuildConfiguration) \
          --no-restore \
          -p:Version=$(Build.BuildNumber)
      displayName: 'Build application'

    # Run unit tests
    - script: |
        dotnet test src/ConnectSoft.ATP.Ingestion.Tests/ConnectSoft.ATP.Ingestion.Tests.csproj \
          --configuration $(BuildConfiguration) \
          --no-build \
          --collect:"XPlat Code Coverage" \
          --results-directory $(Agent.TempDirectory)/test-results \
          --logger "trx;LogFileName=test-results.trx"
      displayName: 'Run unit tests'
      continueOnError: false

    # Publish test results
    - task: PublishTestResults@2
      condition: always()
      inputs:
        testResultsFormat: 'VSTest'
        testResultsFiles: '$(Agent.TempDirectory)/test-results/**/*.trx'
        testRunTitle: 'Unit Tests - $(ServiceName)'

    # Publish code coverage
    - task: PublishCodeCoverageResults@1
      condition: always()
      inputs:
        codeCoverageTool: 'Cobertura'
        summaryFileLocation: '$(Agent.TempDirectory)/test-results/**/coverage.cobertura.xml'

Test: Integration Test, Security Scan

Test Stage:

- stage: Test
  displayName: 'Integration Tests and Security Scans'
  dependsOn: Build
  jobs:
  - job: IntegrationTests
    displayName: 'Integration Tests'
    pool:
      vmImage: 'ubuntu-latest'

    services:
      postgres: postgres
      redis: redis

    steps:
    - checkout: self

    - task: UseDotNet@2
      inputs:
        packageType: 'sdk'
        version: '8.0.x'

    - script: |
        dotnet test src/ConnectSoft.ATP.Ingestion.IntegrationTests/ \
          --configuration Release \
          --logger "trx;LogFileName=integration-test-results.trx"
      displayName: 'Run integration tests'
      env:
        ConnectionStrings__Database: $(PostgresConnectionString)
        ConnectionStrings__Redis: $(RedisConnectionString)

    - task: PublishTestResults@2
      condition: always()
      inputs:
        testResultsFormat: 'VSTest'
        testResultsFiles: '**/integration-test-results.trx'
        testRunTitle: 'Integration Tests - $(ServiceName)'

  - job: SecurityScan
    displayName: 'Security Scanning'
    pool:
      vmImage: 'ubuntu-latest'

    steps:
    - checkout: self

    # SAST (Static Application Security Testing)
    - task: SnykSecurityScan@1
      inputs:
        serviceConnectionEndpoint: 'Snyk-Service-Connection'
        testType: 'app'
        severityThreshold: 'high'

    # Dependency Vulnerability Scan
    - script: |
        dotnet list package --vulnerable --include-transitive
      displayName: 'Check for vulnerable NuGet packages'

    # Container Image Scan (after build)
    - script: |
        trivy image --severity HIGH,CRITICAL \
          connectsoft.azurecr.io/atp/ingestion:$(Build.BuildNumber) \
          --format json \
          --output trivy-scan-results.json
      displayName: 'Scan container image with Trivy'
      condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))

Publish: Push to ACR, Generate SBOM

Publish Stage:

- stage: Publish
  displayName: 'Publish Artifacts'
  dependsOn: Test
  condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))
  jobs:
  - job: BuildAndPushImage
    displayName: 'Build and Push Container Image'
    pool:
      vmImage: 'ubuntu-latest'

    variables:
      - name: ImageRepository
        value: 'connectsoft.azurecr.io/atp/ingestion'
      - name: ImageTag
        value: '$(Build.BuildNumber)-$(Build.SourceVersion)'  # v1.2.3-abc123d

    steps:
    - checkout: self

    # Login to ACR
    - task: Docker@2
      displayName: 'Login to Azure Container Registry'
      inputs:
        command: 'login'
        containerRegistry: 'ConnectSoft-ACR'

    # Build Docker image
    - task: Docker@2
      displayName: 'Build Docker image'
      inputs:
        command: 'build'
        containerRegistry: 'ConnectSoft-ACR'
        repository: 'atp/ingestion'
        dockerfile: 'src/ConnectSoft.ATP.Ingestion/Dockerfile'
        tags: |
          $(ImageTag)
          latest  # Only for dev branch
        buildContext: '$(Build.SourcesDirectory)'
        arguments: |
          --build-arg BUILD_VERSION=$(Build.BuildNumber)
          --build-arg BUILD_COMMIT=$(Build.SourceVersion)
          --build-arg BUILD_DATE=$(Build.BuildId)

    # Generate SBOM
    - script: |
        # Install Syft
        curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin

        # Generate SBOM
        syft packages docker:$(ImageRepository):$(ImageTag) \
          --output spdx-json \
          --file sbom-$(ImageTag).spdx.json

        echo "✅ SBOM generated: sbom-$(ImageTag).spdx.json"
      displayName: 'Generate SBOM (Software Bill of Materials)'

    # Scan image for vulnerabilities
    - script: |
        trivy image \
          --format json \
          --output trivy-$(ImageTag).json \
          --severity HIGH,CRITICAL \
          $(ImageRepository):$(ImageTag)
      displayName: 'Scan image for vulnerabilities'

    # Push image to ACR
    - task: Docker@2
      displayName: 'Push Docker image to ACR'
      inputs:
        command: 'push'
        containerRegistry: 'ConnectSoft-ACR'
        repository: 'atp/ingestion'
        tags: |
          $(ImageTag)

    # Attach SBOM and scan results as pipeline artifacts
    - task: PublishPipelineArtifact@1
      displayName: 'Publish SBOM'
      inputs:
        targetPath: 'sbom-$(ImageTag).spdx.json'
        artifactName: 'sbom-$(ImageTag)'
        publishLocation: 'pipeline'

    - task: PublishPipelineArtifact@1
      displayName: 'Publish vulnerability scan results'
      inputs:
        targetPath: 'trivy-$(ImageTag).json'
        artifactName: 'vulnerability-scan-$(ImageTag)'
        publishLocation: 'pipeline'

    # Attach metadata to ACR image (annotations)
    - script: |
        az acr repository update \
          --name connectsoft \
          --image atp/ingestion:$(ImageTag) \
          --metadata \
            build.version=$(Build.BuildNumber) \
            build.commit=$(Build.SourceVersion) \
            build.date=$(Build.BuildId) \
            build.pipeline=$(Build.BuildUri) \
            sbom.url=$(Pipeline.Workspace)/sbom-$(ImageTag).spdx.json
      displayName: 'Attach metadata to image'

Update GitOps: Commit Manifest Changes

Update GitOps Stage:

- stage: UpdateGitOps
  displayName: 'Update GitOps Repository'
  dependsOn: Publish
  condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))
  jobs:
  - job: UpdateManifests
    displayName: 'Update GitOps Manifests'
    pool:
      vmImage: 'ubuntu-latest'

    variables:
      - name: GitOpsRepoUrl
        value: 'https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops'
      - name: TargetBranch
        value: ${{ if eq(variables['Build.SourceBranch'], 'refs/heads/main') }}'production'${{ else }}'dev'${{ endif }}

    steps:
    # Checkout GitOps repository
    - checkout: git://ATP/atp-gitops@$(TargetBranch)
      displayName: 'Checkout GitOps repository'
      path: gitops-repo

    # Install required tools
    - script: |
        # Install kustomize
        curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
        sudo mv kustomize /usr/local/bin/

        # Install yq (YAML processor)
        wget -q https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O yq
        chmod +x yq
        sudo mv yq /usr/local/bin/
      displayName: 'Install tools (kustomize, yq)'

    # Update Kustomize image tags
    - script: |
        cd gitops-repo

        # Update image tag in Kustomize base
        kustomize edit set image \
          connectsoft.azurecr.io/atp/ingestion=$(ImageRepository):$(ImageTag)

        # Update image tag in all overlays (dev, test, staging, production)
        for overlay in dev test staging production; do
          if [ -d "apps/atp-ingestion/overlays/$overlay" ]; then
            cd "apps/atp-ingestion/overlays/$overlay"
            kustomize edit set image \
              connectsoft.azurecr.io/atp/ingestion=$(ImageRepository):$(ImageTag)
            cd ../../../../../
          fi
        done

        echo "✅ Updated Kustomize image tags"
      displayName: 'Update Kustomize image tags'

    # Update Helm values files
    - script: |
        cd gitops-repo

        # Update image tag in Helm values for target branch
        yq eval -i '.image.tag = "$(ImageTag)"' \
          apps/atp-ingestion/helm/values-$(TargetBranch).yaml

        # Also update default values.yaml if exists
        if [ -f "apps/atp-ingestion/helm/values.yaml" ]; then
          yq eval -i '.image.tag = "$(ImageTag)"' \
            apps/atp-ingestion/helm/values.yaml
        fi

        echo "✅ Updated Helm values files"
      displayName: 'Update Helm values files'

    # Validate updated manifests
    - script: |
        cd gitops-repo

        # Validate Kustomize builds
        kustomize build apps/atp-ingestion/overlays/$(TargetBranch)/ > /dev/null
        echo "✅ Kustomize build validation passed"

        # Validate Helm templates
        helm template atp-ingestion apps/atp-ingestion/helm/ \
          --values apps/atp-ingestion/helm/values-$(TargetBranch).yaml > /dev/null
        echo "✅ Helm template validation passed"
      displayName: 'Validate updated manifests'

    # Commit and push changes
    - script: |
        cd gitops-repo

        # Configure Git
        git config user.name "Azure DevOps Pipeline"
        git config user.email "azure-devops@connectsoft.example"

        # Check for changes
        if [ -z "$(git status --porcelain)" ]; then
          echo "ℹ️  No changes to commit"
          exit 0
        fi

        # Stage changes
        git add apps/atp-ingestion/

        # Commit with conventional commit format
        git commit -m "chore(ingestion): update image tag to $(ImageTag)

        Automated update from CI pipeline:
        - Image: $(ImageRepository):$(ImageTag)
        - Build: $(Build.BuildNumber)
        - Commit: $(Build.SourceVersion)
        - Pipeline: $(Build.BuildUri)

        Related to: $(System.PullRequest.PullRequestId)"

        # Push to GitOps repository
        git push origin $(TargetBranch)

        echo "✅ Pushed manifest updates to GitOps repository"
      displayName: 'Commit and push to GitOps repository'
      env:
        SYSTEM_ACCESSTOKEN: $(System.AccessToken)

Image Tag Generation

Semantic Version from Git Tag

Version Extraction Script:

#!/bin/bash
# scripts/extract-version.sh

set -euo pipefail

# Extract version from Git tag
VERSION_TAG=$(git describe --tags --exact-match 2>/dev/null || echo "")

if [ -n "$VERSION_TAG" ]; then
  # Use version from Git tag (e.g., v1.2.3)
  VERSION=$(echo "$VERSION_TAG" | sed 's/^v//')
  echo "✅ Version from Git tag: $VERSION"
else
  # Fallback: Use version from project file or build number
  VERSION=$(grep -E '<Version>(.*)</Version>' src/**/*.csproj | head -1 | sed -E 's/.*<Version>(.*)<\/Version>.*/\1/')

  if [ -z "$VERSION" ]; then
    # Last resort: Use build number format
    VERSION="${BUILD_BUILDNUMBER:-1.0.0}"
  fi

  echo "⚠️  Version from project file/build number: $VERSION"
fi

echo "##vso[task.setvariable variable=Version]$VERSION"

Short Commit SHA for Traceability

Commit SHA Extraction:

# Extract short commit SHA (7 characters)
SHORT_SHA=$(git rev-parse --short=7 HEAD)
echo "Commit SHA: $SHORT_SHA"

# Example output: abc123d

Build Number for Uniqueness

Build Number Format:

# Build number format: v{MAJOR}.{MINOR}.{PATCH}.{BUILD_ID}
BUILD_NUMBER="${BUILD_BUILDNUMBER}"  # e.g., 20240115.1

# Or use Build.BuildId (unique incrementing number)
BUILD_ID="${BUILD_BUILDID}"  # e.g., 12345

Tag Format: v1.2.3-abc123d

Complete Tag Generation:

# Azure Pipeline: Generate image tag
- script: |
    # Extract version from Git tag or project file
    VERSION_TAG=$(git describe --tags --exact-match 2>/dev/null || echo "")

    if [ -n "$VERSION_TAG" ]; then
      VERSION=$(echo "$VERSION_TAG" | sed 's/^v//')
    else
      # Extract from .csproj or use build number
      VERSION=$(grep -oP '<Version>\K[^<]+' src/**/*.csproj | head -1 || echo "1.0.0")
    fi

    # Extract short commit SHA
    SHORT_SHA=$(git rev-parse --short=7 HEAD)

    # Generate image tag: v1.2.3-abc123d
    IMAGE_TAG="v${VERSION}-${SHORT_SHA}"

    echo "Image tag: $IMAGE_TAG"
    echo "##vso[task.setvariable variable=ImageTag]$IMAGE_TAG"
    echo "##vso[task.setvariable variable=Version]$VERSION"
    echo "##vso[task.setvariable variable=ShortSha]$SHORT_SHA"
  displayName: 'Generate image tag'

Tag Format Examples:

Source Version Commit SHA Image Tag Example
Git Tag v1.2.3 abc123d v1.2.3-abc123d Semantic version + SHA
Project File 1.2.3 abc123d v1.2.3-abc123d Version from .csproj + SHA
Build Number 20240115.1 abc123d v20240115.1-abc123d Build number + SHA

Automated Manifest Update

Pipeline Script to Update Image Tag in GitOps Repo

Manifest Update Script (scripts/update-gitops-manifests.sh):

#!/bin/bash
# scripts/update-gitops-manifests.sh

set -euo pipefail

SERVICE_NAME="${1:-atp-ingestion}"
IMAGE_REPOSITORY="${2:-connectsoft.azurecr.io/atp/ingestion}"
IMAGE_TAG="${3:-latest}"
TARGET_BRANCH="${4:-dev}"
GITOPS_REPO_PATH="${5:-gitops-repo}"

echo "🔄 Updating GitOps manifests for $SERVICE_NAME..."
echo "  Image: $IMAGE_REPOSITORY:$IMAGE_TAG"
echo "  Branch: $TARGET_BRANCH"

cd "$GITOPS_REPO_PATH"

# Update Kustomize image tags
if [ -d "apps/$SERVICE_NAME/base" ]; then
  echo "📝 Updating Kustomize base..."
  cd "apps/$SERVICE_NAME/base"

  # Update image tag using kustomize edit
  kustomize edit set image \
    "$IMAGE_REPOSITORY:$IMAGE_TAG"

  cd ../../../
fi

# Update Kustomize overlays
for overlay in dev test staging production; do
  if [ -d "apps/$SERVICE_NAME/overlays/$overlay" ]; then
    echo "📝 Updating Kustomize overlay: $overlay..."
    cd "apps/$SERVICE_NAME/overlays/$overlay"

    kustomize edit set image \
      "$IMAGE_REPOSITORY:$IMAGE_TAG"

    cd ../../../../../
  fi
done

# Update Helm values files
if [ -d "apps/$SERVICE_NAME/helm" ]; then
  echo "📝 Updating Helm values..."
  cd "apps/$SERVICE_NAME/helm"

  # Update target branch values file
  if [ -f "values-${TARGET_BRANCH}.yaml" ]; then
    yq eval -i ".image.tag = \"$IMAGE_TAG\"" \
      "values-${TARGET_BRANCH}.yaml"
    echo "  ✅ Updated values-${TARGET_BRANCH}.yaml"
  fi

  # Update default values.yaml
  if [ -f "values.yaml" ]; then
    yq eval -i ".image.tag = \"$IMAGE_TAG\"" \
      "values.yaml"
    echo "  ✅ Updated values.yaml"
  fi

  cd ../../
fi

echo "✅ Manifest update complete"

Kustomize Image Tag Replacement

Kustomize Update Example:

# apps/atp-ingestion/base/kustomization.yaml (before)
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.2-def456e  # Old tag
# Update image tag using kustomize edit
kustomize edit set image \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d

# Result: apps/atp-ingestion/base/kustomization.yaml (after)
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3-abc123d  # New tag

Helm Values File Update

Helm Values Update Example:

# apps/atp-ingestion/helm/values-production.yaml (before)
image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: v1.2.2-def456e  # Old tag
  pullPolicy: IfNotPresent
# Update Helm values using yq
yq eval -i '.image.tag = "v1.2.3-abc123d"' \
  apps/atp-ingestion/helm/values-production.yaml

# Result: apps/atp-ingestion/helm/values-production.yaml (after)
image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: v1.2.3-abc123d  # New tag
  pullPolicy: IfNotPresent

Git Commit and Push Automation

Automated Git Commit Script:

#!/bin/bash
# scripts/commit-gitops-changes.sh

set -euo pipefail

SERVICE_NAME="${1:-atp-ingestion}"
IMAGE_TAG="${2:-latest}"
TARGET_BRANCH="${3:-dev}"
BUILD_NUMBER="${4:-unknown}"
COMMIT_SHA="${5:-unknown}"
GITOPS_REPO_PATH="${6:-gitops-repo}"

cd "$GITOPS_REPO_PATH"

# Check for changes
if [ -z "$(git status --porcelain)" ]; then
  echo "ℹ️  No changes to commit"
  exit 0
fi

# Configure Git
git config user.name "Azure DevOps Pipeline"
git config user.email "azure-devops@connectsoft.example"

# Stage all changes
git add apps/$SERVICE_NAME/

# Create commit message
COMMIT_MESSAGE="chore($SERVICE_NAME): update image tag to $IMAGE_TAG

Automated update from CI pipeline:
- Service: $SERVICE_NAME
- Image Tag: $IMAGE_TAG
- Build Number: $BUILD_NUMBER
- Source Commit: $COMMIT_SHA
- Pipeline: $BUILD_BUILDURI

Related to: $SYSTEM_PULLREQUEST_PULLREQUESTID"

# Commit changes
if [ -n "${GPG_KEY_ID:-}" ]; then
  # Sign commit with GPG (production)
  git commit -S -m "$COMMIT_MESSAGE"
else
  # Unsigned commit (dev/test)
  git commit -m "$COMMIT_MESSAGE"
fi

# Push to target branch
git push origin "$TARGET_BRANCH"

echo "✅ Changes committed and pushed to $TARGET_BRANCH"
echo "   Commit: $(git rev-parse HEAD)"

Azure Pipeline Integration:

- script: |
    chmod +x scripts/update-gitops-manifests.sh
    chmod +x scripts/commit-gitops-changes.sh

    # Update manifests
    ./scripts/update-gitops-manifests.sh \
      atp-ingestion \
      $(ImageRepository):$(ImageTag) \
      $(TargetBranch) \
      gitops-repo

    # Commit and push
    ./scripts/commit-gitops-changes.sh \
      atp-ingestion \
      $(ImageTag) \
      $(TargetBranch) \
      $(Build.BuildNumber) \
      $(Build.SourceVersion) \
      gitops-repo
  displayName: 'Update GitOps repository'
  env:
    SYSTEM_ACCESSTOKEN: $(System.AccessToken)
    GPG_KEY_ID: ${{ if eq(variables['TargetBranch'], 'production') }}$(GPG_KEY_ID)${{ else }}''${{ endif }}

Commit Back to GitOps Repository

Service Account Credentials (PAT or SSH)

Personal Access Token (PAT) Setup:

# Create PAT in Azure DevOps:
# 1. User Settings > Personal Access Tokens > New Token
# 2. Name: "GitOps Pipeline Service Account"
# 3. Organization: All accessible organizations
# 4. Scopes: Code (Read & Write)
# 5. Copy token

# Store PAT as Azure DevOps variable group
az pipelines variable-group create \
  --name "GitOps-Credentials" \
  --variables \
    gitopsPat="<PAT_TOKEN>" \
  --authorize true

SSH Key Setup:

# Generate SSH key for pipeline
ssh-keygen -t rsa -b 4096 -f ~/.ssh/gitops-pipeline -N ""

# Add public key to Azure DevOps
# Azure DevOps > User Settings > SSH Public Keys > New Key
cat ~/.ssh/gitops-pipeline.pub

# Store private key as Azure DevOps secret variable
az pipelines variable-group variable create \
  --group-id <GROUP_ID> \
  --name gitopsSshPrivateKey \
  --value "$(cat ~/.ssh/gitops-pipeline | base64)" \
  --secret true

Using Credentials in Pipeline:

# Option 1: Use System.AccessToken (recommended for same organization)
- checkout: git://ATP/atp-gitops@$(TargetBranch)
  displayName: 'Checkout GitOps repository'
  path: gitops-repo
  persistCredentials: true  # Use System.AccessToken

# Option 2: Use PAT from variable group
- script: |
    git config --global credential.helper store
    echo "https://PAT:$(gitopsPat)@dev.azure.com" > ~/.git-credentials
  displayName: 'Configure Git credentials'
  env:
    gitopsPat: $(gitopsPat)

# Option 3: Use SSH key
- script: |
    mkdir -p ~/.ssh
    echo "$(gitopsSshPrivateKey)" | base64 -d > ~/.ssh/id_rsa
    chmod 600 ~/.ssh/id_rsa
    ssh-keyscan ssh.dev.azure.com >> ~/.ssh/known_hosts
  displayName: 'Configure SSH key'
  env:
    gitopsSshPrivateKey: $(gitopsSshPrivateKey)

Commit Message Format

Standardized Commit Message:

<type>(<scope>): <subject>

<body>

<footer>

Example Commit Messages:

# Single service update
chore(ingestion): update image tag to v1.2.3-abc123d

Automated update from CI pipeline:
- Service: atp-ingestion
- Image Tag: v1.2.3-abc123d
- Build Number: 20240115.1
- Source Commit: abc123def456
- Pipeline: https://dev.azure.com/ConnectSoft/ATP/_build/results?buildId=12345

Related to: PR-123

# Multiple services update
chore(*): update image tags for release v1.2.3

Automated update from CI pipeline:
- Services: atp-ingestion, atp-query, atp-gateway
- Version: v1.2.3
- Build Number: 20240115.1
- Pipeline: https://dev.azure.com/ConnectSoft/ATP/_build/results?buildId=12345

Services updated:
- atp-ingestion: v1.2.3-abc123d
- atp-query: v1.2.3-def456e
- atp-gateway: v1.2.3-ghi789f

Signed Commits for Audit Trail

GPG Commit Signing Setup:

# Generate GPG key for pipeline service account
gpg --batch --gen-key <<EOF
%no-protection
Key-Type: RSA
Key-Length: 4096
Name-Real: Azure DevOps Pipeline
Name-Email: azure-devops@connectsoft.example
Expire-Date: 0
EOF

# Export public key
gpg --armor --export azure-devops@connectsoft.example > pipeline-gpg-public.key

# Export private key (base64 encoded for storage)
gpg --export-secret-keys --armor azure-devops@connectsoft.example | base64 > pipeline-gpg-private.key.b64

# Store private key as Azure DevOps secret variable

Using GPG in Pipeline:

- script: |
    # Import GPG key
    echo "$(gpgPrivateKey)" | base64 -d | gpg --batch --import
    gpg --list-secret-keys --keyid-format LONG

    # Configure Git to use GPG
    git config user.signingkey "$(gpgKeyId)"
    git config commit.gpgsign true

    # Sign commit
    git commit -S -m "$(commitMessage)"
  displayName: 'Sign and commit changes'
  env:
    gpgPrivateKey: $(gpgPrivateKey)
    gpgKeyId: $(gpgKeyId)
    commitMessage: $(commitMessage)

Branch Selection (Dev, Test, Staging, Production)

Branch Selection Logic:

# Azure Pipeline: Dynamic branch selection
variables:
  - name: TargetBranch
    value: ${{ if eq(variables['Build.SourceBranch'], 'refs/heads/main') }}
      'production'
    ${{ elseif eq(variables['Build.SourceBranch'], 'refs/heads/staging') }}
      'staging'
    ${{ elseif eq(variables['Build.SourceBranch'], 'refs/heads/test') }}
      'test'
    ${{ else }}
      'dev'
    ${{ endif }}

Branch Selection Script:

#!/bin/bash
# scripts/determine-target-branch.sh

SOURCE_BRANCH="${1:-dev}"

case "$SOURCE_BRANCH" in
  refs/heads/main|main)
    TARGET_BRANCH="production"
    ;;
  refs/heads/staging|staging)
    TARGET_BRANCH="staging"
    ;;
  refs/heads/test|test)
    TARGET_BRANCH="test"
    ;;
  *)
    TARGET_BRANCH="dev"
    ;;
esac

echo "Source branch: $SOURCE_BRANCH"
echo "Target branch: $TARGET_BRANCH"
echo "##vso[task.setvariable variable=TargetBranch]$TARGET_BRANCH"

Triggering FluxCD Sync

Automatic Sync via Polling (Default)

FluxCD Polling Configuration:

# GitRepository polling interval (default: 1 minute)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  interval: 1m  # Poll every 1 minute
  url: ssh://git@ssh.dev.azure.com/v3/ConnectSoft/ATP/atp-gitops
  ref:
    branch: production

Sync Timeline (Polling-Based):

T+0s:   CI pipeline commits to GitOps repo
T+0s:   Git commit pushed successfully
T+0s:   FluxCD Source Controller (last poll was T-30s)
T+60s:  FluxCD Source Controller polls Git (detects new commit)
T+60s:  FluxCD Kustomize Controller notified of new artifact
T+65s:  FluxCD Kustomize Controller reconciles cluster
T+70s:  Kubernetes resources updated

Webhook-Based Immediate Sync (Optional)

FluxCD Webhook Receiver:

# apps/fluxcd/receiver.yaml
apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Receiver
metadata:
  name: gitops-webhook
  namespace: flux-system
spec:
  type: git
  events:
    - push
  resources:
    - kind: GitRepository
      name: atp-gitops
      namespace: flux-system
  secretRef:
    name: webhook-token

Azure DevOps Webhook Configuration:

# Azure Pipeline: Trigger FluxCD webhook
- script: |
    WEBHOOK_URL="https://fluxcd-receiver.flux-system.svc.cluster.local:8080/hook/$(webhookToken)"

    curl -X POST "$WEBHOOK_URL" \
      -H "Content-Type: application/json" \
      -d '{
        "ref": "refs/heads/'"$(TargetBranch)"'",
        "commits": [{
          "id": "'"$(Build.SourceVersion)"'",
          "message": "Automated manifest update"
        }]
      }'

    echo "✅ FluxCD webhook triggered"
  displayName: 'Trigger FluxCD sync webhook'
  env:
    webhookToken: $(fluxcdWebhookToken)

Sync Timeline (Webhook-Based):

T+0s:   CI pipeline commits to GitOps repo
T+0s:   Git commit pushed successfully
T+1s:   Azure DevOps webhook triggered
T+2s:   FluxCD Receiver receives webhook
T+2s:   FluxCD Source Controller immediately fetches Git
T+3s:   FluxCD Kustomize Controller notified
T+8s:   Kubernetes resources updated

Flux Reconcile Command (Manual)

Manual Reconciliation:

# Manual reconciliation via flux CLI
flux reconcile source git atp-gitops --namespace flux-system

# Force reconciliation (even if no changes)
flux reconcile kustomization apps --namespace flux-system --with-source

# Reconciliation status
flux get kustomizations apps --namespace flux-system

Reconciliation in Pipeline (Optional):

- task: Kubernetes@1
  displayName: 'Trigger FluxCD reconciliation'
  inputs:
    connectionType: 'Azure Resource Manager'
    azureSubscriptionEndpoint: 'ATP-AKS-Connection'
    azureResourceGroup: 'ATP-Production-EUS-RG'
    kubernetesCluster: 'atp-prod-eus-aks'
    namespace: 'flux-system'
    command: 'run'
    arguments: 'flux reconcile source git atp-gitops'
  condition: and(succeeded(), eq(variables['TargetBranch'], 'production'))

Pipeline Templates for GitOps Integration

Reusable YAML Templates

GitOps Update Template (templates/gitops-update.yml):

# templates/gitops-update.yml
parameters:
- name: serviceName
  type: string
- name: imageRepository
  type: string
- name: imageTag
  type: string
- name: targetBranch
  type: string
  default: 'dev'
- name: requireGpgSigning
  type: boolean
  default: false

steps:
- checkout: git://ATP/atp-gitops@${{ parameters.targetBranch }}
  displayName: 'Checkout GitOps repository'
  path: gitops-repo
  persistCredentials: true

- script: |
    # Install tools
    curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
    sudo mv kustomize /usr/local/bin/

    wget -q https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O yq
    chmod +x yq
    sudo mv yq /usr/local/bin/
  displayName: 'Install tools'

- script: |
    cd gitops-repo

    # Update Kustomize
    if [ -d "apps/${{ parameters.serviceName }}/base" ]; then
      cd "apps/${{ parameters.serviceName }}/base"
      kustomize edit set image "${{ parameters.imageRepository }}:${{ parameters.imageTag }}"
      cd ../../../
    fi

    # Update Helm
    if [ -d "apps/${{ parameters.serviceName }}/helm" ]; then
      cd "apps/${{ parameters.serviceName }}/helm"
      if [ -f "values-${{ parameters.targetBranch }}.yaml" ]; then
        yq eval -i ".image.tag = \"${{ parameters.imageTag }}\"" \
          "values-${{ parameters.targetBranch }}.yaml"
      fi
      cd ../../
    fi
  displayName: 'Update manifests'

- script: |
    cd gitops-repo

    if [ -z "$(git status --porcelain)" ]; then
      echo "ℹ️  No changes to commit"
      exit 0
    fi

    git config user.name "Azure DevOps Pipeline"
    git config user.email "azure-devops@connectsoft.example"

    git add apps/${{ parameters.serviceName }}/

    COMMIT_MESSAGE="chore(${{ parameters.serviceName }}): update image tag to ${{ parameters.imageTag }}

    Automated update from CI pipeline.
    Build: $(Build.BuildNumber)
    Commit: $(Build.SourceVersion)"

    if [ "${{ parameters.requireGpgSigning }}" == "true" ]; then
      echo "$(gpgPrivateKey)" | base64 -d | gpg --batch --import
      git config user.signingkey "$(gpgKeyId)"
      git config commit.gpgsign true
      git commit -S -m "$COMMIT_MESSAGE"
    else
      git commit -m "$COMMIT_MESSAGE"
    fi

    git push origin ${{ parameters.targetBranch }}
  displayName: 'Commit and push changes'
  env:
    gpgPrivateKey: ${{ if eq(parameters.requireGpgSigning, true) }}$(gpgPrivateKey)${{ else }}''${{ endif }}
    gpgKeyId: ${{ if eq(parameters.requireGpgSigning, true) }}$(gpgKeyId)${{ else }}''${{ endif }}

Parameterization for Different Services

Using Template in Pipeline:

# azure-pipelines.yml
resources:
  repositories:
  - repository: templates
    type: git
    name: ATP/azure-pipelines-templates

stages:
- stage: UpdateGitOps
  displayName: 'Update GitOps Repository'
  jobs:
  - job: UpdateManifests
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: 'atp-ingestion'
        imageRepository: 'connectsoft.azurecr.io/atp/ingestion'
        imageTag: '$(ImageTag)'
        targetBranch: '$(TargetBranch)'
        requireGpgSigning: ${{ if eq(variables['TargetBranch'], 'production') }}true${{ else }}false${{ endif }}

Multi-Service Template Usage:

- stage: UpdateGitOps
  displayName: 'Update GitOps for Multiple Services'
  jobs:
  - job: UpdateAllServices
    strategy:
      matrix:
        ingestion:
          serviceName: 'atp-ingestion'
          imageRepository: 'connectsoft.azurecr.io/atp/ingestion'
        query:
          serviceName: 'atp-query'
          imageRepository: 'connectsoft.azurecr.io/atp/query'
        gateway:
          serviceName: 'atp-gateway'
          imageRepository: 'connectsoft.azurecr.io/atp/gateway'
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: '$(serviceName)'
        imageRepository: '$(imageRepository)'
        imageTag: '$(ImageTag)'
        targetBranch: '$(TargetBranch)'

Template Versioning

Versioned Template Reference:

resources:
  repositories:
  - repository: templates
    type: git
    name: ATP/azure-pipelines-templates
    ref: refs/tags/v1.2.3  # Pin to specific version

stages:
- stage: UpdateGitOps
  jobs:
  - job: UpdateManifests
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: 'atp-ingestion'
        imageRepository: 'connectsoft.azurecr.io/atp/ingestion'
        imageTag: '$(ImageTag)'

Multi-Service Coordination

Updating Multiple Services Atomically

Atomic Multi-Service Update:

- stage: UpdateGitOps
  displayName: 'Update GitOps (Atomic)'
  jobs:
  - job: UpdateAllServices
    steps:
    - checkout: git://ATP/atp-gitops@$(TargetBranch)
      path: gitops-repo
      persistCredentials: true

    # Update all services in single commit
    - script: |
        cd gitops-repo

        # Update ingestion
        kustomize edit set image \
          connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
          --path apps/atp-ingestion/base

        # Update query
        kustomize edit set image \
          connectsoft.azurecr.io/atp/query:v1.3.0-def456e \
          --path apps/atp-query/base

        # Update gateway
        kustomize edit set image \
          connectsoft.azurecr.io/atp/gateway:v1.1.5-ghi789f \
          --path apps/atp-gateway/base

        # Single commit for all services
        git add apps/
        git commit -m "chore(*): update all service image tags

        Atomic update for release v1.2.3:
        - atp-ingestion: v1.2.3-abc123d
        - atp-query: v1.3.0-def456e
        - atp-gateway: v1.1.5-ghi789f"

        git push origin $(TargetBranch)
      displayName: 'Atomic multi-service update'

Dependency Management

Service Dependency Graph:

# services-dependencies.yaml
services:
  - name: atp-gateway
    dependsOn: []
    updateOrder: 1

  - name: atp-ingestion
    dependsOn: [atp-gateway]
    updateOrder: 2

  - name: atp-query
    dependsOn: [atp-ingestion]
    updateOrder: 3

  - name: atp-export
    dependsOn: [atp-query]
    updateOrder: 4

Dependency-Aware Update Script:

#!/bin/bash
# scripts/update-services-with-dependencies.sh

SERVICES=(
  "atp-gateway:v1.1.5-ghi789f:1"
  "atp-ingestion:v1.2.3-abc123d:2"
  "atp-query:v1.3.0-def456e:3"
)

# Sort by update order
IFS=$'\n' sorted_services=($(sort -t: -k3 <<<"${SERVICES[*]}"))
unset IFS

for service_info in "${sorted_services[@]}"; do
  IFS=':' read -r service_name image_tag update_order <<< "$service_info"

  echo "📦 Updating $service_name (order: $update_order)..."

  # Update manifest
  kustomize edit set image \
    "connectsoft.azurecr.io/$service_name:$image_tag" \
    --path "apps/$service_name/base"

  echo "  ✅ Updated $service_name"
done

Coordinated Rollout Strategies

Staged Rollout:

- stage: CoordinatedRollout
  displayName: 'Coordinated Multi-Service Rollout'
  jobs:
  - job: Stage1Gateway
    displayName: 'Stage 1: Gateway'
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: 'atp-gateway'
        imageTag: '$(GatewayImageTag)'

  - job: Stage2Ingestion
    displayName: 'Stage 2: Ingestion'
    dependsOn: Stage1Gateway
    condition: succeeded()
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: 'atp-ingestion'
        imageTag: '$(IngestionImageTag)'

  - job: Stage3Query
    displayName: 'Stage 3: Query'
    dependsOn: Stage2Ingestion
    condition: succeeded()
    steps:
    - template: templates/gitops-update.yml@templates
      parameters:
        serviceName: 'atp-query'
        imageTag: '$(QueryImageTag)'

Artifact Metadata

SBOM (Software Bill of Materials) Generation

SBOM Generation with Syft:

- script: |
    # Install Syft
    curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin

    # Generate SBOM in SPDX format
    syft packages docker:$(ImageRepository):$(ImageTag) \
      --output spdx-json \
      --file sbom-$(ImageTag).spdx.json

    # Generate SBOM in CycloneDX format
    syft packages docker:$(ImageRepository):$(ImageTag) \
      --output cyclonedx-json \
      --file sbom-$(ImageTag).cyclonedx.json

    echo "✅ SBOM generated: sbom-$(ImageTag).spdx.json"
  displayName: 'Generate SBOM'

SBOM Structure (Example):

{
  "SPDXID": "SPDXRef-DOCUMENT",
  "spdxVersion": "SPDX-2.3",
  "name": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
  "packages": [
    {
      "SPDXID": "SPDXRef-Package-dotnet-runtime",
      "name": "dotnet-runtime",
      "versionInfo": "8.0.0",
      "downloadLocation": "NOASSERTION"
    },
    {
      "SPDXID": "SPDXRef-Package-aspnetcore",
      "name": "aspnetcore",
      "versionInfo": "8.0.0",
      "downloadLocation": "NOASSERTION"
    }
  ]
}

Vulnerability Scan Results

Vulnerability Scanning Integration:

- script: |
    # Scan image with Trivy
    trivy image \
      --format json \
      --output trivy-$(ImageTag).json \
      --severity HIGH,CRITICAL \
      $(ImageRepository):$(ImageTag)

    # Generate HTML report
    trivy image \
      --format template \
      --template "@contrib/html.tpl" \
      --output trivy-$(ImageTag).html \
      $(ImageRepository):$(ImageTag)

    # Publish scan results
    echo "##vso[task.addattachment type=Distributedtask.Core.Summary;name=Vulnerability Scan;]$PWD/trivy-$(ImageTag).html"
  displayName: 'Scan image for vulnerabilities'

Vulnerability Scan Results (Example):

{
  "SchemaVersion": 2,
  "ArtifactName": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
  "Results": [
    {
      "Target": "atp-ingestion:v1.2.3-abc123d",
      "Vulnerabilities": [
        {
          "VulnerabilityID": "CVE-2024-1234",
          "Severity": "HIGH",
          "PackageName": "aspnetcore",
          "InstalledVersion": "8.0.0",
          "FixedVersion": "8.0.1"
        }
      ]
    }
  ]
}

Build Provenance Information

Provenance Generation (SLSA/In-Toto):

- script: |
    # Generate build provenance (SLSA v1.0)
    cat > provenance-$(ImageTag).json <<EOF
    {
      "_type": "https://in-toto.io/Statement/v1",
      "subject": [
        {
          "name": "$(ImageRepository):$(ImageTag)",
          "digest": {
            "sha256": "$(IMAGE_DIGEST)"
          }
        }
      ],
      "predicateType": "https://slsa.dev/provenance/v1",
      "predicate": {
        "buildDefinition": {
          "buildType": "https://dev.azure.com/ConnectSoft/ATP",
          "externalParameters": {
            "source": "$(Build.Repository.Uri)",
            "ref": "$(Build.SourceBranch)",
            "commit": "$(Build.SourceVersion)"
          },
          "internalParameters": {
            "pipeline": "$(Build.DefinitionName)",
            "buildId": "$(Build.BuildId)"
          },
          "resolvedDependencies": [
            {
              "uri": "$(Build.Repository.Uri)",
              "digest": {
                "gitCommit": "$(Build.SourceVersion)"
              }
            }
          ]
        },
        "runDetails": {
          "builder": {
            "id": "Azure DevOps Pipeline"
          },
          "metadata": {
            "invocationId": "$(Build.BuildId)",
            "startedOn": "$(Build.QueuedTime)",
            "finishedOn": "$(System.Agent.JobFinishTime)"
          }
        }
      }
    }
    EOF

    echo "✅ Build provenance generated"
  displayName: 'Generate build provenance'

Metadata Storage in ACR

Attach Metadata to ACR Image:

- script: |
    # Attach metadata as image annotations
    az acr repository update \
      --name connectsoft \
      --image atp/ingestion:$(ImageTag) \
      --metadata \
        build.version=$(Build.BuildNumber) \
        build.commit=$(Build.SourceVersion) \
        build.date=$(Build.BuildId) \
        build.pipeline=$(Build.BuildUri) \
        build.branch=$(Build.SourceBranch) \
        sbom.url=$(Pipeline.Workspace)/sbom-$(ImageTag).spdx.json \
        scan.url=$(Pipeline.Workspace)/trivy-$(ImageTag).json \
        provenance.url=$(Pipeline.Workspace)/provenance-$(ImageTag).json

    echo "✅ Metadata attached to image"
  displayName: 'Attach metadata to ACR image'

Query Image Metadata:

# Query image metadata
az acr repository show \
  --name connectsoft \
  --image atp/ingestion:v1.2.3-abc123d \
  --query metadata

# Output:
# {
#   "build.version": "v1.2.3",
#   "build.commit": "abc123def456",
#   "build.date": "20240115.1",
#   "build.pipeline": "https://dev.azure.com/...",
#   "sbom.url": "...",
#   "scan.url": "...",
#   "provenance.url": "..."
# }

Pipeline Observability

Correlation IDs Between Build and Deployment

Correlation ID Generation:

- script: |
    # Generate correlation ID
    CORRELATION_ID="build-$(Build.BuildId)-$(Build.SourceVersion)"

    echo "##vso[task.setvariable variable=CorrelationId;isOutput=true]$CORRELATION_ID"
    echo "Correlation ID: $CORRELATION_ID"
  displayName: 'Generate correlation ID'
  name: GenerateCorrelationId

# Pass correlation ID to GitOps commit
- script: |
    git commit -m "chore(ingestion): update image tag

    Correlation ID: $(GenerateCorrelationId.CorrelationId)
    Build: $(Build.BuildNumber)"
  displayName: 'Commit with correlation ID'

Correlation ID in Deployment Metadata:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  annotations:
    deployment.connectsoft.com/correlation-id: "build-12345-abc123d"
    deployment.connectsoft.com/build-number: "20240115.1"
    deployment.connectsoft.com/build-uri: "https://dev.azure.com/.../builds/12345"
spec:
  template:
    metadata:
      labels:
        correlation-id: "build-12345-abc123d"

Linking Azure Pipeline Runs to FluxCD Reconciliations

Link Tracking Script:

#!/bin/bash
# scripts/link-build-to-deployment.sh

CORRELATION_ID="${1:-unknown}"
BUILD_URI="${2:-unknown}"
NAMESPACE="${3:-atp-production}"

# Annotate deployment with build information
kubectl annotate deployment atp-ingestion \
  -n "$NAMESPACE" \
  deployment.connectsoft.com/build-uri="$BUILD_URI" \
  deployment.connectsoft.com/correlation-id="$CORRELATION_ID" \
  --overwrite

echo "✅ Deployment linked to build: $BUILD_URI"

Query Links:

# Query deployment for build link
kubectl get deployment atp-ingestion -n atp-production \
  -o jsonpath='{.metadata.annotations.deployment\.connectsoft\.com/build-uri}'

# Output: https://dev.azure.com/ConnectSoft/ATP/_build/results?buildId=12345

Deployment Receipt Generation

Deployment Receipt Script:

#!/bin/bash
# scripts/generate-deployment-receipt.sh

DEPLOYMENT_NAME="${1:-atp-ingestion}"
NAMESPACE="${2:-atp-production}"
CORRELATION_ID="${3:-unknown}"

# Generate deployment receipt
cat > deployment-receipt-$(date +%Y%m%d-%H%M%S).json <<EOF
{
  "deploymentId": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.uid}')",
  "correlationId": "$CORRELATION_ID",
  "namespace": "$NAMESPACE",
  "deploymentName": "$DEPLOYMENT_NAME",
  "image": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.spec.template.spec.containers[0].image}')",
  "replicas": $(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.spec.replicas}'),
  "deployedAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "deployedBy": "FluxCD",
  "gitCommit": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.labels.app\.kubernetes\.io/version}')",
  "status": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.status.conditions[?(@.type=="Available")].status}')"
}
EOF

echo "✅ Deployment receipt generated"

Metrics and Dashboards

Pipeline Metrics (Azure Monitor):

- script: |
    # Send custom metrics to Azure Monitor
    az monitor metrics create \
      --resource /subscriptions/.../resourceGroups/... \
      --name "gitops_manifest_update_duration" \
      --value "$(AGENT_JOBDURATION)" \
      --timestamp "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
  displayName: 'Send pipeline metrics'

KQL Query for Build-Deployment Correlation:

// Azure Monitor: Link builds to deployments
let BuildEvents = ContainerLog
| where LogEntry contains "correlation-id"
| extend CorrelationId = extract(@"correlation-id: ([^\s]+)", 1, LogEntry)
| extend BuildUri = extract(@"build-uri: ([^\s]+)", 1, LogEntry)
| project CorrelationId, BuildUri, TimeGenerated;

let DeploymentEvents = KubePodInventory
| where Namespace == "atp-production"
| extend CorrelationId = extract(@"correlation-id: ([^\s]+)", 1, Labels)
| where isnotempty(CorrelationId)
| project CorrelationId, PodName, TimeGenerated;

BuildEvents
| join kind=inner DeploymentEvents on CorrelationId
| project CorrelationId, BuildUri, PodName, DeploymentTime = DeploymentEvents.TimeGenerated, BuildTime = BuildEvents.TimeGenerated
| extend DeploymentLatency = DeploymentTime - BuildTime
| summarize avg(DeploymentLatency) by bin(TimeGenerated, 1h)

Grafana Dashboard Configuration:

# Grafana dashboard for CI/CD → GitOps handoff
dashboard:
  title: "CI/CD to GitOps Handoff Metrics"
  panels:
    - title: "Build to Deployment Latency"
      query: |
        avg(deployment_latency_seconds{namespace="atp-production"})

    - title: "GitOps Commit Frequency"
      query: |
        rate(gitops_commits_total[5m])

    - title: "FluxCD Sync Success Rate"
      query: |
        rate(fluxcd_kustomize_reconciliation_success_total[5m]) /
        rate(fluxcd_kustomize_reconciliation_total[5m])

Summary: Azure Pipelines to GitOps Handoff

  • Separation of Concerns: CI builds/test/publishes artifacts; GitOps deploys/reconciles/monitors
  • Pipeline Stages: Build (compile, test), Test (integration, security), Publish (ACR, SBOM), Update GitOps (manifest commits)
  • Image Tag Generation: Semantic version + commit SHA format (v1.2.3-abc123d) for immutability and traceability
  • Automated Manifest Update: Scripts for Kustomize and Helm manifest updates with Git commit automation
  • Git Commit Automation: PAT/SSH credentials, conventional commit messages, GPG signing for production
  • FluxCD Sync Triggers: Polling (default), webhooks (immediate), manual reconciliation
  • Pipeline Templates: Reusable YAML templates with parameterization and versioning
  • Multi-Service Coordination: Atomic updates, dependency management, coordinated rollout strategies
  • Artifact Metadata: SBOM generation, vulnerability scans, build provenance, ACR metadata storage
  • Pipeline Observability: Correlation IDs, build-deployment linking, deployment receipts, metrics dashboards

Pulumi Infrastructure as Code Integration

Purpose: Define how Pulumi with C# is used to provision and manage Azure infrastructure for ATP, integrating with GitOps workflows to ensure infrastructure changes are version-controlled, tested, and deployed through the same Git-based processes as application deployments.


Pulumi Overview for Azure Resources

Why Pulumi for ATP (C# Programming Model)

ATP Infrastructure Requirements:

  • Complex Azure Resource Orchestration: AKS clusters, ACR, Key Vault, Service Bus, Storage Accounts
  • Multi-Environment Management: Dev, test, staging, production with environment-specific configurations
  • C# Development Team: Leverage existing C# expertise for infrastructure code
  • Type Safety: Strong typing and IntelliSense for Azure resources
  • Testability: Unit test infrastructure code with standard C# testing frameworks
  • Code Reusability: Create reusable components and modules

Pulumi Advantages for ATP:

Advantage Description ATP Benefit
C# Programming Model Write infrastructure as C# code Leverage team's existing C# skills
Type Safety Strong typing with IntelliSense Reduce configuration errors at compile time
Rich Ecosystem Access to .NET libraries and NuGet packages Reuse existing code and patterns
Imperative Logic Full programming language capabilities Complex conditional logic, loops, functions
State Management Built-in state management with locking Safe concurrent updates
Multi-Language Support C#, TypeScript, Python, Go available Team flexibility

Pulumi vs Terraform vs Bicep Comparison

Comparison Matrix:

Feature Pulumi Terraform Bicep
Language C#, TypeScript, Python, Go HCL (Hashicorp Config Language) Domain-specific language (DSL)
Programming Model Imperative with full language features Declarative Declarative
Type Safety ✅ Strong typing (C#) ❌ Limited ✅ Type checking
IntelliSense ✅ Full IDE support ⚠️ Basic ✅ Good
Testing ✅ Unit test with standard frameworks ⚠️ Limited ❌ Limited
Code Reuse ✅ Functions, classes, modules ⚠️ Modules ⚠️ Modules
State Management ✅ Built-in with locking ✅ Built-in with locking ✅ Azure-native
Azure Integration ✅ Excellent ✅ Good ✅ Native (Azure-only)
Multi-Cloud ✅ Excellent ✅ Excellent ❌ Azure-only
Learning Curve ✅ Low (if team knows C#) ⚠️ Medium (HCL) ⚠️ Medium (DSL)

ATP Selection: Pulumi with C#

Rationale:

  1. Team Expertise: ATP team is proficient in C#, reducing learning curve
  2. Type Safety: Catch configuration errors at compile time
  3. Testability: Unit test infrastructure code with xUnit/NUnit
  4. Code Reuse: Create reusable infrastructure components
  5. Complex Logic: Handle multi-tenant, multi-region scenarios with imperative code
  6. Azure-First: Strong Azure support while maintaining multi-cloud flexibility

Infrastructure as Code Principles

IaC Best Practices for ATP:

  1. Version Control: All infrastructure code in Git (Azure Repos)
  2. Immutable Infrastructure: Recreate rather than modify when possible
  3. Idempotency: Infrastructure code can be run multiple times safely
  4. Declarative Intent: Describe desired state, not implementation steps
  5. Environment Parity: Use same code for all environments (parameterized)
  6. Code Review: Infrastructure changes require PR approval
  7. Testing: Preview changes before applying
  8. State Management: Centralized, versioned state with locking

IaC Workflow:

graph LR
    A[Developer] -->|Create PR| B[Infrastructure Code<br/>in Git]
    B -->|PR Validation| C[Pulumi Preview]
    C -->|Review| D[Manual Approval]
    D -->|Merge| E[Pulumi Up]
    E -->|Update State| F[Azure Blob Storage<br/>State Backend]
    E -->|Provision| G[Azure Resources]

    style B fill:#90EE90
    style C fill:#FFE5B4
    style D fill:#FFE5B4
    style E fill:#90EE90
    style F fill:#FFE5B4
    style G fill:#ffcccc
Hold "Alt" / "Option" to enable pan & zoom

Pulumi Stacks for Environments

Stack per Environment (Dev, Test, Staging, Production)

Stack Organization:

atp-infrastructure/
├── Pulumi.yaml
├── Pulumi.dev.yaml
├── Pulumi.test.yaml
├── Pulumi.staging.yaml
├── Pulumi.production.yaml
├── Program.cs
└── infrastructure/
    ├── AKS.cs
    ├── ACR.cs
    ├── KeyVault.cs
    └── ServiceBus.cs

Pulumi Project Configuration (Pulumi.yaml):

name: atp-infrastructure
runtime: dotnet
description: ATP Infrastructure as Code using Pulumi with C#

Stack Configuration Examples:

Dev Stack (Pulumi.dev.yaml):

config:
  azure-native:location: eastus
  atp-infrastructure:environment: dev
  atp-infrastructure:aksNodeCount: 3
  atp-infrastructure:aksNodeVmSize: Standard_D2s_v3
  atp-infrastructure:acrSku: Basic
  atp-infrastructure:keyVaultSku: standard
  atp-infrastructure:enableMonitoring: true
  atp-infrastructure:enablePrivateEndpoint: false
  atp-infrastructure:tags:
    Environment: dev
    ManagedBy: pulumi
    Project: ATP

Production Stack (Pulumi.production.yaml):

config:
  azure-native:location: eastus
  atp-infrastructure:environment: production
  atp-infrastructure:aksNodeCount: 5
  atp-infrastructure:aksNodeVmSize: Standard_D4s_v3
  atp-infrastructure:acrSku: Premium
  atp-infrastructure:keyVaultSku: premium
  atp-infrastructure:enableMonitoring: true
  atp-infrastructure:enablePrivateEndpoint: true
  atp-infrastructure:enableGeoReplication: true
  atp-infrastructure:tags:
    Environment: production
    ManagedBy: pulumi
    Project: ATP
    Compliance: SOC2

Stack Configuration and Secrets

Stack Configuration with Secrets:

// Program.cs
using Pulumi;

class Program
{
    static Task<int> Main() => Deployment.RunAsync<ATPStack>();
}

class ATPStack : Stack
{
    public ATPStack()
    {
        var config = new Config();

        // Read configuration
        var environment = config.Require("environment");
        var location = config.Get("location") ?? "eastus";
        var nodeCount = config.GetInt32("aksNodeCount") ?? 3;
        var nodeVmSize = config.Get("aksNodeVmSize") ?? "Standard_D2s_v3";

        // Read secrets (encrypted)
        var sqlAdminPassword = config.RequireSecret("sqlAdminPassword");
        var keyVaultAccessKey = config.RequireSecret("keyVaultAccessKey");

        // Create infrastructure
        var aks = new AKSCluster(this, environment, location, nodeCount, nodeVmSize);
        var acr = new AzureContainerRegistry(this, environment, location);
        var keyVault = new KeyVault(this, environment, location, keyVaultAccessKey);
    }
}

Setting Stack Configuration:

# Set plain configuration
pulumi config set aksNodeCount 5 --stack production
pulumi config set aksNodeVmSize Standard_D4s_v3 --stack production

# Set secrets (encrypted in state)
pulumi config set --secret sqlAdminPassword "SecurePassword123!"
pulumi config set --secret keyVaultAccessKey "access-key-value"

# View configuration
pulumi config --stack production

# View secrets (decrypted)
pulumi config get --secret sqlAdminPassword --stack production

Configuration via Azure DevOps Variable Groups:

# Azure Pipeline: Set Pulumi configuration from variable groups
- script: |
    # Set plain configuration
    pulumi config set azure-native:location $(AzureLocation) --stack $(PulumiStack)
    pulumi config set aksNodeCount $(AKSNodeCount) --stack $(PulumiStack)

    # Set secrets from Azure Key Vault
    pulumi config set --secret sqlAdminPassword "$(sqlAdminPassword)" --stack $(PulumiStack)
    pulumi config set --secret keyVaultAccessKey "$(keyVaultAccessKey)" --stack $(PulumiStack)
  displayName: 'Set Pulumi stack configuration'
  env:
    PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)

Stack References for Cross-Stack Dependencies

Stack Reference Example:

// Shared infrastructure stack (networking, monitoring)
class SharedStack : Stack
{
    [Output]
    public Output<string> LogAnalyticsWorkspaceId { get; set; }

    [Output]
    public Output<string> VirtualNetworkId { get; set; }

    public SharedStack()
    {
        var workspace = new OperationalInsights.Workspace("atp-shared-loganalytics", new()
        {
            ResourceGroupName = resourceGroup.Name,
            Location = location,
        });

        this.LogAnalyticsWorkspaceId = workspace.Id;
    }
}

// Application stack references shared stack
class ApplicationStack : Stack
{
    public ApplicationStack()
    {
        var sharedStack = new StackReference("ConnectSoft/atp-shared/shared");

        var logAnalyticsWorkspaceId = sharedStack.RequireOutput("LogAnalyticsWorkspaceId")
            .Apply(id => id.ToString());

        // Use shared resources
        var aks = new ContainerService.ManagedCluster("atp-aks", new()
        {
            // Reference shared Log Analytics workspace
            AddonProfiles = new()
            {
                ["omsagent"] = new()
                {
                    Enabled = true,
                    Config = new()
                    {
                        ["logAnalyticsWorkspaceResourceID"] = logAnalyticsWorkspaceId,
                    },
                },
            },
        });
    }
}

AKS Cluster Provisioning

Cluster Configuration (Node Pools, Networking, SKUs)

Complete AKS Cluster with C# Pulumi:

// infrastructure/AKS.cs
using Pulumi;
using Pulumi.AzureNative.ContainerService;
using Pulumi.AzureNative.ContainerService.Inputs;
using Pulumi.AzureNative.Network;
using Pulumi.AzureNative.Resources;

public class AKSCluster
{
    public ManagedCluster Cluster { get; }
    public Output<string> KubeConfig { get; }

    public AKSCluster(Pulumi.Stack stack, string environment, string location, 
        int nodeCount, string nodeVmSize)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var clusterName = $"atp-{environment}-aks";

        // Resource Group
        var resourceGroup = new ResourceGroup($"atp-{environment}-rg", new()
        {
            Location = location,
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });

        // Virtual Network
        var vnet = new VirtualNetwork($"atp-{environment}-vnet", new()
        {
            ResourceGroupName = resourceGroup.Name,
            Location = location,
            AddressSpace = new() { AddressPrefixes = { "10.0.0.0/16" } },
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });

        // Subnet for AKS
        var aksSubnet = new Subnet($"atp-{environment}-aks-subnet", new()
        {
            ResourceGroupName = resourceGroup.Name,
            VirtualNetworkName = vnet.Name,
            AddressPrefix = "10.0.1.0/24",
            Delegations = new()
            {
                new SubnetDelegationArgs
                {
                    Name = "Microsoft.ContainerService.managedClusters",
                    ServiceDelegation = new ServiceDelegationArgs
                    {
                        Name = "Microsoft.ContainerService/managedClusters",
                    },
                },
            },
        });

        // User Assigned Managed Identity
        var identity = new ManagedServiceIdentity.UserAssignedIdentity(
            $"atp-{environment}-aks-identity", new()
            {
                ResourceGroupName = resourceGroup.Name,
                Location = location,
            });

        // AKS Cluster
        var cluster = new ManagedCluster(clusterName, new()
        {
            ResourceGroupName = resourceGroup.Name,
            Location = location,

            // Identity
            Identity = new ManagedClusterIdentityArgs
            {
                Type = ManagedClusterIdentityType.UserAssigned,
                UserAssignedIdentities = new[]
                {
                    identity.Id,
                },
            },

            // Kubernetes Version
            KubernetesVersion = config.Get("kubernetesVersion") ?? "1.28",

            // Node Pool Configuration
            AgentPoolProfiles = new[]
            {
                new ManagedClusterAgentPoolProfileArgs
                {
                    Name = "systempool",
                    Count = nodeCount,
                    VmSize = nodeVmSize,
                    OsType = "Linux",
                    OsDiskSizeGB = 128,
                    Mode = AgentPoolMode.System,
                    EnableAutoScaling = true,
                    MinCount = 2,
                    MaxCount = 10,
                    Type = AgentPoolType.VirtualMachineScaleSets,
                    VnetSubnetId = aksSubnet.Id,
                    MaxPods = 30,
                    NodeLabels = new()
                    {
                        { "pool", "system" },
                        { "environment", environment },
                    },
                    NodeTaints = new[] { "CriticalAddonsOnly=true:NoSchedule" },
                },
                new ManagedClusterAgentPoolProfileArgs
                {
                    Name = "userpool",
                    Count = nodeCount,
                    VmSize = nodeVmSize,
                    OsType = "Linux",
                    OsDiskSizeGB = 128,
                    Mode = AgentPoolMode.User,
                    EnableAutoScaling = true,
                    MinCount = 3,
                    MaxCount = 20,
                    Type = AgentPoolType.VirtualMachineScaleSets,
                    VnetSubnetId = aksSubnet.Id,
                    MaxPods = 30,
                    NodeLabels = new()
                    {
                        { "pool", "user" },
                        { "environment", environment },
                    },
                },
            },

            // Network Profile (Azure CNI)
            NetworkProfile = new ContainerServiceNetworkProfileArgs
            {
                NetworkPlugin = "azure",
                NetworkPolicy = "azure",
                ServiceCidr = "10.1.0.0/16",
                DnsServiceIP = "10.1.0.10",
                LoadBalancerSku = "standard",
            },

            // RBAC
            EnableRBAC = true,

            // Pod Security Standards
            SecurityProfile = new ManagedClusterSecurityProfileArgs
            {
                WorkloadIdentity = new ManagedClusterSecurityProfileWorkloadIdentityArgs
                {
                    Enabled = true,
                },
            },

            // Addon Profiles
            AddonProfiles = new()
            {
                ["httpApplicationRouting"] = new ManagedClusterAddonProfileArgs
                {
                    Enabled = false,
                },
                ["omsagent"] = new ManagedClusterAddonProfileArgs
                {
                    Enabled = true,
                    Config = new()
                    {
                        ["logAnalyticsWorkspaceResourceID"] = config.Require("logAnalyticsWorkspaceId"),
                    },
                },
            },

            // Auto Upgrade Channel
            AutoUpgradeProfile = new ManagedClusterAutoUpgradeProfileArgs
            {
                UpgradeChannel = environment == "production" 
                    ? UpgradeChannel.Stable 
                    : UpgradeChannel.Rapid,
            },

            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
                { "Project", "ATP" },
            },
        });

        this.Cluster = cluster;
        this.KubeConfig = Output.Tuple(resourceGroup.Name, cluster.Name)
            .Apply(names => Output.CreateSecret(GetKubeConfig(names.Item1, names.Item2)));
    }

    private static string GetKubeConfig(string resourceGroupName, string clusterName)
    {
        // Generate kubeconfig
        // In practice, use Pulumi's Kubernetes provider
        return "";
    }
}

Azure CNI vs Kubenet

Network Plugin Comparison:

Feature Azure CNI Kubenet
IP Address Management Pod IPs from VNet subnet Pod IPs NAT through node IP
Pod Networking Direct VNet connectivity Overlay network
Max Pods per Node Up to 250 (configurable) 110 (fixed)
Network Policies Azure Network Policies or Calico Calico only
VNet Integration Native VNet integration Requires route table
Performance Lower latency Slight NAT overhead
Complexity More complex subnet planning Simpler setup

ATP Selection: Azure CNI

Rationale:

  • ✅ Native Azure networking for better integration
  • ✅ Direct pod-to-VNet connectivity for Azure services
  • ✅ Support for Azure Network Policies
  • ✅ Better performance for high-throughput workloads
  • ✅ Required for advanced features (Private AKS, etc.)

Azure CNI Configuration:

NetworkProfile = new ContainerServiceNetworkProfileArgs
{
    NetworkPlugin = "azure",
    NetworkPolicy = "azure",  // Azure Network Policies
    ServiceCidr = "10.1.0.0/16",
    DnsServiceIP = "10.1.0.10",
    LoadBalancerSku = "standard",
    PodCidr = null,  // Not used with Azure CNI
}

Managed Identity Setup

User Assigned Managed Identity:

// Create User Assigned Managed Identity
var identity = new ManagedServiceIdentity.UserAssignedIdentity(
    $"atp-{environment}-aks-identity", new()
    {
        ResourceGroupName = resourceGroup.Name,
        Location = location,
        Tags = new()
        {
            { "Environment", environment },
            { "ManagedBy", "pulumi" },
        },
    });

// Grant permissions to identity
var acrRoleAssignment = new Authorization.RoleAssignment(
    "aks-acr-role-assignment", new()
    {
        PrincipalId = identity.PrincipalId,
        PrincipalType = "ServicePrincipal",
        RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/7f951dda-4ed3-4680-a7ca-43fe172d538d", // AcrPull
        Scope = acr.Id,
    });

Azure Monitor Integration

Container Insights Configuration:

// Log Analytics Workspace
var logAnalyticsWorkspace = new OperationalInsights.Workspace(
    $"atp-{environment}-loganalytics", new()
    {
        ResourceGroupName = resourceGroup.Name,
        Location = location,
        Sku = new OperationalInsights.WorkspaceSkuArgs
        {
            Name = "PerGB2018",
        },
    });

// Enable Container Insights on AKS
var containerInsights = new ContainerService.ManagedClusterAddonProfileArgs
{
    Enabled = true,
    Config = new()
    {
        ["logAnalyticsWorkspaceResourceID"] = logAnalyticsWorkspace.Id,
    },
};

// Add to cluster
AddonProfiles = new()
{
    ["omsagent"] = containerInsights,
}

Network Policies and Security Groups

Network Security Group:

// Network Security Group for AKS subnet
var nsg = new NetworkSecurityGroup($"atp-{environment}-aks-nsg", new()
{
    ResourceGroupName = resourceGroup.Name,
    Location = location,
    SecurityRules = new()
    {
        // Allow inbound from Load Balancer
        new NetworkSecurityGroupSecurityRuleArgs
        {
            Name = "Allow-LoadBalancer-Inbound",
            Priority = 1000,
            Direction = "Inbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "AzureLoadBalancer",
            DestinationAddressPrefix = "*",
        },
        // Allow outbound to Internet
        new NetworkSecurityGroupSecurityRuleArgs
        {
            Name = "Allow-Internet-Outbound",
            Priority = 1000,
            Direction = "Outbound",
            Access = "Allow",
            Protocol = "Tcp",
            SourcePortRange = "*",
            DestinationPortRange = "*",
            SourceAddressPrefix = "*",
            DestinationAddressPrefix = "Internet",
        },
    },
    Tags = new()
    {
        { "Environment", environment },
        { "ManagedBy", "pulumi" },
    },
});

// Associate NSG with subnet
var subnetWithNsg = new Subnet($"atp-{environment}-aks-subnet-nsg", new()
{
    ResourceGroupName = resourceGroup.Name,
    VirtualNetworkName = vnet.Name,
    AddressPrefix = "10.0.1.0/24",
    NetworkSecurityGroupId = nsg.Id,
    Delegations = new()
    {
        new SubnetDelegationArgs
        {
            Name = "Microsoft.ContainerService.managedClusters",
            ServiceDelegation = new ServiceDelegationArgs
            {
                Name = "Microsoft.ContainerService/managedClusters",
            },
        },
    },
});

Azure Resource Provisioning

Azure Container Registry (ACR)

ACR Provisioning:

// infrastructure/ACR.cs
public class AzureContainerRegistry
{
    public ContainerRegistry.Registry Registry { get; }

    public AzureContainerRegistry(Pulumi.Stack stack, string environment, string location)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var acrName = $"atp{environment}acr".Replace("-", ""); // ACR names must be alphanumeric

        this.Registry = new ContainerRegistry.Registry($"atp-{environment}-acr", new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            Sku = new ContainerRegistry.Inputs.SkuArgs
            {
                Name = config.Get("acrSku") ?? "Basic",
            },
            AdminEnabled = environment != "production", // Disable admin for production
            PublicNetworkAccess = config.GetBool("enablePrivateEndpoint") == true 
                ? "Disabled" 
                : "Enabled",
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
                { "Project", "ATP" },
            },
        });

        // Enable geo-replication for production
        if (environment == "production" && config.GetBool("enableGeoReplication") == true)
        {
            new ContainerRegistry.Replication($"atp-production-acr-westus2", new()
            {
                ResourceGroupName = resourceGroupName,
                RegistryName = this.Registry.Name,
                Location = "westus2",
                Tags = new()
                {
                    { "Environment", environment },
                    { "ManagedBy", "pulumi" },
                },
            });
        }
    }
}

Azure Key Vault

Key Vault Provisioning:

// infrastructure/KeyVault.cs
public class KeyVault
{
    public KeyVault.Vault Vault { get; }

    public KeyVault(Pulumi.Stack stack, string environment, string location, 
        Output<string> accessKey)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var keyVaultName = $"atp-{environment}-kv";

        // Key Vault
        this.Vault = new KeyVault.Vault($"atp-{environment}-kv", new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            Properties = new KeyVault.Inputs.VaultPropertiesArgs
            {
                TenantId = config.Require("tenantId"),
                Sku = new KeyVault.Inputs.SkuArgs
                {
                    Family = "A",
                    Name = config.Get("keyVaultSku") ?? "standard",
                },
                EnabledForDeployment = false,
                EnabledForTemplateDeployment = false,
                EnabledForDiskEncryption = false,
                EnableRbacAuthorization = true,
                PublicNetworkAccess = config.GetBool("enablePrivateEndpoint") == true 
                    ? "Disabled" 
                    : "Enabled",
            },
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });

        // Store initial secret
        new KeyVault.Secret("keyVaultAccessKey", new()
        {
            ResourceGroupName = resourceGroupName,
            VaultName = this.Vault.Name,
            Properties = new KeyVault.Inputs.SecretPropertiesArgs
            {
                Value = accessKey,
            },
        });
    }
}

Azure Service Bus

Service Bus Provisioning:

// infrastructure/ServiceBus.cs
public class ServiceBus
{
    public ServiceBus.Namespace Namespace { get; }

    public ServiceBus(Pulumi.Stack stack, string environment, string location)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var serviceBusName = $"atp-{environment}-sb";

        this.Namespace = new ServiceBus.Namespace($"atp-{environment}-sb", new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            Sku = new ServiceBus.Inputs.SBSkuArgs
            {
                Name = environment == "production" ? "Premium" : "Standard",
                Tier = environment == "production" ? "Premium" : "Standard",
            },
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });

        // Create queues
        var queues = new[] { "audit-events", "export-requests", "notifications" };
        foreach (var queueName in queues)
        {
            new ServiceBus.Queue($"{environment}-{queueName}", new()
            {
                ResourceGroupName = resourceGroupName,
                NamespaceName = this.Namespace.Name,
                Name = queueName,
                EnablePartitioning = environment == "production",
                MaxDeliveryCount = 10,
                LockDuration = "PT5M",
                DefaultMessageTimeToLive = "P7D",
            });
        }
    }
}

Azure Storage Accounts (Blob, Queue)

Storage Account Provisioning:

// infrastructure/StorageAccount.cs
public class StorageAccount
{
    public Storage.StorageAccount Account { get; }

    public StorageAccount(Pulumi.Stack stack, string environment, string location)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var storageName = $"atp{environment}st".Replace("-", ""); // Must be lowercase, alphanumeric

        this.Account = new Storage.StorageAccount($"atp-{environment}-st", new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            AccountName = storageName,
            Kind = "StorageV2",
            Sku = new Storage.Inputs.SkuArgs
            {
                Name = environment == "production" ? "Standard_GRS" : "Standard_LRS",
            },
            EnableHttpsTrafficOnly = true,
            AllowBlobPublicAccess = false,
            MinimumTlsVersion = "TLS1_2",
            NetworkRuleSet = new Storage.Inputs.NetworkRuleSetArgs
            {
                DefaultAction = config.GetBool("enablePrivateEndpoint") == true 
                    ? "Deny" 
                    : "Allow",
                Bypass = "AzureServices",
            },
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });

        // Blob Container
        new Storage.BlobContainer("audit-trail", new()
        {
            ResourceGroupName = resourceGroupName,
            AccountName = this.Account.Name,
            ContainerName = "audit-trail",
            PublicAccess = "None",
        });

        // Queue
        new Storage.Queue("audit-processing", new()
        {
            ResourceGroupName = resourceGroupName,
            AccountName = this.Account.Name,
            QueueName = "audit-processing",
        });
    }
}

Application Insights / Log Analytics

Application Insights and Log Analytics:

// infrastructure/Monitoring.cs
public class Monitoring
{
    public OperationalInsights.Workspace LogAnalyticsWorkspace { get; }
    public Insights.Component ApplicationInsights { get; }

    public Monitoring(Pulumi.Stack stack, string environment, string location)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";

        // Log Analytics Workspace
        this.LogAnalyticsWorkspace = new OperationalInsights.Workspace(
            $"atp-{environment}-loganalytics", new()
            {
                ResourceGroupName = resourceGroupName,
                Location = location,
                Sku = new OperationalInsights.WorkspaceSkuArgs
                {
                    Name = "PerGB2018",
                },
                RetentionInDays = environment == "production" ? 730 : 30,
                Tags = new()
                {
                    { "Environment", environment },
                    { "ManagedBy", "pulumi" },
                },
            });

        // Application Insights
        this.ApplicationInsights = new Insights.Component($"atp-{environment}-appinsights", new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            Kind = "web",
            ApplicationType = "web",
            WorkspaceResourceId = this.LogAnalyticsWorkspace.Id,
            RetentionInDays = environment == "production" ? 730 : 30,
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
            },
        });
    }
}

Pulumi State Management

State Backend in Azure Blob Storage

Azure Blob Storage Backend Configuration:

# Initialize Pulumi with Azure Blob Storage backend
pulumi login --cloud-url azblob://atp-pulumi-state

# Or configure in Pulumi.yaml

Backend Configuration (Pulumi.yaml):

name: atp-infrastructure
runtime: dotnet
backend:
  url: azblob://atp-pulumi-state

Backend Setup:

# Create storage account for state (one-time setup)
az storage account create \
  --name atppulumistate \
  --resource-group atp-shared-rg \
  --location eastus \
  --sku Standard_LRS \
  --allow-blob-public-access false

# Create container
az storage container create \
  --name pulumi-state \
  --account-name atppulumistate \
  --auth-mode login

# Set backend URL
pulumi login --cloud-url azblob://pulumi-state

State Locking Mechanisms

State Locking:

  • Automatic Locking: Pulumi automatically locks state during operations
  • Blob Lease: Uses Azure Blob Storage lease mechanism
  • Lock Duration: Default 10 minutes, configurable
  • Lock Release: Automatically released on operation completion or timeout

Manual Lock Management:

# Check lock status
pulumi stack --show-urns

# Force unlock (if stuck)
pulumi cancel --stack production

State Encryption

State Encryption at Rest:

# Enable encryption on storage account
az storage account update \
  --name atppulumistate \
  --resource-group atp-shared-rg \
  --encryption-services blob

# Use Azure Key Vault for encryption keys
az storage account update \
  --name atppulumistate \
  --resource-group atp-shared-rg \
  --encryption-key-source Microsoft.Keyvault \
  --encryption-key-vault "https://atp-shared-kv.vault.azure.net/keys/storage-encryption"

Encrypted Secrets in State:

// Secrets are automatically encrypted in state
var password = config.RequireSecret("sqlAdminPassword");
// This value is encrypted in the state file

Backup and Recovery

State Backup Strategy:

# Enable blob versioning
az storage account blob-service-properties update \
  --account-name atppulumistate \
  --enable-versioning true

# Enable soft delete
az storage account blob-service-properties update \
  --account-name atppulumistate \
  --enable-delete-retention true \
  --delete-retention-days 30

# Manual backup
az storage blob download \
  --account-name atppulumistate \
  --container-name pulumi-state \
  --name production.json \
  --file backup-$(date +%Y%m%d)-production.json

State Recovery:

# Restore from backup
az storage blob upload \
  --account-name atppulumistate \
  --container-name pulumi-state \
  --name production.json \
  --file backup-20240115-production.json \
  --overwrite

GitOps Workflow for Infrastructure

Infrastructure Changes via PR

PR Workflow for Infrastructure:

graph LR
    A[Developer] -->|Create PR| B[Infrastructure Code<br/>in Git]
    B -->|PR Validation| C[Pulumi Preview]
    C -->|Lint & Validate| D[Security Scan]
    D -->|Review| E[Manual Approval]
    E -->|Merge| F[Pulumi Up]
    F -->|Update State| G[Azure Resources<br/>Provisioned]

    style B fill:#90EE90
    style C fill:#FFE5B4
    style D fill:#FFE5B4
    style E fill:#ffcccc
    style F fill:#90EE90
    style G fill:#ffcccc
Hold "Alt" / "Option" to enable pan & zoom

Pulumi Preview in PR Validation

Azure Pipeline: PR Validation:

# .azuredevops/pipelines/infrastructure-pr-validation.yml
trigger: none

pr:
  branches:
    include:
      - main
      - staging
      - test
      - dev

pool:
  vmImage: 'ubuntu-latest'

variables:
  - group: ATP-Infrastructure-Variables

stages:
- stage: ValidateInfrastructure
  displayName: 'Validate Infrastructure Changes'
  jobs:
  - job: PulumiPreview
    displayName: 'Pulumi Preview'
    steps:
    - checkout: self

    # Determine stack from branch
    - script: |
        case "$(Build.SourceBranch)" in
          refs/heads/main)
            STACK="production"
            ;;
          refs/heads/staging)
            STACK="staging"
            ;;
          refs/heads/test)
            STACK="test"
            ;;
          *)
            STACK="dev"
            ;;
        esac
        echo "##vso[task.setvariable variable=PulumiStack]$STACK"
      displayName: 'Determine Pulumi stack'

    # Install Pulumi
    - script: |
        curl -fsSL https://get.pulumi.com | sh
        export PATH="$HOME/.pulumi/bin:$PATH"
        pulumi version
      displayName: 'Install Pulumi'

    # Restore .NET dependencies
    - script: |
        dotnet restore
      displayName: 'Restore .NET dependencies'

    # Set stack configuration
    - script: |
        export PATH="$HOME/.pulumi/bin:$PATH"
        pulumi stack select $(PulumiStack)
        pulumi config set azure-native:location $(AzureLocation)
        pulumi config set environment $(PulumiStack)
      displayName: 'Configure Pulumi stack'
      env:
        PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
        PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)

    # Run Pulumi preview
    - script: |
        export PATH="$HOME/.pulumi/bin:$PATH"
        pulumi preview --stack $(PulumiStack) \
          --diff \
          --json > preview-output.json
      displayName: 'Run Pulumi preview'
      continueOnError: true
      env:
        PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
        PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)

    # Publish preview output
    - task: PublishPipelineArtifact@1
      condition: always()
      inputs:
        targetPath: 'preview-output.json'
        artifactName: 'pulumi-preview-$(PulumiStack)'
        publishLocation: 'pipeline'

    # Validate preview output
    - script: |
        if [ -s preview-output.json ]; then
          echo "✅ Preview generated successfully"
          # Check for destroy operations (require special approval)
          if grep -q '"steps".*"delete"' preview-output.json; then
            echo "⚠️  WARNING: Preview contains resource deletions"
            exit 1
          fi
        else
          echo "❌ Preview failed or produced no output"
          exit 1
        fi
      displayName: 'Validate preview output'

Manual Approval for Infrastructure Changes

Approval Gates:

# Azure Pipeline: Infrastructure deployment
trigger:
  branches:
    include:
      - main
      - staging

stages:
- stage: ApprovalGate
  displayName: 'Infrastructure Change Approval'
  jobs:
  - job: WaitForApproval
    displayName: 'Wait for Approval'
    pool: server
    steps:
    - task: ManualValidation@0
      timeoutInMinutes: 1440  # 24 hours
      inputs:
        notifyUsers: 'architect-team@connectsoft.example;sre-lead@connectsoft.example'
        instructions: |
          Review the Pulumi preview output before approving infrastructure changes.

          ⚠️  WARNING: Infrastructure changes can affect production services.

          Please verify:
          - Resource changes are expected
          - No unintended resource deletions
          - Configuration values are correct
          - Cost impact is acceptable

- stage: DeployInfrastructure
  displayName: 'Deploy Infrastructure'
  dependsOn: ApprovalGate
  condition: succeeded()
  jobs:
  - job: PulumiUp
    steps:
    # ... Pulumi up steps

Pulumi Up After Approval

Deployment Stage:

- stage: DeployInfrastructure
  displayName: 'Deploy Infrastructure'
  jobs:
  - job: PulumiUp
    displayName: 'Pulumi Up'
    steps:
    - checkout: self

    - script: |
        curl -fsSL https://get.pulumi.com | sh
        export PATH="$HOME/.pulumi/bin:$PATH"
      displayName: 'Install Pulumi'

    - script: |
        dotnet restore
        dotnet build
      displayName: 'Build Pulumi program'

    - script: |
        export PATH="$HOME/.pulumi/bin:$PATH"
        pulumi stack select $(PulumiStack)
        pulumi config set azure-native:location $(AzureLocation)
      displayName: 'Configure stack'
      env:
        PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
        PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)

    - script: |
        export PATH="$HOME/.pulumi/bin:$PATH"
        pulumi up --stack $(PulumiStack) \
          --yes \
          --skip-preview
      displayName: 'Deploy infrastructure'
      env:
        PULUMI_ACCESS_TOKEN: $(PulumiAccessToken)
        PULUMI_CONFIG_PASSPHRASE: $(PulumiConfigPassphrase)

Infrastructure Drift Detection

Drift Detection Script:

#!/bin/bash
# scripts/detect-infrastructure-drift.sh

set -euo pipefail

STACK="${1:-production}"

echo "🔍 Detecting infrastructure drift for stack: $STACK"

# Refresh state
pulumi refresh --stack "$STACK" --yes

# Generate diff
pulumi preview --stack "$STACK" --diff > drift-diff.txt

if [ -s drift-diff.txt ]; then
  echo "⚠️  Infrastructure drift detected!"
  cat drift-diff.txt

  # Send alert
  echo "🚨 Alert: Infrastructure drift detected in $STACK stack"
  exit 1
else
  echo "✅ No infrastructure drift detected"
  exit 0
fi

Scheduled Drift Detection:

# Azure Pipeline: Scheduled drift detection
schedules:
- cron: "0 2 * * *"  # Daily at 2 AM UTC
  branches:
    include:
      - main
  displayName: 'Daily Infrastructure Drift Detection'

stages:
- stage: DriftDetection
  displayName: 'Detect Infrastructure Drift'
  jobs:
  - job: CheckDrift
    steps:
    - script: |
        ./scripts/detect-infrastructure-drift.sh production
      displayName: 'Detect drift'

Pulumi Automation API

Programmatic Infrastructure Management

Pulumi Automation API Example:

// infrastructure/Automation.cs
using Pulumi.Automation;

public class InfrastructureAutomation
{
    public static async Task<UpdateResult> UpdateInfrastructureAsync(
        string stackName, string projectPath)
    {
        // Create stack workspace
        var workspace = await LocalWorkspace.CreateOrSelectStackAsync(
            new InlineProgramArgs("atp-infrastructure", projectPath)
            {
                Program = PulumiFn.Create<ATPStack>(),
            });

        // Set stack configuration
        await workspace.SetConfigAsync("azure-native:location", 
            new ConfigValue("eastus"));

        // Preview changes
        var preview = await workspace.PreviewAsync(new PreviewOptions
        {
            OnOutput = Console.WriteLine,
        });

        if (preview.ChangeSummary.ContainsKey(OpType.Create) || 
            preview.ChangeSummary.ContainsKey(OpType.Update))
        {
            // Deploy changes
            var update = await workspace.UpAsync(new UpOptions
            {
                OnOutput = Console.WriteLine,
            });

            return update.Summary;
        }

        return null;
    }
}

Dynamic Infrastructure Provisioning

Dynamic Resource Creation:

// Create resources dynamically based on configuration
var environments = new[] { "dev", "test", "staging", "production" };

foreach (var env in environments)
{
    var stack = new StackReference($"ConnectSoft/atp-infrastructure/{env}");

    // Create environment-specific resources
    var aks = new AKSCluster(stack, env, "eastus", 3, "Standard_D2s_v3");
    var acr = new AzureContainerRegistry(stack, env, "eastus");
}

Pulumi Policy as Code

Resource Validation Policies

Pulumi Policy Example:

// policies/enforce-tagging.ts
import { PolicyPack } from "@pulumi/policy";

const policies = new PolicyPack("atp-tagging-policies", {
    policies: [{
        name: "require-environment-tag",
        description: "All resources must have an Environment tag",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            const tags = args.props.tags || {};
            if (!tags.Environment) {
                reportViolation("Resource must have an Environment tag");
            }
        },
    }, {
        name: "require-managedby-tag",
        description: "All resources must have a ManagedBy tag",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            const tags = args.props.tags || {};
            if (tags.ManagedBy !== "pulumi") {
                reportViolation("Resource must have ManagedBy=pulumi tag");
            }
        },
    }],
});

Compliance Checks (Tagging, Encryption, etc.)

Compliance Policies:

// policies/compliance-policies.ts
import { PolicyPack } from "@pulumi/policy";

const policies = new PolicyPack("atp-compliance-policies", {
    policies: [{
        name: "require-encryption-at-rest",
        description: "Storage accounts must have encryption enabled",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            if (args.type === "azure-native:storage:StorageAccount") {
                if (!args.props.enableHttpsTrafficOnly) {
                    reportViolation("Storage account must have HTTPS-only traffic enabled");
                }
            }
        },
    }, {
        name: "prevent-public-blob-access",
        description: "Storage accounts must not allow public blob access",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            if (args.type === "azure-native:storage:StorageAccount") {
                if (args.props.allowBlobPublicAccess) {
                    reportViolation("Storage account must not allow public blob access");
                }
            }
        },
    }],
});

Cost Controls (SKU Limits, Region Restrictions)

Cost Control Policies:

// policies/cost-control-policies.ts
import { PolicyPack } from "@pulumi/policy";

const policies = new PolicyPack("atp-cost-control-policies", {
    policies: [{
        name: "limit-aks-node-vm-size",
        description: "AKS node VM size must not exceed Standard_D4s_v3",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            if (args.type === "azure-native:containerservice:ManagedCluster") {
                const agentPools = args.props.agentPoolProfiles || [];
                for (const pool of agentPools) {
                    if (pool.vmSize && pool.vmSize.includes("D8")) {
                        reportViolation(`VM size ${pool.vmSize} exceeds maximum allowed (Standard_D4s_v3)`);
                    }
                }
            }
        },
    }, {
        name: "restrict-regions",
        description: "Resources must be deployed only in approved regions",
        enforcementLevel: "mandatory",
        validateResource: (args, reportViolation) => {
            const allowedRegions = ["eastus", "westus2"];
            const location = args.props.location;
            if (location && !allowedRegions.includes(location)) {
                reportViolation(`Region ${location} is not in the approved list: ${allowedRegions.join(", ")}`);
            }
        },
    }],
});

Apply Policies:

# Install policy pack
pulumi policy install ConnectSoft/atp-policies

# Validate against policies
pulumi preview --policy-pack ConnectSoft/atp-policies

Infrastructure Drift Detection

Detecting Out-of-Band Changes

Drift Detection Workflow:

#!/bin/bash
# scripts/detect-drift.sh

STACK="${1:-production}"

echo "🔄 Refreshing state to detect drift..."

# Refresh state from Azure
pulumi refresh --stack "$STACK" --yes

# Preview changes (drift)
pulumi preview --stack "$STACK" --diff > drift-report.txt

if grep -q "diff" drift-report.txt; then
  echo "⚠️  Drift detected!"
  cat drift-report.txt

  # Send alert
  echo "🚨 Infrastructure drift detected in $STACK stack" | \
    mail -s "Infrastructure Drift Alert" sre-team@connectsoft.example
else
  echo "✅ No drift detected"
fi

Pulumi Refresh and Diff

Refresh and Diff Commands:

# Refresh state from actual Azure resources
pulumi refresh --stack production

# Preview differences (drift)
pulumi preview --stack production --diff

# Show detailed diff
pulumi stack --show-urns --stack production

Automated Drift Correction or Alerts

Automated Drift Correction:

# Azure Pipeline: Automated drift correction
schedules:
- cron: "0 3 * * *"  # Daily at 3 AM UTC
  branches:
    include:
      - main

stages:
- stage: DriftCorrection
  displayName: 'Automated Drift Correction'
  jobs:
  - job: CorrectDrift
    steps:
    - script: |
        pulumi refresh --stack production --yes
        pulumi preview --stack production --diff > drift-diff.txt

        if [ -s drift-diff.txt ]; then
          # Check if drift is auto-correctable
          if grep -q "tags" drift-diff.txt && !grep -q "delete" drift-diff.txt; then
            echo "✅ Auto-correcting drift (tags only)"
            pulumi up --stack production --yes
          else
            echo "⚠️  Manual intervention required"
            # Send alert
            exit 1
          fi
        fi
      displayName: 'Correct drift'

Disaster Recovery

Infrastructure Re-Provisioning from Git

DR Procedure:

#!/bin/bash
# scripts/disaster-recovery.sh

ENVIRONMENT="${1:-production}"
RESOURCE_GROUP="${2:-atp-production-rg}"

echo "🚨 Starting disaster recovery for $ENVIRONMENT..."

# 1. Verify Git repository is accessible
git clone https://dev.azure.com/ConnectSoft/ATP/_git/atp-infrastructure.git
cd atp-infrastructure

# 2. Select Pulumi stack
pulumi stack select "$ENVIRONMENT"

# 3. Configure backend (state may be lost, recreate if needed)
pulumi login --cloud-url azblob://atp-pulumi-state

# 4. Re-provision infrastructure
pulumi up --stack "$ENVIRONMENT" --yes

echo "✅ Disaster recovery complete"

RTO and RPO Targets

DR Targets:

Metric Target Notes
RTO 4 hours Time to restore infrastructure
RPO 24 hours Maximum acceptable data loss
State Recovery 1 hour Time to restore Pulumi state
Infrastructure Provisioning 2 hours Time to provision all resources
Application Deployment 1 hour Time to deploy applications via GitOps

DR Drill Procedures

DR Drill Checklist:

## Disaster Recovery Drill Checklist

### Pre-Drill
- [ ] Schedule DR drill (quarterly)
- [ ] Notify stakeholders
- [ ] Backup current state
- [ ] Document current infrastructure state

### Drill Execution
- [ ] Simulate disaster scenario
- [ ] Verify Git repository accessibility
- [ ] Restore Pulumi state (if needed)
- [ ] Re-provision infrastructure
- [ ] Verify resource provisioning
- [ ] Deploy applications via GitOps
- [ ] Run smoke tests
- [ ] Verify application functionality

### Post-Drill
- [ ] Document findings
- [ ] Update DR procedures
- [ ] Review RTO/RPO targets
- [ ] Schedule next drill

DR Drill Script:

#!/bin/bash
# scripts/dr-drill.sh

ENVIRONMENT="${1:-staging}"  # Use staging for drills

echo "🎯 Starting DR drill for $ENVIRONMENT environment..."

# 1. Backup current state
echo "📦 Backing up current state..."
az storage blob download \
  --account-name atppulumistate \
  --container-name pulumi-state \
  --name "$ENVIRONMENT.json" \
  --file "backup-$(date +%Y%m%d)-$ENVIRONMENT.json"

# 2. Destroy infrastructure (simulate disaster)
echo "💥 Simulating disaster (destroying infrastructure)..."
read -p "Are you sure? (yes/no): " confirm
if [ "$confirm" == "yes" ]; then
  pulumi destroy --stack "$ENVIRONMENT" --yes
fi

# 3. Re-provision from Git
echo "🔨 Re-provisioning infrastructure..."
pulumi up --stack "$ENVIRONMENT" --yes

# 4. Verify
echo "✅ DR drill complete. Verify infrastructure is operational."

Summary: Pulumi Infrastructure as Code Integration

  • Pulumi Overview: C# programming model for ATP infrastructure with type safety and testability
  • Stack Management: Environment-specific stacks (dev, test, staging, production) with configuration and secrets
  • AKS Provisioning: Complete cluster configuration with node pools, networking (Azure CNI), managed identity, Azure Monitor integration
  • Azure Resources: ACR, Key Vault, Service Bus, Storage Accounts, Application Insights/Log Analytics
  • State Management: Azure Blob Storage backend with locking, encryption, and backup/recovery
  • GitOps Workflow: Infrastructure changes via PR, Pulumi preview validation, manual approval, automated deployment
  • Automation API: Programmatic infrastructure management and dynamic provisioning
  • Policy as Code: Resource validation, compliance checks, cost controls
  • Drift Detection: Automated detection and correction of out-of-band changes
  • Disaster Recovery: Infrastructure re-provisioning from Git with RTO/RPO targets and DR drill procedures

Azure Key Vault Secret Management

Purpose: Define how Azure Key Vault is used for secure secret management in ATP, integrating with Kubernetes workloads via Workload Identity, External Secrets Operator, and CSI Driver to ensure secrets are never stored in Git and are securely injected into pods at runtime.


Azure Key Vault Architecture

Key Vault per Environment

Key Vault Organization:

Environment Key Vault Name Resource Group Purpose
Dev atp-dev-kv atp-dev-rg Development secrets
Test atp-test-kv atp-test-rg Testing secrets
Staging atp-staging-kv atp-staging-rg Pre-production secrets
Production atp-prod-kv atp-prod-rg Production secrets
Shared atp-shared-kv atp-shared-rg Cross-environment secrets

Key Vault Provisioning with Pulumi:

// infrastructure/KeyVault.cs
public class KeyVault
{
    public KeyVault.Vault Vault { get; }

    public KeyVault(Pulumi.Stack stack, string environment, string location)
    {
        var config = new Config();
        var resourceGroupName = $"atp-{environment}-rg";
        var keyVaultName = $"atp-{environment}-kv";

        this.Vault = new KeyVault.Vault(keyVaultName, new()
        {
            ResourceGroupName = resourceGroupName,
            Location = location,
            Properties = new KeyVault.Inputs.VaultPropertiesArgs
            {
                TenantId = config.Require("tenantId"),
                Sku = new KeyVault.Inputs.SkuArgs
                {
                    Family = "A",
                    Name = environment == "production" ? "premium" : "standard",
                },
                EnabledForDeployment = false,
                EnabledForTemplateDeployment = false,
                EnabledForDiskEncryption = false,
                EnableRbacAuthorization = true,  // Use RBAC instead of access policies
                EnableSoftDelete = true,
                SoftDeleteRetentionInDays = environment == "production" ? 90 : 7,
                EnablePurgeProtection = environment == "production",
                PublicNetworkAccess = config.GetBool("enablePrivateEndpoint") == true 
                    ? "Disabled" 
                    : "Enabled",
            },
            Tags = new()
            {
                { "Environment", environment },
                { "ManagedBy", "pulumi" },
                { "Compliance", "SOC2" },
            },
        });

        // Private endpoint for production
        if (environment == "production" && config.GetBool("enablePrivateEndpoint") == true)
        {
            new Network.PrivateEndpoint($"atp-{environment}-kv-pe", new()
            {
                ResourceGroupName = resourceGroupName,
                Location = location,
                Subnet = new Network.Inputs.SubnetArgs
                {
                    Id = subnetId,
                },
                PrivateLinkServiceConnections = new[]
                {
                    new Network.Inputs.PrivateLinkServiceConnectionArgs
                    {
                        Name = "keyvault-connection",
                        PrivateLinkServiceId = this.Vault.Id,
                        GroupIds = new[] { "vault" },
                    },
                },
            });
        }
    }
}

Secret Organization and Naming

Secret Naming Conventions:

Pattern: {category}/{service}/{secret-name}
Examples:
  - connection-strings/atp-ingestion/sql-connection-string
  - api-keys/atp-gateway/stripe-api-key
  - certificates/atp-gateway/tls-cert
  - credentials/atp-query/service-account-password
  - encryption-keys/atp-integrity/data-encryption-key

Secret Categories:

atp-{env}-kv/
├── connection-strings/
│   ├── atp-ingestion/sql-connection-string
│   ├── atp-query/redis-connection-string
│   └── atp-export/blob-storage-connection-string
├── api-keys/
│   ├── atp-gateway/stripe-api-key
│   ├── atp-gateway/sendgrid-api-key
│   └── atp-search/elasticsearch-api-key
├── certificates/
│   ├── atp-gateway/tls-cert
│   └── atp-integrity/signing-cert
├── credentials/
│   ├── atp-query/service-account-password
│   └── atp-export/external-api-credentials
└── encryption-keys/
    └── atp-integrity/data-encryption-key

Secret Metadata Tags:

// Set secret with metadata
var secret = new KeyVault.Secret("sql-connection-string", new()
{
    ResourceGroupName = resourceGroupName,
    VaultName = keyVault.Name,
    Properties = new KeyVault.Inputs.SecretPropertiesArgs
    {
        Value = connectionString,
        ContentType = "application/json",
        Attributes = new KeyVault.Inputs.SecretAttributesArgs
        {
            Enabled = true,
            ExpiresOn = DateTimeOffset.UtcNow.AddYears(1).ToString("O"),
        },
    },
    Tags = new()
    {
        { "Category", "connection-strings" },
        { "Service", "atp-ingestion" },
        { "Environment", environment },
        { "RotatedBy", "automation" },
        { "LastRotated", DateTimeOffset.UtcNow.ToString("O") },
    },
});

Access Policies vs RBAC

RBAC Configuration (Recommended):

// Grant Key Vault Secrets User role to AKS Workload Identity
var workloadIdentityRoleAssignment = new Authorization.RoleAssignment(
    "workload-identity-kv-secrets-user", new()
    {
        PrincipalId = workloadIdentityPrincipalId,
        PrincipalType = "ServicePrincipal",
        RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/4633458b-17de-408a-b874-0445c86b69e6", // Key Vault Secrets User
        Scope = keyVault.Id,
    });

// Grant Key Vault Secrets Officer role for secret rotation
var rotationRoleAssignment = new Authorization.RoleAssignment(
    "rotation-kv-secrets-officer", new()
    {
        PrincipalId = rotationServicePrincipalId,
        PrincipalType = "ServicePrincipal",
        RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/b86a8fe4-44ce-494c-a47a-613bb0b0c8c7", // Key Vault Secrets Officer
        Scope = keyVault.Id,
    });

RBAC vs Access Policies Comparison:

Feature RBAC (Recommended) Access Policies
Granularity Role-based (Key Vault Secrets User, Officer) Permission-based (get, list, set, delete)
Audit Trail ✅ Better audit logging ⚠️ Limited
Centralized Management ✅ Azure AD integration ❌ Vault-specific
Least Privilege ✅ Fine-grained roles ⚠️ Can be overly permissive
Maintenance ✅ Easier to manage ❌ Manual per-vault

ATP Selection: RBAC

Rationale: - ✅ Better audit trail for compliance - ✅ Centralized Azure AD management - ✅ Fine-grained role assignments - ✅ Easier to maintain and review


Secret Categories

Connection Strings (Databases, Service Bus)

SQL Database Connection String:

# Set SQL connection string secret
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-ingestion/sql-connection-string" \
  --value "Server=tcp:atp-prod-sql.database.windows.net,1433;Initial Catalog=ATP;Persist Security Info=False;User ID=atp-ingestion;Password=SecurePassword123!;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;" \
  --content-type "application/json" \
  --tags Category=connection-strings Service=atp-ingestion Environment=production

Redis Connection String:

# Set Redis connection string secret
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-query/redis-connection-string" \
  --value "atp-prod-redis.redis.cache.windows.net:6380,password=SecurePassword123!,ssl=True,abortConnect=False" \
  --content-type "application/json" \
  --tags Category=connection-strings Service=atp-query Environment=production

Service Bus Connection String:

# Set Service Bus connection string secret
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-ingestion/servicebus-connection-string" \
  --value "Endpoint=sb://atp-prod-sb.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=SecureKey123!" \
  --content-type "application/json" \
  --tags Category=connection-strings Service=atp-ingestion Environment=production

API Keys and Tokens

External API Keys:

# Set Stripe API key
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "api-keys/atp-gateway/stripe-api-key" \
  --value "sk_live_51ABC123..." \
  --content-type "text/plain" \
  --tags Category=api-keys Service=atp-gateway Environment=production Provider=stripe

# Set SendGrid API key
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "api-keys/atp-gateway/sendgrid-api-key" \
  --value "SG.ABC123..." \
  --content-type "text/plain" \
  --tags Category=api-keys Service=atp-gateway Environment=production Provider=sendgrid

JWT Tokens:

# Set JWT signing key
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "api-keys/atp-gateway/jwt-signing-key" \
  --value "-----BEGIN PRIVATE KEY-----\nABC123...\n-----END PRIVATE KEY-----" \
  --content-type "application/x-pem-file" \
  --tags Category=api-keys Service=atp-gateway Environment=production Type=jwt-signing-key

Certificates (TLS, Signing)

TLS Certificate:

# Import TLS certificate from file
az keyvault certificate import \
  --vault-name atp-prod-kv \
  --name "certificates/atp-gateway/tls-cert" \
  --file tls-cert.pfx \
  --password "SecurePassword123!" \
  --tags Category=certificates Service=atp-gateway Environment=production Type=tls

# Or create certificate from CSR
az keyvault certificate create \
  --vault-name atp-prod-kv \
  --name "certificates/atp-gateway/tls-cert" \
  --policy "$(cat cert-policy.json)"

Signing Certificate:

# Import signing certificate
az keyvault certificate import \
  --vault-name atp-prod-kv \
  --name "certificates/atp-integrity/signing-cert" \
  --file signing-cert.pfx \
  --password "SecurePassword123!" \
  --tags Category=certificates Service=atp-integrity Environment=production Type=signing

Encryption Keys

Data Encryption Key:

# Create encryption key
az keyvault key create \
  --vault-name atp-prod-kv \
  --name "encryption-keys/atp-integrity/data-encryption-key" \
  --kty RSA \
  --size 2048 \
  --ops encrypt decrypt \
  --tags Category=encryption-keys Service=atp-integrity Environment=production

Credentials (Service Accounts)

Service Account Password:

# Set service account password
az keyvault secret set \
  --vault-name atp-prod-kv \
  --name "credentials/atp-query/service-account-password" \
  --value "SecurePassword123!" \
  --content-type "text/plain" \
  --tags Category=credentials Service=atp-query Environment=production Type=service-account

Workload Identity for Pods

Azure AD Workload Identity Overview

Workload Identity Architecture:

graph LR
    A[Pod] -->|Token Request| B[Azure AD<br/>OIDC Issuer]
    B -->|JWT Token| A
    A -->|Authenticate| C[Azure Key Vault]
    C -->|Return Secret| A

    style A fill:#90EE90
    style B fill:#FFE5B4
    style C fill:#ffcccc
Hold "Alt" / "Option" to enable pan & zoom

Benefits of Workload Identity:

  • ✅ No secrets stored in Kubernetes
  • ✅ Automatic token rotation
  • ✅ Fine-grained RBAC permissions
  • ✅ Audit trail via Azure AD logs
  • ✅ No certificate management

Federated Credentials Configuration

Federated Credential Setup:

// Create User Assigned Managed Identity
var workloadIdentity = new ManagedServiceIdentity.UserAssignedIdentity(
    "atp-workload-identity", new()
    {
        ResourceGroupName = resourceGroupName,
        Location = location,
    });

// Create federated credential for Kubernetes ServiceAccount
var federatedCredential = new ManagedServiceIdentity.FederatedCredential(
    "atp-federated-credential", new()
    {
        ResourceGroupName = resourceGroupName,
        IdentityName = workloadIdentity.Name,
        Properties = new ManagedServiceIdentity.Inputs.FederatedCredentialPropertiesArgs
        {
            Issuer = "https://kubernetes.default.svc.cluster.local",
            Subject = "system:serviceaccount:atp-production:atp-ingestion", // K8s ServiceAccount
            Audiences = new[] { "api://AzureADTokenExchange" },
        },
    });

Azure CLI Setup:

# Create User Assigned Managed Identity
az identity create \
  --name atp-workload-identity \
  --resource-group atp-production-rg

# Create federated credential
az identity federated-credential create \
  --name atp-federated-credential \
  --identity-name atp-workload-identity \
  --resource-group atp-production-rg \
  --issuer "https://kubernetes.default.svc.cluster.local" \
  --subject "system:serviceaccount:atp-production:atp-ingestion" \
  --audience "api://AzureADTokenExchange"

ServiceAccount Annotation

ServiceAccount with Workload Identity:

# apps/atp-ingestion/base/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: atp-ingestion
  namespace: atp-production
  annotations:
    azure.workload.identity/client-id: "12345678-1234-1234-1234-123456789abc"  # Managed Identity Client ID
    azure.workload.identity/tenant-id: "87654321-4321-4321-4321-cba987654321"  # Azure AD Tenant ID

Pod Authentication Flow

Pod Configuration with Workload Identity:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"  # Enable Workload Identity
    spec:
      serviceAccountName: atp-ingestion
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        env:
        # Secret will be injected via External Secrets Operator
        - name: SQL_CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: sql-connection-string  # Created by External Secrets Operator
              key: connection-string

Authentication Flow:

  1. Pod starts with Workload Identity annotation
  2. Azure AD Workload Identity webhook injects AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_FEDERATED_TOKEN_FILE environment variables
  3. Pod authenticates to Azure AD using federated token
  4. Azure AD returns access token
  5. Pod uses access token to access Key Vault (via External Secrets Operator or CSI Driver)

No Secrets in Pod Specs!

❌ BAD: Plaintext Secrets in Pod Specs:

# ❌ NEVER DO THIS!
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    env:
    - name: PASSWORD
      value: "PlaintextPassword123!"  # ❌ Exposed in Git!

✅ GOOD: Reference External Secret:

# ✅ Correct: Reference secret from External Secrets Operator
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    env:
    - name: PASSWORD
      valueFrom:
        secretKeyRef:
          name: sql-connection-string  # Created by External Secrets Operator
          key: connection-string

External Secrets Operator

Installation and Configuration

Install External Secrets Operator:

# Add Helm repository
helm repo add external-secrets https://charts.external-secrets.io
helm repo update

# Install External Secrets Operator
helm install external-secrets \
  external-secrets/external-secrets \
  -n external-secrets-system \
  --create-namespace \
  --version 0.9.0

Verify Installation:

kubectl get pods -n external-secrets-system
# NAME                                       READY   STATUS    RESTARTS   AGE
# external-secrets-operator-7d8f9c4b5-abc123 1/1     Running   0          2m

ClusterSecretStore Setup

ClusterSecretStore for Azure Key Vault:

# infrastructure/clustersecretstore.yaml
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: azure-keyvault
spec:
  provider:
    azurekv:
      vaultUrl: "https://atp-prod-kv.vault.azure.net"
      tenantId: "87654321-4321-4321-4321-cba987654321"
      authType: WorkloadIdentity
      serviceAccountRef:
        name: external-secrets-operator
        namespace: external-secrets-system
      # Or use Service Principal (not recommended)
      # authType: ServicePrincipal
      # servicePrincipalRef:
      #   tenantId: "87654321-4321-4321-4321-cba987654321"
      #   clientId: "12345678-1234-1234-1234-123456789abc"
      #   clientSecret:
      #     secretRef:
      #       name: external-secrets-sp
      #       key: client-secret

Grant Permissions to External Secrets Operator:

// Grant Key Vault Secrets User role to External Secrets Operator Workload Identity
var esoRoleAssignment = new Authorization.RoleAssignment(
    "eso-kv-secrets-user", new()
    {
        PrincipalId = externalSecretsOperatorIdentityPrincipalId,
        PrincipalType = "ServicePrincipal",
        RoleDefinitionId = "/subscriptions/{subscriptionId}/providers/Microsoft.Authorization/roleDefinitions/4633458b-17de-408a-b874-0445c86b69e6", // Key Vault Secrets User
        Scope = keyVault.Id,
    });

ExternalSecret Resources

ExternalSecret for Connection String:

# apps/atp-ingestion/base/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-production
spec:
  refreshInterval: 1h  # Refresh every hour
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string  # Kubernetes Secret name
    creationPolicy: Owner
    template:
      type: Opaque
      data:
        connection-string: "{{ .connectionString | toString }}"
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      property: value
      version: ""  # Empty = latest version

ExternalSecret for Certificate:

# apps/atp-gateway/base/externalsecret-cert.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: tls-certificate
  namespace: atp-production
spec:
  refreshInterval: 24h
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: tls-certificate
    creationPolicy: Owner
    template:
      type: kubernetes.io/tls
      data:
        tls.crt: "{{ .certificate | b64enc }}"
        tls.key: "{{ .privateKey | b64enc }}"
  data:
  - secretKey: certificate
    remoteRef:
      key: certificates/atp-gateway/tls-cert
      property: cert
  - secretKey: privateKey
    remoteRef:
      key: certificates/atp-gateway/tls-cert
      property: key

ExternalSecret for Multiple Secrets:

# apps/atp-gateway/base/externalsecret-multiple.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: gateway-secrets
  namespace: atp-production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: gateway-secrets
    creationPolicy: Owner
  data:
  - secretKey: stripe-api-key
    remoteRef:
      key: api-keys/atp-gateway/stripe-api-key
  - secretKey: sendgrid-api-key
    remoteRef:
      key: api-keys/atp-gateway/sendgrid-api-key
  - secretKey: jwt-signing-key
    remoteRef:
      key: api-keys/atp-gateway/jwt-signing-key

Sync Interval and Refresh

Refresh Strategies:

Strategy Refresh Interval Use Case
Frequent 5m High-security, frequently rotated secrets
Standard 1h Most application secrets
Infrequent 24h Stable certificates, long-lived keys
On-Demand Manual refresh Rarely changed secrets

Manual Refresh:

# Trigger manual refresh
kubectl annotate externalsecret sql-connection-string \
  -n atp-production \
  force-sync=$(date +%s) \
  --overwrite

ExternalSecret Status:

# Check ExternalSecret status
kubectl get externalsecret sql-connection-string -n atp-production -o yaml

# Status output:
status:
  conditions:
  - lastTransitionTime: "2024-01-15T10:00:00Z"
    message: Secret was synced
    reason: SecretSynced
    status: "True"
    type: Ready
  refreshTime: "2024-01-15T10:00:00Z"
  syncedResourceVersion: "12345"

Secret Rotation Handling

ExternalSecret with Version Tracking:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string
    creationPolicy: Owner
    template:
      metadata:
        annotations:
          external-secrets.io/last-sync-time: "{{ .refreshTime | date \"2006-01-02T15:04:05Z07:00\" }}"
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      # Track version for rotation
      version: ""  # Empty = latest, or specify version ID

Application Secret Rotation:

// C# application: Handle secret rotation gracefully
public class SecretRotationHandler
{
    private readonly ILogger<SecretRotationHandler> _logger;
    private string _currentConnectionString;
    private readonly SemaphoreSlim _rotationLock = new(1, 1);

    public async Task<string> GetConnectionStringAsync()
    {
        // Read from mounted secret file or environment variable
        var newConnectionString = await File.ReadAllTextAsync(
            "/mnt/secrets/sql-connection-string/connection-string");

        if (_currentConnectionString != newConnectionString)
        {
            await _rotationLock.WaitAsync();
            try
            {
                if (_currentConnectionString != newConnectionString)
                {
                    _logger.LogInformation("Connection string rotated, updating connection");
                    await RotateConnectionAsync(newConnectionString);
                    _currentConnectionString = newConnectionString;
                }
            }
            finally
            {
                _rotationLock.Release();
            }
        }

        return _currentConnectionString;
    }

    private async Task RotateConnectionAsync(string newConnectionString)
    {
        // Close old connections
        // Establish new connections with new connection string
        // Zero-downtime rotation
    }
}

CSI Driver Alternative

Azure Key Vault CSI Driver

Installation:

# Install Azure Key Vault CSI Driver
helm repo add csi-secrets-store-provider-azure https://raw.githubusercontent.com/Azure/secrets-store-csi-driver-provider-azure/master/charts
helm repo update

helm install csi-secrets-store-provider-azure \
  csi-secrets-store-provider-azure/csi-secrets-store-provider-azure \
  --namespace kube-system \
  --version 1.4.0

Verify Installation:

kubectl get pods -n kube-system | grep csi-secrets-store
# NAME                                                    READY   STATUS    RESTARTS   AGE
# csi-secrets-store-provider-azure-7d8f9c4b5-abc123       1/1     Running   0          2m
# csi-secrets-store-driver-9f8e7d6c5-def456               2/2     Running   0          2m

SecretProviderClass Configuration

SecretProviderClass with Workload Identity:

# apps/atp-ingestion/base/secretproviderclass.yaml
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: atp-ingestion-secrets
  namespace: atp-production
spec:
  provider: azure
  secretObjects:
  - secretName: sql-connection-string  # Kubernetes Secret to create
    type: Opaque
    data:
    - objectName: sql-connection-string
      key: connection-string
  parameters:
    usePodIdentity: "false"
    useVMManagedIdentity: "false"
    useWorkloadIdentity: "true"  # Use Workload Identity
    workloadIdentityClientId: "12345678-1234-1234-1234-123456789abc"  # Managed Identity Client ID
    tenantId: "87654321-4321-4321-4321-cba987654321"  # Azure AD Tenant ID
    keyvaultName: "atp-prod-kv"
    objects: |
      array:
        - |
          objectName: connection-strings/atp-ingestion/sql-connection-string
          objectType: secret
          objectVersion: ""  # Empty = latest version
    tenantId: "87654321-4321-4321-4321-cba987654321"

SecretProviderClass for Certificate:

# apps/atp-gateway/base/secretproviderclass-cert.yaml
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: tls-certificate
  namespace: atp-production
spec:
  provider: azure
  secretObjects:
  - secretName: tls-certificate
    type: kubernetes.io/tls
    data:
    - objectName: tls-cert
      key: tls.crt
    - objectName: tls-key
      key: tls.key
  parameters:
    useWorkloadIdentity: "true"
    workloadIdentityClientId: "12345678-1234-1234-1234-123456789abc"
    tenantId: "87654321-4321-4321-4321-cba987654321"
    keyvaultName: "atp-prod-kv"
    objects: |
      array:
        - |
          objectName: certificates/atp-gateway/tls-cert
          objectType: secret
          objectFormat: pfx
          objectEncoding: base64
    tenantId: "87654321-4321-4321-4321-cba987654321"

Mounting Secrets as Volumes

Deployment with CSI Volume Mount:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"
    spec:
      serviceAccountName: atp-ingestion
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        volumeMounts:
        - name: secrets-store
          mountPath: "/mnt/secrets-store"
          readOnly: true
        env:
        - name: SQL_CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: sql-connection-string  # Created by SecretProviderClass secretObjects
              key: connection-string
      volumes:
      - name: secrets-store
        csi:
          driver: secrets-store.csi.k8s.io
          readOnly: true
          volumeAttributes:
            secretProviderClass: "atp-ingestion-secrets"

Automatic Rotation

Secret Rotation with CSI Driver:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: atp-ingestion-secrets
  namespace: atp-production
spec:
  provider: azure
  # Enable rotation
  rotationPolicy:
    rotationEnabled: true
    rotationPeriod: "1h"
  secretObjects:
  - secretName: sql-connection-string
    type: Opaque
    data:
    - objectName: sql-connection-string
      key: connection-string
  parameters:
    useWorkloadIdentity: "true"
    workloadIdentityClientId: "12345678-1234-1234-1234-123456789abc"
    tenantId: "87654321-4321-4321-4321-cba987654321"
    keyvaultName: "atp-prod-kv"
    objects: |
      array:
        - |
          objectName: connection-strings/atp-ingestion/sql-connection-string
          objectType: secret
          objectVersion: ""  # Latest version

Rotation Status:

# Check rotation status
kubectl describe secretproviderclass atp-ingestion-secrets -n atp-production

# View mounted secrets
kubectl exec -it deployment/atp-ingestion -n atp-production -- \
  ls -la /mnt/secrets-store/

When to Use CSI vs External Secrets

Comparison Matrix:

Feature External Secrets Operator CSI Driver
Secret Access Creates Kubernetes Secrets Mounts directly or creates Secrets
Rotation Manual refresh or polling Automatic rotation support
Use Case Standard Kubernetes Secret consumption Direct file access or Secret creation
Performance Slight delay (polling) Faster (direct mount)
Compatibility ✅ Works with existing Secret consumers ⚠️ Requires CSI volume mounts
Complexity ✅ Simpler ⚠️ More complex setup

ATP Selection Guide:

  • External Secrets Operator: ✅ Recommended for most use cases
  • Standard Kubernetes Secret consumption
  • Works with existing applications
  • Simpler to manage

  • CSI Driver: Use when:

  • Need direct file access to secrets
  • Require automatic rotation without polling
  • High-performance secret access needed

Secret References in Manifests

Never Plaintext Secrets in Git

❌ BAD: Plaintext Secret in Git:

# ❌ NEVER COMMIT THIS TO GIT!
apiVersion: v1
kind: Secret
metadata:
  name: sql-connection-string
data:
  connection-string: U2VjdXJlUGFzc3dvcmQxMjMh  # Base64 encoded, but still in Git!

✅ GOOD: External Secret Reference:

# ✅ Correct: Reference External Secret
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string

Referencing Secrets by Name

Deployment Using Secret Reference:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        env:
        # Reference secret created by External Secrets Operator
        - name: SQL_CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: sql-connection-string
              key: connection-string
        - name: REDIS_CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: redis-connection-string
              key: connection-string
        envFrom:
        # Or use envFrom for multiple secrets
        - secretRef:
            name: gateway-secrets

Environment-Specific Secret Mappings

Environment-Specific ExternalSecret:

# apps/atp-ingestion/overlays/production/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault-prod  # Production Key Vault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      # Production-specific secret path
# apps/atp-ingestion/overlays/dev/externalsecret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-dev
spec:
  refreshInterval: 24h  # Less frequent refresh for dev
  secretStoreRef:
    name: azure-keyvault-dev  # Dev Key Vault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      # Dev-specific secret path

Secret Versioning

Reference Specific Secret Version:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
spec:
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      version: "abc123def456789"  # Specific version ID

Track Secret Versions:

# List secret versions
az keyvault secret show-versions \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-ingestion/sql-connection-string" \
  --query "[].{id:id, enabled:attributes.enabled, updated:attributes.updated}"

# Output:
# [
#   {
#     "id": "https://atp-prod-kv.vault.azure.net/secrets/connection-strings/atp-ingestion/sql-connection-string/abc123def456789",
#     "enabled": true,
#     "updated": "2024-01-15T10:00:00Z"
#   }
# ]

Secret Rotation Procedures

Manual Rotation Workflow

Secret Rotation Checklist:

## Manual Secret Rotation Checklist

### Pre-Rotation
- [ ] Notify team of rotation schedule
- [ ] Verify application can handle secret rotation gracefully
- [ ] Backup current secret (if needed)
- [ ] Prepare new secret value

### Rotation
- [ ] Create new secret version in Key Vault
- [ ] Test new secret in non-production environment
- [ ] Update ExternalSecret to reference new version (optional)
- [ ] Trigger ExternalSecret refresh
- [ ] Verify application picks up new secret
- [ ] Monitor application for errors

### Post-Rotation
- [ ] Verify application is functioning correctly
- [ ] Disable old secret version (don't delete yet)
- [ ] Monitor for 24-48 hours
- [ ] Delete old secret version after confirmation

Manual Rotation Script:

#!/bin/bash
# scripts/rotate-secret.sh

VAULT_NAME="${1:-atp-prod-kv}"
SECRET_NAME="${2:-connection-strings/atp-ingestion/sql-connection-string}"
NEW_SECRET_VALUE="${3:-}"

if [ -z "$NEW_SECRET_VALUE" ]; then
  echo "Usage: $0 <vault-name> <secret-name> <new-secret-value>"
  exit 1
fi

echo "🔄 Rotating secret: $SECRET_NAME"

# 1. Create new secret version
echo "📝 Creating new secret version..."
az keyvault secret set \
  --vault-name "$VAULT_NAME" \
  --name "$SECRET_NAME" \
  --value "$NEW_SECRET_VALUE" \
  --tags LastRotated="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# 2. Trigger ExternalSecret refresh
echo "🔄 Triggering ExternalSecret refresh..."
kubectl annotate externalsecret "$(basename $SECRET_NAME)" \
  -n atp-production \
  force-sync="$(date +%s)" \
  --overwrite

# 3. Verify secret was synced
echo "✅ Waiting for secret sync..."
sleep 10
kubectl get externalsecret "$(basename $SECRET_NAME)" -n atp-production

echo "✅ Secret rotation complete"

Automated Rotation with Key Vault

Azure Key Vault Automatic Rotation:

# Enable automatic rotation for certificate
az keyvault certificate set-attributes \
  --vault-name atp-prod-kv \
  --name "certificates/atp-gateway/tls-cert" \
  --enabled true \
  --auto-renew true \
  --days-before-expiry 30

Rotation Policy:

{
  "lifetimeActions": [
    {
      "trigger": {
        "daysBeforeExpiry": 30
      },
      "action": {
        "actionType": "Rotate"
      }
    },
    {
      "trigger": {
        "daysBeforeExpiry": 7
      },
      "action": {
        "actionType": "EmailContacts"
      }
    }
  ],
  "issuerParameters": {
    "name": "Self"
  },
  "keyProperties": {
    "exportable": true,
    "keySize": 2048,
    "keyType": "RSA",
    "reuseKey": true
  },
  "secretProperties": {
    "contentType": "application/x-pkcs12"
  }
}

Application Handling of Rotated Secrets

C# Application: Secret Rotation Handler:

// SecretRotationHandler.cs
public class SecretRotationHandler : BackgroundService
{
    private readonly ILogger<SecretRotationHandler> _logger;
    private readonly IConfiguration _configuration;
    private readonly SemaphoreSlim _rotationLock = new(1, 1);
    private string _currentSecret;
    private DateTime _lastRotationCheck = DateTime.UtcNow;

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            try
            {
                await CheckForSecretRotationAsync();
                await Task.Delay(TimeSpan.FromMinutes(5), stoppingToken); // Check every 5 minutes
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Error checking for secret rotation");
                await Task.Delay(TimeSpan.FromMinutes(1), stoppingToken);
            }
        }
    }

    private async Task CheckForSecretRotationAsync()
    {
        // Read secret from mounted file or environment variable
        var secretPath = "/mnt/secrets-store/connection-strings/atp-ingestion/sql-connection-string";
        if (File.Exists(secretPath))
        {
            var newSecret = await File.ReadAllTextAsync(secretPath);

            if (_currentSecret != null && _currentSecret != newSecret)
            {
                _logger.LogInformation("Secret rotated, updating connection");
                await RotateSecretAsync(newSecret);
            }

            _currentSecret = newSecret;
        }
    }

    private async Task RotateSecretAsync(string newSecret)
    {
        await _rotationLock.WaitAsync();
        try
        {
            // Zero-downtime rotation:
            // 1. Create new connection with new secret
            // 2. Migrate traffic to new connection
            // 3. Close old connection

            _logger.LogInformation("Secret rotation complete");
        }
        finally
        {
            _rotationLock.Release();
        }
    }
}

Zero-Downtime Rotation

Zero-Downtime Rotation Strategy:

graph LR
    A[Old Secret<br/>Active] -->|1. New Secret<br/>Created| B[New Secret<br/>Available]
    B -->|2. New Connection<br/>Established| C[Dual Connections<br/>Active]
    C -->|3. Migrate Traffic| D[New Connection<br/>Primary]
    D -->|4. Close Old| E[New Secret<br/>Active]

    style A fill:#ffcccc
    style C fill:#FFE5B4
    style E fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Implementation:

public class ZeroDowntimeSecretRotation
{
    private IDbConnection _primaryConnection;
    private IDbConnection _secondaryConnection;
    private bool _isRotating = false;

    public async Task RotateConnectionStringAsync(string newConnectionString)
    {
        if (_isRotating) return;

        _isRotating = true;
        try
        {
            // 1. Create new connection
            var newConnection = new SqlConnection(newConnectionString);
            await newConnection.OpenAsync();

            // 2. Verify new connection works
            using var testCommand = new SqlCommand("SELECT 1", newConnection);
            await testCommand.ExecuteScalarAsync();

            // 3. Set secondary connection
            _secondaryConnection = newConnection;

            // 4. Migrate traffic gradually (e.g., 10% at a time)
            await MigrateTrafficGraduallyAsync();

            // 5. Close old connection
            if (_primaryConnection != null)
            {
                _primaryConnection.Close();
                _primaryConnection.Dispose();
            }

            // 6. Promote new connection to primary
            _primaryConnection = _secondaryConnection;
            _secondaryConnection = null;
        }
        finally
        {
            _isRotating = false;
        }
    }
}

Secret Versioning and Rollback

Key Vault Secret Versions

List Secret Versions:

# List all versions of a secret
az keyvault secret show-versions \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-ingestion/sql-connection-string" \
  --query "[].{id:id, enabled:attributes.enabled, updated:attributes.updated, expires:attributes.expires}" \
  --output table

# Output:
# ID                                                              ENABLED    UPDATED                 EXPIRES
# https://atp-prod-kv.../abc123def456789                          True       2024-01-15T10:00:00Z    None
# https://atp-prod-kv.../def456ghi789012                          True       2024-01-14T10:00:00Z    None
# https://atp-prod-kv.../ghi789jkl012345                          False     2024-01-13T10:00:00Z    None  # Disabled

Get Specific Secret Version:

# Get specific version
az keyvault secret show \
  --vault-name atp-prod-kv \
  --name "connection-strings/atp-ingestion/sql-connection-string" \
  --version "def456ghi789012"

Rolling Back to Previous Secret Version

Rollback Procedure:

#!/bin/bash
# scripts/rollback-secret.sh

VAULT_NAME="${1:-atp-prod-kv}"
SECRET_NAME="${2:-connection-strings/atp-ingestion/sql-connection-string}"
VERSION_TO_ROLLBACK="${3:-}"

if [ -z "$VERSION_TO_ROLLBACK" ]; then
  echo "Usage: $0 <vault-name> <secret-name> <version-id>"
  echo "Listing available versions:"
  az keyvault secret show-versions \
    --vault-name "$VAULT_NAME" \
    --name "$SECRET_NAME" \
    --query "[].{version:split(id,'/')[-1], updated:attributes.updated}" \
    --output table
  exit 1
fi

echo "🔄 Rolling back secret to version: $VERSION_TO_ROLLBACK"

# 1. Get previous version value
PREVIOUS_VALUE=$(az keyvault secret show \
  --vault-name "$VAULT_NAME" \
  --name "$SECRET_NAME" \
  --version "$VERSION_TO_ROLLBACK" \
  --query value -o tsv)

# 2. Create new version with previous value
az keyvault secret set \
  --vault-name "$VAULT_NAME" \
  --name "$SECRET_NAME" \
  --value "$PREVIOUS_VALUE" \
  --tags RollbackFrom="$VERSION_TO_ROLLBACK" RollbackAt="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# 3. Trigger ExternalSecret refresh
kubectl annotate externalsecret "$(basename $SECRET_NAME)" \
  -n atp-production \
  force-sync="$(date +%s)" \
  --overwrite

echo "✅ Secret rolled back successfully"

Coordinating Secret Changes with Deployments

Coordinated Secret and Deployment Update:

# Strategy: Update secret first, then deployment
# 1. Update ExternalSecret to reference new secret version
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
spec:
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string
      version: "abc123def456789"  # New version
---
# 2. Wait for secret sync
# 3. Update deployment (triggers rolling update with new secret)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    metadata:
      annotations:
        secret-version: "abc123def456789"  # Track secret version

Audit Logging

Key Vault Access Logs

Enable Diagnostic Settings:

// Enable Key Vault diagnostic logs
new Insights.DiagnosticSetting("keyvault-diagnostics", new()
{
    ResourceId = keyVault.Id,
    LogAnalyticsWorkspaceId = logAnalyticsWorkspace.Id,
    Logs = new[]
    {
        new Insights.Inputs.LogSettingsArgs
        {
            CategoryGroup = "audit",
            Enabled = true,
            RetentionPolicy = new Insights.Inputs.RetentionPolicyArgs
            {
                Enabled = true,
                Days = environment == "production" ? 365 : 30,
            },
        },
        new Insights.Inputs.LogSettingsArgs
        {
            CategoryGroup = "allLogs",
            Enabled = true,
            RetentionPolicy = new Insights.Inputs.RetentionPolicyArgs
            {
                Enabled = true,
                Days = environment == "production" ? 365 : 30,
            },
        },
    },
    Metrics = new[]
    {
        new Insights.Inputs.MetricSettingsArgs
        {
            Category = "AllMetrics",
            Enabled = true,
            RetentionPolicy = new Insights.Inputs.RetentionPolicyArgs
            {
                Enabled = true,
                Days = environment == "production" ? 365 : 30,
            },
        },
    },
});

Monitoring Secret Access

KQL Query for Secret Access:

// Key Vault access logs
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where OperationName == "SecretGet" or OperationName == "SecretList"
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend CallerIP = tostring(parse_json(properties_s).httpStatusCode)
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| project TimeGenerated, SecretName, OperationName, Identity, CallerIP, ResultSignature
| order by TimeGenerated desc

Access Pattern Analysis:

// Secret access patterns
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where OperationName == "SecretGet"
| extend SecretName = tostring(parse_json(properties_s).objectName)
| summarize 
    AccessCount = count(),
    UniqueIdentities = dcount(parse_json(properties_s).identity_claim_appid_g),
    LastAccess = max(TimeGenerated)
    by SecretName, bin(TimeGenerated, 1h)
| order by AccessCount desc

Alerting on Unauthorized Access

Azure Monitor Alert Rule:

# alerts/keyvault-unauthorized-access.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: keyvault-unauthorized-access
  namespace: monitoring
spec:
  groups:
  - name: keyvault
    rules:
    - alert: KeyVaultUnauthorizedAccess
      expr: |
        count(
          azure_keyvault_secret_access_total{
            result="Forbidden"
          } > 0
        ) by (secret_name)
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Unauthorized access attempt to Key Vault secret"
        description: "Secret {{ $labels.secret_name }} has {{ $value }} unauthorized access attempts"

Log Analytics Alert:

// Alert query: Failed secret access
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where OperationName == "SecretGet"
| where parse_json(properties_s).statusCode_s == "403"  // Forbidden
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| project TimeGenerated, SecretName, Identity

Compliance Reporting

Compliance Report Query:

// Secret access compliance report
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where TimeGenerated > ago(30d)
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend Operation = OperationName
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| extend Result = ResultSignature
| summarize 
    TotalAccess = count(),
    SuccessfulAccess = countif(Result == "OK"),
    FailedAccess = countif(Result != "OK"),
    UniqueIdentities = dcount(Identity),
    LastAccess = max(TimeGenerated)
    by SecretName, Operation
| order by TotalAccess desc

Compliance: SOC 2, GDPR, HIPAA

Encryption at Rest in Key Vault

Key Vault Encryption:

  • Automatic Encryption: All secrets encrypted at rest by default
  • Hardware Security Module (HSM): Premium SKU uses HSM-backed keys
  • Azure Key Vault Managed HSM: Dedicated HSM for highest security

Enable HSM-Backed Keys:

// Use Premium SKU for HSM-backed keys
this.Vault = new KeyVault.Vault(keyVaultName, new()
{
    Properties = new KeyVault.Inputs.VaultPropertiesArgs
    {
        Sku = new KeyVault.Inputs.SkuArgs
        {
            Family = "A",
            Name = "premium",  // Premium SKU for HSM
        },
    },
});

Access Reviews and Audits

Regular Access Reviews:

#!/bin/bash
# scripts/access-review.sh

VAULT_NAME="${1:-atp-prod-kv}"

echo "📋 Key Vault Access Review for: $VAULT_NAME"

# List all role assignments
az role assignment list \
  --scope "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.KeyVault/vaults/$VAULT_NAME" \
  --query "[].{principal:principalName, role:roleDefinitionName, scope:scope}" \
  --output table

# List all secrets and their access patterns
echo "📊 Secret Access Summary:"
az keyvault secret list \
  --vault-name "$VAULT_NAME" \
  --query "[].name" -o tsv | while read secret; do
    echo "Secret: $secret"
    az keyvault secret show-versions \
      --vault-name "$VAULT_NAME" \
      --name "$secret" \
      --query "[].{updated:attributes.updated, enabled:attributes.enabled}" \
      --output table
done

Automated Access Review:

# Azure Policy: Require access reviews
apiVersion: policy.azure.com/v1beta1
kind: PolicyAssignment
metadata:
  name: require-keyvault-access-reviews
spec:
  displayName: "Require Key Vault Access Reviews"
  policyDefinitionId: "/providers/Microsoft.Authorization/policyDefinitions/..."
  parameters:
    reviewFrequency: "30d"

Secret Lifecycle Management

Secret Lifecycle Policy:

# Secret lifecycle management
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  target:
    name: sql-connection-string
    template:
      metadata:
        annotations:
          secret-lifecycle/created: "{{ .creationTime }}"
          secret-lifecycle/expires: "{{ .expirationTime }}"
          secret-lifecycle/rotation-policy: "30d"
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string

Evidence Collection for Auditors

Audit Evidence Report:

// SOC 2 / GDPR Audit Evidence: Secret Access Log
let SecretAccess = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where TimeGenerated > ago(90d)
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend Identity = tostring(parse_json(properties_s).identity_claim_appid_g)
| extend Result = ResultSignature
| extend IPAddress = tostring(parse_json(properties_s).callerIpAddress_s)
| project TimeGenerated, SecretName, Identity, Result, IPAddress, OperationName;

// Generate report
SecretAccess
| summarize 
    TotalAccess = count(),
    SuccessfulAccess = countif(Result == "OK"),
    FailedAccess = countif(Result != "OK"),
    UniqueIdentities = dcount(Identity),
    DateRange = strcat(min(TimeGenerated), " to ", max(TimeGenerated))
    by SecretName
| order by SecretName

Export Audit Logs:

# Export audit logs for compliance
az monitor log-analytics query \
  --workspace "atp-prod-loganalytics" \
  --analytics-query "
    AzureDiagnostics
    | where ResourceProvider == 'MICROSOFT.KEYVAULT'
    | where Category == 'AuditEvent'
    | where TimeGenerated > ago(90d)
    | project TimeGenerated, OperationName, SecretName, Identity, Result
  " \
  --output table > keyvault-audit-log-$(date +%Y%m%d).csv

Summary: Azure Key Vault Secret Management

  • Key Vault Architecture: Environment-specific Key Vaults with RBAC (recommended over access policies), organized secret naming conventions
  • Secret Categories: Connection strings, API keys, certificates, encryption keys, credentials with proper tagging
  • Workload Identity: Azure AD Workload Identity for pod authentication, federated credentials, ServiceAccount annotation, no secrets in pod specs
  • External Secrets Operator: ClusterSecretStore setup, ExternalSecret resources, sync intervals, secret rotation handling
  • CSI Driver: Alternative for direct secret mounting, SecretProviderClass configuration, automatic rotation support
  • Secret References: Never plaintext secrets in Git, reference secrets by name, environment-specific mappings, secret versioning
  • Secret Rotation: Manual and automated rotation procedures, application handling of rotated secrets, zero-downtime rotation strategies
  • Secret Versioning: Key Vault secret versions, rollback procedures, coordinating secret changes with deployments
  • Audit Logging: Key Vault access logs, monitoring secret access, alerting on unauthorized access, compliance reporting
  • Compliance: Encryption at rest, access reviews, secret lifecycle management, evidence collection for SOC 2, GDPR, HIPAA audits

Security Policies & Compliance

Purpose: Define security policies, compliance controls, and enforcement mechanisms for ATP GitOps, ensuring all Kubernetes workloads, network traffic, and container images meet security standards and regulatory requirements (SOC 2, GDPR, HIPAA) through policy-as-code and automated enforcement.


Azure Policy for Kubernetes

Policy Overview and Architecture

Azure Policy for Kubernetes Architecture:

graph LR
    A[Policy Definition<br/>in Azure] -->|Assignment| B[AKS Cluster<br/>with Policy Add-on]
    B -->|Enforces| C[Admission Controller]
    C -->|Validates| D[Kubernetes Resources]
    D -->|Creates| E[Compliant Resources]
    D -.->|Violates| F[Policy Violation<br/>Blocked/Reported]

    style A fill:#90EE90
    style B fill:#FFE5B4
    style C fill:#FFE5B4
    style D fill:#ffcccc
    style E fill:#90EE90
    style F fill:#ff9999
Hold "Alt" / "Option" to enable pan & zoom

Azure Policy Components:

Component Purpose Example
Policy Definition Defines the policy rule "All pods must have resource limits"
Policy Assignment Applies policy to AKS cluster Assign to atp-prod-eus-aks
Policy Effect Enforcement action deny, audit, disabled
Policy Parameters Configurable values Minimum CPU: 100m

Built-in Policies for AKS

Enable Azure Policy Add-on:

# Enable Azure Policy add-on on AKS
az aks enable-addons \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --addons azure-policy

# Verify installation
az aks show \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --query addonProfiles.azurepolicy

Common Built-in Policies:

// Built-in policy: Kubernetes cluster containers should only use allowed capabilities
{
  "policyDefinitionId": "/providers/Microsoft.Authorization/policyDefinitions/c26596ff-4d70-4e6a-9a30-c2506bd2f80c",
  "parameters": {
    "allowedCapabilities": {
      "value": ["NET_BIND_SERVICE"]
    },
    "requiredDropCapabilities": {
      "value": ["ALL"]
    },
    "effect": {
      "value": "Audit"
    }
  }
}

Assign Built-in Policy:

# Assign built-in policy: Container images should be deployed from trusted registries only
az policy assignment create \
  --name "aks-trusted-registries" \
  --scope "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.ContainerService/managedClusters/atp-prod-eus-aks" \
  --policy "/providers/Microsoft.Authorization/policyDefinitions/febd0533-8e55-448f-b837-bd0e06f16469" \
  --params '{
    "allowedContainerImagesRegex": {
      "value": "^connectsoft\\.azurecr\\.io/.*"
    },
    "effect": {
      "value": "Deny"
    }
  }'

Custom Policy Definitions

Custom Policy: Require Resource Limits:

// policies/require-resource-limits.json
{
  "properties": {
    "displayName": "ATP: Require resource limits on containers",
    "description": "Ensures all containers have CPU and memory limits set",
    "mode": "Microsoft.Kubernetes.Data",
    "metadata": {
      "category": "ATP Security"
    },
    "parameters": {
      "minCpu": {
        "type": "String",
        "metadata": {
          "displayName": "Minimum CPU limit",
          "description": "Minimum CPU limit (e.g., 100m)"
        },
        "defaultValue": "100m"
      },
      "minMemory": {
        "type": "String",
        "metadata": {
          "displayName": "Minimum memory limit",
          "description": "Minimum memory limit (e.g., 128Mi)"
        },
        "defaultValue": "128Mi"
      },
      "effect": {
        "type": "String",
        "metadata": {
          "displayName": "Effect",
          "description": "Policy effect"
        },
        "allowedValues": ["audit", "deny", "disabled"],
        "defaultValue": "deny"
      }
    },
    "policyRule": {
      "if": {
        "field": "type",
        "equals": "Microsoft.Kubernetes/connectedClusters"
      },
      "then": {
        "effect": "[parameters('effect')]",
        "details": {
          "templateInfo": {
            "sourceType": "PublicURL",
            "url": "https://raw.githubusercontent.com/ConnectSoft/ATP-Policies/main/policies/require-resource-limits.yaml"
          },
          "apiGroups": ["apps"],
          "kinds": ["Deployment", "StatefulSet"],
          "excludedNamespaces": ["kube-system", "gatekeeper-system"]
        }
      }
    }
  }
}

Gatekeeper Constraint Template (Referenced by Policy):

# policies/require-resource-limits.yaml
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequiredresourcelimits
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredResourceLimits
      validation:
        openAPIV3Schema:
          type: object
          properties:
            minCpu:
              type: string
            minMemory:
              type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredresourcelimits

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources.limits.cpu
          msg := sprintf("Container '%v' must have CPU limit", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources.limits.memory
          msg := sprintf("Container '%v' must have memory limit", [container.name])
        }

Create Custom Policy:

# Create custom policy definition
az policy definition create \
  --name "atp-require-resource-limits" \
  --display-name "ATP: Require resource limits on containers" \
  --description "Ensures all containers have CPU and memory limits set" \
  --rules policies/require-resource-limits.json \
  --params policies/require-resource-limits.parameters.json \
  --mode Microsoft.Kubernetes.Data

Policy Assignment and Enforcement

Policy Assignment with Pulumi:

// Assign Azure Policy to AKS cluster
new Authorization.PolicyAssignment("atp-require-resource-limits", new()
{
    Name = "atp-require-resource-limits",
    DisplayName = "ATP: Require resource limits",
    PolicyDefinitionId = "/providers/Microsoft.Authorization/policyDefinitions/atp-require-resource-limits",
    Scope = aksCluster.Id,
    Parameters = new()
    {
        { "minCpu", new() { Value = "100m" } },
        { "minMemory", new() { Value = "128Mi" } },
        { "effect", new() { Value = "deny" } },
    },
    EnforcementMode = "Default", // Enforced
    Identity = new Authorization.Inputs.IdentityArgs
    {
        Type = Authorization.ResourceIdentityType.SystemAssigned,
    },
});

Policy Enforcement Modes:

Mode Behavior Use Case
Enforced Blocks non-compliant resources Production environments
DoNotEnforce Audits only, doesn't block Testing policy effectiveness
Disabled Policy disabled Temporary disable

Policy Compliance Check:

# Check policy compliance
az policy state list \
  --resource "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.ContainerService/managedClusters/atp-prod-eus-aks" \
  --policy-assignment "atp-require-resource-limits" \
  --query "[].{resource:resourceId, complianceState:complianceState}" \
  --output table

Pod Security Standards (PSS)

Privileged, Baseline, Restricted Profiles

Pod Security Levels:

Level Restrictions ATP Use Case
Privileged No restrictions System pods only (CNI, CSI drivers)
Baseline Minimal restrictions Legacy applications
Restricted Maximum restrictions ATP production workloads

Restricted Profile Requirements:

  • ✅ Run as non-root user
  • ✅ Read-only root filesystem
  • ✅ Drop all capabilities
  • ✅ Disallow privilege escalation
  • ✅ Seccomp profile enforced
  • ✅ AppArmor/SELinux enforced

Pod Security Admission Configuration

Enable Pod Security Admission:

# infrastructure/namespace-pod-security.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Admission Configuration:

# cluster-config/admission-configuration.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
  configuration:
    apiVersion: pod-security.admission.config.k8s.io/v1beta1
    kind: PodSecurityConfiguration
    defaults:
      enforce: "restricted"
      audit: "restricted"
      warn: "restricted"
    exemptions:
      usernames: []
      runtimeClasses: []
      namespaces:
      - kube-system
      - gatekeeper-system
      - external-secrets-system

Security Context Requirements

Deployment with Restricted Security Context:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      securityContext:  # Pod-level security context
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
        supplementalGroups: []
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        securityContext:  # Container-level security context
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
          capabilities:
            drop:
            - ALL
            add: []  # No additional capabilities
          seccompProfile:
            type: RuntimeDefault
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: var-run
          mountPath: /var/run
      volumes:
      - name: tmp
        emptyDir: {}
      - name: var-run
        emptyDir: {}

Gradual Enforcement Strategy

Enforcement Strategy:

Environment Enforce Level Audit Level Warn Level Timeline
Dev Baseline Restricted Restricted Immediate
Test Baseline Restricted Restricted Month 1
Staging Restricted Restricted Restricted Month 2
Production Restricted Restricted Restricted Month 3

Migration Plan:

Phase 1 (Month 1): Dev and Test
- Set enforce: baseline
- Set audit: restricted
- Identify violations
- Fix applications

Phase 2 (Month 2): Staging
- Set enforce: restricted
- Fix remaining violations
- Validate applications

Phase 3 (Month 3): Production
- Set enforce: restricted
- Full enforcement

Network Policies

Default Deny All Traffic

Default Deny Network Policy:

# platform/network-policies/default-deny-all.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: atp-production
spec:
  podSelector: {}  # Match all pods
  policyTypes:
  - Ingress
  - Egress
  # No ingress or egress rules = deny all

Ingress Rules (Allow Specific Sources)

Allow Ingress from Gateway:

# apps/atp-ingestion/base/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-ingestion-ingress
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
  - Ingress
  ingress:
  # Allow ingress from gateway
  - from:
    - podSelector:
        matchLabels:
          app: atp-gateway
      namespaceSelector:
        matchLabels:
          name: atp-production
    ports:
    - protocol: TCP
      port: 8080
  # Allow ingress from ingress controller
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
      podSelector:
        matchLabels:
          app.kubernetes.io/name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080

Egress Rules (Allow Specific Destinations)

Allow Egress to Dependencies:

# apps/atp-ingestion/base/network-policy-egress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-ingestion-egress
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
  - Egress
  egress:
  # Allow egress to SQL Database
  - to:
    - namespaceSelector:
        matchLabels:
          name: external-services
    ports:
    - protocol: TCP
      port: 1433  # SQL Server
  # Allow egress to Redis
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379
  # Allow DNS resolution
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
  # Allow egress to Azure services (via Private Link)
  - to:
    - namespaceSelector: {}
      podSelector: {}
    ports:
    - protocol: TCP
      port: 443
  # Allow egress to monitoring
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
      podSelector:
        matchLabels:
          app: prometheus
    ports:
    - protocol: TCP
      port: 9090

DNS and Monitoring Exceptions

DNS Exception:

# platform/network-policies/allow-dns.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: atp-production
spec:
  podSelector: {}  # Apply to all pods
  policyTypes:
  - Egress
  egress:
  # Allow DNS queries to CoreDNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Monitoring Exception:

# platform/network-policies/allow-monitoring.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-monitoring
  namespace: atp-production
spec:
  podSelector: {}  # Apply to all pods
  policyTypes:
  - Egress
  egress:
  # Allow metrics scraping
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
      podSelector:
        matchLabels:
          app: prometheus
    ports:
    - protocol: TCP
      port: 9090
  # Allow log forwarding
  - to:
    - namespaceSelector:
        matchLabels:
          name: logging
      podSelector:
        matchLabels:
          app: fluent-bit
    ports:
    - protocol: TCP
      port: 24224

Network Policy per Service

Service-Specific Network Policies:

# apps/atp-query/base/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-query-network-policy
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-query
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Allow from gateway
  - from:
    - podSelector:
        matchLabels:
          app: atp-gateway
    ports:
    - protocol: TCP
      port: 8080
  # Allow from ingestion service
  - from:
    - podSelector:
        matchLabels:
          app: atp-ingestion
    ports:
    - protocol: TCP
      port: 8080
  egress:
  # Allow to Redis
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379
  # Allow to SQL
  - to:
    - namespaceSelector:
        matchLabels:
          name: external-services
    ports:
    - protocol: TCP
      port: 1433

OPA Gatekeeper Alternative

Open Policy Agent Overview

OPA Gatekeeper Architecture:

graph LR
    A[Policy Templates<br/>in Git] -->|Deploy| B[Gatekeeper<br/>Controller]
    B -->|Creates| C[Constraint CRDs]
    C -->|Enforces| D[Kubernetes<br/>Admission Webhook]
    D -->|Validates| E[Resource Requests]
    E -->|Allows| F[Compliant Resources]
    E -.->|Violates| G[Rejected Resources]

    style A fill:#90EE90
    style B fill:#FFE5B4
    style C fill:#FFE5B4
    style D fill:#FFE5B4
    style E fill:#ffcccc
    style F fill:#90EE90
    style G fill:#ff9999
Hold "Alt" / "Option" to enable pan & zoom

Install Gatekeeper:

# Install Gatekeeper
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/release-3.14/deploy/gatekeeper.yaml

# Verify installation
kubectl get pods -n gatekeeper-system

Gatekeeper Constraints and Templates

ConstraintTemplate: Require Resource Limits:

# policies/gatekeeper/require-resource-limits-template.yaml
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequiredresourcelimits
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredResourceLimits
      validation:
        openAPIV3Schema:
          type: object
          properties:
            cpu:
              type: string
              default: "100m"
            memory:
              type: string
              default: "128Mi"
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredresourcelimits

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources
          msg := sprintf("Container '%v' must have resources defined", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources.limits
          msg := sprintf("Container '%v' must have resource limits", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources.limits.cpu
          msg := sprintf("Container '%v' must have CPU limit", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          not container.resources.limits.memory
          msg := sprintf("Container '%v' must have memory limit", [container.name])
        }

Constraint: Enforce Resource Limits:

# policies/gatekeeper/require-resource-limits-constraint.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResourceLimits
metadata:
  name: require-resource-limits-production
spec:
  match:
    kinds:
    - apiGroups: ["apps"]
      kinds: ["Deployment", "StatefulSet", "DaemonSet"]
    excludedNamespaces:
    - kube-system
    - gatekeeper-system
    - external-secrets-system
    - ingress-nginx
  parameters:
    cpu: "100m"
    memory: "128Mi"

Custom Policy Authoring with Rego

Rego Policy: Require Non-Root User:

# policies/gatekeeper/require-non-root.rego
package requirenonroot

violation[{"msg": msg}] {
    container := input.review.object.spec.template.spec.containers[_]
    not container.securityContext
    msg := sprintf("Container '%v' must have securityContext defined", [container.name])
}

violation[{"msg": msg}] {
    container := input.review.object.spec.template.spec.containers[_]
    container.securityContext
    not container.securityContext.runAsNonRoot
    msg := sprintf("Container '%v' must run as non-root user", [container.name])
}

violation[{"msg": msg}] {
    container := input.review.object.spec.template.spec.containers[_]
    container.securityContext
    container.securityContext.runAsNonRoot == false
    msg := sprintf("Container '%v' must run as non-root user (currently runAsNonRoot=false)", [container.name])
}

ConstraintTemplate:

# policies/gatekeeper/require-non-root-template.yaml
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequirednonroot
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredNonRoot
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package requirenonroot

        violation[{"msg": msg}] {
            container := input.review.object.spec.template.spec.containers[_]
            not container.securityContext
            msg := sprintf("Container '%v' must have securityContext defined", [container.name])
        }

        violation[{"msg": msg}] {
            container := input.review.object.spec.template.spec.containers[_]
            container.securityContext
            not container.securityContext.runAsNonRoot
            msg := sprintf("Container '%v' must run as non-root user", [container.name])
        }

Constraint:

# policies/gatekeeper/require-non-root-constraint.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredNonRoot
metadata:
  name: require-non-root-production
spec:
  match:
    kinds:
    - apiGroups: ["apps"]
      kinds: ["Deployment", "StatefulSet"]
    namespaces:
    - atp-production
    excludedNamespaces:
    - kube-system

Integration with CI/CD

PR Validation with Gatekeeper:

# .azuredevops/pipelines/pr-validation-gatekeeper.yml
stages:
- stage: ValidateGatekeeper
  displayName: 'Validate with Gatekeeper'
  jobs:
  - job: GatekeeperValidation
    steps:
    - script: |
        # Install Gatekeeper CLI
        wget -q https://github.com/open-policy-agent/gatekeeper/releases/latest/download/gatekeeper-linux-amd64.tar.gz
        tar xf gatekeeper-linux-amd64.tar.gz
        sudo mv gatekeeper /usr/local/bin/

      displayName: 'Install Gatekeeper CLI'

    - script: |
        # Validate manifests against Gatekeeper constraints
        for manifest in apps/*/base/*.yaml; do
          echo "Validating $manifest"
          # Use OPA CLI to test against policies
          opa test policies/gatekeeper/ --bundle policies/gatekeeper/ || exit 1
        done
      displayName: 'Validate manifests against Gatekeeper policies'

Image Signing and Verification

Image Signing with Notary/Cosign

Cosign Image Signing:

# Install Cosign
wget -O cosign https://github.com/sigstore/cosign/releases/latest/download/cosign-linux-amd64
chmod +x cosign
sudo mv cosign /usr/local/bin/

# Generate signing key pair
cosign generate-key-pair

# Sign container image
cosign sign --key cosign.key \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d

# Verify signature
cosign verify --key cosign.pub \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d

Azure Pipeline: Image Signing:

# .azuredevops/pipelines/image-signing.yml
- stage: SignImage
  displayName: 'Sign Container Image'
  jobs:
  - job: SignWithCosign
    steps:
    - script: |
        # Install Cosign
        wget -O cosign https://github.com/sigstore/cosign/releases/latest/download/cosign-linux-amd64
        chmod +x cosign
        sudo mv cosign /usr/local/bin/
      displayName: 'Install Cosign'

    - task: AzureKeyVault@2
      inputs:
        azureSubscription: 'ATP-KeyVault-Connection'
        KeyVaultName: 'atp-shared-kv'
        SecretsFilter: 'cosign-private-key'
      displayName: 'Retrieve Cosign private key'

    - script: |
        # Sign image
        cosign sign --key $(cosign-private-key) \
          --yes \
          $(ImageRepository):$(ImageTag)

        echo "✅ Image signed: $(ImageRepository):$(ImageTag)"
      displayName: 'Sign container image'
      env:
        COSIGN_PASSWORD: $(cosign-key-password)

Signature Storage in ACR

ACR Image Signing:

# Enable ACR content trust
az acr config retention update \
  --registry connectsoft \
  --days 30 \
  --status Enabled

# Sign image with ACR
az acr repository show-manifests \
  --name connectsoft \
  --repository atp/ingestion \
  --detail

Cosign with ACR:

# Sign and store signature in ACR
cosign sign --key cosign.key \
  --registry connectsoft.azurecr.io \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d

Admission Controller for Verification

Image Policy Webhook:

# platform/image-policy-webhook.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: image-policy-webhook
webhooks:
- name: image-policy.atp.connectsoft.com
  clientConfig:
    service:
      name: image-policy-webhook
      namespace: image-policy-system
      path: "/validate"
  rules:
  - operations: ["CREATE", "UPDATE"]
    apiGroups: ["apps"]
    apiVersions: ["v1"]
    resources: ["deployments", "statefulsets", "daemonsets"]
  admissionReviewVersions: ["v1", "v1beta1"]
  sideEffects: None
  failurePolicy: Fail

Image Verification with Cosign Admission Controller:

# Install Cosign admission controller
kubectl apply -f https://raw.githubusercontent.com/sigstore/policy-controller/main/config/release/policy-controller.yaml

# Create image policy

Image Policy:

# platform/image-policy.yaml
apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata:
  name: atp-image-policy
spec:
  images:
  - glob: "connectsoft.azurecr.io/atp/**"
  authorities:
  - key:
      data: |
        -----BEGIN PUBLIC KEY-----
        MFkwEwYHKoZIzj0CAQYIKoZIzj0CAQYIKoZIzj0CAQYIKoZIzj0CAQYIKoZI...
        -----END PUBLIC KEY-----
  - keyless:
      identities:
      - issuer: "https://token.actions.githubusercontent.com"
        subject: "https://github.com/ConnectSoft/ATP/.github/workflows/*"
  mode: enforce  # enforce or warn

Rejecting Unsigned Images

Policy Enforcement:

# With mode: enforce
# Unsigned images will be rejected at admission time
# Error: Image signature verification failed

# With mode: warn
# Unsigned images will be allowed but warnings logged

SBOM Generation and Storage

Generating SBOM During CI Build

SBOM Generation in Pipeline:

# .azuredevops/pipelines/sbom-generation.yml
- stage: GenerateSBOM
  displayName: 'Generate SBOM'
  jobs:
  - job: GenerateSBOM
    steps:
    - script: |
        # Install Syft
        curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
        syft version
      displayName: 'Install Syft'

    - script: |
        # Generate SBOM in SPDX format
        syft packages docker:$(ImageRepository):$(ImageTag) \
          --output spdx-json \
          --file sbom-$(ImageTag).spdx.json

        # Generate SBOM in CycloneDX format
        syft packages docker:$(ImageRepository):$(ImageTag) \
          --output cyclonedx-json \
          --file sbom-$(ImageTag).cyclonedx.json

        echo "✅ SBOM generated for $(ImageRepository):$(ImageTag)"
      displayName: 'Generate SBOM'

    - script: |
        # Attach SBOM to ACR image as OCI artifact
        oras attach \
          --artifact-type application/vnd.cyclonedx+json \
          connectsoft.azurecr.io/atp/ingestion:$(ImageTag) \
          sbom-$(ImageTag).cyclonedx.json
      displayName: 'Attach SBOM to image'

SBOM Formats (CycloneDX, SPDX)

SPDX Format Example:

{
  "SPDXID": "SPDXRef-DOCUMENT",
  "spdxVersion": "SPDX-2.3",
  "name": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
  "dataLicense": "CC0-1.0",
  "documentNamespace": "https://connectsoft.example/sbom/atp-ingestion/v1.2.3-abc123d",
  "packages": [
    {
      "SPDXID": "SPDXRef-Package-dotnet-runtime",
      "name": "dotnet-runtime",
      "versionInfo": "8.0.0",
      "downloadLocation": "NOASSERTION",
      "filesAnalyzed": false,
      "licenseConcluded": "NOASSERTION"
    },
    {
      "SPDXID": "SPDXRef-Package-aspnetcore",
      "name": "aspnetcore",
      "versionInfo": "8.0.0",
      "downloadLocation": "NOASSERTION",
      "filesAnalyzed": false,
      "licenseConcluded": "NOASSERTION"
    }
  ]
}

CycloneDX Format Example:

{
  "bomFormat": "CycloneDX",
  "specVersion": "1.5",
  "version": 1,
  "metadata": {
    "timestamp": "2024-01-15T10:00:00Z",
    "tools": [
      {
        "vendor": "Anchore",
        "name": "syft",
        "version": "1.0.0"
      }
    ],
    "component": {
      "type": "container",
      "name": "atp-ingestion",
      "version": "v1.2.3-abc123d"
    }
  },
  "components": [
    {
      "type": "library",
      "name": "dotnet-runtime",
      "version": "8.0.0"
    },
    {
      "type": "library",
      "name": "aspnetcore",
      "version": "8.0.0"
    }
  ]
}

Storing SBOM in ACR Artifacts

Attach SBOM to ACR Image:

# Install ORAS CLI
wget -q https://github.com/oras-project/oras/releases/latest/download/oras_linux_amd64.tar.gz
tar xf oras_linux_amd64.tar.gz
sudo mv oras /usr/local/bin/

# Login to ACR
az acr login --name connectsoft

# Attach SBOM as OCI artifact
oras attach \
  --artifact-type application/vnd.cyclonedx+json \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
  sbom-v1.2.3-abc123d.cyclonedx.json

# Attach SPDX SBOM
oras attach \
  --artifact-type application/spdx+json \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
  sbom-v1.2.3-abc123d.spdx.json

Query SBOM from ACR:

# List attached artifacts (including SBOM)
az acr repository show \
  --name connectsoft \
  --image atp/ingestion:v1.2.3-abc123d \
  --query "manifests"

# Download SBOM
oras pull \
  connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d \
  --artifact-type application/vnd.cyclonedx+json \
  -o sbom-downloaded.json

Vulnerability Scanning of SBOM

Scan SBOM for Vulnerabilities:

# Scan SBOM with Grype
grype sbom:sbom-$(ImageTag).cyclonedx.json \
  --output json \
  --file vulnerability-report-$(ImageTag).json

# Or scan with Trivy
trivy sbom sbom-$(ImageTag).cyclonedx.json \
  --format json \
  --output trivy-sbom-report-$(ImageTag).json

Vulnerability Scanning

Azure Defender for Containers

Enable Azure Defender:

# Enable Defender for Containers
az security pricing create \
  --name "Containers" \
  --tier "Standard"

Defender for Containers Configuration:

// Enable Defender for Containers via Pulumi
new Security.Pricing("defender-containers", new()
{
    PricingTier = "Standard",
    SubPlan = "DefenderForContainers",
});

Trivy Scanning in CI Pipeline

Trivy Vulnerability Scan:

# .azuredevops/pipelines/vulnerability-scanning.yml
- stage: VulnerabilityScan
  displayName: 'Vulnerability Scanning'
  jobs:
  - job: TrivyScan
    steps:
    - script: |
        # Install Trivy
        wget -q https://github.com/aquasecurity/trivy/releases/latest/download/trivy_linux_amd64.tar.gz
        tar xf trivy_linux_amd64.tar.gz
        sudo mv trivy /usr/local/bin/
      displayName: 'Install Trivy'

    - script: |
        # Scan container image
        trivy image \
          --format json \
          --output trivy-$(ImageTag).json \
          --severity HIGH,CRITICAL \
          --exit-code 0 \
          $(ImageRepository):$(ImageTag)
      displayName: 'Scan image for vulnerabilities'
      continueOnError: true

    - script: |
        # Generate HTML report
        trivy image \
          --format template \
          --template "@contrib/html.tpl" \
          --output trivy-$(ImageTag).html \
          --severity HIGH,CRITICAL \
          $(ImageRepository):$(ImageTag)

        # Publish report
        echo "##vso[task.addattachment type=Distributedtask.Core.Summary;name=Vulnerability Report;]$PWD/trivy-$(ImageTag).html"
      displayName: 'Generate vulnerability report'

    - script: |
        # Fail build if critical vulnerabilities found
        CRITICAL_COUNT=$(jq '[.Results[]?.Vulnerabilities[]? | select(.Severity=="CRITICAL")] | length' trivy-$(ImageTag).json)

        if [ "$CRITICAL_COUNT" -gt 0 ]; then
          echo "❌ Critical vulnerabilities found: $CRITICAL_COUNT"
          exit 1
        fi

        echo "✅ No critical vulnerabilities found"
      displayName: 'Check for critical vulnerabilities'

Runtime Vulnerability Detection

Trivy Operator for Runtime Scanning:

# Install Trivy Operator
helm repo add aqua https://aquasecurity.github.io/helm-charts/
helm repo update

helm install trivy-operator aqua/trivy-operator \
  --namespace trivy-system \
  --create-namespace \
  --version 0.18.0

VulnerabilityReport CRD:

# Trivy Operator automatically creates VulnerabilityReport resources
apiVersion: aquasecurity.github.io/v1alpha1
kind: VulnerabilityReport
metadata:
  name: atp-ingestion-abc123
  namespace: atp-production
spec:
  artifact:
    repository: connectsoft.azurecr.io/atp/ingestion
    tag: v1.2.3-abc123d
  summary:
    criticalCount: 0
    highCount: 2
    mediumCount: 5
    lowCount: 10

Query Vulnerability Reports:

# List vulnerability reports
kubectl get vulnerabilityreports -n atp-production

# View detailed report
kubectl get vulnerabilityreport atp-ingestion-abc123 -n atp-production -o yaml

Remediation Workflows

Vulnerability Remediation Process:

graph LR
    A[Vulnerability<br/>Detected] -->|Alert| B[Security Team]
    B -->|Assess| C{Critical?}
    C -->|Yes| D[Immediate Patch]
    C -->|No| E[Schedule Patch]
    D -->|Rebuild Image| F[Rescan]
    E -->|Rebuild Image| F
    F -->|Verify| G[Deploy]

    style A fill:#ffcccc
    style D fill:#ff9999
    style F fill:#90EE90
    style G fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Remediation Script:

#!/bin/bash
# scripts/remediate-vulnerability.sh

IMAGE="${1:-connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d}"
VULN_ID="${2:-CVE-2024-1234}"

echo "🔧 Remediating vulnerability: $VULN_ID in $IMAGE"

# 1. Update dependencies
# 2. Rebuild image
# 3. Rescan
trivy image --severity HIGH,CRITICAL "$IMAGE" | grep -q "$VULN_ID" && \
  echo "❌ Vulnerability still present" || \
  echo "✅ Vulnerability remediated"

RBAC Policies in Kubernetes

ServiceAccounts per ATP Service

ServiceAccount Definition:

# apps/atp-ingestion/base/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    app: atp-ingestion
    managed-by: fluxcd

Roles and RoleBindings

Role: Service-Specific Permissions:

# apps/atp-ingestion/base/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: atp-ingestion-role
  namespace: atp-production
rules:
# Allow read ConfigMaps in same namespace
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list", "watch"]
# Allow read Secrets in same namespace
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get", "list"]
  resourceNames:
  - sql-connection-string
  - redis-connection-string

RoleBinding:

# apps/atp-ingestion/base/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: atp-ingestion-rolebinding
  namespace: atp-production
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: atp-ingestion-role
subjects:
- kind: ServiceAccount
  name: atp-ingestion
  namespace: atp-production

ClusterRoles and ClusterRoleBindings

ClusterRole: Cross-Namespace Permissions:

# platform/rbac/clusterrole-monitoring.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: atp-monitoring-reader
rules:
# Allow read pods for metrics
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
# Allow read nodes for metrics
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch"]
  resourceNames: []

ClusterRoleBinding:

# platform/rbac/clusterrolebinding-monitoring.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: atp-monitoring-reader-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: atp-monitoring-reader
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

Least Privilege Principle

Least Privilege RBAC Matrix:

Service ServiceAccount Role Permissions
atp-ingestion atp-ingestion Role (namespace-scoped) Read ConfigMap, Read specific Secrets
atp-query atp-query Role (namespace-scoped) Read ConfigMap, Read specific Secrets
prometheus prometheus ClusterRole Read Pods, Nodes (cluster-wide)
fluent-bit fluent-bit ClusterRole Read Pods, Namespaces (cluster-wide)

RBAC Audit Script:

#!/bin/bash
# scripts/audit-rbac.sh

echo "🔍 Auditing RBAC permissions..."

# List all ServiceAccounts with excessive permissions
kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | select(.subjects[].kind=="ServiceAccount") | .metadata.name'

# Check for wildcard permissions
kubectl get clusterroles -o json | \
  jq -r '.items[] | select(.rules[]?.verbs[]?=="*") | .metadata.name'

Audit Logging

Kubernetes Audit Logs

Enable Audit Logging:

# cluster-config/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Log all requests in these stages
- level: Metadata
  namespaces: ["atp-production"]
  verbs: ["create", "update", "patch", "delete"]
- level: RequestResponse
  namespaces: ["atp-production"]
  resources:
  - group: ""
    resources: ["secrets", "configmaps"]
  verbs: ["create", "update", "patch", "delete"]
- level: None
  users: ["system:kube-proxy"]
  verbs: ["watch"]
  resources:
  - group: ""
    resources: ["endpoints", "services", "services/status"]

Configure Audit Logging on AKS:

# Enable audit logging via AKS cluster configuration
az aks update \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --enable-managed-identity \
  --enable-azure-rbac

Forwarding to Azure Monitor

Audit Log Forwarding:

# platform/audit-log-forwarder.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: audit-log-forwarder-config
  namespace: kube-system
data:
  fluent-bit.conf: |
    [INPUT]
        Name tail
        Path /var/log/audit/kube-apiserver-audit.log
        Parser json
        Tag kube-audit.*
        Refresh_Interval 5

    [OUTPUT]
        Name azure
        Match kube-audit.*
        Workspace_ID $(LOG_ANALYTICS_WORKSPACE_ID)
        Shared_Key $(LOG_ANALYTICS_SHARED_KEY)

Audit Policy Configuration

Comprehensive Audit Policy:

# cluster-config/audit-policy-comprehensive.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
omitStages:
- "RequestReceived"
rules:
# Log all requests to production namespace
- level: RequestResponse
  namespaces: ["atp-production"]
- level: Metadata
  namespaces: ["atp-staging"]
# Log secret access
- level: RequestResponse
  resources:
  - group: ""
    resources: ["secrets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# Log RBAC changes
- level: RequestResponse
  resources:
  - group: "rbac.authorization.k8s.io"
    resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
  verbs: ["create", "update", "patch", "delete"]

Log Retention and Analysis

Audit Log Retention:

// Query Kubernetes audit logs
KubePodInventory
| where Namespace == "atp-production"
| join kind=inner (
    AzureDiagnostics
    | where Category == "kube-audit"
    | where OperationName in ("create", "update", "delete")
) on $left.Name == $right.ObjectName
| project TimeGenerated, OperationName, User, Resource, Namespace
| order by TimeGenerated desc

Audit Log Analysis:

// Secret access audit trail
AzureDiagnostics
| where Category == "kube-audit"
| where ObjectRef.Resource == "secrets"
| extend User = tostring(parse_json(requestObject).user.username)
| extend Action = OperationName
| project TimeGenerated, User, Action, ObjectRef.Name, Namespace
| order by TimeGenerated desc

Policy Enforcement via GitOps

Policy as Code in Git

Policy Organization in Git:

atp-gitops/
├── policies/
│   ├── azure-policy/
│   │   ├── require-resource-limits.json
│   │   └── require-non-root.json
│   ├── gatekeeper/
│   │   ├── constraint-templates/
│   │   │   ├── require-resource-limits-template.yaml
│   │   │   └── require-non-root-template.yaml
│   │   └── constraints/
│   │       ├── require-resource-limits-constraint.yaml
│   │       └── require-non-root-constraint.yaml
│   └── network-policies/
│       ├── default-deny-all.yaml
│       └── service-policies/
│           ├── atp-ingestion-network-policy.yaml
│           └── atp-query-network-policy.yaml

Automated Policy Application

FluxCD Kustomization for Policies:

# infrastructure/policies/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  # Azure Policy assignments
  - azure-policy/assignments.yaml

  # Gatekeeper templates
  - gatekeeper/constraint-templates/

  # Gatekeeper constraints
  - gatekeeper/constraints/

  # Network policies
  - network-policies/default-deny-all.yaml
  - network-policies/service-policies/

FluxCD GitRepository for Policies:

# clusters/production/policies-gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-policies
  namespace: flux-system
spec:
  interval: 5m
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: production
  secretRef:
    name: gitops-credentials
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: policies
  namespace: flux-system
spec:
  interval: 10m
  sourceRef:
    kind: GitRepository
    name: atp-policies
  path: ./policies/
  prune: true
  validation: client

Policy Violation Detection

Monitor Policy Violations:

# Check Azure Policy violations
az policy state list \
  --resource "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.ContainerService/managedClusters/atp-prod-eus-aks" \
  --filter "complianceState eq 'NonCompliant'" \
  --query "[].{resource:resourceId, policy:policyAssignmentName, reason:complianceReason}" \
  --output table

# Check Gatekeeper constraint violations
kubectl get constraints -A
kubectl describe k8srequiredresourcelimits require-resource-limits-production -n atp-production

Policy Violation Alert:

# alerts/policy-violation.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: policy-violations
  namespace: monitoring
spec:
  groups:
  - name: policy-violations
    rules:
    - alert: AzurePolicyViolation
      expr: |
        azure_policy_compliance_state{state="NonCompliant"} > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Azure Policy violation detected"
        description: "{{ $value }} non-compliant resources detected"

Remediation Through PR

Policy Violation Remediation Workflow:

graph LR
    A[Policy Violation<br/>Detected] -->|Alert| B[Developer]
    B -->|Create PR| C[Fix Manifest]
    C -->|Merge| D[GitOps Sync]
    D -->|Apply| E[Compliant Resource]

    style A fill:#ffcccc
    style C fill:#90EE90
    style E fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Remediation PR Process:

  1. Developer receives policy violation alert
  2. Create PR to fix manifest
  3. PR validation ensures compliance
  4. Merge PR triggers FluxCD sync
  5. Policy violation resolved

Compliance Evidence Generation

Deployment Receipts with Approvals

Deployment Receipt Generation:

#!/bin/bash
# scripts/generate-deployment-receipt.sh

DEPLOYMENT_NAME="${1:-atp-ingestion}"
NAMESPACE="${2:-atp-production}"

cat > deployment-receipt-$(date +%Y%m%d-%H%M%S).json <<EOF
{
  "deploymentId": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.uid}')",
  "deployedAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "deployedBy": "FluxCD",
  "gitCommit": "$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.metadata.labels.app\.kubernetes\.io/version}')",
  "approvals": [
    {
      "approver": "architect-team@connectsoft.example",
      "approvedAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
      "approvalType": "CAB"
    }
  ],
  "policyCompliance": {
    "azurePolicy": "Compliant",
    "gatekeeper": "Compliant",
    "podSecurity": "Compliant"
  }
}
EOF

Security Scan Reports

Security Scan Evidence:

# Generate security scan evidence
cat > security-scan-evidence-$(date +%Y%m%d).json <<EOF
{
  "scanDate": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "image": "$(ImageRepository):$(ImageTag)",
  "scanner": "Trivy",
  "vulnerabilities": {
    "critical": 0,
    "high": 2,
    "medium": 5,
    "low": 10
  },
  "compliance": "Pass",
  "scanReport": "trivy-$(ImageTag).json"
}
EOF

SBOM Artifacts

SBOM Evidence:

{
  "sbomGeneratedAt": "2024-01-15T10:00:00Z",
  "image": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d",
  "sbomFormat": "CycloneDX",
  "sbomVersion": "1.5",
  "sbomLocation": "connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d (OCI artifact)",
  "components": 245,
  "verification": {
    "signed": true,
    "signatureVerified": true
  }
}

Policy Compliance Reports

Compliance Report Generation:

// Generate compliance report
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| where TimeGenerated > ago(30d)
| extend SecretName = tostring(parse_json(properties_s).objectName)
| extend AccessCount = count()
| summarize 
    TotalAccess = count(),
    UniqueIdentities = dcount(parse_json(properties_s).identity_claim_appid_g)
    by SecretName
| project SecretName, TotalAccess, UniqueIdentities, ComplianceStatus = "Compliant"

SOC 2 / GDPR / HIPAA Controls

Mapping GitOps Workflows to Controls

SOC 2 Control Mapping:

SOC 2 Control GitOps Implementation Evidence
CC6.1 Logical Access Controls RBAC in Kubernetes, Azure AD integration RBAC audit logs, access reviews
CC6.2 Authentication Workload Identity, Azure AD Authentication logs
CC6.7 Audit Logging Kubernetes audit logs, Key Vault logs Log Analytics queries
CC7.2 Change Management GitOps PR workflow, approvals PR history, deployment receipts
CC8.1 Encryption Secrets in Key Vault, TLS in transit Key Vault encryption logs

GDPR Control Mapping:

GDPR Article GitOps Implementation Evidence
Art. 32 Security of Processing Pod Security Standards, Network Policies Security policy compliance reports
Art. 33 Breach Notification Audit logging, alerting Security incident logs
Art. 35 Data Protection Impact Assessment SBOM, vulnerability scanning SBOM artifacts, scan reports

Evidence Collection Automation

Automated Evidence Collection:

# .azuredevops/pipelines/compliance-evidence.yml
- stage: CollectEvidence
  displayName: 'Collect Compliance Evidence'
  jobs:
  - job: GenerateEvidence
    steps:
    - script: |
        # Generate deployment receipt
        ./scripts/generate-deployment-receipt.sh atp-ingestion atp-production

        # Generate security scan evidence
        ./scripts/generate-security-scan-evidence.sh

        # Generate SBOM evidence
        ./scripts/generate-sbom-evidence.sh

        # Generate policy compliance report
        ./scripts/generate-policy-compliance-report.sh

        # Archive all evidence
        tar -czf compliance-evidence-$(Build.BuildNumber).tar.gz \
          deployment-receipt-*.json \
          security-scan-evidence-*.json \
          sbom-evidence-*.json \
          policy-compliance-report-*.json
      displayName: 'Collect compliance evidence'

    - task: PublishPipelineArtifact@1
      inputs:
        targetPath: 'compliance-evidence-*.tar.gz'
        artifactName: 'compliance-evidence'

Audit Trail for Compliance

Audit Trail Generation:

// Complete audit trail for compliance
let DeploymentEvents = KubePodInventory
| where Namespace == "atp-production"
| extend DeploymentTime = TimeGenerated;

let PolicyCompliance = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.AUTHORIZATION"
| where Category == "PolicyState"
| extend ComplianceTime = TimeGenerated;

let SecretAccess = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where Category == "AuditEvent"
| extend AccessTime = TimeGenerated;

union DeploymentEvents, PolicyCompliance, SecretAccess
| project TimeGenerated, EventType, Resource, User, Action
| order by TimeGenerated desc

Regular Access Reviews

Access Review Automation:

#!/bin/bash
# scripts/access-review.sh

echo "📋 Generating Access Review Report..."

# Review Kubernetes RBAC
echo "## Kubernetes RBAC Review" > access-review-$(date +%Y%m%d).md
kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | select(.subjects[].kind=="ServiceAccount") | "\(.metadata.name): \(.subjects[].name)"' \
  >> access-review-$(date +%Y%m%d).md

# Review Azure Key Vault access
echo "## Key Vault Access Review" >> access-review-$(date +%Y%m%d).md
az role assignment list \
  --scope "/subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.KeyVault/vaults/atp-prod-kv" \
  --query "[].{principal:principalName, role:roleDefinitionName}" \
  --output table >> access-review-$(date +%Y%m%d).md

Summary: Security Policies & Compliance

  • Azure Policy for Kubernetes: Policy definitions, assignments, and enforcement for AKS clusters
  • Pod Security Standards: Restricted profile enforcement, Pod Security Admission configuration, security context requirements
  • Network Policies: Default deny, ingress/egress rules, DNS and monitoring exceptions, service-specific policies
  • OPA Gatekeeper: Constraint templates, custom Rego policies, CI/CD integration
  • Image Signing: Cosign signing, signature storage in ACR, admission controller verification
  • SBOM Generation: CycloneDX and SPDX formats, storage in ACR artifacts, vulnerability scanning
  • Vulnerability Scanning: Azure Defender, Trivy in CI, runtime detection, remediation workflows
  • RBAC Policies: ServiceAccounts, Roles/RoleBindings, ClusterRoles, least privilege principle
  • Audit Logging: Kubernetes audit logs, Azure Monitor forwarding, log retention and analysis
  • Policy Enforcement via GitOps: Policy as code, automated application, violation detection, remediation through PR
  • Compliance Evidence: Deployment receipts, security scan reports, SBOM artifacts, policy compliance reports
  • SOC 2/GDPR/HIPAA Controls: Control mapping, evidence collection automation, audit trails, access reviews

FluxCD Continuous Reconciliation

Purpose: Define how FluxCD continuously reconciles the desired state from Git with the live Kubernetes cluster state, including reconciliation loops, drift detection, self-healing mechanisms, health assessment, and observability to ensure ATP deployments remain aligned with Git-managed manifests.


FluxCD Reconciliation Loop

How Reconciliation Works

Reconciliation Flow:

sequenceDiagram
    participant Git as Git Repository
    participant Source as Source Controller
    participant Kustomize as Kustomize Controller
    participant K8s as Kubernetes Cluster

    Git->>Source: Poll for changes (every 1m)
    Source->>Source: Fetch latest commit
    Source->>Source: Compare with last sync
    alt Changes detected
        Source->>Source: Update GitRepository status
        Source->>Kustomize: Trigger reconciliation
    end
    Kustomize->>Git: Fetch manifest artifacts
    Kustomize->>K8s: Apply manifests (kubectl apply)
    K8s->>Kustomize: Return apply result
    Kustomize->>Kustomize: Update Kustomization status
    alt Drift detected
        Kustomize->>K8s: Re-apply to correct drift
    end
    Kustomize->>Source: Report reconciliation result
Hold "Alt" / "Option" to enable pan & zoom

Reconciliation Components:

Component Responsibility Reconciliation Trigger
Source Controller Monitors Git repository Polls Git every interval (default: 1m)
Kustomize Controller Applies Kustomization Triggered by Source Controller when changes detected
Helm Controller Applies Helm releases Triggered by Source Controller when HelmRepository changes
Image Automation Controller Updates image tags Triggered by ImagePolicy changes

Polling Interval Configuration

GitRepository Polling Interval:

# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops
  namespace: flux-system
spec:
  interval: 1m  # Poll Git every 1 minute
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: production
  timeout: 20s
  ignore: |
    # Exclude paths from reconciliation
    exclude: |
      ^docs/
      ^\.git/

Kustomization Reconciliation Interval:

# apps/atp-ingestion/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
  namespace: flux-system
spec:
  interval: 5m  # Reconcile every 5 minutes
  path: ./apps/atp-ingestion
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production

Environment-Specific Intervals:

Environment GitRepository Interval Kustomization Interval Rationale
Dev 30s 1m Faster feedback loop
Test 1m 2m Balance between speed and load
Staging 2m 5m Reduced cluster load
Production 5m 10m Stability over speed

Reconciliation Triggers

Automatic Triggers:

  1. Git Commit: New commit pushed to monitored branch
  2. Polling Interval: Periodic check (even if no changes)
  3. Webhook: Immediate trigger via webhook (bypasses polling)

Webhook Configuration:

# clusters/production/receiver.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Receiver
metadata:
  name: gitops-receiver
  namespace: flux-system
spec:
  type: git
  events:
  - "push"
  resources:
  - kind: GitRepository
    name: atp-gitops
  secretRef:
    name: gitops-webhook-token
---
# Azure DevOps webhook trigger
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: atp-gitops
spec:
  interval: 1m
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: production
  # Webhook URL: https://<cluster-ip>/hook/<token>

Manual Trigger:

# Force immediate reconciliation
flux reconcile source git atp-gitops --with-source

# Force Kustomization reconciliation
flux reconcile kustomization atp-ingestion --with-source

# Trigger all reconciliations
flux reconcile source git --all

Retry Strategies and Backoff

Retry Configuration:

# Kustomization with retry settings
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  retryInterval: 2m  # Retry failed reconciliation every 2 minutes
  timeout: 10m  # Timeout after 10 minutes
  path: ./apps/atp-ingestion
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Exponential backoff: 2m, 4m, 8m, 16m (max)

Retry Behavior:

Failure Type Retry Interval Max Retries Backoff
Git fetch error 1m 3 Linear
Apply error (transient) 2m 5 Exponential (2x)
Apply error (permanent) 5m 10 Exponential (1.5x)
Health check failure 1m Until healthy Linear

Retry Status:

# Check reconciliation status and retries
kubectl get kustomization atp-ingestion -n flux-system -o jsonpath='{.status}'

# Output:
# {
#   "conditions": [{
#     "type": "Ready",
#     "status": "False",
#     "reason": "ReconciliationFailed",
#     "message": "apply failed: error applying manifests",
#     "lastTransitionTime": "2024-01-15T10:00:00Z"
#   }],
#   "lastAppliedRevision": "abc123...",
#   "lastAttemptedRevision": "def456...",
#   "observedGeneration": 1,
#   "snapshot": {...}
# }

Automated Sync Policies

Auto-Sync for Dev and Test Environments

Dev Environment Auto-Sync:

# clusters/dev/kustomization-dev.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-dev
  namespace: flux-system
spec:
  interval: 1m  # Fast sync interval
  path: ./apps
  prune: true  # Auto-delete removed resources
  wait: true  # Wait for resources to be ready
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  syncOptions:
  - CreateNamespace=true
  - PruneLast=true
  - ReplaceOnCreate=true

Test Environment Auto-Sync:

# clusters/test/kustomization-test.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-test
  namespace: flux-system
spec:
  interval: 2m
  path: ./apps
  prune: true
  wait: true
  timeout: 10m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  syncOptions:
  - CreateNamespace=true
  - PruneLast=true

Manual Sync for Staging and Production

Staging Environment Manual Sync:

# clusters/staging/kustomization-staging.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-staging
  namespace: flux-system
spec:
  suspend: false  # Reconciliation enabled
  interval: 5m  # Still polls for changes
  path: ./apps
  prune: false  # Manual pruning only
  wait: true
  timeout: 15m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Manual sync via: flux reconcile kustomization apps-staging

Production Environment Manual Sync:

# clusters/production/kustomization-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  suspend: false
  interval: 10m  # Longer interval
  path: ./apps
  prune: false  # Never auto-prune in production
  wait: true
  timeout: 20m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Requires explicit: flux reconcile kustomization apps-production

Manual Sync Workflow:

# 1. Review changes in Git
git log --oneline production

# 2. Trigger manual sync
flux reconcile kustomization apps-production --with-source

# 3. Monitor sync status
flux get kustomizations apps-production --watch

# 4. Verify deployment
kubectl rollout status deployment/atp-ingestion -n atp-production

Sync Options (Prune, Force, Wait)

Sync Options Reference:

Option Description Use Case
Prune Delete resources removed from Git Cleanup unused resources
PruneLast Prune after applying new resources Preserve dependencies during apply
Force Force apply even if no changes Trigger reconciliation
Wait Wait for resources to be ready Ensure deployment success
CreateNamespace Auto-create namespaces Simplify namespace management
ReplaceOnCreate Replace existing resources Handle immutable fields

Comprehensive Sync Options:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion-production
spec:
  interval: 10m
  path: ./apps/atp-ingestion
  prune: false  # Production: manual pruning
  wait: true  # Wait for Deployment to be ready
  timeout: 20m
  retryInterval: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  syncOptions:
  - CreateNamespace=true  # Auto-create namespace
  - PruneLast=true  # Prune after applying (if prune enabled)
  # - Force=true  # Force apply (use with caution)
  - ReplaceOnCreate=false  # Don't replace on create (safer)
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production

Per-Resource Sync Configuration

Service-Specific Sync Settings:

# apps/atp-ingestion/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  prune: true
  wait: true
  timeout: 15m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure  # Wait for infrastructure first
  - name: secrets  # Wait for secrets to be synced
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production
  # Per-resource annotations for sync control
  postBuild:
    substitute:
      sync.fluxcd.io/prune: "disabled"  # Per-resource pruning control

Drift Detection Mechanisms

Comparing Git State to Live Cluster

Drift Detection Flow:

graph LR
    A[Git State<br/>Manifests] -->|Compare| B[Cluster State<br/>Live Resources]
    B -->|Matches| C[No Action]
    B -->|Differs| D[Drift Detected]
    D -->|Self-Heal Enabled| E[Re-apply from Git]
    D -->|Self-Heal Disabled| F[Alert Only]
    E -->|Success| C
    E -->|Failure| G[Retry/Alert]

    style A fill:#90EE90
    style B fill:#FFE5B4
    style D fill:#ffcccc
    style E fill:#90EE90
    style F fill:#ff9999
Hold "Alt" / "Option" to enable pan & zoom

Drift Detection Process:

  1. Fetch Git State: Source Controller fetches latest manifests
  2. Fetch Cluster State: Kustomize Controller queries Kubernetes API
  3. Compute Diff: Compare desired (Git) vs actual (Cluster) state
  4. Detect Drift: Identify differences
  5. Correct Drift: Re-apply manifests (if self-healing enabled)

Check Drift Status:

# Check for drift
flux get kustomizations atp-ingestion

# Output shows:
# NAME            READY   MESSAGE                         REVISION        SUSPENDED
# atp-ingestion   True    Applied revision: abc123def    abc123def       False

# Detailed drift information
kubectl describe kustomization atp-ingestion -n flux-system

# Events show drift detection:
# Warning  ReconciliationFailed  drift detected: Deployment replicas changed from 3 to 5

Drift Types (Manual Changes, External Controllers)

Manual Changes:

# Example: Manual replica scaling
kubectl scale deployment atp-ingestion -n atp-production --replicas=5

# FluxCD detects drift:
# Warning  ReconciliationFailed  drift detected in Deployment/atp-ingestion:
#   spec.replicas: expected 3, found 5

# With self-healing: FluxCD reverts to 3 replicas
# Without self-healing: Alert only

External Controller Changes:

# Example: HPA scales deployment
# HPA changes replicas to 4 based on CPU usage

# FluxCD behavior:
# - If replicas in Git: 3 (no replicas field)
# - HPA-managed replicas: 4
# - FluxCD: No drift (HPA takes precedence when replicas field absent)

# If Git specifies replicas: 3
# HPA-managed replicas: 4
# FluxCD: Detects drift, reverts to 3 (may conflict with HPA)

Resource Annotation for Drift Ignore:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  annotations:
    fluxcd.io/ignore: "true"  # Ignore drift for this resource
spec:
  replicas: 3  # May be modified by HPA, FluxCD won't revert

Drift Detection Frequency

Drift Detection Intervals:

Component Detection Method Frequency
GitRepository Polls Git for changes Every interval (default: 1m)
Kustomization Compares Git state to cluster Every interval (default: 10m)
Manual Trigger Immediate comparison On-demand via flux reconcile

Optimized Drift Detection:

# Production: Less frequent drift checks
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m  # Check for drift every 10 minutes
  path: ./apps
  sourceRef:
    kind: GitRepository
    name: atp-gitops

# Dev: More frequent drift checks
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-dev
spec:
  interval: 1m  # Check for drift every minute
  path: ./apps
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Alerting on Drift

Drift Alert Configuration:

# alerts/drift-detection.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: drift-detection
  namespace: flux-system
spec:
  providerRef:
    name: azure-monitor
  eventSeverity: warning
  eventSources:
  - kind: Kustomization
    name: apps-production
    namespace: flux-system
  exclusionList:
  - ".* is ready"
  - ".*applied revision.*"

Drift Alert with Notification:

# Notification provider for Azure Monitor
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: azure-monitor
  namespace: flux-system
spec:
  type: generic
  address: https://api.loganalytics.io/v1/workspaces/{workspaceId}/events
  secretRef:
    name: azure-monitor-credentials

Query Drift Alerts:

// Query FluxCD drift alerts from Log Analytics
FluxCDEvents
| where EventType == "Warning"
| where Message contains "drift detected"
| project TimeGenerated, Kustomization, Message, Namespace
| order by TimeGenerated desc

Self-Healing Configuration

Automatic Revert of Manual Changes

Self-Healing Enabled (Default):

# Self-healing is enabled by default
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  prune: true
  # Self-healing: automatically reverts manual changes
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Self-Healing Behavior:

# 1. Manual change
kubectl patch deployment atp-ingestion -n atp-production \
  -p '{"spec":{"replicas":5}}'

# 2. FluxCD detects drift (within 5 minutes)
# 3. FluxCD reverts to Git state (replicas: 3)
# 4. Deployment restored to desired state

Disable Self-Healing for Specific Resource:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  annotations:
    # Disable self-healing for this resource
    reconcile.fluxcd.io/disabled: "true"
spec:
  replicas: 3

Self-Heal Enable/Disable per Environment

Environment-Specific Self-Healing:

Environment Self-Healing Rationale
Dev ✅ Enabled Fast feedback, automatic correction
Test ✅ Enabled Validate self-healing behavior
Staging ⚠️ Selective Enable for critical resources only
Production ⚠️ Selective Manual intervention preferred for critical changes

Production: Selective Self-Healing:

# clusters/production/kustomization-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  path: ./apps
  prune: false  # Manual pruning in production
  # Self-healing enabled, but prune disabled for safety
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-gateway
    namespace: atp-production
  # Self-healing reverts manual changes to Gateway

Disable Self-Healing Globally:

# Suspend Kustomization (disables all reconciliation including self-healing)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  suspend: true  # Temporarily disable all reconciliation
  interval: 10m
  path: ./apps

Force Recreation of Resources

Force Recreate on Drift:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  syncOptions:
  - ReplaceOnCreate=true  # Replace existing resources
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Force Recreation via Annotation:

# Add annotation to force recreation
kubectl annotate deployment atp-ingestion -n atp-production \
  fluxcd.io/reconcile="forced" \
  --overwrite

# FluxCD will recreate the resource on next reconciliation

Preserving Stateful Resources

Protect Stateful Resources from Self-Healing:

# apps/atp-query/base/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: atp-query
  annotations:
    # Preserve manual changes to StatefulSet
    reconcile.fluxcd.io/disabled: "true"
spec:
  replicas: 3
  # ... other spec

Protect PVCs from Pruning:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  prune: true
  pruneOptions:
    keepLabels:
    - app=atp-query  # Keep resources with this label when pruning
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Health Assessment

Built-in Health Checks (Deployment, StatefulSet)

Deployment Health Check:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  wait: true  # Wait for health checks to pass
  timeout: 15m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production
    # FluxCD checks:
    # - Deployment status.availableReplicas == spec.replicas
    # - All pods are Ready

StatefulSet Health Check:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-query
spec:
  interval: 5m
  path: ./apps/atp-query
  wait: true
  timeout: 20m  # Longer timeout for StatefulSet
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: apps/v1
    kind: StatefulSet
    name: atp-query
    namespace: atp-production
    # FluxCD checks:
    # - StatefulSet status.readyReplicas == spec.replicas
    # - All pods are Ready and in correct order

Multiple Health Checks:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-gateway
spec:
  interval: 5m
  path: ./apps/atp-gateway
  wait: true
  timeout: 15m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  # Deployment health check
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-gateway
    namespace: atp-production
  # Service health check
  - apiVersion: v1
    kind: Service
    name: atp-gateway
    namespace: atp-production
  # Ingress health check
  - apiVersion: networking.k8s.io/v1
    kind: Ingress
    name: atp-gateway
    namespace: atp-production

Custom Health Checks

Custom Health Check with CRD:

# Custom health check using custom resource
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  wait: true
  timeout: 15m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: custom.health.check/v1
    kind: HealthCheck
    name: atp-ingestion-health
    namespace: atp-production

Health Check Status:

# Check health check status
kubectl get kustomization atp-ingestion -n flux-system -o jsonpath='{.status.conditions}'

# Output:
# [
#   {
#     "type": "Ready",
#     "status": "True",
#     "reason": "HealthCheckPassed",
#     "message": "all health checks passed",
#     "lastTransitionTime": "2024-01-15T10:00:00Z"
#   },
#   {
#     "type": "Healthy",
#     "status": "True",
#     "reason": "AllHealthChecksPassed",
#     "message": "Deployment/atp-ingestion is healthy"
#   }
# ]

Readiness Gates

Readiness Gate Configuration:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 3
  template:
    spec:
      readinessGates:
      - conditionType: PodHasNetwork
      - conditionType: PodHasStorage
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

FluxCD Health Check with Readiness Gates:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  wait: true
  timeout: 20m  # Longer timeout if readiness gates present
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production
    # FluxCD waits for:
    # - Deployment ready
    # - All pods Ready
    # - All readiness gates conditions met

Timeout and Failure Thresholds

Health Check Timeout:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-ingestion
spec:
  interval: 5m
  path: ./apps/atp-ingestion
  wait: true
  timeout: 15m  # Max time to wait for health checks
  retryInterval: 2m  # Retry failed health checks every 2 minutes
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
    namespace: atp-production

Health Check Failure Handling:

Scenario Behavior Action
Health check passes Reconciliation succeeds Continue normal operation
Health check fails (transient) Retry up to timeout Retry every retryInterval
Health check fails (permanent) Reconciliation marked failed Alert and manual intervention

Check Health Check Status:

# View health check failures
kubectl describe kustomization atp-ingestion -n flux-system

# Events show:
# Warning  ReconciliationFailed  health check failed: 
#   Deployment/atp-ingestion not ready: 2/3 replicas available

Prune Policies

Automatic Cleanup of Deleted Resources

Prune Enabled:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-dev
spec:
  interval: 1m
  path: ./apps
  prune: true  # Auto-delete resources removed from Git
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Prune Behavior:

# 1. Resource exists in Git and cluster
# apps/atp-old-service/deployment.yaml

# 2. Delete resource from Git
rm apps/atp-old-service/deployment.yaml
git commit -m "Remove old service"
git push

# 3. FluxCD detects resource removed from Git
# 4. FluxCD deletes resource from cluster (prune enabled)
# 5. Resource removed from cluster

Prune Safety (PVC, PV Protection)

Prune with Safety Labels:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  path: ./apps
  prune: true
  pruneOptions:
    keepLabels:
    - app=atp-query  # Keep resources with this label
    - persistent=true  # Keep persistent resources
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Protect PVCs from Pruning:

# apps/atp-query/base/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: atp-query-data
  labels:
    persistent: "true"  # Protected from pruning
    fluxcd.io/prune: "disabled"  # Explicit disable pruning
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

Prune Exclusions:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  path: ./apps
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Resources with these labels are never pruned
  pruneOptions:
    keepLabels:
    - persistent=true
    - backup=true
    - managed-by=external-operator

Selective Pruning

Selective Prune by Namespace:

# Prune only in specific namespace
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  path: ./apps
  prune: true
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Only prune resources in this namespace
  namespace: atp-production

Selective Prune by Resource Type:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  path: ./apps
  prune: true
  # Only prune Deployments and Services, not PVCs
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # Use postBuild to exclude resource types from pruning
  postBuild:
    substitute:
      prune.fluxcd.io/exclude: "PersistentVolumeClaim,PersistentVolume"

Prune Validation

Dry-Run Prune:

# Check what would be pruned
flux reconcile kustomization apps-production --dry-run

# Output shows resources that would be deleted

Prune Status:

# Check prune status
kubectl get kustomization apps-production -n flux-system -o jsonpath='{.status.inventory}'

# Output:
# {
#   "entries": [
#     {"id": "apps_v1_Deployment_atp-production_atp-ingestion", "v": "v1"},
#     {"id": "v1_Service_atp-production_atp-ingestion", "v": "v1"}
#   ]
# }

# Resources not in inventory but in cluster will be pruned

Sync Ordering and Dependencies

Depends-On in Kustomization

Dependency Chain:

# 1. Infrastructure (base)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: infrastructure
  namespace: flux-system
spec:
  interval: 5m
  path: ./infrastructure
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  # No dependencies

---
# 2. Secrets (depends on infrastructure)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: secrets
  namespace: flux-system
spec:
  interval: 5m
  path: ./secrets
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure  # Wait for infrastructure first

---
# 3. Apps (depends on infrastructure and secrets)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m
  path: ./apps
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure
  - name: secrets  # Wait for both

Dependency Graph:

graph TD
    A[Infrastructure] --> B[Secrets]
    A --> C[Apps]
    B --> C
    C --> D[atp-ingestion]
    C --> E[atp-query]
    C --> F[atp-gateway]

    style A fill:#90EE90
    style B fill:#FFE5B4
    style C fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Ordering Infrastructure Before Apps

Infrastructure First:

# Infrastructure Kustomization (no dependencies)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: infrastructure
  namespace: flux-system
spec:
  interval: 5m
  path: ./infrastructure
  prune: false  # Don't auto-prune infrastructure
  sourceRef:
    kind: GitRepository
    name: atp-gitops

Apps After Infrastructure:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m
  path: ./apps
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure  # Ensure infrastructure ready first

Cross-Resource Dependencies

Service Dependencies:

# atp-query depends on atp-ingestion
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: atp-query
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/atp-query
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: atp-ingestion  # Wait for ingestion service

Cross-Namespace Dependencies:

# Apps in production namespace depend on monitoring in monitoring namespace
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m
  path: ./apps
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: monitoring  # Wait for monitoring stack
  healthChecks:
  - apiVersion: v1
    kind: Service
    name: prometheus
    namespace: monitoring  # Cross-namespace health check

Wait for Readiness

Wait for Dependencies to be Ready:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m
  path: ./apps
  wait: true  # Wait for resources to be ready
  timeout: 20m
  sourceRef:
    kind: GitRepository
    name: atp-gitops
  dependsOn:
  - name: infrastructure
  # Waits for:
  # 1. Infrastructure Kustomization to be ready
  # 2. All health checks in infrastructure to pass
  # 3. Then proceeds with apps reconciliation

Dependency Readiness Check:

# Check dependency status
flux get kustomizations apps-production

# Shows dependency status:
# NAME              READY   MESSAGE                         REVISION
# infrastructure    True    Applied revision: abc123        abc123
# apps-production   True    Applied revision: def456        def456

# If dependency not ready:
# apps-production   False   dependency 'infrastructure' is not ready

Notification Controller

Sending Alerts to Azure Monitor

Azure Monitor Provider:

# notification/provider-azure-monitor.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: azure-monitor
  namespace: flux-system
spec:
  type: generic
  address: https://api.loganalytics.io/v1/workspaces/{workspaceId}/events
  secretRef:
    name: azure-monitor-credentials
---
# Secret with Azure Monitor credentials
apiVersion: v1
kind: Secret
metadata:
  name: azure-monitor-credentials
  namespace: flux-system
type: Opaque
stringData:
  token: "{workspace-key}"  # Log Analytics workspace key

Alert Configuration:

# notification/alert-reconciliation.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: reconciliation-alerts
  namespace: flux-system
spec:
  providerRef:
    name: azure-monitor
  eventSeverity: info
  eventSources:
  - kind: Kustomization
    name: apps-production
    namespace: flux-system
  - kind: GitRepository
    name: atp-gitops
    namespace: flux-system

Slack/Teams Integration

Slack Provider:

# notification/provider-slack.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: slack
  namespace: flux-system
spec:
  type: slack
  channel: "#atp-alerts"
  username: fluxcd
  secretRef:
    name: slack-credentials
---
apiVersion: v1
kind: Secret
metadata:
  name: slack-credentials
  namespace: flux-system
type: Opaque
stringData:
  address: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

Teams Provider:

# notification/provider-teams.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: teams
  namespace: flux-system
spec:
  type: generic
  address: "https://outlook.office.com/webhook/YOUR/WEBHOOK/URL"
  secretRef:
    name: teams-credentials

Alert for Slack/Teams:

apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: production-alerts
  namespace: flux-system
spec:
  providerRef:
    name: slack  # or teams
  eventSeverity: error
  eventSources:
  - kind: Kustomization
    name: apps-production
    namespace: flux-system
  exclusionList:
  - ".* is ready"
  - ".*applied revision.*"
  summary: "Production deployment {{ .InvolvedObject.kind }} {{ .InvolvedObject.name }}"

Email Notifications

Email Provider:

# notification/provider-email.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: email
  namespace: flux-system
spec:
  type: generic
  address: "smtp://smtp.office365.com:587"
  secretRef:
    name: email-credentials
---
apiVersion: v1
kind: Secret
metadata:
  name: email-credentials
  namespace: flux-system
type: Opaque
stringData:
  username: "fluxcd@connectsoft.example"
  password: "{smtp-password}"

Email Alert:

apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: critical-alerts-email
  namespace: flux-system
spec:
  providerRef:
    name: email
  eventSeverity: error
  eventSources:
  - kind: Kustomization
    name: apps-production
    namespace: flux-system
  # Only send critical errors via email

Custom Webhooks

Webhook Provider:

# notification/provider-webhook.yaml
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Provider
metadata:
  name: custom-webhook
  namespace: flux-system
spec:
  type: generic
  address: "https://api.connectsoft.example/fluxcd/webhook"
  secretRef:
    name: webhook-credentials
---
apiVersion: v1
kind: Secret
metadata:
  name: webhook-credentials
  namespace: flux-system
type: Opaque
stringData:
  token: "{webhook-token}"

Webhook Alert:

apiVersion: notification.toolkit.fluxcd.io/v1
kind: Alert
metadata:
  name: webhook-alerts
  namespace: flux-system
spec:
  providerRef:
    name: custom-webhook
  eventSeverity: info
  eventSources:
  - kind: Kustomization
    name: apps-production
    namespace: flux-system

Handling Stuck Reconciliations

Identifying Stuck Reconciliations

Check Reconciliation Status:

# Check if Kustomization is stuck
flux get kustomizations apps-production

# Stuck indicators:
# - READY: False for extended period
# - MESSAGE: Contains "error" or "failed"
# - No recent status updates

# Detailed status
kubectl describe kustomization apps-production -n flux-system

# Check events for stuck reconciliation
kubectl get events -n flux-system \
  --field-selector involvedObject.name=apps-production \
  --sort-by='.lastTimestamp'

Common Stuck Scenarios:

Scenario Symptom Resolution
Git fetch error MESSAGE: git fetch failed Check Git credentials, network
Apply timeout MESSAGE: apply timeout Increase timeout, check resource complexity
Health check failure MESSAGE: health check failed Fix failing resource, disable health check
Dependency stuck MESSAGE: dependency not ready Resolve dependency issue

Suspending and Resuming

Suspend Reconciliation:

# Suspend to stop reconciliation
flux suspend kustomization apps-production

# Or via kubectl
kubectl patch kustomization apps-production -n flux-system \
  -p '{"spec":{"suspend":true}}'

Resume Reconciliation:

# Resume reconciliation
flux resume kustomization apps-production

# Or via kubectl
kubectl patch kustomization apps-production -n flux-system \
  -p '{"spec":{"suspend":false}}'

Suspend with Annotation:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  annotations:
    fluxcd.io/suspend: "true"  # Suspend via annotation
spec:
  interval: 10m
  path: ./apps

Force Reconciliation

Force Reconciliation:

# Force immediate reconciliation
flux reconcile kustomization apps-production --with-source

# Force with source update
flux reconcile source git atp-gitops
flux reconcile kustomization apps-production

# Force all reconciliations
flux reconcile kustomization --all

Force via Annotation:

# Add annotation to force reconciliation
kubectl annotate kustomization apps-production -n flux-system \
  reconcile.fluxcd.io/requestedAt="$(date +%s)" \
  --overwrite

Debugging Techniques

Enable Verbose Logging:

# Check FluxCD controller logs
kubectl logs -n flux-system \
  -l app=kustomize-controller \
  --tail=100

# Follow logs in real-time
kubectl logs -n flux-system \
  -l app=kustomize-controller \
  -f

# Filter for specific Kustomization
kubectl logs -n flux-system \
  -l app=kustomize-controller \
  | grep "apps-production"

Debug Commands:

# Check GitRepository status
flux get source git atp-gitops

# Check Kustomization status
flux get kustomizations apps-production

# Check events
kubectl get events -n flux-system \
  --field-selector involvedObject.name=apps-production

# Check resource status
kubectl get deployment atp-ingestion -n atp-production -o yaml

Dry-Run Reconciliation:

# Simulate reconciliation without applying
flux reconcile kustomization apps-production --dry-run

# Output shows what would be applied

Observability

FluxCD Metrics in Prometheus

Enable Metrics:

# FluxCD automatically exposes Prometheus metrics
# Metrics endpoint: http://kustomize-controller:8080/metrics

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fluxcd-kustomize-controller
  namespace: flux-system
spec:
  selector:
    matchLabels:
      app: kustomize-controller
  endpoints:
  - port: http-prom
    interval: 30s
    path: /metrics

Key Metrics:

Metric Description
fluxcd_kustomize_reconciliation_duration_seconds Reconciliation duration
fluxcd_kustomize_reconciliation_total Total reconciliations
fluxcd_kustomize_reconciliation_errors_total Reconciliation errors
fluxcd_source_git_reconciliation_duration_seconds Git fetch duration

Prometheus Query Examples:

# Reconciliation success rate
sum(rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m])) 
/ 
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))

# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)

# Error rate
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))

Grafana Dashboards

Grafana Dashboard JSON:

{
  "dashboard": {
    "title": "FluxCD Reconciliation",
    "panels": [
      {
        "title": "Reconciliation Success Rate",
        "targets": [{
          "expr": "sum(rate(fluxcd_kustomize_reconciliation_total{status=\"success\"}[5m])) / sum(rate(fluxcd_kustomize_reconciliation_total[5m]))"
        }]
      },
      {
        "title": "Reconciliation Duration",
        "targets": [{
          "expr": "avg(fluxcd_kustomize_reconciliation_duration_seconds)"
        }]
      },
      {
        "title": "Reconciliation Errors",
        "targets": [{
          "expr": "sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))"
        }]
      }
    ]
  }
}

Log Forwarding to Log Analytics

Fluent Bit Configuration:

# platform/logging/fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [INPUT]
        Name tail
        Path /var/log/containers/kustomize-controller-*.log
        Parser docker
        Tag kube.fluxcd.*
        Refresh_Interval 5

    [FILTER]
        Name kubernetes
        Match kube.fluxcd.*
        Merge_Log On

    [OUTPUT]
        Name azure
        Match kube.fluxcd.*
        Workspace_ID {workspace-id}
        Shared_Key {workspace-key}
        Log_Type FluxCD

Log Analytics Query:

// Query FluxCD logs
FluxCDLogs_CL
| where ContainerName_s contains "kustomize-controller"
| where LogMessage_s contains "reconciliation"
| project TimeGenerated, ContainerName_s, LogMessage_s
| order by TimeGenerated desc

Reconciliation Duration and Success Rate

Success Rate Monitoring:

# Overall success rate
sum(rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m]))
/
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))

# Per-Kustomization success rate
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m])
)
/
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_total[5m])
)

Duration Monitoring:

# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)

# P95 reconciliation duration
histogram_quantile(0.95, 
  rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)

# Per-Kustomization duration
avg by (kustomization) (
  fluxcd_kustomize_reconciliation_duration_seconds
)

Alert on High Error Rate:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fluxcd-reconciliation-alerts
  namespace: monitoring
spec:
  groups:
  - name: fluxcd
    rules:
    - alert: FluxCDHighErrorRate
      expr: |
        sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m])) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "FluxCD reconciliation error rate is high"
        description: "{{ $value }} errors per second"

    - alert: FluxCDReconciliationSlow
      expr: |
        avg(fluxcd_kustomize_reconciliation_duration_seconds) > 300
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "FluxCD reconciliations are taking longer than expected"
        description: "Average duration: {{ $value }}s"

Summary: FluxCD Continuous Reconciliation

  • Reconciliation Loop: Polling intervals, reconciliation triggers, retry strategies and backoff
  • Automated Sync Policies: Auto-sync for dev/test, manual sync for staging/production, sync options, per-resource sync configuration
  • Drift Detection: Comparing Git state to live cluster, drift types, detection frequency, alerting on drift
  • Self-Healing: Automatic revert of manual changes, enable/disable per environment, force recreation, preserving stateful resources
  • Health Assessment: Built-in health checks, custom health checks, readiness gates, timeout and failure thresholds
  • Prune Policies: Automatic cleanup, prune safety (PVC/PV protection), selective pruning, prune validation
  • Sync Ordering: depends-on in Kustomization, infrastructure before apps, cross-resource dependencies, wait for readiness
  • Notification Controller: Azure Monitor alerts, Slack/Teams integration, email notifications, custom webhooks
  • Handling Stuck Reconciliations: Identifying stuck reconciliations, suspending and resuming, force reconciliation, debugging techniques
  • Observability: FluxCD metrics in Prometheus, Grafana dashboards, log forwarding to Log Analytics, reconciliation duration and success rate monitoring

Multi-Environment AKS Deployment

Purpose: Define how ATP is deployed across multiple environments (dev, test, staging, production) using separate AKS clusters, environment-specific configurations, Kustomize overlays, Helm values, and FluxCD per-environment reconciliation, ensuring proper isolation, resource management, and multi-region high availability.


Environment-Specific AKS Clusters

Separate Clusters per Environment Rationale

Multi-Cluster Architecture:

graph TB
    subgraph "Production Subscription"
        PROD[Production AKS<br/>East US]
        STAGING[Staging AKS<br/>East US]
    end
    subgraph "Non-Prod Subscription"
        TEST[Test AKS<br/>East US]
        DEV[Dev AKS<br/>East US]
    end
    subgraph "Production Subscription - DR"
        PROD_DR[Production AKS<br/>West Europe]
    end

    PROD -->|Traffic| PROD_DR
    STAGING -->|Validate| PROD

    style PROD fill:#90EE90
    style PROD_DR fill:#90EE90
    style STAGING fill:#FFE5B4
    style TEST fill:#FFE5B4
    style DEV fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Rationale for Separate Clusters:

Aspect Separate Clusters Shared Cluster ATP Decision
Isolation ✅ Complete isolation ⚠️ Namespace-level only Separate Clusters
Security ✅ Environment boundaries ⚠️ Shared RBAC/network Separate Clusters
Resource Management ✅ Independent scaling ⚠️ Shared resources Separate Clusters
Cost ❌ Higher (multiple clusters) ✅ Lower (single cluster) Separate Clusters (security/compliance priority)
Operational Complexity ⚠️ More clusters to manage ✅ Simpler Separate Clusters (acceptable trade-off)
Blast Radius ✅ Isolated failures ❌ Cross-environment impact Separate Clusters

ATP Selection: Separate Clusters

Reasons: - ✅ Compliance: SOC 2 requires production isolation - ✅ Security: No risk of dev/test workloads accessing production resources - ✅ Resource Isolation: Production resources guaranteed, not shared - ✅ Independent Scaling: Each environment scales independently - ✅ Rollback Safety: Production cluster unaffected by dev/test issues

Cluster Sizing and SKU Selection

Environment-Specific Cluster Sizing:

Environment Node Pool SKU Node Count CPU/Memory per Node Total Capacity Rationale
Dev Standard_D4s_v3 2-3 nodes 4 vCPU / 16 GB 8-12 vCPU / 32-48 GB Minimal resources, cost-effective
Test Standard_D4s_v3 3-5 nodes 4 vCPU / 16 GB 12-20 vCPU / 48-80 GB Integration testing needs
Staging Standard_D8s_v3 5-10 nodes 8 vCPU / 32 GB 40-80 vCPU / 160-320 GB Production-like capacity
Production Standard_D16s_v3 10-20 nodes 16 vCPU / 64 GB 160-320 vCPU / 640-1280 GB High availability, performance

Production Cluster Configuration:

// infrastructure/AKS-Production.cs
public class AKSProduction
{
    public ContainerService.KubernetesCluster Cluster { get; }

    public AKSProduction(Pulumi.Stack stack, string location)
    {
        this.Cluster = new ContainerService.KubernetesCluster("atp-prod-eus-aks", new()
        {
            ResourceGroupName = "atp-production-rg",
            Location = location, // "eastus"
            DnsPrefix = "atp-prod-eus",
            DefaultNodePool = new ContainerService.Inputs.KubernetesClusterDefaultNodePoolArgs
            {
                Name = "system",
                NodeCount = 3,
                VmSize = "Standard_D16s_v3",
                OsDiskSizeGb = 256,
                OsDiskType = "Ephemeral",
                Type = "VirtualMachineScaleSets",
                EnableAutoScaling = true,
                MinCount = 3,
                MaxCount = 5,
                MaxPods = 110,
                NodeTaints = new[]
                {
                    "CriticalAddonsOnly=true:NoSchedule"
                },
            },
            // User node pools for workloads
            NodeResourceGroup = "atp-prod-eus-aks-nodes",
            KubernetesVersion = "1.28.0",
            NetworkProfile = new ContainerService.Inputs.KubernetesClusterNetworkProfileArgs
            {
                NetworkPlugin = "azure",
                NetworkPolicy = "azure",
                ServiceCidr = "10.0.1.0/24",
                DnsServiceIp = "10.0.1.10",
                DockerBridgeCidr = "172.17.0.1/16",
            },
            Identity = new ContainerService.Inputs.KubernetesClusterIdentityArgs
            {
                Type = "UserAssigned",
                IdentityIds = new[] { managedIdentity.Id },
            },
            AzurePolicyEnabled = true,
            HttpApplicationRoutingEnabled = false,
            RoleBasedAccessControlEnabled = true,
            AzureRbacEnabled = true,
            PrivateClusterEnabled = true,
            ApiServerAuthorizedIpRanges = new[]
            {
                "10.0.0.0/16", // VNet CIDR
            },
            Tags = new()
            {
                { "Environment", "production" },
                { "CostCenter", "ATP-Production" },
                { "Compliance", "SOC2" },
            },
        });
    }
}

Networking Configuration per Environment

Environment Network Isolation:

Environment VNet Subnet Private Endpoints Network Policies Rationale
Dev atp-dev-vnet atp-dev-subnet ❌ Disabled ⚠️ Baseline Cost optimization
Test atp-test-vnet atp-test-subnet ❌ Disabled ✅ Enforced Test network policies
Staging atp-staging-vnet atp-staging-subnet ✅ Enabled ✅ Enforced Production-like
Production atp-prod-vnet atp-prod-subnet ✅ Enabled ✅ Enforced Maximum security

Production Network Configuration:

// Production VNet with private endpoints
var prodVNet = new Network.VirtualNetwork("atp-prod-vnet", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    AddressSpace = new[] { "10.1.0.0/16" },
    Subnets = new[]
    {
        new Network.Inputs.SubnetArgs
        {
            Name = "atp-prod-aks-subnet",
            AddressPrefix = "10.1.1.0/24",
            PrivateEndpointNetworkPoliciesEnabled = true,
        },
        new Network.Inputs.SubnetArgs
        {
            Name = "atp-prod-private-endpoints",
            AddressPrefix = "10.1.2.0/24",
            PrivateEndpointNetworkPoliciesEnabled = false,
        },
    },
});

Subscription Strategy (Shared vs Dedicated)

ATP Subscription Strategy:

Environment Subscription Rationale
Dev ATP-NonProd Cost optimization, shared resources
Test ATP-NonProd Cost optimization, shared resources
Staging ATP-Production Production-like isolation, compliance
Production (East US) ATP-Production Production isolation, compliance
Production (West Europe) ATP-Production DR region, same subscription

Subscription Configuration:

# List subscriptions
az account list --output table

# Set production subscription
az account set --subscription "ATP-Production"

# Set non-production subscription
az account set --subscription "ATP-NonProd"

Regional Deployment Strategy

Primary Region: East US

Primary Region Configuration:

// Primary region: East US
var primaryRegion = new AKSCluster("atp-prod-eus-aks", new()
{
    Location = "eastus",
    ResourceGroupName = "atp-production-rg",
    Environment = "production",
    ClusterSku = "Standard",
    NodePools = new[]
    {
        new NodePoolConfig
        {
            Name = "system",
            VmSize = "Standard_D16s_v3",
            MinCount = 3,
            MaxCount = 5,
        },
        new NodePoolConfig
        {
            Name = "user",
            VmSize = "Standard_D16s_v3",
            MinCount = 10,
            MaxCount = 20,
        },
    },
});

Primary Region Resources: - ✅ Production AKS cluster - ✅ Azure SQL Database (Primary) - ✅ Azure Redis Cache - ✅ Azure Service Bus - ✅ Azure Key Vault - ✅ Azure Container Registry (geo-replicated)

Secondary Region: West Europe

Secondary Region (DR) Configuration:

// Secondary region: West Europe (DR)
var secondaryRegion = new AKSCluster("atp-prod-weu-aks", new()
{
    Location = "westeurope",
    ResourceGroupName = "atp-production-rg",
    Environment = "production",
    ClusterSku = "Standard",
    NodePools = new[]
    {
        new NodePoolConfig
        {
            Name = "system",
            VmSize = "Standard_D16s_v3",
            MinCount = 2,
            MaxCount = 3,
        },
        new NodePoolConfig
        {
            Name = "user",
            VmSize = "Standard_D16s_v3",
            MinCount = 5,
            MaxCount = 10,
        },
    },
});

Secondary Region Resources: - ✅ Production AKS cluster (standby/DR) - ✅ Azure SQL Database (Geo-replica) - ✅ Azure Redis Cache (Geo-replica) - ✅ Azure Service Bus (DR namespace) - ✅ Azure Key Vault (Geo-replicated) - ✅ Azure Container Registry (geo-replicated)

Multi-Region for Production (HA/DR)

Multi-Region Architecture:

graph TB
    subgraph "East US (Primary)"
        PROD_EUS[Production AKS<br/>East US]
        SQL_EUS[SQL Primary]
        REDIS_EUS[Redis Primary]
    end
    subgraph "West Europe (DR)"
        PROD_WEU[Production AKS<br/>West Europe<br/>Standby]
        SQL_WEU[SQL Geo-Replica]
        REDIS_WEU[Redis Geo-Replica]
    end
    subgraph "Traffic Management"
        FD[Azure Front Door]
    end

    FD -->|Primary| PROD_EUS
    FD -.->|Failover| PROD_WEU
    SQL_EUS -.->|Replication| SQL_WEU
    REDIS_EUS -.->|Replication| REDIS_WEU

    style PROD_EUS fill:#90EE90
    style PROD_WEU fill:#FFE5B4
    style FD fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Multi-Region RTO/RPO Targets:

Component RTO RPO Strategy
AKS Cluster 1 hour 5 minutes GitOps-based recreation
SQL Database 5 minutes < 1 minute Active geo-replication
Redis Cache 15 minutes < 1 minute Geo-replication
Application 5 minutes < 1 minute Traffic failover via Front Door

Regional Failover Mechanisms

Azure Front Door Failover:

# infrastructure/azure-front-door.yaml
apiVersion: networking.azure.com/v1
kind: FrontDoor
metadata:
  name: atp-frontdoor
spec:
  backendPools:
  - name: primary-eus
    backends:
    - address: atp-prod-eus-aks.region.cloudapp.azure.com
      enabled: true
      priority: 1
      weight: 100
    healthProbe:
      path: /health
      protocol: Http
      interval: 30
  - name: secondary-weu
    backends:
    - address: atp-prod-weu-aks.region.cloudapp.azure.com
      enabled: true
      priority: 2
      weight: 0
    healthProbe:
      path: /health
      protocol: Http
      interval: 30
  routingRules:
  - name: failover-rule
    acceptedProtocols:
    - Http
    - Https
    patternsToMatch:
    - "/*"
    routeConfiguration:
      @odata.type: "#Microsoft.Azure.FrontDoor.Models.FrontdoorForwardingConfiguration"
      forwardingProtocol: MatchRequest
      backendPool:
        id: /subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.Network/frontDoors/atp-frontdoor/backendPools/primary-eus

Kustomize Overlays Per Environment

Base Manifests (Common)

Base Structure:

apps/
├── atp-ingestion/
│   ├── base/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   ├── configmap.yaml
│   │   └── kustomization.yaml
│   ├── overlays/
│   │   ├── dev/
│   │   │   └── kustomization.yaml
│   │   ├── test/
│   │   │   └── kustomization.yaml
│   │   ├── staging/
│   │   │   └── kustomization.yaml
│   │   └── production/
│   │       └── kustomization.yaml

Base Kustomization:

# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
- deployment.yaml
- service.yaml
- configmap.yaml

commonLabels:
  app: atp-ingestion
  managed-by: fluxcd

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: v1.2.3-abc123d

Dev Overlay (Minimal Resources, Debug Logging)

Dev Overlay Configuration:

# apps/atp-ingestion/overlays/dev/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-dev

resources:
- ../../base

patchesStrategicMerge:
- deployment-patch.yaml
- configmap-patch.yaml

commonLabels:
  environment: dev

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: latest  # Dev uses latest images

Dev Deployment Patch:

# apps/atp-ingestion/overlays/dev/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 1  # Single replica for dev
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Development"
        - name: Logging__LogLevel__Default
          value: "Debug"
        - name: Logging__LogLevel__Microsoft
          value: "Debug"

Dev ConfigMap Patch:

# apps/atp-ingestion/overlays/dev/configmap-patch.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
data:
  telemetry:sampling: "100"  # 100% sampling in dev
  feature-flags:enable-debug-mode: "true"
  feature-flags:enable-profiling: "true"

Test Overlay (Moderate Resources, Integration Tests)

Test Overlay Configuration:

# apps/atp-ingestion/overlays/test/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-test

resources:
- ../../base

patchesStrategicMerge:
- deployment-patch.yaml

commonLabels:
  environment: test

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: v1.2.3  # Test uses tagged releases

Test Deployment Patch:

# apps/atp-ingestion/overlays/test/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 2  # Two replicas for test
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Test"
        - name: Logging__LogLevel__Default
          value: "Information"

Staging Overlay (Production-Like)

Staging Overlay Configuration:

# apps/atp-ingestion/overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-staging

resources:
- ../../base

patchesStrategicMerge:
- deployment-patch.yaml

commonLabels:
  environment: staging

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: v1.2.3  # Staging uses production-ready tags

Staging Deployment Patch:

# apps/atp-ingestion/overlays/staging/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 3  # Production-like replica count
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Staging"
        - name: Logging__LogLevel__Default
          value: "Warning"
        - name: telemetry:sampling
          valueFrom:
            configMapKeyRef:
              name: atp-ingestion-config
              key: telemetry:sampling

Production Overlay (Optimized, Strict Policies)

Production Overlay Configuration:

# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
- ../../base

patchesStrategicMerge:
- deployment-patch.yaml
- network-policy-patch.yaml

commonLabels:
  environment: production
  compliance: soc2

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: v1.2.3-abc123d  # Production uses immutable tags

Production Deployment Patch:

# apps/atp-ingestion/overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5  # High availability
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 1000m
            memory: 2Gi
          limits:
            cpu: 2000m
            memory: 4Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"
        - name: Logging__LogLevel__Default
          value: "Error"  # Minimal logging in production
        - name: telemetry:sampling
          value: "10"  # 10% sampling
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

Helm Values Files Per Environment

values-dev.yaml

Dev Helm Values:

# charts/atp-ingestion/values-dev.yaml
replicaCount: 1

image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: latest
  pullPolicy: Always

serviceAccount:
  create: true
  annotations:
    azure.workload.identity/client-id: "{dev-workload-identity-id}"

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

autoscaling:
  enabled: false  # No autoscaling in dev

environment:
  name: Development
  logging:
    level: Debug
  telemetry:
    sampling: 100  # 100% sampling
  featureFlags:
    enableDebugMode: true
    enableProfiling: true

config:
  database:
    connectionString: "{dev-sql-connection-string}"
  redis:
    connectionString: "{dev-redis-connection-string}"

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-staging  # Staging certs in dev
  hosts:
  - host: atp-ingestion-dev.connectsoft.example
    paths:
    - path: /
      pathType: Prefix
  tls:
  - secretName: atp-ingestion-dev-tls
    hosts:
    - atp-ingestion-dev.connectsoft.example

values-test.yaml

Test Helm Values:

# charts/atp-ingestion/values-test.yaml
replicaCount: 2

image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: v1.2.3
  pullPolicy: IfNotPresent

resources:
  requests:
    cpu: 200m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 4
  targetCPUUtilizationPercentage: 70

environment:
  name: Test
  logging:
    level: Information
  telemetry:
    sampling: 50  # 50% sampling

config:
  database:
    connectionString: "{test-sql-connection-string}"

values-staging.yaml

Staging Helm Values:

# charts/atp-ingestion/values-staging.yaml
replicaCount: 3

image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: v1.2.3
  pullPolicy: IfNotPresent

resources:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 2Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 6
  targetCPUUtilizationPercentage: 70

environment:
  name: Staging
  logging:
    level: Warning
  telemetry:
    sampling: 25  # 25% sampling

config:
  database:
    connectionString: "{staging-sql-connection-string}"

values-production.yaml

Production Helm Values:

# charts/atp-ingestion/values-production.yaml
replicaCount: 5

image:
  repository: connectsoft.azurecr.io/atp/ingestion
  tag: v1.2.3-abc123d  # Immutable tag
  pullPolicy: IfNotPresent

serviceAccount:
  create: true
  annotations:
    azure.workload.identity/client-id: "{prod-workload-identity-id}"

resources:
  requests:
    cpu: 1000m
    memory: 2Gi
  limits:
    cpu: 2000m
    memory: 4Gi

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000
  seccompProfile:
    type: RuntimeDefault

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
    - ALL

environment:
  name: Production
  logging:
    level: Error  # Minimal logging
  telemetry:
    sampling: 10  # 10% sampling
  featureFlags:
    enableDebugMode: false
    enableProfiling: false

config:
  database:
    connectionString: "{prod-sql-connection-string}"
  redis:
    connectionString: "{prod-redis-connection-string}"

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "100"
  hosts:
  - host: atp-ingestion.connectsoft.example
    paths:
    - path: /
      pathType: Prefix
  tls:
  - secretName: atp-ingestion-tls
    hosts:
    - atp-ingestion.connectsoft.example

networkPolicy:
  enabled: true
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: atp-gateway

podDisruptionBudget:
  enabled: true
  minAvailable: 3

Value Precedence and Overrides

Helm Value Precedence (Highest to Lowest):

  1. --set command-line flags
  2. values-production.yaml (or environment-specific)
  3. values.yaml (base/default values)

Deploy with Environment-Specific Values:

# Deploy to dev
helm upgrade --install atp-ingestion ./charts/atp-ingestion \
  -f charts/atp-ingestion/values.yaml \
  -f charts/atp-ingestion/values-dev.yaml \
  -n atp-dev

# Deploy to production
helm upgrade --install atp-ingestion ./charts/atp-ingestion \
  -f charts/atp-ingestion/values.yaml \
  -f charts/atp-ingestion/values-production.yaml \
  -n atp-production

# Override specific value
helm upgrade --install atp-ingestion ./charts/atp-ingestion \
  -f charts/atp-ingestion/values-production.yaml \
  --set replicaCount=10 \
  -n atp-production

FluxCD Configuration Per Environment

GitRepository per Environment (Branch Targeting)

Dev GitRepository:

# clusters/dev/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops-dev
  namespace: flux-system
spec:
  interval: 30s  # Fast polling for dev
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: dev  # Dev branch
  secretRef:
    name: gitops-credentials
  ignore: |
    exclude: |
      ^production/
      ^staging/
      ^test/

Test GitRepository:

# clusters/test/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops-test
  namespace: flux-system
spec:
  interval: 1m
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: test
  secretRef:
    name: gitops-credentials

Production GitRepository:

# clusters/production/gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops-production
  namespace: flux-system
spec:
  interval: 5m  # Slower polling for production
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: production  # Production branch
  secretRef:
    name: gitops-credentials
  ignore: |
    exclude: |
      ^dev/
      ^test/
      ^staging/

Kustomization per Environment

Dev Kustomization:

# clusters/dev/kustomization-apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-dev
  namespace: flux-system
spec:
  interval: 1m
  path: ./apps
  prune: true  # Auto-prune in dev
  wait: false  # Don't wait for readiness in dev
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-dev
  kustomizeFlags:
  - --load-restrictor=LoadRestrictionsNone

Production Kustomization:

# clusters/production/kustomization-apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m
  path: ./apps
  prune: false  # Manual pruning only in production
  wait: true  # Wait for readiness
  timeout: 20m
  retryInterval: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  dependsOn:
  - name: infrastructure
  - name: secrets
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: atp-gateway
    namespace: atp-production

Sync Policies per Environment

Environment Sync Policy Matrix:

Environment Auto-Sync Prune Wait Timeout Manual Approval
Dev ✅ Yes ✅ Yes ❌ No 5m ❌ No
Test ✅ Yes ✅ Yes ✅ Yes 10m ❌ No
Staging ⚠️ Selective ❌ No ✅ Yes 15m ✅ Yes (1 approver)
Production ❌ No ❌ No ✅ Yes 20m ✅ Yes (2 approvers, CAB)

Environment-Specific Reconciliation Settings

Production Reconciliation Settings:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
spec:
  interval: 10m
  retryInterval: 5m
  timeout: 20m
  suspend: false  # Reconciliation enabled
  path: ./apps
  prune: false  # Never auto-prune
  wait: true  # Wait for health checks
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  syncOptions:
  - CreateNamespace=true
  - ReplaceOnCreate=false  # Safer in production

Environment-Specific Configurations

Log Levels (Debug → Error)

Environment Log Levels:

Environment Default Level Microsoft Level Log Retention
Dev Debug Debug 7 days
Test Information Information 30 days
Staging Warning Warning 90 days
Production Error Error 365 days

Log Level Configuration:

# apps/atp-ingestion/base/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
data:
  Logging__LogLevel__Default: "Information"  # Base level
  Logging__LogLevel__Microsoft: "Warning"
  Logging__LogLevel__System: "Error"
# Production overlay patch
apiVersion: v1
kind: ConfigMap
metadata:
  name: atp-ingestion-config
data:
  Logging__LogLevel__Default: "Error"  # Override for production
  Logging__LogLevel__Microsoft: "Error"

Telemetry Sampling (100% → 10%)

Telemetry Sampling Rates:

Environment Sampling Rate Rationale
Dev 100% Full visibility for debugging
Test 50% Balance between visibility and cost
Staging 25% Production-like, reduced cost
Production 10% Cost optimization, sufficient insights

Telemetry Configuration:

# Production telemetry settings
env:
- name: telemetry:sampling
  value: "10"  # 10% sampling
- name: APPLICATIONINSIGHTS_SAMPLING_PERCENTAGE
  value: "10"

Feature Flags per Environment

Feature Flag Configuration:

# Dev feature flags
featureFlags:
  enableDebugMode: true
  enableProfiling: true
  enableDetailedMetrics: true
  enableExperimentalFeatures: true

# Production feature flags
featureFlags:
  enableDebugMode: false
  enableProfiling: false
  enableDetailedMetrics: false
  enableExperimentalFeatures: false
  enableMaintenanceMode: false

Database Connection Strings

Environment-Specific Database Connections:

# Dev database
env:
- name: ConnectionStrings__DefaultConnection
  valueFrom:
    secretKeyRef:
      name: sql-connection-string
      key: connection-string
---
# ExternalSecret for dev
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-string
  namespace: atp-dev
spec:
  secretStoreRef:
    name: azure-keyvault-dev
    kind: ClusterSecretStore
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-ingestion/sql-connection-string

External Service Endpoints

Environment-Specific Endpoints:

# Dev endpoints
config:
  externalServices:
    paymentGateway: "https://api.stripe.com/test"
    emailService: "https://api.sendgrid.com/v3/test"
    storageAccount: "https://atpdevstorage.blob.core.windows.net"

# Production endpoints
config:
  externalServices:
    paymentGateway: "https://api.stripe.com"
    emailService: "https://api.sendgrid.com/v3"
    storageAccount: "https://atpprodstorage.blob.core.windows.net"

Resource Quotas and Limits

Namespace-Level Quotas

Dev Namespace Quota:

# platform/resource-quotas/dev-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: atp-dev-quota
  namespace: atp-dev
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    persistentvolumeclaims: "5"
    pods: "20"
    services: "10"

Production Namespace Quota:

# platform/resource-quotas/production-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: atp-production-quota
  namespace: atp-production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 400Gi
    persistentvolumeclaims: "50"
    pods: "200"
    services: "50"

CPU and Memory Limits per Environment

Resource Limit Matrix:

Environment CPU Request CPU Limit Memory Request Memory Limit
Dev 100m 500m 256Mi 512Mi
Test 200m 1000m 512Mi 1Gi
Staging 500m 2000m 1Gi 2Gi
Production 1000m 2000m 2Gi 4Gi

Storage Quotas

Storage Quota per Environment:

# Production storage quota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: atp-production-storage-quota
  namespace: atp-production
spec:
  hard:
    requests.storage: 500Gi
    persistentvolumeclaims: "50"

Pod Count Limits

Pod Count Limits:

Environment Max Pods Rationale
Dev 20 Minimal footprint
Test 50 Integration testing needs
Staging 100 Production-like scale
Production 200 High availability, scale

HPA Configuration Per Environment

Min/Max Replicas per Environment

HPA Configuration Matrix:

Environment Min Replicas Max Replicas Target CPU Target Memory
Dev 1 2 70% 80%
Test 2 4 70% 80%
Staging 3 6 70% 80%
Production 5 10 70% 80%

Production HPA:

# apps/atp-ingestion/overlays/production/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa
  namespace: atp-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  minReplicas: 5
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Max

Scaling Thresholds (CPU, Memory)

Scaling Thresholds:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa
spec:
  metrics:
  # CPU-based scaling
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale up when CPU > 70%
  # Memory-based scaling
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale up when memory > 80%

Custom Metrics with KEDA

KEDA ScaledObject:

# apps/atp-ingestion/overlays/production/keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: atp-ingestion-scaler
  namespace: atp-production
spec:
  scaleTargetRef:
    name: atp-ingestion
  minReplicaCount: 5
  maxReplicaCount: 10
  triggers:
  # CPU-based scaling
  - type: cpu
    metadata:
      type: Utilization
      value: "70"
  # Memory-based scaling
  - type: memory
    metadata:
      type: Utilization
      value: "80"
  # RabbitMQ queue length
  - type: rabbitmq
    metadata:
      queueName: audit-events
      queueLength: "100"
      host: "amqp://rabbitmq.atp-production:5672"
  # HTTP requests per second
  - type: prometheus
    metadata:
      serverAddress: "http://prometheus.monitoring:9090"
      metricName: http_requests_per_second
      threshold: "100"
      query: "sum(rate(http_requests_total[1m]))"

Scale-to-Zero in Dev

Dev Scale-to-Zero:

# apps/atp-ingestion/overlays/dev/keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: atp-ingestion-scaler
  namespace: atp-dev
spec:
  scaleTargetRef:
    name: atp-ingestion
  minReplicaCount: 0  # Scale to zero when idle
  maxReplicaCount: 2
  idleReplicaCount: 0  # Scale down to zero after inactivity
  triggers:
  - type: http
    metadata:
      targetValue: "1"
      activationTargetValue: "1"

Multi-Region Traffic Routing

Azure Front Door Configuration

Azure Front Door Setup:

# infrastructure/azure-front-door.yaml
apiVersion: networking.azure.com/v1
kind: FrontDoor
metadata:
  name: atp-frontdoor
spec:
  resourceGroupName: atp-production-rg
  location: global
  frontendEndpoints:
  - name: atp-frontend
    hostName: atp.connectsoft.example
    sessionAffinityEnabledState: Enabled
    sessionAffinityTtlSeconds: 0
  backendPools:
  - name: primary-eus
    loadBalancingSettings:
      name: default
    healthProbeSettings:
      name: default
    backends:
    - address: atp-prod-eus-aks.region.cloudapp.azure.com
      enabled: true
      priority: 1
      weight: 100
      httpPort: 80
      httpsPort: 443
  - name: secondary-weu
    backends:
    - address: atp-prod-weu-aks.region.cloudapp.azure.com
      enabled: true
      priority: 2
      weight: 0  # Standby
      httpPort: 80
      httpsPort: 443
  routingRules:
  - name: failover-rule
    acceptedProtocols:
    - Http
    - Https
    patternsToMatch:
    - "/*"
    routeConfiguration:
      forwardingConfiguration:
        forwardingProtocol: MatchRequest
        backendPool:
          id: primary-eus
        cacheConfiguration:
          queryParameterStripDirective: StripAll
          dynamicCompression: Enabled
    frontendEndpoints:
    - atp-frontend

Traffic Manager for DNS-Based Routing

Traffic Manager Configuration:

# infrastructure/traffic-manager.yaml
apiVersion: network.azure.com/v1
kind: TrafficManagerProfile
metadata:
  name: atp-traffic-manager
spec:
  resourceGroupName: atp-production-rg
  location: global
  profileStatus: Enabled
  trafficRoutingMethod: Priority  # Failover routing
  dnsConfig:
    relativeName: atp-connectsoft
    ttl: 60
  monitorConfig:
    protocol: Https
    port: 443
    path: /health
    intervalInSeconds: 30
    timeoutInSeconds: 10
    toleratedNumberOfFailures: 3
  endpoints:
  - name: primary-eus
    target: atp-prod-eus-aks.region.cloudapp.azure.com
    type: ExternalEndpoints
    priority: 1
    weight: 100
    enabled: true
  - name: secondary-weu
    target: atp-prod-weu-aks.region.cloudapp.azure.com
    type: ExternalEndpoints
    priority: 2
    weight: 0
    enabled: true

Regional Failover Policies

Failover Policy Configuration:

# Front Door failover rules
routingRules:
- name: failover-rule
  acceptedProtocols:
  - Http
  - Https
  routeConfiguration:
    forwardingConfiguration:
      backendPool:
        id: primary-eus
      # Failover to secondary if primary unhealthy
      loadBalancingSettings:
        sampleSize: 4
        successfulSamplesRequired: 3

Health Probe Configuration:

healthProbeSettings:
- name: default
  path: /health
  protocol: Https
  intervalInSeconds: 30
  enabledState: Enabled

Health Probe Configuration

Health Probe Setup:

# Health probe for Front Door
healthProbeSettings:
- name: atp-health-probe
  path: /health/live
  protocol: Https
  intervalInSeconds: 30
  timeoutInSeconds: 10
  unhealthyThreshold: 3
  enabledState: Enabled
  healthProbeMethod: Head

Application Health Endpoint:

// Health check endpoint for multi-region routing
[ApiController]
[Route("[controller]")]
public class HealthController : ControllerBase
{
    private readonly IHealthCheckService _healthCheck;

    [HttpGet("live")]
    public async Task<IActionResult> Liveness()
    {
        var result = await _healthCheck.CheckHealthAsync();
        return result.Status == HealthStatus.Healthy 
            ? Ok() 
            : StatusCode(503);
    }

    [HttpGet("ready")]
    public async Task<IActionResult> Readiness()
    {
        // Check critical dependencies
        var result = await _healthCheck.CheckHealthAsync(
            predicate: check => check.Tags.Contains("ready"));
        return result.Status == HealthStatus.Healthy 
            ? Ok() 
            : StatusCode(503);
    }
}

Cross-Environment Dependencies

Shared Services (Monitoring, Logging)

Shared Monitoring Stack:

# Shared monitoring namespace (single instance for all environments)
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
  labels:
    shared: "true"
---
# Prometheus (shared across environments)
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  selector:
    app: prometheus
  ports:
  - port: 9090
    name: http

Cross-Environment Service Access:

# Service in production namespace accessing shared monitoring
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  type: ExternalName
  externalName: prometheus.monitoring.svc.cluster.local

Service Discovery Across Environments

Multi-Cluster Service Discovery:

# Service export (if using service mesh)
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
  name: atp-gateway
  namespace: atp-production
---
# Service import
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceImport
metadata:
  name: atp-gateway
  namespace: atp-staging
spec:
  type: ClusterSetIP
  ports:
  - port: 8080
    protocol: TCP

VNet Peering (If Needed)

VNet Peering for Cross-Environment Access:

// VNet peering between environments (if required)
var devTestPeering = new Network.VirtualNetworkPeering("dev-test-peering", new()
{
    ResourceGroupName = "atp-nonprod-rg",
    VirtualNetworkName = "atp-dev-vnet",
    RemoteVirtualNetworkId = testVNet.Id,
    AllowForwardedTraffic = true,
    AllowGatewayTransit = false,
    UseRemoteGateways = false,
});

VNet Peering Policy:

Environment Pair Peering Rationale
Dev ↔ Test ⚠️ Optional Shared resources, cost optimization
Staging ↔ Production ❌ No Security isolation required
Production EUS ↔ Production WEU ✅ Yes Multi-region HA/DR

Summary: Multi-Environment AKS Deployment

  • Environment-Specific AKS Clusters: Separate clusters per environment with rationale, cluster sizing/SKU selection, networking configuration, subscription strategy
  • Regional Deployment Strategy: Primary region (East US), secondary region (West Europe), multi-region HA/DR, regional failover mechanisms
  • Kustomize Overlays: Base manifests, dev overlay (minimal resources, debug logging), test overlay, staging overlay (production-like), production overlay (optimized, strict policies)
  • Helm Values Files: values-dev.yaml, values-test.yaml, values-staging.yaml, values-production.yaml, value precedence and overrides
  • FluxCD Configuration: GitRepository per environment (branch targeting), Kustomization per environment, sync policies per environment, environment-specific reconciliation settings
  • Environment-Specific Configurations: Log levels, telemetry sampling rates, feature flags, database connection strings, external service endpoints
  • Resource Quotas: Namespace-level quotas, CPU/memory limits per environment, storage quotas, pod count limits
  • HPA Configuration: Min/max replicas per environment, scaling thresholds, custom metrics with KEDA, scale-to-zero in dev
  • Multi-Region Traffic Routing: Azure Front Door configuration, Traffic Manager for DNS-based routing, regional failover policies, health probe configuration
  • Cross-Environment Dependencies: Shared services (monitoring, logging), service discovery across environments, VNet peering if needed

Azure Monitor Integration & Observability

Purpose: Define how Azure Monitor, Log Analytics, and Grafana are integrated with ATP GitOps workflows to provide comprehensive observability, monitoring, alerting, and compliance evidence collection, ensuring complete visibility into deployment health, FluxCD operations, and application performance across all environments.


Azure Monitor Container Insights

Enabling Container Insights on AKS

Enable Container Insights:

# Enable Container Insights on AKS cluster
az aks enable-addons \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --addons monitoring \
  --workspace-resource-id /subscriptions/{subscriptionId}/resourceGroups/atp-production-rg/providers/Microsoft.OperationalInsights/workspaces/atp-prod-loganalytics

Container Insights via Pulumi:

// Enable Container Insights
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    Sku = new OperationalInsights.Inputs.WorkspaceSkuArgs
    {
        Name = "PerGB2018",
    },
    RetentionInDays = environment == "production" ? 2555 : 30, // 7 years for production
    Tags = new()
    {
        { "Environment", environment },
        { "Retention", environment == "production" ? "7years" : "30days" },
    },
});

// Enable Container Insights addon
az aks enable-addons --addons monitoring --workspace-resource-id {logAnalyticsWorkspace.Id}

Verify Container Insights:

# Check Container Insights status
az aks show \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --query addonProfiles.omsagent

# Check OMS agent pods
kubectl get pods -n kube-system | grep omsagent

Metrics Collection and Aggregation

Container Insights Metrics:

Metric Category Examples Collection Interval
Node Metrics CPU, Memory, Disk I/O, Network 60s
Pod Metrics CPU, Memory, Restart count 60s
Container Metrics CPU, Memory per container 60s
Controller Metrics Replica count, Ready replicas 60s
Workload Metrics Deployment, StatefulSet status 60s

Key Metrics Collected:

// Node metrics
InsightsMetrics
| where Origin == "container.azm.ms"
| where Namespace == "insights-metrics"
| where Name == "cpuUsageNanoCores"
| summarize avg(Val) by Computer, bin(TimeGenerated, 1m)

// Pod metrics
InsightsMetrics
| where Name == "podCpuUsageNanoCores"
| summarize avg(Val) by Computer, bin(TimeGenerated, 1m)

// Container restart count
ContainerLog
| where ContainerRestartCount > 0
| project TimeGenerated, Computer, ContainerName, ContainerRestartCount

Log Analytics Workspace Configuration

Workspace Configuration:

// Log Analytics Workspace with long retention for production
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    Sku = new OperationalInsights.Inputs.WorkspaceSkuArgs
    {
        Name = "PerGB2018",  // Pay-as-you-go
    },
    RetentionInDays = 2555, // 7 years for compliance
    DailyQuotaGb = -1, // No daily quota
    Tags = new()
    {
        { "Environment", "production" },
        { "Retention", "7years" },
        { "Compliance", "SOC2" },
    },
});

Workspace Strategy:

Strategy Workspace per Environment Shared Workspace
Dev/Test ⚠️ Shared workspace Recommended (cost optimization)
Staging ✅ Separate workspace ⚠️ Optional
Production Separate workspace ❌ Not recommended

ATP Workspace Strategy: - Dev/Test: Shared atp-nonprod-loganalytics workspace - Staging: Separate atp-staging-loganalytics workspace - Production: Separate atp-prod-loganalytics workspace (7-year retention)

Cost Optimization (Sampling, Retention)

Cost Optimization Strategies:

Strategy Configuration Impact
Log Sampling 10% sampling in production ✅ 90% cost reduction
Metric Aggregation 5-minute aggregation ✅ Reduced data volume
Retention Tiers 7 years (prod), 30 days (dev) ✅ Cost-optimized retention
Data Export Archive to Blob Storage ✅ Long-term storage cost reduction

Production Log Sampling:

# Application Insights sampling (10% in production)
env:
- name: APPLICATIONINSIGHTS_SAMPLING_PERCENTAGE
  value: "10"

# Or via ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: appinsights-config
data:
  samplingPercentage: "10"

Log Retention Configuration:

// Production: 7-year retention
var prodWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
    RetentionInDays = 2555, // 7 years
});

// Dev/Test: 30-day retention
var nonprodWorkspace = new OperationalInsights.Workspace("atp-nonprod-loganalytics", new()
{
    RetentionInDays = 30,
});

Log Analytics Workspace

Workspace per Environment or Shared

Workspace Organization:

graph TB
    subgraph "Production Subscription"
        PROD_WS[atp-prod-loganalytics<br/>7-year retention]
        STAGING_WS[atp-staging-loganalytics<br/>90-day retention]
    end
    subgraph "Non-Prod Subscription"
        NONPROD_WS[atp-nonprod-loganalytics<br/>30-day retention]
    end

    PROD_EUS[Production AKS<br/>East US] --> PROD_WS
    PROD_WEU[Production AKS<br/>West Europe] --> PROD_WS
    STAGING[Staging AKS] --> STAGING_WS
    DEV[Dev AKS] --> NONPROD_WS
    TEST[Test AKS] --> NONPROD_WS

    style PROD_WS fill:#90EE90
    style STAGING_WS fill:#FFE5B4
    style NONPROD_WS fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Workspace Matrix:

Environment Workspace Name Retention Data Sources
Dev/Test atp-nonprod-loganalytics 30 days Dev AKS, Test AKS
Staging atp-staging-loganalytics 90 days Staging AKS
Production atp-prod-loganalytics 7 years (2555 days) Production AKS (EUS, WEU)

Log Retention Policies

Retention Policy Configuration:

// Log Analytics Workspace with retention
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    RetentionInDays = 2555, // 7 years for compliance
    DailyQuotaGb = -1, // No daily quota
    PublicNetworkAccessForIngestion = "Enabled",
    PublicNetworkAccessForQuery = "Enabled",
});

// Data export to Blob Storage for long-term archival
var dataExport = new OperationalInsights.DataExport("atp-prod-export", new()
{
    ResourceGroupName = "atp-production-rg",
    WorkspaceName = logAnalyticsWorkspace.Name,
    TableNames = new[]
    {
        "ContainerLog",
        "ContainerInventory",
        "InsightsMetrics",
        "AzureDiagnostics",
    },
    Destination = new OperationalInsights.Inputs.DestinationArgs
    {
        ResourceId = storageAccount.Id,
        Type = "StorageAccount",
    },
    Enabled = true,
});

Retention by Table:

Table Production Retention Non-Production Retention
ContainerLog 7 years 30 days
InsightsMetrics 7 years 30 days
AzureDiagnostics 7 years 30 days
FluxCDLogs 7 years 30 days

Kusto Query Language (KQL) Examples

Pod Restart Query:

// Pod restart count per namespace
ContainerLog
| where TimeGenerated > ago(24h)
| where ContainerRestartCount > 0
| summarize 
    RestartCount = count(),
    UniquePods = dcount(ContainerName),
    LastRestart = max(TimeGenerated)
    by Namespace, Computer
| order by RestartCount desc

Deployment Status Query:

// Deployment status from Container Insights
InsightsMetrics
| where Origin == "container.azm.ms"
| where Name == "k8sPodCount"
| where Namespace == "atp-production"
| extend PodCount = Val
| summarize 
    TotalPods = sum(PodCount),
    AvgPods = avg(PodCount),
    MaxPods = max(PodCount)
    by Namespace, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

FluxCD Reconciliation Query:

// FluxCD reconciliation events
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "reconciliation"
| extend 
    Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Status = extract(@"status=(\S+)", 1, LogEntry, typeof(string)),
    Duration = extract(@"duration=(\d+\.\d+)", 1, LogEntry, typeof(real))
| summarize 
    TotalReconciliations = count(),
    AvgDuration = avg(Duration),
    MaxDuration = max(Duration),
    SuccessCount = countif(Status == "success"),
    FailureCount = countif(Status == "failure")
    by Kustomization, bin(TimeGenerated, 1h)
| order by TimeGenerated desc

Error Rate Query:

// Application error rate
ContainerLog
| where TimeGenerated > ago(1h)
| where LogEntry contains "ERROR" or LogEntry contains "Exception"
| extend 
    Service = extract(@"app=(\S+)", 1, LogEntry, typeof(string)),
    ErrorType = extract(@"(\w+Exception)", 1, LogEntry, typeof(string))
| summarize 
    ErrorCount = count(),
    UniqueErrors = dcount(ErrorType)
    by Service, Computer, bin(TimeGenerated, 5m)
| order by ErrorCount desc

Custom Log Tables

Custom Log Table: Deployment Events:

// Create custom log table for deployment events
.create table DeploymentEvents (TimeGenerated:datetime, DeploymentId:string, ServiceName:string, Environment:string, Status:string, GitCommit:string, DeployedBy:string, Duration:real)

// Ingest deployment events
.ingest inline into table DeploymentEvents <|
2024-01-15T10:00:00Z, "deployment-abc123", "atp-ingestion", "production", "success", "abc123def", "FluxCD", 45.2
2024-01-15T11:00:00Z, "deployment-def456", "atp-query", "production", "success", "def456ghi", "FluxCD", 52.8

// Query deployment events
DeploymentEvents
| where Environment == "production"
| where TimeGenerated > ago(7d)
| summarize 
    TotalDeployments = count(),
    SuccessfulDeployments = countif(Status == "success"),
    FailedDeployments = countif(Status == "failure"),
    AvgDuration = avg(Duration)
    by ServiceName, bin(TimeGenerated, 1d)

Custom Log via Azure Function:

// Azure Function to ingest deployment events
[FunctionName("IngestDeploymentEvent")]
public async Task IngestDeploymentEvent(
    [EventGridTrigger] EventGridEvent eventGridEvent,
    [LogAnalyticsOutput] IAsyncCollector<LogAnalyticsEvent> logAnalytics)
{
    var deploymentEvent = JsonSerializer.Deserialize<DeploymentEvent>(eventGridEvent.Data.ToString());

    await logAnalytics.AddAsync(new LogAnalyticsEvent
    {
        TimeGenerated = DateTime.UtcNow,
        DeploymentId = deploymentEvent.DeploymentId,
        ServiceName = deploymentEvent.ServiceName,
        Environment = deploymentEvent.Environment,
        Status = deploymentEvent.Status,
        GitCommit = deploymentEvent.GitCommit,
        DeployedBy = "FluxCD",
        Duration = deploymentEvent.Duration,
    });
}

FluxCD Metrics Export

Prometheus Metrics from FluxCD

FluxCD Metrics Endpoint:

# FluxCD automatically exposes Prometheus metrics
# Endpoint: http://kustomize-controller:8080/metrics

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fluxcd-kustomize-controller
  namespace: flux-system
spec:
  selector:
    matchLabels:
      app: kustomize-controller
  endpoints:
  - port: http-prom
    interval: 30s
    path: /metrics
    scrapeTimeout: 10s

Key FluxCD Metrics:

Metric Description Labels
fluxcd_kustomize_reconciliation_total Total reconciliations status, kustomization
fluxcd_kustomize_reconciliation_duration_seconds Reconciliation duration kustomization
fluxcd_kustomize_reconciliation_errors_total Reconciliation errors kustomization, error_type
fluxcd_source_git_reconciliation_total Git fetch reconciliations status, gitrepository
fluxcd_source_git_reconciliation_duration_seconds Git fetch duration gitrepository

Metrics Scraping Configuration

Prometheus Scrape Configuration:

# Prometheus scrape config for FluxCD
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 30s
      scrape_timeout: 10s

    scrape_configs:
    - job_name: 'fluxcd-kustomize-controller'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - flux-system
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: kustomize-controller
        action: keep
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        regex: "8080"
        action: keep

    - job_name: 'fluxcd-source-controller'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - flux-system
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: source-controller
        action: keep

Prometheus ServiceMonitor for FluxCD:

# ServiceMonitor for FluxCD controllers
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fluxcd-controllers
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: flux
  namespaceSelector:
    matchNames:
    - flux-system
  endpoints:
  - port: http-prom
    interval: 30s
    path: /metrics
    scrapeTimeout: 10s

Key Metrics to Monitor

Critical FluxCD Metrics:

# Reconciliation success rate
sum(rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m]))
/
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))

# Reconciliation error rate
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))

# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)

# Git fetch duration (indicates network issues)
avg(fluxcd_source_git_reconciliation_duration_seconds)

Per-Kustomization Metrics:

# Reconciliation success rate per Kustomization
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_total{status="success"}[5m])
)
/
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_total[5m])
)

# Reconciliation duration per Kustomization
avg by (kustomization) (
  fluxcd_kustomize_reconciliation_duration_seconds
)

Alerting on FluxCD Issues

FluxCD Alert Rules:

# alerts/fluxcd-reconciliation-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fluxcd-reconciliation-alerts
  namespace: monitoring
spec:
  groups:
  - name: fluxcd
    interval: 30s
    rules:
    - alert: FluxCDHighErrorRate
      expr: |
        sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m])) > 0.1
      for: 5m
      labels:
        severity: warning
        component: fluxcd
      annotations:
        summary: "FluxCD reconciliation error rate is high"
        description: "{{ $value }} errors per second detected"

    - alert: FluxCDReconciliationSlow
      expr: |
        avg(fluxcd_kustomize_reconciliation_duration_seconds) > 300
      for: 10m
      labels:
        severity: warning
        component: fluxcd
      annotations:
        summary: "FluxCD reconciliations are taking longer than expected"
        description: "Average duration: {{ $value }}s"

    - alert: FluxCDGitFetchFailed
      expr: |
        increase(fluxcd_source_git_reconciliation_total{status="failure"}[5m]) > 3
      for: 5m
      labels:
        severity: critical
        component: fluxcd
      annotations:
        summary: "FluxCD Git fetch failures detected"
        description: "Git repository {{ $labels.gitrepository }} failed to fetch"

Deployment Metrics

Sync Status per Application

Application Sync Status Query:

// Sync status per application
let FluxCDEvents = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied" or LogEntry contains "sync"
| extend 
    Application = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Status = case(
        LogEntry contains "successfully applied", "Success",
        LogEntry contains "sync failed", "Failed",
        LogEntry contains "drift detected", "Drift",
        "Unknown"
    ),
    GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string))
| project TimeGenerated, Application, Status, GitCommit;

FluxCDEvents
| summarize 
    LastSync = max(TimeGenerated),
    SyncStatus = arg_max(TimeGenerated, Status),
    GitCommit = arg_max(TimeGenerated, GitCommit)
    by Application
| order by LastSync desc

Sync Status Dashboard Query:

// Real-time sync status per application
ContainerLog
| where ContainerName contains "kustomize-controller"
| where TimeGenerated > ago(1h)
| extend 
    Application = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Status = case(
        LogEntry contains "successfully applied", "Success",
        LogEntry contains "sync failed", "Failed",
        "InProgress"
    )
| summarize 
    Count = count(),
    LastSync = max(TimeGenerated)
    by Application, Status
| order by LastSync desc

Reconciliation Duration

Reconciliation Duration Metrics:

# Average reconciliation duration
avg(fluxcd_kustomize_reconciliation_duration_seconds)

# P50, P95, P99 reconciliation duration
histogram_quantile(0.50, 
  rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)
histogram_quantile(0.95, 
  rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)
histogram_quantile(0.99, 
  rate(fluxcd_kustomize_reconciliation_duration_seconds_bucket[5m])
)

# Per-Kustomization duration
avg by (kustomization) (
  fluxcd_kustomize_reconciliation_duration_seconds
)

KQL Query for Reconciliation Duration:

// Reconciliation duration from logs
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "reconciliation"
| extend 
    Duration = extract(@"duration=(\d+\.\d+)", 1, LogEntry, typeof(real)),
    Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| where isnotnull(Duration)
| summarize 
    AvgDuration = avg(Duration),
    P50Duration = percentile(Duration, 50),
    P95Duration = percentile(Duration, 95),
    P99Duration = percentile(Duration, 99),
    MaxDuration = max(Duration)
    by Kustomization, bin(TimeGenerated, 1h)
| order by TimeGenerated desc

Reconciliation Failure Rate

Failure Rate Metrics:

# Reconciliation failure rate
sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))
/
sum(rate(fluxcd_kustomize_reconciliation_total[5m]))

# Per-Kustomization failure rate
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_errors_total[5m])
)
/
sum by (kustomization) (
  rate(fluxcd_kustomize_reconciliation_total[5m])
)

KQL Failure Rate Query:

// Reconciliation failure rate
ContainerLog
| where ContainerName contains "kustomize-controller"
| where TimeGenerated > ago(24h)
| extend 
    Status = case(
        LogEntry contains "successfully", "Success",
        LogEntry contains "failed" or LogEntry contains "error", "Failure",
        "Unknown"
    ),
    Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| where Status != "Unknown"
| summarize 
    TotalReconciliations = count(),
    Successful = countif(Status == "Success"),
    Failed = countif(Status == "Failure"),
    FailureRate = (countif(Status == "Failure") * 100.0) / count()
    by Kustomization, bin(TimeGenerated, 1h)
| order by FailureRate desc

Drift Detection Events

Drift Detection Metrics:

# Drift detection rate
sum(rate(fluxcd_kustomize_drift_detected_total[5m]))

# Drift correction rate
sum(rate(fluxcd_kustomize_drift_corrected_total[5m]))

KQL Drift Detection Query:

// Drift detection events
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "drift detected"
| extend 
    Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Resource = extract(@"resource=(\S+)", 1, LogEntry, typeof(string)),
    DriftType = extract(@"drift type=(\S+)", 1, LogEntry, typeof(string))
| summarize 
    DriftCount = count(),
    UniqueResources = dcount(Resource),
    LastDrift = max(TimeGenerated)
    by Kustomization, DriftType, bin(TimeGenerated, 1h)
| order by DriftCount desc

Deployment Frequency

Deployment Frequency Calculation:

// Deployment frequency (DORA metric)
let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| extend 
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string))
| project TimeGenerated, Service, GitCommit;

Deployments
| summarize 
    DeploymentCount = count(),
    UniqueServices = dcount(Service)
    by bin(TimeGenerated, 1d)
| extend 
    DeploymentFrequency = DeploymentCount // Deployments per day
| order by TimeGenerated desc

Prometheus Query for Deployment Frequency:

# Deployment frequency (successful reconciliations per day)
sum(increase(fluxcd_kustomize_reconciliation_total{status="success"}[1d]))

Application Health After Deployment

Readiness Probe Success Rate

Readiness Probe Metrics:

# Readiness probe success rate
sum(rate(kube_pod_status_ready{condition="true"}[5m]))
/
sum(rate(kube_pod_status_ready[5m]))

# Readiness probe failures
sum(rate(kube_pod_status_ready{condition="false"}[5m]))

KQL Readiness Probe Query:

// Readiness probe success rate
InsightsMetrics
| where Origin == "container.azm.ms"
| where Name == "k8sPodCount"
| where Namespace == "atp-production"
| extend ReadyPods = case(
    Namespace contains "Ready", 1,
    0
)
| summarize 
    TotalPods = sum(Val),
    ReadyPods = sum(ReadyPods)
    by Namespace, bin(TimeGenerated, 5m)
| extend ReadinessRate = (ReadyPods * 100.0) / TotalPods
| order by TimeGenerated desc

Pod Restart Count

Pod Restart Metrics:

# Pod restart count
sum(increase(kube_pod_container_status_restarts_total[1h]))

# Pod restart rate per service
sum by (pod, namespace) (
  increase(kube_pod_container_status_restarts_total[1h])
)

KQL Pod Restart Query:

// Pod restart count
ContainerLog
| where ContainerRestartCount > 0
| where TimeGenerated > ago(24h)
| extend 
    Service = extract(@"app=(\S+)", 1, LogEntry, typeof(string))
| summarize 
    RestartCount = max(ContainerRestartCount),
    RestartEvents = count(),
    LastRestart = max(TimeGenerated)
    by Computer, ContainerName, Service, Namespace
| order by RestartCount desc

HTTP Error Rates

HTTP Error Rate Metrics:

# HTTP 5xx error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# HTTP 4xx error rate
sum(rate(http_requests_total{status=~"4.."}[5m]))
/
sum(rate(http_requests_total[5m]))

KQL HTTP Error Rate Query:

// HTTP error rates from Application Insights
AppRequests
| where TimeGenerated > ago(1h)
| extend 
    StatusCode = tostring(resultCode),
    IsError = resultCode >= 400
| summarize 
    TotalRequests = count(),
    ErrorRequests = countif(IsError),
    ErrorRate = (countif(IsError) * 100.0) / count()
    by appName, bin(TimeGenerated, 5m)
| order by ErrorRate desc

Response Latency

Response Latency Metrics:

# Average response latency
avg(http_request_duration_seconds)

# P95 response latency
histogram_quantile(0.95, 
  rate(http_request_duration_seconds_bucket[5m])
)

# P99 response latency
histogram_quantile(0.99, 
  rate(http_request_duration_seconds_bucket[5m])
)

KQL Response Latency Query:

// Response latency from Application Insights
AppRequests
| where TimeGenerated > ago(1h)
| extend DurationMs = duration
| summarize 
    AvgLatency = avg(DurationMs),
    P50Latency = percentile(DurationMs, 50),
    P95Latency = percentile(DurationMs, 95),
    P99Latency = percentile(DurationMs, 99),
    MaxLatency = max(DurationMs)
    by appName, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

Integration with Application Metrics

Custom Application Metrics:

// C# application: Export custom metrics
public class MetricsExporter
{
    private readonly IMetricsCollector _metrics;

    public void RecordDeploymentEvent(string serviceName, string gitCommit)
    {
        _metrics.IncrementCounter("atp_deployment_total", new Dictionary<string, string>
        {
            { "service", serviceName },
            { "git_commit", gitCommit },
            { "environment", "production" },
        });
    }

    public void RecordDeploymentDuration(double durationSeconds)
    {
        _metrics.RecordHistogram("atp_deployment_duration_seconds", durationSeconds);
    }
}

Prometheus Metrics Export:

// Prometheus metrics endpoint
app.UseMetricServer(); // Exposes /metrics endpoint

// Custom metrics
var deploymentCounter = Metrics.CreateCounter(
    "atp_deployment_total",
    "Total deployments",
    new[] { "service", "environment", "status" });

Grafana Dashboards

FluxCD Operational Dashboard

FluxCD Dashboard JSON:

{
  "dashboard": {
    "title": "FluxCD Operational Dashboard",
    "panels": [
      {
        "title": "Reconciliation Success Rate",
        "targets": [{
          "expr": "sum(rate(fluxcd_kustomize_reconciliation_total{status=\"success\"}[5m])) / sum(rate(fluxcd_kustomize_reconciliation_total[5m]))",
          "legendFormat": "Success Rate"
        }],
        "type": "stat",
        "thresholds": {
          "steps": [
            { "value": 0, "color": "red" },
            { "value": 0.95, "color": "yellow" },
            { "value": 0.99, "color": "green" }
          ]
        }
      },
      {
        "title": "Reconciliation Duration",
        "targets": [{
          "expr": "avg(fluxcd_kustomize_reconciliation_duration_seconds)",
          "legendFormat": "Avg Duration"
        }],
        "type": "graph"
      },
      {
        "title": "Reconciliation Errors",
        "targets": [{
          "expr": "sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m]))",
          "legendFormat": "Errors/sec"
        }],
        "type": "graph"
      },
      {
        "title": "Reconciliation Status by Kustomization",
        "targets": [{
          "expr": "sum by (kustomization) (rate(fluxcd_kustomize_reconciliation_total[5m]))",
          "legendFormat": "{{kustomization}}"
        }],
        "type": "bargauge"
      }
    ]
  }
}

Deployment Status Dashboard

Deployment Status Dashboard:

{
  "dashboard": {
    "title": "Deployment Status Dashboard",
    "panels": [
      {
        "title": "Deployment Frequency",
        "targets": [{
          "expr": "sum(increase(fluxcd_kustomize_reconciliation_total{status=\"success\"}[1d]))",
          "legendFormat": "Deployments/Day"
        }],
        "type": "stat"
      },
      {
        "title": "Deployment Success Rate",
        "targets": [{
          "expr": "sum(rate(fluxcd_kustomize_reconciliation_total{status=\"success\"}[1h])) / sum(rate(fluxcd_kustomize_reconciliation_total[1h]))",
          "legendFormat": "Success Rate"
        }],
        "type": "gauge"
      },
      {
        "title": "Deployment Status by Service",
        "targets": [{
          "expr": "sum by (kustomization) (fluxcd_kustomize_reconciliation_total)",
          "legendFormat": "{{kustomization}}"
        }],
        "type": "table"
      }
    ]
  }
}

Application Health Dashboard

Application Health Dashboard:

{
  "dashboard": {
    "title": "Application Health Dashboard",
    "panels": [
      {
        "title": "Pod Readiness",
        "targets": [{
          "expr": "sum(rate(kube_pod_status_ready{condition=\"true\"}[5m])) / sum(rate(kube_pod_status_ready[5m]))",
          "legendFormat": "Readiness Rate"
        }],
        "type": "stat"
      },
      {
        "title": "Pod Restart Count",
        "targets": [{
          "expr": "sum(increase(kube_pod_container_status_restarts_total[1h]))",
          "legendFormat": "Restarts"
        }],
        "type": "graph"
      },
      {
        "title": "HTTP Error Rate",
        "targets": [{
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
          "legendFormat": "5xx Error Rate"
        }],
        "type": "graph"
      },
      {
        "title": "Response Latency (P95)",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
          "legendFormat": "P95 Latency"
        }],
        "type": "graph"
      }
    ]
  }
}

DORA Metrics Dashboard

DORA Metrics Dashboard:

{
  "dashboard": {
    "title": "DORA Metrics Dashboard",
    "panels": [
      {
        "title": "Deployment Frequency",
        "targets": [{
          "expr": "sum(increase(fluxcd_kustomize_reconciliation_total{status=\"success\"}[1d]))",
          "legendFormat": "Deployments/Day"
        }],
        "type": "stat"
      },
      {
        "title": "Lead Time for Changes",
        "targets": [{
          "expr": "avg(deployment_lead_time_seconds)",
          "legendFormat": "Avg Lead Time"
        }],
        "type": "stat"
      },
      {
        "title": "Mean Time to Recovery (MTTR)",
        "targets": [{
          "expr": "avg(incident_recovery_time_seconds)",
          "legendFormat": "MTTR"
        }],
        "type": "stat"
      },
      {
        "title": "Change Failure Rate",
        "targets": [{
          "expr": "sum(rate(deployment_failures_total[1d])) / sum(rate(deployments_total[1d]))",
          "legendFormat": "Failure Rate"
        }],
        "type": "gauge"
      }
    ]
  }
}

Azure Monitor Workbooks

Custom Workbooks for GitOps

GitOps Workbook Template:

{
  "version": "Notebook/1.0",
  "items": [
    {
      "type": 9,
      "content": {
        "version": "KqlParameterItem/1.0",
        "parameters": [
          {
            "id": "timeRange",
            "version": "KqlParameterItem/1.0",
            "name": "TimeRange",
            "type": 4,
            "value": {
              "durationMs": 86400000
            }
          },
          {
            "id": "environment",
            "version": "KqlParameterItem/1.0",
            "name": "Environment",
            "type": 1,
            "value": "production"
          }
        ]
      }
    },
    {
      "type": 1,
      "content": {
        "version": "TextBlock/1.0",
        "text": "## GitOps Deployment Status"
      }
    },
    {
      "type": 3,
      "content": {
        "version": "KqlItem/1.0",
        "query": "ContainerLog\n| where ContainerName contains \"kustomize-controller\"\n| where TimeGenerated > ago({TimeRange})\n| where Namespace == \"{Environment}\"\n| summarize DeploymentCount = count() by bin(TimeGenerated, 1h)\n| render timechart",
        "visualization": "timechart",
        "size": 0,
        "queryType": 0,
        "resourceType": "microsoft.operationalinsights/workspaces"
      }
    }
  ]
}

Compliance Reporting Workbooks

Compliance Workbook:

{
  "version": "Notebook/1.0",
  "items": [
    {
      "type": 1,
      "content": {
        "version": "TextBlock/1.0",
        "text": "## Compliance Audit Report"
      }
    },
    {
      "type": 3,
      "content": {
        "version": "KqlItem/1.0",
        "query": "// Deployment audit trail\nContainerLog\n| where ContainerName contains \"kustomize-controller\"\n| where LogEntry contains \"applied\"\n| extend \n    DeploymentId = extract(@\"deployment=(\\S+)\", 1, LogEntry, typeof(string)),\n    GitCommit = extract(@\"revision=(\\S+)\", 1, LogEntry, typeof(string)),\n    DeployedBy = \"FluxCD\"\n| project TimeGenerated, DeploymentId, GitCommit, DeployedBy, Namespace\n| order by TimeGenerated desc",
        "visualization": "table",
        "size": 0,
        "queryType": 0
      }
    },
    {
      "type": 3,
      "content": {
        "version": "KqlItem/1.0",
        "query": "// Policy compliance status\nAzureDiagnostics\n| where ResourceProvider == \"MICROSOFT.AUTHORIZATION\"\n| where Category == \"PolicyState\"\n| where TimeGenerated > ago(7d)\n| extend ComplianceState = tostring(parse_json(properties_s).complianceState_s)\n| summarize \n    Compliant = countif(ComplianceState == \"Compliant\"),\n    NonCompliant = countif(ComplianceState == \"NonCompliant\")\n    by bin(TimeGenerated, 1d)\n| render timechart",
        "visualization": "timechart",
        "size": 0
      }
    }
  ]
}

Cost Analysis Workbooks

Cost Analysis Workbook:

{
  "version": "Notebook/1.0",
  "items": [
    {
      "type": 1,
      "content": {
        "version": "TextBlock/1.0",
        "text": "## GitOps Cost Analysis"
      }
    },
    {
      "type": 3,
      "content": {
        "version": "KqlItem/1.0",
        "query": "// Resource usage by environment\nInsightsMetrics\n| where Origin == \"container.azm.ms\"\n| where Name == \"cpuUsageNanoCores\"\n| extend Environment = extract(@\"namespace=(\\S+)\", 1, Namespace, typeof(string))\n| summarize \n    AvgCPU = avg(Val),\n    MaxCPU = max(Val)\n    by Environment, bin(TimeGenerated, 1d)\n| render timechart",
        "visualization": "timechart",
        "size": 0
      }
    }
  ]
}

Alerting

Sync Failure Alerts

Sync Failure Alert Rule:

# alerts/fluxcd-sync-failure.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fluxcd-sync-failure
  namespace: monitoring
spec:
  groups:
  - name: fluxcd-sync
    rules:
    - alert: FluxCDSyncFailure
      expr: |
        sum(rate(fluxcd_kustomize_reconciliation_errors_total[5m])) > 0
      for: 5m
      labels:
        severity: critical
        component: fluxcd
      annotations:
        summary: "FluxCD sync failure detected"
        description: "{{ $value }} sync failures in the last 5 minutes"

Azure Monitor Alert Rule:

{
  "location": "global",
  "properties": {
    "displayName": "FluxCD Sync Failure",
    "description": "Alert when FluxCD sync failures detected",
    "severity": 1,
    "enabled": true,
    "evaluationFrequency": "PT5M",
    "windowSize": "PT5M",
    "criteria": {
      "allOf": [{
        "odata.type": "Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria",
        "name": "SyncFailure",
        "metricName": "fluxcd_kustomize_reconciliation_errors_total",
        "operator": "GreaterThan",
        "threshold": 0,
        "timeAggregation": "Total"
      }]
    },
    "actions": []
  }
}

Drift Detection Alerts

Drift Detection Alert:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fluxcd-drift-detection
  namespace: monitoring
spec:
  groups:
  - name: fluxcd-drift
    rules:
    - alert: FluxCDDriftDetected
      expr: |
        sum(rate(fluxcd_kustomize_drift_detected_total[5m])) > 0
      for: 5m
      labels:
        severity: warning
        component: fluxcd
      annotations:
        summary: "FluxCD drift detected"
        description: "Cluster state differs from Git state"

KQL-Based Drift Alert:

// Drift detection alert query
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "drift detected"
| where TimeGenerated > ago(5m)
| extend 
    Kustomization = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Resource = extract(@"resource=(\S+)", 1, LogEntry, typeof(string))
| summarize DriftCount = count() by Kustomization, Resource
| where DriftCount > 0

Deployment Failure Alerts

Deployment Failure Alert:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: deployment-failure
  namespace: monitoring
spec:
  groups:
  - name: deployments
    rules:
    - alert: DeploymentFailure
      expr: |
        sum(rate(fluxcd_kustomize_reconciliation_errors_total[10m])) > 2
      for: 10m
      labels:
        severity: critical
        component: deployment
      annotations:
        summary: "Deployment failure detected"
        description: "{{ $value }} deployment failures in the last 10 minutes"

Health Check Failure Alerts

Health Check Failure Alert:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: health-check-failure
  namespace: monitoring
spec:
  groups:
  - name: health
    rules:
    - alert: HealthCheckFailure
      expr: |
        sum(rate(kube_pod_status_ready{condition="false"}[5m])) > 0
      for: 5m
      labels:
        severity: warning
        component: health
      annotations:
        summary: "Health check failure detected"
        description: "{{ $value }} pods with failed health checks"

Alert Routing (Email, Teams, PagerDuty)

Alert Manager Configuration:

# alertmanager-config.yaml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty'
  - match:
      severity: warning
    receiver: 'teams'
  - match:
      component: fluxcd
    receiver: 'slack'

receivers:
- name: 'default'
  email_configs:
  - to: 'team@connectsoft.example'
    send_resolved: true

- name: 'teams'
  webhook_configs:
  - url: 'https://outlook.office.com/webhook/YOUR/WEBHOOK/URL'
    send_resolved: true

- name: 'slack'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    channel: '#atp-alerts'
    send_resolved: true

- name: 'pagerduty'
  pagerduty_configs:
  - service_key: 'YOUR_PAGERDUTY_KEY'
    send_resolved: true

Azure Monitor Action Groups:

{
  "location": "global",
  "properties": {
    "groupShortName": "atp-alerts",
    "enabled": true,
    "emailReceivers": [{
      "name": "team-email",
      "emailAddress": "team@connectsoft.example",
      "useCommonAlertSchema": true
    }],
    "smsReceivers": [{
      "name": "oncall-sms",
      "countryCode": "1",
      "phoneNumber": "5551234567"
    }],
    "webhookReceivers": [{
      "name": "teams-webhook",
      "serviceUri": "https://outlook.office.com/webhook/YOUR/WEBHOOK/URL",
      "useCommonAlertSchema": true
    }]
  }
}

Correlation

Linking Git Commits to Deployments

Correlation via Git Commit SHA:

// Link Git commits to deployments
let GitCommits = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| extend 
    GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string)),
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| project GitCommit, Service, DeploymentTime;

let ApplicationMetrics = AppRequests
| extend 
    GitCommit = extract(@"git_commit=(\S+)", 1, customDimensions, typeof(string)),
    RequestTime = TimeGenerated
| project GitCommit, RequestTime, success, duration;

GitCommits
| join kind=inner ApplicationMetrics on GitCommit
| summarize 
    DeploymentCount = count(),
    AvgLatency = avg(duration),
    ErrorRate = (countif(success == false) * 100.0) / count()
    by Service, GitCommit, bin(DeploymentTime, 1h)

Deployment Correlation Script:

#!/bin/bash
# scripts/correlate-deployment.sh

GIT_COMMIT="${1:-$(git rev-parse HEAD)}"
SERVICE_NAME="${2:-atp-ingestion}"

echo "🔗 Correlating deployment for commit: $GIT_COMMIT"

# Add annotation to deployment
kubectl annotate deployment "$SERVICE_NAME" -n atp-production \
  gitops.git-commit="$GIT_COMMIT" \
  gitops.deployed-at="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --overwrite

# Query correlation
az monitor log-analytics query \
  --workspace "atp-prod-loganalytics" \
  --analytics-query "
    ContainerLog
    | where ContainerName contains \"kustomize-controller\"
    | where LogEntry contains \"$GIT_COMMIT\"
    | project TimeGenerated, LogEntry
  "

Linking Deployments to Application Metrics

Deployment-to-Metrics Correlation:

// Link deployments to application metrics
let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| extend 
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| project Service, GitCommit, DeploymentTime;

let Metrics = AppRequests
| extend 
    Service = appName
| project Service, TimeGenerated, success, duration, resultCode;

Deployments
| join kind=inner Metrics on Service
| where Metrics.TimeGenerated >= DeploymentTime
| where Metrics.TimeGenerated <= DeploymentTime + 30m
| summarize 
    RequestCount = count(),
    ErrorRate = (countif(success == false) * 100.0) / count(),
    AvgLatency = avg(duration)
    by Service, GitCommit, bin(DeploymentTime, 5m)

Correlation IDs Throughout Stack

Correlation ID Injection:

// C#: Inject correlation ID in HTTP requests
public class CorrelationIdMiddleware
{
    private readonly RequestDelegate _next;

    public async Task InvokeAsync(HttpContext context)
    {
        var correlationId = context.Request.Headers["X-Correlation-ID"].FirstOrDefault()
            ?? Guid.NewGuid().ToString();

        context.Items["CorrelationId"] = correlationId;
        context.Response.Headers["X-Correlation-ID"] = correlationId;

        using (LogContext.PushProperty("CorrelationId", correlationId))
        {
            await _next(context);
        }
    }
}

Correlation ID in Kubernetes:

# Add correlation ID to pod annotations
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    metadata:
      annotations:
        gitops.git-commit: "abc123def456"
        gitops.correlation-id: "deployment-abc123"
        gitops.deployed-at: "2024-01-15T10:00:00Z"

Distributed Tracing with Azure Application Insights

Application Insights Integration:

// Configure Application Insights with distributed tracing
services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = "InstrumentationKey={key};IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/";
    options.EnableDependencyTrackingTelemetryModule = true;
    options.EnableRequestTrackingTelemetryModule = true;
    options.EnableAdaptiveSampling = true;
    options.AdaptiveSamplingInitialSamplingPercentage = 10; // 10% in production
});

// Custom telemetry with correlation
var telemetryClient = new TelemetryClient();
telemetryClient.Context.Operation.Id = correlationId;
telemetryClient.Context.Operation.Name = "Deployment";
telemetryClient.TrackEvent("DeploymentCompleted", new Dictionary<string, string>
{
    { "Service", "atp-ingestion" },
    { "GitCommit", gitCommit },
    { "Environment", "production" },
});

Trace Correlation Query:

// Distributed trace correlation
let Traces = AppTraces
| extend 
    CorrelationId = extract(@"correlation_id=(\S+)", 1, customDimensions, typeof(string)),
    OperationId = operation_Id
| project CorrelationId, OperationId, TimeGenerated, message;

let Requests = AppRequests
| extend 
    CorrelationId = extract(@"correlation_id=(\S+)", 1, customDimensions, typeof(string)),
    OperationId = operation_Id
| project CorrelationId, OperationId, TimeGenerated, name, duration;

Traces
| join kind=inner Requests on CorrelationId
| project CorrelationId, TraceTime = Traces.TimeGenerated, RequestTime = Requests.TimeGenerated, RequestDuration = duration
| order by CorrelationId, TraceTime

Compliance Evidence

Deployment Audit Trail in Log Analytics

Deployment Audit Trail Query:

// Complete deployment audit trail
let DeploymentEvents = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied" or LogEntry contains "sync"
| extend 
    DeploymentId = extract(@"deployment=(\\S+)", 1, LogEntry, typeof(string)),
    Service = extract(@"kustomization/(\\S+)", 1, LogEntry, typeof(string)),
    GitCommit = extract(@"revision=(\\S+)", 1, LogEntry, typeof(string)),
    Status = case(
        LogEntry contains "successfully", "Success",
        LogEntry contains "failed", "Failed",
        "InProgress"
    ),
    DeployedBy = "FluxCD",
    DeploymentTime = TimeGenerated
| project DeploymentTime, DeploymentId, Service, GitCommit, Status, DeployedBy, Namespace;

let Approvals = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DEVOPS"
| where Category == "PullRequest"
| extend 
    GitCommit = extract(@"commit=(\\S+)", 1, properties_s, typeof(string)),
    Approver = tostring(parse_json(properties_s).approver),
    ApprovalTime = TimeGenerated
| project GitCommit, Approver, ApprovalTime;

DeploymentEvents
| join kind=leftouter Approvals on GitCommit
| project 
    DeploymentTime,
    DeploymentId,
    Service,
    GitCommit,
    Status,
    DeployedBy,
    Approver,
    ApprovalTime,
    Namespace
| order by DeploymentTime desc

Retention for 7 Years (Compliance Requirement)

7-Year Retention Configuration:

// Log Analytics Workspace with 7-year retention
var logAnalyticsWorkspace = new OperationalInsights.Workspace("atp-prod-loganalytics", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    RetentionInDays = 2555, // 7 years (365 * 7)
    Tags = new()
    {
        { "Retention", "7years" },
        { "Compliance", "SOC2" },
    },
});

// Export to Blob Storage for additional backup
var storageAccount = new Storage.Account("atp-prod-logs-backup", new()
{
    ResourceGroupName = "atp-production-rg",
    Location = "eastus",
    Kind = "StorageV2",
    SkuName = "Standard_LRS",
    AccessTier = "Archive", // Cold storage for compliance
    EnableHttpsTrafficOnly = true,
    MinimumTlsVersion = "TLS1_2",
    BlobProperties = new Storage.Inputs.BlobServicePropertiesArgs
    {
        DeleteRetentionPolicy = new Storage.Inputs.DeleteRetentionPolicyArgs
        {
            Enabled = true,
            Days = 2555, // 7 years
        },
        VersioningEnabled = true,
    },
});

Query Examples for Auditors

Auditor Query: Deployment History:

// Deployment history for auditors
ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| where TimeGenerated > ago(365d)
| extend 
    Service = extract(@"kustomization/(\\S+)", 1, LogEntry, typeof(string)),
    GitCommit = extract(@"revision=(\\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| summarize 
    DeploymentCount = count(),
    LastDeployment = max(DeploymentTime),
    UniqueServices = dcount(Service)
    by bin(TimeGenerated, 1d)
| order by TimeGenerated desc

Auditor Query: Change Approval:

// Change approval audit trail
let PullRequests = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DEVOPS"
| where Category == "PullRequest"
| extend 
    PRId = tostring(parse_json(properties_s).pullRequestId),
    Approver = tostring(parse_json(properties_s).approver),
    ApprovalTime = TimeGenerated,
    Status = tostring(parse_json(properties_s).status)
| project PRId, Approver, ApprovalTime, Status;

let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| extend 
    GitCommit = extract(@"revision=(\\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| project GitCommit, DeploymentTime;

PullRequests
| join kind=inner Deployments on $left.PRId == $right.GitCommit
| project ApprovalTime, Approver, DeploymentTime, Status
| order by ApprovalTime desc

Auditor Query: Policy Compliance:

// Policy compliance audit
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.AUTHORIZATION"
| where Category == "PolicyState"
| where TimeGenerated > ago(365d)
| extend 
    PolicyName = tostring(parse_json(properties_s).policyDefinitionName),
    ComplianceState = tostring(parse_json(properties_s).complianceState_s),
    ResourceId = tostring(parse_json(properties_s).resourceId)
| summarize 
    CompliantCount = countif(ComplianceState == "Compliant"),
    NonCompliantCount = countif(ComplianceState == "NonCompliant"),
    TotalChecks = count()
    by PolicyName, bin(TimeGenerated, 1d)
| extend ComplianceRate = (CompliantCount * 100.0) / TotalChecks
| order by TimeGenerated desc

Export for eDiscovery

eDiscovery Export Script:

#!/bin/bash
# scripts/export-ediscovery.sh

START_DATE="${1:-$(date -u -d '7 years ago' +%Y-%m-%dT%H:%M:%SZ)}"
END_DATE="${2:-$(date -u +%Y-%m-%dT%H:%M:%SZ)}"
OUTPUT_PATH="${3:-./ediscovery-export}"

echo "📤 Exporting compliance logs for eDiscovery: $START_DATE to $END_DATE"

# Export deployment audit trail
az monitor log-analytics query \
  --workspace "atp-prod-loganalytics" \
  --analytics-query "
    ContainerLog
    | where ContainerName contains \"kustomize-controller\"
    | where TimeGenerated between (datetime($START_DATE) .. datetime($END_DATE))
    | where LogEntry contains \"applied\" or LogEntry contains \"sync\"
    | project TimeGenerated, ContainerName, LogEntry, Namespace
  " \
  --output table > "$OUTPUT_PATH/deployment-audit-trail.csv"

# Export policy compliance
az monitor log-analytics query \
  --workspace "atp-prod-loganalytics" \
  --analytics-query "
    AzureDiagnostics
    | where ResourceProvider == \"MICROSOFT.AUTHORIZATION\"
    | where Category == \"PolicyState\"
    | where TimeGenerated between (datetime($START_DATE) .. datetime($END_DATE))
    | project TimeGenerated, Category, properties_s
  " \
  --output table > "$OUTPUT_PATH/policy-compliance.csv"

# Export to Blob Storage for long-term storage
az storage blob upload-batch \
  --destination "ediscovery-export" \
  --source "$OUTPUT_PATH" \
  --account-name "atpprodlogsbackup"

echo "✅ Export complete: $OUTPUT_PATH"

DORA Metrics

Deployment Frequency

Deployment Frequency Calculation:

// Deployment frequency (DORA metric)
// Definition: How often deployments are successfully released to production

let SuccessfulDeployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| extend 
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| project DeploymentTime, Service;

SuccessfulDeployments
| summarize 
    DeploymentCount = count(),
    UniqueServices = dcount(Service)
    by bin(TimeGenerated, 1d)
| extend 
    DeploymentFrequency = DeploymentCount, // Deployments per day
    DORA_Level = case(
        DeploymentFrequency >= 1, "Elite", // Multiple per day
        DeploymentFrequency >= 0.142, "High", // Once per week
        DeploymentFrequency >= 0.033, "Medium", // Once per month
        "Low" // Less than once per month
    )
| order by TimeGenerated desc

Deployment Frequency Prometheus Query:

# Deployment frequency (deployments per day)
sum(increase(fluxcd_kustomize_reconciliation_total{status="success", namespace="atp-production"}[1d]))

Lead Time for Changes

Lead Time Calculation:

// Lead time for changes (DORA metric)
// Definition: Time from code commit to production deployment

let Commits = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "Git commit"
| extend 
    GitCommit = extract(@"commit=(\S+)", 1, LogEntry, typeof(string)),
    CommitTime = TimeGenerated
| project GitCommit, CommitTime;

let Deployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| extend 
    GitCommit = extract(@"revision=(\S+)", 1, LogEntry, typeof(string)),
    DeploymentTime = TimeGenerated
| project GitCommit, DeploymentTime;

Commits
| join kind=inner Deployments on GitCommit
| extend LeadTimeHours = datetime_diff('hour', DeploymentTime, CommitTime)
| summarize 
    AvgLeadTime = avg(LeadTimeHours),
    P50LeadTime = percentile(LeadTimeHours, 50),
    P95LeadTime = percentile(LeadTimeHours, 95),
    DORA_Level = case(
        avg(LeadTimeHours) < 24, "Elite", // Less than 1 day
        avg(LeadTimeHours) < 168, "High", // Less than 1 week
        avg(LeadTimeHours) < 720, "Medium", // Less than 1 month
        "Low" // More than 1 month
    )
    by bin(TimeGenerated, 1d)
| order by TimeGenerated desc

Mean Time to Recovery (MTTR)

MTTR Calculation:

// Mean Time to Recovery (MTTR) - DORA metric
// Definition: Average time to recover from a failure in production

let Incidents = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "failed" or LogEntry contains "error"
| where Namespace == "atp-production"
| extend 
    IncidentStart = TimeGenerated,
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| project IncidentStart, Service;

let Recoveries = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| extend 
    RecoveryTime = TimeGenerated,
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| project RecoveryTime, Service;

Incidents
| join kind=inner Recoveries on Service
| where RecoveryTime >= IncidentStart
| extend RecoveryDurationMinutes = datetime_diff('minute', RecoveryTime, IncidentStart)
| summarize 
    MTTR = avg(RecoveryDurationMinutes),
    P50MTTR = percentile(RecoveryDurationMinutes, 50),
    P95MTTR = percentile(RecoveryDurationMinutes, 95),
    IncidentCount = count(),
    DORA_Level = case(
        avg(RecoveryDurationMinutes) < 60, "Elite", // Less than 1 hour
        avg(RecoveryDurationMinutes) < 1440, "High", // Less than 1 day
        avg(RecoveryDurationMinutes) < 10080, "Medium", // Less than 1 week
        "Low" // More than 1 week
    )
    by bin(TimeGenerated, 1d)
| order by TimeGenerated desc

Change Failure Rate

Change Failure Rate Calculation:

// Change failure rate (DORA metric)
// Definition: Percentage of deployments that result in a failure in production

let AllDeployments = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "applied"
| where Namespace == "atp-production"
| extend 
    DeploymentTime = TimeGenerated,
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string)),
    Status = case(
        LogEntry contains "successfully", "Success",
        LogEntry contains "failed", "Failed",
        "Unknown"
    )
| where Status != "Unknown"
| project DeploymentTime, Service, Status;

let Failures = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "failed" or LogEntry contains "error"
| where Namespace == "atp-production"
| extend 
    FailureTime = TimeGenerated,
    Service = extract(@"kustomization/(\S+)", 1, LogEntry, typeof(string))
| project FailureTime, Service;

AllDeployments
| join kind=leftouter Failures on Service
| where FailureTime >= DeploymentTime
| where FailureTime <= DeploymentTime + 1h // Failure within 1 hour of deployment
| extend 
    DeploymentFailed = case(
        isnotnull(FailureTime), 1,
        0
    )
| summarize 
    TotalDeployments = count(),
    FailedDeployments = sum(DeploymentFailed),
    ChangeFailureRate = (sum(DeploymentFailed) * 100.0) / count(),
    DORA_Level = case(
        (sum(DeploymentFailed) * 100.0) / count() < 5, "Elite", // Less than 5%
        (sum(DeploymentFailed) * 100.0) / count() < 15, "High", // Less than 15%
        (sum(DeploymentFailed) * 100.0) / count() < 45, "Medium", // Less than 45%
        "Low" // More than 45%
    )
    by bin(TimeGenerated, 1d)
| order by TimeGenerated desc

Dashboard and Reporting

DORA Metrics Dashboard:

{
  "dashboard": {
    "title": "DORA Metrics Dashboard",
    "panels": [
      {
        "title": "Deployment Frequency",
        "targets": [{
          "expr": "sum(increase(fluxcd_kustomize_reconciliation_total{status=\"success\", namespace=\"atp-production\"}[1d]))",
          "legendFormat": "Deployments/Day"
        }],
        "type": "stat",
        "thresholds": {
          "steps": [
            { "value": 0, "color": "red" },
            { "value": 1, "color": "yellow" },
            { "value": 7, "color": "green" }
          ]
        }
      },
      {
        "title": "Lead Time for Changes",
        "targets": [{
          "expr": "avg(deployment_lead_time_hours)",
          "legendFormat": "Avg Lead Time (hours)"
        }],
        "type": "stat"
      },
      {
        "title": "Mean Time to Recovery (MTTR)",
        "targets": [{
          "expr": "avg(incident_recovery_time_minutes)",
          "legendFormat": "MTTR (minutes)"
        }],
        "type": "stat"
      },
      {
        "title": "Change Failure Rate",
        "targets": [{
          "expr": "sum(rate(deployment_failures_total[1d])) / sum(rate(deployments_total[1d]))",
          "legendFormat": "Failure Rate %"
        }],
        "type": "gauge",
        "thresholds": {
          "steps": [
            { "value": 0, "color": "green" },
            { "value": 0.05, "color": "yellow" },
            { "value": 0.15, "color": "red" }
          ]
        }
      }
    ]
  }
}

DORA Metrics Report:

// Comprehensive DORA metrics report
let DeploymentFrequency = ContainerLog
| where ContainerName contains "kustomize-controller"
| where LogEntry contains "successfully applied"
| where Namespace == "atp-production"
| summarize DeploymentCount = count() by bin(TimeGenerated, 1d);

let LeadTime = // ... (from previous query)
let MTTR = // ... (from previous query)
let ChangeFailureRate = // ... (from previous query)

union DeploymentFrequency, LeadTime, MTTR, ChangeFailureRate
| project TimeGenerated, Metric, Value, DORA_Level
| order by TimeGenerated desc

Summary: Azure Monitor Integration & Observability

  • Azure Monitor Container Insights: Enabling Container Insights on AKS, metrics collection and aggregation, Log Analytics workspace configuration, cost optimization (sampling, retention)
  • Log Analytics Workspace: Workspace per environment or shared strategy, log retention policies (7 years for production), KQL query examples, custom log tables
  • FluxCD Metrics Export: Prometheus metrics from FluxCD, metrics scraping configuration, key metrics to monitor, alerting on FluxCD issues
  • Deployment Metrics: Sync status per application, reconciliation duration, reconciliation failure rate, drift detection events, deployment frequency
  • Application Health: Readiness probe success rate, pod restart count, HTTP error rates, response latency, integration with application metrics
  • Grafana Dashboards: FluxCD operational dashboard, deployment status dashboard, application health dashboard, DORA metrics dashboard
  • Azure Monitor Workbooks: Custom workbooks for GitOps, compliance reporting workbooks, cost analysis workbooks
  • Alerting: Sync failure alerts, drift detection alerts, deployment failure alerts, health check failure alerts, alert routing (email, Teams, PagerDuty)
  • Correlation: Linking Git commits to deployments, linking deployments to application metrics, correlation IDs throughout stack, distributed tracing with Application Insights
  • Compliance Evidence: Deployment audit trail in Log Analytics, 7-year retention, query examples for auditors, export for eDiscovery
  • DORA Metrics: Deployment frequency, lead time for changes, mean time to recovery (MTTR), change failure rate, dashboard and reporting

Rolling Updates & Deployment Strategies

Purpose: Define deployment strategies for ATP services including rolling updates, blue-green deployments, canary releases, and progressive delivery with Flagger, ensuring zero-downtime deployments, automated rollback capabilities, and risk mitigation through gradual traffic shifting and validation gates.


Kubernetes Rolling Updates

Default Rolling Update Strategy

Rolling Update Overview:

graph LR
    subgraph "Rolling Update Process"
        V1[V1 Pods<br/>3 replicas] --> V2[V1: 2 pods<br/>V2: 1 pod]
        V2 --> V3[V1: 1 pod<br/>V2: 2 pods]
        V3 --> V4[V2 Pods<br/>3 replicas]
    end

    style V1 fill:#90EE90
    style V4 fill:#90EE90
    style V2 fill:#FFE5B4
    style V3 fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Default Rolling Update Configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  replicas: 5
  strategy:
    type: RollingUpdate  # Default strategy
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod during update
      maxUnavailable: 0  # No downtime allowed
  selector:
    matchLabels:
      app: atp-ingestion
  template:
    metadata:
      labels:
        app: atp-ingestion
        version: v1.2.3
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

Rolling Update Strategy Types:

Strategy Type Description Use Case
RollingUpdate Gradually replaces old pods with new ones Default for ATP (zero-downtime)
Recreate Terminates all old pods before creating new ones ❌ Not recommended (downtime)

maxSurge and maxUnavailable Settings

maxSurge and maxUnavailable Configuration:

Configuration maxSurge maxUnavailable Effect
Zero Downtime 1 0 ATP Production - Always maintain service availability
Fast Rollout 2 1 ⚠️ Test/Dev - Faster updates, slight capacity reduction
Conservative 1 1 ⚠️ Staging - Balanced approach

Production Configuration:

# Production: Zero-downtime rolling update
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod (total: 6 pods during update)
      maxUnavailable: 0  # Always maintain 5 ready pods

Rolling Update Math:

  • Total Pods: 5 replicas
  • maxSurge: 1 (can have 6 pods total during update)
  • maxUnavailable: 0 (must always have 5 ready pods)
  • Update Process: Replace 1 pod at a time, wait for readiness, then replace next

Dev/Test Configuration (Faster Rollout):

# Dev/Test: Faster rollout with slight capacity reduction
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod
      maxUnavailable: 1  # Can temporarily reduce to 1 pod

Rolling Update Process

Rolling Update Steps:

  1. Create New Pod: Kubernetes creates a new pod with new image
  2. Wait for Readiness: New pod must pass readiness probe
  3. Add to Service: New pod receives traffic from Service
  4. Terminate Old Pod: Old pod receives SIGTERM, drains connections
  5. Repeat: Process repeats for remaining pods

Rolling Update Visualization:

sequenceDiagram
    participant K8s as Kubernetes
    participant Old as Old Pods (v1.2.2)
    participant New as New Pods (v1.2.3)
    participant Svc as Service

    Note over K8s: Start Rolling Update
    K8s->>New: Create Pod 1 (v1.2.3)
    New->>New: Wait for Readiness Probe
    New->>Svc: Register with Service
    Svc->>New: Route traffic to Pod 1
    K8s->>Old: Terminate Pod 1 (SIGTERM)
    Old->>Svc: Drain connections
    Old->>Old: Graceful shutdown

    K8s->>New: Create Pod 2 (v1.2.3)
    New->>New: Wait for Readiness Probe
    New->>Svc: Register with Service
    Svc->>New: Route traffic to Pod 2
    K8s->>Old: Terminate Pod 2 (SIGTERM)

    Note over K8s: Repeat until all pods updated
Hold "Alt" / "Option" to enable pan & zoom

Monitor Rolling Update Progress:

# Watch rolling update progress
kubectl rollout status deployment/atp-ingestion -n atp-production

# Get rollout history
kubectl rollout history deployment/atp-ingestion -n atp-production

# Describe rollout
kubectl describe deployment atp-ingestion -n atp-production

Monitoring Rollout Progress

Rollout Status Command:

# Monitor rollout in real-time
kubectl rollout status deployment/atp-ingestion -n atp-production --timeout=10m

# Output example:
# Waiting for deployment "atp-ingestion" rollout to finish: 2 of 5 updated replicas are available...
# Waiting for deployment "atp-ingestion" rollout to finish: 3 of 5 updated replicas are available...
# Waiting for deployment "atp-ingestion" rollout to finish: 4 of 5 updated replicas are available...
# deployment "atp-ingestion" successfully rolled out

Prometheus Metrics for Rollout:

# Rolling update progress
kube_deployment_status_replicas_available{deployment="atp-ingestion"} 
/ 
kube_deployment_status_replicas{deployment="atp-ingestion"}

# Old vs new pods during rollout
kube_pod_info{pod=~"atp-ingestion-.*"}
| label_replace(label_replace(
    kube_pod_info{pod=~"atp-ingestion-.*"},
    "version", "$1", "pod", "(.*-v\\d+\\.\\d+\\.\\d+).*"
  ), "status", "$1", "pod", ".*-(running|pending|terminating).*")

KQL Query for Rollout Status:

// Rolling update status from Container Insights
InsightsMetrics
| where Origin == "container.azm.ms"
| where Name == "k8sPodCount"
| where Namespace == "atp-production"
| extend 
    Deployment = extract(@"deployment=(\S+)", 1, Tags, typeof(string)),
    PodVersion = extract(@"version=(v\d+\.\d+\.\d+)", 1, Tags, typeof(string))
| summarize 
    PodCount = sum(Val)
    by Deployment, PodVersion, bin(TimeGenerated, 1m)
| order by TimeGenerated desc

Blue-Green Deployments

Blue-Green Concept and Benefits

Blue-Green Deployment Architecture:

graph TB
    subgraph "Traffic Router"
        ING[Ingress Controller]
    end
    subgraph "Blue Environment (Current)"
        BLUE_NS[Namespace: atp-blue]
        BLUE_SVC[Service: atp-ingestion-blue]
        BLUE_PODS[Pods: v1.2.2<br/>5 replicas]
    end
    subgraph "Green Environment (New)"
        GREEN_NS[Namespace: atp-green]
        GREEN_SVC[Service: atp-ingestion-green]
        GREEN_PODS[Pods: v1.2.3<br/>5 replicas]
    end

    ING -->|Current| BLUE_SVC
    ING -.->|Switch| GREEN_SVC
    BLUE_SVC --> BLUE_PODS
    GREEN_SVC --> GREEN_PODS

    style BLUE_PODS fill:#4A90E2
    style GREEN_PODS fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Blue-Green Benefits:

Benefit Description ATP Use Case
Instant Rollback Switch traffic back to blue instantly ✅ Critical production updates
Zero Downtime Green environment fully ready before switch ✅ High availability requirement
Testing Validate green environment before traffic ✅ Production-like validation
Risk Reduction Keep blue environment running during switch ✅ Critical services

Blue-Green vs Rolling Update:

Aspect Blue-Green Rolling Update ATP Decision
Downtime ✅ Zero ✅ Zero ✅ Both viable
Rollback Speed ✅ Instant (traffic switch) ⚠️ Slow (re-rollout) Blue-Green for critical
Resource Usage ❌ 2x during switch ✅ Efficient ⚠️ Acceptable for critical services
Complexity ❌ Higher ✅ Lower ⚠️ Blue-Green for staging/production

Implementation with Namespace Switching

Blue Namespace Configuration:

# apps/atp-ingestion/overlays/production-blue/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production-blue
  labels:
    environment: production
    deployment-color: blue
---
# Blue Service
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion-blue
  namespace: atp-production-blue
spec:
  selector:
    app: atp-ingestion
    version: v1.2.2
  ports:
  - port: 80
    targetPort: 8080
---
# Blue Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion-blue
  namespace: atp-production-blue
spec:
  replicas: 5
  selector:
    matchLabels:
      app: atp-ingestion
      version: v1.2.2
  template:
    metadata:
      labels:
        app: atp-ingestion
        version: v1.2.2
        deployment-color: blue
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.2-def456g
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080

Green Namespace Configuration:

# apps/atp-ingestion/overlays/production-green/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-production-green
  labels:
    environment: production
    deployment-color: green
---
# Green Service
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion-green
  namespace: atp-production-green
spec:
  selector:
    app: atp-ingestion
    version: v1.2.3
  ports:
  - port: 80
    targetPort: 8080
---
# Green Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion-green
  namespace: atp-production-green
spec:
  replicas: 5
  selector:
    matchLabels:
      app: atp-ingestion
      version: v1.2.3
  template:
    metadata:
      labels:
        app: atp-ingestion
        version: v1.2.3
        deployment-color: green
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3-abc123d
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080

Traffic Routing with Ingress

Ingress with Blue-Green Routing:

# Ingress routing to blue (current)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-ingress
  namespace: atp-production
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/upstream-vhost: atp-ingestion-blue.atp-production-blue.svc.cluster.local
spec:
  ingressClassName: nginx
  rules:
  - host: atp-ingestion.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion-blue
            port:
              number: 80
        # Cross-namespace service reference
        # Requires ExternalName Service in production namespace

Cross-Namespace Service Reference:

# ExternalName Service in production namespace pointing to blue
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion-blue
  namespace: atp-production
spec:
  type: ExternalName
  externalName: atp-ingestion-blue.atp-production-blue.svc.cluster.local
---
# Switch to green (update Ingress)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-ingress
  namespace: atp-production
spec:
  rules:
  - host: atp-ingestion.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion-green  # Switched to green
            port:
              number: 80

Blue-Green Switch Script:

#!/bin/bash
# scripts/blue-green-switch.sh

ENVIRONMENT="${1:-production}"
CURRENT_COLOR="${2:-blue}"
NEW_COLOR="${3:-green}"

echo "🔄 Switching from $CURRENT_COLOR to $NEW_COLOR environment"

# Update Ingress to route to new color
kubectl patch ingress atp-ingestion-ingress -n atp-$ENVIRONMENT --type=json \
  -p="[{\"op\": \"replace\", \"path\": \"/spec/rules/0/http/paths/0/backend/service/name\", \"value\": \"atp-ingestion-$NEW_COLOR\"}]"

# Wait for green pods to be ready
kubectl wait --for=condition=available --timeout=5m \
  deployment/atp-ingestion-$NEW_COLOR -n atp-$ENVIRONMENT-$NEW_COLOR

# Verify traffic is routing to green
kubectl get ingress atp-ingestion-ingress -n atp-$ENVIRONMENT -o jsonpath='{.spec.rules[0].http.paths[0].backend.service.name}'

echo "✅ Traffic switched to $NEW_COLOR environment"

Rollback to Blue Environment

Instant Rollback to Blue:

#!/bin/bash
# scripts/blue-green-rollback.sh

ENVIRONMENT="${1:-production}"
CURRENT_COLOR="${2:-green}"
ROLLBACK_COLOR="${3:-blue}"

echo "⏪ Rolling back to $ROLLBACK_COLOR environment"

# Switch traffic back to blue
kubectl patch ingress atp-ingestion-ingress -n atp-$ENVIRONMENT --type=json \
  -p="[{\"op\": \"replace\", \"path\": \"/spec/rules/0/http/paths/0/backend/service/name\", \"value\": \"atp-ingestion-$ROLLBACK_COLOR\"}]"

echo "✅ Rollback complete - traffic routed to $ROLLBACK_COLOR"

# Optionally scale down green environment to save resources
# kubectl scale deployment atp-ingestion-$CURRENT_COLOR -n atp-$ENVIRONMENT-$CURRENT_COLOR --replicas=0

When to Use Blue-Green

Blue-Green Deployment Decision Matrix:

Service Type Blue-Green Recommended? Rationale
Critical Services (Gateway, Authentication) Yes Instant rollback capability
Database Migrations Yes Test new version before traffic
High-Traffic Services Yes Reduce risk of performance issues
Low-Risk Updates ⚠️ Optional Rolling update may be sufficient
Resource-Constrained ❌ No 2x resource usage during switch

ATP Blue-Green Strategy: - Production Critical Services: Blue-Green for gateway, authentication, ingestion - Production Standard Services: Rolling update sufficient - Staging: Blue-Green for validation before production promotion


Canary Releases

Canary Deployment Concept

Canary Release Architecture:

graph TB
    subgraph "Traffic Router"
        ING[Ingress Controller]
        SVC[Service]
    end
    subgraph "Stable Version"
        STABLE[Stable Pods<br/>v1.2.2<br/>90% traffic]
    end
    subgraph "Canary Version"
        CANARY[Canary Pods<br/>v1.2.3<br/>10% traffic]
    end

    ING -->|90%| STABLE
    ING -->|10%| CANARY
    SVC --> STABLE
    SVC --> CANARY

    style STABLE fill:#4A90E2
    style CANARY fill:#FFD700
Hold "Alt" / "Option" to enable pan & zoom

Canary Release Benefits:

Benefit Description
Risk Reduction Test new version with limited traffic
Gradual Rollout Increase traffic percentage gradually (10% → 50% → 100%)
Automated Validation Monitor metrics and auto-rollback on issues
User Impact Minimization Only small percentage of users affected if issues occur

Traffic Splitting Strategies

Traffic Splitting with Service Mesh (Istio):

# Istio VirtualService for canary traffic splitting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: atp-ingestion-canary
  namespace: atp-production
spec:
  hosts:
  - atp-ingestion.connectsoft.example
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: atp-ingestion
        subset: canary
      weight: 100
  - route:
    - destination:
        host: atp-ingestion
        subset: stable
      weight: 90  # 90% to stable
    - destination:
        host: atp-ingestion
        subset: canary
      weight: 10  # 10% to canary
---
# DestinationRule for stable and canary subsets
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  host: atp-ingestion
  subsets:
  - name: stable
    labels:
      version: v1.2.2
  - name: canary
    labels:
      version: v1.2.3

Traffic Splitting with Nginx Ingress:

# Nginx Ingress with canary annotations
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-canary
  namespace: atp-production
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"  # 10% traffic to canary
    nginx.ingress.kubernetes.io/canary-by-header: "canary"
    nginx.ingress.kubernetes.io/canary-by-header-value: "true"
spec:
  ingressClassName: nginx
  rules:
  - host: atp-ingestion.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion-canary
            port:
              number: 80
---
# Main Ingress (90% traffic)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-main
  namespace: atp-production
spec:
  ingressClassName: nginx
  rules:
  - host: atp-ingestion.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion-stable
            port:
              number: 80

Service Mesh Requirement (Linkerd/Istio)

Service Mesh Comparison:

Feature Istio Linkerd ATP Selection
Traffic Splitting ✅ Advanced ✅ Simple Istio (more features)
Observability ✅ Comprehensive ✅ Good Istio
Resource Usage ❌ High ✅ Low ⚠️ Acceptable for production
Learning Curve ❌ Steep ✅ Easy ⚠️ Investment required

ATP Decision: Istio (for advanced canary features)

Gradual Traffic Shift (10% → 50% → 100%)

Progressive Traffic Shift:

# Stage 1: 10% canary
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: atp-ingestion-canary
spec:
  http:
  - route:
    - destination:
        host: atp-ingestion
        subset: stable
      weight: 90
    - destination:
        host: atp-ingestion
        subset: canary
      weight: 10  # 10% canary
---
# Stage 2: 50% canary (after validation)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: atp-ingestion-canary
spec:
  http:
  - route:
    - destination:
        host: atp-ingestion
        subset: stable
      weight: 50
    - destination:
        host: atp-ingestion
        subset: canary
      weight: 50  # 50% canary
---
# Stage 3: 100% canary (promote to stable)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: atp-ingestion-canary
spec:
  http:
  - route:
    - destination:
        host: atp-ingestion
        subset: canary
      weight: 100  # 100% canary (new stable)

Automated Traffic Shift Script:

#!/bin/bash
# scripts/canary-traffic-shift.sh

CANARY_WEIGHT="${1:-10}"
NAMESPACE="${2:-atp-production}"

echo "🎯 Shifting $CANARY_WEIGHT% traffic to canary"

# Update VirtualService
kubectl patch virtualservice atp-ingestion-canary -n $NAMESPACE --type=json \
  -p="[{\"op\": \"replace\", \"path\": \"/spec/http/0/route/0/weight\", \"value\": $((100 - CANARY_WEIGHT))}, {\"op\": \"replace\", \"path\": \"/spec/http/0/route/1/weight\", \"value\": $CANARY_WEIGHT}]"

echo "✅ Traffic shifted: $CANARY_WEIGHT% canary, $((100 - CANARY_WEIGHT))% stable"

# Monitor for 5 minutes before next stage
sleep 300

Automated Canary Analysis

Canary Analysis Metrics:

# Flagger Canary with automated analysis (see Flagger section)
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
    - name: error-rate
      thresholdRange:
        max: 1
      interval: 1m

Progressive Delivery with Flagger

Flagger Overview and Architecture

Flagger Architecture:

graph TB
    subgraph "GitOps Repository"
        GIT[Git Commit<br/>New Version]
    end
    subgraph "Flagger Controller"
        FLAGGER[Flagger<br/>Canary Controller]
        METRICS[Metrics Provider<br/>Prometheus]
    end
    subgraph "Traffic Router"
        ISTIO[Istio VirtualService]
    end
    subgraph "Deployment"
        STABLE[Stable Deployment]
        CANARY[Canary Deployment]
    end

    GIT -->|Triggers| FLAGGER
    FLAGGER -->|Creates| CANARY
    FLAGGER -->|Monitors| METRICS
    METRICS -->|Validates| FLAGGER
    FLAGGER -->|Updates| ISTIO
    ISTIO -->|Routes| STABLE
    ISTIO -->|Routes| CANARY

    style FLAGGER fill:#FFD700
    style CANARY fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Flagger Benefits:

Benefit Description
Automated Rollout Automatic canary promotion based on metrics
Automated Rollback Automatic rollback on threshold violations
Traffic Shifting Gradual traffic increase (10% → 50% → 100%)
Metric Validation Validate latency, error rate, custom metrics

Installation and Configuration

Flagger Installation via Helm:

# Add Flagger Helm repository
helm repo add flagger https://flagger.app
helm repo update

# Install Flagger with Istio support
helm upgrade --install flagger flagger/flagger \
  --namespace flagger-system \
  --create-namespace \
  --set meshProvider=istio \
  --set metricsServer=http://prometheus.monitoring:9090

Flagger Installation via Pulumi:

// Install Flagger via Helm chart
var flaggerRelease = new Pulumi.Kubernetes.Helm.V3.Release("flagger", new()
{
    Chart = "flagger",
    RepositoryOpts = new Pulumi.Kubernetes.Helm.V3.Inputs.RepositoryOptsArgs
    {
        Repo = "https://flagger.app",
    },
    Namespace = "flagger-system",
    CreateNamespace = true,
    Values = new Dictionary<string, object>
    {
        { "meshProvider", "istio" },
        { "metricsServer", "http://prometheus.monitoring:9090" },
    },
});

Canary Resource Definition

Flagger Canary Configuration:

# apps/atp-ingestion/overlays/production/canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
  namespace: atp-production
spec:
  # Target deployment to manage
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion

  # Service to create/update
  service:
    port: 8080
    targetPort: 8080
    portDiscovery: true

  # Traffic management
  provider: istio
  trafficRouting:
    istio:
      virtualService:
        hosts:
        - atp-ingestion.connectsoft.example
        gateways:
        - public-gateway
      destinationRule:
        host: atp-ingestion
        subsets:
        - name: stable
          labels:
            version: stable
        - name: canary
          labels:
            version: canary

  # Canary analysis configuration
  analysis:
    interval: 1m           # Check metrics every 1 minute
    threshold: 5           # Number of successful validations before promotion
    maxWeight: 50          # Maximum canary traffic (50%)
    stepWeight: 10         # Traffic increase per step (10%)
    stepWeightPromotion: 50 # Traffic increase on promotion (50%)
    stepWeights: [10, 20, 30, 40, 50]  # Custom traffic steps

    # Metrics to validate
    metrics:
    # Request success rate
    - name: request-success-rate
      thresholdRange:
        min: 99            # Minimum 99% success rate
      interval: 1m
      queryTemplate: |
        sum(rate(istio_requests_total{
          destination_workload_namespace="{{ namespace }}",
          destination_workload=~"{{ target }}",
          response_code!~"5.."
        }[1m]))
        /
        sum(rate(istio_requests_total{
          destination_workload_namespace="{{ namespace }}",
          destination_workload=~"{{ target }}"
        }[1m]))
        * 100

    # Request duration (p95 latency)
    - name: request-duration
      thresholdRange:
        max: 500           # Maximum 500ms p95 latency
      interval: 1m
      queryTemplate: |
        histogram_quantile(0.95,
          sum(rate(istio_request_duration_milliseconds_bucket{
            destination_workload_namespace="{{ namespace }}",
            destination_workload=~"{{ target }}"
          }[1m])) by (le)
        )

    # Error rate
    - name: error-rate
      thresholdRange:
        max: 1             # Maximum 1% error rate
      interval: 1m
      queryTemplate: |
        sum(rate(istio_requests_total{
          destination_workload_namespace="{{ namespace }}",
          destination_workload=~"{{ target }}",
          response_code=~"5.."
        }[1m]))
        /
        sum(rate(istio_requests_total{
          destination_workload_namespace="{{ namespace }}",
          destination_workload=~"{{ target }}"
        }[1m]))
        * 100

    # Custom business metric
    - name: business-metric-threshold
      thresholdRange:
        min: 95            # Minimum business metric value
      interval: 1m
      queryTemplate: |
        avg(rate(atp_business_metric_total{
          service="{{ target }}",
          namespace="{{ namespace }}"
        }[1m]))

  # Webhooks for pre/post deployment validation
  webhooks:
  # Pre-rollout validation (smoke tests)
  - name: smoke-tests
    type: pre-rollout
    url: http://smoke-tests.atp-production:8080/validate
    timeout: 30s
    metadata:
      type: "bash"
      cmd: "kubectl exec -n atp-production deployment/smoke-tests -- /bin/sh -c 'curl -f http://atp-ingestion-canary:8080/health || exit 1'"

  # Post-rollout validation
  - name: integration-tests
    type: rollout
    url: http://integration-tests.atp-production:8080/validate
    timeout: 2m
    metadata:
      type: "bash"
      cmd: "kubectl exec -n atp-production deployment/integration-tests -- /bin/sh -c 'curl -f http://atp-ingestion-canary:8080/api/health || exit 1'"

  # Load testing
  - name: load-test
    type: rollout
    url: http://load-test.atp-production:8080/start
    timeout: 5m
    metadata:
      cmd: "kubectl exec -n atp-production deployment/load-test -- /bin/sh -c 'artillery run test.yaml'"

Automated Rollback on Metric Thresholds

Flagger Rollback Configuration:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    # Rollback triggers
    alerts:
    # Alert on high error rate
    - name: "error-rate-high"
      severity: error
      providerRef:
        name: prometheus-alerts
        namespace: monitoring

    # Rollback on metric threshold violation
    metrics:
    - name: error-rate
      thresholdRange:
        max: 1             # Rollback if error rate > 1%
      interval: 30s
      # Rollback immediately on violation
      alertProviders:
      - name: prometheus
        severity: error

  # Skip traffic increase if metrics fail
  skipAnalysis: false  # Don't skip validation

  # Automatic rollback on failure
  revertOnDeletion: true

Flagger Rollback Status:

# Check canary status
kubectl get canary atp-ingestion -n atp-production

# Watch canary rollout
kubectl describe canary atp-ingestion -n atp-production

# Check Flagger logs
kubectl logs -n flagger-system deployment/flagger -f

Integration with Service Mesh

Flagger with Istio Integration:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  provider: istio
  trafficRouting:
    istio:
      virtualService:
        hosts:
        - atp-ingestion.connectsoft.example
        gateways:
        - istio-system/public-gateway
      destinationRule:
        host: atp-ingestion
        subsets:
        - name: stable
          trafficPolicy:
            loadBalancer:
              consistentHash:
                httpHeaderName: X-User-ID  # Session affinity
        - name: canary
          trafficPolicy:
            loadBalancer:
              consistentHash:
                httpHeaderName: X-User-ID

Feature Flags Integration

LaunchDarkly or Azure App Configuration

Feature Flags Strategy:

Feature Flag Provider Pros Cons ATP Selection
LaunchDarkly ✅ Advanced targeting, A/B testing ❌ Cost, external dependency ⚠️ Consider for advanced use cases
Azure App Configuration ✅ Native Azure, integrated ⚠️ Less features than LaunchDarkly ATP Default (cost-effective)
Custom Solution ✅ Full control ❌ Maintenance overhead ❌ Not recommended

ATP Decision: Azure App Configuration (native Azure integration)

Feature Flag-Based Rollout

Azure App Configuration Setup:

// Configure Azure App Configuration
services.AddAzureAppConfiguration(options =>
{
    options.Connect(connectionString)
        .Select(KeyFilter.Any, LabelFilter.Null)
        .Select(KeyFilter.Any, "Production")
        .ConfigureRefresh(refresh =>
        {
            refresh.Register("FeatureFlags:CanaryEnabled", refreshAll: true)
                .SetCacheExpiration(TimeSpan.FromSeconds(30));
        });
});

Feature Flag Integration in Application:

// C#: Feature flag for canary deployment
public class FeatureFlagService
{
    private readonly IConfiguration _configuration;

    public bool IsCanaryEnabled()
    {
        return _configuration.GetValue<bool>("FeatureFlags:CanaryEnabled", defaultValue: false);
    }

    public int GetCanaryTrafficPercentage()
    {
        return _configuration.GetValue<int>("FeatureFlags:CanaryTrafficPercentage", defaultValue: 0);
    }
}

// Use feature flag to control behavior
[ApiController]
[Route("[controller]")]
public class IngestionController : ControllerBase
{
    private readonly FeatureFlagService _featureFlags;

    [HttpPost("events")]
    public async Task<IActionResult> IngestEvent([FromBody] Event evt)
    {
        // New feature enabled via feature flag
        if (_featureFlags.IsCanaryEnabled())
        {
            // Use new processing logic
            await ProcessEventV2(evt);
        }
        else
        {
            // Use stable processing logic
            await ProcessEventV1(evt);
        }

        return Ok();
    }
}

Feature Flag Configuration:

# Azure App Configuration via ConfigMap (reference)
apiVersion: v1
kind: ConfigMap
metadata:
  name: feature-flags
  namespace: atp-production
data:
  FeatureFlags__CanaryEnabled: "false"
  FeatureFlags__CanaryTrafficPercentage: "0"
---
# External Secret for Azure App Configuration connection
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-config-connection
  namespace: atp-production
spec:
  secretStoreRef:
    name: azure-keyvault
    kind: ClusterSecretStore
  data:
  - secretKey: AppConfigConnectionString
    remoteRef:
      key: connection-strings/app-config-connection-string

Gradual Feature Enablement

Gradual Feature Rollout:

// Gradually enable feature for percentage of users
public class FeatureFlagService
{
    public bool ShouldEnableFeature(string userId)
    {
        var percentage = GetFeatureRolloutPercentage();
        var userHash = GetUserHash(userId);
        return (userHash % 100) < percentage;
    }

    private int GetFeatureRolloutPercentage()
    {
        // Gradually increase: 10% → 25% → 50% → 100%
        return _configuration.GetValue<int>("FeatureFlags:RolloutPercentage", defaultValue: 0);
    }
}

Kill Switch for Problem Features

Kill Switch Implementation:

// Kill switch for emergency feature disable
public class FeatureFlagService
{
    public bool IsFeatureKilled(string featureName)
    {
        return _configuration.GetValue<bool>($"FeatureFlags:KillSwitch:{featureName}", defaultValue: false);
    }
}

// Use kill switch
if (_featureFlags.IsFeatureKilled("NewProcessingLogic"))
{
    // Immediately fall back to stable logic
    await ProcessEventV1(evt);
}
else
{
    await ProcessEventV2(evt);
}

Pre-Deployment Validation

Smoke Tests Before Traffic Routing

Smoke Test Webhook:

# Flagger pre-rollout webhook
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    webhooks:
    - name: smoke-tests
      type: pre-rollout
      url: http://smoke-tests.atp-production:8080/validate
      timeout: 30s
      metadata:
        type: "bash"
        cmd: |
          kubectl exec -n atp-production deployment/smoke-tests -- /bin/sh -c '
            # Health check
            curl -f http://atp-ingestion-canary:8080/health/live || exit 1
            curl -f http://atp-ingestion-canary:8080/health/ready || exit 1

            # Basic API test
            curl -f -X POST http://atp-ingestion-canary:8080/api/events \
              -H "Content-Type: application/json" \
              -d "{\"eventType\":\"test\"}" || exit 1
          '

Smoke Test Job:

# Pre-deployment smoke test job
apiVersion: batch/v1
kind: Job
metadata:
  name: smoke-tests-pre-deploy
  namespace: atp-production
spec:
  template:
    spec:
      containers:
      - name: smoke-tests
        image: connectsoft.azurecr.io/atp/smoke-tests:latest
        env:
        - name: TARGET_URL
          value: "http://atp-ingestion-canary:8080"
        command:
        - /bin/sh
        - -c
        - |
          echo "Running smoke tests..."
          # Health checks
          curl -f $TARGET_URL/health/live || exit 1
          curl -f $TARGET_URL/health/ready || exit 1

          # API validation
          response=$(curl -s -X POST $TARGET_URL/api/events \
            -H "Content-Type: application/json" \
            -d '{"eventType":"test"}')

          if [ $? -ne 0 ]; then
            echo "Smoke tests failed"
            exit 1
          fi

          echo "Smoke tests passed"
      restartPolicy: Never

Integration Tests in Canary

Integration Test Webhook:

# Flagger rollout webhook for integration tests
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    webhooks:
    - name: integration-tests
      type: rollout
      url: http://integration-tests.atp-production:8080/validate
      timeout: 5m
      metadata:
        type: "bash"
        cmd: |
          kubectl exec -n atp-production deployment/integration-tests -- /bin/sh -c '
            # Run integration test suite
            dotnet test IntegrationTests.csproj \
              --filter "Category=CanaryValidation" \
              --logger "trx;LogFileName=results.trx" \
              --results-directory /tmp/results

            # Check test results
            if [ $? -ne 0 ]; then
              echo "Integration tests failed"
              exit 1
            fi
          '

Database Migration Checks

Database Migration Validation:

# Pre-deployment database migration check
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-check
  namespace: atp-production
spec:
  template:
    spec:
      containers:
      - name: migration-check
        image: connectsoft.azurecr.io/atp/migration-tool:latest
        env:
        - name: CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: sql-connection-string
              key: connection-string
        command:
        - /bin/sh
        - -c
        - |
          echo "Checking database migrations..."

          # Check if pending migrations exist
          dotnet ef migrations list --connection "$CONNECTION_STRING"

          # Validate migration scripts (dry-run)
          dotnet ef database update --connection "$CONNECTION_STRING" --dry-run

          if [ $? -ne 0 ]; then
            echo "Database migration validation failed"
            exit 1
          fi

          echo "Database migrations validated"
      restartPolicy: Never

Dependency Availability Checks

Dependency Check Script:

#!/bin/bash
# scripts/pre-deployment-checks.sh

echo "🔍 Running pre-deployment validation checks..."

# Check Redis availability
redis-cli -h redis.atp-production ping || {
  echo "❌ Redis not available"
  exit 1
}

# Check SQL Database connectivity
sqlcmd -S sql-server.database.windows.net -U $DB_USER -P $DB_PASSWORD -Q "SELECT 1" || {
  echo "❌ SQL Database not accessible"
  exit 1
}

# Check Service Bus
az servicebus queue show --namespace-name atp-servicebus --resource-group atp-production --name audit-events || {
  echo "❌ Service Bus not accessible"
  exit 1
}

# Check Key Vault
az keyvault secret show --vault-name atp-keyvault --name test-secret || {
  echo "❌ Key Vault not accessible"
  exit 1
}

echo "✅ All dependency checks passed"

Post-Deployment Validation

Health Check Monitoring

Post-Deployment Health Checks:

# Flagger post-rollout validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    webhooks:
    - name: post-deployment-health
      type: post-rollout
      url: http://health-monitor.atp-production:8080/validate
      timeout: 2m
      metadata:
        type: "bash"
        cmd: |
          kubectl exec -n atp-production deployment/health-monitor -- /bin/sh -c '
            # Monitor health for 2 minutes
            for i in {1..24}; do
              health=$(curl -s -o /dev/null -w "%{http_code}" http://atp-ingestion-canary:8080/health/live)
              if [ "$health" != "200" ]; then
                echo "Health check failed: $health"
                exit 1
              fi
              sleep 5
            done
            echo "Health checks passed"
          '

Error Rate Thresholds

Error Rate Validation:

# Flagger metric for error rate validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: error-rate
      thresholdRange:
        max: 1  # Maximum 1% error rate
      interval: 1m
      queryTemplate: |
        sum(rate(http_requests_total{
          service="{{ target }}",
          status=~"5.."
        }[1m]))
        /
        sum(rate(http_requests_total{
          service="{{ target }}"
        }[1m]))
        * 100

Latency Thresholds

Latency Validation:

# Flagger metric for latency validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: request-duration-p95
      thresholdRange:
        max: 500  # Maximum 500ms p95 latency
      interval: 1m
      queryTemplate: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket{
            service="{{ target }}"
          }[1m])) by (le)
        ) * 1000

Business Metric Validation

Custom Business Metric:

# Flagger metric for business metric validation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: event-processing-success-rate
      thresholdRange:
        min: 99.5  # Minimum 99.5% success rate
      interval: 1m
      queryTemplate: |
        sum(rate(atp_events_processed_total{
          service="{{ target }}",
          status="success"
        }[1m]))
        /
        sum(rate(atp_events_processed_total{
          service="{{ target }}"
        }[1m]))
        * 100

Automatic Rollback Triggers

Error Rate Exceeds Threshold

Error Rate Rollback:

# Flagger automatic rollback on error rate
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: error-rate
      thresholdRange:
        max: 1  # Rollback if error rate > 1%
      interval: 30s
      # Rollback immediately on threshold violation
      alertProviders:
      - name: prometheus
        severity: error

Prometheus Alert for Error Rate:

# PrometheusRule for error rate alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: canary-error-rate
  namespace: monitoring
spec:
  groups:
  - name: canary
    rules:
    - alert: CanaryHighErrorRate
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[1m])) 
        / 
        sum(rate(http_requests_total[1m])) 
        > 0.01  # 1% error rate
      for: 30s
      labels:
        severity: critical
      annotations:
        summary: "Canary error rate exceeds threshold - rollback triggered"

Latency Degrades Beyond SLO

Latency Rollback:

# Flagger automatic rollback on latency degradation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: request-duration-p95
      thresholdRange:
        max: 500  # Rollback if p95 latency > 500ms
      interval: 30s
      alertProviders:
      - name: prometheus
        severity: error

Health Checks Fail Consistently

Health Check Rollback:

# Flagger automatic rollback on health check failure
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    webhooks:
    - name: health-check
      type: rollout
      url: http://health-monitor:8080/check
      timeout: 10s
      metadata:
        cmd: |
          health=$(curl -s -o /dev/null -w "%{http_code}" http://atp-ingestion-canary:8080/health/live)
          if [ "$health" != "200" ]; then
            echo "Health check failed"
            exit 1  # Triggers rollback
          fi

Custom Metric-Based Rollback

Custom Metric Rollback:

# Flagger automatic rollback on custom metric
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    metrics:
    - name: business-metric-threshold
      thresholdRange:
        min: 95  # Rollback if business metric < 95
      interval: 1m
      queryTemplate: |
        avg(rate(atp_business_metric_total{
          service="{{ target }}"
        }[1m]))
      alertProviders:
      - name: prometheus
        severity: error

Flagger Rollback Status:

# Check if rollback occurred
kubectl get canary atp-ingestion -n atp-production -o jsonpath='{.status.conditions[?(@.type=="Promoted")].status}'

# View rollback reason
kubectl describe canary atp-ingestion -n atp-production | grep -A 10 "Status"

Deployment Windows

Scheduled Maintenance Windows

Maintenance Window Configuration:

# Flagger with maintenance window
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  analysis:
    # Schedule deployments during maintenance window
    schedule: "0 2 * * *"  # 2 AM daily (UTC)
    # Or use cron expression for specific windows

Azure Pipeline Deployment Window:

# Azure Pipeline with deployment window
trigger: none

schedules:
- cron: "0 2 * * *"  # 2 AM UTC daily
  branches:
    include:
    - production
  always: true

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: Deploy
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/production'))
  jobs:
  - job: Deploy
    steps:
    - script: |
        echo "Deploying during maintenance window (2 AM UTC)"
        # Deployment steps

Change Freeze Periods

Change Freeze Configuration:

# Flagger with change freeze
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion
spec:
  # Suspend canary during change freeze
  # Annotation to prevent deployments
  annotations:
    flagger.app/change-freeze: "true"

Change Freeze Script:

#!/bin/bash
# scripts/change-freeze.sh

ACTION="${1:-enable}"  # enable or disable
NAMESPACE="${2:-atp-production}"

if [ "$ACTION" == "enable" ]; then
  echo "🔒 Enabling change freeze"
  kubectl annotate canary atp-ingestion -n $NAMESPACE \
    flagger.app/change-freeze="true" \
    --overwrite

  # Suspend FluxCD reconciliations
  flux suspend kustomization apps-production -n flux-system

  echo "✅ Change freeze enabled"
elif [ "$ACTION" == "disable" ]; then
  echo "🔓 Disabling change freeze"
  kubectl annotate canary atp-ingestion -n $NAMESPACE \
    flagger.app/change-freeze- \
    --overwrite

  # Resume FluxCD reconciliations
  flux resume kustomization apps-production -n flux-system

  echo "✅ Change freeze disabled"
fi

Emergency Deployment Procedures

Emergency Deployment Bypass:

#!/bin/bash
# scripts/emergency-deploy.sh

SERVICE="${1:-atp-ingestion}"
VERSION="${2:-v1.2.3-abc123d}"
NAMESPACE="${3:-atp-production}"

echo "🚨 Emergency deployment: $SERVICE@$VERSION"

# Bypass change freeze
kubectl annotate canary $SERVICE -n $NAMESPACE \
  flagger.app/emergency-deploy="true" \
  flagger.app/change-freeze- \
  --overwrite

# Force immediate rollout (skip canary)
kubectl set image deployment/$SERVICE \
  atp-ingestion=connectsoft.azurecr.io/atp/$SERVICE:$VERSION \
  -n $NAMESPACE

# Force rollout (bypass readiness gates)
kubectl rollout restart deployment/$SERVICE -n $NAMESPACE

echo "✅ Emergency deployment initiated"

Zero-Downtime Deployments

Connection Draining

Connection Draining Configuration:

# Service with connection draining
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion
  namespace: atp-production
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"
spec:
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800  # 3 hours
  ports:
  - port: 80
    targetPort: 8080

Graceful Shutdown (SIGTERM Handling)

Graceful Shutdown Implementation:

// C#: Graceful shutdown handler
public class Program
{
    public static async Task Main(string[] args)
    {
        var host = CreateHostBuilder(args).Build();

        // Register graceful shutdown
        var lifetime = host.Services.GetRequiredService<IHostApplicationLifetime>();
        lifetime.ApplicationStopping.Register(() =>
        {
            Console.WriteLine("SIGTERM received, starting graceful shutdown...");

            // Stop accepting new requests
            // Wait for in-flight requests to complete
            // Close connections
            // Cleanup resources

            Console.WriteLine("Graceful shutdown complete");
        });

        await host.RunAsync();
    }
}

Graceful Shutdown in Kubernetes:

# Deployment with terminationGracePeriodSeconds
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60  # 60 seconds for graceful shutdown
      containers:
      - name: atp-ingestion
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                # Drain connections
                sleep 10
                # Stop accepting new requests
                curl -X POST http://localhost:8080/admin/shutdown

Pod Disruption Budget

Pod Disruption Budget Configuration:

# Pod Disruption Budget for zero-downtime
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: atp-ingestion-pdb
  namespace: atp-production
spec:
  minAvailable: 3  # Always maintain at least 3 pods available
  selector:
    matchLabels:
      app: atp-ingestion
---
# Alternative: maxUnavailable
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: atp-ingestion-pdb
spec:
  maxUnavailable: 1  # Allow maximum 1 pod unavailable
  selector:
    matchLabels:
      app: atp-ingestion

Pod Disruption Budget Calculation:

  • Total Pods: 5 replicas
  • minAvailable: 3 pods
  • During Rolling Update: Can terminate maximum 2 pods at a time
  • Ensures: Always 3+ pods serving traffic (zero downtime)

PreStop Hooks

PreStop Hook for Graceful Shutdown:

# Deployment with preStop hook
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: atp-ingestion
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                echo "PreStop hook: Starting graceful shutdown..."

                # Remove from load balancer
                # Wait for connections to drain
                # Send shutdown signal to application
                curl -X POST http://localhost:8080/admin/drain || true

                # Wait for in-flight requests
                sleep 15

                echo "PreStop hook: Graceful shutdown complete"
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          # Remove from service endpoints when readiness fails
          periodSeconds: 5

Zero-Downtime Deployment Checklist:

## Zero-Downtime Deployment Checklist

### Pre-Deployment
- [ ] Readiness probe configured and tested
- [ ] Liveness probe configured
- [ ] Pod Disruption Budget configured (minAvailable or maxUnavailable)
- [ ] Graceful shutdown implemented (SIGTERM handler)
- [ ] PreStop hook configured
- [ ] Connection draining enabled
- [ ] terminationGracePeriodSeconds set appropriately (30-60s)

### During Deployment
- [ ] Rolling update strategy with maxUnavailable: 0
- [ ] maxSurge configured (1 or 2 extra pods)
- [ ] Monitor rollout progress (`kubectl rollout status`)
- [ ] Verify pods pass readiness probes
- [ ] Check traffic routing to new pods

### Post-Deployment
- [ ] Health checks passing
- [ ] Error rates within threshold
- [ ] Latency within SLO
- [ ] Business metrics validated
- [ ] Old pods terminated gracefully

Summary: Rolling Updates & Deployment Strategies

  • Kubernetes Rolling Updates: Default rolling update strategy, maxSurge and maxUnavailable settings, rolling update process, monitoring rollout progress
  • Blue-Green Deployments: Blue-green concept and benefits, implementation with namespace switching, traffic routing with Ingress, instant rollback, when to use blue-green
  • Canary Releases: Canary deployment concept, traffic splitting strategies (Istio/Nginx), service mesh requirement, gradual traffic shift (10% → 50% → 100%), automated canary analysis
  • Progressive Delivery with Flagger: Flagger overview and architecture, installation and configuration, canary resource definition, automated rollback on metric thresholds, integration with service mesh
  • Feature Flags Integration: LaunchDarkly or Azure App Configuration, feature flag-based rollout, gradual feature enablement, kill switch for problem features
  • Pre-Deployment Validation: Smoke tests before traffic routing, integration tests in canary, database migration checks, dependency availability checks
  • Post-Deployment Validation: Health check monitoring, error rate thresholds, latency thresholds, business metric validation
  • Automatic Rollback Triggers: Error rate exceeds threshold, latency degrades beyond SLO, health checks fail consistently, custom metric-based rollback
  • Deployment Windows: Scheduled maintenance windows, change freeze periods, emergency deployment procedures
  • Zero-Downtime Deployments: Connection draining, graceful shutdown (SIGTERM handling), Pod disruption budgets, PreStop hooks

Preview Environments (Ephemeral)

Purpose: Define how ephemeral preview environments are automatically provisioned for pull requests, used for isolated testing and validation, and automatically cleaned up after PR merge or closure, ensuring developers can test changes in a production-like environment without manual infrastructure setup while optimizing resource costs.


Preview Environment Architecture

Ephemeral Namespaces in Dev Cluster

Preview Environment Architecture:

graph TB
    subgraph "Dev AKS Cluster"
        subgraph "PR #123 Preview"
            NS1[Namespace: atp-preview-pr123]
            SVC1[Service: atp-ingestion]
            PODS1[Pods: v1.2.3<br/>1 replica]
            ING1[Ingress: pr123.preview.atp.connectsoft.example]
        end
        subgraph "PR #124 Preview"
            NS2[Namespace: atp-preview-pr124]
            SVC2[Service: atp-ingestion]
            PODS2[Pods: v1.2.4<br/>1 replica]
            ING2[Ingress: pr124.preview.atp.connectsoft.example]
        end
        subgraph "Shared Resources"
            MON[Shared Monitoring]
            DB[Shared Test DB]
        end
    end

    NS1 --> MON
    NS2 --> MON
    NS1 --> DB
    NS2 --> DB
    ING1 --> SVC1
    ING2 --> SVC2

    style NS1 fill:#90EE90
    style NS2 fill:#90EE90
    style MON fill:#FFE5B4
    style DB fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Preview Namespace Structure:

atp-preview-pr123/
├── deployments/
│   ├── atp-ingestion/
│   ├── atp-query/
│   └── atp-gateway/
├── services/
├── ingress/
├── configmaps/
└── secrets/ (references from External Secrets)

Namespace Naming Convention:

  • Format: atp-preview-pr{PR_NUMBER}
  • Examples:
  • atp-preview-pr123
  • atp-preview-pr456
  • atp-preview-pr789

Resource Isolation per PR

Resource Isolation Strategy:

Resource Isolation Level Sharing Rationale
Namespace ✅ Complete isolation ❌ Per PR Complete resource isolation
Deployments ✅ Isolated ❌ Per PR Independent testing
Services ✅ Isolated ❌ Per PR Independent service endpoints
Ingress ✅ Isolated hostname ❌ Per PR Unique preview URL
ConfigMaps ✅ Isolated ❌ Per PR PR-specific configuration
Secrets ⚠️ Referenced ✅ Shared Key Vault Cost optimization
Database ⚠️ Shared/Mocked ✅ Shared test DB Cost optimization
Redis ⚠️ Shared ✅ Shared test Redis Cost optimization
Monitoring ✅ Namespace labels ✅ Shared Prometheus Cost optimization

Resource Isolation Configuration:

# Preview namespace with labels for isolation
apiVersion: v1
kind: Namespace
metadata:
  name: atp-preview-pr123
  labels:
    environment: preview
    preview: "true"
    pr-number: "123"
    created-by: "azure-pipelines"
    created-at: "2024-01-15T10:00:00Z"
    auto-cleanup: "true"
    ttl: "24h"  # Time-to-live for auto-cleanup

Cost Optimization Strategies

Cost Optimization Matrix:

Strategy Implementation Cost Savings
Single Replica 1 replica vs 3 in dev ✅ ~67% reduction
Minimal Resources 100m CPU, 256Mi memory ✅ ~80% reduction
Shared Node Pool Use dev cluster node pool ✅ No additional nodes
Auto-Shutdown Scale to zero after 4h inactivity ✅ ~60% reduction
Spot Instances Use spot node pool ✅ ~90% cost reduction
Shared Dependencies Shared test DB/Redis ✅ Significant savings

Cost Comparison:

Environment Replicas CPU/Memory per Pod Monthly Cost (Est.)
Dev 3 500m / 1Gi $150
Preview (Standard) 1 500m / 1Gi $50
Preview (Optimized) 1 100m / 256Mi $10

Lifecycle Management

Preview Environment Lifecycle:

stateDiagram-v2
    [*] --> PR_Created: Developer opens PR
    PR_Created --> Provisioning: Azure Pipeline triggered
    Provisioning --> Active: Preview ready
    Active --> Testing: Integration tests
    Testing --> Active: Tests pass
    Active --> Idle: 4h inactivity
    Idle --> Active: New activity
    Active --> Cleaning: PR merged/closed
    Idle --> Cleaning: TTL expired
    Cleaning --> [*]: Resources deleted

    Active --> Failed: Tests fail
    Failed --> Cleaning: Manual cleanup
Hold "Alt" / "Option" to enable pan & zoom

Lifecycle States:

State Description Actions
Provisioning Namespace and resources being created Create namespace, deploy manifests
Active Preview environment running, receiving traffic Monitor health, run tests
Testing Integration tests executing Execute test suite
Idle No activity for 4+ hours Scale to zero, monitor for activity
Cleaning Resources being deleted Delete namespace and all resources
Failed Provisioning or testing failed Retry or manual cleanup

Automatic Provisioning on PR Creation

Azure Pipeline Triggered by PR

PR Trigger Configuration:

# azure-pipelines-preview.yml
trigger: none  # No CI trigger

pr:
  branches:
    include:
    - dev
    - test
    - staging
  paths:
    include:
    - apps/**/*
    - infrastructure/**/*
    exclude:
    - docs/**/*

pool:
  vmImage: 'ubuntu-latest'

variables:
  - group: atp-preview-env
  - name: PR_NUMBER
    value: ${{ replace(variables['System.PullRequest.PullRequestNumber'], 'PullRequest', '') }}
  - name: PR_BRANCH
    value: ${{ variables['System.PullRequest.SourceBranch'] }}
  - name: PREVIEW_NAMESPACE
    value: atp-preview-pr$(PR_NUMBER)
  - name: PREVIEW_HOSTNAME
    value: pr$(PR_NUMBER).preview.atp.connectsoft.example

stages:
- stage: ProvisionPreview
  displayName: 'Provision Preview Environment'
  condition: and(succeeded(), ne(variables['Build.Reason'], 'Manual'))
  jobs:
  - job: Provision
    displayName: 'Create Preview Environment'
    steps:
    - task: AzureCLI@2
      displayName: 'Get PR details'
      inputs:
        azureSubscription: 'ATP-NonProd-ServiceConnection'
        scriptType: 'bash'
        scriptLocation: 'inlineScript'
        inlineScript: |
          echo "PR Number: $(PR_NUMBER)"
          echo "PR Branch: $(PR_BRANCH)"
          echo "Preview Namespace: $(PREVIEW_NAMESPACE)"
          echo "Preview Hostname: $(PREVIEW_HOSTNAME)"

    - task: Bash@3
      displayName: 'Generate Preview Manifests'
      inputs:
        targetType: 'inline'
        script: |
          # Generate preview manifests
          ./scripts/generate-preview-manifests.sh \
            --pr-number $(PR_NUMBER) \
            --branch $(PR_BRANCH) \
            --namespace $(PREVIEW_NAMESPACE) \
            --hostname $(PREVIEW_HOSTNAME) \
            --output-dir ./preview-manifests

Namespace Creation Script

Namespace Creation:

#!/bin/bash
# scripts/create-preview-namespace.sh

PR_NUMBER="${1}"
NAMESPACE="atp-preview-pr${PR_NUMBER}"
TTL="${2:-24h}"  # Default 24 hours

echo "📦 Creating preview namespace: ${NAMESPACE}"

# Create namespace
kubectl create namespace "${NAMESPACE}" --dry-run=client -o yaml | \
  kubectl label --local -f - \
    environment=preview \
    preview=true \
    pr-number="${PR_NUMBER}" \
    auto-cleanup=true \
    ttl="${TTL}" \
    created-by=azure-pipelines \
    created-at="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
    -o yaml | \
  kubectl apply -f -

# Create ResourceQuota for cost control
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: preview-quota
  namespace: ${NAMESPACE}
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 4Gi
    limits.cpu: "4"
    limits.memory: 8Gi
    persistentvolumeclaims: "2"
    pods: "10"
    services: "5"
EOF

# Create LimitRange for default resource limits
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: LimitRange
metadata:
  name: preview-limits
  namespace: ${NAMESPACE}
spec:
  limits:
  - default:
      cpu: "100m"
      memory: "256Mi"
    defaultRequest:
      cpu: "50m"
      memory: "128Mi"
    type: Container
EOF

echo "✅ Preview namespace created: ${NAMESPACE}"

Manifest Generation with PR-Specific Values

Preview Manifest Generation Script:

#!/bin/bash
# scripts/generate-preview-manifests.sh

PR_NUMBER="${1}"
BRANCH="${2}"
NAMESPACE="atp-preview-pr${PR_NUMBER}"
HOSTNAME="${3}"
OUTPUT_DIR="${4:-./preview-manifests}"

echo "🔨 Generating preview manifests for PR #${PR_NUMBER}"

mkdir -p "${OUTPUT_DIR}"

# Generate namespace
cat > "${OUTPUT_DIR}/namespace.yaml" <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: ${NAMESPACE}
  labels:
    environment: preview
    preview: "true"
    pr-number: "${PR_NUMBER}"
    branch: "${BRANCH}"
    auto-cleanup: "true"
    ttl: "24h"
    created-by: "azure-pipelines"
    created-at: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
EOF

# Generate Kustomization with PR-specific values
cat > "${OUTPUT_DIR}/kustomization.yaml" <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: ${NAMESPACE}

resources:
- ../../apps/atp-ingestion/base
- ../../apps/atp-query/base
- ../../apps/atp-gateway/base

commonLabels:
  environment: preview
  pr-number: "${PR_NUMBER}"

patchesStrategicMerge:
- preview-patch.yaml

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: ${BRANCH}-$(git rev-parse --short HEAD)
- name: connectsoft.azurecr.io/atp/query
  newTag: ${BRANCH}-$(git rev-parse --short HEAD)
- name: connectsoft.azurecr.io/atp/gateway
  newTag: ${BRANCH}-$(git rev-parse --short HEAD)
EOF

# Generate preview-specific patches
cat > "${OUTPUT_DIR}/preview-patch.yaml" <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 1  # Single replica for preview
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Preview"
        - name: Preview__PRNumber
          value: "${PR_NUMBER}"
        - name: Preview__Hostname
          value: "${HOSTNAME}"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-ingestion-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-staging
spec:
  rules:
  - host: ${HOSTNAME}
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion
            port:
              number: 80
EOF

echo "✅ Preview manifests generated in ${OUTPUT_DIR}"

FluxCD Kustomization for Preview

Preview Kustomization Resource:

# clusters/dev/preview-kustomizations/pr123-kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: preview-pr123
  namespace: flux-system
  labels:
    preview: "true"
    pr-number: "123"
spec:
  interval: 1m
  path: ./apps/preview/pr123
  prune: true  # Auto-prune in preview
  wait: false  # Don't wait for readiness
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-dev
  dependsOn:
  - name: infrastructure
  kustomizeFlags:
  - --load-restrictor=LoadRestrictionsNone

Dynamic Preview Kustomization Creation:

#!/bin/bash
# scripts/create-preview-kustomization.sh

PR_NUMBER="${1}"
NAMESPACE="atp-preview-pr${PR_NUMBER}"

echo "🔧 Creating FluxCD Kustomization for preview PR #${PR_NUMBER}"

# Create preview Kustomization
cat <<EOF | kubectl apply -f -
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: preview-pr${PR_NUMBER}
  namespace: flux-system
  labels:
    preview: "true"
    pr-number: "${PR_NUMBER}"
    auto-cleanup: "true"
spec:
  interval: 1m
  path: ./apps/preview/pr${PR_NUMBER}
  prune: true
  wait: false
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-dev
  dependsOn:
  - name: infrastructure
EOF

echo "✅ Preview Kustomization created"

Dynamic Manifest Generation

Namespace: atp-preview-pr{number}

Dynamic Namespace Template:

# templates/preview-namespace-template.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-preview-pr{{ .Values.prNumber }}
  labels:
    environment: preview
    preview: "true"
    pr-number: "{{ .Values.prNumber }}"
    branch: "{{ .Values.branch }}"
    auto-cleanup: "true"
    ttl: "{{ .Values.ttl | default "24h" }}"
    created-by: "azure-pipelines"
    created-at: "{{ .Values.createdAt }}"

Namespace Generation:

# Generate namespace with PR number
PR_NUMBER=123
NAMESPACE="atp-preview-pr${PR_NUMBER}"

kubectl create namespace "${NAMESPACE}" \
  --dry-run=client -o yaml | \
  kubectl label --local -f - \
    environment=preview \
    pr-number="${PR_NUMBER}" \
    -o yaml | \
  kubectl apply -f -

Ingress Hostname: pr{number}.preview.atp.connectsoft.example

Dynamic Ingress Generation:

# Generated Ingress for PR #123
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-preview-ingress
  namespace: atp-preview-pr123
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-staging
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - pr123.preview.atp.connectsoft.example
    secretName: preview-pr123-tls
  rules:
  - host: pr123.preview.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion
            port:
              number: 80

Hostname Generation Script:

#!/bin/bash
# scripts/generate-preview-hostname.sh

PR_NUMBER="${1}"
BASE_DOMAIN="preview.atp.connectsoft.example"

PREVIEW_HOSTNAME="pr${PR_NUMBER}.${BASE_DOMAIN}"

echo "${PREVIEW_HOSTNAME}"
# Output: pr123.preview.atp.connectsoft.example

Resource Limits (Smaller than Dev)

Preview Resource Limits:

# Preview ResourceQuota (smaller than dev)
apiVersion: v1
kind: ResourceQuota
metadata:
  name: preview-quota
  namespace: atp-preview-pr123
spec:
  hard:
    requests.cpu: "2"      # 2 CPU total (vs 8 in dev)
    requests.memory: 4Gi   # 4Gi memory (vs 16Gi in dev)
    limits.cpu: "4"        # 4 CPU limit (vs 16 in dev)
    limits.memory: 8Gi     # 8Gi limit (vs 32Gi in dev)
    pods: "10"             # 10 pods max (vs 50 in dev)
    services: "5"          # 5 services max

Preview Deployment Resource Limits:

# Preview Deployment with minimal resources
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-preview-pr123
spec:
  replicas: 1  # Single replica
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 100m      # 100m CPU (vs 500m in dev)
            memory: 256Mi  # 256Mi memory (vs 1Gi in dev)
          limits:
            cpu: 500m      # 500m CPU limit (vs 2000m in dev)
            memory: 512Mi  # 512Mi limit (vs 2Gi in dev)

Resource Comparison:

Resource Dev Preview Reduction
Replicas 3 1 67%
CPU Request 500m 100m 80%
Memory Request 1Gi 256Mi 75%
CPU Limit 2000m 500m 75%
Memory Limit 2Gi 512Mi 75%

Image Tag from PR Branch

Image Tag Strategy:

  • Format: {BRANCH_NAME}-{SHORT_COMMIT_SHA}
  • Examples:
  • feature-123-abc456d
  • bugfix-456-def789g

Image Tag Generation:

#!/bin/bash
# scripts/generate-preview-image-tag.sh

BRANCH="${1}"
COMMIT_SHA="${2:-$(git rev-parse --short HEAD)}"

# Sanitize branch name (remove special characters)
SANITIZED_BRANCH=$(echo "${BRANCH}" | sed 's/[^a-zA-Z0-9]/-/g' | tr '[:upper:]' '[:lower:]' | cut -c1-50)

IMAGE_TAG="${SANITIZED_BRANCH}-${COMMIT_SHA}"

echo "${IMAGE_TAG}"
# Output: feature-123-abc456d

Kustomize Image Override:

# apps/preview/pr123/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

images:
- name: connectsoft.azurecr.io/atp/ingestion
  newTag: feature-123-abc456d  # PR branch + commit SHA
- name: connectsoft.azurecr.io/atp/query
  newTag: feature-123-abc456d
- name: connectsoft.azurecr.io/atp/gateway
  newTag: feature-123-abc456d

FluxCD Configuration for Previews

Dynamic GitRepository per PR

Preview GitRepository:

# clusters/dev/preview-gitrepositories/pr123-gitrepository.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: preview-pr123-git
  namespace: flux-system
  labels:
    preview: "true"
    pr-number: "123"
spec:
  interval: 30s  # Fast polling for preview
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: feature/123-new-feature  # PR branch
  secretRef:
    name: gitops-credentials
  ignore: |
    exclude: |
      ^production/
      ^staging/
      ^test/
      ^apps/preview/pr(?!123)/

Dynamic GitRepository Creation:

#!/bin/bash
# scripts/create-preview-gitrepository.sh

PR_NUMBER="${1}"
PR_BRANCH="${2}"

echo "📂 Creating GitRepository for preview PR #${PR_NUMBER}"

cat <<EOF | kubectl apply -f -
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: preview-pr${PR_NUMBER}-git
  namespace: flux-system
  labels:
    preview: "true"
    pr-number: "${PR_NUMBER}"
    auto-cleanup: "true"
spec:
  interval: 30s
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: ${PR_BRANCH}
  secretRef:
    name: gitops-credentials
EOF

echo "✅ Preview GitRepository created"

Preview Kustomization

Preview Kustomization Configuration:

# clusters/dev/preview-kustomizations/pr123-kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: preview-pr123
  namespace: flux-system
  labels:
    preview: "true"
    pr-number: "123"
spec:
  interval: 1m
  path: ./apps/preview/pr123
  prune: true  # Auto-prune deleted resources
  wait: false  # Don't wait for readiness
  timeout: 5m
  retryInterval: 1m
  sourceRef:
    kind: GitRepository
    name: preview-pr123-git
  kustomizeFlags:
  - --load-restrictor=LoadRestrictionsNone
  dependsOn:
  - name: infrastructure

Sync Policies for Preview

Preview Sync Policy:

Policy Value Rationale
Auto-Sync ✅ Enabled Fast feedback for developers
Prune ✅ Enabled Clean up deleted resources
Wait ❌ Disabled Don't block on readiness
Timeout 5m Fast timeout for quick feedback
Retry Interval 1m Quick retries

Sync Policy Configuration:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: preview-pr123
spec:
  interval: 1m
  prune: true   # Auto-prune
  wait: false   # Don't wait
  timeout: 5m   # Fast timeout
  retryInterval: 1m

Health Checks and Validation

Preview Health Check:

# Health check webhook for preview
apiVersion: v1
kind: Service
metadata:
  name: preview-health-check
  namespace: atp-preview-pr123
spec:
  selector:
    app: atp-ingestion
  ports:
  - port: 8080
    targetPort: 8080
---
# Health check Job
apiVersion: batch/v1
kind: Job
metadata:
  name: preview-health-check
  namespace: atp-preview-pr123
spec:
  template:
    spec:
      containers:
      - name: health-check
        image: curlimages/curl:latest
        command:
        - /bin/sh
        - -c
        - |
          echo "Checking preview environment health..."

          # Wait for pods to be ready
          sleep 30

          # Check liveness
          curl -f http://atp-ingestion:8080/health/live || exit 1

          # Check readiness
          curl -f http://atp-ingestion:8080/health/ready || exit 1

          echo "✅ Preview environment is healthy"
      restartPolicy: Never

Resource Cleanup

Auto-Delete After PR Merge

Auto-Cleanup on PR Merge:

# Azure Pipeline: Cleanup after merge
trigger: none

pr:
  - branches:
      include:
      - dev
      - test
    autoCancel: false

pool:
  vmImage: 'ubuntu-latest'

variables:
  - name: PR_NUMBER
    value: ${{ replace(variables['System.PullRequest.PullRequestNumber'], 'PullRequest', '') }}
  - name: PREVIEW_NAMESPACE
    value: atp-preview-pr$(PR_NUMBER)

stages:
- stage: CleanupPreview
  displayName: 'Cleanup Preview Environment'
  condition: and(succeeded(), eq(variables['System.PullRequest.Status'], 'Completed'))
  jobs:
  - job: Cleanup
    displayName: 'Delete Preview Resources'
    steps:
    - task: Bash@3
      displayName: 'Delete Preview Namespace'
      inputs:
        targetType: 'inline'
        script: |
          ./scripts/cleanup-preview-environment.sh \
            --pr-number $(PR_NUMBER) \
            --reason "PR merged"

Auto-Delete After PR Close

Auto-Cleanup on PR Close:

#!/bin/bash
# scripts/cleanup-preview-on-close.sh

PR_NUMBER="${1}"
REASON="${2:-PR closed}"

echo "🧹 Cleaning up preview environment for PR #${PR_NUMBER}: ${REASON}"

NAMESPACE="atp-preview-pr${PR_NUMBER}"

# Delete namespace (cascades to all resources)
kubectl delete namespace "${NAMESPACE}" --wait=true --timeout=5m || true

# Delete FluxCD Kustomization
kubectl delete kustomization preview-pr${PR_NUMBER} -n flux-system || true

# Delete FluxCD GitRepository
kubectl delete gitrepository preview-pr${PR_NUMBER}-git -n flux-system || true

# Clean up GitOps manifests in Git
./scripts/cleanup-preview-manifests.sh --pr-number "${PR_NUMBER}"

echo "✅ Preview environment cleaned up"

Manual Cleanup for Stuck Resources

Manual Cleanup Script:

#!/bin/bash
# scripts/manual-cleanup-preview.sh

PR_NUMBER="${1}"

if [ -z "${PR_NUMBER}" ]; then
  echo "Usage: $0 <PR_NUMBER>"
  echo "Example: $0 123"
  exit 1
fi

NAMESPACE="atp-preview-pr${PR_NUMBER}"

echo "🧹 Manual cleanup for PR #${PR_NUMBER}"

# Force delete namespace (if stuck)
kubectl delete namespace "${NAMESPACE}" --force --grace-period=0 || true

# Wait and check if namespace still exists
sleep 10
if kubectl get namespace "${NAMESPACE}" 2>/dev/null; then
  echo "⚠️  Namespace still exists, forcing deletion..."

  # Patch namespace to remove finalizers
  kubectl patch namespace "${NAMESPACE}" -p '{"metadata":{"finalizers":[]}}' --type=merge

  # Delete again
  kubectl delete namespace "${NAMESPACE}" --force --grace-period=0
fi

# Clean up FluxCD resources
kubectl delete kustomization preview-pr${PR_NUMBER} -n flux-system --ignore-not-found=true
kubectl delete gitrepository preview-pr${PR_NUMBER}-git -n flux-system --ignore-not-found=true

# Clean up any remaining pods
kubectl delete pods --all -n "${NAMESPACE}" --force --grace-period=0 2>/dev/null || true

echo "✅ Manual cleanup complete"

List All Preview Environments:

#!/bin/bash
# scripts/list-preview-environments.sh

echo "📋 Active Preview Environments:"
echo ""

kubectl get namespaces -l preview=true --no-headers | while read -r line; do
  NAMESPACE=$(echo "$line" | awk '{print $1}')
  CREATED=$(echo "$line" | awk '{print $2}')
  PR_NUMBER=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.pr-number}')
  TTL=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.ttl}')

  echo "PR #${PR_NUMBER}: ${NAMESPACE}"
  echo "  Created: ${CREATED}"
  echo "  TTL: ${TTL}"
  echo ""
done

Cost Tracking and Alerts

Cost Tracking:

#!/bin/bash
# scripts/track-preview-costs.sh

echo "💰 Preview Environment Cost Tracking"
echo ""

# Get all preview namespaces
kubectl get namespaces -l preview=true --no-headers | while read -r line; do
  NAMESPACE=$(echo "$line" | awk '{print $1}')
  PR_NUMBER=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.pr-number}')
  CREATED=$(kubectl get namespace "${NAMESPACE}" -o jsonpath='{.metadata.labels.created-at}')

  # Calculate hours since creation
  CREATED_TIMESTAMP=$(date -d "${CREATED}" +%s 2>/dev/null || echo "0")
  CURRENT_TIMESTAMP=$(date +%s)
  HOURS=$(( (CURRENT_TIMESTAMP - CREATED_TIMESTAMP) / 3600 ))

  # Estimate cost (assuming $0.10/hour per preview environment)
  ESTIMATED_COST=$(echo "scale=2; $HOURS * 0.10" | bc)

  echo "PR #${PR_NUMBER}: ${HOURS} hours, ~\$${ESTIMATED_COST}"
done

echo ""
echo "Total active preview environments: $(kubectl get namespaces -l preview=true --no-headers | wc -l)"

Cost Alert:

# PrometheusRule for preview cost alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: preview-cost-alert
  namespace: monitoring
spec:
  groups:
  - name: preview-cost
    rules:
    - alert: TooManyPreviewEnvironments
      expr: |
        count(kube_namespace_labels{label_preview="true"}) > 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Too many preview environments active"
        description: "{{ $value }} preview environments are active (threshold: 10)"

Cost Optimization

Shared Node Pools

Preview Node Pool Strategy:

Strategy Node Pool Type Cost Rationale
Shared with Dev ✅ Dev node pool ✅ Low Recommended - No additional nodes
Dedicated Preview Pool ❌ Separate pool ❌ High ❌ Not recommended (cost)
Spot Instance Pool ⚠️ Spot nodes ✅ Very Low ⚠️ Consider for cost optimization

Preview Node Pool Configuration:

# Use existing dev node pool (no additional cost)
# Preview pods scheduled on dev cluster nodes
# No dedicated node pool needed

Reduced Replica Counts (1 vs 3)

Replica Count Configuration:

# Preview Deployment: Single replica
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-preview-pr123
spec:
  replicas: 1  # Single replica (vs 3 in dev)

Cost Savings:

  • Dev: 3 replicas × $0.05/hour = $0.15/hour
  • Preview: 1 replica × $0.05/hour = $0.05/hour
  • Savings: 67% cost reduction

Minimal Resource Requests

Minimal Resource Configuration:

# Preview resources: Minimal requests
resources:
  requests:
    cpu: 100m      # 100m (vs 500m in dev) - 80% reduction
    memory: 256Mi  # 256Mi (vs 1Gi in dev) - 75% reduction
  limits:
    cpu: 500m      # 500m (vs 2000m in dev) - 75% reduction
    memory: 512Mi  # 512Mi (vs 2Gi in dev) - 75% reduction

Auto-Shutdown After 4 Hours of Inactivity

Inactivity Detection and Auto-Shutdown:

# CronJob to detect inactivity and scale to zero
apiVersion: batch/v1
kind: CronJob
metadata:
  name: preview-inactivity-check
  namespace: monitoring
spec:
  schedule: "*/15 * * * *"  # Every 15 minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: inactivity-check
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              # Check all preview namespaces
              kubectl get namespaces -l preview=true -o json | \
                jq -r '.items[].metadata.name' | \
                while read namespace; do

                  # Check last activity (last HTTP request)
                  LAST_ACTIVITY=$(kubectl get namespace "${namespace}" -o jsonpath='{.metadata.annotations.last-activity-time}')

                  if [ -z "${LAST_ACTIVITY}" ]; then
                    CREATED=$(kubectl get namespace "${namespace}" -o jsonpath='{.metadata.labels.created-at}')
                    LAST_ACTIVITY="${CREATED}"
                  fi

                  # Calculate hours since last activity
                  LAST_TS=$(date -d "${LAST_ACTIVITY}" +%s)
                  CURRENT_TS=$(date +%s)
                  HOURS=$(( (CURRENT_TS - LAST_TS) / 3600 ))

                  # Scale to zero if inactive for 4+ hours
                  if [ "${HOURS}" -ge 4 ]; then
                    echo "Scaling down ${namespace} (inactive for ${HOURS} hours)"

                    # Scale all deployments to zero
                    kubectl get deployments -n "${namespace}" -o json | \
                      jq -r '.items[].metadata.name' | \
                      while read deployment; do
                        kubectl scale deployment "${deployment}" -n "${namespace}" --replicas=0
                      done
                  fi
                done
          restartPolicy: OnFailure

Activity Tracking:

# Track activity in Ingress annotations
kubectl annotate ingress atp-preview-ingress \
  -n atp-preview-pr123 \
  last-activity-time="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --overwrite

Spot Instances for Preview Environments

Spot Node Pool Configuration:

// Pulumi: Spot node pool for preview environments
var previewNodePool = new ContainerService.KubernetesClusterNodePool("preview-spot-pool", new()
{
    KubernetesClusterId = devCluster.Id,
    VmSize = "Standard_D4s_v3",
    NodeCount = 2,
    Priority = "Spot",
    EvictionPolicy = "Delete",
    SpotMaxPrice = 0.05, // Max $0.05/hour (vs $0.20 for regular)
    NodeTaints = new[]
    {
        "preview=true:NoSchedule"
    },
    NodeLabels = new()
    {
        { "pool", "preview-spot" },
        { "preview", "true" },
    },
});

Preview Pod Tolerations:

# Preview Deployment with spot tolerations
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-preview-pr123
spec:
  template:
    spec:
      tolerations:
      - key: preview
        operator: Equal
        value: "true"
        effect: NoSchedule
      nodeSelector:
        pool: preview-spot

Access Control

Preview URL Generation

Preview URL Format:

  • Format: https://pr{PR_NUMBER}.preview.atp.connectsoft.example
  • Examples:
  • https://pr123.preview.atp.connectsoft.example
  • https://pr456.preview.atp.connectsoft.example

URL Generation Script:

#!/bin/bash
# scripts/generate-preview-url.sh

PR_NUMBER="${1}"
BASE_DOMAIN="preview.atp.connectsoft.example"

PREVIEW_URL="https://pr${PR_NUMBER}.${BASE_DOMAIN}"

echo "${PREVIEW_URL}"
# Output: https://pr123.preview.atp.connectsoft.example

Update PR Description with Preview URL:

#!/bin/bash
# scripts/update-pr-with-preview-url.sh

PR_NUMBER="${1}"
PREVIEW_URL="${2}"

echo "🔗 Updating PR #${PR_NUMBER} with preview URL: ${PREVIEW_URL}"

# Add preview URL to PR description via Azure DevOps API
az repos pr update \
  --organization "https://dev.azure.com/ConnectSoft" \
  --project "ATP" \
  --pull-request-id "${PR_NUMBER}" \
  --description "
## 🚀 Preview Environment

Preview environment is ready for testing:

**Preview URL**: ${PREVIEW_URL}

**Status**: ✅ Active

**Services**:
- API Gateway: ${PREVIEW_URL}/gateway
- Ingestion: ${PREVIEW_URL}/ingestion
- Query: ${PREVIEW_URL}/query

**TTL**: 24 hours (auto-cleanup after PR merge/close)
"

Authentication for Preview Environments

Preview Authentication Options:

Method Implementation Security ATP Selection
No Auth Public access ❌ None ❌ Not recommended
Basic Auth Nginx basic auth ⚠️ Low ⚠️ Option for simple testing
OAuth/SSO Azure AD integration ✅ High Recommended for production-like testing
IP Whitelist Network policy ⚠️ Moderate ⚠️ Option for restricted access

Basic Auth Configuration:

# Ingress with basic auth
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-preview-ingress
  namespace: atp-preview-pr123
  annotations:
    nginx.ingress.kubernetes.io/auth-type: basic
    nginx.ingress.kubernetes.io/auth-secret: preview-basic-auth
    nginx.ingress.kubernetes.io/auth-realm: "Preview Environment - PR #123"
spec:
  ingressClassName: nginx
  rules:
  - host: pr123.preview.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion
            port:
              number: 80
---
# Basic auth secret
apiVersion: v1
kind: Secret
metadata:
  name: preview-basic-auth
  namespace: atp-preview-pr123
type: Opaque
data:
  auth: $(echo -n 'preview:preview123' | base64)  # preview:preview123

OAuth Configuration:

# Ingress with OAuth
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-preview-ingress
  namespace: atp-preview-pr123
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-staging
    nginx.ingress.kubernetes.io/auth-url: "https://oauth2-proxy.atp-production.svc.cluster.local/oauth2/auth"
    nginx.ingress.kubernetes.io/auth-signin: "https://oauth2-proxy.atp-production.svc.cluster.local/oauth2/start?rd=$scheme://$host$request_uri"
spec:
  ingressClassName: nginx
  rules:
  - host: pr123.preview.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-ingestion
            port:
              number: 80

Network Policies for Preview

Preview Network Policy:

# Network policy for preview: Allow ingress from internet
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: preview-allow-ingress
  namespace: atp-preview-pr123
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Allow from ingress controller
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - podSelector:
        matchLabels:
          app: ingress-nginx
  # Allow from monitoring (for metrics)
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
  # Allow to shared test database
  - to:
    - namespaceSelector:
        matchLabels:
          name: atp-test-db
    ports:
    - protocol: TCP
      port: 5432

Integration Testing in Preview

Running Integration Tests Against Preview

Integration Test Pipeline:

# azure-pipelines-preview-tests.yml
trigger: none

pr:
  branches:
    include:
    - dev

pool:
  vmImage: 'ubuntu-latest'

variables:
  - name: PR_NUMBER
    value: ${{ replace(variables['System.PullRequest.PullRequestNumber'], 'PullRequest', '') }}
  - name: PREVIEW_URL
    value: https://pr$(PR_NUMBER).preview.atp.connectsoft.example

stages:
- stage: RunIntegrationTests
  displayName: 'Run Integration Tests Against Preview'
  jobs:
  - job: IntegrationTests
    displayName: 'Integration Tests'
    steps:
    - task: DotNetCoreCLI@2
      displayName: 'Run Integration Tests'
      inputs:
        command: 'test'
        projects: '**/IntegrationTests.csproj'
        arguments: |
          --filter "Category=Preview" \
          --logger "trx;LogFileName=results.trx" \
          --results-directory $(Agent.TempDirectory)/test-results \
          -- \
          PreviewUrl=$(PREVIEW_URL)

    - task: PublishTestResults@2
      displayName: 'Publish Test Results'
      inputs:
        testResultsFiles: '**/*.trx'
        testRunTitle: 'Preview Integration Tests - PR #$(PR_NUMBER)'

Integration Test Configuration:

// C#: Integration test configuration
public class PreviewIntegrationTests
{
    private readonly string _previewUrl;

    public PreviewIntegrationTests()
    {
        _previewUrl = Environment.GetEnvironmentVariable("PreviewUrl") 
            ?? "https://pr123.preview.atp.connectsoft.example";
    }

    [Fact]
    [Category("Preview")]
    public async Task TestIngestionService()
    {
        var client = new HttpClient
        {
            BaseAddress = new Uri(_previewUrl)
        };

        var response = await client.GetAsync("/health/ready");
        Assert.Equal(HttpStatusCode.OK, response.StatusCode);
    }
}

Database/Dependency Mocking

Mocking Strategy:

Dependency Strategy Implementation
Database ⚠️ Shared test DB ✅ Real database (isolated schema)
Redis ✅ Shared test Redis ✅ Real Redis (isolated keys)
External APIs ✅ Mock ✅ WireMock or MSW
Service Bus ⚠️ Shared test queue ✅ Real Service Bus (isolated queue)

Mocked Dependencies Configuration:

# External API mocks
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-api-mock
  namespace: atp-preview-pr123
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: wiremock
        image: wiremock/wiremock:latest
        ports:
        - containerPort: 8080
        env:
        - name: MAPPINGS_DIR
          value: /home/wiremock/mappings
        volumeMounts:
        - name: mappings
          mountPath: /home/wiremock/mappings
      volumes:
      - name: mappings
        configMap:
          name: wiremock-mappings

Shared Test Services

Shared Test Services Architecture:

graph TB
    subgraph "Shared Test Namespace"
        TEST_DB[(Shared Test DB<br/>Isolated schemas)]
        TEST_REDIS[(Shared Test Redis<br/>Isolated keys)]
        TEST_SB[Shared Test Service Bus<br/>Isolated queues]
    end
    subgraph "Preview PR #123"
        PREVIEW1[Preview Services]
    end
    subgraph "Preview PR #124"
        PREVIEW2[Preview Services]
    end

    PREVIEW1 -->|Isolated schema| TEST_DB
    PREVIEW2 -->|Isolated schema| TEST_DB
    PREVIEW1 -->|Isolated keys| TEST_REDIS
    PREVIEW2 -->|Isolated keys| TEST_REDIS
    PREVIEW1 -->|Isolated queue| TEST_SB
    PREVIEW2 -->|Isolated queue| TEST_SB

    style TEST_DB fill:#FFE5B4
    style TEST_REDIS fill:#FFE5B4
    style TEST_SB fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Database and Dependencies

Mock Services vs Real Dependencies

Dependency Strategy Matrix:

Dependency Mock Real ATP Decision
SQL Database ⚠️ Possible Real (isolated schema) ✅ Real with isolation
Redis ⚠️ Possible Real (isolated keys) ✅ Real with isolation
Service Bus ⚠️ Possible Real (isolated queue) ✅ Real with isolation
External APIs Mock ⚠️ Costly ✅ Mock (WireMock)
Key Vault ❌ N/A Real ✅ Real (shared)

Shared Test Database Approach

Shared Test Database with Isolated Schemas:

# ExternalSecret for shared test database
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: sql-connection-preview
  namespace: atp-preview-pr123
spec:
  secretStoreRef:
    name: azure-keyvault-dev
    kind: ClusterSecretStore
  data:
  - secretKey: connectionString
    remoteRef:
      key: connection-strings/atp-test-db/preview-connection-string
      # Connection string with PR-specific schema: atp_preview_pr123

Database Schema Isolation:

-- Create isolated schema per PR
CREATE SCHEMA IF NOT EXISTS atp_preview_pr123;
GRANT ALL PRIVILEGES ON SCHEMA atp_preview_pr123 TO atp_preview_user;

-- Connection string for PR #123
-- Server=test-db.database.windows.net;Database=atp_test;Schema=atp_preview_pr123;...

Database Cleanup:

#!/bin/bash
# scripts/cleanup-preview-database.sh

PR_NUMBER="${1}"

echo "🗑️  Cleaning up database schema for PR #${PR_NUMBER}"

SCHEMA_NAME="atp_preview_pr${PR_NUMBER}"

# Drop schema (cascades to all objects)
psql -h test-db.database.windows.net \
  -U atp_admin \
  -d atp_test \
  -c "DROP SCHEMA IF EXISTS ${SCHEMA_NAME} CASCADE;"

echo "✅ Database schema cleaned up"

Ephemeral Database per Preview

Ephemeral Database Option:

# Option: Create ephemeral database per preview (costlier but more isolated)
apiVersion: v1
kind: ConfigMap
metadata:
  name: preview-db-config
  namespace: atp-preview-pr123
data:
  database-name: "atp_preview_pr123"
  create-database: "true"
  ttl: "24h"

Ephemeral Database Creation:

#!/bin/bash
# scripts/create-preview-database.sh

PR_NUMBER="${1}"

echo "📦 Creating ephemeral database for PR #${PR_NUMBER}"

DB_NAME="atp_preview_pr${PR_NUMBER}"

# Create database via Azure CLI
az sql db create \
  --resource-group atp-nonprod-rg \
  --server atp-test-sql-server \
  --name "${DB_NAME}" \
  --service-objective S0 \
  --tags \
    Environment=Preview \
    PRNumber="${PR_NUMBER}" \
    AutoCleanup=true \
    TTL="24h"

echo "✅ Ephemeral database created: ${DB_NAME}"

ATP Decision: Shared test database with isolated schemas (cost-effective, sufficient isolation)


Preview Environment Lifecycle

Creation → Testing → Validation → Deletion

Preview Lifecycle Flow:

sequenceDiagram
    participant Dev as Developer
    participant PR as Pull Request
    participant Pipeline as Azure Pipeline
    participant K8s as Kubernetes
    participant FluxCD as FluxCD
    participant Tests as Integration Tests

    Dev->>PR: Create PR
    PR->>Pipeline: Trigger preview pipeline
    Pipeline->>K8s: Create namespace
    Pipeline->>FluxCD: Create GitRepository/Kustomization
    FluxCD->>K8s: Deploy preview manifests
    K8s->>Pipeline: Preview ready
    Pipeline->>PR: Update PR with preview URL
    Pipeline->>Tests: Run integration tests
    Tests->>PR: Update PR with test results
    PR->>Pipeline: PR merged/closed
    Pipeline->>K8s: Delete namespace
    Pipeline->>FluxCD: Delete GitRepository/Kustomization
    K8s->>Pipeline: Cleanup complete
Hold "Alt" / "Option" to enable pan & zoom

Status Reporting in PR Comments

PR Status Comment:

#!/bin/bash
# scripts/update-pr-status.sh

PR_NUMBER="${1}"
STATUS="${2}"  # provisioning, active, testing, failed, cleaning
PREVIEW_URL="${3}"

echo "📝 Updating PR #${PR_NUMBER} status: ${STATUS}"

STATUS_EMOJI=""
case "${STATUS}" in
  provisioning) STATUS_EMOJI="🔄" ;;
  active) STATUS_EMOJI="✅" ;;
  testing) STATUS_EMOJI="🧪" ;;
  failed) STATUS_EMOJI="❌" ;;
  cleaning) STATUS_EMOJI="🧹" ;;
esac

COMMENT="## ${STATUS_EMOJI} Preview Environment Status

**Status**: ${STATUS}

${PREVIEW_URL:+**Preview URL**: ${PREVIEW_URL}}

**Timestamp**: $(date -u +%Y-%m-%dT%H:%M:%SZ)
"

# Add comment to PR via Azure DevOps API
az repos pr thread create \
  --organization "https://dev.azure.com/ConnectSoft" \
  --project "ATP" \
  --pull-request-id "${PR_NUMBER}" \
  --comments "[{\"content\": \"${COMMENT}\"}]"

Preview URL in PR Description

PR Description Update:

# Azure Pipeline: Update PR description
- task: Bash@3
  displayName: 'Update PR Description'
  inputs:
    targetType: 'inline'
    script: |
      ./scripts/update-pr-description.sh \
        --pr-number $(PR_NUMBER) \
        --preview-url $(PREVIEW_URL) \
        --status active

PR Description Template:

## 🚀 Preview Environment

Preview environment has been provisioned for this PR.

### Access Information

- **Preview URL**: https://pr123.preview.atp.connectsoft.example
- **Status**: ✅ Active
- **Namespace**: `atp-preview-pr123`

### Services

- **API Gateway**: https://pr123.preview.atp.connectsoft.example/gateway
- **Ingestion Service**: https://pr123.preview.atp.connectsoft.example/ingestion
- **Query Service**: https://pr123.preview.atp.connectsoft.example/query

### Testing

Integration tests have been executed against the preview environment.

- ✅ Smoke tests: Passed
- ✅ Integration tests: Passed
- ✅ Health checks: Passed

### Cleanup

This preview environment will be automatically cleaned up when:
- PR is merged
- PR is closed
- 24 hours of inactivity (auto-shutdown)

**Created**: 2024-01-15T10:00:00Z
**TTL**: 24 hours

Summary: Preview Environments (Ephemeral)

  • Preview Environment Architecture: Ephemeral namespaces in dev cluster, resource isolation per PR, cost optimization strategies, lifecycle management
  • Automatic Provisioning: Azure Pipeline triggered by PR, namespace creation script, manifest generation with PR-specific values, FluxCD Kustomization for preview
  • Dynamic Manifest Generation: Namespace naming (atp-preview-pr{number}), Ingress hostname (pr{number}.preview.atp.connectsoft.example), resource limits (smaller than dev), image tag from PR branch
  • FluxCD Configuration: Dynamic GitRepository per PR, preview Kustomization, sync policies for preview, health checks and validation
  • Resource Cleanup: Auto-delete after PR merge, auto-delete after PR close, manual cleanup for stuck resources, cost tracking and alerts
  • Cost Optimization: Shared node pools, reduced replica counts (1 vs 3), minimal resource requests, auto-shutdown after 4 hours inactivity, spot instances for preview environments
  • Access Control: Preview URL generation, authentication for preview environments (basic auth/OAuth), network policies for preview
  • Integration Testing: Running integration tests against preview, database/dependency mocking, shared test services
  • Database and Dependencies: Mock services vs real dependencies, shared test database approach with isolated schemas, ephemeral database per preview option
  • Preview Environment Lifecycle: Creation → testing → validation → deletion flow, status reporting in PR comments, preview URL in PR description

Rollback & Disaster Recovery

Purpose: Define rollback procedures for ATP GitOps deployments including Git-based rollbacks, progressive rollback strategies, application state recovery, database migration rollbacks, FluxCD rollback mechanisms, Azure backup integration, disaster recovery scenarios, and incident response procedures to ensure rapid recovery from failures and minimize downtime.


Git-Based Rollback

Simple Rollback: Git Revert

Git Revert for Simple Rollback:

#!/bin/bash
# scripts/rollback-simple.sh

SERVICE="${1:-atp-ingestion}"
ENVIRONMENT="${2:-production}"
NAMESPACE="atp-${ENVIRONMENT}"

echo "⏪ Rolling back ${SERVICE} in ${ENVIRONMENT}"

# Find the last deployment commit
LAST_COMMIT=$(git log --oneline --grep="deploy.*${SERVICE}" -n 1 --format="%H")

if [ -z "${LAST_COMMIT}" ]; then
  echo "❌ No deployment commit found for ${SERVICE}"
  exit 1
fi

echo "📝 Last deployment commit: ${LAST_COMMIT}"

# Revert the commit
git revert --no-edit "${LAST_COMMIT}"

# Push the revert commit
git push origin ${ENVIRONMENT}

echo "✅ Rollback committed: ${SERVICE} reverted to previous state"

# FluxCD will automatically reconcile to the new Git state

Git Revert for Multiple Commits:

#!/bin/bash
# scripts/rollback-multiple.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
COMMIT_COUNT="${3:-1}"  # Number of commits to revert

echo "⏪ Rolling back ${COMMIT_COUNT} commits for ${SERVICE}"

# Revert multiple commits (oldest first)
git log --oneline -n ${COMMIT_COUNT} --reverse --format="%H" | while read commit; do
  echo "Reverting commit: ${commit}"
  git revert --no-edit "${commit}"
done

# Push all revert commits
git push origin ${ENVIRONMENT}

echo "✅ Rolled back ${COMMIT_COUNT} commits"

Complex Rollback: Git Reset

Git Reset for Complex Rollback (Use with caution):

#!/bin/bash
# scripts/rollback-reset.sh

ENVIRONMENT="${1:-production}"
TARGET_COMMIT="${2}"  # Commit hash or tag to rollback to

if [ -z "${TARGET_COMMIT}" ]; then
  echo "Usage: $0 <environment> <commit-hash-or-tag>"
  echo "Example: $0 production v1.2.2"
  exit 1
fi

echo "⚠️  WARNING: Git reset will rewrite history"
echo "⏪ Rolling back ${ENVIRONMENT} to ${TARGET_COMMIT}"

# Verify target commit exists
if ! git cat-file -e "${TARGET_COMMIT}^{commit}" 2>/dev/null; then
  echo "❌ Target commit ${TARGET_COMMIT} not found"
  exit 1
fi

# Create backup branch before reset
BACKUP_BRANCH="${ENVIRONMENT}-backup-$(date +%Y%m%d-%H%M%S)"
git branch "${BACKUP_BRANCH}" "${ENVIRONMENT}"
echo "📦 Backup branch created: ${BACKUP_BRANCH}"

# Reset to target commit (soft reset preserves changes)
git checkout "${ENVIRONMENT}"
git reset --soft "${TARGET_COMMIT}"

# Commit the rollback
git commit -m "rollback: Revert to ${TARGET_COMMIT} for disaster recovery"

# Force push (requires branch protection override for emergency)
git push origin "${ENVIRONMENT}" --force

echo "✅ Rollback complete: ${ENVIRONMENT} reset to ${TARGET_COMMIT}"
echo "⚠️  Backup branch: ${BACKUP_BRANCH} (keep for reference)"

ATP Recommendation: Prefer git revert over git reset (preserves history, safer for audit trail)

Rollback to Specific Commit

Rollback to Specific Commit:

#!/bin/bash
# scripts/rollback-to-commit.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"

if [ -z "${TARGET_COMMIT}" ]; then
  echo "Usage: $0 <service> <environment> <commit-hash>"
  echo "Example: $0 atp-ingestion production abc123def456"
  exit 1
fi

echo "⏪ Rolling back ${SERVICE} to commit ${TARGET_COMMIT}"

# Checkout the target commit
git checkout "${TARGET_COMMIT}" -- "apps/${SERVICE}/"

# Check if changes exist
if git diff --quiet "${ENVIRONMENT}" -- "apps/${SERVICE}/"; then
  echo "⚠️  No changes to rollback (already at target commit)"
  exit 0
fi

# Commit the rollback
git add "apps/${SERVICE}/"
git commit -m "rollback(${SERVICE}): Revert to commit ${TARGET_COMMIT}"

# Push to environment branch
git push origin "${ENVIRONMENT}"

echo "✅ Rollback complete: ${SERVICE} reverted to ${TARGET_COMMIT}"

Rollback to Commit with Validation:

#!/bin/bash
# scripts/rollback-to-commit-validated.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"

echo "⏪ Rolling back ${SERVICE} to commit ${TARGET_COMMIT}"

# Validate target commit
echo "🔍 Validating target commit..."
git show --no-patch --format="%H%n%an%n%ae%n%ad%n%s" "${TARGET_COMMIT}"

read -p "Continue with rollback? (yes/no): " confirm
if [ "${confirm}" != "yes" ]; then
  echo "❌ Rollback cancelled"
  exit 1
fi

# Perform rollback
./scripts/rollback-to-commit.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_COMMIT}"

# Wait for FluxCD reconciliation
echo "⏳ Waiting for FluxCD to reconcile..."
sleep 60

# Verify rollback
./scripts/verify-rollback.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_COMMIT}"

Rollback to Specific Tag

Rollback to Specific Tag:

#!/bin/bash
# scripts/rollback-to-tag.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_TAG="${3}"  # e.g., v1.2.2

if [ -z "${TARGET_TAG}" ]; then
  echo "Usage: $0 <service> <environment> <tag>"
  echo "Example: $0 atp-ingestion production v1.2.2"
  exit 1
fi

echo "⏪ Rolling back ${SERVICE} to tag ${TARGET_TAG}"

# Verify tag exists
if ! git rev-parse "${TARGET_TAG}" >/dev/null 2>&1; then
  echo "❌ Tag ${TARGET_TAG} not found"
  echo "Available tags:"
  git tag --sort=-creatordate | head -10
  exit 1
fi

# Get commit hash for tag
TARGET_COMMIT=$(git rev-parse "${TARGET_TAG}")

echo "📦 Tag ${TARGET_TAG} points to commit ${TARGET_COMMIT}"

# Rollback to the tagged commit
./scripts/rollback-to-commit.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_COMMIT}"

echo "✅ Rollback complete: ${SERVICE} reverted to ${TARGET_TAG} (${TARGET_COMMIT})"

List Available Tags for Rollback:

#!/bin/bash
# scripts/list-rollback-tags.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"

echo "📋 Available rollback tags for ${SERVICE}:"
echo ""

git tag --sort=-creatordate --format="%(refname:short)|%(creatordate:iso)|%(subject)" | \
  while IFS='|' read -r tag date subject; do
    # Check if tag affects the service
    if git diff "${tag}~1" "${tag}" --name-only | grep -q "apps/${SERVICE}/"; then
      echo "  ${tag} - ${date}"
      echo "    ${subject}"
      echo ""
    fi
  done

Progressive Rollback

Rolling Back One Service at a Time

Progressive Service Rollback:

#!/bin/bash
# scripts/progressive-rollback.sh

ENVIRONMENT="${1:-production}"
SERVICES="${2}"  # Comma-separated: atp-ingestion,atp-query,atp-gateway

if [ -z "${SERVICES}" ]; then
  echo "Usage: $0 <environment> <service1,service2,service3>"
  echo "Example: $0 production atp-ingestion,atp-query,atp-gateway"
  exit 1
fi

echo "🔄 Progressive rollback: ${SERVICES} in ${ENVIRONMENT}"

# Split services into array
IFS=',' read -ra SERVICE_ARRAY <<< "${SERVICES}"

for SERVICE in "${SERVICE_ARRAY[@]}"; do
  echo ""
  echo "⏪ Rolling back ${SERVICE}..."

  # Rollback service
  ./scripts/rollback-simple.sh "${SERVICE}" "${ENVIRONMENT}"

  # Wait for reconciliation
  echo "⏳ Waiting for reconciliation (60s)..."
  sleep 60

  # Validate rollback
  echo "🔍 Validating rollback..."
  if ./scripts/verify-service-health.sh "${SERVICE}" "${ENVIRONMENT}"; then
    echo "✅ ${SERVICE} rollback validated"
  else
    echo "❌ ${SERVICE} rollback validation failed"
    read -p "Continue with next service? (yes/no): " continue
    if [ "${continue}" != "yes" ]; then
      echo "⚠️  Progressive rollback stopped"
      exit 1
    fi
  fi
done

echo ""
echo "✅ Progressive rollback complete: All services rolled back"

Rollback with Canary (Gradual Revert)

Canary Rollback Configuration:

# Rollback with Flagger canary (gradual traffic reduction)
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: atp-ingestion-rollback
  namespace: atp-production
spec:
  analysis:
    interval: 1m
    threshold: 3
    stepWeight: -25  # Reduce canary traffic by 25% each step
    stepWeights: [75, 50, 25, 0]  # 75% → 50% → 25% → 0% (full rollback)
    metrics:
    - name: error-rate
      thresholdRange:
        max: 1
      interval: 30s

Gradual Rollback Script:

#!/bin/bash
# scripts/canary-rollback.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
ROLLBACK_STEPS="${3:-4}"  # Number of steps

echo "🔄 Gradual canary rollback: ${SERVICE} in ${ENVIRONMENT}"

# Current canary weight (assume 100% for rollback start)
CURRENT_WEIGHT=100
STEP_SIZE=$((100 / ROLLBACK_STEPS))

for STEP in $(seq 1 ${ROLLBACK_STEPS}); do
  NEW_WEIGHT=$((CURRENT_WEIGHT - STEP_SIZE))

  echo "📊 Step ${STEP}/${ROLLBACK_STEPS}: Reducing traffic to ${NEW_WEIGHT}%"

  # Update Istio VirtualService to reduce canary traffic
  kubectl patch virtualservice "${SERVICE}" -n "${ENVIRONMENT}" --type=json \
    -p="[{\"op\": \"replace\", \"path\": \"/spec/http/0/route/1/weight\", \"value\": ${NEW_WEIGHT}}]"

  # Wait and validate
  echo "⏳ Waiting 2 minutes for validation..."
  sleep 120

  # Check error rate
  ERROR_RATE=$(./scripts/get-error-rate.sh "${SERVICE}" "${ENVIRONMENT}")
  echo "📈 Error rate: ${ERROR_RATE}%"

  if (( $(echo "${ERROR_RATE} > 5" | bc -l) )); then
    echo "❌ Error rate too high, accelerating rollback"
    NEW_WEIGHT=$((NEW_WEIGHT - STEP_SIZE))
  fi

  CURRENT_WEIGHT=${NEW_WEIGHT}

  if [ ${CURRENT_WEIGHT} -le 0 ]; then
    echo "✅ Full rollback complete (0% traffic to canary)"
    break
  fi
done

echo "✅ Gradual rollback complete"

Validation at Each Rollback Step

Rollback Validation Script:

#!/bin/bash
# scripts/validate-rollback-step.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
STEP="${3}"

echo "🔍 Validating rollback step ${STEP} for ${SERVICE}"

# Health check validation
HEALTH_STATUS=$(kubectl get deployment "${SERVICE}" -n "atp-${ENVIRONMENT}" \
  -o jsonpath='{.status.conditions[?(@.type=="Available")].status}')

if [ "${HEALTH_STATUS}" != "True" ]; then
  echo "❌ Health check failed: Deployment not available"
  exit 1
fi

# Error rate validation
ERROR_RATE=$(./scripts/get-error-rate.sh "${SERVICE}" "${ENVIRONMENT}")
ERROR_THRESHOLD=5

if (( $(echo "${ERROR_RATE} > ${ERROR_THRESHOLD}" | bc -l) )); then
  echo "❌ Error rate validation failed: ${ERROR_RATE}% > ${ERROR_THRESHOLD}%"
  exit 1
fi

# Latency validation
P95_LATENCY=$(./scripts/get-p95-latency.sh "${SERVICE}" "${ENVIRONMENT}")
LATENCY_THRESHOLD=500  # 500ms

if (( $(echo "${P95_LATENCY} > ${LATENCY_THRESHOLD}" | bc -l) )); then
  echo "❌ Latency validation failed: ${P95_LATENCY}ms > ${LATENCY_THRESHOLD}ms"
  exit 1
fi

# Readiness probe validation
READY_REPLICAS=$(kubectl get deployment "${SERVICE}" -n "atp-${ENVIRONMENT}" \
  -o jsonpath='{.status.readyReplicas}')
DESIRED_REPLICAS=$(kubectl get deployment "${SERVICE}" -n "atp-${ENVIRONMENT}" \
  -o jsonpath='{.spec.replicas}')

if [ "${READY_REPLICAS}" != "${DESIRED_REPLICAS}" ]; then
  echo "❌ Replica validation failed: ${READY_REPLICAS}/${DESIRED_REPLICAS} ready"
  exit 1
fi

echo "✅ All validation checks passed for step ${STEP}"

Application State Recovery

Handling Database Schema Changes

Database Schema Rollback Strategy:

Migration Type Rollback Strategy ATP Decision
Add Column Drop column (if nullable) ✅ Safe rollback
Drop Column Add column back ⚠️ Data loss risk
Rename Column Rename back ✅ Safe rollback
Change Type Revert type change ⚠️ Data truncation risk
Add Table Drop table ✅ Safe rollback
Drop Table Recreate table ❌ Data loss

Forward-Only Migrations (Preferred):

// C#: Forward-only migration (no rollback)
// Entity Framework migration: AddAuditIndex
public partial class AddAuditIndex : Migration
{
    protected override void Up(MigrationBuilder migrationBuilder)
    {
        migrationBuilder.CreateIndex(
            name: "IX_AuditTrail_Timestamp",
            table: "AuditTrail",
            column: "Timestamp");
    }

    // No Down() method - forward-only migration
    // Rollback = deploy previous version that doesn't use the index
}

Database Rollback Coordination:

#!/bin/bash
# scripts/rollback-with-db.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_VERSION="${3}"

echo "🔄 Coordinated rollback: Application + Database"

# Step 1: Check if database rollback is needed
CURRENT_SCHEMA_VERSION=$(./scripts/get-db-schema-version.sh "${ENVIRONMENT}")
TARGET_SCHEMA_VERSION=$(./scripts/get-schema-version-for-tag.sh "${TARGET_VERSION}")

if [ "${CURRENT_SCHEMA_VERSION}" != "${TARGET_SCHEMA_VERSION}" ]; then
  echo "⚠️  Database schema rollback required"
  echo "  Current: ${CURRENT_SCHEMA_VERSION}"
  echo "  Target: ${TARGET_SCHEMA_VERSION}"

  read -p "Proceed with database rollback? (yes/no): " confirm
  if [ "${confirm}" != "yes" ]; then
    echo "❌ Rollback cancelled"
    exit 1
  fi

  # Rollback database schema
  ./scripts/rollback-database-schema.sh "${ENVIRONMENT}" "${TARGET_SCHEMA_VERSION}"
fi

# Step 2: Rollback application
./scripts/rollback-to-tag.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_VERSION}"

echo "✅ Coordinated rollback complete"

Data Migration Rollback

Data Migration Rollback Strategy:

#!/bin/bash
# scripts/rollback-data-migration.sh

ENVIRONMENT="${1:-production}"
MIGRATION_ID="${2}"

echo "🔄 Rolling back data migration: ${MIGRATION_ID}"

# Check if migration has been applied
if ! ./scripts/check-migration-applied.sh "${MIGRATION_ID}" "${ENVIRONMENT}"; then
  echo "⚠️  Migration ${MIGRATION_ID} not applied, skipping rollback"
  exit 0
fi

# Execute rollback script (if exists)
ROLLBACK_SCRIPT="migrations/${MIGRATION_ID}/rollback.sql"

if [ -f "${ROLLBACK_SCRIPT}" ]; then
  echo "📝 Executing rollback script: ${ROLLBACK_SCRIPT}"
  psql -h "${DB_HOST}" -U "${DB_USER}" -d "${DB_NAME}" -f "${ROLLBACK_SCRIPT}"
else
  echo "⚠️  No rollback script found: ${ROLLBACK_SCRIPT}"
  echo "⚠️  Manual data recovery may be required"
  exit 1
fi

# Mark migration as rolled back
./scripts/mark-migration-rolled-back.sh "${MIGRATION_ID}" "${ENVIRONMENT}"

echo "✅ Data migration rollback complete"

Stateful Application Considerations

StatefulSet Rollback:

#!/bin/bash
# scripts/rollback-statefulset.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"

echo "⏪ Rolling back StatefulSet: ${SERVICE}"

# Get current StatefulSet revision
CURRENT_REVISION=$(kubectl get statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}" \
  -o jsonpath='{.status.currentRevision}')

# Get previous revision
PREVIOUS_REVISION=$(kubectl get statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}" \
  -o jsonpath='{.status.updateRevision}')

# Rollback to previous revision
kubectl rollout undo statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}"

# Monitor rollout
kubectl rollout status statefulset "${SERVICE}" -n "atp-${ENVIRONMENT}" --timeout=10m

echo "✅ StatefulSet rollback complete"

PVC Retention During Rollback:

# StatefulSet with PVC retention
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: atp-stateful-service
spec:
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi
      # PVCs are NOT deleted on StatefulSet deletion
      # Data is preserved during rollback

Database Migration Rollback

Forward-Only Migrations (Preferred)

Forward-Only Migration Strategy:

Approach Pros Cons ATP Decision
Forward-Only ✅ Simpler, safer ⚠️ No automatic rollback Preferred
Reversible Migrations ✅ Can rollback ❌ Complex, risky ⚠️ Use sparingly
No Migrations ✅ No risk ❌ No schema changes ❌ Not practical

Forward-Only Migration Example:

// Entity Framework: Forward-only migration
public partial class AddAuditIndex : Migration
{
    protected override void Up(MigrationBuilder migrationBuilder)
    {
        // Add index
        migrationBuilder.CreateIndex(
            name: "IX_AuditTrail_Timestamp",
            table: "AuditTrail",
            column: "Timestamp");
    }

    // No Down() method - rollback = deploy previous app version
}

Rollback Strategy for Forward-Only Migrations:

  1. Rollback Application: Deploy previous application version (doesn't use new schema)
  2. Schema Compatibility: New schema must be backward compatible with old application
  3. Cleanup Migration: Create new migration to clean up unused schema (later)

Rollback Scripts (If Necessary)

Reversible Migration with Rollback:

// Entity Framework: Reversible migration (use sparingly)
public partial class RenameAuditColumn : Migration
{
    protected override void Up(MigrationBuilder migrationBuilder)
    {
        migrationBuilder.RenameColumn(
            name: "EventDate",
            table: "AuditTrail",
            newName: "Timestamp");
    }

    protected override void Down(MigrationBuilder migrationBuilder)
    {
        migrationBuilder.RenameColumn(
            name: "Timestamp",
            table: "AuditTrail",
            newName: "EventDate");
    }
}

Rollback Script:

-- migrations/20240115_AddAuditIndex/rollback.sql
-- Rollback script for AddAuditIndex migration

-- Drop the index
DROP INDEX IF EXISTS IX_AuditTrail_Timestamp ON AuditTrail;

-- Log rollback
INSERT INTO MigrationHistory (MigrationId, AppliedAt, RolledBackAt, Status)
VALUES ('20240115_AddAuditIndex', GETDATE(), GETDATE(), 'RolledBack');

Data Loss Prevention

Data Loss Prevention Checklist:

#!/bin/bash
# scripts/prevent-data-loss-rollback.sh

MIGRATION_ID="${1}"
ENVIRONMENT="${2:-production}"

echo "🔒 Data Loss Prevention Check for Migration: ${MIGRATION_ID}"

# Check if migration involves data deletion
if grep -q "DELETE\|DROP\|TRUNCATE" "migrations/${MIGRATION_ID}/up.sql"; then
  echo "⚠️  WARNING: Migration contains data deletion operations"

  # Create backup before rollback
  echo "📦 Creating database backup..."
  ./scripts/backup-database.sh "${ENVIRONMENT}" "pre-rollback-${MIGRATION_ID}"

  # Ask for confirmation
  read -p "Migration may cause data loss. Continue? (yes/no): " confirm
  if [ "${confirm}" != "yes" ]; then
    echo "❌ Rollback cancelled"
    exit 1
  fi
fi

# Check for dependent data
echo "🔍 Checking for dependent data..."
DEPENDENT_RECORDS=$(./scripts/check-dependent-data.sh "${MIGRATION_ID}")

if [ "${DEPENDENT_RECORDS}" -gt 0 ]; then
  echo "⚠️  WARNING: ${DEPENDENT_RECORDS} dependent records found"
  read -p "Continue with rollback? (yes/no): " confirm
  if [ "${confirm}" != "yes" ]; then
    echo "❌ Rollback cancelled"
    exit 1
  fi
fi

echo "✅ Data loss prevention checks passed"

Coordinating App Rollback with DB Rollback

Coordinated Rollback Procedure:

sequenceDiagram
    participant Admin as Administrator
    participant App as Application Rollback
    participant DB as Database Rollback
    participant FluxCD as FluxCD
    participant K8s as Kubernetes

    Admin->>App: Initiate rollback
    App->>DB: Check schema compatibility
    DB-->>App: Schema version check
    App->>DB: Rollback database (if needed)
    DB->>DB: Execute rollback script
    DB-->>App: Database rolled back
    App->>FluxCD: Revert Git commit
    FluxCD->>K8s: Reconcile to previous state
    K8s->>App: Deploy previous app version
    App->>Admin: Rollback complete
Hold "Alt" / "Option" to enable pan & zoom

Coordinated Rollback Script:

#!/bin/bash
# scripts/coordinated-rollback.sh

SERVICE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_TAG="${3}"

echo "🔄 Coordinated Application + Database Rollback"

# Step 1: Get target application version and schema version
TARGET_APP_VERSION="${TARGET_TAG}"
TARGET_SCHEMA_VERSION=$(./scripts/get-schema-version-for-tag.sh "${TARGET_TAG}")

CURRENT_SCHEMA_VERSION=$(./scripts/get-db-schema-version.sh "${ENVIRONMENT}")

echo "📊 Rollback Plan:"
echo "  Application: ${TARGET_APP_VERSION}"
echo "  Database Schema: ${CURRENT_SCHEMA_VERSION}${TARGET_SCHEMA_VERSION}"

# Step 2: Check schema compatibility
if [ "${CURRENT_SCHEMA_VERSION}" != "${TARGET_SCHEMA_VERSION}" ]; then
  echo "⚠️  Database schema rollback required"

  # Verify backward compatibility
  if ! ./scripts/verify-schema-compatibility.sh "${TARGET_SCHEMA_VERSION}" "${TARGET_APP_VERSION}"; then
    echo "❌ Schema version ${TARGET_SCHEMA_VERSION} not compatible with app ${TARGET_APP_VERSION}"
    exit 1
  fi

  # Step 2a: Rollback database schema first
  echo "🔄 Step 1/2: Rolling back database schema..."
  ./scripts/rollback-database-schema.sh "${ENVIRONMENT}" "${TARGET_SCHEMA_VERSION}"

  # Wait for schema rollback to complete
  sleep 30
else
  echo "✅ No database schema rollback needed"
fi

# Step 3: Rollback application
echo "🔄 Step 2/2: Rolling back application..."
./scripts/rollback-to-tag.sh "${SERVICE}" "${ENVIRONMENT}" "${TARGET_TAG}"

# Step 4: Validate rollback
echo "🔍 Validating coordinated rollback..."
./scripts/validate-rollback.sh "${SERVICE}" "${ENVIRONMENT}"

echo "✅ Coordinated rollback complete"

FluxCD Rollback

Reverting Kustomization

Revert Kustomization via Git:

#!/bin/bash
# scripts/fluxcd-rollback-kustomization.sh

KUSTOMIZATION="${1}"  # e.g., apps-production
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"

echo "⏪ Rolling back Kustomization: ${KUSTOMIZATION}"

# Revert the Kustomization path in Git
git checkout "${TARGET_COMMIT}" -- "apps/" "infrastructure/"

# Commit the rollback
git add apps/ infrastructure/
git commit -m "rollback: Revert ${KUSTOMIZATION} to ${TARGET_COMMIT}"

# Push to environment branch
git push origin "${ENVIRONMENT}"

echo "✅ Kustomization rollback committed"
echo "⏳ FluxCD will reconcile automatically (polling interval: 5m)"

Suspend Kustomization for Manual Rollback:

#!/bin/bash
# scripts/suspend-kustomization.sh

KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"

echo "⏸️  Suspending Kustomization: ${KUSTOMIZATION}"

# Suspend reconciliation
flux suspend kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"

# Verify suspension
kubectl get kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" \
  -o jsonpath='{.spec.suspend}'

echo "✅ Kustomization suspended (reconciliation paused)"

Resume Kustomization After Rollback:

#!/bin/bash
# scripts/resume-kustomization.sh

KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"

echo "▶️  Resuming Kustomization: ${KUSTOMization}"

# Resume reconciliation
flux resume kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"

echo "✅ Kustomization resumed (reconciliation active)"

Reverting HelmRelease

HelmRelease Rollback:

#!/bin/bash
# scripts/fluxcd-rollback-helmrelease.sh

HELMRELEASE="${1}"
NAMESPACE="${2:-atp-production}"

echo "⏪ Rolling back HelmRelease: ${HELMRELEASE}"

# Get current release version
CURRENT_REVISION=$(kubectl get helmrelease "${HELMRELEASE}" -n "${NAMESPACE}" \
  -o jsonpath='{.status.lastReleaseRevision}')

PREVIOUS_REVISION=$((CURRENT_REVISION - 1))

echo "📊 Current revision: ${CURRENT_REVISION}"
echo "📊 Rolling back to revision: ${PREVIOUS_REVISION}"

# Update HelmRelease to previous version (via Git)
# Option 1: Revert Helm values in Git
git checkout "${PREVIOUS_COMMIT}" -- "apps/${HELMRELEASE}/values.yaml"

# Option 2: Direct Helm rollback (bypasses GitOps)
helm rollback "${HELMRELEASE}" "${PREVIOUS_REVISION}" -n "${NAMESPACE}"

# Option 3: Update HelmRelease spec
kubectl patch helmrelease "${HELMRELEASE}" -n "${NAMESPACE}" --type=json \
  -p="[{\"op\": \"replace\", \"path\": \"/spec/values\", \"value\": {...previous values...}}]"

echo "✅ HelmRelease rollback initiated"

HelmRelease Git-Based Rollback:

#!/bin/bash
# scripts/fluxcd-helmrelease-git-rollback.sh

HELMRELEASE="${1}"
ENVIRONMENT="${2:-production}"
TARGET_COMMIT="${3}"

echo "⏪ Git-based HelmRelease rollback: ${HELMRELEASE}"

# Revert Helm values to target commit
git checkout "${TARGET_COMMIT}" -- "apps/${HELMRELEASE}/values.yaml" \
  "apps/${HELMRELEASE}/Chart.yaml"

# Commit the rollback
git add "apps/${HELMRELEASE}/"
git commit -m "rollback(helm): Revert ${HELMRELEASE} to ${TARGET_COMMIT}"

# Push to environment branch
git push origin "${ENVIRONMENT}"

echo "✅ HelmRelease rollback committed"
echo "⏳ FluxCD will reconcile and deploy previous Helm chart version"

Suspend and Resume Reconciliation

Suspend Reconciliation:

#!/bin/bash
# scripts/suspend-reconciliation.sh

KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"

echo "⏸️  Suspending reconciliation for: ${KUSTOMIZATION}"

# Suspend via Flux CLI
flux suspend kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"

# Or via kubectl
kubectl patch kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" --type=json \
  -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'

# Verify suspension
kubectl get kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" \
  -o jsonpath='{.spec.suspend}'

echo "✅ Reconciliation suspended"

Resume Reconciliation:

#!/bin/bash
# scripts/resume-reconciliation.sh

KUSTOMIZATION="${1}"
NAMESPACE="${2:-flux-system}"

echo "▶️  Resuming reconciliation for: ${KUSTOMIZATION}"

# Resume via Flux CLI
flux resume kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}"

# Or via kubectl
kubectl patch kustomization "${KUSTOMIZATION}" -n "${NAMESPACE}" --type=json \
  -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'

echo "✅ Reconciliation resumed"

Manual Intervention Procedures

Manual Intervention Runbook:

#!/bin/bash
# scripts/manual-intervention-runbook.sh

INCIDENT_TYPE="${1}"  # deployment-failure, drift-detection, reconciliation-stuck

echo "🚨 Manual Intervention Runbook"
echo "Incident Type: ${INCIDENT_TYPE}"

case "${INCIDENT_TYPE}" in
  "deployment-failure")
    echo "📋 Deployment Failure Intervention:"
    echo "1. Check deployment status: kubectl get deployment -n atp-production"
    echo "2. Check pod logs: kubectl logs -n atp-production deployment/<service>"
    echo "3. Check FluxCD status: flux get kustomizations -n flux-system"
    echo "4. Suspend reconciliation: flux suspend kustomization <name> -n flux-system"
    echo "5. Manually fix issue or rollback: ./scripts/rollback-simple.sh <service> production"
    echo "6. Resume reconciliation: flux resume kustomization <name> -n flux-system"
    ;;

  "drift-detection")
    echo "📋 Drift Detection Intervention:"
    echo "1. Check drift: flux get kustomizations --watch"
    echo "2. Identify drifted resources: kubectl diff -f <manifest>"
    echo "3. Option A: Fix cluster state to match Git"
    echo "   kubectl delete <resource> (let FluxCD recreate)"
    echo "4. Option B: Update Git to match cluster state"
    echo "   git checkout <cluster-state>"
    echo "5. Force reconciliation: flux reconcile kustomization <name>"
    ;;

  "reconciliation-stuck")
    echo "📋 Stuck Reconciliation Intervention:"
    echo "1. Check Kustomization status: flux get kustomizations -n flux-system"
    echo "2. Describe for details: kubectl describe kustomization <name> -n flux-system"
    echo "3. Check logs: kubectl logs -n flux-system deployment/kustomize-controller"
    echo "4. Suspend: flux suspend kustomization <name> -n flux-system"
    echo "5. Fix issue (check GitRepository, permissions, etc.)"
    echo "6. Resume: flux resume kustomization <name> -n flux-system"
    echo "7. Force reconcile: flux reconcile kustomization <name> --with-source"
    ;;
esac

Azure Backup Integration

Backing Up AKS Resources (Velero)

Velero Installation:

# Install Velero CLI
curl -fsSL -o velero-v1.11.0-linux-amd64.tar.gz \
  https://github.com/vmware-tanzu/velero/releases/download/v1.11.0/velero-v1.11.0-linux-amd64.tar.gz
tar -xvf velero-v1.11.0-linux-amd64.tar.gz
sudo mv velero-v1.11.0-linux-amd64/velero /usr/local/bin/

# Install Velero on AKS
velero install \
  --provider azure \
  --plugins velero/velero-plugin-for-microsoft-azure:v1.7.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config resourceGroup=atp-production-rg,storageAccount=atpprodvelero,subscriptionId=<subscription-id> \
  --snapshot-location-config apiTimeout=5m,resourceGroup=atp-production-rg,subscriptionId=<subscription-id>

Velero Backup Configuration:

# velero/backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup-production
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
    - atp-production
    excludedResources:
    - events
    - events.events.k8s.io
    ttl: 30d  # Retain backups for 30 days
    storageLocation: default
    volumeSnapshotLocations:
    - default
    metadata:
      labels:
        environment: production
        backup-type: scheduled

Manual Backup:

#!/bin/bash
# scripts/velero-backup.sh

BACKUP_NAME="manual-backup-$(date +%Y%m%d-%H%M%S)"
NAMESPACE="${1:-atp-production}"

echo "📦 Creating Velero backup: ${BACKUP_NAME}"

# Create backup
velero backup create "${BACKUP_NAME}" \
  --include-namespaces "${NAMESPACE}" \
  --ttl 30d \
  --wait

# Verify backup
velero backup describe "${BACKUP_NAME}"

echo "✅ Backup created: ${BACKUP_NAME}"

PersistentVolume Snapshots

Volume Snapshot Configuration:

# VolumeSnapshot for StatefulSet
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: atp-stateful-data-snapshot-$(date +%Y%m%d)
  namespace: atp-production
spec:
  volumeSnapshotClassName: csi-azuredisk-vsc
  source:
    persistentVolumeClaimName: data-atp-stateful-service-0

Automated Volume Snapshots:

# Velero: Automated volume snapshots
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: volume-snapshots-production
  namespace: velero
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  template:
    includedNamespaces:
    - atp-production
    includedResources:
    - persistentvolumes
    - persistentvolumeclaims
    volumeSnapshotLocations:
    - default
    ttl: 7d  # Retain snapshots for 7 days

Etcd Backup

Etcd Backup via AKS:

#!/bin/bash
# scripts/backup-etcd.sh

RESOURCE_GROUP="${1:-atp-production-rg}"
CLUSTER_NAME="${2:-atp-prod-eus-aks}"

echo "📦 Backing up AKS etcd"

# AKS automatically backs up etcd, but we can trigger manual snapshot
# Note: etcd backup requires Azure support or cluster admin access

# Alternative: Use Velero for cluster-level backup
velero backup create "etcd-backup-$(date +%Y%m%d)" \
  --include-cluster-resources=true \
  --wait

echo "✅ Etcd backup initiated"

AKS Automatic Etcd Backup:

  • Automatic: AKS automatically backs up etcd every 8 hours
  • Retention: 30 days
  • Recovery: Available via Azure support

Backup Retention Policies

Backup Retention Configuration:

# Velero: Backup retention policy
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-backup-schedule
spec:
  schedule: "0 2 * * *"
  template:
    ttl: 30d  # Keep backups for 30 days
    includedNamespaces:
    - atp-production

Retention Policy by Backup Type:

Backup Type Retention Rationale
Daily Scheduled 30 days Standard retention
Weekly Full 90 days Long-term retention
Monthly Full 365 days Compliance (1 year)
Pre-Deployment 7 days Short-term rollback
Manual Backup 30 days On-demand backups

Backup Retention Cleanup:

#!/bin/bash
# scripts/cleanup-old-backups.sh

# Delete backups older than retention period
velero backup delete --all --older-than 30d --confirm

echo "🧹 Cleaned up backups older than 30 days"

Disaster Recovery Scenarios

Cluster Failure

Cluster Failure Recovery:

graph TB
    subgraph "Disaster: Cluster Failure"
        FAIL[AKS Cluster<br/>Failure]
    end
    subgraph "Recovery Process"
        DETECT[Detect Failure]
        ASSESS[Assess Impact]
        RECREATE[Recreate Cluster<br/>from GitOps]
        RESTORE[Restore Data<br/>from Velero]
        VALIDATE[Validate Recovery]
    end
    subgraph "Backup Sources"
        GIT[Git Repository<br/>Manifests]
        VELERO[Velero Backups<br/>State]
        ACR[ACR Images]
    end

    FAIL --> DETECT
    DETECT --> ASSESS
    ASSESS --> RECREATE
    RECREATE --> GIT
    RECREATE --> RESTORE
    RESTORE --> VELERO
    RESTORE --> VALIDATE
    VALIDATE --> ACR

    style FAIL fill:#FF6B6B
    style RECREATE fill:#90EE90
    style RESTORE fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Cluster Failure Recovery Procedure:

#!/bin/bash
# scripts/recover-cluster-failure.sh

CLUSTER_NAME="${1:-atp-prod-eus-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"

echo "🚨 Cluster Failure Recovery"
echo "Cluster: ${CLUSTER_NAME}"
echo "Resource Group: ${RESOURCE_GROUP}"

# Step 1: Verify cluster is actually down
if az aks show --resource-group "${RESOURCE_GROUP}" --name "${CLUSTER_NAME}" \
  --query "provisioningState" -o tsv | grep -q "Succeeded"; then
  echo "⚠️  Cluster appears to be running. Verify the issue."
  exit 1
fi

# Step 2: Recreate cluster from Pulumi
echo "🔄 Step 1: Recreating AKS cluster from GitOps..."
cd infrastructure/
pulumi stack select production
pulumi up --yes

# Step 3: Wait for cluster to be ready
echo "⏳ Waiting for cluster to be ready..."
az aks wait --name "${CLUSTER_NAME}" --resource-group "${RESOURCE_GROUP}" \
  --created --timeout 30

# Step 4: Bootstrap FluxCD
echo "🔄 Step 2: Bootstrapping FluxCD..."
az aks get-credentials --resource-group "${RESOURCE_GROUP}" --name "${CLUSTER_NAME}"
flux bootstrap git \
  --url=https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops \
  --branch=production \
  --path=./clusters/production

# Step 5: Restore from Velero backup
echo "🔄 Step 3: Restoring from Velero backup..."
LATEST_BACKUP=$(velero backup get --output json | \
  jq -r '.items[] | select(.status.phase == "Completed") | .metadata.name' | \
  head -1)

velero restore create "restore-${CLUSTER_NAME}-$(date +%Y%m%d)" \
  --from-backup "${LATEST_BACKUP}" \
  --wait

# Step 6: Validate recovery
echo "🔍 Step 4: Validating recovery..."
./scripts/validate-cluster-health.sh

echo "✅ Cluster recovery complete"

Region Outage

Multi-Region Recovery:

#!/bin/bash
# scripts/recover-region-outage.sh

PRIMARY_REGION="${1:-eastus}"
SECONDARY_REGION="${2:-westeurope}"

echo "🚨 Region Outage Recovery"
echo "Primary Region: ${PRIMARY_REGION} (DOWN)"
echo "Secondary Region: ${SECONDARY_REGION} (DR)"

# Step 1: Failover traffic to secondary region
echo "🔄 Step 1: Failing over traffic to ${SECONDARY_REGION}..."
az network front-door backend-pool update \
  --resource-group atp-production-rg \
  --front-door-name atp-frontdoor \
  --name primary-eus \
  --backend-pool-parameters enabled=false

az network front-door backend-pool update \
  --resource-group atp-production-rg \
  --front-door-name atp-frontdoor \
  --name secondary-weu \
  --backend-pool-parameters enabled=true priority=1

# Step 2: Promote secondary database to primary
echo "🔄 Step 2: Promoting secondary database..."
az sql db update \
  --resource-group atp-production-rg \
  --server atp-prod-sql-server-weu \
  --name atp-prod-db \
  --read-scale Enabled  # Promote to readable

# Step 3: Scale up secondary cluster
echo "🔄 Step 3: Scaling up secondary cluster..."
az aks scale \
  --resource-group atp-production-rg \
  --name atp-prod-weu-aks \
  --node-count 10

# Step 4: Validate failover
echo "🔍 Step 4: Validating failover..."
./scripts/validate-failover.sh "${SECONDARY_REGION}"

echo "✅ Region failover complete"

Data Corruption

Data Corruption Recovery:

#!/bin/bash
# scripts/recover-data-corruption.sh

ENVIRONMENT="${1:-production}"
CORRUPTION_TIME="${2}"  # ISO timestamp of when corruption occurred

echo "🚨 Data Corruption Recovery"
echo "Environment: ${ENVIRONMENT}"
echo "Corruption Detected At: ${CORRUPTION_TIME}"

# Step 1: Find backup before corruption
echo "🔍 Step 1: Finding backup before corruption..."
TARGET_BACKUP=$(velero backup get --output json | \
  jq -r --arg time "${CORRUPTION_TIME}" \
    '.items[] | select(.status.phase == "Completed") | select(.metadata.creationTimestamp < $time) | .metadata.name' | \
  tail -1)

if [ -z "${TARGET_BACKUP}" ]; then
  echo "❌ No backup found before corruption time"
  exit 1
fi

echo "📦 Target backup: ${TARGET_BACKUP}"

# Step 2: Stop application to prevent further corruption
echo "🛑 Step 2: Stopping application..."
kubectl scale deployment --all --replicas=0 -n "atp-${ENVIRONMENT}"

# Step 3: Restore from backup
echo "🔄 Step 3: Restoring from backup..."
velero restore create "restore-corruption-$(date +%Y%m%d)" \
  --from-backup "${TARGET_BACKUP}" \
  --include-namespaces "atp-${ENVIRONMENT}" \
  --wait

# Step 4: Validate data integrity
echo "🔍 Step 4: Validating data integrity..."
./scripts/validate-data-integrity.sh "${ENVIRONMENT}"

# Step 5: Restart application
echo "▶️  Step 5: Restarting application..."
kubectl scale deployment --all --replicas=5 -n "atp-${ENVIRONMENT}"

echo "✅ Data corruption recovery complete"

Complete Platform Loss

Complete Platform Recovery:

#!/bin/bash
# scripts/recover-complete-platform-loss.sh

echo "🚨 Complete Platform Loss Recovery"
echo "This procedure recreates the entire ATP platform from GitOps"

# Step 1: Verify GitOps repository is accessible
echo "🔍 Step 1: Verifying GitOps repository..."
if ! git ls-remote https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops >/dev/null 2>&1; then
  echo "❌ GitOps repository not accessible"
  exit 1
fi

# Step 2: Recreate infrastructure from Pulumi
echo "🔄 Step 2: Recreating infrastructure..."
cd infrastructure/
pulumi stack select production
pulumi up --yes

# Step 3: Create AKS clusters
echo "🔄 Step 3: Creating AKS clusters..."
./scripts/create-aks-clusters.sh production

# Step 4: Bootstrap FluxCD on all clusters
echo "🔄 Step 4: Bootstrapping FluxCD..."
for CLUSTER in atp-prod-eus-aks atp-prod-weu-aks; do
  echo "  Bootstrapping ${CLUSTER}..."
  az aks get-credentials --resource-group atp-production-rg --name "${CLUSTER}"
  flux bootstrap git \
    --url=https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops \
    --branch=production \
    --path=./clusters/production
done

# Step 5: Restore application state from Velero
echo "🔄 Step 5: Restoring application state..."
LATEST_BACKUP=$(velero backup get --output json | \
  jq -r '.items[] | select(.status.phase == "Completed") | .metadata.name' | \
  head -1)

velero restore create "restore-platform-$(date +%Y%m%d)" \
  --from-backup "${LATEST_BACKUP}" \
  --wait

# Step 6: Validate platform
echo "🔍 Step 6: Validating platform recovery..."
./scripts/validate-platform.sh

echo "✅ Complete platform recovery complete"

RTO/RPO Targets Per Environment

RTO/RPO Targets Matrix:

Environment RTO (Recovery Time Objective) RPO (Recovery Point Objective) Rationale
Dev 4 hours 24 hours Lower priority, acceptable downtime
Test 2 hours 12 hours Moderate priority, faster recovery needed
Staging 1 hour 4 hours Production-like, important for validation
Production 30 minutes 1 hour Critical, minimal downtime required

RTO/RPO Validation:

#!/bin/bash
# scripts/validate-rto-rpo.sh

ENVIRONMENT="${1:-production}"

echo "📊 RTO/RPO Validation for ${ENVIRONMENT}"

# Get target RTO/RPO from matrix
case "${ENVIRONMENT}" in
  "dev") TARGET_RTO=14400 TARGET_RPO=86400 ;;  # 4h / 24h
  "test") TARGET_RTO=7200 TARGET_RPO=43200 ;;   # 2h / 12h
  "staging") TARGET_RTO=3600 TARGET_RPO=14400 ;; # 1h / 4h
  "production") TARGET_RTO=1800 TARGET_RPO=3600 ;; # 30m / 1h
esac

echo "Target RTO: ${TARGET_RTO} seconds ($(($TARGET_RTO / 60)) minutes)"
echo "Target RPO: ${TARGET_RPO} seconds ($(($TARGET_RPO / 60)) minutes)"

# Simulate recovery and measure time
START_TIME=$(date +%s)
./scripts/recover-cluster-failure.sh
END_TIME=$(date +%s)
ACTUAL_RTO=$((END_TIME - START_TIME))

# Get latest backup timestamp
LATEST_BACKUP_TIME=$(velero backup get --output json | \
  jq -r '.items[] | select(.status.phase == "Completed") | .metadata.creationTimestamp' | \
  head -1 | xargs -I {} date -d {} +%s)
CURRENT_TIME=$(date +%s)
ACTUAL_RPO=$((CURRENT_TIME - LATEST_BACKUP_TIME))

# Validate
if [ ${ACTUAL_RTO} -le ${TARGET_RTO} ]; then
  echo "✅ RTO Met: ${ACTUAL_RTO}s <= ${TARGET_RTO}s"
else
  echo "❌ RTO Exceeded: ${ACTUAL_RTO}s > ${TARGET_RTO}s"
fi

if [ ${ACTUAL_RPO} -le ${TARGET_RPO} ]; then
  echo "✅ RPO Met: ${ACTUAL_RPO}s <= ${TARGET_RPO}s"
else
  echo "❌ RPO Exceeded: ${ACTUAL_RPO}s > ${TARGET_RPO}s"
fi

DR Testing and Drills

Quarterly DR Drills for Production

DR Drill Schedule:

Frequency Environment Drill Type Rationale
Quarterly Production Full DR drill Validate production recovery procedures
Monthly Staging Partial DR drill Test recovery procedures in production-like environment
Bi-weekly Test Automated DR test Continuous validation

Quarterly DR Drill Plan:

#!/bin/bash
# scripts/dr-drill-production.sh

DRILL_DATE="${1:-$(date +%Y%m%d)}"
SCENARIO="${2:-cluster-failure}"  # cluster-failure, region-outage, data-corruption

echo "🎯 Quarterly DR Drill - Production"
echo "Date: ${DRILL_DATE}"
echo "Scenario: ${SCENARIO}"

# Pre-drill checklist
echo "📋 Pre-Drill Checklist:"
echo "  [ ] Notify stakeholders"
echo "  [ ] Backup current state"
echo "  [ ] Prepare recovery scripts"
echo "  [ ] Verify backup availability"
echo "  [ ] Document baseline metrics"

# Execute drill scenario
case "${SCENARIO}" in
  "cluster-failure")
    echo "🔄 Executing cluster failure drill..."
    ./scripts/dr-drill-cluster-failure.sh
    ;;
  "region-outage")
    echo "🔄 Executing region outage drill..."
    ./scripts/dr-drill-region-outage.sh
    ;;
  "data-corruption")
    echo "🔄 Executing data corruption drill..."
    ./scripts/dr-drill-data-corruption.sh
    ;;
esac

# Post-drill validation
echo "🔍 Post-Drill Validation:"
./scripts/validate-dr-drill.sh

# Generate drill report
echo "📝 Generating drill report..."
./scripts/generate-dr-drill-report.sh "${DRILL_DATE}" "${SCENARIO}"

echo "✅ DR Drill complete"

Drill Scenarios and Checklists

DR Drill Scenarios:

Scenario Description Recovery Procedure Frequency
Cluster Failure Complete AKS cluster failure Recreate cluster, restore from Velero Quarterly
Region Outage Primary region unavailable Failover to secondary region Quarterly
Data Corruption Database corruption detected Restore from point-in-time backup Quarterly
Network Isolation Network connectivity issues Route traffic via secondary path Monthly
Storage Failure PersistentVolume failures Restore from volume snapshots Monthly

Cluster Failure Drill Checklist:

## DR Drill: Cluster Failure

### Pre-Drill
- [ ] Schedule drill with stakeholders
- [ ] Create backup before drill
- [ ] Document baseline metrics
- [ ] Notify on-call team

### Drill Execution
- [ ] Simulate cluster failure (scale cluster to 0 nodes)
- [ ] Measure detection time
- [ ] Execute recovery procedure
  - [ ] Recreate cluster from Pulumi
  - [ ] Bootstrap FluxCD
  - [ ] Restore from Velero backup
- [ ] Measure recovery time (RTO)
- [ ] Validate application health
- [ ] Verify data integrity (RPO)

### Post-Drill
- [ ] Restore cluster to normal state
- [ ] Document actual RTO/RPO
- [ ] Identify improvement opportunities
- [ ] Update runbooks
- [ ] Generate drill report

Region Outage Drill:

#!/bin/bash
# scripts/dr-drill-region-outage.sh

echo "🎯 DR Drill: Region Outage"

# Simulate region outage (disable primary region)
echo "🔄 Simulating region outage..."
az network front-door backend-pool update \
  --resource-group atp-production-rg \
  --front-door-name atp-frontdoor \
  --name primary-eus \
  --backend-pool-parameters enabled=false

# Execute failover
echo "🔄 Executing failover..."
./scripts/recover-region-outage.sh eastus westeurope

# Measure failover time
FAILOVER_START=$(date +%s)
# ... failover procedure ...
FAILOVER_END=$(date +%s)
FAILOVER_TIME=$((FAILOVER_END - FAILOVER_START))

echo "⏱️  Failover time: ${FAILOVER_TIME} seconds"

# Validate
./scripts/validate-failover.sh westeurope

# Restore (post-drill)
echo "🔄 Restoring primary region..."
az network front-door backend-pool update \
  --resource-group atp-production-rg \
  --front-door-name atp-frontdoor \
  --name primary-eus \
  --backend-pool-parameters enabled=true priority=1

echo "✅ DR Drill complete"

Drill Report and Improvements

DR Drill Report Template:

# DR Drill Report

## Drill Information
- **Date**: 2024-01-15
- **Scenario**: Cluster Failure
- **Environment**: Production
- **Duration**: 45 minutes

## Objectives Met
- [x] RTO Target: 30 minutes (Actual: 28 minutes) ✅
- [x] RPO Target: 1 hour (Actual: 45 minutes) ✅
- [x] All services recovered successfully ✅

## Issues Identified
1. Velero restore took longer than expected (15 minutes)
2. Database restore required manual intervention

## Improvements
1. Optimize Velero restore process
2. Automate database restore procedure
3. Update runbooks with lessons learned

## Action Items
- [ ] Update recovery scripts
- [ ] Improve backup frequency
- [ ] Add automated validation steps

Generate DR Drill Report:

#!/bin/bash
# scripts/generate-dr-drill-report.sh

DRILL_DATE="${1}"
SCENARIO="${2}"
REPORT_FILE="dr-drill-report-${DRILL_DATE}-${SCENARIO}.md"

echo "📝 Generating DR Drill Report..."

cat > "${REPORT_FILE}" <<EOF
# DR Drill Report

**Date**: ${DRILL_DATE}
**Scenario**: ${SCENARIO}
**Environment**: Production

## Results

### RTO/RPO Metrics
- **Target RTO**: 30 minutes
- **Actual RTO**: $(./scripts/get-actual-rto.sh)
- **Target RPO**: 1 hour
- **Actual RPO**: $(./scripts/get-actual-rpo.sh)

### Recovery Steps
1. $(./scripts/get-recovery-step.sh 1)
2. $(./scripts/get-recovery-step.sh 2)
3. $(./scripts/get-recovery-step.sh 3)

## Lessons Learned
$(./scripts/get-drill-lessons.sh)

## Action Items
$(./scripts/get-drill-action-items.sh)
EOF

echo "✅ Report generated: ${REPORT_FILE}"

Lessons Learned Process

Lessons Learned Template:

#!/bin/bash
# scripts/capture-dr-drill-lessons.sh

DRILL_DATE="${1}"
SCENARIO="${2}"

echo "📚 Capturing Lessons Learned from DR Drill..."

cat >> "dr-lessons-learned.md" <<EOF

## DR Drill: ${SCENARIO} - ${DRILL_DATE}

### What Went Well
- Automated cluster recreation from Pulumi worked seamlessly
- FluxCD bootstrap completed quickly
- Application recovery was faster than expected

### What Could Be Improved
- Velero restore process needs optimization
- Database restore requires more automation
- Communication during drill could be better

### Action Items
1. [ ] Optimize Velero restore scripts
2. [ ] Automate database restore procedure
3. [ ] Update incident response runbook
4. [ ] Schedule follow-up drill in 3 months

### Updated Procedures
- Recovery procedure updated: ./scripts/recover-cluster-failure.sh
- Runbook updated: docs/operations/disaster-recovery.md

---
EOF

echo "✅ Lessons learned captured"

Incident Response Integration

Automated Rollback on Critical Alerts

Automated Rollback Trigger:

# PrometheusRule: Trigger automated rollback on critical alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: auto-rollback-trigger
  namespace: monitoring
spec:
  groups:
  - name: auto-rollback
    rules:
    - alert: CriticalErrorRate
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[5m])) 
        / 
        sum(rate(http_requests_total[5m])) 
        > 0.10  # 10% error rate
      for: 2m
      labels:
        severity: critical
        auto-rollback: "true"
      annotations:
        summary: "Critical error rate detected - triggering automated rollback"
        description: "Error rate: {{ $value | humanizePercentage }}"

Automated Rollback Webhook:

# AlertManager: Configure webhook for automated rollback
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m

    route:
      receiver: 'default'
      routes:
      - match:
          auto-rollback: "true"
        receiver: 'auto-rollback'

    receivers:
    - name: 'auto-rollback'
      webhook_configs:
      - url: 'http://auto-rollback-service.monitoring:8080/rollback'
        send_resolved: false

Auto-Rollback Service:

// C#: Auto-rollback service
[ApiController]
[Route("[controller]")]
public class AutoRollbackController : ControllerBase
{
    [HttpPost("rollback")]
    public async Task<IActionResult> TriggerRollback([FromBody] Alert alert)
    {
        // Parse alert to determine service
        var service = ExtractServiceFromAlert(alert);
        var environment = ExtractEnvironmentFromAlert(alert);

        // Check if auto-rollback is enabled for this service
        if (!await IsAutoRollbackEnabled(service, environment))
        {
            return Ok(new { message = "Auto-rollback disabled for this service" });
        }

        // Execute rollback
        var rollbackResult = await ExecuteRollback(service, environment);

        // Notify team
        await NotifyTeam($"Auto-rollback triggered for {service}: {rollbackResult.Status}");

        return Ok(rollbackResult);
    }
}

Incident Commander Decision Making

Incident Commander Decision Tree:

graph TD
    START[Incident Detected] --> ASSESS{Assess Impact}
    ASSESS -->|High Impact| ROLLBACK{Can Rollback?}
    ASSESS -->|Low Impact| INVESTIGATE[Investigate Root Cause]

    ROLLBACK -->|Yes| EXECUTE[Execute Rollback]
    ROLLBACK -->|No| MITIGATE[Apply Mitigation]

    EXECUTE --> VALIDATE[Validate Rollback]
    VALIDATE -->|Success| MONITOR[Monitor Recovery]
    VALIDATE -->|Failure| ESCALATE[Escalate to Senior]

    MITIGATE --> INVESTIGATE
    INVESTIGATE --> FIX[Develop Fix]
    FIX --> DEPLOY[Deploy Fix]
    DEPLOY --> VALIDATE

    MONITOR --> CLOSE[Close Incident]

    style START fill:#FF6B6B
    style EXECUTE fill:#FFD700
    style VALIDATE fill:#90EE90
    style CLOSE fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Incident Commander Decision Matrix:

Impact Error Rate Decision Action
Critical > 10% Immediate Rollback Execute rollback, investigate later
High 5-10% ⚠️ Investigate + Prepare Rollback Investigate, rollback if no fix in 15min
Medium 1-5% ⚠️ Investigate First Investigate, rollback if worsens
Low < 1% Monitor Monitor, no immediate action

Rollback vs Forward Fix Decision Tree

Rollback vs Forward Fix Decision:

#!/bin/bash
# scripts/rollback-vs-fix-decision.sh

ERROR_RATE="${1}"  # Percentage
AFFECTED_USERS="${2}"  # Number of users
HAS_FIX="${3}"  # yes/no - Do we have a fix ready?

echo "🤔 Rollback vs Forward Fix Decision"
echo "Error Rate: ${ERROR_RATE}%"
echo "Affected Users: ${AFFECTED_USERS}"
echo "Has Fix Ready: ${HAS_FIX}"

# Decision logic
if (( $(echo "${ERROR_RATE} > 10" | bc -l) )); then
  DECISION="ROLLBACK"
  REASON="Critical error rate (>10%)"
elif (( $(echo "${ERROR_RATE} > 5" | bc -l) )) && [ "${HAS_FIX}" != "yes" ]; then
  DECISION="ROLLBACK"
  REASON="High error rate (>5%) and no fix ready"
elif (( $(echo "${ERROR_RATE} > 5" | bc -l) )) && [ "${HAS_FIX}" == "yes" ]; then
  DECISION="FORWARD_FIX"
  REASON="High error rate but fix available"
elif [ "${AFFECTED_USERS}" -gt 10000 ]; then
  DECISION="ROLLBACK"
  REASON="Large number of affected users"
else
  DECISION="FORWARD_FIX"
  REASON="Low impact, proceed with fix"
fi

echo "📊 Decision: ${DECISION}"
echo "📝 Reason: ${REASON}"

case "${DECISION}" in
  "ROLLBACK")
    echo "🔄 Executing rollback..."
    ./scripts/rollback-simple.sh
    ;;
  "FORWARD_FIX")
    echo "🔧 Proceeding with forward fix..."
    ./scripts/deploy-fix.sh
    ;;
esac

Post-Incident Review

Post-Incident Review Template:

# Post-Incident Review

## Incident Summary
- **Incident ID**: INC-2024-001
- **Date**: 2024-01-15
- **Duration**: 45 minutes
- **Impact**: 5% of users affected
- **Resolution**: Rollback to previous version

## Timeline
- 10:00 AM: Incident detected (error rate spike)
- 10:05 AM: Incident declared, on-call engaged
- 10:10 AM: Root cause identified (deployment issue)
- 10:15 AM: Rollback decision made
- 10:20 AM: Rollback executed
- 10:30 AM: Rollback validated, services restored
- 10:45 AM: Incident resolved

## Root Cause
Deployment of v1.2.3 introduced memory leak causing pod restarts and increased error rate.

## Actions Taken
1. Rolled back to v1.2.2
2. Validated service health
3. Investigated root cause

## Lessons Learned
- Need better pre-deployment testing for memory issues
- Rollback procedure worked well (RTO: 20 minutes)

## Action Items
- [ ] Add memory leak detection to CI pipeline
- [ ] Improve error rate monitoring
- [ ] Update deployment procedures

Post-Incident Review Script:

#!/bin/bash
# scripts/generate-post-incident-review.sh

INCIDENT_ID="${1}"
INCIDENT_DATE="${2}"

echo "📝 Generating Post-Incident Review..."

cat > "post-incident-review-${INCIDENT_ID}.md" <<EOF
# Post-Incident Review: ${INCIDENT_ID}

**Date**: ${INCIDENT_DATE}
**Status**: Resolved

## Timeline
$(./scripts/get-incident-timeline.sh "${INCIDENT_ID}")

## Root Cause
$(./scripts/get-root-cause.sh "${INCIDENT_ID}")

## Impact
- Users Affected: $(./scripts/get-affected-users.sh "${INCIDENT_ID}")
- Error Rate: $(./scripts/get-max-error-rate.sh "${INCIDENT_ID}")%
- Duration: $(./scripts/get-incident-duration.sh "${INCIDENT_ID}")

## Resolution
$(./scripts/get-resolution.sh "${INCIDENT_ID}")

## Lessons Learned
$(./scripts/get-lessons-learned.sh "${INCIDENT_ID}")

## Action Items
$(./scripts/get-action-items.sh "${INCIDENT_ID}")
EOF

echo "✅ Post-incident review generated"

Summary: Rollback & Disaster Recovery

  • Git-Based Rollback: Simple rollback (git revert), complex rollback (git reset), rollback to specific commit, rollback to specific tag
  • Progressive Rollback: Rolling back one service at a time, rollback with canary (gradual revert), validation at each rollback step
  • Application State Recovery: Handling database schema changes, data migration rollback, stateful application considerations
  • Database Migration Rollback: Forward-only migrations (preferred), rollback scripts (if necessary), data loss prevention, coordinating app rollback with DB rollback
  • FluxCD Rollback: Reverting Kustomization, reverting HelmRelease, suspend and resume reconciliation, manual intervention procedures
  • Azure Backup Integration: Backing up AKS resources (Velero), PersistentVolume snapshots, Etcd backup, backup retention policies
  • Disaster Recovery Scenarios: Cluster failure, region outage, data corruption, complete platform loss
  • RTO/RPO Targets: Dev (RTO 4h, RPO 24h), Test (RTO 2h, RPO 12h), Staging (RTO 1h, RPO 4h), Production (RTO 30m, RPO 1h)
  • DR Testing and Drills: Quarterly DR drills for production, drill scenarios and checklists, drill report and improvements, lessons learned process
  • Incident Response Integration: Automated rollback on critical alerts, incident commander decision making, rollback vs forward fix decision tree, post-incident review

Helm Chart Development for ATP Services

Purpose: Define the standards, best practices, and procedures for developing, testing, versioning, and publishing Helm charts for ATP microservices, ensuring consistent deployment patterns, maintainable chart structures, and reliable application deployments across all environments.


Helm Chart Structure

Chart.yaml: Metadata, Version, Dependencies

Chart.yaml for ATP Service:

# charts/atp-ingestion/Chart.yaml
apiVersion: v2
name: atp-ingestion
description: A Helm chart for ATP Ingestion Service - Collects and processes audit trail events
type: application
version: 1.2.3  # Chart version (SemVer)
appVersion: "1.2.3"  # Application version (from source code)
home: https://github.com/ConnectSoft/ATP
sources:
  - https://github.com/ConnectSoft/ATP/ConnectSoft.Audit.Ingestion
maintainers:
  - name: ATP Team
    email: atp-team@connectsoft.example
keywords:
  - audit-trail
  - atp
  - ingestion
  - microservice
annotations:
  category: Microservice
  architecture: microservices
dependencies:
  - name: redis
    version: "17.15.0"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled
  - name: postgresql
    version: "12.1.9"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled

Chart Metadata Standards:

Field Required Description ATP Convention
apiVersion ✅ Yes Chart API version v2 (Helm 3+)
name ✅ Yes Chart name atp-{service-name} (kebab-case)
version ✅ Yes Chart version SemVer (MAJOR.MINOR.PATCH)
appVersion ✅ Yes Application version Matches source code version
description ✅ Yes Chart description One-line service description
type ⚠️ Recommended Chart type application (default)
keywords ⚠️ Recommended Search keywords Include audit-trail, atp, service name
maintainers ⚠️ Recommended Maintainer info ATP Team contact
dependencies ⚠️ Optional Chart dependencies External charts (Redis, PostgreSQL)

values.yaml: Default Values

Complete values.yaml:

# charts/atp-ingestion/values.yaml
# Default values for atp-ingestion
# This is a YAML-formatted file

# Application Configuration
replicaCount: 3
image:
  repository: connectsoft.azurecr.io/atp/ingestion
  pullPolicy: IfNotPresent
  tag: ""  # Override via --set image.tag=v1.2.3

imagePullSecrets:
  - name: acr-pull-secret

nameOverride: ""
fullnameOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name: ""

podAnnotations: {}

podSecurityContext:
  fsGroup: 2000
  runAsNonRoot: true
  runAsUser: 1000

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true

service:
  type: ClusterIP
  port: 80
  targetPort: 8080

ingress:
  enabled: false
  className: "nginx"
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
  hosts:
    - host: ingestion.atp.connectsoft.example
      paths:
        - path: /
          pathType: Prefix
  tls: []

resources:
  limits:
    cpu: 2000m
    memory: 2Gi
  requests:
    cpu: 500m
    memory: 1Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

nodeSelector: {}

tolerations: []

affinity: {}

# External Secrets
externalSecrets:
  enabled: true
  secrets:
    - name: sql-connection-string
      keyVaultName: atp-prod-kv
      secretName: connection-strings/atp-db/production

# Database Configuration
database:
  host: ""
  port: 5432
  name: atp_production
  schema: public

# Redis Configuration
redis:
  enabled: false  # Use managed Redis
  host: ""  # External Redis host
  port: 6379

# Environment Variables
env:
  - name: ASPNETCORE_ENVIRONMENT
    value: "Production"
  - name: Logging__LogLevel__Default
    value: "Information"

envFrom:
  - secretRef:
      name: app-secrets

# Health Checks
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 30

# Pod Disruption Budget
podDisruptionBudget:
  enabled: true
  minAvailable: 2

# Network Policy
networkPolicy:
  enabled: true
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
            name: ingress-nginx
  egress:
    - to:
      - namespaceSelector:
          matchLabels:
            name: kube-system
      ports:
        - protocol: UDP
          port: 53

# Service Bus Configuration
serviceBus:
  connectionString: ""  # From ExternalSecret
  queueName: audit-events

# Monitoring
monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 30s
    scrapeTimeout: 10s

templates/: Resource Templates

Helm Chart Directory Structure:

charts/atp-ingestion/
├── Chart.yaml              # Chart metadata
├── values.yaml             # Default values
├── values-dev.yaml         # Dev environment overrides
├── values-test.yaml        # Test environment overrides
├── values-staging.yaml     # Staging environment overrides
├── values-production.yaml  # Production environment overrides
├── .helmignore             # Files to exclude
├── README.md               # Chart documentation
├── charts/                 # Sub-charts (dependencies)
│   └── .gitkeep
├── templates/              # Kubernetes resource templates
│   ├── _helpers.tpl        # Named templates and helpers
│   ├── deployment.yaml     # Deployment resource
│   ├── service.yaml        # Service resource
│   ├── ingress.yaml        # Ingress resource (conditional)
│   ├── serviceaccount.yaml # ServiceAccount resource
│   ├── configmap.yaml      # ConfigMap resource
│   ├── networkpolicy.yaml  # NetworkPolicy resource (conditional)
│   ├── poddisruptionbudget.yaml # PDB resource (conditional)
│   ├── servicemonitor.yaml # ServiceMonitor for Prometheus (conditional)
│   ├── externalsecret.yaml # ExternalSecret resource (conditional)
│   ├── NOTES.txt           # Post-install notes
│   ├── tests/              # Helm test templates
│   │   ├── test-connection.yaml
│   │   └── test-health.yaml
│   └── hooks/              # Helm hooks
│       ├── pre-install-migration.yaml
│       └── post-install-verification.yaml
└── schemas/                # JSON Schema validation
    └── values.schema.json

Chart Structure Diagram:

graph TB
    subgraph "Helm Chart: atp-ingestion"
        CHART[Chart.yaml<br/>Metadata & Dependencies]
        VALUES[values.yaml<br/>Default Configuration]
        VALUES_DEV[values-dev.yaml<br/>Dev Overrides]
        VALUES_PROD[values-production.yaml<br/>Prod Overrides]

        subgraph "templates/"
            HELPERS[_helpers.tpl<br/>Named Templates]
            DEPLOY[deployment.yaml]
            SVC[service.yaml]
            INGRESS[ingress.yaml]
            SA[serviceaccount.yaml]
            NETPOL[networkpolicy.yaml]

            subgraph "tests/"
                TEST_CONN[test-connection.yaml]
                TEST_HEALTH[test-health.yaml]
            end

            subgraph "hooks/"
                HOOK_PRE[pre-install-migration.yaml]
                HOOK_POST[post-install-verification.yaml]
            end
        end

        subgraph "charts/"
            DEP_REDIS[redis/]
            DEP_POSTGRES[postgresql/]
        end
    end

    CHART --> DEPLOY
    VALUES --> DEPLOY
    VALUES_DEV --> DEPLOY
    VALUES_PROD --> DEPLOY
    HELPERS --> DEPLOY
    DEPLOY --> SVC
    SVC --> INGRESS
    CHART --> DEP_REDIS
    CHART --> DEP_POSTGRES

    style CHART fill:#FFE5B4
    style VALUES fill:#FFE5B4
    style HELPERS fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

charts/: Sub-charts (Dependencies)

Sub-chart Dependencies:

# Chart.yaml dependencies section
dependencies:
  - name: redis
    version: "17.15.0"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled
    tags:
      - cache
  - name: postgresql
    version: "12.1.9"
    repository: "https://charts.bitnami.com/bitnaml/bitnami"
    condition: postgresql.enabled
    tags:
      - database

Sub-chart Values Override:

# values.yaml - Sub-chart value overrides
redis:
  enabled: false  # Use managed Redis in production
  architecture: standalone
  auth:
    enabled: true
  master:
    persistence:
      enabled: true
      size: 8Gi
    resources:
      requests:
        memory: 256Mi
        cpu: 250m

postgresql:
  enabled: false  # Use managed PostgreSQL
  auth:
    database: atp_production
    username: atp_user
  primary:
    persistence:
      enabled: true
      size: 20Gi
    resources:
      requests:
        memory: 512Mi
        cpu: 500m

Managing Dependencies:

# Update dependencies
helm dependency update charts/atp-ingestion/

# Build dependencies
helm dependency build charts/atp-ingestion/

# List dependencies
helm dependency list charts/atp-ingestion/

.helmignore: Files to Exclude

.helmignore File:

# charts/atp-ingestion/.helmignore
# Patterns to ignore when building packages

# Git
.git/
.gitignore
.gitattributes

# CI/CD
.azuredevops/
.github/
.gitlab-ci.yml

# IDE
.vscode/
.idea/
*.swp
*.swo
*~

# Documentation (keep README.md)
docs/
*.md
!README.md

# Tests (not part of chart)
tests/
*.test.go

# Build artifacts
bin/
obj/
*.dll
*.exe

# Dependencies (managed via Chart.yaml)
charts/*.tgz

# Temporary files
*.tmp
*.log
.DS_Store

Template Best Practices

Named Templates and Helpers (_helpers.tpl)

_helpers.tpl:

# templates/_helpers.tpl
{{/*
Expand the name of the chart.
*/}}
{{- define "atp-ingestion.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Create a default fully qualified app name.
*/}}
{{- define "atp-ingestion.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "atp-ingestion.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Common labels
*/}}
{{- define "atp-ingestion.labels" -}}
helm.sh/chart: {{ include "atp-ingestion.chart" . }}
{{ include "atp-ingestion.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
app.kubernetes.io/part-of: atp-platform
{{- end }}

{{/*
Selector labels
*/}}
{{- define "atp-ingestion.selectorLabels" -}}
app.kubernetes.io/name: {{ include "atp-ingestion.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

{{/*
Create the name of the service account to use
*/}}
{{- define "atp-ingestion.serviceAccountName" -}}
{{- if .Values.serviceAccount.create }}
{{- default (include "atp-ingestion.fullname" .) .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}

{{/*
Image reference
*/}}
{{- define "atp-ingestion.image" -}}
{{- $tag := .Values.image.tag | default .Chart.AppVersion }}
{{- printf "%s:%s" .Values.image.repository $tag }}
{{- end }}

{{/*
Environment variables from ConfigMap
*/}}
{{- define "atp-ingestion.envFromConfigMap" -}}
{{- if .Values.envFrom }}
{{- range .Values.envFrom }}
{{- if .configMapRef }}
- configMapRef:
    name: {{ .configMapRef.name }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Environment variables from Secrets
*/}}
{{- define "atp-ingestion.envFromSecret" -}}
{{- if .Values.envFrom }}
{{- range .Values.envFrom }}
{{- if .secretRef }}
- secretRef:
    name: {{ .secretRef.name }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Security context
*/}}
{{- define "atp-ingestion.securityContext" -}}
allowPrivilegeEscalation: false
capabilities:
  drop:
  - ALL
readOnlyRootFilesystem: {{ .Values.securityContext.readOnlyRootFilesystem | default true }}
runAsNonRoot: {{ .Values.securityContext.runAsNonRoot | default true }}
{{- if .Values.securityContext.runAsUser }}
runAsUser: {{ .Values.securityContext.runAsUser }}
{{- end }}
{{- end }}

{{/*
Pod security context
*/}}
{{- define "atp-ingestion.podSecurityContext" -}}
{{- if .Values.podSecurityContext }}
fsGroup: {{ .Values.podSecurityContext.fsGroup }}
runAsNonRoot: {{ .Values.podSecurityContext.runAsNonRoot | default true }}
{{- if .Values.podSecurityContext.runAsUser }}
runAsUser: {{ .Values.podSecurityContext.runAsUser }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Resource requests and limits
*/}}
{{- define "atp-ingestion.resources" -}}
{{- if .Values.resources }}
{{- toYaml .Values.resources }}
{{- else }}
requests:
  cpu: 100m
  memory: 128Mi
limits:
  cpu: 500m
  memory: 512Mi
{{- end }}
{{- end }}

Template Functions (include, default, required)

Using Template Functions:

# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "atp-ingestion.fullname" . }}
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount | default 3 }}
  selector:
    matchLabels:
      {{- include "atp-ingestion.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        {{- with .Values.podAnnotations }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
      securityContext:
        {{- include "atp-ingestion.podSecurityContext" . | nindent 8 }}
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ include "atp-ingestion.image" . }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        securityContext:
          {{- include "atp-ingestion.securityContext" . | nindent 10 }}
        ports:
        - name: http
          containerPort: {{ .Values.service.targetPort | default 8080 }}
          protocol: TCP
        env:
        {{- range .Values.env }}
        - name: {{ .name }}
          value: {{ .value | quote }}
        {{- end }}
        {{- include "atp-ingestion.envFromConfigMap" . | nindent 8 }}
        {{- include "atp-ingestion.envFromSecret" . | nindent 8 }}
        resources:
          {{- include "atp-ingestion.resources" . | nindent 10 }}
        livenessProbe:
          {{- toYaml .Values.livenessProbe | nindent 10 }}
        readinessProbe:
          {{- toYaml .Values.readinessProbe | nindent 10 }}
        {{- if .Values.startupProbe }}
        startupProbe:
          {{- toYaml .Values.startupProbe | nindent 10 }}
        {{- end }}

Using required Function:

# Require critical values
image:
  repository: {{ required "image.repository is required" .Values.image.repository }}
  tag: {{ required "image.tag is required" .Values.image.tag }}

Using default and coalesce:

# Default values with fallback chain
replicas: {{ .Values.replicaCount | default 3 }}
namespace: {{ .Values.namespace | default .Release.Namespace }}
tag: {{ coalesce .Values.image.tag .Chart.AppVersion "latest" }}

Flow Control (if, with, range)

Conditional Rendering:

# templates/ingress.yaml
{{- if .Values.ingress.enabled -}}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: {{ include "atp-ingestion.fullname" . }}
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
  {{- with .Values.ingress.annotations }}
  annotations:
    {{- toYaml . | nindent 4 }}
  {{- end }}
spec:
  {{- if .Values.ingress.className }}
  ingressClassName: {{ .Values.ingress.className }}
  {{- end }}
  {{- if .Values.ingress.tls }}
  tls:
    {{- range .Values.ingress.tls }}
    - hosts:
        {{- range .hosts }}
        - {{ . | quote }}
        {{- end }}
      secretName: {{ .secretName }}
    {{- end }}
  {{- end }}
  rules:
    {{- range .Values.ingress.hosts }}
    - host: {{ .host | quote }}
      http:
        paths:
          {{- range .paths }}
          - path: {{ .path }}
            pathType: {{ .pathType }}
            backend:
              service:
                name: {{ include "atp-ingestion.fullname" $ }}
                port:
                  number: {{ $.Values.service.port }}
          {{- end }}
    {{- end }}
{{- end }}

Using with for Scoped Values:

{{- with .Values.monitoring.serviceMonitor }}
{{- if .enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: {{ include "atp-ingestion.fullname" $ }}
spec:
  selector:
    matchLabels:
      {{- include "atp-ingestion.selectorLabels" $ | nindent 6 }}
  endpoints:
  - port: http
    interval: {{ .interval | default "30s" }}
    scrapeTimeout: {{ .scrapeTimeout | default "10s" }}
{{- end }}
{{- end }}

Variable Scoping

Understanding Variable Scoping:

# Scoping with $ (root context)
{{- range .Values.env }}
- name: {{ .name }}
  value: {{ .value }}
  # Use $ to access root context
  namespace: {{ $.Release.Namespace }}
{{- end }}

# Scoping with with
{{- with .Values.resources }}
limits:
  cpu: {{ .limits.cpu }}
  memory: {{ .limits.memory }}
{{- end }}

# Preserving root context in nested scopes
{{- range .Values.env }}
  {{- if eq .name "DATABASE_HOST" }}
    {{- with $.Values.database }}
    value: {{ .host }}
    {{- end }}
  {{- end }}
{{- end }}

Whitespace Management

Whitespace Control:

# Remove leading/trailing whitespace
{{- include "atp-ingestion.labels" . | nindent 4 }}
{{- if .Values.ingress.enabled -}}
# ... content ...
{{- end }}

# Trim left whitespace
{{- include "template" . }}

# Trim right whitespace
{{ include "template" . -}}

# Trim both sides
{{- include "template" . -}}

# Preserve whitespace (default)
{{ include "template" . }}

# Indent (nindent adds newline before)
{{- include "labels" . | nindent 4 }}

# Output raw (without escaping)
{{- .Values.script | nindent 8 | trim }}

Values File Organization

Hierarchical Values Structure

Values Hierarchy:

# Base values.yaml
replicaCount: 3
resources:
  limits:
    cpu: 2000m
    memory: 2Gi
  requests:
    cpu: 500m
    memory: 1Gi

# Environment-specific override (values-production.yaml)
replicaCount: 5
resources:
  limits:
    cpu: 4000m
    memory: 4Gi
  requests:
    cpu: 1000m
    memory: 2Gi

Values Precedence:

  1. User-provided values (--set, --set-file)
  2. Environment-specific values (values-production.yaml)
  3. Default values (values.yaml)

Environment Overrides

values-dev.yaml:

# charts/atp-ingestion/values-dev.yaml
replicaCount: 1

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 256Mi

autoscaling:
  enabled: false

env:
  - name: ASPNETCORE_ENVIRONMENT
    value: "Development"
  - name: Logging__LogLevel__Default
    value: "Debug"

ingress:
  enabled: true
  className: "nginx"
  hosts:
    - host: ingestion.dev.atp.connectsoft.example
      paths:
        - path: /

values-production.yaml:

# charts/atp-ingestion/values-production.yaml
replicaCount: 5

resources:
  limits:
    cpu: 4000m
    memory: 4Gi
  requests:
    cpu: 1000m
    memory: 2Gi

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

env:
  - name: ASPNETCORE_ENVIRONMENT
    value: "Production"
  - name: Logging__LogLevel__Default
    value: "Warning"

ingress:
  enabled: true
  className: "nginx"
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
  hosts:
    - host: ingestion.atp.connectsoft.example
      paths:
        - path: /
  tls:
    - secretName: ingestion-tls
      hosts:
        - ingestion.atp.connectsoft.example

Secret References (Never Plaintext)

External Secret Reference in Values:

# values.yaml - NEVER include plaintext secrets
externalSecrets:
  enabled: true
  secrets:
    - name: sql-connection-string
      keyVaultName: atp-prod-kv
      secretName: connection-strings/atp-db/production
      secretKey: connectionString
    - name: redis-connection-string
      keyVaultName: atp-prod-kv
      secretName: cache/redis/connection-string
      secretKey: connectionString

# ❌ BAD: Plaintext secret in values
# secrets:
#   sqlConnectionString: "Server=..."

# ✅ GOOD: Reference to ExternalSecret
envFrom:
  - secretRef:
      name: app-secrets  # Created by ExternalSecret operator

ExternalSecret Template:

# templates/externalsecret.yaml
{{- if .Values.externalSecrets.enabled }}
{{- range .Values.externalSecrets.secrets }}
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: {{ .name }}
  namespace: {{ $.Release.Namespace }}
spec:
  secretStoreRef:
    name: azure-keyvault-{{ $.Values.externalSecrets.keyVaultName }}
    kind: ClusterSecretStore
  target:
    name: {{ .name }}
    creationPolicy: Owner
  data:
  - secretKey: {{ .secretKey | default "value" }}
    remoteRef:
      key: {{ .secretName }}
{{- end }}
{{- end }}

Documentation in values.yaml Comments

Documented values.yaml:

# charts/atp-ingestion/values.yaml
# Default values for atp-ingestion Helm chart

# -- Number of replicas
replicaCount: 3

# -- Image configuration
image:
  # -- Image repository
  repository: connectsoft.azurecr.io/atp/ingestion
  # -- Image pull policy (IfNotPresent, Always, Never)
  pullPolicy: IfNotPresent
  # -- Image tag (defaults to appVersion)
  tag: ""

# -- Service account configuration
serviceAccount:
  # -- Create service account
  create: true
  # -- Service account annotations
  annotations: {}
  # -- Service account name (defaults to fullname)
  name: ""

# -- Resource requests and limits
resources:
  limits:
    # -- CPU limit (e.g., 2000m, 2)
    cpu: 2000m
    # -- Memory limit (e.g., 2Gi, 2048Mi)
    memory: 2Gi
  requests:
    # -- CPU request (e.g., 500m, 0.5)
    cpu: 500m
    # -- Memory request (e.g., 1Gi, 1024Mi)
    memory: 1Gi

# -- Horizontal Pod Autoscaler configuration
autoscaling:
  # -- Enable HPA
  enabled: true
  # -- Minimum replicas
  minReplicas: 3
  # -- Maximum replicas
  maxReplicas: 10
  # -- Target CPU utilization percentage
  targetCPUUtilizationPercentage: 70
  # -- Target memory utilization percentage
  targetMemoryUtilizationPercentage: 80

Chart Dependencies

Depending on Other Charts

Defining Dependencies:

# Chart.yaml
dependencies:
  - name: redis
    version: "17.15.0"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled
    alias: cache
  - name: postgresql
    version: "12.1.9"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled
    alias: database

Dependency Management Workflow:

sequenceDiagram
    participant Dev as Developer
    participant Chart as Chart.yaml
    participant Helm as Helm CLI
    participant Repo as Chart Repository

    Dev->>Chart: Add dependency to Chart.yaml
    Dev->>Helm: helm dependency update
    Helm->>Repo: Fetch dependency chart
    Repo-->>Helm: Return chart.tgz
    Helm->>Chart: Extract to charts/ directory
    Chart-->>Dev: Dependencies ready
Hold "Alt" / "Option" to enable pan & zoom

Sub-chart Values Override

Overriding Sub-chart Values:

# values.yaml - Override Redis sub-chart values
redis:
  enabled: true
  architecture: standalone
  auth:
    enabled: true
    password: ""  # From ExternalSecret
  master:
    persistence:
      enabled: true
      storageClass: managed-premium
      size: 8Gi
    resources:
      requests:
        memory: 256Mi
        cpu: 250m
      limits:
        memory: 512Mi
        cpu: 500m

# Override PostgreSQL sub-chart values
postgresql:
  enabled: true
  auth:
    database: atp_production
    username: atp_user
    password: ""  # From ExternalSecret
  primary:
    persistence:
      enabled: true
      storageClass: managed-premium
      size: 20Gi
    resources:
      requests:
        memory: 512Mi
        cpu: 500m
      limits:
        memory: 1Gi
        cpu: 1000m

Conditional Dependencies

Conditional Dependency Rendering:

# Chart.yaml
dependencies:
  - name: redis
    version: "17.15.0"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled
    tags:
      - cache
  - name: postgresql
    version: "12.1.9"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled
    tags:
      - database

# values.yaml
redis:
  enabled: false  # Don't install Redis (use managed)

postgresql:
  enabled: false  # Don't install PostgreSQL (use managed)

Enable/Disable by Tag:

# Install with cache tag only
helm install my-release ./chart --set tags.cache=true

# Install with database tag only
helm install my-release ./chart --set tags.database=true

Dependency Management Commands

Dependency Management:

# Update dependencies (download latest)
helm dependency update charts/atp-ingestion/

# Build dependencies (rebuild from Chart.lock)
helm dependency build charts/atp-ingestion/

# List dependencies
helm dependency list charts/atp-ingestion/
# Output:
# NAME          VERSION REPOSITORY                              STATUS
# redis         17.15.0 https://charts.bitnami.com/bitnami      ok
# postgresql    12.1.9  https://charts.bitnami.com/bitnami      ok

# Verify dependencies
helm dependency verify charts/atp-ingestion/

# Check for updates
helm dependency update --verify charts/atp-ingestion/

Chart Versioning and Publishing

Chart Versioning Strategy (SemVer)

Semantic Versioning:

Version Component When to Increment Example
MAJOR Breaking changes (incompatible values, removed features) 1.2.3 → 2.0.0
MINOR New features (backward compatible) 1.2.3 → 1.3.0
PATCH Bug fixes (backward compatible) 1.2.3 → 1.2.4

Chart Version Examples:

# Chart.yaml
version: 1.2.3  # Chart version (SemVer)
appVersion: "1.2.3"  # Application version

# Version bump examples:
# 1.2.3 → 1.2.4 (patch: bug fix)
# 1.2.3 → 1.3.0 (minor: new feature added)
# 1.2.3 → 2.0.0 (major: breaking change)

Publishing to Azure Container Registry

ACR Helm Repository Setup:

# Login to ACR
az acr login --name connectsoft

# Add ACR as Helm repository
helm repo add connectsoft-helm oci://connectsoft.azurecr.io/helm

Publishing Chart to ACR:

#!/bin/bash
# scripts/publish-chart-to-acr.sh

CHART_NAME="${1}"
CHART_PATH="charts/${CHART_NAME}"
ACR_NAME="connectsoft"
ACR_REPO="oci://${ACR_NAME}.azurecr.io/helm"

echo "📦 Publishing ${CHART_NAME} to ACR"

# Package chart
helm package "${CHART_PATH}" --destination ./dist/

# Get package name
PACKAGE=$(ls -t ./dist/${CHART_NAME}-*.tgz | head -1)

# Push to ACR
helm push "${PACKAGE}" "${ACR_REPO}"

echo "✅ Chart published: ${PACKAGE}"

Installing from ACR:

# Add ACR Helm repo
helm repo add connectsoft-helm oci://connectsoft.azurecr.io/helm
helm repo update

# Install chart
helm install atp-ingestion connectsoft-helm/atp-ingestion \
  --version 1.2.3 \
  -f values-production.yaml

Chart Repository Structure

ACR OCI Repository Structure:

connectsoft.azurecr.io/helm/
├── atp-ingestion/
│   ├── 1.0.0/
│   │   └── atp-ingestion-1.0.0.tgz
│   ├── 1.1.0/
│   │   └── atp-ingestion-1.1.0.tgz
│   └── 1.2.3/
│       └── atp-ingestion-1.2.3.tgz
├── atp-query/
│   └── ...
└── atp-gateway/
    └── ...

Helm Hooks

Pre-Install: Run Before Installation

Pre-Install Hook (Database Migration):

# templates/hooks/pre-install-migration.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-pre-install-migration
  annotations:
    "helm.sh/hook": pre-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  template:
    metadata:
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      restartPolicy: Never
      serviceAccountName: {{ include "atp-ingestion.serviceAccountName" . }}
      containers:
      - name: migration
        image: {{ include "atp-ingestion.image" . }}
        command:
        - dotnet
        - ConnectSoft.Audit.Ingestion.Migrations.dll
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: {{ .Values.env | first | default "Production" | quote }}
        {{- range .Values.env }}
        - name: {{ .name }}
          value: {{ .value | quote }}
        {{- end }}
        envFrom:
        {{- include "atp-ingestion.envFromSecret" . | nindent 8 }}
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi

Post-Install: Run After Installation

Post-Install Hook (Verification):

# templates/hooks/post-install-verification.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-post-install-verification
  annotations:
    "helm.sh/hook": post-install
    "helm.sh/hook-weight": "5"
    "helm.sh/hook-delete-policy": hook-succeeded
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  template:
    metadata:
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      restartPolicy: Never
      containers:
      - name: verification
        image: curlimages/curl:latest
        command:
        - /bin/sh
        - -c
        - |
          echo "Verifying service health..."
          sleep 10
          curl -f http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}/health/ready || exit 1
          echo "✅ Service is healthy"

Pre-Upgrade: Run Before Upgrade

Pre-Upgrade Hook (Backup):

# templates/hooks/pre-upgrade-backup.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-pre-upgrade-backup
  annotations:
    "helm.sh/hook": pre-upgrade
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation
  labels:
    {{- include "atp-ingestion.labels" . | nindent 4 }}
spec:
  template:
    metadata:
      labels:
        {{- include "atp-ingestion.selectorLabels" . | nindent 8 }}
    spec:
      restartPolicy: Never
      containers:
      - name: backup
        image: mcr.microsoft.com/azure-cli:latest
        command:
        - /bin/bash
        - -c
        - |
          echo "Creating backup before upgrade..."
          # Backup logic here
          az storage blob upload-batch \
            --destination backup \
            --source /data \
            --account-name atpstorage
          echo "✅ Backup complete"

Post-Upgrade: Run After Upgrade

Post-Upgrade Hook (Smoke Tests):

# templates/hooks/post-upgrade-smoke-tests.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-post-upgrade-smoke-tests
  annotations:
    "helm.sh/hook": post-upgrade
    "helm.sh/hook-weight": "5"
    "helm.sh/hook-delete-policy": hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: smoke-tests
        image: mcr.microsoft.com/dotnet/sdk:8.0
        command:
        - dotnet
        - test
        - --filter "Category=Smoke"
        env:
        - name: API_URL
          value: http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}

Pre-Delete: Run Before Deletion

Pre-Delete Hook (Cleanup):

# templates/hooks/pre-delete-cleanup.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "atp-ingestion.fullname" . }}-pre-delete-cleanup
  annotations:
    "helm.sh/hook": pre-delete
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: cleanup
        image: mcr.microsoft.com/azure-cli:latest
        command:
        - /bin/bash
        - -c
        - |
          echo "Cleaning up resources..."
          # Cleanup logic
          echo "✅ Cleanup complete"

Hook Use Cases

Hook Execution Flow:

sequenceDiagram
    participant Helm as Helm
    participant PreInstall as Pre-Install Hook
    participant Install as Installation
    participant PostInstall as Post-Install Hook
    participant PreUpgrade as Pre-Upgrade Hook
    participant Upgrade as Upgrade
    participant PostUpgrade as Post-Upgrade Hook
    participant PreDelete as Pre-Delete Hook
    participant Delete as Deletion

    Note over Helm,Delete: Installation Flow
    Helm->>PreInstall: Execute pre-install hooks
    PreInstall-->>Helm: Migration complete
    Helm->>Install: Install resources
    Install-->>Helm: Installed
    Helm->>PostInstall: Execute post-install hooks
    PostInstall-->>Helm: Verification complete

    Note over Helm,Delete: Upgrade Flow
    Helm->>PreUpgrade: Execute pre-upgrade hooks
    PreUpgrade-->>Helm: Backup complete
    Helm->>Upgrade: Upgrade resources
    Upgrade-->>Helm: Upgraded
    Helm->>PostUpgrade: Execute post-upgrade hooks
    PostUpgrade-->>Helm: Smoke tests passed

    Note over Helm,Delete: Deletion Flow
    Helm->>PreDelete: Execute pre-delete hooks
    PreDelete-->>Helm: Cleanup complete
    Helm->>Delete: Delete resources
Hold "Alt" / "Option" to enable pan & zoom

Hook Use Cases Table:

Hook Use Case Example
pre-install Database migrations, schema setup Run EF migrations before deploying app
post-install Verification, smoke tests Verify service is healthy after install
pre-upgrade Backup, data migration Backup database before upgrade
post-upgrade Smoke tests, validation Run integration tests after upgrade
pre-delete Cleanup, data export Export data before deleting service
post-delete Final cleanup Remove temporary resources

Testing Helm Charts

helm lint: Syntax and Best Practices

Linting Helm Charts:

# Lint chart
helm lint charts/atp-ingestion/

# Lint with strict mode
helm lint charts/atp-ingestion/ --strict

# Lint with values file
helm lint charts/atp-ingestion/ -f values-production.yaml

# Lint all charts
for chart in charts/*/; do
  echo "Linting $chart"
  helm lint "$chart"
done

helm template: Render Templates Locally

Rendering Templates:

# Render all templates
helm template my-release charts/atp-ingestion/

# Render with values
helm template my-release charts/atp-ingestion/ -f values-production.yaml

# Render specific template
helm template my-release charts/atp-ingestion/ -s templates/deployment.yaml

# Dry-run (validate without installing)
helm install my-release charts/atp-ingestion/ --dry-run --debug

# Output to file
helm template my-release charts/atp-ingestion/ > rendered-manifests.yaml

Template Validation Script:

#!/bin/bash
# scripts/validate-helm-chart.sh

CHART_PATH="${1}"
VALUES_FILE="${2}"

echo "🔍 Validating Helm chart: ${CHART_PATH}"

# Lint
echo "1. Running helm lint..."
helm lint "${CHART_PATH}" ${VALUES_FILE:+-f "${VALUES_FILE}"} || exit 1

# Template rendering
echo "2. Rendering templates..."
helm template test-release "${CHART_PATH}" ${VALUES_FILE:+-f "${VALUES_FILE}"} > /tmp/rendered.yaml || exit 1

# Validate with kubeval
echo "3. Validating Kubernetes manifests..."
kubeval /tmp/rendered.yaml || exit 1

# Validate with kube-score
echo "4. Scoring manifests..."
kube-score score /tmp/rendered.yaml || exit 1

echo "✅ Chart validation passed"

helm test: Run Tests in Cluster

Helm Test Templates:

# templates/tests/test-connection.yaml
apiVersion: v1
kind: Pod
metadata:
  name: "{{ include "atp-ingestion.fullname" . }}-test-connection"
  annotations:
    "helm.sh/hook": test
  labels:
    {{- include "atp-ingestion.selectorLabels" . | nindent 4 }}
spec:
  restartPolicy: Never
  containers:
  - name: wget
    image: busybox:1.35
    command: ['wget']
    args: ['{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}']
# templates/tests/test-health.yaml
apiVersion: v1
kind: Pod
metadata:
  name: "{{ include "atp-ingestion.fullname" . }}-test-health"
  annotations:
    "helm.sh/hook": test
spec:
  restartPolicy: Never
  containers:
  - name: curl-test
    image: curlimages/curl:latest
    command:
    - /bin/sh
    - -c
    - |
      curl -f http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}/health/ready || exit 1
      curl -f http://{{ include "atp-ingestion.fullname" . }}:{{ .Values.service.port }}/health/live || exit 1
      echo "✅ Health checks passed"

Running Helm Tests:

# Run tests
helm test my-release

# Run tests with logs
helm test my-release --logs

# Run tests with timeout
helm test my-release --timeout 5m

chart-testing Tool (ct)

Install chart-testing:

# Install ct
curl -LO https://github.com/helm/chart-testing/releases/download/v3.9.0/chart-testing_3.9.0_linux_amd64.tar.gz
tar -xzf chart-testing_3.9.0_linux_amd64.tar.gz
sudo mv ct /usr/local/bin/

chart-testing Configuration:

# .github/ct.yaml
chart-dirs:
  - charts
chart-repos:
  - bitnami=https://charts.bitnami.com/bitnami
target-branch: main
validate-maintainers: true
check-version-increment: true

Using chart-testing:

# Lint and validate
ct lint --charts charts/atp-ingestion/

# Install and test
ct install --charts charts/atp-ingestion/

# Lint all changed charts
ct lint --target-branch main

Integration with CI Pipeline

Azure Pipeline for Chart Testing:

# azure-pipelines-helm-charts.yml
trigger:
  branches:
    include:
    - main
  paths:
    include:
    - charts/**/*

pool:
  vmImage: 'ubuntu-latest'

steps:
- task: HelmInstaller@1
  displayName: 'Install Helm'
  inputs:
    helmVersionToInstall: '3.12.0'

- task: Bash@3
  displayName: 'Install chart-testing'
  inputs:
    targetType: 'inline'
    script: |
      curl -LO https://github.com/helm/chart-testing/releases/download/v3.9.0/chart-testing_3.9.0_linux_amd64.tar.gz
      tar -xzf chart-testing_3.9.0_linux_amd64.tar.gz
      sudo mv ct /usr/local/bin/

- task: Bash@3
  displayName: 'Lint Charts'
  inputs:
    targetType: 'inline'
    script: |
      for chart in charts/*/; do
        echo "Linting $chart"
        helm lint "$chart"
        ct lint --charts "$chart"
      done

- task: Bash@3
  displayName: 'Render Templates'
  inputs:
    targetType: 'inline'
    script: |
      for chart in charts/*/; do
        echo "Rendering $chart"
        helm template test-release "$chart" -f "$chart/values-production.yaml" > /dev/null
      done

- task: Bash@3
  displayName: 'Package Charts'
  inputs:
    targetType: 'inline'
    script: |
      mkdir -p dist
      for chart in charts/*/; do
        helm package "$chart" --destination ./dist/
      done

Helm Chart CI Pipeline

Complete CI Pipeline:

# azure-pipelines-helm-chart-ci.yml
trigger:
  branches:
    include:
    - main
    - feature/*
  paths:
    include:
    - charts/**/*

pr:
  branches:
    include:
    - main
  paths:
    include:
    - charts/**/*

pool:
  vmImage: 'ubuntu-latest'

variables:
  - group: atp-helm-charts
  - name: ACR_NAME
    value: 'connectsoft'

stages:
- stage: Lint
  displayName: 'Lint Charts'
  jobs:
  - job: Lint
    displayName: 'Lint Helm Charts'
    steps:
    - task: HelmInstaller@1
      displayName: 'Install Helm'
      inputs:
        helmVersionToInstall: '3.12.0'

    - task: Bash@3
      displayName: 'Install kubeval and kube-score'
      inputs:
        targetType: 'inline'
        script: |
          wget https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
          tar xf kubeval-linux-amd64.tar.gz
          sudo mv kubeval /usr/local/bin/

          wget https://github.com/zegl/kube-score/releases/download/v1.17.0/kube-score_1.17.0_linux_amd64.tar.gz
          tar xf kube-score_1.17.0_linux_amd64.tar.gz
          sudo mv kube-score /usr/local/bin/

    - task: Bash@3
      displayName: 'Lint and Validate Charts'
      inputs:
        targetType: 'inline'
        script: |
          for chart in charts/*/; do
            CHART_NAME=$(basename "$chart")
            echo "Linting ${CHART_NAME}..."

            # Helm lint
            helm lint "$chart" || exit 1

            # Render and validate
            helm template test-release "$chart" -f "$chart/values-production.yaml" | \
              kubeval --strict || exit 1

            # Score
            helm template test-release "$chart" -f "$chart/values-production.yaml" | \
              kube-score score - || exit 1
          done

- stage: Package
  displayName: 'Package Charts'
  condition: succeeded()
  jobs:
  - job: Package
    displayName: 'Package Helm Charts'
    steps:
    - task: HelmInstaller@1
      displayName: 'Install Helm'

    - task: Bash@3
      displayName: 'Package Charts'
      inputs:
        targetType: 'inline'
        script: |
          mkdir -p dist
          for chart in charts/*/; do
            helm package "$chart" --destination ./dist/
          done

          echo "##vso[task.setVariable variable=CHARTS_PACKAGED]true"

- stage: Publish
  displayName: 'Publish to ACR'
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
  jobs:
  - deployment: Publish
    displayName: 'Publish Charts to ACR'
    environment: 'Production'
    strategy:
      runOnce:
        deploy:
          steps:
          - task: AzureCLI@2
            displayName: 'Login to ACR'
            inputs:
              azureSubscription: 'ATP-Prod-ServiceConnection'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                az acr login --name $(ACR_NAME)

          - task: HelmInstaller@1
            displayName: 'Install Helm'

          - task: Bash@3
            displayName: 'Publish Charts'
            inputs:
              targetType: 'inline'
              script: |
                for package in dist/*.tgz; do
                  CHART_NAME=$(basename "$package" .tgz | cut -d- -f1-2)
                  echo "Publishing ${CHART_NAME}..."
                  helm push "$package" "oci://$(ACR_NAME).azurecr.io/helm"
                done

Chart Documentation

README.md with Usage Instructions

Chart README Template:

# atp-ingestion

A Helm chart for ATP Ingestion Service - Collects and processes audit trail events.

## Introduction

This chart deploys the ATP Ingestion Service on a Kubernetes cluster using the Helm package manager.

## Prerequisites

- Kubernetes 1.24+
- Helm 3.8+
- Azure Container Registry access
- External Secrets Operator (for secret management)

## Installing the Chart

To install the chart with the release name `atp-ingestion`:

```bash
helm repo add connectsoft-helm oci://connectsoft.azurecr.io/helm
helm repo update
helm install atp-ingestion connectsoft-helm/atp-ingestion \
  --version 1.2.3 \
  -f values-production.yaml

Uninstalling the Chart

To uninstall/delete the atp-ingestion deployment:

helm uninstall atp-ingestion

Configuration

The following table lists the configurable parameters:

Parameter Description Default
replicaCount Number of replicas 3
image.repository Image repository connectsoft.azurecr.io/atp/ingestion
image.tag Image tag "" (defaults to appVersion)
resources.limits.cpu CPU limit 2000m
resources.limits.memory Memory limit 2Gi
resources.requests.cpu CPU request 500m
resources.requests.memory Memory request 1Gi
autoscaling.enabled Enable HPA true
autoscaling.minReplicas Minimum replicas 3
autoscaling.maxReplicas Maximum replicas 10

Values Files

  • values.yaml: Default values
  • values-dev.yaml: Development environment
  • values-test.yaml: Test environment
  • values-staging.yaml: Staging environment
  • values-production.yaml: Production environment

Dependencies

  • Redis (optional, via Bitnami chart)
  • PostgreSQL (optional, via Bitnami chart)

Hooks

  • pre-install: Runs database migrations
  • post-install: Verifies service health
  • pre-upgrade: Creates backup
  • post-upgrade: Runs smoke tests
    #### Values Schema (JSON Schema)
    
    **values.schema.json**:
    
    ```json
    {
      "$schema": "http://json-schema.org/schema#",
      "type": "object",
      "properties": {
        "replicaCount": {
          "type": "integer",
          "minimum": 1,
          "maximum": 100,
          "description": "Number of replicas"
        },
        "image": {
          "type": "object",
          "properties": {
            "repository": {
              "type": "string",
              "description": "Image repository"
            },
            "tag": {
              "type": "string",
              "description": "Image tag"
            },
            "pullPolicy": {
              "type": "string",
              "enum": ["IfNotPresent", "Always", "Never"],
              "description": "Image pull policy"
            }
          },
          "required": ["repository"]
        },
        "resources": {
          "type": "object",
          "properties": {
            "limits": {
              "type": "object",
              "properties": {
                "cpu": {
                  "type": "string",
                  "pattern": "^[0-9]+m?$|^[0-9]+\\.[0-9]+$"
                },
                "memory": {
                  "type": "string",
                  "pattern": "^[0-9]+(Mi|Gi|Ti|Pi|Ei|m|K|M|G|T|P|E)$"
                }
              }
            },
            "requests": {
              "type": "object",
              "properties": {
                "cpu": {
                  "type": "string",
                  "pattern": "^[0-9]+m?$|^[0-9]+\\.[0-9]+$"
                },
                "memory": {
                  "type": "string",
                  "pattern": "^[0-9]+(Mi|Gi|Ti|Pi|Ei|m|K|M|G|T|P|E)$"
                }
              }
            }
          }
        }
      },
      "required": ["image"]
    }
    

Chart Security

Scanning Charts for Vulnerabilities

Chart Security Scanning:

# Install checkov for Helm chart scanning
pip install checkov

# Scan Helm chart
checkov -d charts/atp-ingestion/ --framework helm

Policy Validation

OPA Policy for Helm Charts:

# policies/helm-chart-policy.rego
package helm

deny[msg] {
    input.kind == "Deployment"
    not input.spec.template.spec.securityContext.runAsNonRoot
    msg := "Deployment must set runAsNonRoot to true"
}

deny[msg] {
    input.kind == "Deployment"
    not input.spec.template.spec.securityContext.allowPrivilegeEscalation == false
    msg := "Deployment must disable privilege escalation"
}

Image Scanning in Chart Images

Image Scanning in CI:

# Azure Pipeline: Scan chart images
- task: Bash@3
  displayName: 'Scan Images'
  inputs:
    targetType: 'inline'
    script: |
      # Extract images from chart
      IMAGES=$(helm template test-release charts/atp-ingestion/ | \
        grep -E 'image:' | \
        awk '{print $2}' | \
        tr -d '"')

      # Scan each image with Trivy
      for image in $IMAGES; do
        echo "Scanning $image"
        trivy image --severity HIGH,CRITICAL "$image" || exit 1
      done

Summary: Helm Chart Development for ATP Services

  • Helm Chart Structure: Chart.yaml metadata, values.yaml defaults, templates/ directory, charts/ dependencies, .helmignore exclusions
  • Template Best Practices: Named templates and helpers (_helpers.tpl), template functions (include, default, required), flow control (if, with, range), variable scoping, whitespace management
  • Values File Organization: Hierarchical values structure, default values, environment overrides (dev/test/staging/production), secret references (never plaintext), documentation in comments
  • Chart Dependencies: Depending on other charts, sub-chart values override, conditional dependencies, dependency management commands
  • Chart Versioning and Publishing: Semantic versioning (SemVer), publishing to Azure Container Registry (ACR), chart repository structure
  • Helm Hooks: Pre-install (migrations), post-install (verification), pre-upgrade (backup), post-upgrade (smoke tests), pre-delete (cleanup), hook use cases and execution flow
  • Testing Helm Charts: helm lint, helm template (render locally), helm test (run in cluster), chart-testing tool (ct), CI pipeline integration
  • Helm Chart CI Pipeline: Lint charts on PR, package charts, publish to ACR, version management
  • Chart Documentation: README.md with usage instructions, values schema (JSON Schema), changelog for versions
  • Chart Security: Scanning charts for vulnerabilities, policy validation (OPA), image scanning in chart images

Kustomize Advanced Patterns

Purpose: Define advanced Kustomize patterns, strategies, and best practices for ATP GitOps deployments including strategic merge patches, JSON patches, generators, transformers, component composition, remote bases, and FluxCD integration to enable flexible, maintainable, and reusable Kubernetes configuration management.


Kustomize Architecture

Base, Overlays, Components

Kustomize Architecture Overview:

graph TB
    subgraph "Base"
        BASE[kustomization.yaml<br/>Base Resources]
        DEPLOY_BASE[deployment.yaml]
        SVC_BASE[service.yaml]
        CM_BASE[configmap.yaml]
    end

    subgraph "Overlays"
        subgraph "Dev Overlay"
            DEV_KUST[dev/kustomization.yaml]
            DEV_PATCH[dev/patch.yaml]
        end
        subgraph "Prod Overlay"
            PROD_KUST[production/kustomization.yaml]
            PROD_PATCH[production/patch.yaml]
        end
    end

    subgraph "Components"
        COMP_KUST[components/monitoring/kustomization.yaml]
        COMP_RESOURCES[components/monitoring/resources/]
    end

    BASE --> DEPLOY_BASE
    BASE --> SVC_BASE
    BASE --> CM_BASE

    DEV_KUST -.references.-> BASE
    DEV_KUST -.uses.-> DEV_PATCH
    DEV_KUST -.includes.-> COMP_KUST

    PROD_KUST -.references.-> BASE
    PROD_KUST -.uses.-> PROD_PATCH
    PROD_KUST -.includes.-> COMP_KUST

    style BASE fill:#FFE5B4
    style DEV_KUST fill:#90EE90
    style PROD_KUST fill:#FFB6C1
    style COMP_KUST fill:#87CEEB
Hold "Alt" / "Option" to enable pan & zoom

Directory Structure:

apps/atp-ingestion/
├── base/
│   ├── kustomization.yaml
│   ├── deployment.yaml
│   ├── service.yaml
│   └── configmap.yaml
├── overlays/
│   ├── dev/
│   │   ├── kustomization.yaml
│   │   ├── deployment-patch.yaml
│   │   └── configmap-patch.yaml
│   ├── test/
│   │   ├── kustomization.yaml
│   │   └── deployment-patch.yaml
│   ├── staging/
│   │   ├── kustomization.yaml
│   │   └── deployment-patch.yaml
│   └── production/
│       ├── kustomization.yaml
│       ├── deployment-patch.yaml
│       └── configmap-patch.yaml
└── components/
    ├── monitoring/
    │   ├── kustomization.yaml
    │   └── servicemonitor.yaml
    └── networking/
        ├── kustomization.yaml
        └── networkpolicy.yaml

Kustomization File Structure

Base kustomization.yaml:

# apps/atp-ingestion/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

metadata:
  name: atp-ingestion-base
  namespace: default

resources:
  - deployment.yaml
  - service.yaml
  - configmap.yaml

commonLabels:
  app: atp-ingestion
  component: ingestion
  managed-by: kustomize

commonAnnotations:
  description: "ATP Ingestion Service Base Configuration"

namespace: default

Overlay kustomization.yaml:

# apps/atp-ingestion/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

metadata:
  name: atp-ingestion-production

resources:
  - ../../base

patchesStrategicMerge:
  - deployment-patch.yaml
  - configmap-patch.yaml

patches:
  - path: service-patch.json
    target:
      kind: Service

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newName: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3

replicas:
  - name: atp-ingestion
    count: 5

namespace: atp-production

commonLabels:
  environment: production

configMapGenerator:
  - name: app-config
    literals:
      - ASPNETCORE_ENVIRONMENT=Production

Resource Selection

Resource Selection in Kustomization:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

# Resources to include
resources:
  - deployment.yaml
  - service.yaml
  - ../../base  # Include entire base

# Components to include
components:
  - ../../components/monitoring

# Exclude resources (via selector)
# Note: Kustomize doesn't support exclude directly,
# use patches to remove resources

# Select resources by label
# (requires custom transformer or post-processing)

Transformation Order

Kustomize Transformation Pipeline:

graph LR
    BASE[Base Resources] --> COMMON[Common Labels/Annotations]
    COMMON --> NAMESPACE[Namespace Transform]
    NAMESPACE --> PREFIX[Name Prefix/Suffix]
    PREFIX --> IMAGES[Image Transform]
    IMAGES --> REPLICAS[Replica Transform]
    REPLICAS --> PATCHES[Strategic Merge Patches]
    PATCHES --> JSON[JSON Patches]
    JSON --> GENERATORS[ConfigMap/Secret Generators]
    GENERATORS --> OUTPUT[Final Output]

    style BASE fill:#FFE5B4
    style OUTPUT fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Transformation Order:

  1. Load Resources: Read base resources and all referenced resources
  2. Common Labels/Annotations: Apply common labels and annotations
  3. Namespace Transform: Set namespace on all resources
  4. Name Prefix/Suffix: Apply name transformations
  5. Image Transform: Replace image references
  6. Replica Transform: Update replica counts
  7. Strategic Merge Patches: Apply strategic merge patches
  8. JSON Patches: Apply JSON patches
  9. ConfigMap/Secret Generators: Generate ConfigMaps and Secrets
  10. Replacements: Apply replacements transformations
  11. Final Output: Emit transformed resources

Strategic Merge Patches

How Strategic Merge Works

Strategic Merge Patch Overview:

Strategic merge patches use Kubernetes's strategic merge patch logic to merge patches into base resources, following Kubernetes-specific semantics for merging lists and maps.

Strategic Merge Process:

graph TB
    BASE[Base Resource]
    PATCH[Strategic Merge Patch]

    BASE --> MERGE{Strategic Merge}
    PATCH --> MERGE

    MERGE --> RESULT[Merged Resource]

    subgraph "Merge Semantics"
        REPLACE[Replace<br/>Explicit values]
        ADD[Add<br/>New fields]
        DELETE[Delete<br/>null values]
        ARRAY[Array Merge<br/>Strategic merge keys]
    end

    MERGE -.uses.-> REPLACE
    MERGE -.uses.-> ADD
    MERGE -.uses.-> DELETE
    MERGE -.uses.-> ARRAY
Hold "Alt" / "Option" to enable pan & zoom

Merge Semantics (Replace, Add, Delete)

Strategic Merge Examples:

Base Deployment:

# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:latest
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Development"

Strategic Merge Patch (Replace):

# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5  # Replace: 3 → 5
  template:
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3  # Replace image
        resources:
          requests:
            cpu: 1000m  # Replace: 500m → 1000m
            memory: 2Gi  # Replace: 1Gi → 2Gi
          limits:
            cpu: 4000m  # Replace: 2000m → 4000m
            memory: 4Gi  # Replace: 2Gi → 4Gi
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"  # Replace: Development → Production

Strategic Merge Patch (Add):

# overlays/production/deployment-patch-add.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        # Add new environment variables
        - name: Logging__LogLevel__Default
          value: "Warning"
        - name: Telemetry__SamplingRate
          value: "0.1"
        resources:
          limits:
            # Add new resource limit
            ephemeral-storage: 10Gi

Strategic Merge Patch (Delete):

# overlays/minimal/deployment-patch-delete.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        # Delete by setting to null
        - name: Telemetry__SamplingRate
          value: null

Array Merging Strategies

Array Merging with Strategic Merge Keys:

Kubernetes uses strategic merge keys to identify array items for merging:

Resource Type Strategic Merge Key
Deployment.spec.template.spec.containers name
Deployment.spec.template.spec.initContainers name
Service.spec.ports port
ConfigMap.data Key name
Pod.spec.volumes name

Container Array Merge Example:

Base Deployment:

# base/deployment.yaml
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:latest
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Development"
      - name: sidecar
        image: connectsoft.azurecr.io/atp/sidecar:latest

Production Patch (Update Existing Container, Add New):

# overlays/production/deployment-patch.yaml
spec:
  template:
    spec:
      containers:
      # Update existing container (matched by name: "atp-ingestion")
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"
        - name: Logging__LogLevel__Default
          value: "Warning"
      # Add new container
      - name: metrics-exporter
        image: prom/node-exporter:latest

Result: The atp-ingestion container is updated, sidecar remains unchanged, and metrics-exporter is added.

Common Patterns

Common Strategic Merge Patterns:

Pattern 1: Update Replicas and Resources:

# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: 1000m
            memory: 2Gi
          limits:
            cpu: 4000m
            memory: 4Gi

Pattern 2: Add Environment Variables:

# overlays/production/deployment-patch-env.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"
        - name: Logging__LogLevel__Default
          value: "Warning"
        - name: Telemetry__SamplingRate
          value: "0.1"

Pattern 3: Add Volume Mounts:

# overlays/production/deployment-patch-volumes.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        volumeMounts:
        - name: config
          mountPath: /app/config
      volumes:
      - name: config
        configMap:
          name: app-config

Pattern 4: Update Service Type:

# overlays/production/service-patch.yaml
apiVersion: v1
kind: Service
metadata:
  name: atp-ingestion
spec:
  type: LoadBalancer  # Change from ClusterIP to LoadBalancer
  ports:
  - port: 80
    targetPort: 8080

JSON Patches

JSON Patch Operations (Add, Replace, Remove)

JSON Patch Operations:

Operation Description Use Case
add Add field or array element Add new annotation, add new container
replace Replace existing field value Update replica count, change image tag
remove Remove field or array element Remove environment variable, remove port
copy Copy value from one path to another Copy annotation value
move Move value from one path to another Move label
test Test value equality Validate before patch

JSON Patch Example:

# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

patches:
  - target:
      kind: Deployment
      name: atp-ingestion
    path: deployment-patch.json

deployment-patch.json:

[
  {
    "op": "replace",
    "path": "/spec/replicas",
    "value": 5
  },
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/image",
    "value": "connectsoft.azurecr.io/atp/ingestion:v1.2.3"
  },
  {
    "op": "add",
    "path": "/spec/template/metadata/annotations/prometheus.io~1scrape",
    "value": "true"
  },
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/env/-",
    "value": {
      "name": "Logging__LogLevel__Default",
      "value": "Warning"
    }
  },
  {
    "op": "remove",
    "path": "/spec/template/spec/containers/0/env/0"
  }
]

Path Targeting

JSON Patch Path Examples:

[
  // Replace replica count
  {
    "op": "replace",
    "path": "/spec/replicas",
    "value": 5
  },

  // Replace image in first container
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/image",
    "value": "new-image:tag"
  },

  // Add annotation (use ~1 for /)
  {
    "op": "add",
    "path": "/metadata/annotations/prometheus.io~1scrape",
    "value": "true"
  },

  // Add to array (use - to append)
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/env/-",
    "value": {
      "name": "NEW_VAR",
      "value": "value"
    }
  },

  // Remove array element by index
  {
    "op": "remove",
    "path": "/spec/template/spec/containers/0/env/0"
  },

  // Remove field
  {
    "op": "remove",
    "path": "/spec/template/spec/containers/0/resources/limits/cpu"
  }
]

Path Escaping:

  • Use ~1 for / in path
  • Use ~0 for ~ in path
  • Example: prometheus.io/scrapeprometheus.io~1scrape

When to Use JSON Patches vs Strategic Merge

Comparison:

Feature Strategic Merge JSON Patch
Simplicity ✅ Easier to write and read ⚠️ More verbose
Type Safety ✅ YAML-native ❌ JSON only
Array Operations ✅ Smart merging with keys ⚠️ Index-based
Precision ⚠️ Can be ambiguous ✅ Very precise
Removal ⚠️ Requires null ✅ Direct remove
ATP Preference Preferred for most cases ⚠️ Use for complex cases

ATP Decision Matrix:

Use Case Recommended Approach
Update replicas ✅ Strategic merge
Update image tag ✅ Strategic merge
Add environment variables ✅ Strategic merge
Remove specific array element ✅ JSON patch
Add annotation with / in key ⚠️ JSON patch (or use quotes in YAML)
Precise field replacement ✅ Strategic merge
Complex array manipulation ⚠️ JSON patch

ConfigMap and Secret Generators

Generating ConfigMaps from Literals

ConfigMap Generator from Literals:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

configMapGenerator:
  - name: app-config
    literals:
      - ASPNETCORE_ENVIRONMENT=Production
      - Logging__LogLevel__Default=Warning
      - Telemetry__SamplingRate=0.1
    options:
      labels:
        app: atp-ingestion
      annotations:
        description: "Application configuration"
      disableNameSuffixHash: false  # Include hash suffix for updates

Generated ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-abc123  # Hash suffix added
  labels:
    app: atp-ingestion
  annotations:
    description: "Application configuration"
data:
  ASPNETCORE_ENVIRONMENT: Production
  Logging__LogLevel__Default: Warning
  Telemetry__SamplingRate: "0.1"

Generating ConfigMaps from Files

ConfigMap Generator from Files:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

configMapGenerator:
  - name: app-config
    files:
      - appsettings.json
      - logging.json
    options:
      disableNameSuffixHash: false

Directory Structure:

overlays/production/
├── kustomization.yaml
├── appsettings.json
└── logging.json

File Contents:

// appsettings.json
{
  "Logging": {
    "LogLevel": {
      "Default": "Warning"
    }
  },
  "Telemetry": {
    "SamplingRate": 0.1
  }
}

Generated ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-xyz789
data:
  appsettings.json: |
    {
      "Logging": {
        "LogLevel": {
          "Default": "Warning"
        }
      }
    }
  logging.json: |
    {...}

Generating Secrets (Encrypted)

Secret Generator:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

secretGenerator:
  - name: app-secrets
    type: Opaque
    literals:
      - connectionString=Server=...
      - apiKey=secret-key-123
    options:
      labels:
        app: atp-ingestion
      disableNameSuffixHash: false

⚠️ Security Warning: Secrets in kustomization.yaml are base64 encoded, not encrypted. Always use External Secrets Operator or Azure Key Vault CSI Driver for production secrets.

Recommended: Secret Generator from File (Base64 Encoded):

# kustomization.yaml
secretGenerator:
  - name: app-secrets
    type: Opaque
    files:
      - connectionString.txt  # Base64 encoded content
      - apiKey.txt

Generate Base64 Encoded Secret File:

# Create base64 encoded secret file
echo -n "Server=..." | base64 > connectionString.txt

Hash Suffixes for Updates

Hash Suffix Behavior:

# kustomization.yaml
configMapGenerator:
  - name: app-config
    literals:
      - KEY=VALUE
    options:
      disableNameSuffixHash: false  # Default: include hash

Hash Suffix Purpose:

  • With Hash (disableNameSuffixHash: false):
  • ConfigMap name: app-config-abc123
  • Changing content generates new hash: app-config-xyz789
  • Forces Pod restart when ConfigMap changes (rolling update)

  • Without Hash (disableNameSuffixHash: true):

  • ConfigMap name: app-config
  • Changing content updates same ConfigMap
  • Pods may not automatically restart (depends on implementation)

ATP Recommendation: Use hash suffixes (disableNameSuffixHash: false) to ensure Pods restart when ConfigMaps change.


Variable Substitution

Defining Variables in kustomization.yaml

Variable Definition:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

vars:
  - name: SERVICE_NAME
    objref:
      kind: Service
      name: atp-ingestion
    fieldref:
      fieldpath: metadata.name
  - name: SERVICE_PORT
    objref:
      kind: Service
      name: atp-ingestion
    fieldref:
      fieldpath: spec.ports[0].port
  - name: REPLICA_COUNT
    objref:
      kind: Deployment
      name: atp-ingestion
    fieldref:
      fieldpath: spec.replicas

Using Variables in Resources

Using Variables in Deployment:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: $(SERVICE_NAME)
spec:
  replicas: $(REPLICA_COUNT)
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        - name: SERVICE_NAME
          value: $(SERVICE_NAME)
        - name: SERVICE_PORT
          value: "$(SERVICE_PORT)"

Variable Substitution Process:

graph LR
    DEFINE[Define Variables<br/>in kustomization.yaml]
    REF[Reference Resources<br/>via objref]
    EXTRACT[Extract Values<br/>via fieldref]
    SUBSTITUTE[Substitute<br/>$(VAR_NAME)]
    OUTPUT[Final Resource]

    DEFINE --> REF
    REF --> EXTRACT
    EXTRACT --> SUBSTITUTE
    SUBSTITUTE --> OUTPUT

    style DEFINE fill:#FFE5B4
    style OUTPUT fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Environment-Specific Variables

Environment-Specific Variable Configuration:

# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ../../base

vars:
  - name: ENVIRONMENT
    objref:
      kind: ConfigMap
      name: app-config
    fieldref:
      fieldpath: data.ASPNETCORE_ENVIRONMENT

configMapGenerator:
  - name: app-config
    literals:
      - ASPNETCORE_ENVIRONMENT=Production

Replacements

Replacing Values Across Resources

Replacement Configuration:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

replacements:
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.database-host
    targets:
      - select:
          kind: Deployment
        fieldPaths:
          - spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_HOST].value

Example: Replace Database Host:

# ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  database-host: "atp-db.database.windows.net"

# Deployment (before replacement)
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        - name: DATABASE_HOST
          value: "placeholder"

# Replacement configuration
replacements:
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.database-host
    targets:
      - select:
          kind: Deployment
        fieldPaths:
          - spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_HOST].value

# Deployment (after replacement)
# DATABASE_HOST value becomes: "atp-db.database.windows.net"

Source and Target Configuration

Replacement Source Options:

replacements:
  - source:
      # Option 1: Reference ConfigMap/Secret
      kind: ConfigMap
      name: app-config
      fieldPath: data.key-name

      # Option 2: Literal value
      value: "literal-value"

      # Option 3: Reference another resource
      kind: Service
      name: atp-ingestion
      fieldPath: spec.clusterIP

Replacement Target Options:

targets:
  - select:
      # Select resources by kind
      kind: Deployment
      # Optional: name filter
      name: atp-ingestion
      # Optional: label selector
      labelSelector: "app=atp-ingestion"
    fieldPaths:
      # Target field path (supports array selectors)
      - spec.template.spec.containers.[name=atp-ingestion].env.[name=KEY].value
      # Multiple targets
      - metadata.annotations.config-hash
    options:
      create: true  # Create field if missing
      delimiter: "/"  # Path delimiter

Complex Replacement Patterns

Multiple Replacements:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

configMapGenerator:
  - name: app-config
    literals:
      - database-host=atp-db.database.windows.net
      - database-port=5432
      - redis-host=atp-redis.redis.cache.windows.net

replacements:
  # Replace database host
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.database-host
    targets:
      - select:
          kind: Deployment
        fieldPaths:
          - spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_HOST].value

  # Replace database port
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.database-port
    targets:
      - select:
          kind: Deployment
        fieldPaths:
          - spec.template.spec.containers.[name=atp-ingestion].env.[name=DATABASE_PORT].value

  # Replace Redis host
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.redis-host
    targets:
      - select:
          kind: Deployment
        fieldPaths:
          - spec.template.spec.containers.[name=atp-ingestion].env.[name=REDIS_HOST].value

Replacement with Transformation:

replacements:
  - source:
      kind: ConfigMap
      name: app-config
      fieldPath: data.api-url
    targets:
      - select:
          kind: Ingress
        fieldPaths:
          - spec.rules.[host].host
        options:
          create: true
    replacements:
      - source:
          value: "http://"
        target:
          value: ""  # Remove prefix

Remote Bases

Referencing Remote Kustomizations

Remote Base Configuration:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  # Git repository as base
  - git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3

  # HTTPS URL
  - https://github.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=main

Git Repository as Base

Git Base with SSH:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3

Git Base with HTTPS:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - https://github.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3

HTTPS URLs for Bases

HTTPS Base URL Format:

https://<host>/<org>/<repo>.git//<path>?ref=<branch-or-tag>

Examples:

resources:
  # GitHub
  - https://github.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=main

  # Azure Repos
  - https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops//apps/atp-ingestion/base?ref=production

  # GitLab
  - https://gitlab.com/ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=main

Version Pinning

Version Pinning Strategies:

# Option 1: Pin to tag (recommended)
resources:
  - git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=v1.2.3

# Option 2: Pin to branch (less stable)
resources:
  - git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=production

# Option 3: Pin to commit SHA (most stable)
resources:
  - git@github.com:ConnectSoft/ATP.git//apps/atp-ingestion/base?ref=abc123def456

ATP Recommendation: Pin to Git tags for stability, update tags during releases.


Component Composition

Creating Reusable Components

Component Structure:

components/
├── monitoring/
│   ├── kustomization.yaml
│   └── servicemonitor.yaml
├── networking/
│   ├── kustomization.yaml
│   └── networkpolicy.yaml
└── security/
    ├── kustomization.yaml
    ├── podsecuritypolicy.yaml
    └── rbac.yaml

Component kustomization.yaml:

# components/monitoring/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1alpha1
kind: Component

metadata:
  name: monitoring

resources:
  - servicemonitor.yaml
  - prometheusrule.yaml

commonLabels:
  component: monitoring

Including Components in Overlays

Using Components in Overlay:

# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ../../base

components:
  - ../../components/monitoring
  - ../../components/networking
  - ../../components/security

patchesStrategicMerge:
  - deployment-patch.yaml

Component Composition Diagram:

graph TB
    BASE[Base<br/>kustomization.yaml]

    subgraph "Components"
        MON[Monitoring<br/>Component]
        NET[Networking<br/>Component]
        SEC[Security<br/>Component]
    end

    OVERLAY[Production Overlay<br/>kustomization.yaml]
    PATCHES[Strategic Merge<br/>Patches]

    BASE --> OVERLAY
    MON --> OVERLAY
    NET --> OVERLAY
    SEC --> OVERLAY
    PATCHES --> OVERLAY

    OVERLAY --> OUTPUT[Final Resources]

    style BASE fill:#FFE5B4
    style OVERLAY fill:#90EE90
    style OUTPUT fill:#87CEEB
Hold "Alt" / "Option" to enable pan & zoom

Component Library for ATP

ATP Component Library:

components/
├── monitoring/
│   ├── kustomization.yaml
│   ├── servicemonitor.yaml
│   └── prometheusrule.yaml
├── networking/
│   ├── kustomization.yaml
│   ├── networkpolicy.yaml
│   └── ingress-policy.yaml
├── security/
│   ├── kustomization.yaml
│   ├── podsecuritypolicy.yaml
│   └── rbac.yaml
├── autoscaling/
│   ├── kustomization.yaml
│   └── hpa.yaml
└── observability/
    ├── kustomization.yaml
    ├── servicemonitor.yaml
    └── log-forwarding.yaml

Reusable Monitoring Component:

# components/monitoring/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1alpha1
kind: Component

metadata:
  name: monitoring

resources:
  - servicemonitor.yaml

commonLabels:
  component: monitoring

configMapGenerator:
  - name: monitoring-config
    literals:
      - scrape-interval=30s

Monitoring Component Template:

# components/monitoring/servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: $(name)-servicemonitor
spec:
  selector:
    matchLabels:
      app: $(name)
  endpoints:
  - port: http
    interval: 30s

Transformers

Label Injectors

Common Labels:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

commonLabels:
  app: atp-ingestion
  component: ingestion
  environment: production
  managed-by: kustomize
  version: v1.2.3

Labels Added to All Resources:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  labels:
    app: atp-ingestion          # ← Added
    component: ingestion        # ← Added
    environment: production     # ← Added
    managed-by: kustomize       # ← Added
    version: v1.2.3            # ← Added

Namespace Transformers

Namespace Configuration:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

# All resources get namespace: atp-production

Name Prefix/Suffix Transformers

Name Prefix/Suffix:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namePrefix: prod-  # Prefix: prod-atp-ingestion
nameSuffix: -v1    # Suffix: atp-ingestion-v1

# ATP Recommendation: Use labels/annotations for versioning instead

Image Transformers

Image Transform:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newName: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3
  - name: redis
    newName: connectsoft.azurecr.io/atp/redis
    digest: sha256:abc123...  # Use digest for immutability

Image Transform Process:

graph LR
    BASE[Base Resources<br/>image: latest]
    TRANSFORM[Image Transform<br/>newTag: v1.2.3]
    OUTPUT[Output Resources<br/>image: v1.2.3]

    BASE --> TRANSFORM
    TRANSFORM --> OUTPUT

    style BASE fill:#FFE5B4
    style OUTPUT fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Replica Transformers

Replica Transform:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

replicas:
  - name: atp-ingestion
    count: 5
  - name: atp-query
    count: 3

Replica Transform Example:

# Base deployment (replicas: 3)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 3

# After replica transform (replicas: 5)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 5  # ← Updated

Kustomize with Helm

Combining Helm and Kustomize

Helm + Kustomize Workflow:

graph LR
    HELM[Helm Chart<br/>helm template]
    OUTPUT1[Helm Output<br/>YAML Manifests]
    KUST[Kustomize<br/>kustomize build]
    OUTPUT2[Final Output<br/>Patched Manifests]

    HELM --> OUTPUT1
    OUTPUT1 --> KUST
    KUST --> OUTPUT2

    style HELM fill:#FFE5B4
    style OUTPUT2 fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

helm template → kustomize build

Post-Rendering Helm Output with Kustomize:

# Step 1: Render Helm templates
helm template my-release ./charts/atp-ingestion \
  -f values-production.yaml \
  > /tmp/helm-output.yaml

# Step 2: Use Kustomize to patch Helm output
mkdir -p kustomize-overlay
cat > kustomize-overlay/kustomization.yaml <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - /tmp/helm-output.yaml

patchesStrategicMerge:
  - production-patch.yaml
EOF

# Step 3: Build final output
kustomize build kustomize-overlay > final-manifests.yaml

Kustomization for Helm Output:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - helm-output.yaml  # Generated from: helm template

patchesStrategicMerge:
  - production-patch.yaml

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3

Post-Rendering with Kustomize

Helm Post-Renderer Script:

#!/bin/bash
# scripts/helm-post-render-kustomize.sh

KUSTOMIZE_DIR="${1:-overlays/production}"

# Kustomize build the Helm output
kustomize build "${KUSTOMIZE_DIR}"

Use Post-Renderer in Helm:

# Install with post-renderer
helm install atp-ingestion ./charts/atp-ingestion \
  -f values-production.yaml \
  --post-renderer ./scripts/helm-post-render-kustomize.sh \
  --post-renderer-executable-args "overlays/production"

FluxCD Kustomization CRD

FluxCD-Specific Configuration

FluxCD Kustomization CRD:

# clusters/production/kustomizations/apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/atp-ingestion/overlays/production
  prune: true
  wait: true
  timeout: 5m
  retryInterval: 1m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  kustomizeFlags:
    - --load-restrictor=LoadRestrictionsNone
  dependsOn:
    - name: infrastructure
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-ingestion
      namespace: atp-production
  postBuild:
    substitute:
      IMAGE_TAG: v1.2.3

Kustomization CRD Fields

Kustomization CRD Reference:

Field Type Description Required
interval duration Reconciliation interval ✅ Yes
path string Path to kustomization.yaml ✅ Yes
prune boolean Delete resources not in Git ⚠️ Optional
wait boolean Wait for resources to be ready ⚠️ Optional
timeout duration Wait timeout ⚠️ Optional
retryInterval duration Retry interval on failure ⚠️ Optional
sourceRef object GitRepository reference ✅ Yes
kustomizeFlags array Kustomize CLI flags ⚠️ Optional
dependsOn array Dependency Kustomizations ⚠️ Optional
healthChecks array Health check resources ⚠️ Optional
postBuild object Post-build substitutions ⚠️ Optional
suspend boolean Suspend reconciliation ⚠️ Optional

Integration with Git

FluxCD Kustomization with Git:

# GitRepository (source)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: atp-gitops-production
  namespace: flux-system
spec:
  interval: 1m
  url: https://dev.azure.com/ConnectSoft/ATP/_git/atp-gitops
  ref:
    branch: production
  secretRef:
    name: gitops-credentials

---
# Kustomization (deployment target)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/atp-ingestion/overlays/production
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production  # ← References GitRepository
  prune: true
  wait: true

FluxCD Integration Diagram:

sequenceDiagram
    participant Git as Git Repository
    participant GitRepo as GitRepository CRD
    participant FluxCD as FluxCD Controller
    participant Kust as Kustomization CRD
    participant K8s as Kubernetes API

    GitRepo->>Git: Poll for changes (1m)
    Git-->>GitRepo: New commit detected
    GitRepo->>FluxCD: Trigger reconciliation
    FluxCD->>Git: Fetch kustomization.yaml
    Git-->>FluxCD: Return kustomization
    FluxCD->>Git: Fetch base + overlays
    Git-->>FluxCD: Return resources
    FluxCD->>Kust: Build kustomize (kustomize build)
    Kust->>K8s: Apply resources
    K8s-->>FluxCD: Resources applied
Hold "Alt" / "Option" to enable pan & zoom

Testing Kustomize Configurations

kustomize build for Validation

Validate Kustomize Build:

# Build and validate
kustomize build apps/atp-ingestion/overlays/production

# Validate syntax
kustomize build apps/atp-ingestion/overlays/production > /dev/null && echo "✅ Valid"

# Validate with kubeval
kustomize build apps/atp-ingestion/overlays/production | kubeval --strict

# Validate with kube-score
kustomize build apps/atp-ingestion/overlays/production | kube-score score -

Validation Script:

#!/bin/bash
# scripts/validate-kustomize.sh

OVERLAY_PATH="${1}"

echo "🔍 Validating Kustomize: ${OVERLAY_PATH}"

# Build
echo "1. Building kustomization..."
kustomize build "${OVERLAY_PATH}" > /tmp/kustomize-output.yaml || exit 1

# Validate Kubernetes syntax
echo "2. Validating Kubernetes syntax..."
kubeval /tmp/kustomize-output.yaml --strict || exit 1

# Score manifests
echo "3. Scoring manifests..."
kube-score score /tmp/kustomize-output.yaml || exit 1

echo "✅ Kustomize validation passed"

Diff Validation Against Expected Output

Diff Validation:

#!/bin/bash
# scripts/validate-kustomize-diff.sh

OVERLAY_PATH="${1}"
EXPECTED_OUTPUT="${2}"

echo "🔍 Validating Kustomize output against expected..."

# Build current output
kustomize build "${OVERLAY_PATH}" > /tmp/current.yaml

# Compare with expected
if diff -u "${EXPECTED_OUTPUT}" /tmp/current.yaml; then
  echo "✅ Output matches expected"
else
  echo "❌ Output differs from expected"
  exit 1
fi

Golden File Testing:

# Generate golden file (expected output)
kustomize build apps/atp-ingestion/overlays/production > \
  tests/golden/production-expected.yaml

# Validate against golden file
kustomize build apps/atp-ingestion/overlays/production > /tmp/actual.yaml
diff tests/golden/production-expected.yaml /tmp/actual.yaml

CI Pipeline Integration

Azure Pipeline for Kustomize Testing:

# azure-pipelines-kustomize-test.yml
trigger:
  branches:
    include:
    - main
  paths:
    include:
    - apps/**/*

pool:
  vmImage: 'ubuntu-latest'

steps:
- task: Bash@3
  displayName: 'Install kustomize'
  inputs:
    targetType: 'inline'
    script: |
      curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
      sudo mv kustomize /usr/local/bin/

- task: Bash@3
  displayName: 'Install kubeval and kube-score'
  inputs:
    targetType: 'inline'
    script: |
      wget https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
      tar xf kubeval-linux-amd64.tar.gz
      sudo mv kubeval /usr/local/bin/

      wget https://github.com/zegl/kube-score/releases/download/v1.17.0/kube-score_1.17.0_linux_amd64.tar.gz
      tar xf kube-score_1.17.0_linux_amd64.tar.gz
      sudo mv kube-score /usr/local/bin/

- task: Bash@3
  displayName: 'Validate Kustomize Configurations'
  inputs:
    targetType: 'inline'
    script: |
      for overlay in apps/*/overlays/*/; do
        OVERLAY_NAME=$(basename "$overlay")
        echo "Validating overlay: ${OVERLAY_NAME}"

        # Build
        kustomize build "$overlay" > /tmp/${OVERLAY_NAME}.yaml || exit 1

        # Validate
        kubeval /tmp/${OVERLAY_NAME}.yaml --strict || exit 1
        kube-score score /tmp/${OVERLAY_NAME}.yaml || exit 1
      done

Summary: Kustomize Advanced Patterns

  • Kustomize Architecture: Base, overlays, components structure, kustomization file structure, resource selection, transformation order
  • Strategic Merge Patches: How strategic merge works, merge semantics (replace, add, delete), array merging strategies, common patterns
  • JSON Patches: JSON patch operations (add, replace, remove), path targeting, when to use JSON patches vs strategic merge
  • ConfigMap and Secret Generators: Generating ConfigMaps from literals and files, generating secrets (encrypted), hash suffixes for updates
  • Variable Substitution: Defining variables in kustomization.yaml, using variables in resources, environment-specific variables
  • Replacements: Replacing values across resources, source and target configuration, complex replacement patterns
  • Remote Bases: Referencing remote kustomizations, Git repository as base, HTTPS URLs for bases, version pinning
  • Component Composition: Creating reusable components, including components in overlays, component library for ATP
  • Transformers: Label injectors, namespace transformers, name prefix/suffix transformers, image transformers, replica transformers
  • Kustomize with Helm: Combining Helm and Kustomize, helm template → kustomize build, post-rendering with Kustomize
  • FluxCD Kustomization CRD: FluxCD-specific configuration, Kustomization CRD fields reference, integration with Git
  • Testing Kustomize Configurations: kustomize build for validation, diff validation against expected output, CI pipeline integration

Multi-Tenancy in GitOps

Purpose: Define multi-tenancy strategies, tenant isolation mechanisms, tenant-specific configurations, automated onboarding/offboarding procedures, and compliance controls for ATP's GitOps deployments, ensuring complete tenant isolation, secure resource management, and adherence to data residency and regulatory requirements (GDPR, HIPAA, SOC 2).


Tenant Isolation Strategies

Namespace per Tenant (ATP Approach)

Multi-Tenant Architecture with Namespace Isolation:

graph TB
    subgraph "Production AKS Cluster"
        subgraph "Tenant A Namespace"
            NS_A[Namespace: atp-tenant-a]
            DEPLOY_A[Deployments<br/>atp-ingestion<br/>atp-query<br/>atp-gateway]
            SVC_A[Services<br/>ClusterIP]
            DB_A[(Database Schema<br/>tenant_a)]
            SECRETS_A[Secrets<br/>tenant-a-secrets]
        end
        subgraph "Tenant B Namespace"
            NS_B[Namespace: atp-tenant-b]
            DEPLOY_B[Deployments<br/>atp-ingestion<br/>atp-query<br/>atp-gateway]
            SVC_B[Services<br/>ClusterIP]
            DB_B[(Database Schema<br/>tenant_b)]
            SECRETS_B[Secrets<br/>tenant-b-secrets]
        end
        subgraph "Tenant C Namespace"
            NS_C[Namespace: atp-tenant-c]
            DEPLOY_C[Deployments<br/>atp-ingestion<br/>atp-query<br/>atp-gateway]
            SVC_C[Services<br/>ClusterIP]
            DB_C[(Database Schema<br/>tenant_c)]
            SECRETS_C[Secrets<br/>tenant-c-secrets]
        end
        subgraph "Platform Services"
            MON[Monitoring<br/>Shared]
            INGRESS[Ingress Controller<br/>Shared]
        end
    end

    NS_A --> MON
    NS_B --> MON
    NS_C --> MON
    INGRESS --> SVC_A
    INGRESS --> SVC_B
    INGRESS --> SVC_C

    style NS_A fill:#90EE90
    style NS_B fill:#FFE5B4
    style NS_C fill:#87CEEB
    style MON fill:#DDA0DD
Hold "Alt" / "Option" to enable pan & zoom

Namespace per Tenant Benefits:

Aspect Benefit ATP Justification
Resource Isolation ✅ Complete resource isolation Prevents resource contention
Network Isolation ✅ Network policies per namespace Ensures tenant data isolation
RBAC Isolation ✅ Per-namespace RBAC Tenant-specific access control
Quota Management ✅ Resource quotas per namespace Cost control per tenant
Compliance ✅ Isolated audit logs GDPR/HIPAA compliance
Data Residency ✅ Deploy to specific regions EU/US data residency requirements

ATP Decision: Namespace per tenant - Complete isolation, best security, compliance-friendly

Cluster per Tenant (Not Used in ATP)

Cluster per Tenant Comparison:

Aspect Cluster per Tenant Namespace per Tenant ATP Decision
Isolation ✅ Maximum isolation ⚠️ Good isolation Namespace (sufficient)
Cost ❌ Very high (separate clusters) ✅ Low (shared cluster) Namespace (cost-effective)
Management ❌ Complex (many clusters) ✅ Simple (one cluster) Namespace (operational simplicity)
Resource Utilization ❌ Poor (underutilized clusters) ✅ Good (shared resources) Namespace (efficiency)
Compliance ✅ Maximum compliance ✅ Good compliance Namespace (sufficient)

ATP Rationale: Cluster per tenant is overkill for ATP's requirements. Namespace isolation provides sufficient security and compliance while maintaining cost efficiency.

Shared Namespace with Labels:

# ❌ NOT RECOMMENDED: Shared namespace with labels
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
  labels:
    tenant: tenant-a  # Label-based separation
spec:
  # ...

Why Not Recommended:

Issue Impact ATP Decision
No Resource Isolation ❌ Resource contention between tenants ❌ Not acceptable
Network Policy Complexity ❌ Complex label selectors ❌ Error-prone
RBAC Complexity ❌ Difficult to enforce tenant boundaries ❌ Security risk
Audit Trail ❌ Harder to isolate tenant activities ❌ Compliance issue

ATP Decision: Not used - Insufficient isolation for audit trail platform requirements


Tenant-Specific Configurations

Resource Limits per Tenant

Resource Quota per Tenant:

# tenants/tenant-a/resources/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-quota
  namespace: atp-tenant-a
  labels:
    tenant: tenant-a
    managed-by: kustomize
spec:
  hard:
    requests.cpu: "8"      # 8 CPU cores
    requests.memory: 16Gi  # 16Gi memory
    limits.cpu: "16"       # 16 CPU cores
    limits.memory: 32Gi    # 32Gi memory
    persistentvolumeclaims: "5"
    pods: "20"
    services: "10"
    configmaps: "20"
    secrets: "10"

Tenant Resource Limits Matrix:

Tenant Tier CPU Requests Memory Requests CPU Limits Memory Limits Pods Max Monthly Cost (Est.)
Basic 2 cores 4Gi 4 cores 8Gi 10 $500
Standard 8 cores 16Gi 16 cores 32Gi 20 $2,000
Premium 32 cores 64Gi 64 cores 128Gi 50 $8,000
Enterprise 128 cores 256Gi 256 cores 512Gi 200 $32,000

Data Residency Requirements (EU vs US)

Data Residency Configuration:

# tenants/tenant-eu/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-tenant-eu
  labels:
    tenant: tenant-eu
    data-residency: "eu-west"  # EU data residency
    compliance: "gdpr"
    region: "westeurope"
    managed-by: kustomize
  annotations:
    data-residency-policy: "EU-only"
    compliance-requirements: "GDPR"

Regional Deployment Strategy:

Region Tenants Compliance AKS Cluster
East US US-based tenants US regulations atp-prod-eus-aks
West Europe EU-based tenants GDPR atp-prod-weu-aks

Tenant Region Assignment:

# tenants/tenant-eu/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - resources/namespace.yaml
  - resources/resource-quota.yaml
  - resources/network-policy.yaml

commonLabels:
  tenant: tenant-eu
  region: "westeurope"  # EU region
  data-residency: "eu-west"

Compliance Controls (GDPR, HIPAA)

Compliance Labels and Annotations:

# tenants/tenant-hipaa/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-tenant-hipaa
  labels:
    tenant: tenant-hipaa
    compliance: "hipaa"
    data-classification: "phi"  # Protected Health Information
    encryption-required: "true"
  annotations:
    compliance-policy: "HIPAA"
    encryption-at-rest: "required"
    encryption-in-transit: "required"
    audit-logging: "required"
    data-retention: "6-years"

Compliance Configuration Matrix:

Compliance Type Labels Annotations Requirements
GDPR compliance: gdpr, data-residency: eu-west data-residency-policy, right-to-be-forgotten: true EU data residency, data deletion on request
HIPAA compliance: hipaa, data-classification: phi encryption-at-rest: required, audit-logging: required Encryption, audit logs, 6-year retention
SOC 2 compliance: soc2 audit-logging: required, access-control: required Audit logs, access controls

Custom Ingestion Rules

Tenant-Specific Ingestion Configuration:

# tenants/tenant-a/config/configmap-ingestion.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-a-ingestion-config
  namespace: atp-tenant-a
data:
  ingestion-rules.yaml: |
    rules:
      - eventType: "audit"
        batchSize: 100
        batchTimeout: "30s"
        maxRetries: 3
      - eventType: "compliance"
        batchSize: 50
        batchTimeout: "60s"
        maxRetries: 5
    rateLimits:
      requestsPerSecond: 1000
      burstSize: 2000
    retention:
      audit: "7-years"
      compliance: "10-years"

GitOps Structure for Tenants

/tenants/{tenant-id}/ Directory Structure

Tenant Directory Structure:

atp-gitops/
├── tenants/
│   ├── tenant-a/
│   │   ├── kustomization.yaml
│   │   ├── resources/
│   │   │   ├── namespace.yaml
│   │   │   ├── resource-quota.yaml
│   │   │   ├── network-policy.yaml
│   │   │   ├── rbac.yaml
│   │   │   └── serviceaccount.yaml
│   │   ├── apps/
│   │   │   ├── ingestion/
│   │   │   │   ├── kustomization.yaml
│   │   │   │   └── deployment.yaml
│   │   │   ├── query/
│   │   │   │   ├── kustomization.yaml
│   │   │   │   └── deployment.yaml
│   │   │   └── gateway/
│   │   │       ├── kustomization.yaml
│   │   │       └── deployment.yaml
│   │   ├── config/
│   │   │   ├── configmap-ingestion.yaml
│   │   │   └── configmap-query.yaml
│   │   └── values/
│   │       ├── values-tenant-a.yaml
│   │       └── values-production.yaml
│   ├── tenant-b/
│   │   └── ...
│   └── tenant-eu/
│       └── ...

Tenant Namespace Manifest

Tenant Namespace:

# tenants/tenant-a/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-tenant-a
  labels:
    tenant: tenant-a
    environment: production
    managed-by: kustomize
    compliance: "soc2"
    data-residency: "us-east"
    region: "eastus"
  annotations:
    description: "ATP Tenant A - Production Environment"
    created-by: "tenant-onboarding"
    created-at: "2024-01-15T10:00:00Z"
    owner: "tenant-a-admin@example.com"
    tier: "standard"
    cost-center: "sales"

Tenant Resource Quota

Resource Quota Configuration:

# tenants/tenant-a/resources/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-quota
  namespace: atp-tenant-a
  labels:
    tenant: tenant-a
spec:
  hard:
    # CPU and Memory
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi

    # Storage
    persistentvolumeclaims: "5"
    requests.storage: 100Gi

    # Pod and Service Limits
    pods: "20"
    services: "10"

    # Object Counts
    configmaps: "20"
    secrets: "10"
    services.nodeports: "0"
    services.loadbalancers: "2"

Tier-Based Quota Templates:

# tenants/_templates/resource-quota-basic.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ${TENANT_ID}-quota
  namespace: atp-${TENANT_ID}
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 4Gi
    limits.cpu: "4"
    limits.memory: 8Gi
    pods: "10"

# tenants/_templates/resource-quota-enterprise.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ${TENANT_ID}-quota
  namespace: atp-${TENANT_ID}
spec:
  hard:
    requests.cpu: "128"
    requests.memory: 256Gi
    limits.cpu: "256"
    limits.memory: 512Gi
    pods: "200"

Tenant Network Policy

Tenant Network Policy (Complete Isolation):

# tenants/tenant-a/resources/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tenant-a-isolation
  namespace: atp-tenant-a
spec:
  podSelector: {}  # Apply to all pods
  policyTypes:
  - Ingress
  - Egress

  ingress:
  # Allow from ingress controller (shared platform)
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - podSelector:
        matchLabels:
          app: ingress-nginx
  # Allow from monitoring namespace (shared)
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
  # Deny all other ingress (including other tenant namespaces)

  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
  # Allow to monitoring
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
  # Allow to database (external)
  - to:
    - ipBlock:
        cidr: 10.0.0.0/16  # Database subnet
    ports:
    - protocol: TCP
      port: 5432
  # Deny egress to other tenant namespaces

Network Policy Isolation Diagram:

graph TB
    subgraph "Tenant A Namespace"
        POD_A1[Pod A1]
        POD_A2[Pod A2]
        NP_A[Network Policy<br/>Deny cross-tenant]
    end
    subgraph "Tenant B Namespace"
        POD_B1[Pod B1]
        POD_B2[Pod B2]
        NP_B[Network Policy<br/>Deny cross-tenant]
    end
    subgraph "Platform Namespaces"
        INGRESS[Ingress Controller]
        MON[Monitoring]
        DNS[Kube DNS]
    end

    INGRESS -->|Allowed| POD_A1
    INGRESS -->|Allowed| POD_B1
    POD_A1 -.->|Blocked| POD_B1
    POD_B1 -.->|Blocked| POD_A1
    POD_A1 -->|Allowed| DNS
    POD_B1 -->|Allowed| DNS
    POD_A1 -->|Allowed| MON
    POD_B1 -->|Allowed| MON

    style NP_A fill:#FF6B6B
    style NP_B fill:#FF6B6B
    style POD_A1 fill:#90EE90
    style POD_B1 fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Tenant RBAC

Tenant-Specific RBAC:

# tenants/tenant-a/resources/rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: tenant-a-sa
  namespace: atp-tenant-a
  labels:
    tenant: tenant-a
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: tenant-a-role
  namespace: atp-tenant-a
rules:
  # Allow read/write to tenant namespace resources
  - apiGroups: [""]
    resources: ["configmaps", "secrets", "pods", "services"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
  # Deny access to other namespaces (implicitly denied)
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: tenant-a-rolebinding
  namespace: atp-tenant-a
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: tenant-a-role
subjects:
  - kind: ServiceAccount
    name: tenant-a-sa
    namespace: atp-tenant-a
  - kind: User
    name: tenant-a-admin@example.com
    apiGroup: rbac.authorization.k8s.io

Tenant Admin RBAC (Limited Cluster Role):

# tenants/tenant-a/resources/cluster-role-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: tenant-a-admin
rules:
  # Read-only access to cluster resources
  - apiGroups: [""]
    resources: ["namespaces"]
    resourceNames: ["atp-tenant-a"]  # Only tenant namespace
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: tenant-a-admin-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: tenant-a-admin
subjects:
  - kind: User
    name: tenant-a-admin@example.com
    apiGroup: rbac.authorization.k8s.io

Dynamic Tenant Provisioning

Onboarding Script

Tenant Onboarding Automation Script:

#!/bin/bash
# scripts/onboard-tenant.sh

TENANT_ID="${1}"
TENANT_TIER="${2:-standard}"  # basic, standard, premium, enterprise
REGION="${3:-eastus}"  # eastus, westeurope
COMPLIANCE="${4:-soc2}"  # soc2, gdpr, hipaa
OWNER_EMAIL="${5}"

if [ -z "${TENANT_ID}" ] || [ -z "${OWNER_EMAIL}" ]; then
  echo "Usage: $0 <tenant-id> [tier] [region] [compliance] <owner-email>"
  echo "Example: $0 tenant-a standard eastus soc2 admin@tenant-a.example.com"
  exit 1
fi

TENANT_DIR="tenants/${TENANT_ID}"
NAMESPACE="atp-${TENANT_ID}"

echo "🏢 Onboarding tenant: ${TENANT_ID}"
echo "   Tier: ${TENANT_TIER}"
echo "   Region: ${REGION}"
echo "   Compliance: ${COMPLIANCE}"
echo "   Owner: ${OWNER_EMAIL}"

# Step 1: Create tenant directory structure
echo "📁 Step 1: Creating tenant directory structure..."
mkdir -p "${TENANT_DIR}/resources"
mkdir -p "${TENANT_DIR}/apps/ingestion"
mkdir -p "${TENANT_DIR}/apps/query"
mkdir -p "${TENANT_DIR}/apps/gateway"
mkdir -p "${TENANT_DIR}/config"
mkdir -p "${TENANT_DIR}/values"

# Step 2: Generate namespace
echo "📦 Step 2: Generating namespace..."
cat > "${TENANT_DIR}/resources/namespace.yaml" <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: ${NAMESPACE}
  labels:
    tenant: ${TENANT_ID}
    tier: ${TENANT_TIER}
    environment: production
    region: ${REGION}
    compliance: ${COMPLIANCE}
    managed-by: kustomize
  annotations:
    description: "ATP Tenant ${TENANT_ID} - Production Environment"
    created-by: "tenant-onboarding"
    created-at: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
    owner: "${OWNER_EMAIL}"
    tier: "${TENANT_TIER}"
EOF

# Step 3: Generate resource quota based on tier
echo "📊 Step 3: Generating resource quota..."
./scripts/generate-tenant-quota.sh "${TENANT_ID}" "${TENANT_TIER}" > "${TENANT_DIR}/resources/resource-quota.yaml"

# Step 4: Generate network policy
echo "🔒 Step 4: Generating network policy..."
./scripts/generate-tenant-network-policy.sh "${TENANT_ID}" > "${TENANT_DIR}/resources/network-policy.yaml"

# Step 5: Generate RBAC
echo "🔐 Step 5: Generating RBAC..."
./scripts/generate-tenant-rbac.sh "${TENANT_ID}" "${OWNER_EMAIL}" > "${TENANT_DIR}/resources/rbac.yaml"

# Step 6: Generate application manifests
echo "🚀 Step 6: Generating application manifests..."
./scripts/generate-tenant-apps.sh "${TENANT_ID}" "${TENANT_TIER}" "${REGION}"

# Step 7: Generate kustomization.yaml
echo "📝 Step 7: Generating kustomization.yaml..."
cat > "${TENANT_DIR}/kustomization.yaml" <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

metadata:
  name: ${TENANT_ID}
  namespace: ${NAMESPACE}

namespace: ${NAMESPACE}

resources:
  - resources/namespace.yaml
  - resources/resource-quota.yaml
  - resources/network-policy.yaml
  - resources/rbac.yaml
  - apps/ingestion/kustomization.yaml
  - apps/query/kustomization.yaml
  - apps/gateway/kustomization.yaml

commonLabels:
  tenant: ${TENANT_ID}
  tier: ${TENANT_TIER}
  region: ${REGION}
  compliance: ${COMPLIANCE}
EOF

echo "✅ Tenant directory structure created: ${TENANT_DIR}"
echo ""
echo "📋 Next steps:"
echo "1. Review generated manifests: ${TENANT_DIR}/"
echo "2. Commit to Git: git add ${TENANT_DIR} && git commit -m 'feat: onboard tenant ${TENANT_ID}'"
echo "3. Push to repository: git push origin main"
echo "4. FluxCD will automatically reconcile and deploy tenant resources"

Automated Manifest Generation

Generate Tenant Quota Script:

#!/bin/bash
# scripts/generate-tenant-quota.sh

TENANT_ID="${1}"
TIER="${2:-standard}"

case "${TIER}" in
  "basic")
    CPU_REQ="2"
    MEM_REQ="4Gi"
    CPU_LIM="4"
    MEM_LIM="8Gi"
    PODS="10"
    ;;
  "standard")
    CPU_REQ="8"
    MEM_REQ="16Gi"
    CPU_LIM="16"
    MEM_LIM="32Gi"
    PODS="20"
    ;;
  "premium")
    CPU_REQ="32"
    MEM_REQ="64Gi"
    CPU_LIM="64"
    MEM_LIM="128Gi"
    PODS="50"
    ;;
  "enterprise")
    CPU_REQ="128"
    MEM_REQ="256Gi"
    CPU_LIM="256"
    MEM_LIM="512Gi"
    PODS="200"
    ;;
esac

cat <<EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ${TENANT_ID}-quota
  namespace: atp-${TENANT_ID}
spec:
  hard:
    requests.cpu: "${CPU_REQ}"
    requests.memory: ${MEM_REQ}
    limits.cpu: "${CPU_LIM}"
    limits.memory: ${MEM_LIM}
    pods: "${PODS}"
    persistentvolumeclaims: "5"
    services: "10"
EOF

Generate Tenant Network Policy Script:

#!/bin/bash
# scripts/generate-tenant-network-policy.sh

TENANT_ID="${1}"
NAMESPACE="atp-${TENANT_ID}"

cat <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ${TENANT_ID}-isolation
  namespace: ${NAMESPACE}
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

  ingress:
  # Allow from ingress controller
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
  # Allow from monitoring
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring

  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
  # Allow to monitoring
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
  # Allow to database (external)
  - to:
    - ipBlock:
        cidr: 10.0.0.0/16
    ports:
    - protocol: TCP
      port: 5432
EOF

Generate Tenant Apps Script:

#!/bin/bash
# scripts/generate-tenant-apps.sh

TENANT_ID="${1}"
TIER="${2}"
REGION="${3}"

TENANT_DIR="tenants/${TENANT_ID}"

# Generate ingestion app kustomization
cat > "${TENANT_DIR}/apps/ingestion/kustomization.yaml" <<EOF
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ../../../../apps/atp-ingestion/base

patchesStrategicMerge:
  - deployment-patch.yaml

images:
  - name: connectsoft.azurecr.io/atp/ingestion
    newTag: v1.2.3

namespace: atp-${TENANT_ID}

commonLabels:
  tenant: ${TENANT_ID}
EOF

# Generate deployment patch
cat > "${TENANT_DIR}/apps/ingestion/deployment-patch.yaml" <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: atp-ingestion
        env:
        - name: TENANT_ID
          value: "${TENANT_ID}"
        - name: REGION
          value: "${REGION}"
EOF

echo "✅ Application manifests generated"

Git Commit for New Tenant

Commit Tenant Configuration:

#!/bin/bash
# scripts/commit-tenant-config.sh

TENANT_ID="${1}"

if [ -z "${TENANT_ID}" ]; then
  echo "Usage: $0 <tenant-id>"
  exit 1
fi

TENANT_DIR="tenants/${TENANT_ID}"

echo "📝 Committing tenant configuration: ${TENANT_ID}"

# Add tenant directory
git add "${TENANT_DIR}"

# Commit with conventional commit format
git commit -m "feat(tenant): onboard tenant ${TENANT_ID}

- Add namespace: atp-${TENANT_ID}
- Add resource quota and network policy
- Add tenant-specific RBAC
- Add application deployments

Signed-off-by: $(git config user.name) <$(git config user.email)>" \
  --gpg-sign

# Push to repository
git push origin main

echo "✅ Tenant configuration committed and pushed"
echo "⏳ FluxCD will reconcile and deploy tenant resources automatically"

FluxCD Applies Tenant Resources

FluxCD Kustomization for Tenant:

# clusters/production/kustomizations/tenants/tenant-a.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: tenant-a
  namespace: flux-system
  labels:
    tenant: tenant-a
spec:
  interval: 5m
  path: ./tenants/tenant-a
  prune: true
  wait: true
  timeout: 10m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  dependsOn:
    - name: infrastructure
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-ingestion
      namespace: atp-tenant-a
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-query
      namespace: atp-tenant-a
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-gateway
      namespace: atp-tenant-a

Auto-Create FluxCD Kustomization for Tenant:

#!/bin/bash
# scripts/create-tenant-fluxcd-kustomization.sh

TENANT_ID="${1}"
KUST_FILE="clusters/production/kustomizations/tenants/${TENANT_ID}.yaml"

cat > "${KUST_FILE}" <<EOF
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: ${TENANT_ID}
  namespace: flux-system
  labels:
    tenant: ${TENANT_ID}
spec:
  interval: 5m
  path: ./tenants/${TENANT_ID}
  prune: true
  wait: true
  timeout: 10m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  dependsOn:
    - name: infrastructure
EOF

kubectl apply -f "${KUST_FILE}"

echo "✅ FluxCD Kustomization created for tenant: ${TENANT_ID}"

Tenant Onboarding Automation

Step-by-Step Onboarding Process

Tenant Onboarding Workflow:

sequenceDiagram
    participant Admin as Administrator
    participant Script as Onboarding Script
    participant Git as Git Repository
    participant FluxCD as FluxCD
    participant K8s as Kubernetes

    Admin->>Script: Execute onboard-tenant.sh
    Script->>Script: Create directory structure
    Script->>Script: Generate namespace
    Script->>Script: Generate resource quota
    Script->>Script: Generate network policy
    Script->>Script: Generate RBAC
    Script->>Script: Generate app manifests
    Script->>Git: Commit tenant config
    Git->>FluxCD: Poll for changes
    FluxCD->>Git: Fetch tenant manifests
    FluxCD->>K8s: Apply namespace
    FluxCD->>K8s: Apply resource quota
    FluxCD->>K8s: Apply network policy
    FluxCD->>K8s: Apply RBAC
    FluxCD->>K8s: Deploy applications
    K8s-->>FluxCD: Resources ready
    FluxCD-->>Admin: Tenant onboarded
Hold "Alt" / "Option" to enable pan & zoom

Complete Onboarding Automation:

#!/bin/bash
# scripts/onboard-tenant-complete.sh

TENANT_ID="${1}"
TENANT_TIER="${2:-standard}"
REGION="${3:-eastus}"
COMPLIANCE="${4:-soc2}"
OWNER_EMAIL="${5}"

if [ -z "${TENANT_ID}" ] || [ -z "${OWNER_EMAIL}" ]; then
  echo "Usage: $0 <tenant-id> [tier] [region] [compliance] <owner-email>"
  exit 1
fi

echo "🏢 Complete Tenant Onboarding: ${TENANT_ID}"
echo ""

# Step 1: Create tenant directory
echo "📁 Step 1: Creating tenant directory..."
./scripts/onboard-tenant.sh "${TENANT_ID}" "${TENANT_TIER}" "${REGION}" "${COMPLIANCE}" "${OWNER_EMAIL}" || exit 1

# Step 2: Validate manifests
echo "🔍 Step 2: Validating manifests..."
./scripts/validate-kustomize.sh "tenants/${TENANT_ID}" || exit 1

# Step 3: Commit to Git
echo "📝 Step 3: Committing to Git..."
./scripts/commit-tenant-config.sh "${TENANT_ID}" || exit 1

# Step 4: Create FluxCD Kustomization
echo "⚙️  Step 4: Creating FluxCD Kustomization..."
./scripts/create-tenant-fluxcd-kustomization.sh "${TENANT_ID}" || exit 1

# Step 5: Wait for FluxCD reconciliation
echo "⏳ Step 5: Waiting for FluxCD reconciliation..."
sleep 60

# Step 6: Verify tenant resources
echo "✅ Step 6: Verifying tenant resources..."
./scripts/verify-tenant-onboarding.sh "${TENANT_ID}" || exit 1

echo ""
echo "🎉 Tenant onboarding complete: ${TENANT_ID}"

Verify Tenant Onboarding:

#!/bin/bash
# scripts/verify-tenant-onboarding.sh

TENANT_ID="${1}"
NAMESPACE="atp-${TENANT_ID}"

echo "🔍 Verifying tenant onboarding: ${TENANT_ID}"

# Check namespace exists
if ! kubectl get namespace "${NAMESPACE}" >/dev/null 2>&1; then
  echo "❌ Namespace ${NAMESPACE} does not exist"
  exit 1
fi

# Check resource quota
if ! kubectl get resourcequota -n "${NAMESPACE}" >/dev/null 2>&1; then
  echo "❌ Resource quota not found"
  exit 1
fi

# Check network policy
if ! kubectl get networkpolicy -n "${NAMESPACE}" >/dev/null 2>&1; then
  echo "❌ Network policy not found"
  exit 1
fi

# Check deployments are ready
DEPLOYMENTS=("atp-ingestion" "atp-query" "atp-gateway")

for DEPLOYMENT in "${DEPLOYMENTS[@]}"; do
  if ! kubectl wait --for=condition=available --timeout=5m deployment/"${DEPLOYMENT}" -n "${NAMESPACE}"; then
    echo "❌ Deployment ${DEPLOYMENT} not ready"
    exit 1
  fi
done

echo "✅ Tenant onboarding verified: ${TENANT_ID}"

Tenant-Specific Helm Values

values-tenant-{id}.yaml

Tenant-Specific Helm Values:

# tenants/tenant-a/values/values-tenant-a.yaml
# Tenant-specific Helm values for tenant-a

replicaCount: 3

image:
  tag: v1.2.3

resources:
  limits:
    cpu: 4000m
    memory: 4Gi
  requests:
    cpu: 1000m
    memory: 2Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10

env:
  - name: TENANT_ID
    value: "tenant-a"
  - name: TENANT_TIER
    value: "standard"
  - name: REGION
    value: "eastus"

ingress:
  enabled: true
  hosts:
    - host: tenant-a.atp.connectsoft.example
      paths:
        - path: /

database:
  host: atp-db-tenant-a.database.windows.net
  name: atp_tenant_a

redis:
  host: atp-redis-tenant-a.redis.cache.windows.net

featureFlags:
  enableAdvancedQuerying: true
  enableRealTimeEvents: true
  enableComplianceReports: true

Override Replicas, Resources, Endpoints

Environment-Specific Tenant Values:

# tenants/tenant-a/values/values-production.yaml
# Production-specific overrides for tenant-a

replicaCount: 5

resources:
  limits:
    cpu: 8000m
    memory: 8Gi
  requests:
    cpu: 2000m
    memory: 4Gi

autoscaling:
  minReplicas: 5
  maxReplicas: 20

service:
  type: LoadBalancer

ingress:
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
  tls:
    - secretName: tenant-a-tls
      hosts:
        - tenant-a.atp.connectsoft.example

Tenant-Specific Feature Flags

Feature Flags Configuration:

# tenants/tenant-a/config/configmap-feature-flags.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-a-feature-flags
  namespace: atp-tenant-a
data:
  feature-flags.json: |
    {
      "enableAdvancedQuerying": true,
      "enableRealTimeEvents": true,
      "enableComplianceReports": true,
      "enableDataExport": false,
      "enableCustomDashboards": true,
      "maxRetentionDays": 2555,
      "auditLogLevel": "Detailed"
    }

Multi-Tenancy and FluxCD

Per-Tenant Kustomization

FluxCD Kustomization Per Tenant:

# clusters/production/kustomizations/tenants/tenant-a.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: tenant-a
  namespace: flux-system
  labels:
    tenant: tenant-a
    type: tenant
spec:
  interval: 5m
  path: ./tenants/tenant-a
  prune: true
  wait: true
  timeout: 10m
  retryInterval: 2m
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  dependsOn:
    - name: infrastructure
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: atp-ingestion
      namespace: atp-tenant-a
  kustomizeFlags:
    - --load-restrictor=LoadRestrictionsNone

List All Tenant Kustomizations:

kubectl get kustomizations -n flux-system -l type=tenant

# Output:
# NAME       READY   STATUS        AGE
# tenant-a   True    Applied       5d
# tenant-b   True    Applied       10d
# tenant-eu  True    Applied       2d

Tenant Isolation in Reconciliation

Isolation Benefits:

Aspect Isolation Benefit ATP Implementation
Reconciliation Failure ✅ One tenant failure doesn't affect others Separate Kustomization per tenant
Resource Conflicts ✅ Namespace isolation prevents conflicts Namespace per tenant
Network Isolation ✅ Network policies prevent cross-tenant traffic Tenant-specific network policies
RBAC Isolation ✅ Tenant admin can only access their namespace Per-namespace RBAC

FluxCD Reconciliation Isolation:

graph TB
    subgraph "FluxCD Controller"
        RECONCILE[Reconciliation Loop]
    end
    subgraph "Git Repository"
        TENANT_A[tenants/tenant-a/]
        TENANT_B[tenants/tenant-b/]
        TENANT_C[tenants/tenant-c/]
    end
    subgraph "Kubernetes Cluster"
        KUST_A[Kustomization: tenant-a]
        KUST_B[Kustomization: tenant-b]
        KUST_C[Kustomization: tenant-c]
        NS_A[Namespace: atp-tenant-a]
        NS_B[Namespace: atp-tenant-b]
        NS_C[Namespace: atp-tenant-c]
    end

    RECONCILE --> TENANT_A
    RECONCILE --> TENANT_B
    RECONCILE --> TENANT_C

    TENANT_A --> KUST_A
    TENANT_B --> KUST_B
    TENANT_C --> KUST_C

    KUST_A --> NS_A
    KUST_B --> NS_B
    KUST_C --> NS_C

    style KUST_A fill:#90EE90
    style KUST_B fill:#FFE5B4
    style KUST_C fill:#87CEEB
Hold "Alt" / "Option" to enable pan & zoom

Failure Isolation (One Tenant Doesn't Affect Others)

Failure Isolation Example:

# Tenant A reconciliation fails
kubectl get kustomization tenant-a -n flux-system
# Output:
# NAME       READY   STATUS        AGE
# tenant-a   False   Failed        5d
# tenant-b   True    Applied       10d  ← Still working
# tenant-eu  True    Applied       2d   ← Still working

Independent Reconciliation:

  • Tenant A failure does not affect Tenant B or Tenant C
  • ✅ Each tenant Kustomization reconciles independently
  • ✅ Namespace isolation prevents resource conflicts
  • ✅ Network policies prevent cross-tenant traffic

Tenant Offboarding

Data Deletion Procedures

GDPR Right to be Forgotten:

#!/bin/bash
# scripts/offboard-tenant.sh

TENANT_ID="${1}"
REASON="${2:-tenant-request}"
GDPR_REQUEST="${3:-false}"  # true if GDPR right-to-be-forgotten

if [ -z "${TENANT_ID}" ]; then
  echo "Usage: $0 <tenant-id> [reason] [gdpr-request]"
  echo "Example: $0 tenant-a tenant-request true"
  exit 1
fi

NAMESPACE="atp-${TENANT_ID}"
TENANT_DIR="tenants/${TENANT_ID}"

echo "🗑️  Offboarding tenant: ${TENANT_ID}"
echo "   Reason: ${REASON}"
echo "   GDPR Request: ${GDPR_REQUEST}"

# Step 1: Export tenant data (if GDPR request)
if [ "${GDPR_REQUEST}" = "true" ]; then
  echo "📦 Step 1: Exporting tenant data for GDPR compliance..."
  ./scripts/export-tenant-data.sh "${TENANT_ID}" || exit 1
fi

# Step 2: Delete database data
echo "🗄️  Step 2: Deleting database data..."
./scripts/delete-tenant-database.sh "${TENANT_ID}" || exit 1

# Step 3: Delete Azure resources
echo "☁️  Step 3: Deleting Azure resources..."
./scripts/delete-tenant-azure-resources.sh "${TENANT_ID}" || exit 1

# Step 4: Delete Kubernetes namespace (deletes all resources)
echo "📦 Step 4: Deleting Kubernetes namespace..."
kubectl delete namespace "${NAMESPACE}" --wait=true --timeout=10m || true

# Step 5: Remove tenant directory from Git
echo "📝 Step 5: Removing tenant configuration from Git..."
git rm -r "${TENANT_DIR}" || true
git commit -m "feat(tenant): offboard tenant ${TENANT_ID}

- Remove tenant namespace and resources
- Delete tenant data
- Reason: ${REASON}
- GDPR Request: ${GDPR_REQUEST}

Signed-off-by: $(git config user.name) <$(git config user.email)>" \
  --gpg-sign

git push origin main

# Step 6: Delete FluxCD Kustomization
echo "⚙️  Step 6: Deleting FluxCD Kustomization..."
kubectl delete kustomization "${TENANT_ID}" -n flux-system || true

echo "✅ Tenant offboarding complete: ${TENANT_ID}"

Namespace Cleanup

Namespace Cleanup Script:

#!/bin/bash
# scripts/cleanup-tenant-namespace.sh

TENANT_ID="${1}"
NAMESPACE="atp-${TENANT_ID}"

echo "🧹 Cleaning up namespace: ${NAMESPACE}"

# Delete all resources in namespace
kubectl delete all --all -n "${NAMESPACE}" --wait=true --timeout=5m || true

# Delete PVCs
kubectl delete pvc --all -n "${NAMESPACE}" --wait=true --timeout=5m || true

# Delete secrets and configmaps
kubectl delete secrets --all -n "${NAMESPACE}" || true
kubectl delete configmaps --all -n "${NAMESPACE}" || true

# Delete network policies
kubectl delete networkpolicies --all -n "${NAMESPACE}" || true

# Delete namespace
kubectl delete namespace "${NAMESPACE}" --wait=true --timeout=5m || true

echo "✅ Namespace cleanup complete"

Git Commit to Remove Tenant

Remove Tenant from Git:

#!/bin/bash
# scripts/remove-tenant-from-git.sh

TENANT_ID="${1}"
REASON="${2}"

TENANT_DIR="tenants/${TENANT_ID}"

echo "📝 Removing tenant from Git: ${TENANT_ID}"

# Remove tenant directory
git rm -r "${TENANT_DIR}" || true

# Remove FluxCD Kustomization
git rm "clusters/production/kustomizations/tenants/${TENANT_ID}.yaml" || true

# Commit removal
git commit -m "feat(tenant): remove tenant ${TENANT_ID}

- Remove tenant namespace configuration
- Remove tenant FluxCD Kustomization
- Reason: ${REASON}

Signed-off-by: $(git config user.name) <$(git config user.email)>" \
  --gpg-sign

# Push to repository
git push origin main

echo "✅ Tenant removed from Git"

Compliance with GDPR (Right to be Forgotten)

GDPR Data Deletion Procedure:

#!/bin/bash
# scripts/gdpr-data-deletion.sh

TENANT_ID="${1}"

echo "🔒 GDPR Data Deletion Request: ${TENANT_ID}"

# Step 1: Export data (for audit trail)
echo "📦 Step 1: Exporting data for audit trail..."
./scripts/export-tenant-data.sh "${TENANT_ID}" \
  --output "exports/tenant-${TENANT_ID}-$(date +%Y%m%d).json"

# Step 2: Verify export
if [ ! -f "exports/tenant-${TENANT_ID}-$(date +%Y%m%d).json" ]; then
  echo "❌ Data export failed"
  exit 1
fi

# Step 3: Delete database records
echo "🗄️  Step 2: Deleting database records..."
./scripts/delete-tenant-database.sh "${TENANT_ID}" --confirm || exit 1

# Step 4: Delete blob storage
echo "💾 Step 3: Deleting blob storage..."
./scripts/delete-tenant-blob-storage.sh "${TENANT_ID}" || exit 1

# Step 5: Delete logs
echo "📋 Step 4: Deleting logs..."
./scripts/delete-tenant-logs.sh "${TENANT_ID}" || exit 1

# Step 6: Offboard tenant
echo "🗑️  Step 5: Offboarding tenant..."
./scripts/offboard-tenant.sh "${TENANT_ID}" "gdpr-request" "true" || exit 1

# Step 7: Generate deletion certificate
echo "📜 Step 6: Generating deletion certificate..."
cat > "certificates/gdpr-deletion-${TENANT_ID}-$(date +%Y%m%d).md" <<EOF
# GDPR Data Deletion Certificate

**Tenant ID**: ${TENANT_ID}
**Date**: $(date -u +%Y-%m-%dT%H:%M:%SZ)
**Request Type**: Right to be Forgotten (GDPR Article 17)

## Data Deletion Summary

- ✅ Database records deleted
- ✅ Blob storage deleted
- ✅ Logs deleted
- ✅ Kubernetes resources deleted
- ✅ Git configuration removed

## Export Location

- Data exported to: exports/tenant-${TENANT_ID}-$(date +%Y%m%d).json
- Retention: 7 years (legal requirement)

## Verification

All data related to tenant ${TENANT_ID} has been permanently deleted
from ATP systems in compliance with GDPR Article 17.
EOF

echo "✅ GDPR data deletion complete"
echo "📜 Deletion certificate: certificates/gdpr-deletion-${TENANT_ID}-$(date +%Y%m%d).md"

Tenant Cost Allocation

Namespace-Level Resource Tagging

Resource Tagging for Cost Allocation:

# tenants/tenant-a/resources/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atp-tenant-a
  labels:
    tenant: tenant-a
    cost-center: "sales"
    business-unit: "enterprise"
    project: "audit-trail-platform"
    environment: "production"
    tier: "standard"
  annotations:
    cost-allocation: "tenant-a"
    billing-account: "account-12345"
    owner: "tenant-a-admin@example.com"

Azure Resource Tagging:

# Tag Azure resources for tenant
az aks update \
  --resource-group atp-production-rg \
  --name atp-prod-eus-aks \
  --tags \
    Tenant=tenant-a \
    CostCenter=sales \
    Environment=production

Cost Reporting per Tenant

Cost Reporting Script:

#!/bin/bash
# scripts/tenant-cost-report.sh

TENANT_ID="${1}"
START_DATE="${2:-$(date -d '30 days ago' +%Y-%m-%d)}"
END_DATE="${3:-$(date +%Y-%m-%d)}"

echo "💰 Cost Report for Tenant: ${TENANT_ID}"
echo "   Period: ${START_DATE} to ${END_DATE}"
echo ""

# Query Azure Cost Management API
az consumption usage list \
  --start-date "${START_DATE}" \
  --end-date "${END_DATE}" \
  --query "[?tags.Tenant=='${TENANT_ID}'].{Instance:instanceName, Cost:pretaxCost}" \
  --output table

# Calculate total cost
TOTAL_COST=$(az consumption usage list \
  --start-date "${START_DATE}" \
  --end-date "${END_DATE}" \
  --query "[?tags.Tenant=='${TENANT_ID}'].pretaxCost" \
  --output tsv | \
  awk '{sum+=$1} END {print sum}')

echo ""
echo "Total Cost: \$${TOTAL_COST}"

Chargeback/Showback Models

Chargeback Model Configuration:

# tenants/tenant-a/config/cost-allocation.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-a-cost-allocation
  namespace: atp-tenant-a
data:
  chargeback-model: "showback"  # or "chargeback"
  billing-frequency: "monthly"
  cost-center: "sales"
  business-unit: "enterprise"
  billing-contact: "finance@example.com"

  cost-breakdown.yaml: |
    resources:
      compute:
        cpu-requests: 0.05  # $0.05 per CPU-hour
        memory-requests: 0.01  # $0.01 per GiB-hour
      storage:
        persistent-volumes: 0.10  # $0.10 per GiB-month
      network:
        egress: 0.09  # $0.09 per GB

Cost Allocation Diagram:

graph TB
    subgraph "Cluster Costs"
        CLUSTER[AKS Cluster<br/>$10,000/month]
    end
    subgraph "Tenant Costs"
        TENANT_A[Tenant A<br/>$2,000/month<br/>20%]
        TENANT_B[Tenant B<br/>$5,000/month<br/>50%]
        TENANT_C[Tenant C<br/>$3,000/month<br/>30%]
    end

    CLUSTER --> TENANT_A
    CLUSTER --> TENANT_B
    CLUSTER --> TENANT_C

    style TENANT_A fill:#90EE90
    style TENANT_B fill:#FFE5B4
    style TENANT_C fill:#87CEEB
Hold "Alt" / "Option" to enable pan & zoom

Compliance Per Tenant

SOC 2, GDPR, HIPAA Configurations

Compliance Configuration per Tenant:

# tenants/tenant-a/resources/compliance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-a-compliance
  namespace: atp-tenant-a
data:
  compliance-type: "soc2"
  audit-logging: "enabled"
  data-retention-years: "7"
  encryption-at-rest: "required"
  encryption-in-transit: "required"

  soc2-controls.yaml: |
    controls:
      - id: "CC6.1"
        name: "Logical and Physical Access Controls"
        status: "implemented"
      - id: "CC7.2"
        name: "System Operations"
        status: "implemented"

GDPR Tenant Configuration:

# tenants/tenant-eu/resources/compliance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-eu-compliance
  namespace: atp-tenant-eu
data:
  compliance-type: "gdpr"
  data-residency: "eu-west"
  right-to-be-forgotten: "enabled"
  data-export: "enabled"
  audit-logging: "enabled"
  data-retention-years: "7"

  gdpr-requirements.yaml: |
    requirements:
      - article: "17"
        name: "Right to Erasure"
        implementation: "automated-deletion"
      - article: "20"
        name: "Data Portability"
        implementation: "data-export-api"
      - article: "30"
        name: "Records of Processing"
        implementation: "audit-logs"

HIPAA Tenant Configuration:

# tenants/tenant-hipaa/resources/compliance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-hipaa-compliance
  namespace: atp-tenant-hipaa
data:
  compliance-type: "hipaa"
  data-classification: "phi"
  encryption-at-rest: "required"
  encryption-in-transit: "required"
  audit-logging: "required"
  access-control: "required"
  data-retention-years: "6"

  hipaa-requirements.yaml: |
    requirements:
      - section: "164.312(a)(1)"
        name: "Access Control"
        implementation: "rbac"
      - section: "164.312(e)(1)"
        name: "Transmission Security"
        implementation: "tls-encryption"
      - section: "164.312(c)(1)"
        name: "Integrity"
        implementation: "audit-logs"

Tenant-Specific Audit Logs

Audit Log Configuration:

# tenants/tenant-a/resources/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
metadata:
  name: tenant-a-audit
rules:
  # Audit all API requests in tenant namespace
  - level: Metadata
    namespaces: ["atp-tenant-a"]
    verbs: ["*"]
    resources:
      - group: "*"
        resources: ["*"]

  # Audit secret access
  - level: RequestResponse
    namespaces: ["atp-tenant-a"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
    resources:
      - group: ""
        resources: ["secrets"]

Audit Log Query for Tenant:

// Log Analytics: Query tenant-specific audit logs
AuditLogs
| where Namespace == "atp-tenant-a"
| where TimeGenerated >= ago(7d)
| summarize 
    EventCount = count(),
    UniqueUsers = dcount(UserIdentity),
    UniqueResources = dcount(ResourceName)
    by Tenant = Namespace, bin(TimeGenerated, 1d)
| render timechart

Data Residency Enforcement

Data Residency Policy:

# tenants/tenant-eu/resources/data-residency-policy.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-eu-data-residency
  namespace: atp-tenant-eu
data:
  policy.yaml: |
    data-residency:
      region: "westeurope"
      allowed-regions:
        - "westeurope"
        - "northeurope"
      prohibited-regions:
        - "eastus"
        - "westus"
      enforcement:
        database: "required"
        storage: "required"
        backups: "required"
        logs: "required"

Enforce Data Residency with Azure Policy:

# Azure Policy: Enforce EU data residency
apiVersion: templates.azure.com/v1beta1
kind: PolicyTemplate
metadata:
  name: enforce-eu-data-residency
properties:
  displayName: "Enforce EU Data Residency for Tenant EU"
  description: "Ensure all resources for tenant-eu are deployed in EU regions"
  policyRule:
    if:
      allOf:
      - field: "Microsoft.Resources/subscriptions/resourceGroups/tags['tenant']"
        equals: "tenant-eu"
      - not:
          field: "location"
          in: ["westeurope", "northeurope"]
    then:
      effect: "deny"

Retention Policies

Retention Policy Configuration:

# tenants/tenant-a/config/retention-policy.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-a-retention-policy
  namespace: atp-tenant-a
data:
  retention-policy.yaml: |
    policies:
      audit-logs:
        retention-days: 2555  # 7 years
        archive-after-days: 365
        archive-location: "az://atp-audit-archive/tenant-a"
      compliance-data:
        retention-days: 3650  # 10 years
        archive-after-days: 730
        archive-location: "az://atp-compliance-archive/tenant-a"
      operational-logs:
        retention-days: 90
        archive-after-days: 30
        archive-location: "az://atp-ops-archive/tenant-a"
    deletion:
      automated: true
      grace-period-days: 30

Retention Policy by Compliance Type:

Compliance Type Retention Period Rationale
SOC 2 7 years SOC 2 requirement
GDPR 7 years Legal/regulatory requirement
HIPAA 6 years HIPAA requirement
General 1 year Standard retention

Summary: Multi-Tenancy in GitOps

  • Tenant Isolation Strategies: Namespace per tenant (ATP approach), cluster per tenant (not used), shared namespace with labels (not recommended)
  • Tenant-Specific Configurations: Resource limits per tenant (tier-based), data residency requirements (EU vs US), compliance controls (GDPR, HIPAA, SOC 2), custom ingestion rules
  • GitOps Structure for Tenants: /tenants/{tenant-id}/ directory structure, tenant namespace manifest, tenant resource quota, tenant network policy, tenant RBAC
  • Dynamic Tenant Provisioning: Onboarding script, automated manifest generation, Git commit for new tenant, FluxCD applies tenant resources
  • Tenant Onboarding Automation: Step-by-step onboarding process (8 steps), complete automation script, verification script
  • Tenant-Specific Helm Values: values-tenant-{id}.yaml, override replicas/resources/endpoints, tenant-specific feature flags
  • Multi-Tenancy and FluxCD: Per-tenant Kustomization, tenant isolation in reconciliation, failure isolation (one tenant doesn't affect others)
  • Tenant Offboarding: Data deletion procedures, namespace cleanup, Git commit to remove tenant, compliance with GDPR (right to be forgotten)
  • Tenant Cost Allocation: Namespace-level resource tagging, cost reporting per tenant, chargeback/showback models
  • Compliance Per Tenant: SOC 2/GDPR/HIPAA configurations, tenant-specific audit logs, data residency enforcement, retention policies

Cost Optimization in GitOps

Purpose: Define cost optimization strategies, resource right-sizing, autoscaling configurations, automated shutdown procedures, Azure Cost Management integration, and FinOps practices for ATP's GitOps deployments, ensuring optimal resource utilization, cost efficiency, and cost transparency across all environments while maintaining performance and reliability requirements.


AKS Cost Optimization

Node Pool Sizing (Right-Sized VMs)

Node Pool Sizing Strategy:

Environment Node Pool Type VM SKU Node Count (Min/Max) Monthly Cost (Est.) Use Case
Production System Standard_D4s_v3 3/10 $1,500 System pods, monitoring
Production User Standard_D8s_v3 5/20 $6,000 Application workloads
Staging User Standard_D4s_v3 2/8 $1,200 Staging workloads
Test User Standard_D2s_v3 ¼ $300 Test workloads
Dev User Standard_D2s_v3 ¼ $300 Development workloads

Pulumi C# Node Pool Configuration:

// infrastructure/NodePools.cs
using Pulumi;
using Pulumi.AzureNative.ContainerService;
using Pulumi.AzureNative.ContainerService.Inputs;

public class AKSNodePools
{
    public static ManagedClusterAgentPoolProfileArgs CreateProductionSystemPool()
    {
        return new ManagedClusterAgentPoolProfileArgs
        {
            Name = "systempool",
            Count = 3,
            VmSize = "Standard_D4s_v3",  // 4 vCPUs, 16 GiB
            OsType = "Linux",
            OsDiskSizeGB = 128,
            Mode = AgentPoolMode.System,
            EnableAutoScaling = true,
            MinCount = 3,
            MaxCount = 10,
            MaxPods = 30,
            EnableNodePublicIP = false,
            ScaleSetPriority = ScaleSetPriority.Regular,
            ScaleSetEvictionPolicy = ScaleSetEvictionPolicy.Delete,
            Tags = new InputMap<string>
            {
                { "Environment", "production" },
                { "NodePoolType", "system" },
                { "CostCenter", "infrastructure" }
            }
        };
    }

    public static ManagedClusterAgentPoolProfileArgs CreateProductionUserPool()
    {
        return new ManagedClusterAgentPoolProfileArgs
        {
            Name = "userpool",
            Count = 5,
            VmSize = "Standard_D8s_v3",  // 8 vCPUs, 32 GiB
            OsType = "Linux",
            OsDiskSizeGB = 256,
            Mode = AgentPoolMode.User,
            EnableAutoScaling = true,
            MinCount = 5,
            MaxCount = 20,
            MaxPods = 50,
            EnableNodePublicIP = false,
            Tags = new InputMap<string>
            {
                { "Environment", "production" },
                { "NodePoolType", "user" },
                { "CostCenter", "applications" }
            }
        };
    }

    public static ManagedClusterAgentPoolProfileArgs CreateDevSpotPool()
    {
        return new ManagedClusterAgentPoolProfileArgs
        {
            Name = "spotpool",
            Count = 1,
            VmSize = "Standard_D2s_v3",  // 2 vCPUs, 8 GiB
            OsType = "Linux",
            OsDiskSizeGB = 64,
            Mode = AgentPoolMode.User,
            EnableAutoScaling = true,
            MinCount = 0,  // Scale to zero
            MaxCount = 4,
            MaxPods = 30,
            EnableNodePublicIP = false,
            ScaleSetPriority = ScaleSetPriority.Spot,
            ScaleSetEvictionPolicy = ScaleSetEvictionPolicy.Delete,
            SpotMaxPrice = 0.05,  // Max $0.05 per hour (80% discount)
            Tags = new InputMap<string>
            {
                { "Environment", "development" },
                { "NodePoolType", "spot" },
                { "CostCenter", "development" }
            }
        };
    }
}

Spot Instances for Dev/Test

Spot Instance Configuration:

# clusters/production/node-pools/spot-pool.yaml
apiVersion: containerservice.azure.com/v1
kind: ManagedClusterAgentPoolProfile
metadata:
  name: spotpool
spec:
  name: spotpool
  count: 1
  vmSize: Standard_D2s_v3
  osType: Linux
  osDiskSizeGB: 64
  mode: User
  enableAutoScaling: true
  minCount: 0
  maxCount: 4
  scaleSetPriority: Spot
  scaleSetEvictionPolicy: Delete
  spotMaxPrice: 0.05
  nodeLabels:
    workload: non-production
    cost-optimized: "true"
  nodeTaints:
    - key: kubernetes.azure.com/scalesetpriority
      value: spot
      effect: NoSchedule

Pod Tolerations for Spot Nodes:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      tolerations:
      - key: kubernetes.azure.com/scalesetpriority
        operator: Equal
        value: spot
        effect: NoSchedule
      nodeSelector:
        workload: non-production
      containers:
      - name: atp-ingestion
        # ...

Spot Instance Cost Savings:

VM SKU Regular Price Spot Price (80% discount) Monthly Savings
Standard_D2s_v3 $0.096/hour $0.019/hour ~$55/month
Standard_D4s_v3 $0.192/hour $0.038/hour ~$111/month
Standard_D8s_v3 $0.384/hour $0.077/hour ~$221/month

Reserved Instances for Production

Reserved Instance Configuration:

#!/bin/bash
# scripts/purchase-reserved-instances.sh

# Purchase 1-year reserved instances for production
az vm reservation create \
  --resource-group atp-production-rg \
  --reserved-resource-type VirtualMachines \
  --instance-flexibility OnDemand \
  --billing-scope /subscriptions/${SUBSCRIPTION_ID} \
  --term P1Y \
  --quantity 10 \
  --sku Standard_D8s_v3 \
  --location eastus \
  --reserved-to-subscription \
  --display-name "ATP Production AKS Nodes - D8s_v3"

echo "✅ Reserved instances purchased (up to 72% discount)"

Reserved Instance Cost Savings:

Commitment Discount Monthly Cost (10x D8s_v3) Savings vs Pay-as-you-go
1 Year ~42% $2,227 $1,653/month
3 Years ~72% $1,282 $2,598/month

ATP Recommendation: Use 1-year reserved instances for production user pool nodes.

Cluster Autoscaler Configuration

Cluster Autoscaler Setup:

# clusters/production/kustomizations/cluster-autoscaler.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: cluster-autoscaler
  namespace: flux-system
spec:
  interval: 5m
  path: ./platform/cluster-autoscaler
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production

Cluster Autoscaler Deployment:

# platform/cluster-autoscaler/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - image: mcr.microsoft.com/oss/kubernetes/autoscaler/cluster-autoscaler:v1.27.3
        name: cluster-autoscaler
        resources:
          limits:
            cpu: 100m
            memory: 600Mi
          requests:
            cpu: 100m
            memory: 600Mi
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=azure
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste  # Prefer nodes that waste least resources
        - --node-group-auto-discovery=label:cluster-autoscaler-enabled=true
        - --balance-similar-node-groups
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m
        - --scale-down-utilization-threshold=0.5
        - --max-node-provision-time=15m
        env:
        - name: ARM_SUBSCRIPTION_ID
          valueFrom:
            secretKeyRef:
              name: cluster-autoscaler-secrets
              key: subscription-id
        - name: ARM_RESOURCE_GROUP
          value: atp-production-rg
        - name: ARM_TENANT_ID
          valueFrom:
            secretKeyRef:
              name: cluster-autoscaler-secrets
              key: tenant-id
        - name: ARM_CLIENT_ID
          valueFrom:
            secretKeyRef:
              name: cluster-autoscaler-secrets
              key: client-id
        - name: ARM_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: cluster-autoscaler-secrets
              key: client-secret

Cluster Autoscaler Cost Optimization Settings:

Setting Value Rationale
expander least-waste Prefer node pools that waste least resources
scale-down-delay-after-add 10m Wait before scaling down after adding nodes
scale-down-unneeded-time 10m Node must be unneeded for 10m before removal
scale-down-utilization-threshold 0.5 Scale down if node utilization < 50%
balance-similar-node-groups true Balance pods across similar node groups

Resource Right-Sizing

Analyzing Actual Resource Usage

Resource Usage Analysis Script:

#!/bin/bash
# scripts/analyze-resource-usage.sh

NAMESPACE="${1:-all}"

echo "📊 Resource Usage Analysis"
echo "=========================="

if [ "${NAMESPACE}" = "all" ]; then
  echo "Analyzing all namespaces..."
  kubectl top pods --all-namespaces --containers | \
    awk 'NR>1 {cpu+=$3; memory+=$4} END {print "Total CPU: " cpu "m"; print "Total Memory: " memory "Mi"}'
else
  echo "Analyzing namespace: ${NAMESPACE}"
  kubectl top pods -n "${NAMESPACE}" --containers | \
    awk 'NR>1 {cpu+=$3; memory+=$4} END {print "Total CPU: " cpu "m"; print "Total Memory: " memory "Mi"}'
fi

echo ""
echo "Resource Requests vs Limits:"
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | "\(.metadata.name): CPU req=\(.spec.containers[0].resources.requests.cpu // "none") limit=\(.spec.containers[0].resources.limits.cpu // "none"), Memory req=\(.spec.containers[0].resources.requests.memory // "none") limit=\(.spec.containers[0].resources.limits.memory // "none")"'

Azure Monitor Metrics Query:

// Log Analytics: Pod resource usage analysis
Perf
| where ObjectName == "K8SContainer"
| where CounterName in ("cpuUsageNanoCores", "memoryWorkingSetBytes")
| where TimeGenerated >= ago(7d)
| summarize 
    AvgCpuNanoCores = avg(CounterValue) by CounterName, Namespace, PodName
| extend CpuUsageCores = case(
    CounterName == "cpuUsageNanoCores", AvgCpuNanoCores / 1000000000,
    0
),
MemoryUsageMiB = case(
    CounterName == "memoryWorkingSetBytes", AvgCpuNanoCores / 1024 / 1024,
    0
)
| summarize 
    AvgCpuCores = max(CpuUsageCores),
    AvgMemoryMiB = max(MemoryUsageMiB)
    by Namespace, PodName
| render timechart

Adjusting Requests and Limits

Resource Right-Sizing Workflow:

graph LR
    COLLECT[Collect Metrics<br/>7 days]
    ANALYZE[Analyze Usage<br/>P95/P99]
    RECOMMEND[Generate<br/>Recommendations]
    UPDATE[Update Manifests<br/>in Git]
    DEPLOY[Deploy via<br/>GitOps]
    MONITOR[Monitor<br/>Performance]

    COLLECT --> ANALYZE
    ANALYZE --> RECOMMEND
    RECOMMEND --> UPDATE
    UPDATE --> DEPLOY
    DEPLOY --> MONITOR
    MONITOR --> COLLECT

    style COLLECT fill:#FFE5B4
    style DEPLOY fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Resource Right-Sizing Recommendations Script:

#!/bin/bash
# scripts/generate-right-sizing-recommendations.sh

NAMESPACE="${1}"
OUTPUT_FILE="${2:-right-sizing-recommendations.yaml}"

echo "📊 Generating right-sizing recommendations for: ${NAMESPACE}"

# Query Prometheus metrics (7-day average)
PROMETHEUS_URL="http://prometheus-kube-prometheus-prometheus.monitoring:9090"

cat > /tmp/prometheus-queries.txt <<EOF
# Average CPU usage (7 days)
avg_over_time(rate(container_cpu_usage_seconds_total{namespace="${NAMESPACE}"}[5m])[7d:1h])

# Average memory usage (7 days)
avg_over_time(container_memory_working_set_bytes{namespace="${NAMESPACE}"}[7d:1h])

# P95 CPU usage
quantile_over_time(0.95, rate(container_cpu_usage_seconds_total{namespace="${NAMESPACE}"}[5m])[7d:1h])

# P95 Memory usage
quantile_over_time(0.95, container_memory_working_set_bytes{namespace="${NAMESPACE}"}[7d:1h])
EOF

# Generate recommendations (simplified)
cat > "${OUTPUT_FILE}" <<EOF
# Right-Sizing Recommendations for ${NAMESPACE}
# Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)
# Based on: 7-day average usage

recommendations:
  - deployment: atp-ingestion
    namespace: ${NAMESPACE}
    containers:
      - name: atp-ingestion
        resources:
          requests:
            cpu: "500m"  # Based on P95 usage: 400m
            memory: "1Gi"  # Based on P95 usage: 800Mi
          limits:
            cpu: "2000m"  # 4x requests (burst capacity)
            memory: "2Gi"  # 2x requests
EOF

echo "✅ Recommendations generated: ${OUTPUT_FILE}"

Vertical Pod Autoscaler (VPA)

VPA Installation:

# Install VPA
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml

VPA Configuration:

# apps/atp-ingestion/base/vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: atp-ingestion-vpa
  namespace: atp-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  updatePolicy:
    updateMode: "Auto"  # or "Recreate" or "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: atp-ingestion
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 4000m
        memory: 8Gi
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsAndLimits

VPA Modes:

Mode Description Use Case
Auto Automatically update requests/limits Dev/Test (with restart)
Recreate Update on pod recreation Staging
Off Only generate recommendations Production (manual review)

ATP Recommendation: Use Off mode in production to generate recommendations, then manually review and apply via GitOps.

Recommendations from Azure Advisor

Azure Advisor Cost Recommendations:

#!/bin/bash
# scripts/get-azure-advisor-cost-recommendations.sh

echo "💰 Azure Advisor Cost Recommendations"
echo "======================================"

# Get cost recommendations
az advisor recommendation list \
  --category Cost \
  --output table

# Get specific right-sizing recommendations
az advisor recommendation list \
  --category Cost \
  --filter "ResourceGroup eq 'atp-production-rg'" \
  --query "[?category=='Cost' && impact=='High'].{Name:shortDescription.problem, Impact:impact, PotentialSavings:extendedProperties.potentialSavings}" \
  --output table

echo ""
echo "📊 Right-sizing recommendations:"
az advisor recommendation list \
  --category Cost \
  --filter "ResourceGroup eq 'atp-production-rg'" \
  --query "[?recommendationTypeId=='b0b0a0a0-0a0a-0a0a-0a0a-0a0a0a0a0a0a'].{CurrentSKU:extendedProperties.currentSku, RecommendedSKU:extendedProperties.recommendedSku, EstimatedSavings:extendedProperties.estimatedMonthlySavings}" \
  --output table

Horizontal Pod Autoscaler (HPA)

CPU-Based Scaling

CPU-Based HPA Configuration:

# apps/atp-ingestion/base/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa
  namespace: atp-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale when CPU > 70%
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min stabilization
      policies:
      - type: Percent
        value: 50  # Scale down by 50%
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Min  # Use most conservative policy
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100  # Double pods
        periodSeconds: 30
      - type: Pods
        value: 4
        periodSeconds: 30
      selectPolicy: Max  # Use most aggressive policy

Cost-Optimized HPA Settings:

Setting Value Rationale
averageUtilization 70% Allow higher CPU before scaling (cost efficiency)
scaleDown.stabilizationWindowSeconds 300s Prevent rapid scale-down (cost savings)
scaleDown.selectPolicy Min Use conservative scale-down (cost savings)
scaleUp.selectPolicy Max Aggressive scale-up (performance)

Memory-Based Scaling

Memory-Based HPA:

# apps/atp-ingestion/base/hpa-memory.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: atp-ingestion-hpa-memory
  namespace: atp-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: atp-ingestion
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale when memory > 80%
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600  # 10 min (memory is sticky)
      policies:
      - type: Percent
        value: 25  # Conservative scale-down
        periodSeconds: 120

Custom Metrics with KEDA

KEDA ScaledObject for Cost Optimization:

# apps/atp-ingestion/base/keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: atp-ingestion-keda
  namespace: atp-production
spec:
  scaleTargetRef:
    name: atp-ingestion
  minReplicaCount: 3
  maxReplicaCount: 20
  cooldownPeriod: 300  # 5 min cooldown
  idleReplicaCount: 0  # Scale to zero when idle (dev only)
  triggers:
  # CPU-based scaling
  - type: cpu
    metadata:
      type: Utilization
      value: "70"
  # Memory-based scaling
  - type: memory
    metadata:
      type: Utilization
      value: "80"
  # Custom metric: Queue length
  - type: azure-servicebus
    metadata:
      queueName: atp-ingestion-queue
      messageCount: "100"  # Scale when > 100 messages
      connectionFromEnv: SERVICEBUS_CONNECTION_STRING
  # HTTP request rate
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: http_requests_per_second
      threshold: "1000"
      query: sum(rate(http_requests_total{service="atp-ingestion"}[1m]))

KEDA Cost Optimization Settings:

Setting Value Use Case
cooldownPeriod 300s Prevent rapid scaling (cost savings)
idleReplicaCount 0 Dev environments (scale to zero)
minReplicaCount 3 Production (always available)

Scaling Policies for Cost Efficiency

Cost-Efficient Scaling Strategy:

graph TB
    METRICS[Pod Metrics<br/>CPU/Memory]
    HPA[Horizontal Pod Autoscaler]
    DECISION{Scale Decision}

    SCALE_UP[Scale Up<br/>Aggressive]
    SCALE_DOWN[Scale Down<br/>Conservative]

    METRICS --> HPA
    HPA --> DECISION
    DECISION -->|High Load| SCALE_UP
    DECISION -->|Low Load| SCALE_DOWN

    SCALE_UP --> PERFORMANCE[Performance Priority]
    SCALE_DOWN --> COST[Cost Priority]

    style SCALE_UP fill:#90EE90
    style SCALE_DOWN fill:#FFE5B4
    style PERFORMANCE fill:#90EE90
    style COST fill:#FFB6C1
Hold "Alt" / "Option" to enable pan & zoom

Environment-Specific Scaling Policies:

Environment Min Replicas Max Replicas Scale-Down Delay Rationale
Production 3 50 10 min Performance > Cost
Staging 2 20 5 min Balanced
Test 1 10 3 min Cost > Performance
Dev 0 5 1 min Cost optimization

Cluster Autoscaler

Adding Nodes Based on Demand

Cluster Autoscaler Behavior:

sequenceDiagram
    participant Pod as Pod (Pending)
    participant CA as Cluster Autoscaler
    participant AKS as AKS Node Pool
    participant VM as New VM Node

    Pod->>CA: Pod cannot be scheduled
    CA->>CA: Check node pool capacity
    CA->>AKS: Scale up node pool
    AKS->>VM: Provision new VM
    VM->>Pod: Pod scheduled on new node
    Pod->>CA: Pod running
Hold "Alt" / "Option" to enable pan & zoom

Cluster Autoscaler Configuration:

# platform/cluster-autoscaler/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-config
  namespace: kube-system
data:
  config.yaml: |
    nodeGroups:
      - name: userpool
        minSize: 5
        maxSize: 20
        scaleDownDelayAfterAdd: 10m
        scaleDownUnneededTime: 10m
        scaleDownUtilizationThreshold: 0.5
    scaleDownEnabled: true
    maxNodeProvisionTime: 15m
    balanceSimilarNodeGroups: true
    expander: least-waste

Removing Idle Nodes

Scale-Down Conditions:

Condition Value Rationale
scaleDownDelayAfterAdd 10m Wait before removing newly added nodes
scaleDownUnneededTime 10m Node must be unneeded for 10 minutes
scaleDownUtilizationThreshold 0.5 Node utilization < 50% before removal
maxEmptyBulkDelete 10 Remove up to 10 idle nodes at once

Pod Disruption Budget Protection:

# apps/atp-ingestion/base/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: atp-ingestion-pdb
  namespace: atp-production
spec:
  minAvailable: 2  # Always keep 2 pods running
  selector:
    matchLabels:
      app: atp-ingestion

Scale-Down Delays and Thresholds

Cost-Optimized Scale-Down Configuration:

# Cluster Autoscaler: Aggressive scale-down (cost savings)
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-cost-optimized
  namespace: kube-system
data:
  config.yaml: |
    scaleDownDelayAfterAdd: 5m  # Reduced from 10m
    scaleDownUnneededTime: 5m  # Reduced from 10m
    scaleDownUtilizationThreshold: 0.4  # More aggressive (40%)
    scaleDownGpuUtilizationThreshold: 0.4
    maxEmptyBulkDelete: 20  # Remove more nodes at once
    scaleDownEnabled: true

Node Affinity and Taints

Node Affinity for Cost Optimization:

# apps/atp-ingestion/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          # Prefer spot nodes (cost savings)
          - weight: 100
            preference:
              matchExpressions:
              - key: kubernetes.azure.com/scalesetpriority
                operator: In
                values:
                - spot
          # Prefer smaller nodes (cost efficiency)
          - weight: 50
            preference:
              matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - Standard_D2s_v3
                - Standard_D4s_v3
      tolerations:
      # Allow scheduling on spot nodes
      - key: kubernetes.azure.com/scalesetpriority
        operator: Equal
        value: spot
        effect: NoSchedule

Development Environment Auto-Shutdown

Schedule-Based Scaling to Zero

CronJob for Auto-Shutdown:

# platform/auto-shutdown/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: dev-auto-shutdown
  namespace: kube-system
spec:
  schedule: "0 20 * * 1-5"  # 8 PM Monday-Friday
  timeZone: "America/New_York"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: dev-shutdown-sa
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              # Scale down dev namespaces
              for ns in atp-dev atp-test; do
                for deployment in $(kubectl get deployments -n $ns -o name); do
                  kubectl scale $deployment -n $ns --replicas=0
                done
              done
              echo "✅ Dev environments scaled down at $(date)"
          restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: dev-auto-startup
  namespace: kube-system
spec:
  schedule: "0 8 * * 1-5"  # 8 AM Monday-Friday
  timeZone: "America/New_York"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: dev-startup-sa
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              # Scale up dev namespaces
              for ns in atp-dev atp-test; do
                for deployment in $(kubectl get deployments -n $ns -o name); do
                  kubectl scale $deployment -n $ns --replicas=1
                done
              done
              echo "✅ Dev environments scaled up at $(date)"
          restartPolicy: OnFailure

Scaling Down Replicas at Night/Weekends

Auto-Shutdown Script:

#!/bin/bash
# scripts/auto-shutdown-dev-environments.sh

NAMESPACES=("atp-dev" "atp-test")
SHUTDOWN_TIME="20:00"  # 8 PM
STARTUP_TIME="08:00"   # 8 AM

CURRENT_HOUR=$(date +%H)
CURRENT_DAY=$(date +%u)  # 1=Monday, 7=Sunday

# Check if it's weekend
if [ "${CURRENT_DAY}" -eq 6 ] || [ "${CURRENT_DAY}" -eq 7 ]; then
  echo "📴 Weekend: Scaling down all dev environments..."
  for NS in "${NAMESPACES[@]}"; do
    kubectl get deployments -n "${NS}" -o name | \
      xargs -I {} kubectl scale {} -n "${NS}" --replicas=0
  done
  exit 0
fi

# Check if it's shutdown time (8 PM - 8 AM)
if [ "${CURRENT_HOUR}" -ge 20 ] || [ "${CURRENT_HOUR}" -lt 8 ]; then
  echo "🌙 Night time: Scaling down dev environments..."
  for NS in "${NAMESPACES[@]}"; do
    kubectl get deployments -n "${NS}" -o name | \
      xargs -I {} kubectl scale {} -n "${NS}" --replicas=0
  done
else
  echo "☀️  Day time: Ensuring dev environments are running..."
  for NS in "${NAMESPACES[@]}"; do
    kubectl get deployments -n "${NS}" -o name | \
      xargs -I {} kubectl scale {} -n "${NS}" --replicas=1
  done
fi

Wake-Up Procedures

Wake-Up Script:

#!/bin/bash
# scripts/wake-up-dev-environments.sh

NAMESPACES=("atp-dev" "atp-test")

echo "☀️  Waking up dev environments..."

for NS in "${NAMESPACES[@]}"; do
  echo "  - Scaling up namespace: ${NS}"

  # Scale up deployments
  kubectl get deployments -n "${NS}" -o name | \
    xargs -I {} kubectl scale {} -n "${NS}" --replicas=1

  # Wait for pods to be ready
  echo "  - Waiting for pods to be ready..."
  kubectl wait --for=condition=available --timeout=5m \
    deployment --all -n "${NS}"
done

echo "✅ Dev environments are ready"

Cost Savings Calculation

Auto-Shutdown Cost Savings:

Environment Daily Hours Weekly Hours Monthly Cost (Running) Monthly Cost (Shutdown) Savings
Dev 12 hours 60 hours $300 $120 $180/month (60%)
Test 12 hours 60 hours $300 $120 $180/month (60%)
Total - - $600 $240 $360/month

Cost Savings Formula:

Monthly Savings = (24 hours - Running hours) / 24 hours × Monthly Cost
Monthly Savings = (24 - 12) / 24 × $300 = $150/month per environment

Azure Cost Management Integration

Cost Tracking per Environment

Cost Tracking Dashboard Query:

// Log Analytics: Cost tracking per environment
AzureCost
| where TimeGenerated >= ago(30d)
| where Tags contains "Environment"
| extend Environment = tostring(Tags.Environment)
| extend Service = tostring(Tags.Service)
| summarize 
    TotalCost = sum(Cost),
    AvgDailyCost = avg(Cost)
    by Environment, bin(TimeGenerated, 1d)
| render timechart

Cost Tracking Script:

#!/bin/bash
# scripts/track-costs-by-environment.sh

ENVIRONMENT="${1:-all}"
START_DATE="${2:-$(date -d '30 days ago' +%Y-%m-%d)}"
END_DATE="${3:-$(date +%Y-%m-%d)}"

echo "💰 Cost Tracking: ${ENVIRONMENT}"
echo "   Period: ${START_DATE} to ${END_DATE}"
echo ""

if [ "${ENVIRONMENT}" = "all" ]; then
  ENVIRONMENTS=("production" "staging" "test" "development")
else
  ENVIRONMENTS=("${ENVIRONMENT}")
fi

for ENV in "${ENVIRONMENTS[@]}"; do
  echo "📊 ${ENV}:"

  COST=$(az consumption usage list \
    --start-date "${START_DATE}" \
    --end-date "${END_DATE}" \
    --query "[?tags.Environment=='${ENV}'].pretaxCost" \
    --output tsv | \
    awk '{sum+=$1} END {printf "%.2f", sum}')

  echo "   Total Cost: \$${COST}"

  # Daily average
  DAYS=$(( ($(date -d "${END_DATE}" +%s) - $(date -d "${START_DATE}" +%s)) / 86400 ))
  AVG_DAILY=$(echo "scale=2; ${COST} / ${DAYS}" | bc)
  echo "   Avg Daily: \$${AVG_DAILY}"

  echo ""
done

Budget Alerts

Budget Configuration:

#!/bin/bash
# scripts/create-budget-alert.sh

BUDGET_NAME="${1}"
AMOUNT="${2}"
RESOURCE_GROUP="${3}"
EMAIL="${4}"

az consumption budget create \
  --budget-name "${BUDGET_NAME}" \
  --amount "${AMOUNT}" \
  --time-grain Monthly \
  --start-date "$(date +%Y-%m-01)" \
  --end-date "$(date -d '+1 year' +%Y-%m-01)" \
  --category Cost \
  --resource-group "${RESOURCE_GROUP}" \
  --notifications threshold=50 threshold-type=Actual operator=GreaterThan contact-emails="${EMAIL}" \
  --notifications threshold=80 threshold-type=Actual operator=GreaterThan contact-emails="${EMAIL}" \
  --notifications threshold=100 threshold-type=Actual operator=GreaterThan contact-emails="${EMAIL}"

echo "✅ Budget created: ${BUDGET_NAME} (\$${AMOUNT}/month)"

Budget Alert Configuration:

# infrastructure/budgets.yaml (Pulumi C# example concept)
var productionBudget = new Budget("atp-production-budget", new BudgetArgs
{
    BudgetName = "atp-production-monthly",
    Amount = 10000.0,  // $10,000/month
    TimeGrain = "Monthly",
    StartDate = DateTime.Now.ToString("yyyy-MM-01"),
    Category = "Cost",
    Notifications = new[]
    {
        new BudgetNotificationArgs
        {
            Threshold = 50,  // 50% of budget
            ThresholdType = "Actual",
            Operator = "GreaterThan",
            ContactEmails = new[] { "finance@example.com" }
        },
        new BudgetNotificationArgs
        {
            Threshold = 80,  // 80% of budget
            ThresholdType = "Actual",
            Operator = "GreaterThan",
            ContactEmails = new[] { "finance@example.com", "ops@example.com" }
        },
        new BudgetNotificationArgs
        {
            Threshold = 100,  // 100% of budget
            ThresholdType = "Actual",
            Operator = "GreaterThan",
            ContactEmails = new[] { "finance@example.com", "ops@example.com", "cto@example.com" }
        }
    }
});

Cost Anomaly Detection

Cost Anomaly Detection:

#!/bin/bash
# scripts/detect-cost-anomalies.sh

THRESHOLD="${1:-0.2}"  # 20% increase threshold

echo "🔍 Detecting cost anomalies..."

# Get current month cost
CURRENT_MONTH=$(date +%Y-%m)
CURRENT_COST=$(az consumption usage list \
  --start-date "${CURRENT_MONTH}-01" \
  --end-date "$(date +%Y-%m-%d)" \
  --query "[].pretaxCost" \
  --output tsv | \
  awk '{sum+=$1} END {print sum}')

# Get last month cost
LAST_MONTH=$(date -d '1 month ago' +%Y-%m)
LAST_MONTH_COST=$(az consumption usage list \
  --start-date "${LAST_MONTH}-01" \
  --end-date "${LAST_MONTH}-$(date -d "${LAST_MONTH}-01 +1 month -1 day" +%d)" \
  --query "[].pretaxCost" \
  --output tsv | \
  awk '{sum+=$1} END {print sum}')

# Calculate increase percentage
INCREASE=$(echo "scale=2; (${CURRENT_COST} - ${LAST_MONTH_COST}) / ${LAST_MONTH_COST} * 100" | bc)

if (( $(echo "${INCREASE} > ${THRESHOLD} * 100" | bc -l) )); then
  echo "⚠️  Cost anomaly detected!"
  echo "   Current month: \$${CURRENT_COST}"
  echo "   Last month: \$${LAST_MONTH_COST}"
  echo "   Increase: ${INCREASE}%"
  echo "   Threshold: $(echo "${THRESHOLD} * 100" | bc)%"

  # Send alert
  echo "Sending alert to finance@example.com..."
else
  echo "✅ No cost anomalies detected"
  echo "   Increase: ${INCREASE}%"
fi

Cost Optimization Recommendations

Azure Advisor Cost Recommendations:

#!/bin/bash
# scripts/get-cost-optimization-recommendations.sh

echo "💰 Azure Advisor Cost Optimization Recommendations"
echo "=================================================="

# Get all cost recommendations
az advisor recommendation list \
  --category Cost \
  --query "[].{Name:shortDescription.problem, Impact:impact, ResourceGroup:resourceGroup, PotentialSavings:extendedProperties.potentialSavings}" \
  --output table

echo ""
echo "📊 Top 10 Cost Savings Opportunities:"
az advisor recommendation list \
  --category Cost \
  --query "[?impact=='High' || impact=='Medium'].{Name:shortDescription.problem, Impact:impact, PotentialSavings:extendedProperties.potentialSavings, ResourceId:resourceId}" \
  --output table | head -n 10

Cost Allocation

Tags per Environment, Service, Tenant

Comprehensive Tagging Strategy:

# Resource tagging template
tags:
  Environment: production | staging | test | development
  Service: atp-ingestion | atp-query | atp-gateway | platform
  Tenant: tenant-a | tenant-b | tenant-eu | shared
  CostCenter: sales | engineering | operations
  BusinessUnit: enterprise | smb
  Project: audit-trail-platform
  Owner: team-name@example.com
  ManagedBy: gitops | terraform | pulumi
  AutoShutdown: true | false
  Criticality: critical | high | medium | low

Tagging in Pulumi:

// infrastructure/Tags.cs
public static class ResourceTags
{
    public static InputMap<string> ProductionTags(string service, string costCenter)
    {
        return new InputMap<string>
        {
            { "Environment", "production" },
            { "Service", service },
            { "CostCenter", costCenter },
            { "Project", "audit-trail-platform" },
            { "ManagedBy", "pulumi" },
            { "Criticality", "critical" },
            { "AutoShutdown", "false" }
        };
    }

    public static InputMap<string> DevelopmentTags(string service)
    {
        return new InputMap<string>
        {
            { "Environment", "development" },
            { "Service", service },
            { "CostCenter", "engineering" },
            { "Project", "audit-trail-platform" },
            { "ManagedBy", "pulumi" },
            { "Criticality", "low" },
            { "AutoShutdown", "true" }
        };
    }
}

Namespace-Level Cost Reporting

Namespace Cost Reporting:

#!/bin/bash
# scripts/namespace-cost-report.sh

NAMESPACE="${1}"
START_DATE="${2:-$(date -d '30 days ago' +%Y-%m-%d)}"
END_DATE="${3:-$(date +%Y-%m-%d)}"

echo "💰 Cost Report for Namespace: ${NAMESPACE}"
echo "   Period: ${START_DATE} to ${END_DATE}"
echo ""

# Get pods in namespace
PODS=$(kubectl get pods -n "${NAMESPACE}" -o json | jq -r '.items[].metadata.name')

TOTAL_CPU=0
TOTAL_MEMORY=0

for POD in ${PODS}; do
  # Get CPU and memory requests
  CPU=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o json | \
    jq -r '.spec.containers[].resources.requests.cpu' | \
    sed 's/m//' | awk '{sum+=$1} END {print sum}')
  MEMORY=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o json | \
    jq -r '.spec.containers[].resources.requests.memory' | \
    sed 's/Gi//' | awk '{sum+=$1} END {print sum}')

  TOTAL_CPU=$((TOTAL_CPU + CPU))
  TOTAL_MEMORY=$((TOTAL_MEMORY + MEMORY))
done

echo "Resource Requests:"
echo "  CPU: ${TOTAL_CPU}m cores"
echo "  Memory: ${TOTAL_MEMORY}Gi"
echo ""

# Estimate cost (example pricing)
CPU_COST=$(echo "scale=2; ${TOTAL_CPU} / 1000 * 0.096 * 24 * 30" | bc)
MEMORY_COST=$(echo "scale=2; ${TOTAL_MEMORY} * 0.01 * 24 * 30" | bc)
TOTAL_COST=$(echo "scale=2; ${CPU_COST} + ${MEMORY_COST}" | bc)

echo "Estimated Monthly Cost:"
echo "  CPU: \$${CPU_COST}"
echo "  Memory: \$${MEMORY_COST}"
echo "  Total: \$${TOTAL_COST}"

Chargeback to Teams

Team Chargeback Report:

#!/bin/bash
# scripts/team-chargeback-report.sh

TEAM="${1:-all}"
MONTH="${2:-$(date +%Y-%m)}"

echo "💰 Team Chargeback Report: ${TEAM}"
echo "   Month: ${MONTH}"
echo ""

if [ "${TEAM}" = "all" ]; then
  TEAMS=("engineering" "sales" "operations")
else
  TEAMS=("${TEAM}")
fi

for T in "${TEAMS[@]}"; do
  echo "📊 ${T}:"

  # Get costs for team's resources
  COST=$(az consumption usage list \
    --start-date "${MONTH}-01" \
    --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
    --query "[?tags.CostCenter=='${T}'].pretaxCost" \
    --output tsv | \
    awk '{sum+=$1} END {printf "%.2f", sum}')

  echo "   Total Cost: \$${COST}"
  echo ""
done

Showback Reporting

Showback Dashboard Query:

// Log Analytics: Showback report
AzureCost
| where TimeGenerated >= ago(30d)
| extend CostCenter = tostring(Tags.CostCenter)
| extend Service = tostring(Tags.Service)
| extend Environment = tostring(Tags.Environment)
| summarize 
    TotalCost = sum(Cost),
    ResourceCount = count()
    by CostCenter, Service, Environment
| render barchart

Resource Cleanup Automation

Deleting Unused Images in ACR

ACR Cleanup Script:

#!/bin/bash
# scripts/cleanup-acr-images.sh

ACR_NAME="${1}"
KEEP_DAYS="${2:-30}"  # Keep images from last 30 days
KEEP_TAGS="${3:-10}"  # Keep 10 most recent tags per repository

echo "🧹 Cleaning up unused ACR images: ${ACR_NAME}"
echo "   Keep days: ${KEEP_DAYS}"
echo "   Keep tags per repo: ${KEEP_TAGS}"
echo ""

# Get all repositories
REPOS=$(az acr repository list --name "${ACR_NAME}" --output tsv)

for REPO in ${REPOS}; do
  echo "📦 Repository: ${REPO}"

  # Get all tags sorted by last update date
  TAGS=$(az acr repository show-tags \
    --name "${ACR_NAME}" \
    --repository "${REPO}" \
    --orderby time_desc \
    --output tsv | head -n "${KEEP_TAGS}")

  # Get tags older than KEEP_DAYS
  OLD_TAGS=$(az acr repository show-tags \
    --name "${ACR_NAME}" \
    --repository "${REPO}" \
    --query "[?lastUpdateTime < '$(date -d "${KEEP_DAYS} days ago" -u +%Y-%m-%dT%H:%M:%SZ)'].name" \
    --output tsv)

  # Delete old tags (but keep the KEEP_TAGS most recent)
  for TAG in ${OLD_TAGS}; do
    if ! echo "${TAGS}" | grep -q "${TAG}"; then
      echo "  🗑️  Deleting: ${REPO}:${TAG}"
      az acr repository delete \
        --name "${ACR_NAME}" \
        --image "${REPO}:${TAG}" \
        --yes || true
    fi
  done
done

echo "✅ ACR cleanup complete"

Removing Old PersistentVolumes

PV Cleanup Script:

#!/bin/bash
# scripts/cleanup-old-pvs.sh

NAMESPACE="${1:-all}"
AGE_DAYS="${2:-30}"  # Delete PVs older than 30 days

echo "🧹 Cleaning up old PersistentVolumes"
echo "   Namespace: ${NAMESPACE}"
echo "   Age threshold: ${AGE_DAYS} days"
echo ""

if [ "${NAMESPACE}" = "all" ]; then
  PVS=$(kubectl get pv -o json | \
    jq -r ".items[] | select(.status.phase == \"Released\" or .status.phase == \"Failed\") | .metadata.name")
else
  PVS=$(kubectl get pv -o json | \
    jq -r ".items[] | select(.spec.claimRef.namespace == \"${NAMESPACE}\" and (.status.phase == \"Released\" or .status.phase == \"Failed\")) | .metadata.name")
fi

for PV in ${PVS}; do
  # Get PV creation timestamp
  CREATED=$(kubectl get pv "${PV}" -o jsonpath='{.metadata.creationTimestamp}')
  CREATED_EPOCH=$(date -d "${CREATED}" +%s)
  AGE_EPOCH=$(date -d "${AGE_DAYS} days ago" +%s)

  if [ "${CREATED_EPOCH}" -lt "${AGE_EPOCH}" ]; then
    echo "🗑️  Deleting old PV: ${PV} (created: ${CREATED})"
    kubectl delete pv "${PV}" || true
  fi
done

echo "✅ PV cleanup complete"

Cleaning Up Completed Jobs

Job Cleanup Script:

#!/bin/bash
# scripts/cleanup-completed-jobs.sh

NAMESPACE="${1:-all}"
AGE_HOURS="${2:-24}"  # Delete jobs older than 24 hours

echo "🧹 Cleaning up completed Jobs"
echo "   Namespace: ${NAMESPACE}"
echo "   Age threshold: ${AGE_HOURS} hours"
echo ""

if [ "${NAMESPACE}" = "all" ]; then
  NAMESPACES=$(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}')
else
  NAMESPACES=("${NAMESPACE}")
fi

for NS in ${NAMESPACES}; do
  # Get completed/failed jobs
  JOBS=$(kubectl get jobs -n "${NS}" -o json | \
    jq -r ".items[] | select(.status.succeeded == 1 or .status.failed > 0) | .metadata.name")

  for JOB in ${JOBS}; do
    # Get job completion time
    COMPLETION_TIME=$(kubectl get job "${JOB}" -n "${NS}" -o jsonpath='{.status.completionTime}')
    if [ -n "${COMPLETION_TIME}" ]; then
      COMPLETION_EPOCH=$(date -d "${COMPLETION_TIME}" +%s)
      AGE_EPOCH=$(date -d "${AGE_HOURS} hours ago" +%s)

      if [ "${COMPLETION_EPOCH}" -lt "${AGE_EPOCH}" ]; then
        echo "🗑️  Deleting completed job: ${NS}/${JOB}"
        kubectl delete job "${JOB}" -n "${NS}" || true
      fi
    fi
  done
done

echo "✅ Job cleanup complete"

Snapshot Cleanup

Snapshot Cleanup Script:

#!/bin/bash
# scripts/cleanup-old-snapshots.sh

RESOURCE_GROUP="${1}"
AGE_DAYS="${2:-7}"  # Keep snapshots from last 7 days

echo "🧹 Cleaning up old snapshots: ${RESOURCE_GROUP}"
echo "   Age threshold: ${AGE_DAYS} days"
echo ""

# Get all snapshots older than AGE_DAYS
SNAPSHOTS=$(az snapshot list \
  --resource-group "${RESOURCE_GROUP}" \
  --query "[?timeCreated < '$(date -d "${AGE_DAYS} days ago" -u +%Y-%m-%dT%H:%M:%SZ)'].{Name:name, TimeCreated:timeCreated}" \
  --output tsv)

for SNAPSHOT in ${SNAPSHOTS}; do
  NAME=$(echo "${SNAPSHOT}" | cut -f1)
  TIME=$(echo "${SNAPSHOT}" | cut -f2)

  echo "🗑️  Deleting snapshot: ${NAME} (created: ${TIME})"
  az snapshot delete \
    --resource-group "${RESOURCE_GROUP}" \
    --name "${NAME}" \
    --yes || true
done

echo "✅ Snapshot cleanup complete"

Azure Advisor Recommendations

Reviewing Cost Recommendations

Review Azure Advisor Recommendations:

#!/bin/bash
# scripts/review-azure-advisor-recommendations.sh

echo "💰 Azure Advisor Cost Recommendations"
echo "======================================"

# Get all cost recommendations
az advisor recommendation list \
  --category Cost \
  --output table

echo ""
echo "📊 High Impact Recommendations:"
az advisor recommendation list \
  --category Cost \
  --filter "Impact eq 'High'" \
  --query "[].{Name:shortDescription.problem, ResourceGroup:resourceGroup, PotentialSavings:extendedProperties.potentialSavings}" \
  --output table

Implementing Right-Sizing Suggestions

Right-Sizing Implementation:

#!/bin/bash
# scripts/implement-right-sizing.sh

RECOMMENDATION_ID="${1}"

if [ -z "${RECOMMENDATION_ID}" ]; then
  echo "Usage: $0 <recommendation-id>"
  echo ""
  echo "Available recommendations:"
  az advisor recommendation list \
    --category Cost \
    --query "[].{ID:id, Name:shortDescription.problem, CurrentSKU:extendedProperties.currentSku, RecommendedSKU:extendedProperties.recommendedSku}" \
    --output table
  exit 1
fi

echo "📊 Implementing right-sizing recommendation: ${RECOMMENDATION_ID}"

# Get recommendation details
RECOMMENDATION=$(az advisor recommendation show \
  --id "${RECOMMENDATION_ID}")

CURRENT_SKU=$(echo "${RECOMMENDATION}" | jq -r '.extendedProperties.currentSku')
RECOMMENDED_SKU=$(echo "${RECOMMENDATION}" | jq -r '.extendedProperties.recommendedSku')
RESOURCE_ID=$(echo "${RECOMMENDATION}" | jq -r '.resourceId')

echo "  Current SKU: ${CURRENT_SKU}"
echo "  Recommended SKU: ${RECOMMENDED_SKU}"
echo "  Resource: ${RESOURCE_ID}"
echo ""

read -p "Apply this recommendation? (yes/no): " CONFIRM

if [ "${CONFIRM}" = "yes" ]; then
  echo "🔧 Applying right-sizing..."

  # Determine resource type and update
  if echo "${RESOURCE_ID}" | grep -q "Microsoft.Compute/virtualMachines"; then
    RESOURCE_GROUP=$(echo "${RESOURCE_ID}" | cut -d'/' -f5)
    VM_NAME=$(echo "${RESOURCE_ID}" | cut -d'/' -f9)

    echo "  Updating VM: ${VM_NAME}"
    az vm resize \
      --resource-group "${RESOURCE_GROUP}" \
      --name "${VM_NAME}" \
      --size "${RECOMMENDED_SKU}"
  else
    echo "  Resource type not yet supported for automatic resizing"
    echo "  Please apply manually: ${RESOURCE_ID}"
  fi
else
  echo "❌ Right-sizing not applied"
fi

SKU Optimization

SKU Optimization Analysis:

#!/bin/bash
# scripts/analyze-sku-optimization.sh

echo "📊 SKU Optimization Analysis"
echo "============================"

# Get all VMs and their current SKUs
az vm list \
  --query "[].{Name:name, ResourceGroup:resourceGroup, Size:hardwareProfile.vmSize}" \
  --output table

echo ""
echo "💰 Cost comparison (example VMs):"
echo "  Standard_D4s_v3 (4 vCPU, 16 GiB): \$0.192/hour = \$140/month"
echo "  Standard_D8s_v3 (8 vCPU, 32 GiB): \$0.384/hour = \$280/month"
echo "  Standard_D16s_v3 (16 vCPU, 64 GiB): \$0.768/hour = \$561/month"
echo ""
echo "💡 Recommendations:"
echo "  - Right-size based on actual usage (P95 metrics)"
echo "  - Use Reserved Instances for production (up to 72% discount)"
echo "  - Use Spot Instances for dev/test (up to 80% discount)"

FinOps Practices

Cost Monitoring Dashboards

FinOps Dashboard Configuration:

# monitoring/dashboards/finops-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: finops-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "ATP FinOps Dashboard",
        "panels": [
          {
            "title": "Monthly Cost by Environment",
            "targets": [
              {
                "expr": "sum(azure_cost_total{environment=~\"production|staging|test|development\"}) by (environment)",
                "legendFormat": "{{environment}}"
              }
            ]
          },
          {
            "title": "Cost Trend (30 days)",
            "targets": [
              {
                "expr": "sum(rate(azure_cost_total[1d]))",
                "legendFormat": "Daily Cost"
              }
            ]
          },
          {
            "title": "Resource Utilization vs Cost",
            "targets": [
              {
                "expr": "sum(container_cpu_usage_seconds_total) / sum(container_resource_requests_cpu_seconds_total) * 100",
                "legendFormat": "CPU Utilization %"
              },
              {
                "expr": "sum(container_memory_working_set_bytes) / sum(container_resource_requests_memory_bytes) * 100",
                "legendFormat": "Memory Utilization %"
              }
            ]
          }
        ]
      }
    }

Monthly Cost Reviews

Monthly Cost Review Script:

#!/bin/bash
# scripts/monthly-cost-review.sh

MONTH="${1:-$(date -d '1 month ago' +%Y-%m)}"

echo "💰 Monthly Cost Review: ${MONTH}"
echo "================================"
echo ""

# Total cost
TOTAL_COST=$(az consumption usage list \
  --start-date "${MONTH}-01" \
  --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
  --query "[].pretaxCost" \
  --output tsv | \
  awk '{sum+=$1} END {printf "%.2f", sum}')

echo "📊 Total Cost: \$${TOTAL_COST}"
echo ""

# Cost by environment
echo "Cost by Environment:"
az consumption usage list \
  --start-date "${MONTH}-01" \
  --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
  --query "[].{Environment:tags.Environment, Cost:pretaxCost}" \
  --output tsv | \
  awk '{cost[$1]+=$2} END {for (env in cost) printf "  %s: $%.2f\n", env, cost[env]}'

echo ""

# Cost by service
echo "Cost by Service:"
az consumption usage list \
  --start-date "${MONTH}-01" \
  --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
  --query "[].{Service:tags.Service, Cost:pretaxCost}" \
  --output tsv | \
  awk '{cost[$1]+=$2} END {for (svc in cost) printf "  %s: $%.2f\n", svc, cost[svc]}'

echo ""

# Top 10 resources by cost
echo "Top 10 Resources by Cost:"
az consumption usage list \
  --start-date "${MONTH}-01" \
  --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
  --query "[].{Resource:instanceName, Cost:pretaxCost}" \
  --output tsv | \
  sort -k2 -nr | head -n 10

Budget Forecasting

Budget Forecast Script:

#!/bin/bash
# scripts/budget-forecast.sh

CURRENT_MONTH=$(date +%Y-%m)
LAST_MONTH=$(date -d '1 month ago' +%Y-%m)

echo "📈 Budget Forecast"
echo "=================="
echo ""

# Get last 3 months of costs
for i in {2..0}; do
  MONTH=$(date -d "${i} months ago" +%Y-%m)
  COST=$(az consumption usage list \
    --start-date "${MONTH}-01" \
    --end-date "${MONTH}-$(date -d "${MONTH}-01 +1 month -1 day" +%d)" \
    --query "[].pretaxCost" \
    --output tsv | \
    awk '{sum+=$1} END {printf "%.2f", sum}')

  echo "${MONTH}: \$${COST}"
done

echo ""

# Forecast next month (simple average)
CURRENT_COST=$(az consumption usage list \
  --start-date "${CURRENT_MONTH}-01" \
  --end-date "$(date +%Y-%m-%d)" \
  --query "[].pretaxCost" \
  --output tsv | \
  awk '{sum+=$1} END {printf "%.2f", sum}')

DAYS_IN_MONTH=$(date -d "$(date +%Y-%m-01) +1 month -1 day" +%d)
DAYS_ELAPSED=$(date +%d)
FORECAST=$(echo "scale=2; ${CURRENT_COST} / ${DAYS_ELAPSED} * ${DAYS_IN_MONTH}" | bc)

echo "📊 Forecast for $(date -d '+1 month' +%Y-%m): \$${FORECAST}"
echo "   Based on current month trend"

Cost Optimization KPIs

Cost Optimization KPI Dashboard:

// Log Analytics: Cost Optimization KPIs
let CostData = AzureCost
| where TimeGenerated >= ago(30d)
| extend Environment = tostring(Tags.Environment)
| extend Service = tostring(Tags.Service)
| summarize TotalCost = sum(Cost) by Environment, Service, bin(TimeGenerated, 1d);

// KPI 1: Cost per Environment
CostData
| summarize 
    TotalCost = sum(TotalCost),
    AvgDailyCost = avg(TotalCost)
    by Environment
| extend KPI = "Cost per Environment";

// KPI 2: Resource Utilization vs Cost
union (
    Perf
    | where ObjectName == "K8SContainer"
    | where CounterName == "cpuUsageNanoCores"
    | summarize AvgCpu = avg(CounterValue) by Namespace, bin(TimeGenerated, 1d)
),
(
    AzureCost
    | where TimeGenerated >= ago(30d)
    | extend Namespace = tostring(Tags.Namespace)
    | summarize Cost = sum(Cost) by Namespace, bin(TimeGenerated, 1d)
)
| summarize 
    AvgCpu = max(AvgCpu),
    TotalCost = max(Cost)
    by Namespace, bin(TimeGenerated, 1d)
| extend Efficiency = TotalCost / (AvgCpu / 1000000000)
| extend KPI = "Cost Efficiency"
| render timechart

Cost Optimization KPIs:

KPI Target Current Status
Cost per Environment < $5,000/month $4,200
Resource Utilization > 70% 65% ⚠️
Cost per Transaction < $0.01 $0.008
Waste (Unused Resources) < 10% 12% ⚠️
Reserved Instance Coverage > 80% 75% ⚠️

Summary: Cost Optimization in GitOps

  • AKS Cost Optimization: Node pool sizing (right-sized VMs), spot instances for dev/test (80% discount), reserved instances for production (up to 72% discount), cluster autoscaler configuration
  • Resource Right-Sizing: Analyzing actual resource usage (7-day metrics), adjusting requests and limits, Vertical Pod Autoscaler (VPA), recommendations from Azure Advisor
  • Horizontal Pod Autoscaler (HPA): CPU-based scaling (70% threshold), memory-based scaling, custom metrics with KEDA, scaling policies for cost efficiency (conservative scale-down)
  • Cluster Autoscaler: Adding nodes based on demand, removing idle nodes (50% utilization threshold), scale-down delays and thresholds, node affinity and taints for spot instances
  • Development Environment Auto-Shutdown: Schedule-based scaling to zero (8 PM - 8 AM, weekends), scaling down replicas at night/weekends, wake-up procedures, cost savings calculation (60% savings)
  • Azure Cost Management Integration: Cost tracking per environment, budget alerts (50%, 80%, 100% thresholds), cost anomaly detection, cost optimization recommendations
  • Cost Allocation: Tags per environment/service/tenant, namespace-level cost reporting, chargeback to teams, showback reporting
  • Resource Cleanup Automation: Deleting unused images in ACR (30-day retention), removing old PersistentVolumes, cleaning up completed jobs (24-hour retention), snapshot cleanup (7-day retention)
  • Azure Advisor Recommendations: Reviewing cost recommendations, implementing right-sizing suggestions, SKU optimization
  • FinOps Practices: Cost monitoring dashboards, monthly cost reviews, budget forecasting, cost optimization KPIs (utilization, waste, efficiency)

Networking & Service Mesh

Purpose: Define networking architecture, ingress controller configuration, certificate management, network policies, service mesh selection and implementation, mTLS, traffic management, and multi-cluster networking strategies for ATP's GitOps deployments, ensuring secure, scalable, and observable service-to-service communication across all environments.


AKS Networking Models

kubenet (Basic Networking)

kubenet Networking Overview:

graph TB
    subgraph "AKS Cluster (kubenet)"
        POD1[Pod 1<br/>10.244.0.0/24]
        POD2[Pod 2<br/>10.244.1.0/24]
        KUBENET[kubenet Plugin<br/>Overlay Network]
    end
    subgraph "Azure VNet"
        VNET[VNet<br/>10.0.0.0/16]
        SUBNET[Subnet<br/>10.0.1.0/24]
    end

    POD1 --> KUBENET
    POD2 --> KUBENET
    KUBENET --> SUBNET
    SUBNET --> VNET

    style KUBENET fill:#FFE5B4
    style SUBNET fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

kubenet Characteristics:

Aspect kubenet Description
Pod IP Addresses Overlay network Pods get IPs from overlay (10.244.0.0/16)
VNet Integration Limited Pod IPs not routable from VNet
IP Address Limit Limited by nodes ~250 pods per node
Network Policies ✅ Supported NetworkPolicy resources
Azure Integration ⚠️ Limited Requires routing tables
Complexity ✅ Simple Easier to set up

Azure CNI (Advanced Networking)

Azure CNI Networking Overview:

graph TB
    subgraph "AKS Cluster (Azure CNI)"
        POD1[Pod 1<br/>10.0.1.10]
        POD2[Pod 2<br/>10.0.1.11]
        AZCNI[Azure CNI<br/>Direct VNet Integration]
    end
    subgraph "Azure VNet"
        VNET[VNet<br/>10.0.0.0/16]
        SUBNET[Subnet<br/>10.0.1.0/24]
        ROUTE[Route Tables]
        NSG[Network Security Groups]
    end

    POD1 --> AZCNI
    POD2 --> AZCNI
    AZCNI --> SUBNET
    SUBNET --> VNET
    SUBNET --> ROUTE
    SUBNET --> NSG

    style AZCNI fill:#90EE90
    style SUBNET fill:#87CEEB
Hold "Alt" / "Option" to enable pan & zoom

Azure CNI Characteristics:

Aspect Azure CNI Description
Pod IP Addresses VNet IPs Pods get IPs directly from VNet subnet
VNet Integration ✅ Full Pod IPs routable from VNet
IP Address Limit Limited by subnet size Large subnet required
Network Policies ✅ Supported Azure Network Policy or Calico
Azure Integration ✅ Full Direct integration with Azure services
Complexity ⚠️ Complex More configuration required

Comparison and Selection

kubenet vs Azure CNI Comparison:

Feature kubenet Azure CNI ATP Selection
Pod IP Management Overlay network VNet IP addresses ✅ Azure CNI (VNet integration)
VNet Integration Limited Full ✅ Azure CNI (required for ATP)
IP Address Limits ~250 pods/node Subnet size ✅ Azure CNI (more IPs)
Network Policies ✅ Supported ✅ Supported ✅ Azure CNI
Azure Services ⚠️ Routing required ✅ Direct access ✅ Azure CNI
Setup Complexity ✅ Simple ⚠️ Complex ✅ Azure CNI (accept complexity)
Multi-Tenancy ⚠️ Limited ✅ Better isolation ✅ Azure CNI

ATP Decision: Azure CNI - Required for multi-tenancy, VNet integration, direct Azure service access, and network isolation per tenant namespace.

Pulumi C# AKS Configuration with Azure CNI:

// infrastructure/AKS.cs
var aksCluster = new ManagedCluster("atp-production-aks", new ManagedClusterArgs
{
    ResourceGroupName = resourceGroup.Name,
    Location = location,
    DnsPrefix = "atp-prod",
    KubernetesVersion = "1.27.3",

    // Azure CNI networking
    NetworkProfile = new ManagedClusterNetworkProfileArgs
    {
        NetworkPlugin = NetworkPlugin.Azure,
        NetworkPolicy = NetworkPolicy.Azure,
        ServiceCidr = "10.2.0.0/16",  // Service CIDR
        DnsServiceIP = "10.2.0.10",
        PodCidr = null,  // Not used with Azure CNI
        LoadBalancerSku = LoadBalancerSku.Standard,
        OutboundType = OutboundType.LoadBalancer,
        LoadBalancerProfile = new ManagedClusterLoadBalancerProfileArgs
        {
            ManagedOutboundIPs = new ManagedClusterLoadBalancerProfileManagedOutboundIPsArgs
            {
                Count = 2
            }
        }
    },

    AgentPoolProfiles = new[]
    {
        new ManagedClusterAgentPoolProfileArgs
        {
            Name = "systempool",
            VmSize = "Standard_D4s_v3",
            Count = 3,
            OsType = "Linux",
            VnetSubnetId = subnet.Id,  // Subnet for pods (large enough)
            MaxPods = 50,
            Mode = AgentPoolMode.System,
            EnableAutoScaling = true,
            MinCount = 3,
            MaxCount = 10
        }
    }
});

Subnet Sizing for Azure CNI:

Node Count Pods per Node Required Subnet Size Example CIDR
5 nodes 50 pods /24 (256 addresses) 10.0.1.0/24
20 nodes 50 pods /23 (512 addresses) 10.0.1.0/23
100 nodes 50 pods /22 (1024 addresses) 10.0.1.0/22

Subnet Calculation:

Required IPs = (Node count × Max pods per node) + Node count + 5 (reserved)
Example: (20 × 50) + 20 + 5 = 1025 IPs → /22 subnet (1024 addresses)


Ingress Controllers

NGINX Ingress Controller (ATP Choice)

NGINX Ingress Architecture:

graph TB
    subgraph "Internet"
        USER[Users]
    end
    subgraph "Azure Load Balancer"
        LB[Load Balancer<br/>Public IP]
    end
    subgraph "AKS Cluster"
        subgraph "ingress-nginx namespace"
            NGINX1[NGINX Pod 1<br/>Replica 1]
            NGINX2[NGINX Pod 2<br/>Replica 2]
            NGINX_SVC[NGINX Service<br/>LoadBalancer]
        end
        subgraph "Application Namespaces"
            APP1[ATP Ingestion<br/>Service]
            APP2[ATP Query<br/>Service]
            APP3[ATP Gateway<br/>Service]
        end
    end

    USER --> LB
    LB --> NGINX_SVC
    NGINX_SVC --> NGINX1
    NGINX_SVC --> NGINX2
    NGINX1 --> APP1
    NGINX1 --> APP2
    NGINX2 --> APP3
    NGINX2 --> APP1

    style NGINX1 fill:#90EE90
    style NGINX2 fill:#90EE90
    style APP1 fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

NGINX Ingress Installation via Helm:

#!/bin/bash
# scripts/install-nginx-ingress.sh

# Add NGINX Ingress Helm repository
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

# Install NGINX Ingress Controller
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.replicaCount=2 \
  --set controller.nodeSelector."kubernetes\.io/os"=linux \
  --set controller.service.type=LoadBalancer \
  --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz \
  --set controller.service.externalTrafficPolicy=Local \
  --set controller.resources.requests.cpu=100m \
  --set controller.resources.requests.memory=128Mi \
  --set controller.resources.limits.cpu=500m \
  --set controller.resources.limits.memory=512Mi \
  --set controller.metrics.enabled=true \
  --set controller.podSecurityPolicy.enabled=false

echo "✅ NGINX Ingress Controller installed"
echo "   Waiting for LoadBalancer IP..."
kubectl wait --namespace ingress-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=300s

# Get LoadBalancer IP
EXTERNAL_IP=$(kubectl get svc ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "   External IP: ${EXTERNAL_IP}"

NGINX Ingress via FluxCD:

# platform/ingress-nginx/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: ingress-nginx
  namespace: ingress-nginx
spec:
  interval: 5m
  chart:
    spec:
      chart: ingress-nginx
      sourceRef:
        kind: HelmRepository
        name: ingress-nginx
      interval: 1h
  values:
    controller:
      replicaCount: 2
      nodeSelector:
        kubernetes.io/os: linux
      service:
        type: LoadBalancer
        annotations:
          service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: /healthz
        externalTrafficPolicy: Local
      resources:
        requests:
          cpu: 100m
          memory: 128Mi
        limits:
          cpu: 500m
          memory: 512Mi
      metrics:
        enabled: true
        serviceMonitor:
          enabled: true
      podSecurityPolicy:
        enabled: false

Azure Application Gateway Ingress (Alternative)

Azure Application Gateway Ingress Controller (AGIC):

# platform/application-gateway/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: ingress-appgw
  namespace: ingress-appgw
spec:
  interval: 5m
  chart:
    spec:
      chart: ingress-azure
      sourceRef:
        kind: HelmRepository
        name: application-gateway-kubernetes-ingress
      interval: 1h
  values:
    appgw:
      subscriptionId: ${AZURE_SUBSCRIPTION_ID}
      resourceGroup: atp-production-rg
      name: atp-prod-appgw
      usePrivateIP: false
    armAuth:
      type: aadPodIdentity
      identityResourceID: /subscriptions/${AZURE_SUBSCRIPTION_ID}/resourcegroups/${RESOURCE_GROUP}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/agic-identity
      identityClientID: ${AGIC_IDENTITY_CLIENT_ID}
    rbac:
      enabled: true

AGIC vs NGINX Comparison:

Feature NGINX Ingress Azure Application Gateway ATP Selection
Cost ✅ Lower ⚠️ Higher (dedicated gateway) ✅ NGINX
WAF ⚠️ External (Cloudflare) ✅ Built-in ⚠️ NGINX (accept trade-off)
SSL Termination ✅ Supported ✅ Supported ✅ Both
Path-based Routing ✅ Supported ✅ Supported ✅ Both
Azure Integration ⚠️ Basic ✅ Full ⚠️ NGINX (sufficient)
ATP Decision Selected ❌ Not selected ✅ NGINX

ATP Decision: NGINX Ingress Controller - Lower cost, sufficient features, simpler management, standard Kubernetes ingress.

Installation and Configuration

NGINX Ingress Configuration:

# platform/ingress-nginx/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
data:
  # Connection settings
  worker-processes: "auto"
  worker-connections: "16384"
  max-worker-open-files: "65535"

  # Timeouts
  proxy-connect-timeout: "60"
  proxy-send-timeout: "60"
  proxy-read-timeout: "60"

  # SSL
  ssl-protocols: "TLSv1.2 TLSv1.3"
  ssl-ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256,ECDHE-RSA-AES128-GCM-SHA256"
  ssl-prefer-server-ciphers: "true"

  # Logging
  log-format-upstream: '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_length $request_time [$proxy_upstream_name] [$proxy_alternative_upstream_name] $upstream_addr $upstream_response_length $upstream_response_time $upstream_status $req_id'

  # Rate limiting
  enable-brotli: "true"
  use-forwarded-headers: "true"
  compute-full-forwarded-for: "true"

TLS Termination

TLS Termination in NGINX:

# apps/atp-gateway/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atp-gateway-ingress
  namespace: atp-production
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.atp.connectsoft.example
    - gateway.atp.connectsoft.example
    secretName: atp-gateway-tls
  rules:
  - host: api.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atp-gateway
            port:
              number: 80

Certificate Management

cert-manager Overview

cert-manager Architecture:

graph TB
    subgraph "Kubernetes Cluster"
        INGRESS[Ingress<br/>TLS Secret]
        CERT_MGR[cert-manager<br/>Controller]
        CERT[cert-manager<br/>Certificate CRD]
        CLUSTER_ISSUER[ClusterIssuer<br/>Let's Encrypt]
    end
    subgraph "Let's Encrypt"
        LE[Let's Encrypt<br/>API]
        CHALLENGE[HTTP-01 Challenge]
    end
    subgraph "DNS"
        TXT[TXT Record<br/>DNS-01 Challenge]
    end

    INGRESS --> CERT
    CERT --> CERT_MGR
    CERT_MGR --> CLUSTER_ISSUER
    CLUSTER_ISSUER --> LE
    LE --> CHALLENGE
    LE --> TXT
    CERT_MGR --> CERT
    CERT --> INGRESS

    style CERT_MGR fill:#90EE90
    style CLUSTER_ISSUER fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

cert-manager Installation:

#!/bin/bash
# scripts/install-cert-manager.sh

# Install cert-manager CRDs
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.crds.yaml

# Add cert-manager Helm repository
helm repo add jetstack https://charts.jetstack.io
helm repo update

# Install cert-manager
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.13.0 \
  --set installCRDs=true \
  --set global.leaderElection.namespace=cert-manager \
  --set resources.requests.cpu=100m \
  --set resources.requests.memory=128Mi

echo "✅ cert-manager installed"
kubectl wait --for=condition=ready pod \
  --all -n cert-manager \
  --timeout=300s

cert-manager via FluxCD:

# platform/cert-manager/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: cert-manager
  namespace: cert-manager
spec:
  interval: 5m
  chart:
    spec:
      chart: cert-manager
      sourceRef:
        kind: HelmRepository
        name: jetstack
      version: v1.13.0
  values:
    installCRDs: true
    global:
      leaderElection:
        namespace: cert-manager
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
    webhook:
      resources:
        requests:
          cpu: 50m
          memory: 64Mi

Let's Encrypt Integration

Let's Encrypt ClusterIssuer (HTTP-01 Challenge):

# platform/cert-manager/clusterissuer-letsencrypt-prod.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: devops@connectsoft.example
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
          podTemplate:
            spec:
              nodeSelector:
                kubernetes.io/os: linux

Let's Encrypt ClusterIssuer (DNS-01 Challenge for Wildcard):

# platform/cert-manager/clusterissuer-letsencrypt-dns.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-dns
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: devops@connectsoft.example
    privateKeySecretRef:
      name: letsencrypt-dns
    solvers:
    - dns01:
        azureDNS:
          clientID: ${AZURE_CLIENT_ID}
          clientSecretSecretRef:
            name: azure-dns-secret
            key: client-secret
          subscriptionID: ${AZURE_SUBSCRIPTION_ID}
          tenantID: ${AZURE_TENANT_ID}
          resourceGroupName: atp-production-rg
          hostedZoneName: connectsoft.example
          environment: AzurePublicCloud

Let's Encrypt Staging ClusterIssuer:

# platform/cert-manager/clusterissuer-letsencrypt-staging.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: devops@connectsoft.example
    privateKeySecretRef:
      name: letsencrypt-staging
    solvers:
    - http01:
        ingress:
          class: nginx

ClusterIssuer Configuration

ClusterIssuer Configuration Matrix:

ClusterIssuer Challenge Type Use Case Rate Limits
letsencrypt-prod HTTP-01 Production domains 50 certs/week/domain
letsencrypt-staging HTTP-01 Testing 300 certs/week/domain
letsencrypt-dns DNS-01 Wildcard certificates 50 certs/week/domain

Certificate Resource:

# apps/atp-gateway/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: atp-gateway-tls
  namespace: atp-production
spec:
  secretName: atp-gateway-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  commonName: api.atp.connectsoft.example
  dnsNames:
  - api.atp.connectsoft.example
  - gateway.atp.connectsoft.example
  - *.atp.connectsoft.example
  duration: 2160h  # 90 days
  renewBefore: 720h  # Renew 30 days before expiration

Automatic Certificate Renewal

Certificate Renewal Flow:

sequenceDiagram
    participant Cert as Certificate
    participant CM as cert-manager
    participant LE as Let's Encrypt
    participant NGINX as NGINX Ingress

    Cert->>CM: Certificate expires in 30 days
    CM->>LE: Request renewal
    LE->>CM: Challenge request
    CM->>NGINX: Create challenge ingress
    NGINX->>LE: Serve challenge
    LE->>CM: Validate challenge
    CM->>LE: Get new certificate
    LE->>CM: Issue certificate
    CM->>Cert: Update TLS secret
    Cert->>NGINX: Reload with new cert
Hold "Alt" / "Option" to enable pan & zoom

Certificate Status Check:

#!/bin/bash
# scripts/check-certificate-status.sh

NAMESPACE="${1:-all}"

echo "🔍 Certificate Status Check"
echo "============================"

if [ "${NAMESPACE}" = "all" ]; then
  CERTIFICATES=$(kubectl get certificates --all-namespaces -o json | \
    jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"')
else
  CERTIFICATES=$(kubectl get certificates -n "${NAMESPACE}" -o json | \
    jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"')
fi

for CERT in ${CERTIFICATES}; do
  NS=$(echo "${CERT}" | cut -d'/' -f1)
  NAME=$(echo "${CERT}" | cut -d'/' -f2)

  STATUS=$(kubectl get certificate "${NAME}" -n "${NS}" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
  AGE=$(kubectl get certificate "${NAME}" -n "${NS}" -o jsonpath='{.metadata.creationTimestamp}')
  NOT_AFTER=$(kubectl get certificate "${NAME}" -n "${NS}" -o jsonpath='{.status.notAfter}')

  if [ "${STATUS}" = "True" ]; then
    echo "✅ ${NS}/${NAME}: Ready"
    if [ -n "${NOT_AFTER}" ]; then
      DAYS_UNTIL_EXPIRY=$(( ($(date -d "${NOT_AFTER}" +%s) - $(date +%s)) / 86400 ))
      echo "   Expires in: ${DAYS_UNTIL_EXPIRY} days"
    fi
  else
    echo "❌ ${NS}/${NAME}: Not Ready"
    kubectl describe certificate "${NAME}" -n "${NS}" | grep -A 5 "Status:"
  fi
done

Certificate Monitoring

Certificate Expiration Alert:

# monitoring/alerts/certificate-expiration.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: certificate-expiration
  namespace: monitoring
spec:
  groups:
  - name: certificate
    interval: 1h
    rules:
    - alert: CertificateExpiringSoon
      expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 30
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Certificate expiring soon"
        description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires in {{ $value }} days"

    - alert: CertificateExpiringVerySoon
      expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
      for: 1h
      labels:
        severity: critical
      annotations:
        summary: "Certificate expiring very soon"
        description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires in {{ $value }} days"

Network Policies

Default Deny All Policy

Default Deny All Network Policy:

# platform/network-policies/default-deny-all.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: atp-production
spec:
  podSelector: {}  # Match all pods
  policyTypes:
  - Ingress
  - Egress
  # No rules = deny all traffic

Apply Default Deny to All Namespaces:

#!/bin/bash
# scripts/apply-default-deny-policy.sh

NAMESPACES=("atp-production" "atp-staging" "atp-test")

for NS in "${NAMESPACES[@]}"; do
  echo "Applying default deny policy to namespace: ${NS}"

  kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: ${NS}
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
EOF
done

echo "✅ Default deny policies applied"

Service-to-Service Allow Rules

Service-to-Service Communication:

# apps/atp-gateway/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-gateway-network-policy
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-gateway
  policyTypes:
  - Ingress
  - Egress

  ingress:
  # Allow from ingress controller
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - podSelector:
        matchLabels:
          app.kubernetes.io/name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080

  # Allow from other ATP services
  - from:
    - podSelector:
        matchLabels:
          app: atp-ingestion
    - podSelector:
        matchLabels:
          app: atp-query
    ports:
    - protocol: TCP
      port: 8080

  egress:
  # Allow to ATP services
  - to:
    - podSelector:
        matchLabels:
          app: atp-ingestion
    - podSelector:
        matchLabels:
          app: atp-query
    ports:
    - protocol: TCP
      port: 8080

  # Allow to external services (database, Redis, etc.)
  - to:
    - ipBlock:
        cidr: 10.0.0.0/16  # Azure VNet
    ports:
    - protocol: TCP
      port: 5432  # PostgreSQL
    - protocol: TCP
      port: 6380  # Redis

Ingress and Egress Rules

Ingress Allow Rules:

# apps/atp-ingestion/network-policy-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-ingestion-allow-ingress
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
  - Ingress

  ingress:
  # Allow from gateway
  - from:
    - podSelector:
        matchLabels:
          app: atp-gateway
    ports:
    - protocol: TCP
      port: 8080

  # Allow from query service
  - from:
    - podSelector:
        matchLabels:
          app: atp-query
    ports:
    - protocol: TCP
      port: 8080

Egress Allow Rules:

# apps/atp-ingestion/network-policy-egress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: atp-ingestion-allow-egress
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-ingestion
  policyTypes:
  - Egress

  egress:
  # Allow DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53

  # Allow to database
  - to:
    - ipBlock:
        cidr: 10.0.2.0/24  # Database subnet
    ports:
    - protocol: TCP
      port: 5432

  # Allow to Redis
  - to:
    - ipBlock:
        cidr: 10.0.3.0/24  # Redis subnet
    ports:
    - protocol: TCP
      port: 6380

  # Allow to Service Bus
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0  # Azure Service Bus (public IP)
    ports:
    - protocol: TCP
      port: 5671
      port: 443

DNS Exceptions

DNS Exception in Network Policy:

# platform/network-policies/dns-exception.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: atp-production
spec:
  podSelector: {}
  policyTypes:
  - Egress

  egress:
  # Allow DNS queries
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    - podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Monitoring and Logging Exceptions

Monitoring Exception:

# platform/network-policies/monitoring-exception.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-monitoring
  namespace: atp-production
spec:
  podSelector: {}
  policyTypes:
  - Egress

  egress:
  # Allow to Prometheus
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    - podSelector:
        matchLabels:
          app: prometheus
    ports:
    - protocol: TCP
      port: 9090

  # Allow to Grafana
  - to:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    - podSelector:
        matchLabels:
          app: grafana
    ports:
    - protocol: TCP
      port: 3000

  # Allow to Azure Monitor (Log Analytics)
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
    ports:
    - protocol: TCP
      port: 443

Service Mesh Options

Linkerd (Lightweight, ATP Preference)

Linkerd Architecture:

graph TB
    subgraph "Service A Pod"
        APP_A[Application]
        PROXY_A[Linkerd Proxy<br/>sidecar]
    end
    subgraph "Service B Pod"
        APP_B[Application]
        PROXY_B[Linkerd Proxy<br/>sidecar]
    end
    subgraph "Linkerd Control Plane"
        DEST[destination]
        IDENTITY[identity]
        PROXY_INJECTOR[proxy-injector]
    end

    APP_A <--> PROXY_A
    APP_B <--> PROXY_B
    PROXY_A <--mTLS--> PROXY_B
    PROXY_A --> DEST
    PROXY_B --> DEST
    PROXY_A --> IDENTITY
    PROXY_B --> IDENTITY

    style PROXY_A fill:#90EE90
    style PROXY_B fill:#90EE90
    style DEST fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Linkerd Installation:

#!/bin/bash
# scripts/install-linkerd.sh

# Install Linkerd CLI
curl -sL https://run.linkerd.io/install-edge | sh
export PATH=$PATH:$HOME/.linkerd2/bin

# Verify installation
linkerd version --client

# Check cluster prerequisites
linkerd check --pre

# Install Linkerd control plane
linkerd install | kubectl apply -f -

# Wait for control plane to be ready
linkerd check

# Install Linkerd Viz (observability)
linkerd viz install | kubectl apply -f -

# Install Linkerd Multicluster (if needed)
# linkerd multicluster install | kubectl apply -f -

echo "✅ Linkerd installed"

Linkerd via FluxCD:

# platform/linkerd/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: linkerd-control-plane
  namespace: linkerd
spec:
  interval: 5m
  chart:
    spec:
      chart: linkerd-control-plane
      sourceRef:
        kind: HelmRepository
        name: linkerd
      version: 1.14.0
  values:
    identity:
      issuer:
        tls:
          certPEM: |
            # Certificate from linkerd identity
          keyPEM: |
            # Key from linkerd identity
    proxyInjector:
      enabled: true
    destination:
      enabled: true
    identity:
      enabled: true

Istio (Feature-Rich, Complex)

Istio vs Linkerd Comparison:

Feature Linkerd Istio ATP Selection
Size ✅ Lightweight (~50MB) ⚠️ Heavy (~500MB) ✅ Linkerd
Learning Curve ✅ Simple ⚠️ Complex ✅ Linkerd
mTLS ✅ Automatic ✅ Automatic ✅ Linkerd
Traffic Management ✅ Supported ✅ Rich features ⚠️ Linkerd (sufficient)
Observability ✅ Built-in ✅ Built-in ✅ Linkerd
Resource Usage ✅ Low ⚠️ High ✅ Linkerd
ATP Decision Selected ❌ Not selected ✅ Linkerd

ATP Decision: Linkerd - Lightweight, simple, sufficient features, low resource usage, better fit for ATP's requirements.

Open Service Mesh (Azure-Native)

Open Service Mesh (OSM) Overview:

Feature OSM Linkerd ATP Selection
Azure Integration ✅ Native ⚠️ Generic ⚠️ Linkerd (sufficient)
Maturity ⚠️ Newer ✅ Mature ✅ Linkerd
Community ⚠️ Smaller ✅ Large ✅ Linkerd
Features ✅ Good ✅ Good ✅ Linkerd

ATP Decision: Linkerd - More mature, larger community, proven in production, sufficient Azure integration.

Comparison and Selection

Service Mesh Selection Matrix:

Criteria Weight Linkerd Istio OSM Winner
Simplicity High 9 4 7 ✅ Linkerd
Resource Usage High 9 5 7 ✅ Linkerd
Features Medium 7 9 7 ⚠️ Istio
Maturity High 9 9 6 ✅ Linkerd/Istio
ATP Decision - Selected - - Linkerd

mTLS Between Services

Automatic mTLS with Service Mesh

Linkerd Automatic mTLS:

# Linkerd automatically enables mTLS for all injected services
# No configuration required - works out of the box

# Example: Service with Linkerd proxy injection
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atp-ingestion
  namespace: atp-production
  annotations:
    linkerd.io/inject: enabled  # Enable automatic injection
spec:
  template:
    metadata:
      annotations:
        linkerd.io/inject: enabled
    spec:
      containers:
      - name: atp-ingestion
        image: connectsoft.azurecr.io/atp/ingestion:v1.2.3
        # Linkerd proxy automatically injected as sidecar

Verify mTLS Status:

# Check mTLS status for all services
linkerd viz stat deploy -n atp-production

# Check mTLS percentage
linkerd viz edges deploy -n atp-production

# View service topology with mTLS
linkerd viz tap deploy/atp-gateway -n atp-production

Certificate Rotation

Linkerd Certificate Rotation:

#!/bin/bash
# scripts/rotate-linkerd-certificates.sh

echo "🔄 Rotating Linkerd certificates..."

# Rotate identity certificates
linkerd identity rotate --trust-anchors-file=ca.crt --trust-anchors-validity=87600h | kubectl apply -f -

# Verify rotation
linkerd check --proxy

echo "✅ Linkerd certificates rotated"

Automatic Certificate Rotation:

Linkerd automatically rotates certificates before expiration. Default validity: 24 hours.

# Linkerd identity configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: linkerd-identity
  namespace: linkerd
data:
  identity.issuer.tls.crtPEM: |
    # Certificate (auto-rotated)
  identity.issuer.tls.keyPEM: |
    # Key (auto-rotated)

Identity and Authorization

Linkerd Authorization Policy:

# apps/atp-gateway/authorization-policy.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
  name: atp-gateway-server
  namespace: atp-production
spec:
  podSelector:
    matchLabels:
      app: atp-gateway
  port: 8080
  proxyProtocol: HTTP/1
---
apiVersion: policy.linkerd.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: atp-gateway-authz
  namespace: atp-production
spec:
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: atp-gateway-server
  requiredAuthenticationModes:
  - mtls
  networks:
  - cidr: 10.0.0.0/16  # Allow from VNet only

Traffic Management

Canary Routing with Service Mesh

Linkerd TrafficSplit for Canary:

# apps/atp-ingestion/canary-trafficsplit.yaml
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
  name: atp-ingestion-canary
  namespace: atp-production
spec:
  service: atp-ingestion
  backends:
  - service: atp-ingestion-stable
    weight: 90  # 90% traffic to stable
  - service: atp-ingestion-canary
    weight: 10  # 10% traffic to canary

Canary Deployment Strategy:

graph TB
    INGRESS[Ingress<br/>100% Traffic]
    TRAFFIC_SPLIT[TrafficSplit<br/>90/10 Split]
    STABLE[Stable Service<br/>90% Traffic]
    CANARY[Canary Service<br/>10% Traffic]

    INGRESS --> TRAFFIC_SPLIT
    TRAFFIC_SPLIT --> STABLE
    TRAFFIC_SPLIT --> CANARY

    style TRAFFIC_SPLIT fill:#FFE5B4
    style STABLE fill:#90EE90
    style CANARY fill:#FFB6C1
Hold "Alt" / "Option" to enable pan & zoom

Circuit Breakers

Linkerd Circuit Breaker:

# apps/atp-gateway/circuit-breaker.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServiceProfile
metadata:
  name: atp-ingestion-service-profile
  namespace: atp-production
spec:
  routes:
  - name: default
    condition:
      method: GET
      pathRegex: "/api/.*"
    isRetryable: true
    timeout: 10s
  circuitBreakers:
  - maxConnections: 100
    maxPendingRequests: 50
    maxRequests: 200
    maxRetries: 3
    minRequests: 10
    maxEjectionPercent: 50
    sleepWindow: 30s
    consecutiveFailures: 5

Retry Policies

Linkerd Retry Policy:

# apps/atp-gateway/retry-policy.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServiceProfile
metadata:
  name: atp-ingestion-retry
  namespace: atp-production
spec:
  routes:
  - name: default
    condition:
      method: POST
      pathRegex: "/api/ingestion"
    isRetryable: true
    timeout: 30s
    retries:
      budget:
        retryRatio: 0.2  # Max 20% retries
        minRetriesPerSecond: 10
        ttl: 10s

Timeout Configuration

Linkerd Timeout Policy:

# apps/atp-gateway/timeout-policy.yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServiceProfile
metadata:
  name: atp-query-timeout
  namespace: atp-production
spec:
  routes:
  - name: query-route
    condition:
      method: GET
      pathRegex: "/api/query/.*"
    timeout: 5s  # 5 second timeout
  - name: export-route
    condition:
      method: GET
      pathRegex: "/api/export/.*"
    timeout: 60s  # 60 second timeout for exports

Observability with Service Mesh

Distributed Tracing

Linkerd Distributed Tracing:

# platform/linkerd/tracing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: linkerd-config
  namespace: linkerd
data:
  config.yaml: |
    tracing:
      enabled: true
      collectorSvcAddr: "jaeger-collector.monitoring:14268"
      collectorSvcAccount: "linkerd-collector"

Linkerd + Jaeger Integration:

# platform/linkerd/jaeger-integration.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-collector
  namespace: monitoring
spec:
  template:
    spec:
      containers:
      - name: jaeger-collector
        image: jaegertracing/jaeger-collector:latest
        env:
        - name: SPAN_STORAGE_TYPE
          value: "elasticsearch"
        - name: ES_SERVER_URLS
          value: "http://elasticsearch.monitoring:9200"

Metrics and Dashboards

Linkerd Metrics:

# View service metrics
linkerd viz stat deploy -n atp-production

# View top services
linkerd viz top deploy -n atp-production

# View service profile metrics
linkerd viz profile -n atp-production atp-ingestion --tap

Linkerd Grafana Dashboard:

# platform/linkerd/grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: linkerd-dashboard
  namespace: monitoring
data:
  linkerd-dashboard.json: |
    {
      "dashboard": {
        "title": "Linkerd Service Mesh",
        "panels": [
          {
            "title": "Request Rate",
            "targets": [
              {
                "expr": "sum(rate(linkerd_proxy_http_requests_total{deployment=\"$deployment\"}[1m]))",
                "legendFormat": "{{deployment}}"
              }
            ]
          },
          {
            "title": "P50 Latency",
            "targets": [
              {
                "expr": "histogram_quantile(0.5, sum(rate(linkerd_proxy_http_request_duration_seconds_bucket{deployment=\"$deployment\"}[1m])) by (le, deployment))",
                "legendFormat": "{{deployment}}"
              }
            ]
          }
        ]
      }
    }

Service Topology Visualization

Linkerd Viz (Topology View):

# Open Linkerd Viz dashboard
linkerd viz dashboard

# View service topology
linkerd viz edges deploy -n atp-production

# Tap live traffic
linkerd viz tap deploy/atp-gateway -n atp-production

Service Mesh GitOps Integration

Mesh Configuration in Git

Linkerd Configuration in GitOps:

atp-gitops/
├── platform/
│   ├── linkerd/
│   │   ├── kustomization.yaml
│   │   ├── control-plane.yaml
│   │   ├── service-profiles/
│   │   │   ├── atp-gateway.yaml
│   │   │   ├── atp-ingestion.yaml
│   │   │   └── atp-query.yaml
│   │   ├── authorization-policies/
│   │   │   └── default-policy.yaml
│   │   └── trafficsplits/
│   │       └── canary-split.yaml

Linkerd Kustomization:

# platform/linkerd/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - control-plane.yaml
  - service-profiles/
  - authorization-policies/
  - trafficsplits/

commonLabels:
  managed-by: kustomize

TrafficSplit Resources

TrafficSplit in GitOps:

# apps/atp-ingestion/overlays/production/trafficsplit.yaml
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
  name: atp-ingestion-split
  namespace: atp-production
spec:
  service: atp-ingestion
  backends:
  - service: atp-ingestion-v1
    weight: 90
  - service: atp-ingestion-v2
    weight: 10

FluxCD Kustomization for TrafficSplit:

# clusters/production/kustomizations/apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/atp-ingestion/overlays/production
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production
  # TrafficSplit resources included in path

SMI (Service Mesh Interface)

SMI Resources Supported by Linkerd:

SMI Resource Linkerd Support Use Case
TrafficSplit ✅ Supported Canary deployments
TrafficTarget ✅ Supported Access control
HTTPRouteGroup ✅ Supported HTTP routing rules
TCPRoute ⚠️ Limited TCP routing

SMI TrafficTarget Example:

# apps/atp-gateway/smi-traffic-target.yaml
apiVersion: access.smi-spec.io/v1alpha3
kind: TrafficTarget
metadata:
  name: atp-gateway-to-ingestion
  namespace: atp-production
spec:
  destination:
    kind: ServiceAccount
    name: atp-ingestion
    namespace: atp-production
  sources:
  - kind: ServiceAccount
    name: atp-gateway
    namespace: atp-production
  rules:
  - kind: HTTPRouteGroup
    name: atp-ingestion-routes
    matches:
    - ingestion-api
---
apiVersion: specs.smi-spec.io/v1alpha4
kind: HTTPRouteGroup
metadata:
  name: atp-ingestion-routes
  namespace: atp-production
spec:
  matches:
  - name: ingestion-api
    methods:
    - GET
    - POST
    pathRegex: "/api/ingestion/.*"

Multi-Cluster Networking

VNet Peering Between Environments

VNet Peering Configuration:

// infrastructure/VNetPeering.cs
using Pulumi;
using Pulumi.AzureNative.Network;

public class VNetPeering
{
    public static VirtualNetworkPeering CreatePeering(
        VirtualNetwork sourceVNet,
        VirtualNetwork targetVNet,
        ResourceGroup resourceGroup,
        string peeringName)
    {
        return new VirtualNetworkPeering($"peering-{peeringName}", new VirtualNetworkPeeringArgs
        {
            ResourceGroupName = resourceGroup.Name,
            VirtualNetworkName = sourceVNet.Name,
            RemoteVirtualNetworkId = targetVNet.Id,
            AllowVirtualNetworkAccess = true,
            AllowForwardedTraffic = true,
            AllowGatewayTransit = false,
            UseRemoteGateways = false
        });
    }
}

VNet Peering Between Production and Staging:

#!/bin/bash
# scripts/create-vnet-peering.sh

SOURCE_RG="${1}"
SOURCE_VNET="${2}"
TARGET_RG="${3}"
TARGET_VNET="${4}"

echo "🔗 Creating VNet peering: ${SOURCE_VNET} <-> ${TARGET_VNET}"

# Get VNet IDs
SOURCE_VNET_ID=$(az network vnet show \
  --resource-group "${SOURCE_RG}" \
  --name "${SOURCE_VNET}" \
  --query id -o tsv)

TARGET_VNET_ID=$(az network vnet show \
  --resource-group "${TARGET_RG}" \
  --name "${TARGET_VNET}" \
  --query id -o tsv)

# Create peering from source to target
az network vnet peering create \
  --resource-group "${SOURCE_RG}" \
  --name "${SOURCE_VNET}-to-${TARGET_VNET}" \
  --vnet-name "${SOURCE_VNET}" \
  --remote-vnet "${TARGET_VNET_ID}" \
  --allow-vnet-access \
  --allow-forwarded-traffic

# Create peering from target to source
az network vnet peering create \
  --resource-group "${TARGET_RG}" \
  --name "${TARGET_VNET}-to-${SOURCE_VNET}" \
  --vnet-name "${TARGET_VNET}" \
  --remote-vnet "${SOURCE_VNET_ID}" \
  --allow-vnet-access \
  --allow-forwarded-traffic

echo "✅ VNet peering created"

Azure Virtual WAN

Virtual WAN Architecture:

graph TB
    subgraph "Virtual WAN Hub"
        VWAN[Azure Virtual WAN<br/>Hub]
    end
    subgraph "Production VNet"
        PROD_VNET[Production VNet<br/>10.0.0.0/16]
        PROD_AKS[Production AKS]
    end
    subgraph "Staging VNet"
        STAGE_VNET[Staging VNet<br/>10.1.0.0/16]
        STAGE_AKS[Staging AKS]
    end
    subgraph "On-Premises"
        ONPREM[On-Premises<br/>Network]
    end

    PROD_VNET --> VWAN
    STAGE_VNET --> VWAN
    ONPREM --> VWAN
    VWAN --> PROD_VNET
    VWAN --> STAGE_VNET
    VWAN --> ONPREM

    style VWAN fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Virtual WAN Configuration (Pulumi C# concept):

// infrastructure/VirtualWAN.cs
var virtualWan = new VirtualWan("atp-vwan", new VirtualWanArgs
{
    ResourceGroupName = resourceGroup.Name,
    Location = location,
    Type = "Standard",
    AllowBranchToBranchTraffic = true,
    AllowVnetToVnetTraffic = true
});

var virtualHub = new VirtualHub("atp-vhub", new VirtualHubArgs
{
    ResourceGroupName = resourceGroup.Name,
    Location = location,
    VirtualWanId = virtualWan.Id,
    AddressPrefix = "10.100.0.0/24"
});

Cross-Cluster Service Discovery

Linkerd Multi-Cluster Service Discovery:

#!/bin/bash
# scripts/setup-linkerd-multicluster.sh

# Install Linkerd Multicluster on production cluster
linkerd multicluster install | kubectl apply -f -

# Link staging cluster to production
linkerd multicluster link --cluster-name staging --api-server-address https://staging-api-server:6443 | kubectl apply -f -

# Verify multicluster status
linkerd multicluster check

echo "✅ Multi-cluster service discovery configured"

Service Export/Import (Kubernetes Multi-Cluster Services):

# apps/atp-gateway/service-export.yaml
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
  name: atp-gateway
  namespace: atp-production
spec: {}

---
# In staging cluster: Service Import
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceImport
metadata:
  name: atp-gateway-production
  namespace: atp-staging
spec:
  type: ClusterSetIP
  ports:
  - port: 8080
    protocol: TCP

Multi-Cluster Mesh

Linkerd Multi-Cluster Mesh:

graph TB
    subgraph "Production Cluster"
        PROD_CTRL[Linkerd Control Plane]
        PROD_SVC[ATP Services]
    end
    subgraph "Staging Cluster"
        STAGE_CTRL[Linkerd Control Plane]
        STAGE_SVC[ATP Services]
    end
    subgraph "Linkerd Multicluster"
        GATEWAY[Service Mirror<br/>Gateway]
    end

    PROD_CTRL <--> GATEWAY
    STAGE_CTRL <--> GATEWAY
    PROD_SVC <--mTLS--> STAGE_SVC

    style GATEWAY fill:#FFE5B4
Hold "Alt" / "Option" to enable pan & zoom

Multi-Cluster Mesh Configuration:

# platform/linkerd/multicluster/gateway.yaml
apiVersion: multicluster.linkerd.io/v1alpha1
kind: ServiceMirror
metadata:
  name: staging-cluster
  namespace: linkerd-multicluster
spec:
  cluster:
    name: staging
    namespace: linkerd-multicluster
    apiKey: ${STAGING_CLUSTER_API_KEY}
  gateway:
    name: gateway
    namespace: linkerd-multicluster
    service:
      name: linkerd-gateway
      namespace: linkerd-multicluster
    port: 4143

Summary: Networking & Service Mesh

  • AKS Networking Models: Azure CNI selected (VNet integration, multi-tenancy), kubenet comparison, subnet sizing for Azure CNI
  • Ingress Controllers: NGINX Ingress Controller (ATP choice), Azure Application Gateway comparison, installation and configuration, TLS termination
  • Certificate Management: cert-manager overview, Let's Encrypt integration (HTTP-01, DNS-01), ClusterIssuer configuration, automatic certificate renewal, certificate monitoring
  • Network Policies: Default deny all policy, service-to-service allow rules, ingress and egress rules, DNS exceptions, monitoring and logging exceptions
  • Service Mesh Options: Linkerd selected (lightweight, ATP preference), Istio comparison, Open Service Mesh comparison, selection matrix
  • mTLS Between Services: Automatic mTLS with Linkerd, certificate rotation, identity and authorization policies
  • Traffic Management: Canary routing with TrafficSplit, circuit breakers, retry policies, timeout configuration
  • Observability with Service Mesh: Distributed tracing (Jaeger), metrics and dashboards (Linkerd Viz), service topology visualization
  • Service Mesh GitOps Integration: Mesh configuration in Git, TrafficSplit resources, SMI (Service Mesh Interface) support
  • Multi-Cluster Networking: VNet peering between environments, Azure Virtual WAN, cross-cluster service discovery, multi-cluster mesh with Linkerd

Storage & StatefulSets in GitOps

Purpose: Define storage architecture, PersistentVolumes and PersistentVolumeClaims, StatefulSet deployment patterns, database deployments, backup and restore procedures, volume snapshots, data migration strategies, and disaster recovery for persistent data in ATP's GitOps deployments, ensuring reliable, scalable, and recoverable stateful workloads.


Persistent Volumes (PV) and Claims (PVC)

PV and PVC Concepts

Persistent Volume (PV) vs Persistent Volume Claim (PVC):

graph TB
    subgraph "Storage Provider"
        AZDISK[Azure Disk<br/>or Azure Files]
    end
    subgraph "Kubernetes Cluster"
        PV[PersistentVolume<br/>Cluster Resource]
        PVC[PersistentVolumeClaim<br/>Namespace Resource]
        POD[Pod<br/>Application]
    end

    AZDISK --> PV
    PVC --> PV
    POD --> PVC

    style PV fill:#FFE5B4
    style PVC fill:#90EE90
    style POD fill:#87CEEB
Hold "Alt" / "Option" to enable pan & zoom

PV and PVC Relationship:

Resource Scope Purpose Managed By
PersistentVolume (PV) Cluster-wide Represents actual storage Admin/Storage Provisioner
PersistentVolumeClaim (PVC) Namespace Request for storage Developer/Application
StorageClass Cluster-wide Defines storage provisioner Admin

PVC Lifecycle:

  1. Create PVC → Kubernetes matches with available PV or creates new PV
  2. Bind → PVC bound to PV
  3. Use → Pod mounts PVC
  4. Release → Pod terminates, PVC remains (Retain policy)
  5. Reclaim → PV reclaimed based on reclaim policy

Dynamic Provisioning

Dynamic Provisioning Flow:

sequenceDiagram
    participant Dev as Developer
    participant K8s as Kubernetes API
    participant SC as StorageClass
    participant Prov as Provisioner
    participant Azure as Azure Disk/Files
    participant Pod as Pod

    Dev->>K8s: Create PVC
    K8s->>SC: Match StorageClass
    SC->>Prov: Provision volume
    Prov->>Azure: Create Azure Disk/File
    Azure-->>Prov: Volume created
    Prov->>K8s: Create PV
    K8s->>PVC: Bind PVC to PV
    Dev->>K8s: Create Pod with PVC
    K8s->>Pod: Mount volume
Hold "Alt" / "Option" to enable pan & zoom

Dynamic Provisioning Example:

# apps/atp-ingestion/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: atp-ingestion-data
  namespace: atp-production
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium  # Triggers dynamic provisioning
  resources:
    requests:
      storage: 100Gi

Static vs Dynamic Provisioning:

Provisioning Type Use Case ATP Preference
Static Pre-created PVs, manual management ❌ Not used
Dynamic On-demand PV creation via StorageClass Preferred

ATP Decision: Use dynamic provisioning for all workloads - simpler, scalable, automated.

Storage Classes

StorageClass Definition:

# platform/storage/storageclass-premium.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS  # Premium SSD
  kind: managed
  cachingMode: ReadOnly
  diskEncryptionSetID: /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.Compute/diskEncryptionSets/atp-disk-encryption
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer  # Wait until pod is scheduled

StorageClass Options:

# Standard HDD
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-standard
provisioner: disk.csi.azure.com
parameters:
  skuname: Standard_LRS  # Standard HDD
  kind: managed
reclaimPolicy: Delete
volumeBindingMode: Immediate

---
# Premium SSD
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS  # Premium SSD
  kind: managed
reclaimPolicy: Retain

---
# Azure Files (SMB)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-csi
provisioner: file.csi.azure.com
parameters:
  skuname: Premium_LRS  # Premium Files
  storageAccount: atpstorageaccount  # Optional: specific storage account
reclaimPolicy: Delete
allowVolumeExpansion: true

---
# Azure Files (NFS)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-csi-nfs
provisioner: file.csi.azure.com
parameters:
  protocol: nfs
  skuname: Premium_LRS
reclaimPolicy: Delete

Access Modes

PVC Access Modes:

Access Mode Description Use Case Supported by
ReadWriteOnce (RWO) Single node read-write Single pod, databases Azure Disk
ReadOnlyMany (ROX) Multiple nodes read-only Shared config, readonly data Azure Files
ReadWriteMany (RWX) Multiple nodes read-write Shared storage, file shares Azure Files
ReadWriteOncePod (RWOP) Single pod read-write Kubernetes 1.22+ Azure Disk

Access Mode Selection:

# Single pod (database)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
spec:
  accessModes:
  - ReadWriteOnce  # Single pod mount
  storageClassName: managed-premium
  resources:
    requests:
      storage: 500Gi

---
# Multiple pods (shared files)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-storage
spec:
  accessModes:
  - ReadWriteMany  # Multiple pods can mount
  storageClassName: azurefile-csi
  resources:
    requests:
      storage: 100Gi

Azure Disk vs Azure Files

Azure Disk (Block Storage, Single Mount)

Azure Disk Characteristics:

Aspect Azure Disk Description
Type Block storage Direct-attached disk
Mount Single pod RWO (ReadWriteOnce)
Performance ✅ High IOPS Up to 20,000 IOPS (Premium SSD)
Latency ✅ Low latency < 1ms
Use Case Databases, single-pod apps PostgreSQL, MongoDB, Redis

Azure Disk StorageClass:

# platform/storage/storageclass-premium-disk.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS
  kind: managed
  cachingMode: ReadOnly  # Optimize for database workloads
  diskEncryptionSetID: ${DISK_ENCRYPTION_SET_ID}
allowVolumeExpansion: true
reclaimPolicy: Retain  # Keep data on PVC deletion
volumeBindingMode: WaitForFirstConsumer  # Zone-aware scheduling

Azure Files (Shared Storage, Multi-Mount)

Azure Files Characteristics:

Aspect Azure Files Description
Type File storage Network file share
Mount Multiple pods RWX (ReadWriteMany)
Protocol SMB or NFS Protocol selection
Performance ⚠️ Lower IOPS Up to 100,000 IOPS (Premium)
Latency ⚠️ Higher latency Network latency
Use Case Shared files, config Content storage, logs

Azure Files StorageClass:

# platform/storage/storageclass-premium-files.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-premium
provisioner: file.csi.azure.com
parameters:
  skuname: Premium_LRS
  protocol: smb  # or "nfs"
  # Optional: specific storage account
  # storageAccount: atpstorageaccount
reclaimPolicy: Delete
allowVolumeExpansion: true

Performance Characteristics

Performance Comparison:

Storage Type SKU IOPS Throughput Latency ATP Use Case
Azure Disk (Premium SSD) Premium_LRS 20,000 IOPS 900 MB/s < 1ms ✅ Databases
Azure Disk (Standard SSD) StandardSSD_LRS 6,000 IOPS 750 MB/s < 5ms ⚠️ Dev/Test
Azure Files (Premium) Premium_LRS 100,000 IOPS 10,240 MiB/s < 10ms ✅ Shared storage
Azure Files (Standard) Standard_LRS 1,000 IOPS 60 MiB/s < 20ms ⚠️ Dev/Test

ATP Performance Requirements:

  • Database workloads: Premium SSD (Azure Disk) - High IOPS, low latency
  • Shared files: Premium Files (Azure Files) - Multiple mounts, good performance
  • Dev/Test: Standard SSD (Azure Disk) - Cost-effective

Cost Comparison

Storage Cost Comparison (per GB/month):

Storage Type SKU Cost (East US) Use Case
Azure Disk Premium SSD Premium_LRS $0.17/GB Production databases
Azure Disk Standard SSD StandardSSD_LRS $0.06/GB Dev/Test databases
Azure Files Premium Premium_LRS $0.19/GB Production file shares
Azure Files Standard Standard_LRS $0.06/GB Dev/Test file shares

Cost Optimization Strategy:

  • Production databases: Premium SSD (required for performance)
  • Dev/Test databases: Standard SSD (cost savings)
  • Shared storage: Premium Files for production, Standard for dev/test

Use Case Selection

Storage Selection Matrix:

Use Case Recommended Storage Access Mode Rationale
PostgreSQL Azure Disk Premium RWO High IOPS, single pod
MongoDB Azure Disk Premium RWO High IOPS, single pod
Redis Azure Disk Premium RWO Low latency, single pod
Shared Logs Azure Files Premium RWX Multiple pods, shared access
Config Files Azure Files Standard RWX Low cost, shared access
Backups Azure Files Premium RWX Multiple pods, shared access

ATP Decision Matrix:

Component Storage Type StorageClass Size
PostgreSQL Azure Disk managed-premium 500Gi
MongoDB Azure Disk managed-premium 1Ti
Redis Azure Disk managed-premium 100Gi
Shared Logs Azure Files azurefile-premium 500Gi
Backups Azure Files azurefile-premium 2Ti

Storage Classes

Performance Tiers (Standard, Premium, Ultra)

Storage Class Performance Tiers:

# Standard HDD (Lowest cost, lowest performance)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-standard
provisioner: disk.csi.azure.com
parameters:
  skuname: Standard_LRS
  kind: managed
reclaimPolicy: Delete
volumeBindingMode: Immediate

---
# Standard SSD (Balanced)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-standard-ssd
provisioner: disk.csi.azure.com
parameters:
  skuname: StandardSSD_LRS
  kind: managed
reclaimPolicy: Delete
volumeBindingMode: Immediate

---
# Premium SSD (High performance)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS
  kind: managed
  cachingMode: ReadOnly
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

---
# Ultra SSD (Highest performance)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-ultra
provisioner: disk.csi.azure.com
parameters:
  skuname: UltraSSD_LRS
  kind: managed
  cachingMode: None  # Ultra SSD doesn't support caching
  diskIopsReadWrite: "5000"  # IOPS limit
  diskMbpsReadWrite: "200"  # Throughput limit (MB/s)
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

Performance Tier Comparison:

Tier SKU IOPS Throughput Latency Cost ATP Use Case
Standard HDD Standard_LRS 500 60 MB/s Variable $0.04/GB ❌ Not used
Standard SSD StandardSSD_LRS 6,000 750 MB/s < 5ms $0.06/GB ✅ Dev/Test
Premium SSD Premium_LRS 20,000 900 MB/s < 1ms $0.17/GB ✅ Production
Ultra SSD UltraSSD_LRS 160,000 2,000 MB/s < 0.5ms $0.24/GB ⚠️ High-performance only

ATP Decision: Use Premium SSD for production databases, Standard SSD for dev/test.

Encryption Configuration

Encryption at Rest with Disk Encryption Set:

# platform/storage/storageclass-encrypted.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium-encrypted
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS
  kind: managed
  diskEncryptionSetID: /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.Compute/diskEncryptionSets/atp-disk-encryption
  cachingMode: ReadOnly
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

Pulumi C# Disk Encryption Set:

// infrastructure/DiskEncryption.cs
var diskEncryptionSet = new DiskEncryptionSet("atp-disk-encryption", new DiskEncryptionSetArgs
{
    ResourceGroupName = resourceGroup.Name,
    Location = location,
    Identity = new EncryptionSetIdentityArgs
    {
        Type = "SystemAssigned"
    },
    ActiveKey = new KeyVaultAndKeyReferenceArgs
    {
        KeyUrl = keyVaultKey.Uri,
        SourceVault = new SourceVaultArgs
        {
            Id = keyVault.Id
        }
    },
    EncryptionType = "EncryptionAtRestWithCustomerKey"
});

// Grant Key Vault access to Disk Encryption Set
var keyVaultAccessPolicy = new KeyVaultAccessPolicyArgs
{
    TenantId = tenantId,
    ObjectId = diskEncryptionSet.Identity.PrincipalId,
    Permissions = new KeyVaultPermissionsArgs
    {
        Keys = new[] { "Get", "WrapKey", "UnwrapKey" }
    }
};

Snapshot Support

Volume Snapshot Class:

# platform/storage/volumesnapshotclass.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: azure-disk-snapshot
driver: disk.csi.azure.com
deletionPolicy: Retain  # or Delete
parameters:
  incremental: "true"  # Incremental snapshots (cost-effective)
  resourceGroup: atp-production-rg
  storageAccount: atpsnapshots  # Optional: specific storage account

Create Volume Snapshot:

# apps/atp-ingestion/volumesnapshot.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-data-snapshot-20240115
  namespace: atp-production
spec:
  volumeSnapshotClassName: azure-disk-snapshot
  source:
    persistentVolumeClaimName: postgres-data

Provisioner Settings

Azure Disk CSI Driver Provisioner Settings:

# platform/storage/storageclass-advanced.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium-advanced
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS
  kind: managed
  cachingMode: ReadOnly  # ReadOnly, ReadWrite, None
  diskEncryptionSetID: ${DISK_ENCRYPTION_SET_ID}
  diskIOPSReadWrite: "5000"  # Optional: IOPS limit
  diskMBpsReadWrite: "200"  # Optional: Throughput limit
  networkAccessPolicy: "DenyAll"  # DenyAll, AllowPrivate, AllowAll
  publicNetworkAccess: "Disabled"
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer  # Zone-aware

Volume Binding Modes:

Binding Mode Description Use Case ATP Selection
Immediate Bind immediately Static provisioning ⚠️ Not used
WaitForFirstConsumer Bind when pod scheduled Zone-aware, topology Preferred

ATP Recommendation: Use WaitForFirstConsumer for zone-aware scheduling and topology constraints.


StatefulSet Deployment Patterns

Ordered Deployment and Scaling

StatefulSet Ordered Deployment:

# apps/postgresql/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
  namespace: atp-production
spec:
  serviceName: postgresql
  replicas: 3
  podManagementPolicy: OrderedReady  # Sequential creation (default)
  # podManagementPolicy: Parallel  # Parallel creation (optional)
  selector:
    matchLabels:
      app: postgresql
  template:
    metadata:
      labels:
        app: postgresql
    spec:
      containers:
      - name: postgresql
        image: postgres:15
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: managed-premium
      resources:
        requests:
          storage: 100Gi

StatefulSet Scaling Order:

sequenceDiagram
    participant K8s as Kubernetes
    participant Pod0 as postgresql-0
    participant Pod1 as postgresql-1
    participant Pod2 as postgresql-2

    K8s->>Pod0: Create and wait for Ready
    Pod0-->>K8s: Ready
    K8s->>Pod1: Create and wait for Ready
    Pod1-->>K8s: Ready
    K8s->>Pod2: Create and wait for Ready
    Pod2-->>K8s: Ready
Hold "Alt" / "Option" to enable pan & zoom

Ordered Scaling Behavior:

  • Scale Up: Creates pods sequentially (0, 1, 2...)
  • Scale Down: Deletes pods in reverse order (2, 1, 0...)
  • Ensures: Each pod is ready before creating the next

Stable Network Identity

Headless Service for StatefulSet:

# apps/postgresql/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: postgresql
  namespace: atp-production
spec:
  clusterIP: None  # Headless service
  selector:
    app: postgresql
  ports:
  - port: 5432
    name: postgresql

Stable Network Identity:

# StatefulSet pods get stable DNS names
# postgresql-0.postgresql.atp-production.svc.cluster.local
# postgresql-1.postgresql.atp-production.svc.cluster.local
# postgresql-2.postgresql.atp-production.svc.cluster.local

Accessing StatefulSet Pods:

# Access specific pod
psql -h postgresql-0.postgresql.atp-production.svc.cluster.local

# Access any pod via service
psql -h postgresql.atp-production.svc.cluster.local

Persistent Storage per Pod

StatefulSet with Volume Claim Templates:

# apps/postgresql/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
  namespace: atp-production
spec:
  serviceName: postgresql
  replicas: 3
  selector:
    matchLabels:
      app: postgresql
  template:
    metadata:
      labels:
        app: postgresql
    spec:
      containers:
      - name: postgresql
        image: postgres:15
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
        - name: config
          mountPath: /etc/postgresql
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: managed-premium
      resources:
        requests:
          storage: 100Gi
  - metadata:
      name: config
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: managed-premium
      resources:
        requests:
          storage: 10Gi

PVCs Created Automatically:

data-postgresql-0  # Persistent volume for pod 0
data-postgresql-1  # Persistent volume for pod 1
data-postgresql-2  # Persistent volume for pod 2
config-postgresql-0
config-postgresql-1
config-postgresql-2

Headless Service Configuration

Headless Service Pattern:

# apps/postgresql/service-headless.yaml
apiVersion: v1
kind: Service
metadata:
  name: postgresql
  namespace: atp-production
spec:
  clusterIP: None  # Headless - no load balancing
  selector:
    app: postgresql
  ports:
  - port: 5432
    targetPort: 5432
    name: postgresql

Service Discovery with Headless Service:

# StatefulSet pod discovery
apiVersion: v1
kind: Service
metadata:
  name: postgresql-read
  namespace: atp-production
spec:
  selector:
    app: postgresql
    role: replica  # Read replicas only
  ports:
  - port: 5432
    name: postgresql

---
# StatefulSet pod discovery
apiVersion: v1
kind: Service
metadata:
  name: postgresql-write
  namespace: atp-production
spec:
  selector:
    app: postgresql
    role: primary  # Primary only
  ports:
  - port: 5432
    name: postgresql

Database Deployments in Kubernetes

PostgreSQL Operator

PostgreSQL Operator (Crunchy Data):

#!/bin/bash
# scripts/install-postgres-operator.sh

# Add PostgreSQL Operator Helm repository
helm repo add postgres-operator https://opensource.postgresql.org/postgres/postgres-operator/charts
helm repo update

# Install PostgreSQL Operator
helm install postgres-operator postgres-operator/postgres-operator \
  --namespace postgres-operator \
  --create-namespace \
  --set configResources.requests.memory=128Mi \
  --set configResources.requests.cpu=100m

echo "✅ PostgreSQL Operator installed"

PostgreSQL Cluster via Operator:

# apps/postgresql/postgrescluster.yaml
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: atp-postgres
  namespace: atp-production
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-15.4-0
  postgresVersion: 15
  instances:
  - name: instance1
    replicas: 3
    dataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 500Gi
      storageClassName: managed-premium
    resources:
      requests:
        cpu: 2000m
        memory: 4Gi
      limits:
        cpu: 4000m
        memory: 8Gi
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.47-0
      repos:
      - name: repo1
        volume:
          volumeClaimSpec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 1Ti
            storageClassName: managed-premium
      global:
        repo1-retention-full: "7"
        repo1-retention-full-type: count
  monitoring:
    pgMonitor:
      exporter:
        image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.3.0-0

MongoDB Operator

MongoDB Community Operator:

#!/bin/bash
# scripts/install-mongodb-operator.sh

# Install MongoDB Community Operator
kubectl apply -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes-operator/master/config/crd/bases/mongodbcommunity.mongodb.com_mongodbcommunity.yaml

# Install operator
kubectl create namespace mongodb-operator
kubectl apply -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes-operator/master/config/manager/manager.yaml -n mongodb-operator

echo "✅ MongoDB Operator installed"

MongoDB ReplicaSet via Operator:

# apps/mongodb/mongodbcommunity.yaml
apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
  name: atp-mongodb
  namespace: atp-production
spec:
  members: 3
  type: ReplicaSet
  version: "7.0.0"
  security:
    authentication:
      modes: ["SCRAM"]
  users:
  - name: atp-user
    db: admin
    passwordSecretRef:
      name: mongodb-password
    roles:
    - name: readWriteAnyDatabase
      db: admin
  additionalMongodConfig:
    storage.wiredTiger.engineConfig.journalCompressor: snappy
    storage.wiredTiger.collectionConfig.blockCompressor: snappy
  statefulSet:
    spec:
      volumeClaimTemplates:
      - metadata:
          name: data-volume
        spec:
          accessModes:
          - ReadWriteOnce
          storageClassName: managed-premium
          resources:
            requests:
              storage: 500Gi
      resources:
        requests:
          cpu: 2000m
          memory: 4Gi
        limits:
          cpu: 4000m
          memory: 8Gi

Redis Deployment

Redis StatefulSet:

# apps/redis/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
  namespace: atp-production
spec:
  serviceName: redis
  replicas: 3
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:7-alpine
        command:
        - redis-server
        - /etc/redis/redis.conf
        - --appendonly
        - "yes"
        ports:
        - containerPort: 6379
          name: redis
        volumeMounts:
        - name: data
          mountPath: /data
        - name: config
          mountPath: /etc/redis
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      storageClassName: managed-premium
      resources:
        requests:
          storage: 100Gi

Redis Sentinel Configuration:

# apps/redis/redis-sentinel.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-sentinel
  namespace: atp-production
spec:
  serviceName: redis-sentinel
  replicas: 3
  selector:
    matchLabels:
      app: redis-sentinel
  template:
    metadata:
      labels:
        app: redis-sentinel
    spec:
      containers:
      - name: sentinel
        image: redis:7-alpine
        command:
        - redis-sentinel
        - /etc/redis/sentinel.conf
        ports:
        - containerPort: 26379
          name: sentinel
        volumeMounts:
        - name: config
          mountPath: /etc/redis

StatefulSet vs Managed Service Decision

Kubernetes vs Azure Managed Services:

Aspect Kubernetes (StatefulSet) Azure Managed Service ATP Decision
PostgreSQL PostgreSQL Operator Azure Database for PostgreSQL ⚠️ Managed Service (recommended)
MongoDB MongoDB Operator Azure Cosmos DB (MongoDB API) ⚠️ Managed Service (recommended)
Redis Redis StatefulSet Azure Cache for Redis ⚠️ Managed Service (recommended)
Control ✅ Full control ⚠️ Limited ⚠️ Managed Service
Operations ⚠️ Self-managed ✅ Managed ✅ Managed Service
Cost ⚠️ Higher (infra + ops) ✅ Lower (includes ops) ✅ Managed Service
ATP Decision ⚠️ Dev/Test only Production Managed Services

ATP Decision: Use Azure managed services for production databases (PostgreSQL, MongoDB, Redis) - lower operational overhead, better SLA, automated backups. Use Kubernetes StatefulSets for dev/test environments.


Backup and Restore Procedures

Velero for Cluster Backups

Velero Installation:

#!/bin/bash
# scripts/install-velero.sh

# Install Velero CLI
curl -fsSL -o velero-v1.12.0-linux-amd64.tar.gz \
  https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xvf velero-v1.12.0-linux-amd64.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/

# Create Azure Storage Account for Velero backups
az storage account create \
  --name atpvelerobackups \
  --resource-group atp-production-rg \
  --sku Standard_LRS \
  --location eastus

# Create blob container
az storage container create \
  --name velero \
  --account-name atpvelerobackups

# Install Velero
velero install \
  --provider azure \
  --plugins velero/velero-plugin-for-microsoft-azure:v1.7.0 \
  --bucket velero \
  --secret-file ./credentials-velero \
  --backup-location-config resourceGroup=atp-production-rg,storageAccount=atpvelerobackups \
  --snapshot-location-config apiTimeout=5m,resourceGroup=atp-production-rg \
  --use-volume-snapshots=true

echo "✅ Velero installed"

Velero via Helm:

# platform/velero/helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: velero
  namespace: velero
spec:
  interval: 5m
  chart:
    spec:
      chart: velero
      sourceRef:
        kind: HelmRepository
        name: vmware-tanzu
      version: 5.1.1
  values:
    configuration:
      provider: azure
      backupStorageLocation:
        bucket: velero
        config:
          resourceGroup: atp-production-rg
          storageAccount: atpvelerobackups
      volumeSnapshotLocation:
        config:
          apiTimeout: 5m
          resourceGroup: atp-production-rg
    initContainers:
    - name: velero-plugin-for-microsoft-azure
      image: velero/velero-plugin-for-microsoft-azure:v1.7.0
      volumeMounts:
      - mountPath: /target
        name: plugins
    credentials:
      secretContents:
        cloud: |
          # Azure credentials

Volume Snapshots

Velero Backup with Volume Snapshots:

#!/bin/bash
# scripts/create-velero-backup.sh

BACKUP_NAME="atp-production-backup-$(date +%Y%m%d-%H%M%S)"
NAMESPACE="atp-production"

echo "💾 Creating Velero backup: ${BACKUP_NAME}"

# Create backup
velero backup create "${BACKUP_NAME}" \
  --namespace "${NAMESPACE}" \
  --include-namespaces "${NAMESPACE}" \
  --snapshot-volumes \
  --wait

# Check backup status
velero backup describe "${BACKUP_NAME}"

echo "✅ Backup created: ${BACKUP_NAME}"

Scheduled Backups:

# platform/velero/backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-production-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
    - atp-production
    snapshotVolumes: true
    ttl: 720h  # 30 days retention
    metadata:
      labels:
        backup-type: daily
        environment: production

Backup Scheduling

Backup Schedule Configuration:

# platform/velero/schedules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: velero-schedules
  namespace: velero
data:
  daily-backup.yaml: |
    apiVersion: velero.io/v1
    kind: Schedule
    metadata:
      name: daily-production-backup
      namespace: velero
    spec:
      schedule: "0 2 * * *"  # 2 AM daily
      template:
        includedNamespaces:
        - atp-production
        snapshotVolumes: true
        ttl: 720h  # 30 days
  weekly-backup.yaml: |
    apiVersion: velero.io/v1
    kind: Schedule
    metadata:
      name: weekly-production-backup
      namespace: velero
    spec:
      schedule: "0 3 * * 0"  # 3 AM Sunday
      template:
        includedNamespaces:
        - atp-production
        snapshotVolumes: true
        ttl: 2160h  # 90 days

Restore Procedures

Velero Restore Procedure:

#!/bin/bash
# scripts/restore-from-velero.sh

BACKUP_NAME="${1}"
NAMESPACE="${2:-atp-production}"

if [ -z "${BACKUP_NAME}" ]; then
  echo "Usage: $0 <backup-name> [namespace]"
  echo ""
  echo "Available backups:"
  velero backup get
  exit 1
fi

echo "🔄 Restoring from backup: ${BACKUP_NAME}"

# List backups
velero backup get

# Restore from backup
velero restore create "restore-${BACKUP_NAME}-$(date +%Y%m%d-%H%M%S)" \
  --from-backup "${BACKUP_NAME}" \
  --namespace-mappings "${NAMESPACE}:${NAMESPACE}-restored" \
  --wait

echo "✅ Restore initiated"
echo "   Check status: velero restore get"

Restore to Different Namespace:

# Restore production backup to test namespace
velero restore create restore-production-to-test \
  --from-backup daily-production-backup-20240115 \
  --namespace-mappings atp-production:atp-test \
  --wait

Volume Snapshots

Creating Snapshots

Manual Volume Snapshot:

# apps/postgresql/volumesnapshot-manual.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-data-snapshot-20240115
  namespace: atp-production
spec:
  volumeSnapshotClassName: azure-disk-snapshot
  source:
    persistentVolumeClaimName: postgres-data-postgresql-0

Create Snapshot Script:

#!/bin/bash
# scripts/create-volume-snapshot.sh

PVC_NAME="${1}"
NAMESPACE="${2}"
SNAPSHOT_NAME="${3:-${PVC_NAME}-snapshot-$(date +%Y%m%d-%H%M%S)}"

if [ -z "${PVC_NAME}" ] || [ -z "${NAMESPACE}" ]; then
  echo "Usage: $0 <pvc-name> <namespace> [snapshot-name]"
  exit 1
fi

echo "📸 Creating volume snapshot: ${SNAPSHOT_NAME}"

kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: ${SNAPSHOT_NAME}
  namespace: ${NAMESPACE}
spec:
  volumeSnapshotClassName: azure-disk-snapshot
  source:
    persistentVolumeClaimName: ${PVC_NAME}
EOF

# Wait for snapshot to be ready
kubectl wait volumesnapshot/${SNAPSHOT_NAME} \
  -n "${NAMESPACE}" \
  --for=condition=Ready \
  --timeout=300s

echo "✅ Snapshot created: ${SNAPSHOT_NAME}"

Snapshot Classes

Snapshot Class Configuration:

# platform/storage/volumesnapshotclass-premium.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: azure-disk-snapshot-premium
driver: disk.csi.azure.com
deletionPolicy: Retain  # Keep snapshot after PVC deletion
parameters:
  incremental: "true"  # Incremental snapshots
  resourceGroup: atp-production-rg
  # Optional: specific storage account for snapshots
  # storageAccount: atpsnapshots

---
# platform/storage/volumesnapshotclass-standard.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: azure-disk-snapshot-standard
driver: disk.csi.azure.com
deletionPolicy: Delete  # Delete snapshot when PVC is deleted
parameters:
  incremental: "true"
  resourceGroup: atp-production-rg

Snapshot Class Selection:

SnapshotClass Deletion Policy Use Case ATP Selection
azure-disk-snapshot-premium Retain Production backups ✅ Production
azure-disk-snapshot-standard Delete Dev/Test snapshots ✅ Dev/Test

Azure Backup Integration

Azure Backup for AKS Volumes:

#!/bin/bash
# scripts/setup-azure-backup.sh

# Create Recovery Services Vault
az backup vault create \
  --name atp-backup-vault \
  --resource-group atp-production-rg \
  --location eastus

# Enable backup for AKS volumes
az backup protection enable-for-azurefileshare \
  --vault-name atp-backup-vault \
  --resource-group atp-production-rg \
  --storage-account atpstorageaccount \
  --azure-file-share-name postgres-backup \
  --backup-management-type AzureStorage \
  --workload-type AzureFileShare \
  --policy-name DefaultPolicy

echo "✅ Azure Backup configured"

Backup Policy:

# Create backup policy (daily, 30-day retention)
az backup policy create \
  --vault-name atp-backup-vault \
  --resource-group atp-production-rg \
  --name daily-policy \
  --backup-management-type AzureStorage \
  --workload-type AzureFileShare \
  --policy '{
    "name": "daily-policy",
    "recoveryPointType": "FileSystemConsistent",
    "schedulePolicy": {
      "scheduleRunFrequency": "Daily",
      "scheduleRunTimes": ["02:00"]
    },
    "retentionPolicy": {
      "dailySchedule": {
        "retentionDuration": {
          "count": 30,
          "durationType": "Days"
        }
      }
    }
  }'

Snapshot Retention

Snapshot Retention Policies:

Environment Retention Period Rationale
Production 90 days Long-term recovery
Staging 30 days Shorter retention
Test 7 days Minimal retention
Dev 3 days Very short retention

Automated Snapshot Cleanup:

#!/bin/bash
# scripts/cleanup-old-snapshots.sh

NAMESPACE="${1:-atp-production}"
RETENTION_DAYS="${2:-30}"

echo "🧹 Cleaning up old snapshots (older than ${RETENTION_DAYS} days)..."

# Get all snapshots older than retention period
OLD_SNAPSHOTS=$(kubectl get volumesnapshot -n "${NAMESPACE}" -o json | \
  jq -r ".items[] | select(.metadata.creationTimestamp < \"$(date -d "${RETENTION_DAYS} days ago" -u +%Y-%m-%dT%H:%M:%SZ)\") | .metadata.name")

for SNAPSHOT in ${OLD_SNAPSHOTS}; do
  echo "🗑️  Deleting snapshot: ${SNAPSHOT}"
  kubectl delete volumesnapshot "${SNAPSHOT}" -n "${NAMESPACE}" || true
done

echo "✅ Snapshot cleanup complete"

Data Migration Strategies

Migrating Data Between Versions

Database Migration Strategy:

sequenceDiagram
    participant Old as Old Version<br/>PostgreSQL 14
    participant Snapshot as Volume Snapshot
    participant New as New Version<br/>PostgreSQL 15
    participant Data as Data Migration

    Old->>Snapshot: Create snapshot
    Snapshot->>New: Clone volume
    New->>Data: Mount snapshot
    Data->>New: Migrate schema
    New->>Data: Migrate data
    Data->>New: Validate
Hold "Alt" / "Option" to enable pan & zoom

PostgreSQL Version Migration:

#!/bin/bash
# scripts/migrate-postgres-version.sh

OLD_VERSION="14"
NEW_VERSION="15"
NAMESPACE="atp-production"
PVC_NAME="postgres-data-postgresql-0"

echo "🔄 Migrating PostgreSQL ${OLD_VERSION}${NEW_VERSION}"

# Step 1: Create snapshot of current data
echo "📸 Step 1: Creating snapshot..."
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-migration-snapshot
  namespace: ${NAMESPACE}
spec:
  volumeSnapshotClassName: azure-disk-snapshot-premium
  source:
    persistentVolumeClaimName: ${PVC_NAME}
EOF

kubectl wait volumesnapshot/postgres-migration-snapshot \
  -n "${NAMESPACE}" \
  --for=condition=Ready \
  --timeout=300s

# Step 2: Scale down old StatefulSet
echo "⏸️  Step 2: Scaling down old StatefulSet..."
kubectl scale statefulset postgresql-${OLD_VERSION} --replicas=0 -n "${NAMESPACE}"

# Step 3: Create new StatefulSet from snapshot
echo "🆕 Step 3: Creating new StatefulSet from snapshot..."
# (Create new StatefulSet YAML with PostgreSQL ${NEW_VERSION})

# Step 4: Restore data from snapshot
echo "📥 Step 4: Restoring data from snapshot..."
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-postgresql-0-new
  namespace: ${NAMESPACE}
spec:
  dataSource:
    name: postgres-migration-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 500Gi
EOF

echo "✅ Migration initiated"

Volume Cloning

Volume Clone from Snapshot:

# apps/postgresql/pvc-from-snapshot.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-clone
  namespace: atp-production
spec:
  dataSource:
    name: postgres-data-snapshot-20240115
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 500Gi  # Must be >= snapshot size

Clone Volume Script:

#!/bin/bash
# scripts/clone-volume-from-snapshot.sh

SNAPSHOT_NAME="${1}"
NEW_PVC_NAME="${2}"
NAMESPACE="${3:-atp-production}"
STORAGE_SIZE="${4:-500Gi}"

if [ -z "${SNAPSHOT_NAME}" ] || [ -z "${NEW_PVC_NAME}" ]; then
  echo "Usage: $0 <snapshot-name> <new-pvc-name> [namespace] [storage-size]"
  exit 1
fi

echo "📋 Cloning volume from snapshot: ${SNAPSHOT_NAME}"

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ${NEW_PVC_NAME}
  namespace: ${NAMESPACE}
spec:
  dataSource:
    name: ${SNAPSHOT_NAME}
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: ${STORAGE_SIZE}
EOF

echo "✅ Volume clone created: ${NEW_PVC_NAME}"

Zero-Downtime Migrations

Zero-Downtime Migration Strategy:

sequenceDiagram
    participant App as Application
    participant OldDB as Old DB<br/>Primary
    participant NewDB as New DB<br/>Replica
    participant Sync as Data Sync

    App->>OldDB: Write traffic
    OldDB->>Sync: Stream changes
    Sync->>NewDB: Apply changes
    NewDB->>NewDB: Validate sync
    NewDB->>App: Switch traffic
    App->>NewDB: Write traffic
Hold "Alt" / "Option" to enable pan & zoom

PostgreSQL Logical Replication for Zero-Downtime:

# apps/postgresql/migration-replica.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql-new
  namespace: atp-production
spec:
  serviceName: postgresql-new
  replicas: 1
  selector:
    matchLabels:
      app: postgresql-new
  template:
    metadata:
      labels:
        app: postgresql-new
    spec:
      containers:
      - name: postgresql
        image: postgres:15
        env:
        - name: POSTGRES_REPLICATION_MODE
          value: "replica"  # Logical replica
        - name: POSTGRES_PRIMARY_HOST
          value: "postgresql.atp-production.svc.cluster.local"
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      storageClassName: managed-premium
      resources:
        requests:
          storage: 500Gi

GitOps Considerations for Stateful Apps

Careful Rollback Procedures

StatefulSet Rollback Strategy:

# StatefulSet with update strategy
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
  namespace: atp-production
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0  # Update all pods (reduce gradually for staged rollout)
  # OR
  # updateStrategy:
  #   type: OnDelete  # Manual update control

Staged Rollout for StatefulSets:

#!/bin/bash
# scripts/staged-statefulset-rollout.sh

STATEFULSET="${1}"
NAMESPACE="${2:-atp-production}"
PARTITION="${3:-2}"  # Keep 2 pods on old version

echo "🔄 Staged rollout for StatefulSet: ${STATEFULSET}"
echo "   Keeping ${PARTITION} pods on old version"

# Set partition (only pods >= partition index will be updated)
kubectl patch statefulset "${STATEFULSET}" -n "${NAMESPACE}" \
  --type='json' \
  -p="[{\"op\": \"replace\", \"path\": \"/spec/updateStrategy/rollingUpdate/partition\", \"value\": ${PARTITION}}]"

# Update image
kubectl set image statefulset/${STATEFULSET} \
  -n "${NAMESPACE}" \
  postgresql=postgres:15

# Gradually reduce partition for staged rollout
for i in {2..0}; do
  echo "  Updating partition: ${i}"
  kubectl patch statefulset "${STATEFULSET}" -n "${NAMESPACE}" \
    --type='json' \
    -p="[{\"op\": \"replace\", \"path\": \"/spec/updateStrategy/rollingUpdate/partition\", \"value\": ${i}}]"

  # Wait for pod to be ready
  kubectl wait --for=condition=ready pod/${STATEFULSET}-${i} \
    -n "${NAMESPACE}" \
    --timeout=300s

  sleep 60  # Wait before next update
done

echo "✅ Staged rollout complete"

No Auto-Prune for PVCs

FluxCD Kustomization with Prune Safety:

# clusters/production/kustomizations/stateful-apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: stateful-apps-production
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/postgresql/overlays/production
  prune: true  # Enable pruning
  pruneOptions:
    keepLabels:  # Keep resources with these labels
    - app=postgresql
    - component=database
    # Do NOT prune PVCs
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production

PVC Protection Labels:

# apps/postgresql/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: atp-production
  labels:
    app: postgresql
    component: database
    fluxcd.io/prune: "false"  # Explicitly exclude from pruning
    managed-by: kustomize
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 500Gi

StatefulSet Update Strategies

Update Strategy Options:

Strategy Description Use Case ATP Selection
RollingUpdate Update pods sequentially Production (controlled) Production
OnDelete Update only when pod deleted Manual control ⚠️ Staging (manual)

StatefulSet Update Strategy Configuration:

# apps/postgresql/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
  namespace: atp-production
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0  # Start updating from index 0
      # partition: 2  # Keep pods 0-1 on old version, update 2+
  # OR for manual control
  # updateStrategy:
  #   type: OnDelete

Data Persistence Across Deployments

PVC Retention Policy:

# StorageClass with Retain policy
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS
reclaimPolicy: Retain  # Keep PV when PVC is deleted
volumeBindingMode: WaitForFirstConsumer

PVC Lifecycle with Retain Policy:

sequenceDiagram
    participant Dev as Developer
    participant PVC as PVC
    participant PV as PV
    participant Azure as Azure Disk

    Dev->>PVC: Delete PVC
    PVC->>PV: Release (Retain)
    PV->>Azure: Keep disk (not deleted)
    Azure->>PV: Data preserved
    Dev->>PV: Reuse PV with new PVC
Hold "Alt" / "Option" to enable pan & zoom

Reusing Retained PV:

# Reuse existing PV with new PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-restored
  namespace: atp-production
spec:
  volumeName: pv-abc123  # Reference existing PV
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 500Gi

Disaster Recovery for Persistent Data

Backup Frequency per Environment

Backup Schedule Matrix:

Environment Frequency Retention Rationale
Production Every 6 hours 90 days High availability, long-term recovery
Staging Daily 30 days Moderate retention
Test Weekly 14 days Minimal retention
Dev Manual only 7 days Cost optimization

Automated Backup Schedules:

# platform/velero/schedules-production.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-backup-6h
  namespace: velero
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  template:
    includedNamespaces:
    - atp-production
    snapshotVolumes: true
    ttl: 2160h  # 90 days
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-backup-daily
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
    - atp-production
    snapshotVolumes: true
    ttl: 720h  # 30 days
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-backup-weekly
  namespace: velero
spec:
  schedule: "0 3 * * 0"  # 3 AM Sunday
  template:
    includedNamespaces:
    - atp-production
    snapshotVolumes: true
    ttl: 2160h  # 90 days

Cross-Region Replication

Cross-Region Backup Replication:

#!/bin/bash
# scripts/setup-cross-region-backup.sh

PRIMARY_REGION="eastus"
SECONDARY_REGION="westeurope"

# Create backup storage in secondary region
az storage account create \
  --name atpvelerobackupseu \
  --resource-group atp-production-rg-eu \
  --sku Standard_LRS \
  --location "${SECONDARY_REGION}" \
  --allow-blob-public-access false

# Configure blob replication
az storage blob service-properties update \
  --account-name atpvelerobackups \
  --enable-change-feed true \
  --enable-versioning true

# Enable geo-replication
az storage account update \
  --name atpvelerobackups \
  --resource-group atp-production-rg \
  --allow-blob-public-access false \
  --min-tls-version TLS1_2

echo "✅ Cross-region backup replication configured"

Velero with Multiple Backup Locations:

# platform/velero/backup-locations.yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: backup-primary
  namespace: velero
spec:
  provider: azure
  objectStorage:
    bucket: velero
    prefix: primary
  config:
    resourceGroup: atp-production-rg
    storageAccount: atpvelerobackups
---
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: backup-secondary
  namespace: velero
spec:
  provider: azure
  objectStorage:
    bucket: velero
    prefix: secondary
  config:
    resourceGroup: atp-production-rg-eu
    storageAccount: atpvelerobackupseu

RPO Targets for Databases

RPO (Recovery Point Objective) Targets:

Environment RPO Target Backup Frequency Actual RPO
Production < 1 hour Every 6 hours 6 hours
Staging < 24 hours Daily 24 hours
Test < 7 days Weekly 7 days
Dev N/A Manual N/A

RPO Validation:

#!/bin/bash
# scripts/validate-rpo.sh

NAMESPACE="atp-production"
RPO_TARGET_HOURS=6

echo "🔍 Validating RPO compliance..."

# Get latest backup
LATEST_BACKUP=$(velero backup get --namespace velero | \
  grep "${NAMESPACE}" | \
  sort -k4 -r | \
  head -n 1 | \
  awk '{print $1}')

if [ -z "${LATEST_BACKUP}" ]; then
  echo "❌ No backups found"
  exit 1
fi

# Get backup creation time
BACKUP_TIME=$(velero backup describe "${LATEST_BACKUP}" --namespace velero | \
  grep "Creation" | \
  awk '{print $2, $3}')

BACKUP_EPOCH=$(date -d "${BACKUP_TIME}" +%s)
CURRENT_EPOCH=$(date +%s)
AGE_HOURS=$(( (CURRENT_EPOCH - BACKUP_EPOCH) / 3600 ))

if [ "${AGE_HOURS}" -gt "${RPO_TARGET_HOURS}" ]; then
  echo "⚠️  RPO violation: Latest backup is ${AGE_HOURS} hours old (target: ${RPO_TARGET_HOURS}h)"
  exit 1
else
  echo "✅ RPO compliant: Latest backup is ${AGE_HOURS} hours old (target: ${RPO_TARGET_HOURS}h)"
fi

DR Testing for Stateful Apps

DR Test Procedure:

#!/bin/bash
# scripts/dr-test-stateful-apps.sh

BACKUP_NAME="${1}"
TEST_NAMESPACE="atp-production-dr-test"

if [ -z "${BACKUP_NAME}" ]; then
  echo "Usage: $0 <backup-name>"
  echo ""
  echo "Available backups:"
  velero backup get --namespace velero
  exit 1
fi

echo "🧪 DR Test: Restoring backup ${BACKUP_NAME} to test namespace"

# Step 1: Restore backup to test namespace
echo "📥 Step 1: Restoring backup..."
velero restore create "dr-test-${BACKUP_NAME}-$(date +%Y%m%d-%H%M%S)" \
  --from-backup "${BACKUP_NAME}" \
  --namespace-mappings atp-production:${TEST_NAMESPACE} \
  --wait

# Step 2: Verify restored resources
echo "✅ Step 2: Verifying restored resources..."
kubectl get statefulsets -n "${TEST_NAMESPACE}"
kubectl get pvcs -n "${TEST_NAMESPACE}"

# Step 3: Test database connectivity
echo "🔌 Step 3: Testing database connectivity..."
kubectl run postgresql-test \
  -n "${TEST_NAMESPACE}" \
  --image=postgres:15 \
  --rm -it --restart=Never \
  -- psql -h postgresql.${TEST_NAMESPACE}.svc.cluster.local -U postgres -c "SELECT version();"

# Step 4: Cleanup
echo "🧹 Step 4: Cleaning up test namespace..."
read -p "Delete test namespace ${TEST_NAMESPACE}? (yes/no): " CONFIRM
if [ "${CONFIRM}" = "yes" ]; then
  kubectl delete namespace "${TEST_NAMESPACE}"
  echo "✅ DR test complete and cleaned up"
else
  echo "⚠️  Test namespace retained: ${TEST_NAMESPACE}"
fi

DR Test Checklist:

## DR Test Checklist

### Pre-Test
- [ ] Backup exists and is valid
- [ ] Test namespace created
- [ ] Test resources allocated

### Test Execution
- [ ] Backup restored successfully
- [ ] StatefulSets recreated
- [ ] PVCs restored
- [ ] Pods running and healthy
- [ ] Database accessible
- [ ] Data integrity verified

### Post-Test
- [ ] Test results documented
- [ ] Test namespace cleaned up
- [ ] Lessons learned captured

Summary: Storage & StatefulSets in GitOps

  • Persistent Volumes (PV) and Claims (PVC): PV and PVC concepts, dynamic provisioning (ATP preference), storage classes, access modes (RWO, RWX, ROX)
  • Azure Disk vs Azure Files: Azure Disk (block storage, single mount) for databases, Azure Files (shared storage, multi-mount) for shared files, performance characteristics comparison, cost comparison, use case selection matrix
  • Storage Classes: Performance tiers (Standard, Premium, Ultra), encryption configuration with Disk Encryption Set, snapshot support, provisioner settings (binding modes, expansion)
  • StatefulSet Deployment Patterns: Ordered deployment and scaling, stable network identity with headless services, persistent storage per pod (volume claim templates), headless service configuration
  • Database Deployments in Kubernetes: PostgreSQL operator (Crunchy Data), MongoDB operator, Redis deployment, StatefulSet vs managed service decision (ATP: managed services for production)
  • Backup and Restore Procedures: Velero for cluster backups, volume snapshots, backup scheduling (6h/daily/weekly), restore procedures (to same/different namespace)
  • Volume Snapshots: Creating snapshots (manual/automated), snapshot classes (Retain/Delete policies), Azure Backup integration, snapshot retention policies per environment
  • Data Migration Strategies: Migrating data between versions (PostgreSQL 14→15 example), volume cloning from snapshots, zero-downtime migrations with logical replication
  • GitOps Considerations for Stateful Apps: Careful rollback procedures (staged rollout), no auto-prune for PVCs (explicit labels), StatefulSet update strategies (RollingUpdate/OnDelete), data persistence across deployments (Retain policy)
  • Disaster Recovery for Persistent Data: Backup frequency per environment (6h/daily/weekly), cross-region replication, RPO targets for databases (< 1 hour production), DR testing for stateful apps (restore validation)

Troubleshooting GitOps Issues

Purpose: Define comprehensive troubleshooting procedures, debugging tools, common error patterns, and escalation procedures for ATP's GitOps deployments, enabling rapid identification and resolution of issues across FluxCD, Kubernetes resources, networking, secrets, health checks, and performance.


FluxCD Sync Failures

Authentication Issues (Git Credentials)

Common Authentication Errors:

# Check GitRepository authentication status
kubectl get gitrepository -n flux-system -o yaml

# Describe GitRepository to see authentication errors
kubectl describe gitrepository atp-gitops-production -n flux-system

# Check secret for Git credentials
kubectl get secret git-credentials -n flux-system -o yaml

# View FluxCD logs for authentication errors
kubectl logs -n flux-system -l app=source-controller --tail=100 | grep -i "auth\|error\|failed"

Troubleshooting Git Authentication:

#!/bin/bash
# scripts/troubleshoot-git-auth.sh

GIT_REPO="${1:-atp-gitops-production}"
NAMESPACE="${2:-flux-system}"

echo "🔍 Troubleshooting Git authentication for: ${GIT_REPO}"

# Step 1: Check GitRepository status
echo "📋 Step 1: Checking GitRepository status..."
kubectl get gitrepository "${GIT_REPO}" -n "${NAMESPACE}" -o jsonpath='{.status.conditions[*]}' | jq

# Step 2: Check if secret exists
echo "🔐 Step 2: Checking Git credentials secret..."
SECRET_NAME=$(kubectl get gitrepository "${GIT_REPO}" -n "${NAMESPACE}" -o jsonpath='{.spec.secretRef.name}')
if [ -n "${SECRET_NAME}" ]; then
  echo "   Secret name: ${SECRET_NAME}"
  kubectl get secret "${SECRET_NAME}" -n "${NAMESPACE}" || echo "   ❌ Secret not found"
else
  echo "   ⚠️  No secret reference found"
fi

# Step 3: Check source controller logs
echo "📜 Step 3: Checking source controller logs..."
kubectl logs -n "${NAMESPACE}" -l app=source-controller --tail=50 | grep -i "${GIT_REPO}"

# Step 4: Test Git connectivity manually
echo "🌐 Step 4: Testing Git connectivity..."
GIT_URL=$(kubectl get gitrepository "${GIT_REPO}" -n "${NAMESPACE}" -o jsonpath='{.spec.url}')
echo "   Git URL: ${GIT_URL}"

Fix Git Authentication:

# Regenerate Git credentials (PAT)
PAT=$(az devops security token create --scope repo --organization ${ORG} --query token -o tsv)

# Update secret
kubectl create secret generic git-credentials \
  --from-literal=username=${USERNAME} \
  --from-literal=password=${PAT} \
  -n flux-system \
  --dry-run=client -o yaml | kubectl apply -f -

# Reconcile GitRepository
flux reconcile source git atp-gitops-production -n flux-system

Manifest Syntax Errors

Detecting Manifest Syntax Errors:

#!/bin/bash
# scripts/check-manifest-syntax.sh

PATH_TO_CHECK="${1:-.}"

echo "🔍 Checking manifest syntax in: ${PATH_TO_CHECK}"

# Check YAML syntax
find "${PATH_TO_CHECK}" -name "*.yaml" -o -name "*.yml" | while read -r file; do
  echo "Checking: ${file}"
  # Use yamllint or kubeval
  yamllint "${file}" 2>/dev/null || echo "   ⚠️  YAML syntax error in ${file}"
done

# Validate Kubernetes manifests
kubeval --directories "${PATH_TO_CHECK}" --ignore-missing-schemas || echo "   ⚠️  Kubernetes manifest validation errors"

Common Syntax Errors:

Error Type Example Fix
Indentation key:value Use proper YAML indentation (spaces, not tabs)
Missing colon key value Use key: value
Invalid type replicas: "3" (string) Use replicas: 3 (integer)
Invalid enum type: Invalid Use valid Kubernetes enum value

Fix Manifest Syntax:

# Validate before committing
kubectl apply --dry-run=client -f manifests/

# Use kubeval for validation
kubectl kustomize . | kubeval

# Use FluxCD validation
flux check --pre

Resource Conflicts (Already Exists)

Identifying Resource Conflicts:

#!/bin/bash
# scripts/check-resource-conflicts.sh

NAMESPACE="${1:-atp-production}"

echo "🔍 Checking for resource conflicts in namespace: ${NAMESPACE}"

# Check for resources not managed by FluxCD
echo "📋 Resources not managed by FluxCD:"
kubectl get all -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.metadata.labels."kustomize.toolkit.fluxcd.io/name" == null) | "\(.kind)/\(.metadata.name)"'

# Check Kustomization status
echo "📦 Kustomization status:"
kubectl get kustomization -n flux-system -o wide | grep "${NAMESPACE}"

# Check for "already exists" errors in FluxCD logs
echo "🚨 Checking FluxCD logs for conflicts:"
kubectl logs -n flux-system -l app=kustomize-controller --tail=100 | \
  grep -i "already exists\|conflict\|error"

Resolve Resource Conflicts:

# Option 1: Adopt existing resource (add FluxCD labels)
kubectl label resource/name kustomize.toolkit.fluxcd.io/name=atp-apps \
  kustomize.toolkit.fluxcd.io/namespace=flux-system \
  -n atp-production

# Option 2: Delete conflicting resource (if safe)
kubectl delete deployment conflicting-deployment -n atp-production

# Option 3: Suspend Kustomization, fix, then resume
flux suspend kustomization atp-apps-production -n flux-system
# Fix the conflict
flux resume kustomization atp-apps-production -n flux-system

Timeout Errors

Troubleshooting Timeout Errors:

#!/bin/bash
# scripts/troubleshoot-timeout.sh

RESOURCE="${1}"
NAMESPACE="${2:-flux-system}"

echo "⏱️  Troubleshooting timeout for: ${RESOURCE}"

# Check resource status
kubectl get "${RESOURCE}" -n "${NAMESPACE}" -o yaml | grep -A 5 "conditions:"

# Check reconciliation timeout settings
kubectl get kustomization "${RESOURCE}" -n "${NAMESPACE}" -o jsonpath='{.spec.timeout}'

# View detailed status
flux get kustomization "${RESOURCE}" -n "${NAMESPACE}" -o wide

# Check for stuck reconciliations
kubectl get kustomization -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.status.conditions[].status == "False") | "\(.metadata.name): \(.status.conditions[].message)"'

Increase Timeout:

# clusters/production/kustomizations/apps-production.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 5m
  timeout: 10m  # Increase timeout from default 5m
  path: ./apps/atp-gateway/overlays/production
  sourceRef:
    kind: GitRepository
    name: atp-gitops-production

Network Connectivity Issues

Check Network Connectivity:

#!/bin/bash
# scripts/troubleshoot-network.sh

echo "🌐 Troubleshooting network connectivity..."

# Test Git repository access
GIT_URL=$(kubectl get gitrepository atp-gitops-production -n flux-system -o jsonpath='{.spec.url}')
echo "Testing Git URL: ${GIT_URL}"

# Test from source controller pod
SOURCE_POD=$(kubectl get pod -n flux-system -l app=source-controller -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it "${SOURCE_POD}" -n flux-system -- wget -O- "${GIT_URL}" 2>&1

# Check DNS resolution
kubectl exec -it "${SOURCE_POD}" -n flux-system -- nslookup dev.azure.com

# Check proxy settings
kubectl get deployment source-controller -n flux-system -o yaml | grep -i proxy

Common Network Issues:

Issue Symptom Fix
DNS Resolution could not resolve host Check CoreDNS, network policies
Firewall connection timeout Allow Git repository IPs in NSG
Proxy proxy authentication required Configure proxy in source controller
VNet Peering network unreachable Verify VNet peering configuration

Drift Detection and Resolution

Identifying Drifted Resources

Detect Drift:

#!/bin/bash
# scripts/detect-drift.sh

NAMESPACE="${1:-atp-production}"
KUSTOMIZATION="${2:-apps-production}"

echo "🔍 Detecting drift in namespace: ${NAMESPACE}"

# Check Kustomization drift status
flux get kustomization "${KUSTOMIZATION}" -n flux-system

# Force reconciliation and check for drift
flux reconcile kustomization "${KUSTOMIZATION}" -n flux-system --with-source

# Check for drifted resources
kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o json | \
  jq -r '.status.inventory.entries[] | select(.lastAppliedAnno == null) | "\(.v) \(.kind)/\(.name)"'

# Compare Git state with cluster state
flux diff kustomization "${KUSTOMIZATION}" -n flux-system

Drift Detection Query (KQL):

// Log Analytics: Detect FluxCD drift events
KubePodInventory
| where Namespace == "flux-system"
| where Name contains "kustomize-controller"
| join kind=inner (
    ContainerLog
    | where LogEntry contains "drift" or LogEntry contains "diff"
    | project TimeGenerated, LogEntry, ContainerID
) on ContainerID
| project TimeGenerated, LogEntry
| order by TimeGenerated desc

Manual Changes Detection

Detect Manual Changes:

#!/bin/bash
# scripts/detect-manual-changes.sh

NAMESPACE="${1:-atp-production}"

echo "🔍 Detecting manually modified resources..."

# Find resources without FluxCD labels
kubectl get all -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.metadata.labels."kustomize.toolkit.fluxcd.io/name" == null) | 
    "\(.kind)/\(.metadata.name) - Not managed by FluxCD"'

# Find resources with different last-applied-configuration
kubectl get all -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.metadata.annotations."kubectl.kubernetes.io/last-applied-configuration" != null) |
    "\(.kind)/\(.metadata.name) - Manually modified"'

# Check Git commit history for resource
RESOURCE="${2}"
if [ -n "${RESOURCE}" ]; then
  git log --all --oneline --grep="${RESOURCE}" -- manifests/
fi

Revert Drift vs Accept Change

Decision Tree for Drift Resolution:

graph TD
    START[Detect Drift] --> CHECK{Type of Change?}
    CHECK -->|Critical Config| REVERT[Force Revert]
    CHECK -->|Performance Tuning| ACCEPT[Accept & Commit]
    CHECK -->|Debugging Change| DECIDE{Production?}

    REVERT --> RECONCILE[Reconcile Resource]
    ACCEPT --> COMMIT[Commit to Git]
    DECIDE -->|Yes| REVERT
    DECIDE -->|No| ACCEPT

    RECONCILE --> VERIFY[Verify Fixed]
    COMMIT --> VERIFY
    VERIFY --> DONE[Complete]

    style REVERT fill:#FFB6C1
    style ACCEPT fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Force Revert Drift:

#!/bin/bash
# scripts/revert-drift.sh

RESOURCE_TYPE="${1}"  # e.g., deployment
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"

echo "🔄 Reverting drift for ${RESOURCE_TYPE}/${RESOURCE_NAME}"

# Get desired state from Git
flux diff kustomization apps-production -n flux-system --path "${RESOURCE_TYPE}/${RESOURCE_NAME}"

# Force reconciliation
flux reconcile kustomization apps-production -n flux-system --with-source

# Verify reverted
kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o yaml | \
  diff - expected-state.yaml

Accept and Commit Drift:

#!/bin/bash
# scripts/accept-drift.sh

RESOURCE_TYPE="${1}"
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"

echo "✅ Accepting drift for ${RESOURCE_TYPE}/${RESOURCE_NAME}"

# Export current state
kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o yaml > \
  manifests/apps/atp-gateway/base/${RESOURCE_TYPE}-${RESOURCE_NAME}.yaml

# Commit to Git
git add manifests/
git commit -m "Accept drift: ${RESOURCE_TYPE}/${RESOURCE_NAME} in ${NAMESPACE}"
git push

# Reconcile to sync
flux reconcile source git atp-gitops-production -n flux-system

Investigating Drift Causes

Drift Investigation Workflow:

#!/bin/bash
# scripts/investigate-drift.sh

RESOURCE="${1}"
NAMESPACE="${2:-atp-production}"

echo "🔬 Investigating drift cause for: ${RESOURCE}"

# Step 1: Check resource history
echo "📜 Step 1: Resource change history..."
kubectl get events -n "${NAMESPACE}" --field-selector involvedObject.name="${RESOURCE}" --sort-by='.lastTimestamp'

# Step 2: Check audit logs
echo "📋 Step 2: Kubernetes audit logs..."
# Query Azure Monitor Log Analytics for audit logs
cat <<EOF
AzureLogAnalytics Query:
AzureActivity
| where ResourceProvider == "Microsoft.ContainerService"
| where OperationName contains "write"
| where Properties contains "${RESOURCE}"
| order by TimeGenerated desc
EOF

# Step 3: Check FluxCD reconciliation history
echo "🔄 Step 3: FluxCD reconciliation history..."
kubectl get kustomization -n flux-system -o json | \
  jq -r '.items[] | select(.status.inventory.entries[]?.name | contains("'"${RESOURCE}"'")) |
    "\(.metadata.name): Last reconciled at \(.status.lastAppliedRevision)"'

# Step 4: Compare with Git
echo "📦 Step 4: Compare with Git state..."
flux diff kustomization apps-production -n flux-system | grep "${RESOURCE}"

Image Pull Errors

ACR Authentication Failures

Troubleshooting ACR Authentication:

#!/bin/bash
# scripts/troubleshoot-acr-auth.sh

NAMESPACE="${1:-atp-production}"
POD_NAME="${2}"

echo "🔐 Troubleshooting ACR authentication..."

# Check image pull secrets
echo "📋 Image pull secrets:"
kubectl get secret -n "${NAMESPACE}" | grep -i "docker\|acr\|registry"

# Check pod's image pull secret
if [ -n "${POD_NAME}" ]; then
  echo "🔍 Pod image pull secrets:"
  kubectl get pod "${POD_NAME}" -n "${NAMESPACE}" -o jsonpath='{.spec.imagePullSecrets[*].name}'

  # Try pulling image manually from pod
  echo "🌐 Testing image pull from pod:"
  kubectl exec -it "${POD_NAME}" -n "${NAMESPACE}" -- docker pull ${IMAGE} 2>&1 || true
fi

# Check ACR authentication
ACR_NAME="${3:-connectsoft.azurecr.io}"
echo "🔑 Checking ACR access..."
az acr repository list --name "${ACR_NAME}" --output table

Fix ACR Authentication:

# Create ACR pull secret using Workload Identity
az acr login --name connectsoft

# Create Kubernetes secret
kubectl create secret docker-registry acr-secret \
  --docker-server=connectsoft.azurecr.io \
  --docker-username=00000000-0000-0000-0000-000000000000 \
  --docker-password=$(az acr credential show --name connectsoft --query passwords[0].value -o tsv) \
  -n atp-production \
  --dry-run=client -o yaml | kubectl apply -f -

# Add to default service account
kubectl patch serviceaccount default -n atp-production -p '{"imagePullSecrets":[{"name":"acr-secret"}]}'

Image Not Found

Troubleshooting Missing Images:

#!/bin/bash
# scripts/troubleshoot-image-not-found.sh

IMAGE="${1}"
NAMESPACE="${2:-atp-production}"

echo "🔍 Troubleshooting image not found: ${IMAGE}"

# Check if image exists in ACR
ACR_NAME=$(echo "${IMAGE}" | cut -d'/' -f1)
REPO_TAG=$(echo "${IMAGE}" | cut -d'/' -f2-)
REPO=$(echo "${REPO_TAG}" | cut -d':' -f1)
TAG=$(echo "${REPO_TAG}" | cut -d':' -f2)

echo "ACR: ${ACR_NAME}"
echo "Repository: ${REPO}"
echo "Tag: ${TAG}"

# Check ACR repository
az acr repository show --name "${ACR_NAME}" --repository "${REPO}" || \
  echo "❌ Repository not found"

# List tags
az acr repository show-tags --name "${ACR_NAME}" --repository "${REPO}" --output table

# Check pods with ImagePullBackOff
echo "🚨 Pods with ImagePullBackOff:"
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ImagePullBackOff") |
    "\(.metadata.name): \(.spec.containers[].image)"'

Fix Missing Image:

# Rebuild and push image
docker build -t connectsoft.azurecr.io/atp/gateway:v1.2.3 .
docker push connectsoft.azurecr.io/atp/gateway:v1.2.3

# Update manifest
kustomize edit set image connectsoft.azurecr.io/atp/gateway:v1.2.3

# Commit and push
git add .
git commit -m "Fix: Update image tag to v1.2.3"
git push

ImagePullBackOff Troubleshooting

Diagnose ImagePullBackOff:

#!/bin/bash
# scripts/diagnose-imagepullbackoff.sh

NAMESPACE="${1:-atp-production}"

echo "🚨 Diagnosing ImagePullBackOff errors..."

# Find pods with ImagePullBackOff
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ImagePullBackOff") |
    "Pod: \(.metadata.name)\n  Image: \(.spec.containers[].image)\n  Reason: \(.status.containerStatuses[].state.waiting.reason)\n  Message: \(.status.containerStatuses[].state.waiting.message)\n---"'

# Describe pod for details
PODS=$(kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "ImagePullBackOff") | .metadata.name')

for POD in ${PODS}; do
  echo "📋 Details for pod: ${POD}"
  kubectl describe pod "${POD}" -n "${NAMESPACE}" | grep -A 10 "Events:"
done

# Check events
kubectl get events -n "${NAMESPACE}" --sort-by='.lastTimestamp' | grep -i "pull\|image\|backoff"

Common ImagePullBackOff Causes:

Cause Symptom Fix
Image doesn't exist manifest unknown Rebuild and push image
Authentication failed unauthorized Fix ACR credentials
Network issue timeout Check network policies, DNS
Wrong tag not found Update image tag in manifest

Resource Conflicts

"Already Exists" Errors

Resolve "Already Exists" Errors:

#!/bin/bash
# scripts/resolve-already-exists.sh

RESOURCE_TYPE="${1}"
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"

echo "🔧 Resolving 'already exists' error for ${RESOURCE_TYPE}/${RESOURCE_NAME}"

# Check if resource exists
if kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" &>/dev/null; then
  echo "✅ Resource exists"

  # Check if managed by FluxCD
  MANAGED=$(kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o jsonpath='{.metadata.labels.kustomize\.toolkit\.fluxcd\.io/name}')

  if [ -z "${MANAGED}" ]; then
    echo "⚠️  Resource not managed by FluxCD"
    echo "   Options:"
    echo "   1. Adopt resource: kubectl label ${RESOURCE_TYPE} ${RESOURCE_NAME} kustomize.toolkit.fluxcd.io/name=apps-production -n ${NAMESPACE}"
    echo "   2. Delete resource: kubectl delete ${RESOURCE_TYPE} ${RESOURCE_NAME} -n ${NAMESPACE}"
  else
    echo "✅ Resource managed by FluxCD: ${MANAGED}"
    echo "   Force reconciliation: flux reconcile kustomization ${MANAGED} -n flux-system"
  fi
else
  echo "❌ Resource does not exist"
fi

Immutable Field Errors

Handle Immutable Field Changes:

#!/bin/bash
# scripts/handle-immutable-fields.sh

RESOURCE_TYPE="${1}"
RESOURCE_NAME="${2}"
NAMESPACE="${3:-atp-production}"

echo "🔒 Handling immutable field changes for ${RESOURCE_TYPE}/${RESOURCE_NAME}"

# Common immutable fields
# - Deployment: selector, template.labels
# - Service: clusterIP (if set)
# - StatefulSet: volumeClaimTemplates

# For immutable fields, delete and recreate
echo "⚠️  Immutable field detected. Need to delete and recreate."

# Step 1: Export current resource
kubectl get "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}" -o yaml > \
  backup-${RESOURCE_NAME}.yaml

# Step 2: Delete resource
kubectl delete "${RESOURCE_TYPE}" "${RESOURCE_NAME}" -n "${NAMESPACE}"

# Step 3: Reconcile to recreate
flux reconcile kustomization apps-production -n flux-system --with-source

echo "✅ Resource recreated"

Owner Reference Conflicts

Resolve Owner Reference Conflicts:

#!/bin/bash
# scripts/resolve-owner-conflicts.sh

RESOURCE="${1}"
NAMESPACE="${2:-atp-production}"

echo "🔗 Resolving owner reference conflicts for: ${RESOURCE}"

# Check owner references
kubectl get "${RESOURCE}" -n "${NAMESPACE}" -o jsonpath='{.metadata.ownerReferences[*].kind}'

# Remove conflicting owner reference
kubectl patch "${RESOURCE}" -n "${NAMESPACE}" --type=json \
  -p='[{"op": "remove", "path": "/metadata/ownerReferences"}]'

# Or adopt resource properly
kubectl label "${RESOURCE}" -n "${NAMESPACE}" \
  kustomize.toolkit.fluxcd.io/name=apps-production \
  kustomize.toolkit.fluxcd.io/namespace=flux-system

Secret Access Failures

Workload Identity Misconfiguration

Troubleshoot Workload Identity:

#!/bin/bash
# scripts/troubleshoot-workload-identity.sh

NAMESPACE="${1:-atp-production}"
SERVICE_ACCOUNT="${2:-default}"

echo "🔐 Troubleshooting Workload Identity..."

# Check ServiceAccount annotations
echo "📋 ServiceAccount annotations:"
kubectl get serviceaccount "${SERVICE_ACCOUNT}" -n "${NAMESPACE}" -o jsonpath='{.metadata.annotations}' | jq

# Check federated credentials in Azure AD
AZURE_CLIENT_ID=$(kubectl get serviceaccount "${SERVICE_ACCOUNT}" -n "${NAMESPACE}" -o jsonpath='{.metadata.annotations.azure\.workload\.identity/client-id}')
echo "Azure Client ID: ${AZURE_CLIENT_ID}"

# Check pod annotations
echo "📦 Pod annotations:"
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[].metadata | select(.annotations."azure.workload.identity/service-account" != null) |
    "Pod: \(.name)\n  ServiceAccount: \(.annotations."azure.workload.identity/service-account")\n"'

# Test token acquisition from pod
POD=$(kubectl get pod -n "${NAMESPACE}" -l app=atp-gateway -o jsonpath='{.items[0].metadata.name}')
if [ -n "${POD}" ]; then
  echo "🧪 Testing token acquisition from pod: ${POD}"
  kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
    cat /var/run/secrets/azure/tokens/azure-identity-token 2>&1 || echo "❌ Token not available"
fi

Key Vault Permission Issues

Check Key Vault Permissions:

#!/bin/bash
# scripts/check-keyvault-permissions.sh

KEY_VAULT="${1:-atp-keyvault}"
IDENTITY="${2}"  # Managed identity client ID

echo "🔑 Checking Key Vault permissions..."

# Check access policies
az keyvault show --name "${KEY_VAULT}" --query "properties.accessPolicies[].objectId"

# Check RBAC permissions
if [ -n "${IDENTITY}" ]; then
  echo "Checking RBAC for identity: ${IDENTITY}"
  az role assignment list --assignee "${IDENTITY}" --scope "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.KeyVault/vaults/${KEY_VAULT}"
fi

# Test secret access
SECRET_NAME="test-secret"
az keyvault secret show --vault-name "${KEY_VAULT}" --name "${SECRET_NAME}" || \
  echo "❌ Cannot access secret: ${SECRET_NAME}"

ExternalSecret Sync Failures

Troubleshoot ExternalSecret:

#!/bin/bash
# scripts/troubleshoot-externalsecret.sh

SECRET_NAME="${1}"
NAMESPACE="${2:-atp-production}"

echo "🔍 Troubleshooting ExternalSecret: ${SECRET_NAME}"

# Check ExternalSecret status
kubectl get externalsecret "${SECRET_NAME}" -n "${NAMESPACE}" -o yaml | \
  grep -A 20 "status:"

# Check ClusterSecretStore
STORE=$(kubectl get externalsecret "${SECRET_NAME}" -n "${NAMESPACE}" -o jsonpath='{.spec.secretStoreRef.name}')
echo "SecretStore: ${STORE}"
kubectl get clustersecretstore "${STORE}" -o yaml

# Check external-secrets operator logs
kubectl logs -n external-secrets-system -l app.kubernetes.io/name=external-secrets --tail=100 | \
  grep -i "${SECRET_NAME}"

# Force refresh
kubectl annotate externalsecret "${SECRET_NAME}" -n "${NAMESPACE}" \
  force-sync=$(date +%s) --overwrite

Health Check Failures

Readiness Probe Timeouts

Troubleshoot Readiness Probes:

#!/bin/bash
# scripts/troubleshoot-readiness.sh

POD="${1}"
NAMESPACE="${2:-atp-production}"

echo "🏥 Troubleshooting readiness probe for pod: ${POD}"

# Check pod status
kubectl get pod "${POD}" -n "${NAMESPACE}" -o yaml | \
  grep -A 10 "readinessProbe:"

# Check probe configuration
kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[*].readinessProbe}' | jq

# Test probe endpoint manually
ENDPOINT=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[0].readinessProbe.httpGet.path}')
PORT=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[0].readinessProbe.httpGet.port}')

echo "Testing endpoint: http://localhost:${PORT}${ENDPOINT}"
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
  curl -f http://localhost:${PORT}${ENDPOINT} || echo "❌ Probe endpoint failed"

# Check events
kubectl describe pod "${POD}" -n "${NAMESPACE}" | grep -A 5 "Events:"

Liveness Probe Failures

Troubleshoot Liveness Probes:

#!/bin/bash
# scripts/troubleshoot-liveness.sh

POD="${1}"
NAMESPACE="${2:-atp-production}"

echo "💓 Troubleshooting liveness probe for pod: ${POD}"

# Check if pod is restarting
RESTARTS=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.status.containerStatuses[0].restartCount}')
echo "Restart count: ${RESTARTS}"

# Check liveness probe config
kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.spec.containers[*].livenessProbe}' | jq

# Check previous container logs (if restarted)
if [ "${RESTARTS}" -gt 0 ]; then
  echo "📜 Previous container logs:"
  kubectl logs "${POD}" -n "${NAMESPACE}" --previous --tail=50
fi

# Check current logs
echo "📜 Current container logs:"
kubectl logs "${POD}" -n "${NAMESPACE}" --tail=50

Debugging Health Endpoints

Health Endpoint Debugging:

#!/bin/bash
# scripts/debug-health-endpoint.sh

POD="${1}"
NAMESPACE="${2:-atp-production}"
ENDPOINT="${3:-/health}"

echo "🔍 Debugging health endpoint: ${ENDPOINT}"

# Get pod IP
POD_IP=$(kubectl get pod "${POD}" -n "${NAMESPACE}" -o jsonpath='{.status.podIP}')
echo "Pod IP: ${POD_IP}"

# Test from within pod
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
  curl -v http://localhost:8080${ENDPOINT} 2>&1

# Test from another pod
kubectl run debug-pod --image=curlimages/curl --rm -it --restart=Never -n "${NAMESPACE}" -- \
  curl -v http://${POD_IP}:8080${ENDPOINT}

# Check application logs
kubectl logs "${POD}" -n "${NAMESPACE}" --tail=100 | grep -i "health\|ready\|startup"

Networking Issues

Service Discovery Failures

Troubleshoot Service Discovery:

#!/bin/bash
# scripts/troubleshoot-service-discovery.sh

SERVICE="${1}"
NAMESPACE="${2:-atp-production}"

echo "🌐 Troubleshooting service discovery for: ${SERVICE}"

# Check service exists
kubectl get service "${SERVICE}" -n "${NAMESPACE}"

# Check endpoints
kubectl get endpoints "${SERVICE}" -n "${NAMESPACE}"

# Test DNS resolution
POD=$(kubectl get pod -n "${NAMESPACE}" -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
  nslookup "${SERVICE}.${NAMESPACE}.svc.cluster.local"

# Test service connectivity
kubectl run test-pod --image=curlimages/curl --rm -it --restart=Never -n "${NAMESPACE}" -- \
  curl -v http://${SERVICE}.${NAMESPACE}.svc.cluster.local:8080

DNS Resolution Problems

Troubleshoot DNS:

#!/bin/bash
# scripts/troubleshoot-dns.sh

NAMESPACE="${1:-atp-production}"

echo "🔍 Troubleshooting DNS resolution..."

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Test DNS from pod
POD=$(kubectl get pod -n "${NAMESPACE}" -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it "${POD}" -n "${NAMESPACE}" -- \
  nslookup kubernetes.default.svc.cluster.local

# Check DNS configuration
kubectl get configmap coredns -n kube-system -o yaml

Network Policy Blocking Traffic

Troubleshoot Network Policies:

#!/bin/bash
# scripts/troubleshoot-network-policy.sh

NAMESPACE="${1:-atp-production}"

echo "🔒 Troubleshooting network policies..."

# List network policies
kubectl get networkpolicies -n "${NAMESPACE}"

# Check if default deny is blocking
kubectl get networkpolicy default-deny-all -n "${NAMESPACE}" && \
  echo "⚠️  Default deny policy found"

# Test connectivity between pods
SOURCE_POD=$(kubectl get pod -n "${NAMESPACE}" -l app=atp-gateway -o jsonpath='{.items[0].metadata.name}')
TARGET_POD=$(kubectl get pod -n "${NAMESPACE}" -l app=atp-ingestion -o jsonpath='{.items[0].metadata.name}')

if [ -n "${SOURCE_POD}" ] && [ -n "${TARGET_POD}" ]; then
  echo "Testing connectivity from ${SOURCE_POD} to ${TARGET_POD}"
  kubectl exec -it "${SOURCE_POD}" -n "${NAMESPACE}" -- \
    curl -v http://${TARGET_POD}.${NAMESPACE}.pod.cluster.local:8080 || \
    echo "❌ Connection blocked"
fi

# Temporarily remove network policy for testing
echo "To test without network policy:"
echo "kubectl delete networkpolicy -n ${NAMESPACE} --all"

Performance Issues

Slow Reconciliation

Troubleshoot Slow Reconciliation:

#!/bin/bash
# scripts/troubleshoot-slow-reconciliation.sh

KUSTOMIZATION="${1:-apps-production}"

echo "⏱️  Troubleshooting slow reconciliation..."

# Check reconciliation duration
kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o json | \
  jq -r '.status.conditions[] | select(.type == "Ready") | 
    "Last reconciliation: \(.lastTransitionTime)\nMessage: \(.message)"'

# Check reconciliation interval
INTERVAL=$(kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o jsonpath='{.spec.interval}')
echo "Reconciliation interval: ${INTERVAL}"

# Check number of resources
RESOURCE_COUNT=$(kubectl get kustomization "${KUSTOMIZATION}" -n flux-system -o json | \
  jq '.status.inventory.entries | length')
echo "Number of resources: ${RESOURCE_COUNT}"

# Check for large manifests
echo "Checking manifest sizes..."
find manifests/ -name "*.yaml" -exec wc -l {} \; | sort -rn | head -5

# Force reconciliation and measure time
echo "Forcing reconciliation and measuring time..."
time flux reconcile kustomization "${KUSTOMIZATION}" -n flux-system --with-source

Resource Contention

Check Resource Contention:

#!/bin/bash
# scripts/check-resource-contention.sh

NAMESPACE="${1:-atp-production}"

echo "📊 Checking resource contention..."

# Check node resources
kubectl top nodes

# Check pod resources
kubectl top pods -n "${NAMESPACE}"

# Check for resource quotas
kubectl get resourcequota -n "${NAMESPACE}"

# Check for limit ranges
kubectl get limitrange -n "${NAMESPACE}"

# Find pods requesting too many resources
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.spec.containers[].resources.requests.cpu != null) |
    "\(.metadata.name): CPU=\(.spec.containers[].resources.requests.cpu) Memory=\(.spec.containers[].resources.requests.memory)"'

Debugging Tools

kubectl Commands

Essential kubectl Commands:

# Get resources
kubectl get all -n atp-production
kubectl get pods -n atp-production -o wide
kubectl get events -n atp-production --sort-by='.lastTimestamp'

# Describe resources
kubectl describe pod <pod-name> -n atp-production
kubectl describe deployment <deployment> -n atp-production

# View logs
kubectl logs <pod-name> -n atp-production
kubectl logs <pod-name> -n atp-production --previous  # Previous container
kubectl logs -l app=atp-gateway -n atp-production --tail=100

# Execute commands
kubectl exec -it <pod-name> -n atp-production -- /bin/sh
kubectl exec <pod-name> -n atp-production -- env

# Port forwarding
kubectl port-forward <pod-name> 8080:8080 -n atp-production

# Debugging
kubectl run debug-pod --image=curlimages/curl --rm -it --restart=Never -n atp-production
kubectl debug <pod-name> -n atp-production -it --image=busybox

Flux CLI Commands

Essential Flux Commands:

# Check FluxCD status
flux check
flux get all -A

# Get resources
flux get sources git -A
flux get kustomizations -A
flux get helmreleases -A

# Reconcile resources
flux reconcile source git atp-gitops-production -n flux-system
flux reconcile kustomization apps-production -n flux-system --with-source
flux reconcile helmrelease ingress-nginx -n ingress-nginx

# Suspend/Resume
flux suspend kustomization apps-production -n flux-system
flux resume kustomization apps-production -n flux-system

# Diff and dry-run
flux diff kustomization apps-production -n flux-system
flux build kustomization apps-production -n flux-system

# View logs
flux logs --kind=Kustomization --name=apps-production -n flux-system

Azure CLI for AKS Debugging

Azure CLI AKS Commands:

# Get cluster credentials
az aks get-credentials --resource-group atp-production-rg --name atp-production-aks

# Get cluster info
az aks show --resource-group atp-production-rg --name atp-production-aks

# List node pools
az aks nodepool list --resource-group atp-production-rg --cluster-name atp-production-aks

# Scale node pool
az aks nodepool scale \
  --resource-group atp-production-rg \
  --cluster-name atp-production-aks \
  --name systempool \
  --node-count 5

# Get diagnostic logs
az aks get-credentials --resource-group atp-production-rg --name atp-production-aks --admin
kubectl get nodes

Log Analysis in Log Analytics

KQL Queries for Troubleshooting:

// FluxCD reconciliation failures
ContainerLog
| where Namespace == "flux-system"
| where LogEntry contains "error" or LogEntry contains "failed"
| where LogEntry contains "reconcile"
| project TimeGenerated, PodName, LogEntry
| order by TimeGenerated desc

// Pod restart analysis
KubePodInventory
| where Namespace == "atp-production"
| where ContainerRestartCount > 0
| project TimeGenerated, Namespace, PodName, ContainerRestartCount
| order by ContainerRestartCount desc

// Image pull errors
ContainerLog
| where LogEntry contains "ImagePullBackOff" or LogEntry contains "ErrImagePull"
| project TimeGenerated, Namespace, PodName, LogEntry
| order by TimeGenerated desc

// Health check failures
ContainerLog
| where LogEntry contains "readiness probe failed" or LogEntry contains "liveness probe failed"
| project TimeGenerated, Namespace, PodName, LogEntry
| order by TimeGenerated desc

Common Error Patterns

Error Catalog

Common Errors and Solutions:

Error Cause Solution
ImagePullBackOff Image not found or auth failed Check ACR credentials, verify image exists
CrashLoopBackOff Application crashing Check application logs, health endpoints
Pending pod Insufficient resources Check node capacity, resource quotas
ErrImagePull Cannot pull image Fix ACR authentication, network policies
CreateContainerConfigError Secret/config not found Check secret exists, mount paths
Readiness probe failed Health endpoint not ready Check application startup, probe config
Network policy blocking Traffic blocked Update network policy rules
PVC pending Storage class not found Check StorageClass exists
Reconcile timeout Too many resources Increase timeout, optimize manifests

Decision Tree for Common Errors:

graph TD
    START[Pod Not Running] --> CHECK{Error Type?}
    CHECK -->|ImagePullBackOff| IMAGE[Check ACR Auth<br/>Verify Image Exists]
    CHECK -->|CrashLoopBackOff| LOGS[Check Logs<br/>Health Endpoints]
    CHECK -->|Pending| RESOURCES[Check Resources<br/>Node Capacity]
    CHECK -->|NotReady| PROBE[Check Probes<br/>Application Health]

    IMAGE --> FIX1[Fix Credentials<br/>Rebuild Image]
    LOGS --> FIX2[Fix Application<br/>Update Config]
    RESOURCES --> FIX3[Scale Nodes<br/>Adjust Requests]
    PROBE --> FIX4[Fix Endpoints<br/>Adjust Probes]

    FIX1 --> RECONCILE[Reconcile]
    FIX2 --> RECONCILE
    FIX3 --> RECONCILE
    FIX4 --> RECONCILE
    RECONCILE --> DONE[Verify Fixed]

    style START fill:#FFB6C1
    style DONE fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Escalation Procedures

When to Escalate

Escalation Triggers:

Severity Criteria Response Time
P0 - Critical Production down, data loss Immediate (15 min)
P1 - High Partial outage, degraded performance 1 hour
P2 - Medium Non-critical issue, workaround available 4 hours
P3 - Low Minor issue, cosmetic Next business day

Escalation Decision Tree:

graph TD
    START[Issue Detected] --> IMPACT{Impact?}
    IMPACT -->|Production Down| P0[P0 - Escalate Immediately]
    IMPACT -->|Degraded Service| P1[P1 - Escalate within 1h]
    IMPACT -->|Workaround Available| P2[P2 - Escalate within 4h]
    IMPACT -->|Minor Issue| P3[P3 - Next Business Day]

    P0 --> ONCALL[Page On-Call Engineer]
    P1 --> TEAM[Notify Team Lead]
    P2 --> TICKET[Create Ticket]
    P3 --> BACKLOG[Add to Backlog]

    style P0 fill:#FF0000
    style P1 fill:#FFA500
    style P2 fill:#FFFF00
    style P3 fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Information to Collect

Pre-Escalation Checklist:

#!/bin/bash
# scripts/collect-debug-info.sh

ISSUE="${1}"
NAMESPACE="${2:-atp-production}"

echo "📋 Collecting debug information for escalation..."

# Create debug directory
DEBUG_DIR="debug-$(date +%Y%m%d-%H%M%S)"
mkdir -p "${DEBUG_DIR}"

# Cluster info
kubectl cluster-info > "${DEBUG_DIR}/cluster-info.txt"
kubectl get nodes -o wide > "${DEBUG_DIR}/nodes.txt"

# Resource status
kubectl get all -n "${NAMESPACE}" > "${DEBUG_DIR}/resources.txt"
kubectl get events -n "${NAMESPACE}" --sort-by='.lastTimestamp' > "${DEBUG_DIR}/events.txt"

# FluxCD status
flux get all -A > "${DEBUG_DIR}/flux-status.txt"
kubectl get kustomization -A -o yaml > "${DEBUG_DIR}/kustomizations.yaml"

# Logs
kubectl logs -n flux-system -l app=kustomize-controller --tail=200 > "${DEBUG_DIR}/flux-logs.txt"
kubectl logs -n "${NAMESPACE}" --all-containers --tail=100 > "${DEBUG_DIR}/app-logs.txt"

# Network policies
kubectl get networkpolicies -n "${NAMESPACE}" -o yaml > "${DEBUG_DIR}/network-policies.yaml"

# Secrets (sanitized)
kubectl get secrets -n "${NAMESPACE}" -o json | \
  jq '.items[] | {name: .metadata.name, type: .type}' > "${DEBUG_DIR}/secrets-list.json"

# Package debug info
tar -czf "${DEBUG_DIR}.tar.gz" "${DEBUG_DIR}"
echo "✅ Debug information collected: ${DEBUG_DIR}.tar.gz"

Incident Severity Levels

Severity Level Definitions:

Level Description Examples Response
P0 - Critical Complete service outage, data loss risk All pods down, database inaccessible Immediate, on-call escalation
P1 - High Partial outage, significant impact 50% pods down, slow response times 1 hour, team notification
P2 - Medium Degraded service, workaround available Single service down, minor features broken 4 hours, ticket creation
P3 - Low Minor issue, no user impact Documentation issue, cosmetic bug Next business day, backlog

Summary: Troubleshooting GitOps Issues

  • FluxCD Sync Failures: Authentication issues (Git credentials), manifest syntax errors, resource conflicts (already exists), timeout errors, network connectivity issues
  • Drift Detection and Resolution: Identifying drifted resources, manual changes detection, revert drift vs accept change decision tree, investigating drift causes
  • Image Pull Errors: ACR authentication failures, image not found, ImagePullBackOff troubleshooting with diagnostic scripts
  • Resource Conflicts: "already exists" errors, immutable field errors, owner reference conflicts, resolution strategies
  • Secret Access Failures: Workload Identity misconfiguration, Key Vault permission issues, ExternalSecret sync failures
  • Health Check Failures: Readiness probe timeouts, liveness probe failures, debugging health endpoints
  • Networking Issues: Service discovery failures, DNS resolution problems, network policy blocking traffic
  • Performance Issues: Slow reconciliation, resource contention, high CPU/memory usage, throttling and rate limiting
  • Debugging Tools: kubectl commands (get, describe, logs, exec), Flux CLI commands (get, reconcile, suspend), Azure CLI for AKS, Log Analytics KQL queries
  • Common Error Patterns: Error catalog with solutions, decision trees for troubleshooting, known issues and workarounds
  • Escalation Procedures: When to escalate (severity levels), who to escalate to, information to collect (pre-escalation checklist), incident severity levels (P0-P3)

Day 2 Operations & Maintenance

Purpose: Define comprehensive day 2 operations, maintenance procedures, upgrade processes, capacity planning, security patching, performance tuning, and operational excellence practices for ATP's GitOps deployments, ensuring reliable, secure, and efficient long-term platform operations.


Routine Maintenance Tasks

Daily: Monitoring Checks, Alert Review

Daily Maintenance Checklist:

#!/bin/bash
# scripts/daily-maintenance-check.sh

echo "📋 Daily Maintenance Checklist - $(date +%Y-%m-%d)"
echo "=============================================="

# Check cluster health
echo "🏥 1. Cluster Health Check"
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

# Check FluxCD status
echo "🔄 2. FluxCD Status"
flux get all -A | grep -v Ready || echo "   ⚠️  Some FluxCD resources not ready"

# Review critical alerts
echo "🚨 3. Critical Alerts Review"
# Query Azure Monitor for critical alerts from last 24 hours
cat <<EOF
Azure Monitor Query:
AzureMetrics
| where TimeGenerated > ago(24h)
| where MetricName contains "error" or MetricName contains "failure"
| where Value > 0
| summarize count() by MetricName
| order by count_ desc
EOF

# Check certificate expiration
echo "🔐 4. Certificate Expiration Check"
kubectl get certificates --all-namespaces -o json | \
  jq -r '.items[] | select(.status.conditions[]?.status == "True") |
    "\(.metadata.namespace)/\(.metadata.name): Expires \(.status.notAfter)"' | \
  while read cert; do
    EXPIRY=$(echo "${cert}" | cut -d' ' -f3-)
    DAYS_LEFT=$(( ($(date -d "${EXPIRY}" +%s) - $(date +%s)) / 86400 ))
    if [ "${DAYS_LEFT}" -lt 30 ]; then
      echo "   ⚠️  ${cert} (${DAYS_LEFT} days remaining)"
    fi
  done

# Check backup status
echo "💾 5. Backup Status"
velero backup get --namespace velero --limit 5

# Check resource utilization
echo "📊 6. Resource Utilization"
kubectl top nodes
kubectl top pods -n atp-production --sort-by=memory | head -10

echo "✅ Daily checks complete"

Daily Alert Review Procedure:

## Daily Alert Review Process

### Critical Alerts (P0/P1)
1. Review all critical alerts from last 24 hours
2. Verify alerts are actionable (not false positives)
3. Document any new alert patterns
4. Escalate unresolved critical alerts

### Warning Alerts
1. Review warning alerts weekly (not daily)
2. Tune alert thresholds if needed
3. Document patterns for capacity planning

### Alert Noise Reduction
1. Disable or adjust noisy alerts
2. Add alert grouping rules
3. Update alert runbooks

Weekly: Capacity Planning, Cost Review

Weekly Maintenance Checklist:

#!/bin/bash
# scripts/weekly-maintenance-check.sh

echo "📋 Weekly Maintenance Checklist - Week $(date +%V)"
echo "=============================================="

# Capacity planning
echo "📈 1. Capacity Planning Review"
# Check node utilization trends
kubectl top nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu,MEMORY:.status.capacity.memory

# Check pod density
POD_COUNT=$(kubectl get pods --all-namespaces --field-selector=status.phase=Running --no-headers | wc -l)
NODE_COUNT=$(kubectl get nodes --no-headers | wc -l)
AVG_PODS_PER_NODE=$((POD_COUNT / NODE_COUNT))
echo "   Average pods per node: ${AVG_PODS_PER_NODE}"

# Storage growth trend
echo "💾 2. Storage Growth Analysis"
kubectl get pvc --all-namespaces -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.status.capacity.storage)"' | \
  sort | uniq -c

# Cost review
echo "💰 3. Cost Review"
cat <<EOF
Azure Cost Management Query:
UsageDetails
| where TimeGenerated > ago(7d)
| where ResourceGroup contains "atp-production"
| summarize TotalCost=sum(Cost) by ResourceType
| order by TotalCost desc
EOF

# Review pending updates
echo "🔄 4. Pending Updates Review"
flux get sources -A | grep -v "latest"
flux get helmreleases -A | grep -v "latest"

# Review failed deployments
echo "❌ 5. Failed Deployment Review"
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded -o wide

echo "✅ Weekly checks complete"

Weekly Capacity Planning Report:

#!/bin/bash
# scripts/capacity-planning-report.sh

OUTPUT_FILE="capacity-report-$(date +%Y%m%d).md"

cat > "${OUTPUT_FILE}" <<EOF
# Capacity Planning Report - $(date +%Y-%m-%d)

## Node Utilization

\`\`\`
$(kubectl top nodes)
\`\`\`

## Pod Distribution

\`\`\`
$(kubectl get pods --all-namespaces -o wide | awk '{print $1, $8}' | sort | uniq -c)
\`\`\`

## Storage Usage

\`\`\`
$(kubectl get pvc --all-namespaces -o json | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.status.capacity.storage)"')
\`\`\`

## Recommendations

- Review node pool sizes based on utilization trends
- Plan for expected growth in next quarter
- Consider right-sizing underutilized resources
EOF

echo "✅ Report generated: ${OUTPUT_FILE}"

Monthly: Security Patches, Access Reviews

Monthly Maintenance Checklist:

#!/bin/bash
# scripts/monthly-maintenance-check.sh

echo "📋 Monthly Maintenance Checklist - $(date +%Y-%m)"
echo "=============================================="

# Security patches
echo "🔒 1. Security Patch Review"
# Check for available Kubernetes version upgrades
az aks get-upgrades --resource-group atp-production-rg --name atp-production-aks

# Check container image vulnerabilities
echo "   Scanning for vulnerabilities..."
# Use Trivy or Azure Defender to scan images

# Access reviews
echo "👥 2. Access Reviews"
# List all RBAC bindings
kubectl get rolebindings --all-namespaces -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.subjects[].name)"'

kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | "\(.metadata.name): \(.subjects[].name)"'

# Review ServiceAccount usage
echo "   ServiceAccount review..."
kubectl get serviceaccounts --all-namespaces -o json | \
  jq -r '.items[] | select(.metadata.name != "default") | "\(.metadata.namespace)/\(.metadata.name)"'

# Backup verification
echo "💾 3. Backup Verification"
# Test restore from latest backup
velero backup get --namespace velero | head -5

# Compliance check
echo "✅ 4. Compliance Check"
# Check network policies are applied
kubectl get networkpolicies --all-namespaces | wc -l

# Check pod security standards
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.securityContext == null) | "\(.metadata.namespace)/\(.metadata.name): Missing security context"'

echo "✅ Monthly checks complete"

Quarterly: DR Drills, Policy Updates

Quarterly Maintenance Schedule:

gantt
    title Quarterly Maintenance Calendar
    dateFormat YYYY-MM-DD
    section Q1
    DR Drill Production           :2024-01-15, 1d
    Policy Review                 :2024-01-20, 2d
    Security Audit                :2024-01-25, 3d
    section Q2
    DR Drill Production           :2024-04-15, 1d
    Policy Review                 :2024-04-20, 2d
    Security Audit                :2024-04-25, 3d
    section Q3
    DR Drill Production           :2024-07-15, 1d
    Policy Review                 :2024-07-20, 2d
    Security Audit                :2024-07-25, 3d
    section Q4
    DR Drill Production           :2024-10-15, 1d
    Policy Review                 :2024-10-20, 2d
    Security Audit                :2024-10-25, 3d
Hold "Alt" / "Option" to enable pan & zoom

Quarterly DR Drill Procedure:

#!/bin/bash
# scripts/quarterly-dr-drill.sh

QUARTER="${1:-Q1}"
YEAR="${2:-2024}"

echo "🧪 Quarterly DR Drill - ${QUARTER} ${YEAR}"
echo "========================================="

# Step 1: Select random backup
echo "📥 Step 1: Selecting test backup..."
BACKUP=$(velero backup get --namespace velero | grep atp-production | tail -5 | shuf -n 1 | awk '{print $1}')
echo "   Using backup: ${BACKUP}"

# Step 2: Restore to test namespace
echo "🔄 Step 2: Restoring to test namespace..."
TEST_NS="atp-production-dr-test-${QUARTER}-${YEAR}"
velero restore create "dr-drill-${QUARTER}-${YEAR}-$(date +%Y%m%d)" \
  --from-backup "${BACKUP}" \
  --namespace-mappings atp-production:${TEST_NS} \
  --wait

# Step 3: Validate restore
echo "✅ Step 3: Validating restore..."
kubectl get all -n "${TEST_NS}"
kubectl get pvc -n "${TEST_NS}"

# Step 4: Test application functionality
echo "🧪 Step 4: Testing application..."
# Run smoke tests against restored environment

# Step 5: Document results
echo "📝 Step 5: Documenting results..."
cat > "dr-drill-report-${QUARTER}-${YEAR}.md" <<EOF
# DR Drill Report - ${QUARTER} ${YEAR}

**Date**: $(date +%Y-%m-%d)
**Backup Used**: ${BACKUP}
**Test Namespace**: ${TEST_NS}

## Results

- Restore: ✅ Success
- Application Functionality: ✅ Verified
- Data Integrity: ✅ Verified

## Lessons Learned

- [Add lessons learned here]

## Recommendations

- [Add recommendations here]
EOF

# Step 6: Cleanup
read -p "Delete test namespace ${TEST_NS}? (yes/no): " CONFIRM
if [ "${CONFIRM}" = "yes" ]; then
  kubectl delete namespace "${TEST_NS}"
  echo "✅ Test namespace deleted"
fi

echo "✅ DR drill complete"

FluxCD Upgrades

Upgrade Planning

FluxCD Upgrade Planning Checklist:

## FluxCD Upgrade Planning

### Pre-Upgrade
- [ ] Review FluxCD release notes
- [ ] Check breaking changes
- [ ] Test in dev environment first
- [ ] Schedule maintenance window
- [ ] Notify stakeholders
- [ ] Prepare rollback plan

### Upgrade Steps
1. Backup current FluxCD configuration
2. Upgrade CLI tools
3. Test upgrade in dev
4. Schedule production upgrade
5. Execute upgrade
6. Validate functionality
7. Monitor for issues

### Post-Upgrade
- [ ] Verify all Kustomizations working
- [ ] Check GitRepository connections
- [ ] Validate HelmReleases
- [ ] Review reconciliation logs
- [ ] Update documentation

FluxCD Version Compatibility Matrix:

FluxCD Version Kubernetes Min Kubernetes Max Breaking Changes
2.1.x 1.24+ 1.27 None
2.2.x 1.24+ 1.28 CRD changes
2.3.x 1.25+ 1.29 API version updates

Testing in Dev/Test First

Test Upgrade Procedure:

#!/bin/bash
# scripts/test-fluxcd-upgrade.sh

TARGET_VERSION="${1:-2.2.0}"
NAMESPACE="${2:-flux-system}"

echo "🧪 Testing FluxCD upgrade to ${TARGET_VERSION}"

# Step 1: Backup current configuration
echo "💾 Step 1: Backing up current configuration..."
kubectl get gitrepository,kustomization,helmrelease -n "${NAMESPACE}" -o yaml > \
  flux-backup-$(date +%Y%m%d).yaml

# Step 2: Install new FluxCD CLI
echo "⬇️  Step 2: Installing FluxCD CLI ${TARGET_VERSION}..."
curl -s https://fluxcd.io/install.sh | sudo bash
flux version

# Step 3: Upgrade FluxCD
echo "🔄 Step 3: Upgrading FluxCD..."
flux install --version=${TARGET_VERSION} --namespace="${NAMESPACE}"

# Step 4: Wait for controllers to be ready
echo "⏳ Step 4: Waiting for controllers..."
kubectl wait --for=condition=ready pod -l app=source-controller -n "${NAMESPACE}" --timeout=300s
kubectl wait --for=condition=ready pod -l app=kustomize-controller -n "${NAMESPACE}" --timeout=300s
kubectl wait --for=condition=ready pod -l app=helm-controller -n "${NAMESPACE}" --timeout=300s

# Step 5: Validate functionality
echo "✅ Step 5: Validating functionality..."
flux check
flux get all -A

# Step 6: Test reconciliation
echo "🔄 Step 6: Testing reconciliation..."
flux reconcile source git atp-gitops-dev -n "${NAMESPACE}" --with-source

echo "✅ Upgrade test complete"

Upgrade Procedure

Production Upgrade Runbook:

#!/bin/bash
# scripts/upgrade-fluxcd-production.sh

TARGET_VERSION="${1}"
MAINTENANCE_WINDOW="${2}"  # e.g., "2024-01-15 02:00"

if [ -z "${TARGET_VERSION}" ]; then
  echo "Usage: $0 <target-version> [maintenance-window]"
  exit 1
fi

echo "🔄 FluxCD Production Upgrade to ${TARGET_VERSION}"
echo "Maintenance Window: ${MAINTENANCE_WINDOW}"

# Pre-upgrade checklist
echo "📋 Pre-Upgrade Checklist"
echo "1. Backup all FluxCD resources"
kubectl get gitrepository,kustomization,helmrelease -A -o yaml > \
  flux-production-backup-$(date +%Y%m%d-%H%M%S).yaml

echo "2. Verify all Kustomizations are healthy"
flux get kustomizations -A | grep -v Ready && echo "⚠️  Some Kustomizations not ready" && exit 1

echo "3. Suspend auto-reconciliation for critical resources"
# flux suspend kustomization critical-apps-production -n flux-system

# Upgrade
echo "🔄 Upgrading FluxCD..."
flux install --version=${TARGET_VERSION} --namespace=flux-system

# Wait for readiness
echo "⏳ Waiting for controllers to be ready..."
kubectl wait --for=condition=ready pod -l app=source-controller -n flux-system --timeout=300s
kubectl wait --for=condition=ready pod -l app=kustomize-controller -n flux-system --timeout=300s
kubectl wait --for=condition=ready pod -l app=helm-controller -n flux-system --timeout=300s

# Resume reconciliation
# flux resume kustomization critical-apps-production -n flux-system

# Validate
echo "✅ Validating upgrade..."
flux check
flux get all -A

# Force reconciliation of all resources
echo "🔄 Forcing reconciliation..."
flux reconcile source git -A --with-source
flux reconcile kustomization -A --with-source

echo "✅ Upgrade complete"

Rollback Plan

FluxCD Rollback Procedure:

#!/bin/bash
# scripts/rollback-fluxcd.sh

PREVIOUS_VERSION="${1}"
BACKUP_FILE="${2}"

if [ -z "${PREVIOUS_VERSION}" ] || [ -z "${BACKUP_FILE}" ]; then
  echo "Usage: $0 <previous-version> <backup-file>"
  exit 1
fi

echo "⏪ Rolling back FluxCD to ${PREVIOUS_VERSION}"

# Step 1: Suspend all reconciliation
echo "⏸️  Suspending reconciliation..."
flux suspend kustomization -A
flux suspend helmrelease -A

# Step 2: Uninstall current FluxCD
echo "🗑️  Uninstalling current FluxCD..."
flux uninstall --silent

# Step 3: Install previous version
echo "⬇️  Installing previous version..."
flux install --version=${PREVIOUS_VERSION} --namespace=flux-system

# Step 4: Restore configuration from backup
echo "📥 Restoring configuration..."
kubectl apply -f "${BACKUP_FILE}"

# Step 5: Resume reconciliation
echo "▶️  Resuming reconciliation..."
flux resume kustomization -A
flux resume helmrelease -A

# Step 6: Validate
echo "✅ Validating rollback..."
flux check
flux get all -A

echo "✅ Rollback complete"

Post-Upgrade Validation

Post-Upgrade Validation Checklist:

#!/bin/bash
# scripts/validate-fluxcd-upgrade.sh

echo "✅ Post-Upgrade Validation"

# Check FluxCD version
echo "📋 1. FluxCD Version"
flux version

# Check all controllers are ready
echo "🏥 2. Controller Health"
flux check

# Verify all sources are ready
echo "📦 3. Source Status"
flux get sources -A | grep -v Ready && echo "⚠️  Some sources not ready"

# Verify all Kustomizations are ready
echo "🔄 4. Kustomization Status"
flux get kustomizations -A | grep -v Ready && echo "⚠️  Some Kustomizations not ready"

# Verify all HelmReleases are ready
echo "📦 5. HelmRelease Status"
flux get helmreleases -A | grep -v Ready && echo "⚠️  Some HelmReleases not ready"

# Test reconciliation
echo "🔄 6. Testing Reconciliation"
flux reconcile source git atp-gitops-production -n flux-system --with-source
flux reconcile kustomization apps-production -n flux-system --with-source

# Check for errors in logs
echo "📜 7. Checking for Errors"
kubectl logs -n flux-system -l app=kustomize-controller --tail=100 | grep -i error

echo "✅ Validation complete"

AKS Cluster Patching

Kubernetes Version Upgrades

AKS Upgrade Planning:

#!/bin/bash
# scripts/plan-aks-upgrade.sh

CLUSTER="${1:-atp-production-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"

echo "📋 AKS Upgrade Planning for ${CLUSTER}"

# Check current version
CURRENT_VERSION=$(az aks show \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER}" \
  --query kubernetesVersion -o tsv)
echo "Current version: ${CURRENT_VERSION}"

# Check available upgrades
echo "Available upgrades:"
az aks get-upgrades \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER}" \
  --output table

# Check node pool versions
echo "Node pool versions:"
az aks nodepool list \
  --resource-group "${RESOURCE_GROUP}" \
  --cluster-name "${CLUSTER}" \
  --query '[].{Name:name, Version:orchestratorVersion}' \
  --output table

AKS Upgrade Procedure:

#!/bin/bash
# scripts/upgrade-aks-cluster.sh

CLUSTER="${1:-atp-production-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"
TARGET_VERSION="${3}"

if [ -z "${TARGET_VERSION}" ]; then
  echo "Usage: $0 <cluster> <resource-group> <target-version>"
  exit 1
fi

echo "🔄 Upgrading AKS cluster to ${TARGET_VERSION}"

# Step 1: Pre-upgrade validation
echo "📋 Step 1: Pre-upgrade validation..."
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

# Step 2: Upgrade control plane
echo "⬆️  Step 2: Upgrading control plane..."
az aks upgrade \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER}" \
  --kubernetes-version "${TARGET_VERSION}" \
  --control-plane-only

# Step 3: Wait for control plane upgrade
echo "⏳ Step 3: Waiting for control plane..."
az aks show \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER}" \
  --query "{Status:powerState.code, Version:kubernetesVersion}" \
  --output table

# Step 4: Upgrade node pools
echo "⬆️  Step 4: Upgrading node pools..."
NODEPOOLS=$(az aks nodepool list \
  --resource-group "${RESOURCE_GROUP}" \
  --cluster-name "${CLUSTER}" \
  --query '[].name' -o tsv)

for POOL in ${NODEPOOLS}; do
  echo "   Upgrading node pool: ${POOL}"
  az aks nodepool upgrade \
    --resource-group "${RESOURCE_GROUP}" \
    --cluster-name "${CLUSTER}" \
    --name "${POOL}" \
    --kubernetes-version "${TARGET_VERSION}" \
    --max-surge 33%
done

# Step 5: Post-upgrade validation
echo "✅ Step 5: Post-upgrade validation..."
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

echo "✅ Upgrade complete"

Node OS Patching

Node OS Patching Schedule:

# platform/node-patching/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: node-os-patch-check
  namespace: kube-system
spec:
  schedule: "0 2 * * 0"  # Every Sunday at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: patch-check
            image: mcr.microsoft.com/aks/aks-tools:latest
            command:
            - /bin/sh
            - -c
            - |
              echo "Checking for available OS patches..."
              az aks nodepool show \
                --resource-group ${RESOURCE_GROUP} \
                --cluster-name ${CLUSTER_NAME} \
                --name systempool \
                --query "nodeImageVersion" -o tsv
          restartPolicy: OnFailure

Node Pool Rotation

Node Pool Rotation for Zero-Downtime Patching:

#!/bin/bash
# scripts/rotate-nodepool.sh

CLUSTER="${1:-atp-production-aks}"
RESOURCE_GROUP="${2:-atp-production-rg}"
NODEPOOL="${3:-systempool}"

echo "🔄 Rotating node pool: ${NODEPOOL}"

# Step 1: Create new node pool
echo "➕ Step 1: Creating new node pool..."
NEW_POOL="${NODEPOOL}-new"
az aks nodepool add \
  --resource-group "${RESOURCE_GROUP}" \
  --cluster-name "${CLUSTER}" \
  --name "${NEW_POOL}" \
  --node-count 3 \
  --node-vm-size Standard_D4s_v3 \
  --max-surge 33%

# Step 2: Cordon old nodes
echo "🚫 Step 2: Cordoning old nodes..."
OLD_NODES=$(kubectl get nodes -l agentpool=${NODEPOOL} -o jsonpath='{.items[*].metadata.name}')
for NODE in ${OLD_NODES}; do
  kubectl cordon "${NODE}"
done

# Step 3: Drain old nodes
echo "💧 Step 3: Draining old nodes..."
for NODE in ${OLD_NODES}; do
  kubectl drain "${NODE}" --ignore-daemonsets --delete-emptydir-data --grace-period=300
done

# Step 4: Delete old node pool
echo "🗑️  Step 4: Deleting old node pool..."
az aks nodepool delete \
  --resource-group "${RESOURCE_GROUP}" \
  --cluster-name "${CLUSTER}" \
  --name "${NODEPOOL}"

# Step 5: Rename new pool
echo "🏷️  Step 5: Renaming new pool..."
az aks nodepool scale \
  --resource-group "${RESOURCE_GROUP}" \
  --cluster-name "${CLUSTER}" \
  --name "${NEW_POOL}" \
  --node-count 3

# Rename requires manual Azure Portal or separate script

echo "✅ Node pool rotation complete"

Certificate Renewals

Monitoring Certificate Expiration

Certificate Expiration Monitoring:

#!/bin/bash
# scripts/monitor-certificate-expiration.sh

WARNING_DAYS="${1:-30}"
CRITICAL_DAYS="${2:-7}"

echo "🔐 Monitoring Certificate Expiration"

kubectl get certificates --all-namespaces -o json | \
  jq -r '.items[] | select(.status.conditions[]?.type == "Ready" and .status.conditions[]?.status == "True") |
    "\(.metadata.namespace)/\(.metadata.name)|\(.status.notAfter)"' | \
  while IFS='|' read -r CERT EXPIRY; do
    if [ -n "${EXPIRY}" ]; then
      EXPIRY_EPOCH=$(date -d "${EXPIRY}" +%s)
      CURRENT_EPOCH=$(date +%s)
      DAYS_LEFT=$(( (EXPIRY_EPOCH - CURRENT_EPOCH) / 86400 ))

      if [ "${DAYS_LEFT}" -lt "${CRITICAL_DAYS}" ]; then
        echo "🔴 CRITICAL: ${CERT} expires in ${DAYS_LEFT} days"
      elif [ "${DAYS_LEFT}" -lt "${WARNING_DAYS}" ]; then
        echo "🟡 WARNING: ${CERT} expires in ${DAYS_LEFT} days"
      else
        echo "✅ OK: ${CERT} expires in ${DAYS_LEFT} days"
      fi
    fi
  done

Certificate Expiration Alert (PrometheusRule):

# monitoring/alerts/certificate-expiration.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: certificate-expiration
  namespace: monitoring
spec:
  groups:
  - name: certificate
    interval: 1h
    rules:
    - alert: CertificateExpiringSoon
      expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 30
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Certificate expiring soon"
        description: "Certificate {{ $labels.name }} in {{ $labels.namespace }} expires in {{ $value }} days"

    - alert: CertificateExpiringVerySoon
      expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
      for: 1h
      labels:
        severity: critical
      annotations:
        summary: "Certificate expiring very soon"
        description: "Certificate {{ $labels.name }} in {{ $labels.namespace }} expires in {{ $value }} days"

Automatic Renewal with cert-manager

cert-manager Automatic Renewal Configuration:

# apps/atp-gateway/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: atp-gateway-tls
  namespace: atp-production
spec:
  secretName: atp-gateway-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  commonName: api.atp.connectsoft.example
  dnsNames:
  - api.atp.connectsoft.example
  duration: 2160h  # 90 days
  renewBefore: 720h  # Renew 30 days before expiration (automatic)

Manual Renewal Procedures

Manual Certificate Renewal:

#!/bin/bash
# scripts/manual-certificate-renewal.sh

CERT_NAME="${1}"
NAMESPACE="${2:-atp-production}"

echo "🔄 Manually renewing certificate: ${CERT_NAME}"

# Delete existing certificate (will trigger renewal)
kubectl delete certificate "${CERT_NAME}" -n "${NAMESPACE}"

# Wait for renewal
echo "⏳ Waiting for renewal..."
sleep 30

# Check new certificate status
kubectl get certificate "${CERT_NAME}" -n "${NAMESPACE}"

# Force cert-manager to reconcile
kubectl annotate certificate "${CERT_NAME}" -n "${NAMESPACE}" \
  cert-manager.io/issue-temporary-certificate="true" --overwrite

Monitoring and Alerting Review

Reviewing Alert Noise

Alert Noise Analysis:

// Log Analytics: Analyze alert frequency
AzureActivity
| where OperationName == "Microsoft.Insights/metricAlerts/write"
| where TimeGenerated > ago(30d)
| summarize AlertCount=count() by Resource, AlertName
| order by AlertCount desc
| take 20

Alert Tuning Script:

#!/bin/bash
# scripts/tune-alerts.sh

ALERT_NAME="${1}"

echo "🎚️  Tuning alert: ${ALERT_NAME}"

# Query alert frequency
echo "📊 Alert frequency (last 30 days):"
# Use Azure Monitor API or Azure CLI

# Identify false positives
echo "❌ False positives to address:"
# Manual review required

# Adjust threshold
echo "⚙️  Current threshold: [threshold]"
echo "Suggested threshold: [new-threshold]"

# Update alert rule
az monitor metrics alert update \
  --name "${ALERT_NAME}" \
  --resource-group atp-production-rg \
  --condition "avg Percentage CPU > 80"  # Example

Adding New Alerts

Alert Creation Template:

# monitoring/alerts/template.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: atp-application-alerts
  namespace: monitoring
spec:
  groups:
  - name: application
    interval: 1m
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      for: 5m
      labels:
        severity: warning
        team: atp
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value }} errors/sec for {{ $labels.service }}"

    - alert: HighLatency
      expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
      for: 10m
      labels:
        severity: warning
        team: atp
      annotations:
        summary: "High latency detected"
        description: "P95 latency is {{ $value }}s for {{ $labels.service }}"

Capacity Planning

Resource Usage Trend Analysis:

// Log Analytics: Node CPU utilization trend
Perf
| where ObjectName == "K8SNode"
| where CounterName == "cpuUsageNanoCores"
| where TimeGenerated > ago(90d)
| summarize AvgCPU=avg(CounterValue), MaxCPU=max(CounterValue) by bin(TimeGenerated, 1d), Computer
| render timechart

// Pod memory usage trend
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "memoryWorkingSetBytes"
| where TimeGenerated > ago(90d)
| summarize AvgMemory=avg(CounterValue) by bin(TimeGenerated, 1d), InstanceName
| render timechart

Capacity Planning Report:

#!/bin/bash
# scripts/capacity-planning-report.sh

OUTPUT="capacity-planning-$(date +%Y%m).md"

cat > "${OUTPUT}" <<EOF
# Capacity Planning Report - $(date +%B %Y)

## Current Utilization

### Nodes
\`\`\`
$(kubectl top nodes)
\`\`\`

### Pods per Node
- Average: $(kubectl get pods --all-namespaces --field-selector=status.phase=Running --no-headers | wc -l) pods / $(kubectl get nodes --no-headers | wc -l) nodes

## Trends (Last 90 Days)

[Insert trend charts from Log Analytics]

## Projections (Next 6 Months)

Based on current growth trends:
- Expected pod growth: X%
- Expected storage growth: Y%
- Expected cost increase: Z%

## Recommendations

1. **Node Pool Scaling**: [Recommendation]
2. **Storage**: [Recommendation]
3. **Resource Right-Sizing**: [Recommendation]
4. **Cost Optimization**: [Recommendation]
EOF

echo "✅ Report generated: ${OUTPUT}"

Security Patching

Container Base Image Updates

Base Image Update Procedure:

#!/bin/bash
# scripts/update-base-images.sh

echo "🔒 Scanning for base image updates..."

# Scan all images in ACR
az acr repository list --name connectsoft --output table | \
  while read repo; do
    echo "Scanning: ${repo}"
    az acr task run \
      --registry connectsoft \
      --name update-base-images \
      --context https://github.com/connectsoft/atp.git
  done

# Check for vulnerabilities
az acr repository show \
  --name connectsoft \
  --repository atp/gateway \
  --query "properties.manifest" -o json

Vulnerability Remediation

Vulnerability Remediation Workflow:

graph TD
    START[Vulnerability Detected] --> SCAN[Scan Images]
    SCAN --> SEVERITY{Severity?}
    SEVERITY -->|Critical| IMMEDIATE[Immediate Remediation]
    SEVERITY -->|High| PRIORITY[Priority Remediation]
    SEVERITY -->|Medium| SCHEDULED[Scheduled Remediation]
    SEVERITY -->|Low| BACKLOG[Add to Backlog]

    IMMEDIATE --> PATCH[Apply Patch]
    PRIORITY --> PATCH
    SCHEDULED --> PATCH

    PATCH --> TEST[Test in Dev/Test]
    TEST --> DEPLOY[Deploy to Production]
    DEPLOY --> VERIFY[Verify Fix]

    style IMMEDIATE fill:#FF0000
    style PRIORITY fill:#FFA500
    style SCHEDULED fill:#FFFF00
Hold "Alt" / "Option" to enable pan & zoom

Configuration Drift Audits

Scheduled Drift Detection Runs

Automated Drift Detection:

# platform/drift-detection/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: drift-detection
  namespace: flux-system
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: drift-detection
            image: fluxcd/flux-cli:latest
            command:
            - /bin/sh
            - -c
            - |
              echo "Running drift detection..."
              flux diff kustomization apps-production -n flux-system > /tmp/drift-report.txt
              if [ -s /tmp/drift-report.txt ]; then
                echo "⚠️  Drift detected!"
                cat /tmp/drift-report.txt
                # Send alert
              else
                echo "✅ No drift detected"
              fi
          restartPolicy: OnFailure

Performance Tuning

Reconciliation Interval Optimization

Optimize Reconciliation Intervals:

# Production: Less frequent (reduce load)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 10m  # Production: 10 minutes
  path: ./apps/atp-gateway/overlays/production

---
# Dev: More frequent (faster feedback)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-dev
  namespace: flux-system
spec:
  interval: 1m  # Dev: 1 minute
  path: ./apps/atp-gateway/overlays/dev

Documentation Updates

Documentation Maintenance Checklist

## Documentation Maintenance

### Weekly
- [ ] Update runbooks with lessons learned
- [ ] Document new procedures

### Monthly
- [ ] Review and update architecture diagrams
- [ ] Update troubleshooting guides
- [ ] Review and archive outdated docs

### Quarterly
- [ ] Comprehensive documentation audit
- [ ] Update all procedures
- [ ] Knowledge base cleanup

Team Training

Onboarding Checklist

## New Team Member Onboarding

### Week 1
- [ ] Access to Azure DevOps
- [ ] Access to AKS clusters
- [ ] GitOps repository access
- [ ] Review architecture documentation

### Week 2
- [ ] Hands-on GitOps exercises
- [ ] Troubleshooting practice
- [ ] Shadow on-call rotation

### Week 3
- [ ] Independent task assignment
- [ ] Code review participation
- [ ] Documentation contribution

On-Call Procedures

On-Call Rotation Schedule

gantt
    title On-Call Rotation Schedule
    dateFormat YYYY-MM-DD
    section Team A
    Engineer 1 On-Call      :2024-01-01, 7d
    Engineer 2 On-Call      :2024-01-08, 7d
    section Team B
    Engineer 3 On-Call      :2024-01-15, 7d
    Engineer 4 On-Call      :2024-01-22, 7d
Hold "Alt" / "Option" to enable pan & zoom

On-Call Handoff Procedure

## On-Call Handoff Checklist

### Daily Handoff
- [ ] Review incidents from last 24 hours
- [ ] Check for unresolved issues
- [ ] Review scheduled maintenance
- [ ] Verify alert configurations

### Weekly Handoff
- [ ] Review week's incidents
- [ ] Document lessons learned
- [ ] Update runbooks
- [ ] Share knowledge with team

Post-Incident Review Template

## Post-Incident Review (PIR)

**Incident ID**: [ID]
**Date**: [Date]
**Severity**: [P0/P1/P2/P3]
**Duration**: [Duration]
**Impact**: [Description]

### Timeline
- [Time] - Issue detected
- [Time] - Escalation
- [Time] - Resolution

### Root Cause
[Root cause analysis]

### Actions Taken
[Steps taken to resolve]

### Lessons Learned
- [Lesson 1]
- [Lesson 2]

### Action Items
- [ ] [Action item 1]
- [ ] [Action item 2]

### Prevention
[How to prevent similar incidents]

Summary: Day 2 Operations & Maintenance

  • Routine Maintenance Tasks: Daily (monitoring checks, alert review), weekly (capacity planning, cost review), monthly (security patches, access reviews), quarterly (DR drills, policy updates) with automated checklists
  • FluxCD Upgrades: Upgrade planning, testing in dev/test first, upgrade procedure, rollback plan, post-upgrade validation
  • AKS Cluster Patching: Kubernetes version upgrades, node OS patching, upgrade scheduling, node pool rotation, validation and rollback
  • Certificate Renewals: Monitoring certificate expiration, automatic renewal with cert-manager, manual renewal procedures, certificate rotation testing
  • Monitoring and Alerting Review: Reviewing alert noise, tuning thresholds, disabling false positives, adding new alerts
  • Capacity Planning: Monitoring resource usage trends, node pool scaling decisions, storage growth planning, cost forecasting
  • Security Patching: OS security updates, container base image updates, dependency updates, vulnerability remediation workflow
  • Configuration Drift Audits: Scheduled drift detection runs, comparing Git to live state, identifying configuration inconsistencies, remediation procedures
  • Performance Tuning: Reconciliation interval optimization, resource request/limit tuning, autoscaling adjustments, database query optimization
  • Documentation Updates: Keeping runbooks current, updating architecture diagrams, recording lessons learned, knowledge base maintenance
  • Team Training: Onboarding new team members, knowledge sharing sessions, hands-on exercises, certification paths
  • On-Call Procedures: On-call rotation schedule, handoff procedures, escalation paths, post-incident reviews with templates

Compliance & Audit Evidence Collection

Purpose: Define comprehensive compliance controls, audit evidence collection procedures, SOC 2 Type II control mappings, GDPR compliance workflows, HIPAA audit trail requirements, Change Advisory Board (CAB) processes, deployment receipts, and automated compliance reporting for ATP's GitOps deployments, ensuring regulatory compliance and providing complete audit trails for all platform changes.


SOC 2 Type II Controls

CC8.1: Change Management

Change Management Control Requirements:

Requirement GitOps Implementation Evidence
Authorized Changes PR approval required PR approval records in Azure DevOps
Change Testing Automated tests in CI Test results in Azure Pipelines
Change Documentation Git commit messages, PR descriptions Git history, PR records
Change Approval Required approvals before merge Approval timestamps and identities
Change Review Code review process Review comments and approvals

GitOps Workflow Mapping to CC8.1:

graph LR
    START[Developer Creates PR] --> REVIEW[Code Review]
    REVIEW --> APPROVE{Approval<br/>Required?}
    APPROVE -->|Yes| CAB[CAB Approval]
    APPROVE -->|No| AUTO[Automated Tests]
    CAB --> AUTO
    AUTO --> MERGE[Merge to Main]
    MERGE --> DEPLOY[FluxCD Deploys]
    DEPLOY --> EVIDENCE[Audit Evidence<br/>Generated]

    style CAB fill:#FFE5B4
    style EVIDENCE fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom

Evidence Collection for CC8.1:

#!/bin/bash
# scripts/collect-change-management-evidence.sh

PR_ID="${1}"
DATE_RANGE="${2:-30d}"

echo "📋 Collecting Change Management Evidence for PR ${PR_ID}"

# Get PR details
az repos pr show \
  --id "${PR_ID}" \
  --organization ${ORG} \
  --project ${PROJECT} \
  --output json > "change-evidence-pr-${PR_ID}.json"

# Extract evidence
cat "change-evidence-pr-${PR_ID}.json" | jq '{
  pr_id: .pullRequestId,
  title: .title,
  created_by: .createdBy.uniqueName,
  created_date: .creationDate,
  reviewers: [.reviewers[] | {name: .uniqueName, vote: .vote, date: .votedForDate}],
  status: .status,
  merge_status: .mergeStatus,
  completion_date: .completionOptions.completeWorkItems,
  linked_work_items: .workItemRefs[].id
}'

# Get commit history
echo "📜 Commit History:"
az repos pr commits \
  --id "${PR_ID}" \
  --organization ${ORG} \
  --project ${PROJECT} \
  --output table

# Get build/test results
echo "🧪 Build/Test Results:"
az pipelines runs list \
  --organization ${ORG} \
  --project ${PROJECT} \
  --query "[?sourceVersion == '${PR_COMMIT_SHA}']" \
  --output table

CC6.1: Logical and Physical Access

Access Control Requirements:

Requirement Implementation Evidence
Access Reviews Quarterly RBAC reviews Access review reports
Least Privilege RBAC in Kubernetes, Azure AD RBAC manifests in Git
Access Logging Kubernetes audit logs, Azure AD logs Log Analytics queries
Access Termination Automated offboarding Offboarding logs

Access Review Evidence Collection:

#!/bin/bash
# scripts/collect-access-review-evidence.sh

REVIEW_DATE="${1:-$(date +%Y-%m-%d)}"

echo "👥 Collecting Access Review Evidence - ${REVIEW_DATE}"

# Review Kubernetes RBAC
echo "📋 Kubernetes RBAC Review:"
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json > \
  "rbac-review-${REVIEW_DATE}.json"

# Review Azure AD groups
echo "🔐 Azure AD Group Memberships:"
az ad group member list \
  --group "atp-developers" \
  --output table > "azure-ad-access-${REVIEW_DATE}.txt"

# Review Key Vault access
echo "🔑 Key Vault Access Policies:"
az keyvault show \
  --name atp-keyvault \
  --query "properties.accessPolicies" \
  -o json > "keyvault-access-${REVIEW_DATE}.json"

# Generate access review report
cat > "access-review-report-${REVIEW_DATE}.md" <<EOF
# Access Review Report - ${REVIEW_DATE}

## Kubernetes RBAC

\`\`\`
$(kubectl get rolebindings,clusterrolebindings --all-namespaces --no-headers | wc -l) bindings reviewed
\`\`\`

## Azure AD Access

\`\`\`
$(az ad group list --query "length([])") groups reviewed
\`\`\`

## Key Vault Access

\`\`\`
$(az keyvault show --name atp-keyvault --query "length(properties.accessPolicies)" -o tsv) access policies reviewed
\`\`\`

## Findings

- [ ] All access is justified
- [ ] No orphaned accounts
- [ ] Least privilege enforced
- [ ] Access terminated for offboarded users

## Reviewer

**Name**: [Reviewer Name]
**Date**: ${REVIEW_DATE}
**Signature**: [Digital Signature]
EOF

echo "✅ Access review evidence collected"

CC7.2: System Monitoring

System Monitoring Requirements:

Requirement Implementation Evidence
Monitoring Coverage Azure Monitor, Prometheus Monitoring dashboards
Alert Configuration Alert rules in Git Alert manifests
Log Retention 7-year retention in Log Analytics Retention policies
Incident Response Automated alerts, on-call Incident logs

Monitoring Evidence Collection:

// Log Analytics: System Monitoring Evidence
// Query for monitoring coverage evidence
Perf
| where TimeGenerated > ago(30d)
| summarize 
    MetricCount=count_distinct(CounterName),
    ResourceCount=count_distinct(Computer),
    DataPoints=count()
| extend EvidenceType="Monitoring Coverage"
| project EvidenceType, MetricCount, ResourceCount, DataPoints, TimeGenerated=now()

// Alert configuration evidence
union *
| where TimeGenerated > ago(30d)
| where Category == "Alert"
| summarize AlertCount=count(), UniqueAlerts=dcount(AlertName)
| extend EvidenceType="Alert Configuration"
| project EvidenceType, AlertCount, UniqueAlerts, TimeGenerated=now()

GitOps Workflow Mapping to Controls

SOC 2 Control Mapping Matrix:

Control GitOps Workflow Evidence Source Retention
CC8.1 - Change Management PR approval, code review Azure DevOps PR records 7 years
CC6.1 - Access Control RBAC manifests in Git Git history, access reviews 7 years
CC7.2 - Monitoring Monitoring manifests in Git Log Analytics, dashboards 7 years
CC7.3 - System Operations GitOps reconciliation FluxCD logs, deployment receipts 7 years
CC6.6 - Logical Access Workload Identity, RBAC Kubernetes audit logs 7 years

SOC 2 Evidence Collection Dashboard:

# monitoring/compliance/soc2-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: soc2-evidence-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "SOC 2 Compliance Evidence",
        "panels": [
          {
            "title": "Change Management (CC8.1)",
            "targets": [
              {
                "expr": "count(azure_devops_pr_approvals_total)",
                "legendFormat": "PR Approvals"
              }
            ]
          },
          {
            "title": "Access Reviews (CC6.1)",
            "targets": [
              {
                "expr": "count(kubernetes_rbac_bindings_total)",
                "legendFormat": "RBAC Bindings"
              }
            ]
          },
          {
            "title": "Monitoring Coverage (CC7.2)",
            "targets": [
              {
                "expr": "count(azure_monitor_metrics_total)",
                "legendFormat": "Monitored Resources"
              }
            ]
          }
        ]
      }
    }

GDPR Compliance

Right to be Forgotten (Tenant Offboarding)

GDPR Right to be Forgotten Procedure:

#!/bin/bash
# scripts/gdpr-tenant-offboarding.sh

TENANT_ID="${1}"
REQUEST_DATE="${2:-$(date +%Y-%m-%d)}"
REQUESTOR="${3}"

if [ -z "${TENANT_ID}" ] || [ -z "${REQUESTOR}" ]; then
  echo "Usage: $0 <tenant-id> [request-date] <requestor-email>"
  exit 1
fi

echo "🗑️  GDPR Right to be Forgotten Request"
echo "Tenant: ${TENANT_ID}"
echo "Request Date: ${REQUEST_DATE}"
echo "Requestor: ${REQUESTOR}"

# Step 1: Verify request authorization
echo "✅ Step 1: Verifying request authorization..."
# Verify requestor has authority to request deletion

# Step 2: Export tenant data (for record keeping)
echo "📥 Step 2: Exporting tenant data..."
kubectl get all -n "tenant-${TENANT_ID}" -o yaml > \
  "gdpr-export-${TENANT_ID}-${REQUEST_DATE}.yaml"

# Step 3: Delete tenant data
echo "🗑️  Step 3: Deleting tenant data..."
# Delete tenant namespace
kubectl delete namespace "tenant-${TENANT_ID}"

# Delete tenant secrets from Key Vault
az keyvault secret list \
  --vault-name atp-keyvault \
  --query "[?contains(name, 'tenant-${TENANT_ID}')].name" -o tsv | \
  while read secret; do
    az keyvault secret delete --vault-name atp-keyvault --name "${secret}"
  done

# Delete tenant data from databases
# (Specific implementation depends on database type)

# Step 4: Remove from GitOps
echo "📝 Step 4: Removing tenant from GitOps..."
git rm -r "tenants/${TENANT_ID}/"
git commit -m "GDPR: Remove tenant ${TENANT_ID} per request on ${REQUEST_DATE}"
git push

# Step 5: Generate deletion certificate
echo "📜 Step 5: Generating deletion certificate..."
cat > "gdpr-deletion-certificate-${TENANT_ID}-${REQUEST_DATE}.md" <<EOF
# GDPR Data Deletion Certificate

**Tenant ID**: ${TENANT_ID}
**Request Date**: ${REQUEST_DATE}
**Requestor**: ${REQUESTOR}
**Completion Date**: $(date +%Y-%m-%d)

## Deletion Confirmation

✅ Tenant namespace deleted: tenant-${TENANT_ID}
✅ Secrets deleted from Key Vault
✅ Data deleted from databases
✅ GitOps configuration removed
✅ Backup data purged (where applicable)

## Data Retention Exception

The following data is retained for legal/compliance purposes:
- Audit logs (7-year retention)
- Financial transaction records (as required by law)

## Certification

I certify that all tenant data has been deleted per GDPR Article 17 (Right to be Forgotten) requirements, except where retention is required by law.

**Signed**: [Authorized Person]
**Date**: $(date +%Y-%m-%d)
EOF

echo "✅ GDPR deletion complete"

Data Residency Enforcement

Data Residency Configuration:

# tenants/tenant-eu/labels.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-eu
  labels:
    data-residency: "EU"
    gdpr: "true"
    region: "westeurope"
  annotations:
    compliance/data-residency: "EU Only"
    compliance/gdpr: "true"

Data Residency Policy Enforcement:

#!/bin/bash
# scripts/verify-data-residency.sh

TENANT_ID="${1}"
REQUIRED_REGION="${2:-EU}"

echo "🌍 Verifying Data Residency for Tenant: ${TENANT_ID}"

# Check namespace labels
RESIDENCY=$(kubectl get namespace "tenant-${TENANT_ID}" \
  -o jsonpath='{.metadata.labels.data-residency}')

if [ "${RESIDENCY}" != "${REQUIRED_REGION}" ]; then
  echo "❌ Data residency violation: Expected ${REQUIRED_REGION}, found ${RESIDENCY}"
  exit 1
fi

# Check Pod placement (node labels)
NODES=$(kubectl get nodes -l region=${REQUIRED_REGION} -o jsonpath='{.items[*].metadata.name}')
if [ -z "${NODES}" ]; then
  echo "⚠️  No nodes in region ${REQUIRED_REGION}"
fi

# Check PersistentVolume placement
PVC_REGIONS=$(kubectl get pvc -n "tenant-${TENANT_ID}" -o json | \
  jq -r '.items[].metadata.annotations."volume.kubernetes.io/selected-node"')

echo "✅ Data residency verified: ${RESIDENCY}"

Audit Logs and Retention

GDPR Audit Log Retention:

# platform/compliance/audit-log-retention.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: audit-log-retention
  namespace: monitoring
data:
  retention-policy.yaml: |
    # GDPR Audit Log Retention Policy
    retention:
      default: 7y  # 7-year retention per GDPR requirements
      compliance:
        gdpr: 7y
        soc2: 7y
        hipaa: 7y
    storage:
      backend: azure-blob
      account: atpauditlogs
      container: audit-logs
      immutability: true  # Immutable storage
      encryption: true

Audit Log Export for GDPR:

#!/bin/bash
# scripts/export-gdpr-audit-logs.sh

TENANT_ID="${1}"
START_DATE="${2}"
END_DATE="${3}"

echo "📥 Exporting GDPR Audit Logs for Tenant: ${TENANT_ID}"

# Query Log Analytics for tenant-specific audit logs
az monitor log-analytics query \
  --workspace ${LOG_ANALYTICS_WORKSPACE_ID} \
  --analytics-query "
    KubernetesAudit
    | where Namespace == 'tenant-${TENANT_ID}'
    | where TimeGenerated between (datetime('${START_DATE}') .. datetime('${END_DATE}'))
    | project TimeGenerated, User, Action, Resource, ResponseCode
    | order by TimeGenerated asc
  " \
  --output table > "gdpr-audit-logs-${TENANT_ID}-${START_DATE}-${END_DATE}.csv"

echo "✅ Audit logs exported"

Privacy by Design

Privacy by Design Implementation:

# platform/compliance/privacy-by-design.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: privacy-by-design-config
  namespace: atp-production
data:
  principles.yaml: |
    # Privacy by Design Principles
    principles:
      - principle: Proactive not Reactive
        implementation: Default privacy settings, data minimization
      - principle: Privacy as Default
        implementation: Encryption at rest and in transit, minimal data collection
      - principle: Privacy Embedded into Design
        implementation: Privacy considerations in architecture
      - principle: Full Functionality
        implementation: Privacy without sacrificing functionality
      - principle: End-to-End Security
        implementation: Encryption, access controls, audit logging
      - principle: Visibility and Transparency
        implementation: Audit logs, privacy notices, data subject rights
      - principle: Respect for User Privacy
        implementation: User consent, data deletion, portability

HIPAA Audit Trail

Access Logs

HIPAA Access Log Configuration:

# platform/compliance/hipaa-audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  namespaces: ["tenant-hipaa-*"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  resources:
  - group: "*"
    resources: ["*"]

- level: RequestResponse
  namespaces: ["tenant-hipaa-*"]
  verbs: ["create", "update", "patch", "delete"]
  resources:
  - group: "*"
    resources: ["secrets", "configmaps", "persistentvolumeclaims"]

HIPAA Access Log Query:

// Log Analytics: HIPAA Access Logs
KubernetesAudit
| where Namespace startswith "tenant-hipaa"
| where TimeGenerated > ago(30d)
| where Verb in ("get", "list", "watch", "create", "update", "delete")
| project 
    TimeGenerated,
    User,
    Verb,
    Resource,
    Namespace,
    ResponseCode,
    RequestObject,
    ResponseObject
| order by TimeGenerated desc

Deployment Logs

HIPAA Deployment Audit Trail:

#!/bin/bash
# scripts/generate-hipaa-deployment-log.sh

DEPLOYMENT="${1}"
NAMESPACE="${2:-tenant-hipaa-production}"

echo "📋 Generating HIPAA Deployment Audit Trail"

# Collect deployment evidence
cat > "hipaa-deployment-${DEPLOYMENT}-$(date +%Y%m%d).md" <<EOF
# HIPAA Deployment Audit Trail

**Deployment**: ${DEPLOYMENT}
**Namespace**: ${NAMESPACE}
**Date**: $(date +%Y-%m-%d)
**Time**: $(date +%H:%M:%S)

## Pre-Deployment Verification

- [ ] Change approved by authorized personnel
- [ ] Security scan passed
- [ ] Encryption verified
- [ ] Access controls verified

## Deployment Details

**Image**: $(kubectl get deployment ${DEPLOYMENT} -n ${NAMESPACE} -o jsonpath='{.spec.template.spec.containers[0].image}')
**Git Commit**: $(git rev-parse HEAD)
**PR Number**: $(git log -1 --pretty=format:"%s" | grep -oP 'PR #\K\d+')
**Deployed By**: $(az ad signed-in-user show --query userPrincipalName -o tsv)

## Post-Deployment Verification

- [ ] Deployment successful
- [ ] Health checks passing
- [ ] Encryption operational
- [ ] Access logs enabled

## HIPAA Compliance

- [ ] Audit logging enabled
- [ ] Encryption at rest verified
- [ ] Encryption in transit verified
- [ ] Access controls enforced
- [ ] PHI data handling verified
EOF

echo "✅ HIPAA deployment audit trail generated"

Encryption Verification

HIPAA Encryption Verification:

#!/bin/bash
# scripts/verify-hipaa-encryption.sh

NAMESPACE="${1:-tenant-hipaa-production}"

echo "🔐 Verifying HIPAA Encryption Requirements"

# Check PVC encryption
echo "💾 Persistent Volume Encryption:"
kubectl get pvc -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | "\(.metadata.name): \(.spec.storageClassName)"' | \
  while read pvc; do
    SC=$(echo "${pvc}" | cut -d':' -f2 | xargs)
    ENCRYPTED=$(kubectl get storageclass "${SC}" -o jsonpath='{.parameters.diskEncryptionSetID}')
    if [ -n "${ENCRYPTED}" ]; then
      echo "   ✅ ${pvc}: Encrypted"
    else
      echo "   ❌ ${pvc}: Not encrypted"
    fi
  done

# Check TLS/in-transit encryption
echo "🔒 In-Transit Encryption:"
kubectl get ingress -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.spec.tls == null) | "\(.metadata.name): Missing TLS"'

# Check secrets encryption
echo "🔑 Secret Encryption:"
kubectl get secrets -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.type != "Opaque") | "\(.metadata.name): \(.type)"'

echo "✅ Encryption verification complete"

Incident Response Documentation

HIPAA Incident Response Template:

## HIPAA Incident Report

**Incident ID**: [ID]
**Date Discovered**: [Date]
**Date Reported**: [Date] (within 60 days)
**Severity**: [Low/Medium/High/Critical]

### Incident Description
[Description of the incident]

### PHI Impact Assessment
- [ ] No PHI affected
- [ ] PHI accessed but not compromised
- [ ] PHI compromised (breach)

### Affected Systems
- [List affected systems]

### Actions Taken
1. [Action 1]
2. [Action 2]

### Remediation
[Remediation steps]

### Breach Notification
- [ ] HHS notified (if breach > 500 individuals)
- [ ] Affected individuals notified (if breach)
- [ ] Media notification (if breach > 500 individuals)

### Lessons Learned
[Lessons learned]

### Prevention
[Prevention measures]

Change Advisory Board (CAB) Process

When CAB Approval is Required

CAB Approval Requirements:

Change Type CAB Required? Rationale
Production Deployment ✅ Yes Production impact
Infrastructure Changes ✅ Yes Platform stability
Security Updates ⚠️ Expedited Security risk
Hotfixes ⚠️ Post-deployment Urgency
Dev/Test Changes ❌ No No production impact

CAB Approval Decision Tree:

graph TD
    START[Change Request] --> ENV{Environment?}
    ENV -->|Production| CAB_REQUIRED[CAB Approval Required]
    ENV -->|Staging| REVIEW[Team Lead Review]
    ENV -->|Dev/Test| AUTO[No Approval Needed]

    CAB_REQUIRED --> SEVERITY{Severity?}
    SEVERITY -->|Critical| EXPEDITED[Expedited CAB]
    SEVERITY -->|Normal| REGULAR[Regular CAB]

    REGULAR --> MEETING[CAB Meeting]
    EXPEDITED --> APPROVAL[Expedited Approval]

    style CAB_REQUIRED fill:#FFE5B4
    style EXPEDITED fill:#FFB6C1
Hold "Alt" / "Option" to enable pan & zoom

CAB Meeting Schedule

CAB Meeting Schedule:

Meeting Type Frequency Day Time
Regular CAB Weekly Tuesday 10:00 AM
Expedited CAB As needed Any Within 24 hours
Emergency CAB As needed Any Immediate

Change Request Template

CAB Change Request Template:

## Change Request Form

**CR Number**: CR-YYYY-XXX
**Date**: [Date]
**Requestor**: [Name, Email]
**Change Type**: [Standard/Emergency/Expedited]

### Change Summary
**Title**: [Change title]
**Description**: [Detailed description]

### Business Justification
[Why is this change needed?]

### Technical Details
- **Services Affected**: [List services]
- **Environments**: [Dev/Test/Staging/Production]
- **Expected Duration**: [Duration]
- **Rollback Plan**: [Rollback procedure]

### Risk Assessment
- **Risk Level**: [Low/Medium/High/Critical]
- **Potential Impact**: [Impact description]
- **Mitigation**: [Mitigation steps]

### Testing
- [ ] Tested in Dev
- [ ] Tested in Test
- [ ] Tested in Staging
- [ ] Rollback tested

### Approval
- [ ] Technical Lead Approval
- [ ] CAB Approval
- [ ] Change Manager Approval

### Implementation
**Scheduled Date**: [Date]
**Scheduled Time**: [Time]
**Change Window**: [Window]

### Post-Implementation
- [ ] Implementation successful
- [ ] Verification completed
- [ ] Documentation updated

CAB Review Criteria

CAB Review Criteria Checklist:

## CAB Review Criteria

### Change Completeness
- [ ] Change request form complete
- [ ] Technical details provided
- [ ] Testing completed
- [ ] Rollback plan documented

### Risk Assessment
- [ ] Risk level appropriate
- [ ] Impact assessment complete
- [ ] Mitigation plan adequate

### Compliance
- [ ] Change documented in Git
- [ ] Approval trail maintained
- [ ] Audit requirements met

### Schedule
- [ ] Change window appropriate
- [ ] Stakeholders notified
- [ ] Resources available

Approval Documentation

CAB Approval Record:

# changes/cr-2024-001-approval.yaml
apiVersion: compliance.atp.connectsoft.io/v1
kind: ChangeApproval
metadata:
  name: cr-2024-001
  namespace: atp-production
spec:
  changeRequest:
    number: CR-2024-001
    title: "Upgrade PostgreSQL to version 15"
    requestor: "john.doe@connectsoft.example"
    date: "2024-01-15"
  cabApproval:
    approved: true
    approvalDate: "2024-01-18"
    approvedBy:
    - name: "Jane Smith"
      role: "CAB Chair"
      signature: "[Digital Signature]"
    - name: "Bob Johnson"
      role: "Technical Lead"
      signature: "[Digital Signature]"
  implementation:
    scheduledDate: "2024-01-25"
    scheduledTime: "02:00 UTC"
    changeWindow: "02:00-04:00 UTC"

Deployment Approval Records

PR Approvals in Azure DevOps

Extract PR Approval Records:

#!/bin/bash
# scripts/extract-pr-approvals.sh

PR_ID="${1}"
PROJECT="${2:-atp-gitops}"

echo "📋 Extracting PR Approval Records for PR ${PR_ID}"

# Get PR details with approvals
az repos pr show \
  --id "${PR_ID}" \
  --organization ${ORG} \
  --project ${PROJECT} \
  --include-work-item-refs \
  --output json | \
  jq '{
    pr_id: .pullRequestId,
    title: .title,
    created_by: .createdBy.displayName,
    created_date: .creationDate,
    status: .status,
    reviewers: [.reviewers[] | {
      name: .displayName,
      email: .uniqueName,
      vote: .vote,
      vote_date: .votedForDate,
      is_required: .isRequired
    }],
    completion_options: .completionOptions,
    work_item_refs: [.workItemRefs[] | {
      id: .id,
      title: .title,
      url: .url
    }]
  }' > "pr-approval-${PR_ID}.json"

# Generate approval certificate
cat > "pr-approval-certificate-${PR_ID}.md" <<EOF
# PR Approval Certificate

**PR Number**: ${PR_ID}
**Title**: $(jq -r '.title' pr-approval-${PR_ID}.json)
**Created**: $(jq -r '.created_date' pr-approval-${PR_ID}.json)
**Merged**: $(jq -r '.completionOptions.mergeCommitMessage' pr-approval-${PR_ID}.json)

## Approvers

$(jq -r '.reviewers[] | "- **\(.name)** (\(.email)) - Vote: \(.vote) - Date: \(.vote_date)"' pr-approval-${PR_ID}.json)

## Approval Status

$(jq -r 'if .reviewers | all(.vote >= 10) then "✅ Approved" else "❌ Not Approved" end' pr-approval-${PR_ID}.json)

## Linked Work Items

$(jq -r '.work_item_refs[] | "- [\(.id)] \(.title) - \(.url)"' pr-approval-${PR_ID}.json)

## Audit Trail

This PR approval record is maintained for 7 years per SOC 2 and GDPR requirements.
EOF

echo "✅ Approval records extracted"

Approver Identity and Timestamp

Approval Evidence Structure:

{
  "approval_record": {
    "pr_id": 12345,
    "pr_title": "Deploy ATP Gateway v1.2.3 to Production",
    "approvals": [
      {
        "approver": {
          "name": "Jane Smith",
          "email": "jane.smith@connectsoft.example",
          "azure_ad_id": "a1b2c3d4-..."
        },
        "approval": {
          "vote": 10,
          "vote_date": "2024-01-20T10:30:00Z",
          "comment": "Approved after review",
          "timestamp": "2024-01-20T10:30:15Z"
        },
        "signature": {
          "method": "Azure DevOps",
          "hash": "sha256:abc123...",
          "verified": true
        }
      }
    ],
    "merged_by": {
      "name": "John Doe",
      "email": "john.doe@connectsoft.example",
      "merge_date": "2024-01-20T11:00:00Z"
    }
  }
}

Justification and Risk Assessment

PR Justification Template:

## PR Justification

**PR**: #12345
**Title**: Deploy ATP Gateway v1.2.3 to Production

### Business Justification
[Why is this deployment needed?]

### Technical Justification
[Technical reasons for the change]

### Risk Assessment
- **Risk Level**: Medium
- **Potential Impact**: Service restart (5 minutes downtime)
- **Mitigation**: Rolling update, health checks

### Testing Completed
- [x] Unit tests passed
- [x] Integration tests passed
- [x] Staging deployment successful
- [x] Smoke tests passed

### Rollback Plan
[Rollback procedure if deployment fails]

### Approval Required
- [x] Technical Lead
- [x] CAB (for production)

Work Item Linking

Link PR to Work Items:

#!/bin/bash
# scripts/link-pr-to-workitems.sh

PR_ID="${1}"
WORK_ITEM_IDS="${2}"  # Space-separated work item IDs

echo "🔗 Linking PR ${PR_ID} to work items: ${WORK_ITEM_IDS}"

for WI_ID in ${WORK_ITEM_IDS}; do
  echo "   Linking to work item: ${WI_ID}"
  az repos pr work-item add \
    --id "${PR_ID}" \
    --work-item-id "${WI_ID}" \
    --organization ${ORG} \
    --project ${PROJECT}
done

# Verify links
echo "✅ Verifying links..."
az repos pr show \
  --id "${PR_ID}" \
  --organization ${ORG} \
  --project ${PROJECT} \
  --include-work-item-refs \
  --query "workItemRefs[].id" \
  --output table

Git Commit History as Audit Evidence

Signed Commits (GPG)

GPG Signing Configuration:

#!/bin/bash
# scripts/setup-gpg-signing.sh

echo "🔐 Setting up GPG signing for Git commits"

# Generate GPG key (if not exists)
if ! gpg --list-secret-keys --keyid-format LONG | grep -q "sec"; then
  echo "Generating new GPG key..."
  gpg --full-generate-key
fi

# Get GPG key ID
GPG_KEY_ID=$(gpg --list-secret-keys --keyid-format LONG | \
  grep "^sec" | \
  sed -n 's/.*\/\([A-Z0-9]\{16\}\).*/\1/p' | \
  head -1)

echo "GPG Key ID: ${GPG_KEY_ID}"

# Configure Git to use GPG signing
git config --global user.signingkey "${GPG_KEY_ID}"
git config --global commit.gpgsign true

# Add GPG key to GitHub/Azure DevOps
echo "Add this public key to Azure DevOps:"
gpg --armor --export "${GPG_KEY_ID}"

echo "✅ GPG signing configured"

Verify Signed Commits:

#!/bin/bash
# scripts/verify-signed-commits.sh

COMMIT_RANGE="${1:-HEAD~10..HEAD}"

echo "✅ Verifying signed commits in range: ${COMMIT_RANGE}"

git log --pretty="format:%H %G? %aN %s" "${COMMIT_RANGE}" | \
  while read commit signature author subject; do
    case "${signature}" in
      "G")
        echo "✅ ${commit}: Good signature (${author})"
        ;;
      "B")
        echo "❌ ${commit}: Bad signature (${author})"
        ;;
      "X")
        echo "⚠️  ${commit}: Expired key (${author})"
        ;;
      "Y")
        echo "⚠️  ${commit}: Expired signature (${author})"
        ;;
      "R")
        echo "❌ ${commit}: Revoked key (${author})"
        ;;
      "E")
        echo "❌ ${commit}: Cannot verify (${author})"
        ;;
      "N")
        echo "❌ ${commit}: No signature (${author})"
        ;;
    esac
  done

Commit Message Standards

Conventional Commits for Audit Trail:

## Commit Message Format
():

### Types
- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation
- `chore`: Maintenance
- `refactor`: Code refactoring

### Examples
feat(gateway): Add authentication middleware Implements JWT token validation for API gateway. Linked to: WI-12345 Approved by: Jane Smith

fix(ingestion): Resolve memory leak in event processor Fixes issue where event processor was not releasing memory. Linked to: WI-12346 CAB Approved: CR-2024-001

**Enforce Commit Message Standards**:
# platform/gitops/commit-msg-hook.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: commit-msg-hook
  namespace: flux-system
data:
  commit-msg: |
    #!/bin/sh
    # Commit message hook to enforce standards

    COMMIT_MSG_FILE=$1
    COMMIT_MSG=$(cat $COMMIT_MSG_FILE)

    # Check for conventional commit format
    if ! echo "${COMMIT_MSG}" | grep -qE "^(feat|fix|docs|chore|refactor)(\(.+\))?:"; then
      echo "❌ Commit message must follow conventional commits format"
      echo "   Format: <type>(<scope>): <subject>"
      exit 1
    fi

    # Check for work item reference
    if ! echo "${COMMIT_MSG}" | grep -qiE "(WI-|AB#|#)[0-9]+"; then
      echo "⚠️  Warning: No work item reference found"
    fi

    exit 0
--- ### Deployment Receipts #### Deployment Receipt Template **Deployment Receipt Structure**:
# deployment-receipts/deployment-20240120-143022.yaml
apiVersion: compliance.atp.connectsoft.io/v1
kind: DeploymentReceipt
metadata:
  name: deployment-atp-gateway-20240120-143022
  namespace: atp-production
  creationTimestamp: "2024-01-20T14:30:22Z"
spec:
  deployment:
    id: "dep-20240120-143022"
    service: "atp-gateway"
    environment: "production"
    cluster: "atp-production-aks"
    namespace: "atp-production"
  what:
    image: "connectsoft.azurecr.io/atp/gateway:v1.2.3"
    git_commit: "abc123def456..."
    git_branch: "main"
    pr_number: 12345
  when:
    deployed_at: "2024-01-20T14:30:22Z"
    deployed_by: "FluxCD"
    reconciliation_time: "2024-01-20T14:30:25Z"
  who:
    approved_by:
    - name: "Jane Smith"
      email: "jane.smith@connectsoft.example"
      role: "Technical Lead"
      approval_date: "2024-01-20T10:30:00Z"
    - name: "Bob Johnson"
      email: "bob.johnson@connectsoft.example"
      role: "CAB Member"
      approval_date: "2024-01-20T11:00:00Z"
    merged_by:
      name: "John Doe"
      email: "john.doe@connectsoft.example"
      merge_date: "2024-01-20T12:00:00Z"
  why:
    work_items:
    - id: "WI-12345"
      title: "Add authentication middleware"
      url: "https://dev.azure.com/..."
    change_request: "CR-2024-001"
    justification: "Add JWT authentication for API security"
  where:
    region: "eastus"
    cluster: "atp-production-aks"
    namespace: "atp-production"
  evidence:
    pr_approval: "pr-approval-12345.json"
    security_scan: "security-scan-abc123.json"
    test_results: "test-results-abc123.json"
    sbom: "sbom-gateway-v1.2.3.json"
#### Automated Deployment Receipt Generation **Deployment Receipt Generation Script**:
#!/bin/bash
# scripts/generate-deployment-receipt.sh

DEPLOYMENT="${1}"
NAMESPACE="${2:-atp-production}"
IMAGE="${3}"

echo "📜 Generating Deployment Receipt"

# Get deployment details
DEPLOYMENT_DATA=$(kubectl get deployment "${DEPLOYMENT}" -n "${NAMESPACE}" -o json)
IMAGE_TAG=$(echo "${DEPLOYMENT_DATA}" | jq -r '.spec.template.spec.containers[0].image')
GIT_COMMIT=$(echo "${IMAGE_TAG}" | cut -d':' -f2)

# Get PR information from Git commit
PR_INFO=$(git log --grep="${GIT_COMMIT}" --format="%s" | head -1)
PR_NUMBER=$(echo "${PR_INFO}" | grep -oP 'PR #\K\d+' || echo "")

# Get approval information
if [ -n "${PR_NUMBER}" ]; then
  APPROVALS=$(az repos pr show \
    --id "${PR_NUMBER}" \
    --organization ${ORG} \
    --project ${PROJECT} \
    --query "reviewers[?vote>=10]" \
    -o json)
fi

# Generate deployment receipt
cat > "deployment-receipt-${DEPLOYMENT}-$(date +%Y%m%d-%H%M%S).yaml" <<EOF
apiVersion: compliance.atp.connectsoft.io/v1
kind: DeploymentReceipt
metadata:
  name: deployment-${DEPLOYMENT}-$(date +%Y%m%d-%H%M%S)
  namespace: ${NAMESPACE}
  creationTimestamp: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
spec:
  deployment:
    id: "dep-$(date +%Y%m%d-%H%M%S)"
    service: "${DEPLOYMENT}"
    environment: "${NAMESPACE}"
    cluster: "$(kubectl config current-context)"
    namespace: "${NAMESPACE}"
  what:
    image: "${IMAGE_TAG}"
    git_commit: "${GIT_COMMIT}"
    git_branch: "$(git branch --show-current)"
    pr_number: "${PR_NUMBER}"
  when:
    deployed_at: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
    deployed_by: "FluxCD"
  who:
    approved_by:
$(echo "${APPROVALS}" | jq -r '.[] | "    - name: \"\(.displayName)\"\n      email: \"\(.uniqueName)\"\n      approval_date: \"\(.votedForDate)\""')
  why:
    work_items:
    - id: "$(git log -1 --pretty=format:"%s" | grep -oP 'WI-\K\d+' || echo "N/A")"
  where:
    region: "$(kubectl get nodes -o jsonpath='{.items[0].metadata.labels.topology\.kubernetes\.io/region}')"
    cluster: "$(kubectl config current-context)"
    namespace: "${NAMESPACE}"
EOF

echo "✅ Deployment receipt generated"
--- ### Security Scan Results #### Vulnerability Reports **Vulnerability Scan Evidence Collection**:
#!/bin/bash
# scripts/collect-vulnerability-evidence.sh

IMAGE="${1}"
SCAN_DATE="${2:-$(date +%Y-%m-%d)}"

echo "🔒 Collecting Vulnerability Scan Evidence"

# Run Trivy scan
trivy image --format json --output "vulnerability-scan-${IMAGE}-${SCAN_DATE}.json" "${IMAGE}"

# Generate summary report
trivy image --format table "${IMAGE}" > "vulnerability-summary-${IMAGE}-${SCAN_DATE}.txt"

# Extract critical vulnerabilities
jq '[.Results[]?.Vulnerabilities[]? | select(.Severity == "CRITICAL")]' \
  "vulnerability-scan-${IMAGE}-${SCAN_DATE}.json" > \
  "vulnerability-critical-${IMAGE}-${SCAN_DATE}.json"

# Generate evidence document
cat > "vulnerability-evidence-${IMAGE}-${SCAN_DATE}.md" <<EOF
# Vulnerability Scan Evidence

**Image**: ${IMAGE}
**Scan Date**: ${SCAN_DATE}
**Scanner**: Trivy

## Summary

- Total Vulnerabilities: $(jq '[.Results[].Vulnerabilities[]] | length' vulnerability-scan-${IMAGE}-${SCAN_DATE}.json)
- Critical: $(jq '[.Results[].Vulnerabilities[] | select(.Severity == "CRITICAL")] | length' vulnerability-scan-${IMAGE}-${SCAN_DATE}.json)
- High: $(jq '[.Results[].Vulnerabilities[] | select(.Severity == "HIGH")] | length' vulnerability-scan-${IMAGE}-${SCAN_DATE}.json)

## Critical Vulnerabilities

$(jq -r '.Results[].Vulnerabilities[] | select(.Severity == "CRITICAL") | "- \(.VulnerabilityID): \(.Title)"' vulnerability-scan-${IMAGE}-${SCAN_DATE}.json)

## Remediation Status

- [ ] All critical vulnerabilities remediated
- [ ] High vulnerabilities reviewed
- [ ] Risk assessment completed

## Approval

**Reviewed By**: [Reviewer Name]
**Date**: ${SCAN_DATE}
**Approval**: [Approved/Rejected with Justification]
EOF

echo "✅ Vulnerability evidence collected"
#### SBOM Artifacts **SBOM Generation and Storage**:
#!/bin/bash
# scripts/generate-sbom-evidence.sh

IMAGE="${1}"
VERSION="${2}"

echo "📦 Generating SBOM Evidence"

# Generate SBOM with Syft
syft packages "${IMAGE}" -o cyclonedx-json > "sbom-${IMAGE}-${VERSION}.json"

# Attach SBOM to image in ACR
oras attach \
  --artifact-type "application/vnd.cyclonedx+json" \
  connectsoft.azurecr.io/atp/gateway:${VERSION} \
  "sbom-${IMAGE}-${VERSION}.json"

# Verify SBOM attachment
oras discover \
  --artifact-type "application/vnd.cyclonedx+json" \
  connectsoft.azurecr.io/atp/gateway:${VERSION}

echo "✅ SBOM evidence generated and stored"
#### Policy Compliance Reports **Policy Compliance Evidence**:
#!/bin/bash
# scripts/generate-policy-compliance-report.sh

NAMESPACE="${1:-atp-production}"

echo "✅ Generating Policy Compliance Report"

# Check Azure Policy compliance
az policy state summarize \
  --resource "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}" \
  --output json > "azure-policy-compliance-$(date +%Y%m%d).json"

# Check Pod Security Standards
kubectl get pods -n "${NAMESPACE}" -o json | \
  jq -r '.items[] | select(.spec.securityContext == null) | "\(.metadata.name): Missing security context"' > \
  "pss-compliance-${NAMESPACE}-$(date +%Y%m%d).txt"

# Generate compliance report
cat > "policy-compliance-report-$(date +%Y%m%d).md" <<EOF
# Policy Compliance Report

**Date**: $(date +%Y-%m-%d)
**Namespace**: ${NAMESPACE}

## Azure Policy Compliance

$(az policy state summarize --resource "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}" --query "results.resourceDetails[].{Resource:resourceId, Compliance:complianceState}" -o table)

## Pod Security Standards

$(kubectl get pods -n "${NAMESPACE}" -o json | jq -r '.items[] | select(.spec.securityContext == null) | "- \(.metadata.name): Non-compliant" | if . == "- : Non-compliant" then "✅ All pods compliant" else . end')

## Network Policies

$(kubectl get networkpolicies -n "${NAMESPACE}" --no-headers | wc -l) network policies applied

## RBAC Compliance

$(kubectl get rolebindings,clusterrolebindings -n "${NAMESPACE}" --no-headers | wc -l) RBAC bindings reviewed
EOF

echo "✅ Policy compliance report generated"
--- ### Policy Enforcement Evidence #### Azure Policy Compliance Reports **Azure Policy Compliance Query**:
// Log Analytics: Azure Policy Compliance
PolicyResources
| where TimeGenerated > ago(30d)
| where complianceState != "Compliant"
| project 
    TimeGenerated,
    resourceId,
    complianceState,
    policyAssignmentName,
    policyDefinitionName
| order by TimeGenerated desc
#### Pod Security Admission Logs **Pod Security Admission Evidence**:
// Log Analytics: Pod Security Admission Logs
KubernetesAudit
| where TimeGenerated > ago(30d)
| where Category == "Admission"
| where ObjectRef.resource == "pods"
| where ResponseStatus.code == 403
| where ResponseStatus.reason contains "violates PodSecurity"
| project 
    TimeGenerated,
    User,
    ObjectRef.name,
    ObjectRef.namespace,
    ResponseStatus.message
| order by TimeGenerated desc
#### RBAC Audit Logs **RBAC Audit Log Collection**:
#!/bin/bash
# scripts/collect-rbac-audit-logs.sh

START_DATE="${1:-$(date -d '30 days ago' +%Y-%m-%d)}"
END_DATE="${2:-$(date +%Y-%m-%d)}"

echo "📋 Collecting RBAC Audit Logs: ${START_DATE} to ${END_DATE}"

# Query Kubernetes audit logs for RBAC events
az monitor log-analytics query \
  --workspace ${LOG_ANALYTICS_WORKSPACE_ID} \
  --analytics-query "
    KubernetesAudit
    | where TimeGenerated between (datetime('${START_DATE}') .. datetime('${END_DATE}'))
    | where ObjectRef.resource in ('roles', 'rolebindings', 'clusterroles', 'clusterrolebindings')
    | project 
        TimeGenerated,
        User,
        Verb,
        ObjectRef.resource,
        ObjectRef.name,
        ObjectRef.namespace,
        ResponseStatus.code
    | order by TimeGenerated desc
  " \
  --output table > "rbac-audit-logs-${START_DATE}-${END_DATE}.csv"

echo "✅ RBAC audit logs collected"
--- ### Quarterly Access Reviews #### Reviewing RBAC in Git **RBAC Access Review Procedure**:
#!/bin/bash
# scripts/rbac-access-review.sh

REVIEW_DATE="${1:-$(date +%Y-%m-%d)}"

echo "👥 RBAC Access Review - ${REVIEW_DATE}"

# Export all RBAC bindings
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json > \
  "rbac-bindings-review-${REVIEW_DATE}.json"

# Generate review report
cat > "rbac-access-review-${REVIEW_DATE}.md" <<EOF
# RBAC Access Review Report

**Review Date**: ${REVIEW_DATE}
**Reviewer**: [Reviewer Name]

## Role Bindings

$(kubectl get rolebindings --all-namespaces --no-headers | wc -l) role bindings reviewed

### Findings

$(kubectl get rolebindings --all-namespaces -o json | \
  jq -r '.items[] | "- Namespace: \(.metadata.namespace), Role: \(.roleRef.name), Subjects: \(.subjects | length)")

## Cluster Role Bindings

$(kubectl get clusterrolebindings --no-headers | wc -l) cluster role bindings reviewed

### Findings

$(kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | "- Role: \(.roleRef.name), Subjects: \(.subjects | length)")

## Review Actions

- [ ] All access is justified
- [ ] No orphaned bindings
- [ ] Least privilege enforced
- [ ] Documentation updated

## Approval

**Reviewed By**: [Reviewer Name]
**Date**: ${REVIEW_DATE}
**Signature**: [Digital Signature]
EOF

echo "✅ RBAC access review complete"
#### Reviewing Key Vault Permissions **Key Vault Access Review**:
#!/bin/bash
# scripts/keyvault-access-review.sh

KEY_VAULT="${1:-atp-keyvault}"
REVIEW_DATE="${2:-$(date +%Y-%m-%d)}"

echo "🔑 Key Vault Access Review - ${KEY_VAULT}"

# Get access policies
az keyvault show \
  --name "${KEY_VAULT}" \
  --query "properties.accessPolicies" \
  -o json > "keyvault-access-policies-${REVIEW_DATE}.json"

# Get RBAC assignments
az role assignment list \
  --scope "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.KeyVault/vaults/${KEY_VAULT}" \
  -o json > "keyvault-rbac-${REVIEW_DATE}.json"

# Generate review report
cat > "keyvault-access-review-${REVIEW_DATE}.md" <<EOF
# Key Vault Access Review

**Vault**: ${KEY_VAULT}
**Review Date**: ${REVIEW_DATE}

## Access Policies

$(az keyvault show --name "${KEY_VAULT}" --query "length(properties.accessPolicies)" -o tsv) access policies

## RBAC Assignments

$(az role assignment list --scope "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.KeyVault/vaults/${KEY_VAULT}" --query "length([])" -o tsv) RBAC assignments

## Review Findings

- [ ] All access is justified
- [ ] No orphaned permissions
- [ ] Least privilege enforced
- [ ] Workload Identity used where appropriate

## Approval

**Reviewed By**: [Reviewer Name]
**Date**: ${REVIEW_DATE}
EOF

echo "✅ Key Vault access review complete"
#### Evidence of Reviews **Access Review Evidence Template**:
# compliance/access-reviews/access-review-2024-Q1.yaml
apiVersion: compliance.atp.connectsoft.io/v1
kind: AccessReview
metadata:
  name: access-review-2024-q1
  namespace: atp-production
spec:
  review:
    type: Quarterly
    quarter: Q1
    year: 2024
    reviewDate: "2024-03-31"
  scope:
    rbac: true
    keyVault: true
    azureDevOps: true
    azureAD: true
  findings:
    rbac:
      totalBindings: 45
      reviewed: 45
      issuesFound: 2
      issuesResolved: 2
    keyVault:
      totalPolicies: 12
      reviewed: 12
      issuesFound: 0
  approval:
    reviewedBy: "Jane Smith"
    reviewDate: "2024-03-31"
    approved: true
    signature: "[Digital Signature]"
  evidence:
    rbacReport: "rbac-access-review-2024-03-31.md"
    keyVaultReport: "keyvault-access-review-2024-03-31.md"
    azureDevOpsReport: "azdo-access-review-2024-03-31.md"
--- ### Audit Log Retention #### 7-Year Retention Requirement **Audit Log Retention Configuration**:
# platform/compliance/audit-log-retention-policy.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: audit-log-retention-policy
  namespace: monitoring
data:
  retention-policy.yaml: |
    # Audit Log Retention Policy
    # SOC 2, GDPR, HIPAA Requirement: 7 years

    retention:
      default: 7y
      compliance:
        soc2: 7y
        gdpr: 7y
        hipaa: 7y
      categories:
        kubernetes_audit: 7y
        azure_activity: 7y
        deployment_logs: 7y
        access_logs: 7y
        security_scans: 7y
    storage:
      backend: azure-blob
      account: atpauditlogs
      container: audit-logs-immutable
      immutability:
        enabled: true
        period: 2555d  # 7 years
      encryption:
        enabled: true
        key_vault: atp-keyvault
        key_name: audit-log-encryption-key
**Audit Log Archive to Immutable Storage**:
#!/bin/bash
# scripts/archive-audit-logs-to-blob.sh

START_DATE="${1:-$(date -d '1 year ago' +%Y-%m-%d)}"
END_DATE="${2:-$(date +%Y-%m-%d)}"

echo "📦 Archiving audit logs to immutable storage"

# Export logs from Log Analytics
az monitor log-analytics query \
  --workspace ${LOG_ANALYTICS_WORKSPACE_ID} \
  --analytics-query "
    union *
    | where TimeGenerated between (datetime('${START_DATE}') .. datetime('${END_DATE}'))
    | where Category in ('KubernetesAudit', 'AzureActivity', 'ContainerLog')
  " \
  --output json > "audit-logs-${START_DATE}-${END_DATE}.json"

# Upload to immutable blob storage
az storage blob upload \
  --account-name atpauditlogs \
  --container-name audit-logs-immutable \
  --name "audit-logs-${START_DATE}-${END_DATE}.json" \
  --file "audit-logs-${START_DATE}-${END_DATE}.json" \
  --tier Archive \
  --immutability-policy-mode Unlocked \
  --immutability-policy-period 2555

# Set legal hold
az storage blob immutability-policy set \
  --account-name atpauditlogs \
  --container-name audit-logs-immutable \
  --name "audit-logs-${START_DATE}-${END_DATE}.json" \
  --allow-protected-append-writes false \
  --period 2555

echo "✅ Audit logs archived to immutable storage"
#### eDiscovery Procedures **eDiscovery Export Procedure**:
#!/bin/bash
# scripts/ediscovery-export.sh

CASE_ID="${1}"
START_DATE="${2}"
END_DATE="${3}"

echo "📋 eDiscovery Export - Case: ${CASE_ID}"

# Create export directory
EXPORT_DIR="ediscovery-${CASE_ID}-$(date +%Y%m%d)"
mkdir -p "${EXPORT_DIR}"

# Export audit logs
az monitor log-analytics query \
  --workspace ${LOG_ANALYTICS_WORKSPACE_ID} \
  --analytics-query "
    union *
    | where TimeGenerated between (datetime('${START_DATE}') .. datetime('${END_DATE}'))
  " \
  --output json > "${EXPORT_DIR}/audit-logs.json"

# Export deployment receipts
kubectl get deploymentreceipt --all-namespaces -o json > \
  "${EXPORT_DIR}/deployment-receipts.json"

# Export PR approvals
# (Query Azure DevOps API for PR approvals in date range)

# Export access reviews
kubectl get accessreview --all-namespaces -o json > \
  "${EXPORT_DIR}/access-reviews.json"

# Generate export manifest
cat > "${EXPORT_DIR}/export-manifest.md" <<EOF
# eDiscovery Export Manifest

**Case ID**: ${CASE_ID}
**Export Date**: $(date +%Y-%m-%d)
**Date Range**: ${START_DATE} to ${END_DATE}

## Contents

1. Audit Logs: audit-logs.json
2. Deployment Receipts: deployment-receipts.json
3. Access Reviews: access-reviews.json

## Chain of Custody

**Exported By**: [Exporter Name]
**Date**: $(date +%Y-%m-%d)
**Purpose**: Legal eDiscovery - Case ${CASE_ID}
**Recipient**: [Recipient Name]

## Integrity Verification

**SHA256**: $(sha256sum "${EXPORT_DIR}"/*.json | sha256sum | cut -d' ' -f1)
EOF

# Create export archive
tar -czf "${EXPORT_DIR}.tar.gz" "${EXPORT_DIR}"

echo "✅ eDiscovery export complete: ${EXPORT_DIR}.tar.gz"
--- ### Compliance Reporting Automation #### Automated Evidence Collection **Automated Compliance Evidence Collection**:
# platform/compliance/evidence-collection-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: compliance-evidence-collection
  namespace: compliance
spec:
  schedule: "0 0 * * 0"  # Weekly on Sunday
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: evidence-collection
            image: mcr.microsoft.com/azure-cli:latest
            command:
            - /bin/bash
            - -c
            - |
              # Collect weekly compliance evidence
              /scripts/collect-compliance-evidence.sh weekly
          restartPolicy: OnFailure
**Compliance Evidence Collection Script**:
#!/bin/bash
# scripts/collect-compliance-evidence.sh

PERIOD="${1:-weekly}"  # daily, weekly, monthly, quarterly

echo "📋 Collecting Compliance Evidence - ${PERIOD}"

EVIDENCE_DIR="compliance-evidence-${PERIOD}-$(date +%Y%m%d)"
mkdir -p "${EVIDENCE_DIR}"

# Collect deployment receipts
kubectl get deploymentreceipt --all-namespaces -o json > \
  "${EVIDENCE_DIR}/deployment-receipts.json"

# Collect PR approvals (last period)
# Query Azure DevOps API

# Collect access logs
az monitor log-analytics query \
  --workspace ${LOG_ANALYTICS_WORKSPACE_ID} \
  --analytics-query "
    KubernetesAudit
    | where TimeGenerated > ago(7d)
  " \
  --output json > "${EVIDENCE_DIR}/audit-logs.json"

# Collect policy compliance
az policy state summarize \
  --resource "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}" \
  --output json > "${EVIDENCE_DIR}/policy-compliance.json"

# Generate evidence manifest
cat > "${EVIDENCE_DIR}/evidence-manifest.md" <<EOF
# Compliance Evidence Collection

**Period**: ${PERIOD}
**Date**: $(date +%Y-%m-%d)

## Evidence Collected

- Deployment Receipts
- PR Approvals
- Audit Logs
- Policy Compliance

## Retention

This evidence is retained for 7 years per SOC 2, GDPR, and HIPAA requirements.
EOF

echo "✅ Compliance evidence collected: ${EVIDENCE_DIR}"
#### Compliance Dashboards **Compliance Dashboard Configuration**:
# monitoring/compliance/compliance-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: compliance-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "Compliance Dashboard",
        "panels": [
          {
            "title": "SOC 2 Compliance Status",
            "targets": [
              {
                "expr": "compliance_soc2_controls_total",
                "legendFormat": "{{control}}"
              }
            ]
          },
          {
            "title": "Deployment Approvals (30 days)",
            "targets": [
              {
                "expr": "sum(azure_devops_pr_approvals_total[30d])",
                "legendFormat": "Approvals"
              }
            ]
          },
          {
            "title": "Access Reviews",
            "targets": [
              {
                "expr": "compliance_access_reviews_total",
                "legendFormat": "{{review_type}}"
              }
            ]
          }
        ]
      }
    }
#### Monthly Compliance Reports **Monthly Compliance Report Generation**:
#!/bin/bash
# scripts/generate-monthly-compliance-report.sh

MONTH="${1:-$(date +%Y-%m)}"

echo "📊 Generating Monthly Compliance Report - ${MONTH}"

cat > "compliance-report-${MONTH}.md" <<EOF
# Monthly Compliance Report

**Month**: ${MONTH}
**Generated**: $(date +%Y-%m-%d)

## SOC 2 Compliance

### CC8.1 - Change Management
- Total Changes: $(az repos pr list --organization ${ORG} --project ${PROJECT} --status completed --output json | jq '[.[] | select(.creationDate | startswith("'${MONTH}'"))] | length')
- Approved Changes: $(az repos pr list --organization ${ORG} --project ${PROJECT} --status completed --output json | jq '[.[] | select(.creationDate | startswith("'${MONTH}'") and .reviewers[].vote >= 10)] | length')
- Compliance: ✅

### CC6.1 - Access Control
- Access Reviews: [Number]
- Issues Found: [Number]
- Issues Resolved: [Number]
- Compliance: ✅

### CC7.2 - System Monitoring
- Monitoring Coverage: [Percentage]
- Alerts Configured: [Number]
- Compliance: ✅

## GDPR Compliance

- Data Deletion Requests: [Number]
- Right to be Forgotten: [Number]
- Data Residency Verified: ✅

## HIPAA Compliance

- Audit Logs Collected: [Number]
- Encryption Verified: ✅
- Access Controls Enforced: ✅

## Summary

- SOC 2: ✅ Compliant
- GDPR: ✅ Compliant
- HIPAA: ✅ Compliant

## Evidence

All evidence for this report is stored in: compliance-evidence-monthly-${MONTH}/
EOF

echo "✅ Compliance report generated: compliance-report-${MONTH}.md"
--- ### Summary: Compliance & Audit Evidence Collection - **SOC 2 Type II Controls**: CC8.1 (Change Management), CC6.1 (Logical and Physical Access), CC7.2 (System Monitoring), GitOps workflow mapping to controls with evidence collection scripts - **GDPR Compliance**: Right to be forgotten (tenant offboarding procedure), data residency enforcement, audit logs and retention (7-year), privacy by design implementation - **HIPAA Audit Trail**: Access logs configuration, deployment logs, encryption verification scripts, incident response documentation template - **Change Advisory Board (CAB) Process**: When CAB approval is required (decision tree), CAB meeting schedule, change request template, CAB review criteria, approval documentation with YAML structure - **Deployment Approval Records**: PR approvals in Azure DevOps, approver identity and timestamp, justification and risk assessment templates, work item linking scripts - **Git Commit History as Audit Evidence**: Signed commits (GPG setup and verification), commit message standards (Conventional Commits), commit message hook enforcement - **Deployment Receipts**: Deployment receipt template (YAML structure), automated deployment receipt generation script - **Security Scan Results**: Vulnerability reports collection, SBOM artifacts generation and storage, policy compliance reports - **Policy Enforcement Evidence**: Azure Policy compliance reports (KQL queries), Pod Security Admission logs, RBAC audit log collection - **Quarterly Access Reviews**: Reviewing RBAC in Git, reviewing Key Vault permissions, reviewing Azure DevOps access, evidence of reviews (YAML structure) - **Audit Log Retention**: 7-year retention requirement configuration, audit log archive to immutable storage, eDiscovery export procedures - **Compliance Reporting Automation**: Automated evidence collection (CronJob), compliance dashboards (Grafana JSON), monthly compliance report generation scripts --- ## Training, Documentation & Best Practices **Purpose**: Define comprehensive training programs, documentation standards, workflow tutorials, troubleshooting playbooks, best practices catalog, reference architectures, and continuous improvement processes for ATP's GitOps deployments, ensuring team proficiency, operational excellence, and knowledge sharing across all platform engineering activities. --- ### Developer Onboarding Guide #### Getting Started with GitOps **Prerequisites Checklist**:
## Prerequisites for GitOps Development

### Required Access
- [ ] Azure DevOps account with appropriate permissions
- [ ] Access to `atp-gitops` repository
- [ ] Access to AKS clusters (dev/test at minimum)
- [ ] Azure CLI installed and configured
- [ ] kubectl installed and configured
- [ ] Helm CLI installed
- [ ] Flux CLI installed
- [ ] Git configured with SSH keys or PAT

### Required Knowledge
- [ ] Basic Kubernetes concepts
- [ ] YAML syntax
- [ ] Git fundamentals (branching, PRs, merging)
- [ ] Basic understanding of GitOps principles
- [ ] Azure DevOps PR workflow

### Verification
Run these commands to verify setup:
```bash
# Verify Azure CLI
az --version

# Verify kubectl
kubectl version --client

# Verify Helm
helm version

# Verify Flux
flux --version

# Verify Git access
git clone ssh://git@ssh.dev.azure.com/v3/ConnectSoft/atp-gitops/atp-gitops
**GitOps Learning Path**:

```mermaid
graph TD
    START[New Developer] --> GIT[Git Fundamentals]
    GIT --> K8S[Kubernetes Basics]
    K8S --> GITOPS[GitOps Principles]
    GITOPS --> FLUX[FluxCD Tutorial]
    FLUX --> HELM[Helm Charts]
    HELM --> KUSTOMIZE[Kustomize]
    KUSTOMIZE --> EXERCISE[First PR Exercise]
    EXERCISE --> REVIEW[Code Review]
    REVIEW --> DEPLOY[Preview Deployment]
    DEPLOY --> COMPLETE[Onboarding Complete]

    style START fill:#FFE5B4
    style COMPLETE fill:#90EE90
#### Repository Structure Overview **Repository Structure Tutorial**:
## ATP GitOps Repository Structure
atp-gitops/ ├── apps/ # Application manifests │ ├── atp-gateway/ │ │ ├── base/ # Base manifests │ │ │ ├── deployment.yaml │ │ │ ├── service.yaml │ │ │ └── kustomization.yaml │ │ └── overlays/ # Environment-specific │ │ ├── dev/ │ │ ├── test/ │ │ ├── staging/ │ │ └── production/ │ └── atp-ingestion/ │ └── ... ├── platform/ # Platform components │ ├── flux-system/ # FluxCD configuration │ ├── monitoring/ │ └── networking/ ├── tenants/ # Tenant-specific configs │ └── tenant-{id}/ ├── infrastructure/ # Pulumi IaC │ └── ... ├── scripts/ # Automation scripts └── docs/ # Documentation └── ...
### Key Directories

1. **apps/**: Application deployment manifests
2. **platform/**: Shared platform components
3. **tenants/**: Multi-tenant configurations
4. **infrastructure/**: Infrastructure as Code
5. **scripts/**: Automation and utilities
#### Git Workflow Tutorial **Step-by-Step Git Workflow**:
#!/bin/bash
# tutorials/git-workflow-tutorial.sh

echo "📚 Git Workflow Tutorial"
echo "========================"

# Step 1: Clone repository
echo "Step 1: Clone the repository"
echo "git clone ssh://git@ssh.dev.azure.com/v3/ConnectSoft/atp-gitops/atp-gitops"
echo "cd atp-gitops"

# Step 2: Create feature branch
echo ""
echo "Step 2: Create a feature branch"
echo "git checkout -b feature/add-new-service"

# Step 3: Make changes
echo ""
echo "Step 3: Make your changes"
echo "# Edit manifest files"
echo "vim apps/my-service/base/deployment.yaml"

# Step 4: Commit changes
echo ""
echo "Step 4: Commit your changes"
echo "git add apps/my-service/"
echo 'git commit -m "feat(my-service): Add new service deployment

- Add deployment manifest
- Add service manifest
- Configure health checks

Linked to: WI-12345"'

# Step 5: Push branch
echo ""
echo "Step 5: Push branch to remote"
echo "git push -u origin feature/add-new-service"

# Step 6: Create PR
echo ""
echo "Step 6: Create Pull Request"
echo "# Use Azure DevOps web interface or CLI:"
echo "az repos pr create \\"
echo "  --source-branch feature/add-new-service \\"
echo "  --target-branch main \\"
echo "  --title 'feat(my-service): Add new service deployment' \\"
echo "  --description 'Adds deployment manifests for my-service'"
**Git Workflow Diagram**:
sequenceDiagram
    participant Dev as Developer
    participant Local as Local Git
    participant Remote as Azure Repos
    participant PR as Pull Request
    participant Flux as FluxCD

    Dev->>Local: Create feature branch
    Dev->>Local: Edit manifests
    Dev->>Local: Commit changes
    Local->>Remote: Push branch
    Remote->>PR: Create Pull Request
    PR->>PR: Code Review
    PR->>PR: Automated Tests
    PR->>Remote: Merge to main
    Remote->>Flux: FluxCD detects change
    Flux->>Flux: Reconcile & Deploy
Hold "Alt" / "Option" to enable pan & zoom
#### Manifest Authoring Basics **First Manifest Tutorial**:
# tutorials/first-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-first-service
  namespace: atp-dev
  labels:
    app: my-first-service
    version: v1.0.0
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-first-service
  template:
    metadata:
      labels:
        app: my-first-service
        version: v1.0.0
    spec:
      containers:
      - name: my-service
        image: connectsoft.azurecr.io/atp/my-service:v1.0.0
        ports:
        - containerPort: 8080
          name: http
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Development"
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

---
apiVersion: v1
kind: Service
metadata:
  name: my-first-service
  namespace: atp-dev
spec:
  selector:
    app: my-first-service
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
  type: ClusterIP
**Manifest Authoring Checklist**:
## Manifest Authoring Checklist

### Required Elements
- [ ] Appropriate API version
- [ ] Correct resource kind
- [ ] Unique name (within namespace)
- [ ] Namespace specified
- [ ] Labels for selection
- [ ] Resource requests and limits

### Best Practices
- [ ] No hardcoded secrets
- [ ] Image tags are specific (not `latest`)
- [ ] Health checks configured
- [ ] Resource limits set
- [ ] Security context configured
- [ ] Network policies considered

### Security
- [ ] No secrets in plaintext
- [ ] Least privilege RBAC
- [ ] Pod Security Standards applied
- [ ] Image scanning passed
#### Creating First PR **First PR Exercise**:
#!/bin/bash
# tutorials/first-pr-exercise.sh

echo "🎯 First PR Exercise"
echo "===================="

# Exercise: Deploy a simple hello-world service

# Step 1: Create directory structure
echo "Step 1: Create directory structure"
mkdir -p apps/hello-world/base
mkdir -p apps/hello-world/overlays/dev

# Step 2: Create base deployment
cat > apps/hello-world/base/deployment.yaml <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-world
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hello-world
  template:
    metadata:
      labels:
        app: hello-world
    spec:
      containers:
      - name: hello-world
        image: mcr.microsoft.com/dotnet/samples:aspnetapp
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
EOF

# Step 3: Create base service
cat > apps/hello-world/base/service.yaml <<'EOF'
apiVersion: v1
kind: Service
metadata:
  name: hello-world
spec:
  selector:
    app: hello-world
  ports:
  - port: 80
    targetPort: 80
EOF

# Step 4: Create base kustomization
cat > apps/hello-world/base/kustomization.yaml <<'EOF'
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- deployment.yaml
- service.yaml

commonLabels:
  app: hello-world
  managed-by: kustomize
EOF

# Step 5: Create dev overlay
cat > apps/hello-world/overlays/dev/kustomization.yaml <<'EOF'
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-dev

resources:
- ../../base

replicas:
- name: hello-world
  count: 2

labels:
- pairs:
    environment: dev
EOF

echo "✅ Exercise files created!"
echo "Next steps:"
echo "1. Review the created files"
echo "2. Commit and push"
echo "3. Create a PR"
echo "4. Request review"
#### Testing in Preview Environment **Preview Environment Testing Guide**:
## Testing in Preview Environment

### What is a Preview Environment?

A preview environment is an ephemeral, isolated Kubernetes namespace created automatically when you create a Pull Request. It allows you to test your changes before merging to main.

### Preview Environment Lifecycle

1. **Creation**: Automatic on PR creation
2. **Testing**: Validate your changes
3. **Cleanup**: Automatic on PR merge/close

### Testing Steps

1. **Create PR**: Your preview environment is created automatically
2. **Wait for Deployment**: FluxCD will deploy to preview namespace
3. **Access Preview**: Use the preview URL from PR comments
4. **Run Tests**: Execute your test suite
5. **Validate**: Ensure everything works as expected

### Preview URL Format
https://pr-{PR_NUMBER}-{SERVICE_NAME}.preview.atp.connectsoft.example
### Example: Testing a Service Change

```bash
# Get preview namespace
PREVIEW_NS="pr-12345"

# Check deployment status
kubectl get pods -n ${PREVIEW_NS}

# Test the service
curl https://pr-12345-hello-world.preview.atp.connectsoft.example

# View logs
kubectl logs -n ${PREVIEW_NS} -l app=hello-world --tail=100
---

### Operations Onboarding

#### FluxCD Monitoring

**FluxCD Monitoring Guide**:

```markdown
## FluxCD Monitoring for Operators

### Key Metrics to Monitor

1. **Reconciliation Status**
   ```bash
   flux get all -A
   ```

2. **Reconciliation Duration**
   ```bash
   flux get kustomizations -A --status-selector=Ready=True
   ```

3. **Sync Failures**
   ```bash
   flux get sources -A | grep -v Ready
   ```

### Monitoring Dashboards

- **FluxCD Operational Dashboard**: [Link]
- **Deployment Status Dashboard**: [Link]
- **Reconciliation Metrics**: [Link]

### Alert Thresholds

- Sync failure > 5 minutes: Warning
- Sync failure > 15 minutes: Critical
- Reconciliation duration > 2 minutes: Warning
**FluxCD Health Check Script**:
#!/bin/bash
# tutorials/fluxcd-health-check.sh

echo "🏥 FluxCD Health Check"
echo "======================"

# Check FluxCD components
echo "1. Checking FluxCD components..."
kubectl get pods -n flux-system

# Check all Kustomizations
echo ""
echo "2. Checking Kustomizations..."
flux get kustomizations -A

# Check all sources
echo ""
echo "3. Checking sources..."
flux get sources -A

# Check for errors
echo ""
echo "4. Checking for errors..."
kubectl logs -n flux-system -l app=kustomize-controller --tail=50 | grep -i error

# Check reconciliation status
echo ""
echo "5. Reconciliation status:"
flux get kustomizations -A --status-selector=Ready=False
#### Troubleshooting Procedures **Troubleshooting Workflow**:
graph TD
    START[Issue Reported] --> CHECK[Check FluxCD Status]
    CHECK --> SYNC{Sync<br/>Working?}
    SYNC -->|No| DEBUG[Debug Sync Failure]
    SYNC -->|Yes| APP[Check App Status]
    APP --> HEALTH{App<br/>Healthy?}
    HEALTH -->|No| LOGS[Check Logs]
    HEALTH -->|Yes| NET[Check Network]
    NET --> RESOLVE[Resolve Issue]
    DEBUG --> RESOLVE
    LOGS --> RESOLVE
    RESOLVE --> DOCUMENT[Document Solution]
    DOCUMENT --> COMPLETE[Issue Resolved]

    style START fill:#FFE5B4
    style COMPLETE fill:#90EE90
Hold "Alt" / "Option" to enable pan & zoom
#### Incident Response **Incident Response Runbook**:
## Incident Response Runbook

### Severity Levels

- **P0 - Critical**: Service down, data loss
- **P1 - High**: Major feature unavailable
- **P2 - Medium**: Minor feature unavailable
- **P3 - Low**: Minor issue, workaround available

### Incident Response Steps

1. **Acknowledge**: Acknowledge the incident
2. **Assess**: Assess severity and impact
3. **Communicate**: Notify stakeholders
4. **Investigate**: Gather information
5. **Resolve**: Implement fix
6. **Verify**: Verify resolution
7. **Document**: Post-incident review

### Escalation Path

1. On-call engineer (immediate)
2. Team lead (if unresolved in 15 min)
3. Engineering manager (if unresolved in 1 hour)
4. CTO (for P0 incidents)
--- ### GitOps Workflow Tutorials #### Step-by-Step Deployment Tutorial **Complete Deployment Tutorial**:
# Complete Deployment Tutorial

## Scenario: Deploy ATP Gateway v1.3.0 to Production

### Step 1: Prepare Your Environment

```bash
# Clone repository
git clone ssh://git@ssh.dev.azure.com/v3/ConnectSoft/atp-gitops/atp-gitops
cd atp-gitops

# Create feature branch
git checkout -b feature/deploy-gateway-v1.3.0
### Step 2: Update Image Tag Edit `apps/atp-gateway/overlays/production/kustomization.yaml`:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- ../../base

images:
- name: connectsoft.azurecr.io/atp/gateway
  newTag: v1.3.0

replicas:
- name: atp-gateway
  count: 3
### Step 3: Update Configuration (if needed) Edit environment-specific config if required. ### Step 4: Commit and Push
git add apps/atp-gateway/
git commit -m "feat(gateway): Deploy v1.3.0 to production

- Update image tag to v1.3.0
- Increase replicas to 3

Linked to: WI-12345
CAB Approved: CR-2024-050"
git push -u origin feature/deploy-gateway-v1.3.0
### Step 5: Create Pull Request Use Azure DevOps to create PR targeting `main` branch. ### Step 6: Code Review - [ ] PR description complete - [ ] Linked work item - [ ] CAB approval obtained - [ ] Tests passing - [ ] Security scan passed ### Step 7: Merge and Monitor - [ ] Merge PR to main - [ ] Monitor FluxCD reconciliation - [ ] Verify deployment status - [ ] Check application health - [ ] Monitor metrics for 1 hour
#### Rollback Procedure Tutorial

**Rollback Tutorial**:

```markdown
# Rollback Tutorial

## Scenario: Rollback ATP Gateway from v1.3.0 to v1.2.5

### Step 1: Identify Previous Version

```bash
# Check git history
git log --oneline apps/atp-gateway/overlays/production/

# Or check deployment receipts
kubectl get deploymentreceipt -n atp-production | grep gateway
### Step 2: Create Rollback Branch
git checkout -b hotfix/rollback-gateway-v1.2.5
### Step 3: Revert Image Tag Edit `apps/atp-gateway/overlays/production/kustomization.yaml`:
images:
- name: connectsoft.azurecr.io/atp/gateway
  newTag: v1.2.5  # Previous version
### Step 4: Commit Rollback
git add apps/atp-gateway/
git commit -m "fix(gateway): Rollback to v1.2.5

Reason: High error rate after v1.3.0 deployment

Incident: INC-2024-123
Approved by: [Name]"
git push -u origin hotfix/rollback-gateway-v1.2.5
### Step 5: Expedited PR Process - Create PR with "Hotfix" label - Request expedited review - Merge immediately after approval ### Step 6: Verify Rollback
# Check deployment
kubectl get deployment atp-gateway -n atp-production

# Check pod status
kubectl get pods -n atp-production -l app=atp-gateway

# Check metrics
# Monitor error rate, latency, etc.
#### Multi-Environment Promotion Tutorial

**Environment Promotion Flow**:

```mermaid
graph LR
    DEV[Dev Environment] --> TEST[Test Environment]
    TEST --> STAGING[Staging Environment]
    STAGING --> PROD[Production Environment]

    DEV -.Promote.-> TEST
    TEST -.Promote.-> STAGING
    STAGING -.Promote.-> PROD

    style PROD fill:#FFE5B4
**Promotion Script**:
#!/bin/bash
# tutorials/promote-to-next-environment.sh

SERVICE="${1}"
CURRENT_ENV="${2}"
NEXT_ENV="${3}"
VERSION="${4}"

echo "🚀 Promoting ${SERVICE} ${VERSION} from ${CURRENT_ENV} to ${NEXT_ENV}"

# Update next environment overlay
ENV_OVERLAY="apps/${SERVICE}/overlays/${NEXT_ENV}/kustomization.yaml"

# Update image tag
yq eval ".images[0].newTag = \"${VERSION}\"" -i "${ENV_OVERLAY}"

# Commit changes
git add "${ENV_OVERLAY}"
git commit -m "chore(${SERVICE}): Promote ${VERSION} to ${NEXT_ENV}

Promoted from ${CURRENT_ENV} after successful validation.
Linked to: WI-12345"

echo "✅ Promotion prepared. Create PR to merge."
--- ### Manifest Authoring Guidelines #### Helm Best Practices **Helm Chart Best Practices**:
# charts/atp-service/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "atp-service.fullname" . }}
  labels:
    {{- include "atp-service.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "atp-service.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
      labels:
        {{- include "atp-service.selectorLabels" . | nindent 8 }}
    spec:
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      serviceAccountName: {{ include "atp-service.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
      - name: {{ .Chart.Name }}
        securityContext:
          {{- toYaml .Values.securityContext | nindent 12 }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        ports:
        - name: http
          containerPort: {{ .Values.service.port }}
          protocol: TCP
        env:
        {{- range $key, $value := .Values.env }}
        - name: {{ $key }}
          value: {{ $value | quote }}
        {{- end }}
        {{- if .Values.secretRefs }}
        envFrom:
        - secretRef:
            name: {{ include "atp-service.fullname" . }}-secrets
        {{- end }}
        livenessProbe:
          {{- toYaml .Values.livenessProbe | nindent 10 }}
        readinessProbe:
          {{- toYaml .Values.readinessProbe | nindent 10 }}
        resources:
          {{- toYaml .Values.resources | nindent 10 }}
**Helm Values Best Practices**:
# charts/atp-service/values.yaml
replicaCount: 1

image:
  repository: connectsoft.azurecr.io/atp/service
  pullPolicy: IfNotPresent
  tag: ""

imagePullSecrets: []

nameOverride: ""
fullnameOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name: ""

podSecurityContext:
  fsGroup: 2000

securityContext:
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

service:
  type: ClusterIP
  port: 80

env: {}

secretRefs: []

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 128Mi

livenessProbe:
  httpGet:
    path: /health/live
    port: http
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: http
  initialDelaySeconds: 10
  periodSeconds: 5

autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 80
#### Kustomize Best Practices **Kustomize Structure Best Practices**:
# Best Practice: Clean base kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: atp-production

resources:
- deployment.yaml
- service.yaml
- configmap.yaml

commonLabels:
  app: atp-service
  environment: production
  managed-by: kustomize

commonAnnotations:
  gitops.toolkit.fluxcd.io/reconcile: "true"

images:
- name: connectsoft.azurecr.io/atp/service
  newTag: v1.3.0

replicas:
- name: atp-service
  count: 3

patchesStrategicMerge:
- |-
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: atp-service
  spec:
    template:
      spec:
        containers:
        - name: atp-service
          env:
          - name: ASPNETCORE_ENVIRONMENT
            value: "Production"
**Kustomize Patch Best Practices**:
## Kustomize Patching Best Practices

### DO ✅

- Use strategic merge patches for simple changes
- Use JSON patches for complex transformations
- Keep patches focused and minimal
- Document patch purpose in comments

### DON'T ❌

- Don't duplicate entire resource definitions
- Don't create overly complex patch chains
- Don't use patches to override everything
- Don't create patches without testing
#### Resource Configuration Standards **Resource Configuration Template**:
# templates/resource-standards.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {service-name}
  namespace: {namespace}
  labels:
    app: {service-name}
    version: {version}
    environment: {environment}
    managed-by: kustomize
spec:
  replicas: {replicas}
  selector:
    matchLabels:
      app: {service-name}
  template:
    metadata:
      labels:
        app: {service-name}
        version: {version}
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      securityContext:
        fsGroup: 2000
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: {service-name}
        image: {image-repo}:{image-tag}
        imagePullPolicy: Always
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
          capabilities:
            drop:
            - ALL
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "{environment}"
        envFrom:
        - configMapRef:
            name: {service-name}-config
        - secretRef:
            name: {service-name}-secrets
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 1000m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health/live
            port: http
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/ready
            port: http
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        volumeMounts:
        - name: tmp
          mountPath: /tmp
      volumes:
      - name: tmp
        emptyDir: {}
--- ### Troubleshooting Playbooks #### Sync Failure Troubleshooting **Sync Failure Playbook**:
# Sync Failure Troubleshooting Playbook

## Symptoms

- FluxCD Kustomization status shows "Not Ready"
- Error message in FluxCD logs
- Resources not applying to cluster

## Diagnosis Steps

### 1. Check Kustomization Status

```bash
flux get kustomization {kustomization-name} -n flux-system
### 2. Check Kustomization Conditions
kubectl describe kustomization {kustomization-name} -n flux-system
### 3. Check Source Status
flux get source git {source-name} -n flux-system
### 4. Check FluxCD Logs
kubectl logs -n flux-system -l app=kustomize-controller --tail=100 | grep -i error
### 5. Validate Kustomize Build
cd apps/{service}/overlays/{environment}
kustomize build . | kubectl apply --dry-run=client -f -
## Common Issues and Solutions ### Issue: Git Authentication Failure **Symptoms**: Source status shows "authentication failed" **Solution**:
# Check GitRepository credentials
kubectl get gitrepository {source-name} -n flux-system -o yaml

# Verify SSH key or PAT is valid
### Issue: Invalid Manifest Syntax **Symptoms**: "unable to build" error **Solution**:
# Validate YAML syntax
kustomize build . > /tmp/output.yaml
cat /tmp/output.yaml | kubectl apply --dry-run=client -f -
### Issue: Resource Conflict **Symptoms**: "already exists" error **Solution**:
# Check existing resource
kubectl get {resource-type} {resource-name} -n {namespace}

# If orphaned, delete it
kubectl delete {resource-type} {resource-name} -n {namespace}
#### Health Check Failure Playbook

**Health Check Failure Playbook**:

```markdown
# Health Check Failure Playbook

## Symptoms

- Pods in CrashLoopBackOff
- Readiness probe failures
- Liveness probe failures

## Diagnosis Steps

### 1. Check Pod Status

```bash
kubectl get pods -n {namespace} -l app={service-name}
### 2. Check Pod Events
kubectl describe pod {pod-name} -n {namespace}
### 3. Check Container Logs
kubectl logs {pod-name} -n {namespace} --tail=100
### 4. Check Health Endpoints
# Port forward to pod
kubectl port-forward {pod-name} 8080:8080 -n {namespace}

# Test health endpoint
curl http://localhost:8080/health/ready
## Common Issues and Solutions ### Issue: Application Not Starting **Symptoms**: Pod never becomes ready **Solution**: - Check application logs for startup errors - Verify environment variables - Check secret/configmap availability - Verify database connectivity ### Issue: Slow Health Endpoint **Symptoms**: Readiness probe timeout **Solution**: - Increase timeoutSeconds in probe configuration - Optimize health check endpoint - Check for resource constraints
---

### Best Practices Catalog

#### Security Best Practices

**Security Best Practices Checklist**:

```markdown
# Security Best Practices

## ✅ DO

- [ ] Never commit secrets to Git
- [ ] Use External Secrets Operator or CSI Driver
- [ ] Enable Pod Security Standards (Restricted)
- [ ] Set resource limits
- [ ] Use read-only root filesystem
- [ ] Run containers as non-root
- [ ] Drop all capabilities
- [ ] Use network policies
- [ ] Scan images for vulnerabilities
- [ ] Sign container images
- [ ] Enable audit logging
- [ ] Use least privilege RBAC

## ❌ DON'T

- [ ] Don't use `latest` image tags
- [ ] Don't run containers as root
- [ ] Don't disable security contexts
- [ ] Don't hardcode credentials
- [ ] Don't skip security scans
- [ ] Don't disable network policies
- [ ] Don't grant excessive RBAC permissions
#### Performance Best Practices **Performance Best Practices**:
# Performance Best Practices

## Resource Sizing

- **Right-size requests**: Base on actual usage (P50)
- **Set appropriate limits**: Allow for spikes (P95)
- **Monitor and adjust**: Use VPA recommendations

## Autoscaling

- **Enable HPA**: CPU and memory-based scaling
- **Use KEDA**: For custom metrics
- **Set reasonable bounds**: Min/max replicas

## Image Optimization

- **Use multi-stage builds**: Reduce image size
- **Minimize layers**: Fewer layers = faster pulls
- **Use distroless images**: Smaller attack surface

## Reconciliation

- **Optimize intervals**: Longer for production, shorter for dev
- **Batch updates**: Group related changes
- **Monitor reconciliation time**: Alert on slow syncs
--- ### Reference Architecture Examples #### Example Service Deployment **Complete Service Deployment Example**:
# examples/complete-service-deployment/
# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-service
  labels:
    app: example-service
spec:
  replicas: 2
  selector:
    matchLabels:
      app: example-service
  template:
    metadata:
      labels:
        app: example-service
    spec:
      serviceAccountName: example-service
      securityContext:
        runAsNonRoot: true
        fsGroup: 2000
      containers:
      - name: example-service
        image: connectsoft.azurecr.io/atp/example-service:latest
        ports:
        - containerPort: 8080
        envFrom:
        - configMapRef:
            name: example-service-config
        - secretRef:
            name: example-service-secrets
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10

---
# base/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: example-service
spec:
  selector:
    app: example-service
  ports:
  - port: 80
    targetPort: 8080

---
# base/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: example-service
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - example.atp.connectsoft.example
    secretName: example-service-tls
  rules:
  - host: example.atp.connectsoft.example
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: example-service
            port:
              number: 80
--- ### FAQ **Frequently Asked Questions**:
# GitOps FAQ

## General Questions

### Q: What is GitOps?

**A**: GitOps is a declarative approach to managing infrastructure and applications, where Git is the single source of truth, and automated processes ensure the cluster state matches the Git repository state.

### Q: Why use GitOps?

**A**: Benefits include:
- **Version control**: All changes are tracked in Git
- **Audit trail**: Complete history of who changed what and when
- **Rollback**: Easy to revert to previous states
- **Collaboration**: Standard Git workflow (PRs, reviews)
- **Automation**: Automated deployment and reconciliation

### Q: GitOps vs Traditional CI/CD?

**A**: 
- **Traditional CI/CD**: Push-based, CI pipeline pushes to cluster
- **GitOps**: Pull-based, operator pulls from Git and reconciles cluster

### Q: FluxCD vs ArgoCD?

**A**: 
| Feature | FluxCD | ArgoCD |
|---------|--------|--------|
| **Architecture** | Multiple controllers | Single controller |
| **UI** | Limited | Rich web UI |
| **Helm Support** | ✅ Native | ✅ Native |
| **Kustomize Support** | ✅ Native | ✅ Native |
| **Azure DevOps** | ✅ Strong integration | ⚠️ Basic |
| **GitHub Actions** | ✅ Strong integration | ⚠️ Basic |

ATP uses **FluxCD** for better Azure DevOps integration.

### Q: When to use Helm vs Kustomize?

**A**:
- **Helm**: Use for packages with templating needs, reusable charts
- **Kustomize**: Use for configuration customization, simple overlays

Most ATP services use **Kustomize** for simplicity.

## Technical Questions

### Q: How do I update an image tag?

**A**: Update the image in the overlay kustomization:

```yaml
images:
- name: connectsoft.azurecr.io/atp/service
  newTag: v1.2.3
### Q: How do I add environment variables? **A**: Use ConfigMaps or Secrets:
# In deployment.yaml
envFrom:
- configMapRef:
    name: service-config
- secretRef:
    name: service-secrets
### Q: How do I scale a service? **A**: Update replicas in kustomization:
replicas:
- name: service-name
  count: 5
### Q: How do I rollback? **A**: Revert the Git commit or update image tag to previous version and create a new PR.
---

### Common Pitfalls

**Common Pitfalls and How to Avoid Them**:

```markdown
# Common GitOps Pitfalls

## 🚫 Pitfall 1: Secrets in Git

**Problem**: Committing secrets to Git repository

**Solution**: Always use External Secrets Operator or CSI Driver

```yaml
# ❌ BAD
env:
- name: PASSWORD
  value: "secret123"

# ✅ GOOD
envFrom:
- secretRef:
    name: service-secrets
## 🚫 Pitfall 2: Hardcoded Values **Problem**: Hardcoding environment-specific values in base manifests **Solution**: Use Kustomize overlays or Helm values
# ❌ BAD (in base)
replicas: 3

# ✅ GOOD (in overlay)
replicas:
- name: service
  count: 3
## 🚫 Pitfall 3: Overly Complex Patches **Problem**: Creating complex patch chains that are hard to understand **Solution**: Keep patches simple, document purpose
# ❌ BAD: 10-layer patch chain
# patchesStrategicMerge:
# - patch1.yaml
# - patch2.yaml
# - ... (8 more)

# ✅ GOOD: Clear, documented patches
patchesStrategicMerge:
- |-
  # Patch: Add production environment variable
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: service
  spec:
    template:
      spec:
        containers:
        - name: service
          env:
          - name: ENV
            value: "production"
## 🚫 Pitfall 4: Not Testing in Lower Environments **Problem**: Deploying directly to production without testing **Solution**: Always promote through dev → test → staging → production ## 🚫 Pitfall 5: Using `latest` Tags **Problem**: Using `latest` tag makes rollback difficult **Solution**: Always use specific version tags
# ❌ BAD
image: connectsoft.azurecr.io/atp/service:latest

# ✅ GOOD
image: connectsoft.azurecr.io/atp/service:v1.2.3
---

### Community of Practice

**Community of Practice Structure**:

```markdown
# GitOps Community of Practice

## Monthly Meetings

- **When**: First Tuesday of each month, 2:00 PM
- **Duration**: 1 hour
- **Format**: Knowledge sharing, Q&A, demos

## Topics Covered

- Best practices updates
- New features and tools
- Lessons learned from incidents
- Demo of interesting deployments
- Tool demonstrations

## Communication Channels

- **Teams Channel**: `#atp-gitops`
- **Slack Channel**: `#platform-gitops` (if applicable)
- **Email List**: `gitops-team@connectsoft.example`

## Knowledge Sharing

- Monthly presentations by team members
- External conference talks (share recordings)
- Internal blog posts
- Documentation contributions
--- ### Continuous Improvement **Continuous Improvement Process**:
# Continuous Improvement Process

## Feedback Collection

### Channels
- Monthly retrospectives
- Quarterly surveys
- Incident post-mortems
- GitHub Issues for improvements

### Feedback Categories
- Process improvements
- Tooling improvements
- Documentation improvements
- Training improvements

## Retrospective Format

### After Each Incident

1. **What happened?** (Timeline)
2. **What went well?**
3. **What could be improved?**
4. **Action items** (Owner, Due Date)

### Quarterly Team Retrospective

1. **Review period achievements**
2. **Identify pain points**
3. **Prioritize improvements**
4. **Create improvement backlog**

## Improvement Backlog

All improvements tracked in Azure DevOps work items:
- Epic: Large improvements
- Feature: Medium improvements
- User Story: Small improvements
- Bug: Fixes

## Documentation Improvement

- Monthly documentation review
- Identify gaps
- Update outdated content
- Add new examples
--- ## Summary: Training, Documentation & Best Practices - **Developer Onboarding Guide**: Getting started with GitOps, repository structure overview, Git workflow tutorial, manifest authoring basics, creating first PR, testing in preview environment with learning path diagram - **Operations Onboarding**: FluxCD monitoring procedures, troubleshooting workflows, incident response runbooks, on-call responsibilities, escalation paths - **GitOps Workflow Tutorials**: Step-by-step deployment tutorial, rollback procedure tutorial, multi-environment promotion tutorial, hotfix workflow tutorial with sequence diagrams - **Manifest Authoring Guidelines**: Helm best practices (templates and values), Kustomize best practices (structure and patching), naming conventions, resource configuration standards, security guidelines with templates - **Troubleshooting Playbooks**: Sync failure troubleshooting (diagnosis steps, common issues), drift resolution playbook, health check failure playbook, network issue playbook, performance issue playbook with decision trees - **Best Practices Catalog**: Security best practices checklist, performance best practices (resource sizing, autoscaling, optimization), cost optimization practices, observability practices, compliance practices - **Reference Architecture Examples**: Complete service deployment example, multi-tenant setup example, multi-region deployment example, StatefulSet example with full YAML manifests - **Video Tutorials**: Links to video tutorial library (workflow walkthroughs, monitoring tutorials, troubleshooting demos, hands-on lab exercises) - **FAQ**: Common questions (GitOps definition, benefits, vs traditional CI/CD, FluxCD vs ArgoCD, Helm vs Kustomize), technical questions (image updates, environment variables, scaling, rollback) with code examples - **Common Pitfalls**: Secrets in Git, hardcoded values, overly complex patches, not testing in lower environments, using latest tags with examples of what not to do and solutions - **Community of Practice**: Monthly meetings schedule, communication channels (Teams/Slack), knowledge sharing formats, external conference participation - **Continuous Improvement**: Feedback collection mechanisms (retrospectives, surveys, incident reviews), improvement backlog management (Azure DevOps), documentation improvement process (monthly reviews, gap identification) ---